From 0372856f653d6332919a24c277fd2fe16dedcced Mon Sep 17 00:00:00 2001
From: "Gao, Xiang" <qasdfgtyuiop@gmail.com>
Date: Fri, 30 Sep 2022 00:21:00 -0700
Subject: [PATCH] Rebase #1900 (#2009)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* hash update - bug fix for branches (#83865)

hash updates for xla were failing because the current pinned hash is a branch, so the git command for getting the date couldn't find the branch due to not having a local version of the branch.  Fixed by checking out the branch to make sure it exists locally.

example of failure: https://github.com/pytorch/pytorch/runs/7913835742?check_suite_focus=true

Test plan:
made it pull request trigger and ran, to get this:
https://github.com/pytorch/pytorch/runs/7959221184?check_suite_focus=true
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83865
Approved by: https://github.com/zengk95

* [FSDP] Remove unneeded checks (#83150)

@awgu pointed out these checks aren't really doing anything, as they just make sure we're setting training state in certain ways throughout FSDP and is sort of arbitrary. So, removing them to avoid confusion.

We still keep the checking around `_post_backward_called` because this is needed in `finalize_params` for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83150
Approved by: https://github.com/awgu

* [BE] Revert distributed change in https://github.com/pytorch/pytorch/pull/68779 (#83181)

https://github.com/pytorch/pytorch/issues/82641 points out a regression in how inputs / outputs are processed by DDP, blocking their HF use case. It was narrowed down to https://github.com/pytorch/pytorch/pull/68779 and reverting the distributed change there fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83181
Approved by: https://github.com/kumpera

* Transpose scheduler small dim sizes better support (#1910)

* Optimize transpose copy on CPU using fbgemm transpose (#83327)

### Description
Optimize transpose copy on CPU using fbgemm transpose

### Testing
single socket (28cores):
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10])  fp32: 2.439e-05 ms; bf16: 2.152e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms
```
single core:
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms;  bf16: 0.00103 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566  ms; bf16: 0.000382 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327
Approved by: https://github.com/frank-wei

* Grouped grid welford (#1921)

Enables grouping of grid welford ops across iterations. Same functionality as the iteration grouping for GridReduction. This ins intended to improve the outer-norm grid persistence in batchnorm-like fusions.

* [ONNX] Use `errors.SymbolicValueError` for more context (#83332)

Replace runtime errors in torch.onnx with `errors.SymbolicValueError` for more context around jit values.

- Extend `_unimplemented`, `_onnx_unsupported`, `_onnx_opset_unsupported`, `_onnx_opset_unsupported_detailed` errors to include JIT value information
- Replace plain RuntimeError with `errors.SymbolicValueError`
- Clean up: Use `_is_bool` to replace string comparison on jit types
- Clean up: Remove the todo `Remove type ignore after #81112`

#77316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83332
Approved by: https://github.com/AllenTiTaiWang, https://github.com/thiagocrepaldi, https://github.com/BowenBao

* [quant][fx] Add support for quantized matmul (#83885)

Summary:
att, probably missed the op during migration to the reference flow

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_qmatmul

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83885
Approved by: https://github.com/andrewor14

* Misc fixes/tuning for transpose scheduler (#1912)

* [nn] split rnn_utils test from test_nn.py (#83675)

Ref: https://github.com/pytorch/pytorch/issues/63085
Proposed folder structure
```
-> test
  -> nn
    -> test_conv.py
    -> test_pooling.py
    -> .....
```

This PR: Moves test related RNN utilities to a different file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
Approved by: https://github.com/albanD

* [optim] rprop: handle complex params as independent real params (#83858)

Ref #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858
Approved by: https://github.com/albanD

* [xla hash update] update the pinned xla hash (#83899)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83899
Approved by: https://github.com/pytorchbot

* [ROCm] More Sparse UTs enablement and more hipification mappings. (#78939)

Enables:

 test_bmm_cuda_float64
 test_bmm_deterministic_cuda_float64
 test_csr_matvec_cuda_complex128
 test_csr_matvec_cuda_complex64
 test_csr_matvec_cuda_float32
 test_csr_matvec_cuda_float64

To enable the above tests had to add some more hip mappings for the hipification process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939
Approved by: https://github.com/pruthvistony, https://github.com/malfet

* Normalize DLPack stride to 1 where shape < 2 (#83158)

Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.

The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158
Approved by: https://github.com/ezyang

* Remove DBR quantization from the codebase (#83642)

Summary:

DBR quantization is a no-go for now because it does not align well with
PyTorch 2.0 plans and we do not want to build yet another tracing system.

Deleting it from the codebase for now since there are no plans to develop
this in the near future. We can bring it back at a later time if necessary.

Test plan:

CI

Differential Revision: [D38839556](https://our.internmc.facebook.com/intern/diff/D38839556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83642
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168

* Refactored ops on size to be dispatcher ops (#83719)

An example of how the graph looks now.
```
def forward(self, x_1):
    size = torch.ops.math.size(x_1, 0)
    size_1 = torch.ops.math.size(x_1, 1);  x_1 = None
    ones = torch.ops.aten.ones.default([1], device = device(type='cpu'), pin_memory = False)
    expand_sym_int = torch.ops.aten.expand.SymInt(ones, [size, size_1]);  ones = size = size_1 = None
    cos_default = torch.ops.aten.cos.default(expand_sym_int);  expand_sym_int = None
    return (cos_default,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83719
Approved by: https://github.com/ezyang

* Fix stride issue with faketensors (#83822)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83822
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Nullary RNGOp (#1892)

* [ROCm] restore MIOpen benchmark flag default to true (#82656)

### Description
PR https://github.com/pytorch/pytorch/pull/77438 allowed MIOpen to support the benchmark flag. Previously, the benchmark flag was ignored by MIOpen such that benchmarking was always turned on. This commit restores the behavior that MIOpen benchmarking is by default turned on.

### Testing
CI unit tests cover this capability.  Torchvision models demonstrate the performance delta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82656
Approved by: https://github.com/ngimel

* Update retry action to latest version (#83911)

We're running into EPERM issues when trying to install nvidia tools, see failure example https://github.com/pytorch/pytorch/runs/7975726013?check_suite_focus=true.
```
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1049
            throw err;
            ^

Error: kill EPERM
    at process.kill (internal/process/per_thread.js:199:13)
    at killPid (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1059:17)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1036:21
    at Array.forEach (<anonymous>)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1034:23
    at Array.forEach (<anonymous>)
    at killAll (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1033:27)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1024:13
    at ChildProcess.onClose (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1080:17)
    at ChildProcess.emit (events.js:314:20) {
  errno: 'EPERM',
  code: 'EPERM',
  syscall: 'kill'
}

```

The root issue probably lies elsewhere but this action is not helping/the errors seem to say it's unable to kill child processes. A more recent commit in that repo uses spawn instead of exec which might make a difference.

Regardless, we should keep our actions up to date anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83911
Approved by: https://github.com/malfet

* [PyTorch] Remove unused sstream/string includes from c10/macros/Macros.h (#83353)

Nothing in the rest of the header seems to use these.

Differential Revision: [D38672680](https://our.internmc.facebook.com/intern/diff/D38672680/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83353
Approved by: https://github.com/malfet

* [functorch] add linalg cross batch rule (#83759)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83759
Approved by: https://github.com/zou3519

* Improve DistanceKernel.cu (#83811)

include device_sqrt
replace reduce_agg by BlockReduce
choose implementation by impl_fptr instead of error-prone copy-and-paste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83811
Approved by: https://github.com/ngimel

* reinplace pass: bugfix for output node replacement (#83845)

Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers.

See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845
Approved by: https://github.com/ezyang

* reinplace pass: special handling for view_scatter ops (#83846)

There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code:
```
         def f():
             a = torch.zeros(4, 4, 4)
             a[:, 2:] = torch.ones(4, 2, 4)
             return a
```

Tracing normally with `make_fx()` gives you:
```

def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones);  slice_tensor_1 = ones = None
    return zeros
```
Functionalizing it gives you:

```
def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807);  slice_tensor_2 = ones = None
    slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807);  zeros = slice_scatter_default = None
    return slice_scatter_default_1
```

Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`).

What we actually want is to replace this line:
```
slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...);
```
with this:
```
new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...);
_ = torch.ops.aten.copy_.default(new_slice, ones)
```

In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed.

We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing)

I also updated the docs for re-inplacing to more closely match the order of the logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846
Approved by: https://github.com/ezyang

* Move ATenNVRTC.h include from `jit_utils.h` to `jit_utils.cpp` (#83886)

In general, `.h` files should only include headers that are used in the header

Fixes https://github.com/pytorch/pytorch/issues/83856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83886
Approved by: https://github.com/ngimel

* Allow None arguments for elementwise type promotion wrapper and fix clamp with None arguments (#83586)

Fixes https://github.com/pytorch/torchdynamo/issues/759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83586
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881)

Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
Saving user from setting two env variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Fix view_func replay in no-grad mode (#83872)

Fixes https://github.com/pytorch/pytorch/issues/83828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83872
Approved by: https://github.com/albanD

* [vulkan] Add VMA as a third_party subrepo (#83906)

the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
Approved by: https://github.com/kimishpatel

* [torchgen] Add documentation for `autogen` keyword (#83610)

This is a follow up for #81437. This PR explains what operator can use `autogen` and what will be generated. Also talked about generated kernels and where to find them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83610
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* remove assertEqualIgnoreTypes from test/distributions/test_distributions.py (#83709)

See https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83709
Approved by: https://github.com/kit1980

* [fix] edge case in `MaxPool1d` and add ErrorInputs (#83553)

Fixes #83224

cc @kshitij12345 @albanD!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
Approved by: https://github.com/albanD

* [complex] conv_transpose1d (#79694)

Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79694
Approved by: https://github.com/ngimel

* Revert "Strenghten preconditions of linalg.cross (#83798)"

This reverts commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e.

Reverted https://github.com/pytorch/pytorch/pull/83798 on behalf of https://github.com/janeyx99 due to Sorry, land race caused functorch issues https://hud.pytorch.org/pytorch/pytorch/commit/7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e

* Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)

`_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly.

Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track

Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855
Approved by: https://github.com/qihqi

* Prefer signal from land checks over PR signals (#83715)

# The problem

When a dev forks their branch from a red master build, their branch can fail CI checks for reasons unrelated to their changes, but the same checks would however pass in the land validation commit (which is rebased off of viable/strict)

Today, in the above scenario the `merge -l` command fails because mergebot sees the failing checks in the PR, which is not helpful when that same check passes in land validation.

# The solution
This PR changes the behavior so that:
1. If both the PR and land validation ran a workflow, only look at the results from land validation
2. If only the PR ran a specific workflow (e.g. for CLA Check or a nightly run) then continue to look the result from the PR (which matches existing behavior)

### Bonus fixes
It also includes a few extra BE fixes:
- Replaces the tuple we used to pass workflow check results around with a named tuple so that it's easier to tell what data is being used
- Reduces the number of API calls to github by ~50% during merges.  Before, we were pulling results from github every time and then filtering it down to the relevant category of checks (e.g. failed/pending/startup_failed). Now, our filters share the check results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83715
Approved by: https://github.com/zengk95

* Don't introduce new overload for SymInt (#83628)

Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove CoreMLMemoryObserver (#83703)

Summary: We added this observer to help us diagnose memory issues that have since resolved. It should be safe to clean this up.

Test Plan: Diff just removed logging, so just build IG and confirm no errors.

Differential Revision: D38843701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83703
Approved by: https://github.com/mcr229

* ci: Remove dead code related to android uploads (#83930)

These uploads actually never got triggeredhappened in nightlies so
removing it altogether. Someone can re-add in the future if they feel
these are important but I can't find an instance of this running since
we migrated so I have a hard time believing anyone will miss it.

https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=android

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83930
Approved by: https://github.com/atalman, https://github.com/malfet

* [fx][pass infra] Adding error catching (#83933)

Example:

```
======================================================================
ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__
    res = fn(module)
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail
    raise RuntimeError("bad")
RuntimeError: bad

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error
    pm(traced_m)
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__
    raise RuntimeError(msg) from e
RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass']
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933
Approved by: https://github.com/SherlockNoMad

* Back out "Support regex-style matching for Any and Oneof (#82853)" (#83922)

Reviewed By: hl475

Differential Revision: D38945806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83922
Approved by: https://github.com/hl475

* Fix use-dict-literal lint (#83718)

Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD

* Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)"

This reverts commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a.

Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/weiwangmeta due to breaking internal builds/causing out-of-bounds errors/training accuracy

* Add hypothesis to requirements.txt (#83740)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83740
Approved by: https://github.com/zhxchen17, https://github.com/janeyx99, https://github.com/zou3519

* [fbia] Keep Track of full qualified name before and after remote sharding (#83889)

Summary: track qualname changes in embedding sharding & FX split, and compose target qualname in the end of FBIA transform stage, so we can use the qualname mapping in XL materialize stage

Test Plan:
CI/CD

with DISABLE_XLEBB_MATERIALIZATION = True
https://fburl.com/fblearner/a8yljbux

with DISABLE_XLEBB_MATERIALIZATION = False
https://fburl.com/fblearner/2nvi0dam

Reviewed By: lliu315gt

Differential Revision: D38772525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83889
Approved by: https://github.com/houseroad

* add merge blocking to ci: sev template (#83940)

as in title, so that by default, ci: sev will block merges

the line can be removed to not block merges
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83940
Approved by: https://github.com/huydhn, https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere

* Move nnapi code from ATen common code to specific library (#83748)

Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now.

Test Plan: Sandcastle.

Differential Revision: D38761095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748
Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA

* Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)

See https://github.com/pytorch/pytorch/issues/38095
Replaced assertEqualIgnoreType with assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
Approved by: https://github.com/kit1980

* [Nested Tensor] Make offset copy and move assignment more explicit. (#83488)

Currently the nested tensor construction for the offset_ parameter takes in references and in the chain of delegation uses value. This could lead to unnecessary copies.  Whenever a nested tensor impl is constructed it should take ownership of all its metadata. The only non-trivially copyable metadata associated with the class is `offsets_`.

The goal of this PR is to make sure that consumers of nested_tensor_impl constructors ensure that they are passing offsets as a temporary - either buy explicitly copying a reference, or by constructing the offsets vector in the scope of construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83488
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove conj kernels for real dtypes (#80374)

`conj_physical_stub` is currently implemented for all dtypes despite
it just being a plain copy for real dtypes. So, instead we should
defer to the existing copy kernel in these cases.

On my build for one CUDA architecture, I see a 2.2 MB decrease in
`libtorch_cuda.so` size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374
Approved by: https://github.com/ngimel, https://github.com/atalman

* [BE][CUDA] Use packed_accessor64 (#83949)

Not sure why we are ignoring those, but SoftMax.cu alone
generates 100+ lines of warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In function ‘at::Tensor at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::get_offsets(const at::Tensor&, const IntArrayRef&, int64_t)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:261:69: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = long int; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto indices_accessor = indices.packed_accessor<int64_t, 2>();
                                                                     ^
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:924:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:1677:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:927:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:1679:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:977:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:1775:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:980:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:1777:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:16:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:18:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:20:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:21:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83949
Approved by: https://github.com/ngimel

* Support returning symbolic strides from t.stride() in Python (#83842)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83842
Approved by: https://github.com/albanD, https://github.com/Chillee, https://github.com/bdhirsh

* Support the XPU backend untyped storage (#83952)

Simple add XPU backend in untyped torch storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83952
Approved by: https://github.com/ezyang

* Support NCCL Premul Sum (#81272)

This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is https://github.com/pytorch/pytorch/pull/81272/commits/cb99ad67447b5899ecf8c4c3d78deaafa1cc09b8 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501

* Test type promotion assertignoretypes (#83867)

See #38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83867
Approved by: https://github.com/kit1980, https://github.com/mruberry

* [Profiler] record nn.Module's parameters (#83209)

Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache  & NNModuleInfo to save parameters
- python binding and unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule

Differential Revision: D38379717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta

* [xla hash update] update the pinned xla hash (#83967)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83967
Approved by: https://github.com/pytorchbot

* Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)

* Fix LTC build warnings (#83955)

Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions.

```
/home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
    };
     ^
```

cc: @wconstab @desertfire @ke1337 @antoniojkim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955
Approved by: https://github.com/wconstab

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Make linalg.inv composite of linalg.solve (#80074)

The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

* Support a stable double backward on linalg.det for real inputs (#80217)

The complex case still fails. I do not know why.

Fixes https://github.com/pytorch/pytorch/issues/62327
Fixes https://github.com/pytorch/pytorch/issues/53364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80217
Approved by: https://github.com/nikitaved, https://github.com/albanD, https://github.com/malfet

* [LTC] Add custom lazy tensor save function (#83294)

We need a custom `save` function for checkpointing a lazy model, similar to what exists in PyTorch/XLA:
https://github.com/pytorch/xla/blob/3eb8a9d9eb4ebb0b064461c3704650241625654e/torch_xla/core/xla_model.py#L994
The purpose of this function is to move any lazy tensors to CPU before saving the checkpoint.

The way I implemented it was to create a general structure visitor, adapted from a function that we use quite often in Cerebras internal repositories. If there is a better tool already available in PyTorch that does the same things, I'm open to suggestions.

CC: @wconstab @Krovatkin @JackCaoG
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83294
Approved by: https://github.com/wconstab

* move pooling test from test_nn to test/nn/test_pooling (#83915)

Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
Approved by: https://github.com/albanD

* [ONNX] Remove static None graph output (#82623)

Fixes #82370
* Unify the export behavior regarding static None outputs. These are
dropped for both traced graph and TorchScript graph export.
* `Optional` outputs are not affected.
Fixes #82370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82623
Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

* [TorchTidy Fix] Don't try to collect strides for non-strided tensors (#83935)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83935
Approved by: https://github.com/robieta, https://github.com/slgong-fb

* [WIP] Validating input_col for certain datapipes (#80267)

Follow up from #79344.

Currently WIP due to multiple test failures.

Waiting for #80140 to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267
Approved by: https://github.com/ejguan

* support more symintnode operations (#83877)

remove debug code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83877
Approved by: https://github.com/ezyang

* add arithmetic ops (#83878)

arithmetic ops tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83878
Approved by: https://github.com/ezyang

* logical ops (#83879)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83879
Approved by: https://github.com/ezyang

* strip SymIntNodes off in the mobile builds (#83938)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83938
Approved by: https://github.com/ezyang

* [pthreadpool] Cap max thread count to fix TSAN issues (#83950)

Summary: Cap the thread count to 64 unconditionally to solve this tsan issue which leads to harder to debug, flaky test failures.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D38136212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83950
Approved by: https://github.com/kimishpatel

* Skip NCCL slimming for cxx11 libtorch builds (#83959)

Fixes https://github.com/pytorch/pytorch/issues/83887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83959
Approved by: https://github.com/atalman

* add hud link to merge failure message (#83946)

as in title, related to https://github.com/pytorch/test-infra/issues/568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83946
Approved by: https://github.com/huydhn

* Check all CUDA API calls for errors in benchmarks/cpp/nvfuser (#74920) (#81817)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74920

Test Plan: Sandcastle

Differential Revision: D35194656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81817
Approved by: https://github.com/malfet

* [frontend] Fix tensor list alias annotation  (#84005)

For issue https://github.com/pytorch/pytorch/issues/77920 and a retry of https://github.com/pytorch/pytorch/pull/83921

The current logic checks alias info before `[]` and after. If no alias info exists after `[]`, we overwrite the alias info before. This logic failed on argument like `Tensor(a!)[]`, dropping the alias info before `[]` on the floor. This PR adds a new alias info if it's missing after `[]`. This way we can keep the alias info before `[]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84005
Approved by: https://github.com/cccclai, https://github.com/bdhirsh

* Suppress Anomaly mode warning message  (#83966)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83966
Approved by: https://github.com/albanD

* Support BF16 for fast layernorm (#83971)

Fixes #83970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83971
Approved by: https://github.com/ngimel

* Map new CUDA error handling to HIP (#75032) (#83953)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75032

Test Plan: Sandcastle

Reviewed By: ezyang, malfet

Differential Revision: D35253785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83953
Approved by: https://github.com/ezyang, https://github.com/malfet

* Improve Normalization.cuh (#83871)

remove unused Ops
replaced copy-and-paste by calling BlockReduce (+SumReduceOp +2D block indexing) and removing duplicate warpSum
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83871
Approved by: https://github.com/ngimel

* Check all CUDA API calls for errors in test/ (#74921) (#83954)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74921

Test Plan: Sandcastle

Reviewed By: ezyang, malfet, ngimel

Differential Revision: D35194966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83954
Approved by: https://github.com/ezyang

* remove duplicate WarpReduceSum (#83757)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83757
Approved by: https://github.com/ngimel

* Set python build-docs timeout to 30 minutes and cpp build-docs timeout to 180 minutes (#83957)

Anything more means there's something wrong and we should just return. AFAIK the timeout doesn't include queuing time, only the job duration https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

![Screen Shot 2022-08-23 at 18 31 57](https://user-images.githubusercontent.com/475357/186298046-5637384f-887c-4c6a-a946-c101b6c66741.png)

This will help avoid having python build docs timeout after 6 hours.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83957
Approved by: https://github.com/ZainRizvi

* [ROCm] Enable test_multiprocessing tests (#82356)

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Issue fixed in ROCm 5.2 user space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82356
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/huydhn

* Pin conda to 4.13.0 (#83991)

Recent update to conda 4.14.0 caused breakages in our docker builds:
https://hud.pytorch.org/pytorch/pytorch/commit/754d7f05b6841e555cea5a4b2c505dd9e0baec1d

This pins to prevent the errors:
```
Traceback (most recent call last):
2022-08-24T16:20:49.2412247Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1125, in __call__
2022-08-24T16:20:49.2413036Z   File "/opt/conda/lib/python3.9/site-packages/conda/cli/main.py", line 86, in main_subshell
2022-08-24T16:20:49.2413615Z   File "/opt/conda/lib/python3.9/site-packages/conda/cli/conda_argparse.py", line 93, in do_call
2022-08-24T16:20:49.2414282Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/core.py", line 75, in wrapper
2022-08-24T16:20:49.2415036Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/core.py", line 39, in display_notices
2022-08-24T16:20:49.2415853Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 36, in get_notice_responses
2022-08-24T16:20:49.2416661Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 39, in <genexpr>
2022-08-24T16:20:49.2417399Z   File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
2022-08-24T16:20:49.2418145Z   File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result
2022-08-24T16:20:49.2418831Z   File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
2022-08-24T16:20:49.2419543Z   File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run
2022-08-24T16:20:49.2420292Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 42, in <lambda>
2022-08-24T16:20:49.2421070Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/cache.py", line 37, in wrapper
2022-08-24T16:20:49.2421712Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 58, in get_channel_notice_response
2022-08-24T16:20:49.2422258Z   File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 600, in get
2022-08-24T16:20:49.2422801Z   File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
2022-08-24T16:20:49.2423226Z   File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
2022-08-24T16:20:49.2423634Z   File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 460, in send
2022-08-24T16:20:49.2424239Z   File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 263, in cert_verify
2022-08-24T16:20:49.2424731Z OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /opt/conda/lib/python3.9/site-packages/certifi/cacert.pem
2022-08-24T16:20:49.2424967Z
2022-08-24T16:20:49.2425110Z During handling of the above exception, another exception occurred:
2022-08-24T16:20:49.2425279Z
2022-08-24T16:20:49.2425377Z Traceback (most recent call last):
2022-08-24T16:20:49.2425610Z   File "/opt/conda/bin/conda", line 13, in <module>
2022-08-24T16:20:49.2425845Z     sys.exit(main())
2022-08-24T16:20:49.2426176Z   File "/opt/conda/lib/python3.9/site-packages/conda/cli/main.py", line 129, in main
2022-08-24T16:20:49.2426614Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1413, in conda_exception_handler
2022-08-24T16:20:49.2427054Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1128, in __call__
2022-08-24T16:20:49.2427555Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1170, in handle_exception
2022-08-24T16:20:49.2427995Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1181, in handle_unexpected_exception
2022-08-24T16:20:49.2428471Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1251, in print_unexpected_error_report
2022-08-24T16:20:49.2428873Z ModuleNotFoundError: No module named 'conda.cli.main_info'
2022-08-24T16:20:55.5428691Z The command '/bin/sh -c bash ./install_conda.sh && rm install_conda.sh' returned a non-zero code: 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83991
Approved by: https://github.com/malfet

* Deletes CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake (#84007)

Looking through the code and online, it does not look like these variables actually change anything. Regardless, this change was instituted to fix https://github.com/pytorch/pytorch/issues/13362, but we are again running into similar issues even with the workaround: see https://github.com/pytorch/pytorch/issues/83790.

Thus, since
1. this change isn't preventing flakiness
2. these variables do not seem used anywhere in pytorch/pytorch nor mozilla/sccache
we should remove this confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84007
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi

* Named pipe based watchdog timer (#83695)

Summary:
This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya).

The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it.

`LocalTimerClient` and `LocalTimerServer` require  a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes.

`FileTimerClient` and `FileTimerServer` does not have such limitation.

Test Plan:
### Unit Test
```
buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test
```
```
RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
    ✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895)
Summary
  Pass: 12
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
```

Differential Revision: D38604238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695
Approved by: https://github.com/d4l3k

* Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057)

This is to improve the performance for hybrid sparse coo tensor on CPU path. This case is appeared at the DLRM terabyte test.
With this fix, according to the previous performance test data, it got ~10x performance improvement on DLRM execution.
without this, the DLRM will run as
Finished training it 100/1000 of epoch 0, 2969.25 ms/it, loss 0.220505, accuracy 0.000 %
with this, the DLRM will run as
Finished training it 100/1000 of epoch 0, 270.71 ms/it, loss 0.220505, accuracy 0.000 %
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23057
Approved by: https://github.com/VitalyFedyunin, https://github.com/malfet

* Pretty print stack trace with gm.print_readable() (#83706)

Precondition: https://github.com/pytorch/torchdynamo/pull/899

Given following function
```
def my_relu(a):
    return a.relu()

def func(a, b):
    d = torch.square(a + b)
    e = my_relu(d)
    f = d.sin()
    s = torch.stack([e, f])
    s = s.sum()
```

Here are the possible result with various tracing frontend: dynamo, symbolic_trace, make_fx
- joint graph with torchdynamo.optimize("aot_nop")
Notice that it has a special stack for gradient addition node (for multiple uses of tensor) in backward
Notice that "No stacktrace found for following nodes" are shown for nodes with stacktrace
```
def forward(self, primals, tangents):
    primals_1, primals_2, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b)
    add_tensor = torch.ops.aten.add.Tensor(primals_1, primals_2);  primals_1 = primals_2 = None
    pow_tensor_scalar = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 2)

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu()
    relu_default = torch.ops.aten.relu.default(pow_tensor_scalar)
    detach_default = torch.ops.aten.detach.default(relu_default)

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin()
    sin_default = torch.ops.aten.sin.default(pow_tensor_scalar)

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f])
    stack_default = torch.ops.aten.stack.default([relu_default, sin_default]);  relu_default = sin_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum()
    sum_default = torch.ops.aten.sum.default(stack_default);  stack_default = None

    # No stacktrace found for following nodes
    is_same_size_default = torch.ops.aten.is_same_size.default(sum_default, tangents_1)

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum()
    expand_default = torch.ops.aten.expand.default(tangents_1, [2, 10, 10]);  tangents_1 = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f])
    unbind_int = torch.ops.aten.unbind.int(expand_default);  expand_default = None
    getitem = unbind_int[0]
    getitem_1 = unbind_int[1];  unbind_int = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin()
    cos_default = torch.ops.aten.cos.default(pow_tensor_scalar);  pow_tensor_scalar = None
    mul_tensor = torch.ops.aten.mul.Tensor(getitem_1, cos_default);  getitem_1 = cos_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu()
    detach_default_1 = torch.ops.aten.detach.default(detach_default);  detach_default = None
    threshold_backward_default = torch.ops.aten.threshold_backward.default(getitem, detach_default_1, 0);  getitem = detach_default_1 = None

    # Gradient addition node due to mulitple use of tensor around:, File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu()
    add_tensor_1 = torch.ops.aten.add.Tensor(mul_tensor, threshold_backward_default);  mul_tensor = threshold_backward_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b)
    pow_tensor_scalar_1 = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 1.0);  add_tensor = None
    mul_scalar = torch.ops.aten.mul.Scalar(pow_tensor_scalar_1, 2.0);  pow_tensor_scalar_1 = None
    mul_tensor_1 = torch.ops.aten.mul.Tensor(add_tensor_1, mul_scalar);  add_tensor_1 = mul_scalar = None
    sum_sym_int = torch.ops.aten.sum.SymInt(mul_tensor_1, [0], True)
    view_sym_int = torch.ops.aten.view.SymInt(sum_sym_int, [10]);  sum_sym_int = None
    return pytree.tree_unflatten([sum_default, mul_tensor_1, view_sym_int], self._out_spec)
```
- default symbolic_trace
Notice that nodes without stacktrace are folded under same region
```
def forward(self, a, b):

    # No stacktrace found for following nodes
    add = a + b;  a = b = None
    square = torch.square(add);  add = None
    relu = square.relu()
    sin = square.sin();  square = None
    stack = torch.stack([relu, sin]);  relu = sin = None
    sum_1 = stack.sum();  stack = None
    return sum_1
```
- symbolic_trace with record_stack_traces=True
```
def forward(self, a, b):

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b)
    add = a + b;  a = b = None
    square = torch.square(add);  add = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu()
    relu = square.relu()

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin()
    sin = square.sin();  square = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f])
    stack = torch.stack([relu, sin]);  relu = sin = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum()
    sum_1 = stack.sum();  stack = None
    return sum_1
```

- make_fx without decomposition
```
def forward(self, a_1, b_1):

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b)
    add_tensor = torch.ops.aten.add.Tensor(a_1, b_1);  a_1 = b_1 = None
    pow_tensor_scalar = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 2);  add_tensor = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu()
    relu_default = torch.ops.aten.relu.default(pow_tensor_scalar)
    detach_default = torch.ops.aten.detach.default(relu_default)

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin()
    sin_default = torch.ops.aten.sin.default(pow_tensor_scalar);  pow_tensor_scalar = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f])
    stack_default = torch.ops.aten.stack.default([relu_default, sin_default]);  relu_default = sin_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum()
    sum_default = torch.ops.aten.sum.default(stack_default);  stack_default = None
    return sum_default
```
- make_fx with decomposition to prims
```
def forward(self, a_1, b_1):

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 41, in func, d = torch.square(a + b)
    broadcast_in_dim_default = torch.ops.prims.broadcast_in_dim.default(b_1, [10, 10], [1]);  b_1 = None
    add_default = torch.ops.prims.add.default(a_1, broadcast_in_dim_default);  a_1 = broadcast_in_dim_default = None
    mul_default = torch.ops.prims.mul.default(add_default, add_default);  add_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 38, in my_relu, return a.relu()
    le_default = torch.ops.prims.le.default(mul_default, 0.0)
    where_default = torch.ops.prims.where.default(le_default, 0.0, mul_default);  le_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 43, in func, f = d.sin()
    sin_default = torch.ops.prims.sin.default(mul_default);  mul_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 44, in func, s = torch.stack([e, f])
    cat_default = torch.ops.prims.cat.default([where_default, sin_default], 0);  where_default = sin_default = None
    split_dim_default = torch.ops.prims.split_dim.default(cat_default, 0, 2);  cat_default = None

    # File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 45, in func, s = s.sum()
    convert_element_type_default = torch.ops.prims.convert_element_type.default(split_dim_default, torch.float32);  split_dim_default = None
    sum_default = torch.ops.prims.sum.default(convert_element_type_default, [0, 1, 2]);  convert_element_type_default = None
    return sum_default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83706
Approved by: https://github.com/Chillee, https://github.com/ezyang

* Add comments for block_reduce.cuh (#83825)

~~Add warning for the BlockReduce result
Remove redundant __syncthreads~~

Add comments for BlockReduce
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83825
Approved by: https://github.com/ngimel

* Add docstring type guidelines for list & tuple to `CONTRIBUTING.md` (#83634)

Minor followup to: https://github.com/pytorch/pytorch/pull/83536

For Google style docstrings, `list` and `tuple` should be completely lowercase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83634
Approved by: https://github.com/ngimel

* use condensed disabled tests file (#84017)

follow up to https://github.com/pytorch/test-infra/pull/545

then we can get rid of the non condensed version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84017
Approved by: https://github.com/huydhn, https://github.com/janeyx99

* Revert "Make linalg.inv composite of linalg.solve (#80074)"

This reverts commit 4737b3361479f4104efaa3bfa2ea517eaacb60fb.

Reverted https://github.com/pytorch/pytorch/pull/80074 on behalf of https://github.com/malfet due to Depends on the changes from https://github.com/pytorch/pytorch/pull/83628

* Revert "[xla hash update] update the pinned xla hash (#83967)"

This reverts commit ce7a9f92e30b93ab6efff4135be005c9afd0533a.

Reverted https://github.com/pytorch/pytorch/pull/83967 on behalf of https://github.com/malfet due to Depends on the changes from https://github.com/pytorch/pytorch/pull/83628

* Revert "Don't introduce new overload for SymInt (#83628)"

This reverts commit 8fae7027b399e65e6071d335aa874497682c84d0.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222

* [Quant] Vectorize scalar remainder in quantized kernel for normalization (#79673)

## Description
This PR improves performance of quantized kernel for normalize by vectorizing scalar remainder.
In the current implementation [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp), the computation is vectorized while the scalar remainder is handled in a `for` loop. The remainder is also vectorized to improve performance in this PR.
This kernel is for contiguous (NCHW) memory layout. For channels-last memory layout, a fast path is added in this PR https://github.com/pytorch/pytorch/pull/70520
The improvement is beneficial for layer norm, group norm and instance norm as this kernel is used for them.

## Changes
1. Add an argument `size` to `Vectorized<T>::loadu()` for vec256_qint and vec512_qint.
2. Load the remainder with the new `loadu` and do computation in the similar way as for vectorized part.

## Validation
### Test method:
Run quantized group norm with group = 2.
Op CPU time measured by `torch.profiler.profile` with warmup = 20, active = 200

### Common environment:
- Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- OS: CentOS Linux 7 (Core) (x86_64)
- Python version: 3.7.10
- Use JeMalloc memory allocator
- MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto
- Using Intel OpenMP
- KMP_AFFINITY=granularity=fine,compact,1,0
- KMP_BLOCKTIME=1

### Case 1: AVX2
**Environment**
- GCC version: (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
- AVX2 enabled, AVX512 disabled, i.e., vec256 used

**Run a single instance on a single core**

Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 2, 8, 5) | 3.73 | 3.75 | 4.51 | 99.41% | 82.75% | Remainder size = 8
(1, 2, 8, 6) | 3.76 | 4.00 | 4.53 | 93.93% | 82.95% | Remainder size = 16
(1, 2, 8, 7) | 3.74 | 4.01 | 4.52 | 93.34% | 82.84% | Remainder size = 24
(1, 2, 8, 8) | 3.90 | 3.96 | 4.49 | 98.49% | 87.00% | No remainder
(1, 2, 8, 17) | 4.00 | 4.17 | 4.72 | 95.83% | 84.69% | Remainder size = 8
(1, 2, 8, 18) | 4.00 | 4.23 | 4.72 | 94.54% | 84.89% | Remainder size = 16
(1, 2, 8, 19) | 4.03 | 4.29 | 4.76 | 94.01% | 84.70% | Remainder size = 24
(1, 2, 8, 20) | 3.92 | 3.93 | 4.76 | 99.67% | 82.29% | No remainder
(1, 2, 8, 33) | 4.10 | 4.18 | 5.06 | 97.92% | 81.00% | Remainder size = 8
(1, 2, 8, 34) | 4.07 | 4.23 | 5.06 | 96.40% | 80.53% | Remainder size = 16
(1, 2, 8, 35) | 4.11 | 4.42 | 5.09 | 93.03% | 80.72% | Remainder size = 24
(1, 2, 8, 36) | 4.03 | 4.06 | 5.11 | 99.24% | 78.83% | No remainder

![image](https://user-images.githubusercontent.com/12522207/173979129-e393e13f-71f5-4987-95ea-ac6e0c895bd7.png)

**Run a single instance on two cores**
Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 4, 8, 5) | 5.09 | 5.24 | 5.52 | 97.17% | 92.29% | Remainder size = 8
(1, 4, 8, 6) | 5.22 | 5.50 | 5.56 | 94.95% | 93.86% | Remainder size = 16
(1, 4, 8, 7) | 5.04 | 5.60 | 5.51 | 89.97% | 91.44% | Remainder size = 24
(1, 4, 8, 8) | 5.30 | 5.29 | 5.56 | 100.23% | 95.27% | No remainder
(1, 4, 8, 17) | 5.36 | 5.56 | 6.05 | 96.53% | 88.69% | Remainder size = 8
(1, 4, 8, 18) | 5.48 | 5.71 | 6.25 | 95.99% | 87.67% | Remainder size = 16
(1, 4, 8, 19) | 5.44 | 5.81 | 6.25 | 93.65% | 87.11% | Remainder size = 24
(1, 4, 8, 20) | 5.43 | 5.34 | 6.07 | 101.76% | 89.43% | No remainder
(1, 4, 8, 33) | 5.52 | 5.58 | 6.51 | 98.89% | 84.75% | Remainder size = 8
(1, 4, 8, 34) | 5.50 | 5.71 | 6.63 | 96.22% | 82.95% | Remainder size = 16
(1, 4, 8, 35) | 5.50 | 6.16 | 6.40 | 89.33% | 85.95% | Remainder size = 24
(1, 4, 8, 36) | 5.37 | 5.48 | 6.54 | 97.94% | 81.98% | No remainder

![image](https://user-images.githubusercontent.com/12522207/173981377-6222e278-0948-4f52-809b-28899399ca65.png)

### Case 2: AVX512
**Environment**
- GCC version: (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
- AVX512 enabled, i.e., vec512 used

**Run a single instance on a single core**

Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 2, 16, 5) | 3.66 | 3.94 | 4.52 | 92.79% | 80.93% | Remainder size = 16
(1, 2, 16, 6) | 3.77 | 4.28 | 4.60 | 88.15% | 81.90% | Remainder size = 32
(1, 2, 16, 7) | 3.85 | 4.41 | 4.57 | 87.36% | 84.20% | Remainder size = 48
(1, 2, 16, 8) | 3.70 | 3.76 | 4.62 | 98.62% | 80.10% | No remainder
(1, 2, 16, 17) | 3.91 | 4.06 | 4.97 | 96.43% | 78.71% | Remainder size = 16
(1, 2, 16, 18) | 3.82 | 4.34 | 5.01 | 88.19% | 76.30% | Remainder size = 32
(1, 2, 16, 19) | 3.86 | 4.56 | 5.05 | 84.63% | 76.28% | Remainder size = 48
(1, 2, 16, 20) | 3.80 | 3.87 | 5.08 | 98.14% | 74.73% | No remainder
(1, 2, 16, 33) | 3.89 | 4.23 | 5.65 | 91.94% | 68.85% | Remainder size = 16
(1, 2, 16, 34) | 3.91 | 4.46 | 5.70 | 87.68% | 68.61% | Remainder size = 32
(1, 2, 16, 35) | 4.04 | 4.68 | 5.72 | 86.44% | 70.64% | Remainder size = 48
(1, 2, 16, 36) | 4.00 | 3.99 | 5.71 | 100.28% | 69.96% | No remainder

![image](https://user-images.githubusercontent.com/12522207/173982490-4687c5bc-50e8-49aa-9fe2-7967c738dbfb.png)

**Run a single instance on two cores**

Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
-- | -- | -- | -- | -- | -- | --
(1, 4, 16, 5) | 5.43 | 5.53 | 5.92 | 98.12% | 91.60% | Remainder size = 16
(1, 4, 16, 6) | 5.35 | 5.85 | 6.05 | 91.53% | 88.54% | Remainder size = 32
(1, 4, 16, 7) | 5.31 | 6.04 | 6.18 | 87.97% | 85.93% | Remainder size = 48
(1, 4, 16, 8) | 5.30 | 5.27 | 6.30 | 100.66% | 84.16% | No remainder
(1, 4, 16, 17) | 5.47 | 5.67 | 6.48 | 96.51% | 84.45% | Remainder size = 16
(1, 4, 16, 18) | 5.53 | 5.86 | 6.59 | 94.28% | 83.78% | Remainder size = 32
(1, 4, 16, 19) | 5.48 | 6.13 | 6.57 | 89.39% | 83.38% | Remainder size = 48
(1, 4, 16, 20) | 5.35 | 5.31 | 6.95 | 100.79% | 76.91% | No remainder
(1, 4, 16, 33) | 5.62 | 5.77 | 7.31 | 97.28% | 76.80% | Remainder size = 16
(1, 4, 16, 34) | 5.56 | 5.85 | 7.06 | 95.03% | 78.71% | Remainder size = 32
(1, 4, 16, 35) | 5.67 | 6.10 | 7.09 | 93.03% | 79.98% | Remainder size = 48
(1, 4, 16, 36) | 5.50 | 5.39 | 7.20 | 102.15% | 76.42% | No remainder

![image](https://user-images.githubusercontent.com/12522207/173982748-5f003630-18a4-4c3d-a643-b8711892cc39.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79673
Approved by: https://github.com/jerryzh168

* Increase timeout for linux binary builds  (#84008)

Increase timeout  for linux binary builds
This mitigates conda build issue: https://github.com/pytorch/pytorch/issues/84003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84008
Approved by: https://github.com/malfet

* [NVFuser] Upstream push 0811 (#83239)

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. double support in expression evaluator
- bug fixes:
  1. dropout fix - rework RNG to support broadcasted dropout (Fixes #82784)
  2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard
- scheduler:
  1. manual transpose schedule example
  2. WIP transpose scheduler

Commits that's in this PR from the devel branch:

```
b7435afcd22c917713c2f41a7237bc26e1183f14 Transpose scheduler, step 1 (#1854)
8a45dbf72034684eb8e18b1835b533e90b68f184 Add an example on how to manually schedule transpose (#1889)
83dbf56a9554b2efbd5416461d938fff477b0b27 Patch dropout fix (#1898)
69d3519a532250719b1aa8341b50e067b181b42d Expand+Reduction, Expand+View support, rework View analysis and guards (#1883)
15091c488e96343bdc49e3990acbf238a3b3da51 Rework RNG to correctly support broadcasted dropout (#1888)
aafe2d048aaac596e503596a41303423619f3954 Make ExpressionEvaluator support Double (#1885)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83239
Approved by: https://github.com/davidberard98

* [TorchTidy] Adding support for unique tensor identifiers (#80266)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80266
Approved by: https://github.com/robieta

* fix oneDNN channels_last path issue (#83653)

Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653
Approved by: https://github.com/mingfeima, https://github.com/malfet

* [caffe2] Remove last clang-for-cuda sources (#84021)

Summary: We're no longer pursuing clang-for-cuda, so remove the last use-case.

Test Plan: CI

Reviewed By: pallab-zz

Differential Revision: D38996710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84021
Approved by: https://github.com/malfet

* Revert "Support NCCL Premul Sum (#81272)"

This reverts commit 432c508e71111f9d5382322e0e6b1bc1c66bf0ec.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds

* Revert "[TorchTidy] Adding support for unique tensor identifiers (#80266)"

This reverts commit b6ba41921daf6365a762562641bfd846437c8529.

Reverted https://github.com/pytorch/pytorch/pull/80266 on behalf of https://github.com/malfet due to Broke number of trunk jobs, see https://hud.pytorch.org/pytorch/pytorch/commit/b6ba41921daf6365a762562641bfd846437c8529

* NCCL: Re-enable parallel builds (#83696)

Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83696
Approved by: https://github.com/malfet

* [fx+scripting] Adding num_iter_1 and num_iter_2 params LearningRate op (#83691)

Summary: Adding num_iter_1 and num_iter_2 to learning rate op

Test Plan: Exisiting unit tests

Differential Revision: D38762710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83691
Approved by: https://github.com/qxy11

* Fix dumb make_fx issue (#84011)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84011
Approved by: https://github.com/ezyang

* [fx] add deferred weights (xl_weight) and tracing for xl_embedding_bag (#84016)

Test Plan: added unit tests

Reviewed By: jfix71

Differential Revision: D36152238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84016
Approved by: https://github.com/jfix71

* Enable cache action for lint workflow (#84026)

Cache all python dependencies using [GHA cache](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows).  I'm doing this for lint workflow first and will slowly roll it out to other workflows.

### Testing

Before caching, pip cache is not found. Dependencies installation continues as usual:

![Screen Shot 2022-08-24 at 16 36 15](https://user-images.githubusercontent.com/475357/186543554-9d7f5978-2c2d-4362-9535-c3b17e922da1.png)

After caching https://github.com/pytorch/pytorch/runs/8006214772?check_suite_focus=true. The long hash at the end of the cache key is the hash of requirements files

![Screen Shot 2022-08-24 at 16 51 51](https://user-images.githubusercontent.com/475357/186543825-055ea025-3d42-42fc-877d-baec358de0ed.png)

Note that the cache is in the runners themselves. This should be a transparent process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84026
Approved by: https://github.com/seemethere, https://github.com/suo, https://github.com/malfet

* switching the exact check to isinstance check (#84023)

Simplifying a type check if an object is a SymIntNode in `is_symint_node`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84023
Approved by: https://github.com/ezyang

* Disable autocast cache during aotdispatch (#84035)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84035
Approved by: https://github.com/jansel

* Make linalg.inv composite of linalg.solve (#80074)

The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

* add qscheme check for quantization observer (#80126)

Motivation: each quantization observer only supports a limit qschemes, we need to do this check at the initiation step, rather than at the running step, such as MinMaxObserver with set qscheme with **torch.per_channel_affine**, there will have a runtime error at the running the calibration step:

```
AttributeError: 'MinMaxObserver' object has no attribute 'ch_axis'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80126
Approved by: https://github.com/jerryzh168

* [functorch] add batching rule for fill_.Tensor (#84015)

I think this is what the theseus folks ran into, but will confirm with
them later.

Test Plan:
- new manual test; the OpInfo for fill_ isn't sufficient and it is
difficult to modify
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84015
Approved by: https://github.com/Chillee

* fix `NoneType` object has no attribute `python_exit_status` (#83985)

Fixes #83791

Prevents the Error when `_utils` has been cleared by Python before `__del__` is invoked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83985
Approved by: https://github.com/NivekT

* Decomposition - batch_norm, save_mean and save_variance always float32 (#84013)

AMP error shown here - https://github.com/pytorch/torchdynamo/issues/835

Test missing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84013
Approved by: https://github.com/ezyang

* enable qlinear dynamic parallelization with fbgemm (#84033)

Test Plan: CI

Differential Revision: D39004891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84033
Approved by: https://github.com/jerryzh168

* [quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713)

Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- Documentation @vkuzo
  - docs/source/conf.py
  - docs/source/quantization.rst
- [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
- [common test routine](test/quantization/ao_migration/common.py) @HDCharles
- JIT stuff @jamesr66a
  - torch/csrc/jit/passes/hoist_conv_packed_params.cpp
  - torch/csrc/jit/passes/quantization/helper.h
  - torch/csrc/jit/serialization/import_source.cpp

Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012/)

Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
Approved by: https://github.com/jerryzh168

* [quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714)

Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
- [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
- [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
- [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a

Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!

Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
Approved by: https://github.com/jerryzh168

* [quant][ao_migration] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` (#78715)

Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] [Current PR] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36860927](https://our.internmc.facebook.com/intern/diff/D36860927/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860927/)!

Differential Revision: [D36860927](https://our.internmc.facebook.com/intern/diff/D36860927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78715
Approved by: https://github.com/jerryzh168

* [quant][ao_migration] `torch.nn.quantizable` → `torch.ao.nn.quantizable`. (#78717)

Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] [Current PR] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- `torch/ao/nn/__init__.py` → Changing the imports to lazy.

Differential Revision: [D36861090](https://our.internmc.facebook.com/intern/diff/D36861090/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861090/)!

Differential Revision: [D36861090](https://our.internmc.facebook.com/intern/diff/D36861090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78717
Approved by: https://github.com/jerryzh168

* [quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716)

Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168

* disable c10::SymIntNode tests on mobile (#84066)

This fixes c++ tests' breaks where we were passing pointers and expected `is_symbolic` to return `true`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84066
Approved by: https://github.com/albanD

* [GHF][BE] Move merge rules to yaml (#84065)

To allow comments

Update `trymerge.yaml`, `revert.yaml` and `tryrebase.yaml` to use v4 setup-python action and install pyyaml

Reformat json to yaml by running:
```
 python -c "import yaml;print(yaml.dump(yaml.safe_load(open('.github/merge_rules.yaml')), sort_keys=False))"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84065
Approved by: https://github.com/b0noI, https://github.com/huydhn

* run functorch decomps after functionalization when enabled (#83992)

This is a short-to-midterm fix for https://github.com/pytorch/pytorch/issues/83923. By running functionalization before decomps, we guarantee that functionalization won't have to see any primtorch view/inplace ops like `broadcast_in_dim`.

This will only really be a problem if there's a function in the decomposition table that decomposes a functional op into mutations. If that comes up later, we'll need to revisit https://github.com/pytorch/pytorch/issues/83923.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83992
Approved by: https://github.com/ezyang

* functionalization: support inplace views on inputs (#83993)

A version of this PR was sitting at https://github.com/pytorch/pytorch/pull/82601 but that PR some other cleanup that relies on being able to use functorch in pytorch/pytorch CI tests, which isn't ready yet. I pulled the change out here to unblock functionalization for some models run with inductor (see https://github.com/pytorch/torchdynamo/issues/964#issuecomment-1225971788).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83993
Approved by: https://github.com/ezyang

* [DataPipe] Reset Shuffler's iterator when NotStarted (#83535)

This PR changes the behavior of `IterDataPipe` to always invoke `reset` for the state of `NotStarted`. The main reason is we normally put lazy initialization code into `reset` function. Even for the state of `NotStarted`, we should invoke `reset` to initialize those lazy variables. Otherwise, we have to manually determine if the state is `NotStarted` or `Iterating` in `__iter__` function and only manually invoke `reset` in the state of `NotStarted`.

This PR also makes `Shuffler` is able to serialize with `buffer` and `rng_state`.

The following part is removed:

~I am also add `_snapshot_state` into serialization state and during `__setstate__` only change the state to `Restored` if the original state is `Iterating`. Especially, for the case of deserializing/serializing `NotStarted` DataPipe (multiprocessing), we would invoke `set_seed` for `Shuffler`. We need the `DataPipe` remains as `NotStarted` to properly `reset`.~

I am listing all the expected behavior state transition below:
- Initial state: `NotStarted`
  - `iter` -> Call `reset` and change the state to `Iterating`
  - serialize/deserialize -> Keep the state as `NotStarted` (will `reset` if `iter` is called afterwards)
- Initial state: `Iterating`
  - `iter` -> Call `reset` and keep the state to `Iterating`
  - serialize/deserialize -> Change the state as `Restored`
- Initial state: `Restored`
  - `iter` -> Only change the state to `Iterating`
  - serialize/deserialize -> Not allowed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83535
Approved by: https://github.com/NivekT

* [ONNX] Assign ONNXScopeName during function substituion (#82039)

Previously only traced IR graph stores module typename and variable name
in `scope` in `node`. This change enables such `scope` info for IR graph
generated by torch script.
Torch script produced IR graphs emit nodes for module object and module
method call. This structured graph is flattened in `function_substition`
pass prior to other ONNX conversion passes.
This PR extends `function_substition` pass to record the module typename
and variable name info in `scope`, while inlining the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82039
Approved by: https://github.com/justinchuby, https://github.com/abock

* Torch cond operator, python dispatch, pyoperator (#83154)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83154
Approved by: https://github.com/ezyang

* [vulkan] use VMA at third-party (#83934)

Remove the VMA checked in at `aten/src/ATen/native/vulkan/api/vk_mem_alloc.h`, and use the version checked into `fbsource/third_party` instead.

Also change open source CMakeLists to look for VMA in third_party submodule directory.

Note that I had to add an alternate VulkanMemoryAllocator target that uses `fb_xplat_cxx_library` instead of `oxx_static_library` to make it work with vulkan targets in `caffe2`.

Before landing this diff, make sure https://github.com/pytorch/pytorch/pull/83906 is committed on open source, which adds VMA as a git submodule of pytorch.

Differential Revision: [D38943217](https://our.internmc.facebook.com/intern/diff/D38943217/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38943217/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83934
Approved by: https://github.com/manuelcandales

* [GHF] Land validation should not change default branch (#84084)

This prevents a loophole, where somebody submits a PR that modifies merge rules and request land validation, so that their PR will be validated against those rules, rather than ones currently in trunk.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84084
Approved by: https://github.com/janeyx99, https://github.com/kit1980

* [ONNX] Add runtime type checking to `export` (#83673)

This PR adds an internal wrapper on the [beartype](https://github.com/beartype/beartype) library to perform runtime type checking in `torch.onnx`. It uses beartype when it is found in the environment and is reduced to a no-op when beartype is not found.

Setting the env var `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS` will turn on the feature. setting `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=DISABLED` will disable all checks. When not set and `beartype` is installed, a warning message is emitted.

Now when users call an api with invalid arguments e.g.

```python
torch.onnx.export(conv, y, path, export_params=True, training=False)

# traning should take TrainingModel, not bool
```

they get

```
Traceback (most recent call last):
  File "bisect_m1_error.py", line 63, in <module>
    main()
  File "bisect_m1_error.py", line 59, in main
    reveal_error()
  File "bisect_m1_error.py", line 32, in reveal_error
    torch.onnx.export(conv, y, cpu_model_path, export_params=True, training=False)
  File "<@beartype(torch.onnx.utils.export) at 0x1281f5a60>", line 136, in export
  File "pytorch/venv/lib/python3.9/site-packages/beartype/_decor/_error/errormain.py", line 301, in raise_pep_call_exception
    raise exception_cls(  # type: ignore[misc]
beartype.roar.BeartypeCallHintParamViolation: @beartyped export() parameter training=False violates type hint <class 'torch._C._onnx.TrainingMode'>, as False not instance of <protocol "torch._C._onnx.TrainingMode">.
```

when `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK` is not set and `beartype` is installed, a warning message is emitted.

```
>>> torch.onnx.export("foo", "bar", "f")
<stdin>:1: CallHintViolationWarning: Traceback (most recent call last):
  File "/home/justinchu/dev/pytorch/torch/onnx/_internal/_beartype.py", line 54, in _coerce_beartype_exceptions_to_warnings
    return beartyped(*args, **kwargs)
  File "<@beartype(torch.onnx.utils.export) at 0x7f1d4ab35280>", line 39, in export
  File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/site-packages/beartype/_decor/_error/errormain.py", line 301, in raise_pep_call_exception
    raise exception_cls(  # type: ignore[misc]
beartype.roar.BeartypeCallHintParamViolation: @beartyped export() parameter model='foo' violates type hint typing.Union[torch.nn.modules.module.Module, torch.jit._script.ScriptModule, torch.jit.ScriptFunction], as 'foo' not <protocol "torch.jit.ScriptFunction">, <protocol "torch.nn.modules.module.Module">, or <protocol "torch.jit._script.ScriptModule">.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/justinchu/dev/pytorch/torch/onnx/_internal/_beartype.py", line 63, in _coerce_beartype_exceptions_to_warnings
    return func(*args, **kwargs)
  File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 482, in export
    _export(
  File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 1422, in _export
    with exporter_context(model, training, verbose):
  File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 177, in exporter_context
    with select_model_mode_for_export(
  File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 95, in select_model_mode_for_export
    originally_training = model.training
AttributeError: 'str' object has no attribute 'training'
```

We see the error is caught right when the type mismatch happens, improving from what otherwise would become `AttributeError: 'str' object has no attribute 'training'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83673
Approved by: https://github.com/BowenBao

* example program for paper intro (#83945)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83945
Approved by: https://github.com/jansel

* New TORCH_UCC_BLOCKING_WAIT env variable (#81791)

Cherry-pick of https://github.com/facebookresearch/torch_ucc/pull/95.

I recommend waiting until https://github.com/pytorch/pytorch/pull/81583 is merged first, so the CI is checking if this PR compiles correctly.

Marking this as a draft for now, will change to "ready for review" once https://github.com/pytorch/pytorch/pull/81583 merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81791
Approved by: https://github.com/kwen2501

* Make graph_module.print_readable() discoverable (#83960)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83960
Approved by: https://github.com/ezyang

* Fix FSDP not all outputs used in loss (#83195)

There are a couple issues / assumptions within FSDP today that this PR attempts to fix:

- In wait_for_post_backward, we assume that if a param required grad, its post backward was called, but this is not true, i.e. if its output did not participate in grad computation, it would not have called post backward. To fix this we simply removed those assertions.
- There is a deeper issue where in `_finalize_params`, we could end up assigning a grad of the sharded shape to an unsharded parameter gradient field, which would raise a shape error. This can happen for example if a parameter's usage transitions from used --> unused. In this case, when the parameter was used, it would have had a gradient, then user could have possibly called `zero_grad()` and p.grad would not be `None`. This in `_prep_grad_for_backward`, we would assign a `_saved_grad_shard` to this gradient field which would be the sharded shape. In `_finalize_param`, our parameter would be unsharded (since post_backward was not called), but we'd try to assign, raising the shape issue. This issue is fixed by checking `_post_backward_called`. If this is False, we simply skip the assignment because there is no new gradient to update.
- A final issue as mentioned above is that if post_backward is not called, we never reshard the full param. This is fixed by checking if we haven't resharded (basically if post_backward_called == False), and if so, performing a reshard.

A few things to note:
- This logic may have to be revisited when non-recursive wrapping lands as there are multiple FlatParams per FSDP unit
- This logic may not work when post_backward_hook fires but p.grad is None, i.e. the short-circuiting here: https://github.com/pytorch/pytorch/blob/f534b2c627da65bbee7ccc8f7e054da0ba48eb79/torch/distributed/fsdp/fully_sharded_data_parallel.py#L2884. As a quick fix, we could just move `_post_backward_called` flag change to after this, or just perform a reshard before returning early. I am not sure how to repro a case where p.grad == None but we call the post-backward hook, https://github.com/pytorch/pytorch/issues/83197 might be a possibility, but I think it is fine to not support this yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83195
Approved by: https://github.com/awgu

* Silence namedtuple warning in dist (#84072)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84072
Approved by: https://github.com/awgu

* Don't introduce new overload for SymInt (#83628)

Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Fix missing include for size_t (#84088)

Fixes the following issue:

```C++
In file included from /home/gaoxiang/pytorch-ucc/c10/test/util/ConstexprCrc_test.cpp:1:
In file included from /home/gaoxiang/pytorch-ucc/c10/util/ConstexprCrc.h:3:
/home/gaoxiang/pytorch-ucc/c10/util/IdWrapper.h:42:10: error: unknown type name 'size_t'; did you mean 'std::size_t'?
  friend size_t hash_value(const concrete_type& v) {
         ^~~~~~
         std::size_t
/usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/12.2.0/../../../../include/c++/12.2.0/x86_64-pc-linux-gnu/bits/c++config.h:298:26: note: 'std::size_t' declared here
  typedef __SIZE_TYPE__         size_t;
                                ^
1 error generated.
[111/2069] Generating /home/gaoxiang/pytorch-ucc/torch/csrc/a...ch-ucc/torch/testing/_internal/generated/annotated_fn_args.py
ninja: build stopped: subcommand failed.
```

This error happens with my GCC 12.2.0 + Clang 14.0.6.

Full environment:
```
Collecting environment information...
PyTorch version: 1.13.0a0+git14a53e6
Is debug build: True
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64)
GCC version: (GCC) 12.2.0
Clang version: 14.0.6
CMake version: version 3.24.1
Libc version: glibc-2.36

Python version: 3.10.6 (main, Aug  3 2022, 17:39:45) [GCC 12.1.1 20220730] (64-bit runtime)
Python platform: Linux-5.19.3-arch1-1-x86_64-with-glibc2.36
Is CUDA available: True
CUDA runtime version: 11.7.99
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 515.65.01
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.4.1
/usr/lib/libcudnn_adv_infer.so.8.4.1
/usr/lib/libcudnn_adv_train.so.8.4.1
/usr/lib/libcudnn_cnn_infer.so.8.4.1
/usr/lib/libcudnn_cnn_train.so.8.4.1
/usr/lib/libcudnn_ops_infer.so.8.4.1
/usr/lib/libcudnn_ops_train.so.8.4.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] torch==1.13.0a0+gitbcc6f6c
[pip3] torch-ucc==1.0.0
[pip3] torchani==2.2
[pip3] torchvision==0.2.2.post3
[conda] Could not collect
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84088
Approved by: https://github.com/ezyang

* Fix small typo in cuda.rst (#84012)

This fixes a very minor typo in the CUDA semantics doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84012
Approved by: https://github.com/malfet

* Use size to check same tensor sizes in reduce_scatter and allgather (#84099)

Summary:
Previous code uses tensor.numel() to check if all tensors have the same size in order to switch between reduce_scatter_v v.s. reduce_scatter, same applies to allgather. However, if the user input tensor is zero in the last dimension (e.g., [648632,0]), then numel() returns zero and check_same_numel is always true.

This patch fixes the check to use size rather than numel, to cover the above case.

Differential Revision: D39044439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84099
Approved by: https://github.com/kwen2501

* Separate kernel compilation API from kernel execution API (#1914)

1. Mostly mechanical changes to refactor some of KernelArgumentHolder in our stack instead of direct use of at::Tensor/IValue:
Note: we are still holding a ref counted at::Tensor within kernel arg holder for tensor entries, simply because we want to forward it in case of aliased output. This is quite unsatisfying. But to properly strip framework Tensor from codegen stack, we need quite some refactor to abstract away the ownership of memory and allocator. That's for some future PRs.
2. Separate compilation from execution of kernels, currently using FusionExecutorCache::compileFusion and FusionExecutorCache::runFusionWithInputs. Note that the compilation API is still experimental. We currently kick off compilation into a separate thread. This part would need to be exposed & integrated into our python API.

TODO for follow up PRs:
- trivial forwarding input to outputs
- infer outputs should switch from meta tensor to fake tensor in order to preserve device
- segmented fusion should/could be compiled in parallel, since we can infer outputs without a compiled kernel.
- inputs_id_lookup should be refactored into KernelArgumentHolder, since we currently use args for passing inputs around.
- index mode currently is per fusion. which is not neccesary and could be refactored into per segmented fuion instead.
- bind kernel inputs should also try to bind cpu scalar with int type, since the runtime value can also be used in shape inference. Generally speaking, cpu scalar dtype should also be checked during validation.
- high water mark could be refactored into using occupancy API after compilation, so we are not unnecessarily recompile when we don't have to.

* Use an unused variable (#84073)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84073
Approved by: https://github.com/huydhn

* Remove unreachable except block (#84070)

This was introduced because two PRs tried to fix an issue concurently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84070
Approved by: https://github.com/huydhn, https://github.com/janeyx99

* Upstream cherry pick fixes 0811 (#1934)

cherry-pick upstream CI fixes from #83067 & #83239

* [xla hash update] update the pinned xla hash (#84043)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84043
Approved by: https://github.com/pytorchbot

* Made some minor cleanups to decompositions (#83814)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83814
Approved by: https://github.com/ngimel

* Fix preconditions of adaptive_avg_pooling2d (#84061)

Before, if the input had dimension `4`, the channel had to be of
dimension non zero. This was not what the errors advertised
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84061
Approved by: https://github.com/Chillee

* [composite compliance] cov, corrcoef (#82954)

Ref: #69991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82954
Approved by: https://github.com/zou3519

* Enable -Wunused-local-typedefs (#83708)

I recently had a PR reverted because it triggered an
unused-local-typedefs warning, so disabling these in the CMake build
is counter-productive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83708
Approved by: https://github.com/albanD

* Use C10_HAS_CPP_ATTRIBUTE to simplify nodiscard definition (#83976)

`C10_HAS_CPP_ATTRIBUTE` only expands to `__has_cpp_attribute` when it
is defined, so we avoid the extra `#if defined(__has_cpp_attribute)`
checks and double-nested `#if`s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83976
Approved by: https://github.com/albanD

* [functorch] add lstsq batch rule (#82325)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82325
Approved by: https://github.com/zou3519

* do not use deprecated functions (#1935)

* Map IterationDomains through view operations. (#1919)

* mac circleci workflows (#82780)

Add mac and ios workflows to circleci so they can be run on pull

m1 tests not included because circleci doesnt have machines

Unsure how to get certain environment variables (specifically for arm64 ios builds that require env vars like `IOS_SIGN_KEY_2022` and `IOS_DEV_TEAM_ID` that are stored in the org-member context which is not accessible by everyone.

doc regarding env vars https://docs.google.com/document/d/1J_3Z9sfu2vlHMF1fjdJfeTuxPXC6dgqJs7aU0KpYSBU/edit#

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82780
Approved by: https://github.com/malfet, https://github.com/huydhn

* Add type hints to torch.save, torch.load (#83937)

I'll probably need help with this one. I'm not sure what the full type signature for `map_location` should be.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83937
Approved by: https://github.com/malfet, https://github.com/albanD

* Expose ProcessGroup::Work.wait() API to TorchScript (#83303)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83303
Approved by: https://github.com/rohan-varma

* Update proxy_tensor.py to support List input/output (#83302)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83302
Approved by: https://github.com/Chillee

* Make allreduce compatible with fx ProxyTensor (#84126)

land after #83122

This PR explores solutions for 2 issues:

1. Collective comm ops are inplace ops, and does not return a tensor.
   With that, `make_fx` cannot include comm ops in the traced graph.
   The current solution is to make comm ops return a tuple of
   `(output_tensors, work_handle)`, so that
   [`proxy_call`](https://github.com/pytorch/pytorch/blob/90821aab100a436424113e2306eac63f5e247ee5/torch/fx/experimental/proxy_tensor.py#L170-L172)
   can handle that. It won't change the behavior of existing c10d
   Python/C++ APIs, so I directly added the code to `Ops.cpp`.
2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore
   the `wait()` call on the work when tracing graph. However, this
   might break correctness, as when running the traced function, it
   could consume a tensor before it's ready. The current solution
   is to create a `CommTensor` tensor subclass to explicitly call
   `wait()`. In this PR, I am only doing this in the test, as we
   will need more discussion to see if we can add this to c10d Python
   implementations. kudos to @Chillee @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84126
Approved by: https://github.com/wanchaol

* Propagate permissive mapping information into indexing pass (#1929)

* [ONNX] Clean up patch functions (#83136)

Changes:

- Move namespace handling from `_new_node` to `_graph_op` for clarity
- Always require the `aten` namespace when creating aten ops. Remove the `aten` argument supplied in `_aten_op` for clarity
- Rename the `_ATTR_PATTERN` global
- Improve types
- Update `_add_attribute` to raise ValueErrors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83136
Approved by: https://github.com/BowenBao

* [Profiler][Minor] Extend Python bindings (#83622)

Adding some fields which are needed for memory profiling.

Differential Revision: [D38528382](https://our.internmc.facebook.com/intern/diff/D38528382/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83622
Approved by: https://github.com/Gamrix

* [Profiler][Trivial] Add null handling to `AppendOnlyList::copy` memcpy path. (#83963)

It is apparently undefined behavior to do pointer arithmetic on nullptr. In the case of AppendOnlyList, `next_` will only be null if `end_` is also null and thus the `memcpy` path will only be triggered if `n == 0`. Nonetheless, it is UB to `memcpy(0, 0, 0)`

The extra null check is in a `C10_LIKELY` block so the extra cost should be negligible, and indeed after dusting off the component microbenchmarks there's no observable difference.

Differential Revision: [D38969443](https://our.internmc.facebook.com/intern/diff/D38969443/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83963
Approved by: https://github.com/slgong-fb

* Update Dynamo pin (#83829)

As title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829
Approved by: https://github.com/ezyang

* make job pass even if monitoring script fails (#84068)

makes github slightly less confusing to look at when a test fails
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84068
Approved by: https://github.com/huydhn, https://github.com/malfet

* [ONNX] Export node and value with scope name (#82040)

Introduce `_jit_pass_onnx_assign_node_and_value_names` to parse and assign
scoped name for nodes and values in exported onnx graph.
Module layer information is obtained from `ONNXScopeName` captured in `scope`
attribute in nodes. For nodes, the processed onnx node name are stored in
attribute `onnx_name`. For values, the processed onnx output name are stored
as `debugName`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82040
Approved by: https://github.com/AllenTiTaiWang, https://github.com/justinchuby, https://github.com/abock

* Add support to traverse all python collection objects (#84079)

Fixes https://github.com/pytorch/data/issues/752

This PR makes `traverse` function supporting more collections data structures from Python. Please let me know if anyone has a better idea about how to elegantly check if the object is a collection then we can dive into this object to see wether there is any DataPipe wrapped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079
Approved by: https://github.com/NivekT

* Read via FileAdapter when loading files in torch if not flatbuffer (#84028)

Summary: This will optimize memory usage at the small cost of loading time when loading mobile models restoring the behavior before D36926217 (https://github.com/pytorch/pytorch/commit/fed12ff680813c0fab7dba7232f6b4cd8b33b8d3).

Test Plan: Signals

Reviewed By: qihqi

Differential Revision: D38998858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84028
Approved by: https://github.com/qihqi, https://github.com/cccclai

* Enable cache action for windows and other minor workflows (#84093)

Following up on https://github.com/pytorch/pytorch/pull/84026, these are the rest of pip dependencies that I can find.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84093
Approved by: https://github.com/malfet

* [Nested Tensor] do not use at::cuda::getDefaultCUDAStream() (#84134)

Use at::cuda::getCurrentCUDAStream(), not getDefaultCUDAStream().

Otherwise, add/remove padding kernels won't sync with current stream, resulting in flaky unit tests in test_nestedtensor.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84134
Approved by: https://github.com/drisspg

* Fix a bug (#1936)

A bit uncomfortable not using an initialization list to initialize
value_, but can't think of any other way to workaround the c10::variant
deprecated problem.

* [fx][pass] Fix type of exception (#84094)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84094
Approved by: https://github.com/SherlockNoMad

* [Profiler][Trivial] Cleanup ExperimentalConfig (#83890)

I'm trying to limit how much is in headers to make it easier to read the API surface. In a similar vein, we can replace `hasOptions` with `operator bool` so it just does the right thing in the check.

Differential Revision: [D38917366](https://our.internmc.facebook.com/intern/diff/D38917366/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83890
Approved by: https://github.com/slgong-fb

* [Profiler] Add `disabled` and `global` methods to ProfilerConfig. (#83891)

`ProfilerState::Disabled` and `ProfilerState::KINETO_ONDEMAND` have special semantics. The former is somewhat intuitive, but the degree of behavior branching on the latter (and why the branching is necessary) is less clear. By factoring the enum checks into methods, we can both clairify intent and future proof in case we ever add other global profiling contexts.

Differential Revision: [D38917980](https://our.internmc.facebook.com/intern/diff/D38917980/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83891
Approved by: https://github.com/slgong-fb

* [DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202)

Fixes: https://github.com/pytorch/data/issues/718

This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202
Approved by: https://github.com/NivekT

* [Prim] Implement group_norm_backward (#84037)

Test plan: CI, i.e. `python3 test_decomp.py -v -k test_comprehensive_nn_functional_group_norm` plus:
```
#!/usr/bin/env python3.8
import torch

func = torch.ops.aten.native_group_norm_backward.default
decomp =  torch._decomp.decomposition_table[func]
for args in (
        (torch.rand(1, 6, 3), torch.rand(1, 6, 3), torch.rand(1, 2), torch.rand(1, 2), torch.rand(6), 1, 6, 3, 2, [True, True, True]),
        (torch.rand(64, 768, 7, 7), torch.rand(64, 768, 7, 7), torch.rand(64, 1), torch.rand(64, 1), torch.rand(768), 64, 768, 49, 1, [True, True, True])):
    nrc=func(*args)
    drc=decomp(*args)
    for i in range(len(nrc)):
       print(i, torch.max(nrc[i]-drc[i]))
    print(all(torch.allclose(x, y) for (x, y) in zip(nrc, drc)))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84037
Approved by: https://github.com/Chillee, https://github.com/ngimel

* Revert "[xla hash update] update the pinned xla hash (#84043)"

This reverts commit ddedc294fbb4c13170811442b590a18e950dae67.

Reverted https://github.com/pytorch/pytorch/pull/84043 on behalf of https://github.com/malfet due to Depends on https://github.com/pytorch/pytorch/pull/83628

* Revert "Don't introduce new overload for SymInt (#83628)"

This reverts commit 9790d90e4b0288796ab44a6b4979db0a67580ba8.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487

* [AOT Autograd] Redirect named_parameters to original mod (#84157)

Helps in comparing accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84157
Approved by: https://github.com/Chillee

* [Nested Tensor] detach (#84078)

## Summary
Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually.

The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b.

Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078
Approved by: https://github.com/albanD

* Enforce explicit ProcessGroup passed into DefaultState (#84105)

Would prefer to enforce that users pass in explicit PG into these state objects when using comm hooks with FSDP, so that it is clear and easy debugable over which processes communication is taking place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84105
Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao

* _to_copy decomp (#84108)

Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84108
Approved by: https://github.com/Chillee

* [ONNX] Fix type annotations and enable type checking for all apis (#84091)

Enable runtime type checking for all torch.onnx public apis, symbolic functions and most helpers (minus two that does not have a checkable type: `_.JitType` does not exist) by adding the beartype decorator. Fix type annotations to makes unit tests green.

Profile:

export `torchvision.models.alexnet(pretrained=True)`

```
with runtime type checking: 21.314 / 10 passes
without runtime type checking: 20.797 / 10 passes

+ 2.48%
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84091
Approved by: https://github.com/BowenBao

* Add nvprims.var_mean (#83508)

This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel

* [xla hash update] update the pinned xla hash (#84164)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84164
Approved by: https://github.com/pytorchbot

* Revert "Make allreduce compatible with fx ProxyTensor (#84126)"

This reverts commit ec5b83f76847584013a9cd4177d389a408033614.

Reverted https://github.com/pytorch/pytorch/pull/84126 on behalf of https://github.com/malfet due to Likely broke multigpu periodic jobs, see https://github.com/pytorch/pytorch/runs/8044611438?check_suite_focus=true

* Fix softmax bwd sizes. (#1890)

* Test `rand` in a fusion with zero tensor input (#1932)

* Improve trivial reduction merge support (#1931)

* Double support on all expression evaluators (#1937)

* arange support (#1933)

* Replace assertEqualIgnoreTypes from common_methods_invocations.py (#84076)

This addresses TODO:38095 . More details at https://github.com/pytorch/pytorch/issues/38095

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84076
Approved by: https://github.com/kit1980

* Nvfuser to copy decomp to prim (#83782)

Conditional decomposing aten::_to_copy to nvprim::convert_element_type to allow fusion with type casting, which is introduced during type promotion phase at torch decomposition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83782
Approved by: https://github.com/ngimel

* Tensor factories must set the output shape as its input (#1939)

* Revert "[xla hash update] update the pinned xla hash (#84164)"

This reverts commit c032b097e315177af5bc867eeee5452b7df32952.

Reverted https://github.com/pytorch/pytorch/pull/84164 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

* Revert "Add nvprims.var_mean (#83508)"

This reverts commit 7e7694b6615fbf46abfab234615fa891c2819eb7.

Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

* Revert "[ONNX] Fix type annotations and enable type checking for all apis (#84091)"

This reverts commit 6446da17305960088dfae501d5c7358af068fa81.

Reverted https://github.com/pytorch/pytorch/pull/84091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

* Fix arange when step is negative (#1942)

* The device version of ceilDiv assumes positive inputs, so when step is
negative, it gives an incorrect result. For example, I see FusionStandAloneArange
results in a write error with compute-sanitizer when start = 0, stop =
-1, step = -1.5 and dtype = kLong.

* Add full, full_like, zeros, zeros_like, ones, ones_like (#1943)

* Move detection of self mapping IDs to IterDomainGraph from (#1941)

* test the groups the same order as they are merged (#1949)

* Exclude unsupported data types (#1951)

* Exclude unsupported data types

* Some indexing cleanups, Add eye support (#1940)

* Fix detection of unmappable root domains (#1952)

ComputeAtRootDomainMap flags domains that should not be mapped due to
reductions. Previously, checking if a domain potentially causes an
invalid mapping is only done with one domain in each group of domains
that are found to be mappable so far. That's not actually sufficient as
the unmappable domain set is created just once with no root mapping
information. The fix is to check all consumer domains of a producer
tensor. A small other fix is also done to address a different problem
discovered after the first fix.

* Fill allocation with nan on tests (#1956)

* TVDomainGuard factory (#1953)

* Some cleanup (#1957)

* Remove unused variables (#1955)

* Improve the comments at the beginning of index_compute.h (#1946)

I just started to learn indexing, and the comment at the beginning of index_compute.h does not look good...

* Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930)

* WAR on index mapping when exact and permissive maps differ (#1960)

* Fix dump effective bandwidth (#1962)

* Upstream push ci fixes (#1965)

Cherry-picking upstream build failure patches from PR pytorch#84626

Changes includes:
1. added throw in stringify
2. Split fused_reduction.cu as its size exceeds the limit in MSVC
3. update bzl build for runtime header
4. Fix a bug originally reported in https://github.com/pytorch/pytorch/pull/84626
5. Meta internal build fix

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

* View scheduling (#1928)

* Move scheduler vectorize utilities into their own file (#1959)

* Enable transpose scheduler (#1927)

* Minor fix (#1967)

* Fix canScheduleCompileTime check of transpose scheduler (#1969)

* Add a null scheduler that helps segmenting away no-op schedules (#1835)

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

* Enable Transpose operation (#1882)

* Segment self mapping fusions (#1954)

* Add support for some empty fusion (#1981)

* Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987)

* Add support for uniform RNG (#1986)

* Minor cleanup (#1992)

* Fix missing thread predicates

Unlikely to matter, but should be necessary

* Minor cleanup lower_unroll.cpp (#1994)

* Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988)

* Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989)

* Minor build fix. (#1996)

* Improve divisible split detection (#1970)

* cleanup (#1997)

* Just fixes comments (#1998)

* Just fixes comments

* Fix build problem (#1999)

* More strict validation (#2000)

* Test util cleanup (#2003)

Don't clear the memory allocator cache as it shouldn't be necessary

* fix merge

* fix merge

* format

* Make inlining even more modular (#2004)

I don't like `TensorView::setComputeAt`  and `TensorView::setMaxProducer`, they are private, and I can not use them conveniently. It would be better if there is some public method of `TensorView` that allows directly setting the CA position of a TV with necessary validation. So I added two public methods `TensorView::inlineAt` and `TensorView::updateMaxProducerPosition` and removed  `TensorView::setComputeAt`  and `TensorView::setMaxProducer`.

The `inlineAt` can be safely used publicly. It will not inline into disallowed dimensions, and the max producer position will be kept consistent. There are two ways of using `inlineAt`:

If you only want to set the CA position of a single tensor, then simply do
```C++
tv->inlineAt(pos, /*best_effort=*/true);
```

If you want to set the CA position of multiple tensors, then you can do
```C++
MaxPosCalculator calc;
for (auto tv : tensors) {
  tv->inlineAt(pos, /*best_effort=*/true, &calc);
}
```

In both case, the max producer position will be updated at the end of the `inlineAt` call. Manually constructing the object of `MaxPosCalculator` is mainly for performance reasons: we don't want to build unmappable dimensions every time we call `inlineAt`. If we want to inline multiple tensors, we should build at the beginning and use it in all `inlineAt` calls.

Even though `inlineAt` always updates the max producer position automatically, there are still cases where we want to manually trigger an update of the max producer position, and the `updateMaxProducerPosition` is designed for such a purpose. It is mainly used for grouped reductions.

**With `inlineAt`, I can refactor inlining to make it even more modular:**

There is no longer an `InlinePropagator`. Innermost inlining is now just a dumb for loop:

```C++
MaxPosCalculator calc;
for (auto tv : all_tvs) {
  tv->inlineAt(-1, /*best_effort=*/true, &calc);
}
```

For standard and best effort inlining, we need first to do a propagation to find the positions in each tensor mapped to the given reference tensor's given position. With the positions calculated, inlining is again a dumb for loop.

* Contiguous indexing for View operations (#1990)

* Enable tests previously disabled due to an aliasing bug (#2005)

* Enable tests previously disabled due to an aliasing bug

The bug was fixed by #1792

* Add matmul benchmark (#2007)

Co-authored-by: Catherine Lee <csl@fb.com>
Co-authored-by: Rohan Varma <rvarm1@fb.com>
Co-authored-by: CaoE <e.cao@intel.com>
Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchu@microsoft.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Kshiteej K <kshitijkalambarkar@gmail.com>
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Co-authored-by: jpvillam <Juan.Villamizar@amd.com>
Co-authored-by: mattip <matti.picus@gmail.com>
Co-authored-by: Vasiliy Kuznetsov <vasiliy@fb.com>
Co-authored-by: Horace He <chilli@fb.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jane Xu <janeyx@fb.com>
Co-authored-by: Scott Wolchok <swolchok@fb.com>
Co-authored-by: samdow <samdow@fb.com>
Co-authored-by: chengscott <60510scott@gmail.com>
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
Co-authored-by: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Co-authored-by: Ke Wen <kw2501@fb.com>
Co-authored-by: lezcano <lezcano-93@hotmail.com>
Co-authored-by: soulitzer <soulitzer@gmail.com>
Co-authored-by: Stephen Jia <ssjia@fb.com>
Co-authored-by: Mengwei Liu <larryliu@fb.com>
Co-authored-by: Kaichen Liu <kaichenliu@fb.com>
Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
Co-authored-by: chenlai <chenlai@devvm7453.prn0.facebook.com>
Co-authored-by: Zain Rizvi <zainr@fb.com>
Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Co-authored-by: John Detloff <johndetloff@fb.com>
Co-authored-by: Eli Uriegas <eliuriegas@fb.com>
Co-authored-by: Angela Yi <angelayi@fb.com>
Co-authored-by: Shirong Wu <shirong@fb.com>
Co-authored-by: Sergii Dymchenko <sdym@fb.com>
Co-authored-by: Nan Xiao <nanx@fb.com>
Co-authored-by: Hansong Zhang <hsz@fb.com>
Co-authored-by: Ishan-Rajgarhia <ishanrajgarhia@fb.com>
Co-authored-by: Driss Guessous <drisspg@fb.com>
Co-authored-by: Peter Bell <peterbell10@live.co.uk>
Co-authored-by: Lu, Chengjun <chengjun.lu@intel.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Souranil Sen <souranil@fb.com>
Co-authored-by: Seonglyong Gong <slgong@fb.com>
Co-authored-by: Henry Tu <henry.tu@cerebras.net>
Co-authored-by: Antonio Kim <antonio.kim@cerebras.net>
Co-authored-by: BowenBao <bowbao@microsoft.com>
Co-authored-by: John Clow <jclow@fb.com>
Co-authored-by: Robert <xiurobert@gmail.com>
Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
Co-authored-by: Digant Desai <digantdesai@fb.com>
Co-authored-by: Richard Barnes <rbarnes@fb.com>
Co-authored-by: Larry Liu <8188269+larryliu0820@users.noreply.github.com>
Co-authored-by: Sherlock Huang <bahuang@fb.com>
Co-authored-by: thomasw21 <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Co-authored-by: Bin Chen <bchen2020@fb.com>
Co-authored-by: Chen, Jian Ping <jian.ping.chen@intel.com>
Co-authored-by: ProGamerGov <ProGamerGov@users.noreply.github.com>
Co-authored-by: Weiwen Xia <weiwen.xia@intel.com>
Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: jjsjann123 <jiej@nvidia.com>
Co-authored-by: XiaobingSuper <xiaobing.zhang@intel.com>
Co-authored-by: Andrew Gallagher <andrewjcg@fb.com>
Co-authored-by: Mandar Deshpande <mandarde@fb.com>
Co-authored-by: Alex Beloi <alexbeloi@fb.com>
Co-authored-by: Richard Zou <zou3519@gmail.com>
Co-authored-by: erjia <erjia@fb.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Jianyu Huang <jianyuhuang@fb.com>
Co-authored-by: zaf <zaf@devvm10206.prn0.facebook.com>
Co-authored-by: Michael Voznesensky <voznesenskym@gmail.com>
Co-authored-by: migeedz <migeedz@fb.com>
Co-authored-by: Christian Jauvin <cjauvin@gmail.com>
Co-authored-by: Min Si <msi@fb.com>
Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Shen Li <cs.shenli@gmail.com>
Co-authored-by: S. Song <41357537+shmsong@users.noreply.github.com>
Co-authored-by: Taylor Robie <taylorrobie@fb.com>
Co-authored-by: Ian Graves <iangraves@fb.com>
Co-authored-by: Natalia Gimelshein <ngimel@fb.com>
Co-authored-by: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Co-authored-by: kuttire42 <64169153+kuttire42@users.noreply.github.com>
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Co-authored-by: Ryan Spring <rdspring1@gmail.com>
---
 .circleci/cimodel/data/dimensions.py          |     1 +
 .../cimodel/data/simple/ios_definitions.py    |    25 +-
 .../cimodel/data/simple/macos_definitions.py  |   105 +-
 .circleci/cimodel/data/simple/nightly_ios.py  |     8 +-
 .../simple/upload_test_stats_definition.py    |    20 +
 .../cimodel/data/simple/util/versions.py      |    14 +-
 .circleci/config.yml                          |   294 +-
 .circleci/docker/build.sh                     |    10 +-
 .circleci/docker/common/install_base.sh       |     3 +-
 .circleci/docker/common/install_conda.sh      |     6 +-
 .circleci/docker/common/install_ucc.sh        |    48 +
 .circleci/docker/requirements-ci.txt          |    10 +
 .circleci/docker/ubuntu-cuda/Dockerfile       |    11 +
 .circleci/docker/ubuntu/Dockerfile            |    11 +
 .circleci/generate_config_yml.py              |    10 +
 .../job-specs/job-specs-custom.yml            |   205 +-
 .github/ISSUE_TEMPLATE/ci-sev.md              |     2 +
 .github/PULL_REQUEST_TEMPLATE.md              |     9 +-
 .../actions/get-workflow-job-id/action.yml    |     2 +-
 .github/actions/setup-win/action.yml          |     5 +
 .github/ci_commit_pins/torchdynamo.txt        |     2 +-
 .github/ci_commit_pins/vision.txt             |     2 +-
 .github/ci_commit_pins/xla.txt                |     2 +-
 .github/generated-ciflow-ruleset.json         |     5 -
 .github/merge_rules.json                      |   230 -
 .github/merge_rules.yaml                      |   342 +
 .github/requirements-gha-cache.txt            |    16 +
 .github/scale-config.yml                      |     2 +-
 .github/scripts/comment_on_pr.py              |    34 +
 .../scripts/generate_binary_build_matrix.py   |     2 +-
 .github/scripts/get_workflow_job_id.py        |     6 +-
 .github/scripts/install_nvidia_utils_linux.sh |     4 +-
 .github/scripts/lint_test_ownership.py        |    88 -
 .github/scripts/test_trymerge.py              |    21 +-
 .github/scripts/trymerge.py                   |   244 +-
 .github/scripts/trymerge_explainer.py         |   146 +
 .github/scripts/update_commit_hashes.py       |     1 +
 .github/templates/common.yml.j2               |     2 +
 .../linux_binary_build_workflow.yml.j2        |     2 +-
 .../windows_binary_build_workflow.yml.j2      |     4 +-
 .../workflows/_android-full-build-test.yml    |    36 -
 .github/workflows/_binary-build-linux.yml     |     5 +-
 .github/workflows/_binary-test-linux.yml      |     4 +-
 .github/workflows/_binary-upload.yml          |    61 +-
 .github/workflows/_buck-build-test.yml        |     9 +-
 .github/workflows/_docs.yml                   |    16 +-
 .github/workflows/_ios-build-test.yml         |     7 +
 .github/workflows/_linux-test.yml             |     3 +-
 .github/workflows/_mac-build.yml              |     8 +-
 ...{_mac-test-arm64.yml => _mac-test-mps.yml} |     2 +-
 .github/workflows/_mac-test.yml               |    51 +-
 .github/workflows/_rocm-test.yml              |     1 +
 .github/workflows/_win-test.yml               |     1 +
 .../workflows/cancel_redundant_workflows.yml  |    23 -
 .github/workflows/docker-release.yml          |    89 +
 .github/workflows/lint.yml                    |    16 +-
 .github/workflows/mac-mps.yml                 |    35 +
 .github/workflows/periodic.yml                |    52 +
 .github/workflows/pr-labels.yml               |    12 +-
 .github/workflows/pull.yml                    |   312 +
 .../workflows/push_nightly_docker_ghcr.yml    |    39 +
 .github/workflows/revert.yml                  |    26 +-
 .github/workflows/stale_pull_requests.yml     |    42 -
 .github/workflows/trunk.yml                   |    35 +-
 .github/workflows/trymerge.yml                |    26 +-
 .github/workflows/tryrebase.yml               |    27 +-
 .github/workflows/update-viablestrict.yml     |    16 +-
 .github/workflows/update_pytorch_labels.yml   |     2 +-
 .github/workflows/update_s3_htmls.yml         |     2 +-
 .gitmodules                                   |     3 +
 .jenkins/caffe2/test.sh                       |     2 +-
 .jenkins/pytorch/build.sh                     |    10 +
 .jenkins/pytorch/common_utils.sh              |     2 +
 .jenkins/pytorch/macos-build.sh               |     4 +-
 .jenkins/pytorch/macos-common.sh              |    41 +-
 .jenkins/pytorch/macos-test.sh                |    27 +-
 .jenkins/pytorch/multigpu-test.sh             |     4 +-
 .jenkins/pytorch/test.sh                      |     9 +-
 .../win-test-helpers/build_pytorch.bat        |     4 +-
 ...miniconda3.bat => activate_miniconda3.bat} |    16 +-
 .../win-test-helpers/setup_pytorch_env.bat    |     8 +-
 .lintrunner.toml                              |     4 +
 BUILD.bazel                                   |     1 +
 CMakeLists.txt                                |   103 +-
 CODEOWNERS                                    |    14 +-
 CONTRIBUTING.md                               |    41 +-
 Dockerfile                                    |    26 +-
 WORKSPACE                                     |     5 +-
 aten/CMakeLists.txt                           |     2 +-
 aten/src/ATen/BatchingRegistrations.cpp       |     5 -
 aten/src/ATen/Context.h                       |     4 +
 aten/src/ATen/DLConvertor.cpp                 |    19 +-
 aten/src/ATen/Dispatch.h                      |    16 +-
 aten/src/ATen/EmptyTensor.cpp                 |     8 +
 aten/src/ATen/ExpandUtils.h                   |     4 +-
 aten/src/ATen/FunctionalStorageImpl.cpp       |     6 +-
 aten/src/ATen/NestedTensorImpl.cpp            |   149 +-
 aten/src/ATen/NestedTensorImpl.h              |    92 +-
 aten/src/ATen/Parallel.h                      |     1 +
 aten/src/ATen/SparseCsrTensorImpl.cpp         |     3 +
 aten/src/ATen/SparseCsrTensorImpl.h           |     1 +
 aten/src/ATen/TensorIterator.h                |     4 +
 aten/src/ATen/TensorMeta.h                    |     1 +
 aten/src/ATen/TensorSubclassLikeUtils.h       |    19 +
 aten/src/ATen/ThreadLocalState.cpp            |     4 +-
 aten/src/ATen/ThreadLocalState.h              |     2 +-
 aten/src/ATen/Utils.h                         |    53 -
 aten/src/ATen/autocast_mode.cpp               |    15 +-
 aten/src/ATen/core/List.h                     |    17 +-
 aten/src/ATen/core/List_inl.h                 |    10 +-
 aten/src/ATen/core/NamedRegistrations.cpp     |     1 -
 aten/src/ATen/core/PhiloxRNGEngine.h          |   131 +-
 aten/src/ATen/core/PythonFallbackKernel.cpp   |     4 +-
 aten/src/ATen/core/PythonFallbackKernel.h     |     2 +-
 aten/src/ATen/core/TensorBase.h               |    16 +
 aten/src/ATen/core/TorchDispatchModeTLS.cpp   |    58 -
 aten/src/ATen/core/TorchDispatchUtils.cpp     |    31 +
 ...DispatchModeTLS.h => TorchDispatchUtils.h} |    14 +-
 .../ATen/core/dispatch/DispatchKeyExtractor.h |     5 +-
 aten/src/ATen/core/dispatch/Dispatcher.h      |    20 +
 aten/src/ATen/core/dispatch/OperatorEntry.cpp |    22 +-
 aten/src/ATen/core/dispatch/OperatorEntry.h   |     6 +
 aten/src/ATen/core/function_schema.h          |    15 +-
 aten/src/ATen/core/interned_strings.cpp       |     1 +
 aten/src/ATen/core/interned_strings.h         |    14 +-
 aten/src/ATen/core/ivalue_inl.h               |     1 +
 aten/src/ATen/core/symbol.h                   |     1 +
 aten/src/ATen/cpu/vec/vec256/vec256_qint.h    |    94 +-
 .../cpu/vec/vec256/vsx/vec256_float_vsx.h     |    22 +-
 aten/src/ATen/cpu/vec/vec512/vec512_qint.h    |    72 +
 aten/src/ATen/cpu/vec/vec_base.h              |     3 +-
 aten/src/ATen/cuda/CUDABlas.cpp               |    63 +-
 aten/src/ATen/cuda/CUDABlas.h                 |    18 +-
 aten/src/ATen/cuda/CUDAEvent.h                |    23 +
 aten/src/ATen/cuda/CUDASparse.h               |    15 +-
 aten/src/ATen/cuda/CUDASparseDescriptors.cpp  |     8 +-
 aten/src/ATen/cuda/CUDASparseDescriptors.h    |    15 +-
 aten/src/ATen/cuda/jiterator.h                |     2 +-
 aten/src/ATen/cuda/jiterator_impl.h           |    30 +-
 aten/src/ATen/cuda/llvm_complex.cpp           |     6 +-
 aten/src/ATen/jit_macros.h                    |     7 -
 aten/src/ATen/jiterator_macros.h              |     4 +-
 aten/src/ATen/mps/IndexKernels.h              |   132 +
 aten/src/ATen/mps/MPSAllocator.mm             |     4 +-
 aten/src/ATen/mps/MPSDevice.h                 |     9 +
 aten/src/ATen/mps/MPSDevice.mm                |    43 +-
 aten/src/ATen/mps/MPSFallback.mm              |     4 -
 aten/src/ATen/mps/MPSGuardImpl.mm             |     3 -
 .../ATen/native/AdaptiveAveragePooling.cpp    |    10 +-
 aten/src/ATen/native/BatchLinearAlgebra.cpp   |   262 +-
 aten/src/ATen/native/Convolution.cpp          |    27 +-
 aten/src/ATen/native/Correlation.cpp          |     9 +-
 aten/src/ATen/native/Cross.cpp                |    24 +-
 aten/src/ATen/native/DispatchStub.h           |     1 +
 aten/src/ATen/native/Dropout.cpp              |     7 +-
 aten/src/ATen/native/ForeachOpsKernels.cpp    |    10 +
 aten/src/ATen/native/Integration.cpp          |     6 +-
 aten/src/ATen/native/Linear.cpp               |     3 -
 aten/src/ATen/native/LinearAlgebra.cpp        |    42 +-
 aten/src/ATen/native/MaxPooling.cpp           |     8 +-
 aten/src/ATen/native/Normalization.cpp        |     9 +-
 aten/src/ATen/native/Onehot.cpp               |     4 +-
 aten/src/ATen/native/Pool.h                   |     5 +
 aten/src/ATen/native/README.md                |    22 +
 aten/src/ATen/native/RangeFactories.cpp       |     4 +
 aten/src/ATen/native/ReduceAllOps.cpp         |    15 +
 aten/src/ATen/native/ReduceOps.cpp            |    74 +-
 aten/src/ATen/native/ReduceOpsUtils.h         |     2 +-
 aten/src/ATen/native/SoftMax.cpp              |    17 +-
 aten/src/ATen/native/Sorting.cpp              |     6 +-
 aten/src/ATen/native/SpectralOps.cpp          |     7 +-
 .../ATen/native/TensorAdvancedIndexing.cpp    |     6 +-
 aten/src/ATen/native/TensorConversions.cpp    |    12 +-
 aten/src/ATen/native/TensorFactories.cpp      |    48 +-
 aten/src/ATen/native/TensorShape.cpp          |    13 +-
 aten/src/ATen/native/TestOps.cpp              |    29 +
 aten/src/ATen/native/UpSample.h               |    16 +-
 .../ao_sparse/quantized/cpu/qnnpack_utils.h   |     2 +-
 aten/src/ATen/native/cpu/CopyKernel.cpp       |    35 +-
 aten/src/ATen/native/cpu/CopyKernel.h         |    12 +
 aten/src/ATen/native/cpu/Loops.h              |     9 -
 aten/src/ATen/native/cpu/UnaryOpsKernel.cpp   |    20 +-
 aten/src/ATen/native/cpu/UpSampleKernel.cpp   |    28 +-
 .../native/cuda/AdaptiveAveragePooling.cu     |    11 +-
 aten/src/ATen/native/cuda/Blas.cpp            |     4 +-
 aten/src/ATen/native/cuda/Copy.cu             |     2 -
 aten/src/ATen/native/cuda/Copy.h              |    10 +
 aten/src/ATen/native/cuda/CumminmaxKernel.cu  |    29 +
 aten/src/ATen/native/cuda/CumprodKernel.cu    |    23 +
 aten/src/ATen/native/cuda/CumsumKernel.cu     |    25 +
 aten/src/ATen/native/cuda/DistanceKernel.cu   |   138 +-
 aten/src/ATen/native/cuda/EmbeddingBag.cu     |    11 +-
 .../ATen/native/cuda/ForeachPointwiseOp.cu    |    31 +-
 .../ATen/native/cuda/FractionalMaxPool2d.cu   |    14 +-
 aten/src/ATen/native/cuda/Indexing.cu         |     5 +
 aten/src/ATen/native/cuda/JitLoops.cuh        |     4 -
 .../ATen/native/cuda/LinearAlgebraStubs.cpp   |     9 +-
 .../ATen/native/cuda/LogcumsumexpKernel.cu    |    37 +
 aten/src/ATen/native/cuda/Loss.cu             |     2 +
 aten/src/ATen/native/cuda/NLLLoss2d.cu        |    12 +-
 aten/src/ATen/native/cuda/Normalization.cuh   |    83 +-
 .../ATen/native/cuda/PersistentSoftmax.cuh    |     8 +-
 aten/src/ATen/native/cuda/ScanKernels.cpp     |     5 +
 .../cuda/{ScanKernels.cu => ScanUtils.cuh}    |    89 +-
 aten/src/ATen/native/cuda/SoftMax.cu          |    12 +-
 aten/src/ATen/native/cuda/TensorFactories.cu  |     8 +-
 aten/src/ATen/native/cuda/TensorTopK.cu       |    14 +-
 .../ATen/native/cuda/UnaryComplexKernels.cu   |    39 +-
 .../ATen/native/cuda/UnarySpecialOpsKernel.cu |     9 +-
 aten/src/ATen/native/cuda/block_reduce.cuh    |    43 +-
 aten/src/ATen/native/cuda/jit_utils.cpp       |   258 +-
 aten/src/ATen/native/cuda/jit_utils.h         |     1 -
 .../src/ATen/native/cuda/layer_norm_kernel.cu |     2 +-
 .../native/cuda/linalg/BatchLinearAlgebra.cpp |   323 +-
 .../cuda/linalg/BatchLinearAlgebraLib.cpp     |    95 -
 .../cuda/linalg/BatchLinearAlgebraLib.h       |     5 -
 .../ATen/native/cuda/reduction_template.cuh   |    16 +
 aten/src/ATen/native/cudnn/Conv_v7.cpp        |     8 +-
 aten/src/ATen/native/cudnn/Conv_v8.cpp        |    16 +-
 aten/src/ATen/native/layer_norm.cpp           |    12 +-
 aten/src/ATen/native/miopen/Conv_miopen.cpp   |   227 +
 aten/src/ATen/native/mkldnn/Common.h          |    46 +
 aten/src/ATen/native/mkldnn/Conv.cpp          |    89 +-
 aten/src/ATen/native/mkldnn/ConvPrepack.cpp   |   289 +
 aten/src/ATen/native/mkldnn/ConvPrepack.h     |    49 +
 aten/src/ATen/native/mkldnn/Matmul.cpp        |    14 +-
 aten/src/ATen/native/mkldnn/OpContext.cpp     |    47 +
 aten/src/ATen/native/mkldnn/OpContext.h       |    99 +
 aten/src/ATen/native/mkldnn/Pooling.cpp       |    18 +-
 .../mkldnn/RegisterMkldnnOpContextClass.cpp   |    60 +
 aten/src/ATen/native/mps/OperationUtils.mm    |     7 +-
 .../ATen/native/mps/operations/Activation.mm  |    39 +-
 .../ATen/native/mps/operations/BinaryOps.mm   |     1 +
 .../ATen/native/mps/operations/BitwiseOps.mm  |   336 +
 .../ATen/native/mps/operations/ConstantOps.mm |    13 +-
 .../ATen/native/mps/operations/Convolution.mm |    68 +-
 .../native/mps/operations/Distributions.mm    |    17 -
 .../src/ATen/native/mps/operations/Indexing.h |    51 +
 .../ATen/native/mps/operations/Indexing.mm    |   145 +-
 aten/src/ATen/native/mps/operations/Linear.mm |     9 +-
 .../native/mps/operations/LinearAlgebra.mm    |    65 +-
 .../src/ATen/native/mps/operations/LossOps.mm |     6 -
 aten/src/ATen/native/mps/operations/Pad.mm    |   304 +
 .../native/mps/operations/PointwiseOps.mm     |     5 +
 .../src/ATen/native/mps/operations/Pooling.mm |    16 -
 .../ATen/native/mps/operations/ReduceOps.mm   |     1 -
 aten/src/ATen/native/mps/operations/Repeat.mm |    13 +-
 aten/src/ATen/native/mps/operations/RnnOps.mm |     3 -
 .../native/mps/operations/ScatterGather.mm    |    49 +-
 aten/src/ATen/native/mps/operations/Shape.mm  |   286 -
 .../src/ATen/native/mps/operations/SoftMax.mm |     2 -
 .../native/mps/operations/TensorCompare.mm    |     6 -
 .../native/mps/operations/TriangularOps.mm    |    14 +-
 aten/src/ATen/native/mps/operations/View.mm   |    26 +-
 aten/src/ATen/native/native_functions.yaml    |   654 +-
 .../native/nested/NestedTensorBackward.cpp    |   106 +
 .../ATen/native/nested/NestedTensorMath.cpp   |    99 +-
 .../src/ATen/native/nested/NestedTensorMath.h |    53 +-
 .../NestedTensorTransformerFunctions.cpp      |    25 +-
 .../nested/NestedTensorTransformerFunctions.h |     2 -
 .../cuda/NestedTensorTransformerFunctions.cpp |     1 +
 .../cuda/NestedTensorTransformerFunctions.cu  |     6 +-
 aten/src/ATen/native/quantized/README.md      |     3 +-
 .../ATen/native/quantized/cpu/QuantUtils.h    |    20 +
 .../native/quantized/cpu/conv_serialization.h |     3 +
 .../cpu/kernels/QuantizedOpKernels.cpp        |    51 +-
 aten/src/ATen/native/quantized/cpu/qconv.cpp  |    27 +-
 .../quantized/cpu/qembeddingbag_prepack.cpp   |    17 +-
 .../src/ATen/native/quantized/cpu/qlinear.cpp |    17 +-
 .../native/quantized/cpu/qlinear_dynamic.cpp  |    23 +-
 .../fully-connected-sparse-operator-tester.h  |     2 +-
 .../gemm-block-sparse-microkernel-tester.h    |     2 +-
 .../ATen/native/sparse/SparseCsrTensor.cpp    |    69 +-
 .../ATen/native/sparse/SparseTensorMath.cpp   |   281 +-
 .../src/ATen/native/sparse/SparseTensorMath.h |     1 +
 aten/src/ATen/native/sparse/cuda/SoftMax.cu   |    14 +-
 .../native/sparse/cuda/SparseBlasImpl.cpp     |    16 +-
 .../native/sparse/cuda/SparseCUDABlas.cpp     |    10 +-
 .../sparse/cuda/SparseCUDATensorMath.cu       |    16 +-
 .../ATen/native/sparse/cuda/SparseMatMul.cu   |     6 +-
 aten/src/ATen/native/tags.yaml                |     6 +
 .../ATen/native/transformers/attention.cpp    |   116 +-
 .../native/transformers/cuda/attention.cu     |    10 +-
 .../ATen/native/transformers/transformer.cpp  |     7 +-
 aten/src/ATen/native/ts_native_functions.yaml |     1 -
 aten/src/ATen/native/vulkan/api/Allocator.h   |     2 +-
 aten/src/ATen/native/vulkan/api/Command.cpp   |    76 +
 aten/src/ATen/native/vulkan/api/Command.h     |    16 +-
 aten/src/ATen/native/vulkan/api/Common.h      |    45 +-
 aten/src/ATen/native/vulkan/api/Context.cpp   |    60 +-
 aten/src/ATen/native/vulkan/api/Context.h     |   127 +-
 aten/src/ATen/native/vulkan/api/Resource.cpp  |    50 +-
 aten/src/ATen/native/vulkan/api/Resource.h    |    18 +
 aten/src/ATen/native/vulkan/api/Runtime.cpp   |    11 +
 .../src/ATen/native/vulkan/api/vk_mem_alloc.h | 19558 -------------
 aten/src/ATen/native/vulkan/glsl/add.glsl     |    13 +-
 aten/src/ATen/native/vulkan/glsl/add_.glsl    |     9 +-
 aten/src/ATen/native/vulkan/glsl/div.glsl     |    13 +-
 aten/src/ATen/native/vulkan/glsl/div_.glsl    |     9 +-
 aten/src/ATen/native/vulkan/glsl/mul.glsl     |    13 +-
 aten/src/ATen/native/vulkan/glsl/mul_.glsl    |     9 +-
 aten/src/ATen/native/vulkan/glsl/sub.glsl     |    13 +-
 aten/src/ATen/native/vulkan/glsl/sub_.glsl    |     9 +-
 .../src/ATen/native/vulkan/ops/Arithmetic.cpp |   149 +-
 aten/src/ATen/native/vulkan/ops/Batchnorm.cpp |     2 +-
 aten/src/ATen/native/vulkan/ops/Common.cpp    |    36 -
 aten/src/ATen/native/vulkan/ops/Common.h      |    52 +-
 aten/src/ATen/native/vulkan/ops/Concat.cpp    |     4 +-
 .../ATen/native/vulkan/ops/Convolution.cpp    |  1118 +-
 aten/src/ATen/native/vulkan/ops/Convolution.h |   150 +-
 aten/src/ATen/native/vulkan/ops/Copy.cpp      |   193 +-
 aten/src/ATen/native/vulkan/ops/Copy.h        |    32 +-
 aten/src/ATen/native/vulkan/ops/Glu.cpp       |     2 +-
 aten/src/ATen/native/vulkan/ops/Gru.cpp       |   328 +-
 aten/src/ATen/native/vulkan/ops/Gru.h         |   128 +-
 aten/src/ATen/native/vulkan/ops/Lerp.cpp      |    14 +-
 aten/src/ATen/native/vulkan/ops/Lstm.cpp      |   264 +-
 aten/src/ATen/native/vulkan/ops/Lstm.h        |   102 +-
 aten/src/ATen/native/vulkan/ops/Mm.cpp        |   182 +-
 aten/src/ATen/native/vulkan/ops/Mm.h          |    71 +-
 .../vulkan/ops/QuantizedConvolution.cpp       |   648 -
 .../native/vulkan/ops/QuantizedConvolution.h  |    44 -
 aten/src/ATen/native/vulkan/ops/Register.cpp  |   306 +-
 aten/src/ATen/native/vulkan/ops/Shape.cpp     |     2 +-
 aten/src/ATen/native/vulkan/ops/Slice.cpp     |     8 +-
 aten/src/ATen/native/vulkan/ops/Tensor.h      |    36 +-
 .../vulkan/ops/TransposeConvolution2d.cpp     |   600 -
 .../vulkan/ops/TransposeConvolution2d.h       |   125 -
 aten/src/ATen/native/vulkan/ops/Utils.cpp     |   172 +-
 aten/src/ATen/native/vulkan/ops/Utils.h       |    17 +
 .../native/vulkan/ops/VulkanOpContext.cpp     |    34 -
 .../ATen/native/vulkan/ops/VulkanOpContext.h  |    35 -
 .../native/vulkan/ops/VulkanPackedContext.h   |    33 +
 aten/src/ATen/native/vulkan/ops/cumsum.cpp    |     3 +-
 .../ATen/templates/DispatchKeyFunctions_inl.h |     5 -
 .../templates/RegisterDispatchDefinitions.ini |    24 +
 .../ATen/templates/RegisterDispatchKey.cpp    |    27 +-
 aten/src/ATen/test/cpu_generator_test.cpp     |    36 +-
 aten/src/ATen/test/cuda_generator_test.cu     |    20 +-
 aten/src/ATen/test/vulkan_api_test.cpp        |   523 +-
 .../ATen/test/vulkan_quantized_api_test.cpp   |     2 +
 aten/src/ATen/test/xnnpack_test.cpp           |   233 +-
 benchmarks/cpp/nvfuser/CMakeLists.txt         |     5 +-
 .../cpp/nvfuser/batch_norm_channels_first.cpp |     4 -
 .../batch_norm_channels_first_backward.cpp    |     4 -
 .../cpp/nvfuser/batch_norm_channels_last.cpp  |     4 -
 .../batch_norm_channels_last_backward.cpp     |     4 -
 benchmarks/cpp/nvfuser/bert.cpp               |    24 +-
 benchmarks/cpp/nvfuser/broadcast.cpp          |    10 +-
 benchmarks/cpp/nvfuser/gelu_backward.cpp      |     9 +-
 benchmarks/cpp/nvfuser/heuristic_lookup.cpp   |    14 +-
 benchmarks/cpp/nvfuser/instance_norm.cpp      |     6 +-
 benchmarks/cpp/nvfuser/layer_norm.cpp         |     8 +-
 .../cpp/nvfuser/layer_norm_backward.cpp       |     9 +-
 benchmarks/cpp/nvfuser/lstm_cell.cpp          |     4 +-
 benchmarks/cpp/nvfuser/matmul.cpp             |   357 +
 benchmarks/cpp/nvfuser/reduction.cpp          |    10 +-
 benchmarks/cpp/nvfuser/rms_norm.cpp           |     2 -
 benchmarks/cpp/nvfuser/rms_norm_backward.cpp  |     3 -
 benchmarks/cpp/nvfuser/scale_bias_relu.cpp    |    18 +-
 benchmarks/cpp/nvfuser/shape_inference.cpp    |     9 +-
 benchmarks/cpp/nvfuser/softmax.cpp            |     6 +-
 benchmarks/cpp/nvfuser/softmax_backward.cpp   |    34 +-
 benchmarks/cpp/nvfuser/softmax_dropout.cpp    |     4 +-
 benchmarks/cpp/nvfuser/timm.cpp               |    11 +-
 benchmarks/cpp/nvfuser/utils.cpp              |    25 +-
 benchmarks/cpp/nvfuser/utils.h                |    26 +-
 benchmarks/distributed/ddp/benchmark.py       |     2 +-
 .../operator_benchmark/pt/qactivation_test.py |    14 +-
 .../operator_benchmark/pt/qarithmetic_test.py |     2 +-
 .../pt/qatembedding_ops_test.py               |     2 +-
 benchmarks/operator_benchmark/pt/qcat_test.py |     2 +-
 .../operator_benchmark/pt/qconv_test.py       |     2 +-
 .../pt/qembeddingbag_test.py                  |     2 +-
 .../operator_benchmark/pt/qlinear_test.py     |     4 +-
 .../pt/quantization_test.py                   |     2 +-
 .../static_runtime/test_static_runtime.cc     |    32 +-
 buckbuild.bzl                                 |    63 +-
 build.bzl                                     |     1 +
 build_variables.bzl                           |    25 +-
 c10/core/DispatchKeySet.cpp                   |     2 +-
 c10/core/SymInt.cpp                           |   109 +-
 c10/core/SymInt.h                             |    62 +-
 c10/core/SymIntArrayRef.h                     |     4 +-
 c10/core/SymIntNodeImpl.h                     |     5 +-
 c10/core/TensorImpl.cpp                       |    67 +-
 c10/core/TensorImpl.h                         |    63 +-
 c10/core/WrapDimMinimal.cpp                   |     7 +-
 c10/core/impl/GPUTrace.cpp                    |    22 +
 c10/core/impl/GPUTrace.h                      |    30 +
 c10/core/impl/PyInterpreter.cpp               |    40 +
 c10/core/impl/PyInterpreter.h                 |   107 +-
 c10/core/impl/TorchDispatchModeTLS.cpp        |    38 +
 c10/core/impl/TorchDispatchModeTLS.h          |    20 +
 c10/cuda/CUDACachingAllocator.cpp             |   112 +-
 c10/cuda/CUDACachingAllocator.h               |    23 +-
 c10/cuda/CUDAStream.cpp                       |     9 +
 c10/cuda/impl/CUDAGuardImpl.h                 |    21 +
 c10/macros/Macros.h                           |    50 +-
 c10/test/core/SymInt_test.cpp                 |     3 +-
 c10/util/Exception.h                          |     4 +
 c10/util/IdWrapper.h                          |     1 +
 c10/util/SmallVector.cpp                      |     1 +
 c10/util/SmallVector.h                        |     1 +
 c10/util/hash.h                               |     8 +
 c10/util/logging_is_google_glog.h             |    21 +-
 c10/util/strides.h                            |    14 +-
 c10/util/variant.h                            |     1 -
 caffe2/CMakeLists.txt                         |    98 +-
 caffe2/core/tensor.h                          |     8 +
 caffe2/quantization/server/dnnlowp.h          |     2 +
 .../server/fully_connected_fake_lowp_op.h     |     2 +
 caffe2/serialize/inline_container.cc          |     8 +-
 caffe2/serialize/inline_container.h           |     4 -
 caffe2/serialize/versions.h                   |    13 -
 caffe2/sgd/learning_rate_op.cc                |    10 +-
 caffe2/utils/threadpool/ThreadPool.cc         |    11 +
 cmake/Dependencies.cmake                      |    13 +-
 cmake/External/nccl.cmake                     |    44 +-
 cmake/External/ucc.cmake                      |    19 +-
 cmake/public/LoadHIP.cmake                    |    17 +-
 cmake/public/utils.cmake                      |    24 +-
 defs_gpu.bzl                                  |     4 +-
 docker.Makefile                               |    47 +-
 docs/requirements.txt                         |    15 +-
 docs/source/amp.rst                           |     2 +-
 docs/source/backends.rst                      |     1 +
 docs/source/community/governance.rst          |    29 +-
 docs/source/community/persons_of_interest.rst |    16 +-
 docs/source/conf.py                           |    53 +-
 docs/source/cuda.rst                          |     1 +
 docs/source/elastic/timer.rst                 |    11 +
 docs/source/index.rst                         |     1 +
 docs/source/masked.rst                        |    11 +
 docs/source/notes/cuda.rst                    |     2 +-
 docs/source/onnx.rst                          |    29 +-
 docs/source/optim.rst                         |     1 +
 docs/source/package.rst                       |    32 +-
 docs/source/quantization-support.rst          |    47 +-
 docs/source/quantization.rst                  |    20 +-
 .../unittest/windows/scripts/environment.yml  |     1 +
 .../codegen/gen_functorch_lagging_op_db.py    |    58 -
 .../maml_omniglot/support/omniglot_loaders.py |     2 +-
 functorch/functorch/_src/aot_autograd.py      |   531 +-
 functorch/functorch/_src/compile_utils.py     |    10 +
 functorch/functorch/_src/compilers.py         |   290 +-
 functorch/functorch/_src/fx_minifier.py       |   383 +-
 functorch/functorch/_src/partitioners.py      |    17 +-
 functorch/functorch/_src/python_key.py        |     5 +-
 functorch/functorch/_src/vmap.py              |    18 +-
 functorch/functorch/compile/__init__.py       |     8 +-
 .../functorch/csrc/BatchRulesActivation.cpp   |     5 +
 .../functorch/csrc/BatchRulesBinaryOps.cpp    |    64 +-
 .../csrc/BatchRulesDecompositions.cpp         |    22 +
 functorch/functorch/csrc/BatchRulesHelper.cpp |    44 +
 functorch/functorch/csrc/BatchRulesHelper.h   |     3 +
 .../csrc/BatchRulesLinearAlgebra.cpp          |   333 +-
 .../functorch/csrc/BatchRulesReduceOps.cpp    |     5 -
 .../functorch/csrc/BatchedTensorImpl.cpp      |     5 +
 functorch/functorch/csrc/BatchedTensorImpl.h  |     5 +
 functorch/functorch/csrc/CompileCache.cpp     |     2 +-
 functorch/functorch/csrc/Constants.h          |     1 -
 functorch/functorch/csrc/CustomFunction.cpp   |     3 +-
 functorch/functorch/csrc/DynamicLayer.cpp     |     2 -
 functorch/functorch/csrc/Interpreter.h        |     5 +-
 .../csrc/LegacyBatchingRegistrations.cpp      |   110 +-
 .../functorch/csrc/PyTorchOperatorHacks.cpp   |    90 -
 functorch/functorch/csrc/dim/arena.h          |   328 +
 functorch/functorch/csrc/dim/dim.cpp          |  3191 +++
 functorch/functorch/csrc/dim/dim.h            |     8 +
 functorch/functorch/csrc/dim/minpybind.h      |   710 +
 .../csrc/dim/python_variable_simple.h         |    49 +
 functorch/functorch/csrc/init.cpp             |    25 +-
 functorch/functorch/dim/README.md             |   759 +
 functorch/functorch/dim/__init__.py           |   170 +
 functorch/functorch/dim/batch_tensor.py       |    26 +
 functorch/functorch/dim/delayed_mul_tensor.py |    67 +
 functorch/functorch/dim/dim.py                |    95 +
 functorch/functorch/dim/magic_trace.py        |    34 +
 functorch/functorch/dim/op_properties.py      |   282 +
 functorch/functorch/dim/reference.py          |   557 +
 functorch/functorch/dim/tree_map.py           |    12 +
 functorch/functorch/dim/wrap_type.py          |    49 +
 functorch/functorch/experimental/cond.py      |   137 +
 functorch/functorch/experimental/ops.py       |    36 +
 .../colab/per_sample_grads_colab.ipynb        |     4 +-
 functorch/notebooks/ensembling.ipynb          |     2 +-
 functorch/notebooks/per_sample_grads.ipynb    |     4 +-
 functorch/op_analysis/gen_data.py             |     2 +-
 functorch/setup.py                            |     1 +
 functorch/test/attn_ft.py                     |   140 +
 functorch/test/attn_positional.py             |    93 +
 functorch/test/common_utils.py                |   158 +-
 functorch/test/discover_coverage.py           |     4 -
 functorch/test/functorch_lagging_op_db.py     |   574 -
 functorch/test/test_control_flow.py           |   183 +
 functorch/test/test_dims.py                   |   594 +
 functorch/test/test_eager_transforms.py       |   247 +-
 functorch/test/test_functionalize.py          |     2 -
 functorch/test/test_minifier.py               |    73 +-
 functorch/test/test_ops.py                    |   392 +-
 functorch/test/test_pythonkey.py              |   192 +-
 functorch/test/test_vmap.py                   |   532 +-
 functorch/test/xfail_suggester.py             |     3 +
 ios/TestApp/fastlane/Fastfile                 |     2 +-
 requirements.txt                              |     1 +
 scripts/onnx/test.sh                          |     1 +
 setup.py                                      |   339 +-
 test/allowlist_for_publicAPI.json             |    22 +-
 test/ao/sparsity/test_composability.py        |    14 +-
 test/ao/sparsity/test_data_sparsifier.py      |    99 +-
 test/cpp/api/CMakeLists.txt                   |    24 +-
 test/cpp/api/autograd.cpp                     |   139 +-
 test/cpp/api/dataloader.cpp                   |     2 -
 test/cpp/c10d/CMakeLists.txt                  |    13 +
 test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp   |     3 +-
 test/cpp/c10d/ProcessGroupNCCLTest.cpp        |     8 +-
 test/cpp/c10d/ProcessGroupUCCTest.cpp         |    35 +
 test/cpp/jit/CMakeLists.txt                   |     7 +-
 test/cpp/jit/test_flatbuffer.cpp              |   227 +-
 test/cpp/jit/test_load_upgraders.cpp          |     2 -
 test/cpp/jit/test_misc.cpp                    |   109 +
 test/cpp/lazy/test_ir.cpp                     |    10 +-
 test/cpp/lazy/test_lazy_ops.cpp               |     2 +
 test/cpp/profiler/containers.cpp              |    15 +
 test/cpp/tensorexpr/test_cuda.cpp             |   819 +-
 .../_shard/checkpoint/test_checkpoint.py      |   139 +-
 .../_shard/checkpoint/test_planner.py         |   268 +
 .../_shard/checkpoint/test_utils.py           |     4 +-
 .../sharded_tensor/ops/test_embedding.py      |    18 +-
 .../sharded_tensor/ops/test_embedding_bag.py  |    12 +-
 .../sharded_tensor/ops/test_tensor_ops.py     |     8 +
 .../timer/file_based_local_timer_test.py      |   266 +
 .../fsdp/test_checkpoint_wrapper.py           |     8 +-
 test/distributed/fsdp/test_fsdp_comm_hooks.py |   193 +-
 test/distributed/fsdp/test_fsdp_misc.py       |   151 +-
 .../distributed/fsdp/test_fsdp_optim_state.py |   245 +-
 test/distributed/fsdp/test_fsdp_state_dict.py |    20 +-
 test/distributed/fsdp/test_shard_utils.py     |    26 +-
 test/distributed/fsdp/test_utils.py           |    11 +
 test/distributed/test_c10d_nccl.py            |    10 +-
 test/distributed/test_store.py                |     2 +-
 test/distributions/test_distributions.py      |    21 +-
 ..._compat-fx_backcompat_class_members.expect |     4 +-
 ...t-fx_backcompat_function_signatures.expect |     2 +-
 .../check_forward_backward_compatibility.py   |   142 +-
 test/fx/quantization.py                       |     2 +-
 test/fx/test_pass_infra.py                    |    15 +
 test/fx/test_subgraph_rewriter.py             |   167 +-
 test/fx/test_z3_gradual_types.py              |   188 +-
 test/jit/test_backends.py                     |    11 +-
 test/jit/test_freezing.py                     |   117 +-
 test/jit/test_legacy_upgraders.py             |   553 -
 test/jit/test_module_interface.py             |    34 +-
 test/jit/test_tensor_creation_ops.py          |     8 +-
 test/jit/test_upgraders.py                    |    13 -
 test/jit/test_with.py                         |     2 +
 test/mobile/test_lite_script_type.py          |     2 +-
 .../test_quantize_fx_lite_script_module.py    |     2 +-
 test/nn/test_packed_sequence.py               |   392 +
 test/nn/test_pooling.py                       |  1429 +
 .../expect/TestOperators.test_acos.expect     |     2 +-
 .../TestOperators.test_add_broadcast.expect   |     2 +-
 ...stOperators.test_add_left_broadcast.expect |     2 +-
 ...tOperators.test_add_size1_broadcast.expect |     2 +-
 ...tors.test_add_size1_right_broadcast.expect |     2 +-
 ....test_add_size1_singleton_broadcast.expect |     2 +-
 .../TestOperators.test_addconstant.expect     |     2 +-
 .../expect/TestOperators.test_addmm.expect    |     2 +-
 .../expect/TestOperators.test_argmax.expect   |     2 +-
 .../expect/TestOperators.test_asin.expect     |     2 +-
 .../expect/TestOperators.test_at_op.expect    |     8 +-
 .../expect/TestOperators.test_atan.expect     |     2 +-
 .../TestOperators.test_avg_pool2d.expect      |     2 +-
 .../expect/TestOperators.test_baddbmm.expect  |     2 +-
 .../expect/TestOperators.test_basic.expect    |     2 +-
 .../TestOperators.test_batchnorm.expect       |     7 +-
 .../TestOperators.test_batchnorm_1d.expect    |     7 +-
 ...stOperators.test_batchnorm_noaffine.expect |     7 +-
 ...tOperators.test_batchnorm_onnx_irv4.expect |     7 +-
 ...stOperators.test_batchnorm_training.expect |     9 +-
 .../expect/TestOperators.test_chunk.expect    |     2 +-
 .../expect/TestOperators.test_clip.expect     |     2 +-
 .../expect/TestOperators.test_clip_max.expect |     2 +-
 .../expect/TestOperators.test_clip_min.expect |     2 +-
 .../expect/TestOperators.test_concat2.expect  |     2 +-
 .../expect/TestOperators.test_conv.expect     |     2 +-
 .../TestOperators.test_conv_onnx_irv4.expect  |     2 +-
 .../TestOperators.test_convtranspose.expect   |     2 +-
 .../onnx/expect/TestOperators.test_cos.expect |     2 +-
 .../expect/TestOperators.test_dict.expect     |     2 +-
 .../expect/TestOperators.test_dict_str.expect |     2 +-
 .../onnx/expect/TestOperators.test_dim.expect |     2 +-
 .../expect/TestOperators.test_dropout.expect  |     2 +-
 .../TestOperators.test_dropout_default.expect |     2 +-
 ...TestOperators.test_dropout_training.expect |     2 +-
 .../onnx/expect/TestOperators.test_elu.expect |     2 +-
 .../TestOperators.test_embedding_bags.expect  |     2 +-
 .../TestOperators.test_empty_like.expect      |     2 +-
 .../expect/TestOperators.test_equal.expect    |     2 +-
 .../onnx/expect/TestOperators.test_erf.expect |     2 +-
 .../onnx/expect/TestOperators.test_exp.expect |     2 +-
 .../expect/TestOperators.test_expand.expect   |     2 +-
 .../expect/TestOperators.test_flatten.expect  |     7 +-
 .../TestOperators.test_flatten2D.expect       |     2 +-
 .../TestOperators.test_frobenius_norm.expect  |     2 +-
 .../expect/TestOperators.test_full.expect     |     2 +-
 .../TestOperators.test_full_like.expect       |     2 +-
 .../expect/TestOperators.test_gather.expect   |     2 +-
 test/onnx/expect/TestOperators.test_ge.expect |     2 +-
 .../expect/TestOperators.test_gelu.expect     |     2 +-
 test/onnx/expect/TestOperators.test_gt.expect |     2 +-
 .../expect/TestOperators.test_hardtanh.expect |     2 +-
 .../TestOperators.test_implicit_expand.expect |     2 +-
 .../expect/TestOperators.test_index.expect    |     2 +-
 .../expect/TestOperators.test_isnan.expect    |     2 +-
 .../TestOperators.test_layer_norm_aten.expect |     2 +-
 test/onnx/expect/TestOperators.test_le.expect |     2 +-
 .../expect/TestOperators.test_linear.expect   |     2 +-
 .../TestOperators.test_log_sigmoid.expect     |     2 +-
 .../TestOperators.test_logsoftmax.expect      |     2 +-
 test/onnx/expect/TestOperators.test_lt.expect |     2 +-
 .../onnx/expect/TestOperators.test_max.expect |     2 +-
 .../expect/TestOperators.test_maxpool.expect  |     2 +-
 .../TestOperators.test_maxpool_indices.expect |     2 +-
 .../expect/TestOperators.test_mean.expect     |     2 +-
 .../TestOperators.test_mean_dtype.expect      |     2 +-
 .../expect/TestOperators.test_meshgrid.expect |    32 +-
 .../onnx/expect/TestOperators.test_min.expect |     2 +-
 test/onnx/expect/TestOperators.test_mm.expect |     2 +-
 .../expect/TestOperators.test_mul_bool.expect |     2 +-
 .../TestOperators.test_mul_fp_bool.expect     |     2 +-
 .../expect/TestOperators.test_narrow.expect   |     2 +-
 test/onnx/expect/TestOperators.test_ne.expect |     2 +-
 .../expect/TestOperators.test_nonzero.expect  |     2 +-
 .../expect/TestOperators.test_norm_p1.expect  |     2 +-
 .../expect/TestOperators.test_norm_p2.expect  |     2 +-
 .../TestOperators.test_ones_like.expect       |     2 +-
 .../onnx/expect/TestOperators.test_pad.expect |    12 +-
 .../expect/TestOperators.test_params.expect   |     2 +-
 ...TestOperators.test_params_onnx_irv4.expect |     2 +-
 .../expect/TestOperators.test_permute2.expect |     2 +-
 .../onnx/expect/TestOperators.test_pow.expect |     2 +-
 .../expect/TestOperators.test_prelu.expect    |     2 +-
 .../expect/TestOperators.test_prod.expect     |     2 +-
 .../TestOperators.test_prod_dtype.expect      |     2 +-
 .../expect/TestOperators.test_rand.expect     |     2 +-
 .../expect/TestOperators.test_randn.expect    |     2 +-
 ...rs.test_reduce_sum_negative_indices.expect |     2 +-
 .../TestOperators.test_reduced_mean.expect    |     2 +-
 ...stOperators.test_reduced_mean_dtype.expect |     2 +-
 ...Operators.test_reduced_mean_keepdim.expect |     2 +-
 .../TestOperators.test_reduced_prod.expect    |     2 +-
 ...stOperators.test_reduced_prod_dtype.expect |     2 +-
 ...Operators.test_reduced_prod_keepdim.expect |     2 +-
 .../TestOperators.test_reduced_sum.expect     |     2 +-
 ...estOperators.test_reduced_sum_dtype.expect |     2 +-
 ...tOperators.test_reduced_sum_keepdim.expect |     2 +-
 .../TestOperators.test_reducemax.expect       |     2 +-
 .../TestOperators.test_reducemin.expect       |     2 +-
 .../TestOperators.test_remainder.expect       |     2 +-
 .../expect/TestOperators.test_repeat.expect   |     2 +-
 ...tOperators.test_repeat_dim_overflow.expect |     2 +-
 .../expect/TestOperators.test_rrelu.expect    |    26 +-
 .../expect/TestOperators.test_rsqrt.expect    |     2 +-
 .../expect/TestOperators.test_rsub.expect     |     2 +-
 .../TestOperators.test_scatter_add.expect     |     2 +-
 .../expect/TestOperators.test_selu.expect     |     2 +-
 .../TestOperators.test_shape_value_map.expect |    12 +-
 .../expect/TestOperators.test_sign.expect     |     2 +-
 .../onnx/expect/TestOperators.test_sin.expect |     2 +-
 .../expect/TestOperators.test_slice.expect    |     2 +-
 .../expect/TestOperators.test_split.expect    |     2 +-
 ...TestOperators.test_split_with_sizes.expect |     2 +-
 .../expect/TestOperators.test_sqrt.expect     |     2 +-
 .../onnx/expect/TestOperators.test_std.expect |     2 +-
 .../onnx/expect/TestOperators.test_sum.expect |     2 +-
 .../TestOperators.test_sum_dtype.expect       |     2 +-
 .../onnx/expect/TestOperators.test_tan.expect |     2 +-
 .../TestOperators.test_transpose.expect       |     2 +-
 .../expect/TestOperators.test_type_as.expect  |     2 +-
 .../expect/TestOperators.test_unfold.expect   |     2 +-
 .../TestOperators.test_unsqueeze.expect       |     2 +-
 ...erators.test_upsample_nearest_scale.expect |     2 +-
 ..._nearest_scale_default_scale_factor.expect |     2 +-
 ...perators.test_upsample_nearest_size.expect |     2 +-
 .../expect/TestOperators.test_view.expect     |     7 +-
 .../TestOperators.test_view_flatten.expect    |     7 +-
 .../TestOperators.test_zeros_like.expect      |     2 +-
 test/onnx/internal/test_beartype.py           |    86 +
 test/onnx/onnx_test_common.py                 |     6 +
 test/onnx/pytorch_test_common.py              |    18 +
 test/onnx/test_autograd_funs.py               |   212 +
 test/onnx/test_models.py                      |    12 +-
 test/onnx/test_models_onnxruntime.py          |    28 +-
 test/onnx/test_onnx_opset.py                  |     2 +-
 test/onnx/test_pytorch_jit_onnx.py            |    13 +-
 .../test_pytorch_onnx_caffe2_quantized.py     |     4 +-
 test/onnx/test_pytorch_onnx_no_runtime.py     |    23 +-
 test/onnx/test_pytorch_onnx_onnxruntime.py    |   184 +-
 test/onnx/test_utility_funs.py                |   124 +-
 test/onnx/test_verification.py                |    33 +
 test/quantization/ao_migration/common.py      |    22 +-
 .../ao_migration/test_ao_migration.py         |   350 +
 .../bc/test_backward_compatibility.py         |     4 +-
 .../experimental/apot_fx_graph_mode_ptq.py    |   131 +
 .../experimental/apot_fx_graph_mode_qat.py    |    94 +
 ...raph_mode_apot.py => quantization_util.py} |   152 +-
 test/quantization/core/test_docs.py           |    39 +-
 .../core/test_quantized_functional.py         |     2 +-
 .../core/test_quantized_module.py             |    88 +-
 test/quantization/core/test_quantized_op.py   |   147 +-
 test/quantization/core/test_utils.py          |    65 +
 .../quantization/core/test_workflow_module.py |     8 +-
 test/quantization/dbr/test_quantize_dbr.py    |  1619 --
 test/quantization/eager/test_fuse_eager.py    |     2 +-
 .../eager/test_numeric_suite_eager.py         |     7 +-
 .../eager/test_quantize_eager_ptq.py          |    26 +-
 .../eager/test_quantize_eager_qat.py          |    12 +-
 test/quantization/fx/test_equalize_fx.py      |     2 +-
 test/quantization/fx/test_model_report_fx.py  |   114 +-
 test/quantization/fx/test_numeric_suite_fx.py |     6 +-
 test/quantization/fx/test_quantize_fx.py      |   209 +-
 test/run_doctests.sh                          |    29 +
 test/run_test.py                              |    93 +-
 test/test_ao_sparsity.py                      |     8 +-
 test/test_autograd.py                         |   440 +-
 test/test_binary_ufuncs.py                    |    21 +-
 test/test_cpp_api_parity.py                   |    59 +-
 test/test_cpp_extensions_jit.py               |     3 +-
 ...cpp_extensions_open_device_registration.py |     4 +-
 test/test_cuda.py                             |    44 +-
 test/test_cuda_trace.py                       |    96 +
 test/test_dataloader.py                       |    69 +-
 test/test_datapipe.py                         |   277 +-
 test/test_decomp.py                           |    13 +-
 test/test_dlpack.py                           |   193 +
 test/test_dynamic_shapes.py                   |    45 +-
 test/test_dynamo_cudagraphs.py                |   192 -
 test/test_expanded_weights.py                 |     9 +-
 test/test_fake_tensor.py                      |    28 +-
 test/test_foreach.py                          |    14 +-
 test/test_function_schema.py                  |     6 +
 test/test_functionalization.py                |   118 +-
 test/test_fx.py                               |    86 +-
 test/test_fx_passes.py                        |   371 +-
 test/test_fx_reinplace_pass.py                |   112 +-
 test/test_jit.py                              |    15 +-
 test/test_jit_autocast.py                     |    98 +-
 test/test_jit_cuda_fuser.py                   |   154 +-
 test/test_jit_fuser_te.py                     |    37 +
 test/test_jiterator.py                        |    10 +-
 test/test_linalg.py                           |    98 +-
 test/test_maskedtensor.py                     |   245 +
 test/test_meta.py                             |    14 +
 test/test_mkldnn_fusion.py                    |   118 +
 test/test_module_init.py                      |    87 +-
 test/test_mps.py                              |   340 +-
 test/test_multiprocessing.py                  |     5 +-
 test/test_namedtensor.py                      |    12 +-
 test/test_namedtuple_return_api.py            |    40 +-
 test/test_native_mha.py                       |     1 +
 test/test_nestedtensor.py                     |   313 +-
 test/test_nn.py                               |  2131 +-
 test/test_nnapi.py                            |    15 +-
 test/test_ops.py                              |    51 +-
 test/test_ops_jit.py                          |   191 +-
 test/test_optim.py                            |   148 +-
 test/test_overrides.py                        |    46 +
 test/test_per_overload_api.py                 |     8 +
 test/test_prims.py                            |   119 +-
 test/test_profiler.py                         |   338 +-
 test/test_profiler_tree.py                    |   185 +-
 test/test_proxy_tensor.py                     |   504 +-
 test/test_public_bindings.py                  |    12 +-
 test/test_python_dispatch.py                  |    49 +-
 test/test_pytree.py                           |    42 +-
 test/test_quantization.py                     |    16 +-
 test/test_reductions.py                       |     6 +-
 test/{jit => }/test_schema_check.py           |    36 +-
 test/test_sort_and_select.py                  |     3 +-
 test/test_sparse.py                           |   123 +-
 test/test_sparse_csr.py                       |   168 +-
 test/test_spectral_ops.py                     |     2 +-
 test/test_tensor_creation_ops.py              |    20 +-
 test/test_testing.py                          |     6 +-
 test/test_torch.py                            |   210 +-
 test/test_transformers.py                     |    66 +-
 test/test_type_promotion.py                   |    73 +-
 test/test_unary_ufuncs.py                     |    22 +-
 test/test_utils.py                            |    59 +
 third_party/VulkanMemoryAllocator             |     1 +
 third_party/cpuinfo                           |     2 +-
 third_party/cpuinfo.BUILD                     |    55 -
 third_party/ideep                             |     2 +-
 third_party/nccl/nccl                         |     2 +-
 tools/autograd/context.py                     |    12 +
 tools/autograd/derivatives.yaml               |   111 +-
 tools/autograd/gen_autograd.py                |    12 +-
 tools/autograd/gen_autograd_functions.py      |    44 +-
 tools/autograd/gen_python_functions.py        |     5 -
 tools/autograd/gen_trace_type.py              |    15 +-
 tools/autograd/gen_variable_type.py           |   130 +-
 tools/autograd/load_derivatives.py            |   175 +-
 tools/autograd/templates/VariableType.cpp     |     7 +-
 tools/autograd/templates/python_enum_tag.cpp  |     3 +-
 .../templates/python_variable_methods.cpp     |    21 +-
 .../package/tool/summarize_jsons.py           |     2 +-
 tools/onnx/update_default_opset_version.py    |   150 +-
 tools/setup_helpers/cmake_utils.py            |     2 +-
 tools/stats/import_test_stats.py              |    39 +-
 tools/stats/print_test_stats.py               |     6 +-
 tools/stats/upload_test_stats.py              |    37 +-
 tools/target_definitions.bzl                  |    33 +-
 tools/test/test_codegen.py                    |   202 +-
 tools/test/test_codegen_model.py              |    64 +-
 tools/test/test_selective_build.py            |    21 +
 tools/testing/test_selections.py              |     2 +-
 torch/CMakeLists.txt                          |    25 +-
 torch/_C/__init__.pyi.in                      |    25 +-
 torch/_C/_autograd.pyi                        |    91 +-
 torch/_C/_distributed_rpc.pyi                 |     5 +-
 torch/_C/_profiler.pyi                        |   111 +
 torch/__init__.py                             |     5 +-
 torch/_decomp/__init__.py                     |    14 +-
 torch/_decomp/decompositions.py               |   264 +-
 .../dbr => torch/_dispatch}/__init__.py       |     0
 torch/_dispatch/_dispatcher.py                |    50 +
 torch/_lazy/__init__.py                       |    13 +
 torch/_lazy/extract_compiled_graph.py         |     2 +-
 torch/_masked/__init__.py                     |     2 +-
 torch/_meta_registrations.py                  |    33 +-
 torch/_namedtensor_internals.py               |     2 +
 torch/_ops.py                                 |    19 +-
 torch/_prims/__init__.py                      |   310 +-
 torch/_prims/context.py                       |    91 +-
 torch/_prims/executor.py                      |    19 +-
 torch/_prims/nvfuser_executor.py              |    22 +-
 torch/_prims/nvfuser_prims.py                 |   280 +
 torch/_prims_common/__init__.py               |    79 +-
 torch/_prims_common/wrappers.py               |    49 +-
 torch/_refs/__init__.py                       |   720 +-
 torch/_refs/nn/functional/__init__.py         |    53 +
 torch/_subclasses/fake_tensor.py              |   150 +-
 torch/_subclasses/meta_utils.py               |    38 +-
 torch/_tensor.py                              |     8 +-
 torch/_tensor_str.py                          |    15 +-
 torch/_torch_docs.py                          |    11 +-
 torch/ao/nn/__init__.py                       |    18 +-
 torch/ao/nn/qat/__init__.py                   |     1 +
 torch/ao/nn/qat/dynamic/__init__.py           |     1 +
 torch/ao/nn/qat/dynamic/modules/__init__.py   |     3 +
 torch/ao/nn/qat/dynamic/modules/linear.py     |    25 +
 torch/ao/nn/qat/modules/__init__.py           |    14 +
 torch/ao/nn/qat/modules/conv.py               |   264 +
 torch/ao/nn/qat/modules/embedding_ops.py      |   143 +
 torch/ao/nn/qat/modules/linear.py             |    77 +
 torch/ao/nn/quantizable/__init__.py           |     1 +
 torch/ao/nn/quantizable/modules/__init__.py   |     9 +
 torch/ao/nn/quantizable/modules/activation.py |   454 +
 torch/ao/nn/quantizable/modules/rnn.py        |   386 +
 torch/ao/nn/quantized/__init__.py             |    38 +
 torch/ao/nn/quantized/_reference/__init__.py  |     1 +
 .../quantized/_reference/modules/__init__.py  |    20 +
 .../nn/quantized/_reference/modules/conv.py   |   316 +
 .../nn/quantized/_reference/modules/linear.py |    55 +
 .../ao/nn/quantized/_reference/modules/rnn.py |   471 +
 .../nn/quantized/_reference/modules/sparse.py |    92 +
 .../nn/quantized/_reference/modules/utils.py  |   154 +
 torch/ao/nn/quantized/dynamic/__init__.py     |     1 +
 .../nn/quantized/dynamic/modules/__init__.py  |    19 +
 torch/ao/nn/quantized/dynamic/modules/conv.py |   399 +
 .../ao/nn/quantized/dynamic/modules/linear.py |   127 +
 torch/ao/nn/quantized/dynamic/modules/rnn.py  |  1054 +
 torch/ao/nn/quantized/functional.py           |   616 +
 torch/ao/nn/quantized/modules/__init__.py     |   136 +
 torch/ao/nn/quantized/modules/activation.py   |   278 +
 torch/ao/nn/quantized/modules/batchnorm.py    |   101 +
 torch/ao/nn/quantized/modules/conv.py         |   937 +
 torch/ao/nn/quantized/modules/dropout.py      |    27 +
 .../ao/nn/quantized/modules/embedding_ops.py  |   295 +
 .../quantized/modules/functional_modules.py   |   233 +
 torch/ao/nn/quantized/modules/linear.py       |   302 +
 .../ao/nn/quantized/modules/normalization.py  |   204 +
 torch/ao/nn/quantized/modules/rnn.py          |    47 +
 torch/ao/nn/quantized/modules/utils.py        |   113 +
 .../ao/nn/sparse/quantized/dynamic/linear.py  |     8 +-
 torch/ao/nn/sparse/quantized/linear.py        |     2 +-
 torch/ao/ns/_numeric_suite.py                 |     4 +-
 torch/ao/ns/_numeric_suite_dbr.py             |   112 -
 torch/ao/ns/fx/mappings.py                    |     8 +-
 torch/ao/ns/fx/utils.py                       |     2 +-
 torch/ao/ns/fx/weight_utils.py                |     6 +-
 torch/ao/quantization/_correct_bias.py        |     2 +-
 torch/ao/quantization/_dbr/README.md          |   259 -
 torch/ao/quantization/_dbr/auto_trace.py      |   723 -
 .../quantization/_dbr/auto_trace_rewriter.py  |   247 -
 torch/ao/quantization/_dbr/function_fusion.py |   101 -
 torch/ao/quantization/_dbr/fusion.py          |    56 -
 torch/ao/quantization/_dbr/mappings.py        |   178 -
 torch/ao/quantization/_dbr/model_utils.py     |   163 -
 .../ao/quantization/_dbr/module_swap_utils.py |    79 -
 .../_dbr/qconfig_mapping_utils.py             |    25 -
 .../quantization/_dbr/quantization_state.py   |   986 -
 .../ao/quantization/_dbr/torchscript_utils.py |    15 -
 torch/ao/quantization/_dbr/utils.py           |   751 -
 torch/ao/quantization/_quantize_dbr.py        |   144 -
 .../quantization/backend_config/__init__.py   |    18 +-
 .../_common_operator_config_utils.py          |   565 +-
 .../backend_config/backend_config.py          |    60 +-
 .../ao/quantization/backend_config/fbgemm.py  |   114 +
 .../ao/quantization/backend_config/native.py  |   372 +-
 .../backend_config/observation_type.py        |    13 -
 .../ao/quantization/backend_config/qnnpack.py |   114 +
 .../quantization/backend_config/tensorrt.py   |    93 +-
 torch/ao/quantization/backend_config/utils.py |   113 +-
 torch/ao/quantization/experimental/linear.py  |     6 -
 .../ao/quantization/experimental/observer.py  |    18 +-
 torch/ao/quantization/fuse_modules.py         |     1 +
 .../ao/quantization/fuser_method_mappings.py  |     6 +-
 torch/ao/quantization/fx/_equalize.py         |    47 +-
 .../fx/_lower_to_native_backend.py            |    60 +-
 .../quantization/fx/_model_report/README.md   |    88 +-
 .../quantization/fx/_model_report/detector.py |   212 +-
 .../fx/_model_report/model_report.py          |   197 +-
 .../fx/_model_report/model_report_observer.py |     2 +-
 .../_model_report/model_report_visualizer.py  |   116 +-
 .../quantization/fx/backend_config_utils.py   |    50 +-
 .../fx/common_quantization_patterns.py        |     8 -
 torch/ao/quantization/fx/convert.py           |    48 +-
 torch/ao/quantization/fx/custom_config.py     |    43 +-
 torch/ao/quantization/fx/fuse.py              |    35 +-
 torch/ao/quantization/fx/fusion_patterns.py   |     2 +-
 torch/ao/quantization/fx/pattern_utils.py     |     4 +-
 torch/ao/quantization/fx/prepare.py           |   194 +-
 ...nfig_utils.py => qconfig_mapping_utils.py} |    20 +-
 .../quantization/fx/quantization_patterns.py  |     9 +-
 torch/ao/quantization/fx/tracer.py            |     2 +-
 torch/ao/quantization/observer.py             |    34 +-
 .../ao/quantization/quantization_mappings.py  |    13 +-
 torch/ao/quantization/quantize.py             |     2 +-
 torch/ao/quantization/quantize_fx.py          |   251 +-
 .../activation_sparsifier.py                  |    50 +-
 .../data_scheduler/base_data_scheduler.py     |    10 +-
 .../data_sparsifier/base_data_sparsifier.py   |     2 +-
 .../data_sparsifier/quantization_utils.py     |   130 +
 torch/ao/sparsity/_mappings.py                |     5 +-
 .../ao/sparsity/scheduler/lambda_scheduler.py |     1 +
 .../ao/sparsity/sparsifier/base_sparsifier.py |     8 +-
 torch/ao/sparsity/sparsifier/utils.py         |     2 +-
 torch/autograd/__init__.py                    |    34 +-
 torch/autograd/anomaly_mode.py                |    20 +-
 torch/autograd/forward_ad.py                  |     3 +
 torch/autograd/function.py                    |     3 +
 torch/autograd/functional.py                  |     6 +
 torch/autograd/grad_mode.py                   |    12 +-
 torch/autograd/graph.py                       |    10 +-
 torch/autograd/profiler.py                    |     8 +-
 torch/autograd/profiler_legacy.py             |     2 +-
 torch/backends/xeon/run_cpu.py                |    16 +-
 torch/csrc/DynamicTypes.cpp                   |     2 +
 torch/csrc/Exceptions.cpp                     |    14 +-
 torch/csrc/Exceptions.h                       |    81 +-
 torch/csrc/Module.cpp                         |     2 +
 torch/csrc/Storage.cpp                        |     2 +
 torch/csrc/api/include/torch/nn/pimpl.h       |    10 +-
 torch/csrc/autograd/FunctionsManual.cpp       |   143 +-
 torch/csrc/autograd/FunctionsManual.h         |     9 +-
 torch/csrc/autograd/TraceTypeManual.cpp       |     4 +-
 torch/csrc/autograd/anomaly_mode.cpp          |     8 +-
 torch/csrc/autograd/anomaly_mode.h            |    12 +-
 torch/csrc/autograd/autograd.cpp              |     6 +
 .../autograd_not_implemented_fallback.cpp     |     3 +-
 torch/csrc/autograd/custom_function.cpp       |    13 +
 torch/csrc/autograd/custom_function.h         |     5 +
 torch/csrc/autograd/engine.cpp                |    23 +-
 torch/csrc/autograd/engine.h                  |   170 +-
 torch/csrc/autograd/function.h                |    44 +
 torch/csrc/autograd/functions/tensor.cpp      |    19 +-
 torch/csrc/autograd/graph_task.h              |   193 +
 torch/csrc/autograd/init.cpp                  |   234 +-
 torch/csrc/autograd/input_buffer.cpp          |     1 +
 torch/csrc/autograd/input_buffer.h            |     1 -
 torch/csrc/autograd/profiler_kineto.cpp       |   457 +-
 torch/csrc/autograd/profiler_kineto.h         |   294 +-
 torch/csrc/autograd/profiler_legacy.cpp       |    21 +-
 torch/csrc/autograd/profiler_python.cpp       |    72 +-
 torch/csrc/autograd/python_anomaly_mode.cpp   |     2 +-
 torch/csrc/autograd/python_cpp_function.cpp   |    39 +-
 torch/csrc/autograd/python_cpp_function.h     |     8 +
 torch/csrc/autograd/python_function.cpp       |    94 +-
 torch/csrc/autograd/python_hook.cpp           |   140 +-
 torch/csrc/autograd/python_hook.h             |    11 +-
 .../python_torch_functions_manual.cpp         |   363 -
 torch/csrc/autograd/python_variable.cpp       |   143 +-
 torch/csrc/autograd/python_variable.h         |     2 +
 .../autograd/python_variable_indexing.cpp     |     5 +-
 torch/csrc/autograd/saved_variable.cpp        |     7 +-
 torch/csrc/autograd/variable.cpp              |     9 +-
 torch/csrc/cuda/Module.cpp                    |   122 +-
 torch/csrc/cuda/shared/cudart.cpp             |    22 +-
 torch/csrc/deploy/deploy.cpp                  |     4 +
 torch/csrc/deploy/interpreter/defs.bzl        |     8 +-
 torch/csrc/distributed/c10d/Ops.cpp           |     6 +-
 torch/csrc/distributed/c10d/ProcessGroup.hpp  |     9 +
 .../distributed/c10d/ProcessGroupGloo.cpp     |     4 -
 .../distributed/c10d/ProcessGroupNCCL.cpp     |   411 +-
 .../distributed/c10d/ProcessGroupNCCL.hpp     |    49 +
 .../csrc/distributed/c10d/ProcessGroupUCC.cpp |    84 +-
 torch/csrc/distributed/c10d/UCCUtils.cpp      |     5 +
 torch/csrc/distributed/c10d/UCCUtils.hpp      |    33 +
 torch/csrc/distributed/c10d/debug.h           |     2 +-
 torch/csrc/distributed/c10d/init.cpp          |     9 +
 torch/csrc/init_flatbuffer_module.cpp         |    21 +-
 .../backends/coreml/objc/PTMCoreMLBackend.mm  |    63 +-
 .../backends/coreml/objc/PTMCoreMLCompiler.h  |    12 +-
 .../backends/coreml/objc/PTMCoreMLCompiler.mm |   143 +-
 .../coreml/objc/PTMCoreMLModelWrapper.h       |     9 -
 .../coreml/observer/PTMCoreMLObserver.h       |    47 -
 .../coreml/observer/PTMCoreMLObserver.mm      |     8 -
 torch/csrc/jit/codegen/cuda/arith.cpp         |   197 +-
 torch/csrc/jit/codegen/cuda/arith.h           |    49 +-
 torch/csrc/jit/codegen/cuda/codegen.cpp       |   275 +-
 torch/csrc/jit/codegen/cuda/compute_at.cpp    |    21 +-
 torch/csrc/jit/codegen/cuda/compute_at.h      |     2 +-
 .../csrc/jit/codegen/cuda/compute_at_map.cpp  |   360 +-
 torch/csrc/jit/codegen/cuda/compute_at_map.h  |    23 +-
 torch/csrc/jit/codegen/cuda/contiguity.cpp    |   617 +-
 torch/csrc/jit/codegen/cuda/contiguity.h      |   152 +-
 torch/csrc/jit/codegen/cuda/disjoint_set.h    |    11 +-
 torch/csrc/jit/codegen/cuda/dispatch.cpp      |    90 +
 torch/csrc/jit/codegen/cuda/dispatch.h        |    24 +
 torch/csrc/jit/codegen/cuda/dynamic_type.h    |    67 +-
 .../jit/codegen/cuda/evaluator_common.cpp     |   234 +-
 .../csrc/jit/codegen/cuda/evaluator_common.h  |   102 +-
 torch/csrc/jit/codegen/cuda/executor.cpp      |   392 +-
 torch/csrc/jit/codegen/cuda/executor.h        |    51 +-
 .../jit/codegen/cuda/executor_kernel_arg.cpp  |    35 +
 .../jit/codegen/cuda/executor_kernel_arg.h    |   258 +-
 .../csrc/jit/codegen/cuda/executor_utils.cpp  |   423 +-
 torch/csrc/jit/codegen/cuda/executor_utils.h  |     9 +-
 .../csrc/jit/codegen/cuda/expr_evaluator.cpp  |    19 +-
 torch/csrc/jit/codegen/cuda/expr_evaluator.h  |    13 +-
 torch/csrc/jit/codegen/cuda/fusion.cpp        |    28 +-
 torch/csrc/jit/codegen/cuda/fusion.h          |     8 +-
 .../jit/codegen/cuda/fusion_segmenter.cpp     |    60 +-
 .../csrc/jit/codegen/cuda/fusion_segmenter.h  |    14 +-
 torch/csrc/jit/codegen/cuda/graph_fuser.cpp   |    12 +-
 .../jit/codegen/cuda/grouped_reduction.cpp    |    18 +-
 .../csrc/jit/codegen/cuda/grouped_reduction.h |     4 +
 torch/csrc/jit/codegen/cuda/index_compute.cpp |   297 +-
 torch/csrc/jit/codegen/cuda/index_compute.h   |   113 +-
 .../jit/codegen/cuda/inline_propagator.cpp    |   385 -
 .../csrc/jit/codegen/cuda/inline_propagator.h |   118 -
 torch/csrc/jit/codegen/cuda/inlining.cpp      |   306 +
 torch/csrc/jit/codegen/cuda/inlining.h        |   100 +
 torch/csrc/jit/codegen/cuda/interface.cpp     |    56 +
 torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp |     4 +-
 torch/csrc/jit/codegen/cuda/ir_base_nodes.h   |     1 +
 torch/csrc/jit/codegen/cuda/ir_builder.cpp    |     4 +
 torch/csrc/jit/codegen/cuda/ir_cloner.cpp     |    16 +
 torch/csrc/jit/codegen/cuda/ir_cloner.h       |     4 +
 torch/csrc/jit/codegen/cuda/ir_graphviz.cpp   |    38 +
 torch/csrc/jit/codegen/cuda/ir_graphviz.h     |     4 +
 .../jit/codegen/cuda/ir_interface_nodes.h     |    29 +-
 .../csrc/jit/codegen/cuda/ir_internal_nodes.h |   498 +-
 torch/csrc/jit/codegen/cuda/ir_iostream.cpp   |   355 +-
 torch/csrc/jit/codegen/cuda/ir_iostream.h     |     6 +
 torch/csrc/jit/codegen/cuda/ir_nodes.cpp      |   670 +-
 torch/csrc/jit/codegen/cuda/ir_utils.cpp      |   144 +-
 torch/csrc/jit/codegen/cuda/ir_utils.h        |     9 +
 torch/csrc/jit/codegen/cuda/iter_visitor.cpp  |   149 +-
 torch/csrc/jit/codegen/cuda/iter_visitor.h    |    99 +-
 torch/csrc/jit/codegen/cuda/kernel.cpp        |    17 +-
 torch/csrc/jit/codegen/cuda/kernel_cache.cpp  |   379 +-
 torch/csrc/jit/codegen/cuda/kernel_cache.h    |    90 +-
 .../codegen/cuda/kernel_expr_evaluator.cpp    |    79 +-
 .../jit/codegen/cuda/kernel_expr_evaluator.h  |    14 +-
 torch/csrc/jit/codegen/cuda/kernel_ir.cpp     |    46 +
 torch/csrc/jit/codegen/cuda/kernel_ir.h       |    68 +-
 torch/csrc/jit/codegen/cuda/lower2device.cpp  |    19 +-
 torch/csrc/jit/codegen/cuda/lower2device.h    |    27 +-
 .../jit/codegen/cuda/lower_alias_memory.cpp   |    22 +-
 .../jit/codegen/cuda/lower_allocation.cpp     |    15 +-
 .../codegen/cuda/lower_divisible_split.cpp    |   121 +
 .../jit/codegen/cuda/lower_divisible_split.h  |    29 +
 .../jit/codegen/cuda/lower_double_buffer.cpp  |     3 +-
 .../csrc/jit/codegen/cuda/lower_expr_sort.cpp |     2 +-
 .../codegen/cuda/lower_fused_reduction.cpp    |    12 +-
 torch/csrc/jit/codegen/cuda/lower_index.cpp   |   263 +-
 torch/csrc/jit/codegen/cuda/lower_index.h     |    23 +-
 .../jit/codegen/cuda/lower_index_compute.cpp  |   191 +-
 .../jit/codegen/cuda/lower_index_compute.h    |    12 +
 .../jit/codegen/cuda/lower_insert_syncs.cpp   |     4 +-
 torch/csrc/jit/codegen/cuda/lower_loops.cpp   |     4 +-
 .../csrc/jit/codegen/cuda/lower_predicate.cpp |    11 +-
 .../cuda/lower_predicate_elimination.cpp      |    75 +-
 torch/csrc/jit/codegen/cuda/lower_shift.cpp   |   125 +-
 torch/csrc/jit/codegen/cuda/lower_shift.h     |    33 +-
 .../codegen/cuda/lower_sync_information.cpp   |    20 +-
 .../codegen/cuda/lower_thread_predicate.cpp   |     7 +-
 .../codegen/cuda/lower_trivial_broadcast.cpp  |     6 +-
 .../codegen/cuda/lower_trivial_broadcast.h    |     3 +-
 torch/csrc/jit/codegen/cuda/lower_unroll.cpp  |    10 +-
 torch/csrc/jit/codegen/cuda/lower_utils.cpp   |   372 +-
 torch/csrc/jit/codegen/cuda/lower_utils.h     |   112 +-
 .../jit/codegen/cuda/lower_validation.cpp     |    45 +-
 .../jit/codegen/cuda/lower_warp_reduce.cpp    |     6 +-
 torch/csrc/jit/codegen/cuda/manager.cpp       |    12 +-
 torch/csrc/jit/codegen/cuda/mutator.cpp       |   123 +-
 .../jit/codegen/cuda/non_divisible_split.cpp  |    13 +-
 torch/csrc/jit/codegen/cuda/nvfuser.cmake     |     4 +-
 torch/csrc/jit/codegen/cuda/ops/alias.cpp     |    39 +-
 torch/csrc/jit/codegen/cuda/ops/composite.cpp |    22 +-
 torch/csrc/jit/codegen/cuda/ops/composite.h   |     2 +
 .../jit/codegen/cuda/ops/normalization.cpp    |    58 +
 .../csrc/jit/codegen/cuda/ops/normalization.h |    11 +
 .../codegen/cuda/parallel_dimension_map.cpp   |     2 +-
 .../jit/codegen/cuda/parallel_type_bitmap.cpp |     2 +
 torch/csrc/jit/codegen/cuda/parser.cpp        |   169 +-
 .../cuda/python_frontend/python_bindings.cpp  |     9 +-
 .../csrc/jit/codegen/cuda/reference_tensor.h  |    27 -
 .../csrc/jit/codegen/cuda/root_domain_map.cpp |   197 +-
 torch/csrc/jit/codegen/cuda/root_domain_map.h |    17 +-
 .../codegen/cuda/runtime/fused_reduction.cu   |  1855 +-
 .../cuda/runtime/fused_welford_helper.cu      |    93 +
 .../cuda/runtime/fused_welford_impl.cu        |   623 +
 .../csrc/jit/codegen/cuda/runtime/helpers.cu  |    19 +
 .../codegen/cuda/runtime/random_numbers.cu    |    34 +-
 torch/csrc/jit/codegen/cuda/runtime/tuple.cu  |   173 +
 .../codegen/cuda/scheduler/all_schedulers.h   |     5 +-
 .../cuda/scheduler/compile_time_info.h        |    56 +-
 .../jit/codegen/cuda/scheduler/heuristic.h    |     3 +-
 .../jit/codegen/cuda/scheduler/mma_utils.cpp  |     8 +-
 .../codegen/cuda/scheduler/normalization.cpp  |     8 +-
 .../jit/codegen/cuda/scheduler/pointwise.cpp  |   277 +-
 .../jit/codegen/cuda/scheduler/pointwise.h    |   138 +
 .../cuda/scheduler/pointwise_utils.cpp        |    46 +-
 .../codegen/cuda/scheduler/pointwise_utils.h  |    20 +-
 .../jit/codegen/cuda/scheduler/reduction.cpp  |     8 +-
 .../cuda/scheduler/reduction_utils.cpp        |    13 +-
 .../jit/codegen/cuda/scheduler/registry.cpp   |   453 +-
 .../jit/codegen/cuda/scheduler/registry.h     |    26 +-
 .../jit/codegen/cuda/scheduler/transpose.cpp  |  1140 +
 .../jit/codegen/cuda/scheduler/transpose.h    |   115 +
 .../cuda/scheduler/transpose_heuristic.h      |   163 +
 .../csrc/jit/codegen/cuda/scheduler/utils.cpp |   965 +-
 torch/csrc/jit/codegen/cuda/scheduler/utils.h |   178 +-
 .../cuda/scheduler/vectorize_helper.cpp       |   287 +
 .../codegen/cuda/scheduler/vectorize_helper.h |    14 +-
 torch/csrc/jit/codegen/cuda/tensor_view.cpp   |   139 +-
 torch/csrc/jit/codegen/cuda/test/test_gpu.cpp |   967 +-
 .../cuda/test/test_gpu_fused_reduction.cpp    |   312 +
 .../jit/codegen/cuda/test/test_gpu_rng.cu     |   211 +-
 .../cuda/test/test_gpu_tensor_factories.cpp   |   339 +
 .../codegen/cuda/test/test_gpu_transpose.cpp  |  1028 +
 .../jit/codegen/cuda/test/test_gpu_utils.cpp  |   273 +
 .../codegen/cuda/test/test_gpu_validator.h    |    59 +-
 .../jit/codegen/cuda/test/test_gpu_view.cpp   |   699 +-
 torch/csrc/jit/codegen/cuda/test/test_utils.h |    71 +-
 .../jit/codegen/cuda/tools/stringify_file.py  |    10 +-
 .../csrc/jit/codegen/cuda/transform_iter.cpp  |     6 +-
 torch/csrc/jit/codegen/cuda/type.cpp          |    64 +-
 torch/csrc/jit/codegen/cuda/type.h            |    17 +-
 .../csrc/jit/codegen/cuda/type_inference.cpp  |    11 +-
 torch/csrc/jit/codegen/cuda/utils.cpp         |    93 +-
 torch/csrc/jit/codegen/cuda/utils.h           |    13 +-
 torch/csrc/jit/codegen/fuser/codegen.cpp      |     2 +-
 .../jit/codegen/fuser/cuda/fused_kernel.cpp   |     5 +
 torch/csrc/jit/frontend/builtin_functions.cpp |   100 -
 .../jit/frontend/function_schema_parser.cpp   |     6 +-
 torch/csrc/jit/frontend/ir_emitter.cpp        |     4 +-
 torch/csrc/jit/frontend/schema_matching.cpp   |     4 -
 .../csrc/jit/frontend/schema_type_parser.cpp  |     6 +-
 torch/csrc/jit/frontend/source_range.cpp      |     3 +
 torch/csrc/jit/frontend/sugared_value.h       |     8 -
 torch/csrc/jit/frontend/tracer.cpp            |     7 +-
 torch/csrc/jit/ir/alias_analysis.cpp          |    15 +-
 torch/csrc/jit/ir/ir.cpp                      |     1 +
 torch/csrc/jit/mobile/flatbuffer_loader.cpp   |   209 +-
 torch/csrc/jit/mobile/flatbuffer_loader.h     |   194 +-
 torch/csrc/jit/mobile/import.cpp              |    53 +-
 .../jit/mobile/model_tracer/TracerRunner.cpp  |    14 +-
 torch/csrc/jit/mobile/promoted_prim_ops.cpp   |    15 +
 torch/csrc/jit/mobile/promoted_prim_ops.h     |     6 +
 .../operator_upgraders/upgraders_guard.cpp    |    12 -
 .../jit/operator_upgraders/upgraders_guard.h  |    10 -
 torch/csrc/jit/passes/autocast.cpp            |    41 +
 .../frozen_conv_add_relu_fusion_cuda.cpp      |    16 +-
 .../jit/passes/hoist_conv_packed_params.cpp   |    15 +-
 torch/csrc/jit/passes/mkldnn_rewrite.cpp      |   234 +
 torch/csrc/jit/passes/mkldnn_rewrite.h        |    34 +
 torch/csrc/jit/passes/normalize_ops.cpp       |     1 +
 torch/csrc/jit/passes/onnx.cpp                |    38 +-
 .../jit/passes/onnx/function_extraction.cpp   |     8 +-
 .../jit/passes/onnx/function_substitution.cpp |   107 +
 torch/csrc/jit/passes/onnx/naming.cpp         |   205 +
 torch/csrc/jit/passes/onnx/naming.h           |    30 +
 .../autograd_function_process.cpp             |    58 +
 .../autograd_function_process.h               |    11 +
 .../jit/passes/onnx/shape_type_inference.cpp  |    34 +-
 torch/csrc/jit/passes/quantization/helper.h   |     4 +-
 .../jit/passes/symbolic_shape_analysis.cpp    |    53 +-
 torch/csrc/jit/passes/tensorexpr_fuser.cpp    |    37 +-
 torch/csrc/jit/passes/utils/memory_dag.cpp    |     2 +-
 torch/csrc/jit/passes/vulkan_rewrite.cpp      |    58 +-
 torch/csrc/jit/python/init.cpp                |   123 +-
 torch/csrc/jit/python/pybind_utils.cpp        |     8 +-
 torch/csrc/jit/python/pybind_utils.h          |    17 +-
 torch/csrc/jit/python/python_ir.cpp           |    12 +-
 torch/csrc/jit/python/script_init.cpp         |     3 -
 torch/csrc/jit/runtime/graph_executor.cpp     |     2 +-
 torch/csrc/jit/runtime/operator.h             |    16 +
 torch/csrc/jit/runtime/register_prim_ops.cpp  |    36 +
 .../serialized_shape_function_registry.cpp    |   127 +-
 torch/csrc/jit/runtime/static/native_ops.cpp  |    30 +-
 torch/csrc/jit/runtime/static/ops.cpp         |    53 +
 torch/csrc/jit/runtime/static/passes.cpp      |    10 +-
 torch/csrc/jit/runtime/symbolic_script.cpp    |    15 +-
 .../jit/runtime/symbolic_shape_registry.cpp   |    49 +-
 .../runtime/symbolic_shape_registry_util.cpp  |     1 +
 torch/csrc/jit/serialization/export.cpp       |    51 +-
 .../serialization/flatbuffer_serializer.cpp   |    56 +-
 .../jit/serialization/flatbuffer_serializer.h |    68 +-
 .../flatbuffer_serializer_jit.cpp             |    31 +-
 .../serialization/flatbuffer_serializer_jit.h |     5 +-
 torch/csrc/jit/serialization/import.cpp       |     3 +-
 .../csrc/jit/serialization/import_source.cpp  |    33 +-
 torch/csrc/jit/serialization/pickler.cpp      |    15 +-
 torch/csrc/jit/serialization/python_print.cpp |    11 +-
 torch/csrc/jit/serialization/unpickler.cpp    |    14 +
 .../jit/tensorexpr/external_functions.cpp     |    31 +
 torch/csrc/jit/tensorexpr/kernel.cpp          |    51 +-
 torch/csrc/jit/tensorexpr/kernel.h            |     6 +
 torch/csrc/jit/tensorexpr/lowerings.cpp       |    23 +
 .../csrc/jit/tensorexpr/operators/conv2d.cpp  |    62 +
 torch/csrc/jit/tensorexpr/operators/conv2d.h  |    14 +-
 torch/csrc/lazy/core/dynamic_ir.h             |     5 +-
 torch/csrc/lazy/core/lazy_graph_executor.cpp  |     9 +-
 torch/csrc/lazy/core/lazy_graph_executor.h    |     4 +-
 torch/csrc/lazy/core/shape_inference.cpp      |    14 +-
 torch/csrc/lazy/core/shape_inference.h        |     4 +-
 torch/csrc/lazy/core/tensor_impl.cpp          |    34 +-
 torch/csrc/lazy/core/tensor_impl.h            |     4 +-
 torch/csrc/lazy/python/python_util.cpp        |     4 +-
 torch/csrc/lazy/ts_backend/dynamic_ir.cpp     |    14 +-
 torch/csrc/lazy/ts_backend/dynamic_ir.h       |     8 +-
 .../lazy/ts_backend/ts_native_functions.cpp   |     6 -
 torch/csrc/onnx/init.cpp                      |    17 +-
 torch/csrc/onnx/onnx.h                        |     2 +
 torch/csrc/profiler/api.cpp                   |    22 +-
 torch/csrc/profiler/api.h                     |    18 +-
 torch/csrc/profiler/collection.cpp            |   418 +-
 torch/csrc/profiler/collection.h              |    98 +-
 torch/csrc/profiler/containers.h              |    19 +
 .../profiler/execution_graph_observer.cpp     |   174 +-
 .../csrc/profiler/execution_graph_observer.h  |     9 +-
 .../csrc/profiler/kineto_client_interface.cpp |    25 +-
 torch/csrc/profiler/kineto_shim.cpp           |     2 +-
 torch/csrc/profiler/python/init.cpp           |   218 +
 torch/csrc/profiler/python/init.h             |    11 +
 torch/csrc/profiler/util.cpp                  |    21 +-
 torch/csrc/profiler/util.h                    |    53 +
 torch/csrc/tensor/python_tensor.cpp           |     4 +
 torch/csrc/tensor/python_tensor.h             |     6 +-
 torch/csrc/utils/out_types.cpp                |    23 +-
 torch/csrc/utils/out_types.h                  |     4 +-
 torch/csrc/utils/python_arg_parser.cpp        |     2 +-
 torch/csrc/utils/python_arg_parser.h          |    63 +-
 torch/csrc/utils/python_dispatch.cpp          |    67 +
 torch/csrc/utils/tensor_numpy.cpp             |     7 +-
 torch/csrc/utils/torch_dispatch_mode.h        |     8 +-
 torch/cuda/__init__.py                        |     1 +
 torch/cuda/_memory_viz.py                     |   188 +
 torch/cuda/jiterator.py                       |     2 +-
 torch/cuda/memory.py                          |    24 +
 .../distributed/_shard/checkpoint/__init__.py |    14 +-
 torch/distributed/_shard/checkpoint/api.py    |    29 +-
 .../_shard/checkpoint/default_planner.py      |   204 +
 .../distributed/_shard/checkpoint/metadata.py |    45 +-
 .../distributed/_shard/checkpoint/planner.py  |   344 +
 .../_shard/checkpoint/planner_helpers.py      |   199 +
 .../_shard/checkpoint/resharding.py           |   117 +-
 .../_shard/checkpoint/state_dict_loader.py    |   125 +-
 .../_shard/checkpoint/state_dict_saver.py     |    15 +-
 torch/distributed/_shard/checkpoint/utils.py  |    44 +-
 torch/distributed/_shard/partial_tensor.py    |     1 +
 .../_shard/sharded_optim/__init__.py          |     1 +
 .../_shard/sharded_tensor/__init__.py         |    22 +-
 .../_shard/sharded_tensor/_ops/_common.py     |     3 +-
 .../_shard/sharded_tensor/_ops/tensor_ops.py  |    11 +-
 .../distributed/_shard/sharded_tensor/api.py  |     4 +-
 torch/distributed/_shard/sharding_plan/api.py |     1 +
 .../chunk_sharding_spec_ops/_common.py        |   270 +-
 .../chunk_sharding_spec_ops/embedding.py      |   170 +-
 .../chunk_sharding_spec_ops/embedding_bag.py  |   621 +-
 .../_checkpoint/checkpoint_wrapper.py         |    40 +-
 .../algorithms/_comm_hooks/default_hooks.py   |    85 +-
 .../algorithms/ddp_comm_hooks/__init__.py     |     1 +
 .../ddp_comm_hooks/debugging_hooks.py         |     1 +
 .../ddp_comm_hooks/default_hooks.py           |     5 +
 .../ddp_comm_hooks/post_localSGD_hook.py      |     1 +
 .../ddp_comm_hooks/powerSGD_hook.py           |     2 +
 .../ddp_comm_hooks/quantization_hooks.py      |     2 +
 torch/distributed/algorithms/join.py          |     1 +
 .../algorithms/model_averaging/averagers.py   |    55 +-
 .../hierarchical_model_averager.py            |    71 +-
 torch/distributed/autograd/__init__.py        |     1 +
 torch/distributed/distributed_c10d.py         |    45 +-
 torch/distributed/elastic/timer/__init__.py   |     1 +
 .../elastic/timer/file_based_local_timer.py   |   313 +
 torch/distributed/fsdp/_optim_utils.py        |   228 +-
 .../fsdp/{shard_utils.py => _shard_utils.py}  |   113 +-
 torch/distributed/fsdp/_utils.py              |    11 +
 .../fsdp/fully_sharded_data_parallel.py       |   595 +-
 torch/distributed/fsdp/wrap.py                |     2 +-
 torch/distributed/launch.py                   |    20 +-
 torch/distributed/nn/api/remote_module.py     |     4 +
 torch/distributed/nn/functional.py            |     1 +
 torch/distributed/optim/functional_rprop.py   |     5 +-
 torch/distributed/optim/optimizer.py          |     5 +
 .../optim/post_localSGD_optimizer.py          |    63 +-
 torch/distributed/optim/utils.py              |     1 +
 .../optim/zero_redundancy_optimizer.py        |     1 +
 torch/distributed/pipeline/sync/pipe.py       |     5 +
 torch/distributed/rpc/api.py                  |    35 +-
 torch/distributed/rpc/functions.py            |     1 +
 torch/distributed/rpc/options.py              |     1 +
 .../rpc/server_process_global_profiler.py     |     3 +-
 torch/distributed/run.py                      |     9 +-
 torch/distributed/utils.py                    |    24 +-
 torch/distributions/bernoulli.py              |     1 +
 torch/distributions/beta.py                   |     1 +
 torch/distributions/binomial.py               |     1 +
 torch/distributions/categorical.py            |     1 +
 torch/distributions/cauchy.py                 |     1 +
 torch/distributions/chi2.py                   |     1 +
 torch/distributions/continuous_bernoulli.py   |     1 +
 torch/distributions/dirichlet.py              |     1 +
 torch/distributions/exponential.py            |     1 +
 torch/distributions/fishersnedecor.py         |     1 +
 torch/distributions/gamma.py                  |     1 +
 torch/distributions/geometric.py              |     1 +
 torch/distributions/gumbel.py                 |     1 +
 torch/distributions/half_cauchy.py            |     1 +
 torch/distributions/half_normal.py            |     1 +
 torch/distributions/independent.py            |     8 +-
 torch/distributions/kumaraswamy.py            |     1 +
 torch/distributions/laplace.py                |     1 +
 torch/distributions/lkj_cholesky.py           |     1 +
 torch/distributions/log_normal.py             |     1 +
 torch/distributions/logistic_normal.py        |     3 +-
 .../lowrank_multivariate_normal.py            |     4 +-
 torch/distributions/mixture_same_family.py    |    17 +-
 torch/distributions/multinomial.py            |     1 +
 torch/distributions/multivariate_normal.py    |     3 +
 torch/distributions/normal.py                 |     1 +
 torch/distributions/one_hot_categorical.py    |     1 +
 torch/distributions/pareto.py                 |     1 +
 torch/distributions/poisson.py                |     1 +
 torch/distributions/relaxed_bernoulli.py      |     3 +-
 torch/distributions/relaxed_categorical.py    |     3 +-
 torch/distributions/studentT.py               |     1 +
 torch/distributions/uniform.py                |     1 +
 torch/distributions/von_mises.py              |     3 +-
 torch/distributions/weibull.py                |     1 +
 torch/distributions/wishart.py                |     3 +-
 torch/functional.py                           |    89 +-
 torch/futures/__init__.py                     |     2 +
 torch/fx/_symbolic_trace.py                   |   127 +-
 torch/fx/experimental/const_fold.py           |     4 +
 torch/fx/experimental/meta_tracer.py          |     4 +-
 .../constraint_generator.py                   |   214 +-
 torch/fx/experimental/proxy_tensor.py         |   657 +-
 torch/fx/experimental/symbolic_shapes.py      |   111 +-
 torch/fx/experimental/unification/core.py     |     1 +
 torch/fx/experimental/unification/dispatch.py |     2 +-
 torch/fx/experimental/unification/match.py    |     5 +-
 torch/fx/experimental/unification/more.py     |     3 +
 .../unification/multipledispatch/core.py      |    12 +-
 .../multipledispatch/dispatcher.py            |    10 +-
 .../unification/multipledispatch/utils.py     |     8 +-
 .../unification/multipledispatch/variadic.py  |     1 +
 torch/fx/experimental/unification/utils.py    |     1 +
 torch/fx/experimental/unification/variable.py |     5 +-
 torch/fx/graph.py                             |    57 +-
 torch/fx/graph_module.py                      |    12 +-
 torch/fx/interpreter.py                       |    20 +-
 torch/fx/operator_schemas.py                  |     2 +-
 torch/fx/passes/backends/nvfuser.py           |     2 +-
 torch/fx/passes/infra/pass_manager.py         |    16 +-
 torch/fx/passes/pass_manager.py               |    45 +-
 torch/fx/passes/reinplace.py                  |   288 +-
 torch/fx/passes/splitter_base.py              |    32 +-
 torch/fx/passes/tools_common.py               |    67 +-
 torch/fx/passes/utils/__init__.py             |     2 +-
 torch/fx/passes/utils/common.py               |    18 +-
 torch/fx/passes/utils/matcher_utils.py        |   233 +
 torch/fx/proxy.py                             |    21 +-
 torch/fx/subgraph_rewriter.py                 |   276 +-
 torch/fx/traceback.py                         |    62 +
 torch/hub.py                                  |    16 +-
 torch/jit/_shape_functions.py                 |   100 +-
 torch/jit/quantized.py                        |    18 +-
 torch/library.py                              |     1 +
 torch/linalg/__init__.py                      |     8 +-
 torch/masked/__init__.py                      |     2 +
 torch/masked/maskedtensor/__init__.py         |     8 +
 torch/masked/maskedtensor/binary.py           |   189 +
 torch/masked/maskedtensor/core.py             |   590 +
 torch/masked/maskedtensor/creation.py         |    58 +
 torch/masked/maskedtensor/passthrough.py      |    42 +
 torch/masked/maskedtensor/unary.py            |   188 +
 torch/monitor/__init__.py                     |     1 +
 torch/nn/functional.py                        |    10 +-
 torch/nn/grad.py                              |     2 +
 torch/nn/init.py                              |     1 +
 torch/nn/intrinsic/qat/modules/conv_fused.py  |     2 +-
 torch/nn/intrinsic/qat/modules/linear_relu.py |     3 +-
 .../quantized/dynamic/modules/linear_relu.py  |     7 +-
 .../nn/intrinsic/quantized/modules/bn_relu.py |    10 +-
 .../intrinsic/quantized/modules/conv_relu.py  |    14 +-
 .../quantized/modules/linear_relu.py          |     7 +-
 torch/nn/modules/activation.py                |    51 +-
 torch/nn/modules/batchnorm.py                 |     4 +
 torch/nn/modules/channelshuffle.py            |     1 +
 torch/nn/modules/container.py                 |     4 +-
 torch/nn/modules/distance.py                  |    13 +-
 torch/nn/modules/fold.py                      |     3 +
 torch/nn/modules/lazy.py                      |     1 +
 torch/nn/modules/loss.py                      |    33 +-
 torch/nn/modules/module.py                    |    40 +-
 torch/nn/modules/padding.py                   |     9 +
 torch/nn/modules/pooling.py                   |     3 +-
 torch/nn/modules/rnn.py                       |    12 +-
 torch/nn/modules/sparse.py                    |     4 +
 torch/nn/modules/transformer.py               |   102 +-
 torch/nn/modules/upsampling.py                |    91 +-
 torch/nn/parallel/data_parallel.py            |     1 +
 torch/nn/parallel/distributed.py              |    26 +-
 torch/nn/qat/__init__.py                      |     6 +
 torch/nn/qat/dynamic/__init__.py              |     6 +
 torch/nn/qat/dynamic/modules/linear.py        |    35 +-
 torch/nn/qat/modules/__init__.py              |    20 +-
 torch/nn/qat/modules/conv.py                  |   276 +-
 torch/nn/qat/modules/embedding_ops.py         |   151 +-
 torch/nn/qat/modules/linear.py                |    87 +-
 torch/nn/quantizable/modules/__init__.py      |     6 +-
 torch/nn/quantizable/modules/activation.py    |   464 +-
 torch/nn/quantizable/modules/rnn.py           |   396 +-
 torch/nn/quantized/__init__.py                |     1 +
 .../quantized/_reference/modules/__init__.py  |    19 +-
 torch/nn/quantized/_reference/modules/conv.py |   335 +-
 .../nn/quantized/_reference/modules/linear.py |    67 +-
 torch/nn/quantized/_reference/modules/rnn.py  |   494 +-
 .../nn/quantized/_reference/modules/sparse.py |   105 +-
 .../nn/quantized/_reference/modules/utils.py  |   175 +-
 torch/nn/quantized/dynamic/__init__.py        |     2 +-
 .../nn/quantized/dynamic/modules/__init__.py  |    19 +-
 torch/nn/quantized/dynamic/modules/conv.py    |   403 +-
 torch/nn/quantized/dynamic/modules/linear.py  |   136 +-
 torch/nn/quantized/dynamic/modules/rnn.py     |  1065 +-
 torch/nn/quantized/functional.py              |   619 +-
 torch/nn/quantized/modules/__init__.py        |   125 +-
 torch/nn/quantized/modules/activation.py      |   295 +-
 torch/nn/quantized/modules/batchnorm.py       |   115 +-
 torch/nn/quantized/modules/conv.py            |   928 +-
 torch/nn/quantized/modules/dropout.py         |    35 +-
 torch/nn/quantized/modules/embedding_ops.py   |   303 +-
 .../quantized/modules/functional_modules.py   |   239 +-
 torch/nn/quantized/modules/linear.py          |   304 +-
 torch/nn/quantized/modules/normalization.py   |   216 +-
 torch/nn/quantized/modules/rnn.py             |    54 +-
 torch/nn/quantized/modules/utils.py           |    88 +-
 torch/nn/utils/_deprecation_utils.py          |    45 +
 .../conv_expanded_weights.py                  |    23 +-
 .../nn/utils/_expanded_weights/conv_utils.py  |    57 +-
 torch/nn/utils/_per_sample_grad.py            |     1 +
 torch/nn/utils/init.py                        |     1 +
 torch/nn/utils/memory_format.py               |    15 +-
 torch/nn/utils/parametrizations.py            |    15 +-
 torch/nn/utils/parametrize.py                 |     1 +
 torch/nn/utils/prune.py                       |    40 +-
 torch/nn/utils/rnn.py                         |    11 +-
 torch/nn/utils/stateless.py                   |     1 +
 torch/onnx/__init__.py                        |     3 +
 torch/onnx/_constants.py                      |     4 +-
 torch/onnx/_exporter_states.py                |    13 +
 torch/onnx/_globals.py                        |    20 +-
 .../_dbr => onnx/_internal}/__init__.py       |     0
 torch/onnx/_internal/_beartype.py             |   105 +
 torch/onnx/_onnx_supported_ops.py             |     2 +-
 torch/onnx/_patch_torch.py                    |   130 +-
 torch/onnx/_type_utils.py                     |   239 +
 torch/onnx/errors.py                          |    73 +-
 torch/onnx/symbolic_caffe2.py                 |    28 +-
 torch/onnx/symbolic_helper.py                 |   601 +-
 torch/onnx/symbolic_opset10.py                |    69 +-
 torch/onnx/symbolic_opset11.py                |   134 +-
 torch/onnx/symbolic_opset12.py                |    85 +-
 torch/onnx/symbolic_opset13.py                |    58 +-
 torch/onnx/symbolic_opset14.py                |     1 +
 torch/onnx/symbolic_opset16.py                |     7 +-
 torch/onnx/symbolic_opset17.py                |    19 +
 torch/onnx/symbolic_opset8.py                 |    45 +-
 torch/onnx/symbolic_opset9.py                 |  1072 +-
 torch/onnx/utils.py                           |   158 +-
 torch/onnx/verification.py                    |   104 +-
 torch/optim/adam.py                           |    40 +-
 torch/optim/adamax.py                         |    11 +
 torch/optim/adamw.py                          |    15 +-
 torch/optim/lr_scheduler.py                   |    65 +-
 torch/optim/lr_scheduler.pyi                  |     5 +
 torch/optim/rmsprop.py                        |    58 +-
 torch/optim/rprop.py                          |    48 +-
 torch/optim/sgd.py                            |     2 +
 torch/optim/swa_utils.py                      |     8 +-
 torch/overrides.py                            |     7 +-
 torch/profiler/__init__.py                    |    35 +-
 torch/profiler/_pattern_matcher.py            |   207 +-
 torch/profiler/profiler.py                    |     2 +-
 torch/quantization/fx/_equalize.py            |     4 +-
 torch/quasirandom.py                          |     1 +
 torch/serialization.py                        |    48 +-
 torch/sparse/__init__.py                      |     1 +
 torch/testing/_creation.py                    |     1 +
 .../testing/_internal/autocast_test_lists.py  |    19 +
 .../_internal/check_kernel_launches.py        |     7 +-
 torch/testing/_internal/common_cuda.py        |    13 +-
 torch/testing/_internal/common_device_type.py |     9 +-
 .../_internal/common_methods_invocations.py   | 22952 ++++++----------
 torch/testing/_internal/common_modules.py     |    32 +-
 torch/testing/_internal/common_nn.py          |    35 +
 .../testing/_internal/common_quantization.py  |    10 +-
 torch/testing/_internal/common_utils.py       |   245 +-
 .../testing/_internal/composite_compliance.py |     4 +-
 .../_internal/distributed/distributed_test.py |    89 +-
 .../_internal/distributed/rpc/rpc_test.py     |     2 +-
 .../_internal/distributed/rpc_utils.py        |     2 +-
 .../_internal/jit_metaprogramming_utils.py    |     1 +
 torch/testing/_internal/opinfo/__init__.py    |     2 +
 torch/testing/_internal/opinfo/core.py        |  2657 ++
 .../_internal/opinfo/definitions/__init__.py  |    18 +
 .../_internal/opinfo/definitions/_masked.py   |  1132 +
 .../_internal/opinfo/definitions/fft.py       |   715 +
 .../_internal/opinfo/definitions/linalg.py    |  2282 ++
 .../_internal/opinfo/definitions/special.py   |   684 +
 torch/testing/_internal/opinfo/refs.py        |   214 +
 torch/testing/_internal/opinfo/utils.py       |   260 +
 torch/testing/_internal/opinfo_helper.py      |   139 -
 torch/testing/_internal/schema_check_mode.py  |    12 +-
 torch/utils/_cuda_trace.py                    |    76 +
 torch/utils/_pytree.py                        |    64 +-
 torch/utils/_zip.py                           |    11 +-
 torch/utils/bottleneck/__main__.py            |     2 +-
 torch/utils/checkpoint.py                     |    38 +-
 torch/utils/cpp_extension.py                  |   105 +-
 torch/utils/data/_utils/collate.py            |     2 +
 torch/utils/data/_utils/pin_memory.py         |    13 +-
 torch/utils/data/dataloader.py                |     3 +-
 torch/utils/data/datapipes/_hook_iterator.py  |     2 +-
 torch/utils/data/datapipes/_typing.py         |     4 +-
 torch/utils/data/datapipes/datapipe.py        |     2 +
 torch/utils/data/datapipes/gen_pyi.py         |     6 +-
 torch/utils/data/datapipes/iter/callable.py   |     6 +-
 .../data/datapipes/iter/combinatorics.py      |    17 +-
 torch/utils/data/datapipes/iter/combining.py  |    16 +-
 torch/utils/data/datapipes/iter/filelister.py |     1 +
 torch/utils/data/datapipes/iter/fileopener.py |     1 +
 torch/utils/data/datapipes/iter/grouping.py   |     3 +
 torch/utils/data/datapipes/iter/selecting.py  |     3 +
 .../utils/data/datapipes/iter/streamreader.py |     1 +
 torch/utils/data/datapipes/iter/utils.py      |     1 +
 torch/utils/data/datapipes/map/__init__.py    |     2 +-
 torch/utils/data/datapipes/map/callable.py    |     1 +
 .../utils/data/datapipes/map/combinatorics.py |   100 +-
 torch/utils/data/datapipes/map/combining.py   |     2 +
 torch/utils/data/datapipes/map/grouping.py    |     1 +
 torch/utils/data/datapipes/map/utils.py       |     1 +
 torch/utils/data/datapipes/utils/common.py    |    82 +-
 torch/utils/data/dataset.py                   |    15 +-
 torch/utils/data/distributed.py               |     1 +
 torch/utils/data/graph.py                     |    20 +-
 torch/utils/data/sampler.py                   |     1 +
 torch/utils/dlpack.py                         |     3 +-
 torch/utils/hipify/cuda_to_hip_mappings.py    |    66 +-
 torch/utils/hipify/hipify_python.py           |     3 +-
 torch/utils/hooks.py                          |    32 +-
 torch/utils/tensorboard/_pytorch_graph.py     |     4 +-
 torch/utils/tensorboard/summary.py            |    11 +-
 torch/utils/throughput_benchmark.py           |    17 +-
 torchgen/api/autograd.py                      |   104 +-
 torchgen/api/python.py                        |    99 +-
 torchgen/api/types.py                         |    29 +-
 torchgen/context.py                           |     1 +
 torchgen/dest/lazy_ir.py                      |     9 +-
 torchgen/gen.py                               |   314 +-
 torchgen/gen_backend_stubs.py                 |    42 +-
 torchgen/gen_functionalization_type.py        |     2 +
 torchgen/model.py                             |   185 +-
 torchgen/native_function_generation.py        |   164 +-
 torchgen/selective_build/selector.py          |     2 +-
 .../gen_jit_shape_functions.py                |    31 +-
 torchgen/utils.py                             |    37 +-
 1604 files changed, 95410 insertions(+), 74365 deletions(-)
 create mode 100644 .circleci/cimodel/data/simple/upload_test_stats_definition.py
 create mode 100755 .circleci/docker/common/install_ucc.sh
 delete mode 100644 .github/generated-ciflow-ruleset.json
 delete mode 100644 .github/merge_rules.json
 create mode 100644 .github/merge_rules.yaml
 create mode 100644 .github/requirements-gha-cache.txt
 create mode 100644 .github/scripts/comment_on_pr.py
 delete mode 100755 .github/scripts/lint_test_ownership.py
 create mode 100644 .github/scripts/trymerge_explainer.py
 rename .github/workflows/{_mac-test-arm64.yml => _mac-test-mps.yml} (98%)
 delete mode 100644 .github/workflows/cancel_redundant_workflows.yml
 create mode 100644 .github/workflows/docker-release.yml
 create mode 100644 .github/workflows/mac-mps.yml
 create mode 100644 .github/workflows/pull.yml
 create mode 100644 .github/workflows/push_nightly_docker_ghcr.yml
 delete mode 100644 .github/workflows/stale_pull_requests.yml
 rename .jenkins/pytorch/win-test-helpers/installation-helpers/{install_miniconda3.bat => activate_miniconda3.bat} (65%)
 delete mode 100644 aten/src/ATen/core/TorchDispatchModeTLS.cpp
 create mode 100644 aten/src/ATen/core/TorchDispatchUtils.cpp
 rename aten/src/ATen/core/{TorchDispatchModeTLS.h => TorchDispatchUtils.h} (55%)
 create mode 100644 aten/src/ATen/mps/IndexKernels.h
 create mode 100644 aten/src/ATen/native/cpu/CopyKernel.h
 create mode 100644 aten/src/ATen/native/cuda/Copy.h
 create mode 100644 aten/src/ATen/native/cuda/CumminmaxKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/CumprodKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/CumsumKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/LogcumsumexpKernel.cu
 rename aten/src/ATen/native/cuda/{ScanKernels.cu => ScanUtils.cuh} (84%)
 create mode 100644 aten/src/ATen/native/mkldnn/Common.h
 create mode 100644 aten/src/ATen/native/mkldnn/ConvPrepack.cpp
 create mode 100644 aten/src/ATen/native/mkldnn/ConvPrepack.h
 create mode 100644 aten/src/ATen/native/mkldnn/OpContext.cpp
 create mode 100644 aten/src/ATen/native/mkldnn/OpContext.h
 create mode 100644 aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp
 create mode 100644 aten/src/ATen/native/mps/operations/BitwiseOps.mm
 create mode 100644 aten/src/ATen/native/mps/operations/Indexing.h
 create mode 100644 aten/src/ATen/native/mps/operations/Pad.mm
 delete mode 100644 aten/src/ATen/native/vulkan/api/vk_mem_alloc.h
 delete mode 100644 aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp
 delete mode 100644 aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h
 delete mode 100644 aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp
 delete mode 100644 aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h
 delete mode 100644 aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp
 delete mode 100644 aten/src/ATen/native/vulkan/ops/VulkanOpContext.h
 create mode 100644 aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h
 create mode 100644 aten/src/ATen/templates/RegisterDispatchDefinitions.ini
 create mode 100644 benchmarks/cpp/nvfuser/matmul.cpp
 create mode 100644 c10/core/impl/GPUTrace.cpp
 create mode 100644 c10/core/impl/GPUTrace.h
 create mode 100644 c10/core/impl/TorchDispatchModeTLS.cpp
 create mode 100644 c10/core/impl/TorchDispatchModeTLS.h
 create mode 100644 docs/source/masked.rst
 delete mode 100644 functorch/codegen/gen_functorch_lagging_op_db.py
 create mode 100644 functorch/functorch/csrc/dim/arena.h
 create mode 100644 functorch/functorch/csrc/dim/dim.cpp
 create mode 100644 functorch/functorch/csrc/dim/dim.h
 create mode 100644 functorch/functorch/csrc/dim/minpybind.h
 create mode 100644 functorch/functorch/csrc/dim/python_variable_simple.h
 create mode 100644 functorch/functorch/dim/README.md
 create mode 100644 functorch/functorch/dim/__init__.py
 create mode 100644 functorch/functorch/dim/batch_tensor.py
 create mode 100644 functorch/functorch/dim/delayed_mul_tensor.py
 create mode 100644 functorch/functorch/dim/dim.py
 create mode 100644 functorch/functorch/dim/magic_trace.py
 create mode 100644 functorch/functorch/dim/op_properties.py
 create mode 100644 functorch/functorch/dim/reference.py
 create mode 100644 functorch/functorch/dim/tree_map.py
 create mode 100644 functorch/functorch/dim/wrap_type.py
 create mode 100644 functorch/functorch/experimental/cond.py
 create mode 100644 functorch/functorch/experimental/ops.py
 create mode 100644 functorch/test/attn_ft.py
 create mode 100644 functorch/test/attn_positional.py
 delete mode 100644 functorch/test/functorch_lagging_op_db.py
 create mode 100644 functorch/test/test_control_flow.py
 create mode 100644 functorch/test/test_dims.py
 create mode 100644 test/cpp/c10d/ProcessGroupUCCTest.cpp
 create mode 100644 test/distributed/_shard/checkpoint/test_planner.py
 create mode 100644 test/distributed/elastic/timer/file_based_local_timer_test.py
 delete mode 100644 test/jit/test_legacy_upgraders.py
 create mode 100644 test/nn/test_packed_sequence.py
 create mode 100644 test/nn/test_pooling.py
 create mode 100644 test/onnx/internal/test_beartype.py
 create mode 100644 test/onnx/test_autograd_funs.py
 create mode 100644 test/quantization/core/experimental/apot_fx_graph_mode_ptq.py
 create mode 100644 test/quantization/core/experimental/apot_fx_graph_mode_qat.py
 rename test/quantization/core/experimental/{fx_graph_mode_apot.py => quantization_util.py} (52%)
 delete mode 100644 test/quantization/dbr/test_quantize_dbr.py
 create mode 100755 test/run_doctests.sh
 mode change 100644 => 100755 test/run_test.py
 create mode 100644 test/test_cuda_trace.py
 create mode 100644 test/test_dlpack.py
 delete mode 100644 test/test_dynamo_cudagraphs.py
 create mode 100644 test/test_maskedtensor.py
 create mode 100644 test/test_mkldnn_fusion.py
 rename test/{jit => }/test_schema_check.py (94%)
 create mode 160000 third_party/VulkanMemoryAllocator
 delete mode 100644 third_party/cpuinfo.BUILD
 create mode 100644 torch/_C/_profiler.pyi
 rename {test/quantization/dbr => torch/_dispatch}/__init__.py (100%)
 create mode 100644 torch/_dispatch/_dispatcher.py
 create mode 100644 torch/_prims/nvfuser_prims.py
 create mode 100644 torch/ao/nn/qat/__init__.py
 create mode 100644 torch/ao/nn/qat/dynamic/__init__.py
 create mode 100644 torch/ao/nn/qat/dynamic/modules/__init__.py
 create mode 100644 torch/ao/nn/qat/dynamic/modules/linear.py
 create mode 100644 torch/ao/nn/qat/modules/__init__.py
 create mode 100644 torch/ao/nn/qat/modules/conv.py
 create mode 100644 torch/ao/nn/qat/modules/embedding_ops.py
 create mode 100644 torch/ao/nn/qat/modules/linear.py
 create mode 100644 torch/ao/nn/quantizable/__init__.py
 create mode 100644 torch/ao/nn/quantizable/modules/__init__.py
 create mode 100644 torch/ao/nn/quantizable/modules/activation.py
 create mode 100644 torch/ao/nn/quantizable/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/__init__.py
 create mode 100644 torch/ao/nn/quantized/_reference/__init__.py
 create mode 100644 torch/ao/nn/quantized/_reference/modules/__init__.py
 create mode 100644 torch/ao/nn/quantized/_reference/modules/conv.py
 create mode 100644 torch/ao/nn/quantized/_reference/modules/linear.py
 create mode 100644 torch/ao/nn/quantized/_reference/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/_reference/modules/sparse.py
 create mode 100644 torch/ao/nn/quantized/_reference/modules/utils.py
 create mode 100644 torch/ao/nn/quantized/dynamic/__init__.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/__init__.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/conv.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/linear.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/functional.py
 create mode 100644 torch/ao/nn/quantized/modules/__init__.py
 create mode 100644 torch/ao/nn/quantized/modules/activation.py
 create mode 100644 torch/ao/nn/quantized/modules/batchnorm.py
 create mode 100644 torch/ao/nn/quantized/modules/conv.py
 create mode 100644 torch/ao/nn/quantized/modules/dropout.py
 create mode 100644 torch/ao/nn/quantized/modules/embedding_ops.py
 create mode 100644 torch/ao/nn/quantized/modules/functional_modules.py
 create mode 100644 torch/ao/nn/quantized/modules/linear.py
 create mode 100644 torch/ao/nn/quantized/modules/normalization.py
 create mode 100644 torch/ao/nn/quantized/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/modules/utils.py
 delete mode 100644 torch/ao/ns/_numeric_suite_dbr.py
 delete mode 100644 torch/ao/quantization/_dbr/README.md
 delete mode 100644 torch/ao/quantization/_dbr/auto_trace.py
 delete mode 100644 torch/ao/quantization/_dbr/auto_trace_rewriter.py
 delete mode 100644 torch/ao/quantization/_dbr/function_fusion.py
 delete mode 100644 torch/ao/quantization/_dbr/fusion.py
 delete mode 100644 torch/ao/quantization/_dbr/mappings.py
 delete mode 100644 torch/ao/quantization/_dbr/model_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/module_swap_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/qconfig_mapping_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/quantization_state.py
 delete mode 100644 torch/ao/quantization/_dbr/torchscript_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/utils.py
 delete mode 100644 torch/ao/quantization/_quantize_dbr.py
 create mode 100644 torch/ao/quantization/backend_config/fbgemm.py
 create mode 100644 torch/ao/quantization/backend_config/qnnpack.py
 delete mode 100644 torch/ao/quantization/fx/common_quantization_patterns.py
 rename torch/ao/quantization/fx/{qconfig_utils.py => qconfig_mapping_utils.py} (95%)
 create mode 100644 torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py
 create mode 100644 torch/csrc/autograd/graph_task.h
 delete mode 100644 torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h
 delete mode 100644 torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
 delete mode 100644 torch/csrc/jit/codegen/cuda/inline_propagator.cpp
 delete mode 100644 torch/csrc/jit/codegen/cuda/inline_propagator.h
 create mode 100644 torch/csrc/jit/codegen/cuda/inlining.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/inlining.h
 create mode 100644 torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/lower_divisible_split.h
 delete mode 100644 torch/csrc/jit/codegen/cuda/reference_tensor.h
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose.h
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp
 delete mode 100644 torch/csrc/jit/operator_upgraders/upgraders_guard.cpp
 delete mode 100644 torch/csrc/jit/operator_upgraders/upgraders_guard.h
 create mode 100644 torch/csrc/jit/passes/mkldnn_rewrite.cpp
 create mode 100644 torch/csrc/jit/passes/mkldnn_rewrite.h
 create mode 100644 torch/csrc/jit/passes/onnx/naming.cpp
 create mode 100644 torch/csrc/jit/passes/onnx/naming.h
 create mode 100644 torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp
 create mode 100644 torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h
 create mode 100644 torch/csrc/profiler/python/init.cpp
 create mode 100644 torch/csrc/profiler/python/init.h
 create mode 100644 torch/cuda/_memory_viz.py
 create mode 100644 torch/distributed/_shard/checkpoint/default_planner.py
 create mode 100644 torch/distributed/_shard/checkpoint/planner.py
 create mode 100644 torch/distributed/_shard/checkpoint/planner_helpers.py
 create mode 100644 torch/distributed/elastic/timer/file_based_local_timer.py
 rename torch/distributed/fsdp/{shard_utils.py => _shard_utils.py} (64%)
 create mode 100644 torch/fx/passes/utils/matcher_utils.py
 create mode 100644 torch/fx/traceback.py
 create mode 100644 torch/masked/__init__.py
 create mode 100644 torch/masked/maskedtensor/__init__.py
 create mode 100644 torch/masked/maskedtensor/binary.py
 create mode 100644 torch/masked/maskedtensor/core.py
 create mode 100644 torch/masked/maskedtensor/creation.py
 create mode 100644 torch/masked/maskedtensor/passthrough.py
 create mode 100644 torch/masked/maskedtensor/unary.py
 create mode 100644 torch/nn/utils/_deprecation_utils.py
 rename torch/{ao/quantization/_dbr => onnx/_internal}/__init__.py (100%)
 create mode 100644 torch/onnx/_internal/_beartype.py
 create mode 100644 torch/onnx/_type_utils.py
 create mode 100644 torch/onnx/symbolic_opset17.py
 create mode 100644 torch/testing/_internal/opinfo/__init__.py
 create mode 100644 torch/testing/_internal/opinfo/core.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/__init__.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/_masked.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/fft.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/linalg.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/special.py
 create mode 100644 torch/testing/_internal/opinfo/refs.py
 create mode 100644 torch/testing/_internal/opinfo/utils.py
 delete mode 100644 torch/testing/_internal/opinfo_helper.py
 create mode 100644 torch/utils/_cuda_trace.py

diff --git a/.circleci/cimodel/data/dimensions.py b/.circleci/cimodel/data/dimensions.py
index 7f9ebccbcc898..5841b3806b135 100644
--- a/.circleci/cimodel/data/dimensions.py
+++ b/.circleci/cimodel/data/dimensions.py
@@ -4,6 +4,7 @@
     "102",
     "113",
     "116",
+    "117",
 ]
 
 ROCM_VERSIONS = [
diff --git a/.circleci/cimodel/data/simple/ios_definitions.py b/.circleci/cimodel/data/simple/ios_definitions.py
index a01a2db8229fc..5dfb84d6e5da1 100644
--- a/.circleci/cimodel/data/simple/ios_definitions.py
+++ b/.circleci/cimodel/data/simple/ios_definitions.py
@@ -11,7 +11,7 @@ def __init__(self, name, custom_build_name=""):
 
     def render(self):
         extra_parts = [self.custom_build_name] if len(self.custom_build_name) > 0 else []
-        return "_".join([self.name] + extra_parts)
+        return "-".join([self.name] + extra_parts).replace("_", "-")
 
 
 def get_platform(arch_variant_name):
@@ -25,30 +25,25 @@ def __init__(self, xcode_version, arch_variant, is_org_member_context=True, extr
         self.is_org_member_context = is_org_member_context
         self.extra_props = extra_props
 
-    def gen_name_parts(self, with_version_dots):
-
-        version_parts = self.xcode_version.render_dots_or_parts(with_version_dots)
-        build_variant_suffix = "_".join([self.arch_variant.render(), "build"])
-
+    def gen_name_parts(self):
+        version_parts = self.xcode_version.render_dots_or_parts("-")
+        build_variant_suffix = self.arch_variant.render()
         return [
-            "pytorch",
             "ios",
         ] + version_parts + [
             build_variant_suffix,
         ]
 
     def gen_job_name(self):
-        return "_".join(self.gen_name_parts(False))
+        return "-".join(self.gen_name_parts())
 
     def gen_tree(self):
-
         platform_name = get_platform(self.arch_variant.name)
-
         props_dict = {
-            "build_environment": "-".join(self.gen_name_parts(True)),
+            "name": self.gen_job_name(),
+            "build_environment": self.gen_job_name(),
             "ios_arch": self.arch_variant.name,
             "ios_platform": platform_name,
-            "name": self.gen_job_name(),
         }
 
         if self.is_org_member_context:
@@ -63,16 +58,12 @@ def gen_tree(self):
 WORKFLOW_DATA = [
     IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False, extra_props={
         "lite_interpreter": miniutils.quote(str(int(True)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("x86_64", "full_jit"), is_org_member_context=False, extra_props={
-        "lite_interpreter": miniutils.quote(str(int(False)))}),
     IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={
         "lite_interpreter": miniutils.quote(str(int(True)))}),
     IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={
         "use_metal": miniutils.quote(str(int(True))),
         "lite_interpreter": miniutils.quote(str(int(True)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("arm64", "full_jit"), extra_props={
-        "lite_interpreter": miniutils.quote(str(int(False)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={
+    IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={
         "op_list": "mobilenetv2.yaml",
         "lite_interpreter": miniutils.quote(str(int(True)))}),
     IOSJob(XCODE_VERSION, ArchVariant("x86_64", "coreml"), is_org_member_context=False, extra_props={
diff --git a/.circleci/cimodel/data/simple/macos_definitions.py b/.circleci/cimodel/data/simple/macos_definitions.py
index 371c8b694cf3b..d0b4df0f906cb 100644
--- a/.circleci/cimodel/data/simple/macos_definitions.py
+++ b/.circleci/cimodel/data/simple/macos_definitions.py
@@ -1,3 +1,7 @@
+from collections import OrderedDict
+from cimodel.lib.miniutils import quote
+
+
 class MacOsJob:
     def __init__(self, os_version, is_build=False, is_test=False, extra_props=tuple()):
         # extra_props is tuple type, because mutable data structures for argument defaults
@@ -11,10 +15,14 @@ def gen_tree(self):
         non_phase_parts = ["pytorch", "macos", self.os_version, "py3"]
 
         extra_name_list = [name for name, exist in self.extra_props.items() if exist]
-        full_job_name_list = non_phase_parts + extra_name_list + [
-            'build' if self.is_build else None,
-            'test' if self.is_test else None,
-        ]
+        full_job_name_list = (
+            non_phase_parts
+            + extra_name_list
+            + [
+                "build" if self.is_build else None,
+                "test" if self.is_test else None,
+            ]
+        )
 
         full_job_name = "_".join(list(filter(None, full_job_name_list)))
 
@@ -41,12 +49,93 @@ def gen_tree(self):
         "10_13",
         is_build=True,
         is_test=True,
-        extra_props=tuple({
-            "lite_interpreter": True
-        }.items()),
-    )
+        extra_props=tuple({"lite_interpreter": True}.items()),
+    ),
 ]
 
 
+def get_new_workflow_jobs():
+    return [
+        OrderedDict(
+            {
+                "mac_build": OrderedDict(
+                    {
+                        "name": "macos-12-py3-x86-64-build",
+                        "build-environment": "macos-12-py3-x86-64",
+                        "xcode-version": quote("13.3.1"),
+                    }
+                )
+            }
+        ),
+        OrderedDict(
+            {
+                "mac_test": OrderedDict(
+                    {
+                        "name": "macos-12-py3-x86-64-test-1-2-default",
+                        "build-environment": "macos-12-py3-x86-64",
+                        "xcode-version": quote("13.3.1"),
+                        "shard-number": quote("1"),
+                        "num-test-shards": quote("2"),
+                        "requires": ["macos-12-py3-x86-64-build"],
+                    }
+                )
+            }
+        ),
+        OrderedDict(
+            {
+                "mac_test": OrderedDict(
+                    {
+                        "name": "macos-12-py3-x86-64-test-2-2-default",
+                        "build-environment": "macos-12-py3-x86-64",
+                        "xcode-version": quote("13.3.1"),
+                        "shard-number": quote("2"),
+                        "num-test-shards": quote("2"),
+                        "requires": ["macos-12-py3-x86-64-build"],
+                    }
+                )
+            }
+        ),
+        OrderedDict(
+            {
+                "mac_test": OrderedDict(
+                    {
+                        "name": "macos-12-py3-x86-64-test-1-1-functorch",
+                        "build-environment": "macos-12-py3-x86-64",
+                        "xcode-version": quote("13.3.1"),
+                        "shard-number": quote("1"),
+                        "num-test-shards": quote("1"),
+                        "test-config": "functorch",
+                        "requires": ["macos-12-py3-x86-64-build"],
+                    }
+                )
+            }
+        ),
+        OrderedDict(
+            {
+                "mac_build": OrderedDict(
+                    {
+                        "name": "macos-12-py3-x86-64-lite-interpreter-build-test",
+                        "build-environment": "macos-12-py3-lite-interpreter-x86-64",
+                        "xcode-version": quote("13.3.1"),
+                        "build-generates-artifacts": "false",
+                    }
+                )
+            }
+        ),
+        OrderedDict(
+            {
+                "mac_build": OrderedDict(
+                    {
+                        "name": "macos-12-py3-arm64-build",
+                        "build-environment": "macos-12-py3-arm64",
+                        "xcode-version": quote("13.3.1"),
+                        "python-version": quote("3.9.12"),
+                    }
+                )
+            }
+        ),
+    ]
+
+
 def get_workflow_jobs():
     return [item.gen_tree() for item in WORKFLOW_DATA]
diff --git a/.circleci/cimodel/data/simple/nightly_ios.py b/.circleci/cimodel/data/simple/nightly_ios.py
index 941a61a73b91e..f75bcb4bfe218 100644
--- a/.circleci/cimodel/data/simple/nightly_ios.py
+++ b/.circleci/cimodel/data/simple/nightly_ios.py
@@ -15,7 +15,7 @@ def __init__(self,
     def get_phase_name(self):
         return "upload" if self.is_upload else "build"
 
-    def get_common_name_pieces(self, with_version_dots):
+    def get_common_name_pieces(self, sep):
 
         extra_name_suffix = [self.get_phase_name()] if self.is_upload else []
 
@@ -24,7 +24,7 @@ def get_common_name_pieces(self, with_version_dots):
         common_name_pieces = [
             "ios",
         ] + extra_name + [
-        ] + ios_definitions.XCODE_VERSION.render_dots_or_parts(with_version_dots) + [
+        ] + ios_definitions.XCODE_VERSION.render_dots_or_parts(sep) + [
             "nightly",
             self.variant,
             "build",
@@ -33,14 +33,14 @@ def get_common_name_pieces(self, with_version_dots):
         return common_name_pieces
 
     def gen_job_name(self):
-        return "_".join(["pytorch"] + self.get_common_name_pieces(False))
+        return "_".join(["pytorch"] + self.get_common_name_pieces(None))
 
     def gen_tree(self):
         build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS
         extra_requires = [x.gen_job_name() for x in build_configs] if self.is_upload else []
 
         props_dict = {
-            "build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(True)),
+            "build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(".")),
             "requires": extra_requires,
             "context": "org-member",
             "filters": {"branches": {"only": "nightly"}},
diff --git a/.circleci/cimodel/data/simple/upload_test_stats_definition.py b/.circleci/cimodel/data/simple/upload_test_stats_definition.py
new file mode 100644
index 0000000000000..0d51add5551ce
--- /dev/null
+++ b/.circleci/cimodel/data/simple/upload_test_stats_definition.py
@@ -0,0 +1,20 @@
+from typing import OrderedDict
+
+
+def get_workflow_job():
+    return [
+        OrderedDict(
+            {
+                "upload_test_stats": OrderedDict(
+                    {
+                        "name": "upload test status",
+                        "requires": [
+                            "macos-12-py3-x86-64-test-1-2-default",
+                            "macos-12-py3-x86-64-test-2-2-default",
+                            "macos-12-py3-x86-64-test-1-1-functorch",
+                        ],
+                    }
+                )
+            }
+        ),
+    ]
diff --git a/.circleci/cimodel/data/simple/util/versions.py b/.circleci/cimodel/data/simple/util/versions.py
index 53d3a837248c1..518feb2e38691 100644
--- a/.circleci/cimodel/data/simple/util/versions.py
+++ b/.circleci/cimodel/data/simple/util/versions.py
@@ -1,3 +1,6 @@
+from typing import Optional
+
+
 class MultiPartVersion:
     def __init__(self, parts, prefix=""):
         self.parts = parts
@@ -13,14 +16,11 @@ def prefixed_parts(self):
         else:
             return [self.prefix]
 
-    def render_dots(self):
-        return ".".join(self.prefixed_parts())
-
-    def render_dots_or_parts(self, with_dots):
-        if with_dots:
-            return [self.render_dots()]
-        else:
+    def render_dots_or_parts(self, sep: Optional[str] = None):
+        if sep is None:
             return self.prefixed_parts()
+        else:
+            return [sep.join(self.prefixed_parts())]
 
 
 class CudaVersion(MultiPartVersion):
diff --git a/.circleci/config.yml b/.circleci/config.yml
index 4ca08b1b7c181..0b742215880ad 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -570,6 +570,196 @@ jobs:
           paths:
             - miniconda3
 
+  mac_build:
+    parameters:
+      build-environment:
+        type: string
+        description: Top-level label for what's being built/tested.
+      xcode-version:
+        type: string
+        default: "13.3.1"
+        description: What xcode version to build with.
+      build-generates-artifacts:
+        type: boolean
+        default: true
+        description: if the build generates build artifacts
+      python-version:
+        type: string
+        default: "3.8"
+    macos:
+      xcode: << parameters.xcode-version >>
+    resource_class: medium
+    environment:
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      AWS_REGION: us-east-1
+    steps:
+
+      - checkout
+      - run_brew_for_macos_build
+
+      - run:
+          name: Install sccache
+          command: |
+            sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}"
+            echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}"
+
+            set +x
+            echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            set -x
+
+      - run:
+          name: Get workflow job id
+          command: |
+            echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}"
+
+      - run:
+          name: Build
+          command: |
+            set -x
+
+            git submodule sync
+            git submodule update --init --recursive --depth 1 --jobs 0
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+            MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh"
+            if [  << parameters.python-version >> == 3.9.12 ]; then
+              MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh"
+            fi
+
+            # If a local installation of conda doesn't exist, we download and install conda
+            if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
+              mkdir -p "${WORKSPACE_DIR}"
+              curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh
+              bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3
+            fi
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            # shellcheck disable=SC1091
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}"
+            .jenkins/pytorch/macos-build.sh
+
+      - when:
+          condition: << parameters.build-generates-artifacts >>
+          steps:
+            - run:
+                name: Archive artifacts into zip
+                command: |
+                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
+                  cp artifacts.zip /Users/distiller/workspace
+
+      - persist_to_workspace:
+          root: /Users/distiller/workspace/
+          paths:
+            - miniconda3
+            - artifacts.zip
+
+      - store_artifacts:
+          path: /Users/distiller/project/artifacts.zip
+
+  mac_test:
+    parameters:
+      build-environment:
+        type: string
+      shard-number:
+        type: string
+      num-test-shards:
+        type: string
+      xcode-version:
+        type: string
+      test-config:
+        type: string
+        default: 'default'
+
+    macos:
+      xcode: << parameters.xcode-version >>
+    environment:
+      GIT_DEFAULT_BRANCH: 'master'
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      TEST_CONFIG: << parameters.test-config >>
+      SHARD_NUMBER: << parameters.shard-number >>
+      NUM_TEST_SHARDS: << parameters.num-test-shards >>
+      PYTORCH_RETRY_TEST_CASES: 1
+      PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run_brew_for_macos_build
+      - run:
+          name: Test
+          no_output_timeout: "1h"
+          command: |
+            set -x
+
+            git submodule sync --recursive
+            git submodule update --init --recursive
+
+            mv ~/workspace/artifacts.zip .
+            unzip artifacts.zip
+
+            export IN_CI=1
+
+            COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            # sanitize the input commit message and PR body here:
+
+            # trim all new lines from commit messages to avoid issues with batch environment
+            # variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028
+            COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}"
+
+            # then trim all special characters like single and double quotes to avoid unescaped inputs to
+            # wreak havoc internally
+            export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"
+
+            python3 -mpip install dist/*.whl
+            .jenkins/pytorch/macos-test.sh
+      - run:
+          name: Copy files for uploading test stats
+          command: |
+            # copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace
+            mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+            cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+      - store_test_results:
+          path: test/test-reports
+      - persist_to_workspace:
+          root: /Users/distiller/project/
+          paths:
+            - test-reports
+
+  upload_test_stats:
+    machine: # executor type
+      image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run:
+          name: upload
+          command: |
+            set -ex
+            if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then
+              echo "No credentials found, cannot upload test stats (are you on a fork?)"
+              exit 0
+            fi
+            cp -r ~/workspace/test-reports/* ~/project
+            pip3 install requests==2.26 rockset==0.8.3 boto3==1.19.12 six==1.16.0
+            export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            # i dont know how to get the run attempt number for reruns so default to 1
+            python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci
   pytorch_macos_10_13_py3_test:
     environment:
       BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
@@ -911,12 +1101,13 @@ jobs:
             cd ${PROJ_ROOT}/ios/TestApp/benchmark
             mkdir -p ../models
             if [ ${USE_COREML_DELEGATE} == 1 ]; then
-              pip install coremltools==5.0b5
-              pip install six
+              pip install coremltools==5.0b5 protobuf==3.20.1 six==1.16.0
               python coreml_backend.py
             else
-              python trace_model.py
+              cd "${PROJ_ROOT}"
+              python test/mobile/model_test/gen_test_model.py ios-test
             fi
+            cd "${PROJ_ROOT}/ios/TestApp/benchmark"
             if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
               echo "Setting up the TestApp for LiteInterpreter"
               ruby setup.rb --lite 1
@@ -924,10 +1115,10 @@ jobs:
               echo "Setting up the TestApp for Full JIT"
               ruby setup.rb
             fi
-            cd ${PROJ_ROOT}/ios/TestApp
-            instruments -s -devices
-            if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
-              if [ ${USE_COREML_DELEGATE} == 1 ]; then
+            cd "${PROJ_ROOT}/ios/TestApp"
+            # instruments -s -devices
+            if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
+              if [ "${USE_COREML_DELEGATE}" == 1 ]; then
                 fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
               else
                 fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
@@ -1241,4 +1432,93 @@ workflows:
             branches:
               only:
                 - postnightly
+      - mac_build:
+          name: macos-12-py3-x86-64-build
+          build-environment: macos-12-py3-x86-64
+          xcode-version: "13.3.1"
+      - mac_test:
+          name: macos-12-py3-x86-64-test-1-2-default
+          build-environment: macos-12-py3-x86-64
+          xcode-version: "13.3.1"
+          shard-number: "1"
+          num-test-shards: "2"
+          requires:
+            - macos-12-py3-x86-64-build
+      - mac_test:
+          name: macos-12-py3-x86-64-test-2-2-default
+          build-environment: macos-12-py3-x86-64
+          xcode-version: "13.3.1"
+          shard-number: "2"
+          num-test-shards: "2"
+          requires:
+            - macos-12-py3-x86-64-build
+      - mac_test:
+          name: macos-12-py3-x86-64-test-1-1-functorch
+          build-environment: macos-12-py3-x86-64
+          xcode-version: "13.3.1"
+          shard-number: "1"
+          num-test-shards: "1"
+          test-config: functorch
+          requires:
+            - macos-12-py3-x86-64-build
+      - mac_build:
+          name: macos-12-py3-x86-64-lite-interpreter-build-test
+          build-environment: macos-12-py3-lite-interpreter-x86-64
+          xcode-version: "13.3.1"
+          build-generates-artifacts: false
+      - mac_build:
+          name: macos-12-py3-arm64-build
+          build-environment: macos-12-py3-arm64
+          xcode-version: "13.3.1"
+          python-version: "3.9.12"
+      - upload_test_stats:
+          name: upload test status
+          requires:
+            - macos-12-py3-x86-64-test-1-2-default
+            - macos-12-py3-x86-64-test-2-2-default
+            - macos-12-py3-x86-64-test-1-1-functorch
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-x86-64
+          ios_arch: x86_64
+          ios_platform: SIMULATOR
+          lite_interpreter: "1"
+          name: ios-12-5-1-x86-64
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-arm64
+          context: org-member
+          ios_arch: arm64
+          ios_platform: OS
+          lite_interpreter: "1"
+          name: ios-12-5-1-arm64
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-arm64-metal
+          context: org-member
+          ios_arch: arm64
+          ios_platform: OS
+          lite_interpreter: "1"
+          name: ios-12-5-1-arm64-metal
+          use_metal: "1"
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-arm64-custom-ops
+          context: org-member
+          ios_arch: arm64
+          ios_platform: OS
+          lite_interpreter: "1"
+          name: ios-12-5-1-arm64-custom-ops
+          op_list: mobilenetv2.yaml
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-x86-64-coreml
+          ios_arch: x86_64
+          ios_platform: SIMULATOR
+          lite_interpreter: "1"
+          name: ios-12-5-1-x86-64-coreml
+          use_coreml: "1"
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-arm64-coreml
+          context: org-member
+          ios_arch: arm64
+          ios_platform: OS
+          lite_interpreter: "1"
+          name: ios-12-5-1-arm64-coreml
+          use_coreml: "1"
     when: << pipeline.parameters.run_build >>
diff --git a/.circleci/docker/build.sh b/.circleci/docker/build.sh
index ee785bbc95039..b7fef829b798e 100755
--- a/.circleci/docker/build.sh
+++ b/.circleci/docker/build.sh
@@ -84,6 +84,8 @@ if [[ "$image" == *xenial* ]] || [[ "$image" == *bionic* ]]; then
 fi
 
 TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64"
+_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab
+_UCC_COMMIT=12944da33f911daf505d9bbc51411233d0ed85e1
 
 # It's annoying to rename jobs every time you want to rewrite a
 # configuration, so we hardcode everything here rather than do it
@@ -147,6 +149,8 @@ case "$image" in
     DB=yes
     VISION=yes
     KATEX=yes
+    UCX_COMMIT=${_UCX_COMMIT}
+    UCC_COMMIT=${_UCC_COMMIT}
     ;;
   pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7)
     CUDA_VERSION=11.7.0
@@ -157,6 +161,8 @@ case "$image" in
     DB=yes
     VISION=yes
     KATEX=yes
+    UCX_COMMIT=${_UCX_COMMIT}
+    UCC_COMMIT=${_UCC_COMMIT}
     ;;
   pytorch-linux-xenial-py3-clang5-asan)
     ANACONDA_PYTHON_VERSION=3.7
@@ -262,7 +268,7 @@ case "$image" in
     ;;
   pytorch-linux-focal-py3.7-gcc7)
     ANACONDA_PYTHON_VERSION=3.7
-    CMAKE_VERSION=3.12.4  # To make sure XNNPACK is enabled for the BACKWARDS_COMPAT_TEST used with this image
+    CMAKE_VERSION=3.16.9  # Required for precompiled header support
     GCC_VERSION=7
     PROTOBUF=yes
     DB=yes
@@ -375,6 +381,8 @@ docker build \
        --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
        --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx900;gfx906}" \
        --build-arg "IMAGE_NAME=${IMAGE_NAME}" \
+       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \
+       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \
        -f $(dirname ${DOCKERFILE})/Dockerfile \
        -t "$tmp_tag" \
        "$@" \
diff --git a/.circleci/docker/common/install_base.sh b/.circleci/docker/common/install_base.sh
index 26ca9d79cedeb..6724031c0a447 100755
--- a/.circleci/docker/common/install_base.sh
+++ b/.circleci/docker/common/install_base.sh
@@ -67,7 +67,8 @@ install_ubuntu() {
     wget \
     sudo \
     vim \
-    jq
+    jq \
+    libtool
 
   # Should resolve issues related to various apt package repository cert issues
   # see: https://github.com/pytorch/pytorch/issues/65931
diff --git a/.circleci/docker/common/install_conda.sh b/.circleci/docker/common/install_conda.sh
index 49afcb5aef423..3626d0cc33d4c 100755
--- a/.circleci/docker/common/install_conda.sh
+++ b/.circleci/docker/common/install_conda.sh
@@ -55,8 +55,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
   # Ensure we run conda in a directory that jenkins has write access to
   pushd /opt/conda
 
-  # Track latest conda update
-  as_jenkins conda update -y -n base conda
+  # Prevent conda from updating to 4.14.0, which causes docker build failures
+  # See https://hud.pytorch.org/pytorch/pytorch/commit/754d7f05b6841e555cea5a4b2c505dd9e0baec1d
+  # Uncomment the below when resolved to track the latest conda update
+  # as_jenkins conda update -y -n base conda
 
   # Install correct Python version
   as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION"
diff --git a/.circleci/docker/common/install_ucc.sh b/.circleci/docker/common/install_ucc.sh
new file mode 100755
index 0000000000000..4d691ebb5e9ed
--- /dev/null
+++ b/.circleci/docker/common/install_ucc.sh
@@ -0,0 +1,48 @@
+#!/bin/bash
+
+set -ex
+
+if [[ -d "/usr/local/cuda/" ]];  then
+  with_cuda=/usr/local/cuda/
+else
+  with_cuda=no
+fi
+
+function install_ucx() {
+  set -ex
+  git clone --recursive https://github.com/openucx/ucx.git
+  pushd ucx
+  git checkout ${UCX_COMMIT}
+  git submodule update --init --recursive
+
+  ./autogen.sh
+  ./configure --prefix=$UCX_HOME      \
+      --enable-mt                     \
+      --with-cuda=$with_cuda          \
+      --enable-profiling              \
+      --enable-stats
+  time make -j
+  sudo make install
+
+  popd
+  rm -rf ucx
+}
+
+function install_ucc() {
+  set -ex
+  git clone --recursive https://github.com/openucx/ucc.git
+  pushd ucc
+  git checkout ${UCC_COMMIT}
+  git submodule update --init --recursive
+
+  ./autogen.sh
+  ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-nccl=no --with-cuda=$with_cuda
+  time make -j
+  sudo make install
+
+  popd
+  rm -rf ucc
+}
+
+install_ucx
+install_ucc
diff --git a/.circleci/docker/requirements-ci.txt b/.circleci/docker/requirements-ci.txt
index 451bd39467c37..5662eadc4f661 100644
--- a/.circleci/docker/requirements-ci.txt
+++ b/.circleci/docker/requirements-ci.txt
@@ -164,6 +164,16 @@ pytest-rerunfailures
 #Pinned versions:
 #test that import:
 
+xdoctest==1.0.2
+#Description: runs doctests in pytest
+#Pinned versions: 1.0.2
+#test that import:
+
+pygments==2.12.0
+#Description: support doctest highlighting
+#Pinned versions: 2.12.0
+#test that import: the doctests
+
 #PyYAML
 #Description: data serialization format
 #Pinned versions:
diff --git a/.circleci/docker/ubuntu-cuda/Dockerfile b/.circleci/docker/ubuntu-cuda/Dockerfile
index f7674987a0c3e..a3a623996ad02 100644
--- a/.circleci/docker/ubuntu-cuda/Dockerfile
+++ b/.circleci/docker/ubuntu-cuda/Dockerfile
@@ -62,6 +62,17 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
 RUN rm install_vision.sh
 ENV INSTALLED_VISION ${VISION}
 
+# (optional) Install UCC
+ARG UCX_COMMIT
+ARG UCC_COMMIT
+ENV UCX_COMMIT $UCX_COMMIT
+ENV UCC_COMMIT $UCC_COMMIT
+ENV UCX_HOME /usr
+ENV UCC_HOME /usr
+ADD ./common/install_ucc.sh install_ucc.sh
+RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
+RUN rm install_ucc.sh
+
 COPY ./common/install_openssl.sh install_openssl.sh
 ENV OPENSSL_ROOT_DIR /opt/openssl
 RUN bash ./install_openssl.sh
diff --git a/.circleci/docker/ubuntu/Dockerfile b/.circleci/docker/ubuntu/Dockerfile
index 22592534c20f0..e86baf0d6690e 100644
--- a/.circleci/docker/ubuntu/Dockerfile
+++ b/.circleci/docker/ubuntu/Dockerfile
@@ -58,6 +58,17 @@ RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
 ENV DESIRED_CUDA ${CUDA_VERSION}
 ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
 
+# (optional) Install UCC
+ARG UCX_COMMIT
+ARG UCC_COMMIT
+ENV UCX_COMMIT $UCX_COMMIT
+ENV UCC_COMMIT $UCC_COMMIT
+ENV UCX_HOME /usr
+ENV UCC_HOME /usr
+ADD ./common/install_ucc.sh install_ucc.sh
+RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
+RUN rm install_ucc.sh
+
 # (optional) Install protobuf for ONNX
 ARG PROTOBUF
 COPY ./common/install_protobuf.sh install_protobuf.sh
diff --git a/.circleci/generate_config_yml.py b/.circleci/generate_config_yml.py
index e068dd98fd8ea..c2e90a4b824fd 100755
--- a/.circleci/generate_config_yml.py
+++ b/.circleci/generate_config_yml.py
@@ -14,6 +14,9 @@
 import cimodel.data.simple.mobile_definitions
 import cimodel.data.simple.nightly_ios
 import cimodel.data.simple.anaconda_prune_defintions
+import cimodel.data.simple.macos_definitions
+import cimodel.data.simple.ios_definitions
+import cimodel.data.simple.upload_test_stats_definition
 import cimodel.lib.miniutils as miniutils
 import cimodel.lib.miniyaml as miniyaml
 
@@ -70,6 +73,7 @@ def write(self, output_filehandle):
         for line in filter(None, lines):
             output_filehandle.write(line + "\n")
 
+
 def _for_all_items(items, functor) -> None:
     if isinstance(items, list):
         for item in items:
@@ -78,6 +82,7 @@ def _for_all_items(items, functor) -> None:
         item_type, item = next(iter(items.items()))
         functor(item_type, item)
 
+
 def filter_master_only_jobs(items):
     def _is_main_or_master_item(item):
         filters = item.get('filters', None)
@@ -116,6 +121,7 @@ def _do_filtering(items):
     _for_all_items(items, _save_requires_if_master)
     return _do_filtering(items)
 
+
 def generate_required_docker_images(items):
     required_docker_images = set()
 
@@ -131,11 +137,15 @@ def _requires_docker_image(item_type, item):
     _for_all_items(items, _requires_docker_image)
     return required_docker_images
 
+
 def gen_build_workflows_tree():
     build_workflows_functions = [
         cimodel.data.simple.mobile_definitions.get_workflow_jobs,
         cimodel.data.simple.nightly_ios.get_workflow_jobs,
         cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs,
+        cimodel.data.simple.macos_definitions.get_new_workflow_jobs,
+        cimodel.data.simple.upload_test_stats_definition.get_workflow_job,
+        cimodel.data.simple.ios_definitions.get_workflow_jobs,
     ]
     build_jobs = [f() for f in build_workflows_functions]
     build_jobs.extend(
diff --git a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
index 180ea014db6d3..bb6155fb7ab50 100644
--- a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
+++ b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
@@ -95,6 +95,196 @@
           paths:
             - miniconda3
 
+  mac_build:
+    parameters:
+      build-environment:
+        type: string
+        description: Top-level label for what's being built/tested.
+      xcode-version:
+        type: string
+        default: "13.3.1"
+        description: What xcode version to build with.
+      build-generates-artifacts:
+        type: boolean
+        default: true
+        description: if the build generates build artifacts
+      python-version:
+        type: string
+        default: "3.8"
+    macos:
+      xcode: << parameters.xcode-version >>
+    resource_class: medium
+    environment:
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      AWS_REGION: us-east-1
+    steps:
+
+      - checkout
+      - run_brew_for_macos_build
+
+      - run:
+          name: Install sccache
+          command: |
+            sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}"
+            echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}"
+
+            set +x
+            echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            set -x
+
+      - run:
+          name: Get workflow job id
+          command: |
+            echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}"
+
+      - run:
+          name: Build
+          command: |
+            set -x
+
+            git submodule sync
+            git submodule update --init --recursive --depth 1 --jobs 0
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+            MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh"
+            if [  << parameters.python-version >> == 3.9.12 ]; then
+              MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh"
+            fi
+
+            # If a local installation of conda doesn't exist, we download and install conda
+            if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
+              mkdir -p "${WORKSPACE_DIR}"
+              curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh
+              bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3
+            fi
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            # shellcheck disable=SC1091
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}"
+            .jenkins/pytorch/macos-build.sh
+
+      - when:
+          condition: << parameters.build-generates-artifacts >>
+          steps:
+            - run:
+                name: Archive artifacts into zip
+                command: |
+                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
+                  cp artifacts.zip /Users/distiller/workspace
+
+      - persist_to_workspace:
+          root: /Users/distiller/workspace/
+          paths:
+            - miniconda3
+            - artifacts.zip
+
+      - store_artifacts:
+          path: /Users/distiller/project/artifacts.zip
+
+  mac_test:
+    parameters:
+      build-environment:
+        type: string
+      shard-number:
+        type: string
+      num-test-shards:
+        type: string
+      xcode-version:
+        type: string
+      test-config:
+        type: string
+        default: 'default'
+
+    macos:
+      xcode: << parameters.xcode-version >>
+    environment:
+      GIT_DEFAULT_BRANCH: 'master'
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      TEST_CONFIG: << parameters.test-config >>
+      SHARD_NUMBER: << parameters.shard-number >>
+      NUM_TEST_SHARDS: << parameters.num-test-shards >>
+      PYTORCH_RETRY_TEST_CASES: 1
+      PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run_brew_for_macos_build
+      - run:
+          name: Test
+          no_output_timeout: "1h"
+          command: |
+            set -x
+
+            git submodule sync --recursive
+            git submodule update --init --recursive
+
+            mv ~/workspace/artifacts.zip .
+            unzip artifacts.zip
+
+            export IN_CI=1
+
+            COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            # sanitize the input commit message and PR body here:
+
+            # trim all new lines from commit messages to avoid issues with batch environment
+            # variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028
+            COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}"
+
+            # then trim all special characters like single and double quotes to avoid unescaped inputs to
+            # wreak havoc internally
+            export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"
+
+            python3 -mpip install dist/*.whl
+            .jenkins/pytorch/macos-test.sh
+      - run:
+          name: Copy files for uploading test stats
+          command: |
+            # copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace
+            mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+            cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+      - store_test_results:
+          path: test/test-reports
+      - persist_to_workspace:
+          root: /Users/distiller/project/
+          paths:
+            - test-reports
+
+  upload_test_stats:
+    machine: # executor type
+      image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run:
+          name: upload
+          command: |
+            set -ex
+            if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then
+              echo "No credentials found, cannot upload test stats (are you on a fork?)"
+              exit 0
+            fi
+            cp -r ~/workspace/test-reports/* ~/project
+            pip3 install requests==2.26 rockset==0.8.3 boto3==1.19.12 six==1.16.0
+            export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            # i dont know how to get the run attempt number for reruns so default to 1
+            python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci
   pytorch_macos_10_13_py3_test:
     environment:
       BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
@@ -436,12 +626,13 @@
             cd ${PROJ_ROOT}/ios/TestApp/benchmark
             mkdir -p ../models
             if [ ${USE_COREML_DELEGATE} == 1 ]; then
-              pip install coremltools==5.0b5
-              pip install six
+              pip install coremltools==5.0b5 protobuf==3.20.1 six==1.16.0
               python coreml_backend.py
             else
-              python trace_model.py
+              cd "${PROJ_ROOT}"
+              python test/mobile/model_test/gen_test_model.py ios-test
             fi
+            cd "${PROJ_ROOT}/ios/TestApp/benchmark"
             if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
               echo "Setting up the TestApp for LiteInterpreter"
               ruby setup.rb --lite 1
@@ -449,10 +640,10 @@
               echo "Setting up the TestApp for Full JIT"
               ruby setup.rb
             fi
-            cd ${PROJ_ROOT}/ios/TestApp
-            instruments -s -devices
-            if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
-              if [ ${USE_COREML_DELEGATE} == 1 ]; then
+            cd "${PROJ_ROOT}/ios/TestApp"
+            # instruments -s -devices
+            if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
+              if [ "${USE_COREML_DELEGATE}" == 1 ]; then
                 fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
               else
                 fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
diff --git a/.github/ISSUE_TEMPLATE/ci-sev.md b/.github/ISSUE_TEMPLATE/ci-sev.md
index 8178c68d978b7..2b6bbfc982c95 100644
--- a/.github/ISSUE_TEMPLATE/ci-sev.md
+++ b/.github/ISSUE_TEMPLATE/ci-sev.md
@@ -5,6 +5,8 @@ about: Tracking incidents for PyTorch's CI infra.
 
 > NOTE: Remember to label this issue with "`ci: sev`"
 
+**MERGE BLOCKING** <!-- remove this line if you don't want this SEV to block merges -->
+
 ## Current Status
 *Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*.
 
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index fc203f1e0d6ce..7d428014cd79c 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,8 +1 @@
-### Description
-<!-- What did you change and why was it needed? -->
-
-### Issue
-<!-- Link to Issue ticket or RFP -->
-
-### Testing
-<!-- How did you test your change? -->
+Fixes #ISSUE_NUMBER
diff --git a/.github/actions/get-workflow-job-id/action.yml b/.github/actions/get-workflow-job-id/action.yml
index 34863677407af..4dc6ba90c3961 100644
--- a/.github/actions/get-workflow-job-id/action.yml
+++ b/.github/actions/get-workflow-job-id/action.yml
@@ -15,7 +15,7 @@ outputs:
 runs:
   using: composite
   steps:
-    - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+    - uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767
       id: get-job-id
       env:
         GITHUB_TOKEN: ${{ inputs.github-token }}
diff --git a/.github/actions/setup-win/action.yml b/.github/actions/setup-win/action.yml
index 12f287b230898..c5f1cac550f68 100644
--- a/.github/actions/setup-win/action.yml
+++ b/.github/actions/setup-win/action.yml
@@ -58,3 +58,8 @@ runs:
       uses: actions/setup-python@v2
       with:
         python-version: "3.x"
+        cache: pip
+        cache-dependency-path: |
+          **/requirements.txt
+          **/.circleci/docker/requirements-ci.txt
+          **/.github/requirements-gha-cache.txt
diff --git a/.github/ci_commit_pins/torchdynamo.txt b/.github/ci_commit_pins/torchdynamo.txt
index 4bca66289a606..a0bcc4dd6c4bc 100644
--- a/.github/ci_commit_pins/torchdynamo.txt
+++ b/.github/ci_commit_pins/torchdynamo.txt
@@ -1 +1 @@
-a43631c54014b2e68a09b39658cbf515875394f6
+058b3581bde241ed72b4092d92e561dd9d82fff0
diff --git a/.github/ci_commit_pins/vision.txt b/.github/ci_commit_pins/vision.txt
index 7406c775afed2..34c3bd4731e26 100644
--- a/.github/ci_commit_pins/vision.txt
+++ b/.github/ci_commit_pins/vision.txt
@@ -1 +1 @@
-1a1d509c8e6578584e7e9e4bd442654bf39149c8
+9c3e2bf46bc49997679785d76b7d0a9fea0223c7
diff --git a/.github/ci_commit_pins/xla.txt b/.github/ci_commit_pins/xla.txt
index 6f0f5eab8182e..170afa2afb3c5 100644
--- a/.github/ci_commit_pins/xla.txt
+++ b/.github/ci_commit_pins/xla.txt
@@ -1 +1 @@
-73c64a55fb096f1e132029d3decbb6f4e532cc7b
+9b2f7929c2dae841888a836449c25b04c8cf4045
diff --git a/.github/generated-ciflow-ruleset.json b/.github/generated-ciflow-ruleset.json
deleted file mode 100644
index 7605e17918849..0000000000000
--- a/.github/generated-ciflow-ruleset.json
+++ /dev/null
@@ -1,5 +0,0 @@
-{
-  "__comment": "@generated DO NOT EDIT MANUALLY, Generation script: .github/scripts/generate_ci_workflows.py",
-  "label_rules": {},
-  "version": "v1"
-}
diff --git a/.github/merge_rules.json b/.github/merge_rules.json
deleted file mode 100644
index 704e1a5d96509..0000000000000
--- a/.github/merge_rules.json
+++ /dev/null
@@ -1,230 +0,0 @@
-[
-   {
-      "name": "ONNX exporter",
-      "patterns": [
-         ".jenkins/caffe2/*",
-         "aten/src/ATen/core/interned_strings.h",
-         "docs/source/onnx.rst",
-         "docs/source/scripts/onnx/**",
-         "scripts/onnx/**",
-         "test/jit/test_export_modes.py",
-         "test/onnx/**",
-         "tools/onnx/**",
-         "torch/_C/__init__.pyi.in",
-         "torch/csrc/jit/passes/onnx.*",
-         "torch/csrc/jit/passes/onnx/**",
-         "torch/csrc/jit/serialization/export.*",
-         "torch/csrc/jit/serialization/onnx.*",
-         "torch/csrc/onnx/**",
-         "torch/onnx/**"
-      ],
-      "approved_by": ["BowenBao", "garymm"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "NVFuser",
-      "patterns": [
-         "test/test_jit_cuda_fuser.py",
-         "torch/csrc/jit/codegen/fuser/cuda/**",
-         "torch/csrc/jit/codegen/cuda/**",
-         "benchmarks/cpp/nvfuser/**"
-      ],
-      "approved_by": ["csarofeen", "ngimel", "jjsjann123"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "OSS CI",
-      "patterns": [".github/**", ".circleci/**", ".jenkins/**", "scripts/**", "tools/**"],
-      "approved_by": ["ezyang", "pytorch/pytorch-dev-infra"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "CI Pinned Hashes",
-      "patterns": [
-         ".github/ci_commit_pins/vision.txt",
-         ".github/ci_commit_pins/torchdynamo.txt"
-      ],
-      "approved_by": ["pytorchbot", "ezyang", "pytorch/pytorch-dev-infra"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "XLA hash pin update",
-      "patterns": [".github/ci_commit_pins/xla.txt"],
-      "approved_by": ["pytorchbot", "ezyang", "pytorch/pytorch-dev-infra"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull / linux-bionic-py3_7-clang8-xla / build",
-         "pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)"
-      ]
-   },
-   {
-      "name": "Documentation",
-      "patterns": ["docs/**", "torch/*docs.py"],
-      "approved_by": ["mruberry", "ngimel", "janeyx99", "svekars"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Mobile",
-      "patterns": ["ios/**", "android/**", "test/mobile/**"],
-      "approved_by": ["linbinyu", "kit1980", "IvanKobzarev", "dreiss"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Linear Algebra",
-      "patterns": [
-         "aten/src/ATen/native/cuda/linalg/**",
-         "aten/src/ATen/LinalgBackend.h",
-         "aten/src/ATen/native/**LinearAlgebra*",
-         "docs/source/linalg.rst",
-         "torch/linalg/**",
-         "torch/_linalg_utils.py",
-         "torch/**python_linalg_functions.*",
-         "torch/**linalg.h",
-         "tools/autograd/templates/python_linalg_functions.cpp",
-         "test/test_linalg.py"
-      ],
-      "approved_by": ["nikitaved", "mruberry", "pearu", "Lezcano", "IvanYashchuk"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "FFT",
-      "patterns": [
-         "aten/src/ATen/native/cuda/*FFT*.h",
-         "aten/src/ATen/native/SpectralOps.cpp",
-         "aten/src/ATen/native/mkl/SpectralOps.cpp",
-         "aten/src/ATen/native/cuda/SpectralOps.*",
-         "docs/source/fft.rst",
-         "torch/fft/**",
-         "torch/csrc/api/include/torch/fft.h",
-         "torch/**python_fft_functions.*",
-         "tools/autograd/templates/python_fft_functions.cpp",
-         "test/cpp/api/fft.cpp"
-      ],
-      "approved_by": ["mruberry", "peterbell10"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Sparse",
-      "patterns": [
-         "benchmarks/sparse",
-         "c10/util/sparse_bitset.h",
-         "docs/source/sparse.rst",
-         "torch/**sparse/**",
-         "torch/**sparse*",
-         "torch/optim/sparse*",
-         "torch/ao/nn/sparse/**",
-         "torch/utils/benchmark/**sparse*",
-         "aten/src/ATen/native/ao_sparse/**",
-         "aten/src/ATen/native/sparse/**",
-         "aten/src/ATen/**Sparse*",
-         "aten/src/ATen/*Sparse*",
-         "torch/_masked/**",
-         "test/*_masked*",
-         "test/**sparse*"
-      ],
-      "approved_by": ["nikitaved", "cpuhrsch", "pearu", "IvanYashchuk"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "MPS",
-      "patterns": [
-         "test/test_mps.py",
-         "aten/src/ATen/native/native_functions.yaml",
-         "aten/src/ATen/mps/**",
-         "aten/src/ATen/native/mps/**"
-      ],
-      "approved_by": ["kulinseth", "razarmehr"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Distributions",
-      "patterns": [
-          "torch/distributions/**",
-          "test/distributions/**"
-      ],
-      "approved_by": ["fritzo", "neerajprad", "alicanb", "vishwakftw"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Distributed",
-       "patterns": [
-         "docs/source/pipeline.rst",
-         "docs/source/distributed*",
-         "docs/source/rpc.rst",
-         "docs/source/rpc/**",
-         "docs/source/_static/img/rpc*",
-         "docs/source/_static/img/*distributed*",
-         "docs/source/elastic/**",
-         "benchmarks/distributed/**",
-         "torch/distributed/**",
-         "torch/nn/parallel/distributed*",
-         "torch/_C/_distributed*",
-         "torch/csrc/distributed/**",
-         "torch/testing/_internal/distributed/**",
-         "test/distributed/**",
-         "test/cpp/dist_autograd/**",
-         "test/cpp/rpc/**"
-      ],
-      "approved_by": ["mrshenli", "pritamdamania87", "d4l3k", "kiukchung", "pietern"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "superuser",
-      "patterns": ["*"],
-      "approved_by": ["pytorch/metamates"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   }
-]
diff --git a/.github/merge_rules.yaml b/.github/merge_rules.yaml
new file mode 100644
index 0000000000000..5557926cc2116
--- /dev/null
+++ b/.github/merge_rules.yaml
@@ -0,0 +1,342 @@
+- name: Core Maintainers
+  patterns:
+  - '*'
+  approved_by:
+  - soumith
+  - gchanan
+  - ezyang
+  - dzhulgakov
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: ONNX exporter
+  patterns:
+  - .jenkins/caffe2/*
+  - aten/src/ATen/core/interned_strings.h
+  - docs/source/onnx.rst
+  - docs/source/scripts/onnx/**
+  - scripts/onnx/**
+  - test/jit/test_export_modes.py
+  - test/onnx/**
+  - tools/onnx/**
+  - torch/_C/__init__.pyi.in
+  - torch/csrc/jit/passes/onnx.*
+  - torch/csrc/jit/passes/onnx/**
+  - torch/csrc/jit/serialization/export.*
+  - torch/csrc/jit/serialization/onnx.*
+  - torch/csrc/onnx/**
+  - torch/onnx/**
+  approved_by:
+  - BowenBao
+  - abock
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: NVFuser
+  patterns:
+  - test/test_jit_cuda_fuser.py
+  - torch/csrc/jit/codegen/fuser/cuda/**
+  - torch/csrc/jit/codegen/cuda/**
+  - benchmarks/cpp/nvfuser/**
+  approved_by:
+  - csarofeen
+  - ngimel
+  - jjsjann123
+  - ptrblck
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: OSS CI
+  patterns:
+  - .github/**
+  - .circleci/**
+  - .jenkins/**
+  - scripts/**
+  - tools/**
+  approved_by:
+  - alband
+  - dagitses
+  - pytorch/pytorch-dev-infra
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: OSS CI / pytorchbot
+  patterns:
+  - .github/ci_commit_pins/vision.txt
+  - .github/ci_commit_pins/torchdynamo.txt
+  approved_by:
+  - pytorchbot
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: OSS CI / pytorchbot / XLA
+  patterns:
+  - .github/ci_commit_pins/xla.txt
+  approved_by:
+  - pytorchbot
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull / linux-bionic-py3_7-clang8-xla / build
+  - pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)
+
+- name: Documentation
+  patterns:
+  - docs/**
+  - torch/*docs.py
+  approved_by:
+  - svekars
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: Mobile
+  patterns:
+  - ios/**
+  - android/**
+  - test/mobile/**
+  approved_by:
+  - linbinyu
+  - IvanKobzarev
+  - dreiss
+  - raziel
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: Linear Algebra
+  patterns:
+  - aten/src/ATen/native/cuda/linalg/**
+  - aten/src/ATen/LinalgBackend.h
+  - aten/src/ATen/native/**LinearAlgebra*
+  - docs/source/linalg.rst
+  - torch/linalg/**
+  - torch/_linalg_utils.py
+  - torch/**python_linalg_functions.*
+  - torch/**linalg.h
+  - tools/autograd/templates/python_linalg_functions.cpp
+  - test/test_linalg.py
+  approved_by:
+  - mruberry
+  - Lezcano
+  - IvanYashchuk
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: FFT
+  patterns:
+  - aten/src/ATen/native/cuda/*FFT*.h
+  - aten/src/ATen/native/SpectralOps.cpp
+  - aten/src/ATen/native/mkl/SpectralOps.cpp
+  - aten/src/ATen/native/cuda/SpectralOps.*
+  - docs/source/fft.rst
+  - torch/fft/**
+  - torch/csrc/api/include/torch/fft.h
+  - torch/**python_fft_functions.*
+  - tools/autograd/templates/python_fft_functions.cpp
+  - test/cpp/api/fft.cpp
+  approved_by:
+  - mruberry
+  - peterbell10
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: Sparse
+  patterns:
+  - benchmarks/sparse
+  - c10/util/sparse_bitset.h
+  - docs/source/sparse.rst
+  - torch/**sparse/**
+  - torch/**sparse*
+  - torch/optim/sparse*
+  - torch/ao/nn/sparse/**
+  - torch/utils/benchmark/**sparse*
+  - aten/src/ATen/native/ao_sparse/**
+  - aten/src/ATen/native/sparse/**
+  - aten/src/ATen/**Sparse*
+  - aten/src/ATen/*Sparse*
+  - torch/_masked/**
+  - test/*_masked*
+  - test/**sparse*
+  approved_by:
+  - nikitaved
+  - cpuhrsch
+  - pearu
+  - IvanYashchuk
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: MPS
+  patterns:
+  - test/test_mps.py
+  - aten/src/ATen/native/native_functions.yaml
+  - aten/src/ATen/mps/**
+  - aten/src/ATen/native/mps/**
+  approved_by:
+  - kulinseth
+  - alband
+  - malfet
+  - razarmehr
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+- name: Distributions
+  patterns:
+  - torch/distributions/**
+  - test/distributions/**
+  approved_by:
+  - fritzo
+  - neerajprad
+  - alicanb
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: Distributed
+  patterns:
+  - docs/source/pipeline.rst
+  - docs/source/distributed*
+  - docs/source/rpc.rst
+  - docs/source/rpc/**
+  - docs/source/_static/img/rpc*
+  - docs/source/_static/img/*distributed*
+  - docs/source/elastic/**
+  - benchmarks/distributed/**
+  - torch/distributed/**
+  - torch/nn/parallel/distributed*
+  - torch/_C/_distributed*
+  - torch/csrc/distributed/**
+  - torch/testing/_internal/distributed/**
+  - test/distributed/**
+  - test/cpp/dist_autograd/**
+  - test/cpp/rpc/**
+  approved_by:
+  - mrshenli
+  - pritamdamania87
+  - zhaojuanmao
+  - rohan-varma
+  - wanchaol
+  - fduwjj
+  - H-Huang
+  - d4l3k
+  - aazzolini
+  - kwen2501
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: IDEEP
+  patterns:
+  - third_party/ideep
+  - caffe2/ideep/**
+  - caffe2/python/ideep/**
+  approved_by:
+  - XiaobingSuper
+  - yanbing-j
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: oneDNN graph
+  patterns:
+  - torch/csrc/jit/codegen/onednn/**
+  - test/test_jit_llga_fuser.py
+  approved_by:
+  - sanchitintel
+  - chunyuan-w
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: CPU ATen backend
+  patterns:
+  - aten/src/ATen/cpu/**
+  - aten/src/ATen/native/cpu/**
+  - aten/src/ATen/native/quantized/cpu/**
+  - aten/src/ATen/native/Convolution*.cpp
+  - aten/src/ATen/native/mkldnn/**
+  approved_by:
+  - mingfeima
+  - XiaobingSuper
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: CPU frontend
+  patterns:
+  - torch/cpu/**
+  - torch/utils/mkldnn.py
+  - test/test_mkldnn.py
+  approved_by:
+  - leslie-fang-intel
+  - CaoE
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: Autocast
+  patterns:
+  - torch/amp/**
+  - aten/src/ATen/autocast_mode.*
+  - torch/csrc/jit/passes/autocast.cpp
+  - test/test_autocast.py
+  approved_by:
+  - leslie-fang-intel
+  - CaoE
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: Lazy Tensor
+  patterns:
+  - torch/csrc/lazy/**
+  - test/cpp/lazy/**
+  - test/lazy/**
+  - codegen/api/lazy.py
+  - codegen/dest/lazy_ir.py
+  - codegen/dest/lazy_ts_lowering.py
+  - codegen/gen_lazy_tensor.py
+  - aten/src/ATen/native/ts_native_functions.yaml
+  approved_by:
+  - alanwaketan
+  - JackCaoG
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
+
+- name: superuser
+  patterns:
+  - '*'
+  approved_by:
+  - pytorch/metamates
+  mandatory_checks_name:
+  - Facebook CLA Check
+  - Lint
+  - pull
diff --git a/.github/requirements-gha-cache.txt b/.github/requirements-gha-cache.txt
new file mode 100644
index 0000000000000..4e757f9307fff
--- /dev/null
+++ b/.github/requirements-gha-cache.txt
@@ -0,0 +1,16 @@
+# This file is to cache other dependencies not specified elsewhere in:
+#   requirement.txt
+#   requirements-flake8.txt
+#   docs/requirements.txt
+#   docs/cpp/requirements.txt
+#   functorch/docs/requirements.txt
+#   .circleci/docker/requirements-ci.txt
+cffi==1.15.0
+dataclasses==0.6
+jinja2==3.0.1
+lintrunner=0.9.2
+ninja==1.10.0.post1
+pynvml==11.4.1
+requests==2.26
+rich==10.9.0
+rockset==0.8.10
diff --git a/.github/scale-config.yml b/.github/scale-config.yml
index 931ca0ef5f1e2..1cf99b326ba81 100644
--- a/.github/scale-config.yml
+++ b/.github/scale-config.yml
@@ -65,5 +65,5 @@ runner_types:
   windows.8xlarge.nvidia.gpu:
     instance_type: p3.2xlarge
     os: windows
-    max_available: 50
+    max_available: 100
     disk_size: 256
diff --git a/.github/scripts/comment_on_pr.py b/.github/scripts/comment_on_pr.py
new file mode 100644
index 0000000000000..06b2eefe09884
--- /dev/null
+++ b/.github/scripts/comment_on_pr.py
@@ -0,0 +1,34 @@
+from typing import Any
+from trymerge import gh_post_pr_comment
+from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
+from trymerge_explainer import BOT_COMMANDS_WIKI
+import os
+
+
+def parse_args() -> Any:
+    from argparse import ArgumentParser
+
+    parser = ArgumentParser("Comment on a PR")
+    parser.add_argument("pr_num", type=int)
+    parser.add_argument("action", type=str)
+    return parser.parse_args()
+
+
+def main() -> None:
+    args = parse_args()
+    repo = GitRepo(get_git_repo_dir(), get_git_remote_name(), debug=True)
+    org, project = repo.gh_owner_and_name()
+    run_url = os.environ.get("GH_RUN_URL")
+
+    job_link = f"[job]({run_url})" if run_url is not None else "job"
+    msg = (
+        f"The {args.action} {job_link} was canceled. If you believe this is a mistake,"
+        + f"then you can re trigger it through [pytorch-bot]({BOT_COMMANDS_WIKI})."
+    )
+
+    gh_post_pr_comment(org, project, args.pr_num, msg)
+    print(org, project, args.pr_num, msg)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/scripts/generate_binary_build_matrix.py b/.github/scripts/generate_binary_build_matrix.py
index 4549a16f7a808..b1e3b46bda344 100644
--- a/.github/scripts/generate_binary_build_matrix.py
+++ b/.github/scripts/generate_binary_build_matrix.py
@@ -16,7 +16,7 @@
 CUDA_ARCHES = ["10.2", "11.3", "11.6", "11.7"]
 
 
-ROCM_ARCHES = ["5.0", "5.1.1"]
+ROCM_ARCHES = ["5.1.1", "5.2"]
 
 
 def arch_type(arch_version: str) -> str:
diff --git a/.github/scripts/get_workflow_job_id.py b/.github/scripts/get_workflow_job_id.py
index 72aed91d55ca9..e3005a735250f 100644
--- a/.github/scripts/get_workflow_job_id.py
+++ b/.github/scripts/get_workflow_job_id.py
@@ -31,7 +31,9 @@
 args = parser.parse_args()
 
 
-PYTORCH_REPO = "https://api.github.com/repos/pytorch/pytorch"
+# From https://docs.github.com/en/actions/learn-github-actions/environment-variables
+PYTORCH_REPO = os.environ.get("GITHUB_REPOSITORY", "pytorch/pytorch")
+PYTORCH_GITHUB_API = f"https://api.github.com/repos/{PYTORCH_REPO}"
 GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
 REQUEST_HEADERS = {
     "Accept": "application/vnd.github.v3+json",
@@ -39,7 +41,7 @@
 }
 
 response = requests.get(
-    f"{PYTORCH_REPO}/actions/runs/{args.workflow_run_id}/jobs?per_page=100",
+    f"{PYTORCH_GITHUB_API}/actions/runs/{args.workflow_run_id}/jobs?per_page=100",
     headers=REQUEST_HEADERS,
 )
 
diff --git a/.github/scripts/install_nvidia_utils_linux.sh b/.github/scripts/install_nvidia_utils_linux.sh
index b854320c9eaa4..b5274fb5805fb 100755
--- a/.github/scripts/install_nvidia_utils_linux.sh
+++ b/.github/scripts/install_nvidia_utils_linux.sh
@@ -2,8 +2,9 @@
 
 set -eou pipefail
 
+
 DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) \
-DRIVER_FN="NVIDIA-Linux-x86_64-510.60.02.run"
+DRIVER_FN="NVIDIA-Linux-x86_64-515.57.run"
 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
 
 install_nvidia_docker2_amzn2() {
@@ -24,6 +25,7 @@ install_nvidia_driver_amzn2() {
         # ensure our kernel install is the same as our underlying kernel,
         # groupinstall "Development Tools" has a habit of mismatching kernel headers
         sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
+        sudo modprobe backlight
         sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
         sudo /bin/bash /tmp/nvidia_driver -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
         sudo rm -fv /tmp/nvidia_driver
diff --git a/.github/scripts/lint_test_ownership.py b/.github/scripts/lint_test_ownership.py
deleted file mode 100755
index 270019c0f5634..0000000000000
--- a/.github/scripts/lint_test_ownership.py
+++ /dev/null
@@ -1,88 +0,0 @@
-#!/usr/bin/env python3
-'''
-Test ownership was introduced in https://github.com/pytorch/pytorch/issues/66232.
-
-This lint verifies that every Python test file (file that matches test_*.py or *_test.py in the test folder)
-has valid ownership information in a comment header. Valid means:
-  - The format of the header follows the pattern "# Owner(s): ["list", "of owner", "labels"]
-  - Each owner label actually exists in PyTorch
-  - Each owner label starts with "module: " or "oncall: " or is in ACCEPTABLE_OWNER_LABELS
-
-This file is expected to run in the root directory of pytorch/pytorch.
-'''
-import boto3  # type: ignore[import]
-import botocore  # type: ignore[import]
-import fnmatch
-import json
-import sys
-from pathlib import Path
-from typing import List, Any
-
-
-# Team/owner labels usually start with "module: " or "oncall: ", but the following are acceptable exceptions
-ACCEPTABLE_OWNER_LABELS = ["NNC", "high priority"]
-GLOB_EXCEPTIONS = [
-    "**/test/run_test.py"
-]
-
-PYTORCH_ROOT = Path(__file__).resolve().parent.parent.parent
-TEST_DIR = PYTORCH_ROOT / "test"
-CURRENT_FILE_NAME = Path(__file__).resolve().relative_to(PYTORCH_ROOT)
-
-S3_RESOURCE_READ_ONLY = boto3.resource("s3", config=botocore.config.Config(signature_version=botocore.UNSIGNED))
-
-
-def get_all_test_files() -> List[Path]:
-    test_files = list(TEST_DIR.glob("**/test_*.py"))
-    test_files.extend(list(TEST_DIR.glob("**/*_test.py")))
-    return [f for f in test_files if not any([fnmatch.fnmatch(str(f), g) for g in GLOB_EXCEPTIONS])]
-
-
-def get_pytorch_labels() -> Any:
-    bucket = S3_RESOURCE_READ_ONLY.Bucket("ossci-metrics")
-    summaries = bucket.objects.filter(Prefix="pytorch_labels.json")
-    for summary in summaries:
-        labels = summary.get()["Body"].read()
-    return json.loads(labels)
-
-
-# Returns a string denoting the error invalidating the label OR an empty string if nothing is wrong
-def validate_label(label: str, pytorch_labels: List[str]) -> str:
-    if label not in pytorch_labels:
-        return f"{label} is not a PyTorch label (please choose from https://github.com/pytorch/pytorch/labels)"
-    if label.startswith("module:") or label.startswith("oncall:") or label in ACCEPTABLE_OWNER_LABELS:
-        return ""
-    return f"{label} is not an acceptable owner (please update to another label or edit ACCEPTABLE_OWNERS_LABELS " \
-        "in {CURRENT_FILE_NAME}"
-
-
-# Returns a string denoting the error invalidating the file OR an empty string if nothing is wrong
-def validate_file(filename: Path, pytorch_labels: List[str]) -> str:
-    prefix = "# Owner(s): "
-    relative_name = Path(filename).relative_to(PYTORCH_ROOT)
-    with open(filename) as f:
-        for line in f.readlines():
-            if line.startswith(prefix):
-                labels = json.loads(line[len(prefix):])
-                labels_msgs = [validate_label(label, pytorch_labels) for label in labels]
-                file_msg = ", ".join([x for x in labels_msgs if x != ""])
-                return f"{relative_name}: {file_msg}" if file_msg != "" else ""
-    return f"{relative_name}: missing a comment header with ownership information."
-
-
-def main() -> None:
-    test_file_paths = get_all_test_files()
-    pytorch_labels = get_pytorch_labels()
-
-    file_msgs = [validate_file(f, pytorch_labels) for f in test_file_paths]
-    err_msg = "\n".join([x for x in file_msgs if x != ""])
-    if err_msg != "":
-        err_msg = err_msg + "\n\nIf you see files with missing ownership information above, " \
-            "please add the following line\n\n# Owner(s): [\"<owner: label>\"]\n\nto the top of each test file. " \
-            "The owner should be an existing pytorch/pytorch label."
-        print(err_msg)
-        sys.exit(1)
-
-
-if __name__ == '__main__':
-    main()
diff --git a/.github/scripts/test_trymerge.py b/.github/scripts/test_trymerge.py
index af3faf8cd0948..572863098da72 100755
--- a/.github/scripts/test_trymerge.py
+++ b/.github/scripts/test_trymerge.py
@@ -18,9 +18,12 @@
                       gh_get_team_members,
                       read_merge_rules,
                       validate_revert,
+                      filter_pending_checks,
+                      filter_failed_checks,
                       GitHubPR,
                       MergeRule,
                       MandatoryChecksMissingError,
+                      WorkflowCheckState,
                       main as trymerge_main)
 from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
 from typing import Any, List, Optional
@@ -139,7 +142,7 @@ def commit_message(self, ref: str) -> str:
 
 class TestGitHubPR(TestCase):
     def test_merge_rules_valid(self) -> None:
-        "Test that merge_rules.json can be parsed"
+        "Test that merge_rules.yaml can be parsed"
         repo = DummyGitRepo()
         self.assertGreater(len(read_merge_rules(repo, "pytorch", "pytorch")), 1)
 
@@ -337,5 +340,21 @@ def test_revert_rules(self, mock_gql: Any) -> None:
         repo = DummyGitRepo()
         self.assertIsNotNone(validate_revert(repo, pr, comment_id=1189459845))
 
+    def test_checks_filter(self) -> None:
+        checks = [
+            WorkflowCheckState(name="check0", status="SUCCESS", url="url0"),
+            WorkflowCheckState(name="check1", status="FAILURE", url="url1"),
+            WorkflowCheckState(name="check2", status="STARTUP_FAILURE", url="url2"),
+            WorkflowCheckState(name="check3", status=None, url="url3"),
+        ]
+
+        checks_dict = {check.name : check for check in checks}
+
+        pending_checks = filter_pending_checks(checks_dict)
+        failing_checks = filter_failed_checks(checks_dict)
+
+        self.assertListEqual(failing_checks, [checks[1], checks[2]])
+        self.assertListEqual(pending_checks, [checks[3]])
+
 if __name__ == "__main__":
     main()
diff --git a/.github/scripts/trymerge.py b/.github/scripts/trymerge.py
index 9e23869cb3804..e64d9c9ea16df 100755
--- a/.github/scripts/trymerge.py
+++ b/.github/scripts/trymerge.py
@@ -6,15 +6,43 @@
 import re
 import time
 import urllib.parse
-from datetime import datetime
 from dataclasses import dataclass
-from urllib.request import urlopen, Request
-from urllib.error import HTTPError
-from typing import Iterable, Pattern, cast, Any, Callable, Dict, List, Optional, Tuple, Union
-from gitutils import get_git_remote_name, get_git_repo_dir, patterns_to_regex, GitRepo
+from datetime import datetime
 from functools import lru_cache
+import yaml
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    Iterable,
+    List,
+    Optional,
+    Pattern,
+    Tuple,
+    Union,
+    cast,
+    NamedTuple
+)
+from urllib.error import HTTPError
+from urllib.request import Request, urlopen
 from warnings import warn
 
+from gitutils import (
+    GitRepo,
+    get_git_remote_name,
+    get_git_repo_dir,
+    patterns_to_regex,
+)
+from trymerge_explainer import (
+    TryMergeExplainer,
+    get_land_check_troubleshooting_message,
+    get_revert_message,
+)
+
+class WorkflowCheckState(NamedTuple):
+    status: Optional[str]
+    url: str
+    name: str
 
 GH_PR_REVIEWS_FRAGMENT = """
 fragment PRReviews on PullRequestReviewConnection {
@@ -373,7 +401,6 @@
 RE_DIFF_REV = re.compile(r'^Differential Revision:.+?(D[0-9]+)', re.MULTILINE)
 CIFLOW_LABEL = re.compile(r"^ciflow/.+")
 CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")
-BOT_COMMANDS_WIKI = 'https://github.com/pytorch/pytorch/wiki/Bot-commands'
 
 def _fetch_url(url: str, *,
                headers: Optional[Dict[str, str]] = None,
@@ -465,12 +492,11 @@ def get_check_run_name_prefix(workflow_run: Any) -> str:
     else:
         return f'{workflow_run["workflow"]["name"]} / '
 
-
 def add_workflow_conclusions(
     checksuites: Any,
     get_next_checkruns_page: Callable[[List[Dict[str, Dict[str, Any]]], int, Any], Any],
     get_next_checksuites: Callable[[Any], Any]
-) -> Dict[str, Tuple[str, str]]:
+) -> Dict[str, WorkflowCheckState]:
     conclusions = {}
 
     def add_conclusions(edges: Any) -> None:
@@ -484,14 +510,23 @@ def add_conclusions(edges: Any) -> None:
                 # Do not override existing status with cancelled
                 if workflow_conclusion == "CANCELLED" and workflow_name in conclusions:
                     continue
-                conclusions[workflow_name] = (workflow_conclusion, node["url"])
+                conclusions[workflow_name] = WorkflowCheckState(
+                    name=workflow_name,
+                    status=workflow_conclusion,
+                    url=node["url"])
             has_failing_check = False
             while checkruns is not None:
                 for checkrun_node in checkruns["nodes"]:
+                    if not isinstance(checkrun_node, dict):
+                        warn(f"Expected dictionary, but got {type(checkrun_node)}")
+                        continue
                     if checkrun_node["conclusion"] == 'FAILURE':
                         has_failing_check = True
-                    conclusions[f'{get_check_run_name_prefix(workflow_run)}{checkrun_node["name"]}'] = (
-                        checkrun_node["conclusion"], checkrun_node["detailsUrl"]
+                    checkrun_name = f'{get_check_run_name_prefix(workflow_run)}{checkrun_node["name"]}'
+                    conclusions[checkrun_name] = WorkflowCheckState(
+                        name=checkrun_name,
+                        status=checkrun_node["conclusion"],
+                        url=checkrun_node["detailsUrl"]
                     )
                 if bool(checkruns["pageInfo"]["hasNextPage"]):
                     checkruns = get_next_checkruns_page(edges, edge_idx, checkruns)
@@ -499,7 +534,11 @@ def add_conclusions(edges: Any) -> None:
                     checkruns = None
             # Github doesn't set conclusion to failure if a job is still pending
             if workflow_run is not None and has_failing_check:
-                conclusions[workflow_run["workflow"]["name"]] = ("FAILURE", node["url"])
+                workflow_name = workflow_run["workflow"]["name"]
+                conclusions[workflow_name] = WorkflowCheckState(
+                    name=workflow_name,
+                    status="FAILURE",
+                    url=node["url"])
 
     add_conclusions(checksuites["edges"])
     while bool(checksuites["pageInfo"]["hasNextPage"]):
@@ -550,7 +589,7 @@ def __init__(self, org: str, project: str, pr_num: int) -> None:
         self.info = gh_get_pr_info(org, project, pr_num)
         self.changed_files: Optional[List[str]] = None
         self.labels: Optional[List[str]] = None
-        self.conclusions: Optional[Dict[str, Tuple[str, str]]] = None
+        self.conclusions: Optional[Dict[str, WorkflowCheckState]] = None
         self.comments: Optional[List[GitHubComment]] = None
         self._authors: Optional[List[Tuple[str, str]]] = None
         self._reviews: Optional[List[Tuple[str, str]]] = None
@@ -678,7 +717,7 @@ def get_labels(self) -> List[str]:
         self.labels = labels
         return self.labels
 
-    def get_checkrun_conclusions(self) -> Dict[str, Tuple[str, str]]:
+    def get_checkrun_conclusions(self) -> Dict[str, WorkflowCheckState]:
         """ Returns dict of checkrun -> [conclusion, url] """
         if self.conclusions is not None:
             return self.conclusions
@@ -803,9 +842,15 @@ def has_internal_changes(self) -> bool:
         checks = self.get_checkrun_conclusions()
         if checks is None or checkrun_name not in checks:
             return False
-        return checks[checkrun_name][0] != "SUCCESS"
-
-    def merge_ghstack_into(self, repo: GitRepo, force: bool, comment_id: Optional[int] = None) -> None:
+        return checks[checkrun_name].status != "SUCCESS"
+
+    def merge_ghstack_into(
+        self,
+        repo: GitRepo,
+        force: bool,
+        comment_id: Optional[int] = None,
+        land_check_commit: Optional[str] = None
+    ) -> None:
         assert self.is_ghstack_pr()
         # For ghstack, cherry-pick commits based from origin
         orig_ref = f"{repo.remote}/{re.sub(r'/head$', '/orig', self.head_ref())}"
@@ -826,7 +871,12 @@ def merge_ghstack_into(self, repo: GitRepo, force: bool, comment_id: Optional[in
                     continue
                 commit_msg = pr.gen_commit_message(filter_ghstack=True)
                 # Raises exception if matching rule is not found
-                find_matching_merge_rule(pr, repo, force=force, skip_internal_checks=can_skip_internal_checks(self, comment_id))
+                find_matching_merge_rule(
+                    pr,
+                    repo,
+                    force=force,
+                    skip_internal_checks=can_skip_internal_checks(self, comment_id),
+                    land_check_commit=land_check_commit)
 
             repo.cherry_pick(rev)
             repo.amend_commit_message(commit_msg)
@@ -846,10 +896,16 @@ def gen_commit_message(self, filter_ghstack: bool = False) -> str:
     def merge_into(self, repo: GitRepo, *,
                    force: bool = False,
                    dry_run: bool = False,
-                   comment_id: Optional[int] = None) -> None:
+                   comment_id: Optional[int] = None,
+                   land_check_commit: Optional[str] = None) -> None:
         # Raises exception if matching rule is not found
-        find_matching_merge_rule(self, repo, force=force, skip_internal_checks=can_skip_internal_checks(self, comment_id))
-        self.merge_changes(repo, force, comment_id)
+        find_matching_merge_rule(
+            self,
+            repo,
+            force=force,
+            skip_internal_checks=can_skip_internal_checks(self, comment_id),
+            land_check_commit=land_check_commit)
+        self.merge_changes(repo, force, comment_id, land_check_commit=land_check_commit)
 
         repo.push(self.default_branch(), dry_run)
         if not dry_run:
@@ -859,6 +915,7 @@ def merge_changes(self,
                       repo: GitRepo,
                       force: bool = False,
                       comment_id: Optional[int] = None,
+                      land_check_commit: Optional[str] = None,
                       branch: Optional[str] = None) -> None:
         branch_to_merge_into = self.default_branch() if branch is None else branch
         if repo.current_branch() != branch_to_merge_into:
@@ -870,13 +927,14 @@ def merge_changes(self,
             repo._run_git("merge", "--squash", pr_branch_name)
             repo._run_git("commit", f"--author=\"{self.get_author()}\"", "-m", msg)
         else:
-            self.merge_ghstack_into(repo, force, comment_id=comment_id)
+            self.merge_ghstack_into(repo, force, comment_id=comment_id, land_check_commit=land_check_commit)
 
     def create_land_time_check_branch(self,
                                       repo: GitRepo,
                                       branch: str,
                                       force: bool = False,
                                       comment_id: Optional[int] = None,) -> str:
+        orig_branch = repo.current_branch()
         self.merge_changes(repo, branch=branch, force=force, comment_id=comment_id)
         land_check_branch = f'landchecks/{self.pr_num}'
         try:
@@ -886,11 +944,9 @@ def create_land_time_check_branch(self,
         repo._run_git('checkout', "-b", land_check_branch)
         repo._run_git('push', '-u', 'origin', land_check_branch, '--force')
         commit = repo.get_commit('HEAD').commit_hash
-        gh_post_pr_comment(self.org, self.project, self.pr_num,
-                           '@pytorchbot successfully started a merge and created land time checks.' +
-                           f' See merge status [here]({os.getenv("GH_RUN_URL")}) ' +
-                           f'and [land check]({BOT_COMMANDS_WIKI}) '
-                           f'progress [here](https://hud.pytorch.org/{self.org}/{self.project}/commit/{commit}).')
+        # Important, return to original branch
+        if repo.current_branch() != orig_branch:
+            repo.checkout(orig_branch)
         return commit
 
 
@@ -912,7 +968,7 @@ class MergeRule:
 def read_merge_rules(repo: Optional[GitRepo], org: str, project: str) -> List[MergeRule]:
     from pathlib import Path
 
-    repo_relative_rules_path = Path(".github") / "merge_rules.json"
+    repo_relative_rules_path = Path(".github") / "merge_rules.yaml"
     if repo is None:
         json_data = _fetch_url(
             f"https://api.github.com/repos/{org}/{project}/contents/{repo_relative_rules_path}",
@@ -920,21 +976,22 @@ def read_merge_rules(repo: Optional[GitRepo], org: str, project: str) -> List[Me
             reader=json.load,
         )
         content = base64.b64decode(json_data["content"])
-        return cast(List[MergeRule], json.loads(content, object_hook=lambda x: MergeRule(**x)))
+        return [MergeRule(**x) for x in yaml.safe_load(content)]
     else:
         rules_path = Path(repo.repo_dir) / repo_relative_rules_path
         if not rules_path.exists():
             print(f"{rules_path} does not exist, returning empty rules")
             return []
         with open(rules_path) as fp:
-            rc = json.load(fp, object_hook=lambda x: MergeRule(**x))
-        return cast(List[MergeRule], rc)
+            rc = yaml.safe_load(fp)
+        return [MergeRule(**x) for x in rc]
 
 
 def find_matching_merge_rule(pr: GitHubPR,
                              repo: Optional[GitRepo] = None,
                              force: bool = False,
-                             skip_internal_checks: bool = False
+                             skip_internal_checks: bool = False,
+                             land_check_commit: Optional[str] = None,
                              ) -> MergeRule:
     """Returns merge rule matching to this pr or raises an exception"""
     changed_files = pr.get_changed_files()
@@ -984,21 +1041,27 @@ def find_matching_merge_rule(pr: GitHubPR,
                                  f"{', '.join(list(rule_approvers_set)[:5])}{', ...' if len(rule_approvers_set) > 5 else ''}")
             continue
         mandatory_checks = rule.mandatory_checks_name if rule.mandatory_checks_name is not None else []
-        checks = pr.get_checkrun_conclusions()
+        checks = get_combined_checks_from_pr_and_land_validation(pr, land_check_commit)
         required_checks = filter(lambda x: force is False or "CLA Check" in x, mandatory_checks)
         [pending_checks, failed_checks] = categorize_checks(checks, required_checks)
 
         if len(failed_checks) > 0:
             if reject_reason_score < 30000:
                 reject_reason_score = 30000
-                reject_reason = ("Refusing to merge as mandatory check(s) " +
-                                 checks_to_str(failed_checks) + f" failed for rule {rule_name}")
+                reject_reason = (
+                    f"[View failures on hud](https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}). "
+                    + f"Refusing to merge as mandatory check(s) {checks_to_str(failed_checks)} failed for "
+                    + f"rule {rule_name}."
+                )
             continue
         elif len(pending_checks) > 0:
             if reject_reason_score < 20000:
                 reject_reason_score = 20000
-                reject_reason = f"Refusing to merge as mandatory check(s) {checks_to_str(pending_checks)}"
-                reject_reason += f" are pending/not yet run for rule {rule_name}"
+                reject_reason = (
+                    f"[View pending jobs on hud](https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}). "
+                    + f"Refusing to merge as mandatory check(s) {checks_to_str(pending_checks)} are pending/not yet run for "
+                    + f"rule {rule_name}."
+                )
             continue
         if not skip_internal_checks and pr.has_internal_changes():
             raise RuntimeError("This PR has internal changes and must be landed via Phabricator")
@@ -1008,7 +1071,7 @@ def find_matching_merge_rule(pr: GitHubPR,
     raise RuntimeError(reject_reason)
 
 
-def get_land_checkrun_conclusions(org: str, project: str, commit: str) -> Dict[str, Tuple[str, str]]:
+def get_land_checkrun_conclusions(org: str, project: str, commit: str) -> Dict[str, WorkflowCheckState]:
 
     def get_commit_next_check_runs(edges: List[Dict[str, Dict[str, Any]]], edge_idx: int, checkruns: Any) -> Any:
         rc = gh_graphql(GH_GET_COMMIT_NEXT_CHECK_RUNS,
@@ -1037,18 +1100,44 @@ def get_commit_next_checksuites(checksuites: Any) -> Any:
 def checks_to_str(checks: List[Tuple[str, Optional[str]]]) -> str:
     return ", ".join(f"[{c[0]}]({c[1]})" if c[1] is not None else c[0] for c in checks)
 
-def pr_get_checks_with_lambda(pr: GitHubPR, status_check: Callable[[Optional[str]], bool]) -> List[Tuple[str, str]]:
-    checks = pr.get_checkrun_conclusions()
-    return [(name, status[1]) for name, status in checks.items() if status_check(status[0])]
+def get_combined_checks_from_pr_and_land_validation(
+    pr: GitHubPR,
+    land_check_commit: Optional[str]
+) -> Dict[str, WorkflowCheckState]:
+    """
+    Combines checks from both the PR and land validation to get a holistic view
+    of all checks.
+
+    This helps us cover the corner case where certain workflows may have been
+    requested on the PR but are not part of land validation (e.g. nightly
+    builds) or are implicitly run on PRs but not on land validation branches
+    (like CLA Checks).
+
+    At the same time, we prioritize the signal workflows which do run on land
+    validation.
 
+    E.g. if a workflow fails on the PR but passes on land validation then we'd
+    use the successful result from the land validation.
+    """
 
-def pr_get_pending_checks(pr: GitHubPR) -> List[Tuple[str, str]]:
-    return pr_get_checks_with_lambda(pr, lambda x: x is None)
+    pr_checks = pr.get_checkrun_conclusions()
+    land_validation_checks = get_land_checkrun_conclusions(pr.org, pr.project, land_check_commit) if land_check_commit else {}
 
+    # Merge the two checks together. Land validation check results (if any) overwrite pr check results
+    merged_checks = {**pr_checks, **land_validation_checks}  # explanation: https://stackoverflow.com/a/26853961/21539
+    return merged_checks
 
-def pr_get_failed_checks(pr: GitHubPR) -> List[Tuple[str, str]]:
-    return pr_get_checks_with_lambda(pr, lambda x: x in ["FAILURE", "STARTUP_FAILURE"])
+def filter_checks_with_lambda(
+    checks: Dict[str, WorkflowCheckState],
+    status_filter: Callable[[Optional[str]], bool]
+) -> List[WorkflowCheckState]:
+    return [check for check in checks.values() if status_filter(check.status)]
 
+def filter_pending_checks(checks: Dict[str, WorkflowCheckState]) -> List[WorkflowCheckState]:
+    return filter_checks_with_lambda(checks, lambda x: x is None)
+
+def filter_failed_checks(checks: Dict[str, WorkflowCheckState]) -> List[WorkflowCheckState]:
+    return filter_checks_with_lambda(checks, lambda x: x in ["FAILURE", "STARTUP_FAILURE"])
 
 def validate_revert(repo: GitRepo, pr: GitHubPR, *,
                     comment_id: Optional[int] = None) -> Tuple[str, str]:
@@ -1133,7 +1222,7 @@ def check_for_sev(org: str, project: str, force: bool) -> None:
 
 def validate_land_time_checks(org: str, project: str, commit: str) -> None:
     checks = get_land_checkrun_conclusions(org, project, commit)
-    if(len(checks) == 0):
+    if len(checks) == 0:
         raise MandatoryChecksMissingError("Refusing to merge as land check(s) are not yet run")
 
     [pending_checks, failed_checks] = categorize_checks(checks, checks)
@@ -1146,19 +1235,17 @@ def validate_land_time_checks(org: str, project: str, commit: str) -> None:
 def has_label(labels: List[str], pattern: Pattern[str] = CIFLOW_LABEL) -> bool:
     return len(list(filter(pattern.match, labels))) > 0
 
-def categorize_checks(check_runs: Dict[str, Tuple[str, str]],
+def categorize_checks(check_runs: Dict[str, WorkflowCheckState],
                       required_checks: Iterable[str]) -> Tuple[List[Tuple[str, Optional[str]]], List[Tuple[str, Optional[str]]]]:
     pending_checks: List[Tuple[str, Optional[str]]] = []
     failed_checks: List[Tuple[str, Optional[str]]] = []
     for checkname in required_checks:
         if checkname not in check_runs:
             pending_checks.append((checkname, None))
-        elif check_runs[checkname][0] is None:
-            pending_checks.append((checkname, check_runs[checkname][1]))
-        elif (check_runs[checkname][0].upper() != 'SUCCESS'
-              and check_runs[checkname][0].upper() != 'SKIPPED'
-              and check_runs[checkname][0].upper() != 'NEUTRAL'):
-            failed_checks.append((checkname, check_runs[checkname][1]))
+        elif check_runs[checkname].status is None:
+            pending_checks.append((checkname, check_runs[checkname].url))
+        elif (str(check_runs[checkname].status).upper() not in ['SUCCESS', 'SKIPPED', 'NEUTRAL']):
+            failed_checks.append((checkname, check_runs[checkname].url))
     return (pending_checks, failed_checks)
 
 def merge(pr_num: int, repo: GitRepo,
@@ -1174,16 +1261,24 @@ def merge(pr_num: int, repo: GitRepo,
     org, project = repo.gh_owner_and_name()
     pr = GitHubPR(org, project, pr_num)
     initial_commit_sha = pr.last_commit()['oid']
+    explainer = TryMergeExplainer(force, on_green, land_checks, pr.get_labels(), pr.pr_num, org, project)
+    on_green, land_checks = explainer.get_flags()
+    land_check_commit = None
+
     check_for_sev(org, project, force)
+
     if force or can_skip_internal_checks(pr, comment_id):
         # do not wait for any pending signals if PR is closed as part of co-development process
+        gh_post_pr_comment(org, project, pr.pr_num, explainer.get_merge_message())
         return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id)
-    if (datetime.utcnow() - pr.last_pushed_at()).days > stale_pr_days:
-        raise RuntimeError("This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.")
 
     if land_checks:
         land_check_commit = pr.create_land_time_check_branch(repo, 'viable/strict', force=force, comment_id=comment_id)
 
+    gh_post_pr_comment(org, project, pr.pr_num, explainer.get_merge_message(land_check_commit))
+    if (datetime.utcnow() - pr.last_pushed_at()).days > stale_pr_days:
+        raise RuntimeError("This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.")
+
     start_time = time.time()
     last_exception = ''
     elapsed_time = 0.0
@@ -1197,26 +1292,27 @@ def merge(pr_num: int, repo: GitRepo,
             raise RuntimeError("New commits were pushed while merging. Please rerun the merge command.")
         try:
             find_matching_merge_rule(pr, repo)
-            pending = pr_get_pending_checks(pr)
-            failing = pr_get_failed_checks(pr)
+            checks = get_combined_checks_from_pr_and_land_validation(pr, land_check_commit)
+            pending = filter_pending_checks(checks)
+            failing = filter_failed_checks(checks)
 
             # HACK until GitHub will be better about surfacing those
-            startup_failures = pr_get_checks_with_lambda(pr, lambda x: x == "STARTUP_FAILURE")
+            startup_failures = filter_checks_with_lambda(checks, lambda status: status == "STARTUP_FAILURE")
             if len(startup_failures) > 0:
                 raise RuntimeError(f"{len(failing)} STARTUP failures reported, please check workflows syntax! " +
-                                   ' ,'.join(f"[{x[0]}]({x[1]})" for x in startup_failures[:5]))
+                                   ' ,'.join(f"[{x.name}]({x.url})" for x in startup_failures[:5]))
             # END of HACK
 
             if (not mandatory_only and on_green) and len(failing) > 0:
                 raise RuntimeError(f"{len(failing)} additional jobs have failed, first few of them are: " +
-                                   ' ,'.join(f"[{x[0]}]({x[1]})" for x in failing[:5]))
+                                   ' ,'.join(f"[{x.name}]({x.url})" for x in failing[:5]))
             if (not mandatory_only and on_green) and len(pending) > 0:
                 raise MandatoryChecksMissingError(f"Still waiting for {len(pending)} additional jobs to finish, " +
-                                                  f"first few of them are: {' ,'.join(x[0] for x in pending[:5])}")
-            if land_checks:
+                                                  f"first few of them are: {' ,'.join(x.name for x in pending[:5])}")
+            if land_checks and land_check_commit is not None:
                 validate_land_time_checks(org, project, land_check_commit)
 
-            return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id)
+            return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id, land_check_commit=land_check_commit)
         except MandatoryChecksMissingError as ex:
             last_exception = str(ex)
             print(f"Merge of https://github.com/{org}/{project}/pull/{pr_num} failed due to: {ex}. Retrying in 5 min")
@@ -1233,28 +1329,21 @@ def main() -> None:
     repo = GitRepo(get_git_repo_dir(), get_git_remote_name())
     org, project = repo.gh_owner_and_name()
     pr = GitHubPR(org, project, args.pr_num)
-    land_checks = args.land_checks and not has_label(pr.get_labels(), CIFLOW_TRUNK_LABEL)
 
     def handle_exception(e: Exception, msg: str = "Merge failed") -> None:
-        msg += f" due to {e}"
+        msg += f"\nReason: {e}"
         run_url = os.getenv("GH_RUN_URL")
         if run_url is not None:
-            msg += f"\nRaised by {run_url}"
-        if land_checks:
-            msg += (" If you believe this is an error, you can use the old behavior with `@pytorchbot merge -g`" +
-                    ' (optionally with the "ciflow/trunk" to get land signals)' +
-                    ' or use `@pytorchbot merge -f "some reason here"`.' +
-                    f" For more information, see the [bot wiki]({BOT_COMMANDS_WIKI}).")
+            msg += f"\nRaised by [workflow job]({run_url})"
+        if args.land_checks:
+            msg += get_land_check_troubleshooting_message()
         gh_post_pr_comment(org, project, args.pr_num, msg, dry_run=args.dry_run)
         import traceback
         traceback.print_exc()
-    if not land_checks:
-        msg = f"@pytorchbot successfully started a {'revert' if args.revert else 'merge'} job."
-        msg += f" Check the current status [here]({os.getenv('GH_RUN_URL')})"
-        gh_post_pr_comment(org, project, args.pr_num, msg, dry_run=args.dry_run)
 
     if args.revert:
         try:
+            gh_post_pr_comment(org, project, args.pr_num, get_revert_message(org, project, pr.pr_num), args.dry_run)
             try_revert(repo, pr, dry_run=args.dry_run, comment_id=args.comment_id, reason=args.reason)
         except Exception as e:
             handle_exception(e, f"Reverting PR {args.pr_num} failed")
@@ -1269,14 +1358,13 @@ def handle_exception(e: Exception, msg: str = "Merge failed") -> None:
         return
 
     try:
-        on_green = args.on_green or has_label(pr.get_labels(), CIFLOW_LABEL)
         merge(args.pr_num, repo,
               dry_run=args.dry_run,
               force=args.force,
               comment_id=args.comment_id,
-              on_green=on_green,
+              on_green=args.on_green,
               mandatory_only=args.on_mandatory,
-              land_checks=land_checks)
+              land_checks=args.land_checks)
     except Exception as e:
         handle_exception(e)
 
diff --git a/.github/scripts/trymerge_explainer.py b/.github/scripts/trymerge_explainer.py
new file mode 100644
index 0000000000000..e59307f10854c
--- /dev/null
+++ b/.github/scripts/trymerge_explainer.py
@@ -0,0 +1,146 @@
+import os
+import re
+from typing import List, Pattern, Tuple, Optional
+
+
+BOT_COMMANDS_WIKI = "https://github.com/pytorch/pytorch/wiki/Bot-commands"
+
+CIFLOW_LABEL = re.compile(r"^ciflow/.+")
+CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")
+
+OFFICE_HOURS_LINK = "https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours"
+CONTACT_US = f"Please reach out to the [PyTorch DevX Team]({OFFICE_HOURS_LINK}) with feedback or questions!"
+ALTERNATIVES = (
+    "If this is not the intended behavior, feel free to use some "
+    + f"of the other merge options in the [wiki]({BOT_COMMANDS_WIKI})."
+)
+LAND_CHECK_ROLLOUT = "https://github.com/pytorch/test-infra/blob/main/torchci/lib/bot/rolloutUtils.ts#L1-L34"
+
+
+def has_label(labels: List[str], pattern: Pattern[str] = CIFLOW_LABEL) -> bool:
+    return len(list(filter(pattern.match, labels))) > 0
+
+
+class TryMergeExplainer(object):
+    force: bool
+    on_green: bool
+    land_checks: bool
+    labels: List[str]
+    pr_num: int
+    org: str
+    project: str
+
+    has_trunk_label: bool
+    has_ciflow_label: bool
+
+    def __init__(
+        self,
+        force: bool,
+        on_green: bool,
+        land_checks: bool,
+        labels: List[str],
+        pr_num: int,
+        org: str,
+        project: str,
+    ):
+        self.force = force
+        self.on_green = on_green
+        self.land_checks = land_checks
+        self.labels = labels
+        self.pr_num = pr_num
+        self.org = org
+        self.project = project
+        self.get_flags()
+
+    def get_flags(self) -> Tuple[bool, bool]:
+        self.has_trunk_label = has_label(self.labels, CIFLOW_TRUNK_LABEL)
+        self.has_ciflow_label = has_label(self.labels, CIFLOW_LABEL)
+        should_check_land_branch = self.land_checks and not self.has_trunk_label
+        should_check_green = self.on_green or self.has_ciflow_label
+
+        return (should_check_green, should_check_land_branch)
+
+    def _get_flag_msg(self) -> str:
+        if self.force:
+            return " the force (-f) flag."
+        elif self.on_green:
+            return " the green (-g) flag."
+        elif self.land_checks:
+            return (
+                " the land checks (-l) flag."
+                + " If you did not specify this flag yourself, "
+                + f" you are likely enrolled in the [land checks rollout]({LAND_CHECK_ROLLOUT})."
+            )
+        else:
+            return "out a flag."
+
+    def _get_land_check_progress(self, commit: Optional[str]) -> str:
+        if commit is not None:
+            return (
+                " and land check "
+                + f"progress [here](https://hud.pytorch.org/{self.org}/{self.project}/commit/{commit})"
+            )
+        else:
+            return ""
+
+    def _get_flag_explanation_message(self) -> str:
+        if self.force:
+            return "This means your change will be merged **immediately**, bypassing any CI checks (ETA: 1-5 minutes)."
+        elif self.on_green:
+            return "This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours)."
+        elif self.land_checks:
+            if self.has_trunk_label:
+                land_check_msg_suffix = "have passed since you have added the `ciflow/trunk` label to your PR (ETA 0-4 Hours)."
+            else:
+                land_check_msg_suffix = (
+                    "and the land checks have passed (**ETA 4 Hours**). "
+                )
+                land_check_msg_suffix += "If you need to coordinate lands between different changes and cannot risk a land race, "
+                land_check_msg_suffix += "please add the `ciflow/trunk` label to your PR and wait for signal to complete, "
+                land_check_msg_suffix += "and then land your changes in proper order."
+                land_check_msg_suffix += (
+                    " Having `trunk`, `pull`, and `Lint` pre-run on a "
+                )
+                land_check_msg_suffix += (
+                    "PR will bypass land checks and the ETA should be immediate."
+                )
+
+            return (
+                "This means that your change will be merged once all checks on your PR "
+                + land_check_msg_suffix
+            )
+        else:
+            return "This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours)."
+
+    def get_merge_message(self, commit: Optional[str] = None) -> str:
+        message_prefix = "@pytorchbot successfully started a merge job."
+        progress_links = f"Check the current status [here]({os.getenv('GH_RUN_URL')}){self._get_land_check_progress(commit)}."
+        flag_message = f"The merge job was triggered with{self._get_flag_msg()}"
+        explanation_message = self._get_flag_explanation_message()
+
+        msg = message_prefix + " "
+        msg += progress_links + "\n"
+        msg += flag_message + " "
+        msg += explanation_message + " "
+        msg += ALTERNATIVES + "\n"
+        msg += CONTACT_US
+        return msg
+
+
+def get_revert_message(org: str, project: str, pr_num: int) -> str:
+    msg = (
+        "@pytorchbot successfully started a revert job."
+        + f" Check the current status [here]({os.getenv('GH_RUN_URL')}).\n"
+    )
+    msg += CONTACT_US
+    return msg
+
+
+def get_land_check_troubleshooting_message() -> str:
+    return (
+        " If you believe this is an error, you can use the old behavior with `@pytorchbot merge -g`"
+        + " (optionally with the `ciflow/trunk` to get land checks)"
+        + ' or use `@pytorchbot merge -f "some reason here"`.'
+        + f" For more information, see the [bot wiki]({BOT_COMMANDS_WIKI}). \n"
+        + CONTACT_US
+    )
diff --git a/.github/scripts/update_commit_hashes.py b/.github/scripts/update_commit_hashes.py
index 5dad5877ca4ae..4b638cf11c90c 100644
--- a/.github/scripts/update_commit_hashes.py
+++ b/.github/scripts/update_commit_hashes.py
@@ -136,6 +136,7 @@ def main() -> None:
     )
     with open(f".github/ci_commit_pins/{args.repo_name}.txt", "r+") as f:
         old_hash = f.read().strip()
+        subprocess.run(f"git checkout {old_hash}".split(), cwd=args.repo_name)
         f.seek(0)
         f.truncate()
         f.write(f"{hash}\n")
diff --git a/.github/templates/common.yml.j2 b/.github/templates/common.yml.j2
index f0f3e3a430f7d..b80b82f5d610d 100644
--- a/.github/templates/common.yml.j2
+++ b/.github/templates/common.yml.j2
@@ -1,5 +1,7 @@
 {%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v5" -%}
 {%- set download_artifact_s3_action = "seemethere/download-artifact-s3@v4" -%}
+{%- set upload_artifact_action = "actions/upload-artifact@v3" -%}
+{%- set download_artifact_action = "actions/download-artifact@v3" -%}
 
 {# squid_proxy is an private ELB that only available for GHA custom runners #}
 {%- set squid_proxy    = "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -%}
diff --git a/.github/templates/linux_binary_build_workflow.yml.j2 b/.github/templates/linux_binary_build_workflow.yml.j2
index 2879da9dad9c2..072da90789ef3 100644
--- a/.github/templates/linux_binary_build_workflow.yml.j2
+++ b/.github/templates/linux_binary_build_workflow.yml.j2
@@ -78,7 +78,7 @@ jobs:
     !{{ upload.binary_env(config) }}
     steps:
       !{{ common.setup_rocm_linux() }}
-      - uses: !{{ common.download_artifact_s3_action }}
+      - uses: !{{ common.download_artifact_action }}
         name: Download Build Artifacts
         with:
           name: !{{ config["build_name"] }}
diff --git a/.github/templates/windows_binary_build_workflow.yml.j2 b/.github/templates/windows_binary_build_workflow.yml.j2
index 6b0cbbd187403..9f68df06b704f 100644
--- a/.github/templates/windows_binary_build_workflow.yml.j2
+++ b/.github/templates/windows_binary_build_workflow.yml.j2
@@ -72,7 +72,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: !{{ common.upload_artifact_s3_action }}
+      - uses: !{{ common.upload_artifact_action }}
         if: always()
         with:
           name: !{{ config["build_name"] }}
@@ -93,7 +93,7 @@ jobs:
     steps:
       !{{ common.setup_ec2_windows() }}
       !{{ set_runner_specific_vars() }}
-      - uses: !{{ common.download_artifact_s3_action }}
+      - uses: !{{ common.download_artifact_action }}
         name: Download Build Artifacts
         with:
           name: !{{ config["build_name"] }}
diff --git a/.github/workflows/_android-full-build-test.yml b/.github/workflows/_android-full-build-test.yml
index efc66846db7a3..02c6a9d890212 100644
--- a/.github/workflows/_android-full-build-test.yml
+++ b/.github/workflows/_android-full-build-test.yml
@@ -19,23 +19,6 @@ on:
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
 
-    secrets:
-      SONATYPE_NEXUS_USERNAME:
-        description: nexus user
-        required: true
-      SONATYPE_NEXUS_PASSWORD:
-        description: nexus pass
-        required: true
-      ANDROID_SIGN_KEY:
-        description: android key
-        required: true
-      ANDROID_SIGN_PASS:
-        description: android pass
-        required: true
-      SCRIBE_GRAPHQL_ACCESS_TOKEN:
-        description: token for writing to scribe/scuba
-        required: true
-
 env:
   GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
 
@@ -160,25 +143,6 @@ jobs:
           mkdir -p "${GITHUB_WORKSPACE}/build_android_artifacts"
           docker cp "${ID_X86_32}:/var/lib/jenkins/workspace/android/artifacts.tgz" "${GITHUB_WORKSPACE}/build_android_artifacts/"
 
-      - name: Publish android snapshot
-        if: ${{ github.event_name == 'push' && github.event.ref == 'refs/heads/nightly' }}
-        env:
-          SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }}
-          SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }}
-          ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }}
-          ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }}
-          ID_X86_32: ${{ steps.build-x86_32.outputs.container_id }}
-        run: |
-          set -eux
-          (echo "./.circleci/scripts/publish_android_snapshot.sh" | docker exec \
-            -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot" \
-            -e SONATYPE_NEXUS_USERNAME \
-            -e SONATYPE_NEXUS_PASSWORD \
-            -e ANDROID_SIGN_KEY \
-            -e ANDROID_SIGN_PASS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            -u jenkins -i "${ID_X86_32}" bash) 2>&1
-
       - name: Store PyTorch Android Build Artifacts on S3
         uses: seemethere/upload-artifact-s3@v5
         with:
diff --git a/.github/workflows/_binary-build-linux.yml b/.github/workflows/_binary-build-linux.yml
index b1b88a5b32f80..dc69e3a82258f 100644
--- a/.github/workflows/_binary-build-linux.yml
+++ b/.github/workflows/_binary-build-linux.yml
@@ -63,7 +63,7 @@ on:
 jobs:
   build:
     runs-on: linux.4xlarge
-    timeout-minutes: 240
+    timeout-minutes: 270
     env:
       PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}
       BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}
@@ -209,10 +209,9 @@ jobs:
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
 
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         with:
           name: ${{ inputs.build_name }}
-          retention-days: 14
           if-no-files-found: error
           path:
             ${{ runner.temp }}/artifacts/*
diff --git a/.github/workflows/_binary-test-linux.yml b/.github/workflows/_binary-test-linux.yml
index 5c29288b82462..e8749c59d58c7 100644
--- a/.github/workflows/_binary-test-linux.yml
+++ b/.github/workflows/_binary-test-linux.yml
@@ -139,7 +139,7 @@ jobs:
           rm -rf "${GITHUB_WORKSPACE}"
           mkdir "${GITHUB_WORKSPACE}"
 
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: ${{ inputs.build_name }}
@@ -172,7 +172,7 @@ jobs:
         working-directory: builder
 
       - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767
         if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' }}
         with:
           timeout_minutes: 10
diff --git a/.github/workflows/_binary-upload.yml b/.github/workflows/_binary-upload.yml
index cf47de9ccf212..eddc3abc7f2db 100644
--- a/.github/workflows/_binary-upload.yml
+++ b/.github/workflows/_binary-upload.yml
@@ -70,7 +70,9 @@ on:
         description: Conda PyTorchBot token
 jobs:
   build:
-    runs-on: linux.2xlarge
+    runs-on: ubuntu-22.04
+    container:
+      image: continuumio/miniconda3:4.12.0
     env:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
@@ -86,40 +88,20 @@ jobs:
       LIBTORCH_VARIANT: ${{ inputs.LIBTORCH_VARIANT }}
       DESIRED_DEVTOOLSET: ${{ inputs.DESIRED_DEVTOOLSET }}
       DESIRED_PYTHON: ${{ inputs.DESIRED_PYTHON }}
-      # Needed for conda builds
-      ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
       ANACONDA_USER: pytorch
-      AWS_DEFAULT_REGION: us-east-1
       BINARY_ENV_FILE: /tmp/env
       GITHUB_TOKEN: ${{ secrets.github-token }}
       PR_NUMBER: ${{ github.event.pull_request.number }}
       PYTORCH_FINAL_PACKAGE_DIR: /artifacts
       SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
     steps:
-      - name: List the env
-        shell: bash
-        run: env
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
-      - name: Setup Linux
-        uses: ./.github/actions/setup-linux
-      - name: Chown workspace
-        uses: ./.github/actions/chown-workspace
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: ./.github/actions/setup-ssh
         with:
-          github-secret: ${{ secrets.github-token }}
+          no-sudo: true
 
-      - name: Download Build Artifacts with S3
-        uses: seemethere/download-artifact-s3@v4
-        if: ${{ inputs.use_s3 }}
-        with:
-          name: ${{ inputs.build_name }}
-          path: "${{ runner.temp }}/artifacts/"
-
-      - name: Download Build Artifacts without S3
+      - name: Download Build Artifacts
         uses: actions/download-artifact@v2
-        if: ${{ !inputs.use_s3 }}
         with:
           name: ${{ inputs.build_name }}
           path: "${{ runner.temp }}/artifacts/"
@@ -144,35 +126,4 @@ jobs:
           AWS_SECRET_ACCESS_KEY: ${{ secrets.aws-pytorch-uploader-secret-access-key }}
           ANACONDA_API_TOKEN: ${{ secrets.conda-pytorchbot-token }}
         run: |
-          docker run --rm -i \
-            -e ANACONDA_API_TOKEN \
-            -e AWS_ACCESS_KEY_ID \
-            -e AWS_SECRET_ACCESS_KEY \
-            -e DRY_RUN \
-            -e PACKAGE_TYPE \
-            -e PKG_DIR=/artifacts \
-            -e UPLOAD_CHANNEL \
-            -e UPLOAD_SUBFOLDER \
-            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
-            -v "${GITHUB_WORKSPACE}:/v" \
-            -w /v \
-            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
-            bash -c '.circleci/scripts/binary_upload.sh'
-
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
+            bash .circleci/scripts/binary_upload.sh
diff --git a/.github/workflows/_buck-build-test.yml b/.github/workflows/_buck-build-test.yml
index ae7f7517e2eda..221ca9adcd442 100644
--- a/.github/workflows/_buck-build-test.yml
+++ b/.github/workflows/_buck-build-test.yml
@@ -28,7 +28,7 @@ jobs:
           activate-environment: build
 
       - name: Install dependencies
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767
         with:
           timeout_minutes: 10
           max_attempts: 5
@@ -46,16 +46,17 @@ jobs:
               typing_extensions
 
       - name: Install Buck
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767
         with:
           timeout_minutes: 10
           max_attempts: 5
           command: |
-            wget https://github.com/facebook/buck/releases/download/v2021.01.12.01/buck.2021.01.12.01_all.deb
+            sudo apt update -q
+            wget -q https://github.com/facebook/buck/releases/download/v2021.01.12.01/buck.2021.01.12.01_all.deb
             sudo apt install ./buck.2021.01.12.01_all.deb
 
       - name: Download third party libraries and generate wrappers
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767
         with:
           timeout_minutes: 10
           max_attempts: 5
diff --git a/.github/workflows/_docs.yml b/.github/workflows/_docs.yml
index de28790f8c5e9..a70605f2f5aa4 100644
--- a/.github/workflows/_docs.yml
+++ b/.github/workflows/_docs.yml
@@ -38,10 +38,16 @@ jobs:
   build-docs:
     # Don't run on forked repos.
     if: github.repository_owner == 'pytorch'
-    runs-on: [self-hosted, linux.2xlarge]
+    runs-on: [self-hosted, linux.4xlarge]
     strategy:
       matrix:
-        docs_type: [cpp, python]
+        include:
+          - docs_type: cpp
+            # Nightly cpp docs take about 150m to finish, and the number is stable
+            timeout-minutes: 180
+          - docs_type: python
+            # It takes less than 30m to finish python docs unless there are issues
+            timeout-minutes: 30
     steps:
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
@@ -76,6 +82,8 @@ jobs:
           echo "password ${GITHUB_PYTORCHBOT_TOKEN}" >> "${RUNNER_TEMP}/.netrc"
 
       - name: Build ${{ matrix.docs_type }} docs
+        timeout-minutes: ${{ matrix.timeout-minutes }}
+        id: build-docs
         env:
           WITH_PUSH: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
           DOCKER_IMAGE: ${{ inputs.docker-image }}
@@ -118,7 +126,7 @@ jobs:
 
       - name: Upload Python Docs Preview
         uses: seemethere/upload-artifact-s3@v5
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' }}
+        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' && steps.build-docs.outcome == 'success' }}
         with:
           retention-days: 14
           s3-bucket: doc-previews
@@ -128,7 +136,7 @@ jobs:
 
       - name: Upload C++ Docs Preview
         uses: seemethere/upload-artifact-s3@v5
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' }}
+        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' && steps.build-docs.outcome == 'success' }}
         with:
           retention-days: 14
           if-no-files-found: error
diff --git a/.github/workflows/_ios-build-test.yml b/.github/workflows/_ios-build-test.yml
index 189e21d210e59..56443419ef1d8 100644
--- a/.github/workflows/_ios-build-test.yml
+++ b/.github/workflows/_ios-build-test.yml
@@ -140,6 +140,7 @@ jobs:
           scripts/build_ios.sh
 
       - name: Run Build Test
+        timeout-minutes: 5
         run: |
           PROFILE=PyTorch_CI_2022
           # run the ruby build script
@@ -190,3 +191,9 @@ jobs:
           else
             bundle exec fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
           fi
+      - name: Dump Simulator Tests On a Failure
+        if: |
+          failure() && inputs.ios-platform == 'SIMULATOR'
+        run: |
+          echo "Simulator Tests Logs:"
+          cat /Users/runner/Library/Logs/scan/*.log
diff --git a/.github/workflows/_linux-test.yml b/.github/workflows/_linux-test.yml
index aa81647c53fcf..f4cd8376883eb 100644
--- a/.github/workflows/_linux-test.yml
+++ b/.github/workflows/_linux-test.yml
@@ -53,7 +53,7 @@ jobs:
           docker-image: ${{ inputs.docker-image }}
 
       - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767
         if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
         with:
           timeout_minutes: 10
@@ -178,6 +178,7 @@ jobs:
       - name: Stop monitoring script
         if: always() && steps.monitor-script.outputs.monitor-script-pid
         shell: bash
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
diff --git a/.github/workflows/_mac-build.yml b/.github/workflows/_mac-build.yml
index f17bd649c7131..316656b6ec9b2 100644
--- a/.github/workflows/_mac-build.yml
+++ b/.github/workflows/_mac-build.yml
@@ -27,6 +27,12 @@ on:
         description: |
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
+      python_version:
+        required: false
+        type: string
+        default: "3.8"
+        description: |
+          The python version to be used. Will be 3.8 by default
 
     secrets:
       MACOS_SCCACHE_S3_ACCESS_KEY_ID:
@@ -68,7 +74,7 @@ jobs:
         uses: conda-incubator/setup-miniconda@v2
         with:
           auto-update-conda: true
-          python-version: 3.8
+          python-version: ${{ inputs.python_version }}
           activate-environment: build
           miniconda-version: 4.7.12
 
diff --git a/.github/workflows/_mac-test-arm64.yml b/.github/workflows/_mac-test-mps.yml
similarity index 98%
rename from .github/workflows/_mac-test-arm64.yml
rename to .github/workflows/_mac-test-mps.yml
index 14502a32ad684..fa189307358a6 100644
--- a/.github/workflows/_mac-test-arm64.yml
+++ b/.github/workflows/_mac-test-mps.yml
@@ -41,7 +41,7 @@ jobs:
       - name: Install PyTorch
         env:
           ENV_NAME: conda-test-env-${{ github.run_id }}
-          PY_VERS: 3.8
+          PY_VERS: 3.9
         shell: arch -arch arm64 bash {0}
         run: |
           # shellcheck disable=SC1090
diff --git a/.github/workflows/_mac-test.yml b/.github/workflows/_mac-test.yml
index e919bef85a67a..36a0149795dc7 100644
--- a/.github/workflows/_mac-test.yml
+++ b/.github/workflows/_mac-test.yml
@@ -18,6 +18,11 @@ on:
         description: |
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
+      arch:
+        required: true
+        type: string
+        description: |
+          Contains the architecture to run the tests with
 
     secrets:
       AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID:
@@ -27,15 +32,15 @@ on:
         required: true
         description: secret acess key for test stats upload
 
-# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
-defaults:
-  run:
-    shell: bash -e -l {0}
-
 jobs:
   test:
     # Don't run on forked repos.
     if: github.repository_owner == 'pytorch'
+    # For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
+    # Also ensure that we always run with the right architecture
+    defaults:
+      run:
+        shell: arch -arch ${{ inputs.arch }} bash -e -l {0}
     strategy:
       matrix: ${{ fromJSON(inputs.test-matrix) }}
       fail-fast: false
@@ -57,7 +62,6 @@ jobs:
 
       - name: Start monitoring script
         id: monitor-script
-        shell: bash
         run: |
           python3 -m pip install psutil==5.9.1
           python3 -m pip install pynvml==11.4.1
@@ -70,7 +74,8 @@ jobs:
           name: ${{ inputs.build-environment }}
           use-gha: true
 
-      - name: Setup miniconda
+      - name: Setup miniconda for x86
+        if: inputs.build-environment == 'macos-12-py3-x86-64'
         uses: conda-incubator/setup-miniconda@v2
         with:
           auto-update-conda: true
@@ -78,6 +83,16 @@ jobs:
           activate-environment: build
           miniconda-version: 4.7.12
 
+      - name: Setup miniconda for arm64
+        if: inputs.build-environment == 'macos-12-py3-arm64'
+        run: |
+          # Conda is already installed and setup for bash here
+          # Cleanup lingering conda environment and create
+          # a new one for this run
+          conda env remove -n build
+          conda create -n build python=3.9.12
+          conda list
+
       - name: Install macOS homebrew dependencies
         run: |
           # Install dependencies
@@ -87,6 +102,12 @@ jobs:
         id: parse-ref
         run: .github/scripts/parse_ref.py
 
+      - name: Pre-process arm64 wheels
+        if: inputs.build-environment == 'macos-12-py3-arm64'
+        run: |
+          # As wheels are cross-compiled they are reported as x86_64 ones
+          ORIG_WHLNAME=$(ls -1 dist/*.whl); ARM_WHLNAME=${ORIG_WHLNAME/x86_64/arm64}; mv "${ORIG_WHLNAME}" "${ARM_WHLNAME}"
+
       - name: Test
         id: test
         run: |
@@ -103,10 +124,21 @@ jobs:
           # wreak havoc internally
           export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"
           export PR_BODY="${PR_BODY//[\'\"]}"
+          arch
+
+          # This is a no-op for x86
+          conda activate build
 
           python3 -mpip install dist/*.whl
           .jenkins/pytorch/macos-test.sh
 
+      - name: Cleanup miniconda for arm64
+        if: inputs.build-environment == 'macos-12-py3-arm64'
+        run: |
+          # Cleanup conda env
+          conda deactivate
+          conda env remove -n build
+
       - name: Get workflow job id
         id: get-job-id
         uses: ./.github/actions/get-workflow-job-id
@@ -115,8 +147,8 @@ jobs:
           github-token: ${{ secrets.GITHUB_TOKEN }}
 
       - name: Stop monitoring script
-        if: always() && steps.monitor-script.outputs.monitor-script-pid
-        shell: bash
+        if: always() && ${{ steps.monitor-script.outputs.monitor-script-pid }}
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
@@ -148,7 +180,6 @@ jobs:
           AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
           AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
           GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
-        shell: bash
         run: |
           set -x
           python3 -m pip install -r requirements.txt
diff --git a/.github/workflows/_rocm-test.yml b/.github/workflows/_rocm-test.yml
index b5550fdda7f0a..f65e1464998a3 100644
--- a/.github/workflows/_rocm-test.yml
+++ b/.github/workflows/_rocm-test.yml
@@ -179,6 +179,7 @@ jobs:
       - name: Stop monitoring script
         if: always() && steps.monitor-script.outputs.monitor-script-pid
         shell: bash
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
diff --git a/.github/workflows/_win-test.yml b/.github/workflows/_win-test.yml
index 560c0fe84e1d4..243bd7563639a 100644
--- a/.github/workflows/_win-test.yml
+++ b/.github/workflows/_win-test.yml
@@ -124,6 +124,7 @@ jobs:
       - name: Stop monitoring script
         if: always() && steps.monitor-script.outputs.monitor-script-pid
         shell: bash
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
diff --git a/.github/workflows/cancel_redundant_workflows.yml b/.github/workflows/cancel_redundant_workflows.yml
deleted file mode 100644
index c6755dd25f37d..0000000000000
--- a/.github/workflows/cancel_redundant_workflows.yml
+++ /dev/null
@@ -1,23 +0,0 @@
-name: Cancel redundant workflows
-on:
-  workflow_run:
-    types:
-    - requested
-    # NOTE: Make sure to add to this list as you add more workflows running on 'pull_request'
-    workflows:
-    - Lint
-    - Test tools
-    - TorchBench CI (pytorch-linux-py3.7-cu102)
-    - clang-format
-jobs:
-  cancel:
-    # We do not want to cancel reruns on master
-    if: github.event.workflow_run.head_branch != 'master'
-    runs-on: ubuntu-18.04
-    steps:
-    - name: Cancel duplicate workflow runs
-      uses: potiuk/cancel-workflow-runs@a81b3c4d59c61e27484cfacdc13897dd908419c9
-      with:
-        cancelMode: duplicates
-        token: ${{ secrets.GITHUB_TOKEN }}
-        sourceRunId: ${{ github.event.workflow_run.id }}
diff --git a/.github/workflows/docker-release.yml b/.github/workflows/docker-release.yml
new file mode 100644
index 0000000000000..6c4dbe5ef773d
--- /dev/null
+++ b/.github/workflows/docker-release.yml
@@ -0,0 +1,89 @@
+name: Build Official Docker Images
+
+on:
+  workflow_dispatch:
+  pull_request:
+    paths:
+      - Dockerfile
+      - docker.Makefile
+  push:
+    branches:
+      - nightly
+    tags:
+      # Release candidate tags look like: v1.11.0-rc1
+      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
+      - ciflow/nightly/*
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+env:
+  BUILD_PROGRESS: plain
+  BUILD_TYPE: official
+  DOCKER_ORG: pytorch
+  DOCKER_REGISTRY: ghcr.io
+  NO_BUILD_SUFFIX: true
+  USE_BUILDX: 1
+  WITH_PUSH: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+
+jobs:
+  build:
+    if: ${{ github.repository == 'pytorch/pytorch' }}
+    runs-on: [self-hosted, linux.2xlarge]
+    timeout-minutes: 240
+    strategy:
+      matrix:
+        include:
+          # nvidia specific images don't exist for arm64 so only build the runtime image
+          - image_type: runtime
+            platform: linux/arm64,linux/amd64
+          - image_type: devel
+            platform: linux/amd64
+    env:
+      BUILD_IMAGE_TYPE: ${{ matrix.image_type }}
+      BUILD_PLATFORMS: ${{ matrix.platform }}
+    steps:
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0) required for git merge-base
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Setup SSH (Click me for login details)
+        uses: ./.github/actions/setup-ssh
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+      - name: Login to GitHub Container Registry
+        if: ${{ env.WITH_PUSH == 'true' }}
+        uses: docker/login-action@v2
+        with:
+          registry: ghcr.io
+          username: pytorch
+          password: ${{ secrets.GHCR_PAT }}
+      # Setup multi-arch image builds
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@v2
+        env:
+          QEMU_BINARY_PATH: ${{ runner.temp }}/bin
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v2
+      - name: Setup job specific variables
+        run: |
+          set -eou pipefail
+          # To get QEMU binaries in our PATh
+          echo "${RUNNER_TEMP}/bin" >> "${GITHUB_PATH}"
+          # Generate PyTorch version to use
+          echo "PYTORCH_VERSION=$(python3 .github/scripts/generate_pytorch_version.py)" >> "${GITHUB_ENV}"
+      - name: Setup nightly specific variables
+        if: ${{ github.event.ref == 'refs/heads/nightly' }}
+        run: |
+          # Use nightly image if building for nightly
+          echo "DOCKER_IMAGE=pytorch-nightly" >> "${GITHUB_ENV}"
+      - name: Run docker build / push
+        # WITH_PUSH is used here to determine whether or not to add the --push flag
+        run: |
+          make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image"
+      - name: Teardown Linux
+        uses: ./.github/actions/teardown-linux
+        if: always()
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
index 002f25561c358..f283a6b5c3a8d 100644
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -14,19 +14,23 @@ jobs:
   lintrunner:
     runs-on: ubuntu-18.04
     steps:
+      - name: Checkout PyTorch
+        uses: csarofeen/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          submodules: false
+          fetch-depth: 1
+
       - name: Setup Python
         uses: actions/setup-python@v2
         with:
           python-version: 3.8
           architecture: x64
-
-      - name: Checkout PyTorch
-        uses: csarofeen/pytorch/.github/actions/checkout-pytorch@master
-        with:
-          submodules: false
+          cache: pip
+          cache-dependency-path: |
+            **/.github/requirements-gha-cache.txt
 
       - name: Install lintrunner
-        run: pip install lintrunner==0.9.*
+        run: pip install lintrunner==0.9.2
 
       - name: Initialize lint dependencies
         run: lintrunner init
diff --git a/.github/workflows/mac-mps.yml b/.github/workflows/mac-mps.yml
new file mode 100644
index 0000000000000..8fc2dd8336bff
--- /dev/null
+++ b/.github/workflows/mac-mps.yml
@@ -0,0 +1,35 @@
+name: Mac MPS
+
+on:
+  push:
+    tags:
+      - ciflow/mps/*
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  macos-12-py3-arm64-build:
+    name: macos-12-py3-arm64
+    uses: ./.github/workflows/_mac-build.yml
+    with:
+      sync-tag: macos-12-py3-arm64-build
+      build-environment: macos-12-py3-arm64
+      xcode-version: "13.3.1"
+      runner-type: macos-12-xl
+      build-generates-artifacts: true
+      # To match the one pre-installed in the m1 runners
+      python_version: 3.9.12
+    secrets:
+      MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
+      MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
+
+  macos-12-py3-arm64-mps-test:
+    name: macos-12-py3-arm64-mps
+    uses: ./.github/workflows/_mac-test-mps.yml
+    needs: macos-12-py3-arm64-build
+    with:
+      sync-tag: macos-12-py3-arm64-mps-test
+      build-environment: macos-12-py3-arm64
diff --git a/.github/workflows/periodic.yml b/.github/workflows/periodic.yml
index 0e3e565deb914..7fbd04f8f161f 100644
--- a/.github/workflows/periodic.yml
+++ b/.github/workflows/periodic.yml
@@ -120,6 +120,58 @@ jobs:
           { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
         ]}
 
+  linux-bionic-cuda11_7-py3_7-gcc7-debug-build:
+    name: linux-bionic-cuda11.7-py3.7-gcc7-debug
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.7-py3.7-gcc7-debug
+      docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7
+      build-with-debug: true
+
+  linux-bionic-cuda11_7-py3_7-gcc7-debug-test:
+    name: linux-bionic-cuda11.7-py3.7-gcc7-debug
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_7-py3_7-gcc7-debug-build
+    with:
+      build-environment: linux-bionic-cuda11.7-py3.7-gcc7-debug
+      docker-image: ${{ needs.linux-bionic-cuda11_7-py3_7-gcc7-debug-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
+
+  libtorch-linux-bionic-cuda11_7-py3_7-gcc7-build:
+    name: libtorch-linux-bionic-cuda11.7-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: libtorch-linux-bionic-cuda11.7-py3.7-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7
+      build-generates-artifacts: false
+
+  win-vs2019-cuda11_7-py3-build:
+    name: win-vs2019-cuda11.7-py3
+    uses: ./.github/workflows/_win-build.yml
+    with:
+      build-environment: win-vs2019-cuda11.7-py3
+      cuda-version: "11.7"
+
+  win-vs2019-cuda11_7-py3-test:
+    name: win-vs2019-cuda11.7-py3
+    uses: ./.github/workflows/_win-test.yml
+    needs: win-vs2019-cuda11_7-py3-build
+    with:
+      build-environment: win-vs2019-cuda11.7-py3
+      cuda-version: "11.7"
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
+
   ios-12-5-1-x86-64-coreml:
     name: ios-12-5-1-x86-64-coreml
     uses: ./.github/workflows/_ios-build-test.yml
diff --git a/.github/workflows/pr-labels.yml b/.github/workflows/pr-labels.yml
index 7313d0b8e9682..41c91b05b1c29 100644
--- a/.github/workflows/pr-labels.yml
+++ b/.github/workflows/pr-labels.yml
@@ -11,14 +11,18 @@ jobs:
     runs-on: ubuntu-latest
 
     steps:
+      - name: Checkout repository
+        uses: actions/checkout@v2
+
       - name: Set up python
         uses: actions/setup-python@v2
+        with:
+          cache: pip
+          cache-dependency-path: |
+            **/.github/requirements-gha-cache.txt
 
       - name: Install requests
-        run: pip3 install requests==2.26
-
-      - name: Checkout repository
-        uses: actions/checkout@v2
+        run: pip install requests==2.26
 
       - name: Process commit and find merger responsible for labeling
         id: commit
diff --git a/.github/workflows/pull.yml b/.github/workflows/pull.yml
new file mode 100644
index 0000000000000..b5d545844fc5b
--- /dev/null
+++ b/.github/workflows/pull.yml
@@ -0,0 +1,312 @@
+name: pull
+
+on:
+  pull_request:
+  push:
+    branches:
+      - master
+      - main
+      - release/*
+      - landchecks/*
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  linux-focal-py3_7-gcc7-build:
+    name: linux-focal-py3.7-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-py3.7-gcc7
+      docker-image-name: pytorch-linux-focal-py3.7-gcc7
+
+  linux-focal-py3_7-gcc7-test:
+    name: linux-focal-py3.7-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-focal-py3_7-gcc7-build
+    with:
+      build-environment: linux-focal-py3.7-gcc7
+      docker-image: ${{ needs.linux-focal-py3_7-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "distributed", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "docs_test", shard: 1, num_shards: 1,  runner: "linux.2xlarge" },
+          { config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "backwards_compat", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  linux-docs:
+    name: linux-docs
+    uses: ./.github/workflows/_docs.yml
+    needs: linux-focal-py3_7-gcc7-build
+    with:
+      build-environment: linux-focal-py3.7-gcc7
+      docker-image: ${{ needs.linux-focal-py3_7-gcc7-build.outputs.docker-image }}
+
+  linux-focal-py3_7-gcc7-no-ops:
+    name: linux-focal-py3.7-gcc7-no-ops
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-py3.7-gcc7-no-ops
+      docker-image-name: pytorch-linux-focal-py3.7-gcc7
+
+  linux-focal-py3_7-gcc7-pch:
+    name: linux-focal-py3.7-gcc7-pch
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-py3.7-gcc7-pch
+      docker-image-name: pytorch-linux-focal-py3.7-gcc7
+
+  linux-focal-py3_7-clang7-asan-build:
+    name: linux-focal-py3.7-clang7-asan
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-py3.7-clang7-asan
+      docker-image-name: pytorch-linux-focal-py3-clang7-asan
+
+  linux-focal-py3_7-clang7-asan-test:
+    name: linux-focal-py3.7-clang7-asan
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-focal-py3_7-clang7-asan-build
+    with:
+      build-environment: linux-focal-py3.7-clang7-asan
+      docker-image: ${{ needs.linux-focal-py3_7-clang7-asan-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 5, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 5, runner: "linux.2xlarge" },
+          { config: "default", shard: 3, num_shards: 5, runner: "linux.2xlarge" },
+          { config: "default", shard: 4, num_shards: 5, runner: "linux.2xlarge" },
+          { config: "default", shard: 5, num_shards: 5, runner: "linux.2xlarge" },
+        ]}
+
+  linux-focal-py3_7-clang10-onnx-build:
+    name: linux-focal-py3.7-clang10-onnx
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-py3.7-clang10-onnx
+      docker-image-name: pytorch-linux-focal-py3-clang10-onnx
+
+  linux-focal-py3_7-clang10-onnx-test:
+    name: linux-focal-py3.7-clang10-onnx
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-focal-py3_7-clang10-onnx-build
+    with:
+      build-environment: linux-focal-py3.7-clang10-onnx
+      docker-image: ${{ needs.linux-focal-py3_7-clang10-onnx-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+        ]}
+
+  linux-bionic-py3_7-clang9-build:
+    name: linux-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-py3.7-clang9
+      docker-image-name: pytorch-linux-bionic-py3.7-clang9
+
+  linux-bionic-py3_7-clang9-test:
+    name: linux-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-py3_7-clang9-build
+    with:
+      build-environment: linux-bionic-py3.7-clang9
+      docker-image: ${{ needs.linux-bionic-py3_7-clang9-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "dynamo", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "dynamo", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  linux-bionic-cuda11_3-py3_7-clang9-build:
+    name: linux-bionic-cuda11.3-py3.7-clang9
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.3-py3.7-clang9
+      docker-image-name: pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9
+
+  linux-vulkan-bionic-py3_7-clang9-build:
+    name: linux-vulkan-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-vulkan-bionic-py3.7-clang9
+      docker-image-name: pytorch-linux-bionic-py3.7-clang9
+
+  linux-vulkan-bionic-py3_7-clang9-test:
+    name: linux-vulkan-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-vulkan-bionic-py3_7-clang9-build
+    with:
+      build-environment: linux-vulkan-bionic-py3.7-clang9
+      docker-image: ${{ needs.linux-vulkan-bionic-py3_7-clang9-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  linux-bionic-cuda11_6-py3_10-gcc7-build:
+    name: linux-bionic-cuda11.6-py3.10-gcc7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
+
+  linux-bionic-cuda11_6-py3_10-gcc7-test:
+    name: linux-bionic-cuda11.6-py3.10-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3_10-gcc7-build
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 1, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 2, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
+
+  linux-xenial-py3-clang5-mobile-build:
+    name: linux-xenial-py3-clang5-mobile-build
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3-clang5-mobile-build
+      docker-image-name: pytorch-linux-xenial-py3-clang5-asan
+      build-generates-artifacts: false
+
+  linux-jammy-cuda-11_6-cudnn8-py3_8-clang12-build:
+    name: linux-jammy-cuda11.6-cudnn8-py3.8-clang12
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-jammy-cuda11.6-cudnn8-py3.8-clang12
+      docker-image-name: pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12
+
+  linux-xenial-py3-clang5-mobile-custom-build-static:
+    name: linux-xenial-py3-clang5-mobile-custom-build-static
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-xenial-py3-clang5-mobile-custom-build-static
+      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+      build-generates-artifacts: false
+
+  linux-bionic-py3_7-clang8-xla-build:
+    name: linux-bionic-py3_7-clang8-xla
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-py3_7-clang8-xla
+      docker-image-name: xla_base
+
+  linux-bionic-py3_7-clang8-xla-test:
+    name: linux-bionic-py3_7-clang8-xla
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-py3_7-clang8-xla-build
+    with:
+      build-environment: linux-bionic-py3_7-clang8-xla
+      docker-image: ${{ needs.linux-bionic-py3_7-clang8-xla-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "xla", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
+
+  win-vs2019-cpu-py3-build:
+    name: win-vs2019-cpu-py3
+    uses: ./.github/workflows/_win-build.yml
+    with:
+      build-environment: win-vs2019-cpu-py3
+      cuda-version: cpu
+
+  win-vs2019-cpu-py3-test:
+    name: win-vs2019-cpu-py3
+    uses: ./.github/workflows/_win-test.yml
+    needs: win-vs2019-cpu-py3-build
+    with:
+      build-environment: win-vs2019-cpu-py3
+      cuda-version: cpu
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "windows.4xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "windows.4xlarge" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
+
+  win-vs2019-cuda11_6-py3-build:
+    if: github.event_name == 'pull_request'
+    name: win-vs2019-cuda11.6-py3
+    uses: ./.github/workflows/_win-build.yml
+    with:
+      build-environment: win-vs2019-cuda11.6-py3
+      cuda-version: "11.6"
+      sync-tag: win-cuda-build
+
+  linux-xenial-cuda11_3-py3_7-gcc7-bazel-test:
+    name: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
+    uses: ./.github/workflows/_bazel-build-test.yml
+    with:
+      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
+      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+
+  linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single:
+    name: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
+    uses: ./.github/workflows/_android-build-test.yml
+    with:
+      build-environment: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
+      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+
+  linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit:
+    name: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
+    uses: ./.github/workflows/_android-build-test.yml
+    with:
+      build-environment: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
+      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+
+  linux-focal-py3_7-gcc7-mobile-lightweight-dispatch-build:
+    name: linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build
+      docker-image-name: pytorch-linux-focal-py3.7-gcc7
+      build-generates-artifacts: false
+
+  linux-bionic-cuda11_6-py3_10-gcc7-deploy-build:
+    name: linux-bionic-cuda11_6-py3_10-gcc7-deploy
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7-deploy
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
+
+  deploy-linux-bionic-cuda11_6-py3_10-gcc7-test:
+    name: linux-bionic-cuda11_6-py3_10-gcc7-deploy
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3_10-gcc7-deploy-build
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7-deploy
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-deploy-build.outputs.docker-image }}
+      test-matrix: |
+        { include: [
+          { config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
+
+  linux-focal-rocm5_2-py3_7-build:
+    # don't run build twice on master
+    if: github.event_name == 'pull_request'
+    name: linux-focal-rocm5.2-py3.7
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-rocm5.2-py3.7
+      docker-image-name: pytorch-linux-focal-rocm5.2-py3.7
+      sync-tag: rocm-build
diff --git a/.github/workflows/push_nightly_docker_ghcr.yml b/.github/workflows/push_nightly_docker_ghcr.yml
new file mode 100644
index 0000000000000..bdcc6e05dc593
--- /dev/null
+++ b/.github/workflows/push_nightly_docker_ghcr.yml
@@ -0,0 +1,39 @@
+name: docker-release-builds
+on:
+  schedule:
+    # Push the nightly docker daily at 1 PM UTC
+    - cron: '0 13 * * *'
+  # Trigger when we modify something related to these images
+  pull_request:
+    paths:
+      - .github/scripts/build_publish_nightly_docker.sh
+      - .github/workflows/push_nightly_docker_ghcr.yml
+      - Dockerfile
+      - docker.Makefile
+  # Have the ability to trigger this job manually using the API as well
+  workflow_dispatch:
+
+jobs:
+  docker-release-build:
+    if: ${{ github.repository == 'pytorch/pytorch' }}
+    runs-on: linux.2xlarge
+    env:
+      GHCR_PAT: ${{ secrets.GHCR_PAT }}
+      WITH_PUSH: ${{ github.event_name == 'schedule' }}
+    steps:
+      - name: Checkout PyTorch
+        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
+      - uses: nick-fields/retry@7d4a37704547a311dbb66ebdf5b23ec19374a767
+        name: Build and upload nightly docker
+        with:
+          timeout_minutes: 10
+          max_attempts: 3
+          command: |
+            set -ex
+            bash .github/scripts/build_publish_nightly_docker.sh
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
diff --git a/.github/workflows/revert.yml b/.github/workflows/revert.yml
index 1fbdacc82071e..d207840f383b4 100644
--- a/.github/workflows/revert.yml
+++ b/.github/workflows/revert.yml
@@ -8,18 +8,24 @@ jobs:
   do_revert:
     name: try_revert_pr_${{ github.event.client_payload.pr_num }}
     runs-on: linux.20_04.4x
+    env:
+        GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
       - name: Checkout repo
         uses: actions/checkout@v2
+        id: checkout
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+          architecture: x64
+          cache: 'pip'
+      - run: pip install pyyaml==6.0
+
       - name: Setup committer id
         run: |
           git config --global user.email "pytorchmergebot@users.noreply.github.com"
@@ -30,7 +36,6 @@ jobs:
           PR_NUM: ${{ github.event.client_payload.pr_num }}
           COMMENT_ID: ${{ github.event.client_payload.comment_id }}
           REASON: ${{ github.event.client_payload.reason }}
-          GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
         run: |
           set -ex
           if [ -n "${COMMENT_ID}" ]; then
@@ -46,5 +51,14 @@ jobs:
               python3 .github/scripts/trymerge.py --revert "${PR_NUM}"
             fi
           fi
+      - name: Comment on Canceled
+        if: ${{ cancelled() && steps.checkout.outcome == 'success' }}
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+          PR_NUM: ${{ github.event.client_payload.pr_num }}
+        run: |
+          set -ex
+          python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "revert"
 
 concurrency: try-revert
diff --git a/.github/workflows/stale_pull_requests.yml b/.github/workflows/stale_pull_requests.yml
deleted file mode 100644
index a65e52c27a7c4..0000000000000
--- a/.github/workflows/stale_pull_requests.yml
+++ /dev/null
@@ -1,42 +0,0 @@
-name: 'Close stale pull requests'
-on:
-  schedule:
-    # TODO: Reduce frequency once we work through the backlog of pull requests
-    - cron: '0 * * * *'
-  workflow_dispatch:
-
-jobs:
-  stale:
-    if: ${{ github.repository == 'pytorch/pytorch' }}
-    runs-on: ubuntu-18.04
-    steps:
-      - uses: actions/stale@v4.1.0
-        with:
-          stale-pr-message: >
-            Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as `Stale`. <br>
-            Feel free to remove the `Stale` label if you feel this was a mistake. <br>
-            `Stale` pull requests will automatically be closed 30 days after being marked `Stale` <br>
-          exempt-pr-labels: "no-stale,open source,high priority"
-          days-before-pr-stale: 60
-          days-before-pr-close: 90
-          days-before-issue-stale: -1
-          days-before-issue-close: -1
-          ascending: true
-  stale-open-source:
-    if: ${{ github.repository == 'pytorch/pytorch' }}
-    runs-on: ubuntu-18.04
-    steps:
-      - uses: actions/stale@v4.1.0
-        with:
-          stale-pr-message: >
-            Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as `Stale`. <br>
-            Feel free to remove the `Stale` label if you feel this was a mistake. <br>
-            If you are unable to remove the `Stale` label please contact a maintainer in order to do so. <br>
-            `Stale` pull requests will automatically be closed 30 days after being marked `Stale` <br>
-          exempt-pr-labels: "no-stale,high priority"
-          only-labels: "open source"
-          days-before-pr-stale: 60
-          days-before-pr-close: 90
-          days-before-issue-stale: -1
-          days-before-issue-close: -1
-          ascending: true
diff --git a/.github/workflows/trunk.yml b/.github/workflows/trunk.yml
index a31111ecf885f..c4298bcb7acae 100644
--- a/.github/workflows/trunk.yml
+++ b/.github/workflows/trunk.yml
@@ -60,8 +60,10 @@ jobs:
       docker-image: ${{ needs.linux-bionic-cuda10_2-py3_9-gcc7-build.outputs.docker-image }}
       test-matrix: |
         { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "functorch", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "slow", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "slow", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
@@ -94,12 +96,6 @@ jobs:
     with:
       build-environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
       docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-    secrets:
-      SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }}
-      SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }}
-      ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }}
-      ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }}
-      SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
 
   linux-bionic-py3_7-clang9-slow-build:
     name: linux-bionic-py3.7-clang9-slow
@@ -157,6 +153,7 @@ jobs:
           { config: "default", shard: 2, num_shards: 2, runner: "macos-12" },
           { config: "functorch", shard: 1, num_shards: 1, runner: "macos-12" },
         ]}
+      arch: x86_64
     secrets:
       AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
       AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
@@ -177,20 +174,40 @@ jobs:
     name: macos-12-py3-arm64
     uses: ./.github/workflows/_mac-build.yml
     with:
+      sync-tag: macos-12-py3-arm64-build
       build-environment: macos-12-py3-arm64
       xcode-version: "13.3.1"
       runner-type: macos-12-xl
       build-generates-artifacts: true
+      # To match the one pre-installed in the m1 runners
+      python_version: 3.9.12
     secrets:
       MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
 
   macos-12-py3-arm64-mps-test:
+    name: macos-12-py3-arm64-mps
+    uses: ./.github/workflows/_mac-test-mps.yml
+    needs: macos-12-py3-arm64-build
+    with:
+      sync-tag: macos-12-py3-arm64-mps-test
+      build-environment: macos-12-py3-arm64
+
+  macos-12-py3-arm64-test:
     name: macos-12-py3-arm64
-    uses: ./.github/workflows/_mac-test-arm64.yml
+    uses: ./.github/workflows/_mac-test.yml
     needs: macos-12-py3-arm64-build
     with:
       build-environment: macos-12-py3-arm64
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "macos-m1-12" },
+          { config: "default", shard: 2, num_shards: 2, runner: "macos-m1-12" },
+        ]}
+      arch: arm64
+    secrets:
+      AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
+      AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
 
   win-vs2019-cuda11_6-py3-build:
     name: win-vs2019-cuda11.6-py3
diff --git a/.github/workflows/trymerge.yml b/.github/workflows/trymerge.yml
index 8db7b0c97c5c9..9ba29af660023 100644
--- a/.github/workflows/trymerge.yml
+++ b/.github/workflows/trymerge.yml
@@ -8,18 +8,24 @@ jobs:
   do_merge:
     name: try_merge_pr_${{ github.event.client_payload.pr_num }}
     runs-on: linux.20_04.4x
+    env:
+        GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
       - name: Checkout repo
+        id: checkout
         uses: actions/checkout@v2
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+          cache: 'pip'
+          architecture: x64
+      - run: pip install pyyaml==6.0
+
       - name: Setup committer id
         run: |
           git config --global user.email "pytorchmergebot@users.noreply.github.com"
@@ -28,7 +34,6 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
           PR_NUM: ${{ github.event.client_payload.pr_num }}
-          GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
           FORCE: ${{ github.event.client_payload.force}}
           ON_GREEN: ${{ github.event.client_payload.on_green}}
           LAND_CHECKS: ${{ github.event.client_payload.land_checks }}
@@ -50,6 +55,15 @@ jobs:
           else
             python3 .github/scripts/trymerge.py "${PR_NUM}"
           fi
+      - name: Comment on Canceled
+        if: ${{ cancelled() && steps.checkout.outcome == 'success' }}
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+          PR_NUM: ${{ github.event.client_payload.pr_num }}
+        run: |
+          set -ex
+          python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "merge"
 
 # We want newer merge commands to supercede old ones
 concurrency:
diff --git a/.github/workflows/tryrebase.yml b/.github/workflows/tryrebase.yml
index 748127ff2d626..fed9000c420e9 100644
--- a/.github/workflows/tryrebase.yml
+++ b/.github/workflows/tryrebase.yml
@@ -7,19 +7,24 @@ on:
 jobs:
   do_rebase:
     runs-on: ubuntu-20.04
+    env:
+        GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
-
       - name: Checkout repo
+        id: checkout
         uses: actions/checkout@v2
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+          architecture: x64
+          cache: 'pip'
+      - run: pip install pyyaml==6.0
+
       - name: Setup committer id
         run: |
           git config --global user.email "pytorchmergebot@users.noreply.github.com"
@@ -29,7 +34,6 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
           PR_NUM: ${{ github.event.client_payload.pr_num }}
-          GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
           BRANCH: ${{ github.event.client_payload.branch }}
         run: |
           set -ex
@@ -38,3 +42,12 @@ jobs:
           else
             python3 .github/scripts/tryrebase.py "${PR_NUM}"
           fi
+      - name: Comment on Canceled
+        if: ${{ cancelled() && steps.checkout.outcome == 'success' }}
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+          PR_NUM: ${{ github.event.client_payload.pr_num }}
+        run: |
+          set -ex
+          python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "rebase"
diff --git a/.github/workflows/update-viablestrict.yml b/.github/workflows/update-viablestrict.yml
index 872d8f5c14285..c7bbb17b4fe96 100644
--- a/.github/workflows/update-viablestrict.yml
+++ b/.github/workflows/update-viablestrict.yml
@@ -13,18 +13,22 @@ jobs:
   do_update_viablestrict:
     runs-on: ubuntu-20.04
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
-
       - name: Checkout repo
         uses: actions/checkout@v2
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: 3.8
+          architecture: x64
+          cache: pip
+          cache-dependency-path: |
+            **/.circleci/docker/requirements-ci.txt
+            **/.github/requirements-gha-cache.txt
+
       - name: Install Python Packages
         run: |
           pip3 install rockset==0.8.10
diff --git a/.github/workflows/update_pytorch_labels.yml b/.github/workflows/update_pytorch_labels.yml
index f19347070ecef..31bbab78e2f9a 100644
--- a/.github/workflows/update_pytorch_labels.yml
+++ b/.github/workflows/update_pytorch_labels.yml
@@ -10,7 +10,7 @@ concurrency:
 
 jobs:
   update-labels-in-S3:
-    runs-on: ubuntu-18.04
+    runs-on: ubuntu-22.04
     if: ${{ github.repository == 'pytorch/pytorch' }}
     steps:
       - name: Checkout PyTorch
diff --git a/.github/workflows/update_s3_htmls.yml b/.github/workflows/update_s3_htmls.yml
index 5f3ff056c5a4a..d68b58911bed2 100644
--- a/.github/workflows/update_s3_htmls.yml
+++ b/.github/workflows/update_s3_htmls.yml
@@ -8,7 +8,7 @@ on:
 
 jobs:
   update-html:
-    runs-on: ubuntu-18.04
+    runs-on: ubuntu-22.04
     if: ${{ github.repository == 'pytorch/pytorch' }}
     strategy:
       matrix:
diff --git a/.gitmodules b/.gitmodules
index 538967d317641..32c0c205948a3 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -148,3 +148,6 @@
 [submodule "third_party/nlohmann"]
 	path = third_party/nlohmann
 	url = https://github.com/nlohmann/json.git
+[submodule "third_party/VulkanMemoryAllocator"]
+	path = third_party/VulkanMemoryAllocator
+	url = https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git
diff --git a/.jenkins/caffe2/test.sh b/.jenkins/caffe2/test.sh
index 6016911941b5d..0204907ee865d 100755
--- a/.jenkins/caffe2/test.sh
+++ b/.jenkins/caffe2/test.sh
@@ -173,7 +173,7 @@ fi
 ##############
 if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
   pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
-  pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.11.0
+  pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.12.1 beartype==0.10.4
   # numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.
   # We don't actually need it for our tests, but it's imported if it's present, so uninstall.
   pip uninstall -q --yes numba
diff --git a/.jenkins/pytorch/build.sh b/.jenkins/pytorch/build.sh
index d442a4ebd41c2..a215459fcc7e1 100755
--- a/.jenkins/pytorch/build.sh
+++ b/.jenkins/pytorch/build.sh
@@ -45,6 +45,12 @@ fi
 if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
   # enable split torch_cuda build option in CMake
   export BUILD_SPLIT_CUDA=ON
+  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then
+    # TODO: there is a linking issue when building with UCC using clang,
+    # disable it for now and to be fix later.
+    export USE_UCC=1
+    export USE_SYSTEM_UCC=1
+  fi
 fi
 
 if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then
@@ -169,6 +175,10 @@ if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then
   export USE_PER_OPERATOR_HEADERS=0
 fi
 
+if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
+    export USE_PRECOMPILED_HEADERS=1
+fi
+
 if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build*  ]]; then
   export USE_GLOO_WITH_OPENSSL=ON
 fi
diff --git a/.jenkins/pytorch/common_utils.sh b/.jenkins/pytorch/common_utils.sh
index 0584ddab9e2a0..7b592d57c280b 100644
--- a/.jenkins/pytorch/common_utils.sh
+++ b/.jenkins/pytorch/common_utils.sh
@@ -117,6 +117,8 @@ function clone_pytorch_xla() {
     pushd xla
     # pin the xla hash so that we don't get broken by changes to xla
     git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"
+    git submodule sync
+    git submodule update --init --recursive
     popd
   fi
 }
diff --git a/.jenkins/pytorch/macos-build.sh b/.jenkins/pytorch/macos-build.sh
index db33e2dedf95b..d40ec521520ba 100755
--- a/.jenkins/pytorch/macos-build.sh
+++ b/.jenkins/pytorch/macos-build.sh
@@ -35,11 +35,11 @@ fi
 
 cross_compile_arm64() {
   # Cross compilation for arm64
-  USE_DISTRIBUTED=1 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF BUILD_TEST=OFF python setup.py bdist_wheel
+  USE_DISTRIBUTED=1 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF python setup.py bdist_wheel
 }
 
 compile_x86_64() {
-  USE_DISTRIBUTED=1 python setup.py bdist_wheel
+  USE_DISTRIBUTED=1 WERROR=1 python setup.py bdist_wheel
 }
 
 build_lite_interpreter() {
diff --git a/.jenkins/pytorch/macos-common.sh b/.jenkins/pytorch/macos-common.sh
index 3dc7d0f17e167..4df378d505ecb 100755
--- a/.jenkins/pytorch/macos-common.sh
+++ b/.jenkins/pytorch/macos-common.sh
@@ -7,19 +7,34 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
 
 sysctl -a | grep machdep.cpu
 
-# NOTE: mkl 2021.3.0+ cmake requires sub-command PREPEND, may break the build
-retry conda install -y \
-  mkl=2021.2.0 \
-  mkl-include=2021.2.0 \
-  numpy=1.18.5 \
-  pyyaml=5.3 \
-  setuptools=46.0.0 \
-  cmake=3.19 \
-  cffi \
-  ninja \
-  typing_extensions \
-  dataclasses \
-  pip
+if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then
+  # We use different versions here as the arm build/tests runs on python 3.9
+  # while the x86 one runs on python 3.8
+  retry conda install -y \
+    numpy=1.22.3 \
+    pyyaml=6.0 \
+    setuptools=61.2.0 \
+    cmake=3.22.1 \
+    cffi \
+    ninja \
+    typing_extensions \
+    dataclasses \
+    pip
+else
+  # NOTE: mkl 2021.3.0+ cmake requires sub-command PREPEND, may break the build
+  retry conda install -y \
+    mkl=2021.2.0 \
+    mkl-include=2021.2.0 \
+    numpy=1.18.5 \
+    pyyaml=5.3 \
+    setuptools=46.0.0 \
+    cmake=3.19 \
+    cffi \
+    ninja \
+    typing_extensions \
+    dataclasses \
+    pip
+fi
 
 # The torch.hub tests make requests to GitHub.
 #
diff --git a/.jenkins/pytorch/macos-test.sh b/.jenkins/pytorch/macos-test.sh
index 1b15fab1ed205..a30e16ba942ee 100755
--- a/.jenkins/pytorch/macos-test.sh
+++ b/.jenkins/pytorch/macos-test.sh
@@ -5,14 +5,20 @@
 source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"
 
 conda install -y six
-pip install -q hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3"
+if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then
+  pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba==0.56.0" psutil "scipy==1.9.0"
+else
+  pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3"
+fi
 
 # TODO move this to docker
 # Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
 pip install "unittest-xml-reporting<=3.2.0,>=2.0.0" \
   pytest \
   pytest-xdist \
-  pytest-rerunfailures
+  pytest-rerunfailures \
+  "xdoctest==1.0.2" \
+  "pygments==2.12.0"
 
 if [ -z "${CI}" ]; then
   rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch*
@@ -32,14 +38,15 @@ if [ -z "${CI}" ]; then
   7z x "${IMAGE_COMMIT_TAG}".7z -o"${WORKSPACE_DIR}/miniconda3/lib/python3.6/site-packages"
 fi
 
-# Test that OpenMP is enabled
-pushd test
-if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
-  echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"
-  exit 1
+# Test that OpenMP is enabled for non-arm64 build
+if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then
+  pushd test
+  if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
+    echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"
+    exit 1
+  fi
+  popd
 fi
-popd
-
 
 setup_test_python() {
   # The CircleCI worker hostname doesn't resolve to an address.
@@ -165,7 +172,7 @@ test_jit_hooks() {
 
 test_dynamo() {
   pushd ../torchdynamo
-  pytest tests
+  pytest test
   popd
 }
 
diff --git a/.jenkins/pytorch/multigpu-test.sh b/.jenkins/pytorch/multigpu-test.sh
index d75d701e8e18b..bbd1c370a638e 100755
--- a/.jenkins/pytorch/multigpu-test.sh
+++ b/.jenkins/pytorch/multigpu-test.sh
@@ -7,7 +7,7 @@
 # shellcheck source=./common.sh
 source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
 
-echo "Testing pytorch (distributed only)"
+echo "Testing pytorch"
 if [ -n "${CI}" ]; then
   # TODO move this to docker
   # Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
@@ -48,4 +48,6 @@ time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/
 time python test/run_test.py --verbose -i distributed/_shard/sharded_optim/test_sharded_optim
 time python test/run_test.py --verbose -i distributed/_shard/test_partial_tensor
 time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor
+# Other tests
+time python test/run_test.py --verbose -i test_cuda_primary_ctx
 assert_git_not_dirty
diff --git a/.jenkins/pytorch/test.sh b/.jenkins/pytorch/test.sh
index b476d25250791..51bf9c8f98fc4 100755
--- a/.jenkins/pytorch/test.sh
+++ b/.jenkins/pytorch/test.sh
@@ -163,7 +163,9 @@ test_python_shard() {
     echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
     exit 1
   fi
+
   time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS" --verbose
+
   assert_git_not_dirty
 }
 
@@ -178,6 +180,8 @@ test_dynamo_shard() {
     echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
     exit 1
   fi
+  # Temporarily disable test_fx for dynamo pending the investigation on TTS
+  # regression in https://github.com/pytorch/torchdynamo/issues/784
   time python test/run_test.py \
     --exclude-jit-executor \
     --exclude-distributed-tests \
@@ -194,6 +198,9 @@ test_dynamo_shard() {
       test_profiler_tree \
       test_overrides \
       test_python_dispatch \
+      test_fx \
+      test_package \
+      test_vmap \
     --shard "$1" "$NUM_TEST_SHARDS" \
     --verbose
   assert_git_not_dirty
@@ -592,7 +599,7 @@ test_vec256() {
 
 test_dynamo() {
   pushd ../torchdynamo
-  pytest tests
+  pytest test
   popd
 }
 
diff --git a/.jenkins/pytorch/win-test-helpers/build_pytorch.bat b/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
index b954430734b02..7edeca96ed8d0 100644
--- a/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
+++ b/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
@@ -29,7 +29,9 @@ call %INSTALLER_DIR%\install_sccache.bat
 if errorlevel 1 exit /b
 if not errorlevel 0 exit /b
 
-call %INSTALLER_DIR%\install_miniconda3.bat
+:: Miniconda has been installed as part of the Windows AMI with all the dependencies.
+:: We just need to activate it here
+call %INSTALLER_DIR%\activate_miniconda3.bat
 if errorlevel 1 exit /b
 if not errorlevel 0 exit /b
 
diff --git a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat b/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
similarity index 65%
rename from .jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat
rename to .jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
index 54b954a0503f1..e6660a17b3890 100644
--- a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat
+++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
@@ -4,24 +4,32 @@ if "%BUILD_ENVIRONMENT%"=="" (
   set CONDA_PARENT_DIR=C:\Jenkins
 )
 
-if "%REBUILD%"=="" set INSTALL_FRESH_CONDA=1
-if NOT "%BUILD_ENVIRONMENT%"==""  set INSTALL_FRESH_CONDA=1
+
+:: Be conservative here when rolling out the new AMI with conda. This will try
+:: to install conda as before if it couldn't find the conda installation. This
+:: can be removed eventually after we gain enough confidence in the AMI
+if not exist %CONDA_PARENT_DIR%\Miniconda3 (
+  set INSTALL_FRESH_CONDA=1
+)
 
 if "%INSTALL_FRESH_CONDA%"=="1" (
-  IF EXIST %CONDA_PARENT_DIR%\Miniconda3 ( rd /s /q %CONDA_PARENT_DIR%\Miniconda3 )
   curl --retry 3 -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
   if errorlevel 1 exit /b
   if not errorlevel 0 exit /b
+
   %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3
   if errorlevel 1 exit /b
   if not errorlevel 0 exit /b
 )
 
+:: Activate conda so that we can use its commands, i.e. conda, python, pip
 call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3
+
 if "%INSTALL_FRESH_CONDA%"=="1" (
-  call conda install -y -q python=%PYTHON_VERSION% numpy"<1.23" cffi pyyaml boto3 libuv
+  call conda install -y -q numpy"<1.23" cffi pyyaml boto3 libuv
   if errorlevel 1 exit /b
   if not errorlevel 0 exit /b
+
   call conda install -y -q -c conda-forge cmake=3.22.3
   if errorlevel 1 exit /b
   if not errorlevel 0 exit /b
diff --git a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
index 90725b7666a33..79e8aedfab75c 100644
--- a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
+++ b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
@@ -7,10 +7,10 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
 
 :: Install Miniconda3
 set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers
-call :retry %INSTALLER_DIR%\install_miniconda3.bat
 
-:retry
-call %* || (powershell -nop -c "& {sleep 1}" && call %*) || (powershell -nop -c "& {sleep 2}" && call %*)
+:: Miniconda has been installed as part of the Windows AMI with all the dependencies.
+:: We just need to activate it here
+call %INSTALLER_DIR%\activate_miniconda3.bat
 if errorlevel 1 exit /b
 if not errorlevel 0 exit /b
 
@@ -36,7 +36,7 @@ popd
 =======
 :: Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
 
-pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-rerunfailures
+pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-rerunfailures "xdoctest==1.0.2" "pygments==2.12.0"
 if errorlevel 1 exit /b
 if not errorlevel 0 exit /b
 
diff --git a/.lintrunner.toml b/.lintrunner.toml
index 02b02d1aaf06e..b2fa676f8e13c 100644
--- a/.lintrunner.toml
+++ b/.lintrunner.toml
@@ -102,6 +102,7 @@ exclude_patterns = [
     'torch/distributed/elastic/agent/server/api.py',
     'torch/testing/_internal/**',
     'torch/distributed/fsdp/fully_sharded_data_parallel.py',
+    'torch/distributed/distributed_c10d.py',
     # TODO(suo): these exclusions were added just to get lint clean on master.
     # Follow up to do more target suppressions and remove them.
     'torch/distributed/fsdp/flatten_params_wrapper.py',
@@ -718,7 +719,10 @@ include_patterns = [
     'torch/_refs/**/*.py',
     'torch/_subclasses/**/*.py',
     'torch/_*.py',
+    'torch/testing/_internal/opinfo/**/*.py',
     'torchgen/**/*.py',
+    'functorch/functorch/_src/aot_autograd.py',
+    'functorch/functorch/_src/compilers.py',
 ]
 command = [
     'python3',
diff --git a/BUILD.bazel b/BUILD.bazel
index 823a59bb63b75..4c0791bffbb4a 100644
--- a/BUILD.bazel
+++ b/BUILD.bazel
@@ -1877,6 +1877,7 @@ test_suite(
         "aten/src/ATen/templates/LazyIr.h",
         "aten/src/ATen/templates/LazyNonNativeIr.h",
         "aten/src/ATen/templates/RegisterDispatchKey.cpp",
+        "aten/src/ATen/templates/RegisterDispatchDefinitions.ini",
         "aten/src/ATen/native/native_functions.yaml",
         "aten/src/ATen/native/tags.yaml",
         "aten/src/ATen/native/ts_native_functions.yaml",
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 38a430ee7287c..9b6fedca3f719 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -43,7 +43,7 @@ set(CMAKE_C_STANDARD   11 CACHE STRING "The C standard whose features are reques
 if(DEFINED GLIBCXX_USE_CXX11_ABI)
   if(${GLIBCXX_USE_CXX11_ABI} EQUAL 1)
     set(CXX_STANDARD_REQUIRED ON)
-    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=1")
+    string(APPEND CMAKE_CXX_FLAGS " -D_GLIBCXX_USE_CXX11_ABI=1")
   else()
     # Please note this is required in order to ensure compatibility between gcc 9 and gcc 7
     # This could be removed when all Linux PyTorch binary builds are compiled by the same toolchain again
@@ -186,7 +186,7 @@ cmake_dependent_option(
     INSTALL_TEST "Install test binaries if BUILD_TEST is on" ON
     "BUILD_TEST" OFF)
 option(USE_CPP_CODE_COVERAGE "Compile C/C++ with code coverage flags" OFF)
-option(COLORIZE_OUTPUT "Colorize output during compilation" ON)
+option(USE_COLORIZE_OUTPUT "Colorize output during compilation" ON)
 option(USE_ASAN "Use Address Sanitizer" OFF)
 option(USE_TSAN "Use Thread Sanitizer" OFF)
 option(USE_CUDA "Use CUDA" ON)
@@ -209,8 +209,8 @@ cmake_dependent_option(
     USE_STATIC_CUDNN "Use cuDNN static libraries" OFF
     "USE_CUDNN" OFF)
 cmake_dependent_option(
-    BUILD_NVFUSER_BENCHMARK "Build C++ binaries for nvfuser benchmarks" ON
-    "USE_CUDA;BUILD_TEST" OFF)
+    BUILD_NVFUSER_BENCHMARK "Build C++ binaries for nvfuser benchmarks" OFF
+    "USE_CUDA" OFF)
 cmake_dependent_option(
     USE_EXPERIMENTAL_CUDNN_V8_API "Use experimental cuDNN v8 API" ON
     "USE_CUDNN" OFF)
@@ -799,22 +799,22 @@ if(NOT MSVC)
   # Details at http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1459
   string(APPEND CMAKE_CXX_FLAGS " -Wall")
   string(APPEND CMAKE_CXX_FLAGS " -Wextra")
-  string(APPEND CMAKE_CXX_FLAGS " -Werror=return-type")
+  append_cxx_flag_if_supported("-Werror=return-type" CMAKE_CXX_FLAGS)
   if(NOT USE_CUDNN)
     # Temporary fix to ignore non virtual dtor error if cudnn is used. A
     # separate PR to cudnn_frontend is needed to address this later on
-    string(APPEND CMAKE_CXX_FLAGS " -Werror=non-virtual-dtor")
+    append_cxx_flag_if_supported("-Werror=non-virtual-dtor" CMAKE_CXX_FLAGS)
   endif()
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-missing-field-initializers")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-type-limits")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-array-bounds")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-unknown-pragmas")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-parameter")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-function")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-result")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-overflow")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-aliasing")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-error=deprecated-declarations")
+  append_cxx_flag_if_supported("-Wno-missing-field-initializers" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-type-limits" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-array-bounds" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-unknown-pragmas" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-unused-parameter" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-unused-function" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-unused-result" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-strict-overflow" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-strict-aliasing" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-error=deprecated-declarations" CMAKE_CXX_FLAGS)
   if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
     string(APPEND CMAKE_CXX_FLAGS " -Wno-range-loop-analysis")
     string(APPEND CMAKE_CXX_FLAGS " -Wno-pass-failed")
@@ -855,32 +855,31 @@ if(NOT MSVC)
     endif()
   endif()
 
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-error=pedantic")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-error=redundant-decls")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-error=old-style-cast")
+  append_cxx_flag_if_supported("-Wno-error=pedantic" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-error=redundant-decls" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wno-error=old-style-cast" CMAKE_CXX_FLAGS)
   # These flags are not available in GCC-4.8.5. Set only when using clang.
   # Compared against https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Option-Summary.html
   if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
-    string(APPEND CMAKE_CXX_FLAGS " -Wconstant-conversion")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-invalid-partial-specialization")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-typedef-redefinition")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-unknown-warning-option")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-private-field")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-inconsistent-missing-override")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-aligned-allocation-unavailable")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-c++14-extensions")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-constexpr-not-const")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-missing-braces")
-    string(APPEND CMAKE_CXX_FLAGS " -Qunused-arguments")
-    if(${COLORIZE_OUTPUT})
-      string(APPEND CMAKE_CXX_FLAGS " -fcolor-diagnostics")
+    append_cxx_flag_if_supported("-Wconstant-conversion" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-invalid-partial-specialization" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-typedef-redefinition" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-unused-private-field" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-inconsistent-missing-override" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-aligned-allocation-unavailable" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-c++14-extensions" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-constexpr-not-const" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Qunused-arguments" CMAKE_CXX_FLAGS)
+    if(${USE_COLORIZE_OUTPUT})
     endif()
   endif()
-  if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 4.9)
-    if(${COLORIZE_OUTPUT})
-      string(APPEND CMAKE_CXX_FLAGS " -fdiagnostics-color=always")
-    endif()
+
+  if(${USE_COLORIZE_OUTPUT})
+    append_cxx_flag_if_supported("-fcolor-diagnostics" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-fdiagnostics-color=always" CMAKE_CXX_FLAGS)
   endif()
+
   if((APPLE AND (NOT ("${CLANG_VERSION_STRING}" VERSION_LESS "9.0")))
     OR(CMAKE_COMPILER_IS_GNUCXX
     AND(CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 7.0 AND NOT APPLE)))
@@ -895,21 +894,15 @@ if(NOT MSVC)
     endif()
   endif(WERROR)
   if(NOT APPLE)
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-but-set-variable")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-maybe-uninitialized")
+    append_cxx_flag_if_supported("-Wno-unused-but-set-variable" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-maybe-uninitialized" CMAKE_CXX_FLAGS)
   endif()
   string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
   string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
-  string(APPEND CMAKE_CXX_FLAGS " -fno-math-errno")
-  string(APPEND CMAKE_CXX_FLAGS " -fno-trapping-math")
-  check_cxx_compiler_flag("-Werror=format" HAS_WERROR_FORMAT)
-  if(HAS_WERROR_FORMAT)
-    string(APPEND CMAKE_CXX_FLAGS " -Werror=format")
-  endif()
-  check_cxx_compiler_flag("-Werror=cast-function-type" HAS_WERROR_CAST_FUNCTION_TYPE)
-  if(HAS_WERROR_CAST_FUNCTION_TYPE)
-    string(APPEND CMAKE_CXX_FLAGS " -Werror=cast-function-type")
-  endif()
+  append_cxx_flag_if_supported("-fno-math-errno" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-fno-trapping-math" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Werror=format" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Werror=cast-function-type" CMAKE_CXX_FLAGS)
   check_cxx_compiler_flag("-Werror=sign-compare" HAS_WERROR_SIGN_COMPARE)
   # This doesn't work globally so we use the test on specific
   # target_compile_options
@@ -970,20 +963,20 @@ if(APPLE)
     if(USE_MPS)
       string(APPEND CMAKE_CXX_FLAGS " -DUSE_MPS -fno-objc-arc")
       string(APPEND CMAKE_SHARED_LINKER_FLAGS " -weak_framework Foundation -weak_framework MetalPerformanceShaders -weak_framework MetalPerformanceShadersGraph -weak_framework Metal")
+      # To suppress MPSGraph availability warnings
+      append_cxx_flag_if_supported("-Wno-unguarded-availability-new" CMAKE_CXX_FLAGS)
     endif()
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-private-field")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-missing-braces")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-c++14-extensions")
-    string(APPEND CMAKE_CXX_FLAGS " -Wno-constexpr-not-const")
+    append_cxx_flag_if_supported("-Wno-unused-private-field" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-c++14-extensions" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wno-constexpr-not-const" CMAKE_CXX_FLAGS)
 endif()
 
 if(EMSCRIPTEN)
   string(APPEND CMAKE_CXX_FLAGS " -Wno-implicit-function-declaration -DEMSCRIPTEN -s DISABLE_EXCEPTION_CATCHING=0")
 endif()
 
-if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 7.0.0)
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-stringop-overflow")
-endif()
+append_cxx_flag_if_supported("-Wno-stringop-overflow" CMAKE_CXX_FLAGS)
 
 if(ANDROID AND (NOT ANDROID_DEBUG_SYMBOLS))
   if(CMAKE_COMPILER_IS_GNUCXX)
diff --git a/CODEOWNERS b/CODEOWNERS
index 1bb8efe9de0b9..a467820fb0759 100644
--- a/CODEOWNERS
+++ b/CODEOWNERS
@@ -1,5 +1,11 @@
+# IMPORTANT:
+# This file is ONLY used to subscribe for the notifications for a
+# PRs related to a specific files. People in this file are
+# notrequire the approval for your changes.
+
 # This is a comment.
 # Each line is a file pattern followed by one or more owners.
+# For module labels => owners mapping, please see https://github.com/pytorch/pytorch/issues/24422.
 
 /torch/utils/cpp_extension.py @fmassa @soumith @ezyang
 
@@ -8,7 +14,7 @@
 /torch/csrc/autograd/ @albanD @soulitzer
 /torch/autograd/ @albanD @soulitzer
 /tools/autograd/ @albanD @soulitzer
-/torch/nn/ @albanD @jbschlosser
+/torch/nn/ @albanD @jbschlosser @saketh-are
 /torch/optim/ @albanD
 /test/test_public_bindings.py @albanD
 /test/allowlist_for_publicAPI.json @albanD @anjali411
@@ -77,3 +83,9 @@ test/test_type_promotion.py @mruberry @ngimel
 test/test_mps.py @kulinseth
 aten/src/ATen/mps/ @kulinseth
 aten/src/ATen/native/mps/ @kulinseth
+
+# Profiler
+torch/csrc/autograd/profiler* @robieta
+torch/autograd/profiler* @robieta
+torch/csrc/profiler/ @robieta
+torch/profiler/ @robieta
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 7b4a1246d002d..a007cedbdcac1 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -17,6 +17,7 @@
   - [C++ Unit Testing](#c-unit-testing)
   - [Run Specific CI Jobs](#run-specific-ci-jobs)
 - [Writing documentation](#writing-documentation)
+  - [Docstring type formatting](#docstring-type-formatting)
   - [Building documentation](#building-documentation)
     - [Tips](#tips)
     - [Building C++ Documentation](#building-c-documentation)
@@ -447,9 +448,47 @@ If you're interested in adding new developer docs, please read this [page on the
 
 The rest of this section is about user-facing documentation.
 
-PyTorch uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
+PyTorch uses [Google style](https://www.sphinx-doc.org/en/master/usage/extensions/example_google.html)
 for formatting docstrings. Each line inside a docstrings block must be limited to 80 characters so that it fits into Jupyter documentation popups.
 
+
+### Docstring type formatting
+
+In addition to the standard Google Style docstring formatting rules, the following guidelines should be followed for docstring types (docstring types are the type information contained in the round brackets after the variable name):
+
+* The "`Callable`", "`Any`", "`Iterable`", "`Iterator`", "`Generator`" types should have their first letter capitalized.
+
+* The "`list`" and "`tuple`" types should be completely lowercase.
+
+* Types should not be made plural. For example: `tuple of int` should be used instead of `tuple of ints`.
+
+* The only acceptable delimiter words for types are `or` and `of`. No other non-type words should be used other than `optional`.
+
+* The word `optional` should only be used after the types, and it is only used if the user does not have to specify a value for the variable. Default values are listed after the variable description. Example:
+
+    ```
+    my_var (int, optional): Variable description. Default: 1
+    ```
+
+* Basic Python types should match their type name so that the [Intersphinx](https://www.sphinx-doc.org/en/master/usage/extensions/intersphinx.html) extension can correctly identify them. For example:
+    * Use `str` instead of `string`.
+    * Use `bool` instead of `boolean`.
+    * Use `dict` instead of `dictionary`.
+
+* Square brackets should be used for the dictionary type. For example:
+
+    ```
+    my_var (dict[str, int]): Variable description.
+    ```
+
+* If a variable has two different possible types, then the word `or` should be used without a comma. Otherwise variables with 3 or more types should use commas to separate the types. Example:
+
+    ```
+    x (type1 or type2): Variable description.
+    y (type1, type2, or type3): Variable description.
+    ```
+
+
 ### Building documentation
 
 To build the documentation:
diff --git a/Dockerfile b/Dockerfile
index 1bd522a624067..815a9108ce946 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -11,8 +11,7 @@ ARG BASE_IMAGE=ubuntu:18.04
 ARG PYTHON_VERSION=3.8
 
 FROM ${BASE_IMAGE} as dev-base
-RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
-    apt-get update && apt-get install -y --no-install-recommends \
+RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
         ca-certificates \
         ccache \
@@ -28,9 +27,16 @@ ENV PATH /opt/conda/bin:$PATH
 
 FROM dev-base as conda
 ARG PYTHON_VERSION=3.8
+# Automatically set by buildx
+ARG TARGETPLATFORM
+# translating Docker's TARGETPLATFORM into miniconda arches
+RUN case ${TARGETPLATFORM} in \
+         "linux/arm64")  MINICONDA_ARCH=aarch64  ;; \
+         *)              MINICONDA_ARCH=x86_64   ;; \
+    esac && \
+    curl -fsSL -v -o ~/miniconda.sh -O  "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
 COPY requirements.txt .
-RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
-    chmod +x ~/miniconda.sh && \
+RUN chmod +x ~/miniconda.sh && \
     ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh && \
     /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
@@ -57,15 +63,21 @@ ARG CUDA_VERSION=11.3
 ARG CUDA_CHANNEL=nvidia
 ARG INSTALL_CHANNEL=pytorch-nightly
 ENV CONDA_OVERRIDE_CUDA=${CUDA_VERSION}
-RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y python=${PYTHON_VERSION} pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}" && \
+# Automatically set by buildx
+RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -y python=${PYTHON_VERSION}
+ARG TARGETPLATFORM
+# On arm64 we can only install wheel packages
+RUN case ${TARGETPLATFORM} in \
+         "linux/arm64")  pip install --extra-index-url https://download.pytorch.org/whl/cpu/ torch torchvision torchtext ;; \
+         *)              /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y "python=${PYTHON_VERSION}" pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}"  ;; \
+    esac && \
     /opt/conda/bin/conda clean -ya
 RUN /opt/conda/bin/pip install torchelastic
 
 FROM ${BASE_IMAGE} as official
 ARG PYTORCH_VERSION
 LABEL com.nvidia.volumes.needed="nvidia_driver"
-RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
-    apt-get update && apt-get install -y --no-install-recommends \
+RUN apt-get update && apt-get install -y --no-install-recommends \
         ca-certificates \
         libjpeg-dev \
         libpng-dev && \
diff --git a/WORKSPACE b/WORKSPACE
index d26dfca5a3336..61abbdac2b239 100644
--- a/WORKSPACE
+++ b/WORKSPACE
@@ -88,6 +88,7 @@ new_local_repository(
     name = "fbgemm",
     build_file = "//third_party:fbgemm/BUILD.bazel",
     path = "third_party/fbgemm",
+    repo_mapping = {"@cpuinfo" : "@org_pytorch_cpuinfo"}
 )
 
 new_local_repository(
@@ -103,8 +104,8 @@ new_local_repository(
 )
 
 new_local_repository(
-    name = "cpuinfo",
-    build_file = "//third_party:cpuinfo.BUILD",
+    name = "org_pytorch_cpuinfo",
+    build_file = "//third_party:cpuinfo/BUILD.bazel",
     path = "third_party/cpuinfo",
 )
 
diff --git a/aten/CMakeLists.txt b/aten/CMakeLists.txt
index 9c3757f346cda..777b13b4dcf09 100644
--- a/aten/CMakeLists.txt
+++ b/aten/CMakeLists.txt
@@ -55,7 +55,7 @@ set(TH_CPU_INCLUDE
 list(APPEND ATen_CPU_INCLUDE ${TH_CPU_INCLUDE})
 
 if(USE_VULKAN)
-  list(APPEND ATen_CPU_INCLUDE ${CMAKE_BINARY_DIR}/vulkan)
+  list(APPEND ATen_CPU_INCLUDE ${CMAKE_BINARY_DIR}/vulkan ${CMAKE_CURRENT_SOURCE_DIR}/../third_party/VulkanMemoryAllocator)
 endif()
 
 # Find the HIP package, set the HIP paths, load the HIP CMake.
diff --git a/aten/src/ATen/BatchingRegistrations.cpp b/aten/src/ATen/BatchingRegistrations.cpp
index a269f82fa8176..28de70636700c 100644
--- a/aten/src/ATen/BatchingRegistrations.cpp
+++ b/aten/src/ATen/BatchingRegistrations.cpp
@@ -189,10 +189,6 @@ Tensor expand_symint_batching_rule(const Tensor& self, SymIntArrayRef psize, boo
   return self.expand(asIntArrayRefSlow(psize), implicit);
 }
 
-Tensor sum_symint_batching_rule(const Tensor& input_t, c10::SymIntArrayRef dim, bool keepdim, optional<ScalarType> opt_dtype) {
-  return input_t.sum(c10::asIntArrayRefSlow(dim), keepdim, opt_dtype);
-}
-
 std::vector<Tensor> chunk_batching_rule(const Tensor& self, int64_t chunks, int64_t dim) {
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
   auto dim_physical = self_physical.getPhysicalDim(dim);
@@ -1100,7 +1096,6 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
   m.impl("_new_zeros_with_same_feature_meta", _new_zeros_with_same_feature_meta_batching_rule);
 
   m.impl("sum.dim_IntList", sum_batching_rule);
-  m.impl("sum.SymInt", sum_symint_batching_rule);
   m.impl("is_complex", native::is_complex);
 
   // inplace operations
diff --git a/aten/src/ATen/Context.h b/aten/src/ATen/Context.h
index 8f3928376473d..b21f32b9021a2 100644
--- a/aten/src/ATen/Context.h
+++ b/aten/src/ATen/Context.h
@@ -253,7 +253,11 @@ class TORCH_API Context {
   bool deterministic_cudnn = false;
   bool _deterministic_algorithms = false;
   bool _deterministic_algorithms_warn_only = false;
+#ifdef USE_ROCM
+  bool benchmark_cudnn = true;
+#else
   bool benchmark_cudnn = false;
+#endif
   Float32MatmulPrecision float32_matmul_precision =
       at::Float32MatmulPrecision::HIGHEST;
   int benchmark_limit_cudnn = 10;
diff --git a/aten/src/ATen/DLConvertor.cpp b/aten/src/ATen/DLConvertor.cpp
index fb3f3596e1fe0..54df9d631d14f 100644
--- a/aten/src/ATen/DLConvertor.cpp
+++ b/aten/src/ATen/DLConvertor.cpp
@@ -215,11 +215,22 @@ void deleter(DLManagedTensor* arg) {
 // This function returns a shared_ptr to memory managed DLpack tensor
 // constructed out of ATen tensor
 DLManagedTensor* toDLPack(const Tensor& src) {
+  // create a new tensor with possibly normalized strides
+  // gh-83069
+  auto shape = src.sizes();
+  auto strides = src.strides().vec();
+  for (int i=0; i<src.dim(); i++) {
+    if (shape[i] < 2) {
+      strides[i] = 1;
+    }
+  }
+
+  auto view = src.as_strided(shape, strides, src.storage_offset());
   ATenDLMTensor* atDLMTensor(new ATenDLMTensor);
-  atDLMTensor->handle = src;
+  atDLMTensor->handle = view;
   atDLMTensor->tensor.manager_ctx = atDLMTensor;
   atDLMTensor->tensor.deleter = &deleter;
-  atDLMTensor->tensor.dl_tensor.data = src.data_ptr();
+  atDLMTensor->tensor.dl_tensor.data = view.data_ptr();
   int64_t device_id = 0;
   if (src.is_cuda()) {
     device_id = src.get_device();
@@ -229,10 +240,10 @@ DLManagedTensor* toDLPack(const Tensor& src) {
   atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
   atDLMTensor->tensor.dl_tensor.shape =
       // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-      const_cast<int64_t*>(src.sizes().data());
+      const_cast<int64_t*>(view.sizes().data());
   atDLMTensor->tensor.dl_tensor.strides =
       // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-      const_cast<int64_t*>(src.strides().data());
+      const_cast<int64_t*>(view.strides().data());
   atDLMTensor->tensor.dl_tensor.byte_offset = 0;
   return &(atDLMTensor->tensor);
 }
diff --git a/aten/src/ATen/Dispatch.h b/aten/src/ATen/Dispatch.h
index 08d41126a1619..3ae552f96d17c 100644
--- a/aten/src/ATen/Dispatch.h
+++ b/aten/src/ATen/Dispatch.h
@@ -8,6 +8,10 @@
 #include <c10/util/complex.h>
 #include <c10/util/string_view.h>
 
+#ifdef __CUDACC__
+#include <cuda.h> // For CUDA_VERSION
+#endif
+
 #ifdef TEMPLATE_SELECTIVE_BUILD
 #include <ATen/selected_mobile_ops.h>
 #else
@@ -72,10 +76,20 @@ TORCH_API void record_kernel_function_dtype(std::string name);
   })
 #endif
 
+// Workaround for C10_UNUSED because CUDA 10.2 and below fails to handle unused
+// attribute in the type aliasing context. Keep name long and verbose to avoid
+// macro collisions.
+#if defined(__CUDACC__) && CUDA_VERSION < 11000
+#define C10_UNUSED_DISPATCH_CUDA_WORKAROUND
+#else
+#define C10_UNUSED_DISPATCH_CUDA_WORKAROUND C10_UNUSED
+#endif
+
 #define AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, HINT, ...) \
   case enum_type: {                                           \
     AT_PRIVATE_CHECK_SELECTIVE_BUILD(enum_type);              \
-    using HINT = c10::impl::ScalarTypeToCPPTypeT<enum_type>;  \
+    using HINT C10_UNUSED_DISPATCH_CUDA_WORKAROUND =          \
+        c10::impl::ScalarTypeToCPPTypeT<enum_type>;           \
     return __VA_ARGS__();                                     \
   }
 
diff --git a/aten/src/ATen/EmptyTensor.cpp b/aten/src/ATen/EmptyTensor.cpp
index caf2a4e653c86..ff91aa0bd14d6 100644
--- a/aten/src/ATen/EmptyTensor.cpp
+++ b/aten/src/ATen/EmptyTensor.cpp
@@ -62,6 +62,14 @@ size_t computeStorageNbytes(
     size_t itemsize_bytes,
     size_t storage_offset
   ) {
+  TORCH_CHECK(
+    sizes.size() == strides.size(),
+    "dimensionality of sizes (",
+    sizes.size(),
+    ") must match dimensionality of strides (",
+    strides.size(),
+    ")");
+
   // Ignore overflow checks on mobile
 #ifndef C10_MOBILE
   // size of the underlying storage is 1 bigger than the offset
diff --git a/aten/src/ATen/ExpandUtils.h b/aten/src/ATen/ExpandUtils.h
index 7a81076a7dd01..a54853b259e73 100644
--- a/aten/src/ATen/ExpandUtils.h
+++ b/aten/src/ATen/ExpandUtils.h
@@ -446,7 +446,7 @@ static inline Tensor sum_to(
   }
 
   auto sizes = tensor.sym_sizes();
-  c10::SmallVector<c10::SymInt, 8> reduce_dims;
+  c10::SmallVector<int64_t, 8> reduce_dims;
   const int64_t leading_dims = sizes.size() - shape.size();
   for (const auto i : c10::irange(leading_dims)) {
     reduce_dims.push_back(i);
@@ -458,7 +458,7 @@ static inline Tensor sum_to(
   }
 
   if (!reduce_dims.empty()) {
-    tensor = tensor.sum_symint(reduce_dims, /*keepdim=*/true);
+    tensor = tensor.sum(reduce_dims, /*keepdim=*/true);
   }
 
   if (always_return_non_view) {
diff --git a/aten/src/ATen/FunctionalStorageImpl.cpp b/aten/src/ATen/FunctionalStorageImpl.cpp
index 2fad6bfad6064..7f136759ef6af 100644
--- a/aten/src/ATen/FunctionalStorageImpl.cpp
+++ b/aten/src/ATen/FunctionalStorageImpl.cpp
@@ -2,6 +2,7 @@
 
 #include <ATen/FunctionalTensorWrapper.h>
 #include <ATen/core/LegacyTypeDispatch.h>
+#include <c10/core/CPUAllocator.h>
 #include <c10/util/Exception.h>
 #include <vector>
 
@@ -94,9 +95,8 @@ FunctionalStorageImpl::FunctionalStorageImpl(const Tensor& value)
       c10::StorageImpl::use_byte_size_t(),
       value.numel() * value.dtype().itemsize(),
       DataPtr{nullptr, value.device()},
-      // Using a null allocator, since FunctionalTensorImpl's aren't resizeable.
-      nullptr,
-      /*resizeable=*/false
+      GetAllocator(kMeta),
+      /*resizeable=*/true
     ),
     alias_(Alias(value))
   {}
diff --git a/aten/src/ATen/NestedTensorImpl.cpp b/aten/src/ATen/NestedTensorImpl.cpp
index 077e9e742fc77..1e98d5ad7a957 100644
--- a/aten/src/ATen/NestedTensorImpl.cpp
+++ b/aten/src/ATen/NestedTensorImpl.cpp
@@ -4,7 +4,26 @@
 #include <ATen/core/op_registration/op_registration.h>
 #include <ATen/NestedTensorImpl.h>
 #include <c10/core/DispatchKey.h>
+#include <c10/util/Exception.h>
+#include <c10/core/TensorImpl.h>
 
+namespace {
+inline void validate_nested_tensor_metadata(
+    const at::Tensor& nested_sizes,
+    const at::Tensor& nested_strides,
+    const std::vector<int64_t>& offsets) {
+  TORCH_INTERNAL_ASSERT(nested_sizes.is_contiguous());
+  int64_t size_dim = nested_sizes.dim();
+  TORCH_INTERNAL_ASSERT(size_dim == 0 || size_dim == 2);
+  TORCH_INTERNAL_ASSERT(nested_strides.is_contiguous());
+  TORCH_INTERNAL_ASSERT(nested_strides.dim() == size_dim);
+  TORCH_INTERNAL_ASSERT(nested_sizes.sizes() == nested_strides.sizes());
+  TORCH_INTERNAL_ASSERT(
+      (size_dim == 0 && (int64_t)offsets.empty()) ||
+      (size_dim == 2 && nested_sizes.size(0) == (int64_t)offsets.size()));
+}
+
+} // namespace
 namespace at {
 namespace native {
 
@@ -99,10 +118,7 @@ inline std::vector<int64_t> construct_offsets(const at::Tensor& sizes) {
 // correct Autograd key which is AutogradNestedTensor
 c10::DispatchKeySet generate_nested_key_set(at::Tensor buffer) {
   c10::DispatchKeySet key_set =
-      (c10::DispatchKeySet(DispatchKey::NestedTensor) |
-       c10::DispatchKeySet(
-           buffer.is_cuda() ? BackendComponent::CUDABit
-                            : BackendComponent::CPUBit));
+      c10::DispatchKeySet(DispatchKey::NestedTensor) | c10::DispatchKeySet{buffer.key_set().highestBackendKey()};
 
   // Add AutogradNestedTensor specific keys
   key_set = key_set | inplace_or_view_ks | autograd_nested;
@@ -110,36 +126,50 @@ c10::DispatchKeySet generate_nested_key_set(at::Tensor buffer) {
 }
 
 NestedTensorImpl::NestedTensorImpl(
-    at::Tensor buffer,
+    Storage storage,
+    c10::DispatchKeySet key_set,
+    const caffe2::TypeMeta data_type,
     at::Tensor nested_size_tensor,
     at::Tensor nested_stride_tensor,
-    const std::vector<int64_t>& offsets)
-    : TensorImpl(
-          generate_nested_key_set(buffer),
-          buffer.dtype(),
-          buffer.device()),
-      buffer_(std::move(buffer)),
+    std::vector<int64_t>&& offsets)
+    : TensorImpl(std::move(storage), key_set, data_type),
       nested_size_tensor_(std::move(nested_size_tensor)),
       nested_stride_tensor_(std::move(nested_stride_tensor)),
-      offsets_(offsets),
-      opt_sizes_(construct_opt_sizes(nested_size_tensor_))
-{
+      offsets_(std::move(offsets)),
+      opt_sizes_(construct_opt_sizes(nested_size_tensor_)) {
   TORCH_WARN_ONCE(
       "The PyTorch API of nested tensors is in prototype stage and will change "
       "in the near future.");
-  TORCH_INTERNAL_ASSERT(buffer_.is_cuda() || buffer_.is_cpu(), "NestedTensorImpl buffer must be either CUDA or CPU but got ", buffer_);
-  TORCH_INTERNAL_ASSERT(nested_size_tensor_.is_contiguous());
-  int64_t size_dim = nested_size_tensor_.dim();
-  TORCH_INTERNAL_ASSERT(size_dim == 0 || size_dim == 2);
-  TORCH_INTERNAL_ASSERT(nested_stride_tensor_.is_contiguous());
-  TORCH_INTERNAL_ASSERT(nested_stride_tensor_.dim() == size_dim);
-  TORCH_INTERNAL_ASSERT(nested_stride_tensor_.sizes() == nested_size_tensor_.sizes());
-  TORCH_INTERNAL_ASSERT((size_dim == 0 && (int64_t)offsets_.empty())
-      || (size_dim == 2 && nested_size_tensor_.size(0) == (int64_t)offsets_.size()));
+  auto storage_device = storage_.device();
+  TORCH_INTERNAL_ASSERT(
+      storage_device.is_cpu() || storage_device.is_cuda(),
+      "NestedTensorImpl storage must be either CUDA or CPU but got ",
+      storage_device);
+  validate_nested_tensor_metadata(nested_size_tensor_, nested_stride_tensor_, offsets_);
   refresh_dim();
   set_sizes_strides_policy(c10::TensorImpl::SizesStridesPolicy::CustomSizes);
 }
 
+NestedTensorImpl::NestedTensorImpl(
+    at::Tensor buffer,
+    at::Tensor nested_size_tensor,
+    at::Tensor nested_stride_tensor,
+    std::vector<int64_t>&& offsets)
+    : NestedTensorImpl(
+          buffer.storage(),
+          generate_nested_key_set(buffer),
+          buffer.dtype(),
+          nested_size_tensor,
+          nested_stride_tensor,
+          std::move(offsets)) {
+
+  TORCH_INTERNAL_ASSERT(
+      buffer.dim() == 1,
+      "NestedTensorImpl buffer is required to be 1 dimensional but got a buffer with ",
+      buffer.dim(),
+      " dimensions.");
+}
+
 // assume contiguous, `nested_stride_tensor` and `offsets`
 // can be infered from `nested_size_tensor`
 NestedTensorImpl::NestedTensorImpl(
@@ -152,6 +182,23 @@ NestedTensorImpl::NestedTensorImpl(
           construct_offsets(nested_size_tensor))
 {}
 
+NestedTensorImpl::NestedTensorImpl(
+    c10::TensorImpl::ImplType impl_type,
+    const at::Tensor& base_tensor,
+    at::Tensor nested_size_tensor,
+    at::Tensor nested_stride_tensor,
+    std::vector<int64_t>&& offsets)
+    : TensorImpl(impl_type, Storage(base_tensor.storage()), base_tensor.key_set(), base_tensor.dtype()),
+      nested_size_tensor_(std::move(nested_size_tensor)),
+      nested_stride_tensor_(std::move(nested_stride_tensor)),
+      offsets_(std::move(offsets)),
+      opt_sizes_(construct_opt_sizes(nested_size_tensor_)) {
+  TORCH_INTERNAL_ASSERT(base_tensor.is_nested());
+  validate_nested_tensor_metadata(nested_size_tensor_, nested_stride_tensor_, offsets_);
+  refresh_dim();
+  set_sizes_strides_policy(c10::TensorImpl::SizesStridesPolicy::CustomSizes);
+}
+
 void NestedTensorImpl::refresh_dim() {
   const auto my_dim = nested_size_tensor_.dim() ? nested_size_tensor_.sizes()[1] + 1 : 1;
   sizes_and_strides_.resize(my_dim);
@@ -187,8 +234,13 @@ int64_t NestedTensorImpl::numel_custom() const {
   return static_cast<int64_t>(num_elements);
 }
 
+
+c10::SymInt NestedTensorImpl::sym_numel_custom() const {
+  return NestedTensorImpl::numel_custom();
+}
+
 bool NestedTensorImpl::is_contiguous_custom(MemoryFormat) const {
-  TORCH_CHECK(false, "is_contiguous is disabled.");
+  return nested_tensor_impl_is_contiguous(this);
 }
 IntArrayRef NestedTensorImpl::sizes_custom() const {
   TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support sizes. Please file an issue on https://github.com/pytorch/nestedtensor");
@@ -200,6 +252,9 @@ c10::SymIntArrayRef NestedTensorImpl::sym_sizes_custom() const {
 c10::SymIntArrayRef NestedTensorImpl::sym_sizes() const {
   return sym_sizes_custom();
 }
+c10::SymIntArrayRef NestedTensorImpl::sym_strides_custom() const {
+  TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support strides. Please file an issue on https://github.com/pytorch/nestedtensor");
+}
 
 IntArrayRef NestedTensorImpl::strides_custom() const {
   TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support strides. Please file an issue on https://github.com/pytorch/nestedtensor");
@@ -209,5 +264,51 @@ const char* NestedTensorImpl::tensorimpl_type_name() const {
   return "NestedTensorImpl";
 }
 
+
+template <typename VariableVersion>
+c10::intrusive_ptr<TensorImpl> NestedTensorImpl::shallow_copy_and_detach_core(
+    VariableVersion&& version_counter,
+    bool allow_tensor_metadata_change) const {
+  if (key_set_.has(DispatchKey::Python) &&
+      !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
+    auto r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this);
+    if (r) {
+      r->set_version_counter(std::forward<VariableVersion>(version_counter));
+      r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
+      return r;
+    }
+    // otherwise just copy the TensorImpl and not the PyObject.  Since
+    // the interpreter is dead no one can call us out on it
+  }
+  auto impl = c10::make_intrusive<NestedTensorImpl>(
+      storage_,
+      key_set_,
+      data_type_,
+      nested_size_tensor_,
+      nested_stride_tensor_,
+      std::vector<int64_t>(offsets_));
+
+      copy_tensor_metadata(
+          /*src_impl=*/this,
+          /*dest_impl=*/impl.get(),
+          /*version_counter=*/std::forward<VariableVersion>(version_counter),
+          /*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
+  return impl;
+}
+
+c10::intrusive_ptr<TensorImpl> NestedTensorImpl::shallow_copy_and_detach(
+    const c10::VariableVersion& version_counter,
+    bool allow_tensor_metadata_change) const {
+  return shallow_copy_and_detach_core(
+      version_counter, allow_tensor_metadata_change);
+}
+
+c10::intrusive_ptr<TensorImpl> NestedTensorImpl::shallow_copy_and_detach(
+    c10::VariableVersion&& version_counter,
+    bool allow_tensor_metadata_change) const {
+  return shallow_copy_and_detach_core(
+      std::move(version_counter), allow_tensor_metadata_change);
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/NestedTensorImpl.h b/aten/src/ATen/NestedTensorImpl.h
index 47f6c1516b9d5..f1fb8273c2902 100644
--- a/aten/src/ATen/NestedTensorImpl.h
+++ b/aten/src/ATen/NestedTensorImpl.h
@@ -1,8 +1,11 @@
 #pragma once
 #include <ATen/MemoryOverlap.h>
 #include <ATen/Tensor.h>
+#include <c10/core/DispatchKey.h>
+#include <c10/core/DispatchKeySet.h>
 #include <c10/core/MemoryFormat.h>
 #include <c10/core/TensorImpl.h>
+#include <c10/util/ArrayRef.h>
 #include <c10/util/Exception.h>
 #include <c10/util/Metaprogramming.h>
 #include <c10/util/irange.h>
@@ -11,15 +14,31 @@ namespace at {
 namespace native {
 
 struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
+  explicit NestedTensorImpl(
+      Storage storage,
+      c10::DispatchKeySet key_set,
+      const caffe2::TypeMeta data_type,
+      at::Tensor nested_size_tensor,
+      at::Tensor nested_stride_tensor,
+      std::vector<int64_t>&& offsets);
+
   explicit NestedTensorImpl(
       at::Tensor buffer,
       at::Tensor nested_size_tensor,
       at::Tensor nested_stride_tensor,
-      const std::vector<int64_t>& offsets);
+      std::vector<int64_t>&& offsets);
   // assume contiguous, `nested_stride_tensor` and `offsets`
   // can be infered from `nested_size_tensor`
   explicit NestedTensorImpl(at::Tensor buffer, at::Tensor nested_size_tensor);
 
+  // This constructor is used creating view tensors from nested tensors
+  explicit NestedTensorImpl(
+      c10::TensorImpl::ImplType impl_type,
+      const at::Tensor& base_tensor,
+      at::Tensor nested_size_tensor,
+      at::Tensor nested_stride_tensor,
+      std::vector<int64_t>&& offsets);
+
   // TODO: don't expose private implementation details like this; in
   // particular, resizing this tensor will mess up our dim() and
   // callers cannot fix it.
@@ -53,9 +72,25 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
         " is irregular and does not have a size.");
     return *optional_size;
   }
+  /**
+   * Return a view of the nested tensor as a 1 dimensional contiguous tensor.
+   *
+   * The buffer tensor created by this function shares the same storage_impl as
+   * the original nested tensor, and therefore can be seen as a view.
+   *
+   * @return A newly constructed view tensor
+   */
+  at::Tensor get_buffer() const {
+    auto buffer_key_set_ = generate_buffer_key_set();
+    const auto buffer_size = get_buffer_size();
+    auto buffer_tensor_impl = c10::make_intrusive<TensorImpl>(
+        c10::TensorImpl::VIEW, Storage(storage_), buffer_key_set_, data_type_);
+    buffer_tensor_impl->set_sizes_contiguous(c10::makeArrayRef(buffer_size));
+    return Tensor(buffer_tensor_impl);
+  }
 
-  const at::Tensor& get_buffer() const {
-    return buffer_;
+  int64_t get_buffer_size() const {
+    return storage_.nbytes() / data_type_.itemsize();
   }
 
  protected:
@@ -64,6 +99,7 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   // TODO: numel_custom and is_contiguous_custom can be profitably overridden
   // with real implementations
   int64_t numel_custom() const override;
+  c10::SymInt sym_numel_custom() const override;
   bool is_contiguous_custom(MemoryFormat) const override;
   int64_t size_custom(int64_t d) const override {
     return this->size(d);
@@ -75,16 +111,32 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   c10::SymIntArrayRef sym_sizes_custom() const override;
   c10::SymIntArrayRef sym_sizes() const override;
   IntArrayRef strides_custom() const override;
+  c10::SymIntArrayRef sym_strides_custom() const override;
 
   // this one is real
   int64_t dim_custom() const override;
 
+  c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
+      const c10::VariableVersion& version_counter,
+      bool allow_tensor_metadata_change) const override;
+
+  c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
+      c10::VariableVersion&& version_counter,
+      bool allow_tensor_metadata_change) const override;
+
+  void shallow_copy_from(const c10::intrusive_ptr<TensorImpl>& impl) override {
+    copy_tensor_metadata(
+        /*src_impl=*/impl.get(),
+        /*dest_impl=*/this,
+        /*version_counter=*/version_counter(),
+        /*allow_tensor_metadata_change=*/allow_tensor_metadata_change());
+  }
+
  private:
   // Must be called after any changes to our dim() to sync the state
   // to TensorImpl.
   void refresh_dim();
 
-  at::Tensor buffer_;
   const at::Tensor nested_size_tensor_, nested_stride_tensor_;
   // The starting positions of the underlying tensors in contiguous buffer
   // i.e. the buffer memory offsets to get the underlying tensors
@@ -103,6 +155,38 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   // TODO: maybe we can remove this metadata since
   //       we can compute it from `nested_size_tensor_`
   std::vector<int64_t> opt_sizes_;
+
+  template <typename VariableVersion>
+  c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach_core(
+      VariableVersion&& version_counter,
+      bool allow_tensor_metadata_change) const;
+
+  /**
+   * Generates a non-nested key_set from a nested tensor.
+   *
+   * For many nested tensor kernel implementations a buffer tensor
+   * is generated and redispatched to a non-nested kernel this function
+   * generates the key set used by that buffer tensor
+   *
+   * @return A newly constructed view tensor
+   */
+  inline c10::DispatchKeySet generate_buffer_key_set() const {
+    auto buffer_key_set = this->key_set();
+    const bool Autograd = buffer_key_set.has_any(c10::autograd_dispatch_keyset);
+    // Remove nested tensor specific keys
+    buffer_key_set = buffer_key_set -
+        c10::DispatchKeySet{
+            c10::DispatchKey::NestedTensor,
+            c10::DispatchKey::AutogradNestedTensor};
+
+    // Add dense tensor specific keys
+    buffer_key_set =
+        buffer_key_set | c10::DispatchKeySet{c10::DispatchKey::Dense};
+    buffer_key_set = Autograd
+        ? c10::DispatchKeySet{c10::DispatchKey::Autograd} | buffer_key_set
+        : buffer_key_set;
+    return buffer_key_set;
+  }
 };
 
 inline NestedTensorImpl* get_nested_tensor_impl_or_null(
diff --git a/aten/src/ATen/Parallel.h b/aten/src/ATen/Parallel.h
index 6c99fcd422cb6..4693997624e98 100644
--- a/aten/src/ATen/Parallel.h
+++ b/aten/src/ATen/Parallel.h
@@ -2,6 +2,7 @@
 #include <ATen/Config.h>
 #include <c10/macros/Macros.h>
 #include <functional>
+#include <string>
 
 namespace at {
 
diff --git a/aten/src/ATen/SparseCsrTensorImpl.cpp b/aten/src/ATen/SparseCsrTensorImpl.cpp
index dab45065fa71e..69fc013211f96 100644
--- a/aten/src/ATen/SparseCsrTensorImpl.cpp
+++ b/aten/src/ATen/SparseCsrTensorImpl.cpp
@@ -160,6 +160,9 @@ void SparseCsrTensorImpl::set_member_tensors(
 IntArrayRef SparseCsrTensorImpl::strides_custom() const {
   TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have strides");
 }
+SymIntArrayRef SparseCsrTensorImpl::sym_strides_custom() const {
+  TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have strides");
+}
 void SparseCsrTensorImpl::set_size(int64_t dim, int64_t new_size) {
   TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have set_size.");
 }
diff --git a/aten/src/ATen/SparseCsrTensorImpl.h b/aten/src/ATen/SparseCsrTensorImpl.h
index 878c465962b86..1f84fb422fde9 100644
--- a/aten/src/ATen/SparseCsrTensorImpl.h
+++ b/aten/src/ATen/SparseCsrTensorImpl.h
@@ -76,6 +76,7 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
 
  protected:
   IntArrayRef strides_custom() const override;
+  SymIntArrayRef sym_strides_custom() const override;
 
  public:
   void set_size(int64_t dim, int64_t new_size) override;
diff --git a/aten/src/ATen/TensorIterator.h b/aten/src/ATen/TensorIterator.h
index fdf86cbba6afe..59f52d9dbd2ed 100644
--- a/aten/src/ATen/TensorIterator.h
+++ b/aten/src/ATen/TensorIterator.h
@@ -473,6 +473,10 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
   }
 
   bool has_contiguous_first_dim() const {
+    if (ndim() == 0) {
+      return true;
+    }
+
     int num_tensors = ntensors();
     for (const auto i : c10::irange(num_tensors)) {
       if (strides(i)[0] != element_size(i)) {
diff --git a/aten/src/ATen/TensorMeta.h b/aten/src/ATen/TensorMeta.h
index 97124611ca13d..07631c3552fd2 100644
--- a/aten/src/ATen/TensorMeta.h
+++ b/aten/src/ATen/TensorMeta.h
@@ -71,6 +71,7 @@ namespace impl {
 struct TORCH_API MetaBase {
   virtual const Tensor& maybe_get_output(int64_t output_idx) = 0;
 
+  // Note: [set_output_*]
   // See: https://github.com/pytorch/pytorch/issues/69813
   // Whenever defining the output properties in the META function of a
   // structured kernel (what was usually done with `set_output`), use one of
diff --git a/aten/src/ATen/TensorSubclassLikeUtils.h b/aten/src/ATen/TensorSubclassLikeUtils.h
index 5c01ce9790407..b533b49a9ca4d 100644
--- a/aten/src/ATen/TensorSubclassLikeUtils.h
+++ b/aten/src/ATen/TensorSubclassLikeUtils.h
@@ -1,5 +1,6 @@
 #pragma once
 #include <ATen/ATen.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 
 namespace at {
 
@@ -39,16 +40,22 @@ constexpr auto kTensorSubclassLike =
     DispatchKeySet(BackendComponent::MetaBit);
 
 inline bool isTensorSubclassLike(const Tensor& tensor) {
+  if (c10::impl::dispatch_mode_enabled())
+    return true;
   auto key_set = tensor.unsafeGetTensorImpl()->key_set();
   return !(key_set & kTensorSubclassLike).empty();
 }
 
 inline bool areAnyTensorSubclassLike(TensorList tensors) {
+  if (c10::impl::dispatch_mode_enabled())
+    return true;
   return std::any_of(tensors.begin(), tensors.end(), isTensorSubclassLike);
 }
 
 inline bool areAnyOptionalTensorSubclassLike(
     const c10::List<c10::optional<Tensor>>& tensors) {
+  if (c10::impl::dispatch_mode_enabled())
+    return true;
   return std::any_of(
       tensors.begin(), tensors.end(), [](const optional<Tensor>& opt_tensor) {
         return (
@@ -56,4 +63,16 @@ inline bool areAnyOptionalTensorSubclassLike(
       });
 }
 
+// Helper function to deal testing truthfulness of a scalar tensor
+// in a Composite Compliant manner.
+// NOTE: This function expects a scalar tensor of boolean dtype.
+// Eg.
+// Non-Composite Compliant Pattern : (t == 0).all().item<bool>()
+// Composite Compliant Patter : is_salar_tensor_true((t == 0).all())
+inline bool is_scalar_tensor_true(const Tensor& t) {
+  TORCH_INTERNAL_ASSERT(t.dim() == 0)
+  TORCH_INTERNAL_ASSERT(t.scalar_type() == kBool)
+  return at::equal(t, t.new_ones({}, t.options()));
+}
+
 } // namespace at
diff --git a/aten/src/ATen/ThreadLocalState.cpp b/aten/src/ATen/ThreadLocalState.cpp
index 8315ddad97b20..fb589beaba894 100644
--- a/aten/src/ATen/ThreadLocalState.cpp
+++ b/aten/src/ATen/ThreadLocalState.cpp
@@ -19,7 +19,7 @@ ThreadLocalState::ThreadLocalState()
 
   saved_tensors_default_hooks_ = at::SavedTensorDefaultHooks::get_stack();
 
-  torch_dispatch_mode_state_ = at::impl::TorchDispatchModeTLS::get_state();
+  torch_dispatch_mode_state_ = c10::impl::TorchDispatchModeTLS::get_state();
 }
 
 void ThreadLocalState::set_grad_mode(bool enabled) {
@@ -33,7 +33,7 @@ void ThreadLocalState::setThreadLocalState(
   // restore the dispatch key set TLS at the same time.
   c10::AutogradState::set_tls_state(state.autograd_tls_);
 
-  at::impl::TorchDispatchModeTLS::set_state(state.torch_dispatch_mode_state_);
+  c10::impl::TorchDispatchModeTLS::set_state(state.torch_dispatch_mode_state_);
 
   at::impl::PythonTorchFunctionTLS::set_state(state.python_torch_function_state_);
 
diff --git a/aten/src/ATen/ThreadLocalState.h b/aten/src/ATen/ThreadLocalState.h
index a21ee6a674f3c..a0067fb8aaebe 100644
--- a/aten/src/ATen/ThreadLocalState.h
+++ b/aten/src/ATen/ThreadLocalState.h
@@ -9,8 +9,8 @@
 
 #include <ATen/FuncTorchTLS.h>
 #include <ATen/PythonTorchFunctionTLS.h>
-#include <ATen/core/TorchDispatchModeTLS.h>
 #include <ATen/record_function.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 
 namespace at {
 
diff --git a/aten/src/ATen/Utils.h b/aten/src/ATen/Utils.h
index bbc235182f1e2..61c9c58fa437a 100644
--- a/aten/src/ATen/Utils.h
+++ b/aten/src/ATen/Utils.h
@@ -26,59 +26,6 @@ namespace at {
 
 TORCH_API int _crash_if_asan(int);
 
-// TODO: This unwrapping code is ONLY used for TH bindings; once TH goes
-// away, we can delete this function
-static inline TensorImpl* checked_dense_tensor_unwrap(
-    const Tensor& expr,
-    const char* name,
-    int pos,
-    const char* api,
-    bool allowNull,
-    DeviceType device_type,
-    ScalarType scalar_type) {
-  if (allowNull && !expr.defined()) {
-    return nullptr;
-  }
-  if (expr.layout() != Layout::Strided) {
-    AT_ERROR(
-        "Expected dense tensor but got ",
-        expr.layout(),
-        " for argument #",
-        pos,
-        " '",
-        name,
-        "' in call to ",
-        api);
-  }
-  if (expr.device().type() != device_type) {
-    AT_ERROR(
-        "Expected object of device type ",
-        device_type,
-        " but got device type ",
-        expr.device().type(),
-        " for argument #",
-        pos,
-        " '",
-        name,
-        "' in call to ",
-        api);
-  }
-  if (expr.scalar_type() != scalar_type) {
-    AT_ERROR(
-        "Expected object of scalar type ",
-        scalar_type,
-        " but got scalar type ",
-        expr.scalar_type(),
-        " for argument #",
-        pos,
-        " '",
-        name,
-        "' in call to ",
-        api);
-  }
-  return expr.unsafeGetTensorImpl();
-}
-
 // Converts a TensorList (i.e. ArrayRef<Tensor> to vector of TensorImpl*)
 // NB: This is ONLY used by legacy TH bindings, and ONLY used by cat.
 // Once cat is ported entirely to ATen this can be deleted!
diff --git a/aten/src/ATen/autocast_mode.cpp b/aten/src/ATen/autocast_mode.cpp
index da0a87b02d1d0..396b9746754cf 100644
--- a/aten/src/ATen/autocast_mode.cpp
+++ b/aten/src/ATen/autocast_mode.cpp
@@ -499,7 +499,6 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
   KERNEL_CPU(ADD_NS(addbmm), "addbmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
   KERNEL_CPU(ADD_NS(linear), "linear", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor> &), lower_precision_fp)
   KERNEL_CPU(ADD_NS(_convolution), "_convolution.deprecated", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(_convolution), "_convolution", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool, bool), lower_precision_fp)
   KERNEL_CPU(ADD_NS(matmul), "matmul", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
   KERNEL_CPU(ADD_NS(conv_tbc), "conv_tbc", Tensor(const Tensor &, const Tensor &, const Tensor &, int64_t), lower_precision_fp)
 
@@ -545,10 +544,24 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
   KERNEL_CPU(ADD_NS(replication_pad2d), "replication_pad2d", Tensor(const Tensor &, IntArrayRef), fp32)
   KERNEL_CPU(ADD_NS(replication_pad3d), "replication_pad3d", Tensor(const Tensor &, IntArrayRef), fp32)
   KERNEL_CPU(ADD_NS(mse_loss), "mse_loss", Tensor(const Tensor &, const Tensor &, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(cosine_embedding_loss), "cosine_embedding_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(nll_loss), "nll_loss", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, int64_t, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(nll_loss2d), "nll_loss2d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, int64_t, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(hinge_embedding_loss), "hinge_embedding_loss", Tensor (const Tensor &, const Tensor &, double, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(poisson_nll_loss), "poisson_nll_loss", Tensor (const Tensor &, const Tensor &, bool, bool, double, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(smooth_l1_loss), "smooth_l1_loss", Tensor (const Tensor &, const Tensor &, int64_t, double), fp32)
+  KERNEL_CPU(ADD_NS(cross_entropy_loss), "cross_entropy_loss", Tensor(const Tensor &, const Tensor &, const c10::optional<Tensor> &, int64_t, int64_t, double), fp32)
+  KERNEL_CPU(ADD_NS(l1_loss), "l1_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(huber_loss), "huber_loss", Tensor (const Tensor &, const Tensor &, int64_t, double), fp32)
+  KERNEL_CPU(ADD_NS(margin_ranking_loss), "margin_ranking_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(soft_margin_loss), "soft_margin_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(triplet_margin_loss), "triplet_margin_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, double, double, bool, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(multi_margin_loss), "multi_margin_loss", Tensor (const Tensor &, const Tensor &, const Scalar&, const Scalar&, const c10::optional<Tensor>&, int64_t), fp32)
   KERNEL_CPU(ADD_NS(ctc_loss), "ctc_loss.IntList", Tensor(const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t, int64_t, bool), fp32)
   KERNEL_CPU(ADD_NS(ctc_loss), "ctc_loss.Tensor", Tensor(const Tensor &, const Tensor &, const Tensor &, const Tensor &, int64_t, int64_t, bool), fp32)
   KERNEL_CPU(ADD_NS(kl_div), "kl_div", Tensor(const Tensor &, const Tensor &, int64_t, bool), fp32)
   KERNEL_CPU(ADD_NS(multilabel_margin_loss), "multilabel_margin_loss", Tensor(const Tensor &, const Tensor &, int64_t), fp32)
+  KERNEL_CPU(ADD_NS(binary_cross_entropy_with_logits), "binary_cross_entropy_with_logits", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&, int64_t), fp32)
   KERNEL_CPU(ADD_NS(fft_fft), "fft_fft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(fft_ifft), "fft_ifft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
   KERNEL_CPU(ADD_NS(fft_fft2), "fft_fft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
diff --git a/aten/src/ATen/core/List.h b/aten/src/ATen/core/List.h
index 0785a6941affd..fe75bf37cb7fa 100644
--- a/aten/src/ATen/core/List.h
+++ b/aten/src/ATen/core/List.h
@@ -69,7 +69,11 @@ struct ListElementConstReferenceTraits<c10::optional<std::string>> {
 template<class T, class Iterator>
 class ListElementReference final {
 public:
-  operator T() const;
+  operator std::conditional_t<
+      std::is_reference<typename c10::detail::
+                            ivalue_to_const_ref_overload_return<T>::type>::value,
+      const T&,
+      T>() const;
 
   ListElementReference& operator=(T&& new_value) &&;
 
@@ -106,9 +110,14 @@ class ListElementReference final {
 
 // this wraps vector::iterator to make sure user code can't rely
 // on it being the type of the underlying vector.
-template<class T, class Iterator>
-class ListIterator final : public std::iterator<std::random_access_iterator_tag, T> {
-public:
+template <class T, class Iterator>
+class ListIterator final : public std::iterator<
+                               std::random_access_iterator_tag,
+                               T,
+                               std::ptrdiff_t,
+                               T*,
+                               ListElementReference<T, Iterator>> {
+ public:
   explicit ListIterator() = default;
   ~ListIterator() = default;
 
diff --git a/aten/src/ATen/core/List_inl.h b/aten/src/ATen/core/List_inl.h
index 14b68b7360623..e8acb89bf3cb5 100644
--- a/aten/src/ATen/core/List_inl.h
+++ b/aten/src/ATen/core/List_inl.h
@@ -118,9 +118,13 @@ namespace detail {
 
 namespace impl {
 
-template<class T, class Iterator>
-ListElementReference<T, Iterator>::operator T() const {
-  return c10::detail::list_element_to<T>(*iterator_);
+template <class T, class Iterator>
+ListElementReference<T, Iterator>::operator std::conditional_t<
+    std::is_reference<typename c10::detail::ivalue_to_const_ref_overload_return<
+        T>::type>::value,
+    const T&,
+    T>() const {
+  return iterator_->template to<T>();
 }
 
 template<class T, class Iterator>
diff --git a/aten/src/ATen/core/NamedRegistrations.cpp b/aten/src/ATen/core/NamedRegistrations.cpp
index a9ae2f12c4dd0..bb675939b27c6 100644
--- a/aten/src/ATen/core/NamedRegistrations.cpp
+++ b/aten/src/ATen/core/NamedRegistrations.cpp
@@ -467,7 +467,6 @@ TORCH_LIBRARY_IMPL(aten, Named, m) {
   m.impl("sum.IntList_out", CppFunction::makeFallthrough());
   m.impl("sum.dim_DimnameList", CppFunction::makeFallthrough());
   m.impl("sum.dim_IntList", CppFunction::makeFallthrough());
-  m.impl("sum.SymInt", CppFunction::makeFallthrough());
   m.impl("t", CppFunction::makeFallthrough());
   m.impl("tan", CppFunction::makeFallthrough());
   m.impl("tan.out", CppFunction::makeFallthrough());
diff --git a/aten/src/ATen/core/PhiloxRNGEngine.h b/aten/src/ATen/core/PhiloxRNGEngine.h
index d075d7dd6fbff..a702de8998d93 100644
--- a/aten/src/ATen/core/PhiloxRNGEngine.h
+++ b/aten/src/ATen/core/PhiloxRNGEngine.h
@@ -71,66 +71,67 @@ class philox_engine {
   C10_HOST_DEVICE inline explicit philox_engine(uint64_t seed = 67280421310721,
                                  uint64_t subsequence = 0,
                                  uint64_t offset = 0) {
-    key[0] = static_cast<uint32_t>(seed);
-    key[1] = static_cast<uint32_t>(seed >> 32);
-    counter = detail::UINT4(0);
-    counter[2] = static_cast<uint32_t>(subsequence);
-    counter[3] = static_cast<uint32_t>(subsequence >> 32);
-    STATE = 0;
+
+    reset_state(seed, subsequence);
     incr_n(offset);
   }
 
+  C10_HOST_DEVICE inline void reset_state(uint64_t seed = 67280421310721,
+                                 uint64_t subsequence = 0) {
+    key_[0] = static_cast<uint32_t>(seed);
+    key_[1] = static_cast<uint32_t>(seed >> 32);
+    counter_ = detail::UINT4(0);
+    counter_[2] = static_cast<uint32_t>(subsequence);
+    counter_[3] = static_cast<uint32_t>(subsequence >> 32);
+    STATE = 0;
+  }
+
   /**
-   * Produces a unique 32-bit pseudo random number on every invocation
+   * Produces a unique 32-bit pseudo random number on every invocation. Bookeeps state to avoid waste.
    */
-  C10_HOST_DEVICE inline uint32_t operator()() {
+  C10_HOST_DEVICE inline uint32_t operator()(int32_t n_rounds = 10) { // 10 here to preserve back-compat behavior
     if(STATE == 0) {
-      detail::UINT4 counter_ = counter;
-      detail::UINT2 key_ = key;
-
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-      counter_ = single_round(counter_, key_);
-      key_[0] += (kPhilox10A); key_[1] += (kPhilox10B);
-
-      output = single_round(counter_, key_);
+      detail::UINT4 counter = counter_;
+      detail::UINT2 key = key_;
+      output_ = rand(counter, key, n_rounds);
       incr();
     }
-    uint32_t ret = output[STATE];
+    uint32_t ret = output_[STATE];
     STATE = (STATE + 1) & 3;
     return ret;
   }
 
+  inline float randn(uint32_t n_rounds) {
+    #ifdef __CUDA_ARCH__
+    AT_ASSERT(false, "Unsupported invocation of randn on CUDA");
+    #endif
+    reset_state(); // Reset state for randn - a little wasteful, but easier to ensure correctness.
+    detail::UINT4 counter = counter_;
+    detail::UINT2 key = key_;
+    detail::UINT4 i = rand(counter, key, n_rounds);
+    detail::FLOAT2 prenorm;
+    prenorm[0] = 1 - uint32_to_uniform_float(i[0]); // uint32_to_uniform_float returns [0,1), we need (0,1] to avoid passing 0 to log.
+    prenorm[1] = 1 - uint32_to_uniform_float(i[1]);
+    detail::FLOAT2 ret = normalize_pair_uniform(prenorm);
+    return ret[0];
+  }
+
   /**
    * Function that Skips N 128 bit numbers in a subsequence
    */
   C10_HOST_DEVICE inline void incr_n(uint64_t n) {
     uint32_t nlo = static_cast<uint32_t>(n);
     uint32_t nhi = static_cast<uint32_t>(n >> 32);
-    counter[0] += nlo;
+    counter_[0] += nlo;
     // if overflow in x has occurred, carry over to nhi
-    if (counter[0] < nlo) {
+    if (counter_[0] < nlo) {
       nhi++;
       // if overflow in nhi has occurred during carry over,
       // propagate that overflow to y and exit to increment z
       // otherwise return
-      counter[1] += nhi;
+      counter_[1] += nhi;
       if(nhi != 0) {
-        if (nhi <= counter[1]) {
+        if (nhi <= counter_[1]) {
           return;
         }
       }
@@ -138,34 +139,34 @@ class philox_engine {
       // if overflow in y has occurred during addition,
       // exit to increment z
       // otherwise return
-      counter[1] += nhi;
-      if (nhi <= counter[1]) {
+      counter_[1] += nhi;
+      if (nhi <= counter_[1]) {
         return;
       }
     }
-    if (++counter[2])
+    if (++counter_[2])
       return;
-    ++counter[3];
+    ++counter_[3];
   }
 
   /**
    * Function that Skips one 128 bit number in a subsequence
    */
   C10_HOST_DEVICE inline void incr() {
-    if (++counter[0])
+    if (++counter_[0])
       return;
-    if (++counter[1])
+    if (++counter_[1])
       return;
-    if (++counter[2]) {
+    if (++counter_[2]) {
       return;
     }
-    ++counter[3];
+    ++counter_[3];
   }
 
 private:
-  detail::UINT4 counter;
-  detail::UINT4 output;
-  detail::UINT2 key;
+  detail::UINT4 counter_;
+  detail::UINT4 output_;
+  detail::UINT2 key_;
   uint32_t STATE;
 
   C10_HOST_DEVICE inline uint32_t mulhilo32(uint32_t a, uint32_t b,
@@ -192,12 +193,46 @@ class philox_engine {
     ret[3] = lo0;
     return ret;
   }
+
+  C10_HOST_DEVICE constexpr float uint32_to_uniform_float(uint32_t value) {
+      // maximum value such that `MAX_INT * scale < 1.0` (with float rounding)
+      constexpr float scale = 4.6566127342e-10;
+      return static_cast<float>(value & 0x7FFFFFFF) * scale;
+  }
+
+
+
+  C10_HOST_DEVICE inline detail::UINT4 rand(detail::UINT4& counter, detail::UINT2& key, uint32_t n_rounds) {
+    for (uint32_t round = 0; round < (n_rounds - 1); round++) {
+        counter = single_round(counter, key);
+        key[0] += (kPhilox10A); key[1] += (kPhilox10B);
+      }
+    return single_round(counter, key);
+  }
+
+  inline detail::FLOAT2 normalize_pair_uniform(detail::FLOAT2 in) {
+    // TODO(voz) We use std:: below, and thus need a separate impl for CUDA.
+    float u1 = in[0];
+    float u2 = in[1];
+
+    constexpr float two_pi = 2.0 * M_PI;
+
+    float mag = std::sqrt(-2.0 * std::log(u1));
+
+    detail::FLOAT2 ret;
+
+    ret[0] = mag * std::cos(two_pi);
+    ret[1] = mag * std::sin(two_pi);
+    return ret;
+  }
+
+
   static const uint32_t kPhilox10A = 0x9E3779B9;
   static const uint32_t kPhilox10B = 0xBB67AE85;
   static const uint32_t kPhiloxSA = 0xD2511F53;
   static const uint32_t kPhiloxSB = 0xCD9E8D57;
 };
 
-typedef philox_engine Philox4_32_10;
+typedef philox_engine Philox4_32;
 
 } // namespace at
diff --git a/aten/src/ATen/core/PythonFallbackKernel.cpp b/aten/src/ATen/core/PythonFallbackKernel.cpp
index 37b46ae15a3c0..210aa6fa568fe 100644
--- a/aten/src/ATen/core/PythonFallbackKernel.cpp
+++ b/aten/src/ATen/core/PythonFallbackKernel.cpp
@@ -1,4 +1,4 @@
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 #include <ATen/core/PythonFallbackKernel.h>
 #include <c10/core/SafePyObject.h>
 
@@ -51,7 +51,7 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
 
 
   // If Torch Dispatch Mode is active, use its PyInterpreter for dispatch
-  const auto& maybe_torch_dispatch_mode_state = at::impl::TorchDispatchModeTLS::get_state();
+  const auto& maybe_torch_dispatch_mode_state = c10::impl::TorchDispatchModeTLS::get_state();
   if (maybe_torch_dispatch_mode_state) {
     maybe_torch_dispatch_mode_state->pyinterpreter()->dispatch(op, stack);
     return;
diff --git a/aten/src/ATen/core/PythonFallbackKernel.h b/aten/src/ATen/core/PythonFallbackKernel.h
index 94cd4e81291a3..f38bdd2ada90a 100644
--- a/aten/src/ATen/core/PythonFallbackKernel.h
+++ b/aten/src/ATen/core/PythonFallbackKernel.h
@@ -1,5 +1,5 @@
 #pragma once
-
+#include <ATen/core/TorchDispatchUtils.h>
 
 namespace at {
 namespace impl {
diff --git a/aten/src/ATen/core/TensorBase.h b/aten/src/ATen/core/TensorBase.h
index ca9c8b5f245a0..e6dd73658efc7 100644
--- a/aten/src/ATen/core/TensorBase.h
+++ b/aten/src/ATen/core/TensorBase.h
@@ -48,6 +48,7 @@ inline bool variable_excluded_from_dispatch() {
   return c10::impl::tls_local_dispatch_key_set().excluded_.isSupersetOf(c10::autograd_dispatch_keyset);
 #endif
 }
+
 }
 
 // NOTE: [Tensor vs. TensorBase]
@@ -161,6 +162,14 @@ class TORCH_API TensorBase {
     return impl_->sym_size(dim);
   }
 
+  c10::SymInt sym_stride(int64_t dim) const {
+    const auto sizes = this->sym_strides();
+    const auto ndim = static_cast<int64_t>(sizes.size());
+    // false is passed to maybe_wrap_dim so behavior is identical to array access (but with wrapping)
+    return sizes[c10::maybe_wrap_dim(dim, ndim, /*wrap_scalar=*/false)];
+
+  }
+
   int64_t size(int64_t dim) const {
     return impl_->size(dim);
   }
@@ -225,6 +234,9 @@ class TORCH_API TensorBase {
   c10::SymIntArrayRef sym_sizes() const {
     return impl_->sym_sizes();
   }
+  c10::SymIntArrayRef sym_strides() const {
+    return impl_->sym_strides();
+  }
   IntArrayRef strides() const {
     return impl_->strides();
   }
@@ -286,6 +298,10 @@ class TORCH_API TensorBase {
     return impl_->numel();
   }
 
+  c10::SymInt sym_numel() const {
+    return impl_->sym_numel();
+  }
+
   // Length of one array element in bytes.  This is the traditional
   // Numpy naming.
   size_t itemsize() const {
diff --git a/aten/src/ATen/core/TorchDispatchModeTLS.cpp b/aten/src/ATen/core/TorchDispatchModeTLS.cpp
deleted file mode 100644
index d224b08d5b54b..0000000000000
--- a/aten/src/ATen/core/TorchDispatchModeTLS.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include <ATen/core/TorchDispatchModeTLS.h>
-#include <c10/core/SafePyObject.h>
-#include <c10/core/DispatchKeySet.h>
-
-namespace at { namespace impl {
-
-thread_local std::shared_ptr<SafePyObject> torchDispatchModeState;
-
-void TorchDispatchModeTLS::set_state(std::shared_ptr<SafePyObject> state) {
-  if (state) {
-    c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true);
-    c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, true);
-  } else {
-    TorchDispatchModeTLS::reset_state();
-  }
-  torchDispatchModeState = std::move(state);
-}
-
-const std::shared_ptr<SafePyObject>& TorchDispatchModeTLS::get_state() {
-  return torchDispatchModeState;
-}
-
-void TorchDispatchModeTLS::reset_state() {
-  torchDispatchModeState.reset();
-  c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false);
-  c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, false);
-}
-
-bool dispatch_mode_enabled() {
-  return static_cast<bool>(at::impl::TorchDispatchModeTLS::get_state());
-}
-
-bool tensor_has_dispatch(const at::Tensor& t) {
-  DispatchKeySet key_set({DispatchKey::Python, DispatchKey::PythonTLSSnapshot});
-  return t.key_set().has_any(key_set);
-}
-
-bool tensorlist_has_dispatch(const at::TensorList& li) {
-  for (const auto& t: li) {
-    if (tensor_has_dispatch(t)) {
-      return true;
-    }
-  }
-  return false;
-}
-
-bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li) {
-  for (auto i : c10::irange(li.size())) {
-    auto t = li.get(i);
-    if (t && tensor_has_dispatch(*t)) {
-      return true;
-    }
-  }
-  return false;
-}
-
-} // namespace impl
-} // namespace at
diff --git a/aten/src/ATen/core/TorchDispatchUtils.cpp b/aten/src/ATen/core/TorchDispatchUtils.cpp
new file mode 100644
index 0000000000000..323019b3bbbb3
--- /dev/null
+++ b/aten/src/ATen/core/TorchDispatchUtils.cpp
@@ -0,0 +1,31 @@
+#include <ATen/core/TorchDispatchUtils.h>
+
+namespace at {
+namespace impl {
+
+bool tensor_has_dispatch(const at::Tensor& t) {
+  DispatchKeySet key_set({DispatchKey::Python, DispatchKey::PythonTLSSnapshot});
+  return t.key_set().has_any(key_set);
+}
+
+bool tensorlist_has_dispatch(const at::TensorList& li) {
+  for (const auto& t: li) {
+    if (tensor_has_dispatch(t)) {
+      return true;
+    }
+  }
+  return false;
+}
+
+bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li) {
+  for (auto i : c10::irange(li.size())) {
+    auto t = li.get(i);
+    if (t && tensor_has_dispatch(*t)) {
+      return true;
+    }
+  }
+  return false;
+}
+
+} // namespace impl
+} // namespace at
diff --git a/aten/src/ATen/core/TorchDispatchModeTLS.h b/aten/src/ATen/core/TorchDispatchUtils.h
similarity index 55%
rename from aten/src/ATen/core/TorchDispatchModeTLS.h
rename to aten/src/ATen/core/TorchDispatchUtils.h
index 9ae015e6582f7..08c009c81b478 100644
--- a/aten/src/ATen/core/TorchDispatchModeTLS.h
+++ b/aten/src/ATen/core/TorchDispatchUtils.h
@@ -1,25 +1,17 @@
 #pragma once
 
-#include <c10/macros/Macros.h>
 #include <torch/library.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <c10/util/ArrayRef.h>
 #include <c10/util/Optional.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 
 namespace at {
 namespace impl {
 
-struct TORCH_API TorchDispatchModeTLS {
-  static void set_state(std::shared_ptr<SafePyObject> state);
-  static const std::shared_ptr<SafePyObject>& get_state();
-  static void reset_state();
-};
-
-bool dispatch_mode_enabled();
 bool tensor_has_dispatch(const at::Tensor& t);
 bool tensorlist_has_dispatch(const at::TensorList& li);
 bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li);
+using c10::impl::dispatch_mode_enabled;
 
-
-} // namespace impl
-} // namespace at
+}}
diff --git a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
index d5345b28e7149..76082c5b01a4b 100644
--- a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
+++ b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
@@ -114,6 +114,9 @@ namespace detail {
  *    they have been registered as fallthrough.  The set of excluded backends
  *    varies from operator, as some operators may have overridden the
  *    fallthrough with custom behavior.
+ *
+ *   Note - this should maintain identical impl to the py dispatcher key extraction logic
+ *   at pytorch/torch/dispatcher.py
  */
 struct TORCH_API DispatchKeyExtractor final {
 public:
@@ -142,7 +145,7 @@ struct TORCH_API DispatchKeyExtractor final {
         // no safe toTensorRef method, alas)
         ks = ks | ivalue.unsafeToTensorImpl()->key_set();
       } else if (C10_UNLIKELY(ivalue.isTensorList())) {
-        for (const at::Tensor tensor : ivalue.toTensorList()) {
+        for (const at::Tensor& tensor : ivalue.toTensorList()) {
           ks = ks | tensor.key_set();
         }
       }
diff --git a/aten/src/ATen/core/dispatch/Dispatcher.h b/aten/src/ATen/core/dispatch/Dispatcher.h
index 83d1738da423b..bc40bc5b62e0c 100644
--- a/aten/src/ATen/core/dispatch/Dispatcher.h
+++ b/aten/src/ATen/core/dispatch/Dispatcher.h
@@ -162,6 +162,7 @@ class TORCH_API Dispatcher final {
 
   // Invoke an operator via the boxed calling convention using an IValue stack
   void callBoxed(const OperatorHandle& op, Stack* stack) const;
+  void callBoxedForDispatchKey(const OperatorHandle& op, DispatchKey dk, Stack* stack) const;
 
   // TODO: This will only be useful if we write a backend fallback that plumbs dispatch keys (currently there are none)
   // See Note [Plumbing Keys Through The Dispatcher]
@@ -332,6 +333,9 @@ class TORCH_API OperatorHandle {
     return operatorDef_->op.hasKernelForDispatchKey(k);
   }
 
+  bool hasComputedKernelForDispatchKey(DispatchKey k) const {
+    return operatorDef_->op.hasComputedKernelForDispatchKey(k);
+  }
 
   std::string dumpComputedTable() const {
     return operatorDef_->op.dumpComputedTable();
@@ -376,6 +380,10 @@ class TORCH_API OperatorHandle {
     callBoxed(&stack);
   }
 
+  void callBoxedForDispatchKey(DispatchKey dk, Stack& stack) const {
+    c10::Dispatcher::singleton().callBoxedForDispatchKey(*this, dk, &stack);
+  }
+
   void redispatchBoxed(DispatchKeySet ks, Stack* stack) const {
     c10::Dispatcher::singleton().redispatchBoxed(*this, ks, stack);
   }
@@ -620,6 +628,18 @@ inline void Dispatcher::callBoxed(const OperatorHandle& op, Stack* stack) const
   kernel.callBoxed(op, dispatchKeySet, stack);
 }
 
+// NB: this doesn't count as a "true" dispatcher jump, so no instrumentation
+inline void Dispatcher::callBoxedForDispatchKey(const OperatorHandle& op, DispatchKey dk, Stack* stack) const {
+  // note: this doesn't need the mutex because write operations on the list keep iterators intact.
+  const auto& entry = op.operatorDef_->op;
+  // We still compute this as we're obligated to pass it on to the internal
+  // kernel, if it is a boxed fallback
+  auto dispatchKeySet = entry.dispatchKeyExtractor().getDispatchKeySetBoxed(stack);
+  const auto& kernel = entry.kernelForDispatchKey(dk);
+  kernel.callBoxed(op, dispatchKeySet, stack);
+}
+
+
 inline void Dispatcher::redispatchBoxed(const OperatorHandle& op, DispatchKeySet dispatchKeySet, Stack* stack) const {
   // note: this doesn't need the mutex because write operations on the list keep iterators intact.
   const auto& entry = op.operatorDef_->op;
diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.cpp b/aten/src/ATen/core/dispatch/OperatorEntry.cpp
index afcf552fdecda..5c1c42bb62260 100644
--- a/aten/src/ATen/core/dispatch/OperatorEntry.cpp
+++ b/aten/src/ATen/core/dispatch/OperatorEntry.cpp
@@ -198,10 +198,24 @@ bool OperatorEntry::hasKernelForAnyDispatchKey(DispatchKeySet ks) const {
 
 bool OperatorEntry::hasKernelForDispatchKey(DispatchKey k) const {
   TORCH_INTERNAL_ASSERT(kernels_.find(DispatchKey::Undefined) == kernels_.end());
-  for (auto& kv : kernels_) {
-    if (k == kv.first) return true;
-  }
-  return false;
+  auto it = kernels_.find(k);
+  if (it == kernels_.end()) return false;
+  return it->second.size() > 0;
+}
+
+const KernelFunction& OperatorEntry::kernelForDispatchKey(DispatchKey k) const {
+  auto it = kernels_.find(k);
+  TORCH_CHECK(it != kernels_.end() && it->second.size(), "no kernel for ", k, " on ", name_);
+  auto jt = it->second.begin();
+  TORCH_INTERNAL_ASSERT(jt->kernel.isValid())
+  return jt->kernel;
+}
+
+bool OperatorEntry::hasComputedKernelForDispatchKey(DispatchKey k) const {
+  TORCH_CHECK(!isAliasDispatchKey(k), "Alias keys do not have runtime kernel registrations.");
+  const auto dispatch_ix = getDispatchTableIndexForDispatchKey(k);
+  TORCH_INTERNAL_ASSERT(dispatch_ix >= 0 && dispatch_ix < c10::num_runtime_entries, toString(k), dispatch_ix);
+  return dispatchTable_[dispatch_ix].isValid();
 }
 
 const AnnotatedKernel* OperatorEntry::getKernelForDispatchKey(DispatchKey dispatch_key) const{
diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.h b/aten/src/ATen/core/dispatch/OperatorEntry.h
index 834f7b32f3947..1d9f1495f3c74 100644
--- a/aten/src/ATen/core/dispatch/OperatorEntry.h
+++ b/aten/src/ATen/core/dispatch/OperatorEntry.h
@@ -206,6 +206,12 @@ class TORCH_API OperatorEntry final {
   bool hasKernelForAnyDispatchKey(DispatchKeySet ks) const;
   // Returns true if kernel_ has entry for a particular key.
   bool hasKernelForDispatchKey(DispatchKey k) const;
+  // Retrieves the kernel entry at a particular key.  Symmetric with
+  // hasKernelForDispatchKey.  To get the AnnotatedKernel, see
+  // getKernelForDispatchKey (private)
+  const KernelFunction& kernelForDispatchKey(DispatchKey k) const;
+  // Returns true if the "computed table" has an entry for a particular key.
+  bool hasComputedKernelForDispatchKey(DispatchKey k) const;
   // Returns all the operator tags added at the time of registration
   const std::vector<at::Tag>& getTags() const;
 
diff --git a/aten/src/ATen/core/function_schema.h b/aten/src/ATen/core/function_schema.h
index 77fdb20f6516a..16083820a1d81 100644
--- a/aten/src/ATen/core/function_schema.h
+++ b/aten/src/ATen/core/function_schema.h
@@ -550,15 +550,24 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
   bool is_opt = type->kind() == OptionalType::Kind;
   auto unopt_type = is_opt ? type->castRaw<OptionalType>()->getElementType() : type;
 
-  if (unopt_type->kind() == ListType::Kind && arg.N()) {
+  if (unopt_type->kind() == ListType::Kind) {
     // sized lists get size N from arg, not type
     auto list = unopt_type->cast<c10::ListType>();
-    out << list->getElementType()->str() << "[" << *arg.N() << "]";
+    out << list->getElementType()->str();
+    if (arg.alias_info() && !arg.alias_info()->containedTypes().empty()){
+      out << arg.alias_info()->containedTypes()[0];
+    }
+    std::string N = "";
+    if (arg.N()) {
+        N = std::to_string(*arg.N());
+    }
+    out << "[" << N << "]";
   } else {
     out << unopt_type->str();
   }
 
-  if (arg.alias_info()) {
+  // print alias info if it has beforeSets.
+  if (arg.alias_info() && !arg.alias_info()->beforeSets().empty()) {
     out << *arg.alias_info();
   }
 
diff --git a/aten/src/ATen/core/interned_strings.cpp b/aten/src/ATen/core/interned_strings.cpp
index 0ad87c21c837f..ff9361f462a1a 100644
--- a/aten/src/ATen/core/interned_strings.cpp
+++ b/aten/src/ATen/core/interned_strings.cpp
@@ -141,6 +141,7 @@ bool Symbol::is_aten() const { return ns() == namespaces::aten; }
 bool Symbol::is_cuda() const { return ns() == namespaces::cuda; }
 bool Symbol::is_prim() const { return ns() == namespaces::prim; }
 bool Symbol::is_prims() const { return ns() == namespaces::prims; }
+bool Symbol::is_nvprims() const { return ns() == namespaces::nvprims; }
 bool Symbol::is_onnx() const { return ns() == namespaces::onnx; }
 bool Symbol::is_user() const { return ns() == namespaces::user; }
 bool Symbol::is_caffe2() const { return ns() == namespaces::_caffe2; }
diff --git a/aten/src/ATen/core/interned_strings.h b/aten/src/ATen/core/interned_strings.h
index 8a195128b4d2c..0e16e8812a436 100644
--- a/aten/src/ATen/core/interned_strings.h
+++ b/aten/src/ATen/core/interned_strings.h
@@ -15,6 +15,7 @@ namespace c10 {
 #define FORALL_NS_SYMBOLS(_)         \
   _(namespaces, prim)                \
   _(namespaces, prims)               \
+  _(namespaces, nvprims)             \
   _(namespaces, aten)                \
   _(namespaces, cuda)                \
   _(namespaces, onnx)                \
@@ -48,13 +49,16 @@ namespace c10 {
   _(prim, oneDNNFusionGuard)         \
   _(prim, FunctionalGraph)           \
   _(prim, add_optional)              \
-  _(prim, view_copy)                 \
+  _(prim, expand_copy)               \
+  _(prim, expand_as_copy)            \
+  _(prim, flatten_copy)              \
+  _(prim, permute_copy)              \
   _(prim, reshape_copy)              \
   _(prim, squeeze_copy)              \
+  _(prim, t_copy)                    \
+  _(prim, transpose_copy)            \
   _(prim, unsqueeze_copy)            \
-  _(prim, flatten_copy)              \
-  _(prim, expand_copy)               \
-  _(prim, expand_as_copy)            \
+  _(prim, view_copy)                 \
   _(prim, DifferentiableGraph)       \
   _(prim, TensorExprGroup)           \
   _(prim, TensorExprDynamicGroup)    \
@@ -221,6 +225,8 @@ namespace c10 {
   _(cuda, _current_device)           \
   _(cuda, synchronize)               \
   _(aten, has_torch_function)        \
+  _(aten, is_autocast_enabled)       \
+  _(aten, is_autocast_cpu_enabled)   \
   FORALL_ATEN_BASE_SYMBOLS(_)        \
   _(onnx, Add)                       \
   _(onnx, Concat)                    \
diff --git a/aten/src/ATen/core/ivalue_inl.h b/aten/src/ATen/core/ivalue_inl.h
index 301b448b834eb..00361c80a01cf 100644
--- a/aten/src/ATen/core/ivalue_inl.h
+++ b/aten/src/ATen/core/ivalue_inl.h
@@ -506,6 +506,7 @@ struct TORCH_API TupleElements {
       TORCH_CHECK(idx < inlineSize_, "TupleElements: invalid index Index = ", idx, "; Length = ", inlineSize_);
       return elementsInline_[idx];
     } else {
+      TORCH_CHECK(idx < elementsVector_.size(), "TupleElements: invalid index Index = ", idx, "; Length = ", elementsVector_.size());
       return elementsVector_.at(idx);
     }
   }
diff --git a/aten/src/ATen/core/symbol.h b/aten/src/ATen/core/symbol.h
index c06c261c3dd3c..04d480b51e317 100644
--- a/aten/src/ATen/core/symbol.h
+++ b/aten/src/ATen/core/symbol.h
@@ -82,6 +82,7 @@ struct TORCH_API Symbol {
   bool is_cuda() const;
   bool is_prim() const;
   bool is_prims() const;
+  bool is_nvprims() const;
   bool is_onnx() const;
   bool is_user() const;
   bool is_caffe2() const;
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_qint.h b/aten/src/ATen/cpu/vec/vec256/vec256_qint.h
index f92e1bd22811c..0ee43b53e6358 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_qint.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_qint.h
@@ -257,6 +257,19 @@ struct Vectorized<c10::qint32> : public Vectorizedqi {
         return Vectorized<c10::qint32>(ptr);
     }
 
+    static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(
+            tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return _mm256_loadu_si256((const __m256i*)tmp_values);
+    }
+
     float_vec_return_type dequantize(
         Vectorized<float> scale,
         Vectorized<float> /*zero_point*/,
@@ -417,10 +430,12 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
     // This is needed because the compiler emits awful code for the default
     // constructor for moving the enum
     // NOLINTNEXTLINE(clang-diagnostic-deprecated-copy)
-    #pragma clang diagnostic push
-    #pragma clang diagnostic ignored "-Wdeprecated-copy"
+    C10_CLANG_DIAGNOSTIC_PUSH()
+    #if C10_CLANG_HAS_WARNING("-Wdeprecated-copy")
+    C10_CLANG_DIAGNOSTIC_IGNORE("-Wdeprecated-copy")
+    #endif
     Vectorized(const Vectorized<c10::qint8>& other) : Vectorizedqi(other.vals) { }
-    #pragma clang diagnostic pop
+    C10_CLANG_DIAGNOSTIC_POP()
 
     void store(void* ptr, int count = size()) const {
         if (count != size()) {
@@ -434,6 +449,19 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
         return Vectorized<c10::qint8>(ptr);
     }
 
+    static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(
+            tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return _mm256_loadu_si256((const __m256i*)tmp_values);
+    }
+
  private:
     __m256i cvtepi8_epi32(__m128i epi8_vals) const {
         return _mm256_cvtepi8_epi32(epi8_vals);
@@ -580,10 +608,12 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
     }
 
     // NOLINTNEXTLINE(clang-diagnostic-deprecated-copy)
-    #pragma clang diagnostic push
-    #pragma clang diagnostic ignored "-Wdeprecated-copy"
+    C10_CLANG_DIAGNOSTIC_PUSH()
+    #if C10_CLANG_HAS_WARNING("-Wdeprecated-copy")
+    C10_CLANG_DIAGNOSTIC_IGNORE("-Wdeprecated-copy")
+    #endif
     Vectorized(const Vectorized<c10::quint8>& other) : Vectorizedqi(other.vals) { }
-    #pragma clang diagnostic pop
+    C10_CLANG_DIAGNOSTIC_POP()
 
     void store(void* ptr, int count = size()) const {
         if (count != size()) {
@@ -597,6 +627,19 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
         return Vectorized<c10::quint8>(ptr);
     }
 
+    static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(
+            tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return _mm256_loadu_si256((const __m256i*)tmp_values);
+    }
+
  private:
     __m256i cvtepu8_epi32(__m128i epu8_vals) const {
         return _mm256_cvtepu8_epi32(epu8_vals);
@@ -816,6 +859,19 @@ struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint32>(ptr);
   }
 
+  static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(
+        tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return Vectorized<c10::qint32>(tmp_values);
+  }
+
   static Vectorized<c10::qint32> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -948,6 +1004,19 @@ struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint8>(ptr);
   }
 
+  static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(
+        tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return Vectorized<c10::qint8>(tmp_values);
+  }
+
   static Vectorized<c10::qint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -1068,6 +1137,19 @@ struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::quint8>(ptr);
   }
 
+  static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(
+        tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return Vectorized<c10::quint8>(tmp_values);
+  }
+
   static Vectorized<c10::quint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
diff --git a/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h b/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h
index 0267c40e1ea45..77cf3695ab912 100644
--- a/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h
+++ b/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h
@@ -551,27 +551,7 @@ class Vectorized<float> {
   }
 
   Vectorized<float> C10_ALWAYS_INLINE pow(const Vectorized<float>& exp) const {
-    auto x = *this;
-    auto sign_bit = (*this) & sign_mask;
-    // |b|
-    auto exp_abs = exp.abs();
-    auto exp_trunc = exp.trunc();
-    Vectorized<float> odd_mask;
-    odd_mask._vecb0 = (vec_signed(exp._vec0) & vi_1) != vi_0;
-    odd_mask._vecb1 = (vec_signed(exp._vec1) & vi_1) != vi_0;
-    // using ln fuction
-    auto temp = (abs().log() * exp).exp();
-
-    // is odd or even check from Sleef
-    auto is_int = (exp == exp_trunc) | (exp_abs >= vcheck);
-    auto is_odd = odd_mask & is_int & (exp_abs < vcheck);
-    // if even then then pow result should be absolute
-    auto temp_sign = temp | sign_bit; // copy_sign
-    auto out = blendv(temp, temp_sign, is_odd);
-    // x<0 and y != N, then NAN
-    auto out1 = blendv(out, v_nan, ((exp.floor() != exp) & (x < zero)));
-    // y = 0 then 1
-    return blendv(out1, one, (exp_abs == zero));
+    return {Sleef_powf4_u10vsx(_vec0, exp._vec0), Sleef_powf4_u10vsx(_vec1, exp._vec1)};
   }
 
   Vectorized<float> fmod(const Vectorized<float>& b) const {
diff --git a/aten/src/ATen/cpu/vec/vec512/vec512_qint.h b/aten/src/ATen/cpu/vec/vec512/vec512_qint.h
index 0f3474eaa2ade..87cf44283c0be 100644
--- a/aten/src/ATen/cpu/vec/vec512/vec512_qint.h
+++ b/aten/src/ATen/cpu/vec/vec512/vec512_qint.h
@@ -268,6 +268,18 @@ struct Vectorized<c10::qint32> : public Vectorizedqi {
         return Vectorized<c10::qint32>(ptr);
     }
 
+    static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return loadu(tmp_values);
+    }
+
     float_vec_return_type dequantize(
         Vectorized<float> scale,
         Vectorized<float> zero_point,
@@ -447,6 +459,18 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
         return Vectorized<c10::qint8>(ptr);
     }
 
+    static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return loadu(tmp_values);
+    }
+
  private:
     __m512i cvtepi8_epi32(__m128i epi8_vals) const {
         return _mm512_cvtepi8_epi32(epi8_vals);
@@ -611,6 +635,18 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
         return Vectorized<c10::quint8>(ptr);
     }
 
+    static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return loadu(tmp_values);
+    }
+
  private:
     __m512i cvtepu8_epi32(__m128i epu8_vals) const {
         return _mm512_cvtepu8_epi32(epu8_vals);
@@ -833,6 +869,18 @@ struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint32>(ptr);
   }
 
+  static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return loadu(tmp_values);
+  }
+
   static Vectorized<c10::qint32> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -965,6 +1013,18 @@ struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint8>(ptr);
   }
 
+  static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return loadu(tmp_values);
+  }
+
   static Vectorized<c10::qint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -1085,6 +1145,18 @@ struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::quint8>(ptr);
   }
 
+  static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return loadu(tmp_values);
+  }
+
   static Vectorized<c10::quint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
diff --git a/aten/src/ATen/cpu/vec/vec_base.h b/aten/src/ATen/cpu/vec/vec_base.h
index 3bf1010efd682..635ec8c82e5dc 100644
--- a/aten/src/ATen/cpu/vec/vec_base.h
+++ b/aten/src/ATen/cpu/vec/vec_base.h
@@ -33,6 +33,7 @@
 #include <c10/util/TypeCast.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/irange.h>
+#include <c10/util/Load.h>
 
 // These macros helped us unify vec_base.h
 #ifdef CPU_CAPABILITY_AVX512
@@ -975,7 +976,7 @@ inline void convert(const src_T *src, dst_T *dst, int64_t n) {
 #endif
   for (const auto i : c10::irange(n)) {
     (void)i; //Suppress unused variable warning
-    *dst = c10::static_cast_with_inter_type<dst_T, src_T>::apply(*src);
+    *dst = c10::convert<dst_T>(c10::load(src));
     src++;
     dst++;
   }
diff --git a/aten/src/ATen/cuda/CUDABlas.cpp b/aten/src/ATen/cuda/CUDABlas.cpp
index e99017289d68b..866f53ee7f87f 100644
--- a/aten/src/ATen/cuda/CUDABlas.cpp
+++ b/aten/src/ATen/cuda/CUDABlas.cpp
@@ -1162,7 +1162,7 @@ void vdot<c10::complex<double>>(CUDABLAS_DOT_ARGTYPES(c10::complex<double>)) {
                                    reinterpret_cast<cuDoubleComplex*>(result)));
 }
 
-// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched, getriBatched on platforms other than cuda
+// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda
 #ifdef CUDART_VERSION
 
 template <>
@@ -1323,67 +1323,6 @@ void getrfBatched<c10::complex<float>>(
       batchsize));
 }
 
-template <>
-void getriBatched<double>(
-    int n, double** dA_array, int ldda, int* ipiv_array, double** dC_array, int lddc, int* info_array, int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasDgetriBatched(
-      handle, n, dA_array, ldda, ipiv_array, dC_array, lddc, info_array, batchsize));
-}
-
-template <>
-void getriBatched<float>(
-    int n, float** dA_array, int ldda, int* ipiv_array, float** dC_array, int lddc, int* info_array, int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasSgetriBatched(
-      handle, n, dA_array, ldda, ipiv_array, dC_array, lddc, info_array, batchsize));
-}
-
-template <>
-void getriBatched<c10::complex<double>>(
-    int n,
-    c10::complex<double>** dA_array,
-    int ldda,
-    int* ipiv_array,
-    c10::complex<double>** dC_array,
-    int lddc,
-    int* info_array,
-    int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasZgetriBatched(
-      handle,
-      n,
-      reinterpret_cast<cuDoubleComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<cuDoubleComplex**>(dC_array),
-      lddc,
-      info_array,
-      batchsize));
-}
-
-template <>
-void getriBatched<c10::complex<float>>(
-    int n,
-    c10::complex<float>** dA_array,
-    int ldda,
-    int* ipiv_array,
-    c10::complex<float>** dC_array,
-    int lddc,
-    int* info_array,
-    int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasCgetriBatched(
-      handle,
-      n,
-      reinterpret_cast<cuComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<cuComplex**>(dC_array),
-      lddc,
-      info_array,
-      batchsize));
-}
 
 template <>
 void gelsBatched<double>(CUDABLAS_GELS_BATCHED_ARGTYPES(double)) {
diff --git a/aten/src/ATen/cuda/CUDABlas.h b/aten/src/ATen/cuda/CUDABlas.h
index 10e589ecd6c9d..96c7fc8184228 100644
--- a/aten/src/ATen/cuda/CUDABlas.h
+++ b/aten/src/ATen/cuda/CUDABlas.h
@@ -227,7 +227,7 @@ void vdot<c10::complex<float>>(CUDABLAS_DOT_ARGTYPES(c10::complex<float>));
 template <>
 void vdot<c10::complex<double>>(CUDABLAS_DOT_ARGTYPES(c10::complex<double>));
 
-// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched, getriBatched on platforms other than cuda
+// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda
 #ifdef CUDART_VERSION
 
 #define CUDABLAS_GETRS_ARGTYPES(Dtype)  \
@@ -287,22 +287,6 @@ TORCH_CUDA_CU_API void getrfBatched<c10::complex<double>>(CUDABLAS_GETRF_ARGTYPE
 template<>
 TORCH_CUDA_CU_API void getrfBatched<c10::complex<float>>(CUDABLAS_GETRF_ARGTYPES(c10::complex<float>));
 
-#define CUDABLAS_GETRI_ARGTYPES(Dtype)  \
-  int n, Dtype** dA_array, int ldda, int* ipiv_array, Dtype** dC_array, int lddc, int* info_array, int batchsize
-
-template<class Dtype>
-void getriBatched(CUDABLAS_GETRI_ARGTYPES(Dtype)) {
-  TORCH_CHECK(false, "at::cuda::blas::getriBatched: not implemented for ", typeid(Dtype).name());
-}
-template<>
-TORCH_CUDA_CU_API void getriBatched<float>(CUDABLAS_GETRI_ARGTYPES(float));
-template<>
-TORCH_CUDA_CU_API void getriBatched<double>(CUDABLAS_GETRI_ARGTYPES(double));
-template<>
-TORCH_CUDA_CU_API void getriBatched<c10::complex<double>>(CUDABLAS_GETRI_ARGTYPES(c10::complex<double>));
-template<>
-TORCH_CUDA_CU_API void getriBatched<c10::complex<float>>(CUDABLAS_GETRI_ARGTYPES(c10::complex<float>));
-
 #define CUDABLAS_GELS_BATCHED_ARGTYPES(Dtype)  \
   cublasHandle_t handle, cublasOperation_t trans, int m, int n, int nrhs, Dtype** dA_array, int ldda, Dtype** dC_array, int lddc, int* info, int *devInfoArray, int batchSize
 
diff --git a/aten/src/ATen/cuda/CUDAEvent.h b/aten/src/ATen/cuda/CUDAEvent.h
index f07daeb979b9e..205fad8c11214 100644
--- a/aten/src/ATen/cuda/CUDAEvent.h
+++ b/aten/src/ATen/cuda/CUDAEvent.h
@@ -2,6 +2,7 @@
 
 #include <ATen/cuda/ATenCUDAGeneral.h>
 #include <ATen/cuda/CUDAContext.h>
+#include <c10/core/impl/GPUTrace.h>
 #include <c10/cuda/CUDAStream.h>
 #include <c10/cuda/CUDAGuard.h>
 #include <ATen/cuda/Exceptions.h>
@@ -45,6 +46,10 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
     try {
       if (is_created_) {
         CUDAGuard guard(device_index_);
+        const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+        if (C10_UNLIKELY(interp)) {
+          interp->trace_gpu_event_deletion(reinterpret_cast<uintptr_t>(event_));
+        }
         cudaEventDestroy(event_);
       }
     } catch (...) { /* No throw */ }
@@ -113,6 +118,13 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
       " does not match recording stream's device ", stream.device_index(), ".");
     CUDAGuard guard(device_index_);
     AT_CUDA_CHECK(cudaEventRecord(event_, stream));
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_event_record(
+          reinterpret_cast<uintptr_t>(event_),
+          reinterpret_cast<uintptr_t>(stream.stream())
+      );
+    }
     was_recorded_ = true;
   }
 
@@ -122,6 +134,13 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
     if (is_created_) {
       CUDAGuard guard(stream.device_index());
       AT_CUDA_CHECK(cudaStreamWaitEvent(stream, event_, 0));
+      const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+      if (C10_UNLIKELY(interp)) {
+        interp->trace_gpu_event_wait(
+            reinterpret_cast<uintptr_t>(event_),
+            reinterpret_cast<uintptr_t>(stream.stream())
+        );
+      }
     }
   }
 
@@ -164,6 +183,10 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
     device_index_ = device_index;
     CUDAGuard guard(device_index_);
     AT_CUDA_CHECK(cudaEventCreateWithFlags(&event_, flags_));
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_event_creation(reinterpret_cast<uintptr_t>(event_));
+    }
     is_created_ = true;
   }
 
diff --git a/aten/src/ATen/cuda/CUDASparse.h b/aten/src/ATen/cuda/CUDASparse.h
index ecb7127dfa322..d309cd5d8e311 100644
--- a/aten/src/ATen/cuda/CUDASparse.h
+++ b/aten/src/ATen/cuda/CUDASparse.h
@@ -4,13 +4,26 @@
 
 // cuSparse Generic API added in CUDA 10.1
 // Windows support added in CUDA 11.0
-// ROCm is not enabled
 #if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && ((CUSPARSE_VERSION >= 10300) || (CUSPARSE_VERSION >= 11000 && defined(_WIN32)))
 #define AT_USE_CUSPARSE_GENERIC_API() 1
 #else
 #define AT_USE_CUSPARSE_GENERIC_API() 0
 #endif
 
+// hipSparse Generic API ROCm 5.2
+#if defined(USE_ROCM) && ROCM_VERSION >= 50200
+#define AT_USE_HIPSPARSE_GENERIC_52_API() 1
+#else
+#define AT_USE_HIPSPARSE_GENERIC_52_API() 0
+#endif
+
+// hipSparse Generic API ROCm 5.1
+#if defined(USE_ROCM) && ROCM_VERSION >= 50100
+#define AT_USE_HIPSPARSE_GENERIC_API() 1
+#else
+#define AT_USE_HIPSPARSE_GENERIC_API() 0
+#endif
+
 // cuSparse Generic API spsv function was added in CUDA 11.3.0
 #if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && (CUSPARSE_VERSION >= 11500)
 #define AT_USE_CUSPARSE_GENERIC_SPSV() 1
diff --git a/aten/src/ATen/cuda/CUDASparseDescriptors.cpp b/aten/src/ATen/cuda/CUDASparseDescriptors.cpp
index 3065babf89b6f..6319e214ac987 100644
--- a/aten/src/ATen/cuda/CUDASparseDescriptors.cpp
+++ b/aten/src/ATen/cuda/CUDASparseDescriptors.cpp
@@ -9,7 +9,7 @@ namespace at {
 namespace cuda {
 namespace sparse {
 
-#if AT_USE_CUSPARSE_GENERIC_API()
+#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 namespace {
 
@@ -53,6 +53,7 @@ cusparseIndexType_t getCuSparseIndexType(const c10::ScalarType& scalar_type) {
   }
 }
 
+#if AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 CuSparseDnMatDescriptor::CuSparseDnMatDescriptor(const Tensor& input, int64_t batch_offset) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.layout() == kStrided);
   IntArrayRef input_strides = input.strides();
@@ -105,6 +106,7 @@ CuSparseDnMatDescriptor::CuSparseDnMatDescriptor(const Tensor& input, int64_t ba
 
   descriptor_.reset(raw_descriptor);
 }
+#endif // AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 
 CuSparseDnVecDescriptor::CuSparseDnVecDescriptor(const Tensor& input) {
   // cuSPARSE doesn't support batched vectors
@@ -175,7 +177,7 @@ CuSparseSpMatCsrDescriptor::CuSparseSpMatCsrDescriptor(const Tensor& input, int6
       value_type // data type of values
       ));
 
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if AT_USE_HIPSPARSE_GENERIC_52_API() || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000)
   if (ndim == 3 && batch_offset == -1) {
     int batch_count =
         at::native::cuda_int_cast(at::native::batchCount(input), "batch_count");
@@ -204,7 +206,7 @@ CuSparseSpMatCsrDescriptor::CuSparseSpMatCsrDescriptor(const Tensor& input, int6
   descriptor_.reset(raw_descriptor);
 }
 
-#endif // AT_USE_CUSPARSE_GENERIC_API()
+#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 } // namespace sparse
 } // namespace cuda
diff --git a/aten/src/ATen/cuda/CUDASparseDescriptors.h b/aten/src/ATen/cuda/CUDASparseDescriptors.h
index 40078b65df647..60c9ff0ffa88a 100644
--- a/aten/src/ATen/cuda/CUDASparseDescriptors.h
+++ b/aten/src/ATen/cuda/CUDASparseDescriptors.h
@@ -40,6 +40,11 @@ class CuSparseDescriptor {
 #if defined(USE_ROCM)
 // hipSPARSE doesn't define this
 using cusparseMatDescr = std::remove_pointer<cusparseMatDescr_t>::type;
+using cusparseDnMatDescr = std::remove_pointer<cusparseDnMatDescr_t>::type;
+using cusparseDnVecDescr = std::remove_pointer<cusparseDnVecDescr_t>::type;
+using cusparseSpMatDescr = std::remove_pointer<cusparseSpMatDescr_t>::type;
+using cusparseSpMatDescr = std::remove_pointer<cusparseSpMatDescr_t>::type;
+using cusparseSpGEMMDescr = std::remove_pointer<cusparseSpGEMMDescr_t>::type;
 #if AT_USE_HIPSPARSE_TRIANGULAR_SOLVE()
 using bsrsv2Info = std::remove_pointer<bsrsv2Info_t>::type;
 using bsrsm2Info = std::remove_pointer<bsrsm2Info_t>::type;
@@ -92,15 +97,17 @@ class TORCH_CUDA_CPP_API CuSparseBsrsm2Info
 
 #endif // AT_USE_HIPSPARSE_TRIANGULAR_SOLVE
 
-#if AT_USE_CUSPARSE_GENERIC_API()
+#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 cusparseIndexType_t getCuSparseIndexType(const c10::ScalarType& scalar_type);
 
+#if AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor
     : public CuSparseDescriptor<cusparseDnMatDescr, &cusparseDestroyDnMat> {
  public:
   explicit CuSparseDnMatDescriptor(const Tensor& input, int64_t batch_offset = -1);
 };
+#endif //AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 
 class TORCH_CUDA_CPP_API CuSparseDnVecDescriptor
     : public CuSparseDescriptor<cusparseDnVecDescr, &cusparseDestroyDnVec> {
@@ -116,7 +123,7 @@ class TORCH_CUDA_CPP_API CuSparseSpMatCsrDescriptor
  public:
   explicit CuSparseSpMatCsrDescriptor(const Tensor& input, int64_t batch_offset = -1);
 
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if defined(USE_ROCM) || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000)
   std::tuple<int64_t, int64_t, int64_t> get_size() {
     int64_t rows, cols, nnz;
     TORCH_CUDASPARSE_CHECK(cusparseSpMatGetSize(
@@ -190,7 +197,7 @@ class TORCH_CUDA_CPP_API CuSparseSpSMDescriptor
 };
 #endif
 
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if (defined(USE_ROCM) && ROCM_VERSION >= 50200) || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000)
 class TORCH_CUDA_CPP_API CuSparseSpGEMMDescriptor
     : public CuSparseDescriptor<cusparseSpGEMMDescr, &cusparseSpGEMM_destroyDescr> {
  public:
@@ -202,7 +209,7 @@ class TORCH_CUDA_CPP_API CuSparseSpGEMMDescriptor
 };
 #endif
 
-#endif // AT_USE_CUSPARSE_GENERIC_API()
+#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 } // namespace sparse
 } // namespace cuda
diff --git a/aten/src/ATen/cuda/jiterator.h b/aten/src/ATen/cuda/jiterator.h
index 41a6f719a9e33..ac2c4d7cecf3f 100644
--- a/aten/src/ATen/cuda/jiterator.h
+++ b/aten/src/ATen/cuda/jiterator.h
@@ -33,7 +33,7 @@ TORCH_CUDA_CPP_API c10::SmallVector<at::Tensor> CompileAndLaunchKernel(
   const c10::SmallVector<at::Tensor>& tensors,
   const c10::SmallVector<at::Scalar>& extra_args,
   bool return_by_ref) {
-    TORCH_CHECK(false, "Jiterator is not supported on ROCm");
+    TORCH_CHECK(false, "Jiterator is not supported");
   }
 }} // namespace at::cuda
 
diff --git a/aten/src/ATen/cuda/jiterator_impl.h b/aten/src/ATen/cuda/jiterator_impl.h
index 7144b6d8eeaf9..5ba251055ad2a 100644
--- a/aten/src/ATen/cuda/jiterator_impl.h
+++ b/aten/src/ATen/cuda/jiterator_impl.h
@@ -27,6 +27,16 @@ namespace native {
   _(7)                      \
   _(8)
 
+#define AT_FOR_8_CASES_WITH_COMMA(_)  \
+  _(1)     ,                           \
+  _(2)     ,                           \
+  _(3)     ,                           \
+  _(4)     ,                           \
+  _(5)     ,                           \
+  _(6)     ,                           \
+  _(7)     ,                           \
+  _(8)
+
 c10::SmallVector<std::string> get_extra_args_typenames(const c10::SmallVector<at::Scalar>& extra_args) {
   c10::SmallVector<std::string> args_typenames(extra_args.size());
   for (auto i = 0; i < extra_args.size(); ++i) {
@@ -83,9 +93,9 @@ static std::unique_ptr<OffsetCalculator<N>> make_unique_offset_calculator(
 
 template <bool IS_INPUT>
 struct OffsetCalculatorVariant {
-#define DEFINE_CASE(index) std::unique_ptr<OffsetCalculator<index>>,
+#define DEFINE_CASE(index) std::unique_ptr<OffsetCalculator<index>>
   using OffsetCalculatorTypes = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -113,9 +123,9 @@ struct OffsetCalculatorVariant {
 
 struct ArrayVariant {
 // works for up to 8 input + 8 outputs
-#define DEFINE_CASE(index) at::detail::Array<char*, index>, at::detail::Array<char*, index+8>,
+#define DEFINE_CASE(index) at::detail::Array<char*, index>, at::detail::Array<char*, index+8>
   using ArrayTypes = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -149,9 +159,9 @@ struct ArrayVariant {
 };
 
 struct TrivialOffsetCalculatorVariant {
-#define DEFINE_CASE(index) TrivialOffsetCalculator<index>,
+#define DEFINE_CASE(index) TrivialOffsetCalculator<index>
   using TrivialOffsetCalculatorTypes = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -177,9 +187,9 @@ struct TrivialOffsetCalculatorVariant {
 };
 
 struct LoadWithCastVariant {
-#define DEFINE_CASE(index) std::unique_ptr<memory::LoadWithCast<index>>,
+#define DEFINE_CASE(index) std::unique_ptr<memory::LoadWithCast<index>>
   using LoadWithCastPtr = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -206,9 +216,9 @@ struct LoadWithCastVariant {
 };
 
 struct StoreWithCastVariant {
-#define DEFINE_CASE(index) std::unique_ptr<memory::StoreWithCast<index>>,
+#define DEFINE_CASE(index) std::unique_ptr<memory::StoreWithCast<index>>
   using StoreWithCastPtr = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
diff --git a/aten/src/ATen/cuda/llvm_complex.cpp b/aten/src/ATen/cuda/llvm_complex.cpp
index 55e39e2802721..d88bdc4ce6579 100644
--- a/aten/src/ATen/cuda/llvm_complex.cpp
+++ b/aten/src/ATen/cuda/llvm_complex.cpp
@@ -834,7 +834,7 @@ complex<typename __promote<_Tp, _Up>::type>
 pow(const complex<_Tp>& __x, const complex<_Up>& __y)
 {
     typedef complex<typename __promote<_Tp, _Up>::type> result_type;
-    return _VSTD::pow(result_type(__x), result_type(__y));
+    return std::pow(result_type(__x), result_type(__y));
 }
 
 template<class _Tp, class _Up>
@@ -847,7 +847,7 @@ typename enable_if
 pow(const complex<_Tp>& __x, const _Up& __y)
 {
     typedef complex<typename __promote<_Tp, _Up>::type> result_type;
-    return _VSTD::pow(result_type(__x), result_type(__y));
+    return std::pow(result_type(__x), result_type(__y));
 }
 
 template<class _Tp, class _Up>
@@ -860,7 +860,7 @@ typename enable_if
 pow(const _Tp& __x, const complex<_Up>& __y)
 {
     typedef complex<typename __promote<_Tp, _Up>::type> result_type;
-    return _VSTD::pow(result_type(__x), result_type(__y));
+    return std::pow(result_type(__x), result_type(__y));
 }
 
 // __sqr, computes pow(x, 2)
diff --git a/aten/src/ATen/jit_macros.h b/aten/src/ATen/jit_macros.h
index ca765f03afbff..9af826549021a 100644
--- a/aten/src/ATen/jit_macros.h
+++ b/aten/src/ATen/jit_macros.h
@@ -3,12 +3,5 @@
 #include <string>
 
 // AT_USE_JITERATOR(), controls whether we jit some elementwise kernels
-// Currently unsupported on ROCm GPUs
-#if !AT_ROCM_ENABLED()
 #define AT_USE_JITERATOR() true
 #define jiterator_stringify(...) std::string(#__VA_ARGS__);
-#else
-#define AT_USE_JITERATOR() false
-#define jiterator_stringify(...) \
-  static_assert(false, "Jiterator is not supported on ROCm");
-#endif // USE_ROCM
diff --git a/aten/src/ATen/jiterator_macros.h b/aten/src/ATen/jiterator_macros.h
index 63a7dfa2eb967..3aa4c7ebb0af0 100644
--- a/aten/src/ATen/jiterator_macros.h
+++ b/aten/src/ATen/jiterator_macros.h
@@ -25,8 +25,8 @@
 // These `,`s confuse the preprocessor into thinking we are passing
 // multiple arguments to the macro.
 #define jiterator_code(...) __VA_ARGS__
-#if defined(__CUDACC__)
-// CPU and CUDA case
+#if defined(__CUDACC__) || defined(__HIPCC__)
+// CPU and CUDA and ROCm case
 #define stringify_code(...) #__VA_ARGS__
 #define jiterator_also_stringify_as(code, str_name) \
   code /* define the function */                    \
diff --git a/aten/src/ATen/mps/IndexKernels.h b/aten/src/ATen/mps/IndexKernels.h
new file mode 100644
index 0000000000000..b789cdc184161
--- /dev/null
+++ b/aten/src/ATen/mps/IndexKernels.h
@@ -0,0 +1,132 @@
+#pragma once
+
+namespace at {
+namespace mps {
+
+static const char * indexing_metal_shaders = R"INDEX_METAL(
+#include <metal_stdlib>
+using namespace metal;
+
+constant uint32_t num_indices            [[function_constant(0)]];
+
+struct IndexAB {
+    // Allow up to 16 indices
+    metal::array<constant const void *, 16>  indexArray [[ id(0) ]];
+};
+
+template<typename T>
+kernel void index_select(
+    constant const IndexAB  & indexAB           [[buffer(0)]],
+    constant const void     * indexSizes        [[buffer(1)]],
+    constant const void     * indexStrides      [[buffer(2)]],
+    constant const uint3    * offsets           [[buffer(3)]],
+    constant const void     * inputData         [[buffer(4)]],
+    device   void           * outputData        [[buffer(5)]],
+    uint thread_index [[thread_position_in_grid]]) {
+
+    constant const int64_t * index_sizes   = (constant const int64_t *)indexSizes;
+    constant const int64_t * index_strides = (constant const int64_t *)indexStrides;
+    int64_t offset = 0;
+    for (uint32_t i = 0; i < num_indices; i++) {
+        int64_t index = ((constant const int64_t*)(indexAB.indexArray[i]))[offsets[thread_index].z / sizeof(int64_t)];
+        if (index < 0) {
+            index += index_sizes[i];
+        }
+        offset += index * index_strides[i];
+     }
+    device T * out = (device T*)((device char*)outputData + offsets[thread_index].x);
+    constant const T * in  = (constant const T*)((constant const char*)inputData  + offsets[thread_index].y + offset);
+    *out = *in;
+}
+
+template
+[[host_name("index_select_float")]]
+kernel void index_select<float>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_half")]]
+kernel void index_select<half>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void   * indexSizes    [[buffer(1)]],
+                               constant const void   * indexStrides  [[buffer(2)]],
+                               constant const uint3  * offsets       [[buffer(3)]],
+                               constant const void   * inputData     [[buffer(4)]],
+                               device void         * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_long")]]
+kernel void index_select<long>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_int")]]
+kernel void index_select<int>(constant const IndexAB & indexAB       [[buffer(0)]],
+                              constant const void    * indexSizes    [[buffer(1)]],
+                              constant const void    * indexStrides  [[buffer(2)]],
+                              constant const uint3   * offsets       [[buffer(3)]],
+                              constant const void    * inputData     [[buffer(4)]],
+                              device void          * outputData    [[buffer(5)]],
+                              uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_short")]]
+kernel void index_select<short>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_char")]]
+kernel void index_select<char>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_uchar")]]
+kernel void index_select<uchar>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+
+template
+[[host_name("index_select_bool")]]
+kernel void index_select<bool>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+
+kernel void kernel_index_offsets(constant const packed_uint3 * strides         [[buffer(0)]],
+                                 device uint3                * data_offsets    [[buffer(1)]],
+                                 constant const uint         * iter_shape      [[buffer(2)]],
+                                 constant const uint         & num_dimensions  [[buffer(3)]],
+                                 constant const uint         & num_offsets     [[buffer(4)]],
+                                 uint thread_index [[thread_position_in_grid]]) {
+    uint32_t idx = thread_index;
+    for (uint32_t dim = 0; dim < num_dimensions; dim++) {
+        uint32_t remainder = idx % iter_shape[dim];
+        idx /= iter_shape[dim];
+        for (uint32_t offset = 0; offset < num_offsets; offset++)
+            data_offsets[thread_index][offset] += remainder * strides[dim][offset];
+    }
+}
+)INDEX_METAL";
+}
+}
diff --git a/aten/src/ATen/mps/MPSAllocator.mm b/aten/src/ATen/mps/MPSAllocator.mm
index 2433acbc050b2..28275c782b794 100644
--- a/aten/src/ATen/mps/MPSAllocator.mm
+++ b/aten/src/ATen/mps/MPSAllocator.mm
@@ -336,7 +336,7 @@ DataPtr allocate(const size_t nbytes) const override {
 
   DeleterFnPtr raw_deleter() const override { return &Delete; }
   bool is_shared(void* ptr) const { return _getAllocImpl().isSharedBuffer(ptr); }
-  bool is_shared_storge_supported() const { return m_has_unified_memory; }
+  bool is_shared_storage_supported() const { return m_has_unified_memory; }
 
 private:
   bool m_has_unified_memory;
@@ -375,7 +375,7 @@ static bool isEnvVarEnabled(const char *envvar) {
 at::Allocator* getMPSSharedAllocator()
 {
   auto& sa = _getSharedAllocator();
-  if (sa.is_shared_storge_supported()) {
+  if (sa.is_shared_storage_supported()) {
     return &sa;
   }
 
diff --git a/aten/src/ATen/mps/MPSDevice.h b/aten/src/ATen/mps/MPSDevice.h
index d957c5440a06d..77e93ea1234a4 100644
--- a/aten/src/ATen/mps/MPSDevice.h
+++ b/aten/src/ATen/mps/MPSDevice.h
@@ -11,9 +11,15 @@
 #include <Metal/Metal.h>
 #include <MetalPerformanceShaders/MetalPerformanceShaders.h>
 typedef id<MTLDevice> MTLDevice_t;
+typedef id<MTLLibrary> MTLLibrary_t;
+typedef id<MTLFunction> MTLFunction_t;
+typedef MTLFunctionConstantValues* MTLFunctionConstantValues_t;
 #else
 typedef void* MTLDevice;
 typedef void* MTLDevice_t;
+typedef void* MTLLibrary_t;
+typedef void* MTLFunction_t;
+typedef void* MTLFunctionConstantValues_t;
 #endif
 
 using namespace std;
@@ -48,11 +54,14 @@ class TORCH_API MPSDevice {
     return _mtl_device;
   }
 
+  MTLFunction_t metalIndexingFunction(const std::string &kernel, MTLFunctionConstantValues_t constantValues);
+
   ~MPSDevice();
 
  private:
   static MPSDevice* _device;
   MTLDevice_t _mtl_device;
+  MTLLibrary_t _mtl_indexing_library;
   MPSDevice();
 };
 
diff --git a/aten/src/ATen/mps/MPSDevice.mm b/aten/src/ATen/mps/MPSDevice.mm
index 2775100666494..007dfbea54bf5 100644
--- a/aten/src/ATen/mps/MPSDevice.mm
+++ b/aten/src/ATen/mps/MPSDevice.mm
@@ -3,6 +3,7 @@
 #include <c10/util/CallOnce.h>
 
 #include <ATen/mps/MPSDevice.h>
+#include <ATen/mps/IndexKernels.h>
 
 namespace at {
 namespace mps {
@@ -10,6 +11,15 @@
 static std::unique_ptr<MPSDevice> mps_device;
 static c10::once_flag mpsdev_init;
 
+static inline MTLLanguageVersion getMetalLanguageVersion(const id<MTLDevice>& device) {
+  // MPS Advanced Indexing needs at least Metal 2.0 (support for Argument Buffers and function constants)
+  // host_name attribute needs at least Metal 2.2
+  MTLLanguageVersion languageVersion = MTLLanguageVersion2_2;
+
+  TORCH_CHECK([device supportsFamily:MTLGPUFamilyMac2], "Missing Metal support for MTLGPUFamilyMac2");
+  return languageVersion;
+}
+
 MPSDevice* MPSDevice::getInstance() {
   c10::call_once(mpsdev_init, [] {
       mps_device = std::unique_ptr<MPSDevice>(new MPSDevice());
@@ -17,12 +27,41 @@
   return mps_device.get();
 }
 
+id<MTLFunction> MPSDevice::metalIndexingFunction(const std::string& kernel, MTLFunctionConstantValues* constantValues) {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(_mtl_device);
+  NSError* error = nil;
+  if (!_mtl_indexing_library) {
+    MTLCompileOptions *options = [MTLCompileOptions new];
+    [options setLanguageVersion: getMetalLanguageVersion(_mtl_device)];
+    [options setFastMathEnabled: YES];
+    _mtl_indexing_library = [_mtl_device newLibraryWithSource: [NSString stringWithCString: mps::indexing_metal_shaders encoding:NSASCIIStringEncoding]
+                                                      options: options
+                                                        error: &error];
+    TORCH_CHECK(_mtl_indexing_library, "Failed to create indexing library, error: ", [[error description] UTF8String]);
+  }
+
+  id<MTLFunction> indexFunction = nil;
+  if (constantValues) {
+    indexFunction = [[_mtl_indexing_library newFunctionWithName: [NSString stringWithUTF8String: kernel.c_str()]
+                                                constantValues: constantValues
+                                                         error: &error] autorelease];
+  } else {
+    indexFunction = [[_mtl_indexing_library newFunctionWithName: [NSString stringWithUTF8String: kernel.c_str()]] autorelease];
+  }
+
+  TORCH_CHECK(indexFunction, "Failed to create specialized function state object: ", kernel, ", error: ", [[error description] UTF8String]);
+
+  return indexFunction;
+}
+
 MPSDevice::~MPSDevice() {
   [_mtl_device release];
+  [_mtl_indexing_library release];
   _mtl_device = nil;
+  _mtl_indexing_library = nil;
 }
 
-MPSDevice::MPSDevice(): _mtl_device(nil) {
+MPSDevice::MPSDevice(): _mtl_device(nil), _mtl_indexing_library(nil)  {
   // Check that MacOS 12.3+ version of MPS framework is available
   // Create the MPSGraph and check method introduced in 12.3+
   // which is used by MPS backend.
@@ -45,7 +84,7 @@
       break;
     }
   }
-  assert(_mtl_device);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(_mtl_device);
 }
 
 at::Allocator* getMPSSharedAllocator();
diff --git a/aten/src/ATen/mps/MPSFallback.mm b/aten/src/ATen/mps/MPSFallback.mm
index c4488330be75a..4f9e635dce05a 100644
--- a/aten/src/ATen/mps/MPSFallback.mm
+++ b/aten/src/ATen/mps/MPSFallback.mm
@@ -35,10 +35,6 @@ void mps_error_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack)
   // These ops are not supported via MPS backend currently, and we fallback to run on CPU.
   // For the rest of unsupported ops the user needs to pass 'PYTORCH_ENABLE_MPS_FALLBACK=1'
   // to fallback on CPU, otherwise we will error out.
-  m.impl("bitwise_and.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
-  m.impl("bitwise_or.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
-  m.impl("bitwise_xor.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
-  m.impl("bitwise_not.out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("bitwise_left_shift.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("bitwise_right_shift.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("embedding_renorm_", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
diff --git a/aten/src/ATen/mps/MPSGuardImpl.mm b/aten/src/ATen/mps/MPSGuardImpl.mm
index c2987fdaa3e73..2aedeccf82cb9 100644
--- a/aten/src/ATen/mps/MPSGuardImpl.mm
+++ b/aten/src/ATen/mps/MPSGuardImpl.mm
@@ -9,9 +9,6 @@
   void MPSGuardImpl::createEvent(
     mpsEvent_t* event,
     const EventFlag flag) const {
-    id<MTLDevice> mtl_device = MPSDevice::getInstance()->device();
-    // when static casting we already create an _event object.
-    auto mps_event = static_cast<mpsEvent_t>(*event);
   }
 
   void MPSGuardImpl::destroyEvent(
diff --git a/aten/src/ATen/native/AdaptiveAveragePooling.cpp b/aten/src/ATen/native/AdaptiveAveragePooling.cpp
index cf4321a1d2d60..855d54eadba88 100644
--- a/aten/src/ATen/native/AdaptiveAveragePooling.cpp
+++ b/aten/src/ATen/native/AdaptiveAveragePooling.cpp
@@ -16,16 +16,16 @@ namespace {
     IntArrayRef output_size)
   {
     TORCH_CHECK(output_size.size() == 2, "adaptive_avg_pool2d: output_size must be 2");
-    int64_t ndim = input.ndimension();
-    for (const auto i : c10::irange(1, ndim)) {
+    int64_t ndim = input.dim();
+    TORCH_CHECK((ndim == 3 || ndim == 4),
+      "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes());
+    for (const auto i : {-2, -1}) {
       TORCH_CHECK(input.size(i) > 0,
         "adaptive_avg_pool2d(): Expected input to have non-zero size for non-batch dimensions, "
-        "but input has sizes ", input.sizes(), " with dimension ", i, " being "
+        "but input has sizes ", input.sizes(), " with dimension ", i + ndim, " being "
         "empty");
     }
 
-    TORCH_CHECK((ndim == 3 || ndim == 4),
-      "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes());
     TORCH_CHECK(input.dtype() == output.dtype(),
       "expected dtype ", input.dtype(), " for `output` but got dtype ", output.dtype());
 
diff --git a/aten/src/ATen/native/BatchLinearAlgebra.cpp b/aten/src/ATen/native/BatchLinearAlgebra.cpp
index b2dc974f5a3b8..56c66171a9616 100644
--- a/aten/src/ATen/native/BatchLinearAlgebra.cpp
+++ b/aten/src/ATen/native/BatchLinearAlgebra.cpp
@@ -27,12 +27,6 @@ extern "C" void cgetrf_(int *m, int *n, std::complex<float> *a, int *lda, int *i
 extern "C" void dgetrf_(int *m, int *n, double *a, int *lda, int *ipiv, int *info);
 extern "C" void sgetrf_(int *m, int *n, float *a, int *lda, int *ipiv, int *info);
 
-// getri
-extern "C" void zgetri_(int *n, std::complex<double> *a, int *lda, int *ipiv, std::complex<double> *work, int *lwork, int *info);
-extern "C" void cgetri_(int *n, std::complex<float> *a, int *lda, int *ipiv, std::complex<float> *work, int *lwork, int *info);
-extern "C" void dgetri_(int *n, double *a, int *lda, int *ipiv, double *work, int *lwork, int *info);
-extern "C" void sgetri_(int *n, float *a, int *lda, int *ipiv, float *work, int *lwork, int *info);
-
 // potrs
 extern "C" void zpotrs_(char *uplo, int *n, int *nrhs, std::complex<double> *a, int *lda, std::complex<double> *b, int *ldb, int *info);
 extern "C" void cpotrs_(char *uplo, int *n, int *nrhs, std::complex<float> *a, int *lda, std::complex<float> *b, int *ldb, int *info);
@@ -454,6 +448,18 @@ TORCH_META_FUNC(_linalg_solve_ex)(const Tensor& A,
   set_output_contiguous(3, shape.slice(0, ndim - 2), A.options().dtype(kInt));
 }
 
+TORCH_META_FUNC(linalg_inv_ex)(const Tensor& A, bool check_errors) {
+  at::native::squareCheckInputs(A, "linalg.inv");
+  at::native::checkFloatingOrComplex(A, "linalg.inv", /*allow_low_precision_dtypes*/false);
+
+  auto shape = A.sizes();
+
+  auto result_strides = at::native::batched_matrix_contiguous_strides(shape, /*f-contig*=*/true);
+  set_output_strided(0, shape, result_strides, A.options(), {});
+  set_output_contiguous(
+      1, shape.slice(0, shape.size() - 2), A.options().dtype(ScalarType::Int)); // info
+}
+
 TORCH_META_FUNC(linalg_lu_factor_ex)(const Tensor& A, bool pivot, bool check_errors) {
   TORCH_CHECK(A.dim() >= 2, "torch.lu_factor: Expected tensor with 2 or more dimensions. Got size: ", A.sizes(), " instead");
 
@@ -682,31 +688,12 @@ namespace native {
 // Define the per-batch functions to be used in the main implementation of the batched
 // linear algebra operations
 
-template<class scalar_t>
-void lapackGetri(int n, scalar_t *a, int lda, int *ipiv, scalar_t *work, int lwork, int *info);
-
 template<class scalar_t>
 void lapackCholeskySolve(char uplo, int n, int nrhs, scalar_t *a, int lda, scalar_t *b, int ldb, int *info);
 
 template<class scalar_t, class value_t=scalar_t>
 void lapackSymeig(char jobz, char uplo, int n, scalar_t *a, int lda, value_t *w, scalar_t *work, int lwork, value_t *rwork, int *info);
 
-template<> void lapackGetri<c10::complex<double>>(int n, c10::complex<double> *a, int lda, int *ipiv, c10::complex<double> *work, int lwork, int *info) {
-  zgetri_(&n, reinterpret_cast<std::complex<double>*>(a), &lda, ipiv, reinterpret_cast<std::complex<double>*>(work), &lwork, info);
-}
-
-template<> void lapackGetri<c10::complex<float>>(int n, c10::complex<float> *a, int lda, int *ipiv, c10::complex<float> *work, int lwork, int *info) {
-  cgetri_(&n, reinterpret_cast<std::complex<float>*>(a), &lda, ipiv, reinterpret_cast<std::complex<float>*>(work), &lwork, info);
-}
-
-template<> void lapackGetri<double>(int n, double *a, int lda, int *ipiv, double *work, int lwork, int *info) {
-  dgetri_(&n, a, &lda, ipiv, work, &lwork, info);
-}
-
-template<> void lapackGetri<float>(int n, float *a, int lda, int *ipiv, float *work, int lwork, int *info) {
-  sgetri_(&n, a, &lda, ipiv, work, &lwork, info);
-}
-
 template<> void lapackLu<c10::complex<double>>(int m, int n, c10::complex<double> *a, int lda, int *ipiv, int *info) {
   zgetrf_(&m, &n, reinterpret_cast<std::complex<double>*>(a), &lda, ipiv, info);
 }
@@ -1513,223 +1500,37 @@ bool _requires_fw_or_bw_grad(const Tensor& input) {
           || input._fw_grad(/*level */ 0).defined());
 }
 
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inverse ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-/*
-Computes the inverse of n-by-n matrix 'self'
-This is an in-place routine, it overwrites the content of 'self'.
-'infos_lu' and 'infos_getri' are int Tensors containing error codes for each matrix in the batched input.
-'infos_lu' is for holding lapackLU errors, and 'infos_getri' is for holding lapackGetri errors.
-For more information see LAPACK's documentation for GETRI and GETRF routines.
-*/
-template <typename scalar_t>
-static void apply_inverse(Tensor& self, Tensor& infos_lu, Tensor& infos_getri) {
-#if !AT_BUILD_WITH_LAPACK()
-  AT_ERROR("inverse: LAPACK library not found in compilation");
-#else
-  using value_t = typename c10::scalar_value_type<scalar_t>::type;
-  auto self_data = self.data_ptr<scalar_t>();
-  auto self_matrix_stride = matrixStride(self);
-  auto batch_size = batchCount(self);
-  auto n = self.size(-2);
-  auto lda = std::max<int64_t>(1, n);
-
-  auto ipiv = at::empty({lda}, self.options().dtype(kInt));
-  auto ipiv_data = ipiv.data_ptr<int>();
-  auto infos_lu_data = infos_lu.data_ptr<int>();
-  auto infos_getri_data = infos_getri.data_ptr<int>();
-
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  int info;
-  // Run once, first to get the optimum work size
-  // Since we deal with batches of matrices with the same dimensions, doing this outside
-  // the loop saves (batch_size - 1) workspace queries which would provide the same result
-  // and (batch_size - 1) calls to allocate and deallocate workspace using at::empty()
-  int lwork = -1;
-  scalar_t wkopt;
-  lapackGetri<scalar_t>(n, self_data, lda, ipiv_data, &wkopt, lwork, &info);
-  lwork = std::max<int>(1, real_impl<scalar_t, value_t>(wkopt));
-  Tensor work = at::empty({lwork}, self.options());
-  auto work_data = work.data_ptr<scalar_t>();
-
-  for (const auto i : c10::irange(batch_size)) {
-    scalar_t* self_working_ptr = &self_data[i * self_matrix_stride];
-    int* info_lu_working_ptr = &infos_lu_data[i];
-    lapackLu<scalar_t>(n, n, self_working_ptr, lda, ipiv_data, info_lu_working_ptr);
-
-    // now compute the actual inverse
-    int* info_getri_working_ptr = &infos_getri_data[i];
-    lapackGetri<scalar_t>(n, self_working_ptr, lda, ipiv_data, work_data, lwork, info_getri_working_ptr);
-  }
-#endif
-}
-
-Tensor inverse(const Tensor &self) {
-  if (self.numel() == 0) {
-    return at::empty_like(self);
-  }
-  return at::linalg_inv(self);
-}
-
-Tensor& inverse_out(const Tensor &self, Tensor &result) {
-  at::linalg_inv_out(result, self);
-  return result;
-}
-
-// This is a type dispatching helper function for 'apply_inverse'
-Tensor& _linalg_inv_out_helper_cpu(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) {
-  // This function calculates the inverse matrix in-place
-  // result should be in column major order and contain matrices to invert
-  // the content of result is overwritten by 'apply_inverse'
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cpu", [&]{
-    apply_inverse<scalar_t>(result, infos_lu, infos_getri);
-  });
-  return result;
-}
-
-// Computes the inverse matrix of 'input', it is saved to 'result' in-place
-// LAPACK/MAGMA/cuSOLVER error codes are saved in 'infos' tensors, they are not checked here
-static Tensor& linalg_inv_out_info(Tensor& result, Tensor& infos_lu, Tensor& infos_getri, const Tensor& input) {
-  squareCheckInputs(input, "linalg.inv");
-  checkSameDevice("linalg.inv", result, input);
-  checkLinalgCompatibleDtype("linalg.inv", result, input);
-
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.scalar_type() == kInt);
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.scalar_type() == kInt);
-
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.device() == input.device());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.device() == input.device());
-
-  bool result_input_same_type = (result.scalar_type() == input.scalar_type());
-  bool result_equal_expected_shape = result.sizes().equals(input.sizes());
-  bool is_batched_column_major = false;
-  if (result.dim() >= 2) {
-    is_batched_column_major = result.mT().is_contiguous();
-  }
-
-  // if result is not empty and not in batched column major format
-  bool copy_needed = (result.numel() != 0 && !is_batched_column_major);
-  copy_needed |= !result_input_same_type;  // or result does not have the same dtype as input
-  copy_needed |= (result.numel() != 0 && !result_equal_expected_shape); // or result does not have the expected shape
-  // we have to allocate a temporary tensor
-
-  // similar conditions for infos_lu and infos_getri tensors
-  auto expected_info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  copy_needed |= (infos_lu.numel() != 0 && !infos_lu.is_contiguous());
-  copy_needed |= (infos_lu.numel() != 0 && !(infos_lu.sizes().equals(expected_info_shape)));
-
-  copy_needed |= (infos_getri.numel() != 0 && !infos_getri.is_contiguous());
-  copy_needed |= (infos_getri.numel() != 0 && !(infos_getri.sizes().equals(expected_info_shape)));
-
-  if (copy_needed) {
-    Tensor result_tmp = at::empty(input.sizes(), input.options());
-    result_tmp.transpose_(-2, -1);
-    Tensor infos_lu_tmp = at::zeros({expected_info_shape}, input.options().dtype(kInt));
-    Tensor infos_getri_tmp = at::zeros({expected_info_shape}, input.options().dtype(kInt));
-
-    result_tmp = linalg_inv_out_info(result_tmp, infos_lu_tmp, infos_getri_tmp, input);
-
-    at::native::resize_output(result, result_tmp.sizes());
-    result.copy_(result_tmp);
-    at::native::resize_output(infos_lu, infos_lu_tmp.sizes());
-    infos_lu.copy_(infos_lu_tmp);
-    at::native::resize_output(infos_getri, infos_getri_tmp.sizes());
-    infos_getri.copy_(infos_getri_tmp);
-    return result;
-  }
-  // else  use result's storage directly
-
-  // if result has no elements we can modify it
-  if (result.numel() == 0) {
-    at::native::resize_as_(result, input.mT(), MemoryFormat::Contiguous);
-    result.transpose_(-2, -1);
-  }
-
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.sizes().equals(input.sizes()));
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.scalar_type() == input.scalar_type());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.device() == input.device());
-
-  // result tensor must be in batched column major order (Fortran contiguous)
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.mT().is_contiguous());
-
-  // if info has no elements we can modify it
-  if (infos_lu.numel() == 0) {
-    infos_lu.resize_(expected_info_shape);
-    infos_lu.fill_(0);
-  }
-  if (infos_getri.numel() == 0) {
-    infos_getri.resize_(expected_info_shape);
-    infos_getri.fill_(0);
+// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ linalg.inv ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+TORCH_IMPL_FUNC(linalg_inv_ex_out)(const Tensor& A, bool check_errors, const Tensor& result, const Tensor& info) {
+  // Fill result with the identity
+  result.zero_();
+  result.diagonal(0, -2, -1).fill_(1.);
+  at::linalg_solve_ex_out(const_cast<Tensor&>(result), const_cast<Tensor&>(info), A, result, /*left*/true);
+  if (check_errors) {
+    at::_linalg_check_errors(info, "linalg.inv_ex", A.dim() == 2);
   }
-
-  // info tensors must be contiguous
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.is_contiguous());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.sizes().equals(expected_info_shape));
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.is_contiguous());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.sizes().equals(expected_info_shape));
-
-  // _linalg_inv_out_helper_ (apply_inverse) performs calculations in-place and result must be a copy of input
-  result.copy_(input);
-
-  // TODO: Replace this helper with DECLARE/DEFINE_DISPATCH
-  result = at::_linalg_inv_out_helper_(result, infos_lu, infos_getri);
-  return result;
 }
 
-// Computes the inverse matrix of 'input', it is saved to 'result' in-place
-Tensor& linalg_inv_out(const Tensor &input, Tensor &result) {
-  auto info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  auto infos_lu = at::zeros({info_shape}, input.options().dtype(kInt));
-  auto infos_getri = at::zeros({info_shape}, input.options().dtype(kInt));
-  result = linalg_inv_out_info(result, infos_lu, infos_getri, input);
-
-  // Now check LAPACK/MAGMA/cuSOLVER error codes
-  at::_linalg_check_errors(infos_lu, "linalg.inv", result.dim() == 2);
-  at::_linalg_check_errors(infos_getri, "linalg.inv", result.dim() == 2);
+Tensor& linalg_inv_out(const Tensor& A, Tensor& result) {
+  auto info = at::empty({0}, A.options().dtype(kInt));
+  at::linalg_inv_ex_out(result, info, A);
+  at::_linalg_check_errors(info, "linalg.inv", A.dim() == 2);
   return result;
 }
 
-// Computes the inverse matrix of 'input'
-Tensor linalg_inv(const Tensor &input) {
+Tensor linalg_inv(const Tensor& A) {
   Tensor result, info;
-  std::tie(result, info) = at::linalg_inv_ex(input, /*check_errors=*/false);
-
-  // we pass check_errors=false above and do the check here
-  // so that the name of the function is correct in the error message
-  at::_linalg_check_errors(info, "torch.linalg.inv", input.dim() == 2);
+  std::tie(result, info) = at::linalg_inv_ex(A);
+  at::_linalg_check_errors(info, "linalg.inv", A.dim() == 2);
   return result;
 }
 
-std::tuple<Tensor&, Tensor&> linalg_inv_ex_out(const Tensor& input, bool check_errors, Tensor& inverse, Tensor& info) {
-  squareCheckInputs(input, "linalg.inv_ex");
-  ScalarType info_output_type = ScalarType::Int;
-  TORCH_CHECK(
-      info.scalar_type() == info_output_type,
-      "torch.linalg.inv_ex: ",
-      "Expected info to have ", info_output_type, " dtype, but got info with dtype ", info.scalar_type());
-
-  // provided `info` tensor is used to save the information about the LU decomposition of `input`
-  // in addition current implementation requires a separate tensor
-  // for saving the information about the inversion process after the LU decomposition
-  auto expected_info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  auto info_inversion = at::zeros({expected_info_shape}, input.options().dtype(kInt));
-
-  linalg_inv_out_info(inverse, info, info_inversion, input);
-
-  if (check_errors) {
-    at::_linalg_check_errors(info, "torch.linalg.inv_ex", input.dim() == 2);
-  }
-
-  return std::tuple<Tensor&, Tensor&>(inverse, info);
+Tensor& inverse_out(const Tensor& A, Tensor& result) {
+  return at::linalg_inv_out(result, A);
 }
 
-std::tuple<Tensor, Tensor> linalg_inv_ex(const Tensor& input, bool check_errors) {
-  squareCheckInputs(input, "linalg.inv_ex");
-  Tensor inverse = at::empty(input.sizes(), input.options(), MemoryFormat::Contiguous);
-  inverse.transpose_(-2, -1); // make `inverse` tensor with batched column major format
-  auto info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  Tensor info = at::zeros({info_shape}, input.options().dtype(kInt));
-  std::tie(inverse, info) = at::native::linalg_inv_ex_out(input, check_errors, inverse, info);
-  return std::make_tuple(inverse, info);
+Tensor inverse(const Tensor& A) {
+  return at::linalg_inv(A);
 }
 
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky_solve ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2001,6 +1802,7 @@ TORCH_IMPL_FUNC(_linalg_solve_ex_out)(const Tensor& A,
   // Possible optimization: Compute the LU factorization of A^T if A is contiguous
   // Then we solve A^T X = B with adjoint=True
   // This saves a copy as A doesn't need to be copied into an F-contig matrix in lu_factor
+  // This optimization makes functorch's batching rule difficult. See NOTE [ solve_ex Batch Rule Contiguity ]
   const bool use_A_T = A.is_contiguous() && !A.is_complex();
   at::linalg_lu_factor_ex_out(const_cast<Tensor&>(LU),
                               const_cast<Tensor&>(pivots),
diff --git a/aten/src/ATen/native/Convolution.cpp b/aten/src/ATen/native/Convolution.cpp
index 9f2d8efbd6181..7128b5c8aea1d 100644
--- a/aten/src/ATen/native/Convolution.cpp
+++ b/aten/src/ATen/native/Convolution.cpp
@@ -713,6 +713,7 @@ at::Tensor complex_convolution(
     IntArrayRef stride,
     IntArrayRef padding,
     IntArrayRef dilation,
+    bool transposed,
     IntArrayRef output_padding,
     int64_t groups) {
   check_input_same_type_as_parameters(input, weight, bias);
@@ -730,15 +731,15 @@ at::Tensor complex_convolution(
   // conv(W, x, b) = a - b + i(c - a - b)
   Tensor a, b, c;
   if (!bias.defined()) {
-    a = at::convolution(i_r, w_r, bias, stride, padding, dilation, false, output_padding, groups);
-    b = at::convolution(i_i, w_i, bias, stride, padding, dilation, false, output_padding, groups);
-    c = at::convolution(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, false, output_padding, groups);
+    a = at::convolution(i_r, w_r, bias, stride, padding, dilation, transposed, output_padding, groups);
+    b = at::convolution(i_i, w_i, bias, stride, padding, dilation, transposed, output_padding, groups);
+    c = at::convolution(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, transposed, output_padding, groups);
   } else {
     Tensor b_r, b_i;
     std::tie(b_r, b_i) = complex_to_real(bias.resolve_conj());
-    a = at::convolution(i_r, w_r, b_r, stride, padding, dilation, false, output_padding, groups);
-    b = at::convolution(i_i, w_i, Tensor(), stride, padding, dilation, false, output_padding, groups);
-    c = at::convolution(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, false, output_padding, groups);
+    a = at::convolution(i_r, w_r, b_r, stride, padding, dilation, transposed, output_padding, groups);
+    b = at::convolution(i_i, w_i, Tensor(), stride, padding, dilation, transposed, output_padding, groups);
+    c = at::convolution(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, transposed, output_padding, groups);
   }
 
   auto i = c10::Scalar(c10::complex<double>(0, 1));
@@ -791,7 +792,7 @@ at::Tensor conv1d(
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv1d");
   Tensor output;
   if (at::isComplexType(input_.scalar_type())) {
-    output = complex_convolution(input, weight, bias, stride, padding, dilation, {0}, groups);
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups);
   } else {
     output = at::convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups);
   }
@@ -810,7 +811,7 @@ at::Tensor conv2d(
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 2, "conv2d");
   Tensor output;
   if (at::isComplexType(input_.scalar_type())) {
-    output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0}}, groups);
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups);
   } else {
     output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups);
   }
@@ -829,7 +830,7 @@ at::Tensor conv3d(
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 3, "conv3d");
   Tensor output;
   if (at::isComplexType(input_.scalar_type())) {
-    output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0, 0}}, groups);
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups);
   } else {
     output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups);
   }
@@ -979,8 +980,14 @@ at::Tensor conv_transpose1d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv_transpose1d");
-  auto output = at::convolution(
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution(
       input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  } else {
+    output = at::convolution(
+      input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
diff --git a/aten/src/ATen/native/Correlation.cpp b/aten/src/ATen/native/Correlation.cpp
index 0bd27195df766..a00aeb4dc9257 100644
--- a/aten/src/ATen/native/Correlation.cpp
+++ b/aten/src/ATen/native/Correlation.cpp
@@ -1,5 +1,6 @@
 #include <ATen/ATen.h>
 #include <ATen/NativeFunctions.h>
+#include <ATen/TensorSubclassLikeUtils.h>
 
 namespace at {
 namespace native {
@@ -47,7 +48,7 @@ Tensor cov(
         " != ",
         num_observations);
     TORCH_CHECK(
-        num_observations == 0 || w.min().ge(0).item<bool>(),
+        num_observations == 0 || at::is_scalar_tensor_true(w.min().ge(0)),
         "cov(): fweights cannot be negative");
   }
 
@@ -70,7 +71,7 @@ Tensor cov(
         " != ",
         num_observations);
     TORCH_CHECK(
-        num_observations == 0 || aw.min().ge(0).item<bool>(),
+        num_observations == 0 || at::is_scalar_tensor_true(aw.min().ge(0)),
         "cov(): aweights cannot be negative");
     w = w.defined() ? w * aw : aw;
   }
@@ -81,7 +82,7 @@ Tensor cov(
       : at::scalar_tensor(num_observations, in.options().dtype(kLong));
 
   TORCH_CHECK(
-      !w.defined() || w_sum.ne(0).item<bool>(),
+      !w.defined() || at::is_scalar_tensor_true(w_sum.ne(0)),
       "cov(): weights sum to zero, can't be normalized");
 
   const auto avg = (w.defined() ? in * w : in).sum(OBSERVATIONS_DIM) / w_sum;
@@ -95,7 +96,7 @@ Tensor cov(
     norm_factor = w_sum - correction;
   }
 
-  if (norm_factor.le(0).item<bool>()) {
+  if (at::is_scalar_tensor_true(norm_factor.le(0))) {
     TORCH_WARN("cov(): degrees of freedom is <= 0");
     norm_factor.zero_();
   }
diff --git a/aten/src/ATen/native/Cross.cpp b/aten/src/ATen/native/Cross.cpp
index 4b3e43da1147b..9b268c6e3d542 100644
--- a/aten/src/ATen/native/Cross.cpp
+++ b/aten/src/ATen/native/Cross.cpp
@@ -9,17 +9,20 @@
 namespace at {
 namespace meta {
 
-TORCH_PRECOMPUTE_META_FUNC(linalg_cross)
-(const Tensor & input, const Tensor & other, const int64_t dimension) {
-  auto out_size = infer_size(input.sizes(), other.sizes());
-  Tensor input_broadcasted = input.expand(out_size);
-  Tensor other_broadcasted = other.expand(out_size);
+TORCH_META_FUNC(linalg_cross)
+(const Tensor & input, const Tensor & other, int64_t dim) {
+  auto x_d = input.dim();
+  auto y_d = other.dim();
+  // This is to avoid things like
+  // linalg.cross(torch.randn(2, 3), torch.randn(5, 2, 3), dim=2)
+  TORCH_CHECK(x_d == y_d, "linalg.cross: inputs must have the same number of dimensions.");
+  TORCH_CHECK(input.size(dim) == 3 && other.size(dim) == 3, "linalg.cross: inputs dimension ", dim, " must have length 3. Got ", input.size(dim), " and ", other.size(dim));
 
-  int64_t dim = maybe_wrap_dim(dimension, input.dim()); // default dim = -1
-  TORCH_CHECK(input_broadcasted.size(dim) == 3, "dimension ", dimension, " does not have size 3");
+  // Broadcast the batch dimension of input and other.
+  // Since the non-batch dimensions agree, this is the same as broadcast all the inputs
+  auto out_size = infer_size(input.sizes(), other.sizes());
 
   set_output_raw_strided(0, out_size, {}, input.options());
-  return TORCH_PRECOMPUTE_STRUCT(linalg_cross)().set_dim(dim);
 }
 
 }
@@ -56,8 +59,9 @@ Tensor & cross_out(const Tensor & input, const Tensor & other, const c10::option
 
 
 TORCH_IMPL_FUNC(linalg_cross_out)
-(const Tensor & input, const Tensor & other, const int64_t dim, const Tensor & out) {
-  auto out_size = infer_size(input.sizes(), other.sizes());
+(const Tensor & input, const Tensor & other, int64_t dim, const Tensor & out) {
+  dim = maybe_wrap_dim(dim, input.dim());
+  auto out_size = out.sizes();
   Tensor input_broadcasted = input.expand(out_size);
   Tensor other_broadcasted = other.expand(out_size);
 
diff --git a/aten/src/ATen/native/DispatchStub.h b/aten/src/ATen/native/DispatchStub.h
index 6e71b5bb5881b..bcbf41fd9d0ff 100644
--- a/aten/src/ATen/native/DispatchStub.h
+++ b/aten/src/ATen/native/DispatchStub.h
@@ -114,6 +114,7 @@ struct TORCH_API DispatchStubImpl {
     std::atomic<void*> cpu_dispatch_ptr;
     void* cuda_dispatch_ptr;
     void* hip_dispatch_ptr;
+    void* mps_dispatch_ptr;
   #else
     std::atomic<void*> cpu_dispatch_ptr{nullptr};
     void* cuda_dispatch_ptr = nullptr;
diff --git a/aten/src/ATen/native/Dropout.cpp b/aten/src/ATen/native/Dropout.cpp
index 36e1b92ad1bdb..2514f15d00e1a 100644
--- a/aten/src/ATen/native/Dropout.cpp
+++ b/aten/src/ATen/native/Dropout.cpp
@@ -109,7 +109,7 @@ native_dropout_cpu(const Tensor& input, double p, c10::optional<bool> train) {
   return std::make_tuple(output, mask);
 }
 
-Tensor native_dropout_backward_cpu(const Tensor& grad, const Tensor& mask, double scale) {
+Tensor native_dropout_backward(const Tensor& grad, const Tensor& mask, double scale) {
   Tensor result = grad * mask * scale;
   return result;
 }
@@ -117,7 +117,10 @@ Tensor native_dropout_backward_cpu(const Tensor& grad, const Tensor& mask, doubl
 Tensor dropout(const Tensor& input, double p, bool train) {
   auto result = [&]() {
     NoNamesGuard guard;
-    if (train && is_fused_kernel_acceptable(input, p)) {
+    // TODO: we can remove this is_nested() code smell in the future
+    //       if we find a way to support _dropout for nested tensor
+    //       e.g. make it an op (at::_dropout) to use dispatcher?
+    if (input.is_nested() || (train && is_fused_kernel_acceptable(input, p))) {
       return std::get<0>(at::native_dropout(input, p, train));
     }
     return _dropout<false>(input, p, train);
diff --git a/aten/src/ATen/native/ForeachOpsKernels.cpp b/aten/src/ATen/native/ForeachOpsKernels.cpp
index 7d6dec7ad24a7..f5665be248e46 100644
--- a/aten/src/ATen/native/ForeachOpsKernels.cpp
+++ b/aten/src/ATen/native/ForeachOpsKernels.cpp
@@ -199,6 +199,9 @@ FOREACH_POINTWISE_OP_SCALAR(addcmul);
 FOREACH_POINTWISE_OP_SCALARLIST(addcdiv);
 FOREACH_POINTWISE_OP_SCALARLIST(addcmul);
 
+// NOTE(crcrpar): It didn't seem feasible to use `self[i]` as both the first and the last
+// arguments of `maximum_out` and `minimum_out` so I tentatively embarrassingly get and copy
+// the result to `self[i]`.
 #define FOREACH_MAXIMUM_MINIMUM_OP(NAME)                                                     \
 std::vector<Tensor> foreach_tensor_##NAME##_slow(TensorList tensors1, TensorList tensors2) { \
   check_foreach_api_restrictions(tensors1, tensors2);                                        \
@@ -211,6 +214,13 @@ std::vector<Tensor> foreach_tensor_##NAME##_slow(TensorList tensors1, TensorList
                                                                                              \
   return result;                                                                             \
 }                                                                                            \
+void foreach_tensor_##NAME##_slow_(TensorList self, TensorList other) {                      \
+  check_foreach_api_restrictions(self, other);                                               \
+  for (const auto i : c10::irange(self.size())) {                                            \
+    const auto tmp = at::NAME(self[i], other[i]);                                            \
+    self[i].copy_(tmp, /* non_blocking */ true);                                             \
+  }                                                                                          \
+}
 
 FOREACH_MAXIMUM_MINIMUM_OP(maximum)
 FOREACH_MAXIMUM_MINIMUM_OP(minimum)
diff --git a/aten/src/ATen/native/Integration.cpp b/aten/src/ATen/native/Integration.cpp
index d1795a70b7222..7ca01bae18a57 100644
--- a/aten/src/ATen/native/Integration.cpp
+++ b/aten/src/ATen/native/Integration.cpp
@@ -34,10 +34,10 @@ Tensor do_trapezoid(const Tensor& y, double dx, int64_t dim) {
 }
 
 Tensor zeros_like_except(const Tensor& y, int64_t dim) {
-    auto sizes = y.sizes().vec();
+    auto sizes = y.sym_sizes().vec();
     dim = maybe_wrap_dim(dim, y.dim());
     sizes.erase(sizes.begin() + dim);
-    return at::zeros(sizes, y.options());
+    return at::zeros_symint(sizes, y.options());
 }
 
 Tensor do_cumulative_trapezoid(const Tensor& y, const Tensor& dx, int64_t dim) {
@@ -111,7 +111,7 @@ Tensor trapezoid(const Tensor& y, const Tensor& x, int64_t dim) {
 
 Tensor trapezoid(const Tensor& y, const Scalar& dx, int64_t dim) {
     // see above
-    if (y.size(dim) == 0) {
+    if (y.sym_size(dim) == 0) {
         return zeros_like_except(y, dim);
     }
     TORCH_CHECK(y.scalar_type() != kBool, "trapezoid: received a bool input for `y`, but bool is not supported")
diff --git a/aten/src/ATen/native/Linear.cpp b/aten/src/ATen/native/Linear.cpp
index a002369fc547d..6137a2f0f153f 100644
--- a/aten/src/ATen/native/Linear.cpp
+++ b/aten/src/ATen/native/Linear.cpp
@@ -26,9 +26,6 @@ Tensor linear(const Tensor& input, const Tensor& weight, const c10::optional<Ten
   if (input.is_mkldnn()) {
     return at::mkldnn_linear(input, weight, *bias);
   }
-  if (input.is_mps()) {
-   return at::_mps_linear(input, weight, *bias);
-  }
 #if defined(C10_MOBILE)
   if (xnnpack::use_linear(input, weight, *bias)) {
     return xnnpack::linear(input, weight, *bias);
diff --git a/aten/src/ATen/native/LinearAlgebra.cpp b/aten/src/ATen/native/LinearAlgebra.cpp
index 7abbc1e333441..529c6b5ef9cac 100644
--- a/aten/src/ATen/native/LinearAlgebra.cpp
+++ b/aten/src/ATen/native/LinearAlgebra.cpp
@@ -1108,24 +1108,26 @@ Tensor math_addr(const Tensor& self,
                  const Scalar& beta, const Scalar& alpha) {
   // when beta==0, values in self should be ignored,
   // nans and infs in self should not propagate.
+  Tensor out;
   if (beta.toComplexDouble() == 0.0) {
     if (alpha.toComplexDouble() == 1.0) {
-      return at::outer(vec1, vec2);
+      out = at::outer(vec1, vec2);
+    } else {
+      out = alpha * at::outer(vec1, vec2);
     }
-    return alpha * at::outer(vec1, vec2);
-  }
-
-  if (beta.toComplexDouble() == 1.0) {
+  } else if (beta.toComplexDouble() == 1.0) {
     if (alpha.toComplexDouble() == 1.0) {
-      return self + at::outer(vec1, vec2);
+      out = self + at::outer(vec1, vec2);
+    } else {
+      out = self + alpha * at::outer(vec1, vec2);
     }
-    return self + alpha * at::outer(vec1, vec2);
-  }
-
-  if (alpha.toComplexDouble() == 1.0) {
-    return beta * self + at::outer(vec1, vec2);
+  } else if (alpha.toComplexDouble() == 1.0) {
+    out = beta * self + at::outer(vec1, vec2);
+  } else {
+    out = beta * self + alpha * at::outer(vec1, vec2);
   }
-  return beta * self + alpha * at::outer(vec1, vec2);
+  auto result_type = c10::promoteTypes(c10::promoteTypes(self.scalar_type(), vec1.scalar_type()), vec2.scalar_type());
+  return out.to(c10::TensorOptions().dtype(result_type));
 }
 
 Tensor& math_addr_out(const Tensor& self,
@@ -1882,21 +1884,7 @@ Tensor _matmul_impl(
 Tensor matmul(const Tensor & tensor1, const Tensor & tensor2) {
   auto maybe_outnames = namedinference::compute_matmul_outnames(tensor1, tensor2);
   at::Tensor result, unused;
-  // Note [is_nested check]
-  // We have 2 choices to support nested tensor matmul:
-  // 1. intercept here by is_nested check
-  // 2. add nested tensor dispatch key
-  // Although 1. is gross, we still choose 1. because we hesitate about 2.:
-  // * We tried 2. for reshape and it caused a weird autograd bug
-  //   (see comment in reshape in TensorShape.cpp)
-  // * but 2. for linear works?
-  // TODO: use 2. after we make sure it is fine
-  if (tensor1.is_nested() || tensor2.is_nested()) {
-    result = at::_NestedTensor_GeneralizedBMM(tensor1, tensor2);
-  }
-  else {
-    result = at::native::_matmul_impl(unused, tensor1, tensor2);
-  }
+  result = at::native::_matmul_impl(unused, tensor1, tensor2);
   namedinference::propagate_names_if_nonempty(result, maybe_outnames);
   return result;
 }
diff --git a/aten/src/ATen/native/MaxPooling.cpp b/aten/src/ATen/native/MaxPooling.cpp
index 3e615d7cf0717..5a92ed74017a8 100644
--- a/aten/src/ATen/native/MaxPooling.cpp
+++ b/aten/src/ATen/native/MaxPooling.cpp
@@ -26,19 +26,19 @@ Tensor max_pool1d_impl(
       "max_pool1d() Expected 2D or 3D input tensor, but got ", self.sizes());
   TORCH_CHECK(
       kernel_size.size() == 1,
-      "max_pool1d() kernel_size must be an int or int list of size 1 but got size ",
+      "max_pool1d() kernel_size must be an int, list of ints or tuple of ints of size 1 but got size ",
       kernel_size.size());
   TORCH_CHECK(
       stride.size() == 0 || stride.size() == 1,
-      "max_pool1d() stride must be None, an int or int list of size 1 but got size ",
+      "max_pool1d() stride must be None, an int, list of ints, or tuple of ints of size 1 but got size ",
       stride.size());
   TORCH_CHECK(
       padding.size() == 1,
-      "max_pool1d() padding must be an int or int list of size 1 but got size ",
+      "max_pool1d() padding must be an int, list of ints, or tuple of ints of size 1 but got size ",
       padding.size());
   TORCH_CHECK(
       dilation.size() == 1,
-      "max_pool1d() dilation must be an int or int list of size 1 but got size ",
+      "max_pool1d() dilation must be an int, list of ints or tuple of ints of size 1 but got size ",
       dilation.size());
 
   // If stride=None then set it to kernel_size
diff --git a/aten/src/ATen/native/Normalization.cpp b/aten/src/ATen/native/Normalization.cpp
index e5373cac4ad2f..34d906b7adc40 100644
--- a/aten/src/ATen/native/Normalization.cpp
+++ b/aten/src/ATen/native/Normalization.cpp
@@ -186,6 +186,7 @@ std::tuple<Tensor,Tensor> batch_norm_cpu_update_stats_template(
     auto _var_sum = at::empty({n_input}, input.options().dtype(dtype));
     auto _mean_a = _mean.accessor<param_t, 1>();
     auto _var_sum_a = _var_sum.accessor<param_t, 1>();
+    auto momentum_ = static_cast<param_t>(momentum);
 
     batch_norm_cpu_collect_stats_stub(kCPU, _mean, _var_sum, input);
 
@@ -195,11 +196,11 @@ std::tuple<Tensor,Tensor> batch_norm_cpu_update_stats_template(
         save_var_transform_a[f] = VarTransform<accscalar_t>{}(_var_sum_a[f] / n, eps);
 
         if (running_mean.defined()) {
-          running_mean_a[f] = momentum * _mean_a[f] + (1 - momentum) * running_mean_a[f];
+          running_mean_a[f] = momentum_ * _mean_a[f] + (1 - momentum_) * running_mean_a[f];
         }
         if (running_var.defined()) {
-           accscalar_t unbiased_var = _var_sum_a[f] / (n - 1);
-           running_var_a[f] = momentum * unbiased_var + (1 - momentum) * running_var_a[f];
+          accscalar_t unbiased_var = _var_sum_a[f] / (n - 1);
+          running_var_a[f] = momentum_ * unbiased_var + (1 - momentum_) * running_var_a[f];
         }
       }
     });
@@ -523,7 +524,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
                && cudnn_enabled
                );
 
-  if (use_miopen) {
+  if (use_miopen && input.suggest_memory_format() != MemoryFormat::ChannelsLast && input.suggest_memory_format() != MemoryFormat::ChannelsLast3d) {
     return std::tuple_cat(
              at::miopen_batch_norm(
                input.contiguous(), weight.contiguous(), bias.contiguous(),
diff --git a/aten/src/ATen/native/Onehot.cpp b/aten/src/ATen/native/Onehot.cpp
index 7455e27a1701e..a0c061062174b 100644
--- a/aten/src/ATen/native/Onehot.cpp
+++ b/aten/src/ATen/native/Onehot.cpp
@@ -23,14 +23,14 @@ Tensor one_hot(const Tensor &self, int64_t num_classes) {
     }
 
     // non-empty tensor
-    if (self.device().type() != at::kCUDA) {
+    if (self.device().type() != at::kCUDA && self.device().type() != at::kMPS) {
       //for cuda, rely on device assert thrown by scatter
       TORCH_CHECK(self.min().item().toLong() >= 0, "Class values must be non-negative.");
     }
     if (num_classes == -1) {
         num_classes = self.max().item().toLong() + 1;
     } else {
-        if (self.device().type() != at::kCUDA) {
+        if (self.device().type() != at::kCUDA && self.device().type() != at::kMPS) {
           //rely on device asserts from scatter to avoid sync here
           TORCH_CHECK(num_classes > self.max().item().toLong(), "Class values must be smaller than num_classes.");
         } else {
diff --git a/aten/src/ATen/native/Pool.h b/aten/src/ATen/native/Pool.h
index 0f3885524a79a..1106c5db0134f 100644
--- a/aten/src/ATen/native/Pool.h
+++ b/aten/src/ATen/native/Pool.h
@@ -58,6 +58,11 @@ template<typename T>
 static inline T pooling_output_shape(
       T inputSize, T kernelSize, T pad, T stride, T dilation, bool ceil_mode) {
     TORCH_CHECK(stride != 0, "stride should not be zero");
+    TORCH_CHECK(pad >= 0,
+                "pad must be non-negative, but got pad: ", pad);
+    TORCH_CHECK(pad <= kernelSize / 2,
+                "pad should be at most half of kernel size, but got pad=",
+                pad, " and kernel_size=", kernelSize)
     return pooling_output_shape_pad_lr(
         inputSize, kernelSize, pad, pad, stride, dilation, ceil_mode);
 }
diff --git a/aten/src/ATen/native/README.md b/aten/src/ATen/native/README.md
index 043e93e332a69..cfce94a36c0e4 100644
--- a/aten/src/ATen/native/README.md
+++ b/aten/src/ATen/native/README.md
@@ -476,6 +476,28 @@ as `Tensor &`, which 1) allowed changing which `TensorImpl` the `Tensor` itself
 was not necessary to allow the underlying data to change. (This was like using `T * const` when we
 wanted `const T*`.)
 
+### `autogen`
+
+```
+- func: my_op_(Tensor(a!) self) -> Tensor(a!)
+...
+  autogen: my_op, my_op.out
+```
+
+`autogen` keyword is being used to specify which native function the codegen system should generate
+implementations for.
+* For an in-place variant of a native function (op name ends with an `_`), we will generate a functional
+variant and an out= variant.
+* If a functional variant is given, we generate an out= variant.
+* We don't support `autogen` for view ops, ops that bypass the dispatcher as well as composite ops.
+
+We also generate kernels for generated ops, which merely copy and return the result from the base ops.
+These generated kernels can be found in `<gen-out>/aten/src/ATen/CompositeViewCopyKernels.cpp`.
+
+Also notice that for new operators being added to `native_functions.yaml`, if they satisfy the requirements
+mentioned above, they should include `autogen` keyword, since functionalization depends on it. We will
+enforce this in codegen.
+
 
 ## Writing an implementation in C++
 
diff --git a/aten/src/ATen/native/RangeFactories.cpp b/aten/src/ATen/native/RangeFactories.cpp
index b4eff5ed9e21f..038da93456edb 100644
--- a/aten/src/ATen/native/RangeFactories.cpp
+++ b/aten/src/ATen/native/RangeFactories.cpp
@@ -142,6 +142,10 @@ Tensor& range_out(const Scalar& start, const Scalar& end, const Scalar& step, Te
   return result;
 }
 
+Tensor& range_out_no_step(const Scalar& start, const Scalar& end, Tensor& result) {
+  return range_out(start, end, /*step = */ 1, result);
+}
+
 Tensor& arange_out(const Scalar& start, const Scalar& end, const Scalar& step, Tensor& result) {
   AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, result.scalar_type(), "arange_cpu", [&]() {
     using accscalar_t = at::acc_type<scalar_t, false>;
diff --git a/aten/src/ATen/native/ReduceAllOps.cpp b/aten/src/ATen/native/ReduceAllOps.cpp
index 43538c2347627..31764734b67ab 100644
--- a/aten/src/ATen/native/ReduceAllOps.cpp
+++ b/aten/src/ATen/native/ReduceAllOps.cpp
@@ -1,4 +1,5 @@
 #include <ATen/native/ReduceAllOps.h>
+#include <ATen/native/Resize.h>
 
 #include <ATen/ATen.h>
 #include <ATen/NativeFunctions.h>
@@ -17,6 +18,13 @@ Tensor min(const Tensor &self) {
   return result;
 }
 
+Tensor& min_unary_out(const Tensor &self, Tensor& out) {
+  Tensor tmp_output = at::min(self);
+  at::native::resize_output(out, tmp_output.sizes());
+  out.copy_(tmp_output);
+  return out;
+}
+
 Tensor max(const Tensor &self) {
   TORCH_CHECK(self.numel() > 0,
               "max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.");
@@ -25,6 +33,13 @@ Tensor max(const Tensor &self) {
   return result;
 }
 
+Tensor& max_unary_out(const Tensor &self, Tensor& out) {
+  Tensor tmp_output = at::max(self);
+  at::native::resize_output(out, tmp_output.sizes());
+  out.copy_(tmp_output);
+  return out;
+}
+
 // DEPRECATED: Use at::aminmax instead
 std::tuple<Tensor, Tensor> _aminmax_all(const Tensor &self) {
   TORCH_WARN_ONCE("_aminmax is deprecated as of PyTorch 1.11 and will be removed in a future release. Use aminmax instead."
diff --git a/aten/src/ATen/native/ReduceOps.cpp b/aten/src/ATen/native/ReduceOps.cpp
index 52ddcd83774ff..803892a781a4b 100644
--- a/aten/src/ATen/native/ReduceOps.cpp
+++ b/aten/src/ATen/native/ReduceOps.cpp
@@ -1079,16 +1079,12 @@ Tensor sum(const Tensor& self, DimnameList dim, bool keepdim, c10::optional<Scal
   return at::sum(self, dimnames_to_positions(self, dim), keepdim, dtype);
 }
 
-Tensor sum_symint(const Tensor& input_t, c10::SymIntArrayRef dim, bool keepdim, optional<ScalarType> opt_dtype) {
-  return at::sum(input_t, c10::asIntArrayRefSlow(dim), keepdim, opt_dtype);
-}
-
 Tensor& sum_out(const Tensor& self, DimnameList dim,
                 bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
   return at::sum_out(result, self, dimnames_to_positions(self, dim), keepdim, opt_dtype);
 }
 
-Tensor& nansum_out(const Tensor& self, IntArrayRef dim,
+Tensor& nansum_out(const Tensor& self, at::OptionalIntArrayRef dim,
                        bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
   TORCH_CHECK(!c10::isComplexType(self.scalar_type()), "nansum does not support complex inputs");
   // For integral types, use existing sum as
@@ -1107,7 +1103,7 @@ Tensor& nansum_out(const Tensor& self, IntArrayRef dim,
   return result;
 }
 
-Tensor nansum(const Tensor& self, IntArrayRef dim, bool keepdim, c10::optional<ScalarType> opt_dtype) {
+Tensor nansum(const Tensor& self, at::OptionalIntArrayRef dim, bool keepdim, c10::optional<ScalarType> opt_dtype) {
   ScalarType dtype = get_dtype_from_self(self, opt_dtype, true);
   Tensor result = create_reduction_result(self, dim, keepdim, dtype);
   return at::native::nansum_out(self, dim, keepdim, dtype, result);
@@ -1239,7 +1235,7 @@ Tensor& mean_out(const Tensor& self, DimnameList dim,
 // TODO(@heitorschueroff) implement custom kernels for nanmean
 Tensor& nanmean_out(
     const Tensor& self,
-    IntArrayRef dim,
+    at::OptionalIntArrayRef dim,
     bool keepdim,
     c10::optional<ScalarType> opt_dtype,
     Tensor& result) {
@@ -1254,7 +1250,7 @@ Tensor& nanmean_out(
 
 Tensor nanmean(
     const Tensor& self,
-    IntArrayRef dim,
+    at::OptionalIntArrayRef dim,
     bool keepdim,
     optional<ScalarType> opt_dtype) {
   TORCH_CHECK(
@@ -1603,7 +1599,7 @@ static Tensor& std_var_out(
 
   if (at::isComplexType(self.scalar_type())) {
     // For complex, calculate variance of real and imaginary components
-    // seperately then add to get overall variance.
+    // separately then add to get overall variance.
     ScalarType dtype = c10::toRealValueType(get_dtype_from_result(result, {}));
     Tensor real_in = at::real(self);
     Tensor real_out = at::empty({0}, self.options().dtype(dtype));
@@ -1674,7 +1670,7 @@ static std::tuple<Tensor&, Tensor&> std_var_mean_out(
               result1.scalar_type(), " and ", result2.scalar_type(), ".");
 
   if (at::isComplexType(self.scalar_type())) {
-    // For complex, calculate for real and imaginary components seperately then combine as:
+    // For complex, calculate for real and imaginary components separately then combine as:
     // variance = var_real + var_imag
     // mean = mean_real + j * mean_imag
     ScalarType dtype = c10::toRealValueType(get_dtype_from_result(result1, {}));
@@ -1729,25 +1725,31 @@ static std::tuple<Tensor&, Tensor&> std_var_mean_out(
 }
 
 std::tuple<Tensor, Tensor> var_mean(
-    const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::var_mean(self, /*dim=*/at::OptionalIntArrayRef(dim),
-                      /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
+    const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) {
+  return at::var_mean(
+      self, /*dim=*/at::OptionalIntArrayRef(dim),
+      /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}),
+      keepdim);
 }
 
 std::tuple<Tensor, Tensor> std_mean(
-    const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::std_mean(self, /*dim=*/at::OptionalIntArrayRef(dim),
-                      /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
+    const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) {
+  return at::std_mean(
+      self, /*dim=*/at::OptionalIntArrayRef(dim),
+      /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}),
+      keepdim);
 }
 
 std::tuple<Tensor, Tensor> std_mean(const Tensor& self, bool unbiased) {
   return at::std_mean(
-      self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0});
+      self, /*dim=*/c10::nullopt,
+      /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}));
 }
 
 std::tuple<Tensor, Tensor> var_mean(const Tensor& self, bool unbiased) {
   return at::var_mean(
-      self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0});
+      self, /*dim=*/c10::nullopt,
+      /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}));
 }
 
 std::tuple<Tensor&, Tensor&> var_mean_out(
@@ -1782,32 +1784,37 @@ std::tuple<Tensor, Tensor> std_mean(
 
 Tensor var(const Tensor& self, bool unbiased) {
   return at::var(
-      self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0});
+      self, /*dim=*/c10::nullopt,
+      /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}));
 }
 
-Tensor var(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::var(self, /*dim=*/at::OptionalIntArrayRef(dim),
-                 /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
+Tensor var(const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) {
+  return at::var(
+      self, /*dim=*/at::OptionalIntArrayRef(dim),
+      /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}),
+      keepdim);
 }
 
-Tensor& var_out(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) {
-  return at::var_out(result, self, /*dim=*/at::OptionalIntArrayRef(dim),
-                     /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
+Tensor& var_out(const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) {
+  return at::var_out(
+      result, self, /*dim=*/at::OptionalIntArrayRef(dim),
+      /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}),
+      keepdim);
 }
 
 Tensor std(const Tensor& self, bool unbiased) {
   return at::std(
-      self, /*dim=*/c10::nullopt, /*correction=*/int64_t{unbiased ? 1 : 0});
+      self, /*dim=*/c10::nullopt, /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}));
 }
 
-Tensor std(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim) {
-  return at::std(self, /*dim=*/at::OptionalIntArrayRef(dim),
-                 /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
+Tensor std(const Tensor& self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim) {
+  return at::std(self, dim,
+                 /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}), keepdim);
 }
 
-Tensor& std_out(const Tensor& self, IntArrayRef dim, bool unbiased, bool keepdim, Tensor& result) {
-  return at::std_out(result, self, /*dim=*/at::OptionalIntArrayRef(dim),
-                     /*correction=*/int64_t{unbiased ? 1 : 0}, keepdim);
+Tensor& std_out(const Tensor& self, at::OptionalIntArrayRef opt_dim, bool unbiased, bool keepdim, Tensor& result) {
+  return at::std_out(result, self, opt_dim,
+                     /*correction=*/c10::make_optional<int64_t>({unbiased ? 1 : 0}), keepdim);
 }
 
 Tensor std(const Tensor& self, at::OptionalIntArrayRef dim,
@@ -1966,9 +1973,6 @@ bool cpu_equal(const Tensor& self, const Tensor& other) {
   at::NoNamesGuard guard;
   TORCH_CHECK(self.device() == other.device(), "Cannot compare two tensors on "
               "different devices. Got: ", self.device(), " and ", other.device());
-  TORCH_CHECK(self.dtype() == other.dtype(),
-              "Expected object of scalar type ", self.dtype(), " but got scalar type ",
-              other.dtype(), " for argument 'other'");
   if (!self.is_same_size(other)) {
     return false;
   }
diff --git a/aten/src/ATen/native/ReduceOpsUtils.h b/aten/src/ATen/native/ReduceOpsUtils.h
index 7c73c85d4c2ff..9db9802ea788b 100644
--- a/aten/src/ATen/native/ReduceOpsUtils.h
+++ b/aten/src/ATen/native/ReduceOpsUtils.h
@@ -159,7 +159,7 @@ static void resize_reduction_result(
 }
 
 inline Tensor create_reduction_result(
-  const Tensor& self, IntArrayRef dim, bool keepdim, ScalarType dtype
+  const Tensor& self, at::OptionalIntArrayRef dim, bool keepdim, ScalarType dtype
 ) {
   DimMask mask = make_dim_mask(dim, self.dim());
   auto shape = shape_from_dim_mask(self, mask, keepdim);
diff --git a/aten/src/ATen/native/SoftMax.cpp b/aten/src/ATen/native/SoftMax.cpp
index 43c7874e43722..21a94d5ed9235 100644
--- a/aten/src/ATen/native/SoftMax.cpp
+++ b/aten/src/ATen/native/SoftMax.cpp
@@ -131,7 +131,18 @@ void host_softmax(
     Tensor output,
     const Tensor& input,
     const int64_t dim,
-    bool* mask = nullptr) {
+    bool* mask = nullptr,
+    const c10::optional<int64_t> mask_type_ = NULL) {
+
+  if (MaskedSoftMax) {
+    TORCH_CHECK(mask_type_.has_value(), "Mask Type should be defined");
+    int64_t mask_type = mask_type_.value();
+    TORCH_CHECK((mask_type == 0) || (mask_type == 1), "Mask Type should be 0 (src_mask) or 1 (src_key_padding_mask)");
+
+    // TODO: Add support for TxT src_mask
+    TORCH_CHECK(mask_type != 0, "src_mask not currently supported on CPU");
+  }
+
   int64_t outer_size = 1;
   int64_t dim_size = input.size(dim);
   int64_t inner_size = 1;
@@ -541,7 +552,7 @@ Tensor log_softmax(const Tensor& self, Dimname dim, optional<ScalarType> dtype)
   return at::log_softmax(self, dimname_to_position(self, dim), dtype);
 }
 
-Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10::optional<int64_t> dim_) {
+Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10::optional<int64_t> dim_, const c10::optional<int64_t> mask_type_) {
   TORCH_CHECK(
       input_.sizes() == mask_.sizes(), "Mask shape should match input shape");
   TORCH_CHECK(
@@ -564,7 +575,7 @@ Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10::
             scalar_t,
             false /* LogSoftMax */,
             true /* MaskedSoftMax */>(
-            output, input, dim, mask.data_ptr<bool>());
+            output, input, dim, mask.data_ptr<bool>(), mask_type_);
       });
   return output;
 }
diff --git a/aten/src/ATen/native/Sorting.cpp b/aten/src/ATen/native/Sorting.cpp
index 18820973fd847..66b9daf7fad8c 100644
--- a/aten/src/ATen/native/Sorting.cpp
+++ b/aten/src/ATen/native/Sorting.cpp
@@ -226,9 +226,9 @@ Tensor quantile_compute(
   // NOTE: this check is only performed when running on the CPU to avoid
   // synchronizing an accelerator with the CPU
   if (self.device().is_cpu()) {
-    TORCH_CHECK(
-        q.ge(0).logical_and_(q.le(1)).all().item<bool>(),
-        "quantile() q values must be in the range [0, 1]");
+    auto all_q_in_range = q.ge(0).logical_and_(q.le(1)).all();
+    TORCH_CHECK(at::is_scalar_tensor_true(all_q_in_range),
+                "quantile() q values must be in the range [0, 1]");
   }
 
   // Flatten input if no dim provided else move dim to reduce as last dimension.
diff --git a/aten/src/ATen/native/SpectralOps.cpp b/aten/src/ATen/native/SpectralOps.cpp
index d6389608a9e36..c2e5bda454ea4 100644
--- a/aten/src/ATen/native/SpectralOps.cpp
+++ b/aten/src/ATen/native/SpectralOps.cpp
@@ -1,5 +1,6 @@
 #include <ATen/ATen.h>
 #include <ATen/Config.h>
+#include <ATen/TensorSubclassLikeUtils.h>
 #include <ATen/detail/CUDAHooksInterface.h>
 #include <ATen/native/SpectralOpsUtils.h>
 #include <ATen/native/TensorIterator.h>
@@ -1100,8 +1101,8 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
 
   y = y.slice(2, start, end, 1);
   window_envelop = window_envelop.slice(2, start, end, 1);
-  const auto window_envelop_lowest = window_envelop.abs().min().item().toDouble();
-  if (window_envelop_lowest < 1e-11) {
+  const auto window_envelop_lowest = window_envelop.abs().min().lt(1e-11);
+  if (at::is_scalar_tensor_true(window_envelop_lowest)) {
     std::ostringstream ss;
     REPR(ss) << "window overlap add min: " << window_envelop_lowest;
     AT_ERROR(ss.str());
@@ -1121,7 +1122,7 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
   }
   return y;
 
-  #undef REPR
+#undef REPR
 }
 
 Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
diff --git a/aten/src/ATen/native/TensorAdvancedIndexing.cpp b/aten/src/ATen/native/TensorAdvancedIndexing.cpp
index 951d9eeb18fa3..647955ad4dfc7 100644
--- a/aten/src/ATen/native/TensorAdvancedIndexing.cpp
+++ b/aten/src/ATen/native/TensorAdvancedIndexing.cpp
@@ -521,9 +521,9 @@ AdvancedIndex::AdvancedIndex(const Tensor& src, TensorList indices_list)
     }
   }
 
-  // For CUDA tensors, force all index tensors to have the same striding to
-  // simplify the CUDA kernel.
-  if (indices.size() >= 2 && this->src.device().type() == kCUDA) {
+  // For CUDA/MPS tensors, force all index tensors to have the same striding to
+  // simplify the CUDA/MPS kernel.
+  if (indices.size() >= 2 && (this->src.device().type() == kCUDA || this->src.device().type() == kMPS)) {
     if (!all_strides_match(indices)) {
       for (auto & indice : indices) {
         indice = indice.contiguous();
diff --git a/aten/src/ATen/native/TensorConversions.cpp b/aten/src/ATen/native/TensorConversions.cpp
index 02ccd133c7ee0..819516f673971 100644
--- a/aten/src/ATen/native/TensorConversions.cpp
+++ b/aten/src/ATen/native/TensorConversions.cpp
@@ -400,10 +400,10 @@ Tensor sparse_compressed_to_dense(
     if (self.dim() > 3) {
       // Flatten batch dims
       auto n_batch_dim = self.dim() - 2;
-      crow_indices = crow_indices.flatten(0, n_batch_dim);
-      col_indices = col_indices.flatten(0, n_batch_dim);
-      values = values.flatten(0, n_batch_dim);
-      dense = dense.flatten(0, n_batch_dim);
+      crow_indices = crow_indices.flatten(0, n_batch_dim - 1);
+      col_indices = col_indices.flatten(0, n_batch_dim - 1);
+      values = values.flatten(0, n_batch_dim - 1);
+      dense = dense.flatten(0, n_batch_dim - 1);
     }
 
     // At this point everything has 3d shape either the batch dim was inserted,
@@ -427,8 +427,8 @@ Tensor sparse_compressed_to_dense(
       dense[batch].index_add_(0, offsets, values[batch]);
     }
 
-    // untile the result, NOTE: The final reshape uses the original self.sizes()
-    // which will squeeze out the extra batch dim if we put one in
+    // un-tile the result, NOTE: The final reshape uses the original
+    // self.sizes() which will squeeze out the extra batch dim if we put one in
     return dense
         .unflatten(
             1, {self.size(-2) / blocksize[0], self.size(-1) / blocksize[1]})
diff --git a/aten/src/ATen/native/TensorFactories.cpp b/aten/src/ATen/native/TensorFactories.cpp
index 7d112b9f415d4..c9cc522e06b83 100644
--- a/aten/src/ATen/native/TensorFactories.cpp
+++ b/aten/src/ATen/native/TensorFactories.cpp
@@ -101,8 +101,12 @@ Tensor arange(
   return at::arange_out(result, start, end, step);
 }
 
+Tensor& arange_start_out(const Scalar& start, const Scalar& end, Tensor& result) {
+    return at::arange_out(result, start, end, /*step=*/1);
+}
+
 Tensor& arange_out(const Scalar& end, Tensor& result) {
-  return at::arange_out(result, /*start=*/0, end);
+  return at::arange_out(result, /*start=*/0, end, /*step=*/1);
 }
 
 Tensor& arange_out(Tensor& result, const Scalar& start, const Scalar& end) {
@@ -1019,12 +1023,12 @@ Tensor tril_indices_cpu(
   //
   // 3. sequential RAM + transpose: create an n X 2 Tensor, fill the Tensor
   //    sequentially, and then transpose it.
-  AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "tril_indices", [&]() -> void {
+  AT_DISPATCH_INDEX_TYPES(result.scalar_type(), "tril_indices", [&]() -> void {
     // fill the Tensor with correct values
-    scalar_t* result_data = result.data_ptr<scalar_t>();
+    index_t* result_data = result.data_ptr<index_t>();
     int64_t i = 0;
 
-    scalar_t r = std::max<int64_t>(0, -offset), c = 0;
+    index_t r = std::max<int64_t>(0, -offset), c = 0;
     while (i < tril_size) {
       result_data[i] = r;
       result_data[tril_size + i++] = c;
@@ -1057,14 +1061,14 @@ Tensor triu_indices_cpu(
   // create an empty Tensor with correct size
   auto result = at::native::empty_cpu({2, triu_size}, dtype_opt, layout_opt, device_opt, pin_memory_opt);
 
-  AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "triu_indices", [&]() -> void {
+  AT_DISPATCH_INDEX_TYPES(result.scalar_type(), "triu_indices", [&]() -> void {
     // fill the Tensor with correct values
-    scalar_t* result_data = result.data_ptr<scalar_t>();
+    index_t* result_data = result.data_ptr<index_t>();
     int64_t i = 0;
     // not typing std::max with scalar_t as it could be an unsigned type
     // NOTE: no need to check if the returned value of std::max overflows
-    // scalar_t, as i and triu_size act as a guard.
-    scalar_t c = std::max<int64_t>(0, offset), r = 0;
+    // index_t, as i and triu_size act as a guard.
+    index_t c = std::max<int64_t>(0, offset), r = 0;
     while (i < triu_size) {
       result_data[i] = r;
       result_data[triu_size + i++] = c;
@@ -1091,19 +1095,19 @@ Tensor zeros(IntArrayRef size,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
     c10::optional<bool> pin_memory) {
-  // See [Note: hacky wrapper removal for TensorOptions]
-  TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
-
-  auto result = at::empty(size, options);
-  return result.zero_();
+  return at::zeros_symint(c10::SymIntArrayRef::fromIntArrayRef(size), dtype, layout, device, pin_memory);
 }
 
-Tensor zeros_symint(c10::SymIntArrayRef size,
+Tensor zeros_symint(SymIntArrayRef size,
     c10::optional<ScalarType> dtype,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
     c10::optional<bool> pin_memory) {
-    return at::zeros(asIntArrayRefSlow(size), dtype, layout, device, pin_memory);
+  // See [Note: hacky wrapper removal for TensorOptions]
+  TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
+
+  auto result = at::empty_symint(size, options);
+  return result.zero_();
 }
 
 Tensor _efficientzerotensor(IntArrayRef size,
@@ -1143,7 +1147,7 @@ Tensor zeros_like(
     TORCH_CHECK(
         !(optional_memory_format.has_value()),
         "memory format option is only supported by strided tensors");
-    auto res = at::empty({0}, options); // to be resized
+    auto res = at::empty({0}, self.options().merge_in(options)); // to be resized
 
     if (self.is_sparse()) {
       res.sparse_resize_and_clear_(
@@ -1186,11 +1190,12 @@ Tensor bartlett_window(int64_t window_length,
 Tensor bartlett_window(
     int64_t window_length,
     bool periodic,
-    c10::optional<ScalarType> dtype,
+    c10::optional<ScalarType> dtype_opt,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
     c10::optional<bool> pin_memory) {
   // See [Note: hacky wrapper removal for TensorOptions]
+  ScalarType dtype = c10::dtype_or_default(dtype_opt);
   TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
   window_function_checks("bartlett_window", options, window_length);
@@ -1224,11 +1229,12 @@ Tensor blackman_window(int64_t window_length,
 Tensor blackman_window(
     int64_t window_length,
     bool periodic,
-    c10::optional<ScalarType> dtype,
+    c10::optional<ScalarType> dtype_opt,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
     c10::optional<bool> pin_memory) {
   // See [Note: hacky wrapper removal for TensorOptions]
+  ScalarType dtype = c10::dtype_or_default(dtype_opt);
   TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
   window_function_checks("blackman_window", options, window_length);
@@ -1294,11 +1300,12 @@ Tensor hamming_window(
     bool periodic,
     double alpha,
     double beta,
-    c10::optional<ScalarType> dtype,
+    c10::optional<ScalarType> dtype_opt,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
     c10::optional<bool> pin_memory) {
   // See [Note: hacky wrapper removal for TensorOptions]
+  ScalarType dtype = c10::dtype_or_default(dtype_opt);
   TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
   window_function_checks("hamming_window", options, window_length);
@@ -1370,11 +1377,12 @@ Tensor kaiser_window(
     int64_t window_length,
     bool periodic,
     double beta,
-    c10::optional<ScalarType> dtype,
+    c10::optional<ScalarType> dtype_opt,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
     c10::optional<bool> pin_memory) {
   // See [Note: hacky wrapper removal for TensorOptions]
+  ScalarType dtype = c10::dtype_or_default(dtype_opt);
   TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
   window_function_checks("kaiser_window", options, window_length);
diff --git a/aten/src/ATen/native/TensorShape.cpp b/aten/src/ATen/native/TensorShape.cpp
index bbecb346ce3ef..85fb9fb627efb 100644
--- a/aten/src/ATen/native/TensorShape.cpp
+++ b/aten/src/ATen/native/TensorShape.cpp
@@ -123,11 +123,11 @@ TORCH_PRECOMPUTE_META_FUNC(cat)(ITensorListRef tensors, int64_t dim) {
     size_t size_at_dim = 0;
     for (const auto i : c10::irange(materialized.size())) {
       const Tensor& t = materialized[i];
+      all_same_dtype = all_same_dtype && out_dtype == t.scalar_type();
       if (!at::native::cat_should_skip_tensor(t)) {
         at::native::check_cat_shape_except_dim(materialized[valid], t, dim, i);
         size_at_dim += t.size(dim);
         all_contiguous = all_contiguous && t.is_contiguous(memory_format);
-        all_same_dtype = all_same_dtype && out_dtype == t.scalar_type();
         all_same_sizes_and_stride = all_same_sizes_and_stride &&
             t.sizes() == materialized[valid].get().sizes() &&
             t.strides() == materialized[valid].get().strides();
@@ -2188,17 +2188,18 @@ std::vector<Tensor> dsplit(const Tensor& self, int64_t split_size) {
 
 std::vector<Tensor> split_with_sizes(const Tensor& self, IntArrayRef split_sizes, int64_t dim) {
   TORCH_CHECK(self.dim() != 0, "split expects at least a 1-dimensional tensor");
-  int64_t dim_size = self.size(dim);
-  int64_t num_splits = split_sizes.size();
-  std::vector<Tensor> splits(num_splits);
+  const int64_t dim_size = self.size(dim);
+  const int64_t num_splits = split_sizes.size();
   int64_t start_idx = 0;
 
+  std::vector<Tensor> splits;
+  splits.reserve(num_splits);
   for (const auto i : c10::irange(num_splits)) {
     auto length = split_sizes[i];
     TORCH_CHECK(length >= 0,
              "split_with_sizes expects split_sizes have only non-negative ",
              "entries, but got split_sizes=", split_sizes);
-    splits[i] = self.narrow(dim, start_idx, length);
+    splits.push_back(at::native::slice(self, dim, start_idx, start_idx + length, 1));
     start_idx += length;
   }
   TORCH_CHECK(start_idx == dim_size,
@@ -3249,7 +3250,7 @@ Tensor diagonal_backward(const Tensor & grad, IntArrayRef input_sizes, int64_t o
 
 Tensor movedim(const Tensor& self, IntArrayRef src, IntArrayRef dst) {
   TORCH_CHECK(src.size() == dst.size(), "movedim: Invalid source or destination dims: source (",
-              src, " dims ) should contain the same number of dims as destination (", dst, " dims)");
+              src, " dims) should contain the same number of dims as destination (", dst, " dims)");
 
   size_t self_dim = self.dim();
   DimVector normalized_src(src.size());
diff --git a/aten/src/ATen/native/TestOps.cpp b/aten/src/ATen/native/TestOps.cpp
index 9a3a5b10cb269..a8c30f5c3ba61 100644
--- a/aten/src/ATen/native/TestOps.cpp
+++ b/aten/src/ATen/native/TestOps.cpp
@@ -1,6 +1,7 @@
 // Copyright 2004-present Facebook. All Rights Reserved.
 
 #include <ATen/ATen.h>
+#include <ATen/FunctionalInverses.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/ScalarOps.h>
 
@@ -74,5 +75,33 @@ Tensor _test_warn_in_autograd(const Tensor &self) {
   return self.clone();
 }
 
+// Test registration of per-dispatch-key derivatives in derivatives.yaml.
+// See derivatives.yaml for dummy registrations.
+
+Tensor _test_autograd_multiple_dispatch_fullcoverage(const Tensor &self) {
+  return self.clone();
+}
+
+Tensor _test_autograd_multiple_dispatch_ntonly(const Tensor &self, bool b) {
+  return self.clone();
+}
+
+// Test derivative dispatch registration for view_copy ops
+Tensor _test_autograd_multiple_dispatch_view(const Tensor &self) {
+  return self.view(-1);
+}
+
 } // namespace native
+
+namespace functionalization {
+
+// view_copy ops must have a functional inverse registered
+Tensor FunctionalInverses::_test_autograd_multiple_dispatch_view_copy_inverse(const at::Tensor& base, const at::Tensor& mutated_view, bool reapply_views) {
+    TORCH_INTERNAL_ASSERT(false,
+    "Attempted to call _test_autograd_multiple_dispatch_view_copy_inverse() during the functionalization pass. ",
+    "This function is for testing only and should never be called.");
+    return Tensor();
+}
+
+} // namespace functionalization
 } // namespace at
diff --git a/aten/src/ATen/native/UpSample.h b/aten/src/ATen/native/UpSample.h
index 6b248352de6ad..f3dd836444d13 100644
--- a/aten/src/ATen/native/UpSample.h
+++ b/aten/src/ATen/native/UpSample.h
@@ -2,11 +2,11 @@
 
 #include <math.h>
 
-#include <ATen/core/Tensor.h>
+#include <ATen/AccumulateType.h>
 #include <ATen/TensorUtils.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/DispatchStub.h>
 
-
 /**
  * Note [compute_scales_value]
  * Note [area_pixel_compute_scale]
@@ -288,7 +288,8 @@ static inline scalar_t area_pixel_compute_source_index(
   if (align_corners) {
     return scale * dst_index;
   } else {
-    scalar_t src_idx = scale * (dst_index + 0.5) - 0.5;
+    scalar_t src_idx = scale * (dst_index + static_cast<scalar_t>(0.5)) -
+        static_cast<scalar_t>(0.5);
     // [Note] Follow Opencv resize logic:
     // We allow negative src_idx here and later will use
     //   dx = src_idx - floorf(src_idx)
@@ -301,7 +302,8 @@ static inline scalar_t area_pixel_compute_source_index(
     // where we should and then remove this cubic flag.
     // This matters in cubic mode, as we might need [-1, 0, 1, 2]
     // to interpolate and the weights can be affected.
-    return (!cubic && src_idx < 0) ? scalar_t(0) : src_idx;
+    return (!cubic && src_idx < static_cast<scalar_t>(0)) ? scalar_t(0)
+                                                          : src_idx;
   }
 }
 
@@ -445,8 +447,10 @@ static inline void compute_source_index_and_lambda(
     lambda0 = static_cast<scalar_t>(1);
     lambda1 = static_cast<scalar_t>(0);
   } else {
-    const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-        ratio, output_index, align_corners, /*cubic=*/false);
+    using accscalar_t = at::acc_type<scalar_t, false>;
+    const accscalar_t real_input_index =
+        area_pixel_compute_source_index<accscalar_t>(
+            ratio, output_index, align_corners, /*cubic=*/false);
     input_index0 = static_cast<int64_t>(real_input_index);
     int64_t offset = (input_index0 < input_size - 1) ? 1 : 0;
     input_index1 = input_index0 + offset;
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h b/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h
index 098b862297fd5..6ac89681899c5 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h
@@ -19,7 +19,7 @@ struct TORCH_API PackedLinearWeightQnnp
   PackedLinearWeightQnnp(const at::Tensor& weight, const c10::optional<at::Tensor>& bias, const int64_t out_features_block_size /* block sparsity size across output_features */, const int64_t in_features_block_size /* block sparsity size across input_features */);
   explicit PackedLinearWeightQnnp(const BCSRSerializationType& serialized);
   c10::optional<at::Tensor> orig_bias_;
-  // Seperate copy of bias exist so that we can fill in zeros when
+  // Separate copy of bias exist so that we can fill in zeros when
   // optional bias does not exist. This is to compy with qnnpack operator that
   // expects bias to be present.
   // In case bias is present bias_ is just a reference to orig_bias_
diff --git a/aten/src/ATen/native/cpu/CopyKernel.cpp b/aten/src/ATen/native/cpu/CopyKernel.cpp
index de1841d989c3b..27df65c7b0485 100644
--- a/aten/src/ATen/native/cpu/CopyKernel.cpp
+++ b/aten/src/ATen/native/cpu/CopyKernel.cpp
@@ -13,9 +13,6 @@ namespace native {
 inline namespace CPU_CAPABILITY {
 void neg_kernel(TensorIteratorBase &iter);
 void conj_kernel(TensorIteratorBase &iter);
-} // namespace CPU_CAPABILITY
-
-namespace {
 
 void float_bfloat16_copy_kernel(TensorIteratorBase &iter, bool requires_neg) {
   auto strides_out = iter.strides(0);
@@ -246,22 +243,20 @@ void copy_kernel(TensorIterator& iter, bool /*non_blocking*/) {
     AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, dtype, "copy_", [&] {
       using dest_t = scalar_t;
       AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, iter.dtype(1), "copy_", [&] {
-        // Note (@zasdfgbnm):
-        //
-        // The code below can not be simplified as
-        //    cpu_kernel(iter, c10::static_cast_with_inter_type<dest_t, scalar_t>::apply);
-        //
-        // because this would force the compiler to instantiate the inline function and generate a function call in the loop
-        // instead of inlining it, making all the optimizations like vectorization impossible.
-        // You can verify this by looking the the symbols of `libtorch_cpu.so`:
-        //
-        //    readelf -Ws libtorch_cpu.so | grep static_cast_with_inter_type
-        //
-        // If done correctly, the above command should have no output.
-        //
-        // See: https://github.com/pytorch/pytorch/issues/31271
-        cpu_kernel(iter, [](scalar_t src) -> dest_t {
-          return c10::static_cast_with_inter_type<dest_t, scalar_t>::apply(src); });
+        if (iter.has_contiguous_first_dim()) {
+          TORCH_INTERNAL_ASSERT(iter.ninputs() == 1);
+          TORCH_INTERNAL_ASSERT(iter.noutputs() == 1);
+
+          iter.for_each([](char **data, const int64_t *strides, int64_t size) {
+            auto src = reinterpret_cast<const scalar_t*>(data[1]);
+            auto dst = reinterpret_cast<dest_t*>(data[0]);
+            at::vec::convert(src, dst, size);
+          });
+        } else {
+          cpu_kernel(iter, [](scalar_t x) -> dest_t {
+            return c10::convert<dest_t>(x);
+          });
+        }
       });
     });
 
@@ -274,7 +269,7 @@ void copy_kernel(TensorIterator& iter, bool /*non_blocking*/) {
   }
 }
 
-} // anonymous namespace
+} // namespace CPU_CAPABILITY
 
 REGISTER_DISPATCH(copy_stub, &copy_kernel);
 
diff --git a/aten/src/ATen/native/cpu/CopyKernel.h b/aten/src/ATen/native/cpu/CopyKernel.h
new file mode 100644
index 0000000000000..9d2affd6101ab
--- /dev/null
+++ b/aten/src/ATen/native/cpu/CopyKernel.h
@@ -0,0 +1,12 @@
+#pragma once
+
+namespace at {
+struct TensorIteratorBase;
+
+namespace native {
+inline namespace CPU_CAPABILITY {
+
+void direct_copy_kernel(TensorIteratorBase &iter);
+void copy_kernel(TensorIterator& iter, bool /*non_blocking*/);
+
+}}}  // namespace at::native::CPU_CAPABILITY
diff --git a/aten/src/ATen/native/cpu/Loops.h b/aten/src/ATen/native/cpu/Loops.h
index 4f64a64b51a4c..2558736ddc0fc 100644
--- a/aten/src/ATen/native/cpu/Loops.h
+++ b/aten/src/ATen/native/cpu/Loops.h
@@ -36,11 +36,6 @@
 #include <ATen/native/TensorIteratorDynamicCasting.h>
 #include <ATen/cpu/vec/vec.h>
 
-#ifndef _MSC_VER
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Wunused-but-set-parameter"
-#endif
-
 namespace at { namespace native { inline namespace CPU_CAPABILITY {
 
 using namespace vec;
@@ -398,7 +393,3 @@ void cpu_serial_kernel_vec(TensorIteratorBase& iter, func_t&& op, vec_func_t&& v
 }
 
 }}}  // namespace at::native::<anonymous>
-
-#ifndef _MSC_VER
-#pragma GCC diagnostic pop
-#endif
diff --git a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
index a53587e56da4b..8a0534fd3da5f 100644
--- a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
@@ -13,6 +13,7 @@
 #include <ATen/cpu/vec/vec.h>
 #include <ATen/cpu/vml.h>
 #include <ATen/native/TensorIterator.h>
+#include <ATen/native/cpu/CopyKernel.h>
 #include <ATen/native/cpu/Loops.h>
 #include <ATen/native/cpu/zmath.h>
 #include <ATen/OpMathType.h>
@@ -203,13 +204,18 @@ static void angle_kernel(TensorIteratorBase& iter) {
 
 // NB: Ignores the negative bit on tensors
 void conj_kernel(TensorIteratorBase& iter) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
-      kBool, kBFloat16, kHalf, kComplexHalf, iter.common_dtype(), "conj_cpu", [&]() {
-        cpu_kernel_vec(
-            iter,
-            [=](scalar_t a) -> scalar_t { return conj_impl(a); },
-            [=](Vectorized<scalar_t> a) { return a.conj(); });
-      });
+  AT_DISPATCH_SWITCH(iter.common_dtype(), "conj_cpu",
+    AT_DISPATCH_CASE_ALL_TYPES_AND3(kBool, kBFloat16, kHalf, [&] {
+      // conj is a no-op for non-complex types
+      direct_copy_kernel(iter);
+    })
+    AT_DISPATCH_CASE_COMPLEX_TYPES_AND(kComplexHalf, [&] {
+      cpu_kernel_vec(
+          iter,
+          [=](scalar_t a) -> scalar_t { return conj_impl(a); },
+          [=](Vectorized<scalar_t> a) { return a.conj(); });
+    })
+  );
 }
 
 static void bitwise_not_kernel(TensorIteratorBase& iter) {
diff --git a/aten/src/ATen/native/cpu/UpSampleKernel.cpp b/aten/src/ATen/native/cpu/UpSampleKernel.cpp
index cfc9318623725..1c08fee0acac7 100644
--- a/aten/src/ATen/native/cpu/UpSampleKernel.cpp
+++ b/aten/src/ATen/native/cpu/UpSampleKernel.cpp
@@ -767,7 +767,6 @@ struct HelperInterpNearest : public HelperInterpBase {
 
     AT_DISPATCH_FLOATING_TYPES_AND(
       ScalarType::BFloat16, scalar_type, "compute_indices_weights_nearest", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         auto input_index_ptr = output[0].data_ptr<int64_t>();
@@ -778,10 +777,11 @@ struct HelperInterpNearest : public HelperInterpBase {
         // index_f32 = (output_index) * scale
         // input_index = floor(index_f32)
         // Same as OpenCV INTER_NEAREST
-
+        using accscalar_t = at::acc_type<scalar_t, false>;
         for (const auto i : c10::irange(output_size)) {
-          const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-              scale, i, /*align_corners=*/true, /*cubic=*/false);
+          const accscalar_t real_input_index =
+              area_pixel_compute_source_index<accscalar_t>(
+                  scale, i, /*align_corners=*/true, /*cubic=*/false);
           input_index = static_cast<int64_t>(floorf(real_input_index));
           input_index_ptr[i] = static_cast<int64_t>(std::min(input_index, input_size - 1)) * stride;
         }
@@ -818,7 +818,6 @@ struct HelperInterpNearestExact : public HelperInterpNearest {
 
     AT_DISPATCH_FLOATING_TYPES(
       scalar_type, "compute_indices_weights_nearest", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         auto input_index_ptr = output[0].data_ptr<int64_t>();
@@ -829,10 +828,11 @@ struct HelperInterpNearestExact : public HelperInterpNearest {
         // index_f32 = (output_index + 0.5) * scale - 0.5
         // input_index = round(index_f32)
         // Same as Pillow and Scikit-Image/Scipy ndi.zoom
-
+        using accscalar_t = at::acc_type<scalar_t, false>;
         for (const auto i : c10::irange(output_size)) {
-          const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-              scale, i, /*align_corners=*/align_corners, /*cubic=*/false);
+          const accscalar_t real_input_index =
+              area_pixel_compute_source_index<accscalar_t>(
+                  scale, i, /*align_corners=*/align_corners, /*cubic=*/false);
           input_index = static_cast<int64_t>(floorf(real_input_index + 0.5));
           input_index_ptr[i] = static_cast<int64_t>(std::min(input_index, input_size - 1)) * stride;
         }
@@ -865,10 +865,8 @@ struct HelperInterpLinear : public HelperInterpBase {
     std::vector<Tensor> output;
     HelperInterpLinear::init_indices_weights(
       scalar_type, output, output_size, ndims, reshape_dim, HelperInterpLinear::interp_size);
-
     AT_DISPATCH_FLOATING_TYPES_AND(
       ScalarType::BFloat16, scalar_type, "compute_indices_weights_linear", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         auto input_index0_ptr = output[0].data_ptr<int64_t>();
@@ -970,7 +968,6 @@ struct HelperInterpCubic : public HelperInterpBase {
 
     AT_DISPATCH_FLOATING_TYPES_AND(
       ScalarType::BFloat16, scalar_type, "compute_indices_weights_cubic", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         int64_t input_index;
@@ -980,11 +977,11 @@ struct HelperInterpCubic : public HelperInterpBase {
 
         int64_t * idx_ptr;
         scalar_t * wt_ptr;
-
+        using accscalar_t = at::acc_type<scalar_t, false>;
         for (const auto i : c10::irange(output_size)) {
-
-          const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-              scale, i, align_corners, /*cubic=*/true);
+          const accscalar_t real_input_index =
+              area_pixel_compute_source_index<accscalar_t>(
+                  scale, i, align_corners, /*cubic=*/true);
           input_index = static_cast<int64_t>(floorf(real_input_index));
           get_cubic_upsample_coefficients<scalar_t>(coeffs, real_input_index - input_index);
 
@@ -1184,7 +1181,6 @@ void _separable_upsample_generic_Nd_kernel_impl_single_dim(
 
   int interp_size = F::interp_size;
   auto input_scalar_type = input.scalar_type();
-
   if (interp_size == 1 && input_scalar_type == at::ScalarType::Byte) {
     // nearest also supports uint8 tensor, but we have to use float
     // with compute_indices_weights
diff --git a/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu b/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
index 55b0d3322e04b..2f9e57ed121ee 100644
--- a/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
+++ b/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
@@ -442,10 +442,14 @@ namespace {
               output_arg{ output, "output", 2 };
     checkAllSameGPU(__func__, {input_arg, output_arg});
 
-    for (int64_t i = 1; i < input.ndimension(); i++) {
+    TORCH_CHECK(output_size.size() == 2, "adaptive_avg_pool2d: output_size must be 2");
+    int64_t ndim = input.dim();
+    TORCH_CHECK((ndim == 3 || ndim == 4),
+      "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes());
+    for (const auto i : {-2, -1}) {
       TORCH_CHECK(input.size(i) > 0,
         "adaptive_avg_pool2d(): Expected input to have non-zero size for non-batch dimensions, "
-        "but input has sizes ", input.sizes(), " with dimension ", i, " being "
+        "but input has sizes ", input.sizes(), " with dimension ", i + ndim, " being "
         "empty");
     }
 
@@ -538,9 +542,6 @@ namespace {
         break;
       }
       case at::MemoryFormat::Contiguous: {
-        TORCH_CHECK((input.ndimension() == 3 || input.ndimension() == 4),
-                    "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ",
-                    input.sizes());
         int64_t grid_x = input.size(-3);
         if (input.ndimension() == 4) {
            input_ = input.contiguous();
diff --git a/aten/src/ATen/native/cuda/Blas.cpp b/aten/src/ATen/native/cuda/Blas.cpp
index 3ca9814175c59..3ff971fdfaff8 100644
--- a/aten/src/ATen/native/cuda/Blas.cpp
+++ b/aten/src/ATen/native/cuda/Blas.cpp
@@ -171,8 +171,8 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
          scalar_type == at::ScalarType::Half ||
          scalar_type == at::ScalarType::BFloat16) &&
         mat2_sizes[0] > 1 && mat2_sizes[1] > 1 &&
-        mat2_sizes[0] < 65535 && mat2_sizes[1] < 65535 &&
-        mat1_sizes[0] < 65535 && mat1_sizes[1] < 65535 &&
+        mat2_sizes[0] < 65535*32 && mat2_sizes[1] < 65535*32 &&
+        mat1_sizes[0] < 65535*32 && mat1_sizes[1] < 65535*32 &&
         // avoid leaing dim >> rows bugs
         ((mat1.strides()[0]==1 && mat1.strides()[1]==mat1_sizes[0]) || (mat1.strides()[1] == 1 && mat1.strides()[0] == mat1_sizes[1]) || (scalar_type != at::ScalarType::Half && scalar_type != at::ScalarType::BFloat16)) &&
         ((mat2.strides()[0]==1 && mat2.strides()[1]==mat2_sizes[0]) || (mat2.strides()[1] == 1 && mat2.strides()[0] == mat2_sizes[1]) || (scalar_type != at::ScalarType::Half && scalar_type != at::ScalarType::BFloat16));
diff --git a/aten/src/ATen/native/cuda/Copy.cu b/aten/src/ATen/native/cuda/Copy.cu
index 4fb647e329d3c..9ed3c5a8bbcc4 100644
--- a/aten/src/ATen/native/cuda/Copy.cu
+++ b/aten/src/ATen/native/cuda/Copy.cu
@@ -23,7 +23,6 @@ namespace native {
 void neg_kernel_cuda(TensorIteratorBase &iter);
 void conj_kernel_cuda(TensorIteratorBase &iter);
 
-namespace {
 void direct_copy_kernel_cuda(TensorIteratorBase &iter) {
   ScalarType dtype = iter.dtype(0);
   if (isQIntType(dtype)) {
@@ -43,7 +42,6 @@ void neg_conj_kernel_cuda(TensorIteratorBase &iter) {
     gpu_kernel(iter, [] GPU_LAMBDA(scalar_t x) { return -std::conj(x); });
   });
 }
-}  // namespace (anonymous)
 
 using namespace at::cuda;
 
diff --git a/aten/src/ATen/native/cuda/Copy.h b/aten/src/ATen/native/cuda/Copy.h
new file mode 100644
index 0000000000000..5639567d66668
--- /dev/null
+++ b/aten/src/ATen/native/cuda/Copy.h
@@ -0,0 +1,10 @@
+#pragma once
+
+namespace at {
+struct TensorIteratorBase;
+
+namespace native {
+
+void direct_copy_kernel_cuda(TensorIteratorBase &iter);
+
+}}  // namespace at::native
diff --git a/aten/src/ATen/native/cuda/CumminmaxKernel.cu b/aten/src/ATen/native/cuda/CumminmaxKernel.cu
new file mode 100644
index 0000000000000..ea73273e2d4b7
--- /dev/null
+++ b/aten/src/ATen/native/cuda/CumminmaxKernel.cu
@@ -0,0 +1,29 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+#include <limits>
+#include <functional>
+
+namespace at { namespace native {
+
+void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
+    self.scalar_type(), "cummax_cuda", [&]() {
+    scalar_t init = self.is_floating_point() ? (-1*std::numeric_limits<scalar_t>::infinity()) : std::numeric_limits<scalar_t>::lowest();
+    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::greater_equal<scalar_t>());
+  });
+}
+
+void launch_cummin_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
+    self.scalar_type(), "cummin_cuda", [&]() {
+    scalar_t init = self.is_floating_point() ? std::numeric_limits<scalar_t>::infinity() : std::numeric_limits<scalar_t>::max();
+    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::less_equal<scalar_t>());
+  });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/CumprodKernel.cu b/aten/src/ATen/native/cuda/CumprodKernel.cu
new file mode 100644
index 0000000000000..d1f3233abb130
--- /dev/null
+++ b/aten/src/ATen/native/cuda/CumprodKernel.cu
@@ -0,0 +1,23 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+namespace at { namespace native {
+
+void launch_cumprod_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
+      ScalarType::Half, ScalarType::BFloat16, self.scalar_type(), "cumprod_cuda", [&]() {
+        scalar_t init = 1;
+        scan_dim<scalar_t>(
+            self,
+            result,
+            dim,
+            init,
+            std::multiplies<scalar_t>());
+      });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/CumsumKernel.cu b/aten/src/ATen/native/cuda/CumsumKernel.cu
new file mode 100644
index 0000000000000..85866b3f0f325
--- /dev/null
+++ b/aten/src/ATen/native/cuda/CumsumKernel.cu
@@ -0,0 +1,25 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+namespace at { namespace native {
+
+void launch_cumsum_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
+      ScalarType::Half, ScalarType::BFloat16,
+      self.scalar_type(), "cumsum_cuda",
+      [&]() {
+        scalar_t init = 0;
+        scan_dim<scalar_t>(
+            self,
+            result,
+            dim,
+            init,
+            std::plus<scalar_t>());
+      });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/DistanceKernel.cu b/aten/src/ATen/native/cuda/DistanceKernel.cu
index a9130bd3e8083..2ae4cd592e6bc 100644
--- a/aten/src/ATen/native/cuda/DistanceKernel.cu
+++ b/aten/src/ATen/native/cuda/DistanceKernel.cu
@@ -6,6 +6,8 @@
 #include <ATen/cuda/CUDAContext.h>
 #include <math.h>
 
+#include <ATen/native/cuda/block_reduce.cuh>
+#include <ATen/native/cuda/DeviceSqrt.cuh>
 #include <ATen/native/Distance.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
@@ -21,20 +23,7 @@ namespace at { namespace native {
 
 namespace {
 
-static const int forward_threads = 256;
-
-template <typename scalar_t>
-static __forceinline__ __device__ scalar_t device_sqrt(scalar_t val);
-
-template <>
-__forceinline__ __device__ float device_sqrt(float val) {
-  return ::sqrtf(val);
-}
-
-template <>
-__forceinline__ __device__ double device_sqrt(double val) {
-  return ::sqrt(val);
-}
+constexpr int kCUDANumThreads = 256;
 
 template <typename scalar_t>
 struct dists {
@@ -92,27 +81,16 @@ struct dists {
 };
 
 template <typename scalar_t, typename F>
-__device__ static inline scalar_t reduce_agg(scalar_t agg) {
-  for (int offset = warpSize / 2; offset > 0; offset /= 2) {
-    F::agg(agg, WARP_SHFL_DOWN(agg, offset));
-  }
-
-  __shared__ scalar_t shared[forward_threads];
-  int lane = threadIdx.x % warpSize;
-  int warp_id = threadIdx.x / warpSize;
-  if (lane == 0) {
-    shared[warp_id] = agg;
-  }
+struct DistReduceOp {
+    __forceinline__ __device__ scalar_t combine(scalar_t a, scalar_t b) const {
+        F::agg(a, b);
+        return a;
+    }
 
-  __syncthreads();
-  agg = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0.0;
-  if (warp_id == 0) {
-    for (int offset = blockDim.x / warpSize / 2; offset > 0; offset /= 2) {
-      F::agg(agg, WARP_SHFL_DOWN(agg, offset));
+    __forceinline__ __device__ scalar_t warp_shfl_down(scalar_t data, int offset) const {
+        return WARP_SHFL_DOWN(data, offset);
     }
-  }
-  return agg;
-}
+};
 
 template <typename scalar_t, typename F>
 __global__ static void pdist_kernel_cuda_impl(scalar_t * result, const scalar_t * self, const int64_t n, const int64_t m, const scalar_t p,
@@ -133,7 +111,9 @@ __global__ static void pdist_kernel_cuda_impl(scalar_t * result, const scalar_t
     F::inc(agg, std::abs(*a - *b), p);
   }
 
-  agg = reduce_agg<scalar_t, F>(agg);
+  __shared__ scalar_t agg_smem[kCUDANumThreads];
+  scalar_t agg_init{0.0};
+  agg = cuda_utils::BlockReduce(agg, DistReduceOp<scalar_t, F>{}, agg_init, agg_smem);
   if (threadIdx.x == 0) {
     result[k] = F::finish(agg, p);
   }
@@ -222,7 +202,9 @@ __global__ static void cdist_kernel_cuda_impl(scalar_t * result, const scalar_t
   for (; a < end; a += stride, b += stride) {
     F::inc(agg, std::abs(*a - *b), p);
   }
-  agg = reduce_agg<scalar_t, F>(agg);
+  __shared__ scalar_t agg_smem[kCUDANumThreads];
+  scalar_t agg_init{0.0};
+  agg = cuda_utils::BlockReduce(agg, DistReduceOp<scalar_t, F>{}, agg_init, agg_smem);
   if (threadIdx.x == 0) {
     result[blockIdx.x] = F::finish(agg, p);
   }
@@ -236,31 +218,27 @@ void cdist_kernel_impl(Tensor& result, const Tensor& x1, const Tensor& x2, doubl
   const int64_t l1_size = r1 * m;
   const int64_t l2_size = r2 * m;
   const dim3 grid(result.numel());
-  const dim3 block(forward_threads);
+  const dim3 block(kCUDANumThreads);
 
   AT_DISPATCH_FLOATING_TYPES(x1.scalar_type(), "cdist_cuda", [&] {
+    auto impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 0.0) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero>;
     } else if (p == 1.0) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p == 2.0) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
     }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 }
 
 void pdist_forward_kernel_impl(Tensor& result, const Tensor& self, double p) {
   const dim3 grid(result.numel());
-  const dim3 block(forward_threads);
+  const dim3 block(kCUDANumThreads);
   int64_t n = self.size(0);
   int64_t m = self.size(1);
   // https://github.com/pytorch/pytorch/issues/15511 demonstrated we need to do
@@ -269,22 +247,18 @@ void pdist_forward_kernel_impl(Tensor& result, const Tensor& self, double p) {
   const double n2_squared_minus_1 = n2 * n2 - 1;
 
   AT_DISPATCH_FLOATING_TYPES(self.scalar_type(), "pdist_cuda", [&] {
+    auto impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 0.0) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero>;
     } else if (p == 1.0) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p == 2.0) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
     }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 }
 
@@ -311,22 +285,18 @@ void pdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor
 
   Tensor buffer = at::empty({n - 1, result.size(0), result.size(1)}, result.options());
   AT_DISPATCH_FLOATING_TYPES(self.scalar_type(), "pdist_cuda_backward", [&] {
+    auto impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 1.0) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p < 2.0) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two>;
     } else if (p == 2.0) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
     }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 
   at::sum_out(result, buffer, 0);
@@ -364,32 +334,20 @@ void cdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor
 
   Tensor buffer = at::empty({batch, r2, r1, m}, result.options());
   AT_DISPATCH_FLOATING_TYPES(result.scalar_type(), "cdist_cuda_backward", [&] {
+    auto impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 1.0) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p < 2.0) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+       impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two>;
     } else if (p == 2.0) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+       impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
+       impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
+    }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
       grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
       p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    }
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 
   at::sum_out(result, buffer, 1);
diff --git a/aten/src/ATen/native/cuda/EmbeddingBag.cu b/aten/src/ATen/native/cuda/EmbeddingBag.cu
index 7ac3a7151b79c..2cd76cbe34d1a 100644
--- a/aten/src/ATen/native/cuda/EmbeddingBag.cu
+++ b/aten/src/ATen/native/cuda/EmbeddingBag.cu
@@ -26,6 +26,7 @@
 #include <ATen/native/cuda/SortingCommon.cuh>
 #include <ATen/native/cuda/EmbeddingBackwardKernel.cuh>
 #include <ATen/native/cuda/KernelUtils.cuh>
+#include <ATen/native/cuda/block_reduce.cuh>
 
 #include <c10/macros/Macros.h>
 
@@ -457,14 +458,6 @@ Tensor _embedding_bag_dense_backward_cuda(const Tensor &grad_, const Tensor &ind
   }
 }
 
-template <typename scalar_t>
-__inline__ __device__
-static scalar_t warpReduceSum(scalar_t val) {
-  for (int offset = C10_WARP_SIZE/2; offset > 0; offset /= 2)
-    val += WARP_SHFL_DOWN(val, offset);
-  return val;
-}
-
 template <typename scalar_t, typename index_t>
 __global__ static void _embedding_bag_per_sample_weights_backward_kernel(
     const scalar_t* grad, int64_t grad_stride0, int64_t grad_stride1,
@@ -495,7 +488,7 @@ __global__ static void _embedding_bag_per_sample_weights_backward_kernel(
             weight[weight_stride0 * embedding_idx + weight_stride1 * feature_idx];
       }
     }
-    result = warpReduceSum<accscalar_t>(result);
+    result = cuda_utils::WarpReduceSum<accscalar_t>(result);
     if (thread_in_warp == 0) {
       output[sample_idx] = result;
     }
diff --git a/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu b/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
index 95d0854bd3dec..3b04b68b0f391 100644
--- a/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
+++ b/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
@@ -188,7 +188,7 @@ std::vector<Tensor> foreach_tensor_##NAME##_cuda(TensorList tensors1, TensorList
     tensor_lists.emplace_back(std::move(vec_res));                                                         \
                                                                                                            \
     AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, tensors1[0].scalar_type(), "foreach_maximum_minimum_op_cuda", [&]() { \
-        using opmath_t = at::opmath_type<scalar_t>;                                                 \
+        using opmath_t = at::opmath_type<scalar_t>;                                                        \
         auto op = []  GPU_LAMBDA (opmath_t a, opmath_t b) -> opmath_t {                                    \
             opmath_t c = a OP b ? a : b;                                                                   \
             if (_isnan(a)) {                                                                               \
@@ -196,12 +196,37 @@ std::vector<Tensor> foreach_tensor_##NAME##_cuda(TensorList tensors1, TensorList
             }                                                                                              \
             return c;};                                                                                    \
         multi_tensor_apply<3>(tensor_lists,                                                                \
-                              PointwiseOpListFunctor<scalar_t, 3>(),                                       \
-                              op);                                                                         \
+                              BinaryOpListAlphaFunctor<scalar_t, 3, 2, 2>(),                               \
+                              op,                                                                          \
+                              opmath_t(1));                                                                \
     });                                                                                                    \
                                                                                                            \
     return tensor_lists[2];                                                                                \
 }                                                                                                          \
+                                                                                                           \
+void foreach_tensor_##NAME##_cuda_(TensorList self, TensorList other) {                                    \
+  check_foreach_api_restrictions(self, other);                                                             \
+  if (!can_use_fast_route({self, other}) || has_bool_tensor(self)) {                                       \
+    return at::native::foreach_tensor_##NAME##_slow_(self, other);                                         \
+  }                                                                                                        \
+                                                                                                           \
+  AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, self[0].scalar_type(), "foreach_maximum_minimum_op_cuda_",  \
+    [&]() {                                                                                                \
+      using opmath_t = at::opmath_type<scalar_t>;                                                          \
+      std::vector<std::vector<at::Tensor>> tensor_lists{self.vec(), other.vec()};                          \
+      auto op = [] GPU_LAMBDA (opmath_t a, opmath_t b) -> opmath_t {                                       \
+        opmath_t c = a OP b ? a : b;                                                                       \
+        if (_isnan(a)) {                                                                                   \
+          c = a;                                                                                           \
+        }                                                                                                  \
+        return c;                                                                                          \
+      };                                                                                                   \
+      multi_tensor_apply<2>(tensor_lists,                                                                  \
+                            BinaryOpListAlphaFunctor<scalar_t, 2, 2, 0>(),                                 \
+                            op,                                                                            \
+                            opmath_t(1));                                                                  \
+  });                                                                                                      \
+}                                                                                                          \
 
 FOREACH_MAXIMUM_MINIMUM_OP(maximum, >)
 FOREACH_MAXIMUM_MINIMUM_OP(minimum, <)
diff --git a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
index 46ea4eadf1feb..24db8776cd49a 100644
--- a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
+++ b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
@@ -185,10 +185,10 @@ TORCH_IMPL_FUNC(fractional_max_pool2d_out_cuda) (
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(input.scalar_type(),
     "fractional_max_pool2d_out_cuda_frame",
     [&] {
-      auto devInput = input_.packed_accessor<scalar_t, 4>();
-      auto devOutput = output_.packed_accessor<scalar_t, 4>();
-      auto devIndices = indices_.packed_accessor<int64_t, 4>();
-      auto devSamples = randomSamples.packed_accessor<scalar_t, 3>();
+      auto devInput = input_.packed_accessor64<scalar_t, 4>();
+      auto devOutput = output_.packed_accessor64<scalar_t, 4>();
+      auto devIndices = indices_.packed_accessor64<int64_t, 4>();
+      auto devSamples = randomSamples.packed_accessor64<scalar_t, 3>();
       fractional_max_pool2d_out_cuda_frame<scalar_t>
         <<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(
           devOutput, devIndices, devInput, devSamples,
@@ -253,12 +253,12 @@ TORCH_IMPL_FUNC(fractional_max_pool2d_backward_cuda)(
             gradInput_.size(0));
   dim3 block(outputPlaneSize > 128 ? 128 : outputPlaneSize);
 
-  auto devIndices = indices_.packed_accessor<int64_t, 4>();
+  auto devIndices = indices_.packed_accessor64<int64_t, 4>();
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(gradOutput.scalar_type(),
     "fractional_max_pool2d_backward_out_cuda_frame",
     [&] {
-      auto devGradInput = gradInput_.packed_accessor<scalar_t, 4>();
-      auto devGradOutput = gradOutput_.packed_accessor<scalar_t, 4>();
+      auto devGradInput = gradInput_.packed_accessor64<scalar_t, 4>();
+      auto devGradOutput = gradOutput_.packed_accessor64<scalar_t, 4>();
       fractional_max_pool2d_backward_out_cuda_frame<scalar_t>
         <<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(
         devGradInput, devGradOutput, devIndices);
diff --git a/aten/src/ATen/native/cuda/Indexing.cu b/aten/src/ATen/native/cuda/Indexing.cu
index 4720da4bd1124..6ea88069ca2ef 100644
--- a/aten/src/ATen/native/cuda/Indexing.cu
+++ b/aten/src/ATen/native/cuda/Indexing.cu
@@ -1256,6 +1256,11 @@ Tensor & masked_fill__cuda(Tensor& self, const Tensor & mask, const Scalar& valu
 Tensor & masked_fill__cuda(Tensor& self, const Tensor & mask, const Tensor & value) {
   TORCH_CHECK(value.dim() == 0, "masked_fill_ only supports a 0-dimensional value tensor, but got tensor "
       "with ", value.dim(), " dimension(s).");
+  // We hit this function if either of the input tensor lives on CUDA.
+  // It is ok, if `value` is `CPU` tensor but we should not allow `self` or
+  // `mask` to be CPU tensor. Check for `self` and `mask` being on same device
+  // exists in `masked_fill__cuda` (Scalar version).
+  TORCH_CHECK(!self.device().is_cpu(), "masked_fill_: Expected inputs to be on same device")
   return masked_fill__cuda(self, mask, value.item());
 }
 
diff --git a/aten/src/ATen/native/cuda/JitLoops.cuh b/aten/src/ATen/native/cuda/JitLoops.cuh
index bb37a6acc2e14..6f350c550ce93 100644
--- a/aten/src/ATen/native/cuda/JitLoops.cuh
+++ b/aten/src/ATen/native/cuda/JitLoops.cuh
@@ -12,11 +12,7 @@
 
 #include <ATen/native/cuda/MemoryAccess.cuh>
 
-#if !AT_ROCM_ENABLED()
 #include <ATen/native/cuda/CUDAJitLoops.cuh>
-#else
-#error Jiterator not supported on ROCm
-#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp b/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp
index 913e30b77c0ff..cb6cacb3630fb 100644
--- a/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp
+++ b/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp
@@ -35,8 +35,7 @@ namespace native {
 namespace {
 cuda::detail::LinalgDispatch disp = {_symeig_helper_cuda,
                                      _cholesky_solve_helper_cuda,
-                                     legacy_lstsq_cuda,
-                                     _linalg_inv_out_helper_cuda};
+                                     legacy_lstsq_cuda};
 
 at::DynamicLibrary& getTorchLinalgLibrary() {
   static at::DynamicLibrary lib("libtorch_cuda_linalg.so", nullptr, true);
@@ -177,12 +176,6 @@ void registerLinalgDispatch(const LinalgDispatch& disp_) {
 }
 }} //namespace cuda::detail
 
-Tensor& _linalg_inv_out_helper_cuda(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) {
-    getTorchLinalgLibrary();
-    TORCH_CHECK(disp.inv_out_helper != _linalg_inv_out_helper_cuda, "Can't find _linalg_inv_out_helper_cuda");
-    return disp.inv_out_helper(result, infos_lu, infos_getri);
-}
-
 std::tuple<Tensor, Tensor> legacy_lstsq_cuda(const Tensor &B, const Tensor &A) {
     getTorchLinalgLibrary();
     TORCH_CHECK(disp.legacy_lstsq != legacy_lstsq_cuda, "Can't find legacy_lstsq_cuda");
diff --git a/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu b/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu
new file mode 100644
index 0000000000000..28b3236caa2de
--- /dev/null
+++ b/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu
@@ -0,0 +1,37 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/OpMathType.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+#include <cmath>
+#include <limits>
+
+namespace at { namespace native {
+
+void launch_logcumsumexp_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
+  AT_DISPATCH_FLOATING_TYPES_AND2(
+      ScalarType::Half, ScalarType::BFloat16,
+      self.scalar_type(), "logcumsumexp_cuda",
+      [&]() {
+        using opmath_t = at::opmath_type<scalar_t>;
+        scalar_t init = -std::numeric_limits<scalar_t>::infinity();
+        auto log_add_exp = [] C10_HOST_DEVICE (const scalar_t x_, const scalar_t y_) -> scalar_t {
+          const opmath_t x{x_}, y{y_};
+          auto min = at::_isnan(y) ? y : std::min<opmath_t>(x, y); //std::min returns first arg if one of the args is nan
+          auto max = at::_isnan(y) ? y : std::max<opmath_t>(x, y); //std::max returns first arg if one of the args is nan
+          if (min != max || ::isfinite(min)) {
+          // nan will be propagated here
+              return ::log1p(std::exp(min - max)) + max;
+          } else {
+          // special case to correctly handle infinite inputs
+             return x;
+          }
+        };
+        scan_dim<scalar_t>(self, result, dim, init, log_add_exp);
+      });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/Loss.cu b/aten/src/ATen/native/cuda/Loss.cu
index 2b5d17b9547ed..fcb3229198ab7 100644
--- a/aten/src/ATen/native/cuda/Loss.cu
+++ b/aten/src/ATen/native/cuda/Loss.cu
@@ -211,6 +211,7 @@ __global__ void nll_loss_forward_reduce_cuda_kernel_1d(
     // If the only element was omited, we get 0. See the discussion in
     // https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162
     *output = scalar_t{0};
+    *total_weight = scalar_t{0};
   }
 }
 
@@ -280,6 +281,7 @@ void nll_loss_forward_out_cuda_template(
 
   if (reduction == Reduction::None && n_dims == 2) {
     at::native::resize_output(output, {batch_size});
+    total_weight.zero_();
     if (batch_size == 0) {
       // This guards from unnecessary operations and launching CUDA kernel with
       // 0 blocks.
diff --git a/aten/src/ATen/native/cuda/NLLLoss2d.cu b/aten/src/ATen/native/cuda/NLLLoss2d.cu
index 2246c836f3dca..a2027587d1c5e 100644
--- a/aten/src/ATen/native/cuda/NLLLoss2d.cu
+++ b/aten/src/ATen/native/cuda/NLLLoss2d.cu
@@ -268,9 +268,9 @@ void nll_loss2d_forward_out_cuda_template(
                  0,
                  at::cuda::getCurrentCUDAStream()>>>(
                   count,
-                  input.packed_accessor<scalar_t, 4>(),
-                  target.packed_accessor<int64_t, 3>(),
-                  output.packed_accessor<scalar_t, 3>(),
+                  input.packed_accessor64<scalar_t, 4>(),
+                  target.packed_accessor64<int64_t, 3>(),
+                  output.packed_accessor64<scalar_t, 3>(),
                   optional_data<scalar_t>(weight_),
                   ignore_index);
           C10_CUDA_KERNEL_LAUNCH_CHECK();
@@ -403,9 +403,9 @@ void nll_loss2d_backward_out_cuda_template(
                  0,
                  at::cuda::getCurrentCUDAStream()>>>(
                   count,
-                  target.packed_accessor<int64_t, 3>(),
-                  grad_output.packed_accessor<scalar_t, 3>(),
-                  grad_input.packed_accessor<scalar_t, 4>(),
+                  target.packed_accessor64<int64_t, 3>(),
+                  grad_output.packed_accessor64<scalar_t, 3>(),
+                  grad_input.packed_accessor64<scalar_t, 4>(),
                   optional_data<scalar_t>(weight_),
                   ignore_index);
           C10_CUDA_KERNEL_LAUNCH_CHECK();
diff --git a/aten/src/ATen/native/cuda/Normalization.cuh b/aten/src/ATen/native/cuda/Normalization.cuh
index a9b11e76db680..cc79284fea4db 100644
--- a/aten/src/ATen/native/cuda/Normalization.cuh
+++ b/aten/src/ATen/native/cuda/Normalization.cuh
@@ -6,6 +6,7 @@
 #include <ATen/ceil_div.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/DeviceUtils.cuh>
+#include <ATen/native/cuda/block_reduce.cuh>
 #include <ATen/native/cuda/DeviceSqrt.cuh>
 #include <ATen/native/cuda/LaunchUtils.h>
 #include <c10/macros/Macros.h>
@@ -60,26 +61,10 @@ struct Float2 {
     v2 += a.v2;
     return *this;
   }
-};
-
-template <typename scalar_t, typename accscalar_t, typename PTA>
-struct SumOp {
-  __device__ SumOp(const PTA& t) : tensor(t) {}
-  __device__ __forceinline__ accscalar_t operator()(int batch, int plane, int n) {
-    return static_cast<accscalar_t>(tensor[batch][plane][n]);
-  }
-  const PTA& tensor;
-};
-
-template <typename scalar_t, typename accscalar_t, typename PTA>
-struct VarOp {
-  __device__ VarOp(accscalar_t m, const PTA& t) : mean(m), tensor(t) {}
-  __device__ __forceinline__ accscalar_t operator()(int batch, int plane, int n) {
-    accscalar_t val = tensor[batch][plane][n];
-    return (val - mean) * (val - mean);
+  __device__ friend Float2 operator+(Float2 a, const Float2& b) {
+    a += b;
+    return a;
   }
-  const accscalar_t mean;
-  const PTA& tensor;
 };
 
 template <typename scalar_t, typename accscalar_t, typename PTA>
@@ -96,21 +81,25 @@ struct GradOp {
   const PTA& grad_output;
 };
 
-// Sum across all threads within a warp
-template <typename T>
-static __device__ __forceinline__ T warpSum(T val) {
-  for (int i = 0; i < getMSB(C10_WARP_SIZE); ++i) {
-    val += WARP_SHFL_XOR(val, 1 << i, C10_WARP_SIZE);
-  }
-  return val;
-}
+template <typename acc_t>
+struct SumReduceOp {
+    __device__ __forceinline__ acc_t combine(acc_t a, acc_t b) const { return a + b; }
+
+    __device__ __forceinline__ acc_t warp_shfl_down(acc_t data, int offset) const {
+        return WARP_SHFL_DOWN(data, offset);
+    }
+};
 
 template <typename scalar_t, typename accscalar_t>
-static __device__ __forceinline__ Float2<scalar_t, accscalar_t> warpSum(Float2<scalar_t, accscalar_t> value) {
-  value.v1 = warpSum(value.v1);
-  value.v2 = warpSum(value.v2);
-  return value;
-}
+struct SumReduceOp<Float2<scalar_t, accscalar_t>> {
+    using acc_t = Float2<scalar_t, accscalar_t>;
+
+    __device__ __forceinline__ acc_t combine(acc_t a, acc_t b) const { return a + b; }
+
+    __device__ __forceinline__ acc_t warp_shfl_down(acc_t data, int offset) const {
+        return {WARP_SHFL_DOWN(data.v1, offset), WARP_SHFL_DOWN(data.v2, offset)};
+    }
+};
 
 // Sum across (batch, x/y/z) applying Op() pointwise
 // this works by first having each thread sum it's part
@@ -130,37 +119,13 @@ __device__ scalar_t reduce(Op op, PTA tensor, int plane) {
       sum += op(batch, plane, x);
     }
   }
-
-  // first warpSum to get one value per thread to
-  // one value per warp
-  sum = warpSum(sum);
-
-  // this writes each warps  item into shared memory
-  // there are at most C10_WARP_SIZE items left because
-  // there are at most C10_WARP_SIZE**2 threads at the beginning
   __shared__ scalar_t shared[C10_WARP_SIZE];
-  __syncthreads();
-  int tid = threadIdx.x + threadIdx.y * blockDim.x;
-  if (tid % C10_WARP_SIZE == 0) {
-    shared[tid / C10_WARP_SIZE] = sum;
-  }
-  if (tid >= blockDim.x * blockDim.y / C10_WARP_SIZE && tid < C10_WARP_SIZE) {
-    // zero out the other entries in shared
-    shared[tid] = (scalar_t)0;
-  }
-  __syncthreads();
-  // now have a second warpSum to reduce the intermediate values
-  // from shared memory to a single number. The very first
-  // thread writes it to shared memory.
-
-  if (tid / C10_WARP_SIZE == 0) {
-    sum = warpSum(shared[tid]);
-    if (tid == 0) {
+  SumReduceOp<scalar_t> reduce_op;
+  sum = cuda_utils::BlockReduce<scalar_t, SumReduceOp<scalar_t>, cuda_utils::Block2D>(sum, reduce_op, 0, shared);
+  if (threadIdx.x == 0 && threadIdx.y == 0) {
       shared[0] = sum;
-    }
   }
   __syncthreads();
-
   // Everyone picks it up, should be broadcast into the whole grad_input
   return shared[0];
 }
diff --git a/aten/src/ATen/native/cuda/PersistentSoftmax.cuh b/aten/src/ATen/native/cuda/PersistentSoftmax.cuh
index 9958d4c9b8144..5d3bea36e37a3 100644
--- a/aten/src/ATen/native/cuda/PersistentSoftmax.cuh
+++ b/aten/src/ATen/native/cuda/PersistentSoftmax.cuh
@@ -90,7 +90,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc
     dst += idx_offset;
 
     if (is_transformer_mask) {
-        mask += (idx_offset / head_chunk_size) * stride + local_idx;
+        mask += ((first_batch * stride) / head_chunk_size) * stride + local_idx;
     } else {
         mask += idx_offset;
     }
@@ -117,13 +117,14 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc
     acc_t max_value[WARP_BATCH];
     #pragma unroll
     for (int i = 0;  i < WARP_BATCH;  ++i) {
+        int batch_element_count = (i >= local_batches) ? 0 : element_count;
         bool is_meaningful_max = false;
         max_value[i] = elements[i][0];
         #pragma unroll
         for (int it = 0;  it < WARP_ITERATIONS;  ++it) {
             if (is_masked) {
                 int idx = it*WARP_SIZE;
-                if ((idx + local_idx) < element_count) {
+                if ((idx + local_idx) < batch_element_count) {
                     if (!is_transformer_mask) {
                         idx += i*element_count;
                     }
@@ -147,6 +148,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc
     acc_t sum[WARP_BATCH] { 0.0f };
     #pragma unroll
     for (int i = 0;  i < WARP_BATCH;  ++i) {
+        int batch_element_count = (i >= local_batches) ? 0 : element_count;
         #pragma unroll
         for (int it = 0;  it < WARP_ITERATIONS;  ++it) {
             if (!is_masked) {
@@ -158,7 +160,7 @@ __global__ void softmax_warp_forward(output_t *dst, const input_t *src, int batc
                 }
             } else {
                 int idx = it*WARP_SIZE;
-                bool valid = (idx + local_idx) < element_count;
+                bool valid = (idx + local_idx) < batch_element_count;
                 if (!is_transformer_mask) {
                     idx += i*element_count;
                 }
diff --git a/aten/src/ATen/native/cuda/ScanKernels.cpp b/aten/src/ATen/native/cuda/ScanKernels.cpp
index 206543384a996..69f86c006950c 100644
--- a/aten/src/ATen/native/cuda/ScanKernels.cpp
+++ b/aten/src/ATen/native/cuda/ScanKernels.cpp
@@ -89,6 +89,11 @@ Tensor _logcumsumexp_cuda(const Tensor& self, int64_t dim) {
 }
 
 void cumsum_cuda_kernel(const Tensor& result, const Tensor& self, int64_t dim) {
+  if (self.is_floating_point() || self.is_complex()) {
+    // See Note [Writing Nondeterministic Operations]
+    // Issue reporting nondeterministic behavior: https://github.com/pytorch/pytorch/issues/75240
+    globalContext().alertNotDeterministic("cumsum_cuda_kernel");
+  }
   auto result_ = contiguous_out_arg(result);
   launch_cumsum_cuda_kernel(*result_, self, dim);
   if (!result.is_same(*result_)) {
diff --git a/aten/src/ATen/native/cuda/ScanKernels.cu b/aten/src/ATen/native/cuda/ScanUtils.cuh
similarity index 84%
rename from aten/src/ATen/native/cuda/ScanKernels.cu
rename to aten/src/ATen/native/cuda/ScanUtils.cuh
index 44982208c086a..ba27a245172b5 100644
--- a/aten/src/ATen/native/cuda/ScanKernels.cu
+++ b/aten/src/ATen/native/cuda/ScanUtils.cuh
@@ -1,18 +1,15 @@
-#define TORCH_ASSERT_NO_OPERATORS
-#include <ATen/core/TensorBase.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/NumericLimits.cuh>
-#include <ATen/AccumulateType.h>
-#include <ATen/Dispatch.h>
+#pragma once
 #include <ATen/NumericUtils.h>
-#include <c10/util/accumulate.h>
-#include <c10/util/Load.h>
-
+#include <ATen/core/TensorBase.h>
 #include <ATen/cuda/cub.cuh>
+#include <ATen/cuda/CUDAContext.h>
 
-#include <ATen/native/cuda/ScanKernels.h>
+#include <c10/util/Load.h>
+#include <limits>
+#include <cmath>
 
-namespace at { namespace native {
+namespace at {
+namespace native {
 
 template <typename integer>
 constexpr inline integer ceil_div(integer n, integer m) {
@@ -158,7 +155,7 @@ __global__ void tensor_kernel_scan_outer_dim_with_indices(scalar_t *self_, scala
   }
 }
 
-void check_fits_in_unsigned(int64_t val, const char* name) {
+inline void check_fits_in_unsigned(int64_t val, const char* name) {
   constexpr auto umax = std::numeric_limits<uint32_t>::max();
   TORCH_CHECK(
       val >= 0 && val <= umax, name, " must fit in a 32-bit uint32_t value");
@@ -224,22 +221,6 @@ void scan_dim_with_indices(const TensorBase& self, const TensorBase& values, con
   }
 }
 
-void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
-    self.scalar_type(), "cummax_cuda", [&]() {
-    scalar_t init = self.is_floating_point() ? (-1*std::numeric_limits<scalar_t>::infinity()) : std::numeric_limits<scalar_t>::lowest();
-    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::greater_equal<scalar_t>());
-  });
-}
-
-void launch_cummin_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
-    self.scalar_type(), "cummin_cuda", [&]() {
-    scalar_t init = self.is_floating_point() ? std::numeric_limits<scalar_t>::infinity() : std::numeric_limits<scalar_t>::max();
-    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::less_equal<scalar_t>());
-  });
-}
-
 // TODO: The implementation of `tensor_kernel_scan_outer_dim` and
 // `tensor_kernel_scan_innermost_dim` is similar to
 // `tensor_kernel_scan_outer_dim_with_indices`
@@ -468,54 +449,4 @@ void scan_dim(const TensorBase& self, const TensorBase& result,
   }
 }
 
-void launch_logcumsumexp_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
-  AT_DISPATCH_FLOATING_TYPES_AND2(
-      ScalarType::Half, ScalarType::BFloat16,
-      self.scalar_type(), "logcumsumexp_cuda",
-      [&]() {
-        using accscalar_t = acc_type<scalar_t, true>;
-        scalar_t init = -std::numeric_limits<scalar_t>::infinity();
-        auto log_add_exp = [] C10_HOST_DEVICE (const scalar_t x, const scalar_t y) -> scalar_t {
-          scalar_t min = at::_isnan(y) ? y : std::min<scalar_t>(x,y); //std::min returns first arg if one of the args is nan
-          scalar_t max = at::_isnan(y) ? y : std::max<scalar_t>(x,y); //std::max returns first arg if one of the args is nan
-          if (min != max || ::isfinite(static_cast<accscalar_t>(min))) {
-          // nan will be propagated here
-              return ::log1p(std::exp(min - max)) + max;
-          } else {
-          // special case to correctly handle infinite inputs
-             return x;
-          }
-        };
-        scan_dim<scalar_t>(self, result, dim, init, log_add_exp);
-      });
-}
-
-void launch_cumsum_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
-      ScalarType::Half, ScalarType::BFloat16,
-      self.scalar_type(), "cumsum_cuda",
-      [&]() {
-        scalar_t init = 0;
-        scan_dim<scalar_t>(
-            self,
-            result,
-            dim,
-            init,
-            std::plus<scalar_t>());
-      });
-}
-
-void launch_cumprod_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
-      ScalarType::Half, ScalarType::BFloat16, self.scalar_type(), "cumprod_cuda", [&]() {
-        scalar_t init = 1;
-        scan_dim<scalar_t>(
-            self,
-            result,
-            dim,
-            init,
-            std::multiplies<scalar_t>());
-      });
-}
-
-}} // namespace at::native
+}}  // namespace at::native
diff --git a/aten/src/ATen/native/cuda/SoftMax.cu b/aten/src/ATen/native/cuda/SoftMax.cu
index feeb08ae34b34..c53276e619be2 100644
--- a/aten/src/ATen/native/cuda/SoftMax.cu
+++ b/aten/src/ATen/native/cuda/SoftMax.cu
@@ -957,18 +957,24 @@ TORCH_IMPL_FUNC(softmax_backward_cuda_out)
   host_softmax_backward<SoftMaxBackwardEpilogue, false>(tmp, output, dim, half_to_float, grad_input);
 }
 
-Tensor masked_softmax_cuda(const Tensor& input_, const Tensor& mask_, const c10::optional<int64_t> dim_) {
+Tensor masked_softmax_cuda(const Tensor& input_, const Tensor& mask_, const c10::optional<int64_t> dim_, const c10::optional<int64_t> mask_type_) {
   Tensor output = at::empty_like(input_, input_.options());
   TORCH_CHECK(mask_.scalar_type() == ScalarType::Bool, "Mask should be a boolean tensor");
 
+  TORCH_CHECK(mask_type_.has_value(), "Mask Type should be defined");
+  int64_t mask_type = mask_type_.value();
+  TORCH_CHECK((mask_type == 0) || (mask_type == 1), "Mask Type should be 0 (src_mask) or 1 (src_key_padding_mask)");
+
   // If input is [B, H, T, T] and mask is [B, T]
   // we have special fast kernel
-  bool is_BxT_mask = (input_.dim() == 4 && mask_.dim() == 2 && input_.size(0) == mask_.size(0) && input_.size(2) == mask_.size(1) && input_.size(3) == mask_.size(1));
+  // mask_type == 1 => mask_ is a src_key_padding_mask
+  bool is_BxT_mask = (mask_type == 1) && (input_.dim() == 4 && mask_.dim() == 2 && input_.size(0) == mask_.size(0) && input_.size(2) == mask_.size(1) && input_.size(3) == mask_.size(1));
 
   // If input is [B, H, T, T] and mask is [T, T]
   // expand mask to [B, H, T, T] and treat it like regular mask
   // TODO We should have special fast kernel for TxT mask as well
-  bool is_TxT_mask = input_.dim() == 4 && mask_.dim() == 2 && input_.size(3) == mask_.size(1) && input_.size(2) == mask_.size(0) && mask_.size(0) == mask_.size(1);
+  // mask_type == 0 => mask_ is a src_mask
+  bool is_TxT_mask = (mask_type == 0) && input_.dim() == 4 && mask_.dim() == 2 && input_.size(3) == mask_.size(1) && input_.size(2) == mask_.size(0) && mask_.size(0) == mask_.size(1);
   TORCH_CHECK(mask_.sizes() == input_.sizes() || is_BxT_mask || is_TxT_mask, "Mask shape should match input. mask: ", mask_.sizes(), " input: ", input_.sizes());
 
   auto input = input_.dim() == 0 ? input_.view(1) : input_;
diff --git a/aten/src/ATen/native/cuda/TensorFactories.cu b/aten/src/ATen/native/cuda/TensorFactories.cu
index 6e05908b2ccea..03711b194a983 100644
--- a/aten/src/ATen/native/cuda/TensorFactories.cu
+++ b/aten/src/ATen/native/cuda/TensorFactories.cu
@@ -294,10 +294,10 @@ Tensor tril_indices_cuda(
       cuda::getApplyGrid(tril_size, dim_grid, tensor.get_device()),
       "unable to get dim grid");
 
-    AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "tril_indices_cuda", [&] {
+    AT_DISPATCH_INDEX_TYPES(tensor.scalar_type(), "tril_indices_cuda", [&] {
       tril_indices_kernel<<<
           dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(
-        tensor.data_ptr<scalar_t>(),
+        tensor.data_ptr<index_t>(),
         trapezoid_row_offset,
         m_first_row,
         col,
@@ -372,10 +372,10 @@ Tensor triu_indices_cuda(
       cuda::getApplyGrid(triu_size, dim_grid, tensor.get_device()),
       "unable to get dim grid");
 
-    AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "triu_indices_cuda", [&] {
+    AT_DISPATCH_INDEX_TYPES(tensor.scalar_type(), "triu_indices_cuda", [&] {
       triu_indices_kernel<<<
           dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(
-        tensor.data_ptr<scalar_t>(),
+        tensor.data_ptr<index_t>(),
         std::max<int64_t>(0, offset),
         m_first_row,
         col,
diff --git a/aten/src/ATen/native/cuda/TensorTopK.cu b/aten/src/ATen/native/cuda/TensorTopK.cu
index 1caf3ec576086..631d887c1a01f 100644
--- a/aten/src/ATen/native/cuda/TensorTopK.cu
+++ b/aten/src/ATen/native/cuda/TensorTopK.cu
@@ -405,13 +405,23 @@ __global__ void computeBlockwiseWithinKCounts(
   int current_bit,
   bool largest,
   // outputs:
-  uint32_t* withinKCounts   // size: num_slices * blocks_per_slice == num_blocks
+  uint32_t* withinKCounts,  // size: num_slices * blocks_per_slice == num_blocks
+  uint32_t num_blocks
 ) {
   // This kernel should be launched with the same number of blocks as the `radixFindKthValues` kernel.
   int tidx = threadIdx.x;
   uint32_t block_idx = getLinearBlockId<uint32_t>();
   uint32_t slice_idx = block_idx / blocks_per_slice;
 
+  // The grid is computed from `getGridFromTiles`, when there are lots of
+  // elements, we will use both blockIdx.x and blockIdx.y, and maybe blockIdx.z
+  // when this is the case, the number of blocks that we are launching can be
+  // more than the number of blocks we need. So we need to check the range of
+  // `block_idx`.
+  if (block_idx >= num_blocks) {
+    return;
+  }
+
   Bitwise desired = doLdg(desires + slice_idx);
   Bitwise desired_digit = at::cuda::Bitfield<Bitwise>::getBitfield(desired, current_bit, RADIX_BITS);
 
@@ -702,7 +712,7 @@ void launch(
     C10_CUDA_KERNEL_LAUNCH_CHECK();
 #if CUB_SUPPORTS_SCAN_BY_KEY()
     computeBlockwiseWithinKCounts<Bitwise><<<grid, RADIX_DIGITS, 0, stream>>>(
-      desired, counts, blocks_per_slice, current_bit, largest, withinKCounts);
+      desired, counts, blocks_per_slice, current_bit, largest, withinKCounts, num_blocks);
     C10_CUDA_KERNEL_LAUNCH_CHECK();
 #endif
     desiredMask = at::cuda::Bitfield<Bitwise>::setBitfield(desiredMask, RADIX_MASK, current_bit, RADIX_BITS);
diff --git a/aten/src/ATen/native/cuda/UnaryComplexKernels.cu b/aten/src/ATen/native/cuda/UnaryComplexKernels.cu
index 0589c3ba4f0dd..a04194b1117e5 100644
--- a/aten/src/ATen/native/cuda/UnaryComplexKernels.cu
+++ b/aten/src/ATen/native/cuda/UnaryComplexKernels.cu
@@ -1,6 +1,7 @@
 #define TORCH_ASSERT_NO_OPERATORS
 #include <limits>
 #include <ATen/native/UnaryOps.h>
+#include <ATen/native/cuda/Copy.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/JitLoops.cuh>
 #include <ATen/Dispatch.h>
@@ -58,22 +59,10 @@ void angle_kernel_cuda(TensorIteratorBase& iter) {
   }
 }
 
-// We manually overload conj because std::conj does not work types other than c10::complex.
-template<typename scalar_t>
-__host__ __device__ static inline scalar_t conj_wrapper(scalar_t v) {
-  return v;
-}
-
-template<typename T>
-__host__ __device__ static inline c10::complex<T> conj_wrapper(c10::complex<T> v) {
-  return std::conj(v);
-}
-
 // NB: Ignores the negative bit on tensors
 const char conj_name[] = "conj_kernel";
 void conj_kernel_cuda(TensorIteratorBase& iter) {
-  auto common_dtype = iter.common_dtype();
-  if (common_dtype == kComplexHalf) {
+  auto conj_chalf = [&] {
     using scalar_t = c10::complex<at::Half>;
     #if AT_USE_JITERATOR()
       static const auto conj_string = jiterator_stringify(
@@ -85,17 +74,23 @@ void conj_kernel_cuda(TensorIteratorBase& iter) {
       jitted_gpu_kernel<conj_name, scalar_t, scalar_t, 1>(iter, conj_string);
     #else
       gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t {
-          return conj_wrapper(a);
+          return std::conj(a);
       });
     #endif
-  } else {
-    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-      kBool, kBFloat16, kHalf, iter.common_dtype(), "conj_cuda", [&]() {
-        gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t {
-          return conj_wrapper(a);
-        });
-    });
-  }
+  };
+
+  AT_DISPATCH_SWITCH(iter.common_dtype(), "conj_cuda",
+    AT_DISPATCH_CASE_ALL_TYPES_AND3(kBool, kBFloat16, kHalf, [&] {
+      // Conj is a no-op for non-complex types
+      direct_copy_kernel_cuda(iter);
+    })
+    AT_DISPATCH_CASE_COMPLEX_TYPES([&] {
+      gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return std::conj(a);
+      });
+    })
+    AT_DISPATCH_CASE(kComplexHalf, conj_chalf)
+  );
 }
 
 REGISTER_DISPATCH(angle_stub, &angle_kernel_cuda);
diff --git a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
index 0cb0d9f238cf5..2481fd6028960 100644
--- a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
+++ b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
@@ -151,7 +151,9 @@ void sigmoid_kernel_cuda(TensorIteratorBase& iter) {
   } else {
     AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, common_dtype, "sigmoid_cuda", [&]() {
       gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-        return scalar_t{1} / (scalar_t{1} + std::exp(-a));
+        using opmath_t = at::opmath_type<scalar_t>;
+        const auto one = opmath_t{1};
+        return static_cast<scalar_t>(one/(one + std::exp(-opmath_t{a})));
       });
     });
   }
@@ -179,8 +181,9 @@ void sinc_kernel_cuda(TensorIteratorBase& iter) {
               return scalar_t(1);
             } else {
               // NVCC says constexpr var is not accessible from device
-              scalar_t product = c10::detail::pi<scalar_t>() * a;
-              return std::sin(product) / product;
+              using opmath_t = at::opmath_type<scalar_t>;
+              opmath_t product = c10::detail::pi<opmath_t>() * opmath_t{a};
+              return static_cast<scalar_t>(std::sin(product) / product);
             }
           });
         });
diff --git a/aten/src/ATen/native/cuda/block_reduce.cuh b/aten/src/ATen/native/cuda/block_reduce.cuh
index e01cd0b060f53..fa75c71f8acaf 100644
--- a/aten/src/ATen/native/cuda/block_reduce.cuh
+++ b/aten/src/ATen/native/cuda/block_reduce.cuh
@@ -29,24 +29,43 @@ __inline__ __device__ T WarpReduceSum(T val) {
   return val;
 }
 
+struct Block1D {
+    static __forceinline__ __device__ int Tid() { return threadIdx.x; }
+
+    static __forceinline__ __device__ int Warps() {
+        return blockDim.x / C10_WARP_SIZE;
+    }
+};
+
+struct Block2D {
+    static __forceinline__ __device__ int Tid() {
+        return threadIdx.x + threadIdx.y * blockDim.x;
+    }
+
+    static __forceinline__ __device__ int Warps() {
+        return blockDim.x * blockDim.y / C10_WARP_SIZE;
+    }
+};
+
 // Sums `val` across all threads in a block.
 //
+// Warning: the return value is only valid for thread 0.
 // Assumptions:
-//   - Thread blocks are an 1D set of threads (indexed with `threadIdx.x` only)
 //   - The size of each block should be a multiple of `C10_WARP_SIZE`
 //   - `shared` should be a pointer to shared memory with size of, at least,
 //     `sizeof(T) * number_of_warps`
-template <typename T>
+template <typename T, typename B = Block1D>
 __inline__ __device__ T BlockReduceSum(T val, T* shared) {
-  const int lid = threadIdx.x % C10_WARP_SIZE;
-  const int wid = threadIdx.x / C10_WARP_SIZE;
+  const int tid = B::Tid();
+  const int lid = tid % C10_WARP_SIZE;
+  const int wid = tid / C10_WARP_SIZE;
   val = WarpReduceSum(val);
-  __syncthreads();
+  __syncthreads(); // prevent races when BlockReduces are called in a row.
   if (lid == 0) {
     shared[wid] = val;
   }
   __syncthreads();
-  val = (threadIdx.x < blockDim.x / C10_WARP_SIZE) ? shared[lid] : T(0);
+  val = (tid < B::Warps()) ? shared[lid] : T(0);
   if (wid == 0) {
     val = WarpReduceSum(val);
   }
@@ -62,19 +81,19 @@ __inline__ __device__ T WarpReduce(T val, const ReduceOp& op) {
   return val;
 }
 
-template <typename T, class ReduceOp>
+template <typename T, class ReduceOp, typename B = Block1D>
 __inline__ __device__ T
 BlockReduce(T val, const ReduceOp& op, const T& identity_element, T* shared) {
-  const int lid = threadIdx.x % C10_WARP_SIZE;
-  const int wid = threadIdx.x / C10_WARP_SIZE;
+  const int tid = B::Tid();
+  const int lid = tid % C10_WARP_SIZE;
+  const int wid = tid / C10_WARP_SIZE;
   val = WarpReduce(val, op);
-  __syncthreads();
+  __syncthreads(); // prevent races when BlockReduces are called in a row.
   if (lid == 0) {
     shared[wid] = val;
   }
   __syncthreads();
-  val = (threadIdx.x < blockDim.x / C10_WARP_SIZE) ? shared[lid]
-                                                   : identity_element;
+  val = (tid < B::Warps()) ? shared[lid] : identity_element;
   if (wid == 0) {
     val = WarpReduce(val, op);
   }
diff --git a/aten/src/ATen/native/cuda/jit_utils.cpp b/aten/src/ATen/native/cuda/jit_utils.cpp
index 673ea9f476e46..f86b624b84f7c 100644
--- a/aten/src/ATen/native/cuda/jit_utils.cpp
+++ b/aten/src/ATen/native/cuda/jit_utils.cpp
@@ -7,6 +7,7 @@
 #include <ATen/jit_macros.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/OffsetCalculator.cuh>
+#include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>
 #include <ATen/code_template.h>
 #include <ATen/OpMathType.h>
 #include <ATen/native/cuda/jit_utils.h>
@@ -40,7 +41,148 @@
 
 namespace at { namespace cuda { namespace jit {
 
+// hiprtc already includes some traits, so this removes duplicate definitions of
+// integral_constant, is_same, is_integral, enable_if, is_floating_point, is_arithmetic.
+// Copied from aten/src/ATen/cuda/llvm_basic.cpp, then modified as above.
+// If not compiling for ROCm, return the original get_traits_string().
+std::string get_traits_string_but_hiprtc_safe() {
+#ifdef USE_ROCM
+    return R"ESCAPE(
+namespace std {
+
+template <class _Tp>
+_Tp&& __declval(int);
+template <class _Tp>
+_Tp __declval(long);
+template <class _Tp>
+decltype(__declval<_Tp>(0)) declval() noexcept;
+
+template <class _Tp> struct remove_const            {typedef _Tp type;};
+template <class _Tp> struct remove_const<const _Tp> {typedef _Tp type;};
+template <class _Tp> using remove_const_t = typename remove_const<_Tp>::type;
+
+template <class _Tp> struct remove_volatile               {typedef _Tp type;};
+template <class _Tp> struct remove_volatile<volatile _Tp> {typedef _Tp type;};
+template <class _Tp> using remove_volatile_t = typename remove_volatile<_Tp>::type;
+
+template <class _Tp> struct remove_cv
+{typedef typename remove_volatile<typename remove_const<_Tp>::type>::type type;};
+template <class _Tp> using remove_cv_t = typename remove_cv<_Tp>::type;
+
+template <class _Tp> struct __libcpp_is_floating_point              : public false_type {};
+template <>          struct __libcpp_is_floating_point<float>       : public true_type {};
+template <>          struct __libcpp_is_floating_point<double>      : public true_type {};
+template <>          struct __libcpp_is_floating_point<long double> : public true_type {};
+
+template <class _Tp>
+inline constexpr bool is_arithmetic_v = is_arithmetic<_Tp>::value;
+
+template <class _Tp>
+struct __numeric_type
+{
+   static void __test(...);
+   static float __test(float);
+   static double __test(char);
+   static double __test(int);
+   static double __test(unsigned);
+   static double __test(long);
+   static double __test(unsigned long);
+   static double __test(long long);
+   static double __test(unsigned long long);
+   static double __test(double);
+   static long double __test(long double);
+
+   typedef decltype(__test(declval<_Tp>())) type;
+   static const bool value = !is_same<type, void>::value;
+};
+
+template <>
+struct __numeric_type<void>
+{
+   static const bool value = true;
+};
+
+// __promote
+
+template <class _A1, class _A2 = void, class _A3 = void,
+          bool = __numeric_type<_A1>::value &&
+                 __numeric_type<_A2>::value &&
+                 __numeric_type<_A3>::value>
+class __promote_imp
+{
+public:
+    static const bool value = false;
+};
+
+template <class _A1, class _A2, class _A3>
+class __promote_imp<_A1, _A2, _A3, true>
+{
+private:
+    typedef typename __promote_imp<_A1>::type __type1;
+    typedef typename __promote_imp<_A2>::type __type2;
+    typedef typename __promote_imp<_A3>::type __type3;
+public:
+    typedef decltype(__type1() + __type2() + __type3()) type;
+    static const bool value = true;
+};
+
+template <class _A1, class _A2>
+class __promote_imp<_A1, _A2, void, true>
+{
+private:
+    typedef typename __promote_imp<_A1>::type __type1;
+    typedef typename __promote_imp<_A2>::type __type2;
+public:
+    typedef decltype(__type1() + __type2()) type;
+    static const bool value = true;
+};
+
+template <class _A1>
+class __promote_imp<_A1, void, void, true>
+{
+public:
+    typedef typename __numeric_type<_A1>::type type;
+    static const bool value = true;
+};
+
+template <class _A1, class _A2 = void, class _A3 = void>
+class __promote : public __promote_imp<_A1, _A2, _A3> {};
+
+} // namespace std
+)ESCAPE";
+#else
+    return get_traits_string();
+#endif
+}
+
+#ifdef USE_ROCM
+const std::string jit_preamble = R"ESCAPE(
+#pragma clang force_cuda_host_device begin
+)ESCAPE";
+const std::string jit_epilogue = R"ESCAPE(
+#pragma clang force_cuda_host_device end
+)ESCAPE";
+#else
+const std::string jit_preamble;
+const std::string jit_epilogue;
+#endif
+
 const std::string jit_common_types = R"ESCAPE(
+  #ifdef __HIPCC__
+  #define ERROR_UNSUPPORTED_CAST ;
+  // corresponds to aten/src/ATen/native/cuda/thread_constants.h
+  #define CUDA_OR_ROCM_NUM_THREADS 256
+  // corresponds to aten/src/ATen/cuda/detail/OffsetCalculator.cuh
+  #define MAX_DIMS 16
+  #ifndef __forceinline__
+  #define __forceinline__ inline __attribute__((always_inline))
+  #endif
+  #else
+  //TODO use _assert_fail, because assert is disabled in non-debug builds
+  #define ERROR_UNSUPPORTED_CAST assert(false);
+  #define CUDA_OR_ROCM_NUM_THREADS 128
+  #define MAX_DIMS 25
+  #endif
   #define POS_INFINITY __int_as_float(0x7f800000)
   #define INFINITY POS_INFINITY
   #define NEG_INFINITY __int_as_float(0xff800000)
@@ -54,11 +196,9 @@ const std::string jit_common_types = R"ESCAPE(
   static_assert(sizeof(int64_t) == 8, "expected size does not match");
   static_assert(sizeof(uint32_t) == 4, "expected size does not match");
   static_assert(sizeof(int8_t) == 1, "expected size does not match");
-  constexpr int num_threads = 128;
+  constexpr int num_threads = CUDA_OR_ROCM_NUM_THREADS;
   constexpr int thread_work_size = 4; // TODO: make template substitution once we decide where those vars live
   constexpr int block_work_size = thread_work_size * num_threads;
-  //TODO use _assert_fail, because assert is disabled in non-debug builds
-  #define ERROR_UNSUPPORTED_CAST assert(false);
 
   ${traits_string}
   ${cmath_string}
@@ -146,15 +286,22 @@ struct alignas(2) Half {
 
   Half() = default;
   inline __host__ __device__ Half(float value){
+#ifdef __HIPCC__
+    x = __half_as_short(__float2half(value));
+#else
     asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(x) : "f"(value));
+#endif
   }
   inline __host__ __device__ operator float() const{
+#ifdef __HIPCC__
+      return __half2float(*reinterpret_cast<const __half*>(&x));
+#else
       float val;
       asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(x)); // do we need const cast here?
       //asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(x)));
       return val;
+#endif
   }
-
 };
 }
 )ESCAPE";
@@ -201,9 +348,18 @@ struct alignas(2) BFloat16 {
   }
 
   inline __host__ __device__ operator float() const{
+#ifdef __HIPCC__
+    union
+    {
+        uint32_t int32;
+        float    fp32;
+    } u = {uint32_t(x) << 16};
+    return u.fp32;
+#else
     float val;
     asm("{ mov.b32 %0, {0,%1};}\n" : "=f"(val) : "h"(x)); //do we need const cast here?
     return val;
+#endif
   }
 
 };
@@ -450,7 +606,7 @@ const std::string offset_calc_template = R"ESCAPE(
       }
 
       #pragma unroll
-      for (int dim = 0; dim < 25; ++dim) {
+      for (int dim = 0; dim < MAX_DIMS; ++dim) {
       if (dim == dims) {
           break;
       }
@@ -469,9 +625,9 @@ const std::string offset_calc_template = R"ESCAPE(
   }
 
     int dims;
-    IntDivider sizes_[25];
+    IntDivider sizes_[MAX_DIMS];
     // NOTE: this approach will not support nInputs == 0
-    ${index_type} strides_[25][NARGS];
+    ${index_type} strides_[MAX_DIMS][NARGS];
   };
 
 
@@ -501,7 +657,7 @@ const std::string jit_code_template = R"ESCAPE(
     int idx = blockIdx.x;
 
     int remaining = numel - block_work_size * idx;
-    auto thread_idx = threadIdx.x;
+    int thread_idx = threadIdx.x;
 
     #pragma unroll
     for (int j = 0; j < thread_work_size; j++){
@@ -592,7 +748,7 @@ const std::string jit_vectorized_code_template = R"ESCAPE(
       constexpr int vec_size = ${vec_size};
       using scalar_t = ${scalar_type};
       int remaining = N - block_work_size * blockIdx.x;
-      auto thread_idx = threadIdx.x;
+      int thread_idx = threadIdx.x;
       int idx = blockIdx.x;
       ${declare_load_arrays}
       ${declare_store_arrays}
@@ -651,6 +807,49 @@ const std::string jit_vectorized_code_template = R"ESCAPE(
   }
 )ESCAPE";
 
+static void replace_all(std::string& s, const std::string& to_replace, const std::string& replace_with) {
+  std::ostringstream oss;
+  std::size_t pos = 0;
+  std::size_t prev_pos = pos;
+
+  while (true) {
+    prev_pos = pos;
+    pos = s.find(to_replace, pos);
+    if (pos == std::string::npos)
+      break;
+    oss << s.substr(prev_pos, pos - prev_pos);
+    oss << replace_with;
+    pos += to_replace.size();
+  }
+
+  oss << s.substr(prev_pos);
+  s = oss.str();
+}
+
+// hipify replaces certain device math functions, e.g., std::max -> ::max
+// See torch/utils/hipify/cuda_to_hip_mappings.py.
+// Replace them back. Search for " ::<name>" to avoid duplicate replacements.
+static std::string unhipify_math_functions(const std::string &original) {
+  static std::vector<std::pair<std::string,std::string>> mappings = {
+    {" std::max", " ::max"},
+    {" std::min", " ::min"},
+    {" std::ceil", " ::ceil"},
+    {" std::floor", " ::floor"},
+    {" std::exp", " ::exp"},
+    {" std::log", " ::log"},
+    {" std::pow", " ::pow"},
+    {" std::fabs", " ::fabs"},
+    {" std::fmod", " ::fmod"},
+    {" std::remainder", " ::remainder"},
+    {" std::frexp", " ::frexp"}
+  };
+  std::string ret = original;
+  for (const auto& mapping : mappings) {
+    replace_all(ret, mapping.second, mapping.first);
+  }
+  return ret;
+}
+
 // The following is copied from fused_kernel.cpp
 // TODO: refactor codegenOutputQuery into its own file
 //   that can be included by both files
@@ -668,7 +867,12 @@ void codegenOutputQuery(
     int& nvrtc_major,
     int& nvrtc_minor,
     bool& compile_to_sass) {
-
+#ifdef USE_ROCM
+  AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&nvrtc_major, &nvrtc_minor));
+  cuda_major = prop->major;
+  cuda_minor = prop->minor;
+  compile_to_sass = false;
+#else
   AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&nvrtc_major, &nvrtc_minor));
   TORCH_CHECK(
       nvrtc_major >= 6, "NVRTC versions less than 6 are not supported. Is: ", nvrtc_major);
@@ -711,6 +915,7 @@ void codegenOutputQuery(
     // compile to sass is not allowed prior to CUDA 11.1
     compile_to_sass = false;
   #endif
+#endif
 }
 
 // TODO: another copy paste from jit, refactor so it's usable from both
@@ -764,7 +969,7 @@ constexpr int thread_work_size = THREAD_WORK_SIZE;
 std::string generate_code(
     int nInputs,
     int nOutputs,
-    const std::string& func,
+    const std::string& func_,
     const std::string& name,
     const std::string& f_inputs_type,
     const std::string& compute_type,
@@ -776,6 +981,7 @@ std::string generate_code(
     bool vectorized,
     int vec_size,
     bool return_by_ref) {
+  std::string func = func_;
   at::jit::TemplateEnv env;
 
   env.s("index_type", "unsigned int");
@@ -887,11 +1093,16 @@ std::string generate_code(
       f_inputs_type == "std::complex<double>" || result_type == "std::complex<double>" ||
       f_inputs_type == "std::complex<at::Half>" || result_type == "std::complex<at::Half>") {
     // complex<Half> depends on complex<T> and Half dtypes.
-    env.s("traits_string", get_traits_string());
+    env.s("traits_string", get_traits_string_but_hiprtc_safe());
     env.s("complex_body_string", get_complex_body_string());
     env.s("complex_math_string", get_complex_math_string());
+#ifdef USE_ROCM
+    // unhipify math functions, but only if std::complex is used.
+    func = unhipify_math_functions(func);
+    env.s("functor", func);
+#endif
   } else if (dynamic_casting) {
-    env.s("traits_string", get_traits_string());
+    env.s("traits_string", get_traits_string_but_hiprtc_safe());
     env.s("complex_body_string", get_complex_body_string());
     env.s("complex_math_string", "");
   } else {
@@ -948,7 +1159,8 @@ std::string generate_code(
     }
     env.s("store_outputs", store_outputs.str());
 
-    static auto cuda_template = at::jit::CodeTemplate(jit_common_types + offset_calc_template + jit_code_template);
+    static auto cuda_template = at::jit::CodeTemplate(
+      jit_preamble + jit_common_types + offset_calc_template + jit_code_template + jit_epilogue);
     const auto code = cuda_template.format(env);
     return code;
   }
@@ -1014,7 +1226,8 @@ std::string generate_code(
   }
   env.s("store_unrolled_outputs", store_unrolled_outputs.str());
 
-  static auto cuda_template = at::jit::CodeTemplate(jit_common_types + jit_vectorized_code_template);
+  static auto cuda_template = at::jit::CodeTemplate(
+    jit_preamble + jit_common_types + jit_vectorized_code_template + jit_epilogue);
   const auto code = cuda_template.format(env);
   return code;
 }
@@ -1114,7 +1327,7 @@ std::string generate_reduction_code(
 
 std::string generate_reduction_code(
     int nOutputs,
-    const std::string& func,
+    const std::string& func_,
     const std::string& name,
     const int vt0,
     const std::string& f_inputs_type,
@@ -1124,6 +1337,7 @@ std::string generate_reduction_code(
     bool vectorized,
     int vec_size,
     int max_threads_codegen) {
+      std::string func = func_;
       at::jit::TemplateEnv env;
       env.s("index_type", "unsigned int");
       env.s("scalar_type", f_inputs_type);
@@ -1149,10 +1363,14 @@ std::string generate_reduction_code(
           f_inputs_type == "std::complex<double>" ||
           f_inputs_type == "std::complex<at::Half>" ) {
         // complex<Half> depends on complex<T> and Half dtypes.
-        env.s("traits_string", get_traits_string());
+        env.s("traits_string", get_traits_string_but_hiprtc_safe());
         env.s("complex_body_string", get_complex_body_string());
         env.s("complex_math_string", get_complex_math_string());
         env.s("complex", std::to_string(1));
+#ifdef USE_ROCM
+        // unhipify math functions, but only if std::complex is used.
+        func = unhipify_math_functions(func);
+#endif
       } else {
         env.s("traits_string", "");
         env.s("complex_body_string", "");
@@ -1168,7 +1386,7 @@ std::string generate_reduction_code(
       env.s("functor", func);
       env.s("output_vec_size", std::to_string(vec_size));
       static auto cuda_template = at::jit::CodeTemplate(
-        jit_common_types + offset_calc_template + get_reduction_template());
+        jit_preamble + jit_common_types + offset_calc_template + get_reduction_template() + jit_epilogue);
       const auto code = cuda_template.format(env);
       return code;
 }
@@ -1312,6 +1530,9 @@ NvrtcFunction jit_pwise_function(
   AT_CUDA_NVRTC_CHECK(nvrtc.nvrtcCreateProgram(
       &program, code.c_str(), nullptr, 0, nullptr, nullptr));
 
+#ifdef USE_ROCM
+  std::vector<const char*> args = {"--std=c++14"};
+#else
   // Constructs nvrtc build arguments
   // CUDA 11.1 allows going directly to SASS (sm_) instead of PTX (compute_)
   // which gives better backwards compatibility to work on older driver,
@@ -1326,6 +1547,7 @@ NvrtcFunction jit_pwise_function(
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
   std::vector<const char*> args = {
       "--std=c++14", compute.c_str(), "-default-device"};
+#endif
 
   #ifndef NDEBUG
     // Add line info to generated kernels
diff --git a/aten/src/ATen/native/cuda/jit_utils.h b/aten/src/ATen/native/cuda/jit_utils.h
index 13aa723db2756..8206f67316e11 100644
--- a/aten/src/ATen/native/cuda/jit_utils.h
+++ b/aten/src/ATen/native/cuda/jit_utils.h
@@ -8,7 +8,6 @@
 #include <c10/util/irange.h>
 #include <ATen/jit_macros.h>
 #include <ATen/cuda/detail/LazyNVRTC.h>
-#include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>
 
 namespace at { namespace cuda { namespace jit {
 
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 96d700c761ebf..ae09f0aaad8f8 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -684,7 +684,7 @@ void LayerNormKernelImplInternal(
   auto can_vectorize = [&](const T * ptr, int alignment){uint64_t addr = reinterpret_cast<uint64_t>(ptr); return addr % alignment == 0;};
   constexpr int num_vec_elems = vec_size;
   constexpr int alignment = num_vec_elems * sizeof(T);
-  if ((std::is_same<T, float>::value || std::is_same<T, at::Half>::value) &&
+  if ((std::is_same<T, float>::value || std::is_same<T, at::Half>::value || std::is_same<T, at::BFloat16>::value) &&
   N <= 1ULL << std::numeric_limits<float>::digits && N % num_vec_elems == 0 &&
   can_vectorize(X_data, alignment) && can_vectorize(Y_data, alignment)) {
     launch_vectorized_layer_norm_kernel(static_cast<int>(N), M, eps, X_data, gamma_data, beta_data, Y_data, mean_data, rstd_data);
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
index 320c799f23bce..061e7e86de8bb 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
@@ -24,7 +24,6 @@
 #include <ATen/NativeFunctions.h>
 #else
 #include <ATen/ops/_cholesky_solve_helper_native.h>
-#include <ATen/ops/_linalg_inv_out_helper_native.h>
 #include <ATen/ops/_symeig_helper_native.h>
 #include <ATen/ops/arange.h>
 #include <ATen/ops/empty.h>
@@ -115,20 +114,6 @@ void magmaLuNoPivBatched(
     magma_int_t m, magma_int_t n, scalar_t** dA_array, magma_int_t ldda,
     magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue);
 
-template<class scalar_t>
-inline magma_int_t magmaGetriOptimalBlocksize(magma_int_t n);
-
-template<class scalar_t>
-void magmaGetri(
-    magma_int_t n, scalar_t* dA, magma_int_t ldda, magma_int_t* ipiv, scalar_t* dwork,
-    magma_int_t lwork, magma_int_t* info);
-
-template<class scalar_t>
-void magmaGetriBatched(
-    magma_int_t n, scalar_t** dA_array, magma_int_t ldda,
-    magma_int_t** ipiv_array, scalar_t** dinvA_array, magma_int_t lddia,
-    magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue);
-
 template<class scalar_t>
 void magmaCholeskySolve(
     magma_uplo_t uplo, magma_int_t n, magma_int_t nrhs, scalar_t* dA, magma_int_t ldda,
@@ -400,154 +385,6 @@ void magmaLuNoPivBatched<c10::complex<float>>(
   AT_CUDA_CHECK(cudaGetLastError());
 }
 
-template<>
-inline magma_int_t magmaGetriOptimalBlocksize<double>(magma_int_t n) {
-  return magma_get_dgetri_nb(n);
-}
-
-template<>
-inline magma_int_t magmaGetriOptimalBlocksize<float>(magma_int_t n) {
-  return magma_get_sgetri_nb(n);
-}
-
-template <>
-inline magma_int_t magmaGetriOptimalBlocksize<c10::complex<double>>(
-    magma_int_t n) {
-  return magma_get_zgetri_nb(n);
-}
-
-template <>
-inline magma_int_t magmaGetriOptimalBlocksize<c10::complex<float>>(
-    magma_int_t n) {
-  return magma_get_cgetri_nb(n);
-}
-
-template<>
-void magmaGetri<double>(
-    magma_int_t n, double* dA, magma_int_t ldda, magma_int_t* ipiv, double* dwork,
-    magma_int_t lwork, magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_dgetri_gpu(n, dA, ldda, ipiv, dwork, lwork, info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template<>
-void magmaGetri<float>(
-    magma_int_t n, float* dA, magma_int_t ldda, magma_int_t* ipiv, float* dwork,
-    magma_int_t lwork, magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_sgetri_gpu(n, dA, ldda, ipiv, dwork, lwork, info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetri<c10::complex<double>>(
-    magma_int_t n,
-    c10::complex<double>* dA,
-    magma_int_t ldda,
-    magma_int_t* ipiv,
-    c10::complex<double>* dwork,
-    magma_int_t lwork,
-    magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_zgetri_gpu(
-      n,
-      reinterpret_cast<magmaDoubleComplex*>(dA),
-      ldda,
-      ipiv,
-      reinterpret_cast<magmaDoubleComplex*>(dwork),
-      lwork,
-      info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetri<c10::complex<float>>(
-    magma_int_t n,
-    c10::complex<float>* dA,
-    magma_int_t ldda,
-    magma_int_t* ipiv,
-    c10::complex<float>* dwork,
-    magma_int_t lwork,
-    magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_cgetri_gpu(
-      n,
-      reinterpret_cast<magmaFloatComplex*>(dA),
-      ldda,
-      ipiv,
-      reinterpret_cast<magmaFloatComplex*>(dwork),
-      lwork,
-      info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template<>
-void magmaGetriBatched<double>(
-    magma_int_t n, double** dA_array, magma_int_t ldda,
-    magma_int_t** ipiv_array, double** dinvA_array, magma_int_t lddia,
-    magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue) {
-  magma_dgetri_outofplace_batched(n, dA_array, ldda, ipiv_array, dinvA_array, lddia, info_array, batchsize, magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template<>
-void magmaGetriBatched<float>(
-    magma_int_t n, float** dA_array, magma_int_t ldda,
-    magma_int_t** ipiv_array, float** dinvA_array, magma_int_t lddia,
-    magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue) {
-  magma_sgetri_outofplace_batched(n, dA_array, ldda, ipiv_array, dinvA_array, lddia, info_array, batchsize, magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetriBatched<c10::complex<double>>(
-    magma_int_t n,
-    c10::complex<double>** dA_array,
-    magma_int_t ldda,
-    magma_int_t** ipiv_array,
-    c10::complex<double>** dinvA_array,
-    magma_int_t lddia,
-    magma_int_t* info_array,
-    magma_int_t batchsize,
-    const MAGMAQueue& magma_queue) {
-  magma_zgetri_outofplace_batched(
-      n,
-      reinterpret_cast<magmaDoubleComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<magmaDoubleComplex**>(dinvA_array),
-      lddia,
-      info_array,
-      batchsize,
-      magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetriBatched<c10::complex<float>>(
-    magma_int_t n,
-    c10::complex<float>** dA_array,
-    magma_int_t ldda,
-    magma_int_t** ipiv_array,
-    c10::complex<float>** dinvA_array,
-    magma_int_t lddia,
-    magma_int_t* info_array,
-    magma_int_t batchsize,
-    const MAGMAQueue& magma_queue) {
-  magma_cgetri_outofplace_batched(
-      n,
-      reinterpret_cast<magmaFloatComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<magmaFloatComplex**>(dinvA_array),
-      lddia,
-      info_array,
-      batchsize,
-      magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
 template<>
 void magmaCholeskySolve<double>(
     magma_uplo_t uplo, magma_int_t n, magma_int_t nrhs, double* dA, magma_int_t ldda,
@@ -1319,156 +1156,6 @@ void ldl_solve_kernel(
 REGISTER_CUDA_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
 REGISTER_CUDA_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
 
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inverse ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-/*
-Computes the inverse of n-by-n matrix 'self', it is saved to 'self_inv'.
-'infos' is an int Tensor containing error codes for each matrix in the batched input.
-'infos_lu' is for holding magmaLU errors, and 'infos_getri' is for holding magmaGetri errors
-For more information see MAGMA's documentation for GETRI and GETRF routines.
-*/
-template <typename scalar_t>
-static void apply_batched_inverse(Tensor& self, Tensor& self_inv, Tensor& infos_lu, Tensor& infos_getri) {
-#if !AT_MAGMA_ENABLED()
-AT_ERROR("inverse: MAGMA library not found in "
-    "compilation. Please rebuild with MAGMA.");
-#else
-  auto self_data = self.data_ptr<scalar_t>();
-  auto self_mat_stride = matrixStride(self);
-  auto self_inv_data = self_inv.data_ptr<scalar_t>();
-  auto self_inv_mat_stride = matrixStride(self_inv);
-
-  auto infos_lu_data = infos_lu.data_ptr<magma_int_t>();
-  auto infos_getri_data = infos_getri.data_ptr<magma_int_t>();
-
-  magma_int_t batch_size = magma_int_cast(batchCount(self), "batchCount");
-  // MAGMA does not work with batch_size == 0, let's return early in this case
-  if (batch_size == 0) {
-    return;
-  }
-
-  magma_int_t n = magma_int_cast(self.size(-2), "self.size(-2)");
-  magma_int_t lda = std::max<magma_int_t>(1, n);
-
-  magma_int_t* ipiv_data;
-  magma_int_t** ipiv_array;
-  scalar_t** self_array;
-  scalar_t** self_inv_array;
-
-  ALLOCATE_ARRAY(ipiv_data, magma_int_t, batch_size * lda);
-  ALLOCATE_ARRAY(ipiv_array, magma_int_t*, batch_size);
-  ALLOCATE_ARRAY(self_array, scalar_t*, batch_size);
-  ALLOCATE_ARRAY(self_inv_array, scalar_t*, batch_size);
-
-  // Set up the created arrays
-  for (int64_t i = 0; i < batch_size; i++) {
-    self_array[i] = &self_data[i * self_mat_stride];
-    self_inv_array[i] = &self_inv_data[i * self_inv_mat_stride];
-    ipiv_array[i] = &ipiv_data[i * n];
-  }
-  // magmaLuBatched leaves ipiv_data values unwritten for singular matrices.
-  // Initialize to avoid memory access violations inside magma kernels (gh-51930).
-  std::fill_n(ipiv_data, batch_size * n, 1);
-
-  MAGMAQueue magma_queue(self.get_device());
-  magmaLuBatched<scalar_t>(
-    n, n, self_array, lda, ipiv_array, infos_lu_data,
-    batch_size, magma_queue);
-
-  constexpr int64_t batch_limit = 65535;
-  // Compute as many batches of 65535 possible
-  // The number of "mini"-batches are floor(batch_size / batch_limit)
-  // and these cover floor(batch_size / batch_limit) * batch_limit matrix solves
-  int64_t mini_batches = batch_size / batch_limit, mini_idx;
-  for (mini_idx = 0; mini_idx < mini_batches * batch_limit; mini_idx += batch_limit) {
-    scalar_t** self_array_cur = &self_array[mini_idx];
-    scalar_t** self_inv_array_cur = &self_inv_array[mini_idx];
-    magma_int_t** ipiv_array_cur = &ipiv_array[mini_idx];
-    magma_int_t* info_array_cur_getri = &infos_getri_data[mini_idx];
-
-    magmaGetriBatched<scalar_t>(
-      n, self_array_cur, lda, ipiv_array_cur, self_inv_array_cur,
-      lda, info_array_cur_getri, batch_limit, magma_queue);
-  }
-
-  // Compute whatever is left = batch_size - floor(batch_size / batch_limit) * batch_limit
-  // which concisely is equal to batch_size % batch_limit
-  if (batch_size % batch_limit != 0) {
-    magmaGetriBatched<scalar_t>(
-      n, &self_array[mini_idx], lda, &ipiv_array[mini_idx], &self_inv_array[mini_idx],
-      lda, &infos_getri_data[mini_idx], batch_size % batch_limit, magma_queue);
-  }
-#endif
-}
-
-template <typename scalar_t>
-static void apply_single_inverse(Tensor& self, Tensor& info_lu, Tensor& info_getri) {
-#if !AT_MAGMA_ENABLED()
-AT_ERROR("inverse: MAGMA library not found in "
-    "compilation. Please rebuild with MAGMA.");
-#else
-  auto self_data = self.data_ptr<scalar_t>();
-  magma_int_t n = magma_int_cast(self.size(-2), "self.size(-2)");
-  magma_int_t lda = std::max<magma_int_t>(1, n);
-  magma_int_t lwork = n * magmaGetriOptimalBlocksize<scalar_t>(n);
-
-  // magmaLu and magmaGetri requires info argument to live on CPU
-  // but info_lu and info_getri tensors are on the same device as self
-  magma_int_t info_lu_cpu = 0;
-  magma_int_t info_getri_cpu = 0;
-
-  Tensor ipiv = at::empty({lda}, at::kInt);
-  Tensor dwork = at::empty({lwork}, self.options());
-  magmaLu<scalar_t>(n, n, self_data, lda, ipiv.data_ptr<magma_int_t>(), &info_lu_cpu);
-  magmaGetri<scalar_t>(
-    n, self_data, lda, ipiv.data_ptr<magma_int_t>(), dwork.data_ptr<scalar_t>(), lwork, &info_getri_cpu);
-  info_lu.fill_(info_lu_cpu);
-  info_getri.fill_(info_getri_cpu);
-#endif
-}
-
-
-// This is a type dispatching helper function for 'apply_batched_inverse' and 'singleCheckErrors'
-Tensor& _linalg_inv_out_helper_cuda_legacy(Tensor& result, Tensor& infos_lu, Tensor& infos_getri) {
-  // assuming result is in column major order and contains the matrices to invert
-  if (result.dim() > 2) {
-    auto input_working_copy = cloneBatchedColumnMajor(result);
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_batched_inverse<scalar_t>(
-        input_working_copy, result, infos_lu, infos_getri);
-    });
-  } else {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_single_inverse<scalar_t>(result, infos_lu, infos_getri);
-    });
-  }
-  return result;
-}
-
-// This is a MAGMA/cuSOLVER dispatching helper function
-Tensor& _linalg_inv_out_helper_cuda(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) {
-  // This function calculates the inverse matrix in-place
-  // result should be in column major order and contain matrices to invert
-#ifdef USE_CUSOLVER
-  auto preferred_backend = at::globalContext().linalgPreferredBackend();
-  switch (preferred_backend) {
-    case at::LinalgBackend::Cusolver:
-      return _linalg_inv_out_helper_cuda_lib(result, infos_lu, infos_getri);  // cusolver or cublas
-    case at::LinalgBackend::Magma:
-      return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri);  // magma-cuda
-    default:
-      if (batchCount(result) <= 2 || !use_magma_) {
-        return _linalg_inv_out_helper_cuda_lib(result, infos_lu, infos_getri);  // cusolver or cublas
-      } else {
-        return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri);  // magma-cuda
-      }
-  }
-#else
-  return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri);  // magma-cuda
-#endif
-  return result;
-}
-
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky_solve ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 template <typename scalar_t>
@@ -1928,7 +1615,12 @@ static void lu_factor(const Tensor& input, const Tensor& pivots, const Tensor& i
   const auto preferred_backend = at::globalContext().linalgPreferredBackend();
 #ifdef USE_CUSOLVER
   const auto lu_factor_cusolver = [batch_size, m, n](const Tensor& input, const Tensor& pivots, const Tensor& infos, bool compute_pivots) {
-    if (batch_size == 1 || m != n || m >= 512) {
+    // In CUDA 10.2, lu_factor_looped_cusolver does not finish the computations when the input
+    // matrix is exactly singular. The returned pivots contain garbage. This breaks linalg.det
+    // Now, batched_cublas does not handle rectangular matrices, so we still dispatch to
+    // looped_cusolver even if m != n.
+    constexpr bool looped_correct = CUSOLVER_VERSION >= 11100;
+    if (m != n || (looped_correct && (batch_size == 1 || m >= 512))) {
       lu_factor_looped_cusolver(input, pivots, infos, compute_pivots);
     } else {
       lu_factor_batched_cublas(input, pivots, infos, compute_pivots);
@@ -3254,8 +2946,7 @@ struct DispatchInitializer {
   DispatchInitializer() {
     cuda::detail::LinalgDispatch disp{ _symeig_helper_cuda,
                                        _cholesky_solve_helper_cuda,
-                                       legacy_lstsq_cuda,
-                                       _linalg_inv_out_helper_cuda};
+                                       legacy_lstsq_cuda};
     cuda::detail::registerLinalgDispatch(disp);
   };
 } initializer;
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
index d3109d866a592..d80b93b3da098 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
@@ -471,101 +471,6 @@ inline static Tensor column_major_identity_matrix_like(const Tensor& self) {
   return at::ones(size_slice, self.options()).diag_embed().mT();
 }
 
-template <typename scalar_t>
-inline static void _apply_single_inverse_helper(scalar_t* self_ptr, scalar_t* self_inv_ptr, int* ipiv_ptr, int* info_getrf_ptr, int* info_getrs_ptr, int n, int lda) {
-  // self_inv_ptr should already be an identity matrix
-
-  auto handle = at::cuda::getCurrentCUDASolverDnHandle();
-  at::cuda::solver::getrf<scalar_t>(handle, n, n, self_ptr, lda, ipiv_ptr, info_getrf_ptr);
-  at::cuda::solver::getrs<scalar_t>(handle, n, n, self_ptr, lda, ipiv_ptr, self_inv_ptr, lda, info_getrs_ptr, CUBLAS_OP_N);
-}
-
-template <typename scalar_t>
-static void apply_batched_inverse_lib(Tensor& self, Tensor& self_inv, Tensor& infos_getrf, Tensor& infos_getrs) {
-  const int batch_size = cuda_int_cast(batchCount(self), "batchCount");
-  const int n = cuda_int_cast(self.size(-2), "self.size(-2)");
-  const int lda = std::max<int>(1, n);
-
-  auto self_data = self.data_ptr<scalar_t>();
-  auto self_mat_stride = matrixStride(self);
-  auto self_inv_data = self_inv.data_ptr<scalar_t>();
-  auto self_inv_mat_stride = matrixStride(self_inv);
-
-  auto infos_getrf_data = infos_getrf.data_ptr<int>();
-  auto infos_getrs_data = infos_getrs.data_ptr<int>();
-
-  auto& allocator = *::c10::cuda::CUDACachingAllocator::get();
-
-  // Heuristic: For small batch size or large matrix size, we use for-loop to iterate over the batches instead of
-  //            calling the batched cublas routine.
-  if (batch_size <= 8 || /* batch_size > 8 && */ n >= 512) {
-    for (int64_t i = 0; i < batch_size; i++) {
-      auto dataPtr = allocator.allocate(sizeof(int) * lda);
-      int* pivot = reinterpret_cast<int*>(dataPtr.get());
-
-      int* infos_getrf_working_ptr = &infos_getrf_data[i];
-      int* infos_getrs_working_ptr = &infos_getrs_data[i];
-
-      _apply_single_inverse_helper<scalar_t>(
-        &self_data[i * self_mat_stride], &self_inv_data[i * self_inv_mat_stride], pivot, infos_getrf_working_ptr, infos_getrs_working_ptr, n, lda);
-    }
-  } else {
-    // cublas batched kernels require input be "device array of device pointers"
-    Tensor self_array = at::arange(
-      reinterpret_cast<int64_t>(self_data),
-      reinterpret_cast<int64_t>(&self_data[(batch_size-1) * self_mat_stride]) + 1,
-      static_cast<int64_t>(self_mat_stride * sizeof(scalar_t)), self.options().dtype(at::kLong));
-    Tensor self_inv_array = at::arange(
-      reinterpret_cast<int64_t>(self_inv_data),
-      reinterpret_cast<int64_t>(&self_inv_data[(batch_size-1) * self_inv_mat_stride]) + 1,
-      static_cast<int64_t>(self_inv_mat_stride * sizeof(scalar_t)), self.options().dtype(at::kLong));
-
-    auto dataPtr = allocator.allocate(sizeof(int)*batch_size*lda);
-    int* ipiv_array = reinterpret_cast<int*>(dataPtr.get());
-
-    at::cuda::blas::getrfBatched<scalar_t>(n, reinterpret_cast<scalar_t**>(self_array.data_ptr()), lda,
-      ipiv_array, infos_getrf_data, batch_size);
-
-    at::cuda::blas::getriBatched<scalar_t>(n, reinterpret_cast<scalar_t**>(self_array.data_ptr()), lda,
-      ipiv_array, reinterpret_cast<scalar_t**>(self_inv_array.data_ptr()), lda, infos_getrs_data, batch_size);
-  }
-}
-
-template <typename scalar_t>
-static void apply_single_inverse_lib(const Tensor& self, Tensor& self_inv, Tensor& infos_getrf, Tensor& infos_getrs) {
-  int n = cuda_int_cast(self.size(-2), "self.size(-2)");
-  int lda = std::max<int>(1, n);
-
-  Tensor ipiv = at::empty({lda}, self.options().dtype(at::kInt));
-
-  _apply_single_inverse_helper<scalar_t>(
-    self.data_ptr<scalar_t>(), self_inv.data_ptr<scalar_t>(), ipiv.data_ptr<int>(), infos_getrf.data_ptr<int>(), infos_getrs.data_ptr<int>(), n, lda);
-}
-
-// This is a type dispatching helper function for 'apply_batched_inverse_lib' and 'apply_single_inverse_lib'
-Tensor& _linalg_inv_out_helper_cuda_lib(Tensor& result, Tensor& infos_getrf, Tensor& infos_getrs) {
-  // assuming result is in column major order and contains the matrices to invert
-  Tensor input_working_copy = cloneBatchedColumnMajor(result);
-
-  // for getrf + getrs (cusolver path)
-  // result should be filled with identity matrices
-  result.zero_();
-  result.diagonal(/*offset=*/0, /*dim1=*/-2, /*dim2=*/-1).fill_(1);
-
-  if (result.dim() > 2) {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_batched_inverse_lib<scalar_t>(
-        input_working_copy, result, infos_getrf, infos_getrs);
-    });
-  } else {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_single_inverse_lib<scalar_t>(input_working_copy, result, infos_getrf, infos_getrs);
-    });
-  }
-
-  return result;
-}
-
 // call cusolver gesvd function to calculate svd
 template<typename scalar_t>
 inline static void apply_svd_cusolver_gesvd(const Tensor& A, const Tensor& U, const Tensor& S, const Tensor& V,
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h
index adee8cc9eb4ea..65b4f9b577178 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h
@@ -59,10 +59,6 @@ void lu_solve_batched_cublas(const Tensor& LU, const Tensor& pivots, const Tenso
 
 #ifdef USE_CUSOLVER
 
-// entrance of calculations of `inverse` using cusolver getrf + getrs, cublas getrfBatched + getriBatched
-Tensor _inverse_helper_cuda_lib(const Tensor& self);
-Tensor& _linalg_inv_out_helper_cuda_lib(Tensor& result, Tensor& infos_getrf, Tensor& infos_getrs);
-
 // entrance of calculations of `svd` using cusolver gesvdj and gesvdjBatched
 void svd_cusolver(const Tensor& A, const bool full_matrices, const bool compute_uv,
   const c10::optional<c10::string_view>& driver, const Tensor& U, const Tensor& S, const Tensor& V, const Tensor& info);
@@ -91,7 +87,6 @@ struct LinalgDispatch {
    std::tuple<Tensor, Tensor> (*symeig_helper)(const Tensor& self, bool eigenvectors, bool upper);
    Tensor (*cholesky_solve_helper)(const Tensor& self, const Tensor& A, bool upper);
    std::tuple<Tensor, Tensor> (*legacy_lstsq)(const Tensor &B, const Tensor &A);
-   Tensor& (*inv_out_helper)(Tensor &result, Tensor& infos_lu, Tensor& infos_getri);
 };
 C10_EXPORT void registerLinalgDispatch(const LinalgDispatch&);
 }} // namespace cuda::detail
diff --git a/aten/src/ATen/native/cuda/reduction_template.cuh b/aten/src/ATen/native/cuda/reduction_template.cuh
index 4d9d559d8ec8a..a38edb538256d 100644
--- a/aten/src/ATen/native/cuda/reduction_template.cuh
+++ b/aten/src/ATen/native/cuda/reduction_template.cuh
@@ -4,11 +4,22 @@ namespace cuda {
 const std::string reduction_template_0 = R"ESCAPE(
   #define C10_HOST_DEVICE __host__ __device__
   #define C10_DEVICE __device__
+  #if defined(__clang__) && defined(__HIP__)
+  #ifndef __forceinline__
+  #define __forceinline__ inline __attribute__((always_inline))
+  #endif
+  // until ROCm support for kernel asserts is restored
+  #define assert(expr) (static_cast<void>(0))
+  #endif
 
   template <typename T>
   __device__ __forceinline__ T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
   {
+  #if defined(__clang__) && defined(__HIP__)
+    return __shfl_down(value, delta, width);
+  #else
     return __shfl_down_sync(mask, value, delta, width);
+  #endif
   }
 
 
@@ -17,8 +28,13 @@ const std::string reduction_template_0 = R"ESCAPE(
   __device__ __forceinline__ std::complex<T> WARP_SHFL_DOWN(std::complex<T> value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
   {
     return std::complex<T>(
+  #if defined(__clang__) && defined(__HIP__)
+        __shfl_down(value.real(), delta, width),
+        __shfl_down(value.imag(), delta, width));
+  #else
         __shfl_down_sync(mask, value.real(), delta, width),
         __shfl_down_sync(mask, value.imag(), delta, width));
+  #endif
   }
   #endif
 
diff --git a/aten/src/ATen/native/cudnn/Conv_v7.cpp b/aten/src/ATen/native/cudnn/Conv_v7.cpp
index 5225fff3bc234..63968fd2072f9 100644
--- a/aten/src/ATen/native/cudnn/Conv_v7.cpp
+++ b/aten/src/ATen/native/cudnn/Conv_v7.cpp
@@ -131,7 +131,7 @@ struct Workspace {
     // Sometimes cuDNN returns a workspace size > 2^63, this could makes the allocation of
     // workspace fail with some 64bit indexing error instead of an OOM error. In such case,
     // we manually fail with OOM.
-    TORCH_CHECK_WITH(CUDAOutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!");
+    TORCH_CHECK_WITH(OutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!");
     data = c10::cuda::CUDACachingAllocator::raw_alloc(size);
   }
   Workspace(const Workspace&) = delete;
@@ -505,7 +505,7 @@ class AlgoIterator {
       try {
         f(algoPerf);
         return;
-      } catch (c10::CUDAOutOfMemoryError &e) {
+      } catch (c10::OutOfMemoryError &e) {
         cudaGetLastError(); // clear CUDA error
       }
     }
@@ -516,7 +516,7 @@ class AlgoIterator {
         f(algoPerf);
         cache.insert(args.params, algoPerf);
         return;
-      } catch (c10::CUDAOutOfMemoryError &e) {
+      } catch (c10::OutOfMemoryError &e) {
         cudaGetLastError(); // clear CUDA error
       } catch (c10::CuDNNError &e) {
         cudaGetLastError(); // clear CUDA error
@@ -530,7 +530,7 @@ inline Tensor allocate_workspace(size_t size, const Tensor &other) {
   // Sometimes cuDNN returns a workspace size > 2^63, this could makes the allocation of
   // workspace fail with some 64bit indexing error instead of an OOM error. In such case,
   // we manually fail with OOM.
-  TORCH_CHECK_WITH(CUDAOutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!");
+  TORCH_CHECK_WITH(OutOfMemoryError, size < 1_TiB, "Not enough memory for workspace!");
   return at::empty({static_cast<int64_t>(size)}, other.options().dtype(kByte));
 }
 
diff --git a/aten/src/ATen/native/cudnn/Conv_v8.cpp b/aten/src/ATen/native/cudnn/Conv_v8.cpp
index 7d5664b12cf51..2ad8d4ffe37c1 100644
--- a/aten/src/ATen/native/cudnn/Conv_v8.cpp
+++ b/aten/src/ATen/native/cudnn/Conv_v8.cpp
@@ -332,13 +332,13 @@ void generate_and_filter_plans(const cudnnHandle_t handle, cudnn_frontend::Opera
       valid_plans.emplace_back(std::move(plan));
     }
   });
-  TORCH_CHECK_WITH(CUDAOutOfMemoryError, max_workspace_size < 1_TiB, "Not enough memory for workspace!");
+  TORCH_CHECK_WITH(OutOfMemoryError, max_workspace_size < 1_TiB, "Not enough memory for workspace!");
   bool remove_invalid = false;
   while (max_workspace_size) {
     try {
       workspace_ptr = c10::cuda::CUDACachingAllocator::get()->allocate(max_workspace_size);
       break;
-    } catch (c10::CUDAOutOfMemoryError &e) {
+    } catch (c10::OutOfMemoryError &e) {
       max_workspace_size /= 2;
       cudaGetLastError(); // clear CUDA error
       remove_invalid = true;
@@ -449,7 +449,7 @@ void try_plans(cudnn_frontend::executionPlans_t& plans, const CacheKey& key, con
       benchmark_cache.emplace(key, plan);
       return;
     } catch (cudnn_frontend::cudnnException &e) {} catch (CuDNNError &e) {}
-      catch (c10::CUDAOutOfMemoryError &e) {
+      catch (c10::OutOfMemoryError &e) {
         cudaGetLastError(); // clear CUDA error
     }
   }
@@ -463,7 +463,7 @@ void try_plans_fused(cudnn_frontend::executionPlans_t& plans, const CacheKeyFuse
       benchmark_cache_fused.emplace(key, plan);
       return;
     } catch (cudnn_frontend::cudnnException &e) {} catch (CuDNNError &e) {}
-      catch (c10::CUDAOutOfMemoryError &e) {
+      catch (c10::OutOfMemoryError &e) {
         cudaGetLastError(); // clear CUDA error
     }
   }
@@ -484,7 +484,7 @@ void try_configs(cudnn_frontend::EngineConfigList& configs, const std::string& o
       benchmark_cache.emplace(key, plan);
       return;
     } catch (cudnn_frontend::cudnnException &e) {} catch(CuDNNError &e) {}
-      catch (c10::CUDAOutOfMemoryError &e) {
+      catch (c10::OutOfMemoryError &e) {
         cudaGetLastError(); // clear CUDA error
     }
   }
@@ -505,7 +505,7 @@ void try_configs_fused(cudnn_frontend::EngineConfigList& configs, const std::str
       benchmark_cache_fused.emplace(key, plan);
       return;
     } catch (cudnn_frontend::cudnnException &e) {} catch(CuDNNError &e) {}
-      catch (c10::CUDAOutOfMemoryError &e) {
+      catch (c10::OutOfMemoryError &e) {
         cudaGetLastError(); // clear CUDA error
     }
   }
@@ -525,7 +525,7 @@ void run_single_conv(const cudnnBackendDescriptorType_t operation,
     try {
       run_conv_plan(handle, x, y, w, *search);
       return;
-    } catch(c10::CUDAOutOfMemoryError &e) {
+    } catch(c10::OutOfMemoryError &e) {
       cudaGetLastError(); // clear CUDA error
     }
   }
@@ -561,7 +561,7 @@ void run_fused_conv(const Tensor& x, const Tensor& y, const Tensor& w, const Ten
     try {
       run_conv_plan_fused(handle, x, y, w, z, b, *search);
       return;
-    } catch(c10::CUDAOutOfMemoryError &e) {
+    } catch(c10::OutOfMemoryError &e) {
       cudaGetLastError(); // clear CUDA error
     }
   }
diff --git a/aten/src/ATen/native/layer_norm.cpp b/aten/src/ATen/native/layer_norm.cpp
index f9ef7e097e017..0b0245896dfa4 100644
--- a/aten/src/ATen/native/layer_norm.cpp
+++ b/aten/src/ATen/native/layer_norm.cpp
@@ -206,7 +206,17 @@ std::tuple<Tensor, Tensor, Tensor> math_native_layer_norm(
   const int normalized_ndim = normalized_shape.size();
   // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
   const int axis = input_ndim - normalized_ndim;
-  at::Tensor input_reshaped = input.view({1, M, -1});
+
+  // Properly handle zero-size inputs: the view(1, M, -1) call below breaks on this.
+  if (input.numel() == 0) {
+    auto result_type = c10::promoteTypes(input.scalar_type(), kFloat);
+    return std::make_tuple(
+      at::empty_like(input),
+      at::empty_like(input, c10::TensorOptions().dtype(result_type)),
+      at::empty_like(input, c10::TensorOptions().dtype(result_type))
+    );
+  }
+  at::Tensor input_reshaped = input.reshape({1, M, -1});
   // Unlike Batch Normalization, which applies scalar scale and bias for each
   // entire channel/plane with the affine option, Layer Normalization applies
   // per-element scale and bias. E.g. For input {N, C, H, W}, weight for
diff --git a/aten/src/ATen/native/miopen/Conv_miopen.cpp b/aten/src/ATen/native/miopen/Conv_miopen.cpp
index 61eb209d5adc1..be92f5a311a55 100644
--- a/aten/src/ATen/native/miopen/Conv_miopen.cpp
+++ b/aten/src/ATen/native/miopen/Conv_miopen.cpp
@@ -102,6 +102,20 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> miopen_depthwise_convolution_backwa
   AT_ERROR("miopen_depthwise_convolution_backward: ATen not compiled with MIOpen support");
 }
 
+
+at::Tensor miopen_convolution_add_relu(
+    const at::Tensor& input, const at::Tensor& weight, const at::Tensor& z,
+    const c10::optional<Scalar>& alpha, const c10::optional<Tensor>& bias, IntArrayRef stride,
+    IntArrayRef padding, IntArrayRef dilation, int64_t groups) {
+  AT_ERROR("miopen_convolution_add_relu: ATen not compiled with MIOpen support");
+}
+
+at::Tensor miopen_convolution_relu(
+    const at::Tensor& input, const at::Tensor& weight, const c10::optional<Tensor>& bias,
+    IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, int64_t groups) {
+  AT_ERROR("miopen_convolution_relu: ATen not compiled with MIOpen support");
+}
+
 }}
 
 #else  // AT_ROCM_ENABLED
@@ -1449,6 +1463,219 @@ Tensor miopen_convolution_transpose(
   return output_t;
 }
 
+// MIOpen fused convolution bias activation forward
+void raw_miopen_convolution_relu_out(
+    const Tensor& output,
+    const Tensor& input,
+    const Tensor& weight,
+    const Tensor& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    int64_t groups,
+    bool benchmark,
+    bool deterministic) {
+
+  auto dataType = getMiopenDataType(input);
+  miopenConvolutionMode_t c_mode = miopenConvolution;
+
+  ConvolutionArgs args{ input, output, weight };
+  args.handle = getMiopenHandle();
+  setConvolutionParams(&args.params, args.handle, input, weight, padding, stride, dilation, groups, deterministic);
+  args.idesc.set(input);
+  args.wdesc.set(weight, input.suggest_memory_format(), 0);
+  args.odesc.set(output);
+  args.cdesc.set(dataType, c_mode, input.dim() - 2, args.params.padding, args.params.stride, args.params.dilation, args.params.groups);
+
+  TensorDescriptor bdesc;
+  bdesc.set(bias.expand({1, bias.size(0)}), output.dim());
+
+  // Create the fusion plan
+  miopenFusionPlanDescriptor_t fusePlanDesc;
+  miopenFusionOpDescriptor_t convoOp;
+  miopenFusionOpDescriptor_t biasOp;
+  miopenFusionOpDescriptor_t activOp;
+  MIOPEN_CHECK(miopenCreateFusionPlan(&fusePlanDesc, miopenVerticalFusion, args.idesc.desc()));
+  MIOPEN_CHECK(miopenCreateOpConvForward(fusePlanDesc, &convoOp, args.cdesc.desc(), args.wdesc.desc()));
+  MIOPEN_CHECK(miopenCreateOpBiasForward(fusePlanDesc, &biasOp, bdesc.desc()));
+  MIOPEN_CHECK(miopenCreateOpActivationForward(fusePlanDesc, &activOp, miopenActivationRELU));
+
+  // compile fusion plan
+  MIOPEN_CHECK(miopenCompileFusionPlan(args.handle, fusePlanDesc));
+
+  // Set the Args
+  float alpha = static_cast<float>(1);
+  float beta = static_cast<float>(0);
+  float activ_alpha = static_cast<float>(0);
+  float activ_beta = static_cast<float>(0);
+  float activ_gamma = static_cast<float>(0);
+  miopenOperatorArgs_t fusionArgs;
+  MIOPEN_CHECK(miopenCreateOperatorArgs(&fusionArgs));
+  MIOPEN_CHECK(miopenSetOpArgsConvForward(fusionArgs, convoOp, &alpha, &beta, weight.data_ptr()));
+  MIOPEN_CHECK(miopenSetOpArgsBiasForward(fusionArgs, biasOp, &alpha, &beta, bias.data_ptr()));
+  MIOPEN_CHECK(miopenSetOpArgsActivForward(fusionArgs, activOp, &alpha, &beta, activ_alpha, activ_beta, activ_gamma));
+
+  miopenExecuteFusionPlan(args.handle, fusePlanDesc, args.idesc.desc(), input.data_ptr(), args.odesc.desc(), output.data_ptr(), fusionArgs);
+
+  // Cleanup
+  miopenDestroyFusionPlan(fusePlanDesc);
+}
+
+static at::Tensor self_or_new_memory_format(at::Tensor& self, at::MemoryFormat memory_format) {
+  if (self.is_contiguous(memory_format)) {
+    return self;
+  }
+  return at::empty_like(self, self.options(), memory_format);
+}
+
+Tensor miopen_convolution_add_relu(
+    const Tensor& input,
+    const Tensor& weight,
+    const Tensor& z,
+    const c10::optional<Scalar>& alpha,
+    const c10::optional<Tensor>& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    int64_t groups) {
+
+  // MIOpen does not support fusion of add, the alpha2 * z step of the below cuDNN function:
+  // y = act ( alpha1 * conv(x) + alpha2 * z + bias )
+
+  auto memory_format = input.suggest_memory_format();
+
+  auto& ctx = at::globalContext();
+  bool benchmark = ctx.benchmarkCuDNN();
+
+  TensorArg input_arg  { input,  "input",  1 },
+            weight_arg { weight, "weight", 2 };
+  auto output = miopen_convolution_forward(
+      "miopen_convolution_add_relu",
+      input_arg,
+      weight_arg,
+      padding,
+      stride,
+      dilation,
+      groups,
+      benchmark,
+      false // deterministic
+  );
+
+  auto contig_output = self_or_new_memory_format(output, memory_format);
+
+  if (!output.is_same(contig_output)) {
+    contig_output.copy_(output);
+  }
+
+  auto _alpha = alpha.has_value() ? alpha.value().to<float>() : 1.0;
+  auto _bias = bias.has_value()
+          ? bias.value()
+          : at::native::zeros(
+                {contig_output.size(1)},
+                optTypeMetaToScalarType(contig_output.options().dtype_opt()),
+                contig_output.options().layout_opt(),
+                contig_output.options().device_opt(),
+                contig_output.options().pinned_memory_opt());
+
+  at::Tensor alpha_mul_z_add_bias = at::native::reshape_bias(input.dim(), _bias).add(z, _alpha);
+  contig_output.add_(alpha_mul_z_add_bias);
+  contig_output.relu_();
+
+  return contig_output;
+}
+
+Tensor miopen_convolution_relu(
+    const Tensor& input,
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    int64_t groups) {
+
+  auto memory_format = input.suggest_memory_format();
+
+  auto& ctx = at::globalContext();
+  bool benchmark = ctx.benchmarkCuDNN();
+
+  // MIOpen currently only supports MemoryFormat::Contiguous and fp32 and 2d
+  if (input.suggest_memory_format() == at::MemoryFormat::Contiguous
+          && input.scalar_type() == at::kFloat
+          && input.ndimension() == 4) {
+
+    // FuseFrozenConvAddRelu performs some tensor shape checking
+    Tensor output_t = at::detail::empty_cuda(
+        conv_output_size(
+            input.sizes(), weight.sizes(), padding, stride, dilation),
+        input.options().memory_format(input.suggest_memory_format()));
+    if (output_t.numel() == 0) {
+      return output_t;
+    }
+
+    auto _bias = bias.has_value()
+            ? bias.value()
+            : at::native::zeros(
+                  {output_t.size(1)},
+                  optTypeMetaToScalarType(output_t.options().dtype_opt()),
+                  output_t.options().layout_opt(),
+                  output_t.options().device_opt(),
+                  output_t.options().pinned_memory_opt());
+
+    raw_miopen_convolution_relu_out(
+        output_t,
+        input,
+        weight,
+        _bias,
+        stride,
+        padding,
+        dilation,
+        groups,
+        benchmark, // benchmark
+        false // deterministic
+    );
+
+    return output_t;
+  }
+  else {
+    // fallback
+
+    TensorArg input_arg  { input,  "input",  1 },
+              weight_arg { weight, "weight", 2 };
+    auto output = miopen_convolution_forward(
+        "miopen_convolution_relu",
+        input_arg,
+        weight_arg,
+        padding,
+        stride,
+        dilation,
+        groups,
+        benchmark,
+        false // deterministic
+    );
+
+    auto contig_output = self_or_new_memory_format(output, memory_format);
+
+    if (!output.is_same(contig_output)) {
+      contig_output.copy_(output);
+    }
+
+    auto _bias = bias.has_value()
+            ? bias.value()
+            : at::native::zeros(
+                  {contig_output.size(1)},
+                  optTypeMetaToScalarType(contig_output.options().dtype_opt()),
+                  contig_output.options().layout_opt(),
+                  contig_output.options().device_opt(),
+                  contig_output.options().pinned_memory_opt());
+
+    at::Tensor reshaped_bias = at::native::reshape_bias(input.dim(), _bias);
+    contig_output.add_(reshaped_bias);
+    contig_output.relu_();
+
+    return contig_output;
+  }
+}
+
 REGISTER_CUDA_DISPATCH(miopen_convolution_backward_stub, &miopen_convolution_backward);
 REGISTER_CUDA_DISPATCH(miopen_convolution_transpose_backward_stub, &miopen_convolution_transpose_backward);
 REGISTER_CUDA_DISPATCH(miopen_depthwise_convolution_backward_stub, &miopen_depthwise_convolution_backward);
diff --git a/aten/src/ATen/native/mkldnn/Common.h b/aten/src/ATen/native/mkldnn/Common.h
new file mode 100644
index 0000000000000..4e048ebce7597
--- /dev/null
+++ b/aten/src/ATen/native/mkldnn/Common.h
@@ -0,0 +1,46 @@
+#pragma once
+
+#include <ATen/ATen.h>
+#include <ATen/Config.h>
+
+#if AT_MKLDNN_ENABLED()
+
+#include <ideep/tensor.hpp>
+
+namespace at {
+namespace native {
+namespace mkldnn {
+
+struct ContextConv final {
+  ideep::tensor weight_packed_;
+  c10::optional<at::Tensor> at_bias_;
+  std::vector<int64_t> padding_;
+  std::vector<int64_t> stride_;
+  std::vector<int64_t> dilation_;
+  int64_t groups_;
+  ideep::attr_t attr_;
+
+  ContextConv() = delete;
+
+  ContextConv(
+      ideep::tensor&& weight_packed,
+      c10::optional<at::Tensor> at_bias,
+      std::vector<int64_t> padding,
+      std::vector<int64_t> stride,
+      std::vector<int64_t> dilation,
+      int64_t groups,
+      ideep::attr_t attr)
+      : weight_packed_(std::move(weight_packed)),
+        at_bias_(std::move(at_bias)),
+        padding_(padding),
+        stride_(stride),
+        dilation_(dilation),
+        groups_(groups),
+        attr_(attr) {}
+};
+
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/Conv.cpp b/aten/src/ATen/native/mkldnn/Conv.cpp
index 0096a1cda6743..7f285e2cfcb8b 100644
--- a/aten/src/ATen/native/mkldnn/Conv.cpp
+++ b/aten/src/ATen/native/mkldnn/Conv.cpp
@@ -155,9 +155,17 @@ static void check_shape_forward(const Tensor& input,
 //  but weight/bias and grad_weight/grad_bias are always CPU tensor.
 //
 
+static inline at::MemoryFormat mkldnn_convolution_memory_format(int64_t dims, bool is_channels_last) {
+   auto memory_format =  at::MemoryFormat::Contiguous;
+   if (is_channels_last) {
+      memory_format = dims == 4 ? at::MemoryFormat::ChannelsLast : at::MemoryFormat::ChannelsLast3d;
+   }
+   return memory_format;
+}
+
 Tensor mkldnn_convolution(
-    const Tensor& input,
-    const Tensor& weight,
+    const Tensor& input_t,
+    const Tensor& weight_t,
     const c10::optional<Tensor>& bias_opt,
     IntArrayRef padding,
     IntArrayRef stride,
@@ -167,15 +175,18 @@ Tensor mkldnn_convolution(
   c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
   const Tensor& bias = *bias_maybe_owned;
 
-  if (input.scalar_type() == ScalarType::BFloat16) {
+  if (input_t.scalar_type() == ScalarType::BFloat16) {
     TORCH_CHECK(mkldnn_bf16_device_check(),
         "mkldnn_convolution: bf16 path needs the cpu support avx512bw, avx512vl and avx512dq");
   }
 
-  check_shape_forward(input, weight, bias, padding, stride, dilation, groups);
+  check_shape_forward(input_t, weight_t, bias, padding, stride, dilation, groups);
 
-  bool is_channels_last = input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+  bool is_channels_last = mkldnn_conv_use_channels_last(input_t, weight_t);
+  auto memory_format = mkldnn_convolution_memory_format(input_t.ndimension(), is_channels_last);
 
+  auto input = input_t.is_mkldnn() ? input_t : input_t.contiguous(memory_format);
+  auto weight = weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format);
   auto output_sizes = conv_output_size(input.sizes(), weight.sizes(), padding, stride, dilation);
   auto output = at::empty({0}, input.options());
 
@@ -184,12 +195,12 @@ Tensor mkldnn_convolution(
 
   ideep::tensor y;
   if (is_channels_last) {
-    output.resize_(output_sizes, input.suggest_memory_format());
+    output.resize_(output_sizes, memory_format);
     y = itensor_from_tensor(output);
   }
   if (bias.defined()) {
     const ideep::tensor b = itensor_from_tensor(bias);
-    ideep::convolution_forward::compute(
+    ideep::convolution_forward::compute_v3(
         x,
         w,
         b,
@@ -199,9 +210,10 @@ Tensor mkldnn_convolution(
         {dilation.begin(), dilation.end()},
         {padding.begin(), padding.end()},
         {padding.begin(), padding.end()},
-        groups);
+        groups,
+        is_channels_last);
   } else {
-    ideep::convolution_forward::compute(
+    ideep::convolution_forward::compute_v3(
         x,
         w,
         {output_sizes.cbegin(), output_sizes.cend()},
@@ -210,7 +222,8 @@ Tensor mkldnn_convolution(
         {dilation.begin(), dilation.end()},
         {padding.begin(), padding.end()},
         {padding.begin(), padding.end()},
-        groups);
+        groups,
+        is_channels_last);
   }
 
   if (input.is_mkldnn()) {
@@ -224,10 +237,15 @@ Tensor mkldnn_convolution(
 }
 
 Tensor mkldnn_convolution_backward_input(
-    IntArrayRef input_size, const Tensor& grad_output, const Tensor& weight,
-    IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined)
-{
-  bool is_channels_last = grad_output.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+    IntArrayRef input_size,
+    const Tensor& grad_output,
+    const Tensor& weight,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    bool bias_defined,
+    bool is_channels_last) {
   auto grad_input = at::empty({0}, grad_output.options());
 
   auto grad_y = itensor_from_tensor(grad_output);
@@ -235,10 +253,11 @@ Tensor mkldnn_convolution_backward_input(
 
   ideep::tensor grad_x;
   if (is_channels_last) {
-    grad_input.resize_(input_size, grad_output.suggest_memory_format());
+    auto memory_format = mkldnn_convolution_memory_format(grad_output.ndimension(), is_channels_last);
+    grad_input.resize_(input_size, memory_format);
     grad_x = itensor_from_tensor(grad_input);
   }
-  ideep::convolution_backward_data::compute(
+  ideep::convolution_backward_data::compute_v2(
       grad_y,
       w,
       input_size.vec(),
@@ -247,7 +266,8 @@ Tensor mkldnn_convolution_backward_input(
       dilation.vec(),
       padding.vec(),
       padding.vec(),
-      groups);
+      groups,
+      is_channels_last);
 
   if (grad_output.is_mkldnn()) {
     return MKLDNNTensor(grad_x, grad_output.options());
@@ -260,17 +280,21 @@ Tensor mkldnn_convolution_backward_input(
 }
 
 std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
-    IntArrayRef weight_size, const Tensor& grad_output, const Tensor& input,
-    IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined)
-{
-  bool is_channels_last = grad_output.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
-
+    IntArrayRef weight_size,
+    const Tensor& grad_output,
+    const Tensor& input,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    bool bias_defined,
+    bool is_channels_last) {
   const ideep::tensor grad_y = itensor_from_tensor(grad_output);
   const ideep::tensor x = itensor_from_tensor(input);
 
   ideep::tensor grad_w, grad_b;
   if (bias_defined) {
-    ideep::convolution_backward_weights::compute(
+    ideep::convolution_backward_weights::compute_v2(
         x,
         grad_y,
         weight_size.vec(),
@@ -280,9 +304,10 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
         dilation.vec(),
         padding.vec(),
         padding.vec(),
-        groups);
+        groups,
+        is_channels_last);
   } else {
-    ideep::convolution_backward_weights::compute(
+    ideep::convolution_backward_weights::compute_v2(
         x,
         grad_y,
         weight_size.vec(),
@@ -291,7 +316,8 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
         dilation.vec(),
         padding.vec(),
         padding.vec(),
-        groups);
+        groups,
+        is_channels_last);
   }
 
   if (!is_channels_last) {
@@ -306,20 +332,23 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
 }
 
 std::tuple<Tensor, Tensor, Tensor> mkldnn_convolution_backward(
-    const Tensor& input, const Tensor& grad_output_t, const Tensor& weight,
+    const Tensor& input_t, const Tensor& grad_output_t, const Tensor& weight_t,
     IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, std::array<bool,3> output_mask)
 {
-  auto memory_format = input.suggest_memory_format();
+  bool is_channels_last = mkldnn_conv_use_channels_last(input_t, weight_t);
+  auto memory_format = mkldnn_convolution_memory_format(input_t.ndimension(), is_channels_last);
   Tensor grad_output = grad_output_t.is_mkldnn() ? grad_output_t : grad_output_t.contiguous(memory_format);
 
+  Tensor input = input_t.is_mkldnn() ? input_t : input_t.contiguous(memory_format);
+  Tensor weight = weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format);
   Tensor grad_input, grad_weight, grad_bias;
   if (output_mask[0]) {
     grad_input = mkldnn_convolution_backward_input(
-      input.sizes(), grad_output, weight, padding, stride, dilation, groups, output_mask[2]);
+      input.sizes(), grad_output, weight, padding, stride, dilation, groups, output_mask[2], is_channels_last);
   }
   if (output_mask[1] || output_mask[2]) {
     std::tie(grad_weight, grad_bias) = mkldnn_convolution_backward_weights(
-      weight.sizes(), grad_output, input, padding, stride, dilation, groups, output_mask[2]);
+      weight.sizes(), grad_output, input, padding, stride, dilation, groups, output_mask[2], is_channels_last);
   }
 
   return std::make_tuple(grad_input, grad_weight, grad_bias);
diff --git a/aten/src/ATen/native/mkldnn/ConvPrepack.cpp b/aten/src/ATen/native/mkldnn/ConvPrepack.cpp
new file mode 100644
index 0000000000000..7670b259b4aec
--- /dev/null
+++ b/aten/src/ATen/native/mkldnn/ConvPrepack.cpp
@@ -0,0 +1,289 @@
+#include <vector>
+
+#include <ATen/native/ConvUtils.h>
+#include <ATen/native/mkldnn/Common.h>
+#include <ATen/native/mkldnn/ConvPrepack.h>
+#include <ATen/native/mkldnn/MKLDNNCommon.h>
+#include <ATen/native/mkldnn/OpContext.h>
+#include <ATen/native/utils/Factory.h>
+#include <ATen/native/utils/ParamUtils.h>
+#include <c10/util/irange.h>
+
+#if AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkldnn {
+namespace internal {
+namespace convolution {
+
+c10::intrusive_ptr<mkldnn::ConvOpContext> createConvPrePackOpContext(
+    Tensor weight,
+    c10::optional<Tensor> bias,
+    std::vector<int64_t> stride,
+    std::vector<int64_t> padding,
+    std::vector<int64_t> dilation,
+    int64_t groups,
+    std::vector<int64_t> input_size,
+    std::string attr) {
+  auto it = fusion_attr_map.find(attr);
+  TORCH_CHECK(it != fusion_attr_map.end(), "Fusion behavior undefined.");
+  ideep::attr_t op_attr = it->second;
+
+  return mkldnn::MkldnnConvOpContext::create_context(
+      std::move(weight),
+      std::move(bias),
+      std::move(padding),
+      std::move(stride),
+      std::move(dilation),
+      groups,
+      std::move(input_size),
+      op_attr);
+}
+
+ContextConv create(
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    const IntArrayRef padding,
+    const IntArrayRef stride,
+    const IntArrayRef dilation,
+    const int64_t groups,
+    const IntArrayRef input_size,
+    const ideep::attr_t& attr) {
+  auto k = weight.ndimension();
+  int64_t dim = k - 2;
+  const auto padding_expanded = expand_param_if_needed(padding, "padding", dim);
+  const auto stride_expanded = expand_param_if_needed(stride, "stride", dim);
+  const auto dilation_expanded =
+      expand_param_if_needed(dilation, "dilation", dim);
+  const auto input_size_expanded =
+      expand_param_if_needed(input_size, "input_size", k);
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  auto w = itensor_view_from_dense(weight);
+  // TODO: what if input is nhwc but w is nchw
+  bool is_channels_last =
+      weight.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+  ideep::tensor::desc expected_weight_desc =
+      ideep::convolution_forward::expected_weights_desc(
+          w.get_dims(),
+          w.get_data_type(),
+          {stride_expanded.begin(), stride_expanded.end()},
+          {padding_expanded.begin(), padding_expanded.end()},
+          {padding_expanded.begin(), padding_expanded.end()},
+          {dilation_expanded.begin(), dilation_expanded.end()},
+          groups,
+          ideep::algorithm::convolution_direct,
+          ideep::prop_kind::forward,
+          /*x_dtype*/ w.get_data_type(),
+          {input_size_expanded.begin(), input_size_expanded.end()},
+          attr,
+          is_channels_last);
+
+  ideep::tensor packed_weight;
+  packed_weight.init(expected_weight_desc);
+  packed_weight.feed_from(w);
+
+  return ContextConv{
+      std::move(packed_weight),
+      bias.has_value() ? c10::make_optional(*bias) : c10::nullopt,
+      {padding_expanded.begin(), padding_expanded.end()},
+      {stride_expanded.begin(), stride_expanded.end()},
+      {dilation_expanded.begin(), dilation_expanded.end()},
+      groups,
+      std::move(attr)};
+}
+
+void _mkldnn_convolution_out(
+    const ideep::tensor& x,
+    ideep::tensor& y,
+    const ideep::tensor& w,
+    const c10::optional<ideep::tensor>& b,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    IntArrayRef output_sizes,
+    int64_t groups,
+    const ideep::attr_t& attr = ideep::attr_t()) {
+  if (b.has_value()) {
+    ideep::convolution_forward::compute_v2(
+        x,
+        w,
+        b.value(),
+        {output_sizes.cbegin(), output_sizes.cend()},
+        y,
+        {stride.begin(), stride.end()},
+        {dilation.begin(), dilation.end()},
+        {padding.begin(), padding.end()},
+        {padding.begin(), padding.end()},
+        groups,
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::zero_point_t(),
+        ideep::zero_point_t(),
+        attr);
+  } else {
+    ideep::convolution_forward::compute_v2(
+        x,
+        w,
+        {output_sizes.cbegin(), output_sizes.cend()},
+        y,
+        {stride.begin(), stride.end()},
+        {dilation.begin(), dilation.end()},
+        {padding.begin(), padding.end()},
+        {padding.begin(), padding.end()},
+        groups,
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::zero_point_t(),
+        ideep::zero_point_t(),
+        attr);
+  }
+}
+
+void mkldnn_convolution_out(
+    const Tensor& input,
+    ideep::tensor& mkldnn_output,
+    const ideep::tensor& mkldnn_weight,
+    const c10::optional<Tensor>& bias_opt,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    IntArrayRef output_sizes,
+    int64_t groups,
+    const ideep::attr_t& attr = ideep::attr_t()) {
+  c10::MaybeOwned<Tensor> bias_maybe_owned =
+      at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  const ideep::tensor mkldnn_input = itensor_from_tensor(input);
+  c10::optional<ideep::tensor> mkldnn_bias{c10::nullopt};
+  if (bias.defined()) {
+    mkldnn_bias = itensor_from_tensor(bias);
+  }
+
+  _mkldnn_convolution_out(
+      mkldnn_input,
+      mkldnn_output,
+      mkldnn_weight,
+      mkldnn_bias,
+      padding,
+      stride,
+      dilation,
+      output_sizes,
+      groups,
+      attr);
+}
+
+std::vector<int64_t> get_output_sizes(
+    ContextConv& context,
+    const Tensor& input) {
+  const ideep::tensor& mkldnn_weight = context.weight_packed_;
+  IntArrayRef padding = context.padding_;
+  IntArrayRef stride = context.stride_;
+  IntArrayRef dilation = context.dilation_;
+
+  auto kernel_size = mkldnn_weight.get_dims();
+
+  std::vector<int64_t> input_size = input.sizes().vec();
+  return conv_output_size(input_size, kernel_size, padding, stride, dilation);
+}
+
+Tensor run(ContextConv& context, const Tensor& input) {
+  std::vector<int64_t> output_sizes = get_output_sizes(context, input);
+  auto output = at::empty(
+      output_sizes,
+      input.options().memory_format(input.suggest_memory_format()));
+
+  bool is_channels_last =
+      input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+  ideep::tensor y;
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  ideep::tensor mkldnn_output = itensor_from_tensor(output);
+
+  if (is_channels_last) {
+    mkldnn_convolution_out(
+        input,
+        mkldnn_output,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+  } else {
+    mkldnn_convolution_out(
+        input,
+        y,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+    mkldnn_output.feed_from(y);
+  }
+  return output;
+}
+
+void run(ContextConv& context, const Tensor& input, void* output) {
+  std::vector<int64_t> output_sizes = get_output_sizes(context, input);
+
+  bool is_channels_last =
+      input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+  ideep::tensor y;
+
+  ideep::tag o_tag = is_channels_last ? ideep::tag::nhwc : ideep::tag::nchw;
+  ideep::tensor::desc o_desc = {
+      output_sizes, get_mkldnn_dtype(input.scalar_type()), o_tag};
+  ideep::tensor mkldnn_output = {o_desc, output};
+
+  if (is_channels_last) {
+    mkldnn_convolution_out(
+        input,
+        mkldnn_output,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+  } else {
+    mkldnn_convolution_out(
+        input,
+        y,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+    mkldnn_output.feed_from(y);
+  }
+}
+
+Tensor conv_run(
+    const Tensor& input,
+    const c10::intrusive_ptr<mkldnn::ConvOpContext>& op_context) {
+  return op_context->run(input);
+}
+
+} // namespace convolution
+} // namespace internal
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/ConvPrepack.h b/aten/src/ATen/native/mkldnn/ConvPrepack.h
new file mode 100644
index 0000000000000..03189c5f5e706
--- /dev/null
+++ b/aten/src/ATen/native/mkldnn/ConvPrepack.h
@@ -0,0 +1,49 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+#include <ATen/native/mkldnn/Common.h>
+#include <ATen/native/mkldnn/OpContext.h>
+
+#if AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkldnn {
+namespace internal {
+namespace convolution {
+
+c10::intrusive_ptr<mkldnn::ConvOpContext> createConvPrePackOpContext(
+    Tensor weight,
+    c10::optional<Tensor> bias,
+    std::vector<int64_t> stride,
+    std::vector<int64_t> padding,
+    std::vector<int64_t> dilation,
+    int64_t groups,
+    std::vector<int64_t> input_size,
+    std::string attr);
+
+Tensor conv_run(
+    const Tensor& input,
+    const c10::intrusive_ptr<mkldnn::ConvOpContext>& op_context);
+
+ContextConv create(
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    const IntArrayRef padding,
+    const IntArrayRef stride,
+    const IntArrayRef dilation,
+    const int64_t groups,
+    const IntArrayRef input_size,
+    const ideep::attr_t& attr);
+
+Tensor run(ContextConv& context, const Tensor& input);
+
+void run(ContextConv& context, const Tensor& input, void* output);
+
+} // namespace convolution
+} // namespace internal
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/Matmul.cpp b/aten/src/ATen/native/mkldnn/Matmul.cpp
index e399e2143dea6..9b07dbfcee5fb 100644
--- a/aten/src/ATen/native/mkldnn/Matmul.cpp
+++ b/aten/src/ATen/native/mkldnn/Matmul.cpp
@@ -71,7 +71,9 @@ bool mkldnn_bf16_gemm(
     op_attr = ideep::attr_t::fuse_sum();
   }
 
-  ideep::tensor::dims a_strides{{1, lda}}, b_strides{{1, ldb}}, c_strides{{1, ldc}};
+  // NOTE: View as c-contiguous to avoid extra reordering in mkldnn
+  // Use identity: C = AB <=> C^T = B^T A^T
+  ideep::tensor::dims a_strides{{lda, 1}}, b_strides{{ldb, 1}}, c_strides{{ldc, 1}};
   if (transa != TransposeType::NoTranspose) {
     std::swap(a_strides[0], a_strides[1]);
   }
@@ -80,23 +82,23 @@ bool mkldnn_bf16_gemm(
   }
 
   ideep::tensor a({
-      /*sizes=*/{m, k},
+      /*sizes=*/{k, m},
       ideep::tensor::data_type::bf16,
       /*strides=*/a_strides},
     const_cast<c10::BFloat16*>(a_data));
   ideep::tensor b({
-      /*sizes=*/{k, n},
+      /*sizes=*/{n, k},
       ideep::tensor::data_type::bf16,
       /*strides=*/b_strides},
     const_cast<c10::BFloat16*>(b_data));
   ideep::tensor c({
-      /*sizes=*/{m, n},
+      /*sizes=*/{n, m},
       ideep::tensor::data_type::bf16,
       /*strides=*/c_strides},
     c_data);
 
   ideep::matmul_forward::compute(
-      a, b, c, alpha, beta,
+      b, a, c, alpha, beta,
       ideep::scale_t(), ideep::scale_t(), ideep::scale_t(), op_attr);
 
   if (c.get_data_handle() != c_data){
@@ -104,7 +106,7 @@ bool mkldnn_bf16_gemm(
     // if given output format is not expected, ideep will re-init an output buffer
     // under this case, we need copy the re-inited buffer back to given buffer
     ideep::tensor real_output({
-        /*sizes=*/{m, n},
+        /*sizes=*/{n, m},
         ideep::tensor::data_type::bf16,
         /*strides=*/c_strides},
       c_data);
diff --git a/aten/src/ATen/native/mkldnn/OpContext.cpp b/aten/src/ATen/native/mkldnn/OpContext.cpp
new file mode 100644
index 0000000000000..2716b4908eb30
--- /dev/null
+++ b/aten/src/ATen/native/mkldnn/OpContext.cpp
@@ -0,0 +1,47 @@
+#include <ATen/native/mkldnn/ConvPrepack.h>
+#include <ATen/native/mkldnn/OpContext.h>
+
+#if AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkldnn {
+
+c10::intrusive_ptr<ConvOpContext> MkldnnConvOpContext::create_context(
+    at::Tensor&& weight,
+    c10::optional<at::Tensor>&& bias,
+    std::vector<int64_t>&& padding,
+    std::vector<int64_t>&& stride,
+    std::vector<int64_t>&& dilation,
+    int64_t groups,
+    std::vector<int64_t>&& input_size,
+    const ideep::attr_t& attr) {
+  auto op_context = mkldnn::internal::convolution::create(
+      weight, bias, padding, stride, dilation, groups, input_size, attr);
+
+  auto conv_op_context = c10::make_intrusive<MkldnnConvOpContext>(
+      std::move(weight),
+      std::move(bias),
+      std::move(padding),
+      std::move(stride),
+      std::move(dilation),
+      groups,
+      std::move(input_size),
+      std::move(op_context));
+
+  return conv_op_context;
+}
+
+Tensor MkldnnConvOpContext::run(const Tensor& input) {
+  return mkldnn::internal::convolution::run(op_context_, input);
+}
+
+void MkldnnConvOpContext::run(const Tensor& input, void* output) {
+  return mkldnn::internal::convolution::run(op_context_, input, output);
+}
+
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/OpContext.h b/aten/src/ATen/native/mkldnn/OpContext.h
new file mode 100644
index 0000000000000..21e8cc78a5134
--- /dev/null
+++ b/aten/src/ATen/native/mkldnn/OpContext.h
@@ -0,0 +1,99 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+#include <ATen/core/ivalue.h>
+#include <ATen/native/mkldnn/Common.h>
+
+#if AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkldnn {
+
+const static std::map<std::string, ideep::attr_t> fusion_attr_map = {
+    {"none", ideep::attr_t()},
+    {"relu", ideep::attr_t::fuse_relu()},
+};
+
+using SerializationTypeConvPrePack = std::tuple<
+    Tensor,
+    c10::optional<Tensor>,
+    std::vector<int64_t>,
+    std::vector<int64_t>,
+    std::vector<int64_t>,
+    int64_t,
+    std::vector<int64_t>,
+    std::string>;
+
+class ConvOpContext : public torch::jit::CustomClassHolder {
+ protected:
+  Tensor orig_weight_;
+  c10::optional<Tensor> orig_bias_;
+  std::vector<int64_t> stride_;
+  std::vector<int64_t> padding_;
+  std::vector<int64_t> dilation_;
+  int64_t groups_;
+  std::vector<int64_t> input_size_;
+  std::string attr_;
+
+ public:
+  SerializationTypeConvPrePack unpack() {
+    return std::make_tuple(
+        orig_weight_,
+        orig_bias_,
+        stride_,
+        padding_,
+        dilation_,
+        groups_,
+        input_size_,
+        attr_);
+  }
+
+  virtual Tensor run(const Tensor& input) = 0;
+  virtual void run(const Tensor& input, void* output) = 0;
+};
+
+class MkldnnConvOpContext final : public ConvOpContext {
+ private:
+  ContextConv op_context_;
+
+ public:
+  MkldnnConvOpContext(
+      Tensor&& weight,
+      c10::optional<Tensor>&& bias,
+      std::vector<int64_t>&& padding,
+      std::vector<int64_t>&& stride,
+      std::vector<int64_t>&& dilation,
+      uint64_t groups,
+      std::vector<int64_t>&& input_size,
+      ContextConv&& op_context)
+      : op_context_(std::move(op_context)) {
+    orig_weight_ = std::move(weight);
+    orig_bias_ = std::move(bias);
+    padding_ = std::move(padding);
+    stride_ = std::move(stride);
+    dilation_ = std::move(dilation);
+    groups_ = groups;
+    input_size_ = std::move(input_size);
+  }
+
+  Tensor run(const Tensor& input) override;
+
+  void run(const Tensor& input, void* output) override;
+
+  static c10::intrusive_ptr<ConvOpContext> create_context(
+      Tensor&& weight,
+      c10::optional<Tensor>&& bias,
+      std::vector<int64_t>&& padding,
+      std::vector<int64_t>&& stride,
+      std::vector<int64_t>&& dilation,
+      int64_t groups,
+      std::vector<int64_t>&& input_size,
+      const ideep::attr_t& attr);
+};
+
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/Pooling.cpp b/aten/src/ATen/native/mkldnn/Pooling.cpp
index 5800bd2247b6e..80cfa2efcc107 100644
--- a/aten/src/ATen/native/mkldnn/Pooling.cpp
+++ b/aten/src/ATen/native/mkldnn/Pooling.cpp
@@ -2,6 +2,7 @@
 #include <ATen/Config.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/core/grad_mode.h>
+#include <ATen/native/Resize.h>
 #include <ATen/native/utils/ParamUtils.h>
 #include <c10/util/irange.h>
 #include <tuple>
@@ -80,6 +81,12 @@ Tensor mkldnn_adaptive_avg_pool2d(Tensor const& input, IntArrayRef output_size)
   TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d: ATen not compiled with MKLDNN support");
 }
 
+Tensor& mkldnn_adaptive_avg_pool2d_out_stub(const Tensor& input,
+    IntArrayRef output_size,
+    Tensor& output) {
+  TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d_out_stub: ATen not compiled with MKLDNN support");
+}
+
 Tensor& mkldnn_adaptive_avg_pool2d_out(const Tensor& input,
     IntArrayRef output_size,
     Tensor& output) {
@@ -498,10 +505,19 @@ Tensor mkldnn_adaptive_avg_pool2d(
       /*algo*/ ideep::algorithm::pooling_avg);
 }
 
+Tensor& mkldnn_adaptive_avg_pool2d_out_stub(const Tensor& input,
+    IntArrayRef output_size,
+    Tensor& output) {
+  TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d_out_stub: in-place mkldnn operations are not supported yet");
+}
+
 Tensor& mkldnn_adaptive_avg_pool2d_out(const Tensor& input,
     IntArrayRef output_size,
     Tensor& output) {
-  TORCH_CHECK(false, "mkldnn_adaptive_avg_pool2d_out: in-place mkldnn operations are not supported yet");
+  auto tmp_output = at::native::mkldnn_adaptive_avg_pool2d(input, output_size);
+  at::native::resize_output(output, tmp_output.sizes());
+  output.copy_(tmp_output);
+  return output;
 }
 
 Tensor mkldnn_max_pool2d_backward(
diff --git a/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp b/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp
new file mode 100644
index 0000000000000..44447441f6daa
--- /dev/null
+++ b/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp
@@ -0,0 +1,60 @@
+#include <ATen/Tensor.h>
+#include <ATen/native/mkldnn/ConvPrepack.h>
+#include <ATen/native/mkldnn/OpContext.h>
+#include <torch/custom_class.h>
+#include <torch/library.h>
+
+#if AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkldnn {
+
+using namespace internal::convolution;
+
+TORCH_LIBRARY(mkldnn, m) {
+  m.class_<ConvOpContext>(TORCH_SELECTIVE_CLASS("ConvOpContext"))
+      .def_pickle(
+          [](const c10::intrusive_ptr<ConvOpContext>& op_context)
+              -> SerializationTypeConvPrePack { // __getstate__
+            return op_context->unpack();
+          },
+          [](SerializationTypeConvPrePack state)
+              -> c10::intrusive_ptr<ConvOpContext> { // __setstate__
+            return createConvPrePackOpContext(
+                std::move(std::get<0>(state)),
+                std::move(std::get<1>(state)),
+                std::move(std::get<2>(state)),
+                std::move(std::get<3>(state)),
+                std::move(std::get<4>(state)),
+                // NOLINTNEXTLINE(performance-move-const-arg,cppcoreguidelines-avoid-magic-numbers)
+                std::move(std::get<5>(state)),
+                // NOLINTNEXTLINE(performance-move-const-arg,cppcoreguidelines-avoid-magic-numbers)
+                std::move(std::get<6>(state)),
+                // NOLINTNEXTLINE(performance-move-const-arg,cppcoreguidelines-avoid-magic-numbers)
+                std::move(std::get<7>(state)));
+          });
+}
+
+TORCH_LIBRARY(mkldnn_prepacked, m) {
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkldnn_prepacked::conv2d_prepack(Tensor W, Tensor? B, int[2] stride, int[2] padding, int[2] dilation, int groups, int[4] input_size, str attr) -> __torch__.torch.classes.mkldnn.ConvOpContext"));
+
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkldnn_prepacked::conv2d_run(Tensor X, __torch__.torch.classes.mkldnn.ConvOpContext W_prepack) -> Tensor Y"));
+}
+
+TORCH_LIBRARY_IMPL(mkldnn_prepacked, CPU, m) {
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn_prepacked::conv2d_prepack"),
+      TORCH_FN(createConvPrePackOpContext));
+
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn_prepacked::conv2d_run"), TORCH_FN(conv_run));
+}
+
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mps/OperationUtils.mm b/aten/src/ATen/native/mps/OperationUtils.mm
index 7e22e2c103a8f..97012dd3b0c9e 100644
--- a/aten/src/ATen/native/mps/OperationUtils.mm
+++ b/aten/src/ATen/native/mps/OperationUtils.mm
@@ -231,7 +231,12 @@ void printTensorNDArray(const Tensor& t) {
   MPSGraphTensorData* tdata = [[[MPSGraphTensorData alloc] initWithMTLBuffer:selfBuf
                                                             shape:selfShape
                                                          dataType:selfDType] autorelease];
+  C10_CLANG_DIAGNOSTIC_PUSH()
+  #if C10_CLANG_HAS_WARNING("-Wobjc-method-access")
+  C10_CLANG_DIAGNOSTIC_IGNORE("-Wobjc-method-access")
+  #endif
   [tdata printNDArray];
+  C10_CLANG_DIAGNOSTIC_POP()
 }
 
 Placeholder::Placeholder(MPSGraphTensor* mpsGraphTensor, const Tensor& src, MPSShape *mpsShape) : _tensor(src)
@@ -247,7 +252,7 @@ void printTensorNDArray(const Tensor& t) {
     if (!_tensor.has_storage()) {
       // if we cannot gather, we make the the tensor contiguous implicitly, and keep
       // it in placeholder to be able to retrieve it when we return from constructor
-      _tensor = src.contiguous();
+      _tensor = src.clone();
     }
     srcBuf = getMTLBufferStorage(_tensor);
   }
diff --git a/aten/src/ATen/native/mps/operations/Activation.mm b/aten/src/ATen/native/mps/operations/Activation.mm
index b741276b45e01..e929a41be2ce1 100644
--- a/aten/src/ATen/native/mps/operations/Activation.mm
+++ b/aten/src/ATen/native/mps/operations/Activation.mm
@@ -417,6 +417,10 @@ Tensor relu_mps(const Tensor& self) {
   using CachedGraph = MPSUnaryCachedGraph;
   TORCH_CHECK(output.is_mps());
 
+  if(output.numel() == 0) {
+    return;
+  }
+
   MPSGraphCache* cache_ = MPSGraphCache::getInstance();
 
   MPSStream* stream = getCurrentMPSStream();
@@ -1452,10 +1456,13 @@ Tensor glu_backward_mps (const Tensor& grad_output,
       if(result.numel() == 0)
         return;
 
+      auto beta_f = beta.to<float>();
+
       struct CachedGraph : public MPSCachedGraph
       {
         CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
         MPSGraphTensor *inputTensor_ = nil;
+        MPSGraphTensor *betaTensor_ = nil;
         MPSGraphTensor *outputTensor_ = nil;
       };
 
@@ -1475,18 +1482,16 @@ Tensor glu_backward_mps (const Tensor& grad_output,
             @autoreleasepool {
               MPSGraph* mpsGraph = make_mps_graph();
               newCachedGraph = new CachedGraph(mpsGraph);
-              MPSGraphTensor *inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
+              MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
 
-              MPSGraphTensor *reluTensor = [mpsGraph reLUWithTensor:inputTensor
-                                                               name:nil];
+              MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, beta);
 
+              MPSGraphTensor* reluTensor = [mpsGraph reLUWithTensor:inputTensor
+                                                               name:nil];
               MPSGraphTensor* unitTensor = [mpsGraph constantWithScalar:1.0
                                                                   shape:@[@1]
                                                                dataType:getMPSDataType(self.scalar_type())];
 
-              MPSGraphTensor* betaTensor = [mpsGraph constantWithScalar:beta.to<double>()
-                                                                  shape:@[@1]
-                                                               dataType:getMPSDataType(self.scalar_type())];
               MPSGraphTensor* reciprocalBetaTensor = [mpsGraph reciprocalWithTensor:betaTensor
                                                                              name:nil];
               MPSGraphTensor* bxTensor = [mpsGraph multiplicationWithPrimaryTensor:inputTensor
@@ -1516,7 +1521,8 @@ Tensor glu_backward_mps (const Tensor& grad_output,
                                                                             name:nil];
 
               newCachedGraph->inputTensor_ = inputTensor;
-                newCachedGraph->outputTensor_ = outputTensor;
+              newCachedGraph->betaTensor_ = betaTensor;
+              newCachedGraph->outputTensor_ = outputTensor;
             }
             return newCachedGraph;
           });
@@ -1527,7 +1533,8 @@ Tensor glu_backward_mps (const Tensor& grad_output,
 
         // Create dictionary of inputs and outputs
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
-          selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData()
+          selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
+          cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_f, MPSDataTypeFloat32)
         };
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
           outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
@@ -1550,11 +1557,14 @@ Tensor glu_backward_mps (const Tensor& grad_output,
       if(grad_input.numel() == 0)
         return;
 
+      auto beta_f = beta.to<float>();
+
       struct CachedGraph : public MPSCachedGraph
       {
         CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
         MPSGraphTensor *gradOutputTensor_ = nil;
         MPSGraphTensor *inputTensor_ = nil;
+        MPSGraphTensor *betaTensor_ = nil;
         MPSGraphTensor *outputTensor_ = nil;
       };
 
@@ -1574,16 +1584,15 @@ Tensor glu_backward_mps (const Tensor& grad_output,
             @autoreleasepool {
               MPSGraph* mpsGraph = make_mps_graph();
               newCachedGraph = new CachedGraph(mpsGraph);
-              MPSGraphTensor *gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
+              MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
 
-              MPSGraphTensor *inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
+              MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
+
+              MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, beta);
 
               MPSGraphTensor* unitTensor = [mpsGraph constantWithScalar:1.0
                                                                   shape:@[@1]
                                                                dataType:getMPSDataType(self.scalar_type())];
-              MPSGraphTensor* betaTensor = [mpsGraph constantWithScalar:beta.to<double>()
-                                                                  shape:@[@1]
-                                                               dataType:getMPSDataType(self.scalar_type())];
               MPSGraphTensor* bxTensor = [mpsGraph multiplicationWithPrimaryTensor:inputTensor
                                                                   secondaryTensor:betaTensor
                                                                   name:nil];
@@ -1611,6 +1620,7 @@ Tensor glu_backward_mps (const Tensor& grad_output,
 
               newCachedGraph->gradOutputTensor_ = gradOutputTensor;
               newCachedGraph->inputTensor_ = inputTensor;
+              newCachedGraph->betaTensor_ = betaTensor;
               newCachedGraph->outputTensor_ = outputTensor;
             }
             return newCachedGraph;
@@ -1624,7 +1634,8 @@ Tensor glu_backward_mps (const Tensor& grad_output,
         // Create dictionary of inputs and outputs
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
           gradOutputPlaceholder.getMPSGraphTensor() : gradOutputPlaceholder.getMPSGraphTensorData(),
-          selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData()
+          selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
+          cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_f, MPSDataTypeFloat32)
         };
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
           gradInputPlaceholder.getMPSGraphTensor() : gradInputPlaceholder.getMPSGraphTensorData()
diff --git a/aten/src/ATen/native/mps/operations/BinaryOps.mm b/aten/src/ATen/native/mps/operations/BinaryOps.mm
index 6e325de38c830..b619307ef8aa1 100644
--- a/aten/src/ATen/native/mps/operations/BinaryOps.mm
+++ b/aten/src/ATen/native/mps/operations/BinaryOps.mm
@@ -403,5 +403,6 @@ void add_sub_template(const Tensor& self, const Tensor& other, const Scalar& alp
         runMPSGraph(stream, cachedGraph->graph(), feeds, results);
       }
 }
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/mps/operations/BitwiseOps.mm b/aten/src/ATen/native/mps/operations/BitwiseOps.mm
new file mode 100644
index 0000000000000..c16818d7d542c
--- /dev/null
+++ b/aten/src/ATen/native/mps/operations/BitwiseOps.mm
@@ -0,0 +1,336 @@
+#include <ATen/mps/MPSStream.h>
+#include <ATen/native/Resize.h>
+#include <fmt/format.h>
+#include <torch/library.h>
+
+namespace {
+static const char* BITWISE_OPS_TEMPLATE = R"METAL(
+
+kernel void bitwise_and_tensor(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         device {2}  *b [[buffer(3)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = a[offset] & b [offset];
+}}
+
+kernel void bitwise_and_scalar(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         constant {2}  &b [[buffer(3)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = a[offset] & b;
+}}
+
+
+kernel void bitwise_or_tensor(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         device {2}  *b [[buffer(3)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = a[offset] | b [offset];
+}}
+
+kernel void bitwise_or_scalar(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         constant {2}  &b [[buffer(3)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = a[offset] | b;
+}}
+
+kernel void bitwise_xor_tensor(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         device {2}  *b [[buffer(3)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = a[offset] ^ b [offset];
+}}
+
+kernel void bitwise_xor_scalar(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         constant {2}  &b [[buffer(3)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = a[offset] ^ b;
+}}
+
+kernel void bitwise_not(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = ~a[offset];
+}}
+)METAL";
+
+
+const std::string& getMetalType(const c10::ScalarType& t) {
+  // Mapping from c10::ScalarType to integral type that can be used for bitwise ops
+  // As bitwise ops sign-agnostic map signed/unsigned char and boolean to the same type
+  static std::unordered_map<c10::ScalarType, std::string> scalar_to_metal_type = {
+    {c10::ScalarType::Long, "long"},
+    {c10::ScalarType::Int, "int"},
+    {c10::ScalarType::Short, "short"},
+    {c10::ScalarType::Byte, "char"},
+    {c10::ScalarType::Char, "char"},
+    {c10::ScalarType::Bool, "char"},
+  };
+
+  auto it = scalar_to_metal_type.find(t);
+  TORCH_CHECK(it != scalar_to_metal_type.end(), "Unsupported type ", t);
+  return it->second;
+}
+
+const std::string& getMetalType(const at::Tensor& t) {
+  return getMetalType(t.scalar_type());
+}
+
+const std::string& getMetalType(const c10::Scalar& s) {
+  return getMetalType(s.type());
+}
+
+
+static id<MTLLibrary> compileBitwiseOpsLibrary(id<MTLDevice> device,
+                                               const std::string& t1,
+                                               const std::string& t2,
+                                               const std::string& t3) {
+  auto key = t1 + t2 + t3;
+  static std::unordered_map<std::string, id<MTLLibrary>> libMap;
+  auto it = libMap.find(key);
+  if (it != libMap.end()) {
+    return it->second;
+  }
+  NSError *error = nil;
+  auto rc  = [device newLibraryWithSource:[NSString stringWithUTF8String:fmt::format(BITWISE_OPS_TEMPLATE, t1, t2, t3).c_str()]
+                                  options:nil
+                                    error:&error];
+ TORCH_CHECK(rc != nil && error == nil, "Failed to compile library: ", [[error localizedDescription] UTF8String]);
+ libMap[key] = rc;
+ return rc;
+}
+
+
+static id<MTLComputePipelineState> getCPLState(id<MTLDevice> device,
+                                                const std::string& t1,
+                                                const std::string& t2,
+                                                const std::string& t3,
+                                                const std::string& fname) {
+  auto key = t1 + t2 + t3 + fname;
+  static std::unordered_map<std::string, id<MTLComputePipelineState>> cplMap;
+  auto it = cplMap.find(key);
+  if (it != cplMap.end()) {
+     return it->second;
+  }
+  NSError *error = nil;
+  auto library = compileBitwiseOpsLibrary(device, t1, t2, t3);
+  id<MTLFunction> func = [library newFunctionWithName:[NSString stringWithUTF8String:fname.c_str()]];
+  TORCH_CHECK(func != nil, "Can't get function ", fname);
+  auto rc = [device newComputePipelineStateWithFunction:func error:&error];
+  TORCH_CHECK(rc != nil && error == nil, "Failed to construct pipeline state: ", [[error localizedDescription] UTF8String]);
+  cplMap[key]  = rc;
+  return rc;
+}
+
+void dispatch1DJob(id<MTLComputeCommandEncoder> commandEncoder, id<MTLComputePipelineState> cplState, uint32_t length)
+{
+  uint32_t maxThreadsPerGroup = [cplState maxTotalThreadsPerThreadgroup];
+  auto size = MTLSizeMake(length, 1, 1);
+  auto threadGroupSize = MTLSizeMake(std::min(maxThreadsPerGroup, length), 1, 1);
+  [commandEncoder dispatchThreads:size
+            threadsPerThreadgroup:threadGroupSize];
+}
+
+void handle_tensor_tensor_binary_op(const at::Tensor& self, const at::Tensor& other, at::Tensor& output, const std::string& kernel_name) {
+  using namespace at::mps;
+  MPSStream* stream = getCurrentMPSStream();
+  id<MTLComputePipelineState> cplState = getCPLState(MPSDevice::getInstance()->device(),
+                                                     getMetalType(output),
+                                                     getMetalType(self),
+                                                     getMetalType(other),
+                                                     kernel_name);
+  uint32_t length = output.numel();
+  dispatch_sync(stream->queue(), ^(){
+    id<MTLCommandBuffer> buffer = stream->commandBuffer();
+    id<MTLComputeCommandEncoder> commandEncoder = [buffer computeCommandEncoder];
+
+    id<MTLBuffer> outBuf = __builtin_bit_cast(id<MTLBuffer>, output.storage().data());
+    id<MTLBuffer> selfBuf = __builtin_bit_cast(id<MTLBuffer>, self.storage().data());
+    id<MTLBuffer> otherBuf = __builtin_bit_cast(id<MTLBuffer>, other.storage().data());
+
+    [commandEncoder pushDebugGroup:[NSString stringWithFormat:@"Dispatch %s kernel", kernel_name.c_str()]];
+    [commandEncoder setComputePipelineState:cplState];
+    [commandEncoder setBytes:&length length:sizeof(length) atIndex:0];
+    [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1];
+    [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize()  atIndex:2];
+    [commandEncoder setBuffer:otherBuf offset:other.storage_offset()*other.itemsize() atIndex:3];
+    dispatch1DJob(commandEncoder, cplState, length);
+    [commandEncoder endEncoding];
+    stream->commit(true);
+  });
+}
+
+void handle_tensor_scalar_binary_op(const at::Tensor& self, const at::Scalar& other, at::Tensor& output, const std::string& kernel_name) {
+  using namespace at::mps;
+  MPSStream* stream = getCurrentMPSStream();
+  id<MTLComputePipelineState> cplState = getCPLState(MPSDevice::getInstance()->device(),
+                                                     getMetalType(output),
+                                                     getMetalType(self),
+                                                     getMetalType(other),
+                                                     kernel_name);
+  uint64_t sval = other.to<int64_t>();
+  uint32_t length = output.numel();
+  dispatch_sync(stream->queue(), ^(){
+    id<MTLCommandBuffer> buffer = stream->commandBuffer();
+    id<MTLComputeCommandEncoder> commandEncoder = [buffer computeCommandEncoder];
+
+    id<MTLBuffer> outBuf = __builtin_bit_cast(id<MTLBuffer>, output.storage().data());
+    id<MTLBuffer> selfBuf = __builtin_bit_cast(id<MTLBuffer>, self.storage().data());
+
+    [commandEncoder pushDebugGroup:[NSString stringWithFormat:@"Dispatch %s kernel", kernel_name.c_str()]];
+    [commandEncoder setComputePipelineState:cplState];
+    [commandEncoder setBytes:&length length:sizeof(length) atIndex:0];
+    [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1];
+    [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize()  atIndex:2];
+    [commandEncoder setBytes:&sval length:sizeof(sval) atIndex:3];
+    dispatch1DJob(commandEncoder, cplState, length);
+    [commandEncoder endEncoding];
+    stream->commit(true);
+  });
+}
+
+at::Tensor& _bitwise_op_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output_, const std::string& op_name) {
+  using namespace at::mps;
+  const bool is_self_scalar = self.dim() == 0;
+  const bool is_other_scalar = other.dim() == 0;
+
+  at::Tensor output = output_;
+  bool needs_output_copy = false;
+
+  auto output_size = at::infer_size_dimvector(self.sizes(), other.sizes());
+  at::native::resize_output(output, output_size);
+  if (!output.is_contiguous()) {
+    output = output.contiguous();
+    needs_output_copy = true;
+  }
+  if (is_other_scalar && is_self_scalar) {
+    if (op_name == "and") {
+      output.fill_(c10::Scalar(self.item<int64_t>() & other.item<int64_t>()));
+    } else if (op_name == "or") {
+      output.fill_(c10::Scalar(self.item<int64_t>() | other.item<int64_t>()));
+    } else if (op_name == "xor") {
+      output.fill_(c10::Scalar(self.item<int64_t>() ^ other.item<int64_t>()));
+    } else {
+      TORCH_CHECK(false, "Unknown operation to be performed over scalars ", op_name);
+    }
+  } else if (is_other_scalar) {
+    handle_tensor_scalar_binary_op(self.contiguous(), other.item(), output, fmt::format("bitwise_{}_scalar", op_name));
+  } else if (is_self_scalar) {
+    handle_tensor_scalar_binary_op(other.contiguous(), self.item(), output, fmt::format("bitwise_{}_scalar", op_name));
+  } else {
+    handle_tensor_tensor_binary_op(self.expand(output_size).contiguous(),
+                                   other.expand(output_size).contiguous(),
+                                   output,
+                                   fmt::format("bitwise_{}_tensor", op_name));
+  }
+  if (needs_output_copy) {
+      output_.copy_(output);
+  }
+  return output_;
+}
+
+at::Tensor& bitwise_and_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output) {
+ return _bitwise_op_out_mps(self, other, output, "and");
+}
+
+at::Tensor& bitwise_or_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output) {
+ return _bitwise_op_out_mps(self, other, output, "or");
+}
+
+at::Tensor& bitwise_xor_out_mps (const at::Tensor& self, const at::Tensor& other, at::Tensor& output) {
+ return _bitwise_op_out_mps(self, other, output, "xor");
+}
+
+at::Tensor& bitwise_not_out_mps (const at::Tensor& self, at::Tensor& output_) {
+  // Handle boolean tensor using logical not
+  if (self.scalar_type() == c10::ScalarType::Bool) {
+    return at::native::logical_not_out_mps(self, output_);
+  }
+
+  at::Tensor output = output_;
+  bool needs_output_copy = false;
+
+  at::native::resize_output(output, self.sizes());
+  if (!output.is_contiguous()) {
+    output = output.contiguous();
+    needs_output_copy = true;
+  }
+  if (self.dim() == 0) {
+    if (self.scalar_type() == c10::ScalarType::Byte) {
+      // Unsigned types need a special handling to keep result of operation in 0..255 output
+      output.fill_(c10::Scalar(static_cast<uint8_t>(~self.item<uint8_t>())));
+    } else {
+      output.fill_(c10::Scalar(~self.item<int64_t>()));
+    }
+    return output_;
+  }
+  using namespace at::mps;
+  MPSStream* stream = getCurrentMPSStream();
+  id<MTLComputePipelineState> cplState = getCPLState(MPSDevice::getInstance()->device(),
+                                                     getMetalType(output),
+                                                     getMetalType(self),
+                                                     getMetalType(self),
+                                                     "bitwise_not");
+  uint32_t length = output.numel();
+  dispatch_sync(stream->queue(), ^(){
+    id<MTLCommandBuffer> buffer = stream->commandBuffer();
+    id<MTLComputeCommandEncoder> commandEncoder = [buffer computeCommandEncoder];
+
+    id<MTLBuffer> outBuf = __builtin_bit_cast(id<MTLBuffer>, output.storage().data());
+    id<MTLBuffer> selfBuf = __builtin_bit_cast(id<MTLBuffer>, self.storage().data());
+
+    [commandEncoder pushDebugGroup:@"Dispatch bitwise_not kernel"];
+    [commandEncoder setComputePipelineState:cplState];
+    [commandEncoder setBytes:&length length:sizeof(length) atIndex:0];
+    [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1];
+    [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize()  atIndex:2];
+    dispatch1DJob(commandEncoder, cplState, length);
+    [commandEncoder endEncoding];
+    stream->commit(true);
+  });
+  if (needs_output_copy) {
+      output_.copy_(output);
+  }
+  return output_;
+}
+
+
+
+TORCH_LIBRARY_IMPL(aten, MPS, m) {
+  m.impl("bitwise_and.Tensor_out", bitwise_and_out_mps);
+  m.impl("bitwise_or.Tensor_out", bitwise_or_out_mps);
+  m.impl("bitwise_xor.Tensor_out", bitwise_xor_out_mps);
+  m.impl("bitwise_not.out", bitwise_not_out_mps);
+}
+
+} // anonymous namespace
diff --git a/aten/src/ATen/native/mps/operations/ConstantOps.mm b/aten/src/ATen/native/mps/operations/ConstantOps.mm
index 0cfd7ccc2ff5b..a5ddd82a229eb 100644
--- a/aten/src/ATen/native/mps/operations/ConstantOps.mm
+++ b/aten/src/ATen/native/mps/operations/ConstantOps.mm
@@ -35,11 +35,15 @@
           MPSGraph *mpsGraph = make_mps_graph();
           newCachedGraph = new CachedGraph(mpsGraph);
           auto isBool = self.scalar_type() == c10::ScalarType::Bool;
-          auto dataType = (!isBool) ? getMPSScalarType(self.scalar_type()) : MPSDataTypeInt8;
+          auto isUInt8 = self.scalar_type() == c10::ScalarType::Byte;
+          auto dataType = !isUInt8 ? !isBool ? getMPSScalarType(self.scalar_type()) : MPSDataTypeInt8 : MPSDataTypeUInt32;
           // constantWithScalar does not work for boolTypes on MacOS-12.[34]
           // workaround by filing it as int8 tensor and than casting to bool
           // See https://github.com/pytorch/pytorch/issues/82427
-          MPSGraphTensor* inputTensor = [mpsGraph constantWithScalar:value.toDouble()
+          // constantWithScalar does not work for UInt8 Types on MacOS-12.[34]/Ventura preview
+          // workaround by filing it as uint32 tensor and than casting to uint8
+          // See https://github.com/pytorch/pytorch/issues/83692
+          MPSGraphTensor* inputTensor = [mpsGraph constantWithScalar: value.toDouble()
                                                                shape:getMPSShape(self)
                                                             dataType:dataType];
           MPSGraphTensor* outputTensor = [mpsGraph identityWithTensor:inputTensor
@@ -49,6 +53,11 @@
                                            toType:MPSDataTypeBool
                                              name:@"constWithBool-workaround"];
           }
+          if (isUInt8) {
+              outputTensor = [mpsGraph castTensor:outputTensor
+                                           toType:MPSDataTypeUInt8
+                                             name:@"constWithUInt8-workaround"];
+          }
 
           newCachedGraph->outputTensor_ = outputTensor;
         }
diff --git a/aten/src/ATen/native/mps/operations/Convolution.mm b/aten/src/ATen/native/mps/operations/Convolution.mm
index 0fe690698c3b8..41d68d4d24c2e 100644
--- a/aten/src/ATen/native/mps/operations/Convolution.mm
+++ b/aten/src/ATen/native/mps/operations/Convolution.mm
@@ -33,8 +33,9 @@ void fill_conv_desc(MPSGraphConvolution2DOpDescriptor* descriptor_,
 
   descriptor_.dataLayout = (memory_format == at::MemoryFormat::Contiguous) ?
         MPSGraphTensorNamedDataLayoutNCHW : MPSGraphTensorNamedDataLayoutNHWC;
-  descriptor_.weightsLayout = (memory_format == at::MemoryFormat::Contiguous) ?
-        MPSGraphTensorNamedDataLayoutOIHW : MPSGraphTensorNamedDataLayoutHWIO;
+
+  // PyTorch always uses OIHW memory layout for weights
+  descriptor_.weightsLayout = MPSGraphTensorNamedDataLayoutOIHW;
   descriptor_.groups = groups;
 }
 
@@ -46,6 +47,8 @@ Tensor _mps_convolution(
     IntArrayRef stride,
     IntArrayRef dilation,
     int64_t groups) {
+  TORCH_CHECK(input_t.dim() < 5, "Conv3D is not supported on MPS");
+
   namespace native_mps = at::native::mps;
   CheckedFrom c = "mps_convolution";
   TensorArg input  { input_t,  "input",  1 },
@@ -61,6 +64,7 @@ Tensor _mps_convolution(
    bias_defined = bias_opt->defined();
 
   auto memory_format = input_t.suggest_memory_format();
+  bool is_channels_last = (memory_format == at::MemoryFormat::ChannelsLast);
   auto output_t = at::empty(
                     conv_output_size(input->sizes(), weight->sizes(),
                                      padding, stride, dilation),
@@ -68,7 +72,7 @@ Tensor _mps_convolution(
                     c10::nullopt,
                     kMPS,
                     c10::nullopt,
-                    memory_format);
+                    c10::nullopt);
 
   if (output_t.numel() == 0) {
     return output_t;
@@ -122,6 +126,18 @@ Tensor _mps_convolution(
                                     + mps::getTensorsStringKey({input_t, weight_t}) + ":"
                                     + to_string(bias_defined) + ":" + bias_shape_key;
     CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    MPSShape* inputShape = nil;
+
+    if (is_channels_last) {
+      const auto inputSizes = input_t.sizes();
+      const NSUInteger N = inputSizes[0];
+      const NSUInteger C = inputSizes[1];
+      const NSUInteger H = inputSizes[2];
+      const NSUInteger W = inputSizes[3];
+      inputShape = @[@(N), @(H), @(W), @(C)];
+    } else {
+      inputShape = native_mps::getMPSShape(input_t);
+    }
 
     if(!cachedGraph) {
       native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
@@ -133,26 +149,34 @@ Tensor _mps_convolution(
           newCachedGraph = new CachedGraph(mpsGraph);
 
           MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease];
-          fill_conv_desc(descriptor_, stride[0], stride[1],
-                                      dilation[0], dilation[1],
+          fill_conv_desc(descriptor_, stride[1], stride[0],
+                                      dilation[1], dilation[0],
                                       padding[1], padding[0],
                                       memory_format, groups);
 
-          MPSGraphTensor* inputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, input_t);
+          MPSGraphTensor* inputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, native_mps::getMPSScalarType(input_t.scalar_type()), inputShape);
           MPSGraphTensor* weightTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, weight_t);
+
           MPSGraphTensor* biasTensor = nil;
           if(bias_defined)
             biasTensor = native_mps::mpsGraphUnrankedPlaceHolder(mpsGraph, native_mps::getMPSDataType((bias_opt.value()).scalar_type()));
 
-          MPSGraphTensor* outputTensor = [mpsGraph convolution2DWithSourceTensor:inputTensor
-                                                                   weightsTensor:weightTensor
-                                                                      descriptor:descriptor_
-                                                                            name:nil];
+          MPSGraphTensor* outputTensor = [mpsGraph convolution2DWithSourceTensor: inputTensor
+                                                                   weightsTensor: weightTensor
+                                                                      descriptor: descriptor_
+                                                                            name: nil];
+          if (is_channels_last) {
+            // NHWC -> NCHW
+            outputTensor = [mpsGraph transposeTensor: [mpsGraph transposeTensor:outputTensor dimension:-1 withDimension:-2 name:nil]
+                                           dimension: -2
+                                       withDimension: -3
+                                                name: nil];
+          }
 
           if(bias_defined) {
-            outputTensor = [mpsGraph additionWithPrimaryTensor:outputTensor
-                                               secondaryTensor:biasTensor
-                                                          name:nil];
+            outputTensor = [mpsGraph additionWithPrimaryTensor: outputTensor
+                                               secondaryTensor: biasTensor
+                                                          name: nil];
           }
 
           newCachedGraph->inputTensor_ = inputTensor;
@@ -165,7 +189,7 @@ Tensor _mps_convolution(
       cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
-    auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
+    auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t, inputShape);
     auto weightsPlaceholder = native_mps::Placeholder(cachedGraph->weightTensor_, weight_t);
     auto biasPlaceholder = native_mps::Placeholder();
     // Reshape the bias to be broadcastable with output of conv2d
@@ -207,7 +231,7 @@ Tensor mps_convolution_backward_input(
                     c10::nullopt,
                     kMPS,
                     c10::nullopt,
-                    memory_format);
+                    c10::nullopt);
 
   // Avoid "grad_input" when this is being used as transposed convolution
   TensorArg grad_input{ grad_input_t, "result", 0 };
@@ -242,9 +266,7 @@ Tensor mps_convolution_backward_input(
     }
 
     MPSShape* mps_input_shape = getMPSShape(input_size);
-
     NSString* ns_shape_key = [[mps_input_shape valueForKey:@"description"] componentsJoinedByString:@","];
-
     string key = "mps_convolution_backward_input:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":"
                                                    + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":"
                                                    + to_string(padding[0]) + ":" + to_string(padding[1]) + ":"
@@ -263,8 +285,8 @@ Tensor mps_convolution_backward_input(
           newCachedGraph = new CachedGraph(mpsGraph);
 
           MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease];
-          fill_conv_desc(descriptor_, stride[0], stride[1],
-                                      dilation[0], dilation[1],
+          fill_conv_desc(descriptor_, stride[1], stride[0],
+                                      dilation[1], dilation[0],
                                       padding[1], padding[0],
                                       memory_format, groups);
 
@@ -320,7 +342,7 @@ Tensor mps_convolution_backward_weights(
   checkAllSameType(c, {grad_output, input});
   checkAllSameGPU(c, {grad_output, input});
 
-  auto grad_weight_t = at::empty(weight_size, grad_output_t.options(), memory_format);
+  auto grad_weight_t = at::empty(weight_size, grad_output_t.options(), c10::nullopt);
   TensorArg grad_weight{ grad_weight_t, "result", 0 };
 
   convolution_shape_check(c, input, grad_weight, grad_output, padding, stride, dilation, groups);
@@ -353,9 +375,7 @@ Tensor mps_convolution_backward_weights(
     }
 
     MPSShape* mps_weight_shape = getMPSShape(weight_size);
-
     NSString* ns_shape_key = [[mps_weight_shape valueForKey:@"description"] componentsJoinedByString:@","];
-
     string key = "mps_convolution_backward_weights:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":"
                                                      + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":"
                                                      + to_string(padding[0]) + ":" + to_string(padding[1]) + ":"
@@ -374,8 +394,8 @@ Tensor mps_convolution_backward_weights(
           newCachedGraph = new CachedGraph(mpsGraph);
 
           MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease];
-          fill_conv_desc(descriptor_, stride[0], stride[1],
-                                      dilation[0], dilation[1],
+          fill_conv_desc(descriptor_, stride[1], stride[0],
+                                      dilation[1], dilation[0],
                                       padding[1], padding[0],
                                       memory_format, groups);
 
diff --git a/aten/src/ATen/native/mps/operations/Distributions.mm b/aten/src/ATen/native/mps/operations/Distributions.mm
index a4b73bd75fb03..999b1cc79d5b2 100644
--- a/aten/src/ATen/native/mps/operations/Distributions.mm
+++ b/aten/src/ATen/native/mps/operations/Distributions.mm
@@ -1,16 +1,8 @@
 //  Copyright © 2022 Apple Inc.
 
-#include <ATen/ATen.h>
-#include <ATen/Tensor.h>
-#include <ATen/Utils.h>
-#include <ATen/native/UnaryOps.h>
-#include <ATen/Dispatch.h>
 #include <ATen/native/Distributions.h>
 #include <ATen/native/DistributionTemplates.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/mps/MPSStream.h>
 #include <ATen/native/mps/OperationUtils.h>
-#include <torch/library.h>
 
 namespace at {
 namespace native {
@@ -24,7 +16,6 @@
   }
   double delta = (to - from);
   AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "check_uniform_bounds", [&] {
-    const auto dtype = input.dtype();
     const auto min = static_cast<double>(std::numeric_limits<scalar_t>::lowest());
     const auto max = static_cast<double>(std::numeric_limits<scalar_t>::max());
     TORCH_CHECK(from <= to, "uniform_ expects to return a [from, to) range, but found from=", from, " > to=", to);
@@ -198,11 +189,6 @@ Tensor normal_mps(const Tensor& mean, const Tensor& std, c10::optional<Generator
 }
 
 Tensor& normal_mps_out(double mean, const Tensor& std, c10::optional<Generator> gen, Tensor& output) {
-  TORCH_CHECK(
-    std.min().ge(0).item<bool>(),
-    "normal expects all elements of std >= 0.0");
-
-
   Tensor mean_t = empty_mps(
                       output.sizes(),
                       output.scalar_type(),
@@ -218,7 +204,6 @@ Tensor normal_mps(const Tensor& mean, const Tensor& std, c10::optional<Generator
 
 Tensor& normal_mps_out(const Tensor& mean, const Tensor& std, c10::optional<Generator> gen, Tensor& output) {
   TORCH_CHECK(!std.is_complex(), "normal expects standard deviation to be non-complex");
-  TORCH_CHECK(std.numel() == 0 || std.min().ge(0).item<bool>(), "normal expects all elements of std >= 0.0");
   // Check that mean and std have same number of elements
   TORCH_CHECK(mean.numel() == std.numel(), "normal_mps_out: mean and std must have same number of elements")
 
@@ -528,8 +513,6 @@ static void check_from_to_in_range(int64_t from, int64_t to_inc, ScalarType scal
     MPSGraphTensor *outputTensor_ = nil;
   };
 
-  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
-
   MPSStream* stream = getCurrentMPSStream();
   uint64_t seed_ = c10::detail::getNonDeterministicRandom(true);
 
diff --git a/aten/src/ATen/native/mps/operations/Indexing.h b/aten/src/ATen/native/mps/operations/Indexing.h
new file mode 100644
index 0000000000000..4227a0cf62c28
--- /dev/null
+++ b/aten/src/ATen/native/mps/operations/Indexing.h
@@ -0,0 +1,51 @@
+//  Copyright © 2022 Apple Inc.
+
+#include <ATen/ATen.h>
+#include <ATen/Tensor.h>
+#include <ATen/Utils.h>
+#include <ATen/mps/MPSStream.h>
+#include <ATen/native/mps/TensorFactory.h>
+#include <c10/core/ScalarType.h>
+#include <torch/library.h>
+#include <unordered_map>
+
+using namespace at::mps;
+
+namespace at {
+namespace native {
+namespace mps {
+
+std::string getMetalScalarType(ScalarType scalar_type) {
+  std::string res = "";
+  switch (scalar_type) {
+    case ScalarType::Float:
+      res = "float"; break;
+    case ScalarType::Half:
+      res = "half";  break;
+    case ScalarType::Long:
+      res = "long";  break;
+    case ScalarType::Int:
+      res = "int";   break;
+    case ScalarType::Short:
+      res = "short"; break;
+    case ScalarType::Char:
+      res = "char"; break;
+    case ScalarType::Byte:
+      res = "uchar"; break;
+    case ScalarType::Bool:
+      res = "bool";  break;
+    default:
+      break;
+  }
+  return res;
+}
+
+std::string getIndexFunctionName(ScalarType scalar_type, bool index_select, bool accumulate) {
+    std::string indexFunction = index_select ? "index_select_" :
+                        (accumulate && (scalar_type != kBool)) ? "index_put_accumulate_" : "index_put_";
+
+  return indexFunction + getMetalScalarType(scalar_type);
+}
+}
+}
+}
diff --git a/aten/src/ATen/native/mps/operations/Indexing.mm b/aten/src/ATen/native/mps/operations/Indexing.mm
index 7c0d7544cf21a..c54f5b12d44af 100644
--- a/aten/src/ATen/native/mps/operations/Indexing.mm
+++ b/aten/src/ATen/native/mps/operations/Indexing.mm
@@ -1,5 +1,4 @@
 //  Copyright © 2022 Apple Inc.
-
 #include <ATen/ATen.h>
 #include <ATen/Tensor.h>
 #include <ATen/Utils.h>
@@ -12,6 +11,7 @@
 #include <ATen/WrapDimUtilsMulti.h>
 #include <ATen/native/LinearAlgebraUtils.h>
 #include <ATen/native/mps/OperationUtils.h>
+#include <ATen/native/mps/operations/Indexing.h>
 #include <ATen/native/Resize.h>
 #include <ATen/AccumulateType.h>
 #include <torch/library.h>
@@ -20,6 +20,7 @@
 #include <c10/util/irange.h>
 #include <c10/core/QScheme.h>
 #include <c10/util/SmallVector.h>
+#include <ATen/native/IndexKernel.h>
 
 #ifdef __OBJC__
 #include <MetalPerformanceShaders/MetalPerformanceShaders.h>
@@ -28,6 +29,137 @@
 namespace at {
 namespace native {
 
+static
+bool dispatchIndexSelectKernel(TensorIteratorBase& iter, IntArrayRef index_size, IntArrayRef index_stride) {
+  using namespace mps;
+
+ if (iter.numel() == 0)
+    return true;
+
+  const Tensor& inputTensor = iter.tensor(1);
+  Tensor outputTensor = iter.tensor(0);
+  id<MTLBuffer> inputBuffer  = getMTLBufferStorage(inputTensor);
+  id<MTLBuffer> outputBuffer = getMTLBufferStorage(outputTensor);
+  MPSStream* mpsStream = getCurrentMPSStream();
+  id<MTLDevice> device = MPSDevice::getInstance()->device();
+
+  dispatch_sync(mpsStream->queue(), ^(){
+    @autoreleasepool {
+    NSError* error = nil;
+      constexpr uint32_t nOffsets = 3;
+      const int64_t num_indices = index_size.size();
+      const uint32_t numThreads = iter.numel();
+      const uint32_t nDim = iter.ndim();
+      const IntArrayRef& iterShape = iter.shape();
+      std::vector<uint32_t> iterShapeData(iterShape.size());
+      std::vector<std::array<uint32_t, nOffsets>> strides(nDim);
+
+      for (const auto i: c10::irange(iterShape.size())) {
+        TORCH_CHECK(i <= UINT32_MAX);
+        iterShapeData[i] = (uint32_t)(iterShape[i]);
+      }
+
+      for (const auto i: c10::irange(nDim)) {
+        for (const auto offset: c10::irange(nOffsets)) {
+          strides[i][offset] = iter.strides(offset)[i];
+        }
+      }
+
+      MTLSize gridSize = MTLSizeMake(numThreads, 1, 1);
+      id<MTLCommandBuffer> commandBuffer = mpsStream->commandBuffer();
+      id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
+      id<MTLFunction> kernelDataOffsetsFunction = MPSDevice::getInstance()->metalIndexingFunction("kernel_index_offsets", nil);
+      id<MTLComputePipelineState> kernelDataOffsetsPSO = [[device newComputePipelineStateWithFunction: kernelDataOffsetsFunction
+                                                                                                error: &error] autorelease];
+      id<MTLBuffer> kernelDataOffsets = [[device newBufferWithLength: numThreads * sizeof(simd_uint3)
+                                                             options: 0] autorelease];
+      TORCH_CHECK(kernelDataOffsetsPSO, "Failed to created pipeline state object, error: ", [[error description] UTF8String]);
+
+      [computeEncoder setComputePipelineState:kernelDataOffsetsPSO];
+      [computeEncoder setBytes:strides.data() length:sizeof(uint32_t) * nDim * nOffsets atIndex:0];
+      [computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:1];
+      [computeEncoder setBytes:iterShapeData.data() length:sizeof(uint32_t) * iterShape.size() atIndex:2];
+      [computeEncoder setBytes:&nDim length:sizeof(uint32_t) atIndex:3];
+      [computeEncoder setBytes:&nOffsets length:sizeof(uint32_t) atIndex:4];
+
+      NSUInteger kernelOffsetsTGSize = kernelDataOffsetsPSO.maxTotalThreadsPerThreadgroup;
+      if (kernelOffsetsTGSize > numThreads)
+          kernelOffsetsTGSize = numThreads;
+
+      MTLSize kernelOffsetsThreadGroupSize = MTLSizeMake(kernelOffsetsTGSize, 1, 1);
+      [computeEncoder dispatchThreads: gridSize
+                threadsPerThreadgroup: kernelOffsetsThreadGroupSize];
+
+      MTLFunctionConstantValues* constantValues = [[MTLFunctionConstantValues new] autorelease];
+      [constantValues setConstantValue: &num_indices type:MTLDataTypeUInt atIndex:0];
+
+      std::string indexFunction = getIndexFunctionName(inputTensor.scalar_type(), true, false);
+      id<MTLFunction> indexKernelFunction = MPSDevice::getInstance()->metalIndexingFunction(indexFunction, constantValues);
+      id<MTLArgumentEncoder> argumentEncoder = [[indexKernelFunction newArgumentEncoderWithBufferIndex:0] autorelease];
+      NSUInteger argumentBufferLength = argumentEncoder.encodedLength;
+      id<MTLBuffer> indexAB = [[device newBufferWithLength:argumentBufferLength options:0] autorelease];
+      [argumentEncoder setArgumentBuffer:indexAB offset:0];
+
+      for (uint32_t idx = 0; idx < num_indices; idx++) {
+        const Tensor& indexTensor = iter.tensor(idx+2);
+        [argumentEncoder setBuffer: getMTLBufferStorage(indexTensor)
+                            offset: indexTensor.storage_offset() * indexTensor.element_size()
+                           atIndex: idx];
+        TORCH_CHECK(indexTensor.scalar_type() == ScalarType::Long, "index(): Expected dtype int64 for Index");
+      }
+
+      // FIXME: PSO needs to be cached
+      id<MTLComputePipelineState> indexSelectPSO = [[device newComputePipelineStateWithFunction: indexKernelFunction
+                                                                                          error: &error] autorelease];
+      TORCH_CHECK(indexSelectPSO, "Failed to created pipeline state object, error: ", [[error description] UTF8String]);
+
+      for (uint32_t idx = 0; idx < num_indices; idx++) {
+        const Tensor& indexTensor = iter.tensor(idx+2);
+        [computeEncoder useResource:getMTLBufferStorage(indexTensor) usage:MTLResourceUsageRead];
+      }
+
+      [computeEncoder setComputePipelineState:indexSelectPSO];
+      [computeEncoder setBuffer:indexAB offset:0 atIndex:0];
+      [computeEncoder setBytes:index_size.data() length:sizeof(index_size[0]) * index_size.size() atIndex:1];
+      [computeEncoder setBytes:index_stride.data() length:sizeof(index_stride[0]) * index_stride.size() atIndex:2];
+      [computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:3];
+      [computeEncoder setBuffer:inputBuffer offset:inputTensor.storage_offset() * inputTensor.element_size() atIndex:4];
+      [computeEncoder setBuffer:outputBuffer offset:outputTensor.storage_offset() * outputTensor.element_size() atIndex:5];
+
+      NSUInteger tgSize = indexSelectPSO.maxTotalThreadsPerThreadgroup;
+      if (tgSize > numThreads)
+          tgSize = numThreads;
+
+      MTLSize threadGroupSize = MTLSizeMake(tgSize, 1, 1);
+      [computeEncoder dispatchThreads: gridSize
+                threadsPerThreadgroup: threadGroupSize];
+
+      [computeEncoder endEncoding];
+      mpsStream->commit(true);
+    }
+  });
+
+  return true;
+}
+
+void index_kernel_mps(TensorIteratorBase& iter, IntArrayRef index_size, IntArrayRef index_stride) {
+  using namespace mps;
+
+  @autoreleasepool {
+    int64_t num_indices = index_size.size();
+
+    AT_ASSERT(num_indices == index_stride.size());
+    AT_ASSERT(num_indices == iter.ntensors() - 2);
+    const Tensor& inputTensor = iter.tensor(1);
+
+    TORCH_CHECK(c10::isIntegralType(inputTensor.scalar_type(), /*includesBool=*/true) ||
+                inputTensor.scalar_type() == ScalarType::Float ||
+                inputTensor.scalar_type() == ScalarType::Half,
+                getMPSTypeString(inputTensor.scalar_type()) + std::string(" not supported for index.Tensor_out"));
+    dispatchIndexSelectKernel(iter, index_size, index_stride);
+  }
+}
+
 Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
   using namespace mps;
 
@@ -161,11 +293,6 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
           MPSGraphTensor* indexTensor = mpsGraphRankedPlaceHolder(mpsGraph, index);
           MPSGraphTensor* sourceTensor = mpsGraphRankedPlaceHolder(mpsGraph, source);
           MPSGraphTensor* alphaTensor = mpsGraphScalarPlaceHolder(mpsGraph, alpha_f);
-          MPSGraphTensor* inputSlice = [mpsGraph gatherWithUpdatesTensor:inputTensor
-                                                           indicesTensor:indexTensor
-                                                                    axis:dim
-                                                         batchDimensions:0
-                                                                    name:nil];
           MPSGraphTensor* alphaSourceSlice = [mpsGraph multiplicationWithPrimaryTensor:sourceTensor
                                                                        secondaryTensor:alphaTensor
                                                                                   name:nil];
@@ -499,5 +626,7 @@ Tensor embedding_dense_backward_mps(
   return masked_fill__mps(self, mask, value.item());
 }
 
-}
-}
+REGISTER_DISPATCH(index_stub, &index_kernel_mps);
+
+} // native
+} // at
diff --git a/aten/src/ATen/native/mps/operations/Linear.mm b/aten/src/ATen/native/mps/operations/Linear.mm
index a6710ea5fc2a5..b3f776d237514 100644
--- a/aten/src/ATen/native/mps/operations/Linear.mm
+++ b/aten/src/ATen/native/mps/operations/Linear.mm
@@ -46,6 +46,10 @@ Tensor _mps_linear(
 
   TORCH_CHECK(output.is_mps());
 
+  if(output.numel() == 0) {
+    return output;
+  }
+
   MPSStream *stream = getCurrentMPSStream();
 
   struct CachedGraph : public MPSCachedGraph
@@ -65,7 +69,6 @@ Tensor _mps_linear(
 
     MPSShape* wt_shape = getMPSShape(weight);
     string wt_key = string([[[wt_shape valueForKey:@"description"] componentsJoinedByString:@","] UTF8String]);
-    MPSShape* bias_shape = nil;
     string bias_key = "nobias";
     if(is_bias_defined) {
       bias_key = "bias";
@@ -358,10 +361,10 @@ Tensor _mps_linear_backward_input(
     const Tensor& weight, std::array<bool,3> output_mask) {
   Tensor grad_input, grad_weight, grad_bias;
   if (output_mask[0]) {
-    grad_input = at::_mps_linear_backward_input(input.sizes(), grad_output, weight);
+    grad_input = _mps_linear_backward_input(input.sizes(), grad_output, weight);
   }
   if (output_mask[1] || output_mask[2]) {
-    std::tie(grad_weight, grad_bias) = at::_mps_linear_backward_weights(grad_output, input, weight, output_mask[2]);
+    std::tie(grad_weight, grad_bias) = _mps_linear_backward_weights(grad_output, input, weight, output_mask[2]);
   }
   return std::tuple<Tensor, Tensor, Tensor>{grad_input, grad_weight, grad_bias};
 }
diff --git a/aten/src/ATen/native/mps/operations/LinearAlgebra.mm b/aten/src/ATen/native/mps/operations/LinearAlgebra.mm
index 8b69c65c17fae..31c8c88248d6a 100644
--- a/aten/src/ATen/native/mps/operations/LinearAlgebra.mm
+++ b/aten/src/ATen/native/mps/operations/LinearAlgebra.mm
@@ -125,17 +125,11 @@ void prepare_matrices_for_broadcasting(
 
   MPSStream* stream = getCurrentMPSStream();
 
-  bool transpose_mat1            = false;
-  bool transpose_mat2            = false;
-
-  prepare_matrices_for_broadcasting(NULL, self, other, NULL, NULL, transpose_mat1, transpose_mat2);
-
   mps::MPSGraphCache *cache_ = mps::MPSGraphCache::getInstance();
 
   @autoreleasepool {
 
-    string key = "mm_out_mps_impl" + getTensorsStringKey({self, other})
-                                   + ":" + to_string(transpose_mat1) + ":" + to_string(transpose_mat2);
+    string key = "mm_out_mps_impl" + getTensorsStringKey({self, other});
 
     CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
     if(!cachedGraph) {
@@ -147,31 +141,25 @@ void prepare_matrices_for_broadcasting(
           MPSGraph *mpsGraph = mps::make_mps_graph();
           newCachedGraph = new CachedGraph(mpsGraph);
 
-          MPSGraphTensor *selfTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, self);
-          MPSGraphTensor *otherTensor =  mps::mpsGraphRankedPlaceHolder(mpsGraph, other);
+          MPSGraphTensor *selfTensor = nil;
+          MPSGraphTensor *otherTensor = nil;
+          MPSGraphTensor *outputTensor = nil;
 
-          MPSGraphTensor* t1 = nil;
-          MPSGraphTensor* t2 = nil;
+          if(self.numel() == 0 || other.numel() == 0) {
 
-          if(transpose_mat1)
-            t1 = [mpsGraph transposeTensor:selfTensor
-                                 dimension:-1
-                             withDimension:-2
-                                      name:nil];
-          else
-            t1 = selfTensor;
+            outputTensor = [mpsGraph constantWithScalar:0.
+                                                  shape:getMPSShape(output_sizes)
+                                                   dataType:getMPSDataType(output.scalar_type())];
 
-          if(transpose_mat2)
-            t2 = [mpsGraph transposeTensor:otherTensor
-                                 dimension:-1
-                             withDimension:-2
-                                      name:nil];
-          else
-            t2 = otherTensor;
+          }
+          else {
 
-          MPSGraphTensor* outputTensor = [mpsGraph matrixMultiplicationWithPrimaryTensor:t1
-                                                                         secondaryTensor:t2
-                                                                                    name:nil];
+            selfTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, self);
+            otherTensor =  mps::mpsGraphRankedPlaceHolder(mpsGraph, other);
+            outputTensor = [mpsGraph matrixMultiplicationWithPrimaryTensor:selfTensor
+                                                           secondaryTensor:otherTensor
+                                                                      name:nil];
+          }
 
           newCachedGraph->selfTensor_ = selfTensor;
           newCachedGraph->otherTensor_ = otherTensor;
@@ -181,14 +169,21 @@ void prepare_matrices_for_broadcasting(
       });
       cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
-    Placeholder selfPlaceholder = Placeholder(cachedGraph->selfTensor_, self);
-    Placeholder otherPlaceholder = Placeholder(cachedGraph->otherTensor_, other);
+    Placeholder selfPlaceholder = Placeholder();
+    Placeholder otherPlaceholder = Placeholder();
+    if(!(self.numel() == 0 || other.numel() == 0)) {
+      selfPlaceholder = Placeholder(cachedGraph->selfTensor_, self);
+      otherPlaceholder = Placeholder(cachedGraph->otherTensor_, other);
+    }
     Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor_, output);
 
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
-      selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
-      otherPlaceholder.getMPSGraphTensor() : otherPlaceholder.getMPSGraphTensorData()
-    };
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = nil;
+
+    if(!(self.numel() == 0 || other.numel() == 0))
+      feeds = @{
+        selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
+        otherPlaceholder.getMPSGraphTensor() : otherPlaceholder.getMPSGraphTensorData()
+      };
 
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
       outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
@@ -246,8 +241,6 @@ void prepare_matrices_for_broadcasting(
 
   MPSStream* stream = getCurrentMPSStream();
 
-  MPSGraph* mpsGraph = make_mps_graph();
-
   bool transpose_mat1_times_mat2 = false;
   bool transpose_mat1            = false;
   bool transpose_mat2            = false;
diff --git a/aten/src/ATen/native/mps/operations/LossOps.mm b/aten/src/ATen/native/mps/operations/LossOps.mm
index 454a9512c23ab..cc112265a3a87 100644
--- a/aten/src/ATen/native/mps/operations/LossOps.mm
+++ b/aten/src/ATen/native/mps/operations/LossOps.mm
@@ -766,10 +766,6 @@ void smooth_l1_loss_impl(
           MPSGraphTensor *targetTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(target.scalar_type()));
 
           // Setup tensors
-          MPSGraphTensor *mpsGraphZeroTensor = [mpsGraph constantWithScalar: 0.0
-                                                                   dataType: inputTensor.dataType];
-          MPSGraphTensor *mpsGraphOneTensor = [mpsGraph constantWithScalar: 1.0
-                                                                  dataType: inputTensor.dataType];
           MPSGraphTensor *mpsGraphHalfTensor = [mpsGraph constantWithScalar: 0.5
                                                                    dataType: inputTensor.dataType];
           MPSGraphTensor *betaTensor = [mpsGraph constantWithScalar: beta
@@ -1067,8 +1063,6 @@ void smooth_l1_loss_backward_template(
     };
     MPSGraphCache* cache_ = MPSGraphCache::getInstance();
 
-    MPSStream* stream = getCurrentMPSStream();
-
     @autoreleasepool {
         string key = op_name + ":" + reductionToString(reduction) + ":" + std::to_string(delta) + ":" + getTensorsStringKey({input, target});
         CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
diff --git a/aten/src/ATen/native/mps/operations/Pad.mm b/aten/src/ATen/native/mps/operations/Pad.mm
new file mode 100644
index 0000000000000..25cccfd6f4424
--- /dev/null
+++ b/aten/src/ATen/native/mps/operations/Pad.mm
@@ -0,0 +1,304 @@
+//  Copyright © 2022 Apple Inc.
+
+#include <ATen/native/mps/OperationUtils.h>
+#include <c10/util/Optional.h>
+
+namespace at {
+namespace native {
+namespace mps {
+
+// Pad operations (1D/2D/3D forward and backward)
+Tensor& pad_out_template(Tensor &output, const Tensor &input_, IntArrayRef padding,
+                         const c10::optional<Tensor>& grad_output_opt,
+                         MPSGraphPaddingMode mode, double constantValue, const string op_name)
+{
+  const int padding_size = (int) padding.size();
+  const int padding_dim = padding_size / 2; // either 1D, 2D, or 3D
+
+  TORCH_CHECK(padding_size == 2 || padding_size == 4 || padding_size == 6,
+              "invalid padding argument of size ", padding_size);
+
+  const Tensor& grad_output_ = *(at::borrow_from_optional_tensor(grad_output_opt));
+  const bool is_backward_pass = grad_output_.defined();
+
+  int64_t nbatch = 1;
+  int64_t ndims = input_.ndimension();
+  // number of input dims with ConstantPad could be less than 2
+  int dim_w = ndims > 1 ? padding_dim : 0;
+  int dim_h = padding_dim - 1;
+  int dim_d = padding_dim - 2;
+  int dim_slices = 0;
+
+  if (!is_backward_pass && ndims > 1) {
+    bool valid_dims = input_.size(1) != 0 && input_.size(padding_dim) != 0;
+    TORCH_CHECK((ndims == 1 + padding_dim && valid_dims) ||
+                (ndims == 2 + padding_dim && valid_dims && input_.size(1 + padding_dim) != 0),
+                "3D or 4D (batch mode) tensor expected for input, but got: ", input_);
+  }
+
+  if (ndims == 2 + padding_dim) {
+    nbatch = input_.size(0);
+    dim_w++;
+    dim_h++;
+    dim_d++;
+    dim_slices++;
+  }
+
+  int64_t pad_l = padding[0];
+  int64_t pad_r = padding[1];
+  int64_t pad_t = padding_dim > 1 ? padding[2] : 0;
+  int64_t pad_b = padding_dim > 1 ? padding[3] : 0;
+  int64_t pad_front = padding_dim > 2 ? padding[4] : 0;
+  int64_t pad_back  = padding_dim > 2 ? padding[5] : 0;
+
+  int64_t nplane = input_.size(dim_slices);
+  int64_t input_w = input_.size(dim_w);
+  int64_t output_w  = input_w + pad_l + pad_r;
+  int64_t input_h = padding_dim > 1 ? input_.size(dim_h) : 0;
+  int64_t output_h = padding_dim > 1 ? input_h + pad_t + pad_b : 0;
+  int64_t input_d = padding_dim > 2 ? input_.size(dim_d) : 0;
+  int64_t output_d = padding_dim > 2 ? input_d + pad_front + pad_back : 0;
+
+  Tensor grad_output, input = input_;
+
+  if (!is_backward_pass) {
+    TORCH_CHECK(pad_l < input_w && pad_r < input_w,
+      "Argument #4: Padding size should be less than the corresponding "
+      "input dimension, but got: padding (", pad_l, ", ", pad_r,
+      ") at dimension ", dim_w, " of input ", ndims);
+
+    if (padding_dim > 1) {
+      TORCH_CHECK(pad_t < input_h && pad_b < input_h,
+        "Argument #6: Padding size should be less than the corresponding "
+        "input dimension, but got: padding (", pad_t, ", ", pad_b,
+        ") at dimension ", dim_h, " of input ", ndims);
+    }
+    TORCH_CHECK(output_w >= 1 || output_h >= padding_dim - 1,
+      "input (H: ", input_h, ", W: ", input_w, ") is too small. Calculated "
+      "output H: ", output_h, " W: ", output_w);
+
+    if (ndims == 1 + padding_dim) {
+      if (padding_dim == 3)
+        output.resize_({nplane, output_d, output_h, output_w});
+      else if (padding_dim == 2)
+        output.resize_({nplane, output_h, output_w});
+      else
+        output.resize_({nplane, output_w});
+    } else {
+      if (padding_dim == 3)
+        output.resize_({nbatch, nplane, output_d, output_h, output_w});
+      else if (padding_dim == 2)
+        output.resize_({nbatch, nplane, output_h, output_w});
+      else if (ndims > 1)
+        output.resize_({nbatch, nplane, output_w});
+      else
+        output.resize_({output_w});
+    }
+    if (output.numel() == 0 || input_.numel() == 0)
+      return output;
+    input = input_.contiguous();
+  } else {
+    TORCH_CHECK(output_w == grad_output_.size(dim_w),
+      "gradOutput width unexpected. Expected: ", output_w, ", Got: ", grad_output_.size(dim_w));
+    if (padding_dim > 1) {
+      TORCH_CHECK(output_h == grad_output_.size(dim_h),
+        "gradOutput height unexpected. Expected: ", output_h, ", Got: ", grad_output_.size(dim_h));
+    }
+    grad_output = grad_output_.contiguous();
+  }
+
+  const int64_t input_dim = input.dim();
+  MPSShape *leftPadding = nullptr, *rightPadding = nullptr;
+  if (padding_dim == 3) {
+    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_front), @(pad_t), @(pad_l) } count:input_dim];
+    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_back), @(pad_b), @(pad_r) } count:input_dim];
+  } else if (padding_dim == 2) {
+    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_t), @(pad_l) } count:input_dim];
+    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_b), @(pad_r) } count:input_dim];
+  } else if (padding_dim == 1) {
+    if (input_dim > 1) {
+      leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_l) } count:input_dim];
+      rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_r) } count:input_dim];
+    } else {
+      leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(pad_l) } count:input_dim];
+      rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(pad_r) } count:input_dim];
+    }
+  }
+
+  struct CachedGraph : public MPSCachedGraph {
+    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) { }
+    MPSGraphTensor *inputTensor = nil, *outputTensor = nil;
+    MPSGraphTensor *gradOutputTensor = nil;
+  };
+  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
+
+  @autoreleasepool {
+    string key = op_name + getTensorsStringKey({input, grad_output}) +
+                           ":L" + to_string(pad_l)     + ":R" + to_string(pad_r) +
+                           ":T" + to_string(pad_t)     + ":B" + to_string(pad_b) +
+                           ":F" + to_string(pad_front) + ":K" + to_string(pad_back);
+
+    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    if(!cachedGraph) {
+      cachedGraph = static_cast<CachedGraph*>(cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+        CachedGraph *newCachedGraph = nil;
+        @autoreleasepool {
+            MPSGraph* mpsGraph = make_mps_graph();
+            newCachedGraph = new CachedGraph(mpsGraph);
+            newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
+            if (!is_backward_pass) {
+              newCachedGraph->outputTensor = [mpsGraph padTensor:newCachedGraph->inputTensor
+                                                 withPaddingMode:mode
+                                                     leftPadding:leftPadding
+                                                    rightPadding:rightPadding
+                                                   constantValue:constantValue
+                                                            name:nil];
+            } else {
+              newCachedGraph->gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
+              newCachedGraph->outputTensor = [mpsGraph padGradientWithIncomingGradientTensor:newCachedGraph->gradOutputTensor
+                                                                                sourceTensor:newCachedGraph->inputTensor
+                                                                                 paddingMode:mode
+                                                                                 leftPadding:leftPadding
+                                                                                rightPadding:rightPadding
+                                                                                        name:nil];
+            }
+        }
+        return newCachedGraph;
+      }));
+    }
+    Placeholder inputPlaceholder  = Placeholder(cachedGraph->inputTensor, input);
+    Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, output);
+
+    NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease];
+    feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData();
+    if (is_backward_pass) {
+        Placeholder gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor, grad_output);
+        feeds[gradOutputPlaceholder.getMPSGraphTensor()] = gradOutputPlaceholder.getMPSGraphTensorData();
+    }
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
+        outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
+    };
+    runMPSGraph(getCurrentMPSStream(), cachedGraph->graph(), feeds, results);
+  }
+  return output;
+}
+} // namespace mps
+
+// 1D Reflection and Replication Padding
+TORCH_IMPL_FUNC(reflection_pad1d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_out_mps");
+}
+
+TORCH_IMPL_FUNC(reflection_pad1d_backward_out_mps)
+(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_backward_out_mps");
+}
+
+TORCH_IMPL_FUNC(replication_pad1d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_out_mps");
+}
+
+TORCH_IMPL_FUNC(replication_pad1d_backward_out_mps)
+(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_backward_out_mps");
+}
+
+// 2D Reflection and Replication Padding
+Tensor& reflection_pad2d_out_mps(const Tensor& input, IntArrayRef padding, Tensor& output)
+{
+  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+Tensor reflection_pad2d_mps(const Tensor& input, IntArrayRef padding)
+{
+  Tensor output = at::empty({0}, input.options());
+  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+Tensor& reflection_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+Tensor reflection_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
+{
+  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+TORCH_IMPL_FUNC(replication_pad2d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad2d_out_mps");
+}
+
+Tensor& replication_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+Tensor replication_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
+{
+  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+// 3D Reflection and Replication Padding
+TORCH_IMPL_FUNC(reflection_pad3d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_out_mps");
+}
+
+TORCH_IMPL_FUNC(reflection_pad3d_backward_out_mps)
+(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_backward_out_mps");
+}
+
+TORCH_IMPL_FUNC(replication_pad3d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad3d_out_mps");
+}
+
+Tensor& replication_pad3d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+Tensor replication_pad3d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
+{
+  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+// backward pass is exlicitly handled in autograd by negating the "pad" argument
+Tensor constant_pad_nd_mps(const Tensor& self, IntArrayRef pad, const Scalar& value)
+{
+  Tensor output = at::empty({0}, self.options());
+  return mps::pad_out_template(output, self, pad, c10::nullopt, MPSGraphPaddingModeConstant, value.toDouble(), __func__);
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/mps/operations/PointwiseOps.mm b/aten/src/ATen/native/mps/operations/PointwiseOps.mm
index 66427c73e0c75..261749bd269f6 100644
--- a/aten/src/ATen/native/mps/operations/PointwiseOps.mm
+++ b/aten/src/ATen/native/mps/operations/PointwiseOps.mm
@@ -18,6 +18,11 @@
   if (&output != &self) {
     output.resize_(output.sizes());
   }
+
+  if(output.numel() == 0) {
+    return output;
+  }
+
   MPSStream* mpsStream = getCurrentMPSStream();
 
   struct CachedGraph : public MPSCachedGraph
diff --git a/aten/src/ATen/native/mps/operations/Pooling.mm b/aten/src/ATen/native/mps/operations/Pooling.mm
index adf10b16fbfa8..1df24e073239e 100644
--- a/aten/src/ATen/native/mps/operations/Pooling.mm
+++ b/aten/src/ATen/native/mps/operations/Pooling.mm
@@ -104,7 +104,6 @@ Tensor _mps_max_pool2d(
 
   namespace native_mps = at::native::mps;
   using CachedGraph = native_mps::MPSUnaryCachedGraph;
-  CheckedFrom c = "mps_max_pool2d";
 
   native_mps::MPSGraphCache* cache_ = native_mps::MPSGraphCache::getInstance();
 
@@ -241,7 +240,6 @@ Tensor mps_max_pool2d_backward(
   }
 
   namespace native_mps = at::native::mps;
-  CheckedFrom c = "mps_max_pool2d_backward";
 
   // Derive from MPSCachedGraph
   struct CachedGraph : public native_mps::MPSCachedGraph
@@ -383,7 +381,6 @@ Tensor mps_max_pool2d_backward(
   }
 
   /* sizes */
-  const int64_t nbatch = input_t.ndimension() == 4 ? input_t.size(-4) : 1;
   const int64_t nInputPlane = input_t.size(-3);
   const int64_t inputHeight = input_t.size(-2);
   const int64_t inputWidth = input_t.size(-1);
@@ -399,7 +396,6 @@ Tensor mps_max_pool2d_backward(
     outputHeight, outputWidth, memory_format);
 
   namespace native_mps = at::native::mps;
-  CheckedFrom c = "max_pool2d_with_indices_out_mps";
 
   // Derive from MPSCachedGraph
   struct CachedGraph : public native_mps::MPSCachedGraph
@@ -541,7 +537,6 @@ Tensor mps_max_pool2d_backward(
   }
 
   namespace native_mps = at::native::mps;
-  CheckedFrom c = "max_pool2d_with_indices_backward_out_mps";
 
   // Derive from MPSCachedGraph
   struct CachedGraph : public native_mps::MPSCachedGraph
@@ -655,13 +650,7 @@ Tensor mps_max_pool2d_backward(
   const int padW = safe_downcast<int, int64_t>(padW_);
 
   /* sizes */
-  const int64_t nbatch = input_.ndimension() == 4 ? input_.size(-4) : 1;
-  const int64_t nInputPlane = input_.size(-3);
-  const int64_t inputHeight = input_.size(-2);
-  const int64_t inputWidth = input_.size(-1);
 
-  int64_t outputWidth = pooling_output_shape<int64_t>(inputWidth, kW, padW, dW, 1, ceil_mode);
-  int64_t outputHeight = pooling_output_shape<int64_t>(inputHeight, kH, padH, dH, 1, ceil_mode);
   const auto memory_format = input_.suggest_memory_format();
 
   Tensor input = input_.contiguous(memory_format);
@@ -778,8 +767,6 @@ Tensor mps_max_pool2d_backward(
   const Tensor input = input_.contiguous(memory_format);
   const Tensor gradOutput = gradOutput_.contiguous(memory_format);
 
-  const int64_t nbatch = input.ndimension() == 4 ? input.size(-4) : 1;
-  const int64_t nInputPlane = input.size(-3);
   const int64_t inputHeight = input.size(-2);
   const int64_t inputWidth = input.size(-1);
 
@@ -791,11 +778,8 @@ Tensor mps_max_pool2d_backward(
   if (count == 0) {
     return;
   }
-  bool use_divisor = divisor_override.has_value();
-  const auto divisor_override_value = use_divisor ? divisor_override.value() : 0;
 
   namespace native_mps = at::native::mps;
-  CheckedFrom c = "avg_pool2d_backward_out_mps";
 
   // Derive from MPSCachedGraph
   struct CachedGraph : public native_mps::MPSCachedGraph
diff --git a/aten/src/ATen/native/mps/operations/ReduceOps.mm b/aten/src/ATen/native/mps/operations/ReduceOps.mm
index 67aeae4ca3cbe..d6e510a06e322 100644
--- a/aten/src/ATen/native/mps/operations/ReduceOps.mm
+++ b/aten/src/ATen/native/mps/operations/ReduceOps.mm
@@ -407,7 +407,6 @@ Tensor count_nonzero_mps(const Tensor& self, IntArrayRef dims){
     "norm_out_mps: reduction dim must be in the range of input shape")
   }
   namespace native_mps = at::native::mps;
-  CheckedFrom c = "norm_out_mps";
 
   using CachedGraph = native_mps::MPSUnaryCachedGraph;
 
diff --git a/aten/src/ATen/native/mps/operations/Repeat.mm b/aten/src/ATen/native/mps/operations/Repeat.mm
index b9c465145ffeb..53bcddf405cc6 100644
--- a/aten/src/ATen/native/mps/operations/Repeat.mm
+++ b/aten/src/ATen/native/mps/operations/Repeat.mm
@@ -36,8 +36,8 @@ Tensor permute_mps(const Tensor& self, IntArrayRef dims) {
   return self.as_strided(newSizes, newStrides);
 }
 
-void set_apparent_shapes(NSMutableArray<NSNumber*> * input_shape,
-                         NSMutableArray<NSNumber*> * &apparent_input_shape,
+void set_apparent_shapes(NSArray<NSNumber*> * input_shape,
+                         NSArray<NSNumber*> * &apparent_input_shape,
                          int64_t num_input_dims,
                          IntArrayRef repeats,
                          NSMutableArray<NSNumber*> * &repeats_shape,
@@ -66,13 +66,14 @@ void set_apparent_shapes(NSMutableArray<NSNumber*> * input_shape,
   }
   // num_repeat_dims > num_input_dims
   else {
-    apparent_input_shape = [NSMutableArray<NSNumber*> arrayWithCapacity:num_repeat_dims];
+    auto rc = [NSMutableArray<NSNumber*> arrayWithCapacity:num_repeat_dims];
 
     for(int i = 0; i < num_repeat_dims - num_input_dims; i++)
-      apparent_input_shape[i] = @1;
+      rc[i] = @1;
 
     for(int i = num_repeat_dims - num_input_dims; i < num_repeat_dims; i++)
-      apparent_input_shape[i] = input_shape[i + num_input_dims - num_repeat_dims];
+      rc[i] = input_shape[i + num_input_dims - num_repeat_dims];
+    apparent_input_shape = rc;
   }
 
 }
@@ -92,7 +93,7 @@ Tensor repeat_mps(const Tensor& self, IntArrayRef repeats) {
 
   MPSGraphCache* cache_ = MPSGraphCache::getInstance();
 
-  NSMutableArray<NSNumber*> *apparent_input_shape = nil;
+  NSArray<NSNumber*> *apparent_input_shape = nil;
   NSMutableArray<NSNumber*> *repeats_shape = nil;
 
   auto input_shape = getMPSShape(self);
diff --git a/aten/src/ATen/native/mps/operations/RnnOps.mm b/aten/src/ATen/native/mps/operations/RnnOps.mm
index 0dd1bd6b47a21..f15e842b54b25 100644
--- a/aten/src/ATen/native/mps/operations/RnnOps.mm
+++ b/aten/src/ATen/native/mps/operations/RnnOps.mm
@@ -52,7 +52,6 @@
     MPSGraphCache* cache_ = MPSGraphCache::getInstance();
 
     MPSStream* stream = getCurrentMPSStream();
-    int timesteps = (batch_first ? input.size(1) : input.size(0));
 
     @autoreleasepool {
       string key = "lstm_" + getTensorsStringKey({input, hx[0], hx[1]}) + getMPSTypeString(input.scalar_type()) + "_num_layers_" + std::to_string(num_layers);
@@ -82,7 +81,6 @@
             opDesc.bidirectional = bidirectional;
             opDesc.produceCell = true;
 
-            MPSShape* inputShape = getMPSShape(input);
             MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input.scalar_type()), getMPSShape(input));
             MPSGraphTensor* stateTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input.scalar_type()), getMPSShape(hx[0]));
             MPSGraphTensor* cellStateTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input.scalar_type()), getMPSShape(hx[1]));
@@ -332,7 +330,6 @@
                     NSMutableArray<MPSGraphTensor*>* gradRecWeightsArray = [[NSMutableArray alloc] initWithCapacity:num_layers];
                     NSMutableArray<MPSGraphTensor*>* gradWeightsArray = [[NSMutableArray alloc] initWithCapacity:num_layers];
                     NSMutableArray<MPSGraphTensor*>* gradBiasArray = [[NSMutableArray alloc] initWithCapacity:num_layers];
-                    NSMutableArray<MPSGraphTensor*>* gradRecBiasArray = [[NSMutableArray alloc] initWithCapacity:num_layers];
                     NSMutableArray<MPSGraphTensor*>* gradStateArray = [[NSMutableArray alloc] initWithCapacity:num_layers];
                     NSMutableArray<MPSGraphTensor*>* gradCellStateArray = [[NSMutableArray alloc] initWithCapacity:num_layers];
 
diff --git a/aten/src/ATen/native/mps/operations/ScatterGather.mm b/aten/src/ATen/native/mps/operations/ScatterGather.mm
index a8d73d5fc42a5..c4943d1242d96 100644
--- a/aten/src/ATen/native/mps/operations/ScatterGather.mm
+++ b/aten/src/ATen/native/mps/operations/ScatterGather.mm
@@ -9,9 +9,7 @@
 #include <ATen/native/mps/OperationUtils.h>
 #include <torch/library.h>
 
-#ifdef __OBJC__
 #include <MetalPerformanceShaders/MetalPerformanceShaders.h>
-#endif
 
 namespace at {
 namespace native {
@@ -120,10 +118,10 @@
                                                           toType:getMPSDataType(ScalarType::Int)
                                                             name:(NSString * _Nonnull)nil];
 
-          MPSGraphTensor* outputTensor = [mpsGraph gatherAlongAxisWithUpdatesTensor:getInput
-                                                                      indicesTensor:castIndexTensor
-                                                                               axis:(NSInteger)dim
-                                                                               name:nil];
+          MPSGraphTensor* outputTensor = [mpsGraph gatherAlongAxis: (NSInteger) dim
+                                                 withUpdatesTensor: getInput
+                                                     indicesTensor: castIndexTensor
+                                                              name: nil];
 
           newCachedGraph->inputTensor_ = inputTensor;
           newCachedGraph->indexTensor_ = indexTensor;
@@ -279,7 +277,7 @@
             getSrc = srcTensor;
 
           // Use in case input needs to be smaller to get scatter
-          NSMutableArray<NSNumber*>* scatterInputShape = nil;
+          NSArray<NSNumber*>* scatterInputShape = nil;
 
           // Slice into the input tensor IF NEEDED
           if(inputNeedSlice) {
@@ -287,7 +285,7 @@
             NSMutableArray<NSNumber*> *ends = [NSMutableArray<NSNumber*> arrayWithCapacity:num_input_dims];
             NSMutableArray<NSNumber*> *strides = [NSMutableArray<NSNumber*> arrayWithCapacity:num_input_dims];
 
-            scatterInputShape = [NSMutableArray<NSNumber*> arrayWithCapacity:num_input_dims];
+            auto rc = [NSMutableArray<NSNumber*> arrayWithCapacity:num_input_dims];
 
             for(int i = 0; i < num_input_dims; i++) {
               // All strides are 1
@@ -296,13 +294,14 @@
               starts[i] = @0;
               if(i != dim) {
                 ends[i] = index_shape[i];
-                scatterInputShape[i] = index_shape[i];
+                rc[i] = index_shape[i];
               }
               else {
                 ends[i] = input_shape[i];
-                scatterInputShape[i] = input_shape[i];
+                rc[i] = input_shape[i];
               }
             }
+            scatterInputShape = rc;
 
             getInput = [mpsGraph sliceTensor:inputTensor
                                       starts:starts
@@ -336,21 +335,21 @@
             scatter_mode = MPSGraphScatterModeMin;
 
           if(!inputNeedSlice) {
-            outputTensor = [mpsGraph scatterAlongAxisWithDataTensor:getInput
-                                                      updatesTensor:getSrc
-                                                      indicesTensor:castIndexTensor
-                                                               axis:(NSInteger)dim
-                                                               mode:scatter_mode
-                                                               name:nil];
+            outputTensor = [mpsGraph scatterAlongAxis: (NSInteger) dim
+                                       withDataTensor: getInput
+                                        updatesTensor: getSrc
+                                        indicesTensor: castIndexTensor
+                                                 mode: scatter_mode
+                                                 name: nil];
           }
           else {
             // Scatter this into the input with set mode
-            MPSGraphTensor* scatterTensor = [mpsGraph scatterAlongAxisWithDataTensor:getInput
-                                                                       updatesTensor:getSrc
-                                                                       indicesTensor:castIndexTensor
-                                                                                axis:(NSInteger)dim
-                                                                                mode:scatter_mode
-                                                                                name:nil];
+            MPSGraphTensor* scatterTensor = [mpsGraph scatterAlongAxis: (NSInteger) dim
+                                                        withDataTensor: getInput
+                                                         updatesTensor: getSrc
+                                                         indicesTensor: castIndexTensor
+                                                                  mode: scatter_mode
+                                                                  name: nil];
 
             // Make an array of scatter indices tensors
             NSMutableArray<MPSGraphTensor*>* indicesTensors = [NSMutableArray<MPSGraphTensor*> arrayWithCapacity:num_input_dims];
@@ -372,9 +371,9 @@
             for(int i = 0; i < num_input_dims; i++) {
               MPSGraphTensor* axisTensor = [mpsGraph constantWithScalar:i
                                                                dataType:MPSDataTypeInt32];
-              MPSGraphTensor* scatter_currentIndexTensor = [mpsGraph getCoordinateValueWithShapeTensor:scatterInputShapeTensor
-                                                                                            axisTensor:axisTensor
-                                                                                                  name:nil];
+              MPSGraphTensor* scatter_currentIndexTensor = [mpsGraph coordinateAlongAxisTensor: axisTensor
+                                                                               withShapeTensor: scatterInputShapeTensor
+                                                                                          name: nil];
               scatter_currentIndexTensor = [mpsGraph reshapeTensor:scatter_currentIndexTensor
                                                          withShape:@[@-1, @1]
                                                               name:nil];
diff --git a/aten/src/ATen/native/mps/operations/Shape.mm b/aten/src/ATen/native/mps/operations/Shape.mm
index 977f9f1ce3fae..99dfbcecc24a9 100644
--- a/aten/src/ATen/native/mps/operations/Shape.mm
+++ b/aten/src/ATen/native/mps/operations/Shape.mm
@@ -16,288 +16,6 @@
 
 namespace at {
 namespace native {
-namespace mps {
-
-// Pad operations (1D/2D/3D forward and backward)
-Tensor& pad_out_template(Tensor &output, const Tensor &input_, IntArrayRef padding,
-                         const c10::optional<Tensor>& grad_output_opt,
-                         MPSGraphPaddingMode mode, double constantValue, const string op_name)
-{
-  const int padding_size = (int) padding.size();
-  const int padding_dim = padding_size / 2; // either 1D, 2D, or 3D
-
-  TORCH_CHECK(padding_size == 2 || padding_size == 4 || padding_size == 6,
-              "invalid padding argument of size ", padding_size);
-
-  const Tensor& grad_output_ = *(at::borrow_from_optional_tensor(grad_output_opt));
-  const bool is_backward_pass = grad_output_.defined();
-
-  int dim_w = padding_dim, dim_h = padding_dim - 1, dim_d = padding_dim - 2, dim_slices = 0;
-  int64_t nbatch = 1, ndims = input_.ndimension();
-
-  if (!is_backward_pass) {
-    bool valid_dims = input_.size(1) != 0 && input_.size(padding_dim) != 0;
-    TORCH_CHECK((ndims == 1 + padding_dim && valid_dims) ||
-                (ndims == 2 + padding_dim && valid_dims && input_.size(1 + padding_dim) != 0),
-                "3D or 4D (batch mode) tensor expected for input, but got: ", input_);
-  }
-
-  if (ndims == 2 + padding_dim) {
-    nbatch = input_.size(0);
-    dim_w++;
-    dim_h++;
-    dim_d++;
-    dim_slices++;
-  }
-
-  int64_t pad_l = padding[0];
-  int64_t pad_r = padding[1];
-  int64_t pad_t = padding_dim > 1 ? padding[2] : 0;
-  int64_t pad_b = padding_dim > 1 ? padding[3] : 0;
-  int64_t pad_front = padding_dim > 2 ? padding[4] : 0;
-  int64_t pad_back  = padding_dim > 2 ? padding[5] : 0;
-
-  int64_t nplane = input_.size(dim_slices);
-  int64_t input_w = input_.size(dim_w);
-  int64_t output_w  = input_w + pad_l + pad_r;
-  int64_t input_h = padding_dim > 1 ? input_.size(dim_h) : 0;
-  int64_t output_h = padding_dim > 1 ? input_h + pad_t + pad_b : 0;
-  int64_t input_d = padding_dim > 2 ? input_.size(dim_d) : 0;
-  int64_t output_d = padding_dim > 2 ? input_d + pad_front + pad_back : 0;
-
-  Tensor grad_output, input = input_;
-
-  if (!is_backward_pass) {
-    TORCH_CHECK(pad_l < input_w && pad_r < input_w,
-      "Argument #4: Padding size should be less than the corresponding "
-      "input dimension, but got: padding (", pad_l, ", ", pad_r,
-      ") at dimension ", dim_w, " of input ", ndims);
-
-    if (padding_dim > 1) {
-      TORCH_CHECK(pad_t < input_h && pad_b < input_h,
-        "Argument #6: Padding size should be less than the corresponding "
-        "input dimension, but got: padding (", pad_t, ", ", pad_b,
-        ") at dimension ", dim_h, " of input ", ndims);
-    }
-    TORCH_CHECK(output_w >= 1 || output_h >= padding_dim - 1,
-      "input (H: ", input_h, ", W: ", input_w, ") is too small. Calculated "
-      "output H: ", output_h, " W: ", output_w);
-
-    if (ndims == 1 + padding_dim) {
-      if (padding_dim == 3)
-        output.resize_({nplane, output_d, output_h, output_w});
-      else if (padding_dim == 2)
-        output.resize_({nplane, output_h, output_w});
-      else
-        output.resize_({nplane, output_w});
-    } else {
-      if (padding_dim == 3)
-        output.resize_({nbatch, nplane, output_d, output_h, output_w});
-      else if (padding_dim == 2)
-        output.resize_({nbatch, nplane, output_h, output_w});
-      else
-        output.resize_({nbatch, nplane, output_w});
-    }
-    if (output.numel() == 0 || input_.numel() == 0)
-      return output;
-    input = input_.contiguous();
-  } else {
-    TORCH_CHECK(output_w == grad_output_.size(dim_w),
-      "gradOutput width unexpected. Expected: ", output_w, ", Got: ", grad_output_.size(dim_w));
-    if (padding_dim > 1) {
-      TORCH_CHECK(output_h == grad_output_.size(dim_h),
-        "gradOutput height unexpected. Expected: ", output_h, ", Got: ", grad_output_.size(dim_h));
-    }
-    grad_output = grad_output_.contiguous();
-  }
-
-  const int64_t input_dim = input.dim();
-  MPSShape *leftPadding = nullptr, *rightPadding = nullptr;
-  if (padding_dim == 3) {
-    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_front), @(pad_t), @(pad_l) } count:input_dim];
-    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_back), @(pad_b), @(pad_r) } count:input_dim];
-  } else if (padding_dim == 2) {
-    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_t), @(pad_l) } count:input_dim];
-    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_b), @(pad_r) } count:input_dim];
-  } else if (padding_dim == 1) {
-    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_l) } count:input_dim];
-    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_r) } count:input_dim];
-  }
-
-  struct CachedGraph : public MPSCachedGraph {
-    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) { }
-    MPSGraphTensor *inputTensor = nil, *outputTensor = nil;
-    MPSGraphTensor *gradOutputTensor = nil;
-  };
-  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
-
-  @autoreleasepool {
-    string key = op_name + getTensorsStringKey({input, grad_output}) +
-                           ":L" + to_string(pad_l)     + ":R" + to_string(pad_r) +
-                           ":T" + to_string(pad_t)     + ":B" + to_string(pad_b) +
-                           ":F" + to_string(pad_front) + ":K" + to_string(pad_back);
-
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
-    if(!cachedGraph) {
-      cachedGraph = static_cast<CachedGraph*>(cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
-        CachedGraph *newCachedGraph = nil;
-        @autoreleasepool {
-            MPSGraph* mpsGraph = make_mps_graph();
-            newCachedGraph = new CachedGraph(mpsGraph);
-            newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
-            if (!is_backward_pass) {
-              newCachedGraph->outputTensor = [mpsGraph padTensor:newCachedGraph->inputTensor
-                                                 withPaddingMode:mode
-                                                     leftPadding:leftPadding
-                                                    rightPadding:rightPadding
-                                                   constantValue:constantValue
-                                                            name:nil];
-            } else {
-              newCachedGraph->gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
-              newCachedGraph->outputTensor = [mpsGraph padGradientWithIncomingGradientTensor:newCachedGraph->gradOutputTensor
-                                                                                sourceTensor:newCachedGraph->inputTensor
-                                                                                 paddingMode:mode
-                                                                                 leftPadding:leftPadding
-                                                                                rightPadding:rightPadding
-                                                                                        name:nil];
-            }
-        }
-        return newCachedGraph;
-      }));
-    }
-    Placeholder inputPlaceholder  = Placeholder(cachedGraph->inputTensor, input);
-    Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, output);
-
-    NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease];
-    feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData();
-    if (is_backward_pass) {
-        Placeholder gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor, grad_output);
-        feeds[gradOutputPlaceholder.getMPSGraphTensor()] = gradOutputPlaceholder.getMPSGraphTensorData();
-    }
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
-        outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
-    };
-    runMPSGraph(getCurrentMPSStream(), cachedGraph->graph(), feeds, results);
-  }
-  return output;
-}
-} // namespace mps
-
-// 1D Reflection and Replication Padding
-TORCH_IMPL_FUNC(reflection_pad1d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_out_mps");
-}
-
-TORCH_IMPL_FUNC(reflection_pad1d_backward_out_mps)
-(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_backward_out_mps");
-}
-
-TORCH_IMPL_FUNC(replication_pad1d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_out_mps");
-}
-
-TORCH_IMPL_FUNC(replication_pad1d_backward_out_mps)
-(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_backward_out_mps");
-}
-
-// 2D Reflection and Replication Padding
-Tensor& reflection_pad2d_out_mps(const Tensor& input, IntArrayRef padding, Tensor& output)
-{
-  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-Tensor reflection_pad2d_mps(const Tensor& input, IntArrayRef padding)
-{
-  Tensor output = at::empty({0}, input.options());
-  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-Tensor& reflection_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-Tensor reflection_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
-{
-  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-TORCH_IMPL_FUNC(replication_pad2d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad2d_out_mps");
-}
-
-Tensor& replication_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-Tensor replication_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
-{
-  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-// 3D Reflection and Replication Padding
-TORCH_IMPL_FUNC(reflection_pad3d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_out_mps");
-}
-
-TORCH_IMPL_FUNC(reflection_pad3d_backward_out_mps)
-(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_backward_out_mps");
-}
-
-TORCH_IMPL_FUNC(replication_pad3d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad3d_out_mps");
-}
-
-Tensor& replication_pad3d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-Tensor replication_pad3d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
-{
-  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-// backward pass is exlicitly handled in autograd by negating the "pad" argument
-Tensor constant_pad_nd_mps(const Tensor& self, IntArrayRef pad, const Scalar& value)
-{
-  Tensor output = at::empty({0}, self.options());
-  return mps::pad_out_template(output, self, pad, c10::nullopt, MPSGraphPaddingModeConstant, value.toDouble(), __func__);
-}
 
 // topk
 TORCH_IMPL_FUNC(topk_out_mps)
@@ -534,7 +252,6 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
   };
 
   const Tensor* notSkippedTensor = NULL; // non-owning reference
-  int nDims = 0;
 
   // Check for type promotion
   TORCH_CHECK(
@@ -570,7 +287,6 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
       continue;
     }
     input_tensors.push_back(&t);
-    nDims = t.dim();
     // TODO: Is this OK?
     notSkippedTensor = &t;
     tensor_idx++;
@@ -879,7 +595,6 @@ void upsample_out_mps(const Tensor& input,
     int64_t output_width = output_size[1];
     @autoreleasepool {
       MPSShape* input_shape = getMPSShape(input);
-      NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
       string key = string("upsample_2d:") + mps::getMPSShapeString(input_shape) + ":" +
                              getMPSTypeString(input.scalar_type()) +
                              ":h" + to_string(output_height) + ":w" + to_string(output_width) +
@@ -992,7 +707,6 @@ void upsample1d_out_mps(const Tensor& input,
     int64_t out_size = output_size[0];
     @autoreleasepool {
       MPSShape* input_shape = getMPSShape(input);
-      NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
       string key = string("upsample_1d:") + mps::getMPSShapeString(input_shape) + ":" +
                              getMPSTypeString(input.scalar_type()) +
                              ":size" + to_string(out_size) +
diff --git a/aten/src/ATen/native/mps/operations/SoftMax.mm b/aten/src/ATen/native/mps/operations/SoftMax.mm
index 4246a37671e99..e96d5ed2481c3 100644
--- a/aten/src/ATen/native/mps/operations/SoftMax.mm
+++ b/aten/src/ATen/native/mps/operations/SoftMax.mm
@@ -216,8 +216,6 @@ void get_shapes(MPSShape* input_shape_readonly,
   @autoreleasepool {
 
     MPSShape* grad_shape = mps::getMPSShape(grad);
-    int num_grad_dims = [grad_shape count];
-
     NSString* ns_shape_key = [[grad_shape valueForKey:@"description"] componentsJoinedByString:@","];
 
     string key = "softmax_backward_mps_out:" + getMPSTypeString(output.scalar_type()) + ":"
diff --git a/aten/src/ATen/native/mps/operations/TensorCompare.mm b/aten/src/ATen/native/mps/operations/TensorCompare.mm
index a6c267290312b..fb3b93a602f1a 100644
--- a/aten/src/ATen/native/mps/operations/TensorCompare.mm
+++ b/aten/src/ATen/native/mps/operations/TensorCompare.mm
@@ -245,8 +245,6 @@ void clamp_scalar_out_mps(const Tensor& input_t,
 
   @autoreleasepool {
 
-    MPSShape* input_shape = getMPSShape(self);
-
     string key = "where_self_out_mps:" + getTensorsStringKey({cond_bool, self, other});
 
     CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
@@ -304,10 +302,6 @@ Tensor where_mps(const Tensor& condition,
                  const Tensor& self,
                  const Tensor& other) {
 
-  auto cond_shape = condition.sizes();
-  auto self_shape = self.sizes();
-  auto other_shape = other.sizes();
-
   bool cond_zero_shape = (condition.dim() == 0);
   bool self_zero_shape = (self.dim() == 0);
   bool other_zero_shape = (other.dim() == 0);
diff --git a/aten/src/ATen/native/mps/operations/TriangularOps.mm b/aten/src/ATen/native/mps/operations/TriangularOps.mm
index 6a29d080cb6c9..fb6e1c52ba49e 100644
--- a/aten/src/ATen/native/mps/operations/TriangularOps.mm
+++ b/aten/src/ATen/native/mps/operations/TriangularOps.mm
@@ -9,9 +9,7 @@
 #include <ATen/native/mps/OperationUtils.h>
 #include <torch/library.h>
 
-#ifdef __OBJC__
 #include <MetalPerformanceShaders/MetalPerformanceShaders.h>
-#endif
 
 namespace at {
 namespace native {
@@ -275,9 +273,9 @@
             MPSGraphTensor* inputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data length:sizeof(int)]
                                                                     shape:@[@1]
                                                                  dataType:MPSDataTypeInt32];
-            numDiagElementsRange = [mpsGraph getCoordinateValueWithShapeTensor:inputShapeTensor
-                                                                    axisTensor:zeroTensor
-                                                                          name:nil];
+            numDiagElementsRange = [mpsGraph coordinateAlongAxisTensor: zeroTensor
+                                                       withShapeTensor: inputShapeTensor
+                                                                  name: nil];
             diagOffset = [mpsGraph constantWithScalar:diagonal
                                              dataType:MPSDataTypeInt32];
             rowMultiplier = [mpsGraph constantWithScalar:[num_output_cols intValue]
@@ -288,9 +286,9 @@
             MPSGraphTensor* outputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data length:sizeof(int)]
                                                                      shape:@[@1]
                                                                   dataType:MPSDataTypeInt32];
-            numDiagElementsRange = [mpsGraph getCoordinateValueWithShapeTensor:outputShapeTensor
-                                                                    axisTensor:zeroTensor
-                                                                          name:nil];
+            numDiagElementsRange = [mpsGraph coordinateAlongAxisTensor: zeroTensor
+                                                       withShapeTensor: outputShapeTensor
+                                                                  name: nil];
             diagOffset = [mpsGraph constantWithScalar:diagonal
                                              dataType:MPSDataTypeInt32];
             rowMultiplier = [mpsGraph constantWithScalar:[num_input_cols intValue]
diff --git a/aten/src/ATen/native/mps/operations/View.mm b/aten/src/ATen/native/mps/operations/View.mm
index 4fa614ae6e2c6..a8a55b21d2468 100644
--- a/aten/src/ATen/native/mps/operations/View.mm
+++ b/aten/src/ATen/native/mps/operations/View.mm
@@ -86,7 +86,8 @@
 
 static MPSGraphTensor* chainViewOperation(ViewCachedGraph* cachedGraph, const IntArrayRef& size,
                                           const IntArrayRef& stride, int64_t offset,
-                                          const IntArrayRef& base_shape, bool needsScatter)
+                                          const IntArrayRef& base_shape, bool needsScatter,
+                                          const bool needsBoolCast)
 {
   MPSGraph* mpsGraph = cachedGraph->graph();
   MPSGraphTensor *outputTensor = nil;
@@ -126,7 +127,17 @@
     indicesTensor = [mpsGraph additionWithPrimaryTensor: indicesTensor
                                         secondaryTensor: cachedGraph->storageOffsetTensor
                                                    name: nil];
-    MPSGraphTensor *reshapedInputTensor = [mpsGraph reshapeTensor: cachedGraph->inputTensor
+    MPSGraphTensor *inputTensor = cachedGraph->inputTensor;
+
+    // Workaround for bool scatter/gather deficiency
+    // See https://github.com/pytorch/pytorch/issues/82663
+    if (needsBoolCast) {
+      inputTensor = [mpsGraph castTensor:inputTensor
+                                  toType:MPSDataTypeInt8
+                                    name:@"Cast away from bool"];
+    }
+
+    MPSGraphTensor *reshapedInputTensor = [mpsGraph reshapeTensor: inputTensor
                                                         withShape: @[@-1]
                                                              name: nil];
     MPSGraphTensor *reshapedIndicesTensor = [mpsGraph reshapeTensor: indicesTensor
@@ -154,6 +165,14 @@
                               withShapeTensor: shapeTensor
                                          name: nil];
     }
+
+    // Workaround for bool scatter/gather deficiency
+    // See https://github.com/pytorch/pytorch/issues/82663
+    if (needsBoolCast) {
+      outputTensor = [mpsGraph castTensor:outputTensor
+                                   toType:MPSDataTypeBool
+                                     name:@"Cast back to bool"];
+    }
   }
   return outputTensor;
 }
@@ -205,6 +224,7 @@
             if (inputType ==  MPSDataTypeUInt8) {
                 inputType =  MPSDataTypeInt8;
             }
+            auto needsBoolCast = inputType == MPSDataTypeBool;
             // Self is the input tensor we are creating view of
             newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, inputType, getMPSShape(base_shape));
             newCachedGraph->storageOffsetTensor = mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[@1]);
@@ -214,7 +234,7 @@
             if (needsScatter) {
               newCachedGraph->updatesTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(self.scalar_type()));
             }
-            newCachedGraph->outputTensor = chainViewOperation(newCachedGraph, size, stride, storage_offset, base_shape, needsScatter);
+            newCachedGraph->outputTensor = chainViewOperation(newCachedGraph, size, stride, storage_offset, base_shape, needsScatter, needsBoolCast);
         }
         return newCachedGraph;
       }));
diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index ab6d38e553d30..848a84a7a55e1 100644
--- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -131,6 +131,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: _new_zeros_with_same_feature_meta
+  autogen: _new_zeros_with_same_feature_meta.out
 
 # This function compares the storage numel of self with that of other, where
 # storage numel is cumputed as: `other.storage().nbytes() / other.itemsize()`.
@@ -181,12 +182,14 @@
   device_check: NoCheck  # log_probs is expected to be on CUDA while targets is expected to be on CPU
   dispatch:
     CUDA: _cudnn_ctc_loss
+  autogen: _cudnn_ctc_loss.out
 
 - func: _use_cudnn_rnn_flatten_weight() -> bool
 
 - func: _cudnn_rnn_flatten_weight(Tensor[] weight_arr, int weight_stride0, int input_size, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, bool bidirectional) -> Tensor
   dispatch:
     CUDA: _cudnn_rnn_flatten_weight
+  autogen: _cudnn_rnn_flatten_weight.out
 
 - func: _cudnn_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor? weight_buf, Tensor hx, Tensor? cx, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
   # rnn_tanh may or may not redispatch to _cudnn_rnn based on algorithm and build. Thus it might hit dispatch or kernel device check.
@@ -194,14 +197,17 @@
   device_check: NoCheck
   dispatch:
     CUDA: _cudnn_rnn
+  autogen: _cudnn_rnn.out
 
 - func: _cudnn_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[])
   dispatch:
     CUDA: _cudnn_rnn_backward
+  autogen: _cudnn_rnn_backward.out
 
 - func: _cudnn_init_dropout_state(float dropout, bool train, int dropout_seed, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
   dispatch:
     CUDA: _cudnn_init_dropout_state
+  autogen: _cudnn_init_dropout_state.out
 
 - func: _debug_has_internal_overlap(Tensor self) -> int
   variants: function
@@ -211,23 +217,28 @@
   dispatch:
     CUDA: fused_dropout_cuda
   tags: nondeterministic_seeded
+  autogen: _fused_dropout.out
 
 - func: _masked_scale(Tensor self, Tensor mask, float scale) -> Tensor
   variants: function
   dispatch:
     CUDA: masked_scale_cuda
+  autogen: _masked_scale.out
 
 - func: native_dropout(Tensor input, float p, bool? train) -> (Tensor, Tensor)
   variants: function
   dispatch:
     CPU: native_dropout_cpu
     CUDA: native_dropout_cuda
+    NestedTensorCPU, NestedTensorCUDA: native_dropout_nested
   tags: nondeterministic_seeded
+  autogen: native_dropout.out
 
 - func: native_dropout_backward(Tensor grad_output, Tensor mask, float scale) -> Tensor
   dispatch:
-    CPU: native_dropout_backward_cpu
+    CPU, NestedTensorCPU, NestedTensorCUDA: native_dropout_backward
     CUDA: native_dropout_backward_cuda
+  autogen: native_dropout_backward.out
 
 - func: _sobol_engine_draw(Tensor quasi, int n, Tensor sobolstate, int dimension, int num_generated, ScalarType? dtype) -> (Tensor, Tensor)
 
@@ -242,27 +253,28 @@
 - func: _shape_as_tensor(Tensor self) -> Tensor
 
 - func: dropout(Tensor input, float p, bool train) -> Tensor
-  dispatch:
-    CompositeImplicitAutograd: dropout
-    NestedTensorCPU, NestedTensorCUDA: dropout_nested
   tags: nondeterministic_seeded
 
 - func: dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
-  dispatch:
-    CompositeImplicitAutograd: dropout_
-    NestedTensorCPU, NestedTensorCUDA: dropout_nested_
+  tags: nondeterministic_seeded
 
 - func: feature_dropout(Tensor input, float p, bool train) -> Tensor
+  tags: nondeterministic_seeded
 
 - func: feature_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: alpha_dropout(Tensor input, float p, bool train) -> Tensor
+  tags: nondeterministic_seeded
 
 - func: alpha_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: feature_alpha_dropout(Tensor input, float p, bool train) -> Tensor
+  tags: nondeterministic_seeded
 
 - func: feature_alpha_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: abs(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -393,6 +405,7 @@
   dispatch:
     CompositeExplicitAutograd: _conj_physical
     SparseCsrCPU, SparseCsrCUDA: conj_physical_sparse_csr
+  autogen: _conj_physical.out
 
 - func: conj_physical(Tensor self) -> Tensor
   variants: function, method
@@ -567,6 +580,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: affine_grid_generator
+  autogen: affine_grid_generator.out
 
 - func: affine_grid_generator_backward(Tensor grad, int[] size, bool align_corners) -> Tensor
   variants: function
@@ -594,6 +608,7 @@
 
 - func: allclose(Tensor self, Tensor other, float rtol=1e-05, float atol=1e-08, bool equal_nan=False) -> bool
   variants: function, method
+  tags: data_dependent_output
   dispatch:
     CompositeExplicitAutograd: allclose
 
@@ -626,15 +641,17 @@
   dispatch:
     CompositeExplicitAutograd: arange
 
-# Note [arange.start_step schema]
-# We want `arange.start_step` to be grouped up with `arange.start_out`,
-# But this doesn't happen automatically because the step argument
-# is defaultable for .start_out but not for .start_step.
-# We should probably just make "step" a defaultable param on arange.start,
-# and kill arange.start_step.
-- func: arange.start_step(Scalar start, Scalar end, Scalar step, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+# This operator should be named `aragne.start_out` if following the naming convention. However that
+# name is already taken. Disabled because of CI job failures.
+# FIXME: enable this
+#- func: arange.start_out_(Scalar start, Scalar end, *, Tensor(a!) out) -> Tensor(a!)
+#  dispatch:
+#    CompositeExplicitAutograd: arange_start_out
+
+- func: arange.start_step(Scalar start, Scalar end, Scalar step=1, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: arange
+  cpp_no_default_args: ['step']
 
 - func: arange.out(Scalar end, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
@@ -645,6 +662,7 @@
     CPU, Meta: arange_out
     CUDA: arange_cuda_out
     MPS: arange_mps_out
+  cpp_no_default_args: ['step']
 
 # This function is a temporary hack to allow tracing of arange like constructs with dynamic
 # bounds on arange.  Normal arange is not traceable because it does not take any tensor inputs;
@@ -888,16 +906,19 @@
 - func: bartlett_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: bartlett_window
+  autogen: bartlett_window.out
 
 - func: bartlett_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: bartlett_window
+  autogen: bartlett_window.periodic_out
 
 - func: batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> Tensor
 
 - func: quantized_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor var, float eps, float output_scale, int output_zero_point) -> Tensor
   dispatch:
     QuantizedCPU: quantized_batch_norm
+  autogen: quantized_batch_norm.out
 
 - func: _batch_norm_impl_index(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> (Tensor, Tensor, Tensor, Tensor, int)
 
@@ -914,6 +935,7 @@
 - func: bernoulli.out(Tensor self, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: function
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: bernoulli_out
     MPS: bernoulli_out_mps
@@ -921,6 +943,7 @@
 - func: bernoulli_.Tensor(Tensor(a!) self, Tensor p, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: bernoulli_
     MPS: bernoulli_mps_
@@ -929,6 +952,7 @@
 - func: bernoulli_.float(Tensor(a!) self, float p=0.5, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: bernoulli_
     MPS: bernoulli_mps_
@@ -985,6 +1009,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: binary_cross_entropy_with_logits
+  autogen: binary_cross_entropy_with_logits.out
 
 - func: bincount(Tensor self, Tensor? weights=None, int minlength=0) -> Tensor
   variants: function, method
@@ -992,6 +1017,7 @@
     CPU: _bincount_cpu
     CUDA: _bincount_cuda
   tags: dynamic_output_shape
+  autogen: bincount.out
 
 - func: bitwise_not(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -1116,10 +1142,12 @@
 - func: blackman_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: blackman_window
+  autogen: blackman_window.out
 
 - func: blackman_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: blackman_window
+  autogen: blackman_window.periodic_out
 
 - func: bmm(Tensor self, Tensor mat2) -> Tensor
   structured_delegate: bmm.out
@@ -1140,10 +1168,6 @@
     SparseCUDA: bmm_out_sparse_cuda
     SparseCsrCUDA: bmm_out_sparse_csr_cuda
 
-- func: _NestedTensor_GeneralizedBMM(Tensor self, Tensor mat2) -> Tensor
-  dispatch:
-    NestedTensorCPU, NestedTensorCUDA: _NestedTensor_GeneralizedBMM
-
 - func: broadcast_tensors(Tensor[] tensors) -> Tensor[]
   device_check: NoCheck
   device_guard: False
@@ -1189,6 +1213,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: block_diag
+  autogen: block_diag.out
 
 - func: ceil(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -1396,6 +1421,7 @@
   dispatch:
     CompositeExplicitAutograd: constant_pad_nd
     MPS: constant_pad_nd_mps
+  autogen: constant_pad_nd.out
 
 - func: contiguous(Tensor(a) self, *, MemoryFormat memory_format=contiguous_format) -> Tensor(a)
   variants: method
@@ -1404,22 +1430,27 @@
 - func: convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor
   dispatch:
     CompositeExplicitAutograd: convolution
+  autogen: convolution.out
 
 - func: convolution_backward(Tensor grad_output, Tensor input, Tensor weight, int[]? bias_sizes, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     CompositeExplicitAutograd, CUDA: convolution_backward
+  autogen: convolution_backward.out
 
 - func: convolution_overrideable(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor
   dispatch:
     CompositeExplicitAutograd: convolution_overrideable
+  autogen: convolution_overrideable.out
 
 - func: convolution_backward_overrideable(Tensor grad_output, Tensor input, Tensor weight, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor grad_input, Tensor grad_weight, Tensor grad_bias)
   dispatch:
     CompositeExplicitAutograd: convolution_backward_overrideable
+  autogen: convolution_backward_overrideable.out
 
 - func: _convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor
   dispatch:
     CompositeExplicitAutograd: _convolution
+  autogen: _convolution.out
 
 - func: _convolution.deprecated(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled) -> Tensor
 
@@ -1445,6 +1476,7 @@
 - func: conv_tbc(Tensor self, Tensor weight, Tensor bias, int pad=0) -> Tensor
   dispatch:
     CompositeExplicitAutograd: conv_tbc
+  autogen: conv_tbc.out
 
 - func: conv_tbc_backward(Tensor self, Tensor input, Tensor weight, Tensor bias, int pad) -> (Tensor, Tensor, Tensor)
 
@@ -1472,12 +1504,14 @@
 - func: _copy_from(Tensor self, Tensor dst, bool non_blocking=False) -> Tensor
   dispatch:
     MPS: _copy_from_mps
+  autogen: _copy_from.out
 
 # We need this to be able to properly copy from a CPU to an XLA tensor with different sizes.
 # See https://github.com/pytorch/xla/issues/2881
 - func: _copy_from_and_resize(Tensor self, Tensor dst) -> Tensor
   dispatch:
     MPS: _copy_from_and_resize_mps
+  autogen: _copy_from_and_resize.out
 
 - func: cos(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -1523,11 +1557,13 @@
     CPU: count_nonzero_cpu
     CUDA: count_nonzero_cuda
     MPS: count_nonzero_mps
+  autogen: count_nonzero.dim_IntList_out
 
 - func: count_nonzero(Tensor self, int? dim=None) -> Tensor
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: count_nonzero
+  autogen: count_nonzero.out
 
 - func: cov(Tensor self, *, int correction=1, Tensor? fweights=None, Tensor? aweights=None) -> Tensor
   variants: function, method
@@ -1538,53 +1574,65 @@
 - func: cudnn_affine_grid_generator(Tensor theta, int N, int C, int H, int W) -> Tensor grid
   dispatch:
     CUDA: cudnn_affine_grid_generator_forward
+  autogen: cudnn_affine_grid_generator.out
 
 # TODO: Why do I have to call this grad?!
 - func: cudnn_affine_grid_generator_backward(Tensor grad, int N, int C, int H, int W) -> Tensor grad_theta
   dispatch:
     CUDA: cudnn_affine_grid_generator_backward
+  autogen: cudnn_affine_grid_generator_backward.out
 
 - func: cudnn_batch_norm(Tensor input, Tensor weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float exponential_average_factor, float epsilon) -> (Tensor, Tensor, Tensor, Tensor)
   dispatch:
     CUDA: cudnn_batch_norm
+  autogen: cudnn_batch_norm.out
 
 # NB: You can only use this if you used cudnn_batch_norm training=True
 - func: cudnn_batch_norm_backward(Tensor input, Tensor grad_output, Tensor weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_var, float epsilon, Tensor reserveSpace) -> (Tensor, Tensor, Tensor)
   dispatch:
     CUDA: cudnn_batch_norm_backward
+  autogen: cudnn_batch_norm_backward.out
 
 - func: cudnn_convolution(Tensor self, Tensor weight, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic, bool allow_tf32) -> Tensor
   dispatch:
     CUDA: cudnn_convolution
+  autogen: cudnn_convolution.out
 
 - func: cudnn_convolution_transpose(Tensor self, Tensor weight, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic, bool allow_tf32) -> Tensor
   dispatch:
     CUDA: cudnn_convolution_transpose
+  autogen: cudnn_convolution_transpose.out
 
 - func: _mps_convolution_transpose(Tensor self, Tensor weight, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups) -> Tensor
   dispatch:
     MPS: _mps_convolution_transpose
+  autogen: _mps_convolution_transpose.out
 
 - func: mps_convolution_transpose_backward(Tensor self, Tensor grad_output, Tensor weight, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool[2] output_mask) -> (Tensor, Tensor)
   dispatch:
     MPS: mps_convolution_transpose_backward
+  autogen: mps_convolution_transpose_backward.out
 
 - func: cudnn_convolution_relu(Tensor self, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor
   dispatch:
     CUDA: cudnn_convolution_relu
+  autogen: cudnn_convolution_relu.out
 
 - func: cudnn_convolution_add_relu(Tensor self, Tensor weight, Tensor z, Scalar? alpha, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor
   dispatch:
     CUDA: cudnn_convolution_add_relu
+  autogen: cudnn_convolution_add_relu.out
 
 # NB: input is special cased in a way I don't quite understand
 - func: cudnn_grid_sampler(Tensor self, Tensor grid) -> Tensor output
   dispatch:
     CUDA: cudnn_grid_sampler_forward
+  autogen: cudnn_grid_sampler.out
 
 - func: cudnn_grid_sampler_backward(Tensor self, Tensor grid, Tensor grad_output) -> (Tensor grad_self, Tensor grad_grid)
   dispatch:
     CUDA: cudnn_grid_sampler_backward
+  autogen: cudnn_grid_sampler_backward.out
 
 - func: cummax(Tensor self, int dim) -> (Tensor values, Tensor indices)
   device_check: NoCheck   # TensorIterator
@@ -1707,16 +1755,19 @@
   dispatch:
     CPU: ctc_loss_cpu
     CUDA: ctc_loss_gpu
+  autogen: _ctc_loss.out
 
 - func: _ctc_loss_backward(Tensor grad, Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, Tensor neg_log_likelihood, Tensor log_alpha, int blank, bool zero_infinity=False) -> Tensor
   dispatch:
     CPU: ctc_loss_backward_cpu
     CUDA: ctc_loss_backward_gpu
+  autogen: _ctc_loss_backward.out
 
 - func: diag_embed(Tensor self, int offset=0, int dim1=-2, int dim2=-1) -> Tensor
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: diag_embed
+  autogen: diag_embed.out
 
 - func: diagflat(Tensor self, int offset=0) -> Tensor
   variants: function, method
@@ -1739,6 +1790,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: diagonal_backward
+  autogen: diagonal_backward.out
 
 - func: fill_diagonal_(Tensor(a!) self, Scalar fill_value, bool wrap=False) -> Tensor(a!)
   variants: method
@@ -1918,6 +1970,7 @@
   dispatch:
     CompositeExplicitAutograd: embedding
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_embedding
+  autogen: embedding.out
 
 - func: embedding_backward(Tensor grad, Tensor indices, int num_weights, int padding_idx, bool scale_grad_by_freq, bool sparse) -> Tensor
 
@@ -1926,6 +1979,7 @@
     CPU: embedding_dense_backward_cpu
     CUDA: embedding_dense_backward_cuda
     MPS: embedding_dense_backward_mps
+  autogen: embedding_dense_backward.out
 
 - func: embedding_renorm_(Tensor(a!) self, Tensor indices, float max_norm, float norm_type) -> Tensor(a!)
   dispatch:
@@ -1949,6 +2003,7 @@
   dispatch:
     CPU: _embedding_bag_forward_only_cpu
     CUDA: _embedding_bag_forward_only_cuda
+  autogen: _embedding_bag_forward_only.out
 
 - func: _rowwise_prune(Tensor weight, Tensor mask, ScalarType compressed_indices_dtype) -> (Tensor, Tensor)
 
@@ -1969,6 +2024,7 @@
   dispatch:
     CPU: _embedding_bag_cpu
     CUDA: _embedding_bag_cuda
+  autogen: _embedding_bag.out
 
 - func: _embedding_bag_backward(Tensor grad, Tensor indices, Tensor offsets, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, int num_weights, bool scale_grad_by_freq, int mode, bool sparse, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
 
@@ -1978,17 +2034,20 @@
   dispatch:
     CPU: _embedding_bag_dense_backward_cpu
     CUDA: _embedding_bag_dense_backward_cuda
+  autogen: _embedding_bag_dense_backward.out
 
 - func: _embedding_bag_per_sample_weights_backward(Tensor grad, Tensor weight, Tensor indices, Tensor offsets, Tensor offset2bag, int mode, int padding_idx=-1) -> Tensor
   dispatch:
     CPU: _embedding_bag_per_sample_weights_backward_cpu
     CUDA: _embedding_bag_per_sample_weights_backward_cuda
+  autogen: _embedding_bag_per_sample_weights_backward.out
 
 - func: empty.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: empty
+  autogen: empty.names_out
 
 - func: empty.memory_format(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   dispatch:
@@ -2018,6 +2077,7 @@
     SparseCPU, SparseCUDA, SparseMeta: empty_symint_sparse
     SparseCsrCPU, SparseCsrCUDA: empty_symint_sparse_compressed
     QuantizedCPU, QuantizedCUDA: empty_symint_unknown_quantized
+  autogen: empty.SymInt_out
 
 # We do not make new_empty a composite that calls into new_empty_strided, as the strided version
 # is significantly more difficult to implement by different backends
@@ -2025,16 +2085,19 @@
   variants: method
   dispatch:
     CompositeExplicitAutograd: new_empty
+  autogen: new_empty.out
 
 - func: new_empty.SymInt(Tensor self, SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
   dispatch:
     CompositeExplicitAutograd: new_empty_symint
+  autogen: new_empty.SymInt_out
 
 - func: new_empty_strided(Tensor self, int[] size, int[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
   dispatch:
     CompositeExplicitAutogradNonFunctional: new_empty_strided
+  autogen: new_empty_strided.out
 
 - func: new_full(Tensor self, int[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
@@ -2042,6 +2105,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: new_full
+  autogen: new_full.out
 
 - func: new_zeros(Tensor self, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
@@ -2049,6 +2113,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: new_zeros
+  autogen: new_zeros.out
 
 - func: new_ones(Tensor self, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
@@ -2056,12 +2121,14 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: new_ones
+  autogen: new_ones.out
 
 # other overrides are to provide a more helpful error message that dtype is required
 - func: _empty_affine_quantized(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, float scale=1, int zero_point=0, MemoryFormat? memory_format=contiguous_format) -> Tensor
   dispatch:
     CPU: empty_affine_quantized_other_backends_stub
     QuantizedCPU, QuantizedCUDA: empty_affine_quantized
+  autogen: _empty_affine_quantized.out
 
 # it's a factory function receiving a tensor argument, thus overriding explicitly
 # other overrides are to provide a more helpful error message that dtype is required
@@ -2070,12 +2137,14 @@
   dispatch:
     CPU: empty_per_channel_affine_quantized_other_backends_stub
     QuantizedCPU, QuantizedCUDA: empty_per_channel_affine_quantized
+  autogen: _empty_per_channel_affine_quantized.out
 
 - func: resize_(Tensor(a!) self, int[] size, *, MemoryFormat? memory_format=None) -> Tensor(a!)
   use_const_ref_for_mutable_tensors: True
   variants: method
   device_check: NoCheck
   device_guard: False
+  tags: inplace_view
   dispatch:
     CPU, Meta: resize_
     CUDA: resize_cuda_
@@ -2099,6 +2168,7 @@
   variants: function
   dispatch:
     QuantizedCPU, QuantizedCUDA: empty_quantized
+  autogen: empty_quantized.out
 
 - func: empty.out(int[] size, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck
@@ -2112,6 +2182,7 @@
     QuantizedCPU, QuantizedCUDA: empty_like_quantized
     SparseCPU, SparseCUDA, SparseMeta: empty_like_sparse_coo
     SparseCsrCPU, SparseCsrCUDA: empty_like_sparse_csr
+  autogen: empty_like.out
 
 - func: empty_strided(int[] size, int[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
@@ -2120,6 +2191,7 @@
     MPS: empty_strided_mps
     Meta: empty_strided_meta
     QuantizedCPU, QuantizedCUDA: empty_strided_unknown_quantized
+  autogen: empty_strided.out
 
 - func: erf(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -2387,6 +2459,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: full
+  autogen: full.names_out
 
 - func: full(int[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
@@ -2401,10 +2474,12 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: full_like
+  autogen: full_like.out
 
 - func: from_file(str filename, bool? shared=None, int? size=0, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CPU: from_file
+  autogen: from_file.out
 
 - func: gcd.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -2457,6 +2532,7 @@
   dispatch:
     CPU, QuantizedCPU: grid_sampler_2d_cpu
     CUDA: grid_sampler_2d_cuda
+  autogen: grid_sampler_2d.out
 
 # `grid_sampler_2d_backward` takes in `output_mask` to optimize performance for
 # the case where `input` doesn't require gradient. Gradient for `grid` is always
@@ -2465,11 +2541,13 @@
   dispatch:
     CPU: grid_sampler_2d_backward_cpu
     CUDA: grid_sampler_2d_backward_cuda
+  autogen: grid_sampler_2d_backward.out
 
 # See NOTE [ grid_sample CPU fallback ]
 - func: _grid_sampler_2d_cpu_fallback(Tensor input, Tensor grid, int interpolation_mode, int padding_mode, bool align_corners) -> Tensor
   dispatch:
     CompositeExplicitAutograd: _grid_sampler_2d_cpu_fallback
+  autogen: _grid_sampler_2d_cpu_fallback.out
 
 - func: _grid_sampler_2d_cpu_fallback_backward(Tensor grad_output, Tensor input, Tensor grid, int interpolation_mode, int padding_mode, bool align_corners) -> (Tensor, Tensor)
 
@@ -2477,6 +2555,7 @@
   dispatch:
     CPU: grid_sampler_3d_cpu
     CUDA: grid_sampler_3d_cuda
+  autogen: grid_sampler_3d.out
 
 # `grid_sampler_3d_backward` takes in `output_mask` to optimize performance for
 # the case where `input` doesn't require gradient. Gradient for `grid` is always
@@ -2485,42 +2564,52 @@
   dispatch:
     CPU: grid_sampler_3d_backward_cpu
     CUDA: grid_sampler_3d_backward_cuda
+  autogen: grid_sampler_3d_backward.out
 
 - func: hann_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: hann_window
+  autogen: hann_window.out
 
 - func: hann_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: hann_window
+  autogen: hann_window.periodic_out
 
 - func: hamming_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: hamming_window
+  autogen: hamming_window.out
 
 - func: hamming_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: hamming_window
+  autogen: hamming_window.periodic_out
 
 - func: hamming_window.periodic_alpha(int window_length, bool periodic, float alpha, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: hamming_window
+  autogen: hamming_window.periodic_alpha_out
 
 - func: hamming_window.periodic_alpha_beta(int window_length, bool periodic, float alpha, float beta, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: hamming_window
+  autogen: hamming_window.periodic_alpha_beta_out
 
 - func: kaiser_window(int window_length, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: kaiser_window
+  autogen: kaiser_window.out
 
 - func: kaiser_window.periodic(int window_length, bool periodic, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: kaiser_window
+  autogen: kaiser_window.periodic_out
 
 - func: kaiser_window.beta(int window_length, bool periodic, float beta, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: kaiser_window
+  autogen: kaiser_window.beta_out
 
 - func: hinge_embedding_loss(Tensor self, Tensor target, float margin=1.0, int reduction=Mean) -> Tensor
 
@@ -2530,10 +2619,12 @@
   dispatch:
     CPU, CUDA: native_group_norm
     CompositeExplicitAutograd: math_group_norm
+  autogen: native_group_norm.out
 
 - func: native_group_norm_backward(Tensor grad_out, Tensor input, Tensor mean, Tensor rstd, Tensor? weight, int N, int C, int HxW, int group, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     CPU, CUDA: native_group_norm_backward
+  autogen: native_group_norm_backward.out
 
 # Real to complex forward FFT
 - func: _fft_r2c(Tensor self, int[] dim, int normalization, bool onesided) -> Tensor
@@ -2608,7 +2699,7 @@
   precomputed:
   - indices -> DimVector sizes, DimVector strides
   dispatch:
-    CPU, CUDA: index_out
+    CPU, CUDA, MPS: index_out
 
 - func: index_copy.out(Tensor self, int dim, Tensor index, Tensor source, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -2661,15 +2752,6 @@
 - func: instance_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool use_input_stats, float momentum, float eps, bool cudnn_enabled) -> Tensor
   variants: function
 
-- func: inverse(Tensor self) -> Tensor
-  variants: function, method
-  dispatch:
-    CompositeExplicitAutograd: inverse
-
-- func: inverse.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
-  dispatch:
-    CompositeExplicitAutograd: inverse_out
-
 - func: isclose(Tensor self, Tensor other, float rtol=1e-05, float atol=1e-08, bool equal_nan=False) -> Tensor
   variants: function, method
 
@@ -2711,6 +2793,7 @@
     CPU, CUDA, MPS: isnan
     SparseCPU, SparseCUDA: isnan_sparse
     SparseCsrCPU, SparseCsrCUDA: isnan_sparse_csr
+  autogen: isnan.out
 
 - func: is_distributed(Tensor self) -> bool
   variants: function, method
@@ -2802,12 +2885,14 @@
     CUDA: layer_norm_cuda
     MPS: layer_norm_mps
     CompositeExplicitAutograd: math_native_layer_norm
+  autogen: native_layer_norm.out
 
 - func: native_layer_norm_backward(Tensor grad_out, Tensor input, int[] normalized_shape, Tensor mean, Tensor rstd, Tensor? weight, Tensor? bias, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     CPU: layer_norm_backward_cpu
     CUDA: layer_norm_backward_cuda
     MPS: layer_norm_backward_mps
+  autogen: native_layer_norm_backward.out
 
 - func: nan_to_num(Tensor self, float? nan=None, float? posinf=None, float? neginf=None) -> Tensor
   variants: function, method
@@ -2831,52 +2916,39 @@
   dispatch:
     CompositeImplicitAutograd: linear
     NestedTensorCPU, NestedTensorCUDA: nested_linear
+    MPS: _mps_linear
 
 - func: linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     NestedTensorCPU, NestedTensorCUDA: nested_linear_backward
+    MPS: mps_linear_backward
+  autogen: linear_backward.out
 
 - func: linear.out(Tensor input, Tensor weight, Tensor? bias=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: linear_out
 
-# TODO: Add this function to MPS dispatch key so that we avoid declaring it in
-# native_functions.yaml
-# https://github.com/pytorch/pytorch/issues/77394
-- func: _mps_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor
-  python_module: nn
-  dispatch:
-    MPS: _mps_linear
-
 - func: mkldnn_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor
   python_module: nn
   dispatch:
     MkldnnCPU: mkldnn_linear
+  autogen: mkldnn_linear.out
 
 - func: mkldnn_linear_backward_input(int[] input_size, Tensor grad_output, Tensor weight) -> Tensor
   dispatch:
     MkldnnCPU: mkldnn_linear_backward_input
+  autogen: mkldnn_linear_backward_input.out
 
 - func: mkldnn_linear_backward_weights(Tensor grad_output, Tensor input, Tensor weight, bool bias_defined) -> (Tensor, Tensor)
   dispatch:
     MkldnnCPU: mkldnn_linear_backward_weights
+  autogen: mkldnn_linear_backward_weights.out
 
 - func: mkldnn_linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     MkldnnCPU: mkldnn_linear_backward
-
-- func: _mps_linear_backward_input(int[] input_size, Tensor grad_output, Tensor weight) -> Tensor
-  dispatch:
-    MPS: _mps_linear_backward_input
-
-- func: _mps_linear_backward_weights(Tensor grad_output, Tensor input, Tensor weight, bool bias_defined) -> (Tensor, Tensor)
-  dispatch:
-    MPS: _mps_linear_backward_weights
-
-- func: mps_linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
-  dispatch:
-    MPS: mps_linear_backward
+  autogen: mkldnn_linear_backward.out
 
 - func: fbgemm_linear_int8_weight_fp32_activation(Tensor input, Tensor weight, Tensor packed, Tensor col_offsets, Scalar weight_scale, Scalar weight_zero_point, Tensor bias) -> Tensor
 
@@ -3152,8 +3224,19 @@
 
 - func: matmul(Tensor self, Tensor other) -> Tensor
   variants: function, method
+  dispatch:
+    CompositeImplicitAutograd: matmul
+    NestedTensorCPU, NestedTensorCUDA: matmul_nested
+
+- func: matmul_backward(Tensor grad, Tensor self, Tensor other, bool[2] mask) -> (Tensor, Tensor)
+  dispatch:
+    NestedTensorCPU, NestedTensorCUDA: matmul_backward_nested
+  autogen: matmul_backward.out
 
 - func: matmul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
+  dispatch:
+    CompositeImplicitAutograd: matmul_out
+    NestedTensorCPU, NestedTensorCUDA: matmul_out_nested
 
 - func: matrix_rank.tol(Tensor self, float tol, bool symmetric=False) -> Tensor
 
@@ -3177,11 +3260,13 @@
 - func: _aminmax(Tensor self) -> (Tensor, Tensor)
   dispatch:
     CPU, CUDA: _aminmax_all
+  autogen: _aminmax.out
 
 # DEPRECATED: Use torch.aminmax instead
 - func: _aminmax.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor, Tensor)
   dispatch:
     CPU, CUDA: _aminmax
+  autogen: _aminmax.dim_out
 
 - func: aminmax(Tensor self, *, int? dim=None, bool keepdim=False) -> (Tensor min, Tensor max)
   device_check: NoCheck   # TensorIterator
@@ -3253,35 +3338,43 @@
 - func: _mps_max_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     MPS: _mps_max_pool2d
+  autogen: _mps_max_pool2d.out
 
 - func: mps_max_pool2d_backward(Tensor grad_output, Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     MPS: mps_max_pool2d_backward
+  autogen: mps_max_pool2d_backward.out
 
 - func: mkldnn_max_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     MkldnnCPU: mkldnn_max_pool2d
+  autogen: mkldnn_max_pool2d.out
 
 - func: mkldnn_max_pool2d_backward(Tensor grad_output, Tensor output, Tensor input, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     MkldnnCPU: mkldnn_max_pool2d_backward
+  autogen: mkldnn_max_pool2d_backward.out
 
 - func: mkldnn_max_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=0, int[3] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     MkldnnCPU: mkldnn_max_pool3d
+  autogen: mkldnn_max_pool3d.out
 
 - func: mkldnn_max_pool3d_backward(Tensor grad_output, Tensor output, Tensor input, int[3] kernel_size, int[3] stride=[], int[3] padding=0, int[3] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     MkldnnCPU: mkldnn_max_pool3d_backward
+  autogen: mkldnn_max_pool3d_backward.out
 
 - func: quantized_max_pool1d(Tensor self, int[1] kernel_size, int[1] stride=[], int[1] padding=0, int[1] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     QuantizedCPU: quantized_max_pool1d
+  autogen: quantized_max_pool1d.out
 
 - func: quantized_max_pool2d(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> Tensor
   dispatch:
     QuantizedCPU: quantized_max_pool2d
     QuantizedCUDA: quantized_max_pool2d_cudnn
+  autogen: quantized_max_pool2d.out
 
 - func: max_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=0, int[3] dilation=1, bool ceil_mode=False) -> Tensor
 
@@ -3293,6 +3386,13 @@
   dispatch:
     CompositeExplicitAutograd: mean
 
+# For normal naming convention this should be `mean.out`. However since we already have `mean.out` we have to rename this.
+# FIXME: fix CI jobs and re-enable this
+#- func: mean.dtype_out(Tensor self, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
+#  device_check: NoCheck   # TensorIterator
+#  dispatch:
+#    CompositeExplicitAutograd: mean_dtype_out
+
 - func: mean.dim(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   structured_delegate: mean.out
   device_check: NoCheck   # TensorIterator
@@ -3315,11 +3415,11 @@
 - func: mean.names_out(Tensor self, Dimname[1] dim, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
 
-- func: nanmean(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
+- func: nanmean(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   device_check: NoCheck   # Composite
   variants: function, method
 
-- func: nanmean.out(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
+- func: nanmean.out(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # Composite
 
 - func: median(Tensor self) -> Tensor
@@ -3327,6 +3427,7 @@
   dispatch:
     CPU: median_cpu
     CUDA: median_cuda
+  autogen: median.out
 
 - func: median.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
   variants: function, method
@@ -3348,6 +3449,7 @@
   dispatch:
     CPU: nanmedian_cpu
     CUDA: nanmedian_cuda
+  autogen: nanmedian.out
 
 - func: nanmedian.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
   variants: function, method
@@ -3403,42 +3505,60 @@
 - func: _mps_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups) -> Tensor
   dispatch:
     MPS: _mps_convolution
+  autogen: _mps_convolution.out
 
 - func: mps_convolution_backward(Tensor self, Tensor grad_output, Tensor weight, int[] padding, int[] stride, int[] dilation, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     MPS: mps_convolution_backward
+  autogen: mps_convolution_backward.out
 
 - func: mkldnn_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups) -> Tensor
   dispatch:
     CompositeExplicitAutograd: mkldnn_convolution
+  autogen: mkldnn_convolution.out
 
 - func: miopen_batch_norm(Tensor input, Tensor weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float exponential_average_factor, float epsilon) -> (Tensor, Tensor, Tensor)
   dispatch:
     CUDA: miopen_batch_norm
+  autogen: miopen_batch_norm.out
 
 - func: miopen_batch_norm_backward(Tensor input, Tensor grad_output, Tensor weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_var, float epsilon) -> (Tensor, Tensor, Tensor)
   dispatch:
     CUDA: miopen_batch_norm_backward
+  autogen: miopen_batch_norm_backward.out
 
 - func: miopen_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
   dispatch:
     CUDA: miopen_convolution
+  autogen: miopen_convolution.out
 
 - func: miopen_convolution_transpose(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
   dispatch:
     CUDA: miopen_convolution_transpose
+  autogen: miopen_convolution_transpose.out
 
 - func: miopen_depthwise_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
   dispatch:
     CUDA: miopen_depthwise_convolution
+  autogen: miopen_depthwise_convolution.out
+
+- func: miopen_convolution_relu(Tensor self, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor
+  dispatch:
+    CUDA: miopen_convolution_relu
+
+- func: miopen_convolution_add_relu(Tensor self, Tensor weight, Tensor z, Scalar? alpha, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor
+  dispatch:
+    CUDA: miopen_convolution_add_relu
 
 - func: miopen_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor hx, Tensor? cx, int mode, int hidden_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
   dispatch:
     CUDA: miopen_rnn
+  autogen: miopen_rnn.out
 
 - func: miopen_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, int hidden_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[])
   dispatch:
     CUDA: miopen_rnn_backward
+  autogen: miopen_rnn_backward.out
 
 - func: mm(Tensor self, Tensor mat2) -> Tensor
   structured_delegate: mm.out
@@ -3463,11 +3583,13 @@
   dispatch:
     SparseCPU: sparse_sparse_matmul_cpu
     SparseCUDA: sparse_sparse_matmul_cuda
+  autogen: _sparse_sparse_matmul.out
 
 - func: _sparse_mask_helper(Tensor t, Tensor mask_indices) -> Tensor
   dispatch:
     SparseCPU: sparse_mask_helper_cpu
     SparseCUDA: sparse_mask_helper_cuda
+  autogen: _sparse_mask_helper.out
 
 - func: mode(Tensor self, int dim=-1, bool keepdim=False) -> (Tensor values, Tensor indices)
   variants: function, method
@@ -3587,6 +3709,7 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: narrow_copy_symint
+  autogen: narrow_copy.SymInt_out
 
 - func: narrow_copy.out(Tensor self, int dim, int start, int length, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
@@ -3617,6 +3740,7 @@
 - func: batch_norm_stats(Tensor input, float eps) -> (Tensor, Tensor)
   dispatch:
     CUDA: batch_norm_stats_cuda
+  autogen: batch_norm_stats.out
 
 - func: batch_norm_elemt(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor invstd, float eps) -> Tensor
   dispatch:
@@ -3630,10 +3754,12 @@
 - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor)
   dispatch:
     CUDA: batch_norm_gather_stats_cuda
+  autogen: batch_norm_gather_stats.out
 
 - func: batch_norm_gather_stats_with_counts(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)
   dispatch:
     CUDA: batch_norm_gather_stats_with_counts_cuda
+  autogen: batch_norm_gather_stats_with_counts.out
 
 - func: native_batch_norm_backward(Tensor grad_out, Tensor input, Tensor? weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_invstd, bool train, float eps, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
@@ -3641,19 +3767,23 @@
     CUDA: batch_norm_backward_cuda
     MPS: batch_norm_backward_mps
     MkldnnCPU: mkldnn_batch_norm_backward
+  autogen: native_batch_norm_backward.out
 
 - func: batch_norm_backward_reduce(Tensor grad_out, Tensor input, Tensor mean, Tensor invstd, Tensor? weight, bool input_g, bool weight_g, bool bias_g) -> (Tensor, Tensor, Tensor, Tensor)
   dispatch:
     CUDA: batch_norm_backward_reduce_cuda
+  autogen: batch_norm_backward_reduce.out
 
 - func: batch_norm_backward_elemt(Tensor grad_out, Tensor input, Tensor mean, Tensor invstd, Tensor? weight, Tensor mean_dy, Tensor mean_dy_xmu, Tensor count) -> Tensor
   dispatch:
     CUDA: batch_norm_backward_elemt_cuda
+  autogen: batch_norm_backward_elemt.out
 
 - func: batch_norm_update_stats(Tensor input, Tensor? running_mean, Tensor? running_var, float momentum) -> (Tensor, Tensor)
   dispatch:
     CPU: batch_norm_update_stats_cpu
     CUDA: batch_norm_update_stats_cuda
+  autogen: batch_norm_update_stats.out
 
 - func: is_vulkan_available() -> bool
 
@@ -3663,12 +3793,14 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: _nnpack_spatial_convolution
+  autogen: _nnpack_spatial_convolution.out
 
 - func: ones.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: ones
+  autogen: ones.names_out
 
 - func: ones(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
@@ -3683,6 +3815,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: ones_like
+  autogen: ones_like.out
 
 - func: pairwise_distance(Tensor x1, Tensor x2, float p=2, float eps=1e-06, bool keepdim=False) -> Tensor
 
@@ -3691,24 +3824,29 @@
 - func: _euclidean_dist(Tensor x1, Tensor x2) -> Tensor
   dispatch:
     CompositeExplicitAutograd: _euclidean_dist
+  autogen: _euclidean_dist.out
 
 - func: _cdist_forward(Tensor x1, Tensor x2, float p, int? compute_mode) -> Tensor
   dispatch:
     CPU, CUDA: _cdist_forward
+  autogen: _cdist_forward.out
 
 - func: _cdist_backward(Tensor grad, Tensor x1, Tensor x2, float p, Tensor cdist) -> Tensor
   dispatch:
     CPU, CUDA: _cdist_backward
+  autogen: _cdist_backward.out
 
 - func: pdist(Tensor self, float p=2) -> Tensor
 
 - func: _pdist_forward(Tensor self, float p=2) -> Tensor
   dispatch:
     CPU, CUDA: _pdist_forward
+  autogen: _pdist_forward.out
 
 - func: _pdist_backward(Tensor grad, Tensor self, float p, Tensor pdist) -> Tensor
   dispatch:
     CPU, CUDA: _pdist_backward
+  autogen: _pdist_backward.out
 
 - func: cosine_similarity(Tensor x1, Tensor x2, int dim=1, float eps=1e-08) -> Tensor
   variants: function
@@ -3762,16 +3900,19 @@
   dispatch:
     CPU: pixel_shuffle_cpu
     CompositeExplicitAutogradNonFunctional: math_pixel_shuffle
+  autogen: pixel_shuffle.out
 
 - func: pixel_unshuffle(Tensor self, int downscale_factor) -> Tensor
   dispatch:
     CPU: pixel_unshuffle_cpu
     CompositeExplicitAutogradNonFunctional: math_pixel_unshuffle
+  autogen: pixel_unshuffle.out
 
 - func: channel_shuffle(Tensor self, int groups) -> Tensor
   dispatch:
     CPU: channel_shuffle
     QuantizedCPU: channel_shuffle_quantized_cpu
+  autogen: channel_shuffle.out
 
 - func: native_channel_shuffle(Tensor self, int groups) -> Tensor
   dispatch:
@@ -3795,6 +3936,7 @@
   dispatch:
     CUDA: _pin_memory_cuda
     MPS: _pin_memory_mps
+  autogen: _pin_memory.out
 
 - func: pinverse(Tensor self, float rcond=1e-15) -> Tensor
   variants: function, method
@@ -3836,18 +3978,23 @@
 - func: scalar_tensor(Scalar s, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: scalar_tensor
+  autogen: scalar_tensor.out
 
 - func: rand.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: rand
+  autogen: rand.names_out
+  tags: nondeterministic_seeded
 
 - func: rand.generator_with_names(int[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   device_check: NoCheck
   device_guard: False
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: rand
+  autogen: rand.generator_with_names_out
 
 - func: rand(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
@@ -3855,14 +4002,17 @@
     CompositeExplicitAutograd: rand
 
 - func: rand.generator(int[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: rand
 
 - func: rand.out(int[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: rand_out
 
 - func: rand.generator_out(int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: rand_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   tags: nondeterministic_seeded
@@ -3870,6 +4020,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: rand_like
+  autogen: rand_like.out
 
 - func: randint(int high, int[] size, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
@@ -3877,6 +4028,7 @@
     CompositeExplicitAutograd: randint
 
 - func: randint.generator(int high, int[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint
 
@@ -3886,22 +4038,27 @@
     CompositeExplicitAutograd: randint
 
 - func: randint.low_generator(int low, int high, int[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint
 
 - func: randint.out(int high, int[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
 - func: randint.generator_out(int high, int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
 - func: randint.low_out(int low, int high, int[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
 - func: randint.low_generator_out(int low, int high, int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
@@ -3911,6 +4068,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: randint_like
+  autogen: randint_like.out
 
 - func: randint_like.low_dtype(Tensor self, int low, int high, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   tags: nondeterministic_seeded
@@ -3918,6 +4076,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: randint_like
+  autogen: randint_like.low_dtype_out
 
 - func: randn(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
@@ -3925,24 +4084,31 @@
     CompositeExplicitAutograd: randn
 
 - func: randn.generator(int[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randn
 
 - func: randn.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: randn
+  autogen: randn.names_out
 
 - func: randn.generator_with_names(int[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: randn
+  autogen: randn.generator_with_names_out
 
 - func: randn.out(int[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: randn.generator_out(int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: randn_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   tags: nondeterministic_seeded
@@ -3950,6 +4116,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: randn_like
+  autogen: randn_like.out
 
 - func: randperm(int n, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
@@ -3957,14 +4124,17 @@
     CompositeExplicitAutograd: randperm
 
 - func: randperm.generator(int n, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randperm
 
 - func: randperm.out(int n, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randperm_out
 
 - func: randperm.generator_out(int n, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CPU: randperm_out_cpu
     CUDA: randperm_out_cuda
@@ -3977,10 +4147,15 @@
   dispatch:
     CompositeExplicitAutograd: range
 
+- func: range.out_(Scalar start, Scalar end, *, Tensor(a!) out) -> Tensor(a!)
+  dispatch:
+    CompositeExplicitAutograd: range_out_no_step
+
 - func: range.out(Scalar start, Scalar end, Scalar step=1, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CPU, Meta: range_out
     CUDA: range_cuda_out
+  cpp_no_default_args: ['step']
 
 - func: ravel(Tensor(a) self) -> Tensor(a)
   variants: function, method
@@ -4043,6 +4218,7 @@
   dispatch:
     CompositeExplicitAutograd: repeat
     MPS: repeat_mps
+  autogen: repeat.out
 
 - func: repeat_interleave.Tensor(Tensor repeats, *, int? output_size=None) -> Tensor
   variants: function
@@ -4050,6 +4226,7 @@
     CPU: repeat_interleave_cpu
     CUDA: repeat_interleave_cuda
   tags: dynamic_output_shape
+  autogen: repeat_interleave.Tensor_out
 
 - func: repeat_interleave.self_Tensor(Tensor self, Tensor repeats, int? dim=None, *, int? output_size=None) -> Tensor
   variants: function, method
@@ -4065,10 +4242,12 @@
 - func: _reshape_nested(Tensor self, int[] shape) -> Tensor
   dispatch:
     NestedTensorCPU, NestedTensorCUDA: _reshape_nested
+  autogen: _reshape_nested.out
 
 - func: _reshape_nested_backward(Tensor self, Tensor grad) -> Tensor
   dispatch:
     NestedTensorCPU, NestedTensorCUDA: _reshape_nested_backward
+  autogen: _reshape_nested_backward.out
 
 # NOTE [ _reshape_alias ] is meant to be used in the implementation of reshape.
 # They are not user-facing, hence the leading underscore. Please don't use it
@@ -4086,6 +4265,7 @@
   device_guard: False
   dispatch:
     MkldnnCPU: mkldnn_reshape
+  autogen: _mkldnn_reshape.out
 
 - func: reshape_as(Tensor(a) self, Tensor other) -> Tensor(a)
   variants: method
@@ -4142,6 +4322,7 @@
   tags: nondeterministic_seeded
 
 - func: rrelu_(Tensor(a!) self, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None) -> Tensor(a!)
+  tags: nondeterministic_seeded
   device_check: NoCheck   # TensorIterator
 
 - func: relu(Tensor self) -> Tensor
@@ -4179,6 +4360,7 @@
     CUDA: prelu_cuda
     MPS: prelu_mps
     QuantizedCPU: prelu_quantized_cpu
+  autogen: prelu.out
 
 - func: prelu_backward(Tensor grad_output, Tensor self, Tensor weight) -> (Tensor, Tensor)
   variants: function, method
@@ -4187,6 +4369,7 @@
     CPU: prelu_backward_cpu
     CUDA: prelu_backward_cuda
     MPS: prelu_backward_mps
+  autogen: prelu_backward.out
 
 - func: gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -4296,6 +4479,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutogradNonFunctional: select_backward
+  autogen: select_backward.out
 
 - func: selu(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -4483,6 +4667,7 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: detach
+    NestedTensorCPU, NestedTensorCUDA: detach
 
 # Like `detach()`, but modifies this `Variable` in-place. This method may
 # only be called on non-view `Variable`s. You can use `is_view()` to check
@@ -4510,6 +4695,8 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: slice
+# NOTE: The implementation of split_with_sizes bypasses the dispatcher to call this; undo
+# that if adding specific implementations here!
 
 - func: slice_backward(Tensor grad_output, int[] input_sizes, int dim, int start, int end, int step) -> Tensor
   variants: function
@@ -4517,6 +4704,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: slice_backward
+  autogen: slice_backward.out
 
 - func: slice_scatter(Tensor self, Tensor src, int dim=0, int? start=None, int? end=None, int step=1) -> Tensor
   variants: function, method
@@ -4524,6 +4712,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: slice_scatter
+  autogen: slice_scatter.out
 
 - func: select_scatter(Tensor self, Tensor src, int dim, int index) -> Tensor
   variants: function, method
@@ -4531,6 +4720,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: select_scatter
+  autogen: select_scatter.out
 
 - func: diagonal_scatter(Tensor self, Tensor src, int offset=0, int dim1=0, int dim2=1) -> Tensor
   variants: function, method
@@ -4538,6 +4728,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: diagonal_scatter
+  autogen: diagonal_scatter.out
 
 - func: as_strided_scatter(Tensor self, Tensor src, int[] size, int[] stride, int? storage_offset=None) -> Tensor
   variants: function, method
@@ -4545,6 +4736,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: as_strided_scatter
+  autogen: as_strided_scatter.out
 
 - func: smm(Tensor self, Tensor mat2) -> Tensor
   variants: function, method
@@ -4552,9 +4744,6 @@
 # softmax allows positional dtype, unlike most operators, because kwonly is BC-breaking when loading jit models.
 - func: softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor
   variants: function, method
-  dispatch:
-    CompositeImplicitAutograd: softmax
-    NestedTensorCPU, NestedTensorCUDA: softmax
 
 - func: softmax.int_out(Tensor self, int dim, ScalarType? dtype=None, *, Tensor(a!) out) -> Tensor(a!)
   variants: function
@@ -4579,6 +4768,8 @@
 
 - func: _softmax_backward_data(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype) -> Tensor
   structured_delegate: _softmax_backward_data.out
+  dispatch:
+    NestedTensorCPU, NestedTensorCUDA: nested_softmax_backward
 
 - func: _softmax_backward_data.out(Tensor grad_output, Tensor output, int dim, ScalarType input_dtype, *, Tensor(a!) grad_input) -> Tensor(a!)
   structured: True
@@ -4593,6 +4784,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: unsafe_split
+  autogen: unsafe_split.Tensor_out
 
 - func: split.Tensor(Tensor(a -> *) self, int split_size, int dim=0) -> Tensor(a)[]
   variants: function, method
@@ -4611,6 +4803,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: unsafe_split_with_sizes
+  autogen: unsafe_split_with_sizes.out
 
 - func: split_with_sizes(Tensor(a -> *) self, int[] split_sizes, int dim=0) -> Tensor(a)[]
   variants: function, method
@@ -4748,12 +4941,7 @@
   dispatch:
     CompositeExplicitAutograd: sum
     SparseCsrCPU, SparseCsrCUDA: sum_csr
-
-- func: sum.SymInt(Tensor self, SymInt[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
-  device_check: NoCheck   # TensorIterator
-  variants: function, method
-  dispatch:
-    CompositeExplicitAutograd: sum_symint
+  autogen: sum.out
 
 - func: sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   structured_delegate: sum.IntList_out
@@ -4776,12 +4964,17 @@
 - func: sum.DimnameList_out(Tensor self, Dimname[1] dim, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
 
-- func: nansum(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
+# TODO: this function will be replaced once nested expand semantics have been settled on
+- func: _nested_sum_backward(Tensor grad, Tensor self, int[1]? dim, bool keepdim=False) -> Tensor
+  dispatch:
+    NestedTensorCPU: _nested_sum_backward_cpu
+
+- func: nansum(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   variants: function, method
   dispatch:
     CPU, CUDA: nansum
 
-- func: nansum.out(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
+- func: nansum.out(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CPU, CUDA: nansum_out
 
@@ -4830,7 +5023,7 @@
   device_check: NoCheck   # TensorIterator
   variants: function, method
 
-- func: std.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor
+- func: std.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function, method
 
@@ -4846,7 +5039,7 @@
   device_check: NoCheck   # TensorIterator
   variants: function
 
-- func: std_mean.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor)
+- func: std_mean.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor)
   device_check: NoCheck   # TensorIterator
   variants: function
 
@@ -4855,6 +5048,7 @@
   variants: function
   dispatch:
     CPU, CUDA: std_mean
+  autogen: std_mean.correction_out
 
 - func: std_mean.names_dim(Tensor self, Dimname[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor)
   device_check: NoCheck   # TensorIterator
@@ -4864,7 +5058,7 @@
   device_check: NoCheck   # TensorIterator
   variants: function
 
-- func: std.out(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!)
+- func: std.out(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
 
 - func: std.correction_out(Tensor self, int[1]? dim, *, int? correction, bool keepdim=False, Tensor(a!) out) -> Tensor(a!)
@@ -4894,6 +5088,7 @@
   dispatch:
     CPU, CUDA: prod
     MPS: prod_mps
+  autogen: prod.out
 
 - func: prod.dim_int(Tensor self, int dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   structured_delegate: prod.int_out
@@ -5073,6 +5268,7 @@
   dispatch:
     CPU, QuantizedCPU, CUDA, QuantizedCUDA: flip
     MPS: flip_mps
+  autogen: flip.out
 
 - func: fliplr(Tensor self) -> Tensor
   variants: function, method
@@ -5085,6 +5281,7 @@
   dispatch:
     CPU: roll_cpu
     CUDA: roll_cuda
+  autogen: roll.out
 
 # default int[] value [0,1] should not add space after comma, since codegen parser uses ', ' to split args
 
@@ -5092,6 +5289,7 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: rot90
+  autogen: rot90.out
 
 - func: trapezoid.x(Tensor y, Tensor x, *, int dim=-1) -> Tensor
 
@@ -5106,10 +5304,12 @@
   dispatch:
     CPU, NestedTensorCPU: transform_bias_rescale_qkv_cpu
     CUDA, NestedTensorCUDA: transform_bias_rescale_qkv_cuda
+  autogen: _transform_bias_rescale_qkv.out
 
-- func: _nested_tensor_from_mask(Tensor t, Tensor mask) -> Tensor
+- func: _nested_tensor_from_mask(Tensor t, Tensor mask, bool mask_check=True) -> Tensor
   dispatch:
     CPU, CUDA: NestedTensor_nested_tensor_from_mask
+  autogen: _nested_tensor_from_mask.out
 
 - func: _nested_tensor_from_mask_left_aligned(Tensor t, Tensor mask) -> bool
   dispatch:
@@ -5120,22 +5320,26 @@
   dispatch:
     CPU: nested_from_padded_generic
     CUDA: nested_from_padded_cuda
+  autogen: _nested_from_padded.out
 
 - func: _nested_tensor_size(Tensor self) -> Tensor
   variants: method
   dispatch:
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_get_nested_size_tensor
+  autogen: _nested_tensor_size.out
 
 # _nested_from_padded is not usable from Python, so
 # _nested_from_padded_and_nested_example is available for testing.
 - func: _nested_from_padded_and_nested_example(Tensor padded, Tensor nt_example) -> Tensor
   dispatch:
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_from_padded_and_nested_example
+  autogen: _nested_from_padded_and_nested_example.out
 
 - func: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor
   dispatch:
-      # calls unsqueeze
+    # calls unsqueeze
     CompositeExplicitAutogradNonFunctional: _trilinear
+  autogen: _trilinear.out
 
 - func: triplet_margin_loss(Tensor anchor, Tensor positive, Tensor negative, float margin=1.0, float p=2, float eps=1e-06, bool swap=False, int reduction=Mean) -> Tensor
 
@@ -5185,6 +5389,7 @@
   dispatch:
     CPU: _unique_cpu
     CUDA: _unique_cuda
+  autogen: _unique.out
 
 - func: unique_dim(Tensor self, int dim, bool sorted=True, bool return_inverse=False, bool return_counts=False) -> (Tensor, Tensor, Tensor)
   variants: function
@@ -5192,6 +5397,7 @@
     CPU: unique_dim_cpu
     CUDA: unique_dim_cuda
   tags: dynamic_output_shape
+  autogen: unique_dim.out
 
 - func: unique_consecutive(Tensor self, bool return_inverse=False, bool return_counts=False, int? dim=None) -> (Tensor, Tensor, Tensor)
   variants: function
@@ -5199,6 +5405,7 @@
     CPU: unique_consecutive_cpu
     CUDA: unique_consecutive_cuda
   tags: dynamic_output_shape
+  autogen: unique_consecutive.out
 
 - func: unique_dim_consecutive(Tensor self, int dim, bool return_inverse=False, bool return_counts=False) -> (Tensor, Tensor, Tensor)
   variants: function
@@ -5206,6 +5413,7 @@
     CPU: unique_dim_consecutive_cpu
     CUDA: unique_dim_consecutive_cuda
   tags: dynamic_output_shape
+  autogen: unique_dim_consecutive.out
 
 # _unique and _unique_dim are fragile and modifying them easily cause internal break
 # the below operator is a temporary hack for adding return_counts support
@@ -5217,10 +5425,12 @@
     CPU: _unique2_cpu
     CUDA: _unique2_cuda
   tags: dynamic_output_shape
+  autogen: _unique2.out
 
 - func: _unsafe_view(Tensor self, int[] size) -> Tensor
   dispatch:
     CompositeExplicitAutograd: _unsafe_view
+  autogen: _unsafe_view.out
 
 - func: unsqueeze(Tensor(a) self, int dim) -> Tensor(a)
   variants: function, method
@@ -5245,7 +5455,7 @@
   device_check: NoCheck   # TensorIterator
   variants: function, method
 
-- func: var.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor
+- func: var.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function, method
 
@@ -5256,7 +5466,7 @@
     CPU, CUDA: var
     MPS: var_mps
 
-- func: var.out(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!)
+- func: var.out(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
 
 - func: var.correction_out(Tensor self, int[1]? dim, *, int? correction, bool keepdim=False, Tensor(a!) out) -> Tensor(a!)
@@ -5283,7 +5493,7 @@
   device_check: NoCheck   # TensorIterator
   variants: function
 
-- func: var_mean.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor)
+- func: var_mean.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor)
   device_check: NoCheck   # TensorIterator
   variants: function
 
@@ -5292,6 +5502,7 @@
   variants: function
   dispatch:
     CPU, CUDA: var_mean
+  autogen: var_mean.correction_out
 
 - func: var_mean.names_dim(Tensor self, Dimname[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor)
   device_check: NoCheck   # TensorIterator
@@ -5345,12 +5556,14 @@
   dispatch:
     CPU: weight_norm_cpu
     CUDA: weight_norm_cuda
+  autogen: _weight_norm_interface.out
 
 - func: _weight_norm_interface_backward(Tensor grad_w, Tensor saved_v, Tensor saved_g, Tensor saved_norms, int dim) -> (Tensor, Tensor)
   variants: function
   dispatch:
     CPU: weight_norm_backward_cpu
     CUDA: weight_norm_backward_cuda
+  autogen: _weight_norm_interface_backward.out
 
 - func: _weight_norm_differentiable_backward(Tensor grad_w, Tensor saved_v, Tensor saved_g, Tensor saved_norms, int dim) -> (Tensor, Tensor)
   variants: function
@@ -5360,11 +5573,13 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: zeros
+  autogen: zeros.names_out
 
 - func: _efficientzerotensor(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CPU: _efficientzerotensor
     CUDA: _efficientzerotensor_cuda
+  autogen: _efficientzerotensor.out
 
 - func: zeros(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
@@ -5373,6 +5588,7 @@
 - func: zeros.SymInt(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: zeros_symint
+  autogen: zeros.SymInt_out
 
 - func: zeros.out(int[] size, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
@@ -5384,12 +5600,14 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: zeros_like
+  autogen: zeros_like.out
 
 - func: _standard_gamma_grad(Tensor self, Tensor output) -> Tensor
   variants: function
   dispatch:
     CPU: _standard_gamma_grad_cpu
     CUDA: _standard_gamma_grad_cuda
+  autogen: _standard_gamma_grad.out
 
 - func: _standard_gamma(Tensor self, Generator? generator=None) -> Tensor
   variants: function
@@ -5397,17 +5615,21 @@
     CPU: _s_gamma_cpu
     CUDA: _s_gamma_cuda
   tags: nondeterministic_seeded
+  autogen: _standard_gamma.out
 
 - func: _dirichlet_grad(Tensor x, Tensor alpha, Tensor total) -> Tensor
   dispatch:
     CPU: _dirichlet_grad_cpu
     CUDA: _dirichlet_grad_cuda
+  autogen: _dirichlet_grad.out
 
 - func: _sample_dirichlet(Tensor self, Generator? generator=None) -> Tensor
+  tags: nondeterministic_seeded
   variants: function
   dispatch:
     CPU: _s_dirichlet_cpu
     CUDA: _s_dirichlet_cuda
+  autogen: _sample_dirichlet.out
 
 - func: poisson(Tensor self, Generator? generator=None) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -5415,6 +5637,7 @@
     CPU: _s_poisson_cpu
     CUDA: _s_poisson_cuda
   tags: nondeterministic_seeded
+  autogen: poisson.out
 
 - func: binomial(Tensor count, Tensor prob, Generator? generator=None) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -5422,6 +5645,7 @@
     CPU: _s_binomial_cpu
     CUDA: _s_binomial_cuda
   tags: nondeterministic_seeded
+  autogen: binomial.out
 
 # When more variants get ported to native, this dispatch will get more
 # complicated
@@ -5429,10 +5653,12 @@
 - func: native_norm(Tensor self, Scalar p=2) -> Tensor
   dispatch:
     SparseCPU, SparseCUDA: norm_sparse
+  autogen: native_norm.out
 
 - func: native_norm.ScalarOpt_dim_dtype(Tensor self, Scalar? p, int[1] dim, bool keepdim, ScalarType? dtype) -> Tensor
   dispatch:
     SparseCPU, SparseCUDA: norm_sparse
+  autogen: native_norm.ScalarOpt_dim_dtype_out
 
 # TODO: reduce signatures down to one when optional args is available
 - func: _sparse_sum(Tensor self) -> Tensor
@@ -5442,6 +5668,7 @@
 - func: _sparse_sum.dim(Tensor self, int[1] dim) -> Tensor
   dispatch:
     CompositeExplicitAutograd: _sparse_sum
+  autogen: _sparse_sum.dim_out
 
 - func: _sparse_sum.dim_dtype(Tensor self, int[1] dim, *, ScalarType dtype) -> Tensor
 
@@ -5449,16 +5676,19 @@
   dispatch:
     SparseCPU: _sparse_sum_backward_cpu
     SparseCUDA: _sparse_sum_backward_cuda
+  autogen: _sparse_sum_backward.out
 
 - func: _sparse_csr_sum.dim_dtype(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   dispatch:
     SparseCsrCPU: _sparse_csr_sum_cpu
     SparseCsrCUDA: _sparse_csr_sum_cuda
+  autogen: _sparse_csr_sum.dim_dtype_out
 
 - func: _sparse_csr_prod.dim_dtype(Tensor self, int[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   dispatch:
     SparseCsrCPU: _sparse_csr_prod_cpu
     SparseCsrCUDA: _sparse_csr_prod_cuda
+  autogen: _sparse_csr_prod.dim_dtype_out
 
 - func: _sparse_softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor
   python_module: sparse
@@ -5473,11 +5703,13 @@
   dispatch:
     SparseCPU: softmax_sparse_cpu
     SparseCUDA: softmax_sparse_cuda
+  autogen: _sparse_softmax.out
 
 - func: _sparse_softmax_backward_data(Tensor grad_output, Tensor output, int dim, Tensor self) -> Tensor
   dispatch:
     SparseCPU: softmax_backward_sparse_cpu
     SparseCUDA: softmax_backward_sparse_cuda
+  autogen: _sparse_softmax_backward_data.out
 
 - func: _sparse_log_softmax.int(Tensor self, int dim, ScalarType? dtype=None) -> Tensor
   python_module: sparse
@@ -5492,28 +5724,33 @@
   dispatch:
     SparseCPU: log_softmax_sparse_cpu
     SparseCUDA: log_softmax_sparse_cuda
+  autogen: _sparse_log_softmax.out
 
 - func: _sparse_log_softmax_backward_data(Tensor grad_output, Tensor output, int dim, Tensor self) -> Tensor
   dispatch:
     SparseCPU: log_softmax_backward_sparse_cpu
     SparseCUDA: log_softmax_backward_sparse_cuda
+  autogen: _sparse_log_softmax_backward_data.out
 
 - func: _spdiags(Tensor diagonals, Tensor offsets, int[] shape, Layout? layout=None) -> Tensor
   python_module: sparse
   dispatch:
     CPU: spdiags
+  autogen: _spdiags.out
 
 - func: norm.ScalarOpt_dtype(Tensor self, Scalar? p, *, ScalarType dtype) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: norm
+  autogen: norm.ScalarOpt_dtype_out
 
 - func: norm.Scalar(Tensor self, Scalar p=2) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: norm
+  autogen: norm.Scalar_out
 
 - func: norm.ScalarOpt_dim_dtype(Tensor self, Scalar? p, int[1] dim, bool keepdim, *, ScalarType dtype) -> Tensor
   structured_delegate: norm.dtype_out
@@ -5603,6 +5840,7 @@
     MkldnnCPU: mkldnn_clone
     QuantizedCPU, QuantizedCUDA: quantized_clone
     NestedTensorCPU, NestedTensorCUDA: clone_nested
+  autogen: clone.out
 
 - func: positive(Tensor(a) self) -> Tensor(a)
   variants: function, method
@@ -5693,6 +5931,7 @@
   variants: function
   dispatch:
     CPU, CUDA: rsub
+  autogen: rsub.Tensor_out
 
 - func: heaviside.out(Tensor self, Tensor values, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -5717,6 +5956,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: rsub
+  autogen: rsub.Scalar_out
 
 # Functionally the same as addmm, but we give it a different derivative formula
 # that doesn't propagate gradients to non-present entries on sparse.
@@ -5724,6 +5964,7 @@
   python_module: sparse
   dispatch:
     CompositeExplicitAutograd: _sparse_addmm
+  autogen: _sparse_addmm.out
 
 - func: sparse_sampled_addmm.out(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
   python_module: sparse
@@ -5907,6 +6148,7 @@
 - func: sparse_coo_tensor.size(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
   dispatch:
     CompositeExplicitAutograd: sparse_coo_tensor
+  autogen: sparse_coo_tensor.size_out
 
 - func: sparse_coo_tensor.indices(Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
 
@@ -5925,10 +6167,12 @@
 - func: _sparse_coo_tensor_with_dims(int sparse_dim, int dense_dim, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
   dispatch:
     SparseCPU, SparseCUDA, SparseMeta, Meta: new_with_dims_sparse
+  autogen: _sparse_coo_tensor_with_dims.out
 
 - func: _sparse_coo_tensor_with_dims_and_tensors(int sparse_dim, int dense_dim, int[] size, Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
   dispatch:
     SparseCPU, SparseCUDA, SparseMeta, Meta: new_with_dims_and_tensor_sparse
+  autogen: _sparse_coo_tensor_with_dims_and_tensors.out
 
 - func: sparse_resize_(Tensor(a!) self, int[] size, int sparse_dim, int dense_dim) -> Tensor(a!)
   use_const_ref_for_mutable_tensors: True
@@ -5950,6 +6194,7 @@
     SparseCPU: sparse_mask_cpu
     SparseCUDA: sparse_mask_cuda
     SparseCsrCPU, SparseCsrCUDA: sparse_mask_sparse_csr
+  autogen: sparse_mask.out
 
 - func: _to_cpu(Tensor[] tensors) -> Tensor[]
   variants: function
@@ -5964,6 +6209,7 @@
     SparseCPU, SparseCUDA: sparse_to_dense
     SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_dense
     MkldnnCPU: mkldnn_to_dense
+  autogen: _to_dense.out
 
 - func: to_dense_backward(Tensor grad, Tensor input) -> Tensor
 
@@ -6019,6 +6265,7 @@
   dispatch:
     SparseCPU: _coalesce_sparse_cpu
     SparseCUDA: _coalesce_sparse_cuda
+  autogen: _coalesce.out
 
 - func: is_coalesced(Tensor self) -> bool
   variants: method
@@ -6126,12 +6373,14 @@
   dispatch:
     CPU, CUDA: dense_to_sparse
     SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse
+  autogen: to_sparse.sparse_dim_out
 
 - func: to_sparse(Tensor self) -> Tensor
   variants: method
   dispatch:
     CPU, CUDA: dense_to_sparse
     SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse
+  autogen: to_sparse.out
 
 - func: to_sparse_csr(Tensor self) -> Tensor
   variants: method
@@ -6139,6 +6388,7 @@
     CPU, CUDA: dense_to_sparse_csr
     SparseCPU, SparseCUDA: coo_to_sparse_csr
     SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_csr
+  autogen: to_sparse_csr.out
 
 - func: to_sparse_csc(Tensor self) -> Tensor
   variants: method
@@ -6146,6 +6396,7 @@
     CPU, CUDA: dense_to_sparse_csc
     SparseCPU, SparseCUDA: coo_to_sparse_csc
     SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_csc
+  autogen: to_sparse_csc.out
 
 - func: to_sparse_bsr(Tensor self, int[2] blocksize) -> Tensor
   variants: method
@@ -6153,6 +6404,7 @@
     CPU, CUDA: dense_to_sparse_bsr
     SparseCPU, SparseCUDA: coo_to_sparse_bsr
     SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_bsr
+  autogen: to_sparse_bsr.out
 
 - func: to_sparse_bsc(Tensor self, int[2] blocksize) -> Tensor
   variants: method
@@ -6160,23 +6412,27 @@
     CPU, CUDA: dense_to_sparse_bsc
     SparseCPU, SparseCUDA: coo_to_sparse_bsc
     SparseCsrCPU, SparseCsrCUDA: sparse_compressed_to_sparse_bsc
+  autogen: to_sparse_bsc.out
 
 - func: to_mkldnn(Tensor self, ScalarType? dtype=None) -> Tensor
   variants: method
   dispatch:
     CPU: dense_to_mkldnn
+  autogen: to_mkldnn.out
 
 - func: mkldnn_reorder_conv2d_weight(Tensor self, int[2] padding=0, int[2] stride=1, int[2] dilation=1, int groups=1) -> Tensor
   variants: function
   python_module: nn
   dispatch:
     MkldnnCPU: mkldnn_reorder_conv2d_weight
+  autogen: mkldnn_reorder_conv2d_weight.out
 
 - func: mkldnn_reorder_conv3d_weight(Tensor self, int[3] padding=0, int[3] stride=1, int[3] dilation=1, int groups=1) -> Tensor
   variants: function
   python_module: nn
   dispatch:
     MkldnnCPU: mkldnn_reorder_conv3d_weight
+  autogen: mkldnn_reorder_conv3d_weight.out
 
 - func: to_mkldnn_backward(Tensor grad, Tensor input) -> Tensor
 
@@ -6184,37 +6440,44 @@
   variants: function
   dispatch:
     CPU, CUDA: quantize_per_tensor_dynamic
+  autogen: quantize_per_tensor_dynamic.out
 
 - func: quantize_per_tensor(Tensor self, float scale, int zero_point, ScalarType dtype) -> Tensor
   variants: function
   dispatch:
     CPU, CUDA: quantize_per_tensor
+  autogen: quantize_per_tensor.out
 
 - func: quantize_per_tensor.tensor_qparams(Tensor self, Tensor scale, Tensor zero_point, ScalarType dtype) -> Tensor
   variants: function
   dispatch:
     CPU, CUDA: quantize_per_tensor_tensor_qparams
+  autogen: quantize_per_tensor.tensor_qparams_out
 
 - func: quantize_per_tensor.tensors(Tensor[] tensors, Tensor scales, Tensor zero_points, ScalarType dtype) -> Tensor[]
   variants: function
   dispatch:
     CPU: quantize_per_tensor_list_cpu
+  autogen: quantize_per_tensor.tensors_out
 
 - func: quantize_per_channel(Tensor self, Tensor scales, Tensor zero_points, int axis, ScalarType dtype) -> Tensor
   variants: function
   dispatch:
     CPU, CUDA: quantize_per_channel
+  autogen: quantize_per_channel.out
 
 - func: dequantize.self(Tensor self) -> Tensor
   variants: function, method
   dispatch:
     CPU, CUDA: dequantize_cpu_or_cuda
     QuantizedCPU, QuantizedCUDA: dequantize_quantized
+  autogen: dequantize.self_out
 
 - func: dequantize.tensors(Tensor[] tensors) -> Tensor[]
   variants: function
   dispatch:
     QuantizedCPU: dequantize_tensors_quantized_cpu
+  autogen: dequantize.tensors_out
 
 - func: q_scale(Tensor self) -> float
   variants: function, method
@@ -6230,11 +6493,13 @@
   variants: function, method
   dispatch:
     QuantizedCPU, QuantizedCUDA: q_per_channel_scales
+  autogen: q_per_channel_scales.out
 
 - func: q_per_channel_zero_points(Tensor self) -> Tensor
   variants: function, method
   dispatch:
     QuantizedCPU, QuantizedCUDA: q_per_channel_zero_points
+  autogen: q_per_channel_zero_points.out
 
 - func: q_per_channel_axis(Tensor self) -> int
   variants: function, method
@@ -6247,16 +6512,19 @@
   dispatch:
     QuantizedCPU: int_repr_quantized_cpu
     QuantizedCUDA: int_repr_quantized_cuda
+  autogen: int_repr.out
 
 - func: _make_per_tensor_quantized_tensor(Tensor self, float scale, int zero_point) -> Tensor
   dispatch:
     CPU: make_per_tensor_quantized_tensor_cpu
     CUDA: make_per_tensor_quantized_tensor_cuda
+  autogen: _make_per_tensor_quantized_tensor.out
 
 - func: _make_per_channel_quantized_tensor(Tensor self, Tensor scale, Tensor zero_point, int axis) -> Tensor
   dispatch:
     CPU: make_per_channel_quantized_tensor_cpu
     CUDA: make_per_channel_quantized_tensor_cuda
+  autogen: _make_per_channel_quantized_tensor.out
 
 - func: qscheme(Tensor self) -> QScheme
   variants: method
@@ -6275,11 +6543,13 @@
   variants: function
   dispatch:
     CPU, CUDA: fake_quantize_per_tensor_affine_cachemask
+  autogen: fake_quantize_per_tensor_affine_cachemask.out
 
 - func: _fake_quantize_per_tensor_affine_cachemask_tensor_qparams(Tensor self, Tensor scale, Tensor zero_point, Tensor fake_quant_enabled, int quant_min, int quant_max) -> (Tensor output, Tensor mask)
   variants: function
   dispatch:
     CPU, CUDA: _fake_quantize_per_tensor_affine_cachemask_tensor_qparams
+  autogen: _fake_quantize_per_tensor_affine_cachemask_tensor_qparams.out
 
 - func: fake_quantize_per_tensor_affine_cachemask_backward(Tensor grad, Tensor mask) -> Tensor
   variants: function
@@ -6288,6 +6558,7 @@
   variants: function
   dispatch:
     CPU, CUDA: _fake_quantize_learnable_per_tensor_affine
+  autogen: _fake_quantize_learnable_per_tensor_affine.out
 
 - func: _fake_quantize_learnable_per_tensor_affine_backward(Tensor grad, Tensor self, Tensor scale, Tensor zero_point, int quant_min, int quant_max, float grad_factor=1.0) -> (Tensor, Tensor, Tensor)
   variants: function
@@ -6300,6 +6571,7 @@
   variants: function
   dispatch:
     CPU, CUDA: fake_quantize_per_channel_affine_cachemask
+  autogen: fake_quantize_per_channel_affine_cachemask.out
 
 - func: fake_quantize_per_channel_affine_cachemask_backward(Tensor grad, Tensor mask) -> Tensor
   variants: function
@@ -6308,6 +6580,7 @@
   variants: function
   dispatch:
     CPU, CUDA: _fake_quantize_learnable_per_channel_affine
+  autogen: _fake_quantize_learnable_per_channel_affine.out
 
 - func: _fake_quantize_learnable_per_channel_affine_backward(Tensor grad, Tensor self, Tensor scale, Tensor zero_point, int axis, int quant_min, int quant_max, float grad_factor=1.0) -> (Tensor, Tensor, Tensor)
   variants: function
@@ -6343,6 +6616,7 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: _to_copy
+  autogen: _to_copy.out
 
 # to(Device) must not exist because all constructors of Device also works for
 # TensorOptions. Otherwise, an ambiguity error is thrown.
@@ -6381,6 +6655,7 @@
   variants: function
 
 - func: item(Tensor self) -> Scalar
+  tags: data_dependent_output
   variants: method
 
 - func: result_type.Tensor(Tensor tensor, Tensor other) -> ScalarType
@@ -6402,6 +6677,7 @@
 
 # NB: Does NOT check precondition that numel == 1
 - func: _local_scalar_dense(Tensor self) -> Scalar
+  tags: data_dependent_output
   dispatch:
     CPU: _local_scalar_dense_cpu
     CUDA: _local_scalar_dense_cuda
@@ -6413,16 +6689,19 @@
 - func: _lstm_mps(Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
   dispatch:
     MPS: _lstm_mps
+  autogen: _lstm_mps.out
 
 - func: lstm_mps_backward(Tensor grad_y, Tensor? grad_hy, Tensor? grad_cy, Tensor z_state, Tensor cell_state_fwd, Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first) -> (Tensor, Tensor[], Tensor[])
   dispatch:
     MPS: lstm_mps_backward
+  autogen: lstm_mps_backward.out
 
 
 # Fused RNN kernels
 - func: _thnn_fused_lstm_cell(Tensor input_gates, Tensor hidden_gates, Tensor cx, Tensor? input_bias=None, Tensor? hidden_bias=None) -> (Tensor, Tensor, Tensor)
   dispatch:
     CUDA: _thnn_fused_lstm_cell_cuda
+  autogen: _thnn_fused_lstm_cell.out
 
 # NB: The composite version of this function below is a simple wrapper that duplicates some of the outputs
 #     It is necessary to avoid triggering TensorImpl use count checks in debug mode
@@ -6430,6 +6709,7 @@
 - func: _thnn_fused_lstm_cell_backward_impl(Tensor? grad_hy, Tensor? grad_cy, Tensor cx, Tensor cy, Tensor workspace, bool has_bias) -> (Tensor, Tensor, Tensor)
   dispatch:
     CUDA: _thnn_fused_lstm_cell_backward_impl_cuda
+  autogen: _thnn_fused_lstm_cell_backward_impl.out
 
 - func: _thnn_fused_lstm_cell_backward(Tensor? grad_hy, Tensor? grad_cy, Tensor cx, Tensor cy, Tensor workspace, bool has_bias) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
 
@@ -6438,10 +6718,12 @@
 - func: _thnn_fused_gru_cell(Tensor input_gates, Tensor hidden_gates, Tensor hx, Tensor? input_bias=None, Tensor? hidden_bias=None) -> (Tensor, Tensor)
   dispatch:
     CUDA: _thnn_fused_gru_cell_cuda
+  autogen: _thnn_fused_gru_cell.out
 
 - func: _thnn_fused_gru_cell_backward(Tensor grad_hy, Tensor workspace, bool has_bias) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
   dispatch:
     CUDA: _thnn_fused_gru_cell_backward_cuda
+  autogen: _thnn_fused_gru_cell_backward.out
 
 - func: _thnn_differentiable_gru_cell_backward(Tensor grad_hy, Tensor input_gates, Tensor hidden_gates, Tensor hx, Tensor? input_bias, Tensor? hidden_bias) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
 
@@ -6500,6 +6782,7 @@
 - func: _pack_padded_sequence(Tensor input, Tensor lengths, bool batch_first) -> (Tensor, Tensor)
   dispatch:
     CompositeExplicitAutograd: _pack_padded_sequence
+  autogen: _pack_padded_sequence.out
 
 - func: _pack_padded_sequence_backward(Tensor grad, int[] input_size, Tensor batch_sizes, bool batch_first) -> Tensor
 
@@ -6556,6 +6839,7 @@
 - func: lift(Tensor self) -> Tensor
   dispatch:
     CompositeExplicitAutograd: lift
+  autogen: lift.out
 
 # lift_fresh is called with an argument that is guaranteed to be
 # fresh (i.e., newly allocated).  This is ONLY called from a
@@ -6571,6 +6855,7 @@
   tags: view_copy
   dispatch:
     CompositeExplicitAutograd: lift_fresh_copy
+  autogen: lift_fresh_copy.out
 
 - func: is_set_to(Tensor self, Tensor tensor) -> bool
   variants: method
@@ -6623,15 +6908,17 @@
   dispatch:
     CompositeExplicitAutograd: masked_scatter
 
-- func: _masked_softmax(Tensor self, Tensor mask, int? dim=None) -> Tensor
+- func: _masked_softmax(Tensor self, Tensor mask, int? dim=None, int? mask_type=None) -> Tensor
   dispatch:
     CUDA: masked_softmax_cuda
     CPU: masked_softmax_cpu
+  autogen: _masked_softmax.out
 
 - func: _masked_softmax_backward(Tensor grad_output, Tensor output, Tensor mask, int? dim=None) -> Tensor
   dispatch:
     CUDA: masked_softmax_backward_cuda
     CPU: masked_softmax_backward_cpu
+  autogen: _masked_softmax_backward.out
 
 - func: view.SymInt(Tensor(a) self, SymInt[] size) -> Tensor(a)
   variants: method
@@ -6887,6 +7174,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: bitwise_and
+  autogen: bitwise_and.Scalar_Tensor_out
 
 - func: bitwise_and.Tensor(Tensor self, Tensor other) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -6941,6 +7229,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: bitwise_or
+  autogen: bitwise_or.Scalar_Tensor_out
 
 - func: bitwise_or.Tensor(Tensor self, Tensor other) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -6995,6 +7284,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: bitwise_xor
+  autogen: bitwise_xor.Scalar_Tensor_out
 
 - func: bitwise_xor.Tensor(Tensor self, Tensor other) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -7092,6 +7382,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: bitwise_left_shift
+  autogen: bitwise_left_shift.Scalar_Tensor_out
 
 - func: __rshift__.Scalar(Tensor self, Scalar other) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -7159,6 +7450,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: bitwise_right_shift
+  autogen: bitwise_right_shift.Scalar_Tensor_out
 
 - func: tril_(Tensor(a!) self, int diagonal=0) -> Tensor(a!)
   structured_delegate: tril.out
@@ -7203,6 +7495,7 @@
 - func: random_.from(Tensor(a!) self, int from, int? to, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: random_
     Meta: random_meta_
@@ -7211,6 +7504,7 @@
 
 - func: random_.to(Tensor(a!) self, int to, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: random_
@@ -7220,6 +7514,7 @@
 
 - func: random_(Tensor(a!) self, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: random_
@@ -7228,6 +7523,7 @@
 
 - func: uniform_(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: uniform_
@@ -7238,12 +7534,14 @@
 - func: cauchy_(Tensor(a!) self, float median=0, float sigma=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: cauchy_
   autogen: cauchy, cauchy.out
 
 - func: log_normal_(Tensor(a!) self, float mean=1, float std=2, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: log_normal_
@@ -7251,6 +7549,7 @@
 
 - func: exponential_(Tensor(a!) self, float lambd=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: exponential_
@@ -7259,11 +7558,12 @@
 
 - func: geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: geometric_
 
-# wrappers for TH functions
+  # wrappers for TH functions
   autogen: geometric, geometric.out
 
 - func: diag.out(Tensor self, int diagonal=0, *, Tensor(a!) out) -> Tensor(a!)
@@ -7313,17 +7613,20 @@
   dispatch:
     CPU: tril_indices_cpu
     CUDA: tril_indices_cuda
+  autogen: tril_indices.out
 
 - func: triu_indices(int row, int col, int offset=0, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CPU: triu_indices_cpu
     CUDA: triu_indices_cuda
+  autogen: triu_indices.out
 
 - func: trace(Tensor self) -> Tensor
   variants: method, function
   dispatch:
     CPU: trace_cpu
     CUDA: trace_cuda
+  autogen: trace.out
 
 - func: trace_backward(Tensor grad, int[] sizes) -> Tensor
   variants: function
@@ -7851,6 +8154,7 @@
   dispatch:
     CPU: _symeig_helper_cpu
     CUDA: _symeig_helper_cuda
+  autogen: _symeig_helper.out
 
 - func: eig.e(Tensor self, bool eigenvectors=False, *, Tensor(a!) e, Tensor(b!) v) -> (Tensor(a!) eigenvalues, Tensor(b!) eigenvectors)
   dispatch:
@@ -7913,6 +8217,7 @@
   dispatch:
     CPU: _cholesky_solve_helper_cpu
     CUDA: _cholesky_solve_helper_cuda
+  autogen: _cholesky_solve_helper.out
 
 - func: cholesky_inverse(Tensor self, bool upper=False) -> Tensor
   variants: method, function
@@ -7973,6 +8278,7 @@
 
 # TODO: remove dispatch section when porting TH CUDA to ATen
 - func: multinomial.out(Tensor self, int num_samples, bool replacement=False, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: multinomial_out
 
@@ -8115,6 +8421,7 @@
   variants: method, function
   dispatch:
     CompositeExplicitAutograd: dist
+  autogen: dist.out
 
 - func: atan2.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -8200,14 +8507,17 @@
 - func: _histogramdd_bin_edges(Tensor self, int[] bins, *, float[]? range=None, Tensor? weight=None, bool density=False) -> Tensor[]
   dispatch:
     CPU: histogramdd_bin_edges_cpu
+  autogen: _histogramdd_bin_edges.out
 
 - func: _histogramdd_from_bin_cts(Tensor self, int[] bins, *, float[]? range=None, Tensor? weight=None, bool density=False) -> Tensor
   dispatch:
     CPU: histogramdd_cpu
+  autogen: _histogramdd_from_bin_cts.out
 
 - func: _histogramdd_from_bin_tensors(Tensor self, Tensor[] bins, *, Tensor? weight=None, bool density=False) -> Tensor
   dispatch:
     CPU: histogramdd_cpu
+  autogen: _histogramdd_from_bin_tensors.out
 
 - func: histogramdd(Tensor self, int[] bins, float[]? range=None, Tensor? weight=None, bool density=False) -> (Tensor hist, Tensor[] bin_edges)
 
@@ -8342,6 +8652,7 @@
   variants: function
   dispatch:
     CPU, CUDA: remainder
+  autogen: remainder.Scalar_Tensor_out
 
 - func: min(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -8351,6 +8662,13 @@
     MPS: min_mps
     QuantizedCPU: min_quantized_cpu
 
+# Not to be confused with binary op `min.out`. Commented because of failed CI
+# FIXME: enable this
+#- func: min.unary_out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+#  device_check: NoCheck   # TensorIterator
+#  dispatch:
+#    CompositeExplicitAutograd: min_unary_out
+
 - func: fmin(Tensor self, Tensor other) -> Tensor
   structured_delegate: fmin.out
   device_check: NoCheck   # TensorIterator
@@ -8371,6 +8689,13 @@
     MPS: max_mps
     QuantizedCPU: max_quantized_cpu
 
+# Not to be confused with binary op `max.out`. Commented because of failed CI
+# FIXME: enable this
+#- func: max.unary_out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+#  device_check: NoCheck   # TensorIterator
+#  dispatch:
+#    CompositeExplicitAutograd: max_unary_out
+
 - func: fmax(Tensor self, Tensor other) -> Tensor
   structured_delegate: fmax.out
   device_check: NoCheck   # TensorIterator
@@ -8493,6 +8818,7 @@
   variants: method, function
   dispatch:
     CPU, CUDA: argsort_stable
+  autogen: argsort.stable_out
 
 - func: argsort.dimname(Tensor self, Dimname dim, bool descending=False) -> Tensor
   variants: method, function
@@ -8564,8 +8890,10 @@
   variants: function
   dispatch:
     CPU, CUDA: unfold_backward
+  autogen: unfold_backward.out
 
 - func: equal(Tensor self, Tensor other) -> bool
+  tags: data_dependent_output
   variants: method, function
   dispatch:
     CPU: cpu_equal
@@ -8644,6 +8972,7 @@
 
 - func: normal_(Tensor(a!) self, float mean=0, float std=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: normal_
@@ -8657,10 +8986,12 @@
 # but we can't due to overload ambiguity with normal.Tensor_float.
 - func: normal_functional(Tensor self, float mean=0, float std=1, *, Generator? generator=None) -> Tensor
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: normal_functional
 
 - func: normal.Tensor_float_out(Tensor mean, float std=1, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: normal_out
     MPS: normal_mps_out
@@ -8678,6 +9009,7 @@
     CPU, CUDA: normal_out
     Meta: normal_out_meta
     MPS: normal_mps_out
+  tags: nondeterministic_seeded
 
 - func: normal.float_Tensor(float mean, Tensor std, *, Generator? generator=None) -> Tensor
   dispatch:
@@ -8691,6 +9023,7 @@
     CPU, CUDA: normal_out
     Meta: normal_out_meta
     MPS: normal_mps_out
+  tags: nondeterministic_seeded
 
 - func: normal.Tensor_Tensor(Tensor mean, Tensor std, *, Generator? generator=None) -> Tensor
   dispatch:
@@ -8702,10 +9035,12 @@
 - func: normal.float_float(float mean, float std, int[] size, *, Generator? generator=None, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: normal
+  tags: nondeterministic_seeded
 
 - func: normal.float_float_out(float mean, float std, int[] size, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CompositeExplicitAutograd: normal_out
+  tags: nondeterministic_seeded
 
 - func: alias(Tensor(a) self) -> Tensor(a)
   variants: method, function
@@ -8724,18 +9059,18 @@
     CUDA: _amp_update_scale_cuda_
   autogen: _amp_update_scale, _amp_update_scale.out
 
-#- func: _cat(Tensor[] tensors, int dim=0) -> Tensor
-  #dispatch:
+    #- func: _cat(Tensor[] tensors, int dim=0) -> Tensor
+    #dispatch:
     #CPU: _cat_cpu
     #CUDA: cat_cuda
     #MPS: cat_mps
     #QuantizedCPU: cat_quantized_cpu
 
-#- func: _cat.out(Tensor[] tensors, int dim=0, *, Tensor(a!) out) -> Tensor(a!)
-  #dispatch:
+    #- func: _cat.out(Tensor[] tensors, int dim=0, *, Tensor(a!) out) -> Tensor(a!)
+    #dispatch:
     #CPU: _cat_out_cpu
-    #CUDA: cat_out_cuda
-    #QuantizedCPU: cat_out_quantized_cpu
+  #CUDA: cat_out_cuda
+  #QuantizedCPU: cat_out_quantized_cpu
 
 - func: _foreach_add.Scalar(Tensor[] self, Scalar scalar) -> Tensor[]
   device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
@@ -9412,6 +9747,14 @@
     CPU: foreach_tensor_maximum_slow
     CUDA: foreach_tensor_maximum_cuda
 
+- func: _foreach_maximum_.List(Tensor(a!)[] self, Tensor[] other) -> ()
+  device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
+  variants: function
+  dispatch:
+    CPU: foreach_tensor_maximum_slow_
+    CUDA: foreach_tensor_maximum_cuda_
+  autogen: _foreach_maximum.List_out
+
 - func: _foreach_minimum.List(Tensor[] self, Tensor[] other) -> Tensor[]
   device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
   variants: function
@@ -9419,12 +9762,21 @@
     CPU: foreach_tensor_minimum_slow
     CUDA: foreach_tensor_minimum_cuda
 
+- func: _foreach_minimum_.List(Tensor(a!)[] self, Tensor[] other) -> ()
+  device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
+  variants: function
+  dispatch:
+    CPU: foreach_tensor_minimum_slow_
+    CUDA: foreach_tensor_minimum_cuda_
+  autogen: _foreach_minimum.List_out
+
 - func: _foreach_norm.Scalar(Tensor[] self, Scalar ord=2) -> Tensor[]
   device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
   variants: function
   dispatch:
     CPU: foreach_tensor_norm_slow
     CUDA: foreach_tensor_norm_cuda
+  autogen: _foreach_norm.Scalar_out
 
 - func: bucketize.Tensor(Tensor self, Tensor boundaries, *, bool out_int32=False, bool right=False) -> Tensor
   dispatch:
@@ -9440,6 +9792,7 @@
   dispatch:
     CPU: bucketize_cpu
     CUDA: bucketize_cuda
+  autogen: bucketize.Scalar_out
 
 - func: searchsorted.Tensor(Tensor sorted_sequence, Tensor self, *, bool out_int32=False, bool right=False, str? side=None, Tensor? sorter=None) -> Tensor
   dispatch:
@@ -9455,6 +9808,7 @@
 - func: _torch_cuda_cu_linker_symbol_op(Tensor self) -> Tensor
   dispatch:
     CUDA: _torch_cuda_cu_linker_symbol_op_cuda
+  autogen: _torch_cuda_cu_linker_symbol_op.out
 
 - func: searchsorted.Tensor_out(Tensor sorted_sequence, Tensor self, *, bool out_int32=False, bool right=False, str? side=None, Tensor? sorter=None, Tensor(a!) out) -> Tensor(a!)
   dispatch:
@@ -9465,6 +9819,7 @@
   dispatch:
     CPU: searchsorted_cpu
     CUDA: searchsorted_cuda
+  autogen: searchsorted.Scalar_out
 
 - func: _convert_indices_from_coo_to_csr(Tensor self, int size, *, bool out_int32=False) -> Tensor
   structured_delegate: _convert_indices_from_coo_to_csr.out
@@ -9767,11 +10122,13 @@
   python_module: nn
   dispatch:
     CPU, CUDA: glu_jvp
+  autogen: glu_jvp.out
 
 - func: glu_backward_jvp(Tensor grad_x, Tensor grad_glu, Tensor x, Tensor dgrad_glu, Tensor dx, int dim) -> Tensor
   python_module: nn
   dispatch:
     CPU, CUDA: glu_backward_jvp
+  autogen: glu_backward_jvp.out
 
 - func: hardsigmoid.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -9860,6 +10217,7 @@
   python_module: nn
   dispatch:
     CPU, CUDA: hardswish_backward
+  autogen: hardswish_backward.out
 
 - func: leaky_relu.out(Tensor self, Scalar negative_slope=0.01, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -9933,6 +10291,7 @@
 
 - func: rrelu_with_noise.out(Tensor self, Tensor noise, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
+  tags: nondeterministic_seeded
   dispatch:
     CPU: rrelu_with_noise_out_cpu
     CUDA: rrelu_with_noise_out_cuda
@@ -9948,9 +10307,11 @@
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: rrelu_with_noise_backward
+  autogen: rrelu_with_noise_backward.out
 
 - func: rrelu_with_noise_(Tensor(a!) self, Tensor noise, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None) -> Tensor(a!)
   python_module: nn
+  tags: nondeterministic_seeded
   dispatch:
     CPU: rrelu_with_noise_cpu_
     CUDA: rrelu_with_noise_cuda_
@@ -10011,7 +10372,7 @@
     CPU: adaptive_avg_pool2d_out_cpu
     CUDA: adaptive_avg_pool2d_out_cuda
     MPS: adaptive_avg_pool2d_out_mps
-    MkldnnCPU: mkldnn_adaptive_avg_pool2d_out
+    MkldnnCPU: mkldnn_adaptive_avg_pool2d_out_stub
 
 - func: adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor
   python_module: nn
@@ -10020,9 +10381,14 @@
   dispatch:
     MkldnnCPU: mkldnn_adaptive_avg_pool2d
 
+- func: mkldnn_adaptive_avg_pool2d.out(Tensor self, int[2] output_size, *, Tensor(a!) out) -> Tensor(a!)
+  dispatch:
+    MkldnnCPU: mkldnn_adaptive_avg_pool2d_out
+
 - func: mkldnn_adaptive_avg_pool2d_backward(Tensor grad_output, Tensor self) -> Tensor
   dispatch:
     MkldnnCPU: mkldnn_adaptive_avg_pool2d_backward
+  autogen: mkldnn_adaptive_avg_pool2d_backward.out
 
 - func: _adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor
   dispatch:
@@ -10031,6 +10397,7 @@
     MPS: adaptive_avg_pool2d_mps
     QuantizedCPU: adaptive_avg_pool2d_quantized_cpu
     QuantizedCUDA: adaptive_avg_pool2d_quantized_cuda
+  autogen: _adaptive_avg_pool2d.out
 
 - func: _adaptive_avg_pool2d_backward(Tensor grad_output, Tensor self) -> Tensor
   python_module: nn
@@ -10038,6 +10405,7 @@
     CPU: adaptive_avg_pool2d_backward_cpu
     CUDA: adaptive_avg_pool2d_backward_cuda
     MPS: adaptive_avg_pool2d_backward_mps
+  autogen: _adaptive_avg_pool2d_backward.out
 
 - func: adaptive_avg_pool3d.out(Tensor self, int[3] output_size, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
@@ -10054,6 +10422,7 @@
     CPU: adaptive_avg_pool3d_cpu
     CUDA: adaptive_avg_pool3d_cuda
     QuantizedCPU: adaptive_avg_pool3d_quantized_cpu
+  autogen: _adaptive_avg_pool3d.out
 
 - func: adaptive_avg_pool3d_backward.grad_input(Tensor grad_output, Tensor self, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
@@ -10066,6 +10435,7 @@
   dispatch:
     CPU: adaptive_avg_pool3d_backward_cpu
     CUDA: adaptive_avg_pool3d_backward_cuda
+  autogen: _adaptive_avg_pool3d_backward.out
 
 # Return: (Tensor output, Tensor indices)
 - func: adaptive_max_pool2d.out(Tensor self, int[2] output_size, *, Tensor(a!) out, Tensor(b!) indices) -> (Tensor(a!), Tensor(b!))
@@ -10477,101 +10847,121 @@
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_linear1d
+  autogen: upsample_linear1d.vec_out
 
 - func: upsample_linear1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_linear1d_backward
+  autogen: upsample_linear1d_backward.vec_out
 
 - func: upsample_bilinear2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_bilinear2d
+  autogen: upsample_bilinear2d.vec_out
 
 - func: upsample_bilinear2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_bilinear2d_backward
+  autogen: upsample_bilinear2d_backward.vec_out
 
 - func: _upsample_bilinear2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_bilinear2d_aa
+  autogen: _upsample_bilinear2d_aa.vec_out
 
 - func: _upsample_bilinear2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_bilinear2d_aa_backward
+  autogen: _upsample_bilinear2d_aa_backward.vec_out
 
 - func: upsample_trilinear3d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_trilinear3d
+  autogen: upsample_trilinear3d.vec_out
 
 - func: upsample_trilinear3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_trilinear3d_backward
+  autogen: upsample_trilinear3d_backward.vec_out
 
 - func: upsample_bicubic2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_bicubic2d
+  autogen: upsample_bicubic2d.vec_out
 
 - func: upsample_bicubic2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_bicubic2d_backward
+  autogen: upsample_bicubic2d_backward.vec_out
 
 - func: _upsample_bicubic2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_bicubic2d_aa
+  autogen: _upsample_bicubic2d_aa.vec_out
 
 - func: _upsample_bicubic2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_bicubic2d_aa_backward
+  autogen: _upsample_bicubic2d_aa_backward.vec_out
 
 - func: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_nearest1d
+  autogen: upsample_nearest1d.vec_out
 
 - func: _upsample_nearest_exact1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_nearest_exact1d
+  autogen: _upsample_nearest_exact1d.vec_out
 
 - func: upsample_nearest1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_nearest1d_backward
+  autogen: upsample_nearest1d_backward.vec_out
 
 - func: _upsample_nearest_exact1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_nearest_exact1d_backward
+  autogen: _upsample_nearest_exact1d_backward.vec_out
 
 - func: upsample_nearest2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_nearest2d
+  autogen: upsample_nearest2d.vec_out
 
 - func: _upsample_nearest_exact2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_nearest_exact2d
+  autogen: _upsample_nearest_exact2d.vec_out
 
 - func: upsample_nearest2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: upsample_nearest2d_backward
+  autogen: upsample_nearest2d_backward.vec_out
 
 - func: _upsample_nearest_exact2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _upsample_nearest_exact2d_backward
+  autogen: _upsample_nearest_exact2d_backward.vec_out
 
 - func: upsample_nearest3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
@@ -10579,6 +10969,7 @@
     CPU: upsample_nearest3d_cpu
     CUDA: upsample_nearest3d_cuda
     QuantizedCPU: upsample_nearest3d_quantized_cpu
+  autogen: upsample_nearest3d.vec_out
 
 - func: _upsample_nearest_exact3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
@@ -10586,18 +10977,21 @@
     CPU: _upsample_nearest_exact3d_cpu
     CUDA: _upsample_nearest_exact3d_cuda
     QuantizedCPU: _upsample_nearest_exact3d_quantized_cpu
+  autogen: _upsample_nearest_exact3d.vec_out
 
 - func: upsample_nearest3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CPU: upsample_nearest3d_backward_cpu
     CUDA: upsample_nearest3d_backward_cuda
+  autogen: upsample_nearest3d_backward.vec_out
 
 - func: _upsample_nearest_exact3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
   python_module: nn
   dispatch:
     CPU: _upsample_nearest_exact3d_backward_cpu
     CUDA: _upsample_nearest_exact3d_backward_cuda
+  autogen: _upsample_nearest_exact3d_backward.vec_out
 
 # NOTE: all of the non-"vec" upsample overloads are only kept for backward compatibility.
 - func: upsample_linear1d.out(Tensor self, int[1] output_size, bool align_corners, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
@@ -10986,6 +11380,7 @@
   dispatch:
     CPU: slow_conv2d_backward_cpu
     CUDA: slow_conv2d_backward_cuda
+  autogen: _slow_conv2d_backward.output_mask_out
 
 - func: _conv_depthwise2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, int[2] padding, int[2] dilation, *, Tensor(a!) out) -> Tensor(a!)
   use_const_ref_for_mutable_tensors: True
@@ -11002,6 +11397,7 @@
   python_module: nn
   dispatch:
     CUDA: conv_depthwise3d_cuda
+  autogen: conv_depthwise3d.out
 
 - func: slow_conv3d.out(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
@@ -11024,12 +11420,14 @@
   dispatch:
     CPU: slow_conv_dilated2d_cpu
     CUDA: slow_conv_dilated2d_cuda
+  autogen: slow_conv_dilated2d.out
 
 - func: slow_conv_dilated3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] dilation=1) -> Tensor
   python_module: nn
   dispatch:
     CPU: slow_conv_dilated3d_cpu
     CUDA: slow_conv_dilated3d_cuda
+  autogen: slow_conv_dilated3d.out
 
 - func: col2im.out(Tensor self, int[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
@@ -11097,6 +11495,7 @@
     SparseCPU, SparseCUDA: isinf_sparse
     SparseMeta: isinf_sparse_meta
     SparseCsrCPU, SparseCsrCUDA: isinf_sparse_csr
+  autogen: isinf.out
 
 - func: record_stream(Tensor(a!) self, Stream s) -> ()
   variants: method
@@ -11748,8 +12147,6 @@
 - func: linalg_cross.out(Tensor self, Tensor other, *, int dim=-1, Tensor(a!) out) -> Tensor(a!)
   python_module: linalg
   structured: True
-  precomputed:
-  - dim -> int dim
   dispatch:
     CPU, CUDA: linalg_cross_out
 
@@ -11886,6 +12283,7 @@
   variants: function
   dispatch:
     CPU, CUDA: linalg_matrix_exp
+  autogen: linalg_matrix_exp.out
 
 - func: _linalg_slogdet(Tensor A) -> (Tensor sign, Tensor logabsdet, Tensor LU, Tensor pivots)
   structured_delegate: _linalg_slogdet.sign
@@ -11960,34 +12358,26 @@
   dispatch:
     CPU, CUDA: linalg_householder_product_out
 
-- func: _linalg_inv_out_helper_(Tensor(a!) self, Tensor(b!) infos_lu, Tensor(c!) infos_getri) -> Tensor(a!)
-  variants: function
-  dispatch:
-    CPU: _linalg_inv_out_helper_cpu
-    CUDA: _linalg_inv_out_helper_cuda
-  autogen: _linalg_inv_out_helper, _linalg_inv_out_helper.out
-
-- func: linalg_inv_ex(Tensor self, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
+- func: linalg_inv_ex(Tensor A, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
   python_module: linalg
-  variants: function
-  dispatch:
-    # calls transpose_
-    CompositeExplicitAutogradNonFunctional: linalg_inv_ex
+  structured_delegate: linalg_inv_ex.inverse
 
-- func: linalg_inv_ex.inverse(Tensor self, *, bool check_errors=False, Tensor(a!) inverse, Tensor(b!) info) -> (Tensor(a!) inverse, Tensor(b!) info)
+- func: linalg_inv_ex.inverse(Tensor A, *, bool check_errors=False, Tensor(a!) inverse, Tensor(b!) info) -> (Tensor(a!) inverse, Tensor(b!) info)
   python_module: linalg
-  variants: function
+  structured: True
   dispatch:
-    # calls transpose_
-    CompositeExplicitAutogradNonFunctional: linalg_inv_ex_out
+    CPU, CUDA: linalg_inv_ex_out
 
-- func: linalg_inv(Tensor self) -> Tensor
+- func: linalg_inv(Tensor A) -> Tensor
   python_module: linalg
-  variants: function
 
-- func: linalg_inv.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+- func: linalg_inv.out(Tensor A, *, Tensor(a!) out) -> Tensor(a!)
   python_module: linalg
-  variants: function
+
+- func: inverse(Tensor self) -> Tensor
+  variants: function, method
+
+- func: inverse.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
 
 - func: inner(Tensor self, Tensor other) -> Tensor
   variants: function, method
@@ -12229,18 +12619,21 @@
   python_module: nn
   dispatch:
     CPU: _test_optional_intlist
+  autogen: _test_optional_intlist.out
 
 # Note: this function is only for testing.
 - func: _test_optional_filled_intlist(Tensor values, int[2]? addends) -> Tensor
   python_module: nn
   dispatch:
     CPU: _test_optional_intlist
+  autogen: _test_optional_filled_intlist.out
 
 # Note: this function is only for testing.
 - func: _test_optional_floatlist(Tensor values, float[]? addends) -> Tensor
   python_module: nn
   dispatch:
     CPU: _test_optional_floatlist
+  autogen: _test_optional_floatlist.out
 
 # Note: this function is only for testing.
 - func: _test_string_default(Tensor dummy, str a="\"'\\", str b='"\'\\') -> Tensor
@@ -12260,16 +12653,45 @@
   python_module: nn
   dispatch:
     CompositeExplicitAutograd: _test_warn_in_autograd
+  autogen: _test_warn_in_autograd.out
+
+# Note: this function is only for testing.
+- func: _test_autograd_multiple_dispatch.fullcoverage(Tensor self) -> Tensor
+  dispatch:
+    # the NestedTensor keys are necessary because NestedTensor has been removed
+    # from the CompositeExplicitAutograd keyset see Note [NestedTensor Not Included in Backend Keys]
+    CompositeExplicitAutograd, NestedTensorCPU, NestedTensorCUDA: _test_autograd_multiple_dispatch_fullcoverage
+  autogen: _test_autograd_multiple_dispatch.fullcoverage_out
+
+# Note: this function is only for testing.
+- func: _test_autograd_multiple_dispatch.ntonly(Tensor self, bool b) -> Tensor
+  dispatch:
+    CompositeImplicitAutograd, NestedTensorCPU, NestedTensorCUDA: _test_autograd_multiple_dispatch_ntonly
+
+# Note: this function is only for testing.
+- func: _test_autograd_multiple_dispatch_view(Tensor(a) self) -> Tensor(a)
+  dispatch:
+    CompositeExplicitAutograd: _test_autograd_multiple_dispatch_view
+
+# Note: this function is only for testing.
+- func: _test_autograd_multiple_dispatch_view_copy(Tensor self) -> Tensor
+  variants: function
+  dispatch:
+    CompositeExplicitAutogradNonFunctional: _test_autograd_multiple_dispatch_view_copy
+  tags: view_copy
+  autogen: _test_autograd_multiple_dispatch_view_copy.out
 
 - func: segment_reduce(Tensor data, str reduce, *, Tensor? lengths=None, Tensor? indices=None, Tensor? offsets=None, int axis=0, bool unsafe=False, Scalar? initial=None) -> Tensor
   variants: function
   dispatch:
     CPU, CUDA: segment_reduce_kernel
+  autogen: segment_reduce.out
 
 - func: _segment_reduce_backward(Tensor grad, Tensor output, Tensor data, str reduce, *, Tensor? lengths=None, Tensor? offsets=None, int axis=0, Scalar? initial=None) -> Tensor
   variants: function
   dispatch:
     CPU, CUDA: _segment_reduce_backward_kernel
+  autogen: _segment_reduce_backward.out
 
 - func: pad_sequence(Tensor[] sequences, bool batch_first=False, float padding_value=0.0) -> Tensor
   python_module: nn
@@ -12287,6 +12709,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: nested_tensor
+  autogen: nested_tensor.out
 
 - func: _fw_primal_copy(Tensor self, int level) -> Tensor
   variants: function
@@ -12467,12 +12890,14 @@
   dispatch:
     CompositeExplicitAutograd: ccol_indices_copy
   tags: view_copy
+  autogen: ccol_indices_copy.out
 
 - func: row_indices_copy(Tensor self) -> Tensor
   variants: function
   dispatch:
     CompositeExplicitAutograd: row_indices_copy
   tags: view_copy
+  autogen: row_indices_copy.out
 
 - func: unbind_copy.int(Tensor self, int dim=0) -> Tensor[]
   variants: function
@@ -12545,6 +12970,7 @@
   dispatch:
     CompositeExplicitAutograd: view_copy_SymInt
   tags: view_copy
+  autogen: view_copy.SymInt_out
 
 
 - func: as_strided_copy.out(Tensor self, int[] size, int[] stride, int? storage_offset=None, *, Tensor(a!) out) -> Tensor(a!)
@@ -12719,26 +13145,47 @@
   dispatch:
     NestedTensorCPU: NestedTensor_to_padded_tensor_generic
     NestedTensorCUDA: NestedTensor_to_padded_tensor_cuda
+  autogen: to_padded_tensor.out
+
+- func: _nested_tensor_softmax_with_shape(Tensor self, Tensor query) -> Tensor
+  dispatch:
+    NestedTensorCPU: NestedTensor_softmax_dropout
+    NestedTensorCUDA: NestedTensor_softmax_dropout_cuda
 
 - func: _nested_tensor_layer_norm(Tensor self, Tensor? weight, Tensor? bias, float eps) -> Tensor
   variants: method
   dispatch:
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_layer_norm
+  autogen: _nested_tensor_layer_norm.out
 
 # Apparently, putting "forward" in the name will cause Python bindings to be skipped, so "fwd" it is.
-- func: _transformer_encoder_layer_fwd(Tensor src, int embed_dim, int num_heads, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, bool use_gelu, bool norm_first, float eps, Tensor norm_weight_1, Tensor norm_bias_1, Tensor norm_weight_2, Tensor norm_bias_2, Tensor ffn_weight_1, Tensor ffn_bias_1, Tensor ffn_weight_2, Tensor ffn_bias_2, Tensor? mask=None) -> Tensor
+- func: _transformer_encoder_layer_fwd(Tensor src, int embed_dim, int num_heads, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, bool use_gelu, bool norm_first, float eps, Tensor norm_weight_1, Tensor norm_bias_1, Tensor norm_weight_2, Tensor norm_bias_2, Tensor ffn_weight_1, Tensor ffn_bias_1, Tensor ffn_weight_2, Tensor ffn_bias_2, Tensor? mask=None, int? mask_type=None) -> Tensor
   variants: function
   dispatch:
     CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: transformer_encoder_layer_forward
+  autogen: _transformer_encoder_layer_fwd.out
 
-- func: _native_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None, bool need_weights=True, bool average_attn_weights=True) -> (Tensor, Tensor)
+- func: _native_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None, bool need_weights=True, bool average_attn_weights=True, int? mask_type=None) -> (Tensor, Tensor)
   variants: function
   dispatch:
     CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: native_multi_head_attention
+  autogen: _native_multi_head_attention.out
 
 - func: _scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor)
   variants: function
 
+- func: _triton_scaled_dot_attention(Tensor q, Tensor k, Tensor v, float dropout_p=0.0) -> Tensor
+  variants: function
+  dispatch:
+    CUDA: triton_scaled_dot_attention
+  autogen: _triton_scaled_dot_attention.out
+
+- func: _triton_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None) -> Tensor
+  variants: function
+  dispatch:
+    CUDA: triton_multi_head_attention
+  autogen: _triton_multi_head_attention.out
+
 - func: special_airy_ai(Tensor x) -> Tensor
   python_module: special
   structured_delegate: special_airy_ai.out
@@ -12756,11 +13203,13 @@
   variants: function
   dispatch:
     CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: transformer_decoder_only_layer_forward
+  autogen: _transformer_decoder_only_layer_fwd.out
 
 - func: _native_decoder_only_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None, Tensor? incr_key=None, Tensor? incr_value=None, bool need_weights=True, bool average_attn_weights=True) -> (Tensor, Tensor, Tensor, Tensor)
   variants: function
   dispatch:
     CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: native_decoder_only_multi_head_attention
+  autogen: _native_decoder_only_multi_head_attention.out
 
 - func: special_bessel_j0(Tensor self) -> Tensor
   python_module: special
@@ -13354,3 +13803,4 @@
 - func: _foobar(Tensor self, bool arg1=True, bool arg2=True, *, bool arg3=True) -> Tensor
   dispatch:
     CPU: foobar
+  autogen: _foobar.out
diff --git a/aten/src/ATen/native/nested/NestedTensorBackward.cpp b/aten/src/ATen/native/nested/NestedTensorBackward.cpp
index 600db24c03aa2..ec96fdfaf4c02 100644
--- a/aten/src/ATen/native/nested/NestedTensorBackward.cpp
+++ b/aten/src/ATen/native/nested/NestedTensorBackward.cpp
@@ -13,6 +13,25 @@
 namespace at {
 namespace native {
 
+// See Note [nested tensor matmul] in NestedTensorMath.cpp
+std::tuple<Tensor, Tensor> matmul_backward_nested(
+    const Tensor& grad,
+    const Tensor& self,
+    const Tensor& other,
+    std::array<bool, 2> grad_input_mask) {
+  if (!grad.defined()) {
+    return std::make_tuple(Tensor(), Tensor());
+  }
+  Tensor grad_self, grad_other;
+  if (grad_input_mask[0]) {
+    grad_self = at::matmul(grad, other.transpose(-1, -2));
+  }
+  if (grad_input_mask[1]) {
+    grad_other = at::matmul(self.transpose(-1, -2), grad);
+  }
+  return std::make_tuple(grad_self, grad_other);
+}
+
 std::tuple<Tensor, Tensor, Tensor> nested_linear_backward(
     const Tensor& input,
     const Tensor& grad_output,
@@ -64,5 +83,92 @@ Tensor _reshape_nested_backward(const Tensor& self, const Tensor& grad) {
   return grad.reshape(sizes);
 }
 
+Tensor nested_softmax_backward(
+    const Tensor& grad,
+    const Tensor& output,
+    int64_t dim,
+    ScalarType input_dtype) {
+  TORCH_INTERNAL_ASSERT(grad.is_nested(), "Should be nested grad")
+  TORCH_INTERNAL_ASSERT(output.is_nested(), "Should be nested output")
+
+  auto output_ptr = get_nested_tensor_impl(output);
+  auto grad_ptr = get_nested_tensor_impl(grad);
+  int64_t ntensors = output_ptr->size(0);
+  if (ntensors == 0) {
+    return grad.clone();
+  }
+  int64_t positive_dim = at::maybe_wrap_dim(dim, output_ptr->dim());
+
+  //  Get the info about the output
+  const Tensor &output_buffer = output_ptr->get_buffer(),
+               &output_sizemat = output_ptr->get_nested_size_tensor();
+
+  //  Get the info about the grad
+  const Tensor &grad_sizemat = grad_ptr->get_nested_size_tensor();
+
+  TORCH_INTERNAL_ASSERT(output_sizemat.equal(grad_sizemat));
+  Tensor grad_output =
+      wrap_buffer(at::empty_like(output_buffer), output_sizemat.clone());
+
+  // Unbind nt into individual tensor slices for calculating the derivative
+  std::vector<Tensor> grad_output_unbind{grad_output.unbind()},
+      grad_unbind{grad.unbind()}, output_unbind{output.unbind()};
+
+  for(const auto i: c10::irange(ntensors)) {
+    at::_softmax_backward_data_out(
+        grad_output_unbind[i],
+        grad_unbind[i],
+        output_unbind[i],
+        positive_dim - 1,
+        input_dtype);
+  }
+  return grad_output;
+
+}
+
+// Rudimentary sum backward assuming the conditions in #82387
+Tensor _nested_sum_backward_cpu(
+  const Tensor& grad,
+  const Tensor& nested_self,
+  OptionalIntArrayRef opt_dims,
+  bool keepdim) {
+  auto nt_self = get_nested_tensor_impl(nested_self);
+  auto nt_grad = get_nested_tensor_impl(grad);
+  const Tensor& grad_buffer = nt_grad->get_buffer();
+  const Tensor& self_buffer = nt_self->get_buffer();
+  auto grad_sizes = nt_grad->get_nested_size_tensor();
+  auto self_sizes = nt_self->get_nested_size_tensor();
+  int64_t ntensors = nt_self->size(0);
+  const Tensor& self_grad_buffer = self_buffer.new_empty(self_buffer.sizes());
+
+  auto num_segments = at::prod(grad_sizes, -1);
+  auto segment_lengths = self_sizes.select(1, -1);
+
+  // This logic assumes for now that
+  // (1) all the gradient nested tensors are contiguous
+  // (2) the gradient nested tensors are stored contiguously in the buffer
+  AT_DISPATCH_ALL_TYPES_AND2(
+    ScalarType::Half, ScalarType::BFloat16, self_grad_buffer.scalar_type(), "nested_sum_dim_cpu", [&]() {
+    auto* self_grad_data = self_grad_buffer.data_ptr<scalar_t>();
+    const auto* output_grad_data = grad_buffer.data_ptr<scalar_t>();
+    int64_t out_idx = 0, in_idx = 0;
+    for (const auto i : c10::irange(ntensors)) {
+      int64_t segments = num_segments[i].item<int64_t>();
+      int64_t segment_length = segment_lengths[i].item<int64_t>();
+      for (auto j = 0; j < segments; j++) {
+        scalar_t output_grad = output_grad_data[out_idx];
+        for (auto k = 0; k < segment_length; k++) {
+          self_grad_data[in_idx] = output_grad;
+          in_idx += 1;
+        }
+        out_idx += 1;
+      }
+    }
+  });
+
+  return wrap_buffer(self_grad_buffer, self_sizes);
+
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorMath.cpp b/aten/src/ATen/native/nested/NestedTensorMath.cpp
index 6c05986e2e61f..d819bceadbb36 100644
--- a/aten/src/ATen/native/nested/NestedTensorMath.cpp
+++ b/aten/src/ATen/native/nested/NestedTensorMath.cpp
@@ -96,10 +96,6 @@ Tensor pad_tensor_to_shape(
 }
 } // namespace
 
-inline const at::Tensor& get_buffer(const at::Tensor& tensor) {
-  return get_nested_tensor_impl(tensor)->get_buffer();
-}
-
 std::vector<at::Tensor> NestedTensor_unbind(
     const at::Tensor& self,
     int64_t dim) {
@@ -119,14 +115,15 @@ std::vector<at::Tensor> NestedTensor_unbind(
   std::vector<IntArrayRef> sizes = NestedTensor_get_sizes(self_ptr),
       strides = NestedTensor_get_strides(self_ptr);
   const std::vector<int64_t>& offsets = self_ptr->get_offsets();
-  for (int64_t i = 0; i < ntensors; i++) {
+  for (const int64_t i: c10::irange(ntensors)){
     result_tensors[i] = buffer.as_strided(sizes[i], strides[i], offsets[i]);
   }
   return result_tensors;
 }
 
 Tensor& NestedTensor_relu_(Tensor& self) {
-  at::relu_(const_cast<Tensor&>(get_nested_tensor_impl(self)->get_buffer()));
+  auto buffer = get_nested_tensor_impl(self)->get_buffer();
+  at::relu_(buffer);
   return self;
 }
 
@@ -135,7 +132,8 @@ Tensor NestedTensor_relu(const Tensor& self) {
 }
 
 Tensor& NestedTensor_gelu_(Tensor& self, c10::string_view approximate) {
-  at::gelu_(const_cast<Tensor&>(get_nested_tensor_impl(self)->get_buffer()), approximate);
+  auto buffer = get_nested_tensor_impl(self)->get_buffer();
+  at::gelu_(buffer, approximate);
   return self;
 }
 
@@ -147,7 +145,7 @@ Tensor NestedTensor_gelu(const Tensor& self, c10::string_view approximate) {
       });
 }
 
-Tensor NestedTensor_nested_tensor_from_mask(const Tensor& t, const Tensor& mask) {
+Tensor NestedTensor_nested_tensor_from_mask(const Tensor& t, const Tensor& mask, bool mask_check) {
     TORCH_CHECK(mask.scalar_type() == at::ScalarType::Bool, "Expected mask to be of ScalarType Bool, but got ", mask.scalar_type(), " instead.");
     TORCH_CHECK(mask.dim() == 2, "Padding mask should be 2D");
     TORCH_CHECK(t.dim() == 3, "Input should be a 3D tensor, N * L * D");
@@ -165,7 +163,8 @@ Tensor NestedTensor_nested_tensor_from_mask(const Tensor& t, const Tensor& mask)
     sizes = sizes.cumsum(1).select(1, L - 1);
     nums = nums.to(sizes.options());
 
-    TORCH_CHECK(sizes.equal(nums), "Mask must be left-aligned without gaps");
+    if (mask_check)
+      TORCH_CHECK(sizes.equal(nums), "Mask must be left-aligned without gaps");
 
     sizes = sizes.reshape({N, 1});
     // N, ([d1=D, d2=D, ... dN=D])
@@ -706,22 +705,25 @@ at::Tensor NestedTensor_get_nested_size_tensor(const at::Tensor& self){
   return get_nested_size_tensor(self);
 }
 
-Tensor dropout_nested(const Tensor& input, double p, bool train) {
+std::tuple<Tensor,Tensor> native_dropout_nested(const Tensor& input, double p, c10::optional<bool> train) {
   auto input_ptr = get_nested_tensor_impl(input);
   const Tensor& input_buffer = input_ptr->get_buffer(),
       & sizemat = input_ptr->get_nested_size_tensor(),
       & stridemat = input_ptr->get_nested_stride_tensor();
   const std::vector<int64_t>& offsets = input_ptr->get_offsets();
-  Tensor output_buffer = at::dropout(input_buffer, p, train);
+  Tensor output_buffer, mask_buffer;
+  if (input_buffer.numel() == 0) {
+    output_buffer = input_buffer.clone();
+    mask_buffer = input_buffer.clone();
+  }
+  else {
+    std::tie(output_buffer, mask_buffer) = at::native_dropout(input_buffer, p, train);
+  }
   // regular tensor dropout reuses input size and stride
   // i.e. if input is not contiguous, then output is also discontiguous
-  return wrap_buffer(output_buffer, sizemat.clone(), stridemat.clone(), offsets);
-}
-
-Tensor& dropout_nested_(Tensor& input, double p, bool train) {
-  Tensor input_buffer = get_buffer(input);
-  at::dropout_(input_buffer, p, train);
-  return input;
+  Tensor output = wrap_buffer(output_buffer, sizemat.clone(), stridemat.clone(), std::vector<int64_t>(offsets)),
+      mask = wrap_buffer(mask_buffer, sizemat.clone(), stridemat.clone(), std::vector<int64_t>(offsets));
+  return std::make_tuple(output, mask);
 }
 
 Tensor softmax_nested(
@@ -731,7 +733,7 @@ Tensor softmax_nested(
   auto input_ptr = get_nested_tensor_impl(input);
   int64_t ntensors = input_ptr->size(0);
   if (ntensors == 0) {
-    return input;
+    return input.clone();
   }
   int64_t positive_dim = at::maybe_wrap_dim(dim, input_ptr->dim());
   TORCH_CHECK(
@@ -819,10 +821,18 @@ Tensor bmm_nested(const Tensor& self, const Tensor& mat2) {
   return output;
 }
 
-// utilities support _NestedTensor_GeneralizedBMM
+// utilities support `matmul_nested`
 namespace {
+// Args:
+//     self_sizes: the sizes of `self` in `matmul_nested`
+//     mat2_sizes: the sizes of `mat2` in `matmul_nested`
+//     buffer_op: the options for new buffer
+//     sizemat_op: the options for new size matrix
+// Returns:
+//     the batch size of each input underlying tensor, i.e. the product of batch-dimension sizes
+//     the empty output nested tensor
 inline std::tuple<std::vector<int64_t>, Tensor>
-_NestedTensor_GeneralizedBMM_BatchSizes_OutputMemory(
+matmul_nested_helper(
     const std::vector<IntArrayRef>& self_sizes,
     const std::vector<IntArrayRef>& mat2_sizes,
     const c10::TensorOptions& buffer_op,
@@ -869,14 +879,16 @@ _NestedTensor_GeneralizedBMM_BatchSizes_OutputMemory(
 }
 }
 
-// This is a generalized batched matmul dedicated to nested tensors,
+// Note [nested tensor matmul]
+// This is really a generalized batched matmul dedicated to nested tensors,
 // where `self` and `mat2` have same number (>= 3) of dimensions.
 // The last 2 dimensions will be considered as matrix dimensions,
 // so they should be matrix-multiplicable.
 // The leading dimensions are considered as batch dimensions,
 // and since nested tensor does not support broadcasting for now,
 // for each batch dimension `self` and `mat2` must have same size.
-Tensor _NestedTensor_GeneralizedBMM(const Tensor& self, const Tensor& mat2) {
+// TODO: Should make full matmul semantics support some day
+Tensor matmul_nested(const Tensor& self, const Tensor& mat2) {
   if (self.is_nested() && !mat2.is_nested()) {
     AT_ERROR("Expected both to be nested, but got a nested self and non-nested other");
   }
@@ -913,7 +925,7 @@ Tensor _NestedTensor_GeneralizedBMM(const Tensor& self, const Tensor& mat2) {
   // create a contiguous output
   std::vector<int64_t> batch_sizes;
   Tensor output;
-  std::tie(batch_sizes, output) = _NestedTensor_GeneralizedBMM_BatchSizes_OutputMemory(
+  std::tie(batch_sizes, output) = matmul_nested_helper(
       self_sizes, mat2_sizes, self_buffer.options(), self_ptr->get_nested_size_tensor().options());
   // call tensor matmul
   // TODO: `padding nested tensor -> bmm -> remove padding` may be more efficient
@@ -945,6 +957,28 @@ Tensor _NestedTensor_GeneralizedBMM(const Tensor& self, const Tensor& mat2) {
   return output;
 }
 
+Tensor& matmul_out_nested(const Tensor& tensor1, const Tensor& tensor2, Tensor& result) {
+  // TODO: this is a very quick and dirty implementation
+  //       should improve it to avoid the intermediate memory usage
+  Tensor function_result = at::matmul(tensor1, tensor2);
+  auto function_result_ptr = get_nested_tensor_impl(function_result);
+  // TODO: this is to reproduce function_result_ptr->opt_sizes_
+  //       if an accessor is provided in the future, can replace this
+  std::vector<int64_t> sizes;
+  for (int64_t i = 0; i < function_result_ptr->dim(); i++) {
+    c10::optional<int64_t> opt_size = function_result_ptr->opt_size(i);
+    if (opt_size.has_value()) {
+      sizes.push_back(*opt_size);
+    }
+    else {
+      sizes.push_back(-1);
+    }
+  }
+  result.reshape(sizes);
+  result.copy_(function_result);
+  return result;
+}
+
 Tensor transpose_nested(const Tensor& self, int64_t dim0, int64_t dim1) {
   auto self_ptr = get_nested_tensor_impl(self);
   // check input dimensions
@@ -970,7 +1004,8 @@ Tensor transpose_nested(const Tensor& self, int64_t dim0, int64_t dim1) {
   // create transposed `sizemat` and `stridemat`
   Tensor sizemat_transposed = at::index_select(sizemat, 1, column_indices),
       stridemat_transposed = at::index_select(stridemat, 1, column_indices);
-  return wrap_buffer(self_ptr->get_buffer(), sizemat_transposed, stridemat_transposed, self_ptr->get_offsets());
+  return create_nested_view_tensor(
+      self, sizemat_transposed, stridemat_transposed, std::vector<int64_t>(self_ptr->get_offsets()));
 }
 
 // utilities supporting `_reshape_nested`
@@ -1005,24 +1040,18 @@ inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
     // some negative sizes remain to be infered
     if (ndims_underlying < ndims_underlying_reshaped) {
       // replace negative sizes for old dimensions with old sizes
-      int64_t numel = 1, numel_reshaped = 1;
       for (int64_t idim = 0; idim < ndims_underlying; idim++) {
         int64_t& size_reshaped = size_reshaped_vector[idim];
         TORCH_CHECK(size_reshaped >= -1, "invalid shape dimension ", size_reshaped);
         if (size_reshaped == -1) {
           size_reshaped = size[idim];
         }
-        numel *= size[idim];
-        numel_reshaped *= size_reshaped;
       }
       // infer negative size for new dimension
       int64_t infer_index = -1;
       for (int64_t idim = ndims_underlying; idim < ndims_underlying_reshaped; idim++) {
         const int64_t& size_reshaped = size_reshaped_vector[idim];
-        if (size_reshaped >= 0) {
-          numel_reshaped *= size_reshaped;
-        }
-        else if (size_reshaped == -1) {
+        if (size_reshaped == -1) {
           if (infer_index > -1) {
             throw std::runtime_error("only one dimension can be inferred");
           }
@@ -1030,7 +1059,7 @@ inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
             infer_index = idim;
           }
         }
-        else {
+        else if (size_reshaped < 0) {
           AT_ERROR("invalid shape dimension ", size_reshaped);
         }
       }
@@ -1098,7 +1127,7 @@ inline void NestedTensor_reshape_copy(
           buffer.as_strided(sizes[i], strides[i], offsets[i]).reshape(sizes_reshaped[i]));
     }
 }
-}
+} // namespace
 
 // Special rules for reshape(nested tensor):
 // 1. Only 1 regular dimension can be collapsed with
@@ -1142,7 +1171,7 @@ Tensor _reshape_nested(const Tensor& self, IntArrayRef proposed_shape) {
   std::tie(reshape_as_view, sizemat_reshaped, stridemat_reshaped) = NestedTensor_reshape_size_stride(
       sizes, strides, proposed_shape, sizemat.options());
   if (reshape_as_view) {
-    return wrap_buffer(buffer, sizemat_reshaped, stridemat_reshaped, offsets);
+    return wrap_buffer(buffer, sizemat_reshaped, stridemat_reshaped, std::vector<int64_t>(offsets));
   }
   Tensor buffer_reshaped = buffer.new_empty(buffer.sizes());
   Tensor output = wrap_buffer(buffer_reshaped, sizemat_reshaped);
diff --git a/aten/src/ATen/native/nested/NestedTensorMath.h b/aten/src/ATen/native/nested/NestedTensorMath.h
index b315a3b253df3..844000605bb04 100644
--- a/aten/src/ATen/native/nested/NestedTensorMath.h
+++ b/aten/src/ATen/native/nested/NestedTensorMath.h
@@ -3,6 +3,9 @@
 #include <ATen/ATen.h>
 #include <c10/macros/Macros.h>
 #include <ATen/NestedTensorImpl.h>
+#include <c10/core/TensorImpl.h>
+#include <c10/util/Exception.h>
+#include <c10/core/DispatchKeySet.h>
 
 #include <vector>
 
@@ -21,11 +24,52 @@ inline at::Tensor wrap_buffer(at::Tensor buffer, at::Tensor nested_size_tensor)
 
 inline at::Tensor wrap_buffer(
     at::Tensor buffer, at::Tensor nested_size_tensor,
-    at::Tensor nested_stride_tensor, const std::vector<int64_t>& offsets) {
+    at::Tensor nested_stride_tensor, std::vector<int64_t>&& offsets) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(buffer.is_contiguous(), "Given buffer must be contiguous.");
   return at::detail::make_tensor<NestedTensorImpl>(
       std::move(buffer), std::move(nested_size_tensor),
-      std::move(nested_stride_tensor), offsets);
+      std::move(nested_stride_tensor), std::move(offsets));
+}
+
+inline at::Tensor get_buffer(const at::Tensor& tensor) {
+  return get_nested_tensor_impl(tensor)->get_buffer();
+}
+
+  /**
+   * Create a new nested tensor that is a view of a base nested tensor
+   *
+   * create_view_tensor calls a specialized constructor that copys the
+   * the keys from base onto the new view tensor being created.
+   * The storage is shared between the base and the returned view tensor
+   *
+   * All callers of this helper must:
+   * - Only return a view of the input
+   * - Must be explicit and define a derivative
+   *
+   * @param base Base tensor to construct view from.
+   * @param nested_size_tensor View tensors' sizes.
+   * @param nested_stride_tensor View tensors' strides.
+   * @param offsets View tensors' offsets.
+   * @return A newly constructed view tensor
+   */
+inline at::Tensor create_nested_view_tensor(
+    const at::Tensor& base,
+    at::Tensor nested_size_tensor,
+    at::Tensor nested_stride_tensor,
+    std::vector<int64_t>&& offsets) {
+  TORCH_INTERNAL_ASSERT(
+      base.is_nested(),
+      "This function can only be used to create nested tensor views");
+  TORCH_INTERNAL_ASSERT(
+      c10::impl::tls_local_dispatch_key_set().excluded_.has(
+          c10::DispatchKey::AutogradFunctionality),
+      "Creating a non differentiable nested tensor view in a CompositeImplicit function is not allowed.");
+  return at::detail::make_tensor<NestedTensorImpl>(
+      c10::TensorImpl::VIEW,
+      base,
+      nested_size_tensor,
+      nested_stride_tensor,
+      std::move(offsets));
 }
 
 // The sizes of the underlying tensors
@@ -42,7 +86,8 @@ inline std::vector<IntArrayRef> NestedTensor_get_sizes(const NestedTensorImpl* s
     return sizes;
   }
   const int64_t* sizemat_ptr = sizemat.data_ptr<int64_t>();
-  for (int64_t i = 0; i < ntensors; i++) {
+
+  for(const auto i: c10::irange(ntensors)){
     sizes[i] = IntArrayRef(sizemat_ptr, sizemat_ptr + orig_dim);
     sizemat_ptr += orig_dim;
   }
@@ -68,7 +113,7 @@ inline std::vector<IntArrayRef> NestedTensor_get_strides(const NestedTensorImpl*
     return strides;
   }
   const int64_t* stridemat_ptr = stridemat.data_ptr<int64_t>();
-  for (int64_t i = 0; i < ntensors; i++) {
+  for(const auto i: c10::irange(ntensors)) {
     strides[i] = IntArrayRef(stridemat_ptr, stridemat_ptr + orig_dim);
     stridemat_ptr += orig_dim;
   }
diff --git a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp
index d33decc224333..c559b75a78a69 100644
--- a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp
+++ b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp
@@ -138,44 +138,56 @@ Tensor NestedTensor_add_NestedTensor_in_place(
   return self;
 }
 
-void NestedTensor_softmax_dropout(const Tensor& query, Tensor& attn_scores) {
+Tensor NestedTensor_softmax_dropout(const Tensor& self, const Tensor& query) {
   const auto* query_nt = get_nested_tensor_impl_or_null(query);
   TORCH_INTERNAL_ASSERT(query_nt != nullptr);
   TORCH_INTERNAL_ASSERT(nested_tensor_impl_is_contiguous(query_nt));
 
   const Tensor& sizes = query_nt->get_nested_size_tensor();
   const auto num_tensors = sizes.sizes()[0];
-  const auto max_seq_len = attn_scores.sizes()[2];
+
+  auto output = at::empty_like(self,{}, at::MemoryFormat::Contiguous);
+  TORCH_INTERNAL_ASSERT(output.is_contiguous());
+
+  const auto max_seq_len = self.sizes()[2];
 
   for (int64_t i = 0; i < num_tensors; i++) {
     auto seq_len = sizes.index({i, 0}).item<int64_t>();
-    auto subseq = attn_scores.index(
+    auto subseq = self.index(
         {i,
          indexing::Slice(),
          indexing::Slice(0, seq_len),
          indexing::Slice(0, seq_len)});
     auto subscores = at::softmax(subseq, subseq.dim() - 1);
-    attn_scores.index_put_(
+    output.index_put_(
         {i,
          indexing::Slice(),
          indexing::Slice(0, seq_len),
          indexing::Slice(0, seq_len)},
         subscores);
-    attn_scores.index_put_(
+    output.index_put_(
         {i,
          indexing::Slice(),
          indexing::Slice(0, seq_len),
          indexing::Slice(seq_len, max_seq_len)},
         0);
-    attn_scores.index_put_(
+    output.index_put_(
         {i,
          indexing::Slice(),
          indexing::Slice(seq_len, max_seq_len),
          indexing::Slice(0, max_seq_len)},
         0);
   }
+  return output;
 }
 
+Tensor NestedTensor_softmax_dropout_cuda(const Tensor& self, const Tensor& query) {
+  c10::optional<Tensor> attn_mask;
+
+  attn_mask = NestedTensor_to_mask(query, 2, self.size(2));
+  attn_mask = attn_mask->to(query.device(), /*non-blocking=*/true);
+  return _masked_softmax(self, *attn_mask, self.dim() - 1, /*mask type */ 1 );  // NestedTensor_to_mask produces a BxT mask
+}
 
 Tensor NestedTensor_batch_offsets_from_size_tensor(
     const Tensor& sizes,
@@ -196,6 +208,7 @@ Tensor NestedTensor_batch_offsets_from_size_tensor(
   return offsets;
 }
 
+
 Tensor NestedTensor_to_mask(const Tensor& nt, c10::optional<int64_t> mask_dim, c10::optional<int64_t> mask_dim_length) {
   auto* nt_impl = get_nested_tensor_impl(nt);
   TORCH_CHECK(
diff --git a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h
index 96ecfe91c3ddd..77eb0145d6847 100644
--- a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h
+++ b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h
@@ -50,8 +50,6 @@ Tensor NestedTensor_from_padded_tensor_cpu(
     const Tensor& padded,
     const NestedTensorImpl& nt);
 
-void NestedTensor_softmax_dropout(const Tensor& query, Tensor& attn_scores);
-
 Tensor NestedTensor_to_mask(const Tensor& nt, c10::optional<int64_t> mask_dim, c10::optional<int64_t> mask_dim_length);
 
 template <typename T>
diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp
index d89e5c5763d7f..fade1d026b2bc 100644
--- a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp
+++ b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp
@@ -204,5 +204,6 @@ Tensor NestedTensor_to_padded_tensor_cuda(
   }
   return NestedTensor_to_padded_tensor_generic(t, padding, output_size);
 }
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu
index e8eb164bf4e76..dd5e9b80ca6bc 100644
--- a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu
+++ b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu
@@ -146,7 +146,7 @@ void remove_padding_kernelLauncher(
   dim3 grid;
   grid.x = batch_size;
   grid.y = GRID_DIM_Y;
-  at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream();
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
   if (output_dim == 2) {
     remove_padding_2<T><<<grid, BLOCK_DIM, 0, stream>>>(
         input,
@@ -180,7 +180,7 @@ void remove_padding_transform0213_kernelLauncher(
   dim3 grid;
   grid.x = batch_size;
   grid.y = GRID_DIM_Y;
-  at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream();
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
   TORCH_CHECK(
       output_dim == 2,
       "remove padding transform0213 only support output dim == 2");
@@ -374,7 +374,7 @@ void add_padding_kernelLauncher(
     const std::vector<int64_t>& output_sizes,
     const int batch_size,
     const int output_batch_size) {
-  at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream();
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
   dim3 grid;
   grid.x = output_batch_size;
   grid.y = GRID_DIM_Y;
diff --git a/aten/src/ATen/native/quantized/README.md b/aten/src/ATen/native/quantized/README.md
index 62c4a8a1f9e13..f042881a8cebe 100644
--- a/aten/src/ATen/native/quantized/README.md
+++ b/aten/src/ATen/native/quantized/README.md
@@ -171,7 +171,8 @@ def quantized_xand(qa, qb):
   return ops.quantized.xand(qa, qb)
 ```
 
-**Note:** If writing new pytorch functions that use quantized kernels, it is strongly encouraged to place them in the `torch/nn/quantized/functional.py`.
+**Note:** If writing new pytorch functions that use quantized kernels,
+it is strongly encouraged to place them in the `torch/ao/nn/quantized/functional.py`.
 
 ### C++
 
diff --git a/aten/src/ATen/native/quantized/cpu/QuantUtils.h b/aten/src/ATen/native/quantized/cpu/QuantUtils.h
index 8ebcea45883c6..f53efab900be1 100644
--- a/aten/src/ATen/native/quantized/cpu/QuantUtils.h
+++ b/aten/src/ATen/native/quantized/cpu/QuantUtils.h
@@ -205,4 +205,24 @@ inline void HandleWeightsSaturation(int64_t N, float* weight) {
   }
 }
 
+// Util function for quantizing bias.
+inline at::Tensor QuantizeBias(
+    bool is_per_channel,
+    const at::Tensor& bias,
+    const at::Tensor& weight_contig,
+    double input_scale) {
+  at::Tensor qbias;
+  if (is_per_channel) {
+    auto bias_quant_scales =
+        weight_contig.q_per_channel_scales() * input_scale;
+    auto bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt);
+    qbias = at::native::quantize_per_channel(
+        bias, bias_quant_scales, bias_zp, 0, c10::kQInt32);
+  } else {
+    qbias = at::native::quantize_per_tensor(
+        bias, weight_contig.q_scale() * input_scale, 0, c10::kQInt32);
+  }
+  return qbias;
+}
+
 } // namespace quant_utils
diff --git a/aten/src/ATen/native/quantized/cpu/conv_serialization.h b/aten/src/ATen/native/quantized/cpu/conv_serialization.h
index b44520f2eb0b7..9e4edb8f9a881 100644
--- a/aten/src/ATen/native/quantized/cpu/conv_serialization.h
+++ b/aten/src/ATen/native/quantized/cpu/conv_serialization.h
@@ -307,6 +307,9 @@ c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> deserialize_conv(
   }
   for (const auto i : c10::irange(kSpatialDim)) {
     (void)i; // Suppress unused variable
+    TORCH_INTERNAL_ASSERT(idx < static_cast<int64_t>(config_vals.size()),
+        "Unexpected index = ", idx, " for config_vals of size ",
+        config_vals.size());
     output_padding.emplace_back(config_vals.at(idx));
     idx++;
   }
diff --git a/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp b/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
index b1bdaadaf5b33..b7d8a89f43493 100644
--- a/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
+++ b/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
@@ -671,17 +671,14 @@ static void qprelu_out_kernel(Tensor& out,
   int64_t input_ndim = qx.dim();
   TORCH_CHECK(input_ndim > 0, "qprelu: zero-dim input tensor is not allowed.");
 
-  // Helper to convert 1d tensors or scalar tensor to an nd tensor that broadcasts with input
+  // Weight should be a 1d or scalar tensor
+  // Reshape it to an nd tensor that broadcasts with input
   // All elements go into the channel dimension
-  DimVector sizes(input_ndim, 1), strides(input_ndim, 0);
-  auto as_nd = [&](const Tensor& t) {
-    TORCH_INTERNAL_ASSERT(t.defined() && (t.dim() == 1 || t.dim() == 0));
-    sizes[1] = t.dim() == 1 ? t.sizes()[0] : 1;
-    strides[1] = t.dim() == 1 ? t.strides()[0] : 0;
-    return t.as_strided(sizes, strides);
-  };
-
-  auto qw_nd = as_nd(qw);
+  DimVector sizes(input_ndim, 1);
+  if (input_ndim > 1) {
+    sizes[1] = qw.numel();
+  }
+  auto qw_nd = qw.reshape(sizes);
 
   auto iter = TensorIteratorConfig()
     .add_output(out)
@@ -2750,18 +2747,26 @@ void quantized_normalize_kernel(
                 dq =
                   (dq - layer_mean_div_scale_xVec) *
                     gamma_p_vec + beta_vec;
-                qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale)
-                  .store(Y_ptr + vecStartIdx);
               }
+              qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale)
+                .store(Y_ptr + vecStartIdx);
             }
-            for (int64_t remIdx = chEndIdx - kNonVecRemInChannel;
-                 remIdx < chEndIdx;
-                 remIdx++) {
-              auto qXVal = X_ptr[remIdx];
-              float dqXVal = at::native::dequantize_val(x_fake_scale, x_zp, qXVal);
-              float dqY =
-                (dqXVal - layer_mean_div_scale_x) * gamma_p + beta;
-              Y_ptr[remIdx] = at::native::quantize_val<scalar_t>(y_scale, y_zp, dqY);
+
+            // Remainder
+            if (kNonVecRemInChannel > 0) {
+              int64_t remIdx = chEndIdx - kNonVecRemInChannel;
+              auto qXVec = qVec::loadu(X_ptr + remIdx, kNonVecRemInChannel);
+              auto dqXVec = qXVec.dequantize(x_fake_scale_vec, x_zp_vec,
+                    x_fake_scale_zp_neg_premul_vec);
+              int validDqvecLen = (kNonVecRemInChannel - 1) / fVec::size() + 1;
+              for (int i = 0; i < validDqvecLen; ++i) {
+                auto &dq = dqXVec[i];
+                dq =
+                  (dq - layer_mean_div_scale_xVec) *
+                    gamma_p_vec + beta_vec;
+              }
+              qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale)
+                .store(Y_ptr + remIdx, kNonVecRemInChannel);
             }
           } // chIdx
 
@@ -3703,8 +3708,8 @@ void quantize_tensor_per_channel_impl<c10::quint8>(
     // channels_last contig.
     // If axis = 0 and channels_last contig, implementation for channels
     // first (NCHW) works.
-    for (const auto b : c10::irange(batches)) {
-      for (const auto e : c10::irange(elements_per_channel)) {
+    for (C10_UNUSED const auto b : c10::irange(batches)) {
+      for (C10_UNUSED const auto e : c10::irange(elements_per_channel)) {
         uint32_t c = 0;
         while (c + 8 < channels) {
           const int32x4_t voffset0123 = vld1q_s32(&zero_points_int32t[c]);
@@ -3738,7 +3743,7 @@ void quantize_tensor_per_channel_impl<c10::quint8>(
       }
     }
   } else {
-    for (const auto b : c10::irange(batches)) {
+    for (C10_UNUSED const auto b : c10::irange(batches)) {
       for (const auto c : c10::irange(channels)) {
         uint32_t e = 0;
         const int32x4_t voffset = vdupq_n_s32(zero_points_int32t[c]);
diff --git a/aten/src/ATen/native/quantized/cpu/qconv.cpp b/aten/src/ATen/native/quantized/cpu/qconv.cpp
index f31d271365e24..873d983a48209 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv.cpp
@@ -725,17 +725,7 @@ at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl_xnnp(
 
 
     // Original bias was float, so we requantize it here.
-    at::Tensor qbias;
-    if (per_channel()) {
-      auto bias_quant_scales =
-          weight_contig.q_per_channel_scales() * act_input_scale;
-      auto bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt);
-      qbias = at::native::quantize_per_channel(
-          bias, bias_quant_scales, bias_zp, 0, c10::kQInt32);
-    } else {
-      qbias = at::native::quantize_per_tensor(
-          bias, weight_contig.q_scale() * act_input_scale, 0, c10::kQInt32);
-    }
+    at::Tensor qbias = quant_utils::QuantizeBias(per_channel(), bias, weight_contig, act_input_scale);
 
     status = at::native::xnnp_utils::xnnp_create_convolution2d_nhwc(
         padding()[0],
@@ -937,21 +927,8 @@ at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl(
     for (const auto i : c10::irange(wt_numel)) {
       qnnp_w_data[i] = static_cast<c10::quint8>(w_data[i] + 128);
     }
-    at::Tensor qbias;
     // Original bias was float, so we requantize it here.
-    if (convolution_op->per_channel) {
-      at::Tensor bias_quant_scales =
-          weight_contig.q_per_channel_scales() * act_input_scale;
-      at::Tensor bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt);
-      qbias = at::native::quantize_per_channel(
-          bias_fp32, bias_quant_scales, bias_zp, 0, c10::kQInt32);
-    } else {
-      qbias = at::native::quantize_per_tensor(
-          bias_fp32,
-          weight_contig.q_scale() * act_input_scale,
-          0,
-          c10::kQInt32);
-    }
+    at::Tensor qbias = quant_utils::QuantizeBias(convolution_op->per_channel, bias_fp32, weight_contig, act_input_scale);
 
     // Update the input scale to not pack again.
     input_scale = act_input_scale;
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
index b32fcf03a8cc3..748e89fc182d7 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
@@ -139,7 +139,8 @@ namespace native {
 //
 // Python example examining a packed 8bit zero_point and scale:
 //
-// >> x = torch.from_numpy(np.array([[[10, 20], [30, 40]],[[50, 60], [70, 80]]], dtype=np.float32))
+// >> x = torch.from_numpy(np.array([[[10, 20], [30, 40]],[[50, 60], [70, 80]]],
+// dtype=np.float32))
 // >> x_packed = torch.ops.quantized.embedding_bag_byte_prepack(x)
 //
 // # Pull out and examine packed scales, zero_points and values
@@ -228,8 +229,9 @@ Tensor& qembeddingbag_byte_prepack_out(Tensor& output, const Tensor& weight) {
   auto* output_data = output.data_ptr<uint8_t>();
 
 #ifdef USE_FBGEMM
-  if (weight.scalar_type() == at::ScalarType::Half) {
-    const auto weight_data = static_cast<fbgemm::float16*>(weight.data_ptr());
+  if (weight_contig->scalar_type() == at::ScalarType::Half) {
+    const auto weight_data =
+        static_cast<fbgemm::float16*>(weight_contig->data_ptr());
     at::parallel_for(
         0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) {
           fbgemm::FloatOrHalfToFused8BitRowwiseQuantizedSBFloat<
@@ -240,7 +242,7 @@ Tensor& qembeddingbag_byte_prepack_out(Tensor& output, const Tensor& weight) {
               output_data + start_idx * output_columns);
         });
   } else {
-    const auto weight_data = weight.data_ptr<float>();
+    const auto weight_data = weight_contig->data_ptr<float>();
     at::parallel_for(
         0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) {
           fbgemm::FloatOrHalfToFused8BitRowwiseQuantizedSBFloat<float>(
@@ -346,8 +348,9 @@ Tensor _qembeddingbag_nbit_prepack_helper(
 
 #ifdef USE_FBGEMM
   if (!optimized_qparams) {
-    if (weight.scalar_type() == at::ScalarType::Half) {
-      const auto weight_data = static_cast<fbgemm::float16*>(weight.data_ptr());
+    if (weight_contig.scalar_type() == at::ScalarType::Half) {
+      const auto weight_data =
+          static_cast<fbgemm::float16*>(weight_contig.data_ptr());
       at::parallel_for(
           0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) {
             fbgemm::FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf<
@@ -359,7 +362,7 @@ Tensor _qembeddingbag_nbit_prepack_helper(
                 output_data + start_idx * output_shape[1]);
           });
     } else {
-      const auto weight_data = weight.data_ptr<float>();
+      const auto weight_data = weight_contig.data_ptr<float>();
       at::parallel_for(
           0, embedding_rows, 1, [&](int64_t start_idx, int64_t end_idx) {
             fbgemm::FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf<float>(
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear.cpp b/aten/src/ATen/native/quantized/cpu/qlinear.cpp
index 99e0155857ceb..0e51b98676078 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear.cpp
@@ -6,6 +6,7 @@
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <ATen/native/quantized/cpu/XnnpackUtils.h>
 #include <ATen/native/quantized/cpu/OnednnUtils.h>
+#include <ATen/native/quantized/cpu/QuantUtils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/custom_class.h>
 #include <torch/library.h>
@@ -328,8 +329,7 @@ at::Tensor PackedLinearWeightsQnnp::apply_impl_xnnp(
         orig_weight, xnnp_weight);
 
     // Original bias was float, so we requantize it here.
-    at::Tensor qbias = at::native::quantize_per_tensor(
-          bias_, orig_weight.q_scale() * input_scale, 0, c10::kQInt32);
+    at::Tensor qbias = quant_utils::QuantizeBias(false, bias_, orig_weight, input_scale);
 
     // output limits
    auto output_min = kReluFused
@@ -476,18 +476,7 @@ at::Tensor PackedLinearWeightsQnnp::apply_impl(
     }
     // Original bias was float, so we requantize it here.
     const bool is_per_channel = orig_weight.qscheme() == at::kPerChannelAffine;
-    at::Tensor qbias;
-    // Original bias was float, so we requantize it here.
-    if (is_per_channel) {
-      at::Tensor bias_quant_scales =
-          weight_contig.q_per_channel_scales() * input_scale;
-      at::Tensor bias_zp = at::zeros(bias_quant_scales.sizes(), c10::kInt);
-      qbias = at::native::quantize_per_channel(
-          bias_fp32, bias_quant_scales, bias_zp, 0, c10::kQInt32);
-    } else {
-      qbias = at::native::quantize_per_tensor(
-          bias_fp32, weight_contig.q_scale() * input_scale, 0, c10::kQInt32);
-    }
+    at::Tensor qbias = quant_utils::QuantizeBias(is_per_channel, bias_fp32, weight_contig, input_scale);
 
     // Update the input scale to not pack again.
     this->input_scale = input_scale;
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
index df529a6612f98..d6fd9be57e30e 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
@@ -415,14 +415,21 @@ at::Tensor& PackedLinearWeightFp16::apply_dynamic_impl(
   // Resize output Tensor
   output.resize_(output_sizes);
 
-  // Call the fp16 gemm interface
-  fbgemm::cblas_gemm_compute(
-      fbgemm::matrix_op_t::NoTranspose,
-      M,
-      input_ptr,
-      packed_weight_fp16,
-      0.0f,
-      output.data_ptr<float>());
+  int num_tasks = at::get_num_threads();
+  at::parallel_for(0, num_tasks, 1, [&](int64_t begin, int64_t end) {
+    for (const auto task_id : c10::irange(begin, end)) {
+      // Call the fp16 gemm interface
+      fbgemm::cblas_gemm_compute(
+          /*transa=*/fbgemm::matrix_op_t::NoTranspose,
+          /*m=*/static_cast<int>(M),
+          /*A=*/input_ptr,
+          /*Bp=*/packed_weight_fp16,
+          /*beta=*/0.0f,
+          /*C=*/output.data_ptr<float>(),
+          /*thread_id=*/static_cast<int>(task_id),
+          /*num_threads=*/num_tasks);
+    }
+  });
 
   // Add bias term
   if (bias_.has_value()) {
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h b/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h
index 6235d55f8bc7c..575c0a17bceb1 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h
@@ -577,7 +577,7 @@ class FullyConnectedSparseOperatorTester {
 
           for (size_t i = 0; i < batchSize(); i++) {
             for (size_t c = 0; c < outputChannels(); c++) {
-              ASSERT_EQ(
+              ASSERT_FLOAT_EQ(
                   output_dynamic[i * outputChannels() + c],
                   accumulators_float[i * outputChannels() + c])
                   << "at " << i << ", " << c
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h b/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h
index 71a370f85d4f2..25e7bb670653d 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h
@@ -475,7 +475,7 @@ class GemmBlockSparseMicrokernelTester {
 
       for (size_t mIndex = 0; mIndex < m(); mIndex++) {
         for (size_t nIndex = 0; nIndex < n(); nIndex++) {
-          ASSERT_EQ(
+          ASSERT_FLOAT_EQ(
               c[mIndex * cStride() + nIndex],
               acc[mIndex * n() + nIndex])
               << "at " << mIndex << ", " << nIndex
diff --git a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
index 8d3a17a24ff82..062cc3d126293 100644
--- a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
+++ b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
@@ -757,9 +757,8 @@ Tensor empty_like_sparse_csr(
 }
 
 Tensor select_sparse_csr(const Tensor& self, int64_t dim, int64_t index) {
-  TORCH_CHECK(
-      self.layout() == kSparseCsr || self.layout() == kSparseBsr,
-      "select(): currently only supports the SparseCsr and SparseBsr layout.");
+  AT_DISPATCH_ALL_SPARSE_COMPRESSED_LAYOUTS(
+      self.layout(), "select()", []() { return; });
   TORCH_CHECK_INDEX(
       self.dim() != 0, "select() cannot be applied to a 0-dim tensor.");
   dim = maybe_wrap_dim(dim, self.dim());
@@ -784,41 +783,55 @@ Tensor select_sparse_csr(const Tensor& self, int64_t dim, int64_t index) {
   new_sizes.erase(new_sizes.begin() + dim);
   auto options = self.options();
 
-  // Selecting batch dimension
-  if (dim < self.dim() - 2) {
-    if (self.layout() == kSparseBsr) {
-      return at::native::_sparse_bsr_tensor_unsafe(
-          self.crow_indices().select(dim, index),
-          self.col_indices().select(dim, index),
-          self.values().select(dim, index),
-          new_sizes,
-          optTypeMetaToScalarType(options.dtype_opt()),
-          options.layout_opt(),
-          options.device_opt(),
-          options.pinned_memory_opt());
-    }
-    return at::native::_sparse_csr_tensor_unsafe(
-        self.crow_indices().select(dim, index),
-        self.col_indices().select(dim, index),
+  Tensor plain_indices;
+  Tensor compressed_indices;
+  std::tie(compressed_indices, plain_indices) =
+      AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(
+          self.layout(),
+          "select",
+          [&]() {
+            return std::make_pair(self.crow_indices(), self.col_indices());
+          },
+          [&]() {
+            return std::make_pair(self.ccol_indices(), self.row_indices());
+          });
+  auto n_batch = compressed_indices.dim() - 1;
+
+  if (dim < n_batch) {
+    // Selecting batch dimension
+    return at::native::_sparse_compressed_tensor_unsafe(
+        compressed_indices.select(dim, index),
+        plain_indices.select(dim, index),
         self.values().select(dim, index),
         new_sizes,
         optTypeMetaToScalarType(options.dtype_opt()),
         options.layout_opt(),
         options.device_opt(),
         options.pinned_memory_opt());
-  } else {
+  } else if (dim < n_batch + 2) {
+    // Selecting sparse dimension
     TORCH_CHECK(
-        self.is_sparse_csr(),
-        "select(): selecting non-batch dimensions is currently only supported for CSR tensors.");
+        self.layout() == kSparseCsr || self.layout() == kSparseCsc,
+        "select(): selecting non-batch dimensions is currently only supported for non-blocked sparse compressed layouts tensors.");
     TORCH_CHECK(
-        self.dim() == 2,
-        "select(): selecting rows or columns is not implemented for batched sparse CSR tensors.")
-    // Converting to COO and calling select is slighly slower than operating on
-    // the CSR indices directly for constructing a COO vector, however current
-    // version is more readable and easier to understand.
+        n_batch == 0,
+        "select(): selecting rows or columns is not implemented for batched sparse compressed tensors.")
+    // Converting to COO and calling select is slightly slower than operating
+    // on the CSR indices directly for constructing a COO vector, however
+    // current version is more readable and easier to understand.
     return self.to_sparse().select(dim, index);
+  } else {
+    // Selecting dense dimension
+    return AT_DISPATCH_PLAIN_SPARSE_COMPRESSED_LAYOUTS(
+        self.layout(),
+        "select",
+        // Non blocked layout (2 sparse dims become 1 nnz dim in values, so dim
+        // is found one position to the left)
+        [&]() { return self.values().select(dim - 1, index); },
+        // Block layout (2 sparse dims become 1 nnz dim + 2 block-shape dims in
+        // values, so dim is found 1 position to the right)
+        [&]() { return self.values().select(dim + 1, index); });
   }
 }
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/sparse/SparseTensorMath.cpp b/aten/src/ATen/native/sparse/SparseTensorMath.cpp
index ad98fcee2d5bb..a25083c6fae88 100644
--- a/aten/src/ATen/native/sparse/SparseTensorMath.cpp
+++ b/aten/src/ATen/native/sparse/SparseTensorMath.cpp
@@ -610,34 +610,127 @@ SparseTensor& add_out_sparse_cpu(const SparseTensor& t, const SparseTensor& src,
 // add(Tensor, SparseTensor, Scalar)
 //    formerly known as spcadd
 // --------------------------------------------------------------------
-
 template <typename scalar_t>
-void add_dense_sparse_worker_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
+void add_dense_sparse_worker_non_hybrid_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
   auto indices_accessor = indices.accessor<int64_t, 2>();
   auto values_accessor = values.accessor<scalar_t, 1>();
 
   scalar_t* r_ptr = r.data_ptr<scalar_t>();
-  auto r_strides = r.strides();
   scalar_t cast_value = value.to<scalar_t>();
-  const auto sparse_dim = sparse.sparse_dim();
-
+  const int64_t sparse_dim = sparse.sparse_dim();
+  std::vector<int64_t> result_stride(sparse_dim);
+  for (const auto d: c10::irange(sparse_dim)) {
+    result_stride[d] = r.stride(d);
+  }
   at::parallel_for(0, sparse._nnz(), 0, [&](int64_t start, int64_t end) {
-    for (auto k: c10::irange(start, end)) {
+    for (const auto k: c10::irange(start, end)) {
       int64_t index = r.storage_offset();
       for (auto d: c10::irange(sparse_dim)) {
-        index += r_strides[d] * indices_accessor[d][k];
+        index += result_stride[d] * indices_accessor[d][k];
       }
       r_ptr[index] += cast_value * values_accessor[k];
     }
   });
 }
 
+template <typename scalar_t>
+inline void add_dense_sparse_worker_hybrid_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
+
+  // Get the dense dimension element numbers of hybrid sparse tensor
+  int64_t values_dense_size = values.stride(0);
+  TORCH_CHECK(values.is_contiguous());
+  scalar_t* v_ptr = values.data_ptr<scalar_t>();
+
+  scalar_t* r_ptr = r.data_ptr<scalar_t>();
+  TORCH_CHECK(r_ptr != nullptr);
+
+  auto indices_accessor = indices.accessor<int64_t, 2>();
+  scalar_t cast_value = value.to<scalar_t>();
+  auto sparse_dim = sparse.sparse_dim();
+  std::vector<int64_t> result_stride(sparse_dim);
+  for (auto d : c10::irange(sparse_dim)) {
+    result_stride[d] = r.stride(d);
+  }
+
+  at::parallel_for(0, sparse._nnz(), 0, [&](int64_t start, int64_t end) {
+    for (auto k: c10::irange(start, end)) {
+      auto r_index = r_ptr;
+      for (auto d: c10::irange(sparse_dim)) {
+        r_index += result_stride[d] * indices_accessor[d][k];
+      }
+      auto v_index = v_ptr + k * values_dense_size;
+      at::native::cpublas::axpy<scalar_t>(values_dense_size, cast_value, v_index, 1, r_index, 1);
+    }
+  });
+}
+
+template <typename scalar_t>
+inline void add_dense_sparse_worker_non_coalesced_cpu(Tensor& r, const Scalar& value,
+    const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
+
+  // Get the dense dimension element numbers of hybrid sparse tensor
+  auto values_dense_size = values.stride(0);
+  TORCH_CHECK(values.is_contiguous());
+  scalar_t* v_ptr = values.data_ptr<scalar_t>();
+  TORCH_CHECK(v_ptr != nullptr);
+
+  scalar_t* r_ptr = r.data_ptr<scalar_t>();
+  TORCH_CHECK(r_ptr != nullptr);
+
+  scalar_t cast_value = value.to<scalar_t>();
+  auto sparse_dim = sparse.sparse_dim();
+
+  auto indices_accessor = indices.accessor<int64_t, 2>();
+  int64_t result_length = r.size(0);
+  std::vector<int64_t> result_stride(sparse_dim);
+  for (auto d : c10::irange(sparse_dim)) {
+    result_stride[d] = r.stride(d);
+  }
+
+  auto sparse_nnz = sparse._nnz();
+  int max_threads = at::get_num_threads();
+  max_threads = (result_length < max_threads) ? result_length : max_threads;
+  int64_t avg_chunk_down = result_length / max_threads;
+  std::vector<int64_t> chuck_size(max_threads);
+  for (const auto i : c10::irange(max_threads)) {
+    chuck_size[i] = avg_chunk_down;
+  }
+  //make chunk balance among threads as 211
+  for (auto i = 0 ; i < result_length % max_threads ; i++) {
+    chuck_size[i] += 1;
+  }
+  std::vector<int64_t> chuck_sum_size(max_threads + 1);
+  chuck_sum_size[0] = 0;
+  for (const auto i : c10::irange(1, max_threads)) {
+    chuck_sum_size[i] = chuck_sum_size[i - 1] + chuck_size[i - 1];
+  }
+  chuck_sum_size[max_threads] = result_length;
+  at::parallel_for(0, max_threads, 0, [&](int64_t start, int64_t end) {
+    for (auto k: c10::irange(start, end)) {
+      int64_t chunk_begin = chuck_sum_size[k];
+      int64_t chunk_end = chuck_sum_size[k + 1];
+      for (const auto n: c10::irange(sparse_nnz)) {
+        int64_t chunk_offset = indices_accessor[0][n];
+        if (chunk_offset >= chunk_begin && chunk_offset < chunk_end) {
+          int64_t r_offset = result_stride[0] * chunk_offset;
+          for (const auto d : c10::irange(1, sparse_dim)) {
+            r_offset += result_stride[d] * indices_accessor[d][n];
+          }
+          scalar_t* v_index = v_ptr + n * values_dense_size;
+          auto r_index = r_ptr + r_offset;
+          at::native::cpublas::axpy<scalar_t>(values_dense_size, cast_value, v_index, 1, r_index, 1);
+        }
+      }
+    }
+  });
+}
+
 Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTensor& sparse_, const Scalar& value) {
-  AT_ASSERT(!r.is_sparse());
-  AT_ASSERT(!dense.is_sparse());
-  AT_ASSERT(sparse_.is_sparse());
+  TORCH_CHECK(!r.is_sparse());
+  TORCH_CHECK(!dense.is_sparse());
+  TORCH_CHECK(sparse_.is_sparse());
 
-  AT_ASSERT(!dense.is_cuda()); // dispatch argument
+  TORCH_CHECK(!dense.is_cuda()); // dispatch argument
   TORCH_CHECK(!r.is_cuda(), "add: expected 'out' to be CPU tensor, but got CUDA tensor");
   TORCH_CHECK(!sparse_.is_cuda(), "add: expected 'other' to be a CPU tensor, but got a CUDA tensor");
 
@@ -648,19 +741,15 @@ Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTen
   TORCH_CHECK(canCast(commonDtype, r.scalar_type()), "Can't convert result type ", commonDtype, " to output ", r.scalar_type(), " in add operation");
 
   r.resize_as_(dense);
-  SparseTensor sparse = sparse_.coalesce();
-
-  Tensor indices = sparse._indices();
-  Tensor values = sparse._values();
-  int64_t nDim = dense.dim();
-  int64_t nDimI = sparse.sparse_dim();
 
-  if (sparse._nnz() == 0) {
+  auto sparse_nnz = sparse_._nnz();
+  if (sparse_nnz == 0) {
     if (!is_same_tensor(r, dense)) r.copy_(dense);
     return r;
   }
 
-  Tensor valuesBuffer = values.to(commonDtype);
+  int64_t dense_dim = dense.dim();
+  int64_t sparse_dim = sparse_.sparse_dim();
   Tensor resultBuffer = r;
   if (r.scalar_type() != commonDtype) {
     resultBuffer = dense.to(commonDtype);
@@ -668,23 +757,56 @@ Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTen
     resultBuffer.copy_(dense);
   }
 
-  // accessors rely on nnz test
-  if (nDim > nDimI) {
-    auto indices_accessor = indices.accessor<int64_t, 2>();
-    for (const auto k : c10::irange(sparse._nnz())) {
-      Tensor dstBuffer = resultBuffer;
-      for (const auto d : c10::irange(sparse.sparse_dim())) {
-        dstBuffer = dstBuffer.select(0, indices_accessor[d][k]);
-      }
-      Tensor srcBuffer = valuesBuffer.select(0, k);
-      dstBuffer.add_(srcBuffer, value);
+  Tensor values = sparse_._values();
+  bool sparse_is_coalesced = (sparse_.is_coalesced() || sparse_nnz == 1);
+  bool result_is_contiguous = ((r.storage().data() != nullptr) && resultBuffer.is_contiguous());
+  bool value_is_contiguous = values.is_contiguous();
+  bool is_contiguous =  (result_is_contiguous && value_is_contiguous);
+
+  SparseTensor sparse = sparse_;
+  Tensor indices = sparse_._indices();
+  Tensor valuesBuffer = values.to(commonDtype);
+  if (is_contiguous && sparse_is_coalesced) {
+    //TODO: we can optimize it for non-hybrid by not using buffers
+    if (sparse_dim == dense_dim) {
+      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+          at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half,
+          commonDtype, "add_dense_sparse_non_hybrid", [&] {
+            add_dense_sparse_worker_non_hybrid_cpu<scalar_t>(resultBuffer, value, sparse_, indices, valuesBuffer);
+          });
+    } else {
+      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+          at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half,
+          commonDtype, "add_dense_sparse_hybrid", [&] {
+            add_dense_sparse_worker_hybrid_cpu<scalar_t>(resultBuffer, value, sparse_, indices, valuesBuffer);
+          });
     }
-  } else {
+  } else if (is_contiguous && (sparse_dim > 0)) {
+    // Handle sparse is not coalesced
     AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
         at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half,
-        commonDtype, "add_dense_sparse", [&] {
-          add_dense_sparse_worker_cpu<scalar_t>(resultBuffer, value, sparse, indices, valuesBuffer);
+        commonDtype, "add_dense_sparse_worker_non_coalesced", [&] {
+          add_dense_sparse_worker_non_coalesced_cpu<scalar_t>(resultBuffer, value, sparse_, indices, valuesBuffer);
         });
+  } else {
+    // Slow path for non-contiguous values and output
+    // TODO: coalesce() performance may can be further improved
+    sparse = sparse_.coalesce();
+    indices = sparse._indices();
+    values = sparse._values();
+    valuesBuffer = values.to(commonDtype);
+    auto indices_accessor = indices.accessor<int64_t, 2>();
+    auto sparse_nnz = sparse._nnz();
+    at::parallel_for(0, sparse_nnz, 100, [&](int64_t start, int64_t end) {
+      for (auto k: c10::irange(start, end)) {
+        Tensor dstBuffer = resultBuffer;
+        for (auto d: c10::irange(sparse_dim)) {
+          dstBuffer = dstBuffer.select(0, indices_accessor[d][k]);
+        }
+        Tensor srcBuffer = valuesBuffer.select(0, k);
+        dstBuffer.add_(srcBuffer, value);
+      }
+    });
   }
   if (r.scalar_type() != commonDtype) {
     r.copy_(resultBuffer);
@@ -776,7 +898,7 @@ Tensor& intersection_binary_op_sparse_dense_out(
     const auto sparse_dim = static_cast<int64_t>(res_shape.size());
     const auto indices = at::empty({sparse_dim, 0}, s_._indices().options());
     const auto values = at::empty({0}, s_._values().options().dtype(res.scalar_type()));
-    get_sparse_impl(res)->raw_resize_(sparse_dim, /*dense_dim=*/0, /*shape=*/res_shape);
+    get_sparse_impl(res)->raw_resize_(sparse_dim, /*dense_dim=*/0, /*size=*/res_shape);
     get_sparse_impl(res)->set_indices_and_values_unsafe(indices, values);
     get_sparse_impl(res)->set_nnz_and_narrow(0);
     return res._coalesced_(true);
@@ -798,7 +920,18 @@ Tensor& intersection_binary_op_sparse_dense_out(
 
   const auto apply_op = [&](const Tensor& d_filtered) -> Tensor& {
     const auto res_indices = s_indices.clone();
-    const auto res_values = op(d_filtered, s_values);
+    // to(res.scalar_type) is only performed when both d and s are 0-dim.
+    // This insures right type promotions with the following rules:
+    // op(0-dim, 0-dim).dtype == <common dtype>
+    // op(0-dim, ge-1-dim).dtype == <ge-1-dim>.dtype,
+    // where ge-1-dim is a tensor with dim >= 1.
+    // We do not cast if op is performed in-place.
+    // The cast is required if s is 0-dim non-coalesced tensor and d is 0-dim.
+    // This is because s.values is at least 1D, so
+    // op(s.values, d).dtype == s.values.dtype, but we want
+    // op(s.values, d).dtype == <common dtype>.
+    const auto values = op(d_filtered, s_values);
+    const auto res_values = is_same_tensor(s_, res) ? values : values.to(res.scalar_type());
     get_sparse_impl(res)->raw_resize_(sparse_dim, dense_dim, res_shape);
     get_sparse_impl(res)->set_indices_and_values_unsafe(res_indices, res_values);
     get_sparse_impl(res)->set_nnz_and_narrow(s._nnz());
@@ -827,14 +960,14 @@ Tensor& intersection_binary_op_sparse_dense_out(
     intersec_indices.reserve(d_dim);
 
     if (d_start_dim_intersec) {
-      intersec_indices.push_back(Ellipsis);
+      intersec_indices.emplace_back(Ellipsis);
     }
     for (const auto i : c10::irange(sparse_dim_intersec)) {
       const auto s_idx = s_start_dim_intersec + i;
-      intersec_indices.push_back(s_indices[s_idx]);
+      intersec_indices.emplace_back(s_indices[s_idx]);
     }
     for (auto i = d_start_dim_intersec + sparse_dim_intersec; i < d_dim; ++i) {
-      intersec_indices.push_back(Slice());
+      intersec_indices.emplace_back(Slice());
     }
     // we need to expand d in the dimensions it is being indexed into
     // to avoid out of bound indices
@@ -851,10 +984,10 @@ Tensor& intersection_binary_op_sparse_dense_out(
 
   // Otherwise nnz gets larger, and both indices and values need an update.
   const auto d_batch_shape = d.sizes().slice(0, d_start_dim_intersec);
-  const auto d_batch_len = d_batch_shape.size();
-  int64_t batch_count;
-  int64_t max_batch_dim;
-  std::tie(batch_count, max_batch_dim) = [&]() -> std::tuple<int64_t, int64_t> {
+  const auto d_batch_len = static_cast<int64_t>(d_batch_shape.size());
+  int64_t batch_count = 1;
+  int64_t max_batch_dim = 0;
+  std::tie(batch_count, max_batch_dim) = [d_batch_shape]() -> std::tuple<int64_t, int64_t> {
     int64_t batch_count = 1;
     int64_t max_batch_dim = 0;
     for (const auto& b : d_batch_shape) {
@@ -873,31 +1006,31 @@ Tensor& intersection_binary_op_sparse_dense_out(
   const auto res_values = op(d_filtered, s_values).reshape(res_values_shape);
   const auto res_indices = [&]() -> Tensor {
     const auto index_buffer = at::arange(max_batch_dim, s_indices.options());
-    auto res_indices = at::empty({res_sparse_dim, res_nnz}, s_indices.options());
+    auto indices = at::empty({res_sparse_dim, res_nnz}, s_indices.options());
     // fill in indices corresponding to the "batch" dimensions of d.
     int64_t n_repeat_interleave = res_nnz;
-    int n_repeat = 1;
+    int64_t n_repeat = 1;
     for (const auto dim : c10::irange(d_batch_len)) {
       const auto dim_size = d_batch_shape[dim];
       n_repeat_interleave /= dim_size;
       // fill in indices corresponding to the "batch" dimension dim.
-      // Equivalent to res_indices[dim].copy_(repeat_interleave(dim_index, n_repeat_interleave).repeat(n_repeat))
+      // Equivalent to indices[dim].copy_(repeat_interleave(dim_index, n_repeat_interleave).repeat(n_repeat))
       const std::initializer_list<int64_t> dim_index_expanded_shape = {n_repeat, dim_size, n_repeat_interleave};
       const auto dim_index = index_buffer.slice(-1, 0, dim_size);
       const auto dim_index_expanded = dim_index.unsqueeze(0).unsqueeze_(-1).expand(dim_index_expanded_shape);
-      // NOTE: res_indices is contiguous, so view is safe
-      res_indices[dim].view(dim_index_expanded_shape).copy_(dim_index_expanded);
+      // NOTE: indices is contiguous, so view is safe
+      indices[dim].view(dim_index_expanded_shape).copy_(dim_index_expanded);
       n_repeat *= dim_size;
     }
     // fill in indices corresponding to s_indices.
-    // Equivalent to res_indices_sparse.copy(s_indices.repeat({1, n_repeat})
+    // Equivalent to indices_sparse.copy(s_indices.repeat({1, n_repeat})
     n_repeat = res_nnz / s_nnz;
-    auto res_indices_sparse = res_indices.narrow(0, d_batch_len, res_sparse_dim - d_batch_len);
+    auto indices_sparse = indices.narrow(0, d_batch_len, res_sparse_dim - d_batch_len);
     const std::initializer_list<int64_t> s_indices_expanded_shape = {-1, n_repeat, s_nnz};
     const auto s_indices_expanded = s_indices.unsqueeze(1).expand(s_indices_expanded_shape);
-    res_indices_sparse.view(s_indices_expanded_shape).copy_(s_indices_expanded);
+    indices_sparse.view(s_indices_expanded_shape).copy_(s_indices_expanded);
 
-    return res_indices;
+    return indices;
   }();
 
   get_sparse_impl(res)->raw_resize_(res_sparse_dim, res_dense_dim, res_shape);
@@ -914,6 +1047,46 @@ Tensor& _mul_dense_sparse_out(const Tensor& d, const Tensor& s, Tensor& res) {
   });
 }
 
+Tensor& _mul_sparse_sparse_zero_dim_out(const Tensor& zero_dim, const Tensor& other, Tensor& r) {
+  const auto is_wrapped_scalar = [](const Tensor& s) -> bool {
+    return !s.dim() && s.is_coalesced();
+  };
+
+  const auto extract_vals_from_wrapped_scalar = [](const Tensor& s) -> Tensor {
+    auto vals = s._values().squeeze(0);
+    // if squeeze does not kill the dim, it means that
+    // vals is empty with shape [0]. In such a case we
+    // return a 0-dim empty tensor to avoid broadcasting
+    // issues in intersection_binary_op_sparse_dense_out
+    // when the sparse argument is actually 0-dim.
+    if (vals.dim()) {
+      return at::empty({}, vals.options());
+    }
+    return vals;
+  };
+
+  // The code dispatches to mul(dense, sparse), and the goal
+  // is to delay calling into coalesce when converting one of
+  // the sparse arguments to dense if possible.
+  // This is possible when there is a 0-dim coalesced argument.
+
+  // if is_wrapped_scalar(zero_dim)
+  if (zero_dim.is_coalesced()) {
+    const auto scalar_val = extract_vals_from_wrapped_scalar(zero_dim);
+    return _mul_dense_sparse_out(scalar_val, other, r);
+  }
+  // Here zero_dim is not a wrapped scalar, so we test other.
+  if (is_wrapped_scalar(other)) {
+    const auto scalar_val = extract_vals_from_wrapped_scalar(other);
+    return _mul_dense_sparse_out(scalar_val, zero_dim, r);
+  }
+  // Neither of inputs is a wrapped scalar, but zero_dim
+  // is at least 0-dim, so we coalesce it to convert to
+  // a scalar.
+  const auto scalar_val = extract_vals_from_wrapped_scalar(zero_dim.coalesce());
+  return _mul_dense_sparse_out(scalar_val, other, r);
+}
+
 SparseTensor& mul_out_sparse_cpu(const Tensor& t_, const Tensor& src_, Tensor& r) {
   AT_ASSERT(!t_.is_cuda()); // dispatch argument
   TORCH_CHECK(!r.is_cuda(), "mul: expected 'out' to be CPU tensor, but got CUDA tensor");
@@ -928,6 +1101,14 @@ SparseTensor& mul_out_sparse_cpu(const Tensor& t_, const Tensor& src_, Tensor& r
     return _mul_dense_sparse_out(t_, src_, r);
   }
 
+  // case mul(sparse, sparse) with a 0-dim input.
+  if (!src_.dim()) {
+    return _mul_sparse_sparse_zero_dim_out(src_, t_, r);
+  }
+  if (!t_.dim()) {
+    return _mul_sparse_sparse_zero_dim_out(t_, src_, r);
+  }
+
   TORCH_CHECK(t_.sizes().equals(src_.sizes()), "mul: expected 'self' and 'other' to have same sizes when both are sparse"
       ", but ", t_.sizes(), " != ", src_.sizes());
 
diff --git a/aten/src/ATen/native/sparse/SparseTensorMath.h b/aten/src/ATen/native/sparse/SparseTensorMath.h
index 645e0e65e0605..1a263b2e7d5e7 100644
--- a/aten/src/ATen/native/sparse/SparseTensorMath.h
+++ b/aten/src/ATen/native/sparse/SparseTensorMath.h
@@ -7,5 +7,6 @@ namespace at { namespace native {
 TORCH_API sparse::SparseTensor& mul_out_sparse_scalar(sparse::SparseTensor& r, const sparse::SparseTensor& t, const Scalar& value);
 TORCH_API sparse::SparseTensor& mul_out_sparse_zerodim(sparse::SparseTensor& r, const sparse::SparseTensor& t, const Tensor& value);
 TORCH_API sparse::SparseTensor& _mul_dense_sparse_out(const Tensor& d, const Tensor& s, Tensor& res);
+TORCH_API sparse::SparseTensor& _mul_sparse_sparse_zero_dim_out(const Tensor& zero_dim, const Tensor& other, Tensor& res);
 
 }}
diff --git a/aten/src/ATen/native/sparse/cuda/SoftMax.cu b/aten/src/ATen/native/sparse/cuda/SoftMax.cu
index 05cb9e06d90f3..0591646f89b5a 100644
--- a/aten/src/ATen/native/sparse/cuda/SoftMax.cu
+++ b/aten/src/ATen/native/sparse/cuda/SoftMax.cu
@@ -258,7 +258,7 @@ Tensor get_offsets(
           cudaMemcpyHostToDevice,
           stream));
 
-  auto indices_accessor = indices.packed_accessor<int64_t, 2>();
+  auto indices_accessor = indices.packed_accessor64<int64_t, 2>();
 
   Tensor offsets = at::empty({nnz}, indices.options());
 
@@ -345,7 +345,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> compute_pool_max(
   if (requireMxRows) {
 
     auto values_accessor =
-        values.packed_accessor<scalar_t, 2>(); // {nnz, nvalues}
+        values.packed_accessor64<scalar_t, 2>(); // {nnz, nvalues}
 
     mx_buffer = at::full({new_sz * nvalues}, Scalar(-std::numeric_limits<scalar_t>::infinity()), values.options());
 
@@ -420,10 +420,10 @@ void cuda_sparse_coo_softmax(
 
   /* Prepare accessors */
   auto values_2 = values.view({nnz, nvalues});
-  auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
+  auto values_accessor = values_2.packed_accessor64<scalar_t, 2>();
 
   auto out_values_2 = out_values.view({nnz, nvalues});
-  auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
+  auto out_values_accessor = out_values_2.packed_accessor64<scalar_t, 2>();
 
   Tensor sorted_indices;
   Tensor pool_offsets;
@@ -539,13 +539,13 @@ void cuda_sparse_coo_softmax_backward(
   auto nvalues = get_nvalues(sizes, sparse_dim);
 
   auto values_2 = values.view({nnz, nvalues});
-  auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
+  auto values_accessor = values_2.packed_accessor64<scalar_t, 2>();
 
   auto out_values_2 = out_values.view({out_nnz, nvalues});
-  auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
+  auto out_values_accessor = out_values_2.packed_accessor64<scalar_t, 2>();
 
   auto grad_values_2 = grad_values.view({grad_nnz, nvalues});
-  auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
+  auto grad_values_accessor = grad_values_2.packed_accessor64<scalar_t, 2>();
 
   Tensor lower_bound_values =
       at::empty({out_offsets.size(0)}, indices.options());
diff --git a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
index 4309e756e8bea..bae31b308cbfe 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
+++ b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
@@ -562,7 +562,7 @@ void spmm(
     const Scalar& beta,
     const Scalar& alpha,
     const Tensor& result) {
-#if !AT_USE_CUSPARSE_GENERIC_API()
+#if !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_52_API())
   addmm_out_legacy(mat1, mat2, beta, alpha, result);
 #else
   c10::MaybeOwned<Tensor> result_ = prepare_dense_matrix_for_cusparse(result);
@@ -663,7 +663,7 @@ void spmm(
   if (!result.is_same(*result_)) {
     result.copy_(*result_);
   }
-#endif // !AT_USE_CUSPARSE_GENERIC_API()
+#endif // !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API())
 }
 
 void spgemm(
@@ -672,12 +672,18 @@ void spgemm(
     const Scalar& beta,
     const Scalar& alpha,
     const at::sparse_csr::SparseCsrTensor& C) {
-#if defined(CUDA_VERSION) && CUDA_VERSION < 11000
+#if (!defined(USE_ROCM)) && (defined(CUDA_VERSION) && CUDA_VERSION < 11000)
   TORCH_CHECK(
       false,
       "Calling addmm with sparse GPU tensors requires compiling ",
       "PyTorch with CUDA 11+. ",
       "Please use PyTorch built with newer CUDA version.");
+#elif defined(USE_ROCM) && ROCM_VERSION < 50200
+  TORCH_CHECK(
+      false,
+      "Calling addmm with sparse GPU tensors requires compiling ",
+      "PyTorch with ROCm 5.2+. ",
+      "Please use PyTorch built with newer ROCm version.");
 #else
   // older versions of cusparse on Windows segfault for complex128 dtype
 #if defined(_WIN32) && defined(CUSPARSE_VERSION) && CUSPARSE_VERSION < 11400
@@ -862,7 +868,7 @@ void addmv_out_sparse_csr(
   if (mat.layout() == kSparseBsr) {
     return block_sparse_mv(mat, vec, beta, alpha, result);
   }
-#if !AT_USE_CUSPARSE_GENERIC_API()
+#if !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API())
   TORCH_CHECK(
       false,
       "Calling addmv on a sparse GPU tensor requires compiling ",
@@ -936,7 +942,7 @@ void addmv_out_sparse_csr(
   if (!result.is_same(*result_)) {
     result.copy_(*result_);
   }
-#endif
+#endif // !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API())
 }
 
 /*
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp b/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp
index 633a503ac8332..bd89e6fc1701a 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp
@@ -19,7 +19,13 @@
 #define IS_SPMM_AVAILABLE() 0
 #endif
 
-#if IS_SPMM_AVAILABLE()
+#if defined(USE_ROCM) && ROCM_VERSION >= 50200
+#define IS_SPMM_HIP_AVAILABLE() 1
+#else
+#define IS_SPMM_HIP_AVAILABLE() 0
+#endif
+
+#if IS_SPMM_AVAILABLE() || IS_SPMM_HIP_AVAILABLE()
 #include <library_types.h>
 #endif
 
@@ -86,7 +92,7 @@ cusparseOperation_t convertTransToCusparseOperation(char trans) {
   }
 }
 
-#if IS_SPMM_AVAILABLE()
+#if IS_SPMM_AVAILABLE() || IS_SPMM_HIP_AVAILABLE()
 
 namespace {
 template<typename T>
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu b/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu
index bea7788e9d579..0cd9882b0c1be 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu
@@ -471,6 +471,14 @@ SparseTensor& mul_out_sparse_cuda(const Tensor& t_, const Tensor& src_, SparseTe
     return _mul_dense_sparse_out(t_, src_, r_);
   }
 
+  // case mul(sparse, sparse) with a 0-dim input.
+  if (!src_.dim()) {
+    return _mul_sparse_sparse_zero_dim_out(src_, t_, r_);
+  }
+  if (!t_.dim()) {
+    return _mul_sparse_sparse_zero_dim_out(t_, src_, r_);
+  }
+
   TORCH_CHECK(t_.is_cuda(), "mul: expected 'self' to be CUDA, but got CPU");
   TORCH_CHECK(src_.is_cuda(), "mul: expected 'other' to be CUDA, but got CPU");
   TORCH_CHECK(cuda::check_device({r_, t_, src_}));
@@ -708,7 +716,7 @@ Tensor bmm_sparse_cuda(const SparseTensor& self, const Tensor& mat2) {
   return bmm_out_sparse_cuda(self, mat2, result);
 }
 
-#if !(defined(USE_ROCM) || (defined(_MSC_VER) && CUSPARSE_VERSION < 11000))
+#if defined(USE_ROCM) || !(defined(_MSC_VER) && CUSPARSE_VERSION < 11000)
 __global__ void search_end_matrix_indices_cuda_kernel(
   int64_t* mat_el_end_indices,
   int64_t num_matrices,
@@ -789,11 +797,9 @@ cudaDataType getTensorCudaDataType(Tensor self) {
 #endif
 
 Tensor& bmm_out_sparse_cuda(const SparseTensor& self, const Tensor& mat2, Tensor& result) {
-#if defined(USE_ROCM)
-  TORCH_CHECK(false, "bmm sparse-dense is not supported on HIP");
-#elif defined(_MSC_VER) && (CUSPARSE_VERSION < 11000)
+#if defined(_MSC_VER) && (CUSPARSE_VERSION < 11000)
   TORCH_CHECK(false, "bmm sparse-dense CUDA is not supported on Windows with cuda before 11.0");
-#elif defined(CUDART_VERSION) && (CUDART_VERSION >= 10010)  // linux cuda >= 10.1 or windows cuda >= 11.0
+#elif defined(USE_ROCM) || (defined(CUDART_VERSION) && (CUDART_VERSION >= 10010))  // linux cuda >= 10.1 or windows cuda >= 11.0
 
   TORCH_CHECK(!mat2.is_sparse(), "bmm_sparse: Tensor 'mat2' must be dense");
   TORCH_CHECK(self.dense_dim() == 0, "bmm_sparse: Tensor 'self' must have 0 dense dims, but has ", self.dense_dim());
diff --git a/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu b/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu
index dbc194ddb20b6..8cc5fc3157c38 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu
@@ -734,13 +734,13 @@ void sparse_sparse_matmul_cuda_kernel(
 
   output_values.set_(csr_output.csr_values_);
   output_indices.resize_({2, nnz});
-  auto output_indices_accessor = output_indices.packed_accessor<int64_t, 2>();
+  auto output_indices_accessor = output_indices.packed_accessor64<int64_t, 2>();
 
   auto csr_output_pointers_accessor =
-      csr_output.csr_pointers_.packed_accessor<int, 1>();
+      csr_output.csr_pointers_.packed_accessor64<int, 1>();
 
   auto csr_output_ind_accessor =
-      csr_output.csr_indices_.packed_accessor<int, 1>();
+      csr_output.csr_indices_.packed_accessor64<int, 1>();
 
   auto major_dim = result.size(0);
   cudaStream_t stream = at::cuda::getCurrentCUDAStream();
diff --git a/aten/src/ATen/native/tags.yaml b/aten/src/ATen/native/tags.yaml
index 39ff5de6f7c48..8fc44c68c2674 100644
--- a/aten/src/ATen/native/tags.yaml
+++ b/aten/src/ATen/native/tags.yaml
@@ -12,6 +12,12 @@
   desc: |
           This tag indicates if an operator's output's shape depends on input Tensor
           data.
+- tag: data_dependent_output
+  desc: |
+          Operator has a non-Tensor output whose value is dependent on the data
+          of Tensor inputs.  Among other things, this implies that this operator
+          cannot be run with meta tensor (since data is not available), nor
+          can it be symbolically traced.
 - tag: generated
   desc: |
           This tag indicates that the operator doesn't have an explicit entry in
diff --git a/aten/src/ATen/native/transformers/attention.cpp b/aten/src/ATen/native/transformers/attention.cpp
index 67fa95c72aa2e..6a6a6daafd866 100644
--- a/aten/src/ATen/native/transformers/attention.cpp
+++ b/aten/src/ATen/native/transformers/attention.cpp
@@ -1,5 +1,4 @@
 #include <type_traits>
-
 #include <ATen/ATen.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
@@ -118,14 +117,10 @@ Tensor bmm_nt(const Tensor& a, const Tensor& b) {
 Tensor masked_softmax(
     Tensor& attn_scores,
     c10::optional<Tensor> attn_mask,
-    const Tensor& query) {
+    const Tensor& query,
+    c10::optional<int64_t> mask_type = NULL) {
   if (query.is_nested() && !attn_mask) {
-    if (attn_scores.is_cpu()) {
-      NestedTensor_softmax_dropout(query, attn_scores);
-      return attn_scores;
-    }
-    attn_mask = NestedTensor_to_mask(query, 2, attn_scores.size(2));
-    attn_mask = attn_mask->to(query.device(), /*non-blocking=*/true);
+    return at::_nested_tensor_softmax_with_shape(attn_scores, query);
   }
   if (attn_mask && attn_mask->dtype() != at::kBool) {
     TORCH_WARN(
@@ -143,7 +138,7 @@ Tensor masked_softmax(
     attn_mask = at::expand_inplace(attn_scores, *attn_mask)->contiguous();
   }
   if (attn_mask) {
-    return _masked_softmax(attn_scores, *attn_mask);
+    return _masked_softmax(attn_scores, *attn_mask, attn_scores.dim() - 1, mask_type);
   } else {
     return _softmax_out(attn_scores, attn_scores, attn_scores.dim() - 1, false);
   }
@@ -329,7 +324,8 @@ std::tuple<Tensor, Tensor> native_multi_head_attention(
     const Tensor& proj_bias,
     const c10::optional<Tensor>& mask,
     bool need_weights,
-    bool average_attn_weights) {
+    bool average_attn_weights,
+    const c10::optional<int64_t> mask_type) {
   // query shape: [B, T, D]
   // qkv_weight shape: [3 * D, D]
 
@@ -445,7 +441,7 @@ std::tuple<Tensor, Tensor> native_multi_head_attention(
   // shape: [B, num_head, T, T]
   // TODO: long-term, have a kernel that works with
   // NestedTensor directly if there is no mask passed
-  qkt = masked_softmax(qkt, mask, query);
+  qkt = masked_softmax(qkt, mask, query, mask_type);
 #ifdef DEBUG_PRINT_EACH_STEP
   std::cerr << "qkt after softmax: " << qkt << std::endl;
 #endif
@@ -727,5 +723,103 @@ std::tuple<Tensor, Tensor> _scaled_dot_product_attention(
     return (need_attn_weights ? std::make_tuple(output, attn) : std::make_tuple(output, Tensor()));
 }
 
+Tensor triton_multi_head_attention(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    const int64_t embed_dim,
+    const int64_t num_head,
+    const Tensor& qkv_weight,
+    const Tensor& qkv_bias,
+    const Tensor& proj_weight,
+    const Tensor& proj_bias,
+    const c10::optional<Tensor>& mask) {
+  // query shape: [B, T, D]
+  // qkv_weight shape: [3 * D, D]
+  TORCH_CHECK(!mask, "Only casual mask is supported for Triton.");
+
+  const auto D = embed_dim;
+  TORCH_CHECK(
+      query.dim() == 3,
+      "expected 3-D `query`, got ",
+      query.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      query.sizes()[2] == embed_dim,
+      "passed-in embed_dim ",
+      embed_dim,
+      " didn't match last dim of query ",
+      query.sizes()[2]);
+  TORCH_CHECK(
+      key.dim() == 3,
+      "expected 3-D `key`, got ",
+      key.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      value.dim() == 3,
+      "expected 3-D `value`, got ",
+      value.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+          query.sizes() == key.sizes() && key.sizes() == value.sizes(),
+      "expected `query`/`key`/`value` shapes to match");
+  TORCH_CHECK(
+      qkv_weight.dim() == 2,
+      "expected 2-D `qkv_weight`, got ",
+      qkv_weight.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      D * 3 == qkv_weight.sizes()[0],
+      "expected `qkv_weight` first dim to be 3x embed_dim");
+  TORCH_CHECK(
+      D == qkv_weight.sizes()[1],
+      "expected `qkv_weight` second dim to be embed_Dim");
+
+#ifndef NDEBUG
+  const auto B = query.is_nested()
+      ? get_nested_tensor_impl(query)->get_nested_size_tensor().size(0)
+      : query.sizes()[0];
+  auto T = query.is_nested() ? 0 : query.sizes()[1];
+  const auto dim_per_head = D / num_head;
+#endif
+
+  // shape: [B, T, 3 x D]
+  auto qkv = qkv_projection(query, key, value, embed_dim, qkv_weight);
+
+  // shape: 3 x [B, num_head, T, dim_per_head]
+  auto q_k_v = _transform_bias_rescale_qkv(qkv, qkv_bias, num_head);
+  qkv = Tensor(); // Not used any more, allow free
+  auto& q = std::get<0>(q_k_v);
+  const auto& k = std::get<1>(q_k_v);
+  const auto& v = std::get<2>(q_k_v);
+#ifndef NDEBUG
+  debug_assert_shape(__LINE__, q, {B, num_head, T, dim_per_head});
+  debug_assert_shape(__LINE__, k, {B, num_head, T, dim_per_head});
+  debug_assert_shape(__LINE__, v, {B, num_head, T, dim_per_head});
+#endif
+#ifdef DEBUG_PRINT_EACH_STEP
+  std::cerr << "q: " << q << std::endl;
+  std::cerr << "k: " << k << std::endl;
+  std::cerr << "v: " << v << std::endl;
+#endif
+
+  auto attn_ctx = at::_triton_scaled_dot_attention(q, k, v);
+
+#ifndef NDEBUG
+  debug_assert_shape(__LINE__, attn_ctx, {B, num_head, T, dim_per_head});
+#endif
+#ifdef DEBUG_PRINT_EACH_STEP
+  std::cerr << "attn_ctx: " << attn_ctx << std::endl;
+#endif
+
+  // shape: [B, T, D]
+  // Fuse transform_0213 inside
+  auto proj = transform0213_gemm_nt_bias(
+      attn_ctx, proj_weight, proj_bias, query);
+#ifndef NDEBUG
+  debug_assert_shape(__LINE__, proj, {B, T, D});
+#endif
+  return proj;
+}
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/transformers/cuda/attention.cu b/aten/src/ATen/native/transformers/cuda/attention.cu
index f347cd6c8c30a..dd31a755bf1dd 100644
--- a/aten/src/ATen/native/transformers/cuda/attention.cu
+++ b/aten/src/ATen/native/transformers/cuda/attention.cu
@@ -345,7 +345,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
       accscalar_t,                                                      \
       assume_aligned>                                                   \
       <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(       \
-          nt_qkv->get_buffer()                                          \
+          nt_qkv_buffer                                          \
               .packed_accessor64<scalar_t, 1, RestrictPtrTraits>(),     \
           qkv_bias.packed_accessor64<scalar_t, 1, RestrictPtrTraits>(), \
           offsets_ptr,                                                  \
@@ -376,6 +376,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
         }
         if (qkv.is_nested()) {
           auto* nt_qkv = get_nested_tensor_impl(qkv);
+          const at::Tensor& nt_qkv_buffer = nt_qkv->get_buffer();
           auto sizes = collapse_dims_1_and_2(nt_qkv->get_nested_size_tensor());
           auto offsets =
               NestedTensor_batch_offsets_from_size_tensor(sizes, sizes.numel());
@@ -387,7 +388,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
           const auto input_dim = sizes.sizes()[1];
           TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input_dim == 1);
           if (aligned &&
-              ((reinterpret_cast<intptr_t>(nt_qkv->get_buffer().data_ptr()) %
+              ((reinterpret_cast<intptr_t>(qkv.data_ptr()) %
                 TRANSFORM_BIAS_RESCALE_VEC) == 0)) {
             CALL_ADD_PADDING_KERNEL(true);
           } else {
@@ -406,5 +407,10 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
       at::native::split(q_k_v.view({3 * B, num_head, T, dim_per_head}), B, 0);
   return std::make_tuple(q_k_v_s[0], q_k_v_s[1], q_k_v_s[2]);
 }
+
+Tensor triton_scaled_dot_attention(const Tensor& q, const Tensor& k, const Tensor& v, double dropout_p){
+  TORCH_CHECK(false, "This operator should be overridden in python before use");
+  return at::Tensor();
+}
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/transformers/transformer.cpp b/aten/src/ATen/native/transformers/transformer.cpp
index bba3adc9b2c4b..2a641a40dfb5f 100644
--- a/aten/src/ATen/native/transformers/transformer.cpp
+++ b/aten/src/ATen/native/transformers/transformer.cpp
@@ -92,7 +92,8 @@ Tensor transformer_encoder_layer_forward(
     const Tensor& ffn_bias_1,
     const Tensor& ffn_weight_2,
     const Tensor& ffn_bias_2,
-    const c10::optional<Tensor>& mask) {
+    const c10::optional<Tensor>& mask,
+    const c10::optional<int64_t> mask_type) {
   {
     const Tensor& check_for_empty = src.is_nested() ? get_nested_tensor_impl(src)->get_buffer() : src;
     if (check_for_empty.numel() == 0) {
@@ -117,7 +118,9 @@ Tensor transformer_encoder_layer_forward(
       proj_weight,
       proj_bias,
       mask,
-      false /* need_weights */));
+      false /* need_weights */,
+      true /* average_attn_weights */,
+      mask_type));
   add_in_place(x, src, use_nested_tensor);
   if (!norm_first) {
     x = norm(x, embed_dim, layer_norm_eps, layer_norm_weight_1, layer_norm_bias_1, use_nested_tensor);
diff --git a/aten/src/ATen/native/ts_native_functions.yaml b/aten/src/ATen/native/ts_native_functions.yaml
index 2ef238c0bff00..a6d26b3ad75b6 100644
--- a/aten/src/ATen/native/ts_native_functions.yaml
+++ b/aten/src/ATen/native/ts_native_functions.yaml
@@ -199,7 +199,6 @@ supported:
   - pixel_unshuffle
   - select_backward
   - _trilinear
-  - linalg_inv_ex
   - linalg_pinv.atol_rtol_tensor
   - logsumexp.out
 autograd:
diff --git a/aten/src/ATen/native/vulkan/api/Allocator.h b/aten/src/ATen/native/vulkan/api/Allocator.h
index 470eb07543c24..ca7541784cf06 100644
--- a/aten/src/ATen/native/vulkan/api/Allocator.h
+++ b/aten/src/ATen/native/vulkan/api/Allocator.h
@@ -47,7 +47,7 @@
 #pragma clang diagnostic ignored "-Wunused-variable"
 #endif /* __clang__ */
 
-#include <ATen/native/vulkan/api/vk_mem_alloc.h>
+#include <include/vk_mem_alloc.h>
 
 #ifdef __clang__
 #pragma clang diagnostic pop
diff --git a/aten/src/ATen/native/vulkan/api/Command.cpp b/aten/src/ATen/native/vulkan/api/Command.cpp
index b2c63ee4399f5..c42eda1c5ef26 100644
--- a/aten/src/ATen/native/vulkan/api/Command.cpp
+++ b/aten/src/ATen/native/vulkan/api/Command.cpp
@@ -215,6 +215,82 @@ void CommandBuffer::copy_texture_to_texture(
   state_ = CommandBuffer::State::RECORDING;
 }
 
+void CommandBuffer::copy_texture_to_buffer(
+    const api::VulkanImage& source,
+    const api::VulkanBuffer& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) {
+  TORCH_CHECK(
+      state_ == CommandBuffer::State::BARRIERS_INSERTED,
+      "Vulkan CommandBuffer: called copy_texture_to_buffer() on a command buffer whose state "
+      "is not BARRIERS_INSERTED.");
+
+  const VkImageSubresourceLayers src_subresource_layers{
+      VK_IMAGE_ASPECT_COLOR_BIT, // aspectMask
+      0u, // mipLevel
+      0u, // baseArrayLayer
+      1u, // layerCount
+  };
+
+  const VkBufferImageCopy copy_details{
+      dst_offset.data[0u], // bufferOffset
+      dst_offset.data[1u], // bufferRowLength
+      dst_offset.data[2u], // bufferImageHeight
+      src_subresource_layers, // imageSubresource
+      create_offset3d(src_offset), // imageOffset
+      create_extent3d(copy_range), // imageExtent
+  };
+
+  vkCmdCopyImageToBuffer(
+      handle_,
+      source.handle(),
+      source.layout(),
+      destination.handle(),
+      1u,
+      &copy_details);
+
+  state_ = CommandBuffer::State::RECORDING;
+}
+
+void CommandBuffer::copy_buffer_to_texture(
+    const api::VulkanBuffer& source,
+    const api::VulkanImage& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) {
+  TORCH_CHECK(
+      state_ == CommandBuffer::State::BARRIERS_INSERTED,
+      "Vulkan CommandBuffer: called copy_buffer_to_texture() on a command buffer whose state "
+      "is not BARRIERS_INSERTED.");
+
+  const VkImageSubresourceLayers dst_subresource_layers{
+      VK_IMAGE_ASPECT_COLOR_BIT, // aspectMask
+      0u, // mipLevel
+      0u, // baseArrayLayer
+      1u, // layerCount
+  };
+
+  const VkBufferImageCopy copy_details{
+      src_offset.data[0u], // bufferOffset
+      src_offset.data[1u], // bufferRowLength
+      src_offset.data[2u], // bufferImageHeight
+      dst_subresource_layers, // imageSubresource
+      create_offset3d(dst_offset), // imageOffset
+      create_extent3d(copy_range), // imageExtent
+  };
+
+  vkCmdCopyBufferToImage(
+      handle_,
+      source.handle(),
+      destination.handle(),
+      destination.layout(),
+      1u,
+      &copy_details);
+
+  state_ = CommandBuffer::State::RECORDING;
+}
+
 void CommandBuffer::write_timestamp(
     const VkQueryPool querypool,
     const uint32_t idx) const {
diff --git a/aten/src/ATen/native/vulkan/api/Command.h b/aten/src/ATen/native/vulkan/api/Command.h
index 6bb9f49e95656..f52e238463fd8 100644
--- a/aten/src/ATen/native/vulkan/api/Command.h
+++ b/aten/src/ATen/native/vulkan/api/Command.h
@@ -34,7 +34,7 @@ class CommandBuffer final {
     INVALID, // Used to indicate the command buffer is moved from
     NEW, // Set during constructor
     RECORDING, // Set during call to begin(), dispatch(), and
-               // copy_texture_to_texture()
+               // copy_*_to_*()
     PIPELINE_BOUND, // Set during call to  bind_pipeline()
     DESCRIPTORS_BOUND, // Set during call to bind_descriptors()
     BARRIERS_INSERTED, // Set during call to insert_barrier()
@@ -88,6 +88,20 @@ class CommandBuffer final {
       const api::utils::uvec3&,
       const api::utils::uvec3&);
 
+  void copy_texture_to_buffer(
+      const api::VulkanImage&,
+      const api::VulkanBuffer&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&);
+
+  void copy_buffer_to_texture(
+      const api::VulkanBuffer&,
+      const api::VulkanImage&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&);
+
   void write_timestamp(const VkQueryPool, const uint32_t) const;
   void reset_querypool(const VkQueryPool, const uint32_t, const uint32_t) const;
 
diff --git a/aten/src/ATen/native/vulkan/api/Common.h b/aten/src/ATen/native/vulkan/api/Common.h
index 1fa268e63409b..d658181b4802d 100644
--- a/aten/src/ATen/native/vulkan/api/Common.h
+++ b/aten/src/ATen/native/vulkan/api/Common.h
@@ -20,23 +20,36 @@
   }
 #endif /* USE_VULKAN_SHADERC_RUNTIME */
 
-#define VK_CHECK(function)              \
-  do {                                  \
-    const VkResult result = (function); \
-    TORCH_CHECK(                        \
-        VK_SUCCESS == result,           \
-        C10_STRINGIZE(__FILE__),        \
-        " [",                           \
-        C10_STRINGIZE(__LINE__),        \
-        "] "                            \
-        "VkResult:",                    \
-        result);                        \
+/*
+ * Check that the return code of a Vulkan API call is VK_SUCCESS, throwing an
+ * error with the returned code if not. If STRIP_ERROR_MESSAGES is defined then
+ * only the return code will be preserved.
+ */
+#ifdef STRIP_ERROR_MESSAGES
+#define VK_CHECK(function)                                       \
+  do {                                                           \
+    const VkResult result = (function);                          \
+    if (VK_SUCCESS != result) {                                  \
+      throw c10::Error(                                          \
+          {__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, \
+          c10::str(result));                                     \
+    }                                                            \
   } while (false)
-
-#define VK_CHECK_RELAXED(function)                          \
-  do {                                                      \
-    const VkResult result = (function);                     \
-    TORCH_CHECK(VK_SUCCESS <= result, "VkResult:", result); \
+#else
+#define VK_CHECK(function)                                       \
+  do {                                                           \
+    const VkResult result = (function);                          \
+    if (VK_SUCCESS != result) {                                  \
+      throw c10::Error(                                          \
+          {__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, \
+          c10::str(                                              \
+              C10_STRINGIZE(__FILE__),                           \
+              "[",                                               \
+              C10_STRINGIZE(__LINE__),                           \
+              "] Expected VK_SUCCESS, got VkResult of ",         \
+              result));                                          \
+    }                                                            \
   } while (false)
+#endif /* STRIP_ERROR_MESSAGES */
 
 #endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/api/Context.cpp b/aten/src/ATen/native/vulkan/api/Context.cpp
index 4d7b3aa0d9877..a26dc95000328 100644
--- a/aten/src/ATen/native/vulkan/api/Context.cpp
+++ b/aten/src/ATen/native/vulkan/api/Context.cpp
@@ -72,48 +72,6 @@ void Context::submit_compute_epilogue(
   command_buffer.dispatch(global_workgroup_size);
 }
 
-void Context::submit_texture_copy(
-    const PipelineBarrier& pipeline_barrier,
-    const api::VulkanImage& source,
-    const api::VulkanImage& destination,
-    const api::utils::uvec3& copy_range,
-    const api::utils::uvec3& src_offset,
-    const api::utils::uvec3& dst_offset,
-    const VkFence fence_handle) {
-  // Serialize recording to the shared command buffer. Do not initialize with a
-  // mutex just yet, since in some cases it will be externally managed.
-  std::unique_lock<std::mutex> cmd_lock;
-  // Refer to comments in submit_compute_job for explanation.
-  if (fence_handle == VK_NULL_HANDLE) {
-    cmd_lock = std::unique_lock<std::mutex>(cmd_mutex_);
-  }
-
-  set_cmd();
-
-#ifdef USE_VULKAN_GPU_DIAGNOSTICS
-  uint32_t log_idx = querypool_.shader_profile_begin(
-      cmd_,
-      "copy_texture_to_texture",
-      create_extent3d({0, 0, 0}),
-      create_extent3d({0, 0, 0}));
-#endif /* USE_VULKAN_GPU_DIAGNOSTICS */
-
-  cmd_.insert_barrier(pipeline_barrier);
-
-  cmd_.copy_texture_to_texture(
-      source, destination, copy_range, src_offset, dst_offset);
-
-#ifdef USE_VULKAN_GPU_DIAGNOSTICS
-  querypool_.shader_profile_end(cmd_, log_idx);
-#endif /* USE_VULKAN_GPU_DIAGNOSTICS */
-
-  submit_count_++;
-  if (fence_handle != VK_NULL_HANDLE ||
-      submit_count_ >= config_.cmdSubmitFrequency) {
-    submit_cmd_to_gpu(fence_handle);
-  }
-}
-
 void Context::submit_cmd_to_gpu(const VkFence fence_handle) {
   if (cmd_) {
     cmd_.end();
@@ -171,18 +129,26 @@ Context* context() {
       };
 
       return new Context(runtime()->default_adapter_i(), config);
+    } catch (const c10::Error& e) {
+      TORCH_WARN(
+          "Pytorch Vulkan Context: Failed to initialize global vulkan context: ",
+          e.what());
     } catch (const std::exception& e) {
-      TORCH_CHECK(
-          false, "Vulkan: Failed to initialize context! Error: ", e.what());
+      TORCH_WARN(
+          "Pytorch Vulkan Context: Failed to initialize global vulkan context: ",
+          e.what());
     } catch (...) {
-      TORCH_CHECK(
-          false, "Vulkan: Failed to initialize context! Error: Unknown");
+      TORCH_WARN(
+          "Pytorch Vulkan Context: Failed to initialize global vulkan context!");
     }
 
     return nullptr;
   }());
 
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(context, "Invalid Vulkan context!");
+  TORCH_CHECK(
+      context,
+      "Pytorch Vulkan Context: The global context could not be retrieved "
+      "because it failed to initialize.");
 
   return context.get();
 }
diff --git a/aten/src/ATen/native/vulkan/api/Context.h b/aten/src/ATen/native/vulkan/api/Context.h
index fbf4aae11376f..e9464b9a16a7e 100644
--- a/aten/src/ATen/native/vulkan/api/Context.h
+++ b/aten/src/ATen/native/vulkan/api/Context.h
@@ -163,6 +163,16 @@ class Context final {
       const utils::uvec3&);
 
  public:
+  template <class S, class D>
+  void submit_copy(
+      const PipelineBarrier&,
+      const S&,
+      const D&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&,
+      const VkFence fence_handle);
+
   template <typename... Arguments>
   void submit_compute_job(
       const ShaderSource&,
@@ -172,15 +182,6 @@ class Context final {
       const VkFence fence_handle,
       Arguments&&...);
 
-  void submit_texture_copy(
-      const PipelineBarrier& pipeline_barrier,
-      const api::VulkanImage&,
-      const api::VulkanImage&,
-      const api::utils::uvec3&,
-      const api::utils::uvec3&,
-      const api::utils::uvec3&,
-      const VkFence fence_handle);
-
  private:
   void submit_cmd_to_gpu(const VkFence fence_handle = VK_NULL_HANDLE);
 
@@ -215,28 +216,33 @@ class UniformParamsBuffer final {
   }
 };
 
-class StagingBuffer final {
+class StorageBuffer final {
  private:
   Context* context_p_;
+  c10::ScalarType dtype_;
+  size_t numel_;
   VulkanBuffer vulkan_buffer_;
 
  public:
-  StagingBuffer(
+  StorageBuffer(
       Context* context_p,
-      const VkDeviceSize size,
+      const c10::ScalarType dtype,
+      const size_t numel,
       const bool gpuonly = false)
       : context_p_(context_p),
+        dtype_(dtype),
+        numel_(numel),
         vulkan_buffer_(context_p_->adapter_ptr()->vma().create_storage_buffer(
-            size,
+            c10::elementSize(dtype_) * numel_,
             gpuonly)) {}
 
-  StagingBuffer(const StagingBuffer&) = delete;
-  StagingBuffer& operator=(const StagingBuffer&) = delete;
+  StorageBuffer(const StorageBuffer&) = delete;
+  StorageBuffer& operator=(const StorageBuffer&) = delete;
 
-  StagingBuffer(StagingBuffer&&) = delete;
-  StagingBuffer& operator=(StagingBuffer&&) = delete;
+  StorageBuffer(StorageBuffer&&) = delete;
+  StorageBuffer& operator=(StorageBuffer&&) = delete;
 
-  ~StagingBuffer() {
+  ~StorageBuffer() {
     context_p_->register_buffer_cleanup(vulkan_buffer_);
   }
 
@@ -266,6 +272,91 @@ inline void bind(
 
 } // namespace detail
 
+template <class S, class D>
+inline void record_copy(
+    CommandBuffer& cmd,
+    const S& source,
+    const D& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) = delete;
+
+template <>
+inline void record_copy<VulkanImage, VulkanImage>(
+    CommandBuffer& cmd,
+    const VulkanImage& source,
+    const VulkanImage& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) {
+  cmd.copy_texture_to_texture(
+      source, destination, copy_range, src_offset, dst_offset);
+}
+
+template <>
+inline void record_copy<VulkanImage, VulkanBuffer>(
+    CommandBuffer& cmd,
+    const VulkanImage& source,
+    const VulkanBuffer& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) {
+  cmd.copy_texture_to_buffer(
+      source, destination, copy_range, src_offset, dst_offset);
+}
+
+template <>
+inline void record_copy<VulkanBuffer, VulkanImage>(
+    CommandBuffer& cmd,
+    const VulkanBuffer& source,
+    const VulkanImage& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) {
+  cmd.copy_buffer_to_texture(
+      source, destination, copy_range, src_offset, dst_offset);
+}
+
+template <class S, class D>
+inline void Context::submit_copy(
+    const PipelineBarrier& pipeline_barrier,
+    const S& source,
+    const D& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset,
+    const VkFence fence_handle) {
+  // Serialize recording to the shared command buffer. Do not initialize with a
+  // mutex just yet, since in some cases it will be externally managed.
+  std::unique_lock<std::mutex> cmd_lock;
+  // Refer to comments in submit_compute_job for explanation.
+  if (fence_handle == VK_NULL_HANDLE) {
+    cmd_lock = std::unique_lock<std::mutex>(cmd_mutex_);
+  }
+
+  set_cmd();
+
+#ifdef USE_VULKAN_GPU_DIAGNOSTICS
+  std::string label = "cmd_copy";
+  uint32_t log_idx = querypool_.shader_profile_begin(
+      cmd_, label, create_extent3d({0, 0, 0}), create_extent3d({0, 0, 0}));
+#endif /* USE_VULKAN_GPU_DIAGNOSTICS */
+
+  cmd_.insert_barrier(pipeline_barrier);
+
+  record_copy(cmd_, source, destination, copy_range, src_offset, dst_offset);
+
+#ifdef USE_VULKAN_GPU_DIAGNOSTICS
+  querypool_.shader_profile_end(cmd_, log_idx);
+#endif /* USE_VULKAN_GPU_DIAGNOSTICS */
+
+  submit_count_++;
+  if (fence_handle != VK_NULL_HANDLE ||
+      submit_count_ >= config_.cmdSubmitFrequency) {
+    submit_cmd_to_gpu(fence_handle);
+  }
+}
+
 template <typename... Arguments>
 inline void Context::submit_compute_job(
     const ShaderSource& shader_descriptor,
diff --git a/aten/src/ATen/native/vulkan/api/Resource.cpp b/aten/src/ATen/native/vulkan/api/Resource.cpp
index 82b98579e051f..fde96a87f959e 100644
--- a/aten/src/ATen/native/vulkan/api/Resource.cpp
+++ b/aten/src/ATen/native/vulkan/api/Resource.cpp
@@ -10,6 +10,20 @@ namespace api {
 // Utility Functions
 //
 
+/*
+ * This function is used to determine what image format to use for a given
+ * dtype.
+ *
+ * TODO: enable proper format selection between kFloat and kHalf.
+ *
+ * Context: due to limitations of the shader compilation system, at the moment
+ * it is not possible to support both 32 bit and 16 bit float formats since
+ * shaders will have to specify the format qualifier of texture inputs. Right
+ * now, shaders are compiled with either rgba16f or rgba32f qualifiers depending
+ * on whether USE_VULKAN_FP16_INFERENCE is set. Therefore, textures must be
+ * always created with the corresponding VkFormat. Consequently, kHalf tensors
+ * are currently unsupported in favor of enforcing inputs to be of kFloat dtype.
+ */
 VkFormat vk_format(const caffe2::TypeMeta dtype) {
   switch (c10::typeMetaToScalarType(dtype)) {
     case kFloat:
@@ -18,15 +32,34 @@ VkFormat vk_format(const caffe2::TypeMeta dtype) {
 #else
       return VK_FORMAT_R32G32B32A32_SFLOAT;
 #endif /* USE_VULKAN_FP16_INFERENCE */
-
     case c10::kQUInt8:
       return VK_FORMAT_R8G8B8A8_UINT;
 
     default:
-      TORCH_CHECK(false, "Vulkan tensor format not supported!");
+      TORCH_CHECK(
+          false, "Vulkan vk_format(): no corresponding format for dtype");
+  }
+}
+
+/*
+ * This function is used to map a texture format to a corresponding
+ * c10::ScalarType. It is primarily used to set the data type of a
+ * StorageBuffer object that will receive copied data from a texture.
+ */
+c10::ScalarType c10_scalartype(const VkFormat image_format) {
+  switch (image_format) {
+    case VK_FORMAT_R32G32B32A32_SFLOAT:
+      return c10::kFloat;
+    case VK_FORMAT_R16G16B16A16_SFLOAT:
+      return c10::kHalf;
+    case VK_FORMAT_R8G8B8A8_UINT:
+      return c10::kQUInt8;
+
+    default:
+      TORCH_CHECK(false, "vulkan c10_scalartype(): Unknown VkFormat.");
   }
-  return VK_FORMAT_UNDEFINED;
 }
+
 //
 // MemoryBarrier
 //
@@ -137,7 +170,8 @@ MemoryMap::MemoryMap(const VulkanBuffer& buffer, const uint8_t access)
     : access_(access),
       allocator_(buffer.vma_allocator()),
       allocation_(buffer.allocation()),
-      data_(nullptr) {
+      data_(nullptr),
+      data_len_{buffer.mem_size()} {
   VK_CHECK(vmaMapMemory(allocator_, allocation_, &data_));
 }
 
@@ -145,7 +179,8 @@ MemoryMap::MemoryMap(MemoryMap&& other) noexcept
     : access_(other.access_),
       allocator_(other.allocator_),
       allocation_(other.allocation_),
-      data_(other.data_) {
+      data_(other.data_),
+      data_len_{other.data_len_} {
   other.allocation_ = VK_NULL_HANDLE;
   other.data_ = nullptr;
 }
@@ -158,8 +193,8 @@ MemoryMap::~MemoryMap() {
   if (access_ & MemoryAccessType::WRITE) {
     // Call will be ignored by implementation if the memory type this allocation
     // belongs to is not HOST_VISIBLE or is HOST_COHERENT, which is the behavior
-    // we want.
-    VK_CHECK(vmaFlushAllocation(allocator_, allocation_, 0u, VK_WHOLE_SIZE));
+    // we want. Don't check the result here as the destructor cannot throw.
+    vmaFlushAllocation(allocator_, allocation_, 0u, VK_WHOLE_SIZE);
   }
 
   vmaUnmapMemory(allocator_, allocation_);
@@ -480,6 +515,7 @@ VkSampler SamplerCache::retrieve(const SamplerCache::Key& key) {
 }
 
 void SamplerCache::purge() {
+  std::lock_guard<std::mutex> lock(cache_mutex_);
   cache_.clear();
 }
 
diff --git a/aten/src/ATen/native/vulkan/api/Resource.h b/aten/src/ATen/native/vulkan/api/Resource.h
index 1efd907b3246d..75df3aa88560c 100644
--- a/aten/src/ATen/native/vulkan/api/Resource.h
+++ b/aten/src/ATen/native/vulkan/api/Resource.h
@@ -3,6 +3,9 @@
 #ifdef USE_VULKAN_API
 
 #include <ATen/native/vulkan/api/Allocator.h>
+#include <ATen/native/vulkan/api/Utils.h>
+
+#include <c10/core/ScalarType.h>
 #include <c10/util/hash.h>
 
 #include <stack>
@@ -16,6 +19,8 @@ typedef uint8_t MemoryAccessFlags;
 
 VkFormat vk_format(const caffe2::TypeMeta dtype);
 
+c10::ScalarType c10_scalartype(const VkFormat image_format);
+
 constexpr VmaAllocationCreateFlags DEFAULT_ALLOCATION_STRATEGY =
     VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT;
 
@@ -104,6 +109,10 @@ class VulkanBuffer final {
     return buffer_properties_.mem_range;
   }
 
+  inline VkDeviceSize mem_size() const {
+    return buffer_properties_.size;
+  }
+
   operator bool() const {
     return (allocation_ != VK_NULL_HANDLE);
   }
@@ -128,6 +137,7 @@ class MemoryMap final {
   VmaAllocator allocator_;
   VmaAllocation allocation_;
   void* data_;
+  VkDeviceSize data_len_;
 
  public:
   template <typename T>
@@ -135,6 +145,10 @@ class MemoryMap final {
     return reinterpret_cast<T*>(data_);
   }
 
+  inline size_t nbytes() {
+    return utils::safe_downcast<size_t>(data_len_);
+  }
+
   void invalidate();
 };
 
@@ -267,6 +281,10 @@ class VulkanImage final {
     return allocation_;
   }
 
+  inline VkFormat format() const {
+    return image_properties_.image_format;
+  }
+
   inline VkExtent3D extents() const {
     return image_properties_.image_extents;
   }
diff --git a/aten/src/ATen/native/vulkan/api/Runtime.cpp b/aten/src/ATen/native/vulkan/api/Runtime.cpp
index a1c460fa4dc97..95cd716dcee40 100644
--- a/aten/src/ATen/native/vulkan/api/Runtime.cpp
+++ b/aten/src/ATen/native/vulkan/api/Runtime.cpp
@@ -253,6 +253,11 @@ std::unique_ptr<Runtime> init_global_vulkan_runtime() {
 
   try {
     return std::make_unique<Runtime>(Runtime(default_config));
+  } catch (const c10::Error& e) {
+    TORCH_WARN(
+        "Pytorch Vulkan Runtime: Failed to initialize the global vulkan runtime! "
+        "The global vulkan runtime is invalid. Error: ",
+        e.what());
   } catch (const std::exception& e) {
     TORCH_WARN(
         "Pytorch Vulkan Runtime: Failed to initialize the global vulkan runtime! "
@@ -286,6 +291,10 @@ Runtime::Runtime(const RuntimeConfiguration config)
         case AdapterSelector::First:
           default_adapter_i_ = create_adapter(select_first);
       }
+    } catch (const c10::Error& e) {
+      TORCH_WARN(
+          "Pytorch Vulkan Runtime: Could not initialize default device! Error: ",
+          e.what());
     } catch (const std::exception& e) {
       TORCH_WARN(
           "Pytorch Vulkan Runtime: Could not initialize default device! Error: ",
@@ -372,10 +381,12 @@ Runtime* runtime() {
   // Runtime.h as it would have internal linkage.
   static const std::unique_ptr<Runtime> p_runtime =
       init_global_vulkan_runtime();
+
   TORCH_CHECK(
       p_runtime,
       "Pytorch Vulkan Runtime: The global runtime could not be retrieved "
       "because it failed to initialize.");
+
   return p_runtime.get();
 }
 
diff --git a/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h b/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h
deleted file mode 100644
index 7b04e54d944bd..0000000000000
--- a/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h
+++ /dev/null
@@ -1,19558 +0,0 @@
-//
-// Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved.
-//
-// Permission is hereby granted, free of charge, to any person obtaining a copy
-// of this software and associated documentation files (the "Software"), to deal
-// in the Software without restriction, including without limitation the rights
-// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-// copies of the Software, and to permit persons to whom the Software is
-// furnished to do so, subject to the following conditions:
-//
-// The above copyright notice and this permission notice shall be included in
-// all copies or substantial portions of the Software.
-//
-// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
-// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-// THE SOFTWARE.
-//
-
-#ifndef AMD_VULKAN_MEMORY_ALLOCATOR_H
-#define AMD_VULKAN_MEMORY_ALLOCATOR_H
-
-/** \mainpage Vulkan Memory Allocator
-
-<b>Version 3.0.1 (2022-05-26)</b>
-
-Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved. \n
-License: MIT
-
-<b>API documentation divided into groups:</b> [Modules](modules.html)
-
-\section main_table_of_contents Table of contents
-
-- <b>User guide</b>
-  - \subpage quick_start
-    - [Project setup](@ref quick_start_project_setup)
-    - [Initialization](@ref quick_start_initialization)
-    - [Resource allocation](@ref quick_start_resource_allocation)
-  - \subpage choosing_memory_type
-    - [Usage](@ref choosing_memory_type_usage)
-    - [Required and preferred flags](@ref choosing_memory_type_required_preferred_flags)
-    - [Explicit memory types](@ref choosing_memory_type_explicit_memory_types)
-    - [Custom memory pools](@ref choosing_memory_type_custom_memory_pools)
-    - [Dedicated allocations](@ref choosing_memory_type_dedicated_allocations)
-  - \subpage memory_mapping
-    - [Mapping functions](@ref memory_mapping_mapping_functions)
-    - [Persistently mapped memory](@ref memory_mapping_persistently_mapped_memory)
-    - [Cache flush and invalidate](@ref memory_mapping_cache_control)
-  - \subpage staying_within_budget
-    - [Querying for budget](@ref staying_within_budget_querying_for_budget)
-    - [Controlling memory usage](@ref staying_within_budget_controlling_memory_usage)
-  - \subpage resource_aliasing
-  - \subpage custom_memory_pools
-    - [Choosing memory type index](@ref custom_memory_pools_MemTypeIndex)
-    - [Linear allocation algorithm](@ref linear_algorithm)
-      - [Free-at-once](@ref linear_algorithm_free_at_once)
-      - [Stack](@ref linear_algorithm_stack)
-      - [Double stack](@ref linear_algorithm_double_stack)
-      - [Ring buffer](@ref linear_algorithm_ring_buffer)
-  - \subpage defragmentation
-  - \subpage statistics
-    - [Numeric statistics](@ref statistics_numeric_statistics)
-    - [JSON dump](@ref statistics_json_dump)
-  - \subpage allocation_annotation
-    - [Allocation user data](@ref allocation_user_data)
-    - [Allocation names](@ref allocation_names)
-  - \subpage virtual_allocator
-  - \subpage debugging_memory_usage
-    - [Memory initialization](@ref debugging_memory_usage_initialization)
-    - [Margins](@ref debugging_memory_usage_margins)
-    - [Corruption detection](@ref debugging_memory_usage_corruption_detection)
-  - \subpage opengl_interop
-- \subpage usage_patterns
-    - [GPU-only resource](@ref usage_patterns_gpu_only)
-    - [Staging copy for upload](@ref usage_patterns_staging_copy_upload)
-    - [Readback](@ref usage_patterns_readback)
-    - [Advanced data uploading](@ref usage_patterns_advanced_data_uploading)
-    - [Other use cases](@ref usage_patterns_other_use_cases)
-- \subpage configuration
-  - [Pointers to Vulkan functions](@ref config_Vulkan_functions)
-  - [Custom host memory allocator](@ref custom_memory_allocator)
-  - [Device memory allocation callbacks](@ref allocation_callbacks)
-  - [Device heap memory limit](@ref heap_memory_limit)
-- <b>Extension support</b>
-    - \subpage vk_khr_dedicated_allocation
-    - \subpage enabling_buffer_device_address
-    - \subpage vk_ext_memory_priority
-    - \subpage vk_amd_device_coherent_memory
-- \subpage general_considerations
-  - [Thread safety](@ref general_considerations_thread_safety)
-  - [Versioning and compatibility](@ref general_considerations_versioning_and_compatibility)
-  - [Validation layer warnings](@ref general_considerations_validation_layer_warnings)
-  - [Allocation algorithm](@ref general_considerations_allocation_algorithm)
-  - [Features not supported](@ref general_considerations_features_not_supported)
-
-\section main_see_also See also
-
-- [**Product page on GPUOpen**](https://gpuopen.com/gaming-product/vulkan-memory-allocator/)
-- [**Source repository on GitHub**](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator)
-
-\defgroup group_init Library initialization
-
-\brief API elements related to the initialization and management of the entire library, especially #VmaAllocator object.
-
-\defgroup group_alloc Memory allocation
-
-\brief API elements related to the allocation, deallocation, and management of Vulkan memory, buffers, images.
-Most basic ones being: vmaCreateBuffer(), vmaCreateImage().
-
-\defgroup group_virtual Virtual allocator
-
-\brief API elements related to the mechanism of \ref virtual_allocator - using the core allocation algorithm
-for user-defined purpose without allocating any real GPU memory.
-
-\defgroup group_stats Statistics
-
-\brief API elements that query current status of the allocator, from memory usage, budget, to full dump of the internal state in JSON format.
-See documentation chapter: \ref statistics.
-*/
-
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-#ifndef VULKAN_H_
-    #include <vulkan/vulkan.h>
-#endif
-
-// Define this macro to declare maximum supported Vulkan version in format AAABBBCCC,
-// where AAA = major, BBB = minor, CCC = patch.
-// If you want to use version > 1.0, it still needs to be enabled via VmaAllocatorCreateInfo::vulkanApiVersion.
-#if !defined(VMA_VULKAN_VERSION)
-    #if defined(VK_VERSION_1_3)
-        #define VMA_VULKAN_VERSION 1003000
-    #elif defined(VK_VERSION_1_2)
-        #define VMA_VULKAN_VERSION 1002000
-    #elif defined(VK_VERSION_1_1)
-        #define VMA_VULKAN_VERSION 1001000
-    #else
-        #define VMA_VULKAN_VERSION 1000000
-    #endif
-#endif
-
-#if defined(__ANDROID__) && defined(VK_NO_PROTOTYPES) && VMA_STATIC_VULKAN_FUNCTIONS
-    extern PFN_vkGetInstanceProcAddr vkGetInstanceProcAddr;
-    extern PFN_vkGetDeviceProcAddr vkGetDeviceProcAddr;
-    extern PFN_vkGetPhysicalDeviceProperties vkGetPhysicalDeviceProperties;
-    extern PFN_vkGetPhysicalDeviceMemoryProperties vkGetPhysicalDeviceMemoryProperties;
-    extern PFN_vkAllocateMemory vkAllocateMemory;
-    extern PFN_vkFreeMemory vkFreeMemory;
-    extern PFN_vkMapMemory vkMapMemory;
-    extern PFN_vkUnmapMemory vkUnmapMemory;
-    extern PFN_vkFlushMappedMemoryRanges vkFlushMappedMemoryRanges;
-    extern PFN_vkInvalidateMappedMemoryRanges vkInvalidateMappedMemoryRanges;
-    extern PFN_vkBindBufferMemory vkBindBufferMemory;
-    extern PFN_vkBindImageMemory vkBindImageMemory;
-    extern PFN_vkGetBufferMemoryRequirements vkGetBufferMemoryRequirements;
-    extern PFN_vkGetImageMemoryRequirements vkGetImageMemoryRequirements;
-    extern PFN_vkCreateBuffer vkCreateBuffer;
-    extern PFN_vkDestroyBuffer vkDestroyBuffer;
-    extern PFN_vkCreateImage vkCreateImage;
-    extern PFN_vkDestroyImage vkDestroyImage;
-    extern PFN_vkCmdCopyBuffer vkCmdCopyBuffer;
-    #if VMA_VULKAN_VERSION >= 1001000
-        extern PFN_vkGetBufferMemoryRequirements2 vkGetBufferMemoryRequirements2;
-        extern PFN_vkGetImageMemoryRequirements2 vkGetImageMemoryRequirements2;
-        extern PFN_vkBindBufferMemory2 vkBindBufferMemory2;
-        extern PFN_vkBindImageMemory2 vkBindImageMemory2;
-        extern PFN_vkGetPhysicalDeviceMemoryProperties2 vkGetPhysicalDeviceMemoryProperties2;
-    #endif // #if VMA_VULKAN_VERSION >= 1001000
-#endif // #if defined(__ANDROID__) && VMA_STATIC_VULKAN_FUNCTIONS && VK_NO_PROTOTYPES
-
-#if !defined(VMA_DEDICATED_ALLOCATION)
-    #if VK_KHR_get_memory_requirements2 && VK_KHR_dedicated_allocation
-        #define VMA_DEDICATED_ALLOCATION 1
-    #else
-        #define VMA_DEDICATED_ALLOCATION 0
-    #endif
-#endif
-
-#if !defined(VMA_BIND_MEMORY2)
-    #if VK_KHR_bind_memory2
-        #define VMA_BIND_MEMORY2 1
-    #else
-        #define VMA_BIND_MEMORY2 0
-    #endif
-#endif
-
-#if !defined(VMA_MEMORY_BUDGET)
-    #if VK_EXT_memory_budget && (VK_KHR_get_physical_device_properties2 || VMA_VULKAN_VERSION >= 1001000)
-        #define VMA_MEMORY_BUDGET 1
-    #else
-        #define VMA_MEMORY_BUDGET 0
-    #endif
-#endif
-
-// Defined to 1 when VK_KHR_buffer_device_address device extension or equivalent core Vulkan 1.2 feature is defined in its headers.
-#if !defined(VMA_BUFFER_DEVICE_ADDRESS)
-    #if VK_KHR_buffer_device_address || VMA_VULKAN_VERSION >= 1002000
-        #define VMA_BUFFER_DEVICE_ADDRESS 1
-    #else
-        #define VMA_BUFFER_DEVICE_ADDRESS 0
-    #endif
-#endif
-
-// Defined to 1 when VK_EXT_memory_priority device extension is defined in Vulkan headers.
-#if !defined(VMA_MEMORY_PRIORITY)
-    #if VK_EXT_memory_priority
-        #define VMA_MEMORY_PRIORITY 1
-    #else
-        #define VMA_MEMORY_PRIORITY 0
-    #endif
-#endif
-
-// Defined to 1 when VK_KHR_external_memory device extension is defined in Vulkan headers.
-#if !defined(VMA_EXTERNAL_MEMORY)
-    #if VK_KHR_external_memory
-        #define VMA_EXTERNAL_MEMORY 1
-    #else
-        #define VMA_EXTERNAL_MEMORY 0
-    #endif
-#endif
-
-// Define these macros to decorate all public functions with additional code,
-// before and after returned type, appropriately. This may be useful for
-// exporting the functions when compiling VMA as a separate library. Example:
-// #define VMA_CALL_PRE  __declspec(dllexport)
-// #define VMA_CALL_POST __cdecl
-#ifndef VMA_CALL_PRE
-    #define VMA_CALL_PRE
-#endif
-#ifndef VMA_CALL_POST
-    #define VMA_CALL_POST
-#endif
-
-// Define this macro to decorate pointers with an attribute specifying the
-// length of the array they point to if they are not null.
-//
-// The length may be one of
-// - The name of another parameter in the argument list where the pointer is declared
-// - The name of another member in the struct where the pointer is declared
-// - The name of a member of a struct type, meaning the value of that member in
-//   the context of the call. For example
-//   VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount"),
-//   this means the number of memory heaps available in the device associated
-//   with the VmaAllocator being dealt with.
-#ifndef VMA_LEN_IF_NOT_NULL
-    #define VMA_LEN_IF_NOT_NULL(len)
-#endif
-
-// The VMA_NULLABLE macro is defined to be _Nullable when compiling with Clang.
-// see: https://clang.llvm.org/docs/AttributeReference.html#nullable
-#ifndef VMA_NULLABLE
-    #ifdef __clang__
-        #define VMA_NULLABLE _Nullable
-    #else
-        #define VMA_NULLABLE
-    #endif
-#endif
-
-// The VMA_NOT_NULL macro is defined to be _Nonnull when compiling with Clang.
-// see: https://clang.llvm.org/docs/AttributeReference.html#nonnull
-#ifndef VMA_NOT_NULL
-    #ifdef __clang__
-        #define VMA_NOT_NULL _Nonnull
-    #else
-        #define VMA_NOT_NULL
-    #endif
-#endif
-
-// If non-dispatchable handles are represented as pointers then we can give
-// then nullability annotations
-#ifndef VMA_NOT_NULL_NON_DISPATCHABLE
-    #if defined(__LP64__) || defined(_WIN64) || (defined(__x86_64__) && !defined(__ILP32__) ) || defined(_M_X64) || defined(__ia64) || defined (_M_IA64) || defined(__aarch64__) || defined(__powerpc64__)
-        #define VMA_NOT_NULL_NON_DISPATCHABLE VMA_NOT_NULL
-    #else
-        #define VMA_NOT_NULL_NON_DISPATCHABLE
-    #endif
-#endif
-
-#ifndef VMA_NULLABLE_NON_DISPATCHABLE
-    #if defined(__LP64__) || defined(_WIN64) || (defined(__x86_64__) && !defined(__ILP32__) ) || defined(_M_X64) || defined(__ia64) || defined (_M_IA64) || defined(__aarch64__) || defined(__powerpc64__)
-        #define VMA_NULLABLE_NON_DISPATCHABLE VMA_NULLABLE
-    #else
-        #define VMA_NULLABLE_NON_DISPATCHABLE
-    #endif
-#endif
-
-#ifndef VMA_STATS_STRING_ENABLED
-    #define VMA_STATS_STRING_ENABLED 1
-#endif
-
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-// 
-//    INTERFACE
-// 
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-
-// Sections for managing code placement in file, only for development purposes e.g. for convenient folding inside an IDE.
-#ifndef _VMA_ENUM_DECLARATIONS
-
-/**
-\addtogroup group_init
-@{
-*/
-
-/// Flags for created #VmaAllocator.
-typedef enum VmaAllocatorCreateFlagBits
-{
-    /** \brief Allocator and all objects created from it will not be synchronized internally, so you must guarantee they are used from only one thread at a time or synchronized externally by you.
-
-    Using this flag may increase performance because internal mutexes are not used.
-    */
-    VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT = 0x00000001,
-    /** \brief Enables usage of VK_KHR_dedicated_allocation extension.
-
-    The flag works only if VmaAllocatorCreateInfo::vulkanApiVersion `== VK_API_VERSION_1_0`.
-    When it is `VK_API_VERSION_1_1`, the flag is ignored because the extension has been promoted to Vulkan 1.1.
-
-    Using this extension will automatically allocate dedicated blocks of memory for
-    some buffers and images instead of suballocating place for them out of bigger
-    memory blocks (as if you explicitly used #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT
-    flag) when it is recommended by the driver. It may improve performance on some
-    GPUs.
-
-    You may set this flag only if you found out that following device extensions are
-    supported, you enabled them while creating Vulkan device passed as
-    VmaAllocatorCreateInfo::device, and you want them to be used internally by this
-    library:
-
-    - VK_KHR_get_memory_requirements2 (device extension)
-    - VK_KHR_dedicated_allocation (device extension)
-
-    When this flag is set, you can experience following warnings reported by Vulkan
-    validation layer. You can ignore them.
-
-    > vkBindBufferMemory(): Binding memory to buffer 0x2d but vkGetBufferMemoryRequirements() has not been called on that buffer.
-    */
-    VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT = 0x00000002,
-    /**
-    Enables usage of VK_KHR_bind_memory2 extension.
-
-    The flag works only if VmaAllocatorCreateInfo::vulkanApiVersion `== VK_API_VERSION_1_0`.
-    When it is `VK_API_VERSION_1_1`, the flag is ignored because the extension has been promoted to Vulkan 1.1.
-
-    You may set this flag only if you found out that this device extension is supported,
-    you enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device,
-    and you want it to be used internally by this library.
-
-    The extension provides functions `vkBindBufferMemory2KHR` and `vkBindImageMemory2KHR`,
-    which allow to pass a chain of `pNext` structures while binding.
-    This flag is required if you use `pNext` parameter in vmaBindBufferMemory2() or vmaBindImageMemory2().
-    */
-    VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT = 0x00000004,
-    /**
-    Enables usage of VK_EXT_memory_budget extension.
-
-    You may set this flag only if you found out that this device extension is supported,
-    you enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device,
-    and you want it to be used internally by this library, along with another instance extension
-    VK_KHR_get_physical_device_properties2, which is required by it (or Vulkan 1.1, where this extension is promoted).
-
-    The extension provides query for current memory usage and budget, which will probably
-    be more accurate than an estimation used by the library otherwise.
-    */
-    VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT = 0x00000008,
-    /**
-    Enables usage of VK_AMD_device_coherent_memory extension.
-
-    You may set this flag only if you:
-
-    - found out that this device extension is supported and enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device,
-    - checked that `VkPhysicalDeviceCoherentMemoryFeaturesAMD::deviceCoherentMemory` is true and set it while creating the Vulkan device,
-    - want it to be used internally by this library.
-
-    The extension and accompanying device feature provide access to memory types with
-    `VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD` and `VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` flags.
-    They are useful mostly for writing breadcrumb markers - a common method for debugging GPU crash/hang/TDR.
-
-    When the extension is not enabled, such memory types are still enumerated, but their usage is illegal.
-    To protect from this error, if you don't create the allocator with this flag, it will refuse to allocate any memory or create a custom pool in such memory type,
-    returning `VK_ERROR_FEATURE_NOT_PRESENT`.
-    */
-    VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT = 0x00000010,
-    /**
-    Enables usage of "buffer device address" feature, which allows you to use function
-    `vkGetBufferDeviceAddress*` to get raw GPU pointer to a buffer and pass it for usage inside a shader.
-
-    You may set this flag only if you:
-
-    1. (For Vulkan version < 1.2) Found as available and enabled device extension
-    VK_KHR_buffer_device_address.
-    This extension is promoted to core Vulkan 1.2.
-    2. Found as available and enabled device feature `VkPhysicalDeviceBufferDeviceAddressFeatures::bufferDeviceAddress`.
-
-    When this flag is set, you can create buffers with `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT` using VMA.
-    The library automatically adds `VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT` to
-    allocated memory blocks wherever it might be needed.
-
-    For more information, see documentation chapter \ref enabling_buffer_device_address.
-    */
-    VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT = 0x00000020,
-    /**
-    Enables usage of VK_EXT_memory_priority extension in the library.
-
-    You may set this flag only if you found available and enabled this device extension,
-    along with `VkPhysicalDeviceMemoryPriorityFeaturesEXT::memoryPriority == VK_TRUE`,
-    while creating Vulkan device passed as VmaAllocatorCreateInfo::device.
-
-    When this flag is used, VmaAllocationCreateInfo::priority and VmaPoolCreateInfo::priority
-    are used to set priorities of allocated Vulkan memory. Without it, these variables are ignored.
-
-    A priority must be a floating-point value between 0 and 1, indicating the priority of the allocation relative to other memory allocations.
-    Larger values are higher priority. The granularity of the priorities is implementation-dependent.
-    It is automatically passed to every call to `vkAllocateMemory` done by the library using structure `VkMemoryPriorityAllocateInfoEXT`.
-    The value to be used for default priority is 0.5.
-    For more details, see the documentation of the VK_EXT_memory_priority extension.
-    */
-    VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT = 0x00000040,
-
-    VMA_ALLOCATOR_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaAllocatorCreateFlagBits;
-/// See #VmaAllocatorCreateFlagBits.
-typedef VkFlags VmaAllocatorCreateFlags;
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/// \brief Intended usage of the allocated memory.
-typedef enum VmaMemoryUsage
-{
-    /** No intended memory usage specified.
-    Use other members of VmaAllocationCreateInfo to specify your requirements.
-    */
-    VMA_MEMORY_USAGE_UNKNOWN = 0,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Prefers `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-    */
-    VMA_MEMORY_USAGE_GPU_ONLY = 1,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` and `VK_MEMORY_PROPERTY_HOST_COHERENT_BIT`.
-    */
-    VMA_MEMORY_USAGE_CPU_ONLY = 2,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`, prefers `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-    */
-    VMA_MEMORY_USAGE_CPU_TO_GPU = 3,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`, prefers `VK_MEMORY_PROPERTY_HOST_CACHED_BIT`.
-    */
-    VMA_MEMORY_USAGE_GPU_TO_CPU = 4,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Prefers not `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-    */
-    VMA_MEMORY_USAGE_CPU_COPY = 5,
-    /**
-    Lazily allocated GPU memory having `VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT`.
-    Exists mostly on mobile platforms. Using it on desktop PC or other GPUs with no such memory type present will fail the allocation.
-
-    Usage: Memory for transient attachment images (color attachments, depth attachments etc.), created with `VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT`.
-
-    Allocations with this usage are always created as dedicated - it implies #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-    */
-    VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED = 6,
-    /**
-    Selects best memory type automatically.
-    This flag is recommended for most common use cases.
-
-    When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT),
-    you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-    in VmaAllocationCreateInfo::flags.
-    
-    It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g.
-    vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo()
-    and not with generic memory allocation functions.
-    */
-    VMA_MEMORY_USAGE_AUTO = 7,
-    /**
-    Selects best memory type automatically with preference for GPU (device) memory.
-
-    When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT),
-    you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-    in VmaAllocationCreateInfo::flags.
-
-    It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g.
-    vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo()
-    and not with generic memory allocation functions.
-    */
-    VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE = 8,
-    /**
-    Selects best memory type automatically with preference for CPU (host) memory.
-
-    When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT),
-    you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-    in VmaAllocationCreateInfo::flags.
-
-    It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g.
-    vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo()
-    and not with generic memory allocation functions.
-    */
-    VMA_MEMORY_USAGE_AUTO_PREFER_HOST = 9,
-
-    VMA_MEMORY_USAGE_MAX_ENUM = 0x7FFFFFFF
-} VmaMemoryUsage;
-
-/// Flags to be passed as VmaAllocationCreateInfo::flags.
-typedef enum VmaAllocationCreateFlagBits
-{
-    /** \brief Set this flag if the allocation should have its own memory block.
-
-    Use it for special, big resources, like fullscreen images used as attachments.
-    */
-    VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT = 0x00000001,
-
-    /** \brief Set this flag to only try to allocate from existing `VkDeviceMemory` blocks and never create new such block.
-
-    If new allocation cannot be placed in any of the existing blocks, allocation
-    fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY` error.
-
-    You should not use #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT and
-    #VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT at the same time. It makes no sense.
-    */
-    VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT = 0x00000002,
-    /** \brief Set this flag to use a memory that will be persistently mapped and retrieve pointer to it.
-
-    Pointer to mapped memory will be returned through VmaAllocationInfo::pMappedData.
-
-    It is valid to use this flag for allocation made from memory type that is not
-    `HOST_VISIBLE`. This flag is then ignored and memory is not mapped. This is
-    useful if you need an allocation that is efficient to use on GPU
-    (`DEVICE_LOCAL`) and still want to map it directly if possible on platforms that
-    support it (e.g. Intel GPU).
-    */
-    VMA_ALLOCATION_CREATE_MAPPED_BIT = 0x00000004,
-    /** \deprecated Preserved for backward compatibility. Consider using vmaSetAllocationName() instead.
-    
-    Set this flag to treat VmaAllocationCreateInfo::pUserData as pointer to a
-    null-terminated string. Instead of copying pointer value, a local copy of the
-    string is made and stored in allocation's `pName`. The string is automatically
-    freed together with the allocation. It is also used in vmaBuildStatsString().
-    */
-    VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT = 0x00000020,
-    /** Allocation will be created from upper stack in a double stack pool.
-
-    This flag is only allowed for custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT flag.
-    */
-    VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT = 0x00000040,
-    /** Create both buffer/image and allocation, but don't bind them together.
-    It is useful when you want to bind yourself to do some more advanced binding, e.g. using some extensions.
-    The flag is meaningful only with functions that bind by default: vmaCreateBuffer(), vmaCreateImage().
-    Otherwise it is ignored.
-
-    If you want to make sure the new buffer/image is not tied to the new memory allocation
-    through `VkMemoryDedicatedAllocateInfoKHR` structure in case the allocation ends up in its own memory block,
-    use also flag #VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT.
-    */
-    VMA_ALLOCATION_CREATE_DONT_BIND_BIT = 0x00000080,
-    /** Create allocation only if additional device memory required for it, if any, won't exceed
-    memory budget. Otherwise return `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-    */
-    VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT = 0x00000100,
-    /** \brief Set this flag if the allocated memory will have aliasing resources.
-    
-    Usage of this flag prevents supplying `VkMemoryDedicatedAllocateInfoKHR` when #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT is specified.
-    Otherwise created dedicated memory will not be suitable for aliasing resources, resulting in Vulkan Validation Layer errors.
-    */
-    VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT = 0x00000200,
-    /**
-    Requests possibility to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT).
-    
-    - If you use #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` value,
-      you must use this flag to be able to map the allocation. Otherwise, mapping is incorrect.
-    - If you use other value of #VmaMemoryUsage, this flag is ignored and mapping is always possible in memory types that are `HOST_VISIBLE`.
-      This includes allocations created in \ref custom_memory_pools.
-
-    Declares that mapped memory will only be written sequentially, e.g. using `memcpy()` or a loop writing number-by-number,
-    never read or accessed randomly, so a memory type can be selected that is uncached and write-combined.
-
-    \warning Violating this declaration may work correctly, but will likely be very slow.
-    Watch out for implicit reads introduced by doing e.g. `pMappedData[i] += x;`
-    Better prepare your data in a local variable and `memcpy()` it to the mapped pointer all at once.
-    */
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT = 0x00000400,
-    /**
-    Requests possibility to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT).
-    
-    - If you use #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` value,
-      you must use this flag to be able to map the allocation. Otherwise, mapping is incorrect.
-    - If you use other value of #VmaMemoryUsage, this flag is ignored and mapping is always possible in memory types that are `HOST_VISIBLE`.
-      This includes allocations created in \ref custom_memory_pools.
-
-    Declares that mapped memory can be read, written, and accessed in random order,
-    so a `HOST_CACHED` memory type is required.
-    */
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT = 0x00000800,
-    /**
-    Together with #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT,
-    it says that despite request for host access, a not-`HOST_VISIBLE` memory type can be selected
-    if it may improve performance.
-
-    By using this flag, you declare that you will check if the allocation ended up in a `HOST_VISIBLE` memory type
-    (e.g. using vmaGetAllocationMemoryProperties()) and if not, you will create some "staging" buffer and
-    issue an explicit transfer to write/read your data.
-    To prepare for this possibility, don't forget to add appropriate flags like
-    `VK_BUFFER_USAGE_TRANSFER_DST_BIT`, `VK_BUFFER_USAGE_TRANSFER_SRC_BIT` to the parameters of created buffer or image.
-    */
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT = 0x00001000,
-    /** Allocation strategy that chooses smallest possible free range for the allocation
-    to minimize memory usage and fragmentation, possibly at the expense of allocation time.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT = 0x00010000,
-    /** Allocation strategy that chooses first suitable free range for the allocation -
-    not necessarily in terms of the smallest offset but the one that is easiest and fastest to find
-    to minimize allocation time, possibly at the expense of allocation quality.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT = 0x00020000,
-    /** Allocation strategy that chooses always the lowest offset in available space.
-    This is not the most efficient strategy but achieves highly packed data.
-    Used internally by defragmentation, not recomended in typical usage.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT  = 0x00040000,
-    /** Alias to #VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_BEST_FIT_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT,
-    /** Alias to #VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_FIRST_FIT_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT,
-    /** A bit mask to extract only `STRATEGY` bits from entire set of flags.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MASK =
-        VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT |
-        VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT |
-        VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-
-    VMA_ALLOCATION_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaAllocationCreateFlagBits;
-/// See #VmaAllocationCreateFlagBits.
-typedef VkFlags VmaAllocationCreateFlags;
-
-/// Flags to be passed as VmaPoolCreateInfo::flags.
-typedef enum VmaPoolCreateFlagBits
-{
-    /** \brief Use this flag if you always allocate only buffers and linear images or only optimal images out of this pool and so Buffer-Image Granularity can be ignored.
-
-    This is an optional optimization flag.
-
-    If you always allocate using vmaCreateBuffer(), vmaCreateImage(),
-    vmaAllocateMemoryForBuffer(), then you don't need to use it because allocator
-    knows exact type of your allocations so it can handle Buffer-Image Granularity
-    in the optimal way.
-
-    If you also allocate using vmaAllocateMemoryForImage() or vmaAllocateMemory(),
-    exact type of such allocations is not known, so allocator must be conservative
-    in handling Buffer-Image Granularity, which can lead to suboptimal allocation
-    (wasted memory). In that case, if you can make sure you always allocate only
-    buffers and linear images or only optimal images out of this pool, use this flag
-    to make allocator disregard Buffer-Image Granularity and so make allocations
-    faster and more optimal.
-    */
-    VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT = 0x00000002,
-
-    /** \brief Enables alternative, linear allocation algorithm in this pool.
-
-    Specify this flag to enable linear allocation algorithm, which always creates
-    new allocations after last one and doesn't reuse space from allocations freed in
-    between. It trades memory consumption for simplified algorithm and data
-    structure, which has better performance and uses less memory for metadata.
-
-    By using this flag, you can achieve behavior of free-at-once, stack,
-    ring buffer, and double stack.
-    For details, see documentation chapter \ref linear_algorithm.
-    */
-    VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT = 0x00000004,
-
-    /** Bit mask to extract only `ALGORITHM` bits from entire set of flags.
-    */
-    VMA_POOL_CREATE_ALGORITHM_MASK =
-        VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT,
-
-    VMA_POOL_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaPoolCreateFlagBits;
-/// Flags to be passed as VmaPoolCreateInfo::flags. See #VmaPoolCreateFlagBits.
-typedef VkFlags VmaPoolCreateFlags;
-
-/// Flags to be passed as VmaDefragmentationInfo::flags.
-typedef enum VmaDefragmentationFlagBits
-{
-    /* \brief Use simple but fast algorithm for defragmentation.
-    May not achieve best results but will require least time to compute and least allocations to copy.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT = 0x1,
-    /* \brief Default defragmentation algorithm, applied also when no `ALGORITHM` flag is specified.
-    Offers a balance between defragmentation quality and the amount of allocations and bytes that need to be moved.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT = 0x2,
-    /* \brief Perform full defragmentation of memory.
-    Can result in notably more time to compute and allocations to copy, but will achieve best memory packing.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT = 0x4,
-    /** \brief Use the most roboust algorithm at the cost of time to compute and number of copies to make.
-    Only available when bufferImageGranularity is greater than 1, since it aims to reduce
-    alignment issues between different types of resources.
-    Otherwise falls back to same behavior as #VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT = 0x8,
-
-    /// A bit mask to extract only `ALGORITHM` bits from entire set of flags.
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_MASK = 
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT |
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT |
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT |
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT,
-
-    VMA_DEFRAGMENTATION_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaDefragmentationFlagBits;
-/// See #VmaDefragmentationFlagBits.
-typedef VkFlags VmaDefragmentationFlags;
-
-/// Operation performed on single defragmentation move. See structure #VmaDefragmentationMove.
-typedef enum VmaDefragmentationMoveOperation
-{
-    /// Buffer/image has been recreated at `dstTmpAllocation`, data has been copied, old buffer/image has been destroyed. `srcAllocation` should be changed to point to the new place. This is the default value set by vmaBeginDefragmentationPass().
-    VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY = 0,
-    /// Set this value if you cannot move the allocation. New place reserved at `dstTmpAllocation` will be freed. `srcAllocation` will remain unchanged.
-    VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE = 1,
-    /// Set this value if you decide to abandon the allocation and you destroyed the buffer/image. New place reserved at `dstTmpAllocation` will be freed, along with `srcAllocation`, which will be destroyed.
-    VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY = 2,
-} VmaDefragmentationMoveOperation;
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/// Flags to be passed as VmaVirtualBlockCreateInfo::flags.
-typedef enum VmaVirtualBlockCreateFlagBits
-{
-    /** \brief Enables alternative, linear allocation algorithm in this virtual block.
-
-    Specify this flag to enable linear allocation algorithm, which always creates
-    new allocations after last one and doesn't reuse space from allocations freed in
-    between. It trades memory consumption for simplified algorithm and data
-    structure, which has better performance and uses less memory for metadata.
-
-    By using this flag, you can achieve behavior of free-at-once, stack,
-    ring buffer, and double stack.
-    For details, see documentation chapter \ref linear_algorithm.
-    */
-    VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT = 0x00000001,
-
-    /** \brief Bit mask to extract only `ALGORITHM` bits from entire set of flags.
-    */
-    VMA_VIRTUAL_BLOCK_CREATE_ALGORITHM_MASK =
-        VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT,
-
-    VMA_VIRTUAL_BLOCK_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaVirtualBlockCreateFlagBits;
-/// Flags to be passed as VmaVirtualBlockCreateInfo::flags. See #VmaVirtualBlockCreateFlagBits.
-typedef VkFlags VmaVirtualBlockCreateFlags;
-
-/// Flags to be passed as VmaVirtualAllocationCreateInfo::flags.
-typedef enum VmaVirtualAllocationCreateFlagBits
-{
-    /** \brief Allocation will be created from upper stack in a double stack pool.
-
-    This flag is only allowed for virtual blocks created with #VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT flag.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_UPPER_ADDRESS_BIT = VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT,
-    /** \brief Allocation strategy that tries to minimize memory usage.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT,
-    /** \brief Allocation strategy that tries to minimize allocation time.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT,
-    /** Allocation strategy that chooses always the lowest offset in available space.
-    This is not the most efficient strategy but achieves highly packed data.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-    /** \brief A bit mask to extract only `STRATEGY` bits from entire set of flags.
-
-    These strategy flags are binary compatible with equivalent flags in #VmaAllocationCreateFlagBits.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MASK = VMA_ALLOCATION_CREATE_STRATEGY_MASK,
-
-    VMA_VIRTUAL_ALLOCATION_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaVirtualAllocationCreateFlagBits;
-/// Flags to be passed as VmaVirtualAllocationCreateInfo::flags. See #VmaVirtualAllocationCreateFlagBits.
-typedef VkFlags VmaVirtualAllocationCreateFlags;
-
-/** @} */
-
-#endif // _VMA_ENUM_DECLARATIONS
-
-#ifndef _VMA_DATA_TYPES_DECLARATIONS
-
-/**
-\addtogroup group_init
-@{ */
-
-/** \struct VmaAllocator
-\brief Represents main object of this library initialized.
-
-Fill structure #VmaAllocatorCreateInfo and call function vmaCreateAllocator() to create it.
-Call function vmaDestroyAllocator() to destroy it.
-
-It is recommended to create just one object of this type per `VkDevice` object,
-right after Vulkan is initialized and keep it alive until before Vulkan device is destroyed.
-*/
-VK_DEFINE_HANDLE(VmaAllocator)
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/** \struct VmaPool
-\brief Represents custom memory pool
-
-Fill structure VmaPoolCreateInfo and call function vmaCreatePool() to create it.
-Call function vmaDestroyPool() to destroy it.
-
-For more information see [Custom memory pools](@ref choosing_memory_type_custom_memory_pools).
-*/
-VK_DEFINE_HANDLE(VmaPool)
-
-/** \struct VmaAllocation
-\brief Represents single memory allocation.
-
-It may be either dedicated block of `VkDeviceMemory` or a specific region of a bigger block of this type
-plus unique offset.
-
-There are multiple ways to create such object.
-You need to fill structure VmaAllocationCreateInfo.
-For more information see [Choosing memory type](@ref choosing_memory_type).
-
-Although the library provides convenience functions that create Vulkan buffer or image,
-allocate memory for it and bind them together,
-binding of the allocation to a buffer or an image is out of scope of the allocation itself.
-Allocation object can exist without buffer/image bound,
-binding can be done manually by the user, and destruction of it can be done
-independently of destruction of the allocation.
-
-The object also remembers its size and some other information.
-To retrieve this information, use function vmaGetAllocationInfo() and inspect
-returned structure VmaAllocationInfo.
-*/
-VK_DEFINE_HANDLE(VmaAllocation)
-
-/** \struct VmaDefragmentationContext
-\brief An opaque object that represents started defragmentation process.
-
-Fill structure #VmaDefragmentationInfo and call function vmaBeginDefragmentation() to create it.
-Call function vmaEndDefragmentation() to destroy it.
-*/
-VK_DEFINE_HANDLE(VmaDefragmentationContext)
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/** \struct VmaVirtualAllocation
-\brief Represents single memory allocation done inside VmaVirtualBlock.
-
-Use it as a unique identifier to virtual allocation within the single block.
-
-Use value `VK_NULL_HANDLE` to represent a null/invalid allocation.
-*/
-VK_DEFINE_NON_DISPATCHABLE_HANDLE(VmaVirtualAllocation);
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/** \struct VmaVirtualBlock
-\brief Handle to a virtual block object that allows to use core allocation algorithm without allocating any real GPU memory.
-
-Fill in #VmaVirtualBlockCreateInfo structure and use vmaCreateVirtualBlock() to create it. Use vmaDestroyVirtualBlock() to destroy it.
-For more information, see documentation chapter \ref virtual_allocator.
-
-This object is not thread-safe - should not be used from multiple threads simultaneously, must be synchronized externally.
-*/
-VK_DEFINE_HANDLE(VmaVirtualBlock)
-
-/** @} */
-
-/**
-\addtogroup group_init
-@{
-*/
-
-/// Callback function called after successful vkAllocateMemory.
-typedef void (VKAPI_PTR* PFN_vmaAllocateDeviceMemoryFunction)(
-    VmaAllocator VMA_NOT_NULL                    allocator,
-    uint32_t                                     memoryType,
-    VkDeviceMemory VMA_NOT_NULL_NON_DISPATCHABLE memory,
-    VkDeviceSize                                 size,
-    void* VMA_NULLABLE                           pUserData);
-
-/// Callback function called before vkFreeMemory.
-typedef void (VKAPI_PTR* PFN_vmaFreeDeviceMemoryFunction)(
-    VmaAllocator VMA_NOT_NULL                    allocator,
-    uint32_t                                     memoryType,
-    VkDeviceMemory VMA_NOT_NULL_NON_DISPATCHABLE memory,
-    VkDeviceSize                                 size,
-    void* VMA_NULLABLE                           pUserData);
-
-/** \brief Set of callbacks that the library will call for `vkAllocateMemory` and `vkFreeMemory`.
-
-Provided for informative purpose, e.g. to gather statistics about number of
-allocations or total amount of memory allocated in Vulkan.
-
-Used in VmaAllocatorCreateInfo::pDeviceMemoryCallbacks.
-*/
-typedef struct VmaDeviceMemoryCallbacks
-{
-    /// Optional, can be null.
-    PFN_vmaAllocateDeviceMemoryFunction VMA_NULLABLE pfnAllocate;
-    /// Optional, can be null.
-    PFN_vmaFreeDeviceMemoryFunction VMA_NULLABLE pfnFree;
-    /// Optional, can be null.
-    void* VMA_NULLABLE pUserData;
-} VmaDeviceMemoryCallbacks;
-
-/** \brief Pointers to some Vulkan functions - a subset used by the library.
-
-Used in VmaAllocatorCreateInfo::pVulkanFunctions.
-*/
-typedef struct VmaVulkanFunctions
-{
-    /// Required when using VMA_DYNAMIC_VULKAN_FUNCTIONS.
-    PFN_vkGetInstanceProcAddr VMA_NULLABLE vkGetInstanceProcAddr;
-    /// Required when using VMA_DYNAMIC_VULKAN_FUNCTIONS.
-    PFN_vkGetDeviceProcAddr VMA_NULLABLE vkGetDeviceProcAddr;
-    PFN_vkGetPhysicalDeviceProperties VMA_NULLABLE vkGetPhysicalDeviceProperties;
-    PFN_vkGetPhysicalDeviceMemoryProperties VMA_NULLABLE vkGetPhysicalDeviceMemoryProperties;
-    PFN_vkAllocateMemory VMA_NULLABLE vkAllocateMemory;
-    PFN_vkFreeMemory VMA_NULLABLE vkFreeMemory;
-    PFN_vkMapMemory VMA_NULLABLE vkMapMemory;
-    PFN_vkUnmapMemory VMA_NULLABLE vkUnmapMemory;
-    PFN_vkFlushMappedMemoryRanges VMA_NULLABLE vkFlushMappedMemoryRanges;
-    PFN_vkInvalidateMappedMemoryRanges VMA_NULLABLE vkInvalidateMappedMemoryRanges;
-    PFN_vkBindBufferMemory VMA_NULLABLE vkBindBufferMemory;
-    PFN_vkBindImageMemory VMA_NULLABLE vkBindImageMemory;
-    PFN_vkGetBufferMemoryRequirements VMA_NULLABLE vkGetBufferMemoryRequirements;
-    PFN_vkGetImageMemoryRequirements VMA_NULLABLE vkGetImageMemoryRequirements;
-    PFN_vkCreateBuffer VMA_NULLABLE vkCreateBuffer;
-    PFN_vkDestroyBuffer VMA_NULLABLE vkDestroyBuffer;
-    PFN_vkCreateImage VMA_NULLABLE vkCreateImage;
-    PFN_vkDestroyImage VMA_NULLABLE vkDestroyImage;
-    PFN_vkCmdCopyBuffer VMA_NULLABLE vkCmdCopyBuffer;
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    /// Fetch "vkGetBufferMemoryRequirements2" on Vulkan >= 1.1, fetch "vkGetBufferMemoryRequirements2KHR" when using VK_KHR_dedicated_allocation extension.
-    PFN_vkGetBufferMemoryRequirements2KHR VMA_NULLABLE vkGetBufferMemoryRequirements2KHR;
-    /// Fetch "vkGetImageMemoryRequirements2" on Vulkan >= 1.1, fetch "vkGetImageMemoryRequirements2KHR" when using VK_KHR_dedicated_allocation extension.
-    PFN_vkGetImageMemoryRequirements2KHR VMA_NULLABLE vkGetImageMemoryRequirements2KHR;
-#endif
-#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000
-    /// Fetch "vkBindBufferMemory2" on Vulkan >= 1.1, fetch "vkBindBufferMemory2KHR" when using VK_KHR_bind_memory2 extension.
-    PFN_vkBindBufferMemory2KHR VMA_NULLABLE vkBindBufferMemory2KHR;
-    /// Fetch "vkBindImageMemory2" on Vulkan >= 1.1, fetch "vkBindImageMemory2KHR" when using VK_KHR_bind_memory2 extension.
-    PFN_vkBindImageMemory2KHR VMA_NULLABLE vkBindImageMemory2KHR;
-#endif
-#if VMA_MEMORY_BUDGET || VMA_VULKAN_VERSION >= 1001000
-    PFN_vkGetPhysicalDeviceMemoryProperties2KHR VMA_NULLABLE vkGetPhysicalDeviceMemoryProperties2KHR;
-#endif
-#if VMA_VULKAN_VERSION >= 1003000
-    /// Fetch from "vkGetDeviceBufferMemoryRequirements" on Vulkan >= 1.3, but you can also fetch it from "vkGetDeviceBufferMemoryRequirementsKHR" if you enabled extension VK_KHR_maintenance4.
-    PFN_vkGetDeviceBufferMemoryRequirements VMA_NULLABLE vkGetDeviceBufferMemoryRequirements;
-    /// Fetch from "vkGetDeviceImageMemoryRequirements" on Vulkan >= 1.3, but you can also fetch it from "vkGetDeviceImageMemoryRequirementsKHR" if you enabled extension VK_KHR_maintenance4.
-    PFN_vkGetDeviceImageMemoryRequirements VMA_NULLABLE vkGetDeviceImageMemoryRequirements;
-#endif
-} VmaVulkanFunctions;
-
-/// Description of a Allocator to be created.
-typedef struct VmaAllocatorCreateInfo
-{
-    /// Flags for created allocator. Use #VmaAllocatorCreateFlagBits enum.
-    VmaAllocatorCreateFlags flags;
-    /// Vulkan physical device.
-    /** It must be valid throughout whole lifetime of created allocator. */
-    VkPhysicalDevice VMA_NOT_NULL physicalDevice;
-    /// Vulkan device.
-    /** It must be valid throughout whole lifetime of created allocator. */
-    VkDevice VMA_NOT_NULL device;
-    /// Preferred size of a single `VkDeviceMemory` block to be allocated from large heaps > 1 GiB. Optional.
-    /** Set to 0 to use default, which is currently 256 MiB. */
-    VkDeviceSize preferredLargeHeapBlockSize;
-    /// Custom CPU memory allocation callbacks. Optional.
-    /** Optional, can be null. When specified, will also be used for all CPU-side memory allocations. */
-    const VkAllocationCallbacks* VMA_NULLABLE pAllocationCallbacks;
-    /// Informative callbacks for `vkAllocateMemory`, `vkFreeMemory`. Optional.
-    /** Optional, can be null. */
-    const VmaDeviceMemoryCallbacks* VMA_NULLABLE pDeviceMemoryCallbacks;
-    /** \brief Either null or a pointer to an array of limits on maximum number of bytes that can be allocated out of particular Vulkan memory heap.
-
-    If not NULL, it must be a pointer to an array of
-    `VkPhysicalDeviceMemoryProperties::memoryHeapCount` elements, defining limit on
-    maximum number of bytes that can be allocated out of particular Vulkan memory
-    heap.
-
-    Any of the elements may be equal to `VK_WHOLE_SIZE`, which means no limit on that
-    heap. This is also the default in case of `pHeapSizeLimit` = NULL.
-
-    If there is a limit defined for a heap:
-
-    - If user tries to allocate more memory from that heap using this allocator,
-      the allocation fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-    - If the limit is smaller than heap size reported in `VkMemoryHeap::size`, the
-      value of this limit will be reported instead when using vmaGetMemoryProperties().
-
-    Warning! Using this feature may not be equivalent to installing a GPU with
-    smaller amount of memory, because graphics driver doesn't necessary fail new
-    allocations with `VK_ERROR_OUT_OF_DEVICE_MEMORY` result when memory capacity is
-    exceeded. It may return success and just silently migrate some device memory
-    blocks to system RAM. This driver behavior can also be controlled using
-    VK_AMD_memory_overallocation_behavior extension.
-    */
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount") pHeapSizeLimit;
-
-    /** \brief Pointers to Vulkan functions. Can be null.
-
-    For details see [Pointers to Vulkan functions](@ref config_Vulkan_functions).
-    */
-    const VmaVulkanFunctions* VMA_NULLABLE pVulkanFunctions;
-    /** \brief Handle to Vulkan instance object.
-
-    Starting from version 3.0.0 this member is no longer optional, it must be set!
-    */
-    VkInstance VMA_NOT_NULL instance;
-    /** \brief Optional. The highest version of Vulkan that the application is designed to use.
-
-    It must be a value in the format as created by macro `VK_MAKE_VERSION` or a constant like: `VK_API_VERSION_1_1`, `VK_API_VERSION_1_0`.
-    The patch version number specified is ignored. Only the major and minor versions are considered.
-    It must be less or equal (preferably equal) to value as passed to `vkCreateInstance` as `VkApplicationInfo::apiVersion`.
-    Only versions 1.0, 1.1, 1.2, 1.3 are supported by the current implementation.
-    Leaving it initialized to zero is equivalent to `VK_API_VERSION_1_0`.
-    */
-    uint32_t vulkanApiVersion;
-#if VMA_EXTERNAL_MEMORY
-    /** \brief Either null or a pointer to an array of external memory handle types for each Vulkan memory type.
-
-    If not NULL, it must be a pointer to an array of `VkPhysicalDeviceMemoryProperties::memoryTypeCount`
-    elements, defining external memory handle types of particular Vulkan memory type,
-    to be passed using `VkExportMemoryAllocateInfoKHR`.
-
-    Any of the elements may be equal to 0, which means not to use `VkExportMemoryAllocateInfoKHR` on this memory type.
-    This is also the default in case of `pTypeExternalMemoryHandleTypes` = NULL.
-    */
-    const VkExternalMemoryHandleTypeFlagsKHR* VMA_NULLABLE VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryTypeCount") pTypeExternalMemoryHandleTypes;
-#endif // #if VMA_EXTERNAL_MEMORY
-} VmaAllocatorCreateInfo;
-
-/// Information about existing #VmaAllocator object.
-typedef struct VmaAllocatorInfo
-{
-    /** \brief Handle to Vulkan instance object.
-
-    This is the same value as has been passed through VmaAllocatorCreateInfo::instance.
-    */
-    VkInstance VMA_NOT_NULL instance;
-    /** \brief Handle to Vulkan physical device object.
-
-    This is the same value as has been passed through VmaAllocatorCreateInfo::physicalDevice.
-    */
-    VkPhysicalDevice VMA_NOT_NULL physicalDevice;
-    /** \brief Handle to Vulkan device object.
-
-    This is the same value as has been passed through VmaAllocatorCreateInfo::device.
-    */
-    VkDevice VMA_NOT_NULL device;
-} VmaAllocatorInfo;
-
-/** @} */
-
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Calculated statistics of memory usage e.g. in a specific memory type, heap, custom pool, or total.
-
-These are fast to calculate.
-See functions: vmaGetHeapBudgets(), vmaGetPoolStatistics().
-*/
-typedef struct VmaStatistics
-{
-    /** \brief Number of `VkDeviceMemory` objects - Vulkan memory blocks allocated.
-    */
-    uint32_t blockCount;
-    /** \brief Number of #VmaAllocation objects allocated.
-    
-    Dedicated allocations have their own blocks, so each one adds 1 to `allocationCount` as well as `blockCount`.
-    */
-    uint32_t allocationCount;
-    /** \brief Number of bytes allocated in `VkDeviceMemory` blocks.
-    
-    \note To avoid confusion, please be aware that what Vulkan calls an "allocation" - a whole `VkDeviceMemory` object
-    (e.g. as in `VkPhysicalDeviceLimits::maxMemoryAllocationCount`) is called a "block" in VMA, while VMA calls
-    "allocation" a #VmaAllocation object that represents a memory region sub-allocated from such block, usually for a single buffer or image.
-    */
-    VkDeviceSize blockBytes;
-    /** \brief Total number of bytes occupied by all #VmaAllocation objects.
-    
-    Always less or equal than `blockBytes`.
-    Difference `(blockBytes - allocationBytes)` is the amount of memory allocated from Vulkan
-    but unused by any #VmaAllocation.
-    */
-    VkDeviceSize allocationBytes;
-} VmaStatistics;
-
-/** \brief More detailed statistics than #VmaStatistics.
-
-These are slower to calculate. Use for debugging purposes.
-See functions: vmaCalculateStatistics(), vmaCalculatePoolStatistics().
-
-Previous version of the statistics API provided averages, but they have been removed
-because they can be easily calculated as:
-
-\code
-VkDeviceSize allocationSizeAvg = detailedStats.statistics.allocationBytes / detailedStats.statistics.allocationCount;
-VkDeviceSize unusedBytes = detailedStats.statistics.blockBytes - detailedStats.statistics.allocationBytes;
-VkDeviceSize unusedRangeSizeAvg = unusedBytes / detailedStats.unusedRangeCount;
-\endcode
-*/
-typedef struct VmaDetailedStatistics
-{
-    /// Basic statistics.
-    VmaStatistics statistics;
-    /// Number of free ranges of memory between allocations.
-    uint32_t unusedRangeCount;
-    /// Smallest allocation size. `VK_WHOLE_SIZE` if there are 0 allocations.
-    VkDeviceSize allocationSizeMin;
-    /// Largest allocation size. 0 if there are 0 allocations.
-    VkDeviceSize allocationSizeMax;
-    /// Smallest empty range size. `VK_WHOLE_SIZE` if there are 0 empty ranges.
-    VkDeviceSize unusedRangeSizeMin;
-    /// Largest empty range size. 0 if there are 0 empty ranges.
-    VkDeviceSize unusedRangeSizeMax;
-} VmaDetailedStatistics;
-
-/** \brief  General statistics from current state of the Allocator -
-total memory usage across all memory heaps and types.
-
-These are slower to calculate. Use for debugging purposes.
-See function vmaCalculateStatistics().
-*/
-typedef struct VmaTotalStatistics
-{
-    VmaDetailedStatistics memoryType[VK_MAX_MEMORY_TYPES];
-    VmaDetailedStatistics memoryHeap[VK_MAX_MEMORY_HEAPS];
-    VmaDetailedStatistics total;
-} VmaTotalStatistics;
-
-/** \brief Statistics of current memory usage and available budget for a specific memory heap.
-
-These are fast to calculate.
-See function vmaGetHeapBudgets().
-*/
-typedef struct VmaBudget
-{
-    /** \brief Statistics fetched from the library.
-    */
-    VmaStatistics statistics;
-    /** \brief Estimated current memory usage of the program, in bytes.
-
-    Fetched from system using VK_EXT_memory_budget extension if enabled.
-
-    It might be different than `statistics.blockBytes` (usually higher) due to additional implicit objects
-    also occupying the memory, like swapchain, pipelines, descriptor heaps, command buffers, or
-    `VkDeviceMemory` blocks allocated outside of this library, if any.
-    */
-    VkDeviceSize usage;
-    /** \brief Estimated amount of memory available to the program, in bytes.
-
-    Fetched from system using VK_EXT_memory_budget extension if enabled.
-
-    It might be different (most probably smaller) than `VkMemoryHeap::size[heapIndex]` due to factors
-    external to the program, decided by the operating system.
-    Difference `budget - usage` is the amount of additional memory that can probably
-    be allocated without problems. Exceeding the budget may result in various problems.
-    */
-    VkDeviceSize budget;
-} VmaBudget;
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/** \brief Parameters of new #VmaAllocation.
-
-To be used with functions like vmaCreateBuffer(), vmaCreateImage(), and many others.
-*/
-typedef struct VmaAllocationCreateInfo
-{
-    /// Use #VmaAllocationCreateFlagBits enum.
-    VmaAllocationCreateFlags flags;
-    /** \brief Intended usage of memory.
-
-    You can leave #VMA_MEMORY_USAGE_UNKNOWN if you specify memory requirements in other way. \n
-    If `pool` is not null, this member is ignored.
-    */
-    VmaMemoryUsage usage;
-    /** \brief Flags that must be set in a Memory Type chosen for an allocation.
-
-    Leave 0 if you specify memory requirements in other way. \n
-    If `pool` is not null, this member is ignored.*/
-    VkMemoryPropertyFlags requiredFlags;
-    /** \brief Flags that preferably should be set in a memory type chosen for an allocation.
-
-    Set to 0 if no additional flags are preferred. \n
-    If `pool` is not null, this member is ignored. */
-    VkMemoryPropertyFlags preferredFlags;
-    /** \brief Bitmask containing one bit set for every memory type acceptable for this allocation.
-
-    Value 0 is equivalent to `UINT32_MAX` - it means any memory type is accepted if
-    it meets other requirements specified by this structure, with no further
-    restrictions on memory type index. \n
-    If `pool` is not null, this member is ignored.
-    */
-    uint32_t memoryTypeBits;
-    /** \brief Pool that this allocation should be created in.
-
-    Leave `VK_NULL_HANDLE` to allocate from default pool. If not null, members:
-    `usage`, `requiredFlags`, `preferredFlags`, `memoryTypeBits` are ignored.
-    */
-    VmaPool VMA_NULLABLE pool;
-    /** \brief Custom general-purpose pointer that will be stored in #VmaAllocation, can be read as VmaAllocationInfo::pUserData and changed using vmaSetAllocationUserData().
-
-    If #VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT is used, it must be either
-    null or pointer to a null-terminated string. The string will be then copied to
-    internal buffer, so it doesn't need to be valid after allocation call.
-    */
-    void* VMA_NULLABLE pUserData;
-    /** \brief A floating-point value between 0 and 1, indicating the priority of the allocation relative to other memory allocations.
-
-    It is used only when #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT flag was used during creation of the #VmaAllocator object
-    and this allocation ends up as dedicated or is explicitly forced as dedicated using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-    Otherwise, it has the priority of a memory block where it is placed and this variable is ignored.
-    */
-    float priority;
-} VmaAllocationCreateInfo;
-
-/// Describes parameter of created #VmaPool.
-typedef struct VmaPoolCreateInfo
-{
-    /** \brief Vulkan memory type index to allocate this pool from.
-    */
-    uint32_t memoryTypeIndex;
-    /** \brief Use combination of #VmaPoolCreateFlagBits.
-    */
-    VmaPoolCreateFlags flags;
-    /** \brief Size of a single `VkDeviceMemory` block to be allocated as part of this pool, in bytes. Optional.
-
-    Specify nonzero to set explicit, constant size of memory blocks used by this
-    pool.
-
-    Leave 0 to use default and let the library manage block sizes automatically.
-    Sizes of particular blocks may vary.
-    In this case, the pool will also support dedicated allocations.
-    */
-    VkDeviceSize blockSize;
-    /** \brief Minimum number of blocks to be always allocated in this pool, even if they stay empty.
-
-    Set to 0 to have no preallocated blocks and allow the pool be completely empty.
-    */
-    size_t minBlockCount;
-    /** \brief Maximum number of blocks that can be allocated in this pool. Optional.
-
-    Set to 0 to use default, which is `SIZE_MAX`, which means no limit.
-
-    Set to same value as VmaPoolCreateInfo::minBlockCount to have fixed amount of memory allocated
-    throughout whole lifetime of this pool.
-    */
-    size_t maxBlockCount;
-    /** \brief A floating-point value between 0 and 1, indicating the priority of the allocations in this pool relative to other memory allocations.
-
-    It is used only when #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT flag was used during creation of the #VmaAllocator object.
-    Otherwise, this variable is ignored.
-    */
-    float priority;
-    /** \brief Additional minimum alignment to be used for all allocations created from this pool. Can be 0.
-
-    Leave 0 (default) not to impose any additional alignment. If not 0, it must be a power of two.
-    It can be useful in cases where alignment returned by Vulkan by functions like `vkGetBufferMemoryRequirements` is not enough,
-    e.g. when doing interop with OpenGL.
-    */
-    VkDeviceSize minAllocationAlignment;
-    /** \brief Additional `pNext` chain to be attached to `VkMemoryAllocateInfo` used for every allocation made by this pool. Optional.
-
-    Optional, can be null. If not null, it must point to a `pNext` chain of structures that can be attached to `VkMemoryAllocateInfo`.
-    It can be useful for special needs such as adding `VkExportMemoryAllocateInfoKHR`.
-    Structures pointed by this member must remain alive and unchanged for the whole lifetime of the custom pool.
-
-    Please note that some structures, e.g. `VkMemoryPriorityAllocateInfoEXT`, `VkMemoryDedicatedAllocateInfoKHR`,
-    can be attached automatically by this library when using other, more convenient of its features.
-    */
-    void* VMA_NULLABLE pMemoryAllocateNext;
-} VmaPoolCreateInfo;
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/// Parameters of #VmaAllocation objects, that can be retrieved using function vmaGetAllocationInfo().
-typedef struct VmaAllocationInfo
-{
-    /** \brief Memory type index that this allocation was allocated from.
-
-    It never changes.
-    */
-    uint32_t memoryType;
-    /** \brief Handle to Vulkan memory object.
-
-    Same memory object can be shared by multiple allocations.
-
-    It can change after the allocation is moved during \ref defragmentation.
-    */
-    VkDeviceMemory VMA_NULLABLE_NON_DISPATCHABLE deviceMemory;
-    /** \brief Offset in `VkDeviceMemory` object to the beginning of this allocation, in bytes. `(deviceMemory, offset)` pair is unique to this allocation.
-
-    You usually don't need to use this offset. If you create a buffer or an image together with the allocation using e.g. function
-    vmaCreateBuffer(), vmaCreateImage(), functions that operate on these resources refer to the beginning of the buffer or image,
-    not entire device memory block. Functions like vmaMapMemory(), vmaBindBufferMemory() also refer to the beginning of the allocation
-    and apply this offset automatically.
-
-    It can change after the allocation is moved during \ref defragmentation.
-    */
-    VkDeviceSize offset;
-    /** \brief Size of this allocation, in bytes.
-
-    It never changes.
-
-    \note Allocation size returned in this variable may be greater than the size
-    requested for the resource e.g. as `VkBufferCreateInfo::size`. Whole size of the
-    allocation is accessible for operations on memory e.g. using a pointer after
-    mapping with vmaMapMemory(), but operations on the resource e.g. using
-    `vkCmdCopyBuffer` must be limited to the size of the resource.
-    */
-    VkDeviceSize size;
-    /** \brief Pointer to the beginning of this allocation as mapped data.
-
-    If the allocation hasn't been mapped using vmaMapMemory() and hasn't been
-    created with #VMA_ALLOCATION_CREATE_MAPPED_BIT flag, this value is null.
-
-    It can change after call to vmaMapMemory(), vmaUnmapMemory().
-    It can also change after the allocation is moved during \ref defragmentation.
-    */
-    void* VMA_NULLABLE pMappedData;
-    /** \brief Custom general-purpose pointer that was passed as VmaAllocationCreateInfo::pUserData or set using vmaSetAllocationUserData().
-
-    It can change after call to vmaSetAllocationUserData() for this allocation.
-    */
-    void* VMA_NULLABLE pUserData;
-    /** \brief Custom allocation name that was set with vmaSetAllocationName().
-    
-    It can change after call to vmaSetAllocationName() for this allocation.
-    
-    Another way to set custom name is to pass it in VmaAllocationCreateInfo::pUserData with
-    additional flag #VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT set [DEPRECATED].
-    */
-    const char* VMA_NULLABLE pName;
-} VmaAllocationInfo;
-
-/** \brief Parameters for defragmentation.
-
-To be used with function vmaBeginDefragmentation().
-*/
-typedef struct VmaDefragmentationInfo
-{
-    /// \brief Use combination of #VmaDefragmentationFlagBits.
-    VmaDefragmentationFlags flags;
-    /** \brief Custom pool to be defragmented.
-
-    If null then default pools will undergo defragmentation process.
-    */
-    VmaPool VMA_NULLABLE pool;
-    /** \brief Maximum numbers of bytes that can be copied during single pass, while moving allocations to different places.
-
-    `0` means no limit.
-    */
-    VkDeviceSize maxBytesPerPass;
-    /** \brief Maximum number of allocations that can be moved during single pass to a different place.
-
-    `0` means no limit.
-    */
-    uint32_t maxAllocationsPerPass;
-} VmaDefragmentationInfo;
-
-/// Single move of an allocation to be done for defragmentation.
-typedef struct VmaDefragmentationMove
-{
-    /// Operation to be performed on the allocation by vmaEndDefragmentationPass(). Default value is #VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY. You can modify it.
-    VmaDefragmentationMoveOperation operation;
-    /// Allocation that should be moved.
-    VmaAllocation VMA_NOT_NULL srcAllocation;
-    /** \brief Temporary allocation pointing to destination memory that will replace `srcAllocation`.
-    
-    \warning Do not store this allocation in your data structures! It exists only temporarily, for the duration of the defragmentation pass,
-    to be used for binding new buffer/image to the destination memory using e.g. vmaBindBufferMemory().
-    vmaEndDefragmentationPass() will destroy it and make `srcAllocation` point to this memory.
-    */
-    VmaAllocation VMA_NOT_NULL dstTmpAllocation;
-} VmaDefragmentationMove;
-
-/** \brief Parameters for incremental defragmentation steps.
-
-To be used with function vmaBeginDefragmentationPass().
-*/
-typedef struct VmaDefragmentationPassMoveInfo
-{
-    /// Number of elements in the `pMoves` array.
-    uint32_t moveCount;
-    /** \brief Array of moves to be performed by the user in the current defragmentation pass.
-    
-    Pointer to an array of `moveCount` elements, owned by VMA, created in vmaBeginDefragmentationPass(), destroyed in vmaEndDefragmentationPass().
-
-    For each element, you should:
-    
-    1. Create a new buffer/image in the place pointed by VmaDefragmentationMove::dstMemory + VmaDefragmentationMove::dstOffset.
-    2. Copy data from the VmaDefragmentationMove::srcAllocation e.g. using `vkCmdCopyBuffer`, `vkCmdCopyImage`.
-    3. Make sure these commands finished executing on the GPU.
-    4. Destroy the old buffer/image.
-    
-    Only then you can finish defragmentation pass by calling vmaEndDefragmentationPass().
-    After this call, the allocation will point to the new place in memory.
-
-    Alternatively, if you cannot move specific allocation, you can set VmaDefragmentationMove::operation to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE.
-
-    Alternatively, if you decide you want to completely remove the allocation:
-
-    1. Destroy its buffer/image.
-    2. Set VmaDefragmentationMove::operation to #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY.
-
-    Then, after vmaEndDefragmentationPass() the allocation will be freed.
-    */
-    VmaDefragmentationMove* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(moveCount) pMoves;
-} VmaDefragmentationPassMoveInfo;
-
-/// Statistics returned for defragmentation process in function vmaEndDefragmentation().
-typedef struct VmaDefragmentationStats
-{
-    /// Total number of bytes that have been copied while moving allocations to different places.
-    VkDeviceSize bytesMoved;
-    /// Total number of bytes that have been released to the system by freeing empty `VkDeviceMemory` objects.
-    VkDeviceSize bytesFreed;
-    /// Number of allocations that have been moved to different places.
-    uint32_t allocationsMoved;
-    /// Number of empty `VkDeviceMemory` objects that have been released to the system.
-    uint32_t deviceMemoryBlocksFreed;
-} VmaDefragmentationStats;
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/// Parameters of created #VmaVirtualBlock object to be passed to vmaCreateVirtualBlock().
-typedef struct VmaVirtualBlockCreateInfo
-{
-    /** \brief Total size of the virtual block.
-
-    Sizes can be expressed in bytes or any units you want as long as you are consistent in using them.
-    For example, if you allocate from some array of structures, 1 can mean single instance of entire structure.
-    */
-    VkDeviceSize size;
-
-    /** \brief Use combination of #VmaVirtualBlockCreateFlagBits.
-    */
-    VmaVirtualBlockCreateFlags flags;
-
-    /** \brief Custom CPU memory allocation callbacks. Optional.
-
-    Optional, can be null. When specified, they will be used for all CPU-side memory allocations.
-    */
-    const VkAllocationCallbacks* VMA_NULLABLE pAllocationCallbacks;
-} VmaVirtualBlockCreateInfo;
-
-/// Parameters of created virtual allocation to be passed to vmaVirtualAllocate().
-typedef struct VmaVirtualAllocationCreateInfo
-{
-    /** \brief Size of the allocation.
-
-    Cannot be zero.
-    */
-    VkDeviceSize size;
-    /** \brief Required alignment of the allocation. Optional.
-
-    Must be power of two. Special value 0 has the same meaning as 1 - means no special alignment is required, so allocation can start at any offset.
-    */
-    VkDeviceSize alignment;
-    /** \brief Use combination of #VmaVirtualAllocationCreateFlagBits.
-    */
-    VmaVirtualAllocationCreateFlags flags;
-    /** \brief Custom pointer to be associated with the allocation. Optional.
-
-    It can be any value and can be used for user-defined purposes. It can be fetched or changed later.
-    */
-    void* VMA_NULLABLE pUserData;
-} VmaVirtualAllocationCreateInfo;
-
-/// Parameters of an existing virtual allocation, returned by vmaGetVirtualAllocationInfo().
-typedef struct VmaVirtualAllocationInfo
-{
-    /** \brief Offset of the allocation.
-     
-    Offset at which the allocation was made.
-    */
-    VkDeviceSize offset;
-    /** \brief Size of the allocation.
-
-    Same value as passed in VmaVirtualAllocationCreateInfo::size.
-    */
-    VkDeviceSize size;
-    /** \brief Custom pointer associated with the allocation.
-
-    Same value as passed in VmaVirtualAllocationCreateInfo::pUserData or to vmaSetVirtualAllocationUserData().
-    */
-    void* VMA_NULLABLE pUserData;
-} VmaVirtualAllocationInfo;
-
-/** @} */
-
-#endif // _VMA_DATA_TYPES_DECLARATIONS
-
-#ifndef _VMA_FUNCTION_HEADERS
-
-/**
-\addtogroup group_init
-@{
-*/
-
-/// Creates #VmaAllocator object.
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAllocator(
-    const VmaAllocatorCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocator VMA_NULLABLE* VMA_NOT_NULL pAllocator);
-
-/// Destroys allocator object.
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyAllocator(
-    VmaAllocator VMA_NULLABLE allocator);
-
-/** \brief Returns information about existing #VmaAllocator object - handle to Vulkan device etc.
-
-It might be useful if you want to keep just the #VmaAllocator handle and fetch other required handles to
-`VkPhysicalDevice`, `VkDevice` etc. every time using this function.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocatorInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocatorInfo* VMA_NOT_NULL pAllocatorInfo);
-
-/**
-PhysicalDeviceProperties are fetched from physicalDevice by the allocator.
-You can access it here, without fetching it again on your own.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPhysicalDeviceProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkPhysicalDeviceProperties* VMA_NULLABLE* VMA_NOT_NULL ppPhysicalDeviceProperties);
-
-/**
-PhysicalDeviceMemoryProperties are fetched from physicalDevice by the allocator.
-You can access it here, without fetching it again on your own.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkPhysicalDeviceMemoryProperties* VMA_NULLABLE* VMA_NOT_NULL ppPhysicalDeviceMemoryProperties);
-
-/**
-\brief Given Memory Type Index, returns Property Flags of this memory type.
-
-This is just a convenience function. Same information can be obtained using
-vmaGetMemoryProperties().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryTypeProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t memoryTypeIndex,
-    VkMemoryPropertyFlags* VMA_NOT_NULL pFlags);
-
-/** \brief Sets index of the current frame.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetCurrentFrameIndex(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t frameIndex);
-
-/** @} */
-
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Retrieves statistics from current state of the Allocator.
-
-This function is called "calculate" not "get" because it has to traverse all
-internal data structures, so it may be quite slow. Use it for debugging purposes.
-For faster but more brief statistics suitable to be called every frame or every allocation,
-use vmaGetHeapBudgets().
-
-Note that when using allocator from multiple threads, returned information may immediately
-become outdated.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateStatistics(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaTotalStatistics* VMA_NOT_NULL pStats);
-
-/** \brief Retrieves information about current memory usage and budget for all memory heaps.
-
-\param allocator
-\param[out] pBudgets Must point to array with number of elements at least equal to number of memory heaps in physical device used.
-
-This function is called "get" not "calculate" because it is very fast, suitable to be called
-every frame or every allocation. For more detailed statistics use vmaCalculateStatistics().
-
-Note that when using allocator from multiple threads, returned information may immediately
-become outdated.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetHeapBudgets(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaBudget* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount") pBudgets);
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/**
-\brief Helps to find memoryTypeIndex, given memoryTypeBits and VmaAllocationCreateInfo.
-
-This algorithm tries to find a memory type that:
-
-- Is allowed by memoryTypeBits.
-- Contains all the flags from pAllocationCreateInfo->requiredFlags.
-- Matches intended usage.
-- Has as many flags from pAllocationCreateInfo->preferredFlags as possible.
-
-\return Returns VK_ERROR_FEATURE_NOT_PRESENT if not found. Receiving such result
-from this function or any other allocating function probably means that your
-device doesn't support any memory type with requested features for the specific
-type of resource you want to use it for. Please check parameters of your
-resource, like image layout (OPTIMAL versus LINEAR) or mip level count.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndex(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t memoryTypeBits,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    uint32_t* VMA_NOT_NULL pMemoryTypeIndex);
-
-/**
-\brief Helps to find memoryTypeIndex, given VkBufferCreateInfo and VmaAllocationCreateInfo.
-
-It can be useful e.g. to determine value to be used as VmaPoolCreateInfo::memoryTypeIndex.
-It internally creates a temporary, dummy buffer that never has memory bound.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForBufferInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    uint32_t* VMA_NOT_NULL pMemoryTypeIndex);
-
-/**
-\brief Helps to find memoryTypeIndex, given VkImageCreateInfo and VmaAllocationCreateInfo.
-
-It can be useful e.g. to determine value to be used as VmaPoolCreateInfo::memoryTypeIndex.
-It internally creates a temporary, dummy image that never has memory bound.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForImageInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    uint32_t* VMA_NOT_NULL pMemoryTypeIndex);
-
-/** \brief Allocates Vulkan device memory and creates #VmaPool object.
-
-\param allocator Allocator object.
-\param pCreateInfo Parameters of pool to create.
-\param[out] pPool Handle to created pool.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreatePool(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VmaPoolCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaPool VMA_NULLABLE* VMA_NOT_NULL pPool);
-
-/** \brief Destroys #VmaPool object and frees Vulkan device memory.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyPool(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NULLABLE pool);
-
-/** @} */
-
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Retrieves statistics of existing #VmaPool object.
-
-\param allocator Allocator object.
-\param pool Pool object.
-\param[out] pPoolStats Statistics of specified pool.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolStatistics(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    VmaStatistics* VMA_NOT_NULL pPoolStats);
-
-/** \brief Retrieves detailed statistics of existing #VmaPool object.
-
-\param allocator Allocator object.
-\param pool Pool object.
-\param[out] pPoolStats Statistics of specified pool.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculatePoolStatistics(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    VmaDetailedStatistics* VMA_NOT_NULL pPoolStats);
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/** \brief Checks magic number in margins around all allocations in given memory pool in search for corruptions.
-
-Corruption detection is enabled only when `VMA_DEBUG_DETECT_CORRUPTION` macro is defined to nonzero,
-`VMA_DEBUG_MARGIN` is defined to nonzero and the pool is created in memory type that is
-`HOST_VISIBLE` and `HOST_COHERENT`. For more information, see [Corruption detection](@ref debugging_memory_usage_corruption_detection).
-
-Possible return values:
-
-- `VK_ERROR_FEATURE_NOT_PRESENT` - corruption detection is not enabled for specified pool.
-- `VK_SUCCESS` - corruption detection has been performed and succeeded.
-- `VK_ERROR_UNKNOWN` - corruption detection has been performed and found memory corruptions around one of the allocations.
-  `VMA_ASSERT` is also fired in that case.
-- Other value: Error returned by Vulkan, e.g. memory mapping failure.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckPoolCorruption(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool);
-
-/** \brief Retrieves name of a custom pool.
-
-After the call `ppName` is either null or points to an internally-owned null-terminated string
-containing name of the pool that was previously set. The pointer becomes invalid when the pool is
-destroyed or its name is changed using vmaSetPoolName().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    const char* VMA_NULLABLE* VMA_NOT_NULL ppName);
-
-/** \brief Sets name of a custom pool.
-
-`pName` can be either null or pointer to a null-terminated string with new name for the pool.
-Function makes internal copy of the string, so it can be changed or freed immediately after this call.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetPoolName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    const char* VMA_NULLABLE pName);
-
-/** \brief General purpose memory allocation.
-
-\param allocator
-\param pVkMemoryRequirements
-\param pCreateInfo
-\param[out] pAllocation Handle to allocated memory.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-You should free the memory using vmaFreeMemory() or vmaFreeMemoryPages().
-
-It is recommended to use vmaAllocateMemoryForBuffer(), vmaAllocateMemoryForImage(),
-vmaCreateBuffer(), vmaCreateImage() instead whenever possible.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkMemoryRequirements* VMA_NOT_NULL pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief General purpose memory allocation for multiple allocation objects at once.
-
-\param allocator Allocator object.
-\param pVkMemoryRequirements Memory requirements for each allocation.
-\param pCreateInfo Creation parameters for each allocation.
-\param allocationCount Number of allocations to make.
-\param[out] pAllocations Pointer to array that will be filled with handles to created allocations.
-\param[out] pAllocationInfo Optional. Pointer to array that will be filled with parameters of created allocations.
-
-You should free the memory using vmaFreeMemory() or vmaFreeMemoryPages().
-
-Word "pages" is just a suggestion to use this function to allocate pieces of memory needed for sparse binding.
-It is just a general purpose allocation function able to make multiple allocations at once.
-It may be internally optimized to be more efficient than calling vmaAllocateMemory() `allocationCount` times.
-
-All allocations are made using same parameters. All of them are created out of the same memory pool and type.
-If any allocation fails, all allocations already made within this function call are also freed, so that when
-returned result is not `VK_SUCCESS`, `pAllocation` array is always entirely filled with `VK_NULL_HANDLE`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryPages(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkMemoryRequirements* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pCreateInfo,
-    size_t allocationCount,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pAllocations,
-    VmaAllocationInfo* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) pAllocationInfo);
-
-/** \brief Allocates memory suitable for given `VkBuffer`.
-
-\param allocator
-\param buffer
-\param pCreateInfo
-\param[out] pAllocation Handle to allocated memory.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-It only creates #VmaAllocation. To bind the memory to the buffer, use vmaBindBufferMemory().
-
-This is a special-purpose function. In most cases you should use vmaCreateBuffer().
-
-You must free the allocation using vmaFreeMemory() when no longer needed.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Allocates memory suitable for given `VkImage`.
-
-\param allocator
-\param image
-\param pCreateInfo
-\param[out] pAllocation Handle to allocated memory.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-It only creates #VmaAllocation. To bind the memory to the buffer, use vmaBindImageMemory().
-
-This is a special-purpose function. In most cases you should use vmaCreateImage().
-
-You must free the allocation using vmaFreeMemory() when no longer needed.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkImage VMA_NOT_NULL_NON_DISPATCHABLE image,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Frees memory previously allocated using vmaAllocateMemory(), vmaAllocateMemoryForBuffer(), or vmaAllocateMemoryForImage().
-
-Passing `VK_NULL_HANDLE` as `allocation` is valid. Such function call is just skipped.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VmaAllocation VMA_NULLABLE allocation);
-
-/** \brief Frees memory and destroys multiple allocations.
-
-Word "pages" is just a suggestion to use this function to free pieces of memory used for sparse binding.
-It is just a general purpose function to free memory and destroy allocations made using e.g. vmaAllocateMemory(),
-vmaAllocateMemoryPages() and other functions.
-It may be internally optimized to be more efficient than calling vmaFreeMemory() `allocationCount` times.
-
-Allocations in `pAllocations` array can come from any memory pools and types.
-Passing `VK_NULL_HANDLE` as elements of `pAllocations` array is valid. Such entries are just skipped.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemoryPages(
-    VmaAllocator VMA_NOT_NULL allocator,
-    size_t allocationCount,
-    const VmaAllocation VMA_NULLABLE* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pAllocations);
-
-/** \brief Returns current information about specified allocation.
-
-Current paramteres of given allocation are returned in `pAllocationInfo`.
-
-Although this function doesn't lock any mutex, so it should be quite efficient,
-you should avoid calling it too often.
-You can retrieve same VmaAllocationInfo structure while creating your resource, from function
-vmaCreateBuffer(), vmaCreateImage(). You can remember it if you are sure parameters don't change
-(e.g. due to defragmentation).
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VmaAllocationInfo* VMA_NOT_NULL pAllocationInfo);
-
-/** \brief Sets pUserData in given allocation to new value.
-
-The value of pointer `pUserData` is copied to allocation's `pUserData`.
-It is opaque, so you can use it however you want - e.g.
-as a pointer, ordinal number or some handle to you own data.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationUserData(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    void* VMA_NULLABLE pUserData);
-
-/** \brief Sets pName in given allocation to new value.
-
-`pName` must be either null, or pointer to a null-terminated string. The function
-makes local copy of the string and sets it as allocation's `pName`. String
-passed as pName doesn't need to be valid for whole lifetime of the allocation -
-you can free it after this call. String previously pointed by allocation's
-`pName` is freed from memory.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const char* VMA_NULLABLE pName);
-
-/**
-\brief Given an allocation, returns Property Flags of its memory type.
-
-This is just a convenience function. Same information can be obtained using
-vmaGetAllocationInfo() + vmaGetMemoryProperties().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationMemoryProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkMemoryPropertyFlags* VMA_NOT_NULL pFlags);
-
-/** \brief Maps memory represented by given allocation and returns pointer to it.
-
-Maps memory represented by given allocation to make it accessible to CPU code.
-When succeeded, `*ppData` contains pointer to first byte of this memory.
-
-\warning
-If the allocation is part of a bigger `VkDeviceMemory` block, returned pointer is
-correctly offsetted to the beginning of region assigned to this particular allocation.
-Unlike the result of `vkMapMemory`, it points to the allocation, not to the beginning of the whole block.
-You should not add VmaAllocationInfo::offset to it!
-
-Mapping is internally reference-counted and synchronized, so despite raw Vulkan
-function `vkMapMemory()` cannot be used to map same block of `VkDeviceMemory`
-multiple times simultaneously, it is safe to call this function on allocations
-assigned to the same memory block. Actual Vulkan memory will be mapped on first
-mapping and unmapped on last unmapping.
-
-If the function succeeded, you must call vmaUnmapMemory() to unmap the
-allocation when mapping is no longer needed or before freeing the allocation, at
-the latest.
-
-It also safe to call this function multiple times on the same allocation. You
-must call vmaUnmapMemory() same number of times as you called vmaMapMemory().
-
-It is also safe to call this function on allocation created with
-#VMA_ALLOCATION_CREATE_MAPPED_BIT flag. Its memory stays mapped all the time.
-You must still call vmaUnmapMemory() same number of times as you called
-vmaMapMemory(). You must not call vmaUnmapMemory() additional time to free the
-"0-th" mapping made automatically due to #VMA_ALLOCATION_CREATE_MAPPED_BIT flag.
-
-This function fails when used on allocation made in memory type that is not
-`HOST_VISIBLE`.
-
-This function doesn't automatically flush or invalidate caches.
-If the allocation is made from a memory types that is not `HOST_COHERENT`,
-you also need to use vmaInvalidateAllocation() / vmaFlushAllocation(), as required by Vulkan specification.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaMapMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    void* VMA_NULLABLE* VMA_NOT_NULL ppData);
-
-/** \brief Unmaps memory represented by given allocation, mapped previously using vmaMapMemory().
-
-For details, see description of vmaMapMemory().
-
-This function doesn't automatically flush or invalidate caches.
-If the allocation is made from a memory types that is not `HOST_COHERENT`,
-you also need to use vmaInvalidateAllocation() / vmaFlushAllocation(), as required by Vulkan specification.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaUnmapMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation);
-
-/** \brief Flushes memory of given allocation.
-
-Calls `vkFlushMappedMemoryRanges()` for memory associated with given range of given allocation.
-It needs to be called after writing to a mapped memory for memory types that are not `HOST_COHERENT`.
-Unmap operation doesn't do that automatically.
-
-- `offset` must be relative to the beginning of allocation.
-- `size` can be `VK_WHOLE_SIZE`. It means all memory from `offset` the the end of given allocation.
-- `offset` and `size` don't have to be aligned.
-  They are internally rounded down/up to multiply of `nonCoherentAtomSize`.
-- If `size` is 0, this call is ignored.
-- If memory type that the `allocation` belongs to is not `HOST_VISIBLE` or it is `HOST_COHERENT`,
-  this call is ignored.
-
-Warning! `offset` and `size` are relative to the contents of given `allocation`.
-If you mean whole allocation, you can pass 0 and `VK_WHOLE_SIZE`, respectively.
-Do not pass allocation's offset as `offset`!!!
-
-This function returns the `VkResult` from `vkFlushMappedMemoryRanges` if it is
-called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size);
-
-/** \brief Invalidates memory of given allocation.
-
-Calls `vkInvalidateMappedMemoryRanges()` for memory associated with given range of given allocation.
-It needs to be called before reading from a mapped memory for memory types that are not `HOST_COHERENT`.
-Map operation doesn't do that automatically.
-
-- `offset` must be relative to the beginning of allocation.
-- `size` can be `VK_WHOLE_SIZE`. It means all memory from `offset` the the end of given allocation.
-- `offset` and `size` don't have to be aligned.
-  They are internally rounded down/up to multiply of `nonCoherentAtomSize`.
-- If `size` is 0, this call is ignored.
-- If memory type that the `allocation` belongs to is not `HOST_VISIBLE` or it is `HOST_COHERENT`,
-  this call is ignored.
-
-Warning! `offset` and `size` are relative to the contents of given `allocation`.
-If you mean whole allocation, you can pass 0 and `VK_WHOLE_SIZE`, respectively.
-Do not pass allocation's offset as `offset`!!!
-
-This function returns the `VkResult` from `vkInvalidateMappedMemoryRanges` if
-it is called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size);
-
-/** \brief Flushes memory of given set of allocations.
-
-Calls `vkFlushMappedMemoryRanges()` for memory associated with given ranges of given allocations.
-For more information, see documentation of vmaFlushAllocation().
-
-\param allocator
-\param allocationCount
-\param allocations
-\param offsets If not null, it must point to an array of offsets of regions to flush, relative to the beginning of respective allocations. Null means all ofsets are zero.
-\param sizes If not null, it must point to an array of sizes of regions to flush in respective allocations. Null means `VK_WHOLE_SIZE` for all allocations.
-
-This function returns the `VkResult` from `vkFlushMappedMemoryRanges` if it is
-called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocations(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t allocationCount,
-    const VmaAllocation VMA_NOT_NULL* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) allocations,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) offsets,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) sizes);
-
-/** \brief Invalidates memory of given set of allocations.
-
-Calls `vkInvalidateMappedMemoryRanges()` for memory associated with given ranges of given allocations.
-For more information, see documentation of vmaInvalidateAllocation().
-
-\param allocator
-\param allocationCount
-\param allocations
-\param offsets If not null, it must point to an array of offsets of regions to flush, relative to the beginning of respective allocations. Null means all ofsets are zero.
-\param sizes If not null, it must point to an array of sizes of regions to flush in respective allocations. Null means `VK_WHOLE_SIZE` for all allocations.
-
-This function returns the `VkResult` from `vkInvalidateMappedMemoryRanges` if it is
-called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocations(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t allocationCount,
-    const VmaAllocation VMA_NOT_NULL* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) allocations,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) offsets,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) sizes);
-
-/** \brief Checks magic number in margins around all allocations in given memory types (in both default and custom pools) in search for corruptions.
-
-\param allocator
-\param memoryTypeBits Bit mask, where each bit set means that a memory type with that index should be checked.
-
-Corruption detection is enabled only when `VMA_DEBUG_DETECT_CORRUPTION` macro is defined to nonzero,
-`VMA_DEBUG_MARGIN` is defined to nonzero and only for memory types that are
-`HOST_VISIBLE` and `HOST_COHERENT`. For more information, see [Corruption detection](@ref debugging_memory_usage_corruption_detection).
-
-Possible return values:
-
-- `VK_ERROR_FEATURE_NOT_PRESENT` - corruption detection is not enabled for any of specified memory types.
-- `VK_SUCCESS` - corruption detection has been performed and succeeded.
-- `VK_ERROR_UNKNOWN` - corruption detection has been performed and found memory corruptions around one of the allocations.
-  `VMA_ASSERT` is also fired in that case.
-- Other value: Error returned by Vulkan, e.g. memory mapping failure.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckCorruption(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t memoryTypeBits);
-
-/** \brief Begins defragmentation process.
-
-\param allocator Allocator object.
-\param pInfo Structure filled with parameters of defragmentation.
-\param[out] pContext Context object that must be passed to vmaEndDefragmentation() to finish defragmentation.
-\returns
-- `VK_SUCCESS` if defragmentation can begin.
-- `VK_ERROR_FEATURE_NOT_PRESENT` if defragmentation is not supported.
-
-For more information about defragmentation, see documentation chapter:
-[Defragmentation](@ref defragmentation).
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VmaDefragmentationInfo* VMA_NOT_NULL pInfo,
-    VmaDefragmentationContext VMA_NULLABLE* VMA_NOT_NULL pContext);
-
-/** \brief Ends defragmentation process.
-
-\param allocator Allocator object.
-\param context Context object that has been created by vmaBeginDefragmentation().
-\param[out] pStats Optional stats for the defragmentation. Can be null.
-
-Use this function to finish defragmentation started by vmaBeginDefragmentation().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaEndDefragmentation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationStats* VMA_NULLABLE pStats);
-
-/** \brief Starts single defragmentation pass.
-
-\param allocator Allocator object.
-\param context Context object that has been created by vmaBeginDefragmentation().
-\param[out] pPassInfo Computed informations for current pass.
-\returns
-- `VK_SUCCESS` if no more moves are possible. Then you can omit call to vmaEndDefragmentationPass() and simply end whole defragmentation.
-- `VK_INCOMPLETE` if there are pending moves returned in `pPassInfo`. You need to perform them, call vmaEndDefragmentationPass(),
-  and then preferably try another pass with vmaBeginDefragmentationPass().
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo);
-
-/** \brief Ends single defragmentation pass.
-
-\param allocator Allocator object.
-\param context Context object that has been created by vmaBeginDefragmentation().
-\param pPassInfo Computed informations for current pass filled by vmaBeginDefragmentationPass() and possibly modified by you.
-
-Returns `VK_SUCCESS` if no more moves are possible or `VK_INCOMPLETE` if more defragmentations are possible.
-
-Ends incremental defragmentation pass and commits all defragmentation moves from `pPassInfo`.
-After this call:
-
-- Allocations at `pPassInfo[i].srcAllocation` that had `pPassInfo[i].operation ==` #VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY
-  (which is the default) will be pointing to the new destination place.
-- Allocation at `pPassInfo[i].srcAllocation` that had `pPassInfo[i].operation ==` #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY
-  will be freed.
-
-If no more moves are possible you can end whole defragmentation.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaEndDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo);
-
-/** \brief Binds buffer to allocation.
-
-Binds specified buffer to region of memory represented by specified allocation.
-Gets `VkDeviceMemory` handle and offset from the allocation.
-If you want to create a buffer, allocate memory for it and bind them together separately,
-you should use this function for binding instead of standard `vkBindBufferMemory()`,
-because it ensures proper synchronization so that when a `VkDeviceMemory` object is used by multiple
-allocations, calls to `vkBind*Memory()` or `vkMapMemory()` won't happen from multiple threads simultaneously
-(which is illegal in Vulkan).
-
-It is recommended to use function vmaCreateBuffer() instead of this one.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer);
-
-/** \brief Binds buffer to allocation with additional parameters.
-
-\param allocator
-\param allocation
-\param allocationLocalOffset Additional offset to be added while binding, relative to the beginning of the `allocation`. Normally it should be 0.
-\param buffer
-\param pNext A chain of structures to be attached to `VkBindBufferMemoryInfoKHR` structure used internally. Normally it should be null.
-
-This function is similar to vmaBindBufferMemory(), but it provides additional parameters.
-
-If `pNext` is not null, #VmaAllocator object must have been created with #VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT flag
-or with VmaAllocatorCreateInfo::vulkanApiVersion `>= VK_API_VERSION_1_1`. Otherwise the call fails.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory2(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer,
-    const void* VMA_NULLABLE pNext);
-
-/** \brief Binds image to allocation.
-
-Binds specified image to region of memory represented by specified allocation.
-Gets `VkDeviceMemory` handle and offset from the allocation.
-If you want to create an image, allocate memory for it and bind them together separately,
-you should use this function for binding instead of standard `vkBindImageMemory()`,
-because it ensures proper synchronization so that when a `VkDeviceMemory` object is used by multiple
-allocations, calls to `vkBind*Memory()` or `vkMapMemory()` won't happen from multiple threads simultaneously
-(which is illegal in Vulkan).
-
-It is recommended to use function vmaCreateImage() instead of this one.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkImage VMA_NOT_NULL_NON_DISPATCHABLE image);
-
-/** \brief Binds image to allocation with additional parameters.
-
-\param allocator
-\param allocation
-\param allocationLocalOffset Additional offset to be added while binding, relative to the beginning of the `allocation`. Normally it should be 0.
-\param image
-\param pNext A chain of structures to be attached to `VkBindImageMemoryInfoKHR` structure used internally. Normally it should be null.
-
-This function is similar to vmaBindImageMemory(), but it provides additional parameters.
-
-If `pNext` is not null, #VmaAllocator object must have been created with #VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT flag
-or with VmaAllocatorCreateInfo::vulkanApiVersion `>= VK_API_VERSION_1_1`. Otherwise the call fails.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory2(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage VMA_NOT_NULL_NON_DISPATCHABLE image,
-    const void* VMA_NULLABLE pNext);
-
-/** \brief Creates a new `VkBuffer`, allocates and binds memory for it.
-
-\param allocator
-\param pBufferCreateInfo
-\param pAllocationCreateInfo
-\param[out] pBuffer Buffer that was created.
-\param[out] pAllocation Allocation that was created.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-This function automatically:
-
--# Creates buffer.
--# Allocates appropriate memory for it.
--# Binds the buffer with the memory.
-
-If any of these operations fail, buffer and allocation are not created,
-returned value is negative error code, `*pBuffer` and `*pAllocation` are null.
-
-If the function succeeded, you must destroy both buffer and allocation when you
-no longer need them using either convenience function vmaDestroyBuffer() or
-separately, using `vkDestroyBuffer()` and vmaFreeMemory().
-
-If #VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT flag was used,
-VK_KHR_dedicated_allocation extension is used internally to query driver whether
-it requires or prefers the new buffer to have dedicated allocation. If yes,
-and if dedicated allocation is possible
-(#VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT is not used), it creates dedicated
-allocation for this buffer, just like when using
-#VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-
-\note This function creates a new `VkBuffer`. Sub-allocation of parts of one large buffer,
-although recommended as a good practice, is out of scope of this library and could be implemented
-by the user as a higher-level logic on top of VMA.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Creates a buffer with additional minimum alignment.
-
-Similar to vmaCreateBuffer() but provides additional parameter `minAlignment` which allows to specify custom,
-minimum alignment to be used when placing the buffer inside a larger memory block, which may be needed e.g.
-for interop with OpenGL.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBufferWithAlignment(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    VkDeviceSize minAlignment,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Creates a new `VkBuffer`, binds already created memory for it.
-
-\param allocator
-\param allocation Allocation that provides memory to be used for binding new buffer to it.
-\param pBufferCreateInfo
-\param[out] pBuffer Buffer that was created.
-
-This function automatically:
-
--# Creates buffer.
--# Binds the buffer with the supplied memory.
-
-If any of these operations fail, buffer is not created,
-returned value is negative error code and `*pBuffer` is null.
-
-If the function succeeded, you must destroy the buffer when you
-no longer need it using `vkDestroyBuffer()`. If you want to also destroy the corresponding
-allocation you can use convenience function vmaDestroyBuffer().
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer);
-
-/** \brief Destroys Vulkan buffer and frees allocated memory.
-
-This is just a convenience function equivalent to:
-
-\code
-vkDestroyBuffer(device, buffer, allocationCallbacks);
-vmaFreeMemory(allocator, allocation);
-\endcode
-
-It it safe to pass null as buffer and/or allocation.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE buffer,
-    VmaAllocation VMA_NULLABLE allocation);
-
-/// Function similar to vmaCreateBuffer().
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/// Function similar to vmaCreateAliasingBuffer().
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage);
-
-/** \brief Destroys Vulkan image and frees allocated memory.
-
-This is just a convenience function equivalent to:
-
-\code
-vkDestroyImage(device, image, allocationCallbacks);
-vmaFreeMemory(allocator, allocation);
-\endcode
-
-It it safe to pass null as image and/or allocation.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE image,
-    VmaAllocation VMA_NULLABLE allocation);
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/** \brief Creates new #VmaVirtualBlock object.
-
-\param pCreateInfo Parameters for creation.
-\param[out] pVirtualBlock Returned virtual block object or `VMA_NULL` if creation failed.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateVirtualBlock(
-    const VmaVirtualBlockCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaVirtualBlock VMA_NULLABLE* VMA_NOT_NULL pVirtualBlock);
-
-/** \brief Destroys #VmaVirtualBlock object.
-
-Please note that you should consciously handle virtual allocations that could remain unfreed in the block.
-You should either free them individually using vmaVirtualFree() or call vmaClearVirtualBlock()
-if you are sure this is what you want. If you do neither, an assert is called.
-
-If you keep pointers to some additional metadata associated with your virtual allocations in their `pUserData`,
-don't forget to free them.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyVirtualBlock(
-    VmaVirtualBlock VMA_NULLABLE virtualBlock);
-
-/** \brief Returns true of the #VmaVirtualBlock is empty - contains 0 virtual allocations and has all its space available for new allocations.
-*/
-VMA_CALL_PRE VkBool32 VMA_CALL_POST vmaIsVirtualBlockEmpty(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock);
-
-/** \brief Returns information about a specific virtual allocation within a virtual block, like its size and `pUserData` pointer.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualAllocationInfo(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, VmaVirtualAllocationInfo* VMA_NOT_NULL pVirtualAllocInfo);
-
-/** \brief Allocates new virtual allocation inside given #VmaVirtualBlock.
-
-If the allocation fails due to not enough free space available, `VK_ERROR_OUT_OF_DEVICE_MEMORY` is returned
-(despite the function doesn't ever allocate actual GPU memory).
-`pAllocation` is then set to `VK_NULL_HANDLE` and `pOffset`, if not null, it set to `UINT64_MAX`.
-
-\param virtualBlock Virtual block
-\param pCreateInfo Parameters for the allocation
-\param[out] pAllocation Returned handle of the new allocation
-\param[out] pOffset Returned offset of the new allocation. Optional, can be null.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaVirtualAllocate(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    const VmaVirtualAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pAllocation,
-    VkDeviceSize* VMA_NULLABLE pOffset);
-
-/** \brief Frees virtual allocation inside given #VmaVirtualBlock.
-
-It is correct to call this function with `allocation == VK_NULL_HANDLE` - it does nothing.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaVirtualFree(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE allocation);
-
-/** \brief Frees all virtual allocations inside given #VmaVirtualBlock.
-
-You must either call this function or free each virtual allocation individually with vmaVirtualFree()
-before destroying a virtual block. Otherwise, an assert is called.
-
-If you keep pointer to some additional metadata associated with your virtual allocation in its `pUserData`,
-don't forget to free it as well.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaClearVirtualBlock(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock);
-
-/** \brief Changes custom pointer associated with given virtual allocation.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetVirtualAllocationUserData(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation,
-    void* VMA_NULLABLE pUserData);
-
-/** \brief Calculates and returns statistics about virtual allocations and memory usage in given #VmaVirtualBlock.
-
-This function is fast to call. For more detailed statistics, see vmaCalculateVirtualBlockStatistics().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualBlockStatistics(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaStatistics* VMA_NOT_NULL pStats);
-
-/** \brief Calculates and returns detailed statistics about virtual allocations and memory usage in given #VmaVirtualBlock.
-
-This function is slow to call. Use for debugging purposes.
-For less detailed statistics, see vmaGetVirtualBlockStatistics().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateVirtualBlockStatistics(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaDetailedStatistics* VMA_NOT_NULL pStats);
-
-/** @} */
-
-#if VMA_STATS_STRING_ENABLED
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Builds and returns a null-terminated string in JSON format with information about given #VmaVirtualBlock.
-\param virtualBlock Virtual block.
-\param[out] ppStatsString Returned string.
-\param detailedMap Pass `VK_FALSE` to only obtain statistics as returned by vmaCalculateVirtualBlockStatistics(). Pass `VK_TRUE` to also obtain full list of allocations and free spaces.
-
-Returned string must be freed using vmaFreeVirtualBlockStatsString().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildVirtualBlockStatsString(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE* VMA_NOT_NULL ppStatsString,
-    VkBool32 detailedMap);
-
-/// Frees a string returned by vmaBuildVirtualBlockStatsString().
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeVirtualBlockStatsString(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE pStatsString);
-
-/** \brief Builds and returns statistics as a null-terminated string in JSON format.
-\param allocator
-\param[out] ppStatsString Must be freed using vmaFreeStatsString() function.
-\param detailedMap
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildStatsString(
-    VmaAllocator VMA_NOT_NULL allocator,
-    char* VMA_NULLABLE* VMA_NOT_NULL ppStatsString,
-    VkBool32 detailedMap);
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeStatsString(
-    VmaAllocator VMA_NOT_NULL allocator,
-    char* VMA_NULLABLE pStatsString);
-
-/** @} */
-
-#endif // VMA_STATS_STRING_ENABLED
-
-#endif // _VMA_FUNCTION_HEADERS
-
-#ifdef __cplusplus
-}
-#endif
-
-#endif // AMD_VULKAN_MEMORY_ALLOCATOR_H
-
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-// 
-//    IMPLEMENTATION
-// 
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-
-// For Visual Studio IntelliSense.
-#if defined(__cplusplus) && defined(__INTELLISENSE__)
-#define VMA_IMPLEMENTATION
-#endif
-
-#ifdef VMA_IMPLEMENTATION
-#undef VMA_IMPLEMENTATION
-
-#include <cstdint>
-#include <cstdlib>
-#include <cstring>
-#include <utility>
-#include <type_traits>
-
-#ifdef _MSC_VER
-    #include <intrin.h> // For functions like __popcnt, _BitScanForward etc.
-#endif
-#if __cplusplus >= 202002L || _MSVC_LANG >= 202002L // C++20
-    #include <bit> // For std::popcount
-#endif
-
-/*******************************************************************************
-CONFIGURATION SECTION
-
-Define some of these macros before each #include of this header or change them
-here if you need other then default behavior depending on your environment.
-*/
-#ifndef _VMA_CONFIGURATION
-
-/*
-Define this macro to 1 to make the library fetch pointers to Vulkan functions
-internally, like:
-
-    vulkanFunctions.vkAllocateMemory = &vkAllocateMemory;
-*/
-#if !defined(VMA_STATIC_VULKAN_FUNCTIONS) && !defined(VK_NO_PROTOTYPES)
-    #define VMA_STATIC_VULKAN_FUNCTIONS 1
-#endif
-
-/*
-Define this macro to 1 to make the library fetch pointers to Vulkan functions
-internally, like:
-
-    vulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkGetDeviceProcAddr(device, "vkAllocateMemory");
-
-To use this feature in new versions of VMA you now have to pass
-VmaVulkanFunctions::vkGetInstanceProcAddr and vkGetDeviceProcAddr as
-VmaAllocatorCreateInfo::pVulkanFunctions. Other members can be null.
-*/
-#if !defined(VMA_DYNAMIC_VULKAN_FUNCTIONS)
-    #define VMA_DYNAMIC_VULKAN_FUNCTIONS 1
-#endif
-
-#ifndef VMA_USE_STL_SHARED_MUTEX
-    // Compiler conforms to C++17.
-    #if __cplusplus >= 201703L
-        #define VMA_USE_STL_SHARED_MUTEX 1
-    // Visual studio defines __cplusplus properly only when passed additional parameter: /Zc:__cplusplus
-    // Otherwise it is always 199711L, despite shared_mutex works since Visual Studio 2015 Update 2.
-    #elif defined(_MSC_FULL_VER) && _MSC_FULL_VER >= 190023918 && __cplusplus == 199711L && _MSVC_LANG >= 201703L
-        #define VMA_USE_STL_SHARED_MUTEX 1
-    #else
-        #define VMA_USE_STL_SHARED_MUTEX 0
-    #endif
-#endif
-
-/*
-Define this macro to include custom header files without having to edit this file directly, e.g.:
-
-    // Inside of "my_vma_configuration_user_includes.h":
-
-    #include "my_custom_assert.h" // for MY_CUSTOM_ASSERT
-    #include "my_custom_min.h" // for my_custom_min
-    #include <algorithm>
-    #include <mutex>
-
-    // Inside a different file, which includes "vk_mem_alloc.h":
-
-    #define VMA_CONFIGURATION_USER_INCLUDES_H "my_vma_configuration_user_includes.h"
-    #define VMA_ASSERT(expr) MY_CUSTOM_ASSERT(expr)
-    #define VMA_MIN(v1, v2)  (my_custom_min(v1, v2))
-    #include "vk_mem_alloc.h"
-    ...
-
-The following headers are used in this CONFIGURATION section only, so feel free to
-remove them if not needed.
-*/
-#if !defined(VMA_CONFIGURATION_USER_INCLUDES_H)
-    #include <cassert> // for assert
-    #include <algorithm> // for min, max
-    #include <mutex>
-#else
-    #include VMA_CONFIGURATION_USER_INCLUDES_H
-#endif
-
-#ifndef VMA_NULL
-   // Value used as null pointer. Define it to e.g.: nullptr, NULL, 0, (void*)0.
-   #define VMA_NULL   nullptr
-#endif
-
-#if defined(__ANDROID_API__) && (__ANDROID_API__ < 16)
-#include <cstdlib>
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    // alignment must be >= sizeof(void*)
-    if(alignment < sizeof(void*))
-    {
-        alignment = sizeof(void*);
-    }
-
-    return memalign(alignment, size);
-}
-#elif defined(__APPLE__) || defined(__ANDROID__) || (defined(__linux__) && defined(__GLIBCXX__) && !defined(_GLIBCXX_HAVE_ALIGNED_ALLOC))
-#include <cstdlib>
-
-#if defined(__APPLE__)
-#include <AvailabilityMacros.h>
-#endif
-
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    // Unfortunately, aligned_alloc causes VMA to crash due to it returning null pointers. (At least under 11.4)
-    // Therefore, for now disable this specific exception until a proper solution is found.
-    //#if defined(__APPLE__) && (defined(MAC_OS_X_VERSION_10_16) || defined(__IPHONE_14_0))
-    //#if MAC_OS_X_VERSION_MAX_ALLOWED >= MAC_OS_X_VERSION_10_16 || __IPHONE_OS_VERSION_MAX_ALLOWED >= __IPHONE_14_0
-    //    // For C++14, usr/include/malloc/_malloc.h declares aligned_alloc()) only
-    //    // with the MacOSX11.0 SDK in Xcode 12 (which is what adds
-    //    // MAC_OS_X_VERSION_10_16), even though the function is marked
-    //    // availabe for 10.15. That is why the preprocessor checks for 10.16 but
-    //    // the __builtin_available checks for 10.15.
-    //    // People who use C++17 could call aligned_alloc with the 10.15 SDK already.
-    //    if (__builtin_available(macOS 10.15, iOS 13, *))
-    //        return aligned_alloc(alignment, size);
-    //#endif
-    //#endif
-
-    // alignment must be >= sizeof(void*)
-    if(alignment < sizeof(void*))
-    {
-        alignment = sizeof(void*);
-    }
-
-    void *pointer;
-    if(posix_memalign(&pointer, alignment, size) == 0)
-        return pointer;
-    return VMA_NULL;
-}
-#elif defined(_WIN32)
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    return _aligned_malloc(size, alignment);
-}
-#else
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    return aligned_alloc(alignment, size);
-}
-#endif
-
-#if defined(_WIN32)
-static void vma_aligned_free(void* ptr)
-{
-    _aligned_free(ptr);
-}
-#else
-static void vma_aligned_free(void* VMA_NULLABLE ptr)
-{
-    free(ptr);
-}
-#endif
-
-// If your compiler is not compatible with C++11 and definition of
-// aligned_alloc() function is missing, uncommeting following line may help:
-
-//#include <malloc.h>
-
-// Normal assert to check for programmer's errors, especially in Debug configuration.
-#ifndef VMA_ASSERT
-   #ifdef NDEBUG
-       #define VMA_ASSERT(expr)
-   #else
-       #define VMA_ASSERT(expr)         assert(expr)
-   #endif
-#endif
-
-// Assert that will be called very often, like inside data structures e.g. operator[].
-// Making it non-empty can make program slow.
-#ifndef VMA_HEAVY_ASSERT
-   #ifdef NDEBUG
-       #define VMA_HEAVY_ASSERT(expr)
-   #else
-       #define VMA_HEAVY_ASSERT(expr)   //VMA_ASSERT(expr)
-   #endif
-#endif
-
-#ifndef VMA_ALIGN_OF
-   #define VMA_ALIGN_OF(type)       (__alignof(type))
-#endif
-
-#ifndef VMA_SYSTEM_ALIGNED_MALLOC
-   #define VMA_SYSTEM_ALIGNED_MALLOC(size, alignment) vma_aligned_alloc((alignment), (size))
-#endif
-
-#ifndef VMA_SYSTEM_ALIGNED_FREE
-   // VMA_SYSTEM_FREE is the old name, but might have been defined by the user
-   #if defined(VMA_SYSTEM_FREE)
-      #define VMA_SYSTEM_ALIGNED_FREE(ptr)     VMA_SYSTEM_FREE(ptr)
-   #else
-      #define VMA_SYSTEM_ALIGNED_FREE(ptr)     vma_aligned_free(ptr)
-    #endif
-#endif
-
-#ifndef VMA_COUNT_BITS_SET
-    // Returns number of bits set to 1 in (v)
-    #define VMA_COUNT_BITS_SET(v) VmaCountBitsSet(v)
-#endif
-
-#ifndef VMA_BITSCAN_LSB
-    // Scans integer for index of first nonzero value from the Least Significant Bit (LSB). If mask is 0 then returns UINT8_MAX
-    #define VMA_BITSCAN_LSB(mask) VmaBitScanLSB(mask)
-#endif
-
-#ifndef VMA_BITSCAN_MSB
-    // Scans integer for index of first nonzero value from the Most Significant Bit (MSB). If mask is 0 then returns UINT8_MAX
-    #define VMA_BITSCAN_MSB(mask) VmaBitScanMSB(mask)
-#endif
-
-#ifndef VMA_MIN
-   #define VMA_MIN(v1, v2)    ((std::min)((v1), (v2)))
-#endif
-
-#ifndef VMA_MAX
-   #define VMA_MAX(v1, v2)    ((std::max)((v1), (v2)))
-#endif
-
-#ifndef VMA_SWAP
-   #define VMA_SWAP(v1, v2)   std::swap((v1), (v2))
-#endif
-
-#ifndef VMA_SORT
-   #define VMA_SORT(beg, end, cmp)  std::sort(beg, end, cmp)
-#endif
-
-#ifndef VMA_DEBUG_LOG
-   #define VMA_DEBUG_LOG(format, ...)
-   /*
-   #define VMA_DEBUG_LOG(format, ...) do { \
-       printf(format, __VA_ARGS__); \
-       printf("\n"); \
-   } while(false)
-   */
-#endif
-
-// Define this macro to 1 to enable functions: vmaBuildStatsString, vmaFreeStatsString.
-#if VMA_STATS_STRING_ENABLED
-    static inline void VmaUint32ToStr(char* VMA_NOT_NULL outStr, size_t strLen, uint32_t num)
-    {
-        snprintf(outStr, strLen, "%u", static_cast<unsigned int>(num));
-    }
-    static inline void VmaUint64ToStr(char* VMA_NOT_NULL outStr, size_t strLen, uint64_t num)
-    {
-        snprintf(outStr, strLen, "%llu", static_cast<unsigned long long>(num));
-    }
-    static inline void VmaPtrToStr(char* VMA_NOT_NULL outStr, size_t strLen, const void* ptr)
-    {
-        snprintf(outStr, strLen, "%p", ptr);
-    }
-#endif
-
-#ifndef VMA_MUTEX
-    class VmaMutex
-    {
-    public:
-        void Lock() { m_Mutex.lock(); }
-        void Unlock() { m_Mutex.unlock(); }
-        bool TryLock() { return m_Mutex.try_lock(); }
-    private:
-        std::mutex m_Mutex;
-    };
-    #define VMA_MUTEX VmaMutex
-#endif
-
-// Read-write mutex, where "read" is shared access, "write" is exclusive access.
-#ifndef VMA_RW_MUTEX
-    #if VMA_USE_STL_SHARED_MUTEX
-        // Use std::shared_mutex from C++17.
-        #include <shared_mutex>
-        class VmaRWMutex
-        {
-        public:
-            void LockRead() { m_Mutex.lock_shared(); }
-            void UnlockRead() { m_Mutex.unlock_shared(); }
-            bool TryLockRead() { return m_Mutex.try_lock_shared(); }
-            void LockWrite() { m_Mutex.lock(); }
-            void UnlockWrite() { m_Mutex.unlock(); }
-            bool TryLockWrite() { return m_Mutex.try_lock(); }
-        private:
-            std::shared_mutex m_Mutex;
-        };
-        #define VMA_RW_MUTEX VmaRWMutex
-    #elif defined(_WIN32) && defined(WINVER) && WINVER >= 0x0600
-        // Use SRWLOCK from WinAPI.
-        // Minimum supported client = Windows Vista, server = Windows Server 2008.
-        class VmaRWMutex
-        {
-        public:
-            VmaRWMutex() { InitializeSRWLock(&m_Lock); }
-            void LockRead() { AcquireSRWLockShared(&m_Lock); }
-            void UnlockRead() { ReleaseSRWLockShared(&m_Lock); }
-            bool TryLockRead() { return TryAcquireSRWLockShared(&m_Lock) != FALSE; }
-            void LockWrite() { AcquireSRWLockExclusive(&m_Lock); }
-            void UnlockWrite() { ReleaseSRWLockExclusive(&m_Lock); }
-            bool TryLockWrite() { return TryAcquireSRWLockExclusive(&m_Lock) != FALSE; }
-        private:
-            SRWLOCK m_Lock;
-        };
-        #define VMA_RW_MUTEX VmaRWMutex
-    #else
-        // Less efficient fallback: Use normal mutex.
-        class VmaRWMutex
-        {
-        public:
-            void LockRead() { m_Mutex.Lock(); }
-            void UnlockRead() { m_Mutex.Unlock(); }
-            bool TryLockRead() { return m_Mutex.TryLock(); }
-            void LockWrite() { m_Mutex.Lock(); }
-            void UnlockWrite() { m_Mutex.Unlock(); }
-            bool TryLockWrite() { return m_Mutex.TryLock(); }
-        private:
-            VMA_MUTEX m_Mutex;
-        };
-        #define VMA_RW_MUTEX VmaRWMutex
-    #endif // #if VMA_USE_STL_SHARED_MUTEX
-#endif // #ifndef VMA_RW_MUTEX
-
-/*
-If providing your own implementation, you need to implement a subset of std::atomic.
-*/
-#ifndef VMA_ATOMIC_UINT32
-    #include <atomic>
-    #define VMA_ATOMIC_UINT32 std::atomic<uint32_t>
-#endif
-
-#ifndef VMA_ATOMIC_UINT64
-    #include <atomic>
-    #define VMA_ATOMIC_UINT64 std::atomic<uint64_t>
-#endif
-
-#ifndef VMA_DEBUG_ALWAYS_DEDICATED_MEMORY
-    /**
-    Every allocation will have its own memory block.
-    Define to 1 for debugging purposes only.
-    */
-    #define VMA_DEBUG_ALWAYS_DEDICATED_MEMORY (0)
-#endif
-
-#ifndef VMA_MIN_ALIGNMENT
-    /**
-    Minimum alignment of all allocations, in bytes.
-    Set to more than 1 for debugging purposes. Must be power of two.
-    */
-    #ifdef VMA_DEBUG_ALIGNMENT // Old name
-        #define VMA_MIN_ALIGNMENT VMA_DEBUG_ALIGNMENT
-    #else
-        #define VMA_MIN_ALIGNMENT (1)
-    #endif
-#endif
-
-#ifndef VMA_DEBUG_MARGIN
-    /**
-    Minimum margin after every allocation, in bytes.
-    Set nonzero for debugging purposes only.
-    */
-    #define VMA_DEBUG_MARGIN (0)
-#endif
-
-#ifndef VMA_DEBUG_INITIALIZE_ALLOCATIONS
-    /**
-    Define this macro to 1 to automatically fill new allocations and destroyed
-    allocations with some bit pattern.
-    */
-    #define VMA_DEBUG_INITIALIZE_ALLOCATIONS (0)
-#endif
-
-#ifndef VMA_DEBUG_DETECT_CORRUPTION
-    /**
-    Define this macro to 1 together with non-zero value of VMA_DEBUG_MARGIN to
-    enable writing magic value to the margin after every allocation and
-    validating it, so that memory corruptions (out-of-bounds writes) are detected.
-    */
-    #define VMA_DEBUG_DETECT_CORRUPTION (0)
-#endif
-
-#ifndef VMA_DEBUG_GLOBAL_MUTEX
-    /**
-    Set this to 1 for debugging purposes only, to enable single mutex protecting all
-    entry calls to the library. Can be useful for debugging multithreading issues.
-    */
-    #define VMA_DEBUG_GLOBAL_MUTEX (0)
-#endif
-
-#ifndef VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY
-    /**
-    Minimum value for VkPhysicalDeviceLimits::bufferImageGranularity.
-    Set to more than 1 for debugging purposes only. Must be power of two.
-    */
-    #define VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY (1)
-#endif
-
-#ifndef VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT
-    /*
-    Set this to 1 to make VMA never exceed VkPhysicalDeviceLimits::maxMemoryAllocationCount
-    and return error instead of leaving up to Vulkan implementation what to do in such cases.
-    */
-    #define VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT (0)
-#endif
-
-#ifndef VMA_SMALL_HEAP_MAX_SIZE
-   /// Maximum size of a memory heap in Vulkan to consider it "small".
-   #define VMA_SMALL_HEAP_MAX_SIZE (1024ull * 1024 * 1024)
-#endif
-
-#ifndef VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE
-   /// Default size of a block allocated as single VkDeviceMemory from a "large" heap.
-   #define VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE (256ull * 1024 * 1024)
-#endif
-
-/*
-Mapping hysteresis is a logic that launches when vmaMapMemory/vmaUnmapMemory is called
-or a persistently mapped allocation is created and destroyed several times in a row.
-It keeps additional +1 mapping of a device memory block to prevent calling actual
-vkMapMemory/vkUnmapMemory too many times, which may improve performance and help
-tools like RenderDOc.
-*/
-#ifndef VMA_MAPPING_HYSTERESIS_ENABLED
-    #define VMA_MAPPING_HYSTERESIS_ENABLED 1
-#endif
-
-#ifndef VMA_CLASS_NO_COPY
-    #define VMA_CLASS_NO_COPY(className) \
-        private: \
-            className(const className&) = delete; \
-            className& operator=(const className&) = delete;
-#endif
-
-#define VMA_VALIDATE(cond) do { if(!(cond)) { \
-        VMA_ASSERT(0 && "Validation failed: " #cond); \
-        return false; \
-    } } while(false)
-
-/*******************************************************************************
-END OF CONFIGURATION
-*/
-#endif // _VMA_CONFIGURATION
-
-
-static const uint8_t VMA_ALLOCATION_FILL_PATTERN_CREATED = 0xDC;
-static const uint8_t VMA_ALLOCATION_FILL_PATTERN_DESTROYED = 0xEF;
-// Decimal 2139416166, float NaN, little-endian binary 66 E6 84 7F.
-static const uint32_t VMA_CORRUPTION_DETECTION_MAGIC_VALUE = 0x7F84E666;
-
-// Copy of some Vulkan definitions so we don't need to check their existence just to handle few constants.
-static const uint32_t VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY = 0x00000040;
-static const uint32_t VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY = 0x00000080;
-static const uint32_t VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY = 0x00020000;
-static const uint32_t VK_IMAGE_CREATE_DISJOINT_BIT_COPY = 0x00000200;
-static const int32_t VK_IMAGE_TILING_DRM_FORMAT_MODIFIER_EXT_COPY = 1000158000;
-static const uint32_t VMA_ALLOCATION_INTERNAL_STRATEGY_MIN_OFFSET = 0x10000000u;
-static const uint32_t VMA_ALLOCATION_TRY_COUNT = 32;
-static const uint32_t VMA_VENDOR_ID_AMD = 4098;
-
-// This one is tricky. Vulkan specification defines this code as available since
-// Vulkan 1.0, but doesn't actually define it in Vulkan SDK earlier than 1.2.131.
-// See pull request #207.
-#define VK_ERROR_UNKNOWN_COPY ((VkResult)-13)
-
-
-#if VMA_STATS_STRING_ENABLED
-// Correspond to values of enum VmaSuballocationType.
-static const char* VMA_SUBALLOCATION_TYPE_NAMES[] =
-{
-    "FREE",
-    "UNKNOWN",
-    "BUFFER",
-    "IMAGE_UNKNOWN",
-    "IMAGE_LINEAR",
-    "IMAGE_OPTIMAL",
-};
-#endif
-
-static VkAllocationCallbacks VmaEmptyAllocationCallbacks =
-    { VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL };
-
-
-#ifndef _VMA_ENUM_DECLARATIONS
-
-enum VmaSuballocationType
-{
-    VMA_SUBALLOCATION_TYPE_FREE = 0,
-    VMA_SUBALLOCATION_TYPE_UNKNOWN = 1,
-    VMA_SUBALLOCATION_TYPE_BUFFER = 2,
-    VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN = 3,
-    VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR = 4,
-    VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL = 5,
-    VMA_SUBALLOCATION_TYPE_MAX_ENUM = 0x7FFFFFFF
-};
-
-enum VMA_CACHE_OPERATION
-{
-    VMA_CACHE_FLUSH,
-    VMA_CACHE_INVALIDATE
-};
-
-enum class VmaAllocationRequestType
-{
-    Normal,
-    TLSF,
-    // Used by "Linear" algorithm.
-    UpperAddress,
-    EndOf1st,
-    EndOf2nd,
-};
-
-#endif // _VMA_ENUM_DECLARATIONS
-
-#ifndef _VMA_FORWARD_DECLARATIONS
-// Opaque handle used by allocation algorithms to identify single allocation in any conforming way.
-VK_DEFINE_NON_DISPATCHABLE_HANDLE(VmaAllocHandle);
-
-struct VmaMutexLock;
-struct VmaMutexLockRead;
-struct VmaMutexLockWrite;
-
-template<typename T>
-struct AtomicTransactionalIncrement;
-
-template<typename T>
-struct VmaStlAllocator;
-
-template<typename T, typename AllocatorT>
-class VmaVector;
-
-template<typename T, typename AllocatorT, size_t N>
-class VmaSmallVector;
-
-template<typename T>
-class VmaPoolAllocator;
-
-template<typename T>
-struct VmaListItem;
-
-template<typename T>
-class VmaRawList;
-
-template<typename T, typename AllocatorT>
-class VmaList;
-
-template<typename ItemTypeTraits>
-class VmaIntrusiveLinkedList;
-
-// Unused in this version
-#if 0
-template<typename T1, typename T2>
-struct VmaPair;
-template<typename FirstT, typename SecondT>
-struct VmaPairFirstLess;
-
-template<typename KeyT, typename ValueT>
-class VmaMap;
-#endif
-
-#if VMA_STATS_STRING_ENABLED
-class VmaStringBuilder;
-class VmaJsonWriter;
-#endif
-
-class VmaDeviceMemoryBlock;
-
-struct VmaDedicatedAllocationListItemTraits;
-class VmaDedicatedAllocationList;
-
-struct VmaSuballocation;
-struct VmaSuballocationOffsetLess;
-struct VmaSuballocationOffsetGreater;
-struct VmaSuballocationItemSizeLess;
-
-typedef VmaList<VmaSuballocation, VmaStlAllocator<VmaSuballocation>> VmaSuballocationList;
-
-struct VmaAllocationRequest;
-
-class VmaBlockMetadata;
-class VmaBlockMetadata_Linear;
-class VmaBlockMetadata_TLSF;
-
-class VmaBlockVector;
-
-struct VmaPoolListItemTraits;
-
-struct VmaCurrentBudgetData;
-
-class VmaAllocationObjectAllocator;
-
-#endif // _VMA_FORWARD_DECLARATIONS
-
-
-#ifndef _VMA_FUNCTIONS
-
-/*
-Returns number of bits set to 1 in (v).
-
-On specific platforms and compilers you can use instrinsics like:
-
-Visual Studio:
-    return __popcnt(v);
-GCC, Clang:
-    return static_cast<uint32_t>(__builtin_popcount(v));
-
-Define macro VMA_COUNT_BITS_SET to provide your optimized implementation.
-But you need to check in runtime whether user's CPU supports these, as some old processors don't.
-*/
-static inline uint32_t VmaCountBitsSet(uint32_t v)
-{
-#if __cplusplus >= 202002L || _MSVC_LANG >= 202002L // C++20
-    return std::popcount(v);
-#else
-    uint32_t c = v - ((v >> 1) & 0x55555555);
-    c = ((c >> 2) & 0x33333333) + (c & 0x33333333);
-    c = ((c >> 4) + c) & 0x0F0F0F0F;
-    c = ((c >> 8) + c) & 0x00FF00FF;
-    c = ((c >> 16) + c) & 0x0000FFFF;
-    return c;
-#endif
-}
-
-static inline uint8_t VmaBitScanLSB(uint64_t mask)
-{
-#if defined(_MSC_VER) && defined(_WIN64)
-    unsigned long pos;
-    if (_BitScanForward64(&pos, mask))
-        return static_cast<uint8_t>(pos);
-    return UINT8_MAX;
-#elif defined __GNUC__ || defined __clang__
-    return static_cast<uint8_t>(__builtin_ffsll(mask)) - 1U;
-#else
-    uint8_t pos = 0;
-    uint64_t bit = 1;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit <<= 1;
-    } while (pos++ < 63);
-    return UINT8_MAX;
-#endif
-}
-
-static inline uint8_t VmaBitScanLSB(uint32_t mask)
-{
-#ifdef _MSC_VER
-    unsigned long pos;
-    if (_BitScanForward(&pos, mask))
-        return static_cast<uint8_t>(pos);
-    return UINT8_MAX;
-#elif defined __GNUC__ || defined __clang__
-    return static_cast<uint8_t>(__builtin_ffs(mask)) - 1U;
-#else
-    uint8_t pos = 0;
-    uint32_t bit = 1;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit <<= 1;
-    } while (pos++ < 31);
-    return UINT8_MAX;
-#endif
-}
-
-static inline uint8_t VmaBitScanMSB(uint64_t mask)
-{
-#if defined(_MSC_VER) && defined(_WIN64)
-    unsigned long pos;
-    if (_BitScanReverse64(&pos, mask))
-        return static_cast<uint8_t>(pos);
-#elif defined __GNUC__ || defined __clang__
-    if (mask)
-        return 63 - static_cast<uint8_t>(__builtin_clzll(mask));
-#else
-    uint8_t pos = 63;
-    uint64_t bit = 1ULL << 63;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit >>= 1;
-    } while (pos-- > 0);
-#endif
-    return UINT8_MAX;
-}
-
-static inline uint8_t VmaBitScanMSB(uint32_t mask)
-{
-#ifdef _MSC_VER
-    unsigned long pos;
-    if (_BitScanReverse(&pos, mask))
-        return static_cast<uint8_t>(pos);
-#elif defined __GNUC__ || defined __clang__
-    if (mask)
-        return 31 - static_cast<uint8_t>(__builtin_clz(mask));
-#else
-    uint8_t pos = 31;
-    uint32_t bit = 1UL << 31;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit >>= 1;
-    } while (pos-- > 0);
-#endif
-    return UINT8_MAX;
-}
-
-/*
-Returns true if given number is a power of two.
-T must be unsigned integer number or signed integer but always nonnegative.
-For 0 returns true.
-*/
-template <typename T>
-inline bool VmaIsPow2(T x)
-{
-    return (x & (x - 1)) == 0;
-}
-
-// Aligns given value up to nearest multiply of align value. For example: VmaAlignUp(11, 8) = 16.
-// Use types like uint32_t, uint64_t as T.
-template <typename T>
-static inline T VmaAlignUp(T val, T alignment)
-{
-    VMA_HEAVY_ASSERT(VmaIsPow2(alignment));
-    return (val + alignment - 1) & ~(alignment - 1);
-}
-
-// Aligns given value down to nearest multiply of align value. For example: VmaAlignUp(11, 8) = 8.
-// Use types like uint32_t, uint64_t as T.
-template <typename T>
-static inline T VmaAlignDown(T val, T alignment)
-{
-    VMA_HEAVY_ASSERT(VmaIsPow2(alignment));
-    return val & ~(alignment - 1);
-}
-
-// Division with mathematical rounding to nearest number.
-template <typename T>
-static inline T VmaRoundDiv(T x, T y)
-{
-    return (x + (y / (T)2)) / y;
-}
-
-// Divide by 'y' and round up to nearest integer.
-template <typename T>
-static inline T VmaDivideRoundingUp(T x, T y)
-{
-    return (x + y - (T)1) / y;
-}
-
-// Returns smallest power of 2 greater or equal to v.
-static inline uint32_t VmaNextPow2(uint32_t v)
-{
-    v--;
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v++;
-    return v;
-}
-
-static inline uint64_t VmaNextPow2(uint64_t v)
-{
-    v--;
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v |= v >> 32;
-    v++;
-    return v;
-}
-
-// Returns largest power of 2 less or equal to v.
-static inline uint32_t VmaPrevPow2(uint32_t v)
-{
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v = v ^ (v >> 1);
-    return v;
-}
-
-static inline uint64_t VmaPrevPow2(uint64_t v)
-{
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v |= v >> 32;
-    v = v ^ (v >> 1);
-    return v;
-}
-
-static inline bool VmaStrIsEmpty(const char* pStr)
-{
-    return pStr == VMA_NULL || *pStr == '\0';
-}
-
-/*
-Returns true if two memory blocks occupy overlapping pages.
-ResourceA must be in less memory offset than ResourceB.
-
-Algorithm is based on "Vulkan 1.0.39 - A Specification (with all registered Vulkan extensions)"
-chapter 11.6 "Resource Memory Association", paragraph "Buffer-Image Granularity".
-*/
-static inline bool VmaBlocksOnSamePage(
-    VkDeviceSize resourceAOffset,
-    VkDeviceSize resourceASize,
-    VkDeviceSize resourceBOffset,
-    VkDeviceSize pageSize)
-{
-    VMA_ASSERT(resourceAOffset + resourceASize <= resourceBOffset && resourceASize > 0 && pageSize > 0);
-    VkDeviceSize resourceAEnd = resourceAOffset + resourceASize - 1;
-    VkDeviceSize resourceAEndPage = resourceAEnd & ~(pageSize - 1);
-    VkDeviceSize resourceBStart = resourceBOffset;
-    VkDeviceSize resourceBStartPage = resourceBStart & ~(pageSize - 1);
-    return resourceAEndPage == resourceBStartPage;
-}
-
-/*
-Returns true if given suballocation types could conflict and must respect
-VkPhysicalDeviceLimits::bufferImageGranularity. They conflict if one is buffer
-or linear image and another one is optimal image. If type is unknown, behave
-conservatively.
-*/
-static inline bool VmaIsBufferImageGranularityConflict(
-    VmaSuballocationType suballocType1,
-    VmaSuballocationType suballocType2)
-{
-    if (suballocType1 > suballocType2)
-    {
-        VMA_SWAP(suballocType1, suballocType2);
-    }
-
-    switch (suballocType1)
-    {
-    case VMA_SUBALLOCATION_TYPE_FREE:
-        return false;
-    case VMA_SUBALLOCATION_TYPE_UNKNOWN:
-        return true;
-    case VMA_SUBALLOCATION_TYPE_BUFFER:
-        return
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL;
-    case VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN:
-        return
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR ||
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL;
-    case VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR:
-        return
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL;
-    case VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL:
-        return false;
-    default:
-        VMA_ASSERT(0);
-        return true;
-    }
-}
-
-static void VmaWriteMagicValue(void* pData, VkDeviceSize offset)
-{
-#if VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_DETECT_CORRUPTION
-    uint32_t* pDst = (uint32_t*)((char*)pData + offset);
-    const size_t numberCount = VMA_DEBUG_MARGIN / sizeof(uint32_t);
-    for (size_t i = 0; i < numberCount; ++i, ++pDst)
-    {
-        *pDst = VMA_CORRUPTION_DETECTION_MAGIC_VALUE;
-    }
-#else
-    // no-op
-#endif
-}
-
-static bool VmaValidateMagicValue(const void* pData, VkDeviceSize offset)
-{
-#if VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_DETECT_CORRUPTION
-    const uint32_t* pSrc = (const uint32_t*)((const char*)pData + offset);
-    const size_t numberCount = VMA_DEBUG_MARGIN / sizeof(uint32_t);
-    for (size_t i = 0; i < numberCount; ++i, ++pSrc)
-    {
-        if (*pSrc != VMA_CORRUPTION_DETECTION_MAGIC_VALUE)
-        {
-            return false;
-        }
-    }
-#endif
-    return true;
-}
-
-/*
-Fills structure with parameters of an example buffer to be used for transfers
-during GPU memory defragmentation.
-*/
-static void VmaFillGpuDefragmentationBufferCreateInfo(VkBufferCreateInfo& outBufCreateInfo)
-{
-    memset(&outBufCreateInfo, 0, sizeof(outBufCreateInfo));
-    outBufCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
-    outBufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-    outBufCreateInfo.size = (VkDeviceSize)VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE; // Example size.
-}
-
-
-/*
-Performs binary search and returns iterator to first element that is greater or
-equal to (key), according to comparison (cmp).
-
-Cmp should return true if first argument is less than second argument.
-
-Returned value is the found element, if present in the collection or place where
-new element with value (key) should be inserted.
-*/
-template <typename CmpLess, typename IterT, typename KeyT>
-static IterT VmaBinaryFindFirstNotLess(IterT beg, IterT end, const KeyT& key, const CmpLess& cmp)
-{
-    size_t down = 0, up = (end - beg);
-    while (down < up)
-    {
-        const size_t mid = down + (up - down) / 2;  // Overflow-safe midpoint calculation
-        if (cmp(*(beg + mid), key))
-        {
-            down = mid + 1;
-        }
-        else
-        {
-            up = mid;
-        }
-    }
-    return beg + down;
-}
-
-template<typename CmpLess, typename IterT, typename KeyT>
-IterT VmaBinaryFindSorted(const IterT& beg, const IterT& end, const KeyT& value, const CmpLess& cmp)
-{
-    IterT it = VmaBinaryFindFirstNotLess<CmpLess, IterT, KeyT>(
-        beg, end, value, cmp);
-    if (it == end ||
-        (!cmp(*it, value) && !cmp(value, *it)))
-    {
-        return it;
-    }
-    return end;
-}
-
-/*
-Returns true if all pointers in the array are not-null and unique.
-Warning! O(n^2) complexity. Use only inside VMA_HEAVY_ASSERT.
-T must be pointer type, e.g. VmaAllocation, VmaPool.
-*/
-template<typename T>
-static bool VmaValidatePointerArray(uint32_t count, const T* arr)
-{
-    for (uint32_t i = 0; i < count; ++i)
-    {
-        const T iPtr = arr[i];
-        if (iPtr == VMA_NULL)
-        {
-            return false;
-        }
-        for (uint32_t j = i + 1; j < count; ++j)
-        {
-            if (iPtr == arr[j])
-            {
-                return false;
-            }
-        }
-    }
-    return true;
-}
-
-template<typename MainT, typename NewT>
-static inline void VmaPnextChainPushFront(MainT* mainStruct, NewT* newStruct)
-{
-    newStruct->pNext = mainStruct->pNext;
-    mainStruct->pNext = newStruct;
-}
-
-// This is the main algorithm that guides the selection of a memory type best for an allocation -
-// converts usage to required/preferred/not preferred flags.
-static bool FindMemoryPreferences(
-    bool isIntegratedGPU,
-    const VmaAllocationCreateInfo& allocCreateInfo,
-    VkFlags bufImgUsage, // VkBufferCreateInfo::usage or VkImageCreateInfo::usage. UINT32_MAX if unknown.
-    VkMemoryPropertyFlags& outRequiredFlags,
-    VkMemoryPropertyFlags& outPreferredFlags,
-    VkMemoryPropertyFlags& outNotPreferredFlags)
-{
-    outRequiredFlags = allocCreateInfo.requiredFlags;
-    outPreferredFlags = allocCreateInfo.preferredFlags;
-    outNotPreferredFlags = 0;
-
-    switch(allocCreateInfo.usage)
-    {
-    case VMA_MEMORY_USAGE_UNKNOWN:
-        break;
-    case VMA_MEMORY_USAGE_GPU_ONLY:
-        if(!isIntegratedGPU || (outPreferredFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0)
-        {
-            outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        }
-        break;
-    case VMA_MEMORY_USAGE_CPU_ONLY:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
-        break;
-    case VMA_MEMORY_USAGE_CPU_TO_GPU:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-        if(!isIntegratedGPU || (outPreferredFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0)
-        {
-            outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        }
-        break;
-    case VMA_MEMORY_USAGE_GPU_TO_CPU:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-        outPreferredFlags |= VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-        break;
-    case VMA_MEMORY_USAGE_CPU_COPY:
-        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        break;
-    case VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT;
-        break;
-    case VMA_MEMORY_USAGE_AUTO:
-    case VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE:
-    case VMA_MEMORY_USAGE_AUTO_PREFER_HOST:
-    {
-        if(bufImgUsage == UINT32_MAX)
-        {
-            VMA_ASSERT(0 && "VMA_MEMORY_USAGE_AUTO* values can only be used with functions like vmaCreateBuffer, vmaCreateImage so that the details of the created resource are known.");
-            return false;
-        }
-        // This relies on values of VK_IMAGE_USAGE_TRANSFER* being the same VK_BUFFER_IMAGE_TRANSFER*.
-        const bool deviceAccess = (bufImgUsage & ~(VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT)) != 0;
-        const bool hostAccessSequentialWrite = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT) != 0;
-        const bool hostAccessRandom = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT) != 0;
-        const bool hostAccessAllowTransferInstead = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT) != 0;
-        const bool preferDevice = allocCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE;
-        const bool preferHost = allocCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_HOST;
-
-        // CPU random access - e.g. a buffer written to or transferred from GPU to read back on CPU.
-        if(hostAccessRandom)
-        {
-            if(!isIntegratedGPU && deviceAccess && hostAccessAllowTransferInstead && !preferHost)
-            {
-                // Nice if it will end up in HOST_VISIBLE, but more importantly prefer DEVICE_LOCAL.
-                // Omitting HOST_VISIBLE here is intentional.
-                // In case there is DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED, it will pick that one.
-                // Otherwise, this will give same weight to DEVICE_LOCAL as HOST_VISIBLE | HOST_CACHED and select the former if occurs first on the list.
-                outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-            }
-            else
-            {
-                // Always CPU memory, cached.
-                outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-            }
-        }
-        // CPU sequential write - may be CPU or host-visible GPU memory, uncached and write-combined.
-        else if(hostAccessSequentialWrite)
-        {
-            // Want uncached and write-combined.
-            outNotPreferredFlags |= VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-
-            if(!isIntegratedGPU && deviceAccess && hostAccessAllowTransferInstead && !preferHost)
-            {
-                outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-            }
-            else
-            {
-                outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-                // Direct GPU access, CPU sequential write (e.g. a dynamic uniform buffer updated every frame)
-                if(deviceAccess)
-                {
-                    // Could go to CPU memory or GPU BAR/unified. Up to the user to decide. If no preference, choose GPU memory.
-                    if(preferHost)
-                        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                    else
-                        outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                }
-                // GPU no direct access, CPU sequential write (e.g. an upload buffer to be transferred to the GPU)
-                else
-                {
-                    // Could go to CPU memory or GPU BAR/unified. Up to the user to decide. If no preference, choose CPU memory.
-                    if(preferDevice)
-                        outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                    else
-                        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                }
-            }
-        }
-        // No CPU access
-        else
-        {
-            // GPU access, no CPU access (e.g. a color attachment image) - prefer GPU memory
-            if(deviceAccess)
-            {
-                // ...unless there is a clear preference from the user not to do so.
-                if(preferHost)
-                    outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                else
-                    outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-            }
-            // No direct GPU access, no CPU access, just transfers.
-            // It may be staging copy intended for e.g. preserving image for next frame (then better GPU memory) or
-            // a "swap file" copy to free some GPU memory (then better CPU memory).
-            // Up to the user to decide. If no preferece, assume the former and choose GPU memory.
-            if(preferHost)
-                outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-            else
-                outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        }
-        break;
-    }
-    default:
-        VMA_ASSERT(0);
-    }
-
-    // Avoid DEVICE_COHERENT unless explicitly requested.
-    if(((allocCreateInfo.requiredFlags | allocCreateInfo.preferredFlags) &
-        (VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY | VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY)) == 0)
-    {
-        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY;
-    }
-
-    return true;
-}
-
-////////////////////////////////////////////////////////////////////////////////
-// Memory allocation
-
-static void* VmaMalloc(const VkAllocationCallbacks* pAllocationCallbacks, size_t size, size_t alignment)
-{
-    void* result = VMA_NULL;
-    if ((pAllocationCallbacks != VMA_NULL) &&
-        (pAllocationCallbacks->pfnAllocation != VMA_NULL))
-    {
-        result = (*pAllocationCallbacks->pfnAllocation)(
-            pAllocationCallbacks->pUserData,
-            size,
-            alignment,
-            VK_SYSTEM_ALLOCATION_SCOPE_OBJECT);
-    }
-    else
-    {
-        result = VMA_SYSTEM_ALIGNED_MALLOC(size, alignment);
-    }
-    VMA_ASSERT(result != VMA_NULL && "CPU memory allocation failed.");
-    return result;
-}
-
-static void VmaFree(const VkAllocationCallbacks* pAllocationCallbacks, void* ptr)
-{
-    if ((pAllocationCallbacks != VMA_NULL) &&
-        (pAllocationCallbacks->pfnFree != VMA_NULL))
-    {
-        (*pAllocationCallbacks->pfnFree)(pAllocationCallbacks->pUserData, ptr);
-    }
-    else
-    {
-        VMA_SYSTEM_ALIGNED_FREE(ptr);
-    }
-}
-
-template<typename T>
-static T* VmaAllocate(const VkAllocationCallbacks* pAllocationCallbacks)
-{
-    return (T*)VmaMalloc(pAllocationCallbacks, sizeof(T), VMA_ALIGN_OF(T));
-}
-
-template<typename T>
-static T* VmaAllocateArray(const VkAllocationCallbacks* pAllocationCallbacks, size_t count)
-{
-    return (T*)VmaMalloc(pAllocationCallbacks, sizeof(T) * count, VMA_ALIGN_OF(T));
-}
-
-#define vma_new(allocator, type)   new(VmaAllocate<type>(allocator))(type)
-
-#define vma_new_array(allocator, type, count)   new(VmaAllocateArray<type>((allocator), (count)))(type)
-
-template<typename T>
-static void vma_delete(const VkAllocationCallbacks* pAllocationCallbacks, T* ptr)
-{
-    ptr->~T();
-    VmaFree(pAllocationCallbacks, ptr);
-}
-
-template<typename T>
-static void vma_delete_array(const VkAllocationCallbacks* pAllocationCallbacks, T* ptr, size_t count)
-{
-    if (ptr != VMA_NULL)
-    {
-        for (size_t i = count; i--; )
-        {
-            ptr[i].~T();
-        }
-        VmaFree(pAllocationCallbacks, ptr);
-    }
-}
-
-static char* VmaCreateStringCopy(const VkAllocationCallbacks* allocs, const char* srcStr)
-{
-    if (srcStr != VMA_NULL)
-    {
-        const size_t len = strlen(srcStr);
-        char* const result = vma_new_array(allocs, char, len + 1);
-        memcpy(result, srcStr, len + 1);
-        return result;
-    }
-    return VMA_NULL;
-}
-
-#if VMA_STATS_STRING_ENABLED
-static char* VmaCreateStringCopy(const VkAllocationCallbacks* allocs, const char* srcStr, size_t strLen)
-{
-    if (srcStr != VMA_NULL)
-    {
-        char* const result = vma_new_array(allocs, char, strLen + 1);
-        memcpy(result, srcStr, strLen);
-        result[strLen] = '\0';
-        return result;
-    }
-    return VMA_NULL;
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-static void VmaFreeString(const VkAllocationCallbacks* allocs, char* str)
-{
-    if (str != VMA_NULL)
-    {
-        const size_t len = strlen(str);
-        vma_delete_array(allocs, str, len + 1);
-    }
-}
-
-template<typename CmpLess, typename VectorT>
-size_t VmaVectorInsertSorted(VectorT& vector, const typename VectorT::value_type& value)
-{
-    const size_t indexToInsert = VmaBinaryFindFirstNotLess(
-        vector.data(),
-        vector.data() + vector.size(),
-        value,
-        CmpLess()) - vector.data();
-    VmaVectorInsert(vector, indexToInsert, value);
-    return indexToInsert;
-}
-
-template<typename CmpLess, typename VectorT>
-bool VmaVectorRemoveSorted(VectorT& vector, const typename VectorT::value_type& value)
-{
-    CmpLess comparator;
-    typename VectorT::iterator it = VmaBinaryFindFirstNotLess(
-        vector.begin(),
-        vector.end(),
-        value,
-        comparator);
-    if ((it != vector.end()) && !comparator(*it, value) && !comparator(value, *it))
-    {
-        size_t indexToRemove = it - vector.begin();
-        VmaVectorRemove(vector, indexToRemove);
-        return true;
-    }
-    return false;
-}
-#endif // _VMA_FUNCTIONS
-
-#ifndef _VMA_STATISTICS_FUNCTIONS
-
-static void VmaClearStatistics(VmaStatistics& outStats)
-{
-    outStats.blockCount = 0;
-    outStats.allocationCount = 0;
-    outStats.blockBytes = 0;
-    outStats.allocationBytes = 0;
-}
-
-static void VmaAddStatistics(VmaStatistics& inoutStats, const VmaStatistics& src)
-{
-    inoutStats.blockCount += src.blockCount;
-    inoutStats.allocationCount += src.allocationCount;
-    inoutStats.blockBytes += src.blockBytes;
-    inoutStats.allocationBytes += src.allocationBytes;
-}
-
-static void VmaClearDetailedStatistics(VmaDetailedStatistics& outStats)
-{
-    VmaClearStatistics(outStats.statistics);
-    outStats.unusedRangeCount = 0;
-    outStats.allocationSizeMin = VK_WHOLE_SIZE;
-    outStats.allocationSizeMax = 0;
-    outStats.unusedRangeSizeMin = VK_WHOLE_SIZE;
-    outStats.unusedRangeSizeMax = 0;
-}
-
-static void VmaAddDetailedStatisticsAllocation(VmaDetailedStatistics& inoutStats, VkDeviceSize size)
-{
-    inoutStats.statistics.allocationCount++;
-    inoutStats.statistics.allocationBytes += size;
-    inoutStats.allocationSizeMin = VMA_MIN(inoutStats.allocationSizeMin, size);
-    inoutStats.allocationSizeMax = VMA_MAX(inoutStats.allocationSizeMax, size);
-}
-
-static void VmaAddDetailedStatisticsUnusedRange(VmaDetailedStatistics& inoutStats, VkDeviceSize size)
-{
-    inoutStats.unusedRangeCount++;
-    inoutStats.unusedRangeSizeMin = VMA_MIN(inoutStats.unusedRangeSizeMin, size);
-    inoutStats.unusedRangeSizeMax = VMA_MAX(inoutStats.unusedRangeSizeMax, size);
-}
-
-static void VmaAddDetailedStatistics(VmaDetailedStatistics& inoutStats, const VmaDetailedStatistics& src)
-{
-    VmaAddStatistics(inoutStats.statistics, src.statistics);
-    inoutStats.unusedRangeCount += src.unusedRangeCount;
-    inoutStats.allocationSizeMin = VMA_MIN(inoutStats.allocationSizeMin, src.allocationSizeMin);
-    inoutStats.allocationSizeMax = VMA_MAX(inoutStats.allocationSizeMax, src.allocationSizeMax);
-    inoutStats.unusedRangeSizeMin = VMA_MIN(inoutStats.unusedRangeSizeMin, src.unusedRangeSizeMin);
-    inoutStats.unusedRangeSizeMax = VMA_MAX(inoutStats.unusedRangeSizeMax, src.unusedRangeSizeMax);
-}
-
-#endif // _VMA_STATISTICS_FUNCTIONS
-
-#ifndef _VMA_MUTEX_LOCK
-// Helper RAII class to lock a mutex in constructor and unlock it in destructor (at the end of scope).
-struct VmaMutexLock
-{
-    VMA_CLASS_NO_COPY(VmaMutexLock)
-public:
-    VmaMutexLock(VMA_MUTEX& mutex, bool useMutex = true) :
-        m_pMutex(useMutex ? &mutex : VMA_NULL)
-    {
-        if (m_pMutex) { m_pMutex->Lock(); }
-    }
-    ~VmaMutexLock() {  if (m_pMutex) { m_pMutex->Unlock(); } }
-
-private:
-    VMA_MUTEX* m_pMutex;
-};
-
-// Helper RAII class to lock a RW mutex in constructor and unlock it in destructor (at the end of scope), for reading.
-struct VmaMutexLockRead
-{
-    VMA_CLASS_NO_COPY(VmaMutexLockRead)
-public:
-    VmaMutexLockRead(VMA_RW_MUTEX& mutex, bool useMutex) :
-        m_pMutex(useMutex ? &mutex : VMA_NULL)
-    {
-        if (m_pMutex) { m_pMutex->LockRead(); }
-    }
-    ~VmaMutexLockRead() { if (m_pMutex) { m_pMutex->UnlockRead(); } }
-
-private:
-    VMA_RW_MUTEX* m_pMutex;
-};
-
-// Helper RAII class to lock a RW mutex in constructor and unlock it in destructor (at the end of scope), for writing.
-struct VmaMutexLockWrite
-{
-    VMA_CLASS_NO_COPY(VmaMutexLockWrite)
-public:
-    VmaMutexLockWrite(VMA_RW_MUTEX& mutex, bool useMutex)
-        : m_pMutex(useMutex ? &mutex : VMA_NULL)
-    {
-        if (m_pMutex) { m_pMutex->LockWrite(); }
-    }
-    ~VmaMutexLockWrite() { if (m_pMutex) { m_pMutex->UnlockWrite(); } }
-
-private:
-    VMA_RW_MUTEX* m_pMutex;
-};
-
-#if VMA_DEBUG_GLOBAL_MUTEX
-    static VMA_MUTEX gDebugGlobalMutex;
-    #define VMA_DEBUG_GLOBAL_MUTEX_LOCK VmaMutexLock debugGlobalMutexLock(gDebugGlobalMutex, true);
-#else
-    #define VMA_DEBUG_GLOBAL_MUTEX_LOCK
-#endif
-#endif // _VMA_MUTEX_LOCK
-
-#ifndef _VMA_ATOMIC_TRANSACTIONAL_INCREMENT
-// An object that increments given atomic but decrements it back in the destructor unless Commit() is called.
-template<typename T>
-struct AtomicTransactionalIncrement
-{
-public:
-    typedef std::atomic<T> AtomicT;
-
-    ~AtomicTransactionalIncrement()
-    {
-        if(m_Atomic)
-            --(*m_Atomic);
-    }
-
-    void Commit() { m_Atomic = nullptr; }
-    T Increment(AtomicT* atomic)
-    {
-        m_Atomic = atomic;
-        return m_Atomic->fetch_add(1);
-    }
-
-private:
-    AtomicT* m_Atomic = nullptr;
-};
-#endif // _VMA_ATOMIC_TRANSACTIONAL_INCREMENT
-
-#ifndef _VMA_STL_ALLOCATOR
-// STL-compatible allocator.
-template<typename T>
-struct VmaStlAllocator
-{
-    const VkAllocationCallbacks* const m_pCallbacks;
-    typedef T value_type;
-
-    VmaStlAllocator(const VkAllocationCallbacks* pCallbacks) : m_pCallbacks(pCallbacks) {}
-    template<typename U>
-    VmaStlAllocator(const VmaStlAllocator<U>& src) : m_pCallbacks(src.m_pCallbacks) {}
-    VmaStlAllocator(const VmaStlAllocator&) = default;
-    VmaStlAllocator& operator=(const VmaStlAllocator&) = delete;
-
-    T* allocate(size_t n) { return VmaAllocateArray<T>(m_pCallbacks, n); }
-    void deallocate(T* p, size_t n) { VmaFree(m_pCallbacks, p); }
-
-    template<typename U>
-    bool operator==(const VmaStlAllocator<U>& rhs) const
-    {
-        return m_pCallbacks == rhs.m_pCallbacks;
-    }
-    template<typename U>
-    bool operator!=(const VmaStlAllocator<U>& rhs) const
-    {
-        return m_pCallbacks != rhs.m_pCallbacks;
-    }
-};
-#endif // _VMA_STL_ALLOCATOR
-
-#ifndef _VMA_VECTOR
-/* Class with interface compatible with subset of std::vector.
-T must be POD because constructors and destructors are not called and memcpy is
-used for these objects. */
-template<typename T, typename AllocatorT>
-class VmaVector
-{
-public:
-    typedef T value_type;
-    typedef T* iterator;
-    typedef const T* const_iterator;
-
-    VmaVector(const AllocatorT& allocator);
-    VmaVector(size_t count, const AllocatorT& allocator);
-    // This version of the constructor is here for compatibility with pre-C++14 std::vector.
-    // value is unused.
-    VmaVector(size_t count, const T& value, const AllocatorT& allocator) : VmaVector(count, allocator) {}
-    VmaVector(const VmaVector<T, AllocatorT>& src);
-    VmaVector& operator=(const VmaVector& rhs);
-    ~VmaVector() { VmaFree(m_Allocator.m_pCallbacks, m_pArray); }
-
-    bool empty() const { return m_Count == 0; }
-    size_t size() const { return m_Count; }
-    T* data() { return m_pArray; }
-    T& front() { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[0]; }
-    T& back() { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[m_Count - 1]; }
-    const T* data() const { return m_pArray; }
-    const T& front() const { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[0]; }
-    const T& back() const { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[m_Count - 1]; }
-
-    iterator begin() { return m_pArray; }
-    iterator end() { return m_pArray + m_Count; }
-    const_iterator cbegin() const { return m_pArray; }
-    const_iterator cend() const { return m_pArray + m_Count; }
-    const_iterator begin() const { return cbegin(); }
-    const_iterator end() const { return cend(); }
-
-    void pop_front() { VMA_HEAVY_ASSERT(m_Count > 0); remove(0); }
-    void pop_back() { VMA_HEAVY_ASSERT(m_Count > 0); resize(size() - 1); }
-    void push_front(const T& src) { insert(0, src); }
-
-    void push_back(const T& src);
-    void reserve(size_t newCapacity, bool freeMemory = false);
-    void resize(size_t newCount);
-    void clear() { resize(0); }
-    void shrink_to_fit();
-    void insert(size_t index, const T& src);
-    void remove(size_t index);
-
-    T& operator[](size_t index) { VMA_HEAVY_ASSERT(index < m_Count); return m_pArray[index]; }
-    const T& operator[](size_t index) const { VMA_HEAVY_ASSERT(index < m_Count); return m_pArray[index]; }
-
-private:
-    AllocatorT m_Allocator;
-    T* m_pArray;
-    size_t m_Count;
-    size_t m_Capacity;
-};
-
-#ifndef _VMA_VECTOR_FUNCTIONS
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>::VmaVector(const AllocatorT& allocator)
-    : m_Allocator(allocator),
-    m_pArray(VMA_NULL),
-    m_Count(0),
-    m_Capacity(0) {}
-
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>::VmaVector(size_t count, const AllocatorT& allocator)
-    : m_Allocator(allocator),
-    m_pArray(count ? (T*)VmaAllocateArray<T>(allocator.m_pCallbacks, count) : VMA_NULL),
-    m_Count(count),
-    m_Capacity(count) {}
-
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>::VmaVector(const VmaVector& src)
-    : m_Allocator(src.m_Allocator),
-    m_pArray(src.m_Count ? (T*)VmaAllocateArray<T>(src.m_Allocator.m_pCallbacks, src.m_Count) : VMA_NULL),
-    m_Count(src.m_Count),
-    m_Capacity(src.m_Count)
-{
-    if (m_Count != 0)
-    {
-        memcpy(m_pArray, src.m_pArray, m_Count * sizeof(T));
-    }
-}
-
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>& VmaVector<T, AllocatorT>::operator=(const VmaVector& rhs)
-{
-    if (&rhs != this)
-    {
-        resize(rhs.m_Count);
-        if (m_Count != 0)
-        {
-            memcpy(m_pArray, rhs.m_pArray, m_Count * sizeof(T));
-        }
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::push_back(const T& src)
-{
-    const size_t newIndex = size();
-    resize(newIndex + 1);
-    m_pArray[newIndex] = src;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::reserve(size_t newCapacity, bool freeMemory)
-{
-    newCapacity = VMA_MAX(newCapacity, m_Count);
-
-    if ((newCapacity < m_Capacity) && !freeMemory)
-    {
-        newCapacity = m_Capacity;
-    }
-
-    if (newCapacity != m_Capacity)
-    {
-        T* const newArray = newCapacity ? VmaAllocateArray<T>(m_Allocator, newCapacity) : VMA_NULL;
-        if (m_Count != 0)
-        {
-            memcpy(newArray, m_pArray, m_Count * sizeof(T));
-        }
-        VmaFree(m_Allocator.m_pCallbacks, m_pArray);
-        m_Capacity = newCapacity;
-        m_pArray = newArray;
-    }
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::resize(size_t newCount)
-{
-    size_t newCapacity = m_Capacity;
-    if (newCount > m_Capacity)
-    {
-        newCapacity = VMA_MAX(newCount, VMA_MAX(m_Capacity * 3 / 2, (size_t)8));
-    }
-
-    if (newCapacity != m_Capacity)
-    {
-        T* const newArray = newCapacity ? VmaAllocateArray<T>(m_Allocator.m_pCallbacks, newCapacity) : VMA_NULL;
-        const size_t elementsToCopy = VMA_MIN(m_Count, newCount);
-        if (elementsToCopy != 0)
-        {
-            memcpy(newArray, m_pArray, elementsToCopy * sizeof(T));
-        }
-        VmaFree(m_Allocator.m_pCallbacks, m_pArray);
-        m_Capacity = newCapacity;
-        m_pArray = newArray;
-    }
-
-    m_Count = newCount;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::shrink_to_fit()
-{
-    if (m_Capacity > m_Count)
-    {
-        T* newArray = VMA_NULL;
-        if (m_Count > 0)
-        {
-            newArray = VmaAllocateArray<T>(m_Allocator.m_pCallbacks, m_Count);
-            memcpy(newArray, m_pArray, m_Count * sizeof(T));
-        }
-        VmaFree(m_Allocator.m_pCallbacks, m_pArray);
-        m_Capacity = m_Count;
-        m_pArray = newArray;
-    }
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::insert(size_t index, const T& src)
-{
-    VMA_HEAVY_ASSERT(index <= m_Count);
-    const size_t oldCount = size();
-    resize(oldCount + 1);
-    if (index < oldCount)
-    {
-        memmove(m_pArray + (index + 1), m_pArray + index, (oldCount - index) * sizeof(T));
-    }
-    m_pArray[index] = src;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::remove(size_t index)
-{
-    VMA_HEAVY_ASSERT(index < m_Count);
-    const size_t oldCount = size();
-    if (index < oldCount - 1)
-    {
-        memmove(m_pArray + index, m_pArray + (index + 1), (oldCount - index - 1) * sizeof(T));
-    }
-    resize(oldCount - 1);
-}
-#endif // _VMA_VECTOR_FUNCTIONS
-
-template<typename T, typename allocatorT>
-static void VmaVectorInsert(VmaVector<T, allocatorT>& vec, size_t index, const T& item)
-{
-    vec.insert(index, item);
-}
-
-template<typename T, typename allocatorT>
-static void VmaVectorRemove(VmaVector<T, allocatorT>& vec, size_t index)
-{
-    vec.remove(index);
-}
-#endif // _VMA_VECTOR
-
-#ifndef _VMA_SMALL_VECTOR
-/*
-This is a vector (a variable-sized array), optimized for the case when the array is small.
-
-It contains some number of elements in-place, which allows it to avoid heap allocation
-when the actual number of elements is below that threshold. This allows normal "small"
-cases to be fast without losing generality for large inputs.
-*/
-template<typename T, typename AllocatorT, size_t N>
-class VmaSmallVector
-{
-public:
-    typedef T value_type;
-    typedef T* iterator;
-
-    VmaSmallVector(const AllocatorT& allocator);
-    VmaSmallVector(size_t count, const AllocatorT& allocator);
-    template<typename SrcT, typename SrcAllocatorT, size_t SrcN>
-    VmaSmallVector(const VmaSmallVector<SrcT, SrcAllocatorT, SrcN>&) = delete;
-    template<typename SrcT, typename SrcAllocatorT, size_t SrcN>
-    VmaSmallVector<T, AllocatorT, N>& operator=(const VmaSmallVector<SrcT, SrcAllocatorT, SrcN>&) = delete;
-    ~VmaSmallVector() = default;
-
-    bool empty() const { return m_Count == 0; }
-    size_t size() const { return m_Count; }
-    T* data() { return m_Count > N ? m_DynamicArray.data() : m_StaticArray; }
-    T& front() { VMA_HEAVY_ASSERT(m_Count > 0); return data()[0]; }
-    T& back() { VMA_HEAVY_ASSERT(m_Count > 0); return data()[m_Count - 1]; }
-    const T* data() const { return m_Count > N ? m_DynamicArray.data() : m_StaticArray; }
-    const T& front() const { VMA_HEAVY_ASSERT(m_Count > 0); return data()[0]; }
-    const T& back() const { VMA_HEAVY_ASSERT(m_Count > 0); return data()[m_Count - 1]; }
-
-    iterator begin() { return data(); }
-    iterator end() { return data() + m_Count; }
-
-    void pop_front() { VMA_HEAVY_ASSERT(m_Count > 0); remove(0); }
-    void pop_back() { VMA_HEAVY_ASSERT(m_Count > 0); resize(size() - 1); }
-    void push_front(const T& src) { insert(0, src); }
-
-    void push_back(const T& src);
-    void resize(size_t newCount, bool freeMemory = false);
-    void clear(bool freeMemory = false);
-    void insert(size_t index, const T& src);
-    void remove(size_t index);
-
-    T& operator[](size_t index) { VMA_HEAVY_ASSERT(index < m_Count); return data()[index]; }
-    const T& operator[](size_t index) const { VMA_HEAVY_ASSERT(index < m_Count); return data()[index]; }
-
-private:
-    size_t m_Count;
-    T m_StaticArray[N]; // Used when m_Size <= N
-    VmaVector<T, AllocatorT> m_DynamicArray; // Used when m_Size > N
-};
-
-#ifndef _VMA_SMALL_VECTOR_FUNCTIONS
-template<typename T, typename AllocatorT, size_t N>
-VmaSmallVector<T, AllocatorT, N>::VmaSmallVector(const AllocatorT& allocator)
-    : m_Count(0),
-    m_DynamicArray(allocator) {}
-
-template<typename T, typename AllocatorT, size_t N>
-VmaSmallVector<T, AllocatorT, N>::VmaSmallVector(size_t count, const AllocatorT& allocator)
-    : m_Count(count),
-    m_DynamicArray(count > N ? count : 0, allocator) {}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::push_back(const T& src)
-{
-    const size_t newIndex = size();
-    resize(newIndex + 1);
-    data()[newIndex] = src;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::resize(size_t newCount, bool freeMemory)
-{
-    if (newCount > N && m_Count > N)
-    {
-        // Any direction, staying in m_DynamicArray
-        m_DynamicArray.resize(newCount);
-        if (freeMemory)
-        {
-            m_DynamicArray.shrink_to_fit();
-        }
-    }
-    else if (newCount > N && m_Count <= N)
-    {
-        // Growing, moving from m_StaticArray to m_DynamicArray
-        m_DynamicArray.resize(newCount);
-        if (m_Count > 0)
-        {
-            memcpy(m_DynamicArray.data(), m_StaticArray, m_Count * sizeof(T));
-        }
-    }
-    else if (newCount <= N && m_Count > N)
-    {
-        // Shrinking, moving from m_DynamicArray to m_StaticArray
-        if (newCount > 0)
-        {
-            memcpy(m_StaticArray, m_DynamicArray.data(), newCount * sizeof(T));
-        }
-        m_DynamicArray.resize(0);
-        if (freeMemory)
-        {
-            m_DynamicArray.shrink_to_fit();
-        }
-    }
-    else
-    {
-        // Any direction, staying in m_StaticArray - nothing to do here
-    }
-    m_Count = newCount;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::clear(bool freeMemory)
-{
-    m_DynamicArray.clear();
-    if (freeMemory)
-    {
-        m_DynamicArray.shrink_to_fit();
-    }
-    m_Count = 0;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::insert(size_t index, const T& src)
-{
-    VMA_HEAVY_ASSERT(index <= m_Count);
-    const size_t oldCount = size();
-    resize(oldCount + 1);
-    T* const dataPtr = data();
-    if (index < oldCount)
-    {
-        //  I know, this could be more optimal for case where memmove can be memcpy directly from m_StaticArray to m_DynamicArray.
-        memmove(dataPtr + (index + 1), dataPtr + index, (oldCount - index) * sizeof(T));
-    }
-    dataPtr[index] = src;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::remove(size_t index)
-{
-    VMA_HEAVY_ASSERT(index < m_Count);
-    const size_t oldCount = size();
-    if (index < oldCount - 1)
-    {
-        //  I know, this could be more optimal for case where memmove can be memcpy directly from m_DynamicArray to m_StaticArray.
-        T* const dataPtr = data();
-        memmove(dataPtr + index, dataPtr + (index + 1), (oldCount - index - 1) * sizeof(T));
-    }
-    resize(oldCount - 1);
-}
-#endif // _VMA_SMALL_VECTOR_FUNCTIONS
-#endif // _VMA_SMALL_VECTOR
-
-#ifndef _VMA_POOL_ALLOCATOR
-/*
-Allocator for objects of type T using a list of arrays (pools) to speed up
-allocation. Number of elements that can be allocated is not bounded because
-allocator can create multiple blocks.
-*/
-template<typename T>
-class VmaPoolAllocator
-{
-    VMA_CLASS_NO_COPY(VmaPoolAllocator)
-public:
-    VmaPoolAllocator(const VkAllocationCallbacks* pAllocationCallbacks, uint32_t firstBlockCapacity);
-    ~VmaPoolAllocator();
-    template<typename... Types> T* Alloc(Types&&... args);
-    void Free(T* ptr);
-
-private:
-    union Item
-    {
-        uint32_t NextFreeIndex;
-        alignas(T) char Value[sizeof(T)];
-    };
-    struct ItemBlock
-    {
-        Item* pItems;
-        uint32_t Capacity;
-        uint32_t FirstFreeIndex;
-    };
-
-    const VkAllocationCallbacks* m_pAllocationCallbacks;
-    const uint32_t m_FirstBlockCapacity;
-    VmaVector<ItemBlock, VmaStlAllocator<ItemBlock>> m_ItemBlocks;
-
-    ItemBlock& CreateNewBlock();
-};
-
-#ifndef _VMA_POOL_ALLOCATOR_FUNCTIONS
-template<typename T>
-VmaPoolAllocator<T>::VmaPoolAllocator(const VkAllocationCallbacks* pAllocationCallbacks, uint32_t firstBlockCapacity)
-    : m_pAllocationCallbacks(pAllocationCallbacks),
-    m_FirstBlockCapacity(firstBlockCapacity),
-    m_ItemBlocks(VmaStlAllocator<ItemBlock>(pAllocationCallbacks))
-{
-    VMA_ASSERT(m_FirstBlockCapacity > 1);
-}
-
-template<typename T>
-VmaPoolAllocator<T>::~VmaPoolAllocator()
-{
-    for (size_t i = m_ItemBlocks.size(); i--;)
-        vma_delete_array(m_pAllocationCallbacks, m_ItemBlocks[i].pItems, m_ItemBlocks[i].Capacity);
-    m_ItemBlocks.clear();
-}
-
-template<typename T>
-template<typename... Types> T* VmaPoolAllocator<T>::Alloc(Types&&... args)
-{
-    for (size_t i = m_ItemBlocks.size(); i--; )
-    {
-        ItemBlock& block = m_ItemBlocks[i];
-        // This block has some free items: Use first one.
-        if (block.FirstFreeIndex != UINT32_MAX)
-        {
-            Item* const pItem = &block.pItems[block.FirstFreeIndex];
-            block.FirstFreeIndex = pItem->NextFreeIndex;
-            T* result = (T*)&pItem->Value;
-            new(result)T(std::forward<Types>(args)...); // Explicit constructor call.
-            return result;
-        }
-    }
-
-    // No block has free item: Create new one and use it.
-    ItemBlock& newBlock = CreateNewBlock();
-    Item* const pItem = &newBlock.pItems[0];
-    newBlock.FirstFreeIndex = pItem->NextFreeIndex;
-    T* result = (T*)&pItem->Value;
-    new(result) T(std::forward<Types>(args)...); // Explicit constructor call.
-    return result;
-}
-
-template<typename T>
-void VmaPoolAllocator<T>::Free(T* ptr)
-{
-    // Search all memory blocks to find ptr.
-    for (size_t i = m_ItemBlocks.size(); i--; )
-    {
-        ItemBlock& block = m_ItemBlocks[i];
-
-        // Casting to union.
-        Item* pItemPtr;
-        memcpy(&pItemPtr, &ptr, sizeof(pItemPtr));
-
-        // Check if pItemPtr is in address range of this block.
-        if ((pItemPtr >= block.pItems) && (pItemPtr < block.pItems + block.Capacity))
-        {
-            ptr->~T(); // Explicit destructor call.
-            const uint32_t index = static_cast<uint32_t>(pItemPtr - block.pItems);
-            pItemPtr->NextFreeIndex = block.FirstFreeIndex;
-            block.FirstFreeIndex = index;
-            return;
-        }
-    }
-    VMA_ASSERT(0 && "Pointer doesn't belong to this memory pool.");
-}
-
-template<typename T>
-typename VmaPoolAllocator<T>::ItemBlock& VmaPoolAllocator<T>::CreateNewBlock()
-{
-    const uint32_t newBlockCapacity = m_ItemBlocks.empty() ?
-        m_FirstBlockCapacity : m_ItemBlocks.back().Capacity * 3 / 2;
-
-    const ItemBlock newBlock =
-    {
-        vma_new_array(m_pAllocationCallbacks, Item, newBlockCapacity),
-        newBlockCapacity,
-        0
-    };
-
-    m_ItemBlocks.push_back(newBlock);
-
-    // Setup singly-linked list of all free items in this block.
-    for (uint32_t i = 0; i < newBlockCapacity - 1; ++i)
-        newBlock.pItems[i].NextFreeIndex = i + 1;
-    newBlock.pItems[newBlockCapacity - 1].NextFreeIndex = UINT32_MAX;
-    return m_ItemBlocks.back();
-}
-#endif // _VMA_POOL_ALLOCATOR_FUNCTIONS
-#endif // _VMA_POOL_ALLOCATOR
-
-#ifndef _VMA_RAW_LIST
-template<typename T>
-struct VmaListItem
-{
-    VmaListItem* pPrev;
-    VmaListItem* pNext;
-    T Value;
-};
-
-// Doubly linked list.
-template<typename T>
-class VmaRawList
-{
-    VMA_CLASS_NO_COPY(VmaRawList)
-public:
-    typedef VmaListItem<T> ItemType;
-
-    VmaRawList(const VkAllocationCallbacks* pAllocationCallbacks);
-    // Intentionally not calling Clear, because that would be unnecessary
-    // computations to return all items to m_ItemAllocator as free.
-    ~VmaRawList() = default;
-
-    size_t GetCount() const { return m_Count; }
-    bool IsEmpty() const { return m_Count == 0; }
-
-    ItemType* Front() { return m_pFront; }
-    ItemType* Back() { return m_pBack; }
-    const ItemType* Front() const { return m_pFront; }
-    const ItemType* Back() const { return m_pBack; }
-
-    ItemType* PushFront();
-    ItemType* PushBack();
-    ItemType* PushFront(const T& value);
-    ItemType* PushBack(const T& value);
-    void PopFront();
-    void PopBack();
-
-    // Item can be null - it means PushBack.
-    ItemType* InsertBefore(ItemType* pItem);
-    // Item can be null - it means PushFront.
-    ItemType* InsertAfter(ItemType* pItem);
-    ItemType* InsertBefore(ItemType* pItem, const T& value);
-    ItemType* InsertAfter(ItemType* pItem, const T& value);
-
-    void Clear();
-    void Remove(ItemType* pItem);
-
-private:
-    const VkAllocationCallbacks* const m_pAllocationCallbacks;
-    VmaPoolAllocator<ItemType> m_ItemAllocator;
-    ItemType* m_pFront;
-    ItemType* m_pBack;
-    size_t m_Count;
-};
-
-#ifndef _VMA_RAW_LIST_FUNCTIONS
-template<typename T>
-VmaRawList<T>::VmaRawList(const VkAllocationCallbacks* pAllocationCallbacks)
-    : m_pAllocationCallbacks(pAllocationCallbacks),
-    m_ItemAllocator(pAllocationCallbacks, 128),
-    m_pFront(VMA_NULL),
-    m_pBack(VMA_NULL),
-    m_Count(0) {}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushFront()
-{
-    ItemType* const pNewItem = m_ItemAllocator.Alloc();
-    pNewItem->pPrev = VMA_NULL;
-    if (IsEmpty())
-    {
-        pNewItem->pNext = VMA_NULL;
-        m_pFront = pNewItem;
-        m_pBack = pNewItem;
-        m_Count = 1;
-    }
-    else
-    {
-        pNewItem->pNext = m_pFront;
-        m_pFront->pPrev = pNewItem;
-        m_pFront = pNewItem;
-        ++m_Count;
-    }
-    return pNewItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushBack()
-{
-    ItemType* const pNewItem = m_ItemAllocator.Alloc();
-    pNewItem->pNext = VMA_NULL;
-    if(IsEmpty())
-    {
-        pNewItem->pPrev = VMA_NULL;
-        m_pFront = pNewItem;
-        m_pBack = pNewItem;
-        m_Count = 1;
-    }
-    else
-    {
-        pNewItem->pPrev = m_pBack;
-        m_pBack->pNext = pNewItem;
-        m_pBack = pNewItem;
-        ++m_Count;
-    }
-    return pNewItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushFront(const T& value)
-{
-    ItemType* const pNewItem = PushFront();
-    pNewItem->Value = value;
-    return pNewItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushBack(const T& value)
-{
-    ItemType* const pNewItem = PushBack();
-    pNewItem->Value = value;
-    return pNewItem;
-}
-
-template<typename T>
-void VmaRawList<T>::PopFront()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const pFrontItem = m_pFront;
-    ItemType* const pNextItem = pFrontItem->pNext;
-    if (pNextItem != VMA_NULL)
-    {
-        pNextItem->pPrev = VMA_NULL;
-    }
-    m_pFront = pNextItem;
-    m_ItemAllocator.Free(pFrontItem);
-    --m_Count;
-}
-
-template<typename T>
-void VmaRawList<T>::PopBack()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const pBackItem = m_pBack;
-    ItemType* const pPrevItem = pBackItem->pPrev;
-    if(pPrevItem != VMA_NULL)
-    {
-        pPrevItem->pNext = VMA_NULL;
-    }
-    m_pBack = pPrevItem;
-    m_ItemAllocator.Free(pBackItem);
-    --m_Count;
-}
-
-template<typename T>
-void VmaRawList<T>::Clear()
-{
-    if (IsEmpty() == false)
-    {
-        ItemType* pItem = m_pBack;
-        while (pItem != VMA_NULL)
-        {
-            ItemType* const pPrevItem = pItem->pPrev;
-            m_ItemAllocator.Free(pItem);
-            pItem = pPrevItem;
-        }
-        m_pFront = VMA_NULL;
-        m_pBack = VMA_NULL;
-        m_Count = 0;
-    }
-}
-
-template<typename T>
-void VmaRawList<T>::Remove(ItemType* pItem)
-{
-    VMA_HEAVY_ASSERT(pItem != VMA_NULL);
-    VMA_HEAVY_ASSERT(m_Count > 0);
-
-    if(pItem->pPrev != VMA_NULL)
-    {
-        pItem->pPrev->pNext = pItem->pNext;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_pFront == pItem);
-        m_pFront = pItem->pNext;
-    }
-
-    if(pItem->pNext != VMA_NULL)
-    {
-        pItem->pNext->pPrev = pItem->pPrev;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_pBack == pItem);
-        m_pBack = pItem->pPrev;
-    }
-
-    m_ItemAllocator.Free(pItem);
-    --m_Count;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertBefore(ItemType* pItem)
-{
-    if(pItem != VMA_NULL)
-    {
-        ItemType* const prevItem = pItem->pPrev;
-        ItemType* const newItem = m_ItemAllocator.Alloc();
-        newItem->pPrev = prevItem;
-        newItem->pNext = pItem;
-        pItem->pPrev = newItem;
-        if(prevItem != VMA_NULL)
-        {
-            prevItem->pNext = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_pFront == pItem);
-            m_pFront = newItem;
-        }
-        ++m_Count;
-        return newItem;
-    }
-    else
-        return PushBack();
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertAfter(ItemType* pItem)
-{
-    if(pItem != VMA_NULL)
-    {
-        ItemType* const nextItem = pItem->pNext;
-        ItemType* const newItem = m_ItemAllocator.Alloc();
-        newItem->pNext = nextItem;
-        newItem->pPrev = pItem;
-        pItem->pNext = newItem;
-        if(nextItem != VMA_NULL)
-        {
-            nextItem->pPrev = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_pBack == pItem);
-            m_pBack = newItem;
-        }
-        ++m_Count;
-        return newItem;
-    }
-    else
-        return PushFront();
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertBefore(ItemType* pItem, const T& value)
-{
-    ItemType* const newItem = InsertBefore(pItem);
-    newItem->Value = value;
-    return newItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertAfter(ItemType* pItem, const T& value)
-{
-    ItemType* const newItem = InsertAfter(pItem);
-    newItem->Value = value;
-    return newItem;
-}
-#endif // _VMA_RAW_LIST_FUNCTIONS
-#endif // _VMA_RAW_LIST
-
-#ifndef _VMA_LIST
-template<typename T, typename AllocatorT>
-class VmaList
-{
-    VMA_CLASS_NO_COPY(VmaList)
-public:
-    class reverse_iterator;
-    class const_iterator;
-    class const_reverse_iterator;
-
-    class iterator
-    {
-        friend class const_iterator;
-        friend class VmaList<T, AllocatorT>;
-    public:
-        iterator() :  m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        iterator operator++(int) { iterator result = *this; ++*this; return result; }
-        iterator operator--(int) { iterator result = *this; --*this; return result; }
-
-        iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pNext; return *this; }
-        iterator& operator--();
-
-    private:
-        VmaRawList<T>* m_pList;
-        VmaListItem<T>* m_pItem;
-
-        iterator(VmaRawList<T>* pList, VmaListItem<T>* pItem) : m_pList(pList),  m_pItem(pItem) {}
-    };
-    class reverse_iterator
-    {
-        friend class const_reverse_iterator;
-        friend class VmaList<T, AllocatorT>;
-    public:
-        reverse_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        reverse_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        reverse_iterator operator++(int) { reverse_iterator result = *this; ++* this; return result; }
-        reverse_iterator operator--(int) { reverse_iterator result = *this; --* this; return result; }
-
-        reverse_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pPrev; return *this; }
-        reverse_iterator& operator--();
-
-    private:
-        VmaRawList<T>* m_pList;
-        VmaListItem<T>* m_pItem;
-
-        reverse_iterator(VmaRawList<T>* pList, VmaListItem<T>* pItem) : m_pList(pList),  m_pItem(pItem) {}
-    };
-    class const_iterator
-    {
-        friend class VmaList<T, AllocatorT>;
-    public:
-        const_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        const_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-        const_iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        iterator drop_const() { return { const_cast<VmaRawList<T>*>(m_pList), const_cast<VmaListItem<T>*>(m_pItem) }; }
-
-        const T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        const T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const const_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const const_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        const_iterator operator++(int) { const_iterator result = *this; ++* this; return result; }
-        const_iterator operator--(int) { const_iterator result = *this; --* this; return result; }
-
-        const_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pNext; return *this; }
-        const_iterator& operator--();
-
-    private:
-        const VmaRawList<T>* m_pList;
-        const VmaListItem<T>* m_pItem;
-
-        const_iterator(const VmaRawList<T>* pList, const VmaListItem<T>* pItem) : m_pList(pList), m_pItem(pItem) {}
-    };
-    class const_reverse_iterator
-    {
-        friend class VmaList<T, AllocatorT>;
-    public:
-        const_reverse_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        const_reverse_iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-        const_reverse_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        reverse_iterator drop_const() { return { const_cast<VmaRawList<T>*>(m_pList), const_cast<VmaListItem<T>*>(m_pItem) }; }
-
-        const T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        const T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const const_reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const const_reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        const_reverse_iterator operator++(int) { const_reverse_iterator result = *this; ++* this; return result; }
-        const_reverse_iterator operator--(int) { const_reverse_iterator result = *this; --* this; return result; }
-
-        const_reverse_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pPrev; return *this; }
-        const_reverse_iterator& operator--();
-
-    private:
-        const VmaRawList<T>* m_pList;
-        const VmaListItem<T>* m_pItem;
-
-        const_reverse_iterator(const VmaRawList<T>* pList, const VmaListItem<T>* pItem) : m_pList(pList), m_pItem(pItem) {}
-    };
-
-    VmaList(const AllocatorT& allocator) : m_RawList(allocator.m_pCallbacks) {}
-
-    bool empty() const { return m_RawList.IsEmpty(); }
-    size_t size() const { return m_RawList.GetCount(); }
-
-    iterator begin() { return iterator(&m_RawList, m_RawList.Front()); }
-    iterator end() { return iterator(&m_RawList, VMA_NULL); }
-
-    const_iterator cbegin() const { return const_iterator(&m_RawList, m_RawList.Front()); }
-    const_iterator cend() const { return const_iterator(&m_RawList, VMA_NULL); }
-
-    const_iterator begin() const { return cbegin(); }
-    const_iterator end() const { return cend(); }
-
-    reverse_iterator rbegin() { return reverse_iterator(&m_RawList, m_RawList.Back()); }
-    reverse_iterator rend() { return reverse_iterator(&m_RawList, VMA_NULL); }
-
-    const_reverse_iterator crbegin() const { return const_reverse_iterator(&m_RawList, m_RawList.Back()); }
-    const_reverse_iterator crend() const { return const_reverse_iterator(&m_RawList, VMA_NULL); }
-
-    const_reverse_iterator rbegin() const { return crbegin(); }
-    const_reverse_iterator rend() const { return crend(); }
-
-    void push_back(const T& value) { m_RawList.PushBack(value); }
-    iterator insert(iterator it, const T& value) { return iterator(&m_RawList, m_RawList.InsertBefore(it.m_pItem, value)); }
-
-    void clear() { m_RawList.Clear(); }
-    void erase(iterator it) { m_RawList.Remove(it.m_pItem); }
-
-private:
-    VmaRawList<T> m_RawList;
-};
-
-#ifndef _VMA_LIST_FUNCTIONS
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::iterator& VmaList<T, AllocatorT>::iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pPrev;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Back();
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::reverse_iterator& VmaList<T, AllocatorT>::reverse_iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pNext;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Front();
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::const_iterator& VmaList<T, AllocatorT>::const_iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pPrev;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Back();
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::const_reverse_iterator& VmaList<T, AllocatorT>::const_reverse_iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pNext;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Back();
-    }
-    return *this;
-}
-#endif // _VMA_LIST_FUNCTIONS
-#endif // _VMA_LIST
-
-#ifndef _VMA_INTRUSIVE_LINKED_LIST
-/*
-Expected interface of ItemTypeTraits:
-struct MyItemTypeTraits
-{
-    typedef MyItem ItemType;
-    static ItemType* GetPrev(const ItemType* item) { return item->myPrevPtr; }
-    static ItemType* GetNext(const ItemType* item) { return item->myNextPtr; }
-    static ItemType*& AccessPrev(ItemType* item) { return item->myPrevPtr; }
-    static ItemType*& AccessNext(ItemType* item) { return item->myNextPtr; }
-};
-*/
-template<typename ItemTypeTraits>
-class VmaIntrusiveLinkedList
-{
-public:
-    typedef typename ItemTypeTraits::ItemType ItemType;
-    static ItemType* GetPrev(const ItemType* item) { return ItemTypeTraits::GetPrev(item); }
-    static ItemType* GetNext(const ItemType* item) { return ItemTypeTraits::GetNext(item); }
-
-    // Movable, not copyable.
-    VmaIntrusiveLinkedList() = default;
-    VmaIntrusiveLinkedList(VmaIntrusiveLinkedList && src);
-    VmaIntrusiveLinkedList(const VmaIntrusiveLinkedList&) = delete;
-    VmaIntrusiveLinkedList& operator=(VmaIntrusiveLinkedList&& src);
-    VmaIntrusiveLinkedList& operator=(const VmaIntrusiveLinkedList&) = delete;
-    ~VmaIntrusiveLinkedList() { VMA_HEAVY_ASSERT(IsEmpty()); }
-    
-    size_t GetCount() const { return m_Count; }
-    bool IsEmpty() const { return m_Count == 0; }
-    ItemType* Front() { return m_Front; }
-    ItemType* Back() { return m_Back; }
-    const ItemType* Front() const { return m_Front; }
-    const ItemType* Back() const { return m_Back; }
-
-    void PushBack(ItemType* item);
-    void PushFront(ItemType* item);
-    ItemType* PopBack();
-    ItemType* PopFront();
-
-    // MyItem can be null - it means PushBack.
-    void InsertBefore(ItemType* existingItem, ItemType* newItem);
-    // MyItem can be null - it means PushFront.
-    void InsertAfter(ItemType* existingItem, ItemType* newItem);
-    void Remove(ItemType* item);
-    void RemoveAll();
-
-private:
-    ItemType* m_Front = VMA_NULL;
-    ItemType* m_Back = VMA_NULL;
-    size_t m_Count = 0;
-};
-
-#ifndef _VMA_INTRUSIVE_LINKED_LIST_FUNCTIONS
-template<typename ItemTypeTraits>
-VmaIntrusiveLinkedList<ItemTypeTraits>::VmaIntrusiveLinkedList(VmaIntrusiveLinkedList&& src)
-    : m_Front(src.m_Front), m_Back(src.m_Back), m_Count(src.m_Count)
-{
-    src.m_Front = src.m_Back = VMA_NULL;
-    src.m_Count = 0;
-}
-
-template<typename ItemTypeTraits>
-VmaIntrusiveLinkedList<ItemTypeTraits>& VmaIntrusiveLinkedList<ItemTypeTraits>::operator=(VmaIntrusiveLinkedList&& src)
-{
-    if (&src != this)
-    {
-        VMA_HEAVY_ASSERT(IsEmpty());
-        m_Front = src.m_Front;
-        m_Back = src.m_Back;
-        m_Count = src.m_Count;
-        src.m_Front = src.m_Back = VMA_NULL;
-        src.m_Count = 0;
-    }
-    return *this;
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::PushBack(ItemType* item)
-{
-    VMA_HEAVY_ASSERT(ItemTypeTraits::GetPrev(item) == VMA_NULL && ItemTypeTraits::GetNext(item) == VMA_NULL);
-    if (IsEmpty())
-    {
-        m_Front = item;
-        m_Back = item;
-        m_Count = 1;
-    }
-    else
-    {
-        ItemTypeTraits::AccessPrev(item) = m_Back;
-        ItemTypeTraits::AccessNext(m_Back) = item;
-        m_Back = item;
-        ++m_Count;
-    }
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::PushFront(ItemType* item)
-{
-    VMA_HEAVY_ASSERT(ItemTypeTraits::GetPrev(item) == VMA_NULL && ItemTypeTraits::GetNext(item) == VMA_NULL);
-    if (IsEmpty())
-    {
-        m_Front = item;
-        m_Back = item;
-        m_Count = 1;
-    }
-    else
-    {
-        ItemTypeTraits::AccessNext(item) = m_Front;
-        ItemTypeTraits::AccessPrev(m_Front) = item;
-        m_Front = item;
-        ++m_Count;
-    }
-}
-
-template<typename ItemTypeTraits>
-typename VmaIntrusiveLinkedList<ItemTypeTraits>::ItemType* VmaIntrusiveLinkedList<ItemTypeTraits>::PopBack()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const backItem = m_Back;
-    ItemType* const prevItem = ItemTypeTraits::GetPrev(backItem);
-    if (prevItem != VMA_NULL)
-    {
-        ItemTypeTraits::AccessNext(prevItem) = VMA_NULL;
-    }
-    m_Back = prevItem;
-    --m_Count;
-    ItemTypeTraits::AccessPrev(backItem) = VMA_NULL;
-    ItemTypeTraits::AccessNext(backItem) = VMA_NULL;
-    return backItem;
-}
-
-template<typename ItemTypeTraits>
-typename VmaIntrusiveLinkedList<ItemTypeTraits>::ItemType* VmaIntrusiveLinkedList<ItemTypeTraits>::PopFront()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const frontItem = m_Front;
-    ItemType* const nextItem = ItemTypeTraits::GetNext(frontItem);
-    if (nextItem != VMA_NULL)
-    {
-        ItemTypeTraits::AccessPrev(nextItem) = VMA_NULL;
-    }
-    m_Front = nextItem;
-    --m_Count;
-    ItemTypeTraits::AccessPrev(frontItem) = VMA_NULL;
-    ItemTypeTraits::AccessNext(frontItem) = VMA_NULL;
-    return frontItem;
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::InsertBefore(ItemType* existingItem, ItemType* newItem)
-{
-    VMA_HEAVY_ASSERT(newItem != VMA_NULL && ItemTypeTraits::GetPrev(newItem) == VMA_NULL && ItemTypeTraits::GetNext(newItem) == VMA_NULL);
-    if (existingItem != VMA_NULL)
-    {
-        ItemType* const prevItem = ItemTypeTraits::GetPrev(existingItem);
-        ItemTypeTraits::AccessPrev(newItem) = prevItem;
-        ItemTypeTraits::AccessNext(newItem) = existingItem;
-        ItemTypeTraits::AccessPrev(existingItem) = newItem;
-        if (prevItem != VMA_NULL)
-        {
-            ItemTypeTraits::AccessNext(prevItem) = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_Front == existingItem);
-            m_Front = newItem;
-        }
-        ++m_Count;
-    }
-    else
-        PushBack(newItem);
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::InsertAfter(ItemType* existingItem, ItemType* newItem)
-{
-    VMA_HEAVY_ASSERT(newItem != VMA_NULL && ItemTypeTraits::GetPrev(newItem) == VMA_NULL && ItemTypeTraits::GetNext(newItem) == VMA_NULL);
-    if (existingItem != VMA_NULL)
-    {
-        ItemType* const nextItem = ItemTypeTraits::GetNext(existingItem);
-        ItemTypeTraits::AccessNext(newItem) = nextItem;
-        ItemTypeTraits::AccessPrev(newItem) = existingItem;
-        ItemTypeTraits::AccessNext(existingItem) = newItem;
-        if (nextItem != VMA_NULL)
-        {
-            ItemTypeTraits::AccessPrev(nextItem) = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_Back == existingItem);
-            m_Back = newItem;
-        }
-        ++m_Count;
-    }
-    else
-        return PushFront(newItem);
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::Remove(ItemType* item)
-{
-    VMA_HEAVY_ASSERT(item != VMA_NULL && m_Count > 0);
-    if (ItemTypeTraits::GetPrev(item) != VMA_NULL)
-    {
-        ItemTypeTraits::AccessNext(ItemTypeTraits::AccessPrev(item)) = ItemTypeTraits::GetNext(item);
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_Front == item);
-        m_Front = ItemTypeTraits::GetNext(item);
-    }
-
-    if (ItemTypeTraits::GetNext(item) != VMA_NULL)
-    {
-        ItemTypeTraits::AccessPrev(ItemTypeTraits::AccessNext(item)) = ItemTypeTraits::GetPrev(item);
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_Back == item);
-        m_Back = ItemTypeTraits::GetPrev(item);
-    }
-    ItemTypeTraits::AccessPrev(item) = VMA_NULL;
-    ItemTypeTraits::AccessNext(item) = VMA_NULL;
-    --m_Count;
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::RemoveAll()
-{
-    if (!IsEmpty())
-    {
-        ItemType* item = m_Back;
-        while (item != VMA_NULL)
-        {
-            ItemType* const prevItem = ItemTypeTraits::AccessPrev(item);
-            ItemTypeTraits::AccessPrev(item) = VMA_NULL;
-            ItemTypeTraits::AccessNext(item) = VMA_NULL;
-            item = prevItem;
-        }
-        m_Front = VMA_NULL;
-        m_Back = VMA_NULL;
-        m_Count = 0;
-    }
-}
-#endif // _VMA_INTRUSIVE_LINKED_LIST_FUNCTIONS
-#endif // _VMA_INTRUSIVE_LINKED_LIST
-
-// Unused in this version.
-#if 0
-
-#ifndef _VMA_PAIR
-template<typename T1, typename T2>
-struct VmaPair
-{
-    T1 first;
-    T2 second;
-
-    VmaPair() : first(), second() {}
-    VmaPair(const T1& firstSrc, const T2& secondSrc) : first(firstSrc), second(secondSrc) {}
-};
-
-template<typename FirstT, typename SecondT>
-struct VmaPairFirstLess
-{
-    bool operator()(const VmaPair<FirstT, SecondT>& lhs, const VmaPair<FirstT, SecondT>& rhs) const
-    {
-        return lhs.first < rhs.first;
-    }
-    bool operator()(const VmaPair<FirstT, SecondT>& lhs, const FirstT& rhsFirst) const
-    {
-        return lhs.first < rhsFirst;
-    }
-};
-#endif // _VMA_PAIR
-
-#ifndef _VMA_MAP
-/* Class compatible with subset of interface of std::unordered_map.
-KeyT, ValueT must be POD because they will be stored in VmaVector.
-*/
-template<typename KeyT, typename ValueT>
-class VmaMap
-{
-public:
-    typedef VmaPair<KeyT, ValueT> PairType;
-    typedef PairType* iterator;
-
-    VmaMap(const VmaStlAllocator<PairType>& allocator) : m_Vector(allocator) {}
-
-    iterator begin() { return m_Vector.begin(); }
-    iterator end() { return m_Vector.end(); }
-    size_t size() { return m_Vector.size(); }
-
-    void insert(const PairType& pair);
-    iterator find(const KeyT& key);
-    void erase(iterator it);
-
-private:
-    VmaVector< PairType, VmaStlAllocator<PairType>> m_Vector;
-};
-
-#ifndef _VMA_MAP_FUNCTIONS
-template<typename KeyT, typename ValueT>
-void VmaMap<KeyT, ValueT>::insert(const PairType& pair)
-{
-    const size_t indexToInsert = VmaBinaryFindFirstNotLess(
-        m_Vector.data(),
-        m_Vector.data() + m_Vector.size(),
-        pair,
-        VmaPairFirstLess<KeyT, ValueT>()) - m_Vector.data();
-    VmaVectorInsert(m_Vector, indexToInsert, pair);
-}
-
-template<typename KeyT, typename ValueT>
-VmaPair<KeyT, ValueT>* VmaMap<KeyT, ValueT>::find(const KeyT& key)
-{
-    PairType* it = VmaBinaryFindFirstNotLess(
-        m_Vector.data(),
-        m_Vector.data() + m_Vector.size(),
-        key,
-        VmaPairFirstLess<KeyT, ValueT>());
-    if ((it != m_Vector.end()) && (it->first == key))
-    {
-        return it;
-    }
-    else
-    {
-        return m_Vector.end();
-    }
-}
-
-template<typename KeyT, typename ValueT>
-void VmaMap<KeyT, ValueT>::erase(iterator it)
-{
-    VmaVectorRemove(m_Vector, it - m_Vector.begin());
-}
-#endif // _VMA_MAP_FUNCTIONS
-#endif // _VMA_MAP
-
-#endif // #if 0
-
-#if !defined(_VMA_STRING_BUILDER) && VMA_STATS_STRING_ENABLED
-class VmaStringBuilder
-{
-public:
-    VmaStringBuilder(const VkAllocationCallbacks* allocationCallbacks) : m_Data(VmaStlAllocator<char>(allocationCallbacks)) {}
-    ~VmaStringBuilder() = default;
-
-    size_t GetLength() const { return m_Data.size(); }
-    const char* GetData() const { return m_Data.data(); }
-    void AddNewLine() { Add('\n'); }
-    void Add(char ch) { m_Data.push_back(ch); }
-
-    void Add(const char* pStr);
-    void AddNumber(uint32_t num);
-    void AddNumber(uint64_t num);
-    void AddPointer(const void* ptr);
-
-private:
-    VmaVector<char, VmaStlAllocator<char>> m_Data;
-};
-
-#ifndef _VMA_STRING_BUILDER_FUNCTIONS
-void VmaStringBuilder::Add(const char* pStr)
-{
-    const size_t strLen = strlen(pStr);
-    if (strLen > 0)
-    {
-        const size_t oldCount = m_Data.size();
-        m_Data.resize(oldCount + strLen);
-        memcpy(m_Data.data() + oldCount, pStr, strLen);
-    }
-}
-
-void VmaStringBuilder::AddNumber(uint32_t num)
-{
-    char buf[11];
-    buf[10] = '\0';
-    char* p = &buf[10];
-    do
-    {
-        *--p = '0' + (num % 10);
-        num /= 10;
-    } while (num);
-    Add(p);
-}
-
-void VmaStringBuilder::AddNumber(uint64_t num)
-{
-    char buf[21];
-    buf[20] = '\0';
-    char* p = &buf[20];
-    do
-    {
-        *--p = '0' + (num % 10);
-        num /= 10;
-    } while (num);
-    Add(p);
-}
-
-void VmaStringBuilder::AddPointer(const void* ptr)
-{
-    char buf[21];
-    VmaPtrToStr(buf, sizeof(buf), ptr);
-    Add(buf);
-}
-#endif //_VMA_STRING_BUILDER_FUNCTIONS
-#endif // _VMA_STRING_BUILDER
-
-#if !defined(_VMA_JSON_WRITER) && VMA_STATS_STRING_ENABLED
-/*
-Allows to conveniently build a correct JSON document to be written to the
-VmaStringBuilder passed to the constructor.
-*/
-class VmaJsonWriter
-{
-    VMA_CLASS_NO_COPY(VmaJsonWriter)
-public:
-    // sb - string builder to write the document to. Must remain alive for the whole lifetime of this object.
-    VmaJsonWriter(const VkAllocationCallbacks* pAllocationCallbacks, VmaStringBuilder& sb);
-    ~VmaJsonWriter();
-
-    // Begins object by writing "{".
-    // Inside an object, you must call pairs of WriteString and a value, e.g.:
-    // j.BeginObject(true); j.WriteString("A"); j.WriteNumber(1); j.WriteString("B"); j.WriteNumber(2); j.EndObject();
-    // Will write: { "A": 1, "B": 2 }
-    void BeginObject(bool singleLine = false);
-    // Ends object by writing "}".
-    void EndObject();
-
-    // Begins array by writing "[".
-    // Inside an array, you can write a sequence of any values.
-    void BeginArray(bool singleLine = false);
-    // Ends array by writing "[".
-    void EndArray();
-
-    // Writes a string value inside "".
-    // pStr can contain any ANSI characters, including '"', new line etc. - they will be properly escaped.
-    void WriteString(const char* pStr);
-    
-    // Begins writing a string value.
-    // Call BeginString, ContinueString, ContinueString, ..., EndString instead of
-    // WriteString to conveniently build the string content incrementally, made of
-    // parts including numbers.
-    void BeginString(const char* pStr = VMA_NULL);
-    // Posts next part of an open string.
-    void ContinueString(const char* pStr);
-    // Posts next part of an open string. The number is converted to decimal characters.
-    void ContinueString(uint32_t n);
-    void ContinueString(uint64_t n);
-    void ContinueString_Size(size_t n);
-    // Posts next part of an open string. Pointer value is converted to characters
-    // using "%p" formatting - shown as hexadecimal number, e.g.: 000000081276Ad00
-    void ContinueString_Pointer(const void* ptr);
-    // Ends writing a string value by writing '"'.
-    void EndString(const char* pStr = VMA_NULL);
-
-    // Writes a number value.
-    void WriteNumber(uint32_t n);
-    void WriteNumber(uint64_t n);
-    void WriteSize(size_t n);
-    // Writes a boolean value - false or true.
-    void WriteBool(bool b);
-    // Writes a null value.
-    void WriteNull();
-
-private:
-    enum COLLECTION_TYPE
-    {
-        COLLECTION_TYPE_OBJECT,
-        COLLECTION_TYPE_ARRAY,
-    };
-    struct StackItem
-    {
-        COLLECTION_TYPE type;
-        uint32_t valueCount;
-        bool singleLineMode;
-    };
-
-    static const char* const INDENT;
-
-    VmaStringBuilder& m_SB;
-    VmaVector< StackItem, VmaStlAllocator<StackItem> > m_Stack;
-    bool m_InsideString;
-
-    // Write size_t for less than 64bits
-    void WriteSize(size_t n, std::integral_constant<bool, false>) { m_SB.AddNumber(static_cast<uint32_t>(n)); }
-    // Write size_t for 64bits
-    void WriteSize(size_t n, std::integral_constant<bool, true>) { m_SB.AddNumber(static_cast<uint64_t>(n)); }
-
-    void BeginValue(bool isString);
-    void WriteIndent(bool oneLess = false);
-};
-const char* const VmaJsonWriter::INDENT = "  ";
-
-#ifndef _VMA_JSON_WRITER_FUNCTIONS
-VmaJsonWriter::VmaJsonWriter(const VkAllocationCallbacks* pAllocationCallbacks, VmaStringBuilder& sb)
-    : m_SB(sb),
-    m_Stack(VmaStlAllocator<StackItem>(pAllocationCallbacks)),
-    m_InsideString(false) {}
-
-VmaJsonWriter::~VmaJsonWriter()
-{
-    VMA_ASSERT(!m_InsideString);
-    VMA_ASSERT(m_Stack.empty());
-}
-
-void VmaJsonWriter::BeginObject(bool singleLine)
-{
-    VMA_ASSERT(!m_InsideString);
-
-    BeginValue(false);
-    m_SB.Add('{');
-
-    StackItem item;
-    item.type = COLLECTION_TYPE_OBJECT;
-    item.valueCount = 0;
-    item.singleLineMode = singleLine;
-    m_Stack.push_back(item);
-}
-
-void VmaJsonWriter::EndObject()
-{
-    VMA_ASSERT(!m_InsideString);
-
-    WriteIndent(true);
-    m_SB.Add('}');
-
-    VMA_ASSERT(!m_Stack.empty() && m_Stack.back().type == COLLECTION_TYPE_OBJECT);
-    m_Stack.pop_back();
-}
-
-void VmaJsonWriter::BeginArray(bool singleLine)
-{
-    VMA_ASSERT(!m_InsideString);
-
-    BeginValue(false);
-    m_SB.Add('[');
-
-    StackItem item;
-    item.type = COLLECTION_TYPE_ARRAY;
-    item.valueCount = 0;
-    item.singleLineMode = singleLine;
-    m_Stack.push_back(item);
-}
-
-void VmaJsonWriter::EndArray()
-{
-    VMA_ASSERT(!m_InsideString);
-
-    WriteIndent(true);
-    m_SB.Add(']');
-
-    VMA_ASSERT(!m_Stack.empty() && m_Stack.back().type == COLLECTION_TYPE_ARRAY);
-    m_Stack.pop_back();
-}
-
-void VmaJsonWriter::WriteString(const char* pStr)
-{
-    BeginString(pStr);
-    EndString();
-}
-
-void VmaJsonWriter::BeginString(const char* pStr)
-{
-    VMA_ASSERT(!m_InsideString);
-
-    BeginValue(true);
-    m_SB.Add('"');
-    m_InsideString = true;
-    if (pStr != VMA_NULL && pStr[0] != '\0')
-    {
-        ContinueString(pStr);
-    }
-}
-
-void VmaJsonWriter::ContinueString(const char* pStr)
-{
-    VMA_ASSERT(m_InsideString);
-
-    const size_t strLen = strlen(pStr);
-    for (size_t i = 0; i < strLen; ++i)
-    {
-        char ch = pStr[i];
-        if (ch == '\\')
-        {
-            m_SB.Add("\\\\");
-        }
-        else if (ch == '"')
-        {
-            m_SB.Add("\\\"");
-        }
-        else if (ch >= 32)
-        {
-            m_SB.Add(ch);
-        }
-        else switch (ch)
-        {
-        case '\b':
-            m_SB.Add("\\b");
-            break;
-        case '\f':
-            m_SB.Add("\\f");
-            break;
-        case '\n':
-            m_SB.Add("\\n");
-            break;
-        case '\r':
-            m_SB.Add("\\r");
-            break;
-        case '\t':
-            m_SB.Add("\\t");
-            break;
-        default:
-            VMA_ASSERT(0 && "Character not currently supported.");
-            break;
-        }
-    }
-}
-
-void VmaJsonWriter::ContinueString(uint32_t n)
-{
-    VMA_ASSERT(m_InsideString);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::ContinueString(uint64_t n)
-{
-    VMA_ASSERT(m_InsideString);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::ContinueString_Size(size_t n)
-{
-    VMA_ASSERT(m_InsideString);
-    // Fix for AppleClang incorrect type casting
-    // TODO: Change to if constexpr when C++17 used as minimal standard
-    WriteSize(n, std::is_same<size_t, uint64_t>{});
-}
-
-void VmaJsonWriter::ContinueString_Pointer(const void* ptr)
-{
-    VMA_ASSERT(m_InsideString);
-    m_SB.AddPointer(ptr);
-}
-
-void VmaJsonWriter::EndString(const char* pStr)
-{
-    VMA_ASSERT(m_InsideString);
-    if (pStr != VMA_NULL && pStr[0] != '\0')
-    {
-        ContinueString(pStr);
-    }
-    m_SB.Add('"');
-    m_InsideString = false;
-}
-
-void VmaJsonWriter::WriteNumber(uint32_t n)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::WriteNumber(uint64_t n)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::WriteSize(size_t n)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    // Fix for AppleClang incorrect type casting
-    // TODO: Change to if constexpr when C++17 used as minimal standard
-    WriteSize(n, std::is_same<size_t, uint64_t>{});
-}
-
-void VmaJsonWriter::WriteBool(bool b)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.Add(b ? "true" : "false");
-}
-
-void VmaJsonWriter::WriteNull()
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.Add("null");
-}
-
-void VmaJsonWriter::BeginValue(bool isString)
-{
-    if (!m_Stack.empty())
-    {
-        StackItem& currItem = m_Stack.back();
-        if (currItem.type == COLLECTION_TYPE_OBJECT &&
-            currItem.valueCount % 2 == 0)
-        {
-            VMA_ASSERT(isString);
-        }
-
-        if (currItem.type == COLLECTION_TYPE_OBJECT &&
-            currItem.valueCount % 2 != 0)
-        {
-            m_SB.Add(": ");
-        }
-        else if (currItem.valueCount > 0)
-        {
-            m_SB.Add(", ");
-            WriteIndent();
-        }
-        else
-        {
-            WriteIndent();
-        }
-        ++currItem.valueCount;
-    }
-}
-
-void VmaJsonWriter::WriteIndent(bool oneLess)
-{
-    if (!m_Stack.empty() && !m_Stack.back().singleLineMode)
-    {
-        m_SB.AddNewLine();
-
-        size_t count = m_Stack.size();
-        if (count > 0 && oneLess)
-        {
-            --count;
-        }
-        for (size_t i = 0; i < count; ++i)
-        {
-            m_SB.Add(INDENT);
-        }
-    }
-}
-#endif // _VMA_JSON_WRITER_FUNCTIONS
-
-static void VmaPrintDetailedStatistics(VmaJsonWriter& json, const VmaDetailedStatistics& stat)
-{
-    json.BeginObject();
-
-    json.WriteString("BlockCount");
-    json.WriteNumber(stat.statistics.blockCount);
-    json.WriteString("BlockBytes");
-    json.WriteNumber(stat.statistics.blockBytes);
-    json.WriteString("AllocationCount");
-    json.WriteNumber(stat.statistics.allocationCount);
-    json.WriteString("AllocationBytes");
-    json.WriteNumber(stat.statistics.allocationBytes);
-    json.WriteString("UnusedRangeCount");
-    json.WriteNumber(stat.unusedRangeCount);
-
-    if (stat.statistics.allocationCount > 1)
-    {
-        json.WriteString("AllocationSizeMin");
-        json.WriteNumber(stat.allocationSizeMin);
-        json.WriteString("AllocationSizeMax");
-        json.WriteNumber(stat.allocationSizeMax);
-    }
-    if (stat.unusedRangeCount > 1)
-    {
-        json.WriteString("UnusedRangeSizeMin");
-        json.WriteNumber(stat.unusedRangeSizeMin);
-        json.WriteString("UnusedRangeSizeMax");
-        json.WriteNumber(stat.unusedRangeSizeMax);
-    }
-    json.EndObject();
-}
-#endif // _VMA_JSON_WRITER
-
-#ifndef _VMA_MAPPING_HYSTERESIS
-
-class VmaMappingHysteresis
-{
-    VMA_CLASS_NO_COPY(VmaMappingHysteresis)
-public:
-    VmaMappingHysteresis() = default;
-
-    uint32_t GetExtraMapping() const { return m_ExtraMapping; }
-
-    // Call when Map was called.
-    // Returns true if switched to extra +1 mapping reference count.
-    bool PostMap()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 0)
-        {
-            ++m_MajorCounter;
-            if(m_MajorCounter >= COUNTER_MIN_EXTRA_MAPPING)
-            {
-                m_ExtraMapping = 1;
-                m_MajorCounter = 0;
-                m_MinorCounter = 0;
-                return true;
-            }
-        }
-        else // m_ExtraMapping == 1
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-        return false;
-    }
-
-    // Call when Unmap was called.
-    void PostUnmap()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 0)
-            ++m_MajorCounter;
-        else // m_ExtraMapping == 1
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-    }
-
-    // Call when allocation was made from the memory block.
-    void PostAlloc()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 1)
-            ++m_MajorCounter;
-        else // m_ExtraMapping == 0
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-    }
-
-    // Call when allocation was freed from the memory block.
-    // Returns true if switched to extra -1 mapping reference count.
-    bool PostFree()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 1)
-        {
-            ++m_MajorCounter;
-            if(m_MajorCounter >= COUNTER_MIN_EXTRA_MAPPING &&
-                m_MajorCounter > m_MinorCounter + 1)
-            {
-                m_ExtraMapping = 0;
-                m_MajorCounter = 0;
-                m_MinorCounter = 0;
-                return true;
-            }
-        }
-        else // m_ExtraMapping == 0
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-        return false;
-    }
-
-private:
-    static const int32_t COUNTER_MIN_EXTRA_MAPPING = 7;
-
-    uint32_t m_MinorCounter = 0;
-    uint32_t m_MajorCounter = 0;
-    uint32_t m_ExtraMapping = 0; // 0 or 1.
-
-    void PostMinorCounter()
-    {
-        if(m_MinorCounter < m_MajorCounter)
-        {
-            ++m_MinorCounter;
-        }
-        else if(m_MajorCounter > 0)
-        {
-            --m_MajorCounter;
-            --m_MinorCounter;
-        }
-    }
-};
-
-#endif // _VMA_MAPPING_HYSTERESIS
-
-#ifndef _VMA_DEVICE_MEMORY_BLOCK
-/*
-Represents a single block of device memory (`VkDeviceMemory`) with all the
-data about its regions (aka suballocations, #VmaAllocation), assigned and free.
-
-Thread-safety:
-- Access to m_pMetadata must be externally synchronized.
-- Map, Unmap, Bind* are synchronized internally.
-*/
-class VmaDeviceMemoryBlock
-{
-    VMA_CLASS_NO_COPY(VmaDeviceMemoryBlock)
-public:
-    VmaBlockMetadata* m_pMetadata;
-
-    VmaDeviceMemoryBlock(VmaAllocator hAllocator);
-    ~VmaDeviceMemoryBlock();
-
-    // Always call after construction.
-    void Init(
-        VmaAllocator hAllocator,
-        VmaPool hParentPool,
-        uint32_t newMemoryTypeIndex,
-        VkDeviceMemory newMemory,
-        VkDeviceSize newSize,
-        uint32_t id,
-        uint32_t algorithm,
-        VkDeviceSize bufferImageGranularity);
-    // Always call before destruction.
-    void Destroy(VmaAllocator allocator);
-
-    VmaPool GetParentPool() const { return m_hParentPool; }
-    VkDeviceMemory GetDeviceMemory() const { return m_hMemory; }
-    uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; }
-    uint32_t GetId() const { return m_Id; }
-    void* GetMappedData() const { return m_pMappedData; }
-    uint32_t GetMapRefCount() const { return m_MapCount; }
-
-    // Call when allocation/free was made from m_pMetadata.
-    // Used for m_MappingHysteresis.
-    void PostAlloc() { m_MappingHysteresis.PostAlloc(); }
-    void PostFree(VmaAllocator hAllocator);
-
-    // Validates all data structures inside this object. If not valid, returns false.
-    bool Validate() const;
-    VkResult CheckCorruption(VmaAllocator hAllocator);
-
-    // ppData can be null.
-    VkResult Map(VmaAllocator hAllocator, uint32_t count, void** ppData);
-    void Unmap(VmaAllocator hAllocator, uint32_t count);
-
-    VkResult WriteMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize);
-    VkResult ValidateMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize);
-
-    VkResult BindBufferMemory(
-        const VmaAllocator hAllocator,
-        const VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkBuffer hBuffer,
-        const void* pNext);
-    VkResult BindImageMemory(
-        const VmaAllocator hAllocator,
-        const VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkImage hImage,
-        const void* pNext);
-
-private:
-    VmaPool m_hParentPool; // VK_NULL_HANDLE if not belongs to custom pool.
-    uint32_t m_MemoryTypeIndex;
-    uint32_t m_Id;
-    VkDeviceMemory m_hMemory;
-
-    /*
-    Protects access to m_hMemory so it is not used by multiple threads simultaneously, e.g. vkMapMemory, vkBindBufferMemory.
-    Also protects m_MapCount, m_pMappedData.
-    Allocations, deallocations, any change in m_pMetadata is protected by parent's VmaBlockVector::m_Mutex.
-    */
-    VMA_MUTEX m_MapAndBindMutex;
-    VmaMappingHysteresis m_MappingHysteresis;
-    uint32_t m_MapCount;
-    void* m_pMappedData;
-};
-#endif // _VMA_DEVICE_MEMORY_BLOCK
-
-#ifndef _VMA_ALLOCATION_T
-struct VmaAllocation_T
-{
-    friend struct VmaDedicatedAllocationListItemTraits;
-
-    enum FLAGS
-    {
-        FLAG_PERSISTENT_MAP   = 0x01,
-        FLAG_MAPPING_ALLOWED  = 0x02,
-    };
-
-public:
-    enum ALLOCATION_TYPE
-    {
-        ALLOCATION_TYPE_NONE,
-        ALLOCATION_TYPE_BLOCK,
-        ALLOCATION_TYPE_DEDICATED,
-    };
-
-    // This struct is allocated using VmaPoolAllocator.
-    VmaAllocation_T(bool mappingAllowed);
-    ~VmaAllocation_T();
-
-    void InitBlockAllocation(
-        VmaDeviceMemoryBlock* block,
-        VmaAllocHandle allocHandle,
-        VkDeviceSize alignment,
-        VkDeviceSize size,
-        uint32_t memoryTypeIndex,
-        VmaSuballocationType suballocationType,
-        bool mapped);
-    // pMappedData not null means allocation is created with MAPPED flag.
-    void InitDedicatedAllocation(
-        VmaPool hParentPool,
-        uint32_t memoryTypeIndex,
-        VkDeviceMemory hMemory,
-        VmaSuballocationType suballocationType,
-        void* pMappedData,
-        VkDeviceSize size);
-
-    ALLOCATION_TYPE GetType() const { return (ALLOCATION_TYPE)m_Type; }
-    VkDeviceSize GetAlignment() const { return m_Alignment; }
-    VkDeviceSize GetSize() const { return m_Size; }
-    void* GetUserData() const { return m_pUserData; }
-    const char* GetName() const { return m_pName; }
-    VmaSuballocationType GetSuballocationType() const { return (VmaSuballocationType)m_SuballocationType; }
-
-    VmaDeviceMemoryBlock* GetBlock() const { VMA_ASSERT(m_Type == ALLOCATION_TYPE_BLOCK); return m_BlockAllocation.m_Block; }
-    uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; }
-    bool IsPersistentMap() const { return (m_Flags & FLAG_PERSISTENT_MAP) != 0; }
-    bool IsMappingAllowed() const { return (m_Flags & FLAG_MAPPING_ALLOWED) != 0; }
-
-    void SetUserData(VmaAllocator hAllocator, void* pUserData) { m_pUserData = pUserData; }
-    void SetName(VmaAllocator hAllocator, const char* pName);
-    void FreeName(VmaAllocator hAllocator);
-    uint8_t SwapBlockAllocation(VmaAllocator hAllocator, VmaAllocation allocation);
-    VmaAllocHandle GetAllocHandle() const;
-    VkDeviceSize GetOffset() const;
-    VmaPool GetParentPool() const;
-    VkDeviceMemory GetMemory() const;
-    void* GetMappedData() const;
-
-    void BlockAllocMap();
-    void BlockAllocUnmap();
-    VkResult DedicatedAllocMap(VmaAllocator hAllocator, void** ppData);
-    void DedicatedAllocUnmap(VmaAllocator hAllocator);
-
-#if VMA_STATS_STRING_ENABLED
-    uint32_t GetBufferImageUsage() const { return m_BufferImageUsage; }
-
-    void InitBufferImageUsage(uint32_t bufferImageUsage);
-    void PrintParameters(class VmaJsonWriter& json) const;
-#endif
-
-private:
-    // Allocation out of VmaDeviceMemoryBlock.
-    struct BlockAllocation
-    {
-        VmaDeviceMemoryBlock* m_Block;
-        VmaAllocHandle m_AllocHandle;
-    };
-    // Allocation for an object that has its own private VkDeviceMemory.
-    struct DedicatedAllocation
-    {
-        VmaPool m_hParentPool; // VK_NULL_HANDLE if not belongs to custom pool.
-        VkDeviceMemory m_hMemory;
-        void* m_pMappedData; // Not null means memory is mapped.
-        VmaAllocation_T* m_Prev;
-        VmaAllocation_T* m_Next;
-    };
-    union
-    {
-        // Allocation out of VmaDeviceMemoryBlock.
-        BlockAllocation m_BlockAllocation;
-        // Allocation for an object that has its own private VkDeviceMemory.
-        DedicatedAllocation m_DedicatedAllocation;
-    };
-
-    VkDeviceSize m_Alignment;
-    VkDeviceSize m_Size;
-    void* m_pUserData;
-    char* m_pName;
-    uint32_t m_MemoryTypeIndex;
-    uint8_t m_Type; // ALLOCATION_TYPE
-    uint8_t m_SuballocationType; // VmaSuballocationType
-    // Reference counter for vmaMapMemory()/vmaUnmapMemory().
-    uint8_t m_MapCount;
-    uint8_t m_Flags; // enum FLAGS
-#if VMA_STATS_STRING_ENABLED
-    uint32_t m_BufferImageUsage; // 0 if unknown.
-#endif
-};
-#endif // _VMA_ALLOCATION_T
-
-#ifndef _VMA_DEDICATED_ALLOCATION_LIST_ITEM_TRAITS
-struct VmaDedicatedAllocationListItemTraits
-{
-    typedef VmaAllocation_T ItemType;
-
-    static ItemType* GetPrev(const ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Prev;
-    }
-    static ItemType* GetNext(const ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Next;
-    }
-    static ItemType*& AccessPrev(ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Prev;
-    }
-    static ItemType*& AccessNext(ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Next;
-    }
-};
-#endif // _VMA_DEDICATED_ALLOCATION_LIST_ITEM_TRAITS
-
-#ifndef _VMA_DEDICATED_ALLOCATION_LIST
-/*
-Stores linked list of VmaAllocation_T objects.
-Thread-safe, synchronized internally.
-*/
-class VmaDedicatedAllocationList
-{
-public:
-    VmaDedicatedAllocationList() {}
-    ~VmaDedicatedAllocationList();
-
-    void Init(bool useMutex) { m_UseMutex = useMutex; }
-    bool Validate();
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats);
-    void AddStatistics(VmaStatistics& inoutStats);
-#if VMA_STATS_STRING_ENABLED
-    // Writes JSON array with the list of allocations.
-    void BuildStatsString(VmaJsonWriter& json);
-#endif
-
-    bool IsEmpty();
-    void Register(VmaAllocation alloc);
-    void Unregister(VmaAllocation alloc);
-
-private:
-    typedef VmaIntrusiveLinkedList<VmaDedicatedAllocationListItemTraits> DedicatedAllocationLinkedList;
-
-    bool m_UseMutex = true;
-    VMA_RW_MUTEX m_Mutex;
-    DedicatedAllocationLinkedList m_AllocationList;
-};
-
-#ifndef _VMA_DEDICATED_ALLOCATION_LIST_FUNCTIONS
-
-VmaDedicatedAllocationList::~VmaDedicatedAllocationList()
-{
-    VMA_HEAVY_ASSERT(Validate());
-
-    if (!m_AllocationList.IsEmpty())
-    {
-        VMA_ASSERT(false && "Unfreed dedicated allocations found!");
-    }
-}
-
-bool VmaDedicatedAllocationList::Validate()
-{
-    const size_t declaredCount = m_AllocationList.GetCount();
-    size_t actualCount = 0;
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-    for (VmaAllocation alloc = m_AllocationList.Front();
-        alloc != VMA_NULL; alloc = m_AllocationList.GetNext(alloc))
-    {
-        ++actualCount;
-    }
-    VMA_VALIDATE(actualCount == declaredCount);
-
-    return true;
-}
-
-void VmaDedicatedAllocationList::AddDetailedStatistics(VmaDetailedStatistics& inoutStats)
-{
-    for(auto* item = m_AllocationList.Front(); item != nullptr; item = DedicatedAllocationLinkedList::GetNext(item))
-    {
-        const VkDeviceSize size = item->GetSize();
-        inoutStats.statistics.blockCount++;
-        inoutStats.statistics.blockBytes += size;
-        VmaAddDetailedStatisticsAllocation(inoutStats, item->GetSize());
-    }
-}
-
-void VmaDedicatedAllocationList::AddStatistics(VmaStatistics& inoutStats)
-{
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-
-    const uint32_t allocCount = (uint32_t)m_AllocationList.GetCount();
-    inoutStats.blockCount += allocCount;
-    inoutStats.allocationCount += allocCount;
-
-    for(auto* item = m_AllocationList.Front(); item != nullptr; item = DedicatedAllocationLinkedList::GetNext(item))
-    {
-        const VkDeviceSize size = item->GetSize();
-        inoutStats.blockBytes += size;
-        inoutStats.allocationBytes += size;
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaDedicatedAllocationList::BuildStatsString(VmaJsonWriter& json)
-{
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-    json.BeginArray();
-    for (VmaAllocation alloc = m_AllocationList.Front();
-        alloc != VMA_NULL; alloc = m_AllocationList.GetNext(alloc))
-    {
-        json.BeginObject(true);
-        alloc->PrintParameters(json);
-        json.EndObject();
-    }
-    json.EndArray();
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaDedicatedAllocationList::IsEmpty()
-{
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-    return m_AllocationList.IsEmpty();
-}
-
-void VmaDedicatedAllocationList::Register(VmaAllocation alloc)
-{
-    VmaMutexLockWrite lock(m_Mutex, m_UseMutex);
-    m_AllocationList.PushBack(alloc);
-}
-
-void VmaDedicatedAllocationList::Unregister(VmaAllocation alloc)
-{
-    VmaMutexLockWrite lock(m_Mutex, m_UseMutex);
-    m_AllocationList.Remove(alloc);
-}
-#endif // _VMA_DEDICATED_ALLOCATION_LIST_FUNCTIONS
-#endif // _VMA_DEDICATED_ALLOCATION_LIST
-
-#ifndef _VMA_SUBALLOCATION
-/*
-Represents a region of VmaDeviceMemoryBlock that is either assigned and returned as
-allocated memory block or free.
-*/
-struct VmaSuballocation
-{
-    VkDeviceSize offset;
-    VkDeviceSize size;
-    void* userData;
-    VmaSuballocationType type;
-};
-
-// Comparator for offsets.
-struct VmaSuballocationOffsetLess
-{
-    bool operator()(const VmaSuballocation& lhs, const VmaSuballocation& rhs) const
-    {
-        return lhs.offset < rhs.offset;
-    }
-};
-
-struct VmaSuballocationOffsetGreater
-{
-    bool operator()(const VmaSuballocation& lhs, const VmaSuballocation& rhs) const
-    {
-        return lhs.offset > rhs.offset;
-    }
-};
-
-struct VmaSuballocationItemSizeLess
-{
-    bool operator()(const VmaSuballocationList::iterator lhs,
-        const VmaSuballocationList::iterator rhs) const
-    {
-        return lhs->size < rhs->size;
-    }
-
-    bool operator()(const VmaSuballocationList::iterator lhs,
-        VkDeviceSize rhsSize) const
-    {
-        return lhs->size < rhsSize;
-    }
-};
-#endif // _VMA_SUBALLOCATION
-
-#ifndef _VMA_ALLOCATION_REQUEST
-/*
-Parameters of planned allocation inside a VmaDeviceMemoryBlock.
-item points to a FREE suballocation.
-*/
-struct VmaAllocationRequest
-{
-    VmaAllocHandle allocHandle;
-    VkDeviceSize size;
-    VmaSuballocationList::iterator item;
-    void* customData;
-    uint64_t algorithmData;
-    VmaAllocationRequestType type;
-};
-#endif // _VMA_ALLOCATION_REQUEST
-
-#ifndef _VMA_BLOCK_METADATA
-/*
-Data structure used for bookkeeping of allocations and unused ranges of memory
-in a single VkDeviceMemory block.
-*/
-class VmaBlockMetadata
-{
-public:
-    // pAllocationCallbacks, if not null, must be owned externally - alive and unchanged for the whole lifetime of this object.
-    VmaBlockMetadata(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata() = default;
-
-    virtual void Init(VkDeviceSize size) { m_Size = size; }
-    bool IsVirtual() const { return m_IsVirtual; }
-    VkDeviceSize GetSize() const { return m_Size; }
-
-    // Validates all data structures inside this object. If not valid, returns false.
-    virtual bool Validate() const = 0;
-    virtual size_t GetAllocationCount() const = 0;
-    virtual size_t GetFreeRegionsCount() const = 0;
-    virtual VkDeviceSize GetSumFreeSize() const = 0;
-    // Returns true if this block is empty - contains only single free suballocation.
-    virtual bool IsEmpty() const = 0;
-    virtual void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) = 0;
-    virtual VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const = 0;
-    virtual void* GetAllocationUserData(VmaAllocHandle allocHandle) const = 0;
-
-    virtual VmaAllocHandle GetAllocationListBegin() const = 0;
-    virtual VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const = 0;
-    virtual VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const = 0;
-
-    // Shouldn't modify blockCount.
-    virtual void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const = 0;
-    virtual void AddStatistics(VmaStatistics& inoutStats) const = 0;
-
-#if VMA_STATS_STRING_ENABLED
-    virtual void PrintDetailedMap(class VmaJsonWriter& json) const = 0;
-#endif
-
-    // Tries to find a place for suballocation with given parameters inside this block.
-    // If succeeded, fills pAllocationRequest and returns true.
-    // If failed, returns false.
-    virtual bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        // Always one of VMA_ALLOCATION_CREATE_STRATEGY_* or VMA_ALLOCATION_INTERNAL_STRATEGY_* flags.
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) = 0;
-
-    virtual VkResult CheckCorruption(const void* pBlockData) = 0;
-
-    // Makes actual allocation based on request. Request must already be checked and valid.
-    virtual void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) = 0;
-
-    // Frees suballocation assigned to given memory region.
-    virtual void Free(VmaAllocHandle allocHandle) = 0;
-
-    // Frees all allocations.
-    // Careful! Don't call it if there are VmaAllocation objects owned by userData of cleared allocations!
-    virtual void Clear() = 0;
-
-    virtual void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) = 0;
-    virtual void DebugLogAllAllocations() const = 0;
-
-protected:
-    const VkAllocationCallbacks* GetAllocationCallbacks() const { return m_pAllocationCallbacks; }
-    VkDeviceSize GetBufferImageGranularity() const { return m_BufferImageGranularity; }
-    VkDeviceSize GetDebugMargin() const { return IsVirtual() ? 0 : VMA_DEBUG_MARGIN; }
-
-    void DebugLogAllocation(VkDeviceSize offset, VkDeviceSize size, void* userData) const;
-#if VMA_STATS_STRING_ENABLED
-    // mapRefCount == UINT32_MAX means unspecified.
-    void PrintDetailedMap_Begin(class VmaJsonWriter& json,
-        VkDeviceSize unusedBytes,
-        size_t allocationCount,
-        size_t unusedRangeCount) const;
-    void PrintDetailedMap_Allocation(class VmaJsonWriter& json,
-        VkDeviceSize offset, VkDeviceSize size, void* userData) const;
-    void PrintDetailedMap_UnusedRange(class VmaJsonWriter& json,
-        VkDeviceSize offset,
-        VkDeviceSize size) const;
-    void PrintDetailedMap_End(class VmaJsonWriter& json) const;
-#endif
-
-private:
-    VkDeviceSize m_Size;
-    const VkAllocationCallbacks* m_pAllocationCallbacks;
-    const VkDeviceSize m_BufferImageGranularity;
-    const bool m_IsVirtual;
-};
-
-#ifndef _VMA_BLOCK_METADATA_FUNCTIONS
-VmaBlockMetadata::VmaBlockMetadata(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : m_Size(0),
-    m_pAllocationCallbacks(pAllocationCallbacks),
-    m_BufferImageGranularity(bufferImageGranularity),
-    m_IsVirtual(isVirtual) {}
-
-void VmaBlockMetadata::DebugLogAllocation(VkDeviceSize offset, VkDeviceSize size, void* userData) const
-{
-    if (IsVirtual())
-    {
-        VMA_DEBUG_LOG("UNFREED VIRTUAL ALLOCATION; Offset: %llu; Size: %llu; UserData: %p", offset, size, userData);
-    }
-    else
-    {
-        VMA_ASSERT(userData != VMA_NULL);
-        VmaAllocation allocation = reinterpret_cast<VmaAllocation>(userData);
-
-        userData = allocation->GetUserData();
-        const char* name = allocation->GetName();
-
-#if VMA_STATS_STRING_ENABLED
-        VMA_DEBUG_LOG("UNFREED ALLOCATION; Offset: %llu; Size: %llu; UserData: %p; Name: %s; Type: %s; Usage: %u",
-            offset, size, userData, name ? name : "vma_empty",
-            VMA_SUBALLOCATION_TYPE_NAMES[allocation->GetSuballocationType()],
-            allocation->GetBufferImageUsage());
-#else
-        VMA_DEBUG_LOG("UNFREED ALLOCATION; Offset: %llu; Size: %llu; UserData: %p; Name: %s; Type: %u",
-            offset, size, userData, name ? name : "vma_empty",
-            (uint32_t)allocation->GetSuballocationType());
-#endif // VMA_STATS_STRING_ENABLED
-    }
-    
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata::PrintDetailedMap_Begin(class VmaJsonWriter& json,
-    VkDeviceSize unusedBytes, size_t allocationCount, size_t unusedRangeCount) const
-{
-    json.WriteString("TotalBytes");
-    json.WriteNumber(GetSize());
-
-    json.WriteString("UnusedBytes");
-    json.WriteSize(unusedBytes);
-
-    json.WriteString("Allocations");
-    json.WriteSize(allocationCount);
-
-    json.WriteString("UnusedRanges");
-    json.WriteSize(unusedRangeCount);
-
-    json.WriteString("Suballocations");
-    json.BeginArray();
-}
-
-void VmaBlockMetadata::PrintDetailedMap_Allocation(class VmaJsonWriter& json,
-    VkDeviceSize offset, VkDeviceSize size, void* userData) const
-{
-    json.BeginObject(true);
-
-    json.WriteString("Offset");
-    json.WriteNumber(offset);
-
-    if (IsVirtual())
-    {
-        json.WriteString("Size");
-        json.WriteNumber(size);
-        if (userData)
-        {
-            json.WriteString("CustomData");
-            json.BeginString();
-            json.ContinueString_Pointer(userData);
-            json.EndString();
-        }
-    }
-    else
-    {
-        ((VmaAllocation)userData)->PrintParameters(json);
-    }
-
-    json.EndObject();
-}
-
-void VmaBlockMetadata::PrintDetailedMap_UnusedRange(class VmaJsonWriter& json,
-    VkDeviceSize offset, VkDeviceSize size) const
-{
-    json.BeginObject(true);
-
-    json.WriteString("Offset");
-    json.WriteNumber(offset);
-
-    json.WriteString("Type");
-    json.WriteString(VMA_SUBALLOCATION_TYPE_NAMES[VMA_SUBALLOCATION_TYPE_FREE]);
-
-    json.WriteString("Size");
-    json.WriteNumber(size);
-
-    json.EndObject();
-}
-
-void VmaBlockMetadata::PrintDetailedMap_End(class VmaJsonWriter& json) const
-{
-    json.EndArray();
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_BLOCK_METADATA_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA
-
-#ifndef _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY
-// Before deleting object of this class remember to call 'Destroy()'
-class VmaBlockBufferImageGranularity final
-{
-public:
-    struct ValidationContext
-    {
-        const VkAllocationCallbacks* allocCallbacks;
-        uint16_t* pageAllocs;
-    };
-
-    VmaBlockBufferImageGranularity(VkDeviceSize bufferImageGranularity);
-    ~VmaBlockBufferImageGranularity();
-
-    bool IsEnabled() const { return m_BufferImageGranularity > MAX_LOW_BUFFER_IMAGE_GRANULARITY; }
-
-    void Init(const VkAllocationCallbacks* pAllocationCallbacks, VkDeviceSize size);
-    // Before destroying object you must call free it's memory
-    void Destroy(const VkAllocationCallbacks* pAllocationCallbacks);
-
-    void RoundupAllocRequest(VmaSuballocationType allocType,
-        VkDeviceSize& inOutAllocSize,
-        VkDeviceSize& inOutAllocAlignment) const;
-
-    bool CheckConflictAndAlignUp(VkDeviceSize& inOutAllocOffset,
-        VkDeviceSize allocSize,
-        VkDeviceSize blockOffset,
-        VkDeviceSize blockSize,
-        VmaSuballocationType allocType) const;
-
-    void AllocPages(uint8_t allocType, VkDeviceSize offset, VkDeviceSize size);
-    void FreePages(VkDeviceSize offset, VkDeviceSize size);
-    void Clear();
-
-    ValidationContext StartValidation(const VkAllocationCallbacks* pAllocationCallbacks,
-        bool isVirutal) const;
-    bool Validate(ValidationContext& ctx, VkDeviceSize offset, VkDeviceSize size) const;
-    bool FinishValidation(ValidationContext& ctx) const;
-
-private:
-    static const uint16_t MAX_LOW_BUFFER_IMAGE_GRANULARITY = 256;
-
-    struct RegionInfo
-    {
-        uint8_t allocType;
-        uint16_t allocCount;
-    };
-
-    VkDeviceSize m_BufferImageGranularity;
-    uint32_t m_RegionCount;
-    RegionInfo* m_RegionInfo;
-
-    uint32_t GetStartPage(VkDeviceSize offset) const { return OffsetToPageIndex(offset & ~(m_BufferImageGranularity - 1)); }
-    uint32_t GetEndPage(VkDeviceSize offset, VkDeviceSize size) const { return OffsetToPageIndex((offset + size - 1) & ~(m_BufferImageGranularity - 1)); }
-
-    uint32_t OffsetToPageIndex(VkDeviceSize offset) const;
-    void AllocPage(RegionInfo& page, uint8_t allocType);
-};
-
-#ifndef _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY_FUNCTIONS
-VmaBlockBufferImageGranularity::VmaBlockBufferImageGranularity(VkDeviceSize bufferImageGranularity)
-    : m_BufferImageGranularity(bufferImageGranularity),
-    m_RegionCount(0),
-    m_RegionInfo(VMA_NULL) {}
-
-VmaBlockBufferImageGranularity::~VmaBlockBufferImageGranularity()
-{
-    VMA_ASSERT(m_RegionInfo == VMA_NULL && "Free not called before destroying object!");
-}
-
-void VmaBlockBufferImageGranularity::Init(const VkAllocationCallbacks* pAllocationCallbacks, VkDeviceSize size)
-{
-    if (IsEnabled())
-    {
-        m_RegionCount = static_cast<uint32_t>(VmaDivideRoundingUp(size, m_BufferImageGranularity));
-        m_RegionInfo = vma_new_array(pAllocationCallbacks, RegionInfo, m_RegionCount);
-        memset(m_RegionInfo, 0, m_RegionCount * sizeof(RegionInfo));
-    }
-}
-
-void VmaBlockBufferImageGranularity::Destroy(const VkAllocationCallbacks* pAllocationCallbacks)
-{
-    if (m_RegionInfo)
-    {
-        vma_delete_array(pAllocationCallbacks, m_RegionInfo, m_RegionCount);
-        m_RegionInfo = VMA_NULL;
-    }
-}
-
-void VmaBlockBufferImageGranularity::RoundupAllocRequest(VmaSuballocationType allocType,
-    VkDeviceSize& inOutAllocSize,
-    VkDeviceSize& inOutAllocAlignment) const
-{
-    if (m_BufferImageGranularity > 1 &&
-        m_BufferImageGranularity <= MAX_LOW_BUFFER_IMAGE_GRANULARITY)
-    {
-        if (allocType == VMA_SUBALLOCATION_TYPE_UNKNOWN ||
-            allocType == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-            allocType == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL)
-        {
-            inOutAllocAlignment = VMA_MAX(inOutAllocAlignment, m_BufferImageGranularity);
-            inOutAllocSize = VmaAlignUp(inOutAllocSize, m_BufferImageGranularity);
-        }
-    }
-}
-
-bool VmaBlockBufferImageGranularity::CheckConflictAndAlignUp(VkDeviceSize& inOutAllocOffset,
-    VkDeviceSize allocSize,
-    VkDeviceSize blockOffset,
-    VkDeviceSize blockSize,
-    VmaSuballocationType allocType) const
-{
-    if (IsEnabled())
-    {
-        uint32_t startPage = GetStartPage(inOutAllocOffset);
-        if (m_RegionInfo[startPage].allocCount > 0 &&
-            VmaIsBufferImageGranularityConflict(static_cast<VmaSuballocationType>(m_RegionInfo[startPage].allocType), allocType))
-        {
-            inOutAllocOffset = VmaAlignUp(inOutAllocOffset, m_BufferImageGranularity);
-            if (blockSize < allocSize + inOutAllocOffset - blockOffset)
-                return true;
-            ++startPage;
-        }
-        uint32_t endPage = GetEndPage(inOutAllocOffset, allocSize);
-        if (endPage != startPage &&
-            m_RegionInfo[endPage].allocCount > 0 &&
-            VmaIsBufferImageGranularityConflict(static_cast<VmaSuballocationType>(m_RegionInfo[endPage].allocType), allocType))
-        {
-            return true;
-        }
-    }
-    return false;
-}
-
-void VmaBlockBufferImageGranularity::AllocPages(uint8_t allocType, VkDeviceSize offset, VkDeviceSize size)
-{
-    if (IsEnabled())
-    {
-        uint32_t startPage = GetStartPage(offset);
-        AllocPage(m_RegionInfo[startPage], allocType);
-
-        uint32_t endPage = GetEndPage(offset, size);
-        if (startPage != endPage)
-            AllocPage(m_RegionInfo[endPage], allocType);
-    }
-}
-
-void VmaBlockBufferImageGranularity::FreePages(VkDeviceSize offset, VkDeviceSize size)
-{
-    if (IsEnabled())
-    {
-        uint32_t startPage = GetStartPage(offset);
-        --m_RegionInfo[startPage].allocCount;
-        if (m_RegionInfo[startPage].allocCount == 0)
-            m_RegionInfo[startPage].allocType = VMA_SUBALLOCATION_TYPE_FREE;
-        uint32_t endPage = GetEndPage(offset, size);
-        if (startPage != endPage)
-        {
-            --m_RegionInfo[endPage].allocCount;
-            if (m_RegionInfo[endPage].allocCount == 0)
-                m_RegionInfo[endPage].allocType = VMA_SUBALLOCATION_TYPE_FREE;
-        }
-    }
-}
-
-void VmaBlockBufferImageGranularity::Clear()
-{
-    if (m_RegionInfo)
-        memset(m_RegionInfo, 0, m_RegionCount * sizeof(RegionInfo));
-}
-
-VmaBlockBufferImageGranularity::ValidationContext VmaBlockBufferImageGranularity::StartValidation(
-    const VkAllocationCallbacks* pAllocationCallbacks, bool isVirutal) const
-{
-    ValidationContext ctx{ pAllocationCallbacks, VMA_NULL };
-    if (!isVirutal && IsEnabled())
-    {
-        ctx.pageAllocs = vma_new_array(pAllocationCallbacks, uint16_t, m_RegionCount);
-        memset(ctx.pageAllocs, 0, m_RegionCount * sizeof(uint16_t));
-    }
-    return ctx;
-}
-
-bool VmaBlockBufferImageGranularity::Validate(ValidationContext& ctx,
-    VkDeviceSize offset, VkDeviceSize size) const
-{
-    if (IsEnabled())
-    {
-        uint32_t start = GetStartPage(offset);
-        ++ctx.pageAllocs[start];
-        VMA_VALIDATE(m_RegionInfo[start].allocCount > 0);
-
-        uint32_t end = GetEndPage(offset, size);
-        if (start != end)
-        {
-            ++ctx.pageAllocs[end];
-            VMA_VALIDATE(m_RegionInfo[end].allocCount > 0);
-        }
-    }
-    return true;
-}
-
-bool VmaBlockBufferImageGranularity::FinishValidation(ValidationContext& ctx) const
-{
-    // Check proper page structure
-    if (IsEnabled())
-    {
-        VMA_ASSERT(ctx.pageAllocs != VMA_NULL && "Validation context not initialized!");
-
-        for (uint32_t page = 0; page < m_RegionCount; ++page)
-        {
-            VMA_VALIDATE(ctx.pageAllocs[page] == m_RegionInfo[page].allocCount);
-        }
-        vma_delete_array(ctx.allocCallbacks, ctx.pageAllocs, m_RegionCount);
-        ctx.pageAllocs = VMA_NULL;
-    }
-    return true;
-}
-
-uint32_t VmaBlockBufferImageGranularity::OffsetToPageIndex(VkDeviceSize offset) const
-{
-    return static_cast<uint32_t>(offset >> VMA_BITSCAN_MSB(m_BufferImageGranularity));
-}
-
-void VmaBlockBufferImageGranularity::AllocPage(RegionInfo& page, uint8_t allocType)
-{
-    // When current alloc type is free then it can be overriden by new type
-    if (page.allocCount == 0 || (page.allocCount > 0 && page.allocType == VMA_SUBALLOCATION_TYPE_FREE))
-        page.allocType = allocType;
-
-    ++page.allocCount;
-}
-#endif // _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY_FUNCTIONS
-#endif // _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY
-
-#if 0
-#ifndef _VMA_BLOCK_METADATA_GENERIC
-class VmaBlockMetadata_Generic : public VmaBlockMetadata
-{
-    friend class VmaDefragmentationAlgorithm_Generic;
-    friend class VmaDefragmentationAlgorithm_Fast;
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_Generic)
-public:
-    VmaBlockMetadata_Generic(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_Generic() = default;
-
-    size_t GetAllocationCount() const override { return m_Suballocations.size() - m_FreeCount; }
-    VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize; }
-    bool IsEmpty() const override { return (m_Suballocations.size() == 1) && (m_FreeCount == 1); }
-    void Free(VmaAllocHandle allocHandle) override { FreeSuballocation(FindAtOffset((VkDeviceSize)allocHandle - 1)); }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; };
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    VkResult CheckCorruption(const void* pBlockData) override;
-
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-    void DebugLogAllAllocations() const override;
-
-private:
-    uint32_t m_FreeCount;
-    VkDeviceSize m_SumFreeSize;
-    VmaSuballocationList m_Suballocations;
-    // Suballocations that are free. Sorted by size, ascending.
-    VmaVector<VmaSuballocationList::iterator, VmaStlAllocator<VmaSuballocationList::iterator>> m_FreeSuballocationsBySize;
-
-    VkDeviceSize AlignAllocationSize(VkDeviceSize size) const { return IsVirtual() ? size : VmaAlignUp(size, (VkDeviceSize)16); }
-
-    VmaSuballocationList::iterator FindAtOffset(VkDeviceSize offset) const;
-    bool ValidateFreeSuballocationList() const;
-
-    // Checks if requested suballocation with given parameters can be placed in given pFreeSuballocItem.
-    // If yes, fills pOffset and returns true. If no, returns false.
-    bool CheckAllocation(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        VmaSuballocationList::const_iterator suballocItem,
-        VmaAllocHandle* pAllocHandle) const;
-
-    // Given free suballocation, it merges it with following one, which must also be free.
-    void MergeFreeWithNext(VmaSuballocationList::iterator item);
-    // Releases given suballocation, making it free.
-    // Merges it with adjacent free suballocations if applicable.
-    // Returns iterator to new free suballocation at this place.
-    VmaSuballocationList::iterator FreeSuballocation(VmaSuballocationList::iterator suballocItem);
-    // Given free suballocation, it inserts it into sorted list of
-    // m_FreeSuballocationsBySize if it is suitable.
-    void RegisterFreeSuballocation(VmaSuballocationList::iterator item);
-    // Given free suballocation, it removes it from sorted list of
-    // m_FreeSuballocationsBySize if it is suitable.
-    void UnregisterFreeSuballocation(VmaSuballocationList::iterator item);
-};
-
-#ifndef _VMA_BLOCK_METADATA_GENERIC_FUNCTIONS
-VmaBlockMetadata_Generic::VmaBlockMetadata_Generic(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_FreeCount(0),
-    m_SumFreeSize(0),
-    m_Suballocations(VmaStlAllocator<VmaSuballocation>(pAllocationCallbacks)),
-    m_FreeSuballocationsBySize(VmaStlAllocator<VmaSuballocationList::iterator>(pAllocationCallbacks)) {}
-
-void VmaBlockMetadata_Generic::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-
-    m_FreeCount = 1;
-    m_SumFreeSize = size;
-
-    VmaSuballocation suballoc = {};
-    suballoc.offset = 0;
-    suballoc.size = size;
-    suballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-
-    m_Suballocations.push_back(suballoc);
-    m_FreeSuballocationsBySize.push_back(m_Suballocations.begin());
-}
-
-bool VmaBlockMetadata_Generic::Validate() const
-{
-    VMA_VALIDATE(!m_Suballocations.empty());
-
-    // Expected offset of new suballocation as calculated from previous ones.
-    VkDeviceSize calculatedOffset = 0;
-    // Expected number of free suballocations as calculated from traversing their list.
-    uint32_t calculatedFreeCount = 0;
-    // Expected sum size of free suballocations as calculated from traversing their list.
-    VkDeviceSize calculatedSumFreeSize = 0;
-    // Expected number of free suballocations that should be registered in
-    // m_FreeSuballocationsBySize calculated from traversing their list.
-    size_t freeSuballocationsToRegister = 0;
-    // True if previous visited suballocation was free.
-    bool prevFree = false;
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-
-    for (const auto& subAlloc : m_Suballocations)
-    {
-        // Actual offset of this suballocation doesn't match expected one.
-        VMA_VALIDATE(subAlloc.offset == calculatedOffset);
-
-        const bool currFree = (subAlloc.type == VMA_SUBALLOCATION_TYPE_FREE);
-        // Two adjacent free suballocations are invalid. They should be merged.
-        VMA_VALIDATE(!prevFree || !currFree);
-
-        VmaAllocation alloc = (VmaAllocation)subAlloc.userData;
-        if (!IsVirtual())
-        {
-            VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-        }
-
-        if (currFree)
-        {
-            calculatedSumFreeSize += subAlloc.size;
-            ++calculatedFreeCount;
-            ++freeSuballocationsToRegister;
-
-            // Margin required between allocations - every free space must be at least that large.
-            VMA_VALIDATE(subAlloc.size >= debugMargin);
-        }
-        else
-        {
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == subAlloc.offset + 1);
-                VMA_VALIDATE(alloc->GetSize() == subAlloc.size);
-            }
-
-            // Margin required between allocations - previous allocation must be free.
-            VMA_VALIDATE(debugMargin == 0 || prevFree);
-        }
-
-        calculatedOffset += subAlloc.size;
-        prevFree = currFree;
-    }
-
-    // Number of free suballocations registered in m_FreeSuballocationsBySize doesn't
-    // match expected one.
-    VMA_VALIDATE(m_FreeSuballocationsBySize.size() == freeSuballocationsToRegister);
-
-    VkDeviceSize lastSize = 0;
-    for (size_t i = 0; i < m_FreeSuballocationsBySize.size(); ++i)
-    {
-        VmaSuballocationList::iterator suballocItem = m_FreeSuballocationsBySize[i];
-
-        // Only free suballocations can be registered in m_FreeSuballocationsBySize.
-        VMA_VALIDATE(suballocItem->type == VMA_SUBALLOCATION_TYPE_FREE);
-        // They must be sorted by size ascending.
-        VMA_VALIDATE(suballocItem->size >= lastSize);
-
-        lastSize = suballocItem->size;
-    }
-
-    // Check if totals match calculated values.
-    VMA_VALIDATE(ValidateFreeSuballocationList());
-    VMA_VALIDATE(calculatedOffset == GetSize());
-    VMA_VALIDATE(calculatedSumFreeSize == m_SumFreeSize);
-    VMA_VALIDATE(calculatedFreeCount == m_FreeCount);
-
-    return true;
-}
-
-void VmaBlockMetadata_Generic::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    const uint32_t rangeCount = (uint32_t)m_Suballocations.size();
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += GetSize();
-
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-            VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-        else
-            VmaAddDetailedStatisticsUnusedRange(inoutStats, suballoc.size);
-    }
-}
-
-void VmaBlockMetadata_Generic::AddStatistics(VmaStatistics& inoutStats) const
-{
-    inoutStats.blockCount++;
-    inoutStats.allocationCount += (uint32_t)m_Suballocations.size() - m_FreeCount;
-    inoutStats.blockBytes += GetSize();
-    inoutStats.allocationBytes += GetSize() - m_SumFreeSize;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Generic::PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const
-{
-    PrintDetailedMap_Begin(json,
-        m_SumFreeSize, // unusedBytes
-        m_Suballocations.size() - (size_t)m_FreeCount, // allocationCount
-        m_FreeCount, // unusedRangeCount
-        mapRefCount);
-
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            PrintDetailedMap_UnusedRange(json, suballoc.offset, suballoc.size);
-        }
-        else
-        {
-            PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-        }
-    }
-
-    PrintDetailedMap_End(json);
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaBlockMetadata_Generic::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(allocSize > 0);
-    VMA_ASSERT(!upperAddress);
-    VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(pAllocationRequest != VMA_NULL);
-    VMA_HEAVY_ASSERT(Validate());
-
-    allocSize = AlignAllocationSize(allocSize);
-
-    pAllocationRequest->type = VmaAllocationRequestType::Normal;
-    pAllocationRequest->size = allocSize;
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-
-    // There is not enough total free space in this block to fulfill the request: Early return.
-    if (m_SumFreeSize < allocSize + debugMargin)
-    {
-        return false;
-    }
-
-    // New algorithm, efficiently searching freeSuballocationsBySize.
-    const size_t freeSuballocCount = m_FreeSuballocationsBySize.size();
-    if (freeSuballocCount > 0)
-    {
-        if (strategy == 0 ||
-            strategy == VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT)
-        {
-            // Find first free suballocation with size not less than allocSize + debugMargin.
-            VmaSuballocationList::iterator* const it = VmaBinaryFindFirstNotLess(
-                m_FreeSuballocationsBySize.data(),
-                m_FreeSuballocationsBySize.data() + freeSuballocCount,
-                allocSize + debugMargin,
-                VmaSuballocationItemSizeLess());
-            size_t index = it - m_FreeSuballocationsBySize.data();
-            for (; index < freeSuballocCount; ++index)
-            {
-                if (CheckAllocation(
-                    allocSize,
-                    allocAlignment,
-                    allocType,
-                    m_FreeSuballocationsBySize[index],
-                    &pAllocationRequest->allocHandle))
-                {
-                    pAllocationRequest->item = m_FreeSuballocationsBySize[index];
-                    return true;
-                }
-            }
-        }
-        else if (strategy == VMA_ALLOCATION_INTERNAL_STRATEGY_MIN_OFFSET)
-        {
-            for (VmaSuballocationList::iterator it = m_Suballocations.begin();
-                it != m_Suballocations.end();
-                ++it)
-            {
-                if (it->type == VMA_SUBALLOCATION_TYPE_FREE && CheckAllocation(
-                    allocSize,
-                    allocAlignment,
-                    allocType,
-                    it,
-                    &pAllocationRequest->allocHandle))
-                {
-                    pAllocationRequest->item = it;
-                    return true;
-                }
-            }
-        }
-        else
-        {
-            VMA_ASSERT(strategy & (VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT | VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT ));
-            // Search staring from biggest suballocations.
-            for (size_t index = freeSuballocCount; index--; )
-            {
-                if (CheckAllocation(
-                    allocSize,
-                    allocAlignment,
-                    allocType,
-                    m_FreeSuballocationsBySize[index],
-                    &pAllocationRequest->allocHandle))
-                {
-                    pAllocationRequest->item = m_FreeSuballocationsBySize[index];
-                    return true;
-                }
-            }
-        }
-    }
-
-    return false;
-}
-
-VkResult VmaBlockMetadata_Generic::CheckCorruption(const void* pBlockData)
-{
-    for (auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaBlockMetadata_Generic::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    VMA_ASSERT(request.type == VmaAllocationRequestType::Normal);
-    VMA_ASSERT(request.item != m_Suballocations.end());
-    VmaSuballocation& suballoc = *request.item;
-    // Given suballocation is a free block.
-    VMA_ASSERT(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    // Given offset is inside this suballocation.
-    VMA_ASSERT((VkDeviceSize)request.allocHandle - 1 >= suballoc.offset);
-    const VkDeviceSize paddingBegin = (VkDeviceSize)request.allocHandle - suballoc.offset - 1;
-    VMA_ASSERT(suballoc.size >= paddingBegin + request.size);
-    const VkDeviceSize paddingEnd = suballoc.size - paddingBegin - request.size;
-
-    // Unregister this free suballocation from m_FreeSuballocationsBySize and update
-    // it to become used.
-    UnregisterFreeSuballocation(request.item);
-
-    suballoc.offset = (VkDeviceSize)request.allocHandle - 1;
-    suballoc.size = request.size;
-    suballoc.type = type;
-    suballoc.userData = userData;
-
-    // If there are any free bytes remaining at the end, insert new free suballocation after current one.
-    if (paddingEnd)
-    {
-        VmaSuballocation paddingSuballoc = {};
-        paddingSuballoc.offset = suballoc.offset + suballoc.size;
-        paddingSuballoc.size = paddingEnd;
-        paddingSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-        VmaSuballocationList::iterator next = request.item;
-        ++next;
-        const VmaSuballocationList::iterator paddingEndItem =
-            m_Suballocations.insert(next, paddingSuballoc);
-        RegisterFreeSuballocation(paddingEndItem);
-    }
-
-    // If there are any free bytes remaining at the beginning, insert new free suballocation before current one.
-    if (paddingBegin)
-    {
-        VmaSuballocation paddingSuballoc = {};
-        paddingSuballoc.offset = suballoc.offset - paddingBegin;
-        paddingSuballoc.size = paddingBegin;
-        paddingSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-        const VmaSuballocationList::iterator paddingBeginItem =
-            m_Suballocations.insert(request.item, paddingSuballoc);
-        RegisterFreeSuballocation(paddingBeginItem);
-    }
-
-    // Update totals.
-    m_FreeCount = m_FreeCount - 1;
-    if (paddingBegin > 0)
-    {
-        ++m_FreeCount;
-    }
-    if (paddingEnd > 0)
-    {
-        ++m_FreeCount;
-    }
-    m_SumFreeSize -= request.size;
-}
-
-void VmaBlockMetadata_Generic::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    outInfo.offset = (VkDeviceSize)allocHandle - 1;
-    const VmaSuballocation& suballoc = *FindAtOffset(outInfo.offset);
-    outInfo.size = suballoc.size;
-    outInfo.pUserData = suballoc.userData;
-}
-
-void* VmaBlockMetadata_Generic::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    return FindAtOffset((VkDeviceSize)allocHandle - 1)->userData;
-}
-
-VmaAllocHandle VmaBlockMetadata_Generic::GetAllocationListBegin() const
-{
-    if (IsEmpty())
-        return VK_NULL_HANDLE;
-
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-            return (VmaAllocHandle)(suballoc.offset + 1);
-    }
-    VMA_ASSERT(false && "Should contain at least 1 allocation!");
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_Generic::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    VmaSuballocationList::const_iterator prev = FindAtOffset((VkDeviceSize)prevAlloc - 1);
-
-    for (VmaSuballocationList::const_iterator it = ++prev; it != m_Suballocations.end(); ++it)
-    {
-        if (it->type != VMA_SUBALLOCATION_TYPE_FREE)
-            return (VmaAllocHandle)(it->offset + 1);
-    }
-    return VK_NULL_HANDLE;
-}
-
-void VmaBlockMetadata_Generic::Clear()
-{
-    const VkDeviceSize size = GetSize();
-
-    VMA_ASSERT(IsVirtual());
-    m_FreeCount = 1;
-    m_SumFreeSize = size;
-    m_Suballocations.clear();
-    m_FreeSuballocationsBySize.clear();
-
-    VmaSuballocation suballoc = {};
-    suballoc.offset = 0;
-    suballoc.size = size;
-    suballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-    m_Suballocations.push_back(suballoc);
-
-    m_FreeSuballocationsBySize.push_back(m_Suballocations.begin());
-}
-
-void VmaBlockMetadata_Generic::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    VmaSuballocation& suballoc = *FindAtOffset((VkDeviceSize)allocHandle - 1);
-    suballoc.userData = userData;
-}
-
-void VmaBlockMetadata_Generic::DebugLogAllAllocations() const
-{
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-            DebugLogAllocation(suballoc.offset, suballoc.size, suballoc.userData);
-    }
-}
-
-VmaSuballocationList::iterator VmaBlockMetadata_Generic::FindAtOffset(VkDeviceSize offset) const
-{
-    VMA_HEAVY_ASSERT(!m_Suballocations.empty());
-    const VkDeviceSize last = m_Suballocations.rbegin()->offset;
-    if (last == offset)
-        return m_Suballocations.rbegin().drop_const();
-    const VkDeviceSize first = m_Suballocations.begin()->offset;
-    if (first == offset)
-        return m_Suballocations.begin().drop_const();
-
-    const size_t suballocCount = m_Suballocations.size();
-    const VkDeviceSize step = (last - first + m_Suballocations.begin()->size) / suballocCount;
-    auto findSuballocation = [&](auto begin, auto end) -> VmaSuballocationList::iterator
-    {
-        for (auto suballocItem = begin;
-            suballocItem != end;
-            ++suballocItem)
-        {
-            if (suballocItem->offset == offset)
-                return suballocItem.drop_const();
-        }
-        VMA_ASSERT(false && "Not found!");
-        return m_Suballocations.end().drop_const();
-    };
-    // If requested offset is closer to the end of range, search from the end
-    if (offset - first > suballocCount * step / 2)
-    {
-        return findSuballocation(m_Suballocations.rbegin(), m_Suballocations.rend());
-    }
-    return findSuballocation(m_Suballocations.begin(), m_Suballocations.end());
-}
-
-bool VmaBlockMetadata_Generic::ValidateFreeSuballocationList() const
-{
-    VkDeviceSize lastSize = 0;
-    for (size_t i = 0, count = m_FreeSuballocationsBySize.size(); i < count; ++i)
-    {
-        const VmaSuballocationList::iterator it = m_FreeSuballocationsBySize[i];
-
-        VMA_VALIDATE(it->type == VMA_SUBALLOCATION_TYPE_FREE);
-        VMA_VALIDATE(it->size >= lastSize);
-        lastSize = it->size;
-    }
-    return true;
-}
-
-bool VmaBlockMetadata_Generic::CheckAllocation(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    VmaSuballocationList::const_iterator suballocItem,
-    VmaAllocHandle* pAllocHandle) const
-{
-    VMA_ASSERT(allocSize > 0);
-    VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(suballocItem != m_Suballocations.cend());
-    VMA_ASSERT(pAllocHandle != VMA_NULL);
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-    const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity();
-
-    const VmaSuballocation& suballoc = *suballocItem;
-    VMA_ASSERT(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    // Size of this suballocation is too small for this request: Early return.
-    if (suballoc.size < allocSize)
-    {
-        return false;
-    }
-
-    // Start from offset equal to beginning of this suballocation.
-    VkDeviceSize offset = suballoc.offset + (suballocItem == m_Suballocations.cbegin() ? 0 : GetDebugMargin());
-
-    // Apply debugMargin from the end of previous alloc.
-    if (debugMargin > 0)
-    {
-        offset += debugMargin;
-    }
-
-    // Apply alignment.
-    offset = VmaAlignUp(offset, allocAlignment);
-
-    // Check previous suballocations for BufferImageGranularity conflicts.
-    // Make bigger alignment if necessary.
-    if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment)
-    {
-        bool bufferImageGranularityConflict = false;
-        VmaSuballocationList::const_iterator prevSuballocItem = suballocItem;
-        while (prevSuballocItem != m_Suballocations.cbegin())
-        {
-            --prevSuballocItem;
-            const VmaSuballocation& prevSuballoc = *prevSuballocItem;
-            if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, offset, bufferImageGranularity))
-            {
-                if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType))
-                {
-                    bufferImageGranularityConflict = true;
-                    break;
-                }
-            }
-            else
-                // Already on previous page.
-                break;
-        }
-        if (bufferImageGranularityConflict)
-        {
-            offset = VmaAlignUp(offset, bufferImageGranularity);
-        }
-    }
-
-    // Calculate padding at the beginning based on current offset.
-    const VkDeviceSize paddingBegin = offset - suballoc.offset;
-
-    // Fail if requested size plus margin after is bigger than size of this suballocation.
-    if (paddingBegin + allocSize + debugMargin > suballoc.size)
-    {
-        return false;
-    }
-
-    // Check next suballocations for BufferImageGranularity conflicts.
-    // If conflict exists, allocation cannot be made here.
-    if (allocSize % bufferImageGranularity || offset % bufferImageGranularity)
-    {
-        VmaSuballocationList::const_iterator nextSuballocItem = suballocItem;
-        ++nextSuballocItem;
-        while (nextSuballocItem != m_Suballocations.cend())
-        {
-            const VmaSuballocation& nextSuballoc = *nextSuballocItem;
-            if (VmaBlocksOnSamePage(offset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-            {
-                if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type))
-                {
-                    return false;
-                }
-            }
-            else
-            {
-                // Already on next page.
-                break;
-            }
-            ++nextSuballocItem;
-        }
-    }
-
-    *pAllocHandle = (VmaAllocHandle)(offset + 1);
-    // All tests passed: Success. pAllocHandle is already filled.
-    return true;
-}
-
-void VmaBlockMetadata_Generic::MergeFreeWithNext(VmaSuballocationList::iterator item)
-{
-    VMA_ASSERT(item != m_Suballocations.end());
-    VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    VmaSuballocationList::iterator nextItem = item;
-    ++nextItem;
-    VMA_ASSERT(nextItem != m_Suballocations.end());
-    VMA_ASSERT(nextItem->type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    item->size += nextItem->size;
-    --m_FreeCount;
-    m_Suballocations.erase(nextItem);
-}
-
-VmaSuballocationList::iterator VmaBlockMetadata_Generic::FreeSuballocation(VmaSuballocationList::iterator suballocItem)
-{
-    // Change this suballocation to be marked as free.
-    VmaSuballocation& suballoc = *suballocItem;
-    suballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-    suballoc.userData = VMA_NULL;
-
-    // Update totals.
-    ++m_FreeCount;
-    m_SumFreeSize += suballoc.size;
-
-    // Merge with previous and/or next suballocation if it's also free.
-    bool mergeWithNext = false;
-    bool mergeWithPrev = false;
-
-    VmaSuballocationList::iterator nextItem = suballocItem;
-    ++nextItem;
-    if ((nextItem != m_Suballocations.end()) && (nextItem->type == VMA_SUBALLOCATION_TYPE_FREE))
-    {
-        mergeWithNext = true;
-    }
-
-    VmaSuballocationList::iterator prevItem = suballocItem;
-    if (suballocItem != m_Suballocations.begin())
-    {
-        --prevItem;
-        if (prevItem->type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            mergeWithPrev = true;
-        }
-    }
-
-    if (mergeWithNext)
-    {
-        UnregisterFreeSuballocation(nextItem);
-        MergeFreeWithNext(suballocItem);
-    }
-
-    if (mergeWithPrev)
-    {
-        UnregisterFreeSuballocation(prevItem);
-        MergeFreeWithNext(prevItem);
-        RegisterFreeSuballocation(prevItem);
-        return prevItem;
-    }
-    else
-    {
-        RegisterFreeSuballocation(suballocItem);
-        return suballocItem;
-    }
-}
-
-void VmaBlockMetadata_Generic::RegisterFreeSuballocation(VmaSuballocationList::iterator item)
-{
-    VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(item->size > 0);
-
-    // You may want to enable this validation at the beginning or at the end of
-    // this function, depending on what do you want to check.
-    VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-
-    if (m_FreeSuballocationsBySize.empty())
-    {
-        m_FreeSuballocationsBySize.push_back(item);
-    }
-    else
-    {
-        VmaVectorInsertSorted<VmaSuballocationItemSizeLess>(m_FreeSuballocationsBySize, item);
-    }
-
-    //VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-}
-
-void VmaBlockMetadata_Generic::UnregisterFreeSuballocation(VmaSuballocationList::iterator item)
-{
-    VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(item->size > 0);
-
-    // You may want to enable this validation at the beginning or at the end of
-    // this function, depending on what do you want to check.
-    VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-
-    VmaSuballocationList::iterator* const it = VmaBinaryFindFirstNotLess(
-        m_FreeSuballocationsBySize.data(),
-        m_FreeSuballocationsBySize.data() + m_FreeSuballocationsBySize.size(),
-        item,
-        VmaSuballocationItemSizeLess());
-    for (size_t index = it - m_FreeSuballocationsBySize.data();
-        index < m_FreeSuballocationsBySize.size();
-        ++index)
-    {
-        if (m_FreeSuballocationsBySize[index] == item)
-        {
-            VmaVectorRemove(m_FreeSuballocationsBySize, index);
-            return;
-        }
-        VMA_ASSERT((m_FreeSuballocationsBySize[index]->size == item->size) && "Not found.");
-    }
-    VMA_ASSERT(0 && "Not found.");
-
-    //VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-}
-#endif // _VMA_BLOCK_METADATA_GENERIC_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_GENERIC
-#endif // #if 0
-
-#ifndef _VMA_BLOCK_METADATA_LINEAR
-/*
-Allocations and their references in internal data structure look like this:
-
-if(m_2ndVectorMode == SECOND_VECTOR_EMPTY):
-
-        0 +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount]
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount + 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  1st[1st.size() - 1]
-          +-------+
-          |       |
-          |       |
-          |       |
-GetSize() +-------+
-
-if(m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER):
-
-        0 +-------+
-          | Alloc |  2nd[0]
-          +-------+
-          | Alloc |  2nd[1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  2nd[2nd.size() - 1]
-          +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount]
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount + 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  1st[1st.size() - 1]
-          +-------+
-          |       |
-GetSize() +-------+
-
-if(m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK):
-
-        0 +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount]
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount + 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  1st[1st.size() - 1]
-          +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  2nd[2nd.size() - 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  2nd[1]
-          +-------+
-          | Alloc |  2nd[0]
-GetSize() +-------+
-
-*/
-class VmaBlockMetadata_Linear : public VmaBlockMetadata
-{
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_Linear)
-public:
-    VmaBlockMetadata_Linear(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_Linear() = default;
-
-    VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize; }
-    bool IsEmpty() const override { return GetAllocationCount() == 0; }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; };
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-    size_t GetAllocationCount() const override;
-    size_t GetFreeRegionsCount() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    VkResult CheckCorruption(const void* pBlockData) override;
-
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void Free(VmaAllocHandle allocHandle) override;
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-    void DebugLogAllAllocations() const override;
-
-private:
-    /*
-    There are two suballocation vectors, used in ping-pong way.
-    The one with index m_1stVectorIndex is called 1st.
-    The one with index (m_1stVectorIndex ^ 1) is called 2nd.
-    2nd can be non-empty only when 1st is not empty.
-    When 2nd is not empty, m_2ndVectorMode indicates its mode of operation.
-    */
-    typedef VmaVector<VmaSuballocation, VmaStlAllocator<VmaSuballocation>> SuballocationVectorType;
-
-    enum SECOND_VECTOR_MODE
-    {
-        SECOND_VECTOR_EMPTY,
-        /*
-        Suballocations in 2nd vector are created later than the ones in 1st, but they
-        all have smaller offset.
-        */
-        SECOND_VECTOR_RING_BUFFER,
-        /*
-        Suballocations in 2nd vector are upper side of double stack.
-        They all have offsets higher than those in 1st vector.
-        Top of this stack means smaller offsets, but higher indices in this vector.
-        */
-        SECOND_VECTOR_DOUBLE_STACK,
-    };
-
-    VkDeviceSize m_SumFreeSize;
-    SuballocationVectorType m_Suballocations0, m_Suballocations1;
-    uint32_t m_1stVectorIndex;
-    SECOND_VECTOR_MODE m_2ndVectorMode;
-    // Number of items in 1st vector with hAllocation = null at the beginning.
-    size_t m_1stNullItemsBeginCount;
-    // Number of other items in 1st vector with hAllocation = null somewhere in the middle.
-    size_t m_1stNullItemsMiddleCount;
-    // Number of items in 2nd vector with hAllocation = null.
-    size_t m_2ndNullItemsCount;
-
-    SuballocationVectorType& AccessSuballocations1st() { return m_1stVectorIndex ? m_Suballocations1 : m_Suballocations0; }
-    SuballocationVectorType& AccessSuballocations2nd() { return m_1stVectorIndex ? m_Suballocations0 : m_Suballocations1; }
-    const SuballocationVectorType& AccessSuballocations1st() const { return m_1stVectorIndex ? m_Suballocations1 : m_Suballocations0; }
-    const SuballocationVectorType& AccessSuballocations2nd() const { return m_1stVectorIndex ? m_Suballocations0 : m_Suballocations1; }
-
-    VmaSuballocation& FindSuballocation(VkDeviceSize offset) const;
-    bool ShouldCompact1st() const;
-    void CleanupAfterFree();
-
-    bool CreateAllocationRequest_LowerAddress(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest);
-    bool CreateAllocationRequest_UpperAddress(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest);
-};
-
-#ifndef _VMA_BLOCK_METADATA_LINEAR_FUNCTIONS
-VmaBlockMetadata_Linear::VmaBlockMetadata_Linear(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_SumFreeSize(0),
-    m_Suballocations0(VmaStlAllocator<VmaSuballocation>(pAllocationCallbacks)),
-    m_Suballocations1(VmaStlAllocator<VmaSuballocation>(pAllocationCallbacks)),
-    m_1stVectorIndex(0),
-    m_2ndVectorMode(SECOND_VECTOR_EMPTY),
-    m_1stNullItemsBeginCount(0),
-    m_1stNullItemsMiddleCount(0),
-    m_2ndNullItemsCount(0) {}
-
-void VmaBlockMetadata_Linear::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-    m_SumFreeSize = size;
-}
-
-bool VmaBlockMetadata_Linear::Validate() const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    VMA_VALIDATE(suballocations2nd.empty() == (m_2ndVectorMode == SECOND_VECTOR_EMPTY));
-    VMA_VALIDATE(!suballocations1st.empty() ||
-        suballocations2nd.empty() ||
-        m_2ndVectorMode != SECOND_VECTOR_RING_BUFFER);
-
-    if (!suballocations1st.empty())
-    {
-        // Null item at the beginning should be accounted into m_1stNullItemsBeginCount.
-        VMA_VALIDATE(suballocations1st[m_1stNullItemsBeginCount].type != VMA_SUBALLOCATION_TYPE_FREE);
-        // Null item at the end should be just pop_back().
-        VMA_VALIDATE(suballocations1st.back().type != VMA_SUBALLOCATION_TYPE_FREE);
-    }
-    if (!suballocations2nd.empty())
-    {
-        // Null item at the end should be just pop_back().
-        VMA_VALIDATE(suballocations2nd.back().type != VMA_SUBALLOCATION_TYPE_FREE);
-    }
-
-    VMA_VALIDATE(m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount <= suballocations1st.size());
-    VMA_VALIDATE(m_2ndNullItemsCount <= suballocations2nd.size());
-
-    VkDeviceSize sumUsedSize = 0;
-    const size_t suballoc1stCount = suballocations1st.size();
-    const VkDeviceSize debugMargin = GetDebugMargin();
-    VkDeviceSize offset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const size_t suballoc2ndCount = suballocations2nd.size();
-        size_t nullItem2ndCount = 0;
-        for (size_t i = 0; i < suballoc2ndCount; ++i)
-        {
-            const VmaSuballocation& suballoc = suballocations2nd[i];
-            const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-            VmaAllocation const alloc = (VmaAllocation)suballoc.userData;
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-            }
-            VMA_VALIDATE(suballoc.offset >= offset);
-
-            if (!currFree)
-            {
-                if (!IsVirtual())
-                {
-                    VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1);
-                    VMA_VALIDATE(alloc->GetSize() == suballoc.size);
-                }
-                sumUsedSize += suballoc.size;
-            }
-            else
-            {
-                ++nullItem2ndCount;
-            }
-
-            offset = suballoc.offset + suballoc.size + debugMargin;
-        }
-
-        VMA_VALIDATE(nullItem2ndCount == m_2ndNullItemsCount);
-    }
-
-    for (size_t i = 0; i < m_1stNullItemsBeginCount; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations1st[i];
-        VMA_VALIDATE(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE &&
-            suballoc.userData == VMA_NULL);
-    }
-
-    size_t nullItem1stCount = m_1stNullItemsBeginCount;
-
-    for (size_t i = m_1stNullItemsBeginCount; i < suballoc1stCount; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations1st[i];
-        const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-        VmaAllocation const alloc = (VmaAllocation)suballoc.userData;
-        if (!IsVirtual())
-        {
-            VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-        }
-        VMA_VALIDATE(suballoc.offset >= offset);
-        VMA_VALIDATE(i >= m_1stNullItemsBeginCount || currFree);
-
-        if (!currFree)
-        {
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1);
-                VMA_VALIDATE(alloc->GetSize() == suballoc.size);
-            }
-            sumUsedSize += suballoc.size;
-        }
-        else
-        {
-            ++nullItem1stCount;
-        }
-
-        offset = suballoc.offset + suballoc.size + debugMargin;
-    }
-    VMA_VALIDATE(nullItem1stCount == m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount);
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        const size_t suballoc2ndCount = suballocations2nd.size();
-        size_t nullItem2ndCount = 0;
-        for (size_t i = suballoc2ndCount; i--; )
-        {
-            const VmaSuballocation& suballoc = suballocations2nd[i];
-            const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-            VmaAllocation const alloc = (VmaAllocation)suballoc.userData;
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-            }
-            VMA_VALIDATE(suballoc.offset >= offset);
-
-            if (!currFree)
-            {
-                if (!IsVirtual())
-                {
-                    VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1);
-                    VMA_VALIDATE(alloc->GetSize() == suballoc.size);
-                }
-                sumUsedSize += suballoc.size;
-            }
-            else
-            {
-                ++nullItem2ndCount;
-            }
-
-            offset = suballoc.offset + suballoc.size + debugMargin;
-        }
-
-        VMA_VALIDATE(nullItem2ndCount == m_2ndNullItemsCount);
-    }
-
-    VMA_VALIDATE(offset <= GetSize());
-    VMA_VALIDATE(m_SumFreeSize == GetSize() - sumUsedSize);
-
-    return true;
-}
-
-size_t VmaBlockMetadata_Linear::GetAllocationCount() const
-{
-    return AccessSuballocations1st().size() - m_1stNullItemsBeginCount - m_1stNullItemsMiddleCount +
-        AccessSuballocations2nd().size() - m_2ndNullItemsCount;
-}
-
-size_t VmaBlockMetadata_Linear::GetFreeRegionsCount() const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return SIZE_MAX;
-}
-
-void VmaBlockMetadata_Linear::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    const VkDeviceSize size = GetSize();
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    const size_t suballoc1stCount = suballocations1st.size();
-    const size_t suballoc2ndCount = suballocations2nd.size();
-
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += size;
-
-    VkDeviceSize lastOffset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = 0;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAllocIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    size_t nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    const VkDeviceSize freeSpace1stTo2ndEnd =
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-            if (lastOffset < freeSpace1stTo2ndEnd)
-            {
-                const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset;
-                VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAllocIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                // There is free space from lastOffset to size.
-                if (lastOffset < size)
-                {
-                    const VkDeviceSize unusedRangeSize = size - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-}
-
-void VmaBlockMetadata_Linear::AddStatistics(VmaStatistics& inoutStats) const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    const VkDeviceSize size = GetSize();
-    const size_t suballoc1stCount = suballocations1st.size();
-    const size_t suballoc2ndCount = suballocations2nd.size();
-
-    inoutStats.blockCount++;
-    inoutStats.blockBytes += size;
-    inoutStats.allocationBytes += size - m_SumFreeSize;
-
-    VkDeviceSize lastOffset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = m_1stNullItemsBeginCount;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++inoutStats.allocationCount;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                    const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset;
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    size_t nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    const VkDeviceSize freeSpace1stTo2ndEnd =
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            ++inoutStats.allocationCount;
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            if (lastOffset < freeSpace1stTo2ndEnd)
-            {
-                // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-                const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset;
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++inoutStats.allocationCount;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < size)
-                {
-                    // There is free space from lastOffset to size.
-                    const VkDeviceSize unusedRangeSize = size - lastOffset;
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Linear::PrintDetailedMap(class VmaJsonWriter& json) const
-{
-    const VkDeviceSize size = GetSize();
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    const size_t suballoc1stCount = suballocations1st.size();
-    const size_t suballoc2ndCount = suballocations2nd.size();
-
-    // FIRST PASS
-
-    size_t unusedRangeCount = 0;
-    VkDeviceSize usedBytes = 0;
-
-    VkDeviceSize lastOffset = 0;
-
-    size_t alloc2ndCount = 0;
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = 0;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    ++unusedRangeCount;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++alloc2ndCount;
-                usedBytes += suballoc.size;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                    ++unusedRangeCount;
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    size_t nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    size_t alloc1stCount = 0;
-    const VkDeviceSize freeSpace1stTo2ndEnd =
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                ++unusedRangeCount;
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            ++alloc1stCount;
-            usedBytes += suballoc.size;
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            if (lastOffset < size)
-            {
-                // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-                ++unusedRangeCount;
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    ++unusedRangeCount;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++alloc2ndCount;
-                usedBytes += suballoc.size;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < size)
-                {
-                    // There is free space from lastOffset to size.
-                    ++unusedRangeCount;
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-
-    const VkDeviceSize unusedBytes = size - usedBytes;
-    PrintDetailedMap_Begin(json, unusedBytes, alloc1stCount + alloc2ndCount, unusedRangeCount);
-
-    // SECOND PASS
-    lastOffset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = 0;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                    const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            if (lastOffset < freeSpace1stTo2ndEnd)
-            {
-                // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-                const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset;
-                PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < size)
-                {
-                    // There is free space from lastOffset to size.
-                    const VkDeviceSize unusedRangeSize = size - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-
-    PrintDetailedMap_End(json);
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaBlockMetadata_Linear::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(allocSize > 0);
-    VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(pAllocationRequest != VMA_NULL);
-    VMA_HEAVY_ASSERT(Validate());
-    pAllocationRequest->size = allocSize;
-    return upperAddress ?
-        CreateAllocationRequest_UpperAddress(
-            allocSize, allocAlignment, allocType, strategy, pAllocationRequest) :
-        CreateAllocationRequest_LowerAddress(
-            allocSize, allocAlignment, allocType, strategy, pAllocationRequest);
-}
-
-VkResult VmaBlockMetadata_Linear::CheckCorruption(const void* pBlockData)
-{
-    VMA_ASSERT(!IsVirtual());
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    for (size_t i = m_1stNullItemsBeginCount, count = suballocations1st.size(); i < count; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations1st[i];
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    for (size_t i = 0, count = suballocations2nd.size(); i < count; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations2nd[i];
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaBlockMetadata_Linear::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    const VkDeviceSize offset = (VkDeviceSize)request.allocHandle - 1;
-    const VmaSuballocation newSuballoc = { offset, request.size, userData, type };
-
-    switch (request.type)
-    {
-    case VmaAllocationRequestType::UpperAddress:
-    {
-        VMA_ASSERT(m_2ndVectorMode != SECOND_VECTOR_RING_BUFFER &&
-            "CRITICAL ERROR: Trying to use linear allocator as double stack while it was already used as ring buffer.");
-        SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-        suballocations2nd.push_back(newSuballoc);
-        m_2ndVectorMode = SECOND_VECTOR_DOUBLE_STACK;
-    }
-    break;
-    case VmaAllocationRequestType::EndOf1st:
-    {
-        SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-
-        VMA_ASSERT(suballocations1st.empty() ||
-            offset >= suballocations1st.back().offset + suballocations1st.back().size);
-        // Check if it fits before the end of the block.
-        VMA_ASSERT(offset + request.size <= GetSize());
-
-        suballocations1st.push_back(newSuballoc);
-    }
-    break;
-    case VmaAllocationRequestType::EndOf2nd:
-    {
-        SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-        // New allocation at the end of 2-part ring buffer, so before first allocation from 1st vector.
-        VMA_ASSERT(!suballocations1st.empty() &&
-            offset + request.size <= suballocations1st[m_1stNullItemsBeginCount].offset);
-        SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-        switch (m_2ndVectorMode)
-        {
-        case SECOND_VECTOR_EMPTY:
-            // First allocation from second part ring buffer.
-            VMA_ASSERT(suballocations2nd.empty());
-            m_2ndVectorMode = SECOND_VECTOR_RING_BUFFER;
-            break;
-        case SECOND_VECTOR_RING_BUFFER:
-            // 2-part ring buffer is already started.
-            VMA_ASSERT(!suballocations2nd.empty());
-            break;
-        case SECOND_VECTOR_DOUBLE_STACK:
-            VMA_ASSERT(0 && "CRITICAL ERROR: Trying to use linear allocator as ring buffer while it was already used as double stack.");
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-
-        suballocations2nd.push_back(newSuballoc);
-    }
-    break;
-    default:
-        VMA_ASSERT(0 && "CRITICAL INTERNAL ERROR.");
-    }
-
-    m_SumFreeSize -= newSuballoc.size;
-}
-
-void VmaBlockMetadata_Linear::Free(VmaAllocHandle allocHandle)
-{
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    VkDeviceSize offset = (VkDeviceSize)allocHandle - 1;
-
-    if (!suballocations1st.empty())
-    {
-        // First allocation: Mark it as next empty at the beginning.
-        VmaSuballocation& firstSuballoc = suballocations1st[m_1stNullItemsBeginCount];
-        if (firstSuballoc.offset == offset)
-        {
-            firstSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-            firstSuballoc.userData = VMA_NULL;
-            m_SumFreeSize += firstSuballoc.size;
-            ++m_1stNullItemsBeginCount;
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    // Last allocation in 2-part ring buffer or top of upper stack (same logic).
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ||
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        VmaSuballocation& lastSuballoc = suballocations2nd.back();
-        if (lastSuballoc.offset == offset)
-        {
-            m_SumFreeSize += lastSuballoc.size;
-            suballocations2nd.pop_back();
-            CleanupAfterFree();
-            return;
-        }
-    }
-    // Last allocation in 1st vector.
-    else if (m_2ndVectorMode == SECOND_VECTOR_EMPTY)
-    {
-        VmaSuballocation& lastSuballoc = suballocations1st.back();
-        if (lastSuballoc.offset == offset)
-        {
-            m_SumFreeSize += lastSuballoc.size;
-            suballocations1st.pop_back();
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    VmaSuballocation refSuballoc;
-    refSuballoc.offset = offset;
-    // Rest of members stays uninitialized intentionally for better performance.
-
-    // Item from the middle of 1st vector.
-    {
-        const SuballocationVectorType::iterator it = VmaBinaryFindSorted(
-            suballocations1st.begin() + m_1stNullItemsBeginCount,
-            suballocations1st.end(),
-            refSuballoc,
-            VmaSuballocationOffsetLess());
-        if (it != suballocations1st.end())
-        {
-            it->type = VMA_SUBALLOCATION_TYPE_FREE;
-            it->userData = VMA_NULL;
-            ++m_1stNullItemsMiddleCount;
-            m_SumFreeSize += it->size;
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    if (m_2ndVectorMode != SECOND_VECTOR_EMPTY)
-    {
-        // Item from the middle of 2nd vector.
-        const SuballocationVectorType::iterator it = m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ?
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetLess()) :
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetGreater());
-        if (it != suballocations2nd.end())
-        {
-            it->type = VMA_SUBALLOCATION_TYPE_FREE;
-            it->userData = VMA_NULL;
-            ++m_2ndNullItemsCount;
-            m_SumFreeSize += it->size;
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    VMA_ASSERT(0 && "Allocation to free not found in linear allocator!");
-}
-
-void VmaBlockMetadata_Linear::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    outInfo.offset = (VkDeviceSize)allocHandle - 1;
-    VmaSuballocation& suballoc = FindSuballocation(outInfo.offset);
-    outInfo.size = suballoc.size;
-    outInfo.pUserData = suballoc.userData;
-}
-
-void* VmaBlockMetadata_Linear::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    return FindSuballocation((VkDeviceSize)allocHandle - 1).userData;
-}
-
-VmaAllocHandle VmaBlockMetadata_Linear::GetAllocationListBegin() const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_Linear::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return VK_NULL_HANDLE;
-}
-
-VkDeviceSize VmaBlockMetadata_Linear::GetNextFreeRegionSize(VmaAllocHandle alloc) const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return 0;
-}
-
-void VmaBlockMetadata_Linear::Clear()
-{
-    m_SumFreeSize = GetSize();
-    m_Suballocations0.clear();
-    m_Suballocations1.clear();
-    // Leaving m_1stVectorIndex unchanged - it doesn't matter.
-    m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-    m_1stNullItemsBeginCount = 0;
-    m_1stNullItemsMiddleCount = 0;
-    m_2ndNullItemsCount = 0;
-}
-
-void VmaBlockMetadata_Linear::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    VmaSuballocation& suballoc = FindSuballocation((VkDeviceSize)allocHandle - 1);
-    suballoc.userData = userData;
-}
-
-void VmaBlockMetadata_Linear::DebugLogAllAllocations() const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    for (auto it = suballocations1st.begin() + m_1stNullItemsBeginCount; it != suballocations1st.end(); ++it)
-        if (it->type != VMA_SUBALLOCATION_TYPE_FREE)
-            DebugLogAllocation(it->offset, it->size, it->userData);
-
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    for (auto it = suballocations2nd.begin(); it != suballocations2nd.end(); ++it)
-        if (it->type != VMA_SUBALLOCATION_TYPE_FREE)
-            DebugLogAllocation(it->offset, it->size, it->userData);
-}
-
-VmaSuballocation& VmaBlockMetadata_Linear::FindSuballocation(VkDeviceSize offset) const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    VmaSuballocation refSuballoc;
-    refSuballoc.offset = offset;
-    // Rest of members stays uninitialized intentionally for better performance.
-
-    // Item from the 1st vector.
-    {
-        SuballocationVectorType::const_iterator it = VmaBinaryFindSorted(
-            suballocations1st.begin() + m_1stNullItemsBeginCount,
-            suballocations1st.end(),
-            refSuballoc,
-            VmaSuballocationOffsetLess());
-        if (it != suballocations1st.end())
-        {
-            return const_cast<VmaSuballocation&>(*it);
-        }
-    }
-
-    if (m_2ndVectorMode != SECOND_VECTOR_EMPTY)
-    {
-        // Rest of members stays uninitialized intentionally for better performance.
-        SuballocationVectorType::const_iterator it = m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ?
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetLess()) :
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetGreater());
-        if (it != suballocations2nd.end())
-        {
-            return const_cast<VmaSuballocation&>(*it);
-        }
-    }
-
-    VMA_ASSERT(0 && "Allocation not found in linear allocator!");
-    return const_cast<VmaSuballocation&>(suballocations1st.back()); // Should never occur.
-}
-
-bool VmaBlockMetadata_Linear::ShouldCompact1st() const
-{
-    const size_t nullItemCount = m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount;
-    const size_t suballocCount = AccessSuballocations1st().size();
-    return suballocCount > 32 && nullItemCount * 2 >= (suballocCount - nullItemCount) * 3;
-}
-
-void VmaBlockMetadata_Linear::CleanupAfterFree()
-{
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    if (IsEmpty())
-    {
-        suballocations1st.clear();
-        suballocations2nd.clear();
-        m_1stNullItemsBeginCount = 0;
-        m_1stNullItemsMiddleCount = 0;
-        m_2ndNullItemsCount = 0;
-        m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-    }
-    else
-    {
-        const size_t suballoc1stCount = suballocations1st.size();
-        const size_t nullItem1stCount = m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount;
-        VMA_ASSERT(nullItem1stCount <= suballoc1stCount);
-
-        // Find more null items at the beginning of 1st vector.
-        while (m_1stNullItemsBeginCount < suballoc1stCount &&
-            suballocations1st[m_1stNullItemsBeginCount].type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            ++m_1stNullItemsBeginCount;
-            --m_1stNullItemsMiddleCount;
-        }
-
-        // Find more null items at the end of 1st vector.
-        while (m_1stNullItemsMiddleCount > 0 &&
-            suballocations1st.back().type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            --m_1stNullItemsMiddleCount;
-            suballocations1st.pop_back();
-        }
-
-        // Find more null items at the end of 2nd vector.
-        while (m_2ndNullItemsCount > 0 &&
-            suballocations2nd.back().type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            --m_2ndNullItemsCount;
-            suballocations2nd.pop_back();
-        }
-
-        // Find more null items at the beginning of 2nd vector.
-        while (m_2ndNullItemsCount > 0 &&
-            suballocations2nd[0].type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            --m_2ndNullItemsCount;
-            VmaVectorRemove(suballocations2nd, 0);
-        }
-
-        if (ShouldCompact1st())
-        {
-            const size_t nonNullItemCount = suballoc1stCount - nullItem1stCount;
-            size_t srcIndex = m_1stNullItemsBeginCount;
-            for (size_t dstIndex = 0; dstIndex < nonNullItemCount; ++dstIndex)
-            {
-                while (suballocations1st[srcIndex].type == VMA_SUBALLOCATION_TYPE_FREE)
-                {
-                    ++srcIndex;
-                }
-                if (dstIndex != srcIndex)
-                {
-                    suballocations1st[dstIndex] = suballocations1st[srcIndex];
-                }
-                ++srcIndex;
-            }
-            suballocations1st.resize(nonNullItemCount);
-            m_1stNullItemsBeginCount = 0;
-            m_1stNullItemsMiddleCount = 0;
-        }
-
-        // 2nd vector became empty.
-        if (suballocations2nd.empty())
-        {
-            m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-        }
-
-        // 1st vector became empty.
-        if (suballocations1st.size() - m_1stNullItemsBeginCount == 0)
-        {
-            suballocations1st.clear();
-            m_1stNullItemsBeginCount = 0;
-
-            if (!suballocations2nd.empty() && m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-            {
-                // Swap 1st with 2nd. Now 2nd is empty.
-                m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-                m_1stNullItemsMiddleCount = m_2ndNullItemsCount;
-                while (m_1stNullItemsBeginCount < suballocations2nd.size() &&
-                    suballocations2nd[m_1stNullItemsBeginCount].type == VMA_SUBALLOCATION_TYPE_FREE)
-                {
-                    ++m_1stNullItemsBeginCount;
-                    --m_1stNullItemsMiddleCount;
-                }
-                m_2ndNullItemsCount = 0;
-                m_1stVectorIndex ^= 1;
-            }
-        }
-    }
-
-    VMA_HEAVY_ASSERT(Validate());
-}
-
-bool VmaBlockMetadata_Linear::CreateAllocationRequest_LowerAddress(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    const VkDeviceSize blockSize = GetSize();
-    const VkDeviceSize debugMargin = GetDebugMargin();
-    const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity();
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    if (m_2ndVectorMode == SECOND_VECTOR_EMPTY || m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        // Try to allocate at the end of 1st vector.
-
-        VkDeviceSize resultBaseOffset = 0;
-        if (!suballocations1st.empty())
-        {
-            const VmaSuballocation& lastSuballoc = suballocations1st.back();
-            resultBaseOffset = lastSuballoc.offset + lastSuballoc.size + debugMargin;
-        }
-
-        // Start from offset equal to beginning of free space.
-        VkDeviceSize resultOffset = resultBaseOffset;
-
-        // Apply alignment.
-        resultOffset = VmaAlignUp(resultOffset, allocAlignment);
-
-        // Check previous suballocations for BufferImageGranularity conflicts.
-        // Make bigger alignment if necessary.
-        if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations1st.empty())
-        {
-            bool bufferImageGranularityConflict = false;
-            for (size_t prevSuballocIndex = suballocations1st.size(); prevSuballocIndex--; )
-            {
-                const VmaSuballocation& prevSuballoc = suballocations1st[prevSuballocIndex];
-                if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity))
-                {
-                    if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType))
-                    {
-                        bufferImageGranularityConflict = true;
-                        break;
-                    }
-                }
-                else
-                    // Already on previous page.
-                    break;
-            }
-            if (bufferImageGranularityConflict)
-            {
-                resultOffset = VmaAlignUp(resultOffset, bufferImageGranularity);
-            }
-        }
-
-        const VkDeviceSize freeSpaceEnd = m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ?
-            suballocations2nd.back().offset : blockSize;
-
-        // There is enough free space at the end after alignment.
-        if (resultOffset + allocSize + debugMargin <= freeSpaceEnd)
-        {
-            // Check next suballocations for BufferImageGranularity conflicts.
-            // If conflict exists, allocation cannot be made here.
-            if ((allocSize % bufferImageGranularity || resultOffset % bufferImageGranularity) && m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-            {
-                for (size_t nextSuballocIndex = suballocations2nd.size(); nextSuballocIndex--; )
-                {
-                    const VmaSuballocation& nextSuballoc = suballocations2nd[nextSuballocIndex];
-                    if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-                    {
-                        if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type))
-                        {
-                            return false;
-                        }
-                    }
-                    else
-                    {
-                        // Already on previous page.
-                        break;
-                    }
-                }
-            }
-
-            // All tests passed: Success.
-            pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1);
-            // pAllocationRequest->item, customData unused.
-            pAllocationRequest->type = VmaAllocationRequestType::EndOf1st;
-            return true;
-        }
-    }
-
-    // Wrap-around to end of 2nd vector. Try to allocate there, watching for the
-    // beginning of 1st vector as the end of free space.
-    if (m_2ndVectorMode == SECOND_VECTOR_EMPTY || m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        VMA_ASSERT(!suballocations1st.empty());
-
-        VkDeviceSize resultBaseOffset = 0;
-        if (!suballocations2nd.empty())
-        {
-            const VmaSuballocation& lastSuballoc = suballocations2nd.back();
-            resultBaseOffset = lastSuballoc.offset + lastSuballoc.size + debugMargin;
-        }
-
-        // Start from offset equal to beginning of free space.
-        VkDeviceSize resultOffset = resultBaseOffset;
-
-        // Apply alignment.
-        resultOffset = VmaAlignUp(resultOffset, allocAlignment);
-
-        // Check previous suballocations for BufferImageGranularity conflicts.
-        // Make bigger alignment if necessary.
-        if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations2nd.empty())
-        {
-            bool bufferImageGranularityConflict = false;
-            for (size_t prevSuballocIndex = suballocations2nd.size(); prevSuballocIndex--; )
-            {
-                const VmaSuballocation& prevSuballoc = suballocations2nd[prevSuballocIndex];
-                if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity))
-                {
-                    if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType))
-                    {
-                        bufferImageGranularityConflict = true;
-                        break;
-                    }
-                }
-                else
-                    // Already on previous page.
-                    break;
-            }
-            if (bufferImageGranularityConflict)
-            {
-                resultOffset = VmaAlignUp(resultOffset, bufferImageGranularity);
-            }
-        }
-
-        size_t index1st = m_1stNullItemsBeginCount;
-
-        // There is enough free space at the end after alignment.
-        if ((index1st == suballocations1st.size() && resultOffset + allocSize + debugMargin <= blockSize) ||
-            (index1st < suballocations1st.size() && resultOffset + allocSize + debugMargin <= suballocations1st[index1st].offset))
-        {
-            // Check next suballocations for BufferImageGranularity conflicts.
-            // If conflict exists, allocation cannot be made here.
-            if (allocSize % bufferImageGranularity || resultOffset % bufferImageGranularity)
-            {
-                for (size_t nextSuballocIndex = index1st;
-                    nextSuballocIndex < suballocations1st.size();
-                    nextSuballocIndex++)
-                {
-                    const VmaSuballocation& nextSuballoc = suballocations1st[nextSuballocIndex];
-                    if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-                    {
-                        if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type))
-                        {
-                            return false;
-                        }
-                    }
-                    else
-                    {
-                        // Already on next page.
-                        break;
-                    }
-                }
-            }
-
-            // All tests passed: Success.
-            pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1);
-            pAllocationRequest->type = VmaAllocationRequestType::EndOf2nd;
-            // pAllocationRequest->item, customData unused.
-            return true;
-        }
-    }
-
-    return false;
-}
-
-bool VmaBlockMetadata_Linear::CreateAllocationRequest_UpperAddress(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    const VkDeviceSize blockSize = GetSize();
-    const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity();
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        VMA_ASSERT(0 && "Trying to use pool with linear algorithm as double stack, while it is already being used as ring buffer.");
-        return false;
-    }
-
-    // Try to allocate before 2nd.back(), or end of block if 2nd.empty().
-    if (allocSize > blockSize)
-    {
-        return false;
-    }
-    VkDeviceSize resultBaseOffset = blockSize - allocSize;
-    if (!suballocations2nd.empty())
-    {
-        const VmaSuballocation& lastSuballoc = suballocations2nd.back();
-        resultBaseOffset = lastSuballoc.offset - allocSize;
-        if (allocSize > lastSuballoc.offset)
-        {
-            return false;
-        }
-    }
-
-    // Start from offset equal to end of free space.
-    VkDeviceSize resultOffset = resultBaseOffset;
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-
-    // Apply debugMargin at the end.
-    if (debugMargin > 0)
-    {
-        if (resultOffset < debugMargin)
-        {
-            return false;
-        }
-        resultOffset -= debugMargin;
-    }
-
-    // Apply alignment.
-    resultOffset = VmaAlignDown(resultOffset, allocAlignment);
-
-    // Check next suballocations from 2nd for BufferImageGranularity conflicts.
-    // Make bigger alignment if necessary.
-    if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations2nd.empty())
-    {
-        bool bufferImageGranularityConflict = false;
-        for (size_t nextSuballocIndex = suballocations2nd.size(); nextSuballocIndex--; )
-        {
-            const VmaSuballocation& nextSuballoc = suballocations2nd[nextSuballocIndex];
-            if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-            {
-                if (VmaIsBufferImageGranularityConflict(nextSuballoc.type, allocType))
-                {
-                    bufferImageGranularityConflict = true;
-                    break;
-                }
-            }
-            else
-                // Already on previous page.
-                break;
-        }
-        if (bufferImageGranularityConflict)
-        {
-            resultOffset = VmaAlignDown(resultOffset, bufferImageGranularity);
-        }
-    }
-
-    // There is enough free space.
-    const VkDeviceSize endOf1st = !suballocations1st.empty() ?
-        suballocations1st.back().offset + suballocations1st.back().size :
-        0;
-    if (endOf1st + debugMargin <= resultOffset)
-    {
-        // Check previous suballocations for BufferImageGranularity conflicts.
-        // If conflict exists, allocation cannot be made here.
-        if (bufferImageGranularity > 1)
-        {
-            for (size_t prevSuballocIndex = suballocations1st.size(); prevSuballocIndex--; )
-            {
-                const VmaSuballocation& prevSuballoc = suballocations1st[prevSuballocIndex];
-                if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity))
-                {
-                    if (VmaIsBufferImageGranularityConflict(allocType, prevSuballoc.type))
-                    {
-                        return false;
-                    }
-                }
-                else
-                {
-                    // Already on next page.
-                    break;
-                }
-            }
-        }
-
-        // All tests passed: Success.
-        pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1);
-        // pAllocationRequest->item unused.
-        pAllocationRequest->type = VmaAllocationRequestType::UpperAddress;
-        return true;
-    }
-
-    return false;
-}
-#endif // _VMA_BLOCK_METADATA_LINEAR_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_LINEAR
-
-#if 0
-#ifndef _VMA_BLOCK_METADATA_BUDDY
-/*
-- GetSize() is the original size of allocated memory block.
-- m_UsableSize is this size aligned down to a power of two.
-  All allocations and calculations happen relative to m_UsableSize.
-- GetUnusableSize() is the difference between them.
-  It is reported as separate, unused range, not available for allocations.
-
-Node at level 0 has size = m_UsableSize.
-Each next level contains nodes with size 2 times smaller than current level.
-m_LevelCount is the maximum number of levels to use in the current object.
-*/
-class VmaBlockMetadata_Buddy : public VmaBlockMetadata
-{
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_Buddy)
-public:
-    VmaBlockMetadata_Buddy(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_Buddy();
-
-    size_t GetAllocationCount() const override { return m_AllocationCount; }
-    VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize + GetUnusableSize(); }
-    bool IsEmpty() const override { return m_Root->type == Node::TYPE_FREE; }
-    VkResult CheckCorruption(const void* pBlockData) override { return VK_ERROR_FEATURE_NOT_PRESENT; }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; };
-    void DebugLogAllAllocations() const override { DebugLogAllAllocationNode(m_Root, 0); }
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void Free(VmaAllocHandle allocHandle) override;
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-
-private:
-    static const size_t MAX_LEVELS = 48;
-
-    struct ValidationContext
-    {
-        size_t calculatedAllocationCount = 0;
-        size_t calculatedFreeCount = 0;
-        VkDeviceSize calculatedSumFreeSize = 0;
-    };
-    struct Node
-    {
-        VkDeviceSize offset;
-        enum TYPE
-        {
-            TYPE_FREE,
-            TYPE_ALLOCATION,
-            TYPE_SPLIT,
-            TYPE_COUNT
-        } type;
-        Node* parent;
-        Node* buddy;
-
-        union
-        {
-            struct
-            {
-                Node* prev;
-                Node* next;
-            } free;
-            struct
-            {
-                void* userData;
-            } allocation;
-            struct
-            {
-                Node* leftChild;
-            } split;
-        };
-    };
-
-    // Size of the memory block aligned down to a power of two.
-    VkDeviceSize m_UsableSize;
-    uint32_t m_LevelCount;
-    VmaPoolAllocator<Node> m_NodeAllocator;
-    Node* m_Root;
-    struct
-    {
-        Node* front;
-        Node* back;
-    } m_FreeList[MAX_LEVELS];
-
-    // Number of nodes in the tree with type == TYPE_ALLOCATION.
-    size_t m_AllocationCount;
-    // Number of nodes in the tree with type == TYPE_FREE.
-    size_t m_FreeCount;
-    // Doesn't include space wasted due to internal fragmentation - allocation sizes are just aligned up to node sizes.
-    // Doesn't include unusable size.
-    VkDeviceSize m_SumFreeSize;
-
-    VkDeviceSize GetUnusableSize() const { return GetSize() - m_UsableSize; }
-    VkDeviceSize LevelToNodeSize(uint32_t level) const { return m_UsableSize >> level; }
-
-    VkDeviceSize AlignAllocationSize(VkDeviceSize size) const
-    {
-        if (!IsVirtual())
-        {
-            size = VmaAlignUp(size, (VkDeviceSize)16);
-        }
-        return VmaNextPow2(size);
-    }
-    Node* FindAllocationNode(VkDeviceSize offset, uint32_t& outLevel) const;
-    void DeleteNodeChildren(Node* node);
-    bool ValidateNode(ValidationContext& ctx, const Node* parent, const Node* curr, uint32_t level, VkDeviceSize levelNodeSize) const;
-    uint32_t AllocSizeToLevel(VkDeviceSize allocSize) const;
-    void AddNodeToDetailedStatistics(VmaDetailedStatistics& inoutStats, const Node* node, VkDeviceSize levelNodeSize) const;
-    // Adds node to the front of FreeList at given level.
-    // node->type must be FREE.
-    // node->free.prev, next can be undefined.
-    void AddToFreeListFront(uint32_t level, Node* node);
-    // Removes node from FreeList at given level.
-    // node->type must be FREE.
-    // node->free.prev, next stay untouched.
-    void RemoveFromFreeList(uint32_t level, Node* node);
-    void DebugLogAllAllocationNode(Node* node, uint32_t level) const;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMapNode(class VmaJsonWriter& json, const Node* node, VkDeviceSize levelNodeSize) const;
-#endif
-};
-
-#ifndef _VMA_BLOCK_METADATA_BUDDY_FUNCTIONS
-VmaBlockMetadata_Buddy::VmaBlockMetadata_Buddy(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_NodeAllocator(pAllocationCallbacks, 32), // firstBlockCapacity
-    m_Root(VMA_NULL),
-    m_AllocationCount(0),
-    m_FreeCount(1),
-    m_SumFreeSize(0)
-{
-    memset(m_FreeList, 0, sizeof(m_FreeList));
-}
-
-VmaBlockMetadata_Buddy::~VmaBlockMetadata_Buddy()
-{
-    DeleteNodeChildren(m_Root);
-    m_NodeAllocator.Free(m_Root);
-}
-
-void VmaBlockMetadata_Buddy::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-
-    m_UsableSize = VmaPrevPow2(size);
-    m_SumFreeSize = m_UsableSize;
-
-    // Calculate m_LevelCount.
-    const VkDeviceSize minNodeSize = IsVirtual() ? 1 : 16;
-    m_LevelCount = 1;
-    while (m_LevelCount < MAX_LEVELS &&
-        LevelToNodeSize(m_LevelCount) >= minNodeSize)
-    {
-        ++m_LevelCount;
-    }
-
-    Node* rootNode = m_NodeAllocator.Alloc();
-    rootNode->offset = 0;
-    rootNode->type = Node::TYPE_FREE;
-    rootNode->parent = VMA_NULL;
-    rootNode->buddy = VMA_NULL;
-
-    m_Root = rootNode;
-    AddToFreeListFront(0, rootNode);
-}
-
-bool VmaBlockMetadata_Buddy::Validate() const
-{
-    // Validate tree.
-    ValidationContext ctx;
-    if (!ValidateNode(ctx, VMA_NULL, m_Root, 0, LevelToNodeSize(0)))
-    {
-        VMA_VALIDATE(false && "ValidateNode failed.");
-    }
-    VMA_VALIDATE(m_AllocationCount == ctx.calculatedAllocationCount);
-    VMA_VALIDATE(m_SumFreeSize == ctx.calculatedSumFreeSize);
-
-    // Validate free node lists.
-    for (uint32_t level = 0; level < m_LevelCount; ++level)
-    {
-        VMA_VALIDATE(m_FreeList[level].front == VMA_NULL ||
-            m_FreeList[level].front->free.prev == VMA_NULL);
-
-        for (Node* node = m_FreeList[level].front;
-            node != VMA_NULL;
-            node = node->free.next)
-        {
-            VMA_VALIDATE(node->type == Node::TYPE_FREE);
-
-            if (node->free.next == VMA_NULL)
-            {
-                VMA_VALIDATE(m_FreeList[level].back == node);
-            }
-            else
-            {
-                VMA_VALIDATE(node->free.next->free.prev == node);
-            }
-        }
-    }
-
-    // Validate that free lists ar higher levels are empty.
-    for (uint32_t level = m_LevelCount; level < MAX_LEVELS; ++level)
-    {
-        VMA_VALIDATE(m_FreeList[level].front == VMA_NULL && m_FreeList[level].back == VMA_NULL);
-    }
-
-    return true;
-}
-
-void VmaBlockMetadata_Buddy::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += GetSize();
-
-    AddNodeToDetailedStatistics(inoutStats, m_Root, LevelToNodeSize(0));
-
-    const VkDeviceSize unusableSize = GetUnusableSize();
-    if (unusableSize > 0)
-        VmaAddDetailedStatisticsUnusedRange(inoutStats, unusableSize);
-}
-
-void VmaBlockMetadata_Buddy::AddStatistics(VmaStatistics& inoutStats) const
-{
-    inoutStats.blockCount++;
-    inoutStats.allocationCount += (uint32_t)m_AllocationCount;
-    inoutStats.blockBytes += GetSize();
-    inoutStats.allocationBytes += GetSize() - m_SumFreeSize;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Buddy::PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const
-{
-    VmaDetailedStatistics stats;
-    VmaClearDetailedStatistics(stats);
-    AddDetailedStatistics(stats);
-
-    PrintDetailedMap_Begin(
-        json,
-        stats.statistics.blockBytes - stats.statistics.allocationBytes,
-        stats.statistics.allocationCount,
-        stats.unusedRangeCount,
-        mapRefCount);
-
-    PrintDetailedMapNode(json, m_Root, LevelToNodeSize(0));
-
-    const VkDeviceSize unusableSize = GetUnusableSize();
-    if (unusableSize > 0)
-    {
-        PrintDetailedMap_UnusedRange(json,
-            m_UsableSize, // offset
-            unusableSize); // size
-    }
-
-    PrintDetailedMap_End(json);
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaBlockMetadata_Buddy::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(!upperAddress && "VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT can be used only with linear algorithm.");
-
-    allocSize = AlignAllocationSize(allocSize);
-
-    // Simple way to respect bufferImageGranularity. May be optimized some day.
-    // Whenever it might be an OPTIMAL image...
-    if (allocType == VMA_SUBALLOCATION_TYPE_UNKNOWN ||
-        allocType == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-        allocType == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL)
-    {
-        allocAlignment = VMA_MAX(allocAlignment, GetBufferImageGranularity());
-        allocSize = VmaAlignUp(allocSize, GetBufferImageGranularity());
-    }
-
-    if (allocSize > m_UsableSize)
-    {
-        return false;
-    }
-
-    const uint32_t targetLevel = AllocSizeToLevel(allocSize);
-    for (uint32_t level = targetLevel; level--; )
-    {
-        for (Node* freeNode = m_FreeList[level].front;
-            freeNode != VMA_NULL;
-            freeNode = freeNode->free.next)
-        {
-            if (freeNode->offset % allocAlignment == 0)
-            {
-                pAllocationRequest->type = VmaAllocationRequestType::Normal;
-                pAllocationRequest->allocHandle = (VmaAllocHandle)(freeNode->offset + 1);
-                pAllocationRequest->size = allocSize;
-                pAllocationRequest->customData = (void*)(uintptr_t)level;
-                return true;
-            }
-        }
-    }
-
-    return false;
-}
-
-void VmaBlockMetadata_Buddy::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    VMA_ASSERT(request.type == VmaAllocationRequestType::Normal);
-
-    const uint32_t targetLevel = AllocSizeToLevel(request.size);
-    uint32_t currLevel = (uint32_t)(uintptr_t)request.customData;
-
-    Node* currNode = m_FreeList[currLevel].front;
-    VMA_ASSERT(currNode != VMA_NULL && currNode->type == Node::TYPE_FREE);
-    const VkDeviceSize offset = (VkDeviceSize)request.allocHandle - 1;
-    while (currNode->offset != offset)
-    {
-        currNode = currNode->free.next;
-        VMA_ASSERT(currNode != VMA_NULL && currNode->type == Node::TYPE_FREE);
-    }
-
-    // Go down, splitting free nodes.
-    while (currLevel < targetLevel)
-    {
-        // currNode is already first free node at currLevel.
-        // Remove it from list of free nodes at this currLevel.
-        RemoveFromFreeList(currLevel, currNode);
-
-        const uint32_t childrenLevel = currLevel + 1;
-
-        // Create two free sub-nodes.
-        Node* leftChild = m_NodeAllocator.Alloc();
-        Node* rightChild = m_NodeAllocator.Alloc();
-
-        leftChild->offset = currNode->offset;
-        leftChild->type = Node::TYPE_FREE;
-        leftChild->parent = currNode;
-        leftChild->buddy = rightChild;
-
-        rightChild->offset = currNode->offset + LevelToNodeSize(childrenLevel);
-        rightChild->type = Node::TYPE_FREE;
-        rightChild->parent = currNode;
-        rightChild->buddy = leftChild;
-
-        // Convert current currNode to split type.
-        currNode->type = Node::TYPE_SPLIT;
-        currNode->split.leftChild = leftChild;
-
-        // Add child nodes to free list. Order is important!
-        AddToFreeListFront(childrenLevel, rightChild);
-        AddToFreeListFront(childrenLevel, leftChild);
-
-        ++m_FreeCount;
-        ++currLevel;
-        currNode = m_FreeList[currLevel].front;
-
-        /*
-        We can be sure that currNode, as left child of node previously split,
-        also fulfills the alignment requirement.
-        */
-    }
-
-    // Remove from free list.
-    VMA_ASSERT(currLevel == targetLevel &&
-        currNode != VMA_NULL &&
-        currNode->type == Node::TYPE_FREE);
-    RemoveFromFreeList(currLevel, currNode);
-
-    // Convert to allocation node.
-    currNode->type = Node::TYPE_ALLOCATION;
-    currNode->allocation.userData = userData;
-
-    ++m_AllocationCount;
-    --m_FreeCount;
-    m_SumFreeSize -= request.size;
-}
-
-void VmaBlockMetadata_Buddy::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    uint32_t level = 0;
-    outInfo.offset = (VkDeviceSize)allocHandle - 1;
-    const Node* const node = FindAllocationNode(outInfo.offset, level);
-    outInfo.size = LevelToNodeSize(level);
-    outInfo.pUserData = node->allocation.userData;
-}
-
-void* VmaBlockMetadata_Buddy::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    uint32_t level = 0;
-    const Node* const node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level);
-    return node->allocation.userData;
-}
-
-VmaAllocHandle VmaBlockMetadata_Buddy::GetAllocationListBegin() const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_Buddy::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    return VK_NULL_HANDLE;
-}
-
-void VmaBlockMetadata_Buddy::DeleteNodeChildren(Node* node)
-{
-    if (node->type == Node::TYPE_SPLIT)
-    {
-        DeleteNodeChildren(node->split.leftChild->buddy);
-        DeleteNodeChildren(node->split.leftChild);
-        const VkAllocationCallbacks* allocationCallbacks = GetAllocationCallbacks();
-        m_NodeAllocator.Free(node->split.leftChild->buddy);
-        m_NodeAllocator.Free(node->split.leftChild);
-    }
-}
-
-void VmaBlockMetadata_Buddy::Clear()
-{
-    DeleteNodeChildren(m_Root);
-    m_Root->type = Node::TYPE_FREE;
-    m_AllocationCount = 0;
-    m_FreeCount = 1;
-    m_SumFreeSize = m_UsableSize;
-}
-
-void VmaBlockMetadata_Buddy::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    uint32_t level = 0;
-    Node* const node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level);
-    node->allocation.userData = userData;
-}
-
-VmaBlockMetadata_Buddy::Node* VmaBlockMetadata_Buddy::FindAllocationNode(VkDeviceSize offset, uint32_t& outLevel) const
-{
-    Node* node = m_Root;
-    VkDeviceSize nodeOffset = 0;
-    outLevel = 0;
-    VkDeviceSize levelNodeSize = LevelToNodeSize(0);
-    while (node->type == Node::TYPE_SPLIT)
-    {
-        const VkDeviceSize nextLevelNodeSize = levelNodeSize >> 1;
-        if (offset < nodeOffset + nextLevelNodeSize)
-        {
-            node = node->split.leftChild;
-        }
-        else
-        {
-            node = node->split.leftChild->buddy;
-            nodeOffset += nextLevelNodeSize;
-        }
-        ++outLevel;
-        levelNodeSize = nextLevelNodeSize;
-    }
-
-    VMA_ASSERT(node != VMA_NULL && node->type == Node::TYPE_ALLOCATION);
-    return node;
-}
-
-bool VmaBlockMetadata_Buddy::ValidateNode(ValidationContext& ctx, const Node* parent, const Node* curr, uint32_t level, VkDeviceSize levelNodeSize) const
-{
-    VMA_VALIDATE(level < m_LevelCount);
-    VMA_VALIDATE(curr->parent == parent);
-    VMA_VALIDATE((curr->buddy == VMA_NULL) == (parent == VMA_NULL));
-    VMA_VALIDATE(curr->buddy == VMA_NULL || curr->buddy->buddy == curr);
-    switch (curr->type)
-    {
-    case Node::TYPE_FREE:
-        // curr->free.prev, next are validated separately.
-        ctx.calculatedSumFreeSize += levelNodeSize;
-        ++ctx.calculatedFreeCount;
-        break;
-    case Node::TYPE_ALLOCATION:
-        ++ctx.calculatedAllocationCount;
-        if (!IsVirtual())
-        {
-            VMA_VALIDATE(curr->allocation.userData != VMA_NULL);
-        }
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        const uint32_t childrenLevel = level + 1;
-        const VkDeviceSize childrenLevelNodeSize = levelNodeSize >> 1;
-        const Node* const leftChild = curr->split.leftChild;
-        VMA_VALIDATE(leftChild != VMA_NULL);
-        VMA_VALIDATE(leftChild->offset == curr->offset);
-        if (!ValidateNode(ctx, curr, leftChild, childrenLevel, childrenLevelNodeSize))
-        {
-            VMA_VALIDATE(false && "ValidateNode for left child failed.");
-        }
-        const Node* const rightChild = leftChild->buddy;
-        VMA_VALIDATE(rightChild->offset == curr->offset + childrenLevelNodeSize);
-        if (!ValidateNode(ctx, curr, rightChild, childrenLevel, childrenLevelNodeSize))
-        {
-            VMA_VALIDATE(false && "ValidateNode for right child failed.");
-        }
-    }
-    break;
-    default:
-        return false;
-    }
-
-    return true;
-}
-
-uint32_t VmaBlockMetadata_Buddy::AllocSizeToLevel(VkDeviceSize allocSize) const
-{
-    // I know this could be optimized somehow e.g. by using std::log2p1 from C++20.
-    uint32_t level = 0;
-    VkDeviceSize currLevelNodeSize = m_UsableSize;
-    VkDeviceSize nextLevelNodeSize = currLevelNodeSize >> 1;
-    while (allocSize <= nextLevelNodeSize && level + 1 < m_LevelCount)
-    {
-        ++level;
-        currLevelNodeSize >>= 1;
-        nextLevelNodeSize >>= 1;
-    }
-    return level;
-}
-
-void VmaBlockMetadata_Buddy::Free(VmaAllocHandle allocHandle)
-{
-    uint32_t level = 0;
-    Node* node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level);
-
-    ++m_FreeCount;
-    --m_AllocationCount;
-    m_SumFreeSize += LevelToNodeSize(level);
-
-    node->type = Node::TYPE_FREE;
-
-    // Join free nodes if possible.
-    while (level > 0 && node->buddy->type == Node::TYPE_FREE)
-    {
-        RemoveFromFreeList(level, node->buddy);
-        Node* const parent = node->parent;
-
-        m_NodeAllocator.Free(node->buddy);
-        m_NodeAllocator.Free(node);
-        parent->type = Node::TYPE_FREE;
-
-        node = parent;
-        --level;
-        --m_FreeCount;
-    }
-
-    AddToFreeListFront(level, node);
-}
-
-void VmaBlockMetadata_Buddy::AddNodeToDetailedStatistics(VmaDetailedStatistics& inoutStats, const Node* node, VkDeviceSize levelNodeSize) const
-{
-    switch (node->type)
-    {
-    case Node::TYPE_FREE:
-        VmaAddDetailedStatisticsUnusedRange(inoutStats, levelNodeSize);
-        break;
-    case Node::TYPE_ALLOCATION:
-        VmaAddDetailedStatisticsAllocation(inoutStats, levelNodeSize);
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        const VkDeviceSize childrenNodeSize = levelNodeSize / 2;
-        const Node* const leftChild = node->split.leftChild;
-        AddNodeToDetailedStatistics(inoutStats, leftChild, childrenNodeSize);
-        const Node* const rightChild = leftChild->buddy;
-        AddNodeToDetailedStatistics(inoutStats, rightChild, childrenNodeSize);
-    }
-    break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-
-void VmaBlockMetadata_Buddy::AddToFreeListFront(uint32_t level, Node* node)
-{
-    VMA_ASSERT(node->type == Node::TYPE_FREE);
-
-    // List is empty.
-    Node* const frontNode = m_FreeList[level].front;
-    if (frontNode == VMA_NULL)
-    {
-        VMA_ASSERT(m_FreeList[level].back == VMA_NULL);
-        node->free.prev = node->free.next = VMA_NULL;
-        m_FreeList[level].front = m_FreeList[level].back = node;
-    }
-    else
-    {
-        VMA_ASSERT(frontNode->free.prev == VMA_NULL);
-        node->free.prev = VMA_NULL;
-        node->free.next = frontNode;
-        frontNode->free.prev = node;
-        m_FreeList[level].front = node;
-    }
-}
-
-void VmaBlockMetadata_Buddy::RemoveFromFreeList(uint32_t level, Node* node)
-{
-    VMA_ASSERT(m_FreeList[level].front != VMA_NULL);
-
-    // It is at the front.
-    if (node->free.prev == VMA_NULL)
-    {
-        VMA_ASSERT(m_FreeList[level].front == node);
-        m_FreeList[level].front = node->free.next;
-    }
-    else
-    {
-        Node* const prevFreeNode = node->free.prev;
-        VMA_ASSERT(prevFreeNode->free.next == node);
-        prevFreeNode->free.next = node->free.next;
-    }
-
-    // It is at the back.
-    if (node->free.next == VMA_NULL)
-    {
-        VMA_ASSERT(m_FreeList[level].back == node);
-        m_FreeList[level].back = node->free.prev;
-    }
-    else
-    {
-        Node* const nextFreeNode = node->free.next;
-        VMA_ASSERT(nextFreeNode->free.prev == node);
-        nextFreeNode->free.prev = node->free.prev;
-    }
-}
-
-void VmaBlockMetadata_Buddy::DebugLogAllAllocationNode(Node* node, uint32_t level) const
-{
-    switch (node->type)
-    {
-    case Node::TYPE_FREE:
-        break;
-    case Node::TYPE_ALLOCATION:
-        DebugLogAllocation(node->offset, LevelToNodeSize(level), node->allocation.userData);
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        ++level;
-        DebugLogAllAllocationNode(node->split.leftChild, level);
-        DebugLogAllAllocationNode(node->split.leftChild->buddy, level);
-    }
-    break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Buddy::PrintDetailedMapNode(class VmaJsonWriter& json, const Node* node, VkDeviceSize levelNodeSize) const
-{
-    switch (node->type)
-    {
-    case Node::TYPE_FREE:
-        PrintDetailedMap_UnusedRange(json, node->offset, levelNodeSize);
-        break;
-    case Node::TYPE_ALLOCATION:
-        PrintDetailedMap_Allocation(json, node->offset, levelNodeSize, node->allocation.userData);
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        const VkDeviceSize childrenNodeSize = levelNodeSize / 2;
-        const Node* const leftChild = node->split.leftChild;
-        PrintDetailedMapNode(json, leftChild, childrenNodeSize);
-        const Node* const rightChild = leftChild->buddy;
-        PrintDetailedMapNode(json, rightChild, childrenNodeSize);
-    }
-    break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_BLOCK_METADATA_BUDDY_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_BUDDY
-#endif // #if 0
-
-#ifndef _VMA_BLOCK_METADATA_TLSF
-// To not search current larger region if first allocation won't succeed and skip to smaller range
-// use with VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT as strategy in CreateAllocationRequest().
-// When fragmentation and reusal of previous blocks doesn't matter then use with
-// VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT for fastest alloc time possible.
-class VmaBlockMetadata_TLSF : public VmaBlockMetadata
-{
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_TLSF)
-public:
-    VmaBlockMetadata_TLSF(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_TLSF();
-
-    size_t GetAllocationCount() const override { return m_AllocCount; }
-    size_t GetFreeRegionsCount() const override { return m_BlocksFreeCount + 1; }
-    VkDeviceSize GetSumFreeSize() const override { return m_BlocksFreeSize + m_NullBlock->size; }
-    bool IsEmpty() const override { return m_NullBlock->offset == 0; }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return ((Block*)allocHandle)->offset; };
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    VkResult CheckCorruption(const void* pBlockData) override;
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void Free(VmaAllocHandle allocHandle) override;
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-    void DebugLogAllAllocations() const override;
-
-private:
-    // According to original paper it should be preferable 4 or 5:
-    // M. Masmano, I. Ripoll, A. Crespo, and J. Real "TLSF: a New Dynamic Memory Allocator for Real-Time Systems"
-    // http://www.gii.upv.es/tlsf/files/ecrts04_tlsf.pdf
-    static const uint8_t SECOND_LEVEL_INDEX = 5;
-    static const uint16_t SMALL_BUFFER_SIZE = 256;
-    static const uint32_t INITIAL_BLOCK_ALLOC_COUNT = 16;
-    static const uint8_t MEMORY_CLASS_SHIFT = 7;
-    static const uint8_t MAX_MEMORY_CLASSES = 65 - MEMORY_CLASS_SHIFT;
-
-    class Block
-    {
-    public:
-        VkDeviceSize offset;
-        VkDeviceSize size;
-        Block* prevPhysical;
-        Block* nextPhysical;
-
-        void MarkFree() { prevFree = VMA_NULL; }
-        void MarkTaken() { prevFree = this; }
-        bool IsFree() const { return prevFree != this; }
-        void*& UserData() { VMA_HEAVY_ASSERT(!IsFree()); return userData; }
-        Block*& PrevFree() { return prevFree; }
-        Block*& NextFree() { VMA_HEAVY_ASSERT(IsFree()); return nextFree; }
-
-    private:
-        Block* prevFree; // Address of the same block here indicates that block is taken
-        union
-        {
-            Block* nextFree;
-            void* userData;
-        };
-    };
-
-    size_t m_AllocCount;
-    // Total number of free blocks besides null block
-    size_t m_BlocksFreeCount;
-    // Total size of free blocks excluding null block
-    VkDeviceSize m_BlocksFreeSize;
-    uint32_t m_IsFreeBitmap;
-    uint8_t m_MemoryClasses;
-    uint32_t m_InnerIsFreeBitmap[MAX_MEMORY_CLASSES];
-    uint32_t m_ListsCount;
-    /*
-    * 0: 0-3 lists for small buffers
-    * 1+: 0-(2^SLI-1) lists for normal buffers
-    */
-    Block** m_FreeList;
-    VmaPoolAllocator<Block> m_BlockAllocator;
-    Block* m_NullBlock;
-    VmaBlockBufferImageGranularity m_GranularityHandler;
-
-    uint8_t SizeToMemoryClass(VkDeviceSize size) const;
-    uint16_t SizeToSecondIndex(VkDeviceSize size, uint8_t memoryClass) const;
-    uint32_t GetListIndex(uint8_t memoryClass, uint16_t secondIndex) const;
-    uint32_t GetListIndex(VkDeviceSize size) const;
-
-    void RemoveFreeBlock(Block* block);
-    void InsertFreeBlock(Block* block);
-    void MergeBlock(Block* block, Block* prev);
-
-    Block* FindFreeBlock(VkDeviceSize size, uint32_t& listIndex) const;
-    bool CheckBlock(
-        Block& block,
-        uint32_t listIndex,
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        VmaAllocationRequest* pAllocationRequest);
-};
-
-#ifndef _VMA_BLOCK_METADATA_TLSF_FUNCTIONS
-VmaBlockMetadata_TLSF::VmaBlockMetadata_TLSF(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_AllocCount(0),
-    m_BlocksFreeCount(0),
-    m_BlocksFreeSize(0),
-    m_IsFreeBitmap(0),
-    m_MemoryClasses(0),
-    m_ListsCount(0),
-    m_FreeList(VMA_NULL),
-    m_BlockAllocator(pAllocationCallbacks, INITIAL_BLOCK_ALLOC_COUNT),
-    m_NullBlock(VMA_NULL),
-    m_GranularityHandler(bufferImageGranularity) {}
-
-VmaBlockMetadata_TLSF::~VmaBlockMetadata_TLSF()
-{
-    if (m_FreeList)
-        vma_delete_array(GetAllocationCallbacks(), m_FreeList, m_ListsCount);
-    m_GranularityHandler.Destroy(GetAllocationCallbacks());
-}
-
-void VmaBlockMetadata_TLSF::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-
-    if (!IsVirtual())
-        m_GranularityHandler.Init(GetAllocationCallbacks(), size);
-
-    m_NullBlock = m_BlockAllocator.Alloc();
-    m_NullBlock->size = size;
-    m_NullBlock->offset = 0;
-    m_NullBlock->prevPhysical = VMA_NULL;
-    m_NullBlock->nextPhysical = VMA_NULL;
-    m_NullBlock->MarkFree();
-    m_NullBlock->NextFree() = VMA_NULL;
-    m_NullBlock->PrevFree() = VMA_NULL;
-    uint8_t memoryClass = SizeToMemoryClass(size);
-    uint16_t sli = SizeToSecondIndex(size, memoryClass);
-    m_ListsCount = (memoryClass == 0 ? 0 : (memoryClass - 1) * (1UL << SECOND_LEVEL_INDEX) + sli) + 1;
-    if (IsVirtual())
-        m_ListsCount += 1UL << SECOND_LEVEL_INDEX;
-    else
-        m_ListsCount += 4;
-
-    m_MemoryClasses = memoryClass + 2;
-    memset(m_InnerIsFreeBitmap, 0, MAX_MEMORY_CLASSES * sizeof(uint32_t));
-
-    m_FreeList = vma_new_array(GetAllocationCallbacks(), Block*, m_ListsCount);
-    memset(m_FreeList, 0, m_ListsCount * sizeof(Block*));
-}
-
-bool VmaBlockMetadata_TLSF::Validate() const
-{
-    VMA_VALIDATE(GetSumFreeSize() <= GetSize());
-
-    VkDeviceSize calculatedSize = m_NullBlock->size;
-    VkDeviceSize calculatedFreeSize = m_NullBlock->size;
-    size_t allocCount = 0;
-    size_t freeCount = 0;
-
-    // Check integrity of free lists
-    for (uint32_t list = 0; list < m_ListsCount; ++list)
-    {
-        Block* block = m_FreeList[list];
-        if (block != VMA_NULL)
-        {
-            VMA_VALIDATE(block->IsFree());
-            VMA_VALIDATE(block->PrevFree() == VMA_NULL);
-            while (block->NextFree())
-            {
-                VMA_VALIDATE(block->NextFree()->IsFree());
-                VMA_VALIDATE(block->NextFree()->PrevFree() == block);
-                block = block->NextFree();
-            }
-        }
-    }
-
-    VkDeviceSize nextOffset = m_NullBlock->offset;
-    auto validateCtx = m_GranularityHandler.StartValidation(GetAllocationCallbacks(), IsVirtual());
-
-    VMA_VALIDATE(m_NullBlock->nextPhysical == VMA_NULL);
-    if (m_NullBlock->prevPhysical)
-    {
-        VMA_VALIDATE(m_NullBlock->prevPhysical->nextPhysical == m_NullBlock);
-    }
-    // Check all blocks
-    for (Block* prev = m_NullBlock->prevPhysical; prev != VMA_NULL; prev = prev->prevPhysical)
-    {
-        VMA_VALIDATE(prev->offset + prev->size == nextOffset);
-        nextOffset = prev->offset;
-        calculatedSize += prev->size;
-
-        uint32_t listIndex = GetListIndex(prev->size);
-        if (prev->IsFree())
-        {
-            ++freeCount;
-            // Check if free block belongs to free list
-            Block* freeBlock = m_FreeList[listIndex];
-            VMA_VALIDATE(freeBlock != VMA_NULL);
-
-            bool found = false;
-            do
-            {
-                if (freeBlock == prev)
-                    found = true;
-
-                freeBlock = freeBlock->NextFree();
-            } while (!found && freeBlock != VMA_NULL);
-
-            VMA_VALIDATE(found);
-            calculatedFreeSize += prev->size;
-        }
-        else
-        {
-            ++allocCount;
-            // Check if taken block is not on a free list
-            Block* freeBlock = m_FreeList[listIndex];
-            while (freeBlock)
-            {
-                VMA_VALIDATE(freeBlock != prev);
-                freeBlock = freeBlock->NextFree();
-            }
-
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE(m_GranularityHandler.Validate(validateCtx, prev->offset, prev->size));
-            }
-        }
-
-        if (prev->prevPhysical)
-        {
-            VMA_VALIDATE(prev->prevPhysical->nextPhysical == prev);
-        }
-    }
-
-    if (!IsVirtual())
-    {
-        VMA_VALIDATE(m_GranularityHandler.FinishValidation(validateCtx));
-    }
-
-    VMA_VALIDATE(nextOffset == 0);
-    VMA_VALIDATE(calculatedSize == GetSize());
-    VMA_VALIDATE(calculatedFreeSize == GetSumFreeSize());
-    VMA_VALIDATE(allocCount == m_AllocCount);
-    VMA_VALIDATE(freeCount == m_BlocksFreeCount);
-
-    return true;
-}
-
-void VmaBlockMetadata_TLSF::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += GetSize();
-    if (m_NullBlock->size > 0)
-        VmaAddDetailedStatisticsUnusedRange(inoutStats, m_NullBlock->size);
-
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-    {
-        if (block->IsFree())
-            VmaAddDetailedStatisticsUnusedRange(inoutStats, block->size);
-        else
-            VmaAddDetailedStatisticsAllocation(inoutStats, block->size);
-    }
-}
-
-void VmaBlockMetadata_TLSF::AddStatistics(VmaStatistics& inoutStats) const
-{
-    inoutStats.blockCount++;
-    inoutStats.allocationCount += (uint32_t)m_AllocCount;
-    inoutStats.blockBytes += GetSize();
-    inoutStats.allocationBytes += GetSize() - GetSumFreeSize();
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_TLSF::PrintDetailedMap(class VmaJsonWriter& json) const
-{
-    size_t blockCount = m_AllocCount + m_BlocksFreeCount;
-    VmaStlAllocator<Block*> allocator(GetAllocationCallbacks());
-    VmaVector<Block*, VmaStlAllocator<Block*>> blockList(blockCount, allocator);
-
-    size_t i = blockCount;
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-    {
-        blockList[--i] = block;
-    }
-    VMA_ASSERT(i == 0);
-
-    VmaDetailedStatistics stats;
-    VmaClearDetailedStatistics(stats);
-    AddDetailedStatistics(stats);
-
-    PrintDetailedMap_Begin(json,
-        stats.statistics.blockBytes - stats.statistics.allocationBytes,
-        stats.statistics.allocationCount,
-        stats.unusedRangeCount);
-
-    for (; i < blockCount; ++i)
-    {
-        Block* block = blockList[i];
-        if (block->IsFree())
-            PrintDetailedMap_UnusedRange(json, block->offset, block->size);
-        else
-            PrintDetailedMap_Allocation(json, block->offset, block->size, block->UserData());
-    }
-    if (m_NullBlock->size > 0)
-        PrintDetailedMap_UnusedRange(json, m_NullBlock->offset, m_NullBlock->size);
-
-    PrintDetailedMap_End(json);
-}
-#endif
-
-bool VmaBlockMetadata_TLSF::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(allocSize > 0 && "Cannot allocate empty block!");
-    VMA_ASSERT(!upperAddress && "VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT can be used only with linear algorithm.");
-
-    // For small granularity round up
-    if (!IsVirtual())
-        m_GranularityHandler.RoundupAllocRequest(allocType, allocSize, allocAlignment);
-
-    allocSize += GetDebugMargin();
-    // Quick check for too small pool
-    if (allocSize > GetSumFreeSize())
-        return false;
-
-    // If no free blocks in pool then check only null block
-    if (m_BlocksFreeCount == 0)
-        return CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest);
-
-    // Round up to the next block
-    VkDeviceSize sizeForNextList = allocSize;
-    VkDeviceSize smallSizeStep = SMALL_BUFFER_SIZE / (IsVirtual() ? 1 << SECOND_LEVEL_INDEX : 4);
-    if (allocSize > SMALL_BUFFER_SIZE)
-    {
-        sizeForNextList += (1ULL << (VMA_BITSCAN_MSB(allocSize) - SECOND_LEVEL_INDEX));
-    }
-    else if (allocSize > SMALL_BUFFER_SIZE - smallSizeStep)
-        sizeForNextList = SMALL_BUFFER_SIZE + 1;
-    else
-        sizeForNextList += smallSizeStep;
-
-    uint32_t nextListIndex = 0;
-    uint32_t prevListIndex = 0;
-    Block* nextListBlock = VMA_NULL;
-    Block* prevListBlock = VMA_NULL;
-
-    // Check blocks according to strategies
-    if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT)
-    {
-        // Quick check for larger block first
-        nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex);
-        if (nextListBlock != VMA_NULL && CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // If not fitted then null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Null block failed, search larger bucket
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-
-        // Failed again, check best fit bucket
-        prevListBlock = FindFreeBlock(allocSize, prevListIndex);
-        while (prevListBlock)
-        {
-            if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            prevListBlock = prevListBlock->NextFree();
-        }
-    }
-    else if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT)
-    {
-        // Check best fit bucket
-        prevListBlock = FindFreeBlock(allocSize, prevListIndex);
-        while (prevListBlock)
-        {
-            if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            prevListBlock = prevListBlock->NextFree();
-        }
-
-        // If failed check null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Check larger bucket
-        nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex);
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-    }
-    else if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT )
-    {
-        // Perform search from the start
-        VmaStlAllocator<Block*> allocator(GetAllocationCallbacks());
-        VmaVector<Block*, VmaStlAllocator<Block*>> blockList(m_BlocksFreeCount, allocator);
-
-        size_t i = m_BlocksFreeCount;
-        for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-        {
-            if (block->IsFree() && block->size >= allocSize)
-                blockList[--i] = block;
-        }
-
-        for (; i < m_BlocksFreeCount; ++i)
-        {
-            Block& block = *blockList[i];
-            if (CheckBlock(block, GetListIndex(block.size), allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-        }
-
-        // If failed check null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Whole range searched, no more memory
-        return false;
-    }
-    else
-    {
-        // Check larger bucket
-        nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex);
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-
-        // If failed check null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Check best fit bucket
-        prevListBlock = FindFreeBlock(allocSize, prevListIndex);
-        while (prevListBlock)
-        {
-            if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            prevListBlock = prevListBlock->NextFree();
-        }
-    }
-
-    // Worst case, full search has to be done
-    while (++nextListIndex < m_ListsCount)
-    {
-        nextListBlock = m_FreeList[nextListIndex];
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-    }
-
-    // No more memory sadly
-    return false;
-}
-
-VkResult VmaBlockMetadata_TLSF::CheckCorruption(const void* pBlockData)
-{
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-    {
-        if (!block->IsFree())
-        {
-            if (!VmaValidateMagicValue(pBlockData, block->offset + block->size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaBlockMetadata_TLSF::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    VMA_ASSERT(request.type == VmaAllocationRequestType::TLSF);
-
-    // Get block and pop it from the free list
-    Block* currentBlock = (Block*)request.allocHandle;
-    VkDeviceSize offset = request.algorithmData;
-    VMA_ASSERT(currentBlock != VMA_NULL);
-    VMA_ASSERT(currentBlock->offset <= offset);
-
-    if (currentBlock != m_NullBlock)
-        RemoveFreeBlock(currentBlock);
-
-    VkDeviceSize debugMargin = GetDebugMargin();
-    VkDeviceSize misssingAlignment = offset - currentBlock->offset;
-
-    // Append missing alignment to prev block or create new one
-    if (misssingAlignment)
-    {
-        Block* prevBlock = currentBlock->prevPhysical;
-        VMA_ASSERT(prevBlock != VMA_NULL && "There should be no missing alignment at offset 0!");
-
-        if (prevBlock->IsFree() && prevBlock->size != debugMargin)
-        {
-            uint32_t oldList = GetListIndex(prevBlock->size);
-            prevBlock->size += misssingAlignment;
-            // Check if new size crosses list bucket
-            if (oldList != GetListIndex(prevBlock->size))
-            {
-                prevBlock->size -= misssingAlignment;
-                RemoveFreeBlock(prevBlock);
-                prevBlock->size += misssingAlignment;
-                InsertFreeBlock(prevBlock);
-            }
-            else
-                m_BlocksFreeSize += misssingAlignment;
-        }
-        else
-        {
-            Block* newBlock = m_BlockAllocator.Alloc();
-            currentBlock->prevPhysical = newBlock;
-            prevBlock->nextPhysical = newBlock;
-            newBlock->prevPhysical = prevBlock;
-            newBlock->nextPhysical = currentBlock;
-            newBlock->size = misssingAlignment;
-            newBlock->offset = currentBlock->offset;
-            newBlock->MarkTaken();
-
-            InsertFreeBlock(newBlock);
-        }
-
-        currentBlock->size -= misssingAlignment;
-        currentBlock->offset += misssingAlignment;
-    }
-
-    VkDeviceSize size = request.size + debugMargin;
-    if (currentBlock->size == size)
-    {
-        if (currentBlock == m_NullBlock)
-        {
-            // Setup new null block
-            m_NullBlock = m_BlockAllocator.Alloc();
-            m_NullBlock->size = 0;
-            m_NullBlock->offset = currentBlock->offset + size;
-            m_NullBlock->prevPhysical = currentBlock;
-            m_NullBlock->nextPhysical = VMA_NULL;
-            m_NullBlock->MarkFree();
-            m_NullBlock->PrevFree() = VMA_NULL;
-            m_NullBlock->NextFree() = VMA_NULL;
-            currentBlock->nextPhysical = m_NullBlock;
-            currentBlock->MarkTaken();
-        }
-    }
-    else
-    {
-        VMA_ASSERT(currentBlock->size > size && "Proper block already found, shouldn't find smaller one!");
-
-        // Create new free block
-        Block* newBlock = m_BlockAllocator.Alloc();
-        newBlock->size = currentBlock->size - size;
-        newBlock->offset = currentBlock->offset + size;
-        newBlock->prevPhysical = currentBlock;
-        newBlock->nextPhysical = currentBlock->nextPhysical;
-        currentBlock->nextPhysical = newBlock;
-        currentBlock->size = size;
-
-        if (currentBlock == m_NullBlock)
-        {
-            m_NullBlock = newBlock;
-            m_NullBlock->MarkFree();
-            m_NullBlock->NextFree() = VMA_NULL;
-            m_NullBlock->PrevFree() = VMA_NULL;
-            currentBlock->MarkTaken();
-        }
-        else
-        {
-            newBlock->nextPhysical->prevPhysical = newBlock;
-            newBlock->MarkTaken();
-            InsertFreeBlock(newBlock);
-        }
-    }
-    currentBlock->UserData() = userData;
-
-    if (debugMargin > 0)
-    {
-        currentBlock->size -= debugMargin;
-        Block* newBlock = m_BlockAllocator.Alloc();
-        newBlock->size = debugMargin;
-        newBlock->offset = currentBlock->offset + currentBlock->size;
-        newBlock->prevPhysical = currentBlock;
-        newBlock->nextPhysical = currentBlock->nextPhysical;
-        newBlock->MarkTaken();
-        currentBlock->nextPhysical->prevPhysical = newBlock;
-        currentBlock->nextPhysical = newBlock;
-        InsertFreeBlock(newBlock);
-    }
-
-    if (!IsVirtual())
-        m_GranularityHandler.AllocPages((uint8_t)(uintptr_t)request.customData,
-            currentBlock->offset, currentBlock->size);
-    ++m_AllocCount;
-}
-
-void VmaBlockMetadata_TLSF::Free(VmaAllocHandle allocHandle)
-{
-    Block* block = (Block*)allocHandle;
-    Block* next = block->nextPhysical;
-    VMA_ASSERT(!block->IsFree() && "Block is already free!");
-
-    if (!IsVirtual())
-        m_GranularityHandler.FreePages(block->offset, block->size);
-    --m_AllocCount;
-
-    VkDeviceSize debugMargin = GetDebugMargin();
-    if (debugMargin > 0)
-    {
-        RemoveFreeBlock(next);
-        MergeBlock(next, block);
-        block = next;
-        next = next->nextPhysical;
-    }
-
-    // Try merging
-    Block* prev = block->prevPhysical;
-    if (prev != VMA_NULL && prev->IsFree() && prev->size != debugMargin)
-    {
-        RemoveFreeBlock(prev);
-        MergeBlock(block, prev);
-    }
-
-    if (!next->IsFree())
-        InsertFreeBlock(block);
-    else if (next == m_NullBlock)
-        MergeBlock(m_NullBlock, block);
-    else
-    {
-        RemoveFreeBlock(next);
-        MergeBlock(next, block);
-        InsertFreeBlock(next);
-    }
-}
-
-void VmaBlockMetadata_TLSF::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    Block* block = (Block*)allocHandle;
-    VMA_ASSERT(!block->IsFree() && "Cannot get allocation info for free block!");
-    outInfo.offset = block->offset;
-    outInfo.size = block->size;
-    outInfo.pUserData = block->UserData();
-}
-
-void* VmaBlockMetadata_TLSF::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    Block* block = (Block*)allocHandle;
-    VMA_ASSERT(!block->IsFree() && "Cannot get user data for free block!");
-    return block->UserData();
-}
-
-VmaAllocHandle VmaBlockMetadata_TLSF::GetAllocationListBegin() const
-{
-    if (m_AllocCount == 0)
-        return VK_NULL_HANDLE;
-
-    for (Block* block = m_NullBlock->prevPhysical; block; block = block->prevPhysical)
-    {
-        if (!block->IsFree())
-            return (VmaAllocHandle)block;
-    }
-    VMA_ASSERT(false && "If m_AllocCount > 0 then should find any allocation!");
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_TLSF::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    Block* startBlock = (Block*)prevAlloc;
-    VMA_ASSERT(!startBlock->IsFree() && "Incorrect block!");
-
-    for (Block* block = startBlock->prevPhysical; block; block = block->prevPhysical)
-    {
-        if (!block->IsFree())
-            return (VmaAllocHandle)block;
-    }
-    return VK_NULL_HANDLE;
-}
-
-VkDeviceSize VmaBlockMetadata_TLSF::GetNextFreeRegionSize(VmaAllocHandle alloc) const
-{
-    Block* block = (Block*)alloc;
-    VMA_ASSERT(!block->IsFree() && "Incorrect block!");
-
-    if (block->prevPhysical)
-        return block->prevPhysical->IsFree() ? block->prevPhysical->size : 0;
-    return 0;
-}
-
-void VmaBlockMetadata_TLSF::Clear()
-{
-    m_AllocCount = 0;
-    m_BlocksFreeCount = 0;
-    m_BlocksFreeSize = 0;
-    m_IsFreeBitmap = 0;
-    m_NullBlock->offset = 0;
-    m_NullBlock->size = GetSize();
-    Block* block = m_NullBlock->prevPhysical;
-    m_NullBlock->prevPhysical = VMA_NULL;
-    while (block)
-    {
-        Block* prev = block->prevPhysical;
-        m_BlockAllocator.Free(block);
-        block = prev;
-    }
-    memset(m_FreeList, 0, m_ListsCount * sizeof(Block*));
-    memset(m_InnerIsFreeBitmap, 0, m_MemoryClasses * sizeof(uint32_t));
-    m_GranularityHandler.Clear();
-}
-
-void VmaBlockMetadata_TLSF::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    Block* block = (Block*)allocHandle;
-    VMA_ASSERT(!block->IsFree() && "Trying to set user data for not allocated block!");
-    block->UserData() = userData;
-}
-
-void VmaBlockMetadata_TLSF::DebugLogAllAllocations() const
-{
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-        if (!block->IsFree())
-            DebugLogAllocation(block->offset, block->size, block->UserData());
-}
-
-uint8_t VmaBlockMetadata_TLSF::SizeToMemoryClass(VkDeviceSize size) const
-{
-    if (size > SMALL_BUFFER_SIZE)
-        return VMA_BITSCAN_MSB(size) - MEMORY_CLASS_SHIFT;
-    return 0;
-}
-
-uint16_t VmaBlockMetadata_TLSF::SizeToSecondIndex(VkDeviceSize size, uint8_t memoryClass) const
-{
-    if (memoryClass == 0)
-    {
-        if (IsVirtual())
-            return static_cast<uint16_t>((size - 1) / 8);
-        else
-            return static_cast<uint16_t>((size - 1) / 64);
-    }
-    return static_cast<uint16_t>((size >> (memoryClass + MEMORY_CLASS_SHIFT - SECOND_LEVEL_INDEX)) ^ (1U << SECOND_LEVEL_INDEX));
-}
-
-uint32_t VmaBlockMetadata_TLSF::GetListIndex(uint8_t memoryClass, uint16_t secondIndex) const
-{
-    if (memoryClass == 0)
-        return secondIndex;
-
-    const uint32_t index = static_cast<uint32_t>(memoryClass - 1) * (1 << SECOND_LEVEL_INDEX) + secondIndex;
-    if (IsVirtual())
-        return index + (1 << SECOND_LEVEL_INDEX);
-    else
-        return index + 4;
-}
-
-uint32_t VmaBlockMetadata_TLSF::GetListIndex(VkDeviceSize size) const
-{
-    uint8_t memoryClass = SizeToMemoryClass(size);
-    return GetListIndex(memoryClass, SizeToSecondIndex(size, memoryClass));
-}
-
-void VmaBlockMetadata_TLSF::RemoveFreeBlock(Block* block)
-{
-    VMA_ASSERT(block != m_NullBlock);
-    VMA_ASSERT(block->IsFree());
-
-    if (block->NextFree() != VMA_NULL)
-        block->NextFree()->PrevFree() = block->PrevFree();
-    if (block->PrevFree() != VMA_NULL)
-        block->PrevFree()->NextFree() = block->NextFree();
-    else
-    {
-        uint8_t memClass = SizeToMemoryClass(block->size);
-        uint16_t secondIndex = SizeToSecondIndex(block->size, memClass);
-        uint32_t index = GetListIndex(memClass, secondIndex);
-        VMA_ASSERT(m_FreeList[index] == block);
-        m_FreeList[index] = block->NextFree();
-        if (block->NextFree() == VMA_NULL)
-        {
-            m_InnerIsFreeBitmap[memClass] &= ~(1U << secondIndex);
-            if (m_InnerIsFreeBitmap[memClass] == 0)
-                m_IsFreeBitmap &= ~(1UL << memClass);
-        }
-    }
-    block->MarkTaken();
-    block->UserData() = VMA_NULL;
-    --m_BlocksFreeCount;
-    m_BlocksFreeSize -= block->size;
-}
-
-void VmaBlockMetadata_TLSF::InsertFreeBlock(Block* block)
-{
-    VMA_ASSERT(block != m_NullBlock);
-    VMA_ASSERT(!block->IsFree() && "Cannot insert block twice!");
-
-    uint8_t memClass = SizeToMemoryClass(block->size);
-    uint16_t secondIndex = SizeToSecondIndex(block->size, memClass);
-    uint32_t index = GetListIndex(memClass, secondIndex);
-    VMA_ASSERT(index < m_ListsCount);
-    block->PrevFree() = VMA_NULL;
-    block->NextFree() = m_FreeList[index];
-    m_FreeList[index] = block;
-    if (block->NextFree() != VMA_NULL)
-        block->NextFree()->PrevFree() = block;
-    else
-    {
-        m_InnerIsFreeBitmap[memClass] |= 1U << secondIndex;
-        m_IsFreeBitmap |= 1UL << memClass;
-    }
-    ++m_BlocksFreeCount;
-    m_BlocksFreeSize += block->size;
-}
-
-void VmaBlockMetadata_TLSF::MergeBlock(Block* block, Block* prev)
-{
-    VMA_ASSERT(block->prevPhysical == prev && "Cannot merge seperate physical regions!");
-    VMA_ASSERT(!prev->IsFree() && "Cannot merge block that belongs to free list!");
-
-    block->offset = prev->offset;
-    block->size += prev->size;
-    block->prevPhysical = prev->prevPhysical;
-    if (block->prevPhysical)
-        block->prevPhysical->nextPhysical = block;
-    m_BlockAllocator.Free(prev);
-}
-
-VmaBlockMetadata_TLSF::Block* VmaBlockMetadata_TLSF::FindFreeBlock(VkDeviceSize size, uint32_t& listIndex) const
-{
-    uint8_t memoryClass = SizeToMemoryClass(size);
-    uint32_t innerFreeMap = m_InnerIsFreeBitmap[memoryClass] & (~0U << SizeToSecondIndex(size, memoryClass));
-    if (!innerFreeMap)
-    {
-        // Check higher levels for avaiable blocks
-        uint32_t freeMap = m_IsFreeBitmap & (~0UL << (memoryClass + 1));
-        if (!freeMap)
-            return VMA_NULL; // No more memory avaible
-
-        // Find lowest free region
-        memoryClass = VMA_BITSCAN_LSB(freeMap);
-        innerFreeMap = m_InnerIsFreeBitmap[memoryClass];
-        VMA_ASSERT(innerFreeMap != 0);
-    }
-    // Find lowest free subregion
-    listIndex = GetListIndex(memoryClass, VMA_BITSCAN_LSB(innerFreeMap));
-    VMA_ASSERT(m_FreeList[listIndex]);
-    return m_FreeList[listIndex];
-}
-
-bool VmaBlockMetadata_TLSF::CheckBlock(
-    Block& block,
-    uint32_t listIndex,
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(block.IsFree() && "Block is already taken!");
-
-    VkDeviceSize alignedOffset = VmaAlignUp(block.offset, allocAlignment);
-    if (block.size < allocSize + alignedOffset - block.offset)
-        return false;
-
-    // Check for granularity conflicts
-    if (!IsVirtual() &&
-        m_GranularityHandler.CheckConflictAndAlignUp(alignedOffset, allocSize, block.offset, block.size, allocType))
-        return false;
-
-    // Alloc successful
-    pAllocationRequest->type = VmaAllocationRequestType::TLSF;
-    pAllocationRequest->allocHandle = (VmaAllocHandle)&block;
-    pAllocationRequest->size = allocSize - GetDebugMargin();
-    pAllocationRequest->customData = (void*)allocType;
-    pAllocationRequest->algorithmData = alignedOffset;
-
-    // Place block at the start of list if it's normal block
-    if (listIndex != m_ListsCount && block.PrevFree())
-    {
-        block.PrevFree()->NextFree() = block.NextFree();
-        if (block.NextFree())
-            block.NextFree()->PrevFree() = block.PrevFree();
-        block.PrevFree() = VMA_NULL;
-        block.NextFree() = m_FreeList[listIndex];
-        m_FreeList[listIndex] = &block;
-        if (block.NextFree())
-            block.NextFree()->PrevFree() = &block;
-    }
-
-    return true;
-}
-#endif // _VMA_BLOCK_METADATA_TLSF_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_TLSF
-
-#ifndef _VMA_BLOCK_VECTOR
-/*
-Sequence of VmaDeviceMemoryBlock. Represents memory blocks allocated for a specific
-Vulkan memory type.
-
-Synchronized internally with a mutex.
-*/
-class VmaBlockVector
-{
-    friend struct VmaDefragmentationContext_T;
-    VMA_CLASS_NO_COPY(VmaBlockVector)
-public:
-    VmaBlockVector(
-        VmaAllocator hAllocator,
-        VmaPool hParentPool,
-        uint32_t memoryTypeIndex,
-        VkDeviceSize preferredBlockSize,
-        size_t minBlockCount,
-        size_t maxBlockCount,
-        VkDeviceSize bufferImageGranularity,
-        bool explicitBlockSize,
-        uint32_t algorithm,
-        float priority,
-        VkDeviceSize minAllocationAlignment,
-        void* pMemoryAllocateNext);
-    ~VmaBlockVector();
-
-    VmaAllocator GetAllocator() const { return m_hAllocator; }
-    VmaPool GetParentPool() const { return m_hParentPool; }
-    bool IsCustomPool() const { return m_hParentPool != VMA_NULL; }
-    uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; }
-    VkDeviceSize GetPreferredBlockSize() const { return m_PreferredBlockSize; }
-    VkDeviceSize GetBufferImageGranularity() const { return m_BufferImageGranularity; }
-    uint32_t GetAlgorithm() const { return m_Algorithm; }
-    bool HasExplicitBlockSize() const { return m_ExplicitBlockSize; }
-    float GetPriority() const { return m_Priority; }
-    const void* GetAllocationNextPtr() const { return m_pMemoryAllocateNext; }
-    // To be used only while the m_Mutex is locked. Used during defragmentation.
-    size_t GetBlockCount() const { return m_Blocks.size(); }
-    // To be used only while the m_Mutex is locked. Used during defragmentation.
-    VmaDeviceMemoryBlock* GetBlock(size_t index) const { return m_Blocks[index]; }
-    VMA_RW_MUTEX &GetMutex() { return m_Mutex; }
-
-    VkResult CreateMinBlocks();
-    void AddStatistics(VmaStatistics& inoutStats);
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats);
-    bool IsEmpty();
-    bool IsCorruptionDetectionEnabled() const;
-
-    VkResult Allocate(
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        const VmaAllocationCreateInfo& createInfo,
-        VmaSuballocationType suballocType,
-        size_t allocationCount,
-        VmaAllocation* pAllocations);
-
-    void Free(const VmaAllocation hAllocation);
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json);
-#endif
-
-    VkResult CheckCorruption();
-
-private:
-    const VmaAllocator m_hAllocator;
-    const VmaPool m_hParentPool;
-    const uint32_t m_MemoryTypeIndex;
-    const VkDeviceSize m_PreferredBlockSize;
-    const size_t m_MinBlockCount;
-    const size_t m_MaxBlockCount;
-    const VkDeviceSize m_BufferImageGranularity;
-    const bool m_ExplicitBlockSize;
-    const uint32_t m_Algorithm;
-    const float m_Priority;
-    const VkDeviceSize m_MinAllocationAlignment;
-
-    void* const m_pMemoryAllocateNext;
-    VMA_RW_MUTEX m_Mutex;
-    // Incrementally sorted by sumFreeSize, ascending.
-    VmaVector<VmaDeviceMemoryBlock*, VmaStlAllocator<VmaDeviceMemoryBlock*>> m_Blocks;
-    uint32_t m_NextBlockId;
-    bool m_IncrementalSort = true;
-
-    void SetIncrementalSort(bool val) { m_IncrementalSort = val; }
-
-    VkDeviceSize CalcMaxBlockSize() const;
-    // Finds and removes given block from vector.
-    void Remove(VmaDeviceMemoryBlock* pBlock);
-    // Performs single step in sorting m_Blocks. They may not be fully sorted
-    // after this call.
-    void IncrementallySortBlocks();
-    void SortByFreeSize();
-
-    VkResult AllocatePage(
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        const VmaAllocationCreateInfo& createInfo,
-        VmaSuballocationType suballocType,
-        VmaAllocation* pAllocation);
-
-    VkResult AllocateFromBlock(
-        VmaDeviceMemoryBlock* pBlock,
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        VmaAllocationCreateFlags allocFlags,
-        void* pUserData,
-        VmaSuballocationType suballocType,
-        uint32_t strategy,
-        VmaAllocation* pAllocation);
-
-    VkResult CommitAllocationRequest(
-        VmaAllocationRequest& allocRequest,
-        VmaDeviceMemoryBlock* pBlock,
-        VkDeviceSize alignment,
-        VmaAllocationCreateFlags allocFlags,
-        void* pUserData,
-        VmaSuballocationType suballocType,
-        VmaAllocation* pAllocation);
-
-    VkResult CreateBlock(VkDeviceSize blockSize, size_t* pNewBlockIndex);
-    bool HasEmptyBlock();
-};
-#endif // _VMA_BLOCK_VECTOR
-
-#ifndef _VMA_DEFRAGMENTATION_CONTEXT
-struct VmaDefragmentationContext_T
-{
-    VMA_CLASS_NO_COPY(VmaDefragmentationContext_T)
-public:
-    VmaDefragmentationContext_T(
-        VmaAllocator hAllocator,
-        const VmaDefragmentationInfo& info);
-    ~VmaDefragmentationContext_T();
-
-    void GetStats(VmaDefragmentationStats& outStats) { outStats = m_GlobalStats; }
-
-    VkResult DefragmentPassBegin(VmaDefragmentationPassMoveInfo& moveInfo);
-    VkResult DefragmentPassEnd(VmaDefragmentationPassMoveInfo& moveInfo);
-
-private:
-    // Max number of allocations to ignore due to size constraints before ending single pass
-    static const uint8_t MAX_ALLOCS_TO_IGNORE = 16;
-    enum class CounterStatus { Pass, Ignore, End };
-
-    struct FragmentedBlock
-    {
-        uint32_t data;
-        VmaDeviceMemoryBlock* block;
-    };
-    struct StateBalanced
-    {
-        VkDeviceSize avgFreeSize = 0;
-        VkDeviceSize avgAllocSize = UINT64_MAX;
-    };
-    struct StateExtensive
-    {
-        enum class Operation : uint8_t
-        {
-            FindFreeBlockBuffer, FindFreeBlockTexture, FindFreeBlockAll,
-            MoveBuffers, MoveTextures, MoveAll,
-            Cleanup, Done
-        };
-
-        Operation operation = Operation::FindFreeBlockTexture;
-        size_t firstFreeBlock = SIZE_MAX;
-    };
-    struct MoveAllocationData
-    {
-        VkDeviceSize size;
-        VkDeviceSize alignment;
-        VmaSuballocationType type;
-        VmaAllocationCreateFlags flags;
-        VmaDefragmentationMove move = {};
-    };
-
-    const VkDeviceSize m_MaxPassBytes;
-    const uint32_t m_MaxPassAllocations;
-
-    VmaStlAllocator<VmaDefragmentationMove> m_MoveAllocator;
-    VmaVector<VmaDefragmentationMove, VmaStlAllocator<VmaDefragmentationMove>> m_Moves;
-
-    uint8_t m_IgnoredAllocs = 0;
-    uint32_t m_Algorithm;
-    uint32_t m_BlockVectorCount;
-    VmaBlockVector* m_PoolBlockVector;
-    VmaBlockVector** m_pBlockVectors;
-    size_t m_ImmovableBlockCount = 0;
-    VmaDefragmentationStats m_GlobalStats = { 0 };
-    VmaDefragmentationStats m_PassStats = { 0 };
-    void* m_AlgorithmState = VMA_NULL;
-
-    static MoveAllocationData GetMoveData(VmaAllocHandle handle, VmaBlockMetadata* metadata);
-    CounterStatus CheckCounters(VkDeviceSize bytes);
-    bool IncrementCounters(VkDeviceSize bytes);
-    bool ReallocWithinBlock(VmaBlockVector& vector, VmaDeviceMemoryBlock* block);
-    bool AllocInOtherBlock(size_t start, size_t end, MoveAllocationData& data, VmaBlockVector& vector);
-
-    bool ComputeDefragmentation(VmaBlockVector& vector, size_t index);
-    bool ComputeDefragmentation_Fast(VmaBlockVector& vector);
-    bool ComputeDefragmentation_Balanced(VmaBlockVector& vector, size_t index, bool update);
-    bool ComputeDefragmentation_Full(VmaBlockVector& vector);
-    bool ComputeDefragmentation_Extensive(VmaBlockVector& vector, size_t index);
-
-    void UpdateVectorStatistics(VmaBlockVector& vector, StateBalanced& state);
-    bool MoveDataToFreeBlocks(VmaSuballocationType currentType,
-        VmaBlockVector& vector, size_t firstFreeBlock,
-        bool& texturePresent, bool& bufferPresent, bool& otherPresent);
-};
-#endif // _VMA_DEFRAGMENTATION_CONTEXT
-
-#ifndef _VMA_POOL_T
-struct VmaPool_T
-{
-    friend struct VmaPoolListItemTraits;
-    VMA_CLASS_NO_COPY(VmaPool_T)
-public:
-    VmaBlockVector m_BlockVector;
-    VmaDedicatedAllocationList m_DedicatedAllocations;
-
-    VmaPool_T(
-        VmaAllocator hAllocator,
-        const VmaPoolCreateInfo& createInfo,
-        VkDeviceSize preferredBlockSize);
-    ~VmaPool_T();
-
-    uint32_t GetId() const { return m_Id; }
-    void SetId(uint32_t id) { VMA_ASSERT(m_Id == 0); m_Id = id; }
-
-    const char* GetName() const { return m_Name; }
-    void SetName(const char* pName);
-
-#if VMA_STATS_STRING_ENABLED
-    //void PrintDetailedMap(class VmaStringBuilder& sb);
-#endif
-
-private:
-    uint32_t m_Id;
-    char* m_Name;
-    VmaPool_T* m_PrevPool = VMA_NULL;
-    VmaPool_T* m_NextPool = VMA_NULL;
-};
-
-struct VmaPoolListItemTraits
-{
-    typedef VmaPool_T ItemType;
-
-    static ItemType* GetPrev(const ItemType* item) { return item->m_PrevPool; }
-    static ItemType* GetNext(const ItemType* item) { return item->m_NextPool; }
-    static ItemType*& AccessPrev(ItemType* item) { return item->m_PrevPool; }
-    static ItemType*& AccessNext(ItemType* item) { return item->m_NextPool; }
-};
-#endif // _VMA_POOL_T
-
-#ifndef _VMA_CURRENT_BUDGET_DATA
-struct VmaCurrentBudgetData
-{
-    VMA_ATOMIC_UINT32 m_BlockCount[VK_MAX_MEMORY_HEAPS];
-    VMA_ATOMIC_UINT32 m_AllocationCount[VK_MAX_MEMORY_HEAPS];
-    VMA_ATOMIC_UINT64 m_BlockBytes[VK_MAX_MEMORY_HEAPS];
-    VMA_ATOMIC_UINT64 m_AllocationBytes[VK_MAX_MEMORY_HEAPS];
-
-#if VMA_MEMORY_BUDGET
-    VMA_ATOMIC_UINT32 m_OperationsSinceBudgetFetch;
-    VMA_RW_MUTEX m_BudgetMutex;
-    uint64_t m_VulkanUsage[VK_MAX_MEMORY_HEAPS];
-    uint64_t m_VulkanBudget[VK_MAX_MEMORY_HEAPS];
-    uint64_t m_BlockBytesAtBudgetFetch[VK_MAX_MEMORY_HEAPS];
-#endif // VMA_MEMORY_BUDGET
-
-    VmaCurrentBudgetData();
-
-    void AddAllocation(uint32_t heapIndex, VkDeviceSize allocationSize);
-    void RemoveAllocation(uint32_t heapIndex, VkDeviceSize allocationSize);
-};
-
-#ifndef _VMA_CURRENT_BUDGET_DATA_FUNCTIONS
-VmaCurrentBudgetData::VmaCurrentBudgetData()
-{
-    for (uint32_t heapIndex = 0; heapIndex < VK_MAX_MEMORY_HEAPS; ++heapIndex)
-    {
-        m_BlockCount[heapIndex] = 0;
-        m_AllocationCount[heapIndex] = 0;
-        m_BlockBytes[heapIndex] = 0;
-        m_AllocationBytes[heapIndex] = 0;
-#if VMA_MEMORY_BUDGET
-        m_VulkanUsage[heapIndex] = 0;
-        m_VulkanBudget[heapIndex] = 0;
-        m_BlockBytesAtBudgetFetch[heapIndex] = 0;
-#endif
-    }
-
-#if VMA_MEMORY_BUDGET
-    m_OperationsSinceBudgetFetch = 0;
-#endif
-}
-
-void VmaCurrentBudgetData::AddAllocation(uint32_t heapIndex, VkDeviceSize allocationSize)
-{
-    m_AllocationBytes[heapIndex] += allocationSize;
-    ++m_AllocationCount[heapIndex];
-#if VMA_MEMORY_BUDGET
-    ++m_OperationsSinceBudgetFetch;
-#endif
-}
-
-void VmaCurrentBudgetData::RemoveAllocation(uint32_t heapIndex, VkDeviceSize allocationSize)
-{
-    VMA_ASSERT(m_AllocationBytes[heapIndex] >= allocationSize);
-    m_AllocationBytes[heapIndex] -= allocationSize;
-    VMA_ASSERT(m_AllocationCount[heapIndex] > 0);
-    --m_AllocationCount[heapIndex];
-#if VMA_MEMORY_BUDGET
-    ++m_OperationsSinceBudgetFetch;
-#endif
-}
-#endif // _VMA_CURRENT_BUDGET_DATA_FUNCTIONS
-#endif // _VMA_CURRENT_BUDGET_DATA
-
-#ifndef _VMA_ALLOCATION_OBJECT_ALLOCATOR
-/*
-Thread-safe wrapper over VmaPoolAllocator free list, for allocation of VmaAllocation_T objects.
-*/
-class VmaAllocationObjectAllocator
-{
-    VMA_CLASS_NO_COPY(VmaAllocationObjectAllocator)
-public:
-    VmaAllocationObjectAllocator(const VkAllocationCallbacks* pAllocationCallbacks)
-        : m_Allocator(pAllocationCallbacks, 1024) {}
-
-    template<typename... Types> VmaAllocation Allocate(Types&&... args);
-    void Free(VmaAllocation hAlloc);
-
-private:
-    VMA_MUTEX m_Mutex;
-    VmaPoolAllocator<VmaAllocation_T> m_Allocator;
-};
-
-template<typename... Types>
-VmaAllocation VmaAllocationObjectAllocator::Allocate(Types&&... args)
-{
-    VmaMutexLock mutexLock(m_Mutex);
-    return m_Allocator.Alloc<Types...>(std::forward<Types>(args)...);
-}
-
-void VmaAllocationObjectAllocator::Free(VmaAllocation hAlloc)
-{
-    VmaMutexLock mutexLock(m_Mutex);
-    m_Allocator.Free(hAlloc);
-}
-#endif // _VMA_ALLOCATION_OBJECT_ALLOCATOR
-
-#ifndef _VMA_VIRTUAL_BLOCK_T
-struct VmaVirtualBlock_T
-{
-    VMA_CLASS_NO_COPY(VmaVirtualBlock_T)
-public:
-    const bool m_AllocationCallbacksSpecified;
-    const VkAllocationCallbacks m_AllocationCallbacks;
-
-    VmaVirtualBlock_T(const VmaVirtualBlockCreateInfo& createInfo);
-    ~VmaVirtualBlock_T();
-
-    VkResult Init() { return VK_SUCCESS; }
-    bool IsEmpty() const { return m_Metadata->IsEmpty(); }
-    void Free(VmaVirtualAllocation allocation) { m_Metadata->Free((VmaAllocHandle)allocation); }
-    void SetAllocationUserData(VmaVirtualAllocation allocation, void* userData) { m_Metadata->SetAllocationUserData((VmaAllocHandle)allocation, userData); }
-    void Clear() { m_Metadata->Clear(); }
-
-    const VkAllocationCallbacks* GetAllocationCallbacks() const;
-    void GetAllocationInfo(VmaVirtualAllocation allocation, VmaVirtualAllocationInfo& outInfo);
-    VkResult Allocate(const VmaVirtualAllocationCreateInfo& createInfo, VmaVirtualAllocation& outAllocation,
-        VkDeviceSize* outOffset);
-    void GetStatistics(VmaStatistics& outStats) const;
-    void CalculateDetailedStatistics(VmaDetailedStatistics& outStats) const;
-#if VMA_STATS_STRING_ENABLED
-    void BuildStatsString(bool detailedMap, VmaStringBuilder& sb) const;
-#endif
-
-private:
-    VmaBlockMetadata* m_Metadata;
-};
-
-#ifndef _VMA_VIRTUAL_BLOCK_T_FUNCTIONS
-VmaVirtualBlock_T::VmaVirtualBlock_T(const VmaVirtualBlockCreateInfo& createInfo)
-    : m_AllocationCallbacksSpecified(createInfo.pAllocationCallbacks != VMA_NULL),
-    m_AllocationCallbacks(createInfo.pAllocationCallbacks != VMA_NULL ? *createInfo.pAllocationCallbacks : VmaEmptyAllocationCallbacks)
-{
-    const uint32_t algorithm = createInfo.flags & VMA_VIRTUAL_BLOCK_CREATE_ALGORITHM_MASK;
-    switch (algorithm)
-    {
-    default:
-        VMA_ASSERT(0);
-    case 0:
-        m_Metadata = vma_new(GetAllocationCallbacks(), VmaBlockMetadata_TLSF)(VK_NULL_HANDLE, 1, true);
-        break;
-    case VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT:
-        m_Metadata = vma_new(GetAllocationCallbacks(), VmaBlockMetadata_Linear)(VK_NULL_HANDLE, 1, true);
-        break;
-    }
-
-    m_Metadata->Init(createInfo.size);
-}
-
-VmaVirtualBlock_T::~VmaVirtualBlock_T()
-{
-    // Define macro VMA_DEBUG_LOG to receive the list of the unfreed allocations
-    if (!m_Metadata->IsEmpty())
-        m_Metadata->DebugLogAllAllocations();
-    // This is the most important assert in the entire library.
-    // Hitting it means you have some memory leak - unreleased virtual allocations.
-    VMA_ASSERT(m_Metadata->IsEmpty() && "Some virtual allocations were not freed before destruction of this virtual block!");
-
-    vma_delete(GetAllocationCallbacks(), m_Metadata);
-}
-
-const VkAllocationCallbacks* VmaVirtualBlock_T::GetAllocationCallbacks() const
-{
-    return m_AllocationCallbacksSpecified ? &m_AllocationCallbacks : VMA_NULL;
-}
-
-void VmaVirtualBlock_T::GetAllocationInfo(VmaVirtualAllocation allocation, VmaVirtualAllocationInfo& outInfo)
-{
-    m_Metadata->GetAllocationInfo((VmaAllocHandle)allocation, outInfo);
-}
-
-VkResult VmaVirtualBlock_T::Allocate(const VmaVirtualAllocationCreateInfo& createInfo, VmaVirtualAllocation& outAllocation,
-    VkDeviceSize* outOffset)
-{
-    VmaAllocationRequest request = {};
-    if (m_Metadata->CreateAllocationRequest(
-        createInfo.size, // allocSize
-        VMA_MAX(createInfo.alignment, (VkDeviceSize)1), // allocAlignment
-        (createInfo.flags & VMA_VIRTUAL_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0, // upperAddress
-        VMA_SUBALLOCATION_TYPE_UNKNOWN, // allocType - unimportant
-        createInfo.flags & VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MASK, // strategy
-        &request))
-    {
-        m_Metadata->Alloc(request,
-            VMA_SUBALLOCATION_TYPE_UNKNOWN, // type - unimportant
-            createInfo.pUserData);
-        outAllocation = (VmaVirtualAllocation)request.allocHandle;
-        if(outOffset)
-            *outOffset = m_Metadata->GetAllocationOffset(request.allocHandle);
-        return VK_SUCCESS;
-    }
-    outAllocation = (VmaVirtualAllocation)VK_NULL_HANDLE;
-    if (outOffset)
-        *outOffset = UINT64_MAX;
-    return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-}
-
-void VmaVirtualBlock_T::GetStatistics(VmaStatistics& outStats) const
-{
-    VmaClearStatistics(outStats);
-    m_Metadata->AddStatistics(outStats);
-}
-
-void VmaVirtualBlock_T::CalculateDetailedStatistics(VmaDetailedStatistics& outStats) const
-{
-    VmaClearDetailedStatistics(outStats);
-    m_Metadata->AddDetailedStatistics(outStats);
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaVirtualBlock_T::BuildStatsString(bool detailedMap, VmaStringBuilder& sb) const
-{
-    VmaJsonWriter json(GetAllocationCallbacks(), sb);
-    json.BeginObject();
-
-    VmaDetailedStatistics stats;
-    CalculateDetailedStatistics(stats);
-
-    json.WriteString("Stats");
-    VmaPrintDetailedStatistics(json, stats);
-
-    if (detailedMap)
-    {
-        json.WriteString("Details");
-        json.BeginObject();
-        m_Metadata->PrintDetailedMap(json);
-        json.EndObject();
-    }
-
-    json.EndObject();
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_VIRTUAL_BLOCK_T_FUNCTIONS
-#endif // _VMA_VIRTUAL_BLOCK_T
-
-
-// Main allocator object.
-struct VmaAllocator_T
-{
-    VMA_CLASS_NO_COPY(VmaAllocator_T)
-public:
-    bool m_UseMutex;
-    uint32_t m_VulkanApiVersion;
-    bool m_UseKhrDedicatedAllocation; // Can be set only if m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0).
-    bool m_UseKhrBindMemory2; // Can be set only if m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0).
-    bool m_UseExtMemoryBudget;
-    bool m_UseAmdDeviceCoherentMemory;
-    bool m_UseKhrBufferDeviceAddress;
-    bool m_UseExtMemoryPriority;
-    VkDevice m_hDevice;
-    VkInstance m_hInstance;
-    bool m_AllocationCallbacksSpecified;
-    VkAllocationCallbacks m_AllocationCallbacks;
-    VmaDeviceMemoryCallbacks m_DeviceMemoryCallbacks;
-    VmaAllocationObjectAllocator m_AllocationObjectAllocator;
-
-    // Each bit (1 << i) is set if HeapSizeLimit is enabled for that heap, so cannot allocate more than the heap size.
-    uint32_t m_HeapSizeLimitMask;
-
-    VkPhysicalDeviceProperties m_PhysicalDeviceProperties;
-    VkPhysicalDeviceMemoryProperties m_MemProps;
-
-    // Default pools.
-    VmaBlockVector* m_pBlockVectors[VK_MAX_MEMORY_TYPES];
-    VmaDedicatedAllocationList m_DedicatedAllocations[VK_MAX_MEMORY_TYPES];
-
-    VmaCurrentBudgetData m_Budget;
-    VMA_ATOMIC_UINT32 m_DeviceMemoryCount; // Total number of VkDeviceMemory objects.
-
-    VmaAllocator_T(const VmaAllocatorCreateInfo* pCreateInfo);
-    VkResult Init(const VmaAllocatorCreateInfo* pCreateInfo);
-    ~VmaAllocator_T();
-
-    const VkAllocationCallbacks* GetAllocationCallbacks() const
-    {
-        return m_AllocationCallbacksSpecified ? &m_AllocationCallbacks : VMA_NULL;
-    }
-    const VmaVulkanFunctions& GetVulkanFunctions() const
-    {
-        return m_VulkanFunctions;
-    }
-
-    VkPhysicalDevice GetPhysicalDevice() const { return m_PhysicalDevice; }
-
-    VkDeviceSize GetBufferImageGranularity() const
-    {
-        return VMA_MAX(
-            static_cast<VkDeviceSize>(VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY),
-            m_PhysicalDeviceProperties.limits.bufferImageGranularity);
-    }
-
-    uint32_t GetMemoryHeapCount() const { return m_MemProps.memoryHeapCount; }
-    uint32_t GetMemoryTypeCount() const { return m_MemProps.memoryTypeCount; }
-
-    uint32_t MemoryTypeIndexToHeapIndex(uint32_t memTypeIndex) const
-    {
-        VMA_ASSERT(memTypeIndex < m_MemProps.memoryTypeCount);
-        return m_MemProps.memoryTypes[memTypeIndex].heapIndex;
-    }
-    // True when specific memory type is HOST_VISIBLE but not HOST_COHERENT.
-    bool IsMemoryTypeNonCoherent(uint32_t memTypeIndex) const
-    {
-        return (m_MemProps.memoryTypes[memTypeIndex].propertyFlags & (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT)) ==
-            VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-    }
-    // Minimum alignment for all allocations in specific memory type.
-    VkDeviceSize GetMemoryTypeMinAlignment(uint32_t memTypeIndex) const
-    {
-        return IsMemoryTypeNonCoherent(memTypeIndex) ?
-            VMA_MAX((VkDeviceSize)VMA_MIN_ALIGNMENT, m_PhysicalDeviceProperties.limits.nonCoherentAtomSize) :
-            (VkDeviceSize)VMA_MIN_ALIGNMENT;
-    }
-
-    bool IsIntegratedGpu() const
-    {
-        return m_PhysicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU;
-    }
-
-    uint32_t GetGlobalMemoryTypeBits() const { return m_GlobalMemoryTypeBits; }
-
-    void GetBufferMemoryRequirements(
-        VkBuffer hBuffer,
-        VkMemoryRequirements& memReq,
-        bool& requiresDedicatedAllocation,
-        bool& prefersDedicatedAllocation) const;
-    void GetImageMemoryRequirements(
-        VkImage hImage,
-        VkMemoryRequirements& memReq,
-        bool& requiresDedicatedAllocation,
-        bool& prefersDedicatedAllocation) const;
-    VkResult FindMemoryTypeIndex(
-        uint32_t memoryTypeBits,
-        const VmaAllocationCreateInfo* pAllocationCreateInfo,
-        VkFlags bufImgUsage, // VkBufferCreateInfo::usage or VkImageCreateInfo::usage. UINT32_MAX if unknown.
-        uint32_t* pMemoryTypeIndex) const;
-
-    // Main allocation function.
-    VkResult AllocateMemory(
-        const VkMemoryRequirements& vkMemReq,
-        bool requiresDedicatedAllocation,
-        bool prefersDedicatedAllocation,
-        VkBuffer dedicatedBuffer,
-        VkImage dedicatedImage,
-        VkFlags dedicatedBufferImageUsage, // UINT32_MAX if unknown.
-        const VmaAllocationCreateInfo& createInfo,
-        VmaSuballocationType suballocType,
-        size_t allocationCount,
-        VmaAllocation* pAllocations);
-
-    // Main deallocation function.
-    void FreeMemory(
-        size_t allocationCount,
-        const VmaAllocation* pAllocations);
-
-    void CalculateStatistics(VmaTotalStatistics* pStats);
-
-    void GetHeapBudgets(
-        VmaBudget* outBudgets, uint32_t firstHeap, uint32_t heapCount);
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json);
-#endif
-
-    void GetAllocationInfo(VmaAllocation hAllocation, VmaAllocationInfo* pAllocationInfo);
-
-    VkResult CreatePool(const VmaPoolCreateInfo* pCreateInfo, VmaPool* pPool);
-    void DestroyPool(VmaPool pool);
-    void GetPoolStatistics(VmaPool pool, VmaStatistics* pPoolStats);
-    void CalculatePoolStatistics(VmaPool pool, VmaDetailedStatistics* pPoolStats);
-
-    void SetCurrentFrameIndex(uint32_t frameIndex);
-    uint32_t GetCurrentFrameIndex() const { return m_CurrentFrameIndex.load(); }
-
-    VkResult CheckPoolCorruption(VmaPool hPool);
-    VkResult CheckCorruption(uint32_t memoryTypeBits);
-
-    // Call to Vulkan function vkAllocateMemory with accompanying bookkeeping.
-    VkResult AllocateVulkanMemory(const VkMemoryAllocateInfo* pAllocateInfo, VkDeviceMemory* pMemory);
-    // Call to Vulkan function vkFreeMemory with accompanying bookkeeping.
-    void FreeVulkanMemory(uint32_t memoryType, VkDeviceSize size, VkDeviceMemory hMemory);
-    // Call to Vulkan function vkBindBufferMemory or vkBindBufferMemory2KHR.
-    VkResult BindVulkanBuffer(
-        VkDeviceMemory memory,
-        VkDeviceSize memoryOffset,
-        VkBuffer buffer,
-        const void* pNext);
-    // Call to Vulkan function vkBindImageMemory or vkBindImageMemory2KHR.
-    VkResult BindVulkanImage(
-        VkDeviceMemory memory,
-        VkDeviceSize memoryOffset,
-        VkImage image,
-        const void* pNext);
-
-    VkResult Map(VmaAllocation hAllocation, void** ppData);
-    void Unmap(VmaAllocation hAllocation);
-
-    VkResult BindBufferMemory(
-        VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkBuffer hBuffer,
-        const void* pNext);
-    VkResult BindImageMemory(
-        VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkImage hImage,
-        const void* pNext);
-
-    VkResult FlushOrInvalidateAllocation(
-        VmaAllocation hAllocation,
-        VkDeviceSize offset, VkDeviceSize size,
-        VMA_CACHE_OPERATION op);
-    VkResult FlushOrInvalidateAllocations(
-        uint32_t allocationCount,
-        const VmaAllocation* allocations,
-        const VkDeviceSize* offsets, const VkDeviceSize* sizes,
-        VMA_CACHE_OPERATION op);
-
-    void FillAllocation(const VmaAllocation hAllocation, uint8_t pattern);
-
-    /*
-    Returns bit mask of memory types that can support defragmentation on GPU as
-    they support creation of required buffer for copy operations.
-    */
-    uint32_t GetGpuDefragmentationMemoryTypeBits();
-
-#if VMA_EXTERNAL_MEMORY
-    VkExternalMemoryHandleTypeFlagsKHR GetExternalMemoryHandleTypeFlags(uint32_t memTypeIndex) const
-    {
-        return m_TypeExternalMemoryHandleTypes[memTypeIndex];
-    }
-#endif // #if VMA_EXTERNAL_MEMORY
-
-private:
-    VkDeviceSize m_PreferredLargeHeapBlockSize;
-
-    VkPhysicalDevice m_PhysicalDevice;
-    VMA_ATOMIC_UINT32 m_CurrentFrameIndex;
-    VMA_ATOMIC_UINT32 m_GpuDefragmentationMemoryTypeBits; // UINT32_MAX means uninitialized.
-#if VMA_EXTERNAL_MEMORY
-    VkExternalMemoryHandleTypeFlagsKHR m_TypeExternalMemoryHandleTypes[VK_MAX_MEMORY_TYPES];
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    VMA_RW_MUTEX m_PoolsMutex;
-    typedef VmaIntrusiveLinkedList<VmaPoolListItemTraits> PoolList;
-    // Protected by m_PoolsMutex.
-    PoolList m_Pools;
-    uint32_t m_NextPoolId;
-
-    VmaVulkanFunctions m_VulkanFunctions;
-
-    // Global bit mask AND-ed with any memoryTypeBits to disallow certain memory types.
-    uint32_t m_GlobalMemoryTypeBits;
-
-    void ImportVulkanFunctions(const VmaVulkanFunctions* pVulkanFunctions);
-
-#if VMA_STATIC_VULKAN_FUNCTIONS == 1
-    void ImportVulkanFunctions_Static();
-#endif
-
-    void ImportVulkanFunctions_Custom(const VmaVulkanFunctions* pVulkanFunctions);
-
-#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-    void ImportVulkanFunctions_Dynamic();
-#endif
-
-    void ValidateVulkanFunctions();
-
-    VkDeviceSize CalcPreferredBlockSize(uint32_t memTypeIndex);
-
-    VkResult AllocateMemoryOfType(
-        VmaPool pool,
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        bool dedicatedPreferred,
-        VkBuffer dedicatedBuffer,
-        VkImage dedicatedImage,
-        VkFlags dedicatedBufferImageUsage,
-        const VmaAllocationCreateInfo& createInfo,
-        uint32_t memTypeIndex,
-        VmaSuballocationType suballocType,
-        VmaDedicatedAllocationList& dedicatedAllocations,
-        VmaBlockVector& blockVector,
-        size_t allocationCount,
-        VmaAllocation* pAllocations);
-
-    // Helper function only to be used inside AllocateDedicatedMemory.
-    VkResult AllocateDedicatedMemoryPage(
-        VmaPool pool,
-        VkDeviceSize size,
-        VmaSuballocationType suballocType,
-        uint32_t memTypeIndex,
-        const VkMemoryAllocateInfo& allocInfo,
-        bool map,
-        bool isUserDataString,
-        bool isMappingAllowed,
-        void* pUserData,
-        VmaAllocation* pAllocation);
-
-    // Allocates and registers new VkDeviceMemory specifically for dedicated allocations.
-    VkResult AllocateDedicatedMemory(
-        VmaPool pool,
-        VkDeviceSize size,
-        VmaSuballocationType suballocType,
-        VmaDedicatedAllocationList& dedicatedAllocations,
-        uint32_t memTypeIndex,
-        bool map,
-        bool isUserDataString,
-        bool isMappingAllowed,
-        bool canAliasMemory,
-        void* pUserData,
-        float priority,
-        VkBuffer dedicatedBuffer,
-        VkImage dedicatedImage,
-        VkFlags dedicatedBufferImageUsage,
-        size_t allocationCount,
-        VmaAllocation* pAllocations,
-        const void* pNextChain = nullptr);
-
-    void FreeDedicatedMemory(const VmaAllocation allocation);
-
-    VkResult CalcMemTypeParams(
-        VmaAllocationCreateInfo& outCreateInfo,
-        uint32_t memTypeIndex,
-        VkDeviceSize size,
-        size_t allocationCount);
-    VkResult CalcAllocationParams(
-        VmaAllocationCreateInfo& outCreateInfo,
-        bool dedicatedRequired,
-        bool dedicatedPreferred);
-
-    /*
-    Calculates and returns bit mask of memory types that can support defragmentation
-    on GPU as they support creation of required buffer for copy operations.
-    */
-    uint32_t CalculateGpuDefragmentationMemoryTypeBits() const;
-    uint32_t CalculateGlobalMemoryTypeBits() const;
-
-    bool GetFlushOrInvalidateRange(
-        VmaAllocation allocation,
-        VkDeviceSize offset, VkDeviceSize size,
-        VkMappedMemoryRange& outRange) const;
-
-#if VMA_MEMORY_BUDGET
-    void UpdateVulkanBudget();
-#endif // #if VMA_MEMORY_BUDGET
-};
-
-
-#ifndef _VMA_MEMORY_FUNCTIONS
-static void* VmaMalloc(VmaAllocator hAllocator, size_t size, size_t alignment)
-{
-    return VmaMalloc(&hAllocator->m_AllocationCallbacks, size, alignment);
-}
-
-static void VmaFree(VmaAllocator hAllocator, void* ptr)
-{
-    VmaFree(&hAllocator->m_AllocationCallbacks, ptr);
-}
-
-template<typename T>
-static T* VmaAllocate(VmaAllocator hAllocator)
-{
-    return (T*)VmaMalloc(hAllocator, sizeof(T), VMA_ALIGN_OF(T));
-}
-
-template<typename T>
-static T* VmaAllocateArray(VmaAllocator hAllocator, size_t count)
-{
-    return (T*)VmaMalloc(hAllocator, sizeof(T) * count, VMA_ALIGN_OF(T));
-}
-
-template<typename T>
-static void vma_delete(VmaAllocator hAllocator, T* ptr)
-{
-    if(ptr != VMA_NULL)
-    {
-        ptr->~T();
-        VmaFree(hAllocator, ptr);
-    }
-}
-
-template<typename T>
-static void vma_delete_array(VmaAllocator hAllocator, T* ptr, size_t count)
-{
-    if(ptr != VMA_NULL)
-    {
-        for(size_t i = count; i--; )
-            ptr[i].~T();
-        VmaFree(hAllocator, ptr);
-    }
-}
-#endif // _VMA_MEMORY_FUNCTIONS
-
-#ifndef _VMA_DEVICE_MEMORY_BLOCK_FUNCTIONS
-VmaDeviceMemoryBlock::VmaDeviceMemoryBlock(VmaAllocator hAllocator)
-    : m_pMetadata(VMA_NULL),
-    m_MemoryTypeIndex(UINT32_MAX),
-    m_Id(0),
-    m_hMemory(VK_NULL_HANDLE),
-    m_MapCount(0),
-    m_pMappedData(VMA_NULL) {}
-
-VmaDeviceMemoryBlock::~VmaDeviceMemoryBlock()
-{
-    VMA_ASSERT(m_MapCount == 0 && "VkDeviceMemory block is being destroyed while it is still mapped.");
-    VMA_ASSERT(m_hMemory == VK_NULL_HANDLE);
-}
-
-void VmaDeviceMemoryBlock::Init(
-    VmaAllocator hAllocator,
-    VmaPool hParentPool,
-    uint32_t newMemoryTypeIndex,
-    VkDeviceMemory newMemory,
-    VkDeviceSize newSize,
-    uint32_t id,
-    uint32_t algorithm,
-    VkDeviceSize bufferImageGranularity)
-{
-    VMA_ASSERT(m_hMemory == VK_NULL_HANDLE);
-
-    m_hParentPool = hParentPool;
-    m_MemoryTypeIndex = newMemoryTypeIndex;
-    m_Id = id;
-    m_hMemory = newMemory;
-
-    switch (algorithm)
-    {
-    case VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT:
-        m_pMetadata = vma_new(hAllocator, VmaBlockMetadata_Linear)(hAllocator->GetAllocationCallbacks(),
-            bufferImageGranularity, false); // isVirtual
-        break;
-    default:
-        VMA_ASSERT(0);
-        // Fall-through.
-    case 0:
-        m_pMetadata = vma_new(hAllocator, VmaBlockMetadata_TLSF)(hAllocator->GetAllocationCallbacks(),
-            bufferImageGranularity, false); // isVirtual
-    }
-    m_pMetadata->Init(newSize);
-}
-
-void VmaDeviceMemoryBlock::Destroy(VmaAllocator allocator)
-{
-    // Define macro VMA_DEBUG_LOG to receive the list of the unfreed allocations
-    if (!m_pMetadata->IsEmpty())
-        m_pMetadata->DebugLogAllAllocations();
-    // This is the most important assert in the entire library.
-    // Hitting it means you have some memory leak - unreleased VmaAllocation objects.
-    VMA_ASSERT(m_pMetadata->IsEmpty() && "Some allocations were not freed before destruction of this memory block!");
-
-    VMA_ASSERT(m_hMemory != VK_NULL_HANDLE);
-    allocator->FreeVulkanMemory(m_MemoryTypeIndex, m_pMetadata->GetSize(), m_hMemory);
-    m_hMemory = VK_NULL_HANDLE;
-
-    vma_delete(allocator, m_pMetadata);
-    m_pMetadata = VMA_NULL;
-}
-
-void VmaDeviceMemoryBlock::PostFree(VmaAllocator hAllocator)
-{
-    if(m_MappingHysteresis.PostFree())
-    {
-        VMA_ASSERT(m_MappingHysteresis.GetExtraMapping() == 0);
-        if (m_MapCount == 0)
-        {
-            m_pMappedData = VMA_NULL;
-            (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(hAllocator->m_hDevice, m_hMemory);
-        }
-    }
-}
-
-bool VmaDeviceMemoryBlock::Validate() const
-{
-    VMA_VALIDATE((m_hMemory != VK_NULL_HANDLE) &&
-        (m_pMetadata->GetSize() != 0));
-
-    return m_pMetadata->Validate();
-}
-
-VkResult VmaDeviceMemoryBlock::CheckCorruption(VmaAllocator hAllocator)
-{
-    void* pData = nullptr;
-    VkResult res = Map(hAllocator, 1, &pData);
-    if (res != VK_SUCCESS)
-    {
-        return res;
-    }
-
-    res = m_pMetadata->CheckCorruption(pData);
-
-    Unmap(hAllocator, 1);
-
-    return res;
-}
-
-VkResult VmaDeviceMemoryBlock::Map(VmaAllocator hAllocator, uint32_t count, void** ppData)
-{
-    if (count == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    const uint32_t oldTotalMapCount = m_MapCount + m_MappingHysteresis.GetExtraMapping();
-    m_MappingHysteresis.PostMap();
-    if (oldTotalMapCount != 0)
-    {
-        m_MapCount += count;
-        VMA_ASSERT(m_pMappedData != VMA_NULL);
-        if (ppData != VMA_NULL)
-        {
-            *ppData = m_pMappedData;
-        }
-        return VK_SUCCESS;
-    }
-    else
-    {
-        VkResult result = (*hAllocator->GetVulkanFunctions().vkMapMemory)(
-            hAllocator->m_hDevice,
-            m_hMemory,
-            0, // offset
-            VK_WHOLE_SIZE,
-            0, // flags
-            &m_pMappedData);
-        if (result == VK_SUCCESS)
-        {
-            if (ppData != VMA_NULL)
-            {
-                *ppData = m_pMappedData;
-            }
-            m_MapCount = count;
-        }
-        return result;
-    }
-}
-
-void VmaDeviceMemoryBlock::Unmap(VmaAllocator hAllocator, uint32_t count)
-{
-    if (count == 0)
-    {
-        return;
-    }
-
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    if (m_MapCount >= count)
-    {
-        m_MapCount -= count;
-        const uint32_t totalMapCount = m_MapCount + m_MappingHysteresis.GetExtraMapping();
-        if (totalMapCount == 0)
-        {
-            m_pMappedData = VMA_NULL;
-            (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(hAllocator->m_hDevice, m_hMemory);
-        }
-        m_MappingHysteresis.PostUnmap();
-    }
-    else
-    {
-        VMA_ASSERT(0 && "VkDeviceMemory block is being unmapped while it was not previously mapped.");
-    }
-}
-
-VkResult VmaDeviceMemoryBlock::WriteMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize)
-{
-    VMA_ASSERT(VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_MARGIN % 4 == 0 && VMA_DEBUG_DETECT_CORRUPTION);
-
-    void* pData;
-    VkResult res = Map(hAllocator, 1, &pData);
-    if (res != VK_SUCCESS)
-    {
-        return res;
-    }
-
-    VmaWriteMagicValue(pData, allocOffset + allocSize);
-
-    Unmap(hAllocator, 1);
-    return VK_SUCCESS;
-}
-
-VkResult VmaDeviceMemoryBlock::ValidateMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize)
-{
-    VMA_ASSERT(VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_MARGIN % 4 == 0 && VMA_DEBUG_DETECT_CORRUPTION);
-
-    void* pData;
-    VkResult res = Map(hAllocator, 1, &pData);
-    if (res != VK_SUCCESS)
-    {
-        return res;
-    }
-
-    if (!VmaValidateMagicValue(pData, allocOffset + allocSize))
-    {
-        VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER FREED ALLOCATION!");
-    }
-
-    Unmap(hAllocator, 1);
-    return VK_SUCCESS;
-}
-
-VkResult VmaDeviceMemoryBlock::BindBufferMemory(
-    const VmaAllocator hAllocator,
-    const VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer hBuffer,
-    const void* pNext)
-{
-    VMA_ASSERT(hAllocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_BLOCK &&
-        hAllocation->GetBlock() == this);
-    VMA_ASSERT(allocationLocalOffset < hAllocation->GetSize() &&
-        "Invalid allocationLocalOffset. Did you forget that this offset is relative to the beginning of the allocation, not the whole memory block?");
-    const VkDeviceSize memoryOffset = hAllocation->GetOffset() + allocationLocalOffset;
-    // This lock is important so that we don't call vkBind... and/or vkMap... simultaneously on the same VkDeviceMemory from multiple threads.
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    return hAllocator->BindVulkanBuffer(m_hMemory, memoryOffset, hBuffer, pNext);
-}
-
-VkResult VmaDeviceMemoryBlock::BindImageMemory(
-    const VmaAllocator hAllocator,
-    const VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage hImage,
-    const void* pNext)
-{
-    VMA_ASSERT(hAllocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_BLOCK &&
-        hAllocation->GetBlock() == this);
-    VMA_ASSERT(allocationLocalOffset < hAllocation->GetSize() &&
-        "Invalid allocationLocalOffset. Did you forget that this offset is relative to the beginning of the allocation, not the whole memory block?");
-    const VkDeviceSize memoryOffset = hAllocation->GetOffset() + allocationLocalOffset;
-    // This lock is important so that we don't call vkBind... and/or vkMap... simultaneously on the same VkDeviceMemory from multiple threads.
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    return hAllocator->BindVulkanImage(m_hMemory, memoryOffset, hImage, pNext);
-}
-#endif // _VMA_DEVICE_MEMORY_BLOCK_FUNCTIONS
-
-#ifndef _VMA_ALLOCATION_T_FUNCTIONS
-VmaAllocation_T::VmaAllocation_T(bool mappingAllowed)
-    : m_Alignment{ 1 },
-    m_Size{ 0 },
-    m_pUserData{ VMA_NULL },
-    m_pName{ VMA_NULL },
-    m_MemoryTypeIndex{ 0 },
-    m_Type{ (uint8_t)ALLOCATION_TYPE_NONE },
-    m_SuballocationType{ (uint8_t)VMA_SUBALLOCATION_TYPE_UNKNOWN },
-    m_MapCount{ 0 },
-    m_Flags{ 0 }
-{
-    if(mappingAllowed)
-        m_Flags |= (uint8_t)FLAG_MAPPING_ALLOWED;
-
-#if VMA_STATS_STRING_ENABLED
-    m_BufferImageUsage = 0;
-#endif
-}
-
-VmaAllocation_T::~VmaAllocation_T()
-{
-    VMA_ASSERT(m_MapCount == 0 && "Allocation was not unmapped before destruction.");
-
-    // Check if owned string was freed.
-    VMA_ASSERT(m_pName == VMA_NULL);
-}
-
-void VmaAllocation_T::InitBlockAllocation(
-    VmaDeviceMemoryBlock* block,
-    VmaAllocHandle allocHandle,
-    VkDeviceSize alignment,
-    VkDeviceSize size,
-    uint32_t memoryTypeIndex,
-    VmaSuballocationType suballocationType,
-    bool mapped)
-{
-    VMA_ASSERT(m_Type == ALLOCATION_TYPE_NONE);
-    VMA_ASSERT(block != VMA_NULL);
-    m_Type = (uint8_t)ALLOCATION_TYPE_BLOCK;
-    m_Alignment = alignment;
-    m_Size = size;
-    m_MemoryTypeIndex = memoryTypeIndex;
-    if(mapped)
-    {
-        VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-        m_Flags |= (uint8_t)FLAG_PERSISTENT_MAP;
-    }
-    m_SuballocationType = (uint8_t)suballocationType;
-    m_BlockAllocation.m_Block = block;
-    m_BlockAllocation.m_AllocHandle = allocHandle;
-}
-
-void VmaAllocation_T::InitDedicatedAllocation(
-    VmaPool hParentPool,
-    uint32_t memoryTypeIndex,
-    VkDeviceMemory hMemory,
-    VmaSuballocationType suballocationType,
-    void* pMappedData,
-    VkDeviceSize size)
-{
-    VMA_ASSERT(m_Type == ALLOCATION_TYPE_NONE);
-    VMA_ASSERT(hMemory != VK_NULL_HANDLE);
-    m_Type = (uint8_t)ALLOCATION_TYPE_DEDICATED;
-    m_Alignment = 0;
-    m_Size = size;
-    m_MemoryTypeIndex = memoryTypeIndex;
-    m_SuballocationType = (uint8_t)suballocationType;
-    if(pMappedData != VMA_NULL)
-    {
-        VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-        m_Flags |= (uint8_t)FLAG_PERSISTENT_MAP;
-    }
-    m_DedicatedAllocation.m_hParentPool = hParentPool;
-    m_DedicatedAllocation.m_hMemory = hMemory;
-    m_DedicatedAllocation.m_pMappedData = pMappedData;
-    m_DedicatedAllocation.m_Prev = VMA_NULL;
-    m_DedicatedAllocation.m_Next = VMA_NULL;
-}
-
-void VmaAllocation_T::SetName(VmaAllocator hAllocator, const char* pName)
-{
-    VMA_ASSERT(pName == VMA_NULL || pName != m_pName);
-
-    FreeName(hAllocator);
-
-    if (pName != VMA_NULL)
-        m_pName = VmaCreateStringCopy(hAllocator->GetAllocationCallbacks(), pName);
-}
-
-uint8_t VmaAllocation_T::SwapBlockAllocation(VmaAllocator hAllocator, VmaAllocation allocation)
-{
-    VMA_ASSERT(allocation != VMA_NULL);
-    VMA_ASSERT(m_Type == ALLOCATION_TYPE_BLOCK);
-    VMA_ASSERT(allocation->m_Type == ALLOCATION_TYPE_BLOCK);
-
-    if (m_MapCount != 0)
-        m_BlockAllocation.m_Block->Unmap(hAllocator, m_MapCount);
-
-    m_BlockAllocation.m_Block->m_pMetadata->SetAllocationUserData(m_BlockAllocation.m_AllocHandle, allocation);
-    VMA_SWAP(m_BlockAllocation, allocation->m_BlockAllocation);
-    m_BlockAllocation.m_Block->m_pMetadata->SetAllocationUserData(m_BlockAllocation.m_AllocHandle, this);
-
-#if VMA_STATS_STRING_ENABLED
-    VMA_SWAP(m_BufferImageUsage, allocation->m_BufferImageUsage);
-#endif
-    return m_MapCount;
-}
-
-VmaAllocHandle VmaAllocation_T::GetAllocHandle() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_AllocHandle;
-    case ALLOCATION_TYPE_DEDICATED:
-        return VK_NULL_HANDLE;
-    default:
-        VMA_ASSERT(0);
-        return VK_NULL_HANDLE;
-    }
-}
-
-VkDeviceSize VmaAllocation_T::GetOffset() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_Block->m_pMetadata->GetAllocationOffset(m_BlockAllocation.m_AllocHandle);
-    case ALLOCATION_TYPE_DEDICATED:
-        return 0;
-    default:
-        VMA_ASSERT(0);
-        return 0;
-    }
-}
-
-VmaPool VmaAllocation_T::GetParentPool() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_Block->GetParentPool();
-    case ALLOCATION_TYPE_DEDICATED:
-        return m_DedicatedAllocation.m_hParentPool;
-    default:
-        VMA_ASSERT(0);
-        return VK_NULL_HANDLE;
-    }
-}
-
-VkDeviceMemory VmaAllocation_T::GetMemory() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_Block->GetDeviceMemory();
-    case ALLOCATION_TYPE_DEDICATED:
-        return m_DedicatedAllocation.m_hMemory;
-    default:
-        VMA_ASSERT(0);
-        return VK_NULL_HANDLE;
-    }
-}
-
-void* VmaAllocation_T::GetMappedData() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        if (m_MapCount != 0 || IsPersistentMap())
-        {
-            void* pBlockData = m_BlockAllocation.m_Block->GetMappedData();
-            VMA_ASSERT(pBlockData != VMA_NULL);
-            return (char*)pBlockData + GetOffset();
-        }
-        else
-        {
-            return VMA_NULL;
-        }
-        break;
-    case ALLOCATION_TYPE_DEDICATED:
-        VMA_ASSERT((m_DedicatedAllocation.m_pMappedData != VMA_NULL) == (m_MapCount != 0 || IsPersistentMap()));
-        return m_DedicatedAllocation.m_pMappedData;
-    default:
-        VMA_ASSERT(0);
-        return VMA_NULL;
-    }
-}
-
-void VmaAllocation_T::BlockAllocMap()
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_BLOCK);
-    VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-
-    if (m_MapCount < 0xFF)
-    {
-        ++m_MapCount;
-    }
-    else
-    {
-        VMA_ASSERT(0 && "Allocation mapped too many times simultaneously.");
-    }
-}
-
-void VmaAllocation_T::BlockAllocUnmap()
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_BLOCK);
-
-    if (m_MapCount > 0)
-    {
-        --m_MapCount;
-    }
-    else
-    {
-        VMA_ASSERT(0 && "Unmapping allocation not previously mapped.");
-    }
-}
-
-VkResult VmaAllocation_T::DedicatedAllocMap(VmaAllocator hAllocator, void** ppData)
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_DEDICATED);
-    VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-
-    if (m_MapCount != 0 || IsPersistentMap())
-    {
-        if (m_MapCount < 0xFF)
-        {
-            VMA_ASSERT(m_DedicatedAllocation.m_pMappedData != VMA_NULL);
-            *ppData = m_DedicatedAllocation.m_pMappedData;
-            ++m_MapCount;
-            return VK_SUCCESS;
-        }
-        else
-        {
-            VMA_ASSERT(0 && "Dedicated allocation mapped too many times simultaneously.");
-            return VK_ERROR_MEMORY_MAP_FAILED;
-        }
-    }
-    else
-    {
-        VkResult result = (*hAllocator->GetVulkanFunctions().vkMapMemory)(
-            hAllocator->m_hDevice,
-            m_DedicatedAllocation.m_hMemory,
-            0, // offset
-            VK_WHOLE_SIZE,
-            0, // flags
-            ppData);
-        if (result == VK_SUCCESS)
-        {
-            m_DedicatedAllocation.m_pMappedData = *ppData;
-            m_MapCount = 1;
-        }
-        return result;
-    }
-}
-
-void VmaAllocation_T::DedicatedAllocUnmap(VmaAllocator hAllocator)
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_DEDICATED);
-
-    if (m_MapCount > 0)
-    {
-        --m_MapCount;
-        if (m_MapCount == 0 && !IsPersistentMap())
-        {
-            m_DedicatedAllocation.m_pMappedData = VMA_NULL;
-            (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(
-                hAllocator->m_hDevice,
-                m_DedicatedAllocation.m_hMemory);
-        }
-    }
-    else
-    {
-        VMA_ASSERT(0 && "Unmapping dedicated allocation not previously mapped.");
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaAllocation_T::InitBufferImageUsage(uint32_t bufferImageUsage)
-{
-    VMA_ASSERT(m_BufferImageUsage == 0);
-    m_BufferImageUsage = bufferImageUsage;
-}
-
-void VmaAllocation_T::PrintParameters(class VmaJsonWriter& json) const
-{
-    json.WriteString("Type");
-    json.WriteString(VMA_SUBALLOCATION_TYPE_NAMES[m_SuballocationType]);
-
-    json.WriteString("Size");
-    json.WriteNumber(m_Size);
-    json.WriteString("Usage");
-    json.WriteNumber(m_BufferImageUsage);
-
-    if (m_pUserData != VMA_NULL)
-    {
-        json.WriteString("CustomData");
-        json.BeginString();
-        json.ContinueString_Pointer(m_pUserData);
-        json.EndString();
-    }
-    if (m_pName != VMA_NULL)
-    {
-        json.WriteString("Name");
-        json.WriteString(m_pName);
-    }
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-void VmaAllocation_T::FreeName(VmaAllocator hAllocator)
-{
-    if(m_pName)
-    {
-        VmaFreeString(hAllocator->GetAllocationCallbacks(), m_pName);
-        m_pName = VMA_NULL;
-    }
-}
-#endif // _VMA_ALLOCATION_T_FUNCTIONS
-
-#ifndef _VMA_BLOCK_VECTOR_FUNCTIONS
-VmaBlockVector::VmaBlockVector(
-    VmaAllocator hAllocator,
-    VmaPool hParentPool,
-    uint32_t memoryTypeIndex,
-    VkDeviceSize preferredBlockSize,
-    size_t minBlockCount,
-    size_t maxBlockCount,
-    VkDeviceSize bufferImageGranularity,
-    bool explicitBlockSize,
-    uint32_t algorithm,
-    float priority,
-    VkDeviceSize minAllocationAlignment,
-    void* pMemoryAllocateNext)
-    : m_hAllocator(hAllocator),
-    m_hParentPool(hParentPool),
-    m_MemoryTypeIndex(memoryTypeIndex),
-    m_PreferredBlockSize(preferredBlockSize),
-    m_MinBlockCount(minBlockCount),
-    m_MaxBlockCount(maxBlockCount),
-    m_BufferImageGranularity(bufferImageGranularity),
-    m_ExplicitBlockSize(explicitBlockSize),
-    m_Algorithm(algorithm),
-    m_Priority(priority),
-    m_MinAllocationAlignment(minAllocationAlignment),
-    m_pMemoryAllocateNext(pMemoryAllocateNext),
-    m_Blocks(VmaStlAllocator<VmaDeviceMemoryBlock*>(hAllocator->GetAllocationCallbacks())),
-    m_NextBlockId(0) {}
-
-VmaBlockVector::~VmaBlockVector()
-{
-    for (size_t i = m_Blocks.size(); i--; )
-    {
-        m_Blocks[i]->Destroy(m_hAllocator);
-        vma_delete(m_hAllocator, m_Blocks[i]);
-    }
-}
-
-VkResult VmaBlockVector::CreateMinBlocks()
-{
-    for (size_t i = 0; i < m_MinBlockCount; ++i)
-    {
-        VkResult res = CreateBlock(m_PreferredBlockSize, VMA_NULL);
-        if (res != VK_SUCCESS)
-        {
-            return res;
-        }
-    }
-    return VK_SUCCESS;
-}
-
-void VmaBlockVector::AddStatistics(VmaStatistics& inoutStats)
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-    const size_t blockCount = m_Blocks.size();
-    for (uint32_t blockIndex = 0; blockIndex < blockCount; ++blockIndex)
-    {
-        const VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex];
-        VMA_ASSERT(pBlock);
-        VMA_HEAVY_ASSERT(pBlock->Validate());
-        pBlock->m_pMetadata->AddStatistics(inoutStats);
-    }
-}
-
-void VmaBlockVector::AddDetailedStatistics(VmaDetailedStatistics& inoutStats)
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-    const size_t blockCount = m_Blocks.size();
-    for (uint32_t blockIndex = 0; blockIndex < blockCount; ++blockIndex)
-    {
-        const VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex];
-        VMA_ASSERT(pBlock);
-        VMA_HEAVY_ASSERT(pBlock->Validate());
-        pBlock->m_pMetadata->AddDetailedStatistics(inoutStats);
-    }
-}
-
-bool VmaBlockVector::IsEmpty()
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-    return m_Blocks.empty();
-}
-
-bool VmaBlockVector::IsCorruptionDetectionEnabled() const
-{
-    const uint32_t requiredMemFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
-    return (VMA_DEBUG_DETECT_CORRUPTION != 0) &&
-        (VMA_DEBUG_MARGIN > 0) &&
-        (m_Algorithm == 0 || m_Algorithm == VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT) &&
-        (m_hAllocator->m_MemProps.memoryTypes[m_MemoryTypeIndex].propertyFlags & requiredMemFlags) == requiredMemFlags;
-}
-
-VkResult VmaBlockVector::Allocate(
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    const VmaAllocationCreateInfo& createInfo,
-    VmaSuballocationType suballocType,
-    size_t allocationCount,
-    VmaAllocation* pAllocations)
-{
-    size_t allocIndex;
-    VkResult res = VK_SUCCESS;
-
-    alignment = VMA_MAX(alignment, m_MinAllocationAlignment);
-
-    if (IsCorruptionDetectionEnabled())
-    {
-        size = VmaAlignUp<VkDeviceSize>(size, sizeof(VMA_CORRUPTION_DETECTION_MAGIC_VALUE));
-        alignment = VmaAlignUp<VkDeviceSize>(alignment, sizeof(VMA_CORRUPTION_DETECTION_MAGIC_VALUE));
-    }
-
-    {
-        VmaMutexLockWrite lock(m_Mutex, m_hAllocator->m_UseMutex);
-        for (allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-        {
-            res = AllocatePage(
-                size,
-                alignment,
-                createInfo,
-                suballocType,
-                pAllocations + allocIndex);
-            if (res != VK_SUCCESS)
-            {
-                break;
-            }
-        }
-    }
-
-    if (res != VK_SUCCESS)
-    {
-        // Free all already created allocations.
-        while (allocIndex--)
-            Free(pAllocations[allocIndex]);
-        memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount);
-    }
-
-    return res;
-}
-
-VkResult VmaBlockVector::AllocatePage(
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    const VmaAllocationCreateInfo& createInfo,
-    VmaSuballocationType suballocType,
-    VmaAllocation* pAllocation)
-{
-    const bool isUpperAddress = (createInfo.flags & VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0;
-
-    VkDeviceSize freeMemory;
-    {
-        const uint32_t heapIndex = m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex);
-        VmaBudget heapBudget = {};
-        m_hAllocator->GetHeapBudgets(&heapBudget, heapIndex, 1);
-        freeMemory = (heapBudget.usage < heapBudget.budget) ? (heapBudget.budget - heapBudget.usage) : 0;
-    }
-
-    const bool canFallbackToDedicated = !HasExplicitBlockSize() &&
-        (createInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0;
-    const bool canCreateNewBlock =
-        ((createInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0) &&
-        (m_Blocks.size() < m_MaxBlockCount) &&
-        (freeMemory >= size || !canFallbackToDedicated);
-    uint32_t strategy = createInfo.flags & VMA_ALLOCATION_CREATE_STRATEGY_MASK;
-
-    // Upper address can only be used with linear allocator and within single memory block.
-    if (isUpperAddress &&
-        (m_Algorithm != VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT || m_MaxBlockCount > 1))
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    // Early reject: requested allocation size is larger that maximum block size for this block vector.
-    if (size + VMA_DEBUG_MARGIN > m_PreferredBlockSize)
-    {
-        return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-    }
-
-    // 1. Search existing allocations. Try to allocate.
-    if (m_Algorithm == VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT)
-    {
-        // Use only last block.
-        if (!m_Blocks.empty())
-        {
-            VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks.back();
-            VMA_ASSERT(pCurrBlock);
-            VkResult res = AllocateFromBlock(
-                pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-            if (res == VK_SUCCESS)
-            {
-                VMA_DEBUG_LOG("    Returned from last block #%u", pCurrBlock->GetId());
-                IncrementallySortBlocks();
-                return VK_SUCCESS;
-            }
-        }
-    }
-    else
-    {
-        if (strategy != VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT) // MIN_MEMORY or default
-        {
-            const bool isHostVisible =
-                (m_hAllocator->m_MemProps.memoryTypes[m_MemoryTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0;
-            if(isHostVisible)
-            {
-                const bool isMappingAllowed = (createInfo.flags &
-                    (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0;
-                /*
-                For non-mappable allocations, check blocks that are not mapped first.
-                For mappable allocations, check blocks that are already mapped first.
-                This way, having many blocks, we will separate mappable and non-mappable allocations,
-                hopefully limiting the number of blocks that are mapped, which will help tools like RenderDoc.
-                */
-                for(size_t mappingI = 0; mappingI < 2; ++mappingI)
-                {
-                    // Forward order in m_Blocks - prefer blocks with smallest amount of free space.
-                    for (size_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-                    {
-                        VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex];
-                        VMA_ASSERT(pCurrBlock);
-                        const bool isBlockMapped = pCurrBlock->GetMappedData() != VMA_NULL;
-                        if((mappingI == 0) == (isMappingAllowed == isBlockMapped))
-                        {
-                            VkResult res = AllocateFromBlock(
-                                pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-                            if (res == VK_SUCCESS)
-                            {
-                                VMA_DEBUG_LOG("    Returned from existing block #%u", pCurrBlock->GetId());
-                                IncrementallySortBlocks();
-                                return VK_SUCCESS;
-                            }
-                        }
-                    }
-                }
-            }
-            else
-            {
-                // Forward order in m_Blocks - prefer blocks with smallest amount of free space.
-                for (size_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-                {
-                    VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex];
-                    VMA_ASSERT(pCurrBlock);
-                    VkResult res = AllocateFromBlock(
-                        pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-                    if (res == VK_SUCCESS)
-                    {
-                        VMA_DEBUG_LOG("    Returned from existing block #%u", pCurrBlock->GetId());
-                        IncrementallySortBlocks();
-                        return VK_SUCCESS;
-                    }
-                }
-            }
-        }
-        else // VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT
-        {
-            // Backward order in m_Blocks - prefer blocks with largest amount of free space.
-            for (size_t blockIndex = m_Blocks.size(); blockIndex--; )
-            {
-                VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex];
-                VMA_ASSERT(pCurrBlock);
-                VkResult res = AllocateFromBlock(pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-                if (res == VK_SUCCESS)
-                {
-                    VMA_DEBUG_LOG("    Returned from existing block #%u", pCurrBlock->GetId());
-                    IncrementallySortBlocks();
-                    return VK_SUCCESS;
-                }
-            }
-        }
-    }
-
-    // 2. Try to create new block.
-    if (canCreateNewBlock)
-    {
-        // Calculate optimal size for new block.
-        VkDeviceSize newBlockSize = m_PreferredBlockSize;
-        uint32_t newBlockSizeShift = 0;
-        const uint32_t NEW_BLOCK_SIZE_SHIFT_MAX = 3;
-
-        if (!m_ExplicitBlockSize)
-        {
-            // Allocate 1/8, 1/4, 1/2 as first blocks.
-            const VkDeviceSize maxExistingBlockSize = CalcMaxBlockSize();
-            for (uint32_t i = 0; i < NEW_BLOCK_SIZE_SHIFT_MAX; ++i)
-            {
-                const VkDeviceSize smallerNewBlockSize = newBlockSize / 2;
-                if (smallerNewBlockSize > maxExistingBlockSize && smallerNewBlockSize >= size * 2)
-                {
-                    newBlockSize = smallerNewBlockSize;
-                    ++newBlockSizeShift;
-                }
-                else
-                {
-                    break;
-                }
-            }
-        }
-
-        size_t newBlockIndex = 0;
-        VkResult res = (newBlockSize <= freeMemory || !canFallbackToDedicated) ?
-            CreateBlock(newBlockSize, &newBlockIndex) : VK_ERROR_OUT_OF_DEVICE_MEMORY;
-        // Allocation of this size failed? Try 1/2, 1/4, 1/8 of m_PreferredBlockSize.
-        if (!m_ExplicitBlockSize)
-        {
-            while (res < 0 && newBlockSizeShift < NEW_BLOCK_SIZE_SHIFT_MAX)
-            {
-                const VkDeviceSize smallerNewBlockSize = newBlockSize / 2;
-                if (smallerNewBlockSize >= size)
-                {
-                    newBlockSize = smallerNewBlockSize;
-                    ++newBlockSizeShift;
-                    res = (newBlockSize <= freeMemory || !canFallbackToDedicated) ?
-                        CreateBlock(newBlockSize, &newBlockIndex) : VK_ERROR_OUT_OF_DEVICE_MEMORY;
-                }
-                else
-                {
-                    break;
-                }
-            }
-        }
-
-        if (res == VK_SUCCESS)
-        {
-            VmaDeviceMemoryBlock* const pBlock = m_Blocks[newBlockIndex];
-            VMA_ASSERT(pBlock->m_pMetadata->GetSize() >= size);
-
-            res = AllocateFromBlock(
-                pBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-            if (res == VK_SUCCESS)
-            {
-                VMA_DEBUG_LOG("    Created new block #%u Size=%llu", pBlock->GetId(), newBlockSize);
-                IncrementallySortBlocks();
-                return VK_SUCCESS;
-            }
-            else
-            {
-                // Allocation from new block failed, possibly due to VMA_DEBUG_MARGIN or alignment.
-                return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-            }
-        }
-    }
-
-    return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-}
-
-void VmaBlockVector::Free(const VmaAllocation hAllocation)
-{
-    VmaDeviceMemoryBlock* pBlockToDelete = VMA_NULL;
-
-    bool budgetExceeded = false;
-    {
-        const uint32_t heapIndex = m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex);
-        VmaBudget heapBudget = {};
-        m_hAllocator->GetHeapBudgets(&heapBudget, heapIndex, 1);
-        budgetExceeded = heapBudget.usage >= heapBudget.budget;
-    }
-
-    // Scope for lock.
-    {
-        VmaMutexLockWrite lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-        VmaDeviceMemoryBlock* pBlock = hAllocation->GetBlock();
-
-        if (IsCorruptionDetectionEnabled())
-        {
-            VkResult res = pBlock->ValidateMagicValueAfterAllocation(m_hAllocator, hAllocation->GetOffset(), hAllocation->GetSize());
-            VMA_ASSERT(res == VK_SUCCESS && "Couldn't map block memory to validate magic value.");
-        }
-
-        if (hAllocation->IsPersistentMap())
-        {
-            pBlock->Unmap(m_hAllocator, 1);
-        }
-
-        const bool hadEmptyBlockBeforeFree = HasEmptyBlock();
-        pBlock->m_pMetadata->Free(hAllocation->GetAllocHandle());
-        pBlock->PostFree(m_hAllocator);
-        VMA_HEAVY_ASSERT(pBlock->Validate());
-
-        VMA_DEBUG_LOG("  Freed from MemoryTypeIndex=%u", m_MemoryTypeIndex);
-
-        const bool canDeleteBlock = m_Blocks.size() > m_MinBlockCount;
-        // pBlock became empty after this deallocation.
-        if (pBlock->m_pMetadata->IsEmpty())
-        {
-            // Already had empty block. We don't want to have two, so delete this one.
-            if ((hadEmptyBlockBeforeFree || budgetExceeded) && canDeleteBlock)
-            {
-                pBlockToDelete = pBlock;
-                Remove(pBlock);
-            }
-            // else: We now have one empty block - leave it. A hysteresis to avoid allocating whole block back and forth.
-        }
-        // pBlock didn't become empty, but we have another empty block - find and free that one.
-        // (This is optional, heuristics.)
-        else if (hadEmptyBlockBeforeFree && canDeleteBlock)
-        {
-            VmaDeviceMemoryBlock* pLastBlock = m_Blocks.back();
-            if (pLastBlock->m_pMetadata->IsEmpty())
-            {
-                pBlockToDelete = pLastBlock;
-                m_Blocks.pop_back();
-            }
-        }
-
-        IncrementallySortBlocks();
-    }
-
-    // Destruction of a free block. Deferred until this point, outside of mutex
-    // lock, for performance reason.
-    if (pBlockToDelete != VMA_NULL)
-    {
-        VMA_DEBUG_LOG("    Deleted empty block #%u", pBlockToDelete->GetId());
-        pBlockToDelete->Destroy(m_hAllocator);
-        vma_delete(m_hAllocator, pBlockToDelete);
-    }
-
-    m_hAllocator->m_Budget.RemoveAllocation(m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex), hAllocation->GetSize());
-    m_hAllocator->m_AllocationObjectAllocator.Free(hAllocation);
-}
-
-VkDeviceSize VmaBlockVector::CalcMaxBlockSize() const
-{
-    VkDeviceSize result = 0;
-    for (size_t i = m_Blocks.size(); i--; )
-    {
-        result = VMA_MAX(result, m_Blocks[i]->m_pMetadata->GetSize());
-        if (result >= m_PreferredBlockSize)
-        {
-            break;
-        }
-    }
-    return result;
-}
-
-void VmaBlockVector::Remove(VmaDeviceMemoryBlock* pBlock)
-{
-    for (uint32_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-    {
-        if (m_Blocks[blockIndex] == pBlock)
-        {
-            VmaVectorRemove(m_Blocks, blockIndex);
-            return;
-        }
-    }
-    VMA_ASSERT(0);
-}
-
-void VmaBlockVector::IncrementallySortBlocks()
-{
-    if (!m_IncrementalSort)
-        return;
-    if (m_Algorithm != VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT)
-    {
-        // Bubble sort only until first swap.
-        for (size_t i = 1; i < m_Blocks.size(); ++i)
-        {
-            if (m_Blocks[i - 1]->m_pMetadata->GetSumFreeSize() > m_Blocks[i]->m_pMetadata->GetSumFreeSize())
-            {
-                VMA_SWAP(m_Blocks[i - 1], m_Blocks[i]);
-                return;
-            }
-        }
-    }
-}
-
-void VmaBlockVector::SortByFreeSize()
-{
-    VMA_SORT(m_Blocks.begin(), m_Blocks.end(),
-        [](VmaDeviceMemoryBlock* b1, VmaDeviceMemoryBlock* b2) -> bool
-        {
-            return b1->m_pMetadata->GetSumFreeSize() < b2->m_pMetadata->GetSumFreeSize();
-        });
-}
-
-VkResult VmaBlockVector::AllocateFromBlock(
-    VmaDeviceMemoryBlock* pBlock,
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    VmaAllocationCreateFlags allocFlags,
-    void* pUserData,
-    VmaSuballocationType suballocType,
-    uint32_t strategy,
-    VmaAllocation* pAllocation)
-{
-    const bool isUpperAddress = (allocFlags & VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0;
-
-    VmaAllocationRequest currRequest = {};
-    if (pBlock->m_pMetadata->CreateAllocationRequest(
-        size,
-        alignment,
-        isUpperAddress,
-        suballocType,
-        strategy,
-        &currRequest))
-    {
-        return CommitAllocationRequest(currRequest, pBlock, alignment, allocFlags, pUserData, suballocType, pAllocation);
-    }
-    return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-}
-
-VkResult VmaBlockVector::CommitAllocationRequest(
-    VmaAllocationRequest& allocRequest,
-    VmaDeviceMemoryBlock* pBlock,
-    VkDeviceSize alignment,
-    VmaAllocationCreateFlags allocFlags,
-    void* pUserData,
-    VmaSuballocationType suballocType,
-    VmaAllocation* pAllocation)
-{
-    const bool mapped = (allocFlags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0;
-    const bool isUserDataString = (allocFlags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0;
-    const bool isMappingAllowed = (allocFlags &
-        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0;
-
-    pBlock->PostAlloc();
-    // Allocate from pCurrBlock.
-    if (mapped)
-    {
-        VkResult res = pBlock->Map(m_hAllocator, 1, VMA_NULL);
-        if (res != VK_SUCCESS)
-        {
-            return res;
-        }
-    }
-
-    *pAllocation = m_hAllocator->m_AllocationObjectAllocator.Allocate(isMappingAllowed);
-    pBlock->m_pMetadata->Alloc(allocRequest, suballocType, *pAllocation);
-    (*pAllocation)->InitBlockAllocation(
-        pBlock,
-        allocRequest.allocHandle,
-        alignment,
-        allocRequest.size, // Not size, as actual allocation size may be larger than requested!
-        m_MemoryTypeIndex,
-        suballocType,
-        mapped);
-    VMA_HEAVY_ASSERT(pBlock->Validate());
-    if (isUserDataString)
-        (*pAllocation)->SetName(m_hAllocator, (const char*)pUserData);
-    else
-        (*pAllocation)->SetUserData(m_hAllocator, pUserData);
-    m_hAllocator->m_Budget.AddAllocation(m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex), allocRequest.size);
-    if (VMA_DEBUG_INITIALIZE_ALLOCATIONS)
-    {
-        m_hAllocator->FillAllocation(*pAllocation, VMA_ALLOCATION_FILL_PATTERN_CREATED);
-    }
-    if (IsCorruptionDetectionEnabled())
-    {
-        VkResult res = pBlock->WriteMagicValueAfterAllocation(m_hAllocator, (*pAllocation)->GetOffset(), allocRequest.size);
-        VMA_ASSERT(res == VK_SUCCESS && "Couldn't map block memory to write magic value.");
-    }
-    return VK_SUCCESS;
-}
-
-VkResult VmaBlockVector::CreateBlock(VkDeviceSize blockSize, size_t* pNewBlockIndex)
-{
-    VkMemoryAllocateInfo allocInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO };
-    allocInfo.pNext = m_pMemoryAllocateNext;
-    allocInfo.memoryTypeIndex = m_MemoryTypeIndex;
-    allocInfo.allocationSize = blockSize;
-
-#if VMA_BUFFER_DEVICE_ADDRESS
-    // Every standalone block can potentially contain a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT - always enable the feature.
-    VkMemoryAllocateFlagsInfoKHR allocFlagsInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR };
-    if (m_hAllocator->m_UseKhrBufferDeviceAddress)
-    {
-        allocFlagsInfo.flags = VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT_KHR;
-        VmaPnextChainPushFront(&allocInfo, &allocFlagsInfo);
-    }
-#endif // VMA_BUFFER_DEVICE_ADDRESS
-
-#if VMA_MEMORY_PRIORITY
-    VkMemoryPriorityAllocateInfoEXT priorityInfo = { VK_STRUCTURE_TYPE_MEMORY_PRIORITY_ALLOCATE_INFO_EXT };
-    if (m_hAllocator->m_UseExtMemoryPriority)
-    {
-        VMA_ASSERT(m_Priority >= 0.f && m_Priority <= 1.f);
-        priorityInfo.priority = m_Priority;
-        VmaPnextChainPushFront(&allocInfo, &priorityInfo);
-    }
-#endif // VMA_MEMORY_PRIORITY
-
-#if VMA_EXTERNAL_MEMORY
-    // Attach VkExportMemoryAllocateInfoKHR if necessary.
-    VkExportMemoryAllocateInfoKHR exportMemoryAllocInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR };
-    exportMemoryAllocInfo.handleTypes = m_hAllocator->GetExternalMemoryHandleTypeFlags(m_MemoryTypeIndex);
-    if (exportMemoryAllocInfo.handleTypes != 0)
-    {
-        VmaPnextChainPushFront(&allocInfo, &exportMemoryAllocInfo);
-    }
-#endif // VMA_EXTERNAL_MEMORY
-
-    VkDeviceMemory mem = VK_NULL_HANDLE;
-    VkResult res = m_hAllocator->AllocateVulkanMemory(&allocInfo, &mem);
-    if (res < 0)
-    {
-        return res;
-    }
-
-    // New VkDeviceMemory successfully created.
-
-    // Create new Allocation for it.
-    VmaDeviceMemoryBlock* const pBlock = vma_new(m_hAllocator, VmaDeviceMemoryBlock)(m_hAllocator);
-    pBlock->Init(
-        m_hAllocator,
-        m_hParentPool,
-        m_MemoryTypeIndex,
-        mem,
-        allocInfo.allocationSize,
-        m_NextBlockId++,
-        m_Algorithm,
-        m_BufferImageGranularity);
-
-    m_Blocks.push_back(pBlock);
-    if (pNewBlockIndex != VMA_NULL)
-    {
-        *pNewBlockIndex = m_Blocks.size() - 1;
-    }
-
-    return VK_SUCCESS;
-}
-
-bool VmaBlockVector::HasEmptyBlock()
-{
-    for (size_t index = 0, count = m_Blocks.size(); index < count; ++index)
-    {
-        VmaDeviceMemoryBlock* const pBlock = m_Blocks[index];
-        if (pBlock->m_pMetadata->IsEmpty())
-        {
-            return true;
-        }
-    }
-    return false;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockVector::PrintDetailedMap(class VmaJsonWriter& json)
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-
-    json.BeginObject();
-    for (size_t i = 0; i < m_Blocks.size(); ++i)
-    {
-        json.BeginString();
-        json.ContinueString(m_Blocks[i]->GetId());
-        json.EndString();
-
-        json.BeginObject();
-        json.WriteString("MapRefCount");
-        json.WriteNumber(m_Blocks[i]->GetMapRefCount());
-
-        m_Blocks[i]->m_pMetadata->PrintDetailedMap(json);
-        json.EndObject();
-    }
-    json.EndObject();
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-VkResult VmaBlockVector::CheckCorruption()
-{
-    if (!IsCorruptionDetectionEnabled())
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-    for (uint32_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-    {
-        VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex];
-        VMA_ASSERT(pBlock);
-        VkResult res = pBlock->CheckCorruption(m_hAllocator);
-        if (res != VK_SUCCESS)
-        {
-            return res;
-        }
-    }
-    return VK_SUCCESS;
-}
-
-#endif // _VMA_BLOCK_VECTOR_FUNCTIONS
-
-#ifndef _VMA_DEFRAGMENTATION_CONTEXT_FUNCTIONS
-VmaDefragmentationContext_T::VmaDefragmentationContext_T(
-    VmaAllocator hAllocator,
-    const VmaDefragmentationInfo& info)
-    : m_MaxPassBytes(info.maxBytesPerPass == 0 ? VK_WHOLE_SIZE : info.maxBytesPerPass),
-    m_MaxPassAllocations(info.maxAllocationsPerPass == 0 ? UINT32_MAX : info.maxAllocationsPerPass),
-    m_MoveAllocator(hAllocator->GetAllocationCallbacks()),
-    m_Moves(m_MoveAllocator)
-{
-    m_Algorithm = info.flags & VMA_DEFRAGMENTATION_FLAG_ALGORITHM_MASK;
-
-    if (info.pool != VMA_NULL)
-    {
-        m_BlockVectorCount = 1;
-        m_PoolBlockVector = &info.pool->m_BlockVector;
-        m_pBlockVectors = &m_PoolBlockVector;
-        m_PoolBlockVector->SetIncrementalSort(false);
-        m_PoolBlockVector->SortByFreeSize();
-    }
-    else
-    {
-        m_BlockVectorCount = hAllocator->GetMemoryTypeCount();
-        m_PoolBlockVector = VMA_NULL;
-        m_pBlockVectors = hAllocator->m_pBlockVectors;
-        for (uint32_t i = 0; i < m_BlockVectorCount; ++i)
-        {
-            VmaBlockVector* vector = m_pBlockVectors[i];
-            if (vector != VMA_NULL)
-            {
-                vector->SetIncrementalSort(false);
-                vector->SortByFreeSize();
-            }
-        }
-    }
-    
-    switch (m_Algorithm)
-    {
-    case 0: // Default algorithm
-        m_Algorithm = VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT;
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT:
-    {
-        m_AlgorithmState = vma_new_array(hAllocator, StateBalanced, m_BlockVectorCount);
-        break;
-    }
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-    {
-        if (hAllocator->GetBufferImageGranularity() > 1)
-        {
-            m_AlgorithmState = vma_new_array(hAllocator, StateExtensive, m_BlockVectorCount);
-        }
-        break;
-    }
-    }
-}
-
-VmaDefragmentationContext_T::~VmaDefragmentationContext_T()
-{
-    if (m_PoolBlockVector != VMA_NULL)
-    {
-        m_PoolBlockVector->SetIncrementalSort(true);
-    }
-    else
-    {
-        for (uint32_t i = 0; i < m_BlockVectorCount; ++i)
-        {
-            VmaBlockVector* vector = m_pBlockVectors[i];
-            if (vector != VMA_NULL)
-                vector->SetIncrementalSort(true);
-        }
-    }
-
-    if (m_AlgorithmState)
-    {
-        switch (m_Algorithm)
-        {
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT:
-            vma_delete_array(m_MoveAllocator.m_pCallbacks, reinterpret_cast<StateBalanced*>(m_AlgorithmState), m_BlockVectorCount);
-            break;
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-            vma_delete_array(m_MoveAllocator.m_pCallbacks, reinterpret_cast<StateExtensive*>(m_AlgorithmState), m_BlockVectorCount);
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-    }
-}
-
-VkResult VmaDefragmentationContext_T::DefragmentPassBegin(VmaDefragmentationPassMoveInfo& moveInfo)
-{
-    if (m_PoolBlockVector != VMA_NULL)
-    {
-        VmaMutexLockWrite lock(m_PoolBlockVector->GetMutex(), m_PoolBlockVector->GetAllocator()->m_UseMutex);
-
-        if (m_PoolBlockVector->GetBlockCount() > 1)
-            ComputeDefragmentation(*m_PoolBlockVector, 0);
-        else if (m_PoolBlockVector->GetBlockCount() == 1)
-            ReallocWithinBlock(*m_PoolBlockVector, m_PoolBlockVector->GetBlock(0));
-    }
-    else
-    {
-        for (uint32_t i = 0; i < m_BlockVectorCount; ++i)
-        {
-            if (m_pBlockVectors[i] != VMA_NULL)
-            {
-                VmaMutexLockWrite lock(m_pBlockVectors[i]->GetMutex(), m_pBlockVectors[i]->GetAllocator()->m_UseMutex);
-
-                if (m_pBlockVectors[i]->GetBlockCount() > 1)
-                {
-                    if (ComputeDefragmentation(*m_pBlockVectors[i], i))
-                        break;
-                }
-                else if (m_pBlockVectors[i]->GetBlockCount() == 1)
-                {
-                    if (ReallocWithinBlock(*m_pBlockVectors[i], m_pBlockVectors[i]->GetBlock(0)))
-                        break;
-                }
-            }
-        }
-    }
-
-    moveInfo.moveCount = static_cast<uint32_t>(m_Moves.size());
-    if (moveInfo.moveCount > 0)
-    {
-        moveInfo.pMoves = m_Moves.data();
-        return VK_INCOMPLETE;
-    }
-
-    moveInfo.pMoves = VMA_NULL;
-    return VK_SUCCESS;
-}
-
-VkResult VmaDefragmentationContext_T::DefragmentPassEnd(VmaDefragmentationPassMoveInfo& moveInfo)
-{
-    VMA_ASSERT(moveInfo.moveCount > 0 ? moveInfo.pMoves != VMA_NULL : true);
-
-    VkResult result = VK_SUCCESS;
-    VmaStlAllocator<FragmentedBlock> blockAllocator(m_MoveAllocator.m_pCallbacks);
-    VmaVector<FragmentedBlock, VmaStlAllocator<FragmentedBlock>> immovableBlocks(blockAllocator);
-    VmaVector<FragmentedBlock, VmaStlAllocator<FragmentedBlock>> mappedBlocks(blockAllocator);
-
-    VmaAllocator allocator = VMA_NULL;
-    for (uint32_t i = 0; i < moveInfo.moveCount; ++i)
-    {
-        VmaDefragmentationMove& move = moveInfo.pMoves[i];
-        size_t prevCount = 0, currentCount = 0;
-        VkDeviceSize freedBlockSize = 0;
-
-        uint32_t vectorIndex;
-        VmaBlockVector* vector;
-        if (m_PoolBlockVector != VMA_NULL)
-        {
-            vectorIndex = 0;
-            vector = m_PoolBlockVector;
-        }
-        else
-        {
-            vectorIndex = move.srcAllocation->GetMemoryTypeIndex();
-            vector = m_pBlockVectors[vectorIndex];
-            VMA_ASSERT(vector != VMA_NULL);
-        }
-        
-        switch (move.operation)
-        {
-        case VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY:
-        {
-            uint8_t mapCount = move.srcAllocation->SwapBlockAllocation(vector->m_hAllocator, move.dstTmpAllocation);
-            if (mapCount > 0)
-            {
-                allocator = vector->m_hAllocator;
-                VmaDeviceMemoryBlock* newMapBlock = move.srcAllocation->GetBlock();
-                bool notPresent = true;
-                for (FragmentedBlock& block : mappedBlocks)
-                {
-                    if (block.block == newMapBlock)
-                    {
-                        notPresent = false;
-                        block.data += mapCount;
-                        break;
-                    }
-                }
-                if (notPresent)
-                    mappedBlocks.push_back({ mapCount, newMapBlock });
-            }
-
-            // Scope for locks, Free have it's own lock
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                prevCount = vector->GetBlockCount();
-                freedBlockSize = move.dstTmpAllocation->GetBlock()->m_pMetadata->GetSize();
-            }
-            vector->Free(move.dstTmpAllocation);
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                currentCount = vector->GetBlockCount();
-            }
-
-            result = VK_INCOMPLETE;
-            break;
-        }
-        case VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE:
-        {
-            m_PassStats.bytesMoved -= move.srcAllocation->GetSize();
-            --m_PassStats.allocationsMoved;
-            vector->Free(move.dstTmpAllocation);
-
-            VmaDeviceMemoryBlock* newBlock = move.srcAllocation->GetBlock();
-            bool notPresent = true;
-            for (const FragmentedBlock& block : immovableBlocks)
-            {
-                if (block.block == newBlock)
-                {
-                    notPresent = false;
-                    break;
-                }
-            }
-            if (notPresent)
-                immovableBlocks.push_back({ vectorIndex, newBlock });
-            break;
-        }
-        case VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY:
-        {
-            m_PassStats.bytesMoved -= move.srcAllocation->GetSize();
-            --m_PassStats.allocationsMoved;
-            // Scope for locks, Free have it's own lock
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                prevCount = vector->GetBlockCount();
-                freedBlockSize = move.srcAllocation->GetBlock()->m_pMetadata->GetSize();
-            }
-            vector->Free(move.srcAllocation);
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                currentCount = vector->GetBlockCount();
-            }
-            freedBlockSize *= prevCount - currentCount;
-
-            VkDeviceSize dstBlockSize;
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                dstBlockSize = move.dstTmpAllocation->GetBlock()->m_pMetadata->GetSize();
-            }
-            vector->Free(move.dstTmpAllocation);
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                freedBlockSize += dstBlockSize * (currentCount - vector->GetBlockCount());
-                currentCount = vector->GetBlockCount();
-            }
-
-            result = VK_INCOMPLETE;
-            break;
-        }
-        default:
-            VMA_ASSERT(0);
-        }
-
-        if (prevCount > currentCount)
-        {
-            size_t freedBlocks = prevCount - currentCount;
-            m_PassStats.deviceMemoryBlocksFreed += static_cast<uint32_t>(freedBlocks);
-            m_PassStats.bytesFreed += freedBlockSize;
-        }
-
-        switch (m_Algorithm)
-        {
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-        {
-            if (m_AlgorithmState != VMA_NULL)
-            {
-                // Avoid unnecessary tries to allocate when new free block is avaiable
-                StateExtensive& state = reinterpret_cast<StateExtensive*>(m_AlgorithmState)[vectorIndex];
-                if (state.firstFreeBlock != SIZE_MAX)
-                {
-                    const size_t diff = prevCount - currentCount;
-                    if (state.firstFreeBlock >= diff)
-                    {
-                        state.firstFreeBlock -= diff;
-                        if (state.firstFreeBlock != 0)
-                            state.firstFreeBlock -= vector->GetBlock(state.firstFreeBlock - 1)->m_pMetadata->IsEmpty();
-                    }
-                    else
-                        state.firstFreeBlock = 0;
-                }
-            }
-        }
-        }
-    }
-    moveInfo.moveCount = 0;
-    moveInfo.pMoves = VMA_NULL;
-    m_Moves.clear();
-
-    // Update stats
-    m_GlobalStats.allocationsMoved += m_PassStats.allocationsMoved;
-    m_GlobalStats.bytesFreed += m_PassStats.bytesFreed;
-    m_GlobalStats.bytesMoved += m_PassStats.bytesMoved;
-    m_GlobalStats.deviceMemoryBlocksFreed += m_PassStats.deviceMemoryBlocksFreed;
-    m_PassStats = { 0 };
-
-    // Move blocks with immovable allocations according to algorithm
-    if (immovableBlocks.size() > 0)
-    {
-        switch (m_Algorithm)
-        {
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-        {
-            if (m_AlgorithmState != VMA_NULL)
-            {
-                bool swapped = false;
-                // Move to the start of free blocks range
-                for (const FragmentedBlock& block : immovableBlocks)
-                {
-                    StateExtensive& state = reinterpret_cast<StateExtensive*>(m_AlgorithmState)[block.data];
-                    if (state.operation != StateExtensive::Operation::Cleanup)
-                    {
-                        VmaBlockVector* vector = m_pBlockVectors[block.data];
-                        VmaMutexLockWrite lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-
-                        for (size_t i = 0, count = vector->GetBlockCount() - m_ImmovableBlockCount; i < count; ++i)
-                        {
-                            if (vector->GetBlock(i) == block.block)
-                            {
-                                VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[vector->GetBlockCount() - ++m_ImmovableBlockCount]);
-                                if (state.firstFreeBlock != SIZE_MAX)
-                                {
-                                    if (i + 1 < state.firstFreeBlock)
-                                    {
-                                        if (state.firstFreeBlock > 1)
-                                            VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[--state.firstFreeBlock]);
-                                        else
-                                            --state.firstFreeBlock;
-                                    }
-                                }
-                                swapped = true;
-                                break;
-                            }
-                        }
-                    }
-                }
-                if (swapped)
-                    result = VK_INCOMPLETE;
-                break;
-            }
-        }
-        default:
-        {
-            // Move to the begining
-            for (const FragmentedBlock& block : immovableBlocks)
-            {
-                VmaBlockVector* vector = m_pBlockVectors[block.data];
-                VmaMutexLockWrite lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-
-                for (size_t i = m_ImmovableBlockCount; i < vector->GetBlockCount(); ++i)
-                {
-                    if (vector->GetBlock(i) == block.block)
-                    {
-                        VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[m_ImmovableBlockCount++]);
-                        break;
-                    }
-                }
-            }
-            break;
-        }
-        }
-    }
-
-    // Bulk-map destination blocks
-    for (const FragmentedBlock& block : mappedBlocks)
-    {
-        VkResult res = block.block->Map(allocator, block.data, VMA_NULL);
-        VMA_ASSERT(res == VK_SUCCESS);
-    }
-    return result;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation(VmaBlockVector& vector, size_t index)
-{
-    switch (m_Algorithm)
-    {
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT:
-        return ComputeDefragmentation_Fast(vector);
-    default:
-        VMA_ASSERT(0);
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT:
-        return ComputeDefragmentation_Balanced(vector, index, true);
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT:
-        return ComputeDefragmentation_Full(vector);
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-        return ComputeDefragmentation_Extensive(vector, index);
-    }
-}
-
-VmaDefragmentationContext_T::MoveAllocationData VmaDefragmentationContext_T::GetMoveData(
-    VmaAllocHandle handle, VmaBlockMetadata* metadata)
-{
-    MoveAllocationData moveData;
-    moveData.move.srcAllocation = (VmaAllocation)metadata->GetAllocationUserData(handle);
-    moveData.size = moveData.move.srcAllocation->GetSize();
-    moveData.alignment = moveData.move.srcAllocation->GetAlignment();
-    moveData.type = moveData.move.srcAllocation->GetSuballocationType();
-    moveData.flags = 0;
-
-    if (moveData.move.srcAllocation->IsPersistentMap())
-        moveData.flags |= VMA_ALLOCATION_CREATE_MAPPED_BIT;
-    if (moveData.move.srcAllocation->IsMappingAllowed())
-        moveData.flags |= VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT;
-
-    return moveData;
-}
-
-VmaDefragmentationContext_T::CounterStatus VmaDefragmentationContext_T::CheckCounters(VkDeviceSize bytes)
-{
-    // Ignore allocation if will exceed max size for copy
-    if (m_PassStats.bytesMoved + bytes > m_MaxPassBytes)
-    {
-        if (++m_IgnoredAllocs < MAX_ALLOCS_TO_IGNORE)
-            return CounterStatus::Ignore;
-        else
-            return CounterStatus::End;
-    }
-    return CounterStatus::Pass;
-}
-
-bool VmaDefragmentationContext_T::IncrementCounters(VkDeviceSize bytes)
-{
-    m_PassStats.bytesMoved += bytes;
-    // Early return when max found
-    if (++m_PassStats.allocationsMoved >= m_MaxPassAllocations || m_PassStats.bytesMoved >= m_MaxPassBytes)
-    {
-        VMA_ASSERT(m_PassStats.allocationsMoved == m_MaxPassAllocations ||
-            m_PassStats.bytesMoved == m_MaxPassBytes && "Exceeded maximal pass threshold!");
-        return true;
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ReallocWithinBlock(VmaBlockVector& vector, VmaDeviceMemoryBlock* block)
-{
-    VmaBlockMetadata* metadata = block->m_pMetadata;
-
-    for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-        handle != VK_NULL_HANDLE;
-        handle = metadata->GetNextAllocation(handle))
-    {
-        MoveAllocationData moveData = GetMoveData(handle, metadata);
-        // Ignore newly created allocations by defragmentation algorithm
-        if (moveData.move.srcAllocation->GetUserData() == this)
-            continue;
-        switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-        {
-        case CounterStatus::Ignore:
-            continue;
-        case CounterStatus::End:
-            return true;
-        default:
-            VMA_ASSERT(0);
-        case CounterStatus::Pass:
-            break;
-        }
-        
-        VkDeviceSize offset = moveData.move.srcAllocation->GetOffset();
-        if (offset != 0 && metadata->GetSumFreeSize() >= moveData.size)
-        {
-            VmaAllocationRequest request = {};
-            if (metadata->CreateAllocationRequest(
-                moveData.size,
-                moveData.alignment,
-                false,
-                moveData.type,
-                VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-                &request))
-            {
-                if (metadata->GetAllocationOffset(request.allocHandle) < offset)
-                {
-                    if (vector.CommitAllocationRequest(
-                        request,
-                        block,
-                        moveData.alignment,
-                        moveData.flags,
-                        this,
-                        moveData.type,
-                        &moveData.move.dstTmpAllocation) == VK_SUCCESS)
-                    {
-                        m_Moves.push_back(moveData.move);
-                        if (IncrementCounters(moveData.size))
-                            return true;
-                    }
-                }
-            }
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::AllocInOtherBlock(size_t start, size_t end, MoveAllocationData& data, VmaBlockVector& vector)
-{
-    for (; start < end; ++start)
-    {
-        VmaDeviceMemoryBlock* dstBlock = vector.GetBlock(start);
-        if (dstBlock->m_pMetadata->GetSumFreeSize() >= data.size)
-        {
-            if (vector.AllocateFromBlock(dstBlock,
-                data.size,
-                data.alignment,
-                data.flags,
-                this,
-                data.type,
-                0,
-                &data.move.dstTmpAllocation) == VK_SUCCESS)
-            {
-                m_Moves.push_back(data.move);
-                if (IncrementCounters(data.size))
-                    return true;
-                break;
-            }
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Fast(VmaBlockVector& vector)
-{
-    // Move only between blocks
-
-    // Go through allocations in last blocks and try to fit them inside first ones
-    for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i)
-    {
-        VmaBlockMetadata* metadata = vector.GetBlock(i)->m_pMetadata;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            if (AllocInOtherBlock(0, i, moveData, vector))
-                return true;
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Balanced(VmaBlockVector& vector, size_t index, bool update)
-{
-    // Go over every allocation and try to fit it in previous blocks at lowest offsets,
-    // if not possible: realloc within single block to minimize offset (exclude offset == 0),
-    // but only if there are noticable gaps between them (some heuristic, ex. average size of allocation in block)
-    VMA_ASSERT(m_AlgorithmState != VMA_NULL);
-
-    StateBalanced& vectorState = reinterpret_cast<StateBalanced*>(m_AlgorithmState)[index];
-    if (update && vectorState.avgAllocSize == UINT64_MAX)
-        UpdateVectorStatistics(vector, vectorState);
-
-    const size_t startMoveCount = m_Moves.size();
-    VkDeviceSize minimalFreeRegion = vectorState.avgFreeSize / 2;
-    for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i)
-    {
-        VmaDeviceMemoryBlock* block = vector.GetBlock(i);
-        VmaBlockMetadata* metadata = block->m_pMetadata;
-        VkDeviceSize prevFreeRegionSize = 0;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            const size_t prevMoveCount = m_Moves.size();
-            if (AllocInOtherBlock(0, i, moveData, vector))
-                return true;
-
-            VkDeviceSize nextFreeRegionSize = metadata->GetNextFreeRegionSize(handle);
-            // If no room found then realloc within block for lower offset
-            VkDeviceSize offset = moveData.move.srcAllocation->GetOffset();
-            if (prevMoveCount == m_Moves.size() && offset != 0 && metadata->GetSumFreeSize() >= moveData.size)
-            {
-                // Check if realloc will make sense
-                if (prevFreeRegionSize >= minimalFreeRegion ||
-                    nextFreeRegionSize >= minimalFreeRegion ||
-                    moveData.size <= vectorState.avgFreeSize ||
-                    moveData.size <= vectorState.avgAllocSize)
-                {
-                    VmaAllocationRequest request = {};
-                    if (metadata->CreateAllocationRequest(
-                        moveData.size,
-                        moveData.alignment,
-                        false,
-                        moveData.type,
-                        VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-                        &request))
-                    {
-                        if (metadata->GetAllocationOffset(request.allocHandle) < offset)
-                        {
-                            if (vector.CommitAllocationRequest(
-                                request,
-                                block,
-                                moveData.alignment,
-                                moveData.flags,
-                                this,
-                                moveData.type,
-                                &moveData.move.dstTmpAllocation) == VK_SUCCESS)
-                            {
-                                m_Moves.push_back(moveData.move);
-                                if (IncrementCounters(moveData.size))
-                                    return true;
-                            }
-                        }
-                    }
-                }
-            }
-            prevFreeRegionSize = nextFreeRegionSize;
-        }
-    }
-    
-    // No moves perfomed, update statistics to current vector state
-    if (startMoveCount == m_Moves.size() && !update)
-    {
-        vectorState.avgAllocSize = UINT64_MAX;
-        return ComputeDefragmentation_Balanced(vector, index, false);
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Full(VmaBlockVector& vector)
-{
-    // Go over every allocation and try to fit it in previous blocks at lowest offsets,
-    // if not possible: realloc within single block to minimize offset (exclude offset == 0)
-
-    for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i)
-    {
-        VmaDeviceMemoryBlock* block = vector.GetBlock(i);
-        VmaBlockMetadata* metadata = block->m_pMetadata;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            const size_t prevMoveCount = m_Moves.size();
-            if (AllocInOtherBlock(0, i, moveData, vector))
-                return true;
-
-            // If no room found then realloc within block for lower offset
-            VkDeviceSize offset = moveData.move.srcAllocation->GetOffset();
-            if (prevMoveCount == m_Moves.size() && offset != 0 && metadata->GetSumFreeSize() >= moveData.size)
-            {
-                VmaAllocationRequest request = {};
-                if (metadata->CreateAllocationRequest(
-                    moveData.size,
-                    moveData.alignment,
-                    false,
-                    moveData.type,
-                    VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-                    &request))
-                {
-                    if (metadata->GetAllocationOffset(request.allocHandle) < offset)
-                    {
-                        if (vector.CommitAllocationRequest(
-                            request,
-                            block,
-                            moveData.alignment,
-                            moveData.flags,
-                            this,
-                            moveData.type,
-                            &moveData.move.dstTmpAllocation) == VK_SUCCESS)
-                        {
-                            m_Moves.push_back(moveData.move);
-                            if (IncrementCounters(moveData.size))
-                                return true;
-                        }
-                    }
-                }
-            }
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Extensive(VmaBlockVector& vector, size_t index)
-{
-    // First free single block, then populate it to the brim, then free another block, and so on
-
-    // Fallback to previous algorithm since without granularity conflicts it can achieve max packing
-    if (vector.m_BufferImageGranularity == 1)
-        return ComputeDefragmentation_Full(vector);
-
-    VMA_ASSERT(m_AlgorithmState != VMA_NULL);
-
-    StateExtensive& vectorState = reinterpret_cast<StateExtensive*>(m_AlgorithmState)[index];
-
-    bool texturePresent = false, bufferPresent = false, otherPresent = false;
-    switch (vectorState.operation)
-    {
-    case StateExtensive::Operation::Done: // Vector defragmented
-        return false;
-    case StateExtensive::Operation::FindFreeBlockBuffer:
-    case StateExtensive::Operation::FindFreeBlockTexture:
-    case StateExtensive::Operation::FindFreeBlockAll:
-    {
-        // No more blocks to free, just perform fast realloc and move to cleanup
-        if (vectorState.firstFreeBlock == 0)
-        {
-            vectorState.operation = StateExtensive::Operation::Cleanup;
-            return ComputeDefragmentation_Fast(vector);
-        }
-
-        // No free blocks, have to clear last one
-        size_t last = (vectorState.firstFreeBlock == SIZE_MAX ? vector.GetBlockCount() : vectorState.firstFreeBlock) - 1;
-        VmaBlockMetadata* freeMetadata = vector.GetBlock(last)->m_pMetadata;
-
-        const size_t prevMoveCount = m_Moves.size();
-        for (VmaAllocHandle handle = freeMetadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = freeMetadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, freeMetadata);
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            if (AllocInOtherBlock(0, last, moveData, vector))
-            {
-                // Full clear performed already
-                if (prevMoveCount != m_Moves.size() && freeMetadata->GetNextAllocation(handle) == VK_NULL_HANDLE)
-                    reinterpret_cast<size_t*>(m_AlgorithmState)[index] = last;
-                return true;
-            }
-        }
-
-        if (prevMoveCount == m_Moves.size())
-        {
-            // Cannot perform full clear, have to move data in other blocks around
-            if (last != 0)
-            {
-                for (size_t i = last - 1; i; --i)
-                {
-                    if (ReallocWithinBlock(vector, vector.GetBlock(i)))
-                        return true;
-                }
-            }
-
-            if (prevMoveCount == m_Moves.size())
-            {
-                // No possible reallocs within blocks, try to move them around fast
-                return ComputeDefragmentation_Fast(vector);
-            }
-        }
-        else
-        {
-            switch (vectorState.operation)
-            {
-            case StateExtensive::Operation::FindFreeBlockBuffer:
-                vectorState.operation = StateExtensive::Operation::MoveBuffers;
-                break;
-            default:
-                VMA_ASSERT(0);
-            case StateExtensive::Operation::FindFreeBlockTexture:
-                vectorState.operation = StateExtensive::Operation::MoveTextures;
-                break;
-            case StateExtensive::Operation::FindFreeBlockAll:
-                vectorState.operation = StateExtensive::Operation::MoveAll;
-                break;
-            }
-            vectorState.firstFreeBlock = last;
-            // Nothing done, block found without reallocations, can perform another reallocs in same pass
-            return ComputeDefragmentation_Extensive(vector, index);
-        }
-        break;
-    }
-    case StateExtensive::Operation::MoveTextures:
-    {
-        if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL, vector,
-            vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent))
-        {
-            if (texturePresent)
-            {
-                vectorState.operation = StateExtensive::Operation::FindFreeBlockTexture;
-                return ComputeDefragmentation_Extensive(vector, index);
-            }
-
-            if (!bufferPresent && !otherPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::Cleanup;
-                break;
-            }
-
-            // No more textures to move, check buffers
-            vectorState.operation = StateExtensive::Operation::MoveBuffers;
-            bufferPresent = false;
-            otherPresent = false;
-        }
-        else
-            break;
-    }
-    case StateExtensive::Operation::MoveBuffers:
-    {
-        if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_BUFFER, vector,
-            vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent))
-        {
-            if (bufferPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::FindFreeBlockBuffer;
-                return ComputeDefragmentation_Extensive(vector, index);
-            }
-
-            if (!otherPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::Cleanup;
-                break;
-            }
-
-            // No more buffers to move, check all others
-            vectorState.operation = StateExtensive::Operation::MoveAll;
-            otherPresent = false;
-        }
-        else
-            break;
-    }
-    case StateExtensive::Operation::MoveAll:
-    {
-        if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_FREE, vector,
-            vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent))
-        {
-            if (otherPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::FindFreeBlockBuffer;
-                return ComputeDefragmentation_Extensive(vector, index);
-            }
-            // Everything moved
-            vectorState.operation = StateExtensive::Operation::Cleanup;
-        }
-        break;
-    }
-    case StateExtensive::Operation::Cleanup:
-        // Cleanup is handled below so that other operations may reuse the cleanup code. This case is here to prevent the unhandled enum value warning (C4062).
-        break;
-    }
-
-    if (vectorState.operation == StateExtensive::Operation::Cleanup)
-    {
-        // All other work done, pack data in blocks even tighter if possible
-        const size_t prevMoveCount = m_Moves.size();
-        for (size_t i = 0; i < vector.GetBlockCount(); ++i)
-        {
-            if (ReallocWithinBlock(vector, vector.GetBlock(i)))
-                return true;
-        }
-
-        if (prevMoveCount == m_Moves.size())
-            vectorState.operation = StateExtensive::Operation::Done;
-    }
-    return false;
-}
-
-void VmaDefragmentationContext_T::UpdateVectorStatistics(VmaBlockVector& vector, StateBalanced& state)
-{
-    size_t allocCount = 0;
-    size_t freeCount = 0;
-    state.avgFreeSize = 0;
-    state.avgAllocSize = 0;
-
-    for (size_t i = 0; i < vector.GetBlockCount(); ++i)
-    {
-        VmaBlockMetadata* metadata = vector.GetBlock(i)->m_pMetadata;
-
-        allocCount += metadata->GetAllocationCount();
-        freeCount += metadata->GetFreeRegionsCount();
-        state.avgFreeSize += metadata->GetSumFreeSize();
-        state.avgAllocSize += metadata->GetSize();
-    }
-
-    state.avgAllocSize = (state.avgAllocSize - state.avgFreeSize) / allocCount;
-    state.avgFreeSize /= freeCount;
-}
-
-bool VmaDefragmentationContext_T::MoveDataToFreeBlocks(VmaSuballocationType currentType, 
-    VmaBlockVector& vector, size_t firstFreeBlock,
-    bool& texturePresent, bool& bufferPresent, bool& otherPresent)
-{
-    const size_t prevMoveCount = m_Moves.size();
-    for (size_t i = firstFreeBlock ; i;)
-    {
-        VmaDeviceMemoryBlock* block = vector.GetBlock(--i);
-        VmaBlockMetadata* metadata = block->m_pMetadata;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Move only single type of resources at once
-            if (!VmaIsBufferImageGranularityConflict(moveData.type, currentType))
-            {
-                // Try to fit allocation into free blocks
-                if (AllocInOtherBlock(firstFreeBlock, vector.GetBlockCount(), moveData, vector))
-                    return false;
-            }
-
-            if (!VmaIsBufferImageGranularityConflict(moveData.type, VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL))
-                texturePresent = true;
-            else if (!VmaIsBufferImageGranularityConflict(moveData.type, VMA_SUBALLOCATION_TYPE_BUFFER))
-                bufferPresent = true;
-            else
-                otherPresent = true;
-        }
-    }
-    return prevMoveCount == m_Moves.size();
-}
-#endif // _VMA_DEFRAGMENTATION_CONTEXT_FUNCTIONS
-
-#ifndef _VMA_POOL_T_FUNCTIONS
-VmaPool_T::VmaPool_T(
-    VmaAllocator hAllocator,
-    const VmaPoolCreateInfo& createInfo,
-    VkDeviceSize preferredBlockSize)
-    : m_BlockVector(
-        hAllocator,
-        this, // hParentPool
-        createInfo.memoryTypeIndex,
-        createInfo.blockSize != 0 ? createInfo.blockSize : preferredBlockSize,
-        createInfo.minBlockCount,
-        createInfo.maxBlockCount,
-        (createInfo.flags& VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT) != 0 ? 1 : hAllocator->GetBufferImageGranularity(),
-        createInfo.blockSize != 0, // explicitBlockSize
-        createInfo.flags & VMA_POOL_CREATE_ALGORITHM_MASK, // algorithm
-        createInfo.priority,
-        VMA_MAX(hAllocator->GetMemoryTypeMinAlignment(createInfo.memoryTypeIndex), createInfo.minAllocationAlignment),
-        createInfo.pMemoryAllocateNext),
-    m_Id(0),
-    m_Name(VMA_NULL) {}
-
-VmaPool_T::~VmaPool_T()
-{
-    VMA_ASSERT(m_PrevPool == VMA_NULL && m_NextPool == VMA_NULL);
-}
-
-void VmaPool_T::SetName(const char* pName)
-{
-    const VkAllocationCallbacks* allocs = m_BlockVector.GetAllocator()->GetAllocationCallbacks();
-    VmaFreeString(allocs, m_Name);
-
-    if (pName != VMA_NULL)
-    {
-        m_Name = VmaCreateStringCopy(allocs, pName);
-    }
-    else
-    {
-        m_Name = VMA_NULL;
-    }
-}
-#endif // _VMA_POOL_T_FUNCTIONS
-
-#ifndef _VMA_ALLOCATOR_T_FUNCTIONS
-VmaAllocator_T::VmaAllocator_T(const VmaAllocatorCreateInfo* pCreateInfo) :
-    m_UseMutex((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT) == 0),
-    m_VulkanApiVersion(pCreateInfo->vulkanApiVersion != 0 ? pCreateInfo->vulkanApiVersion : VK_API_VERSION_1_0),
-    m_UseKhrDedicatedAllocation((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT) != 0),
-    m_UseKhrBindMemory2((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT) != 0),
-    m_UseExtMemoryBudget((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT) != 0),
-    m_UseAmdDeviceCoherentMemory((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT) != 0),
-    m_UseKhrBufferDeviceAddress((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT) != 0),
-    m_UseExtMemoryPriority((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT) != 0),
-    m_hDevice(pCreateInfo->device),
-    m_hInstance(pCreateInfo->instance),
-    m_AllocationCallbacksSpecified(pCreateInfo->pAllocationCallbacks != VMA_NULL),
-    m_AllocationCallbacks(pCreateInfo->pAllocationCallbacks ?
-        *pCreateInfo->pAllocationCallbacks : VmaEmptyAllocationCallbacks),
-    m_AllocationObjectAllocator(&m_AllocationCallbacks),
-    m_HeapSizeLimitMask(0),
-    m_DeviceMemoryCount(0),
-    m_PreferredLargeHeapBlockSize(0),
-    m_PhysicalDevice(pCreateInfo->physicalDevice),
-    m_GpuDefragmentationMemoryTypeBits(UINT32_MAX),
-    m_NextPoolId(0),
-    m_GlobalMemoryTypeBits(UINT32_MAX)
-{
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        m_UseKhrDedicatedAllocation = false;
-        m_UseKhrBindMemory2 = false;
-    }
-
-    if(VMA_DEBUG_DETECT_CORRUPTION)
-    {
-        // Needs to be multiply of uint32_t size because we are going to write VMA_CORRUPTION_DETECTION_MAGIC_VALUE to it.
-        VMA_ASSERT(VMA_DEBUG_MARGIN % sizeof(uint32_t) == 0);
-    }
-
-    VMA_ASSERT(pCreateInfo->physicalDevice && pCreateInfo->device && pCreateInfo->instance);
-
-    if(m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0))
-    {
-#if !(VMA_DEDICATED_ALLOCATION)
-        if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT) != 0)
-        {
-            VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT set but required extensions are disabled by preprocessor macros.");
-        }
-#endif
-#if !(VMA_BIND_MEMORY2)
-        if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT) != 0)
-        {
-            VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT set but required extension is disabled by preprocessor macros.");
-        }
-#endif
-    }
-#if !(VMA_MEMORY_BUDGET)
-    if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT) != 0)
-    {
-        VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT set but required extension is disabled by preprocessor macros.");
-    }
-#endif
-#if !(VMA_BUFFER_DEVICE_ADDRESS)
-    if(m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT is set but required extension or Vulkan 1.2 is not available in your Vulkan header or its support in VMA has been disabled by a preprocessor macro.");
-    }
-#endif
-#if VMA_VULKAN_VERSION < 1002000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 2, 0))
-    {
-        VMA_ASSERT(0 && "vulkanApiVersion >= VK_API_VERSION_1_2 but required Vulkan version is disabled by preprocessor macros.");
-    }
-#endif
-#if VMA_VULKAN_VERSION < 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VMA_ASSERT(0 && "vulkanApiVersion >= VK_API_VERSION_1_1 but required Vulkan version is disabled by preprocessor macros.");
-    }
-#endif
-#if !(VMA_MEMORY_PRIORITY)
-    if(m_UseExtMemoryPriority)
-    {
-        VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT is set but required extension is not available in your Vulkan header or its support in VMA has been disabled by a preprocessor macro.");
-    }
-#endif
-
-    memset(&m_DeviceMemoryCallbacks, 0 ,sizeof(m_DeviceMemoryCallbacks));
-    memset(&m_PhysicalDeviceProperties, 0, sizeof(m_PhysicalDeviceProperties));
-    memset(&m_MemProps, 0, sizeof(m_MemProps));
-
-    memset(&m_pBlockVectors, 0, sizeof(m_pBlockVectors));
-    memset(&m_VulkanFunctions, 0, sizeof(m_VulkanFunctions));
-
-#if VMA_EXTERNAL_MEMORY
-    memset(&m_TypeExternalMemoryHandleTypes, 0, sizeof(m_TypeExternalMemoryHandleTypes));
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    if(pCreateInfo->pDeviceMemoryCallbacks != VMA_NULL)
-    {
-        m_DeviceMemoryCallbacks.pUserData = pCreateInfo->pDeviceMemoryCallbacks->pUserData;
-        m_DeviceMemoryCallbacks.pfnAllocate = pCreateInfo->pDeviceMemoryCallbacks->pfnAllocate;
-        m_DeviceMemoryCallbacks.pfnFree = pCreateInfo->pDeviceMemoryCallbacks->pfnFree;
-    }
-
-    ImportVulkanFunctions(pCreateInfo->pVulkanFunctions);
-
-    (*m_VulkanFunctions.vkGetPhysicalDeviceProperties)(m_PhysicalDevice, &m_PhysicalDeviceProperties);
-    (*m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties)(m_PhysicalDevice, &m_MemProps);
-
-    VMA_ASSERT(VmaIsPow2(VMA_MIN_ALIGNMENT));
-    VMA_ASSERT(VmaIsPow2(VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY));
-    VMA_ASSERT(VmaIsPow2(m_PhysicalDeviceProperties.limits.bufferImageGranularity));
-    VMA_ASSERT(VmaIsPow2(m_PhysicalDeviceProperties.limits.nonCoherentAtomSize));
-
-    m_PreferredLargeHeapBlockSize = (pCreateInfo->preferredLargeHeapBlockSize != 0) ?
-        pCreateInfo->preferredLargeHeapBlockSize : static_cast<VkDeviceSize>(VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE);
-
-    m_GlobalMemoryTypeBits = CalculateGlobalMemoryTypeBits();
-
-#if VMA_EXTERNAL_MEMORY
-    if(pCreateInfo->pTypeExternalMemoryHandleTypes != VMA_NULL)
-    {
-        memcpy(m_TypeExternalMemoryHandleTypes, pCreateInfo->pTypeExternalMemoryHandleTypes,
-            sizeof(VkExternalMemoryHandleTypeFlagsKHR) * GetMemoryTypeCount());
-    }
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    if(pCreateInfo->pHeapSizeLimit != VMA_NULL)
-    {
-        for(uint32_t heapIndex = 0; heapIndex < GetMemoryHeapCount(); ++heapIndex)
-        {
-            const VkDeviceSize limit = pCreateInfo->pHeapSizeLimit[heapIndex];
-            if(limit != VK_WHOLE_SIZE)
-            {
-                m_HeapSizeLimitMask |= 1u << heapIndex;
-                if(limit < m_MemProps.memoryHeaps[heapIndex].size)
-                {
-                    m_MemProps.memoryHeaps[heapIndex].size = limit;
-                }
-            }
-        }
-    }
-
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        // Create only supported types
-        if((m_GlobalMemoryTypeBits & (1u << memTypeIndex)) != 0)
-        {
-            const VkDeviceSize preferredBlockSize = CalcPreferredBlockSize(memTypeIndex);
-            m_pBlockVectors[memTypeIndex] = vma_new(this, VmaBlockVector)(
-                this,
-                VK_NULL_HANDLE, // hParentPool
-                memTypeIndex,
-                preferredBlockSize,
-                0,
-                SIZE_MAX,
-                GetBufferImageGranularity(),
-                false, // explicitBlockSize
-                0, // algorithm
-                0.5f, // priority (0.5 is the default per Vulkan spec)
-                GetMemoryTypeMinAlignment(memTypeIndex), // minAllocationAlignment
-                VMA_NULL); // // pMemoryAllocateNext
-            // No need to call m_pBlockVectors[memTypeIndex][blockVectorTypeIndex]->CreateMinBlocks here,
-            // becase minBlockCount is 0.
-        }
-    }
-}
-
-VkResult VmaAllocator_T::Init(const VmaAllocatorCreateInfo* pCreateInfo)
-{
-    VkResult res = VK_SUCCESS;
-
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        UpdateVulkanBudget();
-    }
-#endif // #if VMA_MEMORY_BUDGET
-
-    return res;
-}
-
-VmaAllocator_T::~VmaAllocator_T()
-{
-    VMA_ASSERT(m_Pools.IsEmpty());
-
-    for(size_t memTypeIndex = GetMemoryTypeCount(); memTypeIndex--; )
-    {
-        vma_delete(this, m_pBlockVectors[memTypeIndex]);
-    }
-}
-
-void VmaAllocator_T::ImportVulkanFunctions(const VmaVulkanFunctions* pVulkanFunctions)
-{
-#if VMA_STATIC_VULKAN_FUNCTIONS == 1
-    ImportVulkanFunctions_Static();
-#endif
-
-    if(pVulkanFunctions != VMA_NULL)
-    {
-        ImportVulkanFunctions_Custom(pVulkanFunctions);
-    }
-
-#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-    ImportVulkanFunctions_Dynamic();
-#endif
-
-    ValidateVulkanFunctions();
-}
-
-#if VMA_STATIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ImportVulkanFunctions_Static()
-{
-    // Vulkan 1.0
-    m_VulkanFunctions.vkGetInstanceProcAddr = (PFN_vkGetInstanceProcAddr)vkGetInstanceProcAddr;
-    m_VulkanFunctions.vkGetDeviceProcAddr = (PFN_vkGetDeviceProcAddr)vkGetDeviceProcAddr;
-    m_VulkanFunctions.vkGetPhysicalDeviceProperties = (PFN_vkGetPhysicalDeviceProperties)vkGetPhysicalDeviceProperties;
-    m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties = (PFN_vkGetPhysicalDeviceMemoryProperties)vkGetPhysicalDeviceMemoryProperties;
-    m_VulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkAllocateMemory;
-    m_VulkanFunctions.vkFreeMemory = (PFN_vkFreeMemory)vkFreeMemory;
-    m_VulkanFunctions.vkMapMemory = (PFN_vkMapMemory)vkMapMemory;
-    m_VulkanFunctions.vkUnmapMemory = (PFN_vkUnmapMemory)vkUnmapMemory;
-    m_VulkanFunctions.vkFlushMappedMemoryRanges = (PFN_vkFlushMappedMemoryRanges)vkFlushMappedMemoryRanges;
-    m_VulkanFunctions.vkInvalidateMappedMemoryRanges = (PFN_vkInvalidateMappedMemoryRanges)vkInvalidateMappedMemoryRanges;
-    m_VulkanFunctions.vkBindBufferMemory = (PFN_vkBindBufferMemory)vkBindBufferMemory;
-    m_VulkanFunctions.vkBindImageMemory = (PFN_vkBindImageMemory)vkBindImageMemory;
-    m_VulkanFunctions.vkGetBufferMemoryRequirements = (PFN_vkGetBufferMemoryRequirements)vkGetBufferMemoryRequirements;
-    m_VulkanFunctions.vkGetImageMemoryRequirements = (PFN_vkGetImageMemoryRequirements)vkGetImageMemoryRequirements;
-    m_VulkanFunctions.vkCreateBuffer = (PFN_vkCreateBuffer)vkCreateBuffer;
-    m_VulkanFunctions.vkDestroyBuffer = (PFN_vkDestroyBuffer)vkDestroyBuffer;
-    m_VulkanFunctions.vkCreateImage = (PFN_vkCreateImage)vkCreateImage;
-    m_VulkanFunctions.vkDestroyImage = (PFN_vkDestroyImage)vkDestroyImage;
-    m_VulkanFunctions.vkCmdCopyBuffer = (PFN_vkCmdCopyBuffer)vkCmdCopyBuffer;
-
-    // Vulkan 1.1
-#if VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR = (PFN_vkGetBufferMemoryRequirements2)vkGetBufferMemoryRequirements2;
-        m_VulkanFunctions.vkGetImageMemoryRequirements2KHR = (PFN_vkGetImageMemoryRequirements2)vkGetImageMemoryRequirements2;
-        m_VulkanFunctions.vkBindBufferMemory2KHR = (PFN_vkBindBufferMemory2)vkBindBufferMemory2;
-        m_VulkanFunctions.vkBindImageMemory2KHR = (PFN_vkBindImageMemory2)vkBindImageMemory2;
-        m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties2KHR = (PFN_vkGetPhysicalDeviceMemoryProperties2)vkGetPhysicalDeviceMemoryProperties2;
-    }
-#endif
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0))
-    {
-        m_VulkanFunctions.vkGetDeviceBufferMemoryRequirements = (PFN_vkGetDeviceBufferMemoryRequirements)vkGetDeviceBufferMemoryRequirements;
-        m_VulkanFunctions.vkGetDeviceImageMemoryRequirements = (PFN_vkGetDeviceImageMemoryRequirements)vkGetDeviceImageMemoryRequirements;
-    }
-#endif
-}
-
-#endif // VMA_STATIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ImportVulkanFunctions_Custom(const VmaVulkanFunctions* pVulkanFunctions)
-{
-    VMA_ASSERT(pVulkanFunctions != VMA_NULL);
-
-#define VMA_COPY_IF_NOT_NULL(funcName) \
-    if(pVulkanFunctions->funcName != VMA_NULL) m_VulkanFunctions.funcName = pVulkanFunctions->funcName;
-
-    VMA_COPY_IF_NOT_NULL(vkGetInstanceProcAddr);
-    VMA_COPY_IF_NOT_NULL(vkGetDeviceProcAddr);
-    VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceProperties);
-    VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceMemoryProperties);
-    VMA_COPY_IF_NOT_NULL(vkAllocateMemory);
-    VMA_COPY_IF_NOT_NULL(vkFreeMemory);
-    VMA_COPY_IF_NOT_NULL(vkMapMemory);
-    VMA_COPY_IF_NOT_NULL(vkUnmapMemory);
-    VMA_COPY_IF_NOT_NULL(vkFlushMappedMemoryRanges);
-    VMA_COPY_IF_NOT_NULL(vkInvalidateMappedMemoryRanges);
-    VMA_COPY_IF_NOT_NULL(vkBindBufferMemory);
-    VMA_COPY_IF_NOT_NULL(vkBindImageMemory);
-    VMA_COPY_IF_NOT_NULL(vkGetBufferMemoryRequirements);
-    VMA_COPY_IF_NOT_NULL(vkGetImageMemoryRequirements);
-    VMA_COPY_IF_NOT_NULL(vkCreateBuffer);
-    VMA_COPY_IF_NOT_NULL(vkDestroyBuffer);
-    VMA_COPY_IF_NOT_NULL(vkCreateImage);
-    VMA_COPY_IF_NOT_NULL(vkDestroyImage);
-    VMA_COPY_IF_NOT_NULL(vkCmdCopyBuffer);
-
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    VMA_COPY_IF_NOT_NULL(vkGetBufferMemoryRequirements2KHR);
-    VMA_COPY_IF_NOT_NULL(vkGetImageMemoryRequirements2KHR);
-#endif
-
-#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000
-    VMA_COPY_IF_NOT_NULL(vkBindBufferMemory2KHR);
-    VMA_COPY_IF_NOT_NULL(vkBindImageMemory2KHR);
-#endif
-
-#if VMA_MEMORY_BUDGET
-    VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceMemoryProperties2KHR);
-#endif
-
-#if VMA_VULKAN_VERSION >= 1003000
-    VMA_COPY_IF_NOT_NULL(vkGetDeviceBufferMemoryRequirements);
-    VMA_COPY_IF_NOT_NULL(vkGetDeviceImageMemoryRequirements);
-#endif
-
-#undef VMA_COPY_IF_NOT_NULL
-}
-
-#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ImportVulkanFunctions_Dynamic()
-{
-    VMA_ASSERT(m_VulkanFunctions.vkGetInstanceProcAddr && m_VulkanFunctions.vkGetDeviceProcAddr &&
-        "To use VMA_DYNAMIC_VULKAN_FUNCTIONS in new versions of VMA you now have to pass "
-        "VmaVulkanFunctions::vkGetInstanceProcAddr and vkGetDeviceProcAddr as VmaAllocatorCreateInfo::pVulkanFunctions. "
-        "Other members can be null.");
-
-#define VMA_FETCH_INSTANCE_FUNC(memberName, functionPointerType, functionNameString) \
-    if(m_VulkanFunctions.memberName == VMA_NULL) \
-        m_VulkanFunctions.memberName = \
-            (functionPointerType)m_VulkanFunctions.vkGetInstanceProcAddr(m_hInstance, functionNameString);
-#define VMA_FETCH_DEVICE_FUNC(memberName, functionPointerType, functionNameString) \
-    if(m_VulkanFunctions.memberName == VMA_NULL) \
-        m_VulkanFunctions.memberName = \
-            (functionPointerType)m_VulkanFunctions.vkGetDeviceProcAddr(m_hDevice, functionNameString);
-
-    VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceProperties, PFN_vkGetPhysicalDeviceProperties, "vkGetPhysicalDeviceProperties");
-    VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties, PFN_vkGetPhysicalDeviceMemoryProperties, "vkGetPhysicalDeviceMemoryProperties");
-    VMA_FETCH_DEVICE_FUNC(vkAllocateMemory, PFN_vkAllocateMemory, "vkAllocateMemory");
-    VMA_FETCH_DEVICE_FUNC(vkFreeMemory, PFN_vkFreeMemory, "vkFreeMemory");
-    VMA_FETCH_DEVICE_FUNC(vkMapMemory, PFN_vkMapMemory, "vkMapMemory");
-    VMA_FETCH_DEVICE_FUNC(vkUnmapMemory, PFN_vkUnmapMemory, "vkUnmapMemory");
-    VMA_FETCH_DEVICE_FUNC(vkFlushMappedMemoryRanges, PFN_vkFlushMappedMemoryRanges, "vkFlushMappedMemoryRanges");
-    VMA_FETCH_DEVICE_FUNC(vkInvalidateMappedMemoryRanges, PFN_vkInvalidateMappedMemoryRanges, "vkInvalidateMappedMemoryRanges");
-    VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory, PFN_vkBindBufferMemory, "vkBindBufferMemory");
-    VMA_FETCH_DEVICE_FUNC(vkBindImageMemory, PFN_vkBindImageMemory, "vkBindImageMemory");
-    VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements, PFN_vkGetBufferMemoryRequirements, "vkGetBufferMemoryRequirements");
-    VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements, PFN_vkGetImageMemoryRequirements, "vkGetImageMemoryRequirements");
-    VMA_FETCH_DEVICE_FUNC(vkCreateBuffer, PFN_vkCreateBuffer, "vkCreateBuffer");
-    VMA_FETCH_DEVICE_FUNC(vkDestroyBuffer, PFN_vkDestroyBuffer, "vkDestroyBuffer");
-    VMA_FETCH_DEVICE_FUNC(vkCreateImage, PFN_vkCreateImage, "vkCreateImage");
-    VMA_FETCH_DEVICE_FUNC(vkDestroyImage, PFN_vkDestroyImage, "vkDestroyImage");
-    VMA_FETCH_DEVICE_FUNC(vkCmdCopyBuffer, PFN_vkCmdCopyBuffer, "vkCmdCopyBuffer");
-
-#if VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements2KHR, PFN_vkGetBufferMemoryRequirements2, "vkGetBufferMemoryRequirements2");
-        VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements2KHR, PFN_vkGetImageMemoryRequirements2, "vkGetImageMemoryRequirements2");
-        VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory2KHR, PFN_vkBindBufferMemory2, "vkBindBufferMemory2");
-        VMA_FETCH_DEVICE_FUNC(vkBindImageMemory2KHR, PFN_vkBindImageMemory2, "vkBindImageMemory2");
-        VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties2KHR, PFN_vkGetPhysicalDeviceMemoryProperties2, "vkGetPhysicalDeviceMemoryProperties2");
-    }
-#endif
-
-#if VMA_DEDICATED_ALLOCATION
-    if(m_UseKhrDedicatedAllocation)
-    {
-        VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements2KHR, PFN_vkGetBufferMemoryRequirements2KHR, "vkGetBufferMemoryRequirements2KHR");
-        VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements2KHR, PFN_vkGetImageMemoryRequirements2KHR, "vkGetImageMemoryRequirements2KHR");
-    }
-#endif
-
-#if VMA_BIND_MEMORY2
-    if(m_UseKhrBindMemory2)
-    {
-        VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory2KHR, PFN_vkBindBufferMemory2KHR, "vkBindBufferMemory2KHR");
-        VMA_FETCH_DEVICE_FUNC(vkBindImageMemory2KHR, PFN_vkBindImageMemory2KHR, "vkBindImageMemory2KHR");
-    }
-#endif // #if VMA_BIND_MEMORY2
-
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties2KHR, PFN_vkGetPhysicalDeviceMemoryProperties2KHR, "vkGetPhysicalDeviceMemoryProperties2KHR");
-    }
-#endif // #if VMA_MEMORY_BUDGET
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0))
-    {
-        VMA_FETCH_DEVICE_FUNC(vkGetDeviceBufferMemoryRequirements, PFN_vkGetDeviceBufferMemoryRequirements, "vkGetDeviceBufferMemoryRequirements");
-        VMA_FETCH_DEVICE_FUNC(vkGetDeviceImageMemoryRequirements, PFN_vkGetDeviceImageMemoryRequirements, "vkGetDeviceImageMemoryRequirements");
-    }
-#endif
-
-#undef VMA_FETCH_DEVICE_FUNC
-#undef VMA_FETCH_INSTANCE_FUNC
-}
-
-#endif // VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ValidateVulkanFunctions()
-{
-    VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceProperties != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkAllocateMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkFreeMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkMapMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkUnmapMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkFlushMappedMemoryRanges != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkInvalidateMappedMemoryRanges != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkBindBufferMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkBindImageMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkGetBufferMemoryRequirements != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkGetImageMemoryRequirements != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkCreateBuffer != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkDestroyBuffer != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkCreateImage != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkDestroyImage != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkCmdCopyBuffer != VMA_NULL);
-
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0) || m_UseKhrDedicatedAllocation)
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR != VMA_NULL);
-        VMA_ASSERT(m_VulkanFunctions.vkGetImageMemoryRequirements2KHR != VMA_NULL);
-    }
-#endif
-
-#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0) || m_UseKhrBindMemory2)
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkBindBufferMemory2KHR != VMA_NULL);
-        VMA_ASSERT(m_VulkanFunctions.vkBindImageMemory2KHR != VMA_NULL);
-    }
-#endif
-
-#if VMA_MEMORY_BUDGET || VMA_VULKAN_VERSION >= 1001000
-    if(m_UseExtMemoryBudget || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties2KHR != VMA_NULL);
-    }
-#endif
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0))
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkGetDeviceBufferMemoryRequirements != VMA_NULL);
-        VMA_ASSERT(m_VulkanFunctions.vkGetDeviceImageMemoryRequirements != VMA_NULL);
-    }
-#endif
-}
-
-VkDeviceSize VmaAllocator_T::CalcPreferredBlockSize(uint32_t memTypeIndex)
-{
-    const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memTypeIndex);
-    const VkDeviceSize heapSize = m_MemProps.memoryHeaps[heapIndex].size;
-    const bool isSmallHeap = heapSize <= VMA_SMALL_HEAP_MAX_SIZE;
-    return VmaAlignUp(isSmallHeap ? (heapSize / 8) : m_PreferredLargeHeapBlockSize, (VkDeviceSize)32);
-}
-
-VkResult VmaAllocator_T::AllocateMemoryOfType(
-    VmaPool pool,
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    bool dedicatedPreferred,
-    VkBuffer dedicatedBuffer,
-    VkImage dedicatedImage,
-    VkFlags dedicatedBufferImageUsage,
-    const VmaAllocationCreateInfo& createInfo,
-    uint32_t memTypeIndex,
-    VmaSuballocationType suballocType,
-    VmaDedicatedAllocationList& dedicatedAllocations,
-    VmaBlockVector& blockVector,
-    size_t allocationCount,
-    VmaAllocation* pAllocations)
-{
-    VMA_ASSERT(pAllocations != VMA_NULL);
-    VMA_DEBUG_LOG("  AllocateMemory: MemoryTypeIndex=%u, AllocationCount=%zu, Size=%llu", memTypeIndex, allocationCount, size);
-
-    VmaAllocationCreateInfo finalCreateInfo = createInfo;
-    VkResult res = CalcMemTypeParams(
-        finalCreateInfo,
-        memTypeIndex,
-        size,
-        allocationCount);
-    if(res != VK_SUCCESS)
-        return res;
-
-    if((finalCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0)
-    {
-        return AllocateDedicatedMemory(
-            pool,
-            size,
-            suballocType,
-            dedicatedAllocations,
-            memTypeIndex,
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0,
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0,
-            (finalCreateInfo.flags &
-                (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0,
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0,
-            finalCreateInfo.pUserData,
-            finalCreateInfo.priority,
-            dedicatedBuffer,
-            dedicatedImage,
-            dedicatedBufferImageUsage,
-            allocationCount,
-            pAllocations,
-            blockVector.GetAllocationNextPtr());
-    }
-    else
-    {
-        const bool canAllocateDedicated =
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0 &&
-            (pool == VK_NULL_HANDLE || !blockVector.HasExplicitBlockSize());
-
-        if(canAllocateDedicated)
-        {
-            // Heuristics: Allocate dedicated memory if requested size if greater than half of preferred block size.
-            if(size > blockVector.GetPreferredBlockSize() / 2)
-            {
-                dedicatedPreferred = true;
-            }
-            // Protection against creating each allocation as dedicated when we reach or exceed heap size/budget,
-            // which can quickly deplete maxMemoryAllocationCount: Don't prefer dedicated allocations when above
-            // 3/4 of the maximum allocation count.
-            if(m_DeviceMemoryCount.load() > m_PhysicalDeviceProperties.limits.maxMemoryAllocationCount * 3 / 4)
-            {
-                dedicatedPreferred = false;
-            }
-
-            if(dedicatedPreferred)
-            {
-                res = AllocateDedicatedMemory(
-                    pool,
-                    size,
-                    suballocType,
-                    dedicatedAllocations,
-                    memTypeIndex,
-                    (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0,
-                    (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0,
-                    (finalCreateInfo.flags &
-                        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0,
-                    (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0,
-                    finalCreateInfo.pUserData,
-                    finalCreateInfo.priority,
-                    dedicatedBuffer,
-                    dedicatedImage,
-                    dedicatedBufferImageUsage,
-                    allocationCount,
-                    pAllocations,
-                    blockVector.GetAllocationNextPtr());
-                if(res == VK_SUCCESS)
-                {
-                    // Succeeded: AllocateDedicatedMemory function already filld pMemory, nothing more to do here.
-                    VMA_DEBUG_LOG("    Allocated as DedicatedMemory");
-                    return VK_SUCCESS;
-                }
-            }
-        }
-
-        res = blockVector.Allocate(
-            size,
-            alignment,
-            finalCreateInfo,
-            suballocType,
-            allocationCount,
-            pAllocations);
-        if(res == VK_SUCCESS)
-            return VK_SUCCESS;
-
-        // Try dedicated memory.
-        if(canAllocateDedicated && !dedicatedPreferred)
-        {
-            res = AllocateDedicatedMemory(
-                pool,
-                size,
-                suballocType,
-                dedicatedAllocations,
-                memTypeIndex,
-                (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0,
-                (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0,
-                (finalCreateInfo.flags &
-                    (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0,
-                (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0,
-                finalCreateInfo.pUserData,
-                finalCreateInfo.priority,
-                dedicatedBuffer,
-                dedicatedImage,
-                dedicatedBufferImageUsage,
-                allocationCount,
-                pAllocations,
-                blockVector.GetAllocationNextPtr());
-            if(res == VK_SUCCESS)
-            {
-                // Succeeded: AllocateDedicatedMemory function already filld pMemory, nothing more to do here.
-                VMA_DEBUG_LOG("    Allocated as DedicatedMemory");
-                return VK_SUCCESS;
-            }
-        }
-        // Everything failed: Return error code.
-        VMA_DEBUG_LOG("    vkAllocateMemory FAILED");
-        return res;
-    }
-}
-
-VkResult VmaAllocator_T::AllocateDedicatedMemory(
-    VmaPool pool,
-    VkDeviceSize size,
-    VmaSuballocationType suballocType,
-    VmaDedicatedAllocationList& dedicatedAllocations,
-    uint32_t memTypeIndex,
-    bool map,
-    bool isUserDataString,
-    bool isMappingAllowed,
-    bool canAliasMemory,
-    void* pUserData,
-    float priority,
-    VkBuffer dedicatedBuffer,
-    VkImage dedicatedImage,
-    VkFlags dedicatedBufferImageUsage,
-    size_t allocationCount,
-    VmaAllocation* pAllocations,
-    const void* pNextChain)
-{
-    VMA_ASSERT(allocationCount > 0 && pAllocations);
-
-    VkMemoryAllocateInfo allocInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO };
-    allocInfo.memoryTypeIndex = memTypeIndex;
-    allocInfo.allocationSize = size;
-    allocInfo.pNext = pNextChain;
-
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    VkMemoryDedicatedAllocateInfoKHR dedicatedAllocInfo = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO_KHR };
-    if(!canAliasMemory)
-    {
-        if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-        {
-            if(dedicatedBuffer != VK_NULL_HANDLE)
-            {
-                VMA_ASSERT(dedicatedImage == VK_NULL_HANDLE);
-                dedicatedAllocInfo.buffer = dedicatedBuffer;
-                VmaPnextChainPushFront(&allocInfo, &dedicatedAllocInfo);
-            }
-            else if(dedicatedImage != VK_NULL_HANDLE)
-            {
-                dedicatedAllocInfo.image = dedicatedImage;
-                VmaPnextChainPushFront(&allocInfo, &dedicatedAllocInfo);
-            }
-        }
-    }
-#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-
-#if VMA_BUFFER_DEVICE_ADDRESS
-    VkMemoryAllocateFlagsInfoKHR allocFlagsInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR };
-    if(m_UseKhrBufferDeviceAddress)
-    {
-        bool canContainBufferWithDeviceAddress = true;
-        if(dedicatedBuffer != VK_NULL_HANDLE)
-        {
-            canContainBufferWithDeviceAddress = dedicatedBufferImageUsage == UINT32_MAX || // Usage flags unknown
-                (dedicatedBufferImageUsage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_EXT) != 0;
-        }
-        else if(dedicatedImage != VK_NULL_HANDLE)
-        {
-            canContainBufferWithDeviceAddress = false;
-        }
-        if(canContainBufferWithDeviceAddress)
-        {
-            allocFlagsInfo.flags = VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT_KHR;
-            VmaPnextChainPushFront(&allocInfo, &allocFlagsInfo);
-        }
-    }
-#endif // #if VMA_BUFFER_DEVICE_ADDRESS
-
-#if VMA_MEMORY_PRIORITY
-    VkMemoryPriorityAllocateInfoEXT priorityInfo = { VK_STRUCTURE_TYPE_MEMORY_PRIORITY_ALLOCATE_INFO_EXT };
-    if(m_UseExtMemoryPriority)
-    {
-        VMA_ASSERT(priority >= 0.f && priority <= 1.f);
-        priorityInfo.priority = priority;
-        VmaPnextChainPushFront(&allocInfo, &priorityInfo);
-    }
-#endif // #if VMA_MEMORY_PRIORITY
-
-#if VMA_EXTERNAL_MEMORY
-    // Attach VkExportMemoryAllocateInfoKHR if necessary.
-    VkExportMemoryAllocateInfoKHR exportMemoryAllocInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR };
-    exportMemoryAllocInfo.handleTypes = GetExternalMemoryHandleTypeFlags(memTypeIndex);
-    if(exportMemoryAllocInfo.handleTypes != 0)
-    {
-        VmaPnextChainPushFront(&allocInfo, &exportMemoryAllocInfo);
-    }
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    size_t allocIndex;
-    VkResult res = VK_SUCCESS;
-    for(allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-    {
-        res = AllocateDedicatedMemoryPage(
-            pool,
-            size,
-            suballocType,
-            memTypeIndex,
-            allocInfo,
-            map,
-            isUserDataString,
-            isMappingAllowed,
-            pUserData,
-            pAllocations + allocIndex);
-        if(res != VK_SUCCESS)
-        {
-            break;
-        }
-    }
-
-    if(res == VK_SUCCESS)
-    {
-        for (allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-        {
-            dedicatedAllocations.Register(pAllocations[allocIndex]);
-        }
-        VMA_DEBUG_LOG("    Allocated DedicatedMemory Count=%zu, MemoryTypeIndex=#%u", allocationCount, memTypeIndex);
-    }
-    else
-    {
-        // Free all already created allocations.
-        while(allocIndex--)
-        {
-            VmaAllocation currAlloc = pAllocations[allocIndex];
-            VkDeviceMemory hMemory = currAlloc->GetMemory();
-
-            /*
-            There is no need to call this, because Vulkan spec allows to skip vkUnmapMemory
-            before vkFreeMemory.
-
-            if(currAlloc->GetMappedData() != VMA_NULL)
-            {
-                (*m_VulkanFunctions.vkUnmapMemory)(m_hDevice, hMemory);
-            }
-            */
-
-            FreeVulkanMemory(memTypeIndex, currAlloc->GetSize(), hMemory);
-            m_Budget.RemoveAllocation(MemoryTypeIndexToHeapIndex(memTypeIndex), currAlloc->GetSize());
-            m_AllocationObjectAllocator.Free(currAlloc);
-        }
-
-        memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount);
-    }
-
-    return res;
-}
-
-VkResult VmaAllocator_T::AllocateDedicatedMemoryPage(
-    VmaPool pool,
-    VkDeviceSize size,
-    VmaSuballocationType suballocType,
-    uint32_t memTypeIndex,
-    const VkMemoryAllocateInfo& allocInfo,
-    bool map,
-    bool isUserDataString,
-    bool isMappingAllowed,
-    void* pUserData,
-    VmaAllocation* pAllocation)
-{
-    VkDeviceMemory hMemory = VK_NULL_HANDLE;
-    VkResult res = AllocateVulkanMemory(&allocInfo, &hMemory);
-    if(res < 0)
-    {
-        VMA_DEBUG_LOG("    vkAllocateMemory FAILED");
-        return res;
-    }
-
-    void* pMappedData = VMA_NULL;
-    if(map)
-    {
-        res = (*m_VulkanFunctions.vkMapMemory)(
-            m_hDevice,
-            hMemory,
-            0,
-            VK_WHOLE_SIZE,
-            0,
-            &pMappedData);
-        if(res < 0)
-        {
-            VMA_DEBUG_LOG("    vkMapMemory FAILED");
-            FreeVulkanMemory(memTypeIndex, size, hMemory);
-            return res;
-        }
-    }
-
-    *pAllocation = m_AllocationObjectAllocator.Allocate(isMappingAllowed);
-    (*pAllocation)->InitDedicatedAllocation(pool, memTypeIndex, hMemory, suballocType, pMappedData, size);
-    if (isUserDataString)
-        (*pAllocation)->SetName(this, (const char*)pUserData);
-    else
-        (*pAllocation)->SetUserData(this, pUserData);
-    m_Budget.AddAllocation(MemoryTypeIndexToHeapIndex(memTypeIndex), size);
-    if(VMA_DEBUG_INITIALIZE_ALLOCATIONS)
-    {
-        FillAllocation(*pAllocation, VMA_ALLOCATION_FILL_PATTERN_CREATED);
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaAllocator_T::GetBufferMemoryRequirements(
-    VkBuffer hBuffer,
-    VkMemoryRequirements& memReq,
-    bool& requiresDedicatedAllocation,
-    bool& prefersDedicatedAllocation) const
-{
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VkBufferMemoryRequirementsInfo2KHR memReqInfo = { VK_STRUCTURE_TYPE_BUFFER_MEMORY_REQUIREMENTS_INFO_2_KHR };
-        memReqInfo.buffer = hBuffer;
-
-        VkMemoryDedicatedRequirementsKHR memDedicatedReq = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR };
-
-        VkMemoryRequirements2KHR memReq2 = { VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR };
-        VmaPnextChainPushFront(&memReq2, &memDedicatedReq);
-
-        (*m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR)(m_hDevice, &memReqInfo, &memReq2);
-
-        memReq = memReq2.memoryRequirements;
-        requiresDedicatedAllocation = (memDedicatedReq.requiresDedicatedAllocation != VK_FALSE);
-        prefersDedicatedAllocation  = (memDedicatedReq.prefersDedicatedAllocation  != VK_FALSE);
-    }
-    else
-#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    {
-        (*m_VulkanFunctions.vkGetBufferMemoryRequirements)(m_hDevice, hBuffer, &memReq);
-        requiresDedicatedAllocation = false;
-        prefersDedicatedAllocation  = false;
-    }
-}
-
-void VmaAllocator_T::GetImageMemoryRequirements(
-    VkImage hImage,
-    VkMemoryRequirements& memReq,
-    bool& requiresDedicatedAllocation,
-    bool& prefersDedicatedAllocation) const
-{
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VkImageMemoryRequirementsInfo2KHR memReqInfo = { VK_STRUCTURE_TYPE_IMAGE_MEMORY_REQUIREMENTS_INFO_2_KHR };
-        memReqInfo.image = hImage;
-
-        VkMemoryDedicatedRequirementsKHR memDedicatedReq = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR };
-
-        VkMemoryRequirements2KHR memReq2 = { VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR };
-        VmaPnextChainPushFront(&memReq2, &memDedicatedReq);
-
-        (*m_VulkanFunctions.vkGetImageMemoryRequirements2KHR)(m_hDevice, &memReqInfo, &memReq2);
-
-        memReq = memReq2.memoryRequirements;
-        requiresDedicatedAllocation = (memDedicatedReq.requiresDedicatedAllocation != VK_FALSE);
-        prefersDedicatedAllocation  = (memDedicatedReq.prefersDedicatedAllocation  != VK_FALSE);
-    }
-    else
-#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    {
-        (*m_VulkanFunctions.vkGetImageMemoryRequirements)(m_hDevice, hImage, &memReq);
-        requiresDedicatedAllocation = false;
-        prefersDedicatedAllocation  = false;
-    }
-}
-
-VkResult VmaAllocator_T::FindMemoryTypeIndex(
-    uint32_t memoryTypeBits,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkFlags bufImgUsage,
-    uint32_t* pMemoryTypeIndex) const
-{
-    memoryTypeBits &= GetGlobalMemoryTypeBits();
-
-    if(pAllocationCreateInfo->memoryTypeBits != 0)
-    {
-        memoryTypeBits &= pAllocationCreateInfo->memoryTypeBits;
-    }
-
-    VkMemoryPropertyFlags requiredFlags = 0, preferredFlags = 0, notPreferredFlags = 0;
-    if(!FindMemoryPreferences(
-        IsIntegratedGpu(),
-        *pAllocationCreateInfo,
-        bufImgUsage,
-        requiredFlags, preferredFlags, notPreferredFlags))
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    *pMemoryTypeIndex = UINT32_MAX;
-    uint32_t minCost = UINT32_MAX;
-    for(uint32_t memTypeIndex = 0, memTypeBit = 1;
-        memTypeIndex < GetMemoryTypeCount();
-        ++memTypeIndex, memTypeBit <<= 1)
-    {
-        // This memory type is acceptable according to memoryTypeBits bitmask.
-        if((memTypeBit & memoryTypeBits) != 0)
-        {
-            const VkMemoryPropertyFlags currFlags =
-                m_MemProps.memoryTypes[memTypeIndex].propertyFlags;
-            // This memory type contains requiredFlags.
-            if((requiredFlags & ~currFlags) == 0)
-            {
-                // Calculate cost as number of bits from preferredFlags not present in this memory type.
-                uint32_t currCost = VMA_COUNT_BITS_SET(preferredFlags & ~currFlags) +
-                    VMA_COUNT_BITS_SET(currFlags & notPreferredFlags);
-                // Remember memory type with lowest cost.
-                if(currCost < minCost)
-                {
-                    *pMemoryTypeIndex = memTypeIndex;
-                    if(currCost == 0)
-                    {
-                        return VK_SUCCESS;
-                    }
-                    minCost = currCost;
-                }
-            }
-        }
-    }
-    return (*pMemoryTypeIndex != UINT32_MAX) ? VK_SUCCESS : VK_ERROR_FEATURE_NOT_PRESENT;
-}
-
-VkResult VmaAllocator_T::CalcMemTypeParams(
-    VmaAllocationCreateInfo& inoutCreateInfo,
-    uint32_t memTypeIndex,
-    VkDeviceSize size,
-    size_t allocationCount)
-{
-    // If memory type is not HOST_VISIBLE, disable MAPPED.
-    if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0 &&
-        (m_MemProps.memoryTypes[memTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0)
-    {
-        inoutCreateInfo.flags &= ~VMA_ALLOCATION_CREATE_MAPPED_BIT;
-    }
-
-    if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0 &&
-        (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT) != 0)
-    {
-        const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memTypeIndex);
-        VmaBudget heapBudget = {};
-        GetHeapBudgets(&heapBudget, heapIndex, 1);
-        if(heapBudget.usage + size * allocationCount > heapBudget.budget)
-        {
-            return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-        }
-    }
-    return VK_SUCCESS;
-}
-
-VkResult VmaAllocator_T::CalcAllocationParams(
-    VmaAllocationCreateInfo& inoutCreateInfo,
-    bool dedicatedRequired,
-    bool dedicatedPreferred)
-{
-    VMA_ASSERT((inoutCreateInfo.flags &
-        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) !=
-        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT) &&
-        "Specifying both flags VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT and VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT is incorrect.");
-    VMA_ASSERT((((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT) == 0 ||
-        (inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0)) &&
-        "Specifying VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT requires also VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.");
-    if(inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO || inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE || inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_HOST)
-    {
-        if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0)
-        {
-            VMA_ASSERT((inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0 &&
-                "When using VMA_ALLOCATION_CREATE_MAPPED_BIT and usage = VMA_MEMORY_USAGE_AUTO*, you must also specify VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.");
-        }
-    }
-
-    // If memory is lazily allocated, it should be always dedicated.
-    if(dedicatedRequired ||
-        inoutCreateInfo.usage == VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED)
-    {
-        inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-    }
-
-    if(inoutCreateInfo.pool != VK_NULL_HANDLE)
-    {
-        if(inoutCreateInfo.pool->m_BlockVector.HasExplicitBlockSize() &&
-            (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0)
-        {
-            VMA_ASSERT(0 && "Specifying VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT while current custom pool doesn't support dedicated allocations.");
-            return VK_ERROR_FEATURE_NOT_PRESENT;
-        }
-        inoutCreateInfo.priority = inoutCreateInfo.pool->m_BlockVector.GetPriority();
-    }
-
-    if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0 &&
-        (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) != 0)
-    {
-        VMA_ASSERT(0 && "Specifying VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT together with VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT makes no sense.");
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    if(VMA_DEBUG_ALWAYS_DEDICATED_MEMORY &&
-        (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) != 0)
-    {
-        inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-    }
-
-    // Non-auto USAGE values imply HOST_ACCESS flags.
-    // And so does VMA_MEMORY_USAGE_UNKNOWN because it is used with custom pools.
-    // Which specific flag is used doesn't matter. They change things only when used with VMA_MEMORY_USAGE_AUTO*.
-    // Otherwise they just protect from assert on mapping.
-    if(inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO &&
-        inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE &&
-        inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO_PREFER_HOST)
-    {
-        if((inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) == 0)
-        {
-            inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT;
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-VkResult VmaAllocator_T::AllocateMemory(
-    const VkMemoryRequirements& vkMemReq,
-    bool requiresDedicatedAllocation,
-    bool prefersDedicatedAllocation,
-    VkBuffer dedicatedBuffer,
-    VkImage dedicatedImage,
-    VkFlags dedicatedBufferImageUsage,
-    const VmaAllocationCreateInfo& createInfo,
-    VmaSuballocationType suballocType,
-    size_t allocationCount,
-    VmaAllocation* pAllocations)
-{
-    memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount);
-
-    VMA_ASSERT(VmaIsPow2(vkMemReq.alignment));
-
-    if(vkMemReq.size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VmaAllocationCreateInfo createInfoFinal = createInfo;
-    VkResult res = CalcAllocationParams(createInfoFinal, requiresDedicatedAllocation, prefersDedicatedAllocation);
-    if(res != VK_SUCCESS)
-        return res;
-
-    if(createInfoFinal.pool != VK_NULL_HANDLE)
-    {
-        VmaBlockVector& blockVector = createInfoFinal.pool->m_BlockVector;
-        return AllocateMemoryOfType(
-            createInfoFinal.pool,
-            vkMemReq.size,
-            vkMemReq.alignment,
-            prefersDedicatedAllocation,
-            dedicatedBuffer,
-            dedicatedImage,
-            dedicatedBufferImageUsage,
-            createInfoFinal,
-            blockVector.GetMemoryTypeIndex(),
-            suballocType,
-            createInfoFinal.pool->m_DedicatedAllocations,
-            blockVector,
-            allocationCount,
-            pAllocations);
-    }
-    else
-    {
-        // Bit mask of memory Vulkan types acceptable for this allocation.
-        uint32_t memoryTypeBits = vkMemReq.memoryTypeBits;
-        uint32_t memTypeIndex = UINT32_MAX;
-        res = FindMemoryTypeIndex(memoryTypeBits, &createInfoFinal, dedicatedBufferImageUsage, &memTypeIndex);
-        // Can't find any single memory type matching requirements. res is VK_ERROR_FEATURE_NOT_PRESENT.
-        if(res != VK_SUCCESS)
-            return res;
-        do
-        {
-            VmaBlockVector* blockVector = m_pBlockVectors[memTypeIndex];
-            VMA_ASSERT(blockVector && "Trying to use unsupported memory type!");
-            res = AllocateMemoryOfType(
-                VK_NULL_HANDLE,
-                vkMemReq.size,
-                vkMemReq.alignment,
-                requiresDedicatedAllocation || prefersDedicatedAllocation,
-                dedicatedBuffer,
-                dedicatedImage,
-                dedicatedBufferImageUsage,
-                createInfoFinal,
-                memTypeIndex,
-                suballocType,
-                m_DedicatedAllocations[memTypeIndex],
-                *blockVector,
-                allocationCount,
-                pAllocations);
-            // Allocation succeeded
-            if(res == VK_SUCCESS)
-                return VK_SUCCESS;
-
-            // Remove old memTypeIndex from list of possibilities.
-            memoryTypeBits &= ~(1u << memTypeIndex);
-            // Find alternative memTypeIndex.
-            res = FindMemoryTypeIndex(memoryTypeBits, &createInfoFinal, dedicatedBufferImageUsage, &memTypeIndex);
-        } while(res == VK_SUCCESS);
-
-        // No other matching memory type index could be found.
-        // Not returning res, which is VK_ERROR_FEATURE_NOT_PRESENT, because we already failed to allocate once.
-        return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-    }
-}
-
-void VmaAllocator_T::FreeMemory(
-    size_t allocationCount,
-    const VmaAllocation* pAllocations)
-{
-    VMA_ASSERT(pAllocations);
-
-    for(size_t allocIndex = allocationCount; allocIndex--; )
-    {
-        VmaAllocation allocation = pAllocations[allocIndex];
-
-        if(allocation != VK_NULL_HANDLE)
-        {
-            if(VMA_DEBUG_INITIALIZE_ALLOCATIONS)
-            {
-                FillAllocation(allocation, VMA_ALLOCATION_FILL_PATTERN_DESTROYED);
-            }
-
-            allocation->FreeName(this);
-
-            switch(allocation->GetType())
-            {
-            case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-                {
-                    VmaBlockVector* pBlockVector = VMA_NULL;
-                    VmaPool hPool = allocation->GetParentPool();
-                    if(hPool != VK_NULL_HANDLE)
-                    {
-                        pBlockVector = &hPool->m_BlockVector;
-                    }
-                    else
-                    {
-                        const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-                        pBlockVector = m_pBlockVectors[memTypeIndex];
-                        VMA_ASSERT(pBlockVector && "Trying to free memory of unsupported type!");
-                    }
-                    pBlockVector->Free(allocation);
-                }
-                break;
-            case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-                FreeDedicatedMemory(allocation);
-                break;
-            default:
-                VMA_ASSERT(0);
-            }
-        }
-    }
-}
-
-void VmaAllocator_T::CalculateStatistics(VmaTotalStatistics* pStats)
-{
-    // Initialize.
-    VmaClearDetailedStatistics(pStats->total);
-    for(uint32_t i = 0; i < VK_MAX_MEMORY_TYPES; ++i)
-        VmaClearDetailedStatistics(pStats->memoryType[i]);
-    for(uint32_t i = 0; i < VK_MAX_MEMORY_HEAPS; ++i)
-        VmaClearDetailedStatistics(pStats->memoryHeap[i]);
-
-    // Process default pools.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        VmaBlockVector* const pBlockVector = m_pBlockVectors[memTypeIndex];
-        if (pBlockVector != VMA_NULL)
-            pBlockVector->AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-    }
-
-    // Process custom pools.
-    {
-        VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex);
-        for(VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool))
-        {
-            VmaBlockVector& blockVector = pool->m_BlockVector;
-            const uint32_t memTypeIndex = blockVector.GetMemoryTypeIndex();
-            blockVector.AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-            pool->m_DedicatedAllocations.AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-        }
-    }
-
-    // Process dedicated allocations.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        m_DedicatedAllocations[memTypeIndex].AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-    }
-
-    // Sum from memory types to memory heaps.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        const uint32_t memHeapIndex = m_MemProps.memoryTypes[memTypeIndex].heapIndex;
-        VmaAddDetailedStatistics(pStats->memoryHeap[memHeapIndex], pStats->memoryType[memTypeIndex]);
-    }
-
-    // Sum from memory heaps to total.
-    for(uint32_t memHeapIndex = 0; memHeapIndex < GetMemoryHeapCount(); ++memHeapIndex)
-        VmaAddDetailedStatistics(pStats->total, pStats->memoryHeap[memHeapIndex]);
-
-    VMA_ASSERT(pStats->total.statistics.allocationCount == 0 ||
-        pStats->total.allocationSizeMax >= pStats->total.allocationSizeMin);
-    VMA_ASSERT(pStats->total.unusedRangeCount == 0 ||
-        pStats->total.unusedRangeSizeMax >= pStats->total.unusedRangeSizeMin);
-}
-
-void VmaAllocator_T::GetHeapBudgets(VmaBudget* outBudgets, uint32_t firstHeap, uint32_t heapCount)
-{
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        if(m_Budget.m_OperationsSinceBudgetFetch < 30)
-        {
-            VmaMutexLockRead lockRead(m_Budget.m_BudgetMutex, m_UseMutex);
-            for(uint32_t i = 0; i < heapCount; ++i, ++outBudgets)
-            {
-                const uint32_t heapIndex = firstHeap + i;
-
-                outBudgets->statistics.blockCount = m_Budget.m_BlockCount[heapIndex];
-                outBudgets->statistics.allocationCount = m_Budget.m_AllocationCount[heapIndex];
-                outBudgets->statistics.blockBytes = m_Budget.m_BlockBytes[heapIndex];
-                outBudgets->statistics.allocationBytes = m_Budget.m_AllocationBytes[heapIndex];
-
-                if(m_Budget.m_VulkanUsage[heapIndex] + outBudgets->statistics.blockBytes > m_Budget.m_BlockBytesAtBudgetFetch[heapIndex])
-                {
-                    outBudgets->usage = m_Budget.m_VulkanUsage[heapIndex] +
-                        outBudgets->statistics.blockBytes - m_Budget.m_BlockBytesAtBudgetFetch[heapIndex];
-                }
-                else
-                {
-                    outBudgets->usage = 0;
-                }
-
-                // Have to take MIN with heap size because explicit HeapSizeLimit is included in it.
-                outBudgets->budget = VMA_MIN(
-                    m_Budget.m_VulkanBudget[heapIndex], m_MemProps.memoryHeaps[heapIndex].size);
-            }
-        }
-        else
-        {
-            UpdateVulkanBudget(); // Outside of mutex lock
-            GetHeapBudgets(outBudgets, firstHeap, heapCount); // Recursion
-        }
-    }
-    else
-#endif
-    {
-        for(uint32_t i = 0; i < heapCount; ++i, ++outBudgets)
-        {
-            const uint32_t heapIndex = firstHeap + i;
-
-            outBudgets->statistics.blockCount = m_Budget.m_BlockCount[heapIndex];
-            outBudgets->statistics.allocationCount = m_Budget.m_AllocationCount[heapIndex];
-            outBudgets->statistics.blockBytes = m_Budget.m_BlockBytes[heapIndex];
-            outBudgets->statistics.allocationBytes = m_Budget.m_AllocationBytes[heapIndex];
-
-            outBudgets->usage = outBudgets->statistics.blockBytes;
-            outBudgets->budget = m_MemProps.memoryHeaps[heapIndex].size * 8 / 10; // 80% heuristics.
-        }
-    }
-}
-
-void VmaAllocator_T::GetAllocationInfo(VmaAllocation hAllocation, VmaAllocationInfo* pAllocationInfo)
-{
-    pAllocationInfo->memoryType = hAllocation->GetMemoryTypeIndex();
-    pAllocationInfo->deviceMemory = hAllocation->GetMemory();
-    pAllocationInfo->offset = hAllocation->GetOffset();
-    pAllocationInfo->size = hAllocation->GetSize();
-    pAllocationInfo->pMappedData = hAllocation->GetMappedData();
-    pAllocationInfo->pUserData = hAllocation->GetUserData();
-    pAllocationInfo->pName = hAllocation->GetName();
-}
-
-VkResult VmaAllocator_T::CreatePool(const VmaPoolCreateInfo* pCreateInfo, VmaPool* pPool)
-{
-    VMA_DEBUG_LOG("  CreatePool: MemoryTypeIndex=%u, flags=%u", pCreateInfo->memoryTypeIndex, pCreateInfo->flags);
-
-    VmaPoolCreateInfo newCreateInfo = *pCreateInfo;
-
-    // Protection against uninitialized new structure member. If garbage data are left there, this pointer dereference would crash.
-    if(pCreateInfo->pMemoryAllocateNext)
-    {
-        VMA_ASSERT(((const VkBaseInStructure*)pCreateInfo->pMemoryAllocateNext)->sType != 0);
-    }
-
-    if(newCreateInfo.maxBlockCount == 0)
-    {
-        newCreateInfo.maxBlockCount = SIZE_MAX;
-    }
-    if(newCreateInfo.minBlockCount > newCreateInfo.maxBlockCount)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    // Memory type index out of range or forbidden.
-    if(pCreateInfo->memoryTypeIndex >= GetMemoryTypeCount() ||
-        ((1u << pCreateInfo->memoryTypeIndex) & m_GlobalMemoryTypeBits) == 0)
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-    if(newCreateInfo.minAllocationAlignment > 0)
-    {
-        VMA_ASSERT(VmaIsPow2(newCreateInfo.minAllocationAlignment));
-    }
-
-    const VkDeviceSize preferredBlockSize = CalcPreferredBlockSize(newCreateInfo.memoryTypeIndex);
-
-    *pPool = vma_new(this, VmaPool_T)(this, newCreateInfo, preferredBlockSize);
-
-    VkResult res = (*pPool)->m_BlockVector.CreateMinBlocks();
-    if(res != VK_SUCCESS)
-    {
-        vma_delete(this, *pPool);
-        *pPool = VMA_NULL;
-        return res;
-    }
-
-    // Add to m_Pools.
-    {
-        VmaMutexLockWrite lock(m_PoolsMutex, m_UseMutex);
-        (*pPool)->SetId(m_NextPoolId++);
-        m_Pools.PushBack(*pPool);
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaAllocator_T::DestroyPool(VmaPool pool)
-{
-    // Remove from m_Pools.
-    {
-        VmaMutexLockWrite lock(m_PoolsMutex, m_UseMutex);
-        m_Pools.Remove(pool);
-    }
-
-    vma_delete(this, pool);
-}
-
-void VmaAllocator_T::GetPoolStatistics(VmaPool pool, VmaStatistics* pPoolStats)
-{
-    VmaClearStatistics(*pPoolStats);
-    pool->m_BlockVector.AddStatistics(*pPoolStats);
-    pool->m_DedicatedAllocations.AddStatistics(*pPoolStats);
-}
-
-void VmaAllocator_T::CalculatePoolStatistics(VmaPool pool, VmaDetailedStatistics* pPoolStats)
-{
-    VmaClearDetailedStatistics(*pPoolStats);
-    pool->m_BlockVector.AddDetailedStatistics(*pPoolStats);
-    pool->m_DedicatedAllocations.AddDetailedStatistics(*pPoolStats);
-}
-
-void VmaAllocator_T::SetCurrentFrameIndex(uint32_t frameIndex)
-{
-    m_CurrentFrameIndex.store(frameIndex);
-
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        UpdateVulkanBudget();
-    }
-#endif // #if VMA_MEMORY_BUDGET
-}
-
-VkResult VmaAllocator_T::CheckPoolCorruption(VmaPool hPool)
-{
-    return hPool->m_BlockVector.CheckCorruption();
-}
-
-VkResult VmaAllocator_T::CheckCorruption(uint32_t memoryTypeBits)
-{
-    VkResult finalRes = VK_ERROR_FEATURE_NOT_PRESENT;
-
-    // Process default pools.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        VmaBlockVector* const pBlockVector = m_pBlockVectors[memTypeIndex];
-        if(pBlockVector != VMA_NULL)
-        {
-            VkResult localRes = pBlockVector->CheckCorruption();
-            switch(localRes)
-            {
-            case VK_ERROR_FEATURE_NOT_PRESENT:
-                break;
-            case VK_SUCCESS:
-                finalRes = VK_SUCCESS;
-                break;
-            default:
-                return localRes;
-            }
-        }
-    }
-
-    // Process custom pools.
-    {
-        VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex);
-        for(VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool))
-        {
-            if(((1u << pool->m_BlockVector.GetMemoryTypeIndex()) & memoryTypeBits) != 0)
-            {
-                VkResult localRes = pool->m_BlockVector.CheckCorruption();
-                switch(localRes)
-                {
-                case VK_ERROR_FEATURE_NOT_PRESENT:
-                    break;
-                case VK_SUCCESS:
-                    finalRes = VK_SUCCESS;
-                    break;
-                default:
-                    return localRes;
-                }
-            }
-        }
-    }
-
-    return finalRes;
-}
-
-VkResult VmaAllocator_T::AllocateVulkanMemory(const VkMemoryAllocateInfo* pAllocateInfo, VkDeviceMemory* pMemory)
-{
-    AtomicTransactionalIncrement<uint32_t> deviceMemoryCountIncrement;
-    const uint64_t prevDeviceMemoryCount = deviceMemoryCountIncrement.Increment(&m_DeviceMemoryCount);
-#if VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT
-    if(prevDeviceMemoryCount >= m_PhysicalDeviceProperties.limits.maxMemoryAllocationCount)
-    {
-        return VK_ERROR_TOO_MANY_OBJECTS;
-    }
-#endif
-
-    const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(pAllocateInfo->memoryTypeIndex);
-
-    // HeapSizeLimit is in effect for this heap.
-    if((m_HeapSizeLimitMask & (1u << heapIndex)) != 0)
-    {
-        const VkDeviceSize heapSize = m_MemProps.memoryHeaps[heapIndex].size;
-        VkDeviceSize blockBytes = m_Budget.m_BlockBytes[heapIndex];
-        for(;;)
-        {
-            const VkDeviceSize blockBytesAfterAllocation = blockBytes + pAllocateInfo->allocationSize;
-            if(blockBytesAfterAllocation > heapSize)
-            {
-                return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-            }
-            if(m_Budget.m_BlockBytes[heapIndex].compare_exchange_strong(blockBytes, blockBytesAfterAllocation))
-            {
-                break;
-            }
-        }
-    }
-    else
-    {
-        m_Budget.m_BlockBytes[heapIndex] += pAllocateInfo->allocationSize;
-    }
-    ++m_Budget.m_BlockCount[heapIndex];
-
-    // VULKAN CALL vkAllocateMemory.
-    VkResult res = (*m_VulkanFunctions.vkAllocateMemory)(m_hDevice, pAllocateInfo, GetAllocationCallbacks(), pMemory);
-
-    if(res == VK_SUCCESS)
-    {
-#if VMA_MEMORY_BUDGET
-        ++m_Budget.m_OperationsSinceBudgetFetch;
-#endif
-
-        // Informative callback.
-        if(m_DeviceMemoryCallbacks.pfnAllocate != VMA_NULL)
-        {
-            (*m_DeviceMemoryCallbacks.pfnAllocate)(this, pAllocateInfo->memoryTypeIndex, *pMemory, pAllocateInfo->allocationSize, m_DeviceMemoryCallbacks.pUserData);
-        }
-
-        deviceMemoryCountIncrement.Commit();
-    }
-    else
-    {
-        --m_Budget.m_BlockCount[heapIndex];
-        m_Budget.m_BlockBytes[heapIndex] -= pAllocateInfo->allocationSize;
-    }
-
-    return res;
-}
-
-void VmaAllocator_T::FreeVulkanMemory(uint32_t memoryType, VkDeviceSize size, VkDeviceMemory hMemory)
-{
-    // Informative callback.
-    if(m_DeviceMemoryCallbacks.pfnFree != VMA_NULL)
-    {
-        (*m_DeviceMemoryCallbacks.pfnFree)(this, memoryType, hMemory, size, m_DeviceMemoryCallbacks.pUserData);
-    }
-
-    // VULKAN CALL vkFreeMemory.
-    (*m_VulkanFunctions.vkFreeMemory)(m_hDevice, hMemory, GetAllocationCallbacks());
-
-    const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memoryType);
-    --m_Budget.m_BlockCount[heapIndex];
-    m_Budget.m_BlockBytes[heapIndex] -= size;
-
-    --m_DeviceMemoryCount;
-}
-
-VkResult VmaAllocator_T::BindVulkanBuffer(
-    VkDeviceMemory memory,
-    VkDeviceSize memoryOffset,
-    VkBuffer buffer,
-    const void* pNext)
-{
-    if(pNext != VMA_NULL)
-    {
-#if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2
-        if((m_UseKhrBindMemory2 || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) &&
-            m_VulkanFunctions.vkBindBufferMemory2KHR != VMA_NULL)
-        {
-            VkBindBufferMemoryInfoKHR bindBufferMemoryInfo = { VK_STRUCTURE_TYPE_BIND_BUFFER_MEMORY_INFO_KHR };
-            bindBufferMemoryInfo.pNext = pNext;
-            bindBufferMemoryInfo.buffer = buffer;
-            bindBufferMemoryInfo.memory = memory;
-            bindBufferMemoryInfo.memoryOffset = memoryOffset;
-            return (*m_VulkanFunctions.vkBindBufferMemory2KHR)(m_hDevice, 1, &bindBufferMemoryInfo);
-        }
-        else
-#endif // #if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2
-        {
-            return VK_ERROR_EXTENSION_NOT_PRESENT;
-        }
-    }
-    else
-    {
-        return (*m_VulkanFunctions.vkBindBufferMemory)(m_hDevice, buffer, memory, memoryOffset);
-    }
-}
-
-VkResult VmaAllocator_T::BindVulkanImage(
-    VkDeviceMemory memory,
-    VkDeviceSize memoryOffset,
-    VkImage image,
-    const void* pNext)
-{
-    if(pNext != VMA_NULL)
-    {
-#if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2
-        if((m_UseKhrBindMemory2 || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) &&
-            m_VulkanFunctions.vkBindImageMemory2KHR != VMA_NULL)
-        {
-            VkBindImageMemoryInfoKHR bindBufferMemoryInfo = { VK_STRUCTURE_TYPE_BIND_IMAGE_MEMORY_INFO_KHR };
-            bindBufferMemoryInfo.pNext = pNext;
-            bindBufferMemoryInfo.image = image;
-            bindBufferMemoryInfo.memory = memory;
-            bindBufferMemoryInfo.memoryOffset = memoryOffset;
-            return (*m_VulkanFunctions.vkBindImageMemory2KHR)(m_hDevice, 1, &bindBufferMemoryInfo);
-        }
-        else
-#endif // #if VMA_BIND_MEMORY2
-        {
-            return VK_ERROR_EXTENSION_NOT_PRESENT;
-        }
-    }
-    else
-    {
-        return (*m_VulkanFunctions.vkBindImageMemory)(m_hDevice, image, memory, memoryOffset);
-    }
-}
-
-VkResult VmaAllocator_T::Map(VmaAllocation hAllocation, void** ppData)
-{
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-        {
-            VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock();
-            char *pBytes = VMA_NULL;
-            VkResult res = pBlock->Map(this, 1, (void**)&pBytes);
-            if(res == VK_SUCCESS)
-            {
-                *ppData = pBytes + (ptrdiff_t)hAllocation->GetOffset();
-                hAllocation->BlockAllocMap();
-            }
-            return res;
-        }
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        return hAllocation->DedicatedAllocMap(this, ppData);
-    default:
-        VMA_ASSERT(0);
-        return VK_ERROR_MEMORY_MAP_FAILED;
-    }
-}
-
-void VmaAllocator_T::Unmap(VmaAllocation hAllocation)
-{
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-        {
-            VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock();
-            hAllocation->BlockAllocUnmap();
-            pBlock->Unmap(this, 1);
-        }
-        break;
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        hAllocation->DedicatedAllocUnmap(this);
-        break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-
-VkResult VmaAllocator_T::BindBufferMemory(
-    VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer hBuffer,
-    const void* pNext)
-{
-    VkResult res = VK_SUCCESS;
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        res = BindVulkanBuffer(hAllocation->GetMemory(), allocationLocalOffset, hBuffer, pNext);
-        break;
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-    {
-        VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock();
-        VMA_ASSERT(pBlock && "Binding buffer to allocation that doesn't belong to any block.");
-        res = pBlock->BindBufferMemory(this, hAllocation, allocationLocalOffset, hBuffer, pNext);
-        break;
-    }
-    default:
-        VMA_ASSERT(0);
-    }
-    return res;
-}
-
-VkResult VmaAllocator_T::BindImageMemory(
-    VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage hImage,
-    const void* pNext)
-{
-    VkResult res = VK_SUCCESS;
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        res = BindVulkanImage(hAllocation->GetMemory(), allocationLocalOffset, hImage, pNext);
-        break;
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-    {
-        VmaDeviceMemoryBlock* pBlock = hAllocation->GetBlock();
-        VMA_ASSERT(pBlock && "Binding image to allocation that doesn't belong to any block.");
-        res = pBlock->BindImageMemory(this, hAllocation, allocationLocalOffset, hImage, pNext);
-        break;
-    }
-    default:
-        VMA_ASSERT(0);
-    }
-    return res;
-}
-
-VkResult VmaAllocator_T::FlushOrInvalidateAllocation(
-    VmaAllocation hAllocation,
-    VkDeviceSize offset, VkDeviceSize size,
-    VMA_CACHE_OPERATION op)
-{
-    VkResult res = VK_SUCCESS;
-
-    VkMappedMemoryRange memRange = {};
-    if(GetFlushOrInvalidateRange(hAllocation, offset, size, memRange))
-    {
-        switch(op)
-        {
-        case VMA_CACHE_FLUSH:
-            res = (*GetVulkanFunctions().vkFlushMappedMemoryRanges)(m_hDevice, 1, &memRange);
-            break;
-        case VMA_CACHE_INVALIDATE:
-            res = (*GetVulkanFunctions().vkInvalidateMappedMemoryRanges)(m_hDevice, 1, &memRange);
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-    }
-    // else: Just ignore this call.
-    return res;
-}
-
-VkResult VmaAllocator_T::FlushOrInvalidateAllocations(
-    uint32_t allocationCount,
-    const VmaAllocation* allocations,
-    const VkDeviceSize* offsets, const VkDeviceSize* sizes,
-    VMA_CACHE_OPERATION op)
-{
-    typedef VmaStlAllocator<VkMappedMemoryRange> RangeAllocator;
-    typedef VmaSmallVector<VkMappedMemoryRange, RangeAllocator, 16> RangeVector;
-    RangeVector ranges = RangeVector(RangeAllocator(GetAllocationCallbacks()));
-
-    for(uint32_t allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-    {
-        const VmaAllocation alloc = allocations[allocIndex];
-        const VkDeviceSize offset = offsets != VMA_NULL ? offsets[allocIndex] : 0;
-        const VkDeviceSize size = sizes != VMA_NULL ? sizes[allocIndex] : VK_WHOLE_SIZE;
-        VkMappedMemoryRange newRange;
-        if(GetFlushOrInvalidateRange(alloc, offset, size, newRange))
-        {
-            ranges.push_back(newRange);
-        }
-    }
-
-    VkResult res = VK_SUCCESS;
-    if(!ranges.empty())
-    {
-        switch(op)
-        {
-        case VMA_CACHE_FLUSH:
-            res = (*GetVulkanFunctions().vkFlushMappedMemoryRanges)(m_hDevice, (uint32_t)ranges.size(), ranges.data());
-            break;
-        case VMA_CACHE_INVALIDATE:
-            res = (*GetVulkanFunctions().vkInvalidateMappedMemoryRanges)(m_hDevice, (uint32_t)ranges.size(), ranges.data());
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-    }
-    // else: Just ignore this call.
-    return res;
-}
-
-void VmaAllocator_T::FreeDedicatedMemory(const VmaAllocation allocation)
-{
-    VMA_ASSERT(allocation && allocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-
-    const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-    VmaPool parentPool = allocation->GetParentPool();
-    if(parentPool == VK_NULL_HANDLE)
-    {
-        // Default pool
-        m_DedicatedAllocations[memTypeIndex].Unregister(allocation);
-    }
-    else
-    {
-        // Custom pool
-        parentPool->m_DedicatedAllocations.Unregister(allocation);
-    }
-
-    VkDeviceMemory hMemory = allocation->GetMemory();
-
-    /*
-    There is no need to call this, because Vulkan spec allows to skip vkUnmapMemory
-    before vkFreeMemory.
-
-    if(allocation->GetMappedData() != VMA_NULL)
-    {
-        (*m_VulkanFunctions.vkUnmapMemory)(m_hDevice, hMemory);
-    }
-    */
-
-    FreeVulkanMemory(memTypeIndex, allocation->GetSize(), hMemory);
-
-    m_Budget.RemoveAllocation(MemoryTypeIndexToHeapIndex(allocation->GetMemoryTypeIndex()), allocation->GetSize());
-    m_AllocationObjectAllocator.Free(allocation);
-
-    VMA_DEBUG_LOG("    Freed DedicatedMemory MemoryTypeIndex=%u", memTypeIndex);
-}
-
-uint32_t VmaAllocator_T::CalculateGpuDefragmentationMemoryTypeBits() const
-{
-    VkBufferCreateInfo dummyBufCreateInfo;
-    VmaFillGpuDefragmentationBufferCreateInfo(dummyBufCreateInfo);
-
-    uint32_t memoryTypeBits = 0;
-
-    // Create buffer.
-    VkBuffer buf = VK_NULL_HANDLE;
-    VkResult res = (*GetVulkanFunctions().vkCreateBuffer)(
-        m_hDevice, &dummyBufCreateInfo, GetAllocationCallbacks(), &buf);
-    if(res == VK_SUCCESS)
-    {
-        // Query for supported memory types.
-        VkMemoryRequirements memReq;
-        (*GetVulkanFunctions().vkGetBufferMemoryRequirements)(m_hDevice, buf, &memReq);
-        memoryTypeBits = memReq.memoryTypeBits;
-
-        // Destroy buffer.
-        (*GetVulkanFunctions().vkDestroyBuffer)(m_hDevice, buf, GetAllocationCallbacks());
-    }
-
-    return memoryTypeBits;
-}
-
-uint32_t VmaAllocator_T::CalculateGlobalMemoryTypeBits() const
-{
-    // Make sure memory information is already fetched.
-    VMA_ASSERT(GetMemoryTypeCount() > 0);
-
-    uint32_t memoryTypeBits = UINT32_MAX;
-
-    if(!m_UseAmdDeviceCoherentMemory)
-    {
-        // Exclude memory types that have VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD.
-        for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-        {
-            if((m_MemProps.memoryTypes[memTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY) != 0)
-            {
-                memoryTypeBits &= ~(1u << memTypeIndex);
-            }
-        }
-    }
-
-    return memoryTypeBits;
-}
-
-bool VmaAllocator_T::GetFlushOrInvalidateRange(
-    VmaAllocation allocation,
-    VkDeviceSize offset, VkDeviceSize size,
-    VkMappedMemoryRange& outRange) const
-{
-    const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-    if(size > 0 && IsMemoryTypeNonCoherent(memTypeIndex))
-    {
-        const VkDeviceSize nonCoherentAtomSize = m_PhysicalDeviceProperties.limits.nonCoherentAtomSize;
-        const VkDeviceSize allocationSize = allocation->GetSize();
-        VMA_ASSERT(offset <= allocationSize);
-
-        outRange.sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE;
-        outRange.pNext = VMA_NULL;
-        outRange.memory = allocation->GetMemory();
-
-        switch(allocation->GetType())
-        {
-        case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-            outRange.offset = VmaAlignDown(offset, nonCoherentAtomSize);
-            if(size == VK_WHOLE_SIZE)
-            {
-                outRange.size = allocationSize - outRange.offset;
-            }
-            else
-            {
-                VMA_ASSERT(offset + size <= allocationSize);
-                outRange.size = VMA_MIN(
-                    VmaAlignUp(size + (offset - outRange.offset), nonCoherentAtomSize),
-                    allocationSize - outRange.offset);
-            }
-            break;
-        case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-        {
-            // 1. Still within this allocation.
-            outRange.offset = VmaAlignDown(offset, nonCoherentAtomSize);
-            if(size == VK_WHOLE_SIZE)
-            {
-                size = allocationSize - offset;
-            }
-            else
-            {
-                VMA_ASSERT(offset + size <= allocationSize);
-            }
-            outRange.size = VmaAlignUp(size + (offset - outRange.offset), nonCoherentAtomSize);
-
-            // 2. Adjust to whole block.
-            const VkDeviceSize allocationOffset = allocation->GetOffset();
-            VMA_ASSERT(allocationOffset % nonCoherentAtomSize == 0);
-            const VkDeviceSize blockSize = allocation->GetBlock()->m_pMetadata->GetSize();
-            outRange.offset += allocationOffset;
-            outRange.size = VMA_MIN(outRange.size, blockSize - outRange.offset);
-
-            break;
-        }
-        default:
-            VMA_ASSERT(0);
-        }
-        return true;
-    }
-    return false;
-}
-
-#if VMA_MEMORY_BUDGET
-void VmaAllocator_T::UpdateVulkanBudget()
-{
-    VMA_ASSERT(m_UseExtMemoryBudget);
-
-    VkPhysicalDeviceMemoryProperties2KHR memProps = { VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_PROPERTIES_2_KHR };
-
-    VkPhysicalDeviceMemoryBudgetPropertiesEXT budgetProps = { VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_BUDGET_PROPERTIES_EXT };
-    VmaPnextChainPushFront(&memProps, &budgetProps);
-
-    GetVulkanFunctions().vkGetPhysicalDeviceMemoryProperties2KHR(m_PhysicalDevice, &memProps);
-
-    {
-        VmaMutexLockWrite lockWrite(m_Budget.m_BudgetMutex, m_UseMutex);
-
-        for(uint32_t heapIndex = 0; heapIndex < GetMemoryHeapCount(); ++heapIndex)
-        {
-            m_Budget.m_VulkanUsage[heapIndex] = budgetProps.heapUsage[heapIndex];
-            m_Budget.m_VulkanBudget[heapIndex] = budgetProps.heapBudget[heapIndex];
-            m_Budget.m_BlockBytesAtBudgetFetch[heapIndex] = m_Budget.m_BlockBytes[heapIndex].load();
-
-            // Some bugged drivers return the budget incorrectly, e.g. 0 or much bigger than heap size.
-            if(m_Budget.m_VulkanBudget[heapIndex] == 0)
-            {
-                m_Budget.m_VulkanBudget[heapIndex] = m_MemProps.memoryHeaps[heapIndex].size * 8 / 10; // 80% heuristics.
-            }
-            else if(m_Budget.m_VulkanBudget[heapIndex] > m_MemProps.memoryHeaps[heapIndex].size)
-            {
-                m_Budget.m_VulkanBudget[heapIndex] = m_MemProps.memoryHeaps[heapIndex].size;
-            }
-            if(m_Budget.m_VulkanUsage[heapIndex] == 0 && m_Budget.m_BlockBytesAtBudgetFetch[heapIndex] > 0)
-            {
-                m_Budget.m_VulkanUsage[heapIndex] = m_Budget.m_BlockBytesAtBudgetFetch[heapIndex];
-            }
-        }
-        m_Budget.m_OperationsSinceBudgetFetch = 0;
-    }
-}
-#endif // VMA_MEMORY_BUDGET
-
-void VmaAllocator_T::FillAllocation(const VmaAllocation hAllocation, uint8_t pattern)
-{
-    if(VMA_DEBUG_INITIALIZE_ALLOCATIONS &&
-        hAllocation->IsMappingAllowed() &&
-        (m_MemProps.memoryTypes[hAllocation->GetMemoryTypeIndex()].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0)
-    {
-        void* pData = VMA_NULL;
-        VkResult res = Map(hAllocation, &pData);
-        if(res == VK_SUCCESS)
-        {
-            memset(pData, (int)pattern, (size_t)hAllocation->GetSize());
-            FlushOrInvalidateAllocation(hAllocation, 0, VK_WHOLE_SIZE, VMA_CACHE_FLUSH);
-            Unmap(hAllocation);
-        }
-        else
-        {
-            VMA_ASSERT(0 && "VMA_DEBUG_INITIALIZE_ALLOCATIONS is enabled, but couldn't map memory to fill allocation.");
-        }
-    }
-}
-
-uint32_t VmaAllocator_T::GetGpuDefragmentationMemoryTypeBits()
-{
-    uint32_t memoryTypeBits = m_GpuDefragmentationMemoryTypeBits.load();
-    if(memoryTypeBits == UINT32_MAX)
-    {
-        memoryTypeBits = CalculateGpuDefragmentationMemoryTypeBits();
-        m_GpuDefragmentationMemoryTypeBits.store(memoryTypeBits);
-    }
-    return memoryTypeBits;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaAllocator_T::PrintDetailedMap(VmaJsonWriter& json)
-{
-    json.WriteString("DefaultPools");
-    json.BeginObject();
-    {
-        for (uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-        {
-            VmaBlockVector* pBlockVector = m_pBlockVectors[memTypeIndex];
-            VmaDedicatedAllocationList& dedicatedAllocList = m_DedicatedAllocations[memTypeIndex];
-            if (pBlockVector != VMA_NULL)
-            {
-                json.BeginString("Type ");
-                json.ContinueString(memTypeIndex);
-                json.EndString();
-                json.BeginObject();
-                {
-                    json.WriteString("PreferredBlockSize");
-                    json.WriteNumber(pBlockVector->GetPreferredBlockSize());
-
-                    json.WriteString("Blocks");
-                    pBlockVector->PrintDetailedMap(json);
-
-                    json.WriteString("DedicatedAllocations");
-                    dedicatedAllocList.BuildStatsString(json);
-                }
-                json.EndObject();
-            }
-        }
-    }
-    json.EndObject();
-
-    json.WriteString("CustomPools");
-    json.BeginObject();
-    {
-        VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex);
-        if (!m_Pools.IsEmpty())
-        {
-            for (uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-            {
-                bool displayType = true;
-                size_t index = 0;
-                for (VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool))
-                {
-                    VmaBlockVector& blockVector = pool->m_BlockVector;
-                    if (blockVector.GetMemoryTypeIndex() == memTypeIndex)
-                    {
-                        if (displayType)
-                        {
-                            json.BeginString("Type ");
-                            json.ContinueString(memTypeIndex);
-                            json.EndString();
-                            json.BeginArray();
-                            displayType = false;
-                        }
-
-                        json.BeginObject();
-                        {
-                            json.WriteString("Name");
-                            json.BeginString();
-                            json.ContinueString_Size(index++);
-                            if (pool->GetName())
-                            {
-                                json.ContinueString(" - ");
-                                json.ContinueString(pool->GetName());
-                            }
-                            json.EndString();
-
-                            json.WriteString("PreferredBlockSize");
-                            json.WriteNumber(blockVector.GetPreferredBlockSize());
-
-                            json.WriteString("Blocks");
-                            blockVector.PrintDetailedMap(json);
-
-                            json.WriteString("DedicatedAllocations");
-                            pool->m_DedicatedAllocations.BuildStatsString(json);
-                        }
-                        json.EndObject();
-                    }
-                }
-
-                if (!displayType)
-                    json.EndArray();
-            }
-        }
-    }
-    json.EndObject();
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_ALLOCATOR_T_FUNCTIONS
-
-
-#ifndef _VMA_PUBLIC_INTERFACE
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAllocator(
-    const VmaAllocatorCreateInfo* pCreateInfo,
-    VmaAllocator* pAllocator)
-{
-    VMA_ASSERT(pCreateInfo && pAllocator);
-    VMA_ASSERT(pCreateInfo->vulkanApiVersion == 0 ||
-        (VK_VERSION_MAJOR(pCreateInfo->vulkanApiVersion) == 1 && VK_VERSION_MINOR(pCreateInfo->vulkanApiVersion) <= 3));
-    VMA_DEBUG_LOG("vmaCreateAllocator");
-    *pAllocator = vma_new(pCreateInfo->pAllocationCallbacks, VmaAllocator_T)(pCreateInfo);
-    VkResult result = (*pAllocator)->Init(pCreateInfo);
-    if(result < 0)
-    {
-        vma_delete(pCreateInfo->pAllocationCallbacks, *pAllocator);
-        *pAllocator = VK_NULL_HANDLE;
-    }
-    return result;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyAllocator(
-    VmaAllocator allocator)
-{
-    if(allocator != VK_NULL_HANDLE)
-    {
-        VMA_DEBUG_LOG("vmaDestroyAllocator");
-        VkAllocationCallbacks allocationCallbacks = allocator->m_AllocationCallbacks; // Have to copy the callbacks when destroying.
-        vma_delete(&allocationCallbacks, allocator);
-    }
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocatorInfo(VmaAllocator allocator, VmaAllocatorInfo* pAllocatorInfo)
-{
-    VMA_ASSERT(allocator && pAllocatorInfo);
-    pAllocatorInfo->instance = allocator->m_hInstance;
-    pAllocatorInfo->physicalDevice = allocator->GetPhysicalDevice();
-    pAllocatorInfo->device = allocator->m_hDevice;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPhysicalDeviceProperties(
-    VmaAllocator allocator,
-    const VkPhysicalDeviceProperties **ppPhysicalDeviceProperties)
-{
-    VMA_ASSERT(allocator && ppPhysicalDeviceProperties);
-    *ppPhysicalDeviceProperties = &allocator->m_PhysicalDeviceProperties;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryProperties(
-    VmaAllocator allocator,
-    const VkPhysicalDeviceMemoryProperties** ppPhysicalDeviceMemoryProperties)
-{
-    VMA_ASSERT(allocator && ppPhysicalDeviceMemoryProperties);
-    *ppPhysicalDeviceMemoryProperties = &allocator->m_MemProps;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryTypeProperties(
-    VmaAllocator allocator,
-    uint32_t memoryTypeIndex,
-    VkMemoryPropertyFlags* pFlags)
-{
-    VMA_ASSERT(allocator && pFlags);
-    VMA_ASSERT(memoryTypeIndex < allocator->GetMemoryTypeCount());
-    *pFlags = allocator->m_MemProps.memoryTypes[memoryTypeIndex].propertyFlags;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetCurrentFrameIndex(
-    VmaAllocator allocator,
-    uint32_t frameIndex)
-{
-    VMA_ASSERT(allocator);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->SetCurrentFrameIndex(frameIndex);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateStatistics(
-    VmaAllocator allocator,
-    VmaTotalStatistics* pStats)
-{
-    VMA_ASSERT(allocator && pStats);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-    allocator->CalculateStatistics(pStats);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetHeapBudgets(
-    VmaAllocator allocator,
-    VmaBudget* pBudgets)
-{
-    VMA_ASSERT(allocator && pBudgets);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-    allocator->GetHeapBudgets(pBudgets, 0, allocator->GetMemoryHeapCount());
-}
-
-#if VMA_STATS_STRING_ENABLED
-
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildStatsString(
-    VmaAllocator allocator,
-    char** ppStatsString,
-    VkBool32 detailedMap)
-{
-    VMA_ASSERT(allocator && ppStatsString);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VmaStringBuilder sb(allocator->GetAllocationCallbacks());
-    {
-        VmaBudget budgets[VK_MAX_MEMORY_HEAPS];
-        allocator->GetHeapBudgets(budgets, 0, allocator->GetMemoryHeapCount());
-
-        VmaTotalStatistics stats;
-        allocator->CalculateStatistics(&stats);
-
-        VmaJsonWriter json(allocator->GetAllocationCallbacks(), sb);
-        json.BeginObject();
-        {
-            json.WriteString("General");
-            json.BeginObject();
-            {
-                const VkPhysicalDeviceProperties& deviceProperties = allocator->m_PhysicalDeviceProperties;
-                const VkPhysicalDeviceMemoryProperties& memoryProperties = allocator->m_MemProps;
-
-                json.WriteString("API");
-                json.WriteString("Vulkan");
-
-                json.WriteString("apiVersion");
-                json.BeginString();
-                json.ContinueString(VK_API_VERSION_MAJOR(deviceProperties.apiVersion));
-                json.ContinueString(".");
-                json.ContinueString(VK_API_VERSION_MINOR(deviceProperties.apiVersion));
-                json.ContinueString(".");
-                json.ContinueString(VK_API_VERSION_PATCH(deviceProperties.apiVersion));
-                json.EndString();
-
-                json.WriteString("GPU");
-                json.WriteString(deviceProperties.deviceName);
-                json.WriteString("deviceType");
-                json.WriteNumber(static_cast<uint32_t>(deviceProperties.deviceType));
-
-                json.WriteString("maxMemoryAllocationCount");
-                json.WriteNumber(deviceProperties.limits.maxMemoryAllocationCount);
-                json.WriteString("bufferImageGranularity");
-                json.WriteNumber(deviceProperties.limits.bufferImageGranularity);
-                json.WriteString("nonCoherentAtomSize");
-                json.WriteNumber(deviceProperties.limits.nonCoherentAtomSize);
-
-                json.WriteString("memoryHeapCount");
-                json.WriteNumber(memoryProperties.memoryHeapCount);
-                json.WriteString("memoryTypeCount");
-                json.WriteNumber(memoryProperties.memoryTypeCount);
-            }
-            json.EndObject();
-        }
-        {
-            json.WriteString("Total");
-            VmaPrintDetailedStatistics(json, stats.total);
-        }
-        {
-            json.WriteString("MemoryInfo");
-            json.BeginObject();
-            {
-                for (uint32_t heapIndex = 0; heapIndex < allocator->GetMemoryHeapCount(); ++heapIndex)
-                {
-                    json.BeginString("Heap ");
-                    json.ContinueString(heapIndex);
-                    json.EndString();
-                    json.BeginObject();
-                    {
-                        const VkMemoryHeap& heapInfo = allocator->m_MemProps.memoryHeaps[heapIndex];
-                        json.WriteString("Flags");
-                        json.BeginArray(true);
-                        {
-                            if (heapInfo.flags & VK_MEMORY_HEAP_DEVICE_LOCAL_BIT)
-                                json.WriteString("DEVICE_LOCAL");
-                        #if VMA_VULKAN_VERSION >= 1001000
-                            if (heapInfo.flags & VK_MEMORY_HEAP_MULTI_INSTANCE_BIT)
-                                json.WriteString("MULTI_INSTANCE");
-                        #endif
-
-                            VkMemoryHeapFlags flags = heapInfo.flags &
-                                ~(VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
-                        #if VMA_VULKAN_VERSION >= 1001000
-                                    | VK_MEMORY_HEAP_MULTI_INSTANCE_BIT
-                        #endif
-                                    );
-                            if (flags != 0)
-                                json.WriteNumber(flags);
-                        }
-                        json.EndArray();
-
-                        json.WriteString("Size");
-                        json.WriteNumber(heapInfo.size);
-
-                        json.WriteString("Budget");
-                        json.BeginObject();
-                        {
-                            json.WriteString("BudgetBytes");
-                            json.WriteNumber(budgets[heapIndex].budget);
-                            json.WriteString("UsageBytes");
-                            json.WriteNumber(budgets[heapIndex].usage);
-                        }
-                        json.EndObject();
-
-                        json.WriteString("Stats");
-                        VmaPrintDetailedStatistics(json, stats.memoryHeap[heapIndex]);
-
-                        json.WriteString("MemoryPools");
-                        json.BeginObject();
-                        {
-                            for (uint32_t typeIndex = 0; typeIndex < allocator->GetMemoryTypeCount(); ++typeIndex)
-                            {
-                                if (allocator->MemoryTypeIndexToHeapIndex(typeIndex) == heapIndex)
-                                {
-                                    json.BeginString("Type ");
-                                    json.ContinueString(typeIndex);
-                                    json.EndString();
-                                    json.BeginObject();
-                                    {
-                                        json.WriteString("Flags");
-                                        json.BeginArray(true);
-                                        {
-                                            VkMemoryPropertyFlags flags = allocator->m_MemProps.memoryTypes[typeIndex].propertyFlags;
-                                            if (flags & VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT)
-                                                json.WriteString("DEVICE_LOCAL");
-                                            if (flags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT)
-                                                json.WriteString("HOST_VISIBLE");
-                                            if (flags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT)
-                                                json.WriteString("HOST_COHERENT");
-                                            if (flags & VK_MEMORY_PROPERTY_HOST_CACHED_BIT)
-                                                json.WriteString("HOST_CACHED");
-                                            if (flags & VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT)
-                                                json.WriteString("LAZILY_ALLOCATED");
-                                        #if VMA_VULKAN_VERSION >= 1001000
-                                            if (flags & VK_MEMORY_PROPERTY_PROTECTED_BIT)
-                                                json.WriteString("PROTECTED");
-                                        #endif
-                                        #if VK_AMD_device_coherent_memory
-                                            if (flags & VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY)
-                                                json.WriteString("DEVICE_COHERENT_AMD");
-                                            if (flags & VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY)
-                                                json.WriteString("DEVICE_UNCACHED_AMD");
-                                        #endif
-
-                                            flags &= ~(VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
-                                        #if VMA_VULKAN_VERSION >= 1001000
-                                                | VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT
-                                        #endif
-                                        #if VK_AMD_device_coherent_memory
-                                                | VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY
-                                                | VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY
-                                        #endif
-                                                | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
-                                                | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
-                                                | VK_MEMORY_PROPERTY_HOST_CACHED_BIT);
-                                            if (flags != 0)
-                                                json.WriteNumber(flags);
-                                        }
-                                        json.EndArray();
-
-                                        json.WriteString("Stats");
-                                        VmaPrintDetailedStatistics(json, stats.memoryType[typeIndex]);
-                                    }
-                                    json.EndObject();
-                                }
-                            }
-
-                        }
-                        json.EndObject();
-                    }
-                    json.EndObject();
-                }
-            }
-            json.EndObject();
-        }
-
-        if (detailedMap == VK_TRUE)
-            allocator->PrintDetailedMap(json);
-
-        json.EndObject();
-    }
-
-    *ppStatsString = VmaCreateStringCopy(allocator->GetAllocationCallbacks(), sb.GetData(), sb.GetLength());
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeStatsString(
-    VmaAllocator allocator,
-    char* pStatsString)
-{
-    if(pStatsString != VMA_NULL)
-    {
-        VMA_ASSERT(allocator);
-        VmaFreeString(allocator->GetAllocationCallbacks(), pStatsString);
-    }
-}
-
-#endif // VMA_STATS_STRING_ENABLED
-
-/*
-This function is not protected by any mutex because it just reads immutable data.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndex(
-    VmaAllocator allocator,
-    uint32_t memoryTypeBits,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    uint32_t* pMemoryTypeIndex)
-{
-    VMA_ASSERT(allocator != VK_NULL_HANDLE);
-    VMA_ASSERT(pAllocationCreateInfo != VMA_NULL);
-    VMA_ASSERT(pMemoryTypeIndex != VMA_NULL);
-
-    return allocator->FindMemoryTypeIndex(memoryTypeBits, pAllocationCreateInfo, UINT32_MAX, pMemoryTypeIndex);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForBufferInfo(
-    VmaAllocator allocator,
-    const VkBufferCreateInfo* pBufferCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    uint32_t* pMemoryTypeIndex)
-{
-    VMA_ASSERT(allocator != VK_NULL_HANDLE);
-    VMA_ASSERT(pBufferCreateInfo != VMA_NULL);
-    VMA_ASSERT(pAllocationCreateInfo != VMA_NULL);
-    VMA_ASSERT(pMemoryTypeIndex != VMA_NULL);
-
-    const VkDevice hDev = allocator->m_hDevice;
-    const VmaVulkanFunctions* funcs = &allocator->GetVulkanFunctions();
-    VkResult res;
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(funcs->vkGetDeviceBufferMemoryRequirements)
-    {
-        // Can query straight from VkBufferCreateInfo :)
-        VkDeviceBufferMemoryRequirements devBufMemReq = {VK_STRUCTURE_TYPE_DEVICE_BUFFER_MEMORY_REQUIREMENTS};
-        devBufMemReq.pCreateInfo = pBufferCreateInfo;
-
-        VkMemoryRequirements2 memReq = {VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2};
-        (*funcs->vkGetDeviceBufferMemoryRequirements)(hDev, &devBufMemReq, &memReq);
-
-        res = allocator->FindMemoryTypeIndex(
-            memReq.memoryRequirements.memoryTypeBits, pAllocationCreateInfo, pBufferCreateInfo->usage, pMemoryTypeIndex);
-    }
-    else
-#endif // #if VMA_VULKAN_VERSION >= 1003000
-    {
-        // Must create a dummy buffer to query :(
-        VkBuffer hBuffer = VK_NULL_HANDLE;
-        res = funcs->vkCreateBuffer(
-            hDev, pBufferCreateInfo, allocator->GetAllocationCallbacks(), &hBuffer);
-        if(res == VK_SUCCESS)
-        {
-            VkMemoryRequirements memReq = {};
-            funcs->vkGetBufferMemoryRequirements(hDev, hBuffer, &memReq);
-
-            res = allocator->FindMemoryTypeIndex(
-                memReq.memoryTypeBits, pAllocationCreateInfo, pBufferCreateInfo->usage, pMemoryTypeIndex);
-
-            funcs->vkDestroyBuffer(
-                hDev, hBuffer, allocator->GetAllocationCallbacks());
-        }
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForImageInfo(
-    VmaAllocator allocator,
-    const VkImageCreateInfo* pImageCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    uint32_t* pMemoryTypeIndex)
-{
-    VMA_ASSERT(allocator != VK_NULL_HANDLE);
-    VMA_ASSERT(pImageCreateInfo != VMA_NULL);
-    VMA_ASSERT(pAllocationCreateInfo != VMA_NULL);
-    VMA_ASSERT(pMemoryTypeIndex != VMA_NULL);
-
-    const VkDevice hDev = allocator->m_hDevice;
-    const VmaVulkanFunctions* funcs = &allocator->GetVulkanFunctions();
-    VkResult res;
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(funcs->vkGetDeviceImageMemoryRequirements)
-    {
-        // Can query straight from VkImageCreateInfo :)
-        VkDeviceImageMemoryRequirements devImgMemReq = {VK_STRUCTURE_TYPE_DEVICE_IMAGE_MEMORY_REQUIREMENTS};
-        devImgMemReq.pCreateInfo = pImageCreateInfo;
-        VMA_ASSERT(pImageCreateInfo->tiling != VK_IMAGE_TILING_DRM_FORMAT_MODIFIER_EXT_COPY && (pImageCreateInfo->flags & VK_IMAGE_CREATE_DISJOINT_BIT_COPY) == 0 &&
-            "Cannot use this VkImageCreateInfo with vmaFindMemoryTypeIndexForImageInfo as I don't know what to pass as VkDeviceImageMemoryRequirements::planeAspect.");
-
-        VkMemoryRequirements2 memReq = {VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2};
-        (*funcs->vkGetDeviceImageMemoryRequirements)(hDev, &devImgMemReq, &memReq);
-
-        res = allocator->FindMemoryTypeIndex(
-            memReq.memoryRequirements.memoryTypeBits, pAllocationCreateInfo, pImageCreateInfo->usage, pMemoryTypeIndex);
-    }
-    else
-#endif // #if VMA_VULKAN_VERSION >= 1003000
-    {
-        // Must create a dummy image to query :(
-        VkImage hImage = VK_NULL_HANDLE;
-        res = funcs->vkCreateImage(
-            hDev, pImageCreateInfo, allocator->GetAllocationCallbacks(), &hImage);
-        if(res == VK_SUCCESS)
-        {
-            VkMemoryRequirements memReq = {};
-            funcs->vkGetImageMemoryRequirements(hDev, hImage, &memReq);
-
-            res = allocator->FindMemoryTypeIndex(
-                memReq.memoryTypeBits, pAllocationCreateInfo, pImageCreateInfo->usage, pMemoryTypeIndex);
-
-            funcs->vkDestroyImage(
-                hDev, hImage, allocator->GetAllocationCallbacks());
-        }
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreatePool(
-    VmaAllocator allocator,
-    const VmaPoolCreateInfo* pCreateInfo,
-    VmaPool* pPool)
-{
-    VMA_ASSERT(allocator && pCreateInfo && pPool);
-
-    VMA_DEBUG_LOG("vmaCreatePool");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->CreatePool(pCreateInfo, pPool);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyPool(
-    VmaAllocator allocator,
-    VmaPool pool)
-{
-    VMA_ASSERT(allocator);
-
-    if(pool == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaDestroyPool");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->DestroyPool(pool);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolStatistics(
-    VmaAllocator allocator,
-    VmaPool pool,
-    VmaStatistics* pPoolStats)
-{
-    VMA_ASSERT(allocator && pool && pPoolStats);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->GetPoolStatistics(pool, pPoolStats);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculatePoolStatistics(
-    VmaAllocator allocator,
-    VmaPool pool,
-    VmaDetailedStatistics* pPoolStats)
-{
-    VMA_ASSERT(allocator && pool && pPoolStats);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->CalculatePoolStatistics(pool, pPoolStats);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckPoolCorruption(VmaAllocator allocator, VmaPool pool)
-{
-    VMA_ASSERT(allocator && pool);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VMA_DEBUG_LOG("vmaCheckPoolCorruption");
-
-    return allocator->CheckPoolCorruption(pool);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolName(
-    VmaAllocator allocator,
-    VmaPool pool,
-    const char** ppName)
-{
-    VMA_ASSERT(allocator && pool && ppName);
-
-    VMA_DEBUG_LOG("vmaGetPoolName");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *ppName = pool->GetName();
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetPoolName(
-    VmaAllocator allocator,
-    VmaPool pool,
-    const char* pName)
-{
-    VMA_ASSERT(allocator && pool);
-
-    VMA_DEBUG_LOG("vmaSetPoolName");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    pool->SetName(pName);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemory(
-    VmaAllocator allocator,
-    const VkMemoryRequirements* pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pVkMemoryRequirements && pCreateInfo && pAllocation);
-
-    VMA_DEBUG_LOG("vmaAllocateMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkResult result = allocator->AllocateMemory(
-        *pVkMemoryRequirements,
-        false, // requiresDedicatedAllocation
-        false, // prefersDedicatedAllocation
-        VK_NULL_HANDLE, // dedicatedBuffer
-        VK_NULL_HANDLE, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_UNKNOWN,
-        1, // allocationCount
-        pAllocation);
-
-    if(pAllocationInfo != VMA_NULL && result == VK_SUCCESS)
-    {
-        allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryPages(
-    VmaAllocator allocator,
-    const VkMemoryRequirements* pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    size_t allocationCount,
-    VmaAllocation* pAllocations,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    if(allocationCount == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VMA_ASSERT(allocator && pVkMemoryRequirements && pCreateInfo && pAllocations);
-
-    VMA_DEBUG_LOG("vmaAllocateMemoryPages");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkResult result = allocator->AllocateMemory(
-        *pVkMemoryRequirements,
-        false, // requiresDedicatedAllocation
-        false, // prefersDedicatedAllocation
-        VK_NULL_HANDLE, // dedicatedBuffer
-        VK_NULL_HANDLE, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_UNKNOWN,
-        allocationCount,
-        pAllocations);
-
-    if(pAllocationInfo != VMA_NULL && result == VK_SUCCESS)
-    {
-        for(size_t i = 0; i < allocationCount; ++i)
-        {
-            allocator->GetAllocationInfo(pAllocations[i], pAllocationInfo + i);
-        }
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForBuffer(
-    VmaAllocator allocator,
-    VkBuffer buffer,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && buffer != VK_NULL_HANDLE && pCreateInfo && pAllocation);
-
-    VMA_DEBUG_LOG("vmaAllocateMemoryForBuffer");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkMemoryRequirements vkMemReq = {};
-    bool requiresDedicatedAllocation = false;
-    bool prefersDedicatedAllocation = false;
-    allocator->GetBufferMemoryRequirements(buffer, vkMemReq,
-        requiresDedicatedAllocation,
-        prefersDedicatedAllocation);
-
-    VkResult result = allocator->AllocateMemory(
-        vkMemReq,
-        requiresDedicatedAllocation,
-        prefersDedicatedAllocation,
-        buffer, // dedicatedBuffer
-        VK_NULL_HANDLE, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_BUFFER,
-        1, // allocationCount
-        pAllocation);
-
-    if(pAllocationInfo && result == VK_SUCCESS)
-    {
-        allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForImage(
-    VmaAllocator allocator,
-    VkImage image,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && image != VK_NULL_HANDLE && pCreateInfo && pAllocation);
-
-    VMA_DEBUG_LOG("vmaAllocateMemoryForImage");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkMemoryRequirements vkMemReq = {};
-    bool requiresDedicatedAllocation = false;
-    bool prefersDedicatedAllocation  = false;
-    allocator->GetImageMemoryRequirements(image, vkMemReq,
-        requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-    VkResult result = allocator->AllocateMemory(
-        vkMemReq,
-        requiresDedicatedAllocation,
-        prefersDedicatedAllocation,
-        VK_NULL_HANDLE, // dedicatedBuffer
-        image, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN,
-        1, // allocationCount
-        pAllocation);
-
-    if(pAllocationInfo && result == VK_SUCCESS)
-    {
-        allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation)
-{
-    VMA_ASSERT(allocator);
-
-    if(allocation == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaFreeMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->FreeMemory(
-        1, // allocationCount
-        &allocation);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemoryPages(
-    VmaAllocator allocator,
-    size_t allocationCount,
-    const VmaAllocation* pAllocations)
-{
-    if(allocationCount == 0)
-    {
-        return;
-    }
-
-    VMA_ASSERT(allocator);
-
-    VMA_DEBUG_LOG("vmaFreeMemoryPages");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->FreeMemory(allocationCount, pAllocations);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationInfo(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && allocation && pAllocationInfo);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->GetAllocationInfo(allocation, pAllocationInfo);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationUserData(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    void* pUserData)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocation->SetUserData(allocator, pUserData);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const char* VMA_NULLABLE pName)
-{
-    allocation->SetName(allocator, pName);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationMemoryProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkMemoryPropertyFlags* VMA_NOT_NULL pFlags)
-{
-    VMA_ASSERT(allocator && allocation && pFlags);
-    const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-    *pFlags = allocator->m_MemProps.memoryTypes[memTypeIndex].propertyFlags;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaMapMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    void** ppData)
-{
-    VMA_ASSERT(allocator && allocation && ppData);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->Map(allocation, ppData);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaUnmapMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->Unmap(allocation);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocation(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_LOG("vmaFlushAllocation");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocation(allocation, offset, size, VMA_CACHE_FLUSH);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocation(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_LOG("vmaInvalidateAllocation");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocation(allocation, offset, size, VMA_CACHE_INVALIDATE);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocations(
-    VmaAllocator allocator,
-    uint32_t allocationCount,
-    const VmaAllocation* allocations,
-    const VkDeviceSize* offsets,
-    const VkDeviceSize* sizes)
-{
-    VMA_ASSERT(allocator);
-
-    if(allocationCount == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VMA_ASSERT(allocations);
-
-    VMA_DEBUG_LOG("vmaFlushAllocations");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocations(allocationCount, allocations, offsets, sizes, VMA_CACHE_FLUSH);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocations(
-    VmaAllocator allocator,
-    uint32_t allocationCount,
-    const VmaAllocation* allocations,
-    const VkDeviceSize* offsets,
-    const VkDeviceSize* sizes)
-{
-    VMA_ASSERT(allocator);
-
-    if(allocationCount == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VMA_ASSERT(allocations);
-
-    VMA_DEBUG_LOG("vmaInvalidateAllocations");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocations(allocationCount, allocations, offsets, sizes, VMA_CACHE_INVALIDATE);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckCorruption(
-    VmaAllocator allocator,
-    uint32_t memoryTypeBits)
-{
-    VMA_ASSERT(allocator);
-
-    VMA_DEBUG_LOG("vmaCheckCorruption");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->CheckCorruption(memoryTypeBits);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentation(
-    VmaAllocator allocator,
-    const VmaDefragmentationInfo* pInfo,
-    VmaDefragmentationContext* pContext)
-{
-    VMA_ASSERT(allocator && pInfo && pContext);
-
-    VMA_DEBUG_LOG("vmaBeginDefragmentation");
-
-    if (pInfo->pool != VMA_NULL)
-    {
-        // Check if run on supported algorithms
-        if (pInfo->pool->m_BlockVector.GetAlgorithm() & VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT)
-            return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pContext = vma_new(allocator, VmaDefragmentationContext_T)(allocator, *pInfo);
-    return VK_SUCCESS;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaEndDefragmentation(
-    VmaAllocator allocator,
-    VmaDefragmentationContext context,
-    VmaDefragmentationStats* pStats)
-{
-    VMA_ASSERT(allocator && context);
-
-    VMA_DEBUG_LOG("vmaEndDefragmentation");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    if (pStats)
-        context->GetStats(*pStats);
-    vma_delete(allocator, context);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo)
-{
-    VMA_ASSERT(context && pPassInfo);
-
-    VMA_DEBUG_LOG("vmaBeginDefragmentationPass");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return context->DefragmentPassBegin(*pPassInfo);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaEndDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo)
-{
-    VMA_ASSERT(context && pPassInfo);
-
-    VMA_DEBUG_LOG("vmaEndDefragmentationPass");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return context->DefragmentPassEnd(*pPassInfo);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkBuffer buffer)
-{
-    VMA_ASSERT(allocator && allocation && buffer);
-
-    VMA_DEBUG_LOG("vmaBindBufferMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->BindBufferMemory(allocation, 0, buffer, VMA_NULL);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory2(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer buffer,
-    const void* pNext)
-{
-    VMA_ASSERT(allocator && allocation && buffer);
-
-    VMA_DEBUG_LOG("vmaBindBufferMemory2");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->BindBufferMemory(allocation, allocationLocalOffset, buffer, pNext);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkImage image)
-{
-    VMA_ASSERT(allocator && allocation && image);
-
-    VMA_DEBUG_LOG("vmaBindImageMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->BindImageMemory(allocation, 0, image, VMA_NULL);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory2(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage image,
-    const void* pNext)
-{
-    VMA_ASSERT(allocator && allocation && image);
-
-    VMA_DEBUG_LOG("vmaBindImageMemory2");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-        return allocator->BindImageMemory(allocation, allocationLocalOffset, image, pNext);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBuffer(
-    VmaAllocator allocator,
-    const VkBufferCreateInfo* pBufferCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkBuffer* pBuffer,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pBufferCreateInfo && pAllocationCreateInfo && pBuffer && pAllocation);
-
-    if(pBufferCreateInfo->size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    if((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 &&
-        !allocator->m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used.");
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_LOG("vmaCreateBuffer");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pBuffer = VK_NULL_HANDLE;
-    *pAllocation = VK_NULL_HANDLE;
-
-    // 1. Create VkBuffer.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)(
-        allocator->m_hDevice,
-        pBufferCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pBuffer);
-    if(res >= 0)
-    {
-        // 2. vkGetBufferMemoryRequirements.
-        VkMemoryRequirements vkMemReq = {};
-        bool requiresDedicatedAllocation = false;
-        bool prefersDedicatedAllocation  = false;
-        allocator->GetBufferMemoryRequirements(*pBuffer, vkMemReq,
-            requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-        // 3. Allocate memory using allocator.
-        res = allocator->AllocateMemory(
-            vkMemReq,
-            requiresDedicatedAllocation,
-            prefersDedicatedAllocation,
-            *pBuffer, // dedicatedBuffer
-            VK_NULL_HANDLE, // dedicatedImage
-            pBufferCreateInfo->usage, // dedicatedBufferImageUsage
-            *pAllocationCreateInfo,
-            VMA_SUBALLOCATION_TYPE_BUFFER,
-            1, // allocationCount
-            pAllocation);
-
-        if(res >= 0)
-        {
-            // 3. Bind buffer with memory.
-            if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0)
-            {
-                res = allocator->BindBufferMemory(*pAllocation, 0, *pBuffer, VMA_NULL);
-            }
-            if(res >= 0)
-            {
-                // All steps succeeded.
-                #if VMA_STATS_STRING_ENABLED
-                    (*pAllocation)->InitBufferImageUsage(pBufferCreateInfo->usage);
-                #endif
-                if(pAllocationInfo != VMA_NULL)
-                {
-                    allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-                }
-
-                return VK_SUCCESS;
-            }
-            allocator->FreeMemory(
-                1, // allocationCount
-                pAllocation);
-            *pAllocation = VK_NULL_HANDLE;
-            (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-            *pBuffer = VK_NULL_HANDLE;
-            return res;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-        *pBuffer = VK_NULL_HANDLE;
-        return res;
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBufferWithAlignment(
-    VmaAllocator allocator,
-    const VkBufferCreateInfo* pBufferCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkDeviceSize minAlignment,
-    VkBuffer* pBuffer,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pBufferCreateInfo && pAllocationCreateInfo && VmaIsPow2(minAlignment) && pBuffer && pAllocation);
-
-    if(pBufferCreateInfo->size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    if((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 &&
-        !allocator->m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used.");
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_LOG("vmaCreateBufferWithAlignment");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pBuffer = VK_NULL_HANDLE;
-    *pAllocation = VK_NULL_HANDLE;
-
-    // 1. Create VkBuffer.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)(
-        allocator->m_hDevice,
-        pBufferCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pBuffer);
-    if(res >= 0)
-    {
-        // 2. vkGetBufferMemoryRequirements.
-        VkMemoryRequirements vkMemReq = {};
-        bool requiresDedicatedAllocation = false;
-        bool prefersDedicatedAllocation  = false;
-        allocator->GetBufferMemoryRequirements(*pBuffer, vkMemReq,
-            requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-        // 2a. Include minAlignment
-        vkMemReq.alignment = VMA_MAX(vkMemReq.alignment, minAlignment);
-
-        // 3. Allocate memory using allocator.
-        res = allocator->AllocateMemory(
-            vkMemReq,
-            requiresDedicatedAllocation,
-            prefersDedicatedAllocation,
-            *pBuffer, // dedicatedBuffer
-            VK_NULL_HANDLE, // dedicatedImage
-            pBufferCreateInfo->usage, // dedicatedBufferImageUsage
-            *pAllocationCreateInfo,
-            VMA_SUBALLOCATION_TYPE_BUFFER,
-            1, // allocationCount
-            pAllocation);
-
-        if(res >= 0)
-        {
-            // 3. Bind buffer with memory.
-            if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0)
-            {
-                res = allocator->BindBufferMemory(*pAllocation, 0, *pBuffer, VMA_NULL);
-            }
-            if(res >= 0)
-            {
-                // All steps succeeded.
-                #if VMA_STATS_STRING_ENABLED
-                    (*pAllocation)->InitBufferImageUsage(pBufferCreateInfo->usage);
-                #endif
-                if(pAllocationInfo != VMA_NULL)
-                {
-                    allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-                }
-
-                return VK_SUCCESS;
-            }
-            allocator->FreeMemory(
-                1, // allocationCount
-                pAllocation);
-            *pAllocation = VK_NULL_HANDLE;
-            (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-            *pBuffer = VK_NULL_HANDLE;
-            return res;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-        *pBuffer = VK_NULL_HANDLE;
-        return res;
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer)
-{
-    VMA_ASSERT(allocator && pBufferCreateInfo && pBuffer && allocation);
-
-    VMA_DEBUG_LOG("vmaCreateAliasingBuffer");
-
-    *pBuffer = VK_NULL_HANDLE;
-
-    if (pBufferCreateInfo->size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    if ((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 &&
-        !allocator->m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used.");
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    // 1. Create VkBuffer.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)(
-        allocator->m_hDevice,
-        pBufferCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pBuffer);
-    if (res >= 0)
-    {
-        // 2. Bind buffer with memory.
-        res = allocator->BindBufferMemory(allocation, 0, *pBuffer, VMA_NULL);
-        if (res >= 0)
-        {
-            return VK_SUCCESS;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-    }
-    return res;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyBuffer(
-    VmaAllocator allocator,
-    VkBuffer buffer,
-    VmaAllocation allocation)
-{
-    VMA_ASSERT(allocator);
-
-    if(buffer == VK_NULL_HANDLE && allocation == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaDestroyBuffer");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    if(buffer != VK_NULL_HANDLE)
-    {
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, buffer, allocator->GetAllocationCallbacks());
-    }
-
-    if(allocation != VK_NULL_HANDLE)
-    {
-        allocator->FreeMemory(
-            1, // allocationCount
-            &allocation);
-    }
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateImage(
-    VmaAllocator allocator,
-    const VkImageCreateInfo* pImageCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkImage* pImage,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pImageCreateInfo && pAllocationCreateInfo && pImage && pAllocation);
-
-    if(pImageCreateInfo->extent.width == 0 ||
-        pImageCreateInfo->extent.height == 0 ||
-        pImageCreateInfo->extent.depth == 0 ||
-        pImageCreateInfo->mipLevels == 0 ||
-        pImageCreateInfo->arrayLayers == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_LOG("vmaCreateImage");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pImage = VK_NULL_HANDLE;
-    *pAllocation = VK_NULL_HANDLE;
-
-    // 1. Create VkImage.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateImage)(
-        allocator->m_hDevice,
-        pImageCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pImage);
-    if(res >= 0)
-    {
-        VmaSuballocationType suballocType = pImageCreateInfo->tiling == VK_IMAGE_TILING_OPTIMAL ?
-            VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL :
-            VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR;
-
-        // 2. Allocate memory using allocator.
-        VkMemoryRequirements vkMemReq = {};
-        bool requiresDedicatedAllocation = false;
-        bool prefersDedicatedAllocation  = false;
-        allocator->GetImageMemoryRequirements(*pImage, vkMemReq,
-            requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-        res = allocator->AllocateMemory(
-            vkMemReq,
-            requiresDedicatedAllocation,
-            prefersDedicatedAllocation,
-            VK_NULL_HANDLE, // dedicatedBuffer
-            *pImage, // dedicatedImage
-            pImageCreateInfo->usage, // dedicatedBufferImageUsage
-            *pAllocationCreateInfo,
-            suballocType,
-            1, // allocationCount
-            pAllocation);
-
-        if(res >= 0)
-        {
-            // 3. Bind image with memory.
-            if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0)
-            {
-                res = allocator->BindImageMemory(*pAllocation, 0, *pImage, VMA_NULL);
-            }
-            if(res >= 0)
-            {
-                // All steps succeeded.
-                #if VMA_STATS_STRING_ENABLED
-                    (*pAllocation)->InitBufferImageUsage(pImageCreateInfo->usage);
-                #endif
-                if(pAllocationInfo != VMA_NULL)
-                {
-                    allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-                }
-
-                return VK_SUCCESS;
-            }
-            allocator->FreeMemory(
-                1, // allocationCount
-                pAllocation);
-            *pAllocation = VK_NULL_HANDLE;
-            (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks());
-            *pImage = VK_NULL_HANDLE;
-            return res;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks());
-        *pImage = VK_NULL_HANDLE;
-        return res;
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage)
-{
-    VMA_ASSERT(allocator && pImageCreateInfo && pImage && allocation);
-
-    *pImage = VK_NULL_HANDLE;
-
-    VMA_DEBUG_LOG("vmaCreateImage");
-
-    if (pImageCreateInfo->extent.width == 0 ||
-        pImageCreateInfo->extent.height == 0 ||
-        pImageCreateInfo->extent.depth == 0 ||
-        pImageCreateInfo->mipLevels == 0 ||
-        pImageCreateInfo->arrayLayers == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    // 1. Create VkImage.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateImage)(
-        allocator->m_hDevice,
-        pImageCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pImage);
-    if (res >= 0)
-    {
-        // 2. Bind image with memory.
-        res = allocator->BindImageMemory(allocation, 0, *pImage, VMA_NULL);
-        if (res >= 0)
-        {
-            return VK_SUCCESS;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks());
-    }
-    return res;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE image,
-    VmaAllocation VMA_NULLABLE allocation)
-{
-    VMA_ASSERT(allocator);
-
-    if(image == VK_NULL_HANDLE && allocation == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaDestroyImage");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    if(image != VK_NULL_HANDLE)
-    {
-        (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, image, allocator->GetAllocationCallbacks());
-    }
-    if(allocation != VK_NULL_HANDLE)
-    {
-        allocator->FreeMemory(
-            1, // allocationCount
-            &allocation);
-    }
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateVirtualBlock(
-    const VmaVirtualBlockCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaVirtualBlock VMA_NULLABLE * VMA_NOT_NULL pVirtualBlock)
-{
-    VMA_ASSERT(pCreateInfo && pVirtualBlock);
-    VMA_ASSERT(pCreateInfo->size > 0);
-    VMA_DEBUG_LOG("vmaCreateVirtualBlock");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    *pVirtualBlock = vma_new(pCreateInfo->pAllocationCallbacks, VmaVirtualBlock_T)(*pCreateInfo);
-    VkResult res = (*pVirtualBlock)->Init();
-    if(res < 0)
-    {
-        vma_delete(pCreateInfo->pAllocationCallbacks, *pVirtualBlock);
-        *pVirtualBlock = VK_NULL_HANDLE;
-    }
-    return res;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyVirtualBlock(VmaVirtualBlock VMA_NULLABLE virtualBlock)
-{
-    if(virtualBlock != VK_NULL_HANDLE)
-    {
-        VMA_DEBUG_LOG("vmaDestroyVirtualBlock");
-        VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-        VkAllocationCallbacks allocationCallbacks = virtualBlock->m_AllocationCallbacks; // Have to copy the callbacks when destroying.
-        vma_delete(&allocationCallbacks, virtualBlock);
-    }
-}
-
-VMA_CALL_PRE VkBool32 VMA_CALL_POST vmaIsVirtualBlockEmpty(VmaVirtualBlock VMA_NOT_NULL virtualBlock)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-    VMA_DEBUG_LOG("vmaIsVirtualBlockEmpty");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    return virtualBlock->IsEmpty() ? VK_TRUE : VK_FALSE;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualAllocationInfo(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, VmaVirtualAllocationInfo* VMA_NOT_NULL pVirtualAllocInfo)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pVirtualAllocInfo != VMA_NULL);
-    VMA_DEBUG_LOG("vmaGetVirtualAllocationInfo");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->GetAllocationInfo(allocation, *pVirtualAllocInfo);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaVirtualAllocate(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    const VmaVirtualAllocationCreateInfo* VMA_NOT_NULL pCreateInfo, VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pAllocation,
-    VkDeviceSize* VMA_NULLABLE pOffset)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pCreateInfo != VMA_NULL && pAllocation != VMA_NULL);
-    VMA_DEBUG_LOG("vmaVirtualAllocate");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    return virtualBlock->Allocate(*pCreateInfo, *pAllocation, pOffset);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaVirtualFree(VmaVirtualBlock VMA_NOT_NULL virtualBlock, VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE allocation)
-{
-    if(allocation != VK_NULL_HANDLE)
-    {
-        VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-        VMA_DEBUG_LOG("vmaVirtualFree");
-        VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-        virtualBlock->Free(allocation);
-    }
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaClearVirtualBlock(VmaVirtualBlock VMA_NOT_NULL virtualBlock)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-    VMA_DEBUG_LOG("vmaClearVirtualBlock");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->Clear();
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetVirtualAllocationUserData(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, void* VMA_NULLABLE pUserData)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-    VMA_DEBUG_LOG("vmaSetVirtualAllocationUserData");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->SetAllocationUserData(allocation, pUserData);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualBlockStatistics(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaStatistics* VMA_NOT_NULL pStats)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pStats != VMA_NULL);
-    VMA_DEBUG_LOG("vmaGetVirtualBlockStatistics");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->GetStatistics(*pStats);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateVirtualBlockStatistics(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaDetailedStatistics* VMA_NOT_NULL pStats)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pStats != VMA_NULL);
-    VMA_DEBUG_LOG("vmaCalculateVirtualBlockStatistics");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->CalculateDetailedStatistics(*pStats);
-}
-
-#if VMA_STATS_STRING_ENABLED
-
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildVirtualBlockStatsString(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE * VMA_NOT_NULL ppStatsString, VkBool32 detailedMap)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && ppStatsString != VMA_NULL);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    const VkAllocationCallbacks* allocationCallbacks = virtualBlock->GetAllocationCallbacks();
-    VmaStringBuilder sb(allocationCallbacks);
-    virtualBlock->BuildStatsString(detailedMap != VK_FALSE, sb);
-    *ppStatsString = VmaCreateStringCopy(allocationCallbacks, sb.GetData(), sb.GetLength());
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeVirtualBlockStatsString(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE pStatsString)
-{
-    if(pStatsString != VMA_NULL)
-    {
-        VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-        VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-        VmaFreeString(virtualBlock->GetAllocationCallbacks(), pStatsString);
-    }
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_PUBLIC_INTERFACE
-#endif // VMA_IMPLEMENTATION
-
-/**
-\page quick_start Quick start
-
-\section quick_start_project_setup Project setup
-
-Vulkan Memory Allocator comes in form of a "stb-style" single header file.
-You don't need to build it as a separate library project.
-You can add this file directly to your project and submit it to code repository next to your other source files.
-
-"Single header" doesn't mean that everything is contained in C/C++ declarations,
-like it tends to be in case of inline functions or C++ templates.
-It means that implementation is bundled with interface in a single file and needs to be extracted using preprocessor macro.
-If you don't do it properly, you will get linker errors.
-
-To do it properly:
-
--# Include "vk_mem_alloc.h" file in each CPP file where you want to use the library.
-   This includes declarations of all members of the library.
--# In exactly one CPP file define following macro before this include.
-   It enables also internal definitions.
-
-\code
-#define VMA_IMPLEMENTATION
-#include "vk_mem_alloc.h"
-\endcode
-
-It may be a good idea to create dedicated CPP file just for this purpose.
-
-This library includes header `<vulkan/vulkan.h>`, which in turn
-includes `<windows.h>` on Windows. If you need some specific macros defined
-before including these headers (like `WIN32_LEAN_AND_MEAN` or
-`WINVER` for Windows, `VK_USE_PLATFORM_WIN32_KHR` for Vulkan), you must define
-them before every `#include` of this library.
-
-This library is written in C++, but has C-compatible interface.
-Thus you can include and use vk_mem_alloc.h in C or C++ code, but full
-implementation with `VMA_IMPLEMENTATION` macro must be compiled as C++, NOT as C.
-Some features of C++14 used. STL containers, RTTI, or C++ exceptions are not used.
-
-
-\section quick_start_initialization Initialization
-
-At program startup:
-
--# Initialize Vulkan to have `VkPhysicalDevice`, `VkDevice` and `VkInstance` object.
--# Fill VmaAllocatorCreateInfo structure and create #VmaAllocator object by
-   calling vmaCreateAllocator().
-
-Only members `physicalDevice`, `device`, `instance` are required.
-However, you should inform the library which Vulkan version do you use by setting
-VmaAllocatorCreateInfo::vulkanApiVersion and which extensions did you enable
-by setting VmaAllocatorCreateInfo::flags (like #VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT for VK_KHR_buffer_device_address).
-Otherwise, VMA would use only features of Vulkan 1.0 core with no extensions.
-
-You may need to configure importing Vulkan functions. There are 3 ways to do this:
-
--# **If you link with Vulkan static library** (e.g. "vulkan-1.lib" on Windows):
-   - You don't need to do anything.
-   - VMA will use these, as macro `VMA_STATIC_VULKAN_FUNCTIONS` is defined to 1 by default.
--# **If you want VMA to fetch pointers to Vulkan functions dynamically** using `vkGetInstanceProcAddr`,
-   `vkGetDeviceProcAddr` (this is the option presented in the example below):
-   - Define `VMA_STATIC_VULKAN_FUNCTIONS` to 0, `VMA_DYNAMIC_VULKAN_FUNCTIONS` to 1.
-   - Provide pointers to these two functions via VmaVulkanFunctions::vkGetInstanceProcAddr,
-     VmaVulkanFunctions::vkGetDeviceProcAddr.
-   - The library will fetch pointers to all other functions it needs internally.
--# **If you fetch pointers to all Vulkan functions in a custom way**, e.g. using some loader like
-   [Volk](https://github.com/zeux/volk):
-   - Define `VMA_STATIC_VULKAN_FUNCTIONS` and `VMA_DYNAMIC_VULKAN_FUNCTIONS` to 0.
-   - Pass these pointers via structure #VmaVulkanFunctions.
-
-\code
-VmaVulkanFunctions vulkanFunctions = {};
-vulkanFunctions.vkGetInstanceProcAddr = &vkGetInstanceProcAddr;
-vulkanFunctions.vkGetDeviceProcAddr = &vkGetDeviceProcAddr;
-
-VmaAllocatorCreateInfo allocatorCreateInfo = {};
-allocatorCreateInfo.vulkanApiVersion = VK_API_VERSION_1_2;
-allocatorCreateInfo.physicalDevice = physicalDevice;
-allocatorCreateInfo.device = device;
-allocatorCreateInfo.instance = instance;
-allocatorCreateInfo.pVulkanFunctions = &vulkanFunctions;
-
-VmaAllocator allocator;
-vmaCreateAllocator(&allocatorCreateInfo, &allocator);
-\endcode
-
-
-\section quick_start_resource_allocation Resource allocation
-
-When you want to create a buffer or image:
-
--# Fill `VkBufferCreateInfo` / `VkImageCreateInfo` structure.
--# Fill VmaAllocationCreateInfo structure.
--# Call vmaCreateBuffer() / vmaCreateImage() to get `VkBuffer`/`VkImage` with memory
-   already allocated and bound to it, plus #VmaAllocation objects that represents its underlying memory.
-
-\code
-VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufferInfo.size = 65536;
-bufferInfo.usage = VK_BUFFER_USAGE_VERTEX_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-Don't forget to destroy your objects when no longer needed:
-
-\code
-vmaDestroyBuffer(allocator, buffer, allocation);
-vmaDestroyAllocator(allocator);
-\endcode
-
-
-\page choosing_memory_type Choosing memory type
-
-Physical devices in Vulkan support various combinations of memory heaps and
-types. Help with choosing correct and optimal memory type for your specific
-resource is one of the key features of this library. You can use it by filling
-appropriate members of VmaAllocationCreateInfo structure, as described below.
-You can also combine multiple methods.
-
--# If you just want to find memory type index that meets your requirements, you
-   can use function: vmaFindMemoryTypeIndexForBufferInfo(),
-   vmaFindMemoryTypeIndexForImageInfo(), vmaFindMemoryTypeIndex().
--# If you want to allocate a region of device memory without association with any
-   specific image or buffer, you can use function vmaAllocateMemory(). Usage of
-   this function is not recommended and usually not needed.
-   vmaAllocateMemoryPages() function is also provided for creating multiple allocations at once,
-   which may be useful for sparse binding.
--# If you already have a buffer or an image created, you want to allocate memory
-   for it and then you will bind it yourself, you can use function
-   vmaAllocateMemoryForBuffer(), vmaAllocateMemoryForImage().
-   For binding you should use functions: vmaBindBufferMemory(), vmaBindImageMemory()
-   or their extended versions: vmaBindBufferMemory2(), vmaBindImageMemory2().
--# **This is the easiest and recommended way to use this library:**
-   If you want to create a buffer or an image, allocate memory for it and bind
-   them together, all in one call, you can use function vmaCreateBuffer(),
-   vmaCreateImage().
-
-When using 3. or 4., the library internally queries Vulkan for memory types
-supported for that buffer or image (function `vkGetBufferMemoryRequirements()`)
-and uses only one of these types.
-
-If no memory type can be found that meets all the requirements, these functions
-return `VK_ERROR_FEATURE_NOT_PRESENT`.
-
-You can leave VmaAllocationCreateInfo structure completely filled with zeros.
-It means no requirements are specified for memory type.
-It is valid, although not very useful.
-
-\section choosing_memory_type_usage Usage
-
-The easiest way to specify memory requirements is to fill member
-VmaAllocationCreateInfo::usage using one of the values of enum #VmaMemoryUsage.
-It defines high level, common usage types.
-Since version 3 of the library, it is recommended to use #VMA_MEMORY_USAGE_AUTO to let it select best memory type for your resource automatically.
-
-For example, if you want to create a uniform buffer that will be filled using
-transfer only once or infrequently and then used for rendering every frame as a uniform buffer, you can
-do it using following code. The buffer will most likely end up in a memory type with
-`VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT` to be fast to access by the GPU device.
-
-\code
-VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufferInfo.size = 65536;
-bufferInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-If you have a preference for putting the resource in GPU (device) memory or CPU (host) memory
-on systems with discrete graphics card that have the memories separate, you can use
-#VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE or #VMA_MEMORY_USAGE_AUTO_PREFER_HOST.
-
-When using `VMA_MEMORY_USAGE_AUTO*` while you want to map the allocated memory,
-you also need to specify one of the host access flags:
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.
-This will help the library decide about preferred memory type to ensure it has `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`
-so you can map it.
-
-For example, a staging buffer that will be filled via mapped pointer and then
-used as a source of transfer to the buffer decribed previously can be created like this.
-It will likely and up in a memory type that is `HOST_VISIBLE` and `HOST_COHERENT`
-but not `HOST_CACHED` (meaning uncached, write-combined) and not `DEVICE_LOCAL` (meaning system RAM).
-
-\code
-VkBufferCreateInfo stagingBufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-stagingBufferInfo.size = 65536;
-stagingBufferInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-VmaAllocationCreateInfo stagingAllocInfo = {};
-stagingAllocInfo.usage = VMA_MEMORY_USAGE_AUTO;
-stagingAllocInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT;
-
-VkBuffer stagingBuffer;
-VmaAllocation stagingAllocation;
-vmaCreateBuffer(allocator, &stagingBufferInfo, &stagingAllocInfo, &stagingBuffer, &stagingAllocation, nullptr);
-\endcode
-
-For more examples of creating different kinds of resources, see chapter \ref usage_patterns.
-
-Usage values `VMA_MEMORY_USAGE_AUTO*` are legal to use only when the library knows
-about the resource being created by having `VkBufferCreateInfo` / `VkImageCreateInfo` passed,
-so they work with functions like: vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo() etc.
-If you allocate raw memory using function vmaAllocateMemory(), you have to use other means of selecting
-memory type, as decribed below.
-
-\note
-Old usage values (`VMA_MEMORY_USAGE_GPU_ONLY`, `VMA_MEMORY_USAGE_CPU_ONLY`,
-`VMA_MEMORY_USAGE_CPU_TO_GPU`, `VMA_MEMORY_USAGE_GPU_TO_CPU`, `VMA_MEMORY_USAGE_CPU_COPY`)
-are still available and work same way as in previous versions of the library
-for backward compatibility, but they are not recommended.
-
-\section choosing_memory_type_required_preferred_flags Required and preferred flags
-
-You can specify more detailed requirements by filling members
-VmaAllocationCreateInfo::requiredFlags and VmaAllocationCreateInfo::preferredFlags
-with a combination of bits from enum `VkMemoryPropertyFlags`. For example,
-if you want to create a buffer that will be persistently mapped on host (so it
-must be `HOST_VISIBLE`) and preferably will also be `HOST_COHERENT` and `HOST_CACHED`,
-use following code:
-
-\code
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.requiredFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-allocInfo.preferredFlags = VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-allocInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT | VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-A memory type is chosen that has all the required flags and as many preferred
-flags set as possible.
-
-Value passed in VmaAllocationCreateInfo::usage is internally converted to a set of required and preferred flags,
-plus some extra "magic" (heuristics).
-
-\section choosing_memory_type_explicit_memory_types Explicit memory types
-
-If you inspected memory types available on the physical device and you have
-a preference for memory types that you want to use, you can fill member
-VmaAllocationCreateInfo::memoryTypeBits. It is a bit mask, where each bit set
-means that a memory type with that index is allowed to be used for the
-allocation. Special value 0, just like `UINT32_MAX`, means there are no
-restrictions to memory type index.
-
-Please note that this member is NOT just a memory type index.
-Still you can use it to choose just one, specific memory type.
-For example, if you already determined that your buffer should be created in
-memory type 2, use following code:
-
-\code
-uint32_t memoryTypeIndex = 2;
-
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.memoryTypeBits = 1u << memoryTypeIndex;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-
-\section choosing_memory_type_custom_memory_pools Custom memory pools
-
-If you allocate from custom memory pool, all the ways of specifying memory
-requirements described above are not applicable and the aforementioned members
-of VmaAllocationCreateInfo structure are ignored. Memory type is selected
-explicitly when creating the pool and then used to make all the allocations from
-that pool. For further details, see \ref custom_memory_pools.
-
-\section choosing_memory_type_dedicated_allocations Dedicated allocations
-
-Memory for allocations is reserved out of larger block of `VkDeviceMemory`
-allocated from Vulkan internally. That is the main feature of this whole library.
-You can still request a separate memory block to be created for an allocation,
-just like you would do in a trivial solution without using any allocator.
-In that case, a buffer or image is always bound to that memory at offset 0.
-This is called a "dedicated allocation".
-You can explicitly request it by using flag #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-The library can also internally decide to use dedicated allocation in some cases, e.g.:
-
-- When the size of the allocation is large.
-- When [VK_KHR_dedicated_allocation](@ref vk_khr_dedicated_allocation) extension is enabled
-  and it reports that dedicated allocation is required or recommended for the resource.
-- When allocation of next big memory block fails due to not enough device memory,
-  but allocation with the exact requested size succeeds.
-
-
-\page memory_mapping Memory mapping
-
-To "map memory" in Vulkan means to obtain a CPU pointer to `VkDeviceMemory`,
-to be able to read from it or write to it in CPU code.
-Mapping is possible only of memory allocated from a memory type that has
-`VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` flag.
-Functions `vkMapMemory()`, `vkUnmapMemory()` are designed for this purpose.
-You can use them directly with memory allocated by this library,
-but it is not recommended because of following issue:
-Mapping the same `VkDeviceMemory` block multiple times is illegal - only one mapping at a time is allowed.
-This includes mapping disjoint regions. Mapping is not reference-counted internally by Vulkan.
-Because of this, Vulkan Memory Allocator provides following facilities:
-
-\note If you want to be able to map an allocation, you need to specify one of the flags
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-in VmaAllocationCreateInfo::flags. These flags are required for an allocation to be mappable
-when using #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` enum values.
-For other usage values they are ignored and every such allocation made in `HOST_VISIBLE` memory type is mappable,
-but they can still be used for consistency.
-
-\section memory_mapping_mapping_functions Mapping functions
-
-The library provides following functions for mapping of a specific #VmaAllocation: vmaMapMemory(), vmaUnmapMemory().
-They are safer and more convenient to use than standard Vulkan functions.
-You can map an allocation multiple times simultaneously - mapping is reference-counted internally.
-You can also map different allocations simultaneously regardless of whether they use the same `VkDeviceMemory` block.
-The way it is implemented is that the library always maps entire memory block, not just region of the allocation.
-For further details, see description of vmaMapMemory() function.
-Example:
-
-\code
-// Having these objects initialized:
-struct ConstantBuffer
-{
-    ...
-};
-ConstantBuffer constantBufferData = ...
-
-VmaAllocator allocator = ...
-VkBuffer constantBuffer = ...
-VmaAllocation constantBufferAllocation = ...
-
-// You can map and fill your buffer using following code:
-
-void* mappedData;
-vmaMapMemory(allocator, constantBufferAllocation, &mappedData);
-memcpy(mappedData, &constantBufferData, sizeof(constantBufferData));
-vmaUnmapMemory(allocator, constantBufferAllocation);
-\endcode
-
-When mapping, you may see a warning from Vulkan validation layer similar to this one:
-
-<i>Mapping an image with layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL can result in undefined behavior if this memory is used by the device. Only GENERAL or PREINITIALIZED should be used.</i>
-
-It happens because the library maps entire `VkDeviceMemory` block, where different
-types of images and buffers may end up together, especially on GPUs with unified memory like Intel.
-You can safely ignore it if you are sure you access only memory of the intended
-object that you wanted to map.
-
-
-\section memory_mapping_persistently_mapped_memory Persistently mapped memory
-
-Kepping your memory persistently mapped is generally OK in Vulkan.
-You don't need to unmap it before using its data on the GPU.
-The library provides a special feature designed for that:
-Allocations made with #VMA_ALLOCATION_CREATE_MAPPED_BIT flag set in
-VmaAllocationCreateInfo::flags stay mapped all the time,
-so you can just access CPU pointer to it any time
-without a need to call any "map" or "unmap" function.
-Example:
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = sizeof(ConstantBuffer);
-bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-// Buffer is already mapped. You can access its memory.
-memcpy(allocInfo.pMappedData, &constantBufferData, sizeof(constantBufferData));
-\endcode
-
-\note #VMA_ALLOCATION_CREATE_MAPPED_BIT by itself doesn't guarantee that the allocation will end up
-in a mappable memory type.
-For this, you need to also specify #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.
-#VMA_ALLOCATION_CREATE_MAPPED_BIT only guarantees that if the memory is `HOST_VISIBLE`, the allocation will be mapped on creation.
-For an example of how to make use of this fact, see section \ref usage_patterns_advanced_data_uploading.
-
-\section memory_mapping_cache_control Cache flush and invalidate
-
-Memory in Vulkan doesn't need to be unmapped before using it on GPU,
-but unless a memory types has `VK_MEMORY_PROPERTY_HOST_COHERENT_BIT` flag set,
-you need to manually **invalidate** cache before reading of mapped pointer
-and **flush** cache after writing to mapped pointer.
-Map/unmap operations don't do that automatically.
-Vulkan provides following functions for this purpose `vkFlushMappedMemoryRanges()`,
-`vkInvalidateMappedMemoryRanges()`, but this library provides more convenient
-functions that refer to given allocation object: vmaFlushAllocation(),
-vmaInvalidateAllocation(),
-or multiple objects at once: vmaFlushAllocations(), vmaInvalidateAllocations().
-
-Regions of memory specified for flush/invalidate must be aligned to
-`VkPhysicalDeviceLimits::nonCoherentAtomSize`. This is automatically ensured by the library.
-In any memory type that is `HOST_VISIBLE` but not `HOST_COHERENT`, all allocations
-within blocks are aligned to this value, so their offsets are always multiply of
-`nonCoherentAtomSize` and two different allocations never share same "line" of this size.
-
-Also, Windows drivers from all 3 PC GPU vendors (AMD, Intel, NVIDIA)
-currently provide `HOST_COHERENT` flag on all memory types that are
-`HOST_VISIBLE`, so on PC you may not need to bother.
-
-
-\page staying_within_budget Staying within budget
-
-When developing a graphics-intensive game or program, it is important to avoid allocating
-more GPU memory than it is physically available. When the memory is over-committed,
-various bad things can happen, depending on the specific GPU, graphics driver, and
-operating system:
-
-- It may just work without any problems.
-- The application may slow down because some memory blocks are moved to system RAM
-  and the GPU has to access them through PCI Express bus.
-- A new allocation may take very long time to complete, even few seconds, and possibly
-  freeze entire system.
-- The new allocation may fail with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-- It may even result in GPU crash (TDR), observed as `VK_ERROR_DEVICE_LOST`
-  returned somewhere later.
-
-\section staying_within_budget_querying_for_budget Querying for budget
-
-To query for current memory usage and available budget, use function vmaGetHeapBudgets().
-Returned structure #VmaBudget contains quantities expressed in bytes, per Vulkan memory heap.
-
-Please note that this function returns different information and works faster than
-vmaCalculateStatistics(). vmaGetHeapBudgets() can be called every frame or even before every
-allocation, while vmaCalculateStatistics() is intended to be used rarely,
-only to obtain statistical information, e.g. for debugging purposes.
-
-It is recommended to use <b>VK_EXT_memory_budget</b> device extension to obtain information
-about the budget from Vulkan device. VMA is able to use this extension automatically.
-When not enabled, the allocator behaves same way, but then it estimates current usage
-and available budget based on its internal information and Vulkan memory heap sizes,
-which may be less precise. In order to use this extension:
-
-1. Make sure extensions VK_EXT_memory_budget and VK_KHR_get_physical_device_properties2
-   required by it are available and enable them. Please note that the first is a device
-   extension and the second is instance extension!
-2. Use flag #VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT when creating #VmaAllocator object.
-3. Make sure to call vmaSetCurrentFrameIndex() every frame. Budget is queried from
-   Vulkan inside of it to avoid overhead of querying it with every allocation.
-
-\section staying_within_budget_controlling_memory_usage Controlling memory usage
-
-There are many ways in which you can try to stay within the budget.
-
-First, when making new allocation requires allocating a new memory block, the library
-tries not to exceed the budget automatically. If a block with default recommended size
-(e.g. 256 MB) would go over budget, a smaller block is allocated, possibly even
-dedicated memory for just this resource.
-
-If the size of the requested resource plus current memory usage is more than the
-budget, by default the library still tries to create it, leaving it to the Vulkan
-implementation whether the allocation succeeds or fails. You can change this behavior
-by using #VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT flag. With it, the allocation is
-not made if it would exceed the budget or if the budget is already exceeded.
-VMA then tries to make the allocation from the next eligible Vulkan memory type.
-The all of them fail, the call then fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-Example usage pattern may be to pass the #VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT flag
-when creating resources that are not essential for the application (e.g. the texture
-of a specific object) and not to pass it when creating critically important resources
-(e.g. render targets).
-
-On AMD graphics cards there is a custom vendor extension available: <b>VK_AMD_memory_overallocation_behavior</b>
-that allows to control the behavior of the Vulkan implementation in out-of-memory cases -
-whether it should fail with an error code or still allow the allocation.
-Usage of this extension involves only passing extra structure on Vulkan device creation,
-so it is out of scope of this library.
-
-Finally, you can also use #VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT flag to make sure
-a new allocation is created only when it fits inside one of the existing memory blocks.
-If it would require to allocate a new block, if fails instead with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-This also ensures that the function call is very fast because it never goes to Vulkan
-to obtain a new block.
-
-\note Creating \ref custom_memory_pools with VmaPoolCreateInfo::minBlockCount
-set to more than 0 will currently try to allocate memory blocks without checking whether they
-fit within budget.
-
-
-\page resource_aliasing Resource aliasing (overlap)
-
-New explicit graphics APIs (Vulkan and Direct3D 12), thanks to manual memory
-management, give an opportunity to alias (overlap) multiple resources in the
-same region of memory - a feature not available in the old APIs (Direct3D 11, OpenGL).
-It can be useful to save video memory, but it must be used with caution.
-
-For example, if you know the flow of your whole render frame in advance, you
-are going to use some intermediate textures or buffers only during a small range of render passes,
-and you know these ranges don't overlap in time, you can bind these resources to
-the same place in memory, even if they have completely different parameters (width, height, format etc.).
-
-![Resource aliasing (overlap)](../gfx/Aliasing.png)
-
-Such scenario is possible using VMA, but you need to create your images manually.
-Then you need to calculate parameters of an allocation to be made using formula:
-
-- allocation size = max(size of each image)
-- allocation alignment = max(alignment of each image)
-- allocation memoryTypeBits = bitwise AND(memoryTypeBits of each image)
-
-Following example shows two different images bound to the same place in memory,
-allocated to fit largest of them.
-
-\code
-// A 512x512 texture to be sampled.
-VkImageCreateInfo img1CreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-img1CreateInfo.imageType = VK_IMAGE_TYPE_2D;
-img1CreateInfo.extent.width = 512;
-img1CreateInfo.extent.height = 512;
-img1CreateInfo.extent.depth = 1;
-img1CreateInfo.mipLevels = 10;
-img1CreateInfo.arrayLayers = 1;
-img1CreateInfo.format = VK_FORMAT_R8G8B8A8_SRGB;
-img1CreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-img1CreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-img1CreateInfo.usage = VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT;
-img1CreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-// A full screen texture to be used as color attachment.
-VkImageCreateInfo img2CreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-img2CreateInfo.imageType = VK_IMAGE_TYPE_2D;
-img2CreateInfo.extent.width = 1920;
-img2CreateInfo.extent.height = 1080;
-img2CreateInfo.extent.depth = 1;
-img2CreateInfo.mipLevels = 1;
-img2CreateInfo.arrayLayers = 1;
-img2CreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM;
-img2CreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-img2CreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-img2CreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT;
-img2CreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-VkImage img1;
-res = vkCreateImage(device, &img1CreateInfo, nullptr, &img1);
-VkImage img2;
-res = vkCreateImage(device, &img2CreateInfo, nullptr, &img2);
-
-VkMemoryRequirements img1MemReq;
-vkGetImageMemoryRequirements(device, img1, &img1MemReq);
-VkMemoryRequirements img2MemReq;
-vkGetImageMemoryRequirements(device, img2, &img2MemReq);
-
-VkMemoryRequirements finalMemReq = {};
-finalMemReq.size = std::max(img1MemReq.size, img2MemReq.size);
-finalMemReq.alignment = std::max(img1MemReq.alignment, img2MemReq.alignment);
-finalMemReq.memoryTypeBits = img1MemReq.memoryTypeBits & img2MemReq.memoryTypeBits;
-// Validate if(finalMemReq.memoryTypeBits != 0)
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.preferredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-
-VmaAllocation alloc;
-res = vmaAllocateMemory(allocator, &finalMemReq, &allocCreateInfo, &alloc, nullptr);
-
-res = vmaBindImageMemory(allocator, alloc, img1);
-res = vmaBindImageMemory(allocator, alloc, img2);
-
-// You can use img1, img2 here, but not at the same time!
-
-vmaFreeMemory(allocator, alloc);
-vkDestroyImage(allocator, img2, nullptr);
-vkDestroyImage(allocator, img1, nullptr);
-\endcode
-
-Remember that using resources that alias in memory requires proper synchronization.
-You need to issue a memory barrier to make sure commands that use `img1` and `img2`
-don't overlap on GPU timeline.
-You also need to treat a resource after aliasing as uninitialized - containing garbage data.
-For example, if you use `img1` and then want to use `img2`, you need to issue
-an image memory barrier for `img2` with `oldLayout` = `VK_IMAGE_LAYOUT_UNDEFINED`.
-
-Additional considerations:
-
-- Vulkan also allows to interpret contents of memory between aliasing resources consistently in some cases.
-See chapter 11.8. "Memory Aliasing" of Vulkan specification or `VK_IMAGE_CREATE_ALIAS_BIT` flag.
-- You can create more complex layout where different images and buffers are bound
-at different offsets inside one large allocation. For example, one can imagine
-a big texture used in some render passes, aliasing with a set of many small buffers
-used between in some further passes. To bind a resource at non-zero offset in an allocation,
-use vmaBindBufferMemory2() / vmaBindImageMemory2().
-- Before allocating memory for the resources you want to alias, check `memoryTypeBits`
-returned in memory requirements of each resource to make sure the bits overlap.
-Some GPUs may expose multiple memory types suitable e.g. only for buffers or
-images with `COLOR_ATTACHMENT` usage, so the sets of memory types supported by your
-resources may be disjoint. Aliasing them is not possible in that case.
-
-
-\page custom_memory_pools Custom memory pools
-
-A memory pool contains a number of `VkDeviceMemory` blocks.
-The library automatically creates and manages default pool for each memory type available on the device.
-Default memory pool automatically grows in size.
-Size of allocated blocks is also variable and managed automatically.
-
-You can create custom pool and allocate memory out of it.
-It can be useful if you want to:
-
-- Keep certain kind of allocations separate from others.
-- Enforce particular, fixed size of Vulkan memory blocks.
-- Limit maximum amount of Vulkan memory allocated for that pool.
-- Reserve minimum or fixed amount of Vulkan memory always preallocated for that pool.
-- Use extra parameters for a set of your allocations that are available in #VmaPoolCreateInfo but not in
-  #VmaAllocationCreateInfo - e.g., custom minimum alignment, custom `pNext` chain.
-- Perform defragmentation on a specific subset of your allocations.
-
-To use custom memory pools:
-
--# Fill VmaPoolCreateInfo structure.
--# Call vmaCreatePool() to obtain #VmaPool handle.
--# When making an allocation, set VmaAllocationCreateInfo::pool to this handle.
-   You don't need to specify any other parameters of this structure, like `usage`.
-
-Example:
-
-\code
-// Find memoryTypeIndex for the pool.
-VkBufferCreateInfo sampleBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-sampleBufCreateInfo.size = 0x10000; // Doesn't matter.
-sampleBufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo sampleAllocCreateInfo = {};
-sampleAllocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-uint32_t memTypeIndex;
-VkResult res = vmaFindMemoryTypeIndexForBufferInfo(allocator,
-    &sampleBufCreateInfo, &sampleAllocCreateInfo, &memTypeIndex);
-// Check res...
-
-// Create a pool that can have at most 2 blocks, 128 MiB each.
-VmaPoolCreateInfo poolCreateInfo = {};
-poolCreateInfo.memoryTypeIndex = memTypeIndex;
-poolCreateInfo.blockSize = 128ull * 1024 * 1024;
-poolCreateInfo.maxBlockCount = 2;
-
-VmaPool pool;
-res = vmaCreatePool(allocator, &poolCreateInfo, &pool);
-// Check res...
-
-// Allocate a buffer out of it.
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 1024;
-bufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.pool = pool;
-
-VkBuffer buf;
-VmaAllocation alloc;
-res = vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, nullptr);
-// Check res...
-\endcode
-
-You have to free all allocations made from this pool before destroying it.
-
-\code
-vmaDestroyBuffer(allocator, buf, alloc);
-vmaDestroyPool(allocator, pool);
-\endcode
-
-New versions of this library support creating dedicated allocations in custom pools.
-It is supported only when VmaPoolCreateInfo::blockSize = 0.
-To use this feature, set VmaAllocationCreateInfo::pool to the pointer to your custom pool and
-VmaAllocationCreateInfo::flags to #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-
-\note Excessive use of custom pools is a common mistake when using this library.
-Custom pools may be useful for special purposes - when you want to
-keep certain type of resources separate e.g. to reserve minimum amount of memory
-for them or limit maximum amount of memory they can occupy. For most
-resources this is not needed and so it is not recommended to create #VmaPool
-objects and allocations out of them. Allocating from the default pool is sufficient.
-
-
-\section custom_memory_pools_MemTypeIndex Choosing memory type index
-
-When creating a pool, you must explicitly specify memory type index.
-To find the one suitable for your buffers or images, you can use helper functions
-vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo().
-You need to provide structures with example parameters of buffers or images
-that you are going to create in that pool.
-
-\code
-VkBufferCreateInfo exampleBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-exampleBufCreateInfo.size = 1024; // Doesn't matter
-exampleBufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-uint32_t memTypeIndex;
-vmaFindMemoryTypeIndexForBufferInfo(allocator, &exampleBufCreateInfo, &allocCreateInfo, &memTypeIndex);
-
-VmaPoolCreateInfo poolCreateInfo = {};
-poolCreateInfo.memoryTypeIndex = memTypeIndex;
-// ...
-\endcode
-
-When creating buffers/images allocated in that pool, provide following parameters:
-
-- `VkBufferCreateInfo`: Prefer to pass same parameters as above.
-  Otherwise you risk creating resources in a memory type that is not suitable for them, which may result in undefined behavior.
-  Using different `VK_BUFFER_USAGE_` flags may work, but you shouldn't create images in a pool intended for buffers
-  or the other way around.
-- VmaAllocationCreateInfo: You don't need to pass same parameters. Fill only `pool` member.
-  Other members are ignored anyway.
-
-\section linear_algorithm Linear allocation algorithm
-
-Each Vulkan memory block managed by this library has accompanying metadata that
-keeps track of used and unused regions. By default, the metadata structure and
-algorithm tries to find best place for new allocations among free regions to
-optimize memory usage. This way you can allocate and free objects in any order.
-
-![Default allocation algorithm](../gfx/Linear_allocator_1_algo_default.png)
-
-Sometimes there is a need to use simpler, linear allocation algorithm. You can
-create custom pool that uses such algorithm by adding flag
-#VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT to VmaPoolCreateInfo::flags while creating
-#VmaPool object. Then an alternative metadata management is used. It always
-creates new allocations after last one and doesn't reuse free regions after
-allocations freed in the middle. It results in better allocation performance and
-less memory consumed by metadata.
-
-![Linear allocation algorithm](../gfx/Linear_allocator_2_algo_linear.png)
-
-With this one flag, you can create a custom pool that can be used in many ways:
-free-at-once, stack, double stack, and ring buffer. See below for details.
-You don't need to specify explicitly which of these options you are going to use - it is detected automatically.
-
-\subsection linear_algorithm_free_at_once Free-at-once
-
-In a pool that uses linear algorithm, you still need to free all the allocations
-individually, e.g. by using vmaFreeMemory() or vmaDestroyBuffer(). You can free
-them in any order. New allocations are always made after last one - free space
-in the middle is not reused. However, when you release all the allocation and
-the pool becomes empty, allocation starts from the beginning again. This way you
-can use linear algorithm to speed up creation of allocations that you are going
-to release all at once.
-
-![Free-at-once](../gfx/Linear_allocator_3_free_at_once.png)
-
-This mode is also available for pools created with VmaPoolCreateInfo::maxBlockCount
-value that allows multiple memory blocks.
-
-\subsection linear_algorithm_stack Stack
-
-When you free an allocation that was created last, its space can be reused.
-Thanks to this, if you always release allocations in the order opposite to their
-creation (LIFO - Last In First Out), you can achieve behavior of a stack.
-
-![Stack](../gfx/Linear_allocator_4_stack.png)
-
-This mode is also available for pools created with VmaPoolCreateInfo::maxBlockCount
-value that allows multiple memory blocks.
-
-\subsection linear_algorithm_double_stack Double stack
-
-The space reserved by a custom pool with linear algorithm may be used by two
-stacks:
-
-- First, default one, growing up from offset 0.
-- Second, "upper" one, growing down from the end towards lower offsets.
-
-To make allocation from the upper stack, add flag #VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT
-to VmaAllocationCreateInfo::flags.
-
-![Double stack](../gfx/Linear_allocator_7_double_stack.png)
-
-Double stack is available only in pools with one memory block -
-VmaPoolCreateInfo::maxBlockCount must be 1. Otherwise behavior is undefined.
-
-When the two stacks' ends meet so there is not enough space between them for a
-new allocation, such allocation fails with usual
-`VK_ERROR_OUT_OF_DEVICE_MEMORY` error.
-
-\subsection linear_algorithm_ring_buffer Ring buffer
-
-When you free some allocations from the beginning and there is not enough free space
-for a new one at the end of a pool, allocator's "cursor" wraps around to the
-beginning and starts allocation there. Thanks to this, if you always release
-allocations in the same order as you created them (FIFO - First In First Out),
-you can achieve behavior of a ring buffer / queue.
-
-![Ring buffer](../gfx/Linear_allocator_5_ring_buffer.png)
-
-Ring buffer is available only in pools with one memory block -
-VmaPoolCreateInfo::maxBlockCount must be 1. Otherwise behavior is undefined.
-
-\note \ref defragmentation is not supported in custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT.
-
-
-\page defragmentation Defragmentation
-
-Interleaved allocations and deallocations of many objects of varying size can
-cause fragmentation over time, which can lead to a situation where the library is unable
-to find a continuous range of free memory for a new allocation despite there is
-enough free space, just scattered across many small free ranges between existing
-allocations.
-
-To mitigate this problem, you can use defragmentation feature.
-It doesn't happen automatically though and needs your cooperation,
-because VMA is a low level library that only allocates memory.
-It cannot recreate buffers and images in a new place as it doesn't remember the contents of `VkBufferCreateInfo` / `VkImageCreateInfo` structures.
-It cannot copy their contents as it doesn't record any commands to a command buffer.
-
-Example:
-
-\code
-VmaDefragmentationInfo defragInfo = {};
-defragInfo.pool = myPool;
-defragInfo.flags = VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT;
-
-VmaDefragmentationContext defragCtx;
-VkResult res = vmaBeginDefragmentation(allocator, &defragInfo, &defragCtx);
-// Check res...
-
-for(;;)
-{
-    VmaDefragmentationPassMoveInfo pass;
-    res = vmaBeginDefragmentationPass(allocator, defragCtx, &pass);
-    if(res == VK_SUCCESS)
-        break;
-    else if(res != VK_INCOMPLETE)
-        // Handle error...
-
-    for(uint32_t i = 0; i < pass.moveCount; ++i)
-    {
-        // Inspect pass.pMoves[i].srcAllocation, identify what buffer/image it represents.
-        VmaAllocationInfo allocInfo;
-        vmaGetAllocationInfo(allocator, pMoves[i].srcAllocation, &allocInfo);
-        MyEngineResourceData* resData = (MyEngineResourceData*)allocInfo.pUserData;
-            
-        // Recreate and bind this buffer/image at: pass.pMoves[i].dstMemory, pass.pMoves[i].dstOffset.
-        VkImageCreateInfo imgCreateInfo = ...
-        VkImage newImg;
-        res = vkCreateImage(device, &imgCreateInfo, nullptr, &newImg);
-        // Check res...
-        res = vmaBindImageMemory(allocator, pMoves[i].dstTmpAllocation, newImg);
-        // Check res...
-
-        // Issue a vkCmdCopyBuffer/vkCmdCopyImage to copy its content to the new place.
-        vkCmdCopyImage(cmdBuf, resData->img, ..., newImg, ...);
-    }
-        
-    // Make sure the copy commands finished executing.
-    vkWaitForFences(...);
-
-    // Destroy old buffers/images bound with pass.pMoves[i].srcAllocation.
-    for(uint32_t i = 0; i < pass.moveCount; ++i)
-    {
-        // ...
-        vkDestroyImage(device, resData->img, nullptr);
-    }
-
-    // Update appropriate descriptors to point to the new places...
-        
-    res = vmaEndDefragmentationPass(allocator, defragCtx, &pass);
-    if(res == VK_SUCCESS)
-        break;
-    else if(res != VK_INCOMPLETE)
-        // Handle error...
-}
-
-vmaEndDefragmentation(allocator, defragCtx, nullptr);
-\endcode
-
-Although functions like vmaCreateBuffer(), vmaCreateImage(), vmaDestroyBuffer(), vmaDestroyImage()
-create/destroy an allocation and a buffer/image at once, these are just a shortcut for
-creating the resource, allocating memory, and binding them together.
-Defragmentation works on memory allocations only. You must handle the rest manually.
-Defragmentation is an iterative process that should repreat "passes" as long as related functions
-return `VK_INCOMPLETE` not `VK_SUCCESS`.
-In each pass:
-
-1. vmaBeginDefragmentationPass() function call:
-   - Calculates and returns the list of allocations to be moved in this pass.
-     Note this can be a time-consuming process.
-   - Reserves destination memory for them by creating temporary destination allocations
-     that you can query for their `VkDeviceMemory` + offset using vmaGetAllocationInfo().
-2. Inside the pass, **you should**:
-   - Inspect the returned list of allocations to be moved.
-   - Create new buffers/images and bind them at the returned destination temporary allocations.
-   - Copy data from source to destination resources if necessary.
-   - Destroy the source buffers/images, but NOT their allocations.
-3. vmaEndDefragmentationPass() function call:
-   - Frees the source memory reserved for the allocations that are moved.
-   - Modifies source #VmaAllocation objects that are moved to point to the destination reserved memory.
-   - Frees `VkDeviceMemory` blocks that became empty.
-
-Unlike in previous iterations of the defragmentation API, there is no list of "movable" allocations passed as a parameter.
-Defragmentation algorithm tries to move all suitable allocations.
-You can, however, refuse to move some of them inside a defragmentation pass, by setting
-`pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE.
-This is not recommended and may result in suboptimal packing of the allocations after defragmentation.
-If you cannot ensure any allocation can be moved, it is better to keep movable allocations separate in a custom pool.
-
-Inside a pass, for each allocation that should be moved:
-
-- You should copy its data from the source to the destination place by calling e.g. `vkCmdCopyBuffer()`, `vkCmdCopyImage()`.
-  - You need to make sure these commands finished executing before destroying the source buffers/images and before calling vmaEndDefragmentationPass().
-- If a resource doesn't contain any meaningful data, e.g. it is a transient color attachment image to be cleared,
-  filled, and used temporarily in each rendering frame, you can just recreate this image
-  without copying its data.
-- If the resource is in `HOST_VISIBLE` and `HOST_CACHED` memory, you can copy its data on the CPU
-  using `memcpy()`.
-- If you cannot move the allocation, you can set `pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE.
-  This will cancel the move.
-  - vmaEndDefragmentationPass() will then free the destination memory
-    not the source memory of the allocation, leaving it unchanged.
-- If you decide the allocation is unimportant and can be destroyed instead of moved (e.g. it wasn't used for long time),
-  you can set `pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY.
-  - vmaEndDefragmentationPass() will then free both source and destination memory, and will destroy the source #VmaAllocation object.
-
-You can defragment a specific custom pool by setting VmaDefragmentationInfo::pool
-(like in the example above) or all the default pools by setting this member to null.
-
-Defragmentation is always performed in each pool separately.
-Allocations are never moved between different Vulkan memory types.
-The size of the destination memory reserved for a moved allocation is the same as the original one.
-Alignment of an allocation as it was determined using `vkGetBufferMemoryRequirements()` etc. is also respected after defragmentation.
-Buffers/images should be recreated with the same `VkBufferCreateInfo` / `VkImageCreateInfo` parameters as the original ones.
-
-You can perform the defragmentation incrementally to limit the number of allocations and bytes to be moved
-in each pass, e.g. to call it in sync with render frames and not to experience too big hitches.
-See members: VmaDefragmentationInfo::maxBytesPerPass, VmaDefragmentationInfo::maxAllocationsPerPass.
-
-It is also safe to perform the defragmentation asynchronously to render frames and other Vulkan and VMA
-usage, possibly from multiple threads, with the exception that allocations
-returned in VmaDefragmentationPassMoveInfo::pMoves shouldn't be destroyed until the defragmentation pass is ended.
-
-<b>Mapping</b> is preserved on allocations that are moved during defragmentation.
-Whether through #VMA_ALLOCATION_CREATE_MAPPED_BIT or vmaMapMemory(), the allocations
-are mapped at their new place. Of course, pointer to the mapped data changes, so it needs to be queried
-using VmaAllocationInfo::pMappedData.
-
-\note Defragmentation is not supported in custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT.
-
-
-\page statistics Statistics
-
-This library contains several functions that return information about its internal state,
-especially the amount of memory allocated from Vulkan.
-
-\section statistics_numeric_statistics Numeric statistics
-
-If you need to obtain basic statistics about memory usage per heap, together with current budget,
-you can call function vmaGetHeapBudgets() and inspect structure #VmaBudget.
-This is useful to keep track of memory usage and stay withing budget
-(see also \ref staying_within_budget).
-Example:
-
-\code
-uint32_t heapIndex = ...
-
-VmaBudget budgets[VK_MAX_MEMORY_HEAPS];
-vmaGetHeapBudgets(allocator, budgets);
-
-printf("My heap currently has %u allocations taking %llu B,\n",
-    budgets[heapIndex].statistics.allocationCount,
-    budgets[heapIndex].statistics.allocationBytes);
-printf("allocated out of %u Vulkan device memory blocks taking %llu B,\n",
-    budgets[heapIndex].statistics.blockCount,
-    budgets[heapIndex].statistics.blockBytes);
-printf("Vulkan reports total usage %llu B with budget %llu B.\n",
-    budgets[heapIndex].usage,
-    budgets[heapIndex].budget);
-\endcode
-
-You can query for more detailed statistics per memory heap, type, and totals,
-including minimum and maximum allocation size and unused range size,
-by calling function vmaCalculateStatistics() and inspecting structure #VmaTotalStatistics.
-This function is slower though, as it has to traverse all the internal data structures,
-so it should be used only for debugging purposes.
-
-You can query for statistics of a custom pool using function vmaGetPoolStatistics()
-or vmaCalculatePoolStatistics().
-
-You can query for information about a specific allocation using function vmaGetAllocationInfo().
-It fill structure #VmaAllocationInfo.
-
-\section statistics_json_dump JSON dump
-
-You can dump internal state of the allocator to a string in JSON format using function vmaBuildStatsString().
-The result is guaranteed to be correct JSON.
-It uses ANSI encoding.
-Any strings provided by user (see [Allocation names](@ref allocation_names))
-are copied as-is and properly escaped for JSON, so if they use UTF-8, ISO-8859-2 or any other encoding,
-this JSON string can be treated as using this encoding.
-It must be freed using function vmaFreeStatsString().
-
-The format of this JSON string is not part of official documentation of the library,
-but it will not change in backward-incompatible way without increasing library major version number
-and appropriate mention in changelog.
-
-The JSON string contains all the data that can be obtained using vmaCalculateStatistics().
-It can also contain detailed map of allocated memory blocks and their regions -
-free and occupied by allocations.
-This allows e.g. to visualize the memory or assess fragmentation.
-
-
-\page allocation_annotation Allocation names and user data
-
-\section allocation_user_data Allocation user data
-
-You can annotate allocations with your own information, e.g. for debugging purposes.
-To do that, fill VmaAllocationCreateInfo::pUserData field when creating
-an allocation. It is an opaque `void*` pointer. You can use it e.g. as a pointer,
-some handle, index, key, ordinal number or any other value that would associate
-the allocation with your custom metadata.
-It it useful to identify appropriate data structures in your engine given #VmaAllocation,
-e.g. when doing \ref defragmentation.
-
-\code
-VkBufferCreateInfo bufCreateInfo = ...
-
-MyBufferMetadata* pMetadata = CreateBufferMetadata();
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.pUserData = pMetadata;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buffer, &allocation, nullptr);
-\endcode
-
-The pointer may be later retrieved as VmaAllocationInfo::pUserData:
-
-\code
-VmaAllocationInfo allocInfo;
-vmaGetAllocationInfo(allocator, allocation, &allocInfo);
-MyBufferMetadata* pMetadata = (MyBufferMetadata*)allocInfo.pUserData;
-\endcode
-
-It can also be changed using function vmaSetAllocationUserData().
-
-Values of (non-zero) allocations' `pUserData` are printed in JSON report created by
-vmaBuildStatsString() in hexadecimal form.
-
-\section allocation_names Allocation names
-
-An allocation can also carry a null-terminated string, giving a name to the allocation.
-To set it, call vmaSetAllocationName().
-The library creates internal copy of the string, so the pointer you pass doesn't need
-to be valid for whole lifetime of the allocation. You can free it after the call.
-
-\code
-std::string imageName = "Texture: ";
-imageName += fileName;
-vmaSetAllocationName(allocator, allocation, imageName.c_str());
-\endcode
-
-The string can be later retrieved by inspecting VmaAllocationInfo::pName.
-It is also printed in JSON report created by vmaBuildStatsString().
-
-\note Setting string name to VMA allocation doesn't automatically set it to the Vulkan buffer or image created with it.
-You must do it manually using an extension like VK_EXT_debug_utils, which is independent of this library.
-
-
-\page virtual_allocator Virtual allocator
-
-As an extra feature, the core allocation algorithm of the library is exposed through a simple and convenient API of "virtual allocator".
-It doesn't allocate any real GPU memory. It just keeps track of used and free regions of a "virtual block".
-You can use it to allocate your own memory or other objects, even completely unrelated to Vulkan.
-A common use case is sub-allocation of pieces of one large GPU buffer.
-
-\section virtual_allocator_creating_virtual_block Creating virtual block
-
-To use this functionality, there is no main "allocator" object.
-You don't need to have #VmaAllocator object created.
-All you need to do is to create a separate #VmaVirtualBlock object for each block of memory you want to be managed by the allocator:
-
--# Fill in #VmaVirtualBlockCreateInfo structure.
--# Call vmaCreateVirtualBlock(). Get new #VmaVirtualBlock object.
-
-Example:
-
-\code
-VmaVirtualBlockCreateInfo blockCreateInfo = {};
-blockCreateInfo.size = 1048576; // 1 MB
-
-VmaVirtualBlock block;
-VkResult res = vmaCreateVirtualBlock(&blockCreateInfo, &block);
-\endcode
-
-\section virtual_allocator_making_virtual_allocations Making virtual allocations
-
-#VmaVirtualBlock object contains internal data structure that keeps track of free and occupied regions
-using the same code as the main Vulkan memory allocator.
-Similarly to #VmaAllocation for standard GPU allocations, there is #VmaVirtualAllocation type
-that represents an opaque handle to an allocation withing the virtual block.
-
-In order to make such allocation:
-
--# Fill in #VmaVirtualAllocationCreateInfo structure.
--# Call vmaVirtualAllocate(). Get new #VmaVirtualAllocation object that represents the allocation.
-   You can also receive `VkDeviceSize offset` that was assigned to the allocation.
-
-Example:
-
-\code
-VmaVirtualAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.size = 4096; // 4 KB
-
-VmaVirtualAllocation alloc;
-VkDeviceSize offset;
-res = vmaVirtualAllocate(block, &allocCreateInfo, &alloc, &offset);
-if(res == VK_SUCCESS)
-{
-    // Use the 4 KB of your memory starting at offset.
-}
-else
-{
-    // Allocation failed - no space for it could be found. Handle this error!
-}
-\endcode
-
-\section virtual_allocator_deallocation Deallocation
-
-When no longer needed, an allocation can be freed by calling vmaVirtualFree().
-You can only pass to this function an allocation that was previously returned by vmaVirtualAllocate()
-called for the same #VmaVirtualBlock.
-
-When whole block is no longer needed, the block object can be released by calling vmaDestroyVirtualBlock().
-All allocations must be freed before the block is destroyed, which is checked internally by an assert.
-However, if you don't want to call vmaVirtualFree() for each allocation, you can use vmaClearVirtualBlock() to free them all at once -
-a feature not available in normal Vulkan memory allocator. Example:
-
-\code
-vmaVirtualFree(block, alloc);
-vmaDestroyVirtualBlock(block);
-\endcode
-
-\section virtual_allocator_allocation_parameters Allocation parameters
-
-You can attach a custom pointer to each allocation by using vmaSetVirtualAllocationUserData().
-Its default value is null.
-It can be used to store any data that needs to be associated with that allocation - e.g. an index, a handle, or a pointer to some
-larger data structure containing more information. Example:
-
-\code
-struct CustomAllocData
-{
-    std::string m_AllocName;
-};
-CustomAllocData* allocData = new CustomAllocData();
-allocData->m_AllocName = "My allocation 1";
-vmaSetVirtualAllocationUserData(block, alloc, allocData);
-\endcode
-
-The pointer can later be fetched, along with allocation offset and size, by passing the allocation handle to function
-vmaGetVirtualAllocationInfo() and inspecting returned structure #VmaVirtualAllocationInfo.
-If you allocated a new object to be used as the custom pointer, don't forget to delete that object before freeing the allocation!
-Example:
-
-\code
-VmaVirtualAllocationInfo allocInfo;
-vmaGetVirtualAllocationInfo(block, alloc, &allocInfo);
-delete (CustomAllocData*)allocInfo.pUserData;
-
-vmaVirtualFree(block, alloc);
-\endcode
-
-\section virtual_allocator_alignment_and_units Alignment and units
-
-It feels natural to express sizes and offsets in bytes.
-If an offset of an allocation needs to be aligned to a multiply of some number (e.g. 4 bytes), you can fill optional member
-VmaVirtualAllocationCreateInfo::alignment to request it. Example:
-
-\code
-VmaVirtualAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.size = 4096; // 4 KB
-allocCreateInfo.alignment = 4; // Returned offset must be a multiply of 4 B
-
-VmaVirtualAllocation alloc;
-res = vmaVirtualAllocate(block, &allocCreateInfo, &alloc, nullptr);
-\endcode
-
-Alignments of different allocations made from one block may vary.
-However, if all alignments and sizes are always multiply of some size e.g. 4 B or `sizeof(MyDataStruct)`,
-you can express all sizes, alignments, and offsets in multiples of that size instead of individual bytes.
-It might be more convenient, but you need to make sure to use this new unit consistently in all the places:
-
-- VmaVirtualBlockCreateInfo::size
-- VmaVirtualAllocationCreateInfo::size and VmaVirtualAllocationCreateInfo::alignment
-- Using offset returned by vmaVirtualAllocate() or in VmaVirtualAllocationInfo::offset
-
-\section virtual_allocator_statistics Statistics
-
-You can obtain statistics of a virtual block using vmaGetVirtualBlockStatistics()
-(to get brief statistics that are fast to calculate)
-or vmaCalculateVirtualBlockStatistics() (to get more detailed statistics, slower to calculate).
-The functions fill structures #VmaStatistics, #VmaDetailedStatistics respectively - same as used by the normal Vulkan memory allocator.
-Example:
-
-\code
-VmaStatistics stats;
-vmaGetVirtualBlockStatistics(block, &stats);
-printf("My virtual block has %llu bytes used by %u virtual allocations\n",
-    stats.allocationBytes, stats.allocationCount);
-\endcode
-
-You can also request a full list of allocations and free regions as a string in JSON format by calling
-vmaBuildVirtualBlockStatsString().
-Returned string must be later freed using vmaFreeVirtualBlockStatsString().
-The format of this string differs from the one returned by the main Vulkan allocator, but it is similar.
-
-\section virtual_allocator_additional_considerations Additional considerations
-
-The "virtual allocator" functionality is implemented on a level of individual memory blocks.
-Keeping track of a whole collection of blocks, allocating new ones when out of free space,
-deleting empty ones, and deciding which one to try first for a new allocation must be implemented by the user.
-
-Alternative allocation algorithms are supported, just like in custom pools of the real GPU memory.
-See enum #VmaVirtualBlockCreateFlagBits to learn how to specify them (e.g. #VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT).
-You can find their description in chapter \ref custom_memory_pools.
-Allocation strategies are also supported.
-See enum #VmaVirtualAllocationCreateFlagBits to learn how to specify them (e.g. #VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT).
-
-Following features are supported only by the allocator of the real GPU memory and not by virtual allocations:
-buffer-image granularity, `VMA_DEBUG_MARGIN`, `VMA_MIN_ALIGNMENT`.
-
-
-\page debugging_memory_usage Debugging incorrect memory usage
-
-If you suspect a bug with memory usage, like usage of uninitialized memory or
-memory being overwritten out of bounds of an allocation,
-you can use debug features of this library to verify this.
-
-\section debugging_memory_usage_initialization Memory initialization
-
-If you experience a bug with incorrect and nondeterministic data in your program and you suspect uninitialized memory to be used,
-you can enable automatic memory initialization to verify this.
-To do it, define macro `VMA_DEBUG_INITIALIZE_ALLOCATIONS` to 1.
-
-\code
-#define VMA_DEBUG_INITIALIZE_ALLOCATIONS 1
-#include "vk_mem_alloc.h"
-\endcode
-
-It makes memory of new allocations initialized to bit pattern `0xDCDCDCDC`.
-Before an allocation is destroyed, its memory is filled with bit pattern `0xEFEFEFEF`.
-Memory is automatically mapped and unmapped if necessary.
-
-If you find these values while debugging your program, good chances are that you incorrectly
-read Vulkan memory that is allocated but not initialized, or already freed, respectively.
-
-Memory initialization works only with memory types that are `HOST_VISIBLE` and with allocations that can be mapped.
-It works also with dedicated allocations.
-
-\section debugging_memory_usage_margins Margins
-
-By default, allocations are laid out in memory blocks next to each other if possible
-(considering required alignment, `bufferImageGranularity`, and `nonCoherentAtomSize`).
-
-![Allocations without margin](../gfx/Margins_1.png)
-
-Define macro `VMA_DEBUG_MARGIN` to some non-zero value (e.g. 16) to enforce specified
-number of bytes as a margin after every allocation.
-
-\code
-#define VMA_DEBUG_MARGIN 16
-#include "vk_mem_alloc.h"
-\endcode
-
-![Allocations with margin](../gfx/Margins_2.png)
-
-If your bug goes away after enabling margins, it means it may be caused by memory
-being overwritten outside of allocation boundaries. It is not 100% certain though.
-Change in application behavior may also be caused by different order and distribution
-of allocations across memory blocks after margins are applied.
-
-Margins work with all types of memory.
-
-Margin is applied only to allocations made out of memory blocks and not to dedicated
-allocations, which have their own memory block of specific size.
-It is thus not applied to allocations made using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT flag
-or those automatically decided to put into dedicated allocations, e.g. due to its
-large size or recommended by VK_KHR_dedicated_allocation extension.
-
-Margins appear in [JSON dump](@ref statistics_json_dump) as part of free space.
-
-Note that enabling margins increases memory usage and fragmentation.
-
-Margins do not apply to \ref virtual_allocator.
-
-\section debugging_memory_usage_corruption_detection Corruption detection
-
-You can additionally define macro `VMA_DEBUG_DETECT_CORRUPTION` to 1 to enable validation
-of contents of the margins.
-
-\code
-#define VMA_DEBUG_MARGIN 16
-#define VMA_DEBUG_DETECT_CORRUPTION 1
-#include "vk_mem_alloc.h"
-\endcode
-
-When this feature is enabled, number of bytes specified as `VMA_DEBUG_MARGIN`
-(it must be multiply of 4) after every allocation is filled with a magic number.
-This idea is also know as "canary".
-Memory is automatically mapped and unmapped if necessary.
-
-This number is validated automatically when the allocation is destroyed.
-If it is not equal to the expected value, `VMA_ASSERT()` is executed.
-It clearly means that either CPU or GPU overwritten the memory outside of boundaries of the allocation,
-which indicates a serious bug.
-
-You can also explicitly request checking margins of all allocations in all memory blocks
-that belong to specified memory types by using function vmaCheckCorruption(),
-or in memory blocks that belong to specified custom pool, by using function
-vmaCheckPoolCorruption().
-
-Margin validation (corruption detection) works only for memory types that are
-`HOST_VISIBLE` and `HOST_COHERENT`.
-
-
-\page opengl_interop OpenGL Interop
-
-VMA provides some features that help with interoperability with OpenGL.
-
-\section opengl_interop_exporting_memory Exporting memory
-
-If you want to attach `VkExportMemoryAllocateInfoKHR` structure to `pNext` chain of memory allocations made by the library:
-
-It is recommended to create \ref custom_memory_pools for such allocations.
-Define and fill in your `VkExportMemoryAllocateInfoKHR` structure and attach it to VmaPoolCreateInfo::pMemoryAllocateNext
-while creating the custom pool.
-Please note that the structure must remain alive and unchanged for the whole lifetime of the #VmaPool,
-not only while creating it, as no copy of the structure is made,
-but its original pointer is used for each allocation instead.
-
-If you want to export all memory allocated by the library from certain memory types,
-also dedicated allocations or other allocations made from default pools,
-an alternative solution is to fill in VmaAllocatorCreateInfo::pTypeExternalMemoryHandleTypes.
-It should point to an array with `VkExternalMemoryHandleTypeFlagsKHR` to be automatically passed by the library
-through `VkExportMemoryAllocateInfoKHR` on each allocation made from a specific memory type.
-Please note that new versions of the library also support dedicated allocations created in custom pools.
-
-You should not mix these two methods in a way that allows to apply both to the same memory type.
-Otherwise, `VkExportMemoryAllocateInfoKHR` structure would be attached twice to the `pNext` chain of `VkMemoryAllocateInfo`.
-
-
-\section opengl_interop_custom_alignment Custom alignment
-
-Buffers or images exported to a different API like OpenGL may require a different alignment,
-higher than the one used by the library automatically, queried from functions like `vkGetBufferMemoryRequirements`.
-To impose such alignment:
-
-It is recommended to create \ref custom_memory_pools for such allocations.
-Set VmaPoolCreateInfo::minAllocationAlignment member to the minimum alignment required for each allocation
-to be made out of this pool.
-The alignment actually used will be the maximum of this member and the alignment returned for the specific buffer or image
-from a function like `vkGetBufferMemoryRequirements`, which is called by VMA automatically.
-
-If you want to create a buffer with a specific minimum alignment out of default pools,
-use special function vmaCreateBufferWithAlignment(), which takes additional parameter `minAlignment`.
-
-Note the problem of alignment affects only resources placed inside bigger `VkDeviceMemory` blocks and not dedicated
-allocations, as these, by definition, always have alignment = 0 because the resource is bound to the beginning of its dedicated block.
-Contrary to Direct3D 12, Vulkan doesn't have a concept of alignment of the entire memory block passed on its allocation.
-
-
-\page usage_patterns Recommended usage patterns
-
-Vulkan gives great flexibility in memory allocation.
-This chapter shows the most common patterns.
-
-See also slides from talk:
-[Sawicki, Adam. Advanced Graphics Techniques Tutorial: Memory management in Vulkan and DX12. Game Developers Conference, 2018](https://www.gdcvault.com/play/1025458/Advanced-Graphics-Techniques-Tutorial-New)
-
-
-\section usage_patterns_gpu_only GPU-only resource
-
-<b>When:</b>
-Any resources that you frequently write and read on GPU,
-e.g. images used as color attachments (aka "render targets"), depth-stencil attachments,
-images/buffers used as storage image/buffer (aka "Unordered Access View (UAV)").
-
-<b>What to do:</b>
-Let the library select the optimal memory type, which will likely have `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-
-\code
-VkImageCreateInfo imgCreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-imgCreateInfo.imageType = VK_IMAGE_TYPE_2D;
-imgCreateInfo.extent.width = 3840;
-imgCreateInfo.extent.height = 2160;
-imgCreateInfo.extent.depth = 1;
-imgCreateInfo.mipLevels = 1;
-imgCreateInfo.arrayLayers = 1;
-imgCreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM;
-imgCreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-imgCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-imgCreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT;
-imgCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-allocCreateInfo.priority = 1.0f;
-
-VkImage img;
-VmaAllocation alloc;
-vmaCreateImage(allocator, &imgCreateInfo, &allocCreateInfo, &img, &alloc, nullptr);
-\endcode
-
-<b>Also consider:</b>
-Consider creating them as dedicated allocations using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT,
-especially if they are large or if you plan to destroy and recreate them with different sizes
-e.g. when display resolution changes.
-Prefer to create such resources first and all other GPU resources (like textures and vertex buffers) later.
-When VK_EXT_memory_priority extension is enabled, it is also worth setting high priority to such allocation
-to decrease chances to be evicted to system memory by the operating system.
-
-\section usage_patterns_staging_copy_upload Staging copy for upload
-
-<b>When:</b>
-A "staging" buffer than you want to map and fill from CPU code, then use as a source od transfer
-to some GPU resource.
-
-<b>What to do:</b>
-Use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT.
-Let the library select the optimal memory type, which will always have `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`.
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 65536;
-bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-...
-
-memcpy(allocInfo.pMappedData, myData, myDataSize);
-\endcode
-
-<b>Also consider:</b>
-You can map the allocation using vmaMapMemory() or you can create it as persistenly mapped
-using #VMA_ALLOCATION_CREATE_MAPPED_BIT, as in the example above.
-
-
-\section usage_patterns_readback Readback
-
-<b>When:</b>
-Buffers for data written by or transferred from the GPU that you want to read back on the CPU,
-e.g. results of some computations.
-
-<b>What to do:</b>
-Use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.
-Let the library select the optimal memory type, which will always have `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`
-and `VK_MEMORY_PROPERTY_HOST_CACHED_BIT`.
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 65536;
-bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-...
-
-const float* downloadedData = (const float*)allocInfo.pMappedData;
-\endcode
-
-
-\section usage_patterns_advanced_data_uploading Advanced data uploading
-
-For resources that you frequently write on CPU via mapped pointer and
-freqnently read on GPU e.g. as a uniform buffer (also called "dynamic"), multiple options are possible:
-
--# Easiest solution is to have one copy of the resource in `HOST_VISIBLE` memory,
-   even if it means system RAM (not `DEVICE_LOCAL`) on systems with a discrete graphics card,
-   and make the device reach out to that resource directly.
-   - Reads performed by the device will then go through PCI Express bus.
-     The performace of this access may be limited, but it may be fine depending on the size
-     of this resource (whether it is small enough to quickly end up in GPU cache) and the sparsity
-     of access.
--# On systems with unified memory (e.g. AMD APU or Intel integrated graphics, mobile chips),
-   a memory type may be available that is both `HOST_VISIBLE` (available for mapping) and `DEVICE_LOCAL`
-   (fast to access from the GPU). Then, it is likely the best choice for such type of resource.
--# Systems with a discrete graphics card and separate video memory may or may not expose
-   a memory type that is both `HOST_VISIBLE` and `DEVICE_LOCAL`, also known as Base Address Register (BAR).
-   If they do, it represents a piece of VRAM (or entire VRAM, if ReBAR is enabled in the motherboard BIOS)
-   that is available to CPU for mapping.
-   - Writes performed by the host to that memory go through PCI Express bus.
-     The performance of these writes may be limited, but it may be fine, especially on PCIe 4.0,
-     as long as rules of using uncached and write-combined memory are followed - only sequential writes and no reads.
--# Finally, you may need or prefer to create a separate copy of the resource in `DEVICE_LOCAL` memory,
-   a separate "staging" copy in `HOST_VISIBLE` memory and perform an explicit transfer command between them.
-
-Thankfully, VMA offers an aid to create and use such resources in the the way optimal
-for the current Vulkan device. To help the library make the best choice,
-use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT together with
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT.
-It will then prefer a memory type that is both `DEVICE_LOCAL` and `HOST_VISIBLE` (integrated memory or BAR),
-but if no such memory type is available or allocation from it fails
-(PC graphics cards have only 256 MB of BAR by default, unless ReBAR is supported and enabled in BIOS),
-it will fall back to `DEVICE_LOCAL` memory for fast GPU access.
-It is then up to you to detect that the allocation ended up in a memory type that is not `HOST_VISIBLE`,
-so you need to create another "staging" allocation and perform explicit transfers.
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 65536;
-bufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
- 
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
- 
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-VkMemoryPropertyFlags memPropFlags;
-vmaGetAllocationMemoryProperties(allocator, alloc, &memPropFlags);
-
-if(memPropFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT)
-{
-    // Allocation ended up in a mappable memory and is already mapped - write to it directly.
-
-    // [Executed in runtime]:
-    memcpy(allocInfo.pMappedData, myData, myDataSize);
-}
-else
-{
-    // Allocation ended up in a non-mappable memory - need to transfer.
-    VkBufferCreateInfo stagingBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-    stagingBufCreateInfo.size = 65536;
-    stagingBufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-    VmaAllocationCreateInfo stagingAllocCreateInfo = {};
-    stagingAllocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-    stagingAllocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-        VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-    VkBuffer stagingBuf;
-    VmaAllocation stagingAlloc;
-    VmaAllocationInfo stagingAllocInfo;
-    vmaCreateBuffer(allocator, &stagingBufCreateInfo, &stagingAllocCreateInfo,
-        &stagingBuf, &stagingAlloc, stagingAllocInfo);
-
-    // [Executed in runtime]:
-    memcpy(stagingAllocInfo.pMappedData, myData, myDataSize);
-    //vkCmdPipelineBarrier: VK_ACCESS_HOST_WRITE_BIT --> VK_ACCESS_TRANSFER_READ_BIT
-    VkBufferCopy bufCopy = {
-        0, // srcOffset
-        0, // dstOffset,
-        myDataSize); // size
-    vkCmdCopyBuffer(cmdBuf, stagingBuf, buf, 1, &bufCopy);
-}
-\endcode
-
-\section usage_patterns_other_use_cases Other use cases
-
-Here are some other, less obvious use cases and their recommended settings:
-
-- An image that is used only as transfer source and destination, but it should stay on the device,
-  as it is used to temporarily store a copy of some texture, e.g. from the current to the next frame,
-  for temporal antialiasing or other temporal effects.
-  - Use `VkImageCreateInfo::usage = VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT`
-  - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO
-- An image that is used only as transfer source and destination, but it should be placed
-  in the system RAM despite it doesn't need to be mapped, because it serves as a "swap" copy to evict
-  least recently used textures from VRAM.
-  - Use `VkImageCreateInfo::usage = VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT`
-  - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO_PREFER_HOST,
-    as VMA needs a hint here to differentiate from the previous case.
-- A buffer that you want to map and write from the CPU, directly read from the GPU
-  (e.g. as a uniform or vertex buffer), but you have a clear preference to place it in device or
-  host memory due to its large size.
-  - Use `VkBufferCreateInfo::usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT`
-  - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE or #VMA_MEMORY_USAGE_AUTO_PREFER_HOST
-  - Use VmaAllocationCreateInfo::flags = #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
-
-
-\page configuration Configuration
-
-Please check "CONFIGURATION SECTION" in the code to find macros that you can define
-before each include of this file or change directly in this file to provide
-your own implementation of basic facilities like assert, `min()` and `max()` functions,
-mutex, atomic etc.
-The library uses its own implementation of containers by default, but you can switch to using
-STL containers instead.
-
-For example, define `VMA_ASSERT(expr)` before including the library to provide
-custom implementation of the assertion, compatible with your project.
-By default it is defined to standard C `assert(expr)` in `_DEBUG` configuration
-and empty otherwise.
-
-\section config_Vulkan_functions Pointers to Vulkan functions
-
-There are multiple ways to import pointers to Vulkan functions in the library.
-In the simplest case you don't need to do anything.
-If the compilation or linking of your program or the initialization of the #VmaAllocator
-doesn't work for you, you can try to reconfigure it.
-
-First, the allocator tries to fetch pointers to Vulkan functions linked statically,
-like this:
-
-\code
-m_VulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkAllocateMemory;
-\endcode
-
-If you want to disable this feature, set configuration macro: `#define VMA_STATIC_VULKAN_FUNCTIONS 0`.
-
-Second, you can provide the pointers yourself by setting member VmaAllocatorCreateInfo::pVulkanFunctions.
-You can fetch them e.g. using functions `vkGetInstanceProcAddr` and `vkGetDeviceProcAddr` or
-by using a helper library like [volk](https://github.com/zeux/volk).
-
-Third, VMA tries to fetch remaining pointers that are still null by calling
-`vkGetInstanceProcAddr` and `vkGetDeviceProcAddr` on its own.
-You need to only fill in VmaVulkanFunctions::vkGetInstanceProcAddr and VmaVulkanFunctions::vkGetDeviceProcAddr.
-Other pointers will be fetched automatically.
-If you want to disable this feature, set configuration macro: `#define VMA_DYNAMIC_VULKAN_FUNCTIONS 0`.
-
-Finally, all the function pointers required by the library (considering selected
-Vulkan version and enabled extensions) are checked with `VMA_ASSERT` if they are not null.
-
-
-\section custom_memory_allocator Custom host memory allocator
-
-If you use custom allocator for CPU memory rather than default operator `new`
-and `delete` from C++, you can make this library using your allocator as well
-by filling optional member VmaAllocatorCreateInfo::pAllocationCallbacks. These
-functions will be passed to Vulkan, as well as used by the library itself to
-make any CPU-side allocations.
-
-\section allocation_callbacks Device memory allocation callbacks
-
-The library makes calls to `vkAllocateMemory()` and `vkFreeMemory()` internally.
-You can setup callbacks to be informed about these calls, e.g. for the purpose
-of gathering some statistics. To do it, fill optional member
-VmaAllocatorCreateInfo::pDeviceMemoryCallbacks.
-
-\section heap_memory_limit Device heap memory limit
-
-When device memory of certain heap runs out of free space, new allocations may
-fail (returning error code) or they may succeed, silently pushing some existing_
-memory blocks from GPU VRAM to system RAM (which degrades performance). This
-behavior is implementation-dependent - it depends on GPU vendor and graphics
-driver.
-
-On AMD cards it can be controlled while creating Vulkan device object by using
-VK_AMD_memory_overallocation_behavior extension, if available.
-
-Alternatively, if you want to test how your program behaves with limited amount of Vulkan device
-memory available without switching your graphics card to one that really has
-smaller VRAM, you can use a feature of this library intended for this purpose.
-To do it, fill optional member VmaAllocatorCreateInfo::pHeapSizeLimit.
-
-
-
-\page vk_khr_dedicated_allocation VK_KHR_dedicated_allocation
-
-VK_KHR_dedicated_allocation is a Vulkan extension which can be used to improve
-performance on some GPUs. It augments Vulkan API with possibility to query
-driver whether it prefers particular buffer or image to have its own, dedicated
-allocation (separate `VkDeviceMemory` block) for better efficiency - to be able
-to do some internal optimizations. The extension is supported by this library.
-It will be used automatically when enabled.
-
-It has been promoted to core Vulkan 1.1, so if you use eligible Vulkan version
-and inform VMA about it by setting VmaAllocatorCreateInfo::vulkanApiVersion,
-you are all set.
-
-Otherwise, if you want to use it as an extension:
-
-1 . When creating Vulkan device, check if following 2 device extensions are
-supported (call `vkEnumerateDeviceExtensionProperties()`).
-If yes, enable them (fill `VkDeviceCreateInfo::ppEnabledExtensionNames`).
-
-- VK_KHR_get_memory_requirements2
-- VK_KHR_dedicated_allocation
-
-If you enabled these extensions:
-
-2 . Use #VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT flag when creating
-your #VmaAllocator to inform the library that you enabled required extensions
-and you want the library to use them.
-
-\code
-allocatorInfo.flags |= VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT;
-
-vmaCreateAllocator(&allocatorInfo, &allocator);
-\endcode
-
-That is all. The extension will be automatically used whenever you create a
-buffer using vmaCreateBuffer() or image using vmaCreateImage().
-
-When using the extension together with Vulkan Validation Layer, you will receive
-warnings like this:
-
-_vkBindBufferMemory(): Binding memory to buffer 0x33 but vkGetBufferMemoryRequirements() has not been called on that buffer._
-
-It is OK, you should just ignore it. It happens because you use function
-`vkGetBufferMemoryRequirements2KHR()` instead of standard
-`vkGetBufferMemoryRequirements()`, while the validation layer seems to be
-unaware of it.
-
-To learn more about this extension, see:
-
-- [VK_KHR_dedicated_allocation in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap50.html#VK_KHR_dedicated_allocation)
-- [VK_KHR_dedicated_allocation unofficial manual](http://asawicki.info/articles/VK_KHR_dedicated_allocation.php5)
-
-
-
-\page vk_ext_memory_priority VK_EXT_memory_priority
-
-VK_EXT_memory_priority is a device extension that allows to pass additional "priority"
-value to Vulkan memory allocations that the implementation may use prefer certain
-buffers and images that are critical for performance to stay in device-local memory
-in cases when the memory is over-subscribed, while some others may be moved to the system memory.
-
-VMA offers convenient usage of this extension.
-If you enable it, you can pass "priority" parameter when creating allocations or custom pools
-and the library automatically passes the value to Vulkan using this extension.
-
-If you want to use this extension in connection with VMA, follow these steps:
-
-\section vk_ext_memory_priority_initialization Initialization
-
-1) Call `vkEnumerateDeviceExtensionProperties` for the physical device.
-Check if the extension is supported - if returned array of `VkExtensionProperties` contains "VK_EXT_memory_priority".
-
-2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`.
-Attach additional structure `VkPhysicalDeviceMemoryPriorityFeaturesEXT` to `VkPhysicalDeviceFeatures2::pNext` to be returned.
-Check if the device feature is really supported - check if `VkPhysicalDeviceMemoryPriorityFeaturesEXT::memoryPriority` is true.
-
-3) While creating device with `vkCreateDevice`, enable this extension - add "VK_EXT_memory_priority"
-to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`.
-
-4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`.
-Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`.
-Enable this device feature - attach additional structure `VkPhysicalDeviceMemoryPriorityFeaturesEXT` to
-`VkPhysicalDeviceFeatures2::pNext` chain and set its member `memoryPriority` to `VK_TRUE`.
-
-5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you
-have enabled this extension and feature - add #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT
-to VmaAllocatorCreateInfo::flags.
-
-\section vk_ext_memory_priority_usage Usage
-
-When using this extension, you should initialize following member:
-
-- VmaAllocationCreateInfo::priority when creating a dedicated allocation with #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-- VmaPoolCreateInfo::priority when creating a custom pool.
-
-It should be a floating-point value between `0.0f` and `1.0f`, where recommended default is `0.5f`.
-Memory allocated with higher value can be treated by the Vulkan implementation as higher priority
-and so it can have lower chances of being pushed out to system memory, experiencing degraded performance.
-
-It might be a good idea to create performance-critical resources like color-attachment or depth-stencil images
-as dedicated and set high priority to them. For example:
-
-\code
-VkImageCreateInfo imgCreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-imgCreateInfo.imageType = VK_IMAGE_TYPE_2D;
-imgCreateInfo.extent.width = 3840;
-imgCreateInfo.extent.height = 2160;
-imgCreateInfo.extent.depth = 1;
-imgCreateInfo.mipLevels = 1;
-imgCreateInfo.arrayLayers = 1;
-imgCreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM;
-imgCreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-imgCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-imgCreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT;
-imgCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-allocCreateInfo.priority = 1.0f;
-
-VkImage img;
-VmaAllocation alloc;
-vmaCreateImage(allocator, &imgCreateInfo, &allocCreateInfo, &img, &alloc, nullptr);
-\endcode
-
-`priority` member is ignored in the following situations:
-
-- Allocations created in custom pools: They inherit the priority, along with all other allocation parameters
-  from the parametrs passed in #VmaPoolCreateInfo when the pool was created.
-- Allocations created in default pools: They inherit the priority from the parameters
-  VMA used when creating default pools, which means `priority == 0.5f`.
-
-
-\page vk_amd_device_coherent_memory VK_AMD_device_coherent_memory
-
-VK_AMD_device_coherent_memory is a device extension that enables access to
-additional memory types with `VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD` and
-`VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` flag. It is useful mostly for
-allocation of buffers intended for writing "breadcrumb markers" in between passes
-or draw calls, which in turn are useful for debugging GPU crash/hang/TDR cases.
-
-When the extension is available but has not been enabled, Vulkan physical device
-still exposes those memory types, but their usage is forbidden. VMA automatically
-takes care of that - it returns `VK_ERROR_FEATURE_NOT_PRESENT` when an attempt
-to allocate memory of such type is made.
-
-If you want to use this extension in connection with VMA, follow these steps:
-
-\section vk_amd_device_coherent_memory_initialization Initialization
-
-1) Call `vkEnumerateDeviceExtensionProperties` for the physical device.
-Check if the extension is supported - if returned array of `VkExtensionProperties` contains "VK_AMD_device_coherent_memory".
-
-2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`.
-Attach additional structure `VkPhysicalDeviceCoherentMemoryFeaturesAMD` to `VkPhysicalDeviceFeatures2::pNext` to be returned.
-Check if the device feature is really supported - check if `VkPhysicalDeviceCoherentMemoryFeaturesAMD::deviceCoherentMemory` is true.
-
-3) While creating device with `vkCreateDevice`, enable this extension - add "VK_AMD_device_coherent_memory"
-to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`.
-
-4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`.
-Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`.
-Enable this device feature - attach additional structure `VkPhysicalDeviceCoherentMemoryFeaturesAMD` to
-`VkPhysicalDeviceFeatures2::pNext` and set its member `deviceCoherentMemory` to `VK_TRUE`.
-
-5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you
-have enabled this extension and feature - add #VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT
-to VmaAllocatorCreateInfo::flags.
-
-\section vk_amd_device_coherent_memory_usage Usage
-
-After following steps described above, you can create VMA allocations and custom pools
-out of the special `DEVICE_COHERENT` and `DEVICE_UNCACHED` memory types on eligible
-devices. There are multiple ways to do it, for example:
-
-- You can request or prefer to allocate out of such memory types by adding
-  `VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` to VmaAllocationCreateInfo::requiredFlags
-  or VmaAllocationCreateInfo::preferredFlags. Those flags can be freely mixed with
-  other ways of \ref choosing_memory_type, like setting VmaAllocationCreateInfo::usage.
-- If you manually found memory type index to use for this purpose, force allocation
-  from this specific index by setting VmaAllocationCreateInfo::memoryTypeBits `= 1u << index`.
-
-\section vk_amd_device_coherent_memory_more_information More information
-
-To learn more about this extension, see [VK_AMD_device_coherent_memory in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_AMD_device_coherent_memory.html)
-
-Example use of this extension can be found in the code of the sample and test suite
-accompanying this library.
-
-
-\page enabling_buffer_device_address Enabling buffer device address
-
-Device extension VK_KHR_buffer_device_address
-allow to fetch raw GPU pointer to a buffer and pass it for usage in a shader code.
-It has been promoted to core Vulkan 1.2.
-
-If you want to use this feature in connection with VMA, follow these steps:
-
-\section enabling_buffer_device_address_initialization Initialization
-
-1) (For Vulkan version < 1.2) Call `vkEnumerateDeviceExtensionProperties` for the physical device.
-Check if the extension is supported - if returned array of `VkExtensionProperties` contains
-"VK_KHR_buffer_device_address".
-
-2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`.
-Attach additional structure `VkPhysicalDeviceBufferDeviceAddressFeatures*` to `VkPhysicalDeviceFeatures2::pNext` to be returned.
-Check if the device feature is really supported - check if `VkPhysicalDeviceBufferDeviceAddressFeatures::bufferDeviceAddress` is true.
-
-3) (For Vulkan version < 1.2) While creating device with `vkCreateDevice`, enable this extension - add
-"VK_KHR_buffer_device_address" to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`.
-
-4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`.
-Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`.
-Enable this device feature - attach additional structure `VkPhysicalDeviceBufferDeviceAddressFeatures*` to
-`VkPhysicalDeviceFeatures2::pNext` and set its member `bufferDeviceAddress` to `VK_TRUE`.
-
-5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you
-have enabled this feature - add #VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT
-to VmaAllocatorCreateInfo::flags.
-
-\section enabling_buffer_device_address_usage Usage
-
-After following steps described above, you can create buffers with `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT*` using VMA.
-The library automatically adds `VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT*` to
-allocated memory blocks wherever it might be needed.
-
-Please note that the library supports only `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT*`.
-The second part of this functionality related to "capture and replay" is not supported,
-as it is intended for usage in debugging tools like RenderDoc, not in everyday Vulkan usage.
-
-\section enabling_buffer_device_address_more_information More information
-
-To learn more about this extension, see [VK_KHR_buffer_device_address in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap46.html#VK_KHR_buffer_device_address)
-
-Example use of this extension can be found in the code of the sample and test suite
-accompanying this library.
-
-\page general_considerations General considerations
-
-\section general_considerations_thread_safety Thread safety
-
-- The library has no global state, so separate #VmaAllocator objects can be used
-  independently.
-  There should be no need to create multiple such objects though - one per `VkDevice` is enough.
-- By default, all calls to functions that take #VmaAllocator as first parameter
-  are safe to call from multiple threads simultaneously because they are
-  synchronized internally when needed.
-  This includes allocation and deallocation from default memory pool, as well as custom #VmaPool.
-- When the allocator is created with #VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT
-  flag, calls to functions that take such #VmaAllocator object must be
-  synchronized externally.
-- Access to a #VmaAllocation object must be externally synchronized. For example,
-  you must not call vmaGetAllocationInfo() and vmaMapMemory() from different
-  threads at the same time if you pass the same #VmaAllocation object to these
-  functions.
-- #VmaVirtualBlock is not safe to be used from multiple threads simultaneously.
-
-\section general_considerations_versioning_and_compatibility Versioning and compatibility
-
-The library uses [**Semantic Versioning**](https://semver.org/),
-which means version numbers follow convention: Major.Minor.Patch (e.g. 2.3.0), where:
-
-- Incremented Patch version means a release is backward- and forward-compatible,
-  introducing only some internal improvements, bug fixes, optimizations etc.
-  or changes that are out of scope of the official API described in this documentation.
-- Incremented Minor version means a release is backward-compatible,
-  so existing code that uses the library should continue to work, while some new
-  symbols could have been added: new structures, functions, new values in existing
-  enums and bit flags, new structure members, but not new function parameters.
-- Incrementing Major version means a release could break some backward compatibility.
-
-All changes between official releases are documented in file "CHANGELOG.md".
-
-\warning Backward compatiblity is considered on the level of C++ source code, not binary linkage.
-Adding new members to existing structures is treated as backward compatible if initializing
-the new members to binary zero results in the old behavior.
-You should always fully initialize all library structures to zeros and not rely on their
-exact binary size.
-
-\section general_considerations_validation_layer_warnings Validation layer warnings
-
-When using this library, you can meet following types of warnings issued by
-Vulkan validation layer. They don't necessarily indicate a bug, so you may need
-to just ignore them.
-
-- *vkBindBufferMemory(): Binding memory to buffer 0xeb8e4 but vkGetBufferMemoryRequirements() has not been called on that buffer.*
-  - It happens when VK_KHR_dedicated_allocation extension is enabled.
-    `vkGetBufferMemoryRequirements2KHR` function is used instead, while validation layer seems to be unaware of it.
-- *Mapping an image with layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL can result in undefined behavior if this memory is used by the device. Only GENERAL or PREINITIALIZED should be used.*
-  - It happens when you map a buffer or image, because the library maps entire
-    `VkDeviceMemory` block, where different types of images and buffers may end
-    up together, especially on GPUs with unified memory like Intel.
-- *Non-linear image 0xebc91 is aliased with linear buffer 0xeb8e4 which may indicate a bug.*
-  - It may happen when you use [defragmentation](@ref defragmentation).
-
-\section general_considerations_allocation_algorithm Allocation algorithm
-
-The library uses following algorithm for allocation, in order:
-
--# Try to find free range of memory in existing blocks.
--# If failed, try to create a new block of `VkDeviceMemory`, with preferred block size.
--# If failed, try to create such block with size / 2, size / 4, size / 8.
--# If failed, try to allocate separate `VkDeviceMemory` for this allocation,
-   just like when you use #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
--# If failed, choose other memory type that meets the requirements specified in
-   VmaAllocationCreateInfo and go to point 1.
--# If failed, return `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-
-\section general_considerations_features_not_supported Features not supported
-
-Features deliberately excluded from the scope of this library:
-
--# **Data transfer.** Uploading (streaming) and downloading data of buffers and images
-   between CPU and GPU memory and related synchronization is responsibility of the user.
-   Defining some "texture" object that would automatically stream its data from a
-   staging copy in CPU memory to GPU memory would rather be a feature of another,
-   higher-level library implemented on top of VMA.
-   VMA doesn't record any commands to a `VkCommandBuffer`. It just allocates memory.
--# **Recreation of buffers and images.** Although the library has functions for
-   buffer and image creation: vmaCreateBuffer(), vmaCreateImage(), you need to
-   recreate these objects yourself after defragmentation. That is because the big
-   structures `VkBufferCreateInfo`, `VkImageCreateInfo` are not stored in
-   #VmaAllocation object.
--# **Handling CPU memory allocation failures.** When dynamically creating small C++
-   objects in CPU memory (not Vulkan memory), allocation failures are not checked
-   and handled gracefully, because that would complicate code significantly and
-   is usually not needed in desktop PC applications anyway.
-   Success of an allocation is just checked with an assert.
--# **Code free of any compiler warnings.** Maintaining the library to compile and
-   work correctly on so many different platforms is hard enough. Being free of
-   any warnings, on any version of any compiler, is simply not feasible.
-   There are many preprocessor macros that make some variables unused, function parameters unreferenced,
-   or conditional expressions constant in some configurations.
-   The code of this library should not be bigger or more complicated just to silence these warnings.
-   It is recommended to disable such warnings instead.
--# This is a C++ library with C interface. **Bindings or ports to any other programming languages** are welcome as external projects but
-   are not going to be included into this repository.
-*/
diff --git a/aten/src/ATen/native/vulkan/glsl/add.glsl b/aten/src/ATen/native/vulkan/glsl/add.glsl
index 68864dd45d9c1..95e63f3a25afc 100644
--- a/aten/src/ATen/native/vulkan/glsl/add.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/add.glsl
@@ -12,7 +12,7 @@ layout(set = 0, binding = 2)         uniform PRECISION                    sample
 layout(set = 0, binding = 3)         uniform PRECISION restrict           Block {
   ivec4 size;
   ivec4 isize0;
-  ivec3 isize1;
+  ivec4 isize1;
   float alpha;
 } uBlock;
 
@@ -24,9 +24,12 @@ void main() {
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input0_pos = pos % uBlock.isize0.xyz;
     const ivec3 input1_pos = pos % uBlock.isize1.xyz;
-    imageStore(
-        uOutput,
-        pos,
-        texelFetch(uInput0, input0_pos, 0) + uBlock.alpha * texelFetch(uInput1, input1_pos, 0));
+    const vec4 v0 = uBlock.isize0.w == 1
+                      ? texelFetch(uInput0, input0_pos, 0).xxxx
+                      : texelFetch(uInput0, input0_pos, 0);
+    const vec4 v1 = uBlock.isize1.w == 1
+                      ? texelFetch(uInput1, input1_pos, 0).xxxx
+                      : texelFetch(uInput1, input1_pos, 0);
+    imageStore(uOutput, pos, v0 + uBlock.alpha * v1);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/add_.glsl b/aten/src/ATen/native/vulkan/glsl/add_.glsl
index d25d3bdcf85e4..1fe72bb7a878a 100644
--- a/aten/src/ATen/native/vulkan/glsl/add_.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/add_.glsl
@@ -7,10 +7,10 @@ layout(std430) buffer;
 /* Qualifiers: layout - storage - precision - memory */
 
 layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput0;
+layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput;
 layout(set = 0, binding = 2)         uniform PRECISION restrict Block {
   ivec4 size;
-  ivec3 isize;
+  ivec4 isize;
   float alpha;
 } uBlock;
 
@@ -21,9 +21,12 @@ void main() {
 
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input_pos = pos % uBlock.isize.xyz;
+    const vec4 v = uBlock.isize.w == 1
+                     ? texelFetch(uInput, input_pos, 0).xxxx
+                     : texelFetch(uInput, input_pos, 0);
     imageStore(
         uOutput,
         pos,
-        imageLoad(uOutput, pos) + uBlock.alpha * texelFetch(uInput0, input_pos, 0));
+        imageLoad(uOutput, pos) + uBlock.alpha * v);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/div.glsl b/aten/src/ATen/native/vulkan/glsl/div.glsl
index 43c6e942a3e15..e7ba6de74ca8c 100644
--- a/aten/src/ATen/native/vulkan/glsl/div.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/div.glsl
@@ -12,7 +12,7 @@ layout(set = 0, binding = 2)         uniform PRECISION                    sample
 layout(set = 0, binding = 3)         uniform PRECISION restrict           Block {
   ivec4 size;
   ivec4 isize0;
-  ivec3 isize1;
+  ivec4 isize1;
   float alpha;
 } uBlock;
 
@@ -24,9 +24,12 @@ void main() {
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input0_pos = pos % uBlock.isize0.xyz;
     const ivec3 input1_pos = pos % uBlock.isize1.xyz;
-    imageStore(
-        uOutput,
-        pos,
-        texelFetch(uInput0, input0_pos, 0) / texelFetch(uInput1, input1_pos, 0));
+    const vec4 v0 = uBlock.isize0.w == 1
+                      ? texelFetch(uInput0, input0_pos, 0).xxxx
+                      : texelFetch(uInput0, input0_pos, 0);
+    const vec4 v1 = uBlock.isize1.w == 1
+                      ? texelFetch(uInput1, input1_pos, 0).xxxx
+                      : texelFetch(uInput1, input1_pos, 0);
+    imageStore(uOutput, pos, v0 / v1);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/div_.glsl b/aten/src/ATen/native/vulkan/glsl/div_.glsl
index 90bfbad3cfdf2..56065a3839312 100644
--- a/aten/src/ATen/native/vulkan/glsl/div_.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/div_.glsl
@@ -7,10 +7,10 @@ layout(std430) buffer;
 /* Qualifiers: layout - storage - precision - memory */
 
 layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput0;
+layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput;
 layout(set = 0, binding = 2)         uniform PRECISION restrict Block {
   ivec4 size;
-  ivec3 isize;
+  ivec4 isize;
   float alpha;
 } uBlock;
 
@@ -21,9 +21,12 @@ void main() {
 
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input_pos = pos % uBlock.isize.xyz;
+    const vec4 v = uBlock.isize.w == 1
+                     ? texelFetch(uInput, input_pos, 0).xxxx
+                     : texelFetch(uInput, input_pos, 0);
     imageStore(
         uOutput,
         pos,
-        imageLoad(uOutput, pos) / texelFetch(uInput0, input_pos, 0));
+        imageLoad(uOutput, pos) / v);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/mul.glsl b/aten/src/ATen/native/vulkan/glsl/mul.glsl
index 43fcd444d11cd..c5aa2c7f522a2 100644
--- a/aten/src/ATen/native/vulkan/glsl/mul.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/mul.glsl
@@ -12,7 +12,7 @@ layout(set = 0, binding = 2)         uniform PRECISION                    sample
 layout(set = 0, binding = 3)         uniform PRECISION restrict           Block {
   ivec4 size;
   ivec4 isize0;
-  ivec3 isize1;
+  ivec4 isize1;
   float alpha;
 } uBlock;
 
@@ -24,9 +24,12 @@ void main() {
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input0_pos = pos % uBlock.isize0.xyz;
     const ivec3 input1_pos = pos % uBlock.isize1.xyz;
-    imageStore(
-        uOutput,
-        pos,
-        texelFetch(uInput0, input0_pos, 0) * texelFetch(uInput1, input1_pos, 0));
+    const vec4 v0 = uBlock.isize0.w == 1
+                      ? texelFetch(uInput0, input0_pos, 0).xxxx
+                      : texelFetch(uInput0, input0_pos, 0);
+    const vec4 v1 = uBlock.isize1.w == 1
+                      ? texelFetch(uInput1, input1_pos, 0).xxxx
+                      : texelFetch(uInput1, input1_pos, 0);
+    imageStore(uOutput, pos, v0 * v1);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/mul_.glsl b/aten/src/ATen/native/vulkan/glsl/mul_.glsl
index af23e678d9871..6487c6c52760d 100644
--- a/aten/src/ATen/native/vulkan/glsl/mul_.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/mul_.glsl
@@ -7,10 +7,10 @@ layout(std430) buffer;
 /* Qualifiers: layout - storage - precision - memory */
 
 layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput0;
+layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput;
 layout(set = 0, binding = 2)         uniform PRECISION restrict Block {
   ivec4 size;
-  ivec3 isize;
+  ivec4 isize;
   float alpha;
 } uBlock;
 
@@ -21,9 +21,12 @@ void main() {
 
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input_pos = pos % uBlock.isize.xyz;
+    const vec4 v = uBlock.isize.w == 1
+                     ? texelFetch(uInput, input_pos, 0).xxxx
+                     : texelFetch(uInput, input_pos, 0);
     imageStore(
         uOutput,
         pos,
-        imageLoad(uOutput, pos) * texelFetch(uInput0, input_pos, 0));
+        imageLoad(uOutput, pos) * v);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/sub.glsl b/aten/src/ATen/native/vulkan/glsl/sub.glsl
index 28ce580abfcd5..9dc89551ea957 100644
--- a/aten/src/ATen/native/vulkan/glsl/sub.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/sub.glsl
@@ -12,7 +12,7 @@ layout(set = 0, binding = 2)         uniform PRECISION                    sample
 layout(set = 0, binding = 3)         uniform PRECISION restrict           Block {
   ivec4 size;
   ivec4 isize0;
-  ivec3 isize1;
+  ivec4 isize1;
   float alpha;
 } uBlock;
 
@@ -24,9 +24,12 @@ void main() {
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input0_pos = pos % uBlock.isize0.xyz;
     const ivec3 input1_pos = pos % uBlock.isize1.xyz;
-    imageStore(
-        uOutput,
-        pos,
-        texelFetch(uInput0, input0_pos, 0) - uBlock.alpha * texelFetch(uInput1, input1_pos, 0));
+    const vec4 v0 = uBlock.isize0.w == 1
+                      ? texelFetch(uInput0, input0_pos, 0).xxxx
+                      : texelFetch(uInput0, input0_pos, 0);
+    const vec4 v1 = uBlock.isize1.w == 1
+                      ? texelFetch(uInput1, input1_pos, 0).xxxx
+                      : texelFetch(uInput1, input1_pos, 0);
+    imageStore(uOutput, pos, v0 - uBlock.alpha * v1);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/sub_.glsl b/aten/src/ATen/native/vulkan/glsl/sub_.glsl
index 6baaaf0a4238c..a68e6f9dc2286 100644
--- a/aten/src/ATen/native/vulkan/glsl/sub_.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/sub_.glsl
@@ -7,10 +7,10 @@ layout(std430) buffer;
 /* Qualifiers: layout - storage - precision - memory */
 
 layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput0;
+layout(set = 0, binding = 1)         uniform PRECISION          sampler3D uInput;
 layout(set = 0, binding = 2)         uniform PRECISION restrict Block {
   ivec4 size;
-  ivec3 isize;
+  ivec4 isize;
   float alpha;
 } uBlock;
 
@@ -21,9 +21,12 @@ void main() {
 
   if (all(lessThan(pos, uBlock.size.xyz))) {
     const ivec3 input_pos = pos % uBlock.isize.xyz;
+    const vec4 v = uBlock.isize.w == 1
+                     ? texelFetch(uInput, input_pos, 0).xxxx
+                     : texelFetch(uInput, input_pos, 0);
     imageStore(
         uOutput,
         pos,
-        imageLoad(uOutput, pos) - uBlock.alpha * texelFetch(uInput0, input_pos, 0));
+        imageLoad(uOutput, pos) - uBlock.alpha * v);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp b/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp
index 1551bd49b7766..65fc758b065d7 100644
--- a/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp
@@ -10,63 +10,84 @@ namespace vulkan {
 namespace ops {
 namespace {
 
-bool broadcast_input(const Tensor& input1, const Tensor& input2) {
-  return ((height_size(input1) > 1 && height_size(input2) == 1) ||
-          (height_size(input2) > 1 && height_size(input1) == 1) ||
-          (height_size(input1) == height_size(input2))) &&
-      ((width_size(input1) > 1 && width_size(input2) == 1) ||
-       (width_size(input2) > 1 && width_size(input1) == 1) ||
-       (width_size(input1) == width_size(input2)));
-}
-
 void check_inputs(const Tensor& input1, const Tensor& input2) {
-  TORCH_CHECK(
-      channels_size(input1) == channels_size(input2),
-      "Vulkan binary elementwise ops require channel dimension to be equal!");
-  if (batch_size(input1) != batch_size(input2)) {
+  const std::string broadcast_error_msg =
+      "Incompatible dimensions for broadcasting for binary elementwise op!";
+  if (get_dim<Dim4D::Batch>(input1) != get_dim<Dim4D::Batch>(input2)) {
     TORCH_CHECK(
-        channels_size(input1) % 4 == 0,
-        "Vulkan binary elementwise ops require channel to be a multiple of 4 to broadcast along batch dimension!")
+        get_dim<Dim4D::Batch>(input1) == 1 || get_dim<Dim4D::Batch>(input2),
+        broadcast_error_msg);
+    TORCH_CHECK(
+        (get_dim<Dim4D::Channel>(input1) == get_dim<Dim4D::Channel>(input2) &&
+         get_dim<Dim4D::Channel>(input1) % 4 == 0) ||
+            get_dim<Dim4D::Channel>(input1) * get_dim<Dim4D::Batch>(input1) ==
+                1 ||
+            get_dim<Dim4D::Channel>(input2) * get_dim<Dim4D::Batch>(input2) ==
+                1,
+        "Invalid broadcasting for Vulkan binary elementwise op! "
+        "If batch dimensions aren't equal, then channel dimensions must be "
+        "equal and multiple of 4 or one of the inputs must have "
+        "channel and batch dimensions both equal to 1!");
+  }
+  if (get_dim<Dim4D::Channel>(input1) != get_dim<Dim4D::Channel>(input2)) {
+    TORCH_CHECK(
+        get_dim<Dim4D::Channel>(input1) == 1 || get_dim<Dim4D::Channel>(input2),
+        broadcast_error_msg);
+    TORCH_CHECK(
+        get_dim<Dim4D::Channel>(input1) * get_dim<Dim4D::Batch>(input1) == 1 ||
+            get_dim<Dim4D::Channel>(input2) * get_dim<Dim4D::Batch>(input2) ==
+                1,
+        "Invalid broadcasting for Vulkan binary elementwise op! "
+        "If channel dimensions aren't equal, then one of the inputs must have "
+        "channel and batch dimensions both equal to 1!");
+  }
+  if (get_dim<Dim4D::Height>(input1) != get_dim<Dim4D::Height>(input2)) {
+    TORCH_CHECK(
+        get_dim<Dim4D::Height>(input1) == 1 || get_dim<Dim4D::Height>(input2),
+        broadcast_error_msg);
+  }
+  if (get_dim<Dim4D::Width>(input1) != get_dim<Dim4D::Width>(input2)) {
+    TORCH_CHECK(
+        get_dim<Dim4D::Width>(input1) == 1 || get_dim<Dim4D::Width>(input2),
+        broadcast_error_msg);
   }
-
-  const std::string broadcast_error_msg =
-      "Incompatible input dimensions for broadcasting for Vulkan binary elementwise op!";
-
-  TORCH_CHECK(broadcast_input(input1, input2), broadcast_error_msg);
 }
 
-std::vector<int64_t> broadcast_size(
-    const Tensor& input1,
-    const Tensor& input2) {
-  std::vector<int64_t> out = {};
-  int input1_size = input1.sizes().size();
-  int input2_size = input2.sizes().size();
-  if (input1_size > input2_size) {
-    for (int i = 0; i < input1_size; i++) {
-      out.push_back(input1.sizes()[i]);
+std::vector<int64_t> broadcast_size(const Tensor& t1, const Tensor& t2) {
+  int64_t t1_size = t1.dim();
+  int64_t t2_size = t2.dim();
+
+  std::vector<int64_t> out;
+  if (t1_size > t2_size) {
+    for (int64_t i = 0; i < t1_size; i++) {
+      out.push_back(t1.sizes()[i]);
     }
   } else {
-    for (int i = 0; i < input2_size; i++) {
-      out.push_back(input2.sizes()[i]);
+    for (int64_t i = 0; i < t2_size; i++) {
+      out.push_back(t2.sizes()[i]);
     }
   }
 
-  if (width_size(input1) > 1 && width_size(input2) == 1) {
-    out[out.size() - 1] = width_size(input1);
-  } else if (width_size(input2) > 1 && width_size(input1) == 1) {
-    out[out.size() - 1] = width_size(input2);
+  if (out.size() > 0) {
+    out[out.size() - 1] =
+        std::max(get_dim<Dim4D::Width>(t1), get_dim<Dim4D::Width>(t2));
   }
-
   if (out.size() > 1) {
-    if (height_size(input1) > 1 && height_size(input2) == 1) {
-      out[out.size() - 2] = height_size(input1);
-    } else if (height_size(input2) > 1 && height_size(input1) == 1) {
-      out[out.size() - 2] = height_size(input2);
-    }
+    out[out.size() - 2] =
+        std::max(get_dim<Dim4D::Height>(t1), get_dim<Dim4D::Height>(t2));
+  }
+  if (out.size() > 2) {
+    out[out.size() - 3] =
+        std::max(get_dim<Dim4D::Channel>(t1), get_dim<Dim4D::Channel>(t2));
+  }
+  if (out.size() > 3) {
+    out[out.size() - 4] =
+        std::max(get_dim<Dim4D::Batch>(t1), get_dim<Dim4D::Batch>(t2));
   }
 
   return out;
 }
+
 } // namespace
 using namespace api::utils;
 
@@ -195,15 +216,17 @@ Tensor arithmetic_tensor(
     uvec3 extents;
     uint32_t fill_0;
     uvec3 input1_extents;
-    uint32_t fill_1;
+    uint32_t channel_batch_size_1;
     uvec3 input2_extents;
+    uint32_t channel_batch_size_2;
     float alpha;
   } block{
       v_output.extents(),
       0u,
       v_self.extents(),
-      0u,
+      get_dim<Dim4D::Channel>(self) * get_dim<Dim4D::Batch>(self),
       v_other.extents(),
+      get_dim<Dim4D::Channel>(other) * get_dim<Dim4D::Batch>(other),
       alpha,
   };
 
@@ -326,6 +349,16 @@ Tensor& arithmetic_tensor_(
     const Tensor& other_arg,
     const c10::optional<Scalar>& alpha_arg,
     const api::ShaderSource& shader_descriptor) {
+  TORCH_CHECK(
+      get_dim<Dim4D::Batch>(self_arg) >= get_dim<Dim4D::Batch>(other_arg) &&
+          get_dim<Dim4D::Channel>(self_arg) >=
+              get_dim<Dim4D::Channel>(other_arg) &&
+          get_dim<Dim4D::Height>(self_arg) >=
+              get_dim<Dim4D::Height>(other_arg) &&
+          get_dim<Dim4D::Width>(self_arg) >= get_dim<Dim4D::Width>(other_arg),
+      "Dimensions of input tensor to Vulkan in-place binary elementwise op "
+      "must be less than or equal the dimensions of the underlying tensor.");
+
   check_inputs(self_arg, other_arg);
 
   TORCH_CHECK(
@@ -344,11 +377,13 @@ Tensor& arithmetic_tensor_(
     uvec3 extents;
     uint32_t fill_0;
     uvec3 input_extents;
+    uint32_t channel_batch_size_other;
     float alpha;
   } block{
       v_self.extents(),
       0u,
       v_other.extents(),
+      get_dim<Dim4D::Channel>(other) * get_dim<Dim4D::Batch>(other),
       alpha,
   };
 
@@ -431,13 +466,6 @@ Tensor add_tensor(
     const Tensor& self_arg,
     const Tensor& other_arg,
     const Scalar& alpha) {
-  if (other_arg.sizes().size() == 0) {
-    return arithmetic_scalar(
-        self_arg,
-        other_arg.item<float>(),
-        c10::optional<Scalar>(alpha.to<float>()),
-        VK_KERNEL(add_scalar));
-  }
   return arithmetic_tensor(
       self_arg, other_arg, c10::optional<Scalar>(alpha), VK_KERNEL(add));
 }
@@ -473,13 +501,6 @@ Tensor sub_tensor(
     const Tensor& self_arg,
     const Tensor& other_arg,
     const Scalar& alpha) {
-  if (other_arg.sizes().size() == 0) {
-    return arithmetic_scalar(
-        self_arg,
-        other_arg.item<float>(),
-        c10::optional<Scalar>(-1 * alpha.to<float>()),
-        VK_KERNEL(add_scalar));
-  }
   return arithmetic_tensor(
       self_arg, other_arg, c10::optional<Scalar>(alpha), VK_KERNEL(sub));
 }
@@ -503,13 +524,6 @@ Tensor& mul_scalar_(Tensor& self, const Scalar& other) {
 }
 
 Tensor mul_tensor(const Tensor& self_arg, const Tensor& other_arg) {
-  if (other_arg.sizes().size() == 0) {
-    return arithmetic_scalar(
-        self_arg,
-        other_arg.item<float>(),
-        c10::optional<Scalar>(),
-        VK_KERNEL(mul_scalar));
-  }
   return arithmetic_tensor(
       self_arg, other_arg, c10::optional<Scalar>(), VK_KERNEL(mul));
 }
@@ -536,13 +550,6 @@ Tensor& div_scalar_(Tensor& self, const Scalar& other) {
 }
 
 Tensor div_tensor(const Tensor& self_arg, const Tensor& other_arg) {
-  if (other_arg.sizes().size() == 0) {
-    return arithmetic_scalar(
-        self_arg,
-        1.0 / other_arg.item<float>(),
-        c10::optional<Scalar>(),
-        VK_KERNEL(mul_scalar));
-  }
   return arithmetic_tensor(
       self_arg, other_arg, c10::optional<Scalar>(), VK_KERNEL(div));
 }
diff --git a/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp b/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp
index 30407e8cec38a..84828aa60468c 100644
--- a/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp
@@ -31,7 +31,7 @@ Tensor batch_norm(
       "running_var must be defined in evaluation mode.");
   TORCH_CHECK(input_arg.dim() == 4, "Vulkan batchnorm expects 4-dim input!");
   TORCH_CHECK(
-      channels_size(input_arg) % 4 == 0,
+      get_dim<Dim4D::Channel>(input_arg) % 4 == 0,
       "Vulkan batchnorm expects channel dim to be multiple of 4!");
 
   const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
diff --git a/aten/src/ATen/native/vulkan/ops/Common.cpp b/aten/src/ATen/native/vulkan/ops/Common.cpp
index 9336291840967..5a3daeb074288 100644
--- a/aten/src/ATen/native/vulkan/ops/Common.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Common.cpp
@@ -5,42 +5,6 @@ namespace native {
 namespace vulkan {
 namespace ops {
 
-uint32_t batch_size(const Tensor& tensor) {
-  const IntArrayRef sizes = tensor.sizes();
-  const uint32_t dims = sizes.size();
-  if (dims < 4) {
-    return 1;
-  }
-  return sizes[dims - 4];
-}
-
-uint32_t channels_size(const Tensor& tensor) {
-  const IntArrayRef sizes = tensor.sizes();
-  const uint32_t dims = sizes.size();
-  if (dims < 3) {
-    return 1;
-  }
-  return sizes[dims - 3];
-}
-
-uint32_t height_size(const Tensor& tensor) {
-  const IntArrayRef sizes = tensor.sizes();
-  const uint32_t dims = sizes.size();
-  if (dims < 2) {
-    return 1;
-  }
-  return sizes[dims - 2];
-}
-
-uint32_t width_size(const Tensor& tensor) {
-  const IntArrayRef sizes = tensor.sizes();
-  const uint32_t dims = sizes.size();
-  if (dims < 1) {
-    return 1;
-  }
-  return sizes[dims - 1];
-}
-
 api::utils::uvec3 adaptive_work_group_size(
     const api::utils::uvec3& global_work_group) {
   api::utils::uvec3 local_group_size = {4, 4, 4};
diff --git a/aten/src/ATen/native/vulkan/ops/Common.h b/aten/src/ATen/native/vulkan/ops/Common.h
index 2cb6159038bb1..913a2feb80d3a 100644
--- a/aten/src/ATen/native/vulkan/ops/Common.h
+++ b/aten/src/ATen/native/vulkan/ops/Common.h
@@ -43,10 +43,54 @@ struct Layout final {
   };
 };
 
-uint32_t batch_size(const Tensor& tensor);
-uint32_t channels_size(const Tensor& tensor);
-uint32_t height_size(const Tensor& tensor);
-uint32_t width_size(const Tensor& tensor);
+/*
+ * Maps a semantic dimension name to an integer that corresponds to its
+ * innermost ordering in a 4D tensor in NCHW format. Width is the innermost
+ * dimension, so it corresponds to 1, height is the next innermost, so it
+ * corresponds to 2, and so on.
+ */
+struct Dim4D {
+  static constexpr uint32_t Width = 1u;
+  static constexpr uint32_t Height = 2u;
+  static constexpr uint32_t Channel = 3u;
+  static constexpr uint32_t Batch = 4u;
+};
+
+/*
+ * The functions below safely return the size of the dimension at the N-th
+ * innermost index. If the dimensionality of the size array is not sufficient
+ * then 1 will be returned. The structs above are intended to be used with
+ * these functions.
+ */
+template <uint32_t N>
+uint32_t get_dim(const IntArrayRef sizes) {
+  const uint32_t dims = sizes.size();
+  return dims < N ? 1 : sizes[dims - N];
+}
+
+template <uint32_t N>
+uint32_t get_dim(const Tensor& t_in) {
+  return get_dim<N>(t_in.sizes());
+}
+
+template <uint32_t N>
+uint32_t get_dim(const vTensor& v_in) {
+  return get_dim<N>(v_in.sizes());
+}
+
+inline c10::optional<Tensor> get_optional_tensor(
+    const c10::impl::GenericList& gen_list,
+    const uint32_t idx) {
+  return gen_list.get(idx).isTensor() ? gen_list.get(idx).toTensor()
+                                      : c10::optional<Tensor>();
+}
+
+inline c10::optional<Scalar> get_optional_scalar(
+    const c10::impl::GenericList& gen_list,
+    const uint32_t idx) {
+  return gen_list.get(idx).isScalar() ? gen_list.get(idx).toScalar()
+                                      : c10::optional<Scalar>();
+}
 
 api::utils::uvec3 adaptive_work_group_size(
     const api::utils::uvec3& global_work_group);
diff --git a/aten/src/ATen/native/vulkan/ops/Concat.cpp b/aten/src/ATen/native/vulkan/ops/Concat.cpp
index d0c2c0cf6afe6..4ab543f5527f0 100644
--- a/aten/src/ATen/native/vulkan/ops/Concat.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Concat.cpp
@@ -114,7 +114,7 @@ Tensor cat_feature_mult4ch(const TensorList tensors, vTensor& v_output) {
 
       api::PipelineBarrier pipeline_barrier{};
 
-      context->submit_texture_copy(
+      context->submit_copy<api::VulkanImage, api::VulkanImage>(
           // pipeline barrier
           pipeline_barrier,
           // images
@@ -152,7 +152,7 @@ Tensor cat_height(const TensorList tensors, vTensor& v_output) {
 
     api::PipelineBarrier pipeline_barrier{};
 
-    context->submit_texture_copy(
+    context->submit_copy<api::VulkanImage, api::VulkanImage>(
         // pipeline barrier
         pipeline_barrier,
         // images
diff --git a/aten/src/ATen/native/vulkan/ops/Convolution.cpp b/aten/src/ATen/native/vulkan/ops/Convolution.cpp
index e375a887e0e2e..1f81c2a7ef19f 100644
--- a/aten/src/ATen/native/vulkan/ops/Convolution.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Convolution.cpp
@@ -5,7 +5,6 @@
 #include <ATen/native/vulkan/ops/Common.h>
 #include <ATen/native/vulkan/ops/Convolution.h>
 #include <ATen/native/vulkan/ops/Utils.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
 #include <c10/util/irange.h>
 
 namespace at {
@@ -41,12 +40,19 @@ Conv2dMethod determine_method(
     const IntArrayRef stride,
     const IntArrayRef padding,
     const IntArrayRef dilation,
-    const int64_t groups) {
-  if (is_depthwise(filter, groups))
-    return Conv2dDepthwise;
-  if (is_pointwise(filter))
-    return Conv2dPointwise;
-  return Conv2dSlidingWindow;
+    const int64_t groups,
+    const bool transposed,
+    const bool quantized) {
+  if (transposed) {
+    return TConv2dSlidingWindow;
+  }
+  if (is_depthwise(filter, groups)) {
+    return quantized ? QConv2dDepthwise : Conv2dDepthwise;
+  }
+  if (is_pointwise(filter)) {
+    return quantized ? QConv2dPointwise : Conv2dPointwise;
+  }
+  return quantized ? QConv2dSlidingWindow : Conv2dSlidingWindow;
 }
 
 vTensor pack_weights_dw(api::Context* const context, const Tensor& weight) {
@@ -77,7 +83,7 @@ vTensor pack_weights_dw(api::Context* const context, const Tensor& weight) {
       weight.options(),
   };
 
-  api::StagingBuffer staging(context, v_weight.buffer_bytes());
+  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -111,7 +117,10 @@ vTensor pack_weights_dw(api::Context* const context, const Tensor& weight) {
   return v_weight;
 }
 
-vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) {
+vTensor pack_weights_2d(
+    api::Context* const context,
+    const Tensor& weight,
+    bool reversed) {
   /* Source */
   const IntArrayRef src_filter = weight.sizes();
   const float* const src_weight_ptr = weight.data_ptr<float>();
@@ -142,7 +151,7 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) {
       weight.options(),
   };
 
-  api::StagingBuffer staging(context, v_weight.buffer_bytes());
+  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -161,6 +170,163 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) {
 
       float* const dst_weight_c_ptr = dst_weight_ptr + dst_c * dst_kernel_sz;
 
+      if (reversed) {
+        for (const auto src_ic :
+             c10::irange(src_filter[Layout::Filter::input])) {
+          for (const auto src_ih : c10::irange(src_kh_sz)) {
+            const int64_t dst_h = src_kh_sz - 1 - src_ih;
+            for (const auto src_iw : c10::irange(src_kw_sz)) {
+              const int64_t dst_w = src_kw_sz - 1 - src_iw;
+              const int64_t dst_w_offset = dst_w * stack_depth;
+              memcpy(
+                  dst_weight_c_ptr + (dst_oh * src_kh_sz + dst_h) * dst_kw_sz +
+                      src_ic + dst_w_offset,
+                  src_weight_oc_ptr + src_ic * src_kernel_sz +
+                      src_ih * src_kw_sz + src_iw,
+                  sizeof(float));
+            }
+          }
+        }
+      } else {
+        for (const auto src_ic :
+             c10::irange(src_filter[Layout::Filter::input])) {
+          const int64_t dst_ic4 = src_ic / 4;
+          for (const auto src_ih : c10::irange(src_kh_sz)) {
+            for (const auto src_iw : c10::irange(src_kw_sz)) {
+              memcpy(
+                  dst_weight_c_ptr + (dst_oh * src_kh_sz + src_ih) * dst_kw_sz +
+                      dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4,
+                  src_weight_oc_ptr + src_ic * src_kernel_sz +
+                      src_ih * src_kw_sz + src_iw,
+                  sizeof(float));
+            }
+          }
+        }
+      }
+    }
+  }
+  utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
+
+  return v_weight;
+}
+
+vTensor pack_weights_dw_q(api::Context* const context, const Tensor& weight) {
+  /* Source */
+  const IntArrayRef src_filter = weight.sizes();
+  const c10::quint8* const src_weight_ptr = weight.data_ptr<c10::quint8>();
+
+  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
+  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
+  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
+  const int64_t src_block_sz =
+      src_kernel_sz * src_filter[Layout::Filter::input];
+  const int64_t num_stacks =
+      div_up(src_filter[Layout::Filter::output], INT64_C(4));
+
+  /* Destination */
+  const int64_t dst_kw_sz = src_kernel_sz;
+  const int64_t dst_kh_sz = num_stacks;
+  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
+
+  vTensor v_weight{
+      context,
+      {
+          4,
+          dst_kh_sz,
+          dst_kw_sz,
+      },
+      weight.options(),
+      weight.q_scale(),
+      weight.q_zero_point(),
+  };
+
+  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
+  {
+    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
+
+    c10::quint8* dst_weight_ptr = mapping.template data<c10::quint8>();
+
+    memset(dst_weight_ptr, 0, v_weight.nbytes());
+
+    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
+      /* Source */
+      const c10::quint8* const src_weight_oc_ptr =
+          src_weight_ptr + src_oc * src_block_sz;
+
+      /* Destination */
+      const int64_t dst_oh = src_oc / 4;
+      const int64_t dst_c = src_oc % 4;
+
+      c10::quint8* const dst_weight_c_ptr =
+          dst_weight_ptr + dst_c * dst_kernel_sz + dst_oh * dst_kw_sz;
+
+      for (const auto src_ih :
+           c10::irange(src_filter[Layout::Filter::height])) {
+        memcpy(
+            dst_weight_c_ptr + src_ih * src_kw_sz,
+            src_weight_oc_ptr + src_ih * src_kw_sz,
+            sizeof(c10::quint8) * src_kw_sz);
+      }
+    }
+  }
+  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
+
+  return v_weight;
+}
+
+vTensor pack_weights_2d_q(api::Context* const context, const Tensor& weight) {
+  /* Source */
+  const IntArrayRef src_filter = weight.sizes();
+  const c10::quint8* const src_weight_ptr = weight.data_ptr<c10::quint8>();
+
+  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
+  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
+  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
+  const int64_t src_block_sz =
+      src_kernel_sz * src_filter[Layout::Filter::input];
+
+  const int64_t num_stacks =
+      div_up(src_filter[Layout::Filter::output], INT64_C(4));
+  const int64_t stack_depth =
+      api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4));
+
+  /* Destination */
+  const int64_t dst_kw_sz = src_kw_sz * stack_depth;
+  const int64_t dst_kh_sz = src_kh_sz * num_stacks;
+  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
+
+  vTensor v_weight{
+      context,
+      {
+          4,
+          dst_kh_sz,
+          dst_kw_sz,
+      },
+      weight.options(),
+      weight.q_scale(),
+      weight.q_zero_point(),
+  };
+
+  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
+  {
+    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
+
+    c10::quint8* dst_weight_ptr = mapping.template data<c10::quint8>();
+
+    memset(dst_weight_ptr, 0, v_weight.nbytes());
+
+    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
+      /* Source */
+      const c10::quint8* const src_weight_oc_ptr =
+          src_weight_ptr + src_oc * src_block_sz;
+
+      /* Destination */
+      const int64_t dst_oh = src_oc / 4;
+      const int64_t dst_c = src_oc % 4;
+
+      c10::quint8* const dst_weight_c_ptr =
+          dst_weight_ptr + dst_c * dst_kernel_sz;
+
       for (const auto src_ic : c10::irange(src_filter[Layout::Filter::input])) {
         const int64_t dst_ic4 = src_ic / 4;
 
@@ -171,7 +337,7 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) {
                     dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4,
                 src_weight_oc_ptr + src_ic * src_kernel_sz +
                     src_ih * src_kw_sz + src_iw,
-                sizeof(float));
+                sizeof(c10::quint8));
           }
         }
       }
@@ -182,30 +348,48 @@ vTensor pack_weights_2d(api::Context* const context, const Tensor& weight) {
   return v_weight;
 }
 
-vTensor pack_weights(const Tensor& weight_arg, const Conv2dMethod conv_method) {
+vTensor pack_weights(
+    const Tensor& weight_arg,
+    const bool transposed,
+    const bool quantized,
+    const Conv2dMethod conv_method) {
   if (weight_arg.is_vulkan()) {
     return convert(weight_arg);
   }
 
   api::Context* const context = api::context();
 
-  const Tensor weight = weight_arg.contiguous();
+  const Tensor weight = transposed
+      ? at::permute(weight_arg, {1, 0, 2, 3}).contiguous()
+      : weight_arg.contiguous();
 
+  if (transposed) {
+    return pack_weights_2d(context, weight, true);
+  }
+  if (quantized) {
+    if (conv_method == QConv2dDepthwise) {
+      return pack_weights_dw_q(context, weight);
+    }
+    return pack_weights_2d_q(context, weight);
+  }
   if (conv_method == Conv2dDepthwise) {
     return pack_weights_dw(context, weight);
   }
-
-  return pack_weights_2d(context, weight);
+  return pack_weights_2d(context, weight, false);
 }
 
-vTensor pack_biases(const c10::optional<Tensor>& bias, const Tensor& weight) {
+vTensor pack_biases_reg(
+    const c10::optional<Tensor>& bias,
+    const Tensor& weight,
+    const bool transposed) {
   if (bias && bias->is_vulkan()) {
     return convert(*bias);
   }
 
   api::Context* const context = api::context();
 
-  const int64_t src_w = weight.size(Layout::Filter::output);
+  const int64_t src_w = weight.size(
+      transposed ? Layout::TransposedFilter::output : Layout::Filter::output);
   const int64_t packed_w = div_up(src_w, INT64_C(4));
   vTensor v_bias{
       context,
@@ -217,7 +401,7 @@ vTensor pack_biases(const c10::optional<Tensor>& bias, const Tensor& weight) {
       weight.options(),
   };
 
-  api::StagingBuffer staging(context, v_bias.buffer_bytes());
+  api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -242,14 +426,78 @@ vTensor pack_biases(const c10::optional<Tensor>& bias, const Tensor& weight) {
           v_bias.nbytes());
     }
   }
+  utils::pack_staging_to_vtensor(staging.buffer(), v_bias);
+
+  return v_bias;
+}
+
+vTensor pack_biases_q(const c10::optional<Tensor>& bias, const Tensor& weight) {
+  if (bias && bias->is_vulkan()) {
+    return convert(*bias);
+  }
+
+  api::Context* const context = api::context();
+
+  const int64_t src_w = weight.size(Layout::Filter::output);
+  const int64_t packed_w = div_up(src_w, INT64_C(4));
+  vTensor v_bias{
+      context,
+      {
+          4,
+          1,
+          packed_w,
+      },
+      weight.options(),
+      weight.q_scale(),
+      weight.q_zero_point(),
+  };
+
+  api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
+  {
+    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
+
+    c10::quint8* dst_bias_ptr = mapping.template data<c10::quint8>();
+
+    if (bias) {
+      const c10::quint8* const src_bias_ptr =
+          bias->contiguous().data_ptr<c10::quint8>();
+
+      memset(dst_bias_ptr, 0, v_bias.nbytes());
+      for (const auto i : c10::irange(src_w)) {
+        const int64_t c = i % 4;
+        const int64_t x = i / 4;
+        dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i];
+      }
+    } else {
+      memset(
+          dst_bias_ptr,
+          // 2's complement integers and IEEE-754 floating point numbers both
+          // have identical bit representations for 0, so can use memset which
+          // only accepts uint8_t parameter.
+          0,
+          v_bias.nbytes());
+    }
+  }
   ops::utils::pack_staging_to_vtensor(staging.buffer(), v_bias);
 
   return v_bias;
 }
 
+vTensor pack_biases(
+    const c10::optional<Tensor>& bias,
+    const Tensor& weight,
+    const bool transposed,
+    const bool quantized) {
+  if (quantized) {
+    return pack_biases_q(bias, weight);
+  }
+  return pack_biases_reg(bias, weight, transposed);
+}
+
 std::array<int64_t, 4> pack_filter(
     const Tensor& weight,
-    const IntArrayRef dilation) {
+    const IntArrayRef dilation,
+    const bool transposed) {
   const IntArrayRef filter = weight.sizes();
 
   const auto effective = [](const int64_t k, const int64_t d) {
@@ -257,8 +505,14 @@ std::array<int64_t, 4> pack_filter(
   };
 
   return {
-      align_up(filter[Layout::Filter::output], INT64_C(4)),
-      align_up(filter[Layout::Filter::input], INT64_C(4)),
+      align_up(
+          transposed ? filter[Layout::TransposedFilter::output]
+                     : filter[Layout::Filter::output],
+          INT64_C(4)),
+      align_up(
+          transposed ? filter[Layout::TransposedFilter::input]
+                     : filter[Layout::Filter::input],
+          INT64_C(4)),
       effective(
           filter[Layout::Filter::height], dilation[Layout::Parameter::height]),
       effective(
@@ -275,6 +529,35 @@ std::array<int64_t, 2> pack_params(const std::vector<int64_t>& vector) {
   };
 }
 
+bool weight_valid(const Tensor& weight, const bool quantized) {
+  return (4 == weight.ndimension()) &&
+      (weight.size(Layout::Filter::height) > 0) &&
+      (weight.size(Layout::Filter::width) > 0) &&
+      ((weight.device().is_cpu()) ||
+       (c10::DeviceType::Vulkan == weight.device().type())) &&
+      (kFloat == weight.scalar_type() ||
+       (quantized && c10::kQUInt8 == weight.scalar_type()));
+}
+
+bool bias_valid(
+    const c10::optional<Tensor>& bias,
+    const Tensor& weight,
+    const bool transposed,
+    const bool quantized) {
+  if (bias && bias->defined()) {
+    return (1 == bias->ndimension()) &&
+        ((bias->device().is_cpu()) ||
+         (c10::DeviceType::Vulkan == bias->device().type())) &&
+        (kFloat == bias->scalar_type() ||
+         (quantized && c10::kQUInt8 == bias->scalar_type())) &&
+        (transposed ? (weight.size(Layout::TransposedFilter::output) ==
+                       bias->size(Layout::Filter::output))
+                    : (weight.size(Layout::Filter::output) ==
+                       bias->size(Layout::Filter::output)));
+  }
+  return true;
+}
+
 bool available(
     const Tensor& weight,
     const c10::optional<Tensor>& bias,
@@ -282,27 +565,16 @@ bool available(
     const IntArrayRef padding,
     const IntArrayRef dilation,
     const bool transposed,
+    const bool quantized,
     const IntArrayRef /* output_padding */,
     const int64_t groups,
     const c10::optional<Scalar>& output_min,
     const c10::optional<Scalar>& output_max) {
   return api::available() &&
       // Weight
-      (4 == weight.ndimension()) && (weight.size(Layout::Filter::height) > 0) &&
-      (weight.size(Layout::Filter::width) > 0) &&
-      ((weight.device().is_cpu()) ||
-       (c10::DeviceType::Vulkan == weight.device().type())) &&
-      (kFloat == weight.scalar_type()) &&
+      weight_valid(weight, quantized) &&
       // Bias
-      ((bias && bias->defined())
-           ? ((1 == bias->ndimension()) &&
-              ((bias->device().is_cpu()) ||
-               (c10::DeviceType::Vulkan == bias->device().type())) &&
-              (kFloat == bias->scalar_type()) &&
-              (transposed ? false /* to be addded in the future */
-                          : (weight.size(Layout::Filter::output) ==
-                             bias->size(Layout::Filter::output))))
-           : true) &&
+      bias_valid(bias, weight, transposed, quantized) &&
       // Stride
       (stride[Layout::Parameter::height] > 0) &&
       (stride[Layout::Parameter::width] > 0) &&
@@ -310,8 +582,10 @@ bool available(
       (padding[Layout::Parameter::height] >= 0) &&
       (padding[Layout::Parameter::width] >= 0) &&
       // Dilation
-      (dilation[Layout::Parameter::height] > 0) &&
-      (dilation[Layout::Parameter::width] > 0) &&
+      (transposed ? (dilation[Layout::Parameter::height] == 1) &&
+               (dilation[Layout::Parameter::width] == 1)
+                  : (dilation[Layout::Parameter::height] > 0) &&
+               (dilation[Layout::Parameter::width] > 0)) &&
       // Groups
       (groups > 0) &&
       // Input
@@ -325,11 +599,12 @@ bool available(
       (!output_max || output_max->isFloatingPoint()) && true;
 }
 
-bool usable(const Tensor& input) {
+bool usable(const Tensor& input, const bool quantized) {
   // Input
   return (4 == input.ndimension()) &&
       (c10::DeviceType::Vulkan == input.device().type()) &&
-      (kFloat == input.scalar_type()) &&
+      (kFloat == input.scalar_type() ||
+       (quantized && c10::kQUInt8 == input.scalar_type())) &&
       (input.size(Layout::Activation4D::batch) >= 0) &&
       (input.size(Layout::Activation4D::channels) > 0) &&
       (input.size(Layout::Activation4D::height) > 0) &&
@@ -337,77 +612,22 @@ bool usable(const Tensor& input) {
       true;
 }
 
-} // namespace
-
-VulkanOpContext conv2d_context_create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride_arg,
-    const IntArrayRef padding_arg,
-    const IntArrayRef dilation_arg,
-    const bool transposed,
-    const IntArrayRef output_padding_arg,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  const auto stride = expand_param_if_needed(stride_arg, "stride", 2);
-  const auto padding = expand_param_if_needed(padding_arg, "padding", 2);
-  const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2);
-  const auto output_padding = output_padding_arg; // TODO: Deconvolutions
-
-  TORCH_CHECK(
-      available(
-          weight,
-          bias,
-          stride,
-          padding,
-          dilation,
-          transposed,
-          output_padding,
-          groups,
-          output_min,
-          output_max),
-      "Vulkan::convolution not available! "
-      "Reason: The provided (weight, bias, stride, padding, dilation, groups, "
-      "transposed, output_padding, output_min, output_max) parameters are either "
-      "invalid individually or their combination is not supported by Vulkan impl.");
-
-  const auto method =
-      determine_method(weight.sizes(), stride, padding, dilation, groups);
-
-  c10::impl::GenericList packed_context{c10::AnyType::get()};
-  packed_context.reserve(10);
-  packed_context.emplace_back(convert(pack_weights(weight, method)));
-  packed_context.emplace_back(convert(pack_biases(bias, weight)));
-  packed_context.emplace_back(pack_filter(weight, dilation));
-  packed_context.emplace_back(pack_params(stride));
-  packed_context.emplace_back(pack_params(padding));
-  packed_context.emplace_back(output_padding);
-  packed_context.emplace_back(pack_params(dilation));
-  packed_context.emplace_back(safe_downcast<int32_t>(groups));
-  packed_context.emplace_back(
-      output_min ? output_min->template to<float>()
-                 : -std::numeric_limits<float>::infinity());
-  packed_context.emplace_back(
-      output_max ? output_max->template to<float>()
-                 : +std::numeric_limits<float>::infinity());
-  packed_context.emplace_back(method);
-
-  c10::impl::GenericList unpacked_context{c10::AnyType::get()};
-  unpacked_context.reserve(10);
-  unpacked_context.emplace_back(weight);
-  unpacked_context.emplace_back(bias);
-  unpacked_context.emplace_back(weight.sizes().vec());
-  unpacked_context.emplace_back(stride_arg.vec());
-  unpacked_context.emplace_back(padding_arg.vec());
-  unpacked_context.emplace_back(output_padding_arg.vec());
-  unpacked_context.emplace_back(dilation_arg.vec());
-  unpacked_context.emplace_back(groups);
-  unpacked_context.emplace_back(output_min);
-  unpacked_context.emplace_back(output_max);
-  unpacked_context.emplace_back(method);
-
-  return VulkanOpContext::create(packed_context, unpacked_context);
+static inline std::vector<int64_t> get_conv_transpose_output_size(
+    IntArrayRef input_size,
+    IntArrayRef weight_size,
+    IntArrayRef padding,
+    IntArrayRef output_padding,
+    IntArrayRef stride,
+    IntArrayRef dilation = IntArrayRef()) {
+  auto dim = input_size.size();
+  std::vector<int64_t> output_size(dim);
+  output_size[0] = input_size[input_batch_size_dim];
+  output_size[1] = weight_size[weight_input_channels_dim];
+  for (const auto d : c10::irange(2, dim)) {
+    output_size[d] = stride[d - 2] * (input_size[d] - 1) + weight_size[d] -
+        2 * padding[d - 2] + output_padding[d - 2];
+  }
+  return output_size;
 }
 
 void conv2d_sliding_window(
@@ -438,7 +658,8 @@ void conv2d_sliding_window(
     ivec4 src_filter;
   } block{
       v_output.extents(),
-      safe_downcast<int32_t>(packed_filter[Layout::Filter::input]),
+      safe_downcast<int32_t>(
+          packed_filter[Layout::Filter::input]), /* this is aligned up */
       {
           safe_downcast<int32_t>(packed_filter[Layout::Filter::width]),
           safe_downcast<int32_t>(packed_filter[Layout::Filter::height]),
@@ -503,44 +724,441 @@ void conv2d_sliding_window(
       params.buffer());
 }
 
-Tensor conv2d_context_run(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context) {
+void conv2d_sliding_window_q(
+    const api::ShaderSource& shader,
+    vTensor& v_output,
+    const vTensor& v_input,
+    const vTensor& packed_v_weight,
+    const vTensor& packed_v_bias,
+    const IntArrayRef packed_filter,
+    const IntArrayRef packed_stride,
+    const IntArrayRef packed_padding,
+    const IntArrayRef packed_dilation,
+    const float packed_output_min,
+    const float packed_output_max,
+    const IntArrayRef unpacked_filter,
+    const Conv2dMethod method_,
+    const double scale,
+    const int64_t zero_point) {
   api::Context* const context = api::context();
 
-  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
-  const vTensor& v_input = convert(input);
+  const double scale_out = v_output.get_scale();
+  const int64_t zero_point_out = v_output.get_zero_point();
 
-  const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor());
-  const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor());
+  const double weight_scale = packed_v_weight.get_scale();
+  const int64_t weight_zero_point = packed_v_weight.get_zero_point();
 
-  const auto packed_filter = packed_context.get(2).toIntVector();
-  const auto packed_stride = packed_context.get(3).toIntVector();
-  const auto packed_padding = packed_context.get(4).toIntVector();
-  const auto packed_dilation = packed_context.get(6).toIntVector();
-  const float packed_output_min = packed_context.get(8).toDouble();
-  const float packed_output_max = packed_context.get(9).toDouble();
-  const auto unpacked_filter = unpacked_context.get(2).toIntVector();
-  const Conv2dMethod method_ = (Conv2dMethod)unpacked_context.get(10).toInt();
-
-  TORCH_CHECK(
-      usable(input),
-      "Vulkan Convolution not usable! "
-      "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl.");
+  const double bias_scale = packed_v_bias.get_scale();
+  const int64_t bias_zero_point = packed_v_bias.get_zero_point();
 
-  vTensor v_output{
-      context,
-      conv_output_size(
-          v_input.sizes(),
-          unpacked_filter,
-          packed_padding,
-          packed_stride,
-          packed_dilation),
+  const struct Block final {
+    uvec3 extents;
+    int32_t ic4;
+    ivec4 kernel;
+    float scale_out;
+    float scale;
+    int32_t zero_point_out;
+    int32_t zero_point;
+    float weight_scale;
+    float bias_scale;
+    int32_t weight_zero_point;
+    int32_t bias_zero_point;
+    ivec2 ikernel;
+    ivec2 stride;
+    ivec2 padding;
+    ivec2 dilate;
+    vec2 clamp;
+  } block{
+      v_output.extents(),
+      safe_downcast<int32_t>(packed_filter[Layout::Filter::input]),
+      {
+          safe_downcast<int32_t>(packed_filter[Layout::Filter::width]),
+          safe_downcast<int32_t>(packed_filter[Layout::Filter::height]),
+          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::width]),
+          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::height]),
+      },
+      safe_downcast<float>(scale_out),
+      safe_downcast<float>(scale),
+      safe_downcast<int32_t>(zero_point_out),
+      safe_downcast<int32_t>(zero_point),
+      safe_downcast<float>(weight_scale),
+      safe_downcast<float>(bias_scale),
+      safe_downcast<int32_t>(weight_zero_point),
+      safe_downcast<int32_t>(bias_zero_point),
+      {
+          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::width]),
+          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::height]),
+      },
+      {
+          safe_downcast<int32_t>(packed_stride[Layout::Parameter::width]),
+          safe_downcast<int32_t>(packed_stride[Layout::Parameter::height]),
+      },
+      {
+          safe_downcast<int32_t>(packed_padding[Layout::Parameter::width]),
+          safe_downcast<int32_t>(packed_padding[Layout::Parameter::height]),
+      },
+      {
+          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::width]),
+          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::height]),
+      },
+      {
+          packed_output_min,
+          packed_output_max,
+      },
+  };
+
+  uvec3 global_size = v_output.extents();
+  if (method_ == QConv2dPointwise) {
+    global_size = {
+        safe_downcast<uint32_t>(
+            div_up(v_output.sizes()[Layout::Filter::width], INT64_C(2))),
+        safe_downcast<uint32_t>(
+            div_up(v_output.sizes()[Layout::Filter::height], INT64_C(2))),
+        v_output.extents().data[2u]};
+  }
+
+  api::UniformParamsBuffer params(context, block);
+  api::PipelineBarrier pipeline_barrier{};
+
+  context->submit_compute_job(
+      // shader descriptor
+      shader,
+      // pipeline barrier
+      pipeline_barrier,
+      // global work group size
+      global_size,
+      // local work group size
+      adaptive_work_group_size(global_size),
+      // fence handle
+      VK_NULL_HANDLE,
+      // shader arguments
+      v_output.image(
+          pipeline_barrier,
+          api::PipelineStage::COMPUTE,
+          api::MemoryAccessType::WRITE),
+      v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      // params buffer
+      params.buffer());
+}
+
+Tensor convolution(
+    const Tensor& input,
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    const IntArrayRef stride,
+    const IntArrayRef padding,
+    const IntArrayRef dilation,
+    const bool transposed,
+    const IntArrayRef output_padding,
+    const int64_t groups) {
+  Conv2dPackedContext conv_context = Conv2dPackedContext(
+      weight,
+      bias,
+      stride,
+      padding,
+      dilation,
+      transposed,
+      false,
+      output_padding,
+      groups);
+
+  return run_conv2d_context(
+      input, c10::make_intrusive<Conv2dPackedContext>(conv_context));
+}
+
+Tensor quantized_convolution(
+    const Tensor& input,
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    const IntArrayRef stride,
+    const IntArrayRef padding,
+    const IntArrayRef dilation,
+    const bool transposed,
+    const IntArrayRef output_padding,
+    const int64_t groups,
+    const double out_scale,
+    const int64_t out_zero_point) {
+  if (transposed) {
+    return run_tconv2d_context(
+        input,
+        c10::make_intrusive<Conv2dPackedContext>(Conv2dPackedContext(
+            weight,
+            bias,
+            stride,
+            padding,
+            dilation,
+            transposed,
+            false,
+            output_padding,
+            groups)));
+  }
+
+  Conv2dPackedContext conv_context = Conv2dPackedContext(
+      weight,
+      bias,
+      stride,
+      padding,
+      dilation,
+      transposed,
+      true,
+      output_padding,
+      groups);
+
+  return run_qconv2d_context(
+      input,
+      out_scale,
+      out_zero_point,
+      c10::make_intrusive<Conv2dPackedContext>(conv_context));
+}
+
+} // namespace
+
+Conv2dPackedContext::Conv2dPackedContext(
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    const IntArrayRef stride_arg,
+    const IntArrayRef padding_arg,
+    const IntArrayRef dilation_arg,
+    const bool transposed,
+    const bool quantized,
+    const IntArrayRef output_padding_arg,
+    const int64_t groups,
+    const c10::optional<Scalar>& output_min,
+    const c10::optional<Scalar>& output_max)
+    : unpacked_{c10::AnyType::get()} {
+  const auto stride = expand_param_if_needed(stride_arg, "stride", 2);
+  const auto padding = expand_param_if_needed(padding_arg, "padding", 2);
+  const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2);
+  const auto output_padding =
+      expand_param_if_needed(output_padding_arg, "output_padding", 2);
+
+  TORCH_CHECK(
+      available(
+          weight,
+          bias,
+          stride,
+          padding,
+          dilation,
+          transposed,
+          quantized,
+          output_padding,
+          groups,
+          output_min,
+          output_max),
+      "Vulkan::convolution not available! "
+      "Reason: The provided (weight, bias, stride, padding, dilation, groups, "
+      "transposed, output_padding, output_min, output_max) parameters are either "
+      "invalid individually or their combination is not supported by Vulkan impl.");
+
+  const auto method = determine_method(
+      weight.sizes(), stride, padding, dilation, groups, transposed, quantized);
+
+  packed_.reserve(Packed::NumArgs);
+  packed_.emplace_back(
+      convert(pack_weights(weight, transposed, quantized, method)));
+  packed_.emplace_back(
+      convert(pack_biases(bias, weight, transposed, quantized)));
+  packed_.emplace_back(pack_filter(weight, dilation, transposed));
+  packed_.emplace_back(pack_params(stride));
+  packed_.emplace_back(pack_params(padding));
+  packed_.emplace_back(output_padding);
+  packed_.emplace_back(pack_params(dilation));
+  packed_.emplace_back(transposed);
+  packed_.emplace_back(quantized);
+  packed_.emplace_back(safe_downcast<int32_t>(groups));
+  packed_.emplace_back(
+      output_min ? output_min->template to<float>()
+                 : -std::numeric_limits<float>::infinity());
+  packed_.emplace_back(
+      output_max ? output_max->template to<float>()
+                 : +std::numeric_limits<float>::infinity());
+  packed_.emplace_back(method);
+  packed_.emplace_back(weight.sizes().vec());
+
+  if (!at::globalContext().releaseWeightsWhenPrepacking()) {
+    unpacked_.reserve(Unpacked::NumArgs);
+    unpacked_.emplace_back(weight);
+    unpacked_.emplace_back(bias);
+    unpacked_.emplace_back(stride_arg.vec());
+    unpacked_.emplace_back(padding_arg.vec());
+    unpacked_.emplace_back(dilation_arg.vec());
+    unpacked_.emplace_back(transposed);
+    unpacked_.emplace_back(quantized);
+    unpacked_.emplace_back(output_padding_arg.vec());
+    unpacked_.emplace_back(groups);
+    unpacked_.emplace_back(output_min);
+    unpacked_.emplace_back(output_max);
+  }
+}
+
+Conv2dPackedContext Conv2dPackedContext::pack(c10::impl::GenericList unpacked) {
+  return Conv2dPackedContext(
+      unpacked.get(Unpacked::Weight).toTensor(),
+      get_optional_tensor(unpacked, Unpacked::Bias),
+      unpacked.get(Unpacked::Stride).toIntVector(),
+      unpacked.get(Unpacked::Padding).toIntVector(),
+      unpacked.get(Unpacked::Dilation).toIntVector(),
+      unpacked.get(Unpacked::isTransposed).toBool(),
+      unpacked.get(Unpacked::isQuantized).toBool(),
+      unpacked.get(Unpacked::OutputPadding).toIntVector(),
+      unpacked.get(Unpacked::Groups).toInt(),
+      get_optional_scalar(unpacked, Unpacked::OutputMin),
+      get_optional_scalar(unpacked, Unpacked::OutputMax));
+}
+
+c10::intrusive_ptr<Conv2dPackedContext> create_conv2d_context(
+    Tensor&& weight,
+    c10::optional<Tensor>&& bias,
+    std::vector<int64_t>&& stride,
+    std::vector<int64_t>&& padding,
+    std::vector<int64_t>&& dilation,
+    const int64_t groups,
+    const c10::optional<Scalar>& output_min,
+    const c10::optional<Scalar>& output_max) {
+  return c10::make_intrusive<Conv2dPackedContext>(Conv2dPackedContext(
+      weight,
+      bias,
+      stride,
+      padding,
+      dilation,
+      /* transposed = */ false,
+      /* quantized = */ false,
+      /* output_padding_arg = */ {0},
+      groups,
+      output_min,
+      output_max));
+}
+
+c10::intrusive_ptr<Conv2dPackedContext> create_tconv2d_context(
+    Tensor&& weight,
+    c10::optional<Tensor>&& bias,
+    std::vector<int64_t>&& stride,
+    std::vector<int64_t>&& padding,
+    std::vector<int64_t>&& output_padding,
+    std::vector<int64_t>&& dilation,
+    const int64_t groups,
+    const c10::optional<Scalar>& output_min,
+    const c10::optional<Scalar>& output_max) {
+  return c10::make_intrusive<Conv2dPackedContext>(Conv2dPackedContext(
+      weight,
+      bias,
+      stride,
+      padding,
+      dilation,
+      /* transposed = */ true,
+      /* quantized = */ false,
+      output_padding,
+      groups,
+      output_min,
+      output_max));
+}
+
+c10::intrusive_ptr<Conv2dPackedContext> create_qconv2d_context(
+    Tensor&& weight,
+    c10::optional<Tensor>&& bias,
+    std::vector<int64_t>&& stride,
+    std::vector<int64_t>&& padding,
+    std::vector<int64_t>&& dilation,
+    const int64_t groups,
+    const c10::optional<Scalar>& output_min,
+    const c10::optional<Scalar>& output_max) {
+  return c10::make_intrusive<Conv2dPackedContext>(Conv2dPackedContext(
+      weight,
+      bias,
+      stride,
+      padding,
+      dilation,
+      /* transposed = */ false,
+      /* quantized = */ true,
+      /* output_padding_arg = */ {},
+      groups,
+      output_min,
+      output_max));
+}
+
+Tensor run_conv2d_context(
+    const Tensor& input_arg,
+    const c10::intrusive_ptr<Conv2dPackedContext>& conv_context) {
+  api::Context* const context = api::context();
+
+  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
+  const vTensor& v_input = convert(input);
+
+  const vTensor& packed_v_weight = convert(
+      conv_context->get_val(Conv2dPackedContext::Packed::Weight).toTensor());
+  const vTensor& packed_v_bias = convert(
+      conv_context->get_val(Conv2dPackedContext::Packed::Bias).toTensor());
+  const auto packed_filter =
+      conv_context->get_val(Conv2dPackedContext::Packed::FilterSizes)
+          .toIntVector();
+  const auto packed_stride =
+      conv_context->get_val(Conv2dPackedContext::Packed::Stride).toIntVector();
+  const auto packed_padding =
+      conv_context->get_val(Conv2dPackedContext::Packed::Padding).toIntVector();
+  const auto packed_output_padding =
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputPadding)
+          .toIntVector();
+  const auto packed_dilation =
+      conv_context->get_val(Conv2dPackedContext::Packed::Dilation)
+          .toIntVector();
+  const auto transposed =
+      conv_context->get_val(Conv2dPackedContext::Packed::isTransposed).toBool();
+  const auto quantized =
+      conv_context->get_val(Conv2dPackedContext::Packed::isQuantized).toBool();
+  const float packed_output_min = safe_downcast<float>(
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputMin).toDouble());
+  const float packed_output_max = safe_downcast<float>(
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputMax).toDouble());
+  const Conv2dMethod method_ =
+      (Conv2dMethod)conv_context
+          ->get_val(Conv2dPackedContext::Packed::ConvMethod)
+          .toInt();
+  const auto unpacked_filter =
+      conv_context->get_val(Conv2dPackedContext::Packed::WeightSizes)
+          .toIntVector();
+
+  TORCH_CHECK(
+      usable(input, quantized),
+      "Vulkan Convolution not usable! "
+      "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl.");
+
+  vTensor v_output{
+      context,
+      transposed ? get_conv_transpose_output_size(
+                       v_input.sizes(),
+                       unpacked_filter,
+                       packed_padding,
+                       packed_output_padding,
+                       packed_stride,
+                       packed_dilation)
+                 : conv_output_size(
+                       v_input.sizes(),
+                       unpacked_filter,
+                       packed_padding,
+                       packed_stride,
+                       packed_dilation),
       input.options(),
   };
 
   switch (method_) {
+    case TConv2dSlidingWindow:
+      conv2d_sliding_window(
+          VK_KERNEL(conv_transpose2d),
+          v_output,
+          v_input,
+          packed_v_weight,
+          packed_v_bias,
+          packed_filter,
+          packed_stride,
+          packed_padding,
+          packed_dilation,
+          packed_output_min,
+          packed_output_max,
+          unpacked_filter,
+          method_);
+      break;
     case Conv2dDepthwise:
       conv2d_sliding_window(
           VK_KERNEL(conv2d_dw),
@@ -594,38 +1212,161 @@ Tensor conv2d_context_run(
   return convert(v_output);
 }
 
-c10::intrusive_ptr<VulkanOpContext> create_conv2d_clamp_context(
-    Tensor&& weight,
-    c10::optional<Tensor>&& bias,
-    std::vector<int64_t>&& stride,
-    std::vector<int64_t>&& padding,
-    std::vector<int64_t>&& dilation,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  return c10::make_intrusive<VulkanOpContext>(conv2d_context_create(
+Tensor run_tconv2d_context(
+    const Tensor& input_arg,
+    const c10::intrusive_ptr<Conv2dPackedContext>& conv_context) {
+  return run_conv2d_context(input_arg, conv_context);
+}
+
+// TODO: this can probably be consolidated with the other run method
+Tensor run_qconv2d_context(
+    const Tensor& input_arg,
+    double scale,
+    int64_t zero_point,
+    const c10::intrusive_ptr<Conv2dPackedContext>& conv_context) {
+  api::Context* const context = api::context();
+
+  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
+  const vTensor& v_input = convert(input);
+
+  const vTensor& packed_v_weight = convert(
+      conv_context->get_val(Conv2dPackedContext::Packed::Weight).toTensor());
+  const vTensor& packed_v_bias = convert(
+      conv_context->get_val(Conv2dPackedContext::Packed::Bias).toTensor());
+  const auto packed_filter =
+      conv_context->get_val(Conv2dPackedContext::Packed::FilterSizes)
+          .toIntVector();
+  const auto packed_stride =
+      conv_context->get_val(Conv2dPackedContext::Packed::Stride).toIntVector();
+  const auto packed_padding =
+      conv_context->get_val(Conv2dPackedContext::Packed::Padding).toIntVector();
+  const auto packed_output_padding =
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputPadding)
+          .toIntVector();
+  const auto packed_dilation =
+      conv_context->get_val(Conv2dPackedContext::Packed::Dilation)
+          .toIntVector();
+  const auto quantized =
+      conv_context->get_val(Conv2dPackedContext::Packed::isQuantized).toBool();
+  const float packed_output_min = safe_downcast<float>(
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputMin).toDouble());
+  const float packed_output_max = safe_downcast<float>(
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputMax).toDouble());
+  const Conv2dMethod method_ =
+      (Conv2dMethod)conv_context
+          ->get_val(Conv2dPackedContext::Packed::ConvMethod)
+          .toInt();
+  const auto unpacked_filter =
+      conv_context->get_val(Conv2dPackedContext::Packed::WeightSizes)
+          .toIntVector();
+
+  TORCH_CHECK(
+      usable(input, quantized),
+      "Vulkan Convolution not usable! "
+      "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl.");
+
+  vTensor v_output{
+      context,
+      conv_output_size(
+          v_input.sizes(),
+          unpacked_filter,
+          packed_padding,
+          packed_stride,
+          packed_dilation),
+      input.options(),
+      scale,
+      zero_point,
+  };
+
+  switch (method_) {
+    case QConv2dSlidingWindow:
+      conv2d_sliding_window_q(
+          VK_KERNEL(quantized_conv2d),
+          v_output,
+          v_input,
+          packed_v_weight,
+          packed_v_bias,
+          packed_filter,
+          packed_stride,
+          packed_padding,
+          packed_dilation,
+          packed_output_min,
+          packed_output_max,
+          unpacked_filter,
+          method_,
+          v_input.get_scale(),
+          v_input.get_zero_point());
+      break;
+    case QConv2dPointwise:
+      conv2d_sliding_window_q(
+          VK_KERNEL(quantized_conv2d_pw_2x2),
+          v_output,
+          v_input,
+          packed_v_weight,
+          packed_v_bias,
+          packed_filter,
+          packed_stride,
+          packed_padding,
+          packed_dilation,
+          packed_output_min,
+          packed_output_max,
+          unpacked_filter,
+          method_,
+          v_input.get_scale(),
+          v_input.get_zero_point());
+      break;
+    case QConv2dDepthwise:
+      conv2d_sliding_window_q(
+          VK_KERNEL(quantized_conv2d_dw),
+          v_output,
+          v_input,
+          packed_v_weight,
+          packed_v_bias,
+          packed_filter,
+          packed_stride,
+          packed_padding,
+          packed_dilation,
+          packed_output_min,
+          packed_output_max,
+          unpacked_filter,
+          method_,
+          v_input.get_scale(),
+          v_input.get_zero_point());
+      break;
+    default:
+      TORCH_CHECK(false, "Invalid Method");
+  }
+
+  return convert_quantized(v_output);
+}
+
+Tensor conv2d(
+    const Tensor& input,
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    int64_t groups,
+    double out_scale,
+    int64_t out_zero_point) {
+  return quantized_convolution(
+      input,
       weight,
       bias,
       stride,
       padding,
       dilation,
-      /* transposed = */ false,
-      /* output_padding_arg = */ {},
+      false,
+      {{0, 0}},
       groups,
-      output_min,
-      output_max));
-}
-
-Tensor run_conv2d_clamp_context(
-    const Tensor& input,
-    const c10::intrusive_ptr<VulkanOpContext>& vulkan_context) {
-  return conv2d_context_run(
-      input, vulkan_context->get_packed(), vulkan_context->get_unpacked());
+      out_scale,
+      out_zero_point);
 }
 
 /* Backwards compatibility */
-Conv2dOpContext::Conv2dOpContext(VulkanOpContext vulkan_context)
-    : vulkan_context_{std::move(vulkan_context)} {}
+Conv2dOpContext::Conv2dOpContext(Conv2dPackedContext conv_context)
+    : conv_context_{std::move(conv_context)} {}
 
 Conv2dOpContext Conv2dOpContext::create(
     const Tensor& weight,
@@ -638,13 +1379,14 @@ Conv2dOpContext Conv2dOpContext::create(
     const int64_t groups,
     const c10::optional<Scalar>& output_min,
     const c10::optional<Scalar>& output_max) {
-  return Conv2dOpContext{conv2d_context_create(
+  return Conv2dOpContext{Conv2dPackedContext(
       weight,
       bias,
       stride_arg,
       padding_arg,
       dilation_arg,
       transposed,
+      /* quantized = */ false,
       output_padding_arg,
       groups,
       output_min,
@@ -652,36 +1394,24 @@ Conv2dOpContext Conv2dOpContext::create(
 }
 
 Tensor Conv2dOpContext::run(const Tensor& input_arg) const {
-  return conv2d_context_run(
-      input_arg, vulkan_context_.get_packed(), vulkan_context_.get_unpacked());
+  return run_conv2d_context(
+      input_arg, c10::make_intrusive<Conv2dPackedContext>(conv_context_));
 }
 
 Conv2dOpContext::State Conv2dOpContext::unpack() const {
-  const c10::impl::GenericList unpacked_ =
-      std::get<1>(vulkan_context_.get_state());
-  const Tensor unpacked_weight = unpacked_.get(0).toTensor();
-  const c10::optional<Tensor> unpacked_bias = unpacked_.get(1).isTensor()
-      ? unpacked_.get(1).toTensor()
-      : (c10::optional<Tensor>&)c10::nullopt;
-  const std::vector<int64_t> unpacked_stride = unpacked_.get(2).toIntVector();
-  const std::vector<int64_t> unpacked_padding = unpacked_.get(3).toIntVector();
-  const std::vector<int64_t> unpacked_dilation = unpacked_.get(4).toIntVector();
-  const int64_t unpacked_groups = unpacked_.get(5).toInt();
-  const c10::optional<Scalar> unpacked_output_min = unpacked_.get(6).isScalar()
-      ? unpacked_.get(6).toScalar()
-      : (c10::optional<Scalar>)c10::nullopt;
-  const c10::optional<Scalar> unpacked_output_max = unpacked_.get(6).isScalar()
-      ? unpacked_.get(7).toScalar()
-      : (c10::optional<Scalar>)c10::nullopt;
-  return Conv2dOpContext::State{
-      unpacked_weight,
-      unpacked_bias,
-      unpacked_stride,
-      unpacked_padding,
-      unpacked_dilation,
-      unpacked_groups,
-      unpacked_output_min,
-      unpacked_output_max};
+  const c10::impl::GenericList unpacked_ = conv_context_.unpack();
+
+  TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!");
+
+  return Conv2dOpContext::State(
+      unpacked_.get(Conv2dPackedContext::Unpacked::Weight).toTensor(),
+      get_optional_tensor(unpacked_, Conv2dPackedContext::Unpacked::Bias),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Stride).toIntVector(),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Padding).toIntVector(),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Dilation).toIntVector(),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Groups).toInt(),
+      get_optional_scalar(unpacked_, Conv2dPackedContext::Unpacked::OutputMin),
+      get_optional_scalar(unpacked_, Conv2dPackedContext::Unpacked::OutputMax));
 }
 
 c10::intrusive_ptr<Conv2dOpContext> conv2d_clamp_prepack(
@@ -700,7 +1430,7 @@ c10::intrusive_ptr<Conv2dOpContext> conv2d_clamp_prepack(
       std::move(padding),
       std::move(dilation),
       /* transposed = */ false,
-      /* output_padding = */ {},
+      /* output_padding = */ {0},
       groups,
       output_min,
       output_max));
@@ -712,6 +1442,10 @@ Tensor conv2d_clamp_run(
   return context->run(input);
 }
 
+TORCH_LIBRARY_IMPL(aten, Vulkan, m) {
+  m.impl("convolution_overrideable", convolution);
+}
+
 } // namespace ops
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/ops/Convolution.h b/aten/src/ATen/native/vulkan/ops/Convolution.h
index 69680a4b167b4..745d6064def17 100644
--- a/aten/src/ATen/native/vulkan/ops/Convolution.h
+++ b/aten/src/ATen/native/vulkan/ops/Convolution.h
@@ -3,7 +3,7 @@
 #ifdef USE_VULKAN_API
 
 #include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
+#include <ATen/native/vulkan/ops/VulkanPackedContext.h>
 
 namespace at {
 namespace native {
@@ -14,61 +14,125 @@ enum Conv2dMethod {
   Conv2dDepthwise,
   Conv2dPointwise,
   Conv2dSlidingWindow,
+  TConv2dSlidingWindow,
+  QConv2dDepthwise,
+  QConv2dPointwise,
+  QConv2dSlidingWindow,
 };
 
-//  private:
-//   packed
-//     vTensor v_weight
-//     vTensor v_bias
-//     std::array<int64_t, 4> filter
-//     std::array<int64_t, 2> stride
-//     std::array<int64_t, 2> padding
-//     std::array<int64_t, 2> dilation
-//     int32_t groups
-//     float output_min
-//     float output_max
-
-//   unpacked
-//     Tensor weight
-//     c10::optional<Tensor> bias
-//     std::vector<int64_t> filter
-//     std::vector<int64_t> stride
-//     std::vector<int64_t> padding
-//     std::vector<int64_t> dilation
-//     int64_t groups
-//     c10::optional<Scalar> output_min
-//     c10::optional<Scalar> output_max
-
-VulkanOpContext conv2d_context_create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride_arg,
-    const IntArrayRef padding_arg,
-    const IntArrayRef dilation_arg,
-    const bool transposed,
-    const IntArrayRef output_padding_arg,
+class Conv2dPackedContext final : virtual public VulkanPackedContext,
+                                  public torch::jit::CustomClassHolder {
+ private:
+  c10::impl::GenericList unpacked_;
+
+ public:
+  Conv2dPackedContext(
+      const Tensor& weight,
+      const c10::optional<Tensor>& bias,
+      const IntArrayRef stride_arg,
+      const IntArrayRef padding_arg,
+      const IntArrayRef dilation_arg,
+      const bool transposed,
+      const bool quantized,
+      const IntArrayRef output_padding_arg,
+      const int64_t groups,
+      const c10::optional<Scalar>& output_min = c10::nullopt,
+      const c10::optional<Scalar>& output_max = c10::nullopt);
+
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
+    static constexpr uint32_t Stride = 2u;
+    static constexpr uint32_t Padding = 3u;
+    static constexpr uint32_t Dilation = 4u;
+    static constexpr uint32_t isTransposed = 5u;
+    static constexpr uint32_t isQuantized = 6u;
+    static constexpr uint32_t OutputPadding = 7u;
+    static constexpr uint32_t Groups = 8u;
+    static constexpr uint32_t OutputMin = 9u;
+    static constexpr uint32_t OutputMax = 10u;
+
+    static constexpr uint32_t NumArgs = 11u;
+  };
+
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
+    static constexpr uint32_t FilterSizes = 2u;
+    static constexpr uint32_t Stride = 3u;
+    static constexpr uint32_t Padding = 4u;
+    static constexpr uint32_t OutputPadding = 5u;
+    static constexpr uint32_t Dilation = 6u;
+    static constexpr uint32_t isTransposed = 7u;
+    static constexpr uint32_t isQuantized = 8u;
+    static constexpr uint32_t Groups = 9u;
+    static constexpr uint32_t OutputMin = 10u;
+    static constexpr uint32_t OutputMax = 11u;
+    static constexpr uint32_t ConvMethod = 12u;
+    static constexpr uint32_t WeightSizes = 13u;
+
+    static constexpr uint32_t NumArgs = 14u;
+  };
+
+  static Conv2dPackedContext pack(c10::impl::GenericList);
+
+  const c10::impl::GenericList unpack() const override {
+    TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!");
+
+    return unpacked_;
+  }
+};
+
+c10::intrusive_ptr<Conv2dPackedContext> create_conv2d_context(
+    Tensor&& weight,
+    c10::optional<Tensor>&& bias,
+    std::vector<int64_t>&& stride,
+    std::vector<int64_t>&& padding,
+    std::vector<int64_t>&& dilation,
     const int64_t groups,
     const c10::optional<Scalar>& output_min = c10::nullopt,
     const c10::optional<Scalar>& output_max = c10::nullopt);
 
-Tensor conv2d_context_run(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context);
+Tensor run_conv2d_context(
+    const Tensor& input,
+    const c10::intrusive_ptr<Conv2dPackedContext>& context);
+
+c10::intrusive_ptr<Conv2dPackedContext> create_tconv2d_context(
+    Tensor&& weight,
+    c10::optional<Tensor>&& bias,
+    std::vector<int64_t>&& stride,
+    std::vector<int64_t>&& padding,
+    std::vector<int64_t>&& output_padding,
+    std::vector<int64_t>&& dilation,
+    const int64_t groups,
+    const c10::optional<Scalar>& output_min = c10::nullopt,
+    const c10::optional<Scalar>& output_max = c10::nullopt);
 
-Tensor run_conv2d_clamp_context(
+Tensor run_tconv2d_context(
     const Tensor& input,
-    const c10::intrusive_ptr<VulkanOpContext>& context);
+    const c10::intrusive_ptr<Conv2dPackedContext>& context);
 
-c10::intrusive_ptr<VulkanOpContext> create_conv2d_clamp_context(
+c10::intrusive_ptr<Conv2dPackedContext> create_qconv2d_context(
     Tensor&& weight,
     c10::optional<Tensor>&& bias,
     std::vector<int64_t>&& stride,
     std::vector<int64_t>&& padding,
     std::vector<int64_t>&& dilation,
     const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max);
+    const c10::optional<Scalar>& output_min = c10::nullopt,
+    const c10::optional<Scalar>& output_max = c10::nullopt);
+
+Tensor run_qconv2d_context(
+    const Tensor& input_arg,
+    double scale,
+    int64_t zero_point,
+    const c10::intrusive_ptr<Conv2dPackedContext>& conv_context);
 
 // Backwards compatibility
 class Conv2dOpContext final : public torch::jit::CustomClassHolder {
@@ -99,8 +163,8 @@ class Conv2dOpContext final : public torch::jit::CustomClassHolder {
   State unpack() const;
 
  private:
-  explicit Conv2dOpContext(VulkanOpContext vulkan_context);
-  VulkanOpContext vulkan_context_;
+  explicit Conv2dOpContext(Conv2dPackedContext conv_context);
+  Conv2dPackedContext conv_context_;
 };
 
 Tensor conv2d_clamp_run(
diff --git a/aten/src/ATen/native/vulkan/ops/Copy.cpp b/aten/src/ATen/native/vulkan/ops/Copy.cpp
index fb4db712a8ad9..dbac25e0c7ee3 100644
--- a/aten/src/ATen/native/vulkan/ops/Copy.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Copy.cpp
@@ -1,4 +1,4 @@
-#include <ATen/native/vulkan/ops/Common.h>
+#include <ATen/native/vulkan/ops/Copy.h>
 #include <ATen/native/vulkan/ops/Utils.h>
 
 namespace at {
@@ -6,12 +6,112 @@ namespace native {
 namespace vulkan {
 namespace ops {
 
-void copy_vulkan_to_vulkan(vTensor& src, vTensor& dst) {
+//
+// Utility functions for memcpy
+//
+
+void memcpy_to_mapping(const Tensor& src, api::MemoryMap& dst_mapping) {
+  if (src.dtype() == at::kFloat) {
+    memcpy_to_mapping_impl<float>(src, dst_mapping);
+  } else if (src.dtype() == at::kHalf) {
+    memcpy_to_mapping_impl<c10::Half>(src, dst_mapping);
+  } else if (src.dtype() == c10::kQUInt8) {
+    memcpy_to_mapping_impl<c10::quint8>(src, dst_mapping);
+  } else {
+    TORCH_CHECK(
+        false,
+        "Invalid Data Type: expected c10::QUint8, at::kHalf or at::Float but got ",
+        src.dtype());
+  }
+}
+
+void memcpy_from_mapping(api::MemoryMap& src_mapping, Tensor& dst) {
+  if (dst.dtype() == at::kFloat) {
+    memcpy_from_mapping_impl<float>(src_mapping, dst);
+  } else if (dst.dtype() == at::kHalf) {
+    memcpy_from_mapping_impl<c10::Half>(src_mapping, dst);
+  } else if (dst.dtype() == c10::kQUInt8) {
+    memcpy_from_mapping_impl<c10::quint8>(src_mapping, dst);
+  } else {
+    TORCH_CHECK(
+        false,
+        "Invalid Data Type: expected c10::QUint8, at::kHalf or Float but got ",
+        dst.dtype());
+  }
+}
+
+//
+// CPU <-> GPU copy implementations (these functions use Transfer commands)
+//
+
+void transfer_cpu_to_vulkan(const Tensor& src, vTensor& v_dst) {
+  api::Context* const context = api::context();
+
+  // Convert to dtype corresponding to the image format of the texture to
+  // ensure that byte alignment is consistent when copying. In some cases
+  // a 16 bit format will be used for at::kFloat.
+  Tensor src_nc4hw = utils::nchw_to_nc4hw(src).to(v_dst.texture_dtype());
+
+  api::StorageBuffer staging(context, v_dst.texture_dtype(), v_dst.numcells());
+  // Copy data into the staging buffer
+  {
+    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
+    mapping.invalidate();
+
+    memcpy_to_mapping(src_nc4hw, mapping);
+  }
+
+  api::PipelineBarrier pipeline_barrier{};
+  utils::copy_buffer_to_vtensor(staging.buffer(), v_dst, pipeline_barrier);
+}
+
+void transfer_vulkan_to_cpu(vTensor& v_src, Tensor& dst) {
+  api::Context* const context = api::context();
+
+  // Temporary tensor to receive copied NC4HW data
+  at::Tensor dst_tmp = utils::create_staging_tensor(v_src);
+
+  api::StorageBuffer staging(context, v_src.texture_dtype(), v_src.numcells());
+
+  api::VulkanFence fence = context->fences().get_fence();
+
+  {
+    // Refer to comment in submit_compute_job. When syncing with the GPU, the
+    // context must not allow other threads to record dispatches into it between
+    // between calling vkQueueSubmit and flushing the context. Therefore,
+    // cmd_mutex_ must be manually managed by the calling thread.
+    std::unique_lock<std::mutex> context_lock(context->dispatch_lock());
+
+    api::PipelineBarrier pipeline_barrier{};
+    utils::copy_vtensor_to_buffer(
+        v_src, staging.buffer(), pipeline_barrier, fence.get_submit_handle());
+
+    fence.wait();
+
+    context->flush();
+    // cmd_mutex_ will be released when exiting this scope.
+  }
+
+  // Copy data from buffer back to CPU tensor.
+  {
+    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::READ);
+    mapping.invalidate();
+
+    memcpy_from_mapping(mapping, dst_tmp);
+  }
+
+  context->fences().return_fence(fence);
+
+  dst =
+      utils::nc4hw_to_nchw(dst_tmp, v_src.sizes()).to(v_src.options().dtype());
+}
+
+void transfer_vulkan_to_vulkan(vTensor& src, vTensor& dst) {
   api::Context* const context = api::context();
 
   api::PipelineBarrier pipeline_barrier{};
 
-  context->submit_texture_copy(
+  context->submit_copy<api::VulkanImage, api::VulkanImage>(
       // pipeline barrier
       pipeline_barrier,
       // images
@@ -28,34 +128,42 @@ void copy_vulkan_to_vulkan(vTensor& src, vTensor& dst) {
       VK_NULL_HANDLE);
 }
 
-void copy_cpu_to_vulkan(const Tensor& src, vTensor& dst) {
+//
+// CPU <-> GPU copy implementations (these functions use compute shaders)
+//
+
+void pack_cpu_to_vulkan(const Tensor& src, vTensor& dst) {
   api::Context* const context = api::context();
 
-  api::StagingBuffer staging(context, dst.buffer_bytes());
+  // Note that the float data type has been enforced for the storage buffer
+  // below. The reason for this is that the nchw_to_image and image_to_nchw
+  // shaders which perform the transfer to/from an image texture expect a buffer
+  // of floats as input. GLSL/Vulkan does not natively support 16 bit arithmetic
+  // types, so for now storage buffers created for compute shaders must define
+  // floats as their base data type.
+  api::StorageBuffer staging(context, at::kFloat, dst.numcells());
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
-    if (src.dtype() == c10::kQUInt8) {
-      c10::quint8* data_ptr = mapping.template data<c10::quint8>();
-      memcpy(
-          data_ptr,
-          src.contiguous().data_ptr<c10::quint8>(),
-          std::min(src.nbytes(), src.nbytes()));
+    // If the dtype() of src is at::kHalf, then first convert it to 32 bit
+    // float. This is required since the nchw_to_image shader uses a float
+    // buffer as input (note that at::kFloat is used to create the StorageBuffer
+    // above).
+    if (src.dtype() == at::kHalf) {
+      memcpy_to_mapping(src.to(at::kFloat), mapping);
     } else {
-      float* data_ptr = mapping.template data<float>();
-      memcpy(
-          data_ptr,
-          src.contiguous().data_ptr<float>(),
-          std::min(src.nbytes(), src.nbytes()));
+      memcpy_to_mapping(src, mapping);
     }
   }
   utils::pack_staging_to_vtensor(staging.buffer(), dst);
 }
 
-void copy_vulkan_to_cpu(vTensor& src, Tensor& dst) {
+void pack_vulkan_to_cpu(vTensor& src, Tensor& dst) {
   api::Context* const context = api::context();
 
-  api::StagingBuffer staging(context, src.buffer_bytes());
+  // Refer to the comment in pack_cpu_to_vulkan for why at::kFloat is specified
+  // for the storage buffer below.
+  api::StorageBuffer staging(context, at::kFloat, src.numcells());
 
   api::VulkanFence fence = context->fences().get_fence();
 
@@ -80,46 +188,41 @@ void copy_vulkan_to_cpu(vTensor& src, Tensor& dst) {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::READ);
     mapping.invalidate();
 
-    if (dst.is_quantized()) {
-      c10::quint8* data_ptr = mapping.template data<c10::quint8>();
-      memcpy(
-          dst.data_ptr<c10::quint8>(),
-          data_ptr,
-          std::min(src.nbytes(), dst.nbytes()));
+    // If the dtype() of dst is at::kHalf, then copy the data into a float
+    // version of it first, similar to pack_cpu_to_vulkan().
+    if (dst.dtype() == at::kHalf) {
+      Tensor dst_float = dst.to(at::kFloat);
+      memcpy_from_mapping(mapping, dst_float);
+      dst = dst_float.to(at::kHalf);
     } else {
-      float* data_ptr = mapping.template data<float>();
-      memcpy(
-          dst.data_ptr<float>(),
-          data_ptr,
-          std::min(src.nbytes(), dst.nbytes()));
+      memcpy_from_mapping(mapping, dst);
     }
   }
 
   context->fences().return_fence(fence);
 }
 
-Tensor& copy_(Tensor& self, const Tensor& src) {
+//
+// Copy op implementations
+//
+
+Tensor& copy_(Tensor& dst, const Tensor& src) {
   // Check that sizes are equal
   TORCH_CHECK(
-      self.sizes() == src.sizes(),
-      "Vulkan copy_: Tensor sizes are mismatched!");
+      dst.sizes() == src.sizes(), "Vulkan copy_: Tensor sizes are mismatched!");
 
   // X -> Vulkan
-  if (at::kVulkan == self.device().type()) {
-    vTensor& v_self = convert(self);
+  if (at::kVulkan == dst.device().type()) {
+    vTensor& v_self = convert(dst);
 
     // Vulkan -> Vulkan
     if (at::kVulkan == src.device().type()) {
       vTensor& v_src = convert(src);
-      copy_vulkan_to_vulkan(v_src, v_self);
+      transfer_vulkan_to_vulkan(v_src, v_self);
     }
     // CPU -> Vulkan
     else {
-      TORCH_CHECK(
-          src.dtype() == c10::kQUInt8 || src.dtype() == at::kFloat,
-          "Invalid Data Type: expected QUint8 or Float but got ",
-          src.dtype());
-      copy_cpu_to_vulkan(src, v_self);
+      pack_cpu_to_vulkan(src, v_self);
     }
   }
   // Vulkan -> X
@@ -127,12 +230,8 @@ Tensor& copy_(Tensor& self, const Tensor& src) {
     vTensor& v_src = convert(src);
 
     // Vulkan -> CPU
-    if (self.device().is_cpu()) {
-      TORCH_CHECK(
-          self.dtype() == c10::kQUInt8 || self.dtype() == at::kFloat,
-          "Invalid Data Type: expected QUint8 or Float but got ",
-          self.dtype());
-      copy_vulkan_to_cpu(v_src, self);
+    if (dst.device().is_cpu()) {
+      pack_vulkan_to_cpu(v_src, dst);
     } else {
       TORCH_CHECK(false, "Unsupported!");
     }
@@ -143,7 +242,7 @@ Tensor& copy_(Tensor& self, const Tensor& src) {
         "was expected to be Vulkan a tensor!  Incorrect dispatch?");
   }
 
-  return self;
+  return dst;
 }
 
 } // namespace ops
diff --git a/aten/src/ATen/native/vulkan/ops/Copy.h b/aten/src/ATen/native/vulkan/ops/Copy.h
index e69af06357c5a..1493af6e629bd 100644
--- a/aten/src/ATen/native/vulkan/ops/Copy.h
+++ b/aten/src/ATen/native/vulkan/ops/Copy.h
@@ -9,7 +9,37 @@ namespace native {
 namespace vulkan {
 namespace ops {
 
-Tensor& copy_(Tensor& self, const Tensor& src);
+void transfer_cpu_to_vulkan(const Tensor&, vTensor&);
+
+void transfer_vulkan_to_cpu(vTensor&, Tensor&);
+
+Tensor& copy_(Tensor& dst, const Tensor& src);
+
+//
+// Utility functions for memcpy
+//
+
+template <typename T>
+void memcpy_to_mapping_impl(const Tensor& src, api::MemoryMap& dst_mapping) {
+  T* data_ptr = dst_mapping.template data<T>();
+  memcpy(
+      data_ptr,
+      src.contiguous().data_ptr<T>(),
+      std::min(src.nbytes(), dst_mapping.nbytes()));
+}
+
+template <typename T>
+void memcpy_from_mapping_impl(api::MemoryMap& src_mapping, Tensor& dst) {
+  T* data_ptr = src_mapping.template data<T>();
+  memcpy(
+      dst.data_ptr<T>(),
+      data_ptr,
+      std::min(src_mapping.nbytes(), dst.nbytes()));
+}
+
+void memcpy_to_mapping(const Tensor& src, api::MemoryMap& dst_mapping);
+
+void memcpy_from_mapping(api::MemoryMap& src_mapping, Tensor& dst);
 
 } // namespace ops
 } // namespace vulkan
diff --git a/aten/src/ATen/native/vulkan/ops/Glu.cpp b/aten/src/ATen/native/vulkan/ops/Glu.cpp
index 1778813bce57b..1a1f58b6dce5d 100644
--- a/aten/src/ATen/native/vulkan/ops/Glu.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Glu.cpp
@@ -16,7 +16,7 @@ Tensor glu(const at::Tensor& input_arg, const int64_t dim = -1) {
       "Vulkan glu only supports GLU for dim = 1, but got dim = ",
       dim);
   TORCH_CHECK(
-      channels_size(input_arg) % 2 == 0,
+      get_dim<Dim4D::Channel>(input_arg) % 2 == 0,
       "Vulkan glu expects channel dim to be multiple of 2!");
 
   const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
diff --git a/aten/src/ATen/native/vulkan/ops/Gru.cpp b/aten/src/ATen/native/vulkan/ops/Gru.cpp
index e29c6b59fd9fb..9be247499d416 100644
--- a/aten/src/ATen/native/vulkan/ops/Gru.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Gru.cpp
@@ -1,6 +1,5 @@
 #include <ATen/native/vulkan/ops/Gru.h>
 #include <ATen/native/vulkan/ops/Mm.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
 #include <vector>
 
 namespace at {
@@ -9,14 +8,19 @@ namespace vulkan {
 namespace ops {
 namespace {
 //
-// input_vk: input tensor of shape (L, N, H_in) when batch_first=False
-//                                 (N, L, H_in) when batch_first=True containing
-//                                 the features of the input sequence
-// hx_vk: initial hidden state for each element in the batch. tensor of shape (D
-// * num_layers, N, H_out) output: tensor of shape (N, L, D * H_out)) when
-// batch_first=True h_n: tensor of shape (D * num_layers, N, H_out)
+// input_vk: input tensor containing the features of the input sequence
+//           tensor of shape (N, L, H_in) when batch_first=True
+//                           (L, N, H_in) when batch_first=False
 //
-//  where
+// hx_vk: initial hidden state for each element in the batch.
+//        tensor of shape (D * num_layers, N, H_out)
+//
+// output: tensor of shape (N, L, D * H_out) when batch_first=True
+//                         (L, N, D * H_out) when batch_first=False
+//
+// h_n: tensor of shape (D * num_layers, N, H_out)
+//
+// where
 //    L = sequence length
 //    N = batch size
 //    D = 2 if bidirectional=True otherwise 1
@@ -46,18 +50,22 @@ std::tuple<Tensor, Tensor> gru_input(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan gru expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan gru expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan gru expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan gru expects 'dropout' to be 0.0.");
 
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
   const auto hidden_size = hx_vk.size(2);
   std::vector<at::Tensor> h_n_list; // hidden output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
   for (int64_t i = 0; i < num_layers; ++i) {
     // extract each hidden state and squeeze into 2D dim
@@ -100,6 +108,7 @@ std::tuple<Tensor, Tensor> gru_input(
   }
 
   auto h_n = at::cat(h_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   return std::tuple<Tensor, Tensor>(x, h_n);
 }
@@ -114,13 +123,19 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) {
 
 } // namespace
 
-std::vector<c10::intrusive_ptr<VulkanOpContext>> pack_linear_op_contexts(
+std::vector<c10::intrusive_ptr<LinearPackedContext>> pack_linear_op_contexts(
     const std::vector<Tensor>& params_cpu,
     int64_t num_layers) {
   TORCH_CHECK(
       static_cast<int64_t>(params_cpu.size()) == 4 * num_layers,
-      "Vulkan gru expects 'params_cpu' size to be 4 * 'num_layers'.");
-  std::vector<c10::intrusive_ptr<VulkanOpContext>> linear_op_contexts;
+      "Vulkan gru expects 'params_cpu' size to be 4 * 'num_layers'."
+      " But 'params_cpu' has size: ",
+      params_cpu.size(),
+      " and 'num_layers' is: ",
+      num_layers);
+  std::vector<c10::intrusive_ptr<LinearPackedContext>> linear_op_contexts;
+  linear_op_contexts.reserve(num_layers * 6);
+
   for (int64_t i = 0; i < num_layers; ++i) {
     const auto& w_ih = params_cpu.at(i * 4);
     const auto& w_hh = params_cpu.at(i * 4 + 1);
@@ -156,7 +171,7 @@ std::vector<c10::intrusive_ptr<VulkanOpContext>> pack_linear_op_contexts(
   return linear_op_contexts;
 }
 
-VulkanOpContext gru_context_create(
+GruPackedContext::GruPackedContext(
     const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
     bool has_biases,
     int64_t num_layers,
@@ -169,99 +184,151 @@ VulkanOpContext gru_context_create(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan gru expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan gru expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan gru expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan gru expects 'dropout' to be 0.0.");
 
-  c10::impl::GenericList packed_context{c10::AnyType::get()};
-  packed_context.reserve(7);
-  packed_context.emplace_back(pack_linear_op_contexts(params_cpu, num_layers));
-  packed_context.emplace_back(has_biases);
-  packed_context.emplace_back(num_layers);
-  packed_context.emplace_back(dropout);
-  packed_context.emplace_back(train);
-  packed_context.emplace_back(bidirectional);
-  packed_context.emplace_back(batch_first);
-
-  c10::impl::GenericList unpacked_context{c10::AnyType::get()};
-  unpacked_context.reserve(7);
-  unpacked_context.emplace_back(params_cpu);
-  unpacked_context.emplace_back(has_biases);
-  unpacked_context.emplace_back(num_layers);
-  unpacked_context.emplace_back(dropout);
-  unpacked_context.emplace_back(train);
-  unpacked_context.emplace_back(bidirectional);
-  unpacked_context.emplace_back(batch_first);
-
-  return VulkanOpContext::create(packed_context, unpacked_context);
+  packed_.reserve(Packed::NumArgs);
+  packed_.emplace_back(pack_linear_op_contexts(params_cpu, num_layers));
+  packed_.emplace_back(has_biases);
+  packed_.emplace_back(num_layers);
+  packed_.emplace_back(dropout);
+  packed_.emplace_back(train);
+  packed_.emplace_back(bidirectional);
+  packed_.emplace_back(batch_first);
+}
+
+GruPackedContext GruPackedContext::pack(c10::impl::GenericList unpacked) {
+  return GruPackedContext(
+      unpacked.get(Unpacked::Params).toTensorVector(),
+      unpacked.get(Unpacked::hasBiases).toBool(),
+      unpacked.get(Unpacked::NumLayers).toInt(),
+      unpacked.get(Unpacked::Dropout).toDouble(),
+      unpacked.get(Unpacked::Train).toBool(),
+      unpacked.get(Unpacked::Bidirectional).toBool(),
+      unpacked.get(Unpacked::BatchFirst).toBool());
+}
+
+const c10::impl::GenericList GruPackedContext::unpack() const {
+  c10::impl::GenericList unpacked_gru_context{c10::AnyType::get()};
+  unpacked_gru_context.reserve(Unpacked::NumArgs);
+
+  const c10::List<c10::IValue> packed_linear_contexts =
+      get_val(Packed::LinearContexts).toList();
+
+  const int64_t num_layers = get_val(Packed::NumLayers).toInt();
+  const int64_t linear_contexts_per_layer = 6;
+
+  std::vector<Tensor> params_cpu;
+  params_cpu.reserve(num_layers * linear_contexts_per_layer);
+
+  for (c10::IValue packed_linear_context : packed_linear_contexts) {
+    const c10::impl::GenericList unpacked_linear_context =
+        packed_linear_context.toCustomClass<LinearPackedContext>()->unpack();
+
+    TORCH_CHECK(
+        unpacked_linear_context.size() > 0u,
+        "unpacked_linear_context does not have any elements!");
+
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Weight)
+            .toTensor()
+            .t());
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Bias)
+            .toTensor());
+  }
+  unpacked_gru_context.emplace_back(params_cpu);
+  for (int64_t i = 1; i < Unpacked::NumArgs; ++i) {
+    unpacked_gru_context.emplace_back(get_val(i));
+  }
+
+  return unpacked_gru_context;
 }
 
-std::tuple<Tensor, Tensor> gru_context_run(
+c10::intrusive_ptr<GruPackedContext> create_gru_context(
+    std::vector<Tensor>&& params_cpu,
+    bool has_biases,
+    int64_t num_layers,
+    double dropout,
+    bool train,
+    bool bidirectional,
+    bool batch_first) {
+  return c10::make_intrusive<GruPackedContext>(GruPackedContext(
+      params_cpu,
+      has_biases,
+      num_layers,
+      dropout,
+      train,
+      bidirectional,
+      batch_first));
+}
+
+std::tuple<Tensor, Tensor> run_gru_context(
     const Tensor& input_vk, // input sequence (vulkan)
     const Tensor& hx_vk, // initial hidden state (vulkan)
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context) {
+    const c10::intrusive_ptr<GruPackedContext>& gru_context) {
   TORCH_INTERNAL_ASSERT(
       input_vk.sizes().size() == 3,
       "Vulkan gru expects 'input_vk' dims to be 3.");
   TORCH_INTERNAL_ASSERT(
       hx_vk.sizes().size() == 3, "Vulkan gru expects 'hx_vk' dims to be 3.");
 
-  const c10::List<c10::IValue> packed_linear_op_contexts =
-      packed_context.get(0).toList();
-  const int64_t packed_num_layers = packed_context.get(2).toInt();
+  const int64_t num_layers =
+      gru_context->get_val(GruPackedContext::Packed::NumLayers).toInt();
+  const bool batch_first =
+      gru_context->get_val(GruPackedContext::Packed::BatchFirst).toBool();
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
+  const c10::List<c10::IValue> packed_linear_contexts =
+      gru_context->get_val(GruPackedContext::Packed::LinearContexts).toList();
 
-  const int64_t linear_op_contexts_per_layer =
-      6; // (b_ir, w_ir), (b_hr, w_hr), (b_iz, w_iz), (b_hz, w_hz), (b_in,
-         // w_in), (b_hn, w_hn)
+  const int64_t linear_contexts_per_layer = 6;
+  // (b_ir, w_ir), (b_hr, w_hr), (b_iz, w_iz),
+  // (b_hz, w_hz), (b_in,cw_in), (b_hn, w_hn)
   std::vector<at::Tensor> h_n_list; // hidden output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
-  for (int64_t i = 0; i < packed_num_layers; ++i) {
+  for (int64_t i = 0; i < num_layers; ++i) {
     // extract each hidden state and squeeze into 2D dim
     auto h = at::slice(hx_vk, 0, i, i + 1, 1);
     h = h.reshape({h.size(0) * h.size(1), h.size(2)});
 
     const auto& cxt_ir =
-        packed_linear_op_contexts[i * linear_op_contexts_per_layer + 0]
-            .toCustomClass<VulkanOpContext>();
+        packed_linear_contexts[i * linear_contexts_per_layer + 0]
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_hr =
-        packed_linear_op_contexts[i * linear_op_contexts_per_layer + 1]
-            .toCustomClass<VulkanOpContext>();
+        packed_linear_contexts[i * linear_contexts_per_layer + 1]
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_iz =
-        packed_linear_op_contexts[i * linear_op_contexts_per_layer + 2]
-            .toCustomClass<VulkanOpContext>();
+        packed_linear_contexts[i * linear_contexts_per_layer + 2]
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_hz =
-        packed_linear_op_contexts[i * linear_op_contexts_per_layer + 3]
-            .toCustomClass<VulkanOpContext>();
+        packed_linear_contexts[i * linear_contexts_per_layer + 3]
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_in =
-        packed_linear_op_contexts[i * linear_op_contexts_per_layer + 4]
-            .toCustomClass<VulkanOpContext>();
+        packed_linear_contexts[i * linear_contexts_per_layer + 4]
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_hn =
-        packed_linear_op_contexts[i * linear_op_contexts_per_layer + 5]
-            .toCustomClass<VulkanOpContext>();
+        packed_linear_contexts[i * linear_contexts_per_layer + 5]
+            .toCustomClass<LinearPackedContext>();
 
     const auto& r = at::sigmoid(
-        linear_context_run(
-            x, cxt_ir->get_packed(), cxt_ir->get_unpacked(), 1.0f, 1.0f) +
-        linear_context_run(
-            h, cxt_hr->get_packed(), cxt_hr->get_unpacked(), 1.0f, 1.0f));
+        run_linear_context(x, cxt_ir) + run_linear_context(h, cxt_hr));
+    // cxt_ir->run(x, 1.0f, 1.0f) + cxt_hr->run(h, 1.0f, 1.0f));
     const auto& z = at::sigmoid(
-        linear_context_run(
-            x, cxt_iz->get_packed(), cxt_iz->get_unpacked(), 1.0f, 1.0f) +
-        linear_context_run(
-            h, cxt_hz->get_packed(), cxt_hz->get_unpacked(), 1.0f, 1.0f));
+        run_linear_context(x, cxt_iz) + run_linear_context(h, cxt_hz));
+    // cxt_iz->run(x, 1.0f, 1.0f) + cxt_hz->run(h, 1.0f, 1.0f));
     const auto& n = at::tanh(
-        linear_context_run(
-            x, cxt_in->get_packed(), cxt_in->get_unpacked(), 1.0f, 1.0f) +
-        r *
-            (linear_context_run(
-                h, cxt_hn->get_packed(), cxt_hn->get_unpacked(), 1.0f, 1.0f)));
+        run_linear_context(x, cxt_in) + r * run_linear_context(h, cxt_hn));
+    // cxt_in->run(x, 1.0f, 1.0f) + r * (cxt_hn->run(h, 1.0f, 1.0f)));
     h = (z * (-1) + 1) * n + z * h;
     x = h; // next input
     h_n_list.emplace_back(
@@ -269,118 +336,11 @@ std::tuple<Tensor, Tensor> gru_context_run(
   }
 
   auto h_n = at::cat(h_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   return std::tuple<Tensor, Tensor>(x, h_n);
 }
 
-c10::intrusive_ptr<VulkanOpContext> create_gru_context(
-    std::vector<Tensor>&& params_cpu,
-    bool has_biases,
-    int64_t num_layers,
-    double dropout,
-    bool train,
-    bool bidirectional,
-    bool batch_first) {
-  return c10::make_intrusive<VulkanOpContext>(gru_context_create(
-      params_cpu,
-      has_biases,
-      num_layers,
-      dropout,
-      train,
-      bidirectional,
-      batch_first));
-}
-
-std::tuple<Tensor, Tensor> run_gru_context(
-    const Tensor& input_vk,
-    const Tensor& hx_vk,
-    const c10::intrusive_ptr<VulkanOpContext>& vulkan_context) {
-  return gru_context_run(
-      input_vk,
-      hx_vk,
-      vulkan_context->get_packed(),
-      vulkan_context->get_unpacked());
-}
-
-/* Backwards compatibility */
-GruOpContext::GruOpContext(VulkanOpContext vulkan_context)
-    : vulkan_context_{std::move(vulkan_context)} {}
-
-GruOpContext GruOpContext::create(
-    const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
-    bool has_biases,
-    int64_t num_layers,
-    double dropout,
-    bool train,
-    bool bidirectional,
-    bool batch_first) {
-  return GruOpContext{gru_context_create(
-      params_cpu,
-      has_biases,
-      num_layers,
-      dropout,
-      train,
-      bidirectional,
-      batch_first)};
-}
-
-std::tuple<Tensor, Tensor> GruOpContext::run(
-    const Tensor& input_vk, // input sequence (vulkan)
-    const Tensor& hx_vk) const { // initial hidden state (vulkan)
-  return gru_context_run(
-      input_vk,
-      hx_vk,
-      vulkan_context_.get_packed(),
-      vulkan_context_.get_unpacked());
-}
-
-GruOpContext::State GruOpContext::unpack() const {
-  const c10::impl::GenericList unpacked_ =
-      std::get<1>(vulkan_context_.get_state());
-  const std::vector<Tensor> unpacked_params_cpu =
-      unpacked_.get(0).toTensorVector();
-  const bool unpacked_has_biases = unpacked_.get(1).toBool();
-  const int64_t unpacked_num_layers = unpacked_.get(2).toInt();
-  const double unpacked_dropout = unpacked_.get(3).toDouble();
-  const bool unpacked_train = unpacked_.get(4).toBool();
-  const bool unpacked_bidirectional = unpacked_.get(5).toBool();
-  const bool unpacked_batch_first = unpacked_.get(6).toBool();
-  return GruOpContext::State{
-      unpacked_params_cpu,
-      unpacked_has_biases,
-      unpacked_num_layers,
-      unpacked_dropout,
-      unpacked_train,
-      unpacked_bidirectional,
-      unpacked_batch_first,
-  };
-}
-
-c10::intrusive_ptr<GruOpContext> gru_prepack(
-    std::vector<Tensor>&& params_cpu,
-    bool has_biases,
-    int64_t num_layers,
-    double dropout,
-    bool train,
-    bool bidirectional,
-    bool batch_first) {
-  return c10::make_intrusive<GruOpContext>(GruOpContext::create(
-      params_cpu,
-      has_biases,
-      num_layers,
-      dropout,
-      train,
-      bidirectional,
-      batch_first));
-}
-
-std::tuple<Tensor, Tensor> gru_run(
-    const Tensor& input_vk,
-    const Tensor& hx_vk,
-    const c10::intrusive_ptr<GruOpContext>& context) {
-  return context->run(input_vk, hx_vk);
-}
-
 } // namespace ops
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/ops/Gru.h b/aten/src/ATen/native/vulkan/ops/Gru.h
index 304ce822a0e9a..922ac02fc2d09 100644
--- a/aten/src/ATen/native/vulkan/ops/Gru.h
+++ b/aten/src/ATen/native/vulkan/ops/Gru.h
@@ -3,7 +3,7 @@
 #ifdef USE_VULKAN_API
 
 #include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
+#include <ATen/native/vulkan/ops/VulkanPackedContext.h>
 #include <torch/library.h>
 
 namespace at {
@@ -11,81 +11,10 @@ namespace native {
 namespace vulkan {
 namespace ops {
 
-//   packed
-//     std::vector<c10::intrusive_ptr<VulkanOpContext>> linear_op_contexts;  //
-//     {{ op context for b_ir, w_ir, op context for b_hr, w_hr,
-//                                                                           //
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_iz,
-//                                                                           w_iz,
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_hz,
-//                                                                           w_hz,
-//                                                                           //
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_in,
-//                                                                           w_in,
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_hn,
-//                                                                           w_hn,},
-//                                                                           ...}
-//     bool has_biases{};
-//     int64_t num_layers{};
-//     double dropout{};
-//     bool train{};
-//     bool bidirectional{};
-//     bool batch_first{};
-
-//   unpacked
-//     std::vector<Tensor> params_cpu      // weights/biases (cpu)
-//     bool has_biases
-//     int64_t num_layers
-//     double dropout
-//     bool train
-//     bool bidirectional
-//     bool batch_first
-
-VulkanOpContext gru_context_create(
-    const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
-    bool has_biases,
-    int64_t num_layers,
-    double dropout,
-    bool train,
-    bool bidirectional,
-    bool batch_first);
-
-std::tuple<Tensor, Tensor> gru_context_run(
-    const Tensor& input_vk, // input sequence (vulkan)
-    const Tensor& hx_vk, // initial hidden state (vulkan)
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context);
-
-c10::intrusive_ptr<VulkanOpContext> create_gru_context(
-    std::vector<Tensor>&& params_cpu, // weights/biases (cpu)
-    bool has_biases,
-    int64_t num_layers,
-    double dropout,
-    bool train,
-    bool bidirectional,
-    bool batch_first);
-
-std::tuple<Tensor, Tensor> run_gru_context(
-    const Tensor& input_vk,
-    const Tensor& hx_vk,
-    const c10::intrusive_ptr<VulkanOpContext>& vulkan_context);
-
-// Backwards compatibility
-class GruOpContext final : public torch::jit::CustomClassHolder {
+class GruPackedContext final : virtual public VulkanPackedContext,
+                               public torch::jit::CustomClassHolder {
  public:
-  static GruOpContext create(
+  GruPackedContext(
       const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
       bool has_biases,
       int64_t num_layers,
@@ -94,19 +23,42 @@ class GruOpContext final : public torch::jit::CustomClassHolder {
       bool bidirectional,
       bool batch_first);
 
-  using State =
-      std::tuple<std::vector<Tensor>, bool, int64_t, double, bool, bool, bool>;
-
-  std::tuple<Tensor, Tensor> run(const Tensor& input_vk, const Tensor& hx_vk)
-      const;
-  State unpack() const;
-
- private:
-  explicit GruOpContext(VulkanOpContext vulkan_context);
-  VulkanOpContext vulkan_context_;
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Params = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
+
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t LinearContexts = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
+
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
+  static GruPackedContext pack(c10::impl::GenericList);
+
+  const c10::impl::GenericList unpack() const override;
 };
 
-c10::intrusive_ptr<GruOpContext> gru_prepack(
+c10::intrusive_ptr<GruPackedContext> create_gru_context(
     std::vector<Tensor>&& params_cpu, // weights/biases (cpu)
     bool has_biases,
     int64_t num_layers,
@@ -115,10 +67,10 @@ c10::intrusive_ptr<GruOpContext> gru_prepack(
     bool bidirectional,
     bool batch_first);
 
-std::tuple<Tensor, Tensor> gru_run(
+std::tuple<Tensor, Tensor> run_gru_context(
     const Tensor& input_vk,
     const Tensor& hx_vk,
-    const c10::intrusive_ptr<GruOpContext>& context);
+    const c10::intrusive_ptr<GruPackedContext>& vulkan_context);
 
 } // namespace ops
 } // namespace vulkan
diff --git a/aten/src/ATen/native/vulkan/ops/Lerp.cpp b/aten/src/ATen/native/vulkan/ops/Lerp.cpp
index 67240f64b2ccd..28921ef897640 100644
--- a/aten/src/ATen/native/vulkan/ops/Lerp.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Lerp.cpp
@@ -11,18 +11,18 @@ using namespace api::utils;
 
 void check_inputs_elementwise_op(const Tensor& input1, const Tensor& input2) {
   TORCH_CHECK(
-      channels_size(input1) == channels_size(input2),
+      get_dim<Dim4D::Channel>(input1) == get_dim<Dim4D::Channel>(input2),
       "Vulkan elementwise ops require channel dimension to be equal!");
-  if (batch_size(input1) != batch_size(input2)) {
+  if (get_dim<Dim4D::Batch>(input1) != get_dim<Dim4D::Batch>(input2)) {
     TORCH_CHECK(
-        channels_size(input1) % 4 == 0,
+        get_dim<Dim4D::Channel>(input1) % 4 == 0,
         "Vulkan elementwise ops require channel to be a multiple of 4 to broadcast along batch dimension!")
   }
 
-  const uint32_t input1_h = height_size(input1);
-  const uint32_t input1_w = width_size(input1);
-  const uint32_t input2_h = height_size(input2);
-  const uint32_t input2_w = width_size(input2);
+  const uint32_t input1_h = get_dim<Dim4D::Height>(input1);
+  const uint32_t input1_w = get_dim<Dim4D::Width>(input1);
+  const uint32_t input2_h = get_dim<Dim4D::Height>(input2);
+  const uint32_t input2_w = get_dim<Dim4D::Width>(input2);
 
   const std::string broadcast_error_msg =
       "Incompatible input dimensions for broadcasting for Vulkan elementwise op!";
diff --git a/aten/src/ATen/native/vulkan/ops/Lstm.cpp b/aten/src/ATen/native/vulkan/ops/Lstm.cpp
index c86583621c3bb..831b97175d45c 100644
--- a/aten/src/ATen/native/vulkan/ops/Lstm.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Lstm.cpp
@@ -1,6 +1,5 @@
-#include <ATen/native/vulkan/ops/Common.h>
+#include <ATen/native/vulkan/ops/Lstm.h>
 #include <ATen/native/vulkan/ops/Mm.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
 #include <torch/library.h>
 
 namespace at {
@@ -9,19 +8,25 @@ namespace vulkan {
 namespace ops {
 namespace {
 //
-// input_vk: input tensor of shape (L, N, H_in) when batch_first=False or (N, L,
-// H_in) when batch_first=True
-//           containing the features of the input sequence
+// input_vk: input tensor of shape (L, N, H_in) when batch_first=False or
+// (N, L, H_in) when batch_first=True containing the features of the input
+// sequence
+//
 // hx_vk: tensor of shape (D * num_layers, N, H_out) containing the initial
-// hidden state for each element in the input sequence. cx_vk: tensor of shape
-// (D * num_layers, N, H_cell) containing the initial cell state for each
-// element in the input sequence. output: tensor of shape (L, N, D * H_out) when
-// batch_first=False or (N, L, D * H_out) when batch_first=True
-//         containing the output features (h_t) from the last layer of the LSTM,
-//         for each t
+// hidden state for each element in the input sequence.
+//
+// cx_vk: tensor of shape (D * num_layers, N, H_cell) containing the initial
+// cell state for each element in the input sequence.
+//
+// output: tensor of shape (L, N, D * H_out) when batch_first=False or
+// (N, L, D * H_out) when batch_first=True, containing the output features
+// (h_t) from the last layer of the LSTM, for each t
+//
 // h_n: tensor of shape (D * num_layers, N, H_out) containing the final hidden
-// state for each element in the sequence. c_n: tensor of shape (D * num_layers,
-// N, H_cell) containing the final cell state for each element in the sequence.
+// state for each element in the sequence.
+//
+// c_n: tensor of shape (D * num_layers, N, H_cell) containing the final cell
+// state for each element in the sequence.
 //
 //  where
 //    L = sequence length
@@ -61,12 +66,17 @@ std::tuple<Tensor, Tensor, Tensor> lstm_input(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan LSTM expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan LSTM expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan LSTM expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan LSTM expects 'dropout' to be 0.0.");
 
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
   const Tensor& hx_vk = hx[0];
   const Tensor& cx_vk = hx[1];
 
@@ -75,8 +85,7 @@ std::tuple<Tensor, Tensor, Tensor> lstm_input(
   std::vector<at::Tensor> c_n_list; // cell state output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
   h_n_list.reserve(num_layers);
   c_n_list.reserve(num_layers);
@@ -135,6 +144,7 @@ std::tuple<Tensor, Tensor, Tensor> lstm_input(
 
   auto h_n = at::cat(h_n_list, 1);
   auto c_n = at::cat(c_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   c_n = c_n.reshape({c_n.size(0) * c_n.size(1), c_n.size(2), c_n.size(3)});
   return std::tuple<Tensor, Tensor, Tensor>(x, h_n, c_n);
@@ -150,13 +160,18 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) {
 
 } // namespace
 
-std::vector<c10::intrusive_ptr<VulkanOpContext>> pack_lstm_linear_op_contexts(
+std::vector<c10::intrusive_ptr<LinearPackedContext>>
+pack_lstm_linear_op_contexts(
     const std::vector<Tensor>& params_cpu,
     int64_t num_layers) {
   TORCH_CHECK(
       static_cast<int64_t>(params_cpu.size()) == 4 * num_layers,
-      "Vulkan LSTM expects 'params_cpu' size to be 4 * 'num_layers'.");
-  std::vector<c10::intrusive_ptr<VulkanOpContext>> linear_op_contexts;
+      "Vulkan LSTM expects 'params_cpu' size to be 4 * 'num_layers'."
+      " But 'params_cpu' has size: ",
+      params_cpu.size(),
+      " and 'num_layers' is: ",
+      num_layers);
+  std::vector<c10::intrusive_ptr<LinearPackedContext>> linear_op_contexts;
   linear_op_contexts.reserve(num_layers * 8);
 
   for (int64_t l = 0; l < num_layers; ++l) {
@@ -200,7 +215,7 @@ std::vector<c10::intrusive_ptr<VulkanOpContext>> pack_lstm_linear_op_contexts(
   return linear_op_contexts;
 }
 
-VulkanOpContext lstm_context_create(
+LstmPackedContext::LstmPackedContext(
     const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
     bool has_biases,
     int64_t num_layers,
@@ -213,42 +228,91 @@ VulkanOpContext lstm_context_create(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan LSTM expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan LSTM expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan LSTM expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan LSTM expects 'dropout' to be 0.0.");
 
-  c10::impl::GenericList packed_context{c10::AnyType::get()};
-  packed_context.reserve(7);
-  packed_context.emplace_back(
-      pack_lstm_linear_op_contexts(params_cpu, num_layers));
-  packed_context.emplace_back(has_biases);
-  packed_context.emplace_back(num_layers);
-  packed_context.emplace_back(dropout);
-  packed_context.emplace_back(train);
-  packed_context.emplace_back(bidirectional);
-  packed_context.emplace_back(batch_first);
-
-  c10::impl::GenericList unpacked_context{c10::AnyType::get()};
-  unpacked_context.reserve(7);
-  unpacked_context.emplace_back(params_cpu);
-  unpacked_context.emplace_back(has_biases);
-  unpacked_context.emplace_back(num_layers);
-  unpacked_context.emplace_back(dropout);
-  unpacked_context.emplace_back(train);
-  unpacked_context.emplace_back(bidirectional);
-  unpacked_context.emplace_back(batch_first);
-
-  return VulkanOpContext::create(packed_context, unpacked_context);
+  packed_.reserve(Packed::NumArgs);
+  packed_.emplace_back(pack_lstm_linear_op_contexts(params_cpu, num_layers));
+  packed_.emplace_back(has_biases);
+  packed_.emplace_back(num_layers);
+  packed_.emplace_back(dropout);
+  packed_.emplace_back(train);
+  packed_.emplace_back(bidirectional);
+  packed_.emplace_back(batch_first);
+}
+
+LstmPackedContext LstmPackedContext::pack(c10::impl::GenericList unpacked) {
+  return LstmPackedContext(
+      unpacked.get(Unpacked::Params).toTensorVector(),
+      unpacked.get(Unpacked::hasBiases).toBool(),
+      unpacked.get(Unpacked::NumLayers).toInt(),
+      unpacked.get(Unpacked::Dropout).toDouble(),
+      unpacked.get(Unpacked::Train).toBool(),
+      unpacked.get(Unpacked::Bidirectional).toBool(),
+      unpacked.get(Unpacked::BatchFirst).toBool());
+}
+
+const c10::impl::GenericList LstmPackedContext::unpack() const {
+  c10::impl::GenericList unpacked_lstm_context{c10::AnyType::get()};
+  unpacked_lstm_context.reserve(Unpacked::NumArgs);
+
+  const c10::List<c10::IValue> packed_linear_contexts =
+      get_val(Packed::LinearContexts).toList();
+
+  const int64_t num_layers = get_val(Packed::NumLayers).toInt();
+  const int64_t linear_contexts_per_layer = 8;
+
+  std::vector<Tensor> params_cpu;
+  params_cpu.reserve(num_layers * linear_contexts_per_layer);
+
+  for (c10::IValue packed_linear_context : packed_linear_contexts) {
+    const c10::impl::GenericList unpacked_linear_context =
+        packed_linear_context.toCustomClass<LinearPackedContext>()->unpack();
+
+    TORCH_CHECK(
+        unpacked_linear_context.size() > 0u,
+        "unpacked_linear_context does not have any elements!");
+
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Weight)
+            .toTensor()
+            .t());
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Bias)
+            .toTensor());
+  }
+  unpacked_lstm_context.emplace_back(params_cpu);
+  for (int64_t i = 1; i < 7; ++i) {
+    unpacked_lstm_context.emplace_back(get_val(i));
+  }
+
+  return unpacked_lstm_context;
 }
 
-std::tuple<Tensor, Tensor, Tensor> lstm_context_run(
+c10::intrusive_ptr<LstmPackedContext> create_lstm_context(
+    std::vector<Tensor>&& params_cpu,
+    bool has_biases,
+    int64_t num_layers,
+    double dropout,
+    bool train,
+    bool bidirectional,
+    bool batch_first) {
+  return c10::make_intrusive<LstmPackedContext>(LstmPackedContext(
+      params_cpu,
+      has_biases,
+      num_layers,
+      dropout,
+      train,
+      bidirectional,
+      batch_first));
+}
+
+std::tuple<Tensor, Tensor, Tensor> run_lstm_context(
     const Tensor& input_vk, // input sequence (vulkan)
     const Tensor& hx_vk, // initial hidden state (vulkan)
     const Tensor& cx_vk, // initial cell state (vulkan)
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context) {
+    const c10::intrusive_ptr<LstmPackedContext>& lstm_context) {
   TORCH_INTERNAL_ASSERT(
       input_vk.sizes().size() == 3, "Vulkan LSTM expects input dims to be 3.");
   TORCH_INTERNAL_ASSERT(
@@ -258,24 +322,34 @@ std::tuple<Tensor, Tensor, Tensor> lstm_context_run(
       cx_vk.sizes().size() == 3,
       "Vulkan LSTM expects cell state dims to be 3.");
 
+  const int64_t num_layers =
+      lstm_context->get_val(LstmPackedContext::Packed::NumLayers).toInt();
+  const bool batch_first =
+      lstm_context->get_val(LstmPackedContext::Packed::BatchFirst).toBool();
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
   const c10::List<c10::IValue> packed_linear_op_contexts =
-      packed_context.get(0).toList();
-  const int64_t packed_num_layers = packed_context.get(2).toInt();
+      lstm_context->get_val(LstmPackedContext::Packed::LinearContexts).toList();
+
+  const int64_t linear_op_contexts_per_layer = 8;
+  // (b_ii, w_ii), (b_hi, w_hi), (b_if, w_if), (b_hf, w_hf),
+  // (b_ig, w_ig), (b_hg, w_hg), (b_io, w_io), (b_ho, w_ho)
 
-  const int64_t linear_op_contexts_per_layer =
-      8; // (b_ii, w_ii), (b_hi, w_hi), (b_if, w_if), (b_hf, w_hf), (b_ig,
-         // w_ig), (b_hg, w_hg), (b_io, w_io), (b_ho, w_ho)
   std::vector<at::Tensor> h_n_list; // hidden state output
   std::vector<at::Tensor> c_n_list; // cell state output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
-  h_n_list.reserve(packed_num_layers);
-  c_n_list.reserve(packed_num_layers);
+  h_n_list.reserve(num_layers);
+  c_n_list.reserve(num_layers);
 
-  for (int64_t l = 0; l < packed_num_layers; ++l) {
+  for (int64_t l = 0; l < num_layers; ++l) {
     // extract each hidden state and squeeze into 2D dim
     auto h = at::slice(hx_vk, 0, l, l + 1, 1);
     h = h.reshape({h.size(0) * h.size(1), h.size(2)});
@@ -285,49 +359,41 @@ std::tuple<Tensor, Tensor, Tensor> lstm_context_run(
 
     const auto& cxt_ii =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 0]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_hi =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 1]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_if =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 2]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_hf =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 3]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_ig =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 4]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_hg =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 5]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_io =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 6]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
     const auto& cxt_ho =
         packed_linear_op_contexts[l * linear_op_contexts_per_layer + 7]
-            .toCustomClass<VulkanOpContext>();
+            .toCustomClass<LinearPackedContext>();
 
     const auto& i = at::sigmoid(
-        linear_context_run(
-            x, cxt_ii->get_packed(), cxt_ii->get_unpacked(), 1.0f, 1.0f) +
-        linear_context_run(
-            h, cxt_hi->get_packed(), cxt_hi->get_unpacked(), 1.0f, 1.0f));
+        run_linear_context(x, cxt_ii) + run_linear_context(h, cxt_hi));
+    // cxt_ii->run(x, 1.0f, 1.0f) + cxt_hi->run(h, 1.0f, 1.0f));
     const auto& f = at::sigmoid(
-        linear_context_run(
-            x, cxt_if->get_packed(), cxt_if->get_unpacked(), 1.0f, 1.0f) +
-        linear_context_run(
-            h, cxt_hf->get_packed(), cxt_hf->get_unpacked(), 1.0f, 1.0f));
-    const auto& g = at::tanh(
-        linear_context_run(
-            x, cxt_ig->get_packed(), cxt_ig->get_unpacked(), 1.0f, 1.0f) +
-        linear_context_run(
-            h, cxt_hg->get_packed(), cxt_hg->get_unpacked(), 1.0f, 1.0f));
+        run_linear_context(x, cxt_if) + run_linear_context(h, cxt_hf));
+    // cxt_if->run(x, 1.0f, 1.0f) + cxt_hf->run(h, 1.0f, 1.0f));
+    const auto& g =
+        at::tanh(run_linear_context(x, cxt_ig) + run_linear_context(h, cxt_hg));
+    // cxt_ig->run(x, 1.0f, 1.0f) + cxt_hg->run(h, 1.0f, 1.0f));
     const auto& o = at::sigmoid(
-        linear_context_run(
-            x, cxt_io->get_packed(), cxt_io->get_unpacked(), 1.0f, 1.0f) +
-        linear_context_run(
-            h, cxt_ho->get_packed(), cxt_ho->get_unpacked(), 1.0f, 1.0f));
+        run_linear_context(x, cxt_io) + run_linear_context(h, cxt_ho));
+    // cxt_io->run(x, 1.0f, 1.0f) + cxt_ho->run(h, 1.0f, 1.0f));
     c = f * c + i * g;
     h = o * at::tanh(c);
     x = h; // next input
@@ -339,42 +405,12 @@ std::tuple<Tensor, Tensor, Tensor> lstm_context_run(
 
   auto h_n = at::cat(h_n_list, 1);
   auto c_n = at::cat(c_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   c_n = c_n.reshape({c_n.size(0) * c_n.size(1), c_n.size(2), c_n.size(3)});
   return std::tuple<Tensor, Tensor, Tensor>(x, h_n, c_n);
 }
 
-c10::intrusive_ptr<VulkanOpContext> create_lstm_context(
-    std::vector<Tensor>&& params_cpu,
-    bool has_biases,
-    int64_t num_layers,
-    double dropout,
-    bool train,
-    bool bidirectional,
-    bool batch_first) {
-  return c10::make_intrusive<VulkanOpContext>(lstm_context_create(
-      params_cpu,
-      has_biases,
-      num_layers,
-      dropout,
-      train,
-      bidirectional,
-      batch_first));
-}
-
-std::tuple<Tensor, Tensor, Tensor> run_lstm_context(
-    const Tensor& input_vk, // input sequence (vulkan)
-    const Tensor& hx_vk, // initial hidden state (vulkan)
-    const Tensor& cx_vk, // initial cell state (vulkan)
-    const c10::intrusive_ptr<VulkanOpContext>& vulkan_context) {
-  return lstm_context_run(
-      input_vk,
-      hx_vk,
-      cx_vk,
-      vulkan_context->get_packed(),
-      vulkan_context->get_unpacked());
-}
-
 } // namespace ops
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/ops/Lstm.h b/aten/src/ATen/native/vulkan/ops/Lstm.h
index e793ad1d00a75..5f4006c67d2f3 100644
--- a/aten/src/ATen/native/vulkan/ops/Lstm.h
+++ b/aten/src/ATen/native/vulkan/ops/Lstm.h
@@ -3,7 +3,7 @@
 #ifdef USE_VULKAN_API
 
 #include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
+#include <ATen/native/vulkan/ops/VulkanPackedContext.h>
 #include <torch/library.h>
 
 namespace at {
@@ -11,60 +11,54 @@ namespace native {
 namespace vulkan {
 namespace ops {
 
-//   packed
-//     std::vector<c10::intrusive_ptr<VulkanOpContext>> linear_op_contexts;  //
-//     {{ op context for b_ii, w_ii, op context for b_hi, w_hi,
-//                                                                           //
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_if,
-//                                                                           w_if,
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_hf,
-//                                                                           w_hf,
-//                                                                           //
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_ig,
-//                                                                           w_ig,
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_hg,
-//                                                                           w_hg,
-//                                                                           //
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_io,
-//                                                                           w_io,
-//                                                                           op
-//                                                                           context
-//                                                                           for
-//                                                                           b_ho,
-//                                                                           w_ho,},
-//                                                                           ...}
-//     bool has_biases{};
-//     int64_t num_layers{};
-//     double dropout{};
-//     bool train{};
-//     bool bidirectional{};
-//     bool batch_first{};
+class LstmPackedContext final : virtual public VulkanPackedContext,
+                                public torch::jit::CustomClassHolder {
+ public:
+  LstmPackedContext(
+      const std::vector<Tensor>& params_cpu, // weights/biases (cpu)
+      bool has_biases,
+      int64_t num_layers,
+      double dropout,
+      bool train,
+      bool bidirectional,
+      bool batch_first);
 
-//   unpacked
-//     std::vector<Tensor> params_cpu      // weights/biases (cpu)
-//     bool has_biases
-//     int64_t num_layers
-//     double dropout
-//     bool train
-//     bool bidirectional
-//     bool batch_first
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Params = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
 
-c10::intrusive_ptr<VulkanOpContext> create_lstm_context(
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t LinearContexts = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
+
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
+  static LstmPackedContext pack(c10::impl::GenericList);
+
+  const c10::impl::GenericList unpack() const override;
+};
+
+c10::intrusive_ptr<LstmPackedContext> create_lstm_context(
     std::vector<Tensor>&& params_cpu, // weights/biases (cpu)
     bool has_biases,
     int64_t num_layers,
@@ -77,7 +71,7 @@ std::tuple<Tensor, Tensor, Tensor> run_lstm_context(
     const Tensor& input_vk, // input sequence (vulkan)
     const Tensor& hx_vk, // initial hidden state (vulkan)
     const Tensor& cx_vk, // initial cell state (vulkan)
-    const c10::intrusive_ptr<VulkanOpContext>& vulkan_context);
+    const c10::intrusive_ptr<LstmPackedContext>& vulkan_context);
 
 } // namespace ops
 } // namespace vulkan
diff --git a/aten/src/ATen/native/vulkan/ops/Mm.cpp b/aten/src/ATen/native/vulkan/ops/Mm.cpp
index 0587a0a95a0ae..80b6ccb34ade6 100644
--- a/aten/src/ATen/native/vulkan/ops/Mm.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Mm.cpp
@@ -1,6 +1,5 @@
 #include <ATen/native/vulkan/ops/Mm.h>
 #include <ATen/native/vulkan/ops/Utils.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
 #include <c10/util/irange.h>
 
 namespace at {
@@ -42,7 +41,7 @@ vTensor pack_weights(const Tensor& weight_arg) {
       weight.options(),
   };
 
-  api::StagingBuffer staging(context, v_weight.buffer_bytes());
+  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -105,7 +104,7 @@ vTensor pack_biases(
         bias_arg->options(),
     };
 
-    api::StagingBuffer staging(context, v_bias.buffer_bytes());
+    api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
     {
       api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -134,7 +133,7 @@ vTensor pack_biases(
         weight_arg.options(),
     };
 
-    api::StagingBuffer staging(context, v_bias.buffer_bytes());
+    api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
     {
       api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -178,40 +177,15 @@ bool available(const Tensor& weight, const c10::optional<Tensor>& bias) {
       true;
 }
 
-bool usable(
-    const Tensor& input,
-    const Tensor& weight,
-    const c10::optional<Tensor>& /* bias */) {
+bool usable(const Tensor& input, const IntArrayRef unpacked_weight_sizes) {
   return (2 == input.ndimension()) &&
       (c10::DeviceType::Vulkan == input.device().type()) &&
       (kFloat == input.scalar_type()) &&
       (input.size(Layout::Parameter::width) ==
-       weight.size(Layout::Parameter::height)) &&
+       unpacked_weight_sizes[Layout::Parameter::height]) &&
       !input.requires_grad() && true;
 }
 
-VulkanOpContext context_create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias) {
-  TORCH_CHECK(
-      available(weight, bias),
-      "Vulkan Linear not available! "
-      "Reason: The provided (weight, bias) parameters are either invalid "
-      "individually or their combination is not supported by Vulkan Impl.");
-
-  c10::impl::GenericList packed_context{c10::AnyType::get()};
-  packed_context.reserve(2);
-  packed_context.emplace_back(convert(pack_weights(weight)));
-  packed_context.emplace_back(convert(pack_biases(weight, bias)));
-
-  c10::impl::GenericList unpacked_context{c10::AnyType::get()};
-  unpacked_context.reserve(2);
-  unpacked_context.emplace_back(weight);
-  unpacked_context.emplace_back(bias);
-
-  return VulkanOpContext::create(packed_context, unpacked_context);
-}
-
 static Tensor reshape_to_2d(const Tensor& input_arg) {
   TORCH_CHECK(
       input_arg.dim() >= 2,
@@ -222,12 +196,11 @@ static Tensor reshape_to_2d(const Tensor& input_arg) {
   return input_arg.reshape({d, input_arg.size(-1)});
 }
 
-Tensor context_run(
+Tensor run_addmm_context(
     const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context,
     const float alpha,
-    const float beta) {
+    const float beta,
+    const c10::intrusive_ptr<LinearPackedContext>& linear_context) {
   api::Context* const context = api::context();
 
   const Tensor input_arg_2d =
@@ -236,15 +209,19 @@ Tensor context_run(
       input_arg_2d.is_vulkan() ? input_arg_2d : input_arg_2d.vulkan();
   const vTensor& v_input = convert(input);
 
-  const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor());
-  const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor());
-  const Tensor& unpacked_weight = unpacked_context.get(0).toTensor();
-  const c10::optional<Tensor>& unpacked_bias =
-      unpacked_context.get(1).isTensor() ? unpacked_context.get(1).toTensor()
-                                         : c10::optional<Tensor>();
+  const vTensor& packed_v_weight = convert(
+      linear_context->get_val(LinearPackedContext::Packed::Weight).toTensor());
+  const vTensor& packed_v_bias = convert(
+      linear_context->get_val(LinearPackedContext::Packed::Bias).toTensor());
+  const std::vector<int64_t> unpacked_weight_sizes =
+      linear_context->get_val(LinearPackedContext::Packed::WeightSizes)
+          .toIntVector();
+  const bool bias_defined =
+      linear_context->get_val(LinearPackedContext::Packed::BiasDefined)
+          .toBool();
 
   TORCH_CHECK(
-      usable(input, unpacked_weight, unpacked_bias),
+      usable(input, unpacked_weight_sizes),
       "Vulkan Linear not usable! "
       "Reason: The provided input tensor is either invalid on its own, or its "
       "combination with the provided weight and bias tensors are unsupported by "
@@ -254,12 +231,12 @@ Tensor context_run(
       context,
       {
           v_input.sizes()[Layout::Parameter::height],
-          unpacked_weight.sizes()[Layout::Parameter::width],
+          unpacked_weight_sizes[Layout::Parameter::width],
       },
       input.options(),
   };
 
-  if (unpacked_bias && unpacked_bias->defined()) {
+  if (bias_defined) {
     const struct {
       uvec3 size;
       int32_t K;
@@ -285,7 +262,7 @@ Tensor context_run(
         // global work group size
         {
             safe_downcast<uint32_t>(div_up(
-                unpacked_weight.sizes()[Layout::Parameter::width], INT64_C(2))),
+                unpacked_weight_sizes[Layout::Parameter::width], INT64_C(2))),
             safe_downcast<uint32_t>(
                 div_up(v_input.sizes()[Layout::Parameter::height], INT64_C(2))),
             1,
@@ -325,7 +302,7 @@ Tensor context_run(
         // global work group size
         {
             safe_downcast<uint32_t>(div_up(
-                unpacked_weight.sizes()[Layout::Parameter::width], INT64_C(2))),
+                unpacked_weight_sizes[Layout::Parameter::width], INT64_C(2))),
             safe_downcast<uint32_t>(
                 div_up(v_input.sizes()[Layout::Parameter::height], INT64_C(2))),
             1,
@@ -364,26 +341,21 @@ Tensor addmm(
     const Tensor& weight,
     const Scalar& beta,
     const Scalar& alpha) {
-  VulkanOpContext vulkan_context = context_create(weight, bias);
-
-  return context_run(
+  return run_addmm_context(
       input,
-      vulkan_context.get_packed(),
-      vulkan_context.get_unpacked(),
       alpha.to<float>(),
-      beta.to<float>());
+      beta.to<float>(),
+      c10::make_intrusive<LinearPackedContext>(
+          LinearPackedContext(weight, bias)));
 }
 
 Tensor mm(const Tensor& mat1_arg, const Tensor& mat2_arg) {
-  VulkanOpContext vulkan_context =
-      context_create(mat2_arg, c10::optional<Tensor>());
-
-  return context_run(
+  return run_addmm_context(
       mat1_arg,
-      vulkan_context.get_packed(),
-      vulkan_context.get_unpacked(),
       1.0f,
-      1.0f);
+      1.0f,
+      c10::make_intrusive<LinearPackedContext>(
+          LinearPackedContext(mat2_arg, c10::optional<Tensor>())));
 }
 
 #ifdef USE_VULKAN_API
@@ -397,82 +369,46 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) {
 
 } // namespace
 
-VulkanOpContext linear_context_create(
+LinearPackedContext::LinearPackedContext(
     const Tensor& weight,
-    const c10::optional<Tensor>& bias) {
-  return context_create(weight, bias);
-}
-
-Tensor linear_context_run(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context,
-    const float alpha,
-    const float beta) {
-  return context_run(input_arg, packed_context, unpacked_context, alpha, beta);
-}
-
-c10::intrusive_ptr<VulkanOpContext> create_linear_context(
-    Tensor&& weight,
-    c10::optional<Tensor>&& bias) {
-  return c10::make_intrusive<VulkanOpContext>(
-      linear_context_create(weight, bias));
-}
-
-Tensor run_linear_context(
-    const Tensor& input,
-    const c10::intrusive_ptr<VulkanOpContext>& vulkan_context) {
-  return linear_context_run(
-      input,
-      vulkan_context->get_packed(),
-      vulkan_context->get_unpacked(),
-      1.0f,
-      1.0f);
-}
-
-/* Backwards compatibility */
-LinearOpContext::LinearOpContext(VulkanOpContext vulkan_context)
-    : vulkan_context_{std::move(vulkan_context)} {}
+    const c10::optional<Tensor>& bias)
+    : unpacked_{c10::AnyType::get()} {
+  TORCH_CHECK(
+      available(weight, bias),
+      "Vulkan Linear not available! "
+      "Reason: The provided (weight, bias) parameters are either invalid "
+      "individually or their combination is not supported by Vulkan Impl.");
 
-LinearOpContext LinearOpContext::create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias) {
-  return LinearOpContext{linear_context_create(weight, bias)};
-}
+  packed_.reserve(Packed::NumArgs);
+  packed_.emplace_back(convert(pack_weights(weight)));
+  packed_.emplace_back(convert(pack_biases(weight, bias)));
+  packed_.emplace_back(weight.sizes());
+  packed_.emplace_back(bias && bias->defined());
 
-Tensor LinearOpContext::run(
-    const Tensor& input_arg,
-    const float alpha,
-    const float beta) const {
-  return linear_context_run(
-      input_arg,
-      vulkan_context_.get_packed(),
-      vulkan_context_.get_unpacked(),
-      alpha,
-      beta);
+  if (!at::globalContext().releaseWeightsWhenPrepacking()) {
+    unpacked_.reserve(Unpacked::NumArgs);
+    unpacked_.emplace_back(weight);
+    unpacked_.emplace_back(bias);
+  }
 }
 
-LinearOpContext::State LinearOpContext::unpack() const {
-  const c10::impl::GenericList unpacked_ =
-      std::get<1>(vulkan_context_.get_state());
-  const Tensor unpacked_weight = unpacked_.get(0).toTensor();
-  const c10::optional<Tensor> unpacked_bias = unpacked_.get(1).isTensor()
-      ? unpacked_.get(1).toTensor()
-      : c10::optional<Tensor>();
-  return LinearOpContext::State{unpacked_weight, unpacked_bias};
+LinearPackedContext LinearPackedContext::pack(c10::impl::GenericList unpacked) {
+  return LinearPackedContext(
+      unpacked.get(Unpacked::Weight).toTensor(),
+      get_optional_tensor(unpacked, Unpacked::Bias));
 }
 
-c10::intrusive_ptr<LinearOpContext> linear_prepack(
+c10::intrusive_ptr<LinearPackedContext> create_linear_context(
     Tensor&& weight,
     c10::optional<Tensor>&& bias) {
-  return c10::make_intrusive<LinearOpContext>(
-      LinearOpContext::create(std::move(weight), std::move(bias)));
+  return c10::make_intrusive<LinearPackedContext>(
+      LinearPackedContext(weight, bias));
 }
 
-Tensor linear_run(
+Tensor run_linear_context(
     const Tensor& input,
-    const c10::intrusive_ptr<LinearOpContext>& context) {
-  return context->run(input, 1.0, 1.0);
+    const c10::intrusive_ptr<LinearPackedContext>& linear_context) {
+  return run_addmm_context(input, 1.0f, 1.0f, linear_context);
 }
 
 } // namespace ops
diff --git a/aten/src/ATen/native/vulkan/ops/Mm.h b/aten/src/ATen/native/vulkan/ops/Mm.h
index 4d573b575bd40..17909eab6d4e6 100644
--- a/aten/src/ATen/native/vulkan/ops/Mm.h
+++ b/aten/src/ATen/native/vulkan/ops/Mm.h
@@ -3,7 +3,7 @@
 #ifdef USE_VULKAN_API
 
 #include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
+#include <ATen/native/vulkan/ops/VulkanPackedContext.h>
 #include <torch/library.h>
 
 namespace at {
@@ -11,57 +11,52 @@ namespace native {
 namespace vulkan {
 namespace ops {
 
-// packed
-//   vTensor v_weight
-//   vTensor v_bias
-
-// unpacked
-//   Tensor weight
-//   c10::optional<Tensor> bias
+class LinearPackedContext final : virtual public VulkanPackedContext,
+                                  public torch::jit::CustomClassHolder {
+ private:
+  c10::impl::GenericList unpacked_;
 
-VulkanOpContext linear_context_create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias);
+ public:
+  LinearPackedContext(const Tensor& weight, const c10::optional<Tensor>& bias);
 
-Tensor linear_context_run(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context,
-    const float alpha,
-    const float beta);
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
 
-c10::intrusive_ptr<VulkanOpContext> create_linear_context(
-    Tensor&& weight,
-    c10::optional<Tensor>&& bias);
+    static constexpr uint32_t NumArgs = 2u;
+  };
 
-Tensor run_linear_context(
-    const Tensor& input,
-    const c10::intrusive_ptr<VulkanOpContext>& context);
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
+    static constexpr uint32_t WeightSizes = 2u;
+    static constexpr uint32_t BiasDefined = 3u;
 
-// Backwards compatibility
-class LinearOpContext final : public torch::jit::CustomClassHolder {
- public:
-  static LinearOpContext create(
-      const Tensor& weight,
-      const c10::optional<Tensor>& bias);
+    static constexpr uint32_t NumArgs = 4u;
+  };
 
-  using State = std::tuple<Tensor, c10::optional<Tensor>>;
+  static LinearPackedContext pack(c10::impl::GenericList);
 
-  Tensor run(const Tensor& input, float beta, float alpha) const;
-  State unpack() const;
+  const c10::impl::GenericList unpack() const override {
+    TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!");
 
- private:
-  explicit LinearOpContext(VulkanOpContext vulkan_context);
-  VulkanOpContext vulkan_context_;
+    return unpacked_;
+  }
 };
 
-c10::intrusive_ptr<LinearOpContext> linear_prepack(
+c10::intrusive_ptr<LinearPackedContext> create_linear_context(
     Tensor&& weight,
     c10::optional<Tensor>&& bias);
 
-Tensor linear_run(
+Tensor run_linear_context(
     const Tensor& input,
-    const c10::intrusive_ptr<LinearOpContext>& context);
+    const c10::intrusive_ptr<LinearPackedContext>& context);
 
 } // namespace ops
 } // namespace vulkan
diff --git a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp b/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp
deleted file mode 100644
index 283967fb9087a..0000000000000
--- a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.cpp
+++ /dev/null
@@ -1,648 +0,0 @@
-#include <ATen/native/ConvUtils.h>
-#include <ATen/native/utils/ParamUtils.h>
-#include <ATen/native/vulkan/api/Utils.h>
-#include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/QuantizedConvolution.h>
-#include <ATen/native/vulkan/ops/Utils.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
-#include <c10/util/irange.h>
-
-namespace at {
-namespace native {
-namespace vulkan {
-namespace ops {
-namespace {
-
-using namespace api::utils;
-using namespace at::native::vulkan::ops;
-
-inline bool is_depthwise(const IntArrayRef filter, const int64_t groups) {
-  return (filter[Layout::Filter::output] == groups) &&
-      // Only K == 1 supported.
-      (filter[Layout::Filter::input] == 1);
-}
-
-inline bool is_pointwise(const IntArrayRef filter) {
-  return (1 == filter[Layout::Filter::height]) &&
-      (1 == filter[Layout::Filter::width]);
-}
-
-bool all_lessthan(const IntArrayRef arr, const int t) {
-  bool retval = true;
-  for (const auto i : c10::irange(arr.size())) {
-    retval = retval && (arr[i] < t);
-  }
-  return retval;
-}
-
-Conv2dQMethod determine_method(
-    const IntArrayRef filter,
-    const IntArrayRef stride,
-    const IntArrayRef padding,
-    const IntArrayRef dilation,
-    const int64_t groups) {
-  if (is_depthwise(filter, groups))
-    return Conv2dQDepthwise;
-  if (is_pointwise(filter))
-    return Conv2dQPointwise;
-  return Conv2dQSlidingWindow;
-}
-
-vTensor pack_weights_dw_q(api::Context* const context, const Tensor& weight) {
-  /* Source */
-  const IntArrayRef src_filter = weight.sizes();
-  const c10::quint8* const src_weight_ptr = weight.data_ptr<c10::quint8>();
-
-  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
-  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
-  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
-  const int64_t src_block_sz =
-      src_kernel_sz * src_filter[Layout::Filter::input];
-  const int64_t num_stacks =
-      div_up(src_filter[Layout::Filter::output], INT64_C(4));
-
-  /* Destination */
-  const int64_t dst_kw_sz = src_kernel_sz;
-  const int64_t dst_kh_sz = num_stacks;
-  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
-
-  vTensor v_weight{
-      context,
-      {
-          4,
-          dst_kh_sz,
-          dst_kw_sz,
-      },
-      weight.options(),
-      weight.q_scale(),
-      weight.q_zero_point(),
-  };
-  api::StagingBuffer staging(context, v_weight.buffer_bytes());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    c10::quint8* dst_weight_ptr = mapping.template data<c10::quint8>();
-
-    memset(dst_weight_ptr, 0, v_weight.nbytes());
-
-    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
-      /* Source */
-      const c10::quint8* const src_weight_oc_ptr =
-          src_weight_ptr + src_oc * src_block_sz;
-
-      /* Destination */
-      const int64_t dst_oh = src_oc / 4;
-      const int64_t dst_c = src_oc % 4;
-
-      c10::quint8* const dst_weight_c_ptr =
-          dst_weight_ptr + dst_c * dst_kernel_sz + dst_oh * dst_kw_sz;
-
-      for (const auto src_ih :
-           c10::irange(src_filter[Layout::Filter::height])) {
-        memcpy(
-            dst_weight_c_ptr + src_ih * src_kw_sz,
-            src_weight_oc_ptr + src_ih * src_kw_sz,
-            sizeof(c10::quint8) * src_kw_sz);
-      }
-    }
-  }
-  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
-
-  return v_weight;
-}
-
-vTensor pack_weights_2d_q(api::Context* const context, const Tensor& weight) {
-  /* Source */
-  const IntArrayRef src_filter = weight.sizes();
-  const c10::quint8* const src_weight_ptr = weight.data_ptr<c10::quint8>();
-
-  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
-  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
-  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
-  const int64_t src_block_sz =
-      src_kernel_sz * src_filter[Layout::Filter::input];
-
-  const int64_t num_stacks =
-      div_up(src_filter[Layout::Filter::output], INT64_C(4));
-  const int64_t stack_depth =
-      api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4));
-
-  /* Destination */
-  const int64_t dst_kw_sz = src_kw_sz * stack_depth;
-  const int64_t dst_kh_sz = src_kh_sz * num_stacks;
-  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
-
-  vTensor v_weight{
-      context,
-      {
-          4,
-          dst_kh_sz,
-          dst_kw_sz,
-      },
-      weight.options(),
-      weight.q_scale(),
-      weight.q_zero_point(),
-  };
-
-  api::StagingBuffer staging(context, v_weight.buffer_bytes());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    c10::quint8* dst_weight_ptr = mapping.template data<c10::quint8>();
-
-    memset(dst_weight_ptr, 0, v_weight.nbytes());
-
-    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
-      /* Source */
-      const c10::quint8* const src_weight_oc_ptr =
-          src_weight_ptr + src_oc * src_block_sz;
-
-      /* Destination */
-      const int64_t dst_oh = src_oc / 4;
-      const int64_t dst_c = src_oc % 4;
-
-      c10::quint8* const dst_weight_c_ptr =
-          dst_weight_ptr + dst_c * dst_kernel_sz;
-
-      for (const auto src_ic : c10::irange(src_filter[Layout::Filter::input])) {
-        const int64_t dst_ic4 = src_ic / 4;
-
-        for (const auto src_ih : c10::irange(src_kh_sz)) {
-          for (const auto src_iw : c10::irange(src_kw_sz)) {
-            memcpy(
-                dst_weight_c_ptr + (dst_oh * src_kh_sz + src_ih) * dst_kw_sz +
-                    dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4,
-                src_weight_oc_ptr + src_ic * src_kernel_sz +
-                    src_ih * src_kw_sz + src_iw,
-                sizeof(c10::quint8));
-          }
-        }
-      }
-    }
-  }
-  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
-
-  return v_weight;
-}
-
-vTensor pack_weights_q(
-    const Tensor& weight_arg,
-    const Conv2dQMethod conv_method) {
-  if (weight_arg.is_vulkan()) {
-    return convert(weight_arg);
-  }
-
-  api::Context* const context = api::context();
-
-  const Tensor weight = weight_arg.contiguous();
-
-  if (conv_method == Conv2dQDepthwise) {
-    return pack_weights_dw_q(context, weight);
-  }
-
-  return pack_weights_2d_q(context, weight);
-}
-
-vTensor pack_biases_q(const c10::optional<Tensor>& bias, const Tensor& weight) {
-  if (bias && bias->is_vulkan()) {
-    return convert(*bias);
-  }
-
-  api::Context* const context = api::context();
-
-  const int64_t src_w = weight.size(Layout::Filter::output);
-  const int64_t packed_w = div_up(src_w, INT64_C(4));
-  vTensor v_bias{
-      context,
-      {
-          4,
-          1,
-          packed_w,
-      },
-      weight.options(),
-      weight.q_scale(),
-      weight.q_zero_point(),
-  };
-
-  api::StagingBuffer staging(context, v_bias.buffer_bytes());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    c10::quint8* dst_bias_ptr = mapping.template data<c10::quint8>();
-
-    if (bias) {
-      const c10::quint8* const src_bias_ptr =
-          bias->contiguous().data_ptr<c10::quint8>();
-
-      memset(dst_bias_ptr, 0, v_bias.nbytes());
-      for (const auto i : c10::irange(src_w)) {
-        const int64_t c = i % 4;
-        const int64_t x = i / 4;
-        dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i];
-      }
-    } else {
-      memset(
-          dst_bias_ptr,
-          // 2's complement integers and IEEE-754 floating point numbers both
-          // have identical bit representations for 0, so can use memset which
-          // only accepts uint8_t parameter.
-          0,
-          v_bias.nbytes());
-    }
-  }
-  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_bias);
-
-  return v_bias;
-}
-
-std::array<int64_t, 4> pack_filter(
-    const Tensor& weight,
-    const IntArrayRef dilation) {
-  const IntArrayRef filter = weight.sizes();
-
-  const auto effective = [](const int64_t k, const int64_t d) {
-    return k + (k - 1) * (d - 1);
-  };
-
-  return {
-      align_up(filter[Layout::Filter::output], INT64_C(4)),
-      align_up(filter[Layout::Filter::input], INT64_C(4)),
-      effective(
-          filter[Layout::Filter::height], dilation[Layout::Parameter::height]),
-      effective(
-          filter[Layout::Filter::width], dilation[Layout::Parameter::width]),
-  };
-}
-
-std::array<int64_t, 2> pack_params(const std::vector<int64_t>& vector) {
-  TORCH_INTERNAL_ASSERT(2u == vector.size(), "Invalid usage!");
-
-  return {
-      vector[0],
-      vector[1],
-  };
-}
-
-bool available(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride,
-    const IntArrayRef padding,
-    const IntArrayRef dilation,
-    const bool transposed,
-    const IntArrayRef /* output_padding */,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  return api::available() &&
-      // Weight
-      (4 == weight.ndimension()) && (weight.size(Layout::Filter::height) > 0) &&
-      (weight.size(Layout::Filter::width) > 0) &&
-      ((weight.device().is_cpu()) ||
-       (c10::DeviceType::Vulkan == weight.device().type())) &&
-      (kFloat == weight.scalar_type() ||
-       c10::kQUInt8 == weight.scalar_type()) &&
-      // Bias
-      ((bias && bias->defined())
-           ? ((1 == bias->ndimension()) &&
-              ((bias->device().is_cpu()) ||
-               (c10::DeviceType::Vulkan == bias->device().type())) &&
-              (kFloat == bias->scalar_type() ||
-               c10::kQUInt8 == bias->scalar_type()) &&
-              (transposed ? false /* to be addded in the future */
-                          : (weight.size(Layout::Filter::output) ==
-                             bias->size(Layout::Filter::output))))
-           : true) &&
-      // Stride
-      (stride[Layout::Parameter::height] > 0) &&
-      (stride[Layout::Parameter::width] > 0) &&
-      // Padding
-      (padding[Layout::Parameter::height] >= 0) &&
-      (padding[Layout::Parameter::width] >= 0) &&
-      // Dilation
-      (dilation[Layout::Parameter::height] > 0) &&
-      (dilation[Layout::Parameter::width] > 0) &&
-      // Groups
-      (groups > 0) &&
-      // Input
-      (weight.size(Layout::Filter::input) > 0) &&
-      // Output
-      (weight.size(Layout::Filter::output) > 0) &&
-      // Output - Groups
-      ((weight.size(Layout::Filter::output) % groups) == 0) &&
-      // Output Min / Max
-      (!output_min || output_min->isFloatingPoint()) &&
-      (!output_max || output_max->isFloatingPoint()) && true;
-}
-
-bool usable(const Tensor& input) {
-  // Input
-  return (4 == input.ndimension()) &&
-      (c10::DeviceType::Vulkan == input.device().type()) &&
-      (kFloat == input.scalar_type() || c10::kQUInt8 == input.scalar_type()) &&
-      (input.size(Layout::Activation4D::batch) >= 0) &&
-      (input.size(Layout::Activation4D::channels) > 0) &&
-      (input.size(Layout::Activation4D::height) > 0) &&
-      (input.size(Layout::Activation4D::width) > 0) && !input.requires_grad() &&
-      true;
-}
-
-} // namespace
-
-VulkanOpContext conv2d_context_create_q(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride_arg,
-    const IntArrayRef padding_arg,
-    const IntArrayRef dilation_arg,
-    const bool transposed,
-    const IntArrayRef output_padding_arg,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  const auto stride = expand_param_if_needed(stride_arg, "stride", 2);
-  const auto padding = expand_param_if_needed(padding_arg, "padding", 2);
-  const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2);
-  const auto output_padding = output_padding_arg; // TODO: Deconvolutions
-
-  TORCH_CHECK(
-      available(
-          weight,
-          bias,
-          stride,
-          padding,
-          dilation,
-          transposed,
-          output_padding,
-          groups,
-          output_min,
-          output_max),
-      "Vulkan::convolution not available! "
-      "Reason: The provided (weight, bias, stride, padding, dilation, groups, "
-      "transposed, output_padding, output_min, output_max) parameters are either "
-      "invalid individually or their combination is not supported by Vulkan impl.");
-
-  TORCH_CHECK(weight.is_quantized(), "Weight Tensor is not Quantized");
-  TORCH_CHECK(bias->is_quantized(), "Bias Tensor is not Quantized");
-
-  auto method =
-      determine_method(weight.sizes(), stride, padding, dilation, groups);
-
-  c10::impl::GenericList packed_context{c10::AnyType::get()};
-  packed_context.reserve(10);
-  packed_context.emplace_back(convert(pack_weights_q(weight, method)));
-  packed_context.emplace_back(convert(pack_biases_q(bias, weight)));
-  packed_context.emplace_back(pack_filter(weight, dilation));
-  packed_context.emplace_back(pack_params(stride));
-  packed_context.emplace_back(pack_params(padding));
-  packed_context.emplace_back(output_padding);
-  packed_context.emplace_back(pack_params(dilation));
-  packed_context.emplace_back(safe_downcast<int32_t>(groups));
-  packed_context.emplace_back(
-      output_min ? output_min->template to<float>()
-                 : -std::numeric_limits<float>::infinity());
-  packed_context.emplace_back(
-      output_max ? output_max->template to<float>()
-                 : +std::numeric_limits<float>::infinity());
-  packed_context.emplace_back(method);
-
-  c10::impl::GenericList unpacked_context{c10::AnyType::get()};
-  unpacked_context.reserve(10);
-  unpacked_context.emplace_back(weight);
-  unpacked_context.emplace_back(bias);
-  unpacked_context.emplace_back(weight.sizes().vec());
-  unpacked_context.emplace_back(stride_arg.vec());
-  unpacked_context.emplace_back(padding_arg.vec());
-  unpacked_context.emplace_back(output_padding_arg.vec());
-  unpacked_context.emplace_back(dilation_arg.vec());
-  unpacked_context.emplace_back(groups);
-  unpacked_context.emplace_back(output_min);
-  unpacked_context.emplace_back(output_max);
-  unpacked_context.emplace_back(method);
-  return VulkanOpContext::create(packed_context, unpacked_context);
-}
-
-void conv2d_sliding_window_q(
-    const api::ShaderSource& shader,
-    vTensor& v_output,
-    const vTensor& v_input,
-    const vTensor& packed_v_weight,
-    const vTensor& packed_v_bias,
-    const IntArrayRef packed_filter,
-    const IntArrayRef packed_stride,
-    const IntArrayRef packed_padding,
-    const IntArrayRef packed_dilation,
-    const float packed_output_min,
-    const float packed_output_max,
-    const IntArrayRef unpacked_filter,
-    const Conv2dQMethod method_,
-    const double scale,
-    const int64_t zero_point) {
-  api::Context* const context = api::context();
-
-  const double scale_out = v_output.get_scale();
-  const int64_t zero_point_out = v_output.get_zero_point();
-
-  const double weight_scale = packed_v_weight.get_scale();
-  const int64_t weight_zero_point = packed_v_weight.get_zero_point();
-
-  const double bias_scale = packed_v_bias.get_scale();
-  const int64_t bias_zero_point = packed_v_bias.get_zero_point();
-
-  const struct Block final {
-    uvec3 extents;
-    int32_t ic4;
-    ivec4 kernel;
-    float scale_out;
-    float scale;
-    int32_t zero_point_out;
-    int32_t zero_point;
-    float weight_scale;
-    float bias_scale;
-    int32_t weight_zero_point;
-    int32_t bias_zero_point;
-    ivec2 ikernel;
-    ivec2 stride;
-    ivec2 padding;
-    ivec2 dilate;
-    vec2 clamp;
-  } block{
-      v_output.extents(),
-      safe_downcast<int32_t>(packed_filter[Layout::Filter::input]),
-      {
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::height]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::width]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::height]),
-      },
-      safe_downcast<float>(scale_out),
-      safe_downcast<float>(scale),
-      safe_downcast<int32_t>(zero_point_out),
-      safe_downcast<int32_t>(zero_point),
-      safe_downcast<float>(weight_scale),
-      safe_downcast<float>(bias_scale),
-      safe_downcast<int32_t>(weight_zero_point),
-      safe_downcast<int32_t>(bias_zero_point),
-      {
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::height]),
-      },
-      {
-          packed_output_min,
-          packed_output_max,
-      },
-  };
-
-  uvec3 global_size = v_output.extents();
-  if (method_ == Conv2dQPointwise) {
-    global_size = {
-        safe_downcast<uint32_t>(
-            div_up(v_output.sizes()[Layout::Filter::width], INT64_C(2))),
-        safe_downcast<uint32_t>(
-            div_up(v_output.sizes()[Layout::Filter::height], INT64_C(2))),
-        v_output.extents().data[2u]};
-  }
-
-  api::UniformParamsBuffer params(context, block);
-  api::PipelineBarrier pipeline_barrier{};
-
-  context->submit_compute_job(
-      // shader descriptor
-      shader,
-      // pipeline barrier
-      pipeline_barrier,
-      // global work group size
-      global_size,
-      // local work group size
-      adaptive_work_group_size(global_size),
-      // fence handle
-      VK_NULL_HANDLE,
-      // shader arguments
-      v_output.image(
-          pipeline_barrier,
-          api::PipelineStage::COMPUTE,
-          api::MemoryAccessType::WRITE),
-      v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      // params buffer
-      params.buffer());
-}
-
-Tensor conv2d_context_run_q(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context,
-    double scale,
-    int64_t zero_point) {
-  api::Context* const context = api::context();
-
-  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
-  const vTensor& v_input = convert(input);
-
-  const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor());
-  const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor());
-
-  const auto packed_filter = packed_context.get(2).toIntVector();
-  const auto packed_stride = packed_context.get(3).toIntVector();
-  const auto packed_padding = packed_context.get(4).toIntVector();
-  const auto packed_dilation = packed_context.get(6).toIntVector();
-  const float packed_output_min =
-      safe_downcast<float>(packed_context.get(8).toDouble());
-  const float packed_output_max =
-      safe_downcast<float>(packed_context.get(9).toDouble());
-  const auto unpacked_filter = unpacked_context.get(2).toIntVector();
-  const Conv2dQMethod method_ = (Conv2dQMethod)unpacked_context.get(10).toInt();
-
-  TORCH_CHECK(
-      usable(input),
-      "Vulkan Convolution not usable! "
-      "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl.");
-
-  vTensor v_output{
-      context,
-      conv_output_size(
-          v_input.sizes(),
-          unpacked_filter,
-          packed_padding,
-          packed_stride,
-          packed_dilation),
-      input.options(),
-      scale,
-      zero_point,
-  };
-
-  if (method_ == Conv2dQSlidingWindow) {
-    conv2d_sliding_window_q(
-        VK_KERNEL(quantized_conv2d),
-        v_output,
-        v_input,
-        packed_v_weight,
-        packed_v_bias,
-        packed_filter,
-        packed_stride,
-        packed_padding,
-        packed_dilation,
-        packed_output_min,
-        packed_output_max,
-        unpacked_filter,
-        method_,
-        v_input.get_scale(),
-        v_input.get_zero_point());
-  } else if (method_ == Conv2dQPointwise) {
-    conv2d_sliding_window_q(
-        VK_KERNEL(quantized_conv2d_pw_2x2),
-        v_output,
-        v_input,
-        packed_v_weight,
-        packed_v_bias,
-        packed_filter,
-        packed_stride,
-        packed_padding,
-        packed_dilation,
-        packed_output_min,
-        packed_output_max,
-        unpacked_filter,
-        method_,
-        v_input.get_scale(),
-        v_input.get_zero_point());
-  } else if (method_ == Conv2dQDepthwise) {
-    conv2d_sliding_window_q(
-        VK_KERNEL(quantized_conv2d_dw),
-        v_output,
-        v_input,
-        packed_v_weight,
-        packed_v_bias,
-        packed_filter,
-        packed_stride,
-        packed_padding,
-        packed_dilation,
-        packed_output_min,
-        packed_output_max,
-        unpacked_filter,
-        method_,
-        v_input.get_scale(),
-        v_input.get_zero_point());
-  } else {
-    TORCH_CHECK(false, "Invalid Method");
-  }
-
-  return convert_quantized(v_output);
-}
-
-} // namespace ops
-} // namespace vulkan
-} // namespace native
-} // namespace at
diff --git a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h b/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h
deleted file mode 100644
index 4853623a7fa37..0000000000000
--- a/aten/src/ATen/native/vulkan/ops/QuantizedConvolution.h
+++ /dev/null
@@ -1,44 +0,0 @@
-#pragma once
-
-#ifdef USE_VULKAN_API
-
-#include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/Convolution.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
-
-namespace at {
-namespace native {
-namespace vulkan {
-namespace ops {
-
-enum Conv2dQMethod {
-  Conv2dQDepthwise,
-  Conv2dQPointwise,
-  Conv2dQSlidingWindow,
-};
-
-VulkanOpContext conv2d_context_create_q(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride_arg,
-    const IntArrayRef padding_arg,
-    const IntArrayRef dilation_arg,
-    const bool transposed,
-    const IntArrayRef output_padding_arg,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min = c10::nullopt,
-    const c10::optional<Scalar>& output_max = c10::nullopt);
-
-Tensor conv2d_context_run_q(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context,
-    double scale,
-    int64_t zero_point);
-
-} // namespace ops
-} // namespace vulkan
-} // namespace native
-} // namespace at
-
-#endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/ops/Register.cpp b/aten/src/ATen/native/vulkan/ops/Register.cpp
index 4cc1ba4e8bb6b..18d5a6facfaed 100644
--- a/aten/src/ATen/native/vulkan/ops/Register.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Register.cpp
@@ -5,10 +5,7 @@
 #include <ATen/native/vulkan/ops/Gru.h>
 #include <ATen/native/vulkan/ops/Lstm.h>
 #include <ATen/native/vulkan/ops/Mm.h>
-#include <ATen/native/vulkan/ops/QuantizedConvolution.h>
 #include <ATen/native/vulkan/ops/QuantizedFunctions.h>
-#include <ATen/native/vulkan/ops/TransposeConvolution2d.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
 #include <torch/custom_class.h>
 #include <torch/library.h>
 
@@ -19,133 +16,110 @@ namespace ops {
 namespace {
 
 TORCH_LIBRARY(vulkan, m) {
-  m.class_<VulkanOpContext>("VulkanOpContext")
+  m.class_<LinearPackedContext>("LinearPackedContext")
       .def_pickle(
           // __getstate__
-          [](const c10::intrusive_ptr<VulkanOpContext>& context) {
-            return context->get_state();
+          [](const c10::intrusive_ptr<LinearPackedContext>& context) {
+            // context is packed
+            return context->unpack();
           },
           // __setstate__
-          [](VulkanOpContext::State state) {
-            return c10::make_intrusive<VulkanOpContext>(VulkanOpContext::create(
-                std::get<0>(state), std::get<1>(state)));
+          [](c10::impl::GenericList state) {
+            // state is unpacked
+            return c10::make_intrusive<LinearPackedContext>(
+                LinearPackedContext::pack(state));
           });
-  // To maintain backwards compatibility.
-  m.class_<Conv2dOpContext>("Conv2dOpContext")
+  m.class_<GruPackedContext>("GruPackedContext")
       .def_pickle(
           // __getstate__
-          [](const c10::intrusive_ptr<Conv2dOpContext>& context) {
+          [](const c10::intrusive_ptr<GruPackedContext>& context) {
+            // context is packed
             return context->unpack();
           },
           // __setstate__
-          [](Conv2dOpContext::State state) {
-            return conv2d_clamp_prepack(
-                std::move(std::get<0>(state)),
-                std::move(std::get<1>(state)),
-                std::move(std::get<2>(state)),
-                std::move(std::get<3>(state)),
-                std::move(std::get<4>(state)),
-                std::get<5>(state),
-                std::get<6>(state),
-                std::get<7>(state));
+          [](c10::impl::GenericList state) {
+            // state is unpacked
+            return c10::make_intrusive<GruPackedContext>(
+                GruPackedContext::pack(state));
           });
-  // To maintain backwards compatibility.
-  m.class_<TransposeConv2dOpContext>("TransposeConv2dOpContext")
+  m.class_<LstmPackedContext>("LstmPackedContext")
       .def_pickle(
           // __getstate__
-          [](const c10::intrusive_ptr<TransposeConv2dOpContext>& context) {
+          [](const c10::intrusive_ptr<LstmPackedContext>& context) {
+            // context is packed
             return context->unpack();
           },
           // __setstate__
-          [](TransposeConv2dOpContext::State state) {
-            return conv2d_transpose_clamp_prepack(
-                std::move(std::get<0>(state)),
-                std::move(std::get<1>(state)),
-                std::move(std::get<2>(state)),
-                std::move(std::get<3>(state)),
-                std::move(std::get<4>(state)),
-                std::move(std::get<5>(state)),
-                std::get<6>(state),
-                std::get<7>(state),
-                std::get<8>(state));
+          [](c10::impl::GenericList state) {
+            // state is unpacked
+            return c10::make_intrusive<LstmPackedContext>(
+                LstmPackedContext::pack(state));
           });
-  // To maintain backwards compatibility.
-  m.class_<LinearOpContext>("LinearOpContext")
+  m.class_<Conv2dPackedContext>("Conv2dPackedContext")
       .def_pickle(
           // __getstate__
-          [](const c10::intrusive_ptr<LinearOpContext>& context) {
+          [](const c10::intrusive_ptr<Conv2dPackedContext>& context) {
+            // context is packed
             return context->unpack();
           },
           // __setstate__
-          [](LinearOpContext::State state) {
-            return linear_prepack(
-                std::move(std::get<0>(state)), std::move(std::get<1>(state)));
+          [](c10::impl::GenericList state) {
+            // state is unpacked
+            return c10::make_intrusive<Conv2dPackedContext>(
+                Conv2dPackedContext::pack(state));
           });
   // To maintain backwards compatibility.
-  m.class_<GruOpContext>("GruOpContext")
+  m.class_<Conv2dOpContext>("Conv2dOpContext")
       .def_pickle(
           // __getstate__
-          [](const c10::intrusive_ptr<GruOpContext>& context) {
+          [](const c10::intrusive_ptr<Conv2dOpContext>& context) {
             return context->unpack();
           },
           // __setstate__
-          [](GruOpContext::State state) {
-            return gru_prepack(
+          [](Conv2dOpContext::State state) {
+            return conv2d_clamp_prepack(
                 std::move(std::get<0>(state)),
-                std::get<1>(state),
-                std::get<2>(state),
-                std::get<3>(state),
-                std::get<4>(state),
+                std::move(std::get<1>(state)),
+                std::move(std::get<2>(state)),
+                std::move(std::get<3>(state)),
+                std::move(std::get<4>(state)),
                 std::get<5>(state),
-                std::get<6>(state));
+                std::get<6>(state),
+                std::get<7>(state));
           });
 }
 
 TORCH_LIBRARY(vulkan_prepack, m) {
   m.def(TORCH_SELECTIVE_SCHEMA(
-      "vulkan_prepack::create_conv2d_clamp_context(Tensor W, Tensor? B, int[2] stride, "
+      "vulkan_prepack::create_conv2d_context(Tensor W, Tensor? B, int[2] stride, "
       "int[2] padding, int[2] dilation, int groups, "
       "Scalar? output_min=None, Scalar? output_max=None) "
-      "-> __torch__.torch.classes.vulkan.VulkanOpContext"));
+      "-> __torch__.torch.classes.vulkan.Conv2dPackedContext"));
   m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
       "vulkan_prepack::conv2d_clamp_prepack(Tensor W, Tensor? B, int[2] stride, "
       "int[2] padding, int[2] dilation, int groups, "
       "Scalar? output_min=None, Scalar? output_max=None) "
       "-> __torch__.torch.classes.vulkan.Conv2dOpContext"));
   m.def(TORCH_SELECTIVE_SCHEMA(
-      "vulkan_prepack::run_conv2d_clamp_context(Tensor X, "
-      "__torch__.torch.classes.vulkan.VulkanOpContext W_prepack) -> Tensor Y"));
+      "vulkan_prepack::run_conv2d_context(Tensor X, "
+      "__torch__.torch.classes.vulkan.Conv2dPackedContext W_prepack) -> Tensor Y"));
   m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
       "vulkan_prepack::conv2d_clamp_run(Tensor X, "
       "__torch__.torch.classes.vulkan.Conv2dOpContext W_prepack) -> Tensor Y"));
   m.def(TORCH_SELECTIVE_SCHEMA(
-      "vulkan_prepack::create_conv2d_transpose_clamp_context(Tensor W, Tensor? B, int[2] stride, "
+      "vulkan_prepack::create_tconv2d_context(Tensor W, Tensor? B, int[2] stride, "
       "int[2] padding, int[2] output_padding, int[2] dilation, int groups, "
       "Scalar? output_min=None, Scalar? output_max=None) "
-      "-> __torch__.torch.classes.vulkan.VulkanOpContext"));
-  m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
-      "vulkan_prepack::conv2d_transpose_clamp_prepack(Tensor W, Tensor? B, int[2] stride, "
-      "int[2] padding, int[2] output_padding, int[2] dilation, int groups, "
-      "Scalar? output_min=None, Scalar? output_max=None) "
-      "-> __torch__.torch.classes.vulkan.TransposeConv2dOpContext"));
+      "-> __torch__.torch.classes.vulkan.Conv2dPackedContext"));
   m.def(TORCH_SELECTIVE_SCHEMA(
-      "vulkan_prepack::run_conv2d_transpose_clamp_context(Tensor X, "
-      "__torch__.torch.classes.vulkan.VulkanOpContext W_prepack) -> Tensor Y"));
-  m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
-      "vulkan_prepack::conv2d_transpose_clamp_run(Tensor X, "
-      "__torch__.torch.classes.vulkan.TransposeConv2dOpContext W_prepack) -> Tensor Y"));
+      "vulkan_prepack::run_tconv2d_context(Tensor X, "
+      "__torch__.torch.classes.vulkan.Conv2dPackedContext W_prepack) -> Tensor Y"));
   m.def(TORCH_SELECTIVE_SCHEMA(
       "vulkan_prepack::create_linear_context(Tensor W, Tensor? B) "
-      "-> __torch__.torch.classes.vulkan.VulkanOpContext"));
-  m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
-      "vulkan_prepack::linear_prepack(Tensor W, Tensor? B) "
-      "-> __torch__.torch.classes.vulkan.LinearOpContext"));
+      "-> __torch__.torch.classes.vulkan.LinearPackedContext"));
   m.def(TORCH_SELECTIVE_SCHEMA(
       "vulkan_prepack::run_linear_context(Tensor X, "
-      "__torch__.torch.classes.vulkan.VulkanOpContext BW_prepack) -> Tensor Y"));
-  m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
-      "vulkan_prepack::linear_run(Tensor X, "
-      "__torch__.torch.classes.vulkan.LinearOpContext BW_prepack) -> Tensor Y"));
+      "__torch__.torch.classes.vulkan.LinearPackedContext BW_prepack) -> Tensor Y"));
   m.def(TORCH_SELECTIVE_SCHEMA(
       "vulkan_prepack::create_gru_context(Tensor[] params_cpu, "
       "bool has_biases, "
@@ -154,24 +128,11 @@ TORCH_LIBRARY(vulkan_prepack, m) {
       "bool train, "
       "bool bidirectional, "
       "bool batch_first) "
-      "-> __torch__.torch.classes.vulkan.VulkanOpContext"));
-  m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
-      "vulkan_prepack::gru_prepack(Tensor[] params_cpu, "
-      "bool has_biases, "
-      "int num_layers, "
-      "float dropout, "
-      "bool train, "
-      "bool bidirectional, "
-      "bool batch_first) "
-      "-> __torch__.torch.classes.vulkan.GruOpContext"));
+      "-> __torch__.torch.classes.vulkan.GruPackedContext"));
   m.def(TORCH_SELECTIVE_SCHEMA(
       "vulkan_prepack::run_gru_context(Tensor input_vk, "
       "Tensor hx_vk, "
-      "__torch__.torch.classes.vulkan.VulkanOpContext G_prepack) -> (Tensor next_input, Tensor hidden_layer)"));
-  m.def(TORCH_SELECTIVE_SCHEMA( // Backwards compatibility
-      "vulkan_prepack::gru_run(Tensor input_vk, "
-      "Tensor hx_vk, "
-      "__torch__.torch.classes.vulkan.GruOpContext G_prepack) -> (Tensor next_input, Tensor hidden_layer)"));
+      "__torch__.torch.classes.vulkan.GruPackedContext G_prepack) -> (Tensor next_input, Tensor hidden_layer)"));
   m.def(TORCH_SELECTIVE_SCHEMA(
       "vulkan_prepack::create_lstm_context(Tensor[] params_cpu, "
       "bool has_biases, "
@@ -180,40 +141,30 @@ TORCH_LIBRARY(vulkan_prepack, m) {
       "bool train, "
       "bool bidirectional, "
       "bool batch_first) "
-      "-> __torch__.torch.classes.vulkan.VulkanOpContext"));
+      "-> __torch__.torch.classes.vulkan.LstmPackedContext"));
   m.def(TORCH_SELECTIVE_SCHEMA(
       "vulkan_prepack::run_lstm_context(Tensor input_vk, "
       "Tensor hx_vk, "
       "Tensor cx_vk, "
-      "__torch__.torch.classes.vulkan.VulkanOpContext L_prepack) -> (Tensor next_input, Tensor hidden_state, Tensor cell_state)"));
+      "__torch__.torch.classes.vulkan.LstmPackedContext L_prepack) -> (Tensor next_input, Tensor hidden_state, Tensor cell_state)"));
 }
 
 TORCH_LIBRARY_IMPL(vulkan_prepack, CPU, m) {
   m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::create_conv2d_clamp_context"),
-      TORCH_FN(create_conv2d_clamp_context));
+      TORCH_SELECTIVE_NAME("vulkan_prepack::create_conv2d_context"),
+      TORCH_FN(create_conv2d_context));
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_clamp_prepack"),
       TORCH_FN(conv2d_clamp_prepack)); // Backwards compatibility
   m.impl(
-      TORCH_SELECTIVE_NAME(
-          "vulkan_prepack::create_conv2d_transpose_clamp_context"),
-      TORCH_FN(create_conv2d_transpose_clamp_context));
-  m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_transpose_clamp_prepack"),
-      TORCH_FN(conv2d_transpose_clamp_prepack)); // Backwards compatibility
+      TORCH_SELECTIVE_NAME("vulkan_prepack::create_tconv2d_context"),
+      TORCH_FN(create_tconv2d_context));
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::create_linear_context"),
       TORCH_FN(create_linear_context));
-  m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::linear_prepack"),
-      TORCH_FN(linear_prepack)); // Backwards compatibility
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::create_gru_context"),
       TORCH_FN(create_gru_context));
-  m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::gru_prepack"),
-      TORCH_FN(gru_prepack)); // Backwards compatibility
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::create_lstm_context"),
       TORCH_FN(create_lstm_context));
@@ -221,161 +172,26 @@ TORCH_LIBRARY_IMPL(vulkan_prepack, CPU, m) {
 
 TORCH_LIBRARY_IMPL(vulkan_prepack, Vulkan, m) {
   m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::run_conv2d_clamp_context"),
-      TORCH_FN(run_conv2d_clamp_context));
+      TORCH_SELECTIVE_NAME("vulkan_prepack::run_conv2d_context"),
+      TORCH_FN(run_conv2d_context));
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_clamp_run"),
       TORCH_FN(conv2d_clamp_run)); // Backwards compatibility
   m.impl(
-      TORCH_SELECTIVE_NAME(
-          "vulkan_prepack::run_conv2d_transpose_clamp_context"),
-      TORCH_FN(run_conv2d_transpose_clamp_context));
-  m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::conv2d_transpose_clamp_run"),
-      TORCH_FN(conv2d_transpose_clamp_run)); // Backwards compatibility
+      TORCH_SELECTIVE_NAME("vulkan_prepack::run_tconv2d_context"),
+      TORCH_FN(run_tconv2d_context));
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::run_linear_context"),
       TORCH_FN(run_linear_context));
-  m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::linear_run"),
-      TORCH_FN(linear_run)); // Backwards compatibility
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::run_gru_context"),
       TORCH_FN(run_gru_context));
-  m.impl(
-      TORCH_SELECTIVE_NAME("vulkan_prepack::gru_run"),
-      TORCH_FN(gru_run)); // Backwards compatibility
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::run_lstm_context"),
       TORCH_FN(run_lstm_context));
 }
 
-Tensor convolution(
-    const Tensor& input,
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride,
-    const IntArrayRef padding,
-    const IntArrayRef dilation,
-    const bool transposed,
-    const IntArrayRef output_padding,
-    const int64_t groups) {
-  if (transposed) {
-    VulkanOpContext vulkan_context = conv2d_transpose_context_create(
-        weight, bias, stride, padding, output_padding, dilation, groups);
-    return conv2d_transpose_context_run(
-        input, vulkan_context.get_packed(), vulkan_context.get_unpacked());
-  }
-  VulkanOpContext vulkan_context = conv2d_context_create(
-      weight,
-      bias,
-      stride,
-      padding,
-      dilation,
-      transposed,
-      output_padding,
-      groups);
-  return conv2d_context_run(
-      input, vulkan_context.get_packed(), vulkan_context.get_unpacked());
-}
-
-Tensor quantized_convolution(
-    const Tensor& input,
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride,
-    const IntArrayRef padding,
-    const IntArrayRef dilation,
-    const bool transposed,
-    const IntArrayRef output_padding,
-    const int64_t groups,
-    const double out_scale,
-    const int64_t out_zero_point) {
-  if (transposed) {
-    VulkanOpContext vulkan_context = conv2d_transpose_context_create(
-        weight, bias, stride, padding, output_padding, dilation, groups);
-    return conv2d_transpose_context_run(
-        input, vulkan_context.get_packed(), vulkan_context.get_unpacked());
-  }
-  VulkanOpContext vulkan_context = conv2d_context_create_q(
-      weight,
-      bias,
-      stride,
-      padding,
-      dilation,
-      transposed,
-      output_padding,
-      groups,
-      c10::nullopt,
-      c10::nullopt);
-  return conv2d_context_run_q(
-      input,
-      vulkan_context.get_packed(),
-      vulkan_context.get_unpacked(),
-      out_scale,
-      out_zero_point);
-}
 } // namespace
-
-static std::tuple<Tensor, bool> batchify(
-    const Tensor& input,
-    const int64_t num_spatial_dims,
-    const std::string& func_name) {
-  const auto dim_count_no_batch = num_spatial_dims + 1;
-  const auto dim_count_batch = dim_count_no_batch + 1;
-  const auto is_batched = (input.dim() == dim_count_batch);
-  TORCH_CHECK(
-      input.dim() == dim_count_no_batch || is_batched,
-      "Expected ",
-      dim_count_no_batch,
-      "D (unbatched) or ",
-      dim_count_batch,
-      "D (batched) input to ",
-      func_name,
-      ", but got input of size: ",
-      input.sizes());
-  return std::make_tuple(is_batched ? input : input.unsqueeze(0), is_batched);
-}
-
-Tensor conv2d(
-    const Tensor& input_,
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias_opt,
-    IntArrayRef stride,
-    IntArrayRef padding,
-    IntArrayRef dilation,
-    int64_t groups,
-    double out_scale,
-    int64_t out_zero_point) {
-  // See [Note: hacky wrapper removal for optional tensor]
-  c10::MaybeOwned<Tensor> bias_maybe_owned =
-      at::borrow_from_optional_tensor(bias_opt);
-  const Tensor& bias = *bias_maybe_owned;
-
-  Tensor input;
-  bool is_batched;
-  std::tie(input, is_batched) =
-      batchify(input_, /*num_spatial_dims=*/2, "conv2d");
-  Tensor output;
-  output = quantized_convolution(
-      input,
-      weight,
-      bias,
-      stride,
-      padding,
-      dilation,
-      false,
-      {{0, 0}},
-      groups,
-      out_scale,
-      out_zero_point);
-  return is_batched ? output : output.squeeze(0);
-}
-
-TORCH_LIBRARY_IMPL(aten, Vulkan, m) {
-  m.impl("convolution_overrideable", convolution);
-}
-
 } // namespace ops
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/ops/Shape.cpp b/aten/src/ATen/native/vulkan/ops/Shape.cpp
index f5c47187c3bda..14b32c2eea179 100644
--- a/aten/src/ATen/native/vulkan/ops/Shape.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Shape.cpp
@@ -22,7 +22,7 @@ Tensor view_internal(const Tensor& self_arg, const IntArrayRef shape) {
       self.options(),
   };
 
-  api::StagingBuffer buffer(context, v_self.buffer_bytes(), true);
+  api::StorageBuffer buffer(context, at::kFloat, v_self.numcells(), true);
 
   utils::pack_vtensor_to_staging(v_self, buffer.buffer());
 
diff --git a/aten/src/ATen/native/vulkan/ops/Slice.cpp b/aten/src/ATen/native/vulkan/ops/Slice.cpp
index d45bff6af4066..a6c0beb965b42 100644
--- a/aten/src/ATen/native/vulkan/ops/Slice.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Slice.cpp
@@ -95,7 +95,7 @@ Tensor slice_width(
 
     api::PipelineBarrier pipeline_barrier{};
 
-    context->submit_texture_copy(
+    context->submit_copy<api::VulkanImage, api::VulkanImage>(
         // pipeline barrier
         pipeline_barrier,
         // images
@@ -126,7 +126,7 @@ Tensor slice_width(
 
       api::PipelineBarrier pipeline_barrier{};
 
-      context->submit_texture_copy(
+      context->submit_copy<api::VulkanImage, api::VulkanImage>(
           // pipeline barrier
           pipeline_barrier,
           // images
@@ -171,7 +171,7 @@ Tensor slice_height(
 
     api::PipelineBarrier pipeline_barrier{};
 
-    context->submit_texture_copy(
+    context->submit_copy<api::VulkanImage, api::VulkanImage>(
         // pipeline barrier
         pipeline_barrier,
         // images
@@ -200,7 +200,7 @@ Tensor slice_height(
 
       api::PipelineBarrier pipeline_barrier{};
 
-      context->submit_texture_copy(
+      context->submit_copy<api::VulkanImage, api::VulkanImage>(
           // pipeline barrier
           pipeline_barrier,
           // images
diff --git a/aten/src/ATen/native/vulkan/ops/Tensor.h b/aten/src/ATen/native/vulkan/ops/Tensor.h
index ecf99ceb9f375..16dbc9887355c 100644
--- a/aten/src/ATen/native/vulkan/ops/Tensor.h
+++ b/aten/src/ATen/native/vulkan/ops/Tensor.h
@@ -80,6 +80,11 @@ class vTensorStorage final {
 
   // Validation
   void verify() const;
+
+ public:
+  inline VkFormat texture_format() {
+    return image_.format();
+  }
 };
 
 class vTensor final {
@@ -139,6 +144,13 @@ class vTensor final {
     return view_->extents_;
   }
 
+  /*
+   * Get a c10::ScalarType that corresponds to the image format of the texture
+   */
+  inline c10::ScalarType texture_dtype() const {
+    return api::c10_scalartype(view_->texture_format());
+  }
+
   inline const TensorOptions& options() const {
     return view_->options_;
   }
@@ -168,11 +180,29 @@ class vTensor final {
         c10::multiply_integers(sizes());
   }
 
-  inline VkDeviceSize buffer_bytes() {
-    return c10::elementSize(c10::typeMetaToScalarType(options().dtype())) *
-        view_->extents_.data[0u] * view_->extents_.data[1u] *
+  /*
+   * Number of texels in the image texture.
+   */
+  inline VkDeviceSize numtexels() {
+    return view_->extents_.data[0u] * view_->extents_.data[1u] *
+        view_->extents_.data[2u];
+  }
+
+  /*
+   * Number of "cells" in the image texture. 4 cells make up a texel.
+   */
+  inline VkDeviceSize numcells() {
+    return view_->extents_.data[0u] * view_->extents_.data[1u] *
         (4u * view_->extents_.data[2u]);
   }
+
+  /*
+   * Number of bytes needed for a buffer to receive all data in the texture
+   */
+  inline VkDeviceSize buffer_bytes() {
+    return c10::elementSize(this->texture_dtype()) * view_->extents_.data[0u] *
+        view_->extents_.data[1u] * (4u * view_->extents_.data[2u]);
+  }
 };
 
 void add_buffer_barrier(
diff --git a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp b/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp
deleted file mode 100644
index 125efa803f3c1..0000000000000
--- a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.cpp
+++ /dev/null
@@ -1,600 +0,0 @@
-#include <ATen/native/ConvUtils.h>
-#include <ATen/native/utils/ParamUtils.h>
-
-#include <ATen/native/vulkan/api/Utils.h>
-#include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/TransposeConvolution2d.h>
-#include <ATen/native/vulkan/ops/Utils.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
-#include <c10/util/irange.h>
-
-namespace at {
-namespace native {
-namespace vulkan {
-namespace ops {
-namespace {
-
-using namespace api::utils;
-using namespace at::native::vulkan::ops;
-
-vTensor pack_weights_2d_reverse(
-    api::Context* const context,
-    const Tensor& weight,
-    bool reversed) {
-  /* Source */
-  const IntArrayRef src_filter = weight.sizes();
-  const float* const src_weight_ptr = weight.data_ptr<float>();
-
-  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
-  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
-  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
-  const int64_t src_block_sz =
-      src_kernel_sz * src_filter[Layout::Filter::input];
-
-  const int64_t num_stacks =
-      div_up(src_filter[Layout::Filter::output], INT64_C(4));
-  const int64_t stack_depth =
-      api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4));
-
-  /* Destination */
-  const int64_t dst_kw_sz = src_kw_sz * stack_depth;
-  const int64_t dst_kh_sz = src_kh_sz * num_stacks;
-  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
-
-  vTensor v_weight{
-      context,
-      {
-          4,
-          dst_kh_sz,
-          dst_kw_sz,
-      },
-      weight.options(),
-  };
-
-  api::StagingBuffer staging(context, v_weight.buffer_bytes());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    float* dst_weight_ptr = mapping.template data<float>();
-
-    memset(dst_weight_ptr, 0, v_weight.nbytes());
-
-    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
-      /* Source */
-      const float* const src_weight_oc_ptr =
-          src_weight_ptr + src_oc * src_block_sz;
-
-      /* Destination */
-      const int64_t dst_oh = src_oc / 4;
-      const int64_t dst_c = src_oc % 4;
-
-      float* const dst_weight_c_ptr = dst_weight_ptr + dst_c * dst_kernel_sz;
-
-      for (const auto src_ic : c10::irange(src_filter[Layout::Filter::input])) {
-        for (const auto src_ih : c10::irange(src_kh_sz)) {
-          const int64_t dst_h = reversed ? (src_kh_sz - 1 - src_ih) : src_ih;
-          for (const auto src_iw : c10::irange(src_kw_sz)) {
-            const int64_t dst_w = reversed ? (src_kw_sz - 1 - src_iw) : src_iw;
-            const int64_t dst_w_offset = dst_w * stack_depth;
-            memcpy(
-                dst_weight_c_ptr + (dst_oh * src_kh_sz + dst_h) * dst_kw_sz +
-                    src_ic + dst_w_offset,
-                src_weight_oc_ptr + src_ic * src_kernel_sz +
-                    src_ih * src_kw_sz + src_iw,
-                sizeof(float));
-          }
-        }
-      }
-    }
-  }
-  utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
-
-  return v_weight;
-}
-
-vTensor pack_weights(const Tensor& weight_arg) {
-  if (weight_arg.is_vulkan()) {
-    return convert(weight_arg);
-  }
-
-  api::Context* const context = api::context();
-
-  const Tensor weight = at::permute(weight_arg, {1, 0, 2, 3}).contiguous();
-
-  return pack_weights_2d_reverse(context, weight, true);
-}
-
-vTensor pack_biases(const c10::optional<Tensor>& bias, const Tensor& weight) {
-  if (bias && bias->is_vulkan()) {
-    return convert(*bias);
-  }
-
-  api::Context* const context = api::context();
-
-  const int64_t src_w = weight.size(Layout::TransposedFilter::output);
-  const int64_t packed_w = div_up(src_w, INT64_C(4));
-  vTensor v_bias{
-      context,
-      {
-          4,
-          1,
-          packed_w,
-      },
-      weight.options(),
-  };
-
-  api::StagingBuffer staging(context, v_bias.buffer_bytes());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    float* dst_bias_ptr = mapping.template data<float>();
-
-    if (bias) {
-      const float* const src_bias_ptr = bias->contiguous().data_ptr<float>();
-
-      memset(dst_bias_ptr, 0, v_bias.nbytes());
-      for (const auto i : c10::irange(src_w)) {
-        const int64_t c = i % 4;
-        const int64_t x = i / 4;
-        dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i];
-      }
-    } else {
-      memset(
-          dst_bias_ptr,
-          // 2's complement integers and IEEE-754 floating point numbers both
-          // have identical bit representations for 0, so can use memset which
-          // only accepts uint8_t parameter.
-          0,
-          v_bias.nbytes());
-    }
-  }
-  utils::pack_staging_to_vtensor(staging.buffer(), v_bias);
-
-  return v_bias;
-}
-
-std::array<int64_t, 4> pack_filter(
-    const Tensor& weight,
-    const IntArrayRef dilation) {
-  const IntArrayRef filter = weight.sizes();
-
-  const auto effective = [](const int64_t k, const int64_t d) {
-    return k + (k - 1) * (d - 1);
-  };
-
-  return {
-      align_up(filter[Layout::TransposedFilter::output], INT64_C(4)),
-      align_up(filter[Layout::TransposedFilter::input], INT64_C(4)),
-      effective(
-          filter[Layout::Filter::height], dilation[Layout::Parameter::height]),
-      effective(
-          filter[Layout::Filter::width], dilation[Layout::Parameter::width]),
-  };
-}
-
-std::array<int64_t, 2> pack_params(const std::vector<int64_t>& vector) {
-  TORCH_INTERNAL_ASSERT(2u == vector.size(), "Invalid usage!");
-
-  return {
-      vector[0],
-      vector[1],
-  };
-}
-
-bool available(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride,
-    const IntArrayRef padding,
-    const IntArrayRef /* output_padding */,
-    const IntArrayRef dilation,
-    const bool transposed,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  return api::available() &&
-      // Weight
-      (4 == weight.ndimension()) && (weight.size(Layout::Filter::height) > 0) &&
-      (weight.size(Layout::Filter::width) > 0) &&
-      ((weight.device().is_cpu()) ||
-       (c10::DeviceType::Vulkan == weight.device().type())) &&
-      (kFloat == weight.scalar_type()) &&
-      // Bias
-      ((bias && bias->defined())
-           ? ((1 == bias->ndimension()) &&
-              ((bias->device().is_cpu()) ||
-               (c10::DeviceType::Vulkan == bias->device().type())) &&
-              (kFloat == bias->scalar_type()) &&
-              (transposed ? (weight.size(Layout::TransposedFilter::output) ==
-                             bias->size(Layout::Filter::output))
-                          : (weight.size(Layout::Filter::output) ==
-                             bias->size(Layout::Filter::output))))
-           : true) &&
-      // Stride
-      (stride[Layout::Parameter::height] > 0) &&
-      (stride[Layout::Parameter::width] > 0) &&
-      // Padding
-      (padding[Layout::Parameter::height] >= 0) &&
-      (padding[Layout::Parameter::width] >= 0) &&
-      // Dilation
-      (transposed ? (dilation[Layout::Parameter::height] == 1) &&
-               (dilation[Layout::Parameter::width] == 1)
-                  : (dilation[Layout::Parameter::height] > 0) &&
-               (dilation[Layout::Parameter::width] > 0)) &&
-      // Groups
-      (groups > 0) &&
-      // Input
-      (weight.size(Layout::Filter::input) > 0) &&
-      // Output
-      (weight.size(Layout::Filter::output) > 0) &&
-      // Output - Groups
-      ((weight.size(Layout::Filter::output) % groups) == 0) &&
-      // Output Min / Max
-      (!output_min || output_min->isFloatingPoint()) &&
-      (!output_max || output_max->isFloatingPoint()) && true;
-}
-
-bool usable(const Tensor& input) {
-  // Input
-  return (4 == input.ndimension()) &&
-      (c10::DeviceType::Vulkan == input.device().type()) &&
-      (kFloat == input.scalar_type()) &&
-      (input.size(Layout::Activation4D::batch) >= 0) &&
-      (input.size(Layout::Activation4D::channels) > 0) &&
-      (input.size(Layout::Activation4D::height) > 0) &&
-      (input.size(Layout::Activation4D::width) > 0) && !input.requires_grad() &&
-      true;
-}
-
-static inline std::vector<int64_t> get_conv_transpose_output_size(
-    IntArrayRef input_size,
-    IntArrayRef weight_size,
-    IntArrayRef padding,
-    IntArrayRef output_padding,
-    IntArrayRef stride,
-    IntArrayRef dilation = IntArrayRef()) {
-  auto dim = input_size.size();
-  std::vector<int64_t> output_size(dim);
-  output_size[0] = input_size[input_batch_size_dim];
-  output_size[1] = weight_size[weight_input_channels_dim];
-  for (const auto d : c10::irange(2, dim)) {
-    output_size[d] = stride[d - 2] * (input_size[d] - 1) + weight_size[d] -
-        2 * padding[d - 2] + output_padding[d - 2];
-  }
-  return output_size;
-}
-
-} // namespace
-
-VulkanOpContext conv2d_transpose_context_create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride_arg,
-    const IntArrayRef padding_arg,
-    const IntArrayRef output_padding_arg,
-    const IntArrayRef dilation_arg,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  const auto stride = expand_param_if_needed(stride_arg, "stride", 2);
-  const auto padding = expand_param_if_needed(padding_arg, "padding", 2);
-  const auto dilation = expand_param_if_needed(dilation_arg, "dilation", 2);
-  const auto output_padding =
-      expand_param_if_needed(output_padding_arg, "output_padding", 2);
-
-  TORCH_CHECK(
-      available(
-          weight,
-          bias,
-          stride,
-          padding,
-          output_padding,
-          dilation,
-          true,
-          groups,
-          output_min,
-          output_max),
-      "Vulkan::convolution not available! "
-      "Reason: The provided (weight, bias, stride, padding, dilation, groups, "
-      "transposed, output_padding, output_min, output_max) parameters are either "
-      "invalid individually or their combination is not supported by Vulkan impl.");
-
-  c10::impl::GenericList packed_context{c10::AnyType::get()};
-  packed_context.reserve(10);
-  packed_context.emplace_back(convert(pack_weights(weight)));
-  packed_context.emplace_back(convert(pack_biases(bias, weight)));
-  packed_context.emplace_back(pack_filter(weight, dilation));
-  packed_context.emplace_back(pack_params(stride));
-  packed_context.emplace_back(pack_params(padding));
-  packed_context.emplace_back(pack_params(output_padding));
-  packed_context.emplace_back(pack_params(dilation));
-  packed_context.emplace_back(safe_downcast<int32_t>(groups));
-  packed_context.emplace_back(
-      output_min ? output_min->template to<float>()
-                 : -std::numeric_limits<float>::infinity());
-  packed_context.emplace_back(
-      output_max ? output_max->template to<float>()
-                 : +std::numeric_limits<float>::infinity());
-
-  c10::impl::GenericList unpacked_context{c10::AnyType::get()};
-  unpacked_context.reserve(10);
-  unpacked_context.emplace_back(weight);
-  unpacked_context.emplace_back(bias);
-  unpacked_context.emplace_back(weight.sizes().vec());
-  unpacked_context.emplace_back(stride_arg.vec());
-  unpacked_context.emplace_back(padding_arg.vec());
-  unpacked_context.emplace_back(output_padding_arg.vec());
-  unpacked_context.emplace_back(dilation_arg.vec());
-  unpacked_context.emplace_back(groups);
-  unpacked_context.emplace_back(output_min);
-  unpacked_context.emplace_back(output_max);
-
-  return VulkanOpContext::create(packed_context, unpacked_context);
-}
-
-void conv2d_transpose_sliding_window(
-    const api::ShaderSource& shader,
-    vTensor& v_output,
-    const vTensor& v_input,
-    const vTensor& packed_v_weight,
-    const vTensor& packed_v_bias,
-    const IntArrayRef packed_filter,
-    const IntArrayRef packed_stride,
-    const IntArrayRef packed_padding,
-    const IntArrayRef packed_dilation,
-    const float packed_output_min,
-    const float packed_output_max,
-    const IntArrayRef unpacked_filter) {
-  api::Context* const context = api::context();
-
-  const struct Block final {
-    uvec3 extents;
-    int32_t ic4;
-    ivec4 kernel;
-    ivec2 ikernel;
-    ivec2 stride;
-    ivec2 padding;
-    ivec2 dilate;
-    vec2 clamp;
-    ivec4 src_filter;
-  } block{
-      v_output.extents(),
-      safe_downcast<int32_t>(
-          packed_filter[Layout::Filter::input]), /* this is aligned up */
-      {
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::height]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::width]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::height]),
-      },
-      {
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::height]),
-      },
-      {
-          packed_output_min,
-          packed_output_max,
-      },
-  };
-
-  uvec3 global_size = v_output.extents();
-
-  api::UniformParamsBuffer params(context, block);
-  api::PipelineBarrier pipeline_barrier{};
-
-  context->submit_compute_job(
-      // shader descriptor
-      shader,
-      // pipeline barrier
-      pipeline_barrier,
-      // global work group size
-      global_size,
-      // local work group size
-      adaptive_work_group_size(global_size),
-      // fence handle
-      VK_NULL_HANDLE,
-      // shader arguments
-      v_output.image(
-          pipeline_barrier,
-          api::PipelineStage::COMPUTE,
-          api::MemoryAccessType::WRITE),
-      v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      // params buffer
-      params.buffer());
-}
-
-Tensor conv2d_transpose_context_run(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context) {
-  api::Context* const context = api::context();
-
-  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
-  const vTensor& v_input = convert(input);
-
-  const vTensor& packed_v_weight = convert(packed_context.get(0).toTensor());
-  const vTensor& packed_v_bias = convert(packed_context.get(1).toTensor());
-
-  const auto packed_filter = packed_context.get(2).toIntVector();
-  const auto packed_stride = packed_context.get(3).toIntVector();
-  const auto packed_padding = packed_context.get(4).toIntVector();
-  const auto packed_output_padding = packed_context.get(5).toIntVector();
-  const auto packed_dilation = packed_context.get(6).toIntVector();
-  const float packed_output_min = packed_context.get(8).toDouble();
-  const float packed_output_max = packed_context.get(9).toDouble();
-  const auto unpacked_filter = unpacked_context.get(2).toIntVector();
-
-  TORCH_CHECK(
-      usable(input),
-      "Vulkan Convolution not usable! "
-      "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl.");
-
-  vTensor v_output{
-      context,
-      get_conv_transpose_output_size(
-          v_input.sizes(),
-          unpacked_filter,
-          packed_padding,
-          packed_output_padding,
-          packed_stride,
-          packed_dilation),
-      input.options(),
-  };
-
-  conv2d_transpose_sliding_window(
-      VK_KERNEL(conv_transpose2d),
-      v_output,
-      v_input,
-      packed_v_weight,
-      packed_v_bias,
-      packed_filter,
-      packed_stride,
-      packed_padding,
-      packed_dilation,
-      packed_output_min,
-      packed_output_max,
-      unpacked_filter);
-
-  return convert(v_output);
-}
-
-c10::intrusive_ptr<VulkanOpContext> create_conv2d_transpose_clamp_context(
-    Tensor&& weight,
-    c10::optional<Tensor>&& bias,
-    std::vector<int64_t>&& stride,
-    std::vector<int64_t>&& padding,
-    std::vector<int64_t>&& output_padding,
-    std::vector<int64_t>&& dilation,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  return c10::make_intrusive<VulkanOpContext>(conv2d_transpose_context_create(
-      weight,
-      bias,
-      stride,
-      padding,
-      output_padding,
-      dilation,
-      groups,
-      output_min,
-      output_max));
-}
-
-Tensor run_conv2d_transpose_clamp_context(
-    const Tensor& input,
-    const c10::intrusive_ptr<VulkanOpContext>& vulkan_context) {
-  return conv2d_transpose_context_run(
-      input, vulkan_context->get_packed(), vulkan_context->get_unpacked());
-}
-
-/* Backwards compatibility */
-TransposeConv2dOpContext::TransposeConv2dOpContext(
-    VulkanOpContext vulkan_context)
-    : vulkan_context_{std::move(vulkan_context)} {}
-
-TransposeConv2dOpContext TransposeConv2dOpContext::create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride_arg,
-    const IntArrayRef padding_arg,
-    const IntArrayRef output_padding_arg,
-    const IntArrayRef dilation_arg,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  return TransposeConv2dOpContext{conv2d_transpose_context_create(
-      weight,
-      bias,
-      stride_arg,
-      padding_arg,
-      output_padding_arg,
-      dilation_arg,
-      groups,
-      output_min,
-      output_max)};
-}
-
-Tensor TransposeConv2dOpContext::run(const Tensor& input_arg) const {
-  return conv2d_transpose_context_run(
-      input_arg, vulkan_context_.get_packed(), vulkan_context_.get_unpacked());
-}
-
-TransposeConv2dOpContext::State TransposeConv2dOpContext::unpack() const {
-  const c10::impl::GenericList unpacked_ =
-      std::get<1>(vulkan_context_.get_state());
-  const Tensor unpacked_weight = unpacked_.get(0).toTensor();
-  const c10::optional<Tensor> unpacked_bias = unpacked_.get(1).isTensor()
-      ? unpacked_.get(1).toTensor()
-      : (c10::optional<Tensor>&)c10::nullopt;
-  const std::vector<int64_t> unpacked_stride = unpacked_.get(3).toIntVector();
-  const std::vector<int64_t> unpacked_padding = unpacked_.get(4).toIntVector();
-  const std::vector<int64_t> unpacked_output_padding =
-      unpacked_.get(5).toIntVector();
-  const std::vector<int64_t> unpacked_dilation = unpacked_.get(6).toIntVector();
-  const int64_t unpacked_groups = unpacked_.get(7).toInt();
-  const c10::optional<Scalar> unpacked_output_min = unpacked_.get(6).isScalar()
-      ? unpacked_.get(8).toScalar()
-      : (c10::optional<Scalar>)c10::nullopt;
-  const c10::optional<Scalar> unpacked_output_max = unpacked_.get(6).isScalar()
-      ? unpacked_.get(9).toScalar()
-      : (c10::optional<Scalar>)c10::nullopt;
-  return TransposeConv2dOpContext::State{
-      unpacked_weight,
-      unpacked_bias,
-      unpacked_stride,
-      unpacked_padding,
-      unpacked_output_padding,
-      unpacked_dilation,
-      unpacked_groups,
-      unpacked_output_min,
-      unpacked_output_max,
-  };
-}
-
-c10::intrusive_ptr<TransposeConv2dOpContext> conv2d_transpose_clamp_prepack(
-    Tensor&& weight,
-    c10::optional<Tensor>&& bias,
-    std::vector<int64_t>&& stride,
-    std::vector<int64_t>&& padding,
-    std::vector<int64_t>&& output_padding,
-    std::vector<int64_t>&& dilation,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max) {
-  return c10::make_intrusive<TransposeConv2dOpContext>(
-      TransposeConv2dOpContext::create(
-          std::move(weight),
-          std::move(bias),
-          std::move(stride),
-          std::move(padding),
-          std::move(output_padding),
-          std::move(dilation),
-          groups,
-          output_min,
-          output_max));
-}
-
-Tensor conv2d_transpose_clamp_run(
-    const Tensor& input,
-    const c10::intrusive_ptr<TransposeConv2dOpContext>& context) {
-  return context->run(input);
-}
-
-} // namespace ops
-} // namespace vulkan
-} // namespace native
-} // namespace at
diff --git a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h b/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h
deleted file mode 100644
index b440e243b57d0..0000000000000
--- a/aten/src/ATen/native/vulkan/ops/TransposeConvolution2d.h
+++ /dev/null
@@ -1,125 +0,0 @@
-#pragma once
-
-#ifdef USE_VULKAN_API
-
-#include <ATen/native/vulkan/ops/Common.h>
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
-
-namespace at {
-namespace native {
-namespace vulkan {
-namespace ops {
-
-enum TransposeConv2dMethod {
-  TransposeConv2dSlidingWindow,
-};
-
-//   packed
-//     vTensor v_weight
-//     vTensor v_bias
-//     std::array<int64_t, 4> filter
-//     std::array<int64_t, 2> stride
-//     std::array<int64_t, 2> padding
-//     std::array<int64_t, 2> output_padding
-//     std::array<int64_t, 2> dilation
-//     int32_t groups
-//     float output_min
-//     float output_max
-
-//   unpacked
-//     Tensor weight
-//     c10::optional<Tensor> bias
-//     std::vector<int64_t> filter
-//     std::vector<int64_t> stride
-//     std::vector<int64_t> padding
-//     std::vector<int64_t> output_padding
-//     std::vector<int64_t> dilation
-//     int64_t groups
-//     c10::optional<Scalar> output_min
-//     c10::optional<Scalar> output_max
-
-Tensor conv2d_transpose_context_run(
-    const Tensor& input_arg,
-    const c10::impl::GenericList& packed_context,
-    const c10::impl::GenericList& unpacked_context);
-
-VulkanOpContext conv2d_transpose_context_create(
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias,
-    const IntArrayRef stride_arg,
-    const IntArrayRef padding_arg,
-    const IntArrayRef output_padding_arg,
-    const IntArrayRef dilation_arg,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min = c10::nullopt,
-    const c10::optional<Scalar>& output_max = c10::nullopt);
-
-Tensor run_conv2d_transpose_clamp_context(
-    const Tensor& input,
-    const c10::intrusive_ptr<VulkanOpContext>& context);
-
-c10::intrusive_ptr<VulkanOpContext> create_conv2d_transpose_clamp_context(
-    Tensor&& weight,
-    c10::optional<Tensor>&& bias,
-    std::vector<int64_t>&& stride,
-    std::vector<int64_t>&& padding,
-    std::vector<int64_t>&& output_padding,
-    std::vector<int64_t>&& dilation,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max);
-
-// Backwards compatibility
-class TransposeConv2dOpContext final : public torch::jit::CustomClassHolder {
- public:
-  static TransposeConv2dOpContext create(
-      const Tensor& weight,
-      const c10::optional<Tensor>& bias,
-      IntArrayRef stride,
-      IntArrayRef padding,
-      IntArrayRef output_padding,
-      IntArrayRef dilation,
-      int64_t groups,
-      const c10::optional<Scalar>& output_min = c10::nullopt,
-      const c10::optional<Scalar>& output_max = c10::nullopt);
-
-  using State = std::tuple<
-      Tensor,
-      c10::optional<Tensor>,
-      std::vector<int64_t>,
-      std::vector<int64_t>,
-      std::vector<int64_t>,
-      std::vector<int64_t>,
-      int64_t,
-      c10::optional<Scalar>,
-      c10::optional<Scalar>>;
-
-  Tensor run(const Tensor& input) const;
-  State unpack() const;
-
- private:
-  explicit TransposeConv2dOpContext(VulkanOpContext vulkan_context);
-  VulkanOpContext vulkan_context_;
-};
-
-Tensor conv2d_transpose_clamp_run(
-    const Tensor& input,
-    const c10::intrusive_ptr<TransposeConv2dOpContext>& context);
-
-c10::intrusive_ptr<TransposeConv2dOpContext> conv2d_transpose_clamp_prepack(
-    Tensor&& weight,
-    c10::optional<Tensor>&& bias,
-    std::vector<int64_t>&& stride,
-    std::vector<int64_t>&& padding,
-    std::vector<int64_t>&& output_padding,
-    std::vector<int64_t>&& dilation,
-    const int64_t groups,
-    const c10::optional<Scalar>& output_min,
-    const c10::optional<Scalar>& output_max);
-
-} // namespace ops
-} // namespace vulkan
-} // namespace native
-} // namespace at
-
-#endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/ops/Utils.cpp b/aten/src/ATen/native/vulkan/ops/Utils.cpp
index 0a255f9915bd3..0ad893db11a7d 100644
--- a/aten/src/ATen/native/vulkan/ops/Utils.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Utils.cpp
@@ -6,6 +6,156 @@ namespace vulkan {
 namespace ops {
 namespace utils {
 
+/*
+ * This function formats an input tensor in NCHW layout to NC4HW layout such
+ * that the buffer of the formatted tensor can be directly copied into a GPU
+ * texture. Conceptually, the formatting can be achieved via the following
+ * steps:
+ *
+ * 1. Given that the src tensor has size {N,C,H,W}
+ *
+ * 2. Combine the batch and channel dims by reshaping to {N*C, H, W}
+ *
+ * 3. Determine the amount of padding to add: determine how many channels to add
+ *    in order to align N*C to the next multiple of 4
+ *
+ * 4. Add padding to the tensor so that the batch-channel dimension is a
+ *    multiple of four; the shape of the tensor is now {NC_aligned, H, W}
+ *
+ * 5. Split the batch-channel dimension into groups of 4 by reshaping the tensor
+ *    to size {NC_aligned/4, 4, H, W}
+ *
+ * 6. The groups of 4 channels (dim 1) should be contiguous. Therefore, permute
+ *    the dims of the tensor in the order {0, 2, 3, 1}
+ *
+ * 7. Finally, return a contiguous version of the tensor. The final shape of the
+ *    tensor would be {NC_aligned/4, H, W, 4}
+ */
+Tensor nchw_to_nc4hw(const Tensor& src) {
+  uint32_t N = get_dim<Dim4D::Batch>(src.sizes());
+  uint32_t C = get_dim<Dim4D::Channel>(src.sizes());
+  uint32_t H = get_dim<Dim4D::Height>(src.sizes());
+  uint32_t W = get_dim<Dim4D::Width>(src.sizes());
+
+  uint32_t NC4 = api::utils::div_up(N * C, 4u);
+  uint32_t NC_aligned = api::utils::align_up(N * C, 4u);
+
+  // Add padding to the tensor so that the batch-channel dim is a multiple of 4
+  Tensor padding = at::zeros({NC_aligned - N * C, H, W}, src.options());
+  Tensor src_padded = at::cat({src.reshape({N * C, H, W}), padding});
+  // Reshape to group channels into groups of 4 and permute so that the groups
+  // are in the first dimension so that they are contiguous
+  Tensor src_NC4HW = src_padded.reshape({NC4, 4, H, W}).permute({0, 2, 3, 1});
+
+  // Return a contiguous version of the tensor
+  return src_NC4HW.contiguous();
+}
+
+/*
+ * Creates a staging tensor into which texture data, which will be in NC4HW
+ * format, can be copied directly. The shape of the staging tensor will be the
+ * same as the tensor produced by a call to format_src_tensor().
+ */
+Tensor create_staging_tensor(const vTensor& v_in) {
+  uint32_t N = get_dim<Dim4D::Batch>(v_in.sizes());
+  uint32_t C = get_dim<Dim4D::Channel>(v_in.sizes());
+  uint32_t H = get_dim<Dim4D::Height>(v_in.sizes());
+  uint32_t W = get_dim<Dim4D::Width>(v_in.sizes());
+
+  uint32_t NC4 = api::utils::div_up(N * C, 4u);
+
+  // Note that the dtype corresponding with the texture format of the vTensor is
+  // used instead of options().dtype(). This is to ensure the number of bytes in
+  // the staging tensor matches the number of bytes in the image texture. Refer
+  // to comments for api::vk_format()
+  return at::empty(
+      {NC4, H, W, 4}, at::device(at::kCPU).dtype(v_in.texture_dtype()));
+}
+
+/*
+ * After copying texture data, which will be in NC4HW format, to a staging
+ * tensor created in create_staging_tensor(), this function reformats the tensor
+ * to NCHW format. It essentially reverses the transformations made by
+ * format_src_tensor().
+ *
+ * Note that the sizes of the original tensor must be passed in to fully restore
+ * the properties of the original tensor.
+ */
+Tensor nc4hw_to_nchw(const Tensor& t_in, IntArrayRef sizes) {
+  uint32_t N = get_dim<Dim4D::Batch>(sizes);
+  uint32_t C = get_dim<Dim4D::Channel>(sizes);
+  uint32_t H = get_dim<Dim4D::Height>(sizes);
+  uint32_t W = get_dim<Dim4D::Width>(sizes);
+
+  uint32_t NC_aligned = api::utils::align_up(N * C, 4u);
+
+  // Undo the permute step and channel grouping step
+  Tensor t_in_padded = t_in.permute({0, 3, 1, 2}).reshape({NC_aligned, H, W});
+  // Remove the padding channels
+  Tensor t_in_shaved =
+      at::narrow(t_in_padded, /*dim=*/0, /*start*/ 0, /*end*/ N * C);
+
+  // Reshape to original sizing and dtype and return a contiguous Tensor
+  return t_in_shaved.reshape(sizes).contiguous();
+}
+
+void copy_buffer_to_vtensor(
+    api::VulkanBuffer& src_buffer,
+    vTensor& v_dst,
+    api::PipelineBarrier& pipeline_barrier) {
+  api::Context* const context = api::context();
+
+  TORCH_CHECK(
+      src_buffer.mem_size() == v_dst.buffer_bytes(),
+      "Vulkan copy_buffer_to_vtensor: source buffer and destination texture "
+      "do not have the same number of bytes");
+
+  context->submit_copy<api::VulkanBuffer, api::VulkanImage>(
+      // pipeline barrier
+      pipeline_barrier,
+      // resources
+      src_buffer,
+      v_dst.image(
+          pipeline_barrier,
+          api::PipelineStage::TRANSFER,
+          api::MemoryAccessType::WRITE),
+      // copy details
+      v_dst.extents(),
+      {0u, 0u, 0u},
+      {0u, 0u, 0u},
+      // fence handle
+      VK_NULL_HANDLE);
+}
+
+void copy_vtensor_to_buffer(
+    vTensor& v_src,
+    api::VulkanBuffer& dst_buffer,
+    api::PipelineBarrier& pipeline_barrier,
+    const VkFence fence_handle) {
+  api::Context* const context = api::context();
+
+  TORCH_CHECK(
+      v_src.buffer_bytes() == dst_buffer.mem_size(),
+      "Vulkan copy_vtensor_to_buffer: source texture and destination buffer "
+      "do not have the same number of bytes");
+
+  context->submit_copy<api::VulkanImage, api::VulkanBuffer>(
+      // pipeline barrier
+      pipeline_barrier,
+      // resources
+      v_src.image(
+          pipeline_barrier,
+          api::PipelineStage::TRANSFER,
+          api::MemoryAccessType::READ),
+      dst_buffer,
+      // copy details
+      v_src.extents(),
+      {0u, 0u, 0u},
+      {0u, 0u, 0u},
+      // fence handle
+      fence_handle);
+}
+
 void pack_buffer_to_vtensor(
     api::VulkanBuffer& buffer,
     vTensor& v_self,
@@ -85,17 +235,21 @@ void pack_vtensor_to_staging(
       },
   };
 
+  bool is_quantized = v_self.is_quantized();
+
+  api::utils::uvec3 pack_extents = extents;
+  if (is_quantized) {
+    pack_extents.data[0u] = 1;
+    pack_extents.data[1u] = 1;
+    pack_extents.data[2u] =
+        api::utils::safe_downcast<uint32_t>(v_self.numtexels());
+  }
+
   api::UniformParamsBuffer params(context, block);
   api::PipelineBarrier pipeline_barrier{};
-  bool is_quantized = v_self.is_quantized();
-  api::utils::uvec3 copy_extents;
-  copy_extents.data[0u] = 1;
-  copy_extents.data[1u] = 1;
-  copy_extents.data[2u] =
-      ((v_self.sizes()[1] * v_self.sizes()[2] * v_self.sizes()[3]) / 4);
+
   api::ShaderSource kernel = is_quantized ? VK_KERNEL(image_to_nchw_quantized)
                                           : VK_KERNEL(image_to_nchw);
-  api::utils::uvec3 extents_to_use = is_quantized ? copy_extents : extents;
 
   context->submit_compute_job(
       // shader descriptor
@@ -103,9 +257,9 @@ void pack_vtensor_to_staging(
       // pipeline barrier
       pipeline_barrier,
       // global work group size
-      extents_to_use,
+      pack_extents,
       // local work group size
-      adaptive_work_group_size(extents_to_use),
+      adaptive_work_group_size(pack_extents),
       // fence handle
       fence_handle,
       // shader arguments
diff --git a/aten/src/ATen/native/vulkan/ops/Utils.h b/aten/src/ATen/native/vulkan/ops/Utils.h
index 59358ee173eb0..f9b85521bab0a 100644
--- a/aten/src/ATen/native/vulkan/ops/Utils.h
+++ b/aten/src/ATen/native/vulkan/ops/Utils.h
@@ -10,6 +10,23 @@ namespace vulkan {
 namespace ops {
 namespace utils {
 
+Tensor nchw_to_nc4hw(const Tensor&);
+
+Tensor create_staging_tensor(const vTensor&);
+
+Tensor nc4hw_to_nchw(const Tensor&, IntArrayRef);
+
+void copy_buffer_to_vtensor(
+    api::VulkanBuffer&,
+    vTensor&,
+    api::PipelineBarrier&);
+
+void copy_vtensor_to_buffer(
+    vTensor&,
+    api::VulkanBuffer&,
+    api::PipelineBarrier&,
+    const VkFence fence_handle = VK_NULL_HANDLE);
+
 inline int64_t normalize(const int64_t dimension, const int64_t n) {
   return (dimension % n + n) % n;
 }
diff --git a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp b/aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp
deleted file mode 100644
index 58f07b0d43c4f..0000000000000
--- a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.cpp
+++ /dev/null
@@ -1,34 +0,0 @@
-#include <ATen/native/vulkan/ops/VulkanOpContext.h>
-
-namespace at {
-namespace native {
-namespace vulkan {
-namespace ops {
-
-VulkanOpContext::VulkanOpContext(
-    c10::impl::GenericList packed_context,
-    c10::impl::GenericList unpacked_context)
-    : packed_(packed_context), unpacked_(unpacked_context) {}
-
-VulkanOpContext VulkanOpContext::create(
-    c10::impl::GenericList packed_context,
-    c10::impl::GenericList unpacked_context) {
-  return VulkanOpContext{packed_context, unpacked_context};
-}
-
-VulkanOpContext::State VulkanOpContext::get_state() const {
-  return VulkanOpContext::State{packed_, unpacked_};
-}
-
-const c10::impl::GenericList& VulkanOpContext::get_packed() const {
-  return packed_;
-}
-
-const c10::impl::GenericList& VulkanOpContext::get_unpacked() const {
-  return unpacked_;
-}
-
-} // namespace ops
-} // namespace vulkan
-} // namespace native
-} // namespace at
diff --git a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.h b/aten/src/ATen/native/vulkan/ops/VulkanOpContext.h
deleted file mode 100644
index 8907b486d50ca..0000000000000
--- a/aten/src/ATen/native/vulkan/ops/VulkanOpContext.h
+++ /dev/null
@@ -1,35 +0,0 @@
-#pragma once
-
-#ifdef USE_VULKAN_API
-
-#include <torch/custom_class.h>
-
-namespace at {
-namespace native {
-namespace vulkan {
-namespace ops {
-
-class VulkanOpContext final : public torch::jit::CustomClassHolder {
- public:
-  static VulkanOpContext create(
-      c10::impl::GenericList packed_context,
-      c10::impl::GenericList unpacked_context);
-  using State = std::tuple<c10::impl::GenericList, c10::impl::GenericList>;
-  State get_state() const;
-  const c10::impl::GenericList& get_packed() const;
-  const c10::impl::GenericList& get_unpacked() const;
-
- private:
-  VulkanOpContext(
-      c10::impl::GenericList packed_context,
-      c10::impl::GenericList unpacked_context);
-  c10::impl::GenericList packed_;
-  c10::impl::GenericList unpacked_;
-};
-
-} // namespace ops
-} // namespace vulkan
-} // namespace native
-} // namespace at
-
-#endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h b/aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h
new file mode 100644
index 0000000000000..f137bf5d5e785
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/ops/VulkanPackedContext.h
@@ -0,0 +1,33 @@
+#pragma once
+
+#ifdef USE_VULKAN_API
+
+#include <torch/custom_class.h>
+
+namespace at {
+namespace native {
+namespace vulkan {
+namespace ops {
+
+class VulkanPackedContext {
+ protected:
+  c10::impl::GenericList packed_;
+
+ public:
+  VulkanPackedContext() : packed_{c10::AnyType::get()} {}
+
+  inline const c10::IValue get_val(int64_t i) const {
+    return packed_.get(i);
+  }
+
+  virtual const c10::impl::GenericList unpack() const = 0;
+
+  virtual ~VulkanPackedContext() = default;
+};
+
+} // namespace ops
+} // namespace vulkan
+} // namespace native
+} // namespace at
+
+#endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/ops/cumsum.cpp b/aten/src/ATen/native/vulkan/ops/cumsum.cpp
index fd84d3304f396..679201532c21e 100644
--- a/aten/src/ATen/native/vulkan/ops/cumsum.cpp
+++ b/aten/src/ATen/native/vulkan/ops/cumsum.cpp
@@ -18,7 +18,8 @@ Tensor cumsum(
       input_arg.dim() <= 4, "Vulkan cumsum expects input dimension <= 4!");
 
   TORCH_CHECK(
-      batch_size(input_arg) == 1, "Vulkan cumsum expects batch size <= 1!");
+      get_dim<Dim4D::Batch>(input_arg) == 1,
+      "Vulkan cumsum expects batch size <= 1!");
 
   TORCH_CHECK(dim < 4, "Vulkan cumsum expects dim < 4!");
 
diff --git a/aten/src/ATen/templates/DispatchKeyFunctions_inl.h b/aten/src/ATen/templates/DispatchKeyFunctions_inl.h
index 73bc1008a4f54..fbb71c2cb123c 100644
--- a/aten/src/ATen/templates/DispatchKeyFunctions_inl.h
+++ b/aten/src/ATen/templates/DispatchKeyFunctions_inl.h
@@ -18,10 +18,5 @@
 
 ${DispatchKeyFunctions_inl_includes}
 
-namespace at {
-namespace ${dispatch_namespace} {
 
 ${dispatch_namespaced_declarations}
-
-} // namespace ${dispatch_namespace}
-} // namespace at
diff --git a/aten/src/ATen/templates/RegisterDispatchDefinitions.ini b/aten/src/ATen/templates/RegisterDispatchDefinitions.ini
new file mode 100644
index 0000000000000..3bf7f9b1bb321
--- /dev/null
+++ b/aten/src/ATen/templates/RegisterDispatchDefinitions.ini
@@ -0,0 +1,24 @@
+${ns_prologue}
+
+// NB: TORCH_LIBRARY_IMPL must be in an anonymous namespace to avoid
+// ambiguity with conflicting identifiers that may have been defined in
+// at namespace already.
+namespace {
+
+${dispatch_helpers}
+
+${dispatch_anonymous_definitions}
+
+${static_init_dispatch_registrations}
+
+} // anonymous namespace
+
+${deferred_dispatch_registrations}
+
+namespace ${dispatch_namespace} {
+
+${dispatch_namespaced_definitions}
+
+} // namespace ${dispatch_namespace}
+
+${ns_epilogue}
diff --git a/aten/src/ATen/templates/RegisterDispatchKey.cpp b/aten/src/ATen/templates/RegisterDispatchKey.cpp
index df00c0d0e4a32..7a1584d505f5a 100644
--- a/aten/src/ATen/templates/RegisterDispatchKey.cpp
+++ b/aten/src/ATen/templates/RegisterDispatchKey.cpp
@@ -50,28 +50,5 @@
 $dispatch_headers
 $ops_headers
 
-
-namespace at {
-
-// NB: TORCH_LIBRARY_IMPL must be in an anonymous namespace to avoid
-// ambiguity with conflicting identifiers that may have been defined in
-// at namespace already.
-namespace {
-
-${dispatch_helpers}
-
-${dispatch_anonymous_definitions}
-
-${static_init_dispatch_registrations}
-
-} // anonymous namespace
-
-${deferred_dispatch_registrations}
-
-namespace ${dispatch_namespace} {
-
-${dispatch_namespaced_definitions}
-
-} // namespace ${dispatch_namespace}
-
-} // namespace at
+// See template file RegisterDispatchDefinitions.ini
+$dispatch_definitions
diff --git a/aten/src/ATen/test/cpu_generator_test.cpp b/aten/src/ATen/test/cpu_generator_test.cpp
index db392b6ead260..6cf3431c66c0e 100644
--- a/aten/src/ATen/test/cpu_generator_test.cpp
+++ b/aten/src/ATen/test/cpu_generator_test.cpp
@@ -144,8 +144,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineReproducibility) {
   //   launch on same thread index and create two engines.
   //   Given same seed, idx and offset, assert that the engines
   //   should be aligned and have the same sequence.
-  at::Philox4_32_10 engine1(0, 0, 4);
-  at::Philox4_32_10 engine2(0, 0, 4);
+  at::Philox4_32 engine1(0, 0, 4);
+  at::Philox4_32 engine2(0, 0, 4);
   ASSERT_EQ(engine1(), engine2());
 }
 
@@ -156,11 +156,11 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineOffset1) {
   //   make another engine increment to until the
   //   first 8 values. Assert that the first call
   //   of engine2 and the 9th call of engine1 are equal.
-  at::Philox4_32_10 engine1(123, 1, 0);
+  at::Philox4_32 engine1(123, 1, 0);
   // Note: offset is a multiple of 4.
   // So if you want to skip 8 values, offset would
   // be 2, since 2*4=8.
-  at::Philox4_32_10 engine2(123, 1, 2);
+  at::Philox4_32 engine2(123, 1, 2);
   for (const auto i : c10::irange(8)) {
     (void)i; // Suppress unused variable warning
     // Note: instead of using the engine() call 8 times
@@ -179,8 +179,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineOffset2) {
   //   make engine2 skip to the 2^64th 128 bit while being at 2^64th thread
   //   Assert that engine2 should be increment_val+1 steps behind engine1.
   unsigned long long increment_val = std::numeric_limits<uint64_t>::max();
-  at::Philox4_32_10 engine1(123, 0, increment_val);
-  at::Philox4_32_10 engine2(123, increment_val, increment_val);
+  at::Philox4_32 engine1(123, 0, increment_val);
+  at::Philox4_32 engine2(123, increment_val, increment_val);
 
   engine2.incr_n(increment_val);
   engine2.incr();
@@ -195,8 +195,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineOffset3) {
   //   start engine2 at thread 1, with offset 0
   //   Assert that engine1 is 1 step behind engine2.
   unsigned long long increment_val = std::numeric_limits<uint64_t>::max();
-  at::Philox4_32_10 engine1(123, 0, increment_val);
-  at::Philox4_32_10 engine2(123, 1, 0);
+  at::Philox4_32 engine1(123, 0, increment_val);
+  at::Philox4_32 engine2(123, 1, 0);
   engine1.incr();
   ASSERT_EQ(engine1(), engine2());
 }
@@ -206,8 +206,8 @@ TEST(CPUGeneratorImpl, TestPhiloxEngineIndex) {
   //   Tests if thread indexing is working properly.
   //   create two engines with different thread index but same offset.
   //   Assert that the engines have different sequences.
-  at::Philox4_32_10 engine1(123456, 0, 4);
-  at::Philox4_32_10 engine2(123456, 1, 4);
+  at::Philox4_32 engine1(123456, 0, 4);
+  at::Philox4_32 engine2(123456, 1, 4);
   ASSERT_NE(engine1(), engine2());
 }
 
@@ -247,3 +247,19 @@ TEST(CPUGeneratorImpl, TestMT19937EngineReproducibility) {
   }
 
 }
+
+TEST(CPUGeneratorImpl, TestPhiloxEngineReproducibilityRandN) {
+  at::Philox4_32 engine1(0, 0, 4);
+  at::Philox4_32 engine2(0, 0, 4);
+  ASSERT_EQ(engine1.randn(1), engine2.randn(1));
+}
+
+TEST(CPUGeneratorImpl, TestPhiloxDeterministic) {
+  at::Philox4_32 engine1(0, 0, 4);
+  ASSERT_EQ(engine1(), 4013802324);  // Determinism!
+  ASSERT_EQ(engine1(), 2979262830);  // Determinism!
+
+  at::Philox4_32 engine2(10, 0, 1);
+  ASSERT_EQ(engine2(), 2007330488);  // Determinism!
+  ASSERT_EQ(engine2(), 2354548925);  // Determinism!
+}
diff --git a/aten/src/ATen/test/cuda_generator_test.cu b/aten/src/ATen/test/cuda_generator_test.cu
index 1ea5c2ebb0077..f82db6de1d5b8 100644
--- a/aten/src/ATen/test/cuda_generator_test.cu
+++ b/aten/src/ATen/test/cuda_generator_test.cu
@@ -21,8 +21,8 @@ using namespace at;
 
 __global__ void testEngineReproducibility(){
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
-  at::Philox4_32_10 engine1(0, idx, 4);
-  at::Philox4_32_10 engine2(0, idx, 4);
+  at::Philox4_32 engine1(0, idx, 4);
+  at::Philox4_32 engine2(0, idx, 4);
   assert(engine1() == engine2());
 }
 
@@ -45,11 +45,11 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineReproducibility) {
 }
 
 __global__ void testEngineOffset1(){
-  at::Philox4_32_10 engine1(123, 1, 0);
+  at::Philox4_32 engine1(123, 1, 0);
   // Note: offset is a multiple of 4.
   // So if you want to skip 8 values, offset would
   // be 2, since 2*4=8.
-  at::Philox4_32_10 engine2(123, 1, 2);
+  at::Philox4_32 engine2(123, 1, 2);
   for(int i = 0; i < 8; i++){
     // Note: instead of using the engine() call 8 times
     // we could have achieved the same functionality by
@@ -81,8 +81,8 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineOffset1) {
 
 __global__ void testEngineOffset2(){
   unsigned long long increment_val = ::ldexp(1.0, 64);
-  at::Philox4_32_10 engine1(123, 0, increment_val);
-  at::Philox4_32_10 engine2(123, increment_val, increment_val);
+  at::Philox4_32 engine1(123, 0, increment_val);
+  at::Philox4_32 engine2(123, increment_val, increment_val);
 
   engine2.incr_n(increment_val);
   engine2.incr();
@@ -110,8 +110,8 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineOffset2) {
 
 __global__ void testEngineOffset3(){
   unsigned long long increment_val = ::ldexp(1.0, 64);
-  at::Philox4_32_10 engine1(123, 0, increment_val);
-  at::Philox4_32_10 engine2(123, 1, 0);
+  at::Philox4_32 engine1(123, 0, increment_val);
+  at::Philox4_32 engine2(123, 1, 0);
   engine1.incr();
   assert(engine1() == engine2());
 }
@@ -136,8 +136,8 @@ TEST(CUDAGeneratorImpl, TestPhiloxEngineOffset3) {
 }
 
 __global__ void testEngineThreadIndex(){
-  at::Philox4_32_10 engine1(123456, 0, 4);
-  at::Philox4_32_10 engine2(123456, 1, 4);
+  at::Philox4_32 engine1(123456, 0, 4);
+  at::Philox4_32 engine2(123456, 1, 4);
   assert(engine1() != engine2());
 }
 
diff --git a/aten/src/ATen/test/vulkan_api_test.cpp b/aten/src/ATen/test/vulkan_api_test.cpp
index 7276261738593..70abc0aa59281 100644
--- a/aten/src/ATen/test/vulkan_api_test.cpp
+++ b/aten/src/ATen/test/vulkan_api_test.cpp
@@ -4,6 +4,7 @@
 #include <ATen/ATen.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/native/vulkan/api/api.h>
+#include <ATen/native/vulkan/ops/Copy.h>
 #include <c10/util/irange.h>
 
 // TODO: These functions should move to a common place.
@@ -248,6 +249,31 @@ class VulkanAPITest : public ::testing::Test {
   }
 };
 
+TEST_F(VulkanAPITest, copy_to_texture) {
+  at::Tensor test_tensors[] = {
+    // 4D
+    at::rand({7, 17, 134, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+    // 3D
+    at::rand({67, 134, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+    // 2D
+    at::rand({229, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+    // 1D
+    at::rand({1902}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+  };
+
+  for (auto in_cpu : test_tensors) {
+    at::Tensor in_vk_copied = in_cpu.vulkan();
+    at::Tensor out_copied = in_vk_copied.cpu();
+
+    const auto check_copy = almostEqual(out_copied, in_cpu);
+
+    if(!check_copy) {
+      std::cout << "Copy failed on size " << in_cpu.sizes()
+                << "with dtype" << in_cpu.dtype() << std::endl;
+    }
+  }
+}
+
 TEST_F(VulkanAPITest, adaptive_avg_pool2d) {
   c10::InferenceMode mode;
 
@@ -336,6 +362,43 @@ TEST_F(VulkanAPITest, add_broadcast2) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, add_broadcast3) {
+
+  const auto a_cpu = at::rand({3, 4, 41, 53}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 1, 41, 53}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::add(a_cpu, b_cpu, 2.5f);
+  const auto c_vulkan = at::add(a_vulkan, b_vulkan, 2.5f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, add_broadcast4) {
+  const auto a_cpu = at::rand({3, 4, 41, 1}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 41, 53}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::add(a_cpu, b_cpu, 2.5f);
+  const auto c_vulkan = at::add(a_vulkan, b_vulkan, 2.5f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, add_) {
   auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat));
   auto a_vulkan = a_cpu.vulkan();
@@ -424,6 +487,69 @@ TEST_F(VulkanAPITest, add_scalar_) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, add_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a_cpu = at::rand({13, 23, 59, 73}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto c_cpu = at::add(a_cpu, b_scalar, 2.1f);
+  const auto c_vulkan = at::add(a_vulkan, b_scalar, 2.1f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, add_scalar_wrapped_) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  auto a_cpu = at::rand({47, 2, 23, 97}, at::device(at::kCPU).dtype(at::kFloat));
+  auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  a_cpu.add_(b_scalar, 2.1f);
+  a_vulkan.add_(b_scalar, 2.1f);
+
+  const auto check = almostEqual(a_cpu, a_vulkan.cpu());
+  if (!check) {
+    showRtol(a_cpu, a_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, add_to_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto b_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::add(a, b_cpu, 2.1f);
+  const auto c_vulkan = at::add(a, b_vulkan, 2.1f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, addmm) {
   constexpr float alpha = 2.1f;
   constexpr float beta = 103.24;
@@ -1033,6 +1159,42 @@ TEST_F(VulkanAPITest, div_broadcast2) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, div_broadcast3) {
+  const auto a_cpu = at::rand({3, 4, 179, 221}, at::device(at::kCPU).dtype(at::kFloat))+0.01;
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat))+0.01;
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::div(a_cpu, b_cpu);
+  const auto c_vulkan = at::div(a_vulkan, b_vulkan);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, div_broadcast4) {
+  const auto a_cpu = at::rand({3, 4, 41, 1}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 41, 53}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::div(a_cpu, b_cpu);
+  const auto c_vulkan = at::div(a_vulkan, b_vulkan);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, div_) {
   auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat))+0.01;
   auto a_vulkan = a_cpu.vulkan();
@@ -1122,6 +1284,69 @@ TEST_F(VulkanAPITest, div_scalar_) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, div_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a_cpu = at::rand({17, 213, 213, 7}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto c_cpu = at::div(a_cpu, b_scalar);
+  const auto c_vulkan = at::div(a_vulkan, b_scalar);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, div_scalar_wrapped_) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  auto a_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat));
+  auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  a_cpu.div_(b_scalar);
+  a_vulkan.div_(b_scalar);
+
+  const auto check = almostEqual(a_cpu, a_vulkan.cpu());
+  if (!check) {
+    showRtol(a_cpu, a_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, div_to_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto b_cpu = at::rand({2, 3, 5, 7}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::div(a, b_cpu);
+  const auto c_vulkan = at::div(a, b_vulkan);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, empty) {
 
   ASSERT_NO_THROW(at::empty({1, 17, 41, 53}, at::device(at::kVulkan).dtype(at::kFloat)));
@@ -1816,6 +2041,42 @@ TEST_F(VulkanAPITest, mul_broadcast2) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, mul_broadcast3) {
+  const auto a_cpu = at::rand({3, 4, 179, 221}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::mul(a_cpu, b_cpu);
+  const auto c_vulkan = at::mul(a_vulkan, b_vulkan);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, mul_broadcast4) {
+  const auto a_cpu = at::rand({3, 4, 179, 1}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::mul(a_cpu, b_cpu);
+  const auto c_vulkan = at::mul(a_vulkan, b_vulkan);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, mul_) {
   auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat));
   auto a_vulkan = a_cpu.vulkan();
@@ -1904,6 +2165,69 @@ TEST_F(VulkanAPITest, mul_scalar_) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, mul_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a_cpu = at::rand({17, 213, 213, 7}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto c_cpu = at::mul(a_cpu, b_scalar);
+  const auto c_vulkan = at::mul(a_vulkan, b_scalar);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, mul_scalar_wrapped_) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  auto a_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat));
+  auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  a_cpu.mul_(b_scalar);
+  a_vulkan.mul_(b_scalar);
+
+  const auto check = almostEqual(a_cpu, a_vulkan.cpu());
+  if (!check) {
+    showRtol(a_cpu, a_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, mul_to_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto b_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::mul(a, b_cpu);
+  const auto c_vulkan = at::mul(a, b_vulkan);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, reflection_pad2d) {
   const auto a_cpu = at::rand({2, 3, 47, 63}, at::device(at::kCPU).dtype(at::kFloat));
   const auto a_vulkan = a_cpu.vulkan();
@@ -2182,6 +2506,42 @@ TEST_F(VulkanAPITest, sub_broadcast2) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, sub_broadcast3) {
+  const auto a_cpu = at::rand({3, 4, 179, 221}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::sub(a_cpu, b_cpu, 2.5f);
+  const auto c_vulkan = at::sub(a_vulkan, b_vulkan, 2.5f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, sub_broadcast4) {
+  const auto a_cpu = at::rand({3, 4, 179, 1}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_cpu = at::rand({1, 179, 221}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::sub(a_cpu, b_cpu, 2.5f);
+  const auto c_vulkan = at::sub(a_vulkan, b_vulkan, 2.5f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, sub_) {
   auto a_cpu = at::rand({61, 17, 29, 83}, at::device(at::kCPU).dtype(at::kFloat));
   auto a_vulkan = a_cpu.vulkan();
@@ -2236,6 +2596,111 @@ TEST_F(VulkanAPITest, sub_broadcast1_) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, sub_scalar) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a_cpu = at::rand({13, 23, 59, 73}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const float b_scalar = 3.1415f;
+
+  const auto c_cpu = at::sub(a_cpu, b_scalar, 2.1f);
+  const auto c_vulkan = at::sub(a_vulkan, b_scalar, 2.1f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, sub_scalar_) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  auto a_cpu = at::rand({47, 2, 23, 97}, at::device(at::kCPU).dtype(at::kFloat));
+  auto a_vulkan = a_cpu.vulkan();
+
+  const float b_scalar = 3.1415f;
+
+  a_cpu.sub_(b_scalar, 2.1f);
+  a_vulkan.sub_(b_scalar, 2.1f);
+
+  const auto check = almostEqual(a_cpu, a_vulkan.cpu());
+  if (!check) {
+    showRtol(a_cpu, a_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, sub_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a_cpu = at::rand({13, 23, 59, 73}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto c_cpu = at::sub(a_cpu, b_scalar, 2.1f);
+  const auto c_vulkan = at::sub(a_vulkan, b_scalar, 2.1f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, sub_scalar_wrapped_) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  auto a_cpu = at::rand({47, 2, 23, 97}, at::device(at::kCPU).dtype(at::kFloat));
+  auto a_vulkan = a_cpu.vulkan();
+
+  const auto b_scalar = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  a_cpu.sub_(b_scalar, 2.1f);
+  a_vulkan.sub_(b_scalar, 2.1f);
+
+  const auto check = almostEqual(a_cpu, a_vulkan.cpu());
+  if (!check) {
+    showRtol(a_cpu, a_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, sub_to_scalar_wrapped) {
+  if (!at::is_vulkan_available()) {
+    return;
+  }
+
+  const auto a = at::rand({1}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto b_cpu = at::rand({11, 7, 139, 109}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto b_vulkan = b_cpu.vulkan();
+
+  const auto c_cpu = at::sub(a, b_cpu, 2.1f);
+  const auto c_vulkan = at::sub(a, b_vulkan, 2.1f);
+
+  const auto check = almostEqual(c_cpu, c_vulkan.cpu());
+  if (!check) {
+    showRtol(c_cpu, c_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, transposed_conv2d) {
   // Arrange
   constexpr int64_t groups = 1;
@@ -3368,13 +3833,15 @@ TEST_F(VulkanAPITest, gru_success) {
   const int H_in = 5;  // input_size
   const int H_out = 7; // hidden_size
   const int num_layers = 3;
+  const int L = 1;
+  const int N = 1;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -3435,13 +3902,15 @@ TEST_F(VulkanAPITest, gru_mclareninputs_success) {
   const int H_in = 384;  // input_size
   const int H_out = 384; // hidden_size
   const int num_layers = 2;
+  const int L = 1;
+  const int N = 1;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -3498,13 +3967,15 @@ TEST_F(VulkanAPITest, gru_invalidinputs_exceptions) {
   const int H_in = 17;  // input_size
   const int H_out = 50; // hidden_size
   const int num_layers = 2;
+  const int L = 5;
+  const int N = 4;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -3591,13 +4062,15 @@ TEST_F(VulkanAPITest, gru_prepack_success) {
   const int H_in = 81;  // input_size
   const int H_out = 10; // hidden_size
   const int num_layers = 2;
+  const int L = 1;
+  const int N = 1;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -3626,13 +4099,13 @@ TEST_F(VulkanAPITest, gru_prepack_success) {
       has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
 
   auto prepack = callOpByName(
-      "vulkan_prepack::gru_prepack",
+      "vulkan_prepack::create_gru_context",
       "",
       std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
         weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
       has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
   auto out_vulkan = callOpByName(
-      "vulkan_prepack::gru_run",
+      "vulkan_prepack::run_gru_context",
       "",
       in_cpu.vulkan(), h0_cpu.vulkan(), prepack[0]);
 
@@ -3660,13 +4133,15 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   const int H_in = 70;  // input_size
   const int H_out = 2; // hidden_size
   const int num_layers = 2;
+  const int L = 3;
+  const int N = 5;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -3692,7 +4167,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   // Act: incorrect # of weights/biases
   EXPECT_THROW({
     auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
             weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1) }),
@@ -3703,13 +4178,13 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   EXPECT_THROW({
     const auto in_cpu_2d = at::rand({1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
     auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
             weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
         has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
     auto out_vulkan = callOpByName(
-        "vulkan_prepack::gru_run",
+        "vulkan_prepack::run_gru_context",
         "",
         in_cpu_2d.vulkan(), h0_cpu.vulkan(), prepack[0]);
   }, ::c10::Error);
@@ -3718,13 +4193,13 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   EXPECT_THROW({
     const auto h0_cpu_2d = at::rand({num_layers, H_out}, at::device(at::kCPU).dtype(at::kFloat));
     auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
             weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
         has_biases, num_layers, gru_dropout, train, bidirectional, batch_first);
     auto out_vulkan = callOpByName(
-        "vulkan_prepack::gru_run",
+        "vulkan_prepack::run_gru_context",
         "",
         in_cpu.vulkan(), h0_cpu_2d.vulkan(), prepack[0]);
   }, ::c10::Error);
@@ -3732,7 +4207,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   // Act: has_biases should be true
   EXPECT_THROW({
     auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
@@ -3742,7 +4217,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   // Act: train should be false
   EXPECT_THROW({
     auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
@@ -3752,7 +4227,7 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   // Act: bidirectional should be false
   EXPECT_THROW({
      auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
@@ -3762,17 +4237,21 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   // Act: batch_first should be true
   EXPECT_THROW({
     auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
         has_biases, num_layers, gru_dropout, train, bidirectional, false);
+    auto out_vulkan = callOpByName(
+        "vulkan_prepack::run_gru_context",
+        "",
+        in_cpu.vulkan(), h0_cpu.vulkan(), prepack[0]);
   }, ::c10::Error);
 
   // Act: dropout should be 0.0
   EXPECT_THROW({
     auto prepack = callOpByName(
-        "vulkan_prepack::gru_prepack",
+        "vulkan_prepack::create_gru_context",
         "",
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
diff --git a/aten/src/ATen/test/vulkan_quantized_api_test.cpp b/aten/src/ATen/test/vulkan_quantized_api_test.cpp
index 9519b079d35e8..3b86b472fffdf 100644
--- a/aten/src/ATen/test/vulkan_quantized_api_test.cpp
+++ b/aten/src/ATen/test/vulkan_quantized_api_test.cpp
@@ -34,9 +34,11 @@ bool almostEqual(const at::Tensor& a, const at::Tensor& b) {
   return checkRtol(a - b, {a, b});
 }
 
+/* Unused function
 bool exactlyEqual(const at::Tensor& a, const at::Tensor& b) {
   return (a - b).abs().max().item<float>() == 0.0f;
 }
+*/
 
 void showRtol(const at::Tensor& a, const at::Tensor& b) {
   const auto diff = (a - b).abs();
diff --git a/aten/src/ATen/test/xnnpack_test.cpp b/aten/src/ATen/test/xnnpack_test.cpp
index 27e87545aa280..a936273bbcecc 100644
--- a/aten/src/ATen/test/xnnpack_test.cpp
+++ b/aten/src/ATen/test/xnnpack_test.cpp
@@ -3,11 +3,11 @@
 #include <torch/types.h>
 #include <torch/utils.h>
 
-#include <c10/core/CPUAllocator.h>
-#include <c10/core/MemoryFormat.h>
-#include <ATen/native/xnnpack/Engine.h>
 #include <ATen/native/xnnpack/Common.h>
+#include <ATen/native/xnnpack/Engine.h>
 #include <ATen/native/xnnpack/Pooling.h>
+#include <c10/core/CPUAllocator.h>
+#include <c10/core/MemoryFormat.h>
 
 #if defined(C10_MOBILE) && defined(USE_XNNPACK)
 
@@ -31,7 +31,8 @@ void test_hardswish(const at::Tensor& input, const at::Tensor& expected) {
   auto result = at::native::xnnpack::hardswish(input);
   auto check = almostEqual(expected, result);
   ASSERT_TRUE(check);
-  ASSERT_TRUE(expected.suggest_memory_format() == input.suggest_memory_format());
+  ASSERT_TRUE(
+      expected.suggest_memory_format() == input.suggest_memory_format());
 }
 
 void test_hardswish_(at::Tensor input, const at::Tensor& expected) {
@@ -39,7 +40,8 @@ void test_hardswish_(at::Tensor input, const at::Tensor& expected) {
   at::native::xnnpack::hardswish_(input);
   auto check = almostEqual(expected, input);
   ASSERT_TRUE(check);
-  ASSERT_TRUE(expected.suggest_memory_format() == input.suggest_memory_format());
+  ASSERT_TRUE(
+      expected.suggest_memory_format() == input.suggest_memory_format());
 }
 
 void test_global_average_pool(at::Tensor input, const at::Tensor& expected) {
@@ -49,58 +51,133 @@ void test_global_average_pool(at::Tensor input, const at::Tensor& expected) {
   ASSERT_TRUE(check);
 }
 
-// Since XNNPACK path is only taken #if defined(C10_MOBILE) && defined(USE_XNNPACK)
-// We can't compare regular CPU path with XNNPACK path in the same test binary
-// Instead we precompute regular results and compare with XNNPACK path here
+// Since XNNPACK path is only taken #if defined(C10_MOBILE) &&
+// defined(USE_XNNPACK) We can't compare regular CPU path with XNNPACK path in
+// the same test binary Instead we precompute regular results and compare with
+// XNNPACK path here
+TEST(TestXNNPackOps, TestLinear) {
+  constexpr std::array<int64_t, 2u> input_shape{1, 37};
+  constexpr std::array<int64_t, 2u> weight_shape{41, 37};
+  constexpr std::array<int64_t, 2u> bias_shape{1, 41};
+  const auto input_cpu =
+      at::rand(input_shape, at::device(at::kCPU).dtype(at::kFloat));
+  const auto weight =
+      at::rand(weight_shape, at::device(at::kCPU).dtype(at::kFloat));
+  const auto bias =
+      at::rand(bias_shape, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto out_cpu = at::linear(input_cpu, weight, bias);
+
+  const auto xnnpack_bias = bias.view({41});
+  ASSERT_TRUE(at::native::xnnpack::use_linear(input_cpu, weight, xnnpack_bias));
+  const auto result =
+      at::native::xnnpack::linear(input_cpu, weight, xnnpack_bias);
+
+  auto check = almostEqual(out_cpu, result);
+  ASSERT_TRUE(check);
+}
+
+TEST(TestXNNPackOps, TestMaxPool2d) {
+  const auto in_cpu =
+      at::rand({5, 13, 55, 68}, at::TensorOptions(at::kCPU).dtype(at::kFloat));
+  const auto out_cpu =
+      at::max_pool2d(in_cpu, {3, 4}, {2, 1}, {1, 1}, {1, 1}, false);
+  ASSERT_TRUE(at::native::xnnpack::use_max_pool2d(
+      in_cpu, {3, 4}, {1, 1}, {2, 1}, {1, 1}, false));
+  const auto result = at::native::xnnpack::max_pool2d(
+      in_cpu, {3, 4}, {1, 1}, {2, 1}, {1, 1}, false);
+
+  auto check = almostEqual(out_cpu, result);
+  ASSERT_TRUE(check);
+}
+
+TEST(TestXNNPackOps, TestConvolution2d) {
+  constexpr int64_t groups = 1;
+  constexpr std::array<int64_t, 2u> stride{2, 2};
+  constexpr std::array<int64_t, 2u> padding{1, 1};
+  constexpr std::array<int64_t, 2u> dilation{1, 1};
+
+  constexpr struct {
+    uint32_t batches;
+    uint32_t channels;
+    uint32_t width;
+    uint32_t height;
+
+    std::array<int64_t, 4u> size() const {
+      return {
+          batches,
+          channels,
+          width,
+          height,
+      };
+    }
+  } input{1, 3, 8, 8};
+
+  constexpr struct {
+    uint32_t output_channels;
+    uint32_t input_channels;
+    uint32_t width;
+    uint32_t height;
+
+    std::array<int64_t, 4u> size() const {
+      return {
+          output_channels,
+          input_channels,
+          width,
+          height,
+      };
+    }
+  } weights{1, input.channels, 3, 3};
+
+  const auto input_cpu =
+      at::randn(input.size(), at::device(at::kCPU).dtype(at::kFloat));
+  const auto weights_cpu =
+      at::randn(weights.size(), at::device(at::kCPU).dtype(at::kFloat));
+  const auto bias_cpu = at::randn(
+      {weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat));
+
+  const auto output_cpu = at::conv2d(
+      input_cpu, weights_cpu, bias_cpu, stride, padding, dilation, groups);
+
+  ASSERT_TRUE(at::native::xnnpack::use_convolution2d(
+      input_cpu,
+      weights_cpu,
+      weights.output_channels,
+      padding,
+      stride,
+      dilation,
+      groups,
+      false));
+  const auto result = at::native::xnnpack::convolution2d(
+      input_cpu, weights_cpu, bias_cpu, padding, stride, dilation, groups);
+  auto check = almostEqual(output_cpu, result);
+  ASSERT_TRUE(check);
+}
+
 TEST(TestXNNPackOps, TestHardSwish) {
   // input, expected_result pair
   auto in = torch::tensor({{1, 1}, {1, 1}}, {torch::kFloat32});
   auto in_slice = in.index({"...", 0});
 
   std::vector<std::pair<at::Tensor, at::Tensor>> input_result_pairs = {
-    {
-      torch::tensor({1, 2, 3, 4, 5}, {torch::kFloat32}),
-      torch::tensor({0.6667, 1.6667, 3.0000, 4.0000, 5.0000}, {torch::kFloat32})
-    },
-    {
-      torch::tensor({0.3330}, {torch::kFloat32}),
-      torch::tensor({0.1850}, {torch::kFloat32})
-    },
-    {
-      torch::tensor({
-        {0.4523, 0.8131, 0.9829},
-        {0.0782, 0.7395, 0.0787}
-      }),
-      torch::tensor({
-        {0.2602, 0.5167, 0.6525},
-        {0.0401, 0.4609, 0.0404}
-      })
-    },
-    {
-      in_slice,
-      torch::tensor({0.6667, 0.6667}, {torch::kFloat32})
-    },
-    {
-      torch::tensor(
-        {{{{0.4993, 0.3835},
-        {0.3163, 0.2348}},
-        {{0.4705, 0.4129},
-        {0.9314, 0.0631}}},
-        {{{0.0030, 0.5656},
-        {0.1413, 0.1943}},
-        {{0.1380, 0.1985},
-        {0.2746, 0.8109}}}}).contiguous(at::MemoryFormat::ChannelsLast),
-        torch::tensor(
-          {{{{0.2912, 0.2163},
-          {0.1748, 0.1266}},
-          {{0.2722, 0.2349},
-          {0.6103, 0.0322}}},
-          {{{0.0015, 0.3361},
-          {0.0740, 0.1034}},
-          {{0.0722, 0.1058},
-          {0.1499, 0.5150}}}}).contiguous(at::MemoryFormat::ChannelsLast)
-    }
-  };
+      {torch::tensor({1, 2, 3, 4, 5}, {torch::kFloat32}),
+       torch::tensor(
+           {0.6667, 1.6667, 3.0000, 4.0000, 5.0000}, {torch::kFloat32})},
+      {torch::tensor({0.3330}, {torch::kFloat32}),
+       torch::tensor({0.1850}, {torch::kFloat32})},
+      {torch::tensor({{0.4523, 0.8131, 0.9829}, {0.0782, 0.7395, 0.0787}}),
+       torch::tensor({{0.2602, 0.5167, 0.6525}, {0.0401, 0.4609, 0.0404}})},
+      {in_slice, torch::tensor({0.6667, 0.6667}, {torch::kFloat32})},
+      {torch::tensor({{{{0.4993, 0.3835}, {0.3163, 0.2348}},
+                       {{0.4705, 0.4129}, {0.9314, 0.0631}}},
+                      {{{0.0030, 0.5656}, {0.1413, 0.1943}},
+                       {{0.1380, 0.1985}, {0.2746, 0.8109}}}})
+           .contiguous(at::MemoryFormat::ChannelsLast),
+       torch::tensor({{{{0.2912, 0.2163}, {0.1748, 0.1266}},
+                       {{0.2722, 0.2349}, {0.6103, 0.0322}}},
+                      {{{0.0015, 0.3361}, {0.0740, 0.1034}},
+                       {{0.0722, 0.1058}, {0.1499, 0.5150}}}})
+           .contiguous(at::MemoryFormat::ChannelsLast)}};
 
   for (const auto& input_result : input_result_pairs) {
     test_hardswish(input_result.first, input_result.second);
@@ -111,42 +188,24 @@ TEST(TestXNNPackOps, TestHardSwish) {
 TEST(TestXNNPackOps, TestGlobal) {
   // input, expected_result pair
   std::vector<std::pair<at::Tensor, at::Tensor>> input_result_pairs = {
-    {
-      torch::tensor({{
-        {{0.0852, 0.7312, 0.9943, 0.7105},
-        {0.0956, 0.9072, 0.3124, 0.9362},
-        {0.5878, 0.8883, 0.5086, 0.9494}},
-        {{0.1056, 0.4968, 0.7740, 0.7593},
-        {0.8519, 0.3543, 0.8078, 0.5517},
-        {0.1413, 0.4608, 0.1706, 0.0314}}
-      }}, {torch::kFloat32}),
-      torch::tensor({{
-        {{0.6422}},
-        {{0.4588}}
-      }}, {torch::kFloat32})
-    },
-    {
-      torch::tensor({{
-          {{0.0280, 0.9073},
-          {0.2103, 0.5298}},
-          {{0.5335, 0.9901},
-          {0.2902, 0.2955}}
-        },
-        {
-          {{0.2363, 0.7024},
-          {0.7903, 0.8260}},
-          {{0.3802, 0.5959},
-          {0.5749, 0.8855}}
-        }}, {torch::kFloat32}),
-        torch::tensor(
-          {{{{0.4188}},
-          {{0.5273}}},
-          {{{0.6388}},
-          {{0.6091}}}},
-          {torch::kFloat32}
-        )
-    }
-  };
+      {torch::tensor(
+           {{{{0.0852, 0.7312, 0.9943, 0.7105},
+              {0.0956, 0.9072, 0.3124, 0.9362},
+              {0.5878, 0.8883, 0.5086, 0.9494}},
+             {{0.1056, 0.4968, 0.7740, 0.7593},
+              {0.8519, 0.3543, 0.8078, 0.5517},
+              {0.1413, 0.4608, 0.1706, 0.0314}}}},
+           {torch::kFloat32}),
+       torch::tensor({{{{0.6422}}, {{0.4588}}}}, {torch::kFloat32})},
+      {torch::tensor(
+           {{{{0.0280, 0.9073}, {0.2103, 0.5298}},
+             {{0.5335, 0.9901}, {0.2902, 0.2955}}},
+            {{{0.2363, 0.7024}, {0.7903, 0.8260}},
+             {{0.3802, 0.5959}, {0.5749, 0.8855}}}},
+           {torch::kFloat32}),
+       torch::tensor(
+           {{{{0.4188}}, {{0.5273}}}, {{{0.6388}}, {{0.6091}}}},
+           {torch::kFloat32})}};
 
   for (const auto& input_result : input_result_pairs) {
     test_global_average_pool(input_result.first, input_result.second);
diff --git a/benchmarks/cpp/nvfuser/CMakeLists.txt b/benchmarks/cpp/nvfuser/CMakeLists.txt
index 5ada0fc30d4ed..ad9053bb3a3aa 100644
--- a/benchmarks/cpp/nvfuser/CMakeLists.txt
+++ b/benchmarks/cpp/nvfuser/CMakeLists.txt
@@ -20,13 +20,16 @@ if(USE_CUDA)
     softmax_backward.cpp
     scale_bias_relu.cpp
     transpose.cpp
+    matmul.cpp
     timm.cpp
     utils.cpp
     main.cpp)
 
   target_link_libraries(nvfuser_bench PRIVATE torch_library benchmark)
   if(NOT MSVC)
-    target_compile_options(nvfuser_bench PRIVATE -Wno-unused-variable -Wno-deprecated-copy -Werror)
+    target_compile_options_if_supported(nvfuser_bench -Werror)
+    target_compile_options_if_supported(nvfuser_bench -Wno-unused-variable)
+    target_compile_options_if_supported(nvfuser_bench -Wno-deprecated-copy)
   endif()
 
 endif()
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp
index 723d222516df4..2f839f0c8332a 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp
@@ -73,10 +73,6 @@ static void NvFuserScheduler_BatchNorm(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(1),
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp
index af2b4d145fc8f..62a4e99e21ef6 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp
@@ -25,7 +25,6 @@ static void setupBatchNorm_BWD(Fusion* fusion, DataType dtype) {
   FusionGuard fg(fusion);
 
   const bool kTraining = true;
-  const float kMomentum = 0.1;
   const float kEps = 1e-5;
 
   // setup fusion
@@ -85,9 +84,6 @@ static void NvFuserScheduler_BatchNorm_BWD(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(1),
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp
index 14fde631aec0b..7b8972a0aad07 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp
@@ -74,10 +74,6 @@ static void NvFuserScheduler_BatchNorm_nhwc(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(2),
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp
index 0660b75e39426..29bcfb3e81be7 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp
@@ -25,7 +25,6 @@ static void setupBatchNorm_nhwc_BWD(Fusion* fusion, DataType dtype) {
   FusionGuard fg(fusion);
 
   const bool kTraining = true;
-  const float kMomentum = 0.1;
   const float kEps = 1e-5;
 
   // setup fusion
@@ -86,9 +85,6 @@ static void NvFuserScheduler_BatchNorm_nhwc_BWD(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(2),
diff --git a/benchmarks/cpp/nvfuser/bert.cpp b/benchmarks/cpp/nvfuser/bert.cpp
index 06bcece52c8f5..05f0f490abb2e 100644
--- a/benchmarks/cpp/nvfuser/bert.cpp
+++ b/benchmarks/cpp/nvfuser/bert.cpp
@@ -140,7 +140,7 @@ static void MagicScheduler_DivMaxSoftDropFwd(
   fe.compileFusion(&fusion);
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion({t0, t1}, norm_params->lparams);
@@ -148,7 +148,7 @@ static void MagicScheduler_DivMaxSoftDropFwd(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto tensor : std::vector<at::Tensor>({t0, t1})) {
@@ -200,7 +200,7 @@ static void MagicScheduler_DivMaxSoftDropBwd(
   fe.compileFusion(&fusion);
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion({t0, t1, t2, t3}, norm_params->lparams);
@@ -208,7 +208,7 @@ static void MagicScheduler_DivMaxSoftDropBwd(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   // Some reason t1 isn't used, ignore it.
@@ -316,7 +316,7 @@ static void MagicScheduler_BiasDropoutAddLayernormFwd(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -324,7 +324,7 @@ static void MagicScheduler_BiasDropoutAddLayernormFwd(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
@@ -426,7 +426,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd1(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -434,7 +434,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd1(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
@@ -537,7 +537,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd2(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -545,7 +545,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd2(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
@@ -628,7 +628,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd3(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -636,7 +636,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd3(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
diff --git a/benchmarks/cpp/nvfuser/broadcast.cpp b/benchmarks/cpp/nvfuser/broadcast.cpp
index 05e8e052f4b26..04b6b18bd6b74 100644
--- a/benchmarks/cpp/nvfuser/broadcast.cpp
+++ b/benchmarks/cpp/nvfuser/broadcast.cpp
@@ -77,7 +77,7 @@ static void NvFuserScheduler_Broadcast(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs({t0, t1});
@@ -86,7 +86,7 @@ static void NvFuserScheduler_Broadcast(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -112,14 +112,14 @@ static void Baseline_Broadcast(
 
   // Sync everything up before we start
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     auto output = t0.add(t1.unsqueeze(bcast_dim));
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/gelu_backward.cpp b/benchmarks/cpp/nvfuser/gelu_backward.cpp
index 6632ba58a2365..732ad7f0ea0fd 100644
--- a/benchmarks/cpp/nvfuser/gelu_backward.cpp
+++ b/benchmarks/cpp/nvfuser/gelu_backward.cpp
@@ -113,9 +113,6 @@ BENCHMARK(GeluBackward_AutoSchedule)->Unit(benchmark::kMicrosecond);
 //------------------------------------------------------------------------------
 
 static void GeluBackward_Lower(benchmark::State& benchmark_state) {
-  constexpr int kHiddenFeatures = 512;
-  constexpr int kBatchSize = 64;
-
   Fusion fusion;
 
   // setup fusion
@@ -173,11 +170,11 @@ static void GeluBackward_RunFusion(benchmark::State& benchmark_state) {
   FusionExecutor executor;
   executor.compileFusion(&fusion);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     outputs = executor.runFusion(c10::ArrayRef<c10::IValue>(inputs), lparams);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
   }
 }
@@ -204,7 +201,7 @@ static void GeluBackward_RunFusion_GpuOnly(benchmark::State& benchmark_state) {
   executor.setMeasureKernelTimeFlag(true);
   executor.compileFusion(&fusion);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     outputs = executor.runFusion(c10::ArrayRef<c10::IValue>(inputs), lparams);
diff --git a/benchmarks/cpp/nvfuser/heuristic_lookup.cpp b/benchmarks/cpp/nvfuser/heuristic_lookup.cpp
index 64b1ecfb756d4..3bd4ec0b1607d 100644
--- a/benchmarks/cpp/nvfuser/heuristic_lookup.cpp
+++ b/benchmarks/cpp/nvfuser/heuristic_lookup.cpp
@@ -99,12 +99,15 @@ static void LayerNormBackward_HeuristicLookup(
 
   auto runtime = getLayerBackwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   for (auto _ : benchmark_state) {
     // Setup (not included in the measurement)
-    runtime->getMaybeHeuristicsFor(aten_inputs);
+    runtime->getMaybeHeuristicsFor(args);
   }
 }
 
@@ -152,12 +155,15 @@ static void LayerNormForward_HeuristicLookup(
 
   auto runtime = getLayerForwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   for (auto _ : benchmark_state) {
     // Setup (not included in the measurement)
-    runtime->getMaybeHeuristicsFor(aten_inputs);
+    runtime->getMaybeHeuristicsFor(args);
   }
 }
 
diff --git a/benchmarks/cpp/nvfuser/instance_norm.cpp b/benchmarks/cpp/nvfuser/instance_norm.cpp
index a7139c113a43b..05475f1144743 100644
--- a/benchmarks/cpp/nvfuser/instance_norm.cpp
+++ b/benchmarks/cpp/nvfuser/instance_norm.cpp
@@ -165,7 +165,7 @@ static void Baseline_InstanceNorm(
   auto ato_running_var = c10::optional<at::Tensor>(at_var);
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
 
@@ -182,9 +182,9 @@ static void Baseline_InstanceNorm(
     auto output = at::relu(norm);
 
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   const size_t kChannels = benchmark_state.range(2);
diff --git a/benchmarks/cpp/nvfuser/layer_norm.cpp b/benchmarks/cpp/nvfuser/layer_norm.cpp
index d793a45caa3c0..d2cff09e5d2ed 100644
--- a/benchmarks/cpp/nvfuser/layer_norm.cpp
+++ b/benchmarks/cpp/nvfuser/layer_norm.cpp
@@ -22,7 +22,6 @@ static void setupLayerNorm(Fusion* fusion, DataType dtype) {
 
   FusionGuard fg(fusion);
 
-  const int kReductionAxis = 1;
   const float kEps = 1e-5;
 
   Double* eps_ptr = IrBuilder::create<Double>(kEps);
@@ -61,7 +60,6 @@ static void NvFuserScheduler_LayerNorm(
 
   std::vector<int64_t> input_shape{
       benchmark_state.range(0), benchmark_state.range(1)};
-  const float kEps = 1e-5;
 
   // inputs
   at::manual_seed(0);
@@ -105,14 +103,14 @@ static void Baseline_LayerNorm(
   at::Tensor bias = at::randn({input_shape[1]}, options);
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     auto output = at::layer_norm(input, norm_shape, weight, bias);
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
index 9e6ac1c207d1d..c431622e7b9f4 100644
--- a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
+++ b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
@@ -22,9 +22,6 @@ static void setupLayerNorm_BWD(Fusion* fusion, DataType dtype) {
 
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const int kReductionAxis = 1;
-  Double* eps_ptr = IrBuilder::create<Double>(1e-5);
-
   // setup fusion
   auto grad_out = makeContigTensor(2, dtype);
   auto input = makeContigTensor(2, dtype);
@@ -136,7 +133,7 @@ static void Baseline_LayerNorm_BWD(
   std::array<bool, 3> output_mask = {true, true, true};
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     at::native_layer_norm_backward(
@@ -144,9 +141,9 @@ static void Baseline_LayerNorm_BWD(
 
     auto output = at::layer_norm(input, norm_shape, weight, bias);
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/lstm_cell.cpp b/benchmarks/cpp/nvfuser/lstm_cell.cpp
index 20ec7c8f47003..58fc057bd85fb 100644
--- a/benchmarks/cpp/nvfuser/lstm_cell.cpp
+++ b/benchmarks/cpp/nvfuser/lstm_cell.cpp
@@ -170,11 +170,11 @@ static void LstmCell_RunFusion(
   FusionExecutor executor;
   executor.compileFusion(&fusion);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     outputs = executor.runFusion(c10::ArrayRef<c10::IValue>(inputs), lparams);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 }
 
diff --git a/benchmarks/cpp/nvfuser/matmul.cpp b/benchmarks/cpp/nvfuser/matmul.cpp
new file mode 100644
index 0000000000000..25fc6cfe23569
--- /dev/null
+++ b/benchmarks/cpp/nvfuser/matmul.cpp
@@ -0,0 +1,357 @@
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/matmul.h>
+
+#include <benchmark/benchmark.h>
+
+#include <cuda_runtime.h>
+
+#include <benchmarks/cpp/nvfuser/utils.h>
+
+using namespace torch::jit::fuser::cuda;
+
+bool cudaArchGuardShouldSkip(int required_major, int required_minor) {
+  int capability_major = at::cuda::getCurrentDeviceProperties()->major;
+  int capability_minor = at::cuda::getCurrentDeviceProperties()->minor;
+
+  if (capability_major < required_major ||
+      (capability_major == required_major &&
+       capability_minor < required_minor)) {
+    return true;
+  }
+  return false;
+}
+
+bool hasRequiredSmemSize(size_t required_size) {
+  // Only checking device 0
+  return at::cuda::getDeviceProperties(0)->sharedMemPerBlockOptin >=
+      required_size;
+}
+
+#define NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(                       \
+    REQUIRED_MAJOR, REQUIRED_MINOR, SMEM_SIZE, STATE)            \
+  if (cudaArchGuardShouldSkip(REQUIRED_MAJOR, REQUIRED_MINOR) || \
+      !hasRequiredSmemSize(SMEM_SIZE)) {                         \
+    STATE.SkipWithError("Unsupported arch or not enough smem!"); \
+    return;                                                      \
+  }
+
+// util to track support matmul operand layout.
+using MatmulLayout = MmaOptions::MmaInputLayout;
+
+static constexpr std::array<MatmulLayout, 3> kAllSupportedLayout = {
+    MatmulLayout::TT,
+    MatmulLayout::NT,
+    MatmulLayout::TN};
+
+// Generic interface to get matmul op with the given layout.
+TensorView* matmul(TensorView* a, TensorView* b, MatmulLayout layout) {
+  TORCH_CHECK(
+      a->nDims() == 2 && b->nDims() == 2, "only pure matmuls for these tests");
+  TensorView *tv2 = nullptr, *tv0b = nullptr, *tv1b = nullptr;
+  switch (layout) {
+    case MatmulLayout::TT:
+      tv0b = broadcast(a, {false, false, true});
+      tv1b = broadcast(b, {true, false, false});
+      tv2 = fusedMultiplySum(tv0b, tv1b, {1});
+      break;
+    case MatmulLayout::TN:
+      tv0b = broadcast(a, {false, true, false});
+      tv1b = broadcast(b, {true, false, false});
+      tv2 = fusedMultiplySum(tv0b, tv1b, {2});
+      break;
+    case MatmulLayout::NT:
+      tv0b = broadcast(a, {false, false, true});
+      tv1b = broadcast(b, {false, true, false});
+      tv2 = fusedMultiplySum(tv0b, tv1b, {0});
+      break;
+    default:
+      TORCH_CHECK(false, "unsupported data layout.");
+  }
+  return tv2;
+}
+
+// Utility to generate matmul input tensors based on given layout
+at::Tensor atMatmul(at::Tensor a, at::Tensor b, MatmulLayout layout) {
+  switch (layout) {
+    case MatmulLayout::TT:
+      return a.matmul(b);
+    case MatmulLayout::TN:
+      return a.matmul(b.t());
+    case MatmulLayout::NT:
+      return a.t().matmul(b);
+    default:
+      TORCH_CHECK(false, "unsupported data layout.");
+  }
+  return at::Tensor();
+}
+
+// Utility to generate reference results based on given layout
+std::pair<at::Tensor, at::Tensor> fp16MatmulAtInput(
+    int M,
+    int N,
+    int K,
+    MatmulLayout layout) {
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+
+  switch (layout) {
+    case MatmulLayout::TT:
+      return std::make_pair(
+          at::randn({M, K}, options), at::randn({K, N}, options));
+    case MatmulLayout::TN:
+      return std::make_pair(
+          at::randn({M, K}, options), at::randn({N, K}, options));
+    case MatmulLayout::NT:
+      return std::make_pair(
+          at::randn({K, M}, options), at::randn({K, N}, options));
+    default:
+      TORCH_CHECK(false, "unsupported data layout.");
+  }
+  return std::make_pair(at::Tensor(), at::Tensor());
+}
+
+// TODO: separate compute and schedule definition once the can schedule
+//  logic and pattern matching is ready.
+void setupMatmul(Fusion* fusion, MatmulLayout layout, MatmulParam params) {
+  // Only hgemm on the initial setup
+  auto a = makeContigTensor(2, DataType::Half);
+  auto b = makeContigTensor(2, DataType::Half);
+
+  auto c = matmul(a, b, layout);
+
+  fusion->addInput(a);
+  fusion->addInput(b);
+  fusion->addOutput(c);
+
+  scheduleMatmul(c, a, b, params);
+}
+
+static void SingleMatmulBase(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout,
+    MatmulParam params) {
+  std::vector<int64_t> input_mnk{
+      benchmark_state.range(0),
+      benchmark_state.range(1),
+      benchmark_state.range(2)};
+
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  // Define fusion graph
+  setupMatmul(fusion, layout, params);
+
+  // inputs
+  at::manual_seed(0);
+
+  // Tensor inputs
+  auto inputs = fp16MatmulAtInput(
+      input_mnk.at(0), input_mnk.at(1), input_mnk.at(2), layout);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(
+      {inputs.first, inputs.second});
+
+  // Always use 32b indexing mode for now.
+  TORCH_INTERNAL_ASSERT(args.getIndexMode() == KernelIndexMode::INT32);
+
+  // Compile kernel
+  FusionExecutor fe;
+  fe.compileFusion(fusion, args, LaunchParams());
+
+  // Warm up run
+  auto outputs = fe.runFusion({inputs.first, inputs.second});
+  fe.setMeasureKernelTimeFlag(true);
+
+  // Sync everything up before we start
+  for (auto _ : benchmark_state) {
+    clearL2Cache();
+    auto outputs = fe.runFusion({inputs.first, inputs.second});
+    benchmark_state.SetIterationTime(fe.kernelTimeMs() / 1000.0);
+  }
+  // Sync everything up before we're finished, don't want to run ahead on the
+  // cpu while benchmarking.
+  cudaDeviceSynchronize();
+
+  // TODO: FLOPS calculation
+}
+
+static void EagerModeMatmul(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  std::vector<int64_t> input_mnk{
+      benchmark_state.range(0),
+      benchmark_state.range(1),
+      benchmark_state.range(2)};
+
+  at::manual_seed(0);
+
+  auto inputs = fp16MatmulAtInput(
+      input_mnk.at(0), input_mnk.at(1), input_mnk.at(2), layout);
+
+  // warm up run
+  auto outputs = atMatmul(inputs.first, inputs.second, layout);
+
+  for (auto _ : benchmark_state) {
+    clearL2Cache();
+    CudaKernelTimer timer;
+    outputs = atMatmul(inputs.first, inputs.second, layout);
+    benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
+  }
+  // Sync everything up before we're finished, don't want to run ahead on the
+  // cpu while benchmarking.
+  cudaDeviceSynchronize();
+}
+
+// Actual benchmarking
+// -----------------------------------------------------------------
+
+size_t getSmemSize(GemmTile cta_tile, int stage_number) {
+  return ((cta_tile.m * cta_tile.k) + (cta_tile.n * cta_tile.k)) *
+      dataTypeSize(DataType::Half) * stage_number;
+}
+
+// TODO: this part eventually will be automated by heuristics
+MatmulParam getMatmulParams(
+    GemmTile cta_tile,
+    int stage_number,
+    MatmulLayout layout) {
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = cta_tile;
+  // TODO: pipe through split K
+  gemm_tile.warp_tile = GemmTile(64, 64, cta_tile.k);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 16);
+
+  // Collect mma swizzle info
+  auto mma_builder =
+      MmaBuilder(MmaOptions::MacroType::Ampere_16_16_16, gemm_tile)
+          .layout(layout);
+
+  MatmulParam params(mma_builder);
+  params.tile_sizes = gemm_tile;
+  params.async_gmem_load_operands = true;
+  params.double_buffer_options.double_buffer_smem_write = true;
+  params.double_buffer_options.double_buffer_smem_read = true;
+  params.double_buffer_options.smem_double_buffer_stage = stage_number;
+
+  return params;
+}
+
+static void Nvfuser_Matmul_4warp3stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(128, 128, 32);
+  int number_of_stage = 3;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+static void Nvfuser_Matmul_8warp3stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(256, 128, 32);
+  int number_of_stage = 3;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+static void Nvfuser_Matmul_4warp4stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(128, 128, 32);
+  int number_of_stage = 4;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+static void Nvfuser_Matmul_8warp4stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(256, 128, 32);
+  int number_of_stage = 4;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+// ----------------------------- Benchmark Instantiation-------
+
+// Common utils:
+#define NO_TILE_QUANTIZATION_ARGS                                             \
+  ArgsProduct(                                                                \
+      {{2048}, {3456}, benchmark::CreateDenseRange(512, 4096, /*step=*/512)}) \
+      ->Unit(benchmark::kMicrosecond)                                         \
+      ->UseManualTime();
+
+#define ForAllLayouts(run)   \
+  run(TT, MatmulLayout::TT); \
+  run(TN, MatmulLayout::TN); \
+  run(NT, MatmulLayout::NT)
+
+// Instantiations:
+#define Nvfuser_4warp3stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_4warp3stage,                      \
+      no_quant_nvfuser_4warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Nvfuser_8warp3stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_8warp3stage,                      \
+      no_quant_nvfuser_8warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Nvfuser_4warp4stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_4warp4stage,                      \
+      no_quant_nvfuser_4warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Nvfuser_8warp4stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_8warp4stage,                      \
+      no_quant_nvfuser_8warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Eagermode_test(layout_label, layout)                      \
+  BENCHMARK_CAPTURE(                                              \
+      EagerModeMatmul, no_quant_eagermode_##layout_label, layout) \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+ForAllLayouts(Nvfuser_4warp3stage_test);
+ForAllLayouts(Nvfuser_4warp4stage_test);
+ForAllLayouts(Nvfuser_8warp3stage_test);
+ForAllLayouts(Nvfuser_8warp4stage_test);
+ForAllLayouts(Eagermode_test);
diff --git a/benchmarks/cpp/nvfuser/reduction.cpp b/benchmarks/cpp/nvfuser/reduction.cpp
index d6fc0ca327ae7..c4aaaf8a60475 100644
--- a/benchmarks/cpp/nvfuser/reduction.cpp
+++ b/benchmarks/cpp/nvfuser/reduction.cpp
@@ -73,7 +73,7 @@ static void NvFuserScheduler_Reduction(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs({aten_input});
@@ -82,7 +82,7 @@ static void NvFuserScheduler_Reduction(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -105,14 +105,14 @@ static void Baseline_Reduction(
 
   // Sync everything up before we start
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     auto output = aten_input.sum({reduction_dim});
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/rms_norm.cpp b/benchmarks/cpp/nvfuser/rms_norm.cpp
index 81fdf46cf8189..37911ea6b1fd2 100644
--- a/benchmarks/cpp/nvfuser/rms_norm.cpp
+++ b/benchmarks/cpp/nvfuser/rms_norm.cpp
@@ -24,7 +24,6 @@ static void setupRMSNorm(Fusion* fusion, DataType dtype) {
 
   FusionGuard fg(fusion);
 
-  const int kReductionAxis = 2;
   const float kEps = 1e-6;
 
   Double* eps_ptr = IrBuilder::create<Double>(kEps);
@@ -61,7 +60,6 @@ static void NvFuserScheduler_RMSNorm(
       dtype == DataType::BFloat16);
 
   std::vector<int64_t> input_shape{8, benchmark_state.range(0), 1024};
-  const float kEps = 1e-6;
 
   // inputs
   at::manual_seed(0);
diff --git a/benchmarks/cpp/nvfuser/rms_norm_backward.cpp b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
index b4c6ac413c758..987c3bf234fa2 100644
--- a/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
+++ b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
@@ -24,9 +24,6 @@ static void setupRMSNorm_BWD(Fusion* fusion, DataType dtype) {
       dtype == DataType::Float || dtype == DataType::Half ||
       dtype == DataType::BFloat16);
 
-  const int kReductionAxis = 2;
-  Double* eps_ptr = IrBuilder::create<Double>(1e-6);
-
   // setup fusion
   auto grad_out = makeContigTensor(3, dtype);
   auto input = makeContigTensor(3, dtype);
diff --git a/benchmarks/cpp/nvfuser/scale_bias_relu.cpp b/benchmarks/cpp/nvfuser/scale_bias_relu.cpp
index 74dbb5324cbab..158d3668c2792 100644
--- a/benchmarks/cpp/nvfuser/scale_bias_relu.cpp
+++ b/benchmarks/cpp/nvfuser/scale_bias_relu.cpp
@@ -144,7 +144,7 @@ static void NvFuserScheduler_SBR(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
@@ -153,7 +153,7 @@ static void NvFuserScheduler_SBR(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   const size_t size =
       input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
@@ -182,7 +182,7 @@ static void Baseline_SBR(benchmark::State& benchmark_state, DataType dtype) {
   at::Tensor at_bias = at::zeros(bcast_shape, options);
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
 
@@ -191,9 +191,9 @@ static void Baseline_SBR(benchmark::State& benchmark_state, DataType dtype) {
     auto output = at::relu(bias);
 
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   const size_t size =
@@ -245,7 +245,7 @@ static void NvFuserScheduler_SBR_Norm(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
@@ -255,7 +255,7 @@ static void NvFuserScheduler_SBR_Norm(
 
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   const size_t size =
       input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
@@ -286,7 +286,7 @@ static void Baseline_SBR_Norm(
   at::Tensor at_mean = at::zeros(bcast_shape, options);
   at::Tensor at_var = at::ones(bcast_shape, options);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
 
@@ -298,7 +298,7 @@ static void Baseline_SBR_Norm(
     auto output = at::relu(bias);
 
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   const size_t size =
diff --git a/benchmarks/cpp/nvfuser/shape_inference.cpp b/benchmarks/cpp/nvfuser/shape_inference.cpp
index 2e5e23ed7442e..fd628a163abce 100644
--- a/benchmarks/cpp/nvfuser/shape_inference.cpp
+++ b/benchmarks/cpp/nvfuser/shape_inference.cpp
@@ -100,8 +100,11 @@ void LayerNormBackward_ShapeInference_Base(
 
   auto runtime = getLayerBackwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   fec->profile(true);
   fec->disableKernelLaunch();
@@ -172,8 +175,10 @@ void LayerNormForward_ShapeInferenceBase(
   auto runtime = getLayerForwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
 
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   fec->profile(true);
   fec->disableKernelLaunch();
diff --git a/benchmarks/cpp/nvfuser/softmax.cpp b/benchmarks/cpp/nvfuser/softmax.cpp
index 439e426220f87..350ccb301638f 100644
--- a/benchmarks/cpp/nvfuser/softmax.cpp
+++ b/benchmarks/cpp/nvfuser/softmax.cpp
@@ -107,7 +107,7 @@ static void Softmax_WarpReduceReference(benchmark::State& benchmark_state) {
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -162,7 +162,7 @@ static void Softmax_WarpReduce(benchmark::State& benchmark_state) {
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -206,7 +206,7 @@ static void Baseline_Softmax(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
diff --git a/benchmarks/cpp/nvfuser/softmax_backward.cpp b/benchmarks/cpp/nvfuser/softmax_backward.cpp
index 8fb35083c6dc7..51696ede90cec 100644
--- a/benchmarks/cpp/nvfuser/softmax_backward.cpp
+++ b/benchmarks/cpp/nvfuser/softmax_backward.cpp
@@ -116,7 +116,7 @@ static void Baseline_Softmax_BWD(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -177,13 +177,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -201,13 +201,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -225,13 +225,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -249,13 +249,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -275,13 +275,13 @@ BENCHMARK(Baseline_Softmax_BWD_Outer_fp32)
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -299,13 +299,13 @@ BENCHMARK(Baseline_Softmax_BWD_Outer_fp16)
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -323,13 +323,13 @@ BENCHMARK(Baseline_Softmax_BWD_Inner_fp32)
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -347,13 +347,13 @@ BENCHMARK(Baseline_Softmax_BWD_Inner_fp16)
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
diff --git a/benchmarks/cpp/nvfuser/softmax_dropout.cpp b/benchmarks/cpp/nvfuser/softmax_dropout.cpp
index 48950373731c1..383d1d4bb9f4d 100644
--- a/benchmarks/cpp/nvfuser/softmax_dropout.cpp
+++ b/benchmarks/cpp/nvfuser/softmax_dropout.cpp
@@ -127,7 +127,7 @@ static void Baseline_Softmax_Dropout(
   at::Tensor attention_scores = at::randn(input_shape, options);
   at::Tensor at_y = at::randn(input_shape, options);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     clearL2Cache();
@@ -144,7 +144,7 @@ static void Baseline_Softmax_Dropout(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   // 5 dtype: attention_scores + attention_mask + attention_scores_out +
   // attention_probs_out + output
diff --git a/benchmarks/cpp/nvfuser/timm.cpp b/benchmarks/cpp/nvfuser/timm.cpp
index 013b609be6020..4669ff0ecabf6 100644
--- a/benchmarks/cpp/nvfuser/timm.cpp
+++ b/benchmarks/cpp/nvfuser/timm.cpp
@@ -115,7 +115,7 @@ static void setup_vit_base_patch16_224_bcast5(Fusion* fusion, void* null) {
   auto t6 = set(t5);
   auto t7 = broadcast(t6, bcast_pattern0);
   auto t8 = add(t4, t7);
-  auto t9 = randlike(t8);
+  auto t9 = rand_like(t8);
   auto d34 =
       sub(IrBuilder::create<Double>(1.0), IrBuilder::create<Double>(0.0));
   auto t10 = lt(t9, d34);
@@ -139,7 +139,6 @@ static void setup_vit_base_patch16_224_bcast5(Fusion* fusion, void* null) {
   auto t20 = sum(t37, {2});
   auto t24 = broadcast(t20, bcast_pattern1);
   auto d95 = castOp(DataType::Double, t2->axis(2)->extent());
-  auto d96 = mul(IrBuilder::create<Double>(1.0), d95);
   auto d105 = reciprocal(d95);
   auto t25 = mul(t24, d105);
   auto t26 = add(t25, IrBuilder::create<Double>(1e-6));
@@ -289,7 +288,7 @@ static void setup_vit_base_patch16_224_norm_inner3(Fusion* fusion, void* null) {
   auto t10 = broadcast(t9, {false, false, false, true});
   auto t11 = reciprocal(t10);
   auto t12 = mul(t8, t11);
-  auto t13 = randlike(t12);
+  auto t13 = rand_like(t12);
   auto d79 = sub(IrBuilder::create<Double>(1), IrBuilder::create<Double>(0));
   auto t14 = lt(t13, d79);
   auto t15 = castOp(DataType::Float, t14);
@@ -320,8 +319,6 @@ static void NvFuserScheduler_TIMM_vit_base_patch16_224_norm_inner3(
 
   at::manual_seed(0);
   auto fp16_options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  auto fp32_options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
 
   auto t0 = at::randn(input_shape, fp16_options);
 
@@ -367,7 +364,7 @@ static void setup_vit_base_patch16_224_bcast_outer6(
   auto t9 = add(IrBuilder::create<Double>(1), t8);
   auto t10 = mul(IrBuilder::create<Double>(0.5), t9);
   auto t11 = mul(t6, t10);
-  auto t12 = randlike(t11);
+  auto t12 = rand_like(t11);
   auto d66 = sub(IrBuilder::create<Double>(1), IrBuilder::create<Double>(0));
   auto t13 = lt(t12, d66);
   auto t14 = castOp(DataType::Float, t13);
@@ -456,7 +453,7 @@ static void setup_vit_base_patch16_224_bcast_inner6(
   auto t9 = add(IrBuilder::create<Double>(1), t8);
   auto t10 = mul(IrBuilder::create<Double>(0.5), t9);
   auto t11 = mul(t6, t10);
-  auto t12 = randlike(t11);
+  auto t12 = rand_like(t11);
   auto d66 = sub(IrBuilder::create<Double>(1), IrBuilder::create<Double>(0));
   auto t13 = lt(t12, d66);
   auto t14 = castOp(DataType::Float, t13);
diff --git a/benchmarks/cpp/nvfuser/utils.cpp b/benchmarks/cpp/nvfuser/utils.cpp
index 3915f7d652989..0a13c57d10e19 100644
--- a/benchmarks/cpp/nvfuser/utils.cpp
+++ b/benchmarks/cpp/nvfuser/utils.cpp
@@ -6,7 +6,7 @@
 
 using namespace torch::jit::fuser::cuda;
 
-std::string toString(ReductionParams rparams) {
+std::string toString(const ReductionParams& rparams) {
   std::stringstream ss;
   ss << (rparams.fastest_dim ? "Red On Fastest Dim // " : "Red On Slow Dim // ")
      << (rparams.persistent_kernel ? "Persistent Kernel // " : "")
@@ -65,7 +65,7 @@ std::string toString(ReductionParams rparams) {
   return ss.str();
 }
 
-std::string toString(PointwiseParams params) {
+std::string toString(const PointwiseParams& params) {
   std::stringstream ss;
   if (params.break_point) {
     ss << "2D Schedule at " << params.break_point << "/";
@@ -89,6 +89,15 @@ std::string toString(PointwiseParams params) {
   return ss.str();
 }
 
+std::string toString(const TransposeParams& params) {
+  std::stringstream ss;
+  ss << "Tile size: (" << params.tile_size1 << "," << params.tile_size2
+     << ")/";
+  ss << "Vectorize size: (" << params.vectorize_factor1 << ","
+     << params.vectorize_factor2 << ")";
+  return ss.str();
+}
+
 std::string toString(const std::shared_ptr<HeuristicParams>& params) {
   auto rparams = std::dynamic_pointer_cast<ReductionParams>(params);
   if (rparams) {
@@ -98,6 +107,10 @@ std::string toString(const std::shared_ptr<HeuristicParams>& params) {
   if (pparams) {
     return toString(*pparams);
   }
+  auto tparams = std::dynamic_pointer_cast<TransposeParams>(params);
+  if (tparams) {
+    return toString(*tparams);
+  }
   TORCH_INTERNAL_ASSERT(
       false,
       "Unknown heuristic parameter type. Did you just added a new heuristic parameter type but forget to update here?");
@@ -176,7 +189,7 @@ void runBenchmarkIterations(
     executor_instance->setMeasureKernelTimeFlag(true);
 
     // Sync everything up before we start
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     for (auto _ : benchmark_state) {
       clearL2Cache();
       auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
@@ -185,7 +198,7 @@ void runBenchmarkIterations(
     }
     // Sync everything up before we're finished, don't want to run ahead on the
     // cpu while benchmarking.
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   } else {
     // Segmented
     // Sync everything up before we start
@@ -193,7 +206,7 @@ void runBenchmarkIterations(
       // Compile/warmup
       auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
     }
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     CudaKernelTimer timer;
     for (auto _ : benchmark_state) {
       clearL2Cache();
@@ -203,7 +216,7 @@ void runBenchmarkIterations(
     }
     // Sync everything up before we're finished, don't want to run ahead on the
     // cpu while benchmarking.
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 }
 
diff --git a/benchmarks/cpp/nvfuser/utils.h b/benchmarks/cpp/nvfuser/utils.h
index e24fdfb127dab..7bfc4aefd45c2 100644
--- a/benchmarks/cpp/nvfuser/utils.h
+++ b/benchmarks/cpp/nvfuser/utils.h
@@ -36,8 +36,9 @@ TensorView* makeContigConcreteTensor(
     std::vector<int64_t> shape,
     DataType dtype = DataType::Float);
 
-std::string toString(ReductionParams rparams);
-std::string toString(PointwiseParams params);
+std::string toString(const ReductionParams& rparams);
+std::string toString(const PointwiseParams& params);
+std::string toString(const TransposeParams& params);
 std::string toString(const std::shared_ptr<HeuristicParams>& params);
 std::string toString(LaunchParams lparams);
 
@@ -55,26 +56,27 @@ class CudaKernelTimer {
  public:
   CudaKernelTimer() {
     // Setup
-    cudaEventCreate(&start_event);
-    cudaEventCreate(&finish_event);
-    cudaEventRecord(start_event);
+    C10_CUDA_CHECK(cudaEventCreate(&start_event));
+    C10_CUDA_CHECK(cudaEventCreate(&finish_event));
+    C10_CUDA_CHECK(cudaEventRecord(start_event));
   }
 
   ~CudaKernelTimer() {
-    cudaEventDestroy(start_event);
-    cudaEventDestroy(finish_event);
+    C10_CUDA_IGNORE_ERROR(cudaEventDestroy(start_event));
+    C10_CUDA_IGNORE_ERROR(cudaEventDestroy(finish_event));
   }
 
   void restart() {
-    cudaEventRecord(start_event);
+    C10_CUDA_CHECK(cudaEventRecord(start_event));
   }
 
   float elapsed() {
     // Record
-    cudaEventRecord(finish_event);
-    cudaEventSynchronize(start_event);
-    cudaEventSynchronize(finish_event);
-    cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event);
+    C10_CUDA_CHECK(cudaEventRecord(finish_event));
+    C10_CUDA_CHECK(cudaEventSynchronize(start_event));
+    C10_CUDA_CHECK(cudaEventSynchronize(finish_event));
+    C10_CUDA_CHECK(
+        cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event));
     return kernel_time_ms_;
   }
 
diff --git a/benchmarks/distributed/ddp/benchmark.py b/benchmarks/distributed/ddp/benchmark.py
index 2c742d0fc9d8f..a905ad60f5309 100644
--- a/benchmarks/distributed/ddp/benchmark.py
+++ b/benchmarks/distributed/ddp/benchmark.py
@@ -87,7 +87,7 @@ def run_benchmark(benchmark, ranks, opts):
     measurements = []
     if dist.get_rank() in set(ranks):
         if not opts:
-            opts = dict()
+            opts = {}
         measurements = benchmark_process_group(group, benchmark, **opts)
     dist.destroy_process_group(group)
     dist.barrier()
diff --git a/benchmarks/operator_benchmark/pt/qactivation_test.py b/benchmarks/operator_benchmark/pt/qactivation_test.py
index 5baf4cca3c3b4..f57ff8d1f16c3 100644
--- a/benchmarks/operator_benchmark/pt/qactivation_test.py
+++ b/benchmarks/operator_benchmark/pt/qactivation_test.py
@@ -1,5 +1,5 @@
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized.functional as qF
 
 import operator_benchmark as op_bench
 
@@ -44,9 +44,9 @@
     attrs=(
         ('relu', torch.nn.ReLU()),
         ('relu6', torch.ops.quantized.relu6),
-        ('functional.hardtanh', nnq.functional.hardtanh),
-        ('functional.hardsigmoid', nnq.functional.hardsigmoid),
-        ('functional.leaky_relu', nnq.functional.leaky_relu),
+        ('functional.hardtanh', qF.hardtanh),
+        ('functional.hardsigmoid', qF.hardsigmoid),
+        ('functional.leaky_relu', qF.leaky_relu),
         ('functional.sigmoid', torch.nn.functional.sigmoid),
         ('functional.tanh', torch.nn.functional.tanh),
     ),
@@ -92,9 +92,9 @@ def forward(self, q_input):
 
 qactivation_scale_zero_point_ops = op_bench.op_list(
     attrs=(
-        ('functional.hardswish', nnq.functional.hardswish),
-        ('functional.elu', nnq.functional.elu),
-        ('functional.celu', nnq.functional.celu),
+        ('functional.hardswish', qF.hardswish),
+        ('functional.elu', qF.elu),
+        ('functional.celu', qF.celu),
     ),
     attr_names=('op_name', 'op_func'),
 )
diff --git a/benchmarks/operator_benchmark/pt/qarithmetic_test.py b/benchmarks/operator_benchmark/pt/qarithmetic_test.py
index b1103a8a25315..97766bdb4c194 100644
--- a/benchmarks/operator_benchmark/pt/qarithmetic_test.py
+++ b/benchmarks/operator_benchmark/pt/qarithmetic_test.py
@@ -29,7 +29,7 @@
 
 class _QFunctionalBinaryArithmeticBenchmarkBase(op_bench.TorchBenchmarkBase):
     def setup(self, N, dtype, contig):
-        self.qfunctional = torch.nn.quantized.QFunctional()
+        self.qfunctional = torch.ao.nn.quantized.QFunctional()
 
         # TODO: Consider more diverse shapes
         f_input = (torch.rand(N, N) - 0.5) * 256
diff --git a/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py b/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py
index 2dcfdd4de4399..97ce0357557e3 100644
--- a/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py
+++ b/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py
@@ -1,6 +1,6 @@
 import operator_benchmark as op_bench
 import torch
-import torch.nn.qat as nnqat
+import torch.ao.nn.qat as nnqat
 import numpy
 from pt import configs
 from torch.ao.quantization import default_embedding_qat_qconfig
diff --git a/benchmarks/operator_benchmark/pt/qcat_test.py b/benchmarks/operator_benchmark/pt/qcat_test.py
index 32dd32e43adfe..2ff0b87a9d380 100644
--- a/benchmarks/operator_benchmark/pt/qcat_test.py
+++ b/benchmarks/operator_benchmark/pt/qcat_test.py
@@ -1,7 +1,7 @@
 import operator_benchmark as op_bench
 
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from typing import List
 
 
diff --git a/benchmarks/operator_benchmark/pt/qconv_test.py b/benchmarks/operator_benchmark/pt/qconv_test.py
index 14e8e143a7ca8..c48759d330e78 100644
--- a/benchmarks/operator_benchmark/pt/qconv_test.py
+++ b/benchmarks/operator_benchmark/pt/qconv_test.py
@@ -1,7 +1,7 @@
 
 import operator_benchmark as op_bench
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 from pt import configs
 
diff --git a/benchmarks/operator_benchmark/pt/qembeddingbag_test.py b/benchmarks/operator_benchmark/pt/qembeddingbag_test.py
index 872f8c28fccd4..5a406631b5ed8 100644
--- a/benchmarks/operator_benchmark/pt/qembeddingbag_test.py
+++ b/benchmarks/operator_benchmark/pt/qembeddingbag_test.py
@@ -1,7 +1,7 @@
 
 import operator_benchmark as op_bench
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import numpy
 from pt import configs
 
diff --git a/benchmarks/operator_benchmark/pt/qlinear_test.py b/benchmarks/operator_benchmark/pt/qlinear_test.py
index 6e4dd9d97eca5..c4f8f36c11d3b 100644
--- a/benchmarks/operator_benchmark/pt/qlinear_test.py
+++ b/benchmarks/operator_benchmark/pt/qlinear_test.py
@@ -2,8 +2,8 @@
 import operator_benchmark as op_bench
 
 import torch
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 
 from pt import configs
 
diff --git a/benchmarks/operator_benchmark/pt/quantization_test.py b/benchmarks/operator_benchmark/pt/quantization_test.py
index 8ffbdd20e4429..332ff52c21d6e 100644
--- a/benchmarks/operator_benchmark/pt/quantization_test.py
+++ b/benchmarks/operator_benchmark/pt/quantization_test.py
@@ -1,7 +1,7 @@
 
 import operator_benchmark as op_bench
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import torch.ao.quantization as tq
 import torch.nn as nn
 
diff --git a/benchmarks/static_runtime/test_static_runtime.cc b/benchmarks/static_runtime/test_static_runtime.cc
index 72ee217401ab0..ee37ddeaf71ac 100644
--- a/benchmarks/static_runtime/test_static_runtime.cc
+++ b/benchmarks/static_runtime/test_static_runtime.cc
@@ -2164,7 +2164,12 @@ TEST(StaticRuntime, Permute) {
   c10::List<int64_t> dims_b{0, 2, 1};
   std::vector<IValue> args_b{b, dims_b};
 
+  auto c = at::randn({3, 3, 3});
+  c10::List<int64_t> dims_c{0, -1, 1};
+  std::vector<IValue> args_c{c, dims_c};
+
   testStaticRuntime(permute_script, args_a);
+  testStaticRuntime(permute_script, args_c);
   testStaticRuntime(permute_script, args_a, args_b);
 
   permute_script = R"JIT(
@@ -2590,23 +2595,28 @@ TEST(StaticRuntime, JIT_Aten_Numel) {
 }
 
 TEST(StaticRuntime, JIT_Aten_List) {
-  const std::string script = R"IR(
+  const auto script_str = R"IR(
     graph(%a: str):
-        %1 : int = prim::Constant[value=0]()
         %ret: str[] = aten::list(%a)
         return (%ret)
   )IR";
-
-  auto graph = std::make_shared<Graph>();
-  std::unordered_map<std::string, Value*> vmap;
-  vmap.reserve(0);
-  parseIR(script, graph.get(), vmap);
-  torch::jit::StaticModule smodule(graph);
-
-  string a = "abcd";
+  std::string a = "abcd";
   std::vector<IValue> args0{a};
+  testStaticRuntime(script_str, args0);
+
+  // Update the result of aten::list to ensure that a deep copy
+  // took place
+  const auto script_list = R"IR(
+    graph(%a : int[]):
+        %idx : int = prim::Constant[value=0]()
+        %value : int = prim::Constant[value=42]()
+        %res : int[] = aten::list(%a)
+        %updated : int[] = aten::_set_item(%res, %idx, %value)
+        return (%res, %a)
+  )IR";
 
-  testStaticRuntime(script, args0);
+  std::vector<IValue> args1{c10::List<int64_t>{1, 2, 3}};
+  testStaticRuntime(script_list, args1);
 }
 
 TEST(StaticRuntime, JIT_Aten_Range_Length) {
diff --git a/buckbuild.bzl b/buckbuild.bzl
index 40f542e3f80df..ae1519ea8f5ee 100644
--- a/buckbuild.bzl
+++ b/buckbuild.bzl
@@ -125,8 +125,8 @@ THIRD_PARTY_LIBS = {
     "XNNPACK": ["//xplat/third-party/XNNPACK:XNNPACK", "//third_party:XNNPACK"],
     "clog": ["//xplat/third-party/clog:clog", "//third_party:clog"],
     "cpuinfo": ["//third-party/cpuinfo:cpuinfo", "//third_party:cpuinfo"],
-    "flatbuffers-api": ["//third-party/flatbuffers:flatbuffers-api", "//third_party:flatbuffers-api"],
-    "flatc": ["//third-party/flatbuffers:flatc", "//third_party:flatc"],
+    "flatbuffers-api": ["//third-party/flatbuffers/fbsource_namespace:flatbuffers-api", "//third_party:flatbuffers-api"],
+    "flatc": ["//third-party/flatbuffers/fbsource_namespace:flatc", "//third_party:flatc"],
     "fmt": ["//third-party/fmt:fmt", "//third_party:fmt"],
     "glog": ["//third-party/glog:glog", "//third_party:glog"],
     "gmock": ["//xplat/third-party/gmock:gtest", "//third_party:gmock"],
@@ -739,7 +739,7 @@ def get_pt_operator_registry_dict(
                    third_party("glog"),
                    C10,
                ] + ([ROOT + ":torch_mobile_train"] if train else []) +
-               ([ROOT + ":torch_flatbuffer_all"] if enable_flatbuffer else []),
+               ([ROOT + ":flatbuffers_mobile"] if enable_flatbuffer else []),
         **kwargs
     )
 
@@ -1347,7 +1347,7 @@ def define_buck_targets(
         exported_preprocessor_flags = get_pt_preprocessor_flags(),
         visibility = ["PUBLIC"],
         exported_deps = [
-            ":torch_flatbuffer_all",
+            ":flatbuffers_mobile",
             ":torch_mobile_core",
         ],
     )
@@ -1497,8 +1497,6 @@ def define_buck_targets(
             # "torch/csrc/jit/mobile/compatibility/runtime_compatibility.cpp",
             # "torch/csrc/jit/serialization/unpickler.cpp",
             "torch/csrc/jit/mobile/compatibility/model_compatibility.cpp",
-            "torch/csrc/jit/serialization/pickle.cpp",
-            "torch/csrc/jit/serialization/pickler.cpp",
         ],
         header_namespace = "",
         exported_headers = [
@@ -1635,7 +1633,6 @@ def define_buck_targets(
         compiler_flags = get_pt_compiler_flags() + ["-Wno-error"],
         exported_preprocessor_flags = get_pt_preprocessor_flags() + [
             "-DUSE_KINETO",
-            "-DUSE_KINETO_UPDATED",
             # Need this otherwise USE_KINETO is undefed
             # for mobile
             "-DEDGE_PROFILER_USE_KINETO",
@@ -1662,7 +1659,6 @@ def define_buck_targets(
         compiler_flags = get_pt_compiler_flags() + ["-Wno-error"],
         exported_preprocessor_flags = get_pt_preprocessor_flags() + [
             "-DUSE_KINETO",
-            "-DUSE_KINETO_UPDATED",
             "-DEDGE_PROFILER_USE_KINETO",
         ],
         # @lint-ignore BUCKLINT link_whole
@@ -1689,21 +1685,29 @@ def define_buck_targets(
         cmd = "$(exe {})".format(third_party("flatc")) +
               " --cpp --gen-mutable --scoped-enums -o ${OUT} ${SRCS}",
         default_outs = ["."],
+        visibility = [
+            "{}:mobile_bytecode".format(ROOT),
+        ],
     )
 
+    # Users of this target will need to add third_party("flatbuffers-api") as a
+    # dep.
     fb_xplat_cxx_library(
         name = "mobile_bytecode",
         header_namespace = "",
         exported_headers = {
             "torch/csrc/jit/serialization/mobile_bytecode_generated.h": ":mobile_bytecode_header[mobile_bytecode_generated.h]",
         },
-        exported_deps = [
-            third_party("flatbuffers-api"),
+        # Avoid leaking implementation details by only exposing this header to
+        # the internals of the loader/serializer layer.
+        visibility = [
+            "{}:flatbuffer_loader".format(ROOT),
+            "{}:flatbuffer_serializer_mobile".format(ROOT),
         ],
     )
 
     fb_xplat_cxx_library(
-        name = "flatbuffer_serializer",
+        name = "flatbuffers_serializer_mobile",
         srcs = ["torch/csrc/jit/serialization/flatbuffer_serializer.cpp"],
         exported_headers = [
             "torch/csrc/jit/serialization/flatbuffer_serializer.h",
@@ -1714,17 +1718,16 @@ def define_buck_targets(
             "-fexceptions",
             "-frtti",
             "-Wno-deprecated-declarations",
-        ],
+        ] + (["-DFB_XPLAT_BUILD"] if not IS_OSS else []),
         visibility = ["PUBLIC"],
         deps = [
+            ":mobile_bytecode",
             ":torch_mobile_module",
             C10,
+            third_party("flatbuffers-api"),
         ],
         exported_deps = [
-            ":flatbuffer_loader",
-            ":mobile_bytecode",
             ":torch_mobile_train",
-            third_party("flatbuffers-api"),
         ],
     )
 
@@ -1739,11 +1742,10 @@ def define_buck_targets(
         compiler_flags = get_pt_compiler_flags() + ["-Wno-error"],
         exported_preprocessor_flags = get_pt_preprocessor_flags() + [
             "-DUSE_KINETO",
-            "-DUSE_KINETO_UPDATED",
             # Need this otherwise USE_KINETO is undefed
             # for mobile
             "-DEDGE_PROFILER_USE_KINETO",
-        ],
+        ] + (["-DFB_XPLAT_BUILD"] if not IS_OSS else []),
         extra_flags = {
             "fbandroid_compiler_flags": ["-frtti"],
         },
@@ -1758,16 +1760,18 @@ def define_buck_targets(
             "-Wl,--no-as-needed",
         ],
         visibility = ["PUBLIC"],
-        exported_deps = [
+        deps = [
             ":mobile_bytecode",
-            ":torch_mobile_deserialize",
             third_party("flatbuffers-api"),
+        ],
+        exported_deps = [
+            ":torch_mobile_deserialize",
             C10,
         ],
     )
 
     fb_xplat_cxx_library(
-        name = "flatbuffer_serializer_jit",
+        name = "flatbuffers_serializer_jit",
         srcs = ["torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp"],
         exported_headers = [
             "torch/csrc/jit/serialization/flatbuffer_serializer_jit.h",
@@ -1785,22 +1789,29 @@ def define_buck_targets(
         visibility = ["PUBLIC"],
         deps = [
             ":flatbuffer_loader",
-            ":flatbuffer_serializer",
-            ":mobile_bytecode",
+            ":flatbuffers_serializer_mobile",
             ":torch_core",
             ":torch_mobile_module",
-            third_party("flatbuffers-api"),
             C10,
         ],
     )
 
     fb_xplat_cxx_library(
-        name = "torch_flatbuffer_all",
+        name = "flatbuffers_jit",
+        visibility = ["PUBLIC"],
+        exported_deps = [
+            ":flatbuffer_loader",
+            ":flatbuffers_serializer_mobile",
+            ":flatbuffers_serializer_jit",
+        ],
+    )
+
+    fb_xplat_cxx_library(
+        name = "flatbuffers_mobile",
         visibility = ["PUBLIC"],
         exported_deps = [
             ":flatbuffer_loader",
-            ":flatbuffer_serializer",
-            ":flatbuffer_serializer_jit",
+            ":flatbuffers_serializer_mobile",
         ],
     )
 
diff --git a/build.bzl b/build.bzl
index ac9ceaa0559de..5715e34786d45 100644
--- a/build.bzl
+++ b/build.bzl
@@ -92,6 +92,7 @@ def define_targets(rules):
             ":LazyIr.h",
             ":LazyNonNativeIr.h",
             ":RegisterDispatchKey.cpp",
+            ":RegisterDispatchDefinitions.ini",
             ":native_functions.yaml",
             ":shape_inference.h",
             ":tags.yaml",
diff --git a/build_variables.bzl b/build_variables.bzl
index e4b4b82df5f60..f70d4280825af 100644
--- a/build_variables.bzl
+++ b/build_variables.bzl
@@ -26,6 +26,8 @@ libtorch_nvfuser_runtime_sources = [
     "torch/csrc/jit/codegen/cuda/runtime/broadcast.cu",
     "torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu",
     "torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu",
@@ -102,7 +104,6 @@ core_sources_common = [
     "torch/csrc/jit/frontend/edit_distance.cpp",
     "torch/csrc/jit/mobile/compatibility/runtime_compatibility.cpp",
     "torch/csrc/jit/mobile/type_parser.cpp",
-    "torch/csrc/jit/operator_upgraders/upgraders_guard.cpp",
     "torch/csrc/jit/operator_upgraders/version_map.cpp",
     "torch/csrc/jit/runtime/instruction.cpp",
     "torch/csrc/jit/runtime/jit_exception.cpp",
@@ -130,6 +131,7 @@ libtorch_profiler_sources = [
     "torch/csrc/autograd/profiler_kineto.cpp",
     "torch/csrc/profiler/api.cpp",
     "torch/csrc/profiler/collection.cpp",
+    "torch/csrc/profiler/execution_graph_observer.cpp",
     "torch/csrc/profiler/kineto_shim.cpp",
     "torch/csrc/profiler/nvtx_observer.cpp",
     "torch/csrc/profiler/kineto_client_interface.cpp",
@@ -289,6 +291,7 @@ core_sources_full_mobile_no_backend_interface = [
     "torch/csrc/jit/passes/utils/subgraph_utils.cpp",
     "torch/csrc/jit/passes/utils/optimization_utils.cpp",
     "torch/csrc/jit/passes/utils/op_registry.cpp",
+    "torch/csrc/jit/passes/mkldnn_rewrite.cpp",
     "torch/csrc/jit/passes/xnnpack_rewrite.cpp",
     "torch/csrc/jit/passes/vulkan_rewrite.cpp",
     "torch/csrc/jit/passes/metal_rewrite.cpp",
@@ -553,6 +556,8 @@ torch_mobile_core = [
     # TODO: Remove this dependency
     "torch/csrc/jit/backends/backend_debug_info.cpp",
     "torch/csrc/jit/mobile/compatibility/model_compatibility.cpp",
+    # TODO: This line needs to be uncommented to build mobile in OSS with flatbuffers
+    # "torch/csrc/jit/mobile/flatbuffer_loader.cpp",
     "torch/csrc/jit/mobile/function.cpp",
     "torch/csrc/jit/mobile/import.cpp",
     "torch/csrc/jit/mobile/interpreter.cpp",
@@ -644,7 +649,7 @@ libtorch_cuda_core_sources = [
     "torch/csrc/autograd/functions/comm.cpp",
     "torch/csrc/jit/codegen/cuda/arith.cpp",
     "torch/csrc/jit/codegen/cuda/compute_at.cpp",
-    "torch/csrc/jit/codegen/cuda/inline_propagator.cpp",
+    "torch/csrc/jit/codegen/cuda/inlining.cpp",
     "torch/csrc/jit/codegen/cuda/compute_at_map.cpp",
     "torch/csrc/jit/codegen/cuda/codegen.cpp",
     "torch/csrc/jit/codegen/cuda/contiguity.cpp",
@@ -678,6 +683,7 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp",
     "torch/csrc/jit/codegen/cuda/lower_allocation.cpp",
     "torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp",
+    "torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp",
     "torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp",
     "torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp",
     "torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp",
@@ -718,12 +724,14 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/root_domain_map.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp",
+    "torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/matmul.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/registry.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/utils.cpp",
+    "torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp",
     "torch/csrc/jit/codegen/cuda/type_inference.cpp",
     "torch/csrc/jit/codegen/cuda/type_promotion.cpp",
     "torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp",
@@ -897,7 +905,9 @@ libtorch_python_core_sources = [
     "torch/csrc/jit/passes/onnx/shape_type_inference.cpp",
     "torch/csrc/jit/passes/onnx/function_extraction.cpp",
     "torch/csrc/jit/passes/onnx/onnx_log.cpp",
+    "torch/csrc/jit/passes/onnx/naming.cpp",
     "torch/csrc/jit/python/pybind_utils.cpp",
+    "torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp",
     "torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp",
     "torch/csrc/jit/passes/onnx/pattern_conversion/pattern_encapsulation.cpp",
     "torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.cpp",
@@ -918,7 +928,7 @@ libtorch_python_core_sources = [
     "torch/csrc/monitor/python_init.cpp",
     "torch/csrc/multiprocessing/init.cpp",
     "torch/csrc/onnx/init.cpp",
-    "torch/csrc/profiler/execution_graph_observer.cpp",
+    "torch/csrc/profiler/python/init.cpp",
     "torch/csrc/serialization.cpp",
     "torch/csrc/tensor/python_tensor.cpp",
     "torch/csrc/utils/init.cpp",
@@ -1050,7 +1060,7 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/core/op_registration/infer_schema.cpp",
     "aten/src/ATen/core/op_registration/op_registration.cpp",
     "aten/src/ATen/core/operator_name.cpp",
-    "aten/src/ATen/core/TorchDispatchModeTLS.cpp",
+    "aten/src/ATen/core/TorchDispatchUtils.cpp",
     "aten/src/ATen/core/register_symbols.cpp",
     "aten/src/ATen/core/class_type.cpp",
     "aten/src/ATen/core/type.cpp",
@@ -1069,6 +1079,7 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/native/UpSample.cpp",
     "aten/src/ATen/native/mkldnn/BinaryOps.cpp",
     "aten/src/ATen/native/mkldnn/Conv.cpp",
+    "aten/src/ATen/native/mkldnn/ConvPrepack.cpp",
     "aten/src/ATen/native/mkldnn/Copy.cpp",
     "aten/src/ATen/native/mkldnn/Gelu.cpp",
     "aten/src/ATen/native/mkldnn/IDeepRegistration.cpp",
@@ -1077,8 +1088,10 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp",
     "aten/src/ATen/native/mkldnn/MkldnnTensorMath.cpp",
     "aten/src/ATen/native/mkldnn/Normalization.cpp",
+    "aten/src/ATen/native/mkldnn/OpContext.cpp",
     "aten/src/ATen/native/mkldnn/Pooling.cpp",
     "aten/src/ATen/native/mkldnn/Prelu.cpp",
+    "aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp",
     "aten/src/ATen/native/mkldnn/Relu.cpp",
     "aten/src/ATen/native/mkldnn/SoftMax.cpp",
     "aten/src/ATen/native/mkldnn/TensorFactories.cpp",
@@ -1096,9 +1109,6 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/Dispatch.cpp",
     "aten/src/ATen/SavedTensorHooks.cpp",
     "aten/src/ATen/vulkan/Context.cpp",
-    "aten/src/ATen/nnapi/nnapi_bind.cpp",
-    "aten/src/ATen/nnapi/nnapi_wrapper.cpp",
-    "aten/src/ATen/nnapi/nnapi_model_loader.cpp",
     "aten/src/ATen/native/prim_native_functions.cpp",
     "aten/src/ATen/native/verbose_wrapper.cpp",
 ]
@@ -1397,7 +1407,6 @@ aten_native_source_non_codegen_list = [
     # Files not in native, but depends on native symbols
     # "aten/src/ATen/TensorIndexing.cpp",
     "aten/src/ATen/TensorIterator.cpp",
-    "aten/src/ATen/nnapi/nnapi_register.cpp",
 ]
 
 # 1. Files in ATen/native with a few exceptions
diff --git a/c10/core/DispatchKeySet.cpp b/c10/core/DispatchKeySet.cpp
index 358703210112a..3cc564bc04ae2 100644
--- a/c10/core/DispatchKeySet.cpp
+++ b/c10/core/DispatchKeySet.cpp
@@ -50,7 +50,7 @@ constexpr DispatchKeySet math_dispatch_keyset = backend_dispatch_keyset |
     autograd_dispatch_keyset |
     // See Note [NestedTensor Not Included in Backend Keys]
     // The caveat to that note is that nested_tensor is a special case
-    // where we would like to support composite implict kernels but not
+    // where we would like to support composite implicit kernels but not
     // explicit kernels therefore we manually add the key to the
     // math_dispatch_keyset
     DispatchKeySet{DispatchKey::NestedTensor};
diff --git a/c10/core/SymInt.cpp b/c10/core/SymInt.cpp
index 28e477481390a..944bc6722add0 100644
--- a/c10/core/SymInt.cpp
+++ b/c10/core/SymInt.cpp
@@ -4,7 +4,8 @@
 
 namespace c10 {
 
-std::array<SymIntNode, 2> normalize_symints(SymInt a_, SymInt b_) {
+#ifndef C10_MOBILE
+static std::array<SymIntNode, 2> normalize_symints(SymInt a_, SymInt b_) {
   SymIntNode a, b;
   if (a_.is_symbolic())
     a = a_.toSymIntNodeImpl();
@@ -33,14 +34,38 @@ c10::SymInt SymInt::toSymInt(SymIntNode sin_sp) {
   auto ptr = static_cast<uint64_t>(
       reinterpret_cast<uintptr_t>(static_cast<void*>(sin_sp.release())));
   auto rep = (ptr & ~MASK) | IS_SYM;
-  return c10::SymInt(static_cast<int64_t>(rep));
+  return c10::SymInt(UNCHECKED, static_cast<int64_t>(rep));
 }
+#else
+// this code should never be executed on mobile due to inlining of `is_symbolic`
+// which always returns `false` on mobile.
+// However, if we decide to strip off `SymIntNode` completely from mobile builds
+// We would need to stub these methods anyways
+c10::SymInt SymInt::toSymInt(SymIntNode sin_sp) {
+  TORCH_INTERNAL_ASSERT(false, "SymInts aren't available on mobile");
+}
+SymIntNode SymInt::toSymIntNodeImpl() const {
+  TORCH_INTERNAL_ASSERT(false, "SymInts aren't available on mobile");
+}
+static std::array<SymIntNode, 2> normalize_symints(SymInt a_, SymInt b_) {
+  TORCH_INTERNAL_ASSERT(false, "SymInts aren't available on mobile");
+}
+#endif
 
 SymInt SymInt::operator+(SymInt sci) const {
-  TORCH_CHECK(
-      !this->is_symbolic() && !sci.is_symbolic(),
-      "Symbolic Add isn't supported yet");
-  return SymInt(data_ + sci.data_);
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymInt(data_ + sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt::toSymInt(res[0]->add(res[1]));
+}
+
+SymInt SymInt::operator-(SymInt sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymInt(data_ - sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt::toSymInt(res[0]->sub(res[1]));
 }
 
 SymInt SymInt::operator*(SymInt sci) const {
@@ -51,6 +76,22 @@ SymInt SymInt::operator*(SymInt sci) const {
   return SymInt::toSymInt(res[0]->mul(res[1]));
 }
 
+SymInt SymInt::operator/(SymInt sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymInt(data_ / sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt::toSymInt(res[0]->floordiv(res[1]));
+}
+
+SymInt SymInt::operator%(SymInt sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymInt(data_ % sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt::toSymInt(res[0]->mod(res[1]));
+}
+
 bool SymInt::operator==(SymInt sci) const {
   if (!is_symbolic() && !sci.is_symbolic()) {
     return data_ == sci.data_;
@@ -64,22 +105,55 @@ bool SymInt::operator!=(SymInt sci) const {
 }
 
 bool SymInt::operator<(SymInt sci) const {
-  TORCH_CHECK(
-      !this->is_symbolic() && !sci.is_symbolic(),
-      "Symbolic lt isn't supported yet");
-  return data_ < sci.data_;
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ < sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->lt(res[1])->bool_();
+}
+
+bool SymInt::operator<=(SymInt sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ <= sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->le(res[1])->bool_();
+}
+
+bool SymInt::operator>(SymInt sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ > sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->gt(res[1])->bool_();
+}
+
+bool SymInt::operator>=(SymInt sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ >= sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->ge(res[1])->bool_();
 }
 
 void SymInt::operator*=(SymInt sci) {
-  TORCH_CHECK(
-      !this->is_symbolic() && !sci.is_symbolic(),
-      "Symbolic mul_ isn't supported yet");
-  data_ = data_ * sci.data_;
+  *this = *this * sci;
 }
 
 bool SymInt::operator<(int64_t sci) const {
-  TORCH_CHECK(!this->is_symbolic(), "Symbolic lt isn't supported yet");
-  return data_ < sci;
+  return *this < c10::SymInt(sci);
+}
+
+bool SymInt::operator<=(int64_t sci) const {
+  return *this <= c10::SymInt(sci);
+}
+
+bool SymInt::operator>(int64_t sci) const {
+  return *this > c10::SymInt(sci);
+}
+
+bool SymInt::operator>=(int64_t sci) const {
+  return *this >= c10::SymInt(sci);
 }
 
 bool SymInt::operator==(int64_t sci) const {
@@ -91,8 +165,7 @@ bool SymInt::operator!=(int64_t sci) const {
 }
 
 SymInt SymInt::operator*(int64_t sci) const {
-  TORCH_CHECK(!this->is_symbolic(), "Symbolic mul isn't supported yet");
-  return SymInt(data_ * sci);
+  return *this * c10::SymInt(sci);
 }
 
 } // namespace c10
diff --git a/c10/core/SymInt.h b/c10/core/SymInt.h
index 331f10305dec0..015260bfaf309 100644
--- a/c10/core/SymInt.h
+++ b/c10/core/SymInt.h
@@ -27,11 +27,28 @@ namespace c10 {
 // SymInt will be extenteded to represent a union structure Union[int64_t,
 // SymIntNodeImpl*] which will be implemented as a single packed int64_t field
 // named data_.
+
+#ifdef C10_MOBILE
+#define SKIP_IS_SYMBOLIC_ON_MOBILE(_) \
+  do {                                \
+  } while (0)
+#else
+#define SKIP_IS_SYMBOLIC_ON_MOBILE(X) TORCH_CHECK(X)
+#endif
+
 class C10_API SymInt {
+  enum Unchecked {
+    UNCHECKED,
+  };
+
  public:
-  // TODO: this needs to only accept integers, not pointers
-  /*implicit*/ SymInt(int64_t d) : data_(d){};
-  SymInt() = default;
+  /*implicit*/ SymInt(int64_t d) : data_(d) {
+    SKIP_IS_SYMBOLIC_ON_MOBILE(!is_symbolic());
+  };
+  SymInt() : data_(0) {}
+
+  // unchecked c-tor accepting raw `data_`
+  SymInt(Unchecked, int64_t d) : data_(d) {}
 
   // TODO: these implementations are not optimal because they allocate a
   // temporary and then use the move constructor/assignment
@@ -55,12 +72,14 @@ class C10_API SymInt {
     return *this;
   }
   SymInt& operator=(SymInt&& s) {
+    release_(); // release the current SymIntNode if any
     data_ = s.data_;
     if (s.is_symbolic())
       s.data_ = 0;
     return *this;
   }
 
+#ifndef C10_MOBILE
   SymIntNodeImpl* toSymIntNodeImplUnowned() const {
     uint64_t unextended_bits = static_cast<uint64_t>(data_) & ~MASK;
     uint64_t sign_bit_mask = 1ULL << (62 - 1);
@@ -70,35 +89,58 @@ class C10_API SymInt {
         reinterpret_cast<void*>(static_cast<uintptr_t>(extended_bits)));
   }
 
-  ~SymInt() {
+  void release_() {
     if (is_symbolic()) {
       SymIntNode::reclaim(toSymIntNodeImplUnowned()); // steal
     }
   }
+#else
+  void release_() {}
+#endif
+
+  SymIntNode toSymIntNodeImpl() const;
+  static c10::SymInt toSymInt(SymIntNode sin);
+
+  ~SymInt() {
+    release_();
+  }
 
   int64_t expect_int() const {
-    TORCH_CHECK(!is_symbolic());
+    SKIP_IS_SYMBOLIC_ON_MOBILE(!is_symbolic());
     return data_;
   }
 
-  bool is_symbolic() const {
+  // N.B. It's important to keep this definition in the header
+  // as we expect if checks to be folded for mobile builds
+  // where `is_symbolic` is always false
+  C10_ALWAYS_INLINE bool is_symbolic() const {
+#ifdef C10_MOBILE
+    return false;
+#else
     return (MASK & static_cast<uint64_t>(this->data_)) == IS_SYM;
+#endif
   }
 
   SymInt operator+(SymInt sci) const;
+  SymInt operator-(SymInt sci) const;
   SymInt operator*(SymInt sci) const;
+  SymInt operator/(SymInt sci) const;
+  SymInt operator%(SymInt sci) const;
   bool operator==(SymInt sci) const;
   bool operator!=(SymInt p2) const;
   bool operator<(SymInt sci) const;
+  bool operator<=(SymInt sci) const;
+  bool operator>(SymInt sci) const;
+  bool operator>=(SymInt sci) const;
   void operator*=(SymInt sci);
 
   SymInt operator*(int64_t sci) const;
   bool operator<(int64_t sci) const;
   bool operator==(int64_t sci) const;
   bool operator!=(int64_t sci) const;
-
-  SymIntNode toSymIntNodeImpl() const;
-  static c10::SymInt toSymInt(SymIntNode sin);
+  bool operator<=(int64_t sci) const;
+  bool operator>(int64_t sci) const;
+  bool operator>=(int64_t sci) const;
 
   int64_t as_int_unchecked() const {
     return data_;
@@ -134,5 +176,7 @@ class C10_API SymInt {
   int64_t data_;
 };
 
+#undef SKIP_IS_SYMBOLIC_ON_MOBILE
+
 C10_API std::ostream& operator<<(std::ostream& os, SymInt s);
 } // namespace c10
diff --git a/c10/core/SymIntArrayRef.h b/c10/core/SymIntArrayRef.h
index bf2eb65c55366..6bfbc945ef91a 100644
--- a/c10/core/SymIntArrayRef.h
+++ b/c10/core/SymIntArrayRef.h
@@ -81,9 +81,9 @@ class SymIntArrayRef final {
 
   static SymIntArrayRef fromIntArrayRef(IntArrayRef array_ref) {
     for (size_t i = 0; i < array_ref.size(); ++i) {
-      TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+      TORCH_CHECK(
           SymInt::check_range(array_ref[i]),
-          "IntArrayRef contains int that cannot be representative as a SymInt",
+          "IntArrayRef contains an int that cannot be represented as a SymInt: ",
           array_ref[i]);
     }
     return SymIntArrayRef(
diff --git a/c10/core/SymIntNodeImpl.h b/c10/core/SymIntNodeImpl.h
index e5ffd2d5ef6a3..da4beaeae7dc7 100644
--- a/c10/core/SymIntNodeImpl.h
+++ b/c10/core/SymIntNodeImpl.h
@@ -33,7 +33,10 @@ class C10_API SymIntNodeImpl : public c10::intrusive_ptr_target {
   virtual SymIntNode mul(const SymIntNode& other) {
     TORCH_CHECK(false, "NYI");
   };
-  virtual SymIntNode div(const SymIntNode& other) {
+  virtual SymIntNode truediv(const SymIntNode& other) {
+    TORCH_CHECK(false, "FP division isn't support for SymInts");
+  };
+  virtual SymIntNode floordiv(const SymIntNode& other) {
     TORCH_CHECK(false, "NYI");
   };
   virtual SymIntNode mod(const SymIntNode& other) {
diff --git a/c10/core/TensorImpl.cpp b/c10/core/TensorImpl.cpp
index e2d8e9684e6f9..5d85e90138c7b 100644
--- a/c10/core/TensorImpl.cpp
+++ b/c10/core/TensorImpl.cpp
@@ -6,6 +6,7 @@
 #include <c10/core/WrapDimMinimal.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
 #include <c10/core/impl/PyInterpreter.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 #include <c10/util/Optional.h>
 #include <c10/util/irange.h>
 
@@ -181,7 +182,6 @@ TensorImpl::TensorImpl(
   if (!is_inference()) {
     version_counter_ = VariableVersion(/*version=*/0);
   }
-
   // we would also like to check that non-cpu devices have an index, but some
   // Caffe2 operators create Storages with default devices.
 }
@@ -209,16 +209,20 @@ void TensorImpl::HandleResize() {
   // If needed, we will free the data. the next mutable_data() call
   // will create the data storage.
   bool reset_tensor = false;
+
+  TORCH_CHECK(!numel_.is_symbolic(), "CAFFE2 doesn't support SymInts");
+  int concrete_numel = numel_.as_int_unchecked();
   if (reserved_) {
     // If tensor is reserved then don't claim its memeory unless nbytes()
     // is smaller than new size
-    reset_tensor =
-        storage_.nbytes() < (storage_offset_ + numel_) * data_type_.itemsize();
+    reset_tensor = storage_.nbytes() <
+        (storage_offset_ + concrete_numel) * data_type_.itemsize();
   } else {
     reset_tensor = storage_.nbytes() <
-            (storage_offset_ + numel_) * data_type_.itemsize() ||
+            (storage_offset_ + concrete_numel) * data_type_.itemsize() ||
         !FLAGS_caffe2_keep_on_shrink ||
-        storage_.nbytes() - (storage_offset_ + numel_) * data_type_.itemsize() >
+        storage_.nbytes() -
+                (storage_offset_ + concrete_numel) * data_type_.itemsize() >
             static_cast<size_t>(FLAGS_caffe2_max_keep_on_shrink_memory);
   }
 
@@ -419,6 +423,20 @@ c10::SymIntArrayRef TensorImpl::sym_sizes_custom() const {
   return sym_sizes_default();
 }
 
+c10::SymInt TensorImpl::sym_numel_custom() const {
+  if (C10_UNLIKELY(is_python_dispatch())) {
+    return load_pyobj_interpreter()->sym_numel(this);
+  }
+  return sym_numel_default();
+}
+
+c10::SymIntArrayRef TensorImpl::sym_strides_custom() const {
+  if (C10_UNLIKELY(is_python_dispatch())) {
+    return load_pyobj_interpreter()->sym_strides(this);
+  }
+  return sym_strides_default();
+}
+
 c10::Device TensorImpl::device_custom() const {
   if (is_python_dispatch()) {
     return load_pyobj_interpreter()->device(this);
@@ -526,17 +544,25 @@ template <typename VariableVersion>
 c10::intrusive_ptr<TensorImpl> TensorImpl::shallow_copy_and_detach_core(
     VariableVersion&& version_counter,
     bool allow_tensor_metadata_change) const {
-  if (key_set_.has(DispatchKey::Python) &&
+  c10::intrusive_ptr<TensorImpl> r;
+  const auto& maybe_torch_dispatch_mode_state =
+      c10::impl::TorchDispatchModeTLS::get_state();
+  // TODO: do we have to exclude after Python dispatch key set?
+  if (maybe_torch_dispatch_mode_state &&
       !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
-    auto r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this);
-    if (r) {
-      r->set_version_counter(std::forward<VariableVersion>(version_counter));
-      r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
-      return r;
-    }
-    // otherwise just copy the TensorImpl and not the PyObject.  Since
-    // the interpreter is dead no one can call us out on it
+    r = maybe_torch_dispatch_mode_state->pyinterpreter()->detach(this);
+  } else if (
+      key_set_.has(DispatchKey::Python) &&
+      !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
+    r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this);
   }
+  if (r) {
+    r->set_version_counter(std::forward<VariableVersion>(version_counter));
+    r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
+    return r;
+  }
+  // otherwise just copy the TensorImpl and not the PyObject.  Since
+  // the interpreter is dead no one can call us out on it
   auto impl = c10::make_intrusive<TensorImpl>(
       // No need to populate Storage; copy_tensor_metadata will do it for us.
       key_set_,
@@ -690,7 +716,7 @@ void TensorImpl::Extend(int64_t num, float growthPct) {
           sizes_and_strides_.size_at_unchecked(0).as_int_unchecked() *
           (1 + growthPct / 100))));
   auto oldData = std::move(storage_.data_ptr());
-  auto oldSize = numel_;
+  auto oldSize = numel_.as_int_unchecked();
   Resize(newCapacity);
   auto* newData = raw_mutable_data(data_type_);
   if (data_type_.copy()) {
@@ -726,7 +752,7 @@ void TensorImpl::ReserveSpace(int64_t outer_dim) {
       "Right now ReserveSpace is only supported for contiguous Tensor.");
   TORCH_CHECK(
       !has_symbolic_sizes_strides_,
-      "ReserveSpace() called on tensor with symbolic shape")
+      "ReserveSpace() called on tensor with symbolic shape");
 
   TORCH_CHECK(storage_.unique(), "Can't call ReserveSpace on shared storage.");
   // TODO: eliminate newCapacity.
@@ -758,7 +784,7 @@ void TensorImpl::Reshape(const std::vector<int64_t>& dims) {
       "Right now Reshape is only supported for contiguous Tensor.");
   TORCH_CHECK(
       !has_symbolic_sizes_strides_,
-      "Reshape() called on tensor with symbolic shape")
+      "Reshape() called on tensor with symbolic shape");
 
   int64_t new_size = 1;
   for (auto d : dims) {
@@ -766,7 +792,7 @@ void TensorImpl::Reshape(const std::vector<int64_t>& dims) {
     new_size *= d;
   }
   TORCH_CHECK(
-      new_size == numel_,
+      new_size == numel_.as_int_unchecked(),
       "New size and old size are not equal. You cannot use Reshape, "
       "but should use Resize."
       // TODO(jiayq): remove the following warning after pending diffs
@@ -828,8 +854,11 @@ void TensorImpl::ShareExternalPointer(
       data_type != ScalarType::Undefined,
       "To share with a raw external pointer you need to pass in an "
       "initialized data_type(TypeMeta).");
+  TORCH_CHECK(
+      !has_symbolic_sizes_strides_,
+      "ReserveSpace() called on tensor with symbolic shape");
   if (!size_bytes) {
-    size_bytes = numel_ * data_type.itemsize();
+    size_bytes = numel_.as_int_unchecked() * data_type.itemsize();
   }
   if (storage_.unique()) {
     storage_.UniqueStorageShareExternalPointer(std::move(data_ptr), size_bytes);
diff --git a/c10/core/TensorImpl.h b/c10/core/TensorImpl.h
index a2ffa3123b083..490f92c4c02fd 100644
--- a/c10/core/TensorImpl.h
+++ b/c10/core/TensorImpl.h
@@ -564,6 +564,21 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
 
   virtual c10::SymIntArrayRef sym_sizes_custom() const;
 
+  c10::SymInt sym_numel() const {
+    if (C10_UNLIKELY(
+            sizes_strides_policy_ >=
+            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
+      return sym_numel_custom();
+    }
+    return sym_numel_default();
+  }
+
+  inline c10::SymInt sym_numel_default() const {
+    return numel_;
+  }
+
+  virtual c10::SymInt sym_numel_custom() const;
+
   /**
    * Return a reference to the strides of this tensor.  This reference remains
    * valid as long as the tensor is live and not restrided.
@@ -577,6 +592,23 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     return strides_default();
   }
 
+  // TODO: make it non-virtual after a change to XLA
+  virtual c10::SymIntArrayRef sym_strides() const {
+    if (C10_UNLIKELY(
+            sizes_strides_policy_ >=
+            static_cast<uint8_t>(SizesStridesPolicy::CustomStrides))) {
+      return sym_strides_custom();
+    }
+    return sym_strides_default();
+  }
+  inline c10::SymIntArrayRef sym_strides_default() const {
+    return c10::SymIntArrayRef(
+        reinterpret_cast<const c10::SymInt*>(sizes_and_strides_.strides_data()),
+        sizes_and_strides_.size());
+  }
+
+  virtual c10::SymIntArrayRef sym_strides_custom() const;
+
   /**
    * Return the size of a tensor at some dimension, wrapping the dimension if
    * necessary.
@@ -746,9 +778,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
 
   inline int64_t numel_default() const {
 #ifdef DEBUG
-    TORCH_INTERNAL_ASSERT(compute_numel() == numel_);
+    TORCH_INTERNAL_ASSERT(compute_numel() == numel_.as_int_unchecked());
 #endif
-    return numel_;
+    return numel_.as_int_unchecked();
   }
 
  public:
@@ -1493,7 +1525,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * ] for details.
    */
   void set_allow_tensor_metadata_change(bool value) {
-    allow_tensor_metadata_change_ = value;
+    // TODO: at some point, we should kill this field completely.
+    allow_tensor_metadata_change_ = true;
   }
 
   /**
@@ -1926,6 +1959,10 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * and a new storage will be created.
    */
   inline void* raw_mutable_data(const caffe2::TypeMeta meta) {
+    auto concrete_numel = numel_.expect_int();
+#ifdef DEBUG
+    TORCH_INTERNAL_ASSERT(compute_numel() == concrete_numel);
+#endif
     // For 0-size tensors it's fine to return any pointer (including nullptr)
     if (data_type_ == meta && storage_initialized()) {
       return static_cast<void*>(
@@ -1940,9 +1977,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
       // We can reuse the existing buffer if the current data does not have
       // a special destructor and the new data doesn't have a special
       // constructor.
-      if (numel_ == 0 ||
+      if (concrete_numel == 0 ||
           (meta.placementNew() == nullptr && !had_special_dtor &&
-           (storage_.nbytes() >= (numel_ * data_type_.itemsize())))) {
+           (storage_.nbytes() >= (concrete_numel * data_type_.itemsize())))) {
         TORCH_INTERNAL_ASSERT(
             storage_offset_ == 0); // because we just reallocated
         return storage_.data();
@@ -1959,18 +1996,18 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         // For types that need placement new, we will call it, as well as
         // making sure that when the data is freed, it calls the right
         // destruction procedure.
-        auto size = numel_;
         auto dtor = data_type_.placementDelete();
-        auto data_ptr = allocator->allocate(numel_ * data_type_.itemsize());
+        auto data_ptr =
+            allocator->allocate(concrete_numel * data_type_.itemsize());
         storage_.set_data_ptr_noswap(PlacementDeleteContext::makeDataPtr(
-            std::move(data_ptr), dtor, size, storage_.device()));
-        data_type_.placementNew()(storage_.data(), numel_);
+            std::move(data_ptr), dtor, concrete_numel, storage_.device()));
+        data_type_.placementNew()(storage_.data(), concrete_numel);
       } else {
         // For fundamental type, new and delete is easier.
         storage_.set_data_ptr_noswap(
-            allocator->allocate(numel_ * data_type_.itemsize()));
+            allocator->allocate(concrete_numel * data_type_.itemsize()));
       }
-      storage_.set_nbytes(numel_ * data_type_.itemsize());
+      storage_.set_nbytes(concrete_numel * data_type_.itemsize());
       TORCH_INTERNAL_ASSERT(
           storage_offset_ == 0); // because we just reallocated
       device_opt_ = storage_.device();
@@ -2045,7 +2082,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         "empty_tensor_restride() called on tensor with symbolic shape")
 #ifdef DEBUG
     TORCH_INTERNAL_ASSERT(
-        compute_numel() == numel_,
+        compute_numel() == numel_.as_int_unchecked(),
         "If you are seeing this error, that means empty_tensor_restride was "
         "called before setting correct numel");
 #endif
@@ -2469,7 +2506,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   // time, we will immediately set sizes to {0} and reset numel to 0.
   // (Can't do that in the default initializers, because there's no way to
   // spell "allocate a one-element array" for strides_).
-  int64_t numel_ = 1;
+  SymInt numel_ = c10::SymInt(1);
 
   // INVARIANT: When storage is non-null, this type meta must
   // agree with the type meta in storage
diff --git a/c10/core/WrapDimMinimal.cpp b/c10/core/WrapDimMinimal.cpp
index 2dc359fc5d4fd..920de4b9a38e6 100644
--- a/c10/core/WrapDimMinimal.cpp
+++ b/c10/core/WrapDimMinimal.cpp
@@ -7,10 +7,13 @@ int64_t maybe_wrap_dim_slow(
     int64_t dim,
     int64_t dim_post_expr,
     bool wrap_scalar) {
-  if (dim_post_expr <= 0) {
+  TORCH_CHECK_INDEX(
+      dim_post_expr >= 0, "Rank cannot be negative but got ", dim_post_expr);
+
+  if (dim_post_expr == 0) {
     TORCH_CHECK_INDEX(
         wrap_scalar,
-        "dimension specified as ",
+        "Dimension specified as ",
         dim,
         " but tensor has no dimensions");
     return c10::maybe_wrap_dim(dim, /*dim_post_expr=*/1, /*wrap_scalar=*/false);
diff --git a/c10/core/impl/GPUTrace.cpp b/c10/core/impl/GPUTrace.cpp
new file mode 100644
index 0000000000000..405ab2c9654a4
--- /dev/null
+++ b/c10/core/impl/GPUTrace.cpp
@@ -0,0 +1,22 @@
+#include <mutex>
+
+#include <c10/core/impl/GPUTrace.h>
+#include <c10/util/CallOnce.h>
+
+namespace c10 {
+namespace impl {
+
+std::atomic<const PyInterpreter*> GPUTrace::gpuTraceState{nullptr};
+
+bool GPUTrace::haveState{false};
+
+void GPUTrace::set_trace(const PyInterpreter* trace) {
+  static c10::once_flag flag;
+  c10::call_once(flag, [&]() {
+    gpuTraceState.store(trace, std::memory_order_release);
+    haveState = true;
+  });
+}
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/GPUTrace.h b/c10/core/impl/GPUTrace.h
new file mode 100644
index 0000000000000..377af88be034a
--- /dev/null
+++ b/c10/core/impl/GPUTrace.h
@@ -0,0 +1,30 @@
+#pragma once
+
+#include <c10/core/impl/PyInterpreter.h>
+
+namespace c10 {
+namespace impl {
+
+struct C10_API GPUTrace {
+  // On the x86 architecture the atomic operations are lock-less.
+  static std::atomic<const PyInterpreter*> gpuTraceState;
+
+  // When PyTorch migrates to C++20, this should be changed to an atomic flag.
+  // Currently, the access to this variable is not synchronized, on the basis
+  // that it will only be flipped once and by the first interpreter that
+  // accesses it.
+  static bool haveState;
+
+  // This function will only register the first interpreter that tries to invoke
+  // it. For all of the next ones it will be a no-op.
+  static void set_trace(const PyInterpreter*);
+
+  static const PyInterpreter* get_trace() {
+    if (!haveState)
+      return nullptr;
+    return gpuTraceState.load(std::memory_order_acquire);
+  }
+};
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/PyInterpreter.cpp b/c10/core/impl/PyInterpreter.cpp
index eec1d23e66da1..76a54663ff546 100644
--- a/c10/core/impl/PyInterpreter.cpp
+++ b/c10/core/impl/PyInterpreter.cpp
@@ -5,6 +5,23 @@
 namespace c10 {
 namespace impl {
 
+template <typename... Ts>
+static void noop_trace_gpu_fn(const PyInterpreter*, Ts...) {
+  TORCH_INTERNAL_ASSERT(
+      0,
+      "attempted to call a GPU trace function after corresponding interpreter died");
+}
+
+void GPUTraceFunctionWrapper::disarm() {
+  event_creation_fn_ = &noop_trace_gpu_fn;
+  event_deletion_fn_ = &noop_trace_gpu_fn;
+  event_record_fn_ = &noop_trace_gpu_fn;
+  event_wait_fn_ = &noop_trace_gpu_fn;
+  memory_allocation_fn_ = &noop_trace_gpu_fn;
+  memory_deallocation_fn_ = &noop_trace_gpu_fn;
+  stream_creation_fn_ = &noop_trace_gpu_fn;
+}
+
 static std::string noop_name_fn(const PyInterpreter*) {
   return "<unloaded interpreter>";
 }
@@ -76,6 +93,20 @@ static c10::Layout noop_layout_fn(const PyInterpreter*, const TensorImpl*) {
       "attempted to call `layout` on Tensor with nontrivial PyObject after corresponding interpreter died");
 }
 
+static c10::SymInt noop_sym_numel_fn(const PyInterpreter*, const TensorImpl*) {
+  TORCH_INTERNAL_ASSERT(
+      0,
+      "attempted to call `sym_numel` on Tensor with nontrivial PyObject after corresponding interpreter died");
+}
+
+static c10::SymIntArrayRef noop_sym_strides_fn(
+    const PyInterpreter*,
+    const TensorImpl*) {
+  TORCH_INTERNAL_ASSERT(
+      0,
+      "attempted to call `sym_strides` on Tensor with nontrivial PyObject after corresponding interpreter died");
+}
+
 void PyInterpreter::disarm() noexcept {
   name_fn_ = &noop_name_fn;
   decref_fn_ = &noop_decref_fn;
@@ -88,6 +119,15 @@ void PyInterpreter::disarm() noexcept {
   sizes_fn_ = &noop_sizes_fn;
   sym_sizes_fn_ = &noop_sym_sizes_fn;
   layout_fn_ = &noop_layout_fn;
+  sym_numel_fn_ = &noop_sym_numel_fn;
+  trace_gpu_functions.disarm();
+  sym_strides_fn_ = &noop_sym_strides_fn;
+}
+
+// Defined out-of-line because it needs access to the definition of TensorImpl.
+__ubsan_ignore_function__ c10::intrusive_ptr<TensorImpl> PyInterpreter::detach(
+    const TensorImpl* self) const {
+  return (*detach_fn_)(this, self);
 }
 
 } // namespace impl
diff --git a/c10/core/impl/PyInterpreter.h b/c10/core/impl/PyInterpreter.h
index db3d9753b9dc6..3f125f6dc2be5 100644
--- a/c10/core/impl/PyInterpreter.h
+++ b/c10/core/impl/PyInterpreter.h
@@ -30,6 +30,46 @@ using Stack = std::vector<c10::IValue>;
 namespace c10 {
 namespace impl {
 
+struct C10_API PyInterpreter;
+
+struct C10_API GPUTraceFunctionWrapper {
+  using event_creation_sig = void(const PyInterpreter*, uintptr_t event);
+  using event_deletion_sig = void(const PyInterpreter*, uintptr_t event);
+  using event_record_sig =
+      void(const PyInterpreter*, uintptr_t event, uintptr_t stream);
+  using event_wait_sig =
+      void(const PyInterpreter*, uintptr_t event, uintptr_t stream);
+  using memory_allocation_sig = void(const PyInterpreter*, uintptr_t pointer);
+  using memory_deallocation_sig = void(const PyInterpreter*, uintptr_t pointer);
+  using stream_creation_sig = void(const PyInterpreter*, uintptr_t stream);
+
+  event_creation_sig* event_creation_fn_;
+  event_deletion_sig* event_deletion_fn_;
+  event_record_sig* event_record_fn_;
+  event_wait_sig* event_wait_fn_;
+  memory_allocation_sig* memory_allocation_fn_;
+  memory_deallocation_sig* memory_deallocation_fn_;
+  stream_creation_sig* stream_creation_fn_;
+
+  GPUTraceFunctionWrapper(
+      event_creation_sig* event_creation_fn,
+      event_deletion_sig* event_deletion_fn,
+      event_record_sig* event_record_fn,
+      event_wait_sig* event_wait_fn,
+      memory_allocation_sig* memory_allocation_fn,
+      memory_deallocation_sig* memory_deallocation_fn,
+      stream_creation_sig* stream_creation_fn)
+      : event_creation_fn_(event_creation_fn),
+        event_deletion_fn_(event_deletion_fn),
+        event_record_fn_(event_record_fn),
+        event_wait_fn_(event_wait_fn),
+        memory_allocation_fn_(memory_allocation_fn),
+        memory_deallocation_fn_(memory_deallocation_fn),
+        stream_creation_fn_(stream_creation_fn) {}
+
+  void disarm();
+};
+
 // Note [Python interpreter tag]
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 // Traditionally, PyTorch is layered such that our Python library
@@ -136,6 +176,9 @@ struct C10_API PyInterpreter {
   using sym_sizes_sig =
       c10::SymIntArrayRef(const PyInterpreter*, const TensorImpl*);
   using layout_sig = c10::Layout(const PyInterpreter*, const TensorImpl*);
+  using sym_numel_sig = c10::SymInt(const PyInterpreter*, const TensorImpl*);
+  using sym_strides_sig =
+      c10::SymIntArrayRef(const PyInterpreter*, const TensorImpl*);
 
   PyInterpreter(
       name_sig* name_fn,
@@ -148,7 +191,10 @@ struct C10_API PyInterpreter {
       strides_sig* strides,
       sizes_sig* sizes,
       sym_sizes_sig* sym_sizes,
-      layout_sig* layout)
+      layout_sig* layout,
+      sym_numel_sig* sym_numel,
+      sym_strides_sig* sym_strides,
+      GPUTraceFunctionWrapper trace_gpu_functions)
       : name_fn_(name_fn),
         decref_fn_(decref_fn),
         detach_fn_(detach),
@@ -159,7 +205,10 @@ struct C10_API PyInterpreter {
         strides_fn_(strides),
         sizes_fn_(sizes),
         sym_sizes_fn_(sym_sizes),
-        layout_fn_(layout) {}
+        layout_fn_(layout),
+        sym_numel_fn_(sym_numel),
+        trace_gpu_functions(trace_gpu_functions),
+        sym_strides_fn_(sym_strides) {}
 
   name_sig* name_fn_;
   decref_sig* decref_fn_;
@@ -172,6 +221,9 @@ struct C10_API PyInterpreter {
   sizes_sig* sizes_fn_;
   sym_sizes_sig* sym_sizes_fn_;
   layout_sig* layout_fn_;
+  sym_numel_sig* sym_numel_fn_;
+  GPUTraceFunctionWrapper trace_gpu_functions;
+  sym_strides_sig* sym_strides_fn_;
 
   // UBSAN suppression fixes: "call to function
   // (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*,
@@ -194,9 +246,7 @@ struct C10_API PyInterpreter {
   // detach, which will also arrange for the PyObject to get copied in this
   // situation
   __ubsan_ignore_function__ c10::intrusive_ptr<TensorImpl> detach(
-      const TensorImpl* self) const {
-    return (*detach_fn_)(this, self);
-  }
+      const TensorImpl* self) const;
 
   // Invoke the Python boxed fallback dispatch to go back into Python
   __ubsan_ignore_function__ void dispatch(
@@ -236,6 +286,53 @@ struct C10_API PyInterpreter {
     return (*layout_fn_)(this, self);
   }
 
+  __ubsan_ignore_function__ c10::SymInt sym_numel(
+      const TensorImpl* self) const {
+    return (*sym_numel_fn_)(this, self);
+  }
+
+  __ubsan_ignore_function__ void trace_gpu_event_creation(
+      uintptr_t event) const {
+    return (*trace_gpu_functions.event_creation_fn_)(this, event);
+  }
+
+  __ubsan_ignore_function__ void trace_gpu_event_deletion(
+      uintptr_t event) const {
+    return (*trace_gpu_functions.event_deletion_fn_)(this, event);
+  }
+
+  __ubsan_ignore_function__ void trace_gpu_event_record(
+      uintptr_t event,
+      uintptr_t stream) const {
+    return (*trace_gpu_functions.event_record_fn_)(this, event, stream);
+  }
+
+  __ubsan_ignore_function__ void trace_gpu_event_wait(
+      uintptr_t event,
+      uintptr_t stream) const {
+    return (*trace_gpu_functions.event_wait_fn_)(this, event, stream);
+  }
+
+  __ubsan_ignore_function__ void trace_gpu_memory_allocation(
+      uintptr_t ptr) const {
+    return (*trace_gpu_functions.memory_allocation_fn_)(this, ptr);
+  }
+
+  __ubsan_ignore_function__ void trace_gpu_memory_deallocation(
+      uintptr_t ptr) const {
+    return (*trace_gpu_functions.memory_deallocation_fn_)(this, ptr);
+  }
+
+  __ubsan_ignore_function__ void trace_gpu_stream_creation(
+      uintptr_t stream) const {
+    return (*trace_gpu_functions.stream_creation_fn_)(this, stream);
+  }
+
+  __ubsan_ignore_function__ c10::SymIntArrayRef sym_strides(
+      const TensorImpl* self) const {
+    return (*sym_strides_fn_)(this, self);
+  }
+
   // Disarm this PyInterpreter, making all of its methods noops.
   // Because the function pointers are raw pointers (not atomics),
   // a disarm() invocation that is concurrent with active destructors
diff --git a/c10/core/impl/TorchDispatchModeTLS.cpp b/c10/core/impl/TorchDispatchModeTLS.cpp
new file mode 100644
index 0000000000000..fbf9504f7b5af
--- /dev/null
+++ b/c10/core/impl/TorchDispatchModeTLS.cpp
@@ -0,0 +1,38 @@
+#include <c10/core/DispatchKeySet.h>
+#include <c10/core/SafePyObject.h>
+#include <c10/core/impl/LocalDispatchKeySet.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
+
+namespace c10 {
+namespace impl {
+
+thread_local std::shared_ptr<SafePyObject> torchDispatchModeState;
+
+void TorchDispatchModeTLS::set_state(std::shared_ptr<SafePyObject> state) {
+  if (state) {
+    c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true);
+    c10::impl::tls_set_dispatch_key_included(
+        DispatchKey::PythonTLSSnapshot, true);
+  } else {
+    TorchDispatchModeTLS::reset_state();
+  }
+  torchDispatchModeState = std::move(state);
+}
+
+const std::shared_ptr<SafePyObject>& TorchDispatchModeTLS::get_state() {
+  return torchDispatchModeState;
+}
+
+void TorchDispatchModeTLS::reset_state() {
+  torchDispatchModeState.reset();
+  c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false);
+  c10::impl::tls_set_dispatch_key_included(
+      DispatchKey::PythonTLSSnapshot, false);
+}
+
+bool dispatch_mode_enabled() {
+  return static_cast<bool>(c10::impl::TorchDispatchModeTLS::get_state());
+}
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/TorchDispatchModeTLS.h b/c10/core/impl/TorchDispatchModeTLS.h
new file mode 100644
index 0000000000000..81aa34b11c5fc
--- /dev/null
+++ b/c10/core/impl/TorchDispatchModeTLS.h
@@ -0,0 +1,20 @@
+#pragma once
+
+#include <c10/core/SafePyObject.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/ArrayRef.h>
+#include <c10/util/Optional.h>
+
+namespace c10 {
+namespace impl {
+
+struct C10_API TorchDispatchModeTLS {
+  static void set_state(std::shared_ptr<SafePyObject> state);
+  static const std::shared_ptr<SafePyObject>& get_state();
+  static void reset_state();
+};
+
+C10_API bool dispatch_mode_enabled();
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/cuda/CUDACachingAllocator.cpp b/c10/cuda/CUDACachingAllocator.cpp
index d60f6960e9f91..c6f91f5a59900 100644
--- a/c10/cuda/CUDACachingAllocator.cpp
+++ b/c10/cuda/CUDACachingAllocator.cpp
@@ -1,6 +1,7 @@
 
 #include <c10/cuda/CUDACachingAllocator.h>
 
+#include <c10/core/impl/GPUTrace.h>
 #include <c10/cuda/CUDAException.h>
 #include <c10/cuda/CUDAFunctions.h>
 #include <c10/cuda/CUDAGuard.h>
@@ -180,6 +181,8 @@ struct Block {
   int event_count; // number of outstanding CUDA events
   int gc_count; // counter for prioritizing older / less useful blocks for
                 // garbage collection
+  std::unique_ptr<History> history;
+  History* history_last;
 
   Block(
       int device,
@@ -279,6 +282,18 @@ struct AllocParams {
   cudaError_t err;
 };
 
+int trimHistoryBefore(Block* block, void* point) {
+  int n = 0;
+  while (block->history && block->history->addr < point) {
+    block->history = std::move(block->history->next);
+    ++n;
+  }
+  if (!block->history) {
+    block->history_last = nullptr;
+  }
+  return n;
+}
+
 // Note: cudaEventCreate when concurrently invoked from multiple threads can be
 // very expensive (at least on certain device/driver combinations). Thus, we a)
 // serialize event creation at a per-device level, and b) pool the events to
@@ -534,18 +549,30 @@ class DeviceCachingAllocator {
   // Maps a capturing stream to its assigned private pool,
   // in case we want multiple captures to share the same pool
   ska::flat_hash_map<CaptureId_t, MempoolId_t> capture_to_pool_map;
+  std::atomic<CreateContextFn> context_recorder_;
 
  public:
   DeviceCachingAllocator()
       : large_blocks(BlockComparator, /*is_small=*/false),
         small_blocks(BlockComparator, /*is_small=*/true) {
     stats.max_split_size = CachingAllocatorConfig::max_split_size();
+    context_recorder_.store(nullptr);
+  }
+
+  void setContextRecorder(CreateContextFn c) {
+    context_recorder_.store(c);
   }
 
   // All public methods (except the above) acquire the allocator mutex.
   // Thus, do not call a public method from another public method.
 
-  Block* malloc(int device, size_t size, cudaStream_t stream) {
+  Block* malloc(int device, size_t orig_size, cudaStream_t stream) {
+    // done outside the lock because we don't know what locks the recorder needs
+    // to have...
+    CreateContextFn context_recorder = context_recorder_.load();
+    std::unique_ptr<Context> context =
+        context_recorder ? context_recorder() : nullptr;
+
     std::unique_lock<std::recursive_mutex> lock(mutex);
 
     if (C10_LIKELY(captures_underway == 0)) {
@@ -562,7 +589,7 @@ class DeviceCachingAllocator {
       process_events();
     }
 
-    size = round_size(size);
+    size_t size = round_size(orig_size);
     auto& pool = get_pool(size, stream);
     const size_t alloc_size = get_allocation_size(size);
     AllocParams params(device, size, stream, &pool, alloc_size, stats);
@@ -637,7 +664,7 @@ class DeviceCachingAllocator {
       // possible "cached" memory to the driver. The only remaining "cached"
       // memory is split from a larger block that is partially in-use.
       TORCH_CHECK_WITH(
-          CUDAOutOfMemoryError,
+          OutOfMemoryError,
           false,
           "CUDA out of memory. Tried to allocate ",
           format_size(alloc_size),
@@ -685,6 +712,10 @@ class DeviceCachingAllocator {
       bool inserted = pool.blocks.insert(remaining).second;
       TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted);
 
+      if (context) {
+        trimHistoryBefore(remaining, (char*)block->ptr + size);
+      }
+
       if (already_split) {
         // An already-split inactive block is being shrunk by size bytes.
         update_stat_array(
@@ -697,6 +728,7 @@ class DeviceCachingAllocator {
           update_stat(stats.inactive_split[stat_type], 1);
         });
       }
+
     } else if (already_split) {
       // An already-split block is becoming active
       for_each_selected_stat_type(params.stat_types, [&](size_t stat_type) {
@@ -706,6 +738,17 @@ class DeviceCachingAllocator {
     }
 
     block->allocated = true;
+    if (context) {
+      trimHistoryBefore(block, (char*)block->ptr + size);
+      block->history = std::make_unique<History>(History{
+          block->ptr,
+          orig_size,
+          std::move(context),
+          std::move(block->history)});
+      if (!block->history_last) {
+        block->history_last = block->history.get();
+      }
+    }
     bool inserted = active_blocks.insert(block).second;
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted);
 
@@ -894,6 +937,7 @@ class DeviceCachingAllocator {
       SegmentInfo& segment_info = result.back();
       segment_info.device = head_block->device;
       segment_info.address = reinterpret_cast<int64_t>(head_block->ptr);
+      segment_info.stream = head_block->stream;
       segment_info.is_large = (!head_block->pool->is_small);
 
       const Block* block = head_block;
@@ -913,7 +957,7 @@ class DeviceCachingAllocator {
         if (block_info.active) {
           segment_info.active_size += block_info.size;
         }
-
+        block_info.history = block->history.get();
         block = block->next;
       }
     }
@@ -1107,19 +1151,35 @@ class DeviceCachingAllocator {
 
     AT_ASSERT(dst->is_split() && src->is_split());
 
-    if (dst->prev == src) {
+    if (dst->prev == src) { // [src dst]
       dst->ptr = src->ptr;
       dst->prev = src->prev;
       if (dst->prev) {
         dst->prev->next = dst;
       }
-    } else {
+      if (!dst->history) {
+        dst->history = std::move(src->history);
+        dst->history_last = src->history_last;
+      } else if (src->history) {
+        src->history_last->next = std::move(dst->history);
+        dst->history = std::move(src->history);
+      }
+      src->history_last = nullptr;
+    } else { // [dest src]
       dst->next = src->next;
       if (dst->next) {
         dst->next->prev = dst;
       }
-    }
 
+      if (!dst->history) {
+        dst->history = std::move(src->history);
+        dst->history_last = src->history_last;
+      } else if (src->history) {
+        dst->history_last->next = std::move(src->history);
+        dst->history_last = src->history_last;
+      }
+      src->history_last = nullptr;
+    }
     const size_t subsumed_size = src->size;
     dst->size += subsumed_size;
     auto erased = pool.blocks.erase(src);
@@ -1345,7 +1405,14 @@ class DeviceCachingAllocator {
         std::numeric_limits<size_t>::max())
       return false;
     BlockPool& pool = *p.pool;
-    Block key = p.search_key;
+
+    // because of std::unique_ptr, block cannot be trivially copied
+    Block key(
+        p.search_key.device,
+        p.search_key.stream,
+        p.search_key.size,
+        p.search_key.pool,
+        p.search_key.ptr);
     key.size = (key.size < CachingAllocatorConfig::max_split_size())
         ? CachingAllocatorConfig::max_split_size()
         : key.size;
@@ -1614,6 +1681,10 @@ class THCCachingAllocator {
     Block* block = device_allocator[device]->malloc(device, size, stream);
     add_allocated_block(block);
     *devPtr = (void*)block->ptr;
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_memory_allocation(reinterpret_cast<uintptr_t>(*devPtr));
+    }
   }
 
   void free(void* ptr) {
@@ -1624,6 +1695,11 @@ class THCCachingAllocator {
     if (!block) {
       TORCH_CHECK(false, "invalid device pointer: ", ptr);
     }
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_memory_deallocation(
+          reinterpret_cast<uintptr_t>(block->ptr));
+    }
     device_allocator[block->device]->free(block);
   }
 
@@ -1646,6 +1722,12 @@ class THCCachingAllocator {
     device_allocator[device]->setMemoryFraction(fraction);
   }
 
+  void setContextRecorder(CreateContextFn recorder) {
+    int device;
+    C10_CUDA_CHECK(cudaGetDevice(&device));
+    device_allocator[device]->setContextRecorder(std::move(recorder));
+  }
+
   void emptyCache() {
     for (auto& da : device_allocator)
       da->emptyCache();
@@ -1703,6 +1785,10 @@ bool forceUncachedAllocator() {
 }
 
 static void uncached_delete(void* ptr) {
+  const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+  if (C10_UNLIKELY(interp)) {
+    interp->trace_gpu_memory_deallocation(reinterpret_cast<uintptr_t>(ptr));
+  }
   C10_CUDA_CHECK(cudaFree(ptr));
 }
 
@@ -1713,7 +1799,7 @@ struct CudaCachingAllocator : public Allocator {
   DataPtr allocate(size_t size) const override {
     constexpr size_t one_exa_bytes = 1152921504606846976ULL;
     TORCH_CHECK_WITH(
-        CUDAOutOfMemoryError,
+        OutOfMemoryError,
         size < one_exa_bytes,
         "CUDA out of memory. Tried to allocate more than 1EB memory.");
     int device;
@@ -1723,6 +1809,10 @@ struct CudaCachingAllocator : public Allocator {
       // Deliberately don't use cudaMallocMaybeCapturing here, to force an error
       // if someone tries to use forceUncachedAllocator while capturing.
       C10_CUDA_CHECK(cudaMalloc(&r, size));
+      const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+      if (C10_UNLIKELY(interp)) {
+        interp->trace_gpu_memory_allocation(reinterpret_cast<uintptr_t>(r));
+      }
       return {r, r, &uncached_delete, Device(DeviceType::CUDA, device)};
     }
     if (size != 0) {
@@ -1754,6 +1844,10 @@ void setMemoryFraction(double fraction, int device) {
   caching_allocator.setMemoryFraction(fraction, device);
 }
 
+void setContextRecorder(CreateContextFn recorder) {
+  caching_allocator.setContextRecorder(std::move(recorder));
+}
+
 void emptyCache(void) {
   caching_allocator.emptyCache();
 }
diff --git a/c10/cuda/CUDACachingAllocator.h b/c10/cuda/CUDACachingAllocator.h
index 9b1a6ecf15903..0fd23f4e61d58 100644
--- a/c10/cuda/CUDACachingAllocator.h
+++ b/c10/cuda/CUDACachingAllocator.h
@@ -11,10 +11,6 @@
 
 namespace c10 {
 
-class C10_CUDA_API CUDAOutOfMemoryError : public c10::Error {
-  using Error::Error;
-};
-
 // Caching allocator will execute every registered callback if it unable to find
 // block inside of already allocated area.
 class C10_CUDA_API FreeMemoryCallback {
@@ -98,6 +94,20 @@ struct DeviceStats {
   int64_t max_split_size = 0;
 };
 
+struct Context {
+  virtual ~Context() {}
+};
+
+typedef std::unique_ptr<Context> (*CreateContextFn)(void);
+
+struct History {
+  void* addr;
+  size_t real_size; // unrounded, actually requested size
+  std::unique_ptr<Context> context; // per-watcher context
+  std::unique_ptr<History> next; // when blocks are merged we keep records of
+                                 // what used to be in the block
+};
+
 // Struct containing info of an allocation block (i.e. a fractional part of a
 // cudaMalloc)..
 struct BlockInfo {
@@ -105,6 +115,8 @@ struct BlockInfo {
   int32_t gc_counter = 0;
   bool allocated = false;
   bool active = false;
+  History* history =
+      nullptr; // borrowed reference because it is owned by the allocator
 };
 
 // Struct containing info of a memory segment (i.e. one contiguous cudaMalloc).
@@ -114,6 +126,7 @@ struct SegmentInfo {
   int64_t total_size = 0;
   int64_t allocated_size = 0;
   int64_t active_size = 0;
+  cudaStream_t stream = 0;
   bool is_large = false;
   std::vector<BlockInfo> blocks;
 };
@@ -147,6 +160,8 @@ C10_CUDA_API void notifyCaptureDestroy(int device, MempoolId_t mempool_id);
 
 C10_CUDA_API std::mutex* getFreeMutex();
 
+C10_CUDA_API void setContextRecorder(CreateContextFn recorder);
+
 C10_CUDA_API std::shared_ptr<void> getIpcDevPtr(std::string handle);
 } // namespace CUDACachingAllocator
 
diff --git a/c10/cuda/CUDAStream.cpp b/c10/cuda/CUDAStream.cpp
index b7fc04b50a8c2..e80026cf81b85 100644
--- a/c10/cuda/CUDAStream.cpp
+++ b/c10/cuda/CUDAStream.cpp
@@ -1,3 +1,4 @@
+#include <c10/core/impl/GPUTrace.h>
 #include <c10/cuda/CUDAFunctions.h>
 #include <c10/cuda/CUDAGuard.h>
 #include <c10/cuda/CUDAStream.h>
@@ -165,6 +166,14 @@ static void initDeviceStreamState(DeviceIndex device_index) {
         &lowpri_stream, kDefaultFlags, kLowPriority));
     C10_CUDA_CHECK(cudaStreamCreateWithPriority(
         &hipri_stream, kDefaultFlags, kHighPriority));
+
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_stream_creation(
+          reinterpret_cast<uintptr_t>(lowpri_stream));
+      interp->trace_gpu_stream_creation(
+          reinterpret_cast<uintptr_t>(hipri_stream));
+    }
   }
 
   low_priority_counters[device_index] = 0;
diff --git a/c10/cuda/impl/CUDAGuardImpl.h b/c10/cuda/impl/CUDAGuardImpl.h
index 583feeec26000..52a9de8ce1cb0 100644
--- a/c10/cuda/impl/CUDAGuardImpl.h
+++ b/c10/cuda/impl/CUDAGuardImpl.h
@@ -2,6 +2,7 @@
 
 #include <c10/core/DeviceGuard.h>
 #include <c10/core/impl/DeviceGuardImplInterface.h>
+#include <c10/core/impl/GPUTrace.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/Exception.h>
 
@@ -100,6 +101,10 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
     }
 
     C10_CUDA_CHECK(cudaEventCreateWithFlags(cuda_event, cuda_flag));
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_event_creation(reinterpret_cast<uintptr_t>(cuda_event));
+    }
   }
 
   void destroyEvent(void* event, const DeviceIndex device_index)
@@ -110,6 +115,10 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
     int orig_device;
     C10_CUDA_CHECK_WARN(cudaGetDevice(&orig_device));
     C10_CUDA_CHECK_WARN(cudaSetDevice(device_index));
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_event_deletion(reinterpret_cast<uintptr_t>(cuda_event));
+    }
     C10_CUDA_CHECK_WARN(cudaEventDestroy(cuda_event));
     C10_CUDA_CHECK_WARN(cudaSetDevice(orig_device));
   }
@@ -140,6 +149,12 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
     C10_CUDA_CHECK(cudaEventRecord(cuda_event, cuda_stream));
     // Makes the void* point to the (possibly just allocated) CUDA event
     *event = cuda_event;
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_event_record(
+          reinterpret_cast<uintptr_t>(cuda_event),
+          reinterpret_cast<uintptr_t>(cuda_stream.stream()));
+    }
 
     // Resets device
     setDevice(orig_device);
@@ -156,6 +171,12 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
         cuda_stream,
         cuda_event,
         /*flags (must be zero)=*/0));
+    const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+    if (C10_UNLIKELY(interp)) {
+      interp->trace_gpu_event_wait(
+          reinterpret_cast<uintptr_t>(cuda_event),
+          reinterpret_cast<uintptr_t>(cuda_stream.stream()));
+    }
     setDevice(orig_device);
   }
 
diff --git a/c10/macros/Macros.h b/c10/macros/Macros.h
index 84a5045d648c7..b97c16421028a 100644
--- a/c10/macros/Macros.h
+++ b/c10/macros/Macros.h
@@ -114,22 +114,17 @@
 //  - MSVC 19.14: https://godbolt.org/z/Dzd7gn (requires /std:c++latest)
 //  - Clang 8.0.0: https://godbolt.org/z/3PYL4Z (always advertises support)
 //  - gcc 8.3: https://godbolt.org/z/4tLMQS (always advertises support)
-#define C10_NODISCARD
-#if defined(__has_cpp_attribute)
-#if __has_cpp_attribute(nodiscard)
-#undef C10_NODISCARD
+#if C10_HAS_CPP_ATTRIBUTE(nodiscard)
 #define C10_NODISCARD [[nodiscard]]
-#endif
 // Workaround for llvm.org/PR23435, since clang 3.6 and below emit a spurious
 // error when __has_cpp_attribute is given a scoped attribute in C mode.
-#elif __cplusplus && defined(__has_cpp_attribute)
-#if __has_cpp_attribute(clang::warn_unused_result)
+#elif __cplusplus && C10_HAS_CPP_ATTRIBUTE(clang::warn_unused_result)
 // TODO: It's possible this is still triggering
 // https://github.com/pytorch/pytorch/issues/13118 on Windows; if it is, better
 // fix it.
-#undef C10_NODISCARD
 #define C10_NODISCARD [[clang::warn_unused_result]]
-#endif
+#else
+#define C10_NODISCARD
 #endif
 
 // suppress an unused variable.
@@ -243,8 +238,7 @@ using namespace c10::hip;
 #define C10_FALLTHROUGH
 #endif
 
-#include <sstream>
-#include <string>
+#include <cstdint>
 
 #ifdef __HIPCC__
 // Unlike CUDA, HIP requires a HIP header to be included for __host__ to work.
@@ -332,7 +326,9 @@ constexpr uint32_t CUDA_THREADS_PER_BLOCK_FALLBACK = 256;
 // CUDA_KERNEL_ASSERT checks the assertion
 // even when NDEBUG is defined. This is useful for important assertions in CUDA
 // code that would otherwise be suppressed when building Release.
-#if defined(__ANDROID__) || defined(__APPLE__) || defined(USE_ROCM)
+#if defined(__ANDROID__) || defined(__APPLE__) ||  \
+    (defined(USE_ROCM) && ROCM_VERSION < 40100) || \
+    (defined(USE_ROCM) && defined(ROCM_DISABLE_GPU_ASSERTS))
 // Those platforms do not support assert()
 #define CUDA_KERNEL_ASSERT(cond)
 #elif defined(_MSC_VER)
@@ -361,22 +357,32 @@ extern SYCL_EXTERNAL void __assert_fail(
     const char* func);
 #else // __SYCL_DEVICE_ONLY__
 #if (defined(__CUDA_ARCH__) && !(defined(__clang__) && defined(__CUDA__)))
+// CUDA supports __assert_fail function which are common for both device
+// and host side code.
 __host__ __device__
-#endif // __CUDA_ARCH__
+#endif
+
+    // This forward declaration matching the declaration of __assert_fail
+    // exactly how it is in glibc in case parts of the program are compiled with
+    // different NDEBUG settings. Otherwise we might get 'ambiguous declaration'
+    // error. Note: On ROCm - this declaration serves for host side compilation.
     void
     __assert_fail(
         const char* assertion,
         const char* file,
         unsigned int line,
-        const char* function) throw()
-// We match the declaration of __assert_fail exactly how it is in glibc in case
-// parts of the program are compiled with different NDEBUG settings. Otherwise
-// we might get 'ambiguous declaration' error.
-#ifdef __GNUC__
-        __attribute__((__noreturn__))
-#endif
-        ;
-#endif
+        const char* function) throw() __attribute__((__noreturn__));
+
+#if (defined(__HIP_ARCH__) || defined(__HIP__)) && \
+    !defined(ROCM_DISABLE_GPU_ASSERTS)
+// ROCm supports __assert_fail only as a device side function.
+__device__ __attribute__((noinline)) __attribute__((weak)) void __assert_fail(
+    const char* assertion,
+    const char* file,
+    unsigned int line,
+    const char* function);
+#endif // defined(__HIP_ARCH__) || defined(__HIP__)
+#endif // __SYCL_DEVICE_ONLY__
 }
 #endif // NDEBUG
 #define CUDA_KERNEL_ASSERT(cond)                                         \
diff --git a/c10/test/core/SymInt_test.cpp b/c10/test/core/SymInt_test.cpp
index 8892cce015daa..a57e7c706486d 100644
--- a/c10/test/core/SymInt_test.cpp
+++ b/c10/test/core/SymInt_test.cpp
@@ -4,7 +4,7 @@
 #include <c10/core/SymIntNodeImpl.h>
 
 using namespace c10;
-
+#ifndef C10_MOBILE
 void check(int64_t value) {
   EXPECT_TRUE(SymInt::check_range(value));
   const auto i = SymInt(value);
@@ -29,3 +29,4 @@ TEST(SymIntTest, AddNode) {
 TEST(SymIntTest, CheckRange) {
   EXPECT_FALSE(SymInt::check_range(INT64_MIN));
 }
+#endif
diff --git a/c10/util/Exception.h b/c10/util/Exception.h
index 327e4cbfabd11..a869038ea444f 100644
--- a/c10/util/Exception.h
+++ b/c10/util/Exception.h
@@ -235,6 +235,10 @@ class C10_API LinAlgError : public Error {
   using Error::Error;
 };
 
+class C10_API OutOfMemoryError : public Error {
+  using Error::Error;
+};
+
 // A utility function to return an exception std::string by prepending its
 // exception type before its what() content
 C10_API std::string GetExceptionString(const std::exception& e);
diff --git a/c10/util/IdWrapper.h b/c10/util/IdWrapper.h
index a22a60cb9fc3d..59b5088c270f8 100644
--- a/c10/util/IdWrapper.h
+++ b/c10/util/IdWrapper.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <c10/macros/Macros.h>
+#include <cstddef>
 #include <functional>
 #include <utility>
 
diff --git a/c10/util/SmallVector.cpp b/c10/util/SmallVector.cpp
index f70c982c83150..d57f4d97b999e 100644
--- a/c10/util/SmallVector.cpp
+++ b/c10/util/SmallVector.cpp
@@ -17,6 +17,7 @@
 #include <c10/util/SmallVector.h>
 #include <cstdint>
 #include <stdexcept>
+#include <string>
 using namespace c10;
 
 // Check that no bytes are wasted and everything is well-aligned.
diff --git a/c10/util/SmallVector.h b/c10/util/SmallVector.h
index 1fcc4a1a8f43a..e4672d666a931 100644
--- a/c10/util/SmallVector.h
+++ b/c10/util/SmallVector.h
@@ -35,6 +35,7 @@
 #include <limits>
 #include <memory>
 #include <new>
+#include <ostream>
 #include <type_traits>
 #include <utility>
 
diff --git a/c10/util/hash.h b/c10/util/hash.h
index d4bb42da21c96..9d771e401ed46 100644
--- a/c10/util/hash.h
+++ b/c10/util/hash.h
@@ -304,6 +304,14 @@ struct hash<std::tuple<Types...>> {
   }
 };
 
+template <typename T1, typename T2>
+struct hash<std::pair<T1, T2>> {
+  size_t operator()(const std::pair<T1, T2>& pair) const {
+    std::tuple<T1, T2> tuple = std::make_tuple(pair.first, pair.second);
+    return _hash_detail::simple_get_hash(tuple);
+  }
+};
+
 template <typename T>
 struct hash<c10::ArrayRef<T>> {
   size_t operator()(c10::ArrayRef<T> v) const {
diff --git a/c10/util/logging_is_google_glog.h b/c10/util/logging_is_google_glog.h
index b5860d8c0c9f4..e5470d22cecd3 100644
--- a/c10/util/logging_is_google_glog.h
+++ b/c10/util/logging_is_google_glog.h
@@ -50,13 +50,14 @@ INSTANTIATE_FOR_CONTAINER(set)
 #include <glog/logging.h>
 
 // Additional macros on top of glog
-#ifndef NDEBUG
 #define TORCH_CHECK_EQ(val1, val2) CHECK_EQ(val1, val2)
 #define TORCH_CHECK_NE(val1, val2) CHECK_NE(val1, val2)
 #define TORCH_CHECK_LE(val1, val2) CHECK_LE(val1, val2)
 #define TORCH_CHECK_LT(val1, val2) CHECK_LT(val1, val2)
 #define TORCH_CHECK_GE(val1, val2) CHECK_GE(val1, val2)
 #define TORCH_CHECK_GT(val1, val2) CHECK_GT(val1, val2)
+
+#ifndef NDEBUG
 #define TORCH_DCHECK_EQ(val1, val2) DCHECK_EQ(val1, val2)
 #define TORCH_DCHECK_NE(val1, val2) DCHECK_NE(val1, val2)
 #define TORCH_DCHECK_LE(val1, val2) DCHECK_LE(val1, val2)
@@ -65,24 +66,6 @@ INSTANTIATE_FOR_CONTAINER(set)
 #define TORCH_DCHECK_GT(val1, val2) DCHECK_GT(val1, val2)
 #else // !NDEBUG
 // These versions generate no code in optimized mode.
-#define TORCH_CHECK_EQ(val1, val2) \
-  while (false)                    \
-  CHECK_EQ(val1, val2)
-#define TORCH_CHECK_NE(val1, val2) \
-  while (false)                    \
-  CHECK_NE(val1, val2)
-#define TORCH_CHECK_LE(val1, val2) \
-  while (false)                    \
-  CHECK_LE(val1, val2)
-#define TORCH_CHECK_LT(val1, val2) \
-  while (false)                    \
-  CHECK_LT(val1, val2)
-#define TORCH_CHECK_GE(val1, val2) \
-  while (false)                    \
-  CHECK_GE(val1, val2)
-#define TORCH_CHECK_GT(val1, val2) \
-  while (false)                    \
-  CHECK_GT(val1, val2)
 #define TORCH_DCHECK_EQ(val1, val2) \
   while (false)                     \
   DCHECK_EQ(val1, val2)
diff --git a/c10/util/strides.h b/c10/util/strides.h
index 40315a625c61f..8a7f7f6301f67 100644
--- a/c10/util/strides.h
+++ b/c10/util/strides.h
@@ -9,16 +9,12 @@ static inline DimVector contiguous_strides(const IntArrayRef sizes) {
   using Int = IntArrayRef::value_type;
   const Int dims = static_cast<Int>(sizes.size());
 
-  DimVector strides;
+  // With this intialisation we get the case dim == 0 or 1 right
+  DimVector strides(dims, 1);
 
-  if (dims > 0) {
-    strides.assign(dims, 0);
-    // Start by populating the last dimension: its strides is always 1.
-    strides[dims - 1] = 1;
-    for (auto i = dims - 2; i >= 0; --i) {
-      // Strides can't be 0 even if sizes are 0.
-      strides[i] = strides[i + 1] * std::max(sizes[i + 1], Int{1});
-    }
+  for (auto i = dims - 2; i >= 0; --i) {
+    // Strides can't be 0 even if sizes are 0.
+    strides[i] = strides[i + 1] * std::max(sizes[i + 1], Int{1});
   }
 
   return strides;
diff --git a/c10/util/variant.h b/c10/util/variant.h
index 564dd3b55d018..53001afea28c2 100644
--- a/c10/util/variant.h
+++ b/c10/util/variant.h
@@ -2253,7 +2253,6 @@ class impl : public copy_assignment<traits<Ts...>> {
 
  public:
   C10_MPARK_INHERITING_CTOR(impl, super)
-  impl& operator=(const impl& other) = default;
 
   template <std::size_t I, typename Arg>
   inline void assign(Arg&& arg) {
diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index 65cdd576d9c28..a904898040123 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -63,7 +63,7 @@ if(INTERN_BUILD_ATEN_OPS)
   set(CMAKE_POSITION_INDEPENDENT_CODE ${__caffe2_CMAKE_POSITION_INDEPENDENT_CODE})
 
   # Generate the headers wrapped by our operator
-  file(GLOB_RECURSE all_python "${PROJECT_SOURCE_DIR}/torchgen/*.py")
+  file(GLOB_RECURSE torchgen_python "${PROJECT_SOURCE_DIR}/torchgen/*.py")
   add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/contrib/aten/aten_op.h
   COMMAND
   "${PYTHON_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/gen_op.py
@@ -72,7 +72,7 @@ if(INTERN_BUILD_ATEN_OPS)
     --yaml_dir=${CMAKE_CURRENT_BINARY_DIR}/../aten/src/ATen
     --install_dir=${CMAKE_CURRENT_BINARY_DIR}/contrib/aten
   DEPENDS
-  ${all_python}
+  ${torchgen_python}
   ${CMAKE_BINARY_DIR}/aten/src/ATen/Declarations.yaml
   ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/gen_op.py
   ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/aten_op_template.h)
@@ -425,6 +425,9 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
     list(APPEND GEN_PER_OPERATOR_FLAG "--per_operator_headers")
   endif()
 
+  file(GLOB_RECURSE autograd_python "${TOOLS_PATH}/autograd/*.py")
+  file(GLOB_RECURSE autograd_yaml "${TOOLS_PATH}/autograd/*.yaml")
+  file(GLOB_RECURSE autograd_templates "${TOOLS_PATH}/autograd/templates/*")
   add_custom_command(
     OUTPUT
     ${TORCH_GENERATED_CODE}
@@ -438,48 +441,20 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
       --gen_lazy_ts_backend
       ${GEN_PER_OPERATOR_FLAG}
     DEPENDS
-    "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
-    "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml"
-    "${TORCH_ROOT}/aten/src/ATen/native/ts_native_functions.yaml"
-    "${TORCH_ROOT}/torch/csrc/lazy/core/shape_inference.h"
-    "${TORCH_ROOT}/torch/csrc/lazy/ts_backend/ts_native_functions.cpp"
-    "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.h"
-    "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp"
-    "${TORCH_ROOT}/aten/src/ATen/templates/LazyIr.h"
-    "${TORCH_ROOT}/aten/src/ATen/templates/LazyNonNativeIr.h"
-    "${TORCH_ROOT}/aten/src/ATen/templates/RegisterDispatchKey.cpp"
-    "${TOOLS_PATH}/autograd/templates/VariableType.h"
-    "${TOOLS_PATH}/autograd/templates/VariableType.cpp"
-    "${TOOLS_PATH}/autograd/templates/ADInplaceOrViewType.cpp"
-    "${TOOLS_PATH}/autograd/templates/TraceType.cpp"
-    "${TOOLS_PATH}/autograd/templates/Functions.h"
-    "${TOOLS_PATH}/autograd/templates/Functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_functions.h"
-    "${TOOLS_PATH}/autograd/templates/python_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_variable_methods.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_torch_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_nn_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_fft_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_linalg_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_sparse_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_special_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_return_types.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_enum_tag.cpp"
-    "${TOOLS_PATH}/autograd/templates/variable_factories.h"
-    "${TOOLS_PATH}/autograd/templates/annotated_fn_args.py.in"
-    "${TOOLS_PATH}/autograd/deprecated.yaml"
-    "${TOOLS_PATH}/autograd/derivatives.yaml"
-    "${TOOLS_PATH}/autograd/gen_autograd_functions.py"
-    "${TOOLS_PATH}/autograd/gen_autograd.py"
-    "${TOOLS_PATH}/autograd/gen_python_functions.py"
-    "${TOOLS_PATH}/autograd/gen_variable_factories.py"
-    "${TOOLS_PATH}/autograd/gen_variable_type.py"
-    "${TOOLS_PATH}/autograd/gen_inplace_or_view_type.py"
-    "${TOOLS_PATH}/autograd/load_derivatives.py"
-    "${TORCH_ROOT}/torchgen/gen_backend_stubs.py"
-    "${TORCH_ROOT}/torchgen/gen_lazy_tensor.py"
-    "${TORCH_ROOT}/torchgen/api/lazy.py"
-    "${TORCH_ROOT}/torchgen/dest/lazy_ir.py"
+      "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
+      "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml"
+      "${TORCH_ROOT}/aten/src/ATen/native/ts_native_functions.yaml"
+      "${TORCH_ROOT}/torch/csrc/lazy/core/shape_inference.h"
+      "${TORCH_ROOT}/torch/csrc/lazy/ts_backend/ts_native_functions.cpp"
+      "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.h"
+      "${TORCH_ROOT}/aten/src/ATen/templates/DispatchKeyNativeFunctions.cpp"
+      "${TORCH_ROOT}/aten/src/ATen/templates/LazyIr.h"
+      "${TORCH_ROOT}/aten/src/ATen/templates/LazyNonNativeIr.h"
+      "${TORCH_ROOT}/aten/src/ATen/templates/RegisterDispatchKey.cpp"
+      ${autograd_python}
+      ${autograd_yaml}
+      ${autograd_templates}
+      ${torchgen_python}
     WORKING_DIRECTORY "${TORCH_ROOT}")
 
 
@@ -553,7 +528,6 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
         ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm
         ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
         ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm
-        ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
       )
       set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm PROPERTIES COMPILE_FLAGS "-fno-objc-arc")
       include_directories(${TORCH_ROOT}/third_party/nlohmann/single_include)
@@ -918,11 +892,6 @@ if(HAVE_SOVERSION)
       VERSION ${TORCH_VERSION} SOVERSION ${TORCH_SOVERSION})
 endif()
 
-if(USE_UCC)
-  target_link_libraries(torch_cpu PRIVATE __caffe2_ucc)
-  target_compile_definitions(torch_cpu PRIVATE USE_UCC)
-endif()
-
 if(USE_ROCM)
   filter_list(__caffe2_hip_srcs_cpp Caffe2_HIP_SRCS "\\.(cu|hip)$")
   set_source_files_properties(${__caffe2_hip_srcs_cpp} PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
@@ -1070,23 +1039,36 @@ endif()
 # Codegen selected_mobile_ops.h for template selective build
 if(BUILD_LITE_INTERPRETER AND SELECTED_OP_LIST)
   message("running gen_selected_mobile_ops_header for:  '${SELECTED_OP_LIST}'")
+  file(GLOB lite_interpreter_python "${TOOLS_PATH}/lite_interpreter/*.py")
   if(${TRACING_BASED})
+    file(GLOB code_analyzer_python "${TOOLS_PATH}/code_analyzer/*.py")
     add_custom_command(
       OUTPUT ${CMAKE_BINARY_DIR}/aten/src/ATen/selected_mobile_ops.h
       COMMAND
-      "${PYTHON_EXECUTABLE}"
-      -m tools.code_analyzer.gen_oplist
-      --model_file_list_path "${SELECTED_OP_LIST}"
-      --output_dir "${CMAKE_BINARY_DIR}/aten/src/ATen"
+        "${PYTHON_EXECUTABLE}"
+        -m tools.code_analyzer.gen_oplist
+        --model_file_list_path "${SELECTED_OP_LIST}"
+        --output_dir "${CMAKE_BINARY_DIR}/aten/src/ATen"
+      DEPENDS
+        ${torchgen_python}
+        ${lite_interpreter_python}
+        ${code_analyzer_python}
+        "${SELECTED_OP_LIST}"
+        "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
       WORKING_DIRECTORY "${TORCH_ROOT}")
   else()
     add_custom_command(
       OUTPUT ${CMAKE_BINARY_DIR}/aten/src/ATen/selected_mobile_ops.h
       COMMAND
-      "${PYTHON_EXECUTABLE}"
-      -m tools.lite_interpreter.gen_selected_mobile_ops_header
-      --yaml_file_path "${SELECTED_OP_LIST}"
-      --output_file_path "${CMAKE_BINARY_DIR}/aten/src/ATen"
+        "${PYTHON_EXECUTABLE}"
+        -m tools.lite_interpreter.gen_selected_mobile_ops_header
+        --yaml_file_path "${SELECTED_OP_LIST}"
+        --output_file_path "${CMAKE_BINARY_DIR}/aten/src/ATen"
+      DEPENDS
+        ${torchgen_python}
+        ${lite_interpreter_python}
+        "${SELECTED_OP_LIST}"
+        "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
       WORKING_DIRECTORY "${TORCH_ROOT}")
   endif()
 
diff --git a/caffe2/core/tensor.h b/caffe2/core/tensor.h
index 4c5be742d0cf7..f7b3d90fe63a7 100644
--- a/caffe2/core/tensor.h
+++ b/caffe2/core/tensor.h
@@ -439,6 +439,14 @@ class TORCH_API Tensor final {
     return impl_->sym_sizes();
   }
 
+  inline c10::SymInt sym_numel() const {
+    return impl_->sym_numel();
+  }
+
+  inline c10::SymIntArrayRef sym_strides() const {
+    return impl_->sym_strides();
+  }
+
   inline int64_t size_from_dim(int k) const {
     return size_from_dim_(k, impl_->sizes());
   }
diff --git a/caffe2/quantization/server/dnnlowp.h b/caffe2/quantization/server/dnnlowp.h
index 2f68d156af108..c71ac8dbef6e1 100644
--- a/caffe2/quantization/server/dnnlowp.h
+++ b/caffe2/quantization/server/dnnlowp.h
@@ -6,7 +6,9 @@
 #include <cstdint>
 #include <limits>
 
+#ifdef __x86_64__
 #include <immintrin.h>
+#endif
 
 #include <fbgemm/QuantUtils.h>
 
diff --git a/caffe2/quantization/server/fully_connected_fake_lowp_op.h b/caffe2/quantization/server/fully_connected_fake_lowp_op.h
index 6cbfc900e9613..cee1c26498fb9 100644
--- a/caffe2/quantization/server/fully_connected_fake_lowp_op.h
+++ b/caffe2/quantization/server/fully_connected_fake_lowp_op.h
@@ -16,7 +16,9 @@
 
 #pragma once
 
+#ifdef __x86_64__
 #include <immintrin.h>
+#endif
 #include "caffe2/core/context.h"
 #include "caffe2/core/operator.h"
 #include "caffe2/utils/conversions.h"
diff --git a/caffe2/serialize/inline_container.cc b/caffe2/serialize/inline_container.cc
index 9847bc132264d..9d3cc332ae96e 100644
--- a/caffe2/serialize/inline_container.cc
+++ b/caffe2/serialize/inline_container.cc
@@ -142,7 +142,13 @@ void PyTorchStreamReader::init() {
     std::tie(version_ptr, version_size) = getRecord("version");
   }
   std::string version(static_cast<const char*>(version_ptr.get()), version_size);
-  version_ = caffe2::stoull(version);
+  try {
+    version_ = caffe2::stoull(version);
+  } catch (const std::invalid_argument &e) {
+    CAFFE_THROW("Couldn't parse the version ",
+                 version,
+                 " as Long Long.");
+  }
   // NOLINTNEXTLINE(clang-diagnostic-sign-compare)
   if (version_ < kMinSupportedFileFormatVersion) {
     CAFFE_THROW(
diff --git a/caffe2/serialize/inline_container.h b/caffe2/serialize/inline_container.h
index 139174fa3d61e..621ffbe9a41ab 100644
--- a/caffe2/serialize/inline_container.h
+++ b/caffe2/serialize/inline_container.h
@@ -166,11 +166,7 @@ class TORCH_API PyTorchStreamWriter final {
   std::function<size_t(const void*, size_t)> writer_func_;
   // This number will be updated when the model has operators
   // that have valid upgraders.
-#if ENABLE_UPGRADERS
   uint64_t version_ = kMinProducedFileFormatVersion;
-#else
-  uint64_t version_ = kProducedFileFormatVersion;
-#endif
   bool finalized_ = false;
   bool err_seen_ = false;
   friend size_t ostream_write_func(
diff --git a/caffe2/serialize/versions.h b/caffe2/serialize/versions.h
index 78a91c64fe84f..6e2c27adc8fae 100644
--- a/caffe2/serialize/versions.h
+++ b/caffe2/serialize/versions.h
@@ -4,18 +4,9 @@
 namespace caffe2 {
 namespace serialize {
 
-// Flag that controls if we want to enable upgraders
-// in the server side. When this flag is set to False,
-// it will switch to old dynamic versioning approach
-#define ENABLE_UPGRADERS true
-
 constexpr uint64_t kMinSupportedFileFormatVersion = 0x1L;
 
-#if ENABLE_UPGRADERS
 constexpr uint64_t kMaxSupportedFileFormatVersion = 0xAL;
-#else
-constexpr uint64_t kMaxSupportedFileFormatVersion = 0x6L;
-#endif
 
 // Versions (i.e. why was the version number bumped?)
 
@@ -57,7 +48,6 @@ constexpr uint64_t kMaxSupportedFileFormatVersion = 0x6L;
 //      when given bool or integer fill values.
 // 6. Write version string to `./data/version` instead of `version`.
 
-#if ENABLE_UPGRADERS
 // [12/15/2021]
 // kProducedFileFormatVersion is set to 7 from 3 due to a different
 // interpretation of what file format version is.
@@ -84,9 +74,6 @@ constexpr uint64_t kMaxSupportedFileFormatVersion = 0x6L;
 //     and aten::gelu.out to support the new approximate kwarg.
 //     (see: https://github.com/pytorch/pytorch/pull/61439)
 constexpr uint64_t kProducedFileFormatVersion = 0xAL;
-#else
-constexpr uint64_t kProducedFileFormatVersion = 0x3L;
-#endif
 
 // Absolute minimum version we will write packages. This
 // means that every package from now on will always be
diff --git a/caffe2/sgd/learning_rate_op.cc b/caffe2/sgd/learning_rate_op.cc
index e8172ab65efea..7e6545c5adebd 100644
--- a/caffe2/sgd/learning_rate_op.cc
+++ b/caffe2/sgd/learning_rate_op.cc
@@ -134,6 +134,12 @@ Example usage:
     .Arg(
         "cosine_lr_shrink",
         "defaults to 0.99, part of CompositeCosineLRPolicy")
+    .Arg(
+         "num_iter_1",
+        "(int, default 0) number of iterations over which to warmup for slope policy")
+    .Arg(
+         "num_iter_2",
+        "(int, default 0) number of iterations over which to gradually gate for slope policy")
     .Input(0, "input", "description needed")
     .Output(0, "output", "description needed")
     .DeviceInferenceFunction([](const OperatorDef& def) {
@@ -185,5 +191,7 @@ C10_EXPORT_CAFFE2_OP_TO_C10_CPU(
     "int? cosine_period = 50, "
     "float? cosine_t_mult = 1.0, "
     "float? cosine_lr_shrink = 0.99, "
-    "float? decay = 1.0) -> Tensor output",
+    "float? decay = 1.0, "
+    "int? num_iter_1 = 0, "
+    "int? num_iter_2 = 0) -> Tensor output",
     LearningRateOpFloatCPU);
diff --git a/caffe2/utils/threadpool/ThreadPool.cc b/caffe2/utils/threadpool/ThreadPool.cc
index 3f0a2adc233c5..cbccf0749bef1 100644
--- a/caffe2/utils/threadpool/ThreadPool.cc
+++ b/caffe2/utils/threadpool/ThreadPool.cc
@@ -100,6 +100,17 @@ size_t getDefaultNumThreads() {
     // Always give precedence to explicit setting.
     numThreads = FLAGS_pthreadpool_size;
   }
+
+  /*
+   * For llvm-tsan, holding limit for the number of locks for a single thread
+   * is 64. pthreadpool's worst case is the number of threads in a pool. So we
+   * want to limit the threadpool size to 64 when running with tsan. However,
+   * sometimes it is tricky to detect if we are running under tsan, for now
+   * capping the default threadcount to the tsan limit unconditionally.
+   */
+  int tsanThreadLimit = 64;
+  numThreads = std::min(numThreads, tsanThreadLimit);
+
   return numThreads;
 }
 
diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake
index c67746d903dc1..0e96653967da6 100644
--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
@@ -823,12 +823,8 @@ if(USE_FBGEMM)
     set_property(TARGET fbgemm PROPERTY POSITION_INDEPENDENT_CODE ON)
     if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 13.0.0)
       # See https://github.com/pytorch/pytorch/issues/74352
-      target_compile_options(asmjit PRIVATE -Wno-deprecated-copy)
-      if(("${CMAKE_CXX_COMPILER_ID}" STREQUAL "AppleClang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 13.1.6)
-        OR("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 13.0.0))
-        # -Wno-unused-but-set-variable doesn't exist in Apple clang version 13.0.0 (clang-1300.0.29.30)
-        target_compile_options(asmjit PRIVATE -Wno-unused-but-set-variable)
-      endif()
+      target_compile_options_if_supported(asmjit -Wno-deprecated-copy)
+      target_compile_options_if_supported(asmjit -Wno-unused-but-set-variable)
     endif()
   endif()
 
@@ -1443,6 +1439,11 @@ if(USE_GLOO)
         get_target_property(_include_dirs uv_a INCLUDE_DIRECTORIES)
         set_target_properties(uv_a PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_include_dirs}")
       endif()
+      if(USE_NCCL AND NOT USE_SYSTEM_NCCL)
+        # Tell Gloo build system to use bundled NCCL, see
+        # https://github.com/facebookincubator/gloo/blob/950c0e23819779a9e0c70b861db4c52b31d1d1b2/cmake/Dependencies.cmake#L123
+        set(NCCL_EXTERNAL ON)
+      endif()
       # gloo uses cuda_add_library
       torch_update_find_cuda_flags()
       add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/../third_party/gloo)
diff --git a/cmake/External/nccl.cmake b/cmake/External/nccl.cmake
index 84c79c243b43a..2d3821840c179 100644
--- a/cmake/External/nccl.cmake
+++ b/cmake/External/nccl.cmake
@@ -15,36 +15,52 @@ if(NOT __NCCL_INCLUDED)
     # this second replacement is needed when there are multiple archs
     string(REPLACE ";-gencode" " -gencode" NVCC_GENCODE "${NVCC_GENCODE}")
 
+    if("${CMAKE_GENERATOR}" MATCHES "Make")
+      # Recursive make with jobserver for parallelism
+      set(MAKE_COMMAND "$(MAKE)")
+    else()
+      if(DEFINED ENV{MAX_JOBS})
+        set(MAX_JOBS "$ENV{MAX_JOBS}")
+      else()
+        include(ProcessorCount)
+        ProcessorCount(NUM_HARDWARE_THREADS)
+        # Assume 2 hardware threads per cpu core
+        math(EXPR MAX_JOBS "${NUM_HARDWARE_THREADS} / 2")
+      endif()
+
+      # Parallel build with CPU load limit to avoid oversubscription
+      set(MAKE_COMMAND "make" "-j${MAX_JOBS}" "-l${MAX_JOBS}")
+    endif()
+
     set(__NCCL_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/nccl")
     ExternalProject_Add(nccl_external
       SOURCE_DIR ${PROJECT_SOURCE_DIR}/third_party/nccl/nccl
       BUILD_IN_SOURCE 1
       CONFIGURE_COMMAND ""
       BUILD_COMMAND
-        env
-        # TODO: remove these flags when
-        # https://github.com/pytorch/pytorch/issues/13362 is fixed
-        "CCACHE_DISABLE=1"
-        "SCCACHE_DISABLE=1"
-        make
+        ${MAKE_COMMAND}
         "CXX=${CMAKE_CXX_COMPILER}"
         "CUDA_HOME=${CUDA_TOOLKIT_ROOT_DIR}"
         "NVCC=${CUDA_NVCC_EXECUTABLE}"
         "NVCC_GENCODE=${NVCC_GENCODE}"
         "BUILDDIR=${__NCCL_BUILD_DIR}"
         "VERBOSE=0"
-        "-j"
-        $ENV{MAX_JOBS}
-        BUILD_BYPRODUCTS "${__NCCL_BUILD_DIR}/lib/libnccl_static.a"
+      BUILD_BYPRODUCTS "${__NCCL_BUILD_DIR}/lib/libnccl_static.a"
       INSTALL_COMMAND ""
       )
 
     # Detect objcopy version
     execute_process(COMMAND "${CMAKE_OBJCOPY}" "--version" OUTPUT_VARIABLE OBJCOPY_VERSION_STR)
-    string(REGEX REPLACE "GNU objcopy version ([0-9])\\.([0-9]+).*" "\\1" OBJCOPY_VERSION_MAJOR ${OBJCOPY_VERSION_STR})
-    string(REGEX REPLACE "GNU objcopy version ([0-9])\\.([0-9]+).*" "\\2" OBJCOPY_VERSION_MINOR ${OBJCOPY_VERSION_STR})
+    string(REGEX REPLACE "GNU objcopy .+ ([0-9])\\.([0-9]+).*" "\\1" OBJCOPY_VERSION_MAJOR ${OBJCOPY_VERSION_STR})
+    string(REGEX REPLACE "GNU objcopy .+ ([0-9])\\.([0-9]+).*" "\\2" OBJCOPY_VERSION_MINOR ${OBJCOPY_VERSION_STR})
 
-    if((${OBJCOPY_VERSION_MAJOR} GREATER 2) OR ((${OBJCOPY_VERSION_MAJOR} EQUAL 2) AND (${OBJCOPY_VERSION_MINOR} GREATER 27)))
+    # TODO: Replace me with SKIP_NCCL_SLIMMING option (and investigate why it does not work on newer compilers)
+    if("$ENV{BUILD_ENVIRONMENT}" MATCHES ".*-libtorch-cxx11-abi$")
+      # See https://github.com/pytorch/pytorch/issues/83887
+      message(WARNING "Skip NCCL library slimming for cxx11-abi builds")
+      set(__NCCL_LIBRARY_DEP nccl_external)
+      set(NCCL_LIBRARIES ${__NCCL_BUILD_DIR}/lib/libnccl_static.a)
+    elseif((${OBJCOPY_VERSION_MAJOR} GREATER 2) OR ((${OBJCOPY_VERSION_MAJOR} EQUAL 2) AND (${OBJCOPY_VERSION_MINOR} GREATER 27)))
       message(WARNING "Enabling NCCL library slimming")
       add_custom_command(
         OUTPUT "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a"
@@ -53,7 +69,9 @@ if(NOT __NCCL_INCLUDED)
         COMMAND cd objects
         COMMAND "${CMAKE_AR}" x "${__NCCL_BUILD_DIR}/lib/libnccl_static.a"
         COMMAND for obj in all_gather_* all_reduce_* broadcast_* reduce_*.o$<SEMICOLON> do "${CMAKE_OBJCOPY}" --remove-relocations .nvFatBinSegment --remove-section __nv_relfatbin $$obj$<SEMICOLON> done
-       COMMAND "${CMAKE_AR}" cr "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" "*.o"
+        COMMAND "${CMAKE_AR}" cr "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" "*.o"
+        COMMAND "${CMAKE_AR}" xN 1 "${__NCCL_BUILD_DIR}/lib/libnccl_static.a" net.o
+        COMMAND "${CMAKE_AR}" q "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" net.o
         COMMAND cd -
         COMMAND "${CMAKE_COMMAND}" -E remove_directory "${__NCCL_BUILD_DIR}/objects"
         WORKING_DIRECTORY "${__NCCL_BUILD_DIR}"
diff --git a/cmake/External/ucc.cmake b/cmake/External/ucc.cmake
index 359ea67b1a745..70cdf4b3af2d5 100644
--- a/cmake/External/ucc.cmake
+++ b/cmake/External/ucc.cmake
@@ -2,19 +2,14 @@ if(NOT __UCC_INCLUDED)
   set(__UCC_INCLUDED TRUE)
 
   if(USE_SYSTEM_UCC)
-    set(UCX_HOME $ENV{UCX_HOME} CACHE PATH "UCX install directory")
-    set(UCC_HOME $ENV{UCC_HOME} CACHE PATH "UCC install directory")
-
-    add_library(__caffe2_ucc INTERFACE)
-
-    target_include_directories(__caffe2_ucc INTERFACE ${UCX_HOME}/include/)
-    target_include_directories(__caffe2_ucc INTERFACE ${UCC_HOME}/include/)
-
-    target_link_libraries(__caffe2_ucc INTERFACE ${UCX_HOME}/lib/libucp.so)
-    target_link_libraries(__caffe2_ucc INTERFACE ${UCX_HOME}/lib/libucs.so)
-    target_link_libraries(__caffe2_ucc INTERFACE ${UCC_HOME}/lib/libucc.so)
+    find_package(UCC REQUIRED)
+    find_package(UCX REQUIRED)
+    if(UCC_FOUND AND UCX_FOUND)
+      add_library(__caffe2_ucc INTERFACE)
+      target_link_libraries(__caffe2_ucc INTERFACE ucx::ucs ucx::ucp ucc::ucc)
+      target_include_directories(__caffe2_ucc INTERFACE ${UCC_INCLUDE_DIRS})
+    endif()
   else()
     message(FATAL_ERROR "USE_SYSTEM_UCC=OFF is not supported yet when using UCC")
   endif()
-
 endif()
diff --git a/cmake/public/LoadHIP.cmake b/cmake/public/LoadHIP.cmake
index 87bb57da1543f..89a61b6242856 100644
--- a/cmake/public/LoadHIP.cmake
+++ b/cmake/public/LoadHIP.cmake
@@ -143,6 +143,9 @@ message("Building PyTorch for GPU arch: ${PYTORCH_ROCM_ARCH}")
 # Add HIP to the CMAKE Module Path
 set(CMAKE_MODULE_PATH ${HIP_PATH}/cmake ${CMAKE_MODULE_PATH})
 
+#Disable kernel assert due to performance regression
+set(ROCM_ENABLE_KERNEL_ASSERTS FALSE CACHE BOOL "Kernel asserts are disabled by default for ROCm")
+
 macro(find_package_and_print_version PACKAGE_NAME)
   find_package("${PACKAGE_NAME}" ${ARGN})
   message("${PACKAGE_NAME} VERSION: ${${PACKAGE_NAME}_VERSION}")
@@ -283,8 +286,18 @@ if(HIP_FOUND)
   find_package_and_print_version(hipcub REQUIRED)
   find_package_and_print_version(rocthrust REQUIRED)
 
-  # Disable Asserts In Code (Can't use asserts on HIP stack.)
-  add_definitions(-DNDEBUG)
+  if(ROCM_VERSION_DEV VERSION_GREATER_EQUAL "4.1.0")
+    if(ROCM_ENABLE_KERNEL_ASSERTS)
+      message("ROCm version >= 4.1; enabling asserts")
+    else()
+      add_definitions(-DROCM_DISABLE_GPU_ASSERTS)
+      message("ROCm version >= 4.1; kernel asserts are disabled")
+    endif()
+  else()
+    # Disable Asserts In Code (Can't use asserts on HIP stack.)
+    add_definitions(-DNDEBUG)
+    message("ROCm version < 4.1; disablng asserts")
+  endif()
 
   if(HIP_COMPILER STREQUAL clang)
     set(hip_library_name amdhip64)
diff --git a/cmake/public/utils.cmake b/cmake/public/utils.cmake
index 0daa6b7f6a3ef..b0c4cc6f08b56 100644
--- a/cmake/public/utils.cmake
+++ b/cmake/public/utils.cmake
@@ -451,7 +451,6 @@ function(torch_compile_options libname)
         -Wno-unused-parameter
         -Wno-unused-function
         -Wno-unused-result
-        -Wno-unused-local-typedefs
         -Wno-missing-field-initializers
         -Wno-write-strings
         -Wno-unknown-pragmas
@@ -570,3 +569,26 @@ function(torch_update_find_cuda_flags)
                     "    CUDA_NVCC_FLAGS_MINSIZEREL     = ${FLAGS_MINSIZEREL}")
   endif()
 endfunction()
+
+##############################################################################
+# CHeck if given flag is supported and append it to provided outputvar
+# Also define HAS_UPPER_CASE_FLAG_NAME variable
+# Usage:
+#   append_cxx_flag_if_supported("-Werror" CMAKE_CXX_FLAGS)
+function(append_cxx_flag_if_supported flag outputvar)
+    string(TOUPPER "HAS${flag}" _FLAG_NAME)
+    string(REGEX REPLACE "[=-]" "_" _FLAG_NAME "${_FLAG_NAME}")
+    check_cxx_compiler_flag("${flag}" ${_FLAG_NAME})
+    if(${_FLAG_NAME})
+        string(APPEND ${outputvar} " ${flag}")
+        set(${outputvar} "${${outputvar}}" PARENT_SCOPE)
+    endif()
+endfunction()
+
+function(target_compile_options_if_supported target flag)
+  set(_compile_options "")
+  append_cxx_flag_if_supported("${flag}" _compile_options)
+  if(NOT "${_compile_options}" STREQUAL "")
+    target_compile_options(${target} PRIVATE ${flag})
+  endif()
+endfunction()
diff --git a/defs_gpu.bzl b/defs_gpu.bzl
index 3d6cae8830893..bfc3db8618629 100644
--- a/defs_gpu.bzl
+++ b/defs_gpu.bzl
@@ -71,9 +71,7 @@ ATEN_NATIVE_CUDA_H_PATTERN = [
 ]
 
 # T66678203: Clang CUDA rollout
-ATEN_CUDA_CLANG_CU_PATTERN = [
-    "aten/src/ATen/native/cuda/DistributionBernoulli.cu",
-]
+ATEN_CUDA_CLANG_CU_PATTERN = []
 
 ### Cuda Files
 def get_aten_cuda_headers():
diff --git a/docker.Makefile b/docker.Makefile
index a1772529d926d..0768f6ecf6ed8 100644
--- a/docker.Makefile
+++ b/docker.Makefile
@@ -1,6 +1,6 @@
-DOCKER_REGISTRY           = docker.io
-DOCKER_ORG                = $(shell docker info 2>/dev/null | sed '/Username:/!d;s/.* //')
-DOCKER_IMAGE              = pytorch
+DOCKER_REGISTRY          ?= docker.io
+DOCKER_ORG               ?= $(shell docker info 2>/dev/null | sed '/Username:/!d;s/.* //')
+DOCKER_IMAGE             ?= pytorch
 DOCKER_FULL_NAME          = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE)
 
 ifeq ("$(DOCKER_ORG)","")
@@ -8,7 +8,7 @@ $(warning WARNING: No docker user found using results from whoami)
 DOCKER_ORG                = $(shell whoami)
 endif
 
-CUDA_VERSION              = 11.3
+CUDA_VERSION              = 11.3.1
 CUDNN_VERSION             = 8
 BASE_RUNTIME              = ubuntu:18.04
 BASE_DEVEL                = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-devel-ubuntu18.04
@@ -16,13 +16,13 @@ BASE_DEVEL                = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-de
 # The conda channel to use to install cudatoolkit
 CUDA_CHANNEL              = nvidia
 # The conda channel to use to install pytorch / torchvision
-INSTALL_CHANNEL           = pytorch
+INSTALL_CHANNEL          ?= pytorch
 
-PYTHON_VERSION            = 3.8
-PYTORCH_VERSION           = $(shell git describe --tags --always)
+PYTHON_VERSION           ?= 3.8
+PYTORCH_VERSION          ?= $(shell git describe --tags --always)
 # Can be either official / dev
-BUILD_TYPE                = dev
-BUILD_PROGRESS            = auto
+BUILD_TYPE               ?= dev
+BUILD_PROGRESS           ?= auto
 BUILD_ARGS                = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
 							--build-arg PYTHON_VERSION=$(PYTHON_VERSION) \
 							--build-arg CUDA_VERSION=$(CUDA_VERSION) \
@@ -30,10 +30,32 @@ BUILD_ARGS                = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
 							--build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \
 							--build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL)
 EXTRA_DOCKER_BUILD_FLAGS ?=
+
+BUILD                    ?= build
+# Intentionally left blank
+PLATFORMS_FLAG           ?=
+PUSH_FLAG                ?=
+USE_BUILDX               ?=
+BUILD_PLATFORMS          ?=
+WITH_PUSH                ?= false
+# Setup buildx flags
+ifneq ("$(USE_BUILDX)","")
+BUILD                     = buildx build
+ifneq ("$(BUILD_PLATFORMS)","")
+PLATFORMS_FLAG            = --platform="$(BUILD_PLATFORMS)"
+endif
+# Only set platforms flags if using buildx
+ifeq ("$(WITH_PUSH)","true")
+PUSH_FLAG                 = --push
+endif
+endif
+
 DOCKER_BUILD              = DOCKER_BUILDKIT=1 \
-							docker build \
+							docker $(BUILD) \
 								--progress=$(BUILD_PROGRESS) \
 								$(EXTRA_DOCKER_BUILD_FLAGS) \
+								$(PLATFORMS_FLAG) \
+								$(PUSH_FLAG) \
 								--target $(BUILD_TYPE) \
 								-t $(DOCKER_FULL_NAME):$(DOCKER_TAG) \
 								$(BUILD_ARGS) .
@@ -48,7 +70,7 @@ devel-image: DOCKER_TAG := $(PYTORCH_VERSION)-devel
 devel-image:
 	$(DOCKER_BUILD)
 
-.PHONY: devel-image
+.PHONY: devel-push
 devel-push: BASE_IMAGE := $(BASE_DEVEL)
 devel-push: DOCKER_TAG := $(PYTORCH_VERSION)-devel
 devel-push:
@@ -59,9 +81,8 @@ runtime-image: BASE_IMAGE := $(BASE_RUNTIME)
 runtime-image: DOCKER_TAG := $(PYTORCH_VERSION)-runtime
 runtime-image:
 	$(DOCKER_BUILD)
-	docker tag $(DOCKER_FULL_NAME):$(DOCKER_TAG) $(DOCKER_FULL_NAME):latest
 
-.PHONY: runtime-image
+.PHONY: runtime-push
 runtime-push: BASE_IMAGE := $(BASE_RUNTIME)
 runtime-push: DOCKER_TAG := $(PYTORCH_VERSION)-runtime
 runtime-push:
diff --git a/docs/requirements.txt b/docs/requirements.txt
index 9a967dd54e0ff..14c93adc22e90 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1,9 +1,12 @@
 sphinx==5.0.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
-sphinxcontrib.katex
-matplotlib
-tensorboard
+# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
+# but it doesn't seem to work and hangs around idly. The initial thought is probably
+# something related to Docker setup. We can investigate this later
+sphinxcontrib.katex==0.8.6
+matplotlib==3.5.3
+tensorboard==2.10.0
 # required to build torch.distributed.elastic.rendezvous.etcd* docs
-python-etcd>=0.4.5
-sphinx_copybutton
-sphinx-panels
+python-etcd==0.4.5
+sphinx-copybutton==0.5.0
+sphinx-panels==0.4.1
diff --git a/docs/source/amp.rst b/docs/source/amp.rst
index 0785849c579e2..3c0c77d4bc4f1 100644
--- a/docs/source/amp.rst
+++ b/docs/source/amp.rst
@@ -26,7 +26,7 @@ However, :class:`torch.autocast` and :class:`torch.cuda.amp.GradScaler` are modu
 As shown in the CPU example section of :class:`torch.autocast`, "automatic mixed precision training/inference" on CPU with
 datatype of ``torch.bfloat16`` only uses :class:`torch.autocast`.
 
-For CUDA and CPU, APIs are also provided seperately:
+For CUDA and CPU, APIs are also provided separately:
 
 * ``torch.autocast("cuda", args...)`` is equivalent to ``torch.cuda.amp.autocast(args...)``.
 * ``torch.autocast("cpu", args...)`` is equivalent to ``torch.cpu.amp.autocast(args...)``. For CPU, only lower precision floating point datatype of ``torch.bfloat16`` is supported for now.
diff --git a/docs/source/backends.rst b/docs/source/backends.rst
index 152e0144a416d..ffbc8a99081a5 100644
--- a/docs/source/backends.rst
+++ b/docs/source/backends.rst
@@ -11,6 +11,7 @@ These backends include:
 
 - ``torch.backends.cuda``
 - ``torch.backends.cudnn``
+- ``torch.backends.mps``
 - ``torch.backends.mkl``
 - ``torch.backends.mkldnn``
 - ``torch.backends.openmp``
diff --git a/docs/source/community/governance.rst b/docs/source/community/governance.rst
index 0a7c224256073..cbb8576c89a4d 100644
--- a/docs/source/community/governance.rst
+++ b/docs/source/community/governance.rst
@@ -60,18 +60,15 @@ design docs, any disputes and dispute resolutions) so that
 contributors and other interested parties understand the
 future direction of the project and can participate in discussion.
 
-Within `pytorch/pytorch <https://github.com/pytorch/pytorch>`__,
-maintainer groups are defined in the
-`CODEOWNERS <https://github.com/pytorch/pytorch/blob/master/CODEOWNERS>`__
-file in the GitHub repository. For other modules that correspond
-to repositories, membership is recorded on GitHub as access
-level to the repo (i.e. “write” permission). Module maintainers
-are given privileges to administrate the repository (except for
-`pytorch/pytorch <https://github.com/pytorch/pytorch>`__ where
-they are responsible for a folder).
+Responsibilities of the maintainer includes:
+* Triaging high priority issues of the module
+* Triaging and reviewing and landing high priority pull requests of the module
+* Supporting public documentation related to the module
+* Running public developer meetings
 
 Core Maintainers
 ----------------
+
 The core maintainers are expected to have a deep understanding
 of the PyTorch code base and design philosophies. Their responsibilities
 include:
@@ -130,14 +127,12 @@ The Principles
 The Process for Nomination
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-* We will have a nomination form, where anyone in the community can
-  nominate a person to a Module maintainer position
-* Every 3 months, the core maintainers go through the nominations,
-  do light filtering around spam or desk-rejection, and draw up a
-  list of potential nominees.
-* The core maintainers ask the specific module maintainers for more
-  information on the nominee. The information should include the following
-  items:
+* Each module has its own process. Please contact module maintainers for more information.
+  However, if there is no process identified, you can file a request to the core maintainers
+  by submitting [this form](https://forms.gle/xNeu1byGMZVHcA2q7). Core maintainers are
+  meeting every three months.
+* If you are submitting a request to the core maintainers, the information in your request
+  must include the following items:
 
   * The nominees depth and breadth of code, review and design
     contributions on the module
diff --git a/docs/source/community/persons_of_interest.rst b/docs/source/community/persons_of_interest.rst
index f6e19db5e8255..cbe5cb1462128 100644
--- a/docs/source/community/persons_of_interest.rst
+++ b/docs/source/community/persons_of_interest.rst
@@ -128,6 +128,14 @@ NVIDIA / CUDA
 -  Piotr Bialecki (`ptrblck <https://github.com/ptrblck>`__)
 -  (emeritus) Xiaoqiang Zheng (`zheng-xq <https://github.com/zheng-xq>`__)
 
+NVFuser
+~~~~~~~
+
+-  Christian Sarofeen (`csarofeen <https://github.com/csarofeen>`__)
+-  Alex Jann (`jjsjann123 <https://github.com/jjsjann123>`__)
+-  Piotr Bialecki (`ptrblck <https://github.com/ptrblck>`__)
+-  Natalia Gimelshein (`ngimel <https://github.com/ngimel>`__)
+
 Intel / MKLDNN
 ~~~~~~~~~~~~~~
 
@@ -182,10 +190,11 @@ C10 utils and operator dispatch
 -  Dmytro Dzhulgakov (`dzhulgakov <https://github.com/dzhulgakov>`__)
 -  (emeritus) Sebastian Messmer (`smessmer <https://github.com/smessmer>`__)
 
-PyTorch -> ONNX
-~~~~~~~~~~~~~~~
+ONNX exporter
+~~~~~~~~~~~~~
 -  Bowen Bao (`BowenBao <https://github.com/BowenBao>`__)
--  Gary Miguel (`garymm <https://github.com/garymm>`__)
+-  Aaron Bockover (`abock <https://github.com/abock>`__)
+-  (emeritus) Gary Miguel (`garymm <https://github.com/garymm>`__)
 -  (emeritus) Lara Haidar (`lara-hdr <https://github.com/lara-hdr>`__)
 -  (emeritus) Lu Fang (`houseroad <https://github.com/houseroad>`__)
 -  (emeritus) Negin Raoof (`neginraoof <https://github.com/neginraoof>`__)
@@ -220,6 +229,7 @@ Apple M1/MPS
 -  Alban Desmaison (`alband <https://github.com/alband>`__)
 -  Nikita Shulga (`malfet <https://github.com/malfet>`__)
 -  Kulin Seth (`kulinseth <https://github.com/kulinseth>`__)
+-  Ramin Azarmehr (`razarmehr <https://github.com/razarmehr>`__)
 
 PowerPC
 ~~~~~~~
diff --git a/docs/source/conf.py b/docs/source/conf.py
index e8b683cd445cd..098cc3ff61ef9 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -134,8 +134,6 @@
     "unregister_custom_op_symbolic",
     # torch.ao.quantization
     "default_eval_fn",
-    # torch.ao.quantization.backend_config
-    "validate_backend_config_dict",
     # torch.backends
     "disable_global_flags",
     "flags_frozen",
@@ -189,7 +187,10 @@
     "DeserializationStorageContext",
     "DeviceObjType",
     "DictType",
+    "DispatchKey",
+    "DispatchKeySet",
     "EnumType",
+    "ExcludeDispatchKeyGuard",
     "ExecutionPlan",
     "FileCheck",
     "FloatType",
@@ -316,7 +317,7 @@
     "DDPCommHookType",
     # torch.jit.mobile
     "LiteScriptModule",
-    # torch.nn.quantized.modules
+    # torch.ao.nn.quantized.modules
     "DeQuantize",
     "Quantize",
     # torch.utils.backcompat
@@ -492,6 +493,51 @@ def is_not_internal(modname):
             for o in output:
                 f.write(o)
 
+
+def process_docstring(app, what_, name, obj, options, lines):
+    """
+    Custom process to transform docstring lines Remove "Ignore" blocks
+
+    Args:
+        app (sphinx.application.Sphinx): the Sphinx application object
+
+        what (str):
+            the type of the object which the docstring belongs to (one of
+            "module", "class", "exception", "function", "method", "attribute")
+
+        name (str): the fully qualified name of the object
+
+        obj: the object itself
+
+        options: the options given to the directive: an object with
+            attributes inherited_members, undoc_members, show_inheritance
+            and noindex that are true if the flag option of same name was
+            given to the auto directive
+
+        lines (List[str]): the lines of the docstring, see above
+
+    References:
+        https://www.sphinx-doc.org/en/1.5.1/_modules/sphinx/ext/autodoc.html
+        https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html
+    """
+    import re
+    remove_directives = [
+        # Remove all xdoctest directives
+        re.compile(r'\s*>>>\s*#\s*x?doctest:\s*.*'),
+        re.compile(r'\s*>>>\s*#\s*x?doc:\s*.*'),
+    ]
+    filtered_lines = [
+        line for line in lines
+        if not any(pat.match(line) for pat in remove_directives)
+    ]
+    # Modify the lines inplace
+    lines[:] = filtered_lines
+
+    # make sure there is a blank line at the end
+    if lines and lines[-1].strip():
+        lines.append('')
+
+
 # Called automatically by Sphinx, making this `conf.py` an "extension".
 def setup(app):
     # NOTE: in Sphinx 1.8+ `html_css_files` is an official configuration value
@@ -508,6 +554,7 @@ def setup(app):
         add_css(css_file)
 
     app.connect("build-finished", coverage_post_process)
+    app.connect('autodoc-process-docstring', process_docstring)
 
 # From PyTorch 1.5, we now use autogenerated files to document classes and
 # functions. This breaks older references since
diff --git a/docs/source/cuda.rst b/docs/source/cuda.rst
index 361b60ed546c8..02c3b407aa218 100644
--- a/docs/source/cuda.rst
+++ b/docs/source/cuda.rst
@@ -33,6 +33,7 @@ torch.cuda
     stream
     synchronize
     utilization
+    OutOfMemoryError
 
 Random Number Generator
 -------------------------
diff --git a/docs/source/elastic/timer.rst b/docs/source/elastic/timer.rst
index e9d4228ee7a6a..f64597c4ce2bf 100644
--- a/docs/source/elastic/timer.rst
+++ b/docs/source/elastic/timer.rst
@@ -18,10 +18,21 @@ Below are the timer server and client pairs that are provided by torchelastic.
           in pairs since there is a messaging protocol between the server
           and client.
 
+Below is a pair of timer server and client that is implemented based on
+a ``multiprocess.Queue``.
+
 .. autoclass:: LocalTimerServer
 
 .. autoclass:: LocalTimerClient
 
+Below is another pair of timer server and client that is implemented
+based on a named pipe.
+
+.. autoclass:: FileTimerServer
+
+.. autoclass:: FileTimerClient
+
+
 Writing a custom timer server/client
 --------------------------------------
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 4e069c9279a20..f688cbe0134fd 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -91,6 +91,7 @@ Features described in this documentation are classified by release status:
    quantization
    rpc
    torch.random <random>
+   masked
    nested
    sparse
    storage
diff --git a/docs/source/masked.rst b/docs/source/masked.rst
new file mode 100644
index 0000000000000..e70f7b04c1ceb
--- /dev/null
+++ b/docs/source/masked.rst
@@ -0,0 +1,11 @@
+torch.masked
+============
+
+.. automodule:: torch.masked
+.. automodule:: torch.masked.maskedtensor
+
+Introduction
+++++++++++++
+
+WIP. For more information, you can go to github.com/pytorch/maskedtensor for the source code
+or http://pytorch.org/maskedtensor for a number of tutorials
diff --git a/docs/source/notes/cuda.rst b/docs/source/notes/cuda.rst
index c678844edcfaa..ed2d22a657d75 100644
--- a/docs/source/notes/cuda.rst
+++ b/docs/source/notes/cuda.rst
@@ -355,7 +355,7 @@ Use of a caching allocator can interfere with memory checking tools such as
 
 The behavior of caching allocator can be controlled via environment variable
 ``PYTORCH_CUDA_ALLOC_CONF``.
-The format is ``PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2><value2>...``
+The format is ``PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>...``
 Available options:
 
 * ``max_split_size_mb`` prevents the allocator from splitting blocks larger
diff --git a/docs/source/onnx.rst b/docs/source/onnx.rst
index 2c24c5d7ee521..5f3b4ba70edea 100644
--- a/docs/source/onnx.rst
+++ b/docs/source/onnx.rst
@@ -410,7 +410,7 @@ See the ``symbolic_opset*.py`` files for more examples.
 torch.autograd.Functions
 ^^^^^^^^^^^^^^^^^^^^^^^^
 
-If the operator is a sub-class of :class:`torch.autograd.Function`, there are two ways
+If the operator is a sub-class of :class:`torch.autograd.Function`, there are three ways
 to export it.
 
 Static Symbolic Method
@@ -488,6 +488,32 @@ The example below shows how you can access ``requires_grad`` via the ``Node`` ob
     from torch.onnx import register_custom_op_symbolic
     register_custom_op_symbolic("prim::PythonOp", symbolic_python_op, 1)
 
+Inline Autograd Function
+~~~~~~~~~~~~~~~~~~~~~~~~
+In cases where a static symbolic method is not provided for its subsequent autograd.Function
+or where a function to register prim::PythonOp as custom symbolic functions is not provided,
+torch.onnx.export tries to inline the graph that corresponds to that autograd.Function such that
+this function is broken down into individual operators that were used within the function.
+The export should be successful as long as these individual operators are supported. For example::
+
+    class MyLogExp(torch.autograd.Function):
+        @staticmethod
+        def forward(ctx, input: torch.Tensor) -> torch.Tensor:
+            ctx.save_for_backward(input)
+            h = input.exp()
+            return h.log().log()
+
+There is no static symbolic method present for this model, yet it is exported as follows::
+
+    graph(%input : Float(1, strides=[1], requires_grad=0, device=cpu)):
+        %1 : float = onnx::Exp[](%input)
+        %2 : float = onnx::Log[](%1)
+        %3 : float = onnx::Log[](%2)
+        return (%3)
+
+In order to avoid inlining of autograd.Functions, model should be exported with
+operator_export_type set to ONNX_FALLTHROUGH or ONNX_ATEN_FALLBACK mode
+
 Custom operators
 ^^^^^^^^^^^^^^^^
 
@@ -605,3 +631,4 @@ Classes
     :template: classtemplate.rst
 
     SymbolicContext
+    JitScalarType
diff --git a/docs/source/optim.rst b/docs/source/optim.rst
index 73c4d742900df..dd1ea2ae5287f 100644
--- a/docs/source/optim.rst
+++ b/docs/source/optim.rst
@@ -204,6 +204,7 @@ algorithms.
     lr_scheduler.ConstantLR
     lr_scheduler.LinearLR
     lr_scheduler.ExponentialLR
+    lr_scheduler.PolynomialLR
     lr_scheduler.CosineAnnealingLR
     lr_scheduler.ChainedScheduler
     lr_scheduler.SequentialLR
diff --git a/docs/source/package.rst b/docs/source/package.rst
index a8840c6215395..9dc85cc1c662d 100644
--- a/docs/source/package.rst
+++ b/docs/source/package.rst
@@ -94,10 +94,10 @@ work for exploring the contents. Some common ways to interact with ZIP files:
 Use the ``file_structure()`` API
 """"""""""""""""""""""""""""""""
 :class:`PackageImporter` provides a ``file_structure()`` method, which will return a printable
-and queryable ``Folder`` object. The ``Folder`` object is a simple directory structure that you can use to explore the
+and queryable :class:`Directory` object. The :class:`Directory` object is a simple directory structure that you can use to explore the
 current contents of a ``torch.package``.
 
-The ``Folder`` object itself is directly printable and will print out a file tree representation. To filter what is returned,
+The :class:`Directory` object itself is directly printable and will print out a file tree representation. To filter what is returned,
 use the glob-style ``include`` and ``exclude`` filtering arguments.
 
 
@@ -108,7 +108,7 @@ use the glob-style ``include`` and ``exclude`` filtering arguments.
 
     importer = PackageImporter('my_package.pt')
     # can limit printed items with include/exclude args
-    print(importer.file_structure(include=["**/utils.py", "**/*.pkl"], exclude="**/*.storages"))
+    print(importer.file_structure(include=["**/utils.py", "**/*.pkl"], exclude="**/*.storage"))
     print(importer.file_structure()) # will print out all files
 
 
@@ -118,7 +118,7 @@ Output:
 ::
 
     # filtered with glob pattern:
-    #    include=["**/utils.py", "**/*.pkl"], exclude="**/*.storages"
+    #    include=["**/utils.py", "**/*.pkl"], exclude="**/*.storage"
     ─── my_package.pt
         ├── models
         │   └── model_1.pkl
@@ -141,7 +141,7 @@ Output:
                 └── utils.py
 
 
-You can also query ``Folder`` objects with the ``has_file()`` method.
+You can also query :class:`Directory` objects with the ``has_file()`` method.
 
 
 ::
@@ -171,7 +171,7 @@ Python objects, text, and binary data to a package.
 ::
 
     with torch.PackageExporter("package.pt") as exporter:
-        # Pickles the object and saves to `my_resources/tens.pkl` in the archive.
+        # Pickles the object and saves to `my_resources/tensor.pkl` in the archive.
         exporter.save_pickle("my_resources", "tensor.pkl", torch.randn(4))
         exporter.save_text("config_stuff", "words.txt", "a sample string")
         exporter.save_binary("raw_data", "binary", my_bytes)
@@ -329,7 +329,7 @@ Now, the code will behave differently depending on whether it’s imported norma
 
     print(is_in_package())  # False
 
-    loaded_module = PackageImporter(my_pacakge).import_module("foo.bar")
+    loaded_module = PackageImporter(my_package).import_module("foo.bar")
     loaded_module.is_in_package()  # True
 
 
@@ -413,7 +413,7 @@ packaged code.
     # bar.py:
     import torch_package_importer # this is the PackageImporter that imported this module.
 
-    # Prints "hello world!", equivalient to importlib.resources.read_text
+    # Prints "hello world!", equivalent to importlib.resources.read_text
     def get_my_resource():
         return torch_package_importer.load_text("my_resource", "a.txt")
 
@@ -596,12 +596,12 @@ Finally, there is one more important action that is not technically part of ``to
 
 * Refactoring: remove or change the dependencies in your code.
 
-Note that actions are only defined on entire Python modules. There is no way to package “just” a function or class from module and leave the rest out.
+Note that actions are only defined on entire Python modules. There is no way to package “just” a function or class from a module and leave the rest out.
 This is by design. Python does not offer clean boundaries between objects defined in a module. The only defined unit of dependency organization is a
 module, so that’s what ``torch.package`` uses.
 
 Actions are applied to modules using patterns. Patterns can either be module names (``"foo.bar"``) or globs (like ``"foo.**"``). You associate a pattern
-with an action using methods on :class:`PackageImporter`, e.g.
+with an action using methods on :class:`PackageExporter`, e.g.
 
 
 ::
@@ -635,7 +635,7 @@ you attempt to ``intern`` them. These kinds of modules need to be ``mock``-ed or
 If a module is ``extern``-ed, it will not be packaged. Instead, it will be added to a list of external dependencies for this package. You can find this
 list on ``package_exporter.extern_modules``.
 
-On package import, when time packaged code tries to import an ``extern``-ed module, :class:`PackageImporter` will use the default Python importer to find
+On package import, when the packaged code tries to import an ``extern``-ed module, :class:`PackageImporter` will use the default Python importer to find
 that module, as if you did ``importlib.import_module("my_externed_module")``. If it can’t find that module, an error will be raised.
 
 In this way, you can depend on third-party libraries like ``numpy`` and ``scipy`` from within your package without having to package them too.
@@ -649,7 +649,7 @@ for your package, try to limit your use of ``extern``.
 If a module is ``mock``-ed, it will not be packaged. Instead a stub module will be packaged in its place. The stub module will allow you to retrieve
 objects from it (so that ``from my_mocked_module import foo`` will not error), but any use of that object will raise a ``NotImplementedError``.
 
-``mock`` should be used for code that you “know” will not be needed in the loaded package, but you still want to available for use in non-packaged contents.
+``mock`` should be used for code that you “know” will not be needed in the loaded package, but you still want available for use in non-packaged contents.
 For example, initialization/configuration code, or code only used for debugging/training.
 
 **Warning**: In general, ``mock`` should be used as a last resort. It introduces behavioral differences between packaged code and non-packaged code,
@@ -661,7 +661,7 @@ Refactoring
 The best way to manage dependencies is to not have dependencies at all! Often, code can be refactored to remove unnecessary dependencies. Here are some
 guidelines for writing code with clean dependencies (which are also generally good practices!):
 
-**Include only what you use**. Do not leave unused imports in our code. The dependency resolver is not smart enough to tell that they are indeed unused,
+**Include only what you use**. Do not leave unused imports in your code. The dependency resolver is not smart enough to tell that they are indeed unused,
 and will try to process them.
 
 **Qualify your imports**. For example, instead of writing import foo and later using ``foo.bar.baz``, prefer to write ``from foo.bar import baz``. This more
@@ -752,8 +752,8 @@ Any class that you import from a :class:`PackageImporter` will be a version of t
     assert isinstance(my_class_instance, imported_MyClass)  # ERROR!
 
 
-In this example, ``MyClass`` and ``import_MyClass`` are *not the same type*. In this specific example, ``MyClass`` and ``import_MyClass`` have exactly the
-same implementation, so you might thing it’s okay to consider them the same class. But consider the situation where ``import_MyClass`` is coming from an
+In this example, ``MyClass`` and ``imported_MyClass`` are *not the same type*. In this specific example, ``MyClass`` and ``imported_MyClass`` have exactly the
+same implementation, so you might think it’s okay to consider them the same class. But consider the situation where ``imported_MyClass`` is coming from an
 older package with an entirely different implementation of ``MyClass`` — in that case, it’s unsafe to consider them the same class.
 
 Under the hood, each importer has a prefix that allows it to uniquely identify classes:
@@ -765,7 +765,7 @@ Under the hood, each importer has a prefix that allows it to uniquely identify c
     print(imported_MyClass.__name__)  # prints <torch_package_0>.foo.MyClass
 
 
-That means you should not expect ``isinstance`` checks to work when one of the arguments if from a package and the other is not. If you need this
+That means you should not expect ``isinstance`` checks to work when one of the arguments is from a package and the other is not. If you need this
 functionality, consider the following options:
 
 * Doing duck typing (just using the class instead of explicitly checking that it is of a given type).
diff --git a/docs/source/quantization-support.rst b/docs/source/quantization-support.rst
index e142cf70a619f..aa8d4e1ec93c8 100644
--- a/docs/source/quantization-support.rst
+++ b/docs/source/quantization-support.rst
@@ -312,16 +312,16 @@ like linear + relu.
 
     LinearReLU
 
-torch.nn.qat
+torch.ao.nn.qat
 ~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.qat
-.. automodule:: torch.nn.qat.modules
+.. automodule:: torch.ao.nn.qat
+.. automodule:: torch.ao.nn.qat.modules
 
 This module implements versions of the key nn modules **Conv2d()** and
 **Linear()** which run in FP32 but with rounding applied to simulate the
 effect of INT8 quantization.
 
-.. currentmodule:: torch.nn.qat
+.. currentmodule:: torch.ao.nn.qat
 
 .. autosummary::
     :toctree: generated
@@ -332,16 +332,16 @@ effect of INT8 quantization.
     Conv3d
     Linear
 
-torch.nn.qat.dynamic
+torch.ao.nn.qat.dynamic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.qat.dynamic
-.. automodule:: torch.nn.qat.dynamic.modules
+.. automodule:: torch.ao.nn.qat.dynamic
+.. automodule:: torch.ao.nn.qat.dynamic.modules
 
 This module implements versions of the key nn modules such as **Linear()**
 which run in FP32 but with rounding applied to simulate the effect of INT8
 quantization and will be dynamically quantized during inference.
 
-.. currentmodule:: torch.nn.qat.dynamic
+.. currentmodule:: torch.ao.nn.qat.dynamic
 
 .. autosummary::
     :toctree: generated
@@ -350,15 +350,16 @@ quantization and will be dynamically quantized during inference.
 
     Linear
 
-torch.nn.quantized
+torch.ao.nn.quantized
 ~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.quantized
-.. automodule:: torch.nn.quantized.modules
+.. automodule:: torch.ao.nn.quantized
+   :noindex:
+.. automodule:: torch.ao.nn.quantized.modules
 
 This module implements the quantized versions of the nn layers such as
 ~`torch.nn.Conv2d` and `torch.nn.ReLU`.
 
-.. currentmodule:: torch.nn.quantized
+.. currentmodule:: torch.ao.nn.quantized
 
 .. autosummary::
     :toctree: generated
@@ -390,15 +391,15 @@ This module implements the quantized versions of the nn layers such as
     InstanceNorm2d
     InstanceNorm3d
 
-torch.nn.quantized.functional
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.quantized.functional
+torch.ao.nn.quantized.functional
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.quantized.functional
 
 This module implements the quantized versions of the functional layers such as
 ~`torch.nn.functional.conv2d` and `torch.nn.functional.relu`. Note:
 :meth:`~torch.nn.functional.relu` supports quantized inputs.
 
-.. currentmodule:: torch.nn.quantized.functional
+.. currentmodule:: torch.ao.nn.quantized.functional
 
 .. autosummary::
     :toctree: generated
@@ -429,7 +430,7 @@ This module implements the quantized versions of the functional layers such as
     upsample_nearest
 
 torch.nn.quantizable
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
 
 This module implements the quantizable versions of some of the nn layers.
 These modules can be used in conjunction with the custom module mechanism,
@@ -446,16 +447,16 @@ by providing the ``custom_module_config`` argument to both prepare and convert.
     MultiheadAttention
 
 
-torch.nn.quantized.dynamic
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.quantized.dynamic
-.. automodule:: torch.nn.quantized.dynamic.modules
+torch.ao.nn.quantized.dynamic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.quantized.dynamic
+.. automodule:: torch.ao.nn.quantized.dynamic.modules
 
 Dynamically quantized :class:`~torch.nn.Linear`, :class:`~torch.nn.LSTM`,
 :class:`~torch.nn.LSTMCell`, :class:`~torch.nn.GRUCell`, and
 :class:`~torch.nn.RNNCell`.
 
-.. currentmodule:: torch.nn.quantized.dynamic
+.. currentmodule:: torch.ao.nn.quantized.dynamic
 
 .. autosummary::
     :toctree: generated
@@ -534,3 +535,5 @@ the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_scr
 .. These modules are missing docs. Adding them here only for tracking
 .. automodule:: torch.nn.quantizable
 .. automodule:: torch.nn.quantizable.modules
+.. automodule:: torch.nn.quantized
+   :noindex:
diff --git a/docs/source/quantization.rst b/docs/source/quantization.rst
index b71b1fb976953..ae53d149a4ef4 100644
--- a/docs/source/quantization.rst
+++ b/docs/source/quantization.rst
@@ -407,15 +407,15 @@ For static quantization techniques which quantize activations, the user needs
 to do the following in addition:
 
 1. Specify where activations are quantized and de-quantized. This is done using
-   :class:`~torch.quantization.QuantStub` and
-   :class:`~torch.quantization.DeQuantStub` modules.
-2. Use :class:`torch.nn.quantized.FloatFunctional` to wrap tensor operations
+   :class:`~torch.ao.quantization.QuantStub` and
+   :class:`~torch.ao.quantization.DeQuantStub` modules.
+2. Use :class:`~torch.ao.nn.quantized.FloatFunctional` to wrap tensor operations
    that require special handling for quantization into modules. Examples
    are operations like ``add`` and ``cat`` which require special handling to
    determine output quantization parameters.
 3. Fuse modules: combine operations/modules into a single module to obtain
    higher accuracy and performance. This is done using the
-   :func:`torch.quantization.fuse_modules` API, which takes in lists of modules
+   :func:`~torch.ao.quantization.fuse_modules` API, which takes in lists of modules
    to be fused. We currently support the following fusions:
    [Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu]
 
@@ -1133,6 +1133,9 @@ Please take a look at `Limitations of Symbolic Tracing <https://docs-preview.pyt
 .. They are here for tracking purposes until they are more permanently fixed.
 .. py:module:: torch.ao
 .. py:module:: torch.ao.nn
+.. py:module:: torch.ao.nn.quantizable
+.. py:module:: torch.ao.nn.quantizable.modules
+.. py:module:: torch.ao.nn.quantized
 .. py:module:: torch.ao.nn.sparse
 .. py:module:: torch.ao.nn.sparse.quantized
 .. py:module:: torch.ao.nn.sparse.quantized.dynamic
@@ -1144,3 +1147,12 @@ Please take a look at `Limitations of Symbolic Tracing <https://docs-preview.pyt
 .. py:module:: torch.ao.sparsity
 .. py:module:: torch.ao.sparsity.scheduler
 .. py:module:: torch.ao.sparsity.sparsifier
+
+.. py:module:: torch.nn.qat
+.. py:module:: torch.nn.qat.modules
+.. py:module:: torch.nn.qat.dynamic
+.. py:module:: torch.nn.qat.dynamic.modules
+.. py:module:: torch.nn.quantized
+.. py:module:: torch.nn.quantized.modules
+.. py:module:: torch.nn.quantized.dynamic
+.. py:module:: torch.nn.quantized.dynamic.modules
diff --git a/functorch/.circleci/unittest/windows/scripts/environment.yml b/functorch/.circleci/unittest/windows/scripts/environment.yml
index 590b1df530ec7..265589019d3f5 100644
--- a/functorch/.circleci/unittest/windows/scripts/environment.yml
+++ b/functorch/.circleci/unittest/windows/scripts/environment.yml
@@ -5,6 +5,7 @@ dependencies:
   - numpy
   - pytest
   - pytest-cov
+  - xdoctest
   - codecov
   - pip
   - pyyaml
diff --git a/functorch/codegen/gen_functorch_lagging_op_db.py b/functorch/codegen/gen_functorch_lagging_op_db.py
deleted file mode 100644
index 833e34ed4d69f..0000000000000
--- a/functorch/codegen/gen_functorch_lagging_op_db.py
+++ /dev/null
@@ -1,58 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-
-from torch.testing._internal.common_methods_invocations import op_db
-
-
-def num_leading_spaces(line: str) -> int:
-    result = len(line) - len(line.lstrip())
-    # Empty space handling
-    if result == 0:
-        return 999999
-    return result
-
-
-def deindent(code: str) -> str:
-    lines = code.split('\n')
-    min_leading_spaces = min(map(num_leading_spaces, lines))
-    lines = [line[min_leading_spaces:] for line in lines]
-    return '\n'.join(lines)
-
-
-if __name__ == '__main__':
-    supported = {(opinfo.name, opinfo.variant_test_name) for opinfo in op_db}
-    supported = sorted(supported)
-    print(deindent("""\
-    # Copyright (c) Facebook, Inc. and its affiliates.
-    # All rights reserved.
-    #
-    # This source code is licensed under the BSD-style license found in the
-    # LICENSE file in the root directory of this source tree.
-    from torch.testing._internal.common_methods_invocations import op_db
-
-    # Generated from codegen/gen_functorch_op_db.py via
-    # python codegen/gen_functorch_lagging_op_db.py > test/functorch_lagging_op_db.py
-    #
-    # People add new OpInfos to PyTorch all the time.
-    # We want them to be able to add OpInfos without breaking our CI.
-    # To achieve this, we keep our OpInfo library behind that of Pytorch's and
-    # we periodically update our OpInfo library by regenerating this file"""))
-
-    print("_functorch_lagging_meta = {")
-    for name, variant in supported:
-        print(f'    {(name, variant)},')
-    print("}")
-
-    print(deindent("""\
-
-
-    def in_functorch_lagging_op_db(opinfo):
-        return (opinfo.name, opinfo.variant_test_name) in _functorch_lagging_meta
-
-
-    functorch_lagging_op_db = [
-        opinfo for opinfo in op_db if in_functorch_lagging_op_db(opinfo)
-    ]"""))
diff --git a/functorch/examples/maml_omniglot/support/omniglot_loaders.py b/functorch/examples/maml_omniglot/support/omniglot_loaders.py
index b712b9b31e435..88eaff97df32c 100644
--- a/functorch/examples/maml_omniglot/support/omniglot_loaders.py
+++ b/functorch/examples/maml_omniglot/support/omniglot_loaders.py
@@ -161,7 +161,7 @@ def __init__(self, root, batchsz, n_way, k_shot, k_query, imgsz, device=None):
                      lambda x: x / 255.]),
             )
 
-            temp = dict()  # {label:img1, img2..., 20 imgs, label2: img1, img2,... in total, 1623 label}
+            temp = {}  # {label:img1, img2..., 20 imgs, label2: img1, img2,... in total, 1623 label}
             for (img, label) in self.x:
                 if label in temp.keys():
                     temp[label].append(img)
diff --git a/functorch/functorch/_src/aot_autograd.py b/functorch/functorch/_src/aot_autograd.py
index 41356b20644e8..e8126d7467384 100644
--- a/functorch/functorch/_src/aot_autograd.py
+++ b/functorch/functorch/_src/aot_autograd.py
@@ -1,28 +1,35 @@
+import collections
+import dataclasses
+import warnings
 from contextlib import contextmanager, nullcontext
+from functools import wraps
+from typing import Any, Callable, Dict, List, Optional, Tuple
+
 import torch
+import torch.fx.traceback as fx_traceback
 import torch.nn as nn
-from torch import Tensor
-from functorch import make_fx
-from torch.fx import immutable_collections
-from torch._subclasses import FakeTensorMode
 import torch.utils._pytree as pytree
 import torch.utils.dlpack
-from torch.nn.utils import _stateless
+from torch import Tensor
+from torch._subclasses import FakeTensorMode
+from torch.fx import immutable_collections, Interpreter
+from torch.nn.utils import stateless
+
+from functorch import make_fx
 from functorch._C import CompileCache
 from functorch.experimental import functionalize
 from . import config
-from .decompositions import register_decomposition
+from .named_members_polyfill import _named_buffers, _named_parameters
 from .partitioners import default_partition
-from .named_members_polyfill import _named_parameters, _named_buffers
-from typing import Callable, List, Dict, Any, Tuple, Optional
-from functools import wraps
 
 try:
     from torchdynamo import disable as disable_torchdynamo
 except ImportError:
+
     def disable_torchdynamo(x):
         return x
 
+
 pytree._register_pytree_node(
     immutable_collections.immutable_list,
     lambda x: (list(x), None),
@@ -36,23 +43,6 @@ def disable_torchdynamo(x):
     ),
 )
 
-# TODO - move this to PyTorch core. This overrides the pytree implementation for
-# dict to maintain parity with Deepmind pytree.
-Context = Any
-
-
-def _dict_flatten(d: Dict[Any, Any]) -> Tuple[List[Any], Context]:
-    keys = sorted(d.keys())
-    values = [d[key] for key in keys]
-    return values, keys
-
-
-def _dict_unflatten(values: List[Any], context: Context) -> Dict[Any, Any]:
-    return {key: value for key, value in zip(context, values)}
-
-
-pytree._register_pytree_node(dict, _dict_flatten, _dict_unflatten)
-
 aten = torch.ops.aten
 
 
@@ -69,6 +59,70 @@ def preserve_rng_state():
             torch.cuda.set_rng_state(cuda_rng_state)
 
 
+# Set up hooks so that during backward the fx's stack_trace is properly set
+callback_set = False
+
+
+def setup_stacktrace_preservation_hooks(roots: List):
+    def iter_graph(roots):
+        if not roots:
+            return
+        seen = set()
+        q = collections.deque()
+        for node in roots:
+            if node is not None:
+                seen.add(node)
+                q.append(node)
+
+        while q:
+            node = q.popleft()
+            for fn, _idx in node.next_functions:
+                if fn in seen or fn is None:
+                    continue
+                seen.add(fn)
+                q.append(fn)
+
+            yield node
+
+    def get_callback(saved_stack_):
+        def callback():
+            global callback_set
+            fx_traceback.set_stack_trace(saved_stack_)
+            callback_set = False
+
+        return callback
+
+    def get_prehook(stack_):
+        def prehook(grad_output):
+            global callback_set
+
+            if not callback_set:
+                torch.autograd.variable.Variable._execution_engine.queue_callback(
+                    get_callback(fx_traceback.format_stack())
+                )
+                callback_set = True
+
+            fx_traceback.set_stack_trace(stack_)
+
+        return prehook
+
+    def get_posthook(special_stack_):
+        def posthook(grad_input, grad_output):
+            fx_traceback.set_stack_trace(special_stack_)
+
+        return posthook
+
+    for node in iter_graph(roots):
+        forward_node_stack = node.metadata.get("traceback_", [])
+        node.register_prehook(get_prehook(forward_node_stack))
+
+        special_stack = forward_node_stack.copy()
+        special_stack.append(
+            "Gradient addition node due to mulitple use of tensor around:"
+        )
+        node.register_hook(get_posthook(special_stack))
+
+
 def create_joint_forward_backward(fn):
     def joint_forward_backward(
         primals: List[Any], tangents: List[Any]
@@ -92,15 +146,19 @@ def joint_forward_backward(
             if isinstance(out, Tensor) and out.requires_grad:
                 needed_outs.append(out)
                 needed_tangents.append(tangent)
+
+        setup_stacktrace_preservation_hooks([out.grad_fn for out in needed_outs])
+
         backward_out = []
         # Call the backwards pass
         if grad_primals:
-            backward_out = torch.autograd.grad(
-                needed_outs,
-                grad_primals,
-                grad_outputs=needed_tangents,
-                allow_unused=True,
-            )
+            with fx_traceback.override_stack_trace():
+                backward_out = torch.autograd.grad(
+                    needed_outs,
+                    grad_primals,
+                    grad_outputs=needed_tangents,
+                    allow_unused=True,
+                )
         backward_out_iter = iter(backward_out)
         return outs, [
             next(backward_out_iter) if i else None for i in inputs_needs_grads
@@ -120,22 +178,8 @@ def normalize_as_list(x):
 aot_autograd_decompositions = {}
 
 
-@register_decomposition(aten._reshape_alias, aot_autograd_decompositions)
-def _reshape_alias(x, shape, strides):
-    return aten.view(x, shape)
-
-
-@register_decomposition(aten.new_zeros, aot_autograd_decompositions)
-def new_zeros(inp, size, dtype=None, layout=None, device=None, pin_memory=None):
-    return torch.zeros(size, dtype=inp.dtype, device=inp.device)
-
-
-@register_decomposition(aten.new_full, aot_autograd_decompositions)
-def new_full(inp, size, value, dtype=None, layout=None, device=None, pin_memory=None):
-    return torch.full(size, value, dtype=inp.dtype, device=inp.device)
-
-
-graph_being_compiled: str = None
+# This is a list since looking forward, we can have this arbitrarily nested.
+graph_being_compiled: List[str] = []
 nth_graph: int = 0
 model_name: str = "model"
 
@@ -145,134 +189,259 @@ def set_model_name(name):
     model_name = name
 
 
-def get_graph_being_compiled() -> str:
+def get_aot_compilation_context() -> Tuple[List[str], str, int]:
+    return list(graph_being_compiled), model_name, nth_graph
+
+
+def get_aot_graph_name() -> str:
     """
     Returns the name of the graph being compiled.
     """
     global model_name, graph_being_compiled, nth_graph
-    return f"{model_name}_{graph_being_compiled}_{nth_graph}"
+    return f"{model_name}_{'_'.join(graph_being_compiled)}_{nth_graph}"
+
+
+get_graph_being_compiled = get_aot_graph_name
 
 
 @contextmanager
 def track_graph_compiling(graph_name, increment_index=False):
     global graph_being_compiled
-    graph_being_compiled = graph_name
+    graph_being_compiled = [graph_name]
     yield
     if increment_index:
         global nth_graph
         nth_graph += 1
-    graph_being_compiled = None
+    graph_being_compiled = []
 
 
-def create_aot_autograd_function(
-    flat_fn, fw_compiler, bw_compiler, partition_fn, decompositions, grad_state
-):
-    """
-    Traces the forward and backward graphs of the attr:`flat_fn` to generate a
-    joint graph. The joint graph is an Fx graph with Aten ops. Please refer to
-    the tracing mechanism to understand the graph capturing details.
+def make_boxed_func(f):
+    def g(args):
+        return f(*args)
 
-    The joint graph is then passed through attr:`partition_fn` to isolate the
-    forward and backward portions, which are then respectively compiled via the
-    provided attr:`fw_compiler` and attr:`bw_compiler`.
+    g._boxed_call = True
+    return g
 
-    The resulting compiled forward and backward graphs are then wrapped up in a
-    ``torch.autograd.Function`` object.
+
+def make_boxed_compiler(compiler):
+    @wraps(compiler)
+    def f(fx_g, inps):
+        out_f = compiler(fx_g, inps)
+        fx_g = make_boxed_func(out_f)
+        return fx_g
+
+    return f
+
+
+def call_func_with_args(f, args, steal_args=False):
+    if not steal_args:
+        args = list(args)
+    assert isinstance(args, list)
+
+    if hasattr(f, "_boxed_call"):
+        out = normalize_as_list(f(args))
+    else:
+        # TODO: Please remove soon
+        # https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670
+        warnings.warn(
+            "Your compiler for AOTAutograd is returning a a function that doesn't take boxed arguments. "
+            "Please wrap it with functorch.compile.make_boxed_func or handle the boxed arguments yourself. "
+            "See https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670 for rationale."
+        )
+        out = normalize_as_list(f(*args))
+    return out
+
+
+@dataclasses.dataclass
+class AOTConfig:
+    """
+    Configuration for AOTDispatcher
     """
-    if decompositions is None:
-        decompositions = {}
+
+    fw_compiler: Callable
+    bw_compiler: Callable
+    partition_fn: Callable
+    decompositions: Dict[Callable, Callable]
+
+
+def aot_dispatch_base(flat_fn, flat_args: List[Tensor], aot_config: AOTConfig):
+    fw_module = make_fx(flat_fn, aot_config.decompositions)(*flat_args)
+    with track_graph_compiling("inference"):
+        compiled_fw = aot_config.fw_compiler(fw_module, flat_args)
+
+    @wraps(compiled_fw)
+    def new_fn(args):
+        fw_outs = call_func_with_args(compiled_fw, args)
+        return fw_outs
+
+    return new_fn
+
+
+def aot_dispatch_autograd(flat_fn, flat_args: List[Tensor], aot_config: AOTConfig):
     joint_forward_backward = create_joint_forward_backward(flat_fn)
 
-    compiled_fw = None
-    compiled_bw = None
-    num_outs = None
+    out = flat_fn(*flat_args)
+    out = pytree.tree_map(
+        lambda x: x.detach().contiguous() if isinstance(x, Tensor) else x,
+        out,
+    )
+
+    if isinstance(out, (list, tuple)):
+        _num_outs = len(out)
+    else:
+        _num_outs = 1
+
+    joint_inputs = (flat_args, out)
+
+    if config.use_functionalize:
+        # Trace once without decompositions, into a graph of ATen ops.
+        fx_g = make_fx(joint_forward_backward)(*joint_inputs)
+
+        def fake_fn(primals, tangents):
+            return fx_g(primals, tangents)
+
+        # Trace a second time, running functionalization, and THEN running decompositions.
+        # functionalization only acts on ATen today, and doesn't currently handle
+        # view and inplace ops that come from primtorch.
+        # Eventually, functionalization should support primtorch view/inplace ops,
+        # which will make it ok to run decompositions before functionalization.
+        fx_g = make_fx(functionalize(fake_fn), aot_config.decompositions)(*joint_inputs)
+    else:
+        fx_g = make_fx(joint_forward_backward, aot_config.decompositions)(*joint_inputs)
+
+    if config.debug_joint:
+        print("====== Joint graph ======")
+        fx_g.print_readable()
+
+    with torch.no_grad():
+        with track_graph_compiling("joint"):
+            fw_module, bw_module = aot_config.partition_fn(fx_g, joint_inputs)
+
+        if config.debug_graphs:
+            print("====== Forward graph ======")
+            fw_module.print_readable()
+            print("====== Backward graph ======")
+            bw_module.print_readable()
+
+        with track_graph_compiling("forward"):
+            compiled_fw_func = aot_config.fw_compiler(fw_module, flat_args)
+
+        if config.debug_partitioner:
+            fw_outs = call_func_with_args(compiled_fw_func, flat_args)
+            activation_sizes = 0
+            for out in fw_outs[_num_outs:]:
+                if isinstance(out, torch.Tensor):
+                    activation_sizes += out.storage().nbytes()
+            print(f"Real Activations Stored(GB): {activation_sizes/1e9}")
 
     class CompiledFunction(torch.autograd.Function):
+        compiled_fw = compiled_fw_func
+        compiled_bw = None
+        num_outs = _num_outs
+
         @staticmethod
         @disable_torchdynamo
         def forward(ctx, *flat_tensor_args):
-            nonlocal compiled_fw, compiled_bw, num_outs
-            # Disable the JIT Autocast flag to prevent re-autocasting of jitted graph.
-            # TODO - Remove when https://github.com/pytorch/functorch/pull/794 is fixed.
-            old_jit_autocast_flag = torch._C._jit_set_autocast_mode(False)
-            if compiled_fw is None:
-                flat_tensor_args = pytree.tree_map(
-                    lambda x: x.detach().requires_grad_(x.requires_grad)
-                    if isinstance(x, Tensor) else x, flat_tensor_args
-                )
-                fake_mode = FakeTensorMode.push() if config.use_fake_tensor else nullcontext()
-                with preserve_rng_state(), fake_mode as mode:
-                    # Set input tensors that require grad to leaves
-                    fake_flat_tensor_args = pytree.tree_map(
-                        lambda x: mode.from_tensor(x) if mode else x
-                        if isinstance(x, Tensor) else x, flat_tensor_args
-                    )
-                    with torch.set_grad_enabled(grad_state):
-                        out = flat_fn(*fake_flat_tensor_args)
-                    out = pytree.tree_map(
-                        lambda x: x.detach().contiguous() if isinstance(x, Tensor) else x, out
-                    )
-
-                    if isinstance(out, (list, tuple)):
-                        num_outs = len(out)
-                    else:
-                        num_outs = 1
-
-                    joint_inputs = (fake_flat_tensor_args, out)
-                    aot_decompositions = {**aot_autograd_decompositions, **decompositions}
-                    with torch.set_grad_enabled(grad_state):
-                        fx_g = make_fx(joint_forward_backward, aot_decompositions)(
-                            *joint_inputs
-                        )
-
-                        if config.use_functionalize:
-                            # Functionalize the foward backward graph. First create a
-                            # fake fn to make functionalize happy
-                            def fake_fn(primals, tangents):
-                                return fx_g(primals, tangents)
-                            fx_g = make_fx(functionalize(fake_fn))(*joint_inputs)
-
-                if config.debug_joint:
-                    print(fx_g.code)
-
-                with track_graph_compiling("joint"):
-                    fw_module, bw_module = partition_fn(fx_g, joint_inputs)
-
-                if config.debug_graphs:
-                    print(fw_module.code, bw_module.code)
-
-                with track_graph_compiling("forward"):
-                    compiled_fw = fw_compiler(fw_module, flat_tensor_args)
-                fw_outs = normalize_as_list(compiled_fw(*flat_tensor_args))
-                if config.debug_partitioner:
-                    activation_sizes = 0
-                    for out in fw_outs[num_outs:]:
-                        if isinstance(out, torch.Tensor):
-                            activation_sizes += out.storage().nbytes()
-                    print(f"Real Activations Stored(GB): {activation_sizes/1e9}")
-
-                bw_args = fw_outs[num_outs:] + fw_outs[0:num_outs]
-                with track_graph_compiling("backward", True):
-                    compiled_bw = bw_compiler(bw_module, bw_args)
-            else:
-                fw_outs = normalize_as_list(compiled_fw(*flat_tensor_args))
-            torch._C._jit_set_autocast_mode(old_jit_autocast_flag)
+            fw_outs = call_func_with_args(
+                CompiledFunction.compiled_fw, flat_tensor_args
+            )
+            num_outs = CompiledFunction.num_outs
             ctx.save_for_backward(*fw_outs[num_outs:])
             return tuple(fw_outs[0:num_outs])
 
         @staticmethod
         @disable_torchdynamo
         def backward(ctx, *flat_args):
-            # Disable the JIT Autocast flag to prevent re-autocasting of jitted graph.
-            # TODO - Remove when https://github.com/pytorch/functorch/pull/794 is fixed.
-            old_jit_autocast_flag = torch._C._jit_set_autocast_mode(False)
             contiguous_args = [t.contiguous() for t in flat_args]
-            # contiguous_args = [t for t in flat_args]
-            out = normalize_as_list(compiled_bw(*ctx.saved_tensors, *contiguous_args))
-            torch._C._jit_set_autocast_mode(old_jit_autocast_flag)
+            all_args = list(ctx.saved_tensors) + list(contiguous_args)
+            if CompiledFunction.compiled_bw is None:
+                with track_graph_compiling("backward", True):
+                    CompiledFunction.compiled_bw = aot_config.bw_compiler(
+                        bw_module, all_args
+                    )
+            ctx.maybe_clear_saved_tensors()
+            out = call_func_with_args(
+                CompiledFunction.compiled_bw, all_args, steal_args=True
+            )
+
             return tuple(out)
 
-    return CompiledFunction
+    return CompiledFunction.apply
+
+
+def create_aot_dispatcher_function(
+    flat_fn, flat_args: List[Tensor], aot_config: AOTConfig
+):
+    """
+    Traces the forward and backward graphs of the attr:`flat_fn` to generate a
+    joint graph. The joint graph is an Fx graph with Aten ops. Please refer to
+    the tracing mechanism to understand the graph capturing details.
+
+    The joint graph is then passed through attr:`partition_fn` to isolate the
+    forward and backward portions, which are then respectively compiled via the
+    provided attr:`fw_compiler` and attr:`bw_compiler`.
+
+    The resulting compiled forward and backward graphs are then wrapped up in a
+    ``torch.autograd.Function`` object.
+    """
+    if aot_config.decompositions is None:
+        aot_config.decompositions = {}
+
+    aot_config.decompositions = {
+        **aot_autograd_decompositions,
+        **aot_config.decompositions,
+    }
+    fake_mode = FakeTensorMode.push() if config.use_fake_tensor else nullcontext()
+
+    with preserve_rng_state(), fake_mode as mode:
+
+        def process_inputs(flat_args):
+            if mode:
+                fake_flat_tensor_args = pytree.tree_map_only(
+                    Tensor, mode.from_tensor, flat_args
+                )
+            else:
+                # The detach().requires_grad_() pattern can cause some subtle bugs.
+                # These will be fixed once FakeTensor is always-on for AOTAutograd.
+                #
+                # For models that might resize their inputs, the input tensors
+                # must have allow_tensor_metadata_change() set to true.
+                # detach() returns a view tensor, but with that field set to false.
+                #
+                # Specifically, this breaks quantized models
+                # (resnet50_quantized_qat and mobilenet_v2_quantized_qat)
+                # because they use a "running-mean" style op that requires
+                # resizing the running counter buffers stored on the module.
+                def make_input(x):
+                    return x.detach().requires_grad_(x.requires_grad)
+
+                fake_flat_tensor_args = pytree.tree_map_only(
+                    Tensor,
+                    make_input,
+                    flat_args,
+                )
+            return fake_flat_tensor_args
+
+        fake_flat_tensor_args = process_inputs(flat_args)
+
+        needs_autograd = (
+            any(
+                [
+                    x.requires_grad
+                    for x in fake_flat_tensor_args
+                    if isinstance(x, Tensor)
+                ]
+            )
+            and torch.is_grad_enabled()
+        )
+        # crappy version of dispatcher
+        # TODO: Do this properly
+        if needs_autograd:
+            return make_boxed_func(
+                aot_dispatch_autograd(flat_fn, fake_flat_tensor_args, aot_config)
+            )
+        else:
+            return aot_dispatch_base(flat_fn, fake_flat_tensor_args, aot_config)
 
 
 class _CompileCache(CompileCache):
@@ -280,12 +449,6 @@ class _CompileCache(CompileCache):
 
 
 # using a C++-based pytree reduces the overhead by about 50%
-try:
-    import tree
-
-    HAS_TREE = True
-except ImportError:
-    HAS_TREE = False
 compile_cache = None
 
 
@@ -450,6 +613,12 @@ def aot_function(
         compile_cache = CompileCache()
     if bw_compiler is None:
         bw_compiler = fw_compiler
+    aot_config = AOTConfig(
+        fw_compiler=fw_compiler,
+        bw_compiler=bw_compiler,
+        partition_fn=partition_fn,
+        decompositions=decompositions,
+    )
     cached_res = None
 
     fn_id = id(fn)
@@ -482,10 +651,7 @@ def returned_function(*args, **kwargs):
             ) = filter_tensor_and_static_args(args, static_argnums)
 
         # Now flatten the tensor args
-        if HAS_TREE:
-            flat_tensor_args = tree.flatten((tensor_args, kwargs))
-        else:
-            flat_tensor_args, _ = pytree.tree_flatten((tensor_args, kwargs))
+        flat_tensor_args, _ = pytree.tree_flatten((tensor_args, kwargs))
 
         # Check if the fn is already compiled
         num_tensor_args = len(flat_tensor_args)
@@ -538,14 +704,11 @@ def flat_fn(*flat_tensor_args):
                 out_spec.set(spec)
                 return flat_out
 
-            compiled_fn = create_aot_autograd_function(
+            compiled_fn = create_aot_dispatcher_function(
                 flat_fn,
-                fw_compiler,
-                bw_compiler,
-                partition_fn,
-                decompositions,
-                grad_state=torch.is_grad_enabled(),
-            ).apply
+                flat_tensor_args,
+                aot_config,
+            )
             cached_res = (compiled_fn, out_spec)
 
             # Save the compiled_fn in the cache
@@ -560,7 +723,7 @@ def flat_fn(*flat_tensor_args):
             )
 
         cached_fn, out_spec = cached_res
-        out = cached_fn(*flat_tensor_args)
+        out = cached_fn(flat_tensor_args)
         return out_spec.unflatten(out)
 
     return returned_function
@@ -612,7 +775,7 @@ def aot_module(mod: nn.Module, *args, **kwargs) -> nn.Module:
 
     def functional_call(named_params, named_buffers, *args, **kwargs):
         params_and_buffers = {**named_params, **named_buffers}
-        return _stateless.functional_call(mod, params_and_buffers, args, kwargs)
+        return stateless.functional_call(mod, params_and_buffers, args, kwargs)
 
     compiled_f = aot_function(functional_call, *args, **kwargs)
 
@@ -654,10 +817,19 @@ def aot_module_simplified(mod: nn.Module, *top_args, **top_kwargs) -> nn.Module:
     params_len = len(params_flat)
 
     def functional_call(*args, **kwargs):
-        with _stateless.reparametrize_module(
+        with stateless._reparametrize_module(
             mod, pytree.tree_unflatten(args[:params_len], params_spec)
         ):
-            out = mod(*args[params_len:], **kwargs)
+            if isinstance(mod, torch.fx.GraphModule):
+                with fx_traceback.override_stack_trace(), warnings.catch_warnings():
+                    warnings.filterwarnings(
+                        "ignore", "Anomaly Detection has been enabled."
+                    )
+                    with torch.autograd.detect_anomaly(check_nan=False):
+                        out = Interpreter(mod).run(*args[params_len:], **kwargs)
+            else:
+                out = mod(*args[params_len:], **kwargs)
+
         if not isinstance(out, (tuple, list)):
             raise RuntimeError(
                 "Graph output must be a tuple(). This is so that we can avoid "
@@ -678,27 +850,41 @@ def aot_function_simplified(
         assert static_argnums is None
         if bw_compiler is None:
             bw_compiler = fw_compiler
-        compiled_fn = create_aot_autograd_function(
-            fn,
-            fw_compiler,
-            bw_compiler,
-            partition_fn,
-            decompositions,
-            grad_state=torch.is_grad_enabled(),
-        ).apply
+        aot_config = AOTConfig(
+            fw_compiler=fw_compiler,
+            bw_compiler=bw_compiler,
+            partition_fn=partition_fn,
+            decompositions=decompositions,
+        )
 
-        return compiled_fn
+        compiled_fn = None
+
+        @wraps(fn)
+        def new_func(*args):
+            nonlocal compiled_fn
+            if compiled_fn is None:
+                compiled_fn = create_aot_dispatcher_function(
+                    fn,
+                    args,
+                    aot_config,
+                )
+            return compiled_fn(args)
+
+        return new_func
 
     compiled_f = aot_function_simplified(functional_call, *top_args, **top_kwargs)
 
     if top_kwargs:
+
         def forward(*args, **kwargs):
             return compiled_f(
                 *params_flat,
                 *args,
                 **kwargs,
             )
+
     else:
+
         def forward(*args):
             return compiled_f(
                 *params_flat,
@@ -706,6 +892,7 @@ def forward(*args):
             )
 
     forward.zero_grad = mod.zero_grad
+    forward.named_parameters = mod.named_parameters
     return forward
 
 
diff --git a/functorch/functorch/_src/compile_utils.py b/functorch/functorch/_src/compile_utils.py
index caebe100ecfb7..f0f23e99768db 100644
--- a/functorch/functorch/_src/compile_utils.py
+++ b/functorch/functorch/_src/compile_utils.py
@@ -78,3 +78,13 @@ def strip_overloads(gm):
         if isinstance(node.target, torch._ops.OpOverload):
             node.target = node.target.overloadpacket
     gm.recompile()
+
+
+def get_placeholders(graph):
+    return list(filter(lambda x: x.op == 'placeholder', graph.nodes))
+
+def get_outputs(graph):
+    for node in graph.nodes:
+        if node.op == 'output':
+            return tree_flatten(node.args[0])[0]
+    raise AssertionError("No output node found")
diff --git a/functorch/functorch/_src/compilers.py b/functorch/functorch/_src/compilers.py
index e860c8a7c7669..da0947b518956 100644
--- a/functorch/functorch/_src/compilers.py
+++ b/functorch/functorch/_src/compilers.py
@@ -1,19 +1,24 @@
+import copy
+import logging
+import os
+import pickle
+import random
+from contextlib import contextmanager
+from functools import partial
+from typing import Callable, Optional, Tuple, Union
+
 import torch
 import torch.fx as fx
 import torch.nn as nn
-from functools import partial
-from typing import Callable, Iterable, Optional, Tuple, Union
 
-from .aot_autograd import aot_function, aot_module
-from .decompositions import get_decompositions
-from .partitioners import draw_graph, min_cut_rematerialization_partition, default_partition
+from .aot_autograd import aot_function, aot_module, make_boxed_compiler
 from .compile_utils import strip_overloads
-import time
-import os
-import pickle
-import random
-import copy
-import logging
+from .decompositions import get_decompositions
+from .partitioners import (
+    default_partition,
+    draw_graph,
+    min_cut_rematerialization_partition,
+)
 
 
 # These canonicalizations are needed here (and not decompositions), as the ops
@@ -26,7 +31,17 @@ def _canonicalize(fx_g):
     return fx_g
 
 
-def ts_compile(fx_g: fx.GraphModule, _) -> Callable:
+@contextmanager
+def _disable_jit_autocast():
+    old_jit_autocast_flag = torch._C._jit_set_autocast_mode(False)
+    try:
+        yield
+    finally:
+        torch._C._jit_set_autocast_mode(old_jit_autocast_flag)
+
+
+@make_boxed_compiler
+def ts_compile(fx_g: fx.GraphModule, inps) -> Callable:
     """
     Compiles the :attr:`fx_g` with Torchscript compiler.
 
@@ -39,94 +54,42 @@ def ts_compile(fx_g: fx.GraphModule, _) -> Callable:
     Returns:
         Torch scripted model.
     """
-    for node in fx_g.graph.nodes:
-        if (node.target == torch.ops.aten._to_copy and len(node.args) == 1
-           and len(node.kwargs) == 1 and 'dtype' in node.kwargs):
-            node.target = torch.ops.aten.to
-
-    for node in fx_g.graph.nodes:
-        new_kwargs = {}
-        for k, v in node.kwargs.items():
-            if isinstance(v, torch.device):
-                v = v.type
-            new_kwargs[k] = v
-        node.kwargs = new_kwargs
 
-    strip_overloads(fx_g)
+    with _disable_jit_autocast():
+        strip_overloads(fx_g)
 
-    fx_g.graph.lint()
+        for node in fx_g.graph.nodes:
+            if (
+                node.target == torch.ops.aten._to_copy
+                and len(node.args) == 1
+                and len(node.kwargs) == 1
+                and "dtype" in node.kwargs
+            ):
+                node.target = torch.ops.aten.to
 
-    fx_g.recompile()
+        for node in fx_g.graph.nodes:
+            new_kwargs = {}
+            for k, v in node.kwargs.items():
+                if isinstance(v, torch.device):
+                    v = v.type
+                new_kwargs[k] = v
+            node.kwargs = new_kwargs
 
-    f = torch.jit.script(fx_g)
+        fx_g.graph.lint()
 
-    torch._C._jit_pass_remove_mutation(f.graph)
+        fx_g.recompile()
 
-    f = torch.jit.freeze(f.eval())
-    f = torch.jit.optimize_for_inference(f)
-    return f
+        f = torch.jit.script(fx_g)
 
+        torch._C._jit_pass_remove_mutation(f.graph)
 
-def tensorexpr_compile(fx_module: fx.GraphModule, flat_args) -> Callable:
-    """Compiles the given fx_module using TensorExpr Kernel"""
-    inp_devices = {i.device for i in flat_args if isinstance(i, torch.Tensor)}
-    assert len(inp_devices) == 1
-    inp_device = list(inp_devices)[0]
-    inputs = []
-    output_refs = []
-    for node in fx_module.graph.nodes:
-        if node.op == "placeholder":
-            inputs.append(node)
-        elif node.op == "output":
-            outputs = node.args[0]
-            if not isinstance(outputs, Iterable):
-                outputs = (outputs,)
-            new_outputs = []
-            for idx, output in enumerate(outputs):
-                # Appends (bool, idx) pairs
-                # if True, read from kernel outputs
-                # if False, read from kernel inputs
-                if output in inputs:
-                    output_refs.append((False, inputs.index(output)))
-                elif output in outputs[:idx]:
-                    output_refs.append((True, output_refs[outputs.index(output)][1]))
-                else:
-                    output_refs.append((True, len(new_outputs)))
-                    new_outputs.append(output)
-            node.args = (tuple(new_outputs),)
-    fx_module.graph.lint()
-    fx_module.recompile()
-
-    for i in range(0, 100):
-        attr = f"_tensor_constant{i}"
-        if hasattr(fx_module, attr):
-            setattr(fx_module, attr, getattr(fx_module, attr).to(inp_device))
-        else:
-            break
-
-    jit_module = torch.jit.trace(fx_module, flat_args)
-    jit_module = torch.jit.freeze(jit_module.eval())
-    torch._C._jit_trace_module(jit_module._c, tuple(flat_args))
-    torch._C._te.remove_unused_self_argument(jit_module.graph)
-    torch._C._te.annotate_input_shapes(jit_module.graph, tuple(flat_args))
-    torch._C._jit_pass_lower_all_tuples(jit_module.graph)
-    te_kernel = torch._C._te.TensorExprKernel(jit_module.graph)
-
-    def f(*args):
-        outs = te_kernel.run(args)
-        if not isinstance(outs, tuple) and not isinstance(outs, list):
-            outs = (outs,)
-        real_outs = []
-        for out in output_refs:
-            if out[0]:
-                real_outs.append(outs[out[1]])
-            else:
-                real_outs.append(args[out[1]])
-        return real_outs
-
+        f = torch.jit.freeze(f.eval())
+        f = torch.jit.optimize_for_inference(f)
+        f(*inps)
     return f
 
 
+@make_boxed_compiler
 def _draw_graph_compile(fx_g, _, name, clear_meta=True):
     print(fx_g.code)
     draw_graph(fx_g, name, clear_meta=clear_meta)
@@ -137,94 +100,7 @@ def draw_graph_compile(name):
     return partial(_draw_graph_compile, name=name)
 
 
-def _tvm_compile(
-    fx_module, example_inputs, target=None, tuning_logfile=None, use_ansor_tuning=False
-):
-    import tvm
-    from tvm import relay, auto_scheduler
-    from tvm.contrib import graph_executor
-    import os
-
-    # Find the target and device for TVM.
-    dev = tvm.cpu(0)
-    if target is None:
-        raise ValueError("Setup the TVM target correctly.")
-    elif isinstance(target, str):
-        if "cuda" in target:
-            dev = tvm.cuda(0)
-        target = tvm.target.Target(target)
-    elif isinstance(target, tvm.target.target.Target):
-        if "cuda" in target.keys:
-            dev = tvm.cuda(0)
-
-    # JIT the model and pass it to Torchscript to Relay frontend parser. TVM
-    # tutorials suggest tracing instead of scripting. The main reason is to
-    # avoid Pythonic computation to show up in JIT module. However, with Python
-    # key tracing, AOT Autograd leads to simpler graphs. Therefore, we use
-    # scripting here to retrieve the JIT module.
-    jit_mod = torch.jit.script(fx_module)
-    shape_list = [(f"inp_{idx}", i.shape) for idx, i in enumerate(example_inputs)]
-    mod, params = relay.frontend.from_pytorch(jit_mod, shape_list)
-
-    # TVM Autotuning
-    if use_ansor_tuning:
-        tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
-        if tuning_logfile is None:
-            log_file = f"{time.time()}.json"
-        else:
-            log_file = f"{tuning_logfile}.json"
-        if len(tasks) != 0:
-            tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
-            tune_option = auto_scheduler.TuningOptions(
-                num_measure_trials=20000,
-                measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
-                # early_stopping=1000,
-                # verbose=2,
-            )
-            tuner.tune(tune_option)
-    elif tuning_logfile is not None:
-        log_file = f"{tuning_logfile}.json"
-
-    if use_ansor_tuning or tuning_logfile is not None:
-        assert os.path.exists(log_file)
-        with auto_scheduler.ApplyHistoryBest(log_file):
-            with tvm.transform.PassContext(
-                opt_level=3, config={"relay.backend.use_auto_scheduler": True}
-            ):
-                lib = relay.build(mod, target=target, params=params)
-    else:
-        with tvm.transform.PassContext(opt_level=3):
-            lib = relay.build(mod, target=target, params=params)
-
-    # Get a graph executor graph module
-    m = graph_executor.GraphModule(lib["default"](dev))
-
-    def exec_tvm(*args):
-        for idx, arg in enumerate(args, 0):
-            if arg.dim() != 0:
-                m.set_input(
-                    f"inp_{idx}",
-                    tvm.nd.from_dlpack(torch.utils.dlpack.to_dlpack(arg.contiguous())),
-                )
-        m.run()
-        outs = [
-            torch.utils.dlpack.from_dlpack(m.get_output(i).to_dlpack())
-            for i in range(m.get_num_outputs())
-        ]
-        return outs
-
-    return exec_tvm
-
-
-def tvm_compile(target, tuning_logfile=None, use_ansor_tuning=False):
-    return partial(
-        _tvm_compile,
-        target=target,
-        tuning_logfile=tuning_logfile,
-        use_ansor_tuning=use_ansor_tuning,
-    )
-
-
+@make_boxed_compiler
 def nop(fx_g: fx.GraphModule, _) -> Callable:
     """
     Returns the :attr:`fx_g` Fx graph module as it is. This is a no-op compiler
@@ -237,6 +113,7 @@ def nop(fx_g: fx.GraphModule, _) -> Callable:
     return fx_g
 
 
+@make_boxed_compiler
 def simple_ts_compile(fx_g, _):
     strip_overloads(fx_g)
     f = torch.jit.script(fx_g)
@@ -270,18 +147,23 @@ def nnc_jit(f, static_argnums=None):
     aten.hardtanh,
     aten.hardswish,
     aten.hardsigmoid,
+    aten.conj_physical,
+    aten.is_same_size,
 }
 
 default_decompositions = get_decompositions(default_decompositions)
 
 
+@make_boxed_compiler
 def print_compile(fx_g, _):
     print(fx_g.code)
     return fx_g
 
 
 def memory_efficient_fusion(
-    fn: Union[Callable, nn.Module], static_argnums: Optional[Tuple[int]] = None, **kwargs
+    fn: Union[Callable, nn.Module],
+    static_argnums: Optional[Tuple[int]] = None,
+    **kwargs,
 ):
     """
     Wrapper function over :func:`aot_function` and :func:`aot_module` to perform
@@ -359,7 +241,7 @@ def get_inputs(input_data_path):
     Return a random input for the given inputs meta generated from _save_fx_default.
     """
     inputs = []
-    with (open(input_data_path, 'rb')) as f:
+    with (open(input_data_path, "rb")) as f:
         inputs_meta = pickle.load(f)
         inputs = []
         for meta in inputs_meta:
@@ -368,7 +250,16 @@ def get_inputs(input_data_path):
                 input = type(random.rand())
             else:
                 type, shape, stride, dtype, device = meta
-                if dtype in {torch.int, torch.int32, torch.int64, torch.bool, torch.int, torch.uint8, int, float}:
+                if dtype in {
+                    torch.int,
+                    torch.int32,
+                    torch.int64,
+                    torch.bool,
+                    torch.int,
+                    torch.uint8,
+                    int,
+                    float,
+                }:
                     input = torch.randint(0, 1, shape, dtype=dtype, device=device)
                 else:
                     input = torch.rand(shape, dtype=dtype, device=device)
@@ -401,16 +292,21 @@ def get_input_meta(args):
             input_meta += get_input_meta(args[1])
             return input_meta
         for arg in args:
-            if(type(arg) == int or type(arg) == float):
+            if type(arg) == int or type(arg) == float:
                 input_meta.append((type(arg),))
             else:
-                input_meta.append((type(arg), arg.shape, arg.stride(), arg.dtype, arg.device))
+                input_meta.append(
+                    (type(arg), arg.shape, arg.stride(), arg.dtype, arg.device)
+                )
         return input_meta
 
     def graph_saver_helper(gm_to_save, args, type_name):
         global graph_index
         if len(gm_to_save.graph.nodes) == 0:
-            logging.log(logging.WARNING, f"No nodes in graph {current_name}_{type_name}_{graph_index}.")
+            logging.log(
+                logging.WARNING,
+                f"No nodes in graph {current_name}_{type_name}_{graph_index}.",
+            )
             return
 
         gm = copy.deepcopy(gm_to_save)
@@ -422,10 +318,21 @@ def graph_saver_helper(gm_to_save, args, type_name):
         isExist = os.path.exists(f"{folder_name}/{current_name}")
         if not isExist:
             os.makedirs(f"{folder_name}/{current_name}")
-        gm.to_folder(f"{folder_name}/{current_name}/{current_name}_{type_name}_{graph_index}")
-        pickle.dump(input_meta, open(f"{folder_name}/{current_name}/{current_name}_{type_name}_{graph_index}/{current_name}_{type_name}_{graph_index}.input", "wb"))  # noqa: E501
+        gm.to_folder(
+            f"{folder_name}/{current_name}/{current_name}_{type_name}_{graph_index}"
+        )
+        pickle.dump(
+            input_meta,
+            open(
+                f"{folder_name}/{current_name}/{current_name}_{type_name}_{graph_index}/{current_name}_{type_name}_{graph_index}.input",  # noqa: B950
+                "wb",
+            ),
+        )  # noqa: E501
         if dump_example_input:
-            torch.save(args, f"{folder_name}/{current_name}/{current_name}_{type_name}_{graph_index}/{current_name}_{type_name}_{graph_index}.pt")  # noqa: E501
+            torch.save(
+                args,
+                f"{folder_name}/{current_name}/{current_name}_{type_name}_{graph_index}/{current_name}_{type_name}_{graph_index}.pt",  # noqa: B950
+            )  # noqa: E501
 
     def graph_saver_forward(gm, fw_args):
         graph_saver_helper(gm, fw_args, "forward")
@@ -441,10 +348,13 @@ def graph_saver_joint(gm, joint_args):
         graph_saver_helper(gm, joint_args, "joint")
         return default_partition(gm, joint_args)
 
-    return aot_module_simplified(gm, fw_compiler=graph_saver_forward,
-                                 bw_compiler=graph_saver_backward,
-                                 partition_fn=graph_saver_joint,
-                                 decompositions=default_decompositions)
+    return aot_module_simplified(
+        gm,
+        fw_compiler=graph_saver_forward,
+        bw_compiler=graph_saver_backward,
+        partition_fn=graph_saver_joint,
+        decompositions=default_decompositions,
+    )
 
 
 def graph_dumper_aot(current_name, folder_name, dump_example_input=False):
diff --git a/functorch/functorch/_src/fx_minifier.py b/functorch/functorch/_src/fx_minifier.py
index a0bb799045805..5ee00f807c414 100644
--- a/functorch/functorch/_src/fx_minifier.py
+++ b/functorch/functorch/_src/fx_minifier.py
@@ -1,9 +1,11 @@
-import subprocess
 import torch.fx as fx
 import copy
 import torch
 import math
-
+from typing import Callable, List
+from functools import wraps, partial
+from dataclasses import dataclass
+from .compile_utils import get_placeholders, get_outputs
 
 class ConcreteProp(torch.fx.Interpreter):
     def run_node(self, n):
@@ -29,19 +31,15 @@ def propagate(self, *args):
         return super().run(*args)
 
 
-def _get_placeholders(graph):
-    return list(filter(lambda x: x.op == 'placeholder', graph.nodes))
-
 # inplace modifies node/inps
-
-
 def _convert_node_to_placeholder(node, inps):
-    if node.op == 'output':
+    if node.op == 'output' or node.op == "placeholder":
         return
     node.op = 'placeholder'
     node.args = ()
+    node.kwargs = {}
     node.target = node.name
-    concrete_val = node.meta['concrete_value']
+    concrete_val = node.meta.get('concrete_value', None)
     if isinstance(concrete_val, torch.Tensor):
         inps.append(concrete_val)
     else:
@@ -49,8 +47,20 @@ def _convert_node_to_placeholder(node, inps):
         for tuple_user in list(node.users):
             _convert_node_to_placeholder(tuple_user, inps)
 
-
-def minifier(fail_f: fx.GraphModule, inps, module_fails):
+def dump_state(fx_g, inps):
+    print(f"""
+# Working Repro with {len(fx_g.graph.nodes)} nodes
+inps = {[(i.shape, i.dtype, i.device.type) for i in inps]}
+inps = [torch.zeros(())] + [torch.ones(shape, dtype=dtype, device=device) for (shape, dtype, device) in inps]
+{fx_g.code}
+""")
+
+@dataclass
+class ReproState:
+    graph: fx.Graph
+    inps: List[torch.Tensor]
+
+def minifier(fail_f: fx.GraphModule, inps, module_fails, dump_state: Callable = dump_state):
     """
     Minimizes a FX graph with given inputs, such that the resulting FX graph still returns True for module_fails.
 
@@ -67,8 +77,16 @@ def minifier(fail_f: fx.GraphModule, inps, module_fails):
     failing_graph = fail_f.graph
     cur_size = len(failing_graph.nodes)
 
-    def graph_fails(graph, inps):
+    num_queries = 0
 
+    def deepcopy_fx_graph(fx_graph):
+        return fx.GraphModule(fail_f, copy.deepcopy(fx_graph)).graph
+
+
+    def graph_fails(graph, inps):
+        nonlocal num_queries
+        graph = copy.deepcopy(graph)
+        num_queries += 1
         mod = fx.GraphModule(fail_f, graph)
         mod.graph.lint()
         return module_fails(mod, inps)
@@ -78,37 +96,89 @@ def graph_fails(graph, inps):
         raise RuntimeError("Input graph did not fail the tester")
     print(f"Started off with {cur_size} nodes")
 
-    def remove_suffix(cur_graph, cur_inps):
-        print("Strategy: Remove suffix")
-        assert graph_fails(cur_graph, cur_inps)
-        gap = 2**math.floor(math.log2(len(cur_graph.nodes)))
+    def _register_strategy(strategy: Callable, name: str):
+        @wraps(strategy)
+        def new_func(old_state: ReproState, granularity=1):
+            print()
+            print(f"Strategy: {name} (G: {granularity}) ({len(old_state.graph.nodes)} nodes, {len(old_state.inps)} inputs)")
+            new_state = strategy(deepcopy_fx_graph(old_state.graph), list(old_state.inps), granularity)
+            if new_state is not None:
+                new_nodes = len(new_state.graph.nodes)
+                old_nodes = len(old_state.graph.nodes)
+                new_inps = len(new_state.inps)
+                old_inps = len(old_state.inps)
+                new_outs = len(get_outputs(new_state.graph))
+                old_outs = len(get_outputs(old_state.graph))
+                progress_made = False
+                if new_nodes < old_nodes:
+                    progress_made = True
+                    print(f"SUCCESS: Went from {old_nodes} to {new_nodes} nodes")
+                if new_inps > old_inps:
+                    progress_made = True
+                    print(f"SUCCESS: Went from {old_inps} to {new_inps} inputs")
+                if new_outs < old_outs:
+                    progress_made = True
+                    print(f"SUCCESS: Went from {old_outs} to {new_outs} outputs")
+
+                if not progress_made:
+                    raise RuntimeError("Success raised but no progress made?")
+
+                if not graph_fails(new_state.graph, new_state.inps):
+                    print("WARNING: Something went wrong, not applying this minification")
+                    return None
+                return new_state
+            else:
+                print(f"FAIL: {name}")
+            return None
+
+        return new_func
+
+    def register_strategy(name: str):
+        return partial(_register_strategy, name=name)
+
+    @register_strategy("Truncate suffix")
+    def remove_suffix(cur_graph, cur_inps, granularity):
         tested = set()
-        while gap >= 1:
-            new_graph = fx.Graph()
-            env = {}
-            for idx, node in enumerate(cur_graph.nodes):
-                new_node = new_graph.node_copy(node, lambda x: env[x])
-                if node.op not in ['placeholder', 'output']:
-                    if idx % gap == 0 and idx not in tested:
-                        output_node = new_graph.output((new_node,))
-                        if graph_fails(new_graph, cur_inps) and len(new_graph.nodes) < len(cur_graph.nodes):
-                            print()
-                            print(f"SUCCESS: Removed [{idx}:{len(cur_graph.nodes)})")
-                            return (new_graph, cur_inps), True
-                        else:
-                            tested.add(idx)
-                            new_graph.erase_node(output_node)
-                env[node] = new_node
-            gap //= 2
-        print("FAIL: Could not remove suffix")
-        return (cur_graph, cur_inps), False
-
-    def remove_unused_inputs(cur_graph, cur_inps):
-        assert graph_fails(cur_graph, cur_inps)
-        ph_nodes = _get_placeholders(cur_graph)
-        if len(ph_nodes) != len(cur_inps):
-            print(cur_graph)
-            print(len(cur_inps))
+        new_graph = fx.Graph()
+        env = {}
+        for idx, node in enumerate(cur_graph.nodes):
+            new_node = new_graph.node_copy(node, lambda x: env[x])
+            if node.op not in ['placeholder', 'output']:
+                # If idx is divisible by (granularity * 2), it would have been checked already.
+                if idx % granularity == 0 and (idx % (granularity * 2) != 0) and idx not in tested:
+                    output_node = new_graph.output((new_node,))
+                    if len(new_graph.nodes) < len(cur_graph.nodes) and graph_fails(new_graph, cur_inps):
+                        return ReproState(new_graph, cur_inps)
+                    else:
+                        tested.add(idx)
+                        new_graph.erase_node(output_node)
+            env[node] = new_node
+        return None
+
+    @register_strategy("Remove outputs")
+    def remove_outputs(cur_graph, cur_inps, granularity):
+        granularity = max(1, granularity // 2)
+        for idx, node in enumerate(cur_graph.nodes):
+            node.idx = idx
+            if node.op == 'output':
+                output = node
+                break
+
+        output_args = sorted(output.args[0], key=lambda x: x.idx if isinstance(x, fx.Node) else int(1e9))
+        if len(output_args) == 1:
+            return None
+
+        for idx in range(0, len(output_args), granularity):
+            output.args = (output_args[:idx] + output_args[idx + granularity:],)
+            if graph_fails(cur_graph, cur_inps):
+                return ReproState(cur_graph, cur_inps)
+        return None
+
+
+    def remove_unused_inputs_unchecked(cur_state: ReproState):
+        cur_graph = cur_state.graph
+        cur_inps = cur_state.inps
+        ph_nodes = get_placeholders(cur_graph)
         assert len(ph_nodes) == len(cur_inps)
 
         new_inps = []
@@ -117,24 +187,29 @@ def remove_unused_inputs(cur_graph, cur_inps):
                 cur_graph.erase_node(ph_nodes[idx])
             else:
                 new_inps.append(cur_inps[idx])
+        if len(new_inps) < len(cur_inps):
+            return ReproState(cur_graph, new_inps)
+        return None
 
-        if len(new_inps) < len(cur_inps) and graph_fails(cur_graph, new_inps):
-            print("Strategy: Remove unused inputs")
-            print(f"SUCCESS: Went from {len(cur_inps)} inputs to {len(new_inps)} inputs")
-            return (cur_graph, new_inps), True
-        else:
-            return (cur_graph, new_inps), False
+    def remove_unused_inputs_checked(cur_state: ReproState):
+        new_state = remove_unused_inputs_unchecked(cur_state)
+        if new_state is not None and graph_fails(new_state.graph, new_state.inps):
+            return new_state
+        return None
 
-    def eliminate_dead_code(cur_graph, cur_inps):
-        orig_size = len(cur_graph.nodes)
+    def _remove_unused_wrapper(cur_graph, cur_inps, granularity):
+        return remove_unused_inputs_checked(ReproState(cur_graph, cur_inps))
+
+    remove_unused_inputs = register_strategy("Remove unused inputs")(_remove_unused_wrapper)
+
+    @register_strategy("Eliminate dead code")
+    def eliminate_dead_code(cur_graph, cur_inps, granularity):
         if cur_graph.eliminate_dead_code() and graph_fails(cur_graph, cur_inps):
-            print("Strategy: Eliminate dead code")
-            print(f"SUCCESS: Went from {orig_size} nodes to {len(cur_graph.nodes)} nodes")
-            return (cur_graph, cur_inps), True
-        else:
-            return (cur_graph, cur_inps), False
+            return ReproState(cur_graph, cur_inps)
+        return None
+
 
-    def consolidate_placeholders(cur_graph):
+    def _consolidate_placeholders(cur_graph):
         new_graph = fx.Graph()
         env = {}
         for node in cur_graph.nodes:
@@ -148,122 +223,84 @@ def consolidate_placeholders(cur_graph):
                 env[node] = new_node
         return new_graph
 
-    def delta_debugging(cur_graph: fx.Graph, cur_inps):
-        print("Strategy: Delta Debugging")
-        assert graph_fails(cur_graph, cur_inps)
-        starting_placeholders = len(_get_placeholders(cur_graph))
+    @register_strategy("Delta Debugging")
+    def delta_debugging(cur_graph: fx.Graph, cur_inps, granularity):
         num_nodes = len(cur_graph.nodes)
-        gap = int(2**math.floor(math.log2(num_nodes)))
-        while gap >= 1:
-            for start_range in range(0, num_nodes, gap):
-                is_removing = False
-                new_graph = copy.deepcopy(cur_graph)
-                new_inps = cur_inps[:]
-                end_range = min(num_nodes, start_range + gap)
-                for idx in range(start_range, end_range):
-                    new_node = list(new_graph.nodes)[idx]
-                    if new_node.op not in ['placeholder', 'output']:
-                        is_removing = True
-                        _convert_node_to_placeholder(new_node, new_inps)
-                if not is_removing:
-                    continue
-                new_graph = consolidate_placeholders(new_graph)
-                if graph_fails(new_graph, new_inps):
-                    print(
-                        f"SUCCESS: Removed ({start_range}:{end_range}] - Went from {starting_placeholders} "
-                        f"placeholders to {len(_get_placeholders(new_graph))}"
-                    )
-                    return (new_graph, new_inps), True
-            gap //= 2
-
-        print("FAIL: Could not remove prefix")
-        return (cur_graph, inps), False
-
-    print("###################")
-    print(f"Current size: {len(failing_graph.nodes)}")
-    print("###################")
-    while True:
-        any_succeeded = False
-        strategies = [
-            remove_suffix, eliminate_dead_code, remove_unused_inputs,
-            delta_debugging, eliminate_dead_code, remove_unused_inputs
-        ]
+        for start_range in range(0, num_nodes, granularity):
+            is_removing = False
+            new_graph = deepcopy_fx_graph(cur_graph)
+            new_inps = cur_inps[:]
+            end_range = min(num_nodes, start_range + granularity)
+            for idx in range(start_range, end_range):
+                new_node = list(new_graph.nodes)[idx]
+                if new_node.op not in ['placeholder', 'output']:
+                    is_removing = True
+                    _convert_node_to_placeholder(new_node, new_inps)
+            if not is_removing:
+                continue
+            new_graph = _consolidate_placeholders(new_graph)
+            new_state = remove_unused_inputs_unchecked(ReproState(new_graph, new_inps))
+            if new_state is None:
+                new_state = ReproState(new_graph, new_inps)
+            if graph_fails(new_state.graph, new_state.inps):
+                return ReproState(new_state.graph, new_state.inps)
+
+        return None
+
+    failing_state = ReproState(failing_graph, inps)
+
+    def try_granularity(failing_state, granularity, use_non_granular):
+        print(f"Trying granularity {granularity}")
+
+        strategies = []
+        num_nodes = len(failing_state.graph.nodes)
+        num_outputs = len(get_outputs(failing_state.graph))
+        if num_outputs > num_nodes // 2:
+            strategies += [remove_outputs]
+
+        if use_non_granular:
+            strategies += [eliminate_dead_code, remove_unused_inputs]
+
+        strategies += [remove_suffix, delta_debugging]
+
         for strategy in strategies:
-            out = strategy(copy.deepcopy(failing_graph), inps[:])
-            (cur_graph, cur_inps), succeeded = out
-            if succeeded:
-                print()
-                print("###################")
-                print(f"Current size: {len(cur_graph.nodes)}")
-                print("###################")
-                failing_graph = cur_graph
-                inps = cur_inps
-                any_succeeded = True
-
-        if not any_succeeded:
-            break
-    failing_fx = fx.GraphModule(fail_f, failing_graph)
-    print(f"""
-inps = {[(i.shape, i.dtype) for i in inps]}
-inps = [torch.zeros(())] + [torch.ones(shape, dtype=dtype, device='cuda') for (shape, dtype) in inps]
-{failing_fx.code}
-f = torch.jit.script(forward)
-with torch.jit.fuser("fuser2"):
-  for _ in range(5):
-    f(*inps)""")
-    return failing_fx, inps
-
-
-def check_nvfuser_subprocess(f, inps):
-    f.to_folder("temp")
-    with open("_temp.py", 'w') as fil:
-        fil.write(f'''
-import torch
-from temp import FxModule
-f = FxModule().cuda()
-inps = {[(i.shape, i.dtype) for i in inps]}
-inps = [torch.ones(shape, dtype=dtype, device='cuda') for shape, dtype in inps]
-with torch.jit.fuser("fuser2"):
-    nf = torch.jit.script(f)
-    for _ in range(5):
-        nf(*inps)
-    ''')
-    p = subprocess.Popen(["PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python _temp.py"],
-                         stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
-    out, err = p.communicate()
-    if p.returncode != 0:
-        err = err.decode('utf-8')
-        print(err)
-        return True
-    return False
-
-
-def check_nvfuser_correctness_subprocess(f, inps):
-    f.to_folder("temp")
-    with open("_temp.py", 'w') as fil:
-        fil.write(f'''
-import torch
-from temp import FxModule
-f = FxModule().cuda()
-inps = {[(i.shape, i.dtype) for i in inps]}
-inps = [torch.randn(shape, dtype=dtype, device='cuda')
-        if dtype.is_floating_point else torch.ones(shape, dtype=dtype, device='cuda')
-        for shape, dtype in inps]
-
-ref = f(*inps)
-nv_f = torch.jit.script(f)
-with torch.jit.fuser("fuser2"):
-    for _ in range(5):
-        res = nv_f(*inps)
-for a, b in zip(ref, res):
-    if not torch.allclose(a, b, atol=0.1):
-        exit(1)
-''')
-    p = subprocess.Popen(["PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python _temp.py"],
-                         stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
-    out, err = p.communicate()
-    if p.returncode != 0:
-        err = err.decode('utf-8')
-        print(err)
-        return True
-    return False
+            new_state = strategy(failing_state, granularity)
+            if new_state is not None:
+                return new_state
+        return None
+
+    while True:
+        dump_state(fx.GraphModule(fail_f, failing_state.graph), failing_state.inps)
+        granularity = int(2**(math.floor(math.log2(len(failing_state.graph.nodes)))))
+        new_state = try_granularity(failing_state, granularity, use_non_granular=True)
+        if new_state is not None:
+            failing_state = new_state
+            continue
+
+        granularity //= 2
+        has_progress = False
+        while granularity >= 1:
+            new_state = try_granularity(failing_state, granularity, use_non_granular=False)
+            if new_state is not None:
+                failing_state = new_state
+                has_progress = True
+                break
+            granularity //= 2
+        if has_progress:
+            continue
+
+        new_state = remove_outputs(failing_state, 1)
+        if new_state is not None:
+            failing_state = new_state
+            continue
+
+        break
+
+    if not graph_fails(failing_state.graph, failing_state.inps):
+        raise RuntimeError("Uh oh, something went wrong :( Final graph is not failing")
+
+    print(f"Made {num_queries} queries")
+    failing_fx = fx.GraphModule(fail_f, failing_state.graph)
+    dump_state(failing_fx, failing_state.inps)
+    print("Wrote minimal repro out to repro.py")
+    return failing_fx, failing_state.inps
diff --git a/functorch/functorch/_src/partitioners.py b/functorch/functorch/_src/partitioners.py
index 88b6e447ba493..325c66b631d21 100644
--- a/functorch/functorch/_src/partitioners.py
+++ b/functorch/functorch/_src/partitioners.py
@@ -13,7 +13,6 @@
 
 AOT_PARTITIONER_DEBUG = config.debug_partitioner
 
-INDUCTOR = False
 
 
 class InvalidNodeBase(object):
@@ -208,7 +207,7 @@ def _count_ops(graph):
 
 
 def min_cut_rematerialization_partition(
-    joint_module: fx.GraphModule, _joint_inputs
+    joint_module: fx.GraphModule, _joint_inputs, compiler="nvfuser"
 ) -> Tuple[fx.GraphModule, fx.GraphModule]:
     """
     Partitions the joint graph such that the backward recomputes the forward.
@@ -277,13 +276,13 @@ def classify_nodes(joint_module):
     prims = torch.ops.prims
 
     pointwise_ops = [aten.add, aten.sub, aten.div, aten.atan2, aten.mul, aten.max, aten.min, aten.pow, aten.remainder, aten.fmod, aten.__and__, aten.__or__, aten.__xor__, aten.__lshift__, aten.__rshift__, aten.eq, aten.ne, aten.ge, aten.gt, aten.le, aten.lt, aten.abs, aten.bitwise_not, aten.ceil, aten.floor, aten.frac, aten.neg, aten.relu, aten.round, aten.silu, aten.trunc, aten.log, aten.log10, aten.log1p, aten.log2, aten.lgamma, aten.exp, aten.expm1, aten.erf, aten.erfc, aten.cos, aten.acos, aten.cosh, aten.sin, aten.asin, aten.sinh, aten.tan, aten.atan, aten.tanh, aten.atanh, aten.sqrt, aten.rsqrt, aten.reciprocal, aten.sigmoid, aten.softplus, aten.threshold, aten.threshold_backward, aten.clamp, aten.where, aten.lerp, aten.addcmul, aten.gelu, aten.gelu_backward]  # noqa: E501
-    if INDUCTOR:
-        pointwise_ops += [prims.div, prims.convert_element_type, aten.sign, aten.clone]  # noqa: E501
+    if compiler == "inductor":
+        pointwise_ops += [prims.div, prims.convert_element_type, aten.sign, aten.clone, aten._to_copy]  # noqa: E501
     misc_ops = [aten.to, aten.type_as, operator.getitem]
 
     reduction_ops = [aten.softmax, aten._softmax, aten._softmax_backward_data, aten.sum, aten.mean, aten._grad_sum_to_size, aten.sum_to_size, aten.amax]  # noqa: E501
-    if INDUCTOR:
-        reduction_ops += [prims.var, prims.sum, aten.var]
+    if compiler == "inductor":
+        reduction_ops += [prims.var, prims.sum, aten.var, aten.std]
 
     # not recomputed by default since these are kinda expensive/hard to fuse into
     # norm_ops = [aten.instance_norm, aten._batch_norm_impl_index, aten.native_batch_norm, aten.batch_norm, aten._batch_norm_impl_index_backward, aten.native_layer_norm, aten.layer_norm, aten.native_layer_norm_backward]  # noqa: E501
@@ -293,7 +292,7 @@ def classify_nodes(joint_module):
 
     # These are the view ops that NVFuser can fuse
     view_ops = [aten.squeeze, aten.unsqueeze]
-    if INDUCTOR:
+    if compiler == "inductor":
         view_ops += [prims.broadcast_in_dim, aten.select, aten.permute, aten._unsafe_view, aten.view, aten.expand, aten.slice, aten.reshape, aten.broadcast_tensors]  # noqa: E501
     random_ops = [aten.native_dropout, aten.rand_like, aten.randn_like]
     compute_intensive_ops = [aten.mm, aten.convolution, aten.convolution_backward, aten.bmm, aten.addmm, aten.upsample_bilinear2d]  # noqa: E501
@@ -334,6 +333,8 @@ def ban_recomputation(node):
                 return True
             if node.target == operator.getitem:
                 return False
+            if compiler == "inductor" and node.dist_from_bw > 4:
+                return True
             # If the output of an op is 4x smaller (arbitrary choice),
             # then we don't allow recomputation.
             if 'tensor_meta' not in node.meta:
@@ -356,7 +357,7 @@ def get_node_weight(node):
 
         # Heuristic to bias towards nodes closer to the backwards pass
         # Complete guess about current value
-        mem_sz = int(mem_sz * (1.5 ** max(min(node.dist_from_bw, 100), 1)))
+        mem_sz = int(mem_sz * (1.1 ** max(min(node.dist_from_bw, 100), 1)))
         # mem_sz = int(mem_sz + node.dist_from_bw)
 
         if is_materialized(node):
diff --git a/functorch/functorch/_src/python_key.py b/functorch/functorch/_src/python_key.py
index 5fe0aff691ca5..e7c805841a62b 100644
--- a/functorch/functorch/_src/python_key.py
+++ b/functorch/functorch/_src/python_key.py
@@ -3,8 +3,7 @@
 #
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-__all__ = ["make_fx", "ProxyTensor", "dispatch_trace", "PythonKeyTracer", "pythonkey_decompose"]
-from torch.fx.experimental.proxy_tensor import make_fx, ProxyTensor, dispatch_trace, PythonKeyTracer, decompose
+__all__ = ["make_fx", "dispatch_trace", "PythonKeyTracer", "pythonkey_decompose"]
+from torch.fx.experimental.proxy_tensor import make_fx, dispatch_trace, PythonKeyTracer, decompose
 
 pythonkey_decompose = decompose
-PythonTensor = ProxyTensor
diff --git a/functorch/functorch/_src/vmap.py b/functorch/functorch/_src/vmap.py
index 1504107a2ca95..df5755fbca5d1 100644
--- a/functorch/functorch/_src/vmap.py
+++ b/functorch/functorch/_src/vmap.py
@@ -6,10 +6,9 @@
 
 import torch
 import functools
-from collections import OrderedDict
 from torch import Tensor
 from typing import Any, Callable, Optional, Tuple, Union, List
-from torch.utils._pytree import tree_flatten, tree_unflatten, _broadcast_to_and_flatten, TreeSpec, _register_pytree_node
+from torch.utils._pytree import tree_flatten, tree_unflatten, _broadcast_to_and_flatten, TreeSpec
 from .pytree_hacks import tree_map_
 from functools import partial
 
@@ -24,25 +23,14 @@
 out_dims_t = Union[int, Tuple[int, ...]]
 
 
-# Temporary OrderedDict registration as pytree
-def _odict_flatten(d):
-    return list(d.values()), list(d.keys())
-
-
-def _odict_unflatten(values, context):
-    return OrderedDict((key, value) for key, value in zip(context, values))
-
-
-_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
-
-
 # Checks that all args-to-be-batched have the same batch dim size
-
 def _validate_and_get_batch_size(
         flat_in_dims: List[Optional[int]],
         flat_args: List) -> int:
     batch_sizes = [arg.size(in_dim) for in_dim, arg in zip(flat_in_dims, flat_args)
                    if in_dim is not None]
+    if len(batch_sizes) == 0:
+        raise ValueError('vmap: Expected at least one Tensor to vmap over')
     if batch_sizes and any(size != batch_sizes[0] for size in batch_sizes):
         raise ValueError(
             f'vmap: Expected all tensors to have the same size in the mapped '
diff --git a/functorch/functorch/compile/__init__.py b/functorch/functorch/compile/__init__.py
index 712b810ae0e31..99e0456a4e4ef 100644
--- a/functorch/functorch/compile/__init__.py
+++ b/functorch/functorch/compile/__init__.py
@@ -1,6 +1,6 @@
 from .._src.python_key import pythonkey_decompose
 from .._src.decompositions import register_decomposition, decomposition_table, get_decompositions
-from .._src.fx_minifier import minifier, check_nvfuser_subprocess, check_nvfuser_correctness_subprocess
+from .._src.fx_minifier import minifier
 from .._src.aot_autograd import (
     aot_function,
     aot_module,
@@ -9,10 +9,14 @@
     num_of_recompilations,
     clear_compile_cache,
     aot_module_simplified,
+    get_graph_being_compiled,
+    get_aot_graph_name,
+    get_aot_compilation_context,
+    make_boxed_func,
+    make_boxed_compiler
 )
 from .._src.compilers import (
     ts_compile,
-    tvm_compile,
     draw_graph_compile,
     nop,
     nnc_jit,
diff --git a/functorch/functorch/csrc/BatchRulesActivation.cpp b/functorch/functorch/csrc/BatchRulesActivation.cpp
index 5261558e9b147..b761c70b15752 100644
--- a/functorch/functorch/csrc/BatchRulesActivation.cpp
+++ b/functorch/functorch/csrc/BatchRulesActivation.cpp
@@ -58,6 +58,11 @@ std::tuple<Tensor,optional<int64_t>> prelu_batch_rule(
   const auto input_ = moveBatchDimToFront(input, input_bdim);
   auto weight_flatten = moveBatchDimToFront(weight, weight_bdim);
 
+  const auto weight_logical_dim = rankWithoutBatchDim(weight, weight_bdim);
+  TORCH_CHECK(weight_logical_dim == 0 || weight_logical_dim == 1,
+      "prelu: Expected `weight` to be a scalar or 1D tensor, but got ndim = ",
+      weight_logical_dim);
+
   if (weight_flatten.dim() > 1) {
     // for an input [N, C, ...]
     // weight can be a non-vector but the total number of elements must be the same as C
diff --git a/functorch/functorch/csrc/BatchRulesBinaryOps.cpp b/functorch/functorch/csrc/BatchRulesBinaryOps.cpp
index 2afd6482cc51e..54e0f38622d6f 100644
--- a/functorch/functorch/csrc/BatchRulesBinaryOps.cpp
+++ b/functorch/functorch/csrc/BatchRulesBinaryOps.cpp
@@ -11,47 +11,6 @@
 
 namespace at { namespace functorch {
 
-static void handleScalarTypePromotion(Tensor& logical_scalar_tensor, Tensor& second) {
-  auto result_type = at::native::result_type(logical_scalar_tensor[0], second);
-  if (logical_scalar_tensor.scalar_type() != result_type) {
-    logical_scalar_tensor = logical_scalar_tensor.to(result_type);
-  }
-  if (second.scalar_type() != result_type) {
-    second = second.to(result_type);
-  }
-}
-
-std::tuple<Tensor, Tensor> _binary_pointwise_helper(
-    const Tensor& tensor, optional<int64_t> tensor_batch_dim,
-    const Tensor& other, optional<int64_t> other_batch_dim) {
-  // compute max logical rank
-  auto tensor_logical_rank = rankWithoutBatchDim(tensor, tensor_batch_dim);
-  auto other_logical_rank = rankWithoutBatchDim(other, other_batch_dim);
-  auto max_logical_rank = std::max(tensor_logical_rank, other_logical_rank);
-
-  auto tensor_ = moveBatchDimToFront(tensor, tensor_batch_dim);
-  auto other_ = moveBatchDimToFront(other, other_batch_dim);
-
-  // In the (0D, ND) case, type promotion semantics are different :/
-  auto tensor_is_logical_scalar = (tensor_logical_rank == 0 && tensor_batch_dim.has_value());
-  auto other_is_logical_scalar = (other_logical_rank == 0 && other_batch_dim.has_value());
-  if (tensor_is_logical_scalar && !other_is_logical_scalar) {
-    handleScalarTypePromotion(tensor_, other_);
-  }
-  if (other_is_logical_scalar && !tensor_is_logical_scalar) {
-    handleScalarTypePromotion(other_, tensor_);
-  }
-
-  // If the dimensions aren't aligned, we need to line them up.
-  // Tensor[B, 3] + Tensor[2, 5, 3] -> Tensor[B, 1, 1, 3] + Tensor[2, 5, 3]
-  // Note that only tensors that have a batch dim need to be modified.
-  // Tensor[B, 2, 3, 5] + Tensor[5] -> no changes needed
-  tensor_ = maybePadToLogicalRank(tensor_, tensor_batch_dim, max_logical_rank);
-  other_ = maybePadToLogicalRank(other_, other_batch_dim, max_logical_rank);
-
-  return std::make_tuple(tensor_, other_);
-}
-
 template <typename F, F Func, typename... ExtraArgs>
 std::tuple<Tensor,optional<int64_t>> _binary_pointwise_batch_rule(
     const Tensor& tensor, optional<int64_t> tensor_batch_dim,
@@ -309,6 +268,25 @@ std::tuple<Tensor,optional<int64_t>> cdist_backward_batch_rule(
   return std::make_tuple(out, out_bdim);
 }
 
+void fill__Tensor_batch_rule(
+    Tensor& self,
+    optional<int64_t> self_bdim,
+    const Tensor& other,
+    optional<int64_t> other_bdim) {
+  if (!other_bdim.has_value()) {
+    // Optimization: fill_ is faster than the other path which does
+    // reshaping + copy_
+    self.fill_(other);
+    return;
+  }
+  if (!self_bdim && other_bdim) {
+    vmapIncompatibleInplaceError("fill_");
+  }
+  auto self_and_other = _binary_pointwise_helper(
+      self, self_bdim, other, other_bdim, /*do_type_promotion*/false);
+  std::get<0>(self_and_other).copy_(std::get<1>(self_and_other));
+}
+
 Tensor binomial_wrapper(const Tensor& count, const Tensor& prob, c10::optional<Generator> gen) {
   return at::binomial(count, prob.contiguous(), gen); // Bug in PyTorch, prob shouldn't need to be contiguous
 }
@@ -497,7 +475,9 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
 #undef SINGLE_ARG
 #undef LOGICAL_COMPARISON_POINTWISE
   VMAP_SUPPORT(masked_select, masked_select_batch_rule);
-  VMAP_SUPPORT(masked_select_backward, masked_select_backward_batch_rule)
+  VMAP_SUPPORT(masked_select_backward, masked_select_backward_batch_rule);
+
+  VMAP_SUPPORT2(fill_, Tensor, fill__Tensor_batch_rule);
 }
 
 }}
diff --git a/functorch/functorch/csrc/BatchRulesDecompositions.cpp b/functorch/functorch/csrc/BatchRulesDecompositions.cpp
index 3256847121eed..06e559986a9d5 100644
--- a/functorch/functorch/csrc/BatchRulesDecompositions.cpp
+++ b/functorch/functorch/csrc/BatchRulesDecompositions.cpp
@@ -128,6 +128,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE(hstack);
   OP_DECOMPOSE(index_select_backward);
   OP_DECOMPOSE(inner);
+  OP_DECOMPOSE(inverse);
   OP_DECOMPOSE(instance_norm);
   OP_DECOMPOSE(kron);
   OP_DECOMPOSE(l1_loss);
@@ -136,13 +137,24 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(less_equal, Tensor );
   OP_DECOMPOSE2(less, Tensor );
   OP_DECOMPOSE(linalg_cond);
+  OP_DECOMPOSE(linalg_cholesky);
   OP_DECOMPOSE(linalg_det);
+  OP_DECOMPOSE(linalg_eigvalsh);
+  OP_DECOMPOSE(linalg_inv);
   OP_DECOMPOSE(linalg_matmul);
+  OP_DECOMPOSE(linalg_matrix_norm);
+  OP_DECOMPOSE2(linalg_matrix_norm, str_ord);
   OP_DECOMPOSE(linalg_multi_dot);
+  OP_DECOMPOSE(linalg_norm);
+  OP_DECOMPOSE(linalg_solve);
+  OP_DECOMPOSE(linalg_solve_ex);
   OP_DECOMPOSE(linalg_svd);
   OP_DECOMPOSE(linalg_svdvals);
+  OP_DECOMPOSE(linalg_tensorinv);
+  OP_DECOMPOSE(_lu_with_info);
   OP_DECOMPOSE(matmul);
   OP_DECOMPOSE(matrix_H);
+  OP_DECOMPOSE(matrix_power);
   OP_DECOMPOSE2(max, other );
   OP_DECOMPOSE(max_pool1d_with_indices);
   OP_DECOMPOSE(max_pool2d);
@@ -156,12 +168,16 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE(mT);
   OP_DECOMPOSE(narrow);
   OP_DECOMPOSE(negative);
+  OP_DECOMPOSE2(frobenius_norm, dim);
+  OP_DECOMPOSE2(nuclear_norm, dim);
+  OP_DECOMPOSE(nuclear_norm);
   OP_DECOMPOSE(nll_loss_nd);
   OP_DECOMPOSE(nll_loss);
   OP_DECOMPOSE(nll_loss2d);
   OP_DECOMPOSE2(not_equal, Tensor );
   OP_DECOMPOSE(outer);
   OP_DECOMPOSE(pairwise_distance);
+  OP_DECOMPOSE(pinverse);
   OP_DECOMPOSE(poisson_nll_loss);
   OP_DECOMPOSE(qr);
   OP_DECOMPOSE(ravel);
@@ -210,6 +226,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(trapezoid, dx);
   OP_DECOMPOSE2(trapz, x);
   OP_DECOMPOSE2(trapz, dx);
+  OP_DECOMPOSE(value_selecting_reduction_backward);
   OP_DECOMPOSE(var);
   OP_DECOMPOSE2(var, dim);
   OP_DECOMPOSE(var_mean);
@@ -217,6 +234,8 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(vsplit, int);
   OP_DECOMPOSE2(vsplit, array);
   OP_DECOMPOSE(vstack);
+  OP_DECOMPOSE2(where, ScalarOther);
+  OP_DECOMPOSE2(where, ScalarSelf);
   OP_DECOMPOSE(orgqr);
   OP_DECOMPOSE2(unflatten, int);
   OP_DECOMPOSE(_convolution_double_backward);
@@ -235,6 +254,9 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE(linalg_diagonal);
   OP_DECOMPOSE(pad);
   OP_DECOMPOSE(_pad_circular);
+  OP_DECOMPOSE(t_);
+  OP_DECOMPOSE(swapdims_);
+  OP_DECOMPOSE(swapaxes_);
 
   // divide, alias for div
   OP_DECOMPOSE2(divide, Tensor);
diff --git a/functorch/functorch/csrc/BatchRulesHelper.cpp b/functorch/functorch/csrc/BatchRulesHelper.cpp
index 3118a6826d8b8..dfd690ac21688 100644
--- a/functorch/functorch/csrc/BatchRulesHelper.cpp
+++ b/functorch/functorch/csrc/BatchRulesHelper.cpp
@@ -147,4 +147,48 @@ void run_jit_decomposition(const c10::OperatorHandle& op, torch::jit::Stack* sta
   }
 }
 
+static void handleScalarTypePromotion(Tensor& logical_scalar_tensor, Tensor& second) {
+  auto result_type = at::native::result_type(logical_scalar_tensor[0], second);
+  if (logical_scalar_tensor.scalar_type() != result_type) {
+    logical_scalar_tensor = logical_scalar_tensor.to(result_type);
+  }
+  if (second.scalar_type() != result_type) {
+    second = second.to(result_type);
+  }
+}
+
+std::tuple<Tensor, Tensor> _binary_pointwise_helper(
+    const Tensor& tensor, optional<int64_t> tensor_batch_dim,
+    const Tensor& other, optional<int64_t> other_batch_dim,
+    bool do_type_promotion) {
+  // compute max logical rank
+  auto tensor_logical_rank = rankWithoutBatchDim(tensor, tensor_batch_dim);
+  auto other_logical_rank = rankWithoutBatchDim(other, other_batch_dim);
+  auto max_logical_rank = std::max(tensor_logical_rank, other_logical_rank);
+
+  auto tensor_ = moveBatchDimToFront(tensor, tensor_batch_dim);
+  auto other_ = moveBatchDimToFront(other, other_batch_dim);
+
+  // In the (0D, ND) case, type promotion semantics are different :/
+  if (do_type_promotion) {
+    auto tensor_is_logical_scalar = (tensor_logical_rank == 0 && tensor_batch_dim.has_value());
+    auto other_is_logical_scalar = (other_logical_rank == 0 && other_batch_dim.has_value());
+    if (tensor_is_logical_scalar && !other_is_logical_scalar) {
+      handleScalarTypePromotion(tensor_, other_);
+    }
+    if (other_is_logical_scalar && !tensor_is_logical_scalar) {
+      handleScalarTypePromotion(other_, tensor_);
+    }
+  }
+
+  // If the dimensions aren't aligned, we need to line them up.
+  // Tensor[B, 3] + Tensor[2, 5, 3] -> Tensor[B, 1, 1, 3] + Tensor[2, 5, 3]
+  // Note that only tensors that have a batch dim need to be modified.
+  // Tensor[B, 2, 3, 5] + Tensor[5] -> no changes needed
+  tensor_ = maybePadToLogicalRank(tensor_, tensor_batch_dim, max_logical_rank);
+  other_ = maybePadToLogicalRank(other_, other_batch_dim, max_logical_rank);
+
+  return std::make_tuple(tensor_, other_);
+}
+
 }}
diff --git a/functorch/functorch/csrc/BatchRulesHelper.h b/functorch/functorch/csrc/BatchRulesHelper.h
index 834fd01e5ada8..552a38b20e205 100644
--- a/functorch/functorch/csrc/BatchRulesHelper.h
+++ b/functorch/functorch/csrc/BatchRulesHelper.h
@@ -468,5 +468,8 @@ inline VmapDimVector range(int64_t start, int64_t stop) {
   }
   return dims;
 }
+std::tuple<Tensor, Tensor> _binary_pointwise_helper(
+    const Tensor& tensor, optional<int64_t> tensor_batch_dim, const Tensor& other, optional<int64_t> other_batch_dim,
+    bool do_type_promotion=true);
 
 }}
diff --git a/functorch/functorch/csrc/BatchRulesLinearAlgebra.cpp b/functorch/functorch/csrc/BatchRulesLinearAlgebra.cpp
index d7286c55f6876..80e9e30d82ec8 100644
--- a/functorch/functorch/csrc/BatchRulesLinearAlgebra.cpp
+++ b/functorch/functorch/csrc/BatchRulesLinearAlgebra.cpp
@@ -8,6 +8,11 @@
 
 namespace at { namespace functorch {
 
+typedef std::tuple<Tensor, optional<int64_t>> oneOutput;
+typedef std::tuple<Tensor, optional<int64_t>, Tensor, optional<int64_t>> twoOutputs;
+typedef std::tuple<Tensor, optional<int64_t>, Tensor, optional<int64_t>, Tensor, optional<int64_t>> threeOutputs;
+typedef std::tuple<Tensor, optional<int64_t>, Tensor, optional<int64_t>, Tensor, optional<int64_t>, Tensor, optional<int64_t>> fourOutputs;
+
 // Note [Batching rules for matmul-like operators]
 // at::matmul doesn't "de-expand" arguments to get better performance (maybe
 // it should). In the batching rules for matmul-like operators (dot, mv, mm),
@@ -176,6 +181,303 @@ householder_product_batch_rule(const Tensor &input, c10::optional<int64_t> input
   return std::make_tuple(at::linalg_householder_product(input_, tau_), 0);
 }
 
+template <char const *op_name, typename A, A a, typename C>
+struct LinalgCheckMatrixUnaryRuleHelper;
+
+template <char const *op_name, typename F, F Func, typename A, typename... T>
+struct LinalgCheckMatrixUnaryRuleHelper<op_name, F, Func, typelist<A, T...>> {
+  static inline Tensor check_and_reshape_input(const Tensor& tensor, optional<int64_t> batch_dim) {
+    TORCH_CHECK(rankWithoutBatchDim(tensor, batch_dim) >= 2, op_name, ": The input tensor A must have at least 2 dimensions.");
+    return moveBatchDimToFront(tensor, batch_dim);
+  }
+
+  static oneOutput apply_one(
+      const Tensor& tensor,
+      optional<int64_t> batch_dim,
+      T... extra_args) {
+    const auto tensor_ = check_and_reshape_input(tensor, batch_dim);
+    return std::make_tuple(Func(tensor_, std::forward<T>(extra_args)...), 0);
+  }
+
+  static twoOutputs apply_two(
+      const Tensor& tensor,
+      optional<int64_t> batch_dim,
+      T... extra_args) {
+    const auto tensor_ = check_and_reshape_input(tensor, batch_dim);
+    const auto res = Func(tensor_, std::forward<T>(extra_args)...);
+    return std::make_tuple(std::get<0>(res), 0, std::get<1>(res), 0);
+  }
+
+  static threeOutputs apply_three(
+      const Tensor& tensor,
+      optional<int64_t> batch_dim,
+      T... extra_args) {
+    const auto tensor_ = check_and_reshape_input(tensor, batch_dim);
+    const auto res = Func(tensor_, std::forward<T>(extra_args)...);
+    return std::make_tuple(std::get<0>(res), 0, std::get<1>(res), 0, std::get<2>(res), 0);
+  }
+
+  static fourOutputs apply_four(
+      const Tensor& tensor,
+      optional<int64_t> batch_dim,
+      T... extra_args) {
+    const auto tensor_ = check_and_reshape_input(tensor, batch_dim);
+    const auto res = Func(tensor_, std::forward<T>(extra_args)...);
+    return std::make_tuple(std::get<0>(res), 0, std::get<1>(res), 0, std::get<2>(res), 0, std::get<3>(res), 0);
+  }
+};
+
+template <char const *op_name, typename A, A a, typename C>
+struct LinalgCheckMatrixBinaryRuleHelper;
+
+template <char const *op_name, typename F, F Func, typename A, typename B, typename... T>
+struct LinalgCheckMatrixBinaryRuleHelper<op_name, F, Func, typelist<A, B, T...>> {
+  static inline std::tuple<Tensor, Tensor> check_inputs_and_reshape_inputs(
+      const Tensor& first, optional<int64_t> first_bdim,
+      const Tensor& second, optional<int64_t> second_bdim) {
+    TORCH_CHECK(rankWithoutBatchDim(first, first_bdim) >= 2,
+                op_name, ": The input tensor A must have at least 2 dimensions.");
+    TORCH_CHECK(rankWithoutBatchDim(second, second_bdim) >= 2,
+                op_name, ": The input tensor B must have at least 2 dimensions.");
+    return _binary_pointwise_helper(first, first_bdim, second, second_bdim, false);
+  }
+
+  static oneOutput apply_one(
+      const Tensor& first, optional<int64_t> first_bdim,
+      const Tensor& second, optional<int64_t> second_bdim,
+      T... extra_args) {
+    const auto tensor_other = check_inputs_and_reshape_inputs(first, first_bdim, second, second_bdim);
+    const auto tensor_ = std::get<0>(tensor_other);
+    const auto other_ = std::get<1>(tensor_other);
+    return std::make_tuple(Func(tensor_, other_, std::forward<T>(extra_args)...), 0);
+  }
+
+  static twoOutputs apply_two(
+      const Tensor& first, optional<int64_t> first_bdim,
+      const Tensor& second, optional<int64_t> second_bdim,
+      T... extra_args) {
+    const auto tensor_other = check_inputs_and_reshape_inputs(first, first_bdim, second, second_bdim);
+    const auto tensor_ = std::get<0>(tensor_other);
+    const auto other_ = std::get<1>(tensor_other);
+    const auto res = Func(tensor_, other_, std::forward<T>(extra_args)...);
+    return std::make_tuple(std::get<0>(res), 0, std::get<1>(res), 0);
+  }
+};
+
+oneOutput cholesky_solve_batch_rule(
+    const Tensor& self, c10::optional<int64_t> self_bdim,
+    const Tensor& A, c10::optional<int64_t> A_bdim,
+    bool upper) {
+  TORCH_CHECK(rankWithoutBatchDim(self, self_bdim) >= 2,
+           "b should have at least 2 dimensions, but has ", self.dim(), " dimensions instead");
+  TORCH_CHECK(rankWithoutBatchDim(A, A_bdim) >= 2,
+           "u should have at least 2 dimensions, but has ", A.dim(), " dimensions instead");
+
+  const auto tensor_other = _binary_pointwise_helper(self, self_bdim, A, A_bdim, /*do_type_promotion=*/false);
+  const auto tensor_ = std::get<0>(tensor_other);
+  const auto other_ = std::get<1>(tensor_other);
+  return std::make_tuple(at::cholesky_solve(tensor_, other_, upper), 0);
+}
+
+threeOutputs linalg_lu_factor_ex_batch_rule(
+    const Tensor& A, c10::optional<int64_t> A_bdim, bool pivot, bool check_errors) {
+  TORCH_CHECK(rankWithoutBatchDim(A, A_bdim) >= 2, "torch.lu_factor: Expected tensor with 2 or more dimensions. Got size: ", A.sizes(), " instead");
+  const auto A_ = moveBatchDimToFront(A, A_bdim);
+  const auto res = at::linalg_lu_factor_ex(A_, pivot, check_errors);
+  return std::make_tuple(std::get<0>(res), 0, std::get<1>(res), 0, std::get<2>(res), 0);
+}
+
+oneOutput matrix_exp_batch_rule(const Tensor& self, c10::optional<int64_t> self_bdim) {
+  TORCH_CHECK(rankWithoutBatchDim(self, self_bdim) >= 2, "torch.matrix_exp: The input tensor A must have at least 2 dimensions.");
+  const auto self_ = moveBatchDimToFront(self, self_bdim).contiguous();  // seems to be a bug
+  return std::make_tuple(at::matrix_exp(self_), 0);
+}
+
+fourOutputs solve_ex_batch_rule(
+    const Tensor& A, optional<int64_t> A_bdim,
+    const Tensor& B, optional<int64_t> B_bdim,
+    bool left, bool check_errors) {
+  auto batch_size = get_bdim_size2(A, A_bdim, B, B_bdim);
+  const auto A_logical_rank = rankWithoutBatchDim(A, A_bdim);
+  const auto B_logical_rank = rankWithoutBatchDim(B, B_bdim);
+  const auto max_logical_rank = std::max(A_logical_rank, B_logical_rank);
+
+  TORCH_CHECK(A_logical_rank >= 2,
+            "linalg.solve: The input tensor A must have at least 2 dimensions.");
+
+  int b_logical_rank = max_logical_rank;
+  if (A_logical_rank > B_logical_rank) {  // vector case: B was a vector or batched vector
+    // not accurate but matches linalg error message
+    TORCH_CHECK(B_logical_rank >= 1, "linalg.solve: The input tensor B must have at least 2 dimensions.");
+    b_logical_rank = max_logical_rank - 1;
+  } else {  // matrix case: A and B are both matrices or batches of matrices
+    TORCH_CHECK(B_logical_rank >= 2, "linalg.solve: The input tensor B must have at least 2 dimensions.");
+  }
+
+  // basically binary pointwise helper but if B was a vector incoming, we must pad it to be 1 dim smaller than A
+  auto A_ = moveBatchDimToFront(A, A_bdim);
+  auto B_ = moveBatchDimToFront(B, B_bdim);
+  A_ = maybePadToLogicalRank(A_, A_bdim, max_logical_rank);
+  B_ = maybePadToLogicalRank(B_, B_bdim, b_logical_rank);
+
+  A_ = ensure_has_bdim(A_, A_bdim.has_value(), batch_size);
+  B_ = ensure_has_bdim(B_, B_bdim.has_value(), batch_size);
+
+  // NOTE [ solve_ex Batch Rule Contiguity ]
+  // A determines whether or not linalg_solve takes an optimized path. We need the check on A_ to match the one run on
+  // A as BatchedTensor since it might have been saved by autograd (specifically by the jvp) and the autograd behvaior
+  // differs based on whether or not the optimized path was taken
+  const auto batched_A_was_contiguous = A_bdim.has_value() ? at::select(A, *A_bdim, 0).is_contiguous() : A.is_contiguous();
+  if (batched_A_was_contiguous && !A.is_complex()) {
+    A_ = A_.contiguous();
+  }
+  const auto res = _linalg_solve_ex(A_, B_, left, check_errors);
+  return std::make_tuple(std::get<0>(res), 0, std::get<1>(res), 0, std::get<2>(res), 0, std::get<3>(res), 0);
+}
+
+oneOutput cross_batch_rule(const Tensor& self, c10::optional<int64_t> self_bdim,
+                           const Tensor& other, c10::optional<int64_t> other_bdim, const int64_t dim) {
+  const auto batch_size = get_bdim_size2(self, self_bdim, other, other_bdim);
+
+  const auto self_other_bundled = _binary_pointwise_helper(self, self_bdim, other, other_bdim, false);
+
+  // might be a bug but still doesn't incur an extra perf hit because input would be expanded in cross' broadcast
+  const auto self_ = ensure_has_bdim(std::get<0>(self_other_bundled), self_bdim.has_value(), batch_size);
+  // needed because of same bug since other_.conj() is used as input to cross in backwards formula
+  const auto other_ = ensure_has_bdim(std::get<1>(self_other_bundled), other_bdim.has_value(), batch_size);
+
+  const auto dim_ = getPhysicalDim(self_, true, dim);
+
+  return std::make_tuple(linalg_cross(self_, other_, dim_), 0);
+}
+
+c10::optional<int64_t> batch_dim_if_not_empty(const Tensor& t) {
+  if (t.dim() == 1 && t.size(0) == 0) {
+    return c10::optional<int64_t>();
+  }
+  return c10::optional<int64_t>(0);
+}
+
+fourOutputs linalg_lstsq_batch_rule(
+    const Tensor& self, c10::optional<int64_t> self_bdim, const Tensor& b, c10::optional<int64_t> b_bdim,
+    c10::optional<double> rcond, c10::optional<c10::string_view> driver) {
+  TORCH_CHECK(rankWithoutBatchDim(self, self_bdim) >= 2, "torch.linalg.lstsq: input must have at least 2 dimensions.");
+  TORCH_CHECK(rankWithoutBatchDim(b, b_bdim) >= 1, "torch.linalg.lstsq: other must have at least 1 dimension.");
+
+  const auto batch_size = get_bdim_size2(self, self_bdim, b, b_bdim);
+  const auto tensor_other = _binary_pointwise_helper(self, self_bdim, b, b_bdim, /*do_type_promotion=*/false);
+
+  // because of ambiguity with vector case, lstsq can broadcast [1, 2] -> [batch_size, 2] but not [2] -> [batch_size, 2]
+  // so could unsqueeze if there's no bdim or just ensure_has_bdim
+  const auto self_ = ensure_has_bdim(std::get<0>(tensor_other), self_bdim.has_value(), batch_size);
+  const auto b_ = ensure_has_bdim(std::get<1>(tensor_other), b_bdim.has_value(), batch_size);
+
+  Tensor res, res_1, res_2, res_3;
+  std::tie(res, res_1, res_2, res_3) = at::linalg_lstsq(self_, b_, rcond, driver);
+
+  // everything but the 0th output are only sometimes computed. When they aren't, they're empty tensors without a bdim
+  const auto res_1_bdim = batch_dim_if_not_empty(res_1);
+  const auto res_2_bdim = batch_dim_if_not_empty(res_2);
+  const auto res_3_bdim = batch_dim_if_not_empty(res_3);
+  return std::make_tuple(res, 0, res_1, res_1_bdim, res_2, res_2_bdim, res_3, res_3_bdim);
+}
+
+#define LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, num_out) SINGLE_ARG(\
+  LinalgCheckMatrixUnaryRuleHelper<\
+    func_string_##fn,\
+    decltype(&ATEN_FN(fn)),\
+    &ATEN_FN(fn),\
+    c10::guts::function_traits<decltype(ATEN_FN(fn))>::parameter_types>::apply_##num_out)
+
+#define LINALG_CHECK_MATRIX_UNARY_BATCH_RULE2(fn, overload, num_out) SINGLE_ARG(\
+  LinalgCheckMatrixUnaryRuleHelper<\
+    func_string_##fn_##overload,\
+    decltype(&ATEN_FN2(fn, overload)),\
+    &ATEN_FN2(fn, overload),\
+    c10::guts::function_traits<decltype(ATEN_FN2(fn, overload))>::parameter_types>::apply_##num_out)
+
+#define LINALG_CHECK_MATRIX_BINARY_BATCH_RULE(fn, num_out) SINGLE_ARG(\
+  LinalgCheckMatrixBinaryRuleHelper<\
+    func_string_##fn,\
+    decltype(&ATEN_FN(fn)),\
+    &ATEN_FN(fn),\
+    c10::guts::function_traits<decltype(ATEN_FN(fn))>::parameter_types>::apply_##num_out)
+
+
+// Define string constants with the function names. These will be used as template parameters
+// C++ doesn't let us use string literals as template parameters, so we have to declare them as consts first
+#define LINALG_STRING_CONST(fn, op_name) \
+  const char func_string_##fn[] = #op_name;\
+
+#define LINALG_STRING_CONST2(fn, overload, op_name) \
+  const char func_string_##fn_##overload[] = #op_name;\
+
+#define LINALG_CHECK_MATRIX_UNARY_ONE_OUT(fn, op_name) \
+  LINALG_STRING_CONST(fn, op_name);\
+  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+    VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, one));\
+  }
+
+#define LINALG_CHECK_MATRIX_UNARY_ONE_OUT2(fn, overload, op_name) \
+  LINALG_STRING_CONST2(fn, overload, op_name);\
+  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+    VMAP_SUPPORT2(fn, overload, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE2(fn, overload, one));\
+  }
+
+#define LINALG_CHECK_MATRIX_UNARY_TWO_OUT(fn, op_name) \
+  LINALG_STRING_CONST(fn, op_name);\
+  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+    VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, two));\
+  }
+
+#define LINALG_CHECK_MATRIX_UNARY_THREE_OUT(fn, op_name) \
+  LINALG_STRING_CONST(fn, op_name);\
+  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+    VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, three));\
+  }
+
+#define LINALG_CHECK_MATRIX_UNARY_FOUR_OUT(fn, op_name) \
+  LINALG_STRING_CONST(fn, op_name);\
+  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+    VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, four));\
+  }
+
+#define LINALG_CHECK_MATRIX_BINARY_ONE_OUT(fn, op_name) \
+  LINALG_STRING_CONST(fn, op_name);\
+  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+    VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_BINARY_BATCH_RULE(fn, one));\
+  }
+
+#define LINALG_CHECK_MATRIX_BINARY_TWO_OUT(fn, op_name) \
+  LINALG_STRING_CONST(fn, op_name);\
+  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+    VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_BINARY_BATCH_RULE(fn, two));\
+  }
+
+// These need to be outside. String constant must be declared outside of a macro to be used as template param
+LINALG_CHECK_MATRIX_UNARY_ONE_OUT(cholesky, cholesky);
+LINALG_CHECK_MATRIX_UNARY_ONE_OUT(cholesky_inverse, cholesky_inverse);
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_cholesky_ex, linalg.cholesky);
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_eig, linalg.eig);
+LINALG_CHECK_MATRIX_UNARY_ONE_OUT(linalg_eigvals, linalg.eigvals);
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_inv_ex, linalg.inv_ex);
+LINALG_CHECK_MATRIX_UNARY_THREE_OUT(linalg_ldl_factor_ex, torch.linalg.ldl_factor_ex);
+LINALG_CHECK_MATRIX_UNARY_ONE_OUT(linalg_matrix_power, linalg.matrix_power);
+LINALG_CHECK_MATRIX_UNARY_ONE_OUT(linalg_pinv, linalg.pinv);
+LINALG_CHECK_MATRIX_UNARY_ONE_OUT2(linalg_pinv, atol_rtol_float, linalg.pinv);
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_qr, linalg.qr);
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_slogdet, linalg.slogdet);
+LINALG_CHECK_MATRIX_BINARY_ONE_OUT(linalg_solve_triangular, linalg.solve_triangular);
+
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(geqrf, geqrf);
+LINALG_CHECK_MATRIX_UNARY_ONE_OUT(logdet, logdet);
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(symeig, symeig);
+LINALG_CHECK_MATRIX_BINARY_TWO_OUT(triangular_solve, triangular_solve);
+LINALG_CHECK_MATRIX_UNARY_THREE_OUT(_linalg_det, linalg.det);
+LINALG_CHECK_MATRIX_UNARY_TWO_OUT(_linalg_eigh, linalg.eigh);
+LINALG_CHECK_MATRIX_UNARY_FOUR_OUT(_linalg_slogdet, linalg.slogdet);
+LINALG_CHECK_MATRIX_UNARY_THREE_OUT(_linalg_svd, linalg.svd);
+
 TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   VMAP_SUPPORT(bmm, bmm_batch_rule);
   m.impl("addmv", addmv_decomp);
@@ -187,32 +489,13 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   VMAP_SUPPORT(mm, mm_batch_rule);
   m.impl("linear", linear_decomp);
   VMAP_SUPPORT(linalg_householder_product, householder_product_batch_rule);
+  VMAP_SUPPORT(cholesky_solve, cholesky_solve_batch_rule);  // custom dim error
+  VMAP_SUPPORT(linalg_lstsq, linalg_lstsq_batch_rule);  // custom errors and sometimes empty return
+  VMAP_SUPPORT(linalg_lu_factor_ex, linalg_lu_factor_ex_batch_rule);
+  VMAP_SUPPORT(linalg_matrix_exp, matrix_exp_batch_rule);
+  VMAP_SUPPORT(_linalg_solve_ex, solve_ex_batch_rule);
+  VMAP_SUPPORT(linalg_cross, cross_batch_rule);
 
   VMAP_SUPPORT(_linalg_check_errors, _linalg_check_errors_batch_rule);
-
-  VARIADIC_BDIMS_BOXED(cholesky_solve);
-  VARIADIC_BDIMS_BOXED(linalg_cholesky_ex);
-  VARIADIC_BDIMS_BOXED(linalg_eig);
-  VARIADIC_BDIMS_BOXED(linalg_eigh);
-  VARIADIC_BDIMS_BOXED(linalg_inv_ex);
-  VARIADIC_BDIMS(linalg_pinv);
-  VARIADIC_BDIMS_BOXED(linalg_qr);
-  VARIADIC_BDIMS_BOXED(linalg_slogdet);
-
-  VARIADIC_BDIMS(cholesky);
-  VARIADIC_BDIMS(cholesky_inverse);
-  VARIADIC_BDIMS_BOXED(geqrf);
-  VARIADIC_BDIMS(logdet);
-  VARIADIC_BDIMS(matrix_exp);
-  VARIADIC_BDIMS(pinverse);
-  VARIADIC_BDIMS(inverse);
-  VARIADIC_BDIMS_BOXED(slogdet);
-  VARIADIC_BDIMS_BOXED(_linalg_svd);
-  VARIADIC_BDIMS_BOXED(solve);
-  VARIADIC_BDIMS_BOXED(symeig);
-  VARIADIC_BDIMS_BOXED(triangular_solve);
-
-  VARIADIC_BDIMS_BOXED(_linalg_det);
-  VARIADIC_BDIMS_BOXED(_lu_with_info);
 }
 }}
diff --git a/functorch/functorch/csrc/BatchRulesReduceOps.cpp b/functorch/functorch/csrc/BatchRulesReduceOps.cpp
index 17f7a263f4ee4..5c4652bf0725b 100644
--- a/functorch/functorch/csrc/BatchRulesReduceOps.cpp
+++ b/functorch/functorch/csrc/BatchRulesReduceOps.cpp
@@ -20,10 +20,6 @@ Tensor sum_decomp(
   return at::sum(self, range(0, self.dim()), false, dtype);
 }
 
-Tensor sum_symint_decomp(const Tensor& input_t, c10::SymIntArrayRef dim, bool keepdim, optional<ScalarType> opt_dtype) {
-  return at::sum(input_t, c10::asIntArrayRefSlow(dim), keepdim, opt_dtype);
-}
-
 Tensor mean_decomp(
     const Tensor& self, optional<ScalarType> dtype) {
   return at::mean(self, range(0, self.dim()), false, dtype);
@@ -426,7 +422,6 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   REDUCTION_BOXED(_log_softmax);
   REDUCTION_BOXED_ARGS(rot90, 2);
   VMAP_SUPPORT(aminmax, aminmax_batching_rule);
-  m.impl("sum.SymInt", sum_symint_decomp);
   VMAP_SUPPORT(_log_softmax_backward_data, _log_softmax_backward_batch_rule);
   VMAP_SUPPORT(_softmax_backward_data, _softmax_backward_batch_rule);
 }
diff --git a/functorch/functorch/csrc/BatchedTensorImpl.cpp b/functorch/functorch/csrc/BatchedTensorImpl.cpp
index 487df29000716..58d8bfdde6af6 100644
--- a/functorch/functorch/csrc/BatchedTensorImpl.cpp
+++ b/functorch/functorch/csrc/BatchedTensorImpl.cpp
@@ -124,6 +124,11 @@ IntArrayRef BatchedTensorImpl::strides_custom() const {
   return strides_default();
 }
 
+SymIntArrayRef BatchedTensorImpl::sym_strides_custom() const {
+  return sym_strides_default();
+}
+
+
 // TODO: implement proper contiguity on batched tensor, then put
 // sizes_strides_policy back to Default
 bool BatchedTensorImpl::is_contiguous_custom(at::MemoryFormat memory_format) const {
diff --git a/functorch/functorch/csrc/BatchedTensorImpl.h b/functorch/functorch/csrc/BatchedTensorImpl.h
index 0b7e7f602641a..3d422d68491e7 100644
--- a/functorch/functorch/csrc/BatchedTensorImpl.h
+++ b/functorch/functorch/csrc/BatchedTensorImpl.h
@@ -68,6 +68,7 @@ struct BatchedTensorImpl : public c10::TensorImpl {
 
   // We have to override this because we opted into CustomStrides
   IntArrayRef strides_custom() const override;
+  SymIntArrayRef sym_strides_custom() const override;
   // Override a bunch of methods inherited from TensorImpl to return error messages.
   bool is_contiguous_custom(at::MemoryFormat memory_format=at::MemoryFormat::Contiguous) const override;
   void set_size(int64_t dim, int64_t new_size) override;
@@ -81,6 +82,10 @@ struct BatchedTensorImpl : public c10::TensorImpl {
   void _unsafe_set_level(int64_t level) {
     level_ = level;
   }
+  void unsafe_set_bdim(int64_t bdim) {
+    // NB: you MUST call refreshTensorMetadata after doing this.
+    bdim_ = bdim;
+  }
  private:
   // see NOTE: [BatchedTensorImpl levels invariant]
   void checkInvariants() const;
diff --git a/functorch/functorch/csrc/CompileCache.cpp b/functorch/functorch/csrc/CompileCache.cpp
index 4c87800c88892..fb877a3ab60f3 100644
--- a/functorch/functorch/csrc/CompileCache.cpp
+++ b/functorch/functorch/csrc/CompileCache.cpp
@@ -244,7 +244,7 @@ struct CompileCache {
     cache_.emplace(cacheKey, compileFn);
   }
 
-  const int64_t size() const { return cache_.size(); }
+  int64_t size() const { return cache_.size(); }
 
   /// Clear the cache.
   void clear() { cache_.clear(); }
diff --git a/functorch/functorch/csrc/Constants.h b/functorch/functorch/csrc/Constants.h
index f6e614e042465..0d170d85b8075 100644
--- a/functorch/functorch/csrc/Constants.h
+++ b/functorch/functorch/csrc/Constants.h
@@ -25,7 +25,6 @@ constexpr auto kDynamicLayerBackModeKey = c10::DispatchKey::FT_DYNAMIC_LAYER_BAC
 //# constexpr auto kPythonKey = c10::DispatchKey::FT_PYTHON_KEY;
 
 // Some helper macros
-#define DECLTYPE_AUTO(...) decltype(__VA_ARGS__), __VA_ARGS__
 #define SINGLE_ARG(...) __VA_ARGS__
 
 }} // namespace at::functorch
diff --git a/functorch/functorch/csrc/CustomFunction.cpp b/functorch/functorch/csrc/CustomFunction.cpp
index 0c60726b0efb2..6b685b7b728dd 100644
--- a/functorch/functorch/csrc/CustomFunction.cpp
+++ b/functorch/functorch/csrc/CustomFunction.cpp
@@ -2,6 +2,7 @@
 #include <functorch/csrc/CustomFunction.h>
 #include <ATen/ATen.h>
 #include <torch/csrc/autograd/function.h>
+#include <torch/csrc/autograd/graph_task.h>
 #include <torch/csrc/autograd/variable.h>
 #include <torch/csrc/autograd/saved_variable.h>
 #include <torch/csrc/autograd/FunctionsManual.h>
@@ -192,7 +193,7 @@ variable_list GenericPythonBackward::apply(variable_list&& grads) {
     args.emplace_back(saved.unpack(shared_from_this()));
   }
 
-  if (should_compute_output({ tensors_ix })) {
+  if (task_should_compute_output({ tensors_ix })) {
     auto handle = backward_fn_->typed<custom_function_t>();
     auto grad_result = handle.call(args);
     grad_inputs = grad_result;
diff --git a/functorch/functorch/csrc/DynamicLayer.cpp b/functorch/functorch/csrc/DynamicLayer.cpp
index 8bfd388358a0f..08cd4d7a7d6b3 100644
--- a/functorch/functorch/csrc/DynamicLayer.cpp
+++ b/functorch/functorch/csrc/DynamicLayer.cpp
@@ -75,8 +75,6 @@ RandomnessType DynamicLayer::randomness() const {
   return VmapInterpreterPtr(&interpreter_).randomness();
 }
 
-constexpr DispatchKeySet kFrontBackKeys({kDynamicLayerBackModeKey, kDynamicLayerFrontModeKey});
-
 using DynmetaData = std::unordered_map<int64_t, std::shared_ptr<bool>>;
 DynmetaData kDynMetaDataSingleton;
 
diff --git a/functorch/functorch/csrc/Interpreter.h b/functorch/functorch/csrc/Interpreter.h
index 2a1a426824b17..630bc17367392 100644
--- a/functorch/functorch/csrc/Interpreter.h
+++ b/functorch/functorch/csrc/Interpreter.h
@@ -1,14 +1,11 @@
 #pragma once
 
-// variant.h doesn't clean up after itself...
-#include <c10/util/variant.h>
-#undef DECLTYPE_AUTO
-
 #include <functorch/csrc/Macros.h>
 #include <functorch/csrc/Constants.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
 #include <c10/util/Optional.h>
+#include <c10/util/variant.h>
 
 namespace at { namespace functorch {
 
diff --git a/functorch/functorch/csrc/LegacyBatchingRegistrations.cpp b/functorch/functorch/csrc/LegacyBatchingRegistrations.cpp
index 8181174ea196a..cb99ca28e4a0e 100644
--- a/functorch/functorch/csrc/LegacyBatchingRegistrations.cpp
+++ b/functorch/functorch/csrc/LegacyBatchingRegistrations.cpp
@@ -150,12 +150,67 @@ Tensor& squeeze_dim__batching_rule(Tensor& self, int64_t dim) {
     return self.squeeze_(dim);
   }
   auto* batched = maybeGetBatchedImpl(self);
-  TORCH_CHECK(batched && batched->bdim() == 0);
+  const auto bdim = batched->bdim();
   auto logical_dim = self.dim();
-  auto dim_physical = 1 + maybe_wrap_dim(dim, logical_dim);
-  batched->value().squeeze_(dim_physical);
 
-  // Also need to change some metadata...
+  // If logically a scalar tensor, then Tensor.squeeze_(dim) is a no-op
+  if (logical_dim == 0) {
+    return self;
+  }
+
+  dim = maybe_wrap_dim(dim, logical_dim);
+  if (dim >= bdim) {
+    dim = dim + 1;
+    batched->value().squeeze_(dim);
+    batched->refreshTensorMetadata();
+    return self;
+  }
+
+  // Tensor.squeeze_(0) is a no-op if dim 0 has a size other than 1
+  if (batched->value().size(dim) != 1) {
+    return self;
+  }
+
+  // dim < bdim, so we need to adjust bdim
+  batched->value().squeeze_(dim);
+  batched->unsafe_set_bdim(bdim - 1);
+  batched->refreshTensorMetadata();
+  return self;
+}
+
+Tensor& squeeze__batching_rule(Tensor& self) {
+  if (!participatesInCurrentLevel(self)) {
+    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    return self.squeeze_();
+  }
+  auto* batched = maybeGetBatchedImpl(self);
+
+  // Need to find out how many dimensions of size 1 are before the bdim
+  const auto bdim = batched->bdim();
+  const auto physical_shape = batched->value().sizes();
+  auto how_many_dims_of_size_1_before_bdim = 0;
+  for (const auto i : c10::irange(0, physical_shape.size())) {
+    if ((int64_t)i == bdim) {
+      break;
+    }
+    if (physical_shape[i] == 1) {
+      how_many_dims_of_size_1_before_bdim++;
+    }
+  }
+
+  int64_t new_bdim = bdim - how_many_dims_of_size_1_before_bdim;
+  if (physical_shape[bdim] != 1) {
+    // if bdim is not 1, can just call squeeze_()
+    batched->value().squeeze_();
+  } else {
+    // otherwise, squeeze_() is going to get rid of the bdim too.
+    // We "fix it up" by calling unsqueeze_.
+    batched->value().squeeze_();
+    batched->value().unsqueeze(new_bdim);
+  }
+
+  // Refresh metadata
+  batched->unsafe_set_bdim(new_bdim);
   batched->refreshTensorMetadata();
   return self;
 }
@@ -166,9 +221,13 @@ Tensor& unsqueeze__batching_rule(Tensor& self, int64_t dim) {
     return self.unsqueeze_(dim);
   }
   auto* batched = maybeGetBatchedImpl(self);
-  TORCH_CHECK(batched && batched->bdim() == 0);
   auto logical_dim = self.dim();
-  auto dim_physical = 1 + maybe_wrap_dim(dim, logical_dim + 1);
+  int64_t dim_physical = maybe_wrap_dim(dim, logical_dim + 1);
+  if (dim_physical >= batched->bdim()) {
+    dim_physical = 1 + dim_physical;
+  } else {
+    batched->unsafe_set_bdim(batched->bdim() + 1);
+  }
   batched->value().unsqueeze_(dim_physical);
 
   // Also need to change some metadata...
@@ -176,6 +235,38 @@ Tensor& unsqueeze__batching_rule(Tensor& self, int64_t dim) {
   return self;
 }
 
+Tensor& transpose__batching_rule(Tensor& self, int64_t dim0, int64_t dim1) {
+  if (!participatesInCurrentLevel(self)) {
+    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    return self.transpose_(dim0, dim1);
+  }
+  auto* batched = maybeGetBatchedImpl(self);
+  auto logical_dim = self.dim();
+
+  // PyTorch has a special case where scalar_tensor.transpose(dim0, dim1) works
+  // for dim0, dim1 in {0, -1} and returns the scalar tensor. If the following happens:
+  // >>> x = torch.randn(B0)  # the per-examples are all scalars
+  // >>> vmap(lambda x: x.transpose_(0, -1), x)
+  // then we replicate this behavior.
+  if (logical_dim == 0 &&
+      is_allowed_dim_on_scalar_tensor(dim0) &&
+      is_allowed_dim_on_scalar_tensor(dim1)) {
+    // No transposing happened :P
+    return self;
+  }
+
+  dim0 = maybe_wrap_dim(dim0, logical_dim);
+  dim1 = maybe_wrap_dim(dim1, logical_dim);
+
+  dim0 = dim0 >= batched->bdim() ? dim0 + 1 : dim0;
+  dim1 = dim1 >= batched->bdim() ? dim1 + 1 : dim1;
+  batched->value().transpose_(dim0, dim1);
+
+  // Also need to change some metadata...
+  batched->refreshTensorMetadata();
+  return self;
+}
+
 Tensor& fill_inplace_scalar_batching_rule(Tensor& self, Scalar value) {
   if (!participatesInCurrentLevel(self)) {
     c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
@@ -411,12 +502,9 @@ Tensor as_strided_batching_rule(
       "same length! Got size ", sizes, " and stride ", strides);
 
   // Sanity checks:
-  // 1. All batch dims are at the front in memory layout (not necessary for
-  // correctness, but we are worried the user might be doing crazy things)
-  // 2. as_strided(sizes, strides, storage_offset + tensor[i].offset() - tensor.offset())
+  // 1. as_strided(sizes, strides, storage_offset + tensor[i].offset() - tensor.offset())
   // is valid for a slice of the input tensor.
   // See Note: [When will the as_strided batching rule fail?] for details.
-  checkBatchDimsAtFrontInLayout(physical_tensor.strides(), num_batch_dims);
   checkBasicAsStridedValidForSlice(
       physical_tensor, num_batch_dims, sizes, strides, storage_offset);
 
@@ -702,8 +790,10 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   m.impl("stack", stack_batching_rule);
 
   // still legacy b/c needs special inplace rules
+  m.impl("squeeze_", squeeze__batching_rule);
   m.impl("squeeze_.dim", squeeze_dim__batching_rule);
   m.impl("unsqueeze_", unsqueeze__batching_rule);
+  m.impl("transpose_", transpose__batching_rule);
 
   // still legacy because these are ridiculously complicated
   m.impl("as_strided", as_strided_batching_rule);
diff --git a/functorch/functorch/csrc/PyTorchOperatorHacks.cpp b/functorch/functorch/csrc/PyTorchOperatorHacks.cpp
index e2f35bea4ef63..0bde1f53d2543 100644
--- a/functorch/functorch/csrc/PyTorchOperatorHacks.cpp
+++ b/functorch/functorch/csrc/PyTorchOperatorHacks.cpp
@@ -22,42 +22,11 @@ namespace at { namespace functorch {
 // or call data_ptr. We have some idea of how to fix these things in the long term
 // (e.g. functionalization for the in-place operations).
 
-// TODO: can replace with better conditional functionalization
-static Tensor value_selecting_reduction_backward_hack(
-    const Tensor& grad,
-    int64_t dim,
-    const Tensor& indices,
-    IntArrayRef sizes,
-    bool keepdim) {
-  if (!keepdim && sizes.size() > 0) {
-    auto grad_ = grad.unsqueeze(dim);
-    auto indices_ = indices.unsqueeze(dim);
-    return at::zeros(sizes, grad_.options()).scatter(dim, indices_, grad_);
-  }
-  return at::zeros(sizes, grad.options()).scatter(dim, indices, grad);
-}
-
 // TODO: upstream into core
 Tensor index_select_backward_hack(const Tensor& grad, IntArrayRef self_sizes, int64_t dim, const Tensor& index) {
   return at::zeros(self_sizes, grad.options()).index_add(dim, index, grad);
 }
 
-// TODO: https://github.com/pytorch/pytorch/issues/69991
-Tensor frobenius_norm_dim_hack(const Tensor& self, IntArrayRef dim, bool keepdim) {
-  if (dim.size() == 1 || dim.size() == 0) {
-    return at::norm(self, 2, dim, keepdim);
-  } else {
-    auto dim_ = dim.vec();
-    maybe_wrap_dims(dim_, self.dim());
-    TORCH_CHECK(dim_[0] != dim_[1], "Expected dims to be different, got ", dim, " instead");
-    if (self.is_complex()){
-      return at::sqrt(at::sum(at::real(self.conj() * self), dim_, keepdim));
-    } else {
-      return at::sqrt(at::sum((self * self), dim_, keepdim));
-    }
-  }
-}
-
 static optional<std::tuple<Tensor,int64_t>> unwrap(const Tensor& tensor) {
   auto* wrapped = maybeGetTensorWrapper(tensor);
   if (wrapped) {
@@ -138,59 +107,6 @@ Tensor linear_hack(const Tensor& input, const Tensor& weight, const c10::optiona
   return output;
 }
 
-Tensor nuclear_norm_dim_hack(const Tensor& self, IntArrayRef dim, bool keepdim) {
-  TORCH_CHECK(dim.size() == 2, "nuclear norm requires a 'dim' argument of size 2");
-  auto dim_ = dim.vec();
-  maybe_wrap_dims(dim_, self.dim());
-
-  auto permutation = at::native::create_dim_backshift_permutation(dim_[0], dim_[1], self.dim());
-  Tensor p = self.permute(permutation);
-  Tensor result = at::sum(at::linalg_svdvals(p), -1, keepdim);
-  if (keepdim) {
-    result = result.unsqueeze(-1);
-    auto permutation_reverse = at::native::create_reverse_permutation(permutation);
-    result = result.permute(permutation_reverse);
-  }
-  return result;
-}
-
-Tensor nuclear_norm_hack(const Tensor& self, bool keepdim) {
-  TORCH_CHECK(
-    self.dim() == 2,
-    "Expected a tensor with 2 dimensions, but got a tensor with ",
-    self.dim(), " dimension", self.dim()==1 ? "" : "s", " instead.");
-
-  return nuclear_norm_dim_hack(self, {0, 1}, keepdim);
-}
-
-Tensor binary_cross_entropy_with_logits_backward_hack(
-    const Tensor& grad, const Tensor& input, const Tensor& target,
-    const c10::optional<Tensor>& weight_opt,
-    const c10::optional<Tensor>& pos_weight_opt, int64_t reduction) {
-  // See [Note: hacky wrapper removal for optional tensor]
-  c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
-  const Tensor& weight = *weight_maybe_owned;
-  const Tensor& pos_weight = c10::value_or_else(pos_weight_opt, [] {return Tensor();});
-
-  Tensor grad_input;
-  if (pos_weight.defined()) {
-    auto t = pos_weight.mul(target);
-    grad_input = t.add(1).sub_(target).mul(input.sigmoid()).sub_(t).mul(grad);
-  } else {
-    grad_input = (input.sigmoid() - target).mul(grad);
-  }
-
-  if (weight.defined()) {
-    grad_input.mul(weight);
-  }
-
-  if (reduction == at::Reduction::Mean) {
-    return grad_input / input.numel();
-  }
-
-  return grad_input;
-}
-
 static inline at::Tensor apply_loss_reduction(const at::Tensor& unreduced, int64_t reduction) {
   if (reduction == at::Reduction::Mean) {
     return unreduced.mean();
@@ -373,11 +289,8 @@ Tensor& feature_alpha_dropout_(Tensor& input, double p, bool train) {
 } // dropout_hack
 
 TORCH_LIBRARY_IMPL(aten, FT_DYNAMIC_LAYER_FRONT_MODE_KEY, m) {
-  m.impl("value_selecting_reduction_backward", value_selecting_reduction_backward_hack);
   m.impl("index_select_backward", index_select_backward_hack);
-  m.impl("frobenius_norm.dim", frobenius_norm_dim_hack);
   m.impl("linear", linear_hack);
-  m.impl("binary_cross_entropy_with_logits_backward", binary_cross_entropy_with_logits_backward_hack);
   m.impl("binary_cross_entropy_with_logits", binary_cross_entropy_with_logits_hack);
   m.impl("trace_backward", trace_backward_decomp);
 
@@ -390,9 +303,6 @@ TORCH_LIBRARY_IMPL(aten, FT_DYNAMIC_LAYER_FRONT_MODE_KEY, m) {
   m.impl("feature_dropout_", dropout_hack::feature_dropout_);
   m.impl("alpha_dropout_", dropout_hack::alpha_dropout_);
   m.impl("feature_alpha_dropout_", dropout_hack::feature_alpha_dropout_);
-
-  m.impl("nuclear_norm", nuclear_norm_hack);
-  m.impl("nuclear_norm.dim", nuclear_norm_dim_hack);
 }
 
 }}
diff --git a/functorch/functorch/csrc/dim/arena.h b/functorch/functorch/csrc/dim/arena.h
new file mode 100644
index 0000000000000..b1f254a313c8b
--- /dev/null
+++ b/functorch/functorch/csrc/dim/arena.h
@@ -0,0 +1,328 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#pragma once
+#include <ATen/ATen.h>
+#include "minpybind.h"
+
+#ifdef _WIN32
+#include <intrin.h>
+// https://stackoverflow.com/questions/355967/how-to-use-msvc-intrinsics-to-get-the-equivalent-of-this-gcc-code
+inline unsigned int __builtin_clz(unsigned int x) {
+    unsigned long r = 0;
+    _BitScanReverse(&r, x);
+    return (31 - r);
+}
+#endif
+
+inline int round2min8(int num) {
+   int nzeros = __builtin_clz((num - 1)|4);
+   return 1 << (32 - nzeros);
+}
+
+struct Arena;
+template<typename T>
+struct OwnedSlice;
+
+template<typename T>
+struct Slice {
+    Slice()
+    :  begin_(nullptr), size_(0), capacity_(0) {}
+
+    template<typename... Args>
+    Slice(Arena& arena, Args&&... args);
+
+    T* begin() const {
+        return begin_;
+    }
+    T* end() const {
+        return begin_ + size_;
+    }
+    int size() const {
+        return size_;
+    }
+    int capacity() const {
+        return capacity_;
+    }
+
+    T& back(int i=-1) {
+        return begin_[size_ + i];
+    }
+
+    T& operator[](int i) const {
+        return begin_[i];
+    }
+    c10::optional<int> index(const T& value) {
+        for (int i : enumerate()) {
+            if (begin_[i] == value) {
+                return i;
+            }
+        }
+        return c10::nullopt;
+    }
+    bool contains(const T& value) {
+        return index(value).has_value();
+    }
+
+    void insert(Arena& arena, Slice where, Slice to_insert);
+    void insert(Arena& arena, Slice where, T v) {
+        return insert(arena, where, Slice(&v, &v + 1));
+    }
+    void insert(Arena& arena, int where, T v) {
+        return insert(arena, slice(where, where), v);
+    }
+    void append(Arena& arena, T value);
+    void extend(Arena& arena, Slice to_insert);
+    void extend(Arena& arena, const T* begin, const T* end) {
+        return extend(arena, Slice<T>((T*)begin, (T*)end));
+    }
+
+    bool remove(Arena& A, T value) {
+        auto idx = index(value);
+        if (idx) {
+            insert(A, slice(*idx, *idx + 1), Slice());
+        }
+        return idx.has_value();
+    }
+
+    Slice slice(int begin) {
+        return slice(begin, size_);
+    }
+
+    Slice slice(int begin, int end) {
+        if (begin < 0) {
+            begin += size_;
+        }
+        if (end < 0) {
+            end += size_;
+        }
+        Slice result;
+        result.begin_ = begin_ + begin;
+        result.size_ = end - begin;
+        result.capacity_ = result.size_;
+        return result;
+    }
+
+    bool inside(Slice where) {
+        return begin() <= where.begin() && where.end() <= end();
+    }
+
+    irange enumerate() const {
+        return irange(size_);
+    }
+
+    irange reversed_enumerate() const {
+        return irange(size_ - 1, -1, -1);
+    }
+
+    bool operator==(const Slice<T>& rhs) const {
+        if (size() != rhs.size()) {
+            return false;
+        }
+        return std::equal(begin(), end(), rhs.begin());
+    }
+
+    Slice(T* begin, T* end)
+    : begin_(begin), size_(end - begin), capacity_(size_) {}
+
+protected:
+    static int _length(const T& t) {
+        return 1;
+    }
+    static int _length(Slice t) {
+        return t.size_;
+    }
+    static T* _insert(T*& dst, T t) {
+        *dst = std::move(t);
+        return ++dst;
+    }
+    static T* _insert(T*& dst, Slice t) {
+        std::memcpy(dst, t.begin_, sizeof(T)*t.size_);
+        dst += t.size_;
+        return dst;
+    }
+    T* begin_;
+    int size_;
+    int capacity_;
+    friend struct OwnedSlice<T>;
+};
+
+template<typename T>
+struct OwnedSlice {
+    typedef void (*deleter_t)(Slice<T>);
+    static void _no_delete(Slice<T>) {}
+    OwnedSlice()
+    : deleter_(_no_delete) {}
+    OwnedSlice(const OwnedSlice&) = delete;
+    OwnedSlice& operator=(const OwnedSlice&) = delete;
+    ~OwnedSlice() {
+        deleter_(slice_);
+        if (slice_.size_ > 8) {
+            delete [] slice_.begin_;
+        }
+    }
+    void set(Slice<T> to_own, deleter_t deleter = _no_delete) {
+        slice_.size_ = slice_.capacity_ = to_own.size();
+        slice_.begin_ = (slice_.size_ > 8) ? new T[slice_.size_] : &small_buf[0];
+        std::memcpy(slice_.begin_, to_own.begin(), slice_.size_ * sizeof(T));
+        deleter_ = deleter;
+    }
+    Slice<T> slice() const {
+        return slice_;
+    }
+private:
+    Slice<T> slice_;
+    deleter_t deleter_;
+    T small_buf[8];
+};
+
+template<typename T>
+inline std::ostream& operator<<(std::ostream& s, const Slice<T>& v) {
+    s << "[";
+    for (int i : v.enumerate()) {
+        if (i > 0) {
+            s << ", ";
+        }
+        s << v[i];
+    }
+    s << "]";
+    return s;
+}
+
+struct TensorRef {
+    TensorRef()
+    : impl_(nullptr){}
+    TensorRef(const at::Tensor& t)
+    : impl_(t.unsafeGetTensorImpl()) {}
+    const at::Tensor& operator*() const {
+        return *(at::Tensor*)this;
+    }
+    at::Tensor* operator->() const {
+        return (at::Tensor*)this;
+    }
+    operator bool() const {
+        return impl_ != nullptr;
+    }
+private:
+    at::TensorImpl* impl_;
+};
+
+constexpr int ARENA_MAX_SIZE = 4096;
+constexpr int ALIGNMENT = 8;
+struct Arena {
+    Arena()
+    : allocated_(0) {}
+    template<typename T>
+    T* allocate(int n) {
+        if (!n) {
+            return nullptr;
+        }
+        int to_allocate = sizeof(T)*n;
+        int to_allocate_rounded = ALIGNMENT * ((to_allocate - 1) / ALIGNMENT + 1);
+        T* result = (T*) &buffer_[allocated_];
+        allocated_ += to_allocate_rounded;
+        AT_ASSERT(allocated_ <= ARENA_MAX_SIZE);
+        return result;
+    }
+    TensorRef autorelease(at::Tensor s) {
+        auto ref = TensorRef(s);
+        s.unsafeReleaseTensorImpl();
+        ar_tensors_.append(*this, ref);
+        return ref;
+    }
+    py::handle autorelease(py::object obj) {
+        ar_objects_.append(*this, obj);
+        obj.release();
+        return ar_objects_.back();
+    }
+    ~Arena() {
+        for(TensorRef t: ar_tensors_) {
+            c10::intrusive_ptr<at::TensorImpl, at::UndefinedTensorImpl>::reclaim(t->unsafeGetTensorImpl());
+        }
+        for(py::handle h: ar_objects_) {
+            py::object::steal(h);
+        }
+    }
+private:
+    int64_t allocated_;
+    char buffer_[ARENA_MAX_SIZE];
+    Slice<TensorRef> ar_tensors_;
+    Slice<py::handle> ar_objects_;
+};
+
+template<typename T>
+inline void Slice<T>::insert(Arena& arena, Slice where, Slice to_insert) {
+    AT_ASSERT(inside(where));
+    Slice result = *this;
+    /// b------sb---se-----e,  0----n
+    T* body_dest = where.begin();
+    if (where.size() != to_insert.size()) {
+        int new_size = size() - where.size() + to_insert.size();
+        T* tail_dest = where.begin() + to_insert.size();
+        if (new_size >= capacity_) {
+            int new_capacity = new_size ? round2min8(new_size) : 0;
+            result.capacity_ = new_capacity;
+            result.begin_ = arena.allocate<T>(new_capacity);
+            body_dest = result.begin_ + (where.begin() - begin());
+            tail_dest = body_dest + to_insert.size();
+            //std::memcpy(result.begin_, begin_, sizeof(T)*(where.begin() - begin()));
+            std::copy(begin_, begin_ + (where.begin() - begin()), result.begin_);
+        }
+        std::memmove(tail_dest, where.end(), sizeof(T)*(end() - where.end()));
+        result.size_ = new_size;
+    }
+
+    //std::memcpy(body_dest, to_insert.begin(), sizeof(T)*to_insert.size());
+    std::copy(to_insert.begin(), to_insert.end(), body_dest);
+    *this = result;
+}
+
+template<typename T>
+inline void Slice<T>::append(Arena& arena, T value) {
+    Slice result = *this;
+    if (size_ == capacity_) {
+        int new_size = size_ ? round2min8(size_)*2 : 8;
+        T* n = arena.allocate<T>(new_size);
+        //memcpy(n, begin_, size_*sizeof(T));
+        std::copy(begin_, begin_ + size_, n);
+        result.begin_ = n;
+        result.capacity_ = new_size;
+    }
+    result[result.size_++] = std::move(value);
+    *this = result;
+}
+
+template<typename T>
+inline void Slice<T>::extend(Arena& arena, Slice<T> rhs) {
+    Slice result = *this;
+    result.size_ = size_ + rhs.size();
+    if (result.size_ > capacity_) {
+        int new_size = round2min8(result.size_);
+        T* n = arena.allocate<T>(new_size);
+        //memcpy(n, begin_, size_*sizeof(T));
+        std::copy(begin_, begin_+size_, n);
+        result.begin_ = n;
+        result.capacity_ = new_size;
+    }
+    //memcpy(result.begin_ + size_, rhs.begin(), sizeof(T)*rhs.size());
+    std::copy(rhs.begin(), rhs.end(), result.begin_ + size_);
+    *this = result;
+}
+
+template<typename T>
+template<typename... Args>
+Slice<T>::Slice(Arena& arena, Args&&... args) {
+    int lens[] = {_length(args)...};
+    size_ = 0;
+    for (auto i : lens) {
+        size_ += i;
+    }
+    capacity_ = size_ ? round2min8(size_) : 0;
+    begin_ = arena.allocate<T>(capacity_);
+    T* dst_ = begin_;
+    T* unused[] = {_insert(dst_, args)...};
+    (void) unused;
+}
diff --git a/functorch/functorch/csrc/dim/dim.cpp b/functorch/functorch/csrc/dim/dim.cpp
new file mode 100644
index 0000000000000..907554e861c63
--- /dev/null
+++ b/functorch/functorch/csrc/dim/dim.cpp
@@ -0,0 +1,3191 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#include "minpybind.h"
+#include <frameobject.h>
+#include <opcode.h>
+#include <utility>
+#include <new>
+#include <iostream>
+#include <vector>
+//#include <torch/csrc/autograd/python_variable.h>
+#include <torch/csrc/Export.h>
+#include <functorch/csrc/BatchedTensorImpl.h>
+#include <functorch/csrc/DynamicLayer.h>
+#include <ATen/ATen.h>
+#include <memory>
+#include "arena.h"
+#include "python_variable_simple.h"
+
+
+// C++ API functions for objects to
+// * construct the object, returning a ref-counted handle
+// * The actual API, with methods that take/return C-typed values
+
+// extend minpybind.h to include
+// * typed handles so that -> can get to their raw API
+// * object/handle distinction for the typed handles
+
+// class Dim: ---------------
+py::handle torch_Tensor___mul__;
+py::handle _Tensor;
+py::handle _Tensor_sum;
+py::handle NamedTuple;
+py::dict_view pointwise;
+py::handle torch_Tensor_expand;
+binaryfunc THPVariable_getitem;
+objobjargproc THPVariable_setitem;
+py::handle no_slice;
+PyTypeObject* torch_Tensor;
+py::handle torch_Tensor_copy_;
+py::handle torch_Tensor_split;
+bool pointwise_optimize = true;
+PyTypeObject* DimType = nullptr;
+
+static void maybeInitializeGlobals() {
+    // globals that depend on the python dim library,
+    // which we can't lookup until we finish initializing the _C module
+    if (_Tensor.ptr()) {
+        return;
+    }
+    auto dim = py::import("functorch.dim");
+    _Tensor = dim.attr("_Tensor");
+    pointwise = dim.attr("pointwise");
+    _Tensor_sum = _Tensor.attr("sum");
+    DimType = (PyTypeObject*) py::import("functorch.dim").attr("Dim").ptr();
+}
+
+PyObject* Tensor_getitem(PyObject* self, PyObject* index);
+int Tensor_setitem(PyObject* self, PyObject* index, PyObject* value);
+
+void replaceMappingIfMatches(py::handle tp) {
+    auto T = (PyTypeObject*) tp.ptr();
+    bool recurse = false;
+    if (T->tp_as_mapping->mp_subscript == THPVariable_getitem) {
+        T->tp_as_mapping->mp_subscript = Tensor_getitem;
+        recurse = true;
+    }
+    if (T->tp_as_mapping->mp_ass_subscript == THPVariable_setitem) {
+        T->tp_as_mapping->mp_ass_subscript = Tensor_setitem;
+        recurse = true;
+    }
+    if (recurse) {
+        auto result = tp.attr("__subclasses__").call();
+        py::list_view lv(result);
+        for (auto i : lv.enumerate()) {
+            replaceMappingIfMatches(lv[i]);
+        }
+    }
+}
+
+static void initializeGlobals(Arena & A) {
+    auto torch = py::import("torch");
+    torch_Tensor = (PyTypeObject*) torch.attr("Tensor").ptr();
+    torch_Tensor___mul__ = torch.attr("Tensor").attr("__mul__");
+
+    torch_Tensor_expand = torch.attr("_C").attr("_TensorBase").attr("expand");
+    torch_Tensor_split = torch.attr("_C").attr("_TensorBase").attr("split");
+    torch_Tensor_copy_ = torch.attr("Tensor").attr("copy_");
+    auto py_TensorBase = torch.attr("_C").attr("_TensorBase");
+    auto TensorBase = (PyTypeObject*) py_TensorBase.ptr();
+    THPVariable_getitem = TensorBase->tp_as_mapping->mp_subscript;
+    THPVariable_setitem = TensorBase->tp_as_mapping->mp_ass_subscript;
+    NamedTuple = py::import("typing").attr("NamedTuple");
+    no_slice = PySlice_New(NULL, NULL, NULL);
+
+}
+
+py::handle DimensionBindError_;
+static py::handle DimensionBindError() {
+    if(!DimensionBindError_.ptr()) {
+        DimensionBindError_ = py::import("functorch.dim").attr("DimensionBindError");
+    }
+    return DimensionBindError_;
+}
+
+static int64_t n_dims_created = 65;
+
+struct Dim : public py::base<Dim> {
+    int64_t level_; // for stable comparisons in prototype
+    py::object name_;
+    Dim()
+    : level_(n_dims_created++) {}
+    void init(py::object name, int64_t s = -1) {
+        name_ = std::move(name);
+        size_ = s;
+    }
+
+    static bool check_exact(py::handle v) {
+        return Py_TYPE(v.ptr()) == DimType;
+    }
+
+    int64_t size() const {
+        if (size_ == -1) {
+            py::raise_error(PyExc_ValueError, "dimension %S is unbound", name_.ptr());
+        }
+        return size_;
+    }
+    void set_size(int64_t v) {
+        if (size_ == -1) {
+            size_ = v;
+        } else if(size_ != v) {
+            py::raise_error(DimensionBindError(), "Dim '%R' previously bound to a dimension of size %lld cannot bind to a dimension of size %lld", this, this->size_, v);
+        }
+    }
+    bool is_bound() const {
+        return size_ != -1;
+    }
+    static py::obj<Dim> create(py::object name, int64_t s = -1) {
+        if (!DimType) {
+            maybeInitializeGlobals();
+        }
+        auto r = Dim::alloc(DimType);
+        r->init(std::move(name), s);
+        return r;
+    }
+    static PyTypeObject Type;
+    const at::Tensor& range() {
+        if (!range_.defined()) {
+            range_ = at::arange(size());
+        }
+        return range_;
+    }
+    const at::Tensor& batchtensor() {
+        if (!batchtensor_.defined()) {
+            batchtensor_ = at::functorch::addBatchDim(range(), 0, level_);
+        }
+        return batchtensor_;
+    }
+private:
+    int64_t size_;
+    at::Tensor range_;
+    at::Tensor batchtensor_;
+};
+
+struct DimEntry {
+    // union of either a negative number indicating which dimension this is from the rhs,
+    // or a pointer to a first-class dimension.
+    // pointers do not have their highest bit set, so checking the number is negative tells us
+    // that it is not a dim.
+    bool is_positional() const {
+        return data_ < 0;
+    }
+    bool is_none() const {
+        return data_ == 0;
+    }
+    int64_t position() const {
+        return data_;
+    }
+    py::hdl<Dim> dim() const {
+        Dim* result;
+        std::memcpy(&result, &data_, sizeof(Dim*));
+        return py::hdl<Dim>(result);
+    }
+
+    DimEntry()
+    : data_(0) {}
+
+    DimEntry(int64_t pos)
+    : data_(pos) {
+        AT_ASSERT(pos < 0);
+    }
+    DimEntry(py::hdl<Dim> d) {
+       std::memcpy(&data_, &d, sizeof(int64_t));
+    }
+    bool operator==(const DimEntry& rhs) const {
+        return data_ == rhs.data_;
+    }
+private:
+    int64_t data_;
+};
+
+std::ostream& operator<<(std::ostream& ss, DimEntry entry) {
+    if (entry.is_none()) {
+        ss << "None";
+    } else if (entry.is_positional()) {
+        ss << entry.position();
+    } else {
+        ss << entry.dim();
+    }
+    return ss;
+}
+
+// Dim wrapper methods
+
+static int Dim_init(py::hdl<Dim> self, PyObject *args, PyObject *kwds) {
+    PY_BEGIN
+    static char* kwlist[] = {"name", "size", nullptr};
+    py::handle name;
+    py::handle size = nullptr;
+    if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O", kwlist, &name, &size)) {
+        return -1;
+    }
+    self->init(py::object::borrow(name), (size.ptr() && !py::is_none(size)) ? py::to_int(size) : -1);
+    return 0;
+    PY_END(-1)
+}
+
+static PyObject* Dim_repr(Dim* self) {
+    PY_BEGIN
+    py::object name = (self->name_.ptr()) ? self->name_ : py::unicode_from_string("<uninitialized dim>");
+    return name.release();
+    PY_END(nullptr)
+}
+
+
+static PyObject* Dim_getsize(Dim* self, void*) {
+    PY_BEGIN
+    return py::from_int(self->size()).release();
+    PY_END(nullptr)
+}
+
+int Dim_setsize(Dim* self, PyObject* size, void*) {
+    PY_BEGIN
+    self->set_size(py::to_int(size));
+    return 0;
+    PY_END(-1)
+}
+
+static PyObject* Dim_getis_bound(Dim* self, void*) {
+    return PyBool_FromLong(self->is_bound());
+}
+
+static PyObject* Dim_getlevel(Dim* self, void*) {
+    return PyLong_FromLong(self->level_);
+}
+
+static PyObject* Dim_get_levels(Dim* self, void*) {
+    py::tuple t(1);
+    t.set(0, py::object::borrow(self->ptr()));
+    return t.release();
+}
+
+static PyObject* Dim_get_has_device(Dim* self, void*) {
+    Py_RETURN_FALSE;
+}
+
+static PyObject* Dim_get_tensor(Dim* self, void*) {
+    return THPVariable_Wrap(self->range());
+}
+
+static PyObject* Dim_get_batchtensor(Dim* self, void*) {
+    return THPVariable_Wrap(self->batchtensor());
+}
+
+
+static PyGetSetDef Dim_getsetters[] = {
+    {"size", (getter) Dim_getsize, (setter) Dim_setsize,
+     "Dimension size", NULL},
+    {"is_bound", (getter) Dim_getis_bound, NULL, "is_bound", NULL},
+    {"_level", (getter) Dim_getlevel, NULL, "_level", NULL},
+    {"_levels", (getter) Dim_get_levels, NULL, "_levels", NULL},
+    {"_has_device", (getter) Dim_get_has_device, NULL, "_has_device", NULL},
+    {"_tensor", (getter) Dim_get_tensor, NULL, "_tensor", NULL},
+    {"_batchtensor", (getter) Dim_get_batchtensor, NULL, "_batchtensor", NULL},
+    {"ndim", (getter) [](PyObject* self, void*) -> PyObject* { return py::from_int(1).release(); }, NULL, "ndim", NULL},
+    {NULL}  /* Sentinel */
+};
+
+PyTypeObject Dim::Type = {
+    PyVarObject_HEAD_INIT(NULL, 0)
+    "_C.Dim",               /* tp_name */
+    sizeof(Dim),               /* tp_basicsize */
+    0,                              /* tp_itemsize */
+    Dim::dealloc_stub,      /* tp_dealloc */
+    0,                              /* tp_vectorcall_offset */
+    0,                              /* tp_getattr */
+    0,                              /* tp_setattr */
+    0,                              /* tp_as_async */
+    (reprfunc)Dim_repr,           /* tp_repr */
+    0,                 /* tp_as_number */
+    0,                              /* tp_as_sequence */
+    0,                              /* tp_as_mapping */
+    0,      /* tp_hash */
+    0,                              /* tp_call */
+    0,                              /* tp_str */
+    0,                              /* tp_getattro */
+    0,                              /* tp_setattro */
+    0,                              /* tp_as_buffer */
+    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE,  /* tp_flags */
+    "Dim Object",                   /* tp_doc */
+    0,                              /* tp_traverse */
+    0,                              /* tp_clear */
+    0,  /* tp_richcompare */
+    0,                              /* tp_weaklistoffset */
+    0,                              /* tp_iter */
+    0,                              /* tp_iternext */
+    0,                              /* tp_methods */
+    0,                              /* tp_members */
+    Dim_getsetters,                 /* tp_getset */
+    0,                              /* tp_base */
+    0,                              /* tp_dict */
+    0,                              /* tp_descr_get */
+    0,                              /* tp_descr_set */
+    0,                              /* tp_dictoffset */
+    (initproc)(void*) Dim_init,     /* tp_init */
+    0,                              /* tp_alloc */
+    Dim::new_stub,                      /* tp_new */
+};
+
+// class DimList ------------
+
+struct DimList : public py::base<DimList> {
+    py::object name_;
+    std::vector<py::obj<Dim>> dims_;
+    static PyTypeObject Type;
+    void init(py::object name) {
+        name_ = std::move(name);
+    }
+    void set_dims(std::vector<py::obj<Dim>> dims) {
+        bound_ = true;
+        dims_ = std::move(dims);
+    }
+    bool is_bound() {
+        return bound_;
+    }
+    void bind_len(int64_t size) {
+        if (bound_) {
+            int64_t b_size = dims_.size();
+            if (b_size != size) {
+                py::raise_error(DimensionBindError(), "Dimlist has size %lld but it is being bound to size %d", b_size, size);
+            }
+        } else {
+            bound_ = true;
+            dims_.resize(size);
+            for (Py_ssize_t i = 0; i < size; ++i) {
+                dims_[i] = Dim::create(py::unicode_from_format("%S%i", name_.ptr(), (int)i));
+            }
+        }
+    }
+    int64_t size() const {
+        if (!bound_) {
+            py::raise_error(DimensionBindError(), "DimList not bound");
+        }
+        return dims_.size();
+    }
+    void set_bound(bool b) {
+        bound_ = b;
+    }
+private:
+    bool bound_ = false;
+};
+
+
+static int DimList_init(DimList *self, PyObject *args, PyObject *kwds);
+
+static PyObject* DimList_repr(DimList* self) {
+    PY_BEGIN
+    if (self->is_bound()) {
+        size_t size = self->dims_.size();
+        py::tuple t(size);
+        for(size_t i = 0; i < size; ++i) {
+            t.set(i, self->dims_[i]);
+        }
+        return py::repr(t).release();
+    } else if(!py::is_none(self->name_)) {
+        return py::unicode_from_format("*%S", self->name_.ptr()).release();
+    } else {
+        return py::unicode_from_string("<unbound_dimlist>").release();
+    }
+    PY_END(nullptr)
+}
+
+static PyObject* DimList_bind(DimList *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    py::handle sizes;
+    static const char * const _keywords[] = {"sizes", nullptr};
+    static _PyArg_Parser parser = {"O", _keywords, 0};
+    if (!_PyArg_ParseStackAndKeywords(args, nargs, kwnames, &parser, &sizes)) {
+        return nullptr;
+    }
+    if (!py::is_sequence(sizes)) {
+        py::raise_error(PyExc_ValueError, "expected a sequence");
+    }
+    py::sequence_view seq = sizes;
+    auto size = seq.size();
+    self->bind_len(size);
+    for (Py_ssize_t i = 0; i < size; ++i) {
+        self->dims_[i]->set_size(py::to_int(seq[i]));
+    }
+    Py_RETURN_NONE;
+    PY_END(nullptr)
+}
+
+static PyObject* DimList_bind_len(DimList *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    int size;
+    static const char * const _keywords[] = {"N", nullptr};
+    static _PyArg_Parser parser = {"i", _keywords, 0};
+    if (!_PyArg_ParseStackAndKeywords(args, nargs, kwnames, &parser, &size)) {
+        return nullptr;
+    }
+    self->bind_len(size);
+    Py_RETURN_NONE;
+    PY_END(nullptr)
+}
+
+static PyMethodDef DimList_methods[] = {
+    {"bind", (PyCFunction)(void*) DimList_bind, METH_FASTCALL | METH_KEYWORDS},
+    {"bind_len", (PyCFunction)(void*) DimList_bind_len, METH_FASTCALL | METH_KEYWORDS},
+    {NULL, NULL, 0, NULL}        /* Sentinel */
+};
+
+
+static Py_ssize_t DimList_len(DimList* self) {
+    PY_BEGIN
+    return self->size();
+    PY_END(-1)
+}
+
+PyObject * DimList_item(DimList* self, Py_ssize_t idx) {
+    PY_BEGIN
+    if (!self->is_bound()) {
+        py::raise_error(DimensionBindError(), "DimList not bound");
+    }
+    if (idx < 0 || (size_t) idx >= self->dims_.size()) {
+        py::raise_error(PyExc_IndexError, "index out of bounds");
+    }
+    py::object r = self->dims_[idx];
+    return r.release();
+    PY_END(nullptr)
+}
+
+PySequenceMethods DimList_seq {
+    (lenfunc) DimList_len, //lenfunc sq_length;
+    0, //binaryfunc sq_concat;
+    0, //ssizeargfunc sq_repeat;
+    (ssizeargfunc) DimList_item, //ssizeargfunc sq_item;
+    0, //void *was_sq_slice;
+    0, //ssizeobjargproc sq_ass_item;
+    0, //void *was_sq_ass_slice;
+    0, //objobjproc sq_contains;
+
+    0, //binaryfunc sq_inplace_concat;
+    0, //ssizeargfunc sq_inplace_repeat;
+};
+
+static PyObject* DimList_getis_bound(DimList* self, void*) {
+    return PyBool_FromLong(self->is_bound());
+}
+
+static PyGetSetDef DimList_getsetters[] = {
+    {"is_bound", (getter) DimList_getis_bound, NULL, "is_bound", NULL},
+    {NULL}  /* Sentinel */
+};
+
+
+static PyObject* DimList_subscript(DimList* self, py::handle idx) {
+    PY_BEGIN
+    if (py::is_int(idx)) {
+        return DimList_item(self, py::to_int(idx));
+    } else if (py::is_slice(idx)) {
+        if (!self->is_bound()) {
+            py::raise_error(DimensionBindError(), "DimList not bound");
+        }
+        py::slice_view s(idx, self->dims_.size());
+        py::tuple r(s.slicelength);
+        for (Py_ssize_t i = s.start, j = 0; i < s.stop; i += s.step) {
+            r.set(j++,  self->dims_[i]);
+        }
+        return r.release();
+    } else {
+        py::raise_error(PyExc_ValueError, "expected an int or a slice");
+        return nullptr;
+    }
+    PY_END(nullptr)
+}
+
+PyMappingMethods DimList_mapping = {
+    0, //lenfunc mp_length;
+    (binaryfunc)(void*) DimList_subscript, //binaryfunc mp_subscript;
+    0, //objobjargproc mp_ass_subscript;
+};
+
+
+
+PyTypeObject DimList::Type = {
+    PyVarObject_HEAD_INIT(NULL, 0)
+    "_C.DimList",               /* tp_name */
+    sizeof(DimList),               /* tp_basicsize */
+    0,                              /* tp_itemsize */
+    DimList::dealloc_stub,      /* tp_dealloc */
+    0,                              /* tp_vectorcall_offset */
+    0,                              /* tp_getattr */
+    0,                              /* tp_setattr */
+    0,                              /* tp_as_async */
+    (reprfunc)DimList_repr,           /* tp_repr */
+    0,                 /* tp_as_number */
+    &DimList_seq,                 /* tp_as_sequence */
+    &DimList_mapping,             /* tp_as_mapping */
+    0,      /* tp_hash */
+    0,                              /* tp_call */
+    0,                              /* tp_str */
+    0,                              /* tp_getattro */
+    0,                              /* tp_setattro */
+    0,                              /* tp_as_buffer */
+    0,                              /* tp_flags */
+    "DimList Object",                   /* tp_doc */
+    0,                              /* tp_traverse */
+    0,                              /* tp_clear */
+    0,                              /* tp_richcompare */
+    0,                              /* tp_weaklistoffset */
+    0,                              /* tp_iter */
+    0,                              /* tp_iternext */
+    DimList_methods,                /* tp_methods */
+    0,                              /* tp_members */
+    DimList_getsetters,             /* tp_getset */
+    0,                              /* tp_base */
+    0,                              /* tp_dict */
+    0,                              /* tp_descr_get */
+    0,                              /* tp_descr_set */
+    0,                              /* tp_dictoffset */
+    (initproc) DimList_init,            /* tp_init */
+    0,                              /* tp_alloc */
+    DimList::new_stub,                      /* tp_new */
+};
+
+static int DimList_init(DimList *self, PyObject *args, PyObject *kwds) {
+    PY_BEGIN
+    static char* kwlist[] = {"len_or_dims", "name", nullptr};
+    py::handle len_or_dims = nullptr;
+    PyObject* name = nullptr;
+    if (!PyArg_ParseTupleAndKeywords(args, kwds, "|OO", kwlist, &len_or_dims, &name)) {
+        return -1;
+    }
+    self->init(py::object::borrow(name ? name : Py_None));
+    if (len_or_dims.ptr()) {
+        if(py::is_int(len_or_dims)) {
+            self->bind_len(py::to_int(len_or_dims));
+        } else if (py::is_sequence(len_or_dims)) {
+            py::sequence_view s(len_or_dims);
+            std::vector<py::obj<Dim>> dims;
+            size_t size = s.size();
+            dims.reserve(size);
+            for (size_t i = 0; i < size; ++i) {
+                auto r = s[i];
+                if (py::is_int(r)) {
+                    dims.emplace_back(Dim::create(py::unicode_from_format("%S%i", self->name_.ptr(), (int)i),  py::to_int(r)));
+                } else {
+                    dims.emplace_back(Dim::wrap(r));
+                }
+            }
+            self->set_dims(std::move(dims));
+        } else {
+            PyErr_Format(PyExc_ValueError, "expected a length or a sequence of dimensions");
+            return -1;
+        }
+        return 0;
+    }
+    return 0;
+    PY_END(-1);
+}
+
+// Tensor -----------------------------
+
+PyTypeObject* TensorType = nullptr; // the python wrapper type.
+at::Tensor _add_batch_dims(Arena& A, at::Tensor t, Slice<DimEntry> levels_);
+static py::object run_torch_function(Arena &A, py::handle orig, py::vector_args args, bool is_pointwise);
+void free_levels_dims(Slice<DimEntry> levels);
+
+struct Tensor;
+
+struct DelayedOperator {
+    DelayedOperator(py::object o, py::vector_args a)
+    : orig(std::move(o)), args(a) {
+        auto all = a.size();
+        // this will outlive the call so
+        // take ownership of temporaries
+        // in vector args
+        auto buf = new py::handle[all];
+        memcpy(buf, args.args, sizeof(py::handle)*all);
+        args.args = buf;
+        for (auto i : args.enumerate_all()) {
+            Py_INCREF(args.args[i].ptr());
+        }
+        Py_XINCREF(args.kwnames.ptr());
+    }
+    ~DelayedOperator() {
+        for (auto i : args.enumerate_all()) {
+            Py_DECREF(args[i].ptr());
+        }
+        if (args.has_keywords()) {
+            Py_XDECREF(args.kwnames.ptr());
+        }
+        delete [] args.args;
+    }
+    py::object orig;
+    py::vector_args args;
+};
+
+struct Tensor : public py::base<Tensor> {
+private:
+    at::Tensor tensor_;
+    at::Tensor batchtensor_;
+    OwnedSlice<DimEntry> levels_;
+    bool has_device_;
+    std::unique_ptr<DelayedOperator> delayed_;
+public:
+
+    at::Tensor& tensor(Arena& A) {
+        if (C10_UNLIKELY(!tensor_.defined())) {
+            AT_ASSERT(delayed_);
+            auto t = Tensor::wrap(run_torch_function(A, delayed_->orig, delayed_->args, true));
+            tensor_ = t->tensor(A);
+            delayed_.reset();
+            // don't force creation of batch tensor if it wasn't alreay provided.
+            batchtensor_ = t->batchtensor_;
+            AT_ASSERT(levels() == t->levels());
+        }
+        return tensor_;
+    }
+    at::Tensor& batchtensor(Arena& A) {
+        if (C10_UNLIKELY(!batchtensor_.defined())) {
+            batchtensor_ = _add_batch_dims(A, tensor(A), levels_.slice());
+        }
+        return batchtensor_;
+    }
+    Slice<DimEntry> levels() {
+        return levels_.slice();
+    }
+    bool has_device() {
+        return has_device_;
+    }
+    DelayedOperator* delayed() {
+        return delayed_.get();
+    }
+    static PyTypeObject Type;
+
+    static bool check_exact(py::handle v) {
+       return Py_TYPE(v.ptr()) == TensorType;
+    }
+
+
+    static py::obj<Tensor> create() {
+        if (!TensorType) {
+            TensorType = (PyTypeObject*) py::import("functorch.dim").attr("Tensor").ptr();
+        }
+        return Tensor::alloc(TensorType);
+    }
+    void capture_levels(Slice<DimEntry> levels) {
+        // grab ownership of the dims inside levels
+        for (auto l : levels) {
+            if (!l.is_positional()) {
+                py::object::borrow(l.dim()).release();
+            }
+        }
+        levels_.set(levels, free_levels_dims);
+    }
+    static py::object from_positional(Arena & A, at::Tensor tensor, Slice<DimEntry> levels, bool has_device);
+    static py::obj<Tensor> create_delayed(py::object op, py::vector_args args, Slice<DimEntry> levels, bool has_device);
+    friend struct EnableAllLayers;
+};
+
+at::Tensor _add_batch_dims(Arena& A, at::Tensor t, Slice<DimEntry> levels_) {
+    auto levels = Slice<DimEntry>();
+    levels.extend(A, levels_);
+    while (true) {
+        int64_t min_real_index = -1;
+        int64_t min_index = -1;
+        int64_t min_value = INT_MAX;
+        int64_t i = 0;
+        int64_t r = 0;
+        for (auto l : levels) {
+            if (!l.is_none()) {
+                if (!l.is_positional() && l.dim()->level_ < min_value) {
+                    min_value = l.dim()->level_;
+                    min_index = i;
+                    min_real_index = r;
+                }
+                ++i;
+            }
+            ++r;
+        }
+        if (min_index == -1) {
+            return t;
+        }
+        auto t2 = at::functorch::addBatchDim(std::move(t), min_index, min_value);
+        t = std::move(t2);
+        levels[min_real_index] = DimEntry();
+    }
+}
+
+void free_levels_dims(Slice<DimEntry> levels) {
+    for(auto e : levels) {
+        if (!e.is_positional()) {
+            py::object::steal(e.dim());
+        }
+    }
+}
+
+// version in header does a unnecessary refcount +/-
+inline at::functorch::BatchedTensorImpl* maybeGetBatchedImpl(const at::Tensor& tensor) {
+    if (at::functorch::isBatchedTensor(tensor)) {
+        return static_cast<at::functorch::BatchedTensorImpl*>(tensor.unsafeGetTensorImpl());
+    }
+    return nullptr;
+}
+
+inline TensorRef unchecked_tensor_from(py::handle p) {
+    auto v = (THPVariable*) p.ptr();
+    return TensorRef(*v->cdata);
+}
+
+int64_t ndim_of_levels(Slice<DimEntry> levels) {
+    int64_t r = 0;
+    for (auto l : levels) {
+        if (l.is_positional()) {
+            ++r;
+        }
+    }
+    return r;
+}
+
+struct TensorInfo {
+    TensorRef tensor;
+    Slice<DimEntry> levels;
+    bool has_device;
+    TensorRef batchedtensor;
+    int64_t ndim() const {
+        return ndim_of_levels(levels);
+    }
+    operator bool() const {
+        return tensor;
+    }
+
+    static TensorInfo create(Arena& A, py::handle h, bool ensure_batched=true, bool ensure_present=true) {
+        if (Tensor::check_exact(h)) {
+            auto t = Tensor::unchecked_wrap(h);
+            return TensorInfo {t->tensor(A), t->levels(), t->has_device(), ensure_batched ? t->batchtensor(A) : TensorRef()};
+        } else if (Dim::check_exact(h)) {
+            auto d = Dim::unchecked_wrap(h);
+            return TensorInfo {d->range(), Slice<DimEntry>(A, DimEntry(d)), false, ensure_batched ? d->batchtensor() : TensorRef()};
+        } else if (THPVariable_Check(h.ptr())) {
+            TensorRef t = unchecked_tensor_from(h);
+            Slice<DimEntry> levels;
+            for (auto i : irange(-t->dim(), 0)) {
+                levels.append(A, i);
+            }
+            return TensorInfo {t, levels, true, t};
+        } else {
+            if (ensure_present) {
+                py::raise_error(PyExc_ValueError, "expected a tensor object");
+            }
+            return TensorInfo {};
+        }
+    }
+
+
+};
+
+py::object Tensor::from_positional(Arena & A, at::Tensor tensor, Slice<DimEntry> levels, bool has_device) {
+    size_t seen_dims = 0;
+    int last = 0;
+    //auto sz = tensor.sizes();
+    for (auto i : levels.enumerate()) {
+        auto l = levels[i];
+        if (l.is_positional()) {
+            AT_ASSERT(last == 0 || last + 1 == l.position());
+            last = l.position();
+        } else {
+            py::object::borrow(l.dim()).release();
+            //AT_ASSERT(sz[i] == l.dim()->size());
+            ++seen_dims;
+        }
+    }
+    AT_ASSERT(last == 0 || last == -1);
+    if (!seen_dims) {
+        return py::object::steal(THPVariable_Wrap(std::move(tensor)));
+    }
+
+    py::obj<Tensor> self = Tensor::create();
+    self->tensor_ = std::move(tensor);
+    AT_ASSERT(self->tensor_.dim() == levels.size());
+    self->levels_.set(levels, free_levels_dims);
+    self->has_device_ = has_device;
+    py::object r = std::move(self);
+    return r;
+}
+
+
+static PyObject* py_Tensor_from_positional(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    #define ARGS(_) _(py::handle, tensor) _(py::handle, py_levels) _(int, has_device)
+    MPY_PARSE_ARGS_KWNAMES("OOp", ARGS)
+    #undef ARGS
+
+    if (!THPVariable_Check(tensor.ptr())) {
+        py::raise_error(PyExc_ValueError, "_tensor is not a Tensor?");
+    }
+
+    Slice<DimEntry> levels;
+    py::sequence_view sq(py_levels);
+    for (auto i : sq.enumerate()) {
+        py::object v = sq[i];
+        if (py::is_int(v)) {
+            auto vi = py::to_int(v);
+            levels.append(A, vi);
+        } else {
+            auto dim = Dim::wrap(std::move(v));
+            py::hdl<Dim> hdim = dim;
+            levels.append(A, hdim);
+        }
+    }
+    return Tensor::from_positional(A, THPVariable_Unpack(tensor.ptr()), levels, has_device != 0).release();
+    PY_END(nullptr)
+}
+
+py::obj<Tensor> Tensor::create_delayed(py::object op, py::vector_args args, Slice<DimEntry> levels, bool has_device) {
+    py::obj<Tensor> self = Tensor::create();
+    self->capture_levels(levels);
+    self->has_device_ = has_device;
+    self->delayed_ = std::make_unique<DelayedOperator>(op, args);
+    return self;
+}
+
+py::list slice_to_list(Slice<py::handle> h) {
+    py::list lst(h.size());
+    for (auto i : h.enumerate()) {
+        lst.set(i, py::object::borrow(h[i]));
+    }
+    return lst;
+}
+
+py::tuple slice_to_tuple(Slice<py::handle> h) {
+    py::tuple lst(h.size());
+    for (auto i : h.enumerate()) {
+        lst.set(i, py::object::borrow(h[i]));
+    }
+    return lst;
+}
+
+enum UType {
+    U_ELEM,
+    U_TUPLE_LIKE,
+    U_DICT,
+};
+
+struct Unflatten {
+    py::object operator()(Slice<py::handle>& elements) {
+        py::object r;
+        switch (type) {
+            case U_ELEM: {
+                r = py::object::borrow(elements[0]);
+                elements = elements.slice(1);
+            } break;
+            case U_TUPLE_LIKE: {
+                py::tuple tup(children.size());
+                for (auto i : children.enumerate()) {
+                    tup.set(i, children[i](elements));
+                }
+                r = obj.call(tup);
+            } break;
+            case U_DICT: {
+                r = py::object::checked_steal(PyDict_New());
+                py::dict_view rv(r);
+                py::dict_view d(obj);
+                Py_ssize_t pos = 0;
+                py::handle k, v;
+                for (int i = 0; d.next(&pos, &k, &v); ++i) {
+                    rv.set(k, children[i](elements));
+                }
+            } break;
+        }
+        return r;
+    }
+    UType type;
+    py::handle obj;
+    Slice<Unflatten> children;
+};
+
+Unflatten tree_flatten(Arena& A, py::handle agg, Slice<py::handle>& flat_elements) {
+    Slice<Unflatten> c;
+    UType utype;
+    py::handle obj;
+    if (py::list_view::check(agg)) {
+        obj = agg.type();
+        utype = U_TUPLE_LIKE;
+        py::list_view l(agg);
+        for (auto i : l.enumerate()) {
+            c.append(A, tree_flatten(A, l[i], flat_elements));
+        }
+    } else if (py::tuple_view::check(agg)) {
+        obj = agg.type();
+        utype = U_TUPLE_LIKE;
+        // includes named tuples
+        py::tuple_view l(agg);
+        for (auto i : l.enumerate()) {
+            c.append(A, tree_flatten(A, l[i], flat_elements));
+        }
+    } else if (py::dict_view::check(agg)) {
+        utype = U_DICT;
+        py::dict_view d(agg);
+        obj = agg;
+        Py_ssize_t pos = 0;
+        py::handle k, v;
+        while (d.next(&pos, &k, &v)) {
+            c.append(A, tree_flatten(A, v, flat_elements));
+        }
+    } else {
+        utype = U_ELEM;
+        flat_elements.append(A, agg);
+    }
+    return Unflatten {utype, obj, c};
+}
+
+struct UnflattenVectorArgs {
+    py::vector_args operator()(Arena& A, Slice<py::handle>& elements) {
+        if (!had_nested) {
+            auto args = elements.begin();
+            elements = Slice<py::handle>();
+            return py::vector_args(args, nargs, kwnames);
+        }
+        Slice<py::handle> args;
+        for (auto u : children) {
+            args.append(A, A.autorelease(u(elements)));
+        }
+        return py::vector_args(args.begin(), nargs, kwnames);
+    }
+    Slice<Unflatten> children;
+    Py_ssize_t nargs;
+    py::handle kwnames;
+    bool had_nested;
+};
+
+UnflattenVectorArgs tree_flatten(Arena& A, py::vector_args args, Slice<py::handle>& flat_elements) {
+    UnflattenVectorArgs r;
+    r.kwnames = args.kwnames;
+    r.nargs = args.nargs;
+    r.had_nested = false;
+    auto N = args.size();
+    for(auto i : irange(N)) {
+        auto typ = Py_TYPE(args[i].ptr());
+        // fast checks that this thing isn't something that is nested.
+        bool is_element = !typ->tp_as_sequence ||  typ == torch_Tensor || typ == TensorType || typ == DimType;
+        if (!is_element) {
+            flat_elements.extend(A, args.args, args.args + i);
+            for (auto j : irange(i)) {
+                (void)j;
+                r.children.append(A, Unflatten {U_ELEM});
+            }
+            for (auto j : irange(i, N)) {
+                r.children.append(A, tree_flatten(A, args[j], flat_elements));
+                if (r.children.back().type != U_ELEM) {
+                    r.had_nested = true;
+                }
+            }
+            return r;
+        }
+    }
+    flat_elements.extend(A, args.args, args.args + N);
+    return r;
+}
+
+
+struct UnflattenArena {
+    Arena A;
+    Unflatten unflatten;
+};
+
+static PyObject* py_unflatten(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    #define ARGS(_) _(py::handle, ns)
+    MPY_PARSE_ARGS_KWNAMES("O", ARGS)
+    #undef ARGS
+    py::sequence_view sv(ns);
+    // because we do not have a autorelase pool yet...
+    Arena A;
+    Slice<py::handle> slice;
+    py::handle Tuple = (PyObject*) &PyTuple_Type;
+    auto inputs = Tuple.call(ns);
+    py::tuple_view tv(inputs);
+    for (auto i : tv.enumerate()) {
+        slice.append(A, tv[i]);
+    }
+    auto AA = (UnflattenArena*) PyCapsule_GetPointer(self, "arena");
+    auto r = AA->unflatten(slice).release();
+    AT_ASSERT(r != nullptr);
+    return r;
+    PY_END(nullptr)
+}
+
+PyMethodDef py_unflatten_def = {"unflatten", (PyCFunction)(void*) py_unflatten, METH_FASTCALL | METH_KEYWORDS};
+
+void free_unflatten_arena(PyObject * pc) {
+    delete (UnflattenArena*) PyCapsule_GetPointer(pc, "arena");
+}
+
+static PyObject* py_tree_flatten(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    #define ARGS(_) _(py::handle, tree)
+    MPY_PARSE_ARGS_KWNAMES("O", ARGS)
+    #undef ARGS
+    auto A = new UnflattenArena;
+    Slice<py::handle> elements;
+    A->unflatten = tree_flatten(A->A, tree, elements);
+    auto cap = py::object::checked_steal(PyCapsule_New(A, "arena", free_unflatten_arena));
+    auto unflatten = py::object::checked_steal(PyCFunction_New(&py_unflatten_def, cap.release()));
+    py::tuple r(2);
+    r.set(0, slice_to_list(elements));
+    r.set(1, std::move(unflatten));
+    return r.release();
+    PY_END(nullptr)
+}
+
+
+
+py::object tree_map(Arena& A, std::function<py::handle(py::handle)> fn, py::handle agg) {
+    Slice<py::handle> elements;
+    auto unflatten = tree_flatten(A, agg, elements);
+    for (auto i : elements.enumerate()) {
+        elements[i] = fn(elements[i]);
+    }
+    return unflatten(elements);
+}
+
+// prereq: isinstance(h, _Tensor)
+inline int64_t _Tensor_ndim(py::handle h) {
+    if (Tensor::check(h)) {
+        int64_t r = 0;
+        for (auto l : Tensor::unchecked_wrap(h)->levels()) {
+            if (l.is_positional()) {
+                ++r;
+            }
+        }
+        return r;
+    }
+    // Dim or DelayedMulTensor
+    return 0;
+}
+
+inline py::handle handle_from_tensor(Arena& A, TensorRef t) {
+    // fast case: tensor is live in python
+    c10::optional<PyObject*> mb_obj =
+        t->unsafeGetTensorImpl()->check_pyobj(getPyInterpreter());
+    if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->owns_pyobj()) {
+        return *mb_obj;
+    }
+    return A.autorelease(py::object::checked_steal(THPVariable_Wrap(*t)));
+}
+
+struct EnableAllLayers {
+    EnableAllLayers(Arena& A, Slice<DimEntry> levels) {
+        std::vector<std::pair<int64_t, int64_t>> layers;
+        layers.reserve(levels.size());
+        for (auto l : levels) {
+            if (!l.is_positional()) {
+                auto d = l.dim();
+                levels_to_dim_.append(A, d);
+            }
+        }
+        std::sort(levels_to_dim_.begin(), levels_to_dim_.end(), [](py::hdl<Dim> lhs, py::hdl<Dim> rhs) { return lhs->level_ < rhs->level_;});
+
+        for (auto i : levels_to_dim_.enumerate()) {
+            auto batch_size = levels_to_dim_[i]->size();
+            auto level = at::functorch::initAndPushDynamicLayer(at::functorch::TransformType::Vmap, batch_size, at::functorch::RandomnessType::Different);
+            if (i == 0) {
+                levels_start_ = level;
+            }
+        }
+    }
+
+    ~EnableAllLayers() {
+        auto to_remove = levels_start_ + levels_to_dim_.size() - 1;
+        for (auto i : levels_to_dim_.enumerate()) {
+            AT_ASSERT(at::functorch::popDynamicLayerAndDeleteMetadata().layerId() == to_remove - i);
+        }
+    }
+
+    py::obj<Tensor> from_batched(Arena& A, at::Tensor batchedtensor, bool has_device) {
+        Slice<DimEntry> levels;
+        for (auto i : irange(-batchedtensor.dim(), 0)) {
+            levels.append(A, i);
+        }
+        TensorRef tensor;
+        at::functorch::BatchedTensorImpl * impl = maybeGetBatchedImpl(batchedtensor);
+        while(true) {
+            auto level = impl->level();
+            AT_ASSERT(level >= levels_start_ && level < levels_start_ + levels_to_dim_.size());
+            py::hdl<Dim> dim = levels_to_dim_[level - levels_start_].ptr();
+            levels.insert(A, impl->bdim(), dim);
+            at::functorch::BatchedTensorImpl * nimpl = maybeGetBatchedImpl(impl->value());
+            if (!nimpl) {
+                tensor = impl->value();
+                break;
+            }
+            impl = nimpl;
+        }
+
+        py::obj<Tensor> self = Tensor::create();
+        // grab ownership of the tensors
+        self->tensor_ = *tensor;
+        self->batchtensor_ = std::move(batchedtensor);
+        self->has_device_ = has_device;
+        self->capture_levels(levels);
+        return self;
+    }
+    void inplace_update_layers(TensorRef batchtensor, Slice<DimEntry> levels) {
+        // XXX - requires a patch to functorch to att set_level
+        auto impl = maybeGetBatchedImpl(*batchtensor);
+        for (auto i : levels_to_dim_.reversed_enumerate()) {
+            if (!impl) {
+                break;
+            }
+            if (levels.contains(levels_to_dim_[i])) {
+                impl->_unsafe_set_level(levels_start_ + i);
+                impl = maybeGetBatchedImpl(impl->value());
+
+            }
+        }
+    }
+private:
+    int64_t levels_start_;
+    Slice<py::hdl<Dim>> levels_to_dim_;
+};
+
+TensorRef _match_levels(Arena& A, TensorRef v, Slice<DimEntry> from_levels, Slice<DimEntry> to_levels, bool drop_levels=false) {
+    if (from_levels == to_levels) {
+        return v;
+    }
+    // drop_levels -> if a dim appears in from_levels but not to_levels, it is assumed it has stride 0.
+    at::IntArrayRef sz = v->sizes();
+    at::IntArrayRef sd = v->strides();
+    AT_ASSERT(drop_levels || from_levels.size() <= to_levels.size());
+    Slice<int64_t> nsz;
+    Slice<int64_t> nsd;
+    for (auto l : to_levels) {
+        auto oidx = from_levels.index(l);
+        if (!oidx) {
+            nsz.append(A, l.is_positional() ? 1 : l.dim()->size());
+            nsd.append(A, 0);
+        } else {
+            auto idx = *oidx;
+            nsz.append(A, sz[idx]);
+            nsd.append(A, sd[idx]);
+        }
+    }
+    return A.autorelease(v->as_strided(at::IntArrayRef(nsz.begin(), nsz.end()), at::IntArrayRef(nsd.begin(), nsd.end()), v->storage_offset()));
+}
+
+static py::object run_torch_function(Arena &A, py::handle orig, py::vector_args args, bool is_pointwise) {
+    if (!pointwise_optimize) {
+        is_pointwise = false;
+    }
+    // std::cout << "__torch_function__ " << ((is_pointwise) ? "pointwise" : "functorch") << " " << orig << "\n";
+
+    Slice<py::hdl<Dim>> all_dims;
+    Slice<py::handle> flat_args;
+    auto unflatten_args = tree_flatten(A, args, flat_args);
+    TensorRef device_holding_tensor;
+
+    Slice<TensorInfo> infos;
+    Slice<DimEntry> result_levels;
+    for (auto f : flat_args) {
+        infos.append(A, TensorInfo::create(A, f, !is_pointwise, false));
+        if (infos.back()) {
+            TensorInfo& info = infos.back();
+            AT_ASSERT(is_pointwise || info.batchedtensor);
+            if (!device_holding_tensor && info.has_device) {
+                device_holding_tensor = infos.back().tensor;
+            }
+            for (auto l : info.levels) {
+                if (!result_levels.contains(l)) {
+                    result_levels.append(A, l);
+                }
+            }
+        }
+    }
+
+    if (is_pointwise) {
+        for (auto i : flat_args.enumerate()) {
+            if (infos[i]) {
+                TensorRef tensor = infos[i].tensor;
+                if (device_holding_tensor && !infos[i].has_device) {
+                    tensor = A.autorelease(tensor->to(device_holding_tensor->device()));
+                }
+                auto ml = _match_levels(A, tensor, infos[i].levels, result_levels);
+                flat_args[i] = handle_from_tensor(A, std::move(ml));
+            }
+        }
+
+        Slice<py::handle> flat_it = flat_args;
+        py::vector_args uargs = unflatten_args(A, flat_it);
+
+        py::object result = orig.call_vector(uargs);
+
+        // fast wrap for normal case where operator just returns a tensor.
+        if (THPVariable_Check(result.ptr())) {
+            return Tensor::from_positional(A, THPVariable_Unpack(result.ptr()), result_levels, device_holding_tensor);
+        }
+        auto wrap = [&](py::handle h) {
+            if (THPVariable_Check(h.ptr())){
+                return A.autorelease(Tensor::from_positional(A, THPVariable_Unpack(h.ptr()), result_levels, device_holding_tensor));
+            }
+            return h;
+        };
+        return tree_map(A, wrap, result);
+    } else {
+        // std::cout << orig << " calling functorch...\n";
+        // std::cout << "rl: " << result_levels << "\n";
+        EnableAllLayers guard(A, result_levels);
+        for (auto i : flat_args.enumerate()) {
+            if (infos[i]) {
+                TensorRef batched = infos[i].batchedtensor;
+                if (device_holding_tensor && !infos[i].has_device) {
+                    batched = A.autorelease(batched->to(device_holding_tensor->device()));
+                }
+                guard.inplace_update_layers(batched, infos[i].levels);
+                flat_args[i] = handle_from_tensor(A, batched);
+            }
+        }
+        Slice<py::handle> flat_it = flat_args;
+        py::vector_args uargs = unflatten_args(A, flat_it);
+        AT_ASSERT(flat_it.size() == 0);
+        py::object result = orig.call_vector(uargs);
+        auto wrap = [&](py::handle h) {
+            if (THPVariable_Check(h.ptr())) {
+                return A.autorelease(guard.from_batched(A, THPVariable_Unpack(h.ptr()), device_holding_tensor));
+            }
+            return h;
+        };
+        if (THPVariable_Check(result.ptr())) {
+            return guard.from_batched(A, THPVariable_Unpack(result.ptr()), device_holding_tensor);
+        }
+        return tree_map(A, wrap, result);
+    }
+}
+
+
+static py::object __torch_function__(Arena &A, py::handle orig, py::vector_args args, bool is_pointwise) {
+    if (orig == torch_Tensor___mul__) {
+        AT_ASSERT(args.nargs == 2 && !args.has_keywords());
+        auto lhs = args[0];
+        auto rhs = args[1];
+        if (py::isinstance(lhs, _Tensor) && py::isinstance(rhs, _Tensor) && _Tensor_ndim(lhs) == 0 && _Tensor_ndim(rhs) == 0) {
+            bool has_device = false;
+            Slice<DimEntry> levels;
+            for (auto i : args.enumerate_positional()) {
+                auto t = TensorInfo::create(A, args[i], false);
+                // something like a mask * rhs, which matrix multiplies don't correctly promote
+                if (!t.tensor->is_floating_point()) {
+                    return run_torch_function(A, orig, args, is_pointwise);
+                }
+                has_device = has_device || t.has_device;
+                for (auto l : t.levels) {
+                    if (!levels.contains(l)) {
+                        levels.append(A, l);
+                    }
+                }
+            }
+            // std::cout << "__torch_function__ " << "delay" << " " << orig << "\n";
+            return Tensor::create_delayed(py::object::borrow(orig), args, levels, has_device);
+        }
+    }
+    return run_torch_function(A, orig, args, is_pointwise);
+}
+
+py::vector_args as_vector_args(Arena& A, py::handle args, py::handle kwargs) {
+    auto pos_args = (py::handle*) &PyTuple_GET_ITEM(args.ptr(), 0);
+    auto pos_n = PyTuple_GET_SIZE(args.ptr());
+    if (!kwargs.ptr()) {
+        return py::vector_args(pos_args, pos_n, nullptr);
+    }
+    Slice<py::handle> all_args;
+    Slice<py::handle> kwnames;
+    all_args.extend(A, pos_args, pos_args + pos_n);
+    py::dict_view dv(kwargs);
+    Py_ssize_t pos = 0;
+    py::handle key, value;
+    while (dv.next(&pos, &key, &value)) {
+        all_args.append(A, value);
+        kwnames.append(A, key);
+    }
+    return py::vector_args(all_args.begin(), pos_n, A.autorelease(slice_to_tuple(kwnames)));
+}
+
+static PyObject* py___torch_function__(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    maybeInitializeGlobals();
+    AT_ASSERT(nargs == 4 || nargs == 5);
+    auto va = as_vector_args(A, args[3], nargs == 5 ? args[4] : nullptr);
+    bool is_pointwise = pointwise.contains(args[1]);
+    return __torch_function__(A, args[1], std::move(va), is_pointwise).release();
+    PY_END(nullptr)
+}
+
+py::object levels_to_tuple(Slice<DimEntry> slice) {
+    py::tuple t(slice.size());
+    for (auto i : slice.enumerate()) {
+        t.set(i, slice[i].is_positional() ?  py::from_int(slice[i].position()) : py::object::borrow(slice[i].dim()));
+    }
+    py::object r = std::move(t);
+    return r;
+}
+
+PyObject* Tensor_ndim(Tensor* self, void*) {
+    Py_ssize_t i = 0;
+    for (auto l : self->levels()) {
+        if (l.is_positional()) {
+            ++i;
+        }
+    }
+    return py::from_int(i).release();
+}
+
+static PyGetSetDef Tensor_getsetters[] = {
+   {"_has_device", (getter) [](PyObject* self, void*) -> PyObject* { return py::from_bool(((Tensor*)self)->has_device()).release(); }, NULL},
+   {"_tensor", (getter) [](PyObject* self, void*) -> PyObject* {
+       Arena A;
+       return THPVariable_Wrap(((Tensor*)self)->tensor(A)); }, NULL},
+   {"_batchtensor", (getter) [](PyObject* self, void*) -> PyObject* {
+       Arena A;
+       return THPVariable_Wrap(((Tensor*)self)->batchtensor(A)); }, NULL},
+   {"_levels", (getter) [](PyObject* self, void*) -> PyObject* {
+       PY_BEGIN
+       return levels_to_tuple(((Tensor*)self)->levels()).release();
+       PY_END(nullptr)
+   }},
+    {"ndim", (getter) Tensor_ndim, NULL, "ndim", NULL},
+    {NULL}  /* Sentinel */
+};
+
+static PyMethodDef Tensor_methods[] = {
+    {NULL, NULL, 0, NULL}        /* Sentinel */
+};
+
+
+PyTypeObject Tensor::Type = {
+    PyVarObject_HEAD_INIT(NULL, 0)
+    "_C.Tensor",               /* tp_name */
+    sizeof(Tensor),               /* tp_basicsize */
+    0,                              /* tp_itemsize */
+    Tensor::dealloc_stub,      /* tp_dealloc */
+    0,                              /* tp_vectorcall_offset */
+    0,                              /* tp_getattr */
+    0,                              /* tp_setattr */
+    0,                              /* tp_as_async */
+    0,           /* tp_repr */
+    0,                 /* tp_as_number */
+    0,                 /* tp_as_sequence */
+    0,             /* tp_as_mapping */
+    0,      /* tp_hash */
+    0,                              /* tp_call */
+    0,                              /* tp_str */
+    0,                              /* tp_getattro */
+    0,                              /* tp_setattro */
+    0,                              /* tp_as_buffer */
+    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE , /* tp_flags */
+    "Tensor Object",                   /* tp_doc */
+    0,                              /* tp_traverse */
+    0,                              /* tp_clear */
+    0,  /* tp_richcompare */
+    0,                              /* tp_weaklistoffset */
+    0,                              /* tp_iter */
+    0,                              /* tp_iternext */
+    Tensor_methods,                /* tp_methods */
+    0,                              /* tp_members */
+    Tensor_getsetters,             /* tp_getset */
+    0,                              /* tp_base */
+    0,                              /* tp_dict */
+    0,                              /* tp_descr_get */
+    0,                              /* tp_descr_set */
+    0,                              /* tp_dictoffset */
+    0,            /* tp_init */
+    0,                              /* tp_alloc */
+    Tensor::new_stub,                      /* tp_new */
+};
+
+
+// dim() --------------------
+
+bool relevant_op(_Py_CODEUNIT c) {
+    switch(_Py_OPCODE(c)) {
+        case STORE_NAME:
+        case STORE_GLOBAL:
+        case STORE_FAST:
+        case STORE_DEREF:
+            return true;
+        default:
+            return false;
+    }
+}
+
+py::object getname(PyCodeObject* code, _Py_CODEUNIT c) {
+    PyObject* names = NULL;
+    switch(_Py_OPCODE(c)) {
+        case STORE_NAME:
+        case STORE_GLOBAL:
+          names = code->co_names;
+          break;
+        case STORE_FAST:
+          names = code->co_varnames;
+          break;
+        case STORE_DEREF:
+          names = code->co_cellvars;
+          break;
+        default:
+            return py::object();
+    }
+    return py::object::steal(PySequence_GetItem(names, _Py_OPARG(c)));
+}
+
+py::object create_dim(py::object name, py::handle size) {
+    auto d = Dim::create(std::move(name));
+    if (!py::is_none(size)) {
+        d->set_size(py::to_int(size));
+    }
+    return std::move(d);
+}
+
+py::object create_dimlist(py::object name, py::handle size) {
+    auto d = DimList::create(std::move(name));
+    if (!py::is_none(size)) {
+        if (py::is_int(size)) {
+            d->bind_len(py::to_int(size));
+        } else {
+            py::sequence_view s(size);
+            d->bind_len(s.size());
+            for (auto i : irange(d->size())) {
+                d->dims_[i]->set_size(py::to_int(s[i]));
+            }
+        }
+    }
+    return std::move(d);
+}
+
+template<py::object (*create_object)(py::object, py::handle)>
+static PyObject* _dims(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    Py_ssize_t specified_ndims = -1;
+    Py_ssize_t found_ndims = 0;
+    Py_ssize_t sizes = -1;
+    py::handle n = Py_None;
+    py::handle py_sizes = Py_None;
+
+    if (nargs || kwnames) {
+        py::vector_args va(args, nargs, kwnames);
+        va.parse("dims", {"n", "sizes"}, {&n, &py_sizes}, 0);
+        if (!py::is_none(py_sizes)) {
+            sizes = py::sequence_view(py_sizes).size();
+            specified_ndims = sizes;
+        }
+        if (!py::is_none(n)) {
+            specified_ndims = py::to_int(n);
+        }
+    }
+
+    PyThreadState* state = PyThreadState_GET();
+    PyFrameObject* f = state->frame;
+    auto code = (_Py_CODEUNIT*)PyBytes_AS_STRING(f->f_code->co_code);
+#if PY_VERSION_HEX >= 0x030a00f0
+    int first = f->f_lasti + 1;
+#else
+    int first = f->f_lasti /  2 + 1;
+#endif
+    auto unpack = code[first];
+    int names_start = first;
+    if (relevant_op(unpack)) {
+        found_ndims = 1;
+    } else if (_Py_OPCODE(unpack) == UNPACK_SEQUENCE) {
+        found_ndims = _Py_OPARG(unpack);
+        names_start++;
+    }
+
+    if (specified_ndims == -1) {
+        if (found_ndims == 0) {
+            py::raise_error(PyExc_SyntaxError, "dims() must be assigned to a sequence of variable names or have argument n specified");
+        }
+        specified_ndims = found_ndims;
+    }
+    if (found_ndims != specified_ndims) {
+        found_ndims = 0; // avoid taking the wrong names for dimensions
+    }
+
+    auto genobject = [&](int i) -> py::object {
+        py::object name;
+        if (i < found_ndims) {
+            name = getname(f->f_code, code[names_start + i]);
+        }
+        if (!name.ptr()) {
+            name = py::unicode_from_format("d%d", i);
+            found_ndims = 0; // once we fail at finding a name, we can find any more
+        }
+        return create_object(std::move(name), sizes != -1 ? py::sequence_view(py_sizes)[i] : py::handle(Py_None));
+    };
+    if (sizes != -1 && sizes != specified_ndims) {
+        py::raise_error(PyExc_ValueError, "expected %d sizes but found %d", int(specified_ndims), int(sizes));
+    }
+    if (specified_ndims == 1) {
+        return genobject(0).release();
+    }
+    py::tuple result(specified_ndims);
+    for (int i = 0; i < specified_ndims; ++i) {
+        result.set(i, genobject(i));
+    }
+    return result.release();
+    PY_END(nullptr)
+}
+
+int64_t dim_index(const std::vector<py::obj<Dim>>& dims, py::hdl<Dim> dim) {
+    for (int64_t i = 0, N  = dims.size(); i < N; ++i) {
+        if (dims[i].ptr() == dim.ptr()) {
+            return i;
+        }
+    }
+    return -1;
+}
+
+
+struct DotPart {
+    Slice<DimEntry> dims;
+    size_t total_size = 1;
+    void append(Arena& A, py::hdl<Dim> d) {
+        total_size *= d->size();
+        dims.append(A, d);
+    }
+};
+
+template<typename T>
+static at::ArrayRef<T> as_array_ref(Slice<T> t) {
+    return at::ArrayRef<T>(t.begin(), t.end());
+}
+
+TensorRef dot_prepare(Arena& A, std::initializer_list<DotPart> parts, const TensorInfo& t) {
+    Slice<DimEntry> new_levels;
+    bool needs_reshape = false;
+    for (auto p : parts) {
+        if (p.dims.size() != 1) {
+            needs_reshape = true;
+        }
+        new_levels.extend(A, p.dims);
+    }
+    auto r = _match_levels(A, t.tensor, t.levels, new_levels, true);
+    if (!needs_reshape) {
+        return r;
+    }
+    Slice<int64_t> view;
+    for (auto p : parts) {
+        view.append(A, p.total_size);
+    }
+    return A.autorelease(r->reshape(at::IntArrayRef(view.begin(), view.end())));
+}
+
+py::object dot_finish(Arena& A, std::initializer_list<DotPart> parts, at::Tensor r) {
+    Slice<DimEntry> result_levels;
+    bool needs_reshape = false;
+    for (auto p : parts) {
+        if (p.dims.size() != 1) {
+            needs_reshape = true;
+        }
+        result_levels.extend(A, p.dims);
+    }
+    if (needs_reshape) {
+        Slice<int64_t> new_size;
+        for (auto l : result_levels) {
+            new_size.append(A, l.dim()->size());
+        }
+        r = r.reshape(at::IntArrayRef(new_size.begin(), new_size.end()));
+    }
+    return Tensor::from_positional(A, std::move(r), result_levels, true);
+}
+
+
+
+py::object dot(Arena& A, TensorInfo lhs, TensorInfo rhs, Slice<DimEntry> sum) {
+    auto lhs_strides = lhs.tensor->strides();
+    auto rhs_strides = rhs.tensor->strides();
+
+    DotPart lro_dims;
+    DotPart lo_dims;
+    DotPart ro_dims;
+    DotPart lr_dims;
+
+    auto insert_dim = [&] (py::hdl<Dim> d, at::optional<int> lhs_idx, at::optional<int> rhs_idx) {
+        bool reduced = sum.contains(d);
+        int64_t lhs_stride = lhs_idx ? lhs_strides[*lhs_idx] : 0;
+        int64_t rhs_stride = rhs_idx ? rhs_strides[*rhs_idx] : 0;
+        if (reduced) {
+            // lr
+            lr_dims.append(A, d);
+        } else {
+            if ((lhs_stride == 0) == (rhs_stride == 0)) {
+                // lro
+                lro_dims.append(A, d);
+            } else if (lhs_stride != 0) {
+                // lo
+                lo_dims.append(A, d);
+            } else {
+                AT_ASSERT(rhs_stride != 0);
+                ro_dims.append(A, d);
+            }
+        }
+    };
+
+
+    auto rhs_seen = A.allocate<bool>(rhs.levels.size());
+    std::fill(rhs_seen, rhs_seen + rhs.levels.size(), false);
+
+    for (auto i : lhs.levels.enumerate()) {
+        auto d = lhs.levels[i];
+        auto rhs_idx = rhs.levels.index(d);
+        if (rhs_idx) {
+            rhs_seen[*rhs_idx] = true;
+        }
+        insert_dim(d.dim(), i, rhs_idx);
+    }
+
+    for (auto i : rhs.levels.enumerate()) {
+        if (rhs_seen[i]) {
+            continue;
+        }
+        auto d = rhs.levels[i];
+        insert_dim(d.dim(), at::nullopt, i);
+    }
+
+    if (lr_dims.dims.size() != sum.size()) {
+        for (auto & d : sum) {
+            if (!lhs.levels.contains(d) && !rhs.levels.contains(d)) {
+                py::raise_error(DimensionBindError(), "summing over non-existant dimension %S", d.dim().ptr());
+            }
+        }
+    }
+
+    // std::cout << lhs.levels << " " << rhs.levels << " " << sum << "\n";
+    // std::cout << lro_dims.dims << " " << lo_dims.dims << " " << ro_dims.dims << " " << lr_dims.dims << "\n";
+
+    // no batch, just call mm
+    if (lro_dims.dims.size() != 0) {
+        auto lhs_ = dot_prepare(A, {lro_dims, lo_dims, lr_dims}, lhs);
+        auto rhs_ = dot_prepare(A, {lro_dims, lr_dims, ro_dims}, rhs);
+        return dot_finish(A, {lro_dims, lo_dims, ro_dims}, at::bmm(*lhs_, *rhs_));
+    } else {
+        auto lhs_ = dot_prepare(A, {lo_dims, lr_dims}, lhs);
+        auto rhs_ = dot_prepare(A, {lr_dims, ro_dims}, rhs);
+        return dot_finish(A, {lo_dims, ro_dims}, at::mm(*lhs_, *rhs_));
+    }
+
+}
+
+static PyObject* test_c(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+
+    Arena A;
+    Slice<int> s(A, 3, 4, 5);
+    AT_ASSERT(s.size() == 3 && s.capacity() == 8);
+    AT_ASSERT(s[0] == 3 && s[1] == 4 && s[2] == 5);
+    s.append(A, 6);
+    AT_ASSERT(s[3] == 6);
+    for(int i : irange(10)) {
+        s.append(A, i);
+    }
+    AT_ASSERT(s[0] == 3 && s.back() == 9 && s.size() == 14 && s.capacity() == 16);
+
+    Slice<int> s2(A, -1, -2, -3);
+    AT_ASSERT(s2[1] == -2 && s[0] == 3);
+
+    auto ss = s.slice(1,2);
+    AT_ASSERT(ss.size() == 1);
+    AT_ASSERT(ss[0] == 4);
+    AT_ASSERT(ss.capacity() == 1);
+    ss.append(A, -4);
+    AT_ASSERT(ss.size() == 2 && ss[1] == -4);
+    ss[0] = 3;
+    AT_ASSERT(s[1] == 4);
+
+    s.insert(A, s.slice(1, 4), ss);
+    AT_ASSERT(s[1] == 3  && s[2] == -4 && s[3] == 0);
+
+    auto sz = s.size();
+    s.insert(A, s.slice(1, 1), 4);
+    AT_ASSERT(s[1] == 4 && sz + 1 == s.size());
+
+
+    Slice<int> d(A, 0, 1, 2, 3, 4);
+
+    Slice<int> b(A, 0, 1, 2, 3, 4);
+    b.insert(A, b.slice(1,1), d);
+    AT_ASSERT(b.size() == 10);
+    AT_ASSERT(b[1] == 0);
+    AT_ASSERT(b[5] == 4);
+    AT_ASSERT(b.back() == 4);
+
+    Py_RETURN_NONE;
+
+    PY_END(nullptr);
+}
+
+static DimEntry _wrap_dim(py::handle d, size_t N, bool keepdim);
+
+static PyObject* order(PyObject *_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    AT_ASSERT(nargs-- > 0);
+    Slice<DimEntry> orig_levels;
+    Slice<DimEntry> levels;
+    TensorRef data;
+    py::handle self = args++[0];
+    bool has_device;
+    if (Tensor::check_exact(self)) {
+        auto t = Tensor::unchecked_wrap(self);
+        orig_levels = t->levels();
+        data = t->tensor(A);
+        has_device = t->has_device();
+    } else {
+       auto d = Dim::unchecked_wrap(self);
+        orig_levels.append(A, d);
+        data = d->range();
+        has_device = false;
+    }
+
+    Slice<DimEntry> flat_positional_dims;
+    Slice<std::pair<int, int>> to_flatten;
+    levels.extend(A, orig_levels);
+
+    int orig_ndim = ndim_of_levels(levels);
+    auto append = [&](DimEntry d) {
+        auto midx = levels.index(d);
+        if (!midx) {
+            if (d.is_positional()) {
+                py::raise_error(PyExc_ValueError, "tensor has %d positional dimensions, but %d specified, or it was specified twice", int(orig_ndim), int(d.position() + orig_ndim));
+            } else {
+                py::raise_error(PyExc_ValueError, "tensor of dimensions %R does not contain dim %R or it was specified twice", levels_to_tuple(orig_levels).ptr(), d.dim().ptr());
+            }
+        }
+        levels[*midx] = DimEntry();
+        flat_positional_dims.append(A, d);
+    };
+
+    int n_new_positional = 0;
+    for (auto i :irange(nargs)) {
+        py::handle arg  = args[i];
+        DimEntry entry = _wrap_dim(arg, orig_ndim, false);
+        if (!entry.is_none()) {
+            append(entry);
+            ++n_new_positional;
+        } else if (DimList::check(arg)) {
+            auto dl = DimList::unchecked_wrap(arg);
+            for (py::obj<Dim> & d : dl->dims_) {
+                append(py::hdl<Dim>(d));
+                ++n_new_positional;
+            }
+        } else {
+            ++n_new_positional;
+            if (!py::is_sequence(arg)) {
+                py::raise_error(PyExc_ValueError, "expected a Dim, List[Dim], or Sequence[Dim]");
+            }
+            py::sequence_view sq(arg);
+            auto N = sq.size();
+            to_flatten.append(A, std::make_pair(flat_positional_dims.size(), N));
+            for (auto j : irange(N)) {
+                DimEntry e = _wrap_dim(A.autorelease(sq[j]), orig_ndim, false);
+                if (e.is_none()) {
+                    py::raise_error(PyExc_ValueError, "expected a Dim, or int");
+                }
+                append(e);
+            }
+        }
+    }
+
+    int ndim = 0;
+    int insert_point = -1;
+    Slice<DimEntry> new_levels;
+    for (auto l : levels) {
+        if (l.is_none()) {
+            continue;
+        }
+        if (l.is_positional()) {
+            ndim++;
+            if (insert_point == -1) {
+                insert_point = new_levels.size();
+                new_levels.extend(A, flat_positional_dims);
+            }
+        }
+        new_levels.append(A, l);
+    }
+    if (insert_point == -1) {
+        insert_point = new_levels.size();
+        new_levels.extend(A, flat_positional_dims);
+    }
+
+    at::Tensor ndata = *_match_levels(A, data, orig_levels, new_levels);
+
+    if (to_flatten.size()) {
+        Slice<int64_t> view;
+        auto sz = ndata.sizes();
+        // before the new positional dims
+        for (auto i : irange(0, insert_point)) {
+            view.append(A, sz[i]);
+        }
+        int i = 0;
+        for (auto to_flat : to_flatten) {
+            for (;i < to_flat.first; ++i) {
+                view.append(A, sz[insert_point + i]);
+            }
+            int64_t new_size = 1;
+            int last = i + to_flat.second;
+            for (; i < last; ++i) {
+                new_size *= sz[insert_point + i];
+            }
+            view.append(A, new_size);
+        }
+        for (; i < flat_positional_dims.size(); ++i) {
+            view.append(A, sz[insert_point + i]);
+        }
+        // after the new positional dims
+        for (auto i : irange(insert_point + flat_positional_dims.size(), levels.size())) {
+            view.append(A, sz[i]);
+        }
+        // we shorted the number of dimension, so remove them from new levels
+        // we will renumber them later
+        auto n_to_remove = flat_positional_dims.size() - n_new_positional;
+        new_levels.insert(A, new_levels.slice(insert_point, insert_point + n_to_remove), Slice<DimEntry>());
+        ndata = std::move(ndata).reshape(at::IntArrayRef(view.begin(), view.end()));
+    }
+
+    // renumber the positional dimension
+    int seen = 0;
+    for (auto i : new_levels.reversed_enumerate()) {
+        if (new_levels[i].is_positional() || (i >= insert_point && i < insert_point + n_new_positional)) {
+            new_levels[i] = --seen;
+        }
+    }
+    return Tensor::from_positional(A, std::move(ndata), new_levels, has_device).release();
+
+    PY_END(nullptr)
+}
+
+static PyObject* expand(PyObject *_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    AT_ASSERT(nargs-- > 0);
+    auto info = TensorInfo::create(A, args++[0], false);
+    for (auto i : irange(nargs)) {
+        if (!Dim::check(args[i])) {
+            maybeInitializeGlobals();
+            py::vector_args vargs(args - 1, nargs + 1, kwnames);
+            if (THPVariable_Check(args[-1])) {
+                return torch_Tensor_expand.call_vector(vargs).release();
+            } else {
+                return __torch_function__(A, torch_Tensor_expand, vargs, false).release();
+            }
+        }
+    }
+    const at::Tensor& data = *info.tensor;
+    auto levels = info.levels;
+    Slice<DimEntry> new_levels;
+    Slice<int64_t> sz;
+    Slice<int64_t> sd;
+    for (auto i : irange(nargs)) {
+        auto d = Dim::unchecked_wrap(args[i]);
+        if (levels.contains(d) || new_levels.contains(d)) {
+            py::raise_error(DimensionBindError(), "expanding dimension %R already exists in tensor with dims", d.ptr());
+        }
+        new_levels.append(A, d);
+        sz.append(A, d->size());
+        sd.append(A, 0);
+    }
+    new_levels.extend(A, levels);
+    at::IntArrayRef osz = data.sizes();
+    at::IntArrayRef osd = data.strides();
+    sz.extend(A, osz.begin(), osz.end());
+    sd.extend(A, osd.begin(), osd.end());
+    at::Tensor ndata = data.as_strided(at::IntArrayRef(sz.begin(), sz.end()), at::IntArrayRef(sd.begin(), sd.end()), data.storage_offset());
+    return Tensor::from_positional(A, std::move(ndata), new_levels, info.has_device).release();
+    PY_END(nullptr)
+}
+
+
+void _bind_dims_to_size(Arena & A, int64_t sz, int64_t sd,
+                        Slice<py::hdl<Dim>> dims, Slice<int64_t>& nsz, Slice<int64_t>& nsd) {
+    int64_t rhs_prod = 1;
+    for (auto i : dims.enumerate()) {
+        if (!dims[i]->is_bound()) {
+            for (auto j : irange(i + 1, dims.size())) {
+                if (!dims[j]->is_bound()) {
+                    py::raise_error(DimensionBindError(), "cannot infer the sizes of two dimensions at once %R and %R", dims[i].ptr(), dims[j].ptr());
+                }
+                rhs_prod *= dims[j]->size();
+            }
+            if (sz % rhs_prod != 0) {
+                py::tuple tup(dims.size());
+                for (auto j : dims.enumerate()) {
+                    tup.set(j, dims[j]->is_bound() ? py::from_int(dims[j]->size()) : py::unicode_from_string("?"));
+                }
+                py::raise_error(DimensionBindError(), "inferred dimension does not evenly fit into larger dimension: %d vs %R", (int) sz, tup.ptr());
+            }
+            int64_t inferred_size = sz / rhs_prod;
+            dims[i]->set_size(inferred_size);
+            rhs_prod = sz;
+            break;
+        }
+        rhs_prod *= dims[i]->size();
+    }
+    if (rhs_prod != sz) {
+        py::tuple tup(dims.size());
+        for (auto j : dims.enumerate()) {
+            tup.set(j, py::object::borrow(dims[j]));
+        }
+        py::raise_error(DimensionBindError(), "Dimension sizes to do not match (%d != %d) when matching dimension pack %R", (int) sz, (int) rhs_prod, tup.ptr());
+    }
+    auto new_strides = A.allocate<int64_t>(dims.size());
+    auto prev_stride = sd;
+    for (auto i : dims.reversed_enumerate()) {
+        new_strides[i] = prev_stride;
+        prev_stride = dims[i]->size()*prev_stride;
+    }
+    for (auto i : dims.enumerate()) {
+        nsd.append(A, new_strides[i]);
+        nsz.append(A, dims[i]->size());
+    }
+}
+
+inline bool has_dims(py::handle d) {
+    return Dim::check_exact(d) || Tensor::check_exact(d);
+}
+
+struct IndexingInfo {
+    bool can_call_original; // if true, then it is safe to just call getitem or setitem, these objects do not need special handling
+    bool advanced_indexing; // requires actual lookup
+    TensorRef self;
+    Slice<py::handle> flat_inputs;
+    Slice<DimEntry> result_levels;
+    bool has_device;
+};
+
+static Slice<py::handle> as_slice(py::tuple_view tv) {
+    PyObject** begin = &PyTuple_GET_ITEM(tv.ptr(),0);
+    return Slice<py::handle>((py::handle*)begin, (py::handle*) (begin + tv.size()));
+}
+
+static Slice<py::handle> as_slice(py::list_view tv) {
+    PyObject** begin = &PyList_GET_ITEM(tv.ptr(),0);
+    return Slice<py::handle>((py::handle*)begin, (py::handle*) (begin + tv.size()));
+}
+
+
+bool maybe_dimpack(Slice<py::handle>& elements, py::handle s, bool check_first=true) {
+    // can we avoid rechecking?
+    if (py::list_view::check(s)) {
+        py::list_view tv(s);
+        if (!check_first || (tv.size() && Dim::check_exact(tv[0]))) {
+            elements = as_slice(tv);
+            return true;
+        }
+    }
+    // can we avoid rechecking?
+    if (py::tuple_view::check(s)) {
+        py::tuple_view tv(s);
+        if (!check_first || (tv.size() && Dim::check_exact(tv[0]))) {
+            elements = as_slice(tv);
+            return true;
+        }
+    }
+    return false;
+};
+
+bool is_dimpack(py::handle s) {
+    Slice<py::handle> e;
+    return maybe_dimpack(e, s);
+}
+
+IndexingInfo getsetitem_flat(Arena& A, TensorInfo self_info, Slice<py::handle> input, Slice<DimEntry> keys, Slice<py::handle> values, bool has_dimpacks_or_none);
+static py::object invoke_getitem(Arena& A, const IndexingInfo& iinfo);
+
+static py::object index(Arena& A, py::handle self, py::handle dims, py::handle indices) {
+    maybeInitializeGlobals();
+    Slice<py::handle> dims_list;
+    Slice<py::handle> indices_list;
+    // we allow for matching single dims to multiple dims,
+    // so we first have to normalize everything into the case where there is a list on lhs and the rhs
+    bool lhs_list = py::tuple_view::check(dims) || py::list_view::check(dims);
+    bool rhs_list = py::tuple_view::check(indices) || py::list_view::check(indices);
+    if (lhs_list && rhs_list) {
+        py::sequence_view dv(dims);
+        py::sequence_view ind(indices);
+        Py_ssize_t N = dv.size();
+        if (N != ind.size()) {
+            py::raise_error(PyExc_TypeError, "dims (%d) and indices (%d) must have the same length", int(N), int(ind.size()));
+        }
+        for (auto i : irange(N)) {
+            dims_list.append(A, A.autorelease(dv[i]));
+            indices_list.append(A, A.autorelease(ind[i]));
+        }
+    } else {
+        dims_list.append(A, dims);
+        indices_list.append(A, indices);
+    }
+
+    // dims being indexed can be grouped together into a single index space, and we have to
+    // flatten them int a single dimension before we can index them...
+    auto self_info = TensorInfo::create(A, self, false);
+    auto ndim = self_info.ndim();
+    Slice<DimEntry> new_levels;
+    Slice<DimEntry> to_flatten;
+    Slice<DimEntry> dims_list_flat;
+    auto parse_dim_entry = [&](py::handle s) -> DimEntry {
+        auto d = _wrap_dim(s, ndim, false);
+        if (d.is_none()) {
+            py::raise_error(PyExc_TypeError, "expected a dimension specifyer but found %R", s.ptr());
+        }
+        return d;
+    };
+    auto dim_not_present = [&](DimEntry d) {
+        if (d.is_positional()) {
+            py::raise_error(PyExc_TypeError, "dimension %d not in tensor of %d dimensions", d.position() + ndim , ndim);
+        } else {
+            py::raise_error(PyExc_TypeError, "dimension %R not in tensor", d.dim()->ptr());
+        }
+    };
+
+    for (auto i : dims_list.enumerate()) {
+        Slice<py::handle> m;
+        if (maybe_dimpack(m, dims_list[i], /*check_first=*/false)) {
+            if (m.size() == 0) {
+                // plausible semantics work for this to have 0 elements (e.g. the index will always be 0)
+                dims_list_flat.append(A, DimEntry()); // value is just dropped
+            }
+            auto first = parse_dim_entry(m[0]);
+            dims_list_flat.append(A, first);
+            if (m.size() == 1) {
+                continue;
+            }
+            if (to_flatten.size() == 0) {
+                new_levels.extend(A, self_info.levels);
+            }
+            Slice<DimEntry> rest;
+            for (auto i : irange(1, m.size())) {
+                auto d = parse_dim_entry(m[i]);
+                if (!new_levels.remove(A, d)) {
+                    dim_not_present(d);
+                }
+                rest.append(A, d);
+            }
+
+            auto first_idx = new_levels.index(first);
+            if (!first_idx) {
+                dim_not_present(first);
+            }
+            new_levels.insert(A, new_levels.slice(*first_idx + 1, *first_idx + 1), rest);
+            to_flatten.extend(A, rest);
+        } else {
+            dims_list_flat.append(A, parse_dim_entry(dims_list[i]));
+        }
+    }
+    if (to_flatten.size() > 0) {
+        TensorRef rearranged = _match_levels(A, self_info.tensor, self_info.levels, new_levels);
+        at::IntArrayRef sizes = rearranged->sizes();
+        Slice<int64_t> new_sizes;
+        Slice<DimEntry> reshape_levels;
+        for (auto i : new_levels.enumerate()) {
+            if (to_flatten.contains(new_levels[i])) {
+                new_sizes.back() *= sizes[i];
+            } else {
+                new_sizes.append(A, sizes[i]);
+                reshape_levels.append(A, new_levels[i]);
+            }
+        }
+        self_info.tensor = A.autorelease(rearranged->reshape(at::IntArrayRef(new_sizes.begin(), new_sizes.end())));
+
+        self_info.levels = reshape_levels; // note: we are using the first level in a flattened group to represent the group for the rest of the op
+                                           // we need to be careful not to rely the dimensions size because it doesnt match the size of the whole group
+    }
+    bool has_dimpacks = false;
+    for (auto idx : indices_list) {
+        if (py::tuple_view::check(idx) || py::list_view::check(idx)) {
+            has_dimpacks = true;
+            break;
+        }
+    }
+    IndexingInfo info = getsetitem_flat(A, self_info, Slice<py::handle>(), dims_list_flat, indices_list, has_dimpacks);
+    return invoke_getitem(A, info);
+}
+
+// true -- the indices were flattend out of a tuple, list or sequence...
+
+Slice<py::handle> slice_from_sequence(Arena& A, py::handle value) {
+    if (py::tuple_view::check(value)) {
+        return as_slice(py::tuple_view(value));
+    } else if (py::list_view::check(value)) {
+        return as_slice(py::list_view(value));
+    } else {
+        py::sequence_view sv(value);
+        Slice<py::handle> r;
+        for (auto i : sv.enumerate()) {
+            r.append(A, A.autorelease(sv[i]));
+        }
+        return r;
+    }
+}
+
+bool extractIndices(Arena& A, py::handle index, Slice<py::handle>& indices) {
+    if (py::tuple_view::check(index)) {
+        indices.extend(A, as_slice(py::tuple_view(index)));
+        return true;
+    } else if (THPVariable_Check(index.ptr())) {
+        indices.append(A, index);
+        return false;
+    } else if (!py::is_sequence(index)) {
+        indices.append(A, index);
+        return false;
+    }
+    // a copy of treatSequenceAsTuple modified to add Dim and our wrapped tensors..
+    py::sequence_view sv(index);
+    if (sv.size() >= 32) {
+        indices.extend(A, slice_from_sequence(A, index));
+        return true;
+    }
+    for (auto i : sv.enumerate()) {
+        py::handle item;
+        try {
+            item = sv[i];
+        } catch (py::exception_set & e) {
+            PyErr_Clear();
+            indices.append(A, index);
+            return false;
+        }
+        if (THPVariable_Check(item.ptr()) || py::is_sequence(item) || PySlice_Check(item.ptr()) || item.ptr() == Py_Ellipsis || py::is_none(item) || has_dims(item)) {
+            indices.extend(A, slice_from_sequence(A, index));
+            return true;
+        }
+    }
+    indices.append(A, index);
+    return false;
+}
+
+static IndexingInfo getsetitem(Arena & A, py::handle self, py::handle index, bool tensors_have_dims) {
+    bool can_call_original_getitem = !tensors_have_dims;
+
+    Slice<py::handle> input;
+    if (has_dims(index)) {
+        input.append(A, index);
+    } else {
+        bool is_sequence = extractIndices(A, index, input);
+        // nothing about first class dims here, fallback to getitem
+        if (can_call_original_getitem && !is_sequence) {
+            return { true };
+        }
+    }
+
+    int64_t dims_indexed = 0;
+    int64_t expanding_object = -1;
+    DimList* unbound_dim_list = nullptr;
+    auto check_expanding = [&](int64_t i) {
+        if (expanding_object != -1) {
+            py::raise_error(DimensionBindError(), "at most one ... or unbound dimension list can exist in indexing list but found 2 at offsets %d and %d", (int) expanding_object, (int) i);
+        }
+        expanding_object = i;
+    };
+    Slice<int64_t> dimlists;
+
+    // calculate how many dimensioned have been indexed in order to compute the size of ...
+    // or expand a potentially unbound dimension list.
+
+    bool has_dimpacks_or_none = false;
+    for (auto i : input.enumerate()) {
+        py::handle s = input[i];
+        if (Dim::check_exact(s) || Tensor::check_exact(s)) {
+            can_call_original_getitem = false;
+            ++dims_indexed;
+        } else if (s.ptr() == Py_Ellipsis) {
+            check_expanding(i);
+        } else if (DimList::check(s)) {
+            can_call_original_getitem = false;
+            auto dl = DimList::unchecked_wrap(s);
+            if (!dl->is_bound()) {
+                check_expanding(i);
+                unbound_dim_list = dl.ptr();
+            } else {
+                dims_indexed += dl->dims_.size();
+            }
+            dimlists.append(A, i);
+        } else if (py::is_none(s)) {
+            has_dimpacks_or_none = true;
+        } else if (is_dimpack(s)) {
+            can_call_original_getitem = false;
+            has_dimpacks_or_none = true;
+            ++dims_indexed;
+        } else {
+            ++dims_indexed;
+        }
+    }
+
+    // at this point if we haven't seen any Dim objects, we also can fallback to the original getitem.
+    if (can_call_original_getitem) {
+        return {true};
+    }
+
+    // std::cout << "__getitem__ " << self << " " << index << "\n";
+
+    TensorInfo self_info = TensorInfo::create(A, self, false, true);
+    auto ndim = self_info.ndim();
+    if (dims_indexed > ndim) {
+        py::raise_error(PyExc_ValueError, "at least %d indices were supplied but the tensor only has %d dimensions", (int) dims_indexed, (int) ndim);
+    }
+    // expand any unbound dimension list, or expand ... into individual : slices.
+    auto expanding_dims = ndim - dims_indexed;
+    if (expanding_object != -1) {
+        if (unbound_dim_list) {
+            unbound_dim_list->bind_len(expanding_dims);
+        } else {
+            // ...
+            Slice<py::handle> no_slices;
+            for (auto i : irange(expanding_dims)) {
+                (void) i;
+                no_slices.append(A, no_slice);
+            }
+            input.insert(A, input.slice(expanding_object, expanding_object + 1), no_slices);
+        }
+    }
+
+    // flatten out any dimensions stored in dimlist elements directly into the inputs
+    // std::cout << dimlists << " <- dim lists!\n";
+    for (int64_t i = dimlists.size() - 1; i >=0; --i) {
+        auto idx = dimlists[i];
+        // we added more elements to input because of ...
+        // so we need to also adjust the index to get back to where the
+        // dimlist existed
+        if (!unbound_dim_list && expanding_object != -1 && idx > expanding_object) {
+            idx += expanding_dims;
+        }
+        auto dl = DimList::unchecked_wrap(input[idx]);
+        // XXX would be better if we used an OwnedSlice in DimList
+        Slice<py::handle> more_dims((py::handle*) &*dl->dims_.begin(), (py::handle*) &*dl->dims_.end());
+        input.insert(A, input.slice(idx, idx + 1), more_dims);
+    }
+
+    return getsetitem_flat(A, self_info, input, Slice<DimEntry>(), Slice<py::handle>(), has_dimpacks_or_none);
+}
+
+IndexingInfo getsetitem_flat(Arena& A, TensorInfo self_info, Slice<py::handle> input, Slice<DimEntry> keys, Slice<py::handle> values, bool has_dimpacks_or_none) {
+    // At this point:
+    // ..., DimList have been eliminated
+    // Dim, Tensor, Tuple[Dim,...], int, slice still remain
+
+
+    // we have to count how many times we see a dimension.
+    // A[i,j] is a simple binding operation, but A[i, i+j] or A[i, i] requires advanced indexing.
+    Slice<py::hdl<Dim>> seen_dims;
+    Slice<int64_t> seen_dims_nuses;
+    auto add_dim = [&](py::hdl<Dim> entry) {
+        auto midx = seen_dims.index(entry);
+        if (!midx) {
+            seen_dims.append(A, entry);
+            seen_dims_nuses.append(A, 1);
+        } else {
+            ++seen_dims_nuses[*midx];
+        }
+    };
+
+    Slice<py::handle> input_it = input;
+
+    Slice<py::handle> flat_inputs;
+    // flat inputs will start with an empty py::handle if the
+    // actual value is in the tensor-like object in the tensor info
+    Slice<TensorInfo> tensor_inputs;
+
+    auto append_flat_handle = [&](py::handle h) {
+        flat_inputs.append(A, h);
+        tensor_inputs.append(A, TensorInfo());
+    };
+    TensorRef device_holding_tensor;
+    auto append_tensor_input = [&](TensorInfo ti) {
+        flat_inputs.append(A, py::handle());
+        tensor_inputs.append(A, ti);
+        if (ti.has_device && !device_holding_tensor) {
+            device_holding_tensor = ti.tensor;
+        }
+    };
+
+    Slice<int64_t> nsz;
+    Slice<int64_t> nsd;
+    at::IntArrayRef sz = self_info.tensor->sizes();
+    at::IntArrayRef sd = self_info.tensor->strides();
+
+    auto append_size = [&](int i) {
+        if (has_dimpacks_or_none) {
+            nsz.append(A, sz[i]);
+            nsd.append(A, sd[i]);
+        }
+    };
+    // std::cout << "self levels: " << self_info.levels << "\n";
+
+    auto parse_nones = [&]() {
+        while (input_it.size() && py::is_none(input_it[0])) {
+            append_flat_handle(no_slice);
+            nsz.append(A, 1);
+            nsd.append(A, 0);
+            input_it = input_it.slice(1);
+        }
+    };
+
+
+    auto append_item = [&](int i, py::handle arg) {
+        if (Dim::check_exact(arg)) {
+            auto d = Dim::unchecked_wrap(arg);
+            d->set_size(sz[i]);
+            add_dim(d);
+            append_size(i);
+            append_flat_handle(arg);
+            return;
+        }
+        auto info = TensorInfo::create(A, arg, false, false);
+        if (info) {
+            append_size(i);
+            append_tensor_input(info);
+            for (auto il : info.levels) {
+                if (!il.is_positional()) {
+                    add_dim(il.dim());
+                }
+            }
+            return;
+        }
+
+        if (has_dimpacks_or_none) {
+            Slice<py::handle> mp;
+            if (maybe_dimpack(mp, arg)) {
+                // dim pack
+                Slice<py::hdl<Dim>> dim_pack;
+                for (auto d : mp) {
+                    dim_pack.append(A, Dim::wrap(d));
+                    add_dim(dim_pack.back());
+                    append_flat_handle(dim_pack.back());
+                }
+                _bind_dims_to_size(A, sz[i], sd[i], dim_pack, nsz, nsd);
+                return;
+            }
+        }
+
+        append_size(i);
+        append_flat_handle(arg);
+    };
+
+    // pair up the indexing expressions with dimension of self it indexes
+    // self may have first-class dims, which do not participate the indexing.
+    for (auto i : self_info.levels.enumerate()) {
+        auto l = self_info.levels[i];
+        auto idx = keys.index(l);
+        if (idx) {
+            append_item(i, values[*idx]);
+        } else if (l.is_positional()) {
+            // grab and index from the positional list
+            parse_nones();
+            if (!input_it.size()) {
+                // we might have fewer indices than tensor dimensions,
+                // which implicitly indexes the remaining dimensions with :
+                append_flat_handle(no_slice);
+                append_size(i);
+            } else {
+                py::handle arg = input_it[0];
+                input_it = input_it.slice(1);
+                append_item(i, arg);
+            }
+        } else {
+            add_dim(l.dim());
+            append_flat_handle(l.dim());
+            append_size(i);
+        }
+    }
+    // any training Nones may have no existing dimension associated with them in self.
+    parse_nones();
+
+    // we have to restride the tensor to collapse dimension packs and introduce our none dimensions.
+    if (has_dimpacks_or_none) {
+        self_info.tensor = A.autorelease(self_info.tensor->as_strided(at::IntArrayRef(nsz.begin(), nsz.end()),at::IntArrayRef(nsd.begin(), nsd.end()), self_info.tensor->storage_offset()));
+    }
+
+
+    // figure out what the shape of the indexing tensors will be
+    // and what the shape of the resulting tensor will be
+    Slice<DimEntry> result_levels;
+    Slice<DimEntry> index_levels;
+    int64_t tensor_insert_point = -1;
+    bool requires_getindex = false;
+    auto mark_tensor_index = [&] {
+        if (tensor_insert_point == -1) {
+            tensor_insert_point = result_levels.size();
+        } else if (tensor_insert_point != result_levels.size()) {
+            tensor_insert_point = 0;
+        }
+    };
+    for (auto i : flat_inputs.enumerate()) {
+        auto inp = flat_inputs[i];
+         if(tensor_inputs[i]) {
+             requires_getindex = true;
+             mark_tensor_index();
+             for (auto l : tensor_inputs[i].levels) {
+                 // std::cout << "Consider to add " << l << "\n";
+                 if (!index_levels.contains(l)) {
+                     index_levels.append(A, l);
+                 }
+             }
+        } else if (Dim::check_exact(inp)) {
+            auto d = Dim::unchecked_wrap(inp);
+            // dimesions used once are just binding operations
+            if (1 == seen_dims_nuses[*seen_dims.index(d)]) {
+                flat_inputs[i] = no_slice;
+                result_levels.append(A, d);
+            } else {
+                requires_getindex = true;
+                flat_inputs[i] = py::handle();
+                tensor_inputs[i] = TensorInfo {d->range(), Slice<DimEntry>(A, DimEntry(d)), false, TensorRef()};
+                if (!index_levels.contains(d)) {
+                     index_levels.append(A, d);
+                }
+                mark_tensor_index();
+            }
+         } else {
+            if (inp.ptr() != no_slice.ptr()) {
+                requires_getindex = true;
+            }
+            if (!py::is_int(inp)) {
+                // note: actual positional indexes are accurately computed later
+                result_levels.append(A, -1);
+            }
+         }
+    }
+
+    // indexing dimensions appear in the tensor at the _first use of a tensor_ in the indexing. So insert
+    // the indexing leveles into the result klevels at this spot
+    if (tensor_insert_point != -1) {
+        result_levels.insert(A, result_levels.slice(tensor_insert_point, tensor_insert_point), index_levels);
+    }
+
+    // std::cout << "flat inputs: " << flat_inputs << "\n";
+    // std::cout << "result_levels: " << result_levels << "\n";
+    // std::cout << "index_levels: " << index_levels << "\n";
+
+    // get all the tensors to be the right shape for indexing
+    if (requires_getindex) {
+        for (auto i : flat_inputs.enumerate()) {
+            if (tensor_inputs[i]) {
+                AT_ASSERT(!flat_inputs[i].ptr());
+                // std::cout << "tensor " << i << " " << tensor_inputs[i].levels << "\n";
+                TensorRef t = tensor_inputs[i].tensor;
+                if (!tensor_inputs[i].has_device && device_holding_tensor) {
+                    t = A.autorelease(t->to(device_holding_tensor->device()));
+                }
+                flat_inputs[i] = handle_from_tensor(A, _match_levels(A, t, tensor_inputs[i].levels, index_levels));
+            }
+        }
+    }
+
+    // previously we didn't know how many positional dimensions there would be so we couldn't number them right
+    // so fill it in now.
+    auto seen_positionals = 0;
+    for (auto i : result_levels.reversed_enumerate()) {
+        if (result_levels[i].is_positional()) {
+            result_levels[i] = -(++seen_positionals);
+        }
+    }
+
+    return IndexingInfo {false, requires_getindex, self_info.tensor, flat_inputs, result_levels, self_info.has_device};
+}
+
+static py::object invoke_getitem(Arena& A, const IndexingInfo& iinfo) {
+    at::Tensor rtensor;
+    if (iinfo.advanced_indexing) {
+        auto self_hdl = handle_from_tensor(A, iinfo.self);
+        auto tup = slice_to_tuple(iinfo.flat_inputs);
+        // std::cout << "calling original getindex " << self_hdl << " " << tup << "\n";
+        auto pytensor = py::object::checked_steal(THPVariable_getitem(self_hdl.ptr(), tup.ptr()));
+        rtensor = THPVariable_Unpack(pytensor.ptr());
+    } else {
+        // std::cout << "skipping original getindex\n";
+        rtensor = *iinfo.self;
+    }
+    // std::cout << "returning (from_positional)\n";
+    return Tensor::from_positional(A, std::move(rtensor), iinfo.result_levels, iinfo.has_device);
+}
+
+static py::object __getitem__(Arena & A, py::handle self, py::handle index) {
+    maybeInitializeGlobals();
+    auto iinfo = getsetitem(A, self, index, has_dims(self));
+    if (iinfo.can_call_original) {
+        return py::object::checked_steal(THPVariable_getitem(self.ptr(), index.ptr()));
+    }
+
+    return invoke_getitem(A, iinfo);
+}
+
+
+PyObject* Tensor_getitem(PyObject* self, PyObject* index) {
+    Arena A;
+    PY_BEGIN
+    return __getitem__(A, self, index).release();
+    PY_END(nullptr);
+}
+
+static void __setitem__(Arena & A, py::handle self, py::handle index, py::handle rhs) {
+    maybeInitializeGlobals();
+    auto iinfo = getsetitem(A, self, index, has_dims(self) || has_dims(rhs));
+    if (iinfo.can_call_original) {
+        if (-1 == THPVariable_setitem(self.ptr(), index.ptr(), rhs.ptr())) {
+            throw py::exception_set();
+        }
+        return;
+    }
+
+    auto rhs_info = TensorInfo::create(A, rhs, false, false);
+    if (rhs_info) { // otherwise rhs can be a scalar...
+        for (auto l : rhs_info.levels) {
+            if (!iinfo.result_levels.contains(l)) {
+                if (l.is_positional()) {
+                    py::raise_error(DimensionBindError(), "rhs contains too many dimensions (%d) compared to indexed value (%d)", ndim_of_levels(iinfo.result_levels), rhs_info.ndim());
+                } else {
+                    auto tup = levels_to_tuple(iinfo.result_levels);
+                    py::raise_error(DimensionBindError(), "rhs of setitem contains dimension %R which is not in the dimension on the left (%R)", l.dim().ptr(), tup.ptr());
+                }
+            }
+        }
+        auto rhs_matched = _match_levels(A, rhs_info.tensor, rhs_info.levels, iinfo.result_levels);
+        rhs = handle_from_tensor(A, rhs_matched);
+    }
+    self = handle_from_tensor(A, iinfo.self);
+
+    if (iinfo.advanced_indexing) {
+        auto tup = slice_to_tuple(iinfo.flat_inputs);
+        if (-1 == THPVariable_setitem(self.ptr(), tup.ptr(), rhs.ptr())) {
+            throw py::exception_set();
+        }
+    } else {
+        torch_Tensor_copy_.call(self, rhs);
+    }
+}
+
+
+int Tensor_setitem(PyObject* self, PyObject* index, PyObject* value) {
+    Arena A;
+    PY_BEGIN
+    __setitem__(A, self, index, value);
+    return 0;
+    PY_END(-1);
+}
+
+static PyObject* py___getitem__(PyObject *_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    AT_ASSERT(nargs == 2);
+    return __getitem__(A, args[0], args[1]).release();
+    PY_END(nullptr)
+}
+
+static PyObject* py___setitem__(PyObject *_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    AT_ASSERT(nargs == 3);
+    __setitem__(A, args[0], args[1], args[2]);
+    Py_RETURN_NONE;
+    PY_END(nullptr)
+}
+
+
+static PyObject* py_index(PyObject *_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    py::vector_args va(args, nargs, kwnames);
+    py::handle self, dims, indices;
+    va.parse("index", {"self", "dims", "indices"}, {&self, &dims, &indices}, 3);
+    return index(A, self, dims, indices).release();
+    PY_END(nullptr)
+}
+
+
+static PyObject* py_stack(PyObject *_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    py::vector_args va(args, nargs, kwnames);
+    py::handle tensors, new_dim, dim;
+    va.parse("stack", {"tensors", "new_dim", "dim"}, {&tensors, &new_dim, &dim}, 2);
+
+    Slice<DimEntry> result_levels;
+    Slice<TensorInfo> infos;
+    py::sequence_view sv(tensors);
+    auto new_dim_d = Dim::wrap(new_dim);
+    for (auto i : sv.enumerate()) {
+        infos.append(A, TensorInfo::create(A, A.autorelease(sv[i]), false));
+        for (auto l : infos.back().levels) {
+            if (!result_levels.contains(l)) {
+                result_levels.append(A, l);
+            }
+        }
+    }
+    new_dim_d->set_size(infos.size());
+    std::vector<at::Tensor> inputs;
+    inputs.reserve(infos.size());
+    for (auto in : infos) {
+        inputs.emplace_back(*_match_levels(A, in.tensor, in.levels, result_levels));
+    }
+    auto ndim = ndim_of_levels(result_levels);
+    int64_t rawdim = 0;
+    if (dim.ptr()) {
+        auto d = _wrap_dim(dim, ndim, false);
+        auto idx = result_levels.index(d);
+        if (!idx) {
+            py::raise_error(PyExc_TypeError, "Dimension %R does not exist in inputs", dim);
+        }
+        rawdim = *idx;
+    }
+    auto result = at::stack(inputs, rawdim);
+    result_levels.insert(A, rawdim, new_dim_d);
+    return Tensor::from_positional(A, std::move(result), result_levels, true).release();
+    PY_END(nullptr)
+}
+
+static PyObject* py_split(PyObject *_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    maybeInitializeGlobals();
+    py::vector_args va(args, nargs, kwnames);
+    py::handle self, split_size_or_sections, dim;
+    va.parse("split", {"self", "split_size_or_sections", "dim"}, {&self, &split_size_or_sections, &dim}, 2);
+    bool dim_is_object = dim.ptr() && Dim::check_exact(dim);
+    Slice<py::handle> sizes;
+
+    bool all_dims = true;
+    bool all_ints = true;
+
+    if (!py::is_int(split_size_or_sections)) {
+        py::sequence_view sv(split_size_or_sections);
+        for (auto i : sv.enumerate()) {
+            sizes.append(A, A.autorelease(sv[i]));
+            if (Dim::check_exact(sizes.back())) {
+                all_ints = false;
+            } else {
+                all_dims = false;
+            }
+        }
+    }
+    if (all_ints) {
+        if (dim_is_object) {
+            py::raise_error(PyExc_TypeError, "when dim is specified as a Dim object, split sizes must also be dimensions.");
+        }
+        // call original split (if self has dimensions this will use torch function to do the split)
+        return torch_Tensor_split.call_vector(py::vector_args(args, nargs, kwnames)).release();
+    }
+    if (!all_dims) {
+        py::raise_error(PyExc_TypeError, "split list must be ints or dims but got a mix");
+    }
+
+    auto self_info = TensorInfo::create(A, self, false);
+    auto ndim = self_info.ndim();
+    if (!dim_is_object&& ndim == 0) {
+        py::raise_error(PyExc_TypeError, "split expects at least a 1-dimension tensor");
+    }
+    DimEntry dim_l = dim.ptr() ? _wrap_dim(dim, ndim, false) : -ndim;
+
+    auto idx = self_info.levels.index(dim_l);
+    if (!idx) {
+        if (!dim.ptr()) {
+            dim = A.autorelease(py::from_int(0));
+        }
+        py::raise_error(PyExc_TypeError, "tensor does not comtain dimension %R", dim.ptr());
+    }
+    Slice<int64_t> indices;
+
+    int64_t total_size = 0;
+    Slice<int64_t> unbound;
+    for (auto i : sizes.enumerate()) {
+        auto d = Dim::unchecked_wrap(sizes[i]);
+        if (d->is_bound()) {
+            indices.append(A, d->size());
+            total_size += indices.back();
+        } else {
+            indices.append(A, 0);
+            unbound.append(A, i);
+        }
+    }
+    auto tensor_size = self_info.tensor->sizes()[*idx];
+
+    if (unbound.size()) {
+        if (total_size > tensor_size) {
+           py::raise_error(PyExc_TypeError, "sizes of target dimensions add up to more (%d) than source dim (%d)", int(total_size), int(tensor_size));
+        }
+        auto remaining_size = tensor_size - total_size;
+        auto chunk_size = (remaining_size + unbound.size() - 1) / unbound.size();
+        for (auto u : unbound) {
+            auto sz = std::min(chunk_size, remaining_size);
+            Dim::unchecked_wrap(sizes[u])->set_size(sz);
+            indices[u] = sz;
+            remaining_size -= sz;
+        }
+    } else if (tensor_size != total_size) {
+        py::raise_error(PyExc_TypeError, "sum of sizes of target dimensions (%d) do not match the than source dim (%d)", int(total_size), int(tensor_size));
+    }
+
+    auto result_tensors = self_info.tensor->split_with_sizes(at::IntArrayRef(indices.begin(), indices.end()), *idx);
+    py::tuple result(result_tensors.size());
+    Slice<DimEntry> new_levels;
+    new_levels.extend(A, self_info.levels);
+    for (auto i : sizes.enumerate()) {
+        new_levels[*idx] = Dim::unchecked_wrap(sizes[i]);
+        result.set(i, Tensor::from_positional(A, std::move(result_tensors[i]), new_levels, true));
+    }
+
+    return result.release();
+
+    PY_END(nullptr)
+}
+
+
+static DimEntry _wrap_dim(py::handle d, size_t N, bool keepdim) {
+    if (Dim::check(d)) {
+        if (keepdim) {
+            py::raise_error(PyExc_ValueError, "cannot preserve first-class dimensions with keepdim=True");
+        }
+        return Dim::unchecked_wrap(d);
+    } else if (py::is_int(d)) {
+        auto i = py::to_int(d);
+        while (i >= 0) {
+            i -= N;
+        }
+        return i;
+    } else {
+        return DimEntry();
+    }
+}
+
+static Slice<DimEntry> _wrap_dims(Arena& A, py::handle d, size_t N, bool keepdim) {
+    auto de = _wrap_dim(d, N, keepdim);
+    Slice<DimEntry> r;
+    if (!de.is_none()) {
+        r.append(A, de);
+    } else {
+        py::sequence_view sq(d);
+        for (auto i : sq.enumerate()) {
+            r.append(A, _wrap_dim(A.autorelease(sq[i]), N, keepdim));
+        }
+    }
+    return r;
+}
+
+struct WrappedOperator : public py::base<WrappedOperator> {
+    py::object orig;
+    PyMethodDef method_def;
+    py::object name, doc;
+
+    bool is_pointwise = false;
+    int64_t dim_offset = 0;
+    int64_t keepdim_offset = 1;
+    std::string dim_name;
+    bool single_dim = false;
+    bool reduce = true;
+
+    static PyTypeObject Type;
+
+    void init(py::object orig_, PyCFunction wrapper_implementation, std::string dim_name_="") {
+        orig = std::move(orig_);
+        method_def.ml_meth = wrapper_implementation;
+        name = orig.attr("__name__");
+        doc = orig.attr("__doc__");
+        dim_name = std::move(dim_name_);
+        if (!py::is_none(doc) && dim_name.size() > 0) {
+            doc = py::unicode_from_format("%S\nArgument '%s' can be either an integer or a torchdim.Dim object.\n", doc.ptr(), dim_name.c_str());
+        }
+        method_def.ml_name = py::is_none(name) ? "" : PyUnicode_AsUTF8(name.ptr());
+        method_def.ml_doc = py::is_none(doc) ? "" : PyUnicode_AsUTF8(doc.ptr());
+        method_def.ml_flags = METH_FASTCALL | METH_KEYWORDS;
+    }
+
+    py::object function() {
+        return py::object::checked_steal(PyCFunction_New(&method_def, ptr()));
+    }
+
+};
+
+PyTypeObject WrappedOperator::Type = {
+    PyVarObject_HEAD_INIT(NULL, 0)
+    "_C.WrappedOperator",               /* tp_name */
+    sizeof(WrappedOperator),               /* tp_basicsize */
+    0,                              /* tp_itemsize */
+    WrappedOperator::dealloc_stub,      /* tp_dealloc */
+    0,                              /* tp_vectorcall_offset */
+    0,                              /* tp_getattr */
+    0,                              /* tp_setattr */
+    0,                              /* tp_as_async */
+    0,           /* tp_repr */
+    0,                 /* tp_as_number */
+    0,                 /* tp_as_sequence */
+    0,             /* tp_as_mapping */
+    0,      /* tp_hash */
+    0,                              /* tp_call */
+    0,                              /* tp_str */
+    0,                              /* tp_getattro */
+    0,                              /* tp_setattro */
+    0,                              /* tp_as_buffer */
+    Py_TPFLAGS_DEFAULT, /* tp_flags */
+    "Wrapped Object Holder",                   /* tp_doc */
+    0,                              /* tp_traverse */
+    0,                              /* tp_clear */
+    0,  /* tp_richcompare */
+    0,                              /* tp_weaklistoffset */
+    0,                              /* tp_iter */
+    0,                              /* tp_iternext */
+    0,                /* tp_methods */
+    0,                              /* tp_members */
+    0,             /* tp_getset */
+    0,                              /* tp_base */
+    0,                              /* tp_dict */
+    0,                              /* tp_descr_get */
+    0,                              /* tp_descr_set */
+    0,                              /* tp_dictoffset */
+    0,            /* tp_init */
+    0,                              /* tp_alloc */
+    WrappedOperator::new_stub,                      /* tp_new */
+};
+
+static PyObject* patched_dim_method(PyObject * self_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    auto self = WrappedOperator::unchecked_wrap(self_);
+    PY_BEGIN
+
+    py::vector_args va(args, nargs, kwnames);
+
+    auto _getarg = [&](const char* name, int64_t offset_) -> py::handle {
+        auto offset = offset_ + 1; // do not include self
+        auto idx = va.index(name, offset);
+        return idx == -1 ? py::handle() : va[idx];
+    };
+    Slice<py::handle> patched_args;
+    patched_args.extend(A, va.begin(), va.end());
+    auto _patcharg = [&](const char* name, int64_t offset_, py::handle value) {
+        auto offset = offset_ + 1; // do not include self
+        auto idx = va.index(name, offset);
+        if (idx == -1) {
+            py::raise_error(PyExc_ValueError, "Missing argument %s", name);
+        }
+        patched_args[idx] = value;
+    };
+
+    auto dim = _getarg(self->dim_name.c_str(), self->dim_offset);
+    if (!dim.ptr()) {
+        auto info = TensorInfo::create(A, args[0], true);
+        EnableAllLayers l(A, info.levels);
+        l.inplace_update_layers(info.batchedtensor, info.levels);
+        patched_args[0] = handle_from_tensor(A, info.batchedtensor);
+        auto r = self->orig.call_vector(patched_args.begin(), nargs, kwnames);
+        return l.from_batched(A, THPVariable_Unpack(r.ptr()), info.has_device).release();
+    }
+
+    auto info = TensorInfo::create(A, args[0]);
+    auto keepdim = false;
+    if (self->reduce) {
+        auto py_keepdim = _getarg("keepdim", self->keepdim_offset);
+        if (py_keepdim.ptr()) {
+            keepdim = py::to_bool(py_keepdim);
+        }
+    }
+
+    auto ndim = info.ndim();
+    auto dims = _wrap_dims(A, dim, ndim, keepdim);
+    Slice<int64_t> dim_indices;
+    auto seen = A.allocate<bool>(info.levels.size());
+    std::fill(seen, seen + info.levels.size(), false);
+
+    for (auto d : dims) {
+        auto midx = info.levels.index(d);
+        if (!midx) {
+            auto tup = levels_to_tuple(info.levels);
+            py::raise_error(PyExc_ValueError, "Tensor with dimensions %R does not contain one of %R\n", tup.ptr(), dim.ptr());
+        }
+        seen[*midx] = true;
+        dim_indices.append(A, *midx);
+    }
+    Slice<DimEntry> new_levels;
+    if (self->reduce && !keepdim) {
+        for (auto i : info.levels.enumerate()) {
+            if (!seen[i]) {
+                new_levels.append(A, info.levels[i]);
+            }
+        }
+    } else {
+        new_levels = info.levels;
+    }
+    py::object py_indices;
+    if (dim_indices.size() == 1) {
+        py_indices = py::from_int(dim_indices[0]);
+    } else {
+        py::tuple tup(dim_indices.size());
+        for (auto i : dim_indices.enumerate()) {
+            tup.set(i, py::from_int(dim_indices[i]));
+        }
+        py_indices = std::move(tup);
+    }
+    _patcharg(self->dim_name.c_str(), self->dim_offset, py_indices);
+    patched_args[0] = handle_from_tensor(A, info.tensor);
+    auto r = self->orig.call_vector(patched_args.begin(), nargs, kwnames);
+    auto wrap = [&](py::handle h) {
+        if (THPVariable_Check(h.ptr())) {
+            return A.autorelease(Tensor::from_positional(A, THPVariable_Unpack(h.ptr()), new_levels, info.has_device));
+        }
+        return h;
+    };
+    return tree_map(A, wrap, r).release();
+    PY_END(nullptr)
+}
+
+static PyObject* _wrap(PyObject * self_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+
+    #define ARGS(_) _(py::handle, orig) _(py::handle, dim_offset) _(py::handle, keepdim_offset) \
+                    _(py::handle, dim_name) _(py::handle, single_dim) _(py::handle, reduce)
+    MPY_PARSE_ARGS_KWNAMES("O|OOOOO", ARGS)
+
+    std::string dim_name_str;
+    if (dim_name.ptr()) {
+        dim_name_str = PyUnicode_AsUTF8(dim_name.ptr());
+    } else {
+        dim_name_str = "dim";
+    }
+    auto info = WrappedOperator::create(py::object::borrow(orig), (PyCFunction)(void*) patched_dim_method, std::move(dim_name_str));
+    if (dim_offset.ptr()) {
+        info->dim_offset = py::to_int(dim_offset);
+    }
+    if (keepdim_offset.ptr()) {
+        info->keepdim_offset = py::to_int(keepdim_offset);
+    }
+
+    if (single_dim.ptr()) {
+        info->single_dim = py::to_bool(single_dim);
+    }
+    if (reduce.ptr()) {
+        info->reduce = py::to_bool(reduce);
+    }
+    return info->function().release();
+    #undef ARGS
+
+    PY_END(nullptr)
+}
+
+static PyObject* call_torch_function(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    Arena A;
+    maybeInitializeGlobals();
+    auto info = WrappedOperator::unchecked_wrap(self);
+    return __torch_function__(A, info->orig, py::vector_args(args, nargs, kwnames), info->is_pointwise).release();
+    PY_END(nullptr)
+}
+
+static PyObject* _wrap_method(PyObject *self,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    AT_ASSERT(nargs == 2);
+    // XXX - ignore python function wrapped, we will call torch function directly
+    py::handle orig = args[0];
+    if (!pointwise.ptr()) {
+        auto dim = py::import("functorch.dim");
+        pointwise = dim.attr("pointwise");
+    }
+    auto info = WrappedOperator::create(py::object::borrow(orig), (PyCFunction)(void*) call_torch_function);
+    info->is_pointwise = pointwise.contains(orig);
+    return PyInstanceMethod_New(info->function().release());
+    PY_END(nullptr);
+}
+
+
+static PyObject* Tensor_sum(PyObject * self_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    Arena A;
+    PY_BEGIN
+    maybeInitializeGlobals();
+    py::vector_args va(args, nargs, kwnames);
+    auto self_ = Tensor::unchecked_wrap(args[0]);
+    auto d = self_->delayed();
+    if (!d) {
+        return _Tensor_sum.call_vector(va).release();
+    }
+    py::handle self, dim, keepdim, dtype;
+    va.parse("sum", {"self", "dim", "keepdim", "dtype"}, {&self, &dim, &keepdim, &dtype}, 1, 1);
+
+    if (dtype.ptr() || (keepdim.ptr() && py::to_bool(keepdim))) {
+        // std::cout << "SKIPPING fusion because dtype or keepdim=True specified\n";
+        return _Tensor_sum.call_vector(va).release();
+    }
+    auto levels = self_->levels();
+
+    auto N = ndim_of_levels(levels);
+    auto reduced_dims = _wrap_dims(A, dim, N, false);
+
+    return dot(A, TensorInfo::create(A, d->args[0], false), TensorInfo::create(A, d->args[1], false), reduced_dims).release();
+    PY_END(nullptr)
+}
+
+static PyObject* _parse_test(PyObject * self_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    maybeInitializeGlobals();
+
+    int required = py::to_int(args[0]);
+    int kwonly = py::to_int(args[1]);
+
+    py::vector_args va(args + 2, nargs - 2, kwnames);
+
+
+    py::handle a, b, c, d;
+    va.parse("_parse_test", {"a", "b", "c", "d"}, {&a, &b, &c, &d}, required, kwonly);
+    py::tuple r(4);
+    r.set(0, py::object::borrow(a.ptr() ? a : Py_None));
+    r.set(1, py::object::borrow(b.ptr() ? b : Py_None));
+    r.set(2, py::object::borrow(c.ptr() ? c : Py_None));
+    r.set(3, py::object::borrow(d.ptr() ? d : Py_None));
+    return r.release();
+
+    PY_END(nullptr)
+}
+
+static PyObject* _set_pointwise_optimize(PyObject * self_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+    py::handle value;
+    py::vector_args va(args, nargs, kwnames);
+    va.parse("_set_pointwise_optimization", {"value"}, {&value}, 1);
+    pointwise_optimize = py::to_bool(value);
+    Py_RETURN_NONE;
+    PY_END(nullptr)
+}
+
+static PyObject* _patch_tensor_class(PyObject * self_,
+                      PyObject *const *args,
+                      Py_ssize_t nargs,
+                      PyObject *kwnames) {
+    PY_BEGIN
+
+    auto torch = py::import("torch");
+    auto py_TensorBase = torch.attr("_C").attr("_TensorBase");
+    replaceMappingIfMatches(py_TensorBase);
+
+    Py_RETURN_NONE;
+    PY_END(nullptr)
+}
+
+
+const char* dims_doc = R"""(
+dims(n=None, sizes=None) -> torchdim.Dim or Tuple[torchdim.Dim, ...]
+
+Creates and returns one or more Dim objects.
+
+Arg:
+    n (int, optional): The number of dimensions to create. Can be omitted if sizes is specified.
+    sizes (List[Optional[int]], optional): A list the same size as the number of dimensions to be
+      created, specifying each dimensions size, or None to leave the size unset.
+
+Example::
+    >>> batch, channel, width, height = dims(4)
+    >>> batch, channel, width, height = dims(sizes=[None, 3, 224, 224])
+)""";
+
+static PyMethodDef methods[] = {
+    {"dims", (PyCFunction)(void*) _dims<create_dim>, METH_FASTCALL | METH_KEYWORDS, dims_doc},
+    {"dimlists", (PyCFunction)(void*) _dims<create_dimlist>, METH_FASTCALL | METH_KEYWORDS},
+    {"_test_c", (PyCFunction)(void*) test_c, METH_FASTCALL | METH_KEYWORDS},
+    {"_wrap_method", (PyCFunction)(void*) _wrap_method, METH_FASTCALL | METH_KEYWORDS},
+    {"Tensor_from_positional", (PyCFunction)(void*) py_Tensor_from_positional, METH_FASTCALL | METH_KEYWORDS},
+    {"__torch_function__", (PyCFunction)(void*) py___torch_function__, METH_FASTCALL | METH_KEYWORDS},
+    {"tree_flatten", (PyCFunction)(void*) py_tree_flatten, METH_FASTCALL | METH_KEYWORDS},
+    {"order", (PyCFunction)(void*) order, METH_FASTCALL | METH_KEYWORDS},
+    {"index", (PyCFunction)(void*) py_index, METH_FASTCALL | METH_KEYWORDS},
+    {"stack", (PyCFunction)(void*) py_stack, METH_FASTCALL | METH_KEYWORDS},
+    {"split", (PyCFunction)(void*) py_split, METH_FASTCALL | METH_KEYWORDS},
+    {"expand", (PyCFunction)(void*) expand, METH_FASTCALL | METH_KEYWORDS},
+    {"__getitem__", (PyCFunction)(void*) py___getitem__, METH_FASTCALL | METH_KEYWORDS},
+    {"__setitem__", (PyCFunction)(void*) py___setitem__, METH_FASTCALL | METH_KEYWORDS},
+    {"_wrap", (PyCFunction)(void*) _wrap, METH_FASTCALL | METH_KEYWORDS},
+    {"Tensor_sum", (PyCFunction)(void*) Tensor_sum, METH_FASTCALL | METH_KEYWORDS},
+    {"_parse_test", (PyCFunction)(void*) _parse_test, METH_FASTCALL | METH_KEYWORDS},
+    {"_set_pointwise_optimize", (PyCFunction)(void*) _set_pointwise_optimize, METH_FASTCALL | METH_KEYWORDS},
+    {"_patch_tensor_class", (PyCFunction)(void*) _patch_tensor_class, METH_FASTCALL | METH_KEYWORDS},
+    {NULL, NULL, 0, NULL}        /* Sentinel */
+};
+
+static struct PyModuleDef module_def = {
+    PyModuleDef_HEAD_INIT,
+    "_C",   /* name of module */
+    NULL, /* module documentation, may be NULL */
+    -1,       /* size of per-interpreter state of the module,
+                 or -1 if the module keeps state in global variables. */
+    methods
+};
+
+PyObject* Dim_init(void) {
+    Arena A;
+    try {
+        py::object mod = py::object::checked_steal(PyModule_Create(&module_def));
+        Dim::ready(mod, "Dim");
+        DimList::ready(mod, "DimList");
+        Tensor::ready(mod, "Tensor");
+        WrappedOperator::ready(mod, "_WrappedOperator");
+        Py_INCREF(&PyInstanceMethod_Type);
+        PyModule_AddObject(mod.ptr(), "_instancemethod", (PyObject *)&PyInstanceMethod_Type);
+
+        initializeGlobals(A);
+        return mod.release();
+    } catch(py::exception_set& err) {
+        return nullptr;
+    }
+}
diff --git a/functorch/functorch/csrc/dim/dim.h b/functorch/functorch/csrc/dim/dim.h
new file mode 100644
index 0000000000000..627caa729fc28
--- /dev/null
+++ b/functorch/functorch/csrc/dim/dim.h
@@ -0,0 +1,8 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+#pragma once
+#include <Python.h>
+PyObject* Dim_init();
diff --git a/functorch/functorch/csrc/dim/minpybind.h b/functorch/functorch/csrc/dim/minpybind.h
new file mode 100644
index 0000000000000..659bd98d7a280
--- /dev/null
+++ b/functorch/functorch/csrc/dim/minpybind.h
@@ -0,0 +1,710 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#pragma once
+#define PY_SSIZE_T_CLEAN
+#include <Python.h>
+#include <utility>
+#include <iostream>
+#include <memory>
+
+#define PY_BEGIN try {
+#define PY_END(v) } catch(py::exception_set & err) { return (v); }
+
+#if PY_VERSION_HEX < 0x03080000
+    #define PY_VECTORCALL _PyObject_FastCallKeywords
+#else
+    #define PY_VECTORCALL _PyObject_Vectorcall
+#endif
+
+struct irange {
+ public:
+    irange(int64_t end)
+    : irange(0, end, 1) {}
+    irange(int64_t begin, int64_t end, int64_t step = 1)
+    : begin_(begin), end_(end), step_(step) {}
+    int64_t operator*() const {
+        return begin_;
+    }
+    irange& operator++() {
+        begin_ += step_;
+        return *this;
+    }
+    bool operator!=(const irange& other) {
+        return begin_ != other.begin_;
+    }
+    irange begin() {
+        return *this;
+    }
+    irange end() {
+        return irange {end_, end_, step_};
+    }
+ private:
+    int64_t begin_;
+    int64_t end_;
+    int64_t step_;
+};
+
+namespace py {
+
+struct exception_set {
+};
+
+struct object;
+struct vector_args;
+
+struct handle {
+    handle(PyObject* ptr)
+    : ptr_(ptr) {}
+    handle()
+    : ptr_(nullptr) {}
+
+
+    PyObject* ptr() const {
+        return ptr_;
+    }
+    object attr(const char* key);
+    bool hasattr(const char* key);
+    handle type() const {
+        return (PyObject*) Py_TYPE(ptr());
+    }
+
+    template<typename... Args>
+    object call(Args&&... args);
+    object call_object(py::handle args);
+    object call_object(py::handle args, py::handle kwargs);
+    object call_vector(py::handle* begin, Py_ssize_t nargs, py::handle kwnames);
+    object call_vector(vector_args args);
+    bool operator==(handle rhs) {
+        return ptr_ == rhs.ptr_;
+    }
+
+    static handle checked(PyObject* ptr) {
+        if (!ptr) {
+            throw exception_set();
+        }
+        return ptr;
+    }
+
+protected:
+    PyObject * ptr_;
+};
+
+
+template<typename T>
+struct obj;
+
+template<typename T>
+struct hdl : public handle {
+    T* ptr() {
+        return  (T*) handle::ptr();
+    }
+    T* operator->() {
+        return ptr();
+    }
+    hdl(T* ptr)
+    : hdl((PyObject*) ptr) {}
+    hdl(obj<T> o)
+    : hdl(o.ptr()) {}
+private:
+    hdl(handle h) : handle(h) {}
+};
+
+struct object : public handle {
+    object() {}
+    object(const object& other)
+    : handle(other.ptr_) {
+        Py_XINCREF(ptr_);
+    }
+    object(object&& other)
+    : handle(other.ptr_) {
+        other.ptr_ = nullptr;
+    }
+    object& operator=(const object& other) {
+        return *this = object(other);
+    }
+    object& operator=(object&& other) {
+        PyObject* tmp = ptr_;
+        ptr_ = other.ptr_;
+        other.ptr_ = tmp;
+        return *this;
+    }
+    ~object() {
+        Py_XDECREF(ptr_);
+    }
+    static object steal(handle o) {
+        return object(o.ptr());
+    }
+    static object checked_steal(handle o) {
+        if (!o.ptr()) {
+            throw exception_set();
+        }
+        return steal(o);
+    }
+    static object borrow(handle o) {
+        Py_XINCREF(o.ptr());
+        return steal(o);
+    }
+    PyObject* release() {
+        auto tmp = ptr_;
+        ptr_ = nullptr;
+        return tmp;
+    }
+protected:
+    explicit object(PyObject* ptr)
+    : handle(ptr) {}
+};
+
+template<typename T>
+struct obj : public object {
+    obj() {}
+    obj(const obj& other)
+    : object(other.ptr_) {
+        Py_XINCREF(ptr_);
+    }
+    obj(obj&& other)
+    : object(other.ptr_) {
+        other.ptr_ = nullptr;
+    }
+    obj& operator=(const obj& other) {
+        return *this = obj(other);
+    }
+    obj& operator=(obj&& other) {
+        PyObject* tmp = ptr_;
+        ptr_ = other.ptr_;
+        other.ptr_ = tmp;
+        return *this;
+    }
+    static obj steal(hdl<T> o) {
+        return obj(o.ptr());
+    }
+    static obj checked_steal(hdl<T> o) {
+        if (!o.ptr()) {
+            throw exception_set();
+        }
+        return steal(o);
+    }
+    static obj borrow(hdl<T> o) {
+        Py_XINCREF(o.ptr());
+        return steal(o);
+    }
+    T* ptr() const {
+        return (T*) object::ptr();
+    }
+    T* operator->() {
+        return ptr();
+    }
+protected:
+    explicit obj(T* ptr)
+    : object((PyObject*)ptr) {}
+};
+
+
+bool isinstance(handle h, handle c) {
+    return PyObject_IsInstance(h.ptr(), c.ptr());
+}
+
+[[ noreturn ]] void raise_error(handle exception, const char *format, ...) {
+    va_list args;
+    va_start(args, format);
+    PyErr_FormatV(exception.ptr(), format, args);
+    va_end(args);
+    throw exception_set();
+}
+
+template<typename T>
+struct base {
+    PyObject_HEAD
+    PyObject* ptr() const {
+        return (PyObject*) this;
+    }
+    static obj<T> alloc(PyTypeObject* type = nullptr) {
+        if (!type) {
+            type = &T::Type;
+        }
+        auto self = (T*) type->tp_alloc(type, 0);
+        if (!self) {
+            throw py::exception_set();
+        }
+        new (self) T;
+        return obj<T>::steal(self);
+    }
+    template<typename ... Args>
+    static obj<T> create(Args ... args) {
+        auto self = alloc();
+        self->init(std::forward<Args>(args)...);
+        return self;
+    }
+    static bool check(handle v) {
+        return isinstance(v, (PyObject*)&T::Type);
+    }
+
+    static hdl<T> unchecked_wrap(handle self_) {
+        return hdl<T>((T*)self_.ptr());
+    }
+    static hdl<T> wrap(handle self_) {
+        if (!check(self_)) {
+            raise_error(PyExc_ValueError, "not an instance of %S", &T::Type);
+        }
+        return unchecked_wrap(self_);
+    }
+
+    static obj<T> unchecked_wrap(object self_) {
+        return obj<T>::steal(unchecked_wrap(self_.release()));
+    }
+    static obj<T> wrap(object self_) {
+        return obj<T>::steal(wrap(self_.release()));
+    }
+
+    static PyObject* new_stub(PyTypeObject *type, PyObject *args, PyObject *kwds) {
+        PY_BEGIN
+        return (PyObject*) alloc(type).release();
+        PY_END(nullptr)
+    }
+    static void dealloc_stub(PyObject *self) {
+        ((T*)self)->~T();
+        Py_TYPE(self)->tp_free(self);
+    }
+    static void ready(py::handle mod, const char* name) {
+        if (PyType_Ready(&T::Type)) {
+            throw exception_set();
+        }
+        if(PyModule_AddObject(mod.ptr(), name, (PyObject*) &T::Type) < 0) {
+            throw exception_set();
+        }
+    }
+};
+
+inline object handle::attr(const char* key) {
+    return object::checked_steal(PyObject_GetAttrString(ptr(), key));
+}
+
+inline bool handle::hasattr(const char* key) {
+    return PyObject_HasAttrString(ptr(), key);
+}
+
+inline object import(const char* module) {
+    return object::checked_steal(PyImport_ImportModule(module));
+}
+
+template<typename... Args>
+inline object handle::call(Args&&... args) {
+    return object::checked_steal(PyObject_CallFunctionObjArgs(ptr_, args.ptr()..., nullptr));
+}
+
+inline object handle::call_object(py::handle args) {
+    return object::checked_steal(PyObject_CallObject(ptr(), args.ptr()));
+}
+
+
+inline object handle::call_object(py::handle args, py::handle kwargs) {
+    return object::checked_steal(PyObject_Call(ptr(), args.ptr(), kwargs.ptr()));
+}
+
+inline object handle::call_vector(py::handle* begin, Py_ssize_t nargs, py::handle kwnames) {
+    return object::checked_steal(PY_VECTORCALL(ptr(), (PyObject*const*) begin, nargs, kwnames.ptr()));
+}
+
+struct tuple : public object {
+    void set(int i, object v) {
+        PyTuple_SET_ITEM(ptr_, i, v.release());
+    }
+    tuple(int size)
+    : object(checked_steal(PyTuple_New(size))) {}
+};
+
+struct list : public object {
+    void set(int i, object v) {
+        PyList_SET_ITEM(ptr_, i, v.release());
+    }
+    list(int size)
+    : object(checked_steal(PyList_New(size))) {}
+};
+
+py::object unicode_from_format(const char* format, ...) {
+    va_list args;
+    va_start(args, format);
+    auto r = PyUnicode_FromFormatV(format, args);
+    va_end(args);
+    return py::object::checked_steal(r);
+}
+py::object unicode_from_string(const char * str) {
+    return py::object::checked_steal(PyUnicode_FromString(str));
+}
+
+py::object from_int(Py_ssize_t s) {
+    return py::object::checked_steal(PyLong_FromSsize_t(s));
+}
+py::object from_bool(bool b) {
+    return py::object::borrow(b ? Py_True : Py_False);
+}
+
+bool is_sequence(handle h) {
+    return PySequence_Check(h.ptr());
+}
+
+
+struct sequence_view : public handle {
+    sequence_view(handle h)
+    : handle(h) {}
+    Py_ssize_t size() const {
+        auto r = PySequence_Size(ptr());
+        if (r == -1 && PyErr_Occurred()) {
+            throw py::exception_set();
+        }
+        return r;
+    }
+    irange enumerate() const {
+        return irange(size());
+    }
+    static sequence_view wrap(handle h) {
+        if (!is_sequence(h)) {
+            raise_error(PyExc_ValueError, "expected a sequence");
+        }
+        return sequence_view(h);
+    }
+    py::object operator[](Py_ssize_t i) const {
+        return py::object::checked_steal(PySequence_GetItem(ptr(), i));
+    }
+};
+
+
+py::object repr(handle h) {
+    return py::object::checked_steal(PyObject_Repr(h.ptr()));
+}
+
+py::object str(handle h) {
+    return py::object::checked_steal(PyObject_Str(h.ptr()));
+}
+
+
+bool is_int(handle h) {
+    return PyLong_Check(h.ptr());
+}
+
+bool is_float(handle h) {
+    return PyFloat_Check(h.ptr());
+}
+
+bool is_none(handle h) {
+    return h.ptr() == Py_None;
+}
+
+bool is_bool(handle h) {
+    return PyBool_Check(h.ptr());
+}
+
+Py_ssize_t to_int(handle h) {
+    Py_ssize_t r = PyLong_AsSsize_t(h.ptr());
+    if (r == -1 && PyErr_Occurred()) {
+        throw py::exception_set();
+    }
+    return r;
+}
+
+double to_float(handle h) {
+    double r = PyFloat_AsDouble(h.ptr());
+    if (PyErr_Occurred()) {
+        throw py::exception_set();
+    }
+    return r;
+}
+
+bool to_bool_unsafe(handle h) {
+    return h.ptr() == Py_True;
+}
+
+bool to_bool(handle h) {
+    return PyObject_IsTrue(h.ptr()) != 0;
+}
+
+struct slice_view {
+    slice_view(handle h, Py_ssize_t size)  {
+        if(PySlice_Unpack(h.ptr(), &start, &stop, &step) == -1) {
+            throw py::exception_set();
+        }
+        slicelength = PySlice_AdjustIndices(size, &start, &stop, step);
+    }
+    Py_ssize_t start, stop, step, slicelength;
+};
+
+bool is_slice(handle h) {
+    return PySlice_Check(h.ptr());
+}
+
+inline std::ostream& operator<<(std::ostream& ss, handle h) {
+    ss << PyUnicode_AsUTF8(str(h).ptr());
+    return ss;
+}
+
+struct tuple_view : public handle {
+    tuple_view() = default;
+    tuple_view(handle h) : handle(h) {}
+
+    Py_ssize_t size() const {
+        return PyTuple_GET_SIZE(ptr());
+    }
+
+    irange enumerate() const {
+        return irange(size());
+    }
+
+    handle operator[](Py_ssize_t i) {
+        return PyTuple_GET_ITEM(ptr(), i);
+    }
+
+    static bool check(handle h) {
+        return PyTuple_Check(h.ptr());
+    }
+};
+
+struct list_view : public handle {
+    list_view() = default;
+    list_view(handle h) : handle(h) {}
+    Py_ssize_t size() const {
+        return PyList_GET_SIZE(ptr());
+    }
+
+    irange enumerate() const {
+        return irange(size());
+    }
+
+    handle operator[](Py_ssize_t i) {
+        return PyList_GET_ITEM(ptr(), i);
+    }
+
+    static bool check(handle h) {
+        return PyList_Check(h.ptr());
+    }
+};
+
+struct dict_view : public handle {
+    dict_view() = default;
+    dict_view(handle h) : handle(h) {}
+    object keys() const {
+        return py::object::checked_steal(PyDict_Keys(ptr()));
+    }
+    object values() const {
+        return py::object::checked_steal(PyDict_Values(ptr()));
+    }
+    object items() const {
+        return py::object::checked_steal(PyDict_Items(ptr()));
+    }
+    bool contains(handle k) const {
+        return PyDict_Contains(ptr(), k.ptr());
+    }
+    handle operator[](handle k) {
+        return py::handle::checked(PyDict_GetItem(ptr(), k.ptr()));
+    }
+    static bool check(handle h) {
+        return PyDict_Check(h.ptr());
+    }
+    bool next(Py_ssize_t* pos, py::handle* key, py::handle* value) {
+        PyObject *k, *v;
+        auto r = PyDict_Next(ptr(), pos, &k, &v);
+        *key = k;
+        *value = v;
+        return r;
+    }
+    void set(handle k, handle v) {
+        if (-1 == PyDict_SetItem(ptr(), k.ptr(), v.ptr())) {
+            throw exception_set();
+        }
+    }
+};
+
+
+struct kwnames_view : public handle {
+    kwnames_view() = default;
+    kwnames_view(handle h) : handle(h) {}
+
+    Py_ssize_t size() const {
+        return PyTuple_GET_SIZE(ptr());
+    }
+
+    irange enumerate() const {
+        return irange(size());
+    }
+
+    const char* operator[](Py_ssize_t i) const {
+        PyObject* obj = PyTuple_GET_ITEM(ptr(), i);
+        return PyUnicode_AsUTF8(obj);
+    }
+
+    static bool check(handle h) {
+        return PyTuple_Check(h.ptr());
+    }
+};
+
+inline py::object funcname(py::handle func) {
+    if (func.hasattr("__name__")) {
+        return func.attr("__name__");
+    } else {
+        return py::str(func);
+    }
+}
+
+struct vector_args {
+    vector_args(PyObject *const *a,
+                      Py_ssize_t n,
+                      PyObject *k)
+    : vector_args((py::handle*)a, n, k) {}
+    vector_args(py::handle* a,
+                    Py_ssize_t n,
+                    py::handle k)
+    : args((py::handle*)a), nargs(n), kwnames(k) {}
+    py::handle* args;
+    Py_ssize_t nargs;
+    kwnames_view kwnames;
+
+    py::handle* begin() {
+        return args;
+    }
+    py::handle* end() {
+        return args + size();
+    }
+
+    py::handle operator[](int64_t i) const {
+        return args[i];
+    }
+    bool has_keywords() const {
+        return kwnames.ptr();
+    }
+    irange enumerate_positional() {
+        return irange(nargs);
+    }
+    irange enumerate_all() {
+        return irange(size());
+    }
+    int64_t size() const {
+        return nargs + (has_keywords() ? kwnames.size() : 0);
+    }
+
+    // bind a test function so this can be tested, first two args for required/kwonly, then return what was parsed...
+
+    // provide write kwarg
+    // don't provide a required arg
+    // don't provide an optional arg
+    // provide a kwarg that is the name of already provided positional
+    // provide a kwonly argument positionally
+    // provide keyword arguments in the wrong order
+    // provide only keyword arguments
+    void parse(const char * fname_cstr, std::initializer_list<const char*> names, std::initializer_list<py::handle*> values, int required, int kwonly=0) {
+        auto error = [&]() {
+            // rather than try to match the slower infrastructure with error messages exactly, once we have detected an error, just use that
+            // infrastructure to format it and throw it
+
+            // have to leak this, because python expects these to last
+            const char** names_buf = new const char*[names.size() + 1];
+            std::copy(names.begin(), names.end(), &names_buf[0]);
+            names_buf[names.size()] = nullptr;
+
+#if PY_VERSION_HEX < 0x03080000
+            char* format_str = new char[names.size() + 3];
+            int i = 0;
+            char* format_it = format_str;
+            for (auto it = names.begin(); it != names.end(); ++it, ++i) {
+                if (i == required) {
+                    *format_it++ = '|';
+                }
+                if (i == (int)names.size() - kwonly) {
+                    *format_it++ = '$';
+                }
+                *format_it++ = 'O';
+            }
+            *format_it++ = '\0';
+            _PyArg_Parser* _parser = new _PyArg_Parser{format_str, &names_buf[0], fname_cstr, 0};
+            _PyArg_ParseStackAndKeywords((PyObject*const*)args, nargs, kwnames.ptr(), _parser);
+#else
+            _PyArg_Parser* _parser = new _PyArg_Parser{NULL, &names_buf[0], fname_cstr, 0};
+            std::unique_ptr<PyObject*[]> buf(new PyObject*[names.size()]);
+            _PyArg_UnpackKeywords((PyObject*const*)args, nargs, NULL, kwnames.ptr(), _parser, required, (Py_ssize_t)values.size() - kwonly, 0, &buf[0]);
+#endif
+            throw exception_set();
+        };
+
+        auto values_it = values.begin();
+        auto names_it = names.begin();
+        auto npositional = values.size() - kwonly;
+
+        if (nargs > (Py_ssize_t)npositional) {
+            // TOO MANY ARGUMENTS
+            error();
+        }
+        for (auto i : irange(nargs)) {
+            *(*values_it++) = args[i];
+            ++names_it;
+        }
+
+        if (!kwnames.ptr()) {
+            if (nargs < required) {
+                // not enough positional arguments
+                error();
+            }
+        } else {
+            int consumed = 0;
+            for (auto i : irange(nargs, values.size())) {
+                bool success = i >= required;
+                const char* target_name = *(names_it++);
+                for (auto j : kwnames.enumerate()) {
+                    if (!strcmp(target_name,kwnames[j])) {
+                        *(*values_it) = args[nargs + j];
+                        ++consumed;
+                        success = true;
+                        break;
+                    }
+                }
+                ++values_it;
+                if (!success) {
+                    // REQUIRED ARGUMENT NOT SPECIFIED
+                    error();
+                }
+            }
+            if (consumed != kwnames.size()) {
+                // NOT ALL KWNAMES ARGUMENTS WERE USED
+                error();
+            }
+        }
+    }
+    int index(const char* name, int pos) {
+        if (pos < nargs) {
+            return pos;
+        }
+        if (kwnames.ptr()) {
+            for (auto j : kwnames.enumerate()) {
+                if (!strcmp(name, kwnames[j])) {
+                    return nargs + j;
+                }
+            }
+        }
+        return -1;
+    }
+};
+
+inline object handle::call_vector(vector_args args) {
+    return object::checked_steal(PY_VECTORCALL(ptr(), (PyObject*const*) args.args, args.nargs, args.kwnames.ptr()));
+}
+
+
+}
+
+#define MPY_ARGS_NAME(typ, name) #name ,
+#define MPY_ARGS_DECLARE(typ, name) typ name;
+#define MPY_ARGS_POINTER(typ, name) &name ,
+#define MPY_PARSE_ARGS_KWARGS(fmt, FORALL_ARGS) \
+    static char* kwlist[] = { FORALL_ARGS(MPY_ARGS_NAME) nullptr}; \
+    FORALL_ARGS(MPY_ARGS_DECLARE) \
+    if (!PyArg_ParseTupleAndKeywords(args, kwargs, fmt, kwlist, FORALL_ARGS(MPY_ARGS_POINTER) nullptr)) { \
+        throw py::exception_set(); \
+    }
+
+#define MPY_PARSE_ARGS_KWNAMES(fmt, FORALL_ARGS) \
+    static const char * const kwlist[] = { FORALL_ARGS(MPY_ARGS_NAME) nullptr}; \
+    FORALL_ARGS(MPY_ARGS_DECLARE) \
+    static _PyArg_Parser parser = {fmt, kwlist, 0}; \
+    if (!_PyArg_ParseStackAndKeywords(args, nargs, kwnames, &parser, FORALL_ARGS(MPY_ARGS_POINTER) nullptr)) { \
+        throw py::exception_set(); \
+    }
diff --git a/functorch/functorch/csrc/dim/python_variable_simple.h b/functorch/functorch/csrc/dim/python_variable_simple.h
new file mode 100644
index 0000000000000..caae566107600
--- /dev/null
+++ b/functorch/functorch/csrc/dim/python_variable_simple.h
@@ -0,0 +1,49 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#pragma once
+// note: pytorch's python variable simple includes pybind which conflicts with minpybind
+// so this file just reproduces the minimial API needed to extract Tensors from python objects.
+
+#include <torch/csrc/python_headers.h>
+#include <ATen/core/Tensor.h>
+#include <torch/csrc/Export.h>
+
+// Python object that backs torch.autograd.Variable
+// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+struct THPVariable {
+  PyObject_HEAD;
+  // Payload
+  c10::MaybeOwned<at::Tensor> cdata;
+  // Hooks to be run on backwards pass (corresponds to Python attr
+  // '_backwards_hooks', set by 'register_hook')
+  PyObject* backward_hooks = nullptr;
+};
+
+TORCH_PYTHON_API extern PyObject *THPVariableClass;
+TORCH_PYTHON_API extern PyObject *ParameterClass;
+
+TORCH_PYTHON_API PyObject * THPVariable_Wrap(at::TensorBase var);
+
+inline bool THPVariable_Check(PyObject *obj)
+{
+  if (!THPVariableClass)
+      return false;
+
+  const auto result = PyObject_IsInstance(obj, THPVariableClass);
+  AT_ASSERT(result != -1);
+  return result;
+}
+
+inline const at::Tensor& THPVariable_Unpack(THPVariable* var) {
+  return *var->cdata;
+}
+
+inline const at::Tensor& THPVariable_Unpack(PyObject* obj) {
+  return THPVariable_Unpack(reinterpret_cast<THPVariable*>(obj));
+}
+
+TORCH_PYTHON_API c10::impl::PyInterpreter* getPyInterpreter();
diff --git a/functorch/functorch/csrc/init.cpp b/functorch/functorch/csrc/init.cpp
index 9f6db6455aebe..0dd7ed1103434 100644
--- a/functorch/functorch/csrc/init.cpp
+++ b/functorch/functorch/csrc/init.cpp
@@ -17,7 +17,7 @@
 #include <functorch/csrc/CompileCache.h>
 #include <functorch/csrc/CustomFunction.h>
 #include <c10/core/AutogradState.h>
-
+#include <functorch/csrc/dim/dim.h>
 
 namespace at {
 namespace functorch {
@@ -59,11 +59,16 @@ void _propagate_functional_input_mutation(const Tensor& unwrapped, const Tensor&
   // but we can't do that unless we give BatchedTensorImpl a notion of storage.
   if (unwrapped.unsafeGetTensorImpl() == wrapped_inner.unsafeGetTensorImpl()) {
   } else {
-      TORCH_INTERNAL_ASSERT(unwrapped.nbytes() == wrapped_inner.nbytes());
-      TORCH_INTERNAL_ASSERT(unwrapped.sizes() == wrapped_inner.sizes(),
-          "An inplace-mutation op (like transpose_() was called on an input to the functionalization pass."
-          " Propagating those mutations to the input is currently not supported.");
-      unwrapped.copy_(wrapped_inner);
+    if (unwrapped.nbytes() != wrapped_inner.nbytes()) {
+      // Functions might resize zero-sized inputs, which we need to reflect ehre.
+      unwrapped.resize_(wrapped_inner.sizes());
+    }
+    // If the input tensor's metadata was mutated, then use as_strided_()
+    // to propagate the metadata change.
+    if (unwrapped.sizes() != wrapped_inner.sizes()) {
+      unwrapped.as_strided_(wrapped_inner.sizes(), wrapped_inner.strides());
+    }
+    unwrapped.copy_(wrapped_inner);
   }
 }
 
@@ -400,8 +405,16 @@ PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
   m.def("dump_local_tls", &at::functorch::dump_local_tls);
   m.def("set_fwd_grad_enabled", &at::functorch::set_fwd_grad_enabled);
   m.def("get_fwd_grad_enabled", &at::functorch::get_fwd_grad_enabled);
+
   at::functorch::initCompileCacheBindings(m.ptr());
 
+  // initialize first-class dims and install it as a submodule on _C
+  auto dim = Dim_init();
+  if (!dim) {
+    throw py::error_already_set();
+  }
+  py::setattr(m, "dim", py::reinterpret_steal<py::object>(dim));
+
   // Windows doesn't like this
 #ifndef _WIN32
   initDispatchBindings(m.ptr());
diff --git a/functorch/functorch/dim/README.md b/functorch/functorch/dim/README.md
new file mode 100644
index 0000000000000..750c8847c8502
--- /dev/null
+++ b/functorch/functorch/dim/README.md
@@ -0,0 +1,759 @@
+Named Tensors using First-class Dimensions in PyTorch
+=====================================================
+
+-- Zachary DeVito [@Zachary_DeVito](https://twitter.com/Zachary_DeVito)
+
+_An implementation of [named tensors](https://namedtensor.github.io) with the functionality of [einsum](http://einops.rocks]http://einops.rocks) , batching ([vmap](https://jax.readthedocs.io/en/latest/jax.html#vectorization-vmap), [xmap](https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html)), and tensor indexing by adding dimension objects to PyTorch_.
+
+The tensor input to a resnet might have the shape [8, 3, 224, 224] but informally we think of those dimensions as 'batch', 'channel', 'width', and 'height'. Eventhough 'width' and 'height' have the same _size_ we still think of them as separate dimensions, and if we have two _different_ images, we think of both as sharing the _same_ 'channel' dimension.
+
+Named tensors gives these dimensions names. [PyTorch's current implementation](https://pytorch.org/docs/stable/named_tensor.html) uses strings to name dimensions. Instead, this library introduces a Python object, a `Dim`, to represent the concept. By expanding the semantics of tensors with dim objects, in addition to naming dimensions, we can get behavior equivalent to batching transforms (xmap, vmap), einops-style rearragement, and loop-style tensor indexing.
+
+A preview:
+
+```py
+from torchdim import dims
+
+# einsum
+def mm(A: torch.Tensor, B: torch.Tensor):
+    i, j, k = dims(3)
+    r = (A[i, k] * B[k, j]).sum(k)
+    return r.order(i, j)
+
+# rearrange
+def pixel_shuffle(img: torch.Tensor, upscale_factor=2):
+    h2, w2, c, b, h, w = dims(6)
+    h2.size = w2.size = upscale_factor
+    return img[b, (c, h2, w2), h, w].order(b, c, (h, h2), (w, w2))
+
+# batching
+def bmm(A: torch.Tensor, B: torch.Tensor):
+    i = dims(1)
+    return mm(A[i], B[i]).order(i)
+
+# indexing
+def embedding_bag(input: torch.Tensor, embedding_weights: torch.Tensor):
+    batch, sequence, features = dims(3)
+    r = embedding_weights[input[batch, sequence], features].sum(sequence)
+    return r.order(batch, features)
+```
+
+Installation
+============
+
+
+_torchdim is a preview release so that we can collect feedback on the API. It may have bugs, and there are known places where performance can be improved._
+
+First-class dims are a library that extends PyTorch, so they need to be installed separately.
+We may eventually upstream them into PyTorch itself along with `functorch`.
+
+
+We have to install a nightly build of PyTorch so first set up an environment:
+
+```sh
+conda create --name dim
+conda activate dim
+```
+
+First-class dims requires a fairly recent nightly build of PyTorch so that functorch will work. You can install it using one of these commands:
+
+```sh
+# For CUDA 10.2
+conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-nightly
+# For CUDA 11.3
+conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch-nightly
+# For CPU-only build
+conda install pytorch torchvision torchaudio cpuonly -c pytorch-nightly
+```
+
+Install dim. You will be asked for github credentials to access the fairinternal organization.
+
+```sh
+pip install ninja  # Makes the build go faster
+pip install --user "git+https://github.com/facebookresearch/torchdim"
+```
+
+Creating and Binding Dims
+=========================
+
+Python objects that represent dimension are created using the `dims` operator.[^1]
+
+```py
+import torch
+from torchdim import dims
+
+batch, channel, width, height = dims(4)
+```
+
+The existing implemention of [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html) in PyTorch, or [JAX's xmap](https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html) use strings to name dimensions. We call these dimensions _first class_ because they are Python objects.
+
+In addition to the normal _positional_ dimensions in a tensor, tensors can also have a separate set of first-class dimensions.
+
+You can create tensors with first-class dimensions by indexing the normal positional dimensions of a tensor with a dimension object. The `ndim` property continues to list the number of positional dimesions, while the new `dims` property lists all the bound first-class dimensions.
+
+```py
+input = torch.rand(2, 3, 224, 224)
+print(input.ndim)
+> 4
+
+input_fc = input[batch, channel, width, height]
+print(input_fc.dims) # first class dimensions
+> (batch, channel, width, height)
+
+
+# since we converted all the positional dimesions
+# first class `input_fc` has 0 positional dimensions now.
+print(input_fc.ndim)
+> 0
+```
+
+Notice that indexing creates a _new_ Tensor, `input_fc` with bound first-class dimensions. It does not modify the original tensor `input`, which still has 4 positional dimensions.
+
+```py
+print(input.ndim) # unchanged
+> 4
+```
+
+Importantly, indexing with square brackets _applies only to positional dimensions_, so attempting to index a tensor with only first class dims will error[^2]:
+
+```py
+try:
+    input_fc[0]
+except ValueError as ve:
+    print(ve)
+> at least 1 indices were supplied but the tensor only has 0 dimensions
+```
+
+Generally, it is possible to construct tensors with a mixture of positional and first class dimensions:
+
+```py
+input_mixed = input[batch, :, :, height]
+print(input_mixed.dims)
+> (batch, height)
+
+print(input_mixed.ndim)
+> 2
+```
+
+Dimension Sizes
+---------------
+
+Dimensions will take on the size of the first thing they are bound to:
+
+```py
+input = torch.rand(3)
+x = dims(1)
+input_fc = input[x]
+print(x.size)
+> 3
+```
+
+But you can also directly set the size of dimension:
+
+```py
+i = dims(1)
+
+i.size = 5 # ok, i previously did not have a size
+
+i.size = 5 # ok, it already had the size 5
+try:
+    i.size = 3
+except Exception as e:
+    print(e)
+> Dim 'i' previously bound to a dimension of size 5 cannot bind to a dimension of size 3
+
+j = dims(sizes=[4]) # can also be set on construction
+```
+
+[^1]: We use a bit of Python introspection to set the debug names for the dimensions based on the names of the variables they are assigned to.
+[^2]: Indexing of first-class dimensions can be done with the `index` method by specifying the dimension to be index into (e.g. `input_fc.index(batch, 0)`.
+
+Semantics of Dimensions
+=======================
+The power of named tensors arises from how the first-class dimensions in the Tensors composed with existing operations.
+
+Three rules define how dimension objects behave with existing Tensors.
+
+Rule 1: Implicit Batching
+-------------------------
+**Tensor operations (e.g. `input + bias`) are implicitly batched over the union of the first-class dimensions in their inputs.**
+
+If `input` has dimensions `batch, channel` and `bias` has dimension `channel`, the output will have the union of those dimensions (`batch, channel`), and the result will computed as if there was a loop over all the first-class dimensions.[^3]
+
+```py
+input_positional = torch.rand(128, 32)
+bias_positional = torch.rand(32)
+
+batch, channel = dims(2)
+input = input_positional[batch, channel]
+bias = bias_positional[channel]
+
+result = input + bias
+print(result.dims)
+> (batch, channel)
+```
+
+It is helpful think of operators on tensors with first-class dimensions by analogy to code with explicit loops over dimensions, with the first-class dimensions of the inputs acting as implicit `for` loops, and the values in the tensor being scalars within the body of the loop:
+
+```py
+# mental model: loop-level analogy
+for batch in range(batch.size):
+    for channel in range(channel.size):
+        input = input_positional[batch, channels]
+        bias = bias_positional[channels]
+        result[batch, channels] =  input + bias # arithmetic on scalars
+```
+
+Positional dimensions behave as they did before (e.g. for + they will broadcast), and can be thought of as being a standard tensor _used within the implicit loops_ defined by first-class dimensions.
+
+In this example, we broke down the expression into lines that bind the dimension to positional tensors and then another line to do the compute. In practice, we often combine these in one statement:
+
+```py
+result = input_positional[batch, channel] + bias_positional[channel]
+result.dims
+```
+
+[^3] This rule is similar to how named dimensions in xmap behave within a function, but instead of introducing the dimensions via a functional transform, they are bound on the objects using indexing.
+
+
+Rule 2: Specifying dimensions
+-----------------------------
+**Wherever an integer is used to specify a dimension in the existing torch operator, a first-class dimensions can be used instead to tell the operator to work over that dimension.**
+
+```py
+batch, channel, width, height = dims(4)
+input_positional = torch.rand(2, 3, 224, 224)
+input = input_positional[batch, channel, width, height]
+avg_pixel_color = input.mean((width, height))
+
+print(avg_pixel_color.dims)
+> (batch, channel)
+```
+
+Any other first-class dimensions (e.g. batch, channel) are still implicitly batched according to Rule #1.
+
+Rule 3: Dims are Tensors
+------------------------
+**A first-class dimension `d` can be used wherever a Tensor is expected. It will act as if it were a tensor whose only dimension is itself, `d`, and the values along the dimension are the indices of each entry `(0, 1, 2, ..., d.size - 1)`**
+
+```py
+print(channel.dims)
+> (channel,)
+
+print(channel + 1000)
+> tensor([1000, 1001, 1002])
+> with dims=(channel,) sizes=(3,)
+```
+
+This means that a dimension used as a tensor acts as an index into that dimension. Going back to our loop-level analogy, it is analogous to using the loop variable as a value:
+
+```py
+# mental model: loop-level analogy
+for channel in range(batch.size):
+    result[channel] = channel + 1000
+```
+
+Arithmetic using dimension indices comes up a lot, such as the mask for an upper triangular part of a matrix. Using dims as tensors makes it easy:
+
+```py
+from torchdim import dims
+i, j = dims(sizes=[4, 4])
+print(i <= j)
+> tensor([[ True,  True,  True,  True],
+>         [False,  True,  True,  True],
+>         [False, False,  True,  True],
+>         [False, False, False,  True]])
+> with dims=(i, j) sizes=(4, 4)
+```
+
+Because of the intentional similarity to loop-level code, using dimsions as tensors makes complicated indexing arithmetic easier to read.
+
+Here is code that lookups up features in an embedding table given a sequence of ids:
+
+```py
+sequence, features = dims(2)
+embeddings = torch.rand(8, 128)
+words = torch.tensor([5, 4, 0,])
+
+state = embeddings[words[sequence], features]
+print(state.dims)
+> (sequence, features)
+```
+
+With the following analogy to loops:
+
+```py
+# mental model: loop-level analogy
+
+for sequence in range(words.size(0)):
+    for features in range(embeddings.size(1)):
+        state = embeddings[words[sequence], features]
+```
+
+Earlier we showed how binding tensors dimension is done with indexing `A[i, j]`. In fact, this binding is just the normal indexing operator. Its behavior follows directly from the behavior of indexing with tensor indices combined with Rule #3 and Rule #1. The expression `A[i + 1, j]` also creates a tensor with dimensions `i` and `j` but with different indexing math. The implementation knows when simple indexing patterns are used and only actually runs a kernel to do indexing when needed.
+
+Unbinding Dims
+-------------
+The `order` method converts first-class dimensions in a tensor back to normal positional dimensions by specifying an order for those dimensions.[^4]
+
+By specifiying a different order from how things were originally bound, it is easy to do transpositions.
+
+```py
+i, j = dims(2)
+A = torch.rand(3, 4)
+A_T = A[i, j].order(j, i)
+assert torch.allclose(A.T, A_T)
+```
+
+Indexing acts left-to-right, and `order` also places the new dimensions back on the left, so it possible to work on tensors that have mixed positonal and first-class dimensions:
+
+```py
+B = torch.rand(3, 4, 5)
+B_T = B[i, j].order(j, i)
+assert torch.allclose(B.permute(1, 0, 2), B_T)
+```
+
+[^4] `order` is actually just a synonym for the already-existing `permute` method, which takes a list a dimension specifiers and puts the tensor in that order because rule #2 says that first-class dims can be passed as arguments to functions that previousely took only integers as dimensions. However, the name `permute` is confusing in this context since it implies dim objects have an original order, so we prefer to use `order` when writing code.
+
+Flattening and Splitting Dims
+-----------------------------
+
+**Tuples of dimensions** can be passed to both indexing and `order`. In indexing, this will split the dimension being indexed across the dimensions in the tuple.  In `order` it will flatten the dimensions in a single positional dimension:
+
+```py
+i, j, k = dims(3)
+j.size = 2
+A = torch.rand(6, 4)
+a = A[(i, j), k] # split dim 0 into i,j
+print(i.size, j.size, k.size)
+> 3 2 4
+
+r = a.order(i, (j, k)) # flatten j and k
+print(r.shape)
+> torch.Size([3, 8])
+```
+
+The size of one unsized dimension in a tuple such as `i` can be inferred if the other sizes are known.
+
+Examples
+========
+
+The usefulness of dimension objects is best seen through examples. Let's look at some different ways they can be used.
+
+Einsum-style Products
+---------------------
+Rather than having [einsum](https://pytorch.org/docs/stable/generated/torch.einsum.html) as a custom operator, it is possible to express matrix products directly as a composition of multiplies and summations. The implementation will pattern match any multiplication followed by a sum to the right matrix-multiply operator.
+
+```py
+def mm(A, B):
+    i, j, k = dims(3)
+    r = (A[i, k] * B[k, j]).sum(k)
+    return r.order(i, j)
+mm(torch.rand(3, 4), torch.rand(4, 5)).shape
+```
+
+The implementation of named tensors delays the execution of multiply to see if a summation follows it as it does above. If so, it will turn this pattern into the correct _optimized matrix product_, similar to how the `einsum` function works.
+
+Since it is no longer necessary to manually match math to matrix functions, other tensor products are easier to express, like the Gram matrix used in style transfer:
+
+```py
+def gram_matrix_new(y):
+    b, c, c2, h, w = dims()
+    r = (y[b, c, h, w] * y[b, c2, h, w]).sum((h, w))
+    r = r / (h.size * w.size)
+    return r.order(b, c, c2)
+
+gram_matrix_new(torch.rand(1, 2, 3, 4))
+# [example adapted from http://einops.rocks/pytorch-examples.html]
+```
+
+Attention is another example that has several matrix products embedded inside it:
+
+```py
+from torchdim import softmax
+def attention(K, Q, V):
+    batch, channel, key, query = dims(4)
+    k = K[batch, channel, key]
+    q = Q[batch, channel, query]
+    v = V[batch, channel, key]
+
+    a = (k * q).sum(channel) # matrix multiply
+    a = softmax(a * (channel.size ** -0.5), dim=key)
+    r = (v * a).sum(key) # matrix multiply
+    return torch.cat((r.order(batch, channel, query), Q), dim=1)
+
+inputs = (torch.rand(2, 3, 4) for _ in range(3))
+attention(*inputs)
+# [example adapted from http://einops.rocks/pytorch-examples.html]
+```
+
+Reshaping tensors (einops)
+--------------------------
+
+Lots of operations in deep learning are just different ways of reshaping, splitting, and joining dimensions, such as the pixel shuffle used to upscale an image by turning channels into pixels:
+
+```py
+def pixel_shuffle(img, upscale_factor=2):
+    h2, w2, c, b, h, w = dims(6)
+    h2.size = w2.size = upscale_factor
+    return img[b, (c, h2, w2), h, w].order(b, c, (h, h2), (w, w2))
+```
+
+[Einops](http://einops.rocks) is an extension to einsum that adds support for the manipulation of dimensions through a few custom operators such as `rearrange`:
+
+```py
+def pixel_shuffle_einops(img, upscale_factor=2):
+    from einops import rearrange
+    return rearrange(img, 'b (c h2 w2) h w -> b c (h h2) (w w2)', h2=upscale_factor, w2=upscale_factor)
+```
+
+Named tensors with first-class dimensions can accomplish the same goal, but using PyTorch's existing operator set.
+
+Automatically batching Code (`vmap`, `xmap`)
+-----------------------------
+
+The implicit batching of Rule #1 means it is easy to created batched versions of existing PyTorch code. Simply bind a dim to the dimensions that should act as a batch, and then pass the tensor to the unbatched function. Since the unbatched function does not know about the dim, the dim will be implicictly batched over:
+
+```py
+batch_size, feature_size = 3, 5
+weights = torch.randn(feature_size)
+
+def model(feature_vec):
+    # Very simple linear model with activation
+    assert feature_vec.dim() == 1
+    return feature_vec.dot(weights).relu()
+
+examples = torch.randn(batch_size, feature_size)
+batch = dims(1)
+r = model(examples[batch])
+print(r)
+# in functorch: result = functorch.vmap(model)(examples)
+> tensor([0.4775, 0.0000, 0.3423])
+> with dims=(batch,) sizes=(3,)
+```
+
+This pattern also composes well with other code that also uses first class dimensions. For instance, we can write batched matrix multiply `bmm` by batching the `mm` operator.
+
+It doesn't matter whether the implementation of the function uses dimension objects, it is also possible to add additional batch dimensions and then call a function:
+
+```py
+def bmm(A, B):
+    i = dims(1) # note: i here is a different value from i inside mm so it works
+    return mm(A[i], B[i]).order(i)
+```
+
+The equivalent code in JAX, using [xmap or vmap](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html#auto-vectorization-with-vmap) are transforms over functions. So there is a lot of syntactic distance between the specification of the dimension mappings, and the values where those mappings apply. Dims express the mapping as indexing of the tensor, right at the place where the function is being applied.
+
+
+[xmap examples](https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html):
+
+```py
+in_axes = [['inputs', 'hidden', ...],
+           ['hidden', 'classes', ...],
+           ['batch', 'inputs', ...],
+           ['batch', ...]]
+
+loss = xmap(named_loss, in_axes=in_axes, out_axes=[...])
+print(loss(w1, w2, images, labels))
+```
+
+Equivalent with dimension objects:
+
+```py
+batch, inputs, hidden, classes = dims(4)
+print(loss(w1[inputs, hidden], w2[hidden, classes], images[batch, inputs], labels[batch],
+      batch, inputs, hidden, classes))
+```
+
+
+Composing matrix products, reshaping, and batching:
+---------------------
+
+Multi-headed attention is a good example of how these different uses compose. It reshapes the inputs, splitting out different attention heads. It batches over those attention heads, and it uses matrix products to compute attention scores.
+
+```py
+from torchdim import softmax
+def multiheadattention(q, k, v, num_attention_heads, dropout_prob, use_positional_embedding):
+    batch, query_sequence, key_sequence, heads, features = dims(5)
+    heads.size = num_attention_heads
+
+    # binding dimensions, and unflattening the heads from the feature dimension
+    q = q[batch, query_sequence, [heads, features]]
+    k = k[batch, key_sequence, [heads, features]]
+    v = v[batch, key_sequence, [heads, features]]
+
+    # einsum-style operators to calculate scores,
+    attention_scores = (q*k).sum(features) * (features.size ** -0.5)
+
+    # use first-class dim to specify dimension for softmax
+    attention_probs = softmax(attention_scores, dim=key_sequence)
+
+    # dropout work pointwise, following Rule #1
+    attention_probs = torch.nn.functional.dropout(attention_probs, p=dropout_prob)
+
+    # another matrix product
+    context_layer = (attention_probs*v).sum(key_sequence)
+
+    # flatten heads back into features
+    return context_layer.order(batch, query_sequence, [heads, features])
+```
+
+Indexing
+--------
+
+Rule #3 enables indexing because dimensions act as loop indices when used as a tensor. This allows for a lot of powerful behavior. The simplest might be using the dimensions to compute masks, such as extracing the upper triangular part of a matrix:
+
+```py
+from torch import where
+def triu(A):
+    i,j = dims()
+    a = A[i, j]
+    return where(i <= j, a, 0).order(i, j)
+triu(torch.rand(3, 4))
+```
+
+Embedding bag does an embedding table lookup followed by a sum, which can be expressed concisely:
+
+```py
+def embedding_bag(input, embedding_weights):
+    batch, sequence, features = dims(3)
+    r = embedding_weights[input[batch, sequence], features].sum(sequence)
+    return r.order(batch, features)
+
+input = torch.tensor([[1, 0, 4, 3]])
+W = torch.rand(5,2)
+embedding_bag(input, W)
+```
+
+Relative positional embeddings associate an embedding vector with the distance between the query and the key in the sequence.
+For instance, a key 3 and query 5 will have embedding ID `(5-3)=2`. We can use first-class dimensions to do the indexing arithmetic, and the embedding lookup:
+
+```py
+def relative_positional_embedding(q, k, distance_embedding_weight):
+    batch, query_sequence, key_sequence, heads, features = dims(5)
+    q = q[batch, query_sequence, [heads, features]]
+    k = k[batch, key_sequence, [heads, features]]
+
+    distance = query_sequence - key_sequence
+    n_embeddings = distance_embedding_weight.size(0)
+    index_bias = n_embeddings // 2
+
+    assert key_sequence.size + bias <= n_embeddings
+
+    # indexing with dims
+    positional_embedding = distance_embedding_weight[distance + index_bias, features]
+
+    # matrix multiplies with dims
+    relative_position_scores_query = (q*positional_embedding).sum(features)
+    relative_position_scores_key = (k*positional_embedding).sum(features)
+    return  (relative_position_scores_query + relative_position_scores_key).order(batch, heads, key_sequence, query_sequence)
+```
+
+Tensor Puzzlers
+===============
+
+[Tensor Puzzlers](https://github.com/srush/Tensor-Puzzles), created by Sasha Rush, are a good exercise for learning the numpy and torch APIs by figuring out how to define common operations using a small set of primitive tensor operations.
+
+However, the difficulty of many of the puzzlers lies not in how to compute the answer but the awkwardness of the primitives themselves.
+
+**With first class dimensions, these puzzlers are nearly the same as the spec that defines them**
+
+
+### Puzzle 3 - outer
+
+Compute [outer](https://numpy.org/doc/stable/reference/generated/numpy.outer.html) - the outer product of two vectors.
+
+```py
+def outer_spec(a, b, out):
+    for i in range(len(out)):
+        for j in range(len(out[0])):
+            out[i][j] = a[i] * b[j]
+
+def outer(a, b):
+    i, j = dims(2)
+    return (a[i] * b[j]).order(i, j)
+```
+
+### Puzzle 4 - diag
+
+Compute [diag](https://numpy.org/doc/stable/reference/generated/numpy.diag.html) - the diagonal vector of a square matrix.
+
+```py
+def diag_spec(a, out):
+    for i in range(len(a)):
+        out[i] = a[i][i]
+
+def diag(a):
+    i = dims(1)
+    return a[i, i].order(i)
+```
+
+### Puzzle 5 - eye
+
+Compute [eye](https://numpy.org/doc/stable/reference/generated/numpy.eye.html) - the identity matrix.
+
+```py
+from torch import where
+def eye_spec(out):
+    for i in range(len(out)):
+        out[i][i] = 1
+
+def eye(j: int):
+    i,j = dims(sizes=[j, j])
+    return where(i == j, 1, 0).order(i, j)
+```
+
+### Puzzle 6 - triu
+
+Compute [triu](https://numpy.org/doc/stable/reference/generated/numpy.triu.html) - the upper triangular matrix.
+
+```py
+def triu_spec(out):
+    for i in range(len(out)):
+        for j in range(len(out)):
+            if i <= j:
+                out[i][j] = 1
+            else:
+                out[i][j] = 0
+
+def triu(j: int):
+    i,j = dims(sizes=[j, j])
+    return where(i <= j, 1, 0).order(i, j)
+```
+
+### Puzzle 8 - diff
+
+Compute [diff](https://numpy.org/doc/stable/reference/generated/numpy.diff.html) - the running difference.
+
+```py
+def diff_spec(a, out):
+    out[0] = a[0]
+    for i in range(1, len(out)):
+        out[i] = a[i] - a[i - 1]
+def diff(a, i: int):
+    i = dims(1)
+    d = a[i] - a[i - 1]
+    return where(i - 1 >= 0, d, a[i]).order(i)
+```
+
+### Puzzle 9 - vstack
+
+Compute [vstack](https://numpy.org/doc/stable/reference/generated/numpy.vstack.html) - the matrix of two vectors
+
+```py
+def vstack_spec(a, b, out):
+    for i in range(len(out[0])):
+        out[0][i] = a[i]
+        out[1][i] = b[i]
+
+def vstack(a, b):
+    v, i = dims(sizes=[2, None])
+    return where(v == 0,  a[i], b[i]).order(v, i)
+```
+
+### Puzzle 10 - roll
+
+Compute [roll](https://numpy.org/doc/stable/reference/generated/numpy.roll.html) - the vector shifted 1 circular position.
+
+```py
+def roll_spec(a, out):
+    for i in range(len(out)):
+        if i + 1 < len(out):
+            out[i] = a[i + 1]
+        else:
+            out[i] = a[i + 1 - len(out)]
+
+def roll(a, i: int):
+    i = dims(sizes=[a.size(0)])
+    return a[where(i + 1 < i.size, i + 1, 0)].order(i)
+```
+
+### Puzzle 11 - flip
+
+Compute [flip](https://numpy.org/doc/stable/reference/generated/numpy.flip.html) - the reversed vector
+
+```py
+def flip_spec(a, out):
+    for i in range(len(out)):
+        out[i] = a[len(out) - i - 1]
+
+def flip(a, i: int):
+    i = dims(sizes=[a.size(0)])
+    return a[i.size - i - 1].order(i)
+```
+
+### Puzzle 14 - sequence_mask
+
+
+Compute [sequence_mask](https://www.tensorflow.org/api_docs/python/tf/sequence_mask) - pad out to length per batch.
+
+```py
+def sequence_mask_spec(values, length, out):
+    for i in range(len(out)):
+        for j in range(len(out[0])):
+            if j < length[i]:
+                out[i][j] = values[i][j]
+            else:
+                out[i][j] = 0
+
+def sequence_mask(values, length):
+    j, i = dims()
+    v = values[i, j]
+    return where(j < length[i], v, 0).order(i, j)
+```
+
+Advantages of First-class Dimensions over String Dimensions
+===================================================================
+
+The most prominent difference between named tensors using first-class dimensions and alternatives (einops, named tensors implemented in PyTorch today , [tensors considered harmful](https://nlp.seas.harvard.edu/NamedTensor), or xmap) is that dimensions are objects rather than strings. Using objects has a number of nice properties.
+
+### Avoiding naming conflicts
+
+Using strings for dimensions introduces the possibility that two unrelated dimensions are given the same name. Using objects instead makes it clear the same names are not the same dimension. It's like the difference between having only global variables, and having the ability to locally bind names in functions.
+ For instance, we defined `bmm` by batching a call to `mm`, and even though they both use the name `i` to identify a dimension.  Because each `i` is a different object, there is no naming conflict:
+
+```py
+def mm(A, B):
+    i, j, k = dims()
+    r = (A[i, k] * B[k, j]).sum(k)
+    return r.order(i, j)
+
+def bmm(A, B):
+    i = dims() # note: doesn't matter than mm internally also uses i
+    return mm(A[i], B[i])
+```
+
+Einops avoids conflicts by ensuring names are all introduced and removed in a single expression, but this precludes using long-lived dimensions to present implicit batching similar to xmap. When nested, JAX's xmap seems to consider axes the same if the string name matches. In the above example it would consider the `i` dimension to be the same dimension in both `bmm` and `mm` so the code would error.
+
+
+### Reuse the same operator set
+
+Having a new object type allows us to extend the existing operator set of PyTorch rather than come up with new operators. For instance, binding dimensions using indexing follows semantically from Rules #1 and #3, so there is no need for a special operator to do binding. Even unbinding is just the `permute` operator which follows from Rule #2, though we call it `order` for clarity. In contrast, using strings requires coming up with new APIs such as `einsum` for matrix multiplies, or `rearrange` for doing permutations.
+
+### Allows dims to act as tensors
+
+Rule #3 is not possible with strings since we cannot make strings behave as tensors. Without this rule, all of the indirect indexing that dims enable would not be easy to express.
+
+### Dims can have methods
+For instance, as objects, dims can have a size, which allows us to do size inference of dimensions in various places in the API where string based APIs would have to take additional arguments specifying size.
+
+
+Comparison to tensor compilers or languages (e.g. TVM or Dex)
+=============================================================
+
+The semantics and surface syntax of dimension objects resembles the kind of code written in tensor compilers such as [Halide](https://halide-lang.org), [TVM](https://tvm.apache.org), [Tensor Comprehensions](https://github.com/facebookresearch/TensorComprehensions), or the language [Dex](https://github.com/google-research/dex-lang).
+
+These compilers and language have syntax and semantics that resemble the loop-level analogy similar to first-class dimensions. However, as compilers or statically typed languages, they require some binding code to go from running deep learning framework code in Python to using the compiled language. This often at least requires refactoring the compiled parts into their own functions, and may require defining a gradient function. Similar to graph mode frameworks, this adds friction to using and debugging the code.
+
+Dimension objects are just an extension of the existing PyTorch tensors and eager sematics, so there is no friction switching between normal Python code and code that uses them. However, since loops over the dimensions are defined implicitly, they can still execute in Python with good performance compared to explicit loops. Furthermore, with dimension objects, a tensors containing dimensions can compute through code that is oblivous to the dimension such as batching examples. There is no need to separate code into 'compiled' vs 'eager'.
+
+In this way, first-class dims are a way of adapting the nicer syntax of these array compilers and languages to eager numpy-style libraries.
+
+
+Performance Expectations
+========================
+First-class dimensions are not a compiler. They provide syntax for existing PyTorch operations such as advanced indexing that is easier to read and write. For large sized tensors, the performance of any statements including them will be the same as using the already existing operations. An important exception is the pattern matching of products and summation, where performance will be improved by issuing to a matrix-multiply kernel. The C++ implementation of dimensions adds a small overhead of around 2us on top of PyTorch's normal overhead of 8us to each function that uses them. In the future, the implementation can encorporate more fusion optimization to further improve performance of this style of code.
+
+
+## License
+Functorch has a BSD-style license, as found in the [LICENSE](LICENSE) file.
diff --git a/functorch/functorch/dim/__init__.py b/functorch/functorch/dim/__init__.py
new file mode 100644
index 0000000000000..4f1cd84e44a18
--- /dev/null
+++ b/functorch/functorch/dim/__init__.py
@@ -0,0 +1,170 @@
+import torch
+from typing import Union, Sequence
+import inspect
+import dis
+from .tree_map import tree_flatten, tree_map
+from .wrap_type import wrap_type
+from functorch._C import dim as _C
+_C._patch_tensor_class()
+dims, DimList, dimlists = _C.dims, _C.DimList, _C.dimlists
+
+class DimensionMismatchError(Exception):
+    pass
+
+class DimensionBindError(Exception):
+    pass
+
+from . import op_properties
+
+# use dict to avoid writing C++ bindings for set
+pointwise = {t: True for t in op_properties.pointwise}
+
+use_c = True
+if not use_c:
+    from . import reference
+
+class _Tensor:
+    # fast path around slow wrapping/unwrapping logic for simply queries used
+    # by the implementation...
+
+
+    @property
+    def dims(self):
+        return tuple(d for d in self._levels if isinstance(d, Dim))
+
+    def dim(self):
+        return self.ndim
+
+    if use_c:
+        __torch_function__ = classmethod(_C.__torch_function__)
+        expand = _C._instancemethod(_C.expand)
+    else:
+        __torch_function__ = reference.__torch_function__
+        expand = reference.expand
+
+    index = _C._instancemethod(_C.index)
+
+    def __repr__(self):
+        tensor, levels, ndim = self._tensor, self._levels, self.ndim
+        return f'{tensor}\nwith dims={tuple(l + ndim if isinstance(l, int) else l for l in levels)} sizes={tuple(tensor.size())}'
+
+
+TensorLike = (_Tensor, torch.Tensor)
+
+class Dim(_C.Dim, _Tensor):
+    # note that _C.Dim comes before tensor because we want the Dim API for things like size to take precendence.
+    # Tensor defines format, but we want to print Dims with special formatting
+    __format__ = object.__format__
+
+
+class Tensor(_Tensor, _C.Tensor):
+    if not use_c:
+        from_batched = staticmethod(_C.Tensor_from_batched)
+    from_positional = staticmethod(_C.Tensor_from_positional)
+    sum = _C._instancemethod(_C.Tensor_sum)
+
+
+def cat(tensors, dim, new_dim):
+    n = dims()
+    return stack(tensors, n, dim).index([n, dim], new_dim)
+
+if use_c:
+    _wrap = _C._wrap
+
+    def _def(name, *args, **kwargs):
+        orig = getattr(torch.Tensor, name)
+        setattr(_Tensor, name, _C._instancemethod(_wrap(orig, *args, **kwargs)))
+
+    t__getitem__ = _C._instancemethod(_C.__getitem__)
+    stack = _C.stack
+    split = _C._instancemethod(_C.split)
+else:
+    _wrap, _def = reference._wrap, reference._def
+    t__getitem__ = reference.t__getitem__
+    stack = reference.stack
+    split = reference.split
+
+# note: there is no python reference
+t__setitem__ = _C._instancemethod(_C.__setitem__)
+# this is patched in the C API because otherwise torch.Tensor will
+# no longer be considered a sequence and things will break
+# torch.Tensor.__getitem__ = t__getitem__
+
+_Tensor.__getitem__ = t__getitem__
+# torch.Tensor.__setitem__ = t__setitem__
+_Tensor.__setitem__ = t__setitem__
+
+torch.Tensor.split = split
+_Tensor.split = split
+torch.Tensor.expand = _C._instancemethod(_C.expand)
+torch.Tensor.index = _C._instancemethod(_C.index)
+wrap_type(use_c, _Tensor, torch.Tensor, _Tensor.__torch_function__)
+del _Tensor.ndim
+
+if use_c:
+    _Tensor.permute = _Tensor.order = _C._instancemethod(_C.order)
+else:
+    _Tensor.permute = _Tensor.order = reference.positional
+
+_def('mean')
+_def('sum')
+_def('all')
+_def('amax')
+_def('amin')
+_def('aminmax')
+_def('any')
+_def('count_nonzero')
+_def('logsumexp')
+_def('nanmean')
+_def('nansum')
+_def('prod')
+_def('std', keepdim_offset=2)
+_def('var', keepdim_offset=2)
+_def('max', single_dim=True)
+_def('min', single_dim=True)
+_def('argmax', single_dim=True)
+_def('argmin', single_dim=True)
+_def('kthvalue', single_dim=True)
+_def('median', single_dim=True)
+_def('nanmedian', single_dim=True)
+_def('mode', single_dim=True)
+_def('sort', reduce=False)
+_def('argsort', reduce=False)
+_def('unbind', single_dim=True)
+_def('chunk', dim_offset=1, reduce=False)
+_def('cummax', single_dim=True, reduce=False)
+_def('cummin', single_dim=True, reduce=False)
+_def('cumprod', single_dim=True, reduce=False)
+_def('cumprod_', single_dim=True, reduce=False)
+_def('cumsum', single_dim=True, reduce=False)
+_def('cumsum_', single_dim=True, reduce=False)
+_def('logcumsumexp', single_dim=True, reduce=False)
+_def('renorm', dim_offset=1, single_dim=True, reduce=False)
+_def('softmax', single_dim=True, reduce=False)
+softmax = _wrap(torch.nn.functional.softmax, single_dim=True, reduce=False)
+
+# stuff to handle in the future, because they require special
+# binding logic for dims
+# cross
+# diag_embed
+# diagonal
+# diagonal_scatter
+# diff
+# nanquantile
+# quantile
+# roll
+# rot90
+# topk (new dimes on output)
+# should these all be subsumed by inplace indexing?
+# index_add_
+# index_add
+# index_copy
+# index_copy_
+# index_fill
+# index_fill_
+# index_select
+# scatter
+# scatter_
+# scatter_add
+# scatter_add_
+# scatter_reduce
diff --git a/functorch/functorch/dim/batch_tensor.py b/functorch/functorch/dim/batch_tensor.py
new file mode 100644
index 0000000000000..15a966ad2176b
--- /dev/null
+++ b/functorch/functorch/dim/batch_tensor.py
@@ -0,0 +1,26 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+from functorch._C import (
+    _vmap_add_layers,
+    _vmap_remove_layers,
+)
+
+from contextlib import contextmanager
+
+_enabled = False
+@contextmanager
+def _enable_layers(dims):
+    global _enabled
+    assert not _enabled
+    input = list(sorted((d._level, d.size) for d in dims if not isinstance(d, int)))
+    n = len(input)
+    try:
+        _vmap_add_layers(input)
+        _enabled = True
+        yield
+    finally:
+        _enabled = False
+        _vmap_remove_layers(n)
diff --git a/functorch/functorch/dim/delayed_mul_tensor.py b/functorch/functorch/dim/delayed_mul_tensor.py
new file mode 100644
index 0000000000000..92082bb3fa624
--- /dev/null
+++ b/functorch/functorch/dim/delayed_mul_tensor.py
@@ -0,0 +1,67 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+from . import _Tensor, Tensor
+from .reference import _dims, _enable_layers, llist, ltuple
+
+class DelayedMulTensor(_Tensor):
+    def __init__(self, lhs, rhs):
+        self._lhs, self._rhs = lhs, rhs
+        self._data = None
+        self._levels_data = None
+        self._has_device = lhs._has_device or rhs._has_device
+        self._batchtensor_data = None
+        self._tensor_data = None
+
+    @property
+    def _levels(self):
+        if self._levels_data is None:
+            levels = llist(self._lhs._levels)
+            for l in self._rhs._levels:
+                if l not in levels:
+                    levels.append(l)
+            self._levels_data = ltuple(levels)
+        return self._levels_data
+
+    @property
+    def _batchtensor(self):
+        if self._batchtensor_data is None:
+            with _enable_layers(self._levels):
+                print("bt multiply fallback")
+                self._batchtensor_data = self._lhs._batchtensor * self._rhs._batchtensor
+        return self._batchtensor_data
+
+    @property
+    def _tensor(self):
+        if self._tensor_data is None:
+            self._tensor_data = Tensor.from_batched(self._batchtensor, self._has_device)._tensor
+        return self._tensor_data
+
+    @property
+    def ndim(self):
+        return self._batchtensor.ndim
+
+    @property
+    def dims(self):
+        return ltuple(super().dims)
+
+
+    def sum(self, dim):
+        dims = _dims(dim, 0, False, False)
+        n = ord('a')
+        all_levels = self._levels
+
+        def to_char(d):
+            return chr(n + all_levels.index(d))
+        plhs, levelslhs = self._lhs._tensor, self._lhs._levels
+        prhs, levelsrhs = self._rhs._tensor, self._rhs._levels
+        new_dims = tuple(d for d in self.dims if d not in dims)
+        new_levels = [l for l in self._levels if l not in dims]
+        fmt = ''.join([*(to_char(d) for d in levelslhs), ',',
+                       *(to_char(d) for d in levelsrhs), '->',
+                       *(to_char(d) for d in new_levels)])
+        result_data = torch.einsum(fmt, (plhs, prhs))
+        return Tensor.from_positional(result_data, new_levels, True)
diff --git a/functorch/functorch/dim/dim.py b/functorch/functorch/dim/dim.py
new file mode 100644
index 0000000000000..312059a4c9f15
--- /dev/null
+++ b/functorch/functorch/dim/dim.py
@@ -0,0 +1,95 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+_vmap_levels = []
+@dataclass
+class LevelInfo:
+    level: int
+    alive: bool = True
+
+class Dim:
+    def __init__(self, name: str, size: Union[None, int] = None):
+        self.name = name
+        self._size = None
+        self._vmap_level = None
+        if size is not None:
+            self.size = size
+
+    def __del__(self):
+        if self._vmap_level is not None:
+            _vmap_active_levels[self._vmap_stack].alive = False
+            while not _vmap_levels[-1].alive and current_level() == _vmap_levels[-1].level:
+                _vmap_decrement_nesting()
+                _vmap_levels.pop()
+
+    @property
+    def size(self):
+        assert self.is_bound
+        return self._size
+
+    @size.setter
+    def size(self, size: int):
+        if self._size is None:
+            self._size = size
+            self._vmap_level = _vmap_increment_nesting(size, 'same')
+            self._vmap_stack = len(_vmap_levels)
+            _vmap_levels.append(LevelInfo(self._vmap_level))
+
+        elif self._size != size:
+            raise DimensionBindError(
+                  f"Dim '{self}' previously bound to a dimension of size {self._size} cannot bind to a dimension of size {size}")
+
+    @property
+    def is_bound(self):
+        return self._size is not None
+
+    def __repr__(self):
+        return self.name
+
+
+def extract_name(inst):
+    assert inst.opname == 'STORE_FAST' or inst.opname == 'STORE_NAME'
+    return inst.argval
+
+_cache = {}
+def dims(lists=0):
+    frame = inspect.currentframe()
+    assert frame is not None
+    calling_frame = frame.f_back
+    assert calling_frame is not None
+    code, lasti = calling_frame.f_code, calling_frame.f_lasti
+    key = (code, lasti)
+    if key not in _cache:
+        first = lasti // 2 + 1
+        instructions = list(dis.get_instructions(calling_frame.f_code))
+        unpack = instructions[first]
+
+        if unpack.opname == 'STORE_FAST' or unpack.opname == 'STORE_NAME':
+            # just a single dim, not a list
+            name = unpack.argval
+            ctor = Dim if lists == 0 else DimList
+            _cache[key] = lambda: ctor(name=name)
+        else:
+            assert unpack.opname == 'UNPACK_SEQUENCE'
+            ndims = unpack.argval
+            names = tuple(extract_name(instructions[first + 1 + i]) for i in range(ndims))
+            first_list = len(names) - lists
+            _cache[key] = lambda: tuple(Dim(n) if i < first_list else DimList(name=n) for i, n in enumerate(names))
+    return _cache[key]()
+
+
+def _dim_set(positional, arg):
+    def convert(a):
+        if isinstance(a, Dim):
+            return a
+        else:
+            assert isinstance(a, int)
+            return positional[a]
+    if arg is None:
+        return positional
+    elif not isinstance(arg, (Dim, int)):
+        return tuple(convert(a) for a in arg)
+    else:
+        return (convert(arg),)
diff --git a/functorch/functorch/dim/magic_trace.py b/functorch/functorch/dim/magic_trace.py
new file mode 100644
index 0000000000000..8d4e5ec31ef89
--- /dev/null
+++ b/functorch/functorch/dim/magic_trace.py
@@ -0,0 +1,34 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+from contextlib import contextmanager
+import os
+import subprocess
+import signal
+
+@contextmanager
+def magic_trace(output='trace.fxt', magic_trace_cache='/tmp/magic-trace'):
+    pid = os.getpid()
+    if not os.path.exists(magic_trace_cache):
+        print(f"Downloading magic_trace to: {magic_trace_cache}")
+        subprocess.run(['wget', '-O', magic_trace_cache, '-q',
+                        'https://github.com/janestreet/magic-trace/releases/download/v1.0.2/magic-trace'])
+        subprocess.run(['chmod', '+x', magic_trace_cache])
+    args = [magic_trace_cache, 'attach', '-pid', str(pid), '-o', output]
+    p = subprocess.Popen(args, stderr=subprocess.PIPE, encoding='utf-8')
+    while True:
+        x = p.stderr.readline()
+        print(x)
+        if 'Attached' in x:
+            break
+    try:
+        yield
+    finally:
+        p.send_signal(signal.SIGINT)
+        r = p.wait()
+        print(p.stderr.read())
+        p.stderr.close()
+        if r != 0:
+            raise ValueError(f'magic_trace exited abnormally: {r}')
diff --git a/functorch/functorch/dim/op_properties.py b/functorch/functorch/dim/op_properties.py
new file mode 100644
index 0000000000000..fdfb0b9ae91d3
--- /dev/null
+++ b/functorch/functorch/dim/op_properties.py
@@ -0,0 +1,282 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+# pointwise operators can go through a faster pathway
+
+tensor_magic_methods = [
+    'add',
+    ''
+]
+pointwise_magic_methods_with_reverse = (
+    'add', 'sub', 'mul', 'floordiv', 'div', 'truediv', 'mod',
+    'pow', 'lshift', 'rshift', 'and', 'or', 'xor'
+)
+pointwise_magic_methods = (
+    *(x for m in pointwise_magic_methods_with_reverse for x in (m, 'r' + m)),
+    'eq', 'gt', 'le', 'lt', 'ge', 'gt', 'ne', 'neg', 'pos',
+    'abs', 'invert',
+    'iadd', 'isub', 'imul', 'ifloordiv', 'idiv',
+    'itruediv', 'imod', 'ipow', 'ilshift', 'irshift', 'iand',
+    'ior', 'ixor',
+    'int', 'long', 'float', 'complex',
+)
+
+pointwise_methods = (
+    *(f'__{m}__' for m in pointwise_magic_methods),
+)
+
+pointwise = (
+    *(getattr(torch.Tensor, m) for m in pointwise_methods),
+    torch.nn.functional.dropout,
+    torch.where,
+    torch.Tensor.abs,
+    torch.abs,
+    torch.Tensor.acos,
+    torch.acos,
+    torch.Tensor.acosh,
+    torch.acosh,
+    torch.Tensor.add,
+    torch.add,
+    torch.Tensor.addcdiv,
+    torch.addcdiv,
+    torch.Tensor.addcmul,
+    torch.addcmul,
+    torch.Tensor.addr,
+    torch.addr,
+    torch.Tensor.angle,
+    torch.angle,
+    torch.Tensor.asin,
+    torch.asin,
+    torch.Tensor.asinh,
+    torch.asinh,
+    torch.Tensor.atan,
+    torch.atan,
+    torch.Tensor.atan2,
+    torch.atan2,
+    torch.Tensor.atanh,
+    torch.atanh,
+    torch.Tensor.bitwise_and,
+    torch.bitwise_and,
+    torch.Tensor.bitwise_left_shift,
+    torch.bitwise_left_shift,
+    torch.Tensor.bitwise_not,
+    torch.bitwise_not,
+    torch.Tensor.bitwise_or,
+    torch.bitwise_or,
+    torch.Tensor.bitwise_right_shift,
+    torch.bitwise_right_shift,
+    torch.Tensor.bitwise_xor,
+    torch.bitwise_xor,
+    torch.Tensor.ceil,
+    torch.ceil,
+    torch.celu,
+    torch.nn.functional.celu,
+    torch.Tensor.clamp,
+    torch.clamp,
+    torch.Tensor.clamp_max,
+    torch.clamp_max,
+    torch.Tensor.clamp_min,
+    torch.clamp_min,
+    torch.Tensor.copysign,
+    torch.copysign,
+    torch.Tensor.cos,
+    torch.cos,
+    torch.Tensor.cosh,
+    torch.cosh,
+    torch.Tensor.deg2rad,
+    torch.deg2rad,
+    torch.Tensor.digamma,
+    torch.digamma,
+    torch.Tensor.div,
+    torch.div,
+    torch.dropout,
+    torch.nn.functional.dropout,
+    torch.nn.functional.elu,
+    torch.Tensor.eq,
+    torch.eq,
+    torch.Tensor.erf,
+    torch.erf,
+    torch.Tensor.erfc,
+    torch.erfc,
+    torch.Tensor.erfinv,
+    torch.erfinv,
+    torch.Tensor.exp,
+    torch.exp,
+    torch.Tensor.exp2,
+    torch.exp2,
+    torch.Tensor.expm1,
+    torch.expm1,
+    torch.feature_dropout,
+    torch.Tensor.float_power,
+    torch.float_power,
+    torch.Tensor.floor,
+    torch.floor,
+    torch.Tensor.floor_divide,
+    torch.floor_divide,
+    torch.Tensor.fmod,
+    torch.fmod,
+    torch.Tensor.frac,
+    torch.frac,
+    torch.Tensor.frexp,
+    torch.frexp,
+    torch.Tensor.gcd,
+    torch.gcd,
+    torch.Tensor.ge,
+    torch.ge,
+    torch.nn.functional.gelu,
+    torch.nn.functional.glu,
+    torch.Tensor.gt,
+    torch.gt,
+    torch.Tensor.hardshrink,
+    torch.hardshrink,
+    torch.nn.functional.hardshrink,
+    torch.nn.functional.hardsigmoid,
+    torch.nn.functional.hardswish,
+    torch.nn.functional.hardtanh,
+    torch.Tensor.heaviside,
+    torch.heaviside,
+    torch.Tensor.hypot,
+    torch.hypot,
+    torch.Tensor.i0,
+    torch.i0,
+    torch.Tensor.igamma,
+    torch.igamma,
+    torch.Tensor.igammac,
+    torch.igammac,
+    torch.Tensor.isclose,
+    torch.isclose,
+    torch.Tensor.isfinite,
+    torch.isfinite,
+    torch.Tensor.isinf,
+    torch.isinf,
+    torch.Tensor.isnan,
+    torch.isnan,
+    torch.Tensor.isneginf,
+    torch.isneginf,
+    torch.Tensor.isposinf,
+    torch.isposinf,
+    torch.Tensor.isreal,
+    torch.isreal,
+    torch.Tensor.kron,
+    torch.kron,
+    torch.Tensor.lcm,
+    torch.lcm,
+    torch.Tensor.ldexp,
+    torch.ldexp,
+    torch.Tensor.le,
+    torch.le,
+    torch.nn.functional.leaky_relu,
+    torch.Tensor.lerp,
+    torch.lerp,
+    torch.Tensor.lgamma,
+    torch.lgamma,
+    torch.Tensor.log,
+    torch.log,
+    torch.Tensor.log10,
+    torch.log10,
+    torch.Tensor.log1p,
+    torch.log1p,
+    torch.Tensor.log2,
+    torch.log2,
+    torch.nn.functional.logsigmoid,
+    torch.Tensor.logical_and,
+    torch.logical_and,
+    torch.Tensor.logical_not,
+    torch.logical_not,
+    torch.Tensor.logical_or,
+    torch.logical_or,
+    torch.Tensor.logical_xor,
+    torch.logical_xor,
+    torch.Tensor.logit,
+    torch.logit,
+    torch.Tensor.lt,
+    torch.lt,
+    torch.Tensor.maximum,
+    torch.maximum,
+    torch.Tensor.minimum,
+    torch.minimum,
+    torch.nn.functional.mish,
+    torch.Tensor.mvlgamma,
+    torch.mvlgamma,
+    torch.Tensor.nan_to_num,
+    torch.nan_to_num,
+    torch.Tensor.ne,
+    torch.ne,
+    torch.Tensor.neg,
+    torch.neg,
+    torch.Tensor.nextafter,
+    torch.nextafter,
+    torch.Tensor.outer,
+    torch.outer,
+    torch.polar,
+    torch.Tensor.polygamma,
+    torch.polygamma,
+    torch.Tensor.positive,
+    torch.positive,
+    torch.Tensor.pow,
+    torch.pow,
+    torch.Tensor.prelu,
+    torch.prelu,
+    torch.nn.functional.prelu,
+    torch.Tensor.rad2deg,
+    torch.rad2deg,
+    torch.Tensor.reciprocal,
+    torch.reciprocal,
+    torch.Tensor.relu,
+    torch.relu,
+    torch.nn.functional.relu,
+    torch.nn.functional.relu6,
+    torch.Tensor.remainder,
+    torch.remainder,
+    torch.Tensor.round,
+    torch.round,
+    torch.rrelu,
+    torch.nn.functional.rrelu,
+    torch.Tensor.rsqrt,
+    torch.rsqrt,
+    torch.rsub,
+    torch.selu,
+    torch.nn.functional.selu,
+    torch.Tensor.sgn,
+    torch.sgn,
+    torch.Tensor.sigmoid,
+    torch.sigmoid,
+    torch.nn.functional.sigmoid,
+    torch.Tensor.sign,
+    torch.sign,
+    torch.Tensor.signbit,
+    torch.signbit,
+    torch.nn.functional.silu,
+    torch.Tensor.sin,
+    torch.sin,
+    torch.Tensor.sinc,
+    torch.sinc,
+    torch.Tensor.sinh,
+    torch.sinh,
+    torch.nn.functional.softplus,
+    torch.nn.functional.softshrink,
+    torch.Tensor.sqrt,
+    torch.sqrt,
+    torch.Tensor.square,
+    torch.square,
+    torch.Tensor.sub,
+    torch.sub,
+    torch.Tensor.tan,
+    torch.tan,
+    torch.Tensor.tanh,
+    torch.tanh,
+    torch.nn.functional.tanh,
+    torch.threshold,
+    torch.nn.functional.threshold,
+    torch.trapz,
+    torch.Tensor.true_divide,
+    torch.true_divide,
+    torch.Tensor.trunc,
+    torch.trunc,
+    torch.Tensor.xlogy,
+    torch.xlogy,
+    torch.rand_like,
+)
diff --git a/functorch/functorch/dim/reference.py b/functorch/functorch/dim/reference.py
new file mode 100644
index 0000000000000..ee351199c974a
--- /dev/null
+++ b/functorch/functorch/dim/reference.py
@@ -0,0 +1,557 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+# reference python implementations for C ops
+import torch
+from .tree_map import tree_flatten, tree_map
+from .batch_tensor import _enable_layers
+from . import op_properties
+from functorch._C import dim as _C
+DimList = _C.DimList
+from functools import reduce
+import operator
+
+
+# use dict to avoid writing C++ bindings for set
+pointwise = set(op_properties.pointwise)
+def prod(x):
+    return reduce(operator.mul, x, 1)
+
+
+def _wrap_dim(d, N, keepdim):
+    from . import Dim
+    if isinstance(d, Dim):
+        assert not keepdim, "cannot preserve first-class dimensions with keepdim=True"
+        return d
+    elif d >= 0:
+        return d - N
+    else:
+        return d
+
+def _dims(d, N, keepdim, single_dim):
+    from . import Dim
+    if isinstance(d, (Dim, int)):
+        return ltuple((_wrap_dim(d, N, keepdim),))
+    assert not single_dim, f"expected a single dimension or int but found: {d}"
+    return ltuple(_wrap_dim(x, N, keepdim) for x in d)
+
+def _bind_dims_to_size(lhs_size, rhs, lhs_debug):
+    from . import DimensionMismatchError
+    not_bound = tuple((i, r) for i, r in enumerate(rhs) if not r.is_bound)
+    if len(not_bound) == 1:
+        idx, d = not_bound[0]
+        rhs_so_far = prod(r.size for r in rhs if r.is_bound)
+        if lhs_size % rhs_so_far != 0:
+            rhs_s = tuple('?' if not r.is_bound else str(r.size) for r in rhs)
+            raise DimensionMismatchError(f"inferred dimension does not evenly fit into larger dimension: {lhs_size} vs {rhs_s}")
+        new_size = lhs_size // rhs_so_far
+        d.size = new_size
+    elif len(not_bound) > 1:
+        rhs_s = tuple('?' if not r.is_bound else str(r.size) for r in rhs)
+        raise DimensionMismatchError(f"cannot infer the size of two dimensions at once: {rhs} with sizes {rhs_s}")
+    else:
+        rhs_size = prod(r.size for r in rhs)
+        if lhs_size != rhs_size:
+            raise DimensionMismatchError(
+                  f"Dimension sizes to do not match ({lhs_size} != {rhs_size}) when matching {lhs_debug} to {rhs}")
+
+def _tensor_levels(inp):
+    from . import _Tensor
+    if isinstance(inp, _Tensor):
+        return inp._tensor, llist(inp._levels), inp._has_device
+    else:
+        return inp, llist(range(-inp.ndim, 0)), True
+
+def _match_levels(v, from_levels, to_levels):
+    view = []
+    permute = []
+    requires_view = False
+    size = v.size()
+    for t in to_levels:
+        try:
+            idx = from_levels.index(t)
+            permute.append(idx)
+            view.append(size[idx])
+        except ValueError:
+            view.append(1)
+            requires_view = True
+    if permute != list(range(len(permute))):
+        v = v.permute(*permute)
+    if requires_view:
+        v = v.view(*view)
+    return v
+
+
+# make a single dimension positional but do not permute it,
+# used to do multi-tensor operators where the dim being acted on
+# should not physically move if possible
+def _positional_no_permute(self, dim, expand_dim=False):
+    from . import Tensor
+    ptensor, levels = self._tensor, llist(self._levels)
+    try:
+        idx = levels.index(dim)
+    except ValueError:
+        if not expand_dim:
+            raise
+        idx = 0
+        ptensor = ptensor.expand(dim.size, *ptensor.size())
+        levels.insert(0, 0)
+    idx_batched = 0
+    for i in range(idx):
+        if isinstance(levels[i], int):
+            levels[i] -= 1
+            idx_batched += 1
+    levels[idx] = -idx_batched - 1
+    return Tensor.from_positional(ptensor, levels, self._has_device), idx_batched
+
+def seq(a, b):
+    from . import Dim
+    if isinstance(a, Dim) != isinstance(b, Dim):
+        return False
+    if isinstance(a, Dim):
+        return a is b
+    else:
+        return a == b
+
+class isin:
+    def __contains__(self, item):
+        for x in self:
+            if seq(item, x):
+                return True
+        return False
+
+    def index(self, item):
+        for i, x in enumerate(self):
+            if seq(item, x):
+                return i
+        raise ValueError
+
+
+class llist(isin, list):
+    pass
+
+class ltuple(isin, tuple):
+    pass
+
+empty_dict = {}
+@classmethod
+def __torch_function__(self, orig, cls, args, kwargs=empty_dict):
+    from . import _Tensor, TensorLike, Tensor
+    from .delayed_mul_tensor import DelayedMulTensor
+
+    if orig is torch.Tensor.__mul__:
+        lhs, rhs = args
+        if isinstance(lhs, _Tensor) and isinstance(rhs, _Tensor) and lhs.ndim == 0 and rhs.ndim == 0:
+            return DelayedMulTensor(lhs, rhs)
+    all_dims = llist()
+    flat_args, unflatten = tree_flatten((args, kwargs))
+    device_holding_tensor = None
+    for f in flat_args:
+        if isinstance(f, _Tensor):
+            if f._has_device:
+                device_holding_tensor = f._batchtensor
+            for d in f.dims:
+                if d not in all_dims:
+                    all_dims.append(d)
+
+    def unwrap(t):
+        if isinstance(t, _Tensor):
+            r = t._batchtensor
+            if device_holding_tensor is not None and not t._has_device:
+                r = r.to(device=device_holding_tensor.device)
+            return r
+        return t
+
+    if orig in pointwise:
+        result_levels = llist()
+        arg_levels = llist()
+        to_expand = []
+        for i, f in enumerate(flat_args):
+            if isinstance(f, TensorLike):
+                ptensor, levels, _ = _tensor_levels(f)
+                if isinstance(f, _Tensor) and not f._has_device and device_holding_tensor is not None:
+                    ptensor = ptensor.to(device=device_holding_tensor.device)
+                flat_args[i] = ptensor
+                for l in levels:
+                    if l not in result_levels:
+                        result_levels.append(l)
+                to_expand.append((i, levels))
+
+        for i, levels in to_expand:
+            flat_args[i] = _match_levels(flat_args[i], levels, result_levels)
+        args, kwargs = unflatten(flat_args)
+        result = orig(*args, **kwargs)
+
+        def wrap(t):
+            if isinstance(t, TensorLike):
+                return Tensor.from_positional(t, result_levels, device_holding_tensor is not None)
+            return t
+        return tree_map(wrap, result)
+    else:
+        def wrap(t):
+            if isinstance(t, TensorLike):
+                return Tensor.from_batched(t, device_holding_tensor is not None)
+            return t
+        with _enable_layers(all_dims):
+            print(f"batch_tensor for {orig}")
+            args, kwargs = unflatten(unwrap(f) for f in flat_args)
+            result = orig(*args, **kwargs)
+            # print("END", orig)
+            return tree_map(wrap, result)
+
+def positional(self, *dims):
+    from . import Dim, Tensor
+    ptensor, levels = self._tensor, llist(self._levels)
+    flat_dims = llist()
+    view = []
+    needs_view = False
+    ndim = self.ndim
+    for d in dims:
+        if isinstance(d, DimList):
+            flat_dims.extend(d)
+            view.extend(e.size for e in d)
+        elif isinstance(d, Dim):
+            flat_dims.append(d)
+            view.append(d.size)
+        elif isinstance(d, int):
+            d = _wrap_dim(d, ndim, False)
+            flat_dims.append(d)
+            view.append(ptensor.size(d))
+        else:
+            flat_dims.extend(d)
+            view.append(prod(e.size for e in d))
+            needs_view = True
+
+    permute = list(range(len(levels)))
+    nflat = len(flat_dims)
+    for i, d in enumerate(flat_dims):
+        try:
+            idx = levels.index(d)
+        except ValueError as e:
+            raise DimensionBindError(f'tensor of dimensions {self.dims} does not contain dim {d}') from e
+        p = permute[idx]
+        del levels[idx]
+        del permute[idx]
+        levels.insert(i, 0)
+        permute.insert(i, p)
+    ptensor = ptensor.permute(*permute)
+    seen = 0
+    for i in range(len(levels) - 1, -1, -1):
+        if isinstance(levels[i], int):
+            seen += 1
+            levels[i] = -seen
+    result = Tensor.from_positional(ptensor, levels, self._has_device)
+    if needs_view:
+        result = result.reshape(*view, *result.size()[len(flat_dims):])
+    return result
+
+def _contains_dim(input):
+    from . import Dim
+    for i in input:
+        if isinstance(i, Dim):
+            return True
+
+def expand(self, *sizes):
+    if not _contains_dim(sizes):
+        return self.__torch_function__(torch.Tensor.expand, None, (self, *sizes))
+    dims = sizes
+    sizes = [d.size for d in dims] + [-1] * self.ndim
+    self = self.expand(*sizes)
+    return self[dims]
+
+
+_not_present = object()
+
+def _getarg(name, offset, args, kwargs, default):
+    if len(args) > offset:
+        return args[offset]
+    return kwargs.get(name, default)
+
+def _patcharg(name, offset, args, kwargs, value):
+    if len(args) > offset:
+        args[offset] = value
+    else:
+        kwargs[name] = value
+
+def _wrap(orig, dim_offset=0, keepdim_offset=1, dim_name='dim', single_dim=False, reduce=True):
+    from . import TensorLike, Dim, Tensor
+
+    def fn(self, *args, **kwargs):
+        dim = _getarg(dim_name, dim_offset, args, kwargs, _not_present)
+        if dim is _not_present or (single_dim and not isinstance(dim, Dim)):
+            with _enable_layers(self.dims):
+                print(f"dim fallback batch_tensor for {orig}")
+                return Tensor.from_batched(orig(self._batchtensor, *args, **kwargs), self._has_device)
+        keepdim = _getarg('keepdim', keepdim_offset, args, kwargs, False) if reduce else False
+        t, levels = self._tensor, llist(self._levels)
+        dims = _dims(dim, self._batchtensor.ndim, keepdim, single_dim)
+        dim_indices = tuple(levels.index(d) for d in dims)
+        if reduce and not keepdim:
+            new_levels = [l for i, l in enumerate(levels) if i not in dim_indices]
+        else:
+            new_levels = levels
+
+        if len(dim_indices) == 1:
+            dim_indices = dim_indices[0]  # so that dims that really only take a single argument work...
+        args = list(args)
+        _patcharg(dim_name, dim_offset, args, kwargs, dim_indices)
+
+        def wrap(t):
+            if isinstance(t, TensorLike):
+                return Tensor.from_positional(t, new_levels, self._has_device)
+            return t
+        with _enable_layers(new_levels):
+            print(f"dim used batch_tensor for {orig}")
+            r = orig(t, *args, **kwargs)
+            return tree_map(wrap, r)
+    return fn
+
+def _def(name, *args, **kwargs):
+    from . import _Tensor
+    orig = getattr(torch.Tensor, name)
+    setattr(_Tensor, name, _wrap(orig, *args, **kwargs))
+
+no_slice = slice(None)
+
+_orig_getitem = torch.Tensor.__getitem__
+
+class dim_tracker:
+    def __init__(self):
+        self.dims = llist()
+        self.count = []
+
+    def record(self, d):
+        if d not in self.dims:
+            self.dims.append(d)
+            self.count.append(1)
+
+    def __getitem__(self, d):
+        return self.count[self.dims.index(d)]
+
+def t__getitem__(self, input):
+    from . import Dim, DimensionBindError, _Tensor, TensorLike, DimList, Tensor
+    # * bail to original example if we have a single non-Dim tensor, or a non-tensor
+    # * locate ... or an unbound tensor list, and determine its size, bind dim list
+    #   (remember that None does not count to the total dim count)
+    # * bind simple dims and dim-packs to their sizes, count the number of uses of each dim,
+    #   produce the re-view if needed
+    # * for each single-use dim index, replace with no_slice and mark that it will be added
+    #   (keep track of whether we have to call super)
+    # * call super if needed
+    # * if we have dims to bind, bind them (it will help if we eliminated ... and None before)
+
+    # this handles bool indexing handling, as well as some other simple cases.
+
+    is_simple = (not isinstance(input, Dim) and
+                 not isinstance(input, (tuple, list)) and
+                 # WAR for functorch bug where zero time tensors in getitem are not handled correctly.
+                 not (isinstance(input, TensorLike) and input.ndim == 0))
+
+    if is_simple:
+        if isinstance(self, _Tensor):
+            return _Tensor.__torch_function__(_orig_getitem, None, (self, input))
+        else:
+            return _orig_getitem(self, input)
+
+    # can further optimize this case
+    if not isinstance(input, tuple):
+        input = [input]
+    else:
+        input = list(input)
+
+    dims_indexed = 0
+    expanding_object = None
+    dimlists = []
+    for i, s in enumerate(input):
+        if s is ... or isinstance(s, DimList) and not s.is_bound:
+            if expanding_object is not None:
+                msg = 'at most one ... or unbound dimension list can exist in indexing list but' \
+                      f' found 2 at offsets {i} and {expanding_object}'
+                raise DimensionBindError(msg)
+            expanding_object = i
+
+        if isinstance(s, DimList):
+            dims_indexed += len(s) if s.is_bound else 0
+            dimlists.append(i)
+        elif s is not None and s is not ...:
+            dims_indexed += 1
+
+    ndim = self.ndim
+    if dims_indexed > ndim:
+        raise IndexError(f'at least {dims_indexed} indices were supplied but the tensor only has {ndim} dimensions.')
+    if expanding_object is not None:
+        expanding_ndims = ndim - dims_indexed
+        obj = input[expanding_object]
+        if obj is ...:
+            input[expanding_object:expanding_object + 1] = [no_slice] * expanding_ndims
+        else:
+            obj.bind_len(expanding_ndims)
+    # flatten the dimslists into the indexing
+    for i in reversed(dimlists):
+        input[i:i + 1] = input[i]
+    dims_indexed = 0
+    requires_view = False
+    size = self.size()
+    view_sizes = []
+    dims_seen = dim_tracker()
+
+    def add_dims(t):
+        if not isinstance(t, _Tensor):
+            return
+        for d in t.dims:
+            dims_seen.record(d)
+
+    add_dims(self)
+    dim_packs = []
+    for i, idx in enumerate(input):
+        if idx is None:
+            input[i] = no_slice
+            view_sizes.append(1)
+            requires_view = True
+        else:
+            sz = size[dims_indexed]
+            if isinstance(idx, Dim):
+                idx.size = sz
+                dims_seen.record(idx)
+                view_sizes.append(sz)
+            elif isinstance(idx, (tuple, list)) and idx and isinstance(idx[0], Dim):
+                for d in idx:
+                    dims_seen.record(idx)
+                _bind_dims_to_size(sz, idx, f'offset {i}')
+                view_sizes.extend(d.size for d in idx)
+                requires_view = True
+                dim_packs.append(i)
+            else:
+                add_dims(idx)
+                view_sizes.append(sz)
+            dims_indexed += 1
+    if requires_view:
+        self = self.view(*view_sizes)
+    for i in reversed(dim_packs):
+        input[i:i + 1] = input[i]
+
+    # currenty:
+    # input is flat, containing either Dim, or Tensor, or something valid for standard indexing
+    # self may have first-class dims as well.
+
+    # to index:
+    # drop the first class dims from self, they just become direct indices of their positions
+
+    # figure out the dimensions of the indexing tensors: union of all the dims in the tensors in the index.
+    # these dimensions will appear and need to be bound at the first place tensor occures
+
+    if isinstance(self, _Tensor):
+        ptensor_self, levels = self._tensor, list(self._levels)
+        # indices to ptensor rather than self which has first-class dimensions
+        input_it = iter(input)
+        flat_inputs = [next(input_it) if isinstance(l, int) else l for l in levels]
+        has_device = self._has_device
+        to_pad = 0
+    else:
+        ptensor_self, flat_inputs = self, input
+        to_pad = ptensor_self.ndim - len(flat_inputs)
+        has_device = True
+
+    result_levels = []
+    index_levels = []
+    tensor_insert_point = None
+    to_expand = {}
+    requires_getindex = False
+    for i, inp in enumerate(flat_inputs):
+        if isinstance(inp, Dim) and dims_seen[inp] == 1:
+            flat_inputs[i] = no_slice
+            result_levels.append(inp)
+        elif isinstance(inp, TensorLike):
+            requires_getindex = True
+            if tensor_insert_point is None:
+                tensor_insert_point = len(result_levels)
+            ptensor, levels, _ = _tensor_levels(inp)
+            to_expand[i] = levels
+            flat_inputs[i] = ptensor
+            for l in levels:
+                if l not in index_levels:
+                    index_levels.append(l)
+        else:
+            requires_getindex = True
+            result_levels.append(0)
+
+    if tensor_insert_point is not None:
+        result_levels[tensor_insert_point:tensor_insert_point] = index_levels
+
+    for i, levels in to_expand.items():
+        flat_inputs[i] = _match_levels(flat_inputs[i], levels, index_levels)
+
+    if requires_getindex:
+        result = _orig_getitem(ptensor_self, flat_inputs)
+    else:
+        result = ptensor_self
+
+    next_positional = -1
+    if to_pad > 0:
+        result_levels.extend([0] * to_pad)
+    for i, r in enumerate(reversed(result_levels)):
+        if isinstance(r, int):
+            result_levels[-1 - i] = next_positional
+            next_positional -= 1
+
+    return Tensor.from_positional(result, result_levels, has_device)
+
+# XXX - dim is optional and can be the outer-most dimension...
+def stack(tensors, new_dim, dim=0, out=None):
+    if isinstance(dim, int):
+        return torch.stack(tensors, dim, out).index(dim, new_dim)
+    index = None
+    if out is not None:
+        out, index = _positional_no_permute(out, dim, expand_dim=True)
+    ptensors = []
+    for t in tensors:
+        pt, pi = _positional_no_permute(t, dim, expand_dim=True)
+        if index is not None and pi != index:
+            pt = pt.move_dim(pi, index)
+        else:
+            index = pi
+        ptensors.append(pt)
+    pr = torch.stack(ptensors, index, out=out)
+    return pr.index((index, index + 1), (new_dim, dim))
+
+_orig_split = torch.Tensor.split
+def split(self, split_size_or_sections, dim=0):
+    from . import Dim, _Tensor
+    if isinstance(split_size_or_sections, int) or any(isinstance(t, int) for t in split_size_or_sections):
+        if isinstance(dim, Dim):
+            raise ValueError('when dim is specified as a Dim object, split sizes must also be dimensions.')
+        return _orig_split(self, split_size_or_sections, dim=dim)
+
+    if isinstance(dim, Dim):
+        assert isinstance(self, _Tensor), f"Tensor does not have dimension {dim}"
+        self, dim = _positional_no_permute(self, dim)
+
+    size = self.size(dim)
+    total_bound_size = 0
+    unbound = []
+    sizes = []
+    for i, d in enumerate(split_size_or_sections):
+        if d.is_bound:
+            sizes.append(d.size)
+            total_bound_size += d.size
+        else:
+            sizes.append(0)
+            unbound.append(i)
+
+    if unbound:
+        assert total_bound_size <= size, \
+            f"result dimensions are larger than original: {total_bound_size} vs {size} ({split_size_or_sections})"
+        remaining_size = size - total_bound_size
+        chunk_size = -(-remaining_size // len(unbound))
+        for u in unbound:
+            sz = min(chunk_size, remaining_size)
+            split_size_or_sections[u].size = sz
+            sizes[u] = sz
+            remaining_size -= sz
+    else:
+        assert total_bound_size == size, \
+            f"result dimensions do not match original: {total_bound_size} vs {size} ({split_size_or_sections})"
+    return tuple(t.index(dim, d) for d, t in zip(split_size_or_sections, _orig_split(self, sizes, dim=dim)))
diff --git a/functorch/functorch/dim/tree_map.py b/functorch/functorch/dim/tree_map.py
new file mode 100644
index 0000000000000..89aaad09eb330
--- /dev/null
+++ b/functorch/functorch/dim/tree_map.py
@@ -0,0 +1,12 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+from functorch._C import dim
+tree_flatten = dim.tree_flatten
+
+def tree_map(fn, tree):
+    vs, unflatten = tree_flatten(tree)
+    return unflatten(fn(v) for v in vs)
diff --git a/functorch/functorch/dim/wrap_type.py b/functorch/functorch/dim/wrap_type.py
new file mode 100644
index 0000000000000..8212836d3d6ae
--- /dev/null
+++ b/functorch/functorch/dim/wrap_type.py
@@ -0,0 +1,49 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+from types import FunctionType, BuiltinMethodType, MethodDescriptorType, WrapperDescriptorType, GetSetDescriptorType
+from functorch._C import dim as _C
+_wrap_method = _C._wrap_method
+
+FUNC_TYPES = (FunctionType, MethodDescriptorType, BuiltinMethodType, WrapperDescriptorType)
+PROPERTY_TYPES = (GetSetDescriptorType, property)
+
+def _py_wrap_method(orig, __torch_function__):
+    def impl(*args, **kwargs):
+        return __torch_function__(orig, None, args, kwargs)
+    return impl
+
+
+
+def wrap_type(use_c, to_patch, pattern, __torch_function__):
+
+    if use_c:
+        wrap_method = _wrap_method
+    else:
+        wrap_method = _py_wrap_method
+
+    all = {}
+    for t in reversed(pattern.mro()[:-1]):  # skip object
+        all.update(t.__dict__)
+
+
+    def wrap_attr(orig):
+        return property(wrap_method(orig.__get__, __torch_function__))
+
+
+    for name, obj in all.items():
+        if name in ('__dict__', '__new__', '__init__', '__repr__', '__weakref__', '__doc__', '__module__', '__dir__'):
+            continue
+
+        # skip things that have been overloaded
+        # things that come from object like `__eq__` still need to be patched, however.
+        if hasattr(to_patch, name) and getattr(to_patch, name) is not getattr(object, name, None):
+            continue
+
+        if isinstance(obj, FUNC_TYPES):
+            setattr(to_patch, name, wrap_method(obj, __torch_function__))
+        elif isinstance(obj, PROPERTY_TYPES):
+            setattr(to_patch, name, wrap_attr(obj))
diff --git a/functorch/functorch/experimental/cond.py b/functorch/functorch/experimental/cond.py
new file mode 100644
index 0000000000000..a4eff7ead1947
--- /dev/null
+++ b/functorch/functorch/experimental/cond.py
@@ -0,0 +1,137 @@
+import torch
+from torch._C import DispatchKey, DispatchKeySet, ExcludeDispatchKeyGuard
+from functorch.experimental.ops import PyOperator, fallthrough_fn
+from torch.utils._pytree import tree_flatten
+from torch.fx.experimental.proxy_tensor import get_isolated_graphmodule, get_proxy_slot
+import torch.utils._pytree as pytree
+from torch.utils._python_dispatch import TorchDispatchMode
+from torch.fx.experimental.proxy_tensor import track_tensor_tree
+
+
+"""
+We're going to define a `cond` operation.
+In order to do this, we need implementations for each of the dispatch keys.
+"""
+from contextlib import contextmanager
+
+# TODO(voz): Move out somewhere else once other py dispatched ops need it
+@contextmanager
+def suspend_mode(mode):
+    assert(mode is not None), "Cannot suspend None mode"
+    assert(isinstance(mode, TorchDispatchMode)), f"Unexpected mode type {mode.__class__}"
+    torch._C._set_torch_dispatch_mode(None)
+    try:
+        yield
+    finally:
+        torch._C._set_torch_dispatch_mode(mode)
+
+@contextmanager
+def enable_mode(mode):
+    curr_mode = torch._C._get_torch_dispatch_mode()
+    torch._C._set_torch_dispatch_mode(mode)
+    try:
+        yield
+    finally:
+        torch._C._set_torch_dispatch_mode(curr_mode)
+
+
+def trace_cond(proxy_mode, func_overload, pred, true_fn, false_fn, operands):
+    def _unwrap_proxy(e):
+        return get_proxy_slot(e, proxy_mode.tracer, e, lambda e: e.proxy)
+
+    assert isinstance(operands, list), "Cond operands must be a list of tensors"
+    assert all(isinstance(o, torch.Tensor) for o in operands), "Cond operands must be a list of tensors"
+
+    true_graph = get_isolated_graphmodule(true_fn, operands, {})
+    false_graph = get_isolated_graphmodule(false_fn, operands, {})
+
+    true_outs = []
+    false_outs = []
+    for node in true_graph.graph.nodes:
+        if node.op == 'output':
+            true_outs.extend(node.args)
+
+    for node in false_graph.graph.nodes:
+        if node.op == 'output':
+            false_outs.extend(node.args)
+
+    flat_true_outs, _ = pytree.tree_flatten(true_outs)
+    flat_false_outs, _ = pytree.tree_flatten(false_outs)
+    assert(len(flat_true_outs) == len(flat_false_outs))
+
+    for i in range(0, len(flat_true_outs)):
+        true_out = flat_true_outs[i]
+        false_out = flat_false_outs[i]
+        assert true_out.meta == false_out.meta
+
+    # There are probably better ways - I know that create_arg has some self incrementing name
+    # magic to it, but since we explicitly have to get the name for register_module,
+    # I was not sure how to do that. This kinda simulates it.
+    next_name = None
+    i = 0
+    while not next_name:
+        candidate = f"true_graph_{i}"
+        if hasattr(proxy_mode.tracer.root, candidate):
+            i += 1
+        else:
+            next_name = candidate
+
+    true_name = next_name
+    false_name = f"false_graph_{i}"
+    assert(not hasattr(proxy_mode.tracer.root, false_name))
+
+    proxy_mode.tracer.root.register_module(true_name, true_graph)
+    proxy_mode.tracer.root.register_module(false_name, false_graph)
+
+    args = (pred, true_graph, false_graph, [operands])
+
+    proxy_args = pytree.tree_map(_unwrap_proxy, args)
+
+    out_proxy = proxy_mode.tracer.create_proxy('call_function', func_overload, proxy_args, {},
+                                               name="conditional")
+
+    if pred:
+        out = true_fn(*operands)
+    else:
+        out = false_fn(*operands)
+
+    return track_tensor_tree(out, out_proxy, constant=None, tracer=proxy_mode.tracer)
+
+
+def cond_dense(pred, true_fn, false_fn, operands):
+    mode = torch._C._get_torch_dispatch_mode()
+    assert (mode is None), "Mode should never be enabled for CPU key"
+    if pred:
+        return true_fn(*operands)
+    else:
+        return false_fn(*operands)
+
+
+def cond_autograd(pred, true_fn, false_fn, *operands):
+    # TODO: support autograd
+    flat_operands, _ = tree_flatten([true_fn, false_fn] + [operands])
+    assert all([not f.requires_grad for f in flat_operands
+                if isinstance(f, torch.Tensor)])
+
+    guard = ExcludeDispatchKeyGuard(DispatchKeySet(DispatchKey.AutogradCPU))
+    return cond(pred, true_fn, false_fn, *operands)
+
+
+def python_fallback(op):
+    def inner(pred, true_fn, false_fn, operands):
+        mode = torch._C._get_torch_dispatch_mode()
+        assert (mode is not None), "Mode should always be enabled for python fallback key"
+        with suspend_mode(mode):
+            res = trace_cond(mode, op, pred, true_fn, false_fn, operands)
+        return res
+
+    return inner
+
+
+cond = PyOperator('cond')
+cond.impl(DispatchKey.CPU, cond_dense)
+cond.impl(DispatchKey.Python, python_fallback(cond))
+cond.impl(DispatchKey.PythonTLSSnapshot, fallthrough_fn)
+cond.impl(DispatchKey.AutogradCPU, cond_autograd)
+cond.impl(DispatchKey.ADInplaceOrView, fallthrough_fn)
+cond.impl(DispatchKey.BackendSelect, fallthrough_fn)
diff --git a/functorch/functorch/experimental/ops.py b/functorch/functorch/experimental/ops.py
new file mode 100644
index 0000000000000..5975b3b7eed1a
--- /dev/null
+++ b/functorch/functorch/experimental/ops.py
@@ -0,0 +1,36 @@
+from torch._dispatch._dispatcher import PyDispatcher, to_flat_tuple, compute_keyset
+from torch.nn.functional import handle_torch_function
+from torch.overrides import has_torch_function
+import torch._C as _C
+
+class PyOperator:
+    def __init__(self, name):
+        self.name = name
+        self.table = {}
+
+        self.__name__ = name
+
+    def impl(self, dispatch_key, fn):
+        assert dispatch_key not in self.table
+        if fn is fallthrough_fn:
+            self.table[dispatch_key] = fn(self, dispatch_key)
+        else:
+            self.table[dispatch_key] = fn
+
+    def lookup(self, keyset):
+        dispatch_key = keyset.highestPriorityTypeId()
+        return self.table[dispatch_key]
+
+    def __call__(self, *args, **kwargs):
+        flat_args = to_flat_tuple(args, kwargs)
+        if has_torch_function(flat_args):
+            return handle_torch_function(self, flat_args, *args, **kwargs)
+
+        return PyDispatcher.call(self, *args, **kwargs)
+
+def fallthrough_fn(operator, dispatch_key):
+    def inner(*args, **kwargs):
+        all_keys_after_current = _C._dispatch_keyset_full_after(dispatch_key)
+        all_keys_after_current_masked = all_keys_after_current & compute_keyset(args, kwargs)
+        return PyDispatcher.redispatch(operator, all_keys_after_current_masked, *args, **kwargs)
+    return inner
diff --git a/functorch/notebooks/colab/per_sample_grads_colab.ipynb b/functorch/notebooks/colab/per_sample_grads_colab.ipynb
index 5912649f11bbc..8c65eb00be29d 100644
--- a/functorch/notebooks/colab/per_sample_grads_colab.ipynb
+++ b/functorch/notebooks/colab/per_sample_grads_colab.ipynb
@@ -464,7 +464,7 @@
         "\n",
         "First, let’s create a stateless functional version of `model` by using `functorch.make_functional_with_buffers`.  \n",
         "\n",
-        "This will seperate state (the parameters) from the model and turn the model into a pure function:\n",
+        "This will separate state (the parameters) from the model and turn the model into a pure function:\n",
         "\n"
       ],
       "metadata": {
@@ -792,4 +792,4 @@
       }
     }
   ]
-}
\ No newline at end of file
+}
diff --git a/functorch/notebooks/ensembling.ipynb b/functorch/notebooks/ensembling.ipynb
index 72554b9f9e22a..c8c826fcd3686 100644
--- a/functorch/notebooks/ensembling.ipynb
+++ b/functorch/notebooks/ensembling.ipynb
@@ -24,7 +24,7 @@
         "dimensions of the input tensors. One of its use cases is eliminating\n",
         "for-loops and speeding them up through vectorization.\n",
         "\n",
-        "Let's demonstrate how to do this using an ensemble of simple CNNs."
+        "Let's demonstrate how to do this using an ensemble of simple MLPs."
       ]
     },
     {
diff --git a/functorch/notebooks/per_sample_grads.ipynb b/functorch/notebooks/per_sample_grads.ipynb
index 5ea18bd05424a..8719f934f7041 100644
--- a/functorch/notebooks/per_sample_grads.ipynb
+++ b/functorch/notebooks/per_sample_grads.ipynb
@@ -230,7 +230,7 @@
         "\n",
         "First, let’s create a stateless functional version of `model` by using `functorch.make_functional_with_buffers`.  \n",
         "\n",
-        "This will seperate state (the parameters) from the model and turn the model into a pure function:\n",
+        "This will separate state (the parameters) from the model and turn the model into a pure function:\n",
         "\n"
       ],
       "metadata": {
@@ -604,4 +604,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 5
-}
\ No newline at end of file
+}
diff --git a/functorch/op_analysis/gen_data.py b/functorch/op_analysis/gen_data.py
index a65ea0cb96a6f..71502ae84ff9e 100644
--- a/functorch/op_analysis/gen_data.py
+++ b/functorch/op_analysis/gen_data.py
@@ -23,7 +23,7 @@ def gen_data(special_op_lists, analysis_name):
     composite_ops = get_ops_for_key('CompositeImplicitAutograd')
     noncomposite_ops = all_ops - composite_ops
 
-    ops = yaml.load(open('../../pytorch/aten/src/ATen/native/native_functions.yaml', 'r').read(), Loader=yaml.CLoader)
+    ops = yaml.load(open('../../aten/src/ATen/native/native_functions.yaml', 'r').read(), Loader=yaml.CLoader)
 
     annotated_ops = {a.strip(): b.strip() for a, b in list(csv.reader(open('annotated_ops')))}
     from collections import defaultdict
diff --git a/functorch/setup.py b/functorch/setup.py
index aadef78b0f5a0..e200cbd1fcc5b 100644
--- a/functorch/setup.py
+++ b/functorch/setup.py
@@ -101,6 +101,7 @@ def get_extensions():
         for p in glob.glob(os.path.join(extensions_dir, "*.cpp"))
     )
     sources = list(extension_sources)
+    sources.append(os.path.join(extensions_dir, "dim", "dim.cpp"))
 
     ext_modules = [
         extension(
diff --git a/functorch/test/attn_ft.py b/functorch/test/attn_ft.py
new file mode 100644
index 0000000000000..364bbf75ae6e8
--- /dev/null
+++ b/functorch/test/attn_ft.py
@@ -0,0 +1,140 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+from torch import nn
+from functorch.dim import dims, dimlists, softmax, cat
+import math
+
+
+class Linear(nn.Linear):
+    def forward(self, input):
+        ci, co = dims()
+        b = dimlists()
+        result = (input[b, ci] * self.weight[co, ci]).sum(ci) + self.bias[co]
+        return result.order(b, co)
+
+class BertSelfAttention(nn.Module):
+    def __init__(self, hidden_size, num_attention_heads,
+                 attention_probs_dropout_prob, position_embedding_type=None,
+                 max_position_embeddings=None, linear=Linear):
+        super().__init__()
+        if hidden_size % num_attention_heads != 0:
+            raise ValueError(
+                f"The hidden size ({hidden_size}) is not a multiple of the number of attention "
+                f"heads ({num_attention_heads})"
+            )
+
+        self.num_attention_heads = num_attention_heads
+        self.attention_head_size = int(hidden_size / num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = linear(hidden_size, self.all_head_size)
+        self.key = linear(hidden_size, self.all_head_size)
+        self.value = linear(hidden_size, self.all_head_size)
+
+        self.dropout_prob = attention_probs_dropout_prob
+        self.position_embedding_type = position_embedding_type
+
+        if self.position_embedding_type is not None:
+            assert max_position_embeddings is not None
+            self.max_position_embeddings = max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * max_position_embeddings - 1, self.attention_head_size)
+
+    def forward(
+        self,
+        hidden_states,
+        past_key_value=None,
+    ):
+        # first run the encoding linear layers for q, k, v normally
+        # the meaning of a linear layer is well understood, so no need to use explicit dimensions
+        q = self.query(hidden_states)
+        k = self.key(hidden_states)
+        v = self.value(hidden_states)
+
+        # introduce values that represent each dimension. dimensions are 'first class'
+        # becaue they are actual python values introduced here
+        batch, query_sequence, key_sequence, heads, features = dims()
+        heads.size = self.num_attention_heads
+
+        # bind the positional dimensions in k, q, and v against
+        # our values. the sizes of each dimension are determined by this binding
+        # and when a dimension is used twice (e.g. batch), its size against both
+        # uses is checked for consistency.
+        # The group (heads, features) splits apart a single positional dimension
+        # into two dimensions. Since heads.size*features.size == q.size(2)
+        # and we specified heads.size, features.size is inferred here.
+        q = q[batch, query_sequence, [heads, features]]
+        k = k[batch, key_sequence, [heads, features]]
+        v = v[batch, key_sequence, [heads, features]]
+
+
+        # this option allows the model to attend to not just the elements of the current sequence
+        # but the previouse elements as well as additional tokens.
+        if past_key_value is not None:
+            extended_key_sequence = dims()
+            key_past = past_key_value[0][batch, heads, key_sequence, features]
+            value_past = past_key_value[1][batch, heads, key_sequence, features]
+            # cat introduces a new dimension exteneded_key_sequence, becuase it is twice as long
+            # as the original key_sequence
+            k = cat([key_past, k], key_sequence, extended_key_sequence)
+            v = cat([value_past, v], key_sequence, extended_key_sequence)
+            # for the rest of the function, we will just use extended_key_sequence in lieu of
+            # key_sequence
+            key_sequence = extended_key_sequence
+
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        # The actual outer-product and summation are explicitly represented here,
+        # and like einsum, will be pattern matched to an efficient matrix multiply op.
+        attention_scores = (q * k).sum(features) / math.sqrt(features.size)
+
+        # relative positional embeddings gave a unique embedding based on the distance between
+        # key and value tokens in the sequence, e.g.
+        #  0  1  2  3
+        # -1  0  1  2
+        # -2 -1  0  1
+        # -3 -2 -1  0
+        if self.position_embedding_type is not None:
+            # the value of a dimension object when used as a tensor is the indices along its dimension
+            # so we can directly subtract the two dimensions to get a 2D tensor of (query_sequence x key_sequence)
+            # with the distance between them
+            distance = query_sequence - key_sequence
+
+            assert key_sequence.size <= self.max_position_embeddings
+
+            # we can then use that as an indirect index into the embedding table values to look up the features for that index
+            # this is just a `gather` primitive op. The resulting tensor will
+            # have all the dimensions of embeddeding_idx (query_sequence x key_sequence),
+            # plus all the dimensions of `embed` that were not indirectly accessed (`embedding_range`).
+            # this form of indirect indexing is more strainghtforward than either advanced indexing or torch.gather which both
+            # have a lot of dependencies on the positions of indexing tensors.
+
+            positional_embedding = self.distance_embedding.weight[self.max_position_embeddings - 1 + distance, features]
+
+            if self.position_embedding_type == "relative_key":
+                # these were einsum ops in the positional code because they are not easy to fit to existing matmul operators
+                # eventhough they are degenerate matmuls
+                relative_position_scores = (q * positional_embedding).sum(features)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = (q * positional_embedding).sum(features)
+                relative_position_scores_key = (k * positional_embedding).sum(features)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_probs = attention_scores
+        # Normalize the attention scores to probabilities.
+        attention_probs = softmax(attention_scores, dim=key_sequence)
+        # # This is actually dropping out entire tokens to attend to, which might
+        # # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = torch.nn.functional.dropout(attention_probs, p=self.dropout_prob)
+
+        # similarly, we can replace the matmul with a direct listing of the outer product, which makes it clear
+        # we are weighting the values v across all keys with the attention scores.
+        context_layer = (attention_probs * v).sum(key_sequence)
+
+        # finally, we convert back to a standard tensor by describing the layout of dimensions.
+        # working in reverse to with_dims, the (heads, features) group flattens the dimensions into a single one.
+        return context_layer.order(batch, query_sequence, [heads, features])
diff --git a/functorch/test/attn_positional.py b/functorch/test/attn_positional.py
new file mode 100644
index 0000000000000..b10130751fa8a
--- /dev/null
+++ b/functorch/test/attn_positional.py
@@ -0,0 +1,93 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+from torch import nn
+import math
+
+class BertSelfAttention(nn.Module):
+    def __init__(self, hidden_size, num_attention_heads,
+                 attention_probs_dropout_prob,
+                 position_embedding_type=None, max_position_embeddings=None):
+        super().__init__()
+        if hidden_size % num_attention_heads != 0:
+            raise ValueError(
+                f"The hidden size ({hidden_size}) is not a multiple of the number of attention "
+                f"heads ({num_attention_heads})"
+            )
+
+        self.num_attention_heads = num_attention_heads
+        self.attention_head_size = int(hidden_size / num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(hidden_size, self.all_head_size)
+        self.key = nn.Linear(hidden_size, self.all_head_size)
+        self.value = nn.Linear(hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(attention_probs_dropout_prob)
+        self.position_embedding_type = position_embedding_type
+
+        if self.position_embedding_type is not None:
+            assert max_position_embeddings is not None
+            self.max_position_embeddings = max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * max_position_embeddings - 1, self.attention_head_size)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states,
+        past_key_value=None,
+    ):
+        q = self.query(hidden_states)
+        k = self.key(hidden_states)
+        v = self.value(hidden_states)
+
+        q = self.transpose_for_scores(q)
+        k = self.transpose_for_scores(k)
+        v = self.transpose_for_scores(v)
+
+        if past_key_value is not None:
+            k = torch.cat([past_key_value[0], k], dim=2)
+            v = torch.cat([past_key_value[1], v], dim=2)
+
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(q, k.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        if self.position_embedding_type is not None:
+            seq_length = hidden_states.size()[1]
+            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
+            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(dtype=q.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = torch.einsum("bhld,lrd->bhlr", q, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", q, positional_embedding)
+                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", k, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_probs = attention_scores
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+        # # This is actually dropping out entire tokens to attend to, which might
+        # # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+
+        context_layer = torch.matmul(attention_probs, v)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+        return context_layer
diff --git a/functorch/test/common_utils.py b/functorch/test/common_utils.py
index 332cae09045fd..cfaa206619538 100644
--- a/functorch/test/common_utils.py
+++ b/functorch/test/common_utils.py
@@ -9,9 +9,9 @@
 import functorch
 from functorch import vmap
 import torch.utils._pytree as pytree
-from functorch_lagging_op_db import functorch_lagging_op_db
 from functorch_additional_op_db import additional_op_db
 from torch.testing._internal.common_methods_invocations import DecorateInfo
+from torch.testing._internal.common_methods_invocations import op_db
 import os
 import unittest
 from torch.testing._internal.common_device_type import toleranceOverride
@@ -38,6 +38,57 @@ def loop(op, in_dims, out_dim, batch_size, *batched_args, **kwarg_values):
     return loop_out
 
 
+# Like loop helper function but for 2 levels of vmap. If we need more levels than this, probably possible
+# to generalize the loops function but it seemed too complicated for this
+def loop2(op, in_dims1, in_dims2, out_dim1, out_dim2, batch_size1, batch_size2, *batched_args, **kwarg_values):
+    outs = []
+    flat_args, args_spec = pytree.tree_flatten(batched_args)
+    flat_dims1, dims_spec1 = pytree.tree_flatten(in_dims1)
+    flat_dims2, dims_spec2 = pytree.tree_flatten(in_dims2)
+    assert(args_spec == dims_spec1)
+    assert(args_spec == dims_spec2)
+    assert(len(flat_dims1) == len(flat_dims2))
+    for idx1 in range(batch_size1):
+        out_split = []
+        arg_split = [a.select(in_dim1, idx1) if in_dim1 is not None else a for a, in_dim1 in zip(flat_args, flat_dims1)]
+        for idx2 in range(batch_size2):
+            new_args = [a.select(in_dim, idx2) if in_dim is not None else a for a, in_dim in zip(arg_split, flat_dims2)]
+            out = op(*pytree.tree_unflatten(new_args, args_spec), **kwarg_values)
+            out_split.append(out)
+        outs.append(out_split)
+
+    loop_out = []
+    for out_split in outs:
+        if isinstance(out_split[0], torch.Tensor):
+            loop_out.append(torch.stack(out_split, out_dim1))
+        else:
+            new_out = []
+            for idx in range(len(out_split[0])):
+                new_out.append(torch.stack([i[idx] for i in out_split], out_dim1))
+            loop_out.append(new_out)
+
+    new_out = []
+    if isinstance(loop_out, torch.Tensor):
+        new_out = torch.stack(loop_out, out_dim2)
+    else:
+        for idx in range(len(loop_out[0])):
+            new_out.append(torch.stack([i[idx] for i in loop_out], out_dim2))
+    return new_out
+
+
+def is_valid_inplace_sample_input(sample_input, op, inplace_variant):
+    if inplace_variant is None:
+        return False
+    if sample_input.broadcasts_input:
+        return False
+
+    # Check if input's dtype matches the output's dtype
+    args = (sample_input.input,) + sample_input.args
+    kwargs = sample_input.kwargs
+    output_dtype = op(*args, **kwargs).dtype
+    return sample_input.input.dtype == output_dtype
+
+
 # This is kind of dangerous, please think carefully before using it.
 # Known risks:
 # - the return better not be mutated so it's best to return immutable types
@@ -184,47 +235,76 @@ def get_batched_arg(arg, bdim):
         yield batched_args, in_dims, kwarg_values
 
 
+def generate_vmap_inputs(args, kwargs, is_batch_norm_and_training=False, batch_size=2):
+    if is_batch_norm_and_training:
+        return get_exhaustive_batched_inputs_batch_norm_is_training(
+            args, kwargs, batch_size)
+    return get_exhaustive_batched_inputs(args, kwargs, batch_size)
+
+
+def clone_if_tensor(x):
+    if isinstance(x, torch.Tensor):
+        return x.clone()
+    return x
+
+
+def compute_quantities_for_vmap_test(
+        op, orig_batched_args, orig_kwarg_values, in_dims,
+        out_dim=0, batch_size=2, compute_loop_out=True,
+        clone_inputs=False):
+
+    def maybe_clone_inputs():
+        if clone_inputs:
+            batched_args = pytree.tree_map(clone_if_tensor, orig_batched_args)
+            kwarg_values = pytree.tree_map(clone_if_tensor, orig_kwarg_values)
+            return batched_args, kwarg_values
+        return orig_batched_args, orig_kwarg_values
+
+    batched_args, kwarg_values = maybe_clone_inputs()
+    if compute_loop_out:
+        loop_out = loop(op, in_dims, out_dim, batch_size, *batched_args, **kwarg_values)
+    else:
+        loop_out = None
+    # Used for debugging the resulting operations
+    # from functorch import make_fx
+    # def f(a):
+    #     return op(a)
+    # t = make_fx(vmap(f, in_dims=in_dims, out_dims=out_dim))(*batched_args, **kwarg_values)
+    # print(in_dims, [arg.shape for arg in batched_args], kwarg_values)
+    batched_args, kwarg_values = maybe_clone_inputs()
+    batched_out = vmap(op, in_dims=in_dims, out_dims=out_dim)(*batched_args, **kwarg_values)
+    yield (loop_out, batched_out)
+
+    # Tests case where we dispatch to a batching rule with no bdims
+    # This should be handled by autogenerated plumbing. For vmap support
+    # added via a manual plumbing you may need to handle this specially.
+    def add_bdim_if_tensor(x):
+        if isinstance(x, torch.Tensor):
+            return x.unsqueeze(1)
+        return x
+
+    def f(dummy, *args, **kwargs):
+        return op(*args, **kwargs)
+
+    dummy = torch.ones(batch_size, 1)
+    expected = pytree.tree_map(add_bdim_if_tensor, batched_out)
+
+    inner_in_dims = (0,) + pytree.tree_map(lambda x: None, in_dims)
+    outer_in_dims = (0,) + in_dims
+    batched_args, kwarg_values = maybe_clone_inputs()
+    output = vmap(vmap(f, inner_in_dims), outer_in_dims)(dummy, *batched_args, **kwarg_values)
+    yield (expected, output)
+
+
 def get_fallback_and_vmap_exhaustive(op, arg_values, kwarg_values, is_batch_norm_and_training=False, compute_loop_out=True):
     out_dim = 0
     batch_size = 2
 
-    if is_batch_norm_and_training:
-        generator = get_exhaustive_batched_inputs_batch_norm_is_training(arg_values, kwarg_values, batch_size)
-    else:
-        generator = get_exhaustive_batched_inputs(arg_values, kwarg_values, batch_size)
-
+    generator = generate_vmap_inputs(arg_values, kwarg_values, is_batch_norm_and_training)
     for batched_args, in_dims, kwarg_values in generator:
-        if compute_loop_out:
-            loop_out = loop(op, in_dims, out_dim, batch_size, *batched_args, **kwarg_values)
-        else:
-            loop_out = None
-        # Used for debugging the resulting operations
-        # from functorch import make_fx
-        # def f(a):
-        #     return op(a)
-        # t = make_fx(vmap(f, in_dims=in_dims, out_dims=out_dim))(*batched_args, **kwarg_values)
-        # print(in_dims, [arg.shape for arg in batched_args], kwarg_values)
-        batched_out = vmap(op, in_dims=in_dims, out_dims=out_dim)(*batched_args, **kwarg_values)
-        yield (loop_out, batched_out)
-
-        # Tests case where we dispatch to a batching rule with no bdims
-        # This should be handled by autogenerated plumbing. For vmap support
-        # added via a manual plumbing you may need to handle this specially.
-        def add_bdim_if_tensor(x):
-            if isinstance(x, torch.Tensor):
-                return x.unsqueeze(1)
-            return x
-
-        def f(dummy, *args, **kwargs):
-            return op(*args, **kwargs)
-
-        dummy = torch.ones(batch_size, 1)
-        expected = pytree.tree_map(add_bdim_if_tensor, batched_out)
-
-        inner_in_dims = (0,) + pytree.tree_map(lambda x: None, in_dims)
-        outer_in_dims = (0,) + in_dims
-        output = vmap(vmap(f, inner_in_dims), outer_in_dims)(dummy, *batched_args, **kwarg_values)
-        yield (expected, output)
+        for quantities in compute_quantities_for_vmap_test(
+                op, batched_args, kwarg_values, in_dims, out_dim, batch_size, compute_loop_out):
+            yield quantities
 
 
 def opinfo_in_dict(opinfo, d):
@@ -242,7 +322,7 @@ def skip(op_name, variant_name='', *, device_type=None, dtypes=None):
 
 
 def skipOps(test_case_name, base_test_name, to_skip):
-    all_opinfos = functorch_lagging_op_db + additional_op_db
+    all_opinfos = op_db + additional_op_db
     for xfail in to_skip:
         op_name, variant_name, device_type, dtypes, expected_failure = xfail
         matching_opinfos = [o for o in all_opinfos
@@ -277,7 +357,7 @@ def tol1(op_name, override_dct, *, device_type=None):
 
 
 def opsToleranceOverride(test_case_name, base_test_name, overrides):
-    all_opinfos = functorch_lagging_op_db + additional_op_db
+    all_opinfos = op_db + additional_op_db
     for override in overrides:
         op_name, variant_name, override, device_type = override
         matching_opinfos = [o for o in all_opinfos
diff --git a/functorch/test/discover_coverage.py b/functorch/test/discover_coverage.py
index d31a25a2ec49a..e52f317087b4c 100644
--- a/functorch/test/discover_coverage.py
+++ b/functorch/test/discover_coverage.py
@@ -7,7 +7,6 @@
 import pprint
 import unittest
 import enum
-from functorch_lagging_op_db import in_functorch_lagging_op_db
 from torch.testing._internal.common_device_type import toleranceOverride
 
 # Importing these files make modifications to the op_db that we need
@@ -650,9 +649,6 @@ def no_opinfos_skip_test(self, test_name):
         """Returns NO if any opinfos have a skip or xfail for the test"""
         if not self.has_opinfo():
             return Support.UNKNOWN
-        if not any([o in additional_op_db for o in self.opinfos]):
-            if not any([in_functorch_lagging_op_db(o) for o in self.opinfos]):
-                return Support.UNKNOWN
         for opinfo in self.opinfos:
             for decorator in opinfo.decorators:
                 if not hasattr(decorator, 'test_name'):
diff --git a/functorch/test/functorch_lagging_op_db.py b/functorch/test/functorch_lagging_op_db.py
deleted file mode 100644
index 39ed29f7b9da7..0000000000000
--- a/functorch/test/functorch_lagging_op_db.py
+++ /dev/null
@@ -1,574 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-from torch.testing._internal.common_methods_invocations import op_db
-
-# Generated from codegen/gen_functorch_op_db.py via
-# python codegen/gen_functorch_lagging_op_db.py > test/functorch_lagging_op_db.py
-#
-# People add new OpInfos to PyTorch all the time.
-# We want them to be able to add OpInfos without breaking our CI.
-# To achieve this, we keep our OpInfo library behind that of Pytorch's and
-# we periodically update our OpInfo library by regenerating this file
-_functorch_lagging_meta = {
-    ('H', ''),
-    ('T', ''),
-    ('__getitem__', ''),
-    ('__radd__', ''),
-    ('__rand__', ''),
-    ('__rdiv__', ''),
-    ('__rmatmul__', ''),
-    ('__rmod__', ''),
-    ('__rmul__', ''),
-    ('__ror__', ''),
-    ('__rpow__', ''),
-    ('__rsub__', ''),
-    ('__rxor__', ''),
-    ('_masked.amax', ''),
-    ('_masked.amin', ''),
-    ('_masked.log_softmax', ''),
-    ('_masked.mean', ''),
-    ('_masked.norm', ''),
-    ('_masked.normalize', ''),
-    ('_masked.prod', ''),
-    ('_masked.softmax', ''),
-    ('_masked.softmin', ''),
-    ('_masked.std', ''),
-    ('_masked.sum', ''),
-    ('_masked.var', ''),
-    ('abs', ''),
-    ('acos', ''),
-    ('acosh', ''),
-    ('add', ''),
-    ('addbmm', ''),
-    ('addcdiv', ''),
-    ('addcmul', ''),
-    ('addmm', ''),
-    ('addmm', 'decomposed'),
-    ('addmv', ''),
-    ('addr', ''),
-    ('all', ''),
-    ('allclose', ''),
-    ('amax', ''),
-    ('amin', ''),
-    ('aminmax', ''),
-    ('angle', ''),
-    ('any', ''),
-    ('argmax', ''),
-    ('argmin', ''),
-    ('argsort', ''),
-    ('argwhere', ''),
-    ('as_strided', ''),
-    ('asin', ''),
-    ('asinh', ''),
-    ('atan', ''),
-    ('atan2', ''),
-    ('atanh', ''),
-    ('atleast_1d', ''),
-    ('atleast_2d', ''),
-    ('atleast_3d', ''),
-    ('baddbmm', ''),
-    ('bernoulli', ''),
-    ('bfloat16', ''),
-    ('bincount', ''),
-    ('bitwise_and', ''),
-    ('bitwise_left_shift', ''),
-    ('bitwise_not', ''),
-    ('bitwise_or', ''),
-    ('bitwise_right_shift', ''),
-    ('bitwise_xor', ''),
-    ('block_diag', ''),
-    ('bmm', ''),
-    ('bool', ''),
-    ('broadcast_tensors', ''),
-    ('broadcast_to', ''),
-    ('bucketize', ''),
-    ('byte', ''),
-    ('cartesian_prod', ''),
-    ('cat', ''),
-    ('cdist', ''),
-    ('ceil', ''),
-    ('char', ''),
-    ('cholesky', ''),
-    ('cholesky_inverse', ''),
-    ('cholesky_solve', ''),
-    ('chunk', ''),
-    ('clamp', ''),
-    ('clamp', 'scalar'),
-    ('clone', ''),
-    ('column_stack', ''),
-    ('combinations', ''),
-    ('complex', ''),
-    ('conj', ''),
-    ('conj_physical', ''),
-    ('contiguous', ''),
-    ('copysign', ''),
-    ('corrcoef', ''),
-    ('cos', ''),
-    ('cosh', ''),
-    ('count_nonzero', ''),
-    ('cov', ''),
-    ('cross', ''),
-    ('cummax', ''),
-    ('cummin', ''),
-    ('cumprod', ''),
-    ('cumsum', ''),
-    ('cumulative_trapezoid', ''),
-    ('deg2rad', ''),
-    ('diag', ''),
-    ('diag_embed', ''),
-    ('diagflat', ''),
-    ('diagonal', ''),
-    ('diagonal_scatter', ''),
-    ('diff', ''),
-    ('digamma', ''),
-    ('dist', ''),
-    ('div', 'floor_rounding'),
-    ('div', 'no_rounding_mode'),
-    ('div', 'trunc_rounding'),
-    ('dot', ''),
-    ('double', ''),
-    ('dsplit', ''),
-    ('dstack', ''),
-    ('eig', ''),
-    ('einsum', ''),
-    ('empty_like', ''),
-    ('eq', ''),
-    ('erf', ''),
-    ('erfc', ''),
-    ('erfinv', ''),
-    ('exp', ''),
-    ('exp2', ''),
-    ('expand', ''),
-    ('expand_as', ''),
-    ('expm1', ''),
-    ('fft.fft', ''),
-    ('fft.fft2', ''),
-    ('fft.fftn', ''),
-    ('fft.fftshift', ''),
-    ('fft.hfft', ''),
-    ('fft.hfft2', ''),
-    ('fft.hfftn', ''),
-    ('fft.ifft', ''),
-    ('fft.ifft2', ''),
-    ('fft.ifftn', ''),
-    ('fft.ifftshift', ''),
-    ('fft.ihfft', ''),
-    ('fft.ihfft2', ''),
-    ('fft.ihfftn', ''),
-    ('fft.irfft', ''),
-    ('fft.irfft2', ''),
-    ('fft.irfftn', ''),
-    ('fft.rfft', ''),
-    ('fft.rfft2', ''),
-    ('fft.rfftn', ''),
-    ('fill_', ''),
-    ('flatten', ''),
-    ('flip', ''),
-    ('fliplr', ''),
-    ('flipud', ''),
-    ('float', ''),
-    ('float_power', ''),
-    ('floor', ''),
-    ('floor_divide', ''),
-    ('fmax', ''),
-    ('fmin', ''),
-    ('fmod', ''),
-    ('frac', ''),
-    ('frexp', ''),
-    ('full_like', ''),
-    ('gather', ''),
-    ('gcd', ''),
-    ('ge', ''),
-    ('geqrf', ''),
-    ('gradient', ''),
-    ('gt', ''),
-    ('half', ''),
-    ('heaviside', ''),
-    ('histc', ''),
-    ('histogram', ''),
-    ('histogramdd', ''),
-    ('hsplit', ''),
-    ('hstack', ''),
-    ('hypot', ''),
-    ('i0', ''),
-    ('igamma', ''),
-    ('igammac', ''),
-    ('imag', ''),
-    ('index_add', ''),
-    ('index_copy', ''),
-    ('index_fill', ''),
-    ('index_put', ''),
-    ('index_select', ''),
-    ('inner', ''),
-    ('int', ''),
-    ('inverse', ''),
-    ('isclose', ''),
-    ('isfinite', ''),
-    ('isin', ''),
-    ('isinf', ''),
-    ('isnan', ''),
-    ('isneginf', ''),
-    ('isposinf', ''),
-    ('isreal', ''),
-    ('istft', ''),
-    ('kron', ''),
-    ('kthvalue', ''),
-    ('lcm', ''),
-    ('ldexp', ''),
-    ('le', ''),
-    ('lerp', ''),
-    ('lgamma', ''),
-    ('linalg.cholesky', ''),
-    ('linalg.cholesky_ex', ''),
-    ('linalg.cond', ''),
-    ('linalg.cross', ''),
-    ('linalg.det', ''),
-    ('linalg.det', 'singular'),
-    ('linalg.eig', ''),
-    ('linalg.eigh', ''),
-    ('linalg.eigvals', ''),
-    ('linalg.eigvalsh', ''),
-    ('linalg.householder_product', ''),
-    ('linalg.inv', ''),
-    ('linalg.inv_ex', ''),
-    ('linalg.lstsq', ''),
-    ('linalg.lstsq', 'grad_oriented'),
-    ('linalg.lu_factor', ''),
-    ('linalg.lu_factor_ex', ''),
-    ('linalg.matrix_norm', ''),
-    ('linalg.matrix_power', ''),
-    ('linalg.matrix_rank', ''),
-    ('linalg.matrix_rank', 'hermitian'),
-    ('linalg.multi_dot', ''),
-    ('linalg.norm', ''),
-    ('linalg.norm', 'subgradients_at_zero'),
-    ('linalg.pinv', ''),
-    ('linalg.pinv', 'hermitian'),
-    ('linalg.pinv', 'singular'),
-    ('linalg.qr', ''),
-    ('linalg.slogdet', ''),
-    ('linalg.solve', ''),
-    ('linalg.solve_triangular', ''),
-    ('linalg.svd', ''),
-    ('linalg.svdvals', ''),
-    ('linalg.tensorinv', ''),
-    ('linalg.tensorsolve', ''),
-    ('linalg.vector_norm', ''),
-    ('log', ''),
-    ('log10', ''),
-    ('log1p', ''),
-    ('log2', ''),
-    ('log_softmax', ''),
-    ('log_softmax', 'dtype'),
-    ('logaddexp', ''),
-    ('logaddexp2', ''),
-    ('logcumsumexp', ''),
-    ('logdet', ''),
-    ('logical_and', ''),
-    ('logical_not', ''),
-    ('logical_or', ''),
-    ('logical_xor', ''),
-    ('logit', ''),
-    ('logsumexp', ''),
-    ('long', ''),
-    ('lt', ''),
-    ('lu', ''),
-    ('lu_solve', ''),
-    ('lu_unpack', ''),
-    ('mH', ''),
-    ('mT', ''),
-    ('masked_fill', ''),
-    ('masked_scatter', ''),
-    ('masked_select', ''),
-    ('matmul', ''),
-    ('matrix_exp', ''),
-    ('max', 'binary'),
-    ('max', 'reduction_no_dim'),
-    ('max', 'reduction_with_dim'),
-    ('maximum', ''),
-    ('mean', ''),
-    ('median', ''),
-    ('meshgrid', 'list_of_tensors'),
-    ('meshgrid', 'variadic_tensors'),
-    ('min', 'binary'),
-    ('min', 'reduction_no_dim'),
-    ('min', 'reduction_with_dim'),
-    ('minimum', ''),
-    ('mm', ''),
-    ('mode', ''),
-    ('movedim', ''),
-    ('msort', ''),
-    ('mul', ''),
-    ('multinomial', ''),
-    ('mv', ''),
-    ('mvlgamma', 'mvlgamma_p_1'),
-    ('mvlgamma', 'mvlgamma_p_3'),
-    ('mvlgamma', 'mvlgamma_p_5'),
-    ('nan_to_num', ''),
-    ('nanmean', ''),
-    ('nanmedian', ''),
-    ('nanquantile', ''),
-    ('nansum', ''),
-    ('narrow', ''),
-    ('ne', ''),
-    ('neg', ''),
-    ('new_empty', ''),
-    ('new_full', ''),
-    ('new_ones', ''),
-    ('new_zeros', ''),
-    ('nextafter', ''),
-    ('nn.functional.adaptive_avg_pool1d', ''),
-    ('nn.functional.adaptive_avg_pool2d', ''),
-    ('nn.functional.adaptive_avg_pool3d', ''),
-    ('nn.functional.adaptive_max_pool1d', ''),
-    ('nn.functional.adaptive_max_pool2d', ''),
-    ('nn.functional.adaptive_max_pool3d', ''),
-    ('nn.functional.avg_pool1d', ''),
-    ('nn.functional.avg_pool2d', ''),
-    ('nn.functional.avg_pool3d', ''),
-    ('nn.functional.batch_norm', ''),
-    ('nn.functional.batch_norm', 'without_cudnn'),
-    ('nn.functional.bilinear', ''),
-    ('nn.functional.binary_cross_entropy', ''),
-    ('nn.functional.binary_cross_entropy_with_logits', ''),
-    ('nn.functional.celu', ''),
-    ('nn.functional.conv1d', ''),
-    ('nn.functional.conv2d', ''),
-    ('nn.functional.conv_transpose1d', ''),
-    ('nn.functional.conv_transpose2d', ''),
-    ('nn.functional.conv_transpose3d', ''),
-    ('nn.functional.cosine_embedding_loss', ''),
-    ('nn.functional.cosine_similarity', ''),
-    ('nn.functional.cross_entropy', ''),
-    ('nn.functional.ctc_loss', ''),
-    ('nn.functional.dropout', ''),
-    ('nn.functional.dropout2d', ''),
-    ('nn.functional.elu', ''),
-    ('nn.functional.embedding', ''),
-    ('nn.functional.embedding_bag', ''),
-    ('nn.functional.feature_alpha_dropout', 'with_train'),
-    ('nn.functional.feature_alpha_dropout', 'without_train'),
-    ('nn.functional.fractional_max_pool2d', ''),
-    ('nn.functional.fractional_max_pool3d', ''),
-    ('nn.functional.gaussian_nll_loss', ''),
-    ('nn.functional.gelu', ''),
-    ('nn.functional.glu', ''),
-    ('nn.functional.grid_sample', ''),
-    ('nn.functional.group_norm', ''),
-    ('nn.functional.hardshrink', ''),
-    ('nn.functional.hardsigmoid', ''),
-    ('nn.functional.hardswish', ''),
-    ('nn.functional.hardtanh', ''),
-    ('nn.functional.hinge_embedding_loss', ''),
-    ('nn.functional.huber_loss', ''),
-    ('nn.functional.instance_norm', ''),
-    ('nn.functional.interpolate', 'area'),
-    ('nn.functional.interpolate', 'bicubic'),
-    ('nn.functional.interpolate', 'bilinear'),
-    ('nn.functional.interpolate', 'linear'),
-    ('nn.functional.interpolate', 'nearest'),
-    ('nn.functional.interpolate', 'trilinear'),
-    ('nn.functional.kl_div', ''),
-    ('nn.functional.l1_loss', ''),
-    ('nn.functional.layer_norm', ''),
-    ('nn.functional.leaky_relu', ''),
-    ('nn.functional.linear', ''),
-    ('nn.functional.local_response_norm', ''),
-    ('nn.functional.logsigmoid', ''),
-    ('nn.functional.margin_ranking_loss', ''),
-    ('nn.functional.max_pool1d', ''),
-    ('nn.functional.max_pool2d', ''),
-    ('nn.functional.max_pool3d', ''),
-    ('nn.functional.max_unpool1d', ''),
-    ('nn.functional.max_unpool1d', 'grad'),
-    ('nn.functional.max_unpool2d', ''),
-    ('nn.functional.max_unpool2d', 'grad'),
-    ('nn.functional.max_unpool3d', ''),
-    ('nn.functional.max_unpool3d', 'grad'),
-    ('nn.functional.mish', ''),
-    ('nn.functional.mse_loss', ''),
-    ('nn.functional.multi_margin_loss', ''),
-    ('nn.functional.multilabel_margin_loss', ''),
-    ('nn.functional.multilabel_soft_margin_loss', ''),
-    ('nn.functional.nll_loss', ''),
-    ('nn.functional.normalize', ''),
-    ('nn.functional.one_hot', ''),
-    ('nn.functional.pad', 'circular'),
-    ('nn.functional.pad', 'constant'),
-    ('nn.functional.pad', 'reflect'),
-    ('nn.functional.pad', 'replicate'),
-    ('nn.functional.pairwise_distance', ''),
-    ('nn.functional.pdist', ''),
-    ('nn.functional.pixel_shuffle', ''),
-    ('nn.functional.pixel_unshuffle', ''),
-    ('nn.functional.poisson_nll_loss', ''),
-    ('nn.functional.prelu', ''),
-    ('nn.functional.relu', ''),
-    ('nn.functional.relu6', ''),
-    ('nn.functional.rrelu', ''),
-    ('nn.functional.selu', ''),
-    ('nn.functional.silu', ''),
-    ('nn.functional.silu', 'complex'),
-    ('nn.functional.smooth_l1_loss', ''),
-    ('nn.functional.soft_margin_loss', ''),
-    ('nn.functional.softmin', ''),
-    ('nn.functional.softmin', 'with_dtype'),
-    ('nn.functional.softplus', ''),
-    ('nn.functional.softshrink', ''),
-    ('nn.functional.softsign', ''),
-    ('nn.functional.tanhshrink', ''),
-    ('nn.functional.threshold', ''),
-    ('nn.functional.triplet_margin_loss', ''),
-    ('nn.functional.triplet_margin_with_distance_loss', ''),
-    ('nn.functional.unfold', ''),
-    ('nn.functional.upsample_bilinear', ''),
-    ('nn.functional.upsample_nearest', ''),
-    ('nonzero', ''),
-    ('norm', ''),
-    ('norm', 'fro'),
-    ('norm', 'inf'),
-    ('norm', 'nuc'),
-    ('normal', ''),
-    ('normal', 'number_mean'),
-    ('ones_like', ''),
-    ('ormqr', ''),
-    ('outer', ''),
-    ('pca_lowrank', ''),
-    ('permute', ''),
-    ('pinverse', ''),
-    ('polar', ''),
-    ('polygamma', 'polygamma_n_0'),
-    ('polygamma', 'polygamma_n_1'),
-    ('polygamma', 'polygamma_n_2'),
-    ('polygamma', 'polygamma_n_3'),
-    ('polygamma', 'polygamma_n_4'),
-    ('positive', ''),
-    ('pow', ''),
-    ('prod', ''),
-    ('put', ''),
-    ('qr', ''),
-    ('quantile', ''),
-    ('rad2deg', ''),
-    ('rand_like', ''),
-    ('randint_like', ''),
-    ('randn_like', ''),
-    ('ravel', ''),
-    ('real', ''),
-    ('reciprocal', ''),
-    ('remainder', ''),
-    ('renorm', ''),
-    ('repeat', ''),
-    ('repeat_interleave', ''),
-    ('reshape', ''),
-    ('reshape_as', ''),
-    ('resize_', ''),
-    ('resize_as_', ''),
-    ('resolve_conj', ''),
-    ('resolve_neg', ''),
-    ('roll', ''),
-    ('rot90', ''),
-    ('round', ''),
-    ('round', 'decimals_0'),
-    ('round', 'decimals_3'),
-    ('round', 'decimals_neg_3'),
-    ('rsqrt', ''),
-    ('rsub', ''),
-    ('scatter', ''),
-    ('scatter_add', ''),
-    ('scatter_reduce', 'amax'),
-    ('scatter_reduce', 'amin'),
-    ('scatter_reduce', 'mean'),
-    ('scatter_reduce', 'prod'),
-    ('scatter_reduce', 'sum'),
-    ('searchsorted', ''),
-    ('select', ''),
-    ('select_scatter', ''),
-    ('sgn', ''),
-    ('short', ''),
-    ('sigmoid', ''),
-    ('sign', ''),
-    ('signbit', ''),
-    ('sin', ''),
-    ('sinc', ''),
-    ('sinh', ''),
-    ('slice_scatter', ''),
-    ('softmax', ''),
-    ('softmax', 'with_dtype'),
-    ('solve', ''),
-    ('sort', ''),
-    ('special.entr', ''),
-    ('special.erfcx', ''),
-    ('special.i0e', ''),
-    ('special.i1', ''),
-    ('special.i1e', ''),
-    ('special.log_ndtr', ''),
-    ('special.ndtr', ''),
-    ('special.ndtri', ''),
-    ('special.polygamma', 'special_polygamma_n_0'),
-    ('special.xlog1py', ''),
-    ('special.zeta', ''),
-    ('split', ''),
-    ('split', 'list_args'),
-    ('split_with_sizes', ''),
-    ('sqrt', ''),
-    ('square', ''),
-    ('squeeze', ''),
-    ('stack', ''),
-    ('std', ''),
-    ('std_mean', ''),
-    ('stft', ''),
-    ('sub', ''),
-    ('sum', ''),
-    ('sum_to_size', ''),
-    ('svd', ''),
-    ('svd_lowrank', ''),
-    ('symeig', ''),
-    ('t', ''),
-    ('take', ''),
-    ('take_along_dim', ''),
-    ('tan', ''),
-    ('tanh', ''),
-    ('tensor_split', ''),
-    ('tensordot', ''),
-    ('tile', ''),
-    ('to_sparse', ''),
-    ('topk', ''),
-    ('trace', ''),
-    ('transpose', ''),
-    ('trapezoid', ''),
-    ('trapz', ''),
-    ('triangular_solve', ''),
-    ('tril', ''),
-    ('triu', ''),
-    ('true_divide', ''),
-    ('trunc', ''),
-    ('unfold', ''),
-    ('unique', ''),
-    ('unique_consecutive', ''),
-    ('unsqueeze', ''),
-    ('var', ''),
-    ('var_mean', ''),
-    ('vdot', ''),
-    ('view', ''),
-    ('view_as', ''),
-    ('view_as_complex', ''),
-    ('view_as_real', ''),
-    ('vsplit', ''),
-    ('vstack', ''),
-    ('where', ''),
-    ('xlogy', ''),
-    ('zero_', ''),
-    ('zeros_like', ''),
-}
-
-
-def in_functorch_lagging_op_db(opinfo):
-    return (opinfo.name, opinfo.variant_test_name) in _functorch_lagging_meta
-
-
-functorch_lagging_op_db = [
-    opinfo for opinfo in op_db if in_functorch_lagging_op_db(opinfo)
-]
diff --git a/functorch/test/test_control_flow.py b/functorch/test/test_control_flow.py
new file mode 100644
index 0000000000000..d34a010321975
--- /dev/null
+++ b/functorch/test/test_control_flow.py
@@ -0,0 +1,183 @@
+import torch
+
+from torch.testing._internal.common_utils import TestCase
+from functorch.experimental.cond import cond
+from torch.fx.experimental.proxy_tensor import make_fx
+
+
+class TestControlFlow(TestCase):
+    def test_cond_no_trace(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return x.cos()
+
+        x = torch.randn(4)
+        result = cond(False, true_fn, false_fn, [x])
+        self.assertEqual(result, torch.cos(x))
+
+
+class TestControlFlowTraced(TestCase):
+    def test_cond_traced(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return x.cos()
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        graph = make_fx(f)(x, torch.tensor(False))
+        result_true = graph.forward(x, torch.tensor(True))
+        result_false = graph.forward(x, torch.tensor(False))
+        self.assertFalse(torch.allclose(result_true, result_false))
+        self.assertEqual(result_true, torch.sin(x))
+        self.assertEqual(result_false, torch.cos(x))
+
+    def test_cond_nested_traced(self):
+        def true_nested(y):
+            return y * y
+
+        def false_nested(y):
+            return y + y
+
+        def true_fn(x, pred2):
+            z = cond(pred2, true_nested, false_nested, [x])
+            return x + z
+
+        def false_fn(x, _):
+            return x.cos()
+
+        def f(x, pred, pred2):
+            return cond(pred, true_fn, false_fn, [x, pred2])
+
+        x = torch.randn(4)
+        graph = make_fx(f)(x, torch.tensor(False), torch.tensor(False))
+
+        result_true_true = graph.forward(x, torch.tensor(True), torch.tensor(True))  # True + True -> x * x
+        result_true_false = graph.forward(x, torch.tensor(True), torch.tensor(False))  # True + True -> x + x
+        result_false_true = graph.forward(x, torch.tensor(False), torch.tensor(True))  # False + either -> cos
+        result_false_false = graph.forward(x, torch.tensor(False), torch.tensor(False))  # False + either -> cos
+
+        self.assertNotEqual(result_true_true, result_true_false)
+        self.assertFalse(torch.allclose(result_false_true, result_true_true))
+
+        self.assertEqual(result_false_true, result_false_false)
+
+        self.assertEqual(result_true_true, (x * x) + x)
+        self.assertEqual(result_true_false, x + x + x)
+
+        self.assertEqual(result_false_true, torch.cos(x))
+
+    def test_cond_nested_traced_other_inputs(self):
+        def true_nested(y):
+            return y * y
+
+        def false_nested(y):
+            return y + y
+
+        def true_fn(k, pred2):
+            z = cond(pred2, true_nested, false_nested, [k])
+            return torch.add(torch.tensor([.25, .25]), z)
+
+        def false_fn(k, _):
+            return k.cos()
+
+        def f(k, pred, pred2):
+            return cond(pred, true_fn, false_fn, [k, pred2])
+
+        x = torch.tensor([0.5, 0.5])
+        graph = make_fx(f)(x, torch.tensor(False), torch.tensor(False))
+
+        a = torch.tensor([1.0, 1.0])
+        result_true_true = graph.forward(a, torch.tensor(True), torch.tensor(True))
+        self.assertEqual(result_true_true, (a * a) + torch.tensor([0.25, 0.25]))
+
+        b = torch.tensor([2.0, 2.0])
+        result_true_true = graph.forward(b, torch.tensor(True), torch.tensor(True))
+        self.assertEqual(result_true_true, (b * b) + torch.tensor([0.25, 0.25]))
+
+    def test_cond_nested_traced_multi(self):
+        def true_a(y):
+            return y * y
+
+        def false_a(y):
+            return y + y
+
+        def true_b(y, z):
+            return y + z
+
+        def false_b(y, z):
+            return y * z
+
+        def f(x, pred, pred2):
+            a_out = cond(pred, true_a, false_a, [x])
+            b_out = cond(pred2, true_b, false_b, [x, x])
+            return a_out + b_out
+
+        x = torch.randn(4)
+        graph = make_fx(f)(x, torch.tensor(False), torch.tensor(False))
+
+        # Brittle, yet, delicious
+        out = """
+        def forward(self, x_1, pred_1, pred2_1):
+            true_graph_0 = self.true_graph_0
+            false_graph_0 = self.false_graph_0
+            conditional = functorch_experimental_ops_cond(pred_1,
+            true_graph_0, false_graph_0, [[x_1]]);  pred_1 = true_graph_0 = false_graph_0 = None
+            true_graph_1 = self.true_graph_1
+            false_graph_1 = self.false_graph_1
+            conditional_1 = functorch_experimental_ops_cond(pred2_1,
+            true_graph_1, false_graph_1, [[x_1, x_1]]);  pred2_1 = true_graph_1 = false_graph_1 = x_1 = None
+            add_tensor = torch.ops.aten.add.Tensor(conditional, conditional_1);  conditional = conditional_1 = None
+            return add_tensor
+        """
+        code = graph.code
+        # Normalization hack, cause .code makes some weird whitespace
+        code = "".join(code.split())
+        out = "".join(out.split())
+        self.assertEqual(code, out)
+
+        code = graph.true_graph_0.code
+        out = """
+        def forward(self, flat_args):
+            flat_args_1, = fx_pytree.tree_flatten_spec([flat_args], self._in_spec)
+            mul_tensor = torch.ops.aten.mul.Tensor(flat_args_1, flat_args_1);  flat_args_1 = None
+            return pytree.tree_unflatten([mul_tensor], self._out_spec)
+        """
+        # Normalization hack, cause .code makes some weird whitespace
+        code = "".join(code.split())
+        out = "".join(out.split())
+        self.assertEqual(code, out)
+
+    def test_assert_on_mismatch_type_size(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return (x, x)
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        with self.assertRaises(AssertionError):
+            make_fx(f)(x, torch.tensor(False))
+
+
+    def test_assert_on_mismatch_tensor_size(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return torch.zeros([10, 10])
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        with self.assertRaises(AssertionError):
+            make_fx(f)(x, torch.tensor(False))
diff --git a/functorch/test/test_dims.py b/functorch/test/test_dims.py
new file mode 100644
index 0000000000000..d2046b4569ff2
--- /dev/null
+++ b/functorch/test/test_dims.py
@@ -0,0 +1,594 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+from functorch.dim import Tensor, Dim, dims, dimlists, stack, DimensionBindError, DimList
+
+from attn_ft import BertSelfAttention as BertSelfAttentionA, Linear
+from attn_positional import BertSelfAttention as BertSelfAttentionB
+
+from torch.testing._internal.common_utils import TestCase, run_tests, TEST_CUDA
+
+from unittest import skip, skipIf
+import torch
+import gc
+
+from functorch._C import dim as _C
+
+try:
+    from torchvision.models import resnet18
+except ImportError:
+    resnet18 = None
+
+_test_c, _parse_test, _set_pointwise_optimize = _C._test_c, _C._parse_test, _C._set_pointwise_optimize
+
+from contextlib import contextmanager
+from time import perf_counter
+
+measure_perf = False
+if measure_perf:
+    from torchdim.magic_trace import magic_trace
+else:
+    @contextmanager
+    def magic_trace(*args, **kwargs):
+        yield
+
+@contextmanager
+def measure(what):
+    b = perf_counter()
+    yield
+    e = perf_counter()
+    print(f"{what}: {e - b:.20f} seconds")
+
+def triu(A):
+    i, j = dims()
+    a = A[i, j]
+    zero = torch.tensor(0, dtype=torch.float)  # XXX - torch.where is janky...
+    return torch.where(i <= j, a, zero).order(i, j)
+
+def gpu_time(lmb, name, r=100):
+    b = torch.cuda.Event(enable_timing=True)
+    e = torch.cuda.Event(enable_timing=True)
+    # with magic_trace(name + ".fxt"):
+    for _ in range(r):
+        lmb()
+    b.record()
+    for _ in range(r):
+        lmb()
+    e.record()
+    e.synchronize()
+    elapsed = b.elapsed_time(e)
+    # with torch.profiler.profile(schedule=torch.profiler.schedule(
+    #     wait=0,
+    #     warmup=1,
+    #     active=2), on_trace_ready=tensorboard_trace_handler(name), with_stack=True) as profiler:
+    #     for _ in range(3):
+    #         lmb()
+    #         profiler.step()
+    print(name, elapsed / r)
+    return elapsed / r
+
+class TestMin(TestCase):
+
+    def setUp(self):
+        gc.disable()
+        gc.collect()
+        self.interesting = set()
+        for o in gc.get_objects():
+            if isinstance(o, (torch.Tensor, Dim, Tensor, DimList)):
+                self.interesting.add(id(o))
+        if 'cuda' in self._testMethodName:
+            self.mem_allocated = torch.cuda.memory_allocated()
+
+    def tearDown(self):
+        interesting = []
+        for o in gc.get_objects():
+            if isinstance(o, (torch.Tensor, Dim, Tensor, DimList)) and id(o) not in self.interesting:
+                interesting.append(o)
+
+        extra_memory = 0
+        if 'cuda' in self._testMethodName:
+            extra_memory += torch.cuda.memory_allocated() - self.mem_allocated
+
+        #  nolevels = _n_levels_in_use() == 0
+        if extra_memory != 0 or len(interesting) != 0:
+            import refcycle
+            refcycle.garbage().export_image('garbage.pdf')
+        gc.collect()
+        # assert nolevels, f"cleanup failed? {_n_levels_in_use()}"
+        assert extra_memory == 0, f'extra cuda memory left allocated: {extra_memory}'
+        assert len(interesting) == 0, \
+            f'extra torch.Tensor, Dim, or Tensor left allocated: {len(interesting)} objects of types:' \
+            f' { [type(t) for t in interesting] }'
+
+    def test_manual_stuff(self):
+
+        A_ = torch.rand(3, 4)
+        B_ = torch.rand(4, 5)
+        i, j, k = dims()
+        A = A_[i, k]
+        B = B_[k, j]
+        C = (A.expand(j) * B.expand(i)).sum(k)
+        self.assertTrue(torch.allclose(C.order(i, j), torch.mm(A_, B_)))
+        self.assertTrue(torch.allclose(torch.triu(A_, 0), triu(A_)))
+
+        D_ = torch.randint(0, 3, (6,))
+        d = dims()
+        D = D_[d]
+
+        A.index([i], [D]).order(k, d)
+
+    def attn(self, batch_size=1, sequence_length=4, hidden_size=6, num_attention_heads=3, linear=Linear, device=None, time=False):
+        def maybe_to(x):
+            return x if device is None else x.to(device)
+
+        attention_probs_dropout_prob = 0.
+        A = maybe_to(BertSelfAttentionA(hidden_size, num_attention_heads, attention_probs_dropout_prob, linear=linear))
+        B = maybe_to(BertSelfAttentionB(hidden_size, num_attention_heads, attention_probs_dropout_prob))
+
+
+        A.load_state_dict(B.state_dict())
+        hidden_state = maybe_to(torch.rand(batch_size, sequence_length, hidden_size))
+        b_out = B(hidden_state)
+        a_out = A(hidden_state)
+        self.assertTrue(torch.allclose(a_out, b_out))  # why does a simple matmul not do the right thing?
+
+        if time:
+            gpu_time(lambda: B(hidden_state), "positional", r=3)
+            gpu_time(lambda: A(hidden_state), "first_class", r=3)
+
+        for approach in ('relative_key', 'relative_key_query'):
+            A = maybe_to(BertSelfAttentionA(hidden_size, num_attention_heads,
+                         attention_probs_dropout_prob, approach, sequence_length, linear=linear))
+            B = maybe_to(BertSelfAttentionB(hidden_size, num_attention_heads,
+                         attention_probs_dropout_prob, approach, sequence_length))
+            A.load_state_dict(B.state_dict())
+
+            hidden_state = maybe_to(torch.rand(batch_size, sequence_length, hidden_size))
+            b_out = B(hidden_state)
+            a_out = A(hidden_state)
+            self.assertTrue(torch.allclose(a_out, b_out))
+
+            if time:
+                gpu_time(lambda: B(hidden_state), "positional", r=3)
+                gpu_time(lambda: A(hidden_state), "first_class", r=3)
+
+        A = maybe_to(BertSelfAttentionA(hidden_size, num_attention_heads,
+                                        attention_probs_dropout_prob, None, None, linear=linear))
+        B = maybe_to(BertSelfAttentionB(hidden_size, num_attention_heads,
+                                        attention_probs_dropout_prob, None, None))
+        A.load_state_dict(B.state_dict())
+
+        hidden_state = maybe_to(torch.rand(batch_size, sequence_length, hidden_size))
+        past_key_value = (maybe_to(torch.rand(batch_size, num_attention_heads,
+                                   sequence_length, hidden_size // num_attention_heads)),
+                          maybe_to(torch.rand(batch_size, num_attention_heads,
+                                   sequence_length, hidden_size // num_attention_heads)))
+
+        b_out = B(hidden_state, past_key_value=past_key_value)
+        a_out = A(hidden_state, past_key_value=past_key_value)
+        self.assertTrue(torch.allclose(a_out, b_out))
+
+        if time:
+            gpu_time(lambda: B(hidden_state), "positional", r=3)
+            gpu_time(lambda: A(hidden_state), "first_class", r=3)
+
+
+    def test_attn(self):
+        self.attn()
+
+    def test_inplace(self):
+        # some embeddings table
+        embeddings = torch.zeros(10, 3)
+
+        # some sparse updates to the embeddings
+        indices = torch.arange(2) + 1
+        values = torch.rand(2, 3)
+
+        i, n, f = dims()
+
+        embeddings[indices[i], f] += values[i, f]
+
+    @skipIf(not TEST_CUDA, "no CUDA")
+    def test_attn_cuda(self):
+        # size from the BERT paper, 90% pretraining of sequence length 128
+        self.attn(batch_size=256, hidden_size=768, sequence_length=128,
+                  num_attention_heads=12, device='cuda', time=measure_perf, linear=torch.nn.Linear)
+
+    def test_stack(self):
+        i, j, d = dims()
+        A = torch.rand(4, 5)
+        r = stack([A[i, j]], d, j)
+        # a, b = r.unbind(d)
+        # self.assertTrue(torch.allclose(a.order(i, j), i.expand(j).order(i, j)))
+        # self.assertTrue(torch.allclose(b.order(i, j), j.expand(i).order(i, j)))
+
+    def test_max(self):
+        ap = torch.rand(2, 3, 2)
+        i, j, k = dims()
+        a = ap[i, j, k]
+        r, i0 = a.max(dim=k)
+        self.assertTrue(torch.allclose(r.order(i, j), ap.max(2)[0]))
+
+    def test_mm(self):
+        i, j, k, q = dims()
+        a = torch.rand(3, 4)
+        b = torch.rand(4, 5)
+        a_ = a[i, k]
+        b_ = b[k, j]
+        q.size = 1
+        r = (a_.expand(j, q) * b_.expand(i, q)).sum(k).order(q, i, j)
+        # r = (a_*b_).sum(k).order(q, i, j)
+        # print(r)
+        # print(a @ b)
+
+    def test_with_dims_split(self):
+        a = torch.arange(3 * 12).view(3, 12)
+        i, j, k = dims()
+        k.size = 4
+        r = a[i, [j, k]]
+        x = r.order(i, [j, k])
+        self.assertTrue(torch.allclose(a, x))
+
+    def test_hello(self):
+        A = torch.rand(3, 4)
+        B = torch.rand(4, 5)
+        i, j, k = dims()
+
+
+
+        # r = A[i]*4
+        r = (A[i, k] * B[k, j]).sum(k).order(i, j)
+        assert torch.allclose(r, A @ B)
+
+        assert A.sum() == A[i].sum((0, i))
+        assert A.sum() == A[i].sum((-1, i))
+
+        assert torch.allclose(A.sum(), A[i].sum(0, keepdim=True).sum((0, i)))
+        assert torch.allclose(A[i].std(i, True), A.std(0, True))
+
+        assert torch.allclose(A[i, k].max(i)[0].order(k), A.max(0)[0])
+        assert torch.allclose(A.sort(1)[0], A[i, k].sort(k)[0].order(i, k))
+        # XXX - chunk changes the size of a dimension, has to take a new dimension...
+        # assert torch.allclose(A.chunk(2,1)[0], A[i, k].chunk(2, k)[0].order(i, k))
+        assert torch.allclose(A[i].renorm(1, i, 7).order(i), A.renorm(1, 0, 7))
+        kk = dims()
+        # assert torch.allclose( torch.stack([A, A], 1), stack([A[i,k], A[i, k]], kk, k).order(i, kk, k))
+
+        k2 = dims()
+        # r = cat((A[i, k], A[i,k]), k, k2)
+        # assert torch.allclose(torch.cat([A, A], 1), r.order(i, k2))
+        # assert k2.size == 2*k.size
+
+        assert torch.allclose(A.expand(5, -1, -1), A[i, k].expand(j).order(j, i, k))
+        z = dims()
+        C = torch.arange(2)
+        assert torch.allclose(A[:, 0:2], A[i, k].index(k, C[z]).order(i, z))
+
+        o, l = dims()
+        o.size = 2
+        r = A[i, k].index(k, (o, l))
+        assert torch.allclose(r.order(i, o, l), A.view(-1, 2, 2))
+        rr = r.index((o, l), k)
+        assert torch.allclose(A, rr.order(i, k))
+
+        r = i + k - 1
+        r2 = torch.arange(3)[:, None] + torch.arange(4)[None, :] - 1
+        assert torch.allclose(r.order(i, k), r2)
+
+        # test with ...
+        assert torch.allclose(A.T, A[..., k].order(k))
+
+        # test with dimlist
+        a_, b_ = dimlists()
+        assert torch.allclose(A[i, a_].order(*a_, i), A.T)
+        # test with one bound dimlist
+        assert torch.allclose(A[:, a_].order(*a_), A.T)
+        # test with a dimlist that will end up empty
+        assert torch.allclose(A[i, b_, k].order(i, k, *b_), A)
+        # test with too few things
+        (A[i] + i)
+        assert torch.allclose((A[i] + i).order(i), A + torch.arange(3)[:, None])
+        # test with too many elements
+        try:
+            A[1, ..., 1, 1]
+            raise NotImplementedError()
+        except IndexError:
+            pass
+        c, d = dims()
+        c.size = 2
+        assert torch.allclose(A[i, [c, d]].order(i, c, d), A.view(3, 2, 2))
+
+        assert torch.allclose(A[c + 1, c + 0].order(c), A[torch.arange(2) + 1, torch.arange(2)])
+        try:
+            A[..., 3, ...]
+            raise NotImplementedError()
+        except DimensionBindError:
+            pass
+
+        C = torch.rand(4, 7)
+        c_, x, y, z = dims()
+
+        a, b, c = C.split((3, 3, 1), dim=1)
+        s = dims()
+        ref = C.split((3, 3, 1), dim=1)
+        t = C[s, c_].split((x, y, z), dim=c_)
+        for a, b, d in zip(ref, t, (x, y, z)):
+            assert torch.allclose(a, b.order(s, d))
+
+        D = torch.rand(3, 4, 5)
+        assert torch.allclose(D.transpose(0, 1).flatten(1, 2), D[i, k, j].order((i, j)).order(k))
+
+
+        r = [id(x) for x in torch.rand_like(A[i, k]).dims]
+        assert id(i) in r and id(k) in r
+        r = [id(x) for x in torch.nn.functional.dropout(A[i, k]).dims]
+        assert id(i) in r and id(k) in r
+
+    def test_simple(self):
+        i, j, k = dims()
+        x = torch.rand(3, 4)
+        z = x[i, j]
+        (z + z + z + z)
+        (z.order(i, j))
+
+    def test_mm_fuse(self):
+        i, j, k = dims()
+        A = torch.rand(3, 4)
+        B = torch.rand(4, 5)
+
+        C = (A[i, k] * B[k, j]).sum(k).order(i, j)
+        assert torch.allclose(C, A @ B)
+
+    def test_time_mm_fuse(self):
+        i, j, k = dims()
+        A = torch.rand(3, 4)
+        B = torch.rand(4, 5)
+
+
+        for _ in range(10):
+            r0 = A @ B
+
+        for _ in range(10):
+            a = A[i, k]
+            b = B[k, j]
+            r1 = (a * b).sum(k)
+
+        with measure('pp'):
+            for _ in range(10000):
+                A @ B
+        # magic_trace_stop_indicator()
+
+        with measure('fc'):
+            for _ in range(10000):
+                (A[i, k] * B[k, j]).sum(k).order(i, j)
+
+        with magic_trace('f.fxt'):
+            for _ in range(10000):
+                (A[i, k] * B[k, j]).sum(k).order(i, j)
+
+        with magic_trace('p.fxt'):
+            for _ in range(10000):
+                A @ B
+
+        # magic_trace_stop_indicator()
+
+
+        assert torch.allclose(r1.order(i, j), r0)
+
+    def test_compare_dims(self):
+        i, j = dims()
+        i.size = 3
+        j.size = 4
+        (i < j)
+
+    def test_c(self):
+        _test_c()
+
+    def test_seg(self):
+        A = torch.rand(3, 4)
+        i, k = dims()
+        i.size = 4
+        k.size = 3
+        r = i + k - 1
+
+    def test_expand(self):
+        A = torch.rand(3, 4)
+        i = dims()
+        assert list(A[i].expand(2, 4).order(i).size()) == [3, 2, 4]
+
+    def test_parse(self):
+        self.assertEqual(("x", None, None, None), _parse_test(1, 0, "x"))
+        self.assertEqual(("x", None, "y", None), _parse_test(1, 0, "x", c="y"))
+        self.assertEqual(("x", None, "y", "z"), _parse_test(1, 0, "x", d="z", c="y"))
+
+        self.assertEqual(("x", "4", None, None), _parse_test(2, 0, "x", b="4"))
+        self.assertEqual(("x", "y", "z", "q"), _parse_test(2, 0, "x", "y", "z", "q"))
+        with self.assertRaises(TypeError):
+            _parse_test(2, 0, "x", "y", "z", "q", "5")
+        with self.assertRaises(TypeError):
+            _parse_test(2, 0, "x", "y", b="y")
+
+        with self.assertRaises(TypeError):
+            _parse_test(2, 0, "x", c="y")
+        with self.assertRaises(TypeError):
+            _parse_test(2, 0, "x")
+
+    def test_network(self):
+        if resnet18 is None:
+            self.skipTest('no torchvision')
+        rn = resnet18(norm_layer=lambda x: torch.nn.BatchNorm2d(x, track_running_stats=False))
+        rn.train()
+        img = torch.rand(1, 1, 2, 3, 224, 224)
+        imgf = img.view(2, 3, 224, 224)
+
+        i, j = dims()
+        r = rn(img[i, j])
+        r = r.order(i, j).view(2, 1000)
+        r2 = rn(imgf)
+        assert torch.allclose(r2, r, atol=1e-06)
+
+    def test_dim_args(self):
+        a = dimlists()
+        assert isinstance(a, DimList)
+        a = dims()
+        b = dimlists()
+        assert isinstance(a, Dim)
+        assert isinstance(b, DimList)
+        assert str(a) == 'a'
+        a, b = dims(sizes=[3, 4])
+        assert a.size == 3
+        assert b.size == 4
+        a = dims(sizes=[3])
+        b = dimlists(sizes=[4])
+        assert len(b) == 4
+        a = dims()
+        b = dimlists(sizes=[[4, 5]])
+        assert b[0].size == 4
+        assert b[1].size == 5
+
+    def test_diag(self):
+        i = dims()
+        A = torch.rand(4, 4)
+        (A[i, i])
+
+    def test_softmax_split(self):
+        a = torch.rand(16)
+        g, i = dims(sizes=[2, None])
+        a2 = a[[i, g], ]
+
+        m_b, _ = a2.max(i)
+        f_b = torch.exp(a2 - m_b)
+        l_b = f_b.sum(i)
+
+        m, _ = m_b.max(g)
+        c = torch.exp(m_b - m)
+        f = (c * f_b).order((i, g))
+        l = (c * l_b).sum(g)
+        assert torch.allclose(f / l, torch.nn.functional.softmax(a, dim=0))
+
+    def test_index(self):
+        A = torch.rand(3, 4)
+        B = torch.rand(4, 5)
+        i, j, k = dims()
+
+        o, l = dims()
+        o.size = 2
+        r = A[i, k].index(k, [o, l])
+        assert torch.allclose(r.order(i, o, l), A.view(-1, 2, 2))
+        rr = r.index([o, l], k)
+        assert torch.allclose(A, rr.order(i, k))
+        z = dims()
+        C = torch.arange(2)
+        x = A[i, k].index(k, C[z]).order(i, z)
+        assert torch.allclose(A[:, 0:2], x)
+
+        C = torch.rand(3, 4, 5)
+        ik = dims()
+        assert torch.allclose(C.index((0, 2), ik).order(ik), C.permute(0, 2, 1).reshape(15, 4))
+
+    # failures that came up from monkey patching some operators...
+    def test_monkey(self):
+        A = torch.rand(3, 4)
+        A[0, 0] = 5
+        x = torch.randn(3, 4, 4, 4, 3)
+        x_clone1 = x.clone()
+        ia = torch.tensor([0, 2, 1])
+        ib = torch.tensor([0, 2, 1])
+        first_shape = x[:, ia, None, ib, 0].shape
+        x_clone1[:, ia, None, ib, 0] = torch.randn(first_shape).to(x_clone1)
+        x = torch.autograd.Variable(torch.tensor([]))
+        z = torch.autograd.Variable(torch.IntTensor([1, 2, 3]))
+        a = [z[2], z[0] + 3]
+        x.new(a)
+        # self.assertEqual(x.new([z[2], z[0] + 3]).tolist(), [3, 4])
+
+    def test_index_placement(self):
+        A = torch.rand(1, 2, 3, 4)
+
+        i, j = dims(sizes=[2, 4])
+
+        a = A[:, i + 0, :, j + 0]
+        r = a.order(i, j)
+
+        assert torch.allclose(A.permute(1, 3, 0, 2), r)
+
+    def test_order(self):
+        i, j = dims()
+        A = torch.rand(3, 4, 5)
+        assert torch.allclose(A[i].order(1, i), A.permute(2, 0, 1))
+
+    def test_mask(self):
+        a = torch.rand(5)
+        i, j = dims(sizes=[a.size(0), a.size(0)])
+        ((i >= j) * a[i]).sum(j).order(i)
+
+    def test_eq(self):
+        i, j = dims(sizes=[3, 3])
+        assert (i == j).sum((i, j)) == 3
+
+    def test_dims_with_size(self):
+        x = dims(3)
+        assert len(x) == 3 and isinstance(x[0], Dim)
+
+        class Foo:
+            pass
+        y = Foo()
+        z, y.x, q = dims(3)
+        assert str(z) == "z"
+        assert str(y.x) == "d1"
+        assert str(q) == "d2"
+
+    def test_dir(self):
+        i, j = dims(sizes=[3, 3])
+        dir(i <= j)
+
+    def test_doc(self):
+        assert Tensor.clamp.__doc__ == torch.Tensor.clamp.__doc__
+
+    def test_embed(self):
+
+        embeddings = torch.rand(8, 32)
+        ids = torch.tensor([1, 0, 3, 4])
+
+        # slow but Pythonic
+        values_ = torch.empty(4, 32)
+        for batch in range(ids.size(0)):
+            for feature in range(embeddings.size(1)):
+                values_[batch, feature] = embeddings[ids[batch], feature]
+
+        # with torchdim, single indexing kernel
+        batch, feature = dims(2)
+        values = embeddings[ids[batch], feature].order(batch, feature)
+
+        assert torch.allclose(values, values_)
+
+    def test_functorch(self):
+        A = torch.rand(3, 4, 5)
+        B = torch.rand(3, 4, 5)
+        C = torch.rand(5, 2)
+
+        i, j = dims()
+
+        AA = torch.mm(A[i], C)  # 3, 4, 2
+        BB = torch.mm(B[j], C)  # 3, 4, 2
+        assert list(torch.mm(AA.T, BB).order(i, j).shape) == [3, 3, 2, 2]
+
+
+skip_functorch_only = ['test_time_mm_fuse', 'test_attn_cuda']
+class TestMinFunctorchOnly(TestMin):
+    def setUp(self):
+        super().setUp()
+        _set_pointwise_optimize(False)
+
+    def tearDown(self):
+        _set_pointwise_optimize(True)
+        super().tearDown()
+
+for n in skip_functorch_only:
+    setattr(TestMinFunctorchOnly, n, skip("skip_functorch_only")(lambda self: None))
+
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/functorch/test/test_eager_transforms.py b/functorch/test/test_eager_transforms.py
index 1b372cedefc05..f0b53192a21ed 100644
--- a/functorch/test/test_eager_transforms.py
+++ b/functorch/test/test_eager_transforms.py
@@ -19,6 +19,7 @@
 from torch.testing._internal.common_device_type import instantiate_device_type_tests, onlyCPU
 from torch.testing._internal.common_dtype import get_all_fp_dtypes
 from torch.testing._internal.common_utils import IS_WINDOWS
+from torch._subclasses.fake_tensor import FakeTensorMode
 from functools import partial
 from functorch.experimental import replace_all_batch_norm_modules_
 
@@ -2014,6 +2015,162 @@ def push_jvp(dummy, x):
         vmap(vmap(push_jvp, (0, None)))(dummy, x)
 
 
+# The tests here follow the cases in [Forward Grad View/inplace]
+# https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/autograd_meta.cpp#L18-L43
+class TestVmapJvpInplaceView(TestCase):
+    # Case 1 in [Forward Grad View/inplace]
+    def test_all_dual_no_view(self, device):
+        B = 2
+
+        def push_jvp(f):
+            def inner(x, xt, y, yt):
+                return jvp(f, (x, y), (xt, yt))
+            return inner
+
+        def f(x, y):
+            x.copy_(y)
+            return x
+        x = torch.randn(3, B, device=device)
+        xt = torch.randn(3, B, device=device)
+        y = torch.randn(3, B, device=device)
+        yt = torch.randn(3, B, device=device)
+        out, out_tangent = vmap(push_jvp(f), in_dims=1)(x, xt, y, yt)
+        self.assertEqual(out, x.movedim(1, 0))
+        self.assertEqual(out_tangent, yt.movedim(1, 0))
+
+        x = torch.randn(3, B, device=device)
+        xt = torch.randn(3, B, device=device)
+        y = torch.randn(3, 3, device=device)[:, 1]
+        yt = torch.randn(6, device=device)[::2]
+        out, out_tangent = vmap(push_jvp(f), in_dims=(1, 1, None, None))(x, xt, y, yt)
+        self.assertEqual(out, x.movedim(1, 0))
+        self.assertEqual(out_tangent, yt.expand(B, 3))
+
+    # Case 2 in [Forward Grad View/inplace]
+    def test_all_dual_base_view_inplace(self, device):
+        B = 2
+
+        def push_jvp(f):
+            def inner(x, xt, y, yt):
+                return jvp(f, (x, y), (xt, yt))
+            return inner
+
+        # with view, propagate from view to base
+        def f(x, y):
+            view = x[:, ::2]
+            view.copy_(y)
+            return view, x
+
+        orig_x = torch.randn(2, 6, B, device=device)
+        orig_xt = torch.randn(2, 6, B, device=device)
+        x = orig_x.clone()
+        xt = orig_xt.clone()
+        y = torch.randn(2, B, 3, device=device)
+        yt = torch.randn(2, B, 3, device=device)
+        out, out_tangent = vmap(push_jvp(f), in_dims=(2, 2, 1, 1))(x, xt, y, yt)
+
+        expected_out = vmap(f, in_dims=(2, 1))(orig_x.clone(), y)
+        self.assertEqual(out[0], expected_out[0])
+        self.assertEqual(out[1], expected_out[1])
+
+        self.assertEqual(out_tangent[0], yt.movedim(1, 0))
+
+        expected_x_tangent = orig_xt.movedim(-1, 0).clone()
+        expected_x_tangent[:, :, ::2].copy_(yt.movedim(1, 0))
+        self.assertEqual(out_tangent[1], expected_x_tangent)
+
+        expected = orig_x.movedim(2, 0).clone()
+        expected[:, :, ::2] = y.movedim(1, 0)
+        self.assertEqual(x.movedim(2, 0), expected)
+
+    # Case 3 in [Forward Grad View/inplace]
+    def test_all_dual_base_inplace(self, device):
+        B = 2
+
+        def push_jvp(f):
+            def inner(x, xt, y, yt):
+                return jvp(f, (x, y), (xt, yt))
+            return inner
+
+        # Case 3: with view, propagate from base to view
+        def f(x, y):
+            view = x[0, ::2]
+            x.copy_(y)
+            return x, view
+
+        x = torch.randn(2, B, 6, device=device)
+        xt = torch.randn(2, 6, B, device=device)
+        y = torch.randn(2, B, 6, device=device)
+        yt = torch.randn(2, B, 6, device=device)
+        out, out_tangent = vmap(push_jvp(f), in_dims=(1, 2, 1, 1))(x.clone(), xt, y, yt)
+
+        expected_out = vmap(f, in_dims=(1, 1))(x.clone(), y)
+        self.assertEqual(out[0], expected_out[0])
+        self.assertEqual(out[1], expected_out[1])
+
+        self.assertEqual(out_tangent[0], yt.movedim(1, 0))
+        self.assertEqual(out_tangent[1], yt.movedim(1, 0)[:, 0, ::2])
+
+    # Case 4 in [Forward Grad View/inplace]
+    def test_right_dual_view_prop(self, device):
+        B = 2
+
+        # Changes on the view must propagate to its base. Also:
+        # - x is a regular Tensor
+        # - y is a dual tensor
+        def f(x, y):
+            x = x.clone()
+            view = x[0]
+            view.copy_(y)
+            return view, x
+
+        def push_jvp(x, y, yt):
+            return jvp(partial(f, x), (y,), (yt,))
+
+        x = torch.randn(2, B, 6, device=device)
+        y = torch.randn(6, B, device=device)
+        yt = torch.randn(6, B, device=device)
+        outs, tangents = vmap(push_jvp, in_dims=(1, 1, 1))(x, y, yt)
+
+        expected_out = vmap(f, in_dims=(1, 1))(x.clone(), y)
+        self.assertEqual(outs[0], expected_out[0])
+        self.assertEqual(outs[1], expected_out[1])
+
+        self.assertEqual(tangents[0], yt.movedim(1, 0))
+
+        expected_tangent_1 = torch.zeros_like(x).movedim(1, 0)
+        expected_tangent_1[:, 0].copy_(yt.movedim(1, 0))
+        self.assertEqual(tangents[1], expected_tangent_1)
+
+    # Case 5 in [Forward Grad View/inplace]
+    def test_right_dual_base_prop(self, device):
+        B = 2
+
+        # Changes on the base must propagate on all its views. Also:
+        # - x is a regular Tensor
+        # - y is a dual tensor
+        def f(x, y):
+            x = x.clone()
+            view = x[0]
+            x.copy_(y)
+            return view, x
+
+        def push_jvp(x, y, yt):
+            return jvp(partial(f, x), (y,), (yt,))
+
+        x = torch.randn(2, B, 6)
+        y = torch.randn(2, 6, B)
+        yt = torch.randn(2, 6, B)
+        outs, tangents = vmap(push_jvp, in_dims=(1, 2, 2))(x, y, yt)
+
+        expected_out = vmap(f, in_dims=(1, 2))(x, y)
+        self.assertEqual(outs[0], expected_out[0])
+        self.assertEqual(outs[1], expected_out[1])
+
+        self.assertEqual(tangents[0], yt.movedim(2, 0)[:, 0])
+        self.assertEqual(tangents[1], yt.movedim(2, 0))
+
+
 class TestCustomFunction(TestCase):
     @unittest.skipIf(IS_WINDOWS, "Prototype of custom_vjp doesn't link on windows")
     @onlyCPU
@@ -2907,6 +3064,21 @@ def compute_loss(weights, image, target):
 
         self.assertEqual(result_grads, expected_grads, atol=1e-3, rtol=1.)
 
+def normalize_devices(fx_g):
+    for node in fx_g.graph.nodes:
+        args = list(node.args)
+        for idx, arg in enumerate(args):
+            if isinstance(arg, torch.device):
+                args[idx] = 'cpu'
+        node.args = tuple(args)
+        new_kwargs = {}
+        for k, v in node.kwargs.items():
+            if isinstance(v, torch.device):
+                v = 'cpu'
+            new_kwargs[k] = v
+        node.kwargs = new_kwargs
+    fx_g.recompile()
+    return fx_g
 
 class TestFunctionalize(TestCase):
     def _check_functionalize_correctness(self, f, inpt):
@@ -3044,6 +3216,19 @@ def jvp_wrapper(x, t):
         out2 = vmap(functionalize(jvp_wrapper))(x, t)
         self.assertEqual(out1, out2)
 
+    # TODO: move this test into test_fake_tensor.py
+    # once functionalize() can be used in core tests.
+    def test_functionalize_fake_tensors(self, device):
+
+        def f(x: torch.Tensor) -> torch.Tensor:
+            y = x.detach()
+            return y + y
+
+        with FakeTensorMode() as mode:
+            x = torch.ones(2, device=device, requires_grad=True)
+            out = functionalize(f)(x)
+        self.assertEqual(x.size(), (2,))
+
     def test_functionalize_fx_simple(self, device):
 
         def f(x: torch.Tensor) -> torch.Tensor:
@@ -3053,16 +3238,17 @@ def f(x: torch.Tensor) -> torch.Tensor:
             return x
         # There's a copy_ in the graph, because the input (x) was mutated.
         # To preserve semantics, functionalize() needs to propagate the mutation.
-        fn = make_fx(functionalize(f, remove='mutations_and_views'), trace_factory_functions=False)
+        fn = make_fx(functionalize(f, remove='mutations_and_views'))
         out = fn(torch.zeros(4, 2, device=device))
+        out = normalize_devices(out)
         self.assertExpectedInline((out.code), """\
 
 
 
 def forward(self, x_1) -> torch.Tensor:
+    ones = torch.ops.aten.ones.default([2], device = 'cpu', pin_memory = False)
     view_copy_default = torch.ops.aten.view_copy.default(x_1, [4, 2])
-    _tensor_constant0 = self._tensor_constant0
-    add_tensor = torch.ops.aten.add.Tensor(view_copy_default, _tensor_constant0);  view_copy_default = _tensor_constant0 = None
+    add_tensor = torch.ops.aten.add.Tensor(view_copy_default, ones);  view_copy_default = ones = None
     view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4, 2]);  add_tensor = None
     copy__default = torch.ops.aten.copy_.default(x_1, view_copy_default_1);  x_1 = None
     return view_copy_default_1
@@ -3072,8 +3258,9 @@ def test_functionalize_fx_transpose_simple(self, device):
 
         def f(x: torch.Tensor) -> torch.Tensor:
             return x.transpose(1, 0)
-        fn = make_fx(functionalize(f, remove='mutations_and_views'), trace_factory_functions=False)
+        fn = make_fx(functionalize(f, remove='mutations_and_views'))
         out = fn(torch.zeros(4, 2, device=device))
+        out = normalize_devices(out)
         self.assertExpectedInline(out.code, """\
 
 
@@ -3092,13 +3279,15 @@ def f(inpt: torch.Tensor) -> torch.Tensor:
             out_view.add_(1)
             return out
 
-        fn = make_fx(functionalize(f, remove='mutations_and_views'), trace_factory_functions=False)
+        fn = make_fx(functionalize(f, remove='mutations_and_views'))
         out = fn(torch.arange(4, device=device, dtype=torch.float32))
+        out = normalize_devices(out)
         self.assertExpectedInline(out.code, """\
 
 
 
 def forward(self, inpt_1) -> torch.Tensor:
+    empty = torch.ops.aten.empty.memory_format([], dtype = torch.float32, device = 'cpu', pin_memory = False)
     add_tensor = torch.ops.aten.add.Tensor(inpt_1, inpt_1);  inpt_1 = None
     view_copy_default = torch.ops.aten.view_copy.default(add_tensor, [4])
     view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4]);  add_tensor = None
@@ -3117,19 +3306,23 @@ def f(inpt: torch.Tensor) -> torch.Tensor:
             torch.aminmax(inpt_view, dim=0, out=(mins, maxs_view))
             return (maxs, mins)
 
-        fn = make_fx(functionalize(f, remove='mutations_and_views'), trace_factory_functions=False)
+        fn = make_fx(functionalize(f, remove='mutations_and_views'))
         out = fn(torch.arange(8, device=device, dtype=torch.float32))
+        out = normalize_devices(out)
         self.assertExpectedInline(out.code, """\
 
 
 
 def forward(self, inpt_1) -> torch.Tensor:
-    view_copy_default = torch.ops.aten.view_copy.default(inpt_1, [2, 4]);  inpt_1 = None
-    aminmax_default = torch.ops.aten.aminmax.default(view_copy_default, dim = 0);  view_copy_default = None
+    empty = torch.ops.aten.empty.memory_format([4], dtype = torch.float32, device = 'cpu', pin_memory = False)
+    empty_1 = torch.ops.aten.empty.memory_format([2, 2], dtype = torch.float32, device = 'cpu', pin_memory = False)
+    view_copy_default = torch.ops.aten.view_copy.default(empty_1, [4]);  empty_1 = None
+    view_copy_default_1 = torch.ops.aten.view_copy.default(inpt_1, [2, 4]);  inpt_1 = None
+    aminmax_default = torch.ops.aten.aminmax.default(view_copy_default_1, dim = 0);  view_copy_default_1 = None
     getitem = aminmax_default[0]
     getitem_1 = aminmax_default[1];  aminmax_default = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(getitem_1, [2, 2]);  getitem_1 = None
-    return (view_copy_default_1, getitem)
+    view_copy_default_2 = torch.ops.aten.view_copy.default(getitem_1, [2, 2]);  getitem_1 = None
+    return (view_copy_default_2, getitem)
     """)
 
     def test_functionalize_fx_reapply_views_simple(self, device):
@@ -3140,15 +3333,16 @@ def f(x: torch.Tensor) -> torch.Tensor:
             y.add_(tmp)
             return x
 
-        out = make_fx(functionalize(f), trace_factory_functions=False)(torch.zeros(4, 2, device=device))
+        out = make_fx(functionalize(f))(torch.zeros(4, 2, device=device))
+        out = normalize_devices(out)
         self.assertExpectedInline(out.code, """\
 
 
 
 def forward(self, x_1) -> torch.Tensor:
+    ones = torch.ops.aten.ones.default([2], device = 'cpu', pin_memory = False)
     view_default = torch.ops.aten.view.default(x_1, [4, 2])
-    _tensor_constant0 = self._tensor_constant0
-    add_tensor = torch.ops.aten.add.Tensor(view_default, _tensor_constant0);  view_default = _tensor_constant0 = None
+    add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
     view_default_1 = torch.ops.aten.view.default(add_tensor, [4, 2]);  add_tensor = None
     copy__default = torch.ops.aten.copy_.default(x_1, view_default_1);  x_1 = None
     return view_default_1
@@ -3162,6 +3356,7 @@ def f() -> torch.Tensor:
             return global_out
 
         out = make_fx(functionalize(f))()
+        out = normalize_devices(out)
         self.assertExpectedInline(out.code, """\
 
 
@@ -3181,6 +3376,7 @@ def f(a, b) -> torch.Tensor:
         a = torch.arange(4).reshape(2, 2)
         b = torch.ones(2, dtype=torch.long)
         out = make_fx(functionalize(f))(a, b)
+        out = normalize_devices(out)
         self.assertExpectedInline(out.code, """\
 
 
@@ -3211,6 +3407,26 @@ def forward(self, a_1, b_1) -> torch.Tensor:
     return index_tensor
     """)
 
+    def test_resize_program_inputs(self, device):
+        def f(x):
+            x.resize_(10)
+            x.fill_(2)
+
+        fn = make_fx(functionalize(f))
+        out = fn(torch.zeros(0, device=device))
+        out = normalize_devices(out)
+        self.assertExpectedInline((out.code), """\
+
+
+
+def forward(self, x_1):
+    resize_default = torch.ops.aten.resize.default(x_1, [10])
+    fill_scalar = torch.ops.aten.fill.Scalar(resize_default, 2);  resize_default = None
+    resize__default = torch.ops.aten.resize_.default(x_1, [10]);  x_1 = None
+    copy__default = torch.ops.aten.copy_.default(resize__default, fill_scalar);  resize__default = fill_scalar = None
+    return None
+    """)
+
 
 
 only_for = ("cpu", "cuda")
@@ -3234,6 +3450,11 @@ def forward(self, a_1, b_1) -> torch.Tensor:
     globals(),
     only_for=only_for,
 )
+instantiate_device_type_tests(
+    TestVmapJvpInplaceView,
+    globals(),
+    only_for=only_for,
+)
 instantiate_device_type_tests(
     TestHessian,
     globals(),
diff --git a/functorch/test/test_functionalize.py b/functorch/test/test_functionalize.py
index 399273bfaf0d2..c32f3872b9514 100644
--- a/functorch/test/test_functionalize.py
+++ b/functorch/test/test_functionalize.py
@@ -41,8 +41,6 @@ class FunctionalizeTest(cls):
 FunctionalizeTestCompileCache = make_functionalize_test(test_compile_cache.TestCompileCache)
 FunctionalizeTestCompileCacheStaticArgs = make_functionalize_test(test_compile_cache.TestCompileCacheStaticArgs)
 FunctionalizeTestPythonKeyAOT = make_functionalize_test(test_pythonkey.TestAOTAutograd)
-FunctionalizeTestPythonKeyContiguous = make_functionalize_test(test_pythonkey.TestContiguous)
-FunctionalizeTestPythonKeyRandom = make_functionalize_test(test_pythonkey.TestRandom)
 FunctionalizeTestPythonKeyPartitioning = make_functionalize_test(test_pythonkey.TestPartitioning)
 
 if __name__ == "__main__":
diff --git a/functorch/test/test_minifier.py b/functorch/test/test_minifier.py
index 4f026c185c50a..49af42795592d 100644
--- a/functorch/test/test_minifier.py
+++ b/functorch/test/test_minifier.py
@@ -2,12 +2,12 @@
 
 import torch
 from functorch.compile import minifier
+from functorch._src.compile_utils import get_placeholders, get_outputs
 from functorch import make_fx
 from torch.testing._internal.common_utils import TestCase, run_tests
 
 
 class TestMinifier(TestCase):
-    # https://github.com/pytorch/functorch/issues/913
     def test_has_mul_minifier(self):
         def failing_f(x, y):
             y = y / 3
@@ -17,12 +17,12 @@ def failing_f(x, y):
         inps = [torch.randn(3), torch.randn(3)]
         failing_f = make_fx(failing_f)(*inps)
 
-        def pass_checker(fx_g, inps):
+        def has_mul(fx_g, inps):
             return (torch.ops.aten.mul.Tensor in set([i.target for i in fx_g.graph.nodes]))
 
-        min_f, inps = minifier(failing_f, inps, pass_checker)
-        assert len(min_f.graph.nodes) == 4
-        assert len(inps) == 2
+        min_f, inps = minifier(failing_f, inps, has_mul)
+        self.assertEqual(len(min_f.graph.nodes), 4)
+        self.assertEqual(len(inps), 2)
 
     def test_has_add_mul(self):
         def failing_f(x):
@@ -37,6 +37,69 @@ def failing_f(x):
         inps = [torch.randn(3)]
         failing_f = make_fx(failing_f)(*inps)
 
+        def has_nans(fx_g, inps):
+            # Basically, make sure none of the nodes are computing nans
+            for i in inps:
+                if torch.isnan(i).any():
+                    return False
+            return torch.isnan(fx_g(*inps)[0]).any()
+
+        min_f, inps = minifier(failing_f, inps, has_nans)
+        self.assertEqual(len(min_f.graph.nodes), 3)
+        self.assertEqual(len(inps), 1)
+
+    def test_input_returned(self):
+        def f(a, b, c):
+            a = a.sin()
+            c = c.cos()
+            d = a * c
+            return (a, b, c, d)
+        inps = [torch.randn(3) for _ in range(3)]
+
+        def inputs_returned(fx_g, inps):
+            inps = set(get_placeholders(fx_g.graph))
+            outs = set(get_outputs(fx_g.graph))
+            return len(inps & outs) > 0
+
+        failing_f = make_fx(f)(*inps)
+        min_f, inps = minifier(failing_f, inps, inputs_returned)
+        self.assertEqual(len(min_f.graph.nodes), 2)
+        self.assertEqual(len(inps), 1)
+
+    def test_tup_use(self):
+        def f(a, b):
+            tup = torch.std_mean(a)
+            return (tup[0] + b * tup[1],)
+
+        inps = [torch.randn(3), torch.randn(3)]
+
+        def has_add(fx_g, inps):
+            return (torch.ops.aten.add.Tensor in set([i.target for i in fx_g.graph.nodes]))
+
+        failing_f = make_fx(f)(*inps)
+        min_f, inps = minifier(failing_f, inps, has_add)
+
+        self.assertEqual(len(min_f.graph.nodes), 4)
+        self.assertEqual(len(inps), 2)
+
+    def test_module(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                y = self.relu(x)
+                zero = y - y
+                result = zero / zero
+                result = result + 3
+                return result
+
+        mod = MockModule()
+        failing_f = torch.fx.symbolic_trace(mod)
+
+        inps = [torch.randn(3)]
+
         def pass_checker(fx_g, inps):
             # Basically, make sure none of the inputs are nans
             for i in inps:
diff --git a/functorch/test/test_ops.py b/functorch/test/test_ops.py
index 6fcc0e19e7b8b..2bb19babd6675 100644
--- a/functorch/test/test_ops.py
+++ b/functorch/test/test_ops.py
@@ -16,8 +16,8 @@
 from torch.testing._internal.common_device_type import ops
 from torch.testing._internal.common_device_type import \
     toleranceOverride, tol
-from functorch_lagging_op_db import functorch_lagging_op_db
 from functorch_additional_op_db import additional_op_db
+from torch.testing._internal.common_methods_invocations import op_db
 from common_utils import (
     get_fallback_and_vmap_exhaustive,
     get_exhaustive_batched_inputs,
@@ -30,7 +30,11 @@
     opsToleranceOverride,
     check_vmap_fallback,
     is_batch_norm_training,
+    is_valid_inplace_sample_input,
+    loop2,
 )
+
+from torch.testing._internal.opinfo.core import SampleInput
 from torch.utils._pytree import tree_flatten, tree_unflatten, tree_map
 from functorch import grad, vjp, vmap, jacrev, jacfwd
 import torch.autograd.forward_ad as fwAD
@@ -115,8 +119,6 @@ def wrapped(*primals):
         if output_process_fn_grad is not None:
             result = output_process_fn_grad(result)
         if isinstance(result, tuple):
-            # TODO: Remove the following hack for namedtuples
-            result = tuple(result)
             result = tuple(r for r in result if torch.is_floating_point(r))
             assert len(result) > 0
         return result
@@ -142,8 +144,6 @@ def wrapped(*primals):
         if output_process_fn_grad is not None:
             result = output_process_fn_grad(result)
         if isinstance(result, tuple):
-            # TODO: Remove the following hack for namedtuples
-            result = tuple(result)
             result = tuple(r for r in result if torch.is_floating_point(r))
             assert len(result) > 0
         return result
@@ -286,18 +286,19 @@ def is_inplace(op, variant):
 
 
 vjp_fail = {
-    xfail('tensor_split'),
+    xfail('tensor_split'),  # data_ptr composite compliance
+    xfail('nn.functional.ctc_loss'),  # data_ptr composite compliance
     xfail('to_sparse'),
-    xfail('nn.functional.ctc_loss'),
-    skip('pca_lowrank', ''),  # fails on cuda, runs okay on cpu
-    skip('svd_lowrank', ''),  # fails on cuda, runs okay on cpu
 }
 
 
 class TestOperators(TestCase):
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_grad', vjp_fail.union({
         xfail('linalg.eig'),  # diagonal_scatter does not support complex
+        xfail('chalf', '', device_type='cpu'),  # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
+        skip('as_strided_scatter', ''),  # silent incorrectness; seems flaky
+        xfail('sparse.sampled_addmm', ''),  # RuntimeError: Sparse CSR tensors do not have strides
     }))
     @opsToleranceOverride('TestOperators', 'test_grad', (
         tol1('nn.functional.binary_cross_entropy_with_logits',
@@ -314,9 +315,8 @@ def test_grad(self, device, dtype, op):
 
         samples = op.sample_inputs(device, dtype, requires_grad=True)
 
-        # TODO: test in-place
         if is_inplace(op, op.get_op()):
-            self.skipTest("Skipped! NYI: inplace-testing not supported.")
+            self.skipTest("Skipped for redundancy. test_vjp handles in-place testing.")
             return
 
         for sample in samples:
@@ -343,26 +343,18 @@ def wrapped_fn(*args, **kwargs):
 
             self.assertEqual(result, expected)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_jvp', set({
-        skip('nn.functional.max_pool1d'),  # fails on cpu, runs okay on cuda
-        skip('pca_lowrank', ''),  # fails on cuda, runs okay on cpu
-        skip('svd_lowrank', ''),  # fails on cuda, runs okay on cpu
-
-        # =============================================
-        # NB: The above failures also fail using PyTorch core's
-        #     forward-mode AD and vmap.
-        #     The failures below are functorch-specific issues
-        # =============================================
-
         # Composite ops that do bad things. Need to be fixed in PyTorch core.
         # RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
         xfail('tensor_split'),
 
-        # BUG: runs and produces numerical differences
+        # BUG: silent incorrectness: runs and produces numerical differences
         skip('nn.functional.max_unpool1d'),  # fails everywhere except on mac
         skip('nn.functional.max_unpool2d'),  # fails everywhere except on windows
         skip('nn.functional.max_unpool3d'),  # fails everywhere except on mac
+
+        xfail('nn.functional.rrelu')  # in-place test errors out with no formula implemented
     }))
     @opsToleranceOverride('TestOperators', 'test_jvp', (
         tol1('nn.functional.conv_transpose3d',
@@ -371,40 +363,70 @@ def wrapped_fn(*args, **kwargs):
              {torch.float32: tol(atol=4e-04, rtol=4e-04)}),
     ))
     def test_jvp(self, device, dtype, op):
-        # TODO: when we change supports_autograd to supports_backward_ad, also change in this file
+        # TODO: get rid of vjp_decomp when we add decomposition support to
+        # PyTorch's forward-mode ad. Currently the decomposition support only
+        # works for functorch.jvp
         VJP_DECOMP = {
             'nn.functional.logsigmoid',
         }
         if op.name in VJP_DECOMP:
-            ref_jvp_local = simulate_jvp
+            fixme_ref_jvp_local = simulate_jvp
         else:
-            ref_jvp_local = ref_jvp
+            fixme_ref_jvp_local = ref_jvp
 
         if not op.supports_forward_ad and op.name not in VJP_DECOMP:
             self.skipTest("Skipped! Forward AD not supported.")
             return
 
         samples = op.sample_inputs(device, dtype, requires_grad=True)
-        # TODO: test in-place
-        if is_inplace(op, op.get_op()):
-            self.skipTest("Skipped! NYI: inplace-testing not supported.")
-            return
+
+        outplace_variant = op if not is_inplace(op, op.get_op()) else None
+        inplace_variant = op.inplace_variant if op.supports_inplace_autograd else None
 
         for sample in samples:
-            # NB: we used requires_grad=True to determine where the primals are,
-            # but don't need that information otherwise
-            fn, primals = normalize_op_input_output(op, sample, requires_grad=True)
-            primals = tree_map(lambda x: x.detach(), primals)
-            tangents = tree_map(lambda x: torch.randn_like(x), primals)
-            primal_outs, tangent_outs = jvp(fn, primals, tangents)
-            expected_primal_outs, expected_tangent_outs = ref_jvp_local(fn, primals, tangents)
-            self.assertEqual(primal_outs, expected_primal_outs)
-            self.assertEqual(tangent_outs, expected_tangent_outs)
-
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+            args = (sample.input,) + sample.args
+            kwargs = sample.kwargs
+            if outplace_variant:
+                self.jvp_opinfo_test(outplace_variant, args, kwargs,
+                                     sample.output_process_fn_grad,
+                                     clone_inputs=False,
+                                     fixme_ref_jvp_local=fixme_ref_jvp_local)
+            if is_valid_inplace_sample_input(sample, op, inplace_variant):
+                self.jvp_opinfo_test(inplace_variant, args, kwargs,
+                                     sample.output_process_fn_grad,
+                                     clone_inputs=True,
+                                     fixme_ref_jvp_local=fixme_ref_jvp_local)
+
+    def jvp_opinfo_test(self, fn, args, kwargs, output_process_fn,
+                        clone_inputs, fixme_ref_jvp_local):
+        # NB: we used requires_grad=True to determine where the primals are,
+        # but don't need that information otherwise
+        fn, primals = normalize_op_input_output2(
+            fn, args, kwargs, output_process_fn, requires_grad=True)
+        orig_primals = tree_map(lambda x: x.detach(), primals)
+        orig_tangents = tree_map(lambda x: torch.randn_like(x), primals)
+
+        def maybe_clone_inputs():
+            if clone_inputs:
+                primals = tree_map(torch.clone, orig_primals)
+                tangents = tree_map(torch.clone, orig_tangents)
+                return primals, tangents
+            return orig_primals, orig_tangents
+
+        primals, tangents = maybe_clone_inputs()
+        expected_primal_outs, expected_tangent_outs = \
+            fixme_ref_jvp_local(fn, primals, tangents)
+
+        primals, tangents = maybe_clone_inputs()
+        primal_outs, tangent_outs = jvp(fn, primals, tangents)
+
+        self.assertEqual(primal_outs, expected_primal_outs)
+        self.assertEqual(tangent_outs, expected_tangent_outs)
+
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vjp', vjp_fail.union({
-        xfail('pca_lowrank', ''),
-        xfail('svd_lowrank', ''),
+        skip('as_strided_scatter', ''),  # silent incorrectness; also might be flaky
+        xfail('sparse.sampled_addmm', ''),
     }))
     @opsToleranceOverride('TestOperators', 'test_vjp', (
         tol1('nn.functional.conv_transpose3d',
@@ -419,13 +441,10 @@ def test_vjp(self, device, dtype, op):
 
         samples = op.sample_inputs(device, dtype, requires_grad=True)
 
-        # TODO: test in-place
-        if is_inplace(op, op.get_op()):
-            self.skipTest("Skipped! NYI: inplace-testing not supported.")
-            return
-
-        def _test(_op):
+        def _test(_op, inplace=False):
             for sample in samples:
+                if inplace and not is_valid_inplace_sample_input(sample, op, op.inplace_variant):
+                    continue
                 fn, primals = normalize_op_input_output(_op, sample)
                 result = fn(*primals)
                 cotangents = tree_map(lambda x: torch.randn_like(x), result)
@@ -442,11 +461,17 @@ def _test(_op):
         _test(op)
         for a_op in op.aliases:
             _test(a_op)
+        if op.inplace_variant:
+            def f(inp, *args, **kwargs):
+                return op.inplace_variant(inp.clone(), *args, **kwargs)
+            _test(f, inplace=True)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vjpvjp', vjp_fail.union({
-        skip('nn.functional.max_unpool1d'),  # Flaky
-        skip('nn.functional.max_unpool2d'),  # Flaky
+        skip('nn.functional.max_unpool1d'),  # silent incorrectness; Flaky
+        skip('nn.functional.max_unpool2d'),  # silent incorrectness; Flaky
+        xfail('native_layer_norm', ''),  # Expected a proper Tensor but got None for argument #1 'other'
+        xfail('sparse.sampled_addmm', ''),  # sparse tensors have no strides
     }))
     @opsToleranceOverride('TestOperators', 'test_vjpvjp', (
         tol1('nn.functional.conv_transpose3d',
@@ -462,29 +487,33 @@ def test_vjpvjp(self, device, dtype, op):
 
         samples = op.sample_inputs(device, dtype, requires_grad=True)
 
-        # TODO: test in-place
-        if is_inplace(op, op.get_op()):
-            self.skipTest("Skipped! NYI: inplace-testing not supported.")
-            return
+        def test(_op, inplace=False):
+            for sample in samples:
+                if inplace and not is_valid_inplace_sample_input(sample, op, op.inplace_variant):
+                    continue
+                fn, args = get_vjpfull_variant(_op, sample)
+                result = fn(*args)
+                cotangents = tree_map(lambda x: torch.randn_like(x), result)
 
-        for sample in samples:
-            fn, args = get_vjpfull_variant(op, sample)
-            result = fn(*args)
-            cotangents = tree_map(lambda x: torch.randn_like(x), result)
+                # Compute vjp of vjp
+                _, vjp_fn = vjp(fn, *args)
+                result_vjps = vjp_fn(cotangents)
 
-            # Compute vjp of vjp
-            _, vjp_fn = vjp(fn, *args)
-            result_vjps = vjp_fn(cotangents)
+                # Compute ref_vjp of vjp. We could have done ref_vjp of ref_vjp,
+                # but since we're confident that vjp works by itself, this is
+                # an equivalent way to test that.
+                _, vjp_fn = ref_vjp(fn, *args)
+                expected_vjps = vjp_fn(cotangents)
 
-            # Compute ref_vjp of vjp. We could have done ref_vjp of ref_vjp,
-            # but since we're confident that vjp works by itself, this is
-            # an equivalent way to test that.
-            _, vjp_fn = ref_vjp(fn, *args)
-            expected_vjps = vjp_fn(cotangents)
+                self.assertEqual(result_vjps, expected_vjps)
 
-            self.assertEqual(result_vjps, expected_vjps)
+        test(op)
+        if op.inplace_variant:
+            def fn(inp, *args, **kwargs):
+                return op.inplace_variant(inp.clone(), *args, **kwargs)
+            test(fn, inplace=True)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     def test_vmapvjpvjp(self, device, dtype, op):
         self.skipTest("Skipped; these tests take too long")
@@ -534,6 +563,7 @@ def vjp_of_vjp(*args_and_cotangents):
                 self.assertEqual(loop_out, batched_out)
 
     vmapvjp_fail = vjp_fail.union({
+        # -------------------- ALLOWED FAILURES --------------------------------
         # The following are not bugs and are expected behavior
         xfail('masked_select'),  # Not possible due to dynamic shapes
         skip('bernoulli'),  # randomness
@@ -544,13 +574,18 @@ def vjp_of_vjp(*args_and_cotangents):
         skip('nn.functional.feature_alpha_dropout', 'without_train'),  # randomness
         skip('nn.functional.dropout'),  # randomness
         skip('nn.functional.dropout2d'),  # randomness
+        skip('nn.functional.dropout3d', ''),  # randomness
         xfail('as_strided'),  # as_strided is too wild for us to support, wontfix
         xfail('index_put', ''),  # not possible due to dynamic shapes; we support a subset
         xfail('masked_scatter'),  # dynamic
         xfail('nn.functional.fractional_max_pool2d'),  # random
         xfail('nn.functional.fractional_max_pool3d'),  # random
         xfail('take'),  # dynamic
+        xfail('pca_lowrank', ''),  # randomness
+        xfail('svd_lowrank', ''),  # randomness
+        # ----------------------------------------------------------------------
 
+        # ---------------------------- BUGS ------------------------------------
         # All of the following are bugs and need to be fixed
         skip('linalg.svdvals'),  # # really annoying thing where it passes correctness check but not has_batch_rule
         xfail('__getitem__', ''),  # dynamic error
@@ -558,27 +593,32 @@ def vjp_of_vjp(*args_and_cotangents):
         xfail('eig'),  # calls aten::item
         xfail('linalg.eig'),  # Uses aten::allclose
         xfail('linalg.householder_product'),  # needs select_scatter
-        xfail('nanquantile'),  # checks q via a .item() call
+        xfail('nanquantile', device_type='cpu'),  # checks q via a .item() call
         xfail('nn.functional.gaussian_nll_loss'),  # checks var for if any value < 0
         xfail('prod'),  # calls nonzero
-        xfail('quantile'),  # checks q via a .item() call
-        xfail('stft'),
-        xfail('view_as_complex'),
+        xfail('quantile', device_type='cpu'),  # checks q via a .item() call
+        xfail('view_as_complex'),  # Tensor must have a last dimension with stride 1
 
         # required rank 4 tensor to use channels_last format
         xfail('bfloat16'),
         xfail('double'),
         xfail('float'),
         xfail('half'),
+        xfail('chalf', ''),
 
         xfail('scatter_reduce', 'prod'),  # item call
 
         # NYI: querying is_contiguous inside of vmap for memory_format other than torch.contiguous_format
         xfail('nn.functional.max_unpool2d'),
         xfail('nn.functional.max_unpool2d', 'grad'),
+
+        xfail('sparse.sampled_addmm', ''),
+        xfail('as_strided_scatter', ''),  # calls as_strided
+        xfail('index_reduce', ''),  # .item() call
+        # ---------------------------------------------------------------------
     })
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     @opsToleranceOverride('TestOperators', 'test_vmapvjp', (
         tol1('linalg.svd',
@@ -609,55 +649,57 @@ def test_vmapvjp(self, device, dtype, op):
                 self.assertEqual(loop_out, batched_out)
 
     vmapjvpall_fail = {
+        # -------------------- ALLOWED FAILURES --------------------------------
         # The following are expected (not a bug)
         skip('bernoulli', ''),  # randomness
         skip('nn.functional.dropout'),  # randomness
         skip('nn.functional.rrelu'),  # randomness
         skip('nn.functional.dropout2d', ''),
+        skip('nn.functional.dropout3d', ''),
         skip('nn.functional.feature_alpha_dropout', 'without_train'),
         skip('nn.functional.feature_alpha_dropout', 'with_train'),
         xfail('nn.functional.fractional_max_pool2d'),  # Cannot access data pointer of Tensor that doesn't have storage
         xfail('nn.functional.fractional_max_pool3d'),  # Cannot access data pointer of Tensor that doesn't have storage
-
-        # The following are bugs that we should fix
-        skip('nn.functional.max_pool1d'),  # fails on cpu, runs on cuda
-        xfail('_masked.mean'),
-        xfail('_masked.prod'),
-
         # Not actually a problem: embedding with max_norm mutates the weight
         # and causes different runs to produce different results.
         # skip because this is flaky depending on what the max_norm is!
         skip('nn.functional.embedding', ''),
-        xfail('nn.functional.soft_margin_loss', ''),
-        xfail('linalg.householder_product'),
-        xfail('tensor_split'),
-        xfail('quantile'),
-        xfail('as_strided'),
-        xfail('nn.functional.gaussian_nll_loss'),
-        xfail('scatter'),
-        xfail('nanquantile'),
-        xfail('view_as_complex'),
-        xfail('prod'),
+        # ----------------------------------------------------------------------
 
-        skip('pca_lowrank', ''),
-        skip('svd_lowrank', ''),
-
-        xfail('stft'),  # transpose_ fallback
+        # ---------------------------- BUGS ------------------------------------
+        # The following are bugs that we should fix
+        skip('nn.functional.max_pool1d'),  # fails on cpu, runs on cuda
+        xfail('_masked.mean'),  # silent incorrectness (nan difference)
+        xfail('_masked.prod'),  # .item or data-dependent control flow
+
+        xfail('nn.functional.soft_margin_loss', ''),  # soft_margin_loss_backward does not support forward-ad
+        xfail('linalg.householder_product'),  # output with shape [5, 5] doesn't match the broadcast shape [2, 5, 5]
+        xfail('tensor_split'),  # data_ptr composite compliance
+        xfail('quantile'),  # at::equal batching rule (cpu), also, in-place vmap (cuda)
+        skip('as_strided'),  # Test runner cannot handle this
+        xfail('nn.functional.gaussian_nll_loss'),  # .item or data-dependent control flow
+        xfail('scatter'),  # forward-mode AD does not support at::scatter
+        xfail('nanquantile'),  # at::equal batching rule (cpu), also, in-place vmap (cuda)
+        xfail('view_as_complex'),  # Tensor must have a last dimension with stride 1
+        xfail('prod'),  # .item or data-dependent control flow
+
+        skip('pca_lowrank', ''),  # randomness
+        skip('svd_lowrank', ''),  # randomness
 
         xfail('double'),  # required rank 4 tensor to use channels_last format
 
+        # potential silent incorrectness
         skip('nn.functional.max_unpool1d'),  # Flaky, seems to sometimes his max_unpool2d
         skip('nn.functional.max_unpool2d'),  # fails everywhere except on mac
         skip('nn.functional.max_unpool3d'),  # fails everywhere except on mac
 
-        xfail('nn.functional.prelu'),  # Call Tensor.as_strided
-
         # erroring because running_mean and running_var aren't differentiable
         xfail('nn.functional.batch_norm'),
         xfail('nn.functional.batch_norm', 'without_cudnn'),
+        # ----------------------------------------------------------------------
     }
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @opsToleranceOverride('TestOperators', 'test_vmapjvpall', (
         tol1('nn.functional.conv_transpose3d',
              {torch.float32: tol(atol=2e-04, rtol=9e-3)}, device_type='cuda'),
@@ -690,9 +732,8 @@ def test_vmapjvpall(self, device, dtype, op):
             for loop_out, batched_out in generator:
                 self.assertEqual(loop_out, batched_out)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vmapjvpall_has_batch_rule', vmapjvpall_fail.union({
-        xfail('linalg.solve_triangular'),
         xfail('nn.functional.huber_loss'),
         xfail('lu'),
         skip('linalg.det', 'singular'),  # https://github.com/pytorch/functorch/issues/961
@@ -700,31 +741,21 @@ def test_vmapjvpall(self, device, dtype, op):
         xfail('lu_solve'),
         xfail('linalg.det'),
         xfail('linalg.lstsq', 'grad_oriented'),
-        xfail('linalg.cholesky'),
-        xfail('linalg.qr'),
-        xfail('cross'),
-        xfail('qr'),
         xfail('linalg.pinv'),
         xfail('masked_fill'),
         xfail('copysign'),
         xfail('linalg.solve'),
-        xfail('linalg.eig'),
         xfail('complex'),
         xfail('linalg.pinv', 'hermitian'),
-        xfail('matrix_exp'),
         xfail('pinverse'),
         skip('_masked.mean'),  # ???
-        xfail('linalg.cholesky_ex'),
         xfail('masked_scatter'),
         xfail('index_fill'),
         xfail('put'),
         xfail('take'),
-        xfail('linalg.eigvals'),
-        xfail('linalg.qr'),
         xfail('linalg.tensorsolve'),
         xfail('nn.functional.max_pool3d'),
         xfail('vdot'),
-        xfail('linalg.cross'),
         xfail('nanmean'),
         xfail('nansum'),
         xfail('nn.functional.feature_alpha_dropout', 'without_train'),
@@ -748,10 +779,15 @@ def test_vmapjvpall(self, device, dtype, op):
         xfail('lu_unpack'),
         xfail('nn.functional.glu'),
         xfail('nn.functional.bilinear'),  # trilinear doesn't have batching rule
-        xfail('linalg.eigh'),  # _linalg_eigh doesn't have batching rule
-        xfail('linalg.eigvalsh'),  # _linalg_eigh doesn't have batching rule
         xfail('logdet'),  # _linalg_slogdet doesn't have batching rule
         xfail('linalg.slogdet'),  # _linalg_slogdet doesn't have batching rule
+        xfail('linalg.lu', ''),
+        xfail('linalg.lu_solve', ''),
+        xfail('linalg.solve_ex', ''),
+        xfail('nn.functional.dropout3d', ''),
+        xfail('as_strided_scatter', ''),
+        xfail('_masked.cumprod', ''),
+        xfail('linalg.vecdot', ''),
     }))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     def test_vmapjvpall_has_batch_rule(self, device, dtype, op):
@@ -778,11 +814,10 @@ def test():
                     pass
         check_vmap_fallback(self, test, op, dry_run=False)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     @skipOps('TestOperators', 'test_vmapvjp_has_batch_rule', vmapvjp_fail.union({
         xfail('view_as_complex'),
-        xfail('cholesky'),
         xfail('complex'),
         xfail('copysign'),
         xfail('cummax'),
@@ -794,28 +829,19 @@ def test():
         xfail('special.log_ndtr'),
         xfail('index_copy'),
         xfail('index_fill'),
-        xfail('linalg.cholesky'),
-        xfail('linalg.cholesky_ex'),
         xfail('linalg.det'),
         xfail('linalg.eig'),
-        xfail('linalg.eigh'),
-        xfail('linalg.eigvals'),
         xfail('linalg.householder_product'),
         xfail('linalg.lstsq', ''),
         xfail('linalg.lstsq', 'grad_oriented'),
         xfail('linalg.pinv'),
-        xfail('linalg.qr'),
         xfail('linalg.pinv', 'hermitian'),
-        xfail('linalg.slogdet'),
-        xfail('linalg.solve'),
-        xfail('logdet'),
         xfail('lu'),
         xfail('lu_solve'),
         xfail('lu_unpack'),
         xfail('masked_fill'),
         xfail('masked_scatter'),
         xfail('masked_select'),
-        xfail('matrix_exp'),
         xfail('nanquantile'),
         xfail('pinverse'),
         xfail('prod'),
@@ -832,21 +858,17 @@ def test():
         xfail('_masked.prod'),
         xfail('fft.ihfft2'),
         xfail('fft.ihfftn'),
-        xfail('cross'),
-        xfail('linalg.cross'),
         xfail('nn.functional.gaussian_nll_loss'),
         xfail('nn.functional.huber_loss'),
         xfail('nn.functional.bilinear'),
         xfail('nn.functional.fractional_max_pool3d'),
         xfail('as_strided'),
-        xfail('linalg.solve_triangular'),
         xfail('stft'),
         xfail('nn.functional.rrelu'),
         xfail('nn.functional.embedding_bag'),
         xfail('nn.functional.max_pool3d'),
         xfail('istft'),
         xfail('nn.functional.fractional_max_pool2d'),
-        xfail('linalg.tensorsolve'),
         xfail('linalg.lu_factor', ''),
         xfail('nn.functional.feature_alpha_dropout', 'with_train'),
         xfail('pca_lowrank', ''),
@@ -871,8 +893,18 @@ def test():
         xfail('scatter_reduce', 'amin'),
         xfail('nn.functional.max_unpool1d', 'grad'),
         xfail('nn.functional.max_unpool2d', 'grad'),
-        xfail('qr'),
-        xfail('linalg.eigvalsh'),  # _linalg_eigh doesn't have batching rule
+        xfail('linalg.lu', ''),
+        xfail('linalg.lu_solve', ''),
+        xfail('chalf', ''),
+        xfail('index_reduce', ''),
+        xfail('linalg.vander', ''),
+        xfail('nn.functional.dropout3d', ''),
+        xfail('as_strided_scatter', ''),
+        xfail('segment_reduce', 'offsets'),
+        xfail('_masked.cumprod', ''),
+        xfail('linalg.vecdot', ''),
+        xfail('segment_reduce', 'lengths'),
+        xfail('sparse.sampled_addmm', ''),
     }))
     def test_vmapvjp_has_batch_rule(self, device, dtype, op):
         if not op.supports_autograd:
@@ -902,7 +934,7 @@ def test():
 
         check_vmap_fallback(self, test, op, dry_run=False)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vjpvmap', vjp_fail.union({
         skip('bernoulli', ''),  # vjpvmap testing can't handle randomness
         skip('normal', ''),  # vjpvmap testing can't handle randomness
@@ -915,24 +947,26 @@ def test():
         # All of the following are bugs and need to be fixed
         xfail('__getitem__', ''),
         xfail('index_put', ''),
-        xfail('matrix_exp'),
         xfail('view_as_complex'),
         xfail('nn.functional.gaussian_nll_loss'),
         xfail('masked_select'),
         skip('nn.functional.fractional_max_pool3d'),  # generator works on cpu, fails on cuda
         xfail('__rpow__'),  # https://github.com/pytorch/functorch/issues/617
-        xfail('as_strided'),
         skip('nn.functional.fractional_max_pool2d'),  # generator works on cpu, fails on cuda
         xfail('column_stack', ''),
         xfail('nn.functional.dropout2d', ''),
         xfail('svd_lowrank', ''),
         xfail('pca_lowrank', ''),
         xfail('clamp'),
+        xfail('cross'),  # The defaults of this op are *very* weird. No wonder it doesn't work
         # something weird happening with channels_last
         xfail('bfloat16'),
         xfail('double'),
         xfail('float'),
         xfail('half'),
+        xfail('nn.functional.dropout3d', ''),
+        xfail('as_strided_scatter', ''),
+        xfail('sparse.sampled_addmm', ''),
     }))
     def test_vjpvmap(self, device, dtype, op):
         # NB: there is no vjpvmap_has_batch_rule test because that is almost
@@ -1000,41 +1034,45 @@ def get_vjp(cotangents, *primals):
         else:
             self.assertEqual(jacobian_jvp, jacobian_vjp)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_jvpvjp', vjp_fail.union({
         # RuntimeError: Trying to set a forward gradient that has a different size than that of the original Tensor,
         # this is not supported. Tensor is of size [5, 2, 3] while the given forward gradient is of size [1, 2, 3].
         xfail('normal', ''),
-        xfail('_masked.log_softmax', ''),
-        xfail('_masked.softmax', ''),
-        xfail('_masked.softmin', ''),
-        xfail('cdist', ''),
-        xfail('cholesky', ''),
-        xfail('eig', ''),
-        xfail('logcumsumexp', ''),
-        xfail('nn.functional.embedding_bag', ''),
-        xfail('nn.functional.grid_sample', ''),
-        xfail('nn.functional.hardsigmoid', ''),
-        xfail('nn.functional.huber_loss', ''),
-        xfail('nn.functional.instance_norm', ''),
-        xfail('nn.functional.logsigmoid', ''),
-        xfail('nn.functional.softmin', ''),
-        xfail('nn.functional.softmin', 'with_dtype'),
-        xfail('renorm', ''),
-        xfail('symeig', ''),
-        xfail('pca_lowrank', ''),
-        xfail('svd_lowrank', ''),
-        xfail('nn.functional.multilabel_margin_loss', ''),
-        xfail('nn.functional.multilabel_soft_margin_loss', ''),
-        xfail('scatter_reduce', 'amax'),
-        xfail('scatter_reduce', 'amin'),
-        xfail('nn.functional.soft_margin_loss', ''),
-        xfail('nn.functional.pdist', ''),
-        xfail('scatter_reduce', 'sum'),
-        xfail('nn.functional.multi_margin_loss', ''),
-        xfail('scatter_reduce', 'mean'),
-        xfail('scatter_reduce', 'prod'),
+        xfail('_masked.log_softmax', ''),  # NYI: forward-AD for _log_softmax_backward_data
+        xfail('_masked.softmax', ''),  # NYI: forward-AD for _softmax_backward_data
+        xfail('_masked.softmin', ''),  # NYI: forward-AD for _softmax_backward_data
+        xfail('cdist', ''),  # NYI: forward-AD for _cdist_forward
+        xfail('cholesky', ''),  # NYI: forward-AD for cholesky
+        xfail('eig', ''),  # NYI: forward-AD for eig
+        xfail('logcumsumexp', ''),  # NYI: forward-AD for logcumsumexp
+        xfail('nn.functional.embedding_bag', ''),  # NYI: forward-AD for _embedding_bag
+        xfail('nn.functional.grid_sample', ''),  # NYI: forward AD for grid_sampler_2d
+        xfail('nn.functional.hardsigmoid', ''),  # NYI: forward AD for hardsigmoid_backward
+        xfail('nn.functional.huber_loss', ''),  # NYI: forward AD for huber_loss_backward
+        xfail('nn.functional.instance_norm', ''),  # NYI: forward AD for native_batch_norm_backward
+        xfail('nn.functional.logsigmoid', ''),  # not differentiable w.r.t. buffer
+        xfail('nn.functional.softmin', ''),  # NYI: forward-AD for _softmax_backward_data
+        xfail('nn.functional.softmin', 'with_dtype'),  # NYI: forward-AD for _softmax_backward_data
+        xfail('renorm', ''),  # NYI: forward AD for renorm
+        xfail('symeig', ''),  # NYI: forward AD for symeig
+        xfail('nn.functional.multilabel_margin_loss', ''),  # NYI: multilabel_margin_loss_forward
+        xfail('nn.functional.multilabel_soft_margin_loss', ''),  # NYI: log_sigmoid_backward
+        xfail('scatter_reduce', 'amax'),  # NYI: forward-AD for scatter_reduce
+        xfail('scatter_reduce', 'amin'),  # NYI: forward-AD for scatter_reduce
+        xfail('nn.functional.soft_margin_loss', ''),  # NYI: forward-AD for log_sigmoid_backward
+        xfail('nn.functional.pdist', ''),  # NYI: forward-AD with _pdist_forward
+        xfail('scatter_reduce', 'sum'),  # NYI: forward-AD for scatter_reduce
+        xfail('nn.functional.multi_margin_loss', ''),  # NYI: forward AD with multi_margin_loss
+        xfail('scatter_reduce', 'mean'),  # NYI: forward-AD for scatter_reduce
+        xfail('scatter_reduce', 'prod'),  # NYI: forward-AD for scatter_reduce
         skip('linalg.householder_product', '', device_type='cuda'),  # flaky, I'm not sure why
+        xfail('native_layer_norm', ''),  # NYI: forward-AD for native_layer_norm_backward
+        xfail('sparse.sampled_addmm', ''),  # Sparse tensors have no strides
+        skip('as_strided_scatter', ''),  # seems flaky
+        xfail('segment_reduce', 'offsets'),  # NYI: forward-AD for segment_reduce
+        xfail('index_reduce', ''),  # NYI: forward-AD for index_reduce
+        xfail('segment_reduce', 'lengths'),  # NYI: forward-AD for segment_reduce
     }))
     def test_jvpvjp(self, device, dtype, op):
         if not op.supports_autograd:
@@ -1278,7 +1316,7 @@ def fn(input, weight, bias):
                 cotangents = torch.randn_like(result, device=device)
                 self._compare_jacobians_of_vjp(fn, (cotangents, input, weight, bias))
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float32, torch.double))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float32, torch.double))
     @skipOps('TestOperators', 'test_vmap_autograd_grad', {
         # call inplace functions
         xfail('linalg.householder_product'),  # inplace
@@ -1291,6 +1329,7 @@ def fn(input, weight, bias):
         xfail('to_sparse'),  # dispatch key issue
 
         # numerical inconsistencies, look like bugs
+        skip('matrix_exp', dtypes=(torch.float32,), device_type='cuda'),  # fails on linux, passes on windows
         skip('ldexp', dtypes=(torch.float32,), device_type='cpu'),  # fails on all but mac
         skip('__rmatmul__'),  # flaky needs investigation
         skip('matmul'),  # flaky needs investigation
@@ -1300,6 +1339,10 @@ def fn(input, weight, bias):
         skip('nn.functional.layer_norm', dtypes=(torch.float32,), device_type='cpu'),  # fails on windows
         skip('linalg.lu_factor', dtypes=(torch.float32,), device_type='cuda'),  # fails on all but windows
         skip('linalg.lu_factor_ex', dtypes=(torch.float32,), device_type='cuda'),  # fails on all but windows
+        skip('linalg.multi_dot', '', device_type='cpu'),
+        skip('sparse.sampled_addmm', ''),
+        skip('native_layer_norm', '', device_type='cpu'),
+        xfail('as_strided_scatter', ''),
     })
     def test_vmap_autograd_grad(self, device, dtype, op):
         def is_differentiable(inp):
@@ -1347,6 +1390,27 @@ def compute_grad(cotangents):
             for loop_out, batched_out in generator:
                 self.assertEqual(loop_out, batched_out)
 
+    def test_vmapvmapjvp_linalg_solve(self):
+        ops = [op for op in op_db if op.name == "linalg.solve"]
+        assert len(ops) > 0
+
+        # this specializes a lot of code from the get_fallback_and_vmap_exhaustive test. If we need this more
+        # generally, this could go for a refactor
+
+        B0 = 2
+        B1 = 3
+
+        # we want to check the case where A will be seen as contiguous by jvp but during the vmap calls will become
+        # non-contiguous because vmap will expand. This will happen during both levels of vmap
+        A = torch.randn(4, 4)
+        k = torch.randn(4, 5, B1, B0)
+        fn, args = get_jvp_variant_primals_tangents(torch.linalg.solve, SampleInput(A, args=(k,)))
+
+        in_dims_all = (None, -1, None, -1)
+        batched_out = vmap(vmap(fn, in_dims=in_dims_all), in_dims=in_dims_all)(*args)
+        loop_out = loop2(fn, in_dims_all, in_dims_all, 0, 0, B0, B1, *args)
+        self.assertEqual(loop_out, batched_out)
+
 
 only_for = ("cpu", "cuda")
 instantiate_device_type_tests(TestOperators, globals(), only_for=only_for)
diff --git a/functorch/test/test_pythonkey.py b/functorch/test/test_pythonkey.py
index 388a87c468755..8fbfa94839bce 100644
--- a/functorch/test/test_pythonkey.py
+++ b/functorch/test/test_pythonkey.py
@@ -15,6 +15,7 @@
 import itertools
 from functools import partial
 from torch.testing._internal.common_device_type import instantiate_device_type_tests
+from torch.testing._internal.common_methods_invocations import op_db
 from functorch import (
     grad, vjp, vmap, jacrev,
     make_fx
@@ -22,12 +23,13 @@
 from functorch._src.aot_autograd import aot_module_simplified
 from functorch.compile import (
     nnc_jit, compiled_function, compiled_module,
-    min_cut_rematerialization_partition, aot_function, aot_module, decomposition_table, nop,
-    num_of_recompilations, default_partition, default_decompositions, memory_efficient_fusion,
+    min_cut_rematerialization_partition, aot_function, aot_module,
+    decomposition_table, nop,
+    num_of_recompilations, default_partition, default_decompositions,
+    memory_efficient_fusion, clear_compile_cache, get_aot_compilation_context
 )
 
 from torch.testing._internal.common_device_type import ops
-from functorch_lagging_op_db import functorch_lagging_op_db
 from functorch_additional_op_db import additional_op_db
 from common_utils import (
     xfail,
@@ -55,8 +57,14 @@
 
 # NB: numpy is a testing dependency!
 
+class AOTTestCase(TestCase):
+    def setUp(self):
+        super().setUp()
+        # NB: We cache on function id, which is unreliable
+        # Can fix by using weakrefs, but not sure if it matters
+        clear_compile_cache()
 
-class TestPythonKey(TestCase):
+class TestPythonKey(AOTTestCase):
     def test_make_fx(self, device):
         def f(x):
             return torch.sin(x)
@@ -205,7 +213,7 @@ def _outs_and_grads(fn, inps):
     return outs, grads
 
 
-class TestAOTAutograd(TestCase):
+class TestAOTAutograd(AOTTestCase):
     def verify_aot_autograd(self, f, inp):
         if isinstance(f, nn.Module):
             compiled_f = aot_module(f, nop)
@@ -257,19 +265,22 @@ def foo(x):
         inps = [torch.randn((), requires_grad=True)]
         graph_size = None
 
-        def assert_graph_empty(fx_g, _):
+        def get_graph_size(fx_g, _):
             nonlocal graph_size
             graph_size = len(fx_g.graph.nodes)
             return fx_g
 
         start_recompilations = num_of_recompilations()
-        f = aot_function(foo, nop, assert_graph_empty)
+        f = aot_function(foo, nop, get_graph_size)
         with torch.set_grad_enabled(False):
             f(*inps)
-        self.assertEqual(graph_size, 2)
+        self.assertIsNone(graph_size)
+
         with torch.set_grad_enabled(True):
-            f(*inps)
-        self.assertTrue(graph_size > 2)
+            out = f(*inps)
+            self.assertIsNone(graph_size)
+            out.sum().backward()
+            self.assertTrue(graph_size > 2)
         self.assertEqual(num_of_recompilations() - start_recompilations, 2)
 
     def test_output_dict(self):
@@ -308,9 +319,67 @@ def test_batchnorm(self):
         x = torch.ones(1, 4, 2, 2)
         mod(x).sum().backward()
 
+    def test_list_codegen(self):
+        def list_nop(f, _):
+            def g(inps):
+                return f(*inps)
+            g._boxed_call = True
+            return g
+
+        def f(a, b, c):
+            return a.sin() * b.cos() * c.sin()
+        f = aot_function(f, list_nop)
+        inp = [torch.randn(5, requires_grad=True) for _ in range(3)]
+        f(*inp).sum().backward()
+
+    def test_compilation_context(self):
+        def f(x):
+            return x.sin().sin()
+        count = []
+
+        def compiler(fx_g, _):
+            context = get_aot_compilation_context()
+            count.append((context[0], len(fx_g.graph.nodes)))
+            return fx_g
+
+        f = aot_function(f, compiler)
+        out = f(torch.randn(5, requires_grad=True))
+        f(torch.randn(5))
+        out.sum().backward()
+        self.assertEqual(count, [(['forward'], 4), (['inference'], 4), (['backward'], 8)])
 
-class TestEagerFusionOpInfo(TestCase):
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
+    def test_batch_norm_amp(self):
+        device = "cuda"
+        input_dtype = torch.float16
+        param_dtype = torch.float32
+        weight, bias = [torch.ones(64, device=device, dtype=param_dtype, requires_grad=True) for _ in range(2)]
+        running_mean, running_var = [torch.ones(64, device=device, dtype=param_dtype) for _ in range(2)]
+
+        def bn(x):
+            return torch.ops.aten.cudnn_batch_norm(
+                x,
+                weight,
+                bias,
+                running_mean,
+                running_var,
+                False,
+                0.1,
+                1e-05,
+            )
+        inp = torch.ones(torch.Size([16, 64, 112, 112]), dtype=input_dtype, device=device)
+
+        ref = bn(inp)
+        cudnn_batch_norm_decomp = torch._decomp.get_decompositions({torch.ops.aten.cudnn_batch_norm})
+        aot_fn = make_fx(bn, decomposition_table=cudnn_batch_norm_decomp)(inp)
+        res = aot_fn(inp)
+        for a, b in zip(ref, res):
+            assert torch.allclose(a, b)
+
+
+
+class TestEagerFusionOpInfo(AOTTestCase):
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     # entries in here need don't work and need to be fixed.
     # Each one of these is a bug (or needs to be investigated)
     @skipOps('TestEagerFusionOpInfo', 'test_aot_autograd_exhaustive', {
@@ -328,6 +397,7 @@ class TestEagerFusionOpInfo(TestCase):
         xfail('trapz'),
         xfail('corrcoef'),
         xfail('cov'),
+        xfail('chalf'),  # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
         skip('nn.functional.binary_cross_entropy_with_logits'),  # seems to fail sometimes?
         skip('nn.functional.margin_ranking_loss'),  # seems flaky
     })
@@ -411,11 +481,11 @@ def get_fw_bw_graph(f, inps, partitioner=min_cut_rematerialization_partition):
                  fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell),
                  bw_compiler=partial(extract_graph, graph_cell=bw_graph_cell),
                  partition_fn=partitioner,
-                 decompositions=default_decompositions)(*inps)
+                 decompositions=default_decompositions)(*inps).sum().backward()
     return (fw_graph_cell[0], bw_graph_cell[0])
 
 
-class TestPartitioning(TestCase):
+class TestPartitioning(AOTTestCase):
     @unittest.skipIf(not USE_NETWORKX, "networkx not available")
     def test_recompute_partitioning(self):
         def fn(a, b):
@@ -503,8 +573,6 @@ def f(x):
         ins, outs = get_ins_outs(fw_graph)
         self.assertEqual(outs[1].target, torch.ops.aten.mm.default)
 
-
-class TestContiguous(TestCase):
     def test_contiguous(self):
         # The test simulates the condition where transpose followed by view
         # happens in the backward pass.
@@ -516,8 +584,36 @@ def f(x):
         out = aot_function(f, nop)(inp)
         torch.autograd.grad(out, inp, torch.randn(3, 2))
 
+    def test_preserve_random(self):
+        def fn(x):
+            return torch.nn.functional.dropout(x, 0.5) + x
+
+        x = torch.randn(4)
+
+        torch.manual_seed(0)
+        ref = fn(x)
+
+        torch.manual_seed(0)
+        aot_fn = aot_function(fn, nop)
+        res = aot_fn(x)
+
+        assert torch.allclose(ref, res)
+
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
+    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
+    def test_autocast(self):
+        mod = torchvision.models.resnet18().cuda()
+        mod.train()
+
+        x = torch.randn(16, 3, 32, 32, device="cuda")
+        aot_mod = memory_efficient_fusion(mod)
+
+        # Ensure that AOT Autograd works with AMP
+        with torch.cuda.amp.autocast(True):
+            res = aot_mod(x)
+        res.sum().backward()
 
-class TestAOTModuleSimplified(TestCase):
+class TestAOTModuleSimplified(AOTTestCase):
     def test_aot_module_simplified(self):
         class MockModule(torch.nn.Module):
             def __init__(self):
@@ -547,39 +643,43 @@ def forward(self, x, y):
         assert torch.allclose(inputs[0].grad, cloned_inputs[0].grad)
         assert torch.allclose(inputs[1].grad, cloned_inputs[1].grad)
 
+    def test_aot_module_simplified_preserves_stack_trace(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(20, 30)
 
-class TestRandom(TestCase):
-    def test_preserve_random(self):
-        def fn(x):
-            return torch.nn.functional.dropout(x, 0.5) + x
-
-
-        x = torch.randn(4)
-
-        torch.manual_seed(0)
-        ref = fn(x)
-
-        torch.manual_seed(0)
-        aot_fn = aot_function(fn, nop)
-        res = aot_fn(x)
-
-        assert torch.allclose(ref, res)
-
+            def forward(self, x, y):
+                z = self.linear(x)
+                z = z + y
+                z = z.relu()
+                return (z, )
+
+        tracer = torch.fx.Tracer()
+        tracer.record_stack_traces = True
+        graph = tracer.trace(MockModule())
+        mod = torch.fx.GraphModule(tracer.root, graph)
+
+        for node in mod.graph.nodes:
+            if node.op == 'output':
+                continue
+            self.assertTrue(node.stack_trace is not None)
+            assert 'test_pythonkey.py' in node.stack_trace
 
-class TestAutocast(TestCase):
-    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
-    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
-    def test_autocast(self):
-        mod = torchvision.models.resnet18().cuda()
-        mod.train()
+        def assert_compiler(gm: torch.fx.GraphModule, _):
+            for node in gm.graph.nodes:
+                if node.op == 'output' or node.op == 'placeholder':
+                    continue
+                self.assertTrue(node.stack_trace is not None)
+                assert 'test_pythonkey.py' in node.stack_trace
+            return gm.forward  # return a python callable
 
-        x = torch.randn(16, 3, 32, 32, device="cuda")
-        aot_mod = memory_efficient_fusion(mod)
+        aot_mod = aot_module_simplified(mod, fw_compiler=assert_compiler, bw_compiler=assert_compiler)
 
-        # Ensure that AOT Autograd works with AMP
-        with torch.cuda.amp.autocast(True):
-            res = aot_mod(x)
-        res.sum().backward()
+        x = torch.randn(128, 20, requires_grad=True)
+        y = torch.randn(128, 30, requires_grad=True)
+        inputs = [x, y]
+        res = aot_mod(*inputs)
 
 
 only_for = ("cpu")
diff --git a/functorch/test/test_vmap.py b/functorch/test/test_vmap.py
index eb13da91b64da..bc958ba4010b2 100644
--- a/functorch/test/test_vmap.py
+++ b/functorch/test/test_vmap.py
@@ -16,6 +16,7 @@
 import itertools
 import warnings
 import unittest
+from torch.testing._internal.common_methods_invocations import op_db
 from torch.testing._internal.common_device_type import instantiate_device_type_tests, \
     skipCUDAIfNoMagma
 from torch.testing._internal.common_device_type import ops
@@ -26,7 +27,6 @@
 )
 from torch.testing._internal.common_device_type import \
     toleranceOverride, tol
-from functorch_lagging_op_db import functorch_lagging_op_db
 from functorch_additional_op_db import additional_op_db
 from common_utils import (
     get_fallback_and_vmap_exhaustive,
@@ -37,6 +37,9 @@
     tol1,
     opsToleranceOverride,
     is_batch_norm_training,
+    generate_vmap_inputs,
+    compute_quantities_for_vmap_test,
+    is_valid_inplace_sample_input,
 )
 import types
 from collections import namedtuple
@@ -96,6 +99,13 @@ def bar(x):
         with self.assertRaisesRegex(ValueError, expected_msg):
             vmap(bar)()
 
+    def test_func_with_no_tensors(self):
+        def foo(x):
+            return torch.randn(3)
+
+        with self.assertRaisesRegex(ValueError, 'at least one Tensor'):
+            vmap(foo, (None,))(1)
+
     def test_constant_function(self):
         output = vmap(lambda x: torch.tensor(3.14))(torch.ones(3))
         self.assertEqual(output, torch.tensor([3.14, 3.14, 3.14]))
@@ -326,21 +336,6 @@ def f(t):
         self.assertEqual(out["sin"], expected["sin"])
         self.assertEqual(out["cos"], expected["cos"])
 
-    # temporary test for _odict_flatten and _odict_unflatten
-    def test_pytest_odict_flatten_unflatten(self):
-
-        from functorch._src.vmap import _odict_flatten, _odict_unflatten
-
-        x = torch.randn(2, 3)
-        inpt = OrderedDict([("sin", x.sin()), ("cos", x.cos())])
-
-        out = _odict_flatten(inpt)
-        self.assertEqual(out[0], list(inpt.values()))
-        self.assertEqual(out[1], list(inpt.keys()))
-
-        recon_inpt = _odict_unflatten(*out)
-        self.assertEqual(recon_inpt, inpt)
-
     def test_pytree_returns_outdims(self):
         x = torch.randn(2, 3)
 
@@ -1500,28 +1495,40 @@ def get_number(getter):
         # self._test_unary(lambda t: op(number, t), getter, device='cuda')
         # self._test_unary(lambda t: op(t, torch.tensor(number)), getter, device='cuda')
 
-    # TODO: as_strided BR
-    @unittest.expectedFailure
     def test_as_strided(self):
         def _test(sizes, strides, offset, tensor, lambd):
+            # bdim at dim 0 test
             result = vmap(lambda t: t.as_strided(sizes, strides, offset))(tensor)
             expected = vmap(lambd)(tensor)
             self.assertTrue(result._base is expected._base)
             self.assertEqual(result, expected)
 
+            # bdim at dim -1 test
+            tensor = tensor.movedim(0, -1)
+            result = vmap(lambda t: t.as_strided(sizes, strides, offset), -1)(tensor)
+            expected = vmap(lambd, -1)(tensor)
+            self.assertTrue(result._base is expected._base)
+            self.assertEqual(result, expected)
+
         # single vmap test
         B0 = 5
+        # Each Tensor has shape [B0, 2, 3]; the expressions below
+        # are just to get tensors of different strides that have shape [B0, 2, 3]
         tensors = [
             # contiguous
             torch.randn(B0, 2, 3),
             # non-contiguous
             torch.randn(B0, 3, 2).transpose(1, 2),
+            torch.randn(3, 2, B0).movedim(-1, 0).transpose(1, 2),
             # non-zero storage offset
             torch.randn(2, B0, 2, 3)[1],
+            torch.randn(2, 2, B0, 3)[1].movedim(1, 0),
             # non-contiguous strides, zero storage offset
             torch.randn(B0, 2, 4, 3, 7)[:, :, 0, :, 0],
+            torch.randn(2, 4, B0, 3, 7).movedim(2, 0)[:, :, 0, :, 0],
             # non-contiguous strides, non-zero storage offset
             torch.randn(B0, 2, 4, 3, 7)[:, :, 2, :, 1],
+            torch.randn(2, 4, 3, 7, B0).movedim(-1, 0)[:, :, 2, :, 1],
         ]
 
         for x in tensors:
@@ -1534,6 +1541,10 @@ def _test(sizes, strides, offset, tensor, lambd):
             _test([3, 2], [S1, S0], offset, x, lambda x: x.transpose(0, 1))
             # select
             _test([2], [S0], offset + S1, x, lambda x: x[:, 1])
+            # diagonal
+            _test([2], [S0 + S1], offset, x, lambda x: x.diagonal())
+            # strided slice
+            _test([2], [S1 * 2], offset, x, lambda x: x[0, ::2])
 
         # Nested vmap test
         B1 = 7
@@ -1549,23 +1560,13 @@ def _test(sizes, strides, offset, tensor, lambd):
             x = torch.randn(B0, 2, 3).transpose(0, 1)
             vmap(lambda x: x.as_strided([1, 1, 1], [1, 1]))(x)
 
-        # Sanity check #1: we require the batch dims to be at the front of the
-        # tensor (in memory layout).
-        msg = 'batch dims being vmapped over are at the front of the tensor'
-        with self.assertRaisesRegex(RuntimeError, msg):
-            x = torch.randn(2, B0, 3).transpose(0, 1)
-            vmap(lambda x: x.as_strided([2, 3], [B0 * 3, 1]))(x)
-        with self.assertRaisesRegex(RuntimeError, msg):
-            x = torch.randn(B0, 2, 3, B1).movedim(3, 1)
-            vmap(vmap(lambda x: x.as_strided([2, 3], [B1 * 3, B1])))(x)
-
-        # All the Sanity check #2{a,b,c} cases check that
+        # All the Sanity check #1{a,b,c} cases check that
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # doesn't index memory that is out of bounds of xs[i]. This condition
         # is important to the correctness of the as_strided batching rule
         # (see NOTE: [When will the as_strided_batching_rule fail?])
 
-        # Sanity check #2a: The maximum indexable location of
+        # Sanity check #1a: The maximum indexable location of
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # is less than or equal to the maximum indexable location of xs[i].
         msg = 'This is not supported inside of vmap'
@@ -1579,14 +1580,14 @@ def _test(sizes, strides, offset, tensor, lambd):
             x = torch.randn(B0, B1, 3, 5)
             vmap(vmap(lambda x: x.as_strided([4, 4], [4, 1], 0)))(x)
 
-        # Sanity check #2b: The min indexable location of
+        # Sanity check #1b: The min indexable location of
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # is greater than or equal to the min indexable location of xs[i].
         with self.assertRaisesRegex(RuntimeError, msg):
             x = torch.randn(2, B0, 3)[1]
             vmap(lambda x: x.as_strided([3], [1], B0 * 3 - 1))(x)
 
-        # Sanity check #2c:
+        # Sanity check #1c:
         # xs[i] is a zero-dim tensor, but
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # is not
@@ -2750,10 +2751,13 @@ def f(x):
 instantiate_parametrized_tests(TestVmapOperators)
 
 
-def construct_v(output, batch_size):
-    return torch.randn(batch_size, *output.shape,
-                       dtype=output.dtype, device=output.device)
-
+def construct_v(output, batch_size, contig=False):
+    if contig:
+        return torch.randn(batch_size, *output.shape,
+                           dtype=output.dtype, device=output.device)
+    result = torch.randn(*output.shape, batch_size,
+                         dtype=output.dtype, device=output.device)
+    return result.movedim(-1, 0)
 
 def as_tuple(x):
     if isinstance(x, tuple):
@@ -2792,13 +2796,15 @@ def _batched_grad_test(self, op, args, kwargs=None, output_process_fn=lambda x:
             kwargs = {}
         outputs = op(*args, **kwargs)
         outputs = differentiable(output_process_fn(outputs))
-        batched_vectors = tuple(construct_v(out, batch_size) for out in outputs)
+        for contig in [True, False]:
+            batched_vectors = tuple(construct_v(out, batch_size, contig)
+                                    for out in outputs)
 
-        def vector_jacobian_product(*vectors):
-            return torch.autograd.grad(outputs, differentiable(args), vectors,
-                                       retain_graph=True)
-        self._vmap_test(vector_jacobian_product, batched_vectors,
-                        check_propagates_grad=False)
+            def vector_jacobian_product(*vectors):
+                return torch.autograd.grad(outputs, differentiable(args), vectors,
+                                           retain_graph=True)
+            self._vmap_test(vector_jacobian_product, batched_vectors,
+                            check_propagates_grad=False)
 
     # Tests batched second grad computation of outputs = op(*args, **kwargs).
     # by comparing it to a sequential map+stack fallback.
@@ -2826,17 +2832,19 @@ def _batched_grad_grad_test(self, op, args, kwargs=None, output_process_fn=lambd
         self.assertNotEqual(
             len(first_grads), 0, "None of the first grads depend on the input!")
 
-        batched_vectors = tuple(construct_v(grad, batch_size) for grad in first_grads)
+        for contig in [True, False]:
+            batched_vectors = tuple(construct_v(grad, batch_size, contig)
+                                    for grad in first_grads)
 
-        def vector_hessian_product(*vectors):
-            outputs = torch.autograd.grad(first_grads, differentiable(args), vectors,
-                                          retain_graph=True, allow_unused=True)
-            outputs = tuple(out for out in outputs if out is not None)
-            assert len(outputs) > 0
-            return outputs
+            def vector_hessian_product(*vectors):
+                outputs = torch.autograd.grad(first_grads, differentiable(args), vectors,
+                                              retain_graph=True, allow_unused=True)
+                outputs = tuple(out for out in outputs if out is not None)
+                assert len(outputs) > 0
+                return outputs
 
-        self._vmap_test(vector_hessian_product, batched_vectors,
-                        check_propagates_grad=False)
+            self._vmap_test(vector_hessian_product, batched_vectors,
+                            check_propagates_grad=False)
 
     def _test_arithmetic(self, op, device, test_grad_grad=True):
         x = torch.randn(2, 3, requires_grad=True, device=device)
@@ -3092,12 +3100,99 @@ def vjp(v):
         self.assertEqual(result, torch.zeros(B0, *x.shape, device=device))
 
 
+def discover_variants(opinfo):
+    aliases = []
+    inplace_variants = []
+
+    if opinfo.inplace_variant:
+        inplace_variants.append(opinfo.inplace_variant)
+
+    aliases.append(opinfo.op)
+    for alias in opinfo.aliases:
+        aliases.append(alias.op)
+        if alias.inplace_variant:
+            inplace_variants.append(alias.inplace_variant)
+    return aliases, inplace_variants
+
+
 class TestVmapOperatorsOpInfo(TestCase):
+
+    def vmap_outplace_test(self, func, args, kwargs, in_dims, check_shape_only=False,
+                           postprocess_fn=None):
+        for loop_out, vmap_out in compute_quantities_for_vmap_test(func, args, kwargs, in_dims):
+            if postprocess_fn is not None:
+                loop_out = postprocess_fn(loop_out)
+                vmap_out = postprocess_fn(vmap_out)
+            if check_shape_only:
+                self.assertEqual(vmap_out.shape, loop_out.shape)
+                continue
+            self.assertEqual(vmap_out, loop_out)
+
+    def vmap_inplace_test(self, func, args, kwargs, in_dims, postprocess_fn=None):
+        # NB: This test assumes that the first argument is being modified.
+        # This is OK because it's what every other OpInfo-based test assumes,
+        # but it is going to need a more robust solution eventually.
+        if in_dims[0] is None:
+            # Check that we correctly raise an error when vmap is impossible
+            # on the in-place operation
+            with self.assertRaises(RuntimeError):
+                for _ in compute_quantities_for_vmap_test(
+                        func, args, kwargs, in_dims, compute_loop_out=False, clone_inputs=True):
+                    pass
+            return
+        for loop_out, vmap_out in compute_quantities_for_vmap_test(
+                func, args, kwargs, in_dims, clone_inputs=True):
+            if postprocess_fn is not None:
+                loop_out = postprocess_fn(loop_out)
+                vmap_out = postprocess_fn(vmap_out)
+            self.assertEqual(vmap_out, loop_out)
+
+    def opinfo_vmap_test(self, device, dtype, op, check_has_batch_rule,
+                         skip_inplace=(), postprocess_fn=None):
+        def test():
+            # Error inputs check
+            if op.error_inputs_func is not None:
+                error_inputs = op.error_inputs(device)
+                for error_input in error_inputs:
+                    sample_input = error_input.sample_input
+                    args = (sample_input.input,) + tuple(sample_input.args)
+                    kwargs = sample_input.kwargs
+                    for args, in_dims, _ in generate_vmap_inputs(args, {}):
+                        with self.assertRaises(Exception):
+                            vmap(op, in_dims)(*args, **kwargs)
+
+            # Sample inputs check
+            sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=False)
+            aliases, inplace_aliases = discover_variants(op)
+            check_shape_only = op.name in ('empty_like', 'new_empty')
+            for sample_input in sample_inputs_itr:
+                args = (sample_input.input,) + sample_input.args
+                kwargs = sample_input.kwargs
+                is_batch_norm_and_training = is_batch_norm_training(op.name, kwargs)
+                for args, in_dims, _ in generate_vmap_inputs(
+                        args, {}, is_batch_norm_and_training=is_batch_norm_and_training):
+                    for func in aliases:
+                        self.vmap_outplace_test(func, args, kwargs, in_dims, check_shape_only, postprocess_fn)
+                    if op.name in skip_inplace:
+                        continue
+                    if not is_valid_inplace_sample_input(sample_input, op, op.inplace_variant):
+                        continue
+                    for func in inplace_aliases:
+                        self.vmap_inplace_test(func, args, kwargs, in_dims, postprocess_fn)
+
+        if check_has_batch_rule:
+            check_vmap_fallback(self, test, op)
+        else:
+            test()
+
     vmap_fail = {
+        # -------------------- ALLOWED FAILURES --------------------------------
         # These are things that we either cannot fix or are not actually problems
         xfail('resize_'),
         xfail('resize_as_'),
         xfail('to_sparse'),
+        xfail('__getitem__'),  # dynamic mask
+        xfail('index_put'),  # dynamic mask
         xfail('nn.functional.dropout'),  # works, can't check against for loop because of randomness inconsistency
         xfail('masked_select'),  # dynamic op
         xfail('nonzero'),  # dynamic op
@@ -3112,26 +3207,41 @@ class TestVmapOperatorsOpInfo(TestCase):
         xfail('nn.functional.embedding', ''),  # we only support some cases
         xfail('nn.functional.rrelu'),  # randomness
         xfail('nn.functional.dropout2d', ''),  # randomness
+        xfail('nn.functional.dropout3d', ''),  # randomness
         xfail('nn.functional.feature_alpha_dropout', 'with_train'),  # randomness
-        xfail('as_strided'),  # as_strided is too crazy
+        xfail('as_strided'),  # Our test runner can't handle this; manual test exists
+        xfail('new_empty_strided'),  # empty tensor data is garbage so it's hard to make comparisons with it
         xfail('nn.functional.fractional_max_pool3d'),  # randomness
         xfail('nn.functional.fractional_max_pool2d'),  # randomness
-
+        xfail('pca_lowrank', ''),  # random operation
+        xfail('svd_lowrank', ''),  # random operation
+        xfail('linspace', ''),  # test runner can't handle factory functions
+        xfail('arange', ''),  # test runner can't handle factory functions
+        xfail('logspace', ''),  # test runner can't handle factory functions
+        xfail('empty', ''),  # test runner can't handle factory functions
+        xfail('eye', ''),  # non-tensor input
+        xfail('broadcast_shapes', ''),  # test runner can't handle non-Tensor ops
+        xfail('sparse.sampled_addmm'),  # sparse
+        xfail('cross'),  # The default value of dim in op is *very* weird. No wonder it doesn't work
+        xfail('linalg.cross'),  # Issue #83936
+        xfail('svd', device_type='cuda'),  # not unique, see test_linalg_svd for manual test
+        xfail('linalg.svd', device_type='cuda'),  # not unique, see test_linalg_svd for manual test
+        skip('linalg.eigh', ''),  # not unique, see test_linalg_eigh for manual test
+        # ----------------------------------------------------------------------
+
+        # ---------------------------- BUGS ------------------------------------
         # entries in here don't work and need to be fixed.
         # Each one of these is a bug
-        xfail('view_as_complex'),
-        xfail('tensor_split'),
-        xfail('svd', device_type='cuda'),
-        xfail('linalg.svd', device_type='cuda'),
-        xfail('matrix_exp'),
-        xfail('histogramdd'),
-        xfail('nn.functional.gaussian_nll_loss'),
-        xfail('nn.functional.embedding_bag'),
+        xfail('clamp_min', ''),  # Exception not raised on error input
+        xfail('clamp_max', ''),  # Exception not raised on error input
+
+        xfail('view_as_complex'),  # RuntimeError: Tensor must have a last dimension with stride 1
+        xfail('tensor_split'),  # data_ptr
+        xfail('histogramdd'),  # expected Tensor as element 0 in argument 0, but got tuple
+        xfail('nn.functional.gaussian_nll_loss'),  # data-dependent control flow error
+        xfail('nn.functional.embedding_bag'),  # embedding renorm vmap inplace incompatible
         xfail('__rpow__'),  # https://github.com/pytorch/functorch/issues/617
-        xfail('column_stack', ''),
-        xfail('pca_lowrank', ''),
-        xfail('svd_lowrank', ''),
-        skip('linalg.eigh', ''),  # Flaky but is likely a real problem
+        xfail('column_stack', ''),  # Batching rule not implemented for aten::column_stack
 
         # required rank 4 tensor to use channels_last format
         xfail('bfloat16'),
@@ -3144,9 +3254,17 @@ class TestVmapOperatorsOpInfo(TestCase):
         xfail('int'),
         xfail('long'),
         xfail('short'),
+
+        xfail('jiterator_binary', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('jiterator_binary_return_by_ref', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('jiterator_4inputs_with_extra_args', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('equal', ''),  # TypeError: object of type 'bool' has no len(); likely testrunner problem
+        xfail('jiterator_unary', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('jiterator_2inputs_2outputs', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        # ---------------------------------------------------------------------
     }
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @opsToleranceOverride('TestVmapOperatorsOpInfo', 'test_vmap_exhaustive', (
         tol1('linalg.det',
              {torch.float32: tol(atol=1e-04, rtol=1e-04)}, device_type='cuda'),
@@ -3158,33 +3276,13 @@ class TestVmapOperatorsOpInfo(TestCase):
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     @skipOps('TestVmapOperatorsOpInfo', 'test_vmap_exhaustive', vmap_fail)
     def test_vmap_exhaustive(self, device, dtype, op):
-        sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=False)
-        for sample_input in sample_inputs_itr:
-            arg_values = [sample_input.input] + list(sample_input.args)
-            kwarg_values = sample_input.kwargs
-            is_batch_norm_and_training = is_batch_norm_training(op.name, kwarg_values)
-            try:
-                generator = get_fallback_and_vmap_exhaustive(
-                    op.op, arg_values, kwarg_values, is_batch_norm_and_training=is_batch_norm_and_training)
-                for loop_out, batched_out in generator:
-                    # empty_like and new_empty produce garbage values so we just check the shapes.
-                    if op.name == 'empty_like' or op.name == 'new_empty':
-                        self.assertEqual(loop_out.shape, batched_out.shape)
-                        continue
-                    self.assertEqual(loop_out, batched_out)
-                for a_op in op.aliases:
-                    a_generator = get_fallback_and_vmap_exhaustive(
-                        a_op, arg_values, kwarg_values, is_batch_norm_and_training=is_batch_norm_and_training)
-                    for loop_out, batched_out in a_generator:
-                        self.assertEqual(loop_out, batched_out)
-            # todo(chilli): Garbage hack I added to deal with indexing not working
-            except Exception as e:
-                # Checking if we're throwing an error because of dynamic shapes.
-                if "dynamic" in e.args[0]:
-                    continue
-                raise e
-
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+        # needs to be fixed
+        inplace_failure_list = (
+        )
+        self.opinfo_vmap_test(device, dtype, op, check_has_batch_rule=False,
+                              skip_inplace=inplace_failure_list)
+
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @opsToleranceOverride('TestVmapOperatorsOpInfo', 'test_op_has_batch_rule', (
         tol1('linalg.det',
              {torch.float32: tol(atol=1e-04, rtol=1e-04)}, device_type='cuda'),
@@ -3202,21 +3300,10 @@ def test_vmap_exhaustive(self, device, dtype, op):
         # masked index as input which is not supported
         xfail('index_put', ''),
         xfail('isin'),
-        xfail('linalg.cholesky'),
-        xfail('linalg.eigvals'),
-        xfail('linalg.eigvalsh'),
-        xfail('linalg.inv'),
-        xfail('linalg.lstsq'),
-        xfail('linalg.lstsq', 'grad_oriented'),
-        xfail('linalg.matrix_norm'),
-        xfail('linalg.matrix_power'),
         xfail('linalg.matrix_rank'),
         xfail('linalg.matrix_rank', 'hermitian'),
         xfail('linalg.pinv'),
         xfail('linalg.pinv', 'hermitian'),
-        xfail('linalg.norm'),
-        xfail('linalg.solve'),
-        xfail('linalg.tensorinv'),
         xfail('lu_solve'),
         xfail('lu_unpack'),
         xfail('masked_fill'),
@@ -3246,7 +3333,6 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('fft.ihfftn'),
         xfail('allclose'),
         xfail('argwhere'),
-        xfail('linalg.cross'),
         xfail('unique_consecutive'),
         xfail('unique'),
         xfail('nn.functional.ctc_loss'),
@@ -3262,7 +3348,6 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('nonzero'),
         xfail('nn.functional.fractional_max_pool2d'),
         xfail('stft'),
-        xfail('linalg.solve_triangular'),
         xfail('isclose'),
         xfail('nn.functional.fractional_max_pool3d'),
         xfail('nn.functional.bilinear'),
@@ -3279,7 +3364,6 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('nn.functional.dropout2d', ''),
         xfail('normal', 'number_mean'),
         xfail('svd_lowrank', ''),
-        xfail('linalg.lu_factor_ex', ''),
         xfail('diagflat', ''),
         xfail('special.log_ndtr'),
         xfail('nn.functional.triplet_margin_loss', ''),
@@ -3289,7 +3373,6 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('scatter_reduce', 'amax'),
         xfail('nn.functional.max_unpool1d', 'grad'),
         xfail('nn.functional.multi_margin_loss', ''),
-        xfail('linalg.norm', 'subgradients_at_zero'),
         xfail('scatter_reduce', 'prod'),
         xfail('nn.functional.multilabel_margin_loss', ''),
         xfail('scatter_reduce', 'amin'),
@@ -3301,28 +3384,162 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('nn.functional.soft_margin_loss', ''),
         xfail('scatter_reduce', 'mean'),
         xfail('nn.functional.max_unpool3d', ''),
+        xfail('linalg.ldl_solve', '', device_type='cpu'),
+        xfail('chalf', ''),
+        xfail('arange', ''),
+        xfail('clamp_max', ''),
+        xfail('jiterator_binary_return_by_ref', device_type='cuda'),
+        xfail('special.spherical_bessel_j0'),
+        xfail('jiterator_unary', device_type='cuda'),
+        xfail('jiterator_2inputs_2outputs', device_type='cuda'),
+        xfail('special.airy_ai'),
+        xfail('clamp_min', ''),
+        xfail('special.bessel_j0'),
+        xfail('sparse.sampled_addmm'),
+        xfail('special.bessel_y0'),
+        xfail('special.chebyshev_polynomial_u'),
+        xfail('special.modified_bessel_k1'),
+        xfail('segment_reduce', 'offsets'),
+        xfail('special.bessel_j1'),
+        xfail('logspace', ''),
+        xfail('empty', ''),
+        xfail('index_reduce', ''),
+        xfail('linspace', ''),
+        xfail('special.laguerre_polynomial_l'),
+        xfail('special.hermite_polynomial_h'),
+        xfail('jiterator_binary', device_type='cuda'),
+        xfail('special.modified_bessel_i0'),
+        xfail('jiterator_4inputs_with_extra_args', device_type='cuda'),
+        xfail('linalg.vander', ''),
+        xfail('segment_reduce', 'lengths'),
+        xfail('linalg.lu_solve', ''),
+        xfail('special.bessel_y1'),
+        xfail('special.hermite_polynomial_he'),
+        xfail('special.scaled_modified_bessel_k0'),
+        xfail('nn.functional.dropout3d', ''),
+        xfail('special.scaled_modified_bessel_k1'),
+        xfail('broadcast_shapes', ''),
+        xfail('special.modified_bessel_k0'),
+        xfail('linalg.vecdot', ''),
+        xfail('linalg.ldl_factor', ''),
+        xfail('special.modified_bessel_i1'),
+        xfail('special.chebyshev_polynomial_t'),
+        xfail('as_strided_scatter', ''),
+        xfail('equal', ''),
+        xfail('linalg.lu', ''),
+        skip('linalg.ldl_solve', ''),
     }))
     def test_op_has_batch_rule(self, device, dtype, op):
+        # needs to be fixed
+        inplace_failures = (
+            'abs',
+            'acos',
+            'acosh',
+            'addbmm',
+            'addcdiv',
+            'addcmul',
+            'addmm',
+            'addmv',
+            'addr',
+            'asin',
+            'asinh',
+            'atan2',
+            'atan',
+            'atanh',
+            'baddbmm',
+            'clamp',
+            'conj_physical',
+            'cumprod',
+            'cumsum',
+            'div',
+            'div',
+            'floor_divide',
+            'fmod',
+            'heaviside',
+            'hypot',
+            'igamma',
+            'igammac',
+            'index_add',
+            'index_copy',
+            'ldexp',
+            'lerp',
+            'neg',
+            'nextafter',
+            'polygamma',
+            'pow',
+            'remainder',
+            'scatter_add',
+            'scatter',
+            'square',
+            'sub',
+            'tril',
+            'triu',
+            'trunc',
+            'xlogy',
+        )
+        self.opinfo_vmap_test(device, dtype, op, check_has_batch_rule=True,
+                              skip_inplace=inplace_failures)
+
+    def test_linalg_svd(self, device):
+        # linalg_svd returns a tuple of three tensors, (U, S, Vh).
+        # Given the same input, it may return different tensors,
+        # because svd isn't unique. To test that the svd is correct, we multiply
+        # U @ diag(S) @ Vh and check that that the output from vmap matches the
+        # output from a for-loop.
+        def compute_A(out):
+            U, S, Vh = out
+            m = U.shape[-1]
+            n = Vh.shape[-2]
+            diag_S = S.new_zeros(*S.shape[:-1], m, n)
+            diag_S.diagonal(offset=0, dim1=-2, dim2=-1).copy_(S)
+            return U @ diag_S @ Vh
+
+        opinfos = [op for op in op_db if op.name == 'linalg.svd']
+        assert len(opinfos) > 0
+
+        for op in opinfos:
+            self.opinfo_vmap_test(device, torch.float, op, check_has_batch_rule=True,
+                                  postprocess_fn=compute_A)
+
+    def test_linalg_eigh(self, device):
+        # linalg_svd returns two tensors, (Q, L).
+        # Given the same input, it may return different tensors,
+        # because the eig decomposition isn't unique.
+        # To test that eigh is correct, we multiply
+        # Q @ diag(L) @ Qh and check that that the output from vmap matches the
+        # output from a for-loop.
+        def compute_A(out):
+            L, Q = out
+            n = Q.shape[-1]
+            diag_L = L.new_zeros(*L.shape[:-1], n, n)
+            diag_L.diagonal(offset=0, dim1=-2, dim2=-1).copy_(L)
+            Qh = Q.transpose(-2, -1).conj()
+            return Q @ diag_L @ Qh
+
+        opinfos = [op for op in op_db if op.name == 'linalg.eigh']
+        assert len(opinfos) > 0
+
+        for op in opinfos:
+            self.opinfo_vmap_test(device, torch.float, op, check_has_batch_rule=False,
+                                  postprocess_fn=compute_A)
+
+    def test_fill__Tensor(self, device):
+        # There's no OpInfo for fill_.Tensor, so here's an extra test for it.
         def test():
-            sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=False)
-            for sample_input in sample_inputs_itr:
-                arg_values = [sample_input.input] + list(sample_input.args)
-                kwarg_values = sample_input.kwargs
-                is_batch_norm_and_training = is_batch_norm_training(op.name, kwarg_values)
-                generator = get_fallback_and_vmap_exhaustive(
-                    op.op, arg_values, kwarg_values, is_batch_norm_and_training=is_batch_norm_and_training)
-                for loop_out, batched_out in generator:
-                    # empty_like and new_empty produce garbage values so we just check the shapes.
-                    if op.name == 'empty_like' or op.name == 'new_empty':
-                        self.assertEqual(loop_out.shape, batched_out.shape)
-                        continue
-                    self.assertEqual(loop_out, batched_out)
-                for a_op in op.aliases:
-                    a_generator = get_fallback_and_vmap_exhaustive(
-                        a_op, arg_values, kwarg_values, is_batch_norm_and_training=is_batch_norm_and_training)
-                    for loop_out, batched_out in a_generator:
-                        self.assertEqual(loop_out, batched_out)
-        check_vmap_fallback(self, test, op)
+            B = 2
+            args = (torch.randn(B, 3, device=device), torch.randn(B))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (0, 0))
+
+            args = (torch.randn(3, B, device=device), torch.randn(B))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (-1, 0))
+
+            args = (torch.randn(3, device=device), torch.randn(B))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (None, 0))
+
+            args = (torch.randn(3, B, device=device), torch.randn([]))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (1, None))
+
+        check_vmap_fallback(self, test, Tensor.fill_)
 
     def test_conv_double_backward(self, device):
         images = torch.randn(2, 1, 5, 5, device=device)
@@ -3534,6 +3751,31 @@ def f(x, y):
         y = torch.randn(2, 3, device=device)
         self.assertTrue(isinstance(vmap(f)(x, y), Point))
 
+    def test_inplace_on_view(self, device):
+        def func(leaf):
+            base = leaf * leaf
+            view = base.transpose(0, 1)
+            view[2:4, 2:4] *= 2
+            view[0:2, 0:2].diagonal().sin_()
+            view = view[1:3, 1:3]
+            view.cos_()
+            return view
+
+        def push_vjp(leaf, gout):
+            _, vjp_fn = vjp(func, leaf)
+            result, = vjp_fn(gout)
+            return result
+
+        leaf = torch.randn(4, 4, device=device)
+        gout = torch.randn(2, 2, device=device)
+        args = (leaf, gout)
+
+        for args, in_dims, _, in generate_vmap_inputs(args, {}):
+            if in_dims[1] is None:
+                # triggers some composite compliance problem
+                continue
+            self.vmap_outplace_test(push_vjp, args, {}, in_dims)
+
     def test_advanced_indexing(self, device):
         def test(f, args):
             for loop_out, batched_out in get_fallback_and_vmap_exhaustive(f, args, {}):
@@ -3585,6 +3827,42 @@ def f(e_):
         b = with_vmap(_fake_vmap)
         self.assertEqual(a, b)
 
+    @ops(filter(lambda op: "linalg" in op.name, op_db + additional_op_db), allowed_dtypes=(torch.float,))
+    @skipOps('TestVmapOperatorsOpInfo', 'test_vmap_linalg_failure_1D_input', {
+        xfail('linalg.vector_norm'),  # can accept vector inputs
+        xfail('linalg.cross'),  # can accept vector inputs
+        skip('linalg.multi_dot'),  # accepts list of tensor inputs, has its own special test
+        xfail('linalg.vander'),
+        xfail('linalg.vecdot'),
+        skip('linalg.ldl_solve', ''),
+    })
+    def test_vmap_linalg_failure_1D_input(self, device, dtype, op):
+        for sample in op.sample_inputs(device, dtype, requires_grad=False):
+            if sample.input.dim() != 2 or sample.input.shape[0] == 0:
+                continue
+            test_input = sample.input[0]  # using the sample input avoids numerical inconsistency issues
+            with self.assertRaisesRegex(RuntimeError, "dimension"):
+                op(test_input, *sample.args, **sample.kwargs)
+
+            def op_wrapper(inp):
+                return op(inp, *sample.args, **sample.kwargs)
+
+            # square inputs are more likely to pass linalg checks
+            test_input = test_input.expand(test_input.shape[0], test_input.shape[0])
+            with self.assertRaisesRegex(RuntimeError, "dimension"):
+                return vmap(op_wrapper)(test_input)
+
+    def test_vmap_multi_dot_failure_1D_input(self):
+        # special exception for first and last tensors so making giving 3 items avoids special cases
+        inputs = (torch.randn(3, 3), torch.randn(3), torch.randn(3, 3))
+        with self.assertRaisesRegex(RuntimeError, "tensor 1 must be 2D but got 1D"):
+            torch.linalg.multi_dot(inputs)
+
+        # square inputs are more likely to pass linalg checks
+        inputs = tuple(i.expand(i.shape[0], i.shape[0]) for i in inputs)
+        with self.assertRaisesRegex(RuntimeError, "tensor 1 must be 2D but got 1D"):
+            return vmap(torch.linalg.multi_dot)(inputs)
+
 
 class TestRandomness(TestCase):
     def _reset_random(self, generator, orig_state, use_generator, seed):
diff --git a/functorch/test/xfail_suggester.py b/functorch/test/xfail_suggester.py
index d9ddc95029585..6bc795eae8a8f 100644
--- a/functorch/test/xfail_suggester.py
+++ b/functorch/test/xfail_suggester.py
@@ -38,6 +38,7 @@ def get_failed_test(line):
     'test_make_fx_exhaustive_',
     'test_vmap_exhaustive_',
     'test_op_has_batch_rule_',
+    'test_vmap_autograd_grad_',
 }
 
 failed_tests = [get_failed_test(line) for line in lines]
@@ -67,6 +68,8 @@ def parse_namespace(base):
         'fft_': 'fft',
         'linalg_': 'linalg',
         '_masked_': '_masked',
+        'sparse_': 'sparse',
+        'speical_': 'special',
     }
     for heading in mappings.keys():
         if base.startswith(heading):
diff --git a/ios/TestApp/fastlane/Fastfile b/ios/TestApp/fastlane/Fastfile
index 42a6346e1b7a4..45a26066e012c 100644
--- a/ios/TestApp/fastlane/Fastfile
+++ b/ios/TestApp/fastlane/Fastfile
@@ -2,7 +2,7 @@ default_platform(:ios)
 
 platform :ios do
   before_all do
-    setup_circle_ci
+    setup_ci(provider: "circleci", timeout: 0)
   end
   lane :install_root_cert do
     import_certificate(
diff --git a/requirements.txt b/requirements.txt
index f66847c94dcd2..64808a00d60f4 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,6 +2,7 @@
 astunparse
 expecttest
 future
+hypothesis
 numpy
 psutil
 pyyaml
diff --git a/scripts/onnx/test.sh b/scripts/onnx/test.sh
index 2cc7e77895535..2cd5b646dcb04 100755
--- a/scripts/onnx/test.sh
+++ b/scripts/onnx/test.sh
@@ -6,6 +6,7 @@ UNKNOWN=()
 
 # defaults
 PARALLEL=1
+export TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS
 
 while [[ $# -gt 0 ]]
 do
diff --git a/setup.py b/setup.py
index 4facfe5deec18..cb117e7c79e5f 100644
--- a/setup.py
+++ b/setup.py
@@ -437,11 +437,6 @@ def build_deps():
 # Building dependent libraries
 ################################################################################
 
-# the list of runtime dependencies required by this built package
-install_requires = [
-    'typing_extensions',
-]
-
 missing_pydep = '''
 Missing build dependency: Unable to `import {importname}`.
 Please install it via `conda install {module}` or `pip install {module}`
@@ -944,7 +939,13 @@ def print_box(msg):
         print('|{}{}|'.format(l, ' ' * (size - len(l))))
     print('-' * (size + 2))
 
-if __name__ == '__main__':
+
+def main():
+    # the list of runtime dependencies required by this built package
+    install_requires = [
+        'typing_extensions',
+    ]
+
     # Parse the command line and check the arguments before we proceed with
     # building deps and setup. We need to set values so `--help` works.
     dist = Distribution()
@@ -969,6 +970,166 @@ def print_box(msg):
         long_description = f.read()
 
     version_range_max = max(sys.version_info[1], 9) + 1
+    torch_package_data = [
+        'py.typed',
+        'bin/*',
+        'test/*',
+        '_C/*.pyi',
+        '_C_flatbuffer/*.pyi',
+        'cuda/*.pyi',
+        'optim/*.pyi',
+        'autograd/*.pyi',
+        'utils/data/*.pyi',
+        'nn/*.pyi',
+        'nn/modules/*.pyi',
+        'nn/parallel/*.pyi',
+        'utils/data/*.pyi',
+        'lib/*.so*',
+        'lib/*.dylib*',
+        'lib/*.dll',
+        'lib/*.lib',
+        'lib/*.pdb',
+        'lib/torch_shm_manager',
+        'lib/*.h',
+        'include/ATen/*.h',
+        'include/ATen/cpu/*.h',
+        'include/ATen/cpu/vec/vec256/*.h',
+        'include/ATen/cpu/vec/vec512/*.h',
+        'include/ATen/cpu/vec/*.h',
+        'include/ATen/core/*.h',
+        'include/ATen/cuda/*.cuh',
+        'include/ATen/cuda/*.h',
+        'include/ATen/cuda/detail/*.cuh',
+        'include/ATen/cuda/detail/*.h',
+        'include/ATen/cudnn/*.h',
+        'include/ATen/ops/*.h',
+        'include/ATen/hip/*.cuh',
+        'include/ATen/hip/*.h',
+        'include/ATen/hip/detail/*.cuh',
+        'include/ATen/hip/detail/*.h',
+        'include/ATen/hip/impl/*.h',
+        'include/ATen/detail/*.h',
+        'include/ATen/native/*.h',
+        'include/ATen/native/cpu/*.h',
+        'include/ATen/native/cuda/*.h',
+        'include/ATen/native/cuda/*.cuh',
+        'include/ATen/native/hip/*.h',
+        'include/ATen/native/hip/*.cuh',
+        'include/ATen/native/quantized/*.h',
+        'include/ATen/native/quantized/cpu/*.h',
+        'include/ATen/quantized/*.h',
+        'include/caffe2/utils/*.h',
+        'include/caffe2/utils/**/*.h',
+        'include/c10/*.h',
+        'include/c10/macros/*.h',
+        'include/c10/core/*.h',
+        'include/ATen/core/boxing/*.h',
+        'include/ATen/core/boxing/impl/*.h',
+        'include/ATen/core/dispatch/*.h',
+        'include/ATen/core/op_registration/*.h',
+        'include/c10/core/impl/*.h',
+        'include/c10/util/*.h',
+        'include/c10/cuda/*.h',
+        'include/c10/cuda/impl/*.h',
+        'include/c10/hip/*.h',
+        'include/c10/hip/impl/*.h',
+        'include/c10d/*.h',
+        'include/c10d/*.hpp',
+        'include/caffe2/**/*.h',
+        'include/torch/*.h',
+        'include/torch/csrc/*.h',
+        'include/torch/csrc/api/include/torch/*.h',
+        'include/torch/csrc/api/include/torch/data/*.h',
+        'include/torch/csrc/api/include/torch/data/dataloader/*.h',
+        'include/torch/csrc/api/include/torch/data/datasets/*.h',
+        'include/torch/csrc/api/include/torch/data/detail/*.h',
+        'include/torch/csrc/api/include/torch/data/samplers/*.h',
+        'include/torch/csrc/api/include/torch/data/transforms/*.h',
+        'include/torch/csrc/api/include/torch/detail/*.h',
+        'include/torch/csrc/api/include/torch/detail/ordered_dict.h',
+        'include/torch/csrc/api/include/torch/nn/*.h',
+        'include/torch/csrc/api/include/torch/nn/functional/*.h',
+        'include/torch/csrc/api/include/torch/nn/options/*.h',
+        'include/torch/csrc/api/include/torch/nn/modules/*.h',
+        'include/torch/csrc/api/include/torch/nn/modules/container/*.h',
+        'include/torch/csrc/api/include/torch/nn/parallel/*.h',
+        'include/torch/csrc/api/include/torch/nn/utils/*.h',
+        'include/torch/csrc/api/include/torch/optim/*.h',
+        'include/torch/csrc/api/include/torch/optim/schedulers/*.h',
+        'include/torch/csrc/api/include/torch/serialize/*.h',
+        'include/torch/csrc/autograd/*.h',
+        'include/torch/csrc/autograd/functions/*.h',
+        'include/torch/csrc/autograd/generated/*.h',
+        'include/torch/csrc/autograd/utils/*.h',
+        'include/torch/csrc/cuda/*.h',
+        'include/torch/csrc/deploy/*.h',
+        'include/torch/csrc/deploy/interpreter/*.h',
+        'include/torch/csrc/deploy/interpreter/*.hpp',
+        'include/torch/csrc/distributed/c10d/exception.h',
+        'include/torch/csrc/distributed/rpc/*.h',
+        'include/torch/csrc/jit/*.h',
+        'include/torch/csrc/jit/backends/*.h',
+        'include/torch/csrc/jit/generated/*.h',
+        'include/torch/csrc/jit/passes/*.h',
+        'include/torch/csrc/jit/passes/quantization/*.h',
+        'include/torch/csrc/jit/passes/utils/*.h',
+        'include/torch/csrc/jit/runtime/*.h',
+        'include/torch/csrc/jit/ir/*.h',
+        'include/torch/csrc/jit/frontend/*.h',
+        'include/torch/csrc/jit/api/*.h',
+        'include/torch/csrc/jit/serialization/*.h',
+        'include/torch/csrc/jit/python/*.h',
+        'include/torch/csrc/jit/mobile/*.h',
+        'include/torch/csrc/jit/testing/*.h',
+        'include/torch/csrc/jit/tensorexpr/*.h',
+        'include/torch/csrc/jit/tensorexpr/operators/*.h',
+        'include/torch/csrc/jit/codegen/cuda/*.h',
+        'include/torch/csrc/jit/codegen/cuda/ops/*.h',
+        'include/torch/csrc/jit/codegen/cuda/scheduler/*.h',
+        'include/torch/csrc/onnx/*.h',
+        'include/torch/csrc/profiler/*.h',
+        'include/torch/csrc/utils/*.h',
+        'include/torch/csrc/tensor/*.h',
+        'include/torch/csrc/lazy/backend/*.h',
+        'include/torch/csrc/lazy/core/*.h',
+        'include/torch/csrc/lazy/core/internal_ops/*.h',
+        'include/torch/csrc/lazy/core/ops/*.h',
+        'include/torch/csrc/lazy/ts_backend/*.h',
+        'include/pybind11/*.h',
+        'include/pybind11/detail/*.h',
+        'include/TH/*.h*',
+        'include/TH/generic/*.h*',
+        'include/THC/*.cuh',
+        'include/THC/*.h*',
+        'include/THC/generic/*.h',
+        'include/THH/*.cuh',
+        'include/THH/*.h*',
+        'include/THH/generic/*.h',
+        'share/cmake/ATen/*.cmake',
+        'share/cmake/Caffe2/*.cmake',
+        'share/cmake/Caffe2/public/*.cmake',
+        'share/cmake/Caffe2/Modules_CUDA_fix/*.cmake',
+        'share/cmake/Caffe2/Modules_CUDA_fix/upstream/*.cmake',
+        'share/cmake/Caffe2/Modules_CUDA_fix/upstream/FindCUDA/*.cmake',
+        'share/cmake/Gloo/*.cmake',
+        'share/cmake/Tensorpipe/*.cmake',
+        'share/cmake/Torch/*.cmake',
+        'utils/benchmark/utils/*.cpp',
+        'utils/benchmark/utils/valgrind_wrapper/*.cpp',
+        'utils/benchmark/utils/valgrind_wrapper/*.h',
+        'utils/model_dump/skeleton.html',
+        'utils/model_dump/code.js',
+        'utils/model_dump/*.mjs',
+    ]
+    torchgen_package_data = [
+        # Recursive glob doesn't work in setup.py,
+        # https://github.com/pypa/setuptools/issues/1806
+        # To make this robust we should replace it with some code that
+        # returns a list of everything under packaged/
+        'packaged/ATen/*',
+        'packaged/ATen/native/*',
+        'packaged/ATen/templates/*',
+    ]
     setup(
         name=package_name,
         version=version,
@@ -982,166 +1143,8 @@ def print_box(msg):
         entry_points=entry_points,
         install_requires=install_requires,
         package_data={
-            'torch': [
-                'py.typed',
-                'bin/*',
-                'test/*',
-                '_C/*.pyi',
-                '_C_flatbuffer/*.pyi',
-                'cuda/*.pyi',
-                'optim/*.pyi',
-                'autograd/*.pyi',
-                'utils/data/*.pyi',
-                'nn/*.pyi',
-                'nn/modules/*.pyi',
-                'nn/parallel/*.pyi',
-                'utils/data/*.pyi',
-                'lib/*.so*',
-                'lib/*.dylib*',
-                'lib/*.dll',
-                'lib/*.lib',
-                'lib/*.pdb',
-                'lib/torch_shm_manager',
-                'lib/*.h',
-                'include/ATen/*.h',
-                'include/ATen/cpu/*.h',
-                'include/ATen/cpu/vec/vec256/*.h',
-                'include/ATen/cpu/vec/vec512/*.h',
-                'include/ATen/cpu/vec/*.h',
-                'include/ATen/core/*.h',
-                'include/ATen/cuda/*.cuh',
-                'include/ATen/cuda/*.h',
-                'include/ATen/cuda/detail/*.cuh',
-                'include/ATen/cuda/detail/*.h',
-                'include/ATen/cudnn/*.h',
-                'include/ATen/ops/*.h',
-                'include/ATen/hip/*.cuh',
-                'include/ATen/hip/*.h',
-                'include/ATen/hip/detail/*.cuh',
-                'include/ATen/hip/detail/*.h',
-                'include/ATen/hip/impl/*.h',
-                'include/ATen/detail/*.h',
-                'include/ATen/native/*.h',
-                'include/ATen/native/cpu/*.h',
-                'include/ATen/native/cuda/*.h',
-                'include/ATen/native/cuda/*.cuh',
-                'include/ATen/native/hip/*.h',
-                'include/ATen/native/hip/*.cuh',
-                'include/ATen/native/quantized/*.h',
-                'include/ATen/native/quantized/cpu/*.h',
-                'include/ATen/quantized/*.h',
-                'include/caffe2/utils/*.h',
-                'include/caffe2/utils/**/*.h',
-                'include/c10/*.h',
-                'include/c10/macros/*.h',
-                'include/c10/core/*.h',
-                'include/ATen/core/boxing/*.h',
-                'include/ATen/core/boxing/impl/*.h',
-                'include/ATen/core/dispatch/*.h',
-                'include/ATen/core/op_registration/*.h',
-                'include/c10/core/impl/*.h',
-                'include/c10/util/*.h',
-                'include/c10/cuda/*.h',
-                'include/c10/cuda/impl/*.h',
-                'include/c10/hip/*.h',
-                'include/c10/hip/impl/*.h',
-                'include/c10d/*.h',
-                'include/c10d/*.hpp',
-                'include/caffe2/**/*.h',
-                'include/torch/*.h',
-                'include/torch/csrc/*.h',
-                'include/torch/csrc/api/include/torch/*.h',
-                'include/torch/csrc/api/include/torch/data/*.h',
-                'include/torch/csrc/api/include/torch/data/dataloader/*.h',
-                'include/torch/csrc/api/include/torch/data/datasets/*.h',
-                'include/torch/csrc/api/include/torch/data/detail/*.h',
-                'include/torch/csrc/api/include/torch/data/samplers/*.h',
-                'include/torch/csrc/api/include/torch/data/transforms/*.h',
-                'include/torch/csrc/api/include/torch/detail/*.h',
-                'include/torch/csrc/api/include/torch/detail/ordered_dict.h',
-                'include/torch/csrc/api/include/torch/nn/*.h',
-                'include/torch/csrc/api/include/torch/nn/functional/*.h',
-                'include/torch/csrc/api/include/torch/nn/options/*.h',
-                'include/torch/csrc/api/include/torch/nn/modules/*.h',
-                'include/torch/csrc/api/include/torch/nn/modules/container/*.h',
-                'include/torch/csrc/api/include/torch/nn/parallel/*.h',
-                'include/torch/csrc/api/include/torch/nn/utils/*.h',
-                'include/torch/csrc/api/include/torch/optim/*.h',
-                'include/torch/csrc/api/include/torch/optim/schedulers/*.h',
-                'include/torch/csrc/api/include/torch/serialize/*.h',
-                'include/torch/csrc/autograd/*.h',
-                'include/torch/csrc/autograd/functions/*.h',
-                'include/torch/csrc/autograd/generated/*.h',
-                'include/torch/csrc/autograd/utils/*.h',
-                'include/torch/csrc/cuda/*.h',
-                'include/torch/csrc/deploy/*.h',
-                'include/torch/csrc/deploy/interpreter/*.h',
-                'include/torch/csrc/deploy/interpreter/*.hpp',
-                'include/torch/csrc/distributed/c10d/exception.h',
-                'include/torch/csrc/distributed/rpc/*.h',
-                'include/torch/csrc/jit/*.h',
-                'include/torch/csrc/jit/backends/*.h',
-                'include/torch/csrc/jit/generated/*.h',
-                'include/torch/csrc/jit/passes/*.h',
-                'include/torch/csrc/jit/passes/quantization/*.h',
-                'include/torch/csrc/jit/passes/utils/*.h',
-                'include/torch/csrc/jit/runtime/*.h',
-                'include/torch/csrc/jit/ir/*.h',
-                'include/torch/csrc/jit/frontend/*.h',
-                'include/torch/csrc/jit/api/*.h',
-                'include/torch/csrc/jit/serialization/*.h',
-                'include/torch/csrc/jit/python/*.h',
-                'include/torch/csrc/jit/mobile/*.h',
-                'include/torch/csrc/jit/testing/*.h',
-                'include/torch/csrc/jit/tensorexpr/*.h',
-                'include/torch/csrc/jit/tensorexpr/operators/*.h',
-                'include/torch/csrc/jit/codegen/cuda/*.h',
-                'include/torch/csrc/jit/codegen/cuda/ops/*.h',
-                'include/torch/csrc/jit/codegen/cuda/scheduler/*.h',
-                'include/torch/csrc/onnx/*.h',
-                'include/torch/csrc/profiler/*.h',
-                'include/torch/csrc/utils/*.h',
-                'include/torch/csrc/tensor/*.h',
-                'include/torch/csrc/lazy/backend/*.h',
-                'include/torch/csrc/lazy/core/*.h',
-                'include/torch/csrc/lazy/core/internal_ops/*.h',
-                'include/torch/csrc/lazy/core/ops/*.h',
-                'include/torch/csrc/lazy/ts_backend/*.h',
-                'include/pybind11/*.h',
-                'include/pybind11/detail/*.h',
-                'include/TH/*.h*',
-                'include/TH/generic/*.h*',
-                'include/THC/*.cuh',
-                'include/THC/*.h*',
-                'include/THC/generic/*.h',
-                'include/THH/*.cuh',
-                'include/THH/*.h*',
-                'include/THH/generic/*.h',
-                'share/cmake/ATen/*.cmake',
-                'share/cmake/Caffe2/*.cmake',
-                'share/cmake/Caffe2/public/*.cmake',
-                'share/cmake/Caffe2/Modules_CUDA_fix/*.cmake',
-                'share/cmake/Caffe2/Modules_CUDA_fix/upstream/*.cmake',
-                'share/cmake/Caffe2/Modules_CUDA_fix/upstream/FindCUDA/*.cmake',
-                'share/cmake/Gloo/*.cmake',
-                'share/cmake/Tensorpipe/*.cmake',
-                'share/cmake/Torch/*.cmake',
-                'utils/benchmark/utils/*.cpp',
-                'utils/benchmark/utils/valgrind_wrapper/*.cpp',
-                'utils/benchmark/utils/valgrind_wrapper/*.h',
-                'utils/model_dump/skeleton.html',
-                'utils/model_dump/code.js',
-                'utils/model_dump/*.mjs',
-            ],
-            'torchgen': [
-                # Recursive glob doesn't work in setup.py,
-                # https://github.com/pypa/setuptools/issues/1806
-                # To make this robust we should replace it with some code that
-                # returns a list of everything under packaged/
-                'packaged/ATen/*',
-                'packaged/ATen/native/*',
-                'packaged/ATen/templates/*',
-            ],
+            'torch': torch_package_data,
+            'torchgen': torchgen_package_data,
             'caffe2': [
                 'python/serialized_test/data/operator_test/*.zip',
             ],
@@ -1172,3 +1175,7 @@ def print_box(msg):
     )
     if EMIT_BUILD_WARNING:
         print_box(build_update_message)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/test/allowlist_for_publicAPI.json b/test/allowlist_for_publicAPI.json
index 1476b2b4abe47..67c79bd22ee3d 100644
--- a/test/allowlist_for_publicAPI.json
+++ b/test/allowlist_for_publicAPI.json
@@ -1,5 +1,24 @@
 {
-  "being_migrated": {},
+  "being_migrated": {
+    "torch.nn.qat": "torch.ao.nn.qat",
+    "torch.nn.qat.dynamic": "torch.ao.nn.qat.dynamic",
+    "torch.nn.qat.dynamic.modules": "torch.ao.nn.qat.dynamic.modules",
+    "torch.nn.qat.dynamic.modules.linear": "torch.ao.nn.qat.dynamic.modules.linear",
+    "torch.nn.qat.modules": "torch.ao.nn.qat.modules",
+    "torch.nn.qat.modules.conv": "torch.ao.nn.qat.modules.conv",
+    "torch.nn.qat.modules.embedding_ops": "torch.ao.nn.qat.modules.embedding_ops",
+    "torch.nn.qat.modules.linear": "torch.ao.nn.qat.modules.linear",
+    "torch.nn.quantized.functional": "torch.ao.nn.quantized.functional",
+    "torch.nn.quantized": "torch.ao.nn.quantized",
+    "torch.nn.quantized.modules": "torch.ao.nn.quantized.modules",
+    "torch.nn.quantized.dynamic": "torch.ao.nn.quantized.dynamic",
+    "torch.nn.quantized.dynamic.modules": "torch.ao.nn.quantized.dynamic.modules",
+    "torch.nn.quantized.dynamic.modules.rnn": "torch.ao.nn.quantized.dynamic.modules.rnn",
+    "torch.nn.quantizable": "torch.ao.nn.quantizable",
+    "torch.nn.quantizable.modules": "torch.ao.nn.quantizable.modules",
+    "torch.nn.quantizable.modules.activation": "torch.ao.nn.quantizable.modules.activation",
+    "torch.nn.quantizable.modules.rnn": "torch.ao.nn.quantizable.modules.rnn"
+  },
   "torch.ao.quantization": [
     "ABC",
     "ABCMeta",
@@ -1999,7 +2018,6 @@
     "_is_functional_tensor",
     "_is_zerotensor",
     "_linalg_check_errors",
-    "_linalg_inv_out_helper_",
     "_linalg_qr_helper",
     "_linalg_svd",
     "_linalg_solve_ex",
diff --git a/test/ao/sparsity/test_composability.py b/test/ao/sparsity/test_composability.py
index 698652ef5bd56..6a1b6067a4cf1 100644
--- a/test/ao/sparsity/test_composability.py
+++ b/test/ao/sparsity/test_composability.py
@@ -87,7 +87,7 @@ def test_q_prep_before_s_prep(self):
         )
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
     # This test checks whether performing sparsity prepare before quantization prepare
@@ -119,7 +119,7 @@ def test_s_prep_before_q_prep(self):
         )
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
     # if the sparsified modules have not undergone the final squash mask operation, its possible
@@ -150,7 +150,7 @@ def test_convert_without_squash_mask(self):
         tq.convert(mod, inplace=True)
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
         # check that module was actually sparsified
@@ -261,12 +261,12 @@ def test_s_prep_before_qat_prep(self):
         # check that correct observers were inserted and that matching
         # occured successfully
         self.assertTrue(hasattr(mod[5], "activation_post_process"))
-        self.assertTrue(isinstance(mod[5], torch.nn.qat.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.qat.Linear))
         _squash_mask_calibrate_and_convert(
             mod, sparsifier, torch.randn(1, 4, 4, 4)
         )
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
         # check that module was actually sparsified
@@ -300,14 +300,14 @@ def test_qat_prep_before_s_prep(self):
         # check that correct observers were inserted and that matching
         # occured successfully
         self.assertTrue(hasattr(mod[5], "activation_post_process"))
-        self.assertTrue(isinstance(mod[5], torch.nn.qat.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.qat.Linear))
 
         _squash_mask_calibrate_and_convert(
             mod, sparsifier, torch.randn(1, 4, 4, 4)
         )
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
         # check that module was actually sparsified
diff --git a/test/ao/sparsity/test_data_sparsifier.py b/test/ao/sparsity/test_data_sparsifier.py
index 23c067faf93cc..40d76ee1953f4 100644
--- a/test/ao/sparsity/test_data_sparsifier.py
+++ b/test/ao/sparsity/test_data_sparsifier.py
@@ -4,7 +4,8 @@
 import logging
 import torch
 from torch.nn.utils.parametrize import is_parametrized
-from torch.testing._internal.common_utils import TestCase
+import unittest
+from torch.testing._internal.common_utils import TestCase, TEST_WITH_ASAN
 
 from typing import Tuple
 from torch import nn
@@ -13,6 +14,7 @@
 import copy
 
 from torch.ao.sparsity._experimental.data_sparsifier import BaseDataSparsifier, DataNormSparsifier
+from torch.ao.sparsity._experimental.data_sparsifier.quantization_utils import post_training_sparse_quantize
 
 logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)
 
@@ -496,3 +498,98 @@ def test_nn_embeddings(self):
 
         self.run_all_checks(data_list=data_list, defaults=defaults,
                             data_with_config=data_with_config, norm_type='L2')
+
+
+class Model(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.emb1 = nn.Embedding(100, 3)
+        self.embbag1 = nn.EmbeddingBag(200, 32)
+        self.emb_seq = nn.Sequential(nn.Embedding(150, 3), nn.EmbeddingBag(100, 3))
+        self.linear1 = nn.Linear(32, 32)
+        self.linear2 = nn.Linear(16, 16)
+
+
+class TestQuantizationUtils(TestCase):
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN due to address sanitization")
+    def test_ptq_sparsify_first(self):
+        """The expectation is post_training_sparse_quantize function
+        1. Takes in a model
+        2. Sparsifies the embeddings
+        3. Quantize the embeddings
+
+        This unit test checks that
+        1. Embeddings and EmbeddingBags are sparsified to the right sparsity levels
+        2. Embeddings and EmbeddingBags are quantized
+        3. Linear modules are not quanitzed
+        """
+        model = Model()
+
+        sparse_config = {'sparsity_level': 0.80, 'sparse_block_shape': (1, 1)}
+        select_embeddings = [model.embbag1, model.emb1]
+        post_training_sparse_quantize(model,
+                                      data_sparsifier_class=DataNormSparsifier,
+                                      sparsify_first=True,
+                                      select_embeddings=select_embeddings,
+                                      **sparse_config)
+
+        assert type(model.emb1) == torch.nn.quantized.modules.embedding_ops.Embedding
+        assert type(model.embbag1) == torch.nn.quantized.modules.embedding_ops.EmbeddingBag
+        assert type(model.emb_seq[0] == nn.Embedding)
+        assert type(model.emb_seq[1] == nn.EmbeddingBag)
+        assert type(model.linear1) == nn.Linear
+        assert type(model.linear2) == nn.Linear
+
+        dequant_emb1 = torch.dequantize(model.emb1.weight())
+        dequant_embbag1 = torch.dequantize(model.embbag1.weight())
+
+        threshold = 1e-2
+
+        sl_emb1 = (torch.abs(dequant_emb1) < threshold).float().mean()
+        sl_embbag1 = (torch.abs(dequant_embbag1) < threshold).float().mean()
+
+        assert abs(sl_emb1 - 0.80) <= 0.05  # +- 5% leeway
+        assert abs(sl_embbag1 - 0.80) <= 0.05  # +- 5% leeway
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN due to address sanitization")
+    def test_ptq_quantize_first(self):
+        """The expectation is post_training_sparse_quantize function
+        1. Takes in a model
+        2. Quantize the embeddings
+        3. Sparsifies the embeddings
+
+        This unit test checks that
+        1. Embeddings and EmbeddingBags are sparsified to the right sparsity levels
+        2. Embeddings and EmbeddingBags are quantized
+        3. Linear modules are not quanitzed
+        """
+        model = Model()
+
+        sparse_config = {'sparsity_level': 0.8, 'sparse_block_shape': (1, 1)}
+        post_training_sparse_quantize(model, DataNormSparsifier, sparsify_first=False, **sparse_config)
+
+        assert type(model.emb1) == torch.nn.quantized.modules.embedding_ops.Embedding
+        assert type(model.embbag1) == torch.nn.quantized.modules.embedding_ops.EmbeddingBag
+        assert type(model.emb_seq[0] == torch.nn.quantized.modules.embedding_ops.Embedding)
+        assert type(model.emb_seq[1] == torch.nn.quantized.modules.embedding_ops.EmbeddingBag)
+        assert type(model.linear1) == nn.Linear  # not quantized
+        assert type(model.linear2) == nn.Linear  # not quantized
+
+
+        dequant_emb1 = torch.dequantize(model.emb1.weight())
+        dequant_embbag1 = torch.dequantize(model.embbag1.weight())
+        dequant_emb_seq_0 = torch.dequantize(model.emb_seq[0].weight())
+        dequant_emb_seq_1 = torch.dequantize(model.emb_seq[1].weight())
+
+        # higher threshold as quantization occurs before sparsity
+        threshold = 1  # zero points seem to have higher magnitude with sparsity occuring after
+
+        sl_emb1 = (torch.abs(dequant_emb1) < threshold).float().mean()
+        sl_embbag1 = (torch.abs(dequant_embbag1) < threshold).float().mean()
+        sl_emb_seq_0 = (torch.abs(dequant_emb_seq_0) < threshold).float().mean()
+        sl_emb_seq_1 = (torch.abs(dequant_emb_seq_1) < threshold).float().mean()
+
+        assert abs(sl_emb1 - 0.80) <= 0.05  # +- 5% leeway
+        assert abs(sl_embbag1 - 0.80) <= 0.05  # +- 5% leeway
+        assert abs(sl_emb_seq_0 - 0.80) <= 0.05  # +- 5% leeway
+        assert abs(sl_emb_seq_1 - 0.80) <= 0.05  # +- 5% leeway
diff --git a/test/cpp/api/CMakeLists.txt b/test/cpp/api/CMakeLists.txt
index 040a2a6e58702..f1e1765af5de7 100644
--- a/test/cpp/api/CMakeLists.txt
+++ b/test/cpp/api/CMakeLists.txt
@@ -54,8 +54,7 @@ add_executable(test_api ${TORCH_API_TEST_SOURCES})
 target_include_directories(test_api PRIVATE ${ATen_CPU_INCLUDE})
 target_link_libraries(test_api PRIVATE torch gtest)
 if(NOT MSVC)
-  target_compile_options(test_api PRIVATE -Wno-unused-variable)
-  target_compile_options(test_api PRIVATE -Wno-unused-local-typedefs)
+  target_compile_options_if_supported(test_api -Wno-unused-variable)
 endif()
 
 if(USE_CUDA)
@@ -81,19 +80,14 @@ if(USE_OPENMP AND CMAKE_COMPILER_IS_GNUCXX AND (CMAKE_CXX_COMPILER_VERSION VERSI
 endif()
 
 if(NOT MSVC)
-  if(APPLE)
-    target_compile_options(test_api PRIVATE
-      # Clang has an unfixed bug leading to spurious missing braces
-      # warnings, see https://bugs.llvm.org/show_bug.cgi?id=21629
-      -Wno-missing-braces)
-  else()
-    target_compile_options(test_api PRIVATE
-      # Considered to be flaky.  See the discussion at
-      # https://github.com/pytorch/pytorch/pull/9608
-      -Wno-maybe-uninitialized
-      # gcc gives nonsensical warnings about variadic.h
-      -Wno-unused-but-set-parameter)
-  endif()
+  # Clang has an unfixed bug leading to spurious missing braces
+  # warnings, see https://bugs.llvm.org/show_bug.cgi?id=21629
+  target_compile_options_if_supported(test_api "-Wno-missing-braces")
+  # Considered to be flaky.  See the discussion at
+  # https://github.com/pytorch/pytorch/pull/9608
+  target_compile_options_if_supported(test_api "-Wno-maybe-uninitialized")
+  # gcc gives nonsensical warnings about variadic.h
+  target_compile_options_if_supported(test_api "-Wno-unused-but-set-parameter")
 endif()
 
 if(INSTALL_TEST)
diff --git a/test/cpp/api/autograd.cpp b/test/cpp/api/autograd.cpp
index d6cc58b7c3bc6..b550802dde38d 100644
--- a/test/cpp/api/autograd.cpp
+++ b/test/cpp/api/autograd.cpp
@@ -4,6 +4,7 @@
 #include <ATen/core/op_registration/op_registration.h>
 #include <torch/torch.h>
 
+#include <torch/csrc/autograd/FunctionsManual.h>
 #include <torch/csrc/autograd/functions/basic_ops.h>
 
 #include <test/cpp/api/support.h>
@@ -217,26 +218,40 @@ TEST(AutogradAPITests, AnomalyMode) {
     ASSERT_TRUE(
         warnings.str().find("Traceback of forward") != std::string::npos);
   }
-  {
-    WarningCapture warnings;
-    // Double backward
+  auto double_backward_produce_nan = [](bool should_throw) {
     auto x = torch::tensor({0.0}, torch::requires_grad());
     auto y = x.pow(1.5);
     auto gr =
         // NOLINTNEXTLINE(bugprone-argument-comment)
         grad({y}, {x}, {}, /*retain_graph=*/true, /*create_backward=*/true);
-    ASSERT_THROWS_WITH(grad({gr[0]}, {x}, {torch::tensor({0.0})});
-                       , "returned nan");
-    auto msgs = warnings.messages();
-    ASSERT_EQ(msgs.size(), 2);
-    ASSERT_TRUE(
-        msgs[0].find("Traceback of forward call that caused the error") !=
-        std::string::npos);
-    ASSERT_TRUE(
-        msgs[1].find(
-            "Traceback of forward call that induced the previous calculation") !=
-        std::string::npos);
+    if (should_throw) {
+      WarningCapture warnings;
+      ASSERT_THROWS_WITH(grad({gr[0]}, {x}, {torch::tensor({0.0})});
+                         , "returned nan");
+      auto msgs = warnings.messages();
+      ASSERT_EQ(msgs.size(), 2);
+      ASSERT_TRUE(
+          msgs[0].find("Traceback of forward call that caused the error") !=
+          std::string::npos);
+      ASSERT_TRUE(
+          msgs[1].find(
+              "Traceback of forward call that induced the previous calculation") !=
+          std::string::npos);
+    } else {
+      grad({gr[0]}, {x}, {torch::tensor({0.0})});
+    }
+  };
+
+  double_backward_produce_nan(true);
+  {
+    torch::autograd::DetectAnomalyGuard detect_anomaly(/*check_nan=*/false);
+    double_backward_produce_nan(false);
+    {
+      torch::autograd::DetectAnomalyGuard detect_anomaly(/*check_nan=*/true);
+      double_backward_produce_nan(true);
+    }
   }
+  double_backward_produce_nan(true);
 }
 
 TEST(CustomAutogradTest, CustomFunction) {
@@ -276,6 +291,102 @@ TEST(CustomAutogradTest, CustomFunction) {
   ASSERT_VARIABLE_EQ(y.grad(), x + torch::ones({5, 5}) * 2);
 }
 
+TEST(CustomAutogradTest, GraphTaskTrimEdges) {
+  struct MyFunction : public Function<MyFunction> {
+    static Variable forward(
+        AutogradContext* ctx,
+        Variable var1,
+        Variable var2,
+        int mul,
+        bool needs_input1_grad,
+        bool needs_input2_grad) {
+      // setup the expected should and should not compute idx
+      ctx->saved_data["needs_input1_grad"] = needs_input1_grad;
+      ctx->saved_data["needs_input2_grad"] = needs_input2_grad;
+
+      ctx->saved_data["mul"] = mul;
+      ctx->save_for_backward({var1, var2});
+      return var1 + mul * var2 + var1 * var2;
+    }
+
+    static variable_list backward(
+        AutogradContext* ctx,
+        variable_list grad_output) {
+      // Test `needs_input_grad` method is working correctly.
+      // We have to test this within the backward function.
+      auto needs_input1_grad = ctx->saved_data["needs_input1_grad"].toBool();
+      auto needs_input2_grad = ctx->saved_data["needs_input2_grad"].toBool();
+      IndexRange var1_idx = {0, 1};
+      IndexRange var2_idx = {1, 2};
+      EXPECT_EQ(ctx->needs_input_grad(0), needs_input1_grad);
+      EXPECT_EQ(ctx->needs_input_grad(1), needs_input2_grad);
+      EXPECT_EQ(ctx->needs_input_grad({var1_idx}), needs_input1_grad);
+      EXPECT_EQ(ctx->needs_input_grad({var2_idx}), needs_input2_grad);
+      EXPECT_EQ(
+          ctx->needs_input_grad({var1_idx, var2_idx}),
+          needs_input1_grad || needs_input2_grad);
+
+      // calculate gradients
+      int mul = ctx->saved_data["mul"].toInt();
+      auto saved = ctx->get_saved_variables();
+      auto var1 = saved[0];
+      auto var2 = saved[1];
+
+      Variable grad_var1, grad_var2;
+      if (ctx->needs_input_grad(0)) {
+        grad_var1 = grad_output[0] + grad_output[0] * var2;
+      }
+      if (ctx->needs_input_grad(1)) {
+        grad_var2 = grad_output[0] * mul + grad_output[0] * var1;
+      }
+      variable_list output = {
+          grad_var1,
+          grad_var2,
+          Variable(),
+          Variable(),
+          Variable(),
+      };
+      return output;
+    }
+  };
+
+  Variable x = torch::randn({5, 5}, torch::requires_grad());
+  Variable y = torch::randn({5, 5}, torch::requires_grad());
+  auto go = torch::ones_like(x);
+  Variable out;
+
+  // grad_x
+  out = MyFunction::apply(
+      x,
+      y,
+      2,
+      /* needs_input1_grad= */ true,
+      /* needs_input2_grad= */ false);
+  auto grad_x = torch::autograd::grad({out}, {x}, {go})[0];
+  ASSERT_VARIABLE_EQ(grad_x, y + torch::ones({5, 5}));
+
+  // grad_y
+  out = MyFunction::apply(
+      x,
+      y,
+      2,
+      /* needs_input1_grad= */ false,
+      /* needs_input2_grad= */ true);
+  auto grad_y = torch::autograd::grad({out}, {y}, {go})[0];
+  ASSERT_VARIABLE_EQ(grad_y, x + torch::ones({5, 5}) * 2);
+
+  // grad_x and grad_y
+  out = MyFunction::apply(
+      x,
+      y,
+      2,
+      /* needs_input1_grad= */ true,
+      /* needs_input2_grad= */ true);
+  auto grads = torch::autograd::grad({out}, {x, y}, {go});
+  ASSERT_VARIABLE_EQ(grads[0], y + torch::ones({5, 5}));
+  ASSERT_VARIABLE_EQ(grads[1], x + torch::ones({5, 5}) * 2);
+}
+
 TEST(CustomAutogradTest, FunctionReturnsInput) {
   struct MyFunction : public Function<MyFunction> {
     static Variable forward(AutogradContext* ctx, Variable var1) {
diff --git a/test/cpp/api/dataloader.cpp b/test/cpp/api/dataloader.cpp
index 5dfb3b0936c2f..abad16a9c9fc5 100644
--- a/test/cpp/api/dataloader.cpp
+++ b/test/cpp/api/dataloader.cpp
@@ -836,12 +836,10 @@ TEST(DataTest, CanUseCustomTypeAsIndexType) {
   auto data_loader = torch::data::make_data_loader(
       TestIndexDataset(23), TestIndexSampler(23), kBatchSize);
 
-  size_t i = 0;
   for (auto batch : *data_loader) {
     for (const auto j : c10::irange(kBatchSize)) {
       ASSERT_EQ(batch.at(j), 10 + j);
     }
-    i += 1;
   }
 }
 
diff --git a/test/cpp/c10d/CMakeLists.txt b/test/cpp/c10d/CMakeLists.txt
index d50c6c4f8ef41..89c6b9155f5b7 100644
--- a/test/cpp/c10d/CMakeLists.txt
+++ b/test/cpp/c10d/CMakeLists.txt
@@ -54,6 +54,19 @@ if(USE_CUDA)
       install(TARGETS c10d_cuda_test DESTINATION lib)
     endif()
   endif()
+  if(USE_UCC AND USE_C10D_UCC)
+    # UCC is a private dependency of libtorch, but the tests include some
+    # private headers of libtorch, which in turn include UCC. As a hacky
+    # alternative to making UCC a public dependency of libtorch, we make it
+    # a private dependency of the tests as well.
+    c10d_add_test(
+      ProcessGroupUCCTest.cpp
+      torch_cpu c10d_cuda_test gtest_main __caffe2_ucc)
+    if(INSTALL_TEST)
+      install(TARGETS ProcessGroupUCCTest DESTINATION bin)
+      install(TARGETS c10d_cuda_test DESTINATION lib)
+    endif()
+  endif()
 else()
   if(USE_GLOO AND USE_C10D_GLOO)
     c10d_add_test(ProcessGroupGlooTest.cpp torch_cpu gtest_main)
diff --git a/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp b/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp
index 6edc8d8ff0959..b01117f6dd2d7 100644
--- a/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp
+++ b/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp
@@ -11,7 +11,6 @@
 using namespace c10d::test;
 
 using at::cuda::CUDAStream;
-using c10d::ProcessGroup;
 
 template <typename T, typename... Args>
 std::vector<T> initialize(const std::string& path, int N, Args&&... args) {
@@ -94,7 +93,7 @@ class AsyncInputIsOutputTest : public AsyncTest {
     }
   }
 
-  void wait(c10::intrusive_ptr<ProcessGroup::Work>& work) {
+  void wait(c10::intrusive_ptr<c10d::ProcessGroup::Work>& work) {
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
     work->wait();
   }
diff --git a/test/cpp/c10d/ProcessGroupNCCLTest.cpp b/test/cpp/c10d/ProcessGroupNCCLTest.cpp
index 278525ac29f99..20da3bef86db0 100644
--- a/test/cpp/c10d/ProcessGroupNCCLTest.cpp
+++ b/test/cpp/c10d/ProcessGroupNCCLTest.cpp
@@ -5,7 +5,6 @@
 #include <c10d/ProcessGroupNCCL.hpp>
 #include "CUDATest.hpp"
 #include "TestUtils.hpp"
-#include "c10d/ProcessGroup.hpp"
 #include "c10d/Types.hpp"
 
 #include <c10/cuda/CUDAGuard.h>
@@ -18,7 +17,6 @@
 using namespace c10d::test;
 
 using at::cuda::CUDAStream;
-using c10d::ProcessGroup;
 
 class NCCLTestBase {
  public:
@@ -94,7 +92,7 @@ class NCCLTest : public NCCLTestBase {
   }
 
   void wait(
-      c10::intrusive_ptr<ProcessGroup::Work>& work,
+      c10::intrusive_ptr<c10d::ProcessGroup::Work>& work,
       std::chrono::milliseconds timeout = kNoTimeout) {
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
     work->wait(timeout);
@@ -108,7 +106,7 @@ class NCCLTest : public NCCLTestBase {
 
     // Copy inputs to outputs
     for (const auto i : c10::irange(numDevices_)) {
-      cudaStreamSynchronize(streams_[i].stream());
+      C10_CUDA_CHECK(cudaStreamSynchronize(streams_[i].stream()));
       outputs[i] = tensors_[i].cpu();
     }
 
@@ -139,7 +137,7 @@ class NCCLTest : public NCCLTestBase {
 
     // Copy inputs to outputs
     for (const auto i : c10::irange(numDevices_)) {
-      cudaStreamSynchronize(streams_[i].stream());
+      C10_CUDA_CHECK(cudaStreamSynchronize(streams_[i].stream()));
       for (auto j = 0; j < worldSize_ * numDevices_; ++j) {
         outputs[i][j] = tensor_lists[i][j].cpu();
       }
diff --git a/test/cpp/c10d/ProcessGroupUCCTest.cpp b/test/cpp/c10d/ProcessGroupUCCTest.cpp
new file mode 100644
index 0000000000000..a31e990536e10
--- /dev/null
+++ b/test/cpp/c10d/ProcessGroupUCCTest.cpp
@@ -0,0 +1,35 @@
+#include <gtest/gtest.h>
+#include <torch/csrc/distributed/c10d/UCCUtils.hpp>
+
+#include <utility>
+#include <vector>
+
+using namespace c10d;
+
+TEST(ProcessGroupUCCTest, testTrim) {
+  std::vector<std::pair<std::string, std::string>> tests = {
+      {" allreduce ", "allreduce"},
+      {"\tallgather", "allgather"},
+      {"send\n", "send"},
+  };
+  for (auto entry : tests) {
+    ASSERT_EQ(trim(entry.first), entry.second);
+  }
+}
+
+TEST(ProcessGroupUCCTest, testToLower) {
+  std::vector<std::pair<std::string, std::string>> tests = {
+      {"AllReduce", "allreduce"},
+      {"ALLGATHER", "allgather"},
+      {"send", "send"},
+  };
+  for (auto entry : tests) {
+    ASSERT_EQ(tolower(entry.first), entry.second);
+  }
+}
+
+TEST(ProcessGroupUCCTest, testParseList) {
+  std::string input = "\tAllReduce, ALLGATHER, send\n";
+  std::vector<std::string> expect{"allreduce", "allgather", "send"};
+  ASSERT_EQ(parse_list(input), expect);
+}
diff --git a/test/cpp/jit/CMakeLists.txt b/test/cpp/jit/CMakeLists.txt
index d8078bbd9e5c3..9cb3d39183447 100644
--- a/test/cpp/jit/CMakeLists.txt
+++ b/test/cpp/jit/CMakeLists.txt
@@ -97,11 +97,14 @@ set(JIT_TEST_SRCS
 
 if(USE_CUDA)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_view.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_rng.cu)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp)
 endif()
 
 add_executable(test_jit
@@ -133,7 +136,7 @@ endif(MSVC)
 target_link_libraries(test_jit PRIVATE ${JIT_TEST_DEPENDENCIES})
 target_include_directories(test_jit PRIVATE ${ATen_CPU_INCLUDE})
 if(NOT MSVC)
-  target_compile_options(test_jit PRIVATE -Wno-unused-variable)
+  target_compile_options(test_jit PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-Wno-unused-variable>)
 endif()
 
 if(LINUX)
@@ -151,7 +154,7 @@ if(USE_CUDA)
   target_compile_definitions(test_jit PRIVATE USE_CUDA)
   # Suppress sign compare checks for NVFUSER JIT tests
   if(NOT MSVC)
-    target_compile_options(test_jit PRIVATE -Wno-sign-compare)
+    target_compile_options(test_jit PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-Wno-sign-compare>)
   endif()
 elseif(USE_ROCM)
   target_link_libraries(test_jit PRIVATE
diff --git a/test/cpp/jit/test_flatbuffer.cpp b/test/cpp/jit/test_flatbuffer.cpp
index d276b4c5fc08a..de49838fc9ab6 100644
--- a/test/cpp/jit/test_flatbuffer.cpp
+++ b/test/cpp/jit/test_flatbuffer.cpp
@@ -31,14 +31,19 @@
 namespace torch {
 namespace jit {
 
+namespace {
 mobile::Module parse_mobile_module(
     void* data,
-    size_t,
+    size_t size,
     bool should_copy_tensor_memory = false) {
-  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data);
-  return initialize_mobile_module(
-      flatbuffer_module, c10::nullopt, should_copy_tensor_memory);
+  return parse_and_initialize_mobile_module(
+      static_cast<char*>(data),
+      size,
+      /*device=*/c10::nullopt,
+      /*extra_files=*/nullptr,
+      should_copy_tensor_memory);
 }
+} // namespace
 
 TEST(FlatbufferTest, UpsampleNearest2d) {
   Module m("m");
@@ -62,7 +67,7 @@ TEST(FlatbufferTest, UpsampleNearest2d) {
   ASSERT_TRUE(resd.equal(refd));
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   auto res2 = bc2.forward(inputs);
   auto resd2 = res2.toTensor();
   ASSERT_TRUE(resd2.equal(refd));
@@ -90,9 +95,7 @@ TEST(FlatbufferTest, UpsampleNearest2dWithCopyTensorMemory) {
   ASSERT_TRUE(resd.equal(refd));
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size(), true);
-
-  buff = flatbuffers::DetachedBuffer();
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size(), true);
 
   auto res2 = bc2.forward(inputs);
   auto resd2 = res2.toTensor();
@@ -115,7 +118,7 @@ TEST(FlatbufferTest, CheckAttrAccess) {
   AT_ASSERT(!mobile_optimized);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   auto mobile_optimized2 = bc2.attr("mobile_optimized", false).toBool();
   AT_ASSERT(!mobile_optimized2);
 }
@@ -163,7 +166,7 @@ TEST(FlatbufferTest, MethodInvocation) { // NOLINT (use =delete in gtest)
     AT_ASSERT(resd == refd);
 
     auto buff = save_mobile_module_to_bytes(bc);
-    mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+    mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
     const auto& test_func2 = bc2.get_method("test_func");
     IValue res2;
     for (int i = 0; i < 3; ++i) {
@@ -255,7 +258,7 @@ TEST(FlatbufferTest, Conv) {
       outputref[0][0][0][0].item<int>() == output[0][0][0][0].item<int>());
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   for (int i = 0; i < 3; ++i) {
     res = bc2.get_method("forward")(inputs);
   }
@@ -297,8 +300,7 @@ TEST(FlatbufferTest, ConvWithCopyTensorMemory) {
       outputref[0][0][0][0].item<int>() == output[0][0][0][0].item<int>());
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size(), true);
-  buff = flatbuffers::DetachedBuffer();
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size(), true);
 
   for (int i = 0; i < 3; ++i) {
     res = bc2.get_method("forward")(inputs);
@@ -328,7 +330,7 @@ TEST(FlatbufferTest, Inline) {
   AT_ASSERT(output.toTensor().item<float>() == 7.0);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   std::vector<torch::jit::IValue> inputs2({torch::ones({})});
   output = bc2.get_method("foo3")(inputs2);
   AT_ASSERT(output.toTensor().item<float>() == 7.0);
@@ -353,8 +355,7 @@ TEST(FlatbufferTest, InlineWithCopyTensorMemory) {
   AT_ASSERT(output.toTensor().item<float>() == 7.0);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size(), true);
-  buff = flatbuffers::DetachedBuffer();
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size(), true);
   std::vector<torch::jit::IValue> inputs2({torch::ones({})});
   output = bc2.get_method("foo3")(inputs2);
   AT_ASSERT(output.toTensor().item<float>() == 7.0);
@@ -377,7 +378,7 @@ TEST(FlatbufferTest, Tuple) {
   AT_ASSERT(output.toTupleRef().elements()[1].toInt() == 2);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   output = bc2.get_method("forward")(inputs);
   AT_ASSERT(output.toTuple()->elements()[1].toInt() == 2);
 }
@@ -399,7 +400,7 @@ TEST(FlatbufferTest, Dict) {
   AT_ASSERT(output.toGenericDict().at("result").toTensor().item().toInt() == 2);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   output = bc2.get_method("forward")(inputs);
   AT_ASSERT(output.toGenericDict().at("result").toTensor().item().toInt() == 2);
 }
@@ -430,7 +431,7 @@ TEST(FlatbufferTest, Prim) {
   AT_ASSERT(resi == refi);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   for (int i = 0; i < 3; ++i) {
     // NOLINTNEXTLINE(performance-unnecessary-copy-initialization)
     auto bcinputs = inputs;
@@ -466,7 +467,7 @@ TEST(FlatbufferTest, PrimScalar) {
   AT_ASSERT(resi == refi);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   for (int i = 0; i < 3; ++i) {
     // NOLINTNEXTLINE(performance-unnecessary-copy-initialization)
     auto bcinputs = inputs;
@@ -493,7 +494,7 @@ TEST(FlatbufferTest, WrongMethodName) {
       bc.get_method("forward")(inputs), "is not defined");
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   ASSERT_THROWS_WITH_MESSAGE(
       bc2.get_method("forward")(inputs), "is not defined");
 }
@@ -534,7 +535,7 @@ TEST(FlatbufferTest, SetState) {
   AT_ASSERT(resd == refd);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   for (int i = 0; i < 3; ++i) {
     // NOLINTNEXTLINE(performance-unnecessary-copy-initialization)
     auto bcinputs = inputs;
@@ -631,7 +632,7 @@ TEST(FlatbufferTest, BuiltinClass) {
   CompilationOptions options;
   mobile::Module bc = jitModuleToMobile(m, options);
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   std::string expected = "Hello! Your tensor has 12 elements!";
   auto res =
       bc2.get_method("forward")(std::vector<IValue>{torch::zeros({3, 4})});
@@ -658,7 +659,7 @@ TEST(FlatbufferTest, BuiltinFunction) {
   AT_ASSERT(str == expected);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   res = bc2.get_method("forward")(std::vector<IValue>{torch::zeros({3, 4})});
   // NOLINTNEXTLINE(performance-unnecessary-copy-initialization)
   str = res.toStringRef();
@@ -698,7 +699,7 @@ TEST(FlatbufferTest, Eval) {
       outputref[0][0][0][0].item<int>() == output[0][0][0][0].item<int>());
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   bc2.eval();
   for (int i = 0; i < 3; ++i) {
     res = bc2.get_method("forward")(inputs);
@@ -722,7 +723,7 @@ TEST(FlatbufferTest, FindWrongMethodName) {
   ASSERT_TRUE(bc.find_method("forward") == c10::nullopt);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   ASSERT_TRUE(bc2.find_method("forward") == c10::nullopt);
 }
 
@@ -755,7 +756,7 @@ TEST(FlatbufferTest, FindAndRunMethod) {
   AT_ASSERT(resd == refd);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
 
   for (int i = 0; i < 3; ++i) {
     auto bcinputs = inputs;
@@ -790,7 +791,7 @@ TEST(FlatbufferTest, RunMethodVariadic) {
   AT_ASSERT(resd == refd);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   res = bc.run_method("add_three", inputx, inputy);
   resd = res.toTensor().item<float>();
   AT_ASSERT(resd == refd);
@@ -824,7 +825,7 @@ TEST(FlatbufferTest, DuplicateSetState) {
   ASSERT_EQ(methods.size(), expected_n);
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   const auto methods2 = bc.get_methods();
   ASSERT_EQ(methods2.size(), expected_n);
 }
@@ -856,7 +857,7 @@ TEST(FlatbufferTest, OpNameExportFetchRootOperators) {
       << "Expected the root operator lists to be the same";
 
   auto buff = save_mobile_module_to_bytes(ptl_model);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   operator_names = torch::jit::mobile::_export_operator_list(bc2);
   EXPECT_EQ(operator_names, expected_operator_names)
       << "Expected the root operator lists to be the same";
@@ -892,7 +893,7 @@ TEST(FlatbufferTest, DefaultArgsConv) {
   AT_ASSERT(output.equal(outputref));
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   for (int i = 0; i < 1; ++i) {
     res = bc2.get_method("forward")(inputs);
   }
@@ -919,7 +920,7 @@ void testLiteModuleCompareResultTensors(
   AT_ASSERT(output.equal(outputref));
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   for (int i = 0; i < 3; ++i) {
     res = bc2.get_method(method_name)(inputs);
   }
@@ -1110,7 +1111,7 @@ TEST(FlatbufferTest, DefaultArgsWithOutArg) {
   AT_ASSERT(input_x.equal(4 * torch::ones({})));
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   auto input_x2 = 2 * torch::ones({});
   auto input_h2 = torch::ones({});
   m.run_method("forward", input_x2, input_h2);
@@ -1189,7 +1190,7 @@ TEST(FlatbufferTest, OperatorSize1) {
       func.get_code().operators_.size());
 
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
   const auto& func2 = bc.get_method("forward").function();
   ASSERT_EQ(
       func2.get_code().operator_input_sizes_.size(),
@@ -1208,7 +1209,7 @@ TEST(FlatbufferTest, BoolAndDoubleList) {
   CompilationOptions options;
   mobile::Module bc = jitModuleToMobile(m, options);
   auto buff = save_mobile_module_to_bytes(bc);
-  mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+  mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
 
   // if the variables read are wrong type the conversion will raise exception
   auto boolval = bc2.attr("bool_list", {}).toBoolList().get(0);
@@ -1252,7 +1253,7 @@ TEST(FlatbufferTest, OperatorTest2) { // NOLINT (use =delete in gtest)
         func.get_code().operators_.size());
 
     auto buff = save_mobile_module_to_bytes(bc);
-    mobile::Module bc2 = parse_mobile_module(buff.data(), buff.size());
+    mobile::Module bc2 = parse_mobile_module(buff->data(), buff->size());
     const auto& func2 = bc.get_method("test_func").function();
     ASSERT_EQ(
         func2.get_code().operator_input_sizes_.size(),
@@ -1260,15 +1261,16 @@ TEST(FlatbufferTest, OperatorTest2) { // NOLINT (use =delete in gtest)
   }
 }
 
-Module jitModuleFromBuffer(void* data) {
-  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data);
-  FlatbufferLoader loader;
-  mobile::Module mobilem = loader.parseModule(flatbuffer_module);
-  ExtraFilesMap files;
-  std::vector<IValue> constants;
-  loader.extractJitSourceAndConstants(&files, &constants);
-  return jitModuleFromSourceAndConstants(
-      mobilem._ivalue(), files, constants, 8);
+Module jitModuleFromBuffer(void* data, size_t size) {
+  // Make a copy of the data so we can use the existing API, which takes
+  // ownership. The `data` param might point into the middle of a buffer, so we
+  // can't safely take ownership of it directly.
+  // @nolint CLANGTIDY cppcoreguidelines-no-malloc
+  std::shared_ptr<char> copy(static_cast<char*>(malloc(size)), free);
+  memcpy(copy.get(), data, size);
+
+  ExtraFilesMap extra_files;
+  return parse_and_initialize_jit_module(std::move(copy), size, extra_files);
 }
 
 TEST(TestSourceFlatbuffer, UpsampleNearest2d) {
@@ -1302,10 +1304,10 @@ TEST(TestSourceFlatbuffer, CheckAttrAccess) {
   Module m("m");
   m.register_attribute("mobile_optimized", BoolType::get(), true);
   auto data = save_jit_module_to_bytes(m);
-  Module m2 = jitModuleFromBuffer(data.data());
+  Module m2 = jitModuleFromBuffer(data->data(), data->size());
   bool mobile_optimized = m2.attr("mobile_optimized", false).toBool();
   AT_ASSERT(mobile_optimized);
-  mobile::Module m3 = parse_mobile_module(data.data(), data.size());
+  mobile::Module m3 = parse_mobile_module(data->data(), data->size());
   mobile_optimized = m3.attr("mobile_optimized", false).toBool();
   AT_ASSERT(mobile_optimized);
 }
@@ -1342,7 +1344,7 @@ TEST(TestSourceFlatbuffer,
     auto ref = m.run_method("test_func", minput);
 
     auto data = save_jit_module_to_bytes(m);
-    Module m2 = jitModuleFromBuffer(data.data());
+    Module m2 = jitModuleFromBuffer(data->data(), data->size());
     const auto& test_func = m2.get_method("test_func");
     IValue res;
     for (int i = 0; i < 3; ++i) {
@@ -1352,7 +1354,7 @@ TEST(TestSourceFlatbuffer,
     auto refd = ref.toTensor().item<float>();
     AT_ASSERT(resd == refd);
 
-    mobile::Module m3 = parse_mobile_module(data.data(), data.size());
+    mobile::Module m3 = parse_mobile_module(data->data(), data->size());
     const auto& test_func3 = m3.get_method("test_func");
     for (int i = 0; i < 3; ++i) {
       res = test_func3({minput});
@@ -1787,5 +1789,132 @@ TEST(FlatbufferUpgraderTest, DivScalarInplaceIntV2) {
 
 #endif // !defined(FB_XPLAT_BUILD)
 
+//
+// Tests that need access to internal flatbuffers types/functions.
+// Do not add any other tests after this section.
+//
+
+} // namespace jit
+} // namespace torch
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h>
+namespace torch {
+namespace jit {
+
+#if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD)
+namespace flatbuffers = flatbuffers_fbsource;
+#endif
+/**
+ * An Allocator that can only deallocate (using delete []), counting
+ * the number of times that it has been asked to deallocate.
+ */
+class TestAllocator : public flatbuffers::Allocator {
+ public:
+  /**
+   * *deallocate_call_count will be incremented whenever deallocate() is called.
+   */
+  explicit TestAllocator(int* deallocate_call_count)
+      : deallocate_call_count_(deallocate_call_count) {}
+
+  void deallocate(uint8_t* p, size_t /*size*/) override {
+    *deallocate_call_count_ += 1;
+    delete[] p;
+  }
+
+  uint8_t* allocate(size_t) override {
+    TORCH_CHECK(false, "allocate() should not be called");
+  }
+  uint8_t* reallocate_downward(uint8_t*, size_t, size_t, size_t, size_t)
+      override {
+    TORCH_CHECK(false, "reallocate_downward() should not be called");
+  }
+
+ private:
+  int* deallocate_call_count_;
+};
+
+/// Provides access to DetachedBuffer::destroy().
+struct DetachedBufferTestingFriend {
+  /// Returns a UniqueDetachedBuffer that wraps the provided DetachedBuffer.
+  /// A copy of similar code in flatbuffer_serializer.cpp.
+  static DetachedBuffer::UniqueDetachedBuffer make_unique_detached_buffer(
+      DetachedBuffer* buf) {
+    return DetachedBuffer::UniqueDetachedBuffer(buf, DetachedBuffer::destroy);
+  }
+};
+
+TEST(FlatbufferTest, DetachedBufferSmoke) {
+  // Use a custom Allocator to watch the lifecycle of a
+  // flatbuffers::DetachedBuffer.
+  int deallocate_call_count = 0;
+  TestAllocator alloc(&deallocate_call_count);
+
+  // Data for the buffer. TestAllocator will free it with `delete []`.
+  constexpr size_t data_size = 4;
+  uint8_t* data = new uint8_t[data_size];
+
+  // An internal buffer on the stack that owns the data.
+  flatbuffers::DetachedBuffer fb_buf_local(
+      &alloc, /*own_allocator=*/false, data, data_size, data, data_size);
+  EXPECT_EQ(fb_buf_local.data(), data);
+  EXPECT_EQ(fb_buf_local.size(), data_size);
+
+  // Mimic the code inside save_mobile_module_to_bytes by transferring ownership
+  // to a heap object.
+  auto fb_buf_ptr = new flatbuffers::DetachedBuffer(std::move(fb_buf_local));
+  // The data should not have been deleted yet.
+  EXPECT_EQ(deallocate_call_count, 0);
+  // The new object points to the data.
+  EXPECT_EQ(fb_buf_ptr->data(), data);
+  EXPECT_EQ(fb_buf_ptr->size(), data_size);
+  // The old object points to nothing.
+  // @lint-ignore CLANGTIDY bugprone-use-after-move
+  EXPECT_EQ(fb_buf_local.data(), nullptr);
+  // @lint-ignore CLANGTIDY bugprone-use-after-move
+  EXPECT_EQ(fb_buf_local.size(), 0);
+
+  // The top-level torch::jit::DetachedBuffer.
+  auto wrapped_buf =
+      new DetachedBuffer(fb_buf_ptr->data(), fb_buf_ptr->size(), fb_buf_ptr);
+  EXPECT_EQ(wrapped_buf->data(), data);
+  EXPECT_EQ(wrapped_buf->size(), data_size);
+
+  // The unique_ptr that owns the torch::jit::DetachedBuffer and its contents.
+  {
+    DetachedBuffer::UniqueDetachedBuffer unique_buf =
+        DetachedBufferTestingFriend::make_unique_detached_buffer(wrapped_buf);
+    EXPECT_EQ(unique_buf->data(), data);
+    EXPECT_EQ(unique_buf->size(), data_size);
+
+    // The data should not have been deleted yet.
+    EXPECT_EQ(deallocate_call_count, 0);
+  }
+
+  // Now that the unique_ptr is out of scope, the data should have been deleted.
+  EXPECT_EQ(deallocate_call_count, 1);
+}
+
+TEST(FlatbufferTest, DetachedBufferNullOwner) {
+  // a torch::jit::DetachedBuffer with a null internal owner.
+  std::vector<uint8_t> data(4);
+  auto wrapped_buf = new DetachedBuffer(data.data(), data.size());
+
+  // A unique_ptr that owns the torch::jit::DetachedBuffer and its contents.
+  {
+    DetachedBuffer::UniqueDetachedBuffer unique_buf =
+        DetachedBufferTestingFriend::make_unique_detached_buffer(wrapped_buf);
+    EXPECT_EQ(unique_buf->data(), data.data());
+    EXPECT_EQ(unique_buf->size(), data.size());
+  }
+
+  // The DetachedBuffer should have been destroyed when the UniqueDetachedBuffer
+  // went out of scope. If we didn't crash or get any ASAN warnings, we should
+  // be good.
+}
+
+//
+// Do not add tests here unless they require flatbuffers types. See comment at
+// the beginning of this section.
+//
+
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/jit/test_load_upgraders.cpp b/test/cpp/jit/test_load_upgraders.cpp
index ddbe46c8ada7b..d011d42ab89f3 100644
--- a/test/cpp/jit/test_load_upgraders.cpp
+++ b/test/cpp/jit/test_load_upgraders.cpp
@@ -10,7 +10,6 @@
 namespace torch {
 namespace jit {
 
-#if ENABLE_UPGRADERS
 // Basic tests to check if C++ torch::jit::load
 // can load the upgraders fine
 // TODO (tugsuu) add more tests
@@ -39,7 +38,6 @@ TEST(UpgraderLoad, CanPopulateUpgradersGraph) {
   // should have saved with version 4, so it is still up to date
   testing::FileCheck().check_count("aten::div", 1, true)->run(*test_graph);
 }
-#endif
 
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/jit/test_misc.cpp b/test/cpp/jit/test_misc.cpp
index 2e45424902a58..401e24ccaa465 100644
--- a/test/cpp/jit/test_misc.cpp
+++ b/test/cpp/jit/test_misc.cpp
@@ -1,3 +1,4 @@
+#include <gmock/gmock.h>
 #include <gtest/gtest.h>
 
 #include <ATen/ATen.h>
@@ -560,6 +561,37 @@ TEST(SchemaParserTest, AnnotatedAliasSets) {
   parseSchema("at::what(Tensor(a) foo) -> (Tensor(a))");
 }
 
+TEST(SchemaParserTest, TensorListAnnotatedAliasSets) {
+  const auto s = parseSchema(
+      "at::foo(Tensor(a!) self, Tensor(b!)[] out)"
+      " -> ()");
+  const AliasInfo* selfAliasInfo = s.arguments().at(0).alias_info();
+  const AliasInfo* outAliasInfo = s.arguments().at(1).alias_info();
+  ASSERT_TRUE(
+      selfAliasInfo->beforeSets() ==
+      std::unordered_set<Symbol>{Symbol::fromQualString("alias::a")});
+  ASSERT_TRUE(selfAliasInfo->isWrite());
+
+  ASSERT_TRUE(outAliasInfo->isWrite());
+  ASSERT_TRUE(outAliasInfo->beforeSets().empty());
+  ASSERT_EQ(outAliasInfo->containedTypes().size(), 1);
+
+  auto containedType = outAliasInfo->containedTypes()[0];
+
+  ASSERT_TRUE(containedType.isWrite());
+  ASSERT_TRUE(
+      containedType.beforeSets() ==
+      std::unordered_set<Symbol>{Symbol::fromQualString("alias::b")});
+}
+
+TEST(SchemaParserTest, AnnotatedAliasWithoutBeforeSet) {
+  EXPECT_THAT(
+      []() { parseSchema("at::foo(Tensor(!) self) -> Tensor"); },
+      ::testing::Throws<std::runtime_error>(::testing::Property(
+          &std::runtime_error::what,
+          ::testing::HasSubstr("expected ident but found '!' here"))));
+}
+
 TEST(SchemaParserTest, BeforeAfterSets) {
   const auto s = parseSchema(
       "at::what(Tensor(b|c)[](a!) list, Tensor(c) element)"
@@ -1414,6 +1446,83 @@ TEST(TestSymInt, AddSymbolicInt) {
   ASSERT_TRUE((a + b).expect_int() == 8);
 }
 
+#ifndef C10_MOBILE
+TEST(TestSymInt, TestIntrusive) {
+  auto a = c10::make_intrusive<c10::SymIntNodeImpl>();
+  auto b = c10::make_intrusive<c10::SymIntNodeImpl>();
+  ASSERT_EQ(a.use_count(), 1);
+  ASSERT_EQ(b.use_count(), 1);
+  auto as = a->toSymInt();
+  auto bs = b->toSymInt();
+  ASSERT_EQ(a.use_count(), 2);
+  ASSERT_EQ(b.use_count(), 2);
+  as = bs;
+  ASSERT_EQ(a.use_count(), 1);
+  ASSERT_EQ(b.use_count(), 3);
+}
+
+class TestSymIntNodeImpl : public c10::SymIntNodeImpl {
+ public:
+  TestSymIntNodeImpl(int64_t i) : i_(i) {}
+
+  bool bool_() override {
+    return static_cast<bool>(i_);
+  };
+
+#define OPDEF3(NAME, OP, RET)                                            \
+  RET NAME(const c10::SymIntNode& other) override {                      \
+    return make_intrusive<TestSymIntNodeImpl>(                           \
+        this->i_ OP dynamic_cast<TestSymIntNodeImpl*>(other.get())->i_); \
+  }
+
+#define OPDEF2(NAME, OP) OPDEF3(NAME, OP, c10::SymIntNode)
+  OPDEF2(add, +)
+  OPDEF2(sub, -)
+  OPDEF2(mul, *)
+  OPDEF2(floordiv, /)
+  OPDEF2(mod, %)
+
+  OPDEF2(eq, ==)
+  OPDEF2(ne, !=)
+  OPDEF2(lt, <)
+  OPDEF2(le, <=)
+  OPDEF2(gt, >)
+  OPDEF2(ge, >=)
+#undef OPDEF2
+#undef OPDEF3
+
+  int64_t i_;
+};
+
+TEST(TestSymInt, TestSymIntToSymIntNodeDispatch) {
+  auto get = [](c10::SymInt si) {
+    auto node = si.toSymIntNodeImpl();
+    return dynamic_cast<TestSymIntNodeImpl*>(node.get())->i_;
+  };
+
+  std::vector<int64_t> inputs{0, 1, -1, 4, -4, 777, -777};
+  for (auto i : inputs) {
+    for (auto j : inputs) {
+      auto a = c10::make_intrusive<TestSymIntNodeImpl>(i)->toSymInt();
+      auto b = c10::make_intrusive<TestSymIntNodeImpl>(j)->toSymInt();
+      ASSERT_EQ(get(a + b), i + j);
+      ASSERT_EQ(get(a - b), i - j);
+      ASSERT_EQ(get(a * b), i * j);
+      if (j != 0) {
+        ASSERT_EQ(get(a / b), i / j);
+        ASSERT_EQ(get(a % b), i % j);
+      }
+      ASSERT_EQ(a == b, i == j);
+      ASSERT_EQ(a != b, i != j);
+      ASSERT_EQ(a < b, i < j);
+      ASSERT_EQ(a <= b, i <= j);
+      ASSERT_EQ(a > b, i > j);
+      ASSERT_EQ(a >= b, i >= j);
+    }
+  }
+}
+#endif
+
 TEST(FallbackGraphsTest, Basic) {
   auto x = at::randn({1}, at::kCPU);
   auto y = at::randn({1}, at::kCPU);
diff --git a/test/cpp/lazy/test_ir.cpp b/test/cpp/lazy/test_ir.cpp
index 2a5c16601ff2e..6490d22090aed 100644
--- a/test/cpp/lazy/test_ir.cpp
+++ b/test/cpp/lazy/test_ir.cpp
@@ -162,20 +162,20 @@ TEST(IrTest, DimensionIsDynamicTest) {
   auto size1 =
       std::dynamic_pointer_cast<SizeNode>(MakeNode<SizeNode>(Value{node1}, 1));
 
-  ASSERT_EQ(true, size0->isDynamic());
-  ASSERT_EQ(false, size1->isDynamic());
+  ASSERT_EQ(true, size0->isSymbolic());
+  ASSERT_EQ(false, size1->isSymbolic());
 
   auto add_dim = std::dynamic_pointer_cast<SizeAdd>(
       MakeNode<SizeAdd>(Value{size0}, Value{size1}));
-  ASSERT_EQ(true, add_dim->isDynamic());
+  ASSERT_EQ(true, add_dim->isSymbolic());
 
   add_dim = std::dynamic_pointer_cast<SizeAdd>(
       MakeNode<SizeAdd>(Value{size1}, Value{size1}));
-  ASSERT_EQ(false, add_dim->isDynamic());
+  ASSERT_EQ(false, add_dim->isSymbolic());
 
   auto mul_dim = std::dynamic_pointer_cast<SizeMul>(
       MakeNode<SizeMul>(Value{size0}, Value{size0}));
-  ASSERT_EQ(true, mul_dim->isDynamic());
+  ASSERT_EQ(true, mul_dim->isSymbolic());
 }
 
 } // namespace lazy
diff --git a/test/cpp/lazy/test_lazy_ops.cpp b/test/cpp/lazy/test_lazy_ops.cpp
index de80ff9ea5763..b20ccc839e742 100644
--- a/test/cpp/lazy/test_lazy_ops.cpp
+++ b/test/cpp/lazy/test_lazy_ops.cpp
@@ -84,6 +84,7 @@ static inline at::DeviceType DefaultDevice() {
 
 } // namespace
 
+#ifndef C10_MOBILE
 TEST(LazyDynamicOpsTest, NarrowCopy) {
   auto x = torch::rand({5, 10, 10}).to(kLazy);
   const size_t Y_DIM = 3;
@@ -110,6 +111,7 @@ TEST(LazyDynamicOpsTest, NarrowCopyViaSymSizes) {
   AllClose(z.cpu(), zc);
   FLAGS_ltc_enable_symbolic_shapes = false;
 }
+#endif
 
 TEST_F(LazyOpsTest, TestScalarTensor) {
   torch::Tensor scalar_tensor = torch::scalar_tensor(
diff --git a/test/cpp/profiler/containers.cpp b/test/cpp/profiler/containers.cpp
index c0e0bf14745c8..5f8e974343b94 100644
--- a/test/cpp/profiler/containers.cpp
+++ b/test/cpp/profiler/containers.cpp
@@ -77,3 +77,18 @@ TEST(ProfilerTest, clock_converter) {
   EXPECT_LT(std::abs(deltas[n / 2]), 200);
   EXPECT_LT(deltas[n * 3 / 4] - deltas[n / 4], 50);
 }
+
+TEST(ProfilerTest, soft_assert) {
+  EXPECT_TRUE(SOFT_ASSERT(true));
+  torch::profiler::impl::setSoftAssertRaises(true);
+  EXPECT_ANY_THROW(SOFT_ASSERT(false));
+  torch::profiler::impl::setSoftAssertRaises(false);
+  EXPECT_NO_THROW(SOFT_ASSERT(false));
+  // Reset soft assert behavior to default
+  torch::profiler::impl::setSoftAssertRaises(c10::nullopt);
+#ifdef NDEBUG
+  EXPECT_NO_THROW(SOFT_ASSERT(false));
+#else
+  EXPECT_ANY_THROW(SOFT_ASSERT(false));
+#endif
+}
diff --git a/test/cpp/tensorexpr/test_cuda.cpp b/test/cpp/tensorexpr/test_cuda.cpp
index cc945834d7a5d..010ca151e568c 100644
--- a/test/cpp/tensorexpr/test_cuda.cpp
+++ b/test/cpp/tensorexpr/test_cuda.cpp
@@ -65,27 +65,31 @@ static void testCudaTestVectorAdd01_impl() {
 
   // TODO: move gpu support into PaddedBuffer
   ctype* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(ctype));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(ctype)));
   ctype* b_dev = nullptr;
-  cudaMalloc(&b_dev, N * sizeof(ctype));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, N * sizeof(ctype)));
   ctype* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(ctype));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(ctype)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_dev, b_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_dev, c_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(ctype), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(ctype), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 float sigmoid(float x) {
@@ -127,23 +131,26 @@ TEST(Cuda, Sigmoid_CUDA) {
 
   // TODO: move gpu support into PaddedBuffer
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, a_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 TEST(Cuda, TestVectorAdd01_CUDA) {
@@ -188,27 +195,31 @@ static void testCudaTestVectorAdd02_impl(int64_t N, int64_t block_size) {
 
   // TODO: move gpu support into PaddedBuffer
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, N * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, N * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_dev, b_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 TEST(Cuda, TestVectorAdd02_CUDA) {
@@ -235,23 +246,23 @@ TEST(Cuda, HalfCast_CUDA) {
   auto aSize = aData.size() * sizeof(aData[0]);
   auto bSize = bData.size() * sizeof(bData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&bDev, bSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev});
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-  cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost);
-  cudaMemcpy(bData.data(), bDev, bSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(bData.data(), bDev, bSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(bData, 2.0f);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
 }
 
 TEST(Cuda, DynamicShape2D_CUDA) {
@@ -275,41 +286,41 @@ TEST(Cuda, DynamicShape2D_CUDA) {
     float* aDev = nullptr;
     float* bDev = nullptr;
     float* cDev = nullptr;
-    cudaMalloc(&aDev, aData.size() * sizeof(aData[0]));
-    cudaMalloc(&bDev, bData.size() * sizeof(bData[0]));
-    cudaMalloc(&cDev, cData.size() * sizeof(cData[0]));
-    cudaMemcpy(
+    C10_CUDA_CHECK(cudaMalloc(&aDev, aData.size() * sizeof(aData[0])));
+    C10_CUDA_CHECK(cudaMalloc(&bDev, bData.size() * sizeof(bData[0])));
+    C10_CUDA_CHECK(cudaMalloc(&cDev, cData.size() * sizeof(cData[0])));
+    C10_CUDA_CHECK(cudaMemcpy(
         aDev,
         aData.data(),
         aData.size() * sizeof(aData[0]),
-        cudaMemcpyHostToDevice);
-    cudaMemcpy(
+        cudaMemcpyHostToDevice));
+    C10_CUDA_CHECK(cudaMemcpy(
         bDev,
         bData.data(),
         bData.size() * sizeof(bData[0]),
-        cudaMemcpyHostToDevice);
-    cudaMemcpy(
+        cudaMemcpyHostToDevice));
+    C10_CUDA_CHECK(cudaMemcpy(
         cDev,
         cData.data(),
         cData.size() * sizeof(cData[0]),
-        cudaMemcpyHostToDevice);
-    cudaDeviceSynchronize();
+        cudaMemcpyHostToDevice));
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
 
     cg.call({aDev, bDev, cDev, M, N});
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-    cudaMemcpy(
+    C10_CUDA_CHECK(cudaMemcpy(
         cData.data(),
         cDev,
         cData.size() * sizeof(cData[0]),
-        cudaMemcpyDeviceToHost);
-    cudaDeviceSynchronize();
+        cudaMemcpyDeviceToHost));
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
 
     ExpectAllNear(cData, std::vector<float>(M * N, 3.0f), 1e-7);
 
-    cudaFree(aDev);
-    cudaFree(bDev);
-    cudaFree(cDev);
+    C10_CUDA_CHECK(cudaFree(aDev));
+    C10_CUDA_CHECK(cudaFree(bDev));
+    C10_CUDA_CHECK(cudaFree(cDev));
   };
   testWithSize(32, 32);
   testWithSize(1, 16);
@@ -342,14 +353,15 @@ TEST(Cuda, TestRand01_CUDA) {
 
   // TODO: move gpu support into PaddedBuffer
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   float sum1 = 0;
   float sum2 = 0;
@@ -371,7 +383,7 @@ TEST(Cuda, TestRand01_CUDA) {
   ASSERT_NEAR(sum1, sum1_mean, 2e-2);
   ASSERT_NEAR(sum2, sum2_mean, 2e-2);
   ASSERT_NEAR(sum3, sum3_mean, 2e-2);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 TEST(Cuda, DynamicShapeSplit_CUDA) {
@@ -393,34 +405,34 @@ TEST(Cuda, DynamicShapeSplit_CUDA) {
   std::vector<float> bData(N, 1.0f);
   float* aDev = nullptr;
   float* bDev = nullptr;
-  cudaMalloc(&aDev, aData.size() * sizeof(aData[0]));
-  cudaMalloc(&bDev, bData.size() * sizeof(bData[0]));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aData.size() * sizeof(aData[0])));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bData.size() * sizeof(bData[0])));
+  C10_CUDA_CHECK(cudaMemcpy(
       aDev,
       aData.data(),
       aData.size() * sizeof(aData[0]),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       bDev,
       bData.data(),
       bData.size() * sizeof(aData[0]),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev, N});
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMemcpy(
       bData.data(),
       bDev,
       bData.size() * sizeof(aData[0]),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(bData, std::vector<float>(N, 2.0f), 1e-7);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
 }
 
 TEST(Cuda, OneBlockOneThreadGlobalReduce1_CUDA) {
@@ -469,24 +481,24 @@ TEST(Cuda, OneBlockOneThreadGlobalReduce1_CUDA) {
   }
 
   float* data_dev = nullptr;
-  cudaMalloc(&data_dev, N * sizeof(float));
-  cudaMemcpy(
-      data_dev, data_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&data_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      data_dev, data_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
   float* output_dev = nullptr;
-  cudaMalloc(&output_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&output_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(data_dev, output_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
-      output_v.data(), output_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      output_v.data(), output_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(output_v, output_ref, 1e-5);
 
-  cudaFree(data_dev);
-  cudaFree(output_dev);
+  C10_CUDA_CHECK(cudaFree(data_dev));
+  C10_CUDA_CHECK(cudaFree(output_dev));
 }
 
 TEST(Cuda, OneBlockMultiThreadGlobalReduce1_CUDA) {
@@ -548,22 +560,24 @@ TEST(Cuda, OneBlockMultiThreadGlobalReduce1_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, NoThreadIdxWrite_1_CUDA) {
@@ -642,23 +656,25 @@ TEST(Cuda, NoThreadIdxWrite_1_CUDA) {
 
   // TODO: add check of the generated code.
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, 2 * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, 2 * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, N * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(a_v.data(), a_dev, 2 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(b_v.data(), b_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_v.data(), a_dev, 2 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(a_v, a_ref, 1e-5);
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, SharedMemReduce_1_CUDA) {
@@ -779,23 +795,24 @@ TEST(Cuda, SharedMemReduce_1_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, kTotalSize * sizeof(float));
-  cudaMemcpy(
-      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, kTotalSize * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, LocalMemReduce_1_CUDA) {
@@ -889,23 +906,24 @@ TEST(Cuda, LocalMemReduce_1_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, kTotalSize * sizeof(float));
-  cudaMemcpy(
-      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, kTotalSize * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, HalfSupport_CUDA) {
@@ -940,29 +958,29 @@ TEST(Cuda, HalfSupport_CUDA) {
   auto cSize = cData.size() * sizeof(float);
   auto dSize = dData.size() * sizeof(dData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&bDev, bSize);
-  cudaMalloc(&cDev, cSize);
-  cudaMalloc(&dDev, dSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(cDev, cData.data(), cSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(dDev, dData.data(), dSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bSize));
+  C10_CUDA_CHECK(cudaMalloc(&cDev, cSize));
+  C10_CUDA_CHECK(cudaMalloc(&dDev, dSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(cDev, cData.data(), cSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(dDev, dData.data(), dSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev, cDev, dDev});
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-  cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost);
-  cudaMemcpy(cData.data(), cDev, cSize, cudaMemcpyDeviceToHost);
-  cudaMemcpy(dData.data(), dDev, dSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(cData.data(), cDev, cSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(dData.data(), dDev, dSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(cData, 46.0f);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
-  cudaFree(cDev);
-  cudaFree(dDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
+  C10_CUDA_CHECK(cudaFree(cDev));
+  C10_CUDA_CHECK(cudaFree(dDev));
 }
 
 TEST(Cuda, HalfPropagation_CUDA) {
@@ -997,20 +1015,22 @@ TEST(Cuda, HalfPropagation_CUDA) {
   auto aSize = aData.size() * sizeof(aData[0]);
   auto reluSize = reluData.size() * sizeof(reluData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&reluDev, reluSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&reluDev, reluSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, reluDev});
-  cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(aData, reluData);
 
-  cudaFree(aDev);
-  cudaFree(reluDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(reluDev));
 }
 
 TEST(Cuda, UnusedHalfArgument_CUDA) {
@@ -1050,23 +1070,25 @@ TEST(Cuda, UnusedHalfArgument_CUDA) {
   auto bSize = bData.size() * sizeof(bData[0]);
   auto reluSize = reluData.size() * sizeof(reluData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&bDev, bSize);
-  cudaMalloc(&reluDev, reluSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bSize));
+  C10_CUDA_CHECK(cudaMalloc(&reluDev, reluSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev, reluDev});
-  cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(aData, reluData);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
-  cudaFree(reluDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
+  C10_CUDA_CHECK(cudaFree(reluDev));
 }
 
 TEST(Cuda, PrioritizeDependents_CUDA) {
@@ -1114,20 +1136,23 @@ TEST(Cuda, PrioritizeDependents_CUDA) {
   float* a_dev = nullptr;
   float* b_dev = nullptr;
   float* c_dev = nullptr;
-  cudaMalloc(&a_dev, 10 * sizeof(float));
-  cudaMalloc(&b_dev, 12 * sizeof(float));
-  cudaMalloc(&c_dev, 12 * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, 10 * sizeof(float)));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 12 * sizeof(float)));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, 12 * sizeof(float)));
 
-  cudaMemcpy(a_dev, a_v.data(), 10 * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), 12 * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), 10 * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), 12 * sizeof(float), cudaMemcpyHostToDevice));
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev, c_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, 12 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, 12 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (const auto i : c10::irange(12)) {
     if (i < 10) {
@@ -1201,33 +1226,39 @@ TEST(Cuda, MaskBlockDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case with two loops, which have different extents that are bound
@@ -1292,33 +1323,39 @@ TEST(Cuda, MaskThreadDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case where there are two loops, and each is bound to a different
@@ -1384,33 +1421,39 @@ TEST(Cuda, MaskMultiBlockDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case where both the blockDim and threadDim are bound to different
@@ -1476,33 +1519,39 @@ TEST(Cuda, MaskBlockAndThreadDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case where the loopnest has two loops of depth two: each with the
@@ -1577,57 +1626,57 @@ TEST(Cuda, MaskMultiDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case where loop extents are symbolic and not known at compile time.
@@ -1707,57 +1756,57 @@ TEST(Cuda, MaskMultiDimSymbolic_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_EXTENT * A_EXTENT * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_EXTENT * A_EXTENT * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_EXTENT * B_EXTENT * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_EXTENT * B_EXTENT * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_EXTENT * A_EXTENT * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_EXTENT * A_EXTENT * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_EXTENT * B_EXTENT * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_EXTENT * B_EXTENT * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_EXTENT * A_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_EXTENT * B_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_EXTENT * A_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_EXTENT * B_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, OUTER_EXTENT, A_EXTENT, B_EXTENT, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_EXTENT * A_EXTENT * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_EXTENT * B_EXTENT * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case where two loops are fused at a common parent loop, which is
@@ -1845,57 +1894,57 @@ TEST(Cuda, MaskCompoundInnerLoop_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev, c_dev, d_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case with two loops fused into a common parent, which is not bound
@@ -1983,57 +2032,57 @@ TEST(Cuda, MaskInnerLoopOneBlock_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev, c_dev, d_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case with two loop nests, each of which bound to the same block
@@ -2109,57 +2158,57 @@ TEST(Cuda, MaskMultiDimMultiAxis_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case with two loop nests, each bound to both Block and Thread but
@@ -2236,57 +2285,57 @@ TEST(Cuda, MaskMultiDimMultiLevel_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_A_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_A_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_B_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_B_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_A_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_A_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_B_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_B_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_A_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_B_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_A_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_B_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_A_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_B_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 } // namespace jit
diff --git a/test/distributed/_shard/checkpoint/test_checkpoint.py b/test/distributed/_shard/checkpoint/test_checkpoint.py
index 3e562d37de60b..cbda342b94d94 100644
--- a/test/distributed/_shard/checkpoint/test_checkpoint.py
+++ b/test/distributed/_shard/checkpoint/test_checkpoint.py
@@ -2,7 +2,7 @@
 
 import random
 import sys
-from typing import Optional, List, Union
+from typing import Optional, List, Union, cast
 from torch.distributed._shard.checkpoint import (
     StorageReader,
     StorageWriter,
@@ -24,9 +24,6 @@
 )
 
 from torch.distributed._shard import sharded_tensor
-from torch.distributed._shard.checkpoint.state_dict_loader import (
-    validate_metadata,
-)
 
 from torch.distributed._shard.checkpoint.state_dict_saver import (
     _prepare,
@@ -36,6 +33,7 @@
     Metadata,
     BytesReadRequest,
     BytesWriteRequest,
+    MetadataIndex,
     TensorReadRequest,
     TensorWriteRequest,
 )
@@ -66,8 +64,6 @@
     )
     sys.exit(0)
 
-
-
 class TestModule(torch.nn.Module):
     def __init__(self) -> None:
         super().__init__()
@@ -93,42 +89,6 @@ class TestDistributedCheckpointing(ShardedTensorTestBase):
     def world_size(self) -> int:
         return 2
 
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(2)
-    @requires_nccl()
-    def test_validate_metadata(self) -> None:
-        module = TestModule()
-
-        metadata, _, _ = _prepare(module.state_dict(), True)
-        self.assertTrue(
-            "regular" in metadata.state_dict_metadata,
-            f"keys: {metadata.state_dict_metadata.keys()}",
-        )
-
-        module = TestModule()
-        validate_metadata(module.state_dict(), metadata)
-
-        module = TestModule()
-        module.extra_param = torch.nn.Parameter(torch.zeros(2, 2))
-        with self.assertRaisesRegex(ValueError, "Could not find Tensor metadata"):
-            validate_metadata(module.state_dict(), metadata)
-
-        module = TestModule()
-        module.regular = torch.nn.Parameter(torch.zeros(2, 4))
-
-        with self.assertRaisesRegex(ValueError, "Incompatible tensor size"):
-            validate_metadata(module.state_dict(), metadata)
-
-        module = TestModule()
-        module.extra_sharded = sharded_tensor.zeros(module.spec(), 4, 2)
-        with self.assertRaisesRegex(ValueError, "Could not find ShardedTensor metadata"):
-            validate_metadata(module.state_dict(), metadata)
-
-        module = TestModule()
-        module.sharded = sharded_tensor.zeros(module.spec(), 4, 2)
-        with self.assertRaisesRegex(ValueError, "Incompatible ShardedTensor size"):
-            validate_metadata(module.state_dict(), metadata)
-
     def gen_metadata(self) -> Metadata:
         module = TestModule()
         # compute the default saved metadata (must pass include_non_replicated_tensors or we'll get incomplete MD)
@@ -140,74 +100,6 @@ def gen_metadata(self) -> Metadata:
 
         return metadata[0]
 
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(2)
-    @requires_nccl()
-    def test_checkpoint_has_shard_too_small(self) -> None:
-        metadata = self.gen_metadata()
-
-        # we make the first stored shard smaller
-        self.assertTrue(
-            "sharded" in metadata.state_dict_metadata,
-            f"keys: {metadata.state_dict_metadata.keys()}",
-        )
-
-        sizes = (
-            metadata.state_dict_metadata["sharded"]
-            .storage_metadata[0]
-            .shard_metadata.shard_sizes
-        )
-        for i in range(len(sizes)):
-            sizes[i] = 1
-
-        module = TestModule()
-        with self.assertRaisesRegex(ValueError, "only has 1 available"):
-            validate_metadata(module.state_dict(), metadata)
-
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(2)
-    @requires_nccl()
-    def test_checkpoint_has_shard_overlap(self) -> None:
-        metadata = self.gen_metadata()
-
-        # we make the first stored shard smaller
-        self.assertTrue(
-            "sharded" in metadata.state_dict_metadata,
-            f"keys: {metadata.state_dict_metadata.keys()}",
-        )
-
-        sizes = (
-            metadata.state_dict_metadata["sharded"]
-            .storage_metadata[0]
-            .shard_metadata.shard_sizes
-        )
-        for i in range(len(sizes)):
-            sizes[i] += 1
-
-        module = TestModule()
-        with self.assertRaisesRegex(ValueError, "overlap"):
-            validate_metadata(module.state_dict(), metadata)
-
-
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(2)
-    @requires_nccl()
-    def test_checkpoint_has_storage_type_mismatch(self) -> None:
-        module = TestModule()
-
-        metadata = self.gen_metadata()
-        regular = metadata.state_dict_metadata["regular"]
-        metadata.state_dict_metadata["sharded"] = regular
-        with self.assertRaisesRegex(ValueError, "ShardedTensorStorageMetadata but found"):
-            validate_metadata(module.state_dict(), metadata)
-
-        metadata = self.gen_metadata()
-        sharded = metadata.state_dict_metadata["sharded"]
-        metadata.state_dict_metadata["regular"] = sharded
-        with self.assertRaisesRegex(ValueError, "TensorStorageMetadata but found"):
-            validate_metadata(module.state_dict(), metadata)
-
-
     @with_comms(init_rpc=False)
     @skip_if_lt_x_gpu(2)
     @requires_nccl()
@@ -220,14 +112,13 @@ def test_tensor_metadata_with_missing_rank_spec(self) -> None:
         )
 
         st = sharded_tensor.zeros(spec, 4, 4, dtype=torch.float64)
-        mapping = dict()
+        mapping = {}
 
-        (_, md) = _prepare_sharded_tensor_write(st, "tensor", mapping)
+        (_, md, storage_md) = _prepare_sharded_tensor_write("fqn", st, "tensor", mapping)
 
-        self.assertEqual(1, len(md.storage_metadata))
+        self.assertEqual(1, len(storage_md))
         self.assertEqual(1, len(mapping))
 
-
     @with_comms(init_rpc=False)
     @skip_if_lt_x_gpu(2)
     @requires_nccl()
@@ -254,7 +145,7 @@ def test_storage_key_mapping(self) -> None:
             self.assertEqual(2, len(tensor_reqs))
 
             self.assertTrue('bytes' in metadata.state_dict_metadata)
-            self.assertEqual(bytes_reqs[0].storage_key, metadata.state_dict_metadata['bytes'].storage_key)
+            self.assertTrue(MetadataIndex('bytes') in metadata.storage_data)
 
             # tensor ordering is unspecified
             if len(tensor_reqs[0].tensor.size()) == 1:
@@ -265,19 +156,24 @@ def test_storage_key_mapping(self) -> None:
                 shard = tensor_reqs[0]
 
             self.assertTrue('replicated' in metadata.state_dict_metadata)
-            self.assertEqual(replicated.storage_key, metadata.state_dict_metadata['replicated'].storage_key)
+            storage_key = MetadataIndex('replicated', torch.Size([0]))
+            self.assertTrue(storage_key in metadata.storage_data)
+            self.assertTrue(metadata.storage_data[storage_key], replicated.storage_key)
         else:
             self.assertEqual(0, len(bytes_reqs))
             self.assertEqual(1, len(tensor_reqs))
             shard = tensor_reqs[0]
+            local_shard = state_dict["sharded"].local_shards()[0]
 
             self.assertTrue('sharded' in metadata.state_dict_metadata)
-            shard_keys = [sm.storage_key for sm in metadata.state_dict_metadata['sharded'].storage_metadata]
-            self.assertTrue(shard.storage_key in shard_keys)
+            storage_key = MetadataIndex('sharded', torch.Size(local_shard.metadata.shard_offsets))
+            self.assertTrue(storage_key in metadata.storage_data)
+            self.assertTrue(metadata.storage_data[storage_key], shard.storage_key)
+
 
 class TestStorageKeys(TestCase):
     def test_create_key_handles_collision(self):
-        keys = dict()
+        keys = {}
         key0 = _create_storage_key(keys, "foo")
         key1 = _create_storage_key(keys, "foo")
         self.assertNotEqual(key0, key1)
@@ -396,8 +292,9 @@ def _test_dist_failure(self, callback, kwargs):
         else:
             with self.assertRaises(CheckpointException) as cm:
                 callback()
-            e = cm.exception
-            for rank, ex in e.failures.items():
+            e = cast(CheckpointException, cm.exception)
+            for rank, wrapped_ex in e.failures.items():
+                ex = wrapped_ex[0]
                 self.assertTrue(rank in bad_ranks, msg=f"{rank} did not fail")
                 if not kwargs.get("ignore_exception_type", False):
                     self.assertEqual(ValueError, type(ex), str(ex))
diff --git a/test/distributed/_shard/checkpoint/test_planner.py b/test/distributed/_shard/checkpoint/test_planner.py
new file mode 100644
index 0000000000000..6e6ea80c6a947
--- /dev/null
+++ b/test/distributed/_shard/checkpoint/test_planner.py
@@ -0,0 +1,268 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+
+import torch
+from torch.distributed._shard.checkpoint.planner import LoadItemType, WriteItemType
+
+from torch.distributed._shard.sharded_tensor import (
+    Shard,
+    ShardMetadata,
+    ShardedTensor,
+    ShardedTensorMetadata,
+)
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
+
+from torch.testing._internal.common_utils import (
+    TestCase,
+    TEST_WITH_DEV_DBG_ASAN,
+    run_tests,
+)
+from torch.distributed._shard.checkpoint.metadata import BytesStorageMetadata, MetadataIndex, TensorStorageMetadata
+from torch.testing._internal.distributed.distributed_utils import (
+    with_fake_comms,
+    with_dist
+)
+
+from torch.distributed._shard.checkpoint.default_planner import (
+    create_default_global_save_plan,
+    create_default_local_save_plan,
+    create_default_local_load_plan,
+    _create_default_local_metadata
+)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+def create_sharded_tensor(rank, world_size, shards_per_rank, shard_size=8):
+    shards_metadata = []
+    local_shards = []
+    for idx in range(0, world_size * shards_per_rank):
+        shard_rank = idx // shards_per_rank
+        shard_md = ShardMetadata(shard_offsets=[idx * shard_size], shard_sizes=[shard_size], placement=f"rank:{shard_rank}/cpu")
+        shards_metadata.append(shard_md)
+        if shard_rank == rank:
+            shard = Shard.from_tensor_and_offsets(
+                torch.rand(*shard_md.shard_sizes),
+                shard_offsets=shard_md.shard_offsets,
+                rank=rank
+            )
+            local_shards.append(shard)
+
+    sharded_tensor_md = ShardedTensorMetadata(
+        shards_metadata=shards_metadata,
+        size=torch.Size([shard_size * len(shards_metadata)]),
+        tensor_properties=TensorProperties.create_from_tensor(torch.zeros(1))
+    )
+
+    return ShardedTensor._init_from_local_shards_and_global_metadata(
+        local_shards=local_shards,
+        sharded_tensor_metadata=sharded_tensor_md
+    )
+
+
+class TestSavePlan(TestCase):
+    @with_fake_comms(rank=1, world_size=4)
+    def test_local_plan(self):
+        tensor = torch.rand(10)
+        val = [1, 2, 3]
+        st = create_sharded_tensor(rank=1, world_size=4, shards_per_rank=1)
+        state_dict = {
+            "tensor": tensor,
+            "value": val,
+            "st": st
+        }
+        plan = create_default_local_save_plan(state_dict, False)
+        self.assertEqual(1, len(plan.items))
+        wi = plan.items[0]
+        self.assertEqual(wi.index, MetadataIndex("st", [8]))
+        self.assertEqual(wi.type, WriteItemType.SHARD)
+        self.assertEqual(wi.tensor_data.size, st.size())
+        self.assertEqual(wi.tensor_data.properties, TensorProperties.create_from_tensor(torch.zeros(1)))
+        self.assertEqual(wi.tensor_data.chunk.offsets, torch.Size([8]))
+        self.assertEqual(wi.tensor_data.chunk.sizes, torch.Size([8]))
+
+        # Coordinator rank, should include replicated items as well
+        plan = create_default_local_save_plan(state_dict, True)
+        self.assertEqual(3, len(plan.items))
+
+        tensor_wi = next(wi for wi in plan.items if wi.type == WriteItemType.TENSOR)
+        self.assertEqual(tensor_wi.index, MetadataIndex("tensor", [0]))
+        self.assertEqual(tensor_wi.tensor_data.size, tensor.size())
+        self.assertEqual(tensor_wi.tensor_data.properties, TensorProperties.create_from_tensor(tensor))
+        self.assertEqual(tensor_wi.tensor_data.chunk.offsets, torch.Size([0]))
+        self.assertEqual(tensor_wi.tensor_data.chunk.sizes, torch.Size([10]))
+
+        bytes_wi = next(wi for wi in plan.items if wi.type == WriteItemType.BYTE_IO)
+        self.assertEqual(bytes_wi.index, MetadataIndex("value"))
+        self.assertIsNone(bytes_wi.tensor_data)
+
+    def test_global_plan(self):
+        def create_data(rank):
+            with with_dist(rank=rank, world_size=4):
+                tensor = torch.rand(10)
+                val = [1, 2, 3]
+                st = create_sharded_tensor(rank=rank, world_size=4, shards_per_rank=1)
+                state_dict = {
+                    "tensor": tensor,
+                    "value": val,
+                    "st": st
+                }
+                return create_default_local_save_plan(state_dict, rank == 0)
+
+        all_plans = [create_data(0), create_data(1), create_data(2), create_data(3)]
+        final_plans, metadata = create_default_global_save_plan(all_plans=all_plans)
+        # The default global plan updates all indexes to include hints
+        for new_plan, old_plan in zip(final_plans, all_plans):
+            for new_item, old_item in zip(new_plan.items, old_plan.items):
+                self.assertEqual(new_item.index, old_item.index)
+                self.assertEqual(new_item.type, old_item.type)
+                self.assertEqual(new_item.tensor_data, old_item.tensor_data)
+                self.assertIn(new_item.index.fqn, metadata.state_dict_metadata)
+
+                item_md = metadata.state_dict_metadata[new_item.index.fqn]
+                if new_item.type == WriteItemType.BYTE_IO:
+                    self.assertTrue(isinstance(item_md, BytesStorageMetadata))
+                else:
+                    self.assertTrue(isinstance(item_md, TensorStorageMetadata))
+                    self.assertEqual(item_md.size, old_item.tensor_data.size)
+                    self.assertEqual(item_md.properties, old_item.tensor_data.properties)
+
+                    self.assertIsNotNone(new_item.index.index)
+                    # Make sure the hint is correct
+                    self.assertEqual(item_md.chunks[new_item.index.index], old_item.tensor_data.chunk)
+
+    def test_local_load_plan(self):
+        def create_state_dict(rank):
+            with with_dist(rank=rank, world_size=4):
+                tensor = torch.rand(10)
+                val = [1, 2, 3]
+                st = create_sharded_tensor(rank=rank, world_size=4, shards_per_rank=1)
+                return {
+                    "tensor": tensor,
+                    "value": val,
+                    "st": st
+                }
+
+        state_dict = create_state_dict(1)
+        metadata = _create_default_local_metadata(state_dict)
+
+        load_plan = create_default_local_load_plan(state_dict, metadata)
+        # This will create 3 entries
+        self.assertEqual(3, len(load_plan.items))
+        st_item = next(ri for ri in load_plan.items if ri.dest_index.fqn == "st")
+        tensor_item = next(ri for ri in load_plan.items if ri.dest_index.fqn == "tensor")
+        bytes_item = next(ri for ri in load_plan.items if ri.dest_index.fqn == "value")
+
+        self.assertEqual(st_item.type, LoadItemType.TENSOR)
+        # This is an exact copy
+        self.assertEqual(st_item.dest_index, MetadataIndex("st", [8]))
+        self.assertEqual(st_item.dest_offsets, torch.Size([0]))
+        self.assertEqual(st_item.storage_index, MetadataIndex("st", [8]))
+        self.assertEqual(st_item.storage_offsets, torch.Size([0]))
+        self.assertEqual(st_item.lengths, torch.Size([8]))
+
+        self.assertEqual(tensor_item.type, LoadItemType.TENSOR)
+        self.assertEqual(tensor_item.dest_index, MetadataIndex("tensor", [0]))
+        self.assertEqual(tensor_item.dest_offsets, torch.Size([0]))
+        self.assertEqual(tensor_item.storage_index, MetadataIndex("tensor", [0]))
+        self.assertEqual(tensor_item.storage_offsets, torch.Size([0]))
+        self.assertEqual(tensor_item.lengths, torch.Size([10]))
+
+        self.assertEqual(bytes_item.type, LoadItemType.BYTE_IO)
+        self.assertEqual(bytes_item.dest_index, MetadataIndex("value"))
+
+    def test_load_with_resharding(self):
+        def create_state_dict(rank, world_size):
+            with with_dist(rank=rank, world_size=world_size):
+                return {
+                    "st": create_sharded_tensor(
+                        rank=rank,
+                        world_size=world_size,
+                        shards_per_rank=1,
+                        shard_size=128 // world_size,
+                    )
+                }
+
+
+        # Rank 1 has a 16 bytes shard from [16, 32[
+        world8_state_dict = create_state_dict(rank=1, world_size=8)
+        world8_metadata = _create_default_local_metadata(world8_state_dict)
+
+        # Rank 1 has a 32 bytes shard from [32, 64[
+        world4_state_dict = create_state_dict(rank=1, world_size=4)
+        world4_metadata = _create_default_local_metadata(world4_state_dict)
+
+        # First scenario, going from world=8 to world=4, need to load 2 shards
+        # Each 4-world shard has 32 elements, so it needs to load 2 shards
+        load_plan = create_default_local_load_plan(world4_state_dict, world8_metadata)
+        self.assertEqual(2, len(load_plan.items))
+        low_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([0]))
+        high_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([16]))
+
+        self.assertEqual(low_ri.storage_index, MetadataIndex("st", [32]))
+        self.assertEqual(low_ri.storage_offsets, torch.Size([0]))
+        self.assertEqual(low_ri.dest_index, MetadataIndex("st", [32]))
+        self.assertEqual(low_ri.dest_offsets, torch.Size([0]))
+        self.assertEqual(low_ri.lengths, torch.Size([16]))
+
+        self.assertEqual(high_ri.storage_index, MetadataIndex("st", [48]))
+        self.assertEqual(high_ri.storage_offsets, torch.Size([0]))
+        self.assertEqual(high_ri.dest_index, MetadataIndex("st", [32]))
+        self.assertEqual(high_ri.dest_offsets, torch.Size([16]))
+        self.assertEqual(high_ri.lengths, torch.Size([16]))
+
+        # Second scenario, going from world=4 to world=8, need to load half of 1 shard
+        # rank1 on 8-world needs to load the upper half of the rank0 4-world shard
+        load_plan = create_default_local_load_plan(world8_state_dict, world4_metadata)
+        self.assertEqual(1, len(load_plan.items))
+        ri = load_plan.items[0]
+        self.assertEqual(ri.storage_index, MetadataIndex("st", [0]))
+        self.assertEqual(ri.storage_offsets, torch.Size([16]))
+        self.assertEqual(ri.dest_index, MetadataIndex("st", [16]))
+        self.assertEqual(ri.dest_offsets, torch.Size([0]))
+        self.assertEqual(ri.lengths, torch.Size([16]))
+
+    def test_load_with_world_size_diff_by_one(self):
+        def create_state_dict(rank, world_size):
+            with with_dist(rank=rank, world_size=world_size):
+                return {
+                    "st": create_sharded_tensor(
+                        rank=rank,
+                        world_size=world_size,
+                        shards_per_rank=1,
+                        shard_size=120 // world_size,
+                    )
+                }
+        # rank 1 has a 30 bytes shard from [30, 60[
+        world4_state_dict = create_state_dict(rank=1, world_size=4)
+        world4_metadata = _create_default_local_metadata(world4_state_dict)
+
+        # rank 1 has a 40 bytes shard from [40, 80[
+        world3_state_dict = create_state_dict(rank=1, world_size=3)
+
+        load_plan = create_default_local_load_plan(world3_state_dict, world4_metadata)
+        self.assertEqual(2, len(load_plan.items))
+        # this is [30, 60] to load [40, 60]
+        low_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([0]))
+        # this is [60, 90] to load [60, 80]
+        high_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([20]))
+
+        self.assertEqual(low_ri.storage_index, MetadataIndex("st", [30]))
+        self.assertEqual(low_ri.storage_offsets, torch.Size([10]))
+        self.assertEqual(low_ri.dest_index, MetadataIndex("st", [40]))
+        self.assertEqual(low_ri.dest_offsets, torch.Size([0]))
+        self.assertEqual(low_ri.lengths, torch.Size([20]))
+
+        self.assertEqual(high_ri.storage_index, MetadataIndex("st", [60]))
+        self.assertEqual(high_ri.storage_offsets, torch.Size([0]))
+        self.assertEqual(high_ri.dest_index, MetadataIndex("st", [40]))
+        self.assertEqual(high_ri.dest_offsets, torch.Size([20]))
+        self.assertEqual(high_ri.lengths, torch.Size([20]))
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_shard/checkpoint/test_utils.py b/test/distributed/_shard/checkpoint/test_utils.py
index 9e1fd2a23888e..e99a9cf863e4f 100644
--- a/test/distributed/_shard/checkpoint/test_utils.py
+++ b/test/distributed/_shard/checkpoint/test_utils.py
@@ -81,6 +81,8 @@ def test_flat_data(self):
 
         a = find_state_dict_object(state_dict, MetadataIndex("a"))
         self.assertEqual(a, state_dict["a"])
+        a = find_state_dict_object(state_dict, MetadataIndex("a", [0]))
+        self.assertEqual(a, state_dict["a"])
         a = find_state_dict_object(state_dict, MetadataIndex("a", index=99))
         self.assertEqual(a, state_dict["a"])
 
@@ -91,8 +93,6 @@ def test_flat_data(self):
 
         with self.assertRaisesRegex(ValueError, "FQN"):
             find_state_dict_object(state_dict, MetadataIndex("c"))
-        with self.assertRaisesRegex(ValueError, "ShardedTensor"):
-            find_state_dict_object(state_dict, MetadataIndex("a", [0]))
         with self.assertRaisesRegex(ValueError, "ShardedTensor"):
             find_state_dict_object(state_dict, MetadataIndex("b", [1]))
 
diff --git a/test/distributed/_shard/sharded_tensor/ops/test_embedding.py b/test/distributed/_shard/sharded_tensor/ops/test_embedding.py
index 842da0b5ea34d..9291e06e31535 100644
--- a/test/distributed/_shard/sharded_tensor/ops/test_embedding.py
+++ b/test/distributed/_shard/sharded_tensor/ops/test_embedding.py
@@ -41,7 +41,6 @@ def _run_sharded_embedding(
         input_size,
         num_embeddings,
         embedding_dim,
-        sharded_dim=None,
         max_norm=None,
         norm_type=2.0,
         padding_idx=None,
@@ -91,6 +90,7 @@ def _run_sharded_embedding(
         # Compare local weight and shared one to ensure the renorm
         # as expected.
         if max_norm is not None:
+            sharded_dim = spec.dim
             sharded_weight = sharded_embedding.weight.local_shards()[0].tensor
             (start_pos, chunk_size) = generate_local_weight_sharding_params_for_test(
                 local_embedding.weight, sharded_dim, TEST_GPU_NUM, spec, self.rank
@@ -134,15 +134,15 @@ def test_sharded_embedding_colwise(self):
             self._run_sharded_embedding(spec, [34], 15, 14, padding_idx=10)
             self._run_sharded_embedding(spec, [8, 6, 5, 4], 23, 13, padding_idx=12)
             self._run_sharded_embedding(
-                spec, [4, 5, 6], 23, 13, max_norm=2.5, sharded_dim=1
+                spec, [4, 5, 6], 23, 13, max_norm=2.5,
             )
             self._run_sharded_embedding(
-                spec, [12, 7, 16], 23, 13, max_norm=2.5, sharded_dim=1
+                spec, [12, 7, 16], 23, 13, max_norm=2.5,
             )
             self._run_sharded_embedding(
-                spec, [8, 16, 20], 12, 12, max_norm=1.25, norm_type=1.0, sharded_dim=1
+                spec, [8, 16, 20], 12, 12, max_norm=1.25, norm_type=1.0,
             )
-            self._run_sharded_embedding(spec, [30], 15, 14, max_norm=2.0, sharded_dim=1)
+            self._run_sharded_embedding(spec, [30], 15, 14, max_norm=2.0)
 
     @with_comms(init_rpc=False)
     @skip_if_lt_x_gpu(TEST_GPU_NUM)
@@ -154,11 +154,11 @@ def test_sharded_embedding_rowwise(self):
             self._run_sharded_embedding(spec, [5, 4], 32, 12)
             self._run_sharded_embedding(spec, [6, 7, 6], 64, 11)
             self._run_sharded_embedding(
-                spec, [5, 12], 16, 22, max_norm=2.5, sharded_dim=0
+                spec, [5, 12], 16, 22, max_norm=2.5,
             )
             self._run_sharded_embedding(spec, [6, 7, 6], 64, 11, padding_idx=30)
             self._run_sharded_embedding(
-                spec, [6, 5, 3], 26, 11, max_norm=2.0, sharded_dim=0
+                spec, [6, 5, 3], 26, 11, max_norm=2.0,
             )
 
             # Test uneven split.
@@ -167,9 +167,9 @@ def test_sharded_embedding_rowwise(self):
             self._run_sharded_embedding(spec, [4], 21, 11)
             self._run_sharded_embedding(spec, [8, 6, 5, 4], 21, 11, padding_idx=10)
             self._run_sharded_embedding(
-                spec, [12, 16, 8], 27, 11, max_norm=2.0, sharded_dim=0
+                spec, [6, 5, 8], 28, 5, max_norm=2.0,
             )
-            self._run_sharded_embedding(spec, [4], 14, 11, max_norm=2.5, sharded_dim=0)
+            self._run_sharded_embedding(spec, [4], 14, 11, max_norm=2.5)
 
 
 if __name__ == "__main__":
diff --git a/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py b/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py
index 05cad5822de2d..4843534f68fb3 100644
--- a/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py
+++ b/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py
@@ -42,7 +42,6 @@ def _run_sharded_embedding_bag(
         num_embeddings,
         embedding_dim,
         mode,
-        sharded_dim=None,
         include_last_offset=False,
         offset_size=None,
         max_norm=None,
@@ -127,6 +126,7 @@ def _run_sharded_embedding_bag(
         # Compare local weight and shared one to ensure the renorm
         # as expected.
         if max_norm is not None:
+            sharded_dim = spec.dim
             sharded_weight = sharded_embedding_bag.weight.local_shards()[0].tensor
             (start_pos, chunk_size) = generate_local_weight_sharding_params_for_test(
                 local_embedding_bag.weight, sharded_dim, TEST_GPU_NUM, spec, self.rank
@@ -184,7 +184,7 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
         self._run_sharded_embedding_bag(spec, [5, 4], 17, 12, "mean")
         self._run_sharded_embedding_bag(spec, [6, 7], 21, 11, "max")
         self._run_sharded_embedding_bag(
-            spec, [5, 5], 17, 14, "sum", max_norm=2.5, sharded_dim=sharded_dim
+            spec, [5, 5], 17, 14, "sum", max_norm=2.5,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -194,7 +194,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "mean",
             max_norm=2.0,
             norm_type=1.0,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -204,7 +203,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "max",
             max_norm=1.5,
             norm_type=1.0,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(spec, [5, 5], 17, 14, "sum", padding_idx=6)
         self._run_sharded_embedding_bag(spec, [8, 6], 24, 13, "sum")
@@ -226,7 +224,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "sum",
             offset_size=3,
             max_norm=1.25,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -236,7 +233,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "mean",
             offset_size=2,
             max_norm=1.25,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -246,7 +242,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "max",
             offset_size=2,
             max_norm=1.15,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(spec, [4, 3], 16, 14, "sum", padding_idx=12)
         self._run_sharded_embedding_bag(spec, [4, 3], 16, 14, "mean", padding_idx=12)
@@ -260,7 +255,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             offset_size=3,
             max_norm=1.25,
             padding_idx=10,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -271,7 +265,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             offset_size=2,
             max_norm=1.25,
             padding_idx=10,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -282,7 +275,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             offset_size=2,
             max_norm=1.15,
             padding_idx=10,
-            sharded_dim=sharded_dim,
         )
 
 
diff --git a/test/distributed/_shard/sharded_tensor/ops/test_tensor_ops.py b/test/distributed/_shard/sharded_tensor/ops/test_tensor_ops.py
index 58aa774cd05e7..977fa701b44e0 100644
--- a/test/distributed/_shard/sharded_tensor/ops/test_tensor_ops.py
+++ b/test/distributed/_shard/sharded_tensor/ops/test_tensor_ops.py
@@ -62,6 +62,14 @@ def test_inplace_copy(self):
         st.copy_(ones_st)
         self.assertTrue(torch.equal(st, ones_st))
 
+        # no grad inplace_copy should work between two with different requires_grad
+        st_with_grad = sharded_tensor.rand(spec, (12, 5), requires_grad=True)
+        self.assertTrue(st_with_grad.requires_grad)
+        self.assertFalse(ones_st.requires_grad)
+        with torch.no_grad():
+            st_with_grad.copy_(ones_st)
+            self.assertEqual(st_with_grad.local_tensor(), ones_st.local_tensor())
+
     @with_comms(init_rpc=False)
     @skip_if_lt_x_gpu(TEST_GPU_NUM)
     @requires_nccl()
diff --git a/test/distributed/elastic/timer/file_based_local_timer_test.py b/test/distributed/elastic/timer/file_based_local_timer_test.py
new file mode 100644
index 0000000000000..785cc978ee85d
--- /dev/null
+++ b/test/distributed/elastic/timer/file_based_local_timer_test.py
@@ -0,0 +1,266 @@
+# Owner(s): ["oncall: r2p"]
+
+# Copyright (c) Meta Platforms, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import multiprocessing as mp
+import signal
+import time
+import unittest
+import unittest.mock as mock
+import uuid
+
+import torch.distributed.elastic.timer as timer
+from torch.testing._internal.common_utils import (
+    IS_MACOS,
+    IS_WINDOWS,
+    run_tests,
+    TEST_WITH_TSAN,
+    TestCase,
+)
+
+
+# timer is not supported on windows or macos
+if not (IS_WINDOWS or IS_MACOS):
+    # func2 should time out
+    def func2(n, file_path):
+        if file_path is not None:
+            timer.configure(timer.FileTimerClient(file_path))
+        if n > 0:
+            with timer.expires(after=0.1):
+                func2(n - 1, None)
+                time.sleep(0.2)
+
+    class FileTimerTest(TestCase):
+        def setUp(self):
+            super().setUp()
+            self.max_interval = 0.01
+            self.file_path = "/tmp/test_file_path_" + str(uuid.uuid4())
+            self.server = timer.FileTimerServer(self.file_path, self.max_interval)
+            self.server.start()
+
+        def tearDown(self):
+            super().tearDown()
+            self.server.stop()
+
+        def test_exception_propagation(self):
+            with self.assertRaises(RuntimeError, msg="foobar"):
+                with timer.expires(after=1):
+                    raise RuntimeError("foobar")
+
+        def test_no_client(self):
+            # no timer client configured; exception expected
+            timer.configure(None)
+            with self.assertRaises(RuntimeError):
+                with timer.expires(after=1):
+                    pass
+
+        def test_client_interaction(self):
+            # no timer client configured but one passed in explicitly
+            # no exception expected
+            timer_client = timer.FileTimerClient(self.file_path)
+            timer_client.acquire = mock.MagicMock(wraps=timer_client.acquire)
+            timer_client.release = mock.MagicMock(wraps=timer_client.release)
+            with timer.expires(after=1, scope="test", client=timer_client):
+                pass
+
+            timer_client.acquire.assert_called_once_with("test", mock.ANY)
+            timer_client.release.assert_called_once_with("test")
+
+        def test_happy_path(self):
+            timer.configure(timer.FileTimerClient(self.file_path))
+            with timer.expires(after=0.5):
+                time.sleep(0.1)
+
+        def test_get_timer_recursive(self):
+            """
+            If a function acquires a countdown timer with default scope,
+            then recursive calls to the function should re-acquire the
+            timer rather than creating a new one. That is only the last
+            recursive call's timer will take effect.
+            """
+            timer.configure(timer.FileTimerClient(self.file_path))
+
+            # func should not time out
+            def func(n):
+                if n > 0:
+                    with timer.expires(after=0.1):
+                        func(n - 1)
+                        time.sleep(0.05)
+
+            func(4)
+
+            p = mp.Process(target=func2, args=(2, self.file_path))
+            p.start()
+            p.join()
+            self.assertEqual(-signal.SIGKILL, p.exitcode)
+
+        def test_multiple_clients_interaction(self):
+            # func should not time out
+            def func(n, file_path):
+                if file_path is not None:
+                    timer.configure(timer.FileTimerClient(file_path))
+                if n > 0:
+                    with timer.expires(after=100):
+                        func(n - 1, None)
+                        time.sleep(0.01)
+
+            num_clients = 10
+            num_requests_per_client = 10
+            processes = []
+            for i in range(num_clients):
+                p = mp.Process(target=func, args=(num_requests_per_client, self.file_path))
+                processes.append(p)
+                p.start()
+            for p in processes:
+                p.join()
+
+            self.server.run_once()  # Allows the server to process all requests
+            self.assertEqual(2 * num_clients * num_requests_per_client, self.server._request_count)
+
+        @staticmethod
+        def _run(file_path, timeout, duration):
+            client = timer.FileTimerClient(file_path)
+            timer.configure(client)
+            with timer.expires(after=timeout):
+                time.sleep(duration)
+
+        @unittest.skipIf(TEST_WITH_TSAN, "test is tsan incompatible")
+        def test_timer(self):
+            timeout = 0.1
+            duration = 1
+            p = mp.Process(target=self._run, args=(self.file_path, timeout, duration))
+            p.start()
+            p.join()
+            self.assertEqual(-signal.SIGKILL, p.exitcode)
+
+    def _request_on_interval(file_path, n, interval, sem):
+        """
+        enqueues ``n`` timer requests into ``mp_queue`` one element per
+        interval seconds. Releases the given semaphore once before going to work.
+        """
+        client = timer.FileTimerClient(file_path)
+        sem.release()
+        for i in range(0, n):
+            client.acquire("test_scope", 0)
+            time.sleep(interval)
+
+
+    class FileTimerClientTest(TestCase):
+        def test_send_request_without_server(self):
+            client = timer.FileTimerClient("test_file")
+            timer.configure(client)
+            with self.assertRaises(BrokenPipeError):
+                with timer.expires(after=0.1):
+                    time.sleep(0.1)
+
+
+    class FileTimerServerTest(TestCase):
+        def setUp(self):
+            super().setUp()
+            self.file_path = "/tmp/test_file_path_" + str(uuid.uuid4())
+            self.max_interval = 0.01
+            self.server = timer.FileTimerServer(self.file_path, self.max_interval)
+
+        def tearDown(self):
+            super().tearDown()
+            self.server.stop()
+
+        def test_watchdog_call_count(self):
+            """
+            checks that the watchdog function ran wait/interval +- 1 times
+            """
+            self.server._run_watchdog = mock.MagicMock(wraps=self.server._run_watchdog)
+            self.server.start()
+
+            test_pid = -3
+            client = timer.FileTimerClient(self.file_path)
+            client._send_request(self._valid_timer(pid=test_pid, scope="test0"))
+
+            wait = 0.1
+            time.sleep(wait)
+            self.server.stop()
+            watchdog_call_count = self.server._run_watchdog.call_count
+            self.assertGreaterEqual(
+                watchdog_call_count, int(wait / self.max_interval) - 1
+            )
+            self.assertLessEqual(watchdog_call_count, int(wait / self.max_interval) + 1)
+
+        def test_watchdog_empty_queue(self):
+            """
+            checks that the watchdog can run on an empty pipe
+            """
+            self.server.start()
+
+        def _expired_timer(self, pid, scope):
+            expired = time.time() - 60
+            return timer.FileTimerRequest(worker_pid=pid, scope_id=scope, expiration_time=expired, signal=signal.SIGKILL)
+
+        def _valid_timer(self, pid, scope):
+            valid = time.time() + 60
+            return timer.FileTimerRequest(worker_pid=pid, scope_id=scope, expiration_time=valid, signal=signal.SIGKILL)
+
+        def _release_timer(self, pid, scope):
+            return timer.FileTimerRequest(worker_pid=pid, scope_id=scope, expiration_time=-1)
+
+        @mock.patch("os.kill")
+        def test_expired_timers(self, mock_os_kill):
+            """
+            tests that a single expired timer on a process should terminate
+            the process and clean up all pending timers that was owned by the process
+            """
+            self.server.start()
+
+            test_pid = -3
+            client = timer.FileTimerClient(self.file_path)
+            client._send_request(self._expired_timer(pid=test_pid, scope="test1"))
+            client._send_request(self._valid_timer(pid=test_pid, scope="test2"))
+
+            self.server.run_once()  # Allows the server to process all requests
+            self.assertEqual(0, len(self.server._timers))
+            mock_os_kill.assert_called_once_with(test_pid, signal.SIGKILL)
+
+        @mock.patch("os.kill")
+        def test_send_request_release(self, mock_os_kill):
+            """
+            tests that:
+            1. a timer can be acquired then released (should not terminate process)
+            2. a timer can be vacuously released (e.g. no-op)
+            """
+            self.server.start()
+
+            client = timer.FileTimerClient(self.file_path)
+            test_pid = -3
+            client._send_request(self._valid_timer(pid=test_pid, scope="test1"))
+            client._send_request(self._release_timer(pid=test_pid, scope="test1"))
+            client._send_request(self._release_timer(pid=test_pid, scope="test2"))
+
+            self.assertEqual(0, len(self.server._timers))
+            mock_os_kill.assert_not_called()
+
+        @mock.patch("os.kill")
+        def test_valid_timers(self, mock_os_kill):
+            """
+            tests that valid timers are processed correctly and the process is left alone
+            """
+            self.server.start()
+
+            client = timer.FileTimerClient(self.file_path)
+            client._send_request(self._valid_timer(pid=-3, scope="test1"))
+            client._send_request(self._valid_timer(pid=-3, scope="test2"))
+            client._send_request(self._valid_timer(pid=-2, scope="test1"))
+            client._send_request(self._valid_timer(pid=-2, scope="test2"))
+
+            self.server.run_once()  # Allows the server to process all requests
+            self.assertEqual(4, len(self.server._timers))
+            self.assertTrue((-3, "test1") in self.server._timers)
+            self.assertTrue((-3, "test2") in self.server._timers)
+            self.assertTrue((-2, "test1") in self.server._timers)
+            self.assertTrue((-2, "test2") in self.server._timers)
+            mock_os_kill.assert_not_called()
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_checkpoint_wrapper.py b/test/distributed/fsdp/test_checkpoint_wrapper.py
index 205981c541b03..b699a38219361 100644
--- a/test/distributed/fsdp/test_checkpoint_wrapper.py
+++ b/test/distributed/fsdp/test_checkpoint_wrapper.py
@@ -27,7 +27,13 @@ def setUp(self):
 
     def test_load_activation_checkpointed_module(self):
         lin = nn.Linear(10, 10, bias=False)
-        lin = checkpoint_wrapper(lin)
+        lin = checkpoint_wrapper(
+            lin,
+            checkpoint_fn=checkpoint,
+            # checkpoint kwargs
+            use_reentrant=True,
+            preserve_rng_state=False,
+        )
         state_dict = deepcopy(lin.state_dict())
         # Load into non-checkpoint wrapped linear module
         lin_new = nn.Linear(10, 10, bias=False)
diff --git a/test/distributed/fsdp/test_fsdp_comm_hooks.py b/test/distributed/fsdp/test_fsdp_comm_hooks.py
index a16258855b2ad..93fdc02d456ef 100644
--- a/test/distributed/fsdp/test_fsdp_comm_hooks.py
+++ b/test/distributed/fsdp/test_fsdp_comm_hooks.py
@@ -7,6 +7,7 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from torch import distributed as dist
+from torch.distributed.distributed_c10d import _get_default_group
 from torch.distributed.algorithms._comm_hooks import default_hooks
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.distributed.fsdp import MixedPrecision
@@ -25,6 +26,7 @@
     run_tests,
 )
 
+
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -73,24 +75,56 @@ def forward(self, x):
 class DummyState(object):
 
     __slots__ = [
-        "process_group"
+        "process_group",
+        "noise"
     ]
 
-    def __init__(self, process_group):
+    def __init__(self, process_group: dist.ProcessGroup, noise: int):
         self.process_group = process_group
+        self.noise = noise
 
 class DummyHook(object):
 
-    def dummy_hook(self, state: DummyState, grad: torch.Tensor):
+    def dummy_hook_for_no_shard_fsdp(self, state: DummyState, grad: torch.Tensor):
+        """
+        This communication hook is for illustration and testing purpose only.
+        This communication hook is used during FSDP ``NO_SHARD`` training. It adds some noise to
+        the provided ``grad`` parameter and uses ``all_reduce`` to communicate full, flattened,
+        unsharded gradient.
+        """
+        grad.add_(state.noise)
+        dist.all_reduce(grad, group=state.process_group)
+
+    def custom_reduce_scatter(self, output, input, group=None):
+        """
+        This function is for illustrative purpose only.
+        It is meant to implement a custom reduce-scatter
+        of a flattened tensor to all processes in a group.
+        Currently a no-op.
+        """
         pass
 
+    def dummy_hook_for_sharded_fsdp(self, state: DummyState, grad: torch.Tensor, output: torch.Tensor):
+        """
+        This communication hook is for illustration and testing purposes only.
+        This communication hook is used during FSDP ``FULL_SHARD`` or ``SHARD_GRAD_OP`` training.
+        It adds some noise to the provided ``grad`` parameter, uses
+        ``reduce_scatter`` for gradient communication and stores a sharded gradient in ``output``.
+        """
+        grad.add_(state.noise)
+        self.custom_reduce_scatter(
+            output, grad, group=state.process_group
+        )
+
 class TestCommunicationHooks(FSDPTest):
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP
         ])
     def test_default_communication_hook_behavior(
         self,
@@ -98,21 +132,33 @@ def test_default_communication_hook_behavior(
     ):
         """
         Tests FSDP's default communication hook's behavior and correctness.
+        This test creates a simple linear net with weight shape  ``1 X N``,
+        where ``N`` is the number of workers.
+        For sharded cases, each worker gets 1 element of the weight parameter. This test
+        checks that after backward, each worker has a proper value in its chunk of
+        the gradient, or the whole gradient on every worker is equal to an expected value.
+
         Arguments:
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
         """
-        m = torch.nn.Linear(1, 5, bias=False)
+        out_dim = self.world_size
+        net = torch.nn.Linear(1, out_dim, bias=False)
         inpt = torch.tensor([self.rank]).float().cuda(self.rank)
 
         net_default_hook = FSDP(
-            m,
+            net,
             device_id=torch.cuda.current_device(),
             sharding_strategy=sharding_strategy
         ).to(self.rank)
 
-        # Check that default hook is set to `all_reduce`
+        # Check that default hook is set to `all_reduce` for `NO_SHARD`
+        # or `reduce_scatter` for sharded cases
+        default_hook = default_hooks.reduce_scatter_hook\
+            if sharding_strategy != ShardingStrategy.NO_SHARD\
+            else default_hooks.allreduce_hook
+
         for entry in FSDP.fsdp_modules(net_default_hook):
-            self.assertEqual(entry.communication_hook, default_hooks.allreduce_hook)
+            self.assertEqual(entry._communication_hook, default_hook)
 
         for _ in range(4):
 
@@ -123,6 +169,8 @@ def test_default_communication_hook_behavior(
 
             # For each worker, the gradient on the weight should be worker_rank.
             grad = net_default_hook.params[0].grad
+            if sharding_strategy != ShardingStrategy.NO_SHARD:
+                self.assertTrue(net_default_hook.params[0]._is_sharded, "Expected parameter to be a sharded chunk.")
             expected_grad = (
                 sum(i for i in range(dist.get_world_size())) / dist.get_world_size()
             )
@@ -164,6 +212,7 @@ def test_default_communication_hook_initialization(
     ):
         """
         Tests FSDP's communication hook interface behavior.
+
         Arguments:
             has_wrapping (bool): Configures wrapping of a module.
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
@@ -174,58 +223,70 @@ def test_default_communication_hook_initialization(
             Net(has_wrapping=has_wrapping, sharding_strategy=sharding_strategy),
             sharding_strategy=sharding_strategy
         )
-        dummy_state = DummyState(process_group=None)
-
-        # FSDP currently supports communication hooks for a NO_SHARD strategy
-        # Check that a `NotImplementedError` is raised for other strategies
-        if sharding_strategy != ShardingStrategy.NO_SHARD:
-            # Check that default hook is set to None
-            for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
-                self.assertIsNone(entry.communication_hook)
-                self.assertIsNone(entry.communication_hook_state)
-
-            with self.assertRaisesRegex(
-                NotImplementedError,
-                '^Communication hooks are currently only available for a NO_SHARD strategy.$'
-            ):
-                fsdp_model_with_hook.register_comm_hook(dummy_state, DummyHook.dummy_hook)
 
-        else:
+        # Check that default hook is set to `all_reduce` for `NO_SHARD`
+        # or `reduce_scatter` for sharded cases
+        default_hook = default_hooks.reduce_scatter_hook\
+            if sharding_strategy != ShardingStrategy.NO_SHARD\
+            else default_hooks.allreduce_hook
+
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            self.assertEqual(entry._communication_hook, default_hook)
 
-            # Check that default hook is set to `all_reduce`
-            for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
-                self.assertEqual(entry.communication_hook, default_hooks.allreduce_hook)
+        dummy_state = DummyState(process_group=None, noise=1234)
+        dummy_hook = DummyHook.dummy_hook_for_no_shard_fsdp\
+            if sharding_strategy != ShardingStrategy.NO_SHARD\
+            else DummyHook.dummy_hook_for_sharded_fsdp
 
-            dummy_state = DummyState(process_group=None)
+        fsdp_model_with_hook.register_comm_hook(
+            dummy_state,
+            dummy_hook
+        )
 
+        # Check that we can't register comm hook twice
+        with self.assertRaisesRegex(AssertionError, '^communication hook can be only registered once$'):
             fsdp_model_with_hook.register_comm_hook(
                 dummy_state,
-                DummyHook.dummy_hook
+                dummy_hook
             )
 
-            # Check that we can't register comm hook twice
-            with self.assertRaisesRegex(AssertionError, '^communication hook can be only registered once$'):
-                fsdp_model_with_hook.register_comm_hook(
-                    dummy_state,
-                    DummyHook.dummy_hook
-                )
+        # Check dummy hook was registered for the root and all submodules if any
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            self.assertEqual(
+                entry._communication_hook,
+                dummy_hook
+            )
+            self.assertEqual(
+                entry._communication_hook_state,
+                dummy_state
+            )
+
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            entry._communication_hook = None
+
+        in_data = torch.rand(16, 8).cuda()
+        loss = fsdp_model_with_hook(in_data).sum()
+        # This Error is raised during backward pass and is checked with `p_assert`,
+        # i.e. it prints error string but AssertionError raises nothing
+        with self.assertRaises(AssertionError):
+            loss.backward()
+
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            entry._communication_hook = dummy_hook
+            entry._communication_hook_state = None
+        # Same as above
+        loss = fsdp_model_with_hook(in_data).sum()
+        with self.assertRaises(AssertionError):
+            loss.backward()
 
-            # Check dummy hook was registered for the root and all submodules if any
-            for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
-                self.assertEqual(
-                    entry.communication_hook,
-                    DummyHook.dummy_hook
-                )
-                self.assertEqual(
-                    entry.communication_hook_state,
-                    dummy_state
-                )
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP
         ])
     def test_registering_hook_non_root(
         self,
@@ -235,27 +296,32 @@ def test_registering_hook_non_root(
         Tests FSDP's communication hook registering for submodules.
         Make sure it can't be registered for non-root submodules.
         Currently tests only ``NO_SHARD`` strategy.
+
         Arguments:
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
-
         """
 
         fsdp_model_with_hook = self._init_model(
             Net(has_wrapping=True, sharding_strategy=sharding_strategy),
             sharding_strategy=sharding_strategy
         )
-        dummy_state = DummyState(process_group=None)
+        dummy_state = DummyState(process_group=None, noise=1234)
+        dummy_hook = DummyHook.dummy_hook_for_no_shard_fsdp\
+            if sharding_strategy != ShardingStrategy.NO_SHARD\
+            else DummyHook.dummy_hook_for_sharded_fsdp
         # Creating a list of non-root submodules to test
         submodules = self._get_submodules(fsdp_model_with_hook)
         # Check that assertion is raised for registering a comm hook on a non-root
         with self.assertRaisesRegex(AssertionError, '^register_comm_hook can only be called on a root instance.$'):
-            submodules[1].register_comm_hook(dummy_state, DummyHook.dummy_hook)
+            submodules[1].register_comm_hook(dummy_state, dummy_hook)
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP
         ])
     def test_registering_hook_submodules(
         self,
@@ -265,23 +331,26 @@ def test_registering_hook_submodules(
         Tests FSDP's communication hook registering for submodules.
         Checks behavior if a hook was registered for a non-root submodule
         Currently tests only ``NO_SHARD`` strategy.
+
         Arguments:
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
-
         """
 
         fsdp_model_with_hook = self._init_model(
             Net(has_wrapping=True, sharding_strategy=sharding_strategy),
             sharding_strategy=sharding_strategy
         )
-        dummy_state = DummyState(process_group=None)
+        dummy_state = DummyState(process_group=None, noise=1234)
+        dummy_hook = DummyHook.dummy_hook_for_no_shard_fsdp\
+            if sharding_strategy != ShardingStrategy.NO_SHARD\
+            else DummyHook.dummy_hook_for_sharded_fsdp
         submodules = self._get_submodules(fsdp_model_with_hook)
 
         # Simulate a registration of a hook on a submodule
         submodules[1]._hook_registered = True
         # Check that an error is raised when some of submodules have a non-default hook assigned
         with self.assertRaisesRegex(AssertionError, '^communication hook can be only registered once$'):
-            fsdp_model_with_hook.register_comm_hook(dummy_state, DummyHook.dummy_hook)
+            fsdp_model_with_hook.register_comm_hook(dummy_state, dummy_hook)
 
         # Reinitialize the model
         fsdp_model_with_hook = self._init_model(
@@ -289,16 +358,16 @@ def test_registering_hook_submodules(
             sharding_strategy=sharding_strategy
         )
         submodules = self._get_submodules(fsdp_model_with_hook)
-        submodules[1].communication_hook = DummyHook.dummy_hook
+        submodules[1]._communication_hook = dummy_hook
 
         # Check that an error is raised when some of submodules have a non-default hook assigned
         with self.assertRaisesRegex(
             AssertionError,
-            f'^communication hook should be default, but it is {submodules[1].communication_hook.__name__} instead$'
+            f'^communication hook should be default, but it is {submodules[1]._communication_hook.__name__} instead$'
         ):
             fsdp_model_with_hook.register_comm_hook(
                 dummy_state,
-                DummyHook.dummy_hook
+                dummy_hook
             )
 
     def _check_low_precision_hook(self, state, hook, sharding_strategy, dtype, has_wrapping):
@@ -345,7 +414,9 @@ def _check_low_precision_hook(self, state, hook, sharding_strategy, dtype, has_w
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP
         ])
     def test_fp16_hook(
         self,
@@ -353,7 +424,7 @@ def test_fp16_hook(
         sharding_strategy: Optional[ShardingStrategy]
     ):
 
-        state = default_hooks.LowPrecisionState(process_group=None)
+        state = default_hooks.LowPrecisionState(process_group=_get_default_group())
         hook = default_hooks.fp16_compress_hook
 
         self._check_low_precision_hook(state, hook, sharding_strategy, torch.float16, has_wrapping)
@@ -370,7 +441,9 @@ def test_fp16_hook(
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP
         ])
     def test_bf16_hook(
         self,
@@ -378,7 +451,7 @@ def test_bf16_hook(
         sharding_strategy: Optional[ShardingStrategy]
     ):
 
-        state = default_hooks.LowPrecisionState(process_group=None)
+        state = default_hooks.LowPrecisionState(process_group=_get_default_group())
         hook = default_hooks.bf16_compress_hook
 
         self._check_low_precision_hook(state, hook, sharding_strategy, torch.bfloat16, has_wrapping)
diff --git a/test/distributed/fsdp/test_fsdp_misc.py b/test/distributed/fsdp/test_fsdp_misc.py
index ce8e090765e4c..612e4c5090a53 100644
--- a/test/distributed/fsdp/test_fsdp_misc.py
+++ b/test/distributed/fsdp/test_fsdp_misc.py
@@ -1,7 +1,9 @@
 # Owner(s): ["oncall: distributed"]
 
+from copy import deepcopy
 import functools
 import sys
+from collections import namedtuple
 from contextlib import suppress
 
 import torch
@@ -9,7 +11,7 @@
 import torch.nn as nn
 from torch.distributed.fsdp import FlatParameter
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp import ShardingStrategy
+from torch.distributed.fsdp import ShardingStrategy, CPUOffload
 from torch.distributed.fsdp.wrap import (
     always_wrap_policy,
     transformer_auto_wrap_policy,
@@ -52,6 +54,120 @@ def world_size(self):
     def process_group(self):
         return dist.distributed_c10d._get_default_group()
 
+    @skip_if_lt_x_gpu(2)
+    def test_fsdp_namedtuple(self):
+        # Ensure namedtuple support, preventing issues such as
+        # https://github.com/pytorch/pytorch/issues/83053
+        class MyModule(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lin = nn.Linear(100, 100)
+
+            def forward(self, x):
+                return x
+
+        m = MyModule().cuda()
+        m = FSDP(m)
+        t = torch.ones(1, device="cuda", requires_grad=True)
+
+        MyOutputType = namedtuple(
+            "MyOutputType",
+            ["a", "b", "c", "d"],
+            defaults=(t, t, t, t)
+        )
+
+        inp = MyOutputType()
+        out = m(inp)
+        # Ensure hooks are registered
+        for x in out:
+            self.assertNotEqual([], list(x._backward_hooks.values()))
+
+        # TODO: we should check backward() and param is resharded
+        # as well, but this is blocked by
+        # https://github.com/pytorch/pytorch/issues/83107 and
+        # https://github.com/pytorch/pytorch/issues/83129
+
+    @skip_if_lt_x_gpu(2)
+    def test_fsdp_not_all_outputs_used_in_loss(self):
+
+        class MyModule(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lin1 = nn.Linear(4, 4)
+                self.lin2 = nn.Linear(4, 4)
+
+            def forward(self, x):
+                a = self.lin1(x)
+                b = self.lin2(x)
+                return (a, b)
+
+        def _check_resharded(fsdp_module):
+            for param in fsdp_module.params:
+                if param._is_sharded:
+                    full_param = param._full_param_padded
+                    self.assertEqual(full_param.storage().size(), 0)
+
+                self.assertEqual(
+                    param.data_ptr(),
+                    param._local_shard.data_ptr()
+                )
+
+        def _check_equal(local, fsdp):
+            with FSDP.summon_full_params(fsdp):
+                for p1, p2 in zip(fsdp.parameters(), local.parameters()):
+                    torch.testing.assert_allclose(p1, p2)
+
+        for sharding_strategy in [
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+            ShardingStrategy.NO_SHARD
+        ]:
+            with self.subTest(sharding_strategy=sharding_strategy):
+                fsdp_ctor = functools.partial(FSDP, sharding_strategy=sharding_strategy)
+                m = MyModule().cuda()
+                m_local = deepcopy(m)
+                local_m = m_local
+                prev_params = [p.clone() for p in m_local.parameters()]
+
+                m.lin1 = fsdp_ctor(m.lin1)
+                m = fsdp_ctor(m)
+                _check_equal(m_local, m)
+
+                opt = torch.optim.SGD(m.parameters(), lr=1e-3)
+                opt_local = torch.optim.SGD(local_m.parameters(), lr=1e-3)
+
+                for i in range(6):
+                    t = torch.ones(4, device="cuda")
+                    a, b = m(t)
+                    local_a, local_b = local_m(t)
+                    if i < 2:
+                        # use both params in loss computation. Later,
+                        # b will go unused and we check grads are the
+                        # same as local training.
+                        loss = (a @ b).sum()
+                        loss_local = (local_a @ local_b).sum()
+                    else:
+                        loss = a.sum()
+                        loss_local = local_a.sum()
+
+                    loss.backward()
+                    loss_local.backward()
+                    _check_resharded(m)
+                    opt.step()
+                    opt_local.step()
+                    _check_equal(m_local, m)
+                    # Ensure at least some change from previous params, otherwise
+                    # above check would be vacuously true.
+                    self.assertTrue(
+                        any(not torch.equal(p1, p2) for p1, p2 in zip(prev_params, m_local.parameters()))
+                    )
+                    prev_params = [p.clone() for p in local_m.parameters()]
+                    opt.zero_grad()
+                    opt_local.zero_grad()
+
+                dist.barrier()
+
+
     @skip_if_lt_x_gpu(2)
     @parametrize("use_second_layer", [True, False])
     @parametrize("sharding_strategy", [ShardingStrategy.NO_SHARD, None])
@@ -118,6 +234,39 @@ def test_device_id_auto_wrap(self):
                 torch.device("cuda", torch.cuda.current_device()),
             )
 
+    @skip_if_lt_x_gpu(2)
+    def test_fsdp_device_id_cpu_offload(self):
+        """
+        Ensures that even if device_id is specified but we have
+        CPU offload, module is on CPU after init.
+        """
+        class MyModel(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = nn.Linear(10, 10)
+                self.b = nn.Linear(10, 10)
+
+            def forward(self, x):
+                return self.b(self.a(x))
+
+        model = MyModel()
+
+        fsdp = FSDP(
+            model,
+            auto_wrap_policy=always_wrap_policy,
+            cpu_offload=CPUOffload(offload_params=True),
+            device_id=torch.cuda.current_device()
+        )
+
+        cpu_device = torch.device("cpu")
+
+        for fsdp_unit in FSDP.fsdp_modules(fsdp):
+            # This FSDP unit may not directly manage
+            # any parameters.
+            if len(fsdp_unit.params) > 0:
+                fsdp_param = fsdp_unit.params[0]
+                self.assertEqual(fsdp_param.device, cpu_device)
+
     @skip_if_lt_x_gpu(2)
     @parametrize("use_index", [True, False])
     def test_fsdp_device_id(self, use_index):
diff --git a/test/distributed/fsdp/test_fsdp_optim_state.py b/test/distributed/fsdp/test_fsdp_optim_state.py
index 1f2d5ad8ea8db..c7ec75d5db796 100644
--- a/test/distributed/fsdp/test_fsdp_optim_state.py
+++ b/test/distributed/fsdp/test_fsdp_optim_state.py
@@ -15,6 +15,7 @@
     OptimStateKeyType,
     StateDictType,
 )
+from torch.distributed.fsdp._shard_utils import _gather_state_dict
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
     CUDAInitMode,
@@ -29,6 +30,11 @@
     run_tests,
 )
 
+
+STATE_DICT_TYPE = [
+    StateDictType.FULL_STATE_DICT, StateDictType.SHARDED_STATE_DICT
+]
+
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -45,6 +51,13 @@ class _OSDCommMethod(Enum):
     """Method for communicating the optimizer state dict for internal tests."""
     BROADCAST_OBJECT_LIST = auto()
     SCATTER_FULL_OSD = auto()
+    FLATTEN_SHARDED_OSD = auto()
+
+
+class _ModelClass(Enum):
+    """Different model type to test."""
+    NESTED = auto()
+    TRANSFORMER = auto()
 
 
 class Bias(torch.nn.Module):
@@ -235,6 +248,13 @@ def param_group1(self) -> List[torch.nn.Parameter]:
 
 
 class TestFSDPOptimState(FSDPTest):
+    def __init__(self, *args, **kwargs):
+        super(TestFSDPOptimState, self).__init__(*args, **kwargs)
+        self._model_class = {
+            _ModelClass.NESTED: self._init_nested_model,
+            _ModelClass.TRANSFORMER: self._init_transformer_model,
+        }
+
     def _init_nested_model(
         self,
         wrap: bool,
@@ -351,7 +371,7 @@ def _are_equal_states(
 
     def _check_same_state(
         self,
-        full_osd,
+        fsdp_osd,
         ref_osd,
         check_same_param_keys: bool,
     ):
@@ -360,33 +380,36 @@ def _check_same_state(
         match (e.g. when both should be parameter names), and does not check
         the parameter keys otherwise."""
         assert "state" in ref_osd
-        self.assertTrue("state" in full_osd)
+        self.assertTrue("state" in fsdp_osd)
         ref_osd_state = ref_osd["state"]
-        full_osd_state = full_osd["state"]
+        fsdp_osd_state = {
+            k: _gather_state_dict(v) for k, v in fsdp_osd["state"].items()
+        }
+
         if check_same_param_keys:
             # Check parameter keys are the same first for earlier erroring
             ref_osd_param_ids = set(ref_osd_state.keys())
-            full_osd_param_ids = set(full_osd_state.keys())
-            self.assertEqual(ref_osd_param_ids, full_osd_param_ids)
+            fsdp_osd_param_ids = set(fsdp_osd_state.keys())
+            self.assertTrue(ref_osd_param_ids == fsdp_osd_param_ids)
             # Check state values are the same
-            for param_id, param_state in full_osd_state.items():
+            for param_id, param_state in fsdp_osd_state.items():
                 for state_name, value in param_state.items():
                     ref_value = ref_osd_state[param_id][state_name]
                     self.assertEqual(value, ref_value)
             return
         # Otherwise, only require the parameter keys to be isomorphic (e.g.
         # between IDs and names)
-        ref_osd_states = list(ref_osd["state"].values())
-        full_osd_states = list(full_osd["state"].values())
-        self.assertEqual(len(ref_osd_states), len(full_osd_states))
+        ref_osd_states = list(ref_osd_state.values())
+        fsdp_osd_states = list(fsdp_osd_state.values())
+        self.assertEqual(len(ref_osd_states), len(fsdp_osd_states))
         # Use brute-force quadratic-time comparison since it is hard to
         # hash a tensor by value instead of by object
-        for full_osd_state in full_osd_states:
+        for fsdp_osd_state in fsdp_osd_states:
             # Check for at least one match (may be > 1 in toy edge cases, e.g.
             # multiple biases); nonetheless, each having >= 1 match and the two
             # lists having equal length imply that the list contents are equal
             self.assertTrue(any(
-                self._are_equal_states(full_osd_state, ref_osd_state)
+                self._are_equal_states(fsdp_osd_state, ref_osd_state)
                 for ref_osd_state in ref_osd_states
             ))
 
@@ -428,18 +451,21 @@ def _check_state_device(self, osd: Dict[str, Any], on_gpu: bool):
                         self.assertFalse(value.is_cuda)
 
     @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", STATE_DICT_TYPE)
     @parametrize("use_multiple_param_groups", [False, True])
     @parametrize("rank0_only", [False, True])
     @parametrize("use_diff_optim_inputs", [False, True])
-    def test_full_optim_state_dict_nested(
+    def test_optim_state_dict_nested(
         self,
+        state_dict_type: StateDictType,
         use_multiple_param_groups: bool,
         rank0_only: bool,
         use_diff_optim_inputs: bool,
     ) -> None:
         """
-        Tests :meth:`full_optim_state_dict` by comparing the returned dict for
-        an FSDP-wrapped model with that of an equivalent non-wrapped model.
+        Tests :meth:`full_optim_state_dict` and `sharded_optim_state_dict`
+        by comparing the returned dict for an FSDP-wrapped model with that of
+        an equivalent non-wrapped model.
 
         The test checks the equivalence excluding the parameter keys since the
         FSDP and normal optimizer state dicts key by names and IDs,
@@ -447,18 +473,25 @@ def test_full_optim_state_dict_nested(
         are incorrectly mapped to values. Their correct mapping is tested in
         other tests that exercise the save/load workflow.
         """
+        if rank0_only and state_dict_type == StateDictType.SHARDED_STATE_DICT:
+            return  # not supported
         NUM_ITERS = 3
         model1, optim1, optim_input = self._init_nested_model(
             wrap=True, use_multiple_param_groups=use_multiple_param_groups,
             use_diff_optim_inputs=use_diff_optim_inputs,
         )
         losses1 = self._step_model(model1, optim1, num_iters=NUM_ITERS)
-        full_osd = FSDP.full_optim_state_dict(
-            model1, optim1, optim_input, rank0_only=rank0_only,
-        )
+        if state_dict_type == StateDictType.FULL_STATE_DICT:
+            fsdp_osd = FSDP.full_optim_state_dict(
+                model1, optim1, optim_input, rank0_only=rank0_only,
+            )
+        else:
+            fsdp_osd = FSDP.sharded_optim_state_dict(
+                model1, optim1, optim_input
+            )
         # Non-target ranks get an empty state dict
         if rank0_only and self.rank != 0:
-            self.assertEqual(len(full_osd), 0)
+            self.assertEqual(len(fsdp_osd), 0)
             return
         model2, optim2, _ = self._init_nested_model(
             wrap=False, use_multiple_param_groups=use_multiple_param_groups,
@@ -469,15 +502,15 @@ def test_full_optim_state_dict_nested(
         # Check the losses to eliminate model drift as a source of error
         for i, (l1, l2) in enumerate(zip(losses1, losses2)):
             assert l1 == l2, f"Losses differ on iter {i}: {l1:.5f} {l2:.5f}"
-        # Do not check the parameter keys since the full optimizer state dict
-        # uses parameter names, while the non-wrapped equivalent uses parameter
-        # IDs
+        # Do not check the parameter keys since the full/sharded optimizer state
+        # dict uses parameter names, while the non-wrapped equivalent uses
+        # parameter IDs
         check_same_param_keys = False
         self._check_same_param_groups(
-            full_osd, ref_osd, check_same_param_keys=check_same_param_keys,
+            fsdp_osd, ref_osd, check_same_param_keys=check_same_param_keys,
         )
         self._check_same_state(
-            full_osd, ref_osd, check_same_param_keys=check_same_param_keys,
+            fsdp_osd, ref_osd, check_same_param_keys=check_same_param_keys,
         )
 
     @skip_if_lt_x_gpu(2)
@@ -540,8 +573,8 @@ def test_shard_full_optim_state_dict_nested(
     ):
         """Tests :meth:`shard_full_optim_state_dict` for a non-FSDP-root model
         with nested FSDP instances."""
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self._test_load_optim_state(
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=False,
             osd_comm_method=_OSDCommMethod.BROADCAST_OBJECT_LIST,
@@ -558,8 +591,8 @@ def test_shard_full_optim_state_dict_nested_halve_world_size(self):
         use_multiple_param_groups = True
         use_diff_optim_inputs = True
         wrap_alt = True
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self._test_load_optim_state(
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.BROADCAST_OBJECT_LIST,
@@ -571,8 +604,9 @@ def test_shard_full_optim_state_dict_nested_halve_world_size(self):
     def test_shard_full_optim_state_dict_transformer(self) -> None:
         """Tests :meth:`shard_full_optim_state_dict` for an FSDP-root
         transformer model with shared parameters."""
-        self._test_shard_full_optim_state(
-            model_class="transformer", use_multiple_param_groups=False,
+        self._test_load_optim_state(
+            model_class=_ModelClass.TRANSFORMER,
+            use_multiple_param_groups=False,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.BROADCAST_OBJECT_LIST,
             use_diff_optim_inputs=False,
@@ -590,8 +624,8 @@ def test_scatter_full_optim_state_dict_nested(
     ):
         """Tests :meth:`scatter_full_optim_state_dict` for a non-FSDP-root
         model with nested FSDP instances."""
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self._test_load_optim_state(
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=False,
             osd_comm_method=_OSDCommMethod.SCATTER_FULL_OSD,
@@ -608,8 +642,8 @@ def test_scatter_full_optim_state_dict_nested_halve_world_size(self):
         use_multiple_param_groups = True
         use_diff_optim_inputs = True
         wrap_alt = True
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self._test_load_optim_state(
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.SCATTER_FULL_OSD,
@@ -621,16 +655,42 @@ def test_scatter_full_optim_state_dict_nested_halve_world_size(self):
     def test_scatter_full_optim_state_dict_transformer(self) -> None:
         """Tests :meth:`scatter_full_optim_state_dict` for an FSDP-root
         transformer model with shared parameters."""
-        self._test_shard_full_optim_state(
-            model_class="transformer", use_multiple_param_groups=False,
+        self._test_load_optim_state(
+            model_class=_ModelClass.TRANSFORMER,
+            use_multiple_param_groups=False,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.SCATTER_FULL_OSD,
             use_diff_optim_inputs=False,
         )
 
-    def _test_shard_full_optim_state(
+    @skip_if_lt_x_gpu(2)
+    def test_flatten_sharded_optim_state_dict_nested(self):
+        """Tests :meth:`flatten_sharded_optim_state_dict` for an FSDP-root
+        nested model."""
+        self._test_load_optim_state(
+            _ModelClass.NESTED,
+            use_multiple_param_groups=False,
+            halve_world_size=False,
+            osd_comm_method=_OSDCommMethod.FLATTEN_SHARDED_OSD,
+            use_diff_optim_inputs=False,
+            wrap_alt=True,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_flatten_sharded_optim_state_dict_transformer(self) -> None:
+        """Tests :meth:`flatten_sharded_optim_state_dict` for an FSDP-root
+        transformer model."""
+        self._test_load_optim_state(
+            _ModelClass.TRANSFORMER,
+            use_multiple_param_groups=False,
+            halve_world_size=False,
+            osd_comm_method=_OSDCommMethod.FLATTEN_SHARDED_OSD,
+            use_diff_optim_inputs=False,
+        )
+
+    def _test_load_optim_state(
         self,
-        model_class: str,
+        model_class: _ModelClass,
         use_multiple_param_groups: bool,
         halve_world_size: bool,
         osd_comm_method: _OSDCommMethod,
@@ -639,10 +699,10 @@ def _test_shard_full_optim_state(
     ):
         """
         (1) Runs a model with full world size for K iterations to generate a
-        full optimizer state dict;
+        full/sharded optimizer state dict;
         (2) initializes a model with halved world size and possibly different
         FSDP wrapping scheme (based on ``new_model_kwargs``);
-        (3) shards the full optimizer state dict from (1) according to the
+        (3) loads the full/sharded optimizer state dict from (1) according to the
         halved-world-size model;
         (4) runs the halved-world-size model for K iterations; and
         (5) checks that the sharded optimizer state dict from (3) matches the
@@ -650,16 +710,19 @@ def _test_shard_full_optim_state(
         former could have equivalently been loaded into the local optimizer.
         """
         NUM_ITERS = 3
-        initializer = self._init_nested_model if model_class == "nested" \
-            else self._init_transformer_model if model_class == "transformer" \
-            else None
-        assert initializer is not None, f"Unsupported model: {model_class}"
+        initializer = self._model_class[model_class]
+        osd_method = (
+            FSDP.sharded_optim_state_dict
+            if osd_comm_method == _OSDCommMethod.FLATTEN_SHARDED_OSD
+            else FSDP.full_optim_state_dict
+        )
+
         # First, run a wrapped model with full world size for a few iterations
         model1, optim1, optim_input1 = initializer(
             wrap=True, use_multiple_param_groups=use_multiple_param_groups,
         )
         self._step_model(model1, optim1, num_iters=NUM_ITERS)
-        full_osd1 = FSDP.full_optim_state_dict(model1, optim1, optim_input1)
+        fsdp_osd1 = osd_method(model1, optim1, optim_input1)
         if halve_world_size:
             # Create a new process group with halved world size
             new_group_ranks = [r for r in range(self.world_size) if r % 2 == 0]
@@ -678,31 +741,39 @@ def _test_shard_full_optim_state(
             **new_model_kwargs,  # specify `wrap_alt` to change wrapping
         )
         self._step_model(model2, optim2, num_iters=NUM_ITERS)
-        full_osd2 = FSDP.full_optim_state_dict(model2, optim2, optim_input2, group=new_group)
+        fsdp_osd2 = osd_method(model2, optim2, optim_input2, group=new_group)
         # Compute two sharded optim state dicts: (1) for the first model
         # according to the second model and (2) for the second model according
         # to the second model
         if osd_comm_method == _OSDCommMethod.BROADCAST_OBJECT_LIST:
-            full_osd1 = self._broadcast_full_osd(full_osd1, group=new_group)
+            fsdp_osd1 = self._broadcast_full_osd(fsdp_osd1, group=new_group)
             sharded_osd1 = FSDP.shard_full_optim_state_dict(
-                full_osd1, model2, optim_input2,
+                fsdp_osd1, model2, optim_input2,
             )
-            full_osd2 = self._broadcast_full_osd(full_osd2, group=new_group)
+            fsdp_osd2 = self._broadcast_full_osd(fsdp_osd2, group=new_group)
             sharded_osd2 = FSDP.shard_full_optim_state_dict(
-                full_osd2, model2, optim_input2,
+                fsdp_osd2, model2, optim_input2,
             )
         elif osd_comm_method == _OSDCommMethod.SCATTER_FULL_OSD:
             sharded_osd1 = FSDP.scatter_full_optim_state_dict(
-                full_osd1 if self.rank == 0 else None, model2, optim_input2,
+                fsdp_osd1 if self.rank == 0 else None, model2, optim_input2,
                 group=new_group,
             )
             sharded_osd2 = FSDP.scatter_full_optim_state_dict(
-                full_osd2 if self.rank == 0 else None, model2, optim_input2,
+                fsdp_osd2 if self.rank == 0 else None, model2, optim_input2,
                 group=new_group,
             )
             self._check_state_device(sharded_osd1, on_gpu=True)
             self._check_state_device(sharded_osd2, on_gpu=True)
-        # As a sanity check, check that sharding the second model's full
+        elif osd_comm_method == _OSDCommMethod.FLATTEN_SHARDED_OSD:
+            sharded_osd1 = FSDP.flatten_sharded_optim_state_dict(
+                fsdp_osd1, model2, optim_input2,
+            )
+            sharded_osd2 = FSDP.flatten_sharded_optim_state_dict(
+                fsdp_osd2, model2, optim_input2,
+            )
+
+        # As a sanity check, check that sharding the second model's full/sharded
         # optimizer state dict according to itself is equivalent to its local
         # optimizer's state dict
         local_osd2 = optim2.state_dict()
@@ -715,7 +786,7 @@ def _test_shard_full_optim_state(
             sharded_osd2, local_osd2,
             check_same_param_keys=check_same_param_keys,
         )
-        # Check that sharding the first model's full optimizer state dict
+        # Check that sharding the first model's full/sharded optimizer state dict
         # according to the second model is equivalent to the second model's
         # local optimizer state dict
         self._check_same_param_groups(
@@ -727,13 +798,16 @@ def _test_shard_full_optim_state(
             check_same_param_keys=check_same_param_keys,
         )
         # As a sanity check, check that we can load and run a few iterations
-        optim2.load_state_dict(sharded_osd1)
-        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        if osd_comm_method != _OSDCommMethod.FLATTEN_SHARDED_OSD:
+            optim2.load_state_dict(sharded_osd1)
+            self._step_model(model2, optim2, num_iters=NUM_ITERS)
 
     @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", STATE_DICT_TYPE)
     @parametrize("add_to_fsdp_module", [False, True])
     def test_shard_full_optim_state_dict_unmanaged_params(
         self,
+        state_dict_type: StateDictType,
         add_to_fsdp_module: bool,
     ):
         """
@@ -749,16 +823,21 @@ def test_shard_full_optim_state_dict_unmanaged_params(
           should be no error (emulating model parallel use cases where some
           parameters may be managed externally to FSDP).
         We do not separately test unmanaged parameters for
-        :meth:`scatter_full_optim_state_dict` to save CI cost since it calls
-        into the same subroutine :meth:`_flatten_full_optim_state_dict`.
+        :meth:`scatter_full_optim_state_dict` and `flatten_sharded_optim_state_dict`
+        to save CI cost since it call into the same subroutine
+        :meth:`_flatten_optim_state_dict`.
         """
         NUM_ITERS = 1
         # Create a normal wrapped model
         model, optim, optim_input = self._init_nested_model(wrap=True)
         self._step_model(model, optim, num_iters=NUM_ITERS)
-        full_osd = FSDP.full_optim_state_dict(
-            model, optim, optim_input, rank0_only=False,
-        )  # save on all ranks to avoid having to broadcast from rank 0
+
+        if state_dict_type == StateDictType.FULL_STATE_DICT:
+            fsdp_osd = FSDP.full_optim_state_dict(
+                model, optim, optim_input, rank0_only=False,
+            )  # save on all ranks to avoid having to broadcast from rank 0
+        else:
+            fsdp_osd = FSDP.sharded_optim_state_dict(model, optim, optim_input)
         # Create a new model with the same structure but additional unmanaged
         # parameters, representing the model for which we want to load
         device = torch.device("cuda")
@@ -777,30 +856,42 @@ def test_shard_full_optim_state_dict_unmanaged_params(
                 "single flattened parameter must have scalar state with the " \
                 "same value and dtype)"
             with self.assertRaisesRegex(ValueError, error_prefix):
-                FSDP.shard_full_optim_state_dict(
-                    full_osd, model, optim_input,
-                )
+                if state_dict_type == StateDictType.FULL_STATE_DICT:
+                    FSDP.shard_full_optim_state_dict(
+                        fsdp_osd, model, optim_input,
+                    )
+                else:
+                    FSDP.flatten_sharded_optim_state_dict(
+                        fsdp_osd, model, optim_input,
+                    )
         else:
             # If we add the unmanaged parameters to a module not wrapped with
             # FSDP, then we simply ignore them without erroring to enable
             # model parallelism use cases, where some parameters are managed
             # externally to FSDP
-            sharded_osd = FSDP.shard_full_optim_state_dict(
-                full_osd, model, optim_input,
-            )
+            if state_dict_type == StateDictType.FULL_STATE_DICT:
+                flattened_osd = FSDP.shard_full_optim_state_dict(
+                    fsdp_osd, model, optim_input,
+                )
+            else:
+                flattened_osd = FSDP.flatten_sharded_optim_state_dict(
+                    fsdp_osd, model, optim_input,
+                )
             # Add entries for the unmanaged parameters to be able to load
             for unmanaged_param in unmanaged_params:
                 NestedModel.add_unmanaged_param_entry(
-                    sharded_osd, unmanaged_param, NUM_ITERS,
+                    flattened_osd, unmanaged_param, NUM_ITERS,
                 )
             # Check that we can load the optimizer state dict
             optim = torch.optim.Adam(optim_input, lr=1e-3)
-            optim.load_state_dict(sharded_osd)
+            optim.load_state_dict(flattened_osd)
 
     @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", STATE_DICT_TYPE)
     @parametrize("use_multiple_param_groups", [False, True])
     def test_rekey_optim_state_dict_to_ids(
         self,
+        state_dict_type: StateDictType,
         use_multiple_param_groups: bool,
     ):
         """Tests :meth:`rekey_optim_state_dict` with the new keys being
@@ -813,10 +904,13 @@ def test_rekey_optim_state_dict_to_ids(
             wrap=True, use_multiple_param_groups=use_multiple_param_groups,
         )
         self._step_model(model1, optim1, num_iters=NUM_ITERS)
-        full_osd = FSDP.full_optim_state_dict(model1, optim1, optim_input1)
-        # Broadcast instead of `torch.save()`/`torch.load()` so that all ranks
-        # have the full state dict
-        full_osd = self._broadcast_full_osd(full_osd)
+        if state_dict_type == StateDictType.FULL_STATE_DICT:
+            fsdp_osd = FSDP.full_optim_state_dict(model1, optim1, optim_input1)
+            # Broadcast instead of `torch.save()`/`torch.load()` so that all ranks
+            # have the full state dict
+            fsdp_osd = self._broadcast_full_osd(fsdp_osd)
+        else:
+            fsdp_osd = FSDP.sharded_optim_state_dict(model1, optim1, optim_input1)
         # Run a non-wrapped model for a few iterations
         model2, optim2, optim_input2 = self._init_nested_model(
             wrap=False, use_multiple_param_groups=use_multiple_param_groups,
@@ -825,7 +919,7 @@ def test_rekey_optim_state_dict_to_ids(
         # Re-key the wrapped model's optimizer state dict using parameter IDs
         # according to the non-wrapped model
         rekeyed_osd = FSDP.rekey_optim_state_dict(
-            full_osd, OptimStateKeyType.PARAM_ID, model2, optim_input2,
+            fsdp_osd, OptimStateKeyType.PARAM_ID, model2, optim_input2,
         )
         # Check that the re-keyed dict and actual dict are the same
         osd = optim2.state_dict()
@@ -837,8 +931,9 @@ def test_rekey_optim_state_dict_to_ids(
             rekeyed_osd, osd, check_same_param_keys=check_same_param_keys,
         )
         # As a sanity check, check that we can load and run a few iterations
-        optim2.load_state_dict(rekeyed_osd)
-        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        if state_dict_type != StateDictType.SHARDED_STATE_DICT:
+            optim2.load_state_dict(rekeyed_osd)
+            self._step_model(model2, optim2, num_iters=NUM_ITERS)
 
     @skip_if_lt_x_gpu(2)
     @parametrize("use_multiple_param_groups", [False])
diff --git a/test/distributed/fsdp/test_fsdp_state_dict.py b/test/distributed/fsdp/test_fsdp_state_dict.py
index 97c82fbe8ffb2..e7d8523471108 100644
--- a/test/distributed/fsdp/test_fsdp_state_dict.py
+++ b/test/distributed/fsdp/test_fsdp_state_dict.py
@@ -20,10 +20,10 @@
     MixedPrecision,
     StateDictType,
 )
+from torch.distributed.fsdp._shard_utils import _gather_state_dict
 from torch.distributed.fsdp.fully_sharded_data_parallel import (
     FullyShardedDataParallel,
 )
-from torch.distributed.fsdp.shard_utils import _gather_state_dict
 from torch.distributed.fsdp.wrap import (
     enable_wrap,
     transformer_auto_wrap_policy,
@@ -222,7 +222,7 @@ def test_fsdp_state_dict_with_activation_checkpoint(self, checkpoint_wrap):
             _zero_model(model_new)
             self._compare_models(model, model_new, self.assertNotEqual)
             # Would fail if checkpoint_wrapper did not correctly implement state_dict pre/post hooks
-            model_new.load_state_dict(state_dict)
+            model_new.load_state_dict(state_dict, strict=True)
             self._compare_models(model, model_new, self.assertEqual)
 
     @skip_if_lt_x_gpu(2)
@@ -258,7 +258,7 @@ def test_state_dict_rank0_offload_save_load_flow(self):
         _zero_model(new_model, zero_buffers=True)
         # Only load the checkpoint on rank 0
         if self.rank == 0:
-            new_model.load_state_dict(state_dict)
+            new_model.load_state_dict(state_dict, strict=True)
         _assert_module_states(
             new_model,
             process_group=self.process_group,
@@ -352,7 +352,7 @@ def test_basic_save_and_load_state_dict(
                 for key in fsdp_state_dict.keys():
                     fsdp_state_dict[key] = fsdp_state_dict[key].cuda()
             with FSDP.state_dict_type(model_new, STATE_DICT_MAPPING[state_dict_type]):
-                model_new.load_state_dict(fsdp_state_dict)
+                model_new.load_state_dict(fsdp_state_dict, strict=True)
 
             self._compare_models(model, model_new, self.assertEqual, check_fp16=fp16)
 
@@ -422,7 +422,7 @@ def test_save_and_load_after_forward_state_dict(
                 state_dict[key] = state_dict[key].cuda()
 
         with FSDP.state_dict_type(model, STATE_DICT_MAPPING[state_dict_type]):
-            model.load_state_dict(state_dict)
+            model.load_state_dict(state_dict, strict=True)
         loaded_params = get_full_params(model)
         self.assertEqual(loaded_params, trained_params)
 
@@ -462,7 +462,7 @@ def _load_state_dict(
             raise ValueError(f"No state_dict for {state_dict_type}")
 
         with FSDP.state_dict_type(model, enum_val):
-            return model.load_state_dict(state_dict)
+            return model.load_state_dict(state_dict, strict=True)
 
     def _dist_train(self, wrap_fsdp: bool, state_dict_type: str = ""):
         # TODO: Move this test to common_fsdp.
@@ -581,7 +581,7 @@ def test_state_dict_load_into_local_module(
                 fsdp_state_dict[key] = fsdp_state_dict[key].cuda()
 
         # if self.rank == 0:
-        blank_local_model.load_state_dict(fsdp_state_dict)
+        blank_local_model.load_state_dict(fsdp_state_dict, strict=True)
         local_params = list(blank_local_model.parameters())
         for fsdp_param, local_param in zip(fsdp_params, local_params):
             self.assertEqual(fsdp_param, local_param)
@@ -638,7 +638,7 @@ def _create_module(wrap_fsdp=True):
             if state_dict_type != "local_state_dict":
                 # FlatParameter has not supported deepcopy yet.
                 state_dict = deepcopy(state_dict)
-            new_fsdp.load_state_dict(state_dict)
+            new_fsdp.load_state_dict(state_dict, strict=True)
         for (p1, p2) in zip(fsdp.parameters(), new_fsdp.parameters()):
             self.assertEqual(p1, p2)
 
@@ -657,7 +657,7 @@ def _create_module(wrap_fsdp=True):
         state_dict = _gather_state_dict(state_dict)
         with fsdp.summon_full_params(fsdp):
             if self.rank == 0:
-                local.load_state_dict(state_dict)
+                local.load_state_dict(state_dict, strict=True)
                 for (p1, p2) in zip(fsdp.parameters(), local.parameters()):
                     self.assertEqual(p1, p2)
 
@@ -715,7 +715,7 @@ def test_state_dict_with_ignored_modules(self, prefix, ignore_inner):
                 param.zero_()
 
         to_load = {k[len(prefix_str):] : v for k, v in sd1.items()}
-        nonwrapped_model.load_state_dict(to_load)
+        nonwrapped_model.load_state_dict(to_load, strict=True)
         local_params = list(nonwrapped_model.parameters())
         for fsdp_param, local_param in zip(fsdp_params, local_params):
             self.assertEqual(fsdp_param, local_param)
diff --git a/test/distributed/fsdp/test_shard_utils.py b/test/distributed/fsdp/test_shard_utils.py
index 1d24b2e3c6815..b132629ac7214 100644
--- a/test/distributed/fsdp/test_shard_utils.py
+++ b/test/distributed/fsdp/test_shard_utils.py
@@ -11,7 +11,8 @@
     EnumerableShardingSpec,
 )
 from torch.distributed.distributed_c10d import _get_default_group
-from torch.distributed.fsdp.shard_utils import (
+from torch.distributed.fsdp._shard_utils import (
+    _create_chunk_sharded_tensor,
     _offsets_to_split_sizes,
     _reshard_flatten_tensor,
 )
@@ -138,10 +139,10 @@ def _create_enumerate_spec(self, tensor):
     def _create_chunk_spec(self):
         return ChunkShardingSpec(dim=0, placements=["rank0/cuda:0"])
 
-    def _create_tensor(self):
+    def _create_tensor(self, *size):
         # Keep everything deterministic.
         torch.manual_seed(0)
-        return torch.rand(1001).cuda()
+        return torch.rand(*size).cuda()
 
     @skip_if_lt_x_gpu(2)
     def test_reshard_flatten_tensor(self):
@@ -151,7 +152,7 @@ def get_offsets(tensor, shard):
             else:
                 return [tensor.shape[0] - shard.shape[0]]
 
-        tensor = self._create_tensor()
+        tensor = self._create_tensor(1001)
 
         shard = _reshard_flatten_tensor(
             self._create_local_chunk(tensor),
@@ -185,3 +186,20 @@ def get_offsets(tensor, shard):
         uneven_sharded_tensor.gather(0, output)
         if self.rank == 0:
             self.assertEqual(tensor, output)
+
+    @skip_if_lt_x_gpu(2)
+    def test_create_chunk_sharded_tensor(self):
+        for size in ((1,), (1, 6), (12,), (12, 6), (25,), (25, 6)):
+            tensor = self._create_tensor(*size)
+
+            sharded_tensor = _create_chunk_sharded_tensor(
+                tensor,
+                self.rank,
+                self.world_size,
+                torch.cuda.device_count(),
+                _get_default_group(),
+            )
+            output = torch.empty(*size).cuda() if self.rank == 0 else None
+            sharded_tensor.gather(0, output)
+            if self.rank == 0:
+                self.assertEqual(tensor, output)
diff --git a/test/distributed/fsdp/test_utils.py b/test/distributed/fsdp/test_utils.py
index b1c8549dd1bf3..2aa7fa0b6d97e 100644
--- a/test/distributed/fsdp/test_utils.py
+++ b/test/distributed/fsdp/test_utils.py
@@ -2,6 +2,7 @@
 
 import random
 import sys
+from typing import List
 import unittest
 from collections import OrderedDict
 
@@ -18,6 +19,7 @@
     run_tests,
     subtest,
 )
+from dataclasses import dataclass
 
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
@@ -52,11 +54,20 @@ def get_a_tensor():
             expected += t.numel()
             return t
 
+        @dataclass
+        class SomeDataClass:
+            some_key: str
+            some_float: float
+            some_tensor: List[torch.Tensor]
+
+
+
         # create a mixed bag of data.
         data = [1, "str"]
         data.append({"key1": get_a_tensor(), "key2": {1: get_a_tensor()}, "key3": 3})
         data.insert(0, set(["x", get_a_tensor(), get_a_tensor()]))
         data.append(([1], get_a_tensor(), (1), [get_a_tensor()], set((1, 2))))
+        data.append({"abc": SomeDataClass("some_key", 1.0, [get_a_tensor()])})
         od = OrderedDict()
         od["k"] = "value"
         data.append(od)
diff --git a/test/distributed/test_c10d_nccl.py b/test/distributed/test_c10d_nccl.py
index 766c5ce811f7a..f8584245b4d60 100644
--- a/test/distributed/test_c10d_nccl.py
+++ b/test/distributed/test_c10d_nccl.py
@@ -61,8 +61,12 @@
 # bfloat16 is only supported by CUDA 11+
 BFLOAT16_AVAILABLE = (
     torch.cuda.is_available()
-    and torch.version.cuda is not None
-    and int(torch.version.cuda.split('.')[0]) >= 11)
+    and
+    (
+        (torch.version.cuda is not None and int(torch.version.cuda.split('.')[0]) >= 11)
+        or torch.version.hip is not None
+    )
+)
 
 class RendezvousEnvTest(TestCase):
     @retry_on_connect_failures
@@ -2096,7 +2100,6 @@ def test_fp16_compress_wrapper_nccl(self):
         "BFloat16 is only supported by CUDA 11+",
     )
     @skip_if_lt_x_gpu(2)
-    @skip_if_rocm
     def test_bf16_compress_wrapper_nccl(self):
         self._test_bf16_compress_wrapper()
 
@@ -2137,7 +2140,6 @@ def test_fp16_compress_wrapper_is_view(self):
         "BFloat16 is only supported by CUDA 11+",
     )
     @skip_if_lt_x_gpu(2)
-    @skip_if_rocm
     def test_bf16_compress_wrapper_is_view(self):
         self._test_bf16_compress_wrapper(gradient_as_bucket_view=True)
 
diff --git a/test/distributed/test_store.py b/test/distributed/test_store.py
index 943d12aa85ae3..a32ce948c048d 100644
--- a/test/distributed/test_store.py
+++ b/test/distributed/test_store.py
@@ -290,7 +290,7 @@ def num_keys_total(self):
 class MyPythonStore(dist.Store):
     def __init__(self):
         super(MyPythonStore, self).__init__()
-        self.store = dict()
+        self.store = {}
 
     def set(self, key, value):
         if not isinstance(key, string_classes):
diff --git a/test/distributions/test_distributions.py b/test/distributions/test_distributions.py
index 4db246ab93b33..b6201d4d9e84d 100644
--- a/test/distributions/test_distributions.py
+++ b/test/distributions/test_distributions.py
@@ -888,15 +888,13 @@ def _check_sampler_discrete(self, torch_dist, ref_dist, message,
     def _check_enumerate_support(self, dist, examples):
         for params, expected in examples:
             params = {k: torch.tensor(v) for k, v in params.items()}
-            expected = torch.tensor(expected)
             d = dist(**params)
             actual = d.enumerate_support(expand=False)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(actual, expected)
+            expected = torch.tensor(expected, dtype=actual.dtype)
+            self.assertEqual(actual, expected)
             actual = d.enumerate_support(expand=True)
             expected_with_expand = expected.expand((-1,) + d.batch_shape + d.event_shape)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(actual, expected_with_expand)
+            self.assertEqual(actual, expected_with_expand)
 
     def test_repr(self):
         for Dist, params in EXAMPLES:
@@ -1203,10 +1201,9 @@ def test_binomial_extreme_vals(self):
 
     def test_binomial_vectorized_count(self):
         set_rng_seed(1)  # see Note [Randomized statistical tests]
-        total_count = torch.tensor([[4, 7], [3, 8]])
+        total_count = torch.tensor([[4, 7], [3, 8]], dtype=torch.float64)
         bin0 = Binomial(total_count, torch.tensor(1.))
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(bin0.sample(), total_count)
+        self.assertEqual(bin0.sample(), total_count)
         bin1 = Binomial(total_count, torch.tensor(0.5))
         samples = bin1.sample(torch.Size((100000,)))
         self.assertTrue((samples <= total_count.type_as(samples)).all())
@@ -1304,9 +1301,8 @@ def test_multinomial_2d(self):
         self._gradcheck_log_prob(lambda p: Multinomial(total_count, None, p.log()), [p])
 
         # sample check for extreme value of probs
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(Multinomial(total_count, s).sample(),
-                                   torch.tensor([[total_count, 0], [0, total_count]]))
+        self.assertEqual(Multinomial(total_count, s).sample(),
+                         torch.tensor([[total_count, 0], [0, total_count]], dtype=torch.float64))
 
     def test_categorical_1d(self):
         p = torch.tensor([0.1, 0.2, 0.3], requires_grad=True)
@@ -1424,9 +1420,6 @@ def ref_log_prob(ref_rate, idx, x, log_prob):
         # approximation enters the forbidden parameter space. We instead compare with the
         # theoretical results.
         dist = Poisson(rate_zero)
-        s = dist.sample()
-        dist.log_prob(s).backward()
-        torch.testing.assert_allclose(rate_zero.grad, -1.0)
         dist.log_prob(torch.ones_like(rate_zero)).backward()
         torch.testing.assert_allclose(rate_zero.grad, torch.inf)
 
diff --git a/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect b/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect
index eb6bd8aca3dcf..357cfe6658a2e 100644
--- a/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect
+++ b/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect
@@ -1,8 +1,8 @@
 torch.fx._symbolic_trace.ProxyableClassMeta []
-torch.fx._symbolic_trace.Tracer ['call_module', 'create_arg', 'create_args_for_root', 'is_leaf_module', 'path_of_module', 'trace']
+torch.fx._symbolic_trace.Tracer ['call_module', 'create_arg', 'create_args_for_root', 'getattr', 'is_leaf_module', 'path_of_module', 'trace']
 torch.fx.graph.Graph ['call_function', 'call_method', 'call_module', 'create_node', 'eliminate_dead_code', 'erase_node', 'get_attr', 'graph_copy', 'inserting_after', 'inserting_before', 'lint', 'node_copy', 'nodes', 'on_generate_code', 'output', 'owning_module', 'placeholder', 'print_tabular', 'process_inputs', 'process_outputs', 'python_code', 'set_codegen']
 torch.fx.graph.PythonCode []
-torch.fx.graph_module.GraphModule ['add_submodule', 'code', 'delete_all_unused_submodules', 'delete_submodule', 'graph', 'nested_str', 'recompile', 'to_folder']
+torch.fx.graph_module.GraphModule ['add_submodule', 'code', 'delete_all_unused_submodules', 'delete_submodule', 'graph', 'print_readable', 'recompile', 'to_folder']
 torch.fx.immutable_collections.immutable_dict ['clear', 'pop', 'popitem', 'update']
 torch.fx.immutable_collections.immutable_list ['append', 'clear', 'extend', 'insert', 'pop', 'remove']
 torch.fx.interpreter.Interpreter ['call_function', 'call_method', 'call_module', 'fetch_args_kwargs_from_env', 'fetch_attr', 'get_attr', 'map_nodes_to_values', 'output', 'placeholder', 'run', 'run_node']
diff --git a/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect b/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
index a50db90a50567..6eb700f859eec 100644
--- a/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
+++ b/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
@@ -22,7 +22,7 @@ torch.fx.graph.Graph.node_copy(self, node: torch.fx.node.Node, arg_transform: Ca
 torch.fx.graph.Graph.output(self, result: 'Argument', type_expr: Optional[Any] = None)
 torch.fx.graph.Graph.placeholder(self, name: str, type_expr: Optional[Any] = None, default_value: Any) -> torch.fx.node.Node
 torch.fx.graph.Graph.print_tabular(self)
-torch.fx.graph.Graph.python_code(self, root_module: str) -> torch.fx.graph.PythonCode
+torch.fx.graph.Graph.python_code(self, root_module: str, verbose: bool = False) -> torch.fx.graph.PythonCode
 torch.fx.graph_module.GraphModule.__init__(self, root: Union[torch.nn.modules.module.Module, Dict[str, Any]], graph: torch.fx.graph.Graph, class_name: str = 'GraphModule')
 torch.fx.graph_module.GraphModule.add_submodule(self, target: str, m: torch.nn.modules.module.Module) -> bool
 torch.fx.graph_module.GraphModule.delete_all_unused_submodules(self) -> None
diff --git a/test/forward_backward_compatibility/check_forward_backward_compatibility.py b/test/forward_backward_compatibility/check_forward_backward_compatibility.py
index 7f91b95cfcd1d..f5a0d3d26254b 100644
--- a/test/forward_backward_compatibility/check_forward_backward_compatibility.py
+++ b/test/forward_backward_compatibility/check_forward_backward_compatibility.py
@@ -47,6 +47,7 @@
     ("prim::MKLDNNRelu6", datetime.date(9999, 1, 1)),
     ("prim::MKLDNNRelu6_", datetime.date(9999, 1, 1)),
     ("prim::Concat", datetime.date(9999, 1, 1)),
+    ("aten::_NestedTensor_GeneralizedBMM", datetime.date(9999, 1, 1)),
     # Internal, profiler-specific ops
     ("profiler::_call_end_callbacks_on_jit_fut*", datetime.date(9999, 1, 1)),
     ("profiler::_record_function_enter", datetime.date(9999, 1, 1)),
@@ -78,6 +79,14 @@
     ("aten::linalg_slogdet.out", datetime.date(2022, 8, 1)),
     ("aten::_linalg_solve", datetime.date(2022, 10, 1)),
     ("aten::_linalg_solve.solution", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv_ex", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv_ex.inverse", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv.out", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper.functional", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper.out", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper_", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper", datetime.date(2022, 10, 1)),
     ("aten::solve", datetime.date(9999, 1, 1)),
     ("aten::solve.solution", datetime.date(9999, 1, 1)),
     ("aten::_solve_helper", datetime.date(9999, 1, 1)),
@@ -115,8 +124,139 @@
     ("aten::.*functional", datetime.date(2022, 8, 1)),
     ("aten::_foreach.*", datetime.date(2022, 8, 1)),
     ("aten::unflatten", datetime.date(2022, 8, 10)),
+    ("aten::nanmean", datetime.date(2022, 8, 30)),
+    ("aten::nanmean.out", datetime.date(2022, 8, 30)),
+    ("aten::nansum", datetime.date(2022, 8, 30)),
+    ("aten::nansum.out", datetime.date(2022, 8, 30)),
+    ("aten::sum.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::mps_linear", datetime.date(9999, 1, 1)),
+    ("aten::_mps_linear", datetime.date(9999, 1, 1)),
     # TODO: FIXME: prims shouldn't be checked
     ("prims::.*", datetime.date(9999, 1, 1)),
+    ("aten::_amp_foreach_non_finite_check_and_unscale.out", datetime.date(2022, 9, 1)),
+    ("aten::_amp_foreach_non_finite_check_and_unscale_", datetime.date(2022, 9, 1)),
+    ("aten::_cudnn_rnn_backward.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_abs.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_abs_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_acos.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_acos_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_asin.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_asin_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_atan.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_atan_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_ceil.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_ceil_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cos.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cos_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cosh.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cosh_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erf.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erf_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erfc.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erfc_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_exp.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_exp_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_expm1.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_expm1_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_floor.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_floor_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_frac.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_frac_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_lgamma.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_lgamma_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log10.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log10_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log1p.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log1p_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log2.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log2_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_maximum.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_maximum_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_minimum.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_minimum_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_neg.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_neg_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_norm.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_reciprocal.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_reciprocal_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_round.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_round_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sigmoid.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sigmoid_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sin.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sin_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sinh.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sinh_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sqrt.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sqrt_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tan.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tan_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tanh.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tanh_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_trunc.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_trunc_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_zero.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_zero_", datetime.date(2022, 9, 1)),
+    ("aten::_histogramdd_bin_edges.out", datetime.date(2022, 9, 1)),
+    ("aten::chunk", datetime.date(2022, 9, 1)),
+    ("aten::dequantize.tensors_out", datetime.date(2022, 9, 1)),
+    ("aten::dsplit.array", datetime.date(2022, 9, 1)),
+    ("aten::dsplit.int", datetime.date(2022, 9, 1)),
+    ("aten::hsplit.array", datetime.date(2022, 9, 1)),
+    ("aten::hsplit.int", datetime.date(2022, 9, 1)),
+    ("aten::lstm_mps_backward.out", datetime.date(2022, 9, 1)),
+    ("aten::miopen_rnn_backward.out", datetime.date(2022, 9, 1)),
+    ("aten::quantize_per_tensor.tensors_out", datetime.date(2022, 9, 1)),
+    ("aten::split", datetime.date(2022, 9, 1)),
+    ("aten::split.Tensor", datetime.date(2022, 9, 1)),
+    ("aten::split.sizes", datetime.date(2022, 9, 1)),
+    ("aten::split_copy.Tensor_out", datetime.date(2022, 9, 1)),
+    ("aten::split_with_sizes", datetime.date(2022, 9, 1)),
+    ("aten::split_with_sizes_copy.out", datetime.date(2022, 9, 1)),
+    ("aten::tensor_split.indices", datetime.date(2022, 9, 1)),
+    ("aten::tensor_split.sections", datetime.date(2022, 9, 1)),
+    ("aten::tensor_split.tensor_indices_or_sections", datetime.date(2022, 9, 1)),
+    ("aten::unbind.Dimname", datetime.date(2022, 9, 1)),
+    ("aten::unbind.int", datetime.date(2022, 9, 1)),
+    ("aten::unbind_copy.int_out", datetime.date(2022, 9, 1)),
+    ("aten::unsafe_split.Tensor_out", datetime.date(2022, 9, 1)),
+    ("aten::unsafe_split_with_sizes.out", datetime.date(2022, 9, 1)),
+    ("aten::vsplit.array", datetime.date(2022, 9, 1)),
+    ("aten::vsplit.int", datetime.date(2022, 9, 1)),
 ]
 
 ALLOW_LIST_COMPILED = [
@@ -302,7 +442,7 @@ def check_fc(existing_schemas):
         default="schemas.txt",
     )
     args = parser.parse_args()
-    existing_schema_dict = dict()
+    existing_schema_dict = {}
     slist = []
     with open(args.existing_schemas, "r") as f:
         while True:
diff --git a/test/fx/quantization.py b/test/fx/quantization.py
index ff6c98ac038b2..83e73ed0a58de 100644
--- a/test/fx/quantization.py
+++ b/test/fx/quantization.py
@@ -121,7 +121,7 @@ def quantize(self, quantizer, node, load_arg):
         weight_scale, weight_zp = _minmax_scale_zeropoint(min_val, max_val)
         qweight = torch.quantize_per_tensor(weight, weight_scale, weight_zp, torch.qint8)
 
-        ctor = torch.nn.intrinsic.quantized.ConvReLU2d if self.relu_node is not None else torch.nn.quantized.Conv2d
+        ctor = torch.nn.intrinsic.quantized.ConvReLU2d if self.relu_node is not None else torch.ao.nn.quantized.Conv2d
 
         qconv = ctor(mod.in_channels, mod.out_channels, mod.kernel_size,
                      mod.stride, mod.padding, mod.dilation, mod.groups,
diff --git a/test/fx/test_pass_infra.py b/test/fx/test_pass_infra.py
index e087b1dc6c2fd..947c80d66dcee 100644
--- a/test/fx/test_pass_infra.py
+++ b/test/fx/test_pass_infra.py
@@ -156,3 +156,18 @@ def pass5(x):
             _topological_sort_passes(passes, constraints)
         expected_error_msg = f"Circular dependency detected within the following passes: {passes}"
         self.assertEqual(e.exception.args[0], expected_error_msg)
+
+    def test_pass_manager_error(self):
+        """
+        Tests error catching + debug
+        """
+        def pass_fail(graph_module):
+            raise RuntimeError("bad")
+
+        m = AddModule()
+        traced_m = torch.fx.symbolic_trace(m)
+        pm = PassManager(passes=[replace_add_with_mul_pass, replace_mul_with_div_pass, pass_fail])
+
+        # Comment out this line to see the actual error message
+        with self.assertRaises(RuntimeError):
+            pm(traced_m)
diff --git a/test/fx/test_subgraph_rewriter.py b/test/fx/test_subgraph_rewriter.py
index afaf9e84f6824..294b7af6c9893 100644
--- a/test/fx/test_subgraph_rewriter.py
+++ b/test/fx/test_subgraph_rewriter.py
@@ -461,7 +461,7 @@ def forward(self, x):
                 assert n.type == int
                 assert m.type == int
 
-    def test_subgraph_writer_replace_consecutive_submodules(self):
+    def test_subgraph_rewriter_replace_consecutive_submodules(self):
 
         def f(x):
             x = torch.sigmoid(x)
@@ -491,3 +491,168 @@ def comparison(x):
         ref_outs = comparison_fn(x)
         test_outs = traced.forward(x)
         self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_with_overlapping_matches(self):
+
+        def f(x):
+            x = torch.sigmoid(x)
+            x = torch.sigmoid(x)
+            x = torch.sigmoid(x)
+            return torch.sigmoid(x)
+
+        def pattern(x):
+            x = torch.sigmoid(x)
+            x = torch.sigmoid(x)
+            return x
+
+        def replacement(x):
+            return torch.neg(x)
+
+        def comparison(x):
+            x = torch.neg(x)
+            return torch.neg(x)
+
+        traced = symbolic_trace(f)
+        comparison_fn = symbolic_trace(comparison)
+
+        x = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x)
+        test_outs = traced.forward(x)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_replace_with_multiple_outputs(self):
+
+        def f(x):
+            y = torch.sigmoid(x)
+            z = torch.relu(x)
+            return y + z
+
+        def pattern(a):
+            b = torch.sigmoid(a)
+            c = torch.relu(a)
+            return b, c
+
+        def replacement(x):
+            return torch.exp(x), torch.abs(x)
+
+        def comparison(x):
+            y = torch.exp(x)
+            z = torch.abs(x)
+            return y + z
+
+        traced = symbolic_trace(f)
+        comparison_fn = symbolic_trace(comparison)
+
+        x = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x)
+        test_outs = traced.forward(x)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_replace_with_duplicated_outputs(self):
+
+        def f(x1, x2):
+            x = x1 - x2
+            y = torch.sigmoid(x)
+            z = torch.relu(x)
+            return y + z
+
+        def pattern(a1, a2):
+            a = a1 - a2
+            b = torch.sigmoid(a)
+            c = torch.relu(a)
+            return b, c, a
+
+        def replacement(x1, x2):
+            y1 = torch.exp(x1)
+            y2 = torch.abs(x2)
+            return y2, y2, y1
+
+        def comparison(x1, x2):
+            y2 = torch.abs(x2)
+            return y2 + y2
+
+        traced = symbolic_trace(f)
+        comparison_fn = symbolic_trace(comparison)
+
+        x1 = torch.randn(3, 4)
+        x2 = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x1, x2)
+        test_outs = traced.forward(x1, x2)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_with_unused_args(self):
+        class M(torch.nn.Module):
+            def forward(self, x, y, z):
+                return x + y
+
+        def pattern(x, y):
+            return x + y
+
+        def replacement(x, y):
+            return x - y
+
+        def comparison(x1, x2, x3):
+            return x1 - x2
+
+        traced = symbolic_trace(M())
+        comparison_fn = symbolic_trace(comparison)
+
+        x1 = torch.randn(3, 4)
+        x2 = torch.randn(3, 4)
+        x3 = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+        placeholder_nodes = [n for n in traced.graph.nodes if n.op == "placeholder"]
+        assert len(placeholder_nodes) == 3
+
+        ref_outs = comparison_fn(x1, x2, x3)
+        test_outs = traced.forward(x1, x2, x3)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_call_method(self):
+
+        class M(torch.nn.Module):
+            def forward(self, x):
+                x = x.dequantize()
+                x = x.sigmoid()
+                x = x.to(torch.float16)
+                return x
+
+        def pattern(x):
+            x = x.dequantize()
+            x = x.sigmoid()
+            x = x.to(torch.float16)
+            return x
+
+        def replacement(x):
+            return x
+
+        traced = symbolic_trace(M())
+        comparison_fn = symbolic_trace(replacement)
+
+        x1 = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x1)
+        test_outs = traced.forward(x1)
+        self.assertEqual(ref_outs, test_outs)
diff --git a/test/fx/test_z3_gradual_types.py b/test/fx/test_z3_gradual_types.py
index c864523fe0664..e8b239b815381 100644
--- a/test/fx/test_z3_gradual_types.py
+++ b/test/fx/test_z3_gradual_types.py
@@ -29,9 +29,77 @@
     HAS_TORCHVISION = False
 skipIfNoTorchVision = unittest.skipIf(not HAS_TORCHVISION, "no torchvision")
 
+class TorchDynamoUseCases(unittest.TestCase):
+
+    def test_dim(self):
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x: TensorType([1, 2])):
+                y = x.dim()
+                return y
+
+        symbolic_traced: torch.fx.GraphModule = symbolic_trace(BasicBlock())
+        transformed = transform_all_constraints(symbolic_traced, counter=0)
+        s = z3.Solver()
+        s.add(transformed)
+        self.assertEqual(s.check(), z3.sat)
+        y_res = z3.z3.Int(2)
+        self.assertEqual(s.model()[y_res], 2)
+
+
+    def test_reshape(self):
+        """
+        In this example, we prove that some nodes must
+        always have a fixed shape regardless of the input
+        """
+
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x: Dyn):
+                y = x.view(100)
+                tmp = y.size()[0]
+                return tmp
+
+        symbolic_traced: torch.fx.GraphModule = symbolic_trace(BasicBlock())
+        transformed = transform_all_constraints(symbolic_traced, counter=0)
+
+        s = z3.Solver()
+        s.add(transformed)
+        self.assertEqual(s.check(), z3.sat)
+        dim = z3.Int(4)
+        self.assertEqual(s.model()[dim], 100)
+        # print(s.model()[dim])
+
 
 class HFOperations(unittest.TestCase):
 
+    def test_eq_dim(self):
+        """
+        test dimensions and equalities
+        """
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super(BasicBlock, self).__init__()
+
+            def forward(self, x: TensorType([32, 4, 4])):
+                eq = x.dim() == 3
+                return eq
+
+        ast_rewriter = RewritingTracer()
+        graph = ast_rewriter.trace(BasicBlock())
+
+        # The node we are considering is the gt node
+        for n in graph.nodes:
+            if n.target == operator.eq:
+                node = n
+
+        positive, negative = evaluate_conditional_with_constraints(ast_rewriter.root, graph, node)
+        self.assertEqual(positive, z3.sat)
+        self.assertEqual(negative, z3.unsat)
 
     def test_conditional_ne_1(self):
         """
@@ -400,6 +468,37 @@ def forward(self, x: Dyn):
         self.assertEqual(s.model()[output].arg(0).arg(1), b[0])
         self.assertEqual(s.model()[output].arg(1).arg(1), b[1])
 
+
+    def test_layer_norm_functional(self):
+
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super(BasicBlock, self).__init__()
+
+            def forward(self, x: Dyn):
+                return torch.nn.functional.layer_norm(x, (1024,))
+
+        ast_rewriter = RewritingTracer()
+        graph = ast_rewriter.trace(BasicBlock())
+        traced = GraphModule(ast_rewriter.root, graph, "gm")
+        transformed = transform_all_constraints(traced, counter=0)
+
+        s = z3.Solver()
+        s.add(transformed)
+        self.assertEquals(s.check(), z3.sat)
+
+        # make the output a size 1 tensor which should result
+        # in the migration of the input
+
+        b = BasicBlock().forward(torch.rand(1024))
+        input = z3.Const(1, tensor_type)
+        output = z3.Const(2, tensor_type)
+        s.add(output == tensor_type.tensor1(D(1, 1024)))
+        s.check()
+        self.assertEqual(s.model()[input], s.model()[output])
+        # input shape = output shape
+        self.assertEqual(b.shape[0], s.model()[input].arg(0).arg(1))
+
     def test_ne_int_long_type_as(self):
 
         class BasicBlock(torch.nn.Module):
@@ -719,6 +818,28 @@ def forward(self, x: TensorType([2, 4])):
         self.assertEquals(s.check(), z3.sat)
 
 
+    def test_embedding_2(self):
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super(BasicBlock, self).__init__()
+
+            def forward(self, x: TensorType([2, 4]), y: TensorType([Dyn, 1024])):
+                return torch.nn.functional.embedding(x, y)
+
+        B = BasicBlock().forward(torch.ones([2, 4], dtype=torch.long), torch.rand(256008, 1024)).size()
+        ast_rewriter = RewritingTracer()
+        graph = ast_rewriter.trace(BasicBlock())
+        traced = GraphModule(ast_rewriter.root, graph, "gm")
+        transformed = transform_all_constraints(traced, counter=0)
+        s = z3.Solver()
+        s.add(transformed)
+        self.assertEquals(s.check(), z3.sat)
+        embedding_result = z3.Const(5, tensor_type)
+
+        assert s.model()[embedding_result].arg(0).arg(1) == B[0]
+        assert s.model()[embedding_result].arg(1).arg(1) == B[1]
+        assert s.model()[embedding_result].arg(2).arg(1) == B[2]
+
     def test_size_two_args(self):
         class BasicBlock(torch.nn.Module):
             def __init__(self):
@@ -910,6 +1031,31 @@ def forward(self, x: TensorType([2, 4]), y: Dyn):
         self.assertEquals(s.check(), z3.sat)
 
 
+    def test_conditional_wrong_assumption(self):
+        """
+        Test condition after making the wrong assumption about the input
+        """
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super(BasicBlock, self).__init__()
+
+            def forward(self, x: Dyn):
+                gt = x > 1
+                return gt
+
+        ast_rewriter = RewritingTracer()
+        graph = ast_rewriter.trace(BasicBlock())
+
+        # The node we are considering is the gt node
+        for n in graph.nodes:
+            if n.target == operator.gt:
+                node = n
+
+        positive, negative = evaluate_conditional_with_constraints(ast_rewriter.root, graph, node)
+
+        self.assertEqual(positive, z3.sat)
+        self.assertEqual(negative, z3.sat)
+
     def test_conditional(self):
         """
         This test case is for the HFmodels interface.
@@ -1317,6 +1463,47 @@ def forward(self, x: TensorType([20, 20])):
 
 
 class TestSingleOperation(unittest.TestCase):
+
+    def test_conv_wrong_example(self):
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super(BasicBlock, self).__init__()
+                self.conv1 = torch.nn.Conv2d(in_channels=2, out_channels=2,
+                                             kernel_size=2, stride=2,
+                                             padding=2, groups=2, bias=False, dilation=2)
+
+                self.conv2 = torch.nn.Conv2d(in_channels=4, out_channels=2,
+                                             kernel_size=2, stride=2,
+                                             padding=2, groups=2, bias=False, dilation=2)
+
+                self.relu = torch.nn.ReLU(inplace=True)
+
+            def forward(self, x: Dyn):
+                y = self.relu(self.conv1(x))
+                z = self.relu(self.conv2(x))
+                return z
+
+        ast_rewriter = RewritingTracer()
+        graph = ast_rewriter.trace(BasicBlock())
+        traced = GraphModule(ast_rewriter.root, graph, "gm")
+
+        transformed = transform_all_constraints(traced)
+
+        solver3 = z3.Solver()
+        solver3.add(transformed)
+        print(solver3.check())
+        assert solver3.check() == z3.sat
+
+        s1, s2, s3, s4 = z3.Ints('s1 s2 s3 s4')
+        s11, s22, s33, s44 = z3.Ints('s11 s22 s33 s44')
+        d1, d2, d3, d4 = D(s11, s1), D(s22, s2), D(s33, s3), D(s44, s4),
+        x = z3.Const(1, tensor_type)
+        solver3.add(x == tensor_type.tensor4(d1, d2, d3, d4))
+        assert solver3.check() == z3.sat
+
+        solver3.add(s22 != 0)
+        assert solver3.check() == z3.unsat
+
     def test_conv_dyn(self):
 
         s1, s2, s3, s4 = z3.Ints('s1 s2 s3 s4')
@@ -1336,7 +1523,6 @@ def __init__(self, in_planes, out_planes, kernel_size, stride, padding, groups,
             def forward(self, x: Dyn):
                 return self.conv1(x)
 
-
         BasicBlock(2, 2, 2, 2, 2, 2, 2).forward(torch.rand(4, 2, 3, 4))
 
         ast_rewriter = RewritingTracer()
diff --git a/test/jit/test_backends.py b/test/jit/test_backends.py
index 0ed7d0c19b2de..1a34fca321558 100644
--- a/test/jit/test_backends.py
+++ b/test/jit/test_backends.py
@@ -656,7 +656,7 @@ def test_execution(self):
         self.check_function("forward", (input1, input2, input2))
 
 # This is needed for IS_WINDOWS or IS_MACOS to skip the tests.
-@unittest.skipIf(TEST_WITH_ROCM or IS_SANDCASTLE or IS_WINDOWS or IS_MACOS or IS_FBCODE,
+@unittest.skipIf(IS_SANDCASTLE or IS_WINDOWS or IS_MACOS or IS_FBCODE,
                  "Non-portable load_library call used in test")
 class TestBackendsWithCompiler(JitTestCase):
     """
@@ -672,17 +672,14 @@ def __init__(self, name):
 
     def setUp(self):
         super().setUp()
-        if not TEST_WITH_ROCM:
-            self.basic_module_compiler_test.setUp()
-            self.error_module_compiler_test.setUp()
-            self.comp_module_compiler_test.setUp()
+        self.basic_module_compiler_test.setUp()
+        self.error_module_compiler_test.setUp()
+        self.comp_module_compiler_test.setUp()
 
-    @skipIfRocm
     def test_execution(self):
         self.basic_module_compiler_test.test_execution()
         self.comp_module_compiler_test.test_execution()
 
-    @skipIfRocm
     def test_errors(self):
         self.error_module_compiler_test.test_errors()
 
diff --git a/test/jit/test_freezing.py b/test/jit/test_freezing.py
index 85e16e4e577c6..12d25bb6e5b44 100644
--- a/test/jit/test_freezing.py
+++ b/test/jit/test_freezing.py
@@ -12,7 +12,7 @@
 from torch.testing import FileCheck
 from torch.testing._internal.common_quantization import skipIfNoFBGEMM
 from torch.testing._internal.common_quantized import override_quantized_engine
-from torch.testing._internal.common_utils import set_default_dtype, skipCUDAMemoryLeakCheckIf
+from torch.testing._internal.common_utils import set_default_dtype, skipCUDAMemoryLeakCheckIf, TEST_WITH_ROCM
 from torch.testing._internal.jit_utils import JitTestCase
 from torch.utils import mkldnn as mkldnn_utils
 
@@ -2189,78 +2189,89 @@ def test_conv_to_mkldnn_no_mkldnn(self):
             inp = torch.rand([4, 3, 4, 4])
             self.assertEqual(frozen(inp), mod(inp))
 
-    @unittest.skipIf(not TEST_CUDNN, "requires CUDNN")
+    @unittest.skipIf(not (TEST_CUDNN or TEST_WITH_ROCM), "requires CUDNN")
     def test_freeze_conv_relu_fusion(self):
-        conv_bias = [True, False]
-        conv_ops = [nn.Conv2d, nn.Conv3d]
-        add_z = [True, False]
-        use_tracing = [True, False]
-        for use_bias, conv, add_z, tracing in product(conv_bias, conv_ops, add_z, use_tracing):
+        with set_default_dtype(torch.float):
+            conv_bias = [True, False]
+            conv_ops = [nn.Conv2d, nn.Conv3d]
+            add_z = [True, False]
+            use_tracing = [True, False]
+            for use_bias, conv, add_z, tracing in product(conv_bias, conv_ops, add_z, use_tracing):
+                class Net(nn.Module):
+                    def __init__(self, in_channels, out_channels, **kwargs):
+                        super(Net, self).__init__()
+                        self.conv = conv(in_channels, out_channels, bias=use_bias, **kwargs)
+                        self.relu = nn.ReLU(inplace=True)
+                        self.add_z = add_z
+
+                    def forward(self, x):
+                        z = self.conv(x)
+                        out = self.conv(x)
+                        if self.add_z:
+                            out += z
+                        out = self.relu(out)
+                        return out
+
+                mod_eager = Net(3, 6, kernel_size=3, stride=2).eval().cuda()
+
+                inps = [5, 3, 4, 4]
+                if conv == nn.Conv3d:
+                    inps.append(inps[-1])
+                inp = torch.rand(inps).cuda()
+
+                if tracing:
+                    scripted_mod = torch.jit.trace(mod_eager, (inp))
+                else:
+                    scripted_mod = torch.jit.script(mod_eager)
+
+                frozen_mod = torch.jit.optimize_for_inference(scripted_mod)
+                if TEST_WITH_ROCM:
+                    if add_z:
+                        FileCheck().check("aten::miopen_convolution_add_relu").run(frozen_mod.graph)
+                    else:
+                        FileCheck().check("aten::miopen_convolution_relu").run(frozen_mod.graph)
+                else:
+                    if add_z:
+                        FileCheck().check("aten::cudnn_convolution_add_relu").run(frozen_mod.graph)
+                    else:
+                        FileCheck().check("aten::cudnn_convolution_relu").run(frozen_mod.graph)
+
+                self.assertEqual(mod_eager(inp), frozen_mod(inp))
+
+    @unittest.skipIf(not (TEST_CUDNN or TEST_WITH_ROCM), "requires CUDNN")
+    def test_freeze_conv_relu_fusion_not_forward(self):
+        with set_default_dtype(torch.float):
             class Net(nn.Module):
                 def __init__(self, in_channels, out_channels, **kwargs):
                     super(Net, self).__init__()
-                    self.conv = conv(in_channels, out_channels, bias=use_bias, **kwargs)
+                    self.conv = nn.Conv2d(in_channels, out_channels, bias=None, **kwargs)
                     self.relu = nn.ReLU(inplace=True)
-                    self.add_z = add_z
 
                 def forward(self, x):
                     z = self.conv(x)
                     out = self.conv(x)
-                    if self.add_z:
-                        out += z
                     out = self.relu(out)
                     return out
 
+                @torch.jit.export
+                def make_prediction(self, x):
+                    return self.forward(x)
+
             mod_eager = Net(3, 6, kernel_size=3, stride=2).eval().cuda()
 
             inps = [5, 3, 4, 4]
-            if conv == nn.Conv3d:
-                inps.append(inps[-1])
             inp = torch.rand(inps).cuda()
 
-            if tracing:
-                scripted_mod = torch.jit.trace(mod_eager, (inp))
-            else:
-                scripted_mod = torch.jit.script(mod_eager)
+            scripted_mod = torch.jit.script(mod_eager)
 
-            frozen_mod = torch.jit.optimize_for_inference(scripted_mod)
-            if add_z:
-                FileCheck().check("aten::cudnn_convolution_add_relu").run(frozen_mod.graph)
+            frozen_mod = torch.jit.freeze(scripted_mod, preserved_attrs=['make_prediction'])
+            optimized_mod = torch.jit.optimize_for_inference(frozen_mod, other_methods=['make_prediction'])
+            if TEST_WITH_ROCM:
+                FileCheck().check("aten::miopen_convolution_relu").run(optimized_mod.make_prediction.graph)
             else:
-                FileCheck().check("aten::cudnn_convolution_relu").run(frozen_mod.graph)
-
-            self.assertEqual(mod_eager(inp), frozen_mod(inp))
-
-    @unittest.skipIf(not TEST_CUDNN, "requires CUDNN")
-    def test_freeze_conv_relu_fusion_not_forward(self):
-        class Net(nn.Module):
-            def __init__(self, in_channels, out_channels, **kwargs):
-                super(Net, self).__init__()
-                self.conv = nn.Conv2d(in_channels, out_channels, bias=None, **kwargs)
-                self.relu = nn.ReLU(inplace=True)
-
-            def forward(self, x):
-                z = self.conv(x)
-                out = self.conv(x)
-                out = self.relu(out)
-                return out
-
-            @torch.jit.export
-            def make_prediction(self, x):
-                return self.forward(x)
-
-        mod_eager = Net(3, 6, kernel_size=3, stride=2).eval().cuda()
-
-        inps = [5, 3, 4, 4]
-        inp = torch.rand(inps).cuda()
-
-        scripted_mod = torch.jit.script(mod_eager)
-
-        frozen_mod = torch.jit.freeze(scripted_mod, preserved_attrs=['make_prediction'])
-        optimized_mod = torch.jit.optimize_for_inference(frozen_mod, other_methods=['make_prediction'])
-        FileCheck().check("aten::cudnn_convolution_relu").run(optimized_mod.make_prediction.graph)
+                FileCheck().check("aten::cudnn_convolution_relu").run(optimized_mod.make_prediction.graph)
 
-        self.assertEqual(mod_eager.make_prediction(inp), optimized_mod.make_prediction(inp))
+            self.assertEqual(mod_eager.make_prediction(inp), optimized_mod.make_prediction(inp))
 
     @unittest.skipIf(not torch._C.has_mkldnn, "MKL-DNN build is disabled")
     def test_incompatible_perf_formats(self):
diff --git a/test/jit/test_legacy_upgraders.py b/test/jit/test_legacy_upgraders.py
deleted file mode 100644
index e4c0d588ee568..0000000000000
--- a/test/jit/test_legacy_upgraders.py
+++ /dev/null
@@ -1,553 +0,0 @@
-# Owner(s): ["oncall: jit"]
-
-from itertools import product as product
-import io
-import os
-import random
-import sys
-import unittest
-
-import torch
-# Make the helper files in test/ importable
-pytorch_test_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
-sys.path.append(pytorch_test_dir)
-from torch.testing._internal.jit_utils import JitTestCase
-
-if __name__ == "__main__":
-    raise RuntimeError(
-        "This test file is not meant to be run directly, use:\n\n"
-        "\tpython test/test_jit.py TESTNAME\n\n"
-        "instead."
-    )
-
-# Legacy test cases for dynamic operator versioning
-class TestLegacyUpgraders(JitTestCase):
-    reason: str = "Skipped due to new global operator versioning for operators"
-
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_symbols(self):
-        """
-        Tests Torchscript symbol versioning. See note [Versioned Symbols].
-        This test uses an undocumented, test-only function
-        torch._test_serialization_subcmul.
-        This function is implemented as (a - alpha * b) with a default value
-        of 1 for alpha. In file format version 2, however, it was implemented
-        as (b - alpha * a) with a default value of 2 for alpha.
-        This test verifies a module seralized with file format version 2
-        exhibits the old behavior, and that the same module newly serialized
-        exhibits the current behavior.
-        """
-        class MyModule(torch.nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-
-            def forward(self, a, b, alpha: float):
-                no_alpha = torch._test_serialization_subcmul(a, b)
-                with_alpha = torch._test_serialization_subcmul(a, b, alpha)
-                return no_alpha, with_alpha
-
-        def historic_subcmul(a, b, alpha=2):
-            return b - alpha * a
-
-        def current_subcmul(a, b, alpha=1):
-            return a - alpha * b
-
-        # Loads and verifies the historic behavior of the module
-        # that was serialized with version 2
-        module_v2 = torch.jit.load(pytorch_test_dir + "/jit/fixtures/_test_serialization_subcmul_v2.pt")
-        a = torch.randn((5,))
-        b = torch.randn((5,))
-        alpha = random.random()
-        args = (a, b, alpha)
-        no_alpha_v2, with_alpha_v2 = module_v2(*args)
-        self.assertEqual(no_alpha_v2, historic_subcmul(a, b))
-        self.assertEqual(with_alpha_v2, historic_subcmul(*args))
-
-        # Scripts, saves, loads and verifies the current behavior of the module
-        scripted_module = torch.jit.script(MyModule())
-        buffer = io.BytesIO()
-        torch.jit.save(scripted_module, buffer)
-        buffer.seek(0)
-        module_current = torch.jit.load(buffer)
-        no_alpha_current, with_alpha_current = module_current(*args)
-        self.assertEqual(no_alpha_current, current_subcmul(a, b))
-        self.assertEqual(with_alpha_current, current_subcmul(*args))
-
-    # Helper that returns the module after saving and loading
-    def _save_load_module(self, m):
-        scripted_module = torch.jit.script(m())
-        buffer = io.BytesIO()
-        torch.jit.save(scripted_module, buffer)
-        buffer.seek(0)
-        return torch.jit.load(buffer)
-
-    # Helper which returns the result of a function or the exception the
-    #   function threw.
-    def _try_fn(self, fn, *args, **kwargs):
-        try:
-            return fn(*args, **kwargs)
-        except Exception as e:
-            return e
-
-    def _verify_no(self, kind, m):
-        self._verify_count(kind, m, 0)
-
-    def _verify_count(self, kind, m, count):
-        node_count = sum(str(n).count(kind) for n in m.graph.nodes())
-        self.assertEqual(node_count, count)
-
-    """
-    Tests that verify Torchscript remaps aten::div(_) from versions 0-3
-    to call either aten::true_divide(_), if an input is a float type,
-    or truncated aten::divide(_) otherwise.
-    NOTE: currently compares against current div behavior, too, since
-      div behavior has not yet been updated.
-    """
-
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_div_tensor(self):
-        def historic_div(self, other):
-            if self.is_floating_point() or other.is_floating_point():
-                return self.true_divide(other)
-            return self.divide(other, rounding_mode='trunc')
-
-        # Tensor x Tensor
-        class MyModule(torch.nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-
-            def forward(self, a, b):
-                result_0 = a / b
-                result_1 = torch.div(a, b)
-                result_2 = a.div(b)
-
-                return result_0, result_1, result_2
-
-        # Loads historic module
-        try:
-            v3_module = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_tensor_v3.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        self._verify_count("aten::div", v3_module, 6)  # true_divide and divide alias to div
-        self._verify_count('prim::Constant[value="trunc"]', v3_module, 1)  # rounding_mode argument
-
-        current_module = self._save_load_module(MyModule)
-        self._verify_count("aten::div", current_module, 3)
-
-        vals = (2., 3., 2, 3)
-        for val_a, val_b in product(vals, vals):
-            a = torch.tensor((val_a,))
-            b = torch.tensor((val_b,))
-
-            def _helper(m, fn):
-                m_results = self._try_fn(m, a, b)
-                fn_result = self._try_fn(fn, a, b)
-
-                if isinstance(m_results, Exception):
-                    self.assertTrue(isinstance(fn_result, Exception))
-                else:
-                    for result in m_results:
-                        self.assertEqual(result, fn_result)
-
-            _helper(v3_module, historic_div)
-            _helper(current_module, torch.div)
-
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_div_tensor_inplace(self):
-        def historic_div_(self, other):
-            if self.is_floating_point() or other.is_floating_point():
-                return self.true_divide_(other)
-            return self.divide_(other, rounding_mode='trunc')
-
-        class MyModule(torch.nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-
-            def forward(self, a, b):
-                a /= b
-                return a
-
-        try:
-            v3_module = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_tensor_inplace_v3.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        self._verify_count("aten::div", v3_module, 2)  # true_divide and divide both alias to div
-        self._verify_count('prim::Constant[value="trunc"]', v3_module, 1)  # rounding_mode argument
-
-        current_module = self._save_load_module(MyModule)
-        self._verify_count("aten::div", current_module, 1)
-
-        vals = (2., 3., 2, 3)
-        for val_a, val_b in product(vals, vals):
-            a = torch.tensor((val_a,))
-            b = torch.tensor((val_b,))
-
-            def _helper(m, fn):
-                fn_result = self._try_fn(fn, a.clone(), b)
-                m_result = self._try_fn(m, a, b)
-
-                if isinstance(m_result, Exception):
-                    self.assertTrue(fn_result, Exception)
-                else:
-                    self.assertEqual(m_result, fn_result)
-                    self.assertEqual(m_result, a)
-
-            _helper(v3_module, historic_div_)
-
-            # Recreates a since it was modified in place
-            a = torch.tensor((val_a,))
-            _helper(current_module, torch.Tensor.div_)
-
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_div_tensor_out(self):
-        def historic_div_out(self, other, out):
-            if self.is_floating_point() or other.is_floating_point() or out.is_floating_point():
-                return torch.true_divide(self, other, out=out)
-            return torch.divide(self, other, out=out, rounding_mode='trunc')
-
-        class MyModule(torch.nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-
-            def forward(self, a, b, out):
-                return a.div(b, out=out)
-
-        try:
-            v3_module = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_tensor_out_v3.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        self._verify_count("aten::div", v3_module, 2)  # true_divide and divide alias to div
-        self._verify_count('prim::Constant[value="trunc"]', v3_module, 1)  # rounding_mode argument
-
-        current_module = self._save_load_module(MyModule)
-        self._verify_count("aten::div", current_module, 1)
-
-        vals = (2., 3., 2, 3)
-        for val_a, val_b in product(vals, vals):
-            a = torch.tensor((val_a,))
-            b = torch.tensor((val_b,))
-
-            for out in (torch.empty((1,)), torch.empty((1,), dtype=torch.long)):
-                def _helper(m, fn):
-                    fn_result = None
-                    if fn is torch.div:
-                        fn_result = self._try_fn(fn, a, b, out=out.clone())
-                    else:
-                        fn_result = self._try_fn(fn, a, b, out.clone())
-                    m_result = self._try_fn(m, a, b, out)
-
-                    if isinstance(m_result, Exception):
-                        self.assertTrue(fn_result, Exception)
-                    else:
-                        self.assertEqual(m_result, fn_result)
-                        self.assertEqual(m_result, out)
-
-                _helper(v3_module, historic_div_out)
-                _helper(current_module, torch.div)
-
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_div_scalar(self):
-        def historic_div_scalar_float(self, other: float):
-            return torch.true_divide(self, other)
-
-        def historic_div_scalar_int(self, other: int):
-            if self.is_floating_point():
-                return torch.true_divide(self, other)
-            return torch.divide(self, other, rounding_mode='trunc')
-
-        class MyModuleFloat(torch.nn.Module):
-            def __init__(self):
-                super(MyModuleFloat, self).__init__()
-
-            def forward(self, a, b: float):
-                return a / b
-
-        class MyModuleInt(torch.nn.Module):
-            def __init__(self):
-                super(MyModuleInt, self).__init__()
-
-            def forward(self, a, b: int):
-                return a / b
-
-        try:
-            v3_module_float = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_float_v3.pt")
-            v3_module_int = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_int_v3.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        for m in (v3_module_float, v3_module_int):
-            self._verify_count("aten::div", m, 2)  # true_divide and divide alias to div
-            self._verify_count('prim::Constant[value="trunc"]', m, 1)  # rounding_mode argument
-
-        current_module_float = self._save_load_module(MyModuleFloat)
-        current_module_int = self._save_load_module(MyModuleInt)
-
-        for m in (current_module_float, current_module_int):
-            self._verify_count("aten::div", m, 1)
-
-        vals = (2., 3., 2, 3)
-        for val_a, val_b in product(vals, vals):
-            a = torch.tensor((val_a,))
-            b = val_b
-
-            def _helper(m, fn):
-                m_result = self._try_fn(m, a, b)
-                fn_result = self._try_fn(fn, a, b)
-
-                if isinstance(m_result, Exception):
-                    self.assertTrue(fn_result, Exception)
-                else:
-                    self.assertEqual(m_result, fn_result)
-
-            if isinstance(b, float):
-                _helper(v3_module_float, historic_div_scalar_float)
-                _helper(current_module_float, torch.div)
-            else:
-                _helper(v3_module_int, historic_div_scalar_int)
-                _helper(current_module_int, torch.div)
-
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_div_scalar_reciprocal(self):
-        def historic_div_scalar_float_reciprocal(self, other: float):
-            return other / self
-
-        def historic_div_scalar_int_reciprocal(self, other: int):
-            if self.is_floating_point():
-                return other / self
-            return torch.divide(other, self, rounding_mode='trunc')
-
-        class MyModuleFloat(torch.nn.Module):
-            def __init__(self):
-                super(MyModuleFloat, self).__init__()
-
-            def forward(self, a, b: float):
-                return b / a
-
-        class MyModuleInt(torch.nn.Module):
-            def __init__(self):
-                super(MyModuleInt, self).__init__()
-
-            def forward(self, a, b: int):
-                return b / a
-
-        try:
-            v3_module_float = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_reciprocal_float_v3.pt")
-            v3_module_int = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_reciprocal_int_v3.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        # NOTE: number / tensor is rewritten to torch.reciprocal(a) * b
-        #  so true_divide and floor_divide do not appear in their graphs
-        for m in (v3_module_float, v3_module_int):
-            self._verify_no("aten::div", m)
-            self._verify_no("aten::true_divide", m)
-            self._verify_no("aten::floor_divide", m)
-            self._verify_count("aten::reciprocal", m, 1)
-
-        current_module_float = self._save_load_module(MyModuleFloat)
-        current_module_int = self._save_load_module(MyModuleInt)
-
-        vals = (2., 3., 2, 3)
-        for val_a, val_b in product(vals, vals):
-            a = torch.tensor((val_a,))
-            b = val_b
-
-            def _helper(m, fn):
-                m_result = self._try_fn(m, a, b)
-                fn_result = None
-                # Reverses argument order for torch.div
-                if fn is torch.div:
-                    fn_result = self._try_fn(torch.div, b, a)
-                else:
-                    fn_result = self._try_fn(fn, a, b)
-
-                if isinstance(m_result, Exception):
-                    self.assertTrue(isinstance(fn_result, Exception))
-                elif fn is torch.div or a.is_floating_point():
-                    self.assertEqual(m_result, fn_result)
-                else:
-                    # Skip when fn is not torch.div and a is integral because
-                    # historic_div_scalar_int performs floored division
-                    pass
-
-            if isinstance(b, float):
-                _helper(v3_module_float, historic_div_scalar_float_reciprocal)
-                _helper(current_module_float, torch.div)
-            else:
-                _helper(v3_module_int, historic_div_scalar_int_reciprocal)
-                _helper(current_module_int, torch.div)
-
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_div_scalar_inplace(self):
-        def historic_div_scalar_float_inplace(self, other: float):
-            return self.true_divide_(other)
-
-        def historic_div_scalar_int_inplace(self, other: int):
-            if self.is_floating_point():
-                return self.true_divide_(other)
-
-            return self.divide_(other, rounding_mode='trunc')
-
-        class MyModuleFloat(torch.nn.Module):
-            def __init__(self):
-                super(MyModuleFloat, self).__init__()
-
-            def forward(self, a, b: float):
-                a /= b
-                return a
-
-        class MyModuleInt(torch.nn.Module):
-            def __init__(self):
-                super(MyModuleInt, self).__init__()
-
-            def forward(self, a, b: int):
-                a /= b
-                return a
-
-        try:
-            v3_module_float = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_inplace_float_v3.pt")
-            v3_module_int = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_inplace_int_v3.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        for m in (v3_module_float, v3_module_int):
-            self._verify_count("aten::div_", m, 2)  # true_divide and divide alias to div
-            self._verify_count('prim::Constant[value="trunc"]', m, 1)  # rounding_mode argument
-
-        current_module_float = self._save_load_module(MyModuleFloat)
-        current_module_int = self._save_load_module(MyModuleInt)
-
-        for m in (current_module_float, current_module_int):
-            self._verify_count("aten::div", m, 1)
-
-        for m in (current_module_float, current_module_int):
-            self._verify_count("aten::div", m, 1)
-
-        vals = (2., 3., 2, 3)
-        for val_a, val_b in product(vals, vals):
-            a = torch.tensor((val_a,))
-            b = val_b
-
-            def _helper(m, fn):
-                m_result = self._try_fn(m, a, b)
-                fn_result = self._try_fn(fn, a, b)
-
-                if isinstance(m_result, Exception):
-                    self.assertTrue(fn_result, Exception)
-                else:
-                    self.assertEqual(m_result, fn_result)
-
-            if isinstance(b, float):
-                _helper(v3_module_float, historic_div_scalar_float_inplace)
-                _helper(current_module_float, torch.Tensor.div_)
-            else:
-                _helper(v3_module_int, historic_div_scalar_int_inplace)
-                _helper(current_module_int, torch.Tensor.div_)
-
-    # NOTE: Scalar division was already true division in op version 3,
-    #   so this test verifies the behavior is unchanged.
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_div_scalar_scalar(self):
-        class MyModule(torch.nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-
-            def forward(self, a: float, b: int, c: float, d: int):
-                result_0 = a / b
-                result_1 = a / c
-                result_2 = b / c
-                result_3 = b / d
-                return (result_0, result_1, result_2, result_3)
-
-        try:
-            v3_module = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_scalar_v3.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        self._verify_count("aten::div", v3_module, 4)
-
-        current_module = self._save_load_module(MyModule)
-        self._verify_count("aten::div", current_module, 4)
-
-        def _helper(m, fn):
-            vals = (5., 3, 2., 7)
-            m_result = m(*vals)
-            fn_result = fn(*vals)
-            for mr, hr in zip(m_result, fn_result):
-                self.assertEqual(mr, hr)
-
-        _helper(v3_module, current_module)
-
-    # NOTE: the JIT was incapable of handling boolean fill values when
-    #   PyTorch produced file format versions 0-4
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_full_integer_value(self):
-        class MyModule(torch.nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-
-            def forward(self, int_fill: int):
-                size = torch.Size(2, 2)
-                a = torch.full(size, int_fill)
-                b = torch.full(size, 1)
-                return (a, b)
-
-        try:
-            v4_module = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_full_integer_value_v4.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        self._verify_count("aten::full", v4_module, 2)
-
-        current_module = self._save_load_module(MyModule)
-        self._verify_count("aten::full", current_module, 2)
-
-        # Verifies historic integer type inference is float
-        # NOTE: only verifies floating point, not exact dtype, due to
-        #   https://github.com/pytorch/pytorch/issues/40470
-        results = v4_module(2)
-        for result in results:
-            self.assertTrue(result.is_floating_point())
-
-        # Verifies values are correct
-        a, b = results
-        self.assertTrue((a == 2.).all())
-        self.assertTrue((b == 1.).all())
-
-    # Tests that torch.full behavior which is the same from prior versions
-    #   to version 5 is preserved.
-    # NOTE: while torch.full in eager PyTorch accepts a requires_grad argument,
-    #   it does not in Torchscript (see https://github.com/pytorch/pytorch/issues/40363)
-    @unittest.skipIf(torch._C._is_upgraders_enabled(), reason)
-    def test_versioned_full_preserved(self):
-        class MyModule(torch.nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-
-            def forward(self, float_fill: float):
-                size = (2, 2)
-                a = torch.full(size, 1.)
-                b = torch.full(size, float_fill)
-                c = torch.full(size, float_fill, dtype=torch.long)
-
-                out = torch.empty(size, dtype=torch.long)
-                d = torch.full(size, float_fill, out=out)
-
-                e = torch.full(size, float_fill, dtype=torch.float16, pin_memory=None,
-                               layout=torch.strided, device='cpu')
-                return (a, b, c, d, e)
-
-        try:
-            v4_module = torch.jit.load(pytorch_test_dir + "/jit/fixtures/test_versioned_full_preserved_v4.pt")
-        except Exception as e:
-            self.skipTest("Failed to load fixture!")
-
-        self._verify_count("aten::full", v4_module, 5)
-
-        current_module = self._save_load_module(MyModule)
-        self._verify_count("aten::full", current_module, 5)
-
-        self.assertEqual(v4_module(2.), current_module(2.))
diff --git a/test/jit/test_module_interface.py b/test/jit/test_module_interface.py
index 16067c711abfd..f1fd56e780c87 100644
--- a/test/jit/test_module_interface.py
+++ b/test/jit/test_module_interface.py
@@ -1,8 +1,5 @@
 # Owner(s): ["oncall: jit"]
 
-# flake8: noqa
-# TODO: enable linting check for this file
-
 from typing import List, Any
 import torch
 import torch.nn as nn
@@ -14,7 +11,6 @@
 # Make the helper files in test/ importable
 pytorch_test_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
 sys.path.append(pytorch_test_dir)
-from torch.testing._internal.jit_utils import JitTestCase, execWrapper
 
 if __name__ == '__main__':
     raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
@@ -109,6 +105,7 @@ def forward2(self, x: Tensor) -> Tensor:
                 return self.two(self.one(x, x)) + 1
 
         make_global(OneTwoModule, OneTwoClass)
+
         def use_module_interface(mod_list: List[OneTwoModule], x: torch.Tensor):
             return mod_list[0].forward(x) + mod_list[1].forward(x)
 
@@ -135,6 +132,7 @@ class TestInterface(nn.Module):
             def one(self, inp1, inp2):
                 # type: (Tensor, Tensor) -> Tensor
                 pass
+
             def forward(self, input):
                 # type: (Tensor) -> Tensor
                 r"""stuff 1"""
@@ -169,6 +167,7 @@ def forward(self, x: Tensor) -> Tensor:
                 pass
 
         make_global(OneTwoModule)
+
         @torch.jit.script
         def as_module_interface(x: OneTwoModule) -> OneTwoModule:
             return x
@@ -208,6 +207,7 @@ def forward(self, input: torch.Tensor) -> Any:
                 pass
 
         make_global(TensorToAny)
+
         @torch.jit.script
         def as_tensor_to_any(x: TensorToAny) -> TensorToAny:
             return x
@@ -218,6 +218,7 @@ def forward(self, input: Any) -> Any:
                 pass
 
         make_global(AnyToAny)
+
         @torch.jit.script
         def as_any_to_any(x: AnyToAny) -> AnyToAny:
             return x
@@ -373,8 +374,8 @@ def forward(self, input: Tensor) -> Tensor:
         scripted_no_module_interface.proxy_mod = torch.jit.script(OrigModule())
         # proxy_mod is neither a module interface or have the same JIT type, should fail
         with self.assertRaisesRegex(RuntimeError,
-                                    "Expected a value of type '__torch__.jit.test_module_interface.OrigModule \(.*\)' " +
-                                    "for field 'proxy_mod', but found '__torch__.jit.test_module_interface.NewModule \(.*\)'"):
+                                    r"Expected a value of type '__torch__.jit.test_module_interface.OrigModule \(.*\)' " +
+                                    r"for field 'proxy_mod', but found '__torch__.jit.test_module_interface.NewModule \(.*\)'"):
             scripted_no_module_interface.proxy_mod = torch.jit.script(NewModule())
 
     def test_script_module_as_interface_swap(self):
@@ -466,7 +467,7 @@ def forward(self, x):
         m.eval()
         mf = torch._C._freeze_module(m._c)
         # Assume interface has no aliasing
-        mf = torch._C._freeze_module(m._c, freezeInterfaces = True)
+        mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
         input = torch.tensor([1])
         out_s = m.forward(input)
         out_f = mf.forward(input)
@@ -481,6 +482,7 @@ def __init__(self):
             def forward(self, x):
                 self.b += 2
                 return self.b
+
             @torch.jit.export
             def getb(self, x):
                 return self.b
@@ -513,7 +515,7 @@ def forward(self, x):
         m.proxy_mod = m.sub
         m.eval()
         with self.assertRaisesRegex(RuntimeError, "failed to freeze interface attribute 'proxy_mod'"):
-            mf = torch._C._freeze_module(m._c, freezeInterfaces = True)
+            mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
 
     def test_freeze_module_with_inplace_mutation_in_interface(self):
         class SubModule(torch.nn.Module):
@@ -524,6 +526,7 @@ def __init__(self):
             def forward(self, x):
                 self.b[0] += 2
                 return self.b
+
             @torch.jit.export
             def getb(self, x):
                 return self.b
@@ -551,7 +554,7 @@ def __init__(self):
 
             def forward(self, x):
                 y = self.proxy_mod(x)
-                z= self.sub.getb(x)
+                z = self.sub.getb(x)
                 return y[0] + z[0]
 
         m = torch.jit.script(TestModule())
@@ -559,7 +562,7 @@ def forward(self, x):
         m.sub.b = m.proxy_mod.b
         m.eval()
         with self.assertRaisesRegex(RuntimeError, "failed to freeze interface attribute 'proxy_mod'"):
-            mf = torch._C._freeze_module(m._c, freezeInterfaces = True)
+            mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
 
     def test_freeze_module_with_mutated_interface(self):
         class SubModule(torch.nn.Module):
@@ -569,6 +572,7 @@ def __init__(self):
 
             def forward(self, x):
                 return self.b
+
             @torch.jit.export
             def getb(self, x):
                 return self.b
@@ -597,13 +601,13 @@ def __init__(self):
             def forward(self, x):
                 self.proxy_mod = self.sub
                 y = self.proxy_mod(x)
-                z= self.sub.getb(x)
+                z = self.sub.getb(x)
                 return y[0] + z[0]
 
         m = torch.jit.script(TestModule())
         m.eval()
         with self.assertRaisesRegex(RuntimeError, "failed to freeze interface attribute 'proxy_mod'"):
-            mf = torch._C._freeze_module(m._c, freezeInterfaces = True)
+            mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
 
     def test_freeze_module_with_interface_and_fork(self):
         class SubModule(torch.nn.Module):
@@ -638,13 +642,13 @@ def __init__(self):
 
             def forward(self, x):
                 y = self.proxy_mod(x)
-                z= self.sub(x)
+                z = self.sub(x)
                 return y + z
 
         class MainModule(torch.nn.Module):
             def __init__(self):
                 super(MainModule, self).__init__()
-                self.test= TestModule()
+                self.test = TestModule()
 
             def forward(self, x):
                 fut = torch.jit._fork(self.test.forward, x)
@@ -654,7 +658,7 @@ def forward(self, x):
 
         m = torch.jit.script(MainModule())
         m.eval()
-        mf = torch._C._freeze_module(m._c, freezeInterfaces = True)
+        mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
 
     def test_module_apis_interface(self):
         @torch.jit.interface
diff --git a/test/jit/test_tensor_creation_ops.py b/test/jit/test_tensor_creation_ops.py
index 5b82aa8cab66e..b3bab0eb20d3f 100644
--- a/test/jit/test_tensor_creation_ops.py
+++ b/test/jit/test_tensor_creation_ops.py
@@ -49,10 +49,10 @@ def triu_indices(rows: int, cols: int):
 
     def test_triu_indices_specified_dtype(self):
         def triu_indices(rows: int, cols: int):
-            indices = torch.triu_indices(rows, cols, dtype=torch.float)
+            indices = torch.triu_indices(rows, cols, dtype=torch.int32)
             # Have to perform assertion here because TorchScript returns dtypes
             # as integers, which are not comparable against eager torch.dtype.
-            assert indices.dtype == torch.float
+            assert indices.dtype == torch.int32
 
         self.checkScript(triu_indices, (3, 3))
 
@@ -67,9 +67,9 @@ def tril_indices(rows: int, cols: int):
 
     def test_tril_indices_specified_dtype(self):
         def tril_indices(rows: int, cols: int):
-            indices = torch.tril_indices(rows, cols, dtype=torch.float)
+            indices = torch.tril_indices(rows, cols, dtype=torch.int32)
             # Have to perform assertion here because TorchScript returns dtypes
             # as integers, which are not comparable against eager torch.dtype.
-            assert indices.dtype == torch.float
+            assert indices.dtype == torch.int32
 
         self.checkScript(tril_indices, (3, 3))
diff --git a/test/jit/test_upgraders.py b/test/jit/test_upgraders.py
index 964e56abd1df2..a5b0d54b5ec59 100644
--- a/test/jit/test_upgraders.py
+++ b/test/jit/test_upgraders.py
@@ -4,10 +4,8 @@
 import os
 import sys
 import torch
-import unittest
 import zipfile
 from torch.testing import FileCheck
-from torch._C import _is_upgraders_enabled
 from typing import Union
 
 # Make the helper files in test/ importable
@@ -103,7 +101,6 @@ def f():
         self.assertTrue("a" not in upgraders_dump_after_remove_test)
         self.assertTrue("c" not in upgraders_dump_after_remove_test)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_div_tensor_at_3(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_div_tensor_v3.pt"
         loaded_model = torch.jit.load(model_path)
@@ -123,7 +120,6 @@ def test_aten_div_tensor_at_3(self):
         # can be different every time
         self.assertEqual(loaded_model.code, loaded_model_twice.code)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_full_other_variants(self):
         def test_func():
             a = torch.full([4, 5, 6], 4, names=["a", "b", "c"], dtype=torch.int64)
@@ -151,7 +147,6 @@ def test_func():
         # make sure we preserve old behaviou
         torch._C._calculate_package_version_based_on_upgraders(current_flag_value)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_linspace(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_linspace_v7.ptl"
         loaded_model = torch.jit.load(model_path)
@@ -165,7 +160,6 @@ def test_aten_linspace(self):
         version = self._load_model_version(loaded_model)
         self.assertTrue(version == 8)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_linspace_out(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_linspace_out_v7.ptl"
         loaded_model = torch.jit.load(model_path)
@@ -183,7 +177,6 @@ def test_aten_linspace_out(self):
         version = self._load_model_version(loaded_model)
         self.assertTrue(version == 8)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_logspace(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_logspace_v8.ptl"
         loaded_model = torch.jit.load(model_path)
@@ -197,7 +190,6 @@ def test_aten_logspace(self):
         version = self._load_model_version(loaded_model)
         self.assertTrue(version == 9)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_logspace_out(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_logspace_out_v8.ptl"
         loaded_model = torch.jit.load(model_path)
@@ -215,7 +207,6 @@ def test_aten_logspace_out(self):
         version = self._load_model_version(loaded_model)
         self.assertTrue(version == 9)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_test_serialization(self):
         model_path = pytorch_test_dir + "/jit/fixtures/_test_serialization_subcmul_v2.pt"
 
@@ -250,7 +241,6 @@ def _test_serialization_subcmul_0_2(self: torch.Tensor, other: torch.Tensor, alp
         torch._C._test_only_remove_entry_to_op_version_map("aten::_test_serialization_subcmul")
         torch._C._test_only_remove_upgraders({"_test_serialization_subcmul_0_2": str(_test_serialization_subcmul_0_2.graph)})
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_div_scalar_at_3(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_div_scalar_float_v3.pt"
         loaded_model = torch.jit.load(model_path)
@@ -267,7 +257,6 @@ def test_aten_div_scalar_at_3(self):
         self.assertEqual(loaded_model(torch.Tensor([5.0, 3.0]), 2.0),
                          loaded_model_twice(torch.Tensor([5.0, 3.0]), 2.0))
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_div_tensor_out_at_3(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_div_tensor_out_v3.pt"
         loaded_model = torch.jit.load(model_path)
@@ -284,7 +273,6 @@ def test_aten_div_tensor_out_at_3(self):
         # can be different every time
         self.assertEqual(loaded_model.code, loaded_model_twice.code)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_full_at_4(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_full_integer_value_v4.pt"
         loaded_model = torch.jit.load(model_path)
@@ -301,7 +289,6 @@ def test_aten_full_at_4(self):
         # can be different every time
         self.assertEqual(loaded_model.code, loaded_model_twice.code)
 
-    @unittest.skipIf(not _is_upgraders_enabled(), "Skipping because upgraders are not enabled")
     def test_aten_full_out_at_4(self):
         model_path = pytorch_test_dir + "/jit/fixtures/test_versioned_full_preserved_v4.pt"
         loaded_model = torch.jit.load(model_path)
diff --git a/test/jit/test_with.py b/test/jit/test_with.py
index bd09a36c6860f..ddbd90a025dab 100644
--- a/test/jit/test_with.py
+++ b/test/jit/test_with.py
@@ -6,6 +6,7 @@
 from typing import Any, List
 
 import torch
+from torch.testing._internal.common_utils import skipIfTorchDynamo
 from torch.testing._internal.jit_utils import JitTestCase, make_global
 
 
@@ -599,6 +600,7 @@ def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
 
         self.assertFalse(w.requires_grad)
 
+    @skipIfTorchDynamo("Torchdynamo cannot correctly handle profiler.profile calls")
     def test_with_record_function(self):
         """
         Check that torch.autograd.profiler.record_function context manager is
diff --git a/test/mobile/test_lite_script_type.py b/test/mobile/test_lite_script_type.py
index 67d6ac8596838..9a778fb5a7fd9 100644
--- a/test/mobile/test_lite_script_type.py
+++ b/test/mobile/test_lite_script_type.py
@@ -45,7 +45,7 @@ def __init__(self):
 
             def forward(self, a: torch.Tensor):
                 self.foo = Foo(a)
-                re: Dict[str, Foo] = dict()
+                re: Dict[str, Foo] = {}
                 re["test"] = Foo(a)
                 return self.foo, re["test"]
 
diff --git a/test/mobile/test_quantize_fx_lite_script_module.py b/test/mobile/test_quantize_fx_lite_script_module.py
index 3253e7b2dd228..44beeef818c33 100644
--- a/test/mobile/test_quantize_fx_lite_script_module.py
+++ b/test/mobile/test_quantize_fx_lite_script_module.py
@@ -2,7 +2,7 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import torch.utils.bundled_inputs
 from torch.ao.quantization import (
     default_qconfig,
diff --git a/test/nn/test_packed_sequence.py b/test/nn/test_packed_sequence.py
new file mode 100644
index 0000000000000..04856dc7096ea
--- /dev/null
+++ b/test/nn/test_packed_sequence.py
@@ -0,0 +1,392 @@
+# Owner(s): ["module: nn"]
+
+import itertools
+import random
+
+import torch
+from torch.testing._internal.common_utils import TestCase, run_tests
+import torch.nn.utils.rnn as rnn_utils
+
+
+class PackedSequenceTest(TestCase):
+
+    _type_by_name = {
+        'torch.DoubleTensor': (torch.DoubleTensor, 'double'),
+        'torch.FloatTensor': (torch.FloatTensor, 'float'),
+        # We leave out `'torch.HalfTensor': (torch.HalfTensor, 'half'),`
+        # because of an error in `pad_packed_sequence`
+        # > AttributeError: 'torch.HalfTensor' object has no attribute 'fill_'
+        'torch.LongTensor': (torch.LongTensor, 'long'),
+        'torch.IntTensor': (torch.IntTensor, 'int'),
+        'torch.ShortTensor': (torch.ShortTensor, 'short'),
+        'torch.CharTensor': (torch.CharTensor, 'char'),
+        'torch.ByteTensor': (torch.ByteTensor, 'byte'),
+    }
+
+    def __init__(self, *args, **kwargs):
+        super(PackedSequenceTest, self).__init__(*args, **kwargs)
+        self.batch_size = 5
+        self.max_length = 6
+
+    def _ordered_sequence(self, tensor_type):
+        """Create ordered list of random sequences"""
+        seqs = [tensor_type(random.randint(1, self.max_length))
+                for _ in range(self.batch_size)]
+        if tensor_type == torch.ByteTensor:
+            seqs = [s.random_(0, 256) for s in seqs]
+        else:
+            seqs = [s.random_(-128, 128) for s in seqs]
+        ordered = sorted(seqs, key=len, reverse=True)
+        return ordered
+
+    def _padded_sequence(self, tensor_type):
+        """Create Tensor of random padded sequences"""
+        ordered = self._ordered_sequence(tensor_type)
+        lengths = [len(i) for i in ordered]
+        padded_tensor = rnn_utils.pad_sequence(ordered)
+        return padded_tensor, lengths
+
+    def test_type_casts(self):
+        """Test type casting of `PackedSequence` against type casting of tensor"""
+        for _, (input_type, _) in self._type_by_name.items():
+            for expected_type_str, (_, cast_str) in self._type_by_name.items():
+                for enforce_sorted in [True, False]:
+                    padded, lengths = self._padded_sequence(input_type)
+                    packed = rnn_utils.pack_padded_sequence(
+                        padded, lengths, enforce_sorted=enforce_sorted)
+                    # Apply cast to `PackedSequence` instance and unpack
+                    masked = getattr(packed, cast_str)()
+                    unpacked, lengths_out = rnn_utils.pad_packed_sequence(masked)
+                    self.assertEqual(unpacked.type(), expected_type_str)
+
+    def test_wrong_order(self):
+        a = torch.ones(25, 300)
+        b = torch.ones(22, 300)
+        b_a = rnn_utils.pad_sequence([b, a])
+        self.assertRaises(
+            RuntimeError,
+            lambda: rnn_utils.pack_padded_sequence(b_a, [22, 25], enforce_sorted=True))
+
+    def test_pad_sequence_with_tensor_sequences(self):
+        seq_tuple_input = torch.nn.utils.rnn.pad_sequence(
+            (torch.tensor([[7, 6]]), torch.tensor([[-7, -1]]))
+        )
+        seq_tensor_input = torch.nn.utils.rnn.pad_sequence(
+            torch.tensor([[[7, 6]], [[-7, -1]]])
+        )
+        self.assertEqual(seq_tuple_input, seq_tensor_input)
+        self.assertEqual(seq_tuple_input.shape, torch.Size([1, 2, 2]))
+
+    def test_pad_sequence_with_non_iterable_sequences(self):
+        msg = r"Expected iterable for input sequences, but got arg of type"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            torch.nn.utils.rnn.pad_sequence(5)
+
+    def test_total_length(self):
+        padded, lengths = self._padded_sequence(torch.FloatTensor)
+        max_length = max(lengths)
+        packed = rnn_utils.pack_padded_sequence(padded, lengths)
+        # test ValueError if total_length < max_length
+        for total_length in (-1, 0, max_length - 1):
+            for batch_first in (True, False):
+                def err_fn():
+                    rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
+                                                  total_length=total_length)
+            self.assertRaisesRegex(ValueError,
+                                   r'Expected total_length to be at least the '
+                                   r'length of the longest sequence in input',
+                                   err_fn)
+        # test that pad_packed_sequence returns results of correct length
+        for batch_first in (True, False):
+            no_extra_pad, _ = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
+            for total_length_delta in (0, 1, 8):
+                total_length = max_length + total_length_delta
+                unpacked, lengths_out = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
+                                                                      total_length=total_length)
+                self.assertEqual(lengths, lengths_out)
+                self.assertEqual(unpacked.size(1 if batch_first else 0), total_length)
+                if total_length_delta == 0:
+                    ref_output = no_extra_pad
+                elif batch_first:
+                    extra_pad = no_extra_pad.new_zeros(self.batch_size, total_length_delta)
+                    ref_output = torch.cat([no_extra_pad, extra_pad], 1)
+                else:
+                    extra_pad = no_extra_pad.new_zeros(total_length_delta, self.batch_size)
+                    ref_output = torch.cat([no_extra_pad, extra_pad], 0)
+                self.assertEqual(unpacked, ref_output)
+
+    def test_to(self):
+        for enforce_sorted in (True, False):
+            padded, lengths = self._padded_sequence(torch.IntTensor)
+            a = rnn_utils.pack_padded_sequence(
+                padded, lengths, enforce_sorted=enforce_sorted).cpu()
+
+            self.assertIs(a, a.to('cpu'))
+            self.assertIs(a, a.cpu())
+            self.assertIs(a, a.to('cpu', dtype=torch.int32))
+            self.assertEqual(a.long(), a.to(torch.int64))
+
+            if torch.cuda.is_available():
+                for cuda in ['cuda', 'cuda:0' if torch.cuda.device_count() == 1 else 'cuda:1']:
+                    b = a.cuda(device=cuda)
+                    self.assertIs(b, b.to(cuda))
+                    self.assertIs(b, b.cuda())
+                    self.assertEqual(a, b.to('cpu'))
+                    self.assertEqual(b, a.to(cuda))
+                    self.assertEqual(a, b.to('cpu', dtype=torch.int32))
+                    self.assertIs(b, b.to(dtype=torch.int32))
+                    self.assertEqual(b.long(), b.to(dtype=torch.int64))
+
+    def test_to_memory_format(self):
+        m = torch.nn.Conv2d(in_channels=16, out_channels=32, kernel_size=2, bias=True)
+        m = m.to(memory_format=torch.channels_last)
+        for param in m.parameters():
+            if param.dim() == 4:
+                self.assertTrue(param.is_contiguous(memory_format=torch.channels_last))
+
+    def test_pad_sequence(self):
+        def pad(tensor, length):
+            return torch.cat(
+                [tensor.data, tensor.data.new(
+                    length - tensor.size(0), *tensor.size()[1:]).zero_()])
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+
+        # batch_first = true
+        expected = torch.tensor([[4, 5, 0], [1, 2, 3], [6, 0, 0]])
+        padded = rnn_utils.pad_sequence([b, a, c], True)
+        self.assertEqual(padded, expected)
+
+        # batch_first = false
+        padded = rnn_utils.pad_sequence([b, a, c])
+        self.assertEqual(padded, expected.transpose(0, 1))
+
+        # pad with non-zero value
+        expected = torch.tensor([[4, 5, 1], [1, 2, 3], [6, 1, 1]])
+        padded = rnn_utils.pad_sequence([b, a, c], True, 1)
+        self.assertEqual(padded, expected)
+
+        # Test pad sorted sequence
+        expected = torch.tensor([[1, 2, 3], [4, 5, 0], [6, 0, 0]])
+        padded = rnn_utils.pad_sequence([a, b, c], True)
+        self.assertEqual(padded, expected)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            trailing_dims = [4] * num_dim
+            for i in range(1, maxlen + 1):
+                seq_len = i * i
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            random.shuffle(sequences)
+            expected = []
+            for seq in sequences:
+                expected.append(pad(seq, maxlen * maxlen))
+            # batch first = true
+            expected = torch.stack(expected)
+            padded = rnn_utils.pad_sequence(sequences, True)
+            self.assertEqual(padded, expected)
+
+            # batch first = false
+            padded = rnn_utils.pad_sequence(sequences)
+            self.assertEqual(padded, expected.transpose(0, 1))
+
+    def test_unpad_sequence(self):
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+        sequences = [a, b, c]
+
+        lengths = torch.as_tensor([v.size(0) for v in sequences])
+        for batch_first in [True, False]:
+            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
+            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
+            self.assertEqual(sequences, unpadded_sequences)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            trailing_dims = [4] * num_dim
+            for i in range(1, maxlen + 1):
+                seq_len = i * i
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            random.shuffle(sequences)
+
+            lengths = torch.as_tensor([v.size(0) for v in sequences])
+            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
+            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
+            self.assertEqual(sequences, unpadded_sequences)
+
+    def test_pack_sequence(self):
+        def _compatibility_test(sequences, lengths, batch_first, enforce_sorted=False):
+            padded = rnn_utils.pad_sequence(sequences, batch_first)
+            packed = rnn_utils.pack_sequence(sequences, enforce_sorted)
+            unpacked = rnn_utils.pad_packed_sequence(packed, batch_first)
+            self.assertEqual(padded, unpacked[0])
+            pack_padded = rnn_utils.pack_padded_sequence(
+                padded, lengths, batch_first, enforce_sorted)
+            self.assertEqual(packed, pack_padded)
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+        packed = rnn_utils.pack_sequence([a, b, c], enforce_sorted=False)
+        expected = torch.tensor([1, 4, 6, 2, 5, 3])
+        self.assertEqual(packed.batch_sizes, [3, 2, 1])
+        self.assertEqual(packed.data.data, expected)
+        self.assertEqual(packed.sorted_indices, [0, 1, 2])
+        self.assertEqual(packed.unsorted_indices, [0, 1, 2])
+
+        packed_unsorted = rnn_utils.pack_sequence([b, c, a], enforce_sorted=False)
+        self.assertEqual(packed_unsorted.batch_sizes, [3, 2, 1])
+        self.assertEqual(packed_unsorted.data.data, expected)
+        self.assertEqual(packed_unsorted.sorted_indices, [2, 0, 1])
+        self.assertEqual(packed_unsorted.unsorted_indices, [1, 2, 0])
+
+        # single dimensional, enforce_sorted = True
+        packed_enforce_sorted = rnn_utils.pack_sequence([a, b, c], enforce_sorted=True)
+        self.assertEqual(packed_enforce_sorted.batch_sizes, [3, 2, 1])
+        self.assertEqual(packed_enforce_sorted.data.data, expected)
+        self.assertTrue(packed_enforce_sorted.sorted_indices is None)
+        self.assertTrue(packed_enforce_sorted.unsorted_indices is None)
+
+        with self.assertRaisesRegex(RuntimeError, 'must be sorted in decreasing order'):
+            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
+
+        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
+            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            lengths = []
+            trailing_dims = [4] * num_dim
+            for i in range(maxlen, 0, -1):
+                seq_len = i * i
+                lengths.append(seq_len)
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            unsorted_sequences = [s.clone() for s in sequences]
+            random.shuffle(unsorted_sequences)
+            unsorted_sequences_lengths = [t.size(0) for t in unsorted_sequences]
+
+            # compatibility with other utilities
+            for batch_first in (True, False):
+                for enforce_sorted in (True, False):
+                    _compatibility_test(sequences, lengths, batch_first, enforce_sorted)
+                _compatibility_test(unsorted_sequences, unsorted_sequences_lengths,
+                                    batch_first)
+
+    def test_unpack_sequence(self):
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+        sequences = [a, b, c]
+
+        packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
+        unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
+        self.assertEqual(sequences, unpacked_sequences)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            trailing_dims = [4] * num_dim
+            for i in range(1, maxlen + 1):
+                seq_len = i * i
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            random.shuffle(sequences)
+
+            packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
+            unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
+            self.assertEqual(sequences, unpacked_sequences)
+
+    def test_pack_padded_sequence(self):
+        def generate_test_case(sorted_lengths, should_shuffle):
+            def pad(tensor, length):
+                return torch.cat([tensor, tensor.new(length - tensor.size(0), *tensor.size()[1:]).zero_()])
+
+            max_length = sorted_lengths[0]
+            batch_sizes = [sum(map(bool, filter(lambda x: x >= i, sorted_lengths)))
+                           for i in range(1, max_length + 1)]
+            offset = 0
+            padded = torch.cat([pad(i * 100 + torch.arange(1., 5 * l + 1).view(l, 1, 5), max_length)
+                                for i, l in enumerate(sorted_lengths, 1)], 1)
+            expected_data = [[torch.arange(1., 6) + (i + 1) * 100 + 5 * n for i in range(batch_size)]
+                             for n, batch_size in enumerate(batch_sizes)]
+            expected_data = list(itertools.chain.from_iterable(expected_data))
+            expected_data = torch.stack(expected_data, dim=0)
+
+            if should_shuffle:
+                # Shuffle the padded sequence to create an unsorted sequence
+                permutation = list(range(len(sorted_lengths)))
+                random.shuffle(permutation)
+
+                unsorted_indices = torch.tensor(permutation)
+                padded = padded.index_select(1, unsorted_indices)
+                lengths = torch.tensor(sorted_lengths).index_select(0, unsorted_indices)
+            else:
+                unsorted_indices = None
+                lengths = sorted_lengths
+
+            return padded.requires_grad_(), lengths, expected_data, batch_sizes, unsorted_indices
+
+        test_cases = [
+            # sorted_lengths, should_shuffle
+            [[10, 8, 4, 2, 2, 2, 1], False],
+            [[11, 10, 8, 6, 4, 3, 1], False],
+            [[11, 10, 8, 6, 4, 3, 1], True],
+        ]
+
+        for test_case, batch_first in itertools.product(test_cases, (True, False)):
+            sorted_lengths, should_shuffle = test_case
+            padded, lengths, expected_data, batch_sizes, unsorted_indices = generate_test_case(
+                sorted_lengths, should_shuffle)
+
+            src = padded
+            if batch_first:
+                src = src.transpose(0, 1)
+
+            # check output
+            packed = rnn_utils.pack_padded_sequence(src, lengths, batch_first=batch_first,
+                                                    enforce_sorted=not should_shuffle)
+            self.assertEqual(packed.data.data, expected_data)
+            self.assertEqual(packed.batch_sizes, batch_sizes)
+            self.assertEqual(packed.unsorted_indices, unsorted_indices)
+
+            # test inverse
+            unpacked, unpacked_len = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
+            self.assertEqual(unpacked, src)
+            self.assertEqual(unpacked_len, lengths)
+
+            # check grad
+            if padded.grad is not None:
+                padded.grad.data.zero_()
+            grad_output = unpacked.data.clone().normal_()
+            unpacked.backward(grad_output)
+            if batch_first:
+                grad_output.transpose_(0, 1)
+            for i, l in enumerate(lengths):
+                self.assertEqual(padded.grad.data[:l, i], grad_output[:l, i])
+                if l < 10:
+                    self.assertEqual(padded.grad.data[l:, i].abs().sum(), 0)
+
+        # test error messages
+        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
+            packed = rnn_utils.pack_padded_sequence(torch.randn(3, 3), [1, 3, 2])
+        with self.assertRaisesRegex(RuntimeError, 'empty tensor'):
+            packed = rnn_utils.pack_padded_sequence(torch.randn(0, 0), [])
+
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_pooling.py b/test/nn/test_pooling.py
new file mode 100644
index 0000000000000..75fc4d1c16e44
--- /dev/null
+++ b/test/nn/test_pooling.py
@@ -0,0 +1,1429 @@
+# Owner(s): ["module: nn"]
+from functools import reduce
+from itertools import repeat
+import unittest
+import subprocess
+import sys
+import os
+import random
+import itertools
+import math
+
+from torch._six import inf, nan
+import torch
+from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_UBSAN, set_default_dtype, \
+    instantiate_parametrized_tests, slowTest, parametrize as parametrize_test, subtest, skipIfMps
+from torch.testing._internal.common_cuda import TEST_CUDA
+from torch.testing._internal.common_nn import NNTestCase, _test_bfloat16_ops, _test_module_empty_input
+from torch.testing._internal.common_device_type import largeTensorTest, onlyNativeDeviceTypes, dtypes, \
+    instantiate_device_type_tests, skipCUDAIfRocm, expectedFailureMeta, dtypesIfCUDA, onlyCPU, onlyCUDA, \
+    TEST_WITH_ROCM
+from torch.testing._internal.common_dtype import floating_types_and
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.autograd import gradcheck, gradgradcheck
+
+
+class TestAvgPool(TestCase):
+    def _sum_pool2d(self, x, kernel_size):
+        windows = torch.nn.functional.unfold(x, kernel_size=kernel_size, stride=kernel_size)
+        return torch.sum(windows, dim=1)
+
+    def _sum_pool3d(self, x, kernel_size):
+        # Because unfold does not support 3D sliding window we will split tensor to multiple tensors and calculate sum
+        h = kernel_size[0]
+        splited_x = [t.sum(0) for t in x.split(h) if t.size(0) == h]
+        # sum_pool2d assumes tensor in (1, 1, n, m) view, so unsqueeze two times
+        splited_x = [self._sum_pool2d(t.unsqueeze(0).unsqueeze(0), kernel_size[1:]) for t in splited_x]
+        joined_x = torch.cat(splited_x)
+        return joined_x.view(1, joined_x.numel())
+
+    def _avg_pool2d(self, x, kernel_size):
+        size = reduce((lambda x, y: x * y), kernel_size)
+        return self._sum_pool2d(x, kernel_size) / size
+
+    def _avg_pool3d(self, x, kernel_size):
+        size = reduce((lambda x, y: x * y), kernel_size)
+        return self._sum_pool3d(x, kernel_size) / size
+
+    def test_doubletensor_avg_pool2d(self):
+        n, m = 5, 8
+        input = torch.rand(1, 1, n, m, dtype=torch.double)
+        for i in range(1, n + 1):
+            for j in range(1, m + 1):
+                actual = torch.nn.functional.avg_pool2d(input[0], (i, j))
+                actual = actual.view(1, actual.numel())
+                expected = self._avg_pool2d(input, (i, j))
+                self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_doubletensor_avg_pool2d_with_divisor(self):
+        n, m = 3, 3
+        input = torch.rand(1, 1, n, m, dtype=torch.double)
+        for i in range(1, n + 1):
+            for j in range(1, m + 1):
+                for divisor in [1, 7, i * j]:
+                    actual = F.avg_pool2d(input[0], (i, j), divisor_override=divisor)
+                    actual = actual.view(1, actual.numel())
+                    expected = self._sum_pool2d(input, (i, j)) / divisor
+                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_doubletensor_avg_pool3d(self):
+        h, w, d = 5, 6, 7
+        input = torch.rand(h, w, d, dtype=torch.double)
+        for i in range(1, h + 1):
+            for j in range(1, w + 1):
+                for k in range(1, d + 1):
+                    actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k))
+                    actual = actual.view(1, actual.numel())
+                    expected = self._avg_pool3d(input, (i, j, k))
+                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_doubletensor_avg_pool3d_with_divisor(self):
+        h, w, d = 6, 5, 7
+        input = torch.rand(h, w, d, dtype=torch.double)
+        for i in range(1, h + 1):
+            for j in range(1, w + 1):
+                for k in range(1, d + 1):
+                    for divisor in [1, 7, i * j]:
+                        actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k), divisor_override=divisor)
+                        actual = actual.view(1, actual.numel())
+                        expected = self._sum_pool3d(input, (i, j, k)) / divisor
+                        self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_avg_pool1d_ceil_mode(self):
+        # Regression test for gh-36977
+        x = 10 * torch.randn((1, 16, 4))
+        y = torch.nn.functional.avg_pool1d(
+            x, ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
+        self.assertTrue(not torch.isnan(y).any())
+
+        if TEST_CUDA:
+            y = torch.nn.functional.avg_pool1d(
+                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
+            self.assertTrue(not torch.isnan(y).any())
+
+    def test_avg_pool2d_ceil_mode(self):
+        # Regression test for gh-36977
+        x = 10 * torch.randn((1, 16, 4, 4))
+        y = torch.nn.functional.avg_pool2d(
+            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
+            padding=(0, 1), stride=2)
+        self.assertTrue(not torch.isnan(y).any())
+
+        if TEST_CUDA:
+            y = torch.nn.functional.avg_pool2d(
+                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
+                padding=(0, 1), stride=2)
+            self.assertTrue(not torch.isnan(y).any())
+
+    def test_avg_pool3d_ceil_mode(self):
+        # Regression test for gh-36977
+        x = 10 * torch.randn((1, 16, 4, 4, 4))
+        y = torch.nn.functional.avg_pool3d(
+            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
+        self.assertTrue(not torch.isnan(y).any())
+
+        if TEST_CUDA:
+            y = torch.nn.functional.avg_pool3d(
+                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
+            self.assertTrue(not torch.isnan(y).any())
+
+
+class TestPoolingNN(NNTestCase):
+    def test_adaptive_pooling_input_size(self):
+        for numel in (2, 3):
+            for pool_type in ('Max', 'Avg'):
+                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
+                module_cls = getattr(nn, cls_name)
+                output_size = (2,) * numel
+                module = module_cls(output_size)
+
+                input = torch.randn(output_size)
+                self.assertRaises(ValueError, lambda: module(input))
+
+    def test_adaptive_pooling_size_none(self):
+        for numel in (2, 3):
+            for pool_type in ('Max', 'Avg'):
+                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
+                module_cls = getattr(nn, cls_name)
+                output_size = (2,) * (numel - 1) + (None,)
+                module = module_cls(output_size)
+
+                input = torch.randn((4,) * (numel + 1))
+                output = module(input)
+                self.assertEqual(output.size(), (4,) + (2,) * (numel - 1) + (4,))
+
+    @unittest.skipIf(TEST_WITH_UBSAN, "signed integer overflow error with UBSAN")
+    def test_adaptive_pooling_size_overflow(self):
+        # 0x0x3fffffffffffffff * 2 * 2 = 0xfffffffffffffffc = -4 as int64_t
+        # Tensor::numel() return int64_t, so following check that negative allocs are correctly handled
+        self.assertRaises(
+            RuntimeError,
+            lambda: torch.nn.AdaptiveMaxPool1d(0x3fffffffffffffff)(torch.empty([2, 2, 2])))
+
+    def test_adaptive_pooling_avg_nhwc(self):
+        device_list = ['cpu']
+        if TEST_CUDA:
+            device_list.append('cuda')
+
+        for device in device_list:
+            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
+            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
+            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            out = pool(input)
+            out.backward(grad)
+            ref_out = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(input.grad, ref_input.grad)
+
+    def test_adaptive_pooling_avg_nhwc_non_contiguous(self):
+        device_list = ['cpu']
+        if TEST_CUDA:
+            device_list.append('cuda')
+
+        for device in device_list:
+            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
+            input = input.contiguous(memory_format=torch.channels_last)
+            input = input[:, ::2, :, :].requires_grad_()
+            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
+            grad = grad[:, ::2, :, :]
+            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            out = pool(input)
+            out.backward(grad)
+            ref_out = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(input.grad, ref_input.grad)
+
+    def test_adaptive_pooling_bfloat16(self):
+        def _test_adaptive_pooling_bfloat16(self, device, mod, memory_format):
+            input = torch.randint(1, 10, (3, 19, 8, 8), dtype=torch.float32)
+            input = input.to(device).to(memory_format=memory_format).requires_grad_()
+            pool = mod((7, 7)).to(device)
+
+            input2 = input.detach().clone().bfloat16().requires_grad_(True)
+
+            out = pool(input)
+            out.sum().backward()
+            out2 = pool(input2)
+            out2.sum().backward()
+
+            self.assertTrue(out2.is_contiguous(memory_format=memory_format))
+            self.assertEqual(out2.dtype, torch.bfloat16)
+            self.assertEqual(input2.grad.dtype, torch.bfloat16)
+            self.assertEqual(out, out2.float(), atol=0.1, rtol=0)
+            self.assertEqual(input.grad, input2.grad.float(), atol=0.1, rtol=0)
+
+        device_list = ['cpu']
+        for device in device_list:
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.contiguous_format)
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.channels_last)
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.contiguous_format)
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.channels_last)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    @largeTensorTest('12GB', device='cuda')
+    def test_adaptive_pooling_avg_nhwc_launch_config_backward(self):
+        input = torch.randint(1, 10, (1, 32, 2 ** 17 + 1, 32), dtype=torch.float32, device="cuda")
+        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+        grad = torch.randint(1, 10, (1, 32, 10, 32), dtype=torch.float32, device="cuda")
+
+        pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
+
+        ref_input = input.detach().clone().contiguous().requires_grad_(True)
+        ref_grad = grad.detach().clone().contiguous()
+        ref_pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
+
+        out = pool(input)
+        out.backward(grad)
+        ref_out = ref_pool(ref_input)
+        ref_out.backward(ref_grad)
+
+        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+        self.assertTrue(ref_out.is_contiguous())
+        self.assertEqual(out, ref_out)
+        self.assertEqual(input.grad, ref_input.grad)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    @largeTensorTest('12GB', device='cuda')
+    def test_adaptive_pooling_avg_nhwc_launch_config_forward(self):
+        input = torch.randint(1, 10, (1, 32, 16, 16), dtype=torch.float32, device="cuda")
+        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+        pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
+
+        ref_input = input.detach().clone().contiguous().requires_grad_(True)
+        ref_pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
+
+        out = pool(input)
+        ref_out = ref_pool(ref_input)
+
+        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+        self.assertTrue(ref_out.is_contiguous())
+        self.assertEqual(out, ref_out)
+
+    def test_MaxUnpool2d_output_size(self):
+        m = nn.MaxPool2d(3, stride=2, return_indices=True)
+        mu = nn.MaxUnpool2d(3, stride=2)
+        big_t = torch.rand(1, 1, 6, 6)
+        big_t[0][0][4][4] = 100
+        output_big, indices_big = m(big_t)
+        self.assertRaises(RuntimeError, lambda: mu(output_big, indices_big))
+
+        small_t = torch.rand(1, 1, 5, 5)
+        for i in range(0, 4, 2):
+            for j in range(0, 4, 2):
+                small_t[:, :, i, j] = 100
+        output_small, indices_small = m(small_t)
+        for h in range(3, 10):
+            for w in range(3, 10):
+                if 4 <= h <= 6 and 4 <= w <= 6:
+                    size = (h, w)
+                    if h == 6:
+                        size = (1, 1) + size
+
+                    mu(output_small, indices_small, output_size=size)
+                else:
+                    self.assertRaises(ValueError, lambda: mu(output_small, indices_small, (h, w)))
+
+    def test_max_unpool2d_nhwc_cpu(self):
+        input = torch.randn(2, 10, 9, 9).float().cpu()
+        input = input.contiguous(memory_format=torch.channels_last)
+        ref_input = input.clone().contiguous()
+
+        pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
+        ref_pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
+
+        out, ind = pool(input)
+        ref_out, ref_ind = ref_pool(ref_input)
+        out.requires_grad_()
+        ref_out.requires_grad_()
+
+        unpool = nn.MaxUnpool2d(3, stride=2).cpu()
+        ref_unpool = nn.MaxUnpool2d(3, stride=2).cpu()
+
+        upout = unpool(out, ind)
+        ref_upout = ref_unpool(ref_out, ref_ind)
+
+        grad = torch.randn(upout.size()).float().cpu()
+        grad = grad.contiguous(memory_format=torch.channels_last)
+        ref_grad = grad.clone().contiguous()
+
+        upout.backward(grad)
+        ref_upout.backward(ref_grad)
+
+        self.assertTrue(upout.is_contiguous(memory_format=torch.channels_last))
+        self.assertTrue(ref_upout.is_contiguous())
+        self.assertTrue(torch.allclose(upout, ref_upout))
+        self.assertTrue(torch.allclose(out.grad, ref_out.grad))
+
+    def test_max_unpool(self):
+        with set_default_dtype(torch.double):
+            # Test 1D
+            output, indices = F.max_pool1d(torch.randn([1, 1, 4]), 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool1d(output, indices, 2), F.max_unpool1d(output, indices, 2, stride=2))
+
+            # Test list / tuple passed as argument to max_unpool1d
+            input = torch.randn([1, 1, 5], requires_grad=True)
+            output, indices = F.max_pool1d(input, 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool1d(output, indices, 2, stride=2, output_size=input.shape),
+                             F.max_unpool1d(output, indices, 2, stride=2, output_size=input.size()))
+            gradcheck(F.max_unpool1d, (output, indices, 2), check_forward_ad=True)
+
+            # Test 2D
+            output, indices = F.max_pool2d(torch.randn(
+                [1, 1, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool2d(output, indices, 2), F.max_unpool2d(output, indices, 2, stride=2))
+            gradcheck(F.max_unpool2d, (output, indices, 2), check_forward_ad=True)
+
+            # Test 3D
+            output, indices = F.max_pool3d(torch.randn(
+                [4, 4, 4, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool3d(output, indices, 2), F.max_unpool3d(output, indices, 2, stride=2))
+            gradcheck(F.max_unpool3d, (output, indices, 2), check_forward_ad=True)
+
+
+class TestPoolingNNDeviceType(NNTestCase):
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float, torch.double)
+    def test_adaptive_pooling_zero_batch(self, dtype, device):
+        inp = torch.ones(0, 10, dtype=dtype, device=device)
+        mod = torch.nn.AdaptiveAvgPool1d(5).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        inp = torch.ones(0, 10, 10, dtype=dtype, device=device)
+        mod = torch.nn.AdaptiveAvgPool2d((5, 5)).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        inp = torch.ones(0, 10, 10, 10, dtype=dtype, device=device)
+        mod = torch.nn.AdaptiveAvgPool3d((5, 5, 5)).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool2d_zero_batch(self, device):
+        mod = nn.FractionalMaxPool2d(3, output_ratio=(0.5, 0.5))
+        inp = torch.ones(0, 16, 50, 32, device=device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected input"):
+            inp = torch.randn(1, 0, 50, 32, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool3d_zero_batch(self, device):
+        mod = nn.FractionalMaxPool3d(3, output_ratio=(0.5, 0.5, 0.5)).to(device)
+        inp = torch.ones(0, 16, 50, 32, 32, device=device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected input"):
+            inp = torch.randn(1, 0, 50, 32, 32, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool2d_zero_out_size(self, device):
+        mod = nn.FractionalMaxPool2d([2, 2], output_size=[0, 1])
+        inp = torch.rand([16, 50, 32, 32], device=device)
+        out = mod(inp)
+        self.assertEqual(out, torch.empty((16, 50, 0, 1), device=device))
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool3d_zero_out_size(self, device):
+        mod = nn.FractionalMaxPool3d([3, 2, 2], output_size=[0, 1, 1])
+        inp = torch.rand([16, 50, 32, 32], device=device)
+        out = mod(inp)
+        self.assertEqual(out, torch.empty((16, 0, 1, 1), device=device))
+
+    @onlyNativeDeviceTypes
+    def test_MaxPool_zero_batch_dim(self, device):
+        inp = torch.randn(0, 16, 50, device=device)
+        mod = torch.nn.MaxPool1d(3, stride=2).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        # 1D is supposed to be okay with 0 numel() inputs so dont test
+        # error raising for that case.
+
+        inp = torch.randn(0, 16, 50, 32, device=device)
+        mod = torch.nn.MaxPool2d(3, stride=2).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.randn(1, 0, 50, 32, device=device)
+            mod(inp)
+
+        inp = torch.ones(0, 16, 50, 44, 31, device=device)
+        mod = torch.nn.MaxPool3d(3, stride=2).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.ones(1, 0, 50, 44, 31, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_MaxUnpool_zero_batch_dim(self, device):
+        pool = torch.nn.MaxPool1d(2, stride=2, return_indices=True).to(device)
+        unpool = torch.nn.MaxUnpool1d(2, stride=2).to(device)
+        inp = torch.randn(0, 10, 10, requires_grad=True, device=device)
+        output, indices = pool(inp)
+        output.requires_grad_(True)
+        unpool_out = unpool(output, indices)
+        unpool_out.sum().backward()
+
+        self.assertEqual(inp.grad, torch.zeros_like(inp))
+        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
+
+        pool = torch.nn.MaxPool2d(2, stride=2, return_indices=True).to(device)
+        unpool = torch.nn.MaxUnpool2d(2, stride=2).to(device)
+        inp = torch.randn(0, 10, 10, 10, requires_grad=True, device=device)
+        output, indices = pool(inp)
+        unpool_out = unpool(output, indices)
+        unpool_out.sum().backward()
+
+        self.assertEqual(inp.grad, torch.zeros_like(inp))
+        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
+
+        pool = torch.nn.MaxPool3d(2, stride=2, return_indices=True).to(device)
+        unpool = torch.nn.MaxUnpool3d(2, stride=2).to(device)
+        inp = torch.randn(0, 10, 10, 10, 10, requires_grad=True, device=device)
+        output, indices = pool(inp)
+        output.requires_grad_(True)
+        unpool_out = unpool(output, indices)
+        unpool_out.sum().backward()
+
+        self.assertEqual(inp.grad, torch.zeros_like(inp))
+        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
+
+    @slowTest
+    @onlyNativeDeviceTypes
+    @skipCUDAIfRocm
+    @parametrize_test("module_name,module_size,output_size,test_index,should_error", [
+        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), -1, True), name='case1'),
+        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), 2 * 2 * 4 * 5, True), name='case2'),
+        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), (2 * 2 * 4 * 5) - 1, False), name='case3'),
+        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), 2 * 3 * 4 * 2, True), name='case4'),
+        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), (2 * 3 * 4 * 2) - 1, False), name='case5'),
+
+        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), -1, True), name='case6'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), 2 * 2 * 2 * 3 * 4 * 5, True), name='case7'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), (2 * 2 * 2 * 3 * 4 * 5) - 1, False), name='case8'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), 2 * 2 * 2 * 3 * 4 * 1, True), name='case9'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), (2 * 2 * 2 * 3 * 4 * 1) - 1, False), name='case10'),
+    ])
+    def test_MaxUnpool_index_errors(self, device, module_name, module_size, output_size, test_index, should_error):
+        # NOTE: CUDA tests need to be run in a subprocess because they cause device asserts
+        if torch.device(device).type == 'cuda':
+            error_msgs = {
+                'MaxUnpool2d': r'Assertion `maxind >= 0 && maxind < outputImageSize` failed',
+                'MaxUnpool3d': r'Assertion `index >= 0 && index < outputImageSize` failed'}
+
+            script = f'''
+import torch
+unpool = torch.nn.{module_name}({module_size}).to('{device}')
+output = torch.rand({output_size}, dtype=torch.float32, device='{device}')
+indices = torch.zeros({output_size}, dtype=torch.int64, device='{device}')
+indices.flatten()[0] = {test_index}
+unpool(output, indices)
+torch.cuda.synchronize()
+'''
+            p = subprocess.run(
+                [sys.executable, '-c', script],
+                cwd=os.path.dirname(os.path.realpath(__file__)),
+                capture_output=True,
+                text=True,
+            )
+
+            output = p.stdout + '\n' + p.stderr
+
+            error_msg = error_msgs[module_name]
+
+            if should_error:
+                self.assertIn(
+                    error_msg,
+                    output,
+                    'The expected error was not found')
+            else:
+                self.assertNotIn(
+                    'Error',
+                    output,
+                    'Should not have produced an error')
+        else:
+            module_class = getattr(torch.nn, module_name)
+            unpool = module_class(module_size).to(device)
+            output = torch.rand(output_size, dtype=torch.float32, device=device)
+            indices = torch.zeros(output_size, dtype=torch.int64, device=device)
+            indices.flatten()[0] = test_index
+
+            if should_error:
+                with self.assertRaisesRegex(RuntimeError, r'Found an invalid max index:'):
+                    unpool(output, indices)
+            else:
+                unpool(output, indices)
+
+    @onlyNativeDeviceTypes
+    def test_AdaptiveMaxPool_zero_batch_dim(self, device):
+        inp = torch.randn(0, 16, 50, device=device)
+        mod = torch.nn.AdaptiveMaxPool1d(3).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.randn(1, 0, 50, device=device)
+            mod(inp)
+
+        inp = torch.randn(0, 16, 50, 32, device=device)
+        mod = torch.nn.AdaptiveMaxPool2d(3).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.randn(1, 0, 50, 32, device=device)
+            mod(inp)
+
+        inp = torch.ones(0, 16, 50, 44, 31, device=device)
+        mod = torch.nn.AdaptiveMaxPool3d(3).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.ones(1, 0, 50, 44, 31, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_AvgPool2d_empty(self, device):
+        avgpool = torch.nn.AvgPool2d(3, stride=2).to(device)
+        inp = torch.randn(0, 16, 20, 32, device=device)
+        _test_module_empty_input(self, avgpool, inp, check_size=False)
+
+        clast_inp = torch.randn(0, 16, 20, 32, device=device).contiguous(memory_format=torch.channels_last)
+        _test_module_empty_input(self, avgpool, clast_inp, check_size=False)
+
+        # test with empty non-batch input
+        with self.assertRaisesRegex(RuntimeError, '3D or 4D'):
+            inp = torch.randn(16, 0, 20, 32, device=device)
+            avgpool(inp)
+
+    def test_pooling_shape(self, device):
+        ''' Test the output shape calculation for pooling functions '''
+
+        # Checks output shape against expected for 1D, 2D and 3D
+        def check(expected_out_shape, sizes, *args, **kwargs):
+            for kernel in ['max', 'avg']:
+                for i in [1, 2, 3]:
+                    if hasattr(torch.nn.functional, f'{kernel}_pool{i}d'):
+                        op = getattr(torch.nn.functional, f'{kernel}_pool{i}d')
+                        t = torch.randn(sizes[:i + 2], device=device)
+                        self.assertEqual(op(t, *args, **kwargs).shape, expected_out_shape[:i + 2])
+
+        check((1, 1, 3, 3, 4), (1, 1, 5, 6, 7), kernel_size=1, stride=2, padding=0, ceil_mode=True)
+        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=False)
+        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=True)
+
+        # Test case from issue https://github.com/pytorch/pytorch/issues/45357
+        x = torch.randn(1, 1, 6, 7, device=device)
+        y = torch.nn.functional.max_pool2d(x, 1, stride=(2, 2), padding=0, ceil_mode=True)
+        self.assertEqual(y.size(), (1, 1, 3, 4))
+
+    @onlyNativeDeviceTypes   # TODO: fix on XLA
+    def test_adaptive_avg_pool2d_output_size_one(self, device):
+        def helper(size, memory_format):
+            x = torch.randint(1, 10, size, dtype=torch.float, device=device, requires_grad=True)
+            if memory_format == 'non_contiguous':
+                x = x[::2, ::2, ::2, ::2]
+            else:
+                x = x.to(memory_format=memory_format)
+
+            net = torch.nn.AdaptiveAvgPool2d((1, 1))
+            out = net(x)
+            ref_out = x.contiguous().mean((-1, -2)).view((x.size(0), x.size(1), 1, 1))
+
+            out.sum().backward()    # make sure it doesn't crash
+
+            self.assertEqual(out, ref_out)
+            if memory_format == torch.channels_last:
+                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+                c = out.size(1)
+                self.assertEqual(out.stride(), [c, 1, c, c])
+            else:
+                self.assertTrue(out.is_contiguous())
+                c = out.size(1)
+                self.assertEqual(out.stride(), [c, 1, 1, 1])
+
+        for mf in (torch.contiguous_format, torch.channels_last, 'non_contiguous'):
+            helper((2, 3, 6, 6), mf)
+
+    @onlyNativeDeviceTypes
+    def test_adaptive_avg_pool3d_output_size_one(self, device):
+        x = torch.randn((2, 3, 6, 6, 6), dtype=torch.float, device=device, requires_grad=True)
+
+        net = torch.nn.AdaptiveAvgPool3d(1)
+        out = net(x)
+        ref_out = x.contiguous().mean((-1, -2, -3)).view(out.shape)
+
+        out.sum().backward()    # make sure it doesn't crash
+
+        self.assertEqual(out, ref_out)
+        self.assertTrue(out.is_contiguous())
+        c = out.size(1)
+        self.assertEqual(out.stride(), [c, 1, 1, 1, 1])
+
+    @expectedFailureMeta  # Runtime Error not raised for meta
+    @onlyNativeDeviceTypes
+    @dtypes(torch.uint8, torch.int8, torch.short, torch.int, torch.long)
+    def test_adaptive_pooling_no_suppot_input(self, device, dtype):
+        for numel in (2, 3):
+            for pool_type in ('Max', 'Avg'):
+                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
+                module_cls = getattr(nn, cls_name)
+                output_size = (2,) * numel
+                module = module_cls(output_size)
+                input = torch.randn((4,) * (numel + 1), device=device).to(dtype)
+                with self.assertRaisesRegex(RuntimeError, "not implemented"):
+                    output = module(input)
+
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float, torch.double)
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    def test_avg_pool2d_nhwc(self, device, dtype):
+        def helper(n, c, h, w, kernel_size, stride=None,
+                   count_include_pad=True, divisor_override=None, padding=0):
+            if stride is None:
+                stride = kernel_size
+            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
+            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
+                               dtype=dtype, device=device)
+            pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
+                                      divisor_override=divisor_override).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
+                                          divisor_override=divisor_override).to(device)
+
+            out = pool(input)
+            out.backward(grad)
+            ref_out = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        helper(4, 8, 8, 8, 3)
+        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=1)
+        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=2, stride=2)
+        helper(4, 8, 8, 8, 3, divisor_override=42)
+        helper(4, 8, 8, 8, 7)
+        # ROCm 16GB MI25 hits OOM error. Clear caching allocator prior to running large subtest.
+        if TEST_WITH_ROCM and 'cuda' in device:
+            torch.cuda.empty_cache()
+        helper(200, 512, 28, 28, 2)
+        helper(4, 8, 7, 7, 3, stride=1)
+        helper(4, 8, 7, 7, 3, padding=2, stride=1)
+        helper(10, 512, 31, 31, 3, stride=2)
+        helper(1, 129, 8, 8, 3, stride=2)
+
+    @onlyCPU
+    @dtypes(torch.float, torch.double)
+    def test_max_pool1d_corner_cases(self, device, dtype):
+        def check(x, args, expected):
+            model = torch.nn.MaxPool1d(*args)
+            if isinstance(x, list):
+                x = torch.tensor(x, device=device, dtype=dtype)
+                expected = torch.tensor(expected, device=device, dtype=dtype)
+            self.assertEqual(model(x), expected)
+
+        # Pooling args: (kernel_size, stride, padding, dilation, return_indices, ceil_mode)
+        check([[]], (1, None, 0, 1, False, False), [[]])
+        check([[[]]], (1, None, 0, 1, False, False), [[[]]])
+        check([[[]]], (2, 1, 1, 2, False, True), [[[]]])
+        check([[1]], (1, None, 0, 1, False, False), [[1]])
+        check([[1]], (2, None, 1, 2, False, False), [[float('-inf')]])
+        check([[1], [1]], (2, None, 1, 2, False, False), [[float('-inf')], [float('-inf')]])
+        check([[1, 2]], (2, 1, 1, 2, False, False), [[2, 1]])
+        check([[1, 2]], (2, 2, 1, 2, False, True), [[2, 2]])
+
+        empty_tensor = torch.empty((2, 0, 1), device=device, dtype=dtype)
+        check(empty_tensor, (1, None, 0, 1, False, False), empty_tensor)
+
+    @onlyCPU
+    @dtypes(torch.float, torch.double)
+    def test_max_pool1d(self, device, dtype):
+        # FIXME For now compare against max_pool1d with indices
+        def check(x, *args, **kwargs):
+            model = torch.nn.MaxPool1d(*args, **kwargs)
+            ref_model = torch.nn.MaxPool1d(*args, **kwargs, return_indices=True)
+            self.assertEqual(model(x), ref_model(x)[0])
+
+        sizes = [random.sample(range(8, 128), 3) for _ in range(3)]
+        kernel_sizes = random.sample(range(1, 5), 3)
+        strides = random.sample(range(1, 5), 3)
+        dilations = random.sample(range(1, 5), 3)
+        ceil_modes = [True, False]
+
+        for size, kernel_size, stride, dilation, ceil_mode in \
+                itertools.product(sizes, kernel_sizes, strides, dilations, ceil_modes):
+            padding = random.sample(range(0, math.floor(kernel_size / 2) + 1), 1)
+            check(torch.randn(size, device=device, dtype=dtype),
+                  kernel_size, stride, padding, dilation, ceil_mode=ceil_mode)
+
+        # Non-contiguous test
+        tensor = torch.randn(5, 151, 33, device=device, dtype=dtype)[::2, ::3, ::2]
+        check(tensor, 3, 2, 1, 2, ceil_mode=True)
+        check(tensor.transpose(1, 2), 3, 2, 1, 2, ceil_mode=True)
+
+    @onlyCUDA
+    def test_max_pool2d(self, device):
+        def helper(n, c, h, w, ks):
+            x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
+            ref_x = x.detach().clone().cpu().requires_grad_()
+
+            pool = torch.nn.MaxPool2d(kernel_size=ks)
+
+            y = pool(x)
+            ref_y = pool(ref_x)
+
+            y.sum().backward()
+            ref_y.sum().backward()
+
+            self.assertEqual(y, ref_y)
+            self.assertEqual(x.grad, ref_x.grad)
+
+        helper(2, 8, 4, 4, ks=2)
+        helper(1, 100000, 32, 32, ks=4)
+        helper(1, 100000, 1, 4, ks=(1, 4))  # test for max_pool1d
+
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float, torch.double)
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    def test_max_pool2d_nhwc(self, device, dtype):
+        def helper(n, c, h, w, kernel_size, stride=None):
+            if stride is None:
+                stride = kernel_size
+            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
+            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
+                               dtype=dtype, device=device)
+            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
+
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        helper(4, 8, 8, 8, 7)
+        helper(200, 512, 28, 28, 2)
+        helper(4, 8, 7, 7, 3, stride=1)
+        helper(10, 512, 31, 31, 3, stride=2)
+        helper(1, 129, 8, 8, 3, stride=2)
+
+    @onlyNativeDeviceTypes
+    @dtypes(torch.half, torch.float, torch.double)
+    @onlyCUDA
+    def test_max_pool3d_ndhwc(self, device, dtype):
+        def helper(n, c, h, w, d, kernel_size, stride=None):
+            batch = n
+            if not batch:
+                batch = 1
+            input = torch.randn(batch, c, d, h, w, dtype=dtype, device=device)
+            input = input.contiguous(memory_format=torch.channels_last_3d).requires_grad_()
+            if not n:
+                input = input.squeeze(0).detach().clone().requires_grad_()
+            if isinstance(kernel_size, int):
+                kernel_size = [kernel_size] * 3
+            if stride is None:
+                stride = kernel_size
+            elif isinstance(stride, int):
+                stride = [stride] * 3
+            grad = torch.randn(batch, c,
+                               (d - kernel_size[0]) // stride[0] + 1,
+                               (h - kernel_size[1]) // stride[1] + 1,
+                               (w - kernel_size[2]) // stride[2] + 1,
+                               dtype=dtype, device=device)
+            grad = grad.contiguous(memory_format=torch.channels_last_3d)
+            if not n:
+                grad = grad.squeeze(0)
+            pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            if len(out.shape) == 4:
+                self.assertTrue(out.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
+            else:
+                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last_3d))
+            self.assertTrue(ref_out.is_contiguous())
+            if len(ind.shape) == 4:
+                self.assertTrue(ind.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
+            else:
+                self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last_3d))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            if dtype == torch.half:
+                self.assertEqual(input.grad, ref_input.grad, atol=0.05, rtol=0.01)
+            else:
+                self.assertEqual(input.grad, ref_input.grad)
+
+        helper(4, 8, 8, 8, 8, 7)
+        helper(4, 8, 8, 8, 8, (5, 6, 7))
+        helper(1, 8, 8, 8, 8, (5, 6, 7))
+        helper(0, 6, 12, 13, 14, (5, 6, 7))
+        helper(4, 8, 7, 7, 7, 3, stride=1)
+        helper(10, 128, 19, 19, 19, 3, stride=2)
+        helper(10, 128, 19, 19, 19, (1, 2, 3), stride=2)
+        helper(1, 128, 19, 19, 19, (1, 2, 3), stride=2)
+        helper(0, 128, 19, 19, 19, (1, 2, 3), stride=2)
+        helper(1, 79, 4, 4, 4, 3, stride=2)
+        helper(0, 79, 4, 4, 4, 3, stride=2)
+
+    @onlyCPU
+    def test_max_pool2d_bfloat16(self, device):
+        def helper(n, c, h, w, kernel_size, stride, memory_format):
+            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
+            input = input.to(memory_format=memory_format).requires_grad_()
+            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
+
+            input2 = input.detach().clone().float().requires_grad_(True)
+
+            out, ind = pool(input)
+            out.sum().backward()
+            out2, ind2 = pool(input2)
+            out2.sum().backward()
+
+            self.assertTrue(out.is_contiguous(memory_format=memory_format))
+            self.assertEqual(out.dtype, torch.bfloat16)
+            self.assertEqual(input.grad.dtype, torch.bfloat16)
+            self.assertEqual(out, out2.bfloat16())
+            self.assertEqual(ind, ind2)
+            self.assertEqual(input.grad, input2.grad.bfloat16())
+
+        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
+        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
+        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
+        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
+
+    @onlyCUDA
+    def test_max_pool2d_indices(self, device):
+        def helper(n, c, h, w, ks):
+            if n is None:
+                x = torch.randn(c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
+            else:
+                x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
+
+            ref_x = x.detach().clone().cpu().requires_grad_()
+
+            pool = torch.nn.MaxPool2d(kernel_size=ks, return_indices=True)
+
+            y, idx = pool(x)
+            ref_y, ref_idx = pool(ref_x)
+
+            y.sum().backward()
+            ref_y.sum().backward()
+
+            self.assertEqual(y, ref_y)
+            self.assertEqual(idx, ref_idx)  # assertEqual implicitly compares shape for tensors
+            self.assertEqual(x.grad, ref_x.grad)
+
+        helper(2, 8, 4, 4, ks=2)
+        helper(None, 3, 50, 50, ks=5)
+
+    @onlyCPU
+    def test_avg_pool2d_bfloat16(self, device):
+        def helper(n, c, h, w, kernel_size, stride, memory_format):
+            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
+            input = input.to(memory_format=memory_format).requires_grad_()
+            pool = torch.nn.AvgPool2d(kernel_size, stride).to(device)
+
+            input2 = input.detach().clone().float().requires_grad_(True)
+
+            out = pool(input)
+            out.sum().backward()
+            out2 = pool(input2)
+            out2.sum().backward()
+
+            self.assertTrue(out.is_contiguous(memory_format=memory_format))
+            self.assertEqual(out.dtype, torch.bfloat16)
+            self.assertEqual(input.grad.dtype, torch.bfloat16)
+            self.assertEqual(out, out2.bfloat16())
+            self.assertEqual(input.grad, input2.grad.bfloat16())
+
+        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
+        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
+        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
+        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
+
+    @dtypes(torch.float, torch.double)
+    def test_adaptive_pooling_max_nhwc(self, device, dtype):
+        def helper(n, c, h, w, output_height, output_width, contig):
+            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
+            input = input.contiguous(memory_format=torch.channels_last)
+            grad = torch.randint(1, 10, (4, 8, output_height, output_width), device=device, dtype=dtype)
+            grad = grad.contiguous(memory_format=torch.channels_last)
+            if not contig:
+                input = input[:, ::2, :, :]
+                grad = grad[:, ::2, :, :]
+            input.requires_grad_(True)
+            pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
+
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        for contig in [True, False]:
+            helper(4, 8, 10, 10, 7, 7, contig)
+            helper(4, 8, 9, 14, 5, 8, contig)
+            helper(4, 8, 11, 11, 1, 1, contig)
+
+    @dtypes(torch.float, torch.double)
+    def test_pooling_max_nhwc(self, device, dtype):
+        def helper(n, c, h, w, kernel_size, stride, padding, dilation, contig, device):
+            output_height = math.floor((h + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1) / stride[0] + 1)
+            output_width = math.floor((w + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1) / stride[1] + 1)
+
+            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
+            input = input.contiguous(memory_format=torch.channels_last)
+            grad = torch.randint(1, 10, (n, c, output_height, output_width), device=device, dtype=dtype)
+            grad = grad.contiguous(memory_format=torch.channels_last)
+            if not contig:
+                input = input[:, ::2, :, :]
+                grad = grad[:, ::2, :, :]
+            input.requires_grad_(True)
+            pool = torch.nn.MaxPool2d(
+                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
+            )
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.MaxPool2d(
+                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
+            ).to(device)
+
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        for contig in [True, False]:
+            helper(4, 8, 10, 10, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
+            helper(4, 8, 9, 14, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
+            helper(4, 8, 11, 11, (4, 4), (2, 2), (2, 2), (2, 2), contig, device)
+
+    @onlyCUDA
+    def test_pool3d_size_one_feature_dim(self, device):
+        # Tests crazy strides for feature dim of size 1
+        x = torch.randn(7, 1, 5, 3, 2, device=device)
+        strange_strides = [30, 1234, 6, 2, 1]
+        y = x.as_strided(x.size(), strange_strides)
+        x = x.cpu().as_strided(x.size(), strange_strides)
+
+        to_test = {
+            'max_pool3d': lambda t: F.max_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
+            'avg_pool3d': lambda t: F.avg_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
+        }
+
+        for test, fn in to_test.items():
+            # Should not crash
+            out_y = fn(y)
+            out_x = fn(x)
+            self.assertEqual(out_y, out_x.to(device), msg=test)
+
+    @onlyCUDA
+    @largeTensorTest('18GB')
+    @largeTensorTest('180GB', 'cpu')
+    def test_pool3d_large_size_int64(self, device):
+        # See https://github.com/pytorch/pytorch/issues/52822
+        x = torch.randn(70, 32, 100, 100, 100, dtype=torch.half, device=device, requires_grad=True)
+        y = torch.nn.functional.max_pool3d(x, 5)
+        g = torch.randn_like(y, dtype=torch.half)
+        torch.cuda.synchronize()
+        y.backward(g)
+        torch.cuda.synchronize()
+
+        ref_x = x.detach().cpu().float()  # max_pool3d_cpu is not implemented for half
+        ref_x.requires_grad = True
+        ref_g = g.cpu().float()
+        ref_y = torch.nn.functional.max_pool3d(ref_x, 5)
+        ref_y.backward(ref_g)
+
+        self.assertEqual(y, ref_y, exact_dtype=False)
+        self.assertEqual(x.grad, ref_x.grad, exact_dtype=False)
+
+    @onlyCUDA
+    def test_AvgPool3d_backward_after_cat_dim1_device(self, device):
+        # x has to have batch_size 1 to test contiguous checks
+        x = torch.randn(1, 3, 4, 4, 4, device=device, requires_grad=True)
+        y = F.avg_pool3d(x, kernel_size=3, padding=1, stride=2)
+
+        grad = torch.randn(y.size(), device=device)
+        # increase the stride in dimension 0. the tensor is still contiguous because size[0] is 1
+        stride = list(grad.stride())
+        stride[0] = stride[0] * 2
+        grad.set_(grad.storage(), 0, grad.size(), stride)
+        assert grad.is_contiguous()
+
+        y.backward(grad)
+
+    def test_pooling_size_empty(self, device):
+        t = torch.rand([1, 2, 3, 4], device=device)
+        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool1d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool2d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool3d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool1d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool2d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool3d(t, []))
+
+    def _test_maxpool_indices(self, num_dim, adaptive=False, device="cpu", dtype=torch.float):
+        def expected_indices(dim):
+            if dim == 1:
+                return torch.tensor([1, 3], dtype=torch.double).repeat(2, 2, 1)
+            if dim == 2:
+                return torch.tensor([[5, 7], [13, 15]], dtype=torch.double).repeat(2, 2, 1, 1)
+
+        def expected_grad(dim):
+            if dim == 1:
+                return torch.tensor([0, 1, 0, 1], dtype=torch.double).repeat(2, 2, 1)
+            grad = expected_grad(dim - 1)
+            zero = torch.zeros(grad.size())
+            return torch.stack((zero, grad, zero, grad), 2)
+
+        def expected_output(dim):
+            if dim == 1:
+                return torch.arange(2, 17, 2).view(2, 2, 2)
+            if dim == 2:
+                col = torch.arange(6, 63, 8)
+                return torch.stack([col, col + 2], 1).view(2, 2, 2, 2)
+
+        if adaptive:
+            cls_name = 'AdaptiveMaxPool{}d'.format(num_dim)
+        else:
+            cls_name = 'MaxPool{}d'.format(num_dim)
+        module_cls = getattr(nn, cls_name)
+        module = module_cls(2, return_indices=True).to(device, dtype=dtype)
+        numel = 4 ** (num_dim + 1)
+        input = torch.arange(1, numel + 1).view(2, 2, *repeat(4, num_dim)).to(device, dtype=dtype)
+        input_var = input.clone().detach().requires_grad_()
+
+        # Check forward
+        output, indices = module(input_var)
+        if num_dim != 3:
+            expected_indices = expected_indices(num_dim)
+            expected_output = expected_output(num_dim)
+            self.assertEqual(indices.dim(), input.dim())
+            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
+            self.assertEqualIgnoreType(indices.data.squeeze(), expected_indices)
+            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
+            self.assertEqualIgnoreType(output.data.squeeze(), expected_output)
+        self.assertTrue(output.requires_grad)
+        self.assertFalse(indices.requires_grad)
+
+        # Make sure backward works
+        grad_output = torch.ones(output.size(), device=device, dtype=dtype)
+        output.backward(grad_output, retain_graph=True)
+        expected_grad = expected_grad(num_dim)
+        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
+        self.assertEqualIgnoreType(input_var.grad.data, expected_grad.view_as(input))
+
+        # Make sure backward after changing indices will result in an error
+        indices.add_(1)
+        self.assertRaises(RuntimeError, lambda: output.backward(grad_output))
+
+        # Make sure -Infinity is handled correctly
+        t = torch.tensor([[[float("-inf")]]])
+        m = nn.MaxPool1d(kernel_size=1, return_indices=True)
+        output, indices = m(t)
+        self.assertEqual(output[0, 0, 0], float("-inf"))
+        self.assertEqual(indices[0, 0, 0], 0)
+
+        t = torch.tensor([[[float("-inf")]]])
+        m = nn.MaxPool2d(kernel_size=1, return_indices=True)
+        output, indices = m(t)
+        self.assertEqual(output[0, 0, 0], float("-inf"))
+        self.assertEqual(indices[0, 0, 0], 0)
+
+        t = torch.tensor([[[[float("-inf")]]]])
+        m = nn.MaxPool3d(kernel_size=1, return_indices=True)
+        output, indices = m(t)
+        self.assertEqual(output[0, 0, 0, 0], float("-inf"))
+        self.assertEqual(indices[0, 0, 0, 0], 0)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_MaxPool1d_indices(self, device, dtype):
+        self._test_maxpool_indices(1, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_MaxPool2d_indices(self, device, dtype):
+        self._test_maxpool_indices(2, device=device, dtype=dtype)
+
+    @skipIfMps
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_MaxPool3d_indices(self, device, dtype):
+        self._test_maxpool_indices(3, device=device, dtype=dtype)
+
+    @skipIfMps
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_AdaptiveMaxPool1d_indices(self, device, dtype):
+        self._test_maxpool_indices(1, adaptive=True, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_AdaptiveMaxPool2d_indices(self, device, dtype):
+        self._test_maxpool_indices(2, adaptive=True, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_AdaptiveMaxPool3d_indices(self, device, dtype):
+        self._test_maxpool_indices(3, adaptive=True, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_maxpool_indices_no_batch_dim(self, device, dtype):
+        """Check that indices with no batch dim is consistent with a single batch."""
+        max_pool_cases = [
+            (nn.MaxPool1d(3, return_indices=True),
+             torch.randn(3, 5, device=device, dtype=dtype)),
+            (nn.MaxPool2d(3, return_indices=True),
+             torch.randn(3, 5, 6, device=device, dtype=dtype)),
+            (nn.MaxPool3d(3, return_indices=True),
+             torch.randn(3, 5, 6, 7, device=device, dtype=dtype)),
+            (nn.AdaptiveMaxPool1d(3, return_indices=True),
+             torch.randn(3, 5, device=device, dtype=dtype)),
+            (nn.AdaptiveMaxPool2d(3, return_indices=True),
+             torch.randn(3, 5, 6, device=device, dtype=dtype)),
+            (nn.AdaptiveMaxPool3d(3, return_indices=True),
+             torch.randn(3, 5, 6, 7, device=device, dtype=dtype))]
+
+        for module, input in max_pool_cases:
+            _, indices_no_batch = module(input)
+            _, indicies_single_batch = module(input.unsqueeze(0))
+            self.assertEqual(indices_no_batch, indicies_single_batch.squeeze(0))
+
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    @dtypes(torch.float)
+    @onlyNativeDeviceTypes  # TODO: Fails on XLA
+    def test_max_pool_nan_inf(self, device, dtype):
+        for adaptive in ['', 'adaptive_']:
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}max_pool{}d'.format(adaptive, num_dim)
+                fn = getattr(F, fn_name)
+
+                x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
+                res = fn(x, 1 if adaptive else 3)
+                res.backward(torch.randn_like(res))
+                self.assertTrue(math.isnan(res.item()))
+                x.requires_grad_(False)
+                res = fn(x, 1 if adaptive else 3)
+                self.assertTrue(math.isnan(res.item()))
+
+                x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
+                res2 = fn(x2, 1 if adaptive else 3)
+                res2.backward(torch.randn_like(res2))
+                self.assertTrue(math.isinf(res2.item()))
+                x2.requires_grad_(False)
+                res2 = fn(x2, 1 if adaptive else 3)
+                self.assertTrue(math.isinf(res2.item()))
+
+    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
+    @onlyNativeDeviceTypes
+    def test_fractional_max_pool2d(self, device):
+        with set_default_dtype(torch.double):
+            x = torch.randn(1, 2, 7, 7, requires_grad=True, device=device)
+            samples = x.new(1, 2, 2).uniform_()
+
+            def func(x):
+                return F.fractional_max_pool2d(
+                    x, (2, 2), output_size=(3, 3), _random_samples=samples)
+
+            self.assertEqual(func(x).shape, (1, 2, 3, 3))
+            gradcheck(func, [x])
+            gradgradcheck(func, [x])
+
+            x = torch.randn(2, 7, 7, requires_grad=True, device=device)
+            self.assertEqual(func(x).shape, (2, 3, 3))
+            if self.device_type != 'cuda':
+                # Reference: https://github.com/pytorch/pytorch/issues/52427
+                # Raises -> RuntimeError: TensorAccessor expected 4 dims but tensor has 3
+                # on CUDA in gradcheck
+                gradcheck(func, [x])
+                gradgradcheck(func, [x])
+
+            for kernel_size in [(), (1,)]:
+                with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
+                    # Incorrect kernel_size
+                    F.fractional_max_pool2d(x, kernel_size=kernel_size, output_size=(3, 3), _random_samples=samples)
+
+            err_large_msg = "too large relative to input "
+            err_out_size_msg = "output_size must either"
+            for output_size, msg in [((9, 3), err_large_msg + "height"),
+                                     ((3, 9), err_large_msg + "width"),
+                                     ((3,), err_out_size_msg),
+                                     ((), err_out_size_msg)]:
+                with self.assertRaisesRegex(RuntimeError, msg):
+                    # Incorrect output_size
+                    F.fractional_max_pool2d(x, (2, 2), output_size=output_size, _random_samples=samples)
+
+    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
+    @onlyNativeDeviceTypes
+    def test_fractional_max_pool3d(self, device):
+        with set_default_dtype(torch.double):
+            x = torch.randn(1, 2, 7, 7, 7, requires_grad=True, device=device)
+            samples = x.new(1, 2, 3).uniform_()
+
+            def func(x):
+                return F.fractional_max_pool3d(
+                    x, (2, 2, 2), output_size=(3, 3, 3), _random_samples=samples)
+
+            self.assertEqual(func(x).shape, (1, 2, 3, 3, 3))
+            gradcheck(func, [x])
+            gradgradcheck(func, [x])
+
+            x = torch.randn(2, 7, 7, 7, requires_grad=True, device=device)
+            self.assertEqual(func(x).shape, (2, 3, 3, 3))
+            gradcheck(func, [x])
+            gradgradcheck(func, [x])
+
+            for kernel_size in [(), (1,), (1, 1)]:
+                with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
+                    # Incorrect kernel_size
+                    F.fractional_max_pool3d(x, kernel_size=kernel_size, output_size=(3, 3, 3), _random_samples=samples)
+
+            err_large_msg = "too large relative to input "
+            err_out_size_msg = "output_size must either"
+            for output_size, msg in [((9, 3, 3), err_large_msg + "time"),
+                                     ((3, 9, 3), err_large_msg + "height"),
+                                     ((3, 3, 9), err_large_msg + "width"),
+                                     ((3, 3), err_out_size_msg),
+                                     ((3,), err_out_size_msg),
+                                     ((), err_out_size_msg)]:
+                with self.assertRaisesRegex(RuntimeError, msg):
+                    # Incorrect output_size
+                    F.fractional_max_pool3d(x, (2, 2, 2), output_size=output_size, _random_samples=samples)
+
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    @dtypes(torch.float)
+    @onlyNativeDeviceTypes  # TODO: Fails on XLA
+    def test_fractional_max_pool_nan_inf(self, device, dtype):
+        for num_dim in [2, 3]:
+            fn_name = 'FractionalMaxPool{}d'.format(num_dim)
+            fn = getattr(nn, fn_name)(kernel_size=2, output_size=1)
+            x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
+            res = fn(x)
+            res.backward(torch.randn_like(res))
+            self.assertTrue(math.isnan(res.item()))
+
+            x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
+            res2 = fn(x2)
+            res2.backward(torch.randn_like(res2))
+            self.assertTrue(math.isinf(res2.item()))
+
+    @onlyNativeDeviceTypes  # TODO: RuntimeError message different on XLA
+    def test_pooling_zero_stride(self, device):
+        for op in ('max', 'avg'):
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}_pool{}d'.format(op, num_dim)
+                fn = getattr(F, fn_name)
+                x = torch.ones([1, 2] + num_dim * [4], device=device, dtype=torch.float)
+                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
+                                       lambda: fn(x, kernel_size=2, stride=0))
+
+                fn_module_name = '{}Pool{}d'.format(op.title(), num_dim)
+                fn_module = getattr(nn, fn_module_name)(kernel_size=2, stride=0)
+                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
+                                       lambda: fn_module(x))
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_pool_large_size(self, device, dtype):
+        for op in ('max', 'avg'):
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}_pool{}d'.format(op, num_dim)
+                fn = getattr(F, fn_name)
+                # 16777217 is the smallest integer not expressible in float32
+                x = torch.ones([1, 1, 16777217] + (num_dim - 1) * [1],
+                               device=device, dtype=dtype)
+                res = fn(x, 1, stride=1, padding=0)
+                # check if the output shape was still computed correctly
+                self.assertEqual(x.shape[2], res.shape[2])
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_pool_invalid_size(self, device, dtype):
+        for op in ('max', 'avg'):
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}_pool{}d'.format(op, num_dim)
+                if op == 'max':
+                    # New implementation without indices supports empty tensors
+                    # TODO(Heitor) change once with_indices code is updated
+                    fn_name += '_with_indices'
+                fn = getattr(F, fn_name)
+                # use a configuration that gives zero outputs only
+                # when doing a correct floor division by the stride
+                x = torch.ones([1, 1] + num_dim * [4],
+                               device=device, dtype=dtype)
+                with self.assertRaisesRegex(RuntimeError, r"too small|smaller than"):
+                    try:
+                        res = fn(x, 3, stride=2, padding=0, dilation=2)
+                    except TypeError:
+                        # some implementations do not support dilation
+                        res = fn(x, 6, stride=2, padding=0)
+
+    @onlyCUDA
+    def test_pooling_bfloat16(self, device):
+        _test_bfloat16_ops(self, torch.nn.AvgPool1d(3, stride=2), device, inp_dims=(8, 4, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AvgPool2d(3, stride=2), device, inp_dims=(8, 4, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AvgPool3d(3, stride=2), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AdaptiveAvgPool1d(3), device, inp_dims=(8, 4, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AdaptiveAvgPool2d((3, 5)), device, inp_dims=(8, 4, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AdaptiveAvgPool3d((3, 5, 7)), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
+
+    def test_maxpool3d_non_square_backward(self, device):
+        # previous CUDA routine of this backward calculates kernel launch grid size
+        # with last two dimensions interchanged, so the tailing along the longer dim
+        # get ignored. Here we test whether every position gets gradient.
+        for dim in (2, 3, 4):
+            shape = tuple(32 if i != dim else 256 for i in range(4))
+            x = torch.randn(shape, device=device, requires_grad=True)
+            F.max_pool3d(x, kernel_size=(1, 1, 1)).sum().backward()
+            self.assertEqual(x.grad, torch.ones_like(x.grad))
+
+    def test_adaptive_pool_invalid(self, device):
+        inp_1d = (torch.randn(1, 1, 1, device=device), (-1,))
+        inp_2d = (torch.randn(1, 1, 1, 1, device=device), (-1, 0))
+        inp_3d = (torch.randn(1, 1, 1, 1, 1, device=device), (-1, 0, 2))
+        module_input_dict = {torch.nn.AdaptiveAvgPool1d: inp_1d,
+                             torch.nn.AdaptiveAvgPool2d: inp_2d,
+                             torch.nn.AdaptiveAvgPool3d: inp_3d}
+
+        for m, inp in module_input_dict.items():
+            with self.assertRaisesRegex(RuntimeError,
+                                        r"elements of output_size must be greater than or equal to 0"):
+                t, output_size = inp
+                m(output_size)(t)
+
+
+instantiate_device_type_tests(TestPoolingNNDeviceType, globals())
+instantiate_parametrized_tests(TestPoolingNN)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/onnx/expect/TestOperators.test_acos.expect b/test/onnx/expect/TestOperators.test_acos.expect
index 40fc61e29b7f9..f3864a75c0fb9 100644
--- a/test/onnx/expect/TestOperators.test_acos.expect
+++ b/test/onnx/expect/TestOperators.test_acos.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_broadcast.expect b/test/onnx/expect/TestOperators.test_add_broadcast.expect
index 569b2400df881..921e17f4c2e2b 100644
--- a/test/onnx/expect/TestOperators.test_add_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_broadcast.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_left_broadcast.expect b/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
index ffa632ca475b8..b4c9843262aa5 100644
--- a/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
index 9917880a8a228..1195e978f13c2 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
index 569b2400df881..921e17f4c2e2b 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
index 96d2dca593256..0062cc4d1c5b4 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_addconstant.expect b/test/onnx/expect/TestOperators.test_addconstant.expect
index 0e1570eb62da5..a4d80aaa8dadb 100644
--- a/test/onnx/expect/TestOperators.test_addconstant.expect
+++ b/test/onnx/expect/TestOperators.test_addconstant.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_addmm.expect b/test/onnx/expect/TestOperators.test_addmm.expect
index 1ef0a81e2a905..be308c5047564 100644
--- a/test/onnx/expect/TestOperators.test_addmm.expect
+++ b/test/onnx/expect/TestOperators.test_addmm.expect
@@ -102,5 +102,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_argmax.expect b/test/onnx/expect/TestOperators.test_argmax.expect
index 38add716ff367..5c53523afc535 100644
--- a/test/onnx/expect/TestOperators.test_argmax.expect
+++ b/test/onnx/expect/TestOperators.test_argmax.expect
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_asin.expect b/test/onnx/expect/TestOperators.test_asin.expect
index f5a44b850eb1c..cd68845b8b9ba 100644
--- a/test/onnx/expect/TestOperators.test_asin.expect
+++ b/test/onnx/expect/TestOperators.test_asin.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_at_op.expect b/test/onnx/expect/TestOperators.test_at_op.expect
index 8890f6535756b..e8557f61baf20 100644
--- a/test/onnx/expect/TestOperators.test_at_op.expect
+++ b/test/onnx/expect/TestOperators.test_at_op.expect
@@ -3,8 +3,8 @@ producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "x"
-    input: "x"
+    input: "x.1"
+    input: "x.1"
     output: "1"
     name: "ATen_0"
     op_type: "ATen"
@@ -22,7 +22,7 @@ graph {
   }
   name: "torch_jit"
   input {
-    name: "x"
+    name: "x.1"
     type {
       tensor_type {
         elem_type: 1
@@ -55,7 +55,7 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
 opset_import {
   domain: "org.pytorch.aten"
diff --git a/test/onnx/expect/TestOperators.test_atan.expect b/test/onnx/expect/TestOperators.test_atan.expect
index c8d189e1415ef..bbb52c8f1c9db 100644
--- a/test/onnx/expect/TestOperators.test_atan.expect
+++ b/test/onnx/expect/TestOperators.test_atan.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_avg_pool2d.expect b/test/onnx/expect/TestOperators.test_avg_pool2d.expect
index 344022ec26887..d551ff38f809b 100644
--- a/test/onnx/expect/TestOperators.test_avg_pool2d.expect
+++ b/test/onnx/expect/TestOperators.test_avg_pool2d.expect
@@ -106,5 +106,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_baddbmm.expect b/test/onnx/expect/TestOperators.test_baddbmm.expect
index 058770e803269..01d059cd4b0e6 100644
--- a/test/onnx/expect/TestOperators.test_baddbmm.expect
+++ b/test/onnx/expect/TestOperators.test_baddbmm.expect
@@ -135,5 +135,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_basic.expect b/test/onnx/expect/TestOperators.test_basic.expect
index 3d151aefabdb1..933664a75b676 100644
--- a/test/onnx/expect/TestOperators.test_basic.expect
+++ b/test/onnx/expect/TestOperators.test_basic.expect
@@ -76,5 +76,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm.expect b/test/onnx/expect/TestOperators.test_batchnorm.expect
index d9c9ec338c8cb..289092005855f 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm.expect
@@ -21,6 +21,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -145,5 +150,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_1d.expect b/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
index a4d2e1f102498..b6ec8fb2f9d76 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
@@ -21,6 +21,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -133,5 +138,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect b/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
index a421443cdcda5..5dcfdc3f69cf7 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
@@ -49,6 +49,11 @@ graph {
       f: 0.7
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -135,5 +140,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
index a556e38c7198a..96d9f901242e5 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
@@ -21,6 +21,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -93,5 +98,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_training.expect b/test/onnx/expect/TestOperators.test_batchnorm_training.expect
index 5e8f2049e1433..9099fac2dcc58 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_training.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_training.expect
@@ -11,8 +11,6 @@ graph {
     output: "6"
     output: "7"
     output: "8"
-    output: "batch_norm_dead_output-13"
-    output: "batch_norm_dead_output-14"
     name: "BatchNormalization_0"
     op_type: "BatchNormalization"
     attribute {
@@ -25,6 +23,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 1
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -149,5 +152,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_chunk.expect b/test/onnx/expect/TestOperators.test_chunk.expect
index 575245c807eb6..708b3e10a8cbf 100644
--- a/test/onnx/expect/TestOperators.test_chunk.expect
+++ b/test/onnx/expect/TestOperators.test_chunk.expect
@@ -192,5 +192,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_clip.expect b/test/onnx/expect/TestOperators.test_clip.expect
index 67dad133acec7..0700d68e69859 100644
--- a/test/onnx/expect/TestOperators.test_clip.expect
+++ b/test/onnx/expect/TestOperators.test_clip.expect
@@ -71,5 +71,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_clip_max.expect b/test/onnx/expect/TestOperators.test_clip_max.expect
index 23b001cd4e7e6..278c81c53a8aa 100644
--- a/test/onnx/expect/TestOperators.test_clip_max.expect
+++ b/test/onnx/expect/TestOperators.test_clip_max.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_clip_min.expect b/test/onnx/expect/TestOperators.test_clip_min.expect
index 3bd4c47ef8583..3f96a07dd135d 100644
--- a/test/onnx/expect/TestOperators.test_clip_min.expect
+++ b/test/onnx/expect/TestOperators.test_clip_min.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_concat2.expect b/test/onnx/expect/TestOperators.test_concat2.expect
index f5b6aec0c2293..80fb9d9c1a536 100644
--- a/test/onnx/expect/TestOperators.test_concat2.expect
+++ b/test/onnx/expect/TestOperators.test_concat2.expect
@@ -65,5 +65,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_conv.expect b/test/onnx/expect/TestOperators.test_conv.expect
index f1078cef39c17..c736f28836469 100644
--- a/test/onnx/expect/TestOperators.test_conv.expect
+++ b/test/onnx/expect/TestOperators.test_conv.expect
@@ -118,5 +118,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
index 18e3c683e9bc9..fc7f328d65fe5 100644
--- a/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
@@ -96,5 +96,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_convtranspose.expect b/test/onnx/expect/TestOperators.test_convtranspose.expect
index 0beedca2f2920..91746bdcdcb45 100644
--- a/test/onnx/expect/TestOperators.test_convtranspose.expect
+++ b/test/onnx/expect/TestOperators.test_convtranspose.expect
@@ -124,5 +124,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_cos.expect b/test/onnx/expect/TestOperators.test_cos.expect
index 1185bca62c597..939bdb2c03e61 100644
--- a/test/onnx/expect/TestOperators.test_cos.expect
+++ b/test/onnx/expect/TestOperators.test_cos.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dict.expect b/test/onnx/expect/TestOperators.test_dict.expect
index e041d535d768b..aa9bba9dfecef 100644
--- a/test/onnx/expect/TestOperators.test_dict.expect
+++ b/test/onnx/expect/TestOperators.test_dict.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dict_str.expect b/test/onnx/expect/TestOperators.test_dict_str.expect
index eaab2752fb7dc..e3ad5f7f027c8 100644
--- a/test/onnx/expect/TestOperators.test_dict_str.expect
+++ b/test/onnx/expect/TestOperators.test_dict_str.expect
@@ -63,5 +63,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dim.expect b/test/onnx/expect/TestOperators.test_dim.expect
index 59e910a646ca9..4143454562cc1 100644
--- a/test/onnx/expect/TestOperators.test_dim.expect
+++ b/test/onnx/expect/TestOperators.test_dim.expect
@@ -28,5 +28,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout.expect b/test/onnx/expect/TestOperators.test_dropout.expect
index 27aab5c718211..bfb06790923bf 100644
--- a/test/onnx/expect/TestOperators.test_dropout.expect
+++ b/test/onnx/expect/TestOperators.test_dropout.expect
@@ -42,5 +42,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout_default.expect b/test/onnx/expect/TestOperators.test_dropout_default.expect
index 89c0e988aacbc..e3ac188c945e1 100644
--- a/test/onnx/expect/TestOperators.test_dropout_default.expect
+++ b/test/onnx/expect/TestOperators.test_dropout_default.expect
@@ -77,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout_training.expect b/test/onnx/expect/TestOperators.test_dropout_training.expect
index 89c0e988aacbc..e3ac188c945e1 100644
--- a/test/onnx/expect/TestOperators.test_dropout_training.expect
+++ b/test/onnx/expect/TestOperators.test_dropout_training.expect
@@ -77,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_elu.expect b/test/onnx/expect/TestOperators.test_elu.expect
index 9fc2d5aab1fed..043ce6374ce7f 100644
--- a/test/onnx/expect/TestOperators.test_elu.expect
+++ b/test/onnx/expect/TestOperators.test_elu.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_embedding_bags.expect b/test/onnx/expect/TestOperators.test_embedding_bags.expect
index b2b11d5bd4f0f..02996eea910e1 100644
--- a/test/onnx/expect/TestOperators.test_embedding_bags.expect
+++ b/test/onnx/expect/TestOperators.test_embedding_bags.expect
@@ -429,5 +429,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_empty_like.expect b/test/onnx/expect/TestOperators.test_empty_like.expect
index e4f6c6ede2cab..0cb3b2598c8e7 100644
--- a/test/onnx/expect/TestOperators.test_empty_like.expect
+++ b/test/onnx/expect/TestOperators.test_empty_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_equal.expect b/test/onnx/expect/TestOperators.test_equal.expect
index 5a9877d484f89..7801bfe777ae4 100644
--- a/test/onnx/expect/TestOperators.test_equal.expect
+++ b/test/onnx/expect/TestOperators.test_equal.expect
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_erf.expect b/test/onnx/expect/TestOperators.test_erf.expect
index f8f70c37598dc..ddf5efcc2bb79 100644
--- a/test/onnx/expect/TestOperators.test_erf.expect
+++ b/test/onnx/expect/TestOperators.test_erf.expect
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_exp.expect b/test/onnx/expect/TestOperators.test_exp.expect
index 49d9f74cb20d9..3abafc3000bbe 100644
--- a/test/onnx/expect/TestOperators.test_exp.expect
+++ b/test/onnx/expect/TestOperators.test_exp.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_expand.expect b/test/onnx/expect/TestOperators.test_expand.expect
index 36804d0062e53..b0dc5d03c0246 100644
--- a/test/onnx/expect/TestOperators.test_expand.expect
+++ b/test/onnx/expect/TestOperators.test_expand.expect
@@ -139,5 +139,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_flatten.expect b/test/onnx/expect/TestOperators.test_flatten.expect
index 12160e8b9e664..b303aa16ded45 100644
--- a/test/onnx/expect/TestOperators.test_flatten.expect
+++ b/test/onnx/expect/TestOperators.test_flatten.expect
@@ -91,6 +91,11 @@ graph {
     output: "8"
     name: "Reshape_7"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -130,5 +135,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_flatten2D.expect b/test/onnx/expect/TestOperators.test_flatten2D.expect
index f60b1ba7066ff..edefc6cfb56b1 100644
--- a/test/onnx/expect/TestOperators.test_flatten2D.expect
+++ b/test/onnx/expect/TestOperators.test_flatten2D.expect
@@ -54,5 +54,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_frobenius_norm.expect b/test/onnx/expect/TestOperators.test_frobenius_norm.expect
index fba4585b18b85..ba186b1ac8428 100644
--- a/test/onnx/expect/TestOperators.test_frobenius_norm.expect
+++ b/test/onnx/expect/TestOperators.test_frobenius_norm.expect
@@ -87,5 +87,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_full.expect b/test/onnx/expect/TestOperators.test_full.expect
index fc8acf5ee80dc..90d5075872d21 100644
--- a/test/onnx/expect/TestOperators.test_full.expect
+++ b/test/onnx/expect/TestOperators.test_full.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_full_like.expect b/test/onnx/expect/TestOperators.test_full_like.expect
index fc8acf5ee80dc..90d5075872d21 100644
--- a/test/onnx/expect/TestOperators.test_full_like.expect
+++ b/test/onnx/expect/TestOperators.test_full_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_gather.expect b/test/onnx/expect/TestOperators.test_gather.expect
index 609f89853ac69..cfa780f17da2b 100644
--- a/test/onnx/expect/TestOperators.test_gather.expect
+++ b/test/onnx/expect/TestOperators.test_gather.expect
@@ -74,5 +74,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_ge.expect b/test/onnx/expect/TestOperators.test_ge.expect
index 8d578a4d25bd0..2d69f792ecb81 100644
--- a/test/onnx/expect/TestOperators.test_ge.expect
+++ b/test/onnx/expect/TestOperators.test_ge.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_gelu.expect b/test/onnx/expect/TestOperators.test_gelu.expect
index dfc7d1d88468d..4aebfac7f4064 100644
--- a/test/onnx/expect/TestOperators.test_gelu.expect
+++ b/test/onnx/expect/TestOperators.test_gelu.expect
@@ -122,5 +122,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_gt.expect b/test/onnx/expect/TestOperators.test_gt.expect
index 5aab77798bf64..d121ee5d12588 100644
--- a/test/onnx/expect/TestOperators.test_gt.expect
+++ b/test/onnx/expect/TestOperators.test_gt.expect
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_hardtanh.expect b/test/onnx/expect/TestOperators.test_hardtanh.expect
index 1268a4c14cfd1..6a5b73e3fa781 100644
--- a/test/onnx/expect/TestOperators.test_hardtanh.expect
+++ b/test/onnx/expect/TestOperators.test_hardtanh.expect
@@ -71,5 +71,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_implicit_expand.expect b/test/onnx/expect/TestOperators.test_implicit_expand.expect
index 3c94edc85b4b3..df955f77ed143 100644
--- a/test/onnx/expect/TestOperators.test_implicit_expand.expect
+++ b/test/onnx/expect/TestOperators.test_implicit_expand.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_index.expect b/test/onnx/expect/TestOperators.test_index.expect
index 330d2de0d7fca..f5327f35947af 100644
--- a/test/onnx/expect/TestOperators.test_index.expect
+++ b/test/onnx/expect/TestOperators.test_index.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_isnan.expect b/test/onnx/expect/TestOperators.test_isnan.expect
index 198d3bdb23870..8dfb3a0add6ff 100644
--- a/test/onnx/expect/TestOperators.test_isnan.expect
+++ b/test/onnx/expect/TestOperators.test_isnan.expect
@@ -37,5 +37,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_layer_norm_aten.expect b/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
index 071437686117e..74e21b77bb266 100644
--- a/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
+++ b/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
@@ -193,5 +193,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_le.expect b/test/onnx/expect/TestOperators.test_le.expect
index 374a0d0e0d521..3e645a67678f8 100644
--- a/test/onnx/expect/TestOperators.test_le.expect
+++ b/test/onnx/expect/TestOperators.test_le.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_linear.expect b/test/onnx/expect/TestOperators.test_linear.expect
index 71c64dfe5a508..5f6294b681cb2 100644
--- a/test/onnx/expect/TestOperators.test_linear.expect
+++ b/test/onnx/expect/TestOperators.test_linear.expect
@@ -102,5 +102,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_log_sigmoid.expect b/test/onnx/expect/TestOperators.test_log_sigmoid.expect
index 2681f1193102c..2e7e267408d2c 100644
--- a/test/onnx/expect/TestOperators.test_log_sigmoid.expect
+++ b/test/onnx/expect/TestOperators.test_log_sigmoid.expect
@@ -61,5 +61,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_logsoftmax.expect b/test/onnx/expect/TestOperators.test_logsoftmax.expect
index 1c4de89b6402c..9a9ca557c56fa 100644
--- a/test/onnx/expect/TestOperators.test_logsoftmax.expect
+++ b/test/onnx/expect/TestOperators.test_logsoftmax.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_lt.expect b/test/onnx/expect/TestOperators.test_lt.expect
index 2dbcc07cd9e17..a7f3e533eb19e 100644
--- a/test/onnx/expect/TestOperators.test_lt.expect
+++ b/test/onnx/expect/TestOperators.test_lt.expect
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_max.expect b/test/onnx/expect/TestOperators.test_max.expect
index d9fcc0fb5f7a3..cb740174abbf9 100644
--- a/test/onnx/expect/TestOperators.test_max.expect
+++ b/test/onnx/expect/TestOperators.test_max.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_maxpool.expect b/test/onnx/expect/TestOperators.test_maxpool.expect
index f43712bbfc58f..6bed98ea739e4 100644
--- a/test/onnx/expect/TestOperators.test_maxpool.expect
+++ b/test/onnx/expect/TestOperators.test_maxpool.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_maxpool_indices.expect b/test/onnx/expect/TestOperators.test_maxpool_indices.expect
index 46c23e3a4caec..167dfa05c1003 100644
--- a/test/onnx/expect/TestOperators.test_maxpool_indices.expect
+++ b/test/onnx/expect/TestOperators.test_maxpool_indices.expect
@@ -165,5 +165,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mean.expect b/test/onnx/expect/TestOperators.test_mean.expect
index b53b8c2f1248f..62a0a1b2ba754 100644
--- a/test/onnx/expect/TestOperators.test_mean.expect
+++ b/test/onnx/expect/TestOperators.test_mean.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mean_dtype.expect b/test/onnx/expect/TestOperators.test_mean_dtype.expect
index 92ce0ae3aa992..60e5cb5c1abb3 100644
--- a/test/onnx/expect/TestOperators.test_mean_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_mean_dtype.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_meshgrid.expect b/test/onnx/expect/TestOperators.test_meshgrid.expect
index 05b9de875d941..32190c410f23c 100644
--- a/test/onnx/expect/TestOperators.test_meshgrid.expect
+++ b/test/onnx/expect/TestOperators.test_meshgrid.expect
@@ -22,6 +22,11 @@ graph {
     output: "onnx::Shape_4"
     name: "Reshape_1"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     output: "onnx::Reshape_5"
@@ -43,6 +48,11 @@ graph {
     output: "onnx::Shape_6"
     name: "Reshape_3"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     output: "onnx::Reshape_7"
@@ -64,6 +74,11 @@ graph {
     output: "onnx::Shape_8"
     name: "Reshape_5"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Shape_4"
@@ -129,6 +144,11 @@ graph {
     output: "onnx::Expand_15"
     name: "Reshape_12"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Expand_15"
@@ -170,6 +190,11 @@ graph {
     output: "onnx::Expand_19"
     name: "Reshape_16"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Expand_19"
@@ -211,6 +236,11 @@ graph {
     output: "onnx::Expand_23"
     name: "Reshape_20"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Expand_23"
@@ -318,5 +348,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_min.expect b/test/onnx/expect/TestOperators.test_min.expect
index 28ca14779f71c..999d6009a3608 100644
--- a/test/onnx/expect/TestOperators.test_min.expect
+++ b/test/onnx/expect/TestOperators.test_min.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mm.expect b/test/onnx/expect/TestOperators.test_mm.expect
index 9492d651fd9ec..4ea74c1545860 100644
--- a/test/onnx/expect/TestOperators.test_mm.expect
+++ b/test/onnx/expect/TestOperators.test_mm.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mul_bool.expect b/test/onnx/expect/TestOperators.test_mul_bool.expect
index 455967e543cbf..cb6ce81804436 100644
--- a/test/onnx/expect/TestOperators.test_mul_bool.expect
+++ b/test/onnx/expect/TestOperators.test_mul_bool.expect
@@ -51,5 +51,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mul_fp_bool.expect b/test/onnx/expect/TestOperators.test_mul_fp_bool.expect
index dee222fbb1fac..834be416415a1 100644
--- a/test/onnx/expect/TestOperators.test_mul_fp_bool.expect
+++ b/test/onnx/expect/TestOperators.test_mul_fp_bool.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_narrow.expect b/test/onnx/expect/TestOperators.test_narrow.expect
index b25611d9374a1..0f97c5aac1f89 100644
--- a/test/onnx/expect/TestOperators.test_narrow.expect
+++ b/test/onnx/expect/TestOperators.test_narrow.expect
@@ -88,5 +88,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_ne.expect b/test/onnx/expect/TestOperators.test_ne.expect
index ab053fbcf67e1..2d7a67e1fd612 100644
--- a/test/onnx/expect/TestOperators.test_ne.expect
+++ b/test/onnx/expect/TestOperators.test_ne.expect
@@ -78,5 +78,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_nonzero.expect b/test/onnx/expect/TestOperators.test_nonzero.expect
index cfcb1f505f878..f25f517cf12b2 100644
--- a/test/onnx/expect/TestOperators.test_nonzero.expect
+++ b/test/onnx/expect/TestOperators.test_nonzero.expect
@@ -58,5 +58,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_norm_p1.expect b/test/onnx/expect/TestOperators.test_norm_p1.expect
index ec5e12b90a169..407424b9b642d 100644
--- a/test/onnx/expect/TestOperators.test_norm_p1.expect
+++ b/test/onnx/expect/TestOperators.test_norm_p1.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_norm_p2.expect b/test/onnx/expect/TestOperators.test_norm_p2.expect
index 0388ec620821e..618471b3ddef1 100644
--- a/test/onnx/expect/TestOperators.test_norm_p2.expect
+++ b/test/onnx/expect/TestOperators.test_norm_p2.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_ones_like.expect b/test/onnx/expect/TestOperators.test_ones_like.expect
index fafec789b1741..e279666c050f0 100644
--- a/test/onnx/expect/TestOperators.test_ones_like.expect
+++ b/test/onnx/expect/TestOperators.test_ones_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_pad.expect b/test/onnx/expect/TestOperators.test_pad.expect
index e4554ae3181ae..61d8e524b239b 100644
--- a/test/onnx/expect/TestOperators.test_pad.expect
+++ b/test/onnx/expect/TestOperators.test_pad.expect
@@ -77,6 +77,11 @@ graph {
     output: "onnx::Slice_13"
     name: "Reshape_5"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     output: "onnx::Slice_14"
@@ -176,6 +181,11 @@ graph {
     output: "onnx::Cast_21"
     name: "Reshape_13"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Cast_21"
@@ -247,5 +257,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_params.expect b/test/onnx/expect/TestOperators.test_params.expect
index 67064d8087ae4..ce2186457eeab 100644
--- a/test/onnx/expect/TestOperators.test_params.expect
+++ b/test/onnx/expect/TestOperators.test_params.expect
@@ -92,5 +92,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
index 8dbc34a20640b..3021b07c6c384 100644
--- a/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
@@ -76,5 +76,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_permute2.expect b/test/onnx/expect/TestOperators.test_permute2.expect
index 7f7b6afd9d2d9..02754b14d9470 100644
--- a/test/onnx/expect/TestOperators.test_permute2.expect
+++ b/test/onnx/expect/TestOperators.test_permute2.expect
@@ -77,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_pow.expect b/test/onnx/expect/TestOperators.test_pow.expect
index f20fd95550904..614661bcf3032 100644
--- a/test/onnx/expect/TestOperators.test_pow.expect
+++ b/test/onnx/expect/TestOperators.test_pow.expect
@@ -78,5 +78,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_prelu.expect b/test/onnx/expect/TestOperators.test_prelu.expect
index f2bcb50ef7772..68458919a478a 100644
--- a/test/onnx/expect/TestOperators.test_prelu.expect
+++ b/test/onnx/expect/TestOperators.test_prelu.expect
@@ -83,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_prod.expect b/test/onnx/expect/TestOperators.test_prod.expect
index 0cfeafa4da32c..f93519e36e151 100644
--- a/test/onnx/expect/TestOperators.test_prod.expect
+++ b/test/onnx/expect/TestOperators.test_prod.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_prod_dtype.expect b/test/onnx/expect/TestOperators.test_prod_dtype.expect
index 26a63ac840ad2..c1a22018ec939 100644
--- a/test/onnx/expect/TestOperators.test_prod_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_prod_dtype.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rand.expect b/test/onnx/expect/TestOperators.test_rand.expect
index 25235bcc49d2a..9ffbaecd0b83b 100644
--- a/test/onnx/expect/TestOperators.test_rand.expect
+++ b/test/onnx/expect/TestOperators.test_rand.expect
@@ -74,5 +74,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_randn.expect b/test/onnx/expect/TestOperators.test_randn.expect
index 14a10912aa43b..56756cbf05ff2 100644
--- a/test/onnx/expect/TestOperators.test_randn.expect
+++ b/test/onnx/expect/TestOperators.test_randn.expect
@@ -74,5 +74,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect b/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
index 7e5fefad2eb70..4ae8d99c7363f 100644
--- a/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
+++ b/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean.expect b/test/onnx/expect/TestOperators.test_reduced_mean.expect
index ce69ab65a6a6d..e9174fd049555 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
index 71d9d296aecd0..a441689414cf5 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
@@ -73,5 +73,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
index 98bb26aaea36b..7da69995cc79d 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
@@ -66,5 +66,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod.expect b/test/onnx/expect/TestOperators.test_reduced_prod.expect
index cdfbc0f5fbb69..9424de9dd8f12 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
index 641d21cb9c79a..eb57b59306132 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
@@ -73,5 +73,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
index 62befc2cf1cff..5ac7c0c05b983 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
@@ -65,5 +65,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum.expect b/test/onnx/expect/TestOperators.test_reduced_sum.expect
index e03a204a3f998..912cb4a004f43 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum.expect
@@ -69,5 +69,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
index e8ffa49295a5c..08e549518102f 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
@@ -83,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
index 7d05fdc26041c..e50fe185dcad3 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
@@ -75,5 +75,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reducemax.expect b/test/onnx/expect/TestOperators.test_reducemax.expect
index bbd770761f3a0..a73b00966ce6d 100644
--- a/test/onnx/expect/TestOperators.test_reducemax.expect
+++ b/test/onnx/expect/TestOperators.test_reducemax.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reducemin.expect b/test/onnx/expect/TestOperators.test_reducemin.expect
index a555fac90f0a6..c0a85099f95cf 100644
--- a/test/onnx/expect/TestOperators.test_reducemin.expect
+++ b/test/onnx/expect/TestOperators.test_reducemin.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_remainder.expect b/test/onnx/expect/TestOperators.test_remainder.expect
index ecf44141260e5..9c432409faf7c 100644
--- a/test/onnx/expect/TestOperators.test_remainder.expect
+++ b/test/onnx/expect/TestOperators.test_remainder.expect
@@ -89,5 +89,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_repeat.expect b/test/onnx/expect/TestOperators.test_repeat.expect
index 76203f189e3ec..5f49c14a8fb51 100644
--- a/test/onnx/expect/TestOperators.test_repeat.expect
+++ b/test/onnx/expect/TestOperators.test_repeat.expect
@@ -106,5 +106,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect b/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
index cdbadc5f43eb7..c99f9d2a95223 100644
--- a/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
+++ b/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
@@ -100,5 +100,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rrelu.expect b/test/onnx/expect/TestOperators.test_rrelu.expect
index 3fb75ab0bb4a9..4814245be94e0 100644
--- a/test/onnx/expect/TestOperators.test_rrelu.expect
+++ b/test/onnx/expect/TestOperators.test_rrelu.expect
@@ -4,26 +4,14 @@ producer_version: "CURRENT_VERSION"
 graph {
   node {
     input: "input"
-    output: "onnx::PRelu_1"
-    name: "RandomUniformLike_0"
-    op_type: "RandomUniformLike"
+    output: "1"
+    name: "LeakyRelu_0"
+    op_type: "LeakyRelu"
     attribute {
-      name: "high"
-      f: 0.333333343
+      name: "alpha"
+      f: 0.229166672
       type: FLOAT
     }
-    attribute {
-      name: "low"
-      f: 0.125
-      type: FLOAT
-    }
-  }
-  node {
-    input: "input"
-    input: "onnx::PRelu_1"
-    output: "2"
-    name: "PRelu_1"
-    op_type: "PRelu"
   }
   name: "torch_jit"
   input {
@@ -49,7 +37,7 @@ graph {
     }
   }
   output {
-    name: "2"
+    name: "1"
     type {
       tensor_type {
         elem_type: 1
@@ -72,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rsqrt.expect b/test/onnx/expect/TestOperators.test_rsqrt.expect
index 32e4df543ae9b..8d4814532544f 100644
--- a/test/onnx/expect/TestOperators.test_rsqrt.expect
+++ b/test/onnx/expect/TestOperators.test_rsqrt.expect
@@ -63,5 +63,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rsub.expect b/test/onnx/expect/TestOperators.test_rsub.expect
index 75344bfc68dee..4b300f4f38094 100644
--- a/test/onnx/expect/TestOperators.test_rsub.expect
+++ b/test/onnx/expect/TestOperators.test_rsub.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_scatter_add.expect b/test/onnx/expect/TestOperators.test_scatter_add.expect
index fd7514e306303..08aaa382622e7 100644
--- a/test/onnx/expect/TestOperators.test_scatter_add.expect
+++ b/test/onnx/expect/TestOperators.test_scatter_add.expect
@@ -104,5 +104,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_selu.expect b/test/onnx/expect/TestOperators.test_selu.expect
index 7cdc4dc8bac4e..911361cb7e1f3 100644
--- a/test/onnx/expect/TestOperators.test_selu.expect
+++ b/test/onnx/expect/TestOperators.test_selu.expect
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_shape_value_map.expect b/test/onnx/expect/TestOperators.test_shape_value_map.expect
index 92e5be56549dc..b467df6304693 100644
--- a/test/onnx/expect/TestOperators.test_shape_value_map.expect
+++ b/test/onnx/expect/TestOperators.test_shape_value_map.expect
@@ -116,6 +116,11 @@ graph {
     output: "onnx::Transpose_16"
     name: "Reshape_9"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Transpose_16"
@@ -195,6 +200,11 @@ graph {
     output: "25"
     name: "Reshape_16"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -237,5 +247,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sign.expect b/test/onnx/expect/TestOperators.test_sign.expect
index 6cb9200dc0735..4d76e6bdb57d0 100644
--- a/test/onnx/expect/TestOperators.test_sign.expect
+++ b/test/onnx/expect/TestOperators.test_sign.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sin.expect b/test/onnx/expect/TestOperators.test_sin.expect
index 4ca6284c48d90..ac1b111f6b801 100644
--- a/test/onnx/expect/TestOperators.test_sin.expect
+++ b/test/onnx/expect/TestOperators.test_sin.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_slice.expect b/test/onnx/expect/TestOperators.test_slice.expect
index 15aa37bc2f7eb..41ef73929a263 100644
--- a/test/onnx/expect/TestOperators.test_slice.expect
+++ b/test/onnx/expect/TestOperators.test_slice.expect
@@ -103,5 +103,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_split.expect b/test/onnx/expect/TestOperators.test_split.expect
index e1616e4a52cdf..da2dfac1322f0 100644
--- a/test/onnx/expect/TestOperators.test_split.expect
+++ b/test/onnx/expect/TestOperators.test_split.expect
@@ -97,5 +97,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_split_with_sizes.expect b/test/onnx/expect/TestOperators.test_split_with_sizes.expect
index 964ba363a56e3..ee358bf707d0f 100644
--- a/test/onnx/expect/TestOperators.test_split_with_sizes.expect
+++ b/test/onnx/expect/TestOperators.test_split_with_sizes.expect
@@ -97,5 +97,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sqrt.expect b/test/onnx/expect/TestOperators.test_sqrt.expect
index 91fc7bac0b775..4b9448a385659 100644
--- a/test/onnx/expect/TestOperators.test_sqrt.expect
+++ b/test/onnx/expect/TestOperators.test_sqrt.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_std.expect b/test/onnx/expect/TestOperators.test_std.expect
index 69df37b90452a..232bf74c057aa 100644
--- a/test/onnx/expect/TestOperators.test_std.expect
+++ b/test/onnx/expect/TestOperators.test_std.expect
@@ -185,5 +185,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sum.expect b/test/onnx/expect/TestOperators.test_sum.expect
index 6722064ace203..0f5b926b29a43 100644
--- a/test/onnx/expect/TestOperators.test_sum.expect
+++ b/test/onnx/expect/TestOperators.test_sum.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sum_dtype.expect b/test/onnx/expect/TestOperators.test_sum_dtype.expect
index 2b5f417b0eee7..2f9177eace80c 100644
--- a/test/onnx/expect/TestOperators.test_sum_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_sum_dtype.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_tan.expect b/test/onnx/expect/TestOperators.test_tan.expect
index 84bc3e9420df1..be836365e7b9c 100644
--- a/test/onnx/expect/TestOperators.test_tan.expect
+++ b/test/onnx/expect/TestOperators.test_tan.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_transpose.expect b/test/onnx/expect/TestOperators.test_transpose.expect
index f1350a1b26233..7a48933ef347c 100644
--- a/test/onnx/expect/TestOperators.test_transpose.expect
+++ b/test/onnx/expect/TestOperators.test_transpose.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_type_as.expect b/test/onnx/expect/TestOperators.test_type_as.expect
index 31803483edbd7..8dc64bf6199fc 100644
--- a/test/onnx/expect/TestOperators.test_type_as.expect
+++ b/test/onnx/expect/TestOperators.test_type_as.expect
@@ -37,5 +37,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_unfold.expect b/test/onnx/expect/TestOperators.test_unfold.expect
index 9b5e20281d201..707b1177013ff 100644
--- a/test/onnx/expect/TestOperators.test_unfold.expect
+++ b/test/onnx/expect/TestOperators.test_unfold.expect
@@ -202,5 +202,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_unsqueeze.expect b/test/onnx/expect/TestOperators.test_unsqueeze.expect
index 49a61c2b84515..21697b87deff5 100644
--- a/test/onnx/expect/TestOperators.test_unsqueeze.expect
+++ b/test/onnx/expect/TestOperators.test_unsqueeze.expect
@@ -61,5 +61,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
index 89bda18c735ca..75faf21aa1cd5 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
@@ -91,5 +91,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
index 89bda18c735ca..75faf21aa1cd5 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
@@ -91,5 +91,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
index 53219a4045086..f02de359c092e 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
@@ -161,5 +161,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_view.expect b/test/onnx/expect/TestOperators.test_view.expect
index 0976258229695..e9633df31103a 100644
--- a/test/onnx/expect/TestOperators.test_view.expect
+++ b/test/onnx/expect/TestOperators.test_view.expect
@@ -22,6 +22,11 @@ graph {
     output: "2"
     name: "Reshape_1"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -55,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_view_flatten.expect b/test/onnx/expect/TestOperators.test_view_flatten.expect
index 444906b80e476..2372a76e52f7d 100644
--- a/test/onnx/expect/TestOperators.test_view_flatten.expect
+++ b/test/onnx/expect/TestOperators.test_view_flatten.expect
@@ -22,6 +22,11 @@ graph {
     output: "8"
     name: "Reshape_1"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -64,5 +69,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_zeros_like.expect b/test/onnx/expect/TestOperators.test_zeros_like.expect
index e4f6c6ede2cab..0cb3b2598c8e7 100644
--- a/test/onnx/expect/TestOperators.test_zeros_like.expect
+++ b/test/onnx/expect/TestOperators.test_zeros_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/internal/test_beartype.py b/test/onnx/internal/test_beartype.py
new file mode 100644
index 0000000000000..8d2ed557d40cf
--- /dev/null
+++ b/test/onnx/internal/test_beartype.py
@@ -0,0 +1,86 @@
+# Owner(s): ["module: onnx"]
+"""Unit tests for the internal beartype wrapper module."""
+
+import unittest
+
+import torch
+from torch.onnx import _exporter_states
+from torch.onnx._internal import _beartype
+from torch.testing._internal import common_utils
+
+
+def beartype_installed():
+    try:
+        import beartype  # noqa: F401
+    except ImportError:
+        return False
+    return True
+
+
+def skip_if_beartype_not_installed(test_case):
+    return unittest.skipIf(not beartype_installed(), "beartype is not installed")(
+        test_case
+    )
+
+
+def func_with_type_hint(x: int) -> int:
+    return x
+
+
+def func_with_incorrect_type_hint(x: int) -> str:
+    return x  # type: ignore[return-value]
+
+
+@common_utils.instantiate_parametrized_tests
+class TestBeartype(common_utils.TestCase):
+    def test_create_beartype_decorator_returns_no_op_decorator_when_disabled(self):
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.DISABLED,
+        )
+        decorated = decorator(func_with_incorrect_type_hint)
+        decorated("string_input")  # type: ignore[arg-type]
+
+    @skip_if_beartype_not_installed
+    def test_create_beartype_decorator_warns_when_warnings(self):
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.WARNINGS,
+        )
+        decorated = decorator(func_with_incorrect_type_hint)
+        with self.assertWarns(torch.onnx.errors.CallHintViolationWarning):
+            decorated("string_input")  # type: ignore[arg-type]
+
+    @common_utils.parametrize("arg", [1, "string_input"])
+    @skip_if_beartype_not_installed
+    def test_create_beartype_decorator_errors_when_errors(self, arg):
+        import beartype
+
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.ERRORS,
+        )
+        decorated = decorator(func_with_incorrect_type_hint)
+        with self.assertRaises(beartype.roar.BeartypeCallHintViolation):
+            decorated(arg)
+
+    @skip_if_beartype_not_installed
+    def test_create_beartype_decorator_warning_calls_function_once(self):
+        call_count = 0
+
+        def func_with_incorrect_type_hint_and_side_effect(x: int) -> str:
+            nonlocal call_count
+            call_count += 1
+            return x  # type: ignore[return-value]
+
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.WARNINGS,
+        )
+        decorated = decorator(func_with_incorrect_type_hint_and_side_effect)
+        decorated("string_input")  # type: ignore[arg-type]
+        self.assertEqual(call_count, 1)
+        decorated(1)
+        # The return value violates the type hint, but the function is called
+        # only once.
+        self.assertEqual(call_count, 2)
+
+
+if __name__ == "__main__":
+    common_utils.run_tests()
diff --git a/test/onnx/onnx_test_common.py b/test/onnx/onnx_test_common.py
index 9c617c3434bd0..fe0614fd90d79 100644
--- a/test/onnx/onnx_test_common.py
+++ b/test/onnx/onnx_test_common.py
@@ -37,6 +37,10 @@ def run_model_test(test_suite: _TestONNXRuntime, *args, **kwargs):
     kwargs["ort_providers"] = _ORT_PROVIDERS
     kwargs["opset_version"] = test_suite.opset_version
     kwargs["keep_initializers_as_inputs"] = test_suite.keep_initializers_as_inputs
+    if hasattr(test_suite, "check_shape"):
+        kwargs["check_shape"] = test_suite.check_shape
+    if hasattr(test_suite, "check_dtype"):
+        kwargs["check_dtype"] = test_suite.check_dtype
     return verification.verify(*args, **kwargs)
 
 
@@ -60,6 +64,8 @@ class _TestONNXRuntime(common_utils.TestCase):
     opset_version = _constants.onnx_default_opset
     keep_initializers_as_inputs = True  # For IR version 3 type export.
     is_script = False
+    check_shape = True
+    check_dtype = True
 
     def setUp(self):
         set_rng_seed(0)
diff --git a/test/onnx/pytorch_test_common.py b/test/onnx/pytorch_test_common.py
index 77bdd28ad4e45..67b39058f2ab4 100644
--- a/test/onnx/pytorch_test_common.py
+++ b/test/onnx/pytorch_test_common.py
@@ -139,5 +139,23 @@ def wrapper(self, *args, **kwargs):
     return skip_dec
 
 
+def skipShapeChecking(func):
+    @functools.wraps(func)
+    def wrapper(self, *args, **kwargs):
+        self.check_shape = False
+        return func(self, *args, **kwargs)
+
+    return wrapper
+
+
+def skipDtypeChecking(func):
+    @functools.wraps(func)
+    def wrapper(self, *args, **kwargs):
+        self.check_dtype = False
+        return func(self, *args, **kwargs)
+
+    return wrapper
+
+
 def flatten(x):
     return tuple(function._iter_filter(lambda o: isinstance(o, torch.Tensor))(x))
diff --git a/test/onnx/test_autograd_funs.py b/test/onnx/test_autograd_funs.py
new file mode 100644
index 0000000000000..a0980d277d568
--- /dev/null
+++ b/test/onnx/test_autograd_funs.py
@@ -0,0 +1,212 @@
+# Owner(s): ["module: onnx"]
+
+import unittest
+
+import torch
+
+from onnx_test_common import run_model_test
+from torch.onnx import OperatorExportTypes
+from torch.onnx._globals import GLOBALS
+from torch.onnx.utils import _model_to_graph
+
+
+class TestAutogradFuns(unittest.TestCase):
+    opset_version = GLOBALS.export_onnx_opset_version
+    keep_initializers_as_inputs = False
+    onnx_shape_inference = True
+
+    def test_single_output(self):
+        class SingleOut(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                result = i.exp()
+                result = result.log()
+                ctx.save_for_backward(result)
+                return result
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                (result,) = ctx.saved_tensors
+                return grad_output * result
+
+        class Caller(torch.nn.Module):
+            def forward(self, input):
+                result = input + 5
+                return SingleOut.apply(result) + 3
+
+        model = Caller()
+        input = torch.ones(1)
+        run_model_test(self, model, input_args=(input,))
+
+    def test_multi_output(self):
+        class MultiOut(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                result_exp = i.exp()
+                result_log = result_exp.log()
+                ctx.save_for_backward(result_exp, result_log)
+                return result_exp, result_log
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                (result,) = ctx.saved_tensors
+                return grad_output * result
+
+        class Caller(torch.nn.Module):
+            def forward(self, input):
+                return MultiOut.apply(input)
+
+        model = Caller()
+        input = torch.ones(1, 5)
+        run_model_test(self, model, input_args=(input,))
+
+    def test_partial_output(self):
+        class PartialOut(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, input):
+                ctx.save_for_backward(input)
+                values, indices = torch.topk(input, 3)
+                return values
+
+        class Caller(torch.nn.Module):
+            def forward(self, input):
+                return PartialOut.apply(input)
+
+        model = Caller()
+        input = torch.ones(1, 5)
+        run_model_test(self, model, input_args=(input,))
+
+    def test_nested_autograd(self):
+        class Child(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                result = i.log()
+                result_log = result.log()
+                ctx.save_for_backward(result_log)
+                return result_log
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                (result,) = ctx.saved_tensors
+                return grad_output * result
+
+        class Parent(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                result_exp = i.exp()
+                result_log = Child.apply(result_exp)
+                ctx.save_for_backward(result_exp, result_log)
+                return result_exp, result_log
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                (result,) = ctx.saved_tensors
+                return grad_output * result
+
+        class Caller(torch.nn.Module):
+            def forward(self, input):
+                return Parent.apply(input)
+
+        model = Caller()
+        input = torch.ones(1, 5)
+        run_model_test(self, model, input_args=(input,))
+
+    # Run export in ONNX_FALLTHROUGH mode as torch.erf() is not supported
+    def test_aten_unsupported(self):
+        class Erf(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x):
+                erf_out = torch.special.erf(x)
+                ctx.save_for_backward(erf_out)
+                return erf_out
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                result = ctx.saved_tensors
+                return torch.special.erfinv(result), None
+
+        class Caller(torch.nn.Module):
+            def forward(self, input):
+                return Erf.apply(input)
+
+        model = Caller()
+        input = torch.ones(1, 5)
+
+        # Test ONNX_FALLTHROUGH_MODE
+        graph, _, _ = _model_to_graph(
+            model,
+            (input,),
+            operator_export_type=OperatorExportTypes.ONNX_FALLTHROUGH,
+        )
+        iter = graph.nodes()
+        self.assertEqual(next(iter).kind(), "prim::PythonOp")
+
+        # Test ATEN_FALLBACK_MODE
+        graph, _, _ = _model_to_graph(
+            model,
+            (input,),
+            operator_export_type=OperatorExportTypes.ONNX_ATEN_FALLBACK,
+        )
+        iter = graph.nodes()
+        self.assertEqual(next(iter).kind(), "prim::PythonOp")
+
+    def test_inline_and_symbolic(self):
+        class Exp(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                ctx.save_for_backward(input)
+                return i.exp()
+
+            @staticmethod
+            def symbolic(g, input):
+                return g.op("Exp", input)
+
+        class LogLog(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                ctx.save_for_backward(input)
+                return i.log().log()
+
+        class Caller(torch.nn.Module):
+            def forward(self, input):
+                exp_result = Exp.apply(input)
+                return LogLog.apply(exp_result)
+
+        model = Caller()
+        input = torch.ones(1)
+        run_model_test(self, model, input_args=(input,))
+
+    def test_inline_with_scoped_tracing(self):
+        class Exp(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                ctx.save_for_backward(input)
+                return i.exp()
+
+            @staticmethod
+            def symbolic(g, input):
+                return g.op("Exp", input)
+
+        class LogLog(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, i):
+                ctx.save_for_backward(input)
+                return i.log().log()
+
+        class Caller(torch.nn.Module):
+            def forward(self, input):
+                exp_result = Exp.apply(input)
+                return LogLog.apply(exp_result)
+
+        model = Caller()
+        input = torch.ones(1)
+
+        torch.jit._trace._trace_module_map = {
+            _m: torch.typename(type(_m)) for _m in model.modules()
+        }
+        run_model_test(self, model, input_args=(input,))
+        torch.jit._trace._trace_module_map = None
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/test/onnx/test_models.py b/test/onnx/test_models.py
index 3e9044ffb06c7..dfa8b1b13ccf7 100644
--- a/test/onnx/test_models.py
+++ b/test/onnx/test_models.py
@@ -2,7 +2,6 @@
 
 import unittest
 
-import caffe2.python.onnx.backend as backend
 import torch
 
 from model_defs.dcgan import _netD, _netG, bsz, imgsz, nz, weights_init
@@ -49,7 +48,9 @@ class TestModels(common_utils.TestCase):
     opset_version = 9  # Caffe2 doesn't support the default.
     keep_initializers_as_inputs = False
 
-    def exportTest(self, model, inputs, rtol=1e-2, atol=1e-7):
+    def exportTest(self, model, inputs, rtol=1e-2, atol=1e-7, **kwargs):
+        import caffe2.python.onnx.backend as backend
+
         with torch.onnx.select_model_mode_for_export(
             model, torch.onnx.TrainingMode.EVAL
         ):
@@ -142,13 +143,10 @@ def test_resnet(self):
         x = Variable(torch.randn(BATCH_SIZE, 3, 224, 224).fill_(1.0))
         self.exportTest(toC(resnet50()), toC(x), atol=1e-6)
 
-    @unittest.skip(
-        "This test has been flaky on trunk and PRs. See https://github.com/pytorch/pytorch/issues/79540"
-    )
-    @skipScriptTest(min_opset_version=15)  # None type in outputs
+    # This test is numerically unstable. Sporadic single element mismatch occurs occasionally.
     def test_inception(self):
         x = Variable(torch.randn(BATCH_SIZE, 3, 299, 299))
-        self.exportTest(toC(inception_v3()), toC(x))
+        self.exportTest(toC(inception_v3()), toC(x), acceptable_error_percentage=0.01)
 
     def test_squeezenet(self):
         # SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and
diff --git a/test/onnx/test_models_onnxruntime.py b/test/onnx/test_models_onnxruntime.py
index 20d81c5b59685..c84640e535e11 100644
--- a/test/onnx/test_models_onnxruntime.py
+++ b/test/onnx/test_models_onnxruntime.py
@@ -8,11 +8,11 @@
 import onnx_test_common
 import parameterized
 import PIL
+import test_models
 
 import torch
 import torchvision
 from pytorch_test_common import skipIfUnsupportedMinOpsetVersion, skipScriptTest
-from test_models import TestModels
 from torch import nn
 from torch.testing._internal import common_utils
 from torchvision import ops
@@ -27,20 +27,38 @@
 )
 
 
-def exportTest(self, model, inputs, rtol=1e-2, atol=1e-7, opset_versions=None):
+def exportTest(
+    self,
+    model,
+    inputs,
+    rtol=1e-2,
+    atol=1e-7,
+    opset_versions=None,
+    acceptable_error_percentage=None,
+):
     opset_versions = opset_versions if opset_versions else [7, 8, 9, 10, 11, 12, 13, 14]
 
     for opset_version in opset_versions:
         self.opset_version = opset_version
         self.onnx_shape_inference = True
         onnx_test_common.run_model_test(
-            self, model, input_args=inputs, rtol=rtol, atol=atol
+            self,
+            model,
+            input_args=inputs,
+            rtol=rtol,
+            atol=atol,
+            acceptable_error_percentage=acceptable_error_percentage,
         )
 
         if self.is_script_test_enabled and opset_version > 11:
             script_model = torch.jit.script(model)
             onnx_test_common.run_model_test(
-                self, script_model, input_args=inputs, rtol=rtol, atol=atol
+                self,
+                script_model,
+                input_args=inputs,
+                rtol=rtol,
+                atol=atol,
+                acceptable_error_percentage=acceptable_error_percentage,
             )
 
 
@@ -48,7 +66,7 @@ def exportTest(self, model, inputs, rtol=1e-2, atol=1e-7, opset_versions=None):
     "TestModels",
     (common_utils.TestCase,),
     dict(
-        TestModels.__dict__,
+        test_models.TestModels.__dict__,
         is_script_test_enabled=False,
         is_script=False,
         exportTest=exportTest,
diff --git a/test/onnx/test_onnx_opset.py b/test/onnx/test_onnx_opset.py
index 6bce330e23557..dab33bf00b09d 100644
--- a/test/onnx/test_onnx_opset.py
+++ b/test/onnx/test_onnx_opset.py
@@ -118,7 +118,7 @@ def forward(self, input, k):
         x = torch.arange(1.0, 6.0, requires_grad=True)
         k = torch.tensor(3)
         module = MyModuleDynamic()
-        check_onnx_opsets_operator(module, [x, k], ops, opset_versions=[10])
+        check_onnx_opsets_operator(module, (x, k), ops, opset_versions=[10])
 
     def test_maxpool(self):
         module = torch.nn.MaxPool1d(2, stride=1)
diff --git a/test/onnx/test_pytorch_jit_onnx.py b/test/onnx/test_pytorch_jit_onnx.py
index a850779e6df9d..9576762fb190f 100644
--- a/test/onnx/test_pytorch_jit_onnx.py
+++ b/test/onnx/test_pytorch_jit_onnx.py
@@ -17,9 +17,6 @@ def _jit_graph_to_onnx_model(graph, operator_export_type, opset_version):
     PyTorch tensor inputs.
     """
 
-    # Shape inference is required because some ops' symbolic functions
-    # generate sub-graphs based on inputs' types.
-    torch.onnx.symbolic_helper._set_onnx_shape_inference(True)
     torch.onnx.symbolic_helper._set_opset_version(opset_version)
     graph = torch.onnx.utils._optimize_graph(
         graph, operator_export_type, params_dict={}
@@ -50,6 +47,8 @@ class _TestJITIRToONNX:
 
     opset_version = -1  # Sub-classes must override
     ort_providers = ["CPUExecutionProvider"]
+    check_shape = True
+    check_dtype = True
 
     def run_test(self, graph_ir, example_inputs):
         graph = torch._C.parse_ir(graph_ir)
@@ -64,7 +63,13 @@ def run_test(self, graph_ir, example_inputs):
         ort_outs = verification._run_ort(ort_sess, example_inputs)
 
         verification._compare_ort_pytorch_outputs(
-            ort_outs, jit_outs, rtol=1e-3, atol=1e-7
+            ort_outs,
+            jit_outs,
+            rtol=1e-3,
+            atol=1e-7,
+            check_shape=self.check_shape,
+            check_dtype=self.check_dtype,
+            acceptable_error_percentage=None,
         )
 
     def test_example_ir(self):
diff --git a/test/onnx/test_pytorch_onnx_caffe2_quantized.py b/test/onnx/test_pytorch_onnx_caffe2_quantized.py
index 6b6e813640087..f6466aa0869e5 100644
--- a/test/onnx/test_pytorch_onnx_caffe2_quantized.py
+++ b/test/onnx/test_pytorch_onnx_caffe2_quantized.py
@@ -6,8 +6,8 @@
 
 import numpy as np
 import onnx
+import torch.ao.nn.quantized as nnq
 import torch.nn as nn
-import torch.nn.quantized as nnq
 import torch.onnx
 from torch.testing._internal import common_utils
 
@@ -201,7 +201,7 @@ def __init__(self):
                 self.dequant = torch.ao.quantization.DeQuantStub()
 
             def forward(self, x):
-                res = torch.nn.quantized.functional.interpolate(
+                res = torch.ao.nn.quantized.functional.interpolate(
                     self.quant1(x), size=[6, 8], mode="nearest"
                 )
                 return self.dequant(res)
diff --git a/test/onnx/test_pytorch_onnx_no_runtime.py b/test/onnx/test_pytorch_onnx_no_runtime.py
index ccde15a745c25..97cfa4afadf8f 100644
--- a/test/onnx/test_pytorch_onnx_no_runtime.py
+++ b/test/onnx/test_pytorch_onnx_no_runtime.py
@@ -122,9 +122,7 @@ def test_optional_output(self, module_class: Type[torch.nn.Module], x_size: int)
             input_names=["x"],
         )
         exported = onnx.load_from_string(f.getvalue())
-        expected_elem_type = symbolic_helper.scalar_type_to_onnx[
-            symbolic_helper.scalar_type_to_pytorch_type.index(x.dtype)
-        ].value
+        expected_elem_type = torch.onnx.JitScalarType.from_dtype(x.dtype).onnx_type()
         expected_output_type = onnx.helper.make_optional_type_proto(
             onnx.helper.make_tensor_type_proto(expected_elem_type, (dynamic_axis_name,))
         )
@@ -169,7 +167,7 @@ def forward(self, x):
         tm = TraceMe()
         tm = torch.jit.trace(tm, torch.rand(3, 4))
         f = io.BytesIO()
-        torch.onnx._export(tm, (torch.rand(3, 4),), f)
+        torch.onnx.export(tm, (torch.rand(3, 4),), f)
 
     def test_export_tensoroption_to(self):
         def foo(x):
@@ -830,6 +828,23 @@ def forward(
                         [max_position_embeddings, batch_size, hidden_size],
                     )
 
+    def test_is_fp_for_C_TypeList(self):
+        class M(torch.nn.Module):
+            def forward(self, x):
+                x = x.squeeze(1)
+                w = x.shape[2]
+                pos = x.view(2, -1).argmax(1)
+                x_int = pos % w
+                y_int = (pos - x_int) // w
+                return y_int, x_int
+
+        model = torch.jit.script(M())
+        inputs = torch.randn(2, 4, 6)
+        f = io.BytesIO()
+        torch.onnx.export(
+            model, inputs, f, dynamic_axes={"x": [0, 1]}, input_names=["x"]
+        )
+
 
 if __name__ == "__main__":
     common_utils.run_tests()
diff --git a/test/onnx/test_pytorch_onnx_onnxruntime.py b/test/onnx/test_pytorch_onnx_onnxruntime.py
index f3763df682bcf..6bafb50923c9c 100644
--- a/test/onnx/test_pytorch_onnx_onnxruntime.py
+++ b/test/onnx/test_pytorch_onnx_onnxruntime.py
@@ -26,19 +26,25 @@
     RNN_HIDDEN_SIZE,
     RNN_INPUT_SIZE,
     RNN_SEQUENCE_LENGTH,
-    skipForAllOpsetVersions,
+    skipDtypeChecking,
     skipIfUnsupportedMaxOpsetVersion,
     skipIfUnsupportedMinOpsetVersion,
     skipIfUnsupportedOpsetVersion,
     skipScriptTest,
+    skipShapeChecking,
     skipTraceTest,
 )
 from torch import Tensor
 from torch.nn.utils import rnn as rnn_utils
-from torch.onnx import verification
+from torch.onnx import _constants, verification
 from torch.testing._internal import common_utils
 from torch.testing._internal.common_utils import skipIfNoLapack
 
+# The min onnx opset version to test for
+MIN_ONNX_OPSET_VERSION = 9
+# The max onnx opset version to test for
+MAX_ONNX_OPSET_VERSION = _constants.onnx_main_opset
+
 
 def _init_test_generalized_rcnn_transform():
     min_size = 100
@@ -111,13 +117,23 @@ def _construct_tensor_for_quantization_test(
     return tensor
 
 
-def _parameterized_class_attrs_and_values():
+def _parameterized_class_attrs_and_values(
+    min_opset_version: int, max_opset_version: int
+):
     attrs = ("opset_version", "is_script", "keep_initializers_as_inputs")
     input_values = []
     input_values.extend(itertools.product((7, 8), (True, False), (True,)))
     # Valid opset versions are defined in torch/onnx/_constants.py.
-    # Versions are intentionally set statically, to not be affected by elsewhere changes.
-    input_values.extend(itertools.product(range(9, 17), (True, False), (True, False)))
+    # Versions are intentionally set statically, to not be affected by changes elsewhere.
+    if min_opset_version < 9:
+        raise ValueError("min_opset_version must be >= 9")
+    input_values.extend(
+        itertools.product(
+            range(min_opset_version, max_opset_version + 1),
+            (True, False),
+            (True, False),
+        )
+    )
     return {"attrs": attrs, "input_values": input_values}
 
 
@@ -142,7 +158,9 @@ def _parametrize_rnn_args(arg_name):
 
 
 @parameterized.parameterized_class(
-    **_parameterized_class_attrs_and_values(),
+    **_parameterized_class_attrs_and_values(
+        MIN_ONNX_OPSET_VERSION, MAX_ONNX_OPSET_VERSION
+    ),
     class_name_func=onnx_test_common.parameterize_class_name,
 )
 @common_utils.instantiate_parametrized_tests
@@ -827,6 +845,7 @@ def forward(self, x: int, y):
         y = torch.randint(10, (2, 3, 4))
         self.run_test(Model(), (x, y))
 
+    @skipDtypeChecking
     def test_primitive_input_floating(self):
         class Model(torch.nn.Module):
             def __init__(self):
@@ -1531,6 +1550,7 @@ def forward(self, x):
         x = torch.randn(2, 3, 4)
         self.run_test(ArithmeticModule(), x, remained_onnx_input_idx=[])
 
+    @skipDtypeChecking
     def test_arithmetic_prim_float(self):
         class ArithmeticModule(torch.nn.Module):
             def forward(self, x, y: float):
@@ -1553,6 +1573,7 @@ def forward(self, x):
         x = torch.randn(2, 3, 4)
         self.run_test(ArithmeticModule(), x, remained_onnx_input_idx=[])
 
+    @skipDtypeChecking
     def test_arithmetic_prim_bool(self):
         class ArithmeticModule(torch.nn.Module):
             def forward(self, x, y: int, z: bool, t: float):
@@ -1577,11 +1598,10 @@ def forward(self, x: int, y: int):
         y = 2
         self.run_test(ArithmeticModule(), (x, y))
 
-    # In tracing, None outputs are removed. In scripting they're kept but
-    # we don't know Optional.elem_type, so we can't construct a valid Optional.
+    # Outputs that are always None are removed.
+    # We don't know Optional.elem_type, so we can't construct a valid Optional.
     # Tests for Optional outputs (control flow with None in one branch,
     # not-None in another) are in test_pytorch_onnx_no_runtime.py.
-    @skipScriptTest()
     def test_tuple_with_none_outputs(self):
         class TupleModel(torch.nn.Module):
             def forward(self, x):
@@ -1720,6 +1740,7 @@ def forward(self, x, y):
         y = torch.arange(1, 2 * 3 * 4 + 1).reshape(2, 3, 4).to(torch.double)
         self.run_test(torch.jit.script(DivModule()), (x, y))
 
+    @skipDtypeChecking
     def test_div_rounding_mode(self):
         class TrueDivModule(torch.nn.Module):
             def forward(self, x, y):
@@ -2940,6 +2961,7 @@ def forward(self, x):
             torch.jit.script(ListUnpackSlice()), x, remained_onnx_input_idx=[]
         )
 
+    @skipDtypeChecking
     def test_pow(self):
         class PowModule(torch.nn.Module):
             def forward(self, x, y):
@@ -2986,6 +3008,7 @@ def forward(self, x, y):
     # add to(dtype=torch.long) to avoid ORT output type does not match expected type.
     # will be fixed in ONNX version 14.
     @skipIfUnsupportedMaxOpsetVersion(13)
+    @skipDtypeChecking
     def test_arithmeticOps_with_low_precision(self):
         class AddModule(torch.nn.Module):
             def forward(self, x, y):
@@ -4901,10 +4924,13 @@ def forward(self, input):
                     torch.argmax(input),
                     torch.argmin(input, keepdim=True),
                     torch.argmax(input, keepdim=True),
+                    torch.argmin(input, dim=0, keepdim=True),
+                    torch.argmax(input, dim=1, keepdim=True),
                 )
 
         self.run_test(ArgminArgmaxModel(), input)
 
+    @skipIfUnsupportedMinOpsetVersion(9)
     def test_argmin_argmax(self):
         input = torch.randn(7, 3, 5)
         self._argmin_argmax_model(input)
@@ -5276,6 +5302,7 @@ def forward(self, x, y, z, ind):
         ind = torch.tensor(-2, dtype=torch.long)
         self.run_test(GetItemModel(), (x, y, z, ind))
 
+    @skipDtypeChecking
     def test_item(self):
         class M(torch.nn.Module):
             def forward(self, x, y, i: int):
@@ -6082,6 +6109,7 @@ def forward(self, x):
         self.run_test(ZeroAndOnes(), (x,))
 
     @skipIfUnsupportedMinOpsetVersion(9)
+    @skipShapeChecking
     def test_tolist(self):
         class List(torch.jit.ScriptModule):
             @torch.jit.script_method
@@ -6623,6 +6651,7 @@ def forward(self, x):
         self.run_test(FullLikeModel(), x)
 
     @skipIfUnsupportedMinOpsetVersion(9)
+    @skipDtypeChecking
     def test_full_like_value(self):
         class FullLikeModel(torch.nn.Module):
             def forward(self, x, y):
@@ -7889,6 +7918,7 @@ def forward(self, poses):
         self.run_test(M(), (dummy_inputs,), input_names=["x"], dynamic_axes={"x": [0]})
 
     @skipIfUnsupportedMinOpsetVersion(12)
+    @skipDtypeChecking
     def test_outer(self):
         class Outer(torch.nn.Module):
             def forward(self, x, y):
@@ -8695,6 +8725,10 @@ def forward(self, x):
         x = torch.randn(10, 3, 53)
         self.run_test(M(), (x))
 
+    def test_rrelu_eval(self):
+        x = torch.tensor([0.5, -0.5])
+        self.run_test(torch.nn.RReLU(0.1, 0.3).eval(), x)
+
     def test_shape_constant_fold(self):
         class ShapeModule(torch.nn.Module):
             def __init__(self):
@@ -11053,6 +11087,7 @@ def forward(self, boxes, scores):
         self.run_test(model, (boxes, scores))
 
     @skipIfUnsupportedMinOpsetVersion(11)
+    @skipDtypeChecking
     def test_symbolic_shape_inference_arange_2(self):
         # test Range
         class ArangeModel(torch.nn.Module):
@@ -11509,6 +11544,7 @@ def forward(self, x):
         x = torch.ones(12, 3)
         self.run_test(M(), (x,), input_names=["x"], dynamic_axes={"x": [0]})
 
+    @skipShapeChecking
     def test_sum_empty_tensor(self):
         class M(torch.nn.Module):
             def forward(self, x):
@@ -11692,7 +11728,7 @@ def forward(self, t: Tensor) -> Tuple[Tensor, Tensor]:
     #       Otherwise test results could be inaccurate.
     @skipIfUnsupportedMinOpsetVersion(10)
     def test_quantized_linear(self):
-        model = torch.nn.quantized.Linear(4, 8)
+        model = torch.ao.nn.quantized.Linear(4, 8)
         # Set fixed weight to avoid flaky test.
         weight = torch.quantize_per_tensor(
             torch.arange(32, dtype=torch.float).view(8, 4), 0.5, 0, torch.qint8
@@ -11708,7 +11744,7 @@ def test_quantized_linear(self):
 
     @skipIfUnsupportedMinOpsetVersion(10)
     def test_quantized_conv2d(self):
-        model = torch.nn.quantized.Conv2d(16, 33, 3, stride=2)
+        model = torch.ao.nn.quantized.Conv2d(16, 33, 3, stride=2)
         # Manually initialize model weight and bias to random numbers.
         # By default all zeros.
         q_weight = torch.quantize_per_tensor(
@@ -11741,26 +11777,113 @@ def test_quantized_conv2d_relu(self):
         q_input = torch.quantize_per_tensor(input, 0.5, 128, torch.quint8)
         self.run_test(model, q_input)
 
+    @common_utils.parametrize(
+        "function_or_module",
+        [
+            common_utils.subtest(
+                torch.nn.ReLU(),
+                name="relu",
+            ),
+            common_utils.subtest(
+                torch.nn.LeakyReLU(),
+                name="leaky_relu",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.LeakyReLU(2.0, 1),
+                name="quantized_leaky_relu",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.Hardswish(2.0, 1),
+                name="quantized_hardswish",
+            ),
+            common_utils.subtest(
+                torch.nn.Sigmoid(),
+                name="sigmoid",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.Sigmoid(2.0, 1),
+                name="quantized_sigmoid",
+            ),
+            common_utils.subtest(
+                torch.nn.Hardsigmoid(),
+                name="hardsigmoid",
+            ),
+            common_utils.subtest(
+                torch.nn.Tanh(),
+                name="tanh",
+            ),
+            common_utils.subtest(
+                torch.nn.Hardtanh(),
+                name="hardtanh",
+            ),
+            common_utils.subtest(
+                lambda x: torch.transpose(x, 0, 1),
+                name="transpose",
+            ),
+            common_utils.subtest(
+                lambda x: x.expand(2, 4, 2, 3),
+                name="expand",
+            ),
+            common_utils.subtest(
+                lambda x: x.view(1, 4, 6),
+                name="view",
+            ),
+            common_utils.subtest(
+                lambda x: x.select(1, 1),
+                name="select",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.LayerNorm(
+                    [4, 2, 3],
+                    torch.nn.Parameter(torch.ones([4, 2, 3])),
+                    torch.nn.Parameter(torch.zeros([4, 2, 3])),
+                    2.0,
+                    1,
+                ),
+                name="layer_norm",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.InstanceNorm1d(
+                    2,
+                    torch.nn.Parameter(torch.ones(4)),
+                    torch.nn.Parameter(torch.zeros(4)),
+                    2.0,
+                    1,
+                ),
+                name="instance_norm",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.GroupNorm(
+                    2,
+                    4,
+                    torch.nn.Parameter(torch.zeros(4)),
+                    torch.nn.Parameter(torch.zeros(4)),
+                    2.0,
+                    1,
+                ),
+                name="group_norm",
+            ),
+            common_utils.subtest(
+                lambda x: torch.as_strided(x, (2, 2), (1, 2)),
+                name="as_strided",
+            ),
+        ],
+    )
+    @skipScriptTest()
     @skipIfUnsupportedMinOpsetVersion(10)
-    def test_quantized_hardswish(self):
-        model = torch.nn.quantized.Hardswish(1.0, 0)
-        input = torch.randn(2, 6)
+    def test_quantized_unary_ops(self, function_or_module):
+        input = torch.randn(1, 4, 2, 3)
         q_input = torch.quantize_per_tensor(input, 0.26, 128, torch.quint8)
-        self.run_test(model, q_input)
 
-    @skipIfUnsupportedMinOpsetVersion(10)
-    def test_quantized_hardsigmoid(self):
-        model = torch.nn.Hardsigmoid()
-        input = torch.randn(2, 6)
-        q_input = torch.quantize_per_tensor(input, 0.26, 128, torch.quint8)
-        self.run_test(model, q_input)
+        class Model(torch.nn.Module):
+            def __init__(self, function_or_module):
+                super().__init__()
+                self.function_or_module = function_or_module
 
-    @skipIfUnsupportedMinOpsetVersion(10)
-    def test_quantized_sigmoid(self):
-        model = torch.nn.Sigmoid()
-        input = torch.randn(2, 6)
-        q_input = torch.quantize_per_tensor(input, 0.26, 128, torch.quint8)
-        self.run_test(model, q_input)
+            def forward(self, x):
+                return self.function_or_module(x)
+
+        self.run_test(Model(function_or_module), q_input)
 
     @skipIfUnsupportedMinOpsetVersion(10)
     def test_quantized_flatten(self):
@@ -11849,8 +11972,8 @@ def test_quantized_arithmetic_qfunctional(self):
 
         class ArithmeticModel(torch.nn.Module):
             def forward(self, x, y):
-                o = torch.nn.quantized.QFunctional().add(x, y)
-                o = torch.nn.quantized.QFunctional().mul(o, x)
+                o = torch.ao.nn.quantized.QFunctional().add(x, y)
+                o = torch.ao.nn.quantized.QFunctional().mul(o, x)
                 return o
 
         self.run_test(ArithmeticModel(), (x, y))
@@ -12180,9 +12303,6 @@ def forward(self, input, grid):
             **atol_rtol,
         )
 
-    # TODO: The fix of OptionalHasElement is still in master branch, not in release
-    #       Enable the test after it's been released.
-    @skipForAllOpsetVersions()
     @skipTraceTest()
     @skipIfUnsupportedMinOpsetVersion(16)
     def test_uninitialized_optional(self):
diff --git a/test/onnx/test_utility_funs.py b/test/onnx/test_utility_funs.py
index 38eab4f09f03b..074ee59d31f83 100644
--- a/test/onnx/test_utility_funs.py
+++ b/test/onnx/test_utility_funs.py
@@ -2,6 +2,7 @@
 
 import copy
 import io
+import unittest
 
 import onnx
 
@@ -23,7 +24,6 @@
     utils,
 )
 from torch.onnx.symbolic_helper import (
-    _set_onnx_shape_inference,
     _set_operator_export_type,
     _set_opset_version,
     _unpack_list,
@@ -51,12 +51,11 @@ def _model_to_graph(
         input_names=None,
         dynamic_axes=None,
     ):
+        torch.onnx.utils._setup_trace_module_map(model, False)
         if training == torch.onnx.TrainingMode.TRAINING:
             model.train()
         elif training == torch.onnx.TrainingMode.EVAL:
             model.eval()
-        # Need disable onnx_shape_inference for this test because it puts const node to initializers.
-        _set_onnx_shape_inference(False)
         utils._validate_dynamic_axes(dynamic_axes, model, None, None)
         graph, params_dict, torch_out = utils._model_to_graph(
             model,
@@ -68,7 +67,6 @@ def _model_to_graph(
             input_names=input_names,
             dynamic_axes=dynamic_axes,
         )
-        _set_onnx_shape_inference(True)
         return graph, params_dict, torch_out
 
 
@@ -626,10 +624,9 @@ def forward(self, x):
         graph, _, __ = self._model_to_graph(
             ShapeModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
         )
-
         for node in graph.nodes():
             self.assertNotEqual(node.kind(), "onnx::Shape")
-        self.assertEqual(len(list(graph.nodes())), 1)
+        self.assertEqual(len(list(graph.nodes())), 2)
 
     def test_constant_fold_upsample_scale_fold_as_constant(self):
         # upsample scale is a constant, not a model parameter,
@@ -989,6 +986,77 @@ def forward(self, x):
             self.assertIn(ln_node.attribute[0], expected_ln_attrs)
             self.assertIn(ln_node.attribute[1], expected_ln_attrs)
 
+    def test_node_scope(self):
+        class N(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                return self.relu(x)
+
+        class M(torch.nn.Module):
+            def __init__(self, num_layers):
+                super().__init__()
+                self.num_layers = num_layers
+                self.lns = torch.nn.ModuleList(
+                    [torch.nn.LayerNorm(3, eps=float(i)) for i in range(num_layers)]
+                )
+                self.gelu1 = torch.nn.GELU()
+                self.gelu2 = torch.nn.GELU()
+                self.relu = N()
+
+            def forward(self, x, y, z):
+                res1 = self.gelu1(x)
+                res2 = self.gelu2(y)
+                for ln in self.lns:
+                    z = ln(z)
+                return res1 + res2, self.relu(z)
+
+        x = torch.randn(2, 3)
+        y = torch.randn(2, 3)
+        z = torch.randn(2, 3)
+
+        model = M(3)
+        expected_scope_names = {
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.activation.GELU::gelu1",
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.activation.GELU::gelu2",
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.normalization.LayerNorm::lns.0",
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.normalization.LayerNorm::lns.1",
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.normalization.LayerNorm::lns.2",
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.M::/"
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.N::relu/"
+            "torch.nn.modules.activation.ReLU::relu",
+            "test_utility_funs.TestUtilityFuns_opset9.test_node_scope.<locals>.M::",
+        }
+
+        graph, _, _ = self._model_to_graph(
+            model, (x, y, z), input_names=[], dynamic_axes={}
+        )
+        for node in graph.nodes():
+            self.assertIn(node.scopeName(), expected_scope_names)
+
+        expected_torch_script_scope_names = {
+            "test_utility_funs.M::/torch.nn.modules.activation.GELU::gelu1",
+            "test_utility_funs.M::/torch.nn.modules.activation.GELU::gelu2",
+            "test_utility_funs.M::/torch.nn.modules.normalization.LayerNorm::lns.0",
+            "test_utility_funs.M::/torch.nn.modules.normalization.LayerNorm::lns.1",
+            "test_utility_funs.M::/torch.nn.modules.normalization.LayerNorm::lns.2",
+            "test_utility_funs.M::/test_utility_funs.N::relu/torch.nn.modules.activation.ReLU::relu",
+            "test_utility_funs.M::",
+        }
+
+        graph, _, _ = self._model_to_graph(
+            torch.jit.script(model), (x, y, z), input_names=[], dynamic_axes={}
+        )
+        for node in graph.nodes():
+            self.assertIn(node.scopeName(), expected_torch_script_scope_names)
+
     def test_aten_fallthrough(self):
         # Test aten export of op with no symbolic
         class Module(torch.nn.Module):
@@ -1086,6 +1154,7 @@ def gelu(g, self, approximate):
         self.assertEqual(graph.opset_import[1].domain, "com.microsoft")
 
     @skipIfNoLapack
+    @unittest.skip("It started failing after #80074")
     def test_custom_opsets_inverse(self):
         class CustomInverse(torch.nn.Module):
             def forward(self, x):
@@ -1231,7 +1300,11 @@ def forward(self, input):
         batch = torch.FloatTensor(1, 3)
 
         graph, _, _ = self._model_to_graph(
-            model, batch, input_names=["batch"], dynamic_axes={"batch": [0, 1]}
+            model,
+            batch,
+            operator_export_type=OperatorExportTypes.ONNX_FALLTHROUGH,
+            input_names=["batch"],
+            dynamic_axes={"batch": [0, 1]},
         )
         iter = graph.nodes()
         autograd1 = next(iter)
@@ -1376,8 +1449,8 @@ def __init__(self):
             def forward(self, x, y):
                 return f(x, y)
 
-        input_1 = torch.tensor(11)
-        input_2 = torch.tensor(12)
+        input_1 = torch.tensor([11])
+        input_2 = torch.tensor([12])
         _set_opset_version(self.opset_version)
         _set_operator_export_type(OperatorExportTypes.ONNX)
         graph, _, __ = self._model_to_graph(
@@ -1430,8 +1503,8 @@ def forward(self, x):
         self.assertEqual(graph.graph.input[1].name, "in_weight")
         self.assertEqual(graph.graph.input[2].name, "in_bias")
 
-    def test_onnx_intermediate_renaming(self):
-        class RenamedIntermediateModule(torch.nn.Module):
+    def test_onnx_node_naming(self):
+        class MainModule(torch.nn.Module):
             def __init__(self):
                 super().__init__()
                 self._module_1 = torch.nn.Linear(10, 10)
@@ -1446,15 +1519,28 @@ def forward(self, x):
                 z = self._module_4(y * z)
                 return z
 
-        module = RenamedIntermediateModule()
+        module = MainModule()
+        ref_node_names = [
+            "/_module_1/Gemm",
+            "/_module_2/Gemm",
+            "/_module_3/Gemm",
+            "/_module_4/Gemm",
+            "/Mul",
+            "/Mul_1",
+        ]
+        f = io.BytesIO()
+
+        torch.onnx.export(module, torch.ones(1, 10), f, output_names=["y"])
+        onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+        for n in onnx_model.graph.node:
+            self.assertIn(n.name, ref_node_names)
 
-        g, p, o = utils._model_to_graph(module, torch.ones(1, 10), output_names=["y"])
-        renamed_intermediate = 0
-        for n in g.nodes():
-            for v in n.inputs():
-                if v.debugName().startswith("onnx::Mul_"):
-                    renamed_intermediate += 1
-        self.assertEqual(renamed_intermediate, 2)
+        torch.onnx.export(
+            torch.jit.script(module), torch.ones(1, 10), f, output_names=["y"]
+        )
+        onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+        for n in onnx_model.graph.node:
+            self.assertIn(n.name, ref_node_names)
 
     def _test_deduplicate_initializers(self, torchscript=False):
         class MyModule(torch.nn.Module):
diff --git a/test/onnx/test_verification.py b/test/onnx/test_verification.py
index 76c9c3b336bdc..ceb0c18d7bdf3 100644
--- a/test/onnx/test_verification.py
+++ b/test/onnx/test_verification.py
@@ -1,5 +1,7 @@
 # Owner(s): ["module: onnx"]
 
+import numpy as np
+
 import torch
 from torch.onnx import _experimental, verification
 from torch.testing._internal import common_utils
@@ -78,3 +80,34 @@ def forward(self, x, y):
             SupportedModel(), test_input_groups
         )
         self.assertEqual(results, "")
+
+    def test_compare_ort_pytorch_outputs_no_raise_with_acceptable_error_percentage(
+        self,
+    ):
+        ort_outs = [np.array([[1.0, 2.0], [3.0, 4.0]])]
+        pytorch_outs = [torch.tensor([[1.0, 2.0], [3.0, 1.0]])]
+        verification._compare_ort_pytorch_outputs(
+            ort_outs,
+            pytorch_outs,
+            rtol=1e-5,
+            atol=1e-6,
+            check_shape=True,
+            check_dtype=False,
+            acceptable_error_percentage=0.3,
+        )
+
+    def test_compare_ort_pytorch_outputs_raise_without_acceptable_error_percentage(
+        self,
+    ):
+        ort_outs = [np.array([[1.0, 2.0], [3.0, 4.0]])]
+        pytorch_outs = [torch.tensor([[1.0, 2.0], [3.0, 1.0]])]
+        with self.assertRaises(AssertionError):
+            verification._compare_ort_pytorch_outputs(
+                ort_outs,
+                pytorch_outs,
+                rtol=1e-5,
+                atol=1e-6,
+                check_shape=True,
+                check_dtype=False,
+                acceptable_error_percentage=None,
+            )
diff --git a/test/quantization/ao_migration/common.py b/test/quantization/ao_migration/common.py
index 85682d1b9202b..bade3b7ff4d26 100644
--- a/test/quantization/ao_migration/common.py
+++ b/test/quantization/ao_migration/common.py
@@ -4,11 +4,19 @@
 from typing import List, Optional
 
 class AOMigrationTestCase(TestCase):
-    def _test_package_import(self, package_name: str, base: Optional[str] = None):
+    def _test_package_import(self, package_name: str,
+                             base: Optional[str] = None,
+                             skip: List[str] = None):
         r"""Tests the module import by making sure that all the internals match
-        (except the dunder methods)."""
-        if base is None:
-            base = 'quantization'
+        (except the dunder methods).
+
+        Args:
+            package_name: The name of the package to be tested
+            base: The base namespace where the `package_name` resides
+            skip: The list of the subpackages/modules/functions to skip
+        """
+        skip = skip or []
+        base = base or 'quantization'
         old_base = 'torch.' + base
         new_base = 'torch.ao.' + base
         old_module = importlib.import_module(f'{old_base}.{package_name}')
@@ -17,7 +25,11 @@ def _test_package_import(self, package_name: str, base: Optional[str] = None):
         new_module_dir = set(dir(new_module))
         # Remove magic modules from checking in subsets
         for el in list(old_module_dir):
-            if el[:2] == '__' and el[-2:] == '__':
+            if el.startswith('__') and el.endswith('__'):
+                # Remove dunder
+                old_module_dir.remove(el)
+            if el in skip:
+                # Remove skips
                 old_module_dir.remove(el)
         assert (old_module_dir <= new_module_dir), \
             f"Importing {old_module} vs. {new_module} does not match: " \
diff --git a/test/quantization/ao_migration/test_ao_migration.py b/test/quantization/ao_migration/test_ao_migration.py
index 4b30bb872dce9..e3cddd490d2a6 100644
--- a/test/quantization/ao_migration/test_ao_migration.py
+++ b/test/quantization/ao_migration/test_ao_migration.py
@@ -92,3 +92,353 @@ def test_function_import_fake_quantize(self):
             'enable_observer',
         ]
         self._test_function_import('fake_quantize', function_list)
+
+
+class TestAOMigrationNNQuantized(AOMigrationTestCase):
+    def test_package_import_nn_quantized_modules(self):
+        r"""Tests the migration of the torch.nn.quantized.modules"""
+        self._test_package_import('modules', base='nn.quantized')
+        self._test_package_import('modules.activation', base='nn.quantized')
+        self._test_package_import('modules.batchnorm', base='nn.quantized')
+        self._test_package_import('modules.conv', base='nn.quantized')
+        self._test_package_import('modules.dropout', base='nn.quantized')
+        self._test_package_import('modules.embedding_ops', base='nn.quantized')
+        self._test_package_import('modules.functional_modules', base='nn.quantized')
+        self._test_package_import('modules.linear', base='nn.quantized')
+        self._test_package_import('modules.normalization', base='nn.quantized')
+        self._test_package_import('modules.utils', base='nn.quantized')
+
+    def test_package_import_nn_quantized(self):
+        skip = [
+            # These are added in the `torch.nn.quantized` to allow
+            # for the legacy import, s.a. `import torch.nn.quantized.conv`, etc.
+            'activation',
+            'batchnorm',
+            'conv',
+            'dropout',
+            'embedding_ops',
+            'functional_modules',
+            'linear',
+            'normalization',
+        ]
+        self._test_package_import('quantized', base='nn', skip=skip)
+
+    def test_functional_import(self):
+        r"""Tests the migration of the torch.nn.quantized.functional"""
+        function_list = [
+            'avg_pool2d',
+            'avg_pool3d',
+            'adaptive_avg_pool2d',
+            'adaptive_avg_pool3d',
+            'conv1d',
+            'conv2d',
+            'conv3d',
+            'interpolate',
+            'linear',
+            'max_pool1d',
+            'max_pool2d',
+            'celu',
+            'leaky_relu',
+            'hardtanh',
+            'hardswish',
+            'threshold',
+            'elu',
+            'hardsigmoid',
+            'clamp',
+            'upsample',
+            'upsample_bilinear',
+            'upsample_nearest',
+        ]
+        self._test_function_import('functional', function_list, base='nn.quantized')
+
+    def test_modules_import(self):
+        module_list = [
+            # Modules
+            'BatchNorm2d',
+            'BatchNorm3d',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+            'DeQuantize',
+            'ELU',
+            'Embedding',
+            'EmbeddingBag',
+            'GroupNorm',
+            'Hardswish',
+            'InstanceNorm1d',
+            'InstanceNorm2d',
+            'InstanceNorm3d',
+            'LayerNorm',
+            'LeakyReLU',
+            'Linear',
+            'MaxPool2d',
+            'Quantize',
+            'ReLU6',
+            'Sigmoid',
+            'Softmax',
+            'Dropout',
+            # Wrapper modules
+            'FloatFunctional',
+            'FXFloatFunctional',
+            'QFunctional',
+        ]
+        self._test_function_import('modules', module_list, base='nn.quantized')
+
+    def test_modules_activation(self):
+        function_list = [
+            'ReLU6',
+            'Hardswish',
+            'ELU',
+            'LeakyReLU',
+            'Sigmoid',
+            'Softmax',
+        ]
+        self._test_function_import('activation', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_batchnorm(self):
+        function_list = [
+            'BatchNorm2d',
+            'BatchNorm3d',
+        ]
+        self._test_function_import('batchnorm', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_conv(self):
+        function_list = [
+            '_reverse_repeat_padding',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+        ]
+
+        self._test_function_import('conv', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_dropout(self):
+        function_list = [
+            'Dropout',
+        ]
+        self._test_function_import('dropout', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_embedding_ops(self):
+        function_list = [
+            'EmbeddingPackedParams',
+            'Embedding',
+            'EmbeddingBag',
+        ]
+        self._test_function_import('embedding_ops', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_functional_modules(self):
+        function_list = [
+            'FloatFunctional',
+            'FXFloatFunctional',
+            'QFunctional',
+        ]
+        self._test_function_import('functional_modules', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_linear(self):
+        function_list = [
+            'Linear',
+            'LinearPackedParams',
+        ]
+        self._test_function_import('linear', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_normalization(self):
+        function_list = [
+            'LayerNorm',
+            'GroupNorm',
+            'InstanceNorm1d',
+            'InstanceNorm2d',
+            'InstanceNorm3d',
+        ]
+        self._test_function_import('normalization', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_utils(self):
+        function_list = [
+            '_ntuple_from_first',
+            '_pair_from_first',
+            '_quantize_weight',
+            'hide_packed_params_repr',
+            'WeightedQuantizedModule',
+        ]
+        self._test_function_import('utils', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_package_import_nn_quantized_dynamic(self):
+        self._test_package_import('dynamic', base='nn.quantized')
+
+    def test_package_import_nn_quantized_dynamic_modules(self):
+        r"""Tests the migration of the torch.nn.quantized.modules"""
+        self._test_package_import('modules', base='nn.quantized.dynamic')
+        self._test_package_import('modules.conv', base='nn.quantized.dynamic')
+        self._test_package_import('modules.linear', base='nn.quantized.dynamic')
+        self._test_package_import('modules.rnn', base='nn.quantized.dynamic')
+
+    def test_import_nn_quantized_dynamic_import(self):
+        module_list = [
+            # Modules
+            'Linear',
+            'LSTM',
+            'GRU',
+            'LSTMCell',
+            'RNNCell',
+            'GRUCell',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+        ]
+        self._test_function_import('dynamic', module_list, base='nn.quantized')
+
+    def test_package_import_nn_quantized_reference(self):
+        self._test_package_import('_reference', base='nn.quantized')
+
+    def test_package_import_nn_quantized_reference_modules(self):
+        r"""Tests the migration of the torch.nn.quantized._reference.modules"""
+        self._test_package_import('modules', base='nn.quantized._reference')
+        self._test_package_import('modules.conv', base='nn.quantized._reference')
+        self._test_package_import('modules.linear', base='nn.quantized._reference')
+        self._test_package_import('modules.rnn', base='nn.quantized._reference')
+        self._test_package_import('modules.sparse', base='nn.quantized._reference')
+
+    def test_import_nn_quantized_reference_import(self):
+        module_list = [
+            # Modules
+            'Linear',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+            'RNNCell',
+            'LSTMCell',
+            'GRUCell',
+            'LSTM',
+            'Embedding',
+            'EmbeddingBag',
+        ]
+        self._test_function_import('_reference', module_list, base='nn.quantized')
+
+    def test_reference_modules_conv(self):
+        function_list = [
+            '_ConvNd',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            '_ConvTransposeNd',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+        ]
+        self._test_function_import('conv', function_list,
+                                   base='nn.quantized._reference.modules')
+
+    def test_reference_modules_linear(self):
+        function_list = [
+            'Linear',
+        ]
+        self._test_function_import('linear', function_list,
+                                   base='nn.quantized._reference.modules')
+
+    def test_reference_modules_rnn(self):
+        function_list = [
+            'RNNCellBase',
+            'RNNCell',
+            'LSTMCell',
+            'GRUCell',
+            'RNNBase',
+            'LSTM',
+        ]
+        self._test_function_import('rnn', function_list,
+                                   base='nn.quantized._reference.modules')
+
+    def test_reference_modules_sparse(self):
+        function_list = [
+            'Embedding',
+            'EmbeddingBag',
+        ]
+        self._test_function_import('sparse', function_list,
+                                   base='nn.quantized._reference.modules')
+
+    def test_package_import_nn_quantizable(self):
+        self._test_package_import('quantizable', base='nn')
+
+    def test_package_import_nn_quantizable_modules(self):
+        r"""Tests the migration of the torch.nn.quantizable.modules"""
+        self._test_package_import('modules', base='nn.quantizable')
+        self._test_package_import('modules.activation', base='nn.quantizable')
+        self._test_package_import('modules.rnn', base='nn.quantizable')
+
+    def test_import_nn_quantizable_activation(self):
+        module_list = [
+            # Modules
+            'MultiheadAttention',
+        ]
+        self._test_function_import('activation', module_list, base='nn.quantizable.modules')
+
+    def test_import_nn_quantizable_rnn(self):
+        module_list = [
+            # Modules
+            'LSTM',
+            'LSTMCell',
+        ]
+        self._test_function_import('rnn', module_list, base='nn.quantizable.modules')
+
+    # torch.nn.qat and torch.nn.qat.dynamic
+    def test_package_import_nn_qat(self):
+        self._test_package_import('qat', base='nn')
+
+    def test_package_import_nn_qat_modules(self):
+        r"""Tests the migration of the torch.nn.qat.modules"""
+        self._test_package_import('modules', base='nn.qat')
+        self._test_package_import('modules.conv', base='nn.qat')
+        self._test_package_import('modules.embedding_ops', base='nn.qat')
+        self._test_package_import('modules.linear', base='nn.qat')
+
+    def test_package_import_nn_qat_dynamic(self):
+        r"""Tests the migration of the torch.nn.qat.modules"""
+        self._test_package_import('dynamic', base='nn.qat')
+        self._test_package_import('dynamic.modules', base='nn.qat')
+        self._test_package_import('dynamic.modules.linear', base='nn.qat')
+
+    def test_import_nn_qat_conv(self):
+        module_list = [
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+        ]
+        self._test_function_import('conv', module_list, base='nn.qat.modules')
+
+    def test_import_nn_qat_embedding_ops(self):
+        module_list = [
+            'Embedding',
+            'EmbeddingBag',
+        ]
+        self._test_function_import('embedding_ops', module_list, base='nn.qat.modules')
+
+    def test_import_nn_qat_linear(self):
+        module_list = [
+            'Linear',
+        ]
+        self._test_function_import('linear', module_list, base='nn.qat.modules')
+
+    def test_import_nn_qat_dynamic_linear(self):
+        module_list = [
+            'Linear',
+        ]
+        self._test_function_import('linear', module_list, base='nn.qat.dynamic.modules')
diff --git a/test/quantization/bc/test_backward_compatibility.py b/test/quantization/bc/test_backward_compatibility.py
index ef8f49b3b2e2a..83f2c790a6eb8 100644
--- a/test/quantization/bc/test_backward_compatibility.py
+++ b/test/quantization/bc/test_backward_compatibility.py
@@ -9,8 +9,8 @@
 # torch
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 import torch.nn.intrinsic.quantized as nniq
 from torch.fx import GraphModule
 
diff --git a/test/quantization/core/experimental/apot_fx_graph_mode_ptq.py b/test/quantization/core/experimental/apot_fx_graph_mode_ptq.py
new file mode 100644
index 0000000000000..c19384294734e
--- /dev/null
+++ b/test/quantization/core/experimental/apot_fx_graph_mode_ptq.py
@@ -0,0 +1,131 @@
+import torch
+import torch.nn as nn
+import torch.quantization
+from torchvision.models.quantization.resnet import resnet18
+from torch.ao.quantization.experimental.quantization_helper import (
+    evaluate,
+    prepare_data_loaders
+)
+
+# validation dataset: full ImageNet dataset
+data_path = '~/my_imagenet/'
+
+data_loader, data_loader_test = prepare_data_loaders(data_path)
+criterion = nn.CrossEntropyLoss()
+float_model = resnet18(pretrained=True)
+float_model.eval()
+
+# deepcopy the model since we need to keep the original model around
+import copy
+model_to_quantize = copy.deepcopy(float_model)
+
+model_to_quantize.eval()
+
+"""
+Prepare models
+"""
+
+# Note that this is temporary, we'll expose these functions to torch.quantization after official releasee
+from torch.quantization.quantize_fx import prepare_qat_fx
+
+def calibrate(model, data_loader):
+    model.eval()
+    with torch.no_grad():
+        for image, target in data_loader:
+            model(image)
+
+from torch.ao.quantization.experimental.qconfig import (
+    uniform_qconfig_8bit,
+    apot_weights_qconfig_8bit,
+    apot_qconfig_8bit,
+    uniform_qconfig_4bit,
+    apot_weights_qconfig_4bit,
+    apot_qconfig_4bit
+)
+
+"""
+Prepare full precision model
+"""
+full_precision_model = float_model
+
+top1, top5 = evaluate(full_precision_model, criterion, data_loader_test)
+print("Model #0 Evaluation accuracy on test dataset: %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model PTQ for specified qconfig for torch.nn.Linear
+"""
+def prepare_ptq_linear(qconfig):
+    qconfig_dict = {"object_type": [(torch.nn.Linear, qconfig)]}
+    prepared_model = prepare_qat_fx(copy.deepcopy(float_model), qconfig_dict)  # fuse modules and insert observers
+    calibrate(prepared_model, data_loader_test)  # run calibration on sample data
+    return prepared_model
+
+"""
+Prepare model with uniform activation, uniform weight
+b=8, k=2
+"""
+
+prepared_model = prepare_ptq_linear(uniform_qconfig_8bit)
+quantized_model = convert_fx(prepared_model)  # convert the calibrated model to a quantized model
+
+top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
+print("Model #1 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with uniform activation, uniform weight
+b=4, k=2
+"""
+
+prepared_model = prepare_ptq_linear(uniform_qconfig_4bit)
+quantized_model = convert_fx(prepared_model)  # convert the calibrated model to a quantized model
+
+top1, top5 = evaluate(quantized_model1, criterion, data_loader_test)
+print("Model #1 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with uniform activation, APoT weight
+(b=8, k=2)
+"""
+
+prepared_model = prepare_ptq_linear(apot_weights_qconfig_8bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #2 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with uniform activation, APoT weight
+(b=4, k=2)
+"""
+
+prepared_model = prepare_ptq_linear(apot_weights_qconfig_4bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #2 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+
+"""
+Prepare model with APoT activation and weight
+(b=8, k=2)
+"""
+
+prepared_model = prepare_ptq_linear(apot_qconfig_8bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #3 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with APoT activation and weight
+(b=4, k=2)
+"""
+
+prepared_model = prepare_ptq_linear(apot_qconfig_4bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #3 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare eager mode quantized model
+"""
+eager_quantized_model = resnet18(pretrained=True, quantize=True).eval()
+top1, top5 = evaluate(eager_quantized_model, criterion, data_loader_test)
+print("Eager mode quantized model evaluation accuracy on test dataset: %2.2f, %2.2f" % (top1.avg, top5.avg))
diff --git a/test/quantization/core/experimental/apot_fx_graph_mode_qat.py b/test/quantization/core/experimental/apot_fx_graph_mode_qat.py
new file mode 100644
index 0000000000000..a50b0df1bbca4
--- /dev/null
+++ b/test/quantization/core/experimental/apot_fx_graph_mode_qat.py
@@ -0,0 +1,94 @@
+from torchvision.models.quantization.resnet import resnet18
+from torch.ao.quantization.experimental.quantization_helper import (
+    evaluate,
+    prepare_data_loaders,
+    training_loop
+)
+
+# training and validation dataset: full ImageNet dataset
+data_path = '~/my_imagenet/'
+
+train_batch_size = 30
+eval_batch_size = 50
+
+data_loader, data_loader_test = prepare_data_loaders(data_path)
+criterion = nn.CrossEntropyLoss()
+float_model = resnet18(pretrained=True)
+float_model.eval()
+
+# deepcopy the model since we need to keep the original model around
+import copy
+model_to_quantize = copy.deepcopy(float_model)
+
+model_to_quantize.eval()
+
+"""
+Prepare model QAT for specified qconfig for torch.nn.Linear
+"""
+def prepare_qat_linear(qconfig):
+    qconfig_dict = {"object_type": [(torch.nn.Linear, qconfig)]}
+    prepared_model = prepare_fx(copy.deepcopy(float_model), qconfig_dict)  # fuse modules and insert observers
+    training_loop(prepared_model, criterion, data_loader)
+    prepared_model.eval()
+    return prepared_model
+
+"""
+Prepare model with uniform activation, uniform weight
+b=8, k=2
+"""
+
+prepared_model = prepare_qat_linear(uniform_qconfig_8bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #1 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with uniform activation, uniform weight
+b=4, k=2
+"""
+
+prepared_model = prepare_qat_linear(uniform_qconfig_4bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #1 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with uniform activation, APoT weight
+(b=8, k=2)
+"""
+
+prepared_model = prepare_qat_linear(apot_weights_qconfig_8bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #2 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with uniform activation, APoT weight
+(b=4, k=2)
+"""
+
+prepared_model = prepare_qat_linear(apot_weights_qconfig_4bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #2 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+
+"""
+Prepare model with APoT activation and weight
+(b=8, k=2)
+"""
+
+prepared_model = prepare_qat_linear(apot_qconfig_8bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #3 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
+
+"""
+Prepare model with APoT activation and weight
+(b=4, k=2)
+"""
+
+prepared_model = prepare_qat_linear(apot_qconfig_4bit)
+
+top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
+print("Model #3 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
diff --git a/test/quantization/core/experimental/fx_graph_mode_apot.py b/test/quantization/core/experimental/quantization_util.py
similarity index 52%
rename from test/quantization/core/experimental/fx_graph_mode_apot.py
rename to test/quantization/core/experimental/quantization_util.py
index 74a5ef081da56..fcba45b765c92 100644
--- a/test/quantization/core/experimental/fx_graph_mode_apot.py
+++ b/test/quantization/core/experimental/quantization_util.py
@@ -1,10 +1,10 @@
 import torch
-import torch.nn as nn
 import torchvision
 import torchvision.transforms.transforms as transforms
 import os
 import torch.quantization
 from torchvision.models.quantization.resnet import resnet18
+from torch.autograd import Variable
 
 # Setup warnings
 import warnings
@@ -19,12 +19,15 @@
 )
 
 """
-Define helper functions
+Define helper functions for APoT PTQ and QAT
 """
 
 # Specify random seed for repeatable results
 _ = torch.manual_seed(191009)
 
+train_batch_size = 30
+eval_batch_size = 50
+
 class AverageMeter(object):
     """Computes and stores the average and current value"""
     def __init__(self, name, fmt=':f'):
@@ -34,7 +37,7 @@ def __init__(self, name, fmt=':f'):
 
     def reset(self):
         self.val = 0
-        self.avg = 0
+        self.avg = 0.0
         self.sum = 0
         self.count = 0
 
@@ -70,12 +73,10 @@ def evaluate(model, criterion, data_loader):
     model.eval()
     top1 = AverageMeter('Acc@1', ':6.2f')
     top5 = AverageMeter('Acc@5', ':6.2f')
-    cnt = 0
     with torch.no_grad():
         for image, target in data_loader:
             output = model(image)
             loss = criterion(output, target)
-            cnt += 1
             acc1, acc5 = accuracy(output, target, topk=(1, 5))
             top1.update(acc1[0], image.size(0))
             top5.update(acc5[0], image.size(0))
@@ -128,127 +129,20 @@ def prepare_data_loaders(data_path):
 
     return data_loader, data_loader_test
 
-data_path = '~/my_imagenet/'
-
-train_batch_size = 30
-eval_batch_size = 50
-
-data_loader, data_loader_test = prepare_data_loaders(data_path)
-criterion = nn.CrossEntropyLoss()
-float_model = resnet18(pretrained=True)
-float_model.eval()
-
-# deepcopy the model since we need to keep the original model around
-import copy
-model_to_quantize = copy.deepcopy(float_model)
-
-model_to_quantize.eval()
-
-"""
-Prepare models
-"""
-
-# Note that this is temporary, we'll expose these functions to torch.quantization after official releasee
-from torch.quantization.quantize_fx import prepare_fx, convert_fx
-
-def calibrate(model, data_loader):
-    model.eval()
-    with torch.no_grad():
-        for image, target in data_loader:
-            model(image)
-
-from torch.ao.quantization.experimental.qconfig import (
-    uniform_qconfig_8bit,
-    apot_weights_qconfig_8bit,
-    apot_qconfig_8bit,
-    uniform_qconfig_4bit,
-    apot_weights_qconfig_4bit,
-    apot_qconfig_4bit
-)
-
-"""
-Prepare full precision model
-"""
-full_precision_model = float_model
-
-top1, top5 = evaluate(full_precision_model, criterion, data_loader_test)
-print("Model #0 Evaluation accuracy on test dataset: %2.2f, %2.2f" % (top1.avg, top5.avg))
-
-"""
-Prepare model PTQ for specified qconfig for torch.nn.Linear
-"""
-def prepare_ptq_linear(qconfig):
-    qconfig_dict = {"object_type": [(torch.nn.Linear, qconfig)]}
-    prepared_model = prepare_fx(copy.deepcopy(float_model), qconfig_dict)  # fuse modules and insert observers
-    calibrate(prepared_model, data_loader_test)  # run calibration on sample data
-    return prepared_model
-
-"""
-Prepare model with uniform activation, uniform weight
-b=8, k=2
-"""
-
-prepared_model = prepare_ptq_linear(uniform_qconfig_8bit)
-quantized_model = convert_fx(prepared_model)  # convert the calibrated model to a quantized model
-
-top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
-print("Model #1 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
-
-"""
-Prepare model with uniform activation, uniform weight
-b=4, k=2
-"""
-
-prepared_model = prepare_ptq_linear(uniform_qconfig_4bit)
-quantized_model = convert_fx(prepared_model)  # convert the calibrated model to a quantized model
-
-top1, top5 = evaluate(quantized_model1, criterion, data_loader_test)
-print("Model #1 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
-
-"""
-Prepare model with uniform activation, APoT weight
-(b=8, k=2)
-"""
-
-prepared_model = prepare_ptq_linear(apot_weights_qconfig_8bit)
-
-top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
-print("Model #2 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
-
-"""
-Prepare model with uniform activation, APoT weight
-(b=4, k=2)
-"""
-
-prepared_model = prepare_ptq_linear(apot_weights_qconfig_4bit)
-
-top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
-print("Model #2 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
-
-
-"""
-Prepare model with APoT activation and weight
-(b=8, k=2)
-"""
-
-prepared_model = prepare_ptq_linear(apot_qconfig_8bit)
-
-top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
-print("Model #3 Evaluation accuracy on test dataset (b=8, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
-
-"""
-Prepare model with APoT activation and weight
-(b=4, k=2)
-"""
-
-prepared_model = prepare_ptq_linear(apot_qconfig_4bit)
-
-top1, top5 = evaluate(prepared_model, criterion, data_loader_test)
-print("Model #3 Evaluation accuracy on test dataset (b=4, k=2): %2.2f, %2.2f" % (top1.avg, top5.avg))
-
-"""
-Prepare eager mode quantized model
-"""
-eager_quantized_model = resnet18(pretrained=True, quantize=True).eval()
-top1, top5 = evaluate(eager_quantized_model, criterion, data_loader_test)
-print("Eager mode quantized model evaluation accuracy on test dataset: %2.2f, %2.2f" % (top1.avg, top5.avg))
+def training_loop(model, criterion, data_loader):
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+    train_loss, correct, total = 0, 0, 0
+    model.train()
+    for i in range(10):
+        for data, target in data_loader:
+            optimizer.zero_grad()
+            output = model(data)
+            loss = criterion(output, target)
+            loss = Variable(loss, requires_grad=True)
+            loss.backward()
+            optimizer.step()
+            train_loss += loss.item()
+            _, predicted = torch.max(output, 1)
+            total += target.size(0)
+            correct += (predicted == target).sum().item()
+    return train_loss, correct, total
diff --git a/test/quantization/core/test_docs.py b/test/quantization/core/test_docs.py
index f719a340a1614..27842b46ce7e0 100644
--- a/test/quantization/core/test_docs.py
+++ b/test/quantization/core/test_docs.py
@@ -1,6 +1,7 @@
 # Owner(s): ["oncall: quantization"]
 
 import re
+import contextlib
 from pathlib import Path
 
 import torch
@@ -10,6 +11,8 @@
     QuantizationTestCase,
     SingleLayerLinearModel,
 )
+from torch.testing._internal.common_quantized import override_quantized_engine
+from torch.testing._internal.common_utils import IS_ARM64
 
 
 class TestQuantizationDocs(QuantizationTestCase):
@@ -20,6 +23,10 @@ class TestQuantizationDocs(QuantizationTestCase):
     they can be imported either in the test file or passed as a global input
     """
 
+    def run(self, result=None):
+        with override_quantized_engine("qnnpack") if IS_ARM64 else contextlib.nullcontext():
+            super(TestQuantizationDocs, self).run(result)
+
     def _get_code(
         self, path_from_pytorch, unique_identifier, offset=2, short_snippet=False
     ):
@@ -34,28 +41,17 @@ def _get_code(
         def get_correct_path(path_from_pytorch):
             r"""
             Current working directory when CI is running test seems to vary, this function
-            looks for docs and if it finds it looks for the path to the
-            file and if the file exists returns that path, otherwise keeps looking. Will
-            only work if cwd contains pytorch or docs or a parent contains docs.
+            looks for docs relative to this test file.
             """
-            # get cwd
-            cur_dir_path = Path(".").resolve()
-
-            # check if cwd contains pytorch, use that if it does
-            if (cur_dir_path / "pytorch").is_dir():
-                cur_dir_path = (cur_dir_path / "pytorch").resolve()
-
-            # need to find the file, so we check current directory
-            # and all parent directories to see if the path leads to it
-            check_dir = cur_dir_path
-            while not check_dir == check_dir.parent:
-                file_path = (check_dir / path_from_pytorch).resolve()
-                if file_path.is_file():
-                    return file_path
-                check_dir = check_dir.parent.resolve()
-
-            # no longer passing when file not found
-            raise FileNotFoundError("could not find {}".format(path_from_pytorch))
+            core_dir = Path(__file__).parent
+            assert core_dir.match("test/quantization/core/"), (
+                "test_docs.py is in an unexpected location. If you've been "
+                "moving files around, ensure that the test and build files have "
+                "been updated to have the correct relative path between "
+                "test_docs.py and the docs."
+            )
+            pytorch_root = core_dir.parent.parent.parent
+            return pytorch_root / path_from_pytorch
 
         path_to_file = get_correct_path(path_from_pytorch)
         if path_to_file:
@@ -127,7 +123,6 @@ def _dummy_func(*args, **kwargs):
 
         input_fp32 = torch.randn(1, 1, 1, 1)
         global_inputs = {"training_loop": _dummy_func, "input_fp32": input_fp32}
-
         code = self._get_code(path_from_pytorch, unique_identifier)
         self._test_code(code, global_inputs)
 
diff --git a/test/quantization/core/test_quantized_functional.py b/test/quantization/core/test_quantized_functional.py
index 57ef41c781a30..f9585e73c8461 100644
--- a/test/quantization/core/test_quantized_functional.py
+++ b/test/quantization/core/test_quantized_functional.py
@@ -2,8 +2,8 @@
 
 # Torch
 import torch
+import torch.ao.nn.quantized.functional as qF
 import torch.nn.functional as F
-import torch.nn.quantized.functional as qF
 
 # Standard library
 import numpy as np
diff --git a/test/quantization/core/test_quantized_module.py b/test/quantization/core/test_quantized_module.py
index 067b7b481426f..85fc8e14c23f0 100644
--- a/test/quantization/core/test_quantized_module.py
+++ b/test/quantization/core/test_quantized_module.py
@@ -4,10 +4,10 @@
 import torch.nn as nn
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.quantized._reference as nnqr
+import torch.ao.nn.quantized._reference as nnqr
 import torch.ao.quantization
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 
 from torch.ao.quantization import (
     get_default_static_quant_module_mappings,
@@ -624,7 +624,7 @@ def test_pool_api(self):
                                        dtype=torch.quint8)
         qX_expect = torch.nn.functional.max_pool2d(qX, **kwargs)
 
-        pool_under_test = torch.nn.quantized.MaxPool2d(**kwargs)
+        pool_under_test = torch.ao.nn.quantized.MaxPool2d(**kwargs)
         qX_hat = pool_under_test(qX)
         self.assertEqual(qX_expect, qX_hat)
 
@@ -1154,8 +1154,8 @@ def _test_qconv_impl(self, q_mod, dq_mod, dim, dtype, bias):
 
     @override_qengines
     def test_dynamic_conv1d(self):
-        q_mod = torch.nn.quantized.Conv1d
-        dq_mod = torch.nn.quantized.dynamic.Conv1d
+        q_mod = torch.ao.nn.quantized.Conv1d
+        dq_mod = torch.ao.nn.quantized.dynamic.Conv1d
         dim = 3
         dtype = torch.quint8
 
@@ -1164,8 +1164,8 @@ def test_dynamic_conv1d(self):
 
     @override_qengines
     def test_dynamic_conv2d(self):
-        q_mod = torch.nn.quantized.Conv2d
-        dq_mod = torch.nn.quantized.dynamic.Conv2d
+        q_mod = torch.ao.nn.quantized.Conv2d
+        dq_mod = torch.ao.nn.quantized.dynamic.Conv2d
         dim = 4
         dtype = torch.quint8
 
@@ -1174,8 +1174,8 @@ def test_dynamic_conv2d(self):
 
     @override_qengines
     def test_dynamic_conv3d(self):
-        q_mod = torch.nn.quantized.Conv3d
-        dq_mod = torch.nn.quantized.dynamic.Conv3d
+        q_mod = torch.ao.nn.quantized.Conv3d
+        dq_mod = torch.ao.nn.quantized.dynamic.Conv3d
         dim = 5
         dtype = torch.quint8
 
@@ -1186,8 +1186,8 @@ def test_dynamic_conv3d(self):
 
     @override_qengines
     def test_dynamic_convtranspose1d(self):
-        q_mod = torch.nn.quantized.ConvTranspose1d
-        dq_mod = torch.nn.quantized.dynamic.ConvTranspose1d
+        q_mod = torch.ao.nn.quantized.ConvTranspose1d
+        dq_mod = torch.ao.nn.quantized.dynamic.ConvTranspose1d
         dim = 3
         dtype = torch.quint8
 
@@ -1196,8 +1196,8 @@ def test_dynamic_convtranspose1d(self):
 
     @override_qengines
     def test_dynamic_convtranspose2d(self):
-        q_mod = torch.nn.quantized.ConvTranspose2d
-        dq_mod = torch.nn.quantized.dynamic.ConvTranspose2d
+        q_mod = torch.ao.nn.quantized.ConvTranspose2d
+        dq_mod = torch.ao.nn.quantized.dynamic.ConvTranspose2d
         dim = 4
         dtype = torch.quint8
 
@@ -1206,8 +1206,8 @@ def test_dynamic_convtranspose2d(self):
 
     @override_qengines
     def test_dynamic_convtranspose3d(self):
-        q_mod = torch.nn.quantized.ConvTranspose3d
-        dq_mod = torch.nn.quantized.dynamic.ConvTranspose3d
+        q_mod = torch.ao.nn.quantized.ConvTranspose3d
+        dq_mod = torch.ao.nn.quantized.dynamic.ConvTranspose3d
         dim = 5
         dtype = torch.quint8
 
@@ -1344,22 +1344,22 @@ def test_lstm_api(self, dtype, bidirectional):
             x = torch.randn(seq_len, batch, input_size)
             h = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
             c = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
-            cell_dq = torch.nn.quantized.dynamic.LSTM(input_size=input_size,
-                                                      hidden_size=hidden_size,
-                                                      num_layers=num_layers,
-                                                      bias=bias,
-                                                      batch_first=False,
-                                                      dropout=0.0,
-                                                      bidirectional=bidirectional,
-                                                      dtype=dtype)
-            ref_dq = torch.nn.quantized.dynamic.LSTM(input_size=input_size,
-                                                     hidden_size=hidden_size,
-                                                     num_layers=num_layers,
-                                                     bias=bias,
-                                                     batch_first=False,
-                                                     dropout=0.0,
-                                                     bidirectional=bidirectional,
-                                                     dtype=dtype)
+            cell_dq = torch.ao.nn.quantized.dynamic.LSTM(input_size=input_size,
+                                                         hidden_size=hidden_size,
+                                                         num_layers=num_layers,
+                                                         bias=bias,
+                                                         batch_first=False,
+                                                         dropout=0.0,
+                                                         bidirectional=bidirectional,
+                                                         dtype=dtype)
+            ref_dq = torch.ao.nn.quantized.dynamic.LSTM(input_size=input_size,
+                                                        hidden_size=hidden_size,
+                                                        num_layers=num_layers,
+                                                        bias=bias,
+                                                        batch_first=False,
+                                                        dropout=0.0,
+                                                        bidirectional=bidirectional,
+                                                        dtype=dtype)
 
             _all_params = ([m.param for m in cell_dq._all_weight_values])
             result = torch.quantized_lstm(x, (h, c),
@@ -1406,14 +1406,14 @@ def test_gru_api(self):
             h = torch.rand(num_layers * (bidirectional + 1), batch, hidden_size)
 
 
-            cell_dq = torch.nn.quantized.dynamic.GRU(input_size=input_size,
-                                                     hidden_size=hidden_size,
-                                                     num_layers=num_layers,
-                                                     bias=bias,
-                                                     batch_first=False,
-                                                     dropout=0.0,
-                                                     bidirectional=bidirectional,
-                                                     dtype=dtype)
+            cell_dq = torch.ao.nn.quantized.dynamic.GRU(input_size=input_size,
+                                                        hidden_size=hidden_size,
+                                                        num_layers=num_layers,
+                                                        bias=bias,
+                                                        batch_first=False,
+                                                        dropout=0.0,
+                                                        bidirectional=bidirectional,
+                                                        dtype=dtype)
 
             _all_params = ([m.param for m in cell_dq._all_weight_values])
             result = torch.quantized_gru(x,
@@ -1447,10 +1447,10 @@ def test_cell_api(self, dtype):
 
         x = torch.rand(batch, input_size)
         h = torch.rand(batch, hidden_size)
-        cell_dict = {'LSTMCell': torch.nn.quantized.dynamic.LSTMCell,
-                     'GRUCell': torch.nn.quantized.dynamic.GRUCell,
-                     'RNNTanh': torch.nn.quantized.dynamic.RNNCell,
-                     'RNNReLU': torch.nn.quantized.dynamic.RNNCell
+        cell_dict = {'LSTMCell': torch.ao.nn.quantized.dynamic.LSTMCell,
+                     'GRUCell': torch.ao.nn.quantized.dynamic.GRUCell,
+                     'RNNTanh': torch.ao.nn.quantized.dynamic.RNNCell,
+                     'RNNReLU': torch.ao.nn.quantized.dynamic.RNNCell
                      }
         state = {'LSTMCell': (h, h),
                  'GRUCell': h,
diff --git a/test/quantization/core/test_quantized_op.py b/test/quantization/core/test_quantized_op.py
index 41b735afc75c2..0b25cb1277fb8 100644
--- a/test/quantization/core/test_quantized_op.py
+++ b/test/quantization/core/test_quantized_op.py
@@ -185,7 +185,7 @@ def _test_activation_function(self, X, fn_name, test_configs):
                     # some functions require inplace=True to test in-place.
                     # copy.copy is needed because these are modified in place
                     extra_kwargs = \
-                        copy.copy(op_group.get('extra_kwargs', dict()))
+                        copy.copy(op_group.get('extra_kwargs', {}))
                     output_is_observed = \
                         copy.copy(op_group.get('output_is_observed', False))
 
@@ -261,8 +261,8 @@ def test_qrelu6(self):
             {
                 'quantized_fn': [
                     torch.ops.quantized.relu6,
-                    torch.nn.quantized.ReLU6(inplace=False),
-                    torch.nn.quantized.ReLU6(inplace=True)
+                    torch.ao.nn.quantized.ReLU6(inplace=False),
+                    torch.ao.nn.quantized.ReLU6(inplace=True)
                 ],
                 'reference_fn': torch.nn.functional.relu6
             }
@@ -320,7 +320,8 @@ def test_qhardsigmoid(self):
         hardsigmoid_test_configs = [
             {
                 'quantized_fn': [
-                    torch.nn.quantized.functional.hardsigmoid
+                    torch.ao.nn.quantized.functional.hardsigmoid,
+                    torch.nn.quantized.functional.hardsigmoid,
                 ],
                 'reference_fn': torch.nn.functional.hardsigmoid,
                 'output_range': (0.0, 1.0),
@@ -328,7 +329,8 @@ def test_qhardsigmoid(self):
             },
             {
                 'quantized_fn': [
-                    torch.nn.quantized.functional.hardsigmoid
+                    torch.ao.nn.quantized.functional.hardsigmoid,
+                    torch.nn.quantized.functional.hardsigmoid,
                 ],
                 'reference_fn': torch.nn.functional.hardsigmoid,
                 'output_range': (0.0, 1.0),
@@ -411,7 +413,7 @@ def test_qelu(self, X, alpha):
         qY_hat = torch.quantize_per_tensor(dqY_hat, scale=output_scale, zero_point=output_zero_point,
                                            dtype=torch_type)
 
-        qY = torch.nn.quantized.functional.elu(qX, output_scale, output_zero_point, alpha=alpha)
+        qY = torch.ao.nn.quantized.functional.elu(qX, output_scale, output_zero_point, alpha=alpha)
         self.assertEqual(qY, qY_hat,
                          msg="F.elu failed ({} vs {})".format(qY, qY_hat))
 
@@ -650,7 +652,8 @@ def test_qthreshold(self, X, threshold, value):
         ops_under_test = {
             'native': torch.threshold,
             'nn.functional': torch.nn.functional.threshold,
-            'nn.quantized.functional': torch.nn.quantized.functional.threshold
+            'nn.quantized.functional': torch.nn.quantized.functional.threshold,
+            'ao.nn.quantized.functional': torch.ao.nn.quantized.functional.threshold,
         }
 
         for name, op in ops_under_test.items():
@@ -723,6 +726,8 @@ def test_hardtanh(self, X, min_val, max_val):
             ops_under_test = {
                 'nn.quantized.functional.hardtanh':
                     torch.nn.quantized.functional.hardtanh,
+                'ao.nn.quantized.functional.hardtanh':
+                    torch.ao.nn.quantized.functional.hardtanh,
             }
 
             for name, op in ops_under_test.items():
@@ -732,6 +737,8 @@ def test_hardtanh(self, X, min_val, max_val):
             ops_under_test_inplace = {
                 'inplace nn.quantized.functional.hardtanh':
                     torch.nn.quantized.functional.hardtanh,
+                'inplace ao.nn.quantized.functional.hardtanh':
+                    torch.ao.nn.quantized.functional.hardtanh,
             }
 
             for name, op_ in ops_under_test_inplace.items():
@@ -770,7 +777,7 @@ def test_hardswish(self):
                                                    zero_point=Y_zero_point,
                                                    dtype=torch_type)
 
-                qY = torch.nn.quantized.functional.hardswish(
+                qY = torch.ao.nn.quantized.functional.hardswish(
                     qX, scale=Y_scale, zero_point=Y_zero_point)
                 self.assertEqual(
                     qY, qY_hat,
@@ -1330,7 +1337,8 @@ def test_max_pool1d(self, X, kernel, stride, dilation, padding, ceil_mode):
         ops_under_test = {
             "torch": torch.max_pool1d,
             "nn.functional": torch.nn.functional.max_pool1d,
-            "nn.quantized.functional": torch.nn.quantized.functional.max_pool1d
+            "nn.quantized.functional": torch.nn.quantized.functional.max_pool1d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.max_pool1d,
         }
 
         for name, op in ops_under_test.items():
@@ -1426,7 +1434,8 @@ def test_max_pool2d(self, X, kernel, stride, dilation, padding, ceil_mode):
         ops_under_test = {
             "torch": torch.max_pool2d,
             "nn.functional": torch.nn.functional.max_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.max_pool2d,
         }
 
         for name, op in ops_under_test.items():
@@ -1484,7 +1493,8 @@ def test_max_pool2d_nhwc(self, X, kernel, stride, dilation, padding, ceil_mode):
         ops_under_test = {
             "torch": torch.max_pool2d,
             "nn.functional": torch.nn.functional.max_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.max_pool2d,
         }
 
         for name, op in ops_under_test.items():
@@ -1533,7 +1543,8 @@ def test_avg_pool2d(self, X, kernel, stride, padding, ceil_mode, count_include_p
             ceil_mode=ceil_mode, count_include_pad=count_include_pad, divisor_override=divisor_override)
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool2d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1594,7 +1605,8 @@ def test_avg_pool2d_nhwc(self, X, kernel, stride, padding, ceil_mode, count_incl
         self.assertTrue(qX.stride() != sorted(qX.stride()))
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool2d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1648,7 +1660,8 @@ def test_avg_pool3d(self, X, kernel, stride, padding, ceil_mode, count_include_p
 
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool3d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool3d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1710,7 +1723,8 @@ def test_avg_pool3d_nhwc(self, X, kernel, stride, padding, ceil_mode, count_incl
         self.assertTrue(qX.stride() != sorted(qX.stride()))
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool3d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool3d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1779,7 +1793,9 @@ def test_adaptive_avg_pool2d_nhwc(self):
             ops_under_test = {
                 "nn.functional": torch.nn.functional.adaptive_avg_pool2d,
                 "nn.quantized.functional":
-                    torch.nn.quantized.functional.adaptive_avg_pool2d
+                    torch.nn.quantized.functional.adaptive_avg_pool2d,
+                "ao.nn.quantized.functional":
+                    torch.ao.nn.quantized.functional.adaptive_avg_pool2d,
             }
             error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
             for name, op in ops_under_test.items():
@@ -1848,7 +1864,9 @@ def test_adaptive_avg_pool(self):
                     "nn.functional":
                         getattr(torch.nn.functional, 'adaptive_avg_pool{}d'.format(dim)),
                     "nn.quantized.functional":
-                        getattr(torch.nn.quantized.functional, 'adaptive_avg_pool{}d'.format(dim))
+                        getattr(torch.nn.quantized.functional, 'adaptive_avg_pool{}d'.format(dim)),
+                    "ao.nn.quantized.functional":
+                        getattr(torch.ao.nn.quantized.functional, 'adaptive_avg_pool{}d'.format(dim))
                 }
 
                 error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
@@ -1926,7 +1944,9 @@ def test_adaptive_avg_pool3d_ndhwc(self):
             ops_under_test = {
                 "nn.functional": torch.nn.functional.adaptive_avg_pool3d,
                 "nn.quantized.functional":
-                    torch.nn.quantized.functional.adaptive_avg_pool3d
+                    torch.nn.quantized.functional.adaptive_avg_pool3d,
+                "ao.nn.quantized.functional":
+                    torch.ao.nn.quantized.functional.adaptive_avg_pool3d,
             }
             error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
             for name, op in ops_under_test.items():
@@ -2072,7 +2092,8 @@ def test_interpolate(self, X, size, mode, scale_factor, align_corners, nhwc_layo
 
         ops_under_test = {
             "nn.functional": torch.nn.functional.interpolate,
-            "nn.quantized.functional": torch.nn.quantized.functional.interpolate
+            "nn.quantized.functional": torch.nn.quantized.functional.interpolate,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.interpolate,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -2125,7 +2146,8 @@ def test_interpolate3d(self, X, size, mode, scale_factor, align_corners, nhwc_la
 
         ops_under_test = {
             "nn.functional": torch.nn.functional.interpolate,
-            "nn.quantized.functional": torch.nn.quantized.functional.interpolate
+            "nn.quantized.functional": torch.nn.quantized.functional.interpolate,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.interpolate,
         }
 
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
@@ -2623,7 +2645,7 @@ def test_empty_batch(self):
                                 "Quantized sigmoid with batch size 0 failed.")
 
         # interpolate
-        op = torch.nn.quantized.functional.interpolate
+        op = torch.ao.nn.quantized.functional.interpolate
         for mode in ["nearest", "bilinear", "nearest-exact"]:
             qY = op(qX, scale_factor=2, mode=mode)
             np.testing.assert_equal(qY.size(), (0, 2, 8, 8),
@@ -2633,13 +2655,13 @@ def test_empty_batch(self):
         kernel = (2, 2)
         stride = (1, 1)
         padding = (0, 0)
-        op = torch.nn.quantized.functional.avg_pool2d
+        op = torch.ao.nn.quantized.functional.avg_pool2d
         qY = op(qX, kernel, stride, padding)
         np.testing.assert_equal(qY.size(), (0, 2, 3, 3),
                                 "Quantized avg_pool2d with batch size 0 failed.")
 
         # adaptive_avg_pool
-        op = torch.nn.quantized.functional.adaptive_avg_pool2d
+        op = torch.ao.nn.quantized.functional.adaptive_avg_pool2d
         qY = op(qX, (3, 3))
         np.testing.assert_equal(qY.size(), (0, 2, 3, 3),
                                 "Quantized adaptive_avg_pool2d with batch size 0 failed.")
@@ -2653,7 +2675,7 @@ def test_empty_batch(self):
                                 "Quantized maxpool2d with batch size 0 failed.")
 
         # hardtanh
-        qY = torch.nn.quantized.functional.hardtanh(qX, -1, 6)
+        qY = torch.ao.nn.quantized.functional.hardtanh(qX, -1, 6)
         np.testing.assert_equal(qY.size(), qX.size(),
                                 "Quantized hardtanh with batch size 0 failed.")
 
@@ -2826,8 +2848,7 @@ def test_custom_module_lstm(self):
                 jit_qmodule = torch.jit.trace(lstm_quantized, qx)
 
                 # Script
-                # TODO: Fix the scripting in the torch/nn/quantizable/modules/rnn.py
-                # jit_qmodule = torch.jit.script(lstm_quantized)
+                jit_qmodule = torch.jit.script(lstm_quantized)
 
     @override_qengines
     def test_custom_module_multi_head_attention(self):
@@ -3442,7 +3463,7 @@ def _test_qconv_op_impl(self, q_mod, dq_op, dim, dtype):
 
     @override_qengines
     def test_dynamic_conv1d(self):
-        q_mod = torch.nn.quantized.Conv1d
+        q_mod = torch.ao.nn.quantized.Conv1d
         dq_op = torch.ops.quantized.conv1d_dynamic
         dim = 3
         dtype = torch.quint8
@@ -3451,7 +3472,7 @@ def test_dynamic_conv1d(self):
 
     @override_qengines
     def test_dynamic_conv2d(self):
-        q_mod = torch.nn.quantized.Conv2d
+        q_mod = torch.ao.nn.quantized.Conv2d
         dq_op = torch.ops.quantized.conv2d_dynamic
         dim = 4
         dtype = torch.quint8
@@ -3460,7 +3481,7 @@ def test_dynamic_conv2d(self):
 
     @override_qengines
     def test_dynamic_conv3d(self):
-        q_mod = torch.nn.quantized.Conv3d
+        q_mod = torch.ao.nn.quantized.Conv3d
         dq_op = torch.ops.quantized.conv3d_dynamic
         dim = 5
         dtype = torch.quint8
@@ -3469,7 +3490,7 @@ def test_dynamic_conv3d(self):
 
     @override_qengines
     def test_dynamic_convtranspose1d(self):
-        q_mod = torch.nn.quantized.ConvTranspose1d
+        q_mod = torch.ao.nn.quantized.ConvTranspose1d
         dq_op = torch.ops.quantized.conv_transpose1d_dynamic
         dim = 3
         dtype = torch.quint8
@@ -3478,7 +3499,7 @@ def test_dynamic_convtranspose1d(self):
 
     @override_qengines
     def test_dynamic_convtranspose2d(self):
-        q_mod = torch.nn.quantized.ConvTranspose2d
+        q_mod = torch.ao.nn.quantized.ConvTranspose2d
         dq_op = torch.ops.quantized.conv_transpose2d_dynamic
         dim = 4
         dtype = torch.quint8
@@ -3487,7 +3508,7 @@ def test_dynamic_convtranspose2d(self):
 
     @override_qengines
     def test_dynamic_convtranspose3d(self):
-        q_mod = torch.nn.quantized.ConvTranspose3d
+        q_mod = torch.ao.nn.quantized.ConvTranspose3d
         dq_op = torch.ops.quantized.conv_transpose3d_dynamic
         dim = 5
         dtype = torch.quint8
@@ -3785,10 +3806,10 @@ def test_qlinear_unpack(self, W, use_channelwise):
 
 @unittest.skipIf(IS_MACOS, "Known test failure on Mac.")
 class TestQuantizedEmbeddingOps(TestCase):
-    def _test_embedding_bag_unpack_fn(self, pack_fn, unpack_fn, num_embeddings, embedding_dim, bit_rate, optimized_qparams,
-                                      num_batches, data_type=np.float32):
-        weights = torch.from_numpy((np.random.random_sample((
-            num_batches, num_embeddings, embedding_dim)).squeeze() + 1).astype(data_type))
+
+    def _test_embedding_bag_unpack_impl(self, pack_fn, unpack_fn, bit_rate, optimized_qparams, weights):
+        data_type = weights.dtype
+
         qtype = torch.quint8
         if bit_rate == 8:
             w_packed = pack_fn(weights)
@@ -3796,13 +3817,13 @@ def _test_embedding_bag_unpack_fn(self, pack_fn, unpack_fn, num_embeddings, embe
             w_packed = pack_fn(weights, optimized_qparams=optimized_qparams)
         w_unpacked = unpack_fn(w_packed)
 
-        if (bit_rate == 8 or bit_rate == 4) and data_type != np.float16:
+        if (bit_rate == 8 or bit_rate == 4) and data_type != torch.float16:
             # torch.quantize_per_channel does not support float16 yet.
 
             obs_weights = weights
             # Combine 3D embeddings (e.g. stacked combination of embeddings)
             # in a dimension orthogonal to channels.
-            if(num_batches > 1):
+            if (len(obs_weights.shape) > 2):
                 stacked_shape = list(weights.size())
                 stacked_shape[1] *= stacked_shape[0]
                 obs_weights = weights.reshape(stacked_shape[1:])
@@ -3826,13 +3847,13 @@ def _test_embedding_bag_unpack_fn(self, pack_fn, unpack_fn, num_embeddings, embe
 
         # compare against C2 to ensure numerical equivalency.
         from caffe2.python import core, workspace
-        conversion_op = "FloatToFused8BitRowwiseQuantized" if data_type == np.float32 else "HalfFloatToFused8BitRowwiseQuantized"
+        conversion_op = "FloatToFused8BitRowwiseQuantized" if data_type == torch.float32 else "HalfFloatToFused8BitRowwiseQuantized"
         reverse_conversion_op = None
         if bit_rate == 4:
-            conversion_op = "FloatToFused4BitRowwiseQuantized" if data_type == np.float32 else "HalfToFused4BitRowwiseQuantized"
+            conversion_op = "FloatToFused4BitRowwiseQuantized" if data_type == torch.float32 else "HalfToFused4BitRowwiseQuantized"
             reverse_conversion_op = "Fused4BitRowwiseQuantizedToFloat"
         elif bit_rate == 2:
-            conversion_op = "FloatToFused2BitRowwiseQuantized" if data_type == np.float32 else "HalfToFused2BitRowwiseQuantized"
+            conversion_op = "FloatToFused2BitRowwiseQuantized" if data_type == torch.float32 else "HalfToFused2BitRowwiseQuantized"
             reverse_conversion_op = "Fused2BitRowwiseQuantizedToFloat"
 
         def get_c2_weights(weights, engine_str):
@@ -3862,13 +3883,35 @@ def get_c2_weights(weights, engine_str):
             engine = "GREEDY"
         else:
             engine = ""
-        w_packed_c2, w_unpacked_c2 = get_c2_weights(weights, engine)
+
+        # C2 quantization needs the memory format of Tensor to be `continuous`, otherwise it will
+        # throw exceptions. torch.clone() will make the memory format to be `continuous`
+        c2_copy = torch.clone(weights)
+        w_packed_c2, w_unpacked_c2 = get_c2_weights(c2_copy, engine)
 
         # Compare packed weights against C2.
         np.testing.assert_allclose(w_packed.numpy(), w_packed_c2.numpy(), atol=1e-6, rtol=1e-6)
         # Compare unpacked weights against C2
         np.testing.assert_allclose(w_unpacked.numpy(), w_unpacked_c2.numpy(), atol=1e-6, rtol=1e-6)
 
+
+    def _test_embedding_bag_unpack_fn(self, pack_fn, unpack_fn, num_embeddings, embedding_dim, bit_rate,
+                                      optimized_qparams, num_batches, data_type=np.float32):
+
+        # when num_batches = 1, it will create a 2D tensor
+        unsplit_weight = torch.from_numpy((np.random.random_sample((
+            num_batches, num_embeddings, embedding_dim)).squeeze() + 1).astype(np.float32))
+
+        # test unsplit weight (memory format is `contiguous`)
+        self._test_embedding_bag_unpack_impl(pack_fn, unpack_fn, bit_rate, optimized_qparams, unsplit_weight)
+
+        # test split weights (memory format is not `contiguous`)
+        split_dim = len(unsplit_weight.shape) - 2
+        split_weights = torch.split(unsplit_weight, 1, dim=split_dim)
+        for weight in split_weights:
+            self._test_embedding_bag_unpack_impl(pack_fn, unpack_fn, bit_rate, optimized_qparams, weight)
+
+
     """ Tests the correctness of the embedding_bag_8bit pack/unpack op against C2 """
     @unittest.skipIf(not BUILD_WITH_CAFFE2, "Test needs Caffe2")
     @given(num_embeddings=st.integers(10, 100),
@@ -3892,6 +3935,7 @@ def test_embedding_bag_4bit_unpack(self, num_embeddings, embedding_dim, optimize
         pack_fn = torch.ops.quantized.embedding_bag_4bit_prepack
         unpack_fn = torch.ops.quantized.embedding_bag_4bit_unpack
 
+        # 4bit and 2bit quantization right now only works for 2D Tensor so we set the num_batches to 1
         self._test_embedding_bag_unpack_fn(
             pack_fn, unpack_fn, num_embeddings, embedding_dim, 4, optimized_qparams, 1, data_type=data_type)
 
@@ -3905,6 +3949,7 @@ def test_embedding_bag_2bit_unpack(self, num_embeddings, embedding_dim, optimize
         pack_fn = torch.ops.quantized.embedding_bag_2bit_prepack
         unpack_fn = torch.ops.quantized.embedding_bag_2bit_unpack
 
+        # 4bit and 2bit quantization right now only works for 2D Tensor so we set the num_batches to 1
         self._test_embedding_bag_unpack_fn(
             pack_fn, unpack_fn, num_embeddings, embedding_dim, 2, optimized_qparams, 1, data_type=data_type)
 
@@ -4769,12 +4814,12 @@ def test_qconv_transpose1d(
                 use_channelwise=False, use_transpose=True, input_dtype=X_qdtype, output_dtype=X_qdtype)
 
             # check that this doesn't error
-            test_conv = torch.nn.quantized.ConvTranspose1d(input_channels, output_channels, 1)
+            test_conv = torch.ao.nn.quantized.ConvTranspose1d(input_channels, output_channels, 1)
             test_conv.scale = Y_scale
             test_conv(X_q)
 
             # Test the module implementation
-            qconv_op = torch.nn.quantized.ConvTranspose1d(
+            qconv_op = torch.ao.nn.quantized.ConvTranspose1d(
                 in_channels=input_channels,
                 out_channels=output_channels,
                 kernel_size=kernels,
@@ -4895,12 +4940,12 @@ def test_qconv_transpose2d(
                 use_channelwise=False, use_transpose=True, input_dtype=X_qdtype, output_dtype=X_qdtype)
 
             # check that this doesn't error
-            test_conv = torch.nn.quantized.ConvTranspose2d(input_channels, output_channels, 1)
+            test_conv = torch.ao.nn.quantized.ConvTranspose2d(input_channels, output_channels, 1)
             test_conv.scale = Y_scale
             test_conv(X_q)
 
             # Test the module implementation
-            qconv_op = torch.nn.quantized.ConvTranspose2d(
+            qconv_op = torch.ao.nn.quantized.ConvTranspose2d(
                 in_channels=input_channels,
                 out_channels=output_channels,
                 kernel_size=kernels,
@@ -5022,12 +5067,12 @@ def test_qconv_transpose3d(
             use_channelwise=False, use_transpose=True)
 
         # check that this doesn't error
-        test_conv = torch.nn.quantized.ConvTranspose3d(input_channels, output_channels, 1)
+        test_conv = torch.ao.nn.quantized.ConvTranspose3d(input_channels, output_channels, 1)
         test_conv.scale = Y_scale
         test_conv(X_q)
 
         # Test the module implementation
-        qconv_op = torch.nn.quantized.ConvTranspose3d(
+        qconv_op = torch.ao.nn.quantized.ConvTranspose3d(
             in_channels=input_channels,
             out_channels=output_channels,
             kernel_size=kernels,
@@ -5807,7 +5852,7 @@ def test_avg_pool2d(
             s = (stride, stride)
             p = (padding, padding)
 
-            q_avg_pool = torch.nn.quantized.functional.avg_pool2d
+            q_avg_pool = torch.ao.nn.quantized.functional.avg_pool2d
 
             x_q = torch.quantize_per_tensor(X, scale=scale, zero_point=zero_point,
                                             dtype=torch.quint8)
@@ -5855,7 +5900,7 @@ def test_adaptive_avg_pool2d(
 
             iH, iW = X.shape[-2:]
 
-            q_avg_pool = torch.nn.quantized.functional.adaptive_avg_pool2d
+            q_avg_pool = torch.ao.nn.quantized.functional.adaptive_avg_pool2d
 
             x_q = torch.quantize_per_tensor(X, scale=scale, zero_point=zero_point,
                                             dtype=torch.quint8)
@@ -5911,7 +5956,7 @@ def test_hardtanh(self):
                 qX = torch.quantize_per_tensor(X, scale=scale, zero_point=zero_point,
                                                dtype=torch_type)
 
-                qY_hat = torch.nn.quantized.functional.hardtanh(qX, min_val, max_val)
+                qY_hat = torch.ao.nn.quantized.functional.hardtanh(qX, min_val, max_val)
                 self.assertEqual(
                     qY, qY_hat,
                     msg="hardtanh failed:\nactual {}\nexpected {}\nmemory_format {}".format(qY_hat, qY, memory_format))
diff --git a/test/quantization/core/test_utils.py b/test/quantization/core/test_utils.py
index dda2b3856a8a1..55d889f88eb3a 100644
--- a/test/quantization/core/test_utils.py
+++ b/test/quantization/core/test_utils.py
@@ -3,6 +3,8 @@
 import torch
 from torch.testing._internal.common_utils import TestCase
 from torch.ao.quantization.utils import get_fqn_to_example_inputs
+from torch.nn.quantized.modules.utils import _quantize_weight
+from torch.ao.quantization import MovingAverageMinMaxObserver, MovingAveragePerChannelMinMaxObserver
 
 
 class TestUtils(TestCase):
@@ -126,3 +128,66 @@ def forward(self, x):
         assert isinstance(fqn_to_example_inputs["sub"][1], list)
         assert isinstance(fqn_to_example_inputs["sub"][2], dict) and \
             "3" in fqn_to_example_inputs["sub"][2]
+
+    def test_quantize_weight_clamping_per_tensor(self):
+        """ Test quant_{min, max} from per tensor observer is honored by `_quantize_weight` method
+        """
+        fp_min, fp_max = -1000.0, 1000.0
+        q8_min, q8_max = -10, 10
+
+        float_tensor = torch.tensor([fp_min, fp_max])
+
+        observer = MovingAverageMinMaxObserver(
+            averaging_constant=1.0,
+            dtype=torch.qint8,
+            quant_min=q8_min,
+            quant_max=q8_max,
+            qscheme=torch.per_tensor_symmetric,
+        )
+
+        observer(float_tensor)
+        assert observer.min_val == fp_min
+        assert observer.max_val == fp_max
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
+
+        # Actual weight values can be outside than observer [min_val, max_val] for the moving average observer
+        float_tensor *= 1.2
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
+
+    def test_quantize_weight_clamping_per_channel(self):
+        """ Test quant_{min, max} from per channel observer is honored by `_quantize_weight` method
+        """
+        fp_min, fp_max = -1000.0, 1000.0
+        q8_min, q8_max = -10, 10
+
+        float_tensor = torch.tensor([[fp_min, fp_max]])
+
+        observer = MovingAveragePerChannelMinMaxObserver(
+            averaging_constant=1.0,
+            dtype=torch.qint8,
+            quant_min=q8_min,
+            quant_max=q8_max,
+            qscheme=torch.per_channel_symmetric,
+            ch_axis=0,
+        )
+
+        observer(float_tensor)
+        assert observer.min_val == fp_min
+        assert observer.max_val == fp_max
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
+
+        # Actual weight values can be outside than observer [min_val, max_val] for the moving average observer
+        float_tensor *= 1.2
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
diff --git a/test/quantization/core/test_workflow_module.py b/test/quantization/core/test_workflow_module.py
index f299026b3192d..ba93a794f70c0 100644
--- a/test/quantization/core/test_workflow_module.py
+++ b/test/quantization/core/test_workflow_module.py
@@ -1198,8 +1198,8 @@ def forward(self, indices):
                                    mapping=get_embedding_static_quant_module_mappings())
 
             # Ensure that EmbeddingBags are now quantized with the appropriate bitwidth.
-            self.assertEqual(type(inference_gm.emb1), torch.nn.quantized.EmbeddingBag)
-            self.assertEqual(type(inference_gm.emb2), torch.nn.quantized.EmbeddingBag)
+            self.assertEqual(type(inference_gm.emb1), torch.ao.nn.quantized.EmbeddingBag)
+            self.assertEqual(type(inference_gm.emb2), torch.ao.nn.quantized.EmbeddingBag)
             self.assertEqual(inference_gm.emb1.dtype, qconfig.weight().dtype)
             self.assertEqual(inference_gm.emb2.dtype, qconfig.weight().dtype)
 
@@ -1233,9 +1233,9 @@ def test_embedding_qat_config(self):
                 inference_gm = convert(quant_model,
                                        mapping=get_embedding_static_quant_module_mappings())
                 # Ensure that Embedding is now quantized
-                self.assertEqual(type(inference_gm.emb), torch.nn.quantized.Embedding)
+                self.assertEqual(type(inference_gm.emb), torch.ao.nn.quantized.Embedding)
                 # Ensure that Linear is now quantized
-                self.assertEqual(type(inference_gm.linear), torch.nn.quantized.Linear)
+                self.assertEqual(type(inference_gm.linear), torch.ao.nn.quantized.Linear)
 
     def test_default_fused_qat_config(self):
         class Model(nn.Module):
diff --git a/test/quantization/dbr/test_quantize_dbr.py b/test/quantization/dbr/test_quantize_dbr.py
deleted file mode 100644
index 4d7a2c5aabaa9..0000000000000
--- a/test/quantization/dbr/test_quantize_dbr.py
+++ /dev/null
@@ -1,1619 +0,0 @@
-# Owner(s): ["oncall: quantization"]
-
-import collections
-import copy
-import math
-import tempfile
-import unittest
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.nn.intrinsic as nni
-import torch.nn.quantized as nnq
-toq = torch.ops.quantized
-from torch.testing._internal.common_quantization import (
-    skipIfNoFBGEMM,
-    skip_if_no_torchvision,
-    QuantizationTestCase,
-    NodeSpec,
-)
-from torch.testing import FileCheck
-from torch.quantization import (
-    ObserverBase,
-    FakeQuantizeBase,
-    QConfig,
-    MinMaxObserver,
-)
-from torch.quantization.quantize_fx import (
-    prepare_fx,
-    convert_fx,
-)
-from torch.ao.quantization._dbr.quantization_state import AutoQuantizationState
-
-import torch.ao.quantization._quantize_dbr as _quantize_dbr
-import torch.ao.ns._numeric_suite_dbr as ns
-# TODO(future PR): move these utils out of the FX folder
-import torch.ao.ns._numeric_suite_fx as ns_fx
-from torch.ao.quantization._dbr.torchscript_utils import (
-    remove_redundant_aliases,
-)
-
-def _allclose(a, b):
-    if isinstance(a, tuple):
-        assert isinstance(b, tuple)
-        result = True
-        for a_inner, b_inner in zip(a, b):
-            result = result and torch.allclose(a_inner, b_inner)
-        return result
-    elif isinstance(a, torch.Tensor):
-        assert isinstance(b, torch.Tensor)
-        return torch.allclose(a, b)
-    raise AssertionError('unhandled type')
-
-
-class QuantizeDBRTestCase(QuantizationTestCase):
-    def _test_auto_tracing(
-        self,
-        m,
-        qconfig,
-        example_args,
-        fuse_modules=True,
-        do_fx_comparison=True,
-        do_torchscript_checks=True,
-        # there are some keys in DBR prepare_custom_config_dict which
-        # are not supported in FX, this argument is for DBR only
-        dbr_prepare_custom_config_dict=None,
-    ):
-        m_copy = copy.deepcopy(m)
-
-        qconfig_dict = {'': qconfig}
-
-        mp = _quantize_dbr.prepare(
-            m, qconfig_dict, example_args, fuse_modules=fuse_modules,
-            prepare_custom_config_dict=dbr_prepare_custom_config_dict)
-        out_p = mp(*example_args)
-        # print(mp)
-        mq = _quantize_dbr.convert(mp)
-        # print(mq)
-        # verify it runs
-        out_q = mq(*example_args)
-        # print(out_q)
-
-        # compare it against FX
-        if do_fx_comparison:
-            m_copy_p = prepare_fx(m_copy, {'': qconfig}, example_inputs=example_args)
-            out_m_copy_p = m_copy_p(*example_args)
-            # print(m_copy_p)
-            m_copy_q = convert_fx(m_copy_p)
-            # print(m_copy_q)
-            # print(m_copy_q.graph)
-            out_q_fx = m_copy_q(*example_args)
-            # print(out_q)
-            # print(out_q_fx)
-            self.assertTrue(_allclose(out_p, out_m_copy_p))
-            # print(out_q)
-            # print(out_q_fx)
-            self.assertTrue(_allclose(out_q, out_q_fx))
-
-        if do_torchscript_checks:
-            # verify torch.jit.trace works
-            mq_jit_traced = torch.jit.trace(
-                mq, example_args, check_trace=False)
-            # print(mq_jit_traced.graph)
-            traced_out = mq_jit_traced(*example_args)
-            self.assertTrue(_allclose(traced_out, out_q))
-
-            # verify torch.jit.script works
-            rewritten = mq.rewrite_for_scripting()
-            rewritten_out = rewritten(*example_args)
-            # print(rewritten)
-            self.assertTrue(_allclose(rewritten_out, out_q))
-
-            scripted_rewritten = torch.jit.script(rewritten)
-            # print(scripted_rewritten.graph)
-            scripted_rewritten_out = scripted_rewritten(*example_args)
-            # print('scripted_rewritten_out', scripted_rewritten_out)
-            self.assertTrue(_allclose(scripted_rewritten_out, out_q))
-
-            traced_rewritten = torch.jit.trace(
-                rewritten, example_args, check_trace=False)
-            traced_rewritten_out = traced_rewritten(*example_args)
-            self.assertTrue(_allclose(traced_rewritten_out, out_q))
-
-@skipIfNoFBGEMM
-class TestQuantizeDBRIndividualOps(QuantizeDBRTestCase):
-    """
-    Tests that DBR quantization covers individual ops
-    """
-    def test_conv(self):
-        convs = {1: nn.Conv1d, 2: nn.Conv2d, 3: nn.Conv3d}
-
-        class M(torch.nn.Module):
-            def __init__(self, dim):
-                super().__init__()
-                self.conv = convs[dim](3, 3, 3)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                return x1
-
-        data = {
-            1: torch.randn(1, 3, 10),
-            2: torch.randn(1, 3, 10, 10),
-            3: torch.randn(1, 3, 5, 5, 5)
-        }
-        for dim in range(1, 4):
-            m = M(dim).eval()
-            qconfig = torch.quantization.default_qconfig
-            self._test_auto_tracing(m, qconfig, (data[dim],))
-
-    def test_conv_functional(self):
-        convs = {1: F.conv1d, 2: F.conv2d, 3: F.conv3d}
-
-        class M(torch.nn.Module):
-            def __init__(self, dim, weight, bias):
-                super().__init__()
-                self.conv_func = convs[dim]
-                self.weight = torch.nn.Parameter(weight)
-                self.bias = torch.nn.Parameter(bias)
-                self.stride = (1,) * dim
-                self.padding = (0,) * dim
-                self.dilation = (1,) * dim
-                self.groups = 1
-
-            def forward(self, x):
-                x = self.conv_func(
-                    x, self.weight, self.bias, self.stride, self.padding,
-                    self.dilation, self.groups)
-                return x
-
-        data = {
-            1: torch.randn(1, 3, 10),
-            2: torch.randn(1, 3, 10, 10),
-            3: torch.randn(1, 3, 5, 5, 5)
-        }
-        bias = torch.randn(1)
-        for dim in range(1, 4):
-            model_fp32 = M(dim, data[dim], bias).eval()
-            qconfig = torch.quantization.default_qconfig
-            self._test_auto_tracing(model_fp32, qconfig, (data[dim],))
-
-    def test_linear_dynamic(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x1 = self.linear(x)
-                return x1
-
-        m = M().eval()
-        qconfig = torch.quantization.default_dynamic_qconfig
-        # qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_linear_functional(self):
-        class LinearFunctional(nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.w1 = nn.Parameter(torch.empty(4, 4))
-                self.b1 = nn.Parameter(torch.ones(4))
-                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
-
-            def forward(self, x):
-                x = F.linear(x, self.w1, self.b1)
-                return x
-
-        model_fp32 = LinearFunctional().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 4, 4),))
-
-    def test_linear_functional_nobias(self):
-        class LinearFunctional(nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.w1 = nn.Parameter(torch.empty(4, 4))
-                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
-
-            def forward(self, x):
-                x = F.linear(x, self.w1)
-                return x
-
-        model_fp32 = LinearFunctional().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 4, 4),))
-
-    # TODO(future PR): implement observer sharing to match FX
-    def test_cat_fp32(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = torch.cat([x, x], dim=1)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = torch.cat((x, x), dim=1)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_cat_int(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = torch.cat([x, x], dim=1)
-                return x
-
-        qconfig = torch.quantization.default_qconfig
-        for dtype in (torch.int32, torch.int64):
-            m = M().eval()
-            self._test_auto_tracing(
-                m, qconfig, (torch.zeros(1, 1, 1, 1, dtype=dtype),),
-                # FX graph mode quant does not support this yet
-                do_fx_comparison=False)
-
-    def test_add(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = x + 1.0
-                x = 1.0 + x
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_add_int32(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.ones(1, 1, 2, 2, dtype=torch.int32),),
-            # FX graph mode quantization does not automatically detect
-            # tensor inputs in non-float dtypes.
-            do_fx_comparison=False)
-
-    def test_sub(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x - x
-                x = x - 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_mul(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x * x
-                x = x * 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_mul_int(self):
-        # TODO: make all the math functions work correctly for integer types
-        # TODO: make the same improvement in FX graph mode quant, if possible
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x * x
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        for dtype in (torch.int32, torch.int64):
-            self._test_auto_tracing(
-                copy.deepcopy(model_fp32), qconfig,
-                (torch.ones(1, 1, 2, 2, dtype=dtype),),
-                # FX graph mode quant does not support this yet
-                do_fx_comparison=False)
-
-    def test_div(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x / x
-                x = x / 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_embedding(self):
-        # Note: this test is just testing that models with embeddings
-        # do not crash with a global qconfig defined. Embedding quantization
-        # is not actually happening in this prototype yet.
-        # TODO(future PR): fix this and update this code.
-
-        # test subclass
-        class EmbeddingSubclass(nn.Embedding):
-            pass
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.embedding = EmbeddingSubclass(1, 1)
-
-            def forward(self, x):
-                x = self.embedding(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_dynamic_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.LongTensor([[0]]),),
-            fuse_modules=False)
-
-        # test regular embedding
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.embedding = nn.Embedding(1, 1)
-
-            def forward(self, x):
-                x = self.embedding(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_dynamic_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.LongTensor([[0]]),),
-            fuse_modules=False)
-
-
-@skipIfNoFBGEMM
-class TestQuantizeDBR(QuantizeDBRTestCase):
-    def test_fusion(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.ReLU()
-                self.child = nn.Sequential(
-                    nn.Conv2d(1, 1, 1),
-                    nn.ReLU(),
-                )
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.relu(x)
-                x = self.child(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (torch.randn(1, 1, 1, 1),))
-        self.assertTrue(isinstance(mp.conv, nni.ConvReLU2d))
-        self.assertTrue(isinstance(mp.child[0], nni.ConvReLU2d))
-
-    def test_fusion2(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.bn = torch.nn.BatchNorm2d(1)
-                # self.conv2 = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.LeakyReLU()
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.bn(x)
-                x = self.relu(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_fusion_called_multiple_times(self):
-        """
-        Tests that fusion works if the modules to fuse get called multiple
-        times in the same forward.
-
-        Currently, observers are not shared between successive calls of
-        the same module.
-        TODO(future PR): make them shared (this is easy to detect)
-        """
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.ReLU()
-
-            def forward(self, x):
-                for _ in range(2):
-                    x = self.conv(x)
-                    x = self.relu(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        # fx graph mode quant doesn't support using a single module multiple times
-        # right now, so this would crash, we can handle this case later
-        # if it is needed
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),), do_fx_comparison=False)
-
-    def test_fusion_functions(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = torch.relu(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (torch.randn(1, 1, 1, 1),))
-
-        self.assertTrue(
-            mp._auto_quant_state.idx_to_seen_q_op_infos[0].fusion_info is not None)
-        self.assertTrue(
-            mp._auto_quant_state.idx_to_seen_q_op_infos[1].fusion_info is not None)
-
-        # verify that the add relu is not observed
-        self.assertTrue(
-            '1' not in mp._auto_quant_state.tensor_id_to_observer)
-        # verify that the relu is observed
-        self.assertTrue(
-            '2' in mp._auto_quant_state.tensor_id_to_observer)
-
-        mp(torch.randn(1, 1, 1, 1))
-        mq = _quantize_dbr.convert(mp)
-
-        # verify that the add-relu got fused
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-        FileCheck().check_count("quantized::add_relu", 1, exactly=True).run(
-            mqt.graph)
-
-        # TODO(future PR): use information about non-quantizeable ops during
-        #   matching fusion patterns
-
-    def test_observers_not_touched_by_tracing(self):
-        """
-        Verifies that running dynamic tracing does not change any data
-        stored in observers and fake quants.
-        """
-        m = nn.Sequential(nn.Conv2d(1, 1, 1)).eval()
-        qconfig = torch.quantization.default_qconfig
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (torch.randn(1, 1, 1, 1),))
-        for _, mod in mp.named_modules():
-            if isinstance(mod, (ObserverBase, FakeQuantizeBase)):
-                scale, zp = mod.calculate_qparams()
-                # Assume that if scale is 1.0 and zp is 0, no calibration
-                # has happened.
-                self.assertTrue(torch.allclose(scale, torch.ones(1)))
-                self.assertTrue(torch.equal(zp, torch.zeros(1, dtype=torch.long)))
-
-    def test_multiple_modules(self):
-        m = nn.Sequential(
-            nn.Sequential(nn.Conv2d(1, 1, 1)),
-            nn.Sequential(nn.Conv2d(1, 1, 1)),
-        ).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_child_modules(self):
-        m = nn.Sequential(nn.Sequential(nn.Conv2d(1, 1, 1))).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_mod_qat(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                return x1
-
-        m = M().eval()
-        qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
-        self._test_auto_tracing(
-            copy.deepcopy(m), qconfig, (torch.randn(1, 1, 2, 2),))
-
-        # test backprop does not crash
-        inputs = torch.randn(1, 1, 1, 1)
-        inputs.requires_grad = True
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (inputs,))
-        output = mp(inputs)
-        labels = torch.randn(1, 1, 1, 1)
-        loss = (output - labels).sum()
-        loss.backward()
-        optim = torch.optim.SGD(mp.parameters(), lr=0.01)
-        optim.step()
-
-    def test_conv_functional_qat(self):
-
-        class M(torch.nn.Module):
-            def __init__(self, weight2d, bias2d):
-                super().__init__()
-                self.weight2d = torch.nn.Parameter(weight2d)
-                self.bias2d = torch.nn.Parameter(bias2d)
-                self.stride2d = (1, 1)
-                self.padding2d = (0, 0)
-                self.dilation2d = (1, 1)
-                self.groups = 1
-
-            def forward(self, x):
-                x = F.conv2d(
-                    x, self.weight2d, self.bias2d, self.stride2d, self.padding2d,
-                    self.dilation2d, self.groups)
-                return x
-
-        m = M(torch.randn(1, 1, 1, 1), torch.randn(1)).eval()
-        qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-        # test backprop does not crash
-        inputs = torch.randn(1, 1, 1, 1)
-        inputs.requires_grad = True
-        m = M(torch.randn(1, 1, 1, 1), torch.randn(1)).eval()
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (inputs,))
-        output = mp(inputs)
-        labels = torch.randn(1, 1, 1, 1)
-        loss = (output - labels).sum()
-        loss.backward()
-        optim = torch.optim.SGD(mp.parameters(), lr=0.01)
-        optim.step()
-
-    @unittest.skip('FX graph mode is using fake_quantize with PTQ, TODO verify')
-    def test_conv_unsupported_inplace_conv(self):
-        """
-        Verifies that having an quantizeable op which is inplace
-        is handled well
-        """
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.conv2 = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = F.hardsigmoid(x, inplace=True)
-                x = self.conv2(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_functional_dynamic_weights(self):
-        class M(torch.nn.Module):
-            def __init__(self, weight2d, bias2d):
-                super().__init__()
-                self.weight2d = torch.nn.Parameter(weight2d)
-                self.bias2d = torch.nn.Parameter(bias2d)
-                self.stride2d = (1, 1)
-                self.padding2d = (0, 0)
-                self.dilation2d = (1, 1)
-                self.groups = 1
-
-            def forward(self, x):
-                updated_weight = self.weight2d * x
-                x = F.conv2d(
-                    x, updated_weight, self.bias2d, self.stride2d, self.padding2d,
-                    self.dilation2d, self.groups)
-                return x
-
-        model_fp32 = M(torch.randn(1, 1, 1, 1), torch.randn(1)).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 2, 2),),
-            # FX implements this functionality instead of skipping it
-            do_fx_comparison=False,
-            # TODO enable scripting support for this
-            do_torchscript_checks=False)
-
-    def test_method(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = torch.relu(x)
-                # x = x.relu()
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_module_created_during_forward(self):
-        """Some BERT models have this pattern"""
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = nn.Softmax(dim=-1)(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 1, 1),),
-            # This syntax is not supported by FX or TorchScript
-            do_fx_comparison=False, do_torchscript_checks=False)
-
-    def test_module_returns_namedtuple(self):
-        NamedTuple = collections.namedtuple("NamedTuple", ["x0", "x1"])
-
-        """Some hf models have this pattern"""
-        class M1(torch.nn.Module):
-            def forward(self, x):
-                return NamedTuple(x, x)
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                m1 = self.m1(x)
-                return (m1.x0, m1.x1)
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 1, 1),),
-            # TODO(future PR): add FX rewrite support
-            do_fx_comparison=False, do_torchscript_checks=False)
-
-    def test_child_module_does_not_return_tensor(self):
-        class M1(torch.nn.Module):
-            def forward(self, x):
-                pass
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                self.m1(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 1, 1),),
-            # TODO(future PR): add FX rewrite support
-            do_fx_comparison=False, do_torchscript_checks=False)
-
-    def _get_non_traceable_module_class_test_model(self):
-        class M1(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = x + x
-                return x
-
-        class M2(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.m1(x)
-                x = self.conv(x)
-                x = x + x
-                return x
-
-        class M3(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m2 = M2()
-
-            def forward(self, x):
-                x = self.m2(x)
-                return x
-
-        return M3().eval(), M1, M2, M3
-
-    def test_prepare_custom_config_dict_non_traceable_module_class_child_leaf(self):
-
-        # if M1 is set as leaf, M2 and M3 should have auto_quant_state
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m, M1, M2, M3 = self._get_non_traceable_module_class_test_model()
-        prepare_custom_config_dict = {
-            'non_traceable_module_class': [M1],
-        }
-        mp = _quantize_dbr.prepare(
-            m, qconfig_dict, (torch.randn(1, 1, 1, 1),),
-            prepare_custom_config_dict=prepare_custom_config_dict)
-        self.assertTrue(not hasattr(mp.m2.m1, '_auto_quant_state'))
-        self.assertTrue(hasattr(mp.m2, '_auto_quant_state'))
-        self.assertTrue(hasattr(mp, '_auto_quant_state'))
-        mq = _quantize_dbr.convert(mp)
-        self.assertTrue(isinstance(mq.m2.m1.conv, nn.Conv2d))
-        self.assertTrue(isinstance(mq.m2.conv, nnq.Conv2d))
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-
-        # mqt.m2.m1 should not have quantized ops
-        FileCheck().check_count("aten::add", 1, exactly=True).run(mqt.m2.m1.graph)
-        FileCheck().check_count("quantized::add", 0, exactly=True).run(mqt.m2.m1.graph)
-        # mqt.m2.m1 should have quantized ops
-        FileCheck().check_count("aten::add", 0, exactly=True).run(mqt.m2.graph)
-        FileCheck().check_count("quantized::add", 1, exactly=True).run(mqt.m2.graph)
-
-        # TODO(future PR): ensure modules in leaves do not get quantized
-
-    def test_prepare_custom_config_dict_non_traceable_module_class_mid_leaf(self):
-        # if M2 is set as leaf, only M1 should have auto_quant_state
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m, M1, M2, M3 = self._get_non_traceable_module_class_test_model()
-        prepare_custom_config_dict = {
-            'non_traceable_module_class': [M2],
-        }
-        mp = _quantize_dbr.prepare(
-            m, qconfig_dict, (torch.randn(1, 1, 1, 1),),
-            prepare_custom_config_dict=prepare_custom_config_dict)
-        self.assertTrue(not hasattr(mp.m2.m1, '_auto_quant_state'))
-        self.assertTrue(not hasattr(mp.m2, '_auto_quant_state'))
-        self.assertTrue(hasattr(mp, '_auto_quant_state'))
-        mq = _quantize_dbr.convert(mp)
-        self.assertTrue(isinstance(mq.m2.m1.conv, nn.Conv2d))
-        self.assertTrue(isinstance(mq.m2.conv, nn.Conv2d))
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-
-        # mqt.m2 and all children should not have quantized ops
-        FileCheck().check_count("aten::add", 1, exactly=True).run(mqt.m2.m1.graph)
-        FileCheck().check_count("quantized::add", 0, exactly=True).run(mqt.m2.m1.graph)
-        FileCheck().check_count("aten::add", 1, exactly=True).run(mqt.m2.graph)
-        FileCheck().check_count("quantized::add", 0, exactly=True).run(mqt.m2.graph)
-
-    def test_module_list(self):
-        class Child(torch.nn.Module):
-            def forward(self, x):
-                return x + x
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.module_list = torch.nn.ModuleList([
-                    Child(),
-                ])
-
-            def forward(self, x):
-                for module in self.module_list:
-                    # TODO(future PR): we should see if there is a better
-                    # solution other than asking users to do this
-                    if not isinstance(module, AutoQuantizationState):
-                        x = module(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(8, 1, 1, 1),),
-            # TODO(future PR): enable scripting for ModuleList + DBR
-            do_fx_comparison=True, do_torchscript_checks=False)
-
-    @unittest.skip('TODO build this')
-    def test_module_input_types(self):
-        class M(torch.nn.Module):
-            def forward(self, x=None, y=None):
-                print('x', x)
-                print('y', y)
-                assert x is not None and y is not None
-                return (x, y)
-
-        model_fp32 = M().eval()
-        example_inputs = {'y': torch.randn(1), 'x': torch.randn(1)}
-        ExampleInputsTupleCtr = collections.namedtuple('ExampleInputs', example_inputs)
-        example_inputs_tuple = ExampleInputsTupleCtr(**example_inputs)
-        ms = torch.jit.trace(model_fp32, example_inputs_tuple)
-
-        return
-        qconfig = torch.quantization.default_qconfig
-
-        # dict
-        kwargs = {'x': torch.randn(1, 1, 2, 2)}
-        self._test_auto_tracing(model_fp32, qconfig, (), kwargs)
-
-    def test_module_return_types(self):
-        class M1(torch.nn.Module):
-            def forward(self, x):
-                return x, x
-
-        class M2(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                x1, x2 = self.m1(x)
-                return x1
-
-        model_fp32 = M2().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_inplace_unquantizeable_op(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv1 = nn.Conv2d(1, 1, 1)
-                self.silu = nn.SiLU(inplace=True)
-                # self.silu = nn.SiLU()
-                self.conv2 = nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv1(x)
-                x = self.silu(x)
-                x = self.conv2(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_vovnet_sequential(self):
-        # We cannot quantize SequentialAppendList directly because
-        # AutoQuantizationStateModuleDict would appear in self.items.
-        # However, we can wrap it and quantize the wrapper.
-        class SequentialAppendList(nn.Sequential):
-            def __init__(self, *args):
-                super(SequentialAppendList, self).__init__(*args)
-
-            def forward(self, x: torch.Tensor) -> torch.Tensor:
-                concat_list = []
-                for i, module in enumerate(self):
-                    if i == 0:
-                        concat_list.append(module(x))
-                    else:
-                        concat_list.append(module(concat_list[-1]))
-                x = torch.cat(concat_list, dim=1)
-                return x
-
-        class Wrapper(nn.Module):
-            def __init__(self, *args):
-                super().__init__()
-                self.append_list = SequentialAppendList(*args)
-
-            def forward(self, x):
-                x = self.append_list(x)
-                return x
-
-        m = Wrapper(torch.nn.Conv2d(1, 1, 1)).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_unsupported_ops(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = F.tanhshrink(x)
-                x = x + x
-                x = F.tanhshrink(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_unsupported_ops_recorded(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv2d = nn.Conv2d(1, 1, 1)
-                self.softshrink = nn.Softshrink()
-
-            def forward(self, x):
-                # supported
-                x = self.conv2d(x)
-                x = x + x
-                # not supported
-                x = self.softshrink(x)
-                x = F.tanhshrink(x)
-                return x
-
-        m = M().eval()
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        mp = _quantize_dbr.prepare(m, qconfig_dict, (torch.randn(1, 1, 1, 1),))
-        self.assertTrue(len(mp._auto_quant_state.seen_nonq_op_infos) == 2)
-        self.assertTrue(mp._auto_quant_state.seen_nonq_op_infos[0].type == nn.Softshrink)
-        self.assertTrue(mp._auto_quant_state.seen_nonq_op_infos[1].type == F.tanhshrink)
-
-    def test_unknown_op_after_quantized(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                std = x.std()
-                return std
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 2, 2),),
-            fuse_modules=False)
-
-    def test_module_calls_items(self):
-        # We cannot quantize M1 directly because
-        # AutoQuantizationStateModuleDict would appear in self.items.
-        # However, we can wrap it and quantize the wrapper.
-        class M1(torch.nn.ModuleDict):
-            def __init__(self):
-                super().__init__()
-                for i in range(2):
-                    layer = nn.ReLU()
-                    self.add_module("layer_" + str(i), layer)
-
-            def forward(self, x):
-                layers = [x]
-                for name, layer in self.items():
-                    layers.append(layer(x))
-                return torch.cat(layers, dim=1)
-
-        class M2(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                x = self.m1(x)
-                return x
-
-        model_fp32 = M2().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 2, 2),),
-            # TODO(future PR): implement observer sharing for torch.cat
-            # in DBR quant, to ensure that numerical behavior matches
-            do_fx_comparison=False)
-
-    def test_subclass_of_quantizeable_module(self):
-        """
-        If a user creates a subclass of nn.BatchNorm2d, that subclass
-        should not be quantized unless the user defines a custom module.
-        """
-        class BN2d(torch.nn.BatchNorm2d):
-            pass
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.bn = BN2d(1)
-                self.conv2 = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.bn(x)
-                x = self.conv2(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(1, 1, 2, 2),),
-            # the module is not symbolically traceable
-            do_fx_comparison=False)
-
-    # TODO(future PR): move into a separate test file
-    def test_numeric_suite(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = nn.Conv2d(1, 1, 1)
-                self.conv2 = nn.Sequential(nn.Conv2d(1, 1, 1))
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.conv2(x)
-                x = x + x
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        example_args = (torch.randn(1, 1, 2, 2),)
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, example_args)
-        out_p = mp(*example_args)
-        mq = _quantize_dbr.convert(copy.deepcopy(mp))
-        out_q = mq(*example_args)
-
-        mp, mq = ns.add_loggers('mp', mp, 'mq', mq)
-
-        mp(*example_args)
-        mq(*example_args)
-
-        act_comparison = ns.extract_logger_info(mp, mq, 'mq')
-        ns_fx.extend_logger_results_with_comparison(
-            act_comparison, 'mp', 'mq', torch.ao.ns.fx.utils.compute_sqnr,
-            'sqnr')
-
-        # TODO(future PR): enforce validity of the result above, using
-        # NS for FX utils. Will need some refactoring.
-
-        # TODO(future PR): consider adding a util for below
-        to_print = []
-        for idx, (layer_name, v) in enumerate(act_comparison.items()):
-            to_print.append([
-                layer_name,
-                v['node_output']['mq'][0]['fqn'],
-                v['node_output']['mq'][0]['ref_node_target_type'],
-                v['node_output']['mq'][0]['sqnr']])
-
-    def test_qconfig_dict_global(self):
-        """
-        Verifies that the '' option of qconfig_dict works
-        """
-
-        # regular case
-        m = nn.Sequential(nn.Conv2d(1, 1, 1))
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0], nnq.Conv2d))
-
-        # quantization turned off
-        m = nn.Sequential(nn.Conv2d(1, 1, 1))
-        qconfig_dict = {'': None}
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0], nn.Conv2d))
-
-    def test_qconfig_dict_object_type_module(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on module types.
-        """
-        m = nn.Sequential(
-            nn.Conv2d(1, 1, 1),
-            nn.Hardswish(),
-            nn.Conv2d(1, 1, 1),
-        )
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'object_type': [
-                (nn.Conv2d, torch.quantization.default_qconfig),
-                (nn.Hardswish, None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0], nnq.Conv2d))
-        self.assertTrue(isinstance(mq[1], nn.Hardswish))
-        self.assertTrue(isinstance(mq[2], nnq.Conv2d))
-
-    def test_qconfig_dict_object_type_function(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on function types.
-        """
-        class M(nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = x * x
-                return x
-
-        m = M()
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'object_type': [
-                (torch.add, None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        rewritten = mq.rewrite_for_scripting()
-        expected_occurrence = {
-            NodeSpec.call_function(torch.add): 1,
-            NodeSpec.call_function(toq.add): 0,
-            NodeSpec.call_function(toq.mul): 1,
-        }
-        self.checkGraphModuleNodes(
-            rewritten, expected_node_occurrence=expected_occurrence)
-
-    def test_qconfig_dict_object_type_method(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on method types.
-        """
-        class M(nn.Module):
-            def forward(self, x):
-                x = x.add(x)
-                x = x.mul(x)
-                return x
-
-        m = M()
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'object_type': [
-                (torch.Tensor.add, None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        rewritten = mq.rewrite_for_scripting()
-        expected_occurrence = {
-            NodeSpec.call_function(torch.add): 1,
-            NodeSpec.call_function(toq.add): 0,
-            NodeSpec.call_function(toq.mul): 1,
-        }
-        self.checkGraphModuleNodes(
-            rewritten, expected_node_occurrence=expected_occurrence)
-
-    def test_qconfig_dict_object_type_function_global_none(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on function types when global qconfig is None.
-        """
-        class M(nn.Module):
-            def forward(self, x):
-                x = x + x
-                return x
-
-        m = M()
-        qconfig_dict = {
-            '': None,
-            'object_type': [
-                (torch.add, torch.quantization.default_qconfig),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        rewritten = mq.rewrite_for_scripting()
-        expected_occurrence = {
-            NodeSpec.call_function(torch.add): 0,
-            NodeSpec.call_function(toq.add): 1,
-        }
-        self.checkGraphModuleNodes(
-            rewritten, expected_node_occurrence=expected_occurrence)
-
-    def test_qconfig_dict_module_name(self):
-        """
-        Verifies that the 'module_name' option of qconfig_dict works
-        on module types.
-        """
-        m = nn.Sequential(
-            nn.Sequential(
-                nn.Conv2d(1, 1, 1),
-            ),
-            nn.Conv2d(1, 1, 1),
-            nn.Sequential(
-                nn.Conv2d(1, 1, 1),
-                nn.Conv2d(1, 1, 1),
-            ),
-        )
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'module_name': [
-                ('0', torch.quantization.default_qconfig),
-                ('1', None),
-                ('2.0', None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0][0], nnq.Conv2d))
-        self.assertTrue(isinstance(mq[1], nn.Conv2d))
-        self.assertTrue(isinstance(mq[2][0], nn.Conv2d))
-        self.assertTrue(isinstance(mq[2][1], nnq.Conv2d))
-
-    def test_qconfig_dict_unsupported_does_not_crash_when_empty(self):
-        """
-        Verifies that the yet unimplemented keys of qconfig_dict only
-        crash when they have non-zero values
-        """
-        m = nn.Sequential(nn.Conv2d(1, 1, 1)).eval()
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_inputs = (torch.randn(1, 1, 1, 1),)
-        # this modifies qconfig_dict inplace to include more keys
-        mp = prepare_fx(m, qconfig_dict, example_inputs=example_inputs)
-        # need this line to not crash
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-
-    def _test_serialization(self, model, input_shape):
-        example_inputs = (torch.randn(*input_shape),)
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m = model().eval()
-        m = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-        # calibrate, to populate statistics
-        m(example_inputs[0])
-        m = _quantize_dbr.convert(m)
-
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m2 = model().eval()
-        m2 = _quantize_dbr.prepare(m2, qconfig_dict, example_inputs)
-        # do not calibrate, to ensure important statistics are populated without calibration and
-        # the results are different at every node, including the quantize_per_tensor node
-        m2 = _quantize_dbr.convert(m2)
-
-        # Results should be different without loading from serialized state_dict
-        expected = m(example_inputs[0])
-        actual = m2(example_inputs[0])
-        self.assertFalse(_allclose(expected, actual))
-
-        # Results should be the same after loading from serialized state_dict
-        with tempfile.NamedTemporaryFile(delete=False) as f:
-            torch.save(m.state_dict(), f)
-            with open(f.name, 'rb') as f2:
-                loaded_state_dict = torch.load(f2)
-                m2.load_state_dict(loaded_state_dict)
-        expected = m(example_inputs[0])
-        actual = m2(example_inputs[0])
-        self.assertTrue(_allclose(expected, actual))
-
-    def test_serialization(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                x2 = self.linear(x1)
-                return x2
-
-        input_shape = (1, 1, 1, 1)
-        self._test_serialization(M, input_shape)
-
-    def test_serialization_functional(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                # conv
-                self.weight2d = torch.nn.Parameter(torch.randn(1, 1, 1, 1))
-                self.bias2d = torch.nn.Parameter(torch.randn(1))
-                self.stride2d = (1, 1)
-                self.padding2d = (0, 0)
-                self.dilation2d = (1, 1)
-                self.groups = 1
-                # linear
-                self.w1 = nn.Parameter(torch.empty(1, 1))
-                self.b1 = nn.Parameter(torch.ones(1))
-                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
-
-            def forward(self, x):
-                updated_weight = self.weight2d * x
-                x = F.conv2d(
-                    x, updated_weight, self.bias2d, self.stride2d, self.padding2d,
-                    self.dilation2d, self.groups)
-                # TODO: Investigate why serialization does not work with functional linear
-                # x = F.linear(x, self.w1, self.b1)
-                return x
-
-        input_shape = (1, 1, 1, 1)
-        self._test_serialization(M, input_shape)
-
-    def test_jit_tracing_removes_aliases(self):
-        m = nn.Sequential(
-            nn.Conv2d(1, 1, 1),
-            nn.Sequential(
-                nn.Conv2d(1, 1, 1),
-            ),
-        )
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_inputs = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-        mq = _quantize_dbr.convert(mp)
-        mqs = torch.jit.trace(mq, example_inputs)
-        FileCheck().check_count("aten::alias", 5, exactly=True).run(
-            mqs.inlined_graph)
-        res1 = mqs(*example_inputs)
-        mqs = remove_redundant_aliases(mqs)
-        res2 = mqs(*example_inputs)
-        self.assertTrue(torch.allclose(res1, res2))
-        # TODO(future PR): figure out why aliasing still appears in the inlined
-        # graph, and if that is fixed then just check the inlined graph.
-        for graph in (
-            mqs.graph,
-            getattr(mqs, '1').graph,
-            getattr(getattr(mqs, '1'), '0').graph,
-        ):
-            FileCheck().check_count("aten::alias", 0, exactly=True).run(graph)
-
-    def test_conv_int32_reference_model(self):
-        m = nn.Sequential(nn.Conv2d(1, 1, 1)).eval()
-        int32_obs_ctr = MinMaxObserver.with_args(dtype=torch.qint32)
-        int32_qconfig = QConfig(weight=int32_obs_ctr, activation=int32_obs_ctr)
-        qconfig_dict = {'': int32_qconfig}
-        mp = _quantize_dbr.prepare(m, qconfig_dict, (torch.randn(1, 1, 1, 1),))
-        mp(torch.randn(1, 1, 1, 1))
-        mq = _quantize_dbr.convert(mp)
-        res = mq(torch.randn(1, 1, 1, 1))
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-        # verify the right ops are present:
-        # x0 -> quant -> (dequant -> conv_ref -> quant) -> dequant -> x1
-        FileCheck()\
-            .check_count("aten::quantize_per_tensor", 2, exactly=True)\
-            .run(mqt.graph)
-        FileCheck()\
-            .check_count("aten::dequantize", 2, exactly=True)\
-            .run(mqt.graph)
-
-@skipIfNoFBGEMM
-class TestQuantizeDBRMultipleOps(QuantizeDBRTestCase):
-    """
-    Tests that DBR quantization covers interactions between multiple ops
-
-    Most of these tests were added when the code was an early prototype
-    and were one-off test cases for patterns which happened to break
-    the code on various models at the time of writing. A lot of these
-    can probably be removed in the future as they are replaced by more
-    systematic individual and fusion tests.
-    """
-    def test_dropout_conv(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.dropout = nn.Dropout()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                # this can be sometimes inplace
-                x1 = self.dropout(x)
-                x1 = self.conv(x)
-                return x1
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_flatten_linear(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                # TODO(future PR): unbreak this
-                # x1 = torch.nn.functional.adaptive_avg_pool2d(x, (1, 1))
-                x1 = torch.nn.functional.adaptive_avg_pool2d(x1, (1, 1))
-                x2 = torch.flatten(x1, 1)
-                x3 = self.linear(x2)
-                return x3
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_conv_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                x2 = x1 + x
-                return x2
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_scalar_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = x + 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_relu_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.ReLU()
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                x2 = self.relu(x1)
-                x3 = x1 + x
-                return x3
-
-        model_fp32 = M().eval()
-
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_linear_torch_relu(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.u1 = nn.Linear(1, 1)
-                self.v1 = nn.Linear(1, 1)
-                self.u2 = nn.Linear(1, 1)
-                self.v2 = nn.Linear(1, 1)
-                self.w = nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.w(x)
-                x = x + torch.relu(self.v1(torch.relu(self.u1(x))))
-                return x + torch.relu(self.v2(torch.relu(self.u2(x))))
-
-        model_fp32 = M().eval()
-
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_gelu_linear(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.gelu = torch.nn.GELU()
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.linear(x)
-                x = self.gelu(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_dropout(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.dropout = nn.Dropout()
-                self.linear = torch.nn.Linear(1, 1)
-                self.linear2 = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.linear(x)
-                x = self.dropout(x)
-                x = self.linear2(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_module_then_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.linear(x)
-                x = x + 1.0
-                x = x + 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_add_linear(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = x + x
-                x = self.linear(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_inplace_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.embedding1 = nn.Embedding(1, 1)
-                self.embedding2 = nn.Embedding(1, 1)
-                self.layernorm = nn.LayerNorm(1)
-
-            def forward(self, x):
-                x1 = self.embedding1(x)
-                x1 += self.embedding2(x)
-                x2 = self.layernorm(x1)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        prepare_custom_config_dict = {
-            'output_dtypes': (torch.int64,),
-        }
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.LongTensor([[0]]),),
-            fuse_modules=False,
-            dbr_prepare_custom_config_dict=prepare_custom_config_dict)
-
-    def test_lstm(self):
-        # building block of torchbenchmark/tts_angular
-        class LSTMWithProjection(nn.Module):
-            def __init__(self, input_size, hidden_size, proj_size):
-                super().__init__()
-                self.input_size = input_size
-                self.hidden_size = hidden_size
-                self.proj_size = proj_size
-                self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
-                self.linear = nn.Linear(hidden_size, proj_size, bias=False)
-
-            def forward(self, x):
-                self.lstm.flatten_parameters()
-                o, (_, _) = self.lstm(x)
-                return self.linear(o)
-
-        m = LSTMWithProjection(1, 1, 1).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(1, 1, 1),),
-            # the module is not symbolically traceable
-            do_fx_comparison=False)
-
-
-@skipIfNoFBGEMM
-class TestQuantizeDBRModels(QuantizeDBRTestCase):
-    @skip_if_no_torchvision
-    def test_mobilenet_v2(self):
-        import torchvision
-        m = torchvision.models.__dict__['mobilenet_v2'](pretrained=False).eval().float()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(1, 3, 224, 224),),
-            # TODO fix this (reason TBD)
-            do_torchscript_checks=False)
-
-    @skip_if_no_torchvision
-    def test_mobilenet_v2_removes_aliases(self):
-        import torchvision
-        m = torchvision.models.__dict__['mobilenet_v2'](pretrained=False)\
-            .eval().float()
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_inputs = (torch.randn(1, 3, 224, 224),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-        mq = _quantize_dbr.convert(mp)
-        mqs = torch.jit.trace(mq, example_inputs)
-        res1 = mqs(*example_inputs)
-        mqs = remove_redundant_aliases(mqs)
-        res2 = mqs(*example_inputs)
-        self.assertTrue(torch.allclose(res1, res2))
diff --git a/test/quantization/eager/test_fuse_eager.py b/test/quantization/eager/test_fuse_eager.py
index 9cf09dedba67c..e397fa9673ac4 100644
--- a/test/quantization/eager/test_fuse_eager.py
+++ b/test/quantization/eager/test_fuse_eager.py
@@ -4,7 +4,7 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.qat as nniqat
diff --git a/test/quantization/eager/test_numeric_suite_eager.py b/test/quantization/eager/test_numeric_suite_eager.py
index 3714a1f28c67b..c8cf9c3dddf85 100644
--- a/test/quantization/eager/test_numeric_suite_eager.py
+++ b/test/quantization/eager/test_numeric_suite_eager.py
@@ -1,8 +1,9 @@
 # Owner(s): ["oncall: quantization"]
 
+import unittest
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.ao.quantization import (
     DeQuantStub,
     QuantStub,
@@ -35,7 +36,7 @@
     skip_if_no_torchvision,
 )
 from torch.testing._internal.common_quantized import override_qengines
-
+from torch.testing._internal.common_utils import IS_ARM64
 
 class SubModule(torch.nn.Module):
     def __init__(self):
@@ -574,11 +575,13 @@ def compute_error(x, y):
         act_compare_dict = get_matching_activations(float_model, qmodel)
 
     @skip_if_no_torchvision
+    @unittest.skipIf(IS_ARM64, "Not working on arm right now")
     def test_mobilenet_v2(self):
         from torchvision.models.quantization import mobilenet_v2
         self._test_vision_model(mobilenet_v2(pretrained=True, quantize=False))
 
     @skip_if_no_torchvision
+    @unittest.skipIf(IS_ARM64, "Not working on arm right now")
     def test_mobilenet_v3(self):
         from torchvision.models.quantization import mobilenet_v3_large
         self._test_vision_model(mobilenet_v3_large(pretrained=True, quantize=False))
diff --git a/test/quantization/eager/test_quantize_eager_ptq.py b/test/quantization/eager/test_quantize_eager_ptq.py
index d06575c51bf21..7d87cc520ba04 100644
--- a/test/quantization/eager/test_quantize_eager_ptq.py
+++ b/test/quantization/eager/test_quantize_eager_ptq.py
@@ -2,7 +2,7 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.nn.utils.rnn import PackedSequence
 from torch.ao.quantization import (
     quantize,
@@ -326,9 +326,9 @@ def test_functional_module(self, train_mode):
 
         def checkQuantized(model):
             self.checkNoPrepModules(model)
-            self.assertEqual(type(model.myadd), torch.nn.quantized.QFunctional)
-            self.assertEqual(type(model.mycat), torch.nn.quantized.QFunctional)
-            self.assertEqual(type(model.myadd_relu), torch.nn.quantized.QFunctional)
+            self.assertEqual(type(model.myadd), torch.ao.nn.quantized.QFunctional)
+            self.assertEqual(type(model.mycat), torch.ao.nn.quantized.QFunctional)
+            self.assertEqual(type(model.myadd_relu), torch.ao.nn.quantized.QFunctional)
             self.checkNoQconfig(model)
 
         checkQuantized(model)
@@ -776,7 +776,7 @@ def test_quantized_embedding(self):
             prepare(model, inplace=True)
             convert(model, inplace=True)
             self.assertTrue('QuantizedEmbedding' in str(model))
-            self.assertEqual(type(model.emb), torch.nn.quantized.Embedding)
+            self.assertEqual(type(model.emb), torch.ao.nn.quantized.Embedding)
             self.checkScriptable(model, [[indices]], check_save_load=True)
 
             idx = torch.LongTensor([1, 2, 4, 5, 4, 3, 2, 9])
@@ -836,7 +836,7 @@ def test_quantized_embedding_bag(self):
 
             # Test to make sure module is quantized correctly.
             self.assertTrue('QuantizedEmbeddingBag' in str(quantized_model))
-            self.checkDynamicQuantizedModule(quantized_model.emb, torch.nn.quantized.EmbeddingBag, torch.quint8)
+            self.checkDynamicQuantizedModule(quantized_model.emb, torch.ao.nn.quantized.EmbeddingBag, torch.quint8)
             self.checkScriptable(quantized_model, [[indices, offsets, per_sample_weights]], check_save_load=True)
 
             class EmbeddingBagWithLinear(torch.nn.Module):
@@ -857,7 +857,7 @@ def forward(self, indices, offsets, per_sample_weights, linear_in):
 
             self.assertTrue('QuantizedEmbeddingBag' in str(quantized_model))
             self.checkLinear(model2.fc)
-            self.checkDynamicQuantizedModule(quantized_model.emb, torch.nn.quantized.EmbeddingBag, torch.quint8)
+            self.checkDynamicQuantizedModule(quantized_model.emb, torch.ao.nn.quantized.EmbeddingBag, torch.quint8)
 
     @skipIfNoFBGEMM
     def test_custom_module_class(self):
@@ -1287,8 +1287,8 @@ def test_quantized_rnn(self, qconfig, dtype):
         }
 
         def checkQuantized(model, module_type):
-            mod_type_map = {'LSTM': torch.nn.quantized.dynamic.LSTM,
-                            'GRU': torch.nn.quantized.dynamic.GRU}
+            mod_type_map = {'LSTM': torch.ao.nn.quantized.dynamic.LSTM,
+                            'GRU': torch.ao.nn.quantized.dynamic.GRU}
             mod_repr_map = {'LSTM': 'DynamicQuantizedLSTM',
                             'GRU': 'DynamicQuantizedGRU'}
             self.assertTrue(mod_repr_map[module_type] in str(model_quantized))
@@ -1359,10 +1359,10 @@ def test_quantized_rnn_cell(self, qconfig, dtype):
                 model_quantized = quantize_dynamic(model=model, qconfig_spec=qconfig_dict, dtype=dtype)
 
             def checkQuantized(model, module_type):
-                mod_type_map = {'LSTMCell': torch.nn.quantized.dynamic.LSTMCell,
-                                'GRUCell': torch.nn.quantized.dynamic.GRUCell,
-                                'RNNTanh': torch.nn.quantized.dynamic.RNNCell,
-                                'RNNReLU': torch.nn.quantized.dynamic.RNNCell}
+                mod_type_map = {'LSTMCell': torch.ao.nn.quantized.dynamic.LSTMCell,
+                                'GRUCell': torch.ao.nn.quantized.dynamic.GRUCell,
+                                'RNNTanh': torch.ao.nn.quantized.dynamic.RNNCell,
+                                'RNNReLU': torch.ao.nn.quantized.dynamic.RNNCell}
 
                 mod_repr_map = {'LSTMCell': 'DynamicQuantizedLSTMCell',
                                 'GRUCell': 'DynamicQuantizedGRUCell',
diff --git a/test/quantization/eager/test_quantize_eager_qat.py b/test/quantization/eager/test_quantize_eager_qat.py
index 984e87dacbbcd..385ebf91b70bf 100644
--- a/test/quantization/eager/test_quantize_eager_qat.py
+++ b/test/quantization/eager/test_quantize_eager_qat.py
@@ -8,11 +8,11 @@
 from torch.nn import Conv2d, BatchNorm2d, ReLU, init
 from torch.nn.intrinsic.qat import ConvBn2d, ConvBnReLU2d
 from torch.nn.modules.utils import _pair
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.qat as nnqat
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.qat as nnqat
 import torch.nn.intrinsic.qat as nniqat
-import torch.nn.qat.dynamic as nnqatd
+import torch.ao.nn.qat.dynamic as nnqatd
 from torch.ao.quantization import (
     prepare,
     convert,
@@ -557,7 +557,7 @@ class M(torch.nn.Module):
             def __init__(self):
                 super().__init__()
                 self.quant = torch.ao.quantization.QuantStub()
-                self.ff = torch.nn.quantized.FloatFunctional()
+                self.ff = torch.ao.nn.quantized.FloatFunctional()
 
             def forward(self, x):
                 x = self.quant(x)
@@ -578,7 +578,7 @@ class M(torch.nn.Module):
             def __init__(self):
                 super().__init__()
                 self.quant = torch.ao.quantization.QuantStub()
-                self.ff = torch.nn.quantized.FloatFunctional()
+                self.ff = torch.ao.nn.quantized.FloatFunctional()
 
             def forward(self, x):
                 x = self.quant(x)
diff --git a/test/quantization/fx/test_equalize_fx.py b/test/quantization/fx/test_equalize_fx.py
index 1a297b9ecf43c..e3560fd291492 100644
--- a/test/quantization/fx/test_equalize_fx.py
+++ b/test/quantization/fx/test_equalize_fx.py
@@ -4,7 +4,7 @@
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.nn.intrinsic.quantized as nniq
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.ao.quantization import default_qconfig
 from torch.ao.quantization.observer import MinMaxObserver, PerChannelMinMaxObserver
 from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
diff --git a/test/quantization/fx/test_model_report_fx.py b/test/quantization/fx/test_model_report_fx.py
index d123a8752ca72..c688946eaf8b8 100644
--- a/test/quantization/fx/test_model_report_fx.py
+++ b/test/quantization/fx/test_model_report_fx.py
@@ -15,7 +15,11 @@
 from torch.ao.quantization.fx._model_report.model_report_observer import ModelReportObserver
 from torch.ao.quantization.fx._model_report.model_report_visualizer import ModelReportVisualizer
 from torch.ao.quantization.fx._model_report.model_report import ModelReport
-from torch.ao.quantization.observer import HistogramObserver, default_per_channel_weight_observer
+from torch.ao.quantization.observer import (
+    HistogramObserver,
+    default_per_channel_weight_observer,
+    default_observer
+)
 from torch.nn.intrinsic.modules.fused import ConvReLU2d, LinearReLU
 from torch.testing._internal.common_quantization import (
     ConvModel,
@@ -1140,6 +1144,114 @@ def test_generate_visualizer(self):
                     linear_info = mod_fqns_to_features[module_fqn]
                     self.assertEqual(len(linear_info), 20)
 
+    @skipIfNoFBGEMM
+    def test_qconfig_mapping_generation(self):
+        """
+        Tests for generation of qconfigs by ModelReport API
+        - Tests that qconfigmapping is generated
+        - Tests that mappings include information for for relavent modules
+        """
+        with override_quantized_engine('fbgemm'):
+            # set the backend for this test
+            torch.backends.quantized.engine = "fbgemm"
+            # test with multiple detectors
+            detector_set = set()
+            detector_set.add(PerChannelDetector())
+            detector_set.add(DynamicStaticDetector())
+
+            model = TwoThreeOps()
+
+            # get tst model and callibrate
+            prepared_for_callibrate_model, mod_report = _get_prepped_for_calibration_model_helper(
+                model, detector_set, model.get_example_inputs()[0]
+            )
+
+            # now we actually callibrate the models
+            example_input = model.get_example_inputs()[0]
+            example_input = example_input.to(torch.float)
+
+            prepared_for_callibrate_model(example_input)
+
+
+            # get the mapping without error
+            qconfig_mapping = mod_report.generate_qconfig_mapping()
+
+            # now get the report by running it through ModelReport instance
+            generated_report = mod_report.generate_model_report(remove_inserted_observers=False)
+
+            # get the visualizer so we can get access to reformatted reports by module fqn
+            mod_reports_by_fqn = mod_report.generate_visualizer().generated_reports
+
+            # compare the entries of the mapping to those of the report
+            # we should have the same number of entries
+            self.assertEqual(len(qconfig_mapping.module_name_qconfigs), len(mod_reports_by_fqn))
+
+            # for the non_empty one, we should have 2 because we have only applicable linears
+            # so should have suggestions for each module named
+            self.assertEqual(len(qconfig_mapping.module_name_qconfigs), 2)
+
+            # only two linears, make sure per channel min max for weight since fbgemm
+            # also static distribution since a simple single callibration
+            for key in qconfig_mapping.module_name_qconfigs:
+                config = qconfig_mapping.module_name_qconfigs[key]
+                self.assertEqual(config.weight, default_per_channel_weight_observer)
+                self.assertEqual(config.activation, default_observer)
+
+            # make sure these can actually be used to prepare the model
+            prepared = quantize_fx.prepare_fx(TwoThreeOps(), qconfig_mapping, example_input)
+
+            # now convert the model to ensure no errors in conversion
+            converted = quantize_fx.convert_fx(prepared)
+
+    @skipIfNoFBGEMM
+    def test_equalization_mapping_generation(self):
+        """
+        Tests for generation of qconfigs by ModelReport API
+        - Tests that equalization config generated when input-weight equalization detector used
+        - Tests that mappings include information for for relavent modules
+        """
+        with override_quantized_engine('fbgemm'):
+            # set the backend for this test
+            torch.backends.quantized.engine = "fbgemm"
+            # test with multiple detectors
+            detector_set = set()
+            detector_set.add(InputWeightEqualizationDetector(0.6))
+
+            model = TwoThreeOps()
+
+            # get tst model and callibrate
+            prepared_for_callibrate_model, mod_report = _get_prepped_for_calibration_model_helper(
+                model, detector_set, model.get_example_inputs()[0]
+            )
+
+            # now we actually callibrate the models
+            example_input = model.get_example_inputs()[0]
+            example_input = example_input.to(torch.float)
+
+            prepared_for_callibrate_model(example_input)
+
+
+            # get the mapping without error
+            qconfig_mapping = mod_report.generate_qconfig_mapping()
+            equalization_mapping = mod_report.generate_equalization_mapping()
+
+            # tests a lot more simple for the equalization mapping
+
+            # shouldn't have any equalization suggestions for this case
+            self.assertEqual(len(qconfig_mapping.module_name_qconfigs), 2)
+
+
+            # make sure these can actually be used to prepare the model
+            prepared = quantize_fx.prepare_fx(
+                TwoThreeOps(),
+                qconfig_mapping,
+                example_input,
+                _equalization_config=equalization_mapping
+            )
+
+            # now convert the model to ensure no errors in conversion
+            converted = quantize_fx.convert_fx(prepared)
+
 class TestFxDetectInputWeightEqualization(QuantizationTestCase):
 
     class SimpleConv(torch.nn.Module):
diff --git a/test/quantization/fx/test_numeric_suite_fx.py b/test/quantization/fx/test_numeric_suite_fx.py
index f371bdfddbb8f..0f98c8aa77793 100644
--- a/test/quantization/fx/test_numeric_suite_fx.py
+++ b/test/quantization/fx/test_numeric_suite_fx.py
@@ -9,7 +9,7 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.ao.quantization import default_dynamic_qconfig
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 toq = torch.ops.quantized
 from torch.ao.quantization.quantize_fx import (
     convert_fx,
@@ -72,7 +72,7 @@
     extract_shadow_logger_info,
     extend_logger_results_with_comparison,
 )
-from torch.ao.quantization.backend_config import get_native_backend_config_dict
+from torch.ao.quantization.backend_config import get_native_backend_config
 from torch.ao.quantization.fx.backend_config_utils import get_pattern_to_quantize_handlers
 
 
@@ -287,7 +287,7 @@ def get_all_quant_patterns():
     all_quant_patterns = get_default_quant_patterns()
     # some of the patterns are moved to (native) backend_config_dict so we need to
     # add them back here
-    for pattern, quantize_handler in get_pattern_to_quantize_handlers(get_native_backend_config_dict()).items():
+    for pattern, quantize_handler in get_pattern_to_quantize_handlers(get_native_backend_config()).items():
         all_quant_patterns[pattern] = quantize_handler
     return all_quant_patterns
 
diff --git a/test/quantization/fx/test_quantize_fx.py b/test/quantization/fx/test_quantize_fx.py
index c008c700815c3..17d1b0a75238e 100644
--- a/test/quantization/fx/test_quantize_fx.py
+++ b/test/quantization/fx/test_quantize_fx.py
@@ -2,12 +2,13 @@
 
 from collections import OrderedDict
 import os
+import contextlib
 import torch
 import torch.nn.functional as F
 import torch.nn as nn
-import torch.nn.quantized as nnq
-import torch.nn.quantized._reference as nnqr
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized._reference as nnqr
+import torch.ao.nn.quantized.dynamic as nnqd
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.quantized.dynamic as nniqd
@@ -24,7 +25,6 @@
 )
 
 from torch.ao.quantization.fx.quantization_patterns import DefaultNodeQuantizeHandler
-from torch.ao.quantization.fx.common_quantization_patterns import CommonQuantizeHandler
 
 from torch.ao.quantization.fx.match_utils import (
     is_match,
@@ -72,8 +72,12 @@
     default_embedding_qat_qconfig,
 )
 
+from torch.ao.quantization.backend_config import (
+    BackendConfig,
+    BackendPatternConfig,
+)
 from torch.ao.quantization.backend_config.native import (
-    get_test_only_legacy_native_backend_config_dict,
+    get_test_only_legacy_native_backend_config,
 )
 
 from torch.ao.quantization.qconfig_mapping import (
@@ -117,7 +121,7 @@
     StandaloneModuleConfigEntry,
 )
 
-from torch.ao.quantization.fx.qconfig_utils import (
+from torch.ao.quantization.fx.qconfig_mapping_utils import (
     maybe_adjust_qconfig_for_module_name_object_type_order,
 )
 
@@ -163,7 +167,7 @@
     override_quantized_engine,
 )
 
-from torch.testing._internal.common_utils import TemporaryFileName
+from torch.testing._internal.common_utils import TemporaryFileName, IS_ARM64
 
 from torch.testing._internal.common_quantization import NodeSpec as ns
 
@@ -536,27 +540,21 @@ def fuse_conv_bn_relu(is_qat, relu, add_pattern):
                 bn, conv = bn_pattern
                 return conv
 
-            conv_bn_res_relu_config1 = {
-                "pattern": (nn.ReLU, (torch.add, MatchAllNode, (nn.BatchNorm2d, nn.Conv2d))),
-                "fuser_method": fuse_conv_bn_relu,
-            }
-
-            conv_bn_res_relu_config2 = {
-                "pattern": (nn.ReLU, (operator.add, MatchAllNode, (nn.BatchNorm2d, nn.Conv2d))),
-                "fuser_method": fuse_conv_bn_relu,
-            }
-
-            backend_config_dict = {
-                "configs": [conv_bn_res_relu_config1, conv_bn_res_relu_config2]
-            }
-            m = fuse_fx(m, backend_config_dict=backend_config_dict)
+            conv_bn_res_relu_config1 = BackendPatternConfig((nn.ReLU, (torch.add, MatchAllNode, (nn.BatchNorm2d, nn.Conv2d)))) \
+                .set_fuser_method(fuse_conv_bn_relu)
+            conv_bn_res_relu_config2 = BackendPatternConfig((nn.ReLU, (operator.add, MatchAllNode, (nn.BatchNorm2d, nn.Conv2d)))) \
+                .set_fuser_method(fuse_conv_bn_relu)
+            backend_config = BackendConfig() \
+                .set_backend_pattern_config(conv_bn_res_relu_config1) \
+                .set_backend_pattern_config(conv_bn_res_relu_config2)
+            m = fuse_fx(m, backend_config=backend_config)
             self.assertEqual(type(m.conv), torch.nn.Conv2d)
             # check bn and relu are gone since we replaced the whole pattern to conv
             self.assertFalse(hasattr(m, "bn"))
             self.assertFalse(hasattr(m, "relu"))
 
     def test_fusion_pattern_with_multiple_inputs(self):
-        """ This test tests two keys in backend_config_dict: root_node_getter and
+        """ This test tests two keys in backend_config: root_node_getter and
         extra_inputs_getter,
         root_node_getter is used to identify a "root" module in the node pattern,
         the node that we'll keep after fusion.
@@ -602,17 +600,12 @@ def conv_bn_res_relu_extra_inputs_getter(pattern):
             bn, conv = bn_pattern
             return [extra_input]
 
-        conv_bn_res_relu_config = {
-            "pattern": (nn.ReLU, (torch.add, (nn.BatchNorm2d, nn.Conv2d), MatchAllNode)),
-            "fuser_method": fuse_conv_bn_relu,
-            "root_node_getter": conv_bn_res_relu_root_node_getter,
-            "extra_inputs_getter": conv_bn_res_relu_extra_inputs_getter
-        }
-
-        backend_config_dict = {
-            "configs": [conv_bn_res_relu_config],
-        }
-        m = fuse_fx(m, backend_config_dict=backend_config_dict)
+        conv_bn_res_relu_config = BackendPatternConfig((nn.ReLU, (torch.add, (nn.BatchNorm2d, nn.Conv2d), MatchAllNode))) \
+            .set_fuser_method(fuse_conv_bn_relu) \
+            ._set_root_node_getter(conv_bn_res_relu_root_node_getter) \
+            ._set_extra_inputs_getter(conv_bn_res_relu_extra_inputs_getter)
+        backend_config = BackendConfig().set_backend_pattern_config(conv_bn_res_relu_config)
+        m = fuse_fx(m, backend_config=backend_config)
         self.assertEqual(type(m.conv), torch.nn.Conv2d)
         # check bn and relu are gone since we replaced the whole pattern to conv
         self.assertFalse(hasattr(m, "bn"))
@@ -670,24 +663,16 @@ def conv_res_relu_extra_inputs_getter(pattern):
             relu, (_, _, extra_input) = pattern
             return [extra_input]
 
-        conv_relu_config = {
-            "pattern": (nn.ReLU, nn.Conv2d),
-            "fuser_method": fuse_conv_relu,
-        }
-        conv_res_relu_config = {
-            "pattern": (nn.ReLU, (torch.add, nn.Conv2d, MatchAllNode)),
-            "fuser_method": fuse_conv_res_relu,
-            "root_node_getter": conv_res_relu_root_node_getter,
-            "extra_inputs_getter": conv_res_relu_extra_inputs_getter,
-        }
-
-        backend_config_dict = {
-            "configs": [
-                conv_relu_config,
-                conv_res_relu_config,
-            ],
-        }
-        m = fuse_fx(m, backend_config_dict=backend_config_dict)
+        conv_relu_config = BackendPatternConfig((nn.ReLU, nn.Conv2d)) \
+            .set_fuser_method(fuse_conv_relu)
+        conv_res_relu_config = BackendPatternConfig((nn.ReLU, (torch.add, nn.Conv2d, MatchAllNode))) \
+            .set_fuser_method(fuse_conv_res_relu) \
+            ._set_root_node_getter(conv_res_relu_root_node_getter) \
+            ._set_extra_inputs_getter(conv_res_relu_extra_inputs_getter)
+        backend_config = BackendConfig() \
+            .set_backend_pattern_config(conv_relu_config) \
+            .set_backend_pattern_config(conv_res_relu_config)
+        m = fuse_fx(m, backend_config=backend_config)
         self.assertEqual(type(m.conv1), torch.nn.Conv2d)
         self.assertEqual(type(m.conv2), torch.nn.Conv2d)
         # check relu are gone since we replaced the both patterns to conv
@@ -960,7 +945,7 @@ def test_conv_linear_not_reference(self):
         for (is_dynamic, ModuleClass, module_constructor_inputs,
              inputs, quantized_node, weight_prepack_node) in tests:
             quant_type = QuantType.DYNAMIC if is_dynamic else QuantType.STATIC
-            node_occurrence = dict()
+            node_occurrence = {}
             if weight_prepack_node:
                 node_occurrence[weight_prepack_node] = 0
             self.checkGraphModeFxOp(
@@ -985,7 +970,7 @@ def _get_keys(prefix, is_dynamic):
         for (is_dynamic, ModuleClass, module_constructor_inputs,
              inputs, quantized_node, weight_prepack_node) in tests:
             quant_type = QuantType.DYNAMIC if is_dynamic else QuantType.STATIC
-            node_occurrence = dict()
+            node_occurrence = {}
             if weight_prepack_node:
                 node_occurrence[weight_prepack_node] = 0
             result_dict = self.checkGraphModeFxOp(
@@ -1220,7 +1205,7 @@ def forward(self, x):
             for (ModuleClass, module_constructor_inputs,
                  inputs, quantized_node, weight_prepack_node) in tests:
                 for is_reference in [True, False]:
-                    node_occurrence = dict()
+                    node_occurrence = {}
                     if weight_prepack_node:
                         node_occurrence[weight_prepack_node] = 0
                     m = ModuleClass(*module_constructor_inputs).eval()
@@ -2071,13 +2056,13 @@ def test_prepare_custom_config_set_standalone_module_name(self):
         qconfig_mapping = QConfigMapping()
         example_inputs = (torch.randn(3),)
         child_prepare_custom_config = PrepareCustomConfig()
-        backend_config_dict = {"name": "my_backend"}
+        backend_config = BackendConfig("my_backend")
         config_entry = StandaloneModuleConfigEntry(
-            qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config_dict)
+            qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config)
         prepare_custom_config = PrepareCustomConfig()
         self.assertEqual(len(prepare_custom_config.standalone_module_names), 0)
         prepare_custom_config.set_standalone_module_name(
-            "module1", qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config_dict)
+            "module1", qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config)
         self.assertEqual(list(prepare_custom_config.standalone_module_names.keys()), ["module1"])
         self.assertEqual(prepare_custom_config.standalone_module_names["module1"], config_entry)
 
@@ -2085,13 +2070,13 @@ def test_prepare_custom_config_set_standalone_module_class(self):
         qconfig_mapping = QConfigMapping()
         example_inputs = (torch.randn(3),)
         child_prepare_custom_config = PrepareCustomConfig()
-        backend_config_dict = {"name": "my_backend"}
+        backend_config = BackendConfig("my_backend")
         config_entry = StandaloneModuleConfigEntry(
-            qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config_dict)
+            qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config)
         prepare_custom_config = PrepareCustomConfig()
         self.assertEqual(len(prepare_custom_config.standalone_module_classes), 0)
         prepare_custom_config.set_standalone_module_class(
-            self._DummyStandaloneModule, qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config_dict)
+            self._DummyStandaloneModule, qconfig_mapping, example_inputs, child_prepare_custom_config, backend_config)
         self.assertEqual(len(prepare_custom_config.standalone_module_classes), 1)
         self.assertTrue(self._DummyStandaloneModule in prepare_custom_config.standalone_module_classes)
         self.assertEqual(prepare_custom_config.standalone_module_classes[self._DummyStandaloneModule], config_entry)
@@ -2148,14 +2133,14 @@ def _get_dummy_prepare_custom_config_dict(self):
                 QConfigMapping(),
                 (torch.randn(3),),
                 PrepareCustomConfig(),
-                {"name", "my_backend"},
+                BackendConfig("my_backend"),
             )],
             STANDALONE_MODULE_CLASS_DICT_KEY: [(
                 self._DummyStandaloneModule,
                 QConfigMapping(),
                 (torch.randn(10),),
                 PrepareCustomConfig(),
-                {"name", "my_backend"},
+                BackendConfig("my_backend"),
             )],
             FLOAT_TO_OBSERVED_DICT_KEY: {
                 "static": {
@@ -4707,12 +4692,12 @@ def forward(self, x):
         )
         self.assertTrue(
             type(mod_prep.untraceable_module_class.linear)
-            is not torch.nn.qat.modules.linear.Linear,
+            is not torch.ao.nn.qat.modules.linear.Linear,
             "prepare_qat_fx shold not convert anything inside untraced module classes",
         )
         self.assertTrue(
             type(mod_prep.untraceable_module_name.linear)
-            is not torch.nn.qat.modules.linear.Linear,
+            is not torch.ao.nn.qat.modules.linear.Linear,
             "prepare_qat_fx shold not convert anything inside modules named in untraced_module_names",
         )
 
@@ -4797,29 +4782,29 @@ def setUp(self):
             weight=torch.ao.quantization.default_per_channel_weight_observer
         )
         self.common_quant_patterns = {
-            torch.nn.ConvTranspose1d: CommonQuantizeHandler,
-            torch.nn.ConvTranspose2d: CommonQuantizeHandler,
-            torch.nn.ELU: CommonQuantizeHandler,
-            torch.nn.LeakyReLU: CommonQuantizeHandler,
-            torch.nn.Hardswish: CommonQuantizeHandler,
-            torch.nn.InstanceNorm1d: CommonQuantizeHandler,
-            torch.nn.InstanceNorm2d: CommonQuantizeHandler,
-            torch.nn.InstanceNorm3d: CommonQuantizeHandler,
-            torch.nn.LayerNorm: CommonQuantizeHandler,
-            torch.nn.SiLU: CommonQuantizeHandler,
-            torch.nn.Mish: CommonQuantizeHandler,
-            torch.nn.GELU: CommonQuantizeHandler,
-            torch.nn.Softmax: CommonQuantizeHandler,
-            torch.nn.functional.elu: CommonQuantizeHandler,
-            torch.nn.functional.hardswish: CommonQuantizeHandler,
-            torch.nn.functional.instance_norm: CommonQuantizeHandler,
-            torch.nn.functional.layer_norm: CommonQuantizeHandler,
-            torch.nn.functional.leaky_relu: CommonQuantizeHandler,
-            torch.nn.functional.silu: CommonQuantizeHandler,
-            torch.nn.functional.mish: CommonQuantizeHandler,
-            torch.nn.functional.gelu: CommonQuantizeHandler,
-            torch.nn.functional.softmax: CommonQuantizeHandler,
-            torch.sum: CommonQuantizeHandler
+            torch.nn.ConvTranspose1d: DefaultNodeQuantizeHandler,
+            torch.nn.ConvTranspose2d: DefaultNodeQuantizeHandler,
+            torch.nn.ELU: DefaultNodeQuantizeHandler,
+            torch.nn.LeakyReLU: DefaultNodeQuantizeHandler,
+            torch.nn.Hardswish: DefaultNodeQuantizeHandler,
+            torch.nn.InstanceNorm1d: DefaultNodeQuantizeHandler,
+            torch.nn.InstanceNorm2d: DefaultNodeQuantizeHandler,
+            torch.nn.InstanceNorm3d: DefaultNodeQuantizeHandler,
+            torch.nn.LayerNorm: DefaultNodeQuantizeHandler,
+            torch.nn.SiLU: DefaultNodeQuantizeHandler,
+            torch.nn.Mish: DefaultNodeQuantizeHandler,
+            torch.nn.GELU: DefaultNodeQuantizeHandler,
+            torch.nn.Softmax: DefaultNodeQuantizeHandler,
+            torch.nn.functional.elu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.hardswish: DefaultNodeQuantizeHandler,
+            torch.nn.functional.instance_norm: DefaultNodeQuantizeHandler,
+            torch.nn.functional.layer_norm: DefaultNodeQuantizeHandler,
+            torch.nn.functional.leaky_relu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.silu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.mish: DefaultNodeQuantizeHandler,
+            torch.nn.functional.gelu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.softmax: DefaultNodeQuantizeHandler,
+            torch.sum: DefaultNodeQuantizeHandler
         }
 
     """Unit tests for individual ops
@@ -4962,7 +4947,7 @@ def forward(self, x):
                 else:
                     # we will have an extra quantize_per_tensor_dynamic + dequantize for
                     # nn.Identity right now, but it will be fixed after we use
-                    # backend_config_dict to configure the default pt backend
+                    # backend_config to configure the default pt backend
                     num_dequantize = int(not has_relu)
 
                 convert_node_occurrence = {
@@ -5068,7 +5053,7 @@ def forward(self, x):
             (True, False),  # functional relu
             (True, False),  # is_reference
         )
-        backend_config_dict = get_test_only_legacy_native_backend_config_dict()
+        backend_config = get_test_only_legacy_native_backend_config()
         for use_bias, has_relu, f_relu, is_reference in options:
             model = FuncLinear(use_bias, has_relu, f_relu)
             linear_fun = ns.call_function(torch.nn.functional.linear)
@@ -5099,7 +5084,7 @@ def forward(self, x):
                 custom_qconfig_dict={"": float16_static_qconfig},
                 prepare_expected_node_occurrence=prepare_node_occurrence,
                 expected_node_occurrence=convert_node_occurrence,
-                backend_config_dict=backend_config_dict)
+                backend_config=backend_config)
 
     @skipIfNoFBGEMM
     def test_conv_module(self):
@@ -5319,7 +5304,7 @@ def _test_binary_op_float16_impl(self, binary_op, ibinary_op):
         custom_qconfig_dict = {
             "object_type": [(binary_op, float16_static_qconfig)]
         }
-        backend_config_dict = get_test_only_legacy_native_backend_config_dict()
+        backend_config = get_test_only_legacy_native_backend_config()
         for is_inplace, is_scalar in options:
             node_occurrence = {
                 # output_conv1, output_add1, output_add2 for scalar
@@ -5330,7 +5315,7 @@ def _test_binary_op_float16_impl(self, binary_op, ibinary_op):
                 BinaryOp(binary_op, ibinary_op, is_inplace, is_scalar), data, quant_type,
                 expected_node_occurrence=node_occurrence,
                 custom_qconfig_dict=custom_qconfig_dict,
-                backend_config_dict=backend_config_dict)
+                backend_config=backend_config)
 
             node_occurrence = {
                 # input_add, output_add for scalar
@@ -5341,7 +5326,7 @@ def _test_binary_op_float16_impl(self, binary_op, ibinary_op):
                 BinaryOpNonQuantizedInput(binary_op, ibinary_op, is_inplace, is_scalar), data, quant_type,
                 expected_node_occurrence=node_occurrence,
                 custom_qconfig_dict=custom_qconfig_dict,
-                backend_config_dict=backend_config_dict)
+                backend_config=backend_config)
 
     def _test_binary_op_relu_int8_impl(self, binary_op, ibinary_op, quantized_op):
         data = (torch.rand((1, 1, 1, 1), dtype=torch.float),
@@ -5366,7 +5351,7 @@ def _test_binary_op_relu_float16_impl(self, binary_op, ibinary_op):
             "": float16_static_qconfig,
             "object_type": [(torch.nn.Conv2d, None)]
         }
-        backend_config_dict = get_test_only_legacy_native_backend_config_dict()
+        backend_config = get_test_only_legacy_native_backend_config()
         for is_inplace_op, is_functional_relu, is_scalar in options:
             node_occurrence = {
                 ns.call_method("to"): 3 if is_scalar else 4
@@ -5376,7 +5361,7 @@ def _test_binary_op_relu_float16_impl(self, binary_op, ibinary_op):
             self.checkGraphModeFxOp(
                 model, data, quant_type, custom_qconfig_dict=custom_qconfig_dict,
                 expected_node_occurrence=node_occurrence,
-                backend_config_dict=backend_config_dict)
+                backend_config=backend_config)
 
 
     @skipIfNoFBGEMM
@@ -5572,7 +5557,7 @@ class M(nn.Module):
             def __init__(self, scalar):
                 super().__init__()
                 self.scalar = scalar
-                self.add_func = torch.nn.quantized.FloatFunctional()
+                self.add_func = torch.ao.nn.quantized.FloatFunctional()
 
             def forward(self, x):
                 return self.add_func.add_scalar(x, self.scalar)
@@ -5975,7 +5960,7 @@ def forward(self, x):
 
         self.checkGraphModuleNodes(m_quant, expected_node_list=node_list)
 
-    @unittest.skip("TODO: reenable with backend_config_dict api")
+    @unittest.skip("TODO: reenable with backend_config api")
     def test_gelu_normal(self):
         module = torch.nn.GELU
         functional = torch.nn.functional.gelu
@@ -5988,14 +5973,14 @@ def test_gelu_normal(self):
         self._test_default_node_quant_handler_ops(
             module, functional, qconfig, is_reference, node_list)
 
-    @unittest.skip("TODO: reenable with backend_config_dict api")
+    @unittest.skip("TODO: reenable with backend_config api")
     def test_softmax_normal(self):
         module = torch.nn.Softmax
         functional = torch.nn.functional.softmax
         qconfig = torch.ao.quantization.get_default_qconfig("fbgemm")
         is_reference = False
         node_list = [
-            ns.call_module(torch.nn.quantized.Softmax),
+            ns.call_module(torch.ao.nn.quantized.Softmax),
             ns.call_function(functional),
         ]
         self._test_default_node_quant_handler_ops(
@@ -6017,7 +6002,7 @@ def test_gelu_reference(self):
             ns.call_function(torch.quantize_per_tensor),
             ns.call_method('dequantize')
         ]
-        # TODO: change these to use backend_config_dict
+        # TODO: change these to use backend_config
         additional_patterns = {torch.nn.GELU: DefaultNodeQuantizeHandler,
                                torch.nn.functional.gelu: DefaultNodeQuantizeHandler}
         self._test_default_node_quant_handler_ops(
@@ -6200,14 +6185,14 @@ def forward(self, x):
         quant_type = QuantType.STATIC
         # TODO: use get_default_qconfig_mapping once it handles fp16
         qconfig_mapping = QConfigMapping().set_global(float16_static_qconfig)
-        backend_config_dict = get_test_only_legacy_native_backend_config_dict()
+        backend_config = get_test_only_legacy_native_backend_config()
         node_occurrence = {
             ns.call_method("to"): 7
         }
         self.checkGraphModeFxOp(
             M(), data, quant_type, custom_qconfig_dict=qconfig_mapping,
             expected_node_occurrence=node_occurrence,
-            backend_config_dict=backend_config_dict)
+            backend_config=backend_config)
 
     def test_fixed_qparams_ops_qint8(self):
         class M(torch.nn.Module):
@@ -6717,7 +6702,7 @@ def forward(self, x):
             m = prepare_fx_function(m, qconfig_dict, example_inputs=example_inputs)
             node_occurrence = {
                 ns.call_module(expected_act_post_process): 7,
-                ns.call_module(torch.nn.quantized.FloatFunctional): 0
+                ns.call_module(torch.ao.nn.quantized.FloatFunctional): 0
             }
             self.checkGraphModuleNodes(m, expected_node_occurrence=node_occurrence)
             m(*example_inputs)
@@ -6924,11 +6909,11 @@ def forward(self, x):
                 (torch.nn.functional.linear, default_qconfig)
             ]
         }
-        backend_config_dict = get_test_only_legacy_native_backend_config_dict()
+        backend_config = get_test_only_legacy_native_backend_config()
         example_inputs = (torch.randn(1, 4),)
         m = prepare_fx(
             m, qconfig_dict, example_inputs=example_inputs,
-            backend_config_dict=backend_config_dict)
+            backend_config=backend_config)
         expected_occurrence = {
             # input and weight of first and second linear, output of first and second linear
             ns.call_module(torch.ao.quantization.MinMaxObserver): 6,
@@ -6939,7 +6924,7 @@ def forward(self, x):
             m,
             expected_node_occurrence=expected_occurrence
         )
-        m = convert_fx(m, backend_config_dict=backend_config_dict)
+        m = convert_fx(m, backend_config=backend_config)
         expected_occurrence = {
             ns.call_function(torch.quantize_per_tensor): 2,
             # dequantize after first linear, before reshape and before output
@@ -6976,10 +6961,10 @@ def forward(self, x):
             .set_global(float16_static_qconfig) \
             .set_object_type(torch.nn.functional.linear, default_qconfig)
         example_inputs = (torch.randn(1, 4),)
-        backend_config_dict = get_test_only_legacy_native_backend_config_dict()
+        backend_config = get_test_only_legacy_native_backend_config()
         m = prepare_fx(
             m, qconfig_mapping, example_inputs=example_inputs,
-            backend_config_dict=backend_config_dict)
+            backend_config=backend_config)
         expected_occurrence = {
             # input and weight of linear, output of linear
             ns.call_module(torch.ao.quantization.MinMaxObserver): 3,
@@ -7073,7 +7058,6 @@ def forward(self, x):
             m,
             expected_node_occurrence=expected_occurrence)
 
-    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_qmatmul(self):
         class M(torch.nn.Module):
             def forward(self, x, y):
@@ -7082,7 +7066,7 @@ def forward(self, x, y):
 
         m = M().eval()
         example_inputs = (torch.randn(2, 2), torch.randn(2, 2))
-        qconfig_dict = {"": torch.ao.quantization.default_qconfig}
+        qconfig_dict = get_default_qconfig_mapping("fbgemm")
         mp = prepare_fx(m, qconfig_dict, example_inputs=example_inputs)
         mp(*example_inputs)
         mq = convert_fx(mp)
@@ -7189,8 +7173,9 @@ def test_model_dropout(self):
         example_inputs = (torch.randn(1, 3, 224, 224),)
         mp = prepare_qat_fx(m, qconfig_mapping, example_inputs=example_inputs)
         mp(*example_inputs)
-        mq = convert_fx(mp)
-        res = mq(*example_inputs)
+        with override_quantized_engine("qnnpack") if IS_ARM64 else contextlib.nullcontext():
+            mq = convert_fx(mp)
+        mq(*example_inputs)
 
     def _test_model_impl(
             self, mode, name, model, eager_quantizable_model,
diff --git a/test/run_doctests.sh b/test/run_doctests.sh
new file mode 100755
index 0000000000000..f8f73e11914eb
--- /dev/null
+++ b/test/run_doctests.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+__doc__="
+This script simply runs the torch doctests via the xdoctest runner.
+
+This must be run from the root of the torch repo, as it needs the path to the
+torch source code.
+"
+
+#xdoctest -m torch --style=google list
+
+# Reference: https://stackoverflow.com/questions/59895/bash-script-dir
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+TORCH_MODPATH=$SCRIPT_DIR/../torch
+echo "TORCH_MODPATH = $TORCH_MODPATH"
+
+if [[ ! -d "$TORCH_MODPATH" ]] ; then
+    echo "Could not find the path to the torch module"
+else
+
+    # Next version of xdoctest will support environment variables that overlo
+
+
+    export XDOCTEST_GLOBAL_EXEC="from torch import nn\nimport torch.nn.functional as F\nimport torch"
+    export XDOCTEST_OPTIONS="+IGNORE_WHITESPACE"
+    # Note: google wont catch numpy style docstrings (a few exist) but it also wont fail
+    # on things not intended to be doctests.
+    export XDOCTEST_STYLE="google"
+    xdoctest "$TORCH_MODPATH" --style="$XDOCTEST_STYLE" --global-exec "$XDOCTEST_GLOBAL_EXEC" --options="$XDOCTEST_OPTIONS"
+fi
diff --git a/test/run_test.py b/test/run_test.py
old mode 100644
new mode 100755
index 0ed38673728d7..a0506b2c99785
--- a/test/run_test.py
+++ b/test/run_test.py
@@ -128,6 +128,10 @@ def skip_test_p(name: str) -> bool:
     ]
 )
 
+# The doctests are a special case that don't correspond to a file that discover
+# tests can enable.
+TESTS = TESTS + ['doctests']
+
 FSDP_TEST = [test for test in TESTS if test.startswith("distributed/fsdp")]
 
 # Tests need to be run with pytest.
@@ -257,6 +261,7 @@ def skip_test_p(name: str) -> bool:
     "test_show_pickle",
     "test_tensorexpr",
     "test_cuda_primary_ctx",
+    "test_cuda_trace",
 ] + FSDP_TEST
 
 # A subset of our TEST list that validates PyTorch's ops, modules, and autograd function as expected
@@ -550,8 +555,84 @@ def test_distributed(test_module, test_directory, options):
     return 0
 
 
+def run_doctests(test_module, test_directory, options):
+    """
+    Assumes the incoming test module is called doctest, and simply executes the
+    xdoctest runner on the torch library itself.
+    """
+    import xdoctest
+    import pathlib
+    pkgpath = pathlib.Path(torch.__file__).parent
+
+    #
+    enabled = {
+        # TODO: expose these options to the user
+        # Temporary disable all feature-conditional tests
+        # 'lapack': 'auto',
+        # 'cuda': 'auto',
+        # 'cuda1': 'auto',
+        # 'qengine': 'auto',
+        'lapack': 0,
+        'cuda': 0,
+        'cuda1': 0,
+        'qengine': 0,
+    }
+
+    # Resolve "auto" based on a test to determine if the feature is available.
+    if enabled['cuda'] == 'auto' and torch.cuda.is_available():
+        enabled['cuda'] = True
+
+    if enabled['cuda1'] == 'auto' and torch.cuda.is_available() and torch.cuda.device_count() > 1:
+        enabled['cuda1'] = True
+
+    if enabled['lapack'] == 'auto' and torch._C.has_lapack:
+        enabled['lapack'] = True
+
+    if enabled['qengine'] == 'auto':
+        try:
+            # Is there a better check if quantization is enabled?
+            import torch.nn.quantized as nnq  # NOQA
+            torch.backends.quantized.engine = 'qnnpack'
+            torch.backends.quantized.engine = 'fbgemm'
+        except (ImportError, RuntimeError):
+            ...
+        else:
+            enabled['qengine'] = True
+
+    # Set doctest environment variables
+    if enabled['cuda']:
+        os.environ['TORCH_DOCTEST_CUDA'] = '1'
+
+    if enabled['cuda1']:
+        os.environ['TORCH_DOCTEST_CUDA1'] = '1'
+
+    if enabled['lapack']:
+        os.environ['TORCH_DOCTEST_LAPACK'] = '1'
+
+    if enabled['qengine']:
+        os.environ['TORCH_DOCTEST_QENGINE'] = '1'
+
+    pkgpath = os.path.dirname(torch.__file__)
+    xdoctest_config = {
+        'global_exec': r'\n'.join([
+            'from torch import nn',
+            'import torch.nn.functional as F',
+            'import torch',
+        ]),
+        'style': 'google',
+        'options': '+IGNORE_WHITESPACE',
+    }
+    xdoctest_verbose = max(1, options.verbose)
+    run_summary = xdoctest.runner.doctest_module(
+        os.fspath(pkgpath), config=xdoctest_config, verbose=xdoctest_verbose,
+        command=options.xdoctest_command, argv=[])
+    result = 1 if run_summary.get('n_failed', 0) else 0
+    return result
+
+
 CUSTOM_HANDLERS = {
     "test_cuda_primary_ctx": test_cuda_primary_ctx,
+    "test_cuda_trace": get_run_test_with_subprocess_fn(),
     "test_cpp_extensions_aot_no_ninja": test_cpp_extensions_aot_no_ninja,
     "test_cpp_extensions_aot_ninja": test_cpp_extensions_aot_ninja,
     "distributed/test_distributed_spawn": test_distributed,
@@ -567,6 +648,7 @@ def test_distributed(test_module, test_directory, options):
     "distributed/rpc/test_tensorpipe_agent": get_run_test_with_subprocess_fn(),
     "distributed/rpc/test_share_memory": get_run_test_with_subprocess_fn(),
     "distributed/rpc/cuda/test_tensorpipe_agent": get_run_test_with_subprocess_fn(),
+    "doctests": run_doctests,
 }
 
 
@@ -723,6 +805,15 @@ def parse_args():
         action="store_true",
         help="Only list the test that will run.",
     )
+    parser.add_argument(
+        "--xdoctest-command",
+        default='list',
+        help=(
+            "Control the specific doctest action. "
+            "Use 'list' to simply parse doctests and check syntax. "
+            "Use 'all' to execute all doctests or specify a specific "
+            "doctest to run")
+    )
     return parser.parse_args()
 
 
@@ -820,7 +911,7 @@ def get_selected_tests(options):
         options.exclude.extend(DISTRIBUTED_TESTS)
 
     # these tests failing in CUDA 11.6 temporary disabling. issue https://github.com/pytorch/pytorch/issues/75375
-    if torch.version.cuda is not None and LooseVersion(torch.version.cuda) == "11.6":
+    if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.6":
         options.exclude.extend(["distributions/test_constraints"])
 
     selected_tests = exclude_tests(options.exclude, selected_tests)
diff --git a/test/test_ao_sparsity.py b/test/test_ao_sparsity.py
index 89c941cddab1b..9832a8ad82458 100644
--- a/test/test_ao_sparsity.py
+++ b/test/test_ao_sparsity.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 # Owner(s): ["module: unknown"]
 
-from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.common_utils import run_tests, IS_ARM64
 
 # Kernels
 from ao.sparsity.test_kernels import TestQuantizedSparseKernels  # noqa: F401
@@ -22,8 +22,9 @@
 from ao.sparsity.test_scheduler import TestScheduler  # noqa: F401
 
 # Composability
-from ao.sparsity.test_composability import TestComposability  # noqa: F401
-from ao.sparsity.test_composability import TestFxComposability  # noqa: F401
+if not IS_ARM64:
+    from ao.sparsity.test_composability import TestComposability  # noqa: F401
+    from ao.sparsity.test_composability import TestFxComposability  # noqa: F401
 
 # Utilities
 from ao.sparsity.test_sparsity_utils import TestSparsityUtilFunctions  # noqa: F401
@@ -31,6 +32,7 @@
 # Data Sparsifier
 from ao.sparsity.test_data_sparsifier import TestBaseDataSparsifier  # noqa: F401
 from ao.sparsity.test_data_sparsifier import TestNormDataSparsifiers  # noqa: F401
+from ao.sparsity.test_data_sparsifier import TestQuantizationUtils  # noqa: F401
 
 # Data Scheduler
 from ao.sparsity.test_data_scheduler import TestBaseDataScheduler  # noqa: F401
diff --git a/test/test_autograd.py b/test/test_autograd.py
index 793e175ba6268..341aea5c91945 100644
--- a/test/test_autograd.py
+++ b/test/test_autograd.py
@@ -33,16 +33,17 @@
 from torch.testing._internal.common_utils import (
     TestCase, run_tests, skipIfNoLapack, slowTest, IS_WINDOWS, IS_MACOS,
     disable_gc, gradcheck, gradgradcheck, parametrize,
-    instantiate_parametrized_tests, skipIfMps)
+    instantiate_parametrized_tests, skipIfMps, set_warn_always_context)
 from torch.autograd import Variable, Function, detect_anomaly, kineto_available, _calculate_shape
 from torch.autograd.function import InplaceFunction
 import torch.autograd.forward_ad as fwAD
 from torch.testing._internal.common_methods_invocations import mask_not_all_zeros
-from torch.testing._internal.common_device_type import (instantiate_device_type_tests, skipCUDAIfRocm,
+from torch.testing._internal.common_device_type import (instantiate_device_type_tests,
                                                         onlyCPU, onlyCUDA, dtypes, dtypesIfCUDA,
                                                         deviceCountAtLeast, skipMeta, dtypesIfMPS)
 from torch.testing._internal.common_dtype import floating_types_and
 from torch.utils._mode_utils import no_dispatch
+import weakref
 
 import pickle
 
@@ -765,6 +766,169 @@ def bw_hook_modify(grad):
         z.backward(torch.ones(5, 5))
         self.assertEqual(y.grad, (x + 1) * 4)
 
+    def _get_mul2(self, use_custom_function):
+        if use_custom_function:
+            class Mul2(Function):
+                @staticmethod
+                def forward(ctx, x):
+                    return x * 2
+
+                @staticmethod
+                def backward(ctx, gO):
+                    return gO * 2
+
+            return Mul2.apply
+        else:
+            return lambda x: x * 2
+
+    def test_grad_fn_prehooks(self):
+        for use_custom_function in (True, False):
+            mul2 = self._get_mul2(use_custom_function)
+
+            a = torch.tensor([1.], requires_grad=True)
+            b = mul2(a)
+
+            post_counter = [0]
+            pre_counter = [0]
+
+            def posthook(grad_input, grad_output):
+                self.assertEqual(pre_counter[0], 3)
+                self.assertTrue(torch.allclose(grad_output[0], torch.ones(1) * 8))
+                self.assertTrue(torch.allclose(grad_input[0], torch.ones(1) * 16))
+                post_counter[0] += 1
+                return grad_input
+
+            def prehook(grad_output):
+                pre_counter[0] += 1
+                return (grad_output[0] * 2,)
+
+            # register posthook x 2
+            b.grad_fn.register_hook(posthook)
+            b.grad_fn.register_hook(posthook)
+            # register prehook x 3
+            b.grad_fn.register_prehook(prehook)
+            b.grad_fn.register_prehook(lambda x: None)
+            b.grad_fn.register_prehook(prehook)
+            b.grad_fn.register_prehook(prehook)
+            b.grad_fn.register_prehook(lambda x: x)
+            b.grad_fn.register_prehook(lambda x: None)
+
+            b.sum().backward()
+
+            self.assertEqual(post_counter[0], 2)
+            self.assertEqual(pre_counter[0], 3)
+
+            # Return None
+            a = torch.rand(3, 3, requires_grad=True)
+            b = mul2(a)
+
+            def prehook(grad_output):
+                pre_counter[0] += 1
+                return None
+
+            b.grad_fn.register_prehook(prehook)
+            b.sum().backward()
+            self.assertEqual(pre_counter[0], 4)
+            self.assertTrue(torch.allclose(a.grad, torch.ones(3, 3) * 2))
+
+    def test_grad_fn_prehooks_multiple_outputs(self):
+        # Compute gradients without hooks
+        b = torch.rand(3, 3, requires_grad=True)
+        var, mean = torch.var_mean(b, dim=0)
+        (var + mean).sum().backward()
+
+        # Compute gradients with hooks
+        a = b.detach().requires_grad_()
+        counter = [0]
+
+        def prehook(grad_output):
+            gvar, gmean = grad_output
+            counter[0] += 1
+            return (gvar * 2, gmean * 2)
+
+        var, mean = torch.var_mean(a, dim=0)
+        mean.grad_fn.register_prehook(prehook)
+        (var + mean).sum().backward()
+
+        self.assertEqual(counter[0], 1)
+        # Compare
+        self.assertTrue(torch.allclose(a.grad, b.grad * 2))
+
+        # Test with custom Function
+        class DoubleMul2(Function):
+            @staticmethod
+            def forward(ctx, x, a, y):
+                ctx.a = a
+                return a * x * 2, a, a * y * 2
+
+            @staticmethod
+            def backward(ctx, g1, _a, g2):
+                return ctx.a * g1 * 2, None, ctx.a * g2 * 2
+
+        counter = [0]
+
+        def prehook(grad_output):
+            g1, ga, g2 = grad_output
+            self.assertIsNone(ga)
+            counter[0] += 1
+            return (g1 * 2, None, g2 * 2)
+
+        a = torch.randn(3, 3, requires_grad=True)
+        b = torch.randn(3, 3, requires_grad=True)
+        k = 3
+        c, _, d = DoubleMul2.apply(a, k, b)
+        c.grad_fn.register_prehook(prehook)
+        (c + d).sum().backward()
+
+        self.assertEqual(counter[0], 1)
+        self.assertTrue(torch.allclose(a.grad, torch.ones(1) * 4 * k))
+        self.assertTrue(torch.allclose(b.grad, torch.ones(1) * 4 * k))
+
+    def test_grad_fn_prehooks_remove_hooks(self):
+        for use_custom_function in (True, False):
+            mul2 = self._get_mul2(use_custom_function)
+
+            # Simply remove hooks
+
+            a = torch.rand(3, 3, requires_grad=True)
+            b = mul2(a)
+            counter = [0]
+
+            def prehook(grad_output):
+                counter[0] += 1
+                return None
+
+            handle = b.grad_fn.register_prehook(prehook)
+            b.grad_fn.register_prehook(prehook)
+            handle.remove()
+            b.sum().backward()
+            self.assertTrue(torch.allclose(a.grad, torch.ones(3, 3) * 2))
+            self.assertEqual(counter[0], 1)
+
+            # Remove hooks during backward
+            a = torch.rand(3, 3, requires_grad=True)
+            b = mul2(a)
+            counter = [0]
+
+            def prehook1(grad_output):
+                handle2.remove()
+                # Remove hook that is already removed is OK
+                handle3.remove()
+                return None
+
+            def prehook2(grad_output):
+                counter[0] += 1
+                return None
+
+            # Hooks that registered first run first
+            b.grad_fn.register_prehook(prehook1)
+            handle2 = b.grad_fn.register_prehook(prehook2)
+            handle3 = b.grad_fn.register_prehook(prehook2)
+            handle3.remove()
+            b.sum().backward()
+            self.assertTrue(torch.allclose(a.grad, torch.ones(3, 3) * 2))
+            self.assertEqual(counter[0], 1)
+
     def test_hooks_cpp(self):
         # Tests hooks for autograd function implemented in C++
         bn = torch.nn.BatchNorm1d(5, affine=False)
@@ -1096,6 +1260,13 @@ def backward(ctx, grad_b):
 
         TestFn.apply(b).sum().backward()
 
+    def test_first_grad_fn_access_in_no_grad_mode(self):
+        a = torch.tensor([1 + 1j], requires_grad=True).clone()
+        v = a.real
+        a.add_(1)
+        with torch.autograd.grad_mode.no_grad():
+            v.grad_fn
+
     def test_free_deep_graph(self):
         def scope():
             depth = 150000
@@ -1783,10 +1954,7 @@ def test_backward_twice_retained_graph_without_saved_values(self):
         c.backward(torch.tensor([1, 1, 1], dtype=torch.double))
 
     def test_backward_create_graph_warns(self):
-        try:
-            prev = torch.is_warn_always_enabled()
-            torch.set_warn_always(True)
-
+        with set_warn_always_context(True):
             b = torch.randn(3, requires_grad=True, dtype=torch.double)
             c = b * b
             with warnings.catch_warnings(record=True) as ws:
@@ -1799,9 +1967,6 @@ def test_backward_create_graph_warns(self):
                 torch.autograd.grad(c, b, torch.ones_like(c), create_graph=True)
             self.assertFalse(any('Using backward() with create_graph=True' in str(w.message) for w in ws))
 
-        finally:
-            torch.set_warn_always(prev)
-
     def test_next_functions(self):
         x = torch.randn(5, 5, requires_grad=True)
         y = torch.randn(5, 5, requires_grad=True)
@@ -2171,7 +2336,7 @@ def backward(ctx, grad_x):
         # which means that a lot of accessors on them may segfault. Test that we
         # properly error in this case.
         t = torch.ones(1, requires_grad=True)
-        t._backward_hooks = dict()
+        t._backward_hooks = {}
         with self.assertRaisesRegex(RuntimeError, "Attribute '_register_hook_dict' is invalid"):
             f._register_hook_dict(t)
         with self.assertRaisesRegex(RuntimeError, "Attribute 'register_hook' is invalid"):
@@ -2352,14 +2517,6 @@ def test_detach_then_inplace_raises_in_autograd(self):
         with self.assertRaisesRegex(RuntimeError, "has been modified by an inplace"):
             y.backward()
 
-    def test_detach_disallows_metadata_change(self):
-        x = torch.randn([], requires_grad=True)
-        detached = x.detach()
-
-        with self.assertRaisesRegex(
-                RuntimeError, "not allowed on a Tensor created from .data or .detach()"):
-            detached.resize_(3, 3)
-
     def _test_type_conversion_backward(self, t, ):
         fvar = Variable(t(torch.randn(5, 5).float()), requires_grad=True)
         fvar.double().sum().backward()
@@ -3467,7 +3624,6 @@ def __exit__(self, *args):
 
     def test_anomaly_assign_parent_cleanup(self):
         # Test that python objects created are properly cleaned up when assign_parent is called
-        import weakref
 
         def get_ref():
             # we use torch.exp here but any function that will construct a new node in its
@@ -3506,8 +3662,6 @@ class Foo(object):
 
     def test_nested_anomaly_printstack_cleanup(self):
         # Test if metadata dict PyObject is properly destroyed
-        import weakref
-
         def get_ref():
             # This is similar to the construction in test_anomaly_assign_parent_cleanup:
             #
@@ -3563,6 +3717,32 @@ class Foo(object):
         del t
         self.assertIsNone(ref())
 
+    def test_anomaly_mode_no_check_nan(self):
+        class MyFunc(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, inp):
+                return inp.clone()
+
+            @staticmethod
+            def backward(ctx, gO):
+                return torch.tensor(float("nan")).expand(10, 10)
+
+        def run_fn(a):
+            out = MyFunc.apply(a)
+            return out.sum()
+
+        with warnings.catch_warnings(record=True) as w:
+            with torch.autograd.detect_anomaly(check_nan=False):
+                inp = torch.rand(10, 10, requires_grad=True)
+                out = run_fn(inp)
+                out.backward(retain_graph=True)
+
+                with torch.autograd.detect_anomaly(check_nan=True):
+                    with self.assertRaisesRegex(RuntimeError, "Function 'MyFuncBackward' returned nan values in its 0th output."):
+                        out.backward(retain_graph=True)
+
+                out.backward()
+
     # TODO: update these tests to use the linalg module and move to test_linalg.py
     @skipIfNoLapack
     def test_eig_no_eigenvectors(self):
@@ -4595,15 +4775,7 @@ def forward(self, x):
         self.assertTrue(mem_reentrant_checkpoint < mem_no_checkpoint)
         self.assertTrue(mem_no_reentrant_checkpoint < mem_no_checkpoint)
 
-    def test_checkpointing_without_reentrant_custom_function_raises(self):
-        """
-        Accessing ctx.saved_tensors multiple times in a custom function
-        backward pass with non-reentrant checkpoint currently throws due to
-        saved tensors not being recomputed in between the accesses.
-        """
-        # For verifying first access to ctx.saved_tensors succeeded.
-
-        _first_saved_tensor_access_succeeded = False
+    def test_checkpointing_without_reentrant_custom_function_works(self):
 
         class MyFunc(torch.autograd.Function):
             @staticmethod
@@ -4616,12 +4788,11 @@ def forward(ctx, x, y, z):
             @staticmethod
             def backward(ctx, grad_out):
                 x, y, z, w, out = ctx.saved_tensors
-                nonlocal _first_saved_tensor_access_succeeded
-                _first_saved_tensor_access_succeeded = True
-                # Raises issue in non-reentrant checkpointing where
-                # second access to saved tensors raises because they were
-                # not recomputed.
+                # Accessing the saved Tensors a second time is fine
+                # as they get cleared only when the SavedVariable
+                # get cleared which happens after this function returns
                 x_2, y_2, z_2, w_2, out_2 = ctx.saved_tensors
+                return x, y, z
 
         x = torch.tensor(1., requires_grad=True)
         y = torch.tensor(2., requires_grad=True)
@@ -4635,20 +4806,10 @@ def foo(x, y, z):
             return out
 
         out = checkpoint(foo, x, y, z, use_reentrant=False)
-        with self.assertRaisesRegex(
-            RuntimeError,
-            "Attempt to retrieve a tensor saved by autograd multiple times"
-        ):
-            out.sum().backward()
+        out.sum().backward()
 
-        self.assertTrue(_first_saved_tensor_access_succeeded)
+    def test_access_saved_tensor_twice_without_recomputation_works(self):
 
-    def test_access_saved_tensor_twice_without_recomputation_raises(self):
-        """
-        If using saved tensor hooks based checkpointing and a saved tensor
-        is accessed multiple times without triggering recomputation in the
-        middle, error is raised indicating so.
-        """
         def foo(a):
             b = a * a
             c = a * b
@@ -4659,10 +4820,14 @@ def foo(a):
         d = checkpoint(foo, a, use_reentrant=False)
         # First access
         d.grad_fn._saved_result
-        # Second access raises error
+        # Second access still works as the saved variable was not cleared
+        d.grad_fn._saved_result
+        # Backward clears the saved variable
+        d.sum().backward()
+        # Now it raises an error
         with self.assertRaisesRegex(
             RuntimeError,
-            "Attempt to retrieve a tensor saved by autograd multiple times"
+            "or directly access saved tensors after they have already been freed"
         ):
             d.grad_fn._saved_result
 
@@ -6207,6 +6372,57 @@ def get_refs(with_backward):
         gc.collect()
         self.assertTrue(ref.expired())
 
+    def test_create_graph_and_full_backward_hook_cycle(self):
+        # If BackwardHook saves grad_output, it can create a cycle when we perform backward
+        # with create_graph=True
+        #
+        #   grad_output -> grad_output.grad_fn -> graph -> hook -> grad_output
+        #
+        class TestCls():
+            # Dummy class for the purpose of creating a weakref
+            pass
+
+        def get_ref(input_requires_grad, nb_hooks):
+            t = torch.randn(10, requires_grad=input_requires_grad)
+            a = torch.tensor(1., requires_grad=True)
+
+            class Test(nn.Module):
+                def forward(self, x):
+                    return x ** 2 * a ** 2
+            mod = Test()
+
+            for _ in range(nb_hooks):
+                mod.register_full_backward_hook(lambda a, b, c: None)
+
+            tmp = mod(t)
+
+            # Save dummy object to graph and get a weak ref to it
+            test = TestCls()
+            ref = weakref.ref(test)
+            tmp.grad_fn.metadata["a"] = test
+
+            with set_warn_always_context(True):
+                with warnings.catch_warnings(record=True) as w:
+                    tmp.exp().sum().backward(create_graph=True)
+                    self.assertTrue(len(w) == 1)
+                    self.assertTrue("Using backward() with create_graph=True" in str(w[0].message))
+
+            # Remove the backward + create_graph=True cycle
+            a.grad = None
+            t.grad = None
+
+            return ref
+
+        for nb_hooks in (1, 2, 3):
+            for input_requires_grad in (True, False):
+                ref_ = get_ref(
+                    input_requires_grad=input_requires_grad,
+                    nb_hooks=nb_hooks,
+                )
+                gc.collect()
+                self.assertIsNone(ref_())
+
+
     def test_input_buffer_accum(self):
         leaf = torch.rand(2, 2, requires_grad=True)
 
@@ -7756,7 +7972,6 @@ def test_pin_memory(self, device):
         gradcheck(lambda x: x.pin_memory(), [x])
         gradgradcheck(lambda x: x.pin_memory(), [x])
 
-    @skipCUDAIfRocm
     @onlyCUDA
     def test_profiler_emit_nvtx(self, device):
         # This test is not intended to ensure correctness of nvtx ranges.
@@ -8859,6 +9074,125 @@ def fn(x1, x2):
         torch.autograd.gradcheck(fn, [inp_r, inp_c], check_forward_ad=True)
         torch.autograd.gradcheck(fn, [inp_c, inp_r], check_forward_ad=True)
 
+class TestAutogradMultipleDispatch(TestCase):
+    def test_autograd_multiple_dispatch_registrations(self, device):
+        t = torch.randn(3, 3, device=device, requires_grad=True)
+        # using _test_autograd_multiple_dispatch.fullcoverage which has
+        # registrations in derivatives.yaml for Default, AutogradCUDA and NestedTensorAutograd
+        out = torch._test_autograd_multiple_dispatch(t)
+        grad = torch.randn(3, 3, device=device)
+        out.backward(grad)
+
+        if 'cuda' not in device:
+            # bogus default gradient registered for Autograd is grad + 1
+            self.assertEqual(t.grad, grad + 1)
+        else:
+            # bogus gradient registered for AutogradCUDA is grad * 2
+            self.assertEqual(t.grad, grad * 2)
+
+        # test registered AutogradNestedTensor formula
+        a = torch.arange(6, dtype=torch.float, device=device).reshape(2, 3).requires_grad_(True)
+        b = torch.arange(8, dtype=torch.float, device=device).reshape(2, 4).requires_grad_(True)
+        nt = torch.nested_tensor([a, b], dtype=torch.float, device=device)
+
+        nt_out = torch._test_autograd_multiple_dispatch(nt)
+        c = torch.randn(2, 3, device=device)
+        d = torch.randn(2, 4, device=device)
+        nt_grad = torch.nested_tensor([c, d], dtype=torch.float, device=device)
+        nt_out.backward(nt_grad)
+
+        # bogus gradient for AutogradNestedTensor is grad * grad
+        self.assertEqual(a.grad, c * c)
+        self.assertEqual(b.grad, d * d)
+
+    def test_autograd_composite_implicit_and_dispatch_registration(self, device):
+        t = torch.randn(3, 3, device=device, requires_grad=True)
+        # using _test_autograd_multiple_dispatch.ntonly
+        # which has registrations in derivatives.yaml for NestedTensorAutograd and otherwise is CompositeImplicit
+        out = torch._test_autograd_multiple_dispatch(t, True)
+        grad = torch.randn(3, 3, device=device)
+        out.backward(grad)
+
+        # t.grad is just out.grad by composite op since _test_autograd_multiple_dispatch is just a clone
+        self.assertEqual(t.grad, grad)
+
+        # test registered AutogradNestedTensor formula
+        a = torch.arange(6, dtype=torch.float, device=device).reshape(2, 3).requires_grad_(True)
+        b = torch.arange(8, dtype=torch.float, device=device).reshape(2, 4).requires_grad_(True)
+        nt = torch.nested_tensor([a, b], dtype=torch.float, device=device)
+
+        nt_out = torch._test_autograd_multiple_dispatch(nt, True)
+        c = torch.randn(2, 3, device=device)
+        d = torch.randn(2, 4, device=device)
+        nt_grad = torch.nested_tensor([c, d], dtype=torch.float, device=device)
+        nt_out.backward(nt_grad)
+
+        # bogus gradient for AutogradNestedTensor is grad * grad + grad
+        self.assertEqual(a.grad, c * c + c)
+        self.assertEqual(b.grad, d * d + d)
+
+    def test_foward_mode_AD(self, device):
+        # check that forward mode AD is only registered for the Default
+        # dispatch for _test_autograd_multiple_dispatch.fullcoverage and not AutogradCUDA
+
+        primal = torch.randn(3, device=device)
+        tangent = torch.randn(3, device=device)
+
+        with fwAD.dual_level():
+            dual_input = fwAD.make_dual(primal, tangent)
+
+            err_msg = r"Trying to use forward AD with .* that does not support it"
+            hint_msg = "Running forward AD for an OP that does not implement it should raise a NotImplementedError"
+
+            if 'cuda' in device:
+                with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
+                    torch._test_autograd_multiple_dispatch(dual_input)
+            else:
+                torch._test_autograd_multiple_dispatch(dual_input)
+
+    def test_view_copy(self, device):
+        # tests that view_copy derivative formulas are also generated per dispatch key
+        # from their respective view ops in derivatives.yaml
+        t = torch.randn(2, 2, device=device, requires_grad=True)
+        t_ref = t.clone().detach().requires_grad_()
+        # _test_autograd_multiple_dispatch_view does a .view(-1) on the input
+        t_view = torch._test_autograd_multiple_dispatch_view(t_ref)
+        t_view_copy = torch._test_autograd_multiple_dispatch_view_copy(t)
+
+        grad = torch.randn(4, device=device)
+        t_view_copy.backward(grad)
+        t_view.backward(grad.clone())
+
+        # forward and backward give the same shape + result
+        self.assertEqual(t_view_copy, t_view)
+        self.assertEqual(t.grad, t_ref.grad)
+        # backward results are per-dispatch-key in derivatives.yaml
+        if 'cuda' in device:
+            # gradient registered to AutogradCUDA is grad.reshape_as(self) + 1
+            self.assertEqual(t.grad, grad.reshape_as(t) + 1)
+        else:
+            # Default gradient registered is grad.reshape_as(self)
+            self.assertEqual(t.grad, grad.reshape_as(t))
+
+    @onlyCPU
+    def test_per_dispatch_key_input_saving(self, device):
+        # Tests that sum.dim_IntList's input is not saved for regular tensors but is saved for nested tensors
+        def foo(x):
+            # Don't modify the input inplace
+            x = x.clone()
+            res = x.sum(-1, keepdim=True)
+            x.add_(x)
+            return res
+
+        inp = torch.rand(2, device=device, requires_grad=True)
+        # sum's input is not saved for regular Tensors
+        foo(inp).backward()
+
+        # sum's input is saved for Nested Tensors
+        nt = torch.nested_tensor([torch.rand(2), torch.rand(2)], device=device).requires_grad_()
+        with self.assertRaisesRegex(RuntimeError, "modified by an inplace operation"):
+            foo(nt).backward(torch.nested_tensor([torch.rand(1), torch.rand(1)], device=device))
+
 # Import test cases from below autograd/ here. These are found
 # implicitly by the loader, so Flake8 thinks they are unused, hence
 # the suppressions.
@@ -8873,6 +9207,12 @@ def fn(x1, x2):
     except_for=None
 )
 
+instantiate_device_type_tests(
+    TestAutogradMultipleDispatch,
+    globals(),
+    only_for=('cpu', 'cuda')
+)
+
 instantiate_parametrized_tests(TestAutograd)
 
 if __name__ == '__main__':
diff --git a/test/test_binary_ufuncs.py b/test/test_binary_ufuncs.py
index 120fb6c605c41..effeeee61f1e9 100644
--- a/test/test_binary_ufuncs.py
+++ b/test/test_binary_ufuncs.py
@@ -57,7 +57,6 @@
 from torch.testing._internal.common_methods_invocations import (
     binary_ufuncs,
     binary_ufuncs_and_refs,
-    _NOTHING,
     generate_elementwise_binary_tensors,
     generate_elementwise_binary_small_value_tensors,
     generate_elementwise_binary_large_value_tensors,
@@ -186,7 +185,7 @@ def _numel(x):
 
     # The following tests only apply to elementwise binary operators with references
     binary_ufuncs_with_references = list(
-        filter(lambda op: op.ref is not None and op.ref is not _NOTHING, binary_ufuncs)
+        filter(lambda op: op.ref is not None and op.ref is not None, binary_ufuncs)
     )
 
     @ops(binary_ufuncs_with_references)
@@ -984,7 +983,7 @@ def test_div_rounding_nonfinite(self, device, dtype):
             [1.0, -1.0, 0, 0.1, -0.1, np.pi, -np.pi, np.inf, -np.inf, np.nan],
             dtype=dtype,
         )
-        # Divide by zero is tested seperately
+        # Divide by zero is tested separately
         denom = num[num != 0]
 
         a, b = num[None, :].clone(), denom[:, None].clone()
@@ -4130,6 +4129,22 @@ def test_zeros_special_helper(torch_fn, reference_fn, scalar=False):
         test_zeros_special_helper(*xlogy_fns, scalar=True)
         test_zeros_special_helper(*xlog1py_fns, scalar=True)
 
+    @dtypes(torch.float64)
+    def test_xlogy_xlog1py_gradients(self, device, dtype):
+        make_arg = partial(torch.tensor, dtype=dtype, device=device, requires_grad=True)
+
+        zeros = torch.zeros((2,), dtype=dtype, device=device)
+
+        x = make_arg([0.0, 0.0])
+        y = make_arg([-1.5, 0.0])
+        torch.special.xlogy(x, y).sum().backward()
+        self.assertEqual(x.grad, zeros)
+
+        x = make_arg([0.0, 0.0])
+        y = make_arg([-2.5, -1.0])
+        torch.special.xlog1py(x, y).sum().backward()
+        self.assertEqual(x.grad, zeros)
+
     def test_xlogy_xlog1py_scalar_type_promotion(self, device):
         # Test that python numbers don't participate in type promotion at the same
         # priority level as 0-dim tensors
diff --git a/test/test_cpp_api_parity.py b/test/test_cpp_api_parity.py
index 50d6de550520c..107c902e24268 100644
--- a/test/test_cpp_api_parity.py
+++ b/test/test_cpp_api_parity.py
@@ -27,35 +27,36 @@ class TestCppApiParity(common.TestCase):
 
 expected_test_params_dicts = []
 
-for test_params_dicts, test_instance_class in [
-    (sample_module.module_tests, common_nn.NewModuleTest),
-    (sample_functional.functional_tests, common_nn.NewModuleTest),
-    (common_nn.module_tests, common_nn.NewModuleTest),
-    (common_nn.new_module_tests, common_nn.NewModuleTest),
-    (common_nn.criterion_tests, common_nn.CriterionTest),
-]:
-    for test_params_dict in test_params_dicts:
-        if test_params_dict.get('test_cpp_api_parity', True):
-            if is_torch_nn_functional_test(test_params_dict):
-                functional_impl_check.write_test_to_test_class(
-                    TestCppApiParity, test_params_dict, test_instance_class, parity_table, devices)
-            else:
-                module_impl_check.write_test_to_test_class(
-                    TestCppApiParity, test_params_dict, test_instance_class, parity_table, devices)
-            expected_test_params_dicts.append(test_params_dict)
-
-# Assert that all NN module/functional test dicts appear in the parity test
-assert len([name for name in TestCppApiParity.__dict__ if 'test_torch_nn_' in name]) == \
-    len(expected_test_params_dicts) * len(devices)
-
-# Assert that there exists auto-generated tests for `SampleModule` and `sample_functional`.
-# 4 == 2 (number of test dicts that are not skipped) * 2 (number of devices)
-assert len([name for name in TestCppApiParity.__dict__ if 'SampleModule' in name]) == 4
-# 4 == 2 (number of test dicts that are not skipped) * 2 (number of devices)
-assert len([name for name in TestCppApiParity.__dict__ if 'sample_functional' in name]) == 4
-
-module_impl_check.build_cpp_tests(TestCppApiParity, print_cpp_source=PRINT_CPP_SOURCE)
-functional_impl_check.build_cpp_tests(TestCppApiParity, print_cpp_source=PRINT_CPP_SOURCE)
+if not common.IS_ARM64:
+    for test_params_dicts, test_instance_class in [
+        (sample_module.module_tests, common_nn.NewModuleTest),
+        (sample_functional.functional_tests, common_nn.NewModuleTest),
+        (common_nn.module_tests, common_nn.NewModuleTest),
+        (common_nn.new_module_tests, common_nn.NewModuleTest),
+        (common_nn.criterion_tests, common_nn.CriterionTest),
+    ]:
+        for test_params_dict in test_params_dicts:
+            if test_params_dict.get('test_cpp_api_parity', True):
+                if is_torch_nn_functional_test(test_params_dict):
+                    functional_impl_check.write_test_to_test_class(
+                        TestCppApiParity, test_params_dict, test_instance_class, parity_table, devices)
+                else:
+                    module_impl_check.write_test_to_test_class(
+                        TestCppApiParity, test_params_dict, test_instance_class, parity_table, devices)
+                expected_test_params_dicts.append(test_params_dict)
+
+    # Assert that all NN module/functional test dicts appear in the parity test
+    assert len([name for name in TestCppApiParity.__dict__ if 'test_torch_nn_' in name]) == \
+        len(expected_test_params_dicts) * len(devices)
+
+    # Assert that there exists auto-generated tests for `SampleModule` and `sample_functional`.
+    # 4 == 2 (number of test dicts that are not skipped) * 2 (number of devices)
+    assert len([name for name in TestCppApiParity.__dict__ if 'SampleModule' in name]) == 4
+    # 4 == 2 (number of test dicts that are not skipped) * 2 (number of devices)
+    assert len([name for name in TestCppApiParity.__dict__ if 'sample_functional' in name]) == 4
+
+    module_impl_check.build_cpp_tests(TestCppApiParity, print_cpp_source=PRINT_CPP_SOURCE)
+    functional_impl_check.build_cpp_tests(TestCppApiParity, print_cpp_source=PRINT_CPP_SOURCE)
 
 if __name__ == "__main__":
     common.run_tests()
diff --git a/test/test_cpp_extensions_jit.py b/test/test_cpp_extensions_jit.py
index e4b1e9e550873..124cb3e8f1c2c 100644
--- a/test/test_cpp_extensions_jit.py
+++ b/test/test_cpp_extensions_jit.py
@@ -15,7 +15,7 @@
 import torch.backends.cudnn
 import torch.utils.cpp_extension
 from torch.utils.cpp_extension import CUDA_HOME, ROCM_HOME
-from torch.testing._internal.common_utils import gradcheck, skipIfSlowGradcheckEnv
+from torch.testing._internal.common_utils import gradcheck, skipIfSlowGradcheckEnv, IS_ARM64
 
 
 TEST_CUDA = torch.cuda.is_available() and CUDA_HOME is not None
@@ -39,6 +39,7 @@ def remove_build_path():
 
 # There's only one test that runs gracheck, run slow mode manually
 @skipIfSlowGradcheckEnv
+@unittest.skipIf(IS_ARM64, "Does not work on arm")
 class TestCppExtensionJIT(common.TestCase):
     """Tests just-in-time cpp extensions.
     Don't confuse this with the PyTorch JIT (aka TorchScript).
diff --git a/test/test_cpp_extensions_open_device_registration.py b/test/test_cpp_extensions_open_device_registration.py
index ca69bf8398c72..61dda497cb64c 100644
--- a/test/test_cpp_extensions_open_device_registration.py
+++ b/test/test_cpp_extensions_open_device_registration.py
@@ -3,8 +3,10 @@
 import os
 import shutil
 import sys
+import unittest
 
 import torch.testing._internal.common_utils as common
+from torch.testing._internal.common_utils import IS_ARM64
 import torch
 import torch.utils.cpp_extension
 from torch.utils.cpp_extension import CUDA_HOME, ROCM_HOME
@@ -18,7 +20,6 @@
     TEST_CUDNN = (
         TEST_CUDA and CUDNN_HEADER_EXISTS and torch.backends.cudnn.is_available()
     )
-IS_WINDOWS = sys.platform == "win32"
 
 
 def remove_build_path():
@@ -54,6 +55,7 @@ def setUpClass(cls):
     def tearDownClass(cls):
         remove_build_path()
 
+    @unittest.skipIf(IS_ARM64, "Does not work on arm")
     def test_open_device_registration(self):
         module = torch.utils.cpp_extension.load(
             name="custom_device_extension",
diff --git a/test/test_cuda.py b/test/test_cuda.py
index cea33fb087898..f9ba82436e3f7 100644
--- a/test/test_cuda.py
+++ b/test/test_cuda.py
@@ -14,6 +14,7 @@
 import tempfile
 import threading
 import unittest
+from random import randint
 
 import torch
 import torch.cuda
@@ -1978,7 +1979,7 @@ def worker(rank):
 t2.start()
 """])
 
-    @unittest.skipIf(TEST_WITH_ROCM, "ROCm doesn't support device side asserts")
+    @unittest.skipIf(TEST_WITH_ROCM, "In ROCm, kernel asserts are disabled due to performance overhead")
     def test_fixed_cuda_assert_async(self):
         with self.assertRaisesRegex(RuntimeError, "Boolean value of Tensor with no values is ambiguous"):
             torch._assert_async(torch.tensor([], device="cuda"))
@@ -4407,6 +4408,47 @@ class TestNamedTupleInput_1(NamedTuple):
             cat = torch.cat((outputs[0][i].to('cpu'), outputs[1][i].to('cpu')))
             self.assertTrue(torch.equal(x, cat))
 
+    def test_memory_snapshot(self):
+        try:
+            torch.cuda.memory.empty_cache()
+            torch.cuda.memory._record_memory_history(True)
+            x = torch.rand(311, 411, device='cuda')
+
+            # create a bunch of tensors that all will tile into the
+            # same segment to  exercise the history merging code
+            # 512B is the minimum block size,
+            # so we allocate all the tensors to this size to make sure
+            # they tile evenly
+            tensors = [torch.rand(128, device='cuda') for _ in range(1000)]
+            while tensors:
+                del tensors[randint(0, len(tensors) - 1)]
+
+            # exercise the history trimming code
+            torch.rand(128 * 5, device='cuda')
+
+            ss = torch.cuda.memory._snapshot()
+            found_it = False
+            for seg in ss:
+                for b in seg['blocks']:
+                    if 'history' in b:
+                        for h in b['history']:
+                            if h['real_size'] == 311 * 411 * 4:
+                                self.assertTrue('test_cuda' in h['frames'][0]['filename'])
+                                found_it = True
+            self.assertTrue(found_it)
+            if not IS_WINDOWS:
+                with tempfile.NamedTemporaryFile() as f:
+                    torch.cuda.memory._save_segment_usage(f.name)
+                    with open(f.name, 'r') as f2:
+                        self.assertTrue('test_cuda.py' in f2.read())
+
+        finally:
+            torch.cuda.memory._record_memory_history(False)
+
+    def test_raises_oom(self):
+        with self.assertRaises(torch.cuda.OutOfMemoryError):
+            torch.empty(1024 * 1024 * 1024 * 1024, device='cuda')
+
 instantiate_parametrized_tests(TestCuda)
 
 if __name__ == '__main__':
diff --git a/test/test_cuda_trace.py b/test/test_cuda_trace.py
new file mode 100644
index 0000000000000..226e30acbfebf
--- /dev/null
+++ b/test/test_cuda_trace.py
@@ -0,0 +1,96 @@
+# Owner(s): ["module: cuda"]
+
+import sys
+import unittest
+import unittest.mock
+
+import torch
+import torch.utils._cuda_trace as cuda_trace
+from torch.testing._internal.common_utils import TestCase, run_tests
+
+# NOTE: Each test needs to be run in a brand new process, to reset the registered hooks
+# and make sure the CUDA streams are initialized for each test that uses them.
+
+# We cannot import TEST_CUDA from torch.testing._internal.common_cuda here,
+# because if we do that, the TEST_CUDNN line from torch.testing._internal.common_cuda will be executed
+# multiple times as well during the execution of this test suite, and it will
+# cause CUDA OOM error on Windows.
+TEST_CUDA = torch.cuda.is_available()
+
+if not TEST_CUDA:
+    print("CUDA not available, skipping tests", file=sys.stderr)
+    TestCase = object  # noqa: F811
+
+
+class TestCudaTrace(TestCase):
+    def setUp(self):
+        torch._C._activate_cuda_trace()
+        self.mock = unittest.mock.MagicMock()
+
+    def test_event_creation_callback(self):
+        cuda_trace.register_callback_for_cuda_event_creation(self.mock)
+
+        event = torch.cuda.Event()
+        event.record()
+        self.mock.assert_called_once_with(event._as_parameter_.value)
+
+    def test_event_deletion_callback(self):
+        cuda_trace.register_callback_for_cuda_event_deletion(self.mock)
+
+        event = torch.cuda.Event()
+        event.record()
+        event_id = event._as_parameter_.value
+        del event
+        self.mock.assert_called_once_with(event_id)
+
+    def test_event_record_callback(self):
+        cuda_trace.register_callback_for_cuda_event_record(self.mock)
+
+        event = torch.cuda.Event()
+        event.record()
+        self.mock.assert_called_once_with(
+            event._as_parameter_.value, torch.cuda.default_stream().cuda_stream
+        )
+
+    def test_event_wait_callback(self):
+        cuda_trace.register_callback_for_cuda_event_wait(self.mock)
+
+        event = torch.cuda.Event()
+        event.record()
+        event.wait()
+        self.mock.assert_called_once_with(
+            event._as_parameter_.value, torch.cuda.default_stream().cuda_stream
+        )
+
+    def test_memory_allocation_callback(self):
+        cuda_trace.register_callback_for_cuda_memory_allocation(self.mock)
+
+        tensor = torch.empty(10, 4, device="cuda")
+        self.mock.assert_called_once_with(tensor.data_ptr())
+
+    def test_memory_deallocation_callback(self):
+        cuda_trace.register_callback_for_cuda_memory_deallocation(self.mock)
+
+        tensor = torch.empty(3, 8, device="cuda")
+        data_ptr = tensor.data_ptr()
+        del tensor
+        self.mock.assert_called_once_with(data_ptr)
+
+    def test_stream_creation_callback(self):
+        cuda_trace.register_callback_for_cuda_stream_creation(self.mock)
+
+        torch.cuda.Stream()
+        self.mock.assert_called()
+
+    def test_all_trace_callbacks_called(self):
+        other = unittest.mock.MagicMock()
+        cuda_trace.register_callback_for_cuda_memory_allocation(self.mock)
+        cuda_trace.register_callback_for_cuda_memory_allocation(other)
+
+        tensor = torch.empty(10, 4, device="cuda")
+        self.mock.assert_called_once_with(tensor.data_ptr())
+        other.assert_called_once_with(tensor.data_ptr())
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/test_dataloader.py b/test/test_dataloader.py
index b35352e3fe6b0..270ca89764ed1 100644
--- a/test/test_dataloader.py
+++ b/test/test_dataloader.py
@@ -14,6 +14,7 @@
 import itertools
 import warnings
 import tempfile
+import torch.utils.data.datapipes as dp
 from torch import multiprocessing as mp
 from torch.utils.data import (
     ChainDataset,
@@ -2221,6 +2222,9 @@ def test_excessive_thread_creation_warning(self):
                 r"excessive worker creation might get DataLoader running slow or even freeze"):
             dataloader = DataLoader(self.dataset, batch_size=2, num_workers=1000)
 
+# Define a global function for testing purposes since local functions cannot be pickled
+def identity(x):
+    return x
 
 @unittest.skipIf(
     TEST_WITH_TSAN,
@@ -2232,9 +2236,9 @@ def test_basics(self):
         # TODO(VitalyFedyunin): This test will start breaking if we remove guaranteed order
         # of traversing workers
         dp = IterableWrapper(list(range(1000))).sharding_filter()
-        dl = DataLoader(dp, batch_size=3, collate_fn=lambda x: x, num_workers=2)
-        dl2 = DataLoader2(dp, batch_size=3, collate_fn=lambda x: x, num_workers=2)
-        dl2_threading = DataLoader2(dp, batch_size=3, collate_fn=lambda x: x, num_workers=2, parallelism_mode='thread')
+        dl = DataLoader(dp, batch_size=3, collate_fn=identity, num_workers=2)
+        dl2 = DataLoader2(dp, batch_size=3, collate_fn=identity, num_workers=2)
+        dl2_threading = DataLoader2(dp, batch_size=3, collate_fn=identity, num_workers=2, parallelism_mode='thread')
         self.assertEqual(list(dl), list(dl2))
         self.assertEqual(list(dl), list(dl2_threading))
 
@@ -2327,6 +2331,65 @@ def clean_me(process, req_queue, res_queue):
         clean_me(process, req_queue, res_queue)
 
 
+class IntegrationTestDataLoaderDataPipe(TestCase):
+    r"""
+    Verify the behavior of a certain ``DataPipes`` with ``DataLoader``
+    """
+
+    def test_shuffler_iterdatapipe(self):
+        r"""
+        Verify ``IterDataPipe.shuffle`` is controlled by ``DataLoader``
+        to generate different seeds deterministically per epoch.
+        """
+        exp = list(range(100))
+
+        def _create_dp(buffer_size):
+            input_ds = dp.iter.IterableWrapper(exp)
+            return input_ds.shuffle(buffer_size=buffer_size).sharding_filter()
+
+        for bs in (5, 20, 33):
+            # Test Deterministic
+            for num_workers, pw in itertools.product((0, 1, 2), (True, False)):
+                if num_workers == 0 and pw:
+                    continue
+
+                shuffle_dp = _create_dp(bs)
+
+                mp_ctx = "spawn" if num_workers > 0 else None
+                dl = DataLoader(
+                    shuffle_dp,
+                    num_workers=num_workers,
+                    shuffle=True,
+                    multiprocessing_context=mp_ctx,
+                    persistent_workers=pw
+                )
+
+                # No seed
+                dl_res_ns = list(dl)
+                self.assertEqual(sorted(dl_res_ns), exp)
+
+                # Same seeds
+                dl_res = []
+                for epoch in range(2):
+                    torch.manual_seed(123)
+                    dl_res.append(list(dl))
+                self.assertEqual(dl_res[0], dl_res[1])
+                self.assertEqual(sorted(dl_res[0]), exp)
+
+                # Different seeds
+                torch.manual_seed(321)
+                dl_res.append(list(dl))
+
+                self.assertEqual(len(dl_res[0]), len(dl_res[2]))
+                self.assertNotEqual(dl_res[0], dl_res[2])
+                self.assertEqual(sorted(dl_res[0]), sorted(dl_res[2]))
+
+                if dl._iterator is not None:
+                    dl._iterator._shutdown_workers()
+                    dl._iterator = None
+                del dl
+
+
 class StringDataset(Dataset):
     def __init__(self):
         self.s = '12345'
diff --git a/test/test_datapipe.py b/test/test_datapipe.py
index 520e266322e65..abc6bd023a0d0 100644
--- a/test/test_datapipe.py
+++ b/test/test_datapipe.py
@@ -632,13 +632,6 @@ def _mod_3_test(x):
     return x % 3 == 1
 
 
-def _worker_init_fn(worker_id):
-    info = torch.utils.data.get_worker_info()
-    num_workers = info.num_workers
-    datapipe = info.dataset
-    torch.utils.data.graph_settings.apply_sharding(datapipe, num_workers, worker_id)
-
-
 lambda_fn1 = lambda x: x  # noqa: E731
 lambda_fn2 = lambda x: x % 2  # noqa: E731
 lambda_fn3 = lambda x: x >= 5  # noqa: E731
@@ -1218,6 +1211,24 @@ def fn_n1(d0, d1):
         def fn_nn(d0, d1):
             return -d0, -d1, d0 + d1
 
+        def fn_n1_def(d0, d1=1):
+            return d0 + d1
+
+        def fn_n1_kwargs(d0, d1, **kwargs):
+            return d0 + d1
+
+        def fn_n1_pos(d0, d1, *args):
+            return d0 + d1
+
+        def fn_n1_sep_pos(d0, *args, d1):
+            return d0 + d1
+
+        def fn_cmplx(d0, d1=1, *args, d2, **kwargs):
+            return d0 + d1
+
+        p_fn_n1 = partial(fn_n1, d1=1)
+        p_fn_cmplx = partial(fn_cmplx, d2=2)
+
         def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
             for constr in (list, tuple):
                 datapipe = dp.iter.IterableWrapper([constr((0, 1, 2)), constr((3, 4, 5)), constr((6, 7, 8))])
@@ -1231,6 +1242,12 @@ def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
                     self.assertEqual(list(res_dp), list(ref_dp))
                     # Reset
                     self.assertEqual(list(res_dp), list(ref_dp))
+        _helper(lambda data: data, fn_n1_def, 0, 1)
+        _helper(lambda data: (data[0], data[1], data[0] + data[1]), fn_n1_def, [0, 1], 2)
+        _helper(lambda data: data, p_fn_n1, 0, 1)
+        _helper(lambda data: data, p_fn_cmplx, 0, 1)
+        _helper(lambda data: (data[0], data[1], data[0] + data[1]), p_fn_cmplx, [0, 1], 2)
+        _helper(lambda data: (data[0] + data[1], ), fn_n1_pos, [0, 1, 2])
 
         # Replacing with one input column and default output column
         _helper(lambda data: (data[0], -data[1], data[2]), fn_11, 1)
@@ -1238,7 +1255,20 @@ def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
         # The index of input column is out of range
         _helper(None, fn_1n, 3, error=IndexError)
         # Unmatched input columns with fn arguments
-        _helper(None, fn_n1, 1, error=TypeError)
+        _helper(None, fn_n1, 1, error=ValueError)
+        _helper(None, fn_n1, [0, 1, 2], error=ValueError)
+        _helper(None, lambda d0, d1: d0 + d1, 0, error=ValueError)
+        _helper(None, lambda d0, d1: d0 + d1, [0, 1, 2], error=ValueError)
+        _helper(None, fn_cmplx, 0, 1, ValueError)
+        _helper(None, fn_n1_pos, 1, error=ValueError)
+        _helper(None, fn_n1_def, [0, 1, 2], 1, error=ValueError)
+        _helper(None, p_fn_n1, [0, 1], error=ValueError)
+        _helper(None, fn_1n, [1, 2], error=ValueError)
+        # _helper(None, p_fn_cmplx, [0, 1, 2], error=ValueError)
+        _helper(None, fn_n1_sep_pos, [0, 1, 2], error=ValueError)
+        # Fn has keyword-only arguments
+        _helper(None, fn_n1_kwargs, 1, error=ValueError)
+        _helper(None, fn_cmplx, [0, 1], 2, ValueError)
 
         # Replacing with multiple input columns and default output column (the left-most input column)
         _helper(lambda data: (data[1], data[2] + data[0]), fn_n1, [2, 0])
@@ -1278,6 +1308,28 @@ def fn_n1(d0, d1):
         def fn_nn(d0, d1):
             return -d0, -d1, d0 + d1
 
+        def fn_n1_def(d0, d1=1):
+            return d0 + d1
+
+        p_fn_n1 = partial(fn_n1, d1=1)
+
+        def fn_n1_pos(d0, d1, *args):
+            return d0 + d1
+
+        def fn_n1_kwargs(d0, d1, **kwargs):
+            return d0 + d1
+
+        def fn_kwonly(*, d0, d1):
+            return d0 + d1
+
+        def fn_has_nondefault_kwonly(d0, *, d1):
+            return d0 + d1
+
+        def fn_cmplx(d0, d1=1, *args, d2, **kwargs):
+            return d0 + d1
+
+        p_fn_cmplx = partial(fn_cmplx, d2=2)
+
         # Prevent modification in-place to support resetting
         def _dict_update(data, newdata, remove_idx=None):
             _data = dict(data)
@@ -1304,13 +1356,33 @@ def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
                 # Reset
                 self.assertEqual(list(res_dp), list(ref_dp))
 
+        _helper(lambda data: data, fn_n1_def, 'x', 'y')
+        _helper(lambda data: data, p_fn_n1, 'x', 'y')
+        _helper(lambda data: data, p_fn_cmplx, 'x', 'y')
+        _helper(lambda data: _dict_update(data, {"z": data["x"] + data["y"]}),
+                p_fn_cmplx, ["x", "y", "z"], "z")
+
+        _helper(lambda data: _dict_update(data, {"z": data["x"] + data["y"]}), fn_n1_def, ['x', 'y'], 'z')
+
+        _helper(None, fn_n1_pos, 'x', error=ValueError)
+        _helper(None, fn_n1_kwargs, 'x', error=ValueError)
+        # non-default kw-only args
+        _helper(None, fn_kwonly, ['x', 'y'], error=ValueError)
+        _helper(None, fn_has_nondefault_kwonly, ['x', 'y'], error=ValueError)
+        _helper(None, fn_cmplx, ['x', 'y'], error=ValueError)
+
+
         # Replacing with one input column and default output column
         _helper(lambda data: _dict_update(data, {"y": -data["y"]}), fn_11, "y")
         _helper(lambda data: _dict_update(data, {"y": (-data["y"], data["y"])}), fn_1n, "y")
         # The key of input column is not in dict
         _helper(None, fn_1n, "a", error=KeyError)
         # Unmatched input columns with fn arguments
-        _helper(None, fn_n1, "y", error=TypeError)
+        _helper(None, fn_n1, "y", error=ValueError)
+        _helper(None, fn_1n, ["x", "y"], error=ValueError)
+        _helper(None, fn_n1_def, ["x", "y", "z"], error=ValueError)
+        _helper(None, p_fn_n1, ["x", "y"], error=ValueError)
+        _helper(None, fn_n1_kwargs, ["x", "y", "z"], error=ValueError)
         # Replacing with multiple input columns and default output column (the left-most input column)
         _helper(lambda data: _dict_update(data, {"z": data["x"] + data["z"]}, ["x"]), fn_n1, ["z", "x"])
         _helper(lambda data: _dict_update(
@@ -1508,6 +1580,32 @@ def _mul_filter_fn(a, b):
         input_col_2_dp = tuple_input_ds.filter(_mul_filter_fn, input_col=[0, 2])
         self.assertEqual(list(input_col_2_dp), [(d - 1, d, d + 1) for d in range(5)])
 
+        # invalid input col
+        with self.assertRaises(ValueError):
+            tuple_input_ds.filter(_mul_filter_fn, input_col=0)
+
+        p_mul_filter_fn = partial(_mul_filter_fn, b=1)
+        out = tuple_input_ds.filter(p_mul_filter_fn, input_col=0)
+        self.assertEqual(list(out), [(d - 1, d, d + 1) for d in range(10)])
+
+        def _mul_filter_fn_with_defaults(a, b=1):
+            return a + b < 10
+
+        out = tuple_input_ds.filter(_mul_filter_fn_with_defaults, input_col=0)
+        self.assertEqual(list(out), [(d - 1, d, d + 1) for d in range(10)])
+
+        def _mul_filter_fn_with_kw_only(*, a, b):
+            return a + b < 10
+
+        with self.assertRaises(ValueError):
+            tuple_input_ds.filter(_mul_filter_fn_with_kw_only, input_col=0)
+
+        def _mul_filter_fn_with_kw_only_1_default(*, a, b=1):
+            return a + b < 10
+
+        with self.assertRaises(ValueError):
+            tuple_input_ds.filter(_mul_filter_fn_with_kw_only_1_default, input_col=0)
+
         # __len__ Test: DataPipe has no valid len
         with self.assertRaisesRegex(TypeError, r"has no len"):
             len(filter_dp)
@@ -1554,68 +1652,65 @@ def test_stream_reader_iterdatapipe(self):
         with self.assertRaises(TypeError):
             len(dp1)
 
-    def test_shuffle_iterdatapipe(self):
-        exp = list(range(100))
-        input_ds = dp.iter.IterableWrapper(exp)
+    def test_shuffler_iterdatapipe(self):
+        input_dp = dp.iter.IterableWrapper(list(range(10)))
 
         with self.assertRaises(AssertionError):
-            shuffle_dp = input_ds.shuffle(buffer_size=0)
-
-        def _create_dp(buffer_size):
-            input_ds = dp.iter.IterableWrapper(list(range(100)))
-            return input_ds.shuffle(buffer_size=bs).sharding_filter()
-
-        for bs in (5, 20, 33):
-            shuffle_dp = _create_dp(bs)
-            self.assertEqual(len(shuffle_dp), len(exp))
-
-            torch.manual_seed(123)
-            res = list(shuffle_dp)
-            self.assertEqual(sorted(res), exp)
-
-            # Test Deterministic
-            for num_workers, pw in itertools.product((0, 1, 2), (True, False)):
-                if num_workers == 0 and pw:
-                    continue
-
-                mp_ctx = "spawn" if num_workers > 0 else None
-                dl = DataLoader(
-                    shuffle_dp,
-                    num_workers=num_workers,
-                    shuffle=True,
-                    multiprocessing_context=mp_ctx,
-                    worker_init_fn=_worker_init_fn,
-                    persistent_workers=pw
-                )
-
-                # No seed
-                dl_res_ns = list(dl)
-                self.assertEqual(len(dl_res_ns), len(exp))
-                self.assertEqual(sorted(dl_res_ns), sorted(exp))
-
-                # Same seeds
-                dl_res = []
-                for epoch in range(2):
-                    torch.manual_seed(123)
-                    dl_res.append(list(dl))
-                self.assertEqual(dl_res[0], dl_res[1])
-
-                # Different seeds
-                torch.manual_seed(321)
-                dl_res.append(list(dl))
-
-                self.assertEqual(len(dl_res[0]), len(dl_res[2]))
-                self.assertNotEqual(dl_res[0], dl_res[2])
-                self.assertEqual(sorted(dl_res[0]), sorted(dl_res[2]))
-
-
-        shuffle_dp_nl = IDP_NoLen(range(20)).shuffle(buffer_size=5)
-        with self.assertRaisesRegex(TypeError, r"instance doesn't have valid length$"):
-            len(shuffle_dp_nl)
+            shuffle_dp = input_dp.shuffle(buffer_size=0)
+
+        # Functional Test: No seed
+        shuffler_dp = input_dp.shuffle()
+        self.assertEqual(set(range(10)), set(shuffler_dp))
+
+        # Functional Test: With global seed
+        torch.manual_seed(123)
+        shuffler_dp = input_dp.shuffle()
+        res = list(shuffler_dp)
+        torch.manual_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
 
-        # Test: deactivate shuffling via set_shuffle
-        unshuffled_dp = input_ds.shuffle().set_shuffle(False)
-        self.assertEqual(list(unshuffled_dp), list(input_ds))
+        # Functional Test: Set seed
+        shuffler_dp = input_dp.shuffle().set_seed(123)
+        res = list(shuffler_dp)
+        shuffler_dp.set_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
+
+        # Functional Test: deactivate shuffling via set_shuffle
+        unshuffled_dp = input_dp.shuffle().set_shuffle(False)
+        self.assertEqual(list(unshuffled_dp), list(input_dp))
+
+        # Reset Test:
+        shuffler_dp = input_dp.shuffle()
+        n_elements_before_reset = 5
+        res_before_reset, res_after_reset = reset_after_n_next_calls(shuffler_dp, n_elements_before_reset)
+        self.assertEqual(5, len(res_before_reset))
+        for x in res_before_reset:
+            self.assertTrue(x in set(range(10)))
+        self.assertEqual(set(range(10)), set(res_after_reset))
+
+        # __len__ Test: returns the length of the input DataPipe
+        shuffler_dp = input_dp.shuffle()
+        self.assertEqual(10, len(shuffler_dp))
+        exp = list(range(100))
+
+        # Serialization Test
+        from torch.utils.data.datapipes._hook_iterator import _SnapshotState
+
+        def _serialization_helper(bs):
+            shuffler_dp = input_dp.shuffle(buffer_size=bs)
+            it = iter(shuffler_dp)
+            for _ in range(2):
+                next(it)
+            shuffler_dp_copy = pickle.loads(pickle.dumps(shuffler_dp))
+            _simple_graph_snapshot_restoration(shuffler_dp_copy.datapipe, shuffler_dp.datapipe._number_of_samples_yielded)
+
+            exp = list(it)
+            shuffler_dp_copy._snapshot_state = _SnapshotState.Restored
+            self.assertEqual(exp, list(shuffler_dp_copy))
+
+        buffer_sizes = [2, 5, 15]
+        for bs in buffer_sizes:
+            _serialization_helper(bs)
 
     def test_zip_iterdatapipe(self):
 
@@ -1670,7 +1765,7 @@ def _serialization_test_for_single_dp(self, dp, use_dill=False):
         _ = next(it)
         self._serialization_test_helper(dp, use_dill)
         # 3. Testing for serialization after DataPipe is fully read
-        _ = list(it)
+        _ = list(dp)
         self._serialization_test_helper(dp, use_dill)
 
     def test_serializable(self):
@@ -1793,6 +1888,16 @@ def test_zip_mapdatapipe(self):
         with self.assertRaisesRegex(IndexError, r"out of range"):
             input_dp1.zip(input_dp2, input_dp3)[5]
 
+        # Functional Test: Ensure `zip` can combine `Batcher` with others
+        dp1 = dp.map.SequenceWrapper(range(10))
+        shuffle_dp1 = dp1.batch(2)
+        dp2 = dp.map.SequenceWrapper(range(10))
+        shuffle_dp2 = dp2.batch(3)
+        zip_dp1 = shuffle_dp1.zip(shuffle_dp2)
+        self.assertEqual(4, len(list(zip_dp1)))
+        zip_dp2 = shuffle_dp1.zip(dp2)
+        self.assertEqual(5, len(list(zip_dp2)))
+
         # __len__ Test: returns the length of the shortest DataPipe
         zip_dp = input_dp1.zip(input_dp2, input_dp3)
         self.assertEqual(5, len(zip_dp))
@@ -1806,10 +1911,27 @@ def test_shuffler_mapdatapipe(self):
         self.assertEqual(set(range(10)), set(shuffler_dp))
 
         # Functional Test: Custom indices are working
-        shuffler_dp = dp.map.Shuffler(input_dp2, indices=['a', 'b', 'c', 'd', 'e'])
+        shuffler_dp = input_dp2.shuffle(indices=['a', 'b', 'c', 'd', 'e'])
         self.assertEqual(set(range(1, 6)), set(shuffler_dp))
 
-        # # Reset Test:
+        # Functional Test: With global seed
+        torch.manual_seed(123)
+        shuffler_dp = input_dp1.shuffle()
+        res = list(shuffler_dp)
+        torch.manual_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
+
+        # Functional Test: Set seed
+        shuffler_dp = input_dp1.shuffle().set_seed(123)
+        res = list(shuffler_dp)
+        shuffler_dp.set_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
+
+        # Functional Test: deactivate shuffling via set_shuffle
+        unshuffled_dp = input_dp1.shuffle().set_shuffle(False)
+        self.assertEqual(list(unshuffled_dp), list(input_dp1))
+
+        # Reset Test:
         shuffler_dp = input_dp1.shuffle()
         n_elements_before_reset = 5
         res_before_reset, res_after_reset = reset_after_n_next_calls(shuffler_dp, n_elements_before_reset)
@@ -1822,6 +1944,19 @@ def test_shuffler_mapdatapipe(self):
         shuffler_dp = input_dp1.shuffle()
         self.assertEqual(10, len(shuffler_dp))
 
+        # Serialization Test
+        from torch.utils.data.datapipes._hook_iterator import _SnapshotState
+
+        shuffler_dp = input_dp1.shuffle()
+        it = iter(shuffler_dp)
+        for _ in range(2):
+            next(it)
+        shuffler_dp_copy = pickle.loads(pickle.dumps(shuffler_dp))
+
+        exp = list(it)
+        shuffler_dp_copy._snapshot_state = _SnapshotState.Restored
+        self.assertEqual(exp, list(shuffler_dp_copy))
+
     def test_map_mapdatapipe(self):
         arr = range(10)
         input_dp = dp.map.SequenceWrapper(arr)
diff --git a/test/test_decomp.py b/test/test_decomp.py
index 16f64a0229da8..df5391f8ecb68 100644
--- a/test/test_decomp.py
+++ b/test/test_decomp.py
@@ -193,7 +193,6 @@ def op_assert_equal(test_case, op, test_dtype, orig, decomp, args, kwargs):
         rtol, atol = tol_table[(decomp.dtype, op)]
     else:
         rtol, atol = _getDefaultRtolAndAtol(orig.dtype, decomp.dtype)
-
     test_case.assertEqual(orig, decomp, rtol=rtol, atol=atol, msg=f"{op.__name__}\nargs = {args}\nkwargs = {kwargs}")
 
 
@@ -279,7 +278,7 @@ def normalize_op_input_output(f, sample, requires_grad=True):
     # CompositeAutogradImplicit
     # See https://github.com/pytorch/pytorch/issues/81669
     (None, None, "nn.functional.relu6"),
-
+    (None, None, "meshgrid"),
 }
 
 all_decomposed = set()
@@ -424,14 +423,16 @@ def _torch_dispatch(cls, func, types, args=(), kwargs=None):
                         func(*tree_map(upcast, args), **tree_map(upcast, kwargs))
                     )
                     for i, orig, decomp, ref in zip(range(len(real_out)), real_out, decomp_out, real_out_double):
-                        if orig is None:
-                            assert decomp is None
+                        if not isinstance(orig, torch.Tensor):
+                            assert type(orig) == type(decomp)
+                            assert orig == decomp
                             continue
                         op_assert_ref(self, func, test_dtype, i, orig, decomp, ref, args, kwargs)
                 else:
                     for orig, decomp in zip(real_out, decomp_out):
-                        if orig is None:
-                            assert decomp is None
+                        if not isinstance(orig, torch.Tensor):
+                            assert type(orig) == type(decomp)
+                            assert orig == decomp
                             continue
                         op_assert_equal(self, func, test_dtype, orig, decomp, args, kwargs)
 
diff --git a/test/test_dlpack.py b/test/test_dlpack.py
new file mode 100644
index 0000000000000..8dbb1058abd35
--- /dev/null
+++ b/test/test_dlpack.py
@@ -0,0 +1,193 @@
+# -*- coding: utf-8 -*-
+# Owner(s): ["module: tests"]
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_utils import TestCase, run_tests
+from torch.testing._internal.common_device_type import (
+    instantiate_device_type_tests, onlyCUDA, dtypes, skipMeta,
+    onlyNativeDeviceTypes)
+from torch.testing._internal.common_dtype import all_types_and_complex_and
+from torch.utils.dlpack import from_dlpack, to_dlpack
+
+
+class TestTorchDlPack(TestCase):
+    exact_dtype = True
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_capsule_conversion(self, device, dtype):
+        # DLpack does not explicitly support bool (xref dmlc/dlpack#75)
+        x = make_tensor((5,), dtype=dtype, device=device)
+        z = from_dlpack(to_dlpack(x))
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_protocol_conversion(self, device, dtype):
+        x = make_tensor((5,), dtype=dtype, device=device)
+        z = from_dlpack(x)
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    def test_dlpack_shared_storage(self, device):
+        x = make_tensor((5,), dtype=torch.float64, device=device)
+        z = from_dlpack(to_dlpack(x))
+        z[0] = z[0] + 20.0
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyCUDA
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_conversion_with_streams(self, device, dtype):
+        # Create a stream where the tensor will reside
+        stream = torch.cuda.Stream()
+        with torch.cuda.stream(stream):
+            # Do an operation in the actual stream
+            x = make_tensor((5,), dtype=dtype, device=device) + 1
+        # DLPack protocol helps establish a correct stream order
+        # (hence data dependency) at the exchange boundary.
+        # DLPack manages this synchronization for us, so we don't need to
+        # explicitly wait until x is populated
+        stream = torch.cuda.Stream()
+        with torch.cuda.stream(stream):
+            z = from_dlpack(x)
+        stream.synchronize()
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_from_dlpack(self, device, dtype):
+        x = make_tensor((5,), dtype=dtype, device=device)
+        y = torch.from_dlpack(x)
+        self.assertEqual(x, y)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_from_dlpack_noncontinguous(self, device, dtype):
+        x = make_tensor((25,), dtype=dtype, device=device).reshape(5, 5)
+
+        y1 = x[0]
+        y1_dl = torch.from_dlpack(y1)
+        self.assertEqual(y1, y1_dl)
+
+        y2 = x[:, 0]
+        y2_dl = torch.from_dlpack(y2)
+        self.assertEqual(y2, y2_dl)
+
+        y3 = x[1, :]
+        y3_dl = torch.from_dlpack(y3)
+        self.assertEqual(y3, y3_dl)
+
+        y4 = x[1]
+        y4_dl = torch.from_dlpack(y4)
+        self.assertEqual(y4, y4_dl)
+
+        y5 = x.t()
+        y5_dl = torch.from_dlpack(y5)
+        self.assertEqual(y5, y5_dl)
+
+    @skipMeta
+    @onlyCUDA
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_conversion_with_diff_streams(self, device, dtype):
+        stream_a = torch.cuda.Stream()
+        stream_b = torch.cuda.Stream()
+        # DLPack protocol helps establish a correct stream order
+        # (hence data dependency) at the exchange boundary.
+        # the `tensor.__dlpack__` method will insert a synchronization event
+        # in the current stream to make sure that it was correctly populated.
+        with torch.cuda.stream(stream_a):
+            x = make_tensor((5,), dtype=dtype, device=device) + 1
+            z = torch.from_dlpack(x.__dlpack__(stream_b.cuda_stream))
+            stream_a.synchronize()
+        stream_b.synchronize()
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_from_dlpack_dtype(self, device, dtype):
+        x = make_tensor((5,), dtype=dtype, device=device)
+        y = torch.from_dlpack(x)
+        assert x.dtype == y.dtype
+
+    @skipMeta
+    @onlyCUDA
+    def test_dlpack_default_stream(self, device):
+        class DLPackTensor:
+            def __init__(self, tensor):
+                self.tensor = tensor
+
+            def __dlpack_device__(self):
+                return self.tensor.__dlpack_device__()
+
+            def __dlpack__(self, stream=None):
+                if torch.version.hip is None:
+                    assert stream == 1
+                else:
+                    assert stream == 0
+                capsule = self.tensor.__dlpack__(stream)
+                return capsule
+
+        # CUDA-based tests runs on non-default streams
+        with torch.cuda.stream(torch.cuda.default_stream()):
+            x = DLPackTensor(make_tensor((5,), dtype=torch.float32, device=device))
+            from_dlpack(x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_tensor_invalid_stream(self, device, dtype):
+        with self.assertRaises(TypeError):
+            x = make_tensor((5,), dtype=dtype, device=device)
+            x.__dlpack__(stream=object())
+
+    @skipMeta
+    def test_dlpack_error_on_bool_tensor(self):
+        x = torch.tensor([True], dtype=torch.bool)
+        with self.assertRaises(RuntimeError):
+            to_dlpack(x)
+
+    # TODO: add interchange tests once NumPy 1.22 (dlpack support) is required
+    @skipMeta
+    def test_dlpack_export_requires_grad(self):
+        x = torch.zeros(10, dtype=torch.float32, requires_grad=True)
+        with self.assertRaisesRegex(RuntimeError, r"require gradient"):
+            x.__dlpack__()
+
+    @skipMeta
+    def test_dlpack_export_is_conj(self):
+        x = torch.tensor([-1 + 1j, -2 + 2j, 3 - 3j])
+        y = torch.conj(x)
+        with self.assertRaisesRegex(RuntimeError, r"conjugate bit"):
+            y.__dlpack__()
+
+    @skipMeta
+    def test_dlpack_export_non_strided(self):
+        x = torch.sparse_coo_tensor([[0]], [1], size=(1,))
+        y = torch.conj(x)
+        with self.assertRaisesRegex(RuntimeError, r"strided"):
+            y.__dlpack__()
+
+    @skipMeta
+    def test_dlpack_normalize_strides(self):
+        x = torch.rand(16)
+        y = x[::3][:1]
+        self.assertEqual(y.shape, (1,))
+        self.assertEqual(y.stride(), (3,))
+        z = from_dlpack(y)
+        self.assertEqual(z.shape, (1,))
+        # gh-83069, make sure __dlpack__ normalizes strides
+        self.assertEqual(z.stride(), (1,))
+
+
+instantiate_device_type_tests(TestTorchDlPack, globals())
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_dynamic_shapes.py b/test/test_dynamic_shapes.py
index 3d5662177e5ae..56e7e9b121925 100644
--- a/test/test_dynamic_shapes.py
+++ b/test/test_dynamic_shapes.py
@@ -7,6 +7,8 @@
 from torch.testing._internal.common_utils import run_tests, TestCase
 import unittest
 import torch
+import operator
+import itertools
 from torch.utils._pytree import tree_map
 from torch.fx.experimental.symbolic_shapes import ShapeEnv, PySymInt
 
@@ -78,17 +80,18 @@ def create_contiguous(shape):
 class FakeSymbolicTensor(torch.Tensor):
     @staticmethod
     def __new__(cls, sym_shape, sym_strides, dtype, layout, requires_grad, device):
-        # sym_strides doesn't work yet
-        # TODO: this is wrong in general
         offset = 0
+        # TODO: this is wrong in general
+        sym_stride = create_contiguous(sym_shape)
         r = torch.Tensor._make_wrapper_subclass(
             cls, sym_shape,
-            create_contiguous(sym_shape), offset,
+            sym_stride, offset,
             dtype=dtype, layout=layout, requires_grad=requires_grad,
             device=device,
         )
 
         r.sym_shape = sym_shape
+        r.sym_stride = sym_stride
         return r
 
     __torch_function__ = _disabled_torch_function_impl
@@ -105,6 +108,10 @@ def __torch_dispatch__(cls, func_overload, types, args=(), kwargs=None):
             self = args[0]
             return self.sym_shape
 
+        if func_overload == torch.ops.aten.sym_stride.default:
+            self = args[0]
+            return self.sym_stride
+
         # some calls can be redirected to `sym_size` rather than
         # `sym_sizes`. `sym_size` uses `dim` to canonicalize an index
         # so we need to implement both `sym_size` and `dim` for python
@@ -132,6 +139,32 @@ def create_symbolic_tensor(name, arg, shape_env):
 
 class TestPySymInt(TestCase):
 
+    @skipIfNoSympy
+    def test_arith_ops(self):
+        shape_env = ShapeEnv()
+        symints = []
+        for i in range(5):
+            symints.append((i, shape_env.create_symint(f"s{i}", i)))
+
+        ops = [operator.add, operator.sub, operator.floordiv, operator.mul, operator.mod]
+
+        for op in ops:
+            for args in itertools.permutations(symints, 2):
+                if not isinstance(args[0][1], int) and ((op != operator.mod or op != operator.floordiv) and args[1][0] != 0):
+                    self.assertTrue(op(args[0][1], args[1][1]) == op(args[0][0], args[1][0]))
+
+
+    @skipIfNoSympy
+    def test_reverse_arith_ops(self):
+        shape_env = ShapeEnv()
+
+        a = shape_env.create_symint("s1", 2)
+        self.assertTrue(5 // a == 5 // 2)
+
+        a = shape_env.create_symint("s1", 2)
+        self.assertTrue(5 * a == 5 * 2)
+
+
     @skipIfNoSympy
     def test_roundtrip(self):
         shape_env = ShapeEnv()
@@ -233,6 +266,12 @@ def test_symint_vargs(self):
         z = y.expand((y.shape[1],))
         z = y.expand(y.shape[1])
 
+    @skipIfNoSympy
+    def test_stride(self):
+        shape_env = ShapeEnv()
+        x = create_symbolic_tensor("x", torch.randn(5, 5), shape_env)
+        self.assertIsInstance(x.stride()[0], CPP_SYMINT_CLASS)
+
     @skipIfNoSympy
     def test_size_expressions(self):
         shape_env = ShapeEnv()
diff --git a/test/test_dynamo_cudagraphs.py b/test/test_dynamo_cudagraphs.py
deleted file mode 100644
index f64380a32ebfb..0000000000000
--- a/test/test_dynamo_cudagraphs.py
+++ /dev/null
@@ -1,192 +0,0 @@
-# Owner(s): ["module: cuda graphs"]
-
-import functools
-import sys
-
-from unittest.mock import patch
-
-import torch
-from torch.testing._internal.common_utils import run_tests, TestCase
-
-try:
-    import functorch  # noqa: F401
-    import torchdynamo
-    from torch.cuda._dynamo_graphs import aot_autograd_cudagraphs
-
-    TEST_DYNAMO = True
-except ImportError:
-    TEST_DYNAMO = False
-
-TEST_CUDA = torch.cuda.is_available()
-
-if not TEST_CUDA or not TEST_DYNAMO:
-    print("CUDA or dynamo not available, skipping tests", file=sys.stderr)
-    TestCase = object  # noqa: F811
-
-
-def composed(*decs):
-    def deco(f):
-        for dec in reversed(decs):
-            f = dec(f)
-        return f
-
-    return deco
-
-
-def assert_aot_autograd_counter(ok=True):
-    def deco(f):
-        @functools.wraps(f)
-        def wrap(self, *args, **kwargs):
-            torchdynamo.utils.counters.clear()
-            r = f(self, *args, **kwargs)
-            c_ok = torchdynamo.utils.counters["aot_autograd"]["ok"]
-            c_not_ok = torchdynamo.utils.counters["aot_autograd"]["not_ok"]
-            if ok:
-                self.assertGreater(c_ok, 0)
-                self.assertEqual(c_not_ok, 0)
-            else:
-                self.assertEqual(c_ok, 0)
-                self.assertGreater(c_not_ok, 0)
-            return r
-
-        return wrap
-
-    return deco
-
-
-def patch_all(ok=True):
-    return composed(
-        patch("torchdynamo.config.verify_correctness", True),
-        assert_aot_autograd_counter(ok),
-    )
-
-
-N_ITERS = 5
-
-
-class TestDynamoCudaGraphs(TestCase):
-    @patch_all()
-    def test_basic(self):
-        def model(x, y):
-            return (x + y) * y
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                x = torch.randn(3, device="cuda", requires_grad=True)
-                y = torch.randn(3, device="cuda")
-                loss = model(x, y).sum()
-                loss.backward()
-
-    @patch_all()
-    def test_dtoh(self):
-        def model(x, y):
-            a = x + y
-            b = a.cpu() * 3
-            return b
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                x = torch.randn(3, device="cuda", requires_grad=True)
-                y = torch.randn(3, device="cuda")
-                loss = model(x, y).sum()
-                loss.backward()
-
-    @patch_all()
-    def test_htod(self):
-        def model(x, y):
-            a = x + y
-            return a * 3
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                x = torch.randn(3, device="cuda", requires_grad=True)
-                y = torch.randn((), device="cpu")
-                loss = model(x, y).sum()
-                loss.backward()
-
-    @patch("functorch._src.config.use_functionalize", True)
-    @patch_all(ok=False)  # input mutation not supported yet
-    def test_mutate_input(self):
-        def model(x, y):
-            y.add_(3)
-            return x * y
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                with self.subTest(i):
-                    x = torch.randn(3, device="cuda", requires_grad=True)
-                    y = torch.randn(3, device="cuda")
-                    y_orig = y.clone()
-                    loss = model(x, y).sum()
-                    self.assertEqual(y, y_orig + 3)
-                    loss.backward()
-
-    @patch_all()
-    def test_mutate_constant(self):
-        def model(x, y):
-            c = torch.tensor(1)
-            c.add_(2)
-            return x * y * 0 + c
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                with self.subTest(i):
-                    x = torch.randn(1, device="cuda", requires_grad=True)
-                    y = torch.randn(1, device="cuda")
-                    loss = model(x, y).sum()
-                    self.assertEqual(loss, torch.tensor(3.0, device="cuda"))
-                    loss.backward()
-
-    @patch_all()
-    def test_factory(self):
-        def model(y):
-            x = torch.zeros(3, device="cuda:0")
-            x.add_(3)
-            return x * y
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                with self.subTest(i):
-                    y = torch.randn(3, device="cuda:0", requires_grad=True)
-                    loss = model(y).sum()
-                    loss.backward()
-
-    @patch("functorch._src.config.use_functionalize", True)
-    @patch_all()
-    def test_mutated_metadata(self):
-        # more tortured example at
-        # https://github.com/pytorch/pytorch/issues/81385
-        def model(x):
-            x = x.clone()
-            x.resize_(20)
-            x.fill_(2)
-            return x
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                with self.subTest(i):
-                    x = torch.empty(0, device="cuda:0")
-                    rx = model(x)
-                    self.assertEqual(rx, torch.full((20,), 2.0, device="cuda:0"))
-
-    @patch("functorch._src.config.use_functionalize", True)
-    @patch_all()
-    def test_dead_fill(self):
-        def model(x):
-            x = x.clone()
-            y = x[0:0]
-            x.fill_(2)
-            y.fill_(3)
-            return x, y
-
-        with torchdynamo.optimize(aot_autograd_cudagraphs):
-            for i in range(N_ITERS):
-                with self.subTest(i):
-                    x = torch.empty(20, device="cuda:0")
-                    rx, ry = model(x)
-                    self.assertEqual(rx, torch.full((20,), 2.0, device="cuda:0"))
-                    self.assertEqual(ry, torch.empty(0, device="cuda:0"))
-
-
-if __name__ == "__main__":
-    run_tests()
diff --git a/test/test_expanded_weights.py b/test/test_expanded_weights.py
index 1b1f24d3eee6e..e956e63b30ba4 100644
--- a/test/test_expanded_weights.py
+++ b/test/test_expanded_weights.py
@@ -178,9 +178,6 @@ def test_expanded_weight_per_sample_grad_sum(self, device, dtype, op):
             if op.name == "nn.functional.embedding":  # embedding flips its argument order for autograd tests
                 sample_input = SampleInput(sample_input.args[0], args=(sample_input.input,), kwargs=sample_input.kwargs)
 
-            def reduction(x):
-                return x.sum()
-
             self._compare_ew_and_for_loop_per_sample_grads(op, sample_input, torch.sum)
 
     @ops(filter(lambda op: op.supports_expanded_weight, op_db), dtypes=OpDTypes.supported, allowed_dtypes=(torch.double,))
@@ -559,12 +556,9 @@ def filter_supported_tests(t):
     supported_modules = ['Linear', 'Conv1d', 'Conv2d', 'Conv3d', 'Embedding', 'LayerNorm', 'GroupNorm', 'InstanceNorm']
     if 'module_name' in t and t['module_name'] in supported_modules:
         return True
-    if 'fullname' in t and any([module + "_" in t['fullname'] for module in supported_modules]):
-        return not('Conv' in t['fullname'] and 'pad' in t['fullname'])
 
 # TODO: Once all of these use ModuleInfo, replace with ModuleInfo tests
 # These currently use the legacy nn tests
-supported_modules = ['Linear', 'Conv1d', 'Conv2d', 'Conv3d', 'Embedding', 'LayerNorm', 'GroupNorm', 'InstanceNorm']
 supported_tests = [t for t in module_tests + new_module_tests if filter_supported_tests(t)]
 for test_param in supported_tests:
     if 'constructor' not in test_param:
@@ -628,8 +622,7 @@ def filter_fn(input):
             is_supported_input = input.input.shape != normalized_shape  # would cause inter-batch operations
         elif op.name in convolutions:
             # currently can't deal with padding computation on Python level
-            is_supported_input = 'padding' not in input.kwargs or not isinstance(input.kwargs['padding'], str)
-            is_supported_input = is_supported_input and input.input.dim() == batched_input_size[op.name]
+            is_supported_input = input.input.dim() == batched_input_size[op.name]
         elif op.name == "nn.functional.embedding":
             idx = input.args[0]
             is_supported_input = len(idx.shape) > 1  # there's no batch size
diff --git a/test/test_fake_tensor.py b/test/test_fake_tensor.py
index 6ea09369ff2ac..dd80a14f8b482 100644
--- a/test/test_fake_tensor.py
+++ b/test/test_fake_tensor.py
@@ -3,6 +3,7 @@
 from torch.testing._internal.common_utils import TestCase, run_tests, skipIfCrossRef, skipIfRocm
 import torch
 import itertools
+import numpy as np
 from torch.testing._internal.jit_utils import RUN_CUDA
 from torch._subclasses.fake_tensor import (
     FakeTensor,
@@ -70,6 +71,16 @@ def test_zero_dim(self):
             self.assertEqual(out.device, y.device)
             self.assertTrue(isinstance(out, FakeTensor))
 
+    def test_nan_to_num(self):
+        mode = FakeTensorMode(inner=None)
+        with enable_torch_dispatch_mode(mode):
+            for dtype in [torch.float16, torch.float32]:
+                x = torch.rand([4], dtype=dtype)
+                y = torch.nan_to_num(x, nan=None)
+                z = torch.nan_to_num(x, 0.0)
+                self.assertEqual(dtype, y.dtype)
+                self.assertEqual(dtype, z.dtype)
+
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_throw(self):
         mode = FakeTensorMode(inner=None)
@@ -167,6 +178,21 @@ def test_binary_op_type_promotion(self):
             self.assertEqual(out.dtype, torch.float)
             self.assertEqual(out.device.type, "cpu")
 
+    def test_from_numpy(self):
+        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+            x = torch.tensor(np.zeros([4, 4]))
+            self.checkType(x, "cpu", [4, 4])
+
+    def test_randperm(self):
+        x = torch.randperm(10)
+        y = torch.randperm(5, device="cpu")
+        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+            x1 = torch.randperm(10)
+            prims.utils.compare_tensor_meta(x, x1)
+            y1 = torch.randperm(5, device="cpu")
+            prims.utils.compare_tensor_meta(y, y1)
+
+
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_cpu_fallback(self):
         with enable_torch_dispatch_mode(FakeTensorMode(inner=None, allow_fallback_kernels=False)):
@@ -291,7 +317,7 @@ def fn(
                 for ten in out:
                     if i == 1:
                         self.assertTrue(isinstance(ten, FakeTensor))
-                    self.assertTrue(ten.device.type == 'cuda')
+                    self.assertEqual(ten.device.type, 'cuda')
 
     @skipIfRocm
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
diff --git a/test/test_foreach.py b/test/test_foreach.py
index 4da23dc66fc3b..71f243c8de0a8 100644
--- a/test/test_foreach.py
+++ b/test/test_foreach.py
@@ -97,7 +97,7 @@ def is_cuda(self):
     # note(mkozuki): It might be the case that the expected number of `cudaLaunchKernel`s
     # is greater than 1 once foreach functions internally separate their input `TensorList`s by
     # devices & dtypes into vectors of tensors.
-    def _get_funcs(self, op, n_expected_cudaLaunchKernels):
+    def _get_funcs(self, op, n_expected_cudaLaunchKernels: int):
         return (
             ForeachFuncWrapper(op.method_variant, n_expected_cudaLaunchKernels),
             RegularFuncWrapper(op.ref),
@@ -370,11 +370,17 @@ def test_unary_slowpath(self, device, dtype, op):
         for N in N_values:
             self._test_unary(device, dtype, op, N, is_fastpath=False)
 
+    # note(crcrpar): `torch.maximum` and `torch.minimum` support `out` arg but there seem to be no inplace versions.
+    # So, compare `inplace_op` results with `ref`'s outputs.
     def _minmax_test(self, opinfo, inputs, is_fastpath, n_expected_cudaLaunchKernels):
-        op, ref, _, _ = self._get_funcs(opinfo, n_expected_cudaLaunchKernels)
-        self.assertEqual(ref(inputs), op(inputs, self.is_cuda, is_fastpath))
+        op, ref, inplace_op, _ = self._get_funcs(opinfo, n_expected_cudaLaunchKernels)
+        expected = ref(inputs)
+        self.assertEqual(expected, op(inputs, self.is_cuda, is_fastpath))
+
+        inplace_inputs = [[t.clone() for t in inputs[0]], inputs[1]]
+        inplace_op(inplace_inputs, self.is_cuda, is_fastpath)
+        self.assertEqual(expected, inplace_inputs[0])
 
-    # note(mkozuki): in-place of foreach_minimum and foreach_maximum aren't implemented.
     @ops(foreach_minmax_op_db)
     def test_minmax_fastpath(self, device, dtype, op):
         for N in N_values:
diff --git a/test/test_function_schema.py b/test/test_function_schema.py
index 505d9182a33dd..f094bb2422672 100644
--- a/test/test_function_schema.py
+++ b/test/test_function_schema.py
@@ -186,5 +186,11 @@ def test_schema_error(self):
         with self.assertRaisesRegex(RuntimeError, r"schemas with vararg \(...\) can't have default value args"):
             schema = parse_schema("any.foo(int arg1, int arg2=0, ...)")
 
+    def test_tensor_list_alias_annotation_properly_parsed(self):
+        schema_str = 'foo(Tensor self, *, Tensor(a!)[] out) -> ()'
+        schema = parse_schema(schema_str)
+        self.assertTrue(schema.arguments[-1].alias_info.is_write)
+        self.assertEqual(str(schema), schema_str)
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_functionalization.py b/test/test_functionalization.py
index 54645503470a3..2a8942eb07021 100644
--- a/test/test_functionalization.py
+++ b/test/test_functionalization.py
@@ -116,7 +116,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2])
     add_tensor = torch.ops.aten.add.Tensor(view_copy_default, ones);  view_copy_default = ones = None
     view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4, 2])
@@ -131,7 +131,7 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_default = torch.ops.aten.view.default(a_1, [4, 2])
     add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
     view_default_1 = torch.ops.aten.view.default(add_tensor, [4, 2])
@@ -156,9 +156,9 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2]);  a_1 = None
-    empty = torch.ops.aten.empty.memory_format([], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([], device = device(type='cpu'), pin_memory = False)
     add_tensor = torch.ops.aten.add.Tensor(view_copy_default, ones);  view_copy_default = ones = None
     mul_tensor = torch.ops.aten.mul.Tensor(add_tensor, add_tensor);  add_tensor = None
     return mul_tensor
@@ -170,9 +170,9 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_default = torch.ops.aten.view.default(a_1, [4, 2]);  a_1 = None
-    empty = torch.ops.aten.empty.memory_format([], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([], device = device(type='cpu'), pin_memory = False)
     add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
     mul_tensor = torch.ops.aten.mul.Tensor(add_tensor, add_tensor);  add_tensor = None
     return mul_tensor
@@ -193,8 +193,8 @@ def f(x):
 
 
 def forward(self, a_1):
-    empty = torch.ops.aten.empty.memory_format([4], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
-    empty_1 = torch.ops.aten.empty.memory_format([4], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
+    empty_1 = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
     aminmax_default = torch.ops.aten.aminmax.default(a_1, dim = 0);  a_1 = None
     getitem = aminmax_default[0]
     getitem_1 = aminmax_default[1];  aminmax_default = None
@@ -207,8 +207,8 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    empty = torch.ops.aten.empty.memory_format([4], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
-    empty_1 = torch.ops.aten.empty.memory_format([4], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
+    empty_1 = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
     aminmax_default = torch.ops.aten.aminmax.default(a_1, dim = 0);  a_1 = None
     getitem = aminmax_default[0]
     getitem_1 = aminmax_default[1];  aminmax_default = None
@@ -281,7 +281,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2])
     add_tensor = torch.ops.aten.add.Tensor(a_1, ones);  ones = None
     copy__default = torch.ops.aten.copy_.default(a_1, add_tensor);  a_1 = None
@@ -295,7 +295,7 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_default = torch.ops.aten.view.default(a_1, [4, 2])
     add_tensor = torch.ops.aten.add.Tensor(a_1, ones);  ones = None
     copy__default = torch.ops.aten.copy_.default(a_1, add_tensor);  a_1 = None
@@ -373,7 +373,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([0], device = device(type='cpu'), pin_memory = False)
     cat_default = torch.ops.aten.cat.default([a_1]);  a_1 = None
     return cat_default
     """)
@@ -384,7 +384,7 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([0], device = device(type='cpu'), pin_memory = False)
     cat_default = torch.ops.aten.cat.default([a_1]);  a_1 = None
     return cat_default
     """)
@@ -405,7 +405,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
     clone_default = torch.ops.aten.clone.default(a_1)
     diagonal_copy_default = torch.ops.aten.diagonal_copy.default(clone_default);  clone_default = None
     add_tensor = torch.ops.aten.add.Tensor(diagonal_copy_default, ones);  diagonal_copy_default = ones = None
@@ -419,7 +419,7 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
     clone_default = torch.ops.aten.clone.default(a_1)
     diagonal_default = torch.ops.aten.diagonal.default(clone_default);  clone_default = None
     add_tensor = torch.ops.aten.add_.Tensor(diagonal_default, ones);  diagonal_default = ones = None
@@ -442,7 +442,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
     diagonal_copy_default = torch.ops.aten.diagonal_copy.default(a_1)
     add_tensor = torch.ops.aten.add.Tensor(diagonal_copy_default, ones);  diagonal_copy_default = ones = None
     diagonal_scatter_default = torch.ops.aten.diagonal_scatter.default(a_1, add_tensor);  add_tensor = None
@@ -466,7 +466,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
     split_copy_tensor = torch.ops.aten.split_copy.Tensor(a_1, 2)
     getitem = split_copy_tensor[0]
     getitem_1 = split_copy_tensor[1];  split_copy_tensor = None
@@ -497,7 +497,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4], device = device(type='cpu'), pin_memory = False)
     transpose_copy_int = torch.ops.aten.transpose_copy.int(a_1, 1, 0)
     select_copy_int = torch.ops.aten.select_copy.int(transpose_copy_int, 0, 0);  transpose_copy_int = None
     add_tensor = torch.ops.aten.add.Tensor(select_copy_int, ones);  select_copy_int = ones = None
@@ -525,8 +525,8 @@ def f(x):
 
 def forward(self, a_1):
     view_copy_default = torch.ops.aten.view_copy.default(a_1, [8])
-    arange = torch.ops.aten.arange.default(4, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
-    arange_1 = torch.ops.aten.arange.default(4, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
+    arange = torch.ops.aten.arange.default(4, device = device(type='cpu'), pin_memory = False)
+    arange_1 = torch.ops.aten.arange.default(4, dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
     index_put_default = torch.ops.aten.index_put.default(view_copy_default, [arange], arange_1);  view_copy_default = arange = arange_1 = None
     view_copy_default_1 = torch.ops.aten.view_copy.default(index_put_default, [4, 2])
     copy__default = torch.ops.aten.copy_.default(a_1, view_copy_default_1);  a_1 = view_copy_default_1 = None
@@ -549,7 +549,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2])
     add_tensor = torch.ops.aten.add.Tensor(view_copy_default, 1);  view_copy_default = None
     mul_tensor = torch.ops.aten.mul.Tensor(add_tensor, 2)
@@ -587,8 +587,8 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     clone_default = torch.ops.aten.clone.default(a_1);  a_1 = None
-    ge_scalar = torch.ops.aten.ge_.Scalar(clone_default, 0)
-    _to_copy_default = torch.ops.aten._to_copy.default(clone_default, dtype = torch.float32, layout = torch.strided);  clone_default = None
+    ge_scalar = torch.ops.aten.ge.Scalar(clone_default, 0);  clone_default = None
+    _to_copy_default = torch.ops.aten._to_copy.default(ge_scalar, dtype = torch.float32, layout = torch.strided);  ge_scalar = None
     return _to_copy_default
     """)  # noqa: B950
 
@@ -647,7 +647,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([2, 2], device = device(type='cpu'), pin_memory = False)
     add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
     view_copy_default = torch.ops.aten.view_copy.default(add_tensor, [8])
     _reshape_alias_copy_default = torch.ops.aten._reshape_alias_copy.default(view_copy_default, [2, 4], [4, 1]);  view_copy_default = None
@@ -685,7 +685,7 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([2, 2], device = device(type='cpu'), pin_memory = False)
     add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
     view_default = torch.ops.aten.view.default(add_tensor, [8])
     _reshape_alias_default = torch.ops.aten._reshape_alias.default(view_default, [2, 4], [4, 1]);  view_default = None
@@ -730,7 +730,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    ones = torch.ops.aten.ones.default([4, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
     view_default = torch.ops.aten.view.default(a_1, [4, 2])
     add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
     view_default_1 = torch.ops.aten.view.default(add_tensor, [4, 2])
@@ -780,10 +780,9 @@ def f(x):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy.default(diagonal_copy_default, a_1);  diagonal_copy_default = None
-    add_tensor = torch.ops.aten.add.Tensor(copy_default, a_1);  copy_default = a_1 = None
+    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
     return add_tensor
     """)
 
@@ -793,11 +792,10 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy_.default(diagonal_default, a_1)
-    add_tensor = torch.ops.aten.add_.Tensor(diagonal_default, a_1);  a_1 = None
-    return diagonal_default
+    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
+    return add_tensor
     """)
 
         # Test 2: copy_() with same dtype, different shape
@@ -808,10 +806,10 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy.default(diagonal_copy_default, a_1);  diagonal_copy_default = None
-    add_tensor = torch.ops.aten.add.Tensor(copy_default, a_1);  copy_default = a_1 = None
+    expand_copy_default = torch.ops.aten.expand_copy.default(a_1, [2])
+    add_tensor = torch.ops.aten.add.Tensor(expand_copy_default, a_1);  expand_copy_default = a_1 = None
     return add_tensor
     """)
 
@@ -821,11 +819,11 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy_.default(diagonal_default, a_1)
-    add_tensor = torch.ops.aten.add_.Tensor(diagonal_default, a_1);  a_1 = None
-    return diagonal_default
+    expand_copy_default = torch.ops.aten.expand_copy.default(a_1, [2])
+    add_tensor = torch.ops.aten.add_.Tensor(expand_copy_default, a_1);  a_1 = None
+    return expand_copy_default
     """)
 
         # Test 3: copy_() with different dtype, same shape
@@ -836,12 +834,12 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy.default(diagonal_copy_default, a_1);  diagonal_copy_default = None
-    add_tensor = torch.ops.aten.add.Tensor(copy_default, a_1);  copy_default = a_1 = None
+    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
+    add_tensor = torch.ops.aten.add.Tensor(_to_copy_default, a_1);  _to_copy_default = a_1 = None
     return add_tensor
-    """)
+    """)  # noqa: B950
 
         reinplaced_logs = self.get_logs(f, torch.ones(2, dtype=torch.long), reapply_views=True, run_reinplace=True)
         self.assertExpectedInline(reinplaced_logs, """\
@@ -849,11 +847,11 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy_.default(diagonal_default, a_1)
-    add_tensor = torch.ops.aten.add_.Tensor(diagonal_default, a_1);  a_1 = None
-    return diagonal_default
+    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
+    add_tensor = torch.ops.aten.add_.Tensor(_to_copy_default, a_1);  a_1 = None
+    return _to_copy_default
     """)  # noqa: B950
 
         # Test 4: copy_() with different dtype, different shape
@@ -864,12 +862,13 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy.default(diagonal_copy_default, a_1);  diagonal_copy_default = None
-    add_tensor = torch.ops.aten.add.Tensor(copy_default, a_1);  copy_default = a_1 = None
+    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
+    expand_copy_default = torch.ops.aten.expand_copy.default(_to_copy_default, [2]);  _to_copy_default = None
+    add_tensor = torch.ops.aten.add.Tensor(expand_copy_default, a_1);  expand_copy_default = a_1 = None
     return add_tensor
-    """)
+    """)  # noqa: B950
 
         reinplaced_logs = self.get_logs(f, torch.ones(1, dtype=torch.long), reapply_views=True, run_reinplace=True)
         self.assertExpectedInline(reinplaced_logs, """\
@@ -877,11 +876,12 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([2, 2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
     diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    copy_default = torch.ops.aten.copy_.default(diagonal_default, a_1)
-    add_tensor = torch.ops.aten.add_.Tensor(diagonal_default, a_1);  a_1 = None
-    return diagonal_default
+    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
+    expand_copy_default = torch.ops.aten.expand_copy.default(_to_copy_default, [2]);  _to_copy_default = None
+    add_tensor = torch.ops.aten.add_.Tensor(expand_copy_default, a_1);  a_1 = None
+    return expand_copy_default
     """)  # noqa: B950
 
     def test_expand_symint(self):
@@ -1114,7 +1114,7 @@ def f(x):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([10], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([10], device = device(type='cpu'), pin_memory = False)
     select_copy_int = torch.ops.aten.select_copy.int(zeros, 0, 5)
     fill_scalar = torch.ops.aten.fill.Scalar(select_copy_int, 1);  select_copy_int = None
     select_scatter_default = torch.ops.aten.select_scatter.default(zeros, fill_scalar, 0, 5);  zeros = fill_scalar = None
@@ -1127,7 +1127,7 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    zeros = torch.ops.aten.zeros.default([10], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    zeros = torch.ops.aten.zeros.default([10], device = device(type='cpu'), pin_memory = False)
     select_int = torch.ops.aten.select.int(zeros, 0, 5)
     fill_scalar = torch.ops.aten.fill_.Scalar(select_int, 1);  select_int = None
     return zeros
diff --git a/test/test_fx.py b/test/test_fx.py
index f69d5046cc9e3..aca7e72697f5f 100644
--- a/test/test_fx.py
+++ b/test/test_fx.py
@@ -553,6 +553,25 @@ def forward(self, a, b):
             self.assertTrue(new_node.stack_trace is not None)
             assert 'test_fx.py' in new_node.stack_trace
 
+    def test_stack_traces_with_transformer(self):
+        class M(torch.nn.Module):
+            def forward(self, a, b):
+                return a + b
+
+        tracer = torch.fx.Tracer()
+        tracer.record_stack_traces = True
+
+        graph = tracer.trace(M())
+        gm = GraphModule(tracer.root, graph)
+        new_gm = Transformer(gm).transform()
+
+        # nodes after Transformer should still preserve the original node's stack trace
+        for node in new_gm.graph.nodes:
+            if node.op in {'placeholder', 'output'}:
+                continue
+            self.assertTrue(node.stack_trace is not None)
+            assert 'test_fx.py' in node.stack_trace
+
     def test_graph_unique_names_manual(self):
         graph : torch.fx.Graph = torch.fx.Graph()
         a : torch.fx.Node = graph.create_node('placeholder', 'x')
@@ -3505,6 +3524,17 @@ def test_imul_code_print(self):
         self.assertEqual(gm(2, 3), 6)
         self.assertIn("a *= b", gm.code)
 
+    def test_deepcopy_tracer(self):
+        def fn(x, y):
+            return (x + y).relu().sin()
+
+        tracer = Tracer()
+        tracer_before = copy.deepcopy(tracer)
+        tracer.trace(fn)
+        tracer_after = copy.deepcopy(tracer)
+
+        self.assertEqual(str(tracer.graph), str(tracer_after.graph))
+        self.assertTrue(not hasattr(tracer_before, 'graph') or str(tracer.graph) != str(tracer_before.graph))
 
 def run_getitem_target():
     from torch.fx._symbolic_trace import _wrapped_methods_to_patch
@@ -4114,9 +4144,9 @@ def tearDown(self):
     }
 
     @classmethod
-    def generate_test_fn(cls, name, model_fn, x, kwargs):
+    def generate_test_fn(cls, name, x, kwargs):
         def run_test(self):
-            model = model_fn(**kwargs)
+            model = torchvision_models.get_model(name, **kwargs)
             model = model.eval()
             if name in self.UNTRACEABLE_MODELS:
                 err, exc = self.UNTRACEABLE_MODELS[name]
@@ -4142,43 +4172,39 @@ def run_test(self):
 
     @classmethod
     def generate_classification_tests(cls):
-        for k, v in torchvision_models.__dict__.items():
-            if callable(v) and k[0].lower() == k[0] and k[0] != "_" and k != 'get_weight':
-                test_name = 'test_torchvision_models_' + k
-                x = torch.rand(1, 3, 299, 299) if k in ['inception_v3'] else torch.rand(1, 3, 224, 224)
-                kwargs = dict(num_classes=50)
-                model_test = cls.generate_test_fn(k, v, x, kwargs)
-                setattr(cls, test_name, model_test)
+        for k in torchvision_models.list_models(module=torchvision_models):
+            test_name = 'test_torchvision_models_' + k
+            x = torch.rand(1, 3, 299, 299) if k in ['inception_v3'] else torch.rand(1, 3, 224, 224)
+            kwargs = dict(num_classes=50)
+            model_test = cls.generate_test_fn(k, x, kwargs)
+            setattr(cls, test_name, model_test)
 
     @classmethod
     def generate_segmentation_tests(cls):
-        for k, v in torchvision_models.segmentation.__dict__.items():
-            if callable(v) and k[0].lower() == k[0] and k[0] != "_":
-                test_name = 'test_torchvision_models_segmentation_' + k
-                x = torch.rand(1, 3, 32, 32)
-                kwargs = dict(num_classes=10, pretrained_backbone=False)
-                model_test = cls.generate_test_fn(k, v, x, kwargs)
-                setattr(cls, test_name, model_test)
+        for k in torchvision_models.list_models(module=torchvision_models.segmentation):
+            test_name = 'test_torchvision_models_segmentation_' + k
+            x = torch.rand(1, 3, 32, 32)
+            kwargs = dict(num_classes=10, pretrained_backbone=False)
+            model_test = cls.generate_test_fn(k, x, kwargs)
+            setattr(cls, test_name, model_test)
 
     @classmethod
     def generate_detection_tests(cls):
-        for k, v in torchvision_models.detection.__dict__.items():
-            if callable(v) and k[0].lower() == k[0] and k[0] != "_":
-                test_name = 'test_torchvision_models_detection_' + k
-                x = [torch.rand(3, 300, 300)]
-                kwargs = dict(num_classes=10, pretrained_backbone=False)
-                model_test = cls.generate_test_fn(k, v, x, kwargs)
-                setattr(cls, test_name, model_test)
+        for k in torchvision_models.list_models(module=torchvision_models.detection):
+            test_name = 'test_torchvision_models_detection_' + k
+            x = [torch.rand(3, 300, 300)]
+            kwargs = dict(num_classes=10, pretrained_backbone=False)
+            model_test = cls.generate_test_fn(k, x, kwargs)
+            setattr(cls, test_name, model_test)
 
     @classmethod
     def generate_video_tests(cls):
-        for k, v in torchvision_models.video.__dict__.items():
-            if callable(v) and k[0].lower() == k[0] and k[0] != "_":
-                test_name = 'test_torchvision_models_video_' + k
-                x = torch.rand(1, 3, 4, 112, 112) if k != 'mvit_v1_b' else torch.rand(1, 3, 16, 224, 224)
-                kwargs = dict(num_classes=50)
-                model_test = cls.generate_test_fn(k, v, x, kwargs)
-                setattr(cls, test_name, model_test)
+        for k in torchvision_models.list_models(module=torchvision_models.video):
+            test_name = 'test_torchvision_models_video_' + k
+            x = torch.rand(1, 3, 4, 112, 112) if k not in {'mvit_v1_b', 'mvit_v2_s'} else torch.rand(1, 3, 16, 224, 224)
+            kwargs = dict(num_classes=50)
+            model_test = cls.generate_test_fn(k, x, kwargs)
+            setattr(cls, test_name, model_test)
 
     @classmethod
     def generate_tests(cls):
diff --git a/test/test_fx_passes.py b/test/test_fx_passes.py
index e8085fbb92ab2..af02a5a2e832d 100644
--- a/test/test_fx_passes.py
+++ b/test/test_fx_passes.py
@@ -1,5 +1,6 @@
 # Owner(s): ["module: fx.passes"]
 
+from dataclasses import dataclass
 import operator
 import logging
 
@@ -9,6 +10,7 @@
 from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
 from torch.fx.passes.operator_support import OperatorSupport
 from torch.fx.passes.utils.fuser_utils import fuse_by_partitions
+from torch.fx.passes.utils.matcher_utils import SubgraphMatcher
 
 from torch.testing._internal.common_utils import run_tests, parametrize, instantiate_parametrized_tests
 from torch.testing._internal.jit_utils import JitTestCase
@@ -163,6 +165,8 @@ class MockOperatorSupport(OperatorSupport):
     def is_node_supported(self, submodules, node: torch.fx.Node) -> bool:
         return node.op == "call_function" and node.target in {operator.add}
 
+
+@instantiate_parametrized_tests
 class TestFXGraphPasses(JitTestCase):
 
     @parametrize("fn, expected_partition", [
@@ -270,7 +274,372 @@ def test_fuser_util_xfail(self, partition):
         with self.assertRaises(Exception):
             fuse_by_partitions(gm, partitions)
 
-instantiate_parametrized_tests(TestFXGraphPasses)
+@dataclass
+class TestCase:
+    match_output: bool
+    match_placeholder: bool
+    num_matches: int
+    remove_overlapping_matches: bool = True
+
+class SingleNodePattern:
+    @staticmethod
+    def forward(x):
+        val = torch.neg(x)
+        return torch.add(val, val)
+
+    @staticmethod
+    def pattern(a):
+        return torch.neg(a)
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+        TestCase(True, False, 0),
+        TestCase(False, True, 1),
+        TestCase(True, True, 0)
+    ]
+class SimplePattern:
+    @staticmethod
+    def forward(x, w1, w2):
+        m1 = torch.cat([w1, w2]).sum()
+        m2 = torch.cat([w2, w1]).sum()
+        m3 = torch.cat([m1, m2]).sum()
+        return x + torch.max(m1) + torch.max(m2) + m3
+
+    @staticmethod
+    def pattern(a, b):
+        return torch.cat([a, b]).sum()
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 3),
+        TestCase(True, False, 0),
+        TestCase(False, True, 2),
+        TestCase(True, True, 0)
+    ]
+
+class SimpleFullGraphMatching:
+    @staticmethod
+    def forward(x):
+        a = torch.neg(x)
+        return torch.add(a, a)
+
+    @staticmethod
+    def pattern(x):
+        a = torch.neg(x)
+        return torch.add(a, a)
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+        TestCase(True, False, 1),
+        TestCase(False, True, 1),
+        TestCase(True, True, 1)
+    ]
+
+class DiamondShapePatternTestCase:
+    @staticmethod
+    def forward(x):
+        a = torch.neg(x)
+
+        a = a.relu()
+        left = a.sigmoid()
+        right = a.relu()
+        out = left + right
+
+        return out
+
+    @staticmethod
+    def pattern(a):
+        a = a.relu()
+        left = a.sigmoid()
+        right = a.relu()
+        out = left + right
+        return out
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+        TestCase(True, False, 1),
+        TestCase(False, True, 0),
+        TestCase(True, True, 0)
+    ]
+
+class NonFullyContainedMatches:
+    @staticmethod
+    def forward(x, w1, w2, b1, b2):
+        # fully contained matched subgraph
+        m1 = torch.cat([w1, w2])
+        m2 = torch.cat([x, b2])
+        t0 = torch.addmm(b1, m1, m2.t())
+        t0_sum = torch.sum(t0)   # use of t0 is not leaking
+
+        # leaking matched subgraph, m3 is leaked
+        m3 = torch.cat([w1, w2])
+        m4 = torch.cat([x, b2])
+        t1 = torch.addmm(b1, m3, m4.t())
+        m3_sum = torch.sum(m3)
+
+        return t0_sum, m3_sum
+
+    @staticmethod
+    def pattern(x, w1, w2, b1, b2):
+        m1 = torch.cat([w1, w2])
+        m2 = torch.cat([x, b2])
+        return torch.addmm(b1, m1, m2.t())
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+
+        TestCase(True, False, 0),
+
+        TestCase(False, True, 1),     # leaked used of placeholder is not leaking
+    ]
+
+class ChainRepeatedPattern:
+    @staticmethod
+    def forward(x):
+        x = torch.sigmoid(x)
+        x = torch.sigmoid(x)
+        x = torch.sigmoid(x)
+        return torch.sigmoid(x)
+
+    @staticmethod
+    def pattern(x):
+        return torch.sigmoid(torch.sigmoid(x))
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 3, remove_overlapping_matches=False),
+        TestCase(False, False, 2, remove_overlapping_matches=True),
+        TestCase(True, False, 1),
+        TestCase(False, True, 1),
+        TestCase(True, True, 0)
+    ]
+
+class QuantizationModel:
+    @staticmethod
+    def forward(x):
+        x += 3
+        x = x.dequantize()
+        x = torch.sigmoid(x)
+        x = x.to(torch.float16)
+        return x
+
+    @staticmethod
+    def pattern(x):
+        x = x.dequantize()
+        x = torch.sigmoid(x)
+        x = x.to(torch.float16)
+        return x
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+        TestCase(True, False, 1),
+        TestCase(False, True, 0),
+        TestCase(True, True, 0)
+    ]
+
+class MultipleOutputsWithDependency:
+    @staticmethod
+    def forward(x):
+        y = x.relu()
+        z = y.sigmoid()
+        return z, y
+
+    @staticmethod
+    def pattern(a):
+        b = a.relu()
+        c = b.sigmoid()
+        return b, c     # outputs have data dependency
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+        TestCase(True, False, 0),
+        TestCase(False, True, 1),
+        TestCase(True, True, 0)
+    ]
+
+class MultipleOutputsWithoutDependency:
+    @staticmethod
+    def forward(x):
+        x = x + 1
+
+        # target subgraph to match
+        x = x.relu()
+        z = x.sum()
+        y = x.sigmoid()
+
+        out = y.sigmoid() + z.sum()
+        return out
+
+    @staticmethod
+    def pattern(a):
+        a = a.relu()
+        b = a.sigmoid()
+        c = a.sum()
+        return b, c
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+        TestCase(True, False, 0),
+        TestCase(False, True, 0),
+        TestCase(True, True, 0)
+    ]
+
+class MultipleOutputsMultipleOverlappingMatches:
+    @staticmethod
+    def forward(x):
+        x = x + 1
+
+        # target subgraph to match
+        x = x.relu()
+        z = x.sum()
+        z1 = x.sum()
+        y = x.sigmoid()
+        y1 = x.sigmoid()
+
+        return z + z1 + y + y1
+
+    @staticmethod
+    def pattern(a):
+        a = a.relu()
+        b = a.sigmoid()
+        c = a.sum()
+        return a, b, c
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 4, remove_overlapping_matches=False),
+        TestCase(False, False, 1, remove_overlapping_matches=True),
+    ]
+
+class MultipleOutputsMultipleNonOverlappingMatches:
+    @staticmethod
+    def forward(x):
+        x = x + 1
+
+        # target subgraph to match
+        x = x.relu()
+        z = x.sum()
+        y = x.sigmoid()
+
+        x = x.relu()
+        z1 = x.sum()
+        y1 = x.sigmoid()
+
+        return z + z1 + y + y1
+
+    @staticmethod
+    def pattern(a):
+        a = a.relu()
+        b = a.sigmoid()
+        c = a.sum()
+        return b, c
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+    ]
+
+class MultipleOutputsIdenticalAnchor:
+    @staticmethod
+    def forward(x):
+        x = x + 1
+
+        # target subgraph to match
+        x = x.relu()
+        y = x.sigmoid()
+        y1 = x.sigmoid()
+
+        return y, y1
+
+    @staticmethod
+    def pattern(a):
+        a = a.relu()
+        b = a.sigmoid()
+        b1 = a.sigmoid()
+        return b, b1
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        # (False, False, 2),  # FIXME: currently still matches to 2, should fix to 1
+        TestCase(True, False, 1),
+        TestCase(False, True, 0),
+    ]
+
+
+class MultipleOutputsHorizontalPattern:
+    @staticmethod
+    def forward(x):
+        x = x + 1
+
+        # target subgraph to match
+        y1 = x.relu()
+        y2 = x.sigmoid()
+
+        return y1, y2
+
+    @staticmethod
+    def pattern(a):
+        b1 = a.relu()
+        b2 = a.sigmoid()
+
+        return b1, b2
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 1),
+        TestCase(True, False, 1),
+        TestCase(False, True, 0),
+        TestCase(True, True, 0)
+    ]
+
+
+@instantiate_parametrized_tests
+class TestFXMatcherUtils(JitTestCase):
+
+    @parametrize("test_model", [
+        SingleNodePattern,
+        SimplePattern,
+        SimpleFullGraphMatching,
+        DiamondShapePatternTestCase,
+        NonFullyContainedMatches,
+        ChainRepeatedPattern,
+        QuantizationModel,
+        MultipleOutputsWithDependency,
+        MultipleOutputsWithoutDependency,
+        MultipleOutputsMultipleOverlappingMatches,
+        MultipleOutputsMultipleNonOverlappingMatches,
+        MultipleOutputsIdenticalAnchor,
+        MultipleOutputsHorizontalPattern
+    ])
+    def test_subgraph_matcher(self, test_model):
+        traced = symbolic_trace(test_model.forward)
+        pattern_traced = symbolic_trace(test_model.pattern)
+
+        for test_case in test_model.test_cases:
+
+            matcher = SubgraphMatcher(pattern_traced.graph,
+                                      match_output=test_case.match_output,
+                                      match_placeholder=test_case.match_placeholder,
+                                      remove_overlapping_matches=test_case.remove_overlapping_matches)
+            matches = matcher.match(traced.graph)
+
+            assert len(matches) == test_case.num_matches
+
+            for match in matches:
+                for node in pattern_traced.graph.nodes:
+                    if not test_case.match_placeholder and node.op == "placeholder":
+                        continue
+                    if not test_case.match_output and node.op == "output":
+                        continue
+                    assert node in match.nodes_map
+
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_fx_reinplace_pass.py b/test/test_fx_reinplace_pass.py
index 037633670bf14..a35f52bac8843 100644
--- a/test/test_fx_reinplace_pass.py
+++ b/test/test_fx_reinplace_pass.py
@@ -63,6 +63,54 @@ def forward(self, x_1):
     return view_default
     """)
 
+    def test_reinplace_different_metadata(self):
+        def f(a_):
+            a = a_.clone()
+            b = a + 1
+            # Naively, we shouldn't try to inplace the .ge() call,
+            # because that would require resizing "b" (from a float to a bool tensor).
+            c = torch.ge(b, a)
+            return c
+        inpt = torch.ones(4)
+        f2 = reinplace(make_fx(f)(inpt), inpt)
+        expected_out = f(inpt)
+        actual_out = f2(inpt)
+        self.assertEqual(actual_out, expected_out)
+        # The .ge() should not be reinplaced.
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self, a__1):
+    clone_default = torch.ops.aten.clone.default(a__1);  a__1 = None
+    add_tensor = torch.ops.aten.add.Tensor(clone_default, 1)
+    ge_tensor = torch.ops.aten.ge.Tensor(add_tensor, clone_default);  add_tensor = clone_default = None
+    return ge_tensor
+    """)
+
+    def test_reinplace_overlapping_memory(self):
+        def f(a_):
+            a = a_.clone()
+            b = a.expand(4, 4)
+            # Can't reinplace because b has overlapping memory.
+            c = b.add(1)
+            return c
+        inpt = torch.ones(1)
+        f2 = reinplace(make_fx(f)(inpt), inpt)
+        expected_out = f(inpt)
+        actual_out = f2(inpt)
+        self.assertEqual(actual_out, expected_out)
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self, a__1):
+    clone_default = torch.ops.aten.clone.default(a__1);  a__1 = None
+    expand_default = torch.ops.aten.expand.default(clone_default, [4, 4]);  clone_default = None
+    add_tensor = torch.ops.aten.add.Tensor(expand_default, 1);  expand_default = None
+    return add_tensor
+    """)
+
     # This test won't actually run in CI, because it requires functionalize() from functorch.
     # I'm planning on testing more comprehensively with torchbench models,
     # but we can make this testing better once functorch moves into pytorch/pytorch.
@@ -211,8 +259,9 @@ def forward(self, a__1):
     select_int_1 = torch.ops.aten.select.int(select_int, 0, 1);  select_int = None
     add_tensor = torch.ops.aten.add.Tensor(select_int_1, 1);  select_int_1 = None
     as_strided_default = torch.ops.aten.as_strided.default(clone_default, [4], [4], 1);  clone_default = None
-    select_scatter_default = torch.ops.aten.select_scatter.default(as_strided_default, add_tensor, 0, 0);  as_strided_default = add_tensor = None
-    return select_scatter_default
+    select_int_2 = torch.ops.aten.select.int(as_strided_default, 0, 0)
+    copy__default = torch.ops.aten.copy_.default(select_int_2, add_tensor);  select_int_2 = add_tensor = None
+    return as_strided_default
     """)  # noqa: B950
 
     def test_reinplace_scatter_twice_with_different_view_op_invalid2(self):
@@ -243,9 +292,64 @@ def forward(self, a__1):
     select_int_1 = torch.ops.aten.select.int(select_int, 0, 1);  select_int = None
     add_tensor = torch.ops.aten.add.Tensor(select_int_1, 1);  select_int_1 = None
     as_strided_default = torch.ops.aten.as_strided.default(clone_default, [4], [4], 0);  clone_default = None
-    select_scatter_default = torch.ops.aten.select_scatter.default(as_strided_default, add_tensor, 0, 1);  as_strided_default = add_tensor = None
-    return select_scatter_default
+    select_int_2 = torch.ops.aten.select.int(as_strided_default, 0, 1)
+    copy__default = torch.ops.aten.copy_.default(select_int_2, add_tensor);  select_int_2 = add_tensor = None
+    return as_strided_default
     """)  # noqa: B950
 
+
+    def test_out_node_updated(self):
+        def f():
+            x = torch.zeros(2, 2)
+            y = x.diagonal()
+            y_updated = y.add(1)
+            z = torch.diagonal_scatter(x, y_updated)
+            # reinplace needs to know to replace output [z] with [x]
+            return [z]
+
+        if not HAS_FUNCTIONALIZATION:
+            return
+        f2 = reinplace(make_fx(functionalize(f))())
+        expected_out = f()
+        actual_out = f2()
+        self.assertEqual(actual_out, expected_out)
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self):
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
+    diagonal_default = torch.ops.aten.diagonal.default(zeros)
+    add_tensor = torch.ops.aten.add_.Tensor(diagonal_default, 1);  diagonal_default = None
+    return [zeros]
+    """)
+
+    def test_reinplace_index_mutation(self):
+        def f():
+            a = torch.zeros(4, 4, 4)
+            a[:, 2:] = torch.ones(4, 2, 4)
+            return a
+
+        if not HAS_FUNCTIONALIZATION:
+            return
+        f2 = reinplace(make_fx(functionalize(f))())
+        expected_out = f()
+        actual_out = f2()
+        self.assertEqual(actual_out, expected_out)
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self):
+    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
+    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
+    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
+    slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
+    slice_tensor_3 = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, 9223372036854775807);  slice_tensor_2 = None
+    copy__default = torch.ops.aten.copy_.default(slice_tensor_3, ones);  slice_tensor_3 = ones = None
+    return zeros
+    """)
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_jit.py b/test/test_jit.py
index b07b83cc40c28..a4f535921e558 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -55,7 +55,6 @@
 from jit.test_hash import TestHash  # noqa: F401
 from jit.test_complex import TestComplex  # noqa: F401
 from jit.test_jit_utils import TestJitUtils  # noqa: F401
-from jit.test_schema_check import TestSchemaCheck  # noqa: F401
 from jit.test_scriptmod_ann import TestScriptModuleInstanceAttributeTypeAnnotation  # noqa: F401
 from jit.test_types import TestTypesAndAnnotation  # noqa: F401
 from jit.test_misc import TestMisc  # noqa: F401
@@ -70,7 +69,6 @@
 from jit.test_aten_pow import TestAtenPow  # noqa: F401
 from jit.test_optimize_for_mobile_preserve_debug_info import TestOptimizeForMobilePreserveDebugInfo  # noqa: F401
 from jit.test_union import TestUnion  # noqa: F401
-from jit.test_legacy_upgraders import TestLegacyUpgraders  # noqa: F401
 from jit.test_batch_mm import TestBatchMM  # noqa: F401
 from jit.test_dtype_analysis import TestDtypeAnalysis, TestDtypeCustomRulesCPU  # noqa: F401
 from jit.test_device_analysis import TestDeviceAnalysis  # noqa: F401
@@ -1860,6 +1858,7 @@ def profile(func, X):
                     self.assertEqual(training, 'aten::bernoulli_' in profile(scripted, X))
 
     @unittest.skipIf(GRAPH_EXECUTOR == ProfilingMode.SIMPLE, 'Testing differentiable graph')
+    @skipIfTorchDynamo("Torchdynamo cannot correctly handle profiler.profile calls")
     def test_dropout_func_requires_grad(self):
         def dropout_training(input):
             return F.dropout(input, 0.5, training=True)
@@ -5754,10 +5753,9 @@ def test_integral_shape_inference(a):
             return a * a
         ''')
         inputs = [torch.ones(10, 10, dtype=torch.long)]
-        outputs = torch.ones(10, 10)
+        outputs = torch.ones(10, 10, dtype=torch.long)
 
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(cu.test_integral_shape_inference(*inputs), outputs)
+        self.assertEqual(cu.test_integral_shape_inference(*inputs), outputs)
 
     @unittest.skipIf(RUN_CUDA, 'This tests the CPU fuser')
     @unittest.skipIf(IS_SANDCASTLE, "NYI: fuser support for Sandcastle")
@@ -7327,9 +7325,7 @@ def func():
                 if inp == 'empty_list':
                     # torchscript returns int tensor, python returns float tensor
                     self.assertNotEqual(t1.dtype, t2.dtype)
-
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                self.assertEqualIgnoreType(t1, t2)
+                self.assertEqual(t1, t2, exact_dtype=False)
                 self.assertEqual(t1.device, t2.device)
 
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.LEGACY, "Simple Executor doesn't have any shapes to propagate")
@@ -15354,8 +15350,7 @@ def forward(self, key):
             # TODO: re-enable module hook when Python printing of attributes is
             # supported
             m = M({char : torch.ones(1) + ord(char) - ord("a") for char in "abcdefg"})
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(m("c"), torch.tensor([103]))
+            self.assertEqual(m("c"), torch.tensor([103.]))
 
     def test_module_none_attrs(self):
         class MyMod(torch.jit.ScriptModule):
diff --git a/test/test_jit_autocast.py b/test/test_jit_autocast.py
index ae0a9cdb0e5c1..5d555e7cd9d8d 100644
--- a/test/test_jit_autocast.py
+++ b/test/test_jit_autocast.py
@@ -2,7 +2,7 @@
 
 import torch
 from torch.cuda.amp import autocast
-from typing import Optional
+from typing import Optional, Tuple
 
 import unittest
 from test_jit import JitTestCase
@@ -819,5 +819,101 @@ def test_nhwc_autocast_jit_trace_model(model, x):
                 continue
             test_nhwc_autocast_jit_trace_model(self.models[i], self.inputs[i])
 
+    def test_script_autocast_cpu(self):
+        def fn(x):
+            if torch.is_autocast_cpu_enabled():
+                return x.relu()
+            else:
+                return x.sin()
+
+        fn_s = torch.jit.script(fn)
+
+        x = torch.rand((4, 4)) - 0.5
+        with torch.cpu.amp.autocast():
+            self.assertEqual(fn_s(x), fn(x))
+
+        with torch.cpu.amp.autocast(enabled=True):
+            self.assertEqual(fn_s(x), fn(x))
+
+        self.assertTrue(any(["is_autocast_cpu_enabled" in x.kind() for x in fn_s.graph.nodes()]))
+
+    @unittest.skipIf(not TEST_CUDA, "No cuda")
+    def test_script_autocast_cuda(self):
+        def fn(x):
+            if torch.is_autocast_enabled():
+                return x.relu()
+            else:
+                return x.sin()
+
+        fn_s = torch.jit.script(fn)
+
+        x = torch.rand((4, 4)) - 0.5
+        with torch.cpu.amp.autocast():
+            self.assertEqual(fn_s(x), fn(x))
+
+        with torch.cuda.amp.autocast(enabled=True):
+            self.assertEqual(fn_s(x), fn(x))
+
+        self.assertTrue(any(["is_autocast_enabled" in x.kind() for x in fn_s.graph.nodes()]))
+
+
+    def test_scripted_aliasing(self):
+        # torch.is_autocast_enabled should not be able to move inside of the autocast context.
+        def fn(x):
+            if torch.is_autocast_enabled():
+                y = True
+            else:
+                y = False
+            with torch.cuda.amp.autocast(enabled=True):
+                z = x.relu()
+            return y, z
+
+        fn_s = torch.jit.script(fn)
+        graph = fn_s.graph
+
+        aliasdb = graph.alias_db()
+
+        is_enabled_nodes = graph.findAllNodes("aten::is_autocast_enabled")
+        enter_nodes = graph.findAllNodes("prim::Enter")
+
+        self.assertEqual(len(is_enabled_nodes), 1)
+        self.assertEqual(len(enter_nodes), 1)
+
+        self.assertFalse(aliasdb.move_after_topologically_valid(is_enabled_nodes[0], enter_nodes[0]))
+
+
+    def test_script_autocast_enable_and_check(self):
+        def fn(x, y) -> Tuple[torch.Tensor, bool, torch.Tensor, bool, torch.Tensor, bool]:
+            b1 = torch.is_autocast_cpu_enabled()
+            v1 = torch.mm(x, y)
+            with torch.cpu.amp.autocast(enabled=True):
+                b2 = torch.is_autocast_cpu_enabled()
+                v2 = torch.mm(x, y)
+                with torch.cpu.amp.autocast(enabled=False):
+                    b3 = torch.is_autocast_cpu_enabled()
+                    v3 = torch.mm(x, y)
+            return (v1, b1, v2, b2, v3, b3)
+
+        # bx = is_autocast_cpu_enabled() result should be False iff (vx = mm(x, y)).dtype is float
+        def check_fn_results(arr):
+            [v1, b1, v2, b2, v3, b3] = arr
+            self.assertTrue((v1.dtype == torch.float) != b1)
+            self.assertTrue((v2.dtype == torch.float) != b2)
+            self.assertTrue((v3.dtype == torch.float) != b3)
+
+        x = torch.rand((2, 2), dtype=torch.float)
+        y = torch.rand((2, 2), dtype=torch.float)
+
+        fn_s = torch.jit.script(fn)
+
+        with torch.cpu.amp.autocast(enabled=False):
+            check_fn_results(fn(x, y))
+            check_fn_results(fn_s(x, y))
+
+        with torch.cpu.amp.autocast(enabled=True):
+            check_fn_results(fn(x, y))
+            check_fn_results(fn_s(x, y))
+
+
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_jit_cuda_fuser.py b/test/test_jit_cuda_fuser.py
index e67b248063ff5..7f838e8bb3d6d 100644
--- a/test/test_jit_cuda_fuser.py
+++ b/test/test_jit_cuda_fuser.py
@@ -41,8 +41,12 @@
 if RUN_NVFUSER and torch.version.cuda is not None:
     CUDA_MAJOR, CUDA_MINOR = (int(x) for x in torch.version.cuda.split('.')[:2])
 
-os.environ['PYTORCH_NVFUSER_ENABLE'] = 'linear_decomposition,conv_decomposition'
-os.environ['PYTORCH_NVFUSER_DISABLE'] = 'fallback,fma,unroll_with_rng'
+if 'PYTORCH_NVFUSER_ENABLE' not in os.environ:
+    os.environ['PYTORCH_NVFUSER_ENABLE'] = ""
+os.environ['PYTORCH_NVFUSER_ENABLE'] = 'linear_decomposition,conv_decomposition,' + os.environ['PYTORCH_NVFUSER_ENABLE']
+if 'PYTORCH_NVFUSER_DISABLE' not in os.environ:
+    os.environ['PYTORCH_NVFUSER_DISABLE'] = ""
+os.environ['PYTORCH_NVFUSER_DISABLE'] = 'fallback,fma,' + os.environ['PYTORCH_NVFUSER_DISABLE']
 os.environ['PYTORCH_NVFUSER_JIT_OPT_LEVEL'] = '0'
 # TODO: enable complex when we fixes the extremal cases in OpInfo
 # see issue https://github.com/csarofeen/pytorch/issues/1730"
@@ -670,12 +674,12 @@ def test_unary_ops(self):
                       torch.nn.functional.softplus,
                       torch.nn.functional.gelu,
                       torch.nn.functional.leaky_relu,
+                      torch.nn.functional.silu,
                       torch.relu,
                       torch.sigmoid,
                       torch.bitwise_not,
                       torch.tan,
-                      torch.tanh,
-                      torch.nn.functional.silu]
+                      torch.tanh]
         skip_complex = {torch.rsqrt, torch.reciprocal}
         for op, dtype in itertools.product(operations, data_types):
             if dtype.is_complex and op in skip_complex:
@@ -4453,6 +4457,122 @@ def t(x, w):
             self.assertEqual(jit_o, o)
             self.assertGraphContainsExactly(t_jit.graph_for(x, w), FUSION_GUARD, 2, consider_subgraphs=True)
 
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_view_before_permute(self):
+        view_examples = [[[1, 19, 1, 12, 7, 1, 99], [1, 19, 1, 3, 2772]],
+                         [[3, 17, 80, 1], [51, 1, 2, 4, 10]],
+                         [[3, 17, 80, 1, 9], [51, 1, 2, 4, 10, 9]],
+                         [[2, 3, 4, 5], [1, 6, 1, 2, 2, 5]],
+                         [[22, 22, 2], [22, 11, 1, 1, 4]],
+                         [[37, 9, 7, 6, 10], [333, 2, 2, 3, 35]],
+                         [[8, 1, 1, 8, 1, 8], [8, 2, 4, 1, 8]],
+                         [[1, 333, 1], [1, 37, 9]],
+                         [[1, 333], [1, 1, 1, 111, 1, 3]],
+                         [[1, 27454, 1, 2], [1, 7844, 1, 7]],
+                         [[1, 7844, 1, 7], [1, 27454, 2]]]
+
+        def _getTransposeAxes(sizes):
+            # broadcast do not change
+            # always move inner-most dim
+            # random permutation of other dims
+            result = []
+            valid_sizes = []
+            for idx, val in enumerate(sizes):
+                if val > 1 and idx < len(sizes) - 1:
+                    valid_sizes.append((idx, val))
+                result.append(idx)
+            idx, new_size = valid_sizes[random.randint(0, len(valid_sizes) - 1)]
+            result[idx] = len(sizes) - 1
+            result[len(sizes) - 1] = idx
+            return result
+
+        def _transposeSize(sizes, dims):
+            return [sizes[old_pos] for old_pos in dims]
+
+        for example in view_examples:
+            before_view_size, after_view_size = example
+            axes = _getTransposeAxes(after_view_size)
+            output_size = _transposeSize(after_view_size, axes)
+            self._view_before_permute_helper(before_view_size, after_view_size, output_size, axes)
+
+    def _view_before_permute_helper(self, input_shape, view_shape, output_shape, dims):
+        def t(x, y, view_shape : List[int], dims : List[int]):
+            x_v = x.view(view_shape)
+            x_t = torch.permute(x_v, dims)
+            o = torch.add(x_t, y)
+            o = torch.relu(o)
+            return o
+
+        x = torch.randn(*input_shape, device="cuda")
+        y = torch.randn(*output_shape, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y, view_shape, dims)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_permute(self):
+        max_dims = 4
+        for ndims in range(2, max_dims + 1):
+            shape = [idx + 2 for idx in range(ndims)]
+            for dims in itertools.permutations(range(ndims)):
+                self._permute_helper(shape, dims)
+
+    def _permute_helper(self, shape, dims):
+        def t(x, y, dims : List[int]):
+            x_t = torch.permute(x, dims)
+            y_t = torch.permute(y, dims)
+            o = torch.add(x_t, y_t)
+            o = torch.relu(o)
+            return o
+
+        x = torch.randn(*shape, device="cuda")
+        y = torch.randn(*shape, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y, dims)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_transpose(self):
+        max_dims = 4
+        for ndims in range(2, max_dims + 1):
+            shape = [idx + 2 for idx in range(ndims)]
+            for idx in range(1, ndims):
+                for jdx in range(idx):
+                    self._transpose_helper(shape, idx, jdx)
+
+    def _transpose_helper(self, shape, dim0, dim1):
+        def t(x, y, dim0 : int, dim1 : int):
+            x_t = torch.transpose(x, dim0, dim1)
+            y_t = torch.transpose(y, dim0, dim1)
+            o = torch.add(x_t, y_t)
+            o = torch.nn.functional.gelu(o)
+            return o
+
+        x = torch.randn(*shape, device="cuda")
+        y = torch.randn(*shape, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y, dim0, dim1)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_transpose_default(self):
+        def t(x, y):
+            x_t = torch.t(x)
+            y_t = torch.t(y)
+            o = torch.add(x_t, y_t)
+            o = torch.nn.functional.gelu(o)
+            return o
+
+        x = torch.randn(3, 5, device="cuda")
+        y = torch.randn(3, 5, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y)
+
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
@@ -4938,6 +5058,28 @@ def t(x0, x1, x2, flag : bool = True):
         t_jit = torch.jit.script(t)
         self._run_helper(t_jit, t, x0, x1, x2, check_stride=True)
 
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_disable_const_chunk_propagation_for_normalization(self):
+        device = "cuda"
+        x0 = torch.randn(10, 12, device=device)
+        x1 = torch.randn(10, 4, device=device)
+        w0 = torch.randn(12, device=device)
+        w1 = torch.randn(4, device=device)
+
+        def t(x, y, w0, w1):
+            ih = torch.layer_norm(x, (12,), w0)
+            i_r, i_z, i_n = ih.chunk(3, dim=1)
+            i_n = torch.layer_norm(i_n, (4,), w1)
+            r = torch.sigmoid(i_r)
+            n = torch.tanh(i_n + r * i_z)
+            h = n + r * y
+            return h
+
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x0, x1, w0, w1, check_stride=True)
+
 
 class TestEnableDisableCudaFuser(JitTestCase):
     def setUp(self):
@@ -5090,7 +5232,7 @@ def _get_extremal_input(x, val, dtype):
         def _get_extremal_sample(sample: SampleInput, val, dtype):
             extremal_sample = SampleInput(
                 input=_get_extremal_input(sample.input, val, dtype),
-                args=[_get_extremal_input(x, val, dtype) for x in sample.args],
+                args=tuple(_get_extremal_input(x, val, dtype) for x in sample.args),
                 kwargs={k: _get_extremal_input(v, val, dtype) for k, v in sample.kwargs.items()},
             )
             return extremal_sample
@@ -5099,7 +5241,7 @@ def _get_extremal_samples(sample: SampleInput, dtype):
             vals = [float('inf'), float('-inf'), float('nan')]
             if dtype.is_complex:
                 complex_vals = itertools.product(vals, vals)
-                vals = list(map(lambda x: complex(*x), complex_vals))
+                vals = tuple(map(lambda x: complex(*x), complex_vals))
             for val in vals:
                 yield _get_extremal_sample(sample, val, dtype)
 
diff --git a/test/test_jit_fuser_te.py b/test/test_jit_fuser_te.py
index 3c2c0fafee4de..09cc3b7e30f5a 100644
--- a/test/test_jit_fuser_te.py
+++ b/test/test_jit_fuser_te.py
@@ -1387,6 +1387,7 @@ def apply(fn):
                 F.hardsigmoid,
                 F.hardswish,
                 F.softplus,
+                F.silu,
                 torch.sqrt,
                 torch.rsqrt,
                 torch.abs,
@@ -2395,6 +2396,42 @@ def test_separate_fusions(x, y):
         f = FileCheck().check("Found multiple fusions")
         f.run(str(error_out.exception))
 
+    def test_constant_chunk_shapes(self):
+        # We had an issue where buildShapeExpressions would fail as show below:
+        #
+        # %1 : Tensor = Constant[..]  # not supported, we don't build this shape
+        # %2 : Tensor = Constant[..]  # not supported
+        # %3 : Tensor = aten::add(%1, %2)  # inputs not supported, we don't build shape
+        # ... = prim::ConstantChunk[..](%3)  # it forgets to check whether input shapes exist, and fails
+        if self.dynamic_shapes:
+            self.skipTest("TODO: chunk dynamic shapes")
+
+        for device in self.devices:
+            def f(x, y):
+                r = torch.tensor(4)
+                z1, z2 = (x + y + r).chunk(2, dim=1)
+                return z1 * z2
+
+            x = torch.randn(4, 4, dtype=torch.float, device=device)
+            y = torch.randn(4, 4, dtype=torch.float, device=device)
+
+            ge = self.checkTrace(f, (x, y))
+            graph = ge.graph_for(x, y)
+
+            # make sure that we are actually testing the right scenario
+            FileCheck().check("with " + FUSION_GROUP + "_").check_count(
+                "ConstantChunk", 1, exactly=True
+            ).run(str(graph))
+
+            f_traced = torch.jit.trace(f, (x, y))
+
+            for i in range(4):
+                # make sure this doesn't error out
+                res = f_traced(x, y)
+
+            self.assertEqual(res, f(x, y))
+
+
 class TestTEFuserStatic(TestTEFuser):
     dynamic_shapes = False
 
diff --git a/test/test_jiterator.py b/test/test_jiterator.py
index 741551e9999bc..f995e54088734 100644
--- a/test/test_jiterator.py
+++ b/test/test_jiterator.py
@@ -23,7 +23,6 @@ def ref_fn(x, y, alpha=1, beta=1):
     return alpha * x + beta * y
 
 class TestPythonJiterator(TestCase):
-    @skipCUDAIfRocm
     @parametrize("shape_strides", [
         (([3, 3], [3, 1]), ([3, 3], [3, 1])),  # contiguous
     ])
@@ -62,7 +61,6 @@ def test_all_dtype_noncontiguous(self, device, dtypes, shape_strides):
 
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @dtypes(torch.float, torch.double, torch.float16, torch.bfloat16)
     @parametrize("alpha", [-1, 2.0, None])
     @parametrize("beta", [3, -4.2, None])
@@ -82,7 +80,6 @@ def test_extra_args(self, device, dtype, alpha, beta):
 
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @parametrize("is_train", [True, False])
     def test_bool_extra_args(self, device, is_train):
         code_string = "template <typename T> T conditional(T x, T mask, bool is_train) { return is_train ? x * mask : x; }"
@@ -98,7 +95,6 @@ def ref_fn(x, mask, is_train):
         result = jitted_fn(a, b, is_train=is_train)
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     def test_multiple_functors(self, device):
         code_string = '''
         template <typename T> T fn(T x, T mask) { return x * mask; }
@@ -117,7 +113,6 @@ def ref_fn(x, mask, y):
         result = jitted_fn(a, b, c)
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @parametrize("num_inputs", [1, 5, 8])
     def test_various_num_inputs(self, num_inputs):
         inputs = []
@@ -137,7 +132,6 @@ def ref_fn(*inputs):
 
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @parametrize("num_outputs", [1, 4, 8])
     def test_various_num_outputs(self, num_outputs):
         input = torch.rand(3, device='cuda')
@@ -146,7 +140,8 @@ def test_various_num_outputs(self, num_outputs):
         function_body = ""
         for i in range(num_outputs):
             function_body += f"out{i} = input + {i};\n"
-        code_string = f"template <typename T> T my_kernel(T input, {output_string}) {{ {function_body} }}"
+        # NB: return type must be void, otherwise ROCm silently fails
+        code_string = f"template <typename T> void my_kernel(T input, {output_string}) {{ {function_body} }}"
 
         jitted_fn = create_multi_output_jit_fn(code_string, num_outputs)
 
@@ -165,7 +160,6 @@ def ref_fn(input):
         for i in range(num_outputs):
             self.assertEqual(expected[i], result[i])
 
-    @skipCUDAIfRocm
     @parametrize("code_string", [
         "template <typename T> T my _kernel(T x) { return x; }",
         "template <typename T> Tmy_kernel(T x) { return x; }",
diff --git a/test/test_linalg.py b/test/test_linalg.py
index 8739da0c99138..fce5da2f42bb0 100644
--- a/test/test_linalg.py
+++ b/test/test_linalg.py
@@ -18,7 +18,7 @@
     (TestCase, run_tests, TEST_SCIPY, IS_MACOS, IS_WINDOWS, slowTest,
      TEST_WITH_ASAN, TEST_WITH_ROCM, IS_FBCODE, IS_REMOTE_GPU, iter_indices,
      make_fullrank_matrices_with_distinct_singular_values,
-     freeze_rng_state)
+     freeze_rng_state, IS_ARM64)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, dtypes, has_cusolver,
      onlyCPU, skipCUDAIf, skipCUDAIfNoMagma, skipCPUIfNoLapack, precisionOverride,
@@ -52,7 +52,7 @@ def _fn(*args, **kwargs):
             torch.backends.cuda.preferred_linalg_library('default')
     return _fn
 
-
+@unittest.skipIf(IS_ARM64, "Issue with numpy version on arm")
 class TestLinalg(TestCase):
     def setUp(self):
         super(self.__class__, self).setUp()
@@ -2719,7 +2719,7 @@ def run_test(torch_inverse, matrix, batches, n):
 
             # Compare against NumPy output
             # NumPy uses 'gesv' LAPACK routine solving the equation A A_inv = I
-            # But in PyTorch 'gertf' + 'getri' is used causing element-wise differences
+            # But in PyTorch 'gertf' + 'getrs' is used. As such, there may be some element-wise differences
             expected = np.linalg.inv(matrix.cpu().numpy())
             self.assertEqual(matrix_inverse, expected, atol=self.precision, rtol=self.precision)
 
@@ -2858,7 +2858,7 @@ def run_test_singular_input(batch_dim, n):
     @skipCPUIfNoLapack
     @onlyNativeDeviceTypes   # TODO: XLA doesn't raise exception
     @skipCUDAIfRocm
-    @skipCUDAVersionIn([(11, 3), (11, 5), (11, 6)])  # https://github.com/pytorch/pytorch/issues/57482
+    @skipCUDAVersionIn([(11, 3), (11, 6), (11, 7)])  # https://github.com/pytorch/pytorch/issues/57482
     @dtypes(*floating_and_complex_types())
     def test_inverse_errors_large(self, device, dtype):
         # Test batched inverse of singular matrices reports errors without crashing (gh-51930)
@@ -3008,7 +3008,7 @@ def run_test_singular_input(batch_dim, n):
         # dtypes should match
         a = torch.eye(2, dtype=dtype, device=device)
         out = torch.empty(0, dtype=torch.int, device=device)
-        with self.assertRaisesRegex(RuntimeError, "got result with dtype Int"):
+        with self.assertRaisesRegex(RuntimeError, "but got int instead"):
             torch.linalg.inv(a, out=out)
 
         # device should match
@@ -3450,11 +3450,9 @@ def test_matrix_rank_atol_rtol(self, device, dtype):
             result = torch.linalg.matrix_rank(a, atol=tol_value, rtol=tol_value)
             self.assertEqual(result, 2)  # there are 2 singular values above max(0.81, 1.5*0.81)
 
-    # CUDA 11.6 issue failure https://github.com/pytorch/pytorch/issues/75391
-    @skipCUDAIf(torch.version.cuda is not None
-                and torch.version.cuda.split(".") == ["11", "6"], "There's a bug in CUDA 11.6")
     @skipCUDAIfNoMagma
     @skipCPUIfNoLapack
+    @skipCUDAVersionIn([(11, 6), (11, 7)])  # https://github.com/pytorch/pytorch/issues/75391
     @dtypes(*floating_and_complex_types())
     def test_matrix_rank_empty(self, device, dtype):
         matrix_rank = torch.linalg.matrix_rank
@@ -4154,7 +4152,7 @@ def test_linalg_solve_triangular(self, device, dtype):
     @onlyCUDA
     @skipCUDAIfNoMagma  # Magma needed for the PLU decomposition
     @skipCUDAIfRocm  # There is a memory access bug in rocBLAS in the (non-batched) solve_triangular
-    @skipCUDAVersionIn([(11, 3), (11, 5), (11, 6)])  # Tracked in https://github.com/pytorch/pytorch/issues/70111
+    @skipCUDAVersionIn([(11, 3), (11, 6), (11, 7)])  # Tracked in https://github.com/pytorch/pytorch/issues/70111
     @dtypes(*floating_and_complex_types())
     @precisionOverride({torch.float32: 1e-2, torch.complex64: 1e-2,
                         torch.float64: 1e-8, torch.complex128: 1e-8})
@@ -4506,48 +4504,6 @@ def test_linalg_cross(self, device, dtype):
         torch.linalg.cross(x, y, dim=1, out=res2)
         self.assertEqual(res1, res2)
 
-        # non contiguous case 1
-        x = torch.rand((4, 4, 4, 3), dtype=dtype,
-                       device=device).contiguous(memory_format=torch.channels_last)  # non-contiguous
-        y = torch.rand((4, 4, 4, 3), dtype=dtype,
-                       device=device).contiguous(memory_format=torch.channels_last)  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=-1)
-        res = torch.linalg.cross(x, y, dim=-1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 2
-        x = torch.rand(1, 3, 2, dtype=dtype, device=device)  # contiguous
-        y = torch.rand(1, 3, 4, dtype=dtype, device=device).permute(2, 1, 0)  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=1)
-        res = torch.linalg.cross(x, y, dim=1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 3
-        x = torch.rand(2, 3, 1, dtype=dtype, device=device).permute(2, 1, 0)  # non-contiguous
-        y = torch.rand(1, 3, 4, dtype=dtype, device=device).permute(2, 1, 0)  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=1)
-        res = torch.linalg.cross(x, y, dim=1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 4
-        x = torch.randn(12, 3, device=device, dtype=dtype)[::2, :]  # non-contiguous
-        y = torch.randn(18, 3, device=device, dtype=dtype)[::3, :]  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=1)
-        res = torch.linalg.cross(x, y, dim=1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 5
-        x = torch.randn(1, device=device, dtype=dtype)  # contiguous
-        y = torch.randn(6, device=device, dtype=dtype)[::2]  # non-contiguous
-        np_expected_ref = np.cross(x.expand(3).cpu().numpy(), y.cpu().numpy())
-        res = torch.linalg.cross(x, y)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
     @dtypes(torch.float32, torch.complex64)
     def test_cross_with_and_without_dim(self, device, dtype):
         x = torch.rand(100, 3, dtype=dtype, device=device)
@@ -4568,46 +4524,6 @@ def test_linalg_cross_with_and_without_dim(self, device, dtype):
         self.assertEqual(res1, res2)
         self.assertEqual(res1, res3)
 
-    def test_cross_errors(self, device):
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.cross(torch.rand(100, 3, device=device), torch.rand(100, 3, 10, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.cross(torch.rand(5, 3, device=device), torch.rand(3, 5, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "no dimension of size 3 in input",
-            lambda: torch.cross(torch.rand(5, 4, device=device), torch.rand(5, 4, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension 0 does not have size 3",
-            lambda: torch.cross(torch.rand(5, 4, 3, device=device), torch.rand(5, 4, 3, device=device), dim=0))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension -1 does not have size 3",
-            lambda: torch.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-1))
-        self.assertRaisesRegex(
-            IndexError, "Dimension out of range",
-            lambda: torch.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-5))
-
-    def test_linalg_cross_errors(self, device):
-        self.assertRaisesRegex(
-            RuntimeError, "dimension -1 does not have size 3",
-            lambda: torch.linalg.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.linalg.cross(torch.rand(100, 3, device=device), torch.rand(100, 3, 10, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.linalg.cross(torch.rand(5, 3, device=device), torch.rand(3, 5, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension 0 does not have size 3",
-            lambda: torch.linalg.cross(torch.rand(5, 4, 3, device=device), torch.rand(5, 4, 3, device=device), dim=0))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension -1 does not have size 3",
-            lambda: torch.linalg.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-1))
-        self.assertRaisesRegex(
-            IndexError, "Dimension out of range",
-            lambda: torch.linalg.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-5))
-
     def test_renorm(self, device):
         m1 = torch.randn(20, 20, device=device)  # big enough to exercise vectorized path
         res1 = torch.tensor((), device=device)
diff --git a/test/test_maskedtensor.py b/test/test_maskedtensor.py
new file mode 100644
index 0000000000000..6ad3ca5b8d867
--- /dev/null
+++ b/test/test_maskedtensor.py
@@ -0,0 +1,245 @@
+# Owner(s): ["module: masked operators"]
+
+import torch
+from torch.testing._internal.common_utils import (
+    TestCase,
+    run_tests,
+    make_tensor,
+    parametrize,
+    instantiate_parametrized_tests,
+)
+from torch.testing._internal.common_methods_invocations import (
+    SampleInput,
+)
+
+from torch.masked import masked_tensor
+from torch.masked.maskedtensor.core import _masks_match, _tensors_match
+from torch.masked.maskedtensor.unary import NATIVE_INPLACE_UNARY_FNS, NATIVE_UNARY_FNS
+
+from torch.masked.maskedtensor.binary import NATIVE_BINARY_FNS, NATIVE_INPLACE_BINARY_FNS
+
+
+def _compare_mt_t(mt_result, t_result):
+    mask = mt_result.get_mask()
+    mt_result_data = mt_result.get_data()
+    if mask.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mask = mask.to_dense()
+    if mt_result_data.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mt_result_data = mt_result_data.to_dense()
+    a = mt_result_data.detach().masked_fill_(~mask, 0)
+    b = t_result.detach().masked_fill_(~mask, 0)
+    if not _tensors_match(a, b, exact=False):
+        raise ValueError("The data in MaskedTensor a and Tensor b do not match")
+
+def _compare_mts(mt1, mt2):
+    mt_data1 = mt1.get_data()
+    mt_data2 = mt2.get_data()
+    if mt_data1.layout != mt_data2.layout:
+        raise ValueError("mt1's data and mt2's data do not have the same layout. "
+                         f"mt1.get_data().layout = {mt_data1.layout} while mt2.get_data().layout = {mt_data2.layout}")
+
+    mask = mt1.get_mask()
+    mask2 = mt2.get_mask()
+    if not _masks_match(mt1, mt2):
+        raise ValueError("mt1 and mt2 must have matching masks")
+    if mask.layout != mask2.layout:
+        raise ValueError("mt1's mask and mt2's mask do not have the same layout. "
+                         f"mt1.get_mask().layout = {mask.layout} while mt2.get_mask().layout = {mask2.layout}")
+    if mask.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mask = mask.to_dense()
+
+    if mt_data1.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mt_data1 = mt_data1.to_dense()
+        mt_data2 = mt_data2.to_dense()
+    a = mt_data1.detach().masked_fill_(~mask, 0)
+    b = mt_data2.detach().masked_fill_(~mask, 0)
+
+    if not _tensors_match(a, b, exact=False):
+        raise ValueError("The data in MaskedTensor mt1 and MaskedTensor mt2 do not match")
+
+def _create_random_mask(shape, device):
+    return make_tensor(
+        shape, device=device, dtype=torch.bool, low=0, high=1, requires_grad=False
+    )
+
+def _generate_sample_data(
+    device="cpu", dtype=torch.float, requires_grad=True, layout=torch.strided
+):
+    assert layout in {
+        torch.strided,
+        torch.sparse_coo,
+        torch.sparse_csr,
+    }, "Layout must be strided/sparse_coo/sparse_csr"
+    shapes = [
+        [],
+        [2],
+        [3, 5],
+        [3, 2, 1, 2],
+    ]
+    inputs = []
+    for s in shapes:
+        data = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)  # type: ignore[arg-type]
+        mask = _create_random_mask(s, device)
+        if layout == torch.sparse_coo:
+            mask = mask.to_sparse_coo().coalesce()
+            data = data.sparse_mask(mask).requires_grad_(requires_grad)
+        elif layout == torch.sparse_csr:
+            if data.ndim != 2 and mask.ndim != 2:
+                continue
+            mask = mask.to_sparse_csr()
+            data = data.sparse_mask(mask)
+        inputs.append(SampleInput(data, kwargs={"mask": mask}))
+    return inputs
+
+def _fix_fn_name(fn_name):
+    if fn_name[-1] == "_":
+        fn_name = fn_name[:-1]
+    return fn_name
+
+
+class TestUnary(TestCase):
+    def _get_test_data(self, fn_name):
+        data = torch.randn(10, 10)
+        mask = torch.rand(10, 10) > 0.5
+        fn_name = _fix_fn_name(fn_name)
+        if fn_name in ["log", "log10", "log1p", "log2", "sqrt"]:
+            data = data.mul(0.5).abs()
+        if fn_name in ["rsqrt"]:
+            data = data.abs() + 1  # Void division by zero
+        if fn_name in ["acos", "arccos", "asin", "arcsin", "logit"]:
+            data = data.abs().mul(0.5).clamp(0, 1)
+        if fn_name in ["atanh", "arctanh", "erfinv"]:
+            data = data.mul(0.5).clamp(-1, 1)
+        if fn_name in ["acosh", "arccosh"]:
+            data = data.abs() + 1
+        if fn_name in ["bitwise_not"]:
+            data = data.mul(128).to(torch.int8)
+        return data, mask
+
+    def _get_sample_kwargs(self, fn_name):
+        fn_name = _fix_fn_name(fn_name)
+        kwargs = {}
+        if fn_name in ["clamp", "clip"]:
+            kwargs["min"] = -0.5
+            kwargs["max"] = 0.5
+        return kwargs
+
+    def _get_sample_args(self, fn_name, data, mask):
+        fn_name = _fix_fn_name(fn_name)
+        mt = masked_tensor(data, mask)
+        t_args = [data]
+        mt_args = [mt]
+        if fn_name in ["pow"]:
+            t_args += [2.0]
+            mt_args += [2.0]
+        return t_args, mt_args
+
+    @parametrize("fn", NATIVE_UNARY_FNS)
+    def test_unary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        t_args, mt_args = self._get_sample_args(fn_name, data, mask)
+
+        mt_result = fn(*mt_args, **kwargs)
+        t_result = fn(*t_args, **kwargs)
+        _compare_mt_t(mt_result, t_result)
+
+    @parametrize("fn", NATIVE_INPLACE_UNARY_FNS)
+    def test_inplace_unary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        t_args, mt_args = self._get_sample_args(fn_name, data, mask)
+
+        mt_result = fn(*mt_args, **kwargs)
+        t_result = fn(*t_args, **kwargs)
+        _compare_mt_t(mt_result, t_result)
+
+class TestBinary(TestCase):
+    def _get_test_data(self, fn_name):
+        fn_name = _fix_fn_name(fn_name)
+        data0 = torch.randn(10, 10)
+        data1 = torch.randn(10, 10)
+        mask = torch.rand(10, 10) > 0.5
+        if fn_name in ["bitwise_and", "bitwise_or", "bitwise_xor"]:
+            data0 = data0.mul(128).to(torch.int8)
+            data1 = data1.mul(128).to(torch.int8)
+        if fn_name in ["bitwise_left_shift", "bitwise_right_shift"]:
+            data0 = data0.abs().to(torch.int64)
+            data1 = data1.abs().to(torch.int64)
+        return data0, data1, mask
+
+    def _get_sample_kwargs(self, fn_name):
+        fn_name = _fix_fn_name(fn_name)
+        kwargs = {}
+        return kwargs
+
+    def _yield_sample_args(self, fn_name, data0, data1, mask):
+        """ Returns two sets of Tensor and MaskedTensor args for a binary function to compute.
+            Tensor args are all the same (just the two provided data tensors),
+            while the MaskedTensor args tests both (MaskedTensor, MaskedTensor) and (MaskedTensor, Tensor)
+        """
+        fn_name = _fix_fn_name(fn_name)
+        mt0 = masked_tensor(data0, mask)
+        mt1 = masked_tensor(data1, mask)
+
+        t_args = [data0, data1]
+        mt_args = [mt0, mt1]
+        yield t_args, mt_args
+
+        t_args = [data0, data1]
+        mt_args = [mt0, data1]
+        yield t_args, mt_args
+
+    @parametrize("fn", NATIVE_BINARY_FNS)
+    def test_binary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data0, data1, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        for (t_args, mt_args) in self._yield_sample_args(fn_name, data0, data1, mask):
+            mt_result = fn(*mt_args, **kwargs)
+            t_result = fn(*t_args, **kwargs)
+            _compare_mt_t(mt_result, t_result)
+
+    @parametrize("fn", NATIVE_INPLACE_BINARY_FNS)
+    def test_inplace_binary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data0, data1, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        for (t_args, mt_args) in self._yield_sample_args(fn_name, data0, data1, mask):
+            mt_result = fn(*mt_args, **kwargs)
+            t_result = fn(*t_args, **kwargs)
+            _compare_mt_t(mt_result, t_result)
+
+    @parametrize("fn_name", ["add", "add_"])
+    def test_masks_match(self, fn_name):
+        torch.random.manual_seed(0)
+        fn = getattr(torch.ops.aten, fn_name)
+        data0, data1, mask = self._get_test_data(fn_name)
+        mask0 = mask
+        mask1 = torch.rand(mask.size()) > 0.5
+        mt0 = masked_tensor(data0, mask0)
+        mt1 = masked_tensor(data1, mask1)
+        try:
+            fn(mt0, mt1)
+            raise AssertionError()
+        except ValueError as e:
+            assert (
+                "Input masks must match. If you need support for this, please open an issue on Github."
+                == str(e)
+            )
+
+instantiate_parametrized_tests(TestUnary)
+instantiate_parametrized_tests(TestBinary)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_meta.py b/test/test_meta.py
index a6b56f2664142..14fab0c9b2914 100644
--- a/test/test_meta.py
+++ b/test/test_meta.py
@@ -876,6 +876,20 @@ def test_empty_quantized(self):
         r = torch.empty(2 ** 52, device='meta', dtype=torch.qint8)
         self.assertEqual(r.device.type, 'meta')
 
+    def test_map_location_deserialize(self):
+        import io
+
+        t = torch.rand(10)
+        b = io.BytesIO()
+
+        torch.save(t, b)
+        b.seek(0)
+        r = torch.load(b, map_location=torch.device("meta"))
+        self.assertEqual(r.device.type, 'meta')
+        self.assertEqual(r.shape, t.shape)
+        self.assertEqual(r.dtype, t.dtype)
+        self.assertEqual(r.storage().data_ptr(), 0)
+
 instantiate_device_type_tests(TestMeta, globals())
 
 def print_op_str_if_not_supported(op_str):
diff --git a/test/test_mkldnn_fusion.py b/test/test_mkldnn_fusion.py
new file mode 100644
index 0000000000000..15d3c6b67f489
--- /dev/null
+++ b/test/test_mkldnn_fusion.py
@@ -0,0 +1,118 @@
+# Owner(s): ["module: mkldnn"]
+import itertools
+import unittest
+
+import torch
+from torch import nn
+
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.jit_utils import JitTestCase
+
+from test_tensorexpr import warmup_and_run_forward
+
+FUSION_GROUP = 'prim::TensorExprGroup'
+
+
+@unittest.skipIf(not torch._C.has_mkldnn, "MKL-DNN build is disabled")
+class TestMkldnnFusion(JitTestCase):
+    def assertFused(self, graph, fused_patterns):
+        for pat in fused_patterns:
+            self.assertGraphContainsExactly(graph, pat, 0)
+
+    def _check_model(self, m, x):
+        old_fusion_inlining = torch._C._debug_get_fusion_group_inlining()
+        torch._C._debug_set_fusion_group_inlining(False)
+
+        old_cpu_fuser_state = torch._C._jit_can_fuse_on_cpu()
+        torch._C._jit_override_can_fuse_on_cpu(True)
+
+        old_te_must_use_llvm_cpu = torch._C._jit_get_te_must_use_llvm_cpu()
+        torch._C._jit_set_te_must_use_llvm_cpu(False)
+
+        m.eval()
+        with torch.no_grad():
+            script = torch.jit.script(m)
+        script = torch.jit.freeze(script)
+
+        with torch.no_grad():
+            y = warmup_and_run_forward(script, x)
+            y = script(x)
+            y_ref = m(x)
+
+            graph = script.graph_for(*x)
+            self.assertEqual(y, y_ref)
+
+        torch._C._debug_set_fusion_group_inlining(old_fusion_inlining)
+        torch._C._jit_override_can_fuse_on_cpu(old_cpu_fuser_state)
+        torch._C._jit_set_te_must_use_llvm_cpu(old_te_must_use_llvm_cpu)
+        return graph
+
+    def test_single_conv(self):
+        class M(nn.Module):
+            def __init__(self, in_channels, out_channels, bias, **kwargs):
+                super(M, self).__init__()
+                self.conv = torch.nn.Conv2d(in_channels, out_channels, bias=bias, **kwargs)
+
+            def forward(self, x):
+                res = self.conv(x)
+                return res
+
+        for memory_format, enabled in [
+            [torch.contiguous_format, False],
+            [torch.channels_last, True],
+        ]:
+            input_size = 224
+            batch_size = 1
+            kernel_size = 3
+            options = itertools.product([True, False], [1, 2], [1, 4])
+            for bias, dilation, groups in options:
+                iC = 3 * groups
+                oC = 10 * groups
+                m = M(iC,
+                      oC,
+                      bias,
+                      kernel_size=(kernel_size, kernel_size),
+                      stride=2,
+                      padding=1,
+                      dilation=dilation,
+                      groups=groups).to(memory_format=memory_format)
+                x = torch.randn(batch_size, iC, input_size, input_size).to(memory_format=memory_format)
+                graph = self._check_model(m, x)
+                if enabled:
+                    self.assertFused(graph, ['aten::conv2d'])
+                    self.assertGraphContainsExactly(graph, FUSION_GROUP, 1)
+                else:
+                    self.assertGraphContains(graph, kind='aten::conv2d')
+
+    def test_conv_eltwise(self):
+        class M(nn.Module):
+            def __init__(self, eltwise_fn, in_channels, out_channels, bias, **kwargs):
+                super(M, self).__init__()
+                self.conv = torch.nn.Conv2d(in_channels, out_channels, bias=bias, **kwargs)
+                self.eltwise = eltwise_fn
+
+            def forward(self, x):
+                x = self.conv(x)
+                x = self.eltwise(x)
+                return x
+
+        for memory_format, enabled in [
+            [torch.contiguous_format, False],
+            [torch.channels_last, True],
+        ]:
+            for eltwise_fn in [torch.relu]:
+                for bias in [True, False]:
+                    for oC in [1, 10]:
+                        m = M(eltwise_fn, 3, oC, bias, kernel_size=(3, 3)).to(memory_format=memory_format)
+                        x = torch.randn(1, 3, 224, 224).to(memory_format=memory_format)
+
+                        graph = self._check_model(m, x)
+                        if enabled:
+                            self.assertFused(graph, ['aten::conv2d', 'aten::' + eltwise_fn.__name__])
+                            self.assertGraphContainsExactly(graph, FUSION_GROUP, 1)
+                        else:
+                            self.assertGraphContains(graph, kind='aten::conv2d')
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/test_module_init.py b/test/test_module_init.py
index 61a9a2feac77d..dc05a95da6f2b 100644
--- a/test/test_module_init.py
+++ b/test/test_module_init.py
@@ -167,6 +167,76 @@ def build_constructor_arg_db():
         torch.nn.UpsamplingBilinear2d: ((), {}),
         torch.nn.UpsamplingNearest2d: ((), {}),
         torch.nn.ZeroPad2d: ((0,), {}),
+        torch.ao.nn.qat.Conv1d: ((3, 3, 3), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Conv2d: ((3, 3, 3), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Conv3d: ((3, 3, 3), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Linear: ((5, 2), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Embedding: ((10, 12), {
+            'qconfig': torch.ao.quantization.float_qparams_weight_only_qconfig,
+        }),
+        torch.ao.nn.qat.EmbeddingBag: ((10, 12), {
+            'qconfig': torch.ao.quantization.float_qparams_weight_only_qconfig,
+        }),
+        torch.nn.quantizable.LSTM: ((5, 6), {}),
+        torch.nn.quantizable.LSTMCell: ((5, 6), {}),
+        torch.nn.quantizable.MultiheadAttention: ((10, 2), {}),
+        torch.ao.nn.quantized.BatchNorm2d: ((2,), {}),
+        torch.ao.nn.quantized.BatchNorm3d: ((2,), {}),
+        torch.ao.nn.quantized.Dropout: ((), {}),
+        torch.ao.nn.quantized.Conv1d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.Conv2d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.Conv3d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.ConvTranspose1d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.ConvTranspose2d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.ConvTranspose3d: ((16, 33, (3, 3, 5)), {
+            'stride': (2, 1, 1),
+            'padding': (4, 2, 2),
+            'output_padding': (2, 2, 2),
+            'dilation': (1, 1, 1),
+        }),
+        torch.ao.nn.quantized.DeQuantize: ((), {}),
+        torch.ao.nn.quantized.ELU: ((0.01, 0), {}),
+        torch.ao.nn.quantized.Embedding: ((10, 3), {
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.EmbeddingBag: ((10, 3), {
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.GroupNorm: ((2, 4, torch.nn.Parameter(torch.tensor(2.)),
+                                          torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.Hardswish: ((0.1, 0,), {}),
+        torch.ao.nn.quantized.InstanceNorm1d: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                               torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.InstanceNorm2d: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                               torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.InstanceNorm3d: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                               torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.LayerNorm: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                          torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.LeakyReLU: ((0.01, 0), {}),
+        torch.ao.nn.quantized.Linear: ((5, 2), {
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.MaxPool2d: ((3,), {}),
+        torch.ao.nn.quantized.Quantize: ((0.1, 0), {
+            'dtype': torch.int16,
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.ReLU6: ((), {}),
+        torch.ao.nn.quantized.Sigmoid: ((0.1, 0), {}),
+        torch.ao.nn.quantized.Softmax: ((), {}),
+        torch.ao.nn.quantized.FloatFunctional: ((), {}),
+        torch.ao.nn.quantized.FXFloatFunctional: ((), {}),
+        torch.ao.nn.quantized.QFunctional: ((), {}),
+        # Remove torch.nn.quantized after the migration completes:
         torch.nn.qat.Conv1d: ((3, 3, 3), {
             'qconfig': torch.ao.quantization.default_qconfig,
         }),
@@ -185,9 +255,6 @@ def build_constructor_arg_db():
         torch.nn.qat.EmbeddingBag: ((10, 12), {
             'qconfig': torch.ao.quantization.float_qparams_weight_only_qconfig,
         }),
-        torch.nn.quantizable.LSTM: ((5, 6), {}),
-        torch.nn.quantizable.LSTMCell: ((5, 6), {}),
-        torch.nn.quantizable.MultiheadAttention: ((10, 2), {}),
         torch.nn.quantized.BatchNorm2d: ((2,), {}),
         torch.nn.quantized.BatchNorm3d: ((2,), {}),
         torch.nn.quantized.Dropout: ((), {}),
@@ -358,6 +425,8 @@ def generate_tests(test_cls, constructor_arg_db):
     # test all modules underneath these namespaces...
     NAMESPACES = [
         torch.nn,
+        torch.ao.nn.qat,
+        torch.ao.nn.quantized,
         torch.nn.qat,
         torch.nn.quantizable,
         torch.nn.quantized,
@@ -367,8 +436,10 @@ def generate_tests(test_cls, constructor_arg_db):
         torch.nn.Module,
         torch.nn.Container,  # deprecated
         torch.nn.NLLLoss2d,  # deprecated
-        # TODO: Remove these 2 from this list once the ASan issue is fixed.
+        # TODO: Remove these 4 from this list once the ASan issue is fixed.
         # See https://github.com/pytorch/pytorch/issues/55396
+        torch.ao.nn.quantized.Embedding,
+        torch.ao.nn.quantized.EmbeddingBag,
         torch.nn.quantized.Embedding,
         torch.nn.quantized.EmbeddingBag,
         torch.nn.quantized.LSTM,
@@ -412,6 +483,14 @@ def generate_tests(test_cls, constructor_arg_db):
     }
     # these modules requires FBGEMM backend to instantiate
     MODULES_THAT_REQUIRE_FBGEMM = {
+        torch.ao.nn.quantized.Conv1d,
+        torch.ao.nn.quantized.Conv2d,
+        torch.ao.nn.quantized.Conv3d,
+        torch.ao.nn.quantized.ConvTranspose1d,
+        torch.ao.nn.quantized.ConvTranspose2d,
+        torch.ao.nn.quantized.ConvTranspose3d,
+        torch.ao.nn.quantized.Linear,
+        # Remove the lines below after AO migration is complete
         torch.nn.quantized.Conv1d,
         torch.nn.quantized.Conv2d,
         torch.nn.quantized.Conv3d,
diff --git a/test/test_mps.py b/test/test_mps.py
index c2cf7565995de..828b762a03418 100644
--- a/test/test_mps.py
+++ b/test/test_mps.py
@@ -10,6 +10,7 @@
 import tempfile
 import os
 import pprint
+import copy
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
@@ -1289,6 +1290,29 @@ def test_slice(self):
         mps_slice4 = mps_x[1, :].to('cpu')
         self.assertEqual(cpu_slice4, mps_slice4)
 
+    def test_scalar_from_slice_unary(self):
+        # https://github.com/pytorch/pytorch/issues/82543
+        tensor_list = torch.tensor([1.0, 1.2], device="mps")
+
+        for scalar in tensor_list:
+            r_mps = torch.ceil(scalar)
+            r_cpu = torch.ceil(scalar.to("cpu"))
+            self.assertEqual(r_mps.cpu(), r_cpu)
+
+    def test_scalar_from_slice_binary(self):
+        # https://github.com/pytorch/pytorch/issues/82543
+        def helper(binary_op):
+            tensor_list = torch.tensor([1.0, 1.2, 2.5, 1.0], device="mps")
+
+            for scalar in tensor_list:
+                r_mps = binary_op(scalar, 1.0)
+                r_cpu = binary_op(scalar.cpu(), 1.0)
+                self.assertEqual(r_mps.cpu(), r_cpu)
+        helper(torch.sub)
+        helper(torch.add)
+        helper(torch.not_equal)
+        helper(torch.eq)
+
     def test_slice_contiguous_view(self):
         # https://github.com/pytorch/pytorch/issues/77750
 
@@ -1544,9 +1568,20 @@ def test_storage_offset_greater_than_src_nbytes(self):
             self.assertEqual(t, t_mps.cpu())
 
     # See https://github.com/pytorch/pytorch/issues/82427
-    # Test should not crash
-    def test_bool_full(self):
+    # and https://github.com/pytorch/pytorch/issues/83692
+    def test_full_bugs(self):
+        # Test should not crash
         x = torch.full((3, 3), True, device='mps')
+        # torch.full should work for uint8
+        y_mps = torch.full((2, 2), 247, device='mps', dtype=torch.uint8)
+        y_cpu = torch.full((2, 2), 247, device='cpu', dtype=torch.uint8)
+        self.assertEqual(y_mps, y_cpu)
+
+    # See https://github.com/pytorch/pytorch/issues/82663
+    def test_bool_expand(self):
+        x = torch.tensor([[1], [0]], dtype=torch.bool, device='mps')
+        y = torch.tensor([0, 1], dtype=torch.bool, device='mps')
+        self.assertFalse(torch.equal(x.expand(2, 2), y.expand(2, 2)))
 
     # Empty unary op should return tensor of the same size
     def test_empty_neg(self):
@@ -3456,6 +3491,8 @@ def helper(shape, padding, op, value=0):
         helper((2, 1, 6), 3, nn.ReplicationPad1d)
         # Constant Pad 1D
         helper((2, 3, 4), 2, nn.ConstantPad1d)
+        # Constant Pad 1D with single dimension input
+        helper((16), (1, 2), nn.ConstantPad1d)
 
         # 2D Padding
         helper((1, 2, 3, 4), (1, 1, 2, 0), nn.ReflectionPad2d)
@@ -5854,6 +5891,304 @@ def test_view_all_dtypes_and_devices(self, device="mps"):
             x = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=dt, device=device)
             self.assertEqual(x.view(6).shape, [6])
 
+class TestConvolutionMPS(TestCase):
+    def test_conv1d_all_strides_paddings(self):
+        # https://github.com/pytorch/pytorch/issues/82921
+        def helper(stride, padding):
+            y_cpu = torch.randn(1, 57, 40)
+            conv_cpu = nn.Conv1d(57, 20, stride=stride, padding=padding, kernel_size=3, bias=False)
+            conv_gpu = copy.deepcopy(conv_cpu).to(device='mps')
+            x_cpu = conv_cpu(y_cpu)
+
+            y_gpu = y_cpu.to(device='mps')
+            x_gpu = conv_gpu(y_gpu)
+            self.assertEqual(x_cpu, x_gpu.cpu())
+        for stride in range(1, 4):
+            for padding in range(1, 4):
+                helper(stride, padding)
+
+
+    def test_conv1d_channels_last(self):
+        # https://github.com/pytorch/pytorch/issues/81557
+        model_cpu = torch.nn.Conv1d(1, 128, 3)
+        a_cpu = torch.arange((128 * 176), dtype=torch.float32)
+        a_cpu = a_cpu.view(128, 176, 1).permute(0, 2, 1)
+        out_cpu = model_cpu(a_cpu)
+
+        a_mps = a_cpu.detach().clone().to("mps")
+        model_mps = model_cpu.to("mps")
+        out_mps = model_mps(a_mps)
+
+        self.assertEqual(out_cpu, out_mps.cpu(), rtol=2.6e-05, atol=2e-04)
+
+    def test_conv_transpose_1d_all_strides(self):
+        # https://github.com/pytorch/pytorch/issues/82711
+        def helper(stride):
+            y_cpu = torch.ones(1, 1, 2)
+            deconv_cpu = nn.ConvTranspose1d(in_channels=1, out_channels=1, kernel_size=1, stride=stride, bias=False, padding=1)
+            deconv_cpu.weight.data = torch.ones(1, 1, 2)
+            deconv_gpu = copy.deepcopy(deconv_cpu).to(device='mps')
+            x_cpu = deconv_cpu(y_cpu)
+
+            y_gpu = y_cpu.to(device='mps')
+            x_gpu = deconv_gpu(y_gpu)
+            self.assertEqual(x_cpu, x_gpu.cpu())
+        [helper(stride) for stride in [1, 2, 3]]
+
+    def test_conv_transpose_1d_nn_functional(self):
+        # https://github.com/pytorch/pytorch/issues/82563
+        tin = torch.rand((1, 512, 1245), dtype=torch.float32)
+        tparams = torch.rand((512, 256, 16), dtype=torch.float32)
+        tbias = torch.rand((256), dtype=torch.float32)
+
+        device = 'cpu'
+        tcpu = torch.nn.functional.conv_transpose1d(tin.to(device), tparams.to(device), tbias.to(device), stride=8, padding=4)
+
+        device = 'mps'
+        tgpu = torch.nn.functional.conv_transpose1d(tin.to(device), tparams.to(device), tbias.to(device), stride=8, padding=4)
+
+        self.assertEqual(tcpu, tgpu.cpu(), rtol=2.6e-05, atol=2e-04)
+
+    def test_conv1d_contiguous(self):
+        model_cpu = torch.nn.Conv1d(1, 128, 3)
+        a_cpu = torch.ones(128, 1, 176)
+        out_cpu = model_cpu(a_cpu)
+
+        a_mps = a_cpu.detach().clone().to("mps")
+        model_mps = model_cpu.to("mps")
+        out_mps = model_mps(a_mps)
+
+        self.assertEqual(out_cpu.shape, out_mps.shape)
+        self.assertEqual(out_cpu, out_mps.cpu())
+
+    def test_conv2d_all_strides_paddings(self):
+        # https://github.com/pytorch/pytorch/issues/83180
+        y_cpu = torch.randn(2, 2, 3, 6)
+        y_gpu = y_cpu.to(device='mps')
+        for strideX in range(1, 4):
+            for strideY in range(1, 4):
+                conv_cpu = torch.nn.Conv2d(in_channels=2, out_channels=2, kernel_size=3, stride=(strideX, strideY))
+                conv_gpu = copy.deepcopy(conv_cpu).to(device='mps')
+                x_cpu = conv_cpu(y_cpu)
+                x_gpu = conv_gpu(y_gpu)
+                self.assertEqual(x_cpu, x_gpu.cpu(), rtol=1e-03, atol=1e-05)
+
+    def test_conv2d_single_stride(self):
+        y_cpu = torch.randn(2, 2, 3, 6)
+        y_gpu = y_cpu.to(device='mps')
+        for stride in range(1, 4):
+            conv_cpu = torch.nn.Conv2d(in_channels=2, out_channels=2, kernel_size=3, stride=stride)
+            conv_gpu = copy.deepcopy(conv_cpu).to(device='mps')
+            x_cpu = conv_cpu(y_cpu)
+            x_gpu = conv_gpu(y_gpu)
+            self.assertEqual(x_cpu, x_gpu.cpu(), rtol=1e-03, atol=1e-05)
+
+class TestAdvancedIndexing(TestCase):
+    supported_dtypes = [torch.float32, torch.float16, torch.int64, torch.int32, torch.int16, torch.uint8]
+
+    # examples from https://www.tutorialspoint.com/numpy/numpy_advanced_indexing.htm
+    def test_indexing_1(self):
+        def helper(dtype):
+            x_cpu = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=dtype)
+            x_mps = x_cpu.detach().clone().to("mps")
+
+            y_cpu = x_cpu[[0, 1, 2], [0, 1, 0]]
+            y_mps = x_mps[[0, 1, 2], [0, 1, 0]]
+            self.assertEqual(y_cpu, y_mps, str(dtype))
+        [helper(dtype) for dtype in self.supported_dtypes]
+
+    def test_indexing_select_corners(self):
+        def helper(dtype):
+            x_cpu = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], dtype=dtype)
+            x_mps = x_cpu.detach().clone().to("mps")
+
+            rows_cpu = torch.tensor([[0, 0], [3, 3]])
+            rows_mps = rows_cpu.detach().clone().to("mps")
+
+            cols_cpu = torch.tensor([[0, 2], [0, 2]])
+            cols_mps = cols_cpu.detach().clone().to("mps")
+
+            res_cpu = x_cpu[rows_cpu, cols_cpu]
+            res_mps = x_mps[rows_mps, cols_mps]
+
+            self.assertEqual(res_cpu, res_mps, str(dtype))
+        [helper(dtype) for dtype in self.supported_dtypes]
+
+    # FIXME: uint8 fails for this testcase, needs further debugging
+    def test_slicing_using_advanced_index_for_column(self):
+        def helper(dtype):
+            x_cpu = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], dtype=dtype)
+            x_mps = x_cpu.detach().clone().to("mps")
+
+            z_cpu = x_cpu[1:4, 1:3]
+            z_mps = x_mps[1:4, 1:3]
+            self.assertEqual(z_cpu, z_mps, str(dtype))
+
+            # using advanced index for column
+            y_cpu = x_cpu[1:4, [1, 2]]
+            y_mps = x_mps[1:4, [1, 2]]
+            self.assertEqual(y_cpu, y_mps, str(dtype))
+        # FIXME: use supported_dtypes once uint8 is fixed
+        [helper(dtype) for dtype in [torch.float32, torch.float16, torch.int64, torch.int32, torch.int16]]
+
+    # FIXME: conditional indexing not working
+    # def test_boolean_array_indexing_1(self):
+    #     def helper(dtype):
+    #         x_cpu = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], dtype=dtype)
+    #         x_mps = x_cpu.detach().clone().to("mps")
+
+    #         res_cpu = x_cpu[x_cpu > 5]
+    #         res_mps = x_mps[x_mps > 5]
+
+    #         print(res_cpu)
+    #         print(res_mps)
+
+    #         self.assertEqual(res_cpu, res_mps, str(dtype))
+    #     [helper(dtype) for dtype in self.supported_dtypes]
+
+    # tests from test_indexing.py
+    def test_bool_indices(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        boolIndices = torch.tensor([True, False, True, True, False], dtype=torch.bool, device=device)
+        self.assertEqual(v[boolIndices].shape, (3, 7, 3))
+        self.assertEqual(v[boolIndices], torch.stack([v[0], v[2], v[3]]))
+
+        v = torch.tensor([True, False, True], dtype=torch.bool, device=device)
+        boolIndices = torch.tensor([True, False, False], dtype=torch.bool, device=device)
+        uint8Indices = torch.tensor([1, 0, 0], dtype=torch.uint8, device=device)
+        with warnings.catch_warnings(record=True) as w:
+            self.assertEqual(v[boolIndices].shape, v[uint8Indices].shape)
+            self.assertEqual(v[boolIndices], v[uint8Indices])
+            self.assertEqual(v[boolIndices], torch.tensor([True], dtype=torch.bool, device=device))
+            self.assertEqual(len(w), 2)
+
+    def test_multiple_bool_indices(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        # note: these broadcast together and are transposed to the first dim
+        mask1 = torch.tensor([1, 0, 1, 1, 0], dtype=torch.bool, device=device)
+        mask2 = torch.tensor([1, 1, 1], dtype=torch.bool, device=device)
+        self.assertEqual(v[mask1, :, mask2].shape, (3, 7))
+
+    def test_step(self, device="mps"):
+        v = torch.arange(10, device=device)
+        self.assertEqual(v[::1], v)
+        self.assertEqual(v[::2].tolist(), [0, 2, 4, 6, 8])
+        self.assertEqual(v[::3].tolist(), [0, 3, 6, 9])
+        self.assertEqual(v[::11].tolist(), [0])
+        self.assertEqual(v[1:6:2].tolist(), [1, 3, 5])
+
+    def test_byte_mask(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        mask = torch.ByteTensor([1, 0, 1, 1, 0]).to(device)
+        with warnings.catch_warnings(record=True) as w:
+            self.assertEqual(v[mask].shape, (3, 7, 3))
+            self.assertEqual(v[mask], torch.stack([v[0], v[2], v[3]]))
+            self.assertEqual(len(w), 2)
+
+        v = torch.tensor([1.], device=device)
+        self.assertEqual(v[v == 0], torch.tensor([], device=device))
+
+    def test_int_indices2d(self, device="mps"):
+        # From the NumPy indexing example
+        x = torch.arange(0, 12, device=device).view(4, 3)
+        rows = torch.tensor([[0, 0], [3, 3]], device=device)
+        columns = torch.tensor([[0, 2], [0, 2]], device=device)
+        self.assertEqual(x[rows, columns].tolist(), [[0, 2], [9, 11]])
+
+    def test_int_indices_broadcast(self, device="mps"):
+        # From the NumPy indexing example
+        x = torch.arange(0, 12, device=device).view(4, 3)
+        rows = torch.tensor([0, 3], device=device)
+        columns = torch.tensor([0, 2], device=device)
+        result = x[rows[:, None], columns]
+        self.assertEqual(result.tolist(), [[0, 2], [9, 11]])
+
+    def test_empty_ndim_index(self, device="mps"):
+        x = torch.randn(5, device=device)
+        self.assertEqual(torch.empty(0, 2, device=device), x[torch.empty(0, 2, dtype=torch.int64, device=device)])
+
+        x = torch.randn(2, 3, 4, 5, device=device)
+        self.assertEqual(torch.empty(2, 0, 6, 4, 5, device=device),
+                         x[:, torch.empty(0, 6, dtype=torch.int64, device=device)])
+
+        x = torch.empty(10, 0, device=device)
+        self.assertEqual(x[[1, 2]].shape, (2, 0))
+        self.assertEqual(x[[], []].shape, (0,))
+        with self.assertRaisesRegex(IndexError, 'for dimension with size 0'):
+            x[:, [0, 1]]
+
+    def test_empty_ndim_index_bool(self, device="mps"):
+        x = torch.randn(5, device=device)
+        self.assertRaises(IndexError, lambda: x[torch.empty(0, 2, dtype=torch.uint8, device=device)])
+
+    def test_index_getitem_copy_bools_slices(self, device="mps"):
+        true = torch.tensor(1, dtype=torch.uint8, device=device)
+        false = torch.tensor(0, dtype=torch.uint8, device=device)
+
+        tensors = [torch.randn(2, 3, device=device), torch.tensor(3., device=device)]
+
+        for a in tensors:
+            self.assertNotEqual(a.data_ptr(), a[True].data_ptr())
+            self.assertEqual(torch.empty(0, *a.shape), a[False])
+            self.assertNotEqual(a.data_ptr(), a[true].data_ptr())
+            self.assertEqual(torch.empty(0, *a.shape), a[false])
+            self.assertEqual(a.data_ptr(), a[None].data_ptr())
+            self.assertEqual(a.data_ptr(), a[...].data_ptr())
+
+    def test_index_scalar_with_bool_mask(self, device="mps"):
+        a = torch.tensor(1, device=device)
+        uintMask = torch.tensor(True, dtype=torch.uint8, device=device)
+        boolMask = torch.tensor(True, dtype=torch.bool, device=device)
+        self.assertEqual(a[uintMask], a[boolMask])
+        self.assertEqual(a[uintMask].dtype, a[boolMask].dtype)
+
+        a = torch.tensor(True, dtype=torch.bool, device=device)
+        self.assertEqual(a[uintMask], a[boolMask])
+        self.assertEqual(a[uintMask].dtype, a[boolMask].dtype)
+
+    def test_getitem_scalars(self, device="mps"):
+        zero = torch.tensor(0, dtype=torch.int64, device=device)
+        one = torch.tensor(1, dtype=torch.int64, device=device)
+
+        # non-scalar indexed with scalars
+        a = torch.randn(2, 3, device=device)
+        self.assertEqual(a[0], a[zero])
+        self.assertEqual(a[0][1], a[zero][one])
+        self.assertEqual(a[0, 1], a[zero, one])
+        self.assertEqual(a[0, one], a[zero, 1])
+
+        # indexing by a scalar should slice (not copy)
+        self.assertEqual(a[0, 1].data_ptr(), a[zero, one].data_ptr())
+        self.assertEqual(a[1].data_ptr(), a[one.int()].data_ptr())
+        self.assertEqual(a[1].data_ptr(), a[one.short()].data_ptr())
+
+        # scalar indexed with scalar
+        r = torch.randn((), device=device)
+        with self.assertRaises(IndexError):
+            r[:]
+        with self.assertRaises(IndexError):
+            r[zero]
+        self.assertEqual(r, r[...])
+
+    def test_variable_slicing(self, device="mps"):
+        x = torch.arange(0, 16, device=device).view(4, 4)
+        indices = torch.IntTensor([0, 1]).to(device)
+        i, j = indices
+        self.assertEqual(x[i:j], x[0:1])
+
+    def test_ellipsis_tensor(self, device="mps"):
+        x = torch.arange(0, 9, device=device).view(3, 3)
+        idx = torch.tensor([0, 2], device=device)
+        self.assertEqual(x[..., idx].tolist(), [[0, 2],
+                                                [3, 5],
+                                                [6, 8]])
+        self.assertEqual(x[idx, ...].tolist(), [[0, 1, 2],
+                                                [6, 7, 8]])
+
+    def test_invalid_index(self, device="mps"):
+        x = torch.arange(0, 16, device=device).view(4, 4)
+        self.assertRaisesRegex(TypeError, 'slice indices', lambda: x["0":"1"])
+
 class TestRNNMPS(TestCase):
     def test_lstm_1(self, device="mps", dtype=torch.float32):
 
@@ -6306,6 +6641,7 @@ class TestConsistency(TestCase):
         'nn.functional.hinge_embedding_loss': ['torch.float32'],
         'nn.functional.kl_div': ['torch.float32'],
         'nn.functional.l1_loss': ['torch.float32'],
+        'nn.functional.linear': ['torch.float32'],
         'nn.functional.huber_loss': ['torch.float32'],
         'nn.functional.leaky_relu': ['torch.float32'],
         'nn.functional.mse_loss': ['torch.float16', 'torch.float32'],
diff --git a/test/test_multiprocessing.py b/test/test_multiprocessing.py
index 5b939afd998f4..ae0d87be216a1 100644
--- a/test/test_multiprocessing.py
+++ b/test/test_multiprocessing.py
@@ -15,7 +15,7 @@
 import torch.utils.hooks
 from torch.nn import Parameter
 from torch.testing._internal.common_utils import (TestCase, run_tests, IS_WINDOWS, NO_MULTIPROCESSING_SPAWN, TEST_WITH_ASAN,
-                                                  load_tests, slowTest, TEST_WITH_TSAN, TEST_WITH_ROCM)
+                                                  load_tests, slowTest, TEST_WITH_TSAN)
 
 # load_tests from common_utils is used to automatically filter tests for
 # sharding on sandcastle. This line silences flake warnings
@@ -590,7 +590,6 @@ def _test_event_multiprocess_child(event, p2c, c2p):
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
-    @unittest.skipIf(TEST_WITH_ROCM, 'Skip the test for ROCm')
     def test_event_multiprocess(self):
         event = torch.cuda.Event(enable_timing=False, interprocess=True)
         self.assertTrue(event.query())
@@ -649,7 +648,6 @@ def _test_event_handle_importer_consumer(handle, p2c, c2p):
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
-    @unittest.skipIf(TEST_WITH_ROCM, 'Skip the test for ROCm')
     def test_event_handle_importer(self):
         e0 = torch.cuda.Event(enable_timing=False, interprocess=True)
         self.assertTrue(e0.query())
@@ -689,7 +687,6 @@ def _test_event_handle_exporter_consumer(handle, p2c, c2p):
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
-    @unittest.skipIf(TEST_WITH_ROCM, 'Skip the test for ROCm')
     def test_event_handle_exporter(self):
         e0 = torch.cuda.Event(enable_timing=False, interprocess=True)
 
diff --git a/test/test_namedtensor.py b/test/test_namedtensor.py
index 839dc01c90dbb..751a56f168e78 100644
--- a/test/test_namedtensor.py
+++ b/test/test_namedtensor.py
@@ -1181,7 +1181,17 @@ def test_simple_reduce(op, device):
             check_output(op(t, 1), ['N', 'L'])
             check_output(op(t, -1), ['N', 'C'])
             check_output(op(t, 'C'), ['N', 'L'])
-            if op.__name__ in ['sum', 'mean']:
+            ops_support_dim_none = [
+                'sum',
+                'mean',
+                'std',
+                'var',
+                'std_mean',
+                'var_mean',
+                'nanmean',
+                'nansum',
+            ]
+            if op.__name__ in ops_support_dim_none:
                 check_output(op(t, None), [])
             else:
                 with self.assertRaisesRegex(RuntimeError, 'Please look up dimensions by name'):
diff --git a/test/test_namedtuple_return_api.py b/test/test_namedtuple_return_api.py
index 01d0d58fb708e..65d90625e5dce 100644
--- a/test/test_namedtuple_return_api.py
+++ b/test/test_namedtuple_return_api.py
@@ -34,26 +34,26 @@ def test_import_return_types(self):
     def test_native_functions_yaml(self):
         operators_found = set()
         regex = re.compile(r"^(\w*)(\(|\.)")
-        file = open(aten_native_yaml, 'r')
-        for f in yaml.safe_load(file.read()):
-            f = f['func']
-            ret = f.split('->')[1].strip()
-            name = regex.findall(f)[0][0]
-            if name in all_operators_with_namedtuple_return:
-                operators_found.add(name)
-                continue
-            if '_backward' in name or name.endswith('_forward'):
-                continue
-            if not ret.startswith('('):
-                continue
-            if ret == '()':
-                continue
-            ret = ret[1:-1].split(',')
-            for r in ret:
-                r = r.strip()
-                self.assertEqual(len(r.split()), 1,
-                                 'only allowlisted operators are allowed to have named return type, got ' + name)
-        file.close()
+        with open(aten_native_yaml, 'r') as file:
+            for f in yaml.safe_load(file.read()):
+                f = f['func']
+                ret = f.split('->')[1].strip()
+                name = regex.findall(f)[0][0]
+                if name in all_operators_with_namedtuple_return:
+                    operators_found.add(name)
+                    continue
+                if '_backward' in name or name.endswith('_forward'):
+                    continue
+                if not ret.startswith('('):
+                    continue
+                if ret == '()':
+                    continue
+                ret = ret[1:-1].split(',')
+                for r in ret:
+                    r = r.strip()
+                    self.assertEqual(len(r.split()), 1, 'only allowlisted '
+                                     'operators are allowed to have named '
+                                     'return type, got ' + name)
         self.assertEqual(all_operators_with_namedtuple_return, operators_found, textwrap.dedent("""
         Some elements in the `all_operators_with_namedtuple_return` of test_namedtuple_return_api.py
         could not be found. Do you forget to update test_namedtuple_return_api.py after renaming some
diff --git a/test/test_native_mha.py b/test/test_native_mha.py
index dfbc3f9b6c31f..2d0843b5a4d34 100644
--- a/test/test_native_mha.py
+++ b/test/test_native_mha.py
@@ -173,6 +173,7 @@ def forward(self, q, k, v, key_padding_mask):
                     key_padding_mask,
                     need_weights=need_weights,
                     average_attn_weights=average_attn_weights,
+                    mask_type=1,   # mask_type = 1 => src_key_padding_mask, mask_type = 0 => src_mask
                 )
 
         npt = NativeMHA(
diff --git a/test/test_nestedtensor.py b/test/test_nestedtensor.py
index 34a0ebd05ee97..2fd9c2dcfe7ec 100644
--- a/test/test_nestedtensor.py
+++ b/test/test_nestedtensor.py
@@ -10,7 +10,8 @@
     skipMeta,
     onlyCPU
 )
-from torch.testing._internal.common_utils import TestCase, IS_FBCODE, run_tests, freeze_rng_state, parametrize
+from torch.testing._internal.common_dtype import floating_types_and_half
+from torch.testing._internal.common_utils import TestCase, IS_FBCODE, run_tests, freeze_rng_state, parametrize, gradcheck
 from torch import nested_tensor
 
 # Tests are ported from pytorch/nestedtensor.
@@ -19,6 +20,25 @@ def _iter_constructors():
     # yield as_nested_tensor
     yield nested_tensor
 
+# Helper function to generate a pair of random nested tensors
+# one is contiguous, the other is not, but they appear to have same entries
+# an output nested tensor consists of
+# * `len(ragged_sizes)` matrices
+# * matrices[i].shape == (20, ragged_sizes[i])
+def random_nt_noncontiguous_pair(ragged_sizes, device="cpu", dtype=torch.float16):
+    xs = []
+    for size in ragged_sizes:
+        xs.append(torch.randn((size, 20), device=device, dtype=dtype))
+    # contiguous nested tensor
+    ys = []
+    for x in xs:
+        ys.append(x.transpose(-1, -2))
+    nt_contiguous = torch.nested_tensor(ys)
+    # noncontiguous nested tensor
+    n = len(ragged_sizes)
+    nt_noncontiguous = torch.nested_tensor(xs).transpose(-1, -2)
+    return nt_contiguous, nt_noncontiguous
+
 # Helper functions to pad a noncontiguous nested tensor
 # can be replaced once to_padded_tensor supports noncontiguous memory
 def noncontiguous_to_padded_tensor(input, shape=None):
@@ -44,6 +64,7 @@ def noncontiguous_to_padded_tensor(input, shape=None):
     return result
 
 class TestNestedTensor(TestCase):
+
     @torch.inference_mode()
     def _test_unbind_case(self, a, b):
         nt = nested_tensor([a, b])
@@ -72,12 +93,6 @@ def test_unbind_1(self):
             torch.tensor([1]), torch.tensor([7]),
         )
 
-    # @torch.inference_mode()
-    # def test_unbind_2(self):
-    #     self._test_unbind_case(
-    #         torch.tensor(1), torch.tensor(7),
-    #     )
-
     @torch.inference_mode()
     def test_unbind_3(self):
         self._test_unbind_case(
@@ -202,11 +217,24 @@ def test_stride(self):
     @unittest.skipIf(IS_FBCODE, "is_contiguous is not virtual in fbcode.")
     @torch.inference_mode()
     def test_is_contiguous(self):
-        for constructor in _iter_constructors():
-            a1 = constructor([])
-            self.assertRaisesRegex(
-                RuntimeError, "is_contiguous is disabled", lambda: a1.is_contiguous()
-            )
+        # Test empty case
+        nt_empty = torch.nested_tensor([])
+        assert nt_empty.is_contiguous()
+        self.assertEqual(nt_empty, nt_empty.contiguous())
+
+        nt_contiguous, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7))
+
+        # Test contiguous case
+        assert nt_contiguous.is_contiguous()
+        self.assertEqual(nt_contiguous, nt_contiguous.contiguous())
+
+        # Test non_contiguous case
+        assert not nt_noncontiguous.is_contiguous()
+        self.assertRaisesRegex(
+            RuntimeError,
+            r"clone_nested only supports memory format Preserve, but got Contiguous instead.",
+            lambda: nt_noncontiguous.contiguous()
+        )
 
     @torch.inference_mode()
     def test_repr_string(self):
@@ -237,21 +265,11 @@ def test_activations(self):
             self.assertEqual(func(t), nested_result.unbind()[0])
 
     def test_to_padded_tensor_on_empty_tensor(self):
+
         nt = torch.nested_tensor([])
         empty = nt.to_padded_tensor(4)
         self.assertEqual(empty, torch.tensor([]))
-
 class TestNestedTensorDeviceType(TestCase):
-    # Helper function to assert 2 nested tensors are equal
-    def nt_equal(self, nt1, nt2):
-        self.assertEqual(nt1.dtype, nt2.dtype)
-        self.assertEqual(nt1.device, nt2.device)
-        ub1 = nt1.unbind()
-        ub2 = nt2.unbind()
-        self.assertEqual(len(ub1), len(ub2))
-        n = len(ub1)
-        for i in range(n):
-            self.assertEqual(ub1[i], ub2[i])
 
     # Helper function to generate a random nested tensor
     def random_nt(self, device, dtype, num_tensors, max_dims, min_dims=None):
@@ -279,28 +297,39 @@ def random_nt_pair(self, device, dtype, num_tensors, max_dims):
         return (torch.nested_tensor(ts1, device=device, dtype=dtype),
                 torch.nested_tensor(ts2, device=device, dtype=dtype))
 
-    # Helper function to generate a pair of random nested tensors
-    # one is contiguous, the other is not, but they appear to have same entries
-    # an output nested tensor consists of
-    # * `len(ragged_sizes)` matrices
-    # * matrices[i].shape == (20, ragged_sizes[i])
-    def random_nt_noncontiguous_pair(self, ragged_sizes, device, dtype):
-        xs = []
-        for size in ragged_sizes:
-            xs.append(torch.randn((size, 20), device=device, dtype=dtype))
-        # contiguous nested tensor
-        ys = []
-        for x in xs:
-            ys.append(x.transpose(-1, -2))
-        nt_contiguous = torch.nested_tensor(ys)
-        # noncontiguous nested tensor
-        n = len(ragged_sizes)
-        nt_noncontiguous = torch.nested_tensor(xs).transpose(-1, -2)
-        return nt_contiguous, nt_noncontiguous
+    @dtypes(*floating_types_and_half())
+    @dtypesIfCUDA(torch.float64)
+    def test_detach(self, device, dtype):
+        a = torch.randn(2, 4, device=device, dtype=dtype, requires_grad=False)
+        b = torch.randn(5, 4, device=device, dtype=dtype, requires_grad=False)
+        x = torch.nested_tensor([a, b]).requires_grad_()
+
+        x_detach = x.detach()
+
+        z = x_detach * 4
+        self.assertFalse(x_detach.requires_grad)
+        self.assertFalse(z.requires_grad)
+
+        a = torch.randn(2, 4, device=device, dtype=dtype, requires_grad=True)
+        b = torch.randn(5, 4, device=device, dtype=dtype, requires_grad=True)
+        x = torch.nested_tensor([a, b])
+
+        y = x * 2
+        y = y.detach()
+        self.assertFalse(y.requires_grad)
+        self.assertIsNone(y.grad_fn)
+
+        z = x + y
+        z.to_padded_tensor(0).sum().backward()
+        # This is an incorrect gradient, but we assume that's what the user
+        # wanted. detach() is an advanced option.
+        self.assertEqual(a.grad, torch.ones(2, 4, device=device, dtype=dtype))
+        self.assertEqual(b.grad, torch.ones(5, 4, device=device, dtype=dtype))
+
 
     @dtypes(torch.float, torch.float16, torch.double)
     def test_unbind_noncontiguous(self, device, dtype):
-        nt_contiguous, nt_noncontiguous = self.random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
+        nt_contiguous, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
         ub_contiguous = nt_contiguous.unbind()
         ub_noncontiguous = nt_noncontiguous.unbind()
         self.assertEqual(len(ub_contiguous), len(ub_noncontiguous))
@@ -458,7 +487,7 @@ def test_to_padded_tensor_dim4(self, device, dtype):
     @dtypes(torch.float, torch.float16, torch.double)
     @torch.inference_mode()
     def test_to_padded_tensor_noncontiguous(self, device, dtype):
-        nt_contiguous, nt_noncontiguous = self.random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
+        nt_contiguous, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
         # test noncontiguous_to_padded_tensor functionality
         self.assertEqual(
             nt_contiguous.to_padded_tensor(0.0),
@@ -511,7 +540,7 @@ def test_nested_tensor_indexing(self, device, dtype):
     @dtypes(torch.float, torch.float16, torch.double)
     @torch.inference_mode()
     def test_nested_tensor_indexing_noncontiguous(self, device, dtype):
-        nt_contiguous, nt_noncontiguous = self.random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
+        nt_contiguous, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
         self.assertEqual(nt_contiguous.size(0), nt_noncontiguous.size(0))
         n = nt_contiguous.size(0)
         for i in range(n):
@@ -524,7 +553,7 @@ def test_nested_tensor_add(self, device, dtype):
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
         ref = torch.nested_tensor([t1 + t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         out = nt1 + nt2
-        self.nt_equal(ref, out)
+        self.assertEqual(ref, out)
 
     @dtypes(torch.float, torch.float16)
     @skipMeta
@@ -534,7 +563,7 @@ def test_nested_tensor_mul(self, device, dtype):
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
         ref = torch.nested_tensor([t1 * t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         out = nt1 * nt2
-        self.nt_equal(ref, out)
+        self.assertEqual(ref, out)
         # nested tensor * scalar
         number = 10.0
         scalar = torch.tensor(number).to(dtype).to(device)
@@ -543,10 +572,10 @@ def test_nested_tensor_mul(self, device, dtype):
         out_number1 = number * nt1
         out_scalar0 = nt1 * scalar
         out_scalar1 = scalar * nt1
-        self.nt_equal(out_number0, ref)
-        self.nt_equal(out_number1, ref)
-        self.nt_equal(out_scalar0, ref)
-        self.nt_equal(out_scalar1, ref)
+        self.assertEqual(out_number0, ref)
+        self.assertEqual(out_number1, ref)
+        self.assertEqual(out_scalar0, ref)
+        self.assertEqual(out_scalar1, ref)
         # error case: numel == 1 but dim > 0
         vector = torch.tensor([number]).to(dtype).to(device)
         self.assertRaisesRegex(
@@ -567,7 +596,7 @@ def test_nested_tensor_add_in_place(self, device, dtype):
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
         ref = torch.nested_tensor([t1 + t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         nt1 += nt2
-        self.nt_equal(ref, nt1)
+        self.assertEqual(ref, nt1)
 
     @dtypes(torch.float, torch.float16)
     @skipMeta
@@ -577,7 +606,7 @@ def test_nested_tensor_mul_in_place(self, device, dtype):
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
         ref = torch.nested_tensor([t1 * t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         nt1 *= nt2
-        self.nt_equal(ref, nt1)
+        self.assertEqual(ref, nt1)
         # nested tensor * scalar
         number = 10.0
         scalar = torch.tensor(number).to(dtype).to(device)
@@ -586,8 +615,8 @@ def test_nested_tensor_mul_in_place(self, device, dtype):
         out_number *= number
         out_scalar = nt1.clone()
         out_scalar *= scalar
-        self.nt_equal(out_number, ref)
-        self.nt_equal(out_scalar, ref)
+        self.assertEqual(out_number, ref)
+        self.assertEqual(out_scalar, ref)
         self.assertRaisesRegex(
             RuntimeError,
             r"output with shape \[.*\] doesn't match the broadcast shape \[.*\]",
@@ -612,17 +641,31 @@ def test_nested_tensor_mul_in_place(self, device, dtype):
     def test_nested_tensor_sum_dim(self, device, dtype):
         params = ((2, (1, 1)), ((4), (4, 4)), (10, (3, 5, 7)))
 
-        def test_sum(nt, dim, keepdim=True):
+        def test_sum(device, dtype, ntensors, max_sizes, dim, keepdim=True):
+            nt = self.random_nt(device, dtype, ntensors, max_sizes)
             nt2 = nt.clone()
-            nt = nt.sum(dim=dim, keepdim=keepdim)
             ub2 = nt2.unbind()
-            ub2 = [t.sum(-1, keepdim=keepdim) for t in ub2]
-            nt2 = torch.nested_tensor(ub2)
-            self.nt_equal(nt, nt2)
+            nt.requires_grad_(True)
+            [t.requires_grad_(True) for t in ub2]
+            nt_sum = nt.sum(dim=dim, keepdim=keepdim)
+            ub2_sum = [t.sum(-1, keepdim=keepdim) for t in ub2]
+            self.assertEqual(nt_sum, torch.nested_tensor(ub2_sum))
+
+            # test backward
+            # generate gradient tensor that has the same size as the output
+            size = nt_sum._nested_tensor_size()
+            gt2 = []
+            for i in range(ntensors):
+                gt2.append(torch.randn(size[i].tolist(), device=device, dtype=dtype))
+            gt = torch.nested_tensor(gt2).clone()
+            nt_sum.backward(gt)
+            for t2, g2 in zip(ub2_sum, gt2):
+                t2.backward(g2)
+            self.assertEqual(nt.grad, torch.nested_tensor([t.grad for t in ub2]))
             return
 
         for ntensors, max_sizes in params:
-            test_sum(self.random_nt(device, dtype, ntensors, max_sizes), len(max_sizes))
+            test_sum(device, dtype, ntensors, max_sizes, len(max_sizes))
 
         # Test error inputs
         with self.assertRaisesRegex(RuntimeError, "NestedTensor can only be reduced across the last"):
@@ -642,7 +685,7 @@ def test_clone(self, device, dtype):
         nt1 = self.random_nt(device, dtype, 4, (4, 4), (1, 1))
         nt2 = nt1.clone()
         # Verify the values match
-        self.nt_equal(nt1, nt2)
+        self.assertEqual(nt1, nt2)
         # Verify modifying nt2 doesn't affect nt1
         nt2.mul_(nt1)
         ub1 = nt1.unbind()
@@ -657,12 +700,11 @@ def test_clone(self, device, dtype):
 
     # cannot test torch.float16 because: RuntimeError: "bernoulli_scalar_cpu_" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
-    @torch.inference_mode()
     def test_dropout(self, device, dtype):
         # edge case: empty nested tensor
         nt0 = torch.nested_tensor([])
         y = torch.nn.functional.dropout(nt0, 0.5)
-        self.nt_equal(nt0, y)
+        self.assertEqual(nt0, y)
         # normal nested tensor
         ntensors = 4
         nt = self.random_nt(device, dtype, ntensors, (4, 4))
@@ -675,8 +717,8 @@ def test_dropout(self, device, dtype):
         dropouter = torch.nn.Dropout(0.0)
         y0 = dropouter(nt)
         y1 = torch.nn.functional.dropout(nt, 0.0)
-        self.nt_equal(nt, y0)
-        self.nt_equal(nt, y1)
+        self.assertEqual(nt, y0)
+        self.assertEqual(nt, y1)
         # edge case: all dropout
         dropouter = torch.nn.Dropout(1.0)
         y0 = dropouter(nt)
@@ -684,8 +726,8 @@ def test_dropout(self, device, dtype):
         nt0 = nt.clone()
         for i in range(ntensors):
             nt0[i].fill_(0.0)
-        self.nt_equal(nt0, y0)
-        self.nt_equal(nt0, y1)
+        self.assertEqual(nt0, y0)
+        self.assertEqual(nt0, y1)
         # normal case: normal dropout
         p = 0.2
         y = torch.nn.functional.dropout(nt, p)
@@ -698,37 +740,29 @@ def test_dropout(self, device, dtype):
                     expect_tensor[j] = 0.0
                 else:
                     expect_tensor[j] /= 1.0 - p
-        self.nt_equal(y, expect)
+        self.assertEqual(y, expect)
         with freeze_rng_state():
             dropouter = torch.nn.Dropout(p)
             y0 = dropouter(nt)
         with freeze_rng_state():
             y1 = torch.nn.functional.dropout(nt, p)
-        self.nt_equal(y0, y1)
-        # inplace
-        # in principle, since we have established the correctness of functional, we could simply compare inplace vs functional
-        # in practice, cuda functional has its own implementation to skip `bernoulli_`
-        # so cuda functional will differ from cuda inplace causing test failure
-        # in `test_dropout_cuda_float64 (__main__.TestNestedTensorDeviceTypeCUDA)`
-        # on `linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)`
-        expect = nt.clone()
-        torch.nn.functional.dropout(nt, p, inplace=True)
-        for i in range(ntensors):
-            actual_tensor = nt[i].view(-1)
-            expect_tensor = expect[i].view(-1)
-            for j in range(actual_tensor.shape[0]):
-                if actual_tensor[j].item() == 0.0:
-                    expect_tensor[j] = 0.0
-                else:
-                    expect_tensor[j] /= 1.0 - p
-        self.nt_equal(nt, expect)
+        self.assertEqual(y0, y1)
 
-    # dropout works directly on the underlying buffer memory
-    # so contiguous / noncontiguous does not make any difference
+    @dtypes(torch.float, torch.double)
+    def test_dropout_noncontiguous(self, device, dtype):
+        ntensors = 4
+        nt0 = self.random_nt(device, dtype, ntensors, (4, 4))
+        nt1 = nt0.transpose(-1, -2)
+        p = 0.3
+        with freeze_rng_state():
+            dropouter = torch.nn.Dropout(p)
+            y0 = dropouter(nt0)
+        with freeze_rng_state():
+            y1 = torch.nn.functional.dropout(nt1, p).transpose(-1, -2)
+        self.assertEqual(y0, y1)
 
     # cannot test torch.float16 because: RuntimeError: "softmax_kernel_impl" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
-    @torch.inference_mode()
     def test_softmax(self, device, dtype):
         # normal nested tensor
         ntensors = 4
@@ -751,7 +785,7 @@ def test_softmax(self, device, dtype):
         softmaxer = torch.nn.Softmax(1)
         y0 = softmaxer(nt)
         y1 = torch.nn.functional.softmax(nt, 1)
-        self.nt_equal(y0, y1)
+        self.assertEqual(y0, y1)
         pt = nt.to_padded_tensor(float("-inf"))
         # if an entire slice is padded, then softmax will return 0.0 / 0.0 = nan
         # however, physically speaking that should be 0.0
@@ -760,7 +794,7 @@ def test_softmax(self, device, dtype):
         # edge case: empty nested tensor
         nt0 = torch.nested_tensor([])
         y = torch.nn.functional.softmax(nt0, 1)
-        self.nt_equal(nt0, y)
+        self.assertEqual(nt0, y)
         # edge case: nesting scalars
         nt1 = torch.nested_tensor([torch.tensor(0.0), torch.tensor(1.0)])
         self.assertRaises(RuntimeError, lambda: torch.nn.functional.softmax(nt1, 0))
@@ -769,8 +803,8 @@ def test_softmax(self, device, dtype):
     @dtypes(torch.float, torch.double)
     @torch.inference_mode()
     def test_softmax_noncontiguous(self, device, dtype):
-        nt_contiguous, nt_noncontiguous = self.random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
-        self.nt_equal(
+        nt_contiguous, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
+        self.assertEqual(
             torch.nn.functional.softmax(nt_contiguous, -1),
             torch.nn.functional.softmax(nt_noncontiguous, -1))
 
@@ -867,9 +901,9 @@ def test_bmm(self, device, dtype):
     # cannot test torch.float16 because: RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
     def test_bmm_noncontiguous(self, device, dtype):
-        nt0_contiguous, nt0_noncontiguous = self.random_nt_noncontiguous_pair((2, 3), device, dtype)
-        nt1_contiguous, nt1_noncontiguous = self.random_nt_noncontiguous_pair((6, 7), device, dtype)
-        self.nt_equal(
+        nt0_contiguous, nt0_noncontiguous = random_nt_noncontiguous_pair((2, 3), device, dtype)
+        nt1_contiguous, nt1_noncontiguous = random_nt_noncontiguous_pair((6, 7), device, dtype)
+        self.assertEqual(
             nt0_contiguous.transpose(-1, -2).bmm(nt1_contiguous),
             nt0_noncontiguous.transpose(-1, -2).bmm(nt1_noncontiguous))
 
@@ -1007,9 +1041,9 @@ def test_matmul(self, device, dtype):
     # cannot test torch.float16 because: RuntimeError: "bmm" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
     def test_matmul_noncontiguous(self, device, dtype):
-        nt0_contiguous, nt0_noncontiguous = self.random_nt_noncontiguous_pair((2, 3), device, dtype)
-        nt1_contiguous, nt1_noncontiguous = self.random_nt_noncontiguous_pair((6, 7), device, dtype)
-        self.nt_equal(
+        nt0_contiguous, nt0_noncontiguous = random_nt_noncontiguous_pair((2, 3), device, dtype)
+        nt1_contiguous, nt1_noncontiguous = random_nt_noncontiguous_pair((6, 7), device, dtype)
+        self.assertEqual(
             torch.matmul(nt0_contiguous.transpose(-1, -2), nt1_contiguous),
             torch.matmul(nt0_noncontiguous.transpose(-1, -2), nt1_noncontiguous))
 
@@ -1063,7 +1097,7 @@ def test_linear(self, device, dtype):
     # since linear does not support noncontiguous buffer yet
     @dtypes(torch.float, torch.double)
     def test_linear_noncontiguous(self, device, dtype):
-        nt_contiguous, nt_noncontiguous = self.random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
+        nt_contiguous, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
         weight = torch.randn((8, 5), device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
@@ -1096,6 +1130,26 @@ def test_transpose(self, device, dtype):
         ptT = pt.transpose(-1, -2)
         self.assertEqual(ptT, ptT_from_ntT)
 
+    @dtypes(torch.float, torch.float16, torch.double)
+    def test_transpose_inference_mode_interaction(self, device, dtype):
+        nt = self.random_nt(device, dtype, 4, (4, 4))
+        # Construct in default mode and transpose while in inference mode
+        with torch.inference_mode():
+            ntT = nt.transpose(-1, -2)
+            ptT_from_ntT = noncontiguous_to_padded_tensor(ntT)
+            pt = nt.to_padded_tensor(0.0)
+            ptT = pt.transpose(-1, -2)
+            self.assertEqual(ptT, ptT_from_ntT)
+
+        # Construct and transpose while in inference mode
+        with torch.inference_mode():
+            nt = self.random_nt(device, dtype, 4, (4, 4))
+            ntT = nt.transpose(-1, -2)
+            ptT_from_ntT = noncontiguous_to_padded_tensor(ntT)
+            pt = nt.to_padded_tensor(0.0)
+            ptT = pt.transpose(-1, -2)
+            self.assertEqual(ptT, ptT_from_ntT)
+
     @dtypes(torch.float, torch.float16, torch.double)
     @torch.inference_mode()
     def test_reshape(self, device, dtype):
@@ -1192,8 +1246,8 @@ def rand_mask(size):
             expected_attn_weights.append(attn_weights.squeeze(0))
         expected_output_nested = torch.nested_tensor(expected_outputs)
         expected_attn_weight_nested = torch.nested_tensor(expected_attn_weights)
-        self.nt_equal(actual[0], expected_output_nested)
-        self.nt_equal(actual[1], expected_attn_weight_nested)
+        self.assertEqual(actual[0], expected_output_nested)
+        self.assertEqual(actual[1], expected_attn_weight_nested)
 
         # Error case: explicit attn_mask set.
         with self.assertRaisesRegex(RuntimeError, "not supported when an explicit attn_mask is set"):
@@ -1207,16 +1261,9 @@ def rand_mask(size):
 
 
 class TestNestedTensorAutograd(TestCase):
-    def nt_equal(self, nt1, nt2):
-        self.assertEqual(nt1.dtype, nt2.dtype)
-        self.assertEqual(nt1.device, nt2.device)
-        ub1 = nt1.unbind()
-        ub2 = nt2.unbind()
-        self.assertEqual(len(ub1), len(ub2))
-        n = len(ub1)
-        for i in range(n):
-            self.assertEqual(ub1[i], ub2[i])
-
+    # Note [Gradcheck args check_batched_grad=False] the common_utils testing version of gradcheck
+    # includes the default parameters used for testing ops with gradcheck. However nested tensor
+    # does not support the stack op therefore we turn it off for these tests
     def _create_nested_tensor_from_list(self, requires_grad=False):
         return torch.nested_tensor([torch.randn(1, 2, requires_grad=requires_grad),
                                     torch.randn(7, 8, requires_grad=requires_grad)])
@@ -1250,7 +1297,7 @@ def test_backward_for_add_op(self):
 
         #  Grad check doesn't work with nested yet.
         # d/dnt_1 (nt + nt_1) = 1*grad_output
-        self.nt_equal(nt_1.grad, grad_output)
+        self.assertEqual(nt_1.grad, grad_output)
 
     # Test Factory Functions
     def test_nested_tensor_to_padded_tensor(self):
@@ -1262,7 +1309,7 @@ def test_nested_tensor_to_padded_tensor(self):
             grad_output = torch.ones(out.shape)
             out.backward(grad_output)
 
-            self.nt_equal(nt.grad, torch.nested_tensor([torch.ones(1, 2), torch.ones(7, 8)]))
+            self.assertEqual(nt.grad, torch.nested_tensor([torch.ones(1, 2), torch.ones(7, 8)]))
 
     def test_nested_tensor_from_mask_and_to_padded(self):
         N, L, D = 2, 4, 4
@@ -1280,7 +1327,7 @@ def grad_test_func(inpt):
             nt = torch._nested_tensor_from_mask(inpt, mask)
             # This implicitly tests to_padded_tensor grads
             return nt.to_padded_tensor(0)
-        assert torch.autograd.gradcheck(grad_test_func, inputs=data)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
     def test_nested_tensor_from_padded(self):
         nested_size = torch.tensor([[1, 2], [2, 2]])
@@ -1294,7 +1341,7 @@ def grad_test_func(tensor, nested_size):
             return nt.to_padded_tensor(0)
 
         data = (padded_tensor, nested_size)
-        assert torch.autograd.gradcheck(grad_test_func, inputs=data)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
     def test_nested_tensor_from_padded_fused(self):
         nested_size = torch.tensor([[1, 8], [2, 8]])
@@ -1307,7 +1354,7 @@ def grad_test_func(tensor, nested_size):
             # This implicitly tests to_padded_tensor grads
             return nt.to_padded_tensor(0)
         data = (padded_tensor, nested_size)
-        assert torch.autograd.gradcheck(grad_test_func, inputs=data)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
     def test_nested_tensor_from_list(self):
 
@@ -1320,7 +1367,7 @@ def grad_test_func(a, b, c):
             # This implictily tests to_padded_tensor grads
             return c.to_padded_tensor(0)
         data = (a, b, c)
-        assert torch.autograd.gradcheck(grad_test_func, inputs=data)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
     def test_size_dim(self):
         a = torch.nested_tensor([])
@@ -1346,6 +1393,13 @@ def test_size_dim(self):
             RuntimeError, "Given dimension 1 is irregular and does not have a size", lambda: a.size(1))
         self.assertEqual(a.size(2), 4)
 
+    def test_dropout_backward(self):
+        nt = torch.nested_tensor([torch.randn((2, 5)), torch.randn((3, 4))]).requires_grad_(True)
+        p = 0.2
+        y = torch.nn.functional.dropout(nt, p)
+        y.backward(nt.clone())
+        self.assertEqual(nt.grad, y)
+
     def test_nested_tensor_bmm_gradcheck(self):
         a = torch.randn(2, 6, requires_grad=True, dtype=torch.float64)
         b = torch.randn(3, 6, requires_grad=True, dtype=torch.float64)
@@ -1403,8 +1457,8 @@ def test_nested_tensor_matmul_backward(self):
         ynt.backward(ynt.clone())
         ypt.backward(ypt.clone())
 
-        self.nt_equal(nt0.grad.to_padded_tensor(0.0), pt0.grad)
-        self.nt_equal(nt1.grad.to_padded_tensor(0.0), pt1.grad)
+        self.assertEqual(nt0.grad.to_padded_tensor(0.0), pt0.grad)
+        self.assertEqual(nt1.grad.to_padded_tensor(0.0), pt1.grad)
 
     def test_nested_tensor_transpose_gradcheck(self):
         a = torch.randn(2, 5, requires_grad=True)
@@ -1469,11 +1523,26 @@ def grad_test_func(a, b, c, weight, bias=None):
             d = torch.functional.F.linear(nt, weight, bias)
             return d.to_padded_tensor(0)
         data = (a, b, c, weight, bias)
-        assert torch.autograd.gradcheck(grad_test_func, inputs=data)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
         # Test linear with no bias added
         data = (a, b, c, weight)
-        assert torch.autograd.gradcheck(grad_test_func, inputs=data)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
+
+    def test_nested_tensor_softmax(self):
+        a = torch.randn(1, 2, requires_grad=True, dtype=torch.float64)
+        b = torch.randn(2, 2, requires_grad=True, dtype=torch.float64)
+        c = torch.randn(3, 2, requires_grad=True, dtype=torch.float64)
+
+        def grad_test_func(a, b, c, dim):
+            nt = torch.nested_tensor([a, b, c])
+            # This implicitly tests to_padded_tensor grads
+            d = torch.functional.F.softmax(nt, dim=dim)
+            return d.to_padded_tensor(0)
+
+        # softmax over last dim
+        data = (a, b, c, -1)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
     def test_nested_tensor_linear_backward(self):
         a = torch.randn(1, 2, requires_grad=False)
diff --git a/test/test_nn.py b/test/test_nn.py
index 09d6e67e9fb2e..39a671b63d3c9 100644
--- a/test/test_nn.py
+++ b/test/test_nn.py
@@ -11,14 +11,11 @@
 import warnings
 import pickle
 from copy import deepcopy
-from itertools import repeat, product
+from itertools import product
 from functools import reduce, partial
 from operator import mul
 from collections import OrderedDict
 from tempfile import NamedTemporaryFile
-import sys
-import os
-import subprocess
 import weakref
 import gc
 
@@ -49,11 +46,11 @@
     download_file, get_function_arglist, load_tests, skipIfMps,\
     suppress_warnings, TemporaryFileName, TEST_WITH_UBSAN, IS_PPC, \
     parametrize as parametrize_test, subtest, instantiate_parametrized_tests, set_default_dtype, IS_WINDOWS, \
-    slowTest, skipIfTorchDynamo
+    skipIfTorchDynamo
 from torch.testing._internal.common_cuda import TEST_CUDA, TEST_MULTIGPU, TEST_CUDNN, TEST_CUDNN_VERSION
 from torch.testing._internal.common_nn import NNTestCase, NewModuleTest, CriterionTest, \
     module_tests, criterion_tests, loss_reference_fns, \
-    ctcloss_reference, new_module_tests, single_batch_reference_fn
+    ctcloss_reference, new_module_tests, single_batch_reference_fn, _test_bfloat16_ops, _test_module_empty_input
 from torch.testing._internal.common_device_type import expectedFailureXLA, instantiate_device_type_tests, dtypes, \
     dtypesIfCUDA, precisionOverride, skipCUDAIfNoCudnn, skipCUDAIfCudnnVersionLessThan, onlyCUDA, onlyCPU, \
     skipCUDAIfRocm, skipCUDAIf, skipCUDAIfNotRocm, skipCUDAIfRocmVersionLessThan, skipCUDAIfNotMiopenSuggestNHWC, \
@@ -90,258 +87,6 @@
 # update test/run_test.py to list it, otherwise it will NOT be run in
 # CI.
 
-
-class PackedSequenceTest(TestCase):
-
-    _type_by_name = {
-        'torch.DoubleTensor': (torch.DoubleTensor, 'double'),
-        'torch.FloatTensor': (torch.FloatTensor, 'float'),
-        # We leave out `'torch.HalfTensor': (torch.HalfTensor, 'half'),`
-        # because of an error in `pad_packed_sequence`
-        # > AttributeError: 'torch.HalfTensor' object has no attribute 'fill_'
-        'torch.LongTensor': (torch.LongTensor, 'long'),
-        'torch.IntTensor': (torch.IntTensor, 'int'),
-        'torch.ShortTensor': (torch.ShortTensor, 'short'),
-        'torch.CharTensor': (torch.CharTensor, 'char'),
-        'torch.ByteTensor': (torch.ByteTensor, 'byte'),
-    }
-
-    def __init__(self, *args, **kwargs):
-        super(PackedSequenceTest, self).__init__(*args, **kwargs)
-        self.batch_size = 5
-        self.max_length = 6
-
-    def _ordered_sequence(self, tensor_type):
-        """Create ordered list of random sequences"""
-        seqs = [tensor_type(random.randint(1, self.max_length))
-                for _ in range(self.batch_size)]
-        if tensor_type == torch.ByteTensor:
-            seqs = [s.random_(0, 256) for s in seqs]
-        else:
-            seqs = [s.random_(-128, 128) for s in seqs]
-        ordered = sorted(seqs, key=len, reverse=True)
-        return ordered
-
-    def _padded_sequence(self, tensor_type):
-        """Create Tensor of random padded sequences"""
-        ordered = self._ordered_sequence(tensor_type)
-        lengths = [len(i) for i in ordered]
-        padded_tensor = rnn_utils.pad_sequence(ordered)
-        return padded_tensor, lengths
-
-    def test_type_casts(self):
-        """Test type casting of `PackedSequence` against type casting of tensor"""
-        for _, (input_type, _) in self._type_by_name.items():
-            for expected_type_str, (_, cast_str) in self._type_by_name.items():
-                for enforce_sorted in [True, False]:
-                    padded, lengths = self._padded_sequence(input_type)
-                    packed = rnn_utils.pack_padded_sequence(
-                        padded, lengths, enforce_sorted=enforce_sorted)
-                    # Apply cast to `PackedSequence` instance and unpack
-                    masked = getattr(packed, cast_str)()
-                    unpacked, lengths_out = rnn_utils.pad_packed_sequence(masked)
-                    self.assertEqual(unpacked.type(), expected_type_str)
-
-    def test_wrong_order(self):
-        a = torch.ones(25, 300)
-        b = torch.ones(22, 300)
-        b_a = rnn_utils.pad_sequence([b, a])
-        self.assertRaises(
-            RuntimeError,
-            lambda: rnn_utils.pack_padded_sequence(b_a, [22, 25], enforce_sorted=True))
-
-    def test_pad_sequence_with_tensor_sequences(self):
-        seq_tuple_input = torch.nn.utils.rnn.pad_sequence(
-            (torch.tensor([[7, 6]]), torch.tensor([[-7, -1]]))
-        )
-        seq_tensor_input = torch.nn.utils.rnn.pad_sequence(
-            torch.tensor([[[7, 6]], [[-7, -1]]])
-        )
-        self.assertEqual(seq_tuple_input, seq_tensor_input)
-        self.assertEqual(seq_tuple_input.shape, torch.Size([1, 2, 2]))
-
-    def test_pad_sequence_with_non_iterable_sequences(self):
-        msg = r"Expected iterable for input sequences, but got arg of type"
-        with self.assertRaisesRegex(RuntimeError, msg):
-            torch.nn.utils.rnn.pad_sequence(5)
-
-    def test_total_length(self):
-        padded, lengths = self._padded_sequence(torch.FloatTensor)
-        max_length = max(lengths)
-        packed = rnn_utils.pack_padded_sequence(padded, lengths)
-        # test ValueError if total_length < max_length
-        for total_length in (-1, 0, max_length - 1):
-            for batch_first in (True, False):
-                def err_fn():
-                    rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
-                                                  total_length=total_length)
-            self.assertRaisesRegex(ValueError,
-                                   r'Expected total_length to be at least the '
-                                   r'length of the longest sequence in input',
-                                   err_fn)
-        # test that pad_packed_sequence returns results of correct length
-        for batch_first in (True, False):
-            no_extra_pad, _ = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
-            for total_length_delta in (0, 1, 8):
-                total_length = max_length + total_length_delta
-                unpacked, lengths_out = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
-                                                                      total_length=total_length)
-                self.assertEqual(lengths, lengths_out)
-                self.assertEqual(unpacked.size(1 if batch_first else 0), total_length)
-                if total_length_delta == 0:
-                    ref_output = no_extra_pad
-                elif batch_first:
-                    extra_pad = no_extra_pad.new_zeros(self.batch_size, total_length_delta)
-                    ref_output = torch.cat([no_extra_pad, extra_pad], 1)
-                else:
-                    extra_pad = no_extra_pad.new_zeros(total_length_delta, self.batch_size)
-                    ref_output = torch.cat([no_extra_pad, extra_pad], 0)
-                self.assertEqual(unpacked, ref_output)
-
-    def test_to(self):
-        for enforce_sorted in (True, False):
-            padded, lengths = self._padded_sequence(torch.IntTensor)
-            a = rnn_utils.pack_padded_sequence(
-                padded, lengths, enforce_sorted=enforce_sorted).cpu()
-
-            self.assertIs(a, a.to('cpu'))
-            self.assertIs(a, a.cpu())
-            self.assertIs(a, a.to('cpu', dtype=torch.int32))
-            self.assertEqual(a.long(), a.to(torch.int64))
-
-            if torch.cuda.is_available():
-                for cuda in ['cuda', 'cuda:0' if torch.cuda.device_count() == 1 else 'cuda:1']:
-                    b = a.cuda(device=cuda)
-                    self.assertIs(b, b.to(cuda))
-                    self.assertIs(b, b.cuda())
-                    self.assertEqual(a, b.to('cpu'))
-                    self.assertEqual(b, a.to(cuda))
-                    self.assertEqual(a, b.to('cpu', dtype=torch.int32))
-                    self.assertIs(b, b.to(dtype=torch.int32))
-                    self.assertEqual(b.long(), b.to(dtype=torch.int64))
-
-    def test_to_memory_format(self):
-        m = torch.nn.Conv2d(in_channels=16, out_channels=32, kernel_size=2, bias=True)
-        m = m.to(memory_format=torch.channels_last)
-        for param in m.parameters():
-            if param.dim() == 4:
-                self.assertTrue(param.is_contiguous(memory_format=torch.channels_last))
-
-class TestAvgPool(TestCase):
-    def _sum_pool2d(self, x, kernel_size):
-        windows = torch.nn.functional.unfold(x, kernel_size=kernel_size, stride=kernel_size)
-        return torch.sum(windows, dim=1)
-
-    def _sum_pool3d(self, x, kernel_size):
-        # Because unfold does not support 3D sliding window we will split tensor to multiple tensors and calculate sum
-        h = kernel_size[0]
-        splited_x = [t.sum(0) for t in x.split(h) if t.size(0) == h]
-        # sum_pool2d assumes tensor in (1, 1, n, m) view, so unsqueeze two times
-        splited_x = [self._sum_pool2d(t.unsqueeze(0).unsqueeze(0), kernel_size[1:]) for t in splited_x]
-        joined_x = torch.cat(splited_x)
-        return joined_x.view(1, joined_x.numel())
-
-    def _avg_pool2d(self, x, kernel_size):
-        size = reduce((lambda x, y: x * y), kernel_size)
-        return self._sum_pool2d(x, kernel_size) / size
-
-    def _avg_pool3d(self, x, kernel_size):
-        size = reduce((lambda x, y: x * y), kernel_size)
-        return self._sum_pool3d(x, kernel_size) / size
-
-    def test_doubletensor_avg_pool2d(self):
-        n, m = 5, 8
-        input = torch.rand(1, 1, n, m)
-        for i in range(1, n + 1):
-            for j in range(1, m + 1):
-                actual = torch.nn.functional.avg_pool2d(input[0], (i, j))
-                actual = actual.view(1, actual.numel())
-                expected = self._avg_pool2d(input, (i, j))
-                self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_avg_pool2d_with_zero_divisor(self):
-        self.assertRaisesRegex(RuntimeError, "divisor must be not zero",
-                               lambda: F.avg_pool2d(torch.zeros(3, 3, 3), (2, 2), divisor_override=0))
-
-    def test_doubletensor_avg_pool2d_with_divisor(self):
-        n, m = 3, 3
-        input = torch.rand(1, 1, n, m)
-        for i in range(1, n + 1):
-            for j in range(1, m + 1):
-                for divisor in [1, 7, i * j]:
-                    actual = F.avg_pool2d(input[0], (i, j), divisor_override=divisor)
-                    actual = actual.view(1, actual.numel())
-                    expected = self._sum_pool2d(input, (i, j)) / divisor
-                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_doubletensor_avg_pool3d(self):
-        h, w, d = 5, 6, 7
-        input = torch.rand(h, w, d)
-        for i in range(1, h + 1):
-            for j in range(1, w + 1):
-                for k in range(1, d + 1):
-                    actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k))
-                    actual = actual.view(1, actual.numel())
-                    expected = self._avg_pool3d(input, (i, j, k))
-                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_doubletensor_avg_pool3d_with_divisor(self):
-        h, w, d = 6, 5, 7
-        input = torch.rand(h, w, d)
-        for i in range(1, h + 1):
-            for j in range(1, w + 1):
-                for k in range(1, d + 1):
-                    for divisor in [1, 7, i * j]:
-                        actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k), divisor_override=divisor)
-                        actual = actual.view(1, actual.numel())
-                        expected = self._sum_pool3d(input, (i, j, k)) / divisor
-                        self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_avg_pool3d_with_zero_divisor(self):
-        self.assertRaisesRegex(RuntimeError, "divisor must be not zero",
-                               lambda: F.avg_pool3d(torch.zeros(3, 3, 3, 3), (2, 2, 2), divisor_override=0))
-
-    def test_avg_pool1d_ceil_mode(self):
-        # Regression test for gh-36977
-        x = 10 * torch.randn((1, 16, 4))
-        y = torch.nn.functional.avg_pool1d(
-            x, ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
-        self.assertTrue(not torch.isnan(y).any())
-
-        if TEST_CUDA:
-            y = torch.nn.functional.avg_pool1d(
-                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
-            self.assertTrue(not torch.isnan(y).any())
-
-
-    def test_avg_pool2d_ceil_mode(self):
-        # Regression test for gh-36977
-        x = 10 * torch.randn((1, 16, 4, 4))
-        y = torch.nn.functional.avg_pool2d(
-            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
-            padding=(0, 1), stride=2)
-        self.assertTrue(not torch.isnan(y).any())
-
-        if TEST_CUDA:
-            y = torch.nn.functional.avg_pool2d(
-                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
-                padding=(0, 1), stride=2)
-            self.assertTrue(not torch.isnan(y).any())
-
-
-    def test_avg_pool3d_ceil_mode(self):
-        # Regression test for gh-36977
-        x = 10 * torch.randn((1, 16, 4, 4, 4))
-        y = torch.nn.functional.avg_pool3d(
-            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
-        self.assertTrue(not torch.isnan(y).any())
-
-        if TEST_CUDA:
-            y = torch.nn.functional.avg_pool3d(
-                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
-            self.assertTrue(not torch.isnan(y).any())
-
-
 class TestNN(NNTestCase):
     _do_cuda_memory_leak_check = True
     _do_cuda_non_default_stream = True
@@ -3478,7 +3223,7 @@ def forward(self, x):
         parametrize.register_parametrization(model, "weight", MinusOne())
         hold_weight = model.weight
 
-        to_model = nn.qat.Linear(
+        to_model = torch.ao.nn.qat.Linear(
             5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
         )
         parametrize.transfer_parametrizations_and_params(model, to_model)
@@ -3531,7 +3276,7 @@ def right_inverse(self, x):
         parametrize.register_parametrization(model, "weight", Double())
         hold_weight = model.weight
 
-        to_model = nn.qat.Linear(
+        to_model = torch.ao.nn.qat.Linear(
             5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
         )
         parametrize.transfer_parametrizations_and_params(model, to_model)
@@ -3569,7 +3314,7 @@ def forward(self, x):
         parametrize.register_parametrization(model, "bias", Double())
         parametrize.register_parametrization(model, "bias", MinusOne())
 
-        to_model = nn.qat.Linear(
+        to_model = torch.ao.nn.qat.Linear(
             5, 5, bias=True, qconfig=torch.ao.quantization.get_default_qconfig()
         )
         parametrize.transfer_parametrizations_and_params(model, to_model, "weight")
@@ -3608,7 +3353,7 @@ def forward(self, x):
         parametrize.register_parametrization(model, "weight", Double())
         hold_weight = model.weight
 
-        to_model = nn.qat.Linear(
+        to_model = torch.ao.nn.qat.Linear(
             3, 3, qconfig=torch.ao.quantization.get_default_qconfig()
         )
 
@@ -4076,10 +3821,9 @@ def test_pruning_container_compute_mask(self):
         default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
         # since we are pruning the two lowest magnitude units, the outcome of
         # the calculation should be this:
-        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]])
+        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]], dtype=torch.float32)
         computed_mask = container.compute_mask(t, default_mask)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(expected_mask, computed_mask)
+        self.assertEqual(expected_mask, computed_mask)
 
         # 2) test structured pruning
         q = prune.LnStructured(amount=1, n=2, dim=0)
@@ -4087,10 +3831,9 @@ def test_pruning_container_compute_mask(self):
         container.add_pruning_method(q)
         # since we are pruning the lowest magnitude one of the two rows, the
         # outcome of the calculation should be this:
-        expected_mask = torch.tensor([[0, 0, 0, 0], [1, 1, 0, 1]])
+        expected_mask = torch.tensor([[0, 0, 0, 0], [1, 1, 0, 1]], dtype=torch.float32)
         computed_mask = container.compute_mask(t, default_mask)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(expected_mask, computed_mask)
+        self.assertEqual(expected_mask, computed_mask)
 
         # 2) test structured pruning, along another axis
         r = prune.LnStructured(amount=1, n=2, dim=1)
@@ -4098,10 +3841,9 @@ def test_pruning_container_compute_mask(self):
         container.add_pruning_method(r)
         # since we are pruning the lowest magnitude of the four columns, the
         # outcome of the calculation should be this:
-        expected_mask = torch.tensor([[0, 1, 1, 0], [0, 1, 0, 1]])
+        expected_mask = torch.tensor([[0, 1, 1, 0], [0, 1, 0, 1]], dtype=torch.float32)
         computed_mask = container.compute_mask(t, default_mask)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(expected_mask, computed_mask)
+        self.assertEqual(expected_mask, computed_mask)
 
     def test_l1_unstructured_pruning(self):
         r"""Test that l1 unstructured pruning actually removes the lowest
@@ -4378,12 +4120,9 @@ def test_custom_from_mask_pruning(self):
         p = prune.CustomFromMask(mask=mask)
 
         computed_mask = p.compute_mask(t, default_mask)
-        expected_mask = torch.tensor([[0, 0, 0, 0], [0, 0, 1, 1]]).to(
-            dtype=t.dtype
-        )
+        expected_mask = torch.tensor([[0, 0, 0, 0], [0, 0, 1, 1]], dtype=computed_mask.dtype)
 
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(computed_mask, expected_mask)
+        self.assertEqual(computed_mask, expected_mask)
 
     def test_pruning_rollback(self):
         r"""Test that if something fails when the we try to compute the mask,
@@ -5795,7 +5534,7 @@ def _multihead_attn_test_helper(add_key_padding_mask=False, add_bias_kv=False, a
             for _ in range(100):
                 batch_sz, seq_len = [random.randint(2, 10) for r in range(2)]
                 d_head = random.randint(3, 10)
-                nheads = random.randint(3, 10)
+                nheads = random.randint(2, 5) * 2
                 d_model = d_head * nheads
                 if same_embed_dim:
                     kv_dim = d_model
@@ -6035,62 +5774,62 @@ def test_multihead_attn_no_bias(self):
 
     def _test_multihead_attn_invalid_shape_impl(self, mha):
         # Batched (3D) query cases
-        query = torch.randn(3, 3, 3)
-        key = torch.randn(3, 3, 3)
-        value = torch.randn(3, 3, 3)
+        query = torch.randn(4, 4, 4)
+        key = torch.randn(4, 4, 4)
+        value = torch.randn(4, 4, 4)
 
         msg = "expected `key` and `value` to be 3-D but found 2-D and 3-D tensors respectively"
         # 3D query, 2D key and 3D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, torch.randn(3, 3), value)
+            mha(query, torch.randn(4, 4), value)
 
         msg = "expected `key` and `value` to be 3-D but found 3-D and 2-D tensors respectively"
         # 3D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, torch.randn(3, 3))
+            mha(query, key, torch.randn(4, 4))
 
         msg = "expected `key_padding_mask` to be `None` or 2-D but found 1-D tensor instead"
         # 3D query, 3D key, 3D value and 1D key_padding_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, key_padding_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, key_padding_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
         msg = "expected `attn_mask` to be `None`, 2-D or 3-D but found 1-D tensor instead"
         # 3D query, 3D key, 3D value and 1D attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, attn_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
         # Unbatched (2D) query cases
-        query = torch.randn(3, 3)
-        key = torch.randn(3, 3)
-        value = torch.randn(3, 3)
+        query = torch.randn(4, 4)
+        key = torch.randn(4, 4)
+        value = torch.randn(4, 4)
 
         msg = "expected `key` and `value` to be 2-D but found 3-D and 2-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, torch.randn(3, 3, 3), value)
+            mha(query, torch.randn(4, 4, 4), value)
 
         msg = "expected `key` and `value` to be 2-D but found 2-D and 3-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, torch.randn(3, 3, 3))
+            mha(query, key, torch.randn(4, 4, 4))
 
         msg = "expected `key_padding_mask` to be `None` or 1-D but found 2-D tensor instead"
         # 2D query, 2D key, 2D value and 1D key_padding_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, key_padding_mask=torch.tensor([[False, True, True] * 2], dtype=torch.bool))
+            mha(query, key, value, key_padding_mask=torch.tensor([[False, False, True, True] * 2], dtype=torch.bool))
 
         msg = "expected `attn_mask` to be `None`, 2-D or 3-D but found 1-D tensor instead"
         # 2D query, 2D key, 2D value and 1D attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, attn_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
-        msg = r"Expected `attn_mask` shape to be \(3, 3, 3\)"
+        msg = r"Expected `attn_mask` shape to be \(4, 4, 4\)"
         # 2D query, 2D key, 2D value and 3D incorrect attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.randn(4, 3, 3).bernoulli_().to(torch.bool))
+            mha(query, key, value, attn_mask=torch.randn(5, 4, 4).bernoulli_().to(torch.bool))
 
     def test_multihead_attn_invalid_shape(self):
-        mha = torch.nn.MultiheadAttention(3, 3)
+        mha = torch.nn.MultiheadAttention(4, 4)
         self._test_multihead_attn_invalid_shape_impl(mha)
         # Give the test a chance to hit the fast path. (Right now, it
         # won't, but gating may be less restricted in the future.)
@@ -6099,12 +5838,12 @@ def test_multihead_attn_invalid_shape(self):
 
     @torch.no_grad()
     def test_multihead_attn_fast_path_invalid_shape(self):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True).eval()
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True).eval()
 
         # Batched (3D) query cases
-        query = torch.randn(3, 3, 3)
-        key = torch.randn(3, 3, 3)
-        value = torch.randn(3, 3, 3)
+        query = torch.randn(4, 4, 4)
+        key = torch.randn(4, 4, 4)
+        value = torch.randn(4, 4, 4)
 
         # Currently, this case will just go to the slow path and get
         # the usual message because it fails the requirement to be
@@ -6134,38 +5873,38 @@ def test_multihead_attn_fast_path_invalid_shape(self):
 
         # Unbatched (2D) query cases
         # NOTE: error messages are the same as regular path because the fast path doesn't support 2D.
-        query = torch.randn(3, 3)
-        key = torch.randn(3, 3)
-        value = torch.randn(3, 3)
+        query = torch.randn(4, 4)
+        key = torch.randn(4, 4)
+        value = torch.randn(4, 4)
 
         msg = "expected `key` and `value` to be 2-D but found 3-D and 2-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, torch.randn(3, 3, 3), value)
+            mha(query, torch.randn(4, 4, 4), value)
 
         msg = "expected `key` and `value` to be 2-D but found 2-D and 3-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, torch.randn(3, 3, 3))
+            mha(query, key, torch.randn(4, 4, 4))
 
         msg = "expected `key_padding_mask` to be `None` or 1-D but found 2-D tensor instead"
         # 2D query, 2D key, 2D value and 1D key_padding_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, key_padding_mask=torch.tensor([[False, True, True] * 2], dtype=torch.bool))
+            mha(query, key, value, key_padding_mask=torch.tensor([[False, False, True, True] * 2], dtype=torch.bool))
 
         msg = "expected `attn_mask` to be `None`, 2-D or 3-D but found 1-D tensor instead"
         # 2D query, 2D key, 2D value and 1D attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, attn_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
-        msg = r"Expected `attn_mask` shape to be \(3, 3, 3\)"
+        msg = r"Expected `attn_mask` shape to be \(4, 4, 4\)"
         # 2D query, 2D key, 2D value and 3D incorrect attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.randn(4, 3, 3).bernoulli_().to(torch.bool))
+            mha(query, key, value, attn_mask=torch.randn(5, 4, 4).bernoulli_().to(torch.bool))
 
     def test_multihead_attn_nested_tensor_outside_fast_path(self):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True).eval()
-        nt = torch.nested_tensor([torch.randn(3, 3)])
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True).eval()
+        nt = torch.nested_tensor([torch.randn(4, 4)])
         # One tested platform (linux-bionic-py3.7-clang) has a torch_function for one
         # or more of these. Take advantage of that to test the torch_function bailout.
         has_torch_func = torch.overrides.has_torch_function(
@@ -6186,7 +5925,7 @@ def test_multihead_attn_nested_tensor_outside_fast_path(self):
             mha(nt, nt, nt)
         with torch.inference_mode():
             mha(nt, nt, nt)
-        nt = torch.nested_tensor([torch.randn(3, 3, requires_grad=False)])
+        nt = torch.nested_tensor([torch.randn(4, 4, requires_grad=False)])
         nt.requires_grad = False
         with self.assertRaisesRegex(AssertionError, msg):
             mha(nt, nt, nt)
@@ -6204,155 +5943,6 @@ def test_normalize(self):
         inputs = torch.randn((), requires_grad=True)
         self.assertTrue(gradcheck(lambda x: F.normalize(x, p=1, dim=-1), (inputs,)))
 
-    def test_adaptive_pooling_input_size(self):
-        for numel in (2, 3):
-            for pool_type in ('Max', 'Avg'):
-                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
-                module_cls = getattr(nn, cls_name)
-                output_size = (2,) * numel
-                module = module_cls(output_size)
-
-                input = torch.randn(output_size)
-                self.assertRaises(ValueError, lambda: module(input))
-
-    def test_adaptive_pooling_size_none(self):
-        for numel in (2, 3):
-            for pool_type in ('Max', 'Avg'):
-                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
-                module_cls = getattr(nn, cls_name)
-                output_size = (2,) * (numel - 1) + (None,)
-                module = module_cls(output_size)
-
-                input = torch.randn((4,) * (numel + 1))
-                output = module(input)
-                self.assertEqual(output.size(), (4,) + (2,) * (numel - 1) + (4,))
-
-    @unittest.skipIf(TEST_WITH_UBSAN, "signed integer overflow error with UBSAN")
-    def test_adaptive_pooling_size_overflow(self):
-        # 0x0x3fffffffffffffff * 2 * 2 = 0xfffffffffffffffc = -4 as int64_t
-        # Tensor::numel() return int64_t, so following check that negative allocs are correctly handled
-        self.assertRaises(
-            RuntimeError,
-            lambda: torch.nn.AdaptiveMaxPool1d(0x3fffffffffffffff)(torch.empty([2, 2, 2])))
-
-    def test_adaptive_pooling_avg_nhwc(self):
-        device_list = ['cpu']
-        if TEST_CUDA:
-            device_list.append('cuda')
-
-        for device in device_list:
-            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
-            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
-            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            out = pool(input)
-            out.backward(grad)
-            ref_out = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(input.grad, ref_input.grad)
-
-    def test_adaptive_pooling_avg_nhwc_non_contiguous(self):
-        device_list = ['cpu']
-        if TEST_CUDA:
-            device_list.append('cuda')
-
-        for device in device_list:
-            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
-            input = input.contiguous(memory_format=torch.channels_last)
-            input = input[:, ::2, :, :].requires_grad_()
-            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
-            grad = grad[:, ::2, :, :]
-            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            out = pool(input)
-            out.backward(grad)
-            ref_out = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(input.grad, ref_input.grad)
-
-    def test_adaptive_pooling_bfloat16(self):
-        def _test_adaptive_pooling_bfloat16(self, device, mod, memory_format):
-            input = torch.randint(1, 10, (3, 19, 8, 8), dtype=torch.float32)
-            input = input.to(device).to(memory_format=memory_format).requires_grad_()
-            pool = mod((7, 7)).to(device)
-
-            input2 = input.detach().clone().bfloat16().requires_grad_(True)
-
-            out = pool(input)
-            out.sum().backward()
-            out2 = pool(input2)
-            out2.sum().backward()
-
-            self.assertTrue(out2.is_contiguous(memory_format=memory_format))
-            self.assertEqual(out2.dtype, torch.bfloat16)
-            self.assertEqual(input2.grad.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.float(), atol=0.1, rtol=0)
-            self.assertEqual(input.grad, input2.grad.float(), atol=0.1, rtol=0)
-
-        device_list = ['cpu']
-        for device in device_list:
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.contiguous_format)
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.channels_last)
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.contiguous_format)
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.channels_last)
-
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    @largeTensorTest('12GB', device='cuda')
-    def test_adaptive_pooling_avg_nhwc_launch_config_backward(self):
-        input = torch.randint(1, 10, (1, 32, 2 ** 17 + 1, 32), dtype=torch.float32, device="cuda")
-        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-        grad = torch.randint(1, 10, (1, 32, 10, 32), dtype=torch.float32, device="cuda")
-
-        pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
-
-        ref_input = input.detach().clone().contiguous().requires_grad_(True)
-        ref_grad = grad.detach().clone().contiguous()
-        ref_pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
-
-        out = pool(input)
-        out.backward(grad)
-        ref_out = ref_pool(ref_input)
-        ref_out.backward(ref_grad)
-
-        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-        self.assertTrue(ref_out.is_contiguous())
-        self.assertEqual(out, ref_out)
-        self.assertEqual(input.grad, ref_input.grad)
-
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    @largeTensorTest('12GB', device='cuda')
-    def test_adaptive_pooling_avg_nhwc_launch_config_forward(self):
-        input = torch.randint(1, 10, (1, 32, 16, 16), dtype=torch.float32, device="cuda")
-        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-        pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
-
-        ref_input = input.detach().clone().contiguous().requires_grad_(True)
-        ref_pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
-
-        out = pool(input)
-        ref_out = ref_pool(ref_input)
-
-        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-        self.assertTrue(ref_out.is_contiguous())
-        self.assertEqual(out, ref_out)
-
     @unittest.skipIf(not TEST_MULTIGPU, "multi-GPU not supported")
     # Skip the test for ROCm as per https://github.com/pytorch/pytorch/issues/53190
     @skipIfRocm
@@ -7152,63 +6742,6 @@ def test_Conv3d_groups_wbias(self):
                          torch.cat([m1.bias.grad.data, m2.bias.grad.data], 0),
                          atol=dtype2prec_DONTUSE[torch.float], rtol=dtype2prec_DONTUSE[torch.float])
 
-
-
-    def test_MaxUnpool2d_output_size(self):
-        m = nn.MaxPool2d(3, stride=2, return_indices=True)
-        mu = nn.MaxUnpool2d(3, stride=2)
-        big_t = torch.rand(1, 1, 6, 6)
-        big_t[0][0][4][4] = 100
-        output_big, indices_big = m(big_t)
-        self.assertRaises(RuntimeError, lambda: mu(output_big, indices_big))
-
-        small_t = torch.rand(1, 1, 5, 5)
-        for i in range(0, 4, 2):
-            for j in range(0, 4, 2):
-                small_t[:, :, i, j] = 100
-        output_small, indices_small = m(small_t)
-        for h in range(3, 10):
-            for w in range(3, 10):
-                if 4 <= h <= 6 and 4 <= w <= 6:
-                    size = (h, w)
-                    if h == 6:
-                        size = (1, 1) + size
-
-                    mu(output_small, indices_small, output_size=size)
-                else:
-                    self.assertRaises(ValueError, lambda: mu(output_small, indices_small, (h, w)))
-
-    def test_max_unpool2d_nhwc_cpu(self):
-        input = torch.randn(2, 10, 9, 9).float().cpu()
-        input = input.contiguous(memory_format=torch.channels_last)
-        ref_input = input.clone().contiguous()
-
-        pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
-        ref_pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
-
-        out, ind = pool(input)
-        ref_out, ref_ind = ref_pool(ref_input)
-        out.requires_grad_()
-        ref_out.requires_grad_()
-
-        unpool = nn.MaxUnpool2d(3, stride=2).cpu()
-        ref_unpool = nn.MaxUnpool2d(3, stride=2).cpu()
-
-        upout = unpool(out, ind)
-        ref_upout = ref_unpool(ref_out, ref_ind)
-
-        grad = torch.randn(upout.size()).float().cpu()
-        grad = grad.contiguous(memory_format=torch.channels_last)
-        ref_grad = grad.clone().contiguous()
-
-        upout.backward(grad)
-        ref_upout.backward(ref_grad)
-
-        self.assertTrue(upout.is_contiguous(memory_format=torch.channels_last))
-        self.assertTrue(ref_upout.is_contiguous())
-        self.assertTrue(torch.allclose(upout, ref_upout))
-        self.assertTrue(torch.allclose(out.grad, ref_out.grad))
-
     def test_container_copy(self):
         class Model(nn.Module):
             def __init__(self):
@@ -7525,249 +7058,6 @@ def test_invalid_dropout_p(self):
         self.assertRaises(ValueError, lambda: F.dropout(v, -0.1))
         self.assertRaises(ValueError, lambda: F.dropout(v, 1.1))
 
-    def test_pad_sequence(self):
-        def pad(tensor, length):
-            return torch.cat(
-                [tensor.data, tensor.data.new(
-                    length - tensor.size(0), *tensor.size()[1:]).zero_()])
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-
-        # batch_first = true
-        expected = torch.tensor([[4, 5, 0], [1, 2, 3], [6, 0, 0]])
-        padded = rnn_utils.pad_sequence([b, a, c], True)
-        self.assertEqual(padded, expected)
-
-        # batch_first = false
-        padded = rnn_utils.pad_sequence([b, a, c])
-        self.assertEqual(padded, expected.transpose(0, 1))
-
-        # pad with non-zero value
-        expected = torch.tensor([[4, 5, 1], [1, 2, 3], [6, 1, 1]])
-        padded = rnn_utils.pad_sequence([b, a, c], True, 1)
-        self.assertEqual(padded, expected)
-
-        # Test pad sorted sequence
-        expected = torch.tensor([[1, 2, 3], [4, 5, 0], [6, 0, 0]])
-        padded = rnn_utils.pad_sequence([a, b, c], True)
-        self.assertEqual(padded, expected)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            trailing_dims = [4] * num_dim
-            for i in range(1, maxlen + 1):
-                seq_len = i * i
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            random.shuffle(sequences)
-            expected = []
-            for seq in sequences:
-                expected.append(pad(seq, maxlen * maxlen))
-            # batch first = true
-            expected = torch.stack(expected)
-            padded = rnn_utils.pad_sequence(sequences, True)
-            self.assertEqual(padded, expected)
-
-            # batch first = false
-            padded = rnn_utils.pad_sequence(sequences)
-            self.assertEqual(padded, expected.transpose(0, 1))
-
-    def test_unpad_sequence(self):
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-        sequences = [a, b, c]
-
-        lengths = torch.as_tensor([v.size(0) for v in sequences])
-        for batch_first in [True, False]:
-            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
-            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
-            self.assertEqual(sequences, unpadded_sequences)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            trailing_dims = [4] * num_dim
-            for i in range(1, maxlen + 1):
-                seq_len = i * i
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            random.shuffle(sequences)
-
-            lengths = torch.as_tensor([v.size(0) for v in sequences])
-            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
-            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
-            self.assertEqual(sequences, unpadded_sequences)
-
-    def test_pack_sequence(self):
-        def _compatibility_test(sequences, lengths, batch_first, enforce_sorted=False):
-            padded = rnn_utils.pad_sequence(sequences, batch_first)
-            packed = rnn_utils.pack_sequence(sequences, enforce_sorted)
-            unpacked = rnn_utils.pad_packed_sequence(packed, batch_first)
-            self.assertEqual(padded, unpacked[0])
-            pack_padded = rnn_utils.pack_padded_sequence(
-                padded, lengths, batch_first, enforce_sorted)
-            self.assertEqual(packed, pack_padded)
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-        packed = rnn_utils.pack_sequence([a, b, c], enforce_sorted=False)
-        expected = torch.tensor([1, 4, 6, 2, 5, 3])
-        self.assertEqual(packed.batch_sizes, [3, 2, 1])
-        self.assertEqual(packed.data.data, expected)
-        self.assertEqual(packed.sorted_indices, [0, 1, 2])
-        self.assertEqual(packed.unsorted_indices, [0, 1, 2])
-
-        packed_unsorted = rnn_utils.pack_sequence([b, c, a], enforce_sorted=False)
-        self.assertEqual(packed_unsorted.batch_sizes, [3, 2, 1])
-        self.assertEqual(packed_unsorted.data.data, expected)
-        self.assertEqual(packed_unsorted.sorted_indices, [2, 0, 1])
-        self.assertEqual(packed_unsorted.unsorted_indices, [1, 2, 0])
-
-        # single dimensional, enforce_sorted = True
-        packed_enforce_sorted = rnn_utils.pack_sequence([a, b, c], enforce_sorted=True)
-        self.assertEqual(packed_enforce_sorted.batch_sizes, [3, 2, 1])
-        self.assertEqual(packed_enforce_sorted.data.data, expected)
-        self.assertTrue(packed_enforce_sorted.sorted_indices is None)
-        self.assertTrue(packed_enforce_sorted.unsorted_indices is None)
-
-        with self.assertRaisesRegex(RuntimeError, 'must be sorted in decreasing order'):
-            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
-
-        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
-            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            lengths = []
-            trailing_dims = [4] * num_dim
-            for i in range(maxlen, 0, -1):
-                seq_len = i * i
-                lengths.append(seq_len)
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            unsorted_sequences = [s.clone() for s in sequences]
-            random.shuffle(unsorted_sequences)
-            unsorted_sequences_lengths = [t.size(0) for t in unsorted_sequences]
-
-            # compatibility with other utilities
-            for batch_first in (True, False):
-                for enforce_sorted in (True, False):
-                    _compatibility_test(sequences, lengths, batch_first, enforce_sorted)
-                _compatibility_test(unsorted_sequences, unsorted_sequences_lengths,
-                                    batch_first)
-
-    def test_unpack_sequence(self):
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-        sequences = [a, b, c]
-
-        packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
-        unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
-        self.assertEqual(sequences, unpacked_sequences)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            trailing_dims = [4] * num_dim
-            for i in range(1, maxlen + 1):
-                seq_len = i * i
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            random.shuffle(sequences)
-
-            packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
-            unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
-            self.assertEqual(sequences, unpacked_sequences)
-
-    def test_pack_padded_sequence(self):
-        def generate_test_case(sorted_lengths, should_shuffle):
-            def pad(tensor, length):
-                return torch.cat([tensor, tensor.new(length - tensor.size(0), *tensor.size()[1:]).zero_()])
-
-            max_length = sorted_lengths[0]
-            batch_sizes = [sum(map(bool, filter(lambda x: x >= i, sorted_lengths)))
-                           for i in range(1, max_length + 1)]
-            offset = 0
-            padded = torch.cat([pad(i * 100 + torch.arange(1., 5 * l + 1).view(l, 1, 5), max_length)
-                                for i, l in enumerate(sorted_lengths, 1)], 1)
-            expected_data = [[torch.arange(1., 6) + (i + 1) * 100 + 5 * n for i in range(batch_size)]
-                             for n, batch_size in enumerate(batch_sizes)]
-            expected_data = list(itertools.chain.from_iterable(expected_data))
-            expected_data = torch.stack(expected_data, dim=0)
-
-            if should_shuffle:
-                # Shuffle the padded sequence to create an unsorted sequence
-                permutation = list(range(len(sorted_lengths)))
-                random.shuffle(permutation)
-
-                unsorted_indices = torch.tensor(permutation)
-                padded = padded.index_select(1, unsorted_indices)
-                lengths = torch.tensor(sorted_lengths).index_select(0, unsorted_indices)
-            else:
-                unsorted_indices = None
-                lengths = sorted_lengths
-
-            return padded.requires_grad_(), lengths, expected_data, batch_sizes, unsorted_indices
-
-        test_cases = [
-            # sorted_lengths, should_shuffle
-            [[10, 8, 4, 2, 2, 2, 1], False],
-            [[11, 10, 8, 6, 4, 3, 1], False],
-            [[11, 10, 8, 6, 4, 3, 1], True],
-        ]
-
-        for test_case, batch_first in itertools.product(test_cases, (True, False)):
-            sorted_lengths, should_shuffle = test_case
-            padded, lengths, expected_data, batch_sizes, unsorted_indices = generate_test_case(
-                sorted_lengths, should_shuffle)
-
-            src = padded
-            if batch_first:
-                src = src.transpose(0, 1)
-
-            # check output
-            packed = rnn_utils.pack_padded_sequence(src, lengths, batch_first=batch_first,
-                                                    enforce_sorted=not should_shuffle)
-            self.assertEqual(packed.data.data, expected_data)
-            self.assertEqual(packed.batch_sizes, batch_sizes)
-            self.assertEqual(packed.unsorted_indices, unsorted_indices)
-
-            # test inverse
-            unpacked, unpacked_len = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
-            self.assertEqual(unpacked, src)
-            self.assertEqual(unpacked_len, lengths)
-
-            # check grad
-            if padded.grad is not None:
-                padded.grad.data.zero_()
-            grad_output = unpacked.data.clone().normal_()
-            unpacked.backward(grad_output)
-            if batch_first:
-                grad_output.transpose_(0, 1)
-            for i, l in enumerate(lengths):
-                self.assertEqual(padded.grad.data[:l, i], grad_output[:l, i])
-                if l < 10:
-                    self.assertEqual(padded.grad.data[l:, i].abs().sum(), 0)
-
-        # test error messages
-        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
-            packed = rnn_utils.pack_padded_sequence(torch.randn(3, 3), [1, 3, 2])
-        with self.assertRaisesRegex(RuntimeError, 'empty tensor'):
-            packed = rnn_utils.pack_padded_sequence(torch.randn(0, 0), [])
-
     def test_LSTM_cell(self):
         # this is just a smoke test; these modules are implemented through
         # autograd so no Jacobian test is needed
@@ -11204,9 +10494,9 @@ def test_upsampling_small_scale(self):
         self.assertEqual(expected_out_t, out_t)
 
     def test_upsampling_bfloat16(self, dtype=torch.bfloat16):
-        def helper(size, scale_factor, mode, device):
+        def helper(size, scale_factor, mode, device, memory_format=torch.contiguous_format):
             inputf = torch.randn(size, device=device, dtype=torch.float, requires_grad=True)
-            input = inputf.to(dtype).detach().requires_grad_(True)
+            input = inputf.to(dtype).to(memory_format=memory_format).detach().requires_grad_(True)
             m = nn.Upsample(scale_factor=scale_factor, mode=mode)
 
             outf = m(inputf)
@@ -11222,9 +10512,11 @@ def helper(size, scale_factor, mode, device):
         for device in ['cpu']:
             helper([3, 20, 30], 2, 'nearest', device)
             helper([3, 20, 11, 7], 2, 'nearest', device)
+            helper([3, 20, 11, 7], 2, 'nearest', device, torch.channels_last)
             helper([3, 20, 11, 7, 3], 2, 'nearest', device)
             helper([3, 20, 30], 2, 'linear', device)
             helper([3, 20, 11, 7], 2, 'bilinear', device)
+            helper([3, 20, 11, 7], 2, 'bilinear', device, torch.channels_last)
             helper([3, 20, 11, 7, 3], 2, 'trilinear', device)
 
     @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
@@ -11588,8 +10880,7 @@ def test_log_softmax_cpu(self, dtype=torch.bfloat16):
             outf = F.log_softmax(inputf, dim=dim)
             out = F.log_softmax(input, dim=dim)
             self.assertEqual(out.dtype, dtype)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(out, outf, atol=0.1, rtol=0)
+            self.assertEqual(out, outf.to(dtype=dtype), atol=0.1, rtol=0)
 
             out.sum().backward()
             outf.sum().backward()
@@ -11719,14 +11010,12 @@ def test_cross_entropy_loss(self, dtype=torch.bfloat16):
         outf = loss_cpu(inputf, target)
         out = loss_cpu(input, target)
         self.assertEqual(out.dtype, dtype)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(out, outf, atol=1e-1, rtol=0)
+        self.assertEqual(out, outf.to(dtype=dtype), atol=1e-1, rtol=0)
 
         outf.backward()
         out.backward()
         self.assertEqual(input.grad.dtype, dtype)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(input.grad, inputf.grad, atol=1e-1, rtol=0)
+        self.assertEqual(input.grad, inputf.grad.to(dtype=dtype), atol=1e-1, rtol=0)
 
     def test_cross_entropy_loss_precision(self):
         # Regression test for #55657
@@ -12139,28 +11428,6 @@ def test_eye_only_works_on_2d_inputs(self):
                 tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=3)
                 init.eye_(tensor)
 
-    def test_max_unpool(self):
-        # Test 1D
-        output, indices = F.max_pool1d(torch.randn([1, 1, 4]), 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool1d(output, indices, 2), F.max_unpool1d(output, indices, 2, stride=2))
-
-        # Test list / tuple passed as argument to max_unpool1d
-        input = torch.randn([1, 1, 5], requires_grad=True)
-        output, indices = F.max_pool1d(input, 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool1d(output, indices, 2, stride=2, output_size=input.shape),
-                         F.max_unpool1d(output, indices, 2, stride=2, output_size=input.size()))
-        gradcheck(F.max_unpool1d, (output, indices, 2), check_forward_ad=True)
-
-        # Test 2D
-        output, indices = F.max_pool2d(torch.randn([1, 1, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool2d(output, indices, 2), F.max_unpool2d(output, indices, 2, stride=2))
-        gradcheck(F.max_unpool2d, (output, indices, 2), check_forward_ad=True)
-
-        # Test 3D
-        output, indices = F.max_pool3d(torch.randn([4, 4, 4, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool3d(output, indices, 2), F.max_unpool3d(output, indices, 2, stride=2))
-        gradcheck(F.max_unpool3d, (output, indices, 2), check_forward_ad=True)
-
     def test_dirac_properties(self):
         for dims in [3, 4, 5]:
             for groups in [1, 2, 3]:
@@ -13194,9 +12461,8 @@ def _test_GroupNorm_general(self, device, dtype=torch.float):
             out_reshaped = output.view(b, g, -1)
             mean = out_reshaped.mean(-1)
             var = out_reshaped.var(-1, unbiased=False)
-            # TODO: fix numerical issue. See #44863
-            self.assertEqual(torch.abs(mean).mean(), 0, atol=1e-3, rtol=1e-3)
-            self.assertEqual(torch.abs(var).mean(), 1, atol=1e-3, rtol=1e-3)
+            self.assertEqual(torch.abs(mean).mean(), 0, atol=1e-5, rtol=0)
+            self.assertEqual(torch.abs(var).mean(), 1, atol=1e-5, rtol=0)
 
             output.backward(torch.randn_like(output))
             if output.is_cuda:
@@ -13213,9 +12479,8 @@ def _test_GroupNorm_general(self, device, dtype=torch.float):
             out_normed_reshaped = out_normed.view(b, g, -1)
             mean = out_normed_reshaped.mean(-1)
             var = out_normed_reshaped.var(-1, unbiased=False)
-            # TODO: fix numerical issue. See #44863
-            self.assertEqual(torch.abs(mean).mean(), 0, atol=1e-3, rtol=1e-3)
-            self.assertEqual(torch.abs(var).mean(), 1, atol=1e-3, rtol=1e-3)
+            self.assertEqual(torch.abs(mean).mean(), 0, atol=1e-5, rtol=0)
+            self.assertEqual(torch.abs(var).mean(), 1, atol=1e-5, rtol=0)
 
         bad_shape_g = {
             (1, 2, 3, 4): 3,
@@ -13234,21 +12499,6 @@ def _test_GroupNorm_cuda_half(self):
         output.sum().backward()
         self.assertEqualTypeString(output, input)
 
-    def _test_module_empty_input(self, module, inp, check_size=True, inference=False):
-        if not inference:
-            inp.requires_grad_(True)
-        out = module(inp)
-        if not inference:
-            gO = torch.rand_like(out)
-            out.backward(gO)
-        if check_size:
-            self.assertEqual(out.size(), inp.size())
-        if not inference:
-            for p in module.parameters():
-                if p.requires_grad:
-                    self.assertEqual(p.grad, torch.zeros_like(p.grad))
-            self.assertEqual(inp.grad, torch.zeros_like(inp))
-
     def _test_module_empty_inputs(self, module, inputs):
         for _inp in inputs:
             _inp.requires_grad_(True)
@@ -14394,6 +13644,46 @@ def _make_noncontiguous(inp):
         self.assertTrue(gradgradcheck(convolution, inputs, nondet_tol=gradcheck_nondet_tol))
 
 
+    @onlyCPU
+    def test_conv_contiguous_for_oneDNN(self):
+        # See https://github.com/pytorch/pytorch/issues/80837.
+        for dtype in [torch.float, torch.bfloat16]:
+            conv = nn.Conv2d(
+                1,
+                128,
+                kernel_size=(5, 2),
+                stride=(2, 1),
+                padding=(0, 1),
+                dilation=(1, 1),
+                groups=1,
+                bias=True,
+                padding_mode='zeros').to(dtype=dtype)
+
+            x = torch.rand([1, 2, 321, 201, 1]).to(dtype=dtype)
+            x = torch.transpose(x, 1, 4)
+            x2 = x[..., 0]
+            inputs = [x2, conv.weight, conv.bias, (2, 1), (0, 1), (1, 1), False, (0, 1), 1]
+            if torch.backends.mkldnn.is_available():
+                y = conv(x2)
+                # Disable MKLDNN explicitly
+                with torch.backends.mkldnn.flags(enabled=False):
+                    y_ = conv(x2)
+                    self.assertEqual(y, y_)
+
+    @onlyCPU
+    def test_conv_ic1_channels_last_for_oneDNN(self):
+        # See https://github.com/pytorch/pytorch/issues/82060, N > 1 will call in OneDNN path.
+        for dtype in [torch.float, torch.bfloat16]:
+            conv = torch.nn.Conv2d(1, 64, kernel_size=(3, 3), padding=(1, 1), bias=False)
+            conv = conv.to(memory_format=torch.channels_last).to(dtype=dtype)
+            x = torch.rand(2, 1, 100, 100).to(dtype=dtype)
+            if torch.backends.mkldnn.is_available():
+                y = conv(x)
+                # Disable MKLDNN explicitly
+                with torch.backends.mkldnn.flags(enabled=False):
+                    y_ = conv(x)
+                    self.assertEqual(y, y_)
+
     def test_Dropout(self, device):
         input = torch.empty(1000)
         self._test_dropout(nn.Dropout, device, input)
@@ -14646,10 +13936,10 @@ def test_GroupNorm_raises_error_if_one_value_per_group(self, device):
     def test_GroupNorm_empty(self, device):
         mod = torch.nn.GroupNorm(2, 4).to(device)
         inp = torch.randn(0, 4, 2, 2, device=device)
-        self._test_module_empty_input(mod, inp)
+        _test_module_empty_input(self, mod, inp)
         if self.device_type == 'cuda' and self.has_cudnn():
             with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp)
+                _test_module_empty_input(self, mod, inp)
 
     @onlyCPU
     @dtypes(torch.float, torch.double)
@@ -14754,7 +14044,7 @@ def test_ReplicationPad_empty(self, device, dtype):
                 (torch.nn.ReplicationPad1d(3), torch.randn(0, 3, 10, device=device, dtype=dtype)),
                 (torch.nn.ReplicationPad2d(3), torch.randn(0, 3, 10, 10, device=device, dtype=dtype)),
                 (torch.nn.ReplicationPad3d(3), torch.randn(0, 3, 10, 10, 10, device=device, dtype=dtype))]:
-            self._test_module_empty_input(mod, inp, check_size=False)
+            _test_module_empty_input(self, mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, 'Expected 2D or 3D'):
             mod = torch.nn.ReplicationPad1d(2)
@@ -14885,7 +14175,7 @@ def test_TransformerEncoderLayer_empty(self, device):
                 if not training:
                     encoder_layer = encoder_layer.eval()
                     with torch.no_grad():
-                        self._test_module_empty_input(encoder_layer, input, check_size=False, inference=True)
+                        _test_module_empty_input(self, encoder_layer, input, check_size=False, inference=True)
                     if batch_first and not TEST_WITH_CROSSREF:
                         with torch.no_grad():
                             # A NestedTensor with no tensors inside it doesn't have dim 3 (or dim
@@ -14894,12 +14184,12 @@ def test_TransformerEncoderLayer_empty(self, device):
                             with self.assertRaisesRegex(
                                     AssertionError, 'MultiheadAttention does not support NestedTensor outside'):
                                 nt = torch.nested_tensor([], device=device)
-                                self._test_module_empty_input(encoder_layer, nt, check_size=False, inference=True)
+                                _test_module_empty_input(self, encoder_layer, nt, check_size=False, inference=True)
 
                             nt = torch.nested_tensor([torch.rand(0, 512, device=device)], device=device)
-                            self._test_module_empty_input(encoder_layer, nt, check_size=False, inference=True)
+                            _test_module_empty_input(self, encoder_layer, nt, check_size=False, inference=True)
                 else:
-                    self._test_module_empty_input(encoder_layer, input, check_size=False)
+                    _test_module_empty_input(self, encoder_layer, input, check_size=False)
 
     @expectedFailureMeta  # RuntimeError: cannot reshape tensor of 0 elements into shape [1, 0, -1]
     @onlyNativeDeviceTypes
@@ -14909,7 +14199,7 @@ def test_TransformerEncoder_empty(self, device):
             input = torch.rand(*input_shape, device=device)
             encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=batch_first).to(device)
             transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6).to(device)
-            self._test_module_empty_input(transformer_encoder, input, check_size=False)
+            _test_module_empty_input(self, transformer_encoder, input, check_size=False)
 
     @expectedFailureMeta  # RuntimeError: cannot reshape tensor of 0 elements into shape [1, 0, -1]
     @onlyNativeDeviceTypes
@@ -14948,7 +14238,7 @@ def test_ReflectionPad_empty(self, device, dtype):
                 (torch.nn.ReflectionPad1d(2), torch.randn(0, 3, 10, device=device, dtype=dtype)),
                 (torch.nn.ReflectionPad2d(2), torch.randn(0, 3, 10, 10, device=device, dtype=dtype)),
                 (torch.nn.ReflectionPad3d(3), torch.randn(0, 3, 10, 10, 10, device=device, dtype=dtype))]:
-            self._test_module_empty_input(mod, inp, check_size=False)
+            _test_module_empty_input(self, mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, '2D or 3D'):
             mod = torch.nn.ReflectionPad1d(2)
@@ -14990,7 +14280,7 @@ def test_ReflectionPad2d_large(self, device):
     def test_LocalResponseNorm_empty(self, device):
         mod = torch.nn.LocalResponseNorm(2).to(device)
         inp = torch.ones(0, 5, 24, 24, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
 
     @onlyCUDA   # Test if CPU and GPU results match
     def test_ReflectionPad3d_large(self, device):
@@ -15040,217 +14330,17 @@ def test_MarginLoss_empty(self, device, dtype):
                 y = torch.ones(10, 0, device=device).type(torch.long)
                 mod(x, y)
 
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float, torch.double)
-    def test_adaptive_pooling_zero_batch(self, dtype, device):
-        inp = torch.ones(0, 10, dtype=dtype, device=device)
-        mod = torch.nn.AdaptiveAvgPool1d(5).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        inp = torch.ones(0, 10, 10, dtype=dtype, device=device)
-        mod = torch.nn.AdaptiveAvgPool2d((5, 5)).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        inp = torch.ones(0, 10, 10, 10, dtype=dtype, device=device)
-        mod = torch.nn.AdaptiveAvgPool3d((5, 5, 5)).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool2d_zero_batch(self, device):
-        mod = nn.FractionalMaxPool2d(3, output_ratio=(0.5, 0.5))
-        inp = torch.ones(0, 16, 50, 32, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected input"):
-            inp = torch.randn(1, 0, 50, 32, device=device)
-            mod(inp)
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool3d_zero_batch(self, device):
-        mod = nn.FractionalMaxPool3d(3, output_ratio=(0.5, 0.5, 0.5)).to(device)
-        inp = torch.ones(0, 16, 50, 32, 32, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected input"):
-            inp = torch.randn(1, 0, 50, 32, 32, device=device)
-            mod(inp)
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool2d_zero_out_size(self, device):
-        mod = nn.FractionalMaxPool2d([2, 2], output_size=[0, 1])
-        inp = torch.rand([16, 50, 32, 32], device=device)
-        out = mod(inp)
-        self.assertEqual(out, torch.empty((16, 50, 0, 1), device=device))
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool3d_zero_out_size(self, device):
-        mod = nn.FractionalMaxPool3d([3, 2, 2], output_size=[0, 1, 1])
-        inp = torch.rand([16, 50, 32, 32], device=device)
-        out = mod(inp)
-        self.assertEqual(out, torch.empty((16, 0, 1, 1), device=device))
-
     @onlyNativeDeviceTypes
     def test_Unfold_empty(self, device):
         inp = torch.randn(0, 3, 3, 4, device=device)
         unfold = torch.nn.Unfold(kernel_size=(2, 3)).to(device)
-        self._test_module_empty_input(unfold, inp, check_size=False)
+        _test_module_empty_input(self, unfold, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, 'Expected 3D or 4D'):
             inp = torch.randn(3, 0, 3, 4, device=device)
             unfold = torch.nn.Unfold(kernel_size=(2, 3)).to(device)
             unfold(inp)
 
-    @onlyNativeDeviceTypes
-    def test_MaxPool_zero_batch_dim(self, device):
-        inp = torch.randn(0, 16, 50, device=device)
-        mod = torch.nn.MaxPool1d(3, stride=2).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        # 1D is supposed to be okay with 0 numel() inputs so dont test
-        # error raising for that case.
-
-        inp = torch.randn(0, 16, 50, 32, device=device)
-        mod = torch.nn.MaxPool2d(3, stride=2).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.randn(1, 0, 50, 32, device=device)
-            mod(inp)
-
-        inp = torch.ones(0, 16, 50, 44, 31, device=device)
-        mod = torch.nn.MaxPool3d(3, stride=2).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.ones(1, 0, 50, 44, 31, device=device)
-            mod(inp)
-
-    @onlyNativeDeviceTypes
-    def test_MaxUnpool_zero_batch_dim(self, device):
-        pool = torch.nn.MaxPool1d(2, stride=2, return_indices=True).to(device)
-        unpool = torch.nn.MaxUnpool1d(2, stride=2).to(device)
-        inp = torch.randn(0, 10, 10, requires_grad=True, device=device)
-        output, indices = pool(inp)
-        output.requires_grad_(True)
-        unpool_out = unpool(output, indices)
-        unpool_out.sum().backward()
-
-        self.assertEqual(inp.grad, torch.zeros_like(inp))
-        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
-
-        pool = torch.nn.MaxPool2d(2, stride=2, return_indices=True).to(device)
-        unpool = torch.nn.MaxUnpool2d(2, stride=2).to(device)
-        inp = torch.randn(0, 10, 10, 10, requires_grad=True, device=device)
-        output, indices = pool(inp)
-        unpool_out = unpool(output, indices)
-        unpool_out.sum().backward()
-
-        self.assertEqual(inp.grad, torch.zeros_like(inp))
-        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
-
-        pool = torch.nn.MaxPool3d(2, stride=2, return_indices=True).to(device)
-        unpool = torch.nn.MaxUnpool3d(2, stride=2).to(device)
-        inp = torch.randn(0, 10, 10, 10, 10, requires_grad=True, device=device)
-        output, indices = pool(inp)
-        output.requires_grad_(True)
-        unpool_out = unpool(output, indices)
-        unpool_out.sum().backward()
-
-        self.assertEqual(inp.grad, torch.zeros_like(inp))
-        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
-
-    @slowTest
-    @onlyNativeDeviceTypes
-    @skipCUDAIfRocm
-    @parametrize_test("module_name,module_size,output_size,test_index,should_error", [
-        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), -1, True), name='case1'),
-        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), 2 * 2 * 4 * 5, True), name='case2'),
-        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), (2 * 2 * 4 * 5) - 1, False), name='case3'),
-        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), 2 * 3 * 4 * 2, True), name='case4'),
-        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), (2 * 3 * 4 * 2) - 1, False), name='case5'),
-
-        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), -1, True), name='case6'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), 2 * 2 * 2 * 3 * 4 * 5, True), name='case7'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), (2 * 2 * 2 * 3 * 4 * 5) - 1, False), name='case8'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), 2 * 2 * 2 * 3 * 4 * 1, True), name='case9'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), (2 * 2 * 2 * 3 * 4 * 1) - 1, False), name='case10'),
-    ])
-    def test_MaxUnpool_index_errors(self, device, module_name, module_size, output_size, test_index, should_error):
-        # NOTE: CUDA tests need to be run in a subprocess because they cause device asserts
-        if torch.device(device).type == 'cuda':
-            error_msgs = {
-                'MaxUnpool2d': r'Assertion `maxind >= 0 && maxind < outputImageSize` failed',
-                'MaxUnpool3d': r'Assertion `index >= 0 && index < outputImageSize` failed'}
-
-            script = f'''
-import torch
-unpool = torch.nn.{module_name}({module_size}).to('{device}')
-output = torch.rand({output_size}, dtype=torch.float32, device='{device}')
-indices = torch.zeros({output_size}, dtype=torch.int64, device='{device}')
-indices.flatten()[0] = {test_index}
-unpool(output, indices)
-torch.cuda.synchronize()
-'''
-            p = subprocess.run(
-                [sys.executable, '-c', script],
-                cwd=os.path.dirname(os.path.realpath(__file__)),
-                capture_output=True,
-                text=True,
-            )
-
-            output = p.stdout + '\n' + p.stderr
-
-            error_msg = error_msgs[module_name]
-
-            if should_error:
-                self.assertIn(
-                    error_msg,
-                    output,
-                    'The expected error was not found')
-            else:
-                self.assertNotIn(
-                    'Error',
-                    output,
-                    'Should not have produced an error')
-        else:
-            module_class = getattr(torch.nn, module_name)
-            unpool = module_class(module_size).to(device)
-            output = torch.rand(output_size, dtype=torch.float32, device=device)
-            indices = torch.zeros(output_size, dtype=torch.int64, device=device)
-            indices.flatten()[0] = test_index
-
-            if should_error:
-                with self.assertRaisesRegex(RuntimeError, r'Found an invalid max index:'):
-                    unpool(output, indices)
-            else:
-                unpool(output, indices)
-
-    @onlyNativeDeviceTypes
-    def test_AdaptiveMaxPool_zero_batch_dim(self, device):
-        inp = torch.randn(0, 16, 50, device=device)
-        mod = torch.nn.AdaptiveMaxPool1d(3).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.randn(1, 0, 50, device=device)
-            mod(inp)
-
-        inp = torch.randn(0, 16, 50, 32, device=device)
-        mod = torch.nn.AdaptiveMaxPool2d(3).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.randn(1, 0, 50, 32, device=device)
-            mod(inp)
-
-        inp = torch.ones(0, 16, 50, 44, 31, device=device)
-        mod = torch.nn.AdaptiveMaxPool3d(3).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.ones(1, 0, 50, 44, 31, device=device)
-            mod(inp)
-
     @onlyCUDA
     @dtypes(torch.float, torch.double)
     @tf32_on_and_off(0.005)
@@ -15322,10 +14412,10 @@ def check_rnn_grads(rnn1, rnn2):
     def test_BatchNorm_empty(self, device):
         mod = torch.nn.BatchNorm2d(3).to(device)
         inp = torch.randn(0, 3, 2, 2, device=device)
-        self._test_module_empty_input(mod, inp)
+        _test_module_empty_input(self, mod, inp)
         if self.device_type == 'cuda' and self.has_cudnn():
             with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp)
+                _test_module_empty_input(self, mod, inp)
 
         self.assertEqual(mod.running_mean, torch.tensor([0., 0, 0], device=device))
         self.assertEqual(mod.running_var, torch.tensor([1., 1, 1], device=device))
@@ -15337,7 +14427,7 @@ def test_conv_empty_channel(self, device, dtype):
         in_channels = 0
         mod = torch.nn.Conv1d(in_channels, 8, 2, stride=2, dtype=dtype).to(device)
         inp = torch.randn(2, 0, 15, device=device, dtype=dtype)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
             inp = torch.randn(2, 1, 0, device=device)
@@ -15345,7 +14435,7 @@ def test_conv_empty_channel(self, device, dtype):
 
         mod = torch.nn.Conv2d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
         inp = torch.randn(2, 0, 50, 100, device=device, dtype=dtype)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
             inp = torch.randn(2, 1, 40, 0, device=device)
@@ -15353,7 +14443,7 @@ def test_conv_empty_channel(self, device, dtype):
 
         mod = torch.nn.Conv3d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
         inp = torch.randn(2, 0, 50, 20, 40, device=device, dtype=dtype)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
             inp = torch.randn(2, 1, 50, 0, 40, device=device)
@@ -15362,40 +14452,26 @@ def test_conv_empty_channel(self, device, dtype):
     def test_group_conv_empty(self, device):
         mod = torch.nn.Conv2d(4, 4, stride=2, kernel_size=3, padding=1, groups=4).to(device)
         inp = torch.randn(0, 4, 4, 4, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
         if self.device_type == 'cuda' and self.has_cudnn():
             with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp, check_size=False)
+                _test_module_empty_input(self, mod, inp, check_size=False)
 
     def test_group_convTranspose_empty(self, device):
         mod = torch.nn.ConvTranspose2d(4, 4, stride=2, kernel_size=3, padding=1, groups=4).to(device)
         inp = torch.randn(0, 4, 4, 4, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
         if self.device_type == 'cuda' and self.has_cudnn():
             with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp, check_size=False)
+                _test_module_empty_input(self, mod, inp, check_size=False)
 
     def test_convTranspose_empty(self, device):
         mod = torch.nn.ConvTranspose2d(4, 4, stride=2, kernel_size=3, padding=1).to(device)
         inp = torch.randn(0, 4, 4, 4, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
         if self.device_type == 'cuda' and self.has_cudnn():
             with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp, check_size=False)
-
-    @onlyNativeDeviceTypes
-    def test_AvgPool2d_empty(self, device):
-        avgpool = torch.nn.AvgPool2d(3, stride=2).to(device)
-        inp = torch.randn(0, 16, 20, 32, device=device)
-        self._test_module_empty_input(avgpool, inp, check_size=False)
-
-        clast_inp = torch.randn(0, 16, 20, 32, device=device).contiguous(memory_format=torch.channels_last)
-        self._test_module_empty_input(avgpool, clast_inp, check_size=False)
-
-        # test with empty non-batch input
-        with self.assertRaisesRegex(RuntimeError, '3D or 4D'):
-            inp = torch.randn(16, 0, 20, 32, device=device)
-            avgpool(inp)
+                _test_module_empty_input(self, mod, inp, check_size=False)
 
     @onlyCUDA
     @largeTensorTest('16GB')
@@ -15408,7 +14484,7 @@ def test_prelu_backward_32bit_indexing(self, device):
     def test_linear_empty(self, device):
         mod = torch.nn.Linear(7, 7).to(device)
         inp = torch.randn(0, 7, device=device)
-        self._test_module_empty_input(mod, inp)
+        _test_module_empty_input(self, mod, inp)
 
     def test_one_hot(self, device):
         if self.device_type != 'cuda':  # cuda throws device assert for invalid data
@@ -15687,394 +14763,6 @@ def test(nonlinearity, *args, **kwargs):
         test('threshold', 3, 2)
         test('threshold', 3, 2, inplace=True)
 
-    def test_pooling_shape(self, device):
-        ''' Test the output shape calculation for pooling functions '''
-
-        # Checks output shape against expected for 1D, 2D and 3D
-        def check(expected_out_shape, sizes, *args, **kwargs):
-            for kernel in ['max', 'avg']:
-                for i in [1, 2, 3]:
-                    if hasattr(torch.nn.functional, f'{kernel}_pool{i}d'):
-                        op = getattr(torch.nn.functional, f'{kernel}_pool{i}d')
-                        t = torch.randn(sizes[:i + 2], device=device)
-                        self.assertEqual(op(t, *args, **kwargs).shape, expected_out_shape[:i + 2])
-
-        check((1, 1, 3, 3, 4), (1, 1, 5, 6, 7), kernel_size=1, stride=2, padding=0, ceil_mode=True)
-        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=False)
-        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=True)
-
-        # Test case from issue https://github.com/pytorch/pytorch/issues/45357
-        x = torch.randn(1, 1, 6, 7, device=device)
-        y = torch.nn.functional.max_pool2d(x, 1, stride=(2, 2), padding=0, ceil_mode=True)
-        self.assertEqual(y.size(), (1, 1, 3, 4))
-
-    @onlyNativeDeviceTypes   # TODO: fix on XLA
-    def test_adaptive_avg_pool2d_output_size_one(self, device):
-        def helper(size, memory_format):
-            x = torch.randint(1, 10, size, dtype=torch.float, device=device, requires_grad=True)
-            if memory_format == 'non_contiguous':
-                x = x[::2, ::2, ::2, ::2]
-            else:
-                x = x.to(memory_format=memory_format)
-
-            net = torch.nn.AdaptiveAvgPool2d((1, 1))
-            out = net(x)
-            ref_out = x.contiguous().mean((-1, -2)).view((x.size(0), x.size(1), 1, 1))
-
-            out.sum().backward()    # make sure it doesn't crash
-
-            self.assertEqual(out, ref_out)
-            if memory_format == torch.channels_last:
-                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-                c = out.size(1)
-                self.assertEqual(out.stride(), [c, 1, c, c])
-            else:
-                self.assertTrue(out.is_contiguous())
-                c = out.size(1)
-                self.assertEqual(out.stride(), [c, 1, 1, 1])
-
-        for mf in (torch.contiguous_format, torch.channels_last, 'non_contiguous'):
-            helper((2, 3, 6, 6), mf)
-
-    @onlyNativeDeviceTypes
-    def test_adaptive_avg_pool3d_output_size_one(self, device):
-        x = torch.randn((2, 3, 6, 6, 6) , dtype=torch.float, device=device, requires_grad=True)
-
-        net = torch.nn.AdaptiveAvgPool3d(1)
-        out = net(x)
-        ref_out = x.contiguous().mean((-1, -2, -3)).view(out.shape)
-
-        out.sum().backward()    # make sure it doesn't crash
-
-        self.assertEqual(out, ref_out)
-        self.assertTrue(out.is_contiguous())
-        c = out.size(1)
-        self.assertEqual(out.stride(), [c, 1, 1, 1, 1])
-
-
-    @expectedFailureMeta  # Runtime Error not raised for meta
-    @onlyNativeDeviceTypes
-    @dtypes(torch.uint8, torch.int8, torch.short, torch.int, torch.long)
-    def test_adaptive_pooling_no_suppot_input(self, device, dtype):
-        for numel in (2, 3):
-            for pool_type in ('Max', 'Avg'):
-                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
-                module_cls = getattr(nn, cls_name)
-                output_size = (2,) * numel
-                module = module_cls(output_size)
-                input = torch.randn((4,) * (numel + 1), device=device).to(dtype)
-                with self.assertRaisesRegex(RuntimeError, "not implemented"):
-                    output = module(input)
-
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float, torch.double)
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    def test_avg_pool2d_nhwc(self, device, dtype):
-        def helper(n, c, h, w, kernel_size, stride=None,
-                   count_include_pad=True, divisor_override=None, padding=0):
-            if stride is None:
-                stride = kernel_size
-            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
-            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
-                               dtype=dtype, device=device)
-            pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
-                                      divisor_override=divisor_override).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
-                                          divisor_override=divisor_override).to(device)
-
-            out = pool(input)
-            out.backward(grad)
-            ref_out = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(input.grad, ref_input.grad)
-
-        helper(4, 8, 8, 8, 3)
-        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=1)
-        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=2, stride=2)
-        helper(4, 8, 8, 8, 3, divisor_override=42)
-        helper(4, 8, 8, 8, 7)
-        # ROCm 16GB MI25 hits OOM error. Clear caching allocator prior to running large subtest.
-        if TEST_WITH_ROCM and 'cuda' in device:
-            torch.cuda.empty_cache()
-        helper(200, 512, 28, 28, 2)
-        helper(4, 8, 7, 7, 3, stride=1)
-        helper(4, 8, 7, 7, 3, padding=2, stride=1)
-        helper(10, 512, 31, 31, 3, stride=2)
-        helper(1, 129, 8, 8, 3, stride=2)
-
-    @onlyCPU
-    @dtypes(torch.float)
-    def test_max_pool1d_errors(self, device, dtype):
-        def check(x, args, message):
-            model = torch.nn.MaxPool1d(*args)
-            with self.assertRaisesRegex(RuntimeError, r'max_pool1d\(\) ' + message):
-                model(torch.tensor(x, device=device, dtype=dtype))
-
-        # Pooling args: (kernel_size, stride, padding, dilation, return_indices, ceil_mode)
-        check(0, (1,), "Expected 2D or 3D input tensor, but got")
-        check([], (1,), "Expected 2D or 3D input tensor, but got")
-        check([[]], (1, 0), "stride must be greater than zero, but got 0")
-        check([[]], (1, 1, -1), "padding must be non-negative, but got -1")
-        check([[]], (1, 1, 2), "padding should be at most half of kernel size, but got padding=2 and kernel_size=1")
-        check([[]], (1, 1, 0, 0), "dilation must be greater than zero, but got 0")
-        check([[]], (5, 1, 0, 1), "Invalid computed output size: -4")
-
-    @onlyCPU
-    @dtypes(torch.float, torch.double)
-    def test_max_pool1d_corner_cases(self, device, dtype):
-        def check(x, args, expected):
-            model = torch.nn.MaxPool1d(*args)
-            if isinstance(x, list):
-                x = torch.tensor(x, device=device, dtype=dtype)
-                expected = torch.tensor(expected, device=device, dtype=dtype)
-            self.assertEqual(model(x), expected)
-
-        # Pooling args: (kernel_size, stride, padding, dilation, return_indices, ceil_mode)
-        check([[]], (1, None, 0, 1, False, False), [[]])
-        check([[[]]], (1, None, 0, 1, False, False), [[[]]])
-        check([[[]]], (2, 1, 1, 2, False, True), [[[]]])
-        check([[1]], (1, None, 0, 1, False, False), [[1]])
-        check([[1]], (2, None, 1, 2, False, False), [[float('-inf')]])
-        check([[1], [1]], (2, None, 1, 2, False, False), [[float('-inf')], [float('-inf')]])
-        check([[1, 2]], (2, 1, 1, 2, False, False), [[2, 1]])
-        check([[1, 2]], (2, 2, 1, 2, False, True), [[2, 2]])
-
-        empty_tensor = torch.empty((2, 0, 1), device=device, dtype=dtype)
-        check(empty_tensor, (1, None, 0, 1, False, False), empty_tensor)
-
-    @onlyCPU
-    @dtypes(torch.float, torch.double)
-    def test_max_pool1d(self, device, dtype):
-        # FIXME For now compare against max_pool1d with indices
-        def check(x, *args, **kwargs):
-            model = torch.nn.MaxPool1d(*args, **kwargs)
-            ref_model = torch.nn.MaxPool1d(*args, **kwargs, return_indices=True)
-            self.assertEqual(model(x), ref_model(x)[0])
-
-        sizes = [random.sample(range(8, 128), 3) for _ in range(3)]
-        kernel_sizes = random.sample(range(1, 5), 3)
-        strides = random.sample(range(1, 5), 3)
-        dilations = random.sample(range(1, 5), 3)
-        ceil_modes = [True, False]
-
-        for size, kernel_size, stride, dilation, ceil_mode in \
-                itertools.product(sizes, kernel_sizes, strides, dilations, ceil_modes):
-            padding = random.sample(range(0, math.floor(kernel_size / 2) + 1), 1)
-            check(torch.randn(size, device=device, dtype=dtype),
-                  kernel_size, stride, padding, dilation, ceil_mode=ceil_mode)
-
-        # Non-contiguous test
-        tensor = torch.randn(5, 151, 33, device=device, dtype=dtype)[::2, ::3, ::2]
-        check(tensor, 3, 2, 1, 2, ceil_mode=True)
-        check(tensor.transpose(1, 2), 3, 2, 1, 2, ceil_mode=True)
-
-    @onlyCUDA
-    def test_max_pool2d(self, device):
-        def helper(n, c, h, w, ks):
-            x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
-            ref_x = x.detach().clone().cpu().requires_grad_()
-
-            pool = torch.nn.MaxPool2d(kernel_size=ks)
-
-            y = pool(x)
-            ref_y = pool(ref_x)
-
-            y.sum().backward()
-            ref_y.sum().backward()
-
-            self.assertEqual(y, ref_y)
-            self.assertEqual(x.grad, ref_x.grad)
-
-        helper(2, 8, 4, 4, ks=2)
-        helper(1, 100000, 32, 32, ks=4)
-        helper(1, 100000, 1, 4, ks=(1, 4))  # test for max_pool1d
-
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float, torch.double)
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    def test_max_pool2d_nhwc(self, device, dtype):
-        def helper(n, c, h, w, kernel_size, stride=None):
-            if stride is None:
-                stride = kernel_size
-            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
-            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
-                               dtype=dtype, device=device)
-            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
-
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            self.assertEqual(input.grad, ref_input.grad)
-
-        helper(4, 8, 8, 8, 7)
-        helper(200, 512, 28, 28, 2)
-        helper(4, 8, 7, 7, 3, stride=1)
-        helper(10, 512, 31, 31, 3, stride=2)
-        helper(1, 129, 8, 8, 3, stride=2)
-
-    @onlyNativeDeviceTypes
-    @dtypes(torch.half, torch.float, torch.double)
-    @onlyCUDA
-    def test_max_pool3d_ndhwc(self, device, dtype):
-        def helper(n, c, h, w, d, kernel_size, stride=None):
-            batch = n
-            if not batch:
-                batch = 1
-            input = torch.randn(batch, c, d, h, w, dtype=dtype, device=device)
-            input = input.contiguous(memory_format=torch.channels_last_3d).requires_grad_()
-            if not n:
-                input = input.squeeze(0).detach().clone().requires_grad_()
-            if isinstance(kernel_size, int):
-                kernel_size = [kernel_size] * 3
-            if stride is None:
-                stride = kernel_size
-            elif isinstance(stride, int):
-                stride = [stride] * 3
-            grad = torch.randn(batch, c,
-                               (d - kernel_size[0]) // stride[0] + 1,
-                               (h - kernel_size[1]) // stride[1] + 1,
-                               (w - kernel_size[2]) // stride[2] + 1,
-                               dtype=dtype, device=device)
-            grad = grad.contiguous(memory_format=torch.channels_last_3d)
-            if not n:
-                grad = grad.squeeze(0)
-            pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            if len(out.shape) == 4:
-                self.assertTrue(out.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
-            else:
-                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last_3d))
-            self.assertTrue(ref_out.is_contiguous())
-            if len(ind.shape) == 4:
-                self.assertTrue(ind.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
-            else:
-                self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last_3d))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            if dtype == torch.half:
-                self.assertEqual(input.grad, ref_input.grad, atol=0.05, rtol=0.01)
-            else:
-                self.assertEqual(input.grad, ref_input.grad)
-
-        helper(4, 8, 8, 8, 8, 7)
-        helper(4, 8, 8, 8, 8, (5, 6, 7))
-        helper(1, 8, 8, 8, 8, (5, 6, 7))
-        helper(0, 6, 12, 13, 14, (5, 6, 7))
-        helper(4, 8, 7, 7, 7, 3, stride=1)
-        helper(10, 128, 19, 19, 19, 3, stride=2)
-        helper(10, 128, 19, 19, 19, (1, 2, 3), stride=2)
-        helper(1, 128, 19, 19, 19, (1, 2, 3), stride=2)
-        helper(0, 128, 19, 19, 19, (1, 2, 3), stride=2)
-        helper(1, 79, 4, 4, 4, 3, stride=2)
-        helper(0, 79, 4, 4, 4, 3, stride=2)
-
-
-    @onlyCPU
-    def test_max_pool2d_bfloat16(self, device):
-        def helper(n, c, h, w, kernel_size, stride, memory_format):
-            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
-            input = input.to(memory_format=memory_format).requires_grad_()
-            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
-
-            input2 = input.detach().clone().float().requires_grad_(True)
-
-            out, ind = pool(input)
-            out.sum().backward()
-            out2, ind2 = pool(input2)
-            out2.sum().backward()
-
-            self.assertTrue(out.is_contiguous(memory_format=memory_format))
-            self.assertEqual(out.dtype, torch.bfloat16)
-            self.assertEqual(input.grad.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.bfloat16())
-            self.assertEqual(ind, ind2)
-            self.assertEqual(input.grad, input2.grad.bfloat16())
-
-        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
-        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
-        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
-        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
-
-    @onlyCUDA
-    def test_max_pool2d_indices(self, device):
-        def helper(n, c, h, w, ks):
-            if n is None:
-                x = torch.randn(c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
-            else:
-                x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
-
-            ref_x = x.detach().clone().cpu().requires_grad_()
-
-            pool = torch.nn.MaxPool2d(kernel_size=ks, return_indices=True)
-
-            y, idx = pool(x)
-            ref_y, ref_idx = pool(ref_x)
-
-            y.sum().backward()
-            ref_y.sum().backward()
-
-            self.assertEqual(y, ref_y)
-            self.assertEqual(idx, ref_idx)  # assertEqual implicitly compares shape for tensors
-            self.assertEqual(x.grad, ref_x.grad)
-
-        helper(2, 8, 4, 4, ks=2)
-        helper(None, 3, 50, 50, ks=5)
-
-    @onlyCPU
-    def test_avg_pool2d_bfloat16(self, device):
-        def helper(n, c, h, w, kernel_size, stride, memory_format):
-            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
-            input = input.to(memory_format=memory_format).requires_grad_()
-            pool = torch.nn.AvgPool2d(kernel_size, stride).to(device)
-
-            input2 = input.detach().clone().float().requires_grad_(True)
-
-            out = pool(input)
-            out.sum().backward()
-            out2 = pool(input2)
-            out2.sum().backward()
-
-            self.assertTrue(out.is_contiguous(memory_format=memory_format))
-            self.assertEqual(out.dtype, torch.bfloat16)
-            self.assertEqual(input.grad.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.bfloat16())
-            self.assertEqual(input.grad, input2.grad.bfloat16())
-
-        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
-        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
-        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
-        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
-
     def test_upsamplingNearest1d(self, device):
         # Forward AD does not support XLA because XLA tensors don't have storage
         check_forward_ad = torch.device(device).type != 'xla'
@@ -16540,83 +15228,6 @@ def test_upsamplingBicubic2d_aa_correctness(self, device, memory_format):
         t_out = F.interpolate(t_in, size=(2, 2), mode="bicubic", align_corners=False, antialias=True)
         self.assertEqual(expected_out, t_out)
 
-    @dtypes(torch.float, torch.double)
-    def test_adaptive_pooling_max_nhwc(self, device, dtype):
-        def helper(n, c, h, w, output_height, output_width, contig):
-            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
-            input = input.contiguous(memory_format=torch.channels_last)
-            grad = torch.randint(1, 10, (4, 8, output_height, output_width), device=device, dtype=dtype)
-            grad = grad.contiguous(memory_format=torch.channels_last)
-            if not contig:
-                input = input[:, ::2, :, :]
-                grad = grad[:, ::2, :, :]
-            input.requires_grad_(True)
-            pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
-
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            self.assertEqual(input.grad, ref_input.grad)
-
-        for contig in [True, False]:
-            helper(4, 8, 10, 10, 7, 7, contig)
-            helper(4, 8, 9, 14, 5, 8, contig)
-            helper(4, 8, 11, 11, 1, 1, contig)
-
-    @dtypes(torch.float, torch.double)
-    def test_pooling_max_nhwc(self, device, dtype):
-        def helper(n, c, h, w, kernel_size, stride, padding, dilation, contig, device):
-            output_height = math.floor((h + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1) / stride[0] + 1)
-            output_width = math.floor((w + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1) / stride[1] + 1)
-
-            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
-            input = input.contiguous(memory_format=torch.channels_last)
-            grad = torch.randint(1, 10, (n, c, output_height, output_width), device=device, dtype=dtype)
-            grad = grad.contiguous(memory_format=torch.channels_last)
-            if not contig:
-                input = input[:, ::2, :, :]
-                grad = grad[:, ::2, :, :]
-            input.requires_grad_(True)
-            pool = torch.nn.MaxPool2d(
-                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
-            )
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.MaxPool2d(
-                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
-            ).to(device)
-
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            self.assertEqual(input.grad, ref_input.grad)
-
-        for contig in [True, False]:
-            helper(4, 8, 10, 10, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
-            helper(4, 8, 9, 14, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
-            helper(4, 8, 11, 11, (4, 4), (2, 2), (2, 2), (2, 2), contig, device)
-
     def test_embedding_dense_grad(self, device):
         embd = nn.Embedding(20, 20).to(device)
         weight = embd.weight
@@ -17032,10 +15643,11 @@ def test_masked_softmax(self, device):
                 input = torch.randn((B, num_heads, L, L))
                 mask = torch.randint(0, 2, (B, L))
                 mask = mask.reshape(B, 1, 1, L).expand(B, num_heads, L, L).bool()
+                mask_type = 1   # BxL => src_key_padding_mask
                 if (self.device_type == "cuda"):
                     input = input.cuda()
                     mask = mask.cuda()
-                native_res = torch._masked_softmax(input, mask, dim)
+                native_res = torch._masked_softmax(input, mask, dim, mask_type)
                 mask = ~mask
 
                 def slow_masked_softmax(input, mask):
@@ -17057,9 +15669,9 @@ def slow_masked_softmax(input, mask):
                     exact_dtype=True
                 )
 
-    def _test_masked_softmax_helper(self, input, dim, mask):
+    def _test_masked_softmax_helper(self, input, dim, mask, mask_type):
         input_ref = input.detach().clone().requires_grad_()
-        result = torch._masked_softmax(input, mask, dim)
+        result = torch._masked_softmax(input, mask, dim, mask_type)
 
         expected = torch._softmax(input_ref.masked_fill(mask, float('-inf')), dim, False)
         grad = torch.randn_like(expected).to(dtype=expected.dtype)
@@ -17070,7 +15682,7 @@ def _test_masked_softmax_helper(self, input, dim, mask):
         # Make sure the optional argument works as well
         if dim == input.dim() - 1:
             input_ref_default = input.detach().clone().requires_grad_()
-            result_default = torch._masked_softmax(input_ref_default, mask)
+            result_default = torch._masked_softmax(input_ref_default, mask, None, mask_type)
             result_default.backward(grad)
             self.assertEqual(result, result_default)
             self.assertEqual(input.grad, input_ref_default.grad)
@@ -17090,10 +15702,11 @@ def test_masked_softmax_grad(self, device):
             for dim in dims:
                 input = torch.randn(shape, requires_grad=True)
                 mask = torch.randint(0, 2, shape).bool()
+                mask_type = 1   # BxL => src_key_padding_mask
                 if (self.device_type == "cuda"):
                     input = input.cuda().detach().requires_grad_()
                     mask = mask.cuda()
-                self._test_masked_softmax_helper(input, dim, mask)
+                self._test_masked_softmax_helper(input, dim, mask, mask_type)
 
     # In this test, the forward pass is expected to produce nan's because when dim=0, we only have unspecified values
     def test_masked_softmax_forward_with_nans(self, device):
@@ -17102,10 +15715,11 @@ def test_masked_softmax_forward_with_nans(self, device):
         for (x, y) in shapes:
             input = torch.randn((x, y), requires_grad=True)
             mask = torch.tensor([i % 2 for i in range(y)]).expand((x, y)).bool()
+            mask_type = 1   # BxL => src_key_padding_mask
             if (self.device_type == "cuda"):
                 input = input.cuda().detach().requires_grad_()
                 mask = mask.cuda()
-            self._test_masked_softmax_helper(input, dim, mask)
+            self._test_masked_softmax_helper(input, dim, mask, mask_type)
 
     @onlyCUDA
     def test_masked_softmax_transformer_layout(self, device):
@@ -17115,11 +15729,12 @@ def test_masked_softmax_transformer_layout(self, device):
         input = torch.randn((B, num_heads, L, L))
         dim = input.dim() - 1
         mask = torch.randint(0, 2, (B, L))
+        mask_type = 1   # BxL => src_key_padding_mask
         if (self.device_type == "cuda"):
             input = input.cuda()
             mask = mask.cuda()
         mask = mask.bool()
-        native_res = torch._masked_softmax(input, mask, dim)
+        native_res = torch._masked_softmax(input, mask, dim, mask_type)
         mask = mask.reshape(B, 1, 1, L).expand(B, num_heads, L, L)
         mask = ~mask
         mask = mask.float()
@@ -17135,11 +15750,12 @@ def test_masked_softmax_TxT_layout(self, device):
         input = torch.randn((B, num_heads, L, L))
         dim = input.dim() - 1
         mask = torch.randint(0, 2, (L, L))
+        mask_type = 0   # LxL => src_mask
         if (self.device_type == "cuda"):
             input = input.cuda()
             mask = mask.cuda()
         mask = mask.bool()
-        native_res = torch._masked_softmax(input, mask, dim)
+        native_res = torch._masked_softmax(input, mask, dim, mask_type)
         mask = mask.expand(B, num_heads, L, L)
         mask = ~mask
         mask = mask.float()
@@ -17953,70 +16569,6 @@ def test_softmax(self, device, dtype):
         # should be bitwise equal
         self.assertEqual(input.grad, inputf.grad.to(dtype), atol=0, rtol=0)
 
-    @onlyCUDA
-    def test_pool3d_size_one_feature_dim(self, device):
-        # Tests crazy strides for feature dim of size 1
-        x = torch.randn(7, 1, 5, 3, 2, device=device)
-        strange_strides = [30, 1234, 6, 2, 1]
-        y = x.as_strided(x.size(), strange_strides)
-        x = x.cpu().as_strided(x.size(), strange_strides)
-
-        to_test = {
-            'max_pool3d': lambda t: F.max_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
-            'avg_pool3d': lambda t: F.avg_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
-        }
-
-        for test, fn in to_test.items():
-            # Should not crash
-            out_y = fn(y)
-            out_x = fn(x)
-            self.assertEqual(out_y, out_x.to(device), msg=test)
-
-    @onlyCUDA
-    @largeTensorTest('18GB')
-    @largeTensorTest('180GB', 'cpu')
-    def test_pool3d_large_size_int64(self, device):
-        # See https://github.com/pytorch/pytorch/issues/52822
-        x = torch.randn(70, 32, 100, 100, 100, dtype=torch.half, device=device, requires_grad=True)
-        y = torch.nn.functional.max_pool3d(x, 5)
-        g = torch.randn_like(y, dtype=torch.half)
-        torch.cuda.synchronize()
-        y.backward(g)
-        torch.cuda.synchronize()
-
-        ref_x = x.detach().cpu().float()  # max_pool3d_cpu is not implemented for half
-        ref_x.requires_grad = True
-        ref_g = g.cpu().float()
-        ref_y = torch.nn.functional.max_pool3d(ref_x, 5)
-        ref_y.backward(ref_g)
-
-        self.assertEqual(y, ref_y, exact_dtype=False)
-        self.assertEqual(x.grad, ref_x.grad, exact_dtype=False)
-
-    @onlyCUDA
-    def test_AvgPool3d_backward_after_cat_dim1_device(self, device):
-        # x has to have batch_size 1 to test contiguous checks
-        x = torch.randn(1, 3, 4, 4, 4, device=device, requires_grad=True)
-        y = F.avg_pool3d(x, kernel_size=3, padding=1, stride=2)
-
-        grad = torch.randn(y.size(), device=device)
-        # increase the stride in dimension 0. the tensor is still contiguous because size[0] is 1
-        stride = list(grad.stride())
-        stride[0] = stride[0] * 2
-        grad.set_(grad.storage(), 0, grad.size(), stride)
-        assert grad.is_contiguous()
-
-        y.backward(grad)
-
-    def test_pooling_size_empty(self, device):
-        t = torch.rand([1, 2, 3, 4], device=device)
-        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool1d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool2d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool3d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool1d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool2d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool3d(t, []))
-
     @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
     def test_embedding_bag_empty_input(self, device, dtypes):
         m = 4
@@ -18896,164 +17448,6 @@ def test_batchnorm_simple_average_mixed(self, device, dtype):
             with torch.backends.cudnn.flags(enabled=False):
                 self._test_batchnorm_simple_average(device, dtype, torch.float)
 
-    def _test_maxpool_indices(self, num_dim, adaptive=False, device="cpu", dtype=torch.float):
-        def expected_indices(dim):
-            if dim == 1:
-                return torch.tensor([1, 3], dtype=torch.double).repeat(2, 2, 1)
-            if dim == 2:
-                return torch.tensor([[5, 7], [13, 15]], dtype=torch.double).repeat(2, 2, 1, 1)
-
-        def expected_grad(dim):
-            if dim == 1:
-                return torch.tensor([0, 1, 0, 1], dtype=torch.double).repeat(2, 2, 1)
-            grad = expected_grad(dim - 1)
-            zero = torch.zeros(grad.size())
-            return torch.stack((zero, grad, zero, grad), 2)
-
-        def expected_output(dim):
-            if dim == 1:
-                return torch.arange(2, 17, 2).view(2, 2, 2)
-            if dim == 2:
-                col = torch.arange(6, 63, 8)
-                return torch.stack([col, col + 2], 1).view(2, 2, 2, 2)
-
-        if adaptive:
-            cls_name = 'AdaptiveMaxPool{}d'.format(num_dim)
-        else:
-            cls_name = 'MaxPool{}d'.format(num_dim)
-        module_cls = getattr(nn, cls_name)
-        module = module_cls(2, return_indices=True).to(device, dtype=dtype)
-        numel = 4 ** (num_dim + 1)
-        input = torch.arange(1, numel + 1).view(2, 2, *repeat(4, num_dim)).to(device, dtype=dtype)
-        input_var = input.clone().detach().requires_grad_()
-
-        # Check forward
-        output, indices = module(input_var)
-        if num_dim != 3:
-            expected_indices = expected_indices(num_dim)
-            expected_output = expected_output(num_dim)
-            self.assertEqual(indices.dim(), input.dim())
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(indices.data.squeeze(), expected_indices)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(output.data.squeeze(), expected_output)
-        self.assertTrue(output.requires_grad)
-        self.assertFalse(indices.requires_grad)
-
-        # Make sure backward works
-        grad_output = torch.ones(output.size(), device=device, dtype=dtype)
-        output.backward(grad_output, retain_graph=True)
-        expected_grad = expected_grad(num_dim)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(input_var.grad.data, expected_grad.view_as(input))
-
-        # Make sure backward after changing indices will result in an error
-        indices.add_(1)
-        self.assertRaises(RuntimeError, lambda: output.backward(grad_output))
-
-        # Make sure -Infinity is handled correctly
-        t = torch.tensor([[[float("-inf")]]])
-        m = nn.MaxPool1d(kernel_size=1, return_indices=True)
-        output, indices = m(t)
-        self.assertEqual(output[0, 0, 0], float("-inf"))
-        self.assertEqual(indices[0, 0, 0], 0)
-
-        t = torch.tensor([[[float("-inf")]]])
-        m = nn.MaxPool2d(kernel_size=1, return_indices=True)
-        output, indices = m(t)
-        self.assertEqual(output[0, 0, 0], float("-inf"))
-        self.assertEqual(indices[0, 0, 0], 0)
-
-        t = torch.tensor([[[[float("-inf")]]]])
-        m = nn.MaxPool3d(kernel_size=1, return_indices=True)
-        output, indices = m(t)
-        self.assertEqual(output[0, 0, 0, 0], float("-inf"))
-        self.assertEqual(indices[0, 0, 0, 0], 0)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_MaxPool1d_indices(self, device, dtype):
-        self._test_maxpool_indices(1, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_MaxPool2d_indices(self, device, dtype):
-        self._test_maxpool_indices(2, device=device, dtype=dtype)
-
-    @skipIfMps
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_MaxPool3d_indices(self, device, dtype):
-        self._test_maxpool_indices(3, device=device, dtype=dtype)
-
-    @skipIfMps
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_AdaptiveMaxPool1d_indices(self, device, dtype):
-        self._test_maxpool_indices(1, adaptive=True, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_AdaptiveMaxPool2d_indices(self, device, dtype):
-        self._test_maxpool_indices(2, adaptive=True, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_AdaptiveMaxPool3d_indices(self, device, dtype):
-        self._test_maxpool_indices(3, adaptive=True, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_maxpool_indices_no_batch_dim(self, device, dtype):
-        """Check that indices with no batch dim is consistent with a single batch."""
-        max_pool_cases = [
-            (nn.MaxPool1d(3, return_indices=True),
-             torch.randn(3, 5, device=device, dtype=dtype)),
-            (nn.MaxPool2d(3, return_indices=True),
-             torch.randn(3, 5, 6, device=device, dtype=dtype)),
-            (nn.MaxPool3d(3, return_indices=True),
-             torch.randn(3, 5, 6, 7, device=device, dtype=dtype)),
-            (nn.AdaptiveMaxPool1d(3, return_indices=True),
-             torch.randn(3, 5, device=device, dtype=dtype)),
-            (nn.AdaptiveMaxPool2d(3, return_indices=True),
-             torch.randn(3, 5, 6, device=device, dtype=dtype)),
-            (nn.AdaptiveMaxPool3d(3, return_indices=True),
-             torch.randn(3, 5, 6, 7, device=device, dtype=dtype))]
-
-        for module, input in max_pool_cases:
-            _, indices_no_batch = module(input)
-            _, indicies_single_batch = module(input.unsqueeze(0))
-            self.assertEqual(indices_no_batch, indicies_single_batch.squeeze(0))
-
-
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    @dtypes(torch.float)
-    @onlyNativeDeviceTypes  # TODO: Fails on XLA
-    def test_max_pool_nan_inf(self, device, dtype):
-        for adaptive in ['', 'adaptive_']:
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}max_pool{}d'.format(adaptive, num_dim)
-                fn = getattr(F, fn_name)
-
-                x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
-                res = fn(x, 1 if adaptive else 3)
-                res.backward(torch.randn_like(res))
-                self.assertTrue(math.isnan(res.item()))
-                x.requires_grad_(False)
-                res = fn(x, 1 if adaptive else 3)
-                self.assertTrue(math.isnan(res.item()))
-
-                x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
-                res2 = fn(x2, 1 if adaptive else 3)
-                res2.backward(torch.randn_like(res2))
-                self.assertTrue(math.isinf(res2.item()))
-                x2.requires_grad_(False)
-                res2 = fn(x2, 1 if adaptive else 3)
-                self.assertTrue(math.isinf(res2.item()))
-
     @onlyNativeDeviceTypes
     @dtypes(torch.float, torch.double)
     def test_grid_sample_nan_inf(self, device, dtype):
@@ -19064,150 +17458,6 @@ def test_grid_sample_nan_inf(self, device, dtype):
                                                      padding_mode=padding_mode, align_corners=False)
             self.assertEqual(sample, torch.zeros([1, 1, 1, 2], device=device, dtype=dtype))
 
-    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
-    @onlyNativeDeviceTypes
-    def test_fractional_max_pool2d(self, device):
-        x = torch.randn(1, 2, 7, 7, requires_grad=True, device=device)
-        samples = x.new(1, 2, 2).uniform_()
-
-        def func(x):
-            return F.fractional_max_pool2d(
-                x, (2, 2), output_size=(3, 3), _random_samples=samples)
-
-        self.assertEqual(func(x).shape, (1, 2, 3, 3))
-        gradcheck(func, [x])
-        gradgradcheck(func, [x])
-
-        x = torch.randn(2, 7, 7, requires_grad=True, device=device)
-        self.assertEqual(func(x).shape, (2, 3, 3))
-        if self.device_type != 'cuda':
-            # Reference: https://github.com/pytorch/pytorch/issues/52427
-            # Raises -> RuntimeError: TensorAccessor expected 4 dims but tensor has 3
-            # on CUDA in gradcheck
-            gradcheck(func, [x])
-            gradgradcheck(func, [x])
-
-        for kernel_size in [(), (1,)]:
-            with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
-                # Incorrect kernel_size
-                F.fractional_max_pool2d(x, kernel_size=kernel_size, output_size=(3, 3), _random_samples=samples)
-
-        err_large_msg = "too large relative to input "
-        err_out_size_msg = "output_size must either"
-        for output_size, msg in [((9, 3), err_large_msg + "height"),
-                                 ((3, 9), err_large_msg + "width"),
-                                 ((3,), err_out_size_msg),
-                                 ((), err_out_size_msg)]:
-            with self.assertRaisesRegex(RuntimeError, msg):
-                # Incorrect output_size
-                F.fractional_max_pool2d(x, (2, 2), output_size=output_size, _random_samples=samples)
-
-    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
-    @onlyNativeDeviceTypes
-    def test_fractional_max_pool3d(self, device):
-        x = torch.randn(1, 2, 7, 7, 7, requires_grad=True, device=device)
-        samples = x.new(1, 2, 3).uniform_()
-
-        def func(x):
-            return F.fractional_max_pool3d(
-                x, (2, 2, 2), output_size=(3, 3, 3), _random_samples=samples)
-
-        self.assertEqual(func(x).shape, (1, 2, 3, 3, 3))
-        gradcheck(func, [x])
-        gradgradcheck(func, [x])
-
-        x = torch.randn(2, 7, 7, 7, requires_grad=True, device=device)
-        self.assertEqual(func(x).shape, (2, 3, 3, 3))
-        gradcheck(func, [x])
-        gradgradcheck(func, [x])
-
-        for kernel_size in [(), (1,), (1, 1)]:
-            with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
-                # Incorrect kernel_size
-                F.fractional_max_pool3d(x, kernel_size=kernel_size, output_size=(3, 3, 3), _random_samples=samples)
-
-        err_large_msg = "too large relative to input "
-        err_out_size_msg = "output_size must either"
-        for output_size, msg in [((9, 3, 3), err_large_msg + "time"),
-                                 ((3, 9, 3), err_large_msg + "height"),
-                                 ((3, 3, 9), err_large_msg + "width"),
-                                 ((3, 3), err_out_size_msg),
-                                 ((3,), err_out_size_msg),
-                                 ((), err_out_size_msg)]:
-            with self.assertRaisesRegex(RuntimeError, msg):
-                # Incorrect output_size
-                F.fractional_max_pool3d(x, (2, 2, 2), output_size=output_size, _random_samples=samples)
-
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    @dtypes(torch.float)
-    @onlyNativeDeviceTypes  # TODO: Fails on XLA
-    def test_fractional_max_pool_nan_inf(self, device, dtype):
-        for num_dim in [2, 3]:
-            fn_name = 'FractionalMaxPool{}d'.format(num_dim)
-            fn = getattr(nn, fn_name)(kernel_size=2, output_size=1)
-            x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
-            res = fn(x)
-            res.backward(torch.randn_like(res))
-            self.assertTrue(math.isnan(res.item()))
-
-            x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
-            res2 = fn(x2)
-            res2.backward(torch.randn_like(res2))
-            self.assertTrue(math.isinf(res2.item()))
-
-    @onlyNativeDeviceTypes  # TODO: RuntimeError message different on XLA
-    def test_pooling_zero_stride(self, device):
-        for op in ('max', 'avg'):
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}_pool{}d'.format(op, num_dim)
-                fn = getattr(F, fn_name)
-                x = torch.ones([1, 2] + num_dim * [4], device=device, dtype=torch.float)
-                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
-                                       lambda: fn(x, kernel_size=2, stride=0))
-
-                fn_module_name = '{}Pool{}d'.format(op.title(), num_dim)
-                fn_module = getattr(nn, fn_module_name)(kernel_size=2, stride=0)
-                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
-                                       lambda: fn_module(x))
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_pool_large_size(self, device, dtype):
-        for op in ('max', 'avg'):
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}_pool{}d'.format(op, num_dim)
-                fn = getattr(F, fn_name)
-                # 16777217 is the smallest integer not expressible in float32
-                x = torch.ones([1, 1, 16777217] + (num_dim - 1) * [1],
-                               device=device, dtype=dtype)
-                res = fn(x, 1, stride=1, padding=0)
-                # check if the output shape was still computed correctly
-                self.assertEqual(x.shape[2], res.shape[2])
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_pool_invalid_size(self, device, dtype):
-        for op in ('max', 'avg'):
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}_pool{}d'.format(op, num_dim)
-                if op == 'max':
-                    # New implementation without indices supports empty tensors
-                    # TODO(Heitor) change once with_indices code is updated
-                    fn_name += '_with_indices'
-                fn = getattr(F, fn_name)
-                # use a configuration that gives zero outputs only
-                # when doing a correct floor division by the stride
-                x = torch.ones([1, 1] + num_dim * [4],
-                               device=device, dtype=dtype)
-                with self.assertRaisesRegex(RuntimeError, r"too small|smaller than"):
-                    try:
-                        res = fn(x, 3, stride=2, padding=0, dilation=2)
-                    except TypeError:
-                        # some implementations do not support dilation
-                        res = fn(x, 6, stride=2, padding=0)
-
     def test_CTCLoss_empty_target(self, device):
         target_lengths = [0, 0, 0]
         input_lengths = [50, 50, 50]
@@ -19459,50 +17709,22 @@ def test_bfloat16(fn, device, inp_dims, prec):
             test_bfloat16(torch.nn.Hardswish(), device, shape, prec=2e-2)
             test_bfloat16(torch.nn.Softplus(), device, shape, prec=1e-2)
 
-    def _test_bfloat16_ops(self, op, device, inp_dims=(), prec=1e-2, scale_factor=None):
-        # fp32 compute
-        input1 = torch.randn(inp_dims, dtype=torch.float32, device=device, requires_grad=True)
-        if scale_factor is not None:
-            input1 = (torch.rand(inp_dims, dtype=torch.bfloat16, device=device) * scale_factor).float().requires_grad_()
-        out1 = op(input1)
-        grad_input1 = torch.randn_like(out1, device=device)
-        out1.backward(grad_input1)
-
-        # bfloat16 compute
-        op_bfp16 = op.bfloat16()
-        input2 = input1.detach().bfloat16().requires_grad_()
-        grad_input2 = grad_input1.bfloat16()
-        out2 = op_bfp16(input2)
-        out2.backward(grad_input2)
-
-        self.assertEqual(out1, out2, atol=prec, rtol=prec, exact_dtype=False)
-        self.assertEqual(input1.grad.data, input2.grad.data, atol=prec, rtol=prec, exact_dtype=False)
-
     @onlyCUDA
     def test_activations_bfloat16(self, device):
-        self._test_bfloat16_ops(torch.nn.ReLU(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Threshold(0.1, 20), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.ELU(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Softplus(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Hardshrink(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Softshrink(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.LeakyReLU(), device, inp_dims=(5), prec=1e-2)
-
-    @onlyCUDA
-    def test_pooling_bfloat16(self, device):
-        self._test_bfloat16_ops(torch.nn.AvgPool1d(3, stride=2), device, inp_dims=(8, 4, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AvgPool2d(3, stride=2), device, inp_dims=(8, 4, 16, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AvgPool3d(3, stride=2), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AdaptiveAvgPool1d(3), device, inp_dims=(8, 4, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AdaptiveAvgPool2d((3, 5)), device, inp_dims=(8, 4, 16, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AdaptiveAvgPool3d((3, 5, 7)), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.ReLU(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Threshold(0.1, 20), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.ELU(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Softplus(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Hardshrink(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Softshrink(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.LeakyReLU(), device, inp_dims=(5), prec=1e-2)
 
     @onlyNativeDeviceTypes
     def test_softmax_bfloat16(self, device):
         for dim in [0, 1, 2, 3]:
-            self._test_bfloat16_ops(torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=1e-2)
+            _test_bfloat16_ops(self, torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=1e-2)
             # test softmax with large input value which casues exp() to overflow
-            self._test_bfloat16_ops(torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=0.05, scale_factor=1000.0)
+            _test_bfloat16_ops(self, torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=0.05, scale_factor=1000.0)
 
     @onlyCPU
     @dtypes(torch.float, torch.double)
@@ -19755,7 +17977,6 @@ def test_conv2d_no_grad(self, device, dtype):
                 self.assertEqual(output, output_ng, rtol=1e-2, atol=1e-5)
 
     @onlyCUDA
-    @skipCUDAIfRocm
     @skipCUDAIfNoCudnn
     @dtypes(torch.float, torch.float16)
     @precisionOverride({torch.half: 0.002, torch.float: 1e-4})
@@ -19773,7 +17994,10 @@ def test_cudnn_convolution_relu(self, device, dtype):
             conv2d_out = torch.conv2d(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
             inp = inp.to(memory_format=memory_format)
             w = w.to(memory_format=memory_format)
-            cudnn_out = torch.cudnn_convolution_relu(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
+            if torch.version.hip:
+                cudnn_out = torch.miopen_convolution_relu(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
+            else:
+                cudnn_out = torch.cudnn_convolution_relu(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
             self.assertTrue(cudnn_out.is_contiguous(memory_format=memory_format))
             if tf32_is_not_fp32() and dtype == torch.float:
                 self.assertEqual(conv2d_out.relu(), cudnn_out, atol=2e-4, rtol=0.006)
@@ -19781,7 +18005,6 @@ def test_cudnn_convolution_relu(self, device, dtype):
                 self.assertEqual(conv2d_out.relu(), cudnn_out)
 
     @onlyCUDA
-    @skipCUDAIfRocm
     @skipCUDAIfNoCudnn
     @dtypes(torch.float, torch.float16)
     @precisionOverride({torch.half: 0.002, torch.float: 1e-4})
@@ -19803,7 +18026,10 @@ def test_cudnn_convolution_add_relu(self, device, dtype):
             inp = inp.to(memory_format=memory_format)
             w = w.to(memory_format=memory_format)
             z = z.to(memory_format=memory_format)
-            cudnn_out = torch.cudnn_convolution_add_relu(inp, w, z, alpha, None, (1, 1), (0, 0), (1, 1), 1)
+            if torch.version.hip:
+                cudnn_out = torch.miopen_convolution_add_relu(inp, w, z, alpha, None, (1, 1), (0, 0), (1, 1), 1)
+            else:
+                cudnn_out = torch.cudnn_convolution_add_relu(inp, w, z, alpha, None, (1, 1), (0, 0), (1, 1), 1)
 
             self.assertTrue(cudnn_out.is_contiguous(memory_format=memory_format))
             if tf32_is_not_fp32() and dtype == torch.float:
@@ -19902,8 +18128,7 @@ def _nll_loss_helper(self, input_size, reduction, expected, device):
         target = torch.randint(num_channels, target_size, device=device)
 
         output = F.nll_loss(input, target, reduction=reduction)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(output, expected)
+        self.assertEqual(output, expected, exact_dtype=False)
 
         output.sum().backward()
         self.assertEqual(input.grad.size(), input.size())
@@ -20246,16 +18471,6 @@ def test_logsigmoid_out(self, device):
         noncontig_out = torch.randn(2, 3, device=device).t()
         self.assertEqual(F.logsigmoid(x), F.logsigmoid(x, out=noncontig_out))
 
-    def test_maxpool3d_non_square_backward(self, device):
-        # previous CUDA routine of this backward calculates kernel launch grid size
-        # with last two dimensions interchanged, so the tailing along the longer dim
-        # get ignored. Here we test whether every position gets gradient.
-        for dim in (2, 3, 4):
-            shape = tuple(32 if i != dim else 256 for i in range(4))
-            x = torch.randn(shape, device=device, requires_grad=True)
-            F.max_pool3d(x, kernel_size=(1, 1, 1)).sum().backward()
-            self.assertEqual(x.grad, torch.ones_like(x.grad))
-
     # Check that clip_grad_norm_ raises an error if the total norm of the
     # parameters' gradients is non-finite
     @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
@@ -20637,20 +18852,6 @@ def test_skip_init(self, device):
         self.assertEqual(m_initialized.weight.device, m_uninitialized.weight.device)
         self.assertFalse(torch.allclose(m_initialized.weight, m_uninitialized.weight))
 
-    def test_adaptive_pool_invalid(self, device):
-        inp_1d = (torch.randn(1, 1, 1, device=device), (-1,))
-        inp_2d = (torch.randn(1, 1, 1, 1, device=device), (-1, 0))
-        inp_3d = (torch.randn(1, 1, 1, 1, 1, device=device), (-1, 0, 2))
-        module_input_dict = {torch.nn.AdaptiveAvgPool1d : inp_1d,
-                             torch.nn.AdaptiveAvgPool2d : inp_2d,
-                             torch.nn.AdaptiveAvgPool3d : inp_3d}
-
-        for m, inp in module_input_dict.items():
-            with self.assertRaisesRegex(RuntimeError,
-                                        r"elements of output_size must be greater than or equal to 0"):
-                t, output_size = inp
-                m(output_size)(t)
-
     @dtypes(torch.float)
     @dtypesIfCUDA(torch.double, torch.float, torch.half)
     def test_transformerencoderlayer(self, device, dtype):
@@ -20834,23 +19035,23 @@ def perm_fn(x):
     @dtypes(torch.double)
     @torch.no_grad()
     def test_multihead_attn_fast_path_query_and_bias_have_different_dtypes(self, device, dtype):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True, dtype=dtype, device=device).eval()
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True, dtype=dtype, device=device).eval()
         mha.in_proj_bias = torch.nn.Parameter(mha.in_proj_bias.to(torch.half).to(device))
-        query = torch.randn(3, 3, 3, dtype=dtype, device=device)
+        query = torch.randn(4, 4, 4, dtype=dtype, device=device)
         mha(query, query, query)
 
     @dtypes(torch.double)
     @torch.no_grad()
     def test_multihead_attn_fast_path_small_test(self, device, dtype):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True, dtype=dtype, device=device).eval()
-        query = torch.randn(3, 3, 3, dtype=dtype, device=device)
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True, dtype=dtype, device=device).eval()
+        query = torch.randn(4, 4, 4, dtype=dtype, device=device)
         mha(query, query, query)
 
     @dtypes(torch.double)
     @torch.no_grad()
     def test_multihead_attn_in_proj_bias_none(self, device, dtype):
-        mha = torch.nn.MultiheadAttention(1, 1, bias=False, dtype=dtype, device=device)
-        query = torch.rand(3, 2, 1, dtype=dtype, device=device)
+        mha = torch.nn.MultiheadAttention(2, 2, bias=False, dtype=dtype, device=device)
+        query = torch.rand(2, 2, 2, dtype=dtype, device=device)
         mha(query, query, query)
 
     @dtypes(torch.double)
@@ -20864,6 +19065,17 @@ def test_multihead_attn_in_proj_weight_none(self, device, dtype):
         key = torch.rand(4, 4, 2, dtype=dtype, device=device)
         mha(query, key, key)
 
+    @onlyCPU
+    @dtypes(torch.double)
+    def test_transformerencoderlayer_fast_path(self, device, dtype):
+        model = torch.nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True, device=device, dtype=dtype)
+        src = torch.rand(32, 10, 512)
+        src_mask = torch.zeros(10, 10).to(torch.bool)
+
+        model.eval()
+        with torch.no_grad():
+            model(src, src_mask)
+
     @dtypes(torch.float)
     @dtypesIfCUDA(torch.half, torch.float)
     def test_transformerencoderlayer_gelu(self, device, dtype):
@@ -21056,6 +19268,7 @@ def bw_fail2(self, grad_input, grad_output):
             with self.assertRaisesRegex(RuntimeError, 'got 2, but expected 1'):
                 module(input).sum().backward()
 
+    @skipIfTorchDynamo("https://github.com/pytorch/torchdynamo/issues/847")
     def test_module_backward_global_hook_writeable(self):
         module = nn.Sigmoid()
         input = torch.randn(5, 5, requires_grad=True)
diff --git a/test/test_nnapi.py b/test/test_nnapi.py
index 7cf6a3900ca83..60f2c89712360 100644
--- a/test/test_nnapi.py
+++ b/test/test_nnapi.py
@@ -4,11 +4,12 @@
 import os
 import ctypes
 import torch
+import unittest
 from typing import Tuple
 from torch.backends._nnapi.prepare import convert_model_to_nnapi
+from torch.testing._internal.common_quantized import supported_qengines
 from torch.testing._internal.common_utils import TestCase, run_tests
 
-
 def qpt(t, scale, zero_point, dtype=torch.quint8):
     t = torch.tensor(t)
     return torch.quantize_per_tensor(t, scale, zero_point, dtype)
@@ -20,6 +21,8 @@ def nhwc(t):
     return t
 
 
+@unittest.skipUnless('qnnpack' in supported_qengines,
+                     "This Pytorch Build has not been built with or does not support QNNPACK")
 class TestNNAPI(TestCase):
 
     def setUp(self):
@@ -111,12 +114,12 @@ def test_prelu(self):
 
     def test_quantize(self):
         self.check(
-            torch.nn.quantized.Quantize(0.25, 2, torch.quint8),
+            torch.ao.nn.quantized.Quantize(0.25, 2, torch.quint8),
             nhwc(torch.tensor([[[[1.0]], [[2.0]]]])))
 
     def test_dequantize(self):
         self.check(
-            torch.nn.quantized.DeQuantize(),
+            torch.ao.nn.quantized.DeQuantize(),
             nhwc(qpt([[[[1.0]], [[2.0]]]], 0.25, 2)))
 
     def test_unsqueeze(self):
@@ -561,7 +564,7 @@ def test_conv2d_transpose(self):
                 convert_arg = torch.zeros(*convert_dims)
 
                 if "quant" in kind:
-                    model = torch.nn.quantized.ConvTranspose2d(in_ch, out_ch, kernel)
+                    model = torch.ao.nn.quantized.ConvTranspose2d(in_ch, out_ch, kernel)
                     model.qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
                     inp = qpt(inp, 1.0 / 16, 128)
                     # I've seen numerical differences between QNNPACK and NNAPI,
@@ -586,7 +589,7 @@ def test_conv2d_transpose(self):
 
 
     def test_qadd(self):
-        func = torch.nn.quantized.QFunctional()
+        func = torch.ao.nn.quantized.QFunctional()
         func.scale = 0.5
         func.zero_point = 120
 
@@ -649,7 +652,7 @@ def test_qlinear(self):
         torch.manual_seed(29)
         weight = qpt(torch.randn(16, 32), 0.125, 0, torch.qint8)
         bias = torch.randn(16)
-        mod = torch.nn.quantized.Linear(32, 16)
+        mod = torch.ao.nn.quantized.Linear(32, 16)
         mod.set_weight_bias(weight, bias)
         inp = qpt(torch.randn(2, 32), 0.05, 130, torch.quint8)
         self.check(mod, inp)
diff --git a/test/test_ops.py b/test/test_ops.py
index 050ae49649e6a..3e483607008a4 100644
--- a/test/test_ops.py
+++ b/test/test_ops.py
@@ -36,7 +36,6 @@
 )
 from torch.testing._internal.common_methods_invocations import (
     op_db,
-    _NOTHING,
     UnaryUfuncInfo,
     ReductionOpInfo,
     ReductionPythonRefInfo,
@@ -64,7 +63,7 @@
 import torch._prims as prims
 from torch._prims.context import TorchRefsMode
 
-import torch.testing._internal.opinfo_helper as opinfo_helper
+from torch.testing._internal import opinfo
 from torch.testing._internal import composite_compliance
 
 from torch.utils._pytree import tree_flatten
@@ -89,8 +88,7 @@
         lambda op: not isinstance(
             op, (UnaryUfuncInfo, ReductionOpInfo, SpectralFuncInfo, BinaryUfuncInfo)
         )
-        and op.ref is not None
-        and op.ref is not _NOTHING,
+        and op.ref is not None,
         op_db,
     )
 )
@@ -113,9 +111,9 @@ def tearDownClass(cls):
                 "This is OK for testing, but be sure to set the dtypes manually before landing your PR!"
             )
             # Assure no opinfo entry has dynamic_dtypes
-            filtered_ops = list(filter(opinfo_helper.is_dynamic_dtype_set, op_db))
+            filtered_ops = list(filter(opinfo.utils.is_dynamic_dtype_set, op_db))
             for op in filtered_ops:
-                fmt_str = opinfo_helper.str_format_dynamic_dtype(op)
+                fmt_str = opinfo.utils.str_format_dynamic_dtype(op)
                 err_msg += "\n" + fmt_str
 
             assert len(filtered_ops) == 0, err_msg
@@ -197,7 +195,7 @@ def _to_tensormeta(x):
                         self.assertTrue(isinstance(b, FakeTensor))
                         prims.utils.compare_tensor_meta(a, b)
 
-    def _ref_test_helper(self, ctx, device, dtype, op, skip_zero_numel=False, skip_zero_dim=False):
+    def _ref_test_helper(self, ctx, device, dtype, op, skip_zero_numel=False, skip_zero_dim=False, skip_bfloat=False):
         # NOTE: this test works by comparing the reference
         ex = None
         for sample in op.reference_inputs(device, dtype, requires_grad=False):
@@ -205,6 +203,26 @@ def _ref_test_helper(self, ctx, device, dtype, op, skip_zero_numel=False, skip_z
                 continue
             if isinstance(sample.input, torch.Tensor) and sample.input.ndim == 0 and skip_zero_dim:
                 continue
+
+            is_lower_than_cuda11_0 = (
+                (torch.version.cuda is not None)
+                and ([int(x) for x in torch.version.cuda.split(".")] < [11, 0]))
+
+            if (
+                skip_bfloat
+                and is_lower_than_cuda11_0
+                and (
+                    (
+                        isinstance(sample.input, torch.Tensor)
+                        and sample.input.dtype == torch.bfloat16
+                    )
+                    or any(
+                        isinstance(arg, torch.Tensor) and arg.dtype == torch.bfloat16
+                        for arg in sample.args
+                    )
+                )
+            ):
+                continue
             with ctx():
                 ref_result = op(sample.input, *sample.args, **sample.kwargs)
             torch_result = op.torch_opinfo(sample.input, *sample.args, **sample.kwargs)
@@ -356,6 +374,7 @@ def test_python_ref_executor(self, device, dtype, op, executor):
             op,
             skip_zero_numel=("nvfuser" in executor),  # nvfuser doesn't support zero-sized tensors
             skip_zero_dim=skip_zero_dim,
+            skip_bfloat=("nvfuser" in executor),  # nvfuser doesn't support bfloat tensors for pre-11 cuda TK
         )
 
     @skipMeta
@@ -1547,6 +1566,7 @@ class TestRefsOpsInfo(TestCase):
         '_refs.full',
         '_refs.full_like',
         '_refs.item',
+        '_refs.to',
         '_refs.ones',
         '_refs.ones_like',
         '_refs.std_var',
@@ -1555,22 +1575,25 @@ class TestRefsOpsInfo(TestCase):
         '_refs.scalar_tensor',
         '_refs.trunc_divide',
         '_refs.zeros',
-        '_refs.zeros_like'
+        '_refs.zeros_like',
+        '_refs.rfloordiv',
+        '_refs.rtruediv',
+        '_refs.rpow',
     }
 
     not_in_decomp_table = {
         # duplicated in _decomp and _refs
         '_refs.nn.functional.elu',
         '_refs.nn.functional.mse_loss',
-        '_refs.masked_fill',
-        '_refs.transpose',
         '_refs.var',
         '_refs.rsub',
         # these are not aten ops?
         '_refs.broadcast_shapes',
         '_refs.broadcast_tensors',
         '_refs.nn.functional.tanhshrink',
-        '_refs.swap_axes',
+        '_refs.rfloordiv',
+        '_refs.rtruediv',
+        '_refs.rpow',
         # CompositeImplicitAutograd
         '_refs.allclose',
         '_refs.atleast_1d',
@@ -1591,12 +1614,14 @@ class TestRefsOpsInfo(TestCase):
         '_refs.hstack',
         '_refs.isclose',
         '_refs.isfinite',
+        '_refs.movedim',
         '_refs.narrow',
         '_refs.positive',
         '_refs.ravel',
         '_refs.reshape',
         '_refs.square',
         '_refs.tensor_split',
+        '_refs.to',
         '_refs.true_divide',
         '_refs.trunc_divide',
         '_refs.vsplit',
@@ -1605,6 +1630,7 @@ class TestRefsOpsInfo(TestCase):
         '_refs.linalg.norm',
         '_refs.linalg.svd',
         '_refs.linalg.svdvals',
+        '_refs.unflatten',
         # ref implementation missing kwargs
         '_refs.empty',  # missing "pin_memory"
         '_refs.empty_like',  # missing "layout"
@@ -1622,6 +1648,8 @@ class TestRefsOpsInfo(TestCase):
         '_refs.clone',  # test_meta.py: view size is not compatible with input tensor's size and stride
         '_refs.equal',  # 'bool' object has no attribute 'dtype'
         '_refs.conj',  # Calls _prims.conj
+        '_refs.real',
+        '_refs.imag',
     }
 
     @parametrize("op", ref_ops_names)
@@ -1713,6 +1741,7 @@ def test_refs_are_in_decomp_table(self, op):
 )
 
 fake_striding_skips = (
+    "diag_embed",
     "fft.fft2",
     "fft.fft",
     "fft.fftn",
diff --git a/test/test_ops_jit.py b/test/test_ops_jit.py
index 21fccf6622953..dcb1923cd8f1f 100644
--- a/test/test_ops_jit.py
+++ b/test/test_ops_jit.py
@@ -1,6 +1,7 @@
 # Owner(s): ["module: unknown"]
 
 from functools import partial
+from textwrap import dedent
 
 import torch
 
@@ -59,7 +60,6 @@ def test_variant_consistency_jit(self, device, dtype, op):
             variants = {'method': getattr(torch.Tensor, op.name)}
             samples = op.sample_inputs(device, dtype, requires_grad=False)
 
-        support_script = op.supports_scripting
 
         tested = False
         for sample in samples:
@@ -76,97 +76,110 @@ def test_variant_consistency_jit(self, device, dtype, op):
                     continue
 
                 tested = True
+                try:
+                    self.indiv_variant_test_jit(device, dtype, op, sample, func_type, variant, has_fake_function)
+                except Exception as e:
+                    variant_error_info = dedent(f"""
+                        Error testing {op.name} {func_type} variant
+                        with dtype: {dtype}
+                        with inputs {sample}:
+                    """)
+                    raise Exception(variant_error_info) from e
+
+        assert tested, "JIT Test does not execute any logic"
 
-                # Create accessor for script function variant
-                name = op.name + '_' if func_type == 'inplace' else op.name
-
-                # run with disable_autodiff_subgraph_inlining(True) to test
-                #   autodiff support. Context manager forces the graph to contain
-                #   DifferentiableGraph nodes if they are present
-                with disable_autodiff_subgraph_inlining():
-                    # Check scripted forward, grad, and grad grad
-                    if support_script:
-                        script_fn = create_script_fn(self, name, func_type)
-
-                    def out_fn(output):
-                        # Processes the output for autograd
-                        if sample.output_process_fn_grad is not None:
-                            return sample.output_process_fn_grad(output)
-                        return output
-
-                    def get_sample():
-                        return clone_input_helper(sample.input) if op.name[-1] == '_' else sample.input
-
-                    if support_script:
-                        check_against_reference(self,
-                                                script_fn,
-                                                func,
-                                                out_fn,
-                                                (get_sample(),) + sample.args,
-                                                sample.kwargs,
-                                                no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
-
-                    # Check traced forward, grad, and grad grad
-                    # TODO: fix tracing here
-                    supports_tracing = op.supports_tracing and not has_fake_function
-                    if op.assert_jit_shape_analysis:
-                        self.assertTrue(supports_tracing)
-
-                    if supports_tracing:
-                        traced_fn = create_traced_fn(self, variant)
-                        check_against_reference(self,
-                                                traced_fn,
-                                                func,
-                                                out_fn,
-                                                (get_sample(),) + sample.args,
-                                                sample.kwargs,
-                                                no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
-
-                    # Check alias annotation schema for correctness (make
-                    #   sure inputs that aren't supposed to be modified aren't)
-                    # Note: only runs in float32 because schema isn't affected by dtype,
-                    #   so running it on all dtypes is would be excessive
-                    if dtype == torch.float32:
-                        # TODO: no reason why we cant run this with tracing graph
-                        if support_script and op.name != "rsub":
-                            check_alias_annotation(name, (get_sample(),) + sample.args, sample.kwargs,
-                                                   func_type=func_type, aten_name=op.aten_name)
-
-                        # TODO: use script graph as well
-                        checked_shape_analysis = False
-                        if supports_tracing:
-                            out = variant(get_sample(), *sample.args, **sample.kwargs)
-
-                            # right now, tuple of outputs and tensor output supported
-                            # TODO: list of tensor outputs
-                            tuple_of_tensors = isinstance(out, tuple) and all([isinstance(elem, torch.Tensor) for elem in out])
-
-                            if isinstance(out, torch.Tensor) or tuple_of_tensors:
-                                if tuple_of_tensors:
-                                    sizes = [elem.size() for elem in out]
-                                else:
-                                    sizes = out.size()
-                                self.checkShapeAnalysis(sizes, traced_fn.graph, op.assert_jit_shape_analysis)
-                                checked_shape_analysis = True
-                        if op.assert_jit_shape_analysis:
-                            self.assertTrue(checked_shape_analysis)
-
-                    # Check autodifferentiation of nodes for traced and scripted graphs, only need to check once per sample
-                    if dtype is torch.float32:
-                        # Sandcastle doesn't fuse nodes
-                        if IS_SANDCASTLE:
-                            # fusible nodes are expected to be found in FusionGroups in the DifferentiableGraphs
-                            nonfusible_nodes = op.autodiff_nonfusible_nodes + op.autodiff_fusible_nodes
-                            fusible_nodes = []
+    def indiv_variant_test_jit(self, device, dtype, op, sample, func_type, variant, has_fake_function):
+        _requires_grad = (dtype in op.supported_backward_dtypes(torch.device(device).type))
+        support_script = op.supports_scripting
+        # Create accessor for script function variant
+        name = op.name + '_' if func_type == 'inplace' else op.name
+
+        # run with disable_autodiff_subgraph_inlining(True) to test
+        #   autodiff support. Context manager forces the graph to contain
+        #   DifferentiableGraph nodes if they are present
+        with disable_autodiff_subgraph_inlining():
+            # Check scripted forward, grad, and grad grad
+            if support_script:
+                script_fn = create_script_fn(self, name, func_type)
+
+            def out_fn(output):
+                # Processes the output for autograd
+                if sample.output_process_fn_grad is not None:
+                    return sample.output_process_fn_grad(output)
+                return output
+
+            def get_sample():
+                return clone_input_helper(sample.input) if op.name[-1] == '_' else sample.input
+
+            if support_script:
+                check_against_reference(self,
+                                        script_fn,
+                                        op.get_op(),
+                                        out_fn,
+                                        (get_sample(),) + sample.args,
+                                        sample.kwargs,
+                                        no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
+
+            # Check traced forward, grad, and grad grad
+            # TODO: fix tracing here
+            supports_tracing = op.supports_tracing and not has_fake_function
+            if op.assert_jit_shape_analysis:
+                self.assertTrue(supports_tracing)
+
+            if supports_tracing:
+                traced_fn = create_traced_fn(self, variant)
+                check_against_reference(self,
+                                        traced_fn,
+                                        op.get_op(),
+                                        out_fn,
+                                        (get_sample(),) + sample.args,
+                                        sample.kwargs,
+                                        no_grad=not _requires_grad, no_gradgrad=not op.supports_gradgrad)
+
+            # Check alias annotation schema for correctness (make
+            #   sure inputs that aren't supposed to be modified aren't)
+            # Note: only runs in float32 because schema isn't affected by dtype,
+            #   so running it on all dtypes is would be excessive
+            if dtype == torch.float32:
+                # TODO: no reason why we cant run this with tracing graph
+                if support_script and op.name != "rsub":
+                    check_alias_annotation(name, (get_sample(),) + sample.args, sample.kwargs,
+                                           func_type=func_type, aten_name=op.aten_name)
+
+                # TODO: use script graph as well
+                checked_shape_analysis = False
+                if supports_tracing:
+                    out = variant(get_sample(), *sample.args, **sample.kwargs)
+
+                    # right now, tuple of outputs and tensor output supported
+                    # TODO: list of tensor outputs
+                    tuple_of_tensors = isinstance(out, tuple) and all([isinstance(elem, torch.Tensor) for elem in out])
+
+                    if isinstance(out, torch.Tensor) or tuple_of_tensors:
+                        if tuple_of_tensors:
+                            sizes = [elem.size() for elem in out]
                         else:
-                            nonfusible_nodes = op.autodiff_nonfusible_nodes
-                            fusible_nodes = op.autodiff_fusible_nodes
+                            sizes = out.size()
+                        self.checkShapeAnalysis(sizes, traced_fn.graph, op.assert_jit_shape_analysis)
+                        checked_shape_analysis = True
+                if op.assert_jit_shape_analysis:
+                    self.assertTrue(checked_shape_analysis)
+
+            # Check autodifferentiation of nodes for traced and scripted graphs, only need to check once per sample
+            if dtype is torch.float32:
+                # Sandcastle doesn't fuse nodes
+                if IS_SANDCASTLE:
+                    # fusible nodes are expected to be found in FusionGroups in the DifferentiableGraphs
+                    nonfusible_nodes = op.autodiff_nonfusible_nodes + op.autodiff_fusible_nodes
+                    fusible_nodes = []
+                else:
+                    nonfusible_nodes = op.autodiff_nonfusible_nodes
+                    fusible_nodes = op.autodiff_fusible_nodes
 
-                        if supports_tracing:
-                            self.assertAutodiffNode(traced_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
-                        if support_script:
-                            self.assertAutodiffNode(script_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
-        assert tested, "JIT Test does not execute any logic"
+                if supports_tracing:
+                    self.assertAutodiffNode(traced_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
+                if support_script:
+                    self.assertAutodiffNode(script_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
 
     # alias testing is only done with torch.float for the same reason
     _alias_ops = partial(ops, dtypes=OpDTypes.supported,
diff --git a/test/test_optim.py b/test/test_optim.py
index 7ea094d399367..a1595120ea8fc 100644
--- a/test/test_optim.py
+++ b/test/test_optim.py
@@ -17,11 +17,11 @@
 from torch import sparse
 from torch.optim.lr_scheduler import LambdaLR, MultiplicativeLR, SequentialLR, StepLR, \
     MultiStepLR, ConstantLR, LinearLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau, \
-    _LRScheduler, CyclicLR, CosineAnnealingWarmRestarts, OneCycleLR, ChainedScheduler, \
+    _LRScheduler, CyclicLR, CosineAnnealingWarmRestarts, OneCycleLR, ChainedScheduler, PolynomialLR, \
     EPOCH_DEPRECATION_WARNING
 from torch.optim.swa_utils import AveragedModel, SWALR, update_bn
 from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_UBSAN, load_tests, \
-    parametrize, instantiate_parametrized_tests, gradcheck
+    parametrize, instantiate_parametrized_tests, gradcheck, skipIfRocm
 # load_tests from common_utils is used to automatically filter tests for
 # sharding on sandcastle. This line silences flake warnings
 load_tests = load_tests
@@ -311,10 +311,9 @@ def _test_complex_optimizer(self, optimizer_constructor):
         complex_opt = optimizer_constructor(complex_param)
         real_opt = optimizer_constructor(real_param)
 
-        for i in range(3):
+        for _ in range(3):
             complex_param.grad = torch.randn_like(complex_param)
             real_param.grad = torch.view_as_real(complex_param.grad)
-
             complex_opt.step()
             real_opt.step()
 
@@ -331,7 +330,7 @@ def _test_complex_2d(self, optimizer_constructor, f=None):
         optim1 = optimizer_constructor([a1])
         optim2 = optimizer_constructor([a1_real, a1_imag])
 
-        for i in range(10):
+        for _ in range(10):
             optim1.zero_grad()
             optim2.zero_grad()
             a2 = torch.complex(a1_real, a1_imag)
@@ -426,16 +425,21 @@ def test_sgd(self):
                 optimizer([weight, bias], nesterov=True, lr=1e-3, momentum=0.5, weight_decay=1, maximize=maximize),
                 constructor_accepts_maximize=True
             )
+            self._test_basic_cases(
+                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
+                [lambda opt: PolynomialLR(opt, power=0.9, total_iters=4)],
+                constructor_accepts_maximize=True
+            )
             with self.assertRaisesRegex(ValueError, "Invalid momentum value: -0.5"):
                 optimizer(None, lr=1e-2, momentum=-0.5)
 
     def test_sgd_sparse(self):
         for optimizer in [optim.SGD, optim_mt.SGD]:
             self._test_rosenbrock_sparse(
-                lambda params: optimizer(params, lr=5e-3)
+                lambda params: optimizer(params, lr=4.8e-3)
             )
             self._test_rosenbrock_sparse(
-                lambda params: optimizer(params, lr=0.005),
+                lambda params: optimizer(params, lr=0.0048),
                 [lambda opt: StepLR(opt, gamma=0.99999, step_size=300)]
             )
 
@@ -495,9 +499,8 @@ def test_multi_tensor_optimizers(self):
 
         kIterations = 4
         device = 'cuda'
-
         for optimizers, params in optimizer_pairs_with_flags:
-            res = []
+            res, state = [], []
             for opt in optimizers:
                 input = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=torch.float64, device=device).reshape(3, 2)
 
@@ -521,10 +524,20 @@ def test_multi_tensor_optimizers(self):
 
                     optimizer.step()
 
+                state.append(optimizer.state)
                 res.append(model.parameters())
 
-            for p1, p2 in zip(res[0], res[1]):
-                self.assertEqual(p1, p2, atol=5e-5, rtol=0)
+            st_state = state[0]
+            mt_state = state[1]
+            for st_p, mt_p in zip(res[0], res[1]):
+                self.assertEqual(st_p, mt_p, atol=5e-5, rtol=0)
+
+                # check that optimizer states are the same
+                st_p_state = st_state[st_p]
+                mt_p_state = mt_state[mt_p]
+
+                for k in st_p_state:
+                    self.assertEqual(st_p_state[k], mt_p_state[k], atol=5e-5, rtol=0)
 
     def test_adam(self):
         for optimizer in [optim.Adam, optim_mt.Adam]:
@@ -592,6 +605,14 @@ def test_adam(self):
                  lambda opt: ReduceLROnPlateau(opt)],
                 constructor_accepts_maximize=True
             )
+
+            self._test_basic_cases(
+                lambda weight, bias, maximize: optimizer(
+                    self._build_params_dict(weight, bias, lr=1e-2),
+                    lr=1e-3, maximize=maximize),
+                [lambda opt: PolynomialLR(opt, total_iters=4, power=0.9)],
+                constructor_accepts_maximize=True
+            )
             self._test_complex_2d(optimizer)
 
             with self.assertRaisesRegex(ValueError, "Invalid beta parameter at index 0: 1.0"):
@@ -619,6 +640,7 @@ def test_adamw(self):
                 lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, weight_decay=1, amsgrad=True, maximize=maximize),
                 constructor_accepts_maximize=True
             )
+            self._test_complex_2d(optimizer)
             with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -1"):
                 optimizer(None, lr=1e-2, weight_decay=-1)
 
@@ -784,6 +806,7 @@ def test_adamax(self):
                     [weight, bias], lr=1e-1, weight_decay=1, maximize=maximize),
                 constructor_accepts_maximize=True
             )
+            self._test_complex_2d(optimizer)
             with self.assertRaisesRegex(ValueError, "Invalid beta parameter at index 1: 1.0"):
                 optimizer(None, lr=1e-2, betas=(0.0, 1.0))
 
@@ -847,6 +870,14 @@ def test_rmsprop(self):
                     lr=1e-2, momentum=0.1, weight_decay=1, maximize=maximize),
                 constructor_accepts_maximize=True
             )
+            self._test_complex_2d(optimizer)
+            self._test_complex_2d(lambda param: optimizer(param, centered=True))
+            self._test_complex_2d(lambda param: optimizer(param, momentum=0.1))
+            self._test_complex_2d(lambda param: optimizer(param, maximize=True))
+            self._test_complex_optimizer(lambda param: optimizer([param]))
+            self._test_complex_optimizer(lambda param: optimizer([param], centered=True))
+            self._test_complex_optimizer(lambda param: optimizer([param], momentum=0.1))
+            self._test_complex_optimizer(lambda param: optimizer([param], maximize=True))
             with self.assertRaisesRegex(ValueError, "Invalid momentum value: -1.0"):
                 optimizer(None, lr=1e-2, momentum=-1.0)
 
@@ -871,15 +902,25 @@ def test_asgd(self):
             with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -0.5"):
                 optimizer(None, lr=1e-2, weight_decay=-0.5)
 
+    @skipIfRocm
     def test_rprop(self):
         for optimizer in [optim.Rprop, optim_mt.Rprop]:
             self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3)
+                lambda weight, bias, maximize: optimizer([weight, bias], lr=2e-4, maximize=maximize),
+                constructor_accepts_maximize=True
             )
             self._test_basic_cases(
-                lambda weight, bias: optimizer(
+                lambda weight, bias, maximize: optimizer(
                     self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3)
+                    lr=2e-4, maximize=maximize),
+                constructor_accepts_maximize=True
+            )
+            self._test_complex_2d(optimizer)
+            self._test_complex_optimizer(
+                lambda param: optimizer([param], lr=0.001)
+            )
+            self._test_complex_optimizer(
+                lambda param: optimizer([param], lr=0.001, maximize=True)
             )
             with self.assertRaisesRegex(ValueError, "Invalid eta values: 1.0, 0.5"):
                 optimizer(None, lr=1e-2, etas=(1.0, 0.5))
@@ -1200,6 +1241,10 @@ def test_linear_linearlr_is_constant_for_constant_epoch(self):
         scheduler = LinearLR(self.opt)
         self._test_lr_is_constant_for_constant_epoch(scheduler)
 
+    def test_polynomial_lr_is_constant_for_constant_epoch(self):
+        scheduler = PolynomialLR(self.opt, power=0.9)
+        self._test_lr_is_constant_for_constant_epoch(scheduler)
+
     def test_step_lr(self):
         # lr = 0.05     if epoch < 3
         # lr = 0.005    if 30 <= epoch < 6
@@ -1332,6 +1377,15 @@ def test_exp_lr(self):
         scheduler = ExponentialLR(self.opt, gamma=0.9)
         self._test(scheduler, targets, epochs)
 
+    def test_poly_lr(self):
+        epochs = 10
+        power = 0.9
+        total_iters = 5
+        single_targets = [(1.0 - x / total_iters) ** power * 0.05 for x in range(total_iters)] + [0.0] * (epochs - total_iters)
+        targets = [single_targets, [x * epochs for x in single_targets]]
+        scheduler = PolynomialLR(self.opt, power=power, total_iters=total_iters)
+        self._test(scheduler, targets, epochs)
+
     def test_cos_anneal_lr(self):
         epochs = 10
         eta_min = 1e-10
@@ -1367,6 +1421,11 @@ def test_closed_form_exp_lr(self):
         closed_form_scheduler = ExponentialLR(self.opt, gamma=0.9)
         self._test_against_closed_form(scheduler, closed_form_scheduler, 20)
 
+    def test_closed_form_poly_lr(self):
+        scheduler = PolynomialLR(self.opt, power=0.9)
+        closed_form_scheduler = PolynomialLR(self.opt, power=0.9)
+        self._test_against_closed_form(scheduler, closed_form_scheduler, 20)
+
     def test_closed_form_cos_anneal_lr(self):
         eta_min = 1e-10
         epochs = 20
@@ -1582,6 +1641,25 @@ def test_chained_lr4(self):
         self._test([scheduler], targets, epochs)
         self.assertEqual(scheduler.get_last_lr(), schedulers[-1].get_last_lr())
 
+    def test_chained_lr5(self):
+        def poly_lr(lr: float):
+            return [
+                (lr * ((1.0 - x / total_iters) ** power)) for x in range(total_iters)
+            ] + [0.0] * (epochs - total_iters)
+
+        schedulers = [None] * 2
+        epochs = 10
+        power = 0.9
+        total_iters = 5
+        const_factor = 0.1
+        single_targets = [x * const_factor for x in poly_lr(lr=0.05)]
+        targets = [single_targets, [x * const_factor for x in poly_lr(0.5)]]
+        schedulers[0] = PolynomialLR(self.opt, power=power, total_iters=total_iters)
+        schedulers[1] = ConstantLR(self.opt, factor=const_factor)
+        scheduler = ChainedScheduler(schedulers)
+        self._test(scheduler, targets, epochs)
+        self.assertEqual(scheduler.get_last_lr(), schedulers[-1].get_last_lr())
+
     def test_compound_step_and_multistep_lr(self):
         epochs = 10
         schedulers = [None] * 2
@@ -2704,11 +2782,15 @@ def _diff_fn(p, grad, opt_differentiable_state, opt_class, kwargs, *ignored):
     # dict
     p = p.clone()
     p.grad = grad
-    opt_differentiable_state = {k: v.clone() for k, v in opt_differentiable_state.items()}
+    opt_differentiable_state = {
+        k: v.clone() if isinstance(v, torch.Tensor) else v
+        for k, v in opt_differentiable_state.items()
+    }
     opt = opt_class([p], **kwargs)
-    opt.state.update(opt_differentiable_state)
+    opt.state[p].update(opt_differentiable_state)
     opt.step()
-    return (p,) + tuple(opt_differentiable_state.values())
+    return (p,) + tuple(
+        v for v in opt_differentiable_state.values() if isinstance(v, torch.Tensor) and v.requires_grad)
 
 
 class TestDifferentiableOptimizer(TestCase):
@@ -2720,6 +2802,38 @@ def test_sgd(self):
         state = {'momentum_buffer': mbuff}
         gradcheck(_diff_fn, (p, grad, state, torch.optim.SGD, {'lr': 0.9, 'differentiable': True}, *state.values()))
 
+    def test_adam(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state['step'] = torch.tensor(10., requires_grad=False, dtype=torch.float64)
+        state['exp_avg'] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state['exp_avg_sq'] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state['max_exp_avg_sq'] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+
+        gradcheck(
+            _diff_fn,
+            (p, grad, state, torch.optim.Adam,
+             {'lr': 0.9, 'differentiable': True, 'amsgrad': True}, *state.values())
+        )
+
+    def test_rmsprop(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state['step'] = 0
+        state['square_avg'] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state['momentum_buffer'] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # This can cause issues with large values and nan due to sqrt ops
+        state['grad_avg'] = 1e-2 * torch.rand(10, requires_grad=True, dtype=torch.float64)
+        gradcheck(
+            _diff_fn,
+            (p, grad, state, torch.optim.RMSprop,
+             {'lr': 0.9, 'maximize': True, 'momentum': 0.9, 'differentiable': True, 'centered': True, 'weight_decay': 0.1},
+             *state.values()))
+
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_overrides.py b/test/test_overrides.py
index dae399732a5e8..9a7747b5f2171 100644
--- a/test/test_overrides.py
+++ b/test/test_overrides.py
@@ -19,6 +19,7 @@
     TorchFunctionMode
 )
 from torch.utils._mode_utils import find_outermost_mode, all_same_mode, all_same_mode_scope
+from torch.utils._pytree import tree_map
 
 Tensor = torch.Tensor
 
@@ -1532,6 +1533,51 @@ class B(torch.Tensor):
 
         self.assertTrue(called)
 
+    def test_disable_enable_subclass(self):
+        called = False
+
+        class A(torch.Tensor):
+            pass
+
+        x = A(torch.randn(5))
+        with torch._C.DisableTorchFunction():
+            g = torch._C._EnableTorchFunction()
+            try:
+                self.assertIsInstance(torch.sum(x), A)
+            finally:
+                del g
+
+    def test_subclass_hash(self):
+        class DiagTensor(torch.Tensor):
+            def __init__(self, diag):
+                self._diag = diag
+
+            @classmethod
+            def __torch_function__(cls, func, types, args=(), kwargs=None):
+                kwargs = kwargs or {}
+
+                def get_full_matrices(t):
+                    if isinstance(t, DiagTensor):
+                        return torch.diag_embed(t._diag)
+                    else:
+                        return t
+
+                return func(*tree_map(get_full_matrices, args), **tree_map(get_full_matrices, kwargs))
+
+        d = torch.rand(2)
+        a = DiagTensor(d)
+
+        self.assertEqual((a + 1), torch.diag_embed(d) + 1)
+
+        # If the hash function was returning the same value, this would
+        # fail inside `Tensor.__eq__`.
+        # If __hash__ was going through torch_function, the implementation above would
+        # be wrong as it would compute the hash on a temporary Tensor thus not ensuring
+        # the uniqueness of the hash that we rely on for Tensors.
+        s = set()
+        s.add(a)
+        s.add(DiagTensor(d))
+
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_per_overload_api.py b/test/test_per_overload_api.py
index cdb2b79835121..1fe1535a6cf12 100644
--- a/test/test_per_overload_api.py
+++ b/test/test_per_overload_api.py
@@ -60,5 +60,13 @@ def test_basics_opoverload(self):
 
         self.assertRaises(RuntimeError, lambda: add_tensoroverload(a, a, out=b))
 
+    def test_decompose(self):
+        x = torch.randn(2, 3)
+        y = torch.randn(5, 3)
+        self.assertEqual(
+            torch.ops.aten.linear.default.decompose(x, y),
+            torch.ops.aten.linear.default(x, y)
+        )
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_prims.py b/test/test_prims.py
index faae19b7da6a9..f85386bdbe486 100644
--- a/test/test_prims.py
+++ b/test/test_prims.py
@@ -38,7 +38,7 @@ def _wrapper(a, b, broadcast_dimensions):
         traced = make_traced(_wrapper)
         make_arg = partial(make_tensor, device=device, dtype=dtype)
 
-        for executor in ('aten', 'nvfuser'):
+        for executor in ('aten', 'strictly_nvfuser'):
             fn = partial(traced, executor=executor)
             # Same shape
             shape = (5, 5)
@@ -104,7 +104,7 @@ def _wrapper(a):
         traced = make_traced(_wrapper)
         make_arg = partial(make_tensor, device=device, dtype=dtype)
 
-        for executor in ('aten', 'nvfuser'):
+        for executor in ('aten', 'strictly_nvfuser'):
             fn = partial(traced, executor=executor)
             shape = (5, 5)
             a = make_arg(shape)
@@ -150,7 +150,7 @@ def test_nvfuser_impl_is_used(self, device):
         ops_without_nvfuser_impl = {
             name
             for name in prim_nvfuser_ops
-            if getattr(torch.ops.prims, name).default.impl_nvfuser is None
+            if getattr(torch.ops.nvprims, name, None) is None
         }
         assert (
             len(ops_without_nvfuser_impl) == 0
@@ -189,7 +189,7 @@ def test_nvfuser_capability_context(self, device):
 
         # It's assumed that digamma is not supported by nvfuser
         # If it's ever supported, this test will need to be updated
-        self.assertTrue(torch.ops.prims.digamma.default.impl_nvfuser is None)
+        self.assertTrue(getattr(torch.ops.nvprims, "digamma", None) is None)
 
         a = torch.randn(3, 3, device=device)
 
@@ -197,10 +197,10 @@ def func(a):
             return torch.digamma(a)
 
         with TorchRefsNvfuserCapabilityMode():
-            gm = make_fx(func)(a)
+            gm = make_fx(func, decomposition_table=torch._prims.context.nvfuser_decomp_table())(a)
 
         # Check that the torch.digamma is not replaced with torch.ops.prims.digamma
-        call_function_nodes = filter(lambda n: n.op == "call_function", gm.graph.nodes)
+        call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
         includes_aten_digamma = any(
             torch.ops.aten.digamma.default == node.target
             for node in call_function_nodes
@@ -217,9 +217,9 @@ def func(a):
             return torch.sigmoid(torch.digamma(a))
 
         with TorchRefsNvfuserCapabilityMode():
-            gm = make_fx(func)(a)
+            gm = make_fx(func, decomposition_table=torch._prims.context.nvfuser_decomp_table())(a)
 
-        call_function_nodes = filter(lambda n: n.op == "call_function", gm.graph.nodes)
+        call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
         includes_aten_sigmoid = any(
             torch.ops.aten.sigmoid.default == node.target
             for node in call_function_nodes
@@ -228,8 +228,35 @@ def func(a):
             torch.ops.prims.digamma.default == node.target
             for node in call_function_nodes
         )
+        includes_nvprims_exp = any(
+            torch.ops.nvprims.exp.default == node.target
+            for node in call_function_nodes
+        )
         self.assertFalse(includes_aten_sigmoid)
         self.assertFalse(includes_prims_digamma)
+        self.assertTrue(includes_nvprims_exp)
+
+
+    def test_aten_overload_to_prims(self, device):
+        # This test is to ensure that the torch.ops.aten calls are replaced with refs
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsMode
+
+        a = torch.randn(3, 3, device=device)
+
+        def func(a):
+            return torch.ops.aten.sigmoid.default(torch.ops.aten.digamma.default(a))
+
+        with TorchRefsMode():
+            gm = make_fx(func)(a)
+
+        # Check that all call_function nodes are prims
+        call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+        all_prims_namespace = all(
+            node.target.name.startswith("prims") for node in call_function_nodes
+        )
+        self.assertTrue(all_prims_namespace)
+
 
     @onlyCUDA
     @skipCUDAIfRocm
@@ -237,7 +264,7 @@ def test_nvfuser_executor_partitioned(self, device):
         # This test is to ensure that nvfuser partitioned executor works correctly
         # It's assumed that digamma is not supported by nvfuser
         # If it's ever supported, this test will need to be updated
-        self.assertTrue(torch.ops.prims.digamma.default.impl_nvfuser is None)
+        self.assertTrue(getattr(torch.ops.nvprims, "digamma", None) is None)
 
         from torch.fx.experimental.proxy_tensor import make_fx
         from torch._prims.context import TorchRefsMode
@@ -266,7 +293,7 @@ def test_nvfuser_executor_partitioned_no_partitions_error(self, device):
         # This test is to ensure that nvfuser partitioned executor works correctly
         # It's assumed that digamma is not supported by nvfuser
         # If it's ever supported, this test will need to be updated
-        self.assertTrue(torch.ops.prims.digamma.default.impl_nvfuser is None)
+        self.assertTrue(getattr(torch.ops.nvprims, "digamma", None) is None)
 
         from torch.fx.experimental.proxy_tensor import make_fx
         from torch._prims.context import TorchRefsMode
@@ -287,6 +314,24 @@ def func(a):
             self.assertEqual(len(w), 1)
             self.assertTrue("is not supported by nvFuser" in str(w[-1].message))
 
+    def test_nvprims(self, device):
+        # This test is to ensure that nvfuser specific prims are exposed
+        # and can be traced with make_fx
+        from torch.fx.experimental.proxy_tensor import make_fx
+
+        def func(a):
+            return torch.ops.nvprims.add(a, a)
+
+        a = torch.randn(3, 4, device=device)
+        gm = make_fx(func)(a)
+
+        for node in gm.graph.nodes:
+            if node.op == "call_function":
+                self.assertTrue(node.name == "add_default")
+                self.assertTrue(node.target == torch.ops.nvprims.add.default)
+                self.assertFalse(node.target == torch.ops.prims.add.default)
+                self.assertFalse(node.target == torch.ops.aten.add.default)
+
     @onlyCUDA
     @skipCUDAIfRocm
     @dtypes(torch.float32)
@@ -298,7 +343,7 @@ def _wrapper(a):
         traced = make_traced(_wrapper)
         make_arg = partial(make_tensor, device=device, dtype=dtype)
 
-        for executor in ('aten', 'nvfuser'):
+        for executor in ('aten', 'strictly_nvfuser'):
             fn = partial(traced, executor=executor)
             shape = (5, 5)
             a = make_arg(shape)
@@ -325,7 +370,7 @@ def fn(a, b_dict):
         b_dict = {"b": b}
 
         result_aten = fn(a, b_dict, executor="aten")
-        result_nvfuser = fn(a, b_dict, executor="nvfuser")
+        result_nvfuser = fn(a, b_dict, executor="strictly_nvfuser")
         self.assertEqual(result_aten, result_nvfuser)
 
     @dtypes(torch.float32)
@@ -391,6 +436,19 @@ def test_memory_format_strides(self, device, dtype):
                 actual = refs.contiguous(a, memory_format=memory_format)
                 self.assertEqual(expected.stride(), actual.stride())
 
+    @dtypes(torch.float32)
+    def test_reshape_view_method(self, device, dtype):
+        make_arg = partial(make_tensor, device=device, dtype=dtype)
+        a = make_arg((5, 5))
+        new_shape = 1, 5, 1, 5
+        result_eager = a.reshape(*new_shape)
+        result_refs = refs.reshape(a, *new_shape)
+        self.assertEqual(result_eager, result_refs)
+
+        result_eager = a.view(*new_shape)
+        result_refs = refs.view(a, *new_shape)
+        self.assertEqual(result_eager, result_refs)
+
 
 class TestPrimsBasic(TestCase):
     def test_torch_ops(self):
@@ -451,5 +509,42 @@ def test_constant_pad_nd_memory_format(self, device, dtype):
 instantiate_device_type_tests(TestRefs, globals())
 
 
+class TestDecomp(TestCase):
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float16, torch.float32)
+    def test_decomposition_type_promotion_nvprim_amp(self, device, dtype):
+        x = torch.rand(5, device=device).to(dtype)
+        y = torch.rand(5, device=device).to(dtype)
+
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode, _is_func_unsupported_nvfuser
+        from torch.fx.experimental.proxy_tensor import make_fx
+        op = torch._decomp.decomposition_table.get(torch.ops.aten.leaky_relu_backward.default)
+
+        def fn0(*arg):
+            return _is_func_unsupported_nvfuser(mode, op, arg, {})
+
+        def fn1(x):
+            x = x * 2
+            x = x @ x
+            x = x * 2
+            return x
+
+        with TorchRefsNvfuserCapabilityMode() as mode:
+            self.assertFalse(fn0(x, y, 0.3, False))
+
+            with torch.autocast("cuda"):
+                gm = make_fx(fn1, decomposition_table=torch._prims.context.nvfuser_decomp_table())(x)
+            call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+            includes_aten_to_copy = any(
+                torch.ops.aten._to_copy.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertFalse(includes_aten_to_copy)
+
+
+instantiate_device_type_tests(TestDecomp, globals())
+
+
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_profiler.py b/test/test_profiler.py
index a5f60837e898b..378afcab31016 100644
--- a/test/test_profiler.py
+++ b/test/test_profiler.py
@@ -1,50 +1,74 @@
 # Owner(s): ["oncall: profiler"]
 import collections
-import expecttest
 import gc
 import io
 import json
 import os
 import re
 import tempfile
-from typing import List, Optional
 import unittest
 from dataclasses import dataclass, field
+from typing import List, Optional
 
+import expecttest
 import torch
 import torch.nn as nn
 import torch.optim
 import torch.utils.data
 import torch.utils.data.datapipes as dp
-from torch.testing._internal.common_cuda import TEST_MULTIGPU
-from torch.testing._internal.common_utils import (
-    TestCase, run_tests, TEST_WITH_ASAN, TEST_WITH_ROCM, IS_WINDOWS,
-    TEST_WITH_CROSSREF, TemporaryFileName, TemporaryDirectoryName)
-from torch.autograd import (_record_function_with_args_enter, _record_function_with_args_exit)
+from torch.autograd import (
+    _record_function_with_args_enter,
+    _record_function_with_args_exit,
+)
 from torch.autograd.profiler import profile as _profile
 from torch.autograd.profiler_legacy import profile as _profile_legacy
 from torch.profiler import (
-    kineto_available, profile, record_function, supported_activities,
-    DeviceType, ProfilerAction, ProfilerActivity, ExecutionGraphObserver,
-    _utils
+    _utils,
+    DeviceType,
+    ExecutionGraphObserver,
+    kineto_available,
+    profile,
+    ProfilerAction,
+    ProfilerActivity,
+    record_function,
+    supported_activities,
 )
-from torch.profiler._pattern_matcher import (Pattern, NamePattern,
-                                             ExtraCUDACopyPattern,
-                                             ForLoopIndexingPattern,
-                                             FP32MatMulPattern,
-                                             OptimizerSingleTensorPattern,
-                                             SynchronizedDataLoaderPattern,
-                                             GradNotSetToNonePattern,
-                                             Conv2dBiasFollowedByBatchNorm2dPattern)
+from torch.profiler._pattern_matcher import (
+    Conv2dBiasFollowedByBatchNorm2dPattern,
+    ExtraCUDACopyPattern,
+    ForLoopIndexingPattern,
+    FP32MatMulPattern,
+    GradNotSetToNonePattern,
+    MatMulDimInFP16Pattern,
+    NamePattern,
+    OptimizerSingleTensorPattern,
+    Pattern,
+    report_all_anti_patterns,
+    SynchronizedDataLoaderPattern,
+)
+from torch.testing._internal.common_cuda import TEST_MULTIGPU
 from torch.testing._internal.common_device_type import skipCUDAVersionIn
+from torch.testing._internal.common_utils import (
+    IS_WINDOWS,
+    run_tests,
+    TemporaryDirectoryName,
+    TemporaryFileName,
+    TEST_WITH_ASAN,
+    TEST_WITH_CROSSREF,
+    TEST_WITH_ROCM,
+    TestCase,
+)
 
 try:
     import psutil
+
     HAS_PSUTIL = True
 except ImportError:
     HAS_PSUTIL = False
 import pickle
 
+from torch._C._profiler import _ExtraFields_PyCall
+
 
 @unittest.skipIf(not HAS_PSUTIL, "Requires psutil to run")
 @unittest.skipIf(TEST_WITH_ASAN, "Cannot test with ASAN")
@@ -220,7 +244,11 @@ class TestExecutionGraph(TestCase):
     def payload(self, use_cuda=False):
         u = torch.randn(3, 4, 5, requires_grad=True)
         with record_function("## TEST 1 ##", "1, 2, 3"):
-            rf_handle = _record_function_with_args_enter("## TEST 2 ##", 1, False, 2.5, [u, u], (u, u), "hello", u)
+            inf_val = float("inf")
+            neg_inf_val = float("-inf")
+            nan_val = float("nan")
+            rf_handle = _record_function_with_args_enter("## TEST 2 ##", 1, False, 2.5, [u, u], (u, u),
+                                                         "hello", u, inf_val, neg_inf_val, nan_val)
             x = torch.randn(10, 10, requires_grad=True)
             if use_cuda:
                 x = x.cuda()
@@ -229,6 +257,9 @@ def payload(self, use_cuda=False):
                 y = y.cuda()
             z = x + y + x * y + x * y
             z.backward(z)
+            gelu = nn.GELU()
+            m = torch.randn(2)
+            _ = gelu(m)
             if use_cuda:
                 z = z.cpu()
             _record_function_with_args_exit(rf_handle)
@@ -279,6 +310,7 @@ def trace_handler(p):
         assert fp.name == eg.get_output_file_path()
         nodes = self.get_execution_graph_root(fp.name)
         loop_count = 0
+        found_root_node = False
         for n in nodes:
             assert "name" in n
             if "[pytorch|profiler|execution_graph|process]" in n["name"]:
@@ -308,12 +340,19 @@ def test_execution_graph_alone(self):
         assert fp.name == eg.get_output_file_path()
         nodes = self.get_execution_graph_root(fp.name)
         loop_count = 0
+        # Expected tensor object tuple size, in th form of:
+        # [tensor_id, storage_id, offset, numel, itemsize, device_str]
+        tensor_tuple_size = 6
+        found_root_node = False
         for n in nodes:
             assert "name" in n
             if "[pytorch|profiler|execution_graph|process]" in n["name"]:
                 found_root_node = True
             if n["name"].startswith("## LOOP "):
                 loop_count += 1
+            # Check if tensor tuple representation size is correct.
+            if n["name"] == "## TEST 2 ##":
+                assert len(n["inputs"][3][0]) == tensor_tuple_size
         assert found_root_node
         assert loop_count == expected_loop_events
 
@@ -343,6 +382,7 @@ def test_execution_graph_start_stop(self):
         assert fp.name == eg.get_output_file_path()
         nodes = self.get_execution_graph_root(fp.name)
         loop_count = 0
+        found_root_node = False
         for n in nodes:
             assert "name" in n
             if "[pytorch|profiler|execution_graph|process]" in n["name"]:
@@ -352,6 +392,40 @@ def test_execution_graph_start_stop(self):
         assert found_root_node
         assert loop_count == expected_loop_events
 
+    def test_execution_graph_repeat_in_loop(self):
+        use_cuda = torch.profiler.ProfilerActivity.CUDA in supported_activities()
+        iter_list = {3, 4, 6, 8}
+        expected_loop_events = len(iter_list)
+        output_files = []
+        for idx in range(10):
+            if idx in iter_list:
+                # Create a temp file to save execution graph data.
+                fp = tempfile.NamedTemporaryFile('w+t', suffix='.json', delete=False)
+                fp.close()
+                output_files.append(fp.name)
+                eg = ExecutionGraphObserver()
+                eg.register_callback(fp.name)
+                eg.start()
+            with record_function(f"## LOOP {idx} ##"):
+                self.payload(use_cuda=use_cuda)
+            if idx in iter_list:
+                eg.stop()
+                eg.unregister_callback()
+
+        event_count = 0
+        for eg_file in output_files:
+            nodes = self.get_execution_graph_root(eg_file)
+            found_root_node = False
+            for n in nodes:
+                assert "name" in n
+                if "[pytorch|profiler|execution_graph|process]" in n["name"]:
+                    assert n["id"] == 1
+                    found_root_node = True
+                if n["name"].startswith("## LOOP "):
+                    event_count += 1
+            assert found_root_node
+        assert event_count == expected_loop_events
+
     def test_execution_graph_no_capture(self):
         fp = tempfile.NamedTemporaryFile('w+t', suffix='.json', delete=False)
         fp.close()
@@ -1125,7 +1199,7 @@ def test_profiler_fwd_bwd_link(self):
 
     def test_profiler_type(self):
         profiler_type = torch._C._autograd._profiler_type
-        ActiveProfilerType = torch._C._autograd.ActiveProfilerType
+        ActiveProfilerType = torch._C._profiler.ActiveProfilerType
         self.assertEqual(profiler_type(), ActiveProfilerType.NONE)
 
         # Autograd profiler
@@ -1160,6 +1234,22 @@ def test_profiler_correlation_id(self):
                     id_uniqueness_set.add(corr_id)
                     self.assertTrue(corr_id < uint32_max)
 
+    def test_nested_tensor_with_shapes(self):
+        a = torch.randn(4, 4)
+        b = torch.randn(4, 4)
+        c = torch.randn(4, 4)
+        inp = torch.nested_tensor([a, b])
+        with torch.profiler.profile(record_shapes=True) as prof:
+            torch.nn.functional.linear(inp, c, None)
+        for e in prof.events():
+            if e.name in ("aten::mm", "aten::addmm"):
+                # intentionally vague tests to protect against possible future changes
+                # of mm to addmm or other impl, or changing internal order of args
+                self.assertTrue(len(e.input_shapes) > 0)
+                self.assertTrue(len(e.input_shapes[0]) > 0)
+
+
+
 def find_node_with_name(nodes, name):
     for node in nodes:
         if node.name() == name:
@@ -1179,24 +1269,25 @@ def test_extra_fields(self):
 
         self.assertIsInstance(
             node.extra_fields,
-            torch._C._autograd._ExtraFields_TorchOp)
+            torch._C._profiler._ExtraFields_TorchOp)
 
         self.assertIsInstance(
             node.parent.extra_fields,
-            torch._C._autograd._ExtraFields_PyCCall)
+            torch._C._profiler._ExtraFields_PyCCall)
 
         self.assertEqual(node.children[0].name(), "aten::empty")
         self.assertEqual(node.children[0].children[0].name(), "[memory]")
         self.assertIsInstance(
             node.children[0].children[0].extra_fields,
-            torch._C._autograd._ExtraFields_Allocation)
+            torch._C._profiler._ExtraFields_Allocation)
 
     def test_tensor_properties(self):
         x = torch.ones(10, 10).as_strided([4, 4], [12, 3])
-        y = torch.ones(4, 1)
+        y = torch.ones(4, 1, requires_grad=True)
 
         with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
             _ = x + y
+            _ = x * y
 
         nodes = p.profiler.kineto_results.experimental_event_tree()
         node = find_node_with_name(nodes, "aten::add")
@@ -1204,15 +1295,166 @@ def test_tensor_properties(self):
 
         self.assertIsInstance(
             node.extra_fields,
-            torch._C._autograd._ExtraFields_TorchOp)
+            torch._C._profiler._ExtraFields_TorchOp)
 
         self.assertEqual(node.extra_fields.inputs.shapes, [[4, 4], [4, 1], []])
+        self.assertEqual(node.extra_fields.inputs.strides, [[12, 3], [1, 1], []])
 
         input_info = node.extra_fields.inputs
         self.assertEqual(input_info.dtypes, ['float', 'float', 'Scalar'])
 
         layout_info = [x.layout if x else None for x in input_info.tensor_metadata]
         self.assertEqual(layout_info, [torch.strided, torch.strided, None])
+        device_info = [x.device if x else None for x in input_info.tensor_metadata]
+        self.assertEqual(device_info, [torch.device("cpu"), torch.device("cpu"), None])
+        self.assertEqual(node.extra_fields.scope, torch.profiler.RecordScope.FUNCTION)
+
+        mul_node = find_node_with_name(nodes, "aten::mul")
+        self.assertIsNotNone(mul_node)
+        self.assertEqual(
+            node.extra_fields.sequence_number + 1,
+            mul_node.extra_fields.sequence_number)
+
+    def test_sparse_tensors(self):
+        i = [[0, 1, 1], [2, 0, 2]]
+        v = [3, 4, 5]
+        s = torch.sparse_coo_tensor(i, v, (2, 3))
+
+        with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+            _ = s + s
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "aten::add")
+        self.assertIsNotNone(node)
+
+        self.assertIsInstance(
+            node.extra_fields,
+            torch._C._profiler._ExtraFields_TorchOp)
+
+        self.assertEqual(node.extra_fields.inputs.shapes, [[2, 3], [2, 3], []])
+        self.assertEqual(node.extra_fields.inputs.strides, [[], [], []])
+
+        input_info = node.extra_fields.inputs
+
+        # FIXME: Different systems have different names for int64_t
+        # below are example names I have found. This is not guaranteed to be exhaustive.
+        # self.assertIn(input_info.dtypes[0], ["long long", "long int", "long", "__int64"])
+
+        layout_info = [x.layout if x else None for x in input_info.tensor_metadata]
+        self.assertEqual(layout_info, [torch.sparse_coo, torch.sparse_coo, None])
+        device_info = [x.device if x else None for x in input_info.tensor_metadata]
+        self.assertEqual(device_info, [torch.device("cpu"), torch.device("cpu"), None])
+
+    @unittest.skipIf(not torch._C.has_mkldnn, "MKL-DNN build is disabled")
+    def test_mkldnn_tensors(self):
+        x = torch.ones(4, 3).to_mkldnn()
+
+        with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+            _ = x + x
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "aten::add")
+        self.assertIsNotNone(node)
+
+        self.assertIsInstance(
+            node.extra_fields,
+            torch._C._profiler._ExtraFields_TorchOp)
+
+        self.assertEqual(node.extra_fields.inputs.shapes, [[4, 3], [4, 3], []])
+        self.assertEqual(node.extra_fields.inputs.strides, [[], [], []])
+
+        input_info = node.extra_fields.inputs
+        self.assertEqual(input_info.dtypes, ['float', 'float', 'Scalar'])
+
+        layout_info = [x.layout if x else None for x in input_info.tensor_metadata]
+        self.assertEqual(layout_info, [torch._mkldnn, torch._mkldnn, None])
+        device_info = [x.device if x else None for x in input_info.tensor_metadata]
+        self.assertEqual(device_info, [torch.device("cpu"), torch.device("cpu"), None])
+
+    def test_scalar_ins(self):
+        x = torch.ones(5, 5)
+        alpha = 0.9
+
+        with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+            _ = torch.add(x, 9.1, alpha=alpha)
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "aten::add")
+        self.assertIsNotNone(node)
+
+        # The second argument to the add gets promotoed to a zerodim Tensor
+        input_info = node.extra_fields.inputs
+        self.assertEqual(input_info.dtypes, ['float', 'double', 'Scalar'])
+        self.assertEqual(input_info.shapes, [[5, 5], [], []])
+        self.assertEqual(input_info.ivalues, [None, None, alpha])
+
+    def test_nnmodule_params(self):
+
+        def flat_out_extrafields(nodes, out=None):
+            if out is None:
+                out = []
+            for node in nodes:
+                if isinstance(node.extra_fields, _ExtraFields_PyCall) and node.extra_fields.module:
+                    if node.extra_fields.module.params:
+                        out.append(node.extra_fields.module)
+                flat_out_extrafields(node.children, out)
+            return out
+
+        class simpleNet(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.fc1 = nn.Linear(10, 5)
+                self.fc2 = nn.Linear(5, 2)
+
+            def forward(self, x):
+                return self.fc2(self.fc1(x))
+
+        inputs = torch.rand(10)
+        with torch.profiler.profile(with_stack=True, profile_memory=True) as p:
+            net = simpleNet()
+            out = net(inputs)
+
+        modules = flat_out_extrafields(p.profiler.kineto_results.experimental_event_tree())
+        self.assertEqual(len(modules), 2, f"Expected two parameter list, but got {len(modules)}")
+
+        params = [p for module in modules for p in module.params]
+        expected = [(name, val.storage().data_ptr()) for name, val in net.fc1._parameters.items()]
+        expected += [(name, val.storage().data_ptr()) for name, val in net.fc2._parameters.items()]
+        self.assertEqual(expected, params, f"{expected} vs. {params}")
+
+    def test_allocations(self):
+        gc.collect()
+        with profile(profile_memory=True) as p:
+            x = torch.empty((3, 4))
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "[memory]")
+        self.assertIsNotNone(node)
+
+        alloc_size = 3 * 4 * 4  # fp32 -> 4 bytes
+        ptr = node.extra_fields.ptr
+        self.assertGreater(ptr, 0)
+        self.assertEqual(node.extra_fields.alloc_size, alloc_size)
+        self.assertEqual(node.extra_fields.device_type, torch._C._autograd.DeviceType.CPU)
+        self.assertEqual(node.extra_fields.device_index, -1)
+        total_allocated = node.extra_fields.total_allocated
+
+        # total_reserved is only for CUDACachingAllocator
+        self.assertEqual(node.extra_fields.total_reserved, 0)
+
+        with profile(profile_memory=True) as p:
+            del x
+            gc.collect()
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "[memory]")
+        self.assertIsNotNone(node)
+
+        self.assertEqual(node.extra_fields.ptr, ptr)
+        self.assertEqual(node.extra_fields.alloc_size, -alloc_size)
+        self.assertEqual(node.extra_fields.device_type, torch._C._autograd.DeviceType.CPU)
+        self.assertEqual(node.extra_fields.device_index, -1)
+        self.assertEqual(node.extra_fields.total_allocated, total_allocated - alloc_size)
 
 
 @dataclass(frozen=True)
@@ -1563,6 +1805,9 @@ def test_profiler_extra_cuda_copy_pattern(self):
             (1, lambda: torch.rand((100, 100)).cuda()),
             (1, lambda: torch.randn((100, 100)).cuda()),
             (1, lambda: torch.full((100, 100), 10).cuda()),
+            (0, lambda: torch.rand((100, 100)).to(dtype=torch.float16)),
+            (0, lambda: torch.rand((100, 100)).half()),
+            (0, lambda: torch.rand((100, 100), device="cuda").half()),
         )
         num_matched = []
         for _, fn in cases:
@@ -1711,5 +1956,48 @@ def test_profiler_conv2d_bias_followed_by_batchnorm2d_pattern(self):
         self.assertEqual(num_matched, [i for i, _ in cases])
 
 
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
+    def test_profiler_matmul_dim_fp16_pattern(self):
+        cases = (
+            (1, torch.randn((201, 201), device='cuda', dtype=torch.float16)),
+            (1, torch.randn((3, 97, 97), device='cuda', dtype=torch.float16)),
+            (0, torch.randn((200, 200), device='cuda', dtype=torch.float16)),
+            (0, torch.randn((3, 200, 200), device='cuda', dtype=torch.float16))
+        )
+        num_matched = []
+        for _, x in cases:
+            with profile(with_stack=True, record_shapes=True) as prof:
+                x @ x
+            pattern = MatMulDimInFP16Pattern(prof)
+            num_matched.append(len(pattern.matched_events()))
+        self.assertEqual(num_matched, [i for i, _ in cases])
+
+    def test_profiler_pattern_matcher_json_report(self):
+        x = torch.ones((100, 100))
+        model = nn.Sequential(
+            nn.Linear(100, 100),
+            nn.ReLU(),
+            nn.Linear(100, 10),
+        )
+        optimizer = torch.optim.Adam(model.parameters())
+        with profile(with_stack=True, record_shapes=True) as prof:
+            y_hat = model(x)
+            loss = torch.nn.functional.cross_entropy(y_hat, torch.randint(0, 10, (100,)))
+            loss.backward()
+            optimizer.step()
+            optimizer.zero_grad()
+        report_all_anti_patterns(prof, json_report_dir=".", print_enable=False)
+        try:
+            with open("./torchtidy_report.json") as f:
+                report = json.load(f)
+            self.assertTrue("test_profiler.py" in report)
+            self.assertTrue(len(report["test_profiler.py"]) > 0)
+            expected_fields = sorted(["line_number", "name", "url", "message"])
+            for event in report["test_profiler.py"]:
+                actual_fields = sorted(event.keys())
+                self.assertEqual(expected_fields, actual_fields)
+        finally:
+            os.remove("torchtidy_report.json")
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_profiler_tree.py b/test/test_profiler_tree.py
index 41d7f770461b1..55491bc86b45c 100644
--- a/test/test_profiler_tree.py
+++ b/test/test_profiler_tree.py
@@ -10,17 +10,25 @@
 import expecttest
 
 import torch
-from torch._C._autograd import _ExtraFields_PyCall, _ExtraFields_PyCCall
+from torch._C._profiler import _ExtraFields_PyCall, _ExtraFields_PyCCall
 from torch.testing._internal.common_utils import (
-    TestCase, run_tests, IS_WINDOWS, TEST_WITH_CROSSREF)
+    TestCase, run_tests, IS_WINDOWS, TEST_WITH_CROSSREF, IS_ARM64)
+from torch.utils._pytree import tree_map
 
 # These functions can vary from based on platform and build (e.g. with CUDA)
 # and generally distract from rather than adding to the test.
+PRUNE_ALL = 1
+KEEP_ELLIPSES = 2
+KEEP_NAME_AND_ELLIPSES = 3
+
 PRUNE_FUNCTIONS = {
-    "torch/profiler/profiler.py(...): start": True,
-    "torch/profiler/profiler.py(...): stop_trace": True,
-    "cudaStreamIsCapturing": False,
-    "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags": False,
+    "torch/utils/_pytree.py(...): tree_map": KEEP_NAME_AND_ELLIPSES,
+    "torch/profiler/profiler.py(...): start": KEEP_ELLIPSES,
+    "torch/profiler/profiler.py(...): stop_trace": KEEP_ELLIPSES,
+    "torch/profiler/profiler.py(...): _transit_action": KEEP_ELLIPSES,
+    "<built-in method __exit__ of torch._C.DisableTorchFunction object at 0xXXXXXXXXXXXX>": PRUNE_ALL,
+    "cudaStreamIsCapturing": PRUNE_ALL,
+    "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags": PRUNE_ALL,
 }
 
 # ROCTracer is currently not producing events that profiler can extract. We
@@ -34,6 +42,38 @@
 ALLOW_CUDA_FAILURE = (torch.version.hip is not None) or IS_WINDOWS
 
 
+class TorchFunctionTensor(torch.Tensor):
+
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
+        return super().__torch_function__(func, types, args, kwargs)
+
+
+class TorchDispatchTensor(torch.Tensor):
+
+    @staticmethod
+    def __new__(cls, elem):
+        t = torch.Tensor._make_subclass(cls, elem, elem.requires_grad)
+        t.elem = elem
+        return t
+
+    __torch_function__ = torch._C._disabled_torch_function_impl
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+
+        def unwrap(x):
+            return x.elem if isinstance(x, TorchDispatchTensor) else x
+
+        def wrap(x):
+            return TorchDispatchTensor(x) if isinstance(x, torch.Tensor) else x
+
+        args = tree_map(unwrap, args)
+        kwargs = tree_map(unwrap, kwargs or {})
+
+        return tree_map(wrap, func(*args, **kwargs))
+
+
 class ProfilerTree:
 
     @staticmethod
@@ -70,12 +110,18 @@ def flatten(nodes, depth=0, out=None):
             for node in nodes:
                 cls.validate_node(node)
                 name = cls.fmt_name(node.name())
-                add_ellipses = PRUNE_FUNCTIONS.get(name.strip(), None)
-                if add_ellipses is None:
+                prune_level = PRUNE_FUNCTIONS.get(name.strip(), None)
+                if prune_level is None:
                     out.append((depth, name))
                     flatten(node.children, depth + 1, out)
-                elif add_ellipses:
+                elif prune_level == KEEP_NAME_AND_ELLIPSES:
+                    out.append((depth, name))
+                    if node.children:
+                        out.append((depth + 1, "..."))
+                elif prune_level == KEEP_ELLIPSES:
                     out.append((depth, "..."))
+                else:
+                    assert prune_level == PRUNE_ALL
 
             return out
 
@@ -154,6 +200,7 @@ def to_string(frame_state):
                 caller_name = to_string(extra_fields.caller)
                 assert parent_name == caller_name, f"{parent_name} vs. {caller_name}"
 
+@unittest.skipIf(IS_ARM64, "Not working on ARM")
 class TestProfilerTree(TestCase):
     def assertTreesMatch(self, actual: str, expected: str, allow_failure: bool = False):
         # Warning: Here be dragons
@@ -270,21 +317,24 @@ def test_profiler_experimental_tree_with_record_function(self):
             ProfilerTree.format(p.profiler, 12),
             """\
             aten::zeros
-              aten::empty
-              aten::zero_
-            Top level Annotation
-              aten::empty
               aten::zeros
                 aten::empty
                 aten::zero_
+            Top level Annotation
+              aten::empty
+              aten::zeros
+                aten::zeros
+                  aten::empty
+                  aten::zero_
               First Annotation
                 aten::empty
                 aten::ones
                   aten::empty
                   aten::fill_
               aten::zeros
-                aten::empty
-                aten::zero_
+                aten::zeros
+                  aten::empty
+                  aten::zero_
               Second Annotation
                 aten::empty
                 aten::add
@@ -293,8 +343,9 @@ def test_profiler_experimental_tree_with_record_function(self):
                       aten::empty_strided
                       aten::copy_
                 aten::zeros
-                  aten::empty
-                  aten::zero_
+                  aten::zeros
+                    aten::empty
+                    aten::zero_
                 Third Annotation
                   aten::empty
                   aten::ones_like
@@ -474,11 +525,7 @@ def test_profiler_experimental_tree_with_memory_and_stack(self):
                 [memory]
               torch/profiler/profiler.py(...): __exit__
                 torch/profiler/profiler.py(...): stop
-                  torch/profiler/profiler.py(...): _transit_action
-                    <built-in method get of dict object at 0xXXXXXXXXXXXX>
-                      enum.py(...): __hash__
-                        <built-in function hash>
-                    ..."""
+                  ..."""
         )
 
     @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
@@ -597,13 +644,78 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
                             aten::clamp_min
               torch/profiler/profiler.py(...): __exit__
                 torch/profiler/profiler.py(...): stop
-                  torch/profiler/profiler.py(...): _transit_action
-                    <built-in method get of dict object at 0xXXXXXXXXXXXX>
-                      enum.py(...): __hash__
-                        <built-in function hash>
-                    ..."""
+                  ..."""
+        )
+
+    @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
+    @ProfilerTree.test
+    def test_profiler_experimental_tree_with_stack_and_torch_function(self):
+        x = TorchFunctionTensor(torch.ones((1,)))
+        y = torch.ones((1,))
+
+        # There's some lazy initialization in __torch_function__. If we don't
+        # run this the first run won't match the replicates.
+        torch.add(x, y)
+
+        with torch.profiler.profile(with_stack=True) as p:
+            torch.add(x, y)
+
+        self.assertTreesMatch(
+            ProfilerTree.format(p.profiler, 12),
+            """\
+            test_profiler_tree.py(...): test_profiler_experimental_tree_with_stack_and_torch_function
+              torch/profiler/profiler.py(...): __enter__
+                ...
+              <built-in method add of type object at 0xXXXXXXXXXXXX>
+                test_profiler_tree.py(...): __torch_function__
+                  torch/_tensor.py(...): __torch_function__
+                    <built-in function all>
+                      torch/_tensor.py(...): <genexpr>
+                        <built-in function issubclass>
+                      torch/_tensor.py(...): <genexpr>
+                    <built-in method add of type object at 0xXXXXXXXXXXXX>
+                      aten::add
+                    torch/_tensor.py(...): _convert
+                      <built-in function isinstance>
+                      <built-in function isinstance>
+                      <built-in method as_subclass of Tensor object at 0xXXXXXXXXXXXX>
+                        aten::alias
+                      <built-in function isinstance>
+              torch/profiler/profiler.py(...): __exit__
+                torch/profiler/profiler.py(...): stop
+                  ..."""
         )
 
+    @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
+    @ProfilerTree.test
+    def test_profiler_experimental_tree_with_stack_and_torch_dispatch(self):
+        x = TorchDispatchTensor(torch.ones((1,)))
+        y = torch.ones((1,))
+
+        with torch.profiler.profile(with_stack=True) as p:
+            x + y
+
+        self.assertTreesMatch(
+            ProfilerTree.format(p.profiler, 12),
+            """\
+            test_profiler_tree.py(...): test_profiler_experimental_tree_with_stack_and_torch_dispatch
+              torch/profiler/profiler.py(...): __enter__
+                ...
+              aten::add
+                test_profiler_tree.py(...): __torch_dispatch__
+                  torch/utils/_pytree.py(...): tree_map
+                    ...
+                  torch/utils/_pytree.py(...): tree_map
+                    ...
+                  torch/_ops.py(...): __call__
+                    <built-in method  of PyCapsule object at 0xXXXXXXXXXXXX>
+                      aten::add
+                  torch/utils/_pytree.py(...): tree_map
+                    ...
+              torch/profiler/profiler.py(...): __exit__
+                torch/profiler/profiler.py(...): stop
+                  ...""")
+
     @unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
     @ProfilerTree.test
     def test_profiler_experimental_tree_cuda(self):
@@ -677,9 +789,10 @@ def test_profiler_experimental_tree_cuda(self):
                   detach
             [memory]
             aten::zeros
-              aten::empty
-                [memory]
-              aten::zero_
+              aten::zeros
+                aten::empty
+                  [memory]
+                aten::zero_
             Optimizer.step#SGD.step
               aten::empty
                 [memory]
@@ -756,6 +869,7 @@ def test_profiler_experimental_tree_cuda_with_stream(self):
             allow_failure=ALLOW_CUDA_FAILURE,
         )
 
+    @unittest.skip("https://github.com/pytorch/pytorch/issues/83606")
     @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
     @unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
     @ProfilerTree.test
@@ -871,9 +985,10 @@ def step():
                   torch/autograd/profiler.py(...): __init__
                     <built-in method zeros of type object at 0xXXXXXXXXXXXX>
                       aten::zeros
-                        aten::empty
-                          [memory]
-                        aten::zero_
+                        aten::zeros
+                          aten::empty
+                            [memory]
+                          aten::zero_
                   torch/autograd/profiler.py(...): __enter__
                     torch/_ops.py(...): __call__
                       <built-in method _record_function_enter of PyCapsule object at 0xXXXXXXXXXXXX>
@@ -891,13 +1006,11 @@ def step():
                       <built-in method append of list object at 0xXXXXXXXXXXXX>
                       <built-in method append of list object at 0xXXXXXXXXXXXX>
                       torch/_tensor.py(...): __hash__
-                        <built-in function _has_torch_function_unary>
                         <built-in function id>
                       <built-in method append of list object at 0xXXXXXXXXXXXX>
                       <built-in method append of list object at 0xXXXXXXXXXXXX>
                       <built-in method append of list object at 0xXXXXXXXXXXXX>
                       torch/_tensor.py(...): __hash__
-                        <built-in function _has_torch_function_unary>
                         <built-in function id>
                       <built-in method append of list object at 0xXXXXXXXXXXXX>
                       torch/optim/sgd.py(...): sgd
@@ -931,10 +1044,8 @@ def step():
                               cudaLaunchKernel
                                 void at::native::vectorized_elementwise_kernel<...>(...)
                       torch/_tensor.py(...): __hash__
-                        <built-in function _has_torch_function_unary>
                         <built-in function id>
                       torch/_tensor.py(...): __hash__
-                        <built-in function _has_torch_function_unary>
                         <built-in function id>
                     torch/autograd/grad_mode.py(...): __init__
                       <built-in function is_grad_enabled>
diff --git a/test/test_proxy_tensor.py b/test/test_proxy_tensor.py
index 2d7caa807acdd..f3682560fde00 100644
--- a/test/test_proxy_tensor.py
+++ b/test/test_proxy_tensor.py
@@ -13,11 +13,15 @@
 
 from torch._decomp import decomposition_table
 from torch.testing._internal.common_device_type import ops
-from torch.fx.experimental.proxy_tensor import make_fx, DecompositionInterpreter, get_isolated_graphmodule
+from torch._C import _disabled_torch_function_impl
+from torch.fx.experimental.proxy_tensor import make_fx, DecompositionInterpreter, get_isolated_graphmodule, has_proxy
 from torch.utils._pytree import tree_map
 from torch import nn
 import re
 
+import types
+import functools
+
 aten = torch.ops.aten
 
 try:
@@ -26,6 +30,7 @@
 except ImportError:
     HAS_SYMPY = False
 skipIfNoSympy = unittest.skipIf(not HAS_SYMPY, "no sympy")
+HAS_CUDA = torch.cuda.is_available()
 
 
 def process_failures():
@@ -62,6 +67,16 @@ def create_normalized_name(op):
     print("}")
 
 
+def copy_func(f):
+    """Based on http://stackoverflow.com/a/6528148/190597 (Glenn Maynard)"""
+    g = types.FunctionType(f.__code__, f.__globals__, name=f.__name__,
+                           argdefs=f.__defaults__,
+                           closure=f.__closure__)
+    g = functools.update_wrapper(g, f)
+    g.__kwdefaults__ = f.__kwdefaults__
+    return g
+
+
 # Copied from functorch
 def xfail(op_name, variant_name='', *, device_type=None, dtypes=None):
     return (op_name, variant_name, device_type, dtypes, True)
@@ -115,22 +130,62 @@ def _create_new_input(x):
     if x.dtype != torch.float:
         return x + 1
     if x.is_leaf:
-        return torch.rand_like(x, requires_grad=True)
+        return torch.rand_like(x, requires_grad=x.requires_grad)
     else:
         return torch.rand_like(x)
 
-class TestProxyTensor(TestCase):
+"""
+Delays a cos being executed on the unwraptensor until its used. Simulates a CommTensor used
+"""
+class UnwrapTensor(torch.Tensor):
+    @staticmethod
+    def __new__(cls, tensor: torch.Tensor):
+        r = torch.Tensor._make_wrapper_subclass(
+            cls,
+            tensor.size(),
+            dtype=tensor.dtype,
+            device=tensor.device,
+            layout=tensor.layout,
+            requires_grad=tensor.requires_grad,
+        )
+        r._tensor = tensor
+        return r
+
+    def __repr__(self):
+        # TODO: consider all_gather the local tensors for better debugging
+        return f"UnwrapTensor({self._tensor})"
+
+    __torch_function__ = _disabled_torch_function_impl
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+        def unwrap(e):
+            ret = e
+            if isinstance(e, UnwrapTensor):
+                ret = e._tensor.cos()
+
+            return ret
+
+        args = tree_map(unwrap, args)
+        kwargs = tree_map(unwrap, kwargs)
+        return func(*args, **kwargs)
+
+class TestGenericProxyTensor(TestCase):
+    # WARNING: if any of your inputs are index tensors, DO NOT use this
+    # function
     def _test(self, f, inps):
-        fx_f = make_fx(f)(*inps)
+        fx_f = make_fx(f, tracing_mode=self.tracing_mode)(*inps)
         new_inps = tree_map(_create_new_input, inps)
-        self.assertEqual(fx_f(*new_inps), f(*new_inps))
+        r1 = fx_f(*new_inps)
+        r2 = f(*new_inps)
+        self.assertEqual(r1, r2)
 
-    def test_make_fx_simple(self, device):
+    def test_make_fx_simple(self):
         def f(x):
             return torch.sin(x)
         self._test(f, (torch.randn(3),))
 
-    def test_scalar_device(self, device):
+    def test_scalar_device(self, device='cpu'):
         def f(a, b):
             return a + b
         self._test(f, [torch.randn(3, device=device), torch.tensor(5)])
@@ -232,25 +287,80 @@ def f2_logging_tensor(x):
             self.assertTrue(is_any_sigmoid(gm))
             return torch.digamma(x)
 
-        with self.assertRaisesRegex(AssertionError, "ProxyTensor is wrapped with another Tensor subclass"):
-            traced = make_fx(f2_logging_tensor)(torch.randn(3))
-            self.assertFalse(is_any_sum(traced))
-            self.assertFalse(is_any_sigmoid(traced))  # this fails, sigmoid is traced with LoggingTensor
-            self.assertTrue(is_any_digamma(traced))
+        traced = make_fx(f2_logging_tensor)(torch.randn(3))
+        self.assertFalse(is_any_sum(traced))
+        self.assertFalse(is_any_sigmoid(traced))  # this fails, sigmoid is traced with LoggingTensor
+        self.assertTrue(is_any_digamma(traced))
+
+    def test_proxy_tensor_mode_with_decomp_table_preserves_proxy(self):
+        def f(x):
+            y = x.new_zeros(x.size())
+            y.copy_(x)
+            return y
+
+        def _new_zeros_decomp(inp, size, dtype=None, layout=None, device=None, pin_memory=None):
+            return torch.zeros(size, dtype=inp.dtype, device=inp.device)
+
+        factory_func_decomp = {torch.ops.aten.new_zeros.default: _new_zeros_decomp}
+
+        # When new_zeros() decomposes into torch.zero(), we expect ProxyTensorMode
+        # to still be (re-entrantly) enabled, so that the `torch.zero()` call
+        # returns a ProxyTensor.
+        out = make_fx(f, decomposition_table=factory_func_decomp)(torch.ones(2))
+        self.assertExpectedInline(out.code, """\
 
-    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
-    def test_resnet18_backward_trace(self, device):
-        mod = torchvision.models.resnet18()
 
+
+def forward(self, x_1):
+    zeros = torch.ops.aten.zeros.default([2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
+    copy__default = torch.ops.aten.copy_.default(zeros, x_1);  zeros = x_1 = None
+    return copy__default
+    """)
+
+    def test_make_fx_reentrant_dispatch(self):
         def f(x):
-            for a in mod.parameters():
-                a.grad = None
-            out = mod(x)
-            out.sum().backward()
-            return [a.grad for a in mod.parameters()]
+            return torch.ops.aten.norm.Scalar(x, 2.0)
+
+        def norm_decomp(x, p=2.0):
+            if p != 2.0:
+                raise RuntimeError("can't handle with p != 2")
+            return torch.sqrt(torch.sum(torch.square(x)))
+
+        decomp = {torch.ops.aten.norm.Scalar: norm_decomp}
+
+        traced = make_fx(f, decomposition_table=decomp, tracing_mode=self.tracing_mode)(torch.rand(3))
+
+        for n in traced.graph.nodes:
+            self.assertTrue("square" not in str(n.target))
+            self.assertTrue("norm" not in str(n.target))
+
+    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
+    def test_resnet18_backward_trace(self):
+        mod = torchvision.models.resnet18()
 
-        inp = torch.randn(3, 3, 250, 250, requires_grad=True)
-        self._test(f, [inp])
+        # An old version of this test called the module directly.  This works
+        # for tracing_mode == "real", but for fake tensors, we also have to
+        # ensure that the parameters and buffers get wrapped in fake tensors
+        # because free fake tensors are not supported.  Fortunately stateless
+        # does precisely this for us.
+        def f(x, params, buffers):
+            for p in params.values():
+                p.grad = None
+            loss = stateless.functional_call(mod, {**params, **buffers}, (x,)).sum()
+            # I could have done this with the functional API, but there is
+            # plenty of exercising this; I want to show mutating API still
+            # works
+            loss.backward()
+            return [p.grad for p in params.values()]
+
+        inp = torch.randn(3, 3, 250, 250)
+        self._test(f, [inp, dict(mod.named_parameters()), dict(mod.named_buffers())])
+
+    def test_varargs(self):
+        def f(*args):
+            return sum(args)
+
+        self._test(f, [torch.randn(2), torch.randn(2)])
 
     def test_proxy_tensor(self):
         def f_grad(x):
@@ -279,7 +389,7 @@ def f(x):
             return x + torch.randn(x.shape)
 
         # default behavior should trace factory functions
-        traced = make_fx(f)(torch.randn(3))
+        traced = make_fx(f, tracing_mode=self.tracing_mode)(torch.randn(3))
         self.assertTrue(
             any(
                 node.target == aten.randn.default
@@ -287,23 +397,11 @@ def f(x):
             )
         )
 
-    def test_mode_tracing_factory_function_no_factory_function(self):
-        def f(x):
-            return x + torch.randn(x.shape)
-        # setting the flag to false should not trace factory functions
-        traced = make_fx(f, trace_factory_functions=False)(torch.randn(3))
-        self.assertFalse(
-            any(
-                node.target == aten.randn.default
-                for node in traced.graph.nodes
-            )
-        )
-
     def test_make_fx_overloads(self):
         def f(x):
             return x.cos() + torch.randn(x.shape)
 
-        traced = make_fx(f)(torch.randn(3))
+        traced = make_fx(f, tracing_mode=self.tracing_mode)(torch.randn(3))
 
         self.assertTrue(all([isinstance(node.target, torch._ops.OpOverload)
                              for node in traced.graph.nodes if node.op == 'call_function']))
@@ -315,30 +413,24 @@ def f():
 
         self._test(f, [])
 
-    def test_constant_proxy_tensor(self):
-        from torch.fx.experimental.proxy_tensor import make_fx
-
-        def f():
-            val = torch.tensor(float('inf'))
-            return torch.full((100, 100), val)
+    def test_allclose(self):
+        def f(a, b):
+            return torch.allclose(a, b)
 
-        g = make_fx(f)()
-        self.assertEqual(g(), f())
+        self.assertRaisesRegex(
+            RuntimeError, "data-dependent",
+            lambda: make_fx(f, tracing_mode=self.tracing_mode)(
+                torch.zeros(3), torch.zeros(3)
+            )
+        )
 
     def test_constant_proxy_tensor_mut(self):
-        from torch.fx.experimental.proxy_tensor import make_fx
-
         def f():
             val = torch.tensor(float(1))
             val.add_(2)
             return torch.full((100, 100), val)
 
-        g = make_fx(f)()
-        self.assertEqual(g(), f())
-        # In case we mutated shared state in the g graph!
-        self.assertEqual(g(), f())
-
-        g = make_fx(f, tracing_mode="fake")()
+        g = make_fx(f, tracing_mode=self.tracing_mode)()
         self.assertEqual(g(), f())
         # In case we mutated shared state in the g graph!
         self.assertEqual(g(), f())
@@ -349,37 +441,37 @@ def f():
             r, = torch.unbind(val, 0)
             return r.item()
 
-        g = make_fx(f)()
+        g = make_fx(f, tracing_mode=self.tracing_mode)()
         self.assertEqual(g(), f())
 
-    def test_issue82547(self):
-        x = nn.Parameter(torch.randn(3, 3))
-
+    def test_constant_blowup(self):
         def f():
-            return torch.ops.aten.t.default(x)
-        self.assertRaisesRegex(Exception, "non-Fake Tensor", lambda: make_fx(f, tracing_mode="fake")())
-
-        class A(torch.Tensor):
-            pass
+            val = torch.tensor([2])
+            blowup = val.repeat(1000)
+            return blowup.sum().item()
 
-        x = A(torch.randn(3, 3))
-        self.assertRaisesRegex(TypeError, "no implementation found", lambda: make_fx(f, tracing_mode="fake")())
+        self.assertRaisesRegex(
+            RuntimeError, "data-dependent",
+            lambda: make_fx(f, tracing_mode=self.tracing_mode)()
+        )
 
-    def test_use_fake_and_tensor(self):
-        def f(x, y):
-            z = torch.tensor([2.0, 3.0])
-            return x + y + z
+    def test_constant_random(self):
+        def f():
+            val = torch.tensor([2.0])
+            val.normal_()
+            return val.item()
 
-        g = make_fx(f, tracing_mode="fake")(torch.randn(2), torch.randn(2))
-        x, y = torch.randn(2), torch.randn(2)
-        self.assertEqual(g(x, y), f(x, y))
+        self.assertRaisesRegex(
+            RuntimeError, "data-dependent",
+            lambda: make_fx(f, tracing_mode=self.tracing_mode)()
+        )
 
     def test_decomposition_interpreter(self):
         def fn(x):
             return torch.nn.functional.silu(x)
 
         x = torch.rand((4, 4))
-        fx_module = make_fx(fn, decomposition_table=None)(x)
+        fx_module = make_fx(fn, tracing_mode=self.tracing_mode, decomposition_table=None)(x)
 
         found_silu = False
         for n in fx_module.graph.nodes:
@@ -404,7 +496,7 @@ def fn(x):
 
         self.assertEqual(fx_module(x), decomposed_module(x))
 
-    def test_make_fx_model_fwd_bwd(self, device):
+    def test_make_fx_model_fwd_bwd(self):
         class Foo(torch.nn.Module):
             def __init__(self):
                 super().__init__()
@@ -421,7 +513,7 @@ def f(x, params):
             return list(params.values())
         input = torch.randn(3, 5, requires_grad=True)
         params = dict(model.named_parameters())
-        fx_f = make_fx(f)(input, params)
+        fx_f = make_fx(f, tracing_mode=self.tracing_mode)(input, params)
         # fx may change the order of parameters in list, so using set() to compare
         self.assertTrue(
             torch.allclose(fx_f(input, params)[0], f(input, params)[0])
@@ -434,7 +526,30 @@ def f(x, params):
             torch.allclose(fx_f(input, params)[1], f(input, params)[1])
         )
 
-    def test_make_fx_model_fwd_bwd_wgtupdate(self, device):
+    def test_make_fx_model_double_param(self):
+        class Emformer(torch.nn.Module):
+            def __init__(
+                self,
+                input_dim: int = 256,
+            ) -> None:
+                super().__init__()
+
+                self.layer_norm = torch.nn.LayerNorm(input_dim)
+
+            def forward(mod_self, x):  # noqa: B902
+                self.assertTrue(isinstance(mod_self.layer_norm.weight, torch.Tensor))
+                y = mod_self.layer_norm(x)
+                self.assertTrue(isinstance(mod_self.layer_norm.weight, torch.Tensor))
+                z = mod_self.layer_norm(y)
+                return z
+
+
+        gm = make_fx(Emformer())(torch.randn(16, 1, 256))
+        ops = set([n.target for n in gm.graph.nodes if n.op == 'call_function'])
+        self.assertEqual(len(ops), 2)
+
+
+    def test_make_fx_model_fwd_bwd_wgtupdate(self):
         class Foo(torch.nn.Module):
             def __init__(self):
                 super().__init__()
@@ -446,6 +561,8 @@ def forward(self, x):
         model = Foo()
 
         def f(args, params, buffers):
+            for p in params.values():
+                p.grad = None
             if not isinstance(args, Iterable):
                 args = [args]
             params_and_buffers = {**params, **buffers}
@@ -456,7 +573,7 @@ def f(args, params, buffers):
         input = torch.randn(3, 5, requires_grad=True)
         params = dict(model.named_parameters())
         buffers = dict(model.named_buffers())
-        fx_f = make_fx(f)(input, params, buffers)
+        fx_f = make_fx(f, tracing_mode=self.tracing_mode)(input, params, buffers)
         # fx may change the order of parameters in list, so using set() to compare
         # also there is a numerical difference in results so changing atol from 1e-08 to 1e-03
         self.assertTrue(
@@ -470,10 +587,135 @@ def f(args, params, buffers):
             torch.allclose(fx_f(input, params, buffers)[1], f(input, params, buffers)[1], atol=1e-03)
         )
 
+    def test_trace_subclasses(self):
+        def f(x):
+            x = UnwrapTensor(x)
+            y = x * 2
+            return y
+
+        inp = [torch.randn(5)]
+        self._test(f, inp)
+
+    def test_partial_decomp(self):
+        def f(a, b, c):
+            x = torch.addmm(a, b, c)
+            y = torch.addmm(a, b, c, beta=2, alpha=1)
+            return x + y
+        inps = [torch.randn(5, 5), torch.randn(5, 5), torch.randn(5, 5)]
+        fx_g = make_fx(f)(*inps)
+
+        def addmm(a, b, c, beta=1, alpha=1):
+            if beta == 1 and alpha == 1:
+                return NotImplemented
+            return beta * a + alpha * (b @ c)
+
+        decomposed_fx = make_fx(f, {aten.addmm.default: addmm})(*inps)
+
+        self.assertEqual(fx_g(*inps), decomposed_fx(*inps))
+        self.assertEqual(len([n for n in fx_g.graph.nodes if n.target == aten.addmm.default]), 2)
+        self.assertEqual(len([n for n in decomposed_fx.graph.nodes if n.target == aten.addmm.default]), 1)
+
+    @unittest.skipIf(not HAS_CUDA, 'CUDA-only test')
+    def test_amp_cache(self):
+        layer = torch.nn.Conv2d(3, 3, 3).cuda()
+
+        def f(x, w):
+            return torch.nn.functional.conv2d(x, w, stride=layer.stride)
+
+        inp = torch.randn(4, 3, 10, 10, device='cuda')
+        with torch.autocast('cuda'):
+            out_graph = make_fx(f)(inp, layer.weight).graph
+            out_graph2 = make_fx(f)(inp, layer.weight).graph
+
+        self.assertEqual(len(out_graph.nodes), len(out_graph2.nodes))
+        for a, b in zip(out_graph.nodes, out_graph2.nodes):
+            self.assertEqual(a.op, b.op)
+
+    def test_has_proxy(self):
+        foo = torch.randn(5)
+
+        def f(x):
+            self.assertFalse(has_proxy(foo))
+            self.assertTrue(has_proxy(x))
+            y = x.cos()
+            self.assertTrue(has_proxy(y))
+            return y
+
+        self.assertFalse(has_proxy(torch.randn(5)))
+        make_fx(f)(torch.randn(5))
+
+class TestGenericProxyTensorReal(TestGenericProxyTensor):
+    tracing_mode = "real"
+
+
+class TestGenericProxyTensorFake(TestGenericProxyTensor):
+    tracing_mode = "fake"
+
+
+def xfail_inherited_tests(tests):
+    """
+    Given a list of test names which are defined by a superclass of the
+    class this decorates, mark them as expected failure.  This is useful
+    if you are doing poor man's parameterized tests by subclassing a generic
+    test class.
+    """
+    def deco(cls):
+        for t in tests:
+            # NB: expectedFailure operates by mutating the method in question,
+            # which is why you have to copy the function first
+            setattr(cls, t, unittest.expectedFailure(copy_func(getattr(cls, t))))
+        return cls
+    return deco
+
+
+@skipIfNoSympy
+@xfail_inherited_tests([
+    "test_inplace_metadata",
+    "test_mode_tracing_factory_function",
+    "test_make_fx_overloads",
+    "test_make_fx_model_fwd_bwd_wgtupdate",
+    "test_make_fx_model_fwd_bwd",
+    "test_proxy_tensor",
+    "test_resnet18_backward_trace",
+    "test_trace_subclasses",
+])
+class TestGenericProxyTensorSymbolic(TestGenericProxyTensor):
+    tracing_mode = "symbolic"
+
+
+del TestGenericProxyTensor
+
+
+class TestRealProxyTensor(TestCase):
+    pass
+
+class TestFakeProxyTensor(TestCase):
+    def test_issue82547(self):
+        x = nn.Parameter(torch.randn(3, 3))
+
+        def f():
+            return torch.ops.aten.t.default(x)
+        self.assertRaisesRegex(Exception, "non-Fake Tensor", lambda: make_fx(f, tracing_mode="fake")())
+
+        class A(torch.Tensor):
+            pass
+
+        x = A(torch.randn(3, 3))
+        self.assertRaisesRegex(TypeError, "no implementation found", lambda: make_fx(f, tracing_mode="fake")())
+
+    def test_use_fake_and_tensor(self):
+        def f(x, y):
+            z = torch.tensor([2.0, 3.0])
+            return x + y + z
+
+        g = make_fx(f, tracing_mode="fake")(torch.randn(2), torch.randn(2))
+        x, y = torch.randn(2), torch.randn(2)
+        self.assertEqual(g(x, y), f(x, y))
+
 # TODO: Need to test the guards themselves specifically as well
 @skipIfNoSympy
 class TestSymbolicTracing(TestCase):
-    def _test_dynamic(self, fn, trace_inputs, test_inputs):
+    def _test_dynamic(self, fn, trace_inputs, test_inputs, assert_eq=True):
         """
         Tests fn traced with trace_inputs against test_inputs
         Also returns shape env
@@ -482,7 +724,9 @@ def _test_dynamic(self, fn, trace_inputs, test_inputs):
         traced_f = make_fx(fn, tracing_mode="symbolic")(*trace_inputs)
         for input in test_inputs:
             input = [torch.randn(shape) for shape in input]
-            self.assertEqual(traced_f(*input), fn(*input))
+            rx, ry = traced_f(*input), fn(*input)
+            if assert_eq:
+                self.assertEqual(rx, ry)
         return traced_f.shape_env
 
 
@@ -509,6 +753,19 @@ def f(a, b):
         shape_env = self._test_dynamic(f, [(1, 2), (3, 1)], test_inputs)
         assert len(shape_env.guards) == 0
 
+    def test_multiply_shape(self):
+        def f(a):
+            return torch.empty(a.shape[0] * 2)
+
+        r = str(make_fx(f, tracing_mode="symbolic")(torch.empty(4)).code).strip()
+        self.assertExpectedInline(r, """\
+def forward(self, a_1):
+    sym_size = torch.ops.aten.sym_size(a_1, 0);  a_1 = None
+    mul = sym_size * 2;  sym_size = None
+    empty = torch.ops.aten.empty.SymInt([mul], device = device(type='cpu'), pin_memory = False);  mul = None
+    sym_size_1 = torch.ops.aten.sym_size(empty, 0)
+    return empty""")
+
     def test_cat(self):
         def f(a, b):
             val = torch.mul(a, b)
@@ -525,6 +782,24 @@ def f(a, b):
         self.assertFalse(shape_env.evaluate_guards_for_args(torch.randn(1, 2), torch.randn(4, 1)))
         assert len(shape_env.guards) == 1
 
+    def test_new_empty(self):
+        def f(a, b):
+            return a.new_empty(b.shape[0], b.shape[1] * 2)
+
+        self._test_dynamic(f, [(2, 4), (4, 5)], [[(2, 3), (5, 7)], [(3, 7), (9, 3)]], assert_eq=False)
+
+
+    def test_expand(self):
+        def f(a):
+            b = torch.mul(a, a)
+            c = b.expand(a.shape)
+            return c
+
+        self._test_dynamic(f, [(3,)], [[(3,)], [(4,)], [(2,)]])
+        self._test_dynamic(f, [(5, 1)], [[(4, 1)], [(3, 1)], [(6, 1)]])
+
+
+
 make_fx_failures = {
     # unknown
     xfail('allclose'),
@@ -545,19 +820,19 @@ def f(a, b):
     # data-dependent control flow
     xfail('cov'),
     xfail('istft'),
-    xfail('nanquantile'),
     xfail('nn.functional.gaussian_nll_loss'),
-    xfail('quantile'),
     xfail('tensor_split'),
     xfail('corrcoef'),
+    xfail('quantile'),
+    xfail('nanquantile'),
 
     # Seems like it's creating a sparse tensor that isn't captured by tensor.is_sparse
     xfail('sparse.sampled_addmm'),
 
     # ???
     xfail('nn.functional.ctc_loss'),
-    # Sparse tensors are not supported with faketensors for now
-    xfail('to_sparse'),
+    # proxy tensor doesn't support sparse correctly right now
+    skip('to_sparse'),
     # segfaults
     skip('block_diag'),
 }
@@ -582,7 +857,6 @@ def f(a, b):
     xfail('linalg.eig'),
     xfail('__getitem__', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('__rmatmul__', ''),  # aten.new_empty.default - couldn't find symbolic meta function/decomposition
-    xfail('__rpow__', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('_masked.amax', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('_masked.amin', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('_masked.argmax', ''),  # aten.argmax.default - couldn't find symbolic meta function/decomposition
@@ -591,7 +865,6 @@ def f(a, b):
     xfail('_masked.cumsum', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('_masked.log_softmax', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('_masked.logaddexp', ''),  # aten.logaddexp.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.logsumexp', ''),  # Tensors of type TensorImpl do not have numel
     xfail('_masked.mean', ''),  # ones() received an invalid combination of arguments - got (torch.Size, device=torch.device, ...
     xfail('_masked.median', ''),  # aten.nanmedian.dim - couldn't find symbolic meta function/decomposition
     xfail('_masked.norm', ''),  # aten.linalg_vector_norm.default - couldn't find symbolic meta function/decomposition
@@ -602,12 +875,11 @@ def f(a, b):
     xfail('_masked.std', ''),  # ones() received an invalid combination of arguments - got (torch.Size, device=torch.device, d...
     xfail('_masked.sum', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('_masked.var', ''),  # ones() received an invalid combination of arguments - got (torch.Size, device=torch.device, d...
-    xfail('addbmm', ''),  # aten.addbmm.default - couldn't find symbolic meta function/decomposition
     xfail('addmm', ''),  # aten.mm.default - couldn't find symbolic meta function/decomposition
     xfail('addmm', 'decomposed'),  # aten.mm.default - couldn't find symbolic meta function/decomposition
     xfail('addmv', ''),  # aten.addmv.default - couldn't find symbolic meta function/decomposition
     xfail('addr', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('all', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type promotion!
+    xfail('all', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
     xfail('aminmax', ''),  # aten.aminmax.default - couldn't find symbolic meta function/decomposition
     xfail('argmax', ''),  # aten.argmax.default - couldn't find symbolic meta function/decomposition
     xfail('argmin', ''),  # aten.argmin.default - couldn't find symbolic meta function/decomposition
@@ -629,7 +901,6 @@ def f(a, b):
     xfail('char', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('cholesky_solve', ''),  # Could not run 'aten::_cholesky_solve_helper' with arguments from the 'Meta' back...
     xfail('chunk', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('clamp_max', ''),  # Received type <class 'NoneType'> that is neither a tensor or a number!
     xfail('clone', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
     xfail('column_stack', ''),  # Tensors of type TensorImpl do not have numel
     xfail('constant_pad_nd', ''),  # aten.fill.Scalar - couldn't find symbolic meta function/decomposition
@@ -642,7 +913,6 @@ def f(a, b):
     xfail('cumulative_trapezoid', ''),  # aten.slice.Tensor - couldn't find symbolic meta function/decomposition
     xfail('deg2rad', ''),  # aten.deg2rad.default - couldn't find symbolic meta function/decomposition
     xfail('diag_embed', ''),  # aten.diag_embed.default - couldn't find symbolic meta function/decomposition
-    xfail('diagflat', ''),  # Tensors of type TensorImpl do not have numel
     xfail('diagonal', ''),  # aten.diagonal.default - couldn't find symbolic meta function/decomposition
     xfail('diagonal_scatter', ''),  # aten.diagonal_scatter.default - couldn't find symbolic meta function/decomposition
     xfail('diff', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
@@ -674,8 +944,8 @@ def f(a, b):
     xfail('fft.rfftn', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('fill', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
     xfail('flatten', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
+    xfail('unflatten', ''),  # RuntimeError: Trying to call aten.size on a tensor with symbolic shapes...
     xfail('float', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('float_power', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('frexp', ''),  # aten.frexp.Tensor - couldn't find symbolic meta function/decomposition
     xfail('full_like', ''),  # aten.full_like.default - couldn't find symbolic meta function/decomposition
     xfail('gather', ''),  # aten.gather.default - couldn't find symbolic meta function/decomposition
@@ -686,17 +956,14 @@ def f(a, b):
     xfail('histogram', ''),  # Could not run 'aten::histogram.bin_ct' with arguments from the 'Meta' backend. This c...
     xfail('histogramdd', ''),  # aten._histogramdd_bin_edges.default - couldn't find symbolic meta function/decomposition
     xfail('hsplit', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('hstack', ''),  # Tensors of type TensorImpl do not have numel
     xfail('i0', ''),  # aten.i0.default - couldn't find symbolic meta function/decomposition
     xfail('index_add', ''),  # Float
     xfail('index_copy', ''),  # Expected a long tensor for index, but got Float
     xfail('index_fill', ''),  # aten.index_fill.int_Scalar - couldn't find symbolic meta function/decomposition
     xfail('index_put', ''),  # aten.index_put.default - couldn't find symbolic meta function/decomposition
     xfail('index_reduce', ''),  # Float
-    xfail('index_select', ''),  # Tensors of type TensorImpl do not have numel
     xfail('inner', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('int', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('inverse', ''),  # Tensors of type TensorImpl do not have numel
     xfail('isclose', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
     xfail('isin', ''),  # aten.isin.Tensor_Tensor - couldn't find symbolic meta function/decomposition
     xfail('isreal', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
@@ -708,6 +975,7 @@ def f(a, b):
     xfail('linalg.cond', ''),  # Tensors of type TensorImpl do not have numel
     xfail('linalg.cross', ''),  # aten.linalg_cross.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.det', ''),  # aten._linalg_det.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.det', 'singular'),  # aten._linalg_det.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.eigh', ''),  # aten._linalg_eigh.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.eigvalsh', ''),  # aten._linalg_eigh.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.householder_product', ''),  # aten.linalg_householder_product.default - couldn't find symbolic meta funct...
@@ -725,8 +993,6 @@ def f(a, b):
     xfail('linalg.matrix_rank', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.matrix_rank', 'hermitian'),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.multi_dot', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('norm', 'fro'),  # TensorImpl do not have numel
-    xfail('norm', 'inf'),  # TensorImpl do not have numel
     xfail('linalg.norm', ''),  # TensorImpl do not have numel
     xfail('linalg.norm', 'subgradients_at_zero'),  # TensorImpl do not have numel
     xfail('linalg.pinv', ''),  # aten.linalg_pinv.atol_rtol_tensor - couldn't find symbolic meta function/decomposition
@@ -744,12 +1010,10 @@ def f(a, b):
     xfail('linalg.vander', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.vecdot', ''),  # Could not run 'aten::vdot' with arguments from the 'Meta' backend. This could be ...
     xfail('linalg.vector_norm', ''),  # TensorImpl do not have numel
-    xfail('log_softmax', 'dtype'),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('logaddexp2', ''),  # aten.logaddexp2.default - couldn't find symbolic meta function/decomposition
     xfail('logaddexp', ''),  # aten.logaddexp.default - couldn't find symbolic meta function/decomposition
     xfail('logcumsumexp', ''),  # aten.logcumsumexp.default - couldn't find symbolic meta function/decomposition
     xfail('logdet', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('logsumexp', ''),  # Tensors of type TensorImpl do not have numel
     xfail('long', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('lu', ''),  # aten.linalg_lu_factor_ex.default - couldn't find symbolic meta function/decomposition
     xfail('lu_solve', ''),  # aten.linalg_lu_solve.default - couldn't find symbolic meta function/decomposition
@@ -760,7 +1024,7 @@ def f(a, b):
     xfail('matmul', ''),  # aten.new_empty.default - couldn't find symbolic meta function/decomposition
     xfail('matrix_exp', ''),  # aten.linalg_matrix_exp.default - couldn't find symbolic meta function/decomposition
     xfail('max', 'reduction_with_dim'),  # aten.max.dim - couldn't find symbolic meta function/decomposition
-    xfail('mean', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type promotion!
+    xfail('mean', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
     xfail('median', ''),  # Could not run 'aten::median' with arguments from the 'Meta' backend. This could be becau...
     xfail('meshgrid', 'list_of_tensors'),  # Tensors of type TensorImpl do not have numel
     xfail('meshgrid', 'variadic_tensors'),  # Tensors of type TensorImpl do not have numel
@@ -770,8 +1034,9 @@ def f(a, b):
     xfail('msort', ''),  # aten.sort.default - couldn't find symbolic meta function/decomposition
     xfail('mv', ''),  # aten.mv.default - couldn't find symbolic meta function/decomposition
     xfail('nanmean', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
+    xfail('nanquantile', ''),  # Could not run 'aten::equal' with arguments from the 'Meta' backend.
     xfail('narrow', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('native_layer_norm', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type promot...
+    xfail('native_layer_norm', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promot...
     xfail('nn.functional.adaptive_avg_pool1d', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.adaptive_avg_pool2d', ''),  # argument 'size' must be tuple of ints, but found element o...
     xfail('nn.functional.adaptive_avg_pool3d', ''),  # aten._adaptive_avg_pool3d.default - couldn't find symbolic meta func...
@@ -803,9 +1068,7 @@ def f(a, b):
     xfail('nn.functional.fractional_max_pool3d', ''),  # argument 'size' must be tuple of ints, but found element of t...
     xfail('nn.functional.glu', ''),  # aten.glu.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.grid_sample', ''),  # aten.grid_sampler_2d.default - couldn't find symbolic meta function/decompos...
-    xfail('nn.functional.group_norm', ''),  # 'torch._C.SymbolicIntNode' and 'int'
-    xfail('nn.functional.hardsigmoid', ''),  # Received type <class 'NoneType'> that is neither a tensor or a number!
-    xfail('nn.functional.hardswish', ''),  # Received type <class 'NoneType'> that is neither a tensor or a number!
+    xfail('nn.functional.group_norm', ''),  # 'torch._C.SymIntNode' and 'int'
     xfail('nn.functional.hinge_embedding_loss', ''),  # aten.empty_like.default - couldn't find symbolic meta function/deco...
     xfail('nn.functional.huber_loss', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.instance_norm', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
@@ -815,9 +1078,9 @@ def f(a, b):
     xfail('nn.functional.interpolate', 'linear'),  # aten.upsample_linear1d.vec - couldn't find symbolic meta function/dec...
     xfail('nn.functional.interpolate', 'nearest'),  # aten.upsample_nearest1d.vec - couldn't find symbolic meta function/d...
     xfail('nn.functional.interpolate', 'trilinear'),  # aten.upsample_trilinear3d.vec - couldn't find symbolic meta functi...
-    xfail('nn.functional.kl_div', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type pro...
+    xfail('nn.functional.kl_div', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type pro...
     xfail('nn.functional.l1_loss', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.layer_norm', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type...
+    xfail('nn.functional.layer_norm', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type...
     xfail('nn.functional.linear', ''),  # aten.mv.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.local_response_norm', ''),  # Tensors of type TensorImpl do not have numel
     xfail('nn.functional.margin_ranking_loss', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
@@ -835,18 +1098,15 @@ def f(a, b):
     xfail('nn.functional.pad', 'constant'),  # aten.fill.Scalar - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.pad', 'reflect'),  # aten.reflection_pad1d.default - couldn't find symbolic meta function/decompo...
     xfail('nn.functional.pad', 'replicate'),  # aten.replication_pad1d.default - couldn't find symbolic meta function/deco...
-    xfail('nn.functional.pairwise_distance', ''),  # TensorImpl does not have numel
     xfail('nn.functional.pdist', ''),  # Could not run 'aten::_pdist_forward' with arguments from the 'Meta' backend...
     xfail('nn.functional.pixel_shuffle', ''),  # aten.pixel_shuffle.default - couldn't find symbolic meta function/decompos...
     xfail('nn.functional.pixel_unshuffle', ''),  # aten.pixel_unshuffle.default - couldn't find symbolic meta function/deco...
     xfail('nn.functional.poisson_nll_loss', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
-    xfail('nn.functional.prelu', ''),  # Tensors of type TensorImpl do not have numel
     xfail('nn.functional.rrelu', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.smooth_l1_loss', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.soft_margin_loss', ''),  # aten.soft_margin_loss.default - couldn't find symbolic meta function/de...
-    xfail('nn.functional.softmin', 'with_dtype'),  # aten._to_copy.default - couldn't find symbolic meta function/decompos...
-    xfail('nn.functional.triplet_margin_loss', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing element...
-    xfail('nn.functional.triplet_margin_with_distance_loss', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when com...
+    xfail('nn.functional.triplet_margin_loss', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing element...
+    xfail('nn.functional.triplet_margin_with_distance_loss', ''),  # Unexpected type <class 'torch.SymIntNode'> when com...
     xfail('nn.functional.unfold', ''),  # aten.im2col.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.upsample_bilinear', ''),  # aten.upsample_bilinear2d.vec - couldn't find symbolic meta function/de...
     xfail('nn.functional.upsample_nearest', ''),  # aten.upsample_nearest1d.vec - couldn't find symbolic meta function/deco...
@@ -865,12 +1125,12 @@ def f(a, b):
     xfail('polygamma', 'polygamma_n_3'),  # aten.polygamma.default - couldn't find symbolic meta function/decomposition
     xfail('polygamma', 'polygamma_n_4'),  # aten.polygamma.default - couldn't find symbolic meta function/decomposition
     xfail('put', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
+    xfail('quantile', ''),  # Could not run 'aten::equal' with arguments from the 'Meta' backend.
     xfail('qr', ''),  # aten.linalg_qr.default - couldn't find symbolic meta function/decomposition
     xfail('rad2deg', ''),  # aten.rad2deg.default - couldn't find symbolic meta function/decomposition
     xfail('rand_like', ''),  # aten.randn_like.default - couldn't find symbolic meta function/decomposition
     xfail('randint_like', ''),  # aten.randint_like.default - couldn't find symbolic meta function/decomposition
     xfail('randn_like', ''),  # aten.randn_like.default - couldn't find symbolic meta function/decomposition
-    xfail('ravel', ''),  # Tensors of type TensorImpl do not have numel
     xfail('renorm', ''),  # aten.renorm.default - couldn't find symbolic meta function/decomposition
     xfail('repeat', ''),  # aten.repeat.default - couldn't find symbolic meta function/decomposition
     xfail('reshape_as', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
@@ -898,7 +1158,6 @@ def f(a, b):
     xfail('short', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('sinc', ''),  # aten.sinc.default - couldn't find symbolic meta function/decomposition
     xfail('slice_scatter', ''),  # aten.slice_scatter.default - couldn't find symbolic meta function/decomposition
-    xfail('softmax', 'with_dtype'),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('sort', ''),  # aten.sort.default - couldn't find symbolic meta function/decomposition
     xfail('special.airy_ai', ''),  # aten.special_airy_ai.default - couldn't find symbolic meta function/decomposition
     xfail('special.bessel_j0', ''),  # aten.special_bessel_j0.default - couldn't find symbolic meta function/decomposition
@@ -923,13 +1182,12 @@ def f(a, b):
     xfail('special.scaled_modified_bessel_k1', ''),  # aten.special_scaled_modified_bessel_k1.default - couldn't find symbo...
     xfail('special.spherical_bessel_j0', ''),  # aten.special_spherical_bessel_j0.default - couldn't find symbolic meta fun...
     xfail('special.xlog1py', ''),  # aten.special_xlog1py.default - couldn't find symbolic meta function/decomposition
-    xfail('split', ''),  # 'torch._C.SymbolicIntNode' and 'int'
+    xfail('split', ''),  # 'torch._C.SymIntNode' and 'int'
     xfail('split', 'list_args'),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('split_with_sizes', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('stack', ''),  # argument 'size' must be tuple of ints, but found element of type torch._C.SymbolicIntNode a...
-    xfail('std', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type promotion!
-    xfail('std_mean', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type promotion!
-    xfail('stft', ''),  # argument 'size' must be tuple of ints, but found element of type torch._C.SymbolicIntNode at...
+    xfail('std', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
+    xfail('std_mean', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
+    xfail('stft', ''),  # argument 'size' must be tuple of ints, but found element of type torch._C.SymIntNode at...
     xfail('sum_to_size', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('svd', ''),  # aten._linalg_svd.default - couldn't find symbolic meta function/decomposition
     xfail('svd_lowrank', ''),  # aten.mm.default - couldn't find symbolic meta function/decomposition
@@ -939,23 +1197,28 @@ def f(a, b):
     xfail('tensordot', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('tile', ''),  # aten.repeat.default - couldn't find symbolic meta function/decomposition
     xfail('topk', ''),  # aten.topk.default - couldn't find symbolic meta function/decomposition
-    xfail('trapezoid', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('trapz', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
+    xfail('trapezoid', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('triangular_solve', ''),  # aten.triangular_solve.default - couldn't find symbolic meta function/decomposition
     xfail('tril', ''),  # aten.tril.default - couldn't find symbolic meta function/decomposition
     xfail('triu', ''),  # aten.triu.default - couldn't find symbolic meta function/decomposition
     xfail('unfold', ''),  # aten.unfold.default - couldn't find symbolic meta function/decomposition
-    xfail('var_mean', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type promotion!
-    xfail('var', ''),  # Unexpected type <class 'torch.SymbolicIntNode'> when computing elementwise type promotion!
+    xfail('var_mean', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
+    xfail('var', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
     xfail('vdot', ''),  # aten.vdot.default - couldn't find symbolic meta function/decomposition
     xfail('view_as_complex', ''),  # aten.view_as_complex.default - couldn't find symbolic meta function/decomposition
     xfail('view_as', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('view', ''),  # Tensors of type TensorImpl do not have numel
     xfail('vsplit', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('where', ''),  # expected predicate to be bool, got torch.float32
     xfail('zero_', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
     xfail('zeros_like', ''),  # aten.zeros_like.default - couldn't find symbolic meta function/decomposition
+    xfail('unbind', ''),  # aten.unbind.int - couldn't find symbolic meta function/decomposition
 }
+symbolic_tensor_segfaults = {
+    skip('_masked.logsumexp', ''),  # Tensors of type TensorImpl do not have numel
+}
+
+symbolic_tensor_failures.update(symbolic_tensor_segfaults)
 
 def _test_make_fx_helper(self, device, dtype, op, tracing_mode):
     def f(args, kwargs):
@@ -1001,11 +1264,6 @@ def test_make_fx_symbolic_exhaustive(self, device, dtype, op):
 
 
 only_for = ("cpu")
-instantiate_device_type_tests(
-    TestProxyTensor,
-    globals(),
-    only_for=only_for,
-)
 instantiate_device_type_tests(TestProxyTensorOpInfo, globals(), only_for=only_for)
 
 
diff --git a/test/test_public_bindings.py b/test/test_public_bindings.py
index c41c6b0324926..8b2bc79fb57f3 100644
--- a/test/test_public_bindings.py
+++ b/test/test_public_bindings.py
@@ -86,9 +86,12 @@ def test_no_new_bindings(self):
             "DeviceObjType",
             "DictType",
             "DisableTorchFunction",
+            "DispatchKey",
+            "DispatchKeySet",
             "dtype",
             "EnumType",
             "ErrorReport",
+            "ExcludeDispatchKeyGuard",
             "ExecutionPlan",
             "FatalError",
             "FileCheck",
@@ -127,6 +130,7 @@ def test_no_new_bindings(self):
             "SymIntType",
             "IODescriptor",
             "is_anomaly_enabled",
+            "is_anomaly_check_nan_enabled",
             "is_autocast_cache_enabled",
             "is_autocast_cpu_enabled",
             "is_autocast_enabled",
@@ -222,6 +226,7 @@ def test_no_new_bindings(self):
             "import_ir_module_from_buffer",
             "init_num_threads",
             "is_anomaly_enabled",
+            "is_anomaly_check_nan_enabled",
             "is_autocast_enabled",
             "is_grad_enabled",
             "layout",
@@ -243,12 +248,6 @@ def test_no_new_bindings(self):
 
             "wait",
             "Tag",
-            "inplace_view",
-            "view_copy",
-            "generated",
-            "dynamic_output_shape",
-            "nondeterministic_bitwise",
-            "nondeterministic_seeded",
         }
         torch_C_bindings = {elem for elem in dir(torch._C) if not elem.startswith("_")}
 
@@ -367,7 +366,6 @@ def check_one_element(elem, modname, mod, *, is_public, is_all):
                 for elem in all_api:
                     if not elem.startswith('_'):
                         check_one_element(elem, modname, mod, is_public=True, is_all=False)
-
         for _, modname, ispkg in pkgutil.walk_packages(path=torch.__path__, prefix=torch.__name__ + '.'):
             test_module(modname)
 
diff --git a/test/test_python_dispatch.py b/test/test_python_dispatch.py
index de720e5e56e90..ba65bb5203edc 100644
--- a/test/test_python_dispatch.py
+++ b/test/test_python_dispatch.py
@@ -10,7 +10,7 @@
 from torch.utils._mode_utils import no_dispatch, find_outermost_mode, all_same_mode, all_same_mode_scope
 from torch.testing._internal.logging_tensor import LoggingTensor, LoggingTensorReentrant, LoggingTensorMode, \
     log_input, capture_logs, capture_logs_with_logging_tensor_mode
-from torch.utils._pytree import tree_map
+from torch.utils._pytree import tree_map, tree_map_only
 from torch.utils._python_dispatch import enable_torch_dispatch_mode, TorchDispatchMode
 
 import logging
@@ -444,12 +444,6 @@ def test_detach_appears_twice_when_called_once(self) -> None:
 $1 = torch._ops.aten.detach.default($0)
 $2 = torch._ops.aten.detach.default($1)''')
 
-    def test_metadata_change_not_allowed(self) -> None:
-        x = LoggingTensor(torch.ones(1))
-        y = x.data
-        self.assertIsInstance(y, LoggingTensor)
-        self.assertRaises(RuntimeError, lambda: y.resize_(4))
-
     def test_storage(self) -> None:
         # For now, just make sure it doesn't crash.  Ideally, we should
         # return some virtual storage that is safe to work with
@@ -746,9 +740,8 @@ def test_enable_torch_dispatch_mode_basic(self) -> None:
         with capture_logs(is_mode=True) as logs:
             with enable_torch_dispatch_mode(LoggingTensorMode()):
                 torch.empty([])
-
-        self.assertExpectedInline('\n'.join(logs), ("$0 = torch._ops.aten.empty.memory_format([], dtype=torch.float32," +
-                                                    " device=device(type='cpu'), pin_memory=False)"))
+        self.assertExpectedInline('\n'.join(logs), """\
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)""")
 
     def test_enable_torch_dispatch_mode_unrelated_tensors(self) -> None:
         x = torch.randn([])
@@ -775,8 +768,8 @@ def test_nested_push_logging_tensor_mode(self):
                     x + y
 
         self.assertExpectedInline('\n'.join(logs), """\
-$0 = torch._ops.aten.empty.memory_format([], dtype=torch.float32, device=device(type='cpu'), pin_memory=False)
-$0 = torch._ops.aten.empty.memory_format([], dtype=torch.float32, device=device(type='cpu'), pin_memory=False)
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)
 $3 = torch._ops.aten.add.Tensor($1, $2)
 $3 = torch._ops.aten.add.Tensor($1, $2)""")
 
@@ -787,7 +780,7 @@ def test_capture_logs_with_torch_dispatch_mode(self):
             torch.empty([])
             x + y
         self.assertExpectedInline('\n'.join(logs), """\
-$0 = torch._ops.aten.empty.memory_format([], dtype=torch.float32, device=device(type='cpu'), pin_memory=False)
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)
 $3 = torch._ops.aten.add.Tensor($1, $2)""")
 
         x = torch.randn([])
@@ -799,8 +792,8 @@ def test_capture_logs_with_torch_dispatch_mode(self):
                 x + y
 
         self.assertExpectedInline('\n'.join(logs2), """\
-$0 = torch._ops.aten.empty.memory_format([], dtype=torch.float32, device=device(type='cpu'), pin_memory=False)
-$0 = torch._ops.aten.empty.memory_format([], dtype=torch.float32, device=device(type='cpu'), pin_memory=False)
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)
 $3 = torch._ops.aten.add.Tensor($1, $2)
 $3 = torch._ops.aten.add.Tensor($1, $2)""")
 
@@ -873,6 +866,24 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         with enable_torch_dispatch_mode(x):
             y + y
 
+    def test_shallow_copy_and_detach(self) -> None:
+        seen = set()
+        test_case = self
+
+        class TestMode(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                tree_map_only(torch.Tensor, lambda t: test_case.assertIn(t, seen), (args, kwargs))
+                if kwargs is None:
+                    kwargs = {}
+                r = func(*args, **kwargs)
+                tree_map_only(torch.Tensor, lambda t: seen.add(t), r)
+                return r
+
+        with TestMode():
+            x = torch.randn(3, requires_grad=True)
+            loss = (x * x).sum()
+            loss.backward()
+
     def test_nested_enable_torch_dispatch_mode(self) -> None:
         class A(LoggingTensorMode):
             pass
@@ -1698,7 +1709,7 @@ def __new__(cls, data, wrapper):
 
                 @classmethod
                 def __torch_dispatch__(cls, func, types, args, kwargs):
-                    if func == torch.ops.aten.stride.default:
+                    if func == torch.ops.aten.sym_stride.default:
                         return (4, 2)
                     return NotImplemented
 
@@ -1709,13 +1720,13 @@ def __new__(cls, data, wrapper):
 
                 @classmethod
                 def __torch_dispatch__(cls, func, types, args, kwargs):
-                    if func == torch.ops.aten.stride.default:
+                    if func == torch.ops.aten.sym_stride.default:
                         return None
                     return NotImplemented
 
-            err_msg = "no implementation found for 'torch.ops.aten.stride'"
+            err_msg = "no implementation found for 'torch.ops.aten.sym_stride'"
             e = StridesNotImplemented(torch.randn(3, 3), use_wrapper_subclass)
-            with self.assertRaisesRegex(TypeError, err_msg):
+            with self.assertRaisesRegex(RuntimeError, err_msg):
                 e.stride()
 
             e = StridesCustomReturn(torch.randn(3, 3), use_wrapper_subclass)
diff --git a/test/test_pytree.py b/test/test_pytree.py
index 6a1c750d49b6c..4927ce25224f6 100644
--- a/test/test_pytree.py
+++ b/test/test_pytree.py
@@ -3,8 +3,9 @@
 import torch
 from torch.testing._internal.common_utils import TestCase, run_tests
 from torch.utils._pytree import tree_flatten, tree_map, tree_unflatten, TreeSpec, LeafSpec
-from torch.utils._pytree import _broadcast_to_and_flatten
-from collections import namedtuple
+from torch.utils._pytree import _broadcast_to_and_flatten, tree_map_only, tree_all
+from torch.utils._pytree import tree_any, tree_all_only, tree_any_only
+from collections import namedtuple, OrderedDict
 from torch.testing._internal.common_utils import parametrize, subtest, instantiate_parametrized_tests
 
 class TestPytree(TestCase):
@@ -63,6 +64,28 @@ def run_test(tup):
         run_test((1., 2))
         run_test((torch.tensor([1., 2]), 2, 10, 9, 11))
 
+    def test_flatten_unflatten_odict(self):
+        def run_test(odict):
+            expected_spec = TreeSpec(
+                OrderedDict,
+                list(odict.keys()),
+                [LeafSpec() for _ in odict.values()])
+            values, treespec = tree_flatten(odict)
+            self.assertTrue(isinstance(values, list))
+            self.assertEqual(values, list(odict.values()))
+            self.assertEqual(treespec, expected_spec)
+
+            unflattened = tree_unflatten(values, treespec)
+            self.assertEqual(unflattened, odict)
+            self.assertTrue(isinstance(unflattened, OrderedDict))
+
+        od = OrderedDict()
+        run_test(od)
+
+        od['b'] = 1
+        od['a'] = torch.tensor(3.14)
+        run_test(od)
+
     def test_flatten_unflatten_namedtuple(self):
         Point = namedtuple('Point', ['x', 'y'])
 
@@ -160,6 +183,21 @@ def invf(x):
                 run_test(case)
 
 
+    def test_tree_only(self):
+        self.assertEqual(tree_map_only(int, lambda x: x + 2, [0, "a"]), [2, "a"])
+
+
+    def test_tree_all_any(self):
+        self.assertTrue(tree_all(lambda x: x % 2, [1, 3]))
+        self.assertFalse(tree_all(lambda x: x % 2, [0, 1]))
+        self.assertTrue(tree_any(lambda x: x % 2, [0, 1]))
+        self.assertFalse(tree_any(lambda x: x % 2, [0, 2]))
+        self.assertTrue(tree_all_only(int, lambda x: x % 2, [1, 3, "a"]))
+        self.assertFalse(tree_all_only(int, lambda x: x % 2, [0, 1, "a"]))
+        self.assertTrue(tree_any_only(int, lambda x: x % 2, [0, 1, "a"]))
+        self.assertFalse(tree_any_only(int, lambda x: x % 2, [0, 2, "a"]))
+
+
     def test_treespec_repr(self):
         # Check that it looks sane
         pytree = (0, [0, 0, 0])
diff --git a/test/test_quantization.py b/test/test_quantization.py
index 16f1d2cd318a4..45cc65a06427f 100644
--- a/test/test_quantization.py
+++ b/test/test_quantization.py
@@ -38,7 +38,12 @@
 from quantization.core.test_workflow_module import TestFusedObsFakeQuantModule  # noqa: F401
 from quantization.core.test_backend_config import TestBackendConfig  # noqa: F401
 from quantization.core.test_utils import TestUtils  # noqa: F401
-from quantization.core.test_docs import TestQuantizationDocs  # noqa: F401
+try:
+    # This test has extra data dependencies, so in some environments, e.g. Meta internal
+    # Buck, it has its own test runner.
+    from quantization.core.test_docs import TestQuantizationDocs  # noqa: F401
+except ImportError:
+    pass
 
 # Eager Mode Workflow. Tests for the functionality of APIs and different features implemented
 # using eager mode.
@@ -119,18 +124,11 @@
 
 # AO Migration tests
 from quantization.ao_migration.test_quantization import TestAOMigrationQuantization  # noqa: F401
+from quantization.ao_migration.test_ao_migration import TestAOMigrationNNQuantized  # noqa: F401
 try:
     from quantization.ao_migration.test_quantization_fx import TestAOMigrationQuantizationFx  # noqa: F401
 except ImportError:
     pass
 
-try:
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBR  # noqa: F401
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBRIndividualOps  # noqa: F401
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBRMultipleOps  # noqa: F401
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBRModels  # noqa: F401
-except ImportError:
-    pass
-
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_reductions.py b/test/test_reductions.py
index f29bb56087bef..f524962b86dc0 100644
--- a/test/test_reductions.py
+++ b/test/test_reductions.py
@@ -2541,11 +2541,7 @@ def _compare_std_var_with_numpy(self, op, device, dtype, input, dim,
             torch_result = torch_op(input, dim, unbiased, keepdim, out=out)
         else:
             out = torch.empty(0, device=device, dtype=dtype)
-            try:
-                torch_result = torch_op(input, dim, unbiased, keepdim, out=out)
-            except RuntimeError:
-                return
-            self.fail("Failed to hit RuntimeError!")
+            torch_result = torch_op(input, dim, unbiased, keepdim, out=out)
 
         exact_dtype = input.dtype not in (torch.bfloat16, torch.complex32, torch.complex64, torch.complex128)
         self.assertEqual(torch_result, numpy_result, exact_dtype=exact_dtype)
diff --git a/test/jit/test_schema_check.py b/test/test_schema_check.py
similarity index 94%
rename from test/jit/test_schema_check.py
rename to test/test_schema_check.py
index 306724ef29386..2ccf2698d75a6 100644
--- a/test/jit/test_schema_check.py
+++ b/test/test_schema_check.py
@@ -5,19 +5,16 @@
 import torch
 from torch.utils._pytree import tree_map
 
+from torch.testing._internal.common_utils import run_tests
 from torch.fx.operator_schemas import normalize_function
 from torch.testing._internal.schema_check_mode import SchemaCheckMode
 from torch.utils._python_dispatch import enable_torch_dispatch_mode, TorchDispatchMode
+from torch.testing._internal.common_methods_invocations import op_db
 from torch.testing._internal.jit_utils import JitTestCase
-
+from torch.testing._internal.common_device_type import ops, OpDTypes, instantiate_device_type_tests
 pytorch_test_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
 sys.path.append(pytorch_test_dir)
 
-if __name__ == '__main__':
-    raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
-                       "\tpython test/test_jit.py TESTNAME\n\n"
-                       "instead.")
-
 # This TorchDispatchTensor Subclass is used to simulate an incorrect schema
 # which is then used to test that SchemaCheckMode behaves as expected
 
@@ -79,7 +76,7 @@ def test_schema_check_mode_operator_order(self):
         with enable_torch_dispatch_mode(schema_check):
             x = torch.rand((3, 3), requires_grad=True)
             x.relu().sin()
-        self.assertEqual(["aten::rand", "aten::relu", "aten::sin"], schema_check.ops)
+        self.assertEqual(["aten::rand", "aten::relu", "aten::detach", "aten::sin"], schema_check.ops)
 
     # Tests that SchemaCheckMode records operator order without grad
     def test_schema_check_mode_operator_order_without_grad(self):
@@ -91,7 +88,9 @@ def test_schema_check_mode_operator_order_without_grad(self):
 
     # Tests that SchemaCheckMode records mutations and aliases with none expected
     def test_schema_check_mode_mutated_aliasing_none(self):
-        x = torch.rand((3, 3), requires_grad=True)
+        # NB: previously requires_grad=True, but this induces a detach for
+        # saved variable
+        x = torch.rand((3, 3))
         schema_check = SchemaCheckMode()
         with enable_torch_dispatch_mode(schema_check):
             actual = x.relu().sin()
@@ -403,8 +402,8 @@ def test_overlaps_basic(self):
     def test_overlaps_empty_container(self):
         x = []
         y = [torch.rand((3, 3), requires_grad=True)]
-        # Anything overlaps nothing
-        self.assertTrue(torch._C._overlaps(y, x))
+        # Empty containers return false
+        self.assertFalse(torch._C._overlaps(y, x))
         self.assertTrue(torch._C._overlaps(y, y))
 
     # Tests that SchemaInfo Bindings work as expected
@@ -443,3 +442,20 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         schemaInfoCheck = SchemaInfoBindTestMode(self)
         with enable_torch_dispatch_mode(schemaInfoCheck):
             x.add(x)
+
+
+class TestSchemaCheckModeOpInfo(JitTestCase):
+    @ops(op_db, dtypes=OpDTypes.supported)
+    def test_schema_correctness(self, device, dtype, op):
+        # Currently torch.equal isn't supported with torch.complex32
+        # There's also errors with complex64 and complex128
+        if (dtype == torch.complex32):
+            return
+        for sample in op.sample_inputs(device, dtype, requires_grad=False):
+            with enable_torch_dispatch_mode(SchemaCheckMode()):
+                op(sample.input, *sample.args, **sample.kwargs)
+
+instantiate_device_type_tests(TestSchemaCheckModeOpInfo, globals(), only_for=("cpu", "cuda"))
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_sort_and_select.py b/test/test_sort_and_select.py
index bc320ddd48396..70ec86f27ed62 100644
--- a/test/test_sort_and_select.py
+++ b/test/test_sort_and_select.py
@@ -10,7 +10,7 @@
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import all_types, all_types_and, floating_types_and
 from torch.testing._internal.common_utils import \
-    (TestCase, run_tests, slowTest)
+    (TestCase, run_tests, skipIfTorchDynamo, slowTest)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, dtypes, onlyNativeDeviceTypes,
      onlyCUDA, dtypesIfCUDA, dtypesIfCPU, onlyCPU, largeTensorTest)
@@ -357,6 +357,7 @@ def test(shape):
         for shape in shapes:
             test(shape)
 
+    @skipIfTorchDynamo("https://github.com/pytorch/torchdynamo/issues/982")
     def test_topk(self, device):
         def topKViaSort(t, k, dim, dir):
             sorted, indices = t.sort(dim, dir)
diff --git a/test/test_sparse.py b/test/test_sparse.py
index e0b50e1b3ed98..c6bd6ba6f03ab 100644
--- a/test/test_sparse.py
+++ b/test/test_sparse.py
@@ -9,7 +9,7 @@
 from torch.testing import make_tensor
 from torch.testing._internal.common_utils import TestCase, run_tests, skipIfRocm, do_test_dtypes, \
     do_test_empty_full, load_tests, TEST_NUMPY, TEST_SCIPY, IS_WINDOWS, gradcheck, coalescedonoff, \
-    DeterministicGuard, first_sample
+    DeterministicGuard, first_sample, TEST_WITH_CROSSREF, TEST_WITH_ROCM
 from torch.testing._internal.common_cuda import TEST_CUDA, _get_torch_cuda_version
 from numbers import Number
 from typing import Dict, Any
@@ -25,6 +25,7 @@
     all_types, all_types_and_complex, all_types_and_complex_and, floating_and_complex_types,
     floating_and_complex_types_and, integral_types, floating_types_and,
 )
+from torch.utils._python_dispatch import TorchDispatchMode
 
 if TEST_SCIPY:
     import scipy.sparse
@@ -40,7 +41,53 @@
     IS_WINDOWS and torch.version.cuda and LooseVersion(torch.version.cuda) > "11.2"
 ) or (not IS_WINDOWS and CUDA11OrLater)
 
-class TestSparse(TestCase):
+class CrossRefSparseFakeMode(TorchDispatchMode):
+    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+        kwargs = kwargs or {}
+
+        def on_tensor(f):
+            def go(t):
+                if isinstance(t, torch.Tensor):
+                    return f(t)
+                else:
+                    return t
+            return go
+
+        # empty_like excluded for now due to sparse complex
+        # aten._to_dense.default this one is getting called with csc
+        if (
+            func not in [
+                torch.ops.aten.lift_fresh.default,
+                torch.ops.aten.empty_like.default,
+                torch.ops.aten.set_.source_Storage_storage_offset,
+                torch.ops.aten.sspaddmm.out,
+                torch.ops.aten._spdiags.default,
+                torch.ops.aten._to_dense.default
+            ]
+            and torch.Tag.dynamic_output_shape not in func.tags
+            and torch.Tag.inplace_view not in func.tags
+        ):
+            from torch._subclasses.fake_tensor import FakeTensorMode, UnsupportedFakeTensorException
+            from torch.utils._pytree import tree_map
+            try:
+                with FakeTensorMode(allow_meta=True) as fake_mode:
+                    fake_args, fake_kwargs = tree_map(on_tensor(fake_mode.from_tensor), (args, kwargs))
+                    fake_r = func(*fake_args, **fake_kwargs)
+            except UnsupportedFakeTensorException:
+                pass
+
+        r = func(*args, **kwargs)
+        return r
+
+class TestSparseBase(TestCase):
+    def run(self, result=None):
+        if TEST_WITH_CROSSREF:
+            with CrossRefSparseFakeMode():
+                return super().run(result)
+        else:
+            return super().run(result)
+
+class TestSparse(TestSparseBase):
 
     def setUp(self):
         TestCase.setUp(self)
@@ -1200,7 +1247,7 @@ def test_shape(di, dj, dk, nnz):
         "bmm sparse-dense CUDA is not yet supported in Windows, at least up to CUDA 10.1"
     )
     @unittest.skipIf(
-        TEST_CUDA and _get_torch_cuda_version() < (10, 1),
+        TEST_CUDA and _get_torch_cuda_version() < (10, 1) and not TEST_WITH_ROCM,
         "bmm sparse-dense requires CUDA 10.1 or greater"
     )
     @coalescedonoff
@@ -1262,7 +1309,7 @@ def test_shape(num_mats, dim_i, dim_j, dim_k, nnz):
         "bmm sparse-dense CUDA is not yet supported in Windows, at least up to CUDA 10.1"
     )
     @unittest.skipIf(
-        _get_torch_cuda_version() < (10, 1),
+        _get_torch_cuda_version() < (10, 1) and not TEST_WITH_ROCM,
         "bmm sparse-dense requires CUDA 10.1 or greater"
     )
     def test_bmm_deterministic(self, device, dtype, coalesced):
@@ -1641,6 +1688,7 @@ def test_shape(sparse_dims, nnz, with_size):
 
     @coalescedonoff
     @dtypes(torch.double)
+    @unittest.skipIf(TEST_WITH_CROSSREF, "fallback triggers cuda device error")
     def test_sparse_sum(self, device, dtype, coalesced):
 
         def run_tests(S, td=None):
@@ -2942,41 +2990,6 @@ def test_is_nonzero(self, device):
         self.assertFalse(torch.sparse_coo_tensor(([0],), 0. + 0j, (1,), dtype=torch.cdouble, device=device)
                          .is_nonzero())
 
-    def test_allow_tensor_metadata_change(self, device):
-        def do_test(t):
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "raw_resize_ is not allowed on a Tensor created from .data or .detach()"):
-                t.transpose_(0, 1)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_ is not allowed on a Tensor created from .data or .detach()"):
-                t.resize_as_(self.sparse_empty(3, 3))
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_and_clear_ is not allowed on a Tensor created from .data or .detach()"):
-                t.mul_(t)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_coalesced is not allowed on a Tensor created from .data or .detach()"):
-                t._coalesced_(True)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_indices_and_values_unsafe is not allowed on a Tensor created from .data or .detach()"):
-                a = self.sparse_tensor(torch.tensor([[0, 1, 1], [2, 0, 2]]), torch.tensor([3., 4., 5.])).data
-                a.add_(a)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_and_clear_ is not allowed on a Tensor created from .data or .detach()"):
-                a.zero_()
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_ is not allowed on a Tensor created from .data or .detach()"):
-                a.copy_(self.sparse_empty(3, 3))
-
-        do_test(self.sparse_empty([3, 0], device=device).data)
-        do_test(self.sparse_empty([3, 0], device=device).detach())
-
     @dtypes(torch.double, torch.cdouble)
     def test_change_tensor_metadata(self, device, dtype):
         i = self.index_tensor([[0], [1]], device=device)
@@ -3413,6 +3426,7 @@ def test_softmax_zero_nnz(self, device, dtype):
                                       *[torch.bfloat16] if CUDA11OrLater and SM80OrLater else [],
                                       *[torch.complex64] if CUDA11OrLater else [],
                                       *[torch.complex128] if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else []))
+    @unittest.skipIf(TEST_WITH_CROSSREF, "not working with fake tensor")
     @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2, torch.complex64: 1e-2, torch.float32: 1e-2})
     def test_sparse_matmul(self, device, dtype, coalesced):
         """
@@ -3619,11 +3633,36 @@ def check_empty(sparse_shape, nnz, dense_shape, coalesce):
             # check scalar multiplication
             s = self._gen_sparse(sparse_dim, nnz, sub_shape, dtype, device, coalesced)[0]
             for scalar in (True, 1, 1.0):
-                res_sparse = s * scalar
+                res_sparse_right = s * scalar
+                res_sparse_left = scalar * s
                 res_dense = s.to_dense() * scalar
                 # check correctness and dtype
-                self.assertEqual(s.to(res_sparse.dtype), res_sparse)
-                self.assertEqual(res_sparse.dtype, res_dense.dtype)
+                self.assertEqual(s.to(res_sparse_right.dtype), res_sparse_right)
+                self.assertEqual(res_sparse_right, res_sparse_left)
+                self.assertEqual(res_sparse_right.dtype, res_dense.dtype)
+                self.assertEqual(res_sparse_left.dtype, res_dense.dtype)
+                # check scalar as 0-dim sparse tensor
+                tscalar = torch.tensor(scalar, device=device)
+                sscalar = tscalar.to_sparse()
+                res_sparse_right = s * sscalar
+                res_sparse_left = sscalar * s
+                self.assertEqual(res_sparse_right, res_sparse_left)
+                self.assertEqual(s.to(res_sparse_right.dtype), res_sparse_right)
+
+            # check non-coalesced 0-dim scalar
+            # we skip torch.bool because for such tensors
+            # coalesce.to_dense != to_dense
+            if dtype == torch.bool:
+                return
+
+            for scalar_dtype in (int, float):
+                scalar = scalar_dtype(1)
+                idx = torch.tensor([], device=device).reshape(0, 2)
+                val = torch.tensor([scalar, scalar], device=device)
+                sscalar = torch.sparse_coo_tensor(idx, val, ())
+                res_dense = s.to_dense() * sscalar.to_dense()
+                self.assertEqual((s * sscalar).to_dense(), res_dense)
+                self.assertEqual((sscalar * s).to_dense(), res_dense)
 
             # Case 1: sparse broadcasts over dense
             s = self._gen_sparse(sparse_dim, nnz, sub_shape, dtype, device, coalesced)[0]
diff --git a/test/test_sparse_csr.py b/test/test_sparse_csr.py
index b9423763795d1..66a418c9e2551 100644
--- a/test/test_sparse_csr.py
+++ b/test/test_sparse_csr.py
@@ -12,8 +12,8 @@
     (TEST_WITH_ROCM, TEST_SCIPY, TEST_NUMPY, TEST_MKL, IS_WINDOWS, TestCase, run_tests, load_tests, coalescedonoff, parametrize,
      subtest)
 from torch.testing._internal.common_device_type import \
-    (ops, instantiate_device_type_tests, dtypes, OpDTypes, dtypesIfCUDA, onlyCPU, onlyCUDA, skipCUDAIfNoCusparseGeneric,
-     precisionOverride, skipMeta, skipCUDAIf, skipCUDAIfRocm, skipCPUIfNoMklSparse)
+    (ops, instantiate_device_type_tests, dtypes, OpDTypes, dtypesIfCUDA, onlyCPU, onlyCUDA, skipCUDAIfNoSparseGeneric,
+     precisionOverride, skipMeta, skipCUDAIf, skipCUDAIfRocm, skipCPUIfNoMklSparse, skipCUDAIfRocmVersionLessThan)
 from torch.testing._internal.common_methods_invocations import \
     (op_db, sparse_csr_unary_ufuncs, ReductionOpInfo)
 from torch.testing._internal.common_cuda import _get_torch_cuda_version, CUDA11OrLater, TEST_CUDA
@@ -913,78 +913,95 @@ def test_csr_double_to_sparse_csr(self):
         a = self.genSparseCSRTensor((3, 3), 3, dtype=torch.float, device=self.device_type, index_dtype=torch.int64)
         a.to_sparse_csr().to_sparse_csr()
 
+    @all_sparse_compressed_layouts()
+    @parametrize("index_dtype", [torch.int32, torch.int64])
     @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
-    def test_sparse_csr_select(self, device, dtype):
-        batch_shape = (2, 3)
-        crow_indices = torch.tensor([0, 2, 4], device=device).repeat(6, 1).reshape(*batch_shape, -1)
-        col_indices = torch.tensor([0, 1, 0, 1], device=device).repeat(6, 1).reshape(*batch_shape, -1)
-        values = torch.tensor([1, 2, 3, 4], device=device, dtype=dtype).repeat(6, 1).reshape(*batch_shape, -1)
-        sparse = torch.sparse_csr_tensor(crow_indices,
-                                         col_indices,
-                                         values,
-                                         size=(*batch_shape, 2, 10),
-                                         dtype=dtype,
-                                         device=device)
+    def test_select(self, device, dtype, index_dtype, layout):
+        compressed_indices_mth = {
+            torch.sparse_csr: torch.Tensor.crow_indices,
+            torch.sparse_bsr: torch.Tensor.crow_indices,
+            torch.sparse_csc: torch.Tensor.ccol_indices,
+            torch.sparse_bsc: torch.Tensor.ccol_indices,
+        }[layout]
+
+        plain_indices_mth = {
+            torch.sparse_csr: torch.Tensor.col_indices,
+            torch.sparse_bsr: torch.Tensor.col_indices,
+            torch.sparse_csc: torch.Tensor.row_indices,
+            torch.sparse_bsc: torch.Tensor.row_indices,
+        }[layout]
+        create_tensor_mth = {
+            torch.sparse_csr: torch.sparse_csr_tensor,
+            torch.sparse_bsr: torch.sparse_bsr_tensor,
+            torch.sparse_csc: torch.sparse_csc_tensor,
+            torch.sparse_bsc: torch.sparse_bsc_tensor,
+        }[layout]
+
+        shape = (2, 3, 6, 10)
+        nnz = 6
+        blocksize = (2, 2) if layout in {torch.sparse_bsr, torch.sparse_bsc} else ()
+        sparse = self.genSparseCompressedTensor(
+            shape, nnz, device=device, layout=layout, dtype=dtype, index_dtype=index_dtype, blocksize=blocksize)
+        comp_indices = compressed_indices_mth(sparse)
+        plain_indices = plain_indices_mth(sparse)
+        values = sparse.values()
 
         # select from batch dimensions
         sparse_selected12 = sparse.select(1, 2)
-        expected_sparse_selected12 = torch.sparse_csr_tensor(crow_indices.select(1, 2).contiguous(),
-                                                             col_indices.select(1, 2).contiguous(),
-                                                             values.select(1, 2).contiguous(),
-                                                             size=(2, 2, 10),
-                                                             dtype=dtype,
-                                                             device=device)
+        expected_sparse_selected12 = create_tensor_mth(comp_indices.select(1, 2).contiguous(),
+                                                       plain_indices.select(1, 2).contiguous(),
+                                                       values.select(1, 2).contiguous(),
+                                                       size=(2, 6, 10),
+                                                       dtype=dtype,
+                                                       device=device)
         self.assertEqual(expected_sparse_selected12, sparse_selected12)
 
-        # select from rows or columns
-        sparse_non_batched = sparse[0, 0]
-        for selects_args in [(0, 0), (1, 1)]:
-            sparse_selected = sparse_non_batched.select(*selects_args)
-            dense_selected = sparse_non_batched.to_dense().select(*selects_args)
-            self.assertEqual(dense_selected, sparse_selected)
-
-        # index a single element
-        self.assertEqual(sparse[0, 0, 0, 0], sparse.to_dense()[0, 0, 0, 0])
+        # Select from dense dimensions
+        sparse_hybrid = self.genSparseCompressedTensor(shape + (4, 2),
+                                                       nnz,
+                                                       device=device,
+                                                       layout=layout,
+                                                       dtype=dtype,
+                                                       index_dtype=index_dtype,
+                                                       blocksize=blocksize,
+                                                       dense_dims=2)
+        sparse_hybrid_dense_selected = sparse_hybrid.select(4, 1)
+        expected_sparse_hybrid_dense_selected = sparse_hybrid.values().select(-2, 1)
+        self.assertEqual(expected_sparse_hybrid_dense_selected, sparse_hybrid_dense_selected)
 
-        # selecting from rows or columns for batched CSR is not yet implemented
-        with self.assertRaisesRegex(RuntimeError, "selecting rows or columns is not implemented for batched"):
-            sparse.select(-2, 0)
 
-        with self.assertRaisesRegex(RuntimeError, "selecting rows or columns is not implemented for batched"):
-            sparse.select(-1, 0)
 
-        # assigning to sparse trhough indexing is disabled
-        with self.assertRaisesRegex(TypeError, "Cannot assign to a sparse tensor"):
-            sparse[0, 0, 0, 0] = 99.0
-
-    @parametrize("index_dtype", [torch.int32, torch.int64])
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
-    def test_sparse_bsr_select(self, device, dtype, index_dtype):
-        shape = (3, 6, 10)
-        nnz = 6
-        sparse = self.genSparseBSRTensor(shape, (2, 2), nnz, dtype=dtype, device=device, index_dtype=index_dtype)
-
-        # select from batch dimensions
-        sparse_selected02 = sparse.select(0, 2)
-        expected_sparse_selected02 = torch.sparse_bsr_tensor(sparse.crow_indices().select(0, 2).contiguous(),
-                                                             sparse.col_indices().select(0, 2).contiguous(),
-                                                             sparse.values().select(0, 2).contiguous(),
-                                                             size=(6, 10),
-                                                             dtype=dtype,
-                                                             device=device)
-        self.assertEqual(expected_sparse_selected02, sparse_selected02)
-
-        msg = "selecting non-batch dimensions is currently only supported for CSR tensors"
-        # selecting from rows or columns for batched CSR is not yet implemented
-        with self.assertRaisesRegex(RuntimeError, msg):
-            sparse.select(-2, 0)
-
-        with self.assertRaisesRegex(RuntimeError, msg):
-            sparse.select(-1, 0)
+        # selecting rows/col with batch dims not allowed
+        sparse_non_batched = sparse[0, 0]
+        # select from sparse dimensions if layout supports is
+        if layout in {torch.sparse_csr, torch.sparse_csc}:
 
-        # assigning to sparse via indexing is disabled
-        with self.assertRaisesRegex(RuntimeError, msg):
-            sparse[0, 0, 0] = 99.0
+            for select_args in [(0, 0), (1, 1)]:
+                sparse_selected = sparse_non_batched.select(*select_args)
+                dense_selected = sparse_non_batched.to_dense().select(*select_args)
+                self.assertEqual(dense_selected, sparse_selected)
+
+            self.assertEqual(sparse[0, 0, 0, 0], sparse.to_dense()[0, 0, 0, 0])
+            # assigning to sparse through indexing is disabled, not tested generally because only layouts supporting
+            # sparse dim select will get far enough to test
+            with self.assertRaisesRegex(TypeError, "Cannot assign to a sparse tensor"):
+                sparse[0, 0, 0, 0] = 99.0
+
+            # select from sparse dimensions without removing batch dims, not tested generally because only layouts
+            # supporting sparse dim select will get far enough
+            msg = "selecting rows or columns is not implemented for batched sparse compressed tensors."
+            with self.assertRaisesRegex(RuntimeError, msg):
+                sparse.select(-2, 0)
+
+            with self.assertRaisesRegex(RuntimeError, msg):
+                sparse.select(-1, 0)
+        # ensure raises if layout does not support
+        else:
+            msg = (
+                "selecting non-batch dimensions is currently only supported for non-blocked sparse "
+                "compressed layouts tensors.")
+            with self.assertRaisesRegex(RuntimeError, msg):
+                sparse_non_batched.select(0, 0)
 
     @skipMeta
     @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
@@ -1201,12 +1218,16 @@ def test_matmul_device_mismatch(self, device, dtype):
                     torch.addmm(s, csr, m2)
 
     @skipCPUIfNoMklSparse
-    @skipCUDAIfNoCusparseGeneric
+    @skipCUDAIfNoSparseGeneric
     @dtypes(*floating_and_complex_types())
     @dtypesIfCUDA(*floating_and_complex_types_and(
                   *[torch.half] if SM53OrLater else [],
                   *[torch.bfloat16] if SM80OrLater else []))
     def test_csr_matvec(self, device, dtype):
+
+        if TEST_WITH_ROCM and (dtype == torch.half or dtype == torch.bfloat16):
+            self.skipTest("ROCm doesn't work with half dtypes correctly.")
+
         side = 100
         for index_dtype in [torch.int32, torch.int64]:
             csr = self.genSparseCSRTensor((side, side), 1000, device=device, dtype=dtype, index_dtype=index_dtype)
@@ -1223,7 +1244,8 @@ def test_csr_matvec(self, device, dtype):
                 csr.matmul(bad_vec)
 
     @onlyCUDA
-    @unittest.skipIf(not CUDA11OrLater, "Only CUDA 11+ is supported")
+    @unittest.skipIf(not (CUDA11OrLater or TEST_WITH_ROCM), "Only CUDA 11+ is supported")
+    @skipCUDAIfRocmVersionLessThan((5, 2))
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
     def test_baddbmm(self, device, dtype):
         def run_test(c, a, a_batched, b, op_b=False, op_out=False, *, dtype=None, device=None):
@@ -1258,7 +1280,7 @@ def run_test(c, a, a_batched, b, op_b=False, op_out=False, *, dtype=None, device
 
     @onlyCUDA
     @unittest.skipIf(not CUDA11OrLater, "Only CUDA 11+ is supported")
-    @skipCUDAIfNoCusparseGeneric
+    @skipCUDAIfNoSparseGeneric
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
     def test_bmm(self, device, dtype):
         def run_test(a, a_batched, b, op_b=False, op_out=False, *, dtype=None, device=None):
@@ -2571,17 +2593,7 @@ def _test_matrix(pt_matrix, dense, layout, blocksize):
             for i in range(batch_len):
                 batch_idx = tuple(np.unravel_index(i, batch_shape))
                 _test_matrix(pt_tensor[batch_idx], dense[batch_idx], layout, blocksize)
-            # todo: check whole conversion once to_dense impl for n-d batched-bsr
-            # take 3d slices of dense/sparse to convert/compare for now
-            if dense.dim() > 3:
-                part_dim = dense.dim() - 3
-                part_shape = batch_shape[:part_dim]
-                len_partition = functools.reduce(lambda x, y: x * y, part_shape, 1)
-                for i in range(len_partition):
-                    part_idx = tuple(np.unravel_index(i, part_shape))
-                    self.assertEqual(dense[part_idx], pt_tensor[part_idx].to_dense())
-            else:
-                self.assertEqual(dense, pt_tensor.to_dense())
+            self.assertEqual(dense, pt_tensor.to_dense())
 
         # Verify exception when given 0 sized batch
         for shape, blocksize in itertools.product(shapes, blocksizes):
diff --git a/test/test_spectral_ops.py b/test/test_spectral_ops.py
index a0cc0c20c1645..35c5ea7d31847 100644
--- a/test/test_spectral_ops.py
+++ b/test/test_spectral_ops.py
@@ -1539,7 +1539,7 @@ def find(self, obj, name=None, module=None, globs=None, extraglobs=None):
         doctests = []
 
         modname = name if name is not None else obj.__name__
-        globs = dict() if globs is None else globs
+        globs = {} if globs is None else globs
 
         for fname in obj.__all__:
             func = getattr(obj, fname)
diff --git a/test/test_tensor_creation_ops.py b/test/test_tensor_creation_ops.py
index 174ba0debdb1d..54f5e85fa289d 100644
--- a/test/test_tensor_creation_ops.py
+++ b/test/test_tensor_creation_ops.py
@@ -1246,15 +1246,6 @@ def test_trilu_indices(self, device):
         self.assertEqual(b.triu(2), output)
         self.assertRaises(RuntimeError, lambda: b.triu_(2))
 
-    @onlyCPU
-    def test_triu_tril_indices_bfloat16(self, device):
-        op_funcs = [torch.tril_indices, torch.triu_indices]
-        for op_fun in op_funcs:
-            out = op_fun(4, 3, 1, dtype=torch.bfloat16)
-            out2 = op_fun(4, 3, 1, dtype=torch.float)
-            self.assertEqual(out.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.bfloat16())
-
     # TODO: update to work on CUDA, too
     @onlyCPU
     def test_stack(self, device):
@@ -2676,6 +2667,17 @@ def _test(sizes, strides, dtype):
             for strides in [(12, 4, 1), (2, 4, 6), (0, 0, 0)]:
                 _test(shape, strides, torch.float)
 
+        # Make sure sizes and strides have the same length
+        # https://github.com/pytorch/pytorch/issues/82416
+        with self.assertRaisesRegex(
+                RuntimeError,
+                r"dimensionality of sizes \(1\) must match dimensionality of strides \(0\)"):
+            dtype = torch.float64
+            x = torch.tensor(-4.8270, dtype=dtype, device=device)
+            size = (2,)
+            stride = ()
+            x.new_empty_strided(size, stride, dtype=dtype, device=device)
+
     def test_strided_mismatched_stride_shape(self, device):
         for shape, strides in [((1, ), ()), ((1, 2), (1, ))]:
             with self.assertRaisesRegex(RuntimeError, "mismatch in length of strides and shape"):
diff --git a/test/test_testing.py b/test/test_testing.py
index 6eadd4058ad47..c71b1a4c54d33 100644
--- a/test/test_testing.py
+++ b/test/test_testing.py
@@ -25,7 +25,7 @@
      get_device_type_test_bases, instantiate_device_type_tests, onlyCUDA, onlyNativeDeviceTypes,
      deviceCountAtLeast, ops, expectedFailureMeta)
 from torch.testing._internal.common_methods_invocations import op_db
-import torch.testing._internal.opinfo_helper as opinfo_helper
+from torch.testing._internal import opinfo
 from torch.testing._internal.common_dtype import all_types_and_complex_and
 from torch.testing._internal.common_modules import modules, module_db
 
@@ -436,8 +436,8 @@ def test_get_supported_dtypes(self, device):
         ops_to_test = list(filter(lambda op: op.name in ['atan2', 'topk', 'xlogy'], op_db))
 
         for op in ops_to_test:
-            dynamic_dtypes = opinfo_helper.get_supported_dtypes(op, op.sample_inputs_func, self.device_type)
-            dynamic_dispatch = opinfo_helper.dtypes_dispatch_hint(dynamic_dtypes)
+            dynamic_dtypes = opinfo.utils.get_supported_dtypes(op, op.sample_inputs_func, self.device_type)
+            dynamic_dispatch = opinfo.utils.dtypes_dispatch_hint(dynamic_dtypes)
             if self.device_type == 'cpu':
                 dtypes = op.dtypes
             else:  # device_type ='cuda'
diff --git a/test/test_torch.py b/test/test_torch.py
index f335443de7536..e0430af509987 100644
--- a/test/test_torch.py
+++ b/test/test_torch.py
@@ -24,7 +24,6 @@
 import subprocess
 import weakref
 import sys
-from torch.utils.dlpack import from_dlpack, to_dlpack
 from torch._six import inf, nan, string_classes
 from itertools import product, combinations, permutations
 from functools import partial
@@ -1422,6 +1421,25 @@ def backward_func(slf, device):
 
         backward_func(self, device)
 
+    @dtypes(*all_types_and_complex_and(torch.bool))
+    def test_nondeterministic_alert_cumsum(self, device, dtype):
+
+        def test_func(op_call):
+            input = make_tensor((10,), dtype=dtype, device=device, low=-9, high=9)
+
+            @expectedAlertNondeterministic('cumsum_cuda_kernel', ['cuda'])
+            def forward_func_alert(slf, device):
+                op_call(input, 0)
+
+            if dtype.is_floating_point or dtype.is_complex:
+                forward_func_alert(self, device)
+            else:
+                with DeterministicGuard(True):
+                    op_call(input, 0)
+
+        test_func(torch.Tensor.cumsum)
+        test_func(torch.cumsum)
+
     def test_nondeterministic_alert_scatter_add(self, device):
         def test_func(op_call):
             input = torch.randn(5, 4, device=device)
@@ -4657,169 +4675,6 @@ def compare_strides(s1, s2, div):
             for x in xs:
                 _test_helper(x, op, unary=True)
 
-    # FIXME: move dlpack tests to their own test class/suite
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_capsule_conversion(self, device, dtype):
-        # DLpack does not explicitly support bool (xref dmlc/dlpack#75)
-        x = make_tensor((5,), dtype=dtype, device=device)
-        z = from_dlpack(to_dlpack(x))
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_protocol_conversion(self, device, dtype):
-        x = make_tensor((5,), dtype=dtype, device=device)
-        z = from_dlpack(x)
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    def test_dlpack_shared_storage(self, device):
-        x = make_tensor((5,), dtype=torch.float64, device=device)
-        z = from_dlpack(to_dlpack(x))
-        z[0] = z[0] + 20.0
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyCUDA
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_conversion_with_streams(self, device, dtype):
-        # Create a stream where the tensor will reside
-        stream = torch.cuda.Stream()
-        with torch.cuda.stream(stream):
-            # Do an operation in the actual stream
-            x = make_tensor((5,), dtype=dtype, device=device) + 1
-        # DLPack protocol helps establish a correct stream order
-        # (hence data dependency) at the exchange boundary.
-        # DLPack manages this synchronization for us, so we don't need to
-        # explicitly wait until x is populated
-        stream = torch.cuda.Stream()
-        with torch.cuda.stream(stream):
-            z = from_dlpack(x)
-        stream.synchronize()
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_from_dlpack(self, device, dtype):
-        x = make_tensor((5,), dtype=dtype, device=device)
-        y = torch.from_dlpack(x)
-        self.assertEqual(x, y)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_from_dlpack_noncontinguous(self, device, dtype):
-        x = make_tensor((25,), dtype=dtype, device=device).reshape(5, 5)
-
-        y1 = x[0]
-        y1_dl = torch.from_dlpack(y1)
-        self.assertEqual(y1, y1_dl)
-
-        y2 = x[:, 0]
-        y2_dl = torch.from_dlpack(y2)
-        self.assertEqual(y2, y2_dl)
-
-        y3 = x[1, :]
-        y3_dl = torch.from_dlpack(y3)
-        self.assertEqual(y3, y3_dl)
-
-        y4 = x[1]
-        y4_dl = torch.from_dlpack(y4)
-        self.assertEqual(y4, y4_dl)
-
-        y5 = x.t()
-        y5_dl = torch.from_dlpack(y5)
-        self.assertEqual(y5, y5_dl)
-
-    @skipMeta
-    @onlyCUDA
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_conversion_with_diff_streams(self, device, dtype):
-        stream_a = torch.cuda.Stream()
-        stream_b = torch.cuda.Stream()
-        # DLPack protocol helps establish a correct stream order
-        # (hence data dependency) at the exchange boundary.
-        # the `tensor.__dlpack__` method will insert a synchronization event
-        # in the current stream to make sure that it was correctly populated.
-        with torch.cuda.stream(stream_a):
-            x = make_tensor((5,), dtype=dtype, device=device) + 1
-            z = torch.from_dlpack(x.__dlpack__(stream_b.cuda_stream))
-            stream_a.synchronize()
-        stream_b.synchronize()
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_from_dlpack_dtype(self, device, dtype):
-        x = make_tensor((5,), dtype=dtype, device=device)
-        y = torch.from_dlpack(x)
-        assert x.dtype == y.dtype
-
-    @skipMeta
-    @onlyCUDA
-    def test_dlpack_default_stream(self, device):
-        class DLPackTensor:
-            def __init__(self, tensor):
-                self.tensor = tensor
-
-            def __dlpack_device__(self):
-                return self.tensor.__dlpack_device__()
-
-            def __dlpack__(self, stream=None):
-                if torch.version.hip is None:
-                    assert stream == 1
-                else:
-                    assert stream == 0
-                capsule = self.tensor.__dlpack__(stream)
-                converted = True
-                return capsule
-
-        # CUDA-based tests runs on non-default streams
-        with torch.cuda.stream(torch.cuda.default_stream()):
-            x = DLPackTensor(make_tensor((5,), dtype=torch.float32, device=device))
-            from_dlpack(x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_tensor_invalid_stream(self, device, dtype):
-        with self.assertRaises(TypeError):
-            x = make_tensor((5,), dtype=dtype, device=device)
-            x.__dlpack__(stream=object())
-
-    @skipMeta
-    def test_dlpack_error_on_bool_tensor(self):
-        x = torch.tensor([True], dtype=torch.bool)
-        with self.assertRaises(RuntimeError):
-            to_dlpack(x)
-
-    # TODO: increase tests once NumPy supports the `__dlpack__` protocol
-    @skipMeta
-    def test_dlpack_export_requires_grad(self):
-        x = torch.zeros(10, dtype=torch.float32, requires_grad=True)
-        with self.assertRaisesRegex(RuntimeError, r"require gradient"):
-            x.__dlpack__()
-
-    @skipMeta
-    def test_dlpack_export_is_conj(self):
-        x = torch.tensor([-1 + 1j, -2 + 2j, 3 - 3j])
-        y = torch.conj(x)
-        with self.assertRaisesRegex(RuntimeError, r"conjugate bit"):
-            y.__dlpack__()
-
-    @skipMeta
-    def test_dlpack_export_non_strided(self):
-        x = torch.sparse_coo_tensor([[0]], [1], size=(1,))
-        y = torch.conj(x)
-        with self.assertRaisesRegex(RuntimeError, r"strided"):
-            y.__dlpack__()
-
     @onlyCUDA
     @unittest.skipIf(PYTORCH_CUDA_MEMCHECK, "is_pinned uses failure to detect pointer property")
     def test_pin_memory_from_constructor(self, device):
@@ -5869,7 +5724,7 @@ def test_unflatten(self):
             torch.tensor([1]).unflatten(0, [])
         with self.assertRaisesRegex(RuntimeError, r"Provided sizes \[2, 2\] don't multiply up to the size of dim 0 \(1\)"):
             torch.tensor([1]).unflatten(0, [2, 2])
-        with self.assertRaisesRegex(IndexError, r"dimension specified as 0 but tensor has no dimensions"):
+        with self.assertRaisesRegex(IndexError, r"Dimension specified as 0 but tensor has no dimensions"):
             torch.tensor(1).unflatten(0, [0])
         with self.assertRaisesRegex(RuntimeError, r"only one dimension can be inferred"):
             torch.randn(5, 10).unflatten(1, (-1, -1))
@@ -6016,6 +5871,13 @@ def test_equal(self):
         self.assertTrue(torch.equal(s1, s3))
         self.assertFalse(torch.equal(s1, s4))
 
+        # Different dtypes
+        x = torch.tensor((1, 2, 3), dtype=torch.float)
+        y = torch.tensor((1, 2, 3), dtype=torch.int)
+        z = torch.tensor((1, -1), dtype=torch.int)
+        self.assertTrue(torch.equal(x, y))
+        self.assertFalse(torch.equal(z, x))
+
     def test_element_size(self):
         byte = torch.ByteStorage().element_size()
         char = torch.CharStorage().element_size()
@@ -7186,22 +7048,8 @@ def test_has_internal_overlap(self):
         self.assertEqual(torch._debug_has_internal_overlap(c), OVERLAP_TOO_HARD)
 
     def test_allow_tensor_metadata_change(self):
-        def do_test(t):
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_sizes_contiguous is not allowed on a Tensor created from .data or .detach()"):
-                t.resize_((2, 1))
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_storage is not allowed on a Tensor created from .data or .detach()"):
-                t.set_()
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_storage_offset is not allowed on a Tensor created from .data or .detach()"):
-                t.set_(t.storage(), 0, t.size(), list(t.stride()))
-
-        do_test(torch.tensor([[1, 2]]).data)
-        do_test(torch.tensor([[1, 2]]).detach())
+        a = torch.ones(2, 3)
+        # Metadata changes are allowed on view tensors that are created from detach().
 
     @skipIfNotRegistered("LayerNorm", "Skipping as LayerNorm is not registered")
     def test_c10_layer_norm(self):
diff --git a/test/test_transformers.py b/test/test_transformers.py
index 68fb89697a4a0..ca2ef5349a028 100644
--- a/test/test_transformers.py
+++ b/test/test_transformers.py
@@ -5,10 +5,17 @@
 import torch.nn as nn
 import torch.nn.functional as F
 import unittest
+from unittest.mock import patch
 
 from torch.testing._internal.common_nn import NNTestCase
 from torch.testing._internal.common_utils import (
-    TEST_FAIRSEQ, run_tests, parametrize, instantiate_parametrized_tests, freeze_rng_state)
+    TEST_FAIRSEQ,
+    run_tests,
+    parametrize,
+    instantiate_parametrized_tests,
+    freeze_rng_state,
+    TEST_WITH_CROSSREF
+)
 from torch.testing._internal.common_cuda import TEST_CUDA
 
 if TEST_FAIRSEQ:
@@ -59,11 +66,11 @@ def test_self_attn_TxT_attn_mask(self):
             self.assertEqual(output_mask_4d, output_mask_TxT)
 
     @parametrize("device", device_list)
-    def test_transformerencoderlayer_src_mask(self, device):
+    @parametrize("nhead", [1, 4, 8])
+    def test_transformerencoderlayer_src_mask(self, device, nhead):
         batch_size = 2
         seqlen = 4
         d_model = 8
-        nhead = 8
         dim_feedforward = 32
 
         model = torch.nn.TransformerEncoderLayer(
@@ -724,6 +731,59 @@ def rand_tensor(*shape):
             if dropout_p == 0.0 or device == 'cpu':
                 self.assertEqual(actual, expected)
 
+    @unittest.skipIf(TEST_WITH_CROSSREF, 'Fastpath not available with crossref')
+    @torch.no_grad()
+    def test_mask_check_fastpath(self):
+        """
+        Test that fastpath is executed independently of the mask that is passed.
+        If the passed mask is left aligned or mask_check=False, test that nested tensors are used (sparsity fastpath),
+        otherwise use fastpath with traditional tensors.
+        """
+
+        x = torch.Tensor([[[1, 2], [3, 4], [5, 6]]]).to(torch.float)
+
+        def _test_fastpath(model, mask, mock_return_value, nested_tensors=True):
+            with patch('torch._transformer_encoder_layer_fwd') as fastpath_mock:
+                fastpath_mock.return_value = mock_return_value
+                model(x, src_key_padding_mask=mask)
+
+                # If mock was called, fastpath was taken
+                self.assertTrue(fastpath_mock.called)
+
+                # If mock was called with nested tensors, sparsity fastpath was taken
+                for call_args, _ in fastpath_mock.call_args_list:
+                    self.assertEqual(call_args[0].is_nested, nested_tensors)
+
+        encoder_layer = torch.nn.TransformerEncoderLayer(d_model=2, nhead=2, dim_feedforward=8, batch_first=True)
+
+        model = torch.nn.TransformerEncoder(encoder_layer, num_layers=2, enable_nested_tensor=True, mask_check=True)
+        model.eval()
+
+        aligned_mask = torch.Tensor([[0, 0, 1]]).to(torch.bool)
+        not_aligned_mask = torch.Tensor([[1, 0, 1]]).to(torch.bool)
+        nested_tensor_return_value = torch.nested_tensor([torch.ones((2, 2), dtype=torch.float)])
+        tensor_return_value = torch.ones((1, 3, 2), dtype=torch.float)
+
+        # Left aligned mask results in sparsity fastpath
+        _test_fastpath(model, aligned_mask, nested_tensor_return_value, nested_tensors=True)
+
+        # Not aligned mask results in fastpath
+        _test_fastpath(model, not_aligned_mask, tensor_return_value, nested_tensors=False)
+
+        model = torch.nn.TransformerEncoder(encoder_layer, num_layers=2, enable_nested_tensor=False, mask_check=True)
+        model.eval()
+
+        # If nested tensor disabled, fastpath is always taken
+        _test_fastpath(model, aligned_mask, tensor_return_value, nested_tensors=False)
+        _test_fastpath(model, not_aligned_mask, tensor_return_value, nested_tensors=False)
+
+
+        model = torch.nn.TransformerEncoder(encoder_layer, num_layers=2, enable_nested_tensor=True, mask_check=False)
+        model.eval()
+
+        # Mask check disabled results in sparisty fastpath, independently of the mask
+        _test_fastpath(model, aligned_mask, nested_tensor_return_value, nested_tensors=True)
+        _test_fastpath(model, not_aligned_mask, nested_tensor_return_value, nested_tensors=True)
 
 # TODO: Replace this with instantiate_device_type_tests() to take advantage of test framework support for
 # cross device / dtype testing.
diff --git a/test/test_type_promotion.py b/test/test_type_promotion.py
index 25190f8976ccc..3cc6488b912b5 100644
--- a/test/test_type_promotion.py
+++ b/test/test_type_promotion.py
@@ -1,6 +1,6 @@
 # Owner(s): ["module: type promotion"]
 
-from functools import (partial, wraps)
+from functools import wraps
 import itertools
 import unittest
 
@@ -321,16 +321,16 @@ def test_complex_half(self, device):
 
     @float_double_default_dtype
     def test_alternate_result(self, device):
-        f = torch.tensor([1, 1, 1, 1], dtype=torch.float, device=device)
+        x = torch.tensor([1, 1, 1, 1], dtype=torch.float, device=device)
         o = torch.tensor([0, 0, 0, 0], dtype=torch.long, device=device)
         self.assertRaisesRegex(RuntimeError,
                                "can't be cast to",
-                               lambda: torch.add(f, f, out=o))
+                               lambda: torch.add(x, x, out=o))
         d = torch.tensor([1, 1, 1, 1], dtype=torch.double, device=device)
-        torch.add(f, f, out=d)
+        torch.add(x, x, out=d)
         self.assertEqual(d.dtype, torch.double)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(f + f, d)
+        x = x.to(torch.double)
+        self.assertEqual(x + x, d)
 
     @float_double_default_dtype
     def test_mixed_type_backward(self, device):
@@ -339,8 +339,8 @@ def test_mixed_type_backward(self, device):
         tens = f * ten
         s = (tens + 2).sum()
         s.backward()
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(f.grad, tens)
+        expected = f.grad.to(torch.double)
+        self.assertEqual(tens, expected)
 
         # If we don't convert the returned grad_input to the actual input type
         # we get an error like:
@@ -444,8 +444,7 @@ def test_create_bool_tensors(self, device):
         self.assertEqual(torch.arange(False, True, device=device), expected)
         self.assertEqual(torch.arange(True, device=device), expected)
         expected = torch.tensor([0, 0.5], dtype=torch.get_default_dtype(), device=device)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(torch.arange(False, True, 0.5, device=device), expected)
+        self.assertEqual(torch.arange(False, True, 0.5, device=device), expected)
         expected = torch.ones(0, dtype=torch.int64, device=device)
         self.assertEqual(torch.arange(False, False, device=device), expected)
 
@@ -937,53 +936,6 @@ def test_integer_addcdiv_deprecated(self, device, dtype):
         with self.assertRaisesRegex(RuntimeError, '^Integer division.+is no longer supported+'):
             t.addcdiv_(t, t)
 
-    def _ternary_promotion_common(self, device, op1, op2):
-        make_arg = partial(make_tensor, device=device)
-
-        types = (
-            (torch.float64, torch.float64, torch.complex128),
-            (torch.long, torch.bfloat16, torch.float32),
-        )
-
-        for type1, type2, type3 in types:
-            arg1 = make_arg([5, 5], dtype=type1)
-            arg2 = make_arg([5, 5], dtype=type2)
-            arg3 = make_arg([1, 5], dtype=type3)
-
-            res1 = op1(arg1, arg2, arg3)
-            res2 = op2(arg1, arg2, arg3)
-
-            # res1 and res2 are not guaranteed to be the same.  They are the
-            # same when all the inputs are tensors with one or more dimensions.
-            self.assertEqual(res1, res2)
-            self.assertEqual(res1.dtype, res2.dtype)
-
-    # Fails on XLA:
-    # https://github.com/pytorch/pytorch/pull/74234#issuecomment-1117169366
-    # https://github.com/pytorch/xla/issues/3551
-    @onlyNativeDeviceTypes
-    def test_addcdiv_promotion(self, device):
-        def op1(arg1, arg2, arg3):
-            return torch.addcdiv(arg1, arg2, arg3)
-
-        def op2(arg1, arg2, arg3):
-            return arg1 + arg2 / arg3
-
-        self._ternary_promotion_common(device, op1, op2)
-
-    # Fails on XLA:
-    # https://github.com/pytorch/pytorch/pull/74234#issuecomment-1117169366
-    # https://github.com/pytorch/xla/issues/3551
-    @onlyNativeDeviceTypes
-    def test_addcmul_promotion(self, device):
-        def op1(arg1, arg2, arg3):
-            return torch.addcmul(arg1, arg2, arg3)
-
-        def op2(arg1, arg2, arg3):
-            return arg1 + arg2 * arg3
-
-        self._ternary_promotion_common(device, op1, op2)
-
     @unittest.skipIf(not TEST_NUMPY, "NumPy not found")
     @float_double_default_dtype
     @onlyCPU
@@ -1073,6 +1025,13 @@ def test_cat_different_dtypes(self, device):
             res = torch.cat([x, y])
             self.assertEqual(res, expected_res, exact_dtype=True)
 
+            # cat: full and an empty tensor.
+            y = torch.tensor([], device=device, dtype=y_dtype)
+            res_dtype = torch.result_type(x, y)
+            expected_res = torch.tensor(x_vals + [], device=device, dtype=res_dtype)
+            res = torch.cat([x, y])
+            self.assertEqual(res, expected_res, exact_dtype=True)
+
     @onlyNativeDeviceTypes
     def test_cat_out_different_dtypes(self, device):
         dtypes = all_types_and_complex_and(torch.half)
diff --git a/test/test_unary_ufuncs.py b/test/test_unary_ufuncs.py
index 5ca94ad5e8c33..43330b953a81f 100644
--- a/test/test_unary_ufuncs.py
+++ b/test/test_unary_ufuncs.py
@@ -25,7 +25,6 @@
 from torch.testing._internal.common_methods_invocations import (
     unary_ufuncs,
     generate_elementwise_unary_tensors,
-    _NOTHING,
     generate_elementwise_unary_small_value_tensors,
     generate_elementwise_unary_large_value_tensors,
     generate_elementwise_unary_extremal_value_tensors,
@@ -59,7 +58,7 @@
 # Refer [scipy reference filter]
 # Filter operators for which the reference function
 # is available in the current environment (for reference_numerics tests).
-reference_filtered_ops = list(filter(lambda op: op.ref is not _NOTHING, unary_ufuncs))
+reference_filtered_ops = list(filter(lambda op: op.ref is not None, unary_ufuncs))
 
 # Tests for unary "universal functions (ufuncs)" that accept a single
 # tensor and have common properties like:
@@ -687,12 +686,9 @@ def test_abs_angle_complex_to_float(self, device, dtype):
                     torch.float32 if dtype is torch.complex64 else torch.float64
                 )
                 np_float_out = np_fn(a).astype(torch_to_numpy_dtype_dict[float_dtype])
-                float_out = torch.empty_like(t).float()
+                float_out = torch.empty_like(t, dtype=float_dtype)
                 torch_fn(t, out=float_out)
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                self.assertEqualIgnoreType(
-                    torch.from_numpy(np_float_out), float_out.cpu()
-                )
+                self.assertEqual(torch.from_numpy(np_float_out), float_out.cpu())
 
                 # Tests float out (resized out)
                 float_out = torch.empty(1, device=device, dtype=float_dtype)
@@ -700,21 +696,15 @@ def test_abs_angle_complex_to_float(self, device, dtype):
                 self.assertEqual(torch.from_numpy(np_float_out), float_out.cpu())
 
                 # Tests complex out
-                np_complex_out = np_fn(a)
+                np_complex_out = np_fn(a).astype(torch_to_numpy_dtype_dict[dtype])
                 complex_out = torch.empty_like(t)
                 torch_fn(t, out=complex_out)
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                self.assertEqualIgnoreType(
-                    torch.from_numpy(np_complex_out), complex_out.cpu()
-                )
+                self.assertEqual(torch.from_numpy(np_complex_out), complex_out.cpu())
 
                 # Tests complex out (resized out)
                 complex_out = torch.empty(0, device=device, dtype=dtype)
                 torch_fn(t, out=complex_out)
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                self.assertEqualIgnoreType(
-                    torch.from_numpy(np_complex_out), complex_out.cpu()
-                )
+                self.assertEqual(torch.from_numpy(np_complex_out), complex_out.cpu())
 
                 # Tests long out behavior (expected failure)
                 long_out = torch.empty(0, device=device, dtype=torch.long)
diff --git a/test/test_utils.py b/test/test_utils.py
index 65583bcbaf63d..f81c9112f376e 100644
--- a/test/test_utils.py
+++ b/test/test_utils.py
@@ -9,6 +9,7 @@
 import tempfile
 import textwrap
 import unittest
+from typing import List
 import torch
 import torch.nn as nn
 import torch.utils.data
@@ -362,6 +363,64 @@ def run_fn2(tensor1, tensor2):
             out = checkpoint(run_fn2, input_var, input_var2)
             out.sum().backward()
 
+    @unittest.skipIf(not torch.cuda.is_available(), "Test requires CUDA")
+    def test_checkpointing_without_reentrant_early_free(self):
+        # I don't know how to check if the temporary saved variable buffer
+        # get de-allocated directly. So using cuda memory usage as a proxy
+
+        def _do_test(fn, should_free):
+            stats: List[int] = []
+
+            def track(x, idx):
+                # Track that at each step of the backward, some Tensor were
+                # de-allocated (which correspond to the checkpoint storage being
+                # emptied at each step)
+                def hook(_unused):
+                    self.assertEqual(len(stats), idx)
+                    torch.cuda.synchronize()
+                    stats.append(torch.cuda.memory_allocated())
+                    if idx > 0:
+                        if should_free:
+                            self.assertLess(stats[idx], stats[idx - 1])
+                        else:
+                            self.assertEqual(stats[idx], stats[idx - 1])
+
+                x.register_hook(hook)
+
+            def test_fn(x):
+                # The main property of this function is that it contains multiple
+                # operations that save gradients in a chain.
+                x = x ** 2
+                track(x, 2)
+                x = x ** 2
+                track(x, 1)
+                x = x ** 2
+                track(x, 0)
+                x = x ** 2
+                return x.sum()
+
+            fn(test_fn)
+
+            return stats
+
+        x = torch.zeros(10, device="cuda", requires_grad=True)
+        x.grad = torch.zeros_like(x)
+
+        # In a regular backward, buffers get eagerly freed
+        non_retain_stats = _do_test(lambda fn: fn(x).backward(), True)
+
+        # In a retain_grad backward, buffers get preserved
+        retain_stats = _do_test(lambda fn: fn(x).backward(retain_graph=True), False)
+
+        # In a regular backward with checkpoint, buffers get eagerly freed
+        checkpoint_non_retain_stats = _do_test(lambda fn: checkpoint(fn, x, use_reentrant=False).backward(), True)
+
+        # In a retain_grad backward with checkpoint, buffers get preserved
+        checkpoint_retain_stats = _do_test(lambda fn: checkpoint(fn, x, use_reentrant=False).backward(retain_graph=True), False)
+
+        self.assertEqual(non_retain_stats, checkpoint_non_retain_stats)
+        self.assertEqual(retain_stats, checkpoint_retain_stats)
+
 class TestDataLoaderUtils(TestCase):
     def setUp(self):
         self.dataset = torch.randn(5, 3, 3, 2)
diff --git a/third_party/VulkanMemoryAllocator b/third_party/VulkanMemoryAllocator
new file mode 160000
index 0000000000000..a6bfc237255a6
--- /dev/null
+++ b/third_party/VulkanMemoryAllocator
@@ -0,0 +1 @@
+Subproject commit a6bfc237255a6bac1513f7c1ebde6d8aed6b5191
diff --git a/third_party/cpuinfo b/third_party/cpuinfo
index 5916273f79a21..8ec7bd91ad047 160000
--- a/third_party/cpuinfo
+++ b/third_party/cpuinfo
@@ -1 +1 @@
-Subproject commit 5916273f79a21551890fd3d56fc5375a78d1598d
+Subproject commit 8ec7bd91ad0470e61cf38f618cc1f270dede599c
diff --git a/third_party/cpuinfo.BUILD b/third_party/cpuinfo.BUILD
deleted file mode 100644
index 2e9e2aeff3cda..0000000000000
--- a/third_party/cpuinfo.BUILD
+++ /dev/null
@@ -1,55 +0,0 @@
-load("@rules_cc//cc:defs.bzl", "cc_library")
-
-cc_library(
-    name = "clog",
-    srcs = [
-        "deps/clog/src/clog.c",
-    ],
-    hdrs = glob([
-        "deps/clog/include/*.h",
-    ]),
-    includes = [
-        "deps/clog/include/",
-    ],
-    linkstatic = True,
-    visibility = ["//visibility:public"],
-)
-
-cc_library(
-    name = "cpuinfo",
-    srcs = glob(
-        [
-            "src/*.c",
-            "src/linux/*.c",
-            "src/x86/*.c",
-            "src/x86/cache/*.c",
-            "src/x86/linux/*.c",
-        ],
-        exclude = [
-            "src/x86/mockcpuid.c",
-            "src/linux/mockfile.c",
-        ],
-    ),
-    hdrs = glob([
-        "include/*.h",
-        "src/*.h",
-        "src/cpuinfo/*.h",
-        "src/include/*.h",
-        "src/x86/*.h",
-        "src/x86/linux/*.h",
-        "src/linux/*.h",
-    ]),
-    copts = [
-        "-DCPUINFO_LOG_LEVEL=2",
-        "-D_GNU_SOURCE=1",
-    ],
-    includes = [
-        "include",
-        "src",
-    ],
-    linkstatic = True,
-    visibility = ["//visibility:public"],
-    deps = [
-        ":clog",
-    ],
-)
diff --git a/third_party/ideep b/third_party/ideep
index 8a114a51c116b..77d662b313a76 160000
--- a/third_party/ideep
+++ b/third_party/ideep
@@ -1 +1 @@
-Subproject commit 8a114a51c116b55c4ceb689b98746786bd00c29b
+Subproject commit 77d662b313a762e82b389d3fd965e0098f12cd99
diff --git a/third_party/nccl/nccl b/third_party/nccl/nccl
index 7e515921295ad..19ab67d1727d3 160000
--- a/third_party/nccl/nccl
+++ b/third_party/nccl/nccl
@@ -1 +1 @@
-Subproject commit 7e515921295adaab72adf56ea71a0fafb0ecb5f3
+Subproject commit 19ab67d1727d337d10d0a48cbaf5cd119b8d88f1
diff --git a/tools/autograd/context.py b/tools/autograd/context.py
index da39ad467e154..f7f4153774cd5 100644
--- a/tools/autograd/context.py
+++ b/tools/autograd/context.py
@@ -16,3 +16,15 @@ def wrapper(f: NFWDI) -> T:
             return func(f)
 
     return wrapper
+
+
+# Like the above but with an additional dispatch key string argument
+def with_native_function_with_differentiability_info_and_key(
+    func: Callable[[NFWDI, str], T]
+) -> Callable[[NFWDI, str], T]:
+    @functools.wraps(func)
+    def wrapper(f: NFWDI, key: str) -> T:
+        with native_function_manager(f.func):
+            return func(f, key)
+
+    return wrapper
diff --git a/tools/autograd/derivatives.yaml b/tools/autograd/derivatives.yaml
index c9c708026854b..9c08d05a7ff8a 100644
--- a/tools/autograd/derivatives.yaml
+++ b/tools/autograd/derivatives.yaml
@@ -13,6 +13,14 @@
 # Each entry consists of:
 #   - A 'name', which specifies the ATen name of the function you
 #     are defining derivatives for, and an argument specification.
+#   - An optional 'dispatch' entry which can be used to specify
+#     per-autograd dispatch key derivatives. If this entry is not
+#     specified, then the gradient entries will be taken as the
+#     default gradients (i.e. registered for every backward dispatch
+#     key). (see _test_autograd_multiple_dispatch for an example
+#     of how to register separate derivates for different dispatch keys).
+#     The list of allowed dispatch keys (in addition to 'Default' which
+#     represents the Autograd alias key) is torchgen/model.py:AUTOGRAD_KEYS.
 #   - One or more gradients entries, mapping differentiable input
 #     names to a formula specifying how to compute its gradient.
 #     Note that a single gradient entry can specify the gradient
@@ -342,10 +350,8 @@
   mat2: self.transpose(1, 2).conj().bmm(grad)
   result: self_t.bmm(mat2_p) + self_p.bmm(mat2_t)
 
-- name: _NestedTensor_GeneralizedBMM(Tensor self, Tensor mat2) -> Tensor
-  self: _NestedTensor_GeneralizedBMM(grad, mat2.transpose(-2, -1).conj())
-  mat2: _NestedTensor_GeneralizedBMM(self.transpose(-2, -1).conj(), grad)
-  result: _NestedTensor_GeneralizedBMM(self_t, mat2_p) + _NestedTensor_GeneralizedBMM(self_p, mat2_t)
+- name: matmul(Tensor self, Tensor other) -> Tensor
+  self, other: matmul_backward(grad, self, other, grad_input_mask)
 
 - name: cat(Tensor[] tensors, int dim=0) -> Tensor
   tensors: cat_tensors_backward(grad, to_args_sizes(tensors), to_args_scalartypes(tensors), dim)
@@ -501,7 +507,7 @@
 
 - name: _linalg_det(Tensor A) -> (Tensor result, Tensor LU, Tensor pivots)
   A: linalg_det_backward(grad, result, A, LU, pivots)
-  result: at::linalg_lu_solve(LU, pivots, A_t, /*left*/true, /*adjoint*/!A_p.is_complex() && A_p.is_contiguous()).diagonal(0, -2, -1).sum(-1) * result
+  result: linalg_det_jvp(A_t, result, LU, pivots, A_p.is_contiguous() && !A_p.is_complex())
   output_differentiability: [True, False, False]
 
 - name: _linalg_slogdet(Tensor A) -> (Tensor sign, Tensor logabsdet, Tensor LU, Tensor pivots)
@@ -829,13 +835,10 @@
   index: non_differentiable
   result: auto_linear
 
-- name: inverse(Tensor self) -> Tensor
-  self: -at::matmul(result.mH(), at::matmul(grad, result.mH()))
-  result: -at::matmul(at::matmul(result, self_t), result)
-
-- name: linalg_inv_ex(Tensor self, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
-  self: -at::matmul(inverse.mH(), at::matmul(grad, inverse.mH()))
-  inverse: -at::matmul(at::matmul(inverse, self_t), inverse)
+- name: linalg_inv_ex(Tensor A, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
+  A: -at::matmul(inverse.mH(), at::matmul(grad, inverse.mH()))
+  inverse: -at::matmul(at::matmul(inverse, A_t), inverse)
+  output_differentiability: [True, False]
 
 - name: linalg_pinv.atol_rtol_tensor(Tensor self, *, Tensor? atol=None, Tensor? rtol=None, bool hermitian=False) -> Tensor
   self: pinv_backward(grad, result, self)
@@ -910,30 +913,44 @@
   other: grad / (1 + pow(2, self - other))
   result: self_t / (1 + pow(2, other_p - self_p)) + other_t / (1 + pow(2, self_p - other_p))
 
+# Note [Gradient formula for xlogy at x = 0, y <= 0]
+# x * log(y) is not defined at y <= 0, so we cannot even talk about differentiability
+# Now, xlogy(0, y) = 0 by definition.
+# This does not make it differentiable as it's not defined in a neighbourhood of a point
+# (0, y) when y <= 0.
+# Now, when a function is non-differentiable, sometimes we return "a relatively sensible value"
+# In this case, as per the discussion in https://github.com/pytorch/pytorch/issues/80770, we choose
+# this value to be zero, which is the directional derivative along the line {x = 0}.
 - name: xlogy.Tensor(Tensor self, Tensor other) -> Tensor
-  self: grad * at::xlogy((self != 0), other)
-  other: grad * at::where(other.isnan() | (self != 0), self / other, zeros_like(other))
-  result: self_t * at::xlogy((self_p != 0), other_p) + other_t * self_p / other_p
+  self: at::xlogy(grad, other).masked_fill((self == 0.) & (other <= 0.), 0.)
+  other: grad * self / other
+  result: at::xlogy(self_t, other_p).masked_fill((self_p == 0.) & (other_p <= 0.), 0.) + other_t * self_p / other_p
 
 - name: xlogy.Scalar_Self(Scalar self, Tensor other) -> Tensor
-  other: grad * at::where(other.isnan() | (!self.equal(0)), self / other, zeros_like(other))
+  other: grad * self / other
   result: auto_element_wise
 
 - name: xlogy.Scalar_Other(Tensor self, Scalar other) -> Tensor
-  self: grad * at::xlogy((self != 0), other)
+  self: "other.toDouble() > 0.
+          ? at::xlogy(grad,  other)
+          : at::xlogy(grad,  other).masked_fill(self == 0., 0.)"
   result: auto_element_wise
 
+# See Note [Gradient formula for xlogy at x = 0, y <= 0]
+# Same here but with y <= -1
 - name: special_xlog1py(Tensor self, Tensor other) -> Tensor
-  self: grad * other.log1p()
+  self: at::special_xlog1py(grad,  other).masked_fill((self == 0.) & (other <= -1.), 0.)
   other: grad * self / (other + 1)
-  result: self_t * other_p.log1p() + other_t * self_p / (other_p + 1)
+  result: at::special_xlog1py(self_t,  other_p).masked_fill((self_p == 0.) & (other_p <= -1.), 0.) + other_t * self_p / (other_p + 1)
 
 - name: special_xlog1py.self_scalar(Scalar self, Tensor other) -> Tensor
   other: grad * self / (other + 1)
   result: auto_element_wise
 
 - name: special_xlog1py.other_scalar(Tensor self, Scalar other) -> Tensor
-  self: grad * log1p(other.toDouble())
+  self: "other.toDouble() > -1.
+          ? at::special_xlog1py(grad,  other)
+          : at::special_xlog1py(grad,  other).masked_fill(self == 0., 0.)"
   result: auto_element_wise
 
 - name: special_zeta(Tensor self, Tensor other) -> Tensor
@@ -1002,7 +1019,7 @@
 
 - name: masked_fill.Tensor(Tensor self, Tensor mask, Tensor value) -> Tensor
   self: grad.masked_fill(mask, 0)
-  value: at::where(mask, grad, zeros_like(grad)).sum()
+  value: masked_fill_backward(grad, mask)
   mask: non_differentiable
   result: self_t.masked_fill(mask, value_t)
 
@@ -1531,15 +1548,16 @@
   self: grad.expand(self.sizes())
   result: auto_linear
 
-- name: sum.SymInt(Tensor self, SymInt[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
-  self: sum_backward(grad, self.sym_sizes(), dim, keepdim)
-  result: auto_linear
-
 - name: sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
-  self: sum_backward(grad, self.sizes(), dim, keepdim)
-  result: auto_linear
-
-- name: nansum(Tensor self, int[1] dim=[], bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
+  dispatch:
+    Default:
+      self: sum_backward(grad, self.sizes(), dim, keepdim)
+      result: auto_linear
+    AutogradNestedTensor:
+      # TODO: replace this function once semantics for nested tensor expand have been settled on
+      self: _nested_sum_backward(grad, self, dim, keepdim)
+
+- name: nansum(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   self: nansum_backward(grad.to(self.scalar_type()), self, dim, keepdim)
   result: at::where(self_p.isnan(), 0, self_t).sum(dim, keepdim, dtype)
 
@@ -1740,8 +1758,8 @@
 
 - name: where.self(Tensor condition, Tensor self, Tensor other) -> Tensor
   condition: non_differentiable
-  self: where(condition, grad, zeros_like(grad))
-  other: where(condition, zeros_like(grad), grad)
+  self: where(condition, grad, 0)
+  other: where(condition, 0, grad)
   result: where(condition, self_t, other_t)
 
 # weight_norm_cuda_interface_backward does not have an explicitly defined derivative, so if we do happen
@@ -1894,6 +1912,7 @@
 
 - name: elu_(Tensor(a!) self, Scalar alpha=1, Scalar scale=1, Scalar input_scale=1) -> Tensor(a!)
   self: elu_backward(grad, alpha, scale, input_scale, /* is_result */ true, result)
+  result: self_t.copy_(elu_backward(original_self_t, alpha, scale, input_scale, /* is_result */ true, result))
 
 - name: celu(Tensor self, Scalar alpha=1.0) -> Tensor
   self: elu_backward(grad, alpha, 1, 1.0/alpha.toFloat(), /* is_result */ false, self)
@@ -1901,6 +1920,7 @@
 
 - name: celu_(Tensor(a!) self, Scalar alpha=1.0) -> Tensor(a!)
   self: elu_backward(grad, alpha, 1, 1.0/alpha.toFloat(), /* is_result */ true, result)
+  result: self_t.copy_(elu_backward(original_self_t, alpha, 1, 1.0/alpha.toFloat(), /* is_result */ true, result))
 
 - name: gelu(Tensor self, *, str approximate='none') -> Tensor
   self: gelu_backward(grad, self, approximate)
@@ -1949,7 +1969,7 @@
 - name: _sparse_log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
   self: _sparse_log_softmax_backward_data(grad, result, dim, self)
 
-- name: _masked_softmax(Tensor self, Tensor mask, int? dim=None) -> Tensor
+- name: _masked_softmax(Tensor self, Tensor mask, int? dim=None, int? mask_type=None) -> Tensor
   self: _masked_softmax_backward(grad, result, mask, dim)
   mask: non_differentiable
 
@@ -2172,9 +2192,6 @@
 - name: mps_convolution_backward(Tensor self, Tensor grad_output, Tensor weight, int[] padding, int[] stride, int[] dilation, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   grad_output, self, weight: _convolution_double_backward(grads[0], grads[1], grads[2], grad_output, weight, self, stride, padding, dilation, false, std::vector<int64_t>(padding.size(), 0), groups, grad_input_mask)
 
-- name: _mps_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor
-  self, weight, bias: mps_linear_backward(self, grad, weight, grad_input_mask)
-
 - name: max_pool2d_with_indices(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> (Tensor, Tensor)
   self: max_pool2d_with_indices_backward(grad, self, kernel_size, stride, padding, dilation, ceil_mode, result1)
   result0: gather(self_t.flatten(-2), -1, result1.flatten(-2)).view_as(result1)
@@ -2684,7 +2701,7 @@
 - name: nested_tensor(Tensor[] list, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   list: "grad.defined()? at::unbind(grad) : std::vector<Tensor>(list.size())"
 
-- name: _nested_tensor_from_mask(Tensor t, Tensor mask) -> Tensor
+- name: _nested_tensor_from_mask(Tensor t, Tensor mask, bool mask_check=True) -> Tensor
   t: grad.to_padded_tensor(0, t.sizes())
   mask: non_differentiable
 
@@ -2789,6 +2806,28 @@
 - name: _test_warn_in_autograd(Tensor self) -> Tensor
   self: warn_backwards(grad)
 
+- name: _test_autograd_multiple_dispatch.fullcoverage(Tensor self) -> Tensor
+  dispatch:
+    Default:
+      self: grad.expand(self.sizes()) + 1
+      result: auto_linear
+    AutogradNestedTensor:
+      self: grad.mul(grad)
+    AutogradCUDA:
+      self: grad.expand(self.sizes()) * 2
+
+- name: _test_autograd_multiple_dispatch.ntonly(Tensor self, bool b) -> Tensor
+  dispatch:
+    AutogradNestedTensor:
+      self: grad.mul(grad).add(grad)
+
+- name: _test_autograd_multiple_dispatch_view(Tensor(a) self) -> Tensor(a)
+  dispatch:
+    Default:
+      self: grad.reshape_as(self)
+    AutogradCUDA:
+      self: grad.reshape_as(self) + 1
+
 - name: _efficientzerotensor(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   output_differentiability: [False]
 
diff --git a/tools/autograd/gen_autograd.py b/tools/autograd/gen_autograd.py
index 3747ae3341f45..6c78af9caa484 100644
--- a/tools/autograd/gen_autograd.py
+++ b/tools/autograd/gen_autograd.py
@@ -3,7 +3,6 @@
 repository, run:
 
 python -m tools.autograd.gen_autograd \
-       build/aten/src/ATen/Declarations.yaml \
        aten/src/ATen/native/native_functions.yaml \
        aten/src/ATen/native/tags.yaml \
        $OUTPUT_DIR \
@@ -56,7 +55,7 @@ def gen_autograd(
     disable_autograd: bool = False,
 ) -> None:
     # Parse and load derivatives.yaml
-    differentiability_infos = load_derivatives(
+    differentiability_infos, used_dispatch_keys = load_derivatives(
         os.path.join(autograd_dir, "derivatives.yaml"), native_functions_path, tags_path
     )
 
@@ -78,7 +77,12 @@ def gen_autograd(
     # Generate VariableType.h/cpp
     if not disable_autograd:
         gen_variable_type(
-            out, native_functions_path, tags_path, fns_with_diff_infos, template_path
+            out,
+            native_functions_path,
+            tags_path,
+            fns_with_diff_infos,
+            template_path,
+            used_dispatch_keys,
         )
 
         gen_inplace_or_view_type(
@@ -100,7 +104,7 @@ def gen_autograd_python(
     out: str,
     autograd_dir: str,
 ) -> None:
-    differentiability_infos = load_derivatives(
+    differentiability_infos, _ = load_derivatives(
         os.path.join(autograd_dir, "derivatives.yaml"), native_functions_path, tags_path
     )
 
diff --git a/tools/autograd/gen_autograd_functions.py b/tools/autograd/gen_autograd_functions.py
index 072c236f61351..d21dd439d307b 100644
--- a/tools/autograd/gen_autograd_functions.py
+++ b/tools/autograd/gen_autograd_functions.py
@@ -4,7 +4,7 @@
 #  Functions.h/cpp: subclasses of autograd::Node
 #  python_functions.h/cpp: Python bindings for the above classes
 #
-from typing import List, Sequence, Tuple
+from typing import Dict, List, Sequence, Tuple
 
 from torchgen.api.autograd import (
     Derivative,
@@ -32,7 +32,7 @@
     tensorT,
 )
 from torchgen.code_template import CodeTemplate
-from torchgen.model import Argument
+from torchgen.model import Argument, FunctionSchema
 from torchgen.utils import FileManager
 
 from .gen_inplace_or_view_type import VIEW_FUNCTIONS
@@ -87,7 +87,7 @@
 
 DERIVATIVE_SINGLE = CodeTemplate(
     """\
-if (should_compute_output({ ${name}_ix })) {
+if (task_should_compute_output({ ${name}_ix })) {
   auto grad_result = ${derivative};
   copy_range(grad_inputs, ${name}_ix, grad_result);
 }
@@ -96,7 +96,7 @@
 
 DERIVATIVE_MULTI_COPY_RANGE = CodeTemplate(
     """\
-  if (should_compute_output({ ${name}_ix })) {
+  if (task_should_compute_output({ ${name}_ix })) {
     copy_range(grad_inputs, ${name}_ix, std::get<${i}>(grad_result));
   }
 """
@@ -104,7 +104,7 @@
 
 DERIVATIVE_MULTI = CodeTemplate(
     """\
-if (should_compute_output({ ${idx_ranges} })) {
+if (task_should_compute_output({ ${idx_ranges} })) {
   ${grad_input_mask}
   auto grad_result = ${derivative};
   ${copy_ranges}
@@ -362,9 +362,22 @@
 UNTRACEABLE_FUNCTIONS = VIEW_FUNCTIONS
 
 
+def get_infos_with_derivatives_list(
+    differentiability_infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]]
+) -> List[DifferentiabilityInfo]:
+
+    diff_info_list = [
+        info
+        for diffinfo_dict in differentiability_infos.values()
+        for info in diffinfo_dict.values()
+    ]
+
+    return list(filter(lambda info: info.args_with_derivatives, diff_info_list))
+
+
 def gen_autograd_functions_lib(
     out: str,
-    differentiability_infos: Sequence[DifferentiabilityInfo],
+    differentiability_infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]],
     template_path: str,
 ) -> None:
     """Functions.h and Functions.cpp body
@@ -373,10 +386,9 @@ def gen_autograd_functions_lib(
     for each every differentiable torch function.
     """
 
-    # only create an autograd function if we are actually going to calculate a derivative
-    infos = list(
-        filter(lambda info: info.args_with_derivatives, differentiability_infos)
-    )
+    # get a 1D list of diffinfos, we do not need them to be per FunctionSchema/DispatchKey here
+    # infos with the diff dispatchkeys but the same name will still be in the same shard.
+    infos = get_infos_with_derivatives_list(differentiability_infos)
     declarations = list(map(lambda f: process_function(f, FUNCTION_DECLARATION), infos))
     definitions = list(map(lambda f: process_function(f, FUNCTION_DEFINITION), infos))
 
@@ -397,7 +409,7 @@ def gen_autograd_functions_lib(
 
 def gen_autograd_functions_python(
     out: str,
-    differentiability_infos: Sequence[DifferentiabilityInfo],
+    differentiability_infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]],
     template_path: str,
 ) -> None:
 
@@ -417,9 +429,9 @@ def gen_autograd_functions_python(
         },
     )
 
-    infos = list(
-        filter(lambda info: info.args_with_derivatives, differentiability_infos)
-    )
+    # get a 1D list of diffinfos, we do not need them to be per FunctionSchema/DispatchKey here
+    # infos with the diff dispatchkeys but the same name will still be in the same shard.
+    infos = get_infos_with_derivatives_list(differentiability_infos)
     fm.write_sharded(
         "python_functions.cpp",
         infos,
@@ -661,7 +673,9 @@ def emit_derivative(
             )
         else:
             if "grad_input_mask" in formula:
-                masks = [f"should_compute_output({{ {n}_ix }})," for n in var_names]
+                masks = [
+                    f"task_should_compute_output({{ {n}_ix }})," for n in var_names
+                ]
                 grad_input_mask = GRAD_INPUT_MASK.substitute(
                     masks=masks, n=len(var_names)
                 )
diff --git a/tools/autograd/gen_python_functions.py b/tools/autograd/gen_python_functions.py
index aeb1c4af9e0dc..cf585daba2f07 100644
--- a/tools/autograd/gen_python_functions.py
+++ b/tools/autograd/gen_python_functions.py
@@ -95,10 +95,7 @@
     "_unsafe_view",
     "tensor",
     "_?sparse_(coo|compressed|csr|csc|bsr|bsc)_tensor.*",
-    "_arange.*",
     "_range.*",
-    "linspace.*",
-    "logspace.*",
     "_sparse_add_out",
     "_sparse_div.*",
     "_sparse_mul.*",
@@ -113,11 +110,9 @@
     "_prod.*",
     "_th_.*",
     "_thnn_.*",
-    "arange.*",
     "range.*",
     "_solve.*",
     "_inverse.*",
-    "full(_out)?",
     "_cholesky.*",
     "_triangular_solve.*",
     "_qr.*",
diff --git a/tools/autograd/gen_trace_type.py b/tools/autograd/gen_trace_type.py
index f1e584fa64e21..f7df5efb39c91 100644
--- a/tools/autograd/gen_trace_type.py
+++ b/tools/autograd/gen_trace_type.py
@@ -458,11 +458,20 @@ def emit_trace_body(f: NativeFunction) -> List[str]:
 )
 
 
-def type_wrapper_name(f: NativeFunction) -> str:
+def type_wrapper_name(f: NativeFunction, key: str = "Default") -> str:
     if f.func.name.overload_name:
-        return f"{cpp.name(f.func)}_{f.func.name.overload_name}"
+        name = f"{cpp.name(f.func)}_{f.func.name.overload_name}"
     else:
-        return cpp.name(f.func)
+        name = cpp.name(f.func)
+
+    # The key argument is only used in gen_variable_type where we need fns per autograd dispatch key.
+    # In gen_trace_type and gen_inplace_view_type where only one fn per native_fn must be generated,
+    # the key argument should not be passed.
+    # We do not append key if it is Default so that generated functions from
+    # before per-dispatch-key derivatives were added retain the same names.
+    if key != "Default":
+        name = name + f"_{key}"
+    return name
 
 
 @with_native_function
diff --git a/tools/autograd/gen_variable_type.py b/tools/autograd/gen_variable_type.py
index 2667cf29228c1..d9488b6398f98 100644
--- a/tools/autograd/gen_variable_type.py
+++ b/tools/autograd/gen_variable_type.py
@@ -25,7 +25,7 @@
 #     which will in turn dispatch back to VariableType for its
 #     differentiable subcomponents.
 #
-from typing import Callable, Dict, List, Optional, Sequence, Tuple, Union
+from typing import Callable, Dict, List, Optional, Sequence, Set, Tuple, Union
 
 from torchgen.api import cpp
 from torchgen.api.autograd import (
@@ -55,7 +55,11 @@
     VectorCType,
 )
 from torchgen.code_template import CodeTemplate
-from torchgen.context import native_function_manager, with_native_function
+from torchgen.context import (
+    native_function_manager,
+    with_native_function,
+    with_native_function_and,
+)
 from torchgen.model import (
     Argument,
     BaseType,
@@ -67,7 +71,7 @@
 )
 from torchgen.utils import FileManager, mapMaybe
 
-from .context import with_native_function_with_differentiability_info
+from .context import with_native_function_with_differentiability_info_and_key
 from .gen_inplace_or_view_type import (
     ALL_VIEW_FUNCTIONS,
     ASSIGN_RETURN_VALUE,
@@ -717,6 +721,7 @@ def gen_variable_type(
     tags_yaml_path: str,
     fns_with_diff_infos: List[NativeFunctionWithDifferentiabilityInfo],
     template_path: str,
+    used_keys: Set[str],
 ) -> None:
 
     """VariableType.h and VariableType.cpp body
@@ -733,9 +738,52 @@ def gen_variable_type(
         },
     )
 
+    # helper that generates a TORCH_LIBRARY_IMPL macro for each
+    # dispatch key that appears in derivatives.yaml
+    def wrapper_registrations(used_keys: Set[str]) -> str:
+        library_impl_macro_list: List[str] = []
+        for key in used_keys:
+            dispatch_key = key
+            if key == "Default":
+                dispatch_key = "Autograd"
+            library_impl_macro = (
+                f"TORCH_LIBRARY_IMPL(aten, {dispatch_key}, m) "
+                + "{\n"
+                + "${"
+                + f"wrapper_registrations_{key}"
+                + "}\n}"
+            )
+            library_impl_macro_list += [library_impl_macro]
+        return "\n\n".join(library_impl_macro_list)
+
+    # Generate a new template from VariableType.cpp which replaces ${wrapper_registrations}
+    # with per key TORCH_LIBRARY_IMPL macros for each key that appears in derivatives.yaml
+    fm1 = FileManager(
+        install_dir=out + "/templates", template_dir=template_path, dry_run=False
+    )
+    fm1.write(
+        "VariableType.cpp",
+        lambda: {
+            "type_derived_method_definitions": "\n\n".join(
+                [
+                    "${" + f"type_derived_method_definitions_{key}" + "}"
+                    for key in used_keys
+                ]
+            ),
+            "wrapper_registrations": wrapper_registrations(used_keys),
+        },
+    )
+
+    # Generate final VariableType_*.cpp files from the generated template
+    fm2 = FileManager(install_dir=out, template_dir=out + "/templates", dry_run=False)
+
+    sharded_keys = set(
+        [f"type_derived_method_definitions_{key}" for key in used_keys]
+        + [f"wrapper_registrations_{key}" for key in used_keys]
+    )
     # NOTE: see Note [Sharded File] at the top of the VariableType.cpp
     # template regarding sharding of the generated files.
-    fm.write_sharded(
+    fm2.write_sharded(
         "VariableType.cpp",
         [fn for fn in fns_with_diff_infos if use_derived(fn)],
         key_fn=lambda fn: cpp.name(fn.func.func),
@@ -744,15 +792,15 @@ def gen_variable_type(
         },
         env_callable=gen_variable_type_func,
         num_shards=5,
-        sharded_keys={"type_derived_method_definitions", "wrapper_registrations"},
+        sharded_keys=sharded_keys,
     )
 
 
-@with_native_function
-def gen_wrapper_registration(f: NativeFunction) -> str:
+@with_native_function_and
+def gen_wrapper_registration(f: NativeFunction, key: str = "Default") -> str:
     return WRAPPER_REGISTRATION.substitute(
         unqual_operator_name_with_overload=f.func.name,
-        type_wrapper_name=type_wrapper_name(f),
+        type_wrapper_name=type_wrapper_name(f, key),
         class_type="VariableType",
     )
 
@@ -761,6 +809,7 @@ def gen_variable_type_func(
     fn: NativeFunctionWithDifferentiabilityInfo,
 ) -> Dict[str, List[str]]:
     f = fn.func
+    result = {}
     with native_function_manager(f):
         name = cpp.name(f.func)
         formals = gen_formals(f)
@@ -794,20 +843,38 @@ def gen_variable_type_func(
             wrapper_registration = AUTOGRAD_NOT_IMPLEMENTED_REGISTRATION.substitute(
                 unqual_operator_name_with_overload=f.func.name
             )
+            result["type_derived_method_definitions_Default"] = [type_definition]
+            result["wrapper_registrations_Default"] = [wrapper_registration]
         else:
-            type_definition = METHOD_DEFINITION.substitute(
-                return_type=cpp.returns_type(f.func.returns).cpp_type(),
-                type_wrapper_name=type_wrapper_name(f),
-                type_definition_body=emit_body(fn),
-                formals=formals,
-            )
-            wrapper_registration = gen_wrapper_registration(f)
-
+            if not fn.info:
+                key = "Default"
+                type_definition = METHOD_DEFINITION.substitute(
+                    return_type=cpp.returns_type(f.func.returns).cpp_type(),
+                    type_wrapper_name=type_wrapper_name(f, key),
+                    type_definition_body=emit_body(fn, key),
+                    formals=formals,
+                )
+                wrapper_registration = gen_wrapper_registration(f, key)
+                result[f"type_derived_method_definitions_{key}"] = [type_definition]
+                result[f"wrapper_registrations_{key}"] = [wrapper_registration]
+            else:
+                for key, _ in fn.info.items():
+                    type_definition = METHOD_DEFINITION.substitute(
+                        return_type=cpp.returns_type(f.func.returns).cpp_type(),
+                        type_wrapper_name=type_wrapper_name(f, key),
+                        type_definition_body=emit_body(fn, key),
+                        formals=formals,
+                    )
+                    wrapper_registration = gen_wrapper_registration(f, key)
+                    result[f"type_derived_method_definitions_{key}"] = [type_definition]
+                    result[f"wrapper_registrations_{key}"] = [wrapper_registration]
     # See Note [Manual Backend kernels]
     assert (name in MANUAL_BACKEND) == f.manual_kernel_registration
     # If you want to register a kernel to Autograd, you must make the op abstract.
     # In other words, this op must have dispatch section in native_functions.yaml.
-    if name in MANUAL_AUTOGRAD_AND_TRACER or (fn.info and fn.info.has_derivatives):
+    if name in MANUAL_AUTOGRAD_AND_TRACER or (
+        fn.info and any(info.has_derivatives for info in fn.info.values())
+    ):
         msg = (
             f"There's a formula for {name}(or its functional variant) in derivatives.yaml. "
             f"It's required to add a dispatch section for it with explicit supported backends e.g CPU/CUDA "
@@ -817,18 +884,17 @@ def gen_variable_type_func(
         )
         assert f.is_abstract, msg
 
-    return {
-        "type_derived_method_definitions": [type_definition],
-        "wrapper_registrations": [wrapper_registration],
-    }
+    return result
 
 
-@with_native_function_with_differentiability_info
-def emit_body(fn: NativeFunctionWithDifferentiabilityInfo) -> List[str]:
+@with_native_function_with_differentiability_info_and_key
+def emit_body(
+    fn: NativeFunctionWithDifferentiabilityInfo, key: str = "Default"
+) -> List[str]:
     assert dispatch_strategy(fn) == "use_derived"
     f = fn.func
-    info = fn.info
-    fw_derivatives = fn.fw_derivatives
+    info = fn.info[key] if fn.info else None
+    fw_derivatives = fn.fw_derivatives.get(key, []) if fn.fw_derivatives else []
 
     name = cpp.name(f.func)
     inplace = f.func.kind() == SchemaKind.inplace
@@ -878,7 +944,7 @@ def find_args_with_derivatives(
 
     differentiable_inputs = gen_differentiable_inputs(f)
     args_with_derivatives = find_args_with_derivatives(differentiable_inputs)
-    differentiable_outputs = gen_differentiable_outputs(fn)
+    differentiable_outputs = gen_differentiable_outputs(fn, key)
 
     undifferentiable = (base_name in DONT_REQUIRE_DERIVATIVE) or (
         name in DONT_REQUIRE_DERIVATIVE
@@ -1316,9 +1382,9 @@ def emit_save_outputs() -> str:
 
     def emit_any_requires_grad() -> List[str]:
         extra_condition = ""
-        if fn.info and fn.info.output_differentiability_conditions:
-            assert len(fn.info.output_differentiability_conditions) == 1
-            extra_condition = f"_any_requires_grad &= ({fn.info.output_differentiability_conditions[0]});"
+        if info and info.output_differentiability_conditions:
+            assert len(info.output_differentiability_conditions) == 1
+            extra_condition = f"_any_requires_grad &= ({info.output_differentiability_conditions[0]});"
         return [
             SETUP_ANY_REQUIRES_GRAD.substitute(
                 args_with_derivatives=[arg.name for arg in args_with_derivatives],
@@ -1358,9 +1424,9 @@ def emit_any_has_forward_grad() -> List[str]:
                     )
                 requires_fw_grad = "true"
 
-            if fn.info and fn.info.output_differentiability_conditions:
-                assert len(fn.info.output_differentiability_conditions) == 1
-                requires_fw_grad = f"({fn.info.output_differentiability_conditions[0]}) && ({requires_fw_grad})"
+            if info and info.output_differentiability_conditions:
+                assert len(info.output_differentiability_conditions) == 1
+                requires_fw_grad = f"({info.output_differentiability_conditions[0]}) && ({requires_fw_grad})"
 
             content.append(
                 f"auto {get_any_has_forward_grad_name(derivative.var_names)} = {requires_fw_grad};\n"
diff --git a/tools/autograd/load_derivatives.py b/tools/autograd/load_derivatives.py
index 7bf43cfb3c992..ce6e3d808192c 100644
--- a/tools/autograd/load_derivatives.py
+++ b/tools/autograd/load_derivatives.py
@@ -37,6 +37,7 @@
 from torchgen.context import with_native_function
 from torchgen.gen import get_grouped_by_view_native_functions, parse_native_yaml
 from torchgen.model import (
+    AUTOGRAD_KEYS,
     FunctionSchema,
     NativeFunction,
     NativeFunctionsViewGroup,
@@ -48,34 +49,48 @@
 
 _GLOBAL_LOAD_DERIVATIVE_CACHE = {}
 
-# This function directly adds derivative entries for {view}_copy variants of each view op.
+_VALID_AUTOGRAD_KEYS = set(AUTOGRAD_KEYS)
+
+# This function directly adds per-dispatchkey derivative entries for {view}_copy variants of each view op.
 # Since every {view} and {view}_copy op shares the same derivative formula,
 # we generate them here instead of duplicating them in the yaml.
 # See Note [Codegen'd {view}_copy Operators]
 def add_view_copy_derivatives(
-    infos: List[DifferentiabilityInfo], view_groups: List[NativeFunctionsViewGroup]
-) -> List[DifferentiabilityInfo]:
+    infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]],
+    view_groups: List[NativeFunctionsViewGroup],
+) -> None:
     # Get the map from each view op's name to its corresponding view group
     view_name_to_group: Dict[OperatorName, NativeFunctionsViewGroup] = {
         g.view.func.name: g for g in view_groups
     }
 
-    view_copy_differentiability_infos = []
-    for info in infos:
-        maybe_view_group = view_name_to_group.get(info.func.func.name, None)
-        if maybe_view_group is not None and maybe_view_group.view_copy is not None:
-            view_copy_info = info.create_view_copy_from_view_derivative(
-                maybe_view_group
-            )
-            if view_copy_info is not None:
-                view_copy_differentiability_infos.append(view_copy_info)
+    view_infos = {}
+
+    for _, info_dispatch_dict in infos.items():
+        # maybe_view_group only needs to be calculated once per info_dispatch_dict
+        maybe_view_group = None
+        view_copy_differentiability_infos = {}
+        for dispatch_key, info in info_dispatch_dict.items():
+            maybe_view_group = view_name_to_group.get(info.func.func.name, None)
+            if maybe_view_group is not None and maybe_view_group.view_copy is not None:
+                view_copy_info = info.create_view_copy_from_view_derivative(
+                    maybe_view_group
+                )
+                if view_copy_info is not None:
+                    fn_schema = view_copy_info.func.func
+                    view_copy_differentiability_infos[dispatch_key] = view_copy_info
+            else:
+                break
+        if len(view_copy_differentiability_infos) > 0:
+            assert fn_schema is not None
+            view_infos[fn_schema] = view_copy_differentiability_infos
 
-    return view_copy_differentiability_infos
+    infos.update(view_infos)
 
 
 def load_derivatives(
     derivatives_yaml_path: str, native_yaml_path: str, tags_yaml_path: str
-) -> Sequence[DifferentiabilityInfo]:
+) -> Tuple[Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]], Set[str]]:
     # Do some caching as this is a deterministic function
     global _GLOBAL_LOAD_DERIVATIVE_CACHE
     key = (derivatives_yaml_path, native_yaml_path)
@@ -108,7 +123,7 @@ def load_derivatives(
         functions_by_signature: Dict[
             FunctionSchema, List[NativeFunction]
         ] = defaultdict(list)
-        functions_by_schema: Dict[str, NativeFunction] = dict()
+        functions_by_schema: Dict[str, NativeFunction] = {}
         for function in native_functions_without_view_copies:
             functions_by_signature[function.func.signature()].append(function)
             assert str(function.func) not in functions_by_schema
@@ -118,15 +133,36 @@ def load_derivatives(
         # disambiguate them with a numeric suffix.
         op_counter = Counter[str]()
 
-        infos = [
-            create_differentiability_info(
-                defn, functions_by_signature, functions_by_schema, op_counter
+        # infos is a dict that maps FunctionSchema -> a dict of per dispatch key DifferentiabilityInfos
+        # this is useful because in tools/autograd/gen_autograd.py:match_differentiability_info
+        # we ultimately need to categorize the DifferentiabilityInfos by FunctionSchema
+        infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]] = {}
+        used_dispatch_keys: Set[str] = set()
+        for defn_dict in definitions:
+            # Ensure that the old derivatives.yaml schema with no dispatch key can be loaded.
+            if "dispatch" not in defn_dict:
+                specification = defn_dict.pop("name")
+                output_differentiability = defn_dict.pop(
+                    "output_differentiability", None
+                )
+                defn_dict = {"name": specification, "dispatch": {"Default": defn_dict}}
+                if output_differentiability:
+                    defn_dict["output_differentiability"] = output_differentiability
+            name, per_dispatch_diffinfos = create_differentiability_info(
+                defn_dict,
+                functions_by_signature,
+                functions_by_schema,
+                op_counter,
+                used_dispatch_keys,
             )
-            for defn in definitions
-        ]
-        infos += add_view_copy_derivatives(infos, view_groups)
+            infos[name] = per_dispatch_diffinfos
 
-        _GLOBAL_LOAD_DERIVATIVE_CACHE[key] = infos
+        add_view_copy_derivatives(infos, view_groups)
+
+        # cache both loaded infos as well a a set of all the dispatch_keys/aliases
+        # that appear in derivatives.yaml. used_dispatch_keys is useful for generating
+        # VariableType.cpp where we need a TORCH_LIBRARY_IMPL for every autograd dispatch key used
+        _GLOBAL_LOAD_DERIVATIVE_CACHE[key] = infos, used_dispatch_keys
 
     return _GLOBAL_LOAD_DERIVATIVE_CACHE[key]
 
@@ -388,11 +424,12 @@ def is_forward_derivative_definition(
 
 
 def create_differentiability_info(
-    defn: Dict[Any, Any],
+    defn_dict: Dict[Any, Any],
     functions_by_signature: Dict[FunctionSchema, List[NativeFunction]],
     functions_by_schema: Dict[str, NativeFunction],
     op_counter: Counter[str],
-) -> DifferentiabilityInfo:
+    used_dispatch_keys: Set[str],
+) -> Tuple[FunctionSchema, Dict[str, DifferentiabilityInfo]]:
     """Processes a single entry `defn` in derivatives.yaml"""
 
     def canonical_function(
@@ -564,11 +601,11 @@ def set_up_derivatives(
         )
 
     # NB: Removes 'name' from defn dictionary
-    specification = defn.pop("name")
+    specification = defn_dict.pop("name")
     defn_name, _ = split_name_params(specification)
     # NB: Removes 'output_differentiability' from defn dictionary
     #     `None` means all differentiable.
-    output_differentiability = defn.pop("output_differentiability", None)
+    output_differentiability = defn_dict.pop("output_differentiability", None)
     output_differentiability_conditions = None
     if output_differentiability and any(
         [isinstance(diff, str) for diff in output_differentiability]
@@ -627,40 +664,58 @@ def set_up_derivatives(
             "Please use a different name in native_functions.yaml."
         )
 
-    (
-        derivatives,
-        forward_derivatives,
-        args_with_derivatives,
-        non_differentiable_arg_names,
-        available_named_gradients,
-    ) = set_up_derivatives(canonical)
-
-    used_named_gradients: Set[str] = set()
-    for d in derivatives:
-        used_named_gradients |= d.named_gradients
-
-    # only assign an op name if we are actually going to calculate a derivative
-    op = None
-    if args_with_derivatives:
-        op_prefix = _create_op_prefix(defn_name)
-        op = f"{op_prefix}{op_counter[op_prefix]}"
-        op_counter[op_prefix] += 1
-
-    return DifferentiabilityInfo(
-        name=defn_name,
-        func=canonical,
-        op=op,
-        derivatives=derivatives,
-        forward_derivatives=forward_derivatives,
-        all_saved_inputs=dedup_vars([v for d in derivatives for v in d.saved_inputs]),
-        all_saved_outputs=dedup_vars([v for d in derivatives for v in d.saved_outputs]),
-        available_named_gradients=available_named_gradients,
-        used_named_gradients=used_named_gradients,
-        args_with_derivatives=args_with_derivatives,
-        non_differentiable_arg_names=non_differentiable_arg_names,
-        output_differentiability=output_differentiability,
-        output_differentiability_conditions=output_differentiability_conditions,
-    )
+    diffinfo_dict = {}
+    for key, defn in defn_dict["dispatch"].items():
+        if key != "Default" and key not in _VALID_AUTOGRAD_KEYS:
+            raise RuntimeError(
+                f"Invalid dispatch key {key} in derivatives.yaml for {specification},"
+                f" expected key to be one of {_VALID_AUTOGRAD_KEYS}"
+            )
+        if key not in used_dispatch_keys:
+            used_dispatch_keys.add(key)
+
+        (
+            derivatives,
+            forward_derivatives,
+            args_with_derivatives,
+            non_differentiable_arg_names,
+            available_named_gradients,
+        ) = set_up_derivatives(canonical)
+
+        used_named_gradients: Set[str] = set()
+        for d in derivatives:
+            used_named_gradients |= d.named_gradients
+
+        # only assign an op name if we are actually going to calculate a derivative
+        op = None
+        if args_with_derivatives:
+            op_prefix = _create_op_prefix(defn_name)
+            if key != "Default":
+                op_prefix = op_prefix + key
+            op = f"{op_prefix}{op_counter[op_prefix]}"
+            op_counter[op_prefix] += 1
+
+        diffinfo_dict[key] = DifferentiabilityInfo(
+            name=defn_name,
+            func=canonical,
+            op=op,
+            derivatives=derivatives,
+            forward_derivatives=forward_derivatives,
+            all_saved_inputs=dedup_vars(
+                [v for d in derivatives for v in d.saved_inputs]
+            ),
+            all_saved_outputs=dedup_vars(
+                [v for d in derivatives for v in d.saved_outputs]
+            ),
+            available_named_gradients=available_named_gradients,
+            used_named_gradients=used_named_gradients,
+            args_with_derivatives=args_with_derivatives,
+            non_differentiable_arg_names=non_differentiable_arg_names,
+            output_differentiability=output_differentiability,
+            output_differentiability_conditions=output_differentiability_conditions,
+        )
+
+    return canonical.func, diffinfo_dict
 
 
 GRAD_INDEX_REGEX = r"(?:^|\W)grads\[(\d+)\]"
diff --git a/tools/autograd/templates/VariableType.cpp b/tools/autograd/templates/VariableType.cpp
index 63e984ad4ad50..9cd2d5c40de79 100644
--- a/tools/autograd/templates/VariableType.cpp
+++ b/tools/autograd/templates/VariableType.cpp
@@ -3,7 +3,8 @@
 #include "torch/csrc/autograd/FunctionsManual.h"
 
 #include <ATen/RedispatchFunctions.h>
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
+#include <ATen/core/TorchDispatchUtils.h>
 #include <torch/library.h>
 
 
@@ -55,9 +56,7 @@ namespace {
 
 namespace {
 
-TORCH_LIBRARY_IMPL(aten, Autograd, m) {
-  ${wrapper_registrations}
-}
+${wrapper_registrations}
 
 }
 
diff --git a/tools/autograd/templates/python_enum_tag.cpp b/tools/autograd/templates/python_enum_tag.cpp
index 0d86c52d752f9..83cfad1d7ba4d 100644
--- a/tools/autograd/templates/python_enum_tag.cpp
+++ b/tools/autograd/templates/python_enum_tag.cpp
@@ -9,8 +9,7 @@ namespace torch {
     void initEnumTag(PyObject* module) {
         auto m = py::handle(module).cast<py::module>();
         py::enum_<at::Tag>(m, "Tag")
-        ${enum_of_valid_tags}
-        .export_values();
+        ${enum_of_valid_tags};
         m.doc() = "An Enum that contains tags that can be assigned to an operator registered in C++.";
     }
 }}
diff --git a/tools/autograd/templates/python_variable_methods.cpp b/tools/autograd/templates/python_variable_methods.cpp
index 13f14bff75b87..6f2bbe9703e51 100644
--- a/tools/autograd/templates/python_variable_methods.cpp
+++ b/tools/autograd/templates/python_variable_methods.cpp
@@ -147,13 +147,21 @@ static PyObject * THPVariable_stride(PyObject* self, PyObject* args, PyObject* k
   }
 
   if (r.idx == 0) {
-    return wrap(self_.stride(r.toInt64(0)));
+    return torch::toPyObject(self_.sym_stride(r.toInt64(0)));
   } else if (r.idx == 1) {
     // yes, this is called strides in ATen.
-    IntArrayRef strides = self_.strides();
+    at::SymIntArrayRef strides = self_.sym_strides();
     // we can't do the normal wrapping here because IntArrayRef maps to both
     // torch.Size and tuple in python
-    return THPUtils_packInt64Array(strides.size(), strides.data());
+    // TODO: consider factoring this out
+    THPObjectPtr tuple(PyTuple_New(strides.size()));
+    if (!tuple) throw python_error();
+    for (size_t i = 0; i != strides.size(); i++) {
+      PyObject* s = torch::toPyObject(strides[i]);
+      if (!s) throw python_error();
+      PyTuple_SET_ITEM(tuple.get(), i, s);
+    }
+    return tuple.release();
   }
   else if (r.idx == 2) {
     return wrap(self_.stride(r.dimname(0)));
@@ -232,7 +240,12 @@ static PyObject * THPVariable_numel(PyObject* self, PyObject* args)
    if (jit::tracer::isTracing()) {
      return wrap(jit::tracer::getNumelOf(self_));
    } else {
-     return THPUtils_packInt64(self_.numel());
+     auto si = self_.sym_numel();
+     if (si.is_symbolic()) {
+       return py::cast(si.toSymIntNodeImpl()).release().ptr();
+     } else {
+       return THPUtils_packInt64(si.as_int_unchecked());
+     }
    }
    END_HANDLE_TH_ERRORS
 }
diff --git a/tools/code_coverage/package/tool/summarize_jsons.py b/tools/code_coverage/package/tool/summarize_jsons.py
index 0ab113816c07c..35086561e718c 100644
--- a/tools/code_coverage/package/tool/summarize_jsons.py
+++ b/tools/code_coverage/package/tool/summarize_jsons.py
@@ -26,7 +26,7 @@
 )
 
 
-# coverage_records: Dict[str, LineInfo] = dict()
+# coverage_records: Dict[str, LineInfo] = {}
 covered_lines: Dict[str, Set[int]] = {}
 uncovered_lines: Dict[str, Set[int]] = {}
 tests_type: TestStatusType = {"success": set(), "partial": set(), "fail": set()}
diff --git a/tools/onnx/update_default_opset_version.py b/tools/onnx/update_default_opset_version.py
index dfdbf1f23c879..2086a26c9696c 100755
--- a/tools/onnx/update_default_opset_version.py
+++ b/tools/onnx/update_default_opset_version.py
@@ -9,6 +9,7 @@
 Run with no arguments.
 """
 
+import argparse
 import datetime
 import os
 import pathlib
@@ -16,59 +17,10 @@
 import subprocess
 import sys
 from subprocess import DEVNULL
+from typing import Any
 
-pytorch_dir = pathlib.Path(__file__).parent.parent.parent.resolve()
-onnx_dir = pytorch_dir / "third_party" / "onnx"
-os.chdir(onnx_dir)
-
-date = datetime.datetime.now() - datetime.timedelta(days=18 * 30)
-onnx_commit = subprocess.check_output(
-    ("git", "log", f"--until={date}", "--max-count=1", "--format=%H"), encoding="utf-8"
-).strip()
-onnx_tags = subprocess.check_output(
-    ("git", "tag", "--list", f"--contains={onnx_commit}"), encoding="utf-8"
-)
-tag_tups = []
-semver_pat = re.compile(r"v(\d+)\.(\d+)\.(\d+)")
-for tag in onnx_tags.splitlines():
-    match = semver_pat.match(tag)
-    if match:
-        tag_tups.append(tuple(int(x) for x in match.groups()))
-
-version_str = "{}.{}.{}".format(*min(tag_tups))
-
-print("Using ONNX release", version_str)
-
-head_commit = subprocess.check_output(
-    ("git", "log", "--max-count=1", "--format=%H", "HEAD"), encoding="utf-8"
-).strip()
-
-new_default = None
-
-subprocess.check_call(
-    ("git", "checkout", f"v{version_str}"), stdout=DEVNULL, stderr=DEVNULL
-)
-try:
-    from onnx import helper  # type: ignore[import]
-
-    for version in helper.VERSION_TABLE:
-        if version[0] == version_str:
-            new_default = version[2]
-            print("found new default opset_version", new_default)
-            break
-    if not new_default:
-        sys.exit(
-            f"failed to find version {version_str} in onnx.helper.VERSION_TABLE at commit {onnx_commit}"
-        )
-finally:
-    subprocess.check_call(
-        ("git", "checkout", head_commit), stdout=DEVNULL, stderr=DEVNULL
-    )
-
-os.chdir(pytorch_dir)
 
-
-def read_sub_write(path: str, prefix_pat: str) -> None:
+def read_sub_write(path: str, prefix_pat: str, new_default: int) -> None:
     with open(path, encoding="utf-8") as f:
         content_str = f.read()
     content_str = re.sub(prefix_pat, r"\g<1>{}".format(new_default), content_str)
@@ -77,18 +29,84 @@ def read_sub_write(path: str, prefix_pat: str) -> None:
     print("modified", path)
 
 
-read_sub_write(
-    os.path.join("torch", "onnx", "_constants.py"),
-    r"(onnx_default_opset = )\d+",
-)
-read_sub_write(
-    os.path.join("torch", "onnx", "__init__.py"), r"(opset_version \(int, default )\d+"
-)
-
-print("Updating operator .expect files")
-subprocess.check_call(("python", "setup.py", "develop"), stdout=DEVNULL, stderr=DEVNULL)
-subprocess.check_call(
-    ("python", os.path.join("test", "onnx", "test_operators.py"), "--accept"),
-    stdout=DEVNULL,
-    stderr=DEVNULL,
-)
+def main(args: Any) -> None:
+    pytorch_dir = pathlib.Path(__file__).parent.parent.parent.resolve()
+    onnx_dir = pytorch_dir / "third_party" / "onnx"
+    os.chdir(onnx_dir)
+
+    date = datetime.datetime.now() - datetime.timedelta(days=18 * 30)
+    onnx_commit = subprocess.check_output(
+        ("git", "log", f"--until={date}", "--max-count=1", "--format=%H"),
+        encoding="utf-8",
+    ).strip()
+    onnx_tags = subprocess.check_output(
+        ("git", "tag", "--list", f"--contains={onnx_commit}"), encoding="utf-8"
+    )
+    tag_tups = []
+    semver_pat = re.compile(r"v(\d+)\.(\d+)\.(\d+)")
+    for tag in onnx_tags.splitlines():
+        match = semver_pat.match(tag)
+        if match:
+            tag_tups.append(tuple(int(x) for x in match.groups()))
+
+    # Take the release 18 months ago
+    version_str = "{}.{}.{}".format(*min(tag_tups))
+
+    print("Using ONNX release", version_str)
+
+    head_commit = subprocess.check_output(
+        ("git", "log", "--max-count=1", "--format=%H", "HEAD"), encoding="utf-8"
+    ).strip()
+
+    new_default = None
+
+    subprocess.check_call(
+        ("git", "checkout", f"v{version_str}"), stdout=DEVNULL, stderr=DEVNULL
+    )
+    try:
+        from onnx import helper  # type: ignore[import]
+
+        for version in helper.VERSION_TABLE:
+            if version[0] == version_str:
+                new_default = version[2]
+                print("found new default opset_version", new_default)
+                break
+        if not new_default:
+            sys.exit(
+                f"failed to find version {version_str} in onnx.helper.VERSION_TABLE at commit {onnx_commit}"
+            )
+    finally:
+        subprocess.check_call(
+            ("git", "checkout", head_commit), stdout=DEVNULL, stderr=DEVNULL
+        )
+
+    os.chdir(pytorch_dir)
+
+    read_sub_write(
+        os.path.join("torch", "onnx", "_constants.py"),
+        r"(onnx_default_opset = )\d+",
+        new_default,
+    )
+    read_sub_write(
+        os.path.join("torch", "onnx", "utils.py"),
+        r"(opset_version \(int, default )\d+",
+        new_default,
+    )
+
+    if not args.skip_build:
+        print("Building PyTorch...")
+        subprocess.check_call(
+            ("python", "setup.py", "develop"),
+        )
+    print("Updating operator .expect files")
+    subprocess.check_call(
+        ("python", os.path.join("test", "onnx", "test_operators.py"), "--accept"),
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--skip_build", action="store_true", help="Skip building pytorch"
+    )
+    main(parser.parse_args())
diff --git a/tools/setup_helpers/cmake_utils.py b/tools/setup_helpers/cmake_utils.py
index 8fb41c913e256..dabd66a4e838b 100644
--- a/tools/setup_helpers/cmake_utils.py
+++ b/tools/setup_helpers/cmake_utils.py
@@ -51,7 +51,7 @@ def get_cmake_cache_variables_from_file(
       dict: A ``dict`` containing the value of cached CMake variables.
     """
 
-    results = dict()
+    results = {}
     for i, line in enumerate(cmake_cache_file, 1):
         line = line.strip()
         if not line or line.startswith(("#", "//")):
diff --git a/tools/stats/import_test_stats.py b/tools/stats/import_test_stats.py
index fbc33a685d4ae..a798119010d22 100644
--- a/tools/stats/import_test_stats.py
+++ b/tools/stats/import_test_stats.py
@@ -83,9 +83,18 @@ def get_slow_tests(
 
 def get_test_times(dirpath: str, filename: str) -> Dict[str, Dict[str, float]]:
     url = "https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/test-times.json"
+    build_environment = os.environ.get("BUILD_ENVIRONMENT")
+    if build_environment is None:
+        test_times = fetch_and_cache(dirpath, filename, url, lambda x: x)
+        raise RuntimeError(
+            f"BUILD_ENVIRONMENT is not defined, available keys are {test_times.keys()}"
+        )
 
     def process_response(the_response: Dict[str, Any]) -> Any:
-        build_environment = os.environ["BUILD_ENVIRONMENT"]
+        if build_environment not in the_response:
+            raise RuntimeError(
+                f"{build_environment} not found, available envs are: {the_response.keys()}"
+            )
         return the_response[build_environment]
 
     try:
@@ -99,34 +108,18 @@ def get_disabled_tests(
     dirpath: str, filename: str = DISABLED_TESTS_FILE
 ) -> Optional[Dict[str, Any]]:
     def process_disabled_test(the_response: Dict[str, Any]) -> Dict[str, Any]:
+        # remove re-enabled tests and condense even further by getting rid of pr_num
         disabled_test_from_issues = dict()
-        for item in the_response["items"]:
-            title = item["title"]
-            key = "DISABLED "
-            issue_url = item["html_url"]
-            issue_number = issue_url.split("/")[-1]
-            if title.startswith(key) and issue_number not in IGNORE_DISABLED_ISSUES:
-                test_name = title[len(key) :].strip()
-                body = item["body"]
-                platforms_to_skip = []
-                key = "platforms:"
-                # When the issue has no body, it is assumed that all platforms should skip the test
-                if body is not None:
-                    for line in body.splitlines():
-                        line = line.lower()
-                        if line.startswith(key):
-                            pattern = re.compile(r"^\s+|\s*,\s*|\s+$")
-                            platforms_to_skip.extend(
-                                [x for x in pattern.split(line[len(key) :]) if x]
-                            )
+        for test_name, (pr_num, link, platforms) in the_response.items():
+            if pr_num not in IGNORE_DISABLED_ISSUES:
                 disabled_test_from_issues[test_name] = (
-                    item["html_url"],
-                    platforms_to_skip,
+                    link,
+                    platforms,
                 )
         return disabled_test_from_issues
 
     try:
-        url = "https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests.json"
+        url = "https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json"
         return fetch_and_cache(dirpath, filename, url, process_disabled_test)
     except Exception:
         print("Couldn't download test skip set, leaving all tests enabled...")
diff --git a/tools/stats/print_test_stats.py b/tools/stats/print_test_stats.py
index b82c1236525d8..068b035987722 100755
--- a/tools/stats/print_test_stats.py
+++ b/tools/stats/print_test_stats.py
@@ -670,7 +670,7 @@ def __str__(self) -> str:
 class TestSuite:
     def __init__(self, name: str) -> None:
         self.name = name
-        self.test_cases: Dict[str, TestCase] = dict()
+        self.test_cases: Dict[str, TestCase] = {}
         self.failed_count = 0
         self.skipped_count = 0
         self.errored_count = 0
@@ -725,7 +725,7 @@ class TestFile:
     def __init__(self, name: str) -> None:
         self.name = name
         self.total_time = 0.0
-        self.test_suites: Dict[str, TestSuite] = dict()
+        self.test_suites: Dict[str, TestSuite] = {}
 
     def append(self, test_case: TestCase) -> None:
         suite_name = test_case.class_name
@@ -765,7 +765,7 @@ def get_recursive_files(folder: str, extension: str) -> Iterable[str]:
 
 
 def parse_reports(folder: str) -> Dict[str, TestFile]:
-    tests_by_file = dict()
+    tests_by_file = {}
     for report in get_recursive_files(folder, ".xml"):
         report_path = Path(report)
         # basename of the directory of test-report is the test filename
diff --git a/tools/stats/upload_test_stats.py b/tools/stats/upload_test_stats.py
index 9b919716ebc3a..01647264705bb 100644
--- a/tools/stats/upload_test_stats.py
+++ b/tools/stats/upload_test_stats.py
@@ -167,6 +167,23 @@ def get_tests(
         return test_cases, pytest_parallel_times
 
 
+def get_tests_for_circleci(
+    workflow_run_id: int, workflow_run_attempt: int
+) -> Tuple[List[Dict[str, Any]], Dict[Any, Any]]:
+    # Parse the reports and transform them to JSON
+    test_cases = []
+    for xml_report in Path(".").glob("**/test/test-reports/**/*.xml"):
+        test_cases.extend(
+            parse_xml_report(
+                "testcase", xml_report, workflow_run_id, workflow_run_attempt
+            )
+        )
+
+    pytest_parallel_times = get_pytest_parallel_times()
+
+    return test_cases, pytest_parallel_times
+
+
 def get_invoking_file_times(
     test_case_summaries: List[Dict[str, Any]], pytest_parallel_times: Dict[Any, Any]
 ) -> List[Dict[str, Any]]:
@@ -261,7 +278,6 @@ def init_value(test_case: Dict[str, Any]) -> Dict[str, Any]:
     parser = argparse.ArgumentParser(description="Upload test stats to Rockset")
     parser.add_argument(
         "--workflow-run-id",
-        type=int,
         required=True,
         help="id of the workflow to get artifacts from",
     )
@@ -276,10 +292,23 @@ def init_value(test_case: Dict[str, Any]) -> Dict[str, Any]:
         required=True,
         help="Head branch of the workflow",
     )
-    args = parser.parse_args()
-    test_cases, pytest_parallel_times = get_tests(
-        args.workflow_run_id, args.workflow_run_attempt
+    parser.add_argument(
+        "--circleci",
+        action="store_true",
+        help="If this is being run through circleci",
     )
+    args = parser.parse_args()
+
+    print(f"Workflow id is: {args.workflow_run_id}")
+
+    if args.circleci:
+        test_cases, pytest_parallel_times = get_tests_for_circleci(
+            args.workflow_run_id, args.workflow_run_attempt
+        )
+    else:
+        test_cases, pytest_parallel_times = get_tests(
+            args.workflow_run_id, args.workflow_run_attempt
+        )
 
     # Flush stdout so that any errors in rockset upload show up last in the logs.
     sys.stdout.flush()
diff --git a/tools/target_definitions.bzl b/tools/target_definitions.bzl
index 6c4a53dbfc89c..11c597fa2e507 100644
--- a/tools/target_definitions.bzl
+++ b/tools/target_definitions.bzl
@@ -69,7 +69,7 @@ def add_torch_libs():
     ] + (
         ["-DUSE_C10D_MPI"] if use_mpi else []
     ) + (
-        ["-DUSE_KINETO", "-DUSE_KINETO_UPDATED"] if use_kineto() else []
+        ["-DUSE_KINETO"] if use_kineto() else []
     ) + (
         ["-DENABLE_LIBKINETO_CLIENT"] if native.read_config("kineto", "enable_libkineto_client", "1") == "1" else []
     )
@@ -448,6 +448,13 @@ def add_torch_libs():
         ],
     )
 
+    cpp_library(
+        name = "headers_for_torch_python_hip_deps",
+        exported_deps = [
+            ":_C_impl_hip",
+        ],
+    ) if is_amd_build() else None
+
     # This target compiles torch_python bindings, but skips the deps on actual
     # torch and python since those will be integrated specially in the wrapper for
     # libinterpreter.so used in torch::deploy
@@ -523,6 +530,28 @@ def add_torch_libs():
         output_gen_files = libtorch_python_hip_sources_hipified,
     )
 
+    cpp_library(
+        name = "torch_python_hip_without_torch",
+        srcs = [":fb_C_impl_hipify_gen={}".format(f) for f in (libtorch_python_hip_sources_hipified)] + libtorch_python_hip_sources + torch_cpp_srcs,
+        undefined_symbols = True,
+        preferred_linkage = "static",
+        exported_deps = [
+            ":headers_for_torch_python_hip_deps#headers",
+        ],
+        exported_external_deps = [
+            ("pybind11", None),
+            ("frozenpython", None, "python-headers"),
+        ],
+        compiler_flags = compiler_flags_cpu + [
+            "-DUSE_ROCM",
+            # some code in the Python bindings compiles differently
+            # when you are deploy
+            "-DUSE_DEPLOY",
+            "-Wno-error",
+        ],
+        compiler_specific_flags = common_flags["compiler_specific_flags"],
+    ) if is_amd_build() else None
+
     cpp_library(
         name = "_C_impl_hip",
         srcs = [":fb_C_impl_hipify_gen={}".format(f) for f in (libtorch_python_hip_sources_hipified)] + libtorch_python_hip_sources,
@@ -564,7 +593,7 @@ def add_torch_libs():
         deps = [
             ":_C_impl",
             "//caffe2:flatbuffer_loader",
-            "//caffe2:flatbuffer_serializer",
+            "//caffe2:flatbuffers_serializer_jit",
         ],
     )
 
diff --git a/tools/test/test_codegen.py b/tools/test/test_codegen.py
index a96533769d364..cbce8b3bc5e66 100644
--- a/tools/test/test_codegen.py
+++ b/tools/test/test_codegen.py
@@ -1,22 +1,28 @@
 import dataclasses
 import typing
 import unittest
-from typing import Dict
+from collections import defaultdict
+from typing import Dict, List
 
 import torchgen.model
 
+import yaml
+
 from tools.autograd import gen_autograd_functions, load_derivatives
 from torchgen.gen import (
     get_native_function_declarations,
     get_native_function_schema_registrations,
+    LineLoader,
 )
 from torchgen.model import (
     BackendIndex,
     BackendMetadata,
     DispatchKey,
+    Location,
     NativeFunction,
     OperatorName,
 )
+from torchgen.native_function_generation import add_generated_native_functions
 from torchgen.selective_build.selector import SelectiveBuilder
 
 
@@ -40,19 +46,19 @@ def test_non_differentiable_output(self) -> None:
         schema = torchgen.model.FunctionSchema.parse(specification)
         native_function = dataclasses.replace(DEFAULT_NATIVE_FUNCTION, func=schema)
 
-        differentiability_info = load_derivatives.create_differentiability_info(
-            defn={
+        _, differentiability_info = load_derivatives.create_differentiability_info(
+            defn_dict={
                 "name": specification,
-                "a": "grads[0]",
-                "b": "grads[2]",
+                "dispatch": {"Default": {"a": "grads[0]", "b": "grads[2]"}},
             },
             functions_by_signature={schema.signature(): [native_function]},
             functions_by_schema={specification: native_function},
             op_counter=typing.Counter[str](),
+            used_dispatch_keys=set(),
         )
 
         self.assertSequenceEqual(
-            differentiability_info.available_named_gradients,
+            differentiability_info["Default"].available_named_gradients,
             # grad_y is not present because y is a
             # bool and thus not differentiable.
             ["grad_x", "grad_z"],
@@ -81,16 +87,21 @@ def test_named_grads_and_indexed_grads(self) -> None:
             RuntimeError, 'illegally mixes use of "grad_RETURN_NAME"'
         ):
             load_derivatives.create_differentiability_info(
-                defn={
+                defn_dict={
                     "name": specification,
                     # Uh-oh, the derivatives reference gradients by
                     # name and by index.
-                    "a": "grad_x",
-                    "b": "grads[1]",
+                    "dispatch": {
+                        "Default": {
+                            "a": "grad_x",
+                            "b": "grads[1]",
+                        }
+                    },
                 },
                 functions_by_signature={schema.signature(): [native_function]},
                 functions_by_schema={specification: native_function},
                 op_counter=typing.Counter[str](),
+                used_dispatch_keys=set(),
             )
 
 
@@ -100,18 +111,24 @@ def test_non_differentiable_output_invalid_type(self) -> None:
         schema = torchgen.model.FunctionSchema.parse(specification)
         native_function = dataclasses.replace(DEFAULT_NATIVE_FUNCTION, func=schema)
 
-        differentiability_info = load_derivatives.create_differentiability_info(
-            defn={
+        _, differentiability_info = load_derivatives.create_differentiability_info(
+            defn_dict={
                 "name": specification,
-                "a": "grad_x",
-                "b": "grad_z",
+                "dispatch": {
+                    "Default": {
+                        "a": "grad_x",
+                        "b": "grad_z",
+                    }
+                },
             },
             functions_by_signature={schema.signature(): [native_function]},
             functions_by_schema={specification: native_function},
             op_counter=typing.Counter[str](),
+            used_dispatch_keys=set(),
         )
         definition = gen_autograd_functions.process_function(
-            differentiability_info, gen_autograd_functions.FUNCTION_DEFINITION
+            differentiability_info["Default"],
+            gen_autograd_functions.FUNCTION_DEFINITION,
         )
         # grad_z should map to grads[1], not grads[2] because output 1
         # (y) is not differentiable.
@@ -123,24 +140,70 @@ def test_non_differentiable_output_output_differentiability(self) -> None:
         schema = torchgen.model.FunctionSchema.parse(specification)
         native_function = dataclasses.replace(DEFAULT_NATIVE_FUNCTION, func=schema)
 
-        differentiability_info = load_derivatives.create_differentiability_info(
-            defn={
+        _, differentiability_info = load_derivatives.create_differentiability_info(
+            defn_dict={
                 "name": specification,
-                "a": "grad_x",
-                "b": "grad_z",
+                "dispatch": {
+                    "Default": {
+                        "a": "grad_x",
+                        "b": "grad_z",
+                    },
+                    "AutogradNestedTensor": {
+                        "a": "grad_z",
+                        "b": "grad_x",
+                    },
+                },
                 "output_differentiability": [True, False, True],
             },
             functions_by_signature={schema.signature(): [native_function]},
             functions_by_schema={specification: native_function},
             op_counter=typing.Counter[str](),
+            used_dispatch_keys=set(),
         )
-        definition = gen_autograd_functions.process_function(
-            differentiability_info, gen_autograd_functions.FUNCTION_DEFINITION
+        default_definition = gen_autograd_functions.process_function(
+            differentiability_info["Default"],
+            gen_autograd_functions.FUNCTION_DEFINITION,
         )
         # grad_z should map to grads[1], not grads[2] because output 1
         # (y) is not differentiable.
-        assert "grad_z = grads[2]" not in definition
-        assert "grad_z = grads[1]" in definition
+        assert "grad_z = grads[2]" not in default_definition
+        assert "grad_z = grads[1]" in default_definition
+
+        nested_tensor_definition = gen_autograd_functions.process_function(
+            differentiability_info["AutogradNestedTensor"],
+            gen_autograd_functions.FUNCTION_DEFINITION,
+        )
+        assert "grad_z = grads[2]" not in nested_tensor_definition
+        assert "grad_z = grads[1]" in nested_tensor_definition
+
+    def test_register_bogus_dispatch_key(self) -> None:
+        specification = "func(Tensor a, Tensor b) -> (Tensor x, bool y, Tensor z)"
+        schema = torchgen.model.FunctionSchema.parse(specification)
+        native_function = dataclasses.replace(DEFAULT_NATIVE_FUNCTION, func=schema)
+
+        with self.assertRaisesRegex(
+            RuntimeError,
+            "Invalid dispatch key AutogradRandomTensor in derivatives.yaml for",
+        ):
+            load_derivatives.create_differentiability_info(
+                defn_dict={
+                    "name": specification,
+                    "dispatch": {
+                        "Default": {
+                            "a": "grad_x",
+                            "b": "grad_z",
+                        },
+                        "AutogradRandomTensor": {
+                            "a": "grad_x",
+                            "b": "grad_z",
+                        },
+                    },
+                },
+                functions_by_signature={schema.signature(): [native_function]},
+                functions_by_schema={specification: native_function},
+                op_counter=typing.Counter[str](),
+                used_dispatch_keys=set(),
+            )
 
 
 class TestGenSchemaRegistration(unittest.TestCase):
@@ -192,21 +255,36 @@ def test_mixed_namespace_schema_registration_code_valid(self) -> None:
 };""",
         )
 
-    def test_3_namespaces_schema_registration_code_invalid(self) -> None:
+    def test_3_namespaces_schema_registration_code_valid(self) -> None:
         custom2_native_function, _ = torchgen.model.NativeFunction.from_yaml(
             {"func": "custom2::func() -> bool"},
             loc=torchgen.model.Location(__file__, 1),
             valid_tags=set(),
         )
-        with self.assertRaises(AssertionError):
-            get_native_function_schema_registrations(
-                native_functions=[
-                    DEFAULT_NATIVE_FUNCTION,
-                    self.custom_native_function,
-                    custom2_native_function,
-                ],
-                schema_selector=self.selector,
-            )
+        (
+            aten_registrations,
+            custom_registrations,
+        ) = get_native_function_schema_registrations(
+            native_functions=[
+                DEFAULT_NATIVE_FUNCTION,
+                self.custom_native_function,
+                custom2_native_function,
+            ],
+            schema_selector=self.selector,
+        )
+        self.assertEqual(aten_registrations, ['m.def("func() -> bool", {});\n'])
+        self.assertEqual(
+            custom_registrations,
+            """
+TORCH_LIBRARY(custom, m) {
+  m.def("func() -> bool", {});
+
+};
+TORCH_LIBRARY(custom2, m) {
+  m.def("func() -> bool", {});
+
+};""",
+        )
 
 
 class TestGenNativeFunctionDeclaration(unittest.TestCase):
@@ -270,6 +348,66 @@ def test_native_function_declaration_1_op_1_ns_valid(self) -> None:
         self.assertEqual("\n".join(declaration), target)
 
 
+# Test for native_function_generation
+class TestNativeFunctionGeneratrion(unittest.TestCase):
+    def setUp(self) -> None:
+        self.native_functions: List[NativeFunction] = []
+        self.backend_indices: Dict[
+            DispatchKey, Dict[OperatorName, BackendMetadata]
+        ] = defaultdict(dict)
+        yaml_entry = """
+- func: op(Tensor self) -> Tensor
+  dispatch:
+    CompositeExplicitAutograd: op
+  autogen: op.out
+        """
+        es = yaml.load(yaml_entry, Loader=LineLoader)
+        self.one_return_func, m = NativeFunction.from_yaml(
+            es[0], loc=Location(__file__, 1), valid_tags=set()
+        )
+
+        BackendIndex.grow_index(self.backend_indices, m)
+
+        self.two_returns_func, two_returns_backend_index = NativeFunction.from_yaml(
+            {
+                "func": "op_2() -> (Tensor, Tensor)",
+                "dispatch": {"CPU": "kernel_1"},
+                "autogen": "op_2.out",
+            },
+            loc=torchgen.model.Location(__file__, 1),
+            valid_tags=set(),
+        )
+        BackendIndex.grow_index(self.backend_indices, two_returns_backend_index)
+
+    def test_functional_variant_autogen_out_variant(self) -> None:
+        native_functions = [self.one_return_func]
+        add_generated_native_functions(native_functions, self.backend_indices)
+        self.assertEqual(len(native_functions), 2)
+        self.assertEqual(
+            str(native_functions[1].func),
+            "op.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)",
+        )
+        op_name = native_functions[1].func.name
+        backend_metadata = self.backend_indices[DispatchKey.CompositeExplicitAutograd][
+            op_name
+        ]
+        self.assertEqual(backend_metadata.kernel, "op_out")
+
+    def test_functional_variant_autogen_out_variant_two_returns(self) -> None:
+        native_functions = [self.two_returns_func]
+        add_generated_native_functions(native_functions, self.backend_indices)
+        self.assertEqual(len(native_functions), 2)
+        self.assertEqual(
+            str(native_functions[1].func),
+            "op_2.out(*, Tensor(a!) out0, Tensor(b!) out1) -> (Tensor(a!), Tensor(b!))",
+        )
+        op_name = native_functions[1].func.name
+        backend_metadata = self.backend_indices[DispatchKey.CompositeExplicitAutograd][
+            op_name
+        ]
+        self.assertEqual(backend_metadata.kernel, "op_2_out")
+
+
 # Represents the most basic NativeFunction. Use dataclasses.replace()
 # to edit for use.
 DEFAULT_NATIVE_FUNCTION, _ = torchgen.model.NativeFunction.from_yaml(
diff --git a/tools/test/test_codegen_model.py b/tools/test/test_codegen_model.py
index cb31561275edf..582ec4b9dc192 100644
--- a/tools/test/test_codegen_model.py
+++ b/tools/test/test_codegen_model.py
@@ -2,14 +2,21 @@
 
 import textwrap
 import unittest
+from typing import cast
 
 import expecttest
+
 import torchgen.dest as dest
 import torchgen.gen as gen
 import yaml
 from torchgen.gen import LineLoader, parse_native_yaml_struct
-
-from torchgen.model import DispatchKey, NativeFunctionsGroup
+from torchgen.model import (
+    Annotation,
+    CustomClassType,
+    DispatchKey,
+    NativeFunctionsGroup,
+    Type,
+)
 
 
 class TestCodegenModel(expecttest.TestCase):
@@ -141,6 +148,59 @@ def test_invalid_cudafunctoronself_for_binary_op(self) -> None:
 cannot use CUDAFunctorOnSelf on non-binary function""",
         )
 
+    def test_parse_custom_class_type(self) -> None:
+        custom_class_name = "namespace_foo.class_bar"
+        custom_class_name_with_prefix = f"__torch__.torch.classes.{custom_class_name}"
+        custom_class_type = cast(
+            CustomClassType, Type.parse(custom_class_name_with_prefix)
+        )
+        self.assertTrue(isinstance(custom_class_type, CustomClassType))
+        self.assertEqual(custom_class_name, custom_class_type.class_name)
+        self.assertEqual(custom_class_name_with_prefix, str(custom_class_type))
+
+
+class TestAnnotation(expecttest.TestCase):
+    def test_single_alias_no_write(self) -> None:
+        a = Annotation.parse("a")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertFalse(a.is_write)
+        self.assertEqual(a.alias_set_after, tuple())
+
+    def test_single_alias_is_write(self) -> None:
+        a = Annotation.parse("a!")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertTrue(a.is_write)
+        self.assertEqual(a.alias_set_after, tuple())
+
+    def test_single_alias_is_write_to_wildcard(self) -> None:
+        a = Annotation.parse("a! -> *")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertTrue(a.is_write)
+        self.assertEqual(a.alias_set_after, tuple("*"))
+
+    def test_alias_set(self) -> None:
+        a = Annotation.parse("a|b")
+        self.assertEqual(a.alias_set, ("a", "b"))
+
+    def test_alias_set_is_write_raises_exception(self) -> None:
+        with self.assertRaisesRegex(
+            AssertionError, r"alias set larger than 1 is not mutable"
+        ):
+            Annotation.parse("a|b!")
+
+    def test_single_alias_is_write_to_alias_set(self) -> None:
+        a = Annotation.parse("a! -> a|b")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertTrue(a.is_write)
+        self.assertEqual(a.alias_set_after, ("a", "b"))
+
+    def test_before_and_after_alias_set_larger_than_1_raises_exception(self) -> None:
+        with self.assertRaisesRegex(
+            AssertionError,
+            r"before alias set and after alias set cannot be larger than 1 at the same time",
+        ):
+            Annotation.parse("a|b -> c|d")
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/tools/test/test_selective_build.py b/tools/test/test_selective_build.py
index 50a3ba56eb795..bb90f01b01578 100644
--- a/tools/test/test_selective_build.py
+++ b/tools/test/test_selective_build.py
@@ -3,6 +3,7 @@
 import unittest
 
 from torchgen.selective_build.operator import *
+from torchgen.model import Location, NativeFunction
 from torchgen.selective_build.selector import (
     combine_selective_builders,
     SelectiveBuilder,
@@ -279,3 +280,23 @@ def test_all_kernel_dtypes_selected(self):
         self.assertTrue(selector.is_kernel_dtype_selected("add_kernel", "int16"))
         self.assertTrue(selector.is_kernel_dtype_selected("add1_kernel", "int32"))
         self.assertTrue(selector.is_kernel_dtype_selected("add_kernel", "float"))
+
+    def test_custom_namespace_selected_correctly(self):
+        yaml_config = """
+operators:
+  aten::add.int:
+    is_used_for_training: No
+    is_root_operator: Yes
+    include_all_overloads: No
+  custom::add:
+    is_used_for_training: Yes
+    is_root_operator: No
+    include_all_overloads: Yes
+"""
+        selector = SelectiveBuilder.from_yaml_str(yaml_config)
+        native_function, _ = NativeFunction.from_yaml(
+            {"func": "custom::add() -> Tensor"},
+            loc=Location(__file__, 1),
+            valid_tags=set(),
+        )
+        self.assertTrue(selector.is_native_function_selected(native_function))
diff --git a/tools/testing/test_selections.py b/tools/testing/test_selections.py
index caa447c6907d5..1fa13ee99b0c9 100644
--- a/tools/testing/test_selections.py
+++ b/tools/testing/test_selections.py
@@ -9,7 +9,7 @@
 def calculate_shards(
     num_shards: int, tests: List[str], job_times: Dict[str, float]
 ) -> List[Tuple[float, List[str]]]:
-    filtered_job_times: Dict[str, float] = dict()
+    filtered_job_times: Dict[str, float] = {}
     unknown_jobs: List[str] = []
     for test in tests:
         if test in job_times:
diff --git a/torch/CMakeLists.txt b/torch/CMakeLists.txt
index 05acd066f7f08..7586745602fd0 100644
--- a/torch/CMakeLists.txt
+++ b/torch/CMakeLists.txt
@@ -207,6 +207,10 @@ add_custom_target(torch_python_stubs DEPENDS
     "${TORCH_SRC_DIR}/nn/functional.pyi"
     "${TORCH_SRC_DIR}/utils/data/datapipes/datapipe.pyi"
 )
+
+file(GLOB_RECURSE torchgen_python "${PROJECT_SOURCE_DIR}/torchgen/*.py")
+file(GLOB_RECURSE autograd_python "${TOOLS_PATH}/autograd/*.py")
+file(GLOB_RECURSE pyi_python "${TOOLS_PATH}/pyi/*.py")
 add_custom_command(
     OUTPUT
     "${TORCH_SRC_DIR}/_C/__init__.pyi"
@@ -218,13 +222,15 @@ add_custom_command(
       --tags-path "aten/src/ATen/native/tags.yaml"
       --deprecated-functions-path "tools/autograd/deprecated.yaml"
     DEPENDS
-    "${TORCH_SRC_DIR}/_C/__init__.pyi.in"
-    "${TORCH_SRC_DIR}/_C/_VariableFunctions.pyi.in"
-    "${TORCH_SRC_DIR}/nn/functional.pyi.in"
-    "${TOOLS_PATH}/pyi/gen_pyi.py"
-    "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
-    "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml"
-    "${TORCH_ROOT}/tools/autograd/deprecated.yaml"
+      "${TORCH_SRC_DIR}/_C/__init__.pyi.in"
+      "${TORCH_SRC_DIR}/_C/_VariableFunctions.pyi.in"
+      "${TORCH_SRC_DIR}/nn/functional.pyi.in"
+      "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
+      "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml"
+      "${TORCH_ROOT}/tools/autograd/deprecated.yaml"
+      ${pyi_python}
+      ${autograd_python}
+      ${torchgen_python}
     WORKING_DIRECTORY
     "${TORCH_ROOT}"
 )
@@ -259,9 +265,6 @@ if(USE_DISTRIBUTED)
     if(USE_NCCL)
       list(APPEND TORCH_PYTHON_LINK_LIBRARIES __caffe2_nccl)
     endif()
-    if(USE_UCC)
-      list(APPEND TORCH_PYTHON_LINK_LIBRARIES __caffe2_ucc)
-    endif()
     # Same for MPI.
     if(USE_MPI)
       list(APPEND TORCH_PYTHON_LINK_LIBRARIES ${MPI_CXX_LIBRARIES})
@@ -283,7 +286,6 @@ if(USE_DEPLOY)
   add_library(torch_python_obj OBJECT ${TORCH_PYTHON_SRCS})
   if(NOT MSVC)
     target_compile_options(torch_python_obj PRIVATE -Wno-unused-variable)
-    target_compile_options(torch_python_obj PRIVATE -Wno-unused-local-typedefs)
   endif()
   if(USE_DISTRIBUTED)
     # Set c10d-related compile definitions. For a "normal" build of
@@ -398,7 +400,6 @@ endif()
 
 if(NOT MSVC)
   target_compile_options(torch_python PRIVATE -Wno-unused-variable)
-  target_compile_options(torch_python PRIVATE -Wno-unused-local-typedefs)
 endif()
 
 # Required workaround for generated sources
diff --git a/torch/_C/__init__.pyi.in b/torch/_C/__init__.pyi.in
index d6ad4ec95a6f8..6ba1945d14a35 100644
--- a/torch/_C/__init__.pyi.in
+++ b/torch/_C/__init__.pyi.in
@@ -208,7 +208,7 @@ def _jit_init() -> _bool: ...
 def _jit_flatten(arg: Any) -> Tuple[List[Tensor], IODescriptor]: ...
 def _jit_unflatten(vars: List[Tensor], desc: IODescriptor) -> Any: ...
 def _jit_get_operation(op_name: str) -> Tuple[Callable, List[str]]: ...
-def _get_operation_overload(op_name: str, op_overload_name: str) -> Tuple[Callable, List[Any]]: ...
+def _get_operation_overload(op_name: str, op_overload_name: str) -> Tuple[Callable, Callable, List[Any]]: ...
 def _get_schema(op_name: str, overload_name: str) -> FunctionSchema: ...
 def _jit_pass_optimize_for_mobile(module: 'torch.jit.ScriptModule',
                                   optimization_blocklist: Set[MobileOptimizerType],
@@ -311,7 +311,6 @@ def _create_function_from_trace(
 def _jit_is_script_object(obj: Any) -> _bool: ...
 def _last_executed_optimized_graph() -> Graph: ...
 def parse_type_comment(comment: str) -> Decl: ...
-def _is_upgraders_enabled() -> _bool: ...
 def _get_upgraders_map_size() -> _int: ...
 def _dump_upgraders_map() -> Dict[str, str]: ...
 def _test_only_populate_upgraders(content: Dict[str, str]) -> None: ...
@@ -335,6 +334,7 @@ def _jit_pass_onnx_remove_inplace_ops_for_onnx(graph: Graph, module: Module) ->
 def _jit_pass_remove_inplace_ops(graph: Graph) -> None: ...
 def _jit_pass_canonicalize_graph_fuser_ops(graph: Graph) -> None: ...
 def _jit_pass_peephole(graph: Graph, disable_shape_peepholes: _bool = False) -> None: ...
+def _jit_pass_onnx_autograd_function_process(graph: Graph) -> None: ...
 def _jit_pass_fuse_addmm(graph: Graph) -> None: ...
 def _jit_pass_onnx_preprocess(graph: Graph) -> None: ...
 def _jit_pass_prepare_division_for_onnx(graph: Graph) -> None: ...
@@ -383,7 +383,9 @@ def _jit_pass_onnx_block(
     env: Dict[Value, Value],
     is_sub_block: _bool
 ) -> Dict[Value, Value]: ...
+def _jit_pass_onnx_assign_scoped_names_for_node_and_value(graph: Graph) -> None: ...
 def _jit_pass_fixup_onnx_controlflow_node(n: Node, opset_version: _int) -> Node: ...
+def _jit_onnx_create_full_scope_name(class_name: str, variable_name: str) -> str: ...
 
 def _compile_graph_to_code_table(name: str, graph: Graph) -> IValue: ...
 
@@ -497,6 +499,7 @@ class Block:
     def returnNode(self) -> Node: ...
     def owningNode(self) -> Node: ...
     def registerOutput(self, n: Value) -> _int: ...
+    def addNode(self, name: str, inputs: Sequence[Value]) -> Node: ...
     ...
 
 # Defined in torch/csrc/jit/ir/ir.h
@@ -548,6 +551,7 @@ class Node:
     def getModuleHierarchy(self) -> str: ...
     def prev(self) -> Node: ...
     def destroy(self) -> None: ...
+    def attributeNames(self) -> List[str]: ...
 
     # Accessors for attributes as types.
     def f(self, name: str) -> _float: ...
@@ -869,8 +873,9 @@ def autocast_increment_nesting() -> _int: ...
 def autocast_decrement_nesting() -> _int: ...
 def is_autocast_cache_enabled() -> _bool: ...
 def set_autocast_cache_enabled(enabled: _bool) -> None: ...
-def set_anomaly_enabled(enabled: _bool) -> None: ...
+def set_anomaly_enabled(enabled: _bool, check_nan: _bool = True) -> None: ...
 def is_anomaly_enabled() -> _bool: ...
+def is_anomaly_check_nan_enabled() -> _bool: ...
 def _enter_dual_level() -> _int: ...
 def _exit_dual_level(level: _int) -> None: ...
 def _make_dual(tensor: Tensor, tangent: Tensor, level: _int) -> Tensor: ...
@@ -889,6 +894,12 @@ def _get_torch_dispatch_mode() -> Any: ...
 class _InferenceMode(object):
     def __init__(self, mode: _bool) -> None: ...
 
+class _DisableFuncTorch:
+    def __init__(self) -> None: ...
+
+class _EnableTorchFunction:
+    def __init__(self) -> None: ...
+
 # Defined in torch/csrc/jit/python/script_init.cpp
 class LoggerBase(object):
     ...
@@ -957,6 +968,7 @@ class Generator(object):
 # Defined in torch/csrc/utils/python_dispatch.cpp
 def _dispatch_library(kind: str, name: str, dispatch: str, file: str = "", linenum: Any = 0) -> Any: ...
 def _dispatch_has_kernel_for_dispatch_key(name: str, dispatch: str) -> _bool: ...
+def _dispatch_has_computed_kernel_for_dispatch_key(name: str, dispatch: str) -> _bool: ...
 def _dispatch_has_kernel(name: str) -> _bool: ...
 def _dispatch_tls_is_dispatch_key_excluded(dispatch: str) -> _bool: ...
 def _dispatch_tls_set_dispatch_key_excluded(dispatch: str, val: _bool) -> None: ...
@@ -1054,6 +1066,7 @@ def _cuda_memoryStats(device: _int) -> Dict[str, Any]: ...
 def _cuda_resetAccumulatedMemoryStats(device: _int) -> None: ...
 def _cuda_resetPeakMemoryStats(device: _int) -> None: ...
 def _cuda_memorySnapshot() -> List[Dict[str, Any]]: ...
+def _cuda_recordMemoryHistory(enabled: _bool) -> None: ...
 def _cuda_lock_mutex() -> None: ...
 def _cuda_unlock_mutex() -> None: ...
 def _cuda_canDeviceAccessPeer(device: _int, peer_device: _int) -> _bool: ...
@@ -1312,8 +1325,10 @@ class TensorType(JitType):
     def getInferred(cls) -> TensorType: ...
     def with_sizes(self, other: Optional[List[Optional[_int]]]) -> TensorType: ...
     def sizes(self) -> Optional[List[_int]]: ...
+    def varyingSizes(self) -> Optional[List[Optional[_int]]]: ...
     def strides(self) -> Optional[List[_int]]: ...
     def device(self) -> Optional[_device]: ...
+    def dim(self) -> _int: ...
     def dtype(self) -> Optional[_dtype]: ...
     @staticmethod
     def create_from_tensor(t: Tensor) -> TensorType: ...
@@ -1354,3 +1369,7 @@ def _enable_minidumps(directory: str) -> None: ...
 def _disable_minidumps() -> None: ...
 def _enable_minidumps_on_exceptions() -> None: ...
 def _register_py_class_for_device(device: str, cls: Any) -> None: ...
+def _activate_cuda_trace() -> None: ...
+
+class _OutOfMemoryError:
+    pass
diff --git a/torch/_C/_autograd.pyi b/torch/_C/_autograd.pyi
index 34f7766b42b4c..20f5819d579d6 100644
--- a/torch/_C/_autograd.pyi
+++ b/torch/_C/_autograd.pyi
@@ -2,22 +2,10 @@ from typing import List, Set, Callable, Any, Union
 from enum import Enum
 
 import torch
+from ._profiler import _ProfilerEvent, ActiveProfilerType, ProfilerActivity, ProfilerConfig
 
 # Defined in tools/autograd/init.cpp
 
-class ProfilerState(Enum):
-    Disable = ...
-    CPU = ...
-    CUDA = ...
-    NVTX = ...
-    ITT = ...
-    KINETO = ...
-    KINETO_GPU_FALLBACK = ...
-
-class ProfilerActivity(Enum):
-    CPU = ...
-    CUDA = ...
-
 class DeviceType(Enum):
     CPU = ...
     CUDA = ...
@@ -36,27 +24,6 @@ class DeviceType(Enum):
     Metal = ...
     ...
 
-class _ExperimentalConfig:
-    def __init__(
-        self,
-        profiler_metrics: List[str] = ...,
-        profiler_measure_per_kernel: bool = ...,
-    ) -> None: ...
-    ...
-
-class ProfilerConfig:
-    def __init__(
-        self,
-        state: ProfilerState,
-        report_input_shapes: bool,
-        profile_memory: bool,
-        with_stack: bool,
-        with_flops: bool,
-        with_modules: bool,
-        experimental_config: _ExperimentalConfig,
-    ) -> None: ...
-    ...
-
 class ProfilerEvent:
     def cpu_elapsed_us(self, other: ProfilerEvent) -> float: ...
     def cpu_memory_usage(self) -> int: ...
@@ -85,59 +52,6 @@ class _KinetoEvent:
     def linked_correlation_id(self) -> int: ...
     ...
 
-class _ProfilerEvent:
-    tag: _EventType
-    id: int
-    correlation_id: int
-    start_tid: int
-    start_time_ns: int
-    end_time_ns: int
-    duration_time_ns: int
-    parent: _ProfilerEvent
-    children: List[_ProfilerEvent]
-    extra_fields: Union[_ExtraFields_Allocation, _ExtraFields_Backend,
-                        _ExtraFields_PyCall, _ExtraFields_PyCCall,
-                        _ExtraFields_TorchOp]
-    def name(self) -> str: ...
-    ...
-
-class _PyFrameState:
-    line_number: int
-    function_name: str
-    file_name: str
-    ...
-
-class _EventType(Enum):
-    Allocation = ...
-    Backend = ...
-    PyCall = ...
-    PyCCall = ...
-    TorchOp = ...
-    Kineto = ...
-
-class _Inputs:
-    shapes: List[List[int]]
-    dtypes: List[str]
-
-class _ExtraFields_TorchOp:
-    allow_tf32_cublas: bool
-    inputs: _Inputs
-    ...
-
-class _ExtraFields_Backend:
-    ...
-
-class _ExtraFields_Allocation:
-    ...
-
-class _ExtraFields_PyCCall:
-    caller: _PyFrameState
-    ...
-
-class _ExtraFields_PyCall:
-    caller: _PyFrameState
-    ...
-
 class _ProfilerResult:
     def events(self) -> List[_KinetoEvent]: ...
     def legacy_events(self) -> List[List[ProfilerEvent]]: ...
@@ -147,9 +61,6 @@ class _ProfilerResult:
 class SavedTensor:
     ...
 
-class ActiveProfilerType:
-    ...
-
 def _enable_profiler(config: ProfilerConfig, activities: Set[ProfilerActivity]) -> None: ...
 def _prepare_profiler(config: ProfilerConfig, activities: Set[ProfilerActivity]) -> None: ...
 def _disable_profiler() -> _ProfilerResult: ...
diff --git a/torch/_C/_distributed_rpc.pyi b/torch/_C/_distributed_rpc.pyi
index 06d7a6fcba3fe..72448910e4bca 100644
--- a/torch/_C/_distributed_rpc.pyi
+++ b/torch/_C/_distributed_rpc.pyi
@@ -4,8 +4,9 @@ import enum
 import torch
 from torch.types import Device
 from . import Future
-from ._autograd import ProfilerConfig, ProfilerState, ProfilerEvent
+from ._autograd import ProfilerEvent
 from ._distributed_c10d import ProcessGroup, Store
+from ._profiler import ActiveProfilerType, ProfilerConfig, ProfilerState
 
 # This module is defined in torch/csrc/distributed/rpc/init.cpp
 
@@ -77,7 +78,7 @@ class _TensorPipeRpcBackendOptionsBase(RpcBackendOptions):
         _channels: Optional[List],
         rpc_timeout: float = _DEFAULT_RPC_TIMEOUT_SEC,
         init_method: str = _DEFAULT_INIT_METHOD,
-        device_maps: Dict[str, Dict[torch.device, torch.device]] = dict(),
+        device_maps: Dict[str, Dict[torch.device, torch.device]] = {},
         devices: List[torch.device] = list()): ...
     def _set_device_map(self, to: str, device_map: Dict[torch.device, torch.device]): ...
 
diff --git a/torch/_C/_profiler.pyi b/torch/_C/_profiler.pyi
new file mode 100644
index 0000000000000..06676df7ce422
--- /dev/null
+++ b/torch/_C/_profiler.pyi
@@ -0,0 +1,111 @@
+from enum import Enum
+from typing import List, Union
+
+# defined in torch/csrc/profiler/python/init.cpp
+
+class RecordScope(Enum):
+    FUNCTION = ...
+    BACKWARD_FUNCTION = ...
+    TORCHSCRIPT_FUNCTION = ...
+    KERNEL_FUNCTION_DTYPE = ...
+    CUSTOM_CLASS = ...
+    BUILD_FEATURE = ...
+    LITE_INTERPRETER = ...
+    USER_SCOPE = ...
+    STATIC_RUNTIME_OP = ...
+    STATIC_RUNTIME_MODEL = ...
+
+class ProfilerState(Enum):
+    Disable = ...
+    CPU = ...
+    CUDA = ...
+    NVTX = ...
+    ITT = ...
+    KINETO = ...
+    KINETO_GPU_FALLBACK = ...
+
+class ActiveProfilerType: ...
+
+class ProfilerActivity(Enum):
+    CPU = ...
+    CUDA = ...
+
+class _ExperimentalConfig:
+    def __init__(
+        self,
+        profiler_metrics: List[str] = ...,
+        profiler_measure_per_kernel: bool = ...,
+    ) -> None: ...
+    ...
+
+class ProfilerConfig:
+    def __init__(
+        self,
+        state: ProfilerState,
+        report_input_shapes: bool,
+        profile_memory: bool,
+        with_stack: bool,
+        with_flops: bool,
+        with_modules: bool,
+        experimental_config: _ExperimentalConfig,
+    ) -> None: ...
+    ...
+
+class _ProfilerEvent:
+    tag: _EventType
+    id: int
+    correlation_id: int
+    start_tid: int
+    start_time_ns: int
+    end_time_ns: int
+    duration_time_ns: int
+    parent: _ProfilerEvent
+    children: List[_ProfilerEvent]
+    extra_fields: Union[
+        _ExtraFields_TorchOp,
+        _ExtraFields_Backend,
+        _ExtraFields_Allocation,
+        _ExtraFields_OutOfMemory,
+        _ExtraFields_PyCall,
+        _ExtraFields_PyCCall,
+        _ExtraFields_Kineto,
+    ]
+    def name(self) -> str: ...
+    ...
+
+class _PyFrameState:
+    line_number: int
+    function_name: str
+    file_name: str
+    ...
+
+class _EventType(Enum):
+    Allocation = ...
+    Backend = ...
+    PyCall = ...
+    PyCCall = ...
+    TorchOp = ...
+    Kineto = ...
+
+class _Inputs:
+    shapes: List[List[int]]
+    dtypes: List[str]
+
+class _ExtraFields_TorchOp:
+    allow_tf32_cublas: bool
+    inputs: _Inputs
+    ...
+
+class _ExtraFields_Backend: ...
+class _ExtraFields_Allocation: ...
+class _ExtraFields_OutOfMemory: ...
+
+class _ExtraFields_PyCCall:
+    caller: _PyFrameState
+    ...
+
+class _ExtraFields_PyCall:
+    caller: _PyFrameState
+    ...
+
+class _ExtraFields_Kineto: ...
diff --git a/torch/__init__.py b/torch/__init__.py
index 80853c97f562a..cd0f6e2e47bf7 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -14,7 +14,6 @@
 import platform
 import textwrap
 import ctypes
-import warnings
 import inspect
 if sys.version_info < (3,):
     raise Exception("Python 2 has reached end-of-life and is no longer supported by PyTorch.")
@@ -319,6 +318,7 @@ def set_default_tensor_type(t):
 
     Example::
 
+        >>> # xdoctest: +SKIP("Other tests may have changed the default type. Can we reset it?")
         >>> torch.tensor([1.2, 3]).dtype    # initial default for floating point is torch.float32
         torch.float32
         >>> torch.set_default_tensor_type(torch.DoubleTensor)
@@ -355,6 +355,7 @@ def set_default_dtype(d):
                                   Either torch.float32 or torch.float64.
 
     Example:
+        >>> # xdoctest: +SKIP("Other tests may have changed the default type. Can we reset it?")
         >>> # initial default for floating point is torch.float32
         >>> # Python floats are interpreted as float32
         >>> torch.tensor([1.2, 3]).dtype
@@ -455,6 +456,7 @@ def use_deterministic_algorithms(mode, *, warn_only=False):
         * :func:`torch.kthvalue` with called on a CUDA tensor
         * :func:`torch.median` with indices output when called on a CUDA tensor
         * :func:`torch.nn.functional.grid_sample` when attempting to differentiate a CUDA tensor
+        * :func:`torch.cumsum` when called on a CUDA tensor when dtype is floating point or complex
 
     A handful of CUDA operations are nondeterministic if the CUDA version is
     10.2 or greater, unless the environment variable ``CUBLAS_WORKSPACE_CONFIG=:4096:8``
@@ -493,6 +495,7 @@ def use_deterministic_algorithms(mode, *, warn_only=False):
         >>> torch.use_deterministic_algorithms(True)
 
         # Forward mode nondeterministic error
+        >>> # xdoctest: +SKIP
         >>> torch.randn(10, device='cuda').kthvalue(0)
         ...
         RuntimeError: kthvalue CUDA does not have a deterministic implementation...
diff --git a/torch/_decomp/__init__.py b/torch/_decomp/__init__.py
index 9f9c4f1404f3e..7f3d29290ef4b 100644
--- a/torch/_decomp/__init__.py
+++ b/torch/_decomp/__init__.py
@@ -110,11 +110,15 @@ def add_op_to_table(aten_op):
                     # which don't have corresponding dispatcher entries, we need
                     # to filter those out
                     and torch._C._dispatch_has_kernel(name)
-                    # Don't register a meta kernel to any operator that has
-                    # a CompositeImplicitAutograd kernel in core.
-                    # Otherwise we won't be able to run autograd for that operator with the meta backend.
-                    and "CompositeImplicitAutograd" not in torch._C._dispatch_dump(name)
-                    and not torch._C._dispatch_has_kernel_for_dispatch_key(name, "Meta")
+                    # Don't register a python meta kernel to any operator that has
+                    # should already work with meta tensors today.
+                    # We can check that by seeing if the "computed table" for the operator
+                    # has a registration to Meta;
+                    # either through a direct registration, or an indirect one through
+                    # an alias dispatch key (e.g. CompositeImplicitAutograd)
+                    and not torch._C._dispatch_has_computed_kernel_for_dispatch_key(
+                        name, "Meta"
+                    )
                 ):
                     if any(
                         a.alias_info is not None and not a.alias_info.is_write
diff --git a/torch/_decomp/decompositions.py b/torch/_decomp/decompositions.py
index 4b177d99ce605..eb0426fb90aac 100644
--- a/torch/_decomp/decompositions.py
+++ b/torch/_decomp/decompositions.py
@@ -143,7 +143,7 @@ def hardsigmoid_backward(grad_output: Tensor, self: Tensor):
     return torch.where(
         (self > -3.0) & (self < 3.0),
         grad_output * (1.0 / 6.0),
-        grad_output.new_zeros(()),
+        0.0,
     )
 
 
@@ -152,17 +152,13 @@ def hardsigmoid_backward(grad_output: Tensor, self: Tensor):
 def hardtanh_backward(
     grad_output: Tensor, self: Tensor, min_val: float, max_val: float
 ):
-    return torch.where(
-        (self <= min_val) | (self >= max_val), grad_output.new_zeros(()), grad_output
-    )
+    return torch.where((self <= min_val) | (self >= max_val), 0.0, grad_output)
 
 
 @register_decomposition(aten.hardshrink_backward)
 @pw_cast_for_opmath
 def hardshrink_backward(grad_out: Tensor, self: Tensor, lambd: float):
-    return torch.where(
-        (self >= -lambd) & (self <= lambd), grad_out.new_zeros(()), grad_out
-    )
+    return torch.where((self >= -lambd) & (self <= lambd), 0.0, grad_out)
 
 
 @register_decomposition(aten.hardswish)
@@ -176,7 +172,7 @@ def hardswish(self: Tensor) -> Tensor:
 def hardswish_backward(grad_output: Tensor, self: Tensor) -> Tensor:
     return torch.where(
         self < -3,
-        grad_output.new_zeros(()),
+        0.0,
         torch.where(self <= 3, grad_output * ((self / 3) + 0.5), grad_output),
     )
 
@@ -184,7 +180,7 @@ def hardswish_backward(grad_output: Tensor, self: Tensor) -> Tensor:
 @register_decomposition(aten.threshold_backward)
 @pw_cast_for_opmath
 def threshold_backward(grad_output: Tensor, self: Tensor, threshold: float):
-    return torch.where(self <= threshold, grad_output.new_zeros(()), grad_output)
+    return torch.where(self <= threshold, 0.0, grad_output)
 
 
 @register_decomposition(aten.leaky_relu_backward)
@@ -251,9 +247,7 @@ def silu_backward(grad_output: Tensor, self: Tensor) -> Tensor:
 
 @register_decomposition(aten.softshrink_backward)
 def softshrink_backward(grad_output: Tensor, self: Tensor, lambd: float) -> Tensor:
-    return torch.where(
-        (self >= -lambd) & (self <= lambd), grad_output.new_zeros(()), grad_output
-    )
+    return torch.where((self >= -lambd) & (self <= lambd), 0.0, grad_output)
 
 
 @register_decomposition(aten.prelu_backward)
@@ -269,9 +263,7 @@ def prelu_backward(
     for _ in range(2, grad_output.dim()):
         cur_weight = cur_weight.unsqueeze(-1)
     input_grad = torch.where(self > 0, grad_output, cur_weight * grad_output)
-    weight_grad_collector = torch.where(
-        self > 0, grad_output.new_zeros(()), self * grad_output
-    )
+    weight_grad_collector = torch.where(self > 0, 0.0, self * grad_output)
     out = weight_grad_collector.sum_to_size(cur_weight.shape)
     while out.dim() > weight.dim():
         out = out.squeeze(-1)
@@ -408,8 +400,7 @@ def _nll_loss_backward(
 
     has_ignore_index = ignore_index >= 0
     if has_ignore_index:
-        ignore_index_mask = target != ignore_index
-        grad_output = grad_output * ignore_index_mask
+        grad_output = torch.where(target != ignore_index, grad_output, 0)
 
     return grad_input * grad_output
 
@@ -621,7 +612,7 @@ def im2col_backward(
     padding: List[int],
     stride: List[int],
 ) -> Tensor:
-    return F.fold(grad_output, input_size, kernel_size, dilation, padding, stride)  # type: ignore[arg-type]
+    return aten.col2im(grad_output, input_size, kernel_size, dilation, padding, stride)
 
 
 @register_decomposition(aten.col2im_backward)
@@ -632,17 +623,7 @@ def col2im_backward(
     padding: List[int],
     stride: List[int],
 ) -> Tensor:
-    return F.unfold(grad_output, kernel_size, dilation, padding, stride)  # type: ignore[arg-type]
-
-
-@register_decomposition(aten.masked_fill.Scalar)
-def masked_fill_Scalar(self: Tensor, mask: Tensor, value: float) -> Tensor:
-    return torch.where(mask, utils.dtype_to_type(self.dtype)(value), self)
-
-
-@register_decomposition(aten.masked_fill.Tensor)
-def masked_fill_Tensor(self: Tensor, mask: Tensor, value: Tensor) -> Tensor:
-    return torch.where(mask, value, self)
+    return aten.im2col(grad_output, kernel_size, dilation, padding, stride)
 
 
 @register_decomposition(aten.native_dropout_backward)
@@ -662,7 +643,7 @@ def logit_backward(
         return torch.where(
             torch.logical_and(self >= lo, self <= hi),
             grad_output / (self * (1.0 - self)),
-            self.new_zeros(()),
+            0.0,
         )
     else:
         return torch.where(
@@ -701,12 +682,6 @@ def _log_softmax(x: Tensor, dim: int, half_to_float: bool):
     return shifted - shifted_logsumexp
 
 
-@register_decomposition(aten.addcdiv)
-@pw_cast_for_opmath
-def addcdiv(self: Tensor, tensor1: Tensor, tensor2: Tensor, value: float = 1):
-    return self + value * (tensor1 / tensor2)
-
-
 # Remove special case when https://github.com/pytorch/pytorch/pull/72949 is landed.
 @register_decomposition(aten.addcmul)
 @pw_cast_for_opmath
@@ -872,6 +847,94 @@ def native_group_norm(
     return (out, mean, rstd)
 
 
+@register_decomposition(aten.native_group_norm_backward)
+@pw_cast_for_opmath
+def native_group_norm_backward(
+    grad_output: Tensor,
+    input: Tensor,
+    mean: Tensor,
+    rstd: Tensor,
+    gamma: Optional[Tensor],
+    N: int,
+    C: int,
+    HxW: int,
+    group: int,
+    output_mask: List[bool],
+) -> Tuple[Optional[Tensor], Optional[Tensor], Optional[Tensor]]:
+    utils.check_same_device(
+        grad_output, input, mean, rstd, allow_cpu_scalar_tensors=False
+    )
+    utils.check_same_shape(input, grad_output, allow_cpu_scalar_tensors=False)
+    utils.check_same_shape(mean, rstd, allow_cpu_scalar_tensors=False)
+    utils.check(
+        input.numel() == N * C * HxW,
+        lambda: f"Expect input to have { N * C * HxW} elements",
+    )
+    utils.check(
+        mean.shape == (N, group),
+        lambda: f"Expect mean to have shape ({N}, {group}, but got {mean.shape}",
+    )
+    utils.check(
+        gamma is None or gamma.numel() == C,
+        lambda: f"Expect gamma to have {C} elements but got {gamma.numel() if gamma is not None else -1}",
+    )
+
+    cpg, _rem = divmod(C, group)
+    utils.check(
+        _rem == 0,
+        lambda: f"Expect number of channels {C} to be evenly-divisible by number of groups {group}",
+    )
+
+    # Compute Internal gradients
+    ds = torch.mul(grad_output, input).view(N, C, HxW).sum(dim=[2])
+    db = grad_output.view(N, C, HxW).sum(dim=[2])
+
+    d_input: Optional[Tensor] = None
+    d_gamma: Optional[Tensor] = None
+    d_bias: Optional[Tensor] = None
+    if output_mask[0]:
+        s = 1.0 / (HxW * cpg)
+        if gamma is not None:
+            ds_val = torch.mul(ds, gamma.unsqueeze(0)).reshape(N, group, cpg).sum(2)
+            db_val = torch.mul(db, gamma.unsqueeze(0)).reshape(N, group, cpg).sum(2)
+            c1 = torch.mul(
+                rstd.unsqueeze(-1),
+                gamma.reshape(1, group, cpg),
+            )
+        else:
+            ds_val = ds.reshape(N, group, cpg).sum(2)
+            db_val = db.reshape(N, group, cpg).sum(2)
+            c1 = torch.mul(
+                rstd.unsqueeze(-1),
+                torch.ones((1, group, cpg), device=rstd.device),
+            )
+        c2 = (db_val * mean - ds_val) * rstd * rstd * rstd * s
+        c3 = -c2 * mean - db_val * rstd * s
+
+        c1 = c1.unsqueeze(-1)
+        c2 = _unsqueeze_to_dim(c2, 4)
+        c3 = _unsqueeze_to_dim(c3, 4)
+        d_input = (
+            torch.mul(grad_output.reshape(N, group, cpg, HxW), c1)
+            + torch.mul(input.reshape(N, group, cpg, HxW), c2)
+            + c3
+        )
+        d_input = d_input.reshape(input.shape).to(input.dtype)
+    if output_mask[1]:
+        d_gamma = (
+            (
+                (ds.view(N, group, cpg) - db.view(N, group, cpg) * mean.unsqueeze(-1))
+                * rstd.unsqueeze(-1)
+            )
+            .sum(dim=[0])
+            .reshape(C)
+        )
+    if output_mask[2]:
+        d_bias = db.sum(dim=[0])
+
+    return (d_input, d_gamma, d_bias)
+
+
 def _maybe_cast(x: Optional[Tensor], dtype) -> Optional[Tensor]:
     if x is not None:
         return x.to(dtype)
@@ -1025,6 +1088,36 @@ def _fused_dropout_decomposition(input, p, generator=None):
     return (res, mask)
 
 
+@register_decomposition(aten._to_copy)
+def _to_copy(
+    x: Tensor,
+    *,
+    dtype: Optional[torch.dtype] = None,
+    layout=None,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+    non_blocking: bool = False,
+    memory_format: Optional[torch.memory_format] = None,
+):
+    assert not layout or layout == torch.strided, "TODO"
+    assert not pin_memory, "TODO"
+    assert device is not None or dtype is not None or memory_format is not None
+    dtype_converted = False
+    if device is not None and device != x.get_device():
+        # avoid conversions on cpu
+        if dtype is not None and device.type == "cpu":
+            x = torch._prims.convert_element_type(x, dtype)
+            dtype_converted = True
+        x = torch._prims.device_put(x, device)
+    if dtype is not None and not dtype_converted:
+        x = torch._prims.convert_element_type(x, dtype)
+    if memory_format is not None:  # no ref/prim for memory format
+        out = torch.empty_like(x, memory_format=memory_format)
+        out.copy_(x)
+        return out  # type: ignore[call-overload]
+    return x
+
+
 @register_decomposition(aten.xlogy.Tensor)
 @pw_cast_for_int_to_real
 def xlogy(self: Tensor, other: Tensor) -> Tensor:
@@ -1124,8 +1217,8 @@ def cudnn_batch_norm(
         return (a, b, c, input.new_zeros((0,), dtype=torch.uint8))
     return (
         a,
-        input.new_zeros((0,)),
-        input.new_zeros((0,)),
+        weight.new_zeros((0,)),
+        weight.new_zeros((0,)),
         input.new_zeros((0,), dtype=torch.uint8),
     )
 
@@ -1250,20 +1343,6 @@ def cudnn_batch_norm_backward(
     )
 
 
-@register_decomposition(aten.transpose.int, disable_meta=True)
-def transpose_int(self: Tensor, dim0: int, dim1: int) -> Tensor:
-    dim0, dim1 = utils.canonicalize_dims(self.dim(), (dim0, dim1))  # type: ignore[misc]
-
-    if self.dim() <= 1:
-        return self
-
-    if dim0 == dim1:
-        return self
-    perm = list(range(self.dim()))
-    perm[dim0], perm[dim1] = perm[dim1], perm[dim0]
-    return torch.permute(self, perm)
-
-
 def _squeeze_multiple(self: Tensor, dims: List[int]) -> Tensor:
     ndim = self.dim()
     wrapped_dims = utils.canonicalize_dims(ndim, dims)
@@ -1386,3 +1465,84 @@ def upsample_bilinear2d_vec(
     q2 = torch.mul(v3, xscale1) + torch.mul(v4, xscale2)
     result = torch.mul(q1, yscale1) + torch.mul(q2, yscale2)
     return result
+
+
+# We should be applying decompositions after all transformations
+@register_decomposition(aten.is_same_size.default)
+def is_same_size(a: Tensor, b: Tensor) -> bool:
+    return a.shape == b.shape
+
+
+@register_decomposition(aten._reshape_alias)
+def _reshape_alias(x, shape, strides):
+    return aten.view(x, shape)
+
+
+@register_decomposition(aten.nll_loss_forward)
+def nll_loss_forward(
+    self: Tensor,
+    target: Tensor,
+    weight: Optional[Tensor],
+    reduction: int,
+    ignore_index: int,
+) -> Tuple[Tensor, Tensor]:
+    assert self.dim() > 0 and self.dim() <= 2, "input tensor should be 1D or 2D"
+    assert (
+        target.dim() <= 1
+    ), "0D or 1D target tensor expected, multi-target not supported"
+
+    no_batch_dim = self.dim() == 1 and target.dim() == 0
+    assert no_batch_dim or (
+        self.shape[0] == target.shape[0]
+    ), f"size mismatch (got input: {self.shape}, target: {target.shape})"
+
+    n_classes = self.shape[-1]
+
+    assert weight is None or (
+        weight.dim() == 1 and weight.numel() == n_classes
+    ), f"weight tensor should be defined either for all {n_classes} classes or no classes but got weight tensor of shape: {weight.shape}"  # noqa: B950
+
+    # self can be [N, C] or [C]
+    # target can be [N] or []
+
+    n_dims = self.dim()
+    channel_dim = 1
+    if n_dims < 2:
+        channel_dim = 0
+
+    if weight is not None:
+        w = weight.unsqueeze(0) if n_dims > 1 else weight
+        self = self * w
+
+    target_ = target.unsqueeze(channel_dim)
+    # target can be [N, 1] or [1]
+
+    result = -torch.gather(self, channel_dim, target_).squeeze(channel_dim)
+
+    if ignore_index >= 0:
+        result = torch.where(target != ignore_index, result, 0)
+
+    if reduction == Reduction.NONE.value and n_dims > 1:
+        total_weight = self.new_full((), 0.0)
+        return result, total_weight
+
+    if weight is not None:
+        w = weight.unsqueeze(0).expand(self.shape) if n_dims > 1 else weight
+        wsum = torch.gather(w, channel_dim, target_).squeeze(channel_dim)
+        if ignore_index >= 0:
+            wsum = torch.where(target != ignore_index, wsum, 0)
+        total_weight = wsum.sum()
+    elif ignore_index >= 0:
+        total_weight = (target != ignore_index).sum().to(self)
+    else:
+        total_weight = self.new_full((), 1.0 * result.numel())
+
+    if reduction == Reduction.SUM.value:
+        result = result.sum()
+    elif reduction == Reduction.MEAN.value:
+        if weight is None:
+            result = result.sum() / total_weight if ignore_index >= 0 else result.mean()
+        else:
+            result = result.sum() / total_weight
+
+    return result, total_weight
diff --git a/test/quantization/dbr/__init__.py b/torch/_dispatch/__init__.py
similarity index 100%
rename from test/quantization/dbr/__init__.py
rename to torch/_dispatch/__init__.py
diff --git a/torch/_dispatch/_dispatcher.py b/torch/_dispatch/_dispatcher.py
new file mode 100644
index 0000000000000..1f3a37ccbfedf
--- /dev/null
+++ b/torch/_dispatch/_dispatcher.py
@@ -0,0 +1,50 @@
+import torch
+import torch._C as _C
+from torch.utils._pytree import tree_flatten
+
+"""
+This is a dispatcher (in Python)
+- You can define new operations (in Python) without schemas
+- It interfaces with the PyTorch dispatcher
+"""
+
+class PyDispatcher:
+    # operator is a PyOperator
+    @staticmethod
+    def call(operator, *args, **kwargs):
+        dispatch_key_set = compute_keyset(args, kwargs)
+        kernel = operator.lookup(dispatch_key_set)
+        return kernel(*args, **kwargs)
+
+    # operator is a PyOperator
+    @staticmethod
+    def redispatch(operator, dispatch_key_set, *args, **kwargs):
+        kernel = operator.lookup(dispatch_key_set)
+        return kernel(*args, **kwargs)
+
+
+def compute_keyset(args, kwargs):
+    tensors = get_tensors(args, kwargs)
+    return key_extractor(tensors)
+
+
+# Note - this should maintain identical impl to the C++ dispatcher key extraction logic
+# at ATen/core/dispatch/DispatchKeyExtractor.h
+def key_extractor(tensors):
+    key_set = _C._dispatch_tls_local_include_set()  # type: ignore[attr-defined]
+    for tensor in tensors:
+        key_set = key_set | _C._dispatch_keys(tensor)  # type: ignore[attr-defined]
+    key_set = key_set - _C._dispatch_tls_local_exclude_set()  # type: ignore[attr-defined]
+    return key_set
+
+
+def to_flat_tuple(args, kwargs):
+    flat_args, _ = tree_flatten(args)
+    flat_kwargs, _ = tree_flatten(kwargs)
+    flat_all = flat_args + flat_kwargs
+    return flat_all
+
+def get_tensors(args, kwargs):
+    flat_all = to_flat_tuple(args, kwargs)
+    tensor_args = [t for t in flat_all if isinstance(t, torch.Tensor)]
+    return tuple(tensor_args)
diff --git a/torch/_lazy/__init__.py b/torch/_lazy/__init__.py
index 953f6a83ffacd..9ccffe950db02 100644
--- a/torch/_lazy/__init__.py
+++ b/torch/_lazy/__init__.py
@@ -1,4 +1,5 @@
 import torch._C._lazy
+from torch.utils._pytree import tree_flatten, tree_unflatten
 
 
 def mark_step(device: str = "", wait=False):
@@ -34,3 +35,15 @@ def sync_multi(tensors, devices):
 def get_tensor_id(tensor):
     """Return a unique id of the lazy tensor maintained by LTC"""
     return torch._C._lazy._get_tensor_id(tensor)
+
+
+def to_cpu(tensors, devices=None):
+    devices = devices or ["lazy"]
+
+    flattened, spec = tree_flatten(tensors)
+    sync_multi(flattened, devices)
+    return tree_unflatten([t.to("cpu") for t in flattened], spec)
+
+
+def save(tensors, *args, **kwargs):
+    torch.save(to_cpu(tensors), *args, **kwargs)
diff --git a/torch/_lazy/extract_compiled_graph.py b/torch/_lazy/extract_compiled_graph.py
index 440539f52453b..033d000c69d85 100644
--- a/torch/_lazy/extract_compiled_graph.py
+++ b/torch/_lazy/extract_compiled_graph.py
@@ -72,7 +72,7 @@ def __init__(self, lazy_out_list):
         self.index: List[List[int]] = []
         self.total_count = len(lazy_out_list)
 
-        tensor_id_to_idx: Dict[int, int] = dict()
+        tensor_id_to_idx: Dict[int, int] = {}
         for dup_idx, lazy_tensor in enumerate(lazy_out_list):
             uniq_idx = tensor_id_to_idx.get(id(lazy_tensor), None)
             if uniq_idx is not None:
diff --git a/torch/_masked/__init__.py b/torch/_masked/__init__.py
index ac49fc95baf7e..10c586c4756a7 100644
--- a/torch/_masked/__init__.py
+++ b/torch/_masked/__init__.py
@@ -278,7 +278,7 @@ def _generate_docstring(func):
         cumprod="cumulative_prod",
     )
 
-    operation_names = dict()
+    operation_names = {}
     operation_names.update(reduction_names)
     operation_names.update(normalization_names)
 
diff --git a/torch/_meta_registrations.py b/torch/_meta_registrations.py
index 9f5574575fd7a..269c0a1559aca 100644
--- a/torch/_meta_registrations.py
+++ b/torch/_meta_registrations.py
@@ -70,6 +70,12 @@ def meta_fft_r2c(self, dim, normalization, onesided):
     )
 
 
+@register_meta(aten.randperm.generator_out)
+def meta_randperm(n, *, generator=None, out):
+    assert out.ndim == 1 and out.size(0) == n
+    return out
+
+
 @register_meta([aten._fft_c2r.default, aten._fft_c2r.out])
 @out_wrapper()
 def meta_fft_c2r(self, dim, normalization, lastdim):
@@ -189,22 +195,29 @@ def _compute_reduction_shape(self, dims, keepdim):
     return utils.compute_reduction_output_shape(self.shape, dims)
 
 
-@register_meta(aten.inverse.default)
-def meta_inverse(self):
-    # Bug: https://github.com/pytorch/pytorch/issues/77498
-    if self.numel() == 0:
-        return torch.empty_like(self)
-    r = self.new_empty(self.shape)
-    r.transpose_(-2, -1)
-    return r
-
-
 @torch.library.impl(meta_lib, "bernoulli.out")
 def meta_bernoulli(self, *, generator=None, out):
     torch._resize_output_(out, self.size(), self.device)
     return out
 
 
+@register_meta(aten._to_copy.default, False)
+def _to_copy(
+    self: torch.Tensor,
+    dtype=None,
+    layout=None,
+    device=None,
+    pin_memory=None,
+    memory_format=None,
+):
+    dtype = self.dtype if dtype is None else dtype
+    device = self.device if device is None else device
+    layout = self.layout if layout is None else layout
+    assert pin_memory is None
+    assert memory_format is None
+    return self.new_empty(self.shape, dtype=dtype, device=device, pin_memory=pin_memory)
+
+
 @register_meta(aten.convolution.default)
 def meta_conv(
     input_tensor: torch.Tensor,
diff --git a/torch/_namedtensor_internals.py b/torch/_namedtensor_internals.py
index 0c422295d897c..38e1131e01fa2 100644
--- a/torch/_namedtensor_internals.py
+++ b/torch/_namedtensor_internals.py
@@ -128,6 +128,7 @@ def update_names(tensor, names, rename_map, inplace):
 
     >>> x.rename('batch', '...', 'width').names
     ('batch', 'C', 'H', 'width')
+
     ```
 
     tensor.rename(**rename_map) returns a view on tensor that has rename dims
@@ -138,6 +139,7 @@ def update_names(tensor, names, rename_map, inplace):
     >>> x = torch.empty(2, 3, 5, 7, names=('N', 'C', 'H', 'W'))
     >>> x.rename(W='width', H='height').names
     ('N', 'C', 'height', 'width')
+
     ```
 
     Finally, tensor.rename has an in-place version called tensor.rename_.
diff --git a/torch/_ops.py b/torch/_ops.py
index b1a43734b440f..21af70bbd2eec 100644
--- a/torch/_ops.py
+++ b/torch/_ops.py
@@ -29,14 +29,18 @@ def dl_open_guard():
 # Each OpOverload object contains pointer to a a specific operator overload, a pointer to the parent `OpOverloadPacket` object.
 # You can obtain an OpOverload object through attribute query on OpOverloadPacket.
 class OpOverload:
-    def __init__(self, overloadpacket, op, schema, tags):
+    def __init__(self, overloadpacket, op, op_dk, schema, tags):
         self._op = op
+        self._op_dk = op_dk
         self._schema = schema
         self._overloadpacket = overloadpacket
         self._tags = tags
         self._overloadname = (
             "default" if schema.overload_name == "" else schema.overload_name
         )
+        self.name = self._schema.name
+        if schema.overload_name:
+            self.name += "." + schema.overload_name
         self.__name__ = "{}.{}".format(
             self._schema.name.split("::")[1], self._overloadname
         )
@@ -65,6 +69,13 @@ def __hash__(self):
     def __str__(self):
         return "{}.{}.{}".format(*self._schema.name.split("::"), self._overloadname)
 
+    def decompose(self, *args, **kwargs):
+        dk = "CompositeImplicitAutograd"
+        if torch._C._dispatch_has_kernel_for_dispatch_key(self.name, dk):
+            return self._op_dk(dk, *args, **kwargs)
+        else:
+            return NotImplemented
+
     @property
     def overloadpacket(self):
         return self._overloadpacket
@@ -141,11 +152,11 @@ def __getattr__(self, key):
             # This is ok since we are guaranteed that an overload name for an aten op can't be 'default'
             use_key = "" if key == "default" else key
             # TODO: disallow access to overloads registered by JIT
-            op_, tags = torch._C._get_operation_overload(
+            op_, op_dk_, tags = torch._C._get_operation_overload(
                 self._qualified_op_name, use_key
             )
             schema = torch._C._get_schema(self._qualified_op_name, use_key)
-            overload = OpOverload(self, op_, schema, tags)
+            overload = OpOverload(self, op_, op_dk_, schema, tags)
             # cache the overload object
             setattr(self, key, overload)
             return overload
@@ -223,7 +234,7 @@ def __getattr__(self, op_name):
             # Turn this into AttributeError so getattr(obj, key, default)
             # works (this is called by TorchScript with __origin__)
             raise AttributeError(
-                f"'_OpNamespace' object has no attribute '{op_name}'"
+                f"'_OpNamespace' '{self.name}' object has no attribute '{op_name}'"
             ) from e
 
         # let the script frontend know that op is identical to the builtin op
diff --git a/torch/_prims/__init__.py b/torch/_prims/__init__.py
index 5f9507804ebae..f325ce536305d 100644
--- a/torch/_prims/__init__.py
+++ b/torch/_prims/__init__.py
@@ -13,11 +13,11 @@
 import torch.library
 from torch import Tensor, TypedStorage
 from torch._C import _get_default_device
+from torch._prims.nvfuser_prims import register_nvprims
 from torch._prims_common import (
     check,
     DimsSequenceType,
     DimsType,
-    getnvFuserDtype,
     Number,
     NumberType,
     ShapeType,
@@ -26,6 +26,7 @@
     TensorLikeType,
     type_to_dtype,
 )
+from torch._prims_common.wrappers import backwards_not_supported
 from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
 from torch.overrides import handle_torch_function, has_torch_function
 from torch.utils._pytree import tree_flatten, tree_map, tree_unflatten
@@ -280,107 +281,6 @@ def TensorMeta(
 # Common datastructures and helpers
 #
 
-_nvfuser_unary_ops = {
-    "abs",
-    "acos",
-    "asin",
-    "atan",
-    "atanh",
-    "cos",
-    "cosh",
-    "bitwise_not",
-    "ceil",
-    "erf",
-    "erfc",
-    "exp",
-    "expm1",
-    "floor",
-    "imag",
-    "isfinite",
-    "lgamma",
-    "log",
-    "log1p",
-    "log2",
-    "log10",
-    "reciprocal",
-    "neg",
-    "real",
-    "round",
-    "rsqrt",
-    "sin",
-    "sinh",
-    "sqrt",
-    "tan",
-    "tanh",
-}
-
-
-def _assert_nvfuser_op_exists(fname: str):
-    try:
-        from torch._C._nvfuser import FusionDefinition as fd  # type: ignore[import]
-
-        assert getattr(fd.Operators, fname)
-    except ImportError:
-        # Not all PyTorch builds have nvfuser
-        pass
-
-
-for fname in _nvfuser_unary_ops:
-    exec(
-        f"""
-# Ensure that the nvfuser implementation exists
-_assert_nvfuser_op_exists("{fname}")
-
-def _{fname}_nvfuser(fd: Any, a: TensorLikeType):
-    return fd.ops.{fname}(a)  # type: ignore[attr-defined]
-"""
-    )
-
-_nvfuser_binary_ops = {
-    "add",
-    "atan2",
-    "bitwise_and",
-    "bitwise_or",
-    "bitwise_xor",
-    "div",
-    "eq",
-    "fmod",
-    "ge",
-    "gt",
-    "le",
-    "lt",
-    "mul",
-    "ne",
-    "pow",
-    "sub",
-}
-
-for fname in _nvfuser_binary_ops:
-    exec(
-        f"""
-# Ensure that the nvfuser implementation exists
-_assert_nvfuser_op_exists("{fname}")
-
-def _{fname}_nvfuser(fd: Any, a: TensorLikeType, b: TensorLikeType):
-    return fd.ops.{fname}(a, b)  # type: ignore[attr-defined]
-"""
-    )
-
-_nvfuser_ternary_ops = {
-    "where",
-}
-
-for fname in _nvfuser_ternary_ops:
-    exec(
-        f"""
-# Ensure that the nvfuser implementation exists
-_assert_nvfuser_op_exists("{fname}")
-
-def _{fname}_nvfuser(fd: Any, a: TensorLikeType, b: TensorLikeType, c: TensorLikeType):
-    return fd.ops.{fname}(a, b, c)  # type: ignore[attr-defined]
-"""
-    )
-
 # Describes the return type of the primitive:
 #
 #   - NEW, a new tensor is created
@@ -419,7 +319,6 @@ def _make_prim(
     return_type: Union[RETURN_TYPE, Tuple[RETURN_TYPE, ...]],
     meta: Callable,
     impl_aten: Callable,
-    impl_nvfuser: Optional[Callable] = None,
     doc: str,
 ):
     """
@@ -440,23 +339,8 @@ def _prim_impl(*args, **kwargs):
     # argument that provides an implementation for backward here.)  Because we
     # don't have derivative formulas, we must setup a custom autograd function
     # that raises an error if backwards is invoked
-    class BackwardsNotSupported(torch.autograd.Function):
-        @staticmethod
-        def forward(ctx, args_spec, *flat_args):
-            args, kwargs = tree_unflatten(flat_args, args_spec)  # type: ignore[arg-type]
-            g = torch._C._AutoDispatchBelowAutograd()
-            try:
-                return _prim(*args, **kwargs)
-            finally:
-                del g
-
-        @staticmethod
-        def backward(ctx, *args):
-            raise RuntimeError("backwards not supported on prim")
-
     def _autograd_impl(*args, **kwargs):
-        flat_args, args_spec = tree_flatten((args, kwargs))
-        return BackwardsNotSupported.apply(args_spec, *flat_args)
+        return backwards_not_supported(_prim)(*args, **kwargs)
 
     _meta_impl = _wrap_tensor_meta(meta)
 
@@ -481,9 +365,12 @@ def _backend_select_impl(*args, **kwargs):
 
     for p in (_prim_packet, _prim):
         p.__doc__ = doc
-        p.impl_nvfuser = impl_nvfuser  # type: ignore[attr-defined]
         p.return_type = return_type  # type: ignore[attr-defined]
 
+        p.schema = schema
+        p.prim_impl = _prim_impl
+        p.prim_meta_impl = _wrap_tensor_meta(meta)
+
     return _prim
 
 
@@ -620,7 +507,6 @@ def _not_impl(*args, **kwargs):
 abs = _make_elementwise_unary_prim(
     "abs",
     impl_aten=torch.abs,
-    impl_nvfuser=_abs_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.COMPLEX_TO_FLOAT,
 )
@@ -628,7 +514,6 @@ def _not_impl(*args, **kwargs):
 acos = _make_elementwise_unary_prim(
     "acos",
     impl_aten=torch.acos,
-    impl_nvfuser=_acos_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -643,7 +528,6 @@ def _not_impl(*args, **kwargs):
 asin = _make_elementwise_unary_prim(
     "asin",
     impl_aten=torch.asin,
-    impl_nvfuser=_asin_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -658,7 +542,6 @@ def _not_impl(*args, **kwargs):
 atan = _make_elementwise_unary_prim(
     "atan",
     impl_aten=torch.atan,
-    impl_nvfuser=_atan_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -666,7 +549,6 @@ def _not_impl(*args, **kwargs):
 atanh = _make_elementwise_unary_prim(
     "atanh",
     impl_aten=torch.atanh,
-    impl_nvfuser=_atanh_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -674,7 +556,6 @@ def _not_impl(*args, **kwargs):
 cos = _make_elementwise_unary_prim(
     "cos",
     impl_aten=torch.cos,
-    impl_nvfuser=_cos_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -682,7 +563,6 @@ def _not_impl(*args, **kwargs):
 cosh = _make_elementwise_unary_prim(
     "cosh",
     impl_aten=torch.cosh,
-    impl_nvfuser=_cosh_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -718,7 +598,6 @@ def _not_impl(*args, **kwargs):
 bitwise_not = _make_elementwise_unary_prim(
     "bitwise_not",
     impl_aten=torch.bitwise_not,
-    impl_nvfuser=_bitwise_not_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -748,7 +627,6 @@ def _cbrt_aten(a: torch.Tensor) -> Tensor:
 ceil = _make_elementwise_unary_prim(
     "ceil",
     impl_aten=torch.ceil,
-    impl_nvfuser=_ceil_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -780,7 +658,6 @@ def _conj_physical_meta(input: TensorLikeType) -> TensorLikeType:
 erf = _make_elementwise_unary_prim(
     "erf",
     impl_aten=torch.erf,
-    impl_nvfuser=_erf_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -795,7 +672,6 @@ def _conj_physical_meta(input: TensorLikeType) -> TensorLikeType:
 erfc = _make_elementwise_unary_prim(
     "erfc",
     impl_aten=torch.special.erfc,
-    impl_nvfuser=_erfc_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -803,7 +679,6 @@ def _conj_physical_meta(input: TensorLikeType) -> TensorLikeType:
 exp = _make_elementwise_unary_prim(
     "exp",
     impl_aten=torch.exp,
-    impl_nvfuser=_exp_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -811,7 +686,6 @@ def _conj_physical_meta(input: TensorLikeType) -> TensorLikeType:
 expm1 = _make_elementwise_unary_prim(
     "expm1",
     impl_aten=torch.special.expm1,
-    impl_nvfuser=_expm1_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -850,7 +724,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 floor = _make_elementwise_unary_prim(
     "floor",
     impl_aten=torch.floor,
-    impl_nvfuser=_floor_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -863,14 +736,12 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
     ),
     return_type=RETURN_TYPE.VIEW,
     impl_aten=torch.imag,
-    impl_nvfuser=_imag_nvfuser,  # type: ignore[name-defined]
     doc="",
 )
 
 isfinite = _make_elementwise_unary_prim(
     "isfinite",
     impl_aten=torch.isfinite,
-    impl_nvfuser=_isfinite_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
 )
@@ -878,7 +749,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 lgamma = _make_elementwise_unary_prim(
     "lgamma",
     impl_aten=torch.lgamma,
-    impl_nvfuser=_lgamma_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -886,7 +756,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 log = _make_elementwise_unary_prim(
     "log",
     impl_aten=torch.log,
-    impl_nvfuser=_log_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -894,7 +763,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 log1p = _make_elementwise_unary_prim(
     "log1p",
     impl_aten=torch.log1p,
-    impl_nvfuser=_log1p_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -902,7 +770,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 log2 = _make_elementwise_unary_prim(
     "log2",
     impl_aten=torch.log2,
-    impl_nvfuser=_log2_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -910,7 +777,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 log10 = _make_elementwise_unary_prim(
     "log10",
     impl_aten=torch.log10,
-    impl_nvfuser=_log10_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -923,14 +789,12 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
     ),
     return_type=RETURN_TYPE.VIEW,
     impl_aten=torch.real,
-    impl_nvfuser=_real_nvfuser,  # type: ignore[name-defined]
     doc="",
 )
 
 reciprocal = _make_elementwise_unary_prim(
     "reciprocal",
     impl_aten=torch.reciprocal,
-    impl_nvfuser=_reciprocal_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -938,7 +802,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 neg = _make_elementwise_unary_prim(
     "neg",
     impl_aten=torch.neg,
-    impl_nvfuser=_neg_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -946,7 +809,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 round = _make_elementwise_unary_prim(
     "round",
     impl_aten=torch.round,
-    impl_nvfuser=_round_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -954,7 +816,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 rsqrt = _make_elementwise_unary_prim(
     "rsqrt",
     impl_aten=torch.rsqrt,
-    impl_nvfuser=_rsqrt_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -976,7 +837,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 sin = _make_elementwise_unary_prim(
     "sin",
     impl_aten=torch.sin,
-    impl_nvfuser=_sin_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -984,7 +844,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 sinh = _make_elementwise_unary_prim(
     "sinh",
     impl_aten=torch.sinh,
-    impl_nvfuser=_sinh_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -992,7 +851,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 sqrt = _make_elementwise_unary_prim(
     "sqrt",
     impl_aten=torch.sqrt,
-    impl_nvfuser=_sqrt_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1000,7 +858,6 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 tan = _make_elementwise_unary_prim(
     "tan",
     impl_aten=torch.tan,
-    impl_nvfuser=_tan_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1008,20 +865,13 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 tanh = _make_elementwise_unary_prim(
     "tanh",
     impl_aten=torch.tanh,
-    impl_nvfuser=_tanh_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
 
-
-def _trunc_nvfuser(fd: Any, a: TensorLikeType):
-    return fd.ops.trunc(a)  # type: ignore[attr-defined]
-
-
 trunc = _make_elementwise_unary_prim(
     "trunc",
     impl_aten=torch.trunc,
-    impl_nvfuser=_trunc_nvfuser,
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1033,7 +883,6 @@ def _trunc_nvfuser(fd: Any, a: TensorLikeType):
 add = _make_elementwise_binary_prim(
     name="add",
     impl_aten=torch.add,
-    impl_nvfuser=_add_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1041,7 +890,6 @@ def _trunc_nvfuser(fd: Any, a: TensorLikeType):
 atan2 = _make_elementwise_binary_prim(
     name="atan2",
     impl_aten=torch.atan2,
-    impl_nvfuser=_atan2_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1049,7 +897,6 @@ def _trunc_nvfuser(fd: Any, a: TensorLikeType):
 bitwise_and = _make_elementwise_binary_prim(
     "bitwise_and",
     impl_aten=torch.bitwise_and,
-    impl_nvfuser=_bitwise_and_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1057,7 +904,6 @@ def _trunc_nvfuser(fd: Any, a: TensorLikeType):
 bitwise_or = _make_elementwise_binary_prim(
     "bitwise_or",
     impl_aten=torch.bitwise_or,
-    impl_nvfuser=_bitwise_or_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1065,7 +911,6 @@ def _trunc_nvfuser(fd: Any, a: TensorLikeType):
 bitwise_xor = _make_elementwise_binary_prim(
     "bitwise_xor",
     impl_aten=torch.bitwise_xor,
-    impl_nvfuser=_bitwise_xor_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1079,15 +924,19 @@ def _trunc_nvfuser(fd: Any, a: TensorLikeType):
 # div prim performs truncation division on integer inputs
 #   and true division for floating and complex inputs
 def _div_aten(a, b):
-    if isinstance(a, (bool, int)):
+    is_integral = isinstance(a, (bool, int)) or (
+        isinstance(a, torch.Tensor) and utils.is_integer_dtype(a.dtype)
+    )
+
+    if is_integral:
         return torch.div(a, b, rounding_mode="trunc")
-    return torch.true_divide(a, b)
+    else:
+        return torch.true_divide(a, b)
 
 
 div = _make_elementwise_binary_prim(
     "div",
     impl_aten=_div_aten,
-    impl_nvfuser=_div_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1095,7 +944,6 @@ def _div_aten(a, b):
 eq = _make_elementwise_binary_prim(
     "eq",
     impl_aten=torch.eq,
-    impl_nvfuser=_eq_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
 )
@@ -1117,7 +965,6 @@ def _div_aten(a, b):
 fmod = _make_elementwise_binary_prim(
     "fmod",
     impl_aten=torch.fmod,
-    impl_nvfuser=_fmod_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1134,7 +981,6 @@ def _div_aten(a, b):
 ge = _make_elementwise_binary_prim(
     "ge",
     impl_aten=torch.ge,
-    impl_nvfuser=_ge_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
 )
@@ -1142,7 +988,6 @@ def _div_aten(a, b):
 gt = _make_elementwise_binary_prim(
     "gt",
     impl_aten=torch.gt,
-    impl_nvfuser=_gt_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
 )
@@ -1171,7 +1016,6 @@ def _div_aten(a, b):
 le = _make_elementwise_binary_prim(
     "le",
     impl_aten=torch.le,
-    impl_nvfuser=_le_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
 )
@@ -1179,7 +1023,6 @@ def _div_aten(a, b):
 lt = _make_elementwise_binary_prim(
     "lt",
     impl_aten=torch.lt,
-    impl_nvfuser=_lt_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
 )
@@ -1226,7 +1069,6 @@ def _minimum_aten(
 mul = _make_elementwise_binary_prim(
     "mul",
     impl_aten=torch.mul,
-    impl_nvfuser=_mul_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1234,7 +1076,6 @@ def _minimum_aten(
 ne = _make_elementwise_binary_prim(
     "ne",
     impl_aten=torch.ne,
-    impl_nvfuser=_ne_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
 )
@@ -1249,20 +1090,13 @@ def _minimum_aten(
 pow = _make_elementwise_binary_prim(
     "pow",
     impl_aten=torch.pow,
-    impl_nvfuser=_pow_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
 
-
-def _remainder_nvfuser(fd: Any, a: TensorLikeType, b: TensorLikeType):
-    return fd.ops.remainder(a, b)  # type: ignore[attr-defined]
-
-
 remainder = _make_elementwise_binary_prim(
     "remainder",
     impl_aten=torch.remainder,
-    impl_nvfuser=_remainder_nvfuser,
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1287,7 +1121,6 @@ def _remainder_nvfuser(fd: Any, a: TensorLikeType, b: TensorLikeType):
 sub = _make_elementwise_binary_prim(
     "sub",
     impl_aten=torch.sub,
-    impl_nvfuser=_sub_nvfuser,  # type: ignore[name-defined]
     doc="",
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
@@ -1334,7 +1167,7 @@ def _as_strided_aten(
 """
 
 as_strided = _make_prim(
-    schema="as_strided(Tensor(a!) a, int[] size, int[] stride, int storage_offset) -> Tensor(a!)",
+    schema="as_strided(Tensor(a!) a, SymInt[] size, SymInt[] stride, SymInt storage_offset) -> Tensor(a!)",
     meta=_as_strided_meta,
     impl_aten=_as_strided_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -1402,15 +1235,6 @@ def _broadcast_in_dim_aten(a, shape, broadcast_dimensions):
     return v.expand(shape)
 
 
-def _broadcast_in_dim_nvfuser(
-    fd: Any,
-    a: torch.Tensor,
-    shape: ShapeType,
-    broadcast_dimensions: ShapeType,
-):
-    return fd.ops.broadcast_in_dim(a, shape, broadcast_dimensions)  # type: ignore[attr-defined]
-
-
 _broadcast_in_dim_doc = """
   Creates a view of a with the specified shape.
 
@@ -1426,7 +1250,6 @@ def _broadcast_in_dim_nvfuser(
     schema="broadcast_in_dim(Tensor(a) a, SymInt[] shape, int[] broadcast_dimensions) -> Tensor(a)",
     meta=_broadcast_in_dim_meta,
     impl_aten=_broadcast_in_dim_aten,
-    impl_nvfuser=_broadcast_in_dim_nvfuser,
     return_type=RETURN_TYPE.VIEW,
     doc=_broadcast_in_dim_doc,
 )
@@ -1561,12 +1384,18 @@ def _conj_meta(a: TensorLikeType) -> TensorLikeType:
 )
 
 
-def expand_dims(a: TensorLikeType, dimensions: DimsSequenceType) -> TensorLikeType:
+def expand_dims(
+    a: TensorLikeType, dimensions: DimsSequenceType, ndim=None
+) -> TensorLikeType:
     """
     Creates a view of a with a.ndim + len(dimensions) dimensions, with new
     dimensions of length one at the dimensions specified by dimensions.
     """
-    dims = sorted(utils.canonicalize_dims(a.ndim, dimensions))  # type: ignore[arg-type]
+    if ndim is not None:
+        # TODO: this is only here to support the unsqueeze ref
+        dims = sorted(utils.canonicalize_dims(ndim, dimensions))  # type: ignore[arg-type]
+    else:
+        dims = sorted(utils.canonicalize_dims(a.ndim, dimensions))  # type: ignore[arg-type]
     if len(set(dims)) != len(dims):
         msg = "Received duplicate dimensions to expand in {0}".format(str(dimensions))
         raise ValueError(msg)
@@ -1696,7 +1525,7 @@ def _slice_aten(
     """
 
 slice = _make_prim(
-    schema="slice(Tensor(a) a, int[] start_indices, int[] limit_indices, int[]? strides=None) -> Tensor(a)",
+    schema="slice(Tensor(a) a, SymInt[] start_indices, SymInt[] limit_indices, SymInt[]? strides=None) -> Tensor(a)",
     meta=_slice_meta,
     impl_aten=_slice_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -1779,8 +1608,9 @@ def _slice_in_dim_aten(
     Convenience wrapper for slicing just one dimension using slice.
     """
 
+# TODO: make stride SymInt
 slice_in_dim = _make_prim(
-    schema="slice_in_dim(Tensor(a) a, int start_index, int limit_index, int stride=1, int axis=0) -> Tensor(a)",
+    schema="slice_in_dim(Tensor(a) a, SymInt start_index, SymInt limit_index, int stride=1, int axis=0) -> Tensor(a)",
     meta=_slice_in_dim_meta,
     impl_aten=_slice_in_dim_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -1794,10 +1624,9 @@ def _split_dim_meta(a: TensorLikeType, dim: int, outer_length: int) -> TensorLik
     utils.validate_dim_length(outer_length)
 
     # Verifies the dim can be split with the specified lhs_length
-    _inner_length = a.shape[dim] / outer_length
-    inner_length: int = int(_inner_length)
+    inner_length = a.shape[dim] // outer_length
 
-    if inner_length != _inner_length:
+    if (a.shape[dim] % outer_length) != 0:
         msg = "Attempting to split dimension of length {0}, but outer length of {1} divides it with a remainder!".format(
             a.shape[dim], outer_length
         )
@@ -1817,7 +1646,7 @@ def _split_dim_meta(a: TensorLikeType, dim: int, outer_length: int) -> TensorLik
 
 
 def _split_dim_aten(a: Tensor, dim: int, outer_length: int) -> Tensor:
-    inner_length = int(a.shape[dim] / outer_length)
+    inner_length = a.shape[dim] // outer_length
     new_shape = a.shape[0:dim] + (outer_length, inner_length) + a.shape[dim + 1 :]
 
     return a.view(new_shape)
@@ -1832,7 +1661,7 @@ def _split_dim_aten(a: Tensor, dim: int, outer_length: int) -> Tensor:
 
 # TODO: consider renaming split_dim_view
 split_dim = _make_prim(
-    schema="split_dim(Tensor(a) a, int dim, int outer_length) -> Tensor(a)",
+    schema="split_dim(Tensor(a) a, int dim, SymInt outer_length) -> Tensor(a)",
     meta=_split_dim_meta,
     impl_aten=_split_dim_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -2025,7 +1854,7 @@ def _reshape_aten(a: Tensor, shape: ShapeType) -> Tensor:
   containing a copy of the data in a.
   """
 reshape = _make_prim(
-    schema="reshape(Tensor a, int[] shape) -> Tensor",
+    schema="reshape(Tensor a, SymInt[] shape) -> Tensor",
     meta=_reshape_meta,
     impl_aten=_reshape_aten,
     return_type=RETURN_TYPE.NEW,
@@ -2078,7 +1907,6 @@ def _where_meta(
     schema="where(Tensor pred, Tensor a, Tensor b) -> Tensor",
     meta=_where_meta,
     impl_aten=torch.where,
-    impl_nvfuser=_where_nvfuser,  # type: ignore[name-defined]
     return_type=RETURN_TYPE.NEW,
     doc=_where_doc,
 )
@@ -2115,11 +1943,6 @@ def _convert_element_type_aten(a: Tensor, dtype: torch.dtype) -> Tensor:
         return copy_to(result, a)
 
 
-def _convert_element_type_nvfuser(fd: Any, a: Tensor, dtype: torch.dtype) -> Tensor:
-    nvfuser_dtype = getnvFuserDtype(dtype)
-    return fd.ops.cast(a, nvfuser_dtype)  # type: ignore[attr-defined]
-
-
 _convert_element_type_doc = """
   Creates a copy of a tensor with the given dtype.
   """
@@ -2128,7 +1951,6 @@ def _convert_element_type_nvfuser(fd: Any, a: Tensor, dtype: torch.dtype) -> Ten
     schema="convert_element_type(Tensor a, ScalarType dtype) -> Tensor",
     meta=_convert_element_type_meta,
     impl_aten=_convert_element_type_aten,
-    impl_nvfuser=_convert_element_type_nvfuser,
     return_type=RETURN_TYPE.NEW,
     doc=_convert_element_type_doc,
 )
@@ -2326,7 +2148,7 @@ def _resize_aten(a: Tensor, shape: ShapeType) -> Tensor:
 
 # TODO: review support arbitrary resizes
 resize = _make_prim(
-    schema="resize(Tensor(a!) a, int[] shape) -> Tensor(a!)",
+    schema="resize(Tensor(a!) a, SymInt[] shape) -> Tensor(a!)",
     meta=_resize_meta,
     impl_aten=_resize_aten,
     return_type=RETURN_TYPE.INPLACE,
@@ -2380,44 +2202,31 @@ def _var_reduction_meta(inp, dims, *, correction):
     """
 
 
-def _make_reduction_prim(name: str, impl_aten, doc, impl_nvfuser=None):
+def _make_reduction_prim(name: str, impl_aten, doc):
     """Creates a reduction prim."""
     return _make_prim(
         schema=f"{name}(Tensor inp, int[]? dims, *, ScalarType? output_dtype=None) -> Tensor",
         meta=_reduction_meta,
         impl_aten=impl_aten,
-        impl_nvfuser=impl_nvfuser,
         return_type=RETURN_TYPE.NEW,
         doc=doc,
     )
 
 
-def _make_var_reduction_prim(name: str, impl_aten, doc, impl_nvfuser):
+def _make_var_reduction_prim(name: str, impl_aten, doc):
     """Creates a reduction prim."""
     return _make_prim(
         schema=f"{name}(Tensor inp, int[]? dims, *, int correction, ScalarType? output_dtype=None) -> Tensor",
         meta=_var_reduction_meta,
         impl_aten=impl_aten,
-        impl_nvfuser=impl_nvfuser,
         return_type=RETURN_TYPE.NEW,
         doc=doc,
     )
 
 
-def _sum_nvfuser(
-    fd: Any,
-    a: TensorLikeType,
-    dims: DimsSequenceType,
-):
-    keep_dims = False
-    output_dtype = torch._C._nvfuser.DataType.Null
-    return fd.ops.sum(a, dims, keep_dims, output_dtype)
-
-
 sum = _make_reduction_prim(
     name="sum",
     impl_aten=torch.sum,
-    impl_nvfuser=_sum_nvfuser,
     doc=_sum_doc,
 )
 
@@ -2443,54 +2252,21 @@ def _prod_aten(
     doc=_prod_doc,
 )
 
-
-def _var_nvfuser(
-    fd: Any,
-    a: TensorLikeType,
-    dims: DimsSequenceType,
-    *,
-    correction: int,
-):
-    keep_dims = False
-    return fd.ops.var(a, dims, correction, keep_dims)
-
-
-def _amax_nvfuser(
-    fd: Any,
-    a: TensorLikeType,
-    dims: DimsSequenceType,
-):
-    keep_dims = False
-    return fd.ops.max(a, dims, keep_dims)
-
-
-def _amin_nvfuser(
-    fd: Any,
-    a: TensorLikeType,
-    dims: DimsSequenceType,
-):
-    keep_dims = False
-    return fd.ops.min(a, dims, keep_dims)
-
-
 var = _make_var_reduction_prim(
     name="var",
     impl_aten=torch.var,
-    impl_nvfuser=_var_nvfuser,
     doc=_var_doc,
 )
 
 amax = _make_reduction_prim(
     name="amax",
     impl_aten=torch.amax,
-    impl_nvfuser=_amax_nvfuser,
     doc=_amax_doc,
 )
 
 amin = _make_reduction_prim(
     name="amin",
     impl_aten=torch.amin,
-    impl_nvfuser=_amin_nvfuser,
     doc=_amin_doc,
 )
 
@@ -2552,9 +2328,9 @@ def _arange_aten(
 ) -> TensorLikeType:
     # mypy: Not all union combinations were tried because there are too many unions
     return torch.arange(  # type: ignore[call-overload, misc]
-        start,
-        end,
-        step,
+        start,  # type: ignore[arg-type]
+        end,  # type: ignore[arg-type]
+        step,  # type: ignore[arg-type]
         dtype=dtype,
         device=device,
         layout=torch.strided,
@@ -2594,7 +2370,7 @@ def _empty_aten(
 """
 
 empty = _make_prim(
-    schema="empty(int[] shape, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
+    schema="empty(SymInt[] shape, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
     meta=_empty_meta,
     impl_aten=_empty_aten,
     return_type=RETURN_TYPE.NEW,
@@ -2619,7 +2395,7 @@ def _empty_strided_meta(
 
 # TODO: add layout, pin_memory
 empty_strided = _make_prim(
-    schema="empty_strided(int[] shape, int[] strides, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
+    schema="empty_strided(SymInt[] shape, SymInt[] strides, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
     return_type=RETURN_TYPE.NEW,
     meta=_empty_strided_meta,
     impl_aten=torch.empty_strided,
@@ -2659,7 +2435,7 @@ def _full_aten(
 
 # TODO: add layout
 full = _make_prim(
-    schema="full(int[] shape, Scalar fill_value, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
+    schema="full(SymInt[] shape, Scalar fill_value, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
     meta=_full_meta,
     impl_aten=_full_aten,
     return_type=RETURN_TYPE.NEW,
@@ -2846,7 +2622,7 @@ def _uniform_aten(
 # TODO: we should more seriously review randomness modeling and prims
 uniform = _make_prim(
     schema=(
-        "uniform(int[] shape, *, Scalar low, Scalar high, ScalarType dtype, Device device) -> Tensor"
+        "uniform(SymInt[] shape, *, Scalar low, Scalar high, ScalarType dtype, Device device) -> Tensor"
     ),
     return_type=RETURN_TYPE.NEW,
     meta=_uniform_meta,
@@ -2970,9 +2746,11 @@ def _fft_c2r_aten(
 
 
 fft_c2r = _make_prim(
-    schema="fft_c2r(Tensor self, *, int[] dim, int last_dim_size) -> Tensor",
+    schema="fft_c2r(Tensor self, *, int[] dim, SymInt last_dim_size) -> Tensor",
     meta=_fft_c2r_meta,
     impl_aten=_fft_c2r_aten,
     return_type=RETURN_TYPE.NEW,
     doc=_fft_c2r_doc,
 )
+
+register_nvprims()
diff --git a/torch/_prims/context.py b/torch/_prims/context.py
index c6ef474eec2d3..9fbbf70a1d792 100644
--- a/torch/_prims/context.py
+++ b/torch/_prims/context.py
@@ -1,8 +1,10 @@
 import functools
-from typing import Any, Callable, Dict, Sequence
+from contextlib import nullcontext
+from typing import Any, Callable, Dict, Sequence, Union
 
 import torch
 
+import torch._decomp
 import torch._prims
 
 import torch._refs
@@ -35,12 +37,19 @@ def torch_to_refs_map():
         torch.Tensor.__and__: torch._refs.bitwise_and,
         torch.Tensor.__or__: torch._refs.bitwise_or,
         torch.Tensor.__eq__: torch._refs.eq,
+        torch.Tensor.__rsub__: torch._refs.rsub,
+        torch.Tensor.__rtruediv__: torch._refs.rtruediv,
+        torch.Tensor.__floordiv__: torch._refs.floor_divide,
+        torch.Tensor.__rfloordiv__: torch._refs.rfloordiv,
+        torch.Tensor.__pow__: torch._refs.pow,
+        torch.Tensor.__rpow__: torch._refs.rpow,
         torch.Tensor.new_empty: torch._refs.new_empty,
         torch.Tensor.new_full: torch._refs.new_full,
         torch.Tensor.new_zeros: torch._refs.new_zeros,
         torch.Tensor.new_ones: torch._refs.new_ones,
         torch.Tensor.fill_: torch._refs.fill_,
         torch.Tensor.zero_: torch._refs.zero_,
+        torch.Tensor.to: torch._refs.to,
         # TODO: Should these methods be mapped some other way?
         torch.Tensor.copy_: torch._prims.copy_to,
         torch.Tensor.resize: torch._prims.resize,
@@ -56,6 +65,25 @@ def torch_to_refs_map():
     return r
 
 
+@functools.lru_cache(None)
+def nvfuser_decomp_table():
+    """
+    decomposition table needed for nvfuser
+    """
+    aten = torch.ops.aten
+    nvfuser_decompositions: Sequence[
+        Union[torch._ops.OpOverload, torch._ops.OpOverloadPacket]
+    ] = {  # type: ignore[assignment]
+        # AMP calls `to` in C++, which is not handled by torch mapping
+        aten._to_copy,
+    }
+
+    from torch._decomp import get_decompositions
+
+    decomp_table = get_decompositions(nvfuser_decompositions)
+    return decomp_table
+
+
 @functools.lru_cache(None)
 def all_prims():
     """
@@ -64,11 +92,46 @@ def all_prims():
     return {torch._prims.__dict__.get(s) for s in torch._prims.__all__}
 
 
+class NvfuserPrimsMode(torch.overrides.TorchFunctionMode):
+    """
+    Switches the interpretation of torch.ops.prims.* functions to
+    use nvFuser's prims in torch.ops.nvprims.*
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> with NvfuserPrimsMode():
+    ...     torch.ops.prims.add(x, y)  # calls torch.ops.nvprims.add(x, y)
+
+    By default, this context manager will fall back on the torch.ops.prims* if the
+    nvprim does not exist.
+    """
+
+    def __torch_function__(
+        self,
+        orig_func: Callable,
+        types: Sequence,
+        args: Sequence[Any] = (),
+        kwargs: Dict = None,
+    ):
+        if kwargs is None:
+            kwargs = {}
+        if isinstance(orig_func, torch._ops.OpOverload) or isinstance(
+            orig_func, torch._ops.OpOverloadPacket
+        ):
+            namespace = str(orig_func).split(".")[0]
+            name = str(orig_func).split(".")[1]
+            if namespace == "prims":
+                nvfunc = getattr(torch.ops.nvprims, name, None)
+                if nvfunc is not None:
+                    return nvfunc(*args, **kwargs)
+        return orig_func(*args, **kwargs)
+
+
 class TorchRefsMode(torch.overrides.TorchFunctionMode):
     """
     Switches the interpretation of torch.* functions and Tensor methods to
     use PrimTorch refs in torch._refs.  (Direct calls to _refs are unaffected.)
 
+    >>> # xdoctest: +SKIP
     >>> with TorchRefsMode():
     ...     torch.add(x, y)  # calls torch._refs.add(x, y)
 
@@ -78,9 +141,15 @@ class TorchRefsMode(torch.overrides.TorchFunctionMode):
     this behavior can be customized by passing a function to should_fallback_fn.
     """
 
-    def __init__(self, strict=False, should_fallback_fn=lambda *_: False):
+    def __init__(
+        self,
+        strict=False,
+        should_fallback_fn=lambda *_: False,
+        prims_mode_cls=nullcontext,
+    ):
         self.strict = strict
         self.should_fallback_fn = should_fallback_fn
+        self.prims_mode_cls = prims_mode_cls
 
     def __torch_function__(
         self,
@@ -92,10 +161,22 @@ def __torch_function__(
         if kwargs is None:
             kwargs = {}
         # For primitive operations, run them as is without interception
+        # Unless we are in prims_mode, in which case we want to use nvprims
         if orig_func in torch_function_passthrough or orig_func in all_prims():
-            return orig_func(*args, **kwargs)
+            with self.prims_mode_cls():
+                return orig_func(*args, **kwargs)
         mapping = torch_to_refs_map()
         func = mapping.get(orig_func, None)
+
+        # For torch.ops.aten.*, use registered decompositions from torch._decomp
+        # torch._decomp.decomposition_table provides a mapping from
+        # torch.ops.aten.* to torch._refs or torch._decomp.decompositions
+        # implementations.
+        # There're other ways to implement this functionality,
+        # see https://github.com/pytorch/pytorch/pull/82657#discussion_r939776417
+        if func is None and isinstance(orig_func, torch._ops.OpOverload):
+            func = torch._decomp.decomposition_table.get(orig_func, None)
+
         if func is not None:
             # If the ref exists query whether we should use it or not
             if self.should_fallback_fn(self, func, args, kwargs):
@@ -131,5 +212,7 @@ def _is_func_unsupported_nvfuser(torch_function_mode, func, args, kwargs):
 
 
 TorchRefsNvfuserCapabilityMode = functools.partial(
-    TorchRefsMode, should_fallback_fn=_is_func_unsupported_nvfuser
+    TorchRefsMode,
+    should_fallback_fn=_is_func_unsupported_nvfuser,
+    prims_mode_cls=NvfuserPrimsMode,
 )
diff --git a/torch/_prims/executor.py b/torch/_prims/executor.py
index ca8d07d2f2031..1dfbee2c95a4f 100644
--- a/torch/_prims/executor.py
+++ b/torch/_prims/executor.py
@@ -1,10 +1,10 @@
 from typing import Callable
 
-from torch._prims.context import TorchRefsMode
+from torch._prims.context import NvfuserPrimsMode, TorchRefsMode
 from torch._prims.nvfuser_executor import nvfuser_execute, nvfuser_execute_partitioned
 
 from torch.fx import GraphModule
-from torch.fx.experimental.proxy_tensor import make_fx
+from torch.fx.experimental.proxy_tensor import make_fx, wrapper_and_args_for_make_fx
 
 
 def execute(gm: GraphModule, *args, executor: str = "aten"):
@@ -55,18 +55,9 @@ def foo(a, b):
 
     def _traced(*args, executor="aten", **kwargs):
         # TODO: caching
-        nargs = len(args)
-        fn_kwargs = kwargs
-        flat_fn_kwargs = list(fn_kwargs.values())
-        all_args = list(args) + flat_fn_kwargs
-
-        def wrapped(args):
-            fn_args = args[:nargs]
-            kwargs_keys = list(fn_kwargs.keys())
-            kwargs = dict(zip(kwargs_keys, args[nargs:]))
-            return fn(*fn_args, **kwargs)
-
-        with TorchRefsMode():
+        wrapped, all_args = wrapper_and_args_for_make_fx(fn, args, kwargs)
+
+        with NvfuserPrimsMode(), TorchRefsMode():
             gm = make_fx(wrapped)(all_args)
         return execute(gm, all_args, executor=executor)
 
diff --git a/torch/_prims/nvfuser_executor.py b/torch/_prims/nvfuser_executor.py
index 93de1f6bc43ce..b98a1dba56def 100644
--- a/torch/_prims/nvfuser_executor.py
+++ b/torch/_prims/nvfuser_executor.py
@@ -147,6 +147,15 @@ def call_module(self, target, args, kwargs):
             return super().call_module(target, args, kwargs)
 
 
+class NvfuserGraphModule(torch.nn.Module):
+    def __init__(self, gm):
+        super().__init__()
+        self.gm = gm
+
+    def __call__(self, *args):
+        return nvfuser_execute(self.gm, *args)
+
+
 # MyPy bug: https://github.com/python/mypy/issues/5107
 @lru_cache()  # type: ignore[arg-type]
 def maybe_partition_graph(gm: GraphModule):
@@ -171,6 +180,17 @@ def maybe_partition_graph(gm: GraphModule):
                 category=RuntimeWarning,
             )
         partitioned_graph = partitioner.fuse_partitions(partitions)
+
+        # Replacing graph's fused submodules with a wrapper module with
+        # __call__() method that calls nvfuser_execute.
+        # This avoids the need to call the interpreter on the graph
+        for node in partitioned_graph.graph.nodes:
+            # TODO: use a better way to identify fused submodule
+            if node.op == "call_module" and "fused_" in node.name:
+                nvfuser_submodule = getattr(partitioned_graph, node.name)
+                partitioned_graph.delete_submodule(node.target)
+                gm.add_submodule(node.target, NvfuserGraphModule(nvfuser_submodule))
+
         return partitioned_graph, any_unsupported
     else:
         return gm, any_unsupported
@@ -181,6 +201,6 @@ def nvfuser_execute_partitioned(gm: GraphModule, *args):
     # because it avoids PartitionedInterpreter's overhead
     gm, is_partitioned = maybe_partition_graph(gm)
     if is_partitioned:
-        return PartitionedInterpreter(gm).run(*args)
+        return gm(*args)
     else:
         return nvfuser_execute(gm, *args)
diff --git a/torch/_prims/nvfuser_prims.py b/torch/_prims/nvfuser_prims.py
new file mode 100644
index 0000000000000..7decdcbdfc552
--- /dev/null
+++ b/torch/_prims/nvfuser_prims.py
@@ -0,0 +1,280 @@
+# Module for defining "primitive" operations executable by the nvFuser. This
+# list exists to decouple main set of primitives from the ones that provide a
+# lowering of the op to nvFuser’s Python interface. Mostly torch.ops.nvprims is
+# a subset of the primitives in torch.ops.prims, but some additional primitives
+# can be added in the future for the corresponding higher-level torch/aten
+# functions.
+
+from typing import Any, Dict
+
+import torch
+
+from torch._prims_common import (
+    DimsSequenceType,
+    getnvFuserDtype,
+    ShapeType,
+    TensorLikeType,
+)
+
+from torch._prims_common.wrappers import backwards_not_supported
+
+nvprim_namespace = "nvprims"
+nvprim = torch.library.Library(nvprim_namespace, "DEF")
+nvprim_impl = torch.library.Library(
+    nvprim_namespace, "IMPL", "CompositeExplicitAutograd"
+)
+nvprim_autograd_impl = torch.library.Library(nvprim_namespace, "IMPL", "Autograd")
+nvprim_meta_impl = torch.library.Library(nvprim_namespace, "IMPL", "Meta")
+
+nvprim_names = [
+    "abs",
+    "acos",
+    "asin",
+    "atan",
+    "atanh",
+    "cos",
+    "cosh",
+    "bitwise_not",
+    "ceil",
+    "erf",
+    "erfc",
+    "exp",
+    "expm1",
+    "floor",
+    "imag",
+    "isfinite",
+    "lgamma",
+    "log",
+    "log1p",
+    "log2",
+    "log10",
+    "real",
+    "reciprocal",
+    "neg",
+    "round",
+    "rsqrt",
+    "sign",
+    "sin",
+    "sinh",
+    "sqrt",
+    "tan",
+    "tanh",
+    "trunc",
+    "add",
+    "atan2",
+    "bitwise_and",
+    "bitwise_or",
+    "bitwise_xor",
+    "div",
+    "eq",
+    "fmod",
+    "ge",
+    "gt",
+    "le",
+    "lt",
+    "mul",
+    "ne",
+    "pow",
+    "remainder",
+    "sub",
+    "broadcast_in_dim",
+    "where",
+    "convert_element_type",
+    "sum",
+    "var",
+    "amax",
+    "amin",
+]
+
+_nvfuser_impls: Dict[str, Any] = {}
+
+_nvfuser_unary_ops = {
+    "abs",
+    "acos",
+    "asin",
+    "atan",
+    "atanh",
+    "cos",
+    "cosh",
+    "bitwise_not",
+    "ceil",
+    "erf",
+    "erfc",
+    "exp",
+    "expm1",
+    "floor",
+    "imag",
+    "isfinite",
+    "lgamma",
+    "log",
+    "log1p",
+    "log2",
+    "log10",
+    "reciprocal",
+    "neg",
+    "real",
+    "round",
+    "rsqrt",
+    "sign",
+    "sin",
+    "sinh",
+    "sqrt",
+    "tan",
+    "tanh",
+    "trunc",
+}
+
+
+def _assert_nvfuser_op_exists(fname: str):
+    try:
+        from torch._C._nvfuser import FusionDefinition as fd  # type: ignore[import]
+
+        assert getattr(fd.Operators, fname)
+    except ImportError:
+        # Not all PyTorch builds have nvfuser
+        pass
+
+
+for fname in _nvfuser_unary_ops:
+    exec(
+        f"""
+# Ensure that the nvfuser implementation exists
+_assert_nvfuser_op_exists("{fname}")
+
+def _{fname}_nvfuser(fd, a):
+    return fd.ops.{fname}(a)  # type: ignore[attr-defined]
+
+_nvfuser_impls["{fname}"] = _{fname}_nvfuser
+"""
+    )
+
+_nvfuser_binary_ops = {
+    "add",
+    "atan2",
+    "bitwise_and",
+    "bitwise_or",
+    "bitwise_xor",
+    "div",
+    "eq",
+    "fmod",
+    "ge",
+    "gt",
+    "le",
+    "lt",
+    "mul",
+    "ne",
+    "pow",
+    "remainder",
+    "sub",
+}
+
+for fname in _nvfuser_binary_ops:
+    exec(
+        f"""
+# Ensure that the nvfuser implementation exists
+_assert_nvfuser_op_exists("{fname}")
+
+def _{fname}_nvfuser(fd, a, b):
+    return fd.ops.{fname}(a, b)  # type: ignore[attr-defined]
+
+_nvfuser_impls["{fname}"] = _{fname}_nvfuser
+"""
+    )
+
+_nvfuser_ternary_ops = {
+    "where",
+}
+
+for fname in _nvfuser_ternary_ops:
+    exec(
+        f"""
+# Ensure that the nvfuser implementation exists
+_assert_nvfuser_op_exists("{fname}")
+
+def _{fname}_nvfuser(fd, a, b, c):
+    return fd.ops.{fname}(a, b, c)  # type: ignore[attr-defined]
+
+_nvfuser_impls["{fname}"] = _{fname}_nvfuser
+"""
+    )
+
+
+def _broadcast_in_dim_nvfuser(
+    fd: Any,
+    a: TensorLikeType,
+    shape: ShapeType,
+    broadcast_dimensions: ShapeType,
+):
+    return fd.ops.broadcast_in_dim(a, shape, broadcast_dimensions)  # type: ignore[attr-defined]
+
+
+def _convert_element_type_nvfuser(fd: Any, a: TensorLikeType, dtype: torch.dtype):
+    nvfuser_dtype = getnvFuserDtype(dtype)
+    return fd.ops.cast(a, nvfuser_dtype)  # type: ignore[attr-defined]
+
+
+def _sum_nvfuser(
+    fd: Any,
+    a: TensorLikeType,
+    dims: DimsSequenceType,
+):
+    keep_dims = False
+    output_dtype = torch._C._nvfuser.DataType.Null
+    return fd.ops.sum(a, dims, keep_dims, output_dtype)
+
+
+def _var_nvfuser(
+    fd: Any,
+    a: TensorLikeType,
+    dims: DimsSequenceType,
+    *,
+    correction: int,
+):
+    keep_dims = False
+    return fd.ops.var(a, dims, correction, keep_dims)
+
+
+def _amax_nvfuser(
+    fd: Any,
+    a: TensorLikeType,
+    dims: DimsSequenceType,
+):
+    keep_dims = False
+    return fd.ops.max(a, dims, keep_dims)
+
+
+def _amin_nvfuser(
+    fd: Any,
+    a: TensorLikeType,
+    dims: DimsSequenceType,
+):
+    keep_dims = False
+    return fd.ops.min(a, dims, keep_dims)
+
+
+_nvfuser_impls["broadcast_in_dim"] = _broadcast_in_dim_nvfuser
+_nvfuser_impls["convert_element_type"] = _convert_element_type_nvfuser
+_nvfuser_impls["sum"] = _sum_nvfuser
+_nvfuser_impls["var"] = _var_nvfuser
+_nvfuser_impls["amax"] = _amax_nvfuser
+_nvfuser_impls["amin"] = _amin_nvfuser
+
+
+def register_nvprims():
+    """Registers all nvFuser primitives in the torch.ops.nvprims module."""
+    for name in nvprim_names:
+        main_prim = getattr(torch.ops.prims, name)
+
+        nvprim.define(main_prim.schema)
+        nvprim_impl.impl(name, main_prim.prim_impl)
+        nvprim_meta_impl.impl(name, main_prim.prim_meta_impl)
+
+        prim_packet = getattr(torch.ops.nvprims, name)
+        prim = prim_packet.default
+
+        nvprim_autograd_impl.impl(name, backwards_not_supported(prim))
+
+        for p in (prim_packet, prim):
+            p.__doc__ = main_prim.__doc__
+            p.impl_nvfuser = _nvfuser_impls[name]
+            p.return_type = main_prim.return_type  # type: ignore[attr-defined]
diff --git a/torch/_prims_common/__init__.py b/torch/_prims_common/__init__.py
index 49ae9e806a947..5638ed98fa750 100644
--- a/torch/_prims_common/__init__.py
+++ b/torch/_prims_common/__init__.py
@@ -51,8 +51,11 @@ def getnvFuserDtype(dtype: Union[torch.dtype, NumberTypeType]):
 
 
 torch_function_passthrough = {
+    torch.Tensor.dim,
     torch.Tensor.ndim.__get__,  # type: ignore[attr-defined]
     torch.Tensor.numel,
+    torch.Tensor.size,
+    torch.Tensor.storage_offset,
     torch.Tensor.stride,
     torch.Tensor.dtype.__get__,  # type: ignore[attr-defined]
     torch.Tensor.is_sparse.__get__,  # type: ignore[attr-defined]
@@ -442,19 +445,26 @@ def validate_exclusive_idx(rank: int, ex_idx: int):
 
 # "Wraps" a dim (up to one time) for the given rank, allowing
 # dims to be specified using negative indices
-def canonicalize_dim(rank: int, idx: int) -> int:
-    # TODO: add a comment for why this is
-    _rank = rank if rank != 0 else 1
+def canonicalize_dim(rank: int, idx: int, wrap_scalar: bool = True) -> int:
+    if rank < 0:
+        msg = f"Rank cannot be negative but got {rank}"
+        raise IndexError(msg)
+
+    if rank == 0:
+        if not wrap_scalar:
+            msg = f"Dimension specified as {idx} but tensor has no dimensions"
+            raise IndexError(msg)
+        rank = 1
 
-    if idx >= 0 and idx < _rank:
+    if idx >= 0 and idx < rank:
         return idx
 
     if idx < 0:
-        _idx = idx + _rank
+        _idx = idx + rank
     else:
         _idx = idx
 
-    if _idx < 0 or _idx > _rank:
+    if _idx < 0 or _idx >= rank:
         # Same error message as in aten/src/ATen/WrapDimUtils.h:49
         msg = "Dimension out of range (expected to be in range of [{0}, {1}], but got {2})".format(
             -rank, rank - 1, idx
@@ -619,7 +629,8 @@ def extract_shape(*args, allow_cpu_scalar_tensors: bool) -> Optional[ShapeType]:
 
 
 def extract_shape_from_varargs(
-    shape: Union[ShapeType, Tuple[ShapeType]]
+    shape: Union[ShapeType, Tuple[ShapeType]],
+    validate=True,
 ) -> Tuple[int, ...]:
     """
     Returns a shape from varargs.
@@ -643,10 +654,37 @@ def extract_shape_from_varargs(
     if len(shape) == 1 and isinstance(shape[0], Sequence):
         shape = shape[0]
 
-    validate_shape(shape)  # type: ignore[arg-type]
+    if validate:
+        validate_shape(shape)  # type: ignore[arg-type]
     return shape  # type: ignore[return-value]
 
 
+def infer_size(shape: ShapeType, numel: int) -> Tuple[int, ...]:
+    """
+    Infers the size of a dim with size -1, if it exists.
+    Also checks that new shape is compatible with the number of elements.
+    """
+    dim = None
+    newsize = 1
+    for i, d in enumerate(shape):
+        if d == -1:
+            check(dim is None, lambda: "only one dimension can be inferred")
+            dim = i
+        elif d >= 0:
+            newsize *= d
+        else:
+            check(False, lambda: f"invalid shape dimension {d}")
+    check(numel == newsize or (dim is not None and newsize > 0 and numel % newsize == 0),
+          lambda: f"shape '{list(shape)}' is invalid for input of size {numel}")
+    if dim is not None:
+        check(newsize != 0,
+              lambda: f"cannot reshape tensor fo 0 elements into shape {shape} because the "
+                      f"unspecified dimension size -1 can be any value and is ambiguous")
+        shape = list(shape)
+        shape[dim] = numel // newsize
+    return tuple(shape)
+
+
 _integer_dtypes = (torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)
 _low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
 _float_dtypes = (torch.float16, torch.bfloat16, torch.float32, torch.float64)
@@ -745,6 +783,13 @@ def type_to_dtype(typ: type) -> torch.dtype:
     raise ValueError("Invalid type!")
 
 
+def get_dtype(x: Union[torch.Tensor, NumberType]):
+    if isinstance(x, torch.Tensor):
+        return x.dtype
+    else:
+        return type_to_dtype(type(x))
+
+
 _ordered_types = (bool, int, float, complex)
 
 
@@ -1384,3 +1429,21 @@ def suggest_memory_format(x: TensorLikeType) -> torch.memory_format:
         return torch.channels_last if x.ndim == 4 else torch.channels_last_3d
 
     return torch.contiguous_format
+
+
+def prod(xs: Sequence[NumberType]) -> NumberType:
+    """Product of elements in input sequence. Returns 1 for empty sequence"""
+    return reduce(operator.mul, xs, 1)
+
+
+def mask_tensor(mask: TensorLikeType, t: TensorLikeType):
+    """
+    Similar to torch.where(mask, t, 0) but if t is boolean,
+    result is also boolean and not promoted to int.
+    """
+    # torch.where(mask, t, False) is equivalent
+    # but feels hacky and might break in the future
+    if t.dtype is torch.bool:
+        return mask.logical_and(t)
+    else:
+        return torch.where(mask, t, 0)
diff --git a/torch/_prims_common/wrappers.py b/torch/_prims_common/wrappers.py
index a3199356ea1b3..6a314204fee5b 100644
--- a/torch/_prims_common/wrappers.py
+++ b/torch/_prims_common/wrappers.py
@@ -7,7 +7,7 @@
     ELEMENTWISE_TYPE_PROMOTION_KIND,
 )
 import torch._prims_common as utils
-from torch.utils._pytree import tree_flatten
+from torch.utils._pytree import tree_flatten, tree_unflatten
 
 from typing import Callable, Sequence, Union, Tuple, NamedTuple
 import inspect
@@ -18,8 +18,8 @@
 
 # TODO: implement ref.cast with an option to enforce safe casting
 def _maybe_convert_to_dtype(
-    a: Union[TensorLikeType, NumberType, Sequence], dtype: torch.dtype
-) -> Union[TensorLikeType, NumberType, Sequence]:
+    a: Union[TensorLikeType, NumberType, Sequence, None], dtype: torch.dtype
+) -> Union[TensorLikeType, NumberType, Sequence, None]:
     import torch._prims as prims
     if isinstance(a, TensorLike):
         if a.dtype != dtype:
@@ -31,6 +31,10 @@ def _maybe_convert_to_dtype(
         return utils.dtype_to_type(dtype)(a)
     if isinstance(a, Sequence):
         return tuple(_maybe_convert_to_dtype(x, dtype) for x in a)
+    # Passthrough None because some functions wrapped with type promotion
+    # wrapper might have optional args
+    if a is None:
+        return None
 
     raise ValueError(
         "Received type {0} that is neither a tensor or a number!".format(type(a))
@@ -268,6 +272,45 @@ def _fn(*args, out=None, **kwargs):
     return _out_wrapper
 
 
+def backwards_not_supported(prim):
+    def redispatch_prim(args, kwargs):
+        g = torch._C._AutoDispatchBelowAutograd()
+        try:
+            return prim(*args, **kwargs)
+        finally:
+            del g
+
+    class BackwardsNotSupported(torch.autograd.Function):
+        @staticmethod
+        def forward(ctx, args_spec, *flat_args):
+            args, kwargs = tree_unflatten(flat_args, args_spec)  # type: ignore[arg-type]
+            return redispatch_prim(args, kwargs)
+
+        @staticmethod
+        def backward(ctx, *args):
+            raise RuntimeError("backwards not supported on prim")
+
+    @wraps(prim)
+    def _autograd_impl(*args, **kwargs):
+        flat_args, args_spec = tree_flatten((args, kwargs))
+        if torch.is_grad_enabled() and any(a.requires_grad for a in flat_args if isinstance(a, torch.Tensor)):
+            # TODO: There is a subtle bug here: prims like copy_to
+            # return their input argument after mutating it; and custom
+            # autograd function will incorrectly turn the result into
+            # a view which will fail test_python_ref_executor tests.
+            # At the moment, we sidestep this by observing that the
+            # unit tests don't ever try to run the executor with
+            # autograd, so we don't exercise the buggy case, but if
+            # you ever want to feed autograd through this, be aware
+            # of it!  We need a way of properly implementing autograd
+            # for mutating operations in Python to do this.
+            return BackwardsNotSupported.apply(args_spec, *flat_args)
+        else:
+            return redispatch_prim(args, kwargs)
+
+    return _autograd_impl
+
+
 # TODO: when tracing this will add torch tensors and not TensorMeta objects
 # to the trace -- we should fix this by adding a tracing context and NumberMeta classes
 # TODO: this wrapper is currently untested
diff --git a/torch/_refs/__init__.py b/torch/_refs/__init__.py
index bb7b65031b8b1..bfcbbc119254f 100644
--- a/torch/_refs/__init__.py
+++ b/torch/_refs/__init__.py
@@ -121,6 +121,7 @@
     "hypot",
     "igamma",
     "igammac",
+    "imag",
     "isclose",
     "lcm",
     # 'ldexp',
@@ -139,8 +140,12 @@
     "nextafter",
     # 'polar',  # abs, cos, sin
     "pow",
+    "real",
+    "rpow",
     "remainder",
     "rsub",
+    "rtruediv",
+    "rfloordiv",
     # # special.xlog1py
     # # special.zeta
     "sub",
@@ -150,6 +155,7 @@
     #
     # Elementwise Ternary References
     #
+    "addcdiv",
     "clamp",
     #
     # Conditional references
@@ -162,6 +168,7 @@
     "clone",
     "copy_to",  # TODO: add OpInfo (or implement .to)
     "item",  # TODO: add OpInfo
+    "to",
     #
     # Reduction ops
     #
@@ -195,6 +202,8 @@
     "conj",
     "constant_pad_nd",
     "contiguous",
+    "diag_embed",
+    "diagonal",
     "dsplit",
     "dstack",
     "expand",
@@ -204,6 +213,8 @@
     "flipud",
     "hsplit",
     "hstack",
+    "meshgrid",
+    "movedim",
     "narrow",
     "native_layer_norm",
     "permute",
@@ -222,12 +233,19 @@
     "view",
     "vsplit",
     "vstack",
+    "unflatten",
+    "unbind",
+    "triu",
+    "tril",
+    "triu_indices",
+    "tril_indices",
     #
     # Tensor Creation
     #
     "empty",
     "empty_like",
     "empty_strided",
+    "eye",
     "full",
     "full_like",
     "ones",
@@ -635,22 +653,20 @@ def logsumexp(
 
 @register_decomposition(torch.ops.aten.nan_to_num)
 @out_wrapper()
-@elementwise_type_promotion_wrapper(
-    type_promoting_args=("a,"),
-    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-)
 def nan_to_num(
     a: TensorLikeType,
-    *,
     nan: Optional[NumberType] = 0.0,
     posinf: Optional[NumberType] = None,
     neginf: Optional[NumberType] = None,
 ) -> TensorLikeType:
     assert isinstance(a, TensorLike)
 
-    if a.dtype == torch.bool:
+    if utils.is_boolean_dtype(a.dtype) or utils.is_integer_dtype(a.dtype):
         return clone(a)
 
+    if nan is None:
+        nan = 0.0
+
     if posinf is None:
         posinf = prims.maximum_value(a.dtype)
 
@@ -1059,6 +1075,28 @@ def _floor_divide(
         else:
             b = prims.device_put(b, device=a.device)
 
+    assert isinstance(a, Tensor) and isinstance(b, Tensor)
+    dtype = a.dtype
+    if utils.is_float_dtype(dtype):
+        return _floor_divide_float(a, b)
+    elif utils.is_integer_dtype(dtype):
+        return _floor_divide_integer(a, b)
+    else:
+        check(False, lambda: f"{dtype} not supported for floor_divide")
+
+
+def _floor_divide_integer(a: Tensor, b: Tensor) -> Tensor:
+    a, b = _maybe_broadcast(a, b)
+
+    if not a.dtype.is_signed:
+        return prims.div(a, b)
+
+    # Convert truncation to flooring:
+    offset = (torch.signbit(a) != torch.signbit(b)).logical_and(torch.fmod(a, b) != 0)
+    return prims.div(a, b) - prims.convert_element_type(offset, a.dtype)
+
+
+def _floor_divide_float(a: Tensor, b: Tensor) -> Tensor:
     mod = fmod(a, b)
     div = true_divide(sub(a, mod), b)
 
@@ -1253,8 +1291,17 @@ def isclose(
 
 
 def _lcm(a: TensorLikeType, b: TensorLikeType):
-    g = gcd(a, b)
-    return where(eq(g, 0), 0, abs(mul(true_divide(a, g), b)))
+    dtype = a.dtype
+    promote_to_int = dtype in (torch.int8, torch.int16)
+    if promote_to_int:
+        a = prims.convert_element_type(a, torch.int32)
+        b = prims.convert_element_type(b, torch.int32)
+
+    g = torch.gcd(a, b)
+    # Avoid division by zero in case gcd(0, 0) == 0
+    g = torch.where(g == 0, 1, g)
+    res = torch.abs(prims.div(a, g) * b)
+    return res if not promote_to_int else prims.convert_element_type(res, dtype)
 
 
 # TODO: add docstring
@@ -1442,7 +1489,11 @@ def sub(
 def _trunc_divide(
     a: Union[TensorLikeType, NumberType], b: Union[TensorLikeType, NumberType]
 ):
-    return trunc(true_divide(a, b))
+    dtype = utils.get_dtype(a)
+    if utils.is_integer_dtype(dtype):
+        return prims.div(a, b)
+
+    return trunc(prims.div(a, b))
 
 
 # TODO: add docstring
@@ -1457,6 +1508,36 @@ def _trunc_divide(
 #
 
 
+@register_decomposition(torch.ops.aten.addcdiv)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("self", "tensor1", "tensor2"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def addcdiv(
+    self: TensorLikeType,
+    tensor1: TensorLikeType,
+    tensor2: TensorLikeType,
+    *,
+    value: NumberType = 1,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.addcdiv
+    """
+    if value is not None:
+        dtype = self.dtype  # no scalars allowed, see add
+        python_type = utils.dtype_to_type(dtype)
+        if not utils.is_weakly_lesser_type(type(value), python_type):
+            msg = (
+                "value argument of type {0} cannot be safely cast to type {1}!".format(
+                    type(value), python_type
+                )
+            )
+            raise ValueError(msg)
+
+    return self + value * tensor1 / tensor2
+
+
 @register_decomposition(torch.ops.aten.clamp)
 @out_wrapper()
 @elementwise_type_promotion_wrapper(
@@ -1574,6 +1655,26 @@ def item(a: TensorLikeType) -> NumberType:
     return number_type(prims.item(a))
 
 
+def to(
+    a: TensorLikeType,
+    dtype: torch.dtype,
+    non_blocking: bool = False,
+    copy: bool = False,
+    memory_format: torch.memory_format = torch.preserve_format,
+) -> TensorLikeType:
+    if (
+        (copy is True or dtype != a.dtype)
+        and memory_format == torch.preserve_format
+        and non_blocking is False
+    ):
+        return prims.convert_element_type(a, dtype)
+    result = torch.empty_like(
+        a, dtype=dtype, requires_grad=a.requires_grad, memory_format=memory_format
+    )
+    copy_to(result, a)
+    return result
+
+
 #
 # Reduction references
 #
@@ -1918,9 +2019,9 @@ def mean(
 def std_mean(
     a: TensorLikeType,
     dim: Union[Optional[int], Optional[List[int]]] = None,
+    *,
     unbiased: Optional[bool] = None,
     keepdim: bool = False,
-    *,
     correction: Optional[int] = None,
 ):
     s = std(a, dim, unbiased, keepdim, correction=correction)
@@ -1952,6 +2053,7 @@ def addr(
     self: TensorLikeType,
     vec1: TensorLikeType,
     vec2: TensorLikeType,
+    *,
     beta: NumberType = 1,
     alpha: NumberType = 1,
 ) -> TensorLikeType:
@@ -2111,8 +2213,7 @@ def cat(tensors: TensorSequenceType, dim: int = 0) -> TensorLikeType:
 @out_wrapper()
 def column_stack(tensors: TensorSequenceType) -> TensorLikeType:
     aligned_tensors = tuple(
-        x if x.ndim > 1 else prims.expand_dims(x, list(range(x.ndim, 2)))
-        for x in tensors
+        x if x.ndim > 1 else x.reshape((x.numel(), 1)) for x in tensors
     )
     return cat(aligned_tensors, 1)
 
@@ -2424,49 +2525,17 @@ def permute(a: TensorLikeType, dims: DimsSequenceType) -> TensorLikeType:
     return prims.transpose(a, _permutation)
 
 
-def _reshape_view_helper(
-    a: TensorLikeType, shape: ShapeType, *, allow_copy: bool
-) -> TensorLikeType:
-    # NOTE: Reshape may be given a shape with a -1 length
-    # This indicates that the dimension's length should be inferred
+def _reshape_view_helper(a: TensorLikeType, *shape, allow_copy: bool) -> TensorLikeType:
     # Creates a valid shape
-
-    for idx in range(len(shape)):
-        if shape[idx] == -1:
-            # Verifies there's only one dimension of length -1 in the shape
-            if shape.count(-1) > 1:
-                msg = "Can only infer the length of one dimension, but got shape {0}!".format(
-                    str(shape)
-                )
-                raise ValueError(msg)
-
-            # TODO: improve error message
-            if a.numel() > 0:
-                length = reduce(
-                    operator.floordiv, (x for x in shape if x != -1), a.numel()
-                )
-            else:
-                msg = "Cannot reshape a tensor of zero elements into shape {0} because the unspecified length is ambiguous!".format(
-                    str(shape)
-                )
-                raise ValueError(msg)
-
-            shape = list(shape)
-            shape[idx] = length
-            break
+    shape = utils.extract_shape_from_varargs(shape, validate=False)
+    # Reshape may be given a shape with a -1 length
+    # This indicates that the dimension's length should be inferred
+    shape = utils.infer_size(shape, a.numel())
 
     # Short-circuits if shape is the same
-    utils.validate_shape(shape)
     if tuple(a.shape) == tuple(shape):
         return prims.view_of(a)
 
-    numel = reduce(operator.mul, shape) if len(shape) > 0 else 1
-    if a.numel() != numel:
-        msg = "Attempting to reshape a tensor with shape {0} and {1} elements to a shape {2} with {3} elements!".format(
-            str(a.shape), a.numel(), str(shape), numel
-        )
-        raise ValueError(msg)
-
     # Special-cases tensors with no elements
     if a.numel() == 0:
         return as_strided(a, shape, utils.make_contiguous_strides_for(shape))
@@ -2566,8 +2635,11 @@ def _reshape_view_helper(
 
 # TODO: Turn this into a decomposition (currently fails on reshape meta tests)
 # CompositeImplicitAutograd - don't register decomp
-def reshape(a: TensorLikeType, shape: ShapeType) -> TensorLikeType:
-    return _reshape_view_helper(a, shape, allow_copy=True)
+# NOTE: shape is a vararg because Tensor.reshape can be called with as
+# Tensor.reshape(a, b, c) or Tensor.reshape((a, b, c)) Function call
+# torch.reshape doesn't support unpacked shapes
+def reshape(a: TensorLikeType, *shape: ShapeType) -> TensorLikeType:
+    return _reshape_view_helper(a, *shape, allow_copy=True)
 
 
 @register_decomposition(torch.ops.aten.roll)
@@ -2621,16 +2693,17 @@ def rot90(
     a: TensorLikeType, k: int = 1, dims: DimsSequenceType = (0, 1)
 ) -> TensorLikeType:
     """Reference implementation of :func:`torch.rot90`."""
-    dims_ = utils.canonicalize_dims(a.ndim, dims)
-    # Required to silence MyPy errors
-    assert isinstance(dims_, (tuple, list))
-    dims = dims_
     if len(dims) != 2:
         raise RuntimeError(
             f"expected total rotation dims == 2, but got dims = {len(dims)}"
         )
     if a.ndim < 2:
         raise RuntimeError(f"expected total dims >= 2, but got total dims = {a.ndim}")
+
+    # Do this after the initial checks to be compatible with the behavior in
+    # core.
+    dims = utils.canonicalize_dims(a.ndim, dims)
+
     if dims[0] == dims[1]:
         raise RuntimeError(
             f"expected rotation dims to be different, but got dim0 = {dims[0]} and dim1 = {dims[1]}"
@@ -2708,6 +2781,26 @@ def vstack(tensors: TensorSequenceType) -> TensorLikeType:
     return cat(aligned_tensors, 0)
 
 
+# CompositeImplicitAutograd - don't register decomp
+def unflatten(a: TensorLikeType, dim: int, sizes: ShapeType) -> TensorLikeType:
+    dim = utils.canonicalize_dim(a.ndim, dim)
+    utils.check(len(sizes) != 0, lambda: "unflatten: sizes must be non-empty")
+    return a.view(tuple(a.shape[:dim]) + tuple(sizes) + tuple(a.shape[dim + 1 :]))
+
+
+@register_decomposition(torch.ops.aten.unbind)
+def unbind(t: TensorLikeType, dim: int = 0) -> TensorSequenceType:
+    dim = utils.canonicalize_dim(t.ndim, dim)
+    check(
+        len(t.shape) > 0,
+        lambda: "dimension specified as 0 but tensor has no dimensions",
+        IndexError,
+    )
+    return tuple(
+        torch.squeeze(s, dim) for s in torch.tensor_split(t, t.shape[dim], dim)
+    )
+
+
 # Note: although squeeze is documented as having the out= kwarg it doesn't
 @register_decomposition(torch.ops.aten.squeeze, disable_meta=True)
 def squeeze(a: TensorLikeType, dim: Optional[int] = None) -> TensorLikeType:
@@ -2889,6 +2982,102 @@ def vsplit(
     return tensor_split(a, split_sizes, 0)
 
 
+@register_decomposition(torch.ops.aten.diagonal, disable_meta=True)
+def diagonal(
+    self: TensorLikeType,
+    offset: int = 0,
+    dim1: int = 0,
+    dim2: int = 1,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.diagonal
+    """
+    num_dims = self.dim()
+    dim1 = utils.canonicalize_dim(idx=dim1, rank=num_dims)
+    dim2 = utils.canonicalize_dim(idx=dim2, rank=num_dims)
+
+    check(
+        dim1 != dim2, lambda: f"diagonal dimensions cannot be identical {dim1}, {dim2}"
+    )
+
+    storage_offset = self.storage_offset()
+
+    if offset >= 0:
+        diag_size = max(min(self.size()[dim1], self.size()[dim2] - offset), 0)
+    else:
+        diag_size = max(min(self.size()[dim1] + offset, self.size()[dim2]), 0)
+
+    if diag_size > 0:
+        if offset >= 0:
+            storage_offset += offset * self.stride()[dim2]
+        else:
+            storage_offset -= offset * self.stride()[dim1]
+
+    sizes = [s for i, s in enumerate(self.size()) if i not in (dim1, dim2)]
+    sizes.append(diag_size)
+
+    strides = [s for i, s in enumerate(self.stride()) if i not in (dim1, dim2)]
+    strides.append(self.stride()[dim1] + self.stride()[dim2])
+
+    result = self.as_strided(size=sizes, stride=strides, storage_offset=storage_offset)
+
+    return result
+
+
+@register_decomposition(torch.ops.aten.diag_embed)
+def diag_embed(
+    t: TensorLikeType,
+    offset: int = 0,
+    dim1: int = -2,
+    dim2: int = -1,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.diag_embed
+    """
+    # as per the docs, exchanging dims is equivalent to changing the sign of
+    # offset
+    if dim1 > dim2:
+        dim1, dim2 = dim2, dim1
+        offset = -offset
+
+    # convert from negative dims
+    rank = t.ndim + 1
+    dim1 = utils.canonicalize_dim(rank=rank, idx=dim1)
+    dim2 = utils.canonicalize_dim(rank=rank, idx=dim2)
+
+    check(
+        dim1 != dim2, lambda: f"diagonal dimensions cannot be identical {dim1}, {dim2}"
+    )
+
+    # as per the docs, the size of last dim is placed at dim1 and dim2
+    last_dim = t.size(-1)
+
+    if offset != 0:
+        # add padding to match the new size
+        t_shape = list(t.shape)
+        t_shape[-1] = builtins.abs(offset)
+        z = torch.zeros(t_shape, dtype=t.dtype, device=t.device, requires_grad=False)
+        pair = (z, t) if offset > 0 else (t, z)
+        t = torch.cat(pair, dim=-1)
+        # make sure the diagonal always has the same size
+        last_dim += builtins.abs(offset)
+
+    # preserve original data, but place 1 at dim1 and move last dim to dim2
+    t = t.unsqueeze(dim1).movedim(-1, dim2)
+
+    # generate ranges shifting indices based on offset
+    a_range = torch.arange(last_dim, device=t.device, dtype=torch.int64)
+    b_range = torch.arange(
+        offset, last_dim + offset, device=t.device, dtype=torch.int64
+    )
+
+    # broadcast
+    cond = a_range == b_range.unsqueeze(-1)
+    cond_shape = [last_dim if i in (dim1, dim2) else 1 for i in range(len(t.shape))]
+    cond = cond.reshape(cond_shape)
+    return utils.mask_tensor(cond, t)
+
+
 # CompositeImplicitAutograd - don't register decomp
 def dsplit(a: TensorLikeType, sections: DimsType) -> TensorSequenceType:
     if a.ndim < 3:
@@ -2921,6 +3110,7 @@ def t(a: TensorLikeType):
     return torch.transpose(a, 0, 0 if a.ndim < 2 else 1)
 
 
+@register_decomposition(torch.ops.aten.transpose, disable_meta=True)
 def transpose(a: TensorLikeType, dim0: int, dim1: int) -> TensorLikeType:
     _dim0, _dim1 = utils.canonicalize_dims(a.ndim, (dim0, dim1))  # type: ignore[misc]
 
@@ -2930,7 +3120,7 @@ def transpose(a: TensorLikeType, dim0: int, dim1: int) -> TensorLikeType:
     _permutation = list(range(0, a.ndim))
     _permutation[_dim0] = _dim1
     _permutation[_dim1] = _dim0
-    return prims.transpose(a, _permutation)
+    return torch.permute(a, _permutation)
 
 
 # Aliases for transpose
@@ -2941,14 +3131,18 @@ def transpose(a: TensorLikeType, dim0: int, dim1: int) -> TensorLikeType:
 def unsqueeze(a: TensorLikeType, dim: int) -> TensorLikeType:
     # Note that unsqueeze canonicalizes with rank + 1 because it allows
     # a new innermost dimension to be specified
-    dim = utils.canonicalize_dim(a.ndim + 1, dim)
-    return prims.expand_dims(a, (dim,))
+    ndim = a.ndim + 1
+    dim = utils.canonicalize_dim(ndim, dim)
+    return prims.expand_dims(a, (dim,), ndim=ndim)
 
 
+# NOTE: shape is a vararg because Tensor.reshape can be called with as
+# Tensor.view(a, b, c) or Tensor.view((a, b, c)) Function call torch.view
+# doesn't support unpacked shapes
 # TODO: Turn this into a decomposition (currently fails on reshape meta tests)
 @register_decomposition(torch.ops.aten.view, disable_meta=True)
-def view(a: TensorLikeType, shape: ShapeType) -> TensorLikeType:
-    return _reshape_view_helper(a, shape, allow_copy=False)
+def view(a: TensorLikeType, *shape: ShapeType) -> TensorLikeType:
+    return _reshape_view_helper(a, *shape, allow_copy=False)
 
 
 # CompositeImplicitAutograd - don't register decomp
@@ -3018,6 +3212,34 @@ def new_empty(
     )
 
 
+@register_decomposition(torch.ops.aten.new_empty_strided)
+def new_empty_strided(
+    a: TensorLikeType,
+    size: ShapeType,
+    stride: StrideType,
+    *,
+    dtype: Optional[torch.dtype] = None,
+    layout: Optional[torch.layout] = None,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.Tensor.new_empty_strided
+    """
+
+    dtype = a.dtype if dtype is None else dtype
+    device = a.device if device is None else device
+
+    return torch.empty_strided(
+        size,
+        stride,
+        dtype=dtype,
+        device=device,
+        pin_memory=pin_memory,
+        layout=layout,
+    )
+
+
 @register_decomposition(torch.ops.aten.new_zeros)
 def new_zeros(
     a: TensorLikeType,
@@ -3307,6 +3529,142 @@ def logspace(
     return prims.to_dtype(torch.pow(base, ret), dtype)
 
 
+@overload
+def meshgrid(tensors: Sequence[TensorLikeType], indexing: str):
+    pass
+
+
+@overload
+def meshgrid(*tensors: TensorLikeType, indexing: str):
+    pass
+
+
+@register_decomposition(torch.ops.aten.meshgrid)
+def meshgrid(
+    *tensors: Union[TensorLikeType, List[TensorLikeType], Tuple[TensorLikeType]],
+    indexing: str,
+) -> List[TensorLikeType]:
+    # This ref simultaneously handles two overloads (see stubs above)
+    # The `indexing` argument is currently optional for torch.meshgrid, but we
+    # plan to make the argument required: https://github.com/pytorch/pytorch/issues/50276
+    if isinstance(tensors[0], list) or isinstance(tensors[0], tuple):
+        assert len(tensors) == 1
+        tensors = tuple(tensors[0])
+
+    check(
+        py_all(isinstance(a, TensorLike) for a in tensors),
+        lambda: "meshgrid expects its inputs to be tensors",
+    )
+
+    check(len(tensors) > 0, lambda: "meshgrid expects a non-empty TensorList")
+
+    for i in range(len(tensors) - 1):
+        check(
+            tensors[i].dtype == tensors[i + 1].dtype,  # type: ignore[union-attr]
+            lambda: "meshgrid expects all tensors to have the same dtype",
+        )
+        check(
+            tensors[i].device == tensors[i + 1].device,  # type: ignore[union-attr]
+            lambda: "meshgrid expects all tensors to have the same device",
+        )
+
+    swap_first_and_second_tensors = False
+    if indexing == "xy":
+        swap_first_and_second_tensors = len(tensors) >= 2
+        if swap_first_and_second_tensors:
+            tensors = (tensors[1], tensors[0], *tensors[2:])
+    else:
+        check(
+            indexing == "ij",
+            lambda: (
+                'torch.meshgrid: indexing must be one of "xy" or "ij", '
+                f"but received: {indexing}"
+            ),
+        )
+
+    result_shape: List[int] = []
+    for t in tensors:
+        assert isinstance(t, TensorLike)  # mypy
+        check(
+            t.ndim == 0 or t.ndim == 1,
+            lambda: f"torch.meshgrid: Expected 0D or 1D tensor in the tensor list but got: {t}",
+        )
+        result_shape.append(t.numel())
+
+    grids: List[TensorLikeType] = []
+    for i, t in enumerate(tensors):
+        assert isinstance(t, TensorLike)  # mypy
+        if t.ndim == 0:
+            t = t.view((1,))
+        grids.append(prims.broadcast_in_dim(t, result_shape, (i,)))
+
+    if swap_first_and_second_tensors:
+        # Swap outputs if we originally swapped at the beginning
+        grids[0], grids[1] = grids[1], grids[0]
+
+    return grids
+
+
+# CompositeImplicitAutograd - don't register decomp
+def movedim(
+    input: TensorLikeType,
+    source: Union[int, DimsSequenceType],
+    destination: Union[int, DimsSequenceType],
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.movedim
+    """
+    if type(source) is int:
+        source = (source,)
+    if type(destination) is int:
+        destination = (destination,)
+
+    utils.check(
+        len(source) == len(destination),  # type: ignore[arg-type]
+        lambda: (
+            "movedim: Invalid source or destination dims: source "
+            f"({source} dims) should contain the same number of dims as "
+            f"destination ({destination} dims)"
+        ),
+    )
+
+    rank = input.ndim
+    ss = tuple(utils.canonicalize_dims(rank=rank, indices=source))  # type: ignore[arg-type]
+    ds = tuple(utils.canonicalize_dims(rank=rank, indices=destination))  # type: ignore[arg-type]
+
+    sss = set(ss)
+    dss = set(ds)
+
+    utils.check(
+        len(ss) == len(sss),
+        lambda: f"movedim: repeated dim in `source` {source}",
+    )
+    utils.check(
+        len(ds) == len(dss),
+        lambda: f"movedim: repeated dim in `destination` {destination}",
+    )
+
+    m = dict(zip(ds, ss))
+    dims = []
+    si = 0  # source index
+    for di in range(rank):
+        # check if the destination index is in the mapping
+        s = m.get(di)
+        if s is not None:
+            # insert source index if found
+            dims.append(s)
+        else:
+            # insert source index sequentially, skipping indices from the mapping
+            while si in sss:
+                si += 1
+            dims.append(si)
+            si += 1
+
+    result = torch.permute(input, tuple(dims))
+
+    return result
+
+
 # NOTE: for convenience, shape can be a tuple of ints or a tuple containing a tuple of ints
 @register_decomposition(torch.ops.aten.empty_strided)
 def empty_strided(
@@ -3338,6 +3696,46 @@ def empty_strided(
     )
 
 
+@register_decomposition(torch.ops.aten.eye)
+@out_wrapper()
+def eye(
+    n: int,
+    m: Optional[int] = None,
+    *,
+    dtype: Optional[torch.dtype] = None,
+    layout: torch.layout = torch.strided,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+    requires_grad: bool = False,  # TODO: unused
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.eye
+    """
+    # TODO: no support for layout, pin_memory
+    assert layout == torch.strided
+    assert pin_memory is False
+
+    if m is None:
+        m = n
+
+    check(n >= 0, lambda: f"n must be greater or equal to 0, got {n}")
+    check(m >= 0, lambda: f"m must be greater or equal to 0, got {m}")
+
+    range_n = torch.arange(n, dtype=torch.int64, device=device, requires_grad=False)
+    range_m = torch.arange(m, dtype=torch.int64, device=device, requires_grad=False)
+
+    cond = range_n.unsqueeze(-1) == range_m
+    if dtype is torch.bool:
+        return cond
+    else:
+        # TODO: pin_memory=pin_memory, layout=layout
+        one = torch.ones(1, dtype=dtype, device=device, requires_grad=False)
+        return torch.where(cond, one, 0)
+    # TODO: Use requires_grad.  All refs taking the requires_grad kwarg must
+    # return a leaf tensor.
+    # result.requires_grad_(requires_grad)
+
+
 # TODO: missing kwargs (e.g. layout)
 @out_wrapper()
 def full(
@@ -3408,6 +3806,9 @@ def uniform(
     return prims.uniform(shape, low=low, high=high, dtype=dtype, device=device)
 
 
+@register_decomposition(
+    [torch.ops.aten.masked_fill.Scalar, torch.ops.aten.masked_fill.Tensor]
+)
 def masked_fill(a: TensorLikeType, mask: TensorLikeType, value: TensorOrNumberLikeType):
     python_type = utils.dtype_to_type(a.dtype)
     if isinstance(value, Number):
@@ -3420,7 +3821,14 @@ def masked_fill(a: TensorLikeType, mask: TensorLikeType, value: TensorOrNumberLi
             value_ndim == 0,
             lambda: f"only supports a 0-dimensional value tensor, but got tensor with {value_ndim} dimension",
         )
+        # `masked_fill` allows cpu scalar to be moved to cuda but not otherwise.
+        check(
+            a.device.type == "cuda" or value.device == a.device,
+            lambda: "Expected `value` to be on same device as `a`",
+        )
         value_type = utils.dtype_to_type(value.dtype)
+        if utils.is_cpu_scalar_tensor(value):
+            value = value.item()
 
     if value_type is complex:
         # only downcasting from complex to lower type is not allowed.
@@ -3435,10 +3843,10 @@ def masked_fill(a: TensorLikeType, mask: TensorLikeType, value: TensorOrNumberLi
     # Since `where` allows type-promotion,
     # cast value to correct type before passing to `where`
     if isinstance(value, Number):
-        return where(mask, python_type(value), a)
+        return torch.where(mask, python_type(value), a)
 
     assert isinstance(value, TensorLike)
-    return where(mask, prims.to_dtype(value, a.dtype), a)
+    return torch.where(mask, prims.to_dtype(value, a.dtype), a)
 
 
 # CompositeImplicitAutograd - don't register decomp
@@ -3513,6 +3921,194 @@ def trace(self: TensorLikeType) -> TensorLikeType:
     return torch.sum(torch.diag(self, 0))
 
 
+def _make_r_binary_op(base_op):
+    def rop(
+        a: Union[TensorLikeType, NumberType],
+        b: Union[TensorLikeType, NumberType],
+    ) -> TensorLikeType:
+        return base_op(b, a)
+
+    return rop
+
+
+rtruediv = _make_r_binary_op(true_divide)
+rfloordiv = _make_r_binary_op(floor_divide)
+rpow = _make_r_binary_op(pow)
+
+
+@register_decomposition(torch.ops.aten.triu)
+@out_wrapper()
+def triu(a: TensorLikeType, diagonal: int = 0) -> TensorLikeType:
+    utils.check(
+        a.ndim >= 2, lambda: "triu: input tensor must have at least 2 dimensions"
+    )
+    h, w = a.shape[-2:]
+    mask = (
+        torch.arange(w, device=a.device).unsqueeze(-2)
+        - torch.arange(h, device=a.device).unsqueeze(-1)
+    ) >= diagonal
+
+    return utils.mask_tensor(mask, a)
+
+
+@register_decomposition(torch.ops.aten.tril)
+@out_wrapper()
+def tril(a: TensorLikeType, diagonal: int = 0) -> TensorLikeType:
+    utils.check(
+        a.ndim >= 2, lambda: "tril: input tensor must have at least 2 dimensions"
+    )
+    h, w = a.shape[-2:]
+    mask = (
+        torch.arange(w, device=a.device).unsqueeze(-2)
+        - torch.arange(h, device=a.device).unsqueeze(-1)
+    ) <= diagonal
+
+    return utils.mask_tensor(mask, a)
+
+
+# This is based on get_tril_size in aten/src/ATen/native/TensorFactories.h
+# The components of the matrix that belong to the lower triangle with offset
+# form a pentagon that can be broken down into a top trapezoid and a bottom
+# rectangle. For the implementation of tril_indices, we need the sizes of
+# both of these, as well as the length of the top side of the trapezoid.
+def _get_tril_sizes(row: int, col: int, offset: int) -> Tuple[int, int, int]:
+    if row == 0 or col == 0:
+        return 0, 0, 0
+
+    m_first_row = min(col, 1 + offset) if offset > 0 else int(row + offset > 0)
+    m_last_row = max(0, min(col, row + offset))
+    n_row_all = max(0, min(row, row + offset))
+    n_row_trapezoid = m_last_row - m_first_row + 1
+
+    # Number of elements in top trapezoid
+    trapezoid_size = (m_first_row + m_last_row) * n_row_trapezoid // 2
+    # Number of elements in bottom rectangle
+    diff_row = n_row_all - n_row_trapezoid
+    rectangle_size = max(0, diff_row * col)
+
+    return trapezoid_size, rectangle_size, m_first_row
+
+
+def _trilu_checks(
+    name: str,
+    row: int,
+    col: int,
+    dtype: torch.dtype,
+    layout: torch.layout,
+    pin_memory: bool,
+):
+    if pin_memory:
+        raise NotImplementedError("PrimTorch doesn't support pinned memory")
+    check(row >= 0, lambda: f"row must be non-negative, got {row}")
+    check(col >= 0, lambda: f"col must be non-negative, got {col}")
+    check(
+        layout == torch.strided,
+        lambda: f"only support layout=torch.strided, got {layout}",
+    )
+    check(
+        dtype in (torch.int32, torch.int64),
+        lambda: f"\"{name}\" not implemented for '{dtype}'",
+    )
+
+
+# This is based on tril_indices_cuda in aten/src/ATen/native/cuda/TensorFactories.cu
+@register_decomposition(torch.ops.aten.tril_indices)
+def tril_indices(
+    row: int,
+    col: int,
+    offset: int = 0,
+    *,
+    dtype: torch.dtype = torch.long,
+    layout: torch.layout = torch.strided,
+    device: DeviceLikeType = "cpu",
+    pin_memory: bool = False,
+) -> TensorLikeType:
+    _trilu_checks("tril_indices", row, col, dtype, layout, pin_memory)
+
+    trapezoid_size, rectangle_size, m_first_row = _get_tril_sizes(row, col, offset)
+    row_offset = max(0, -offset)
+
+    # first we do the indices for top trapezoid
+    xs1 = torch.arange(0, trapezoid_size, dtype=torch.float64, device=device)
+    b = m_first_row - 0.5
+    row_inds1 = torch.floor(-b + torch.sqrt(b * b + 2 * xs1))
+    col_inds1 = torch.floor(xs1 - (2 * m_first_row - 1 + row_inds1) * row_inds1 * 0.5)
+    row_inds1 = prims.to_dtype(row_inds1 + row_offset, dtype)
+    col_inds1 = prims.to_dtype(col_inds1, dtype)
+
+    # then bottom rectangle
+    xs2 = torch.arange(0, rectangle_size, device=device)
+    row_inds2 = xs2 // col + (col - m_first_row + 1 + row_offset)
+    col_inds2 = xs2 % col
+    row_inds2 = prims.to_dtype(row_inds2, dtype)
+    col_inds2 = prims.to_dtype(col_inds2, dtype)
+
+    return torch.stack(
+        (torch.cat((row_inds1, row_inds2)), torch.cat((col_inds1, col_inds2)))
+    )
+
+
+# Similar to _get_tril_sizes above, but here there is a top trapezoid and
+# a bottom rectangle instead. Note that you can't reduce this to
+# _get_tril_sizes(col, row, -offset) because that would correspond to
+# decomposing into a left trapezoid and right rectangle.
+def _get_triu_sizes(row: int, col: int, offset: int) -> Tuple[int, int, int]:
+    if row == 0 or col == 0:
+        return 0, 0, 0
+
+    m_first_row = max(0, col - offset) if offset > 0 else col
+
+    # Number of elements in top rectangle
+    rectangle_size = max(0, min(row, -offset) * col)
+
+    # Number of elements in bottom trapezoid
+    trapezoid_size_tril, rectangle_size_tril, _ = _get_tril_sizes(row, col, offset - 1)
+    triu_size = row * col - (trapezoid_size_tril + rectangle_size_tril)
+    trapezoid_size = triu_size - rectangle_size
+
+    return trapezoid_size, rectangle_size, m_first_row
+
+
+@register_decomposition(torch.ops.aten.triu_indices)
+def triu_indices(
+    row: int,
+    col: int,
+    offset: int = 0,
+    *,
+    dtype: torch.dtype = torch.long,
+    layout: torch.layout = torch.strided,
+    device: DeviceLikeType = "cpu",
+    pin_memory: bool = False,
+) -> TensorLikeType:
+    _trilu_checks("triu_indices", row, col, dtype, layout, pin_memory)
+
+    trapezoid_size, rectangle_size, m_first_row = _get_triu_sizes(row, col, offset)
+    col_offset = max(0, offset)
+
+    # indices for top rectangle
+    xs2 = torch.arange(0, rectangle_size, device=device)
+    row_inds2 = xs2 // col
+    col_inds2 = xs2 % col
+    row_inds2 = prims.to_dtype(row_inds2 + col_offset, dtype)
+    col_inds2 = prims.to_dtype(col_inds2, dtype)
+
+    # bottom trapezoid
+    xs1 = torch.arange(0, trapezoid_size, dtype=torch.float64, device=device)
+    b = -0.5 - m_first_row
+    row_inds1 = torch.floor(-b - torch.sqrt(b * b - 2 * xs1))
+    col_inds1 = torch.floor(xs1 - ((2 * m_first_row - 1 - row_inds1) * row_inds1) * 0.5)
+    row_inds1 = prims.to_dtype(row_inds1, dtype)
+    col_inds1 = prims.to_dtype(col_inds1, dtype)
+
+    if col:
+        row_inds1 = row_inds1 + (rectangle_size // col)
+    col_inds1 = col_inds1 + col_offset
+
+    return torch.stack(
+        (torch.cat((row_inds2, row_inds1)), torch.cat((col_inds2, col_inds1)))
+    )
+
+
 import torch._refs.fft
 import torch._refs.linalg
 import torch._refs.nn.functional
diff --git a/torch/_refs/nn/functional/__init__.py b/torch/_refs/nn/functional/__init__.py
index 702589e72ad2e..60ec2c179dbea 100644
--- a/torch/_refs/nn/functional/__init__.py
+++ b/torch/_refs/nn/functional/__init__.py
@@ -43,6 +43,9 @@
     "softshrink",
     "tanhshrink",
     "threshold",
+    "glu",
+    "pairwise_distance",
+    "pdist",
 ]
 
 Tensor = torch.Tensor
@@ -559,3 +562,53 @@ def relu6(a: TensorLikeType, inplace: bool = False) -> TensorLikeType:
     # It may be better to use clamp here, but we use hardtanh to replicate
     # the behavior of the existing implementation
     return refs.nn.functional.hardtanh(a, 0, 6)
+
+
+@register_decomposition(torch.ops.aten.glu)
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+@out_wrapper()
+def glu(a: TensorLikeType, dim: int = -1) -> TensorLikeType:
+    dim = utils.canonicalize_dims(a.ndim, dim)
+    check(
+        a.shape[dim] % 2 == 0,
+        lambda: f"Halving dimension must be even, but dimension {dim} is size {a.shape[dim]}",
+    )
+    b, c = torch.tensor_split(a, 2, dim)
+
+    return b * torch.sigmoid(c)
+
+
+@register_decomposition(torch.ops.aten.pairwise_distance)
+@out_wrapper()
+def pairwise_distance(
+    x1: TensorLikeType,
+    x2: TensorLikeType,
+    p: NumberType = 2.0,
+    eps: NumberType = 1e-6,
+    keepdim=False,
+) -> TensorLikeType:
+    return torch.linalg.vector_norm(x1 - x2 + eps, ord=p, dim=-1, keepdim=keepdim)
+
+
+@register_decomposition(torch.ops.aten.pdist)
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+@out_wrapper()
+def pdist(a: TensorLikeType, p: float = 2) -> TensorLikeType:
+    check(a.ndim == 2, lambda: f"pdist only supports 2D tensors, got: {a.ndim}D")
+    check(p >= 0, lambda: "pdist only supports non-negative p values")
+    # For p == 2 we can use an efficient implementation, but other values of p
+    # require creating a much bigger tensor for an intermediate step
+    if p == 2:
+        aTa = torch.mm(a, a.T)
+        aTa_diag = torch.diag(aTa)
+        t = torch.sqrt(torch.clamp(aTa_diag + aTa_diag.unsqueeze(-1) - 2 * aTa, min=0))
+    else:
+        t = torch.linalg.vector_norm(a.unsqueeze(1) - a, ord=p, dim=2)
+    i = torch.triu_indices(t.shape[0], t.shape[1], offset=1, device=a.device)
+    return t.flatten().index_select(0, i[0] * t.shape[0] + i[1])
diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py
index e4dbc6c70aa4a..034fa33f12ed5 100644
--- a/torch/_subclasses/fake_tensor.py
+++ b/torch/_subclasses/fake_tensor.py
@@ -34,11 +34,13 @@ class DynamicOutputShapeException(RuntimeError):
 _device_not_kwarg_ops = (
     aten._resize_output_.default,
     aten.nested_tensor.default,
+    aten.nested_tensor.out,
     aten.pin_memory.default,
     aten.is_pinned.default,
     aten.to.device,
     aten.to.prim_Device,
     aten._pin_memory.default,
+    aten._pin_memory.out,
     aten._resize_output.default,
     aten._resize_output.out,
 )
@@ -56,19 +58,31 @@ def contains_tensor_types(type):
 
 _like_tensor_constructors = (
     aten.empty_like.default,
+    aten.empty_like.out,
     aten.full_like.default,
+    aten.full_like.out,
     aten.ones_like.default,
+    aten.ones_like.out,
     aten.rand_like.default,
+    aten.rand_like.out,
     aten.randn_like.default,
+    aten.randn_like.out,
     aten.randint_like.default,
+    aten.randint_like.out,
     aten.randint_like.low_dtype,
-    aten.randn_like.default,
+    aten.randint_like.low_dtype_out,
     aten.zeros_like.default,
+    aten.zeros_like.out,
     aten.new_empty.default,
+    aten.new_empty.out,
     aten.new_empty_strided.default,
+    aten.new_empty_strided.out,
     aten.new_full.default,
+    aten.new_full.out,
     aten.new_zeros.default,
+    aten.new_zeros.out,
     aten.new_ones.default,
+    aten.new_ones.out,
 )
 
 
@@ -129,12 +143,13 @@ def from_real_tensor(self, fake_mode, t):
             return maybe_memo
         existing_device = t.device
         # not yet supported in metatensors
-        if t.is_sparse:
-            raise UnsupportedFakeTensorException("sparse nyi in meta tensors")
         if t.is_quantized:
             raise UnsupportedFakeTensorException("quantized nyi in meta tensors")
         with no_dispatch():
-            out = FakeTensor(fake_mode, self.meta_converter(t), existing_device)
+            meta_t = self.meta_converter(t)
+            if meta_t.device.type != "meta":
+                raise UnsupportedFakeTensorException("meta converter nyi")
+            out = FakeTensor(fake_mode, meta_t, existing_device)
         if type(t) is torch.nn.Parameter:
             out = torch.nn.Parameter(out, requires_grad=out.requires_grad)  # type: ignore[assignment]
         if t.grad is not None:
@@ -150,11 +165,21 @@ def from_meta_and_device(self, fake_mode, t, device):
         self.set_tensor_memo(t, out)
         return out
 
+    # There are two ways to call this.  First, you can have manually constructed
+    # a meta tensor and you need to turn it into a fake tensor.  In that case,
+    # pass a meta tensor and a device argument.  Alternately, you can have a
+    # real tensor that you need to convert into a fake tensor; in that case,
+    # omit the device.
+    #
+    # The disallowed case: if you specify the device, it MUST be a meta tensor.
+    # However, you're allowed to pass a meta tensor to be turned into a fake
+    # tensor; although an odd thing to do, this can occur if you're doing
+    # cross ref testing and the inner test is already operating on meta tensors
     def __call__(self, fake_mode, t, device=None):
-        assert t.device.type != "meta" or device is not None
-        if t.device.type != "meta":
+        if device is None:
             return self.from_real_tensor(fake_mode, t)
         else:
+            assert t.device.type == "meta"
             return self.from_meta_and_device(fake_mode, t, device)
 
 
@@ -216,6 +241,12 @@ def resize_as_(fake_mode, func, *args, **kwargs):
     return func(*args, **kwargs)
 
 
+@register_op_impl(aten._sparse_coo_tensor_with_dims_and_tensors.default)
+def _sparse_coo_tensor_with_dims_and_tensors(fake_mode, func, *args, **kwargs):
+    # TODO: remove me
+    return constructors(fake_mode, func, *args, **kwargs)
+
+
 # _to_copy fails when run with FakeTensors to cuda device
 # TODO: debug
 @register_op_impl(aten._to_copy.default)
@@ -345,14 +376,20 @@ def __new__(cls, fake_mode, elem, device):
         )
 
     def __init__(self, fake_mode, elem, device: Union[torch.device, str]):
-        # elem does not need to be recorded, because FakeTensor *is a* elem
-        assert elem.device.type == "meta", elem
+        assert elem.device.type == "meta", elem.device.type
         device = device if isinstance(device, torch.device) else torch.device(device)
-        # normalize cuda device
+        # NB: it is fine, if a little confusing, for device to be meta
+        # (we are faking a meta tensor in that case).  However, it often
+        # indicates some sort of confusion (e.g., you accidentally passed
+        # in a meta tensor when you should have passed in the real tensor).
+        # So by default we disallow meta, and if you are working in a situation
+        # where it is helpful (e.g., crossref testing) you can turn it back
+        # on
+        if not fake_mode.allow_meta:
+            assert device.type != "meta"
+        # normalize cuda device.
         if device.type == "cuda" and device.index is None:
             device = torch.device(f"cuda:{torch.cuda.current_device()}")
-        assert device.type != "meta"
-
         self.fake_device = device
         self.fake_mode = fake_mode
         self.has_sym_ints = symbolic_shapes.has_symbolic_sizes_strides(elem)
@@ -360,20 +397,14 @@ def __init__(self, fake_mode, elem, device: Union[torch.device, str]):
     @staticmethod
     def from_tensor(t, fake_mode):
         existing_device = t.device
+        # TODO: this should use meta converter
         return FakeTensor(fake_mode, t.to(device="meta"), existing_device)
 
     # TODO: resolve error in default __repr__
     def __repr__(self):
-        return f"FakeTensor({self.fake_device}, {self.size()}, {self.dtype})"
-
-    def stride(self):
-        if self.has_sym_ints:
-            # TODO: As we currently don't support symbolic strides, we'll assume contiguous strides
-            # The reason this needs to be here instead of __torch_dispatch__ is that
-            # when aten.stride goes into __torch_dispatch__, it expects a list of
-            # concrete ints to be returned. So we need to short-circuit that entirely
-            return symbolic_shapes.create_contiguous(self.shape)
-        return super().stride()
+        with in_kernel_invocation_manager(self.fake_mode):
+            self_repr = super().__repr__()
+        return f"FakeTensor({self.fake_mode}, {self_repr}, {self.fake_device})"
 
     def new(self, *args, **kwargs):
         # torch.Tensor.new does not go through the normal dispatcher pattern
@@ -404,6 +435,20 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
                 return torch.device("meta")
             else:
                 return args[0].fake_device
+        # Need this to handle infinite recursion with sparse tensors.
+        # Sparse tensors have custom stride policy which means that
+        # they will dispatch here on dispatch, and we need to trigger
+        # the default behavior.
+        # TODO: when we get other tensor types online they will also
+        # need to get entries here.
+        elif func == torch.ops.aten.sym_size.default:
+            return None
+        elif func == torch.ops.aten.sym_stride.default:
+            return None
+        elif func == torch.ops.aten.size.default:
+            return None
+        elif func == torch.ops.aten.stride.default:
+            return None
 
         # Because fake mode can return NotImplemented (if it sees a subclass
         # it doesn't know how to deal with), this test here is important
@@ -489,9 +534,10 @@ def merge_devices(t):
 
 
 class FakeTensorMode(TorchDispatchMode):
-    def __init__(self, allow_fallback_kernels=True):
+    def __init__(self, *, allow_fallback_kernels=True, allow_meta=False):
         self.allow_fallback_kernels = allow_fallback_kernels
         self.fake_tensor_converter = FakeTensorConverter()
+        self.allow_meta = allow_meta
 
         # [in_kernel_invocation]
         # when FakeTensor is invoked in user code, .device should return
@@ -517,24 +563,20 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         flat_arg_tensors = [
             i for i in tree_flatten((args, kwargs))[0] if isinstance(i, FakeTensor)
         ]
-        has_symbolic_sizes = any([i.has_sym_ints for i in flat_arg_tensors])
+        flat_symints = [
+            i
+            for i in tree_flatten((args, kwargs))[0]
+            if isinstance(i, torch._C.SymIntNode)
+        ]
+        has_symbolic_sizes = (
+            any([i.has_sym_ints for i in flat_arg_tensors]) or len(flat_symints) > 0
+        )
         if has_symbolic_sizes:
             # TODO: Find better approach for this
             # Avoid circular import
             from torch._decomp import decomposition_table
             from torch._meta_registrations import meta_table
 
-            # TODO: hack, doesn't actually work.
-            # see https://github.com/pytorch/pytorch/pull/81598#issuecomment-1192030435
-            with enable_torch_dispatch_mode(
-                self
-            ), torch.overrides.enable_reentrant_dispatch():
-                if func in meta_table:
-                    r = meta_table[func](*args, **kwargs)
-                    return r
-                if func in decomposition_table:
-                    return decomposition_table[func](*args, **kwargs)
-
             with no_dispatch():
                 if symbolic_shapes.is_symbolic_op(func):
                     return symbolic_shapes.handle_symbolic_op(func, args, kwargs)
@@ -544,6 +586,18 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                         "It's likely that this is from calling tensor.shape in C++"
                     )
 
+            with self.restore():
+                if func in meta_table:
+                    r = meta_table[func](*args, **kwargs)
+                    return r
+                if func in decomposition_table:
+                    return decomposition_table[func](*args, **kwargs)
+
+                # Decomposes CompositeImplicitAutograd ops
+                r = func.decompose(*args, **kwargs)
+                if r is not NotImplemented:
+                    return r
+
         # prims already wrap FakeTensor inputs to FakeTensor outputs
         # and do device logic, we dont need do anything but run them
         # and ensure that Meta kernels are dispatched to (see)
@@ -637,7 +691,9 @@ def check_non_fake_tensor(x):
             except NotImplementedError as not_implemented_error:
                 if not self.allow_fallback_kernels:
                     raise not_implemented_error
-                r = run_fallback_kernel(func, args, kwargs, not_implemented_error)
+                return run_fallback_kernel(
+                    self, func, args, kwargs, not_implemented_error
+                )
 
             # TODO: handle non-kwarg devices
             assert func not in _device_not_kwarg_ops, f"NYI: {func}"
@@ -666,7 +722,8 @@ def from_tensor(self, tensor):
         return self.fake_tensor_converter(self, tensor)
 
 
-def run_fallback_kernel(func, args, kwargs, orig_not_implemented_exception):
+# NB: returns fake tensors
+def run_fallback_kernel(fake_mode, func, args, kwargs, orig_not_implemented_exception):
     # these should all be supported, just to be safe
     # avoid fallback for operators which inplace modify metadata
     # because the input fake tensors would be umodified
@@ -679,6 +736,8 @@ def run_fallback_kernel(func, args, kwargs, orig_not_implemented_exception):
         def to_real_tensor(e):
             if isinstance(e, FakeTensor):
                 out = torch.zeros_like(e, device=e.fake_device)
+                if e.is_sparse:
+                    out._coalesced_(e.is_coalesced())
                 inp_impls[id(out)] = e
                 return out
             return e
@@ -693,7 +752,8 @@ def to_real_tensor(e):
 
         for e in tree_flatten((args, kwargs))[0]:
             if isinstance(e, torch.Tensor):
-                storages.add(e.storage()._cdata)
+                if not e.is_sparse:
+                    storages.add(e.storage()._cdata)
 
         # TODO: also check metadata change on inputs
         # proper aliasing/metadata relationship between outputs and inputs will
@@ -701,16 +761,20 @@ def to_real_tensor(e):
         # input impl
         for e in tree_flatten(r)[0]:
             if id(e) not in inp_impls and (
-                isinstance(e, torch.Tensor) and e.storage()._cdata in storages
+                isinstance(e, torch.Tensor)
+                and not e.is_sparse
+                and e.storage()._cdata in storages
             ):
                 raise orig_not_implemented_exception
 
-    # the outputs which are are not reused from impls will be converted
-    # to fake tensors later
-    meta_converter = MetaConverter()
-
     def map_out(e):
-        return inp_impls.get(id(e), meta_converter(e))
+        if isinstance(e, torch.Tensor):
+            if id(e) in inp_impls:
+                return inp_impls[id(e)]
+            else:
+                return fake_mode.fake_tensor_converter(fake_mode, e)
+        else:
+            return e
 
     return tree_map(map_out, r)
 
diff --git a/torch/_subclasses/meta_utils.py b/torch/_subclasses/meta_utils.py
index 5e554fbf5f40f..fe906ca464b5f 100644
--- a/torch/_subclasses/meta_utils.py
+++ b/torch/_subclasses/meta_utils.py
@@ -94,7 +94,10 @@ def set_tensor_memo(self, t, v):
         # hold a weak ref to self, otherwise it will be kept alive
         # by the del_ten closure
         self_weak_ref = weakref.ref(self)
-        weak_st = StorageWeakRef(t.storage())
+        if t.is_sparse:
+            weak_st = None
+        else:
+            weak_st = StorageWeakRef(t.storage())
         tensor_ref_key = WeakTensorRefKey(t)
 
         def del_ten():
@@ -106,7 +109,7 @@ def del_ten():
             self_ref.tensor_memo.pop(tensor_ref_key, None)
             if weak_st and weak_st.expired():
                 self_ref.storage_memo.pop(weak_st, None)
-            else:
+            elif weak_st is not None:
                 # [expired-storages]
                 # NB: even though the tensor has died,
                 # the deallocation of its storage can take longer,
@@ -143,7 +146,25 @@ def meta_tensor(self, t):
 
         if self.get_tensor_memo(t) is None:
             with torch.inference_mode(t.is_inference()):
-                if t._is_view():
+                if t.is_sparse:
+                    is_leaf = safe_is_leaf(t)
+                    r = torch.ops.aten._sparse_coo_tensor_with_dims(
+                        t.sparse_dim(),
+                        t.dense_dim(),
+                        t.shape,
+                        dtype=t.dtype,
+                        layout=torch.sparse_coo,
+                        device="meta",
+                    )
+                    r._coalesced_(t.is_coalesced())
+                    if t.requires_grad:
+                        r.requires_grad = True
+                    if t.requires_grad and not is_leaf:
+                        with torch.enable_grad():
+                            r = r.clone()
+                            r._coalesced_(t.is_coalesced())
+
+                elif t._is_view():
                     # Construct views in two steps: recursively meta-fy their
                     # base, and then create the view off that.  NB: doing it
                     # directly from storage is WRONG because this won't cause
@@ -207,14 +228,21 @@ def is_c_of_r(complex_dtype, real_dtype):
     def __call__(self, t):
         # TODO: zero tensors?  We appear to have eliminated them by
         # excluding complex for now
-        if type(t) is torch.Tensor or type(t) is torch.nn.Parameter:
+        from torch._subclasses.fake_tensor import FakeTensor
+
+        if (
+            type(t) is torch.Tensor
+            or type(t) is torch.nn.Parameter
+            or isinstance(t, FakeTensor)
+        ):
             if any(
                 [
                     t.is_sparse_csr,
-                    t.is_sparse,
+                    t.layout in [torch.sparse_csc, torch.sparse_bsr, torch.sparse_bsc],
                     t.is_mkldnn,
                     t.is_quantized,
                     t.is_nested,
+                    t._is_view() and t._base is not None and t._base.is_sparse,
                     torch._is_functional_tensor(t),
                     # these are supported in meta conversion but the fallbacks
                     # don't work
diff --git a/torch/_tensor.py b/torch/_tensor.py
index 472ecc512e771..b74b29823230e 100644
--- a/torch/_tensor.py
+++ b/torch/_tensor.py
@@ -912,8 +912,10 @@ def __iter__(self):
         return iter(self.unbind(0))
 
     def __hash__(self):
-        if has_torch_function_unary(self):
-            return handle_torch_function(Tensor.__hash__, (self,), self)
+        # Do NOT handle __torch_function__ here as user's default
+        # implementation that handle most functions will most likely do it wrong.
+        # It can be easily overridden by defining this method on the user
+        # subclass if needed.
         return id(self)
 
     def __dir__(self):
@@ -1195,7 +1197,7 @@ def rename(self, *names, **rename_map):
 
             >>> renamed_imgs = imgs.rename(None)
             >>> renamed_imgs.names
-            (None,)
+            (None, None, None, None)
 
             >>> renamed_imgs = imgs.rename('batch', 'channel', 'height', 'width')
             >>> renamed_imgs.names
diff --git a/torch/_tensor_str.py b/torch/_tensor_str.py
index 0308d028bdd05..8ec35ab8f0d33 100644
--- a/torch/_tensor_str.py
+++ b/torch/_tensor_str.py
@@ -46,12 +46,20 @@ def set_printoptions(
 
     Example::
 
+        >>> # Limit the precision of elements
         >>> torch.set_printoptions(precision=2)
         >>> torch.tensor([1.12345])
         tensor([1.12])
+        >>> # Limit the number of elements shown
         >>> torch.set_printoptions(threshold=5)
         >>> torch.arange(10)
         tensor([0, 1, 2, ..., 7, 8, 9])
+        >>> # Restore defaults
+        >>> torch.set_printoptions(profile='default')
+        >>> torch.tensor([1.12345])
+        tensor([1.1235])
+        >>> torch.arange(10)
+        tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 
     """
     if profile is not None:
@@ -206,7 +214,7 @@ def _vector_str(self, indent, summarize, formatter1, formatter2=None):
     elements_per_line = max(
         1, int(math.floor((PRINT_OPTS.linewidth - indent) / (element_length)))
     )
-    char_per_line = element_length * elements_per_line
+    # char_per_line = element_length * elements_per_line  # unused
 
     def _val_formatter(val, formatter1=formatter1, formatter2=formatter2):
         if formatter2 is not None:
@@ -409,7 +417,10 @@ def _str_intern(inp, *, tensor_contents=None):
     )
     if self.is_sparse:
         suffixes.append("size=" + str(tuple(self.shape)))
-        suffixes.append("nnz=" + str(self._nnz()))
+        from torch._subclasses.fake_tensor import FakeTensor
+
+        if not self.is_meta and not isinstance(self, FakeTensor):
+            suffixes.append("nnz=" + str(self._nnz()))
         if not has_default_dtype:
             suffixes.append("dtype=" + str(self.dtype))
         if not custom_contents_provided:
diff --git a/torch/_torch_docs.py b/torch/_torch_docs.py
index 958d9814d7cf4..97a6fd29c6bcd 100644
--- a/torch/_torch_docs.py
+++ b/torch/_torch_docs.py
@@ -7095,7 +7095,7 @@ def merge_dicts(*dicts):
 
 Args:
     {input}
-    {dim} If `None`, reduces all dimensions. Default is `None`.
+    {opt_dim}
     {keepdim}
 
 Keyword args:
@@ -10806,7 +10806,7 @@ def merge_dicts(*dicts):
 
 Args:
     {input}
-    {dim}
+    {opt_dim}
 
 Keyword args:
     unbiased (bool): whether to use Bessel's correction (:math:`\delta N = 1`).
@@ -10971,7 +10971,7 @@ def merge_dicts(*dicts):
 
 Args:
     {input}
-    {dim}
+    {opt_dim}
     {keepdim}
 
 Keyword args:
@@ -12188,7 +12188,7 @@ def merge_dicts(*dicts):
 
 Args:
     {input}
-    {dim}
+    {opt_dim}
 
 Keyword args:
     unbiased (bool): whether to use Bessel's correction (:math:`\delta N = 1`).
@@ -12228,7 +12228,7 @@ def merge_dicts(*dicts):
 
 Args:
     {input}
-    {dim}
+    {opt_dim}
 
 Keyword args:
     unbiased (bool): whether to use Bessel's correction (:math:`\delta N = 1`).
@@ -13523,6 +13523,7 @@ def merge_dicts(*dicts):
 
 Example::
 
+    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
     >>> g_cpu = torch.Generator()
     >>> g_cuda = torch.Generator(device='cuda')
 """,
diff --git a/torch/ao/nn/__init__.py b/torch/ao/nn/__init__.py
index 4a229cfb509c4..2139c3dcbcc70 100644
--- a/torch/ao/nn/__init__.py
+++ b/torch/ao/nn/__init__.py
@@ -1 +1,17 @@
-from torch.ao.nn import sparse
+# We are exposing all subpackages to the end-user.
+# Because of possible inter-dependency, we want to avoid
+# the cyclic imports, thus implementing lazy version
+# as per https://peps.python.org/pep-0562/
+
+import importlib
+
+__all__ = [
+    "quantizable",
+    "quantized",
+    "sparse",
+]
+
+def __getattr__(name):
+    if name in __all__:
+        return importlib.import_module("." + name, __name__)
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
diff --git a/torch/ao/nn/qat/__init__.py b/torch/ao/nn/qat/__init__.py
new file mode 100644
index 0000000000000..3d79bdbfe8320
--- /dev/null
+++ b/torch/ao/nn/qat/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/qat/dynamic/__init__.py b/torch/ao/nn/qat/dynamic/__init__.py
new file mode 100644
index 0000000000000..3d79bdbfe8320
--- /dev/null
+++ b/torch/ao/nn/qat/dynamic/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/qat/dynamic/modules/__init__.py b/torch/ao/nn/qat/dynamic/modules/__init__.py
new file mode 100644
index 0000000000000..c8168b30406a8
--- /dev/null
+++ b/torch/ao/nn/qat/dynamic/modules/__init__.py
@@ -0,0 +1,3 @@
+from .linear import Linear
+
+__all__ = ["Linear"]
diff --git a/torch/ao/nn/qat/dynamic/modules/linear.py b/torch/ao/nn/qat/dynamic/modules/linear.py
new file mode 100644
index 0000000000000..3f8413623659e
--- /dev/null
+++ b/torch/ao/nn/qat/dynamic/modules/linear.py
@@ -0,0 +1,25 @@
+import torch
+from torch.ao.quantization import activation_is_memoryless
+
+
+class Linear(torch.ao.nn.qat.Linear):
+    r"""
+    A linear module attached with FakeQuantize modules for weight,
+    used for dynamic quantization aware training.
+
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
+    for documentation.
+
+    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
+    default.
+    """
+
+    def __init__(self, in_features, out_features, bias=True,
+                 qconfig=None, device=None, dtype=None) -> None:
+        super().__init__(in_features, out_features, bias, qconfig, device, dtype)
+        if not activation_is_memoryless(qconfig):
+            raise ValueError(
+                "Dynamic QAT requires a memoryless observer." +
+                "This means a MovingAverage observer with averaging constant equal to 1"
+            )
diff --git a/torch/ao/nn/qat/modules/__init__.py b/torch/ao/nn/qat/modules/__init__.py
new file mode 100644
index 0000000000000..988a1dd5ed4b6
--- /dev/null
+++ b/torch/ao/nn/qat/modules/__init__.py
@@ -0,0 +1,14 @@
+from .linear import Linear
+from .conv import Conv1d
+from .conv import Conv2d
+from .conv import Conv3d
+from .embedding_ops import EmbeddingBag, Embedding
+
+__all__ = [
+    "Linear",
+    "Conv1d",
+    "Conv2d",
+    "Conv3d",
+    "Embedding",
+    "EmbeddingBag",
+]
diff --git a/torch/ao/nn/qat/modules/conv.py b/torch/ao/nn/qat/modules/conv.py
new file mode 100644
index 0000000000000..ef6a7c909f492
--- /dev/null
+++ b/torch/ao/nn/qat/modules/conv.py
@@ -0,0 +1,264 @@
+import torch
+import torch.nn as nn
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.nn.intrinsic import _FusedModule
+from typing import Tuple, TypeVar, Union
+from torch.nn.common_types import _size_1_t, _size_2_t, _size_3_t
+
+MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
+
+class _ConvNd(nn.modules.conv._ConvNd):
+
+    _FLOAT_MODULE = MOD
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: Tuple[int, ...],
+                 stride: Tuple[int, ...],
+                 padding: Tuple[int, ...],
+                 dilation: Tuple[int, ...],
+                 transposed: bool,
+                 output_padding: Tuple[int, ...],
+                 groups: int,
+                 bias: bool,
+                 padding_mode: str,
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        factory_kwargs = {"device": device, "dtype": dtype}
+        nn.modules.conv._ConvNd.__init__(self, in_channels, out_channels, kernel_size,
+                                         stride, padding, dilation, transposed,
+                                         output_padding, groups, bias, padding_mode, **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @staticmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module
+
+            Args:
+               `mod`: a float module, either produced by torch.ao.quantization utilities
+               or directly from user
+        """
+        assert type(mod) == cls._FLOAT_MODULE, (
+            "qat."
+            + cls.__name__
+            + ".from_float only works for "
+            + cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
+        )
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid qconfig'
+        if issubclass(type(mod), _FusedModule):
+            mod = mod[0]  # type: ignore[index]
+        qconfig = mod.qconfig
+        qat_conv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
+                       stride=mod.stride, padding=mod.padding, dilation=mod.dilation,
+                       groups=mod.groups, bias=mod.bias is not None,
+                       padding_mode=mod.padding_mode, qconfig=qconfig)
+        qat_conv.weight = mod.weight
+        qat_conv.bias = mod.bias
+        return qat_conv
+
+    def to_float(self):
+        """ This works for both single qat conv, and the qat conv - relu modules
+        to convert the qat module to a floating point module
+        """
+        cls = type(self)
+        conv = cls._FLOAT_CONV_MODULE(  # type: ignore[attr-defined, operator]
+            self.in_channels,
+            self.out_channels,
+            self.kernel_size,  # type: ignore[arg-type]
+            self.stride,  # type: ignore[arg-type]
+            self.padding,  # type: ignore[arg-type]
+            self.dilation,  # type: ignore[arg-type]
+            self.groups,
+            self.bias is not None,
+            self.padding_mode)
+        conv.weight = torch.nn.Parameter(self.weight.detach())
+        if self.bias is not None:
+            conv.bias = torch.nn.Parameter(self.bias.detach())
+        # conv relu
+        if issubclass(cls, _FusedModule):
+            modules = [conv]
+            assert hasattr(cls, "_FLOAT_RELU_MODULE")
+            relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
+            modules.append(relu)
+            fused = cls._FLOAT_MODULE(*modules)  # type: ignore[arg-type, attr-defined, operator]
+            fused.train(self.training)
+            return fused
+        else:
+            return conv
+
+class Conv1d(_ConvNd, nn.Conv1d):
+    r"""
+    A Conv1d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as :class:`~torch.nn.Conv1d`
+
+    Similar to :class:`~torch.nn.Conv2d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv1d
+    _FLOAT_CONV_MODULE = nn.Conv1d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: Union[str, _size_1_t] = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _single(kernel_size)
+        stride_ = _single(stride)
+        padding_ = padding if isinstance(padding, str) else _single(padding)
+        dilation_ = _single(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_single(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
+
+class Conv2d(_ConvNd, nn.Conv2d):
+    r"""
+    A Conv2d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Conv2d`, please see
+    https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d
+    for documentation.
+
+    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv2d
+    _FLOAT_CONV_MODULE = nn.Conv2d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_2_t,
+                 stride: _size_2_t = 1,
+                 padding: Union[str, _size_2_t] = 0,
+                 dilation: _size_2_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _pair(kernel_size)
+        stride_ = _pair(stride)
+        padding_ = padding if isinstance(padding, str) else _pair(padding)
+        dilation_ = _pair(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_pair(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
+
+class Conv3d(_ConvNd, nn.Conv3d):
+    r"""
+    A Conv3d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Conv3d`, please see
+    https://pytorch.org/docs/stable/nn.html?highlight=conv3d#torch.nn.Conv3d
+    for documentation.
+
+    Similar to `torch.nn.Conv3d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv3d
+    _FLOAT_CONV_MODULE = nn.Conv3d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_3_t,
+                 stride: _size_3_t = 1,
+                 padding: Union[str, _size_3_t] = 0,
+                 dilation: _size_3_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _triple(kernel_size)
+        stride_ = _triple(stride)
+        padding_ = padding if isinstance(padding, str) else _triple(padding)
+        dilation_ = _triple(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_triple(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
diff --git a/torch/ao/nn/qat/modules/embedding_ops.py b/torch/ao/nn/qat/modules/embedding_ops.py
new file mode 100644
index 0000000000000..da7f33363742d
--- /dev/null
+++ b/torch/ao/nn/qat/modules/embedding_ops.py
@@ -0,0 +1,143 @@
+import torch
+from torch import Tensor
+import torch.nn as nn
+import torch.nn.functional as F
+
+__all__ = ['Embedding', 'EmbeddingBag']
+
+class Embedding(nn.Embedding):
+    r"""
+    An embedding bag module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Embedding`, please see
+    https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding
+    for documentation.
+
+    Similar to `torch.nn.Embedding`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Embedding
+
+    def __init__(self, num_embeddings, embedding_dim, padding_idx=None,
+                 max_norm=None, norm_type=2.0, scale_grad_by_freq=False,
+                 sparse=False, _weight=None, device=None, dtype=None, qconfig=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
+                         norm_type, scale_grad_by_freq, sparse, _weight,
+                         **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(qconfig.weight().qscheme)
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input) -> Tensor:
+        return F.embedding(input, self.weight_fake_quant(self.weight), self.padding_idx,
+                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
+                           self.sparse)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module
+
+            Args: `mod` a float module, either produced by torch.ao.quantization utilities
+            or directly from user
+        """
+        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
+            cls._FLOAT_MODULE.__name__
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid qconfig'
+        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
+        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(weight_qscheme)
+
+        qconfig = mod.qconfig
+        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.padding_idx,
+                                mod.max_norm, mod.norm_type, mod.scale_grad_by_freq,
+                                mod.sparse, mod.weight, qconfig=qconfig)
+
+        return qat_embedding_bag
+
+    def to_float(self):
+        embedding_bag = torch.nn.Embedding(self.num_embeddings, self.embedding_dim, self.padding_idx,
+                                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
+                                           self.sparse, None)
+        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
+        embedding_bag.train(self.training)
+        return embedding_bag
+
+class EmbeddingBag(nn.EmbeddingBag):
+    r"""
+    An embedding bag module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
+    https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag
+    for documentation.
+
+    Similar to `torch.nn.EmbeddingBag`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.EmbeddingBag
+
+    def __init__(self, num_embeddings, embedding_dim, max_norm=None,
+                 norm_type=2.0, scale_grad_by_freq=False, mode='mean',
+                 sparse=False, _weight=None, include_last_offset=False,
+                 padding_idx=None, qconfig=None, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
+                         scale_grad_by_freq, mode, sparse, _weight,
+                         include_last_offset, padding_idx, **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(qconfig.weight().qscheme)
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input, offsets=None, per_sample_weights=None) -> Tensor:
+        return F.embedding_bag(input, self.weight_fake_quant(self.weight), offsets,
+                               self.max_norm, self.norm_type,
+                               self.scale_grad_by_freq, self.mode, self.sparse,
+                               per_sample_weights, self.include_last_offset,
+                               self.padding_idx)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module
+
+            Args: `mod` a float module, either produced by torch.ao.quantization utilities
+            or directly from user
+        """
+        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
+            cls._FLOAT_MODULE.__name__
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid qconfig'
+        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
+        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(weight_qscheme)
+
+        qconfig = mod.qconfig
+        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.max_norm, mod.norm_type,
+                                mod.scale_grad_by_freq, mod.mode, mod.sparse, mod.weight,
+                                mod.include_last_offset, mod.padding_idx, qconfig=qconfig)
+
+        return qat_embedding_bag
+
+    def to_float(self):
+        embedding_bag = torch.nn.EmbeddingBag(self.num_embeddings, self.embedding_dim, self.max_norm,
+                                              self.norm_type, self.scale_grad_by_freq, self.mode, self.sparse,
+                                              None, self.include_last_offset, self.padding_idx)
+        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
+        embedding_bag.train(self.training)
+        return embedding_bag
diff --git a/torch/ao/nn/qat/modules/linear.py b/torch/ao/nn/qat/modules/linear.py
new file mode 100644
index 0000000000000..b03c255f97f84
--- /dev/null
+++ b/torch/ao/nn/qat/modules/linear.py
@@ -0,0 +1,77 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.intrinsic import LinearReLU
+from torch.nn.utils.parametrize import (
+    is_parametrized,
+    type_before_parametrizations,
+    transfer_parametrizations_and_params,
+)
+
+class Linear(nn.Linear):
+    r"""
+    A linear module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
+    for documentation.
+
+    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Linear
+
+    def __init__(self, in_features, out_features, bias=True,
+                 qconfig=None, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(in_features, out_features, bias, **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input):
+        return F.linear(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module or qparams_dict
+            Args: `mod` a float module, either produced by torch.ao.quantization utilities
+            or directly from user
+        """
+        assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
+            " qat."
+            + cls.__name__
+            + ".from_float only works for "
+            + cls._FLOAT_MODULE.__name__
+        )
+        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
+        assert mod.qconfig, "Input float module must have a valid qconfig"
+        if type_before_parametrizations(mod) == LinearReLU:
+            mod = mod[0]
+
+        qconfig = mod.qconfig
+        qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)
+
+        if is_parametrized(mod, "weight"):
+            transfer_parametrizations_and_params(mod, qat_linear, "weight")
+        else:
+            qat_linear.weight = mod.weight
+
+        if is_parametrized(mod, "bias"):
+            transfer_parametrizations_and_params(mod, qat_linear, "bias")
+        else:
+            qat_linear.bias = mod.bias
+
+        return qat_linear
+
+    def to_float(self):
+        linear = torch.nn.Linear(self.in_features, self.out_features, self.bias is not None)
+        linear.weight = torch.nn.Parameter(self.weight.detach())
+        if self.bias is not None:
+            linear.bias = torch.nn.Parameter(self.bias.detach())
+        linear.train(self.training)
+        return linear
diff --git a/torch/ao/nn/quantizable/__init__.py b/torch/ao/nn/quantizable/__init__.py
new file mode 100644
index 0000000000000..3d79bdbfe8320
--- /dev/null
+++ b/torch/ao/nn/quantizable/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/quantizable/modules/__init__.py b/torch/ao/nn/quantizable/modules/__init__.py
new file mode 100644
index 0000000000000..7c9fb032a2bb3
--- /dev/null
+++ b/torch/ao/nn/quantizable/modules/__init__.py
@@ -0,0 +1,9 @@
+from .activation import MultiheadAttention
+from .rnn import LSTM
+from .rnn import LSTMCell
+
+__all__ = [
+    'LSTM',
+    'LSTMCell',
+    'MultiheadAttention',
+]
diff --git a/torch/ao/nn/quantizable/modules/activation.py b/torch/ao/nn/quantizable/modules/activation.py
new file mode 100644
index 0000000000000..65ea5c2741359
--- /dev/null
+++ b/torch/ao/nn/quantizable/modules/activation.py
@@ -0,0 +1,454 @@
+import torch
+import torch.jit  # this is needed to avoid a circular import
+from torch import nn
+import torch.nn.functional as nnF
+
+from torch import Tensor
+from typing import Optional, Tuple
+
+import warnings
+
+class MultiheadAttention(nn.MultiheadAttention):
+    _FLOAT_MODULE = nn.MultiheadAttention
+
+    r"""Quantizable implementation of the MultiheadAttention.
+
+    Note::
+        Please, refer to :class:`~torch.nn.MultiheadAttention` for more
+        information
+
+    Allows the model to jointly attend to information from different
+    representation subspaces.
+    See reference: Attention Is All You Need
+
+    The original MHA module is not quantizable.
+    This reimplements it by explicitly instantiating the linear layers.
+
+    .. math::
+        \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
+        \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
+
+    Args:
+        embed_dim: total dimension of the model.
+        num_heads: parallel attention heads.
+        dropout: a Dropout layer on attn_output_weights. Default: 0.0.
+        bias: add bias as module parameter. Default: True.
+        add_bias_kv: add bias to the key and value sequences at dim=0.
+        add_zero_attn: add a new batch of zeros to the key and
+                       value sequences at dim=1.
+        kdim: total number of features in key. Default: None.
+        vdim: total number of features in value. Default: None.
+        batch_first: If ``True``, then the input and output tensors are provided
+            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
+
+    Note that if :attr:`kdim` and :attr:`vdim` are None, they will be set
+    to :attr:`embed_dim` such that query, key, and value have the same
+    number of features.
+
+    Examples::
+
+        >>> import torch.nn.quantizable as nnqa
+        >>> multihead_attn = nnqa.MultiheadAttention(embed_dim, num_heads)
+        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
+
+    Note::
+        Please, follow the quantization flow to convert the quantizable MHA.
+    """
+    __constants__ = ['batch_first']
+
+    def __init__(self, embed_dim: int, num_heads: int,
+                 dropout: float = 0., bias: bool = True,
+                 add_bias_kv: bool = False, add_zero_attn: bool = False,
+                 kdim: int = None, vdim: int = None, batch_first: bool = False,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(MultiheadAttention, self).__init__(embed_dim, num_heads, dropout,
+                                                 bias, add_bias_kv,
+                                                 add_zero_attn, kdim, vdim, batch_first,
+                                                 **factory_kwargs)
+        self.linear_Q = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)
+        self.linear_K = nn.Linear(self.kdim, self.embed_dim, bias=bias, **factory_kwargs)
+        self.linear_V = nn.Linear(self.vdim, self.embed_dim, bias=bias, **factory_kwargs)
+        # for the type: ignore, see https://github.com/pytorch/pytorch/issues/58969
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)  # type: ignore[assignment]
+
+        # Functionals
+        self.q_scaling_product = torch.nn.quantized.FloatFunctional()
+        # note: importing torch.nn.quantized at top creates a circular import
+
+        # Quant/Dequant
+        self.quant_attn_output = torch.ao.quantization.QuantStub()
+        self.quant_attn_output_weights = torch.ao.quantization.QuantStub()
+        self.dequant_q = torch.ao.quantization.DeQuantStub()
+        self.dequant_k = torch.ao.quantization.DeQuantStub()
+        self.dequant_v = torch.ao.quantization.DeQuantStub()
+
+    def _get_name(self):
+        return 'QuantizableMultiheadAttention'
+
+    @classmethod
+    def from_float(cls, other):
+        assert type(other) == cls._FLOAT_MODULE
+        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
+        # Setting the dropout to 0.0!
+        observed = cls(other.embed_dim, other.num_heads, other.dropout,
+                       (other.in_proj_bias is not None),
+                       (other.bias_k is not None),
+                       other.add_zero_attn, other.kdim, other.vdim)
+        observed.bias_k = other.bias_k
+        observed.bias_v = other.bias_v
+        observed.qconfig = other.qconfig
+
+        # Set the linear weights
+        # for the type: ignores, see https://github.com/pytorch/pytorch/issues/58969
+        observed.out_proj.weight = other.out_proj.weight  # type: ignore[has-type]
+        observed.out_proj.bias = other.out_proj.bias  # type: ignore[has-type]
+        if other._qkv_same_embed_dim:
+            # Use separate params
+            bias = other.in_proj_bias
+            _start = 0
+            _end = _start + other.embed_dim
+            weight = other.in_proj_weight[_start:_end, :]
+            if bias is not None:
+                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
+            observed.linear_Q.weight = torch.nn.Parameter(weight,
+                                                          weight.requires_grad)
+            observed.linear_Q.bias = bias
+
+            bias = other.in_proj_bias
+            _start = _end
+            _end = _start + other.embed_dim
+            weight = other.in_proj_weight[_start:_end, :]
+            if bias is not None:
+                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
+            observed.linear_K.weight = torch.nn.Parameter(weight,
+                                                          weight.requires_grad)
+            observed.linear_K.bias = bias
+
+            bias = other.in_proj_bias
+            _start = _end
+            weight = other.in_proj_weight[_start:, :]
+            if bias is not None:
+                bias = torch.nn.Parameter(bias[_start:], bias.requires_grad)
+            observed.linear_V.weight = torch.nn.Parameter(weight,
+                                                          weight.requires_grad)
+            observed.linear_V.bias = bias
+        else:
+            observed.linear_Q.weight = nn.Parameter(other.q_proj_weight)
+            observed.linear_K.weight = nn.Parameter(other.k_proj_weight)
+            observed.linear_V.weight = nn.Parameter(other.v_proj_weight)
+            if other.in_proj_bias is None:
+                observed.linear_Q.bias = None  # type: ignore[assignment]
+                observed.linear_K.bias = None  # type: ignore[assignment]
+                observed.linear_V.bias = None  # type: ignore[assignment]
+            else:
+                observed.linear_Q.bias = nn.Parameter(other.in_proj_bias[0:other.embed_dim])
+                observed.linear_K.bias = nn.Parameter(other.in_proj_bias[other.embed_dim:(other.embed_dim * 2)])
+                observed.linear_V.bias = nn.Parameter(other.in_proj_bias[(other.embed_dim * 2):])
+        observed.eval()
+        # Explicit prepare
+        observed = torch.ao.quantization.prepare(observed, inplace=True)
+        return observed
+
+    @torch.jit.unused
+    def dequantize(self):
+        r"""Utility to convert the quantized MHA back to float.
+
+        The motivation for this is that it is not trivial to conver the weights
+        from the format that is used in the quantized version back to the
+        float.
+        """
+        fp = self._FLOAT_MODULE(self.embed_dim, self.num_heads, self.dropout,
+                                (self.in_proj_bias is not None),
+                                (self.bias_k is not None),
+                                self.add_zero_attn, self.kdim, self.vdim, self.batch_first)
+        assert fp._qkv_same_embed_dim == self._qkv_same_embed_dim
+        if self.bias_k is not None:
+            fp.bias_k = nn.Parameter(self.bias_k.dequantize())
+        if self.bias_v is not None:
+            fp.bias_v = nn.Parameter(self.bias_v.dequantize())
+
+        # Set the linear weights
+        # Note: Because the linear layers are quantized, mypy does not nkow how
+        # to deal with them -- might need to ignore the typing checks.
+        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
+        w, b = self.out_proj._weight_bias()  # type: ignore[operator, has-type]
+        fp.out_proj.weight = nn.Parameter(w.dequantize())
+        if b is not None:
+            fp.out_proj.bias = nn.Parameter(b)
+
+        wQ, bQ = self.linear_Q._weight_bias()  # type: ignore[operator]
+        wQ = wQ.dequantize()
+        wK, bK = self.linear_K._weight_bias()  # type: ignore[operator]
+        wK = wK.dequantize()
+        wV, bV = self.linear_V._weight_bias()  # type: ignore[operator]
+        wV = wV.dequantize()
+        if fp._qkv_same_embed_dim:
+            # Use separate params
+            _start = 0
+            _end = _start + fp.embed_dim
+            fp.in_proj_weight[_start:_end, :] = wQ
+            if fp.in_proj_bias is not None:
+                assert all(bQ == 0)
+                fp.in_proj_bias[_start:_end] = bQ
+
+            _start = _end
+            _end = _start + fp.embed_dim
+            fp.in_proj_weight[_start:_end, :] = wK
+            if fp.in_proj_bias is not None:
+                assert all(bK == 0)
+                fp.in_proj_bias[_start:_end] = bK
+
+            _start = _end
+            fp.in_proj_weight[_start:, :] = wV
+            if fp.in_proj_bias is not None:
+                assert all(bV == 0)
+                fp.in_proj_bias[_start:] = bV
+        else:
+            fp.q_proj_weight = nn.Parameter(wQ)
+            fp.k_proj_weight = nn.Parameter(wK)
+            fp.v_proj_weight = nn.Parameter(wV)
+            if fp.in_proj_bias is None:
+                self.linear_Q.bias = None
+                self.linear_K.bias = None
+                self.linear_V.bias = None
+            else:
+                fp.in_proj_bias[0:fp.embed_dim] = bQ
+                fp.in_proj_bias[fp.embed_dim:(fp.embed_dim * 2)] = bK
+                fp.in_proj_bias[(fp.embed_dim * 2):] = bV
+
+        return fp
+
+
+    @classmethod
+    def from_observed(cls, other):
+        # The whole flow is float -> observed -> quantized
+        # This class does float -> observed only
+        # See nn.quantized.MultiheadAttention
+        raise NotImplementedError("It looks like you are trying to prepare an "
+                                  "MHA module. Please, see "
+                                  "the examples on quantizable MHAs.")
+
+    def forward(self,
+                query: Tensor,
+                key: Tensor,
+                value: Tensor,
+                key_padding_mask: Optional[Tensor] = None,
+                need_weights: bool = True,
+                attn_mask: Optional[Tensor] = None,
+                average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
+        r"""
+    Note::
+        Please, refer to :func:`~torch.nn.MultiheadAttention.forward` for more
+        information
+
+    Args:
+        query, key, value: map a query and a set of key-value pairs to an output.
+            See "Attention Is All You Need" for more details.
+        key_padding_mask: if provided, specified padding elements in the key will
+            be ignored by the attention. When given a binary mask and a value is True,
+            the corresponding value on the attention layer will be ignored. When given
+            a byte mask and a value is non-zero, the corresponding value on the attention
+            layer will be ignored
+        need_weights: output attn_output_weights.
+        attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all
+            the batches while a 3D mask allows to specify a different mask for the entries of each batch.
+
+    Shape:
+        - Inputs:
+        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
+          the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
+        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
+          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
+        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
+          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
+        - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.
+          If a ByteTensor is provided, the non-zero positions will be ignored while the position
+          with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
+          value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
+        - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+          3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
+          S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
+          positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
+          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
+          is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
+          is provided, it will be added to the attention weight.
+        - average_attn_weights: If true, indicates that the returned ``attn_weights`` should be averaged across
+          heads. Otherwise, ``attn_weights`` are provided separately per head. Note that this flag only has an
+          effect when ``need_weights=True.``. Default: True (i.e. average weights across heads)
+
+        - Outputs:
+        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
+          E is the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
+        - attn_output_weights: If ``average_attn_weights=True``, returns attention weights averaged
+          across heads of shape :math:`(N, L, S)`, where N is the batch size, L is the target sequence length,
+          S is the source sequence length. If ``average_attn_weights=False``, returns attention weights per
+          head of shape :math:`(N, num_heads, L, S)`.
+        """
+        return self._forward_impl(query, key, value, key_padding_mask,
+                                  need_weights, attn_mask, average_attn_weights)
+
+    def _forward_impl(self,
+                      query: Tensor,
+                      key: Tensor,
+                      value: Tensor,
+                      key_padding_mask: Optional[Tensor] = None,
+                      need_weights: bool = True,
+                      attn_mask: Optional[Tensor] = None,
+                      average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
+        # This version will not deal with the static key/value pairs.
+        # Keeping it here for future changes.
+        #
+        # TODO: This method has some duplicate lines with the
+        # `torch.nn.functional.multi_head_attention`. Will need to refactor.
+        static_k = None
+        static_v = None
+
+        if self.batch_first:
+            query, key, value = [x.transpose(0, 1) for x in (query, key, value)]
+
+        tgt_len, bsz, embed_dim_to_check = query.size()
+        assert self.embed_dim == embed_dim_to_check
+        # allow MHA to have different sizes for the feature dimension
+        assert key.size(0) == value.size(0) and key.size(1) == value.size(1)
+
+        head_dim = self.embed_dim // self.num_heads
+        assert head_dim * self.num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+        scaling = float(head_dim) ** -0.5
+
+        q = self.linear_Q(query)
+        k = self.linear_K(key)
+        v = self.linear_V(value)
+
+        q = self.q_scaling_product.mul_scalar(q, scaling)
+
+        if attn_mask is not None:
+            assert attn_mask.dtype == torch.float32 or attn_mask.dtype == torch.float64 or \
+                attn_mask.dtype == torch.float16 or attn_mask.dtype == torch.uint8 or attn_mask.dtype == torch.bool, \
+                'Only float, byte, and bool types are supported for attn_mask, not {}'.format(attn_mask.dtype)
+            if attn_mask.dtype == torch.uint8:
+                warnings.warn("Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
+                attn_mask = attn_mask.to(torch.bool)
+
+            if attn_mask.dim() == 2:
+                attn_mask = attn_mask.unsqueeze(0)
+                if list(attn_mask.size()) != [1, query.size(0), key.size(0)]:
+                    raise RuntimeError('The size of the 2D attn_mask is not correct.')
+            elif attn_mask.dim() == 3:
+                if list(attn_mask.size()) != [bsz * self.num_heads, query.size(0), key.size(0)]:
+                    raise RuntimeError('The size of the 3D attn_mask is not correct.')
+            else:
+                raise RuntimeError("attn_mask's dimension {} is not supported".format(attn_mask.dim()))
+            # attn_mask's dim is 3 now.
+
+        # convert ByteTensor key_padding_mask to bool
+        if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
+            warnings.warn("Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
+            key_padding_mask = key_padding_mask.to(torch.bool)
+        if self.bias_k is not None and self.bias_v is not None:
+            if static_k is None and static_v is None:
+
+                # Explicitly assert that bias_k and bias_v are not None
+                # in a way that TorchScript can understand.
+                bias_k = self.bias_k
+                assert bias_k is not None
+                bias_v = self.bias_v
+                assert bias_v is not None
+
+                k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
+                v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
+                if attn_mask is not None:
+                    attn_mask = nnF.pad(attn_mask, (0, 1))
+                if key_padding_mask is not None:
+                    key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
+            else:
+                assert static_k is None, "bias cannot be added to static key."
+                assert static_v is None, "bias cannot be added to static value."
+        else:
+            assert self.bias_k is None
+            assert self.bias_v is None
+
+        q = q.contiguous().view(tgt_len, bsz * self.num_heads, head_dim).transpose(0, 1)
+        if k is not None:
+            k = k.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
+        if v is not None:
+            v = v.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
+
+        if static_k is not None:
+            assert static_k.size(0) == bsz * self.num_heads
+            assert static_k.size(2) == head_dim
+            k = static_k
+
+        if static_v is not None:
+            assert static_v.size(0) == bsz * self.num_heads
+            assert static_v.size(2) == head_dim
+            v = static_v
+
+        src_len = k.size(1)
+
+        if key_padding_mask is not None:
+            assert key_padding_mask.size(0) == bsz
+            assert key_padding_mask.size(1) == src_len
+
+        if self.add_zero_attn:
+            src_len += 1
+            k_zeros = torch.zeros((k.size(0), 1) + k.size()[2:])
+            if k.is_quantized:
+                k_zeros = torch.quantize_per_tensor(k_zeros, k.q_scale(), k.q_zero_point(), k.dtype)
+            k = torch.cat([k, k_zeros], dim=1)
+            v_zeros = torch.zeros((v.size(0), 1) + k.size()[2:])
+            if v.is_quantized:
+                v_zeros = torch.quantize_per_tensor(v_zeros, v.q_scale(), v.q_zero_point(), v.dtype)
+            v = torch.cat([v, v_zeros], dim=1)
+
+            if attn_mask is not None:
+                attn_mask = nnF.pad(attn_mask, (0, 1))
+            if key_padding_mask is not None:
+                key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
+
+        # Leaving the quantized zone here
+        q = self.dequant_q(q)
+        k = self.dequant_k(k)
+        v = self.dequant_v(v)
+        attn_output_weights = torch.bmm(q, k.transpose(1, 2))
+        assert list(attn_output_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+
+        if attn_mask is not None:
+            if attn_mask.dtype == torch.bool:
+                attn_output_weights.masked_fill_(attn_mask, float('-inf'))
+            else:
+                attn_output_weights += attn_mask
+
+        if key_padding_mask is not None:
+            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_output_weights = attn_output_weights.masked_fill(
+                key_padding_mask.unsqueeze(1).unsqueeze(2),
+                float('-inf'),
+            )
+            attn_output_weights = attn_output_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        attn_output_weights = nnF.softmax(
+            attn_output_weights, dim=-1)
+        attn_output_weights = nnF.dropout(attn_output_weights, p=self.dropout, training=self.training)
+
+        attn_output = torch.bmm(attn_output_weights, v)
+        assert list(attn_output.size()) == [bsz * self.num_heads, tgt_len, head_dim]
+        if self.batch_first:
+            attn_output = attn_output.view(bsz, tgt_len, self.embed_dim)
+        else:
+            attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, self.embed_dim)
+
+        # Reentering the quantized zone
+        attn_output = self.quant_attn_output(attn_output)
+        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
+        attn_output = self.out_proj(attn_output)  # type: ignore[has-type]
+        attn_output_weights = self.quant_attn_output_weights(attn_output_weights)
+
+        if need_weights:
+            # average attention weights over heads
+            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            if average_attn_weights:
+                attn_output_weights = attn_output_weights.mean(dim=1)
+            return attn_output, attn_output_weights
+        else:
+            return attn_output, None
diff --git a/torch/ao/nn/quantizable/modules/rnn.py b/torch/ao/nn/quantizable/modules/rnn.py
new file mode 100644
index 0000000000000..59f23137097ce
--- /dev/null
+++ b/torch/ao/nn/quantizable/modules/rnn.py
@@ -0,0 +1,386 @@
+import numbers
+from typing import Optional, Tuple
+import warnings
+
+import torch
+from torch import Tensor
+
+"""
+We will recreate all the RNN modules as we require the modules to be decomposed
+into its building blocks to be able to observe.
+"""
+
+class LSTMCell(torch.nn.Module):
+    r"""A quantizable long short-term memory (LSTM) cell.
+
+    For the description and the argument types, please, refer to :class:`~torch.nn.LSTMCell`
+
+    Examples::
+
+        >>> import torch.nn.quantizable as nnqa
+        >>> rnn = nnqa.LSTMCell(10, 20)
+        >>> input = torch.randn(6, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> cx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx, cx = rnn(input[i], (hx, cx))
+        ...     output.append(hx)
+    """
+    _FLOAT_MODULE = torch.nn.LSTMCell
+
+    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.input_size = input_dim
+        self.hidden_size = hidden_dim
+        self.bias = bias
+
+        self.igates = torch.nn.Linear(input_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
+        self.hgates = torch.nn.Linear(hidden_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
+        self.gates = torch.ao.nn.quantized.FloatFunctional()
+
+        self.fgate_cx = torch.ao.nn.quantized.FloatFunctional()
+        self.igate_cgate = torch.ao.nn.quantized.FloatFunctional()
+        self.fgate_cx_igate_cgate = torch.ao.nn.quantized.FloatFunctional()
+
+        self.ogate_cy = torch.ao.nn.quantized.FloatFunctional()
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
+        if hidden is None or hidden[0] is None or hidden[1] is None:
+            hidden = self.initialize_hidden(x.shape[0], x.is_quantized)
+        hx, cx = hidden
+
+        igates = self.igates(x)
+        hgates = self.hgates(hx)
+        gates = self.gates.add(igates, hgates)
+
+        input_gate, forget_gate, cell_gate, out_gate = gates.chunk(4, 1)
+
+        input_gate = torch.sigmoid(input_gate)
+        forget_gate = torch.sigmoid(forget_gate)
+        cell_gate = torch.tanh(cell_gate)
+        out_gate = torch.sigmoid(out_gate)
+
+        fgate_cx = self.fgate_cx.mul(forget_gate, cx)
+        igate_cgate = self.igate_cgate.mul(input_gate, cell_gate)
+        fgate_cx_igate_cgate = self.fgate_cx_igate_cgate.add(fgate_cx, igate_cgate)
+        cy = fgate_cx_igate_cgate
+
+        tanh_cy = torch.tanh(cy)
+        hy = self.ogate_cy.mul(out_gate, tanh_cy)
+        return hy, cy
+
+    def initialize_hidden(self, batch_size: int, is_quantized: bool = False) -> Tuple[Tensor, Tensor]:
+        h, c = torch.zeros((batch_size, self.hidden_size)), torch.zeros((batch_size, self.hidden_size))
+        if is_quantized:
+            h = torch.quantize_per_tensor(h, scale=1.0, zero_point=0, dtype=torch.quint8)
+            c = torch.quantize_per_tensor(c, scale=1.0, zero_point=0, dtype=torch.quint8)
+        return h, c
+
+    def _get_name(self):
+        return 'QuantizableLSTMCell'
+
+    @classmethod
+    def from_params(cls, wi, wh, bi=None, bh=None):
+        """Uses the weights and biases to create a new LSTM cell.
+
+        Args:
+            wi, wh: Weights for the input and hidden layers
+            bi, bh: Biases for the input and hidden layers
+        """
+        assert (bi is None) == (bh is None)  # Either both None or both have values
+        input_size = wi.shape[1]
+        hidden_size = wh.shape[1]
+        cell = cls(input_dim=input_size, hidden_dim=hidden_size,
+                   bias=(bi is not None))
+        cell.igates.weight = torch.nn.Parameter(wi)
+        if bi is not None:
+            cell.igates.bias = torch.nn.Parameter(bi)
+        cell.hgates.weight = torch.nn.Parameter(wh)
+        if bh is not None:
+            cell.hgates.bias = torch.nn.Parameter(bh)
+        return cell
+
+    @classmethod
+    def from_float(cls, other):
+        assert type(other) == cls._FLOAT_MODULE
+        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
+        observed = cls.from_params(other.weight_ih, other.weight_hh,
+                                   other.bias_ih, other.bias_hh)
+        observed.qconfig = other.qconfig
+        observed.igates.qconfig = other.qconfig
+        observed.hgates.qconfig = other.qconfig
+        return observed
+
+
+class _LSTMSingleLayer(torch.nn.Module):
+    r"""A single one-directional LSTM layer.
+
+    The difference between a layer and a cell is that the layer can process a
+    sequence, while the cell only expects an instantaneous value.
+    """
+    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.cell = LSTMCell(input_dim, hidden_dim, bias=bias, **factory_kwargs)
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
+        result = []
+        for xx in x:
+            hidden = self.cell(xx, hidden)
+            result.append(hidden[0])  # type: ignore[index]
+        result_tensor = torch.stack(result, 0)
+        return result_tensor, hidden
+
+    @classmethod
+    def from_params(cls, *args, **kwargs):
+        cell = LSTMCell.from_params(*args, **kwargs)
+        layer = cls(cell.input_size, cell.hidden_size, cell.bias)
+        layer.cell = cell
+        return layer
+
+
+class _LSTMLayer(torch.nn.Module):
+    r"""A single bi-directional LSTM layer."""
+    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
+                 batch_first: bool = False, bidirectional: bool = False,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.batch_first = batch_first
+        self.bidirectional = bidirectional
+        self.layer_fw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
+        if self.bidirectional:
+            self.layer_bw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
+        if self.batch_first:
+            x = x.transpose(0, 1)
+        if hidden is None:
+            hx_fw, cx_fw = (None, None)
+        else:
+            hx_fw, cx_fw = hidden
+        hidden_bw: Optional[Tuple[Tensor, Tensor]] = None
+        if self.bidirectional:
+            if hx_fw is None:
+                hx_bw = None
+            else:
+                hx_bw = hx_fw[1]
+                hx_fw = hx_fw[0]
+            if cx_fw is None:
+                cx_bw = None
+            else:
+                cx_bw = cx_fw[1]
+                cx_fw = cx_fw[0]
+            if hx_bw is not None and cx_bw is not None:
+                hidden_bw = hx_bw, cx_bw
+        if hx_fw is None and cx_fw is None:
+            hidden_fw = None
+        else:
+            hidden_fw = torch.jit._unwrap_optional(hx_fw), torch.jit._unwrap_optional(cx_fw)
+        result_fw, hidden_fw = self.layer_fw(x, hidden_fw)
+
+        if hasattr(self, 'layer_bw') and self.bidirectional:
+            x_reversed = x.flip(0)
+            result_bw, hidden_bw = self.layer_bw(x_reversed, hidden_bw)
+            result_bw = result_bw.flip(0)
+
+            result = torch.cat([result_fw, result_bw], result_fw.dim() - 1)
+            if hidden_fw is None and hidden_bw is None:
+                h = None
+                c = None
+            elif hidden_fw is None:
+                (h, c) = torch.jit._unwrap_optional(hidden_bw)
+            elif hidden_bw is None:
+                (h, c) = torch.jit._unwrap_optional(hidden_fw)
+            else:
+                h = torch.stack([hidden_fw[0], hidden_bw[0]], 0)  # type: ignore[list-item]
+                c = torch.stack([hidden_fw[1], hidden_bw[1]], 0)  # type: ignore[list-item]
+        else:
+            result = result_fw
+            h, c = torch.jit._unwrap_optional(hidden_fw)  # type: ignore[assignment]
+
+        if self.batch_first:
+            result.transpose_(0, 1)
+
+        return result, (h, c)
+
+    @classmethod
+    def from_float(cls, other, layer_idx=0, qconfig=None, **kwargs):
+        r"""
+        There is no FP equivalent of this class. This function is here just to
+        mimic the behavior of the `prepare` within the `torch.ao.quantization`
+        flow.
+        """
+        assert hasattr(other, 'qconfig') or (qconfig is not None)
+
+        input_size = kwargs.get('input_size', other.input_size)
+        hidden_size = kwargs.get('hidden_size', other.hidden_size)
+        bias = kwargs.get('bias', other.bias)
+        batch_first = kwargs.get('batch_first', other.batch_first)
+        bidirectional = kwargs.get('bidirectional', other.bidirectional)
+
+        layer = cls(input_size, hidden_size, bias, batch_first, bidirectional)
+        layer.qconfig = getattr(other, 'qconfig', qconfig)
+        wi = getattr(other, f'weight_ih_l{layer_idx}')
+        wh = getattr(other, f'weight_hh_l{layer_idx}')
+        bi = getattr(other, f'bias_ih_l{layer_idx}', None)
+        bh = getattr(other, f'bias_hh_l{layer_idx}', None)
+
+        layer.layer_fw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
+
+        if other.bidirectional:
+            wi = getattr(other, f'weight_ih_l{layer_idx}_reverse')
+            wh = getattr(other, f'weight_hh_l{layer_idx}_reverse')
+            bi = getattr(other, f'bias_ih_l{layer_idx}_reverse', None)
+            bh = getattr(other, f'bias_hh_l{layer_idx}_reverse', None)
+            layer.layer_bw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
+        return layer
+
+
+class LSTM(torch.nn.Module):
+    r"""A quantizable long short-term memory (LSTM).
+
+    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
+
+    Attributes:
+        layers : instances of the `_LSTMLayer`
+
+    .. note::
+        To access the weights and biases, you need to access them per layer.
+        See examples below.
+
+    Examples::
+
+        >>> import torch.nn.quantizable as nnqa
+        >>> rnn = nnqa.LSTM(10, 20, 2)
+        >>> input = torch.randn(5, 3, 10)
+        >>> h0 = torch.randn(2, 3, 20)
+        >>> c0 = torch.randn(2, 3, 20)
+        >>> output, (hn, cn) = rnn(input, (h0, c0))
+        >>> # To get the weights:
+        >>> # xdoctest: +SKIP
+        >>> print(rnn.layers[0].weight_ih)
+        tensor([[...]])
+        >>> print(rnn.layers[0].weight_hh)
+        AssertionError: There is no reverse path in the non-bidirectional layer
+    """
+    _FLOAT_MODULE = torch.nn.LSTM
+
+    def __init__(self, input_size: int, hidden_size: int,
+                 num_layers: int = 1, bias: bool = True,
+                 batch_first: bool = False, dropout: float = 0.,
+                 bidirectional: bool = False,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.bias = bias
+        self.batch_first = batch_first
+        self.dropout = float(dropout)
+        self.bidirectional = bidirectional
+        self.training = False  # We don't want to train using this module
+        num_directions = 2 if bidirectional else 1
+
+        if not isinstance(dropout, numbers.Number) or not 0 <= dropout <= 1 or \
+                isinstance(dropout, bool):
+            raise ValueError("dropout should be a number in range [0, 1] "
+                             "representing the probability of an element being "
+                             "zeroed")
+        if dropout > 0:
+            warnings.warn("dropout option for quantizable LSTM is ignored. "
+                          "If you are training, please, use nn.LSTM version "
+                          "followed by `prepare` step.")
+            if num_layers == 1:
+                warnings.warn("dropout option adds dropout after all but last "
+                              "recurrent layer, so non-zero dropout expects "
+                              "num_layers greater than 1, but got dropout={} "
+                              "and num_layers={}".format(dropout, num_layers))
+
+        layers = [_LSTMLayer(self.input_size, self.hidden_size,
+                             self.bias, batch_first=False,
+                             bidirectional=self.bidirectional, **factory_kwargs)]
+        for layer in range(1, num_layers):
+            layers.append(_LSTMLayer(self.hidden_size, self.hidden_size,
+                                     self.bias, batch_first=False,
+                                     bidirectional=self.bidirectional,
+                                     **factory_kwargs))
+        self.layers = torch.nn.ModuleList(layers)
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
+        if self.batch_first:
+            x = x.transpose(0, 1)
+
+        max_batch_size = x.size(1)
+        num_directions = 2 if self.bidirectional else 1
+        if hidden is None:
+            zeros = torch.zeros(num_directions, max_batch_size,
+                                self.hidden_size, dtype=torch.float,
+                                device=x.device)
+            zeros.squeeze_(0)
+            if x.is_quantized:
+                zeros = torch.quantize_per_tensor(zeros, scale=1.0,
+                                                  zero_point=0, dtype=x.dtype)
+            hxcx = [(zeros, zeros) for _ in range(self.num_layers)]
+        else:
+            hidden_non_opt = torch.jit._unwrap_optional(hidden)
+            if isinstance(hidden_non_opt[0], Tensor):
+                hx = hidden_non_opt[0].reshape(self.num_layers, num_directions,
+                                               max_batch_size,
+                                               self.hidden_size).unbind(0)
+                cx = hidden_non_opt[1].reshape(self.num_layers, num_directions,
+                                               max_batch_size,
+                                               self.hidden_size).unbind(0)
+                hxcx = [(hx[idx].squeeze_(0), cx[idx].squeeze_(0)) for idx in range(self.num_layers)]
+            else:
+                hxcx = hidden_non_opt
+
+        hx_list = []
+        cx_list = []
+        for idx, layer in enumerate(self.layers):
+            x, (h, c) = layer(x, hxcx[idx])
+            hx_list.append(torch.jit._unwrap_optional(h))
+            cx_list.append(torch.jit._unwrap_optional(c))
+        hx_tensor = torch.stack(hx_list)
+        cx_tensor = torch.stack(cx_list)
+
+        # We are creating another dimension for bidirectional case
+        # need to collapse it
+        hx_tensor = hx_tensor.reshape(-1, hx_tensor.shape[-2], hx_tensor.shape[-1])
+        cx_tensor = cx_tensor.reshape(-1, cx_tensor.shape[-2], cx_tensor.shape[-1])
+
+        if self.batch_first:
+            x = x.transpose(0, 1)
+
+        return x, (hx_tensor, cx_tensor)
+
+    def _get_name(self):
+        return 'QuantizableLSTM'
+
+    @classmethod
+    def from_float(cls, other, qconfig=None):
+        assert isinstance(other, cls._FLOAT_MODULE)
+        assert (hasattr(other, 'qconfig') or qconfig)
+        observed = cls(other.input_size, other.hidden_size, other.num_layers,
+                       other.bias, other.batch_first, other.dropout,
+                       other.bidirectional)
+        observed.qconfig = getattr(other, 'qconfig', qconfig)
+        for idx in range(other.num_layers):
+            observed.layers[idx] = _LSTMLayer.from_float(other, idx, qconfig,
+                                                         batch_first=False)
+        observed.eval()
+        observed = torch.ao.quantization.prepare(observed, inplace=True)
+        return observed
+
+    @classmethod
+    def from_observed(cls, other):
+        # The whole flow is float -> observed -> quantized
+        # This class does float -> observed only
+        raise NotImplementedError("It looks like you are trying to convert a "
+                                  "non-quantizable LSTM module. Please, see "
+                                  "the examples on quantizable LSTMs.")
diff --git a/torch/ao/nn/quantized/__init__.py b/torch/ao/nn/quantized/__init__.py
new file mode 100644
index 0000000000000..ae3eae72fb0b0
--- /dev/null
+++ b/torch/ao/nn/quantized/__init__.py
@@ -0,0 +1,38 @@
+from . import functional
+from .modules import *  # noqa: F403
+
+__all__ = [
+    'BatchNorm2d',
+    'BatchNorm3d',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'DeQuantize',
+    'ELU',
+    'Embedding',
+    'EmbeddingBag',
+    'GroupNorm',
+    'Hardswish',
+    'InstanceNorm1d',
+    'InstanceNorm2d',
+    'InstanceNorm3d',
+    'LayerNorm',
+    'LeakyReLU',
+    'Linear',
+    'LSTM',
+    'MaxPool2d',
+    'MultiheadAttention',
+    'Quantize',
+    'ReLU6',
+    'Sigmoid',
+    'Softmax',
+    'Dropout',
+    'PReLU',
+    # Wrapper modules
+    'FloatFunctional',
+    'FXFloatFunctional',
+    'QFunctional',
+]
diff --git a/torch/ao/nn/quantized/_reference/__init__.py b/torch/ao/nn/quantized/_reference/__init__.py
new file mode 100644
index 0000000000000..3d79bdbfe8320
--- /dev/null
+++ b/torch/ao/nn/quantized/_reference/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/quantized/_reference/modules/__init__.py b/torch/ao/nn/quantized/_reference/modules/__init__.py
new file mode 100644
index 0000000000000..3627f93ebd6c5
--- /dev/null
+++ b/torch/ao/nn/quantized/_reference/modules/__init__.py
@@ -0,0 +1,20 @@
+from .linear import Linear
+from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from .rnn import RNNCell, LSTMCell, GRUCell, LSTM
+from .sparse import Embedding, EmbeddingBag
+
+__all__ = [
+    'Linear',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'RNNCell',
+    'LSTMCell',
+    'GRUCell',
+    'LSTM',
+    'Embedding',
+    'EmbeddingBag',
+]
diff --git a/torch/ao/nn/quantized/_reference/modules/conv.py b/torch/ao/nn/quantized/_reference/modules/conv.py
new file mode 100644
index 0000000000000..64b31c0c75a72
--- /dev/null
+++ b/torch/ao/nn/quantized/_reference/modules/conv.py
@@ -0,0 +1,316 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Dict, Any, List
+from torch.nn.common_types import _size_1_t
+from .utils import ReferenceQuantizedModule
+
+class _ConvNd(torch.nn.modules.conv._ConvNd, ReferenceQuantizedModule):
+    """ A reference version of nn.quantized.Conv2d
+        we will not pack the parameters in this module, since weight packing is an
+        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
+        this is useful when user want to use this module in other backends like Glow.
+    """
+    __annotations__ = {"bias": Optional[torch.Tensor]}
+    _IS_REFERENCE = True
+
+    @staticmethod
+    def from_float(cls, float_conv, weight_qparams):
+        qref_conv = cls(
+            float_conv.in_channels,
+            float_conv.out_channels,
+            float_conv.kernel_size,  # type: ignore[arg-type]
+            float_conv.stride,  # type: ignore[arg-type]
+            float_conv.padding,  # type: ignore[arg-type]
+            float_conv.dilation,  # type: ignore[arg-type]
+            float_conv.groups,
+            float_conv.bias is not None,  # type: ignore[arg-type]
+            float_conv.padding_mode,
+            device=float_conv.weight.device,
+            dtype=float_conv.weight.dtype,
+            weight_qparams=weight_qparams)
+        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
+        if float_conv.bias is not None:
+            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
+        return qref_conv
+
+class Conv1d(_ConvNd, nn.Conv1d):
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = "zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.Conv1d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.conv1d ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.conv1d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv1d
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.conv1d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, self.dilation, self.groups)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConv1d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvNd.from_float(cls, float_conv, weight_qparams)
+
+class Conv2d(_ConvNd, nn.Conv2d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros',
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.Conv2d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.conv2d ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.conv2d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv2d
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.conv2d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, self.dilation, self.groups)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConv2d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvNd.from_float(cls, float_conv, weight_qparams)
+
+class Conv3d(_ConvNd, nn.Conv3d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode="zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.Conv3d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.conv3d ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.conv3d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv3d
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.conv3d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, self.dilation, self.groups)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConv3d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvNd.from_float(cls, float_conv, weight_qparams)
+
+class _ConvTransposeNd(_ConvNd, torch.nn.modules.conv._ConvTransposeNd):
+    """ A reference version of nn.quantized.ConvTranspose2d
+        we will not pack the parameters in this module, since weight packing is an
+        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
+        this is useful when user want to use this module in other backends like Glow.
+    """
+    @staticmethod
+    def from_float(cls, float_conv, weight_qparams):
+        qref_conv = cls(
+            float_conv.in_channels,
+            float_conv.out_channels,
+            float_conv.kernel_size,  # type: ignore[arg-type]
+            float_conv.stride,  # type: ignore[arg-type]
+            float_conv.padding,  # type: ignore[arg-type]
+            float_conv.output_padding,  # type: ignore[arg-type]
+            float_conv.groups,
+            float_conv.bias is not None,  # type: ignore[arg-type]
+            float_conv.dilation,  # type: ignore[arg-type]
+            float_conv.padding_mode,
+            device=float_conv.weight.device,
+            dtype=float_conv.weight.dtype,
+            weight_qparams=weight_qparams)
+        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
+        if float_conv.bias is not None:
+            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
+        return qref_conv
+
+
+class ConvTranspose1d(_ConvTransposeNd, nn.ConvTranspose1d):
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 output_padding: _size_1_t = 0,
+                 groups: int = 1,
+                 bias: bool = True,
+                 dilation: _size_1_t = 1,
+                 padding_mode: str = "zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.ConvTranspose1d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.convTranspose1d ---
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.convTranspose1d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv1d
+        """
+
+        assert isinstance(self.padding, tuple)
+        # One cannot replace List by Tuple or Sequence in "_output_padding" because
+        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
+        output_padding = self._output_padding(
+            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
+
+        weight_quant_dequant = self.get_weight()
+        result = F.conv_transpose1d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, output_padding, self.groups, self.dilation)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConvTranspose1d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
+
+class ConvTranspose2d(_ConvTransposeNd, nn.ConvTranspose2d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0,
+                 groups=1, bias=True, dilation=1,
+                 padding_mode='zeros',
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+
+        nn.ConvTranspose2d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.convTranspose2d ---
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.convTranspose2d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv2d
+        """
+        assert isinstance(self.padding, tuple)
+        # One cannot replace List by Tuple or Sequence in "_output_padding" because
+        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
+
+        output_padding = self._output_padding(
+            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
+
+        weight_quant_dequant = self.get_weight()
+        result = F.conv_transpose2d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, output_padding, self.groups, self.dilation)
+
+        return result
+
+    def _get_name(self):
+        return "QuantizedConvTranspose2d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
+
+class ConvTranspose3d(_ConvTransposeNd, nn.ConvTranspose3d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0,
+                 groups=1, bias=True, dilation=1,
+                 padding_mode="zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.ConvTranspose3d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.convTranspose3d ---
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.convTranspose3d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv3d
+        """
+
+        assert isinstance(self.padding, tuple)
+        # One cannot replace List by Tuple or Sequence in "_output_padding" because
+        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
+        output_padding = self._output_padding(
+            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
+
+        weight_quant_dequant = self.get_weight()
+        result = F.conv_transpose3d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, output_padding, self.groups, self.dilation)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConvTranspose3d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
diff --git a/torch/ao/nn/quantized/_reference/modules/linear.py b/torch/ao/nn/quantized/_reference/modules/linear.py
new file mode 100644
index 0000000000000..adbcba39685fc
--- /dev/null
+++ b/torch/ao/nn/quantized/_reference/modules/linear.py
@@ -0,0 +1,55 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Dict, Any
+from .utils import ReferenceQuantizedModule
+
+class Linear(nn.Linear, ReferenceQuantizedModule):
+    """ A reference quantized linear module that fits into the FX
+    Graph Mode Quantization workflow
+    activation will be floating point Tensor, we will store floating
+    point weight as well in the module, but in forward we'll quantize
+    and dequantize the weight before running the floating point functional
+    linear operator.
+    """
+    _IS_REFERENCE = True
+
+    def __init__(
+            self,
+            in_features: int,
+            out_features: int,
+            bias_: bool = True,
+            device: Optional[torch.device] = None,
+            dtype: Optional[torch.dtype] = None,
+            weight_qparams: Optional[Dict[str, Any]] = None):
+        super().__init__(in_features, out_features, bias_, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def _get_name(self):
+        return "QuantizedLinear(Reference)"
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.linear ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.linear --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized linear
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.linear(x, weight_quant_dequant, self.bias)
+        return result
+
+    @classmethod
+    def from_float(cls, float_linear, weight_qparams):
+        qref_linear = Linear(
+            float_linear.in_features, float_linear.out_features,
+            float_linear.bias is not None, device=float_linear.weight.device,
+            dtype=float_linear.weight.dtype, weight_qparams=weight_qparams)
+        qref_linear.weight = torch.nn.Parameter(float_linear.weight.detach())
+        if float_linear.bias is not None:
+            qref_linear.bias = torch.nn.Parameter(float_linear.bias.detach())
+        return qref_linear
diff --git a/torch/ao/nn/quantized/_reference/modules/rnn.py b/torch/ao/nn/quantized/_reference/modules/rnn.py
new file mode 100644
index 0000000000000..bc9527200f88a
--- /dev/null
+++ b/torch/ao/nn/quantized/_reference/modules/rnn.py
@@ -0,0 +1,471 @@
+import torch
+import torch.nn as nn
+from torch import Tensor
+from .utils import _quantize_and_dequantize_weight
+from .utils import _quantize_weight
+from typing import Optional, Dict, Any, Tuple
+from torch import _VF
+from torch.nn.utils.rnn import PackedSequence
+
+def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
+    return tensor.index_select(dim, permutation)
+
+def get_weight_and_quantization_params(module, wn):
+    weight = getattr(module, wn)
+    params = [weight]
+    for param_name in [wn + n for n in ["_qscheme", "_dtype", "_scale", "_zero_point", "_axis"]]:
+        if hasattr(module, param_name):
+            param = getattr(module, param_name)
+        else:
+            param = None
+        params.append(param)
+    return params
+
+def get_quantized_weight(module, wn):
+    if not hasattr(module, wn):
+        return None
+    params = get_weight_and_quantization_params(module, wn)
+    weight = _quantize_weight(*params)
+    return weight
+
+def get_quantize_and_dequantized_weight(module, wn):
+    if not hasattr(module, wn):
+        return None
+    params = get_weight_and_quantization_params(module, wn)
+    weight = _quantize_and_dequantize_weight(*params)
+    return weight
+
+class RNNCellBase(nn.RNNCellBase):
+    def __init__(self, input_size: int, hidden_size: int, bias: bool, num_chunks: int,
+                 device=None, dtype=None, weight_qparams_dict=None) -> None:
+        super().__init__(input_size, hidden_size, bias, num_chunks, device=device, dtype=dtype)
+        if weight_qparams_dict is None:
+            weight_qparams = {
+                "qscheme": torch.per_tensor_affine,
+                "dtype": torch.quint8,
+                "scale": 1.0,
+                "zero_point": 0
+            }
+            weight_qparams_dict = {
+                "weight_ih": weight_qparams,
+                "weight_hh": weight_qparams
+            }
+        assert len(weight_qparams_dict) == 2, "Expected length for weight_qparams_dict to be 2 for QuantizedRNNCellBase(Reference)"
+        self._init_weight_qparams_dict(weight_qparams_dict, device)
+
+    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
+        assert weight_qparams_dict is not None
+        for key, weight_qparams in weight_qparams_dict.items():
+            # TODO: refactor the duplicated code to utils.py
+            weight_qscheme = weight_qparams["qscheme"]
+            weight_dtype = weight_qparams["dtype"]
+            setattr(self, key + "_qscheme", weight_qscheme)
+            setattr(self, key + "_dtype", weight_dtype)
+            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
+                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
+            if weight_qscheme is not None:
+                self.register_buffer(
+                    key + "_scale",
+                    torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
+                self.register_buffer(
+                    key + "_zero_point",
+                    torch.tensor(weight_qparams["zero_point"], dtype=torch.int, device=device))
+                if weight_qscheme == torch.per_channel_affine:
+                    self.register_buffer(
+                        key + "_axis",
+                        torch.tensor(weight_qparams["axis"], dtype=torch.int, device=device))
+                else:
+                    # added for TorchScriptability, not used
+                    self.register_buffer(
+                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
+
+    def _get_name(self):
+        return "QuantizedRNNCellBase(Reference)"
+
+    def get_quantized_weight_ih(self):
+        return get_quantized_weight(self, "weight_ih")
+
+    def get_quantized_weight_hh(self):
+        return get_quantized_weight(self, "weight_hh")
+
+    def get_weight_ih(self):
+        return get_quantize_and_dequantized_weight(self, "weight_ih")
+
+    def get_weight_hh(self):
+        return get_quantize_and_dequantized_weight(self, "weight_hh")
+
+class RNNCell(RNNCellBase):
+    """
+    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
+    we need to pass in a `weight_qparams_dict` that maps from weight name,
+    e.g. weight_ih, to the weight_qparams for that weight
+    """
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True, nonlinearity: str = "tanh",
+                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
+        super().__init__(input_size, hidden_size, bias, num_chunks=1, **factory_kwargs)
+        self.nonlinearity = nonlinearity
+
+    def _get_name(self):
+        return "QuantizedRNNCell(Reference)"
+
+    # TODO: refactor nn.RNNCell to have a _forward that takes weight_ih and weight_hh as input
+    # and remove duplicated code, same for the other two Cell modules
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        assert input.dim() in (1, 2), \
+            f"RNNCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
+        is_batched = input.dim() == 2
+        if not is_batched:
+            input = input.unsqueeze(0)
+
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        else:
+            hx = hx.unsqueeze(0) if not is_batched else hx
+
+        if self.nonlinearity == "tanh":
+            ret = _VF.rnn_tanh_cell(
+                input, hx,
+                self.get_weight_ih(), self.get_weight_hh(),
+                self.bias_ih, self.bias_hh,
+            )
+        elif self.nonlinearity == "relu":
+            ret = _VF.rnn_relu_cell(
+                input, hx,
+                self.get_weight_ih(), self.get_weight_hh(),
+                self.bias_ih, self.bias_hh,
+            )
+        else:
+            ret = input  # TODO: remove when jit supports exception flow
+            raise RuntimeError(
+                "Unknown nonlinearity: {}".format(self.nonlinearity))
+
+        if not is_batched:
+            ret = ret.squeeze(0)
+
+        return ret
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.nonlinearity,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
+
+class LSTMCell(RNNCellBase):
+    """
+    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
+    we need to pass in a `weight_qparams_dict` that maps from weight name,
+    e.g. weight_ih, to the weight_qparams for that weight
+    """
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
+                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
+        super().__init__(input_size, hidden_size, bias, num_chunks=4, **factory_kwargs)
+
+    def _get_name(self):
+        return "QuantizedLSTMCell(Reference)"
+
+    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
+        assert input.dim() in (1, 2), \
+            f"LSTMCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
+        is_batched = input.dim() == 2
+        if not is_batched:
+            input = input.unsqueeze(0)
+
+        if hx is None:
+            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+            hx = (zeros, zeros)
+        else:
+            hx = (hx[0].unsqueeze(0), hx[1].unsqueeze(0)) if not is_batched else hx
+
+        ret = _VF.lstm_cell(
+            input, hx,
+            self.get_weight_ih(), self.get_weight_hh(),
+            self.bias_ih, self.bias_hh,
+        )
+
+        if not is_batched:
+            ret = (ret[0].squeeze(0), ret[1].squeeze(0))
+        return ret
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
+
+class GRUCell(RNNCellBase):
+    """
+    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
+    we need to pass in a `weight_qparams_dict` that maps from weight name,
+    e.g. weight_ih, to the weight_qparams for that weight
+    """
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
+                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
+        super().__init__(input_size, hidden_size, bias, num_chunks=3, **factory_kwargs)
+
+    def _get_name(self):
+        return "QuantizedGRUCell(Reference)"
+
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        assert input.dim() in (1, 2), \
+            f"GRUCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
+        is_batched = input.dim() == 2
+        if not is_batched:
+            input = input.unsqueeze(0)
+
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        else:
+            hx = hx.unsqueeze(0) if not is_batched else hx
+
+        ret = _VF.gru_cell(
+            input, hx,
+            self.get_weight_ih(), self.get_weight_hh(),
+            self.bias_ih, self.bias_hh,
+        )
+
+        if not is_batched:
+            ret = ret.squeeze(0)
+
+        return ret
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
+
+class RNNBase(nn.RNNBase):
+    def __init__(self, mode: str, input_size: int, hidden_size: int,
+                 num_layers: int = 1, bias: bool = True, batch_first: bool = False,
+                 dropout: float = 0., bidirectional: bool = False, proj_size: int = 0,
+                 device=None, dtype=None,
+                 weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        super().__init__(
+            mode, input_size, hidden_size, num_layers, bias, batch_first, dropout,
+            bidirectional, proj_size, device, dtype
+        )
+        if weight_qparams_dict is None:
+            weight_qparams = {
+                'qscheme': torch.per_tensor_affine,
+                'dtype': torch.quint8,
+                'scale': 1.0,
+                'zero_point': 0
+            }
+            weight_qparams_dict = {}
+            for wn in self._flat_weights_names:
+                if wn.startswith("weight"):
+                    weight_qparams_dict[wn] = weight_qparams
+        self._init_weight_qparams_dict(weight_qparams_dict, device)
+
+    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
+        for key, weight_qparams in weight_qparams_dict.items():
+            weight_qscheme = weight_qparams["qscheme"]
+            weight_dtype = weight_qparams["dtype"]
+            setattr(self, key + "_qscheme", weight_qscheme)
+            setattr(self, key + "_dtype", weight_dtype)
+            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
+                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
+            if weight_qscheme is not None:
+                self.register_buffer(
+                    key + "_scale",
+                    torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
+                self.register_buffer(
+                    key + "_zero_point",
+                    torch.tensor(weight_qparams["zero_point"], dtype=torch.int, device=device))
+                if weight_qscheme == torch.per_channel_affine:
+                    self.register_buffer(
+                        key + "_axis",
+                        torch.tensor(weight_qparams["axis"], dtype=torch.int, device=device))
+                else:
+                    # added for TorchScriptability, not used
+                    self.register_buffer(
+                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
+
+class LSTM(RNNBase):
+    """ Reference Quantized LSTM Module
+    We'll store weight_qparams for all the weights in _flat_weights, we need to pass in
+    a `weight_qparams_dict` that maps from weight name, e.g. weight_ih_l0,
+    to the weight_qparams for that weight
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__('LSTM', *args, **kwargs)
+
+    # Same as above, see torch/nn/modules/module.py::_forward_unimplemented
+    def permute_hidden(self,  # type: ignore[override]
+                       hx: Tuple[Tensor, Tensor],
+                       permutation: Optional[Tensor]
+                       ) -> Tuple[Tensor, Tensor]:
+        if permutation is None:
+            return hx
+        return apply_permutation(hx[0], permutation), apply_permutation(hx[1], permutation)
+
+    def get_expected_cell_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
+        if batch_sizes is not None:
+            mini_batch = int(batch_sizes[0])
+        else:
+            mini_batch = input.size(0) if self.batch_first else input.size(1)
+        num_directions = 2 if self.bidirectional else 1
+        expected_hidden_size = (self.num_layers * num_directions,
+                                mini_batch, self.hidden_size)
+        return expected_hidden_size
+
+    # In the future, we should prevent mypy from applying contravariance rules here.
+    # See torch/nn/modules/module.py::_forward_unimplemented
+    def check_forward_args(self,  # type: ignore[override]
+                           input: Tensor,
+                           hidden: Tuple[Tensor, Tensor],
+                           batch_sizes: Optional[Tensor],
+                           ):
+        self.check_input(input, batch_sizes)
+        self.check_hidden_size(hidden[0], self.get_expected_hidden_size(input, batch_sizes),
+                               'Expected hidden[0] size {}, got {}')
+        self.check_hidden_size(hidden[1], self.get_expected_cell_size(input, batch_sizes),
+                               'Expected hidden[1] size {}, got {}')
+
+    def get_quantized_weight_bias_dict(self):
+        """ dictionary from flat_weight_name to quantized weight or (unquantized) bias
+        e.g.
+        {
+          "weight_ih_l0": quantized_weight,
+          "bias_ih_l0": unquantized_bias,
+          ...
+        }
+        """
+        quantized_weight_bias_dict = {}
+        for wn in self._flat_weights_names:
+            if hasattr(self, wn):
+                if wn.startswith("weight"):
+                    weight_or_bias = get_quantized_weight(self, wn)
+                else:
+                    weight_or_bias = getattr(self, wn)
+            else:
+                weight_or_bias = None
+            quantized_weight_bias_dict[wn] = weight_or_bias
+        return quantized_weight_bias_dict
+
+    def get_flat_weights(self):
+        flat_weights = []
+        for wn in self._flat_weights_names:
+            if hasattr(self, wn):
+                weight = getattr(self, wn)
+                if wn.startswith("weight"):
+                    params = get_weight_and_quantization_params(self, wn)
+                    weight = _quantize_and_dequantize_weight(*params)
+            else:
+                weight = None
+            flat_weights.append(weight)
+        return flat_weights
+
+    def forward(self, input, hx=None):  # noqa: F811
+        orig_input = input
+        # xxx: isinstance check needs to be in conditional for TorchScript to compile
+        batch_sizes = None
+        if isinstance(orig_input, PackedSequence):
+            input, batch_sizes, sorted_indices, unsorted_indices = input
+            max_batch_size = batch_sizes[0]
+            max_batch_size = int(max_batch_size)
+        else:
+            batch_sizes = None
+            is_batched = input.dim() == 3
+            batch_dim = 0 if self.batch_first else 1
+            if not is_batched:
+                input = input.unsqueeze(batch_dim)
+            max_batch_size = input.size(0) if self.batch_first else input.size(1)
+            sorted_indices = None
+            unsorted_indices = None
+
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            real_hidden_size = self.proj_size if self.proj_size > 0 else self.hidden_size
+            h_zeros = torch.zeros(self.num_layers * num_directions,
+                                  max_batch_size, real_hidden_size,
+                                  dtype=input.dtype, device=input.device)
+            c_zeros = torch.zeros(self.num_layers * num_directions,
+                                  max_batch_size, self.hidden_size,
+                                  dtype=input.dtype, device=input.device)
+            hx = (h_zeros, c_zeros)
+        else:
+            if batch_sizes is None:  # If not PackedSequence input.
+                if is_batched:
+                    if (hx[0].dim() != 3 or hx[1].dim() != 3):
+                        msg = ("For batched 3-D input, hx and cx should "
+                               f"also be 3-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
+                        raise RuntimeError(msg)
+                else:
+                    if hx[0].dim() != 2 or hx[1].dim() != 2:
+                        msg = ("For unbatched 2-D input, hx and cx should "
+                               f"also be 2-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
+                        raise RuntimeError(msg)
+                    hx = (hx[0].unsqueeze(1), hx[1].unsqueeze(1))
+
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+
+        self.check_forward_args(input, hx, batch_sizes)
+        if batch_sizes is None:
+            result = _VF.lstm(input, hx, self.get_flat_weights(), self.bias, self.num_layers,
+                              self.dropout, self.training, self.bidirectional, self.batch_first)
+        else:
+            result = _VF.lstm(input, batch_sizes, hx, self.get_flat_weights(), self.bias,
+                              self.num_layers, self.dropout, self.training, self.bidirectional)
+        output = result[0]
+        hidden = result[1:]
+        # xxx: isinstance check needs to be in conditional for TorchScript to compile
+        if isinstance(orig_input, PackedSequence):
+            output_packed = PackedSequence(output, batch_sizes, sorted_indices, unsorted_indices)
+            return output_packed, self.permute_hidden(hidden, unsorted_indices)
+        else:
+            if not is_batched:
+                output = output.squeeze(batch_dim)
+                hidden = (hidden[0].squeeze(1), hidden[1].squeeze(1))
+            return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def _get_name(self):
+        return "QuantizedLSTM(Reference)"
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.num_layers,
+            mod.bias,
+            mod.batch_first,
+            mod.dropout,
+            mod.bidirectional,
+            weight_qparams_dict=weight_qparams_dict)
+        for wn in mod._flat_weights_names:
+            setattr(ref_mod, wn, getattr(mod, wn))
+        return ref_mod
diff --git a/torch/ao/nn/quantized/_reference/modules/sparse.py b/torch/ao/nn/quantized/_reference/modules/sparse.py
new file mode 100644
index 0000000000000..5ace87f0fb732
--- /dev/null
+++ b/torch/ao/nn/quantized/_reference/modules/sparse.py
@@ -0,0 +1,92 @@
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import Tensor
+from .utils import ReferenceQuantizedModule
+from typing import Optional, Dict, Any
+
+class Embedding(nn.Embedding, ReferenceQuantizedModule):
+    """ A reference quantized Embedding module that fits into the
+    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
+    we will store floating point weight as well in the module, but in forward we'll
+    quantize and dequantize the weight before running the floating point functional
+    embedding operator.
+    """
+    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 sparse: bool = False, _weight: Optional[Tensor] = None,
+                 device=None, dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
+        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
+                         norm_type, scale_grad_by_freq, sparse, _weight, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def _get_name(self):
+        return "QuantizedEmbedding(Reference)"
+
+    def forward(self, input: Tensor) -> Tensor:
+        weight_quant_dequant = self.get_weight()
+        return F.embedding(
+            input, weight_quant_dequant, self.padding_idx, self.max_norm,
+            self.norm_type, self.scale_grad_by_freq, self.sparse)
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams):
+        return cls(
+            mod.num_embeddings,
+            mod.embedding_dim,
+            mod.padding_idx,
+            mod.max_norm,
+            mod.norm_type,
+            mod.scale_grad_by_freq,
+            mod.sparse,
+            mod.weight,
+            mod.weight.device,
+            mod.weight.dtype,
+            weight_qparams)
+
+class EmbeddingBag(nn.EmbeddingBag, ReferenceQuantizedModule):
+    """ A reference quantized EmbeddingBag module that fits into the
+    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
+    we will store floating point weight as well in the module, but in forward we'll
+    quantize and dequantize the weight before running the floating point functional
+    embedding operator.
+    """
+    def __init__(self, num_embeddings: int, embedding_dim: int,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 mode: str = 'mean', sparse: bool = False, _weight: Optional[Tensor] = None,
+                 include_last_offset: bool = False, padding_idx: Optional[int] = None,
+                 device=None, dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
+        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
+                         scale_grad_by_freq, mode, sparse, _weight, include_last_offset,
+                         padding_idx, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def _get_name(self):
+        return "QuantizedEmbedding(Reference)"
+
+    def forward(self, input: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None) -> Tensor:
+        weight_quant_dequant = self.get_weight()
+        return F.embedding_bag(input, weight_quant_dequant, offsets,
+                               self.max_norm, self.norm_type,
+                               self.scale_grad_by_freq, self.mode, self.sparse,
+                               per_sample_weights, self.include_last_offset,
+                               self.padding_idx)
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams):
+        return cls(
+            mod.num_embeddings,
+            mod.embedding_dim,
+            mod.max_norm,
+            mod.norm_type,
+            mod.scale_grad_by_freq,
+            mod.mode,
+            mod.sparse,
+            mod.weight,
+            mod.include_last_offset,
+            mod.padding_idx,
+            mod.weight.device,
+            mod.weight.dtype,
+            weight_qparams
+        )
diff --git a/torch/ao/nn/quantized/_reference/modules/utils.py b/torch/ao/nn/quantized/_reference/modules/utils.py
new file mode 100644
index 0000000000000..58d5cd608ffbd
--- /dev/null
+++ b/torch/ao/nn/quantized/_reference/modules/utils.py
@@ -0,0 +1,154 @@
+import torch
+from typing import Dict, Any
+
+class ReferenceQuantizedModule(torch.nn.Module):
+    def _init_weight_qparams(self, weight_qparams, device):
+        if weight_qparams is None:
+            weight_qparams = {
+                "qscheme": torch.per_tensor_affine,
+                "dtype": torch.quint8,
+                "scale": 1.0,
+                "zero_point": 0
+            }
+        self.weight_qscheme: torch.qscheme = weight_qparams["qscheme"]
+        self.weight_dtype = weight_qparams["dtype"]
+        assert self.weight_qscheme in [
+            None, torch.per_tensor_affine, torch.per_channel_affine,
+            torch.per_channel_affine_float_qparams], \
+            Exception(f"qscheme: {self.weight_qscheme} is not support in reference quantized {self._get_name()}")
+        if self.weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
+            zero_point_dtype = weight_qparams["zero_point"].dtype if \
+                isinstance(weight_qparams["zero_point"], torch.Tensor) else \
+                torch.int
+            self.register_buffer(
+                "weight_scale",
+                torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
+            self.register_buffer(
+                "weight_zero_point",
+                torch.tensor(weight_qparams["zero_point"], dtype=zero_point_dtype, device=device))
+            if self.weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
+                self.register_buffer(
+                    "weight_axis",
+                    torch.tensor(weight_qparams["axis"], dtype=torch.int, device=device))
+            else:
+                # added for TorchScriptability, not used
+                self.register_buffer(
+                    "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
+        else:
+            # added for TorchScriptability, and for torch.float
+            self.register_buffer("weight_scale", torch.tensor(1.0, dtype=torch.float, device=device))
+            self.register_buffer("weight_zero_point", torch.tensor(0, dtype=torch.int, device=device))
+            self.register_buffer(
+                "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
+
+    def get_weight(self):
+        """
+        Fake quantize (quantize and dequantize) the weight with
+        the quantization parameters for weight, this is used to
+        simulate the numerics for the quantized weight in a quantized
+        model
+        """
+        # suppress mypy warning
+        assert isinstance(self.weight_scale, torch.Tensor)
+        assert isinstance(self.weight_zero_point, torch.Tensor)
+        assert isinstance(self.weight_axis, torch.Tensor)
+        return _quantize_and_dequantize_weight(
+            self.weight,  # type: ignore[arg-type]
+            self.weight_qscheme,
+            self.weight_dtype,
+            self.weight_scale,
+            self.weight_zero_point, self.weight_axis)
+
+    def get_quantized_weight(self):
+        # suppress mypy warning
+        assert isinstance(self.weight_scale, torch.Tensor)
+        assert isinstance(self.weight_zero_point, torch.Tensor)
+        assert isinstance(self.weight_axis, torch.Tensor)
+        return _quantize_weight(
+            self.weight,  # type: ignore[arg-type]
+            self.weight_qscheme,
+            self.weight_dtype,
+            self.weight_scale,
+            self.weight_zero_point,
+            self.weight_axis)
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super()._save_to_state_dict(destination, prefix, keep_vars)
+        _save_weight_qparams(
+            destination, prefix, self.weight_qscheme, self.weight_dtype,
+            self.weight_scale, self.weight_zero_point, self.weight_axis)
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        for key in _get_weight_qparam_keys(state_dict, prefix):
+            setattr(self, key, state_dict[prefix + key])
+            state_dict.pop(prefix + key)
+
+        super()._load_from_state_dict(
+            state_dict, prefix, local_metadata, False,
+            missing_keys, unexpected_keys, error_msgs)
+
+def _quantize_weight(
+        weight: torch.Tensor,
+        weight_qscheme: torch.qscheme,
+        weight_dtype: torch.dtype,
+        weight_scale: torch.Tensor,
+        weight_zero_point: torch.Tensor,
+        weight_axis: torch.Tensor):
+    if weight_dtype == torch.float16:
+        weight = weight.to(weight_dtype)
+        return weight
+
+    if weight_qscheme == torch.per_tensor_affine:
+        if weight_dtype in [torch.quint8, torch.qint8, torch.qint32]:
+            weight = torch.quantize_per_tensor(weight, weight_scale, weight_zero_point, weight_dtype)
+            return weight
+    elif weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
+        if weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
+            weight = torch.quantize_per_channel(
+                weight, weight_scale,
+                weight_zero_point, weight_axis.item(), weight_dtype)  # type: ignore[arg-type]
+            return weight
+    raise Exception(f"Unsupported dtype and qscheme: {weight_dtype}, {weight_qscheme}")
+
+def _quantize_and_dequantize_weight(
+        weight: torch.Tensor,
+        weight_qscheme: torch.qscheme,
+        weight_dtype: torch.dtype,
+        weight_scale: torch.Tensor,
+        weight_zero_point: torch.Tensor,
+        weight_axis: torch.Tensor):
+    """ Quantize and then dequantize the weight based on
+    the quantization parameters
+    """
+    if weight_qscheme in [
+            torch.per_tensor_affine,
+            torch.per_channel_affine,
+            torch.per_channel_affine_float_qparams]:
+        weight_quant = _quantize_weight(
+            weight, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis)
+        weight_dequant = weight_quant.dequantize()
+    else:
+        weight_dequant = weight
+    return weight_dequant
+
+def _save_weight_qparams(destination, prefix, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis):
+    destination[prefix + "weight_qscheme"] = weight_qscheme
+    destination[prefix + "weight_dtype"] = weight_dtype
+    if weight_qscheme is not None:
+        destination[prefix + "weight_scale"] = weight_scale
+        destination[prefix + "weight_zero_point"] = weight_zero_point
+        if weight_qscheme == torch.per_channel_affine:
+            destination[prefix + "weight_axis"] = weight_axis
+
+def _get_weight_qparam_keys(
+        state_dict: Dict[str, Any],
+        prefix: str):
+    keys = ["weight_qscheme", "weight_dtype"]
+    weight_qscheme = state_dict[prefix + "weight_qscheme"]
+    if weight_qscheme is not None:
+        keys.append("weight_scale")
+        keys.append("weight_zero_point")
+        if weight_qscheme == torch.quantize_per_channel:
+            keys.append("weight_axis")
+    return keys
diff --git a/torch/ao/nn/quantized/dynamic/__init__.py b/torch/ao/nn/quantized/dynamic/__init__.py
new file mode 100644
index 0000000000000..3d79bdbfe8320
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/quantized/dynamic/modules/__init__.py b/torch/ao/nn/quantized/dynamic/modules/__init__.py
new file mode 100644
index 0000000000000..a7a97e0a8da83
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/__init__.py
@@ -0,0 +1,19 @@
+
+from .linear import Linear
+from .rnn import LSTM, GRU, LSTMCell, RNNCell, GRUCell
+from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+
+__all__ = [
+    'Linear',
+    'LSTM',
+    'GRU',
+    'LSTMCell',
+    'RNNCell',
+    'GRUCell',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+]
diff --git a/torch/ao/nn/quantized/dynamic/modules/conv.py b/torch/ao/nn/quantized/dynamic/modules/conv.py
new file mode 100644
index 0000000000000..9e290ed41ac02
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/conv.py
@@ -0,0 +1,399 @@
+# coding=utf-8
+r"""Dynamically quantized convolution modules."""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from torch import Tensor
+from torch._ops import ops
+from torch.nn.common_types import _size_1_t
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.ao.nn.quantized.modules.conv import _reverse_repeat_padding
+import torch.ao.nn.quantized as nnq
+import warnings
+
+__all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
+
+class Conv1d(nnq.Conv1d):
+    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv1d` and :class:`~torch.nn.quantized.dynamic.Conv1d` and
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv1d` for other attributes.
+
+    Examples::
+
+        >>> m = nn.quantized.dynamic.Conv1d(16, 33, 3, stride=2)
+        >>> input = torch.randn(20, 16, 100)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+
+    """
+
+    _FLOAT_MODULE = nn.Conv1d
+    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
+    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 device=None,
+                 dtype=None,
+                 reduce_range=True):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _single(kernel_size)
+        stride = _single(stride)
+        padding = padding if isinstance(padding, str) else _single(padding)
+        dilation = _single(dilation)
+
+        super(Conv1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConv1d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        if self.padding_mode != 'zeros':
+            # Padding in Conv1d is stored as (p, p), need to get (p,)
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv1d_dynamic(input, self._packed_params, reduce_range)
+
+
+class Conv2d(nnq.Conv2d):
+    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv2d` and :class:`~torch.nn.quantized.dynamic.Conv2d` and
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv2d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.dynamic.Conv2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
+        >>> input = torch.randn(20, 16, 50, 100)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+
+    """
+    _FLOAT_MODULE = nn.Conv2d
+    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
+    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _pair(kernel_size)
+        stride = _pair(stride)
+        padding = _pair(padding)
+        dilation = _pair(dilation)
+
+        super(Conv2d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConv2d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv2d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class Conv3d(nnq.Conv3d):
+    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv3d` and :class:`~torch.nn.quantized.dynamic.Conv3d` and
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv3d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.dynamic.Conv3d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
+        >>> input = torch.randn(20, 16, 56, 56, 56)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+
+    """
+    _FLOAT_MODULE = nn.Conv3d
+    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
+    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        padding = _triple(padding)
+        dilation = _triple(dilation)
+        super(Conv3d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConv3d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv3d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class ConvTranspose1d(nnq.ConvTranspose1d):
+    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose1d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv1d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose1d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> # xdoctest: +SKIP
+        >>> m = nndq.ConvTranspose1d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nndq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> output = m(input)
+        >>> # exact output size can be also specified as an argument
+        >>> downsample = nndq.Conv1d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nndq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(input)
+        >>> h.size()
+        torch.Size([1, 16, 6])
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose1d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(ConvTranspose1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConvTranpose1d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        return torch.ops.quantized.conv_transpose1d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class ConvTranspose2d(nnq.ConvTranspose2d):
+    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose2d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv2d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+        >>> # exact output size can be also specified as an argument
+        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6])
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose2d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(ConvTranspose2d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConvTranpose2d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        return ops.quantized.conv_transpose2d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class ConvTranspose3d(nnq.ConvTranspose3d):
+    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose3d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv3d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
+
+    Examples::
+
+        >>> # With cubic kernels and equal stride
+        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
+        >>> # non-cubic kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+        >>> # exact output size can be also specified as an argument
+        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6, 6])
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose3d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(ConvTranspose3d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConvTranpose3d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
+        return ops.quantized.conv_transpose3d_dynamic(
+            input, self._packed_params, reduce_range)
diff --git a/torch/ao/nn/quantized/dynamic/modules/linear.py b/torch/ao/nn/quantized/dynamic/modules/linear.py
new file mode 100644
index 0000000000000..3b3e3ff62c54e
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/linear.py
@@ -0,0 +1,127 @@
+import torch
+import torch.ao.nn.quantized as nnq
+from torch.ao.nn.quantized.modules.utils import _quantize_weight
+import torch.nn.intrinsic as nni
+
+class Linear(nnq.Linear):
+    r"""
+    A dynamic quantized linear module with floating point tensor as inputs and outputs.
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
+
+    Similar to :class:`torch.nn.Linear`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module which are of
+                         shape :math:`(\text{out\_features}, \text{in\_features})`.
+        bias (Tensor): the non-learnable floating point bias of the module of shape
+                       :math:`(\text{out\_features})`. If :attr:`bias` is ``True``,
+                       the values are initialized to zero.
+
+    Examples::
+
+        >>> m = nn.quantized.dynamic.Linear(20, 30)
+        >>> input = torch.randn(128, 20)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+        >>> print(output.size())
+        torch.Size([128, 30])
+    """
+    # version used in this class is different from the parent class nnq.Linear
+    _version = 4
+
+    def __init__(self, in_features, out_features, bias_=True, dtype=torch.qint8):
+        super(Linear, self).__init__(in_features, out_features, bias_, dtype=dtype)
+        # We don't muck around with buffers or attributes or anything here
+        # to keep the module simple. *everything* is simply a Python attribute.
+        # Serialization logic is explicitly handled in the below serialization and
+        # deserialization modules
+        self.version = 4
+
+    def forward(self, x):
+        # Note that we can handle self.bias == None case.
+        if self._packed_params.dtype == torch.qint8:
+            if self.version is None or self.version < 4:
+                Y = torch.ops.quantized.linear_dynamic(
+                    x, self._packed_params._packed_params)
+            else:
+                Y = torch.ops.quantized.linear_dynamic(
+                    x, self._packed_params._packed_params, reduce_range=True)
+        elif self._packed_params.dtype == torch.float16:
+            Y = torch.ops.quantized.linear_dynamic_fp16(
+                x, self._packed_params._packed_params)
+        else:
+            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
+        return Y.to(x.dtype)
+
+    def _get_name(self):
+        return 'DynamicQuantizedLinear'
+
+    def extra_repr(self):
+        extra_repr_str = 'in_features={}, out_features={}, dtype={}'.format(
+            self.in_features, self.out_features, self._packed_params.dtype
+        )
+        if self._packed_params.dtype == torch.qint8:
+            extra_repr_str += ', qscheme={}'.format(self.weight().qscheme())
+        return extra_repr_str
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        version = local_metadata.get('version', None)
+        self.version = version
+        super(Linear, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                  missing_keys, unexpected_keys, error_msgs)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a dynamic quantized module from a float module or qparams_dict
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+        """
+        float_modules = [torch.nn.Linear, torch.nn.modules.linear.NonDynamicallyQuantizableLinear,
+                         torch.nn.intrinsic.modules.fused.LinearReLU, torch.ao.nn.qat.dynamic.Linear]
+
+        assert type(mod) in float_modules, \
+            'nn.quantized.dynamic.Linear.from_float only works for one of' + \
+            str([float_mod.__name__ for float_mod in float_modules])
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        if type(mod) == nni.LinearReLU:
+            mod = mod[0]
+        if mod.qconfig is not None and mod.qconfig.weight is not None:
+            weight_observer = mod.qconfig.weight()
+        else:
+            # We have the circular import issues if we import the qconfig in the beginning of this file:
+            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
+            # import until we need it.
+            from torch.ao.quantization.qconfig import default_dynamic_qconfig
+            weight_observer = default_dynamic_qconfig.weight()
+        dtype = weight_observer.dtype
+        assert dtype in [torch.qint8, torch.float16], "The only supported dtypes for " \
+            "dynamic quantized linear are qint8 and float16 got: {}".format(dtype)
+        weight_observer(mod.weight)
+        if dtype == torch.qint8:
+            qweight = _quantize_weight(mod.weight.float(), weight_observer)
+        elif dtype == torch.float16:
+            qweight = mod.weight.float()
+        else:
+            raise RuntimeError('Unsupported dtype specified for dynamic quantized Linear!')
+        qlinear = cls(mod.in_features, mod.out_features, dtype=dtype)
+        qlinear.set_weight_bias(qweight, mod.bias)
+        return qlinear
+
+    @classmethod
+    def from_reference(cls, ref_qlinear):
+        """ Create a (fbgemm/qnnpack) dynamic quantized module from a reference quantized
+        module
+        Args:
+            ref_qlinear (Module): a reference quantized  module, either produced by
+            torch.ao.quantization functions or provided by the user
+        """
+        qlinear = cls(ref_qlinear.in_features, ref_qlinear.out_features, dtype=ref_qlinear.weight_dtype)
+        qweight = ref_qlinear.get_quantized_weight()
+        bias = ref_qlinear.bias
+        qlinear.set_weight_bias(qweight, bias)
+        return qlinear
diff --git a/torch/ao/nn/quantized/dynamic/modules/rnn.py b/torch/ao/nn/quantized/dynamic/modules/rnn.py
new file mode 100644
index 0000000000000..f9ec67529f7a0
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/rnn.py
@@ -0,0 +1,1054 @@
+import numbers
+import warnings
+
+import torch
+import torch.nn as nn
+from torch import Tensor  # noqa: F401
+from torch._jit_internal import Tuple, Optional, List, Union, Dict  # noqa: F401
+from torch.nn.utils.rnn import PackedSequence
+from torch.ao.nn.quantized.modules.utils import _quantize_weight
+
+__all__ = ['pack_weight_bias', 'PackedParameter', 'RNNBase', 'LSTM', 'GRU', 'RNNCellBase', 'RNNCell', 'LSTMCell',
+           'GRUCell']
+
+def _apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
+    return tensor.index_select(dim, permutation)
+
+def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
+    warnings.warn("apply_permutation is deprecated, please use tensor.index_select(dim, permutation) instead")
+    return _apply_permutation(tensor, permutation, dim)
+
+def pack_weight_bias(qweight, bias, dtype):
+
+    if dtype == torch.qint8:
+        # for each layer, for each direction we need to quantize and pack
+        # weights and pack parameters in this order:
+        #
+        #   w_ih, w_hh
+        packed_weight = \
+            torch.ops.quantized.linear_prepack(qweight, bias)
+
+        return packed_weight
+    else:
+        # for each layer, for each direction we need to quantize and pack
+        # weights and pack parameters in this order:
+        #
+        #   packed_ih, packed_hh, b_ih, b_hh
+        packed_weight = torch.ops.quantized.linear_prepack_fp16(
+            qweight, bias)
+
+        return packed_weight
+
+class PackedParameter(torch.nn.Module):
+    def __init__(self, param):
+        super(PackedParameter, self).__init__()
+        self.param = param
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(PackedParameter, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'param'] = self.param
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.param = state_dict[prefix + 'param']
+        super(PackedParameter, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                           missing_keys, unexpected_keys, error_msgs)
+
+class RNNBase(torch.nn.Module):
+
+    _FLOAT_MODULE = nn.RNNBase
+
+    _version = 2
+
+    def __init__(self, mode, input_size, hidden_size,
+                 num_layers=1, bias=True, batch_first=False,
+                 dropout=0., bidirectional=False, dtype=torch.qint8):
+        super(RNNBase, self).__init__()
+
+        self.mode = mode
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.bias = bias
+        self.batch_first = batch_first
+        self.dropout = float(dropout)
+        self.bidirectional = bidirectional
+        self.dtype = dtype
+        self.version = 2
+        self.training = False
+        num_directions = 2 if bidirectional else 1
+
+        # "type: ignore" is required since ints and Numbers are not fully comparable
+        # https://github.com/python/mypy/issues/8566
+        if not isinstance(dropout, numbers.Number) \
+                or not 0 <= dropout <= 1 or isinstance(dropout, bool):  # type: ignore[operator]
+            raise ValueError("dropout should be a number in range [0, 1] "
+                             "representing the probability of an element being "
+                             "zeroed")
+        if dropout > 0 and num_layers == 1:  # type: ignore[operator]
+            warnings.warn("dropout option adds dropout after all but last "
+                          "recurrent layer, so non-zero dropout expects "
+                          "num_layers greater than 1, but got dropout={} and "
+                          "num_layers={}".format(dropout, num_layers))
+
+        if mode == 'LSTM':
+            gate_size = 4 * hidden_size
+        elif mode == 'GRU':
+            gate_size = 3 * hidden_size
+        else:
+            raise ValueError("Unrecognized RNN mode: " + mode)
+
+        _all_weight_values = []
+        for layer in range(num_layers):
+            for direction in range(num_directions):
+                layer_input_size = input_size if layer == 0 else hidden_size * num_directions
+
+                w_ih = torch.randn(gate_size, layer_input_size).to(torch.float)
+                w_hh = torch.randn(gate_size, hidden_size).to(torch.float)
+                b_ih = torch.randn(gate_size).to(torch.float)
+                b_hh = torch.randn(gate_size).to(torch.float)
+                if dtype == torch.qint8:
+                    w_ih = torch.quantize_per_tensor(w_ih, scale=0.1, zero_point=0, dtype=torch.qint8)
+                    w_hh = torch.quantize_per_tensor(w_hh, scale=0.1, zero_point=0, dtype=torch.qint8)
+                    packed_ih = \
+                        torch.ops.quantized.linear_prepack(w_ih, b_ih)
+                    packed_hh = \
+                        torch.ops.quantized.linear_prepack(w_hh, b_hh)
+                    if self.version is None or self.version < 2:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh)
+                    else:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh, True)
+                else:
+                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
+                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
+                        packed_ih, packed_hh)
+
+                _all_weight_values.append(PackedParameter(cell_params))
+        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
+
+    def _get_name(self):
+        return 'DynamicQuantizedRNN'
+
+    def extra_repr(self):
+        s = '{input_size}, {hidden_size}'
+        if self.num_layers != 1:
+            s += ', num_layers={num_layers}'
+        if self.bias is not True:
+            s += ', bias={bias}'
+        if self.batch_first is not False:
+            s += ', batch_first={batch_first}'
+        if self.dropout != 0:
+            s += ', dropout={dropout}'
+        if self.bidirectional is not False:
+            s += ', bidirectional={bidirectional}'
+        return s.format(**self.__dict__)
+
+    def __repr__(self):
+        # We don't want to show `ModuleList` children, hence custom
+        # `__repr__`. This is the same as nn.Module.__repr__, except the check
+        # for the `PackedParameter` and `nn.ModuleList`.
+        # You should still override `extra_repr` to add more info.
+        extra_lines = []
+        extra_repr = self.extra_repr()
+        # empty string will be split into list ['']
+        if extra_repr:
+            extra_lines = extra_repr.split('\n')
+        child_lines = []
+        for key, module in self._modules.items():
+            if isinstance(module, (PackedParameter, nn.ModuleList)):
+                continue
+            mod_str = repr(module)
+            mod_str = nn.modules.module._addindent(mod_str, 2)
+            child_lines.append('(' + key + '): ' + mod_str)
+        lines = extra_lines + child_lines
+
+        main_str = self._get_name() + '('
+        if lines:
+            # simple one-liner info, which most builtin Modules will use
+            if len(extra_lines) == 1 and not child_lines:
+                main_str += extra_lines[0]
+            else:
+                main_str += '\n  ' + '\n  '.join(lines) + '\n'
+
+        main_str += ')'
+        return main_str
+
+    def check_input(self, input: Tensor, batch_sizes: Optional[Tensor]) -> None:
+        expected_input_dim = 2 if batch_sizes is not None else 3
+        if input.dim() != expected_input_dim:
+            raise RuntimeError(
+                'input must have {} dimensions, got {}'.format(
+                    expected_input_dim, input.dim()))
+        if self.input_size != input.size(-1):
+            raise RuntimeError(
+                'input.size(-1) must be equal to input_size. Expected {}, got {}'.format(
+                    self.input_size, input.size(-1)))
+
+    def get_expected_hidden_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
+        if batch_sizes is not None:
+            mini_batch = int(batch_sizes[0])
+        else:
+            mini_batch = input.size(0) if self.batch_first else input.size(1)
+        num_directions = 2 if self.bidirectional else 1
+        expected_hidden_size = (self.num_layers * num_directions,
+                                mini_batch, self.hidden_size)
+        return expected_hidden_size
+
+    def check_hidden_size(
+        self, hx: Tensor, expected_hidden_size: Tuple[int, int, int],
+        msg: str = 'Expected hidden size {}, got {}'
+    ) -> None:
+        if hx.size() != expected_hidden_size:
+            raise RuntimeError(msg.format(
+                expected_hidden_size, list(hx.size())))
+
+    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
+        self.check_input(input, batch_sizes)
+        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
+        self.check_hidden_size(hidden, expected_hidden_size,
+                               msg='Expected hidden size {}, got {}')
+
+    def permute_hidden(self, hx: Tensor, permutation: Optional[Tensor]) -> Tensor:
+        if permutation is None:
+            return hx
+        return _apply_permutation(hx, permutation)
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        version = local_metadata.get('version', None)
+        self.version = version
+        super(RNNBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                   missing_keys, unexpected_keys, error_msgs)
+
+    def set_weight_bias(self, weight_bias_dict):
+
+        def weight_bias_name(ihhh, layer, suffix):
+            weight_name = "weight_{}_l{}{}".format(ihhh, layer, suffix)
+            bias_name = "bias_{}_l{}{}".format(ihhh, layer, suffix)
+            return weight_name, bias_name
+
+        num_directions = 2 if self.bidirectional else 1
+        # TODO: dedup with __init__ of RNNBase
+        _all_weight_values = []
+        for layer in range(self.num_layers):
+            for direction in range(num_directions):
+                suffix = "_reverse" if direction == 1 else ""
+                w_ih_name, b_ih_name = weight_bias_name("ih", layer, suffix)
+                w_hh_name, b_hh_name = weight_bias_name("hh", layer, suffix)
+                w_ih = weight_bias_dict[w_ih_name]
+                b_ih = weight_bias_dict[b_ih_name]
+                w_hh = weight_bias_dict[w_hh_name]
+                b_hh = weight_bias_dict[b_hh_name]
+                if w_ih.dtype == torch.qint8:
+                    packed_ih = torch.ops.quantized.linear_prepack(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack(w_hh, b_hh)
+                    if self.version is None or self.version < 2:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh)
+                    else:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh, True)
+                else:
+                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
+                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
+                        packed_ih, packed_hh)
+
+                _all_weight_values.append(PackedParameter(cell_params))
+        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
+
+    @classmethod
+    def from_float(cls, mod):
+        assert type(mod) in set(
+            [torch.nn.LSTM,
+             torch.nn.GRU]
+        ), 'nn.quantized.dynamic.RNNBase.from_float only works for nn.LSTM and nn.GRU'
+        assert hasattr(
+            mod,
+            'qconfig'
+        ), 'Input float module must have qconfig defined'
+
+        if mod.qconfig is not None and mod.qconfig.weight is not None:
+            weight_observer_method = mod.qconfig.weight
+        else:
+            # We have the circular import issues if we import the qconfig in the beginning of this file:
+            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
+            # import until we need it.
+            from torch.ao.quantization.qconfig import default_dynamic_qconfig
+            weight_observer_method = default_dynamic_qconfig.weight
+
+        dtype = weight_observer_method().dtype
+        supported_scalar_types = [torch.qint8, torch.float16]
+        if dtype not in supported_scalar_types:
+            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
+        # RNNBase can be either LSTM or GRU
+        qRNNBase: Union[LSTM, GRU]
+        if mod.mode == 'LSTM':
+            qRNNBase = LSTM(mod.input_size, mod.hidden_size, mod.num_layers,
+                            mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
+        elif mod.mode == 'GRU':
+            qRNNBase = GRU(mod.input_size, mod.hidden_size, mod.num_layers,
+                           mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
+        else:
+            raise NotImplementedError('Only LSTM/GRU is supported for QuantizedRNN for now')
+
+        num_directions = 2 if mod.bidirectional else 1
+
+        assert mod.bias
+
+        _all_weight_values = []
+        for layer in range(qRNNBase.num_layers):
+            for direction in range(num_directions):
+                suffix = '_reverse' if direction == 1 else ''
+
+                def retrieve_weight_bias(ihhh):
+                    weight_name = 'weight_{}_l{}{}'.format(ihhh, layer, suffix)
+                    bias_name = 'bias_{}_l{}{}'.format(ihhh, layer, suffix)
+                    weight = getattr(mod, weight_name)
+                    bias = getattr(mod, bias_name)
+                    return weight, bias
+
+                weight_ih, bias_ih = retrieve_weight_bias('ih')
+                weight_hh, bias_hh = retrieve_weight_bias('hh')
+
+                if dtype == torch.qint8:
+                    def quantize_and_pack(w, b):
+                        weight_observer = weight_observer_method()
+                        weight_observer(w)
+                        qweight = _quantize_weight(w.float(), weight_observer)
+                        packed_weight = \
+                            torch.ops.quantized.linear_prepack(qweight, b)
+                        return packed_weight
+                    packed_ih = quantize_and_pack(weight_ih, bias_ih)
+                    packed_hh = quantize_and_pack(weight_hh, bias_hh)
+                    if qRNNBase.version is None or qRNNBase.version < 2:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, bias_ih, bias_hh)
+                    else:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, bias_ih, bias_hh, True)
+
+                elif dtype == torch.float16:
+                    packed_ih = torch.ops.quantized.linear_prepack_fp16(
+                        weight_ih.float(), bias_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack_fp16(
+                        weight_hh.float(), bias_hh)
+
+                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
+                        packed_ih, packed_hh)
+                else:
+                    raise RuntimeError('Unsupported dtype specified for dynamic quantized LSTM!')
+
+                _all_weight_values.append(PackedParameter(cell_params))
+        qRNNBase._all_weight_values = torch.nn.ModuleList(_all_weight_values)
+
+        return qRNNBase
+
+
+    def _weight_bias(self):
+        # Returns a dict of weights and biases
+        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
+        count = 0
+        num_directions = 2 if self.bidirectional else 1
+        for layer in range(self.num_layers):
+            for direction in range(num_directions):
+                suffix = '_reverse' if direction == 1 else ''
+                key_name1 = 'weight_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                key_name2 = 'weight_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                # packed weights are part of torchbind class, CellParamsSerializationType
+                # Within the packed weight class, the weight and bias are accessible as Tensors
+                packed_weight_bias = self._all_weight_values[count].param.__getstate__()[0][4]
+                weight_bias_dict['weight'][key_name1] = packed_weight_bias[0].__getstate__()[0][0]
+                weight_bias_dict['weight'][key_name2] = packed_weight_bias[1].__getstate__()[0][0]
+                key_name1 = 'bias_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                key_name2 = 'bias_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                weight_bias_dict['bias'][key_name1] = packed_weight_bias[0].__getstate__()[0][1]
+                weight_bias_dict['bias'][key_name2] = packed_weight_bias[1].__getstate__()[0][1]
+                count = count + 1
+        return weight_bias_dict
+
+    def get_weight(self):
+        return self._weight_bias()['weight']
+
+    def get_bias(self):
+        return self._weight_bias()['bias']
+
+class LSTM(RNNBase):
+    r"""
+    A dynamic quantized LSTM module with floating point tensor as inputs and outputs.
+    We adopt the same interface as `torch.nn.LSTM`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM for documentation.
+
+    Examples::
+
+        >>> rnn = nn.LSTM(10, 20, 2)
+        >>> input = torch.randn(5, 3, 10)
+        >>> h0 = torch.randn(2, 3, 20)
+        >>> c0 = torch.randn(2, 3, 20)
+        >>> output, (hn, cn) = rnn(input, (h0, c0))
+    """
+    _FLOAT_MODULE = nn.LSTM
+
+    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
+
+    def __init__(self, *args, **kwargs):
+        super(LSTM, self).__init__('LSTM', *args, **kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedLSTM'
+
+    def forward_impl(
+        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]],
+        batch_sizes: Optional[Tensor], max_batch_size: int,
+        sorted_indices: Optional[Tensor]
+    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            zeros = torch.zeros(self.num_layers * num_directions,
+                                max_batch_size, self.hidden_size,
+                                dtype=input.dtype, device=input.device)
+            hx = (zeros, zeros)
+        else:
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+
+        self.check_forward_args(input, hx, batch_sizes)
+
+        _all_params = ([m.param for m in self._all_weight_values])
+        if batch_sizes is None:
+            result = torch.quantized_lstm(input, hx, _all_params, self.bias, self.num_layers,
+                                          float(self.dropout), self.training, self.bidirectional,
+                                          self.batch_first, dtype=self.dtype, use_dynamic=True)
+        else:
+            result = torch.quantized_lstm(input, batch_sizes, hx, _all_params, self.bias,
+                                          self.num_layers, float(self.dropout), self.training,
+                                          self.bidirectional, dtype=self.dtype, use_dynamic=True)
+        output = result[0]
+        hidden = result[1:]
+
+        return output, hidden
+
+    @torch.jit.export
+    def forward_tensor(
+        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None
+    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
+        batch_sizes = None
+        max_batch_size = input.size(0) if self.batch_first else input.size(1)
+        sorted_indices = None
+        unsorted_indices = None
+
+        output, hidden = self.forward_impl(
+            input, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    @torch.jit.export
+    def forward_packed(
+        self, input: PackedSequence, hx: Optional[Tuple[Tensor, Tensor]] = None
+    ) -> Tuple[PackedSequence, Tuple[Tensor, Tensor]]:
+        input_, batch_sizes, sorted_indices, unsorted_indices = input
+        max_batch_size = batch_sizes[0]
+        max_batch_size = int(max_batch_size)
+
+        output_, hidden = self.forward_impl(
+            input_, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        output = PackedSequence(output_, batch_sizes,
+                                sorted_indices, unsorted_indices)
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    # "type: ignore" is required due to issue #43072
+    def permute_hidden(  # type: ignore[override]
+        self, hx: Tuple[Tensor, Tensor], permutation: Optional[Tensor]
+    ) -> Tuple[Tensor, Tensor]:
+        if permutation is None:
+            return hx
+        return _apply_permutation(hx[0], permutation), _apply_permutation(hx[1], permutation)
+
+    # "type: ignore" is required due to issue #43072
+    def check_forward_args(  # type: ignore[override]
+        self, input: Tensor, hidden: Tuple[Tensor, Tensor], batch_sizes: Optional[Tensor]
+    ) -> None:
+        self.check_input(input, batch_sizes)
+        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
+
+        self.check_hidden_size(hidden[0], expected_hidden_size,
+                               'Expected hidden[0] size {}, got {}')
+        self.check_hidden_size(hidden[1], expected_hidden_size,
+                               'Expected hidden[1] size {}, got {}')
+
+    @torch.jit.ignore
+    def forward(self, input, hx=None):
+        if isinstance(input, PackedSequence):
+            return self.forward_packed(input, hx)
+        else:
+            return self.forward_tensor(input, hx)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(LSTM, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_mod):
+        assert hasattr(ref_mod, "weight_ih_l0_dtype"), "We are assuming weight_ih_l0 "
+        "exists in LSTM, may need to relax the assumption to support the use case"
+        qmod = cls(
+            ref_mod.input_size,
+            ref_mod.hidden_size,
+            ref_mod.num_layers,
+            ref_mod.bias,
+            ref_mod.batch_first,
+            ref_mod.dropout,
+            ref_mod.bidirectional,
+            # assuming there is layer 0, which should be OK
+            ref_mod.weight_ih_l0_dtype,
+        )
+        qmod.set_weight_bias(ref_mod.get_quantized_weight_bias_dict())
+        return qmod
+
+
+class GRU(RNNBase):
+    r"""Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
+
+
+    For each element in the input sequence, each layer computes the following
+    function:
+
+    .. math::
+        \begin{array}{ll}
+            r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
+            z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
+            n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
+            h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
+        \end{array}
+
+    where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is the input
+    at time `t`, :math:`h_{(t-1)}` is the hidden state of the layer
+    at time `t-1` or the initial hidden state at time `0`, and :math:`r_t`,
+    :math:`z_t`, :math:`n_t` are the reset, update, and new gates, respectively.
+    :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
+
+    In a multilayer GRU, the input :math:`x^{(l)}_t` of the :math:`l` -th layer
+    (:math:`l >= 2`) is the hidden state :math:`h^{(l-1)}_t` of the previous layer multiplied by
+    dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)}_t` is a Bernoulli random
+    variable which is :math:`0` with probability :attr:`dropout`.
+
+    Args:
+        input_size: The number of expected features in the input `x`
+        hidden_size: The number of features in the hidden state `h`
+        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
+            would mean stacking two GRUs together to form a `stacked GRU`,
+            with the second GRU taking in outputs of the first GRU and
+            computing the final results. Default: 1
+        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
+            Default: ``True``
+        batch_first: If ``True``, then the input and output tensors are provided
+            as (batch, seq, feature). Default: ``False``
+        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
+            GRU layer except the last layer, with dropout probability equal to
+            :attr:`dropout`. Default: 0
+        bidirectional: If ``True``, becomes a bidirectional GRU. Default: ``False``
+
+    Inputs: input, h_0
+        - **input** of shape `(seq_len, batch, input_size)`: tensor containing the features
+          of the input sequence. The input can also be a packed variable length
+          sequence. See :func:`torch.nn.utils.rnn.pack_padded_sequence`
+          for details.
+        - **h_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
+          containing the initial hidden state for each element in the batch.
+          Defaults to zero if not provided. If the RNN is bidirectional,
+          num_directions should be 2, else it should be 1.
+
+    Outputs: output, h_n
+        - **output** of shape `(seq_len, batch, num_directions * hidden_size)`: tensor
+          containing the output features h_t from the last layer of the GRU,
+          for each `t`. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
+          given as the input, the output will also be a packed sequence.
+          For the unpacked case, the directions can be separated
+          using ``output.view(seq_len, batch, num_directions, hidden_size)``,
+          with forward and backward being direction `0` and `1` respectively.
+
+          Similarly, the directions can be separated in the packed case.
+        - **h_n** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
+          containing the hidden state for `t = seq_len`
+
+          Like *output*, the layers can be separated using
+          ``h_n.view(num_layers, num_directions, batch, hidden_size)``.
+
+    Shape:
+        - Input1: :math:`(L, N, H_{in})` tensor containing input features where
+          :math:`H_{in}=\text{input\_size}` and `L` represents a sequence length.
+        - Input2: :math:`(S, N, H_{out})` tensor
+          containing the initial hidden state for each element in the batch.
+          :math:`H_{out}=\text{hidden\_size}`
+          Defaults to zero if not provided. where :math:`S=\text{num\_layers} * \text{num\_directions}`
+          If the RNN is bidirectional, num_directions should be 2, else it should be 1.
+        - Output1: :math:`(L, N, H_{all})` where :math:`H_{all}=\text{num\_directions} * \text{hidden\_size}`
+        - Output2: :math:`(S, N, H_{out})` tensor containing the next hidden state
+          for each element in the batch
+
+    Attributes:
+        weight_ih_l[k] : the learnable input-hidden weights of the :math:`\text{k}^{th}` layer
+            (W_ir|W_iz|W_in), of shape `(3*hidden_size, input_size)` for `k = 0`.
+            Otherwise, the shape is `(3*hidden_size, num_directions * hidden_size)`
+        weight_hh_l[k] : the learnable hidden-hidden weights of the :math:`\text{k}^{th}` layer
+            (W_hr|W_hz|W_hn), of shape `(3*hidden_size, hidden_size)`
+        bias_ih_l[k] : the learnable input-hidden bias of the :math:`\text{k}^{th}` layer
+            (b_ir|b_iz|b_in), of shape `(3*hidden_size)`
+        bias_hh_l[k] : the learnable hidden-hidden bias of the :math:`\text{k}^{th}` layer
+            (b_hr|b_hz|b_hn), of shape `(3*hidden_size)`
+
+    .. note::
+        All the weights and biases are initialized from :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})`
+        where :math:`k = \frac{1}{\text{hidden\_size}}`
+
+    .. include:: ../cudnn_persistent_rnn.rst
+
+    Examples::
+
+        >>> rnn = nn.GRU(10, 20, 2)
+        >>> input = torch.randn(5, 3, 10)
+        >>> h0 = torch.randn(2, 3, 20)
+        >>> output, hn = rnn(input, h0)
+    """
+    _FLOAT_MODULE = nn.GRU
+
+    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
+
+    def __init__(self, *args, **kwargs):
+        super(GRU, self).__init__('GRU', *args, **kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedGRU'
+
+    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
+        self.check_input(input, batch_sizes)
+        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
+
+        self.check_hidden_size(hidden, expected_hidden_size,
+                               'Expected hidden size {}, got {}')
+
+    def forward_impl(
+        self, input: Tensor, hx: Optional[Tensor],
+        batch_sizes: Optional[Tensor], max_batch_size: int,
+        sorted_indices: Optional[Tensor]
+    ) -> Tuple[Tensor, Tensor]:
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            zeros = torch.zeros(self.num_layers * num_directions,
+                                max_batch_size, self.hidden_size,
+                                dtype=input.dtype, device=input.device)
+            hx = zeros
+        else:
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+
+        self.check_forward_args(input, hx, batch_sizes)
+
+        _all_params = ([m.param for m in self._all_weight_values])
+        if batch_sizes is None:
+            result = torch.quantized_gru(input,
+                                         hx,
+                                         _all_params,
+                                         self.bias,
+                                         self.num_layers,
+                                         self.dropout,
+                                         self.training,
+                                         self.bidirectional,
+                                         self.batch_first)
+        else:
+            result = torch.quantized_gru(input,
+                                         batch_sizes,
+                                         hx,
+                                         _all_params,
+                                         self.bias,
+                                         self.num_layers,
+                                         self.dropout,
+                                         self.training,
+                                         self.bidirectional)
+        output = result[0]
+        hidden = result[1]
+
+        return output, hidden
+
+
+    @torch.jit.export
+    def forward_tensor(
+        self, input: Tensor, hx: Optional[Tensor] = None
+    ) -> Tuple[Tensor, Tensor]:
+        batch_sizes = None
+        max_batch_size = input.size(0) if self.batch_first else input.size(1)
+        sorted_indices = None
+        unsorted_indices = None
+
+        output, hidden = self.forward_impl(
+            input, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    @torch.jit.export
+    def forward_packed(
+        self, input: PackedSequence, hx: Optional[Tensor] = None
+    ) -> Tuple[PackedSequence, Tensor]:
+        input_, batch_sizes, sorted_indices, unsorted_indices = input
+        max_batch_size = batch_sizes[0]
+        max_batch_size = int(max_batch_size)
+        output_, hidden = self.forward_impl(
+            input_, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        output = PackedSequence(output_, batch_sizes,
+                                sorted_indices, unsorted_indices)
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def permute_hidden(
+        self, hx: Tensor, permutation: Optional[Tensor]
+    ) -> Tensor:
+        if permutation is None:
+            return hx
+        return _apply_permutation(hx, permutation)
+
+    @torch.jit.ignore
+    def forward(self, input, hx=None):
+        if isinstance(input, PackedSequence):
+            return self.forward_packed(input, hx)
+        else:
+            return self.forward_tensor(input, hx)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(GRU, cls).from_float(mod)
+
+
+class RNNCellBase(torch.nn.Module):
+    # _FLOAT_MODULE = nn.CellRNNBase
+    __constants__ = ['input_size', 'hidden_size', 'bias']
+
+    def __init__(self, input_size, hidden_size, bias=True, num_chunks=4, dtype=torch.qint8):
+        super(RNNCellBase, self).__init__()
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.bias = bias
+        self.weight_dtype = dtype
+        if bias:
+            self.bias_ih = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
+            self.bias_hh = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
+        else:
+            self.register_parameter('bias_ih', None)
+            self.register_parameter('bias_hh', None)
+
+        weight_ih = torch.randn(num_chunks * hidden_size, input_size).to(torch.float)
+        weight_hh = torch.randn(num_chunks * hidden_size, hidden_size).to(torch.float)
+        if dtype == torch.qint8:
+            weight_ih = torch.quantize_per_tensor(weight_ih, scale=1, zero_point=0, dtype=torch.qint8)
+            weight_hh = torch.quantize_per_tensor(weight_hh, scale=1, zero_point=0, dtype=torch.qint8)
+
+        if dtype == torch.qint8:
+            # for each layer, for each direction we need to quantize and pack
+            # weights and pack parameters in this order:
+            #
+            #   w_ih, w_hh
+            packed_weight_ih = \
+                torch.ops.quantized.linear_prepack(weight_ih, self.bias_ih)
+            packed_weight_hh = \
+                torch.ops.quantized.linear_prepack(weight_hh, self.bias_hh)
+        else:
+            # for each layer, for each direction we need to quantize and pack
+            # weights and pack parameters in this order:
+            #
+            #   packed_ih, packed_hh, b_ih, b_hh
+            packed_weight_ih = torch.ops.quantized.linear_prepack_fp16(
+                weight_ih, self.bias_ih)
+            packed_weight_hh = torch.ops.quantized.linear_prepack_fp16(
+                weight_hh, self.bias_hh)
+
+        self._packed_weight_ih = packed_weight_ih
+        self._packed_weight_hh = packed_weight_hh
+
+    def _get_name(self):
+        return 'DynamicQuantizedRNNBase'
+
+    def extra_repr(self):
+        s = '{input_size}, {hidden_size}'
+        if 'bias' in self.__dict__ and self.bias is not True:
+            s += ', bias={bias}'
+        if 'nonlinearity' in self.__dict__ and self.nonlinearity != "tanh":
+            s += ', nonlinearity={nonlinearity}'
+        return s.format(**self.__dict__)
+
+    def check_forward_input(self, input):
+        if input.size(1) != self.input_size:
+            raise RuntimeError(
+                "input has inconsistent input_size: got {}, expected {}".format(
+                    input.size(1), self.input_size))
+
+    def check_forward_hidden(self, input: Tensor, hx: Tensor, hidden_label: str = '') -> None:
+        if input.size(0) != hx.size(0):
+            raise RuntimeError(
+                "Input batch size {} doesn't match hidden{} batch size {}".format(
+                    input.size(0), hidden_label, hx.size(0)))
+
+        if hx.size(1) != self.hidden_size:
+            raise RuntimeError(
+                "hidden{} has inconsistent hidden_size: got {}, expected {}".format(
+                    hidden_label, hx.size(1), self.hidden_size))
+
+    @classmethod
+    def from_float(cls, mod):
+        assert type(mod) in set([torch.nn.LSTMCell,
+                                 torch.nn.GRUCell,
+                                 torch.nn.RNNCell]), 'nn.quantized.dynamic.RNNCellBase.from_float \
+                                 only works for nn.LSTMCell, nn.GRUCell and nn.RNNCell'
+        assert hasattr(
+            mod, 'qconfig'), 'Input float module must have qconfig defined'
+
+        if mod.qconfig is not None and mod.qconfig.weight is not None:
+            weight_observer_method = mod.qconfig.weight
+        else:
+            # We have the circular import issues if we import the qconfig in the beginning of this file:
+            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
+            # import until we need it.
+            from torch.ao.quantization.qconfig import default_dynamic_qconfig
+            weight_observer_method = default_dynamic_qconfig.weight
+
+        dtype = weight_observer_method().dtype
+        supported_scalar_types = [torch.qint8, torch.float16]
+        if dtype not in supported_scalar_types:
+            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
+
+        qRNNCellBase: Union[LSTMCell, GRUCell, RNNCell]
+
+        if type(mod) == torch.nn.LSTMCell:
+            qRNNCellBase = LSTMCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
+        elif type(mod) == torch.nn.GRUCell:
+            qRNNCellBase = GRUCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
+        elif type(mod) == torch.nn.RNNCell:
+            qRNNCellBase = RNNCell(mod.input_size, mod.hidden_size, bias=mod.bias, nonlinearity=mod.nonlinearity, dtype=dtype)
+        else:
+            raise NotImplementedError('Only LSTMCell, GRUCell and RNNCell \
+            are supported for QuantizedRNN for now')
+
+        assert mod.bias
+
+        def _observe_and_quantize_weight(weight):
+            if dtype == torch.qint8:
+                weight_observer = weight_observer_method()
+                weight_observer(weight)
+                qweight = _quantize_weight(weight.float(), weight_observer)
+                return qweight
+            else:
+                return weight.float()
+
+        qRNNCellBase._packed_weight_ih = pack_weight_bias(_observe_and_quantize_weight(mod.weight_ih), mod.bias_ih, dtype)
+        qRNNCellBase._packed_weight_hh = pack_weight_bias(_observe_and_quantize_weight(mod.weight_hh), mod.bias_hh, dtype)
+        return qRNNCellBase
+
+    @classmethod
+    def from_reference(cls, ref_mod):
+        assert hasattr(ref_mod, "weight_ih_dtype"), "We are assuming weight_ih "
+        "exists in reference module, may need to relax the assumption to support the use case"
+        if hasattr(ref_mod, "nonlinearity"):
+            qmod = cls(
+                ref_mod.input_size,
+                ref_mod.hidden_size,
+                ref_mod.bias,
+                ref_mod.nonlinearity,
+                dtype=ref_mod.weight_ih_dtype
+            )
+        else:
+            qmod = cls(
+                ref_mod.input_size,
+                ref_mod.hidden_size,
+                ref_mod.bias,
+                dtype=ref_mod.weight_ih_dtype
+            )
+        weight_bias_dict = {
+            "weight": {
+                "weight_ih": ref_mod.get_quantized_weight_ih(),
+                "weight_hh": ref_mod.get_quantized_weight_hh(),
+            },
+            "bias": {
+                "bias_ih": ref_mod.bias_ih,
+                "bias_hh": ref_mod.bias_hh,
+            }
+        }
+        qmod.set_weight_bias(weight_bias_dict)
+        return qmod
+
+    def _weight_bias(self):
+        # Returns a dict of weights and biases
+        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
+        w1, b1 = self._packed_weight_ih.__getstate__()[0]
+        w2, b2 = self._packed_weight_hh.__getstate__()[0]
+        # TODO: these can be simplified to one level? e.g. using weight_ih as key
+        # directly
+        weight_bias_dict['weight']['weight_ih'] = w1
+        weight_bias_dict['weight']['weight_hh'] = w2
+        weight_bias_dict['bias']['bias_ih'] = b1
+        weight_bias_dict['bias']['bias_hh'] = b2
+        return weight_bias_dict
+
+    def get_weight(self):
+        return self._weight_bias()['weight']
+
+    def get_bias(self):
+        return self._weight_bias()['bias']
+
+    def set_weight_bias(self, weight_bias_dict):
+        # TODO: these can be simplified to one level? e.g. using weight_ih as key
+        # directly
+        self._packed_weight_ih = pack_weight_bias(
+            weight_bias_dict["weight"]["weight_ih"],
+            weight_bias_dict["bias"]["bias_ih"],
+            self.weight_dtype)
+        self._packed_weight_hh = pack_weight_bias(
+            weight_bias_dict["weight"]["weight_hh"],
+            weight_bias_dict["bias"]["bias_hh"],
+            self.weight_dtype)
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(RNNCellBase, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + '_packed_weight_ih'] = self._packed_weight_ih
+        destination[prefix + '_packed_weight_hh'] = self._packed_weight_hh
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self._packed_weight_ih = state_dict.pop(prefix + '_packed_weight_ih')
+        self._packed_weight_hh = state_dict.pop(prefix + '_packed_weight_hh')
+        super(RNNCellBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                       missing_keys, unexpected_keys, error_msgs)
+
+class RNNCell(RNNCellBase):
+    r"""An Elman RNN cell with tanh or ReLU non-linearity.
+    A dynamic quantized RNNCell module with floating point tensor as inputs and outputs.
+    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.RNNCell`,
+    please see https://pytorch.org/docs/stable/nn.html#torch.nn.RNNCell for documentation.
+
+    Examples::
+
+        >>> rnn = nn.RNNCell(10, 20)
+        >>> input = torch.randn(6, 3, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx = rnn(input[i], hx)
+        ...     output.append(hx)
+    """
+    __constants__ = ['input_size', 'hidden_size', 'bias', 'nonlinearity']
+
+    def __init__(self, input_size, hidden_size, bias=True, nonlinearity="tanh", dtype=torch.qint8):
+        super(RNNCell, self).__init__(input_size, hidden_size, bias, num_chunks=1, dtype=dtype)
+        self.nonlinearity = nonlinearity
+
+    def _get_name(self):
+        return 'DynamicQuantizedRNNCell'
+
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        self.check_forward_input(input)
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        self.check_forward_hidden(input, hx, '')
+        if self.nonlinearity == "tanh":
+            ret = torch.ops.quantized.quantized_rnn_tanh_cell_dynamic(
+                input, hx,
+                self._packed_weight_ih, self._packed_weight_hh,
+                self.bias_ih, self.bias_hh)
+        elif self.nonlinearity == "relu":
+            ret = torch.ops.quantized.quantized_rnn_relu_cell_dynamic(
+                input, hx,
+                self._packed_weight_ih, self._packed_weight_hh,
+                self.bias_ih, self.bias_hh)
+        else:
+            ret = input  # TODO: remove when jit supports exception flow
+            raise RuntimeError(
+                "Unknown nonlinearity: {}".format(self.nonlinearity))
+        return ret
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(RNNCell, cls).from_float(mod)
+
+
+class LSTMCell(RNNCellBase):
+    r"""A long short-term memory (LSTM) cell.
+
+    A dynamic quantized LSTMCell module with floating point tensor as inputs and outputs.
+    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.LSTMCell`,
+    please see https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell for documentation.
+
+    Examples::
+
+        >>> rnn = nn.LSTMCell(10, 20)
+        >>> input = torch.randn(6, 3, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> cx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx, cx = rnn(input[i], (hx, cx))
+        ...     output.append(hx)
+    """
+
+    def __init__(self, *args, **kwargs):
+        super(LSTMCell, self).__init__(*args, num_chunks=4, **kwargs)  # type: ignore[misc]
+
+    def _get_name(self):
+        return 'DynamicQuantizedLSTMCell'
+
+    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
+        self.check_forward_input(input)
+        if hx is None:
+            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+            hx = (zeros, zeros)
+        self.check_forward_hidden(input, hx[0], '[0]')
+        self.check_forward_hidden(input, hx[1], '[1]')
+        return torch.ops.quantized.quantized_lstm_cell_dynamic(
+            input, hx,
+            self._packed_weight_ih, self._packed_weight_hh,
+            self.bias_ih, self.bias_hh)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(LSTMCell, cls).from_float(mod)
+
+class GRUCell(RNNCellBase):
+    r"""A gated recurrent unit (GRU) cell
+
+    A dynamic quantized GRUCell module with floating point tensor as inputs and outputs.
+    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.GRUCell`,
+    please see https://pytorch.org/docs/stable/nn.html#torch.nn.GRUCell for documentation.
+
+    Examples::
+
+        >>> rnn = nn.GRUCell(10, 20)
+        >>> input = torch.randn(6, 3, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx = rnn(input[i], hx)
+        ...     output.append(hx)
+    """
+
+    def __init__(self, input_size, hidden_size, bias=True, dtype=torch.qint8):
+        super(GRUCell, self).__init__(input_size, hidden_size, bias, num_chunks=3, dtype=dtype)
+
+    def _get_name(self):
+        return 'DynamicQuantizedGRUCell'
+
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        self.check_forward_input(input)
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        self.check_forward_hidden(input, hx, '')
+        return torch.ops.quantized.quantized_gru_cell_dynamic(
+            input, hx,
+            self._packed_weight_ih, self._packed_weight_hh,
+            self.bias_ih, self.bias_hh,
+        )
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(GRUCell, cls).from_float(mod)
diff --git a/torch/ao/nn/quantized/functional.py b/torch/ao/nn/quantized/functional.py
new file mode 100644
index 0000000000000..d0b100bd30567
--- /dev/null
+++ b/torch/ao/nn/quantized/functional.py
@@ -0,0 +1,616 @@
+r""" Functional interface (quantized)."""
+from typing import List, Optional
+import warnings
+
+import torch
+from torch import Tensor
+from torch.nn.modules.utils import _pair, _triple
+from torch.jit.annotations import BroadcastingList2
+
+from .modules.utils import _pair_from_first
+
+# Although some of the functions and docstrings are mirrored from the torch.nn,
+# we want to have them here for future changes.
+
+def avg_pool2d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
+               count_include_pad=True, divisor_override=None):
+    r"""
+    Applies 2D average-pooling operation in :math:`kH \times kW` regions by step size
+    :math:`sH \times sW` steps. The number of output features is equal to the number of
+    input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    See :class:`~torch.ao.nn.quantized.AvgPool2d` for details and output shape.
+
+    Args:
+        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
+        kernel_size: size of the pooling region. Can be a single number or a
+          tuple `(kH, kW)`
+        stride: stride of the pooling operation. Can be a single number or a
+          tuple `(sH, sW)`. Default: :attr:`kernel_size`
+        padding: implicit zero paddings on both sides of the input. Can be a
+          single number or a tuple `(padH, padW)`. Default: 0
+        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
+            to compute the output shape. Default: ``False``
+        count_include_pad: when True, will include the zero-padding in the
+            averaging calculation. Default: ``True``
+        divisor_override: if specified, it will be used as divisor, otherwise
+             size of the pooling region will be used. Default: None
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.avg_pool2d' must be quantized!")
+    return torch.nn.functional.avg_pool2d(input, kernel_size, stride, padding,
+                                          ceil_mode, count_include_pad,
+                                          divisor_override)
+
+def avg_pool3d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
+               count_include_pad=True, divisor_override=None):
+    r"""
+    Applies 3D average-pooling operation in :math:`kD \ times kH \times kW` regions by step size
+    :math:`sD \times sH \times sW` steps. The number of output features is equal to the number of
+    input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    Args:
+        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
+        kernel_size: size of the pooling region. Can be a single number or a
+          tuple `(kD, kH, kW)`
+        stride: stride of the pooling operation. Can be a single number or a
+          tuple `(sD, sH, sW)`. Default: :attr:`kernel_size`
+        padding: implicit zero paddings on both sides of the input. Can be a
+          single number or a tuple `(padD, padH, padW)`. Default: 0
+        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
+            to compute the output shape. Default: ``False``
+        count_include_pad: when True, will include the zero-padding in the
+            averaging calculation. Default: ``True``
+        divisor_override: if specified, it will be used as divisor, otherwise
+             size of the pooling region will be used. Default: None
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.avg_pool3d' must be quantized!")
+    return torch.nn.functional.avg_pool3d(input, kernel_size, stride, padding,
+                                          ceil_mode, count_include_pad,
+                                          divisor_override)
+
+def adaptive_avg_pool2d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
+    r"""
+    Applies a 2D adaptive average pooling over a quantized input signal composed
+    of several quantized input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    See :class:`~torch.ao.nn.quantized.AdaptiveAvgPool2d` for details and output shape.
+
+    Args:
+        output_size: the target output size (single integer or
+                     double-integer tuple)
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.functional.adaptive_avg_pool2d' must be quantized!")
+    return torch.nn.functional.adaptive_avg_pool2d(input, output_size)
+
+def adaptive_avg_pool3d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
+    r"""
+    Applies a 3D adaptive average pooling over a quantized input signal composed
+    of several quantized input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    See :class:`~torch.ao.nn.quantized.AdaptiveAvgPool3d` for details and output shape.
+
+    Args:
+        output_size: the target output size (single integer or
+                     double-integer tuple)
+    """
+    if not input.is_quantized:
+        raise ValueError(
+            "Input to 'quantized.functional.adaptive_avg_pool3d' must be quantized!")
+    return torch.nn.functional.adaptive_avg_pool3d(input, output_size)
+
+def conv1d(input, weight, bias,
+           stride=1, padding=0, dilation=1, groups=1,
+           padding_mode='zeros',
+           scale=1.0, zero_point=0,
+           dtype=torch.quint8):
+    r"""
+    Applies a 1D convolution over a quantized 1D input composed of several input
+    planes.
+
+    See :class:`~torch.ao.nn.quantized.Conv1d` for details and output shape.
+
+    Args:
+        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iW)`
+        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , iW)`
+        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
+        stride: the stride of the convolving kernel. Can be a single number or a
+          tuple `(sW,)`. Default: 1
+        padding: implicit paddings on both sides of the input. Can be a
+          single number or a tuple `(padW,)`. Default: 0
+        dilation: the spacing between kernel elements. Can be a single number or
+          a tuple `(dW,)`. Default: 1
+        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
+          number of groups. Default: 1
+        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
+        scale: quantization scale for the output. Default: 1.0
+        zero_point: quantization zero_point for the output. Default: 0
+        dtype: quantization data type to use. Default: ``torch.quint8``
+
+    Examples::
+
+        >>> from torch.ao.nn.quantized import functional as qF
+        >>> filters = torch.randn(33, 16, 3, dtype=torch.float)
+        >>> inputs = torch.randn(20, 16, 50, dtype=torch.float)
+        >>> bias = torch.randn(33, dtype=torch.float)
+        >>>
+        >>> scale, zero_point = 1.0, 0
+        >>> dtype_inputs = torch.quint8
+        >>> dtype_filters = torch.qint8
+        >>>
+        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
+        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
+        >>> qF.conv1d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
+    """  # noqa: E501
+    if padding_mode != 'zeros':
+        raise NotImplementedError("Only zero-padding is supported!")
+    if input.dtype != torch.quint8:
+        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
+    if weight.dtype != torch.qint8:
+        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
+    if input.ndim != 3:
+        raise ValueError("Input shape must be `(N, C, L)`!")
+    stride = _pair_from_first(stride)
+    padding = _pair_from_first(padding)
+    dilation = _pair_from_first(dilation)
+
+    packed_params = torch.ops.quantized.conv1d_prepack(
+        weight, bias, stride, padding, dilation, groups)
+    return torch.ops.quantized.conv1d(input, packed_params, scale, zero_point)
+
+def conv2d(input, weight, bias,
+           stride=1, padding=0, dilation=1, groups=1,
+           padding_mode='zeros',
+           scale=1.0, zero_point=0,
+           dtype=torch.quint8):
+    r"""
+    Applies a 2D convolution over a quantized 2D input composed of several input
+    planes.
+
+    See :class:`~torch.ao.nn.quantized.Conv2d` for details and output shape.
+
+    Args:
+        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
+        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kH , kW)`
+        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
+        stride: the stride of the convolving kernel. Can be a single number or a
+          tuple `(sH, sW)`. Default: 1
+        padding: implicit paddings on both sides of the input. Can be a
+          single number or a tuple `(padH, padW)`. Default: 0
+        dilation: the spacing between kernel elements. Can be a single number or
+          a tuple `(dH, dW)`. Default: 1
+        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
+          number of groups. Default: 1
+        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
+        scale: quantization scale for the output. Default: 1.0
+        zero_point: quantization zero_point for the output. Default: 0
+        dtype: quantization data type to use. Default: ``torch.quint8``
+
+    Examples::
+
+        >>> from torch.ao.nn.quantized import functional as qF
+        >>> filters = torch.randn(8, 4, 3, 3, dtype=torch.float)
+        >>> inputs = torch.randn(1, 4, 5, 5, dtype=torch.float)
+        >>> bias = torch.randn(8, dtype=torch.float)
+        >>>
+        >>> scale, zero_point = 1.0, 0
+        >>> dtype_inputs = torch.quint8
+        >>> dtype_filters = torch.qint8
+        >>>
+        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
+        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
+        >>> qF.conv2d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
+    """  # noqa: E501
+    if padding_mode != 'zeros':
+        raise NotImplementedError("Only zero-padding is supported!")
+    if input.dtype != torch.quint8:
+        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
+    if weight.dtype != torch.qint8:
+        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
+    if input.ndim != 4:
+        raise ValueError("Input shape must be `(N, C, H, W)`!")
+    stride = _pair(stride)
+    padding = _pair(padding)
+    dilation = _pair(dilation)
+
+    packed_params = torch.ops.quantized.conv2d_prepack(
+        weight, bias, stride, padding, dilation, groups)
+    return torch.ops.quantized.conv2d(input, packed_params, scale, zero_point)
+
+def conv3d(input, weight, bias, stride=1, padding=0, dilation=1, groups=1,
+           padding_mode='zeros', scale=1.0, zero_point=0, dtype=torch.quint8):
+    r"""
+    Applies a 3D convolution over a quantized 3D input composed of several input
+    planes.
+
+    See :class:`~torch.ao.nn.quantized.Conv3d` for details and output shape.
+
+    Args:
+        input: quantized input tensor of shape
+          :math:`(\text{minibatch} , \text{in\_channels} , iD , iH , iW)`
+        weight: quantized filters of shape
+          :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kD , kH , kW)`
+        bias: **non-quantized** bias tensor of shape
+          :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
+        stride: the stride of the convolving kernel. Can be a single number or a
+          tuple `(sD, sH, sW)`. Default: 1
+        padding: implicit paddings on both sides of the input. Can be a
+          single number or a tuple `(padD, padH, padW)`. Default: 0
+        dilation: the spacing between kernel elements. Can be a single number or
+          a tuple `(dD, dH, dW)`. Default: 1
+        groups: split input into groups, :math:`\text{in\_channels}` should be
+          divisible by the number of groups. Default: 1
+        padding_mode: the padding mode to use. Only "zeros" is supported for
+          quantized convolution at the moment. Default: "zeros"
+        scale: quantization scale for the output. Default: 1.0
+        zero_point: quantization zero_point for the output. Default: 0
+        dtype: quantization data type to use. Default: ``torch.quint8``
+
+    Examples::
+
+        >>> from torch.ao.nn.quantized import functional as qF
+        >>> filters = torch.randn(8, 4, 3, 3, 3, dtype=torch.float)
+        >>> inputs = torch.randn(1, 4, 5, 5, 5, dtype=torch.float)
+        >>> bias = torch.randn(8, dtype=torch.float)
+        >>>
+        >>> scale, zero_point = 1.0, 0
+        >>> dtype_inputs = torch.quint8
+        >>> dtype_filters = torch.qint8
+        >>>
+        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
+        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
+        >>> qF.conv3d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
+    """  # noqa: E501
+    if padding_mode != 'zeros':
+        raise NotImplementedError("Only zero-padding is supported!")
+    if input.dtype != torch.quint8:
+        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
+    if weight.dtype != torch.qint8:
+        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
+    if input.ndim != 5:
+        raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+    stride = _triple(stride)
+    padding = _triple(padding)
+    dilation = _triple(dilation)
+
+    packed_params = torch.ops.quantized.conv3d_prepack(
+        weight, bias, stride, padding, dilation, groups)
+    return torch.ops.quantized.conv3d(input, packed_params, scale, zero_point)
+
+def interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
+    r"""Down/up samples the input to either the given :attr:`size` or the given
+    :attr:`scale_factor`
+
+    See :func:`torch.nn.functional.interpolate` for implementation details.
+
+    The input dimensions are interpreted in the form:
+    `mini-batch x channels x [optional depth] x [optional height] x width`.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D/3D input is supported for quantized inputs
+
+    .. note:: Only the following modes are supported for the quantized inputs:
+
+        - `bilinear`
+        - `nearest`
+
+    Args:
+        input (Tensor): the input tensor
+        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
+            output spatial size.
+        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to match input size if it is a tuple.
+        mode (str): algorithm used for upsampling:
+            ``'nearest'`` | ``'bilinear'``
+        align_corners (bool, optional): Geometrically, we consider the pixels of the
+            input and output as squares rather than points.
+            If set to ``True``, the input and output tensors are aligned by the
+            center points of their corner pixels, preserving the values at the corner pixels.
+            If set to ``False``, the input and output tensors are aligned by the corner
+            points of their corner pixels, and the interpolation uses edge value padding
+            for out-of-boundary values, making this operation *independent* of input size
+            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
+            is ``'bilinear'``.
+            Default: ``False``
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.interpolate' must be quantized!")
+    return torch.nn.functional.interpolate(input, size, scale_factor, mode,
+                                           align_corners)
+
+def linear(
+    input: Tensor, weight: Tensor, bias: Optional[Tensor] = None,
+    scale: Optional[float] = None, zero_point: Optional[int] = None
+) -> Tensor:
+    r"""
+    Applies a linear transformation to the incoming quantized data:
+    :math:`y = xA^T + b`.
+    See :class:`~torch.ao.nn.quantized.Linear`
+
+    .. note::
+
+      Current implementation packs weights on every call, which has penalty on performance.
+      If you want to avoid the overhead, use :class:`~torch.ao.nn.quantized.Linear`.
+
+    Args:
+      input (Tensor): Quantized input of type `torch.quint8`
+      weight (Tensor): Quantized weight of type `torch.qint8`
+      bias (Tensor): None or fp32 bias of type `torch.float`
+      scale (double): output scale. If None, derived from the input scale
+      zero_point (long): output zero point. If None, derived from the input zero_point
+
+    Shape:
+        - Input: :math:`(N, *, in\_features)` where `*` means any number of
+          additional dimensions
+        - Weight: :math:`(out\_features, in\_features)`
+        - Bias: :math:`(out\_features)`
+        - Output: :math:`(N, *, out\_features)`
+    """
+    if scale is None:
+        scale = input.q_scale()
+    if zero_point is None:
+        zero_point = input.q_zero_point()
+    _packed_params = torch.ops.quantized.linear_prepack(weight, bias)
+    return torch.ops.quantized.linear(input, _packed_params, scale, zero_point)
+
+def max_pool1d(input, kernel_size, stride=None, padding=0, dilation=1,
+               ceil_mode=False, return_indices=False):
+    r"""Applies a 1D max pooling over a quantized input signal composed of
+    several quantized input planes.
+
+    .. note:: The input quantization parameters are propagated to the output.
+
+    See :class:`~torch.ao.nn.quantized.MaxPool1d` for details.
+    """
+    if return_indices:
+        raise NotImplementedError("return_indices is not yet implemented!")
+    if stride is None:
+        stride = torch.jit.annotate(List[int], [])
+    return torch.nn.functional.max_pool1d(input, kernel_size, stride, padding,
+                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
+
+def max_pool2d(input, kernel_size, stride=None, padding=0, dilation=1,
+               ceil_mode=False, return_indices=False):
+    r"""Applies a 2D max pooling over a quantized input signal composed of
+    several quantized input planes.
+
+    .. note:: The input quantization parameters are propagated to the output.
+
+    See :class:`~torch.ao.nn.quantized.MaxPool2d` for details.
+    """
+    if return_indices:
+        raise NotImplementedError("return_indices is not yet implemented!")
+    if stride is None:
+        stride = torch.jit.annotate(List[int], [])
+    return torch.nn.functional.max_pool2d(input, kernel_size, stride, padding,
+                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
+
+def celu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
+    r"""celu(input, scale, zero_point, alpha=1.) -> Tensor
+
+    Applies the quantized CELU function element-wise.
+
+    .. math::
+        \text{CELU}(x) = \max(0,x) + \min(0, \alpha * (\exp(x / \alpha) - 1))
+
+    Args:
+        input: quantized input
+        alpha: the :math:`\alpha` value for the CELU formulation. Default: 1.0
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.celu' must be quantized!")
+    return torch.ops.quantized.celu(input, scale, zero_point, alpha)
+
+
+def leaky_relu(input: Tensor, negative_slope: float = 0.01, inplace: bool = False,
+               scale: Optional[float] = None, zero_point: Optional[int] = None):
+    r"""
+    Quantized version of the.
+    leaky_relu(input, negative_slope=0.01, inplace=False, scale, zero_point) -> Tensor
+
+    Applies element-wise,
+    :math:`\text{LeakyReLU}(x) = \max(0, x) + \text{negative\_slope} * \min(0, x)`
+
+    Args:
+        input: Quaintized input
+        negative_slope: The slope of the negative input
+        inplace: Inplace modification of the input tensor
+        scale, zero_point: Scale and zero point of the output tensor.
+
+    See :class:`~torch.nn.LeakyReLU` for more details.
+    """
+    if scale is not None and zero_point is not None:
+        assert not inplace, "Cannot rescale with `inplace`"
+        output = torch._empty_affine_quantized(
+            input.shape, scale=scale, zero_point=int(zero_point), dtype=input.dtype)
+        torch._C._nn.leaky_relu(input, negative_slope, out=output)
+        return output
+    if inplace:
+        result = torch._C._nn.leaky_relu_(input, negative_slope)
+    else:
+        result = torch._C._nn.leaky_relu(input, negative_slope)
+    return result
+
+def hardtanh(input: Tensor, min_val: float = -1., max_val: float = 1., inplace: bool = False) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.hardtanh`.
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.hardtanh' must be quantized!")
+    if inplace:
+        return torch._C._nn.hardtanh_(input, min_val, max_val)
+    return torch._C._nn.hardtanh(input, min_val, max_val)
+
+def hardswish(input: Tensor, scale: float, zero_point: int) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.hardswish`.
+
+    Args:
+        input: quantized input
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.hardswish' must be quantized!")
+    return torch._ops.ops.quantized.hardswish(input, scale, zero_point)
+
+def threshold(input: Tensor, threshold: float, value: float) -> Tensor:
+    r"""Applies the quantized version of the threshold function element-wise:
+
+    .. math::
+        x = \begin{cases}
+                x & \text{if~} x > \text{threshold} \\
+                \text{value} & \text{otherwise}
+            \end{cases}
+
+    See :class:`~torch.nn.Threshold` for more details.
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.threshold' must be quantized!")
+    if threshold is None:
+        raise ValueError("Input to 'threshold' must be specified!")
+    if value is None:
+        raise ValueError("Input to 'value' must be specified!")
+    return torch._ops.ops.quantized.threshold(input, threshold, value)
+
+def elu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.elu`.
+
+    Args:
+        input: quantized input
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        alpha: the alpha constant
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.elu' must be quantized!")
+    return torch.ops.quantized.elu(input, scale, zero_point, alpha)
+
+def hardsigmoid(input: Tensor, inplace: bool = False) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.hardsigmoid`.
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.hardsigmoid' must be quantized!")
+    if inplace:
+        return torch._C._nn.hardsigmoid_(input)  # type: ignore[attr-defined]
+    return torch._C._nn.hardsigmoid(input)
+
+def clamp(input: Tensor, min_: float, max_: float) -> Tensor:
+    r"""float(input, min\_, max\_) -> Tensor
+
+    Applies the clamp function element-wise.
+    See :class:`~torch.ao.nn.quantized.clamp` for more details.
+
+    Args:
+        input: quantized input
+        min_: minimum value for clamping
+        max_: maximum value for clamping
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.clamp' must be quantized!")
+    return torch.clamp(input, min_, max_)
+
+def upsample(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
+    r"""Upsamples the input to either the given :attr:`size` or the given
+    :attr:`scale_factor`
+
+    .. warning::
+        This function is deprecated in favor of
+        :func:`torch.nn.quantized.functional.interpolate`.
+        This is equivalent with ``nn.quantized.functional.interpolate(...)``.
+
+    See :func:`torch.nn.functional.interpolate` for implementation details.
+
+    The input dimensions are interpreted in the form:
+    `mini-batch x channels x [optional depth] x [optional height] x width`.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D input is supported for quantized inputs
+
+    .. note:: Only the following modes are supported for the quantized inputs:
+
+        - `bilinear`
+        - `nearest`
+
+    Args:
+        input (Tensor): quantized input tensor
+        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
+            output spatial size.
+        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to be an integer.
+        mode (str): algorithm used for upsampling:
+            ``'nearest'`` | ``'bilinear'``
+        align_corners (bool, optional): Geometrically, we consider the pixels of the
+            input and output as squares rather than points.
+            If set to ``True``, the input and output tensors are aligned by the
+            center points of their corner pixels, preserving the values at the corner pixels.
+            If set to ``False``, the input and output tensors are aligned by the corner
+            points of their corner pixels, and the interpolation uses edge value padding
+            for out-of-boundary values, making this operation *independent* of input size
+            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
+            is ``'bilinear'``.
+            Default: ``False``
+
+    .. warning::
+        With ``align_corners = True``, the linearly interpolating modes
+        (`bilinear`) don't proportionally align the
+        output and input pixels, and thus the output values can depend on the
+        input size. This was the default behavior for these modes up to version
+        0.3.1. Since then, the default behavior is ``align_corners = False``.
+        See :class:`~torch.nn.Upsample` for concrete examples on how this
+        affects the outputs.
+    """
+    warnings.warn("nn.quantized.functional.upsample is deprecated. Use nn.quantized.functional.interpolate instead.")
+    return interpolate(input, size, scale_factor, mode, align_corners)
+
+def upsample_bilinear(input, size=None, scale_factor=None):
+    r"""Upsamples the input, using bilinear upsampling.
+
+    .. warning::
+        This function is deprecated in favor of
+        :func:`torch.nn.quantized.functional.interpolate`.
+        This is equivalent with
+        ``nn.quantized.functional.interpolate(..., mode='bilinear', align_corners=True)``.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D inputs are supported
+
+    Args:
+        input (Tensor): quantized input
+        size (int or Tuple[int, int]): output spatial size.
+        scale_factor (int or Tuple[int, int]): multiplier for spatial size
+    """
+    # DeprecationWarning is ignored by default
+    warnings.warn("nn.quantized.functional.upsample_bilinear is deprecated. Use nn.quantized.functional.interpolate instead.")
+    return interpolate(input, size, scale_factor, mode='bilinear', align_corners=True)
+
+def upsample_nearest(input, size=None, scale_factor=None):
+    r"""Upsamples the input, using nearest neighbours' pixel values.
+
+    .. warning::
+        This function is deprecated in favor of
+        :func:`torch.nn.quantized.functional.interpolate`.
+        This is equivalent with ``nn.quantized.functional.interpolate(..., mode='nearest')``.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D inputs are supported
+
+    Args:
+        input (Tensor): quantized input
+        size (int or Tuple[int, int] or Tuple[int, int, int]): output spatial
+            size.
+        scale_factor (int): multiplier for spatial size. Has to be an integer.
+    """
+    # DeprecationWarning is ignored by default
+    warnings.warn("nn.quantized.functional.upsample_nearest is deprecated. Use nn.quantized.functional.interpolate instead.")
+    return interpolate(input, size, scale_factor, mode='nearest')
diff --git a/torch/ao/nn/quantized/modules/__init__.py b/torch/ao/nn/quantized/modules/__init__.py
new file mode 100644
index 0000000000000..750591f0fc426
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/__init__.py
@@ -0,0 +1,136 @@
+import torch
+
+# The quantized modules use `torch.nn` and `torch.ao.nn.quantizable`
+# packages. However, the `quantizable` package uses "lazy imports"
+# to avoid circular dependency.
+# Hence we need to include it here to make sure it is resolved before
+# they are used in the modules.
+import torch.ao.nn.quantizable
+
+from torch.nn.modules.pooling import MaxPool2d
+
+from .activation import ReLU6, Hardswish, ELU, LeakyReLU, Sigmoid, Softmax, MultiheadAttention, PReLU
+from .dropout import Dropout
+from .batchnorm import BatchNorm2d, BatchNorm3d
+from .normalization import LayerNorm, GroupNorm, InstanceNorm1d, \
+    InstanceNorm2d, InstanceNorm3d
+from .conv import Conv1d, Conv2d, Conv3d
+from .conv import ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from .linear import Linear
+from .embedding_ops import Embedding, EmbeddingBag
+from .rnn import LSTM
+
+from .functional_modules import FloatFunctional, FXFloatFunctional, QFunctional
+
+
+class Quantize(torch.nn.Module):
+    r"""Quantizes an incoming tensor
+
+    Args:
+     `scale`: scale of the output Quantized Tensor
+     `zero_point`: zero_point of output Quantized Tensor
+     `dtype`: data type of output Quantized Tensor
+     `factory_kwargs`: Dictionary of kwargs used for configuring initialization
+         of internal buffers. Currently, `device` and `dtype` are supported.
+         Example: `factory_kwargs={'device': 'cuda', 'dtype': torch.float64}`
+         will initialize internal buffers as type `torch.float64` on the current CUDA device.
+         Note that `dtype` only applies to floating-point buffers.
+
+    Examples::
+        >>> t = torch.tensor([[1., -1.], [1., -1.]])
+        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
+        >>> qm = Quantize(scale, zero_point, dtype)
+        >>> # xdoctest: +SKIP
+        >>> qt = qm(t)
+        >>> print(qt)
+        tensor([[ 1., -1.],
+                [ 1., -1.]], size=(2, 2), dtype=torch.qint8, scale=1.0, zero_point=2)
+    """
+
+    scale: torch.Tensor
+    zero_point: torch.Tensor
+
+    def __init__(self, scale, zero_point, dtype, factory_kwargs=None):
+        factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
+        super(Quantize, self).__init__()
+        self.register_buffer('scale', torch.tensor([scale], **factory_kwargs))
+        self.register_buffer('zero_point',
+                             torch.tensor([zero_point], dtype=torch.long,
+                                          **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}))
+        self.dtype = dtype
+
+    def forward(self, X):
+        return torch.quantize_per_tensor(X, float(self.scale),
+                                         int(self.zero_point), self.dtype)
+
+    @staticmethod
+    def from_float(mod):
+        assert hasattr(mod, 'activation_post_process')
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return Quantize(scale.float().item(), zero_point.long().item(), mod.activation_post_process.dtype)
+
+    def extra_repr(self):
+        return 'scale={}, zero_point={}, dtype={}'.format(self.scale, self.zero_point, self.dtype)
+
+
+class DeQuantize(torch.nn.Module):
+    r"""Dequantizes an incoming tensor
+
+    Examples::
+        >>> input = torch.tensor([[1., -1.], [1., -1.]])
+        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
+        >>> qm = Quantize(scale, zero_point, dtype)
+        >>> # xdoctest: +SKIP
+        >>> quantized_input = qm(input)
+        >>> dqm = DeQuantize()
+        >>> dequantized = dqm(quantized_input)
+        >>> print(dequantized)
+        tensor([[ 1., -1.],
+                [ 1., -1.]], dtype=torch.float32)
+    """
+
+    def __init__(self):
+        super(DeQuantize, self).__init__()
+
+    def forward(self, Xq):
+        return Xq.dequantize()
+
+    @staticmethod
+    def from_float(mod):
+        return DeQuantize()
+
+__all__ = [
+    'BatchNorm2d',
+    'BatchNorm3d',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'DeQuantize',
+    'ELU',
+    'Embedding',
+    'EmbeddingBag',
+    'GroupNorm',
+    'Hardswish',
+    'InstanceNorm1d',
+    'InstanceNorm2d',
+    'InstanceNorm3d',
+    'LayerNorm',
+    'LeakyReLU',
+    'Linear',
+    'LSTM',
+    'MaxPool2d',
+    'MultiheadAttention',
+    'Quantize',
+    'ReLU6',
+    'Sigmoid',
+    'Softmax',
+    'Dropout',
+    'PReLU',
+    # Wrapper modules
+    'FloatFunctional',
+    'FXFloatFunctional',
+    'QFunctional',
+]
diff --git a/torch/ao/nn/quantized/modules/activation.py b/torch/ao/nn/quantized/modules/activation.py
new file mode 100644
index 0000000000000..067844fd80a1a
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/activation.py
@@ -0,0 +1,278 @@
+import torch
+
+class ReLU6(torch.nn.ReLU):
+    r"""Applies the element-wise function:
+
+    :math:`\text{ReLU6}(x) = \min(\max(x_0, x), q(6))`, where :math:`x_0` is the
+    zero_point, and :math:`q(6)` is the quantized representation of number 6.
+
+    Args:
+        inplace: can optionally do the operation in-place. Default: ``False``
+
+    Shape:
+        - Input: :math:`(N, *)` where `*` means, any number of additional
+          dimensions
+        - Output: :math:`(N, *)`, same shape as the input
+
+    .. image:: ../scripts/activation_images/ReLU6.png
+
+    Examples::
+
+        >>> m = nn.quantized.ReLU6()
+        >>> input = torch.randn(2)
+        >>> # xdoctest: +SKIP
+        >>> input = torch.quantize_per_tensor(input, 1.0, 0, dtype=torch.qint32)
+        >>> output = m(input)
+    """
+    def __init__(self, inplace=False):
+        super(ReLU6, self).__init__(inplace)
+        self.inplace = inplace
+
+    def forward(self, input):
+        return torch.ops.quantized.relu6(input, self.inplace)
+
+    def _get_name(self):
+        return 'QuantizedReLU6'
+
+    @staticmethod
+    def from_float(mod):
+        return ReLU6(mod.inplace)
+
+class Hardswish(torch.nn.Hardswish):
+    r"""This is the quantized version of :class:`~torch.nn.Hardswish`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+    def __init__(self, scale, zero_point):
+        super(Hardswish, self).__init__()
+        self.scale = scale
+        self.zero_point = zero_point
+
+    def forward(self, input):
+        return torch.ao.nn.quantized.functional.hardswish(
+            input, scale=self.scale, zero_point=self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedHardswish'
+
+    @staticmethod
+    def from_float(mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return Hardswish(float(scale), int(zero_point))
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(float(scale), int(zero_point))
+
+class ELU(torch.nn.ELU):
+    r"""This is the quantized equivalent of :class:`~torch.nn.ELU`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        alpha: the alpha constant
+    """
+    def __init__(self, scale, zero_point, alpha=1.):
+        super(ELU, self).__init__(alpha)
+        self.scale = scale
+        self.zero_point = zero_point
+
+    def forward(self, input):
+        return torch.ao.nn.quantized.functional.elu(
+            input, self.scale, self.zero_point, self.alpha)
+
+    def _get_name(self):
+        return 'QuantizedELU'
+
+    @staticmethod
+    def from_float(mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return ELU(float(scale), int(zero_point), mod.alpha)
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(float(scale), int(zero_point), mod.alpha)
+
+class LeakyReLU(torch.nn.LeakyReLU):
+    r"""This is the quantized equivalent of :class:`~torch.nn.LeakyReLU`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        negative_slope: Controls the angle of the negative slope. Default: 1e-2
+    """
+    def __init__(self, scale: float, zero_point: int, negative_slope: float = 1e-2,
+                 inplace: bool = False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(negative_slope, inplace)
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.leaky_relu(
+            input, self.negative_slope, self.inplace, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedLeakyReLU'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
+
+class Sigmoid(torch.nn.Sigmoid):
+    r"""This is the quantized equivalent of :class:`~torch.nn.Sigmoid`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+
+    def __init__(self, output_scale: float, output_zero_point: int):
+        super().__init__()
+        self.output_scale = output_scale
+        self.output_zero_point = output_zero_point
+
+    def forward(self, input):
+        return torch.ops.quantized.sigmoid(input, self.output_scale, self.output_zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        output_scale, output_zero_point = mod.activation_post_process.calculate_qparams()
+        return cls(float(output_scale), int(output_zero_point))
+
+class Softmax(torch.nn.Softmax):
+    r"""This is the quantized version of :class:`~torch.nn.Softmax`.
+
+    Args:
+        dim: A dimension along which Softmax will be computed (so every slice along dim will sum to 1).
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+    def __init__(self, dim=None, scale=1.0, zero_point=0):
+        super().__init__()
+        self.dim = dim
+        self.scale = scale
+        self.zero_point = zero_point
+
+    def forward(self, input):
+        dim = self.dim
+        if dim is None:
+            stacklevel = 3
+            # Note: adding the mypy ignore on _get_softmax_dim seems less bad
+            # than making `_get_softmax_dim` an official API.
+            dim = torch.nn.functional._get_softmax_dim(  # type: ignore[attr-defined]
+                "softmax", input.dim(), stacklevel)
+        return torch.ops.quantized.softmax(
+            input, dim, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedSoftmax'
+
+    @staticmethod
+    def from_float(mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return Softmax(mod.dim, float(scale), int(zero_point))
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(mod.dim, float(scale), int(zero_point))
+
+
+class MultiheadAttention(torch.ao.nn.quantizable.MultiheadAttention):
+    _FLOAT_MODULE = torch.ao.nn.quantizable.MultiheadAttention
+
+    def _get_name(self):
+        return "QuantizedMultiheadAttention"
+
+    @classmethod
+    def from_float(cls, other):
+        # The whole flow is float -> observed -> quantized
+        # This class does observed -> quantized only
+        raise NotImplementedError("It looks like you are trying to convert a "
+                                  "non-observed MHA module. Please, see "
+                                  "the examples on quantizable MHAs.")
+
+    @classmethod
+    def from_observed(cls, other):
+        converted = torch.ao.quantization.convert(other, mapping=None,
+                                                  inplace=False,
+                                                  remove_qconfig=True,
+                                                  convert_custom_config_dict=None)
+        converted.__class__ = cls
+        # Remove the parameters for the bias_k and bias_v to quantize them
+        # TODO: This is a potential source of accuracy drop.
+        #       quantized cat takes the scale and zp of the first
+        #       element, which might lose the precision in the bias_k
+        #       and the bias_v (which are cat'ed with k/v being first).
+        if converted.bias_k is not None:
+            bias_k = converted._parameters.pop('bias_k')
+            sc, zp = torch._choose_qparams_per_tensor(bias_k,
+                                                      reduce_range=False)
+            bias_k = torch.quantize_per_tensor(bias_k, sc, zp, torch.quint8)
+            setattr(converted, 'bias_k', bias_k)  # noqa: B010
+
+        if converted.bias_v is not None:
+            bias_v = converted._parameters.pop('bias_v')
+            sc, zp = torch._choose_qparams_per_tensor(bias_k,
+                                                      reduce_range=False)
+            bias_v = torch.quantize_per_tensor(bias_v, sc, zp, torch.quint8)
+            setattr(converted, 'bias_v', bias_v)  # noqa: B010
+
+        return converted
+
+class PReLU(torch.nn.Module):
+    r"""This is the quantized equivalent of :class:`~torch.nn.PReLU`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        num_parameters: number of parameters: 1, or the number of channels at input. Default: 1
+    """
+    def __init__(self, output_scale: float, output_zero_point: int,
+                 num_parameters: int = 1) -> None:
+        super().__init__()
+        self.num_parameters = num_parameters
+        self.scale = output_scale
+        self.zero_point = output_zero_point
+        w = torch.randn(num_parameters, dtype=torch.float)
+        qw = torch.quantize_per_tensor(w, scale=1.0, zero_point=0, dtype=torch.quint8)
+        self.set_weight(qw)
+
+    def set_weight(self, w: torch.Tensor) -> None:
+        self.weight = w
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return torch.ops.quantized.prelu(input, self.weight, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedPReLU'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
+        float_wt = mod.weight.float()
+        observer = mod.qconfig.weight()
+        wt_scale, wt_zp = observer.calculate_qparams()
+        qweight = torch.quantize_per_tensor(
+            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
+        qprelu.set_weight(qweight)
+        return qprelu
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
+        float_wt = mod.weight.float()
+        observer = mod.qconfig.weight()
+        wt_scale, wt_zp = observer.calculate_qparams()
+        qweight = torch.quantize_per_tensor(
+            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
+        qprelu.set_weight(qweight)
+        return qprelu
diff --git a/torch/ao/nn/quantized/modules/batchnorm.py b/torch/ao/nn/quantized/modules/batchnorm.py
new file mode 100644
index 0000000000000..008de4f242eb2
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/batchnorm.py
@@ -0,0 +1,101 @@
+import torch
+import torch.nn.intrinsic as nni
+
+class _BatchNorm(torch.nn.modules.batchnorm._BatchNorm):
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_features, eps, momentum, True, True, **factory_kwargs)
+        self.register_buffer('scale', torch.tensor(1.0, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(0, **factory_kwargs))
+
+    @staticmethod
+    def from_float(cls, mod):
+        activation_post_process = mod.activation_post_process
+        if type(mod) == cls._NNI_BN_RELU_MODULE:
+            mod = mod[0]
+        scale, zero_point = activation_post_process.calculate_qparams()
+        new_mod = cls(mod.num_features, mod.eps)
+        new_mod.weight = mod.weight
+        new_mod.bias = mod.bias
+        new_mod.running_mean = mod.running_mean
+        new_mod.running_var = mod.running_var
+        new_mod.scale = scale
+        new_mod.zero_point = zero_point
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, bn, output_scale, output_zero_point):
+        qbn = cls(
+            bn.num_features,
+            bn.eps,
+            bn.momentum,
+            device=bn.weight.device,
+            dtype=bn.weight.dtype
+        )
+        qbn.weight = bn.weight
+        qbn.bias = bn.bias
+        qbn.running_mean = bn.running_mean
+        qbn.running_var = bn.running_var
+        qbn.scale = output_scale
+        qbn.zero_point = output_zero_point
+        return qbn
+
+class BatchNorm2d(_BatchNorm):
+    r"""This is the quantized version of :class:`~torch.nn.BatchNorm2d`.
+    """
+
+    _NNI_BN_RELU_MODULE = nni.BNReLU2d
+
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_features, eps, momentum, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedBatchNorm2d'
+
+    def _check_input_dim(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        # disabling this since this is not symbolically traceable
+        # self._check_input_dim(input)
+        return torch.ops.quantized.batch_norm2d(
+            input, self.weight, self.bias, self.running_mean,
+            self.running_var, self.eps, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        return _BatchNorm.from_float(cls, mod)
+
+class BatchNorm3d(_BatchNorm):
+    r"""This is the quantized version of :class:`~torch.nn.BatchNorm3d`.
+    """
+
+    _NNI_BN_RELU_MODULE = nni.BNReLU3d
+
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_features, eps, momentum, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedBatchNorm3d'
+
+    def _check_input_dim(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        # disabling this since this is not symbolically traceable
+        # self._check_input_dim(input)
+        return torch.ops.quantized.batch_norm3d(
+            input, self.weight, self.bias, self.running_mean,
+            self.running_var, self.eps, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        return _BatchNorm.from_float(cls, mod)
diff --git a/torch/ao/nn/quantized/modules/conv.py b/torch/ao/nn/quantized/modules/conv.py
new file mode 100644
index 0000000000000..c0966e0da64f3
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/conv.py
@@ -0,0 +1,937 @@
+# coding=utf-8
+r"""Quantized convolution modules."""
+
+from typing import Optional, List, TypeVar
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.nn.intrinsic as nni
+import torch.nn.intrinsic.qat as nniqat
+
+from torch._ops import ops
+from torch.nn.common_types import _size_1_t
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.nn.utils import fuse_conv_bn_weights
+
+from .utils import _quantize_weight, WeightedQuantizedModule
+
+__all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
+
+_SUPPORTED_PADDING = {
+    'zeros',
+    'reflect'
+}
+
+
+def _reverse_repeat_padding(padding: List[int]) -> List[int]:
+    _reversed_padding_repeated_twice: List[int] = []
+    N = len(padding)
+    for idx in range(N):
+        for _ in range(2):
+            _reversed_padding_repeated_twice.append(padding[N - idx - 1])
+    return _reversed_padding_repeated_twice
+
+
+class _ConvNd(WeightedQuantizedModule):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        # All subclasses have this signature - See PR #49702s
+        raise NotImplementedError
+
+    def _init(self, in_channels, out_channels, kernel_size, stride,
+              padding, dilation,
+              transposed, output_padding,
+              groups, bias,
+              padding_mode='zeros',
+              device=None,
+              dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(_ConvNd, self).__init__()
+
+        if in_channels % groups != 0:
+            raise ValueError('in_channels must be divisible by groups')
+        if out_channels % groups != 0:
+            raise ValueError('out_channels must be divisible by groups')
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.padding = padding
+        self.dilation = dilation
+        self.transposed = transposed
+        self.output_padding = output_padding
+        self.groups = groups
+        if padding_mode not in _SUPPORTED_PADDING:
+            raise ValueError("'padding_mode' {} is not supported by quantized convolution".format(padding_mode))
+        self.padding_mode = padding_mode
+        # Initialize as NCHW. set_weight will internally transpose to NHWC.
+        if self.transposed:
+            weight_shape = [in_channels, out_channels // self.groups]
+        else:
+            weight_shape = [out_channels, in_channels // self.groups]
+        qweight = torch._empty_affine_quantized(
+            weight_shape + list(kernel_size),
+            scale=1, zero_point=0, dtype=torch.qint8,
+            **{k: v for k, v in factory_kwargs.items() if k != 'dtype'})
+        bias_float = (
+            torch.zeros(out_channels, dtype=torch.float,
+                        **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}) if bias else None)
+
+        self.set_weight_bias(qweight, bias_float)
+        self.scale = 1.0
+        self.zero_point = 0
+
+    def set_weight_bias(self, qweight, bias_float):
+        raise NotImplementedError
+
+    def bias(self):
+        raise NotImplementedError
+
+    def _weight_bias(self):
+        raise NotImplementedError
+
+    def extra_repr(self):
+        s = ('{in_channels}, {out_channels}, kernel_size={kernel_size}'
+             ', stride={stride}, scale={scale}, zero_point={zero_point}')
+        if self.padding != (0,) * len(self.padding):
+            s += ', padding={padding}'
+        if self.dilation != (1,) * len(self.dilation):
+            s += ', dilation={dilation}'
+        if self.output_padding != (0,) * len(self.output_padding):
+            s += ', output_padding={output_padding}'
+        if self.groups != 1:
+            s += ', groups={groups}'
+        if self.bias() is None:
+            s += ', bias=False'
+        return s.format(**self.__dict__)
+
+    # ===== Serialization methods =====
+    # The special consideration here is that we have to unpack the weights into
+    # their regular QTensor form for serialization. Packed weights should not
+    # live outside the process in which they were created, rather they should be
+    # derived from the QTensor weight.
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #
+    # TODO: maybe change to this when https://github.com/pytorch/pytorch/pull/32958 is landed
+    #   self
+    #   |--- _packed_params : Conv2dPackedParamsBase or Conv3dPackedParamsBase
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(_ConvNd, self)._save_to_state_dict(destination, prefix, keep_vars)
+        (w, b) = self._weight_bias()
+        destination[prefix + 'weight'] = w
+        destination[prefix + 'bias'] = b
+        destination[prefix + 'scale'] = torch.tensor(self.scale)
+        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
+
+    @torch.jit.export
+    def __getstate__(self):
+        (w, b) = self._weight_bias()
+        return (
+            self.in_channels,
+            self.out_channels,
+            self.kernel_size,
+            self.stride,
+            self.padding,
+            self.dilation,
+            self.transposed,
+            self.output_padding,
+            self.groups,
+            self.padding_mode,
+            w,
+            b,
+            self.scale,
+            self.zero_point,
+            self.training
+        )
+
+    # ===== Deserialization methods =====
+    # Counterpart to the serialization methods, we must pack the serialized
+    # QTensor weight into its packed format for use by the FBGEMM ops.
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.set_weight_bias(
+            state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
+        state_dict.pop(prefix + 'weight')
+        state_dict.pop(prefix + 'bias')
+        self.scale = float(state_dict[prefix + 'scale'])
+        state_dict.pop(prefix + 'scale')
+        self.zero_point = int(state_dict[prefix + 'zero_point'])
+        state_dict.pop(prefix + 'zero_point')
+        super(_ConvNd, self)._load_from_state_dict(
+            state_dict, prefix, local_metadata, False, missing_keys,
+            unexpected_keys, error_msgs)
+
+    @torch.jit.export
+    def __setstate__(self, state):
+        self.in_channels = state[0]
+        self.out_channels = state[1]
+        self.kernel_size = state[2]
+        self.stride = state[3]
+        self.padding = state[4]
+        self.dilation = state[5]
+        self.transposed = state[6]
+        self.output_padding = state[7]
+        self.groups = state[8]
+        self.padding_mode = state[9]
+        self.set_weight_bias(state[10], state[11])
+        self.scale = state[12]
+        self.zero_point = state[13]
+        self.training = state[14]
+
+    def __deepcopy__(self, memo):
+        new_instance = type(self).__new__(type(self))
+        torch.nn.Module.__init__(new_instance)
+        state = self.__getstate__()
+        new_instance.__setstate__(state)
+        return new_instance
+
+    def __copy__(self):
+        return self.__deepcopy__({})
+
+    @classmethod
+    def get_qconv(cls, mod, activation_post_process, weight_post_process=None):
+        r"""Creates a qconv object and returns it.
+        """
+        if weight_post_process is None:
+            weight_post_process = mod.qconfig.weight()
+        weight_post_process(mod.weight)
+        assert weight_post_process.dtype == torch.qint8, \
+            'Weight observer must have a dtype of qint8'
+        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
+        # the __init__ call used is the one from derived classes and not the one from _ConvNd
+        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
+                    mod.stride, mod.padding, mod.dilation, mod.groups,
+                    mod.bias is not None, mod.padding_mode)
+        qconv.set_weight_bias(qweight, mod.bias)
+        if activation_post_process is None or activation_post_process.dtype == torch.float:
+            return qconv  # dynamic quantization doesn't need scale/zero_point
+        else:
+            act_scale, act_zp = activation_post_process.calculate_qparams()
+            qconv.scale = float(act_scale)
+            qconv.zero_point = int(act_zp)
+            return qconv
+
+    @staticmethod
+    def from_float(cls, mod):
+        if hasattr(mod, "weight_fake_quant"):
+            # assert type(mod) == cls.__QAT_MODULE, " nnq." + cls.__name__ + \
+            # ".from_float only works for " + cls.__QAT_MODULE.__name__
+            if type(mod) == cls._NNIQAT_CONV_BN_MODULE:
+                mod.weight, mod.bias = fuse_conv_bn_weights(
+                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
+                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
+            assert hasattr(mod, "activation_post_process"), \
+                "Input QAT module must have observer attached"
+            weight_post_process = mod.weight_fake_quant
+            activation_post_process = mod.activation_post_process
+        else:
+            assert type(mod) == cls._FLOAT_MODULE, \
+                " nnq." + cls.__name__ + ".from_float only works for " + \
+                cls._FLOAT_MODULE.__name__ + " but got:" + str(type(mod))
+            assert hasattr(mod, "qconfig"), \
+                "Input float module must have qconfig defined."
+            activation_post_process = None if not hasattr(
+                mod, "activation_post_process") else mod.activation_post_process
+            if type(mod) == cls._NNI_CONV_RELU_MODULE:
+                mod = mod[0]
+            weight_post_process = mod.qconfig.weight()
+        return cls.get_qconv(mod, activation_post_process, weight_post_process)
+
+    @classmethod
+    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
+        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
+        Args:
+            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+            output_scale (float): scale for output Tensor
+            output_zero_point (int): zero point for output Tensor
+        """
+        qconv = cls(
+            ref_qconv.in_channels,
+            ref_qconv.out_channels,
+            ref_qconv.kernel_size,  # type: ignore[arg-type]
+            ref_qconv.stride,  # type: ignore[arg-type]
+            ref_qconv.padding,  # type: ignore[arg-type]
+            ref_qconv.dilation,  # type: ignore[arg-type]
+            ref_qconv.groups,
+            ref_qconv.bias is not None,  # type: ignore[arg-type]
+            ref_qconv.padding_mode,
+            device=ref_qconv.weight.device,
+            dtype=ref_qconv.weight.dtype)
+        qweight = ref_qconv.get_quantized_weight()
+        qconv.set_weight_bias(qweight, ref_qconv.bias)
+        qconv.scale = float(output_scale)
+        qconv.zero_point = int(output_zero_point)
+        return qconv
+
+
+class Conv1d(_ConvNd):
+    r"""Applies a 1D convolution over a quantized input signal composed of
+    several quantized input planes.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv1d`.
+
+    .. note::
+        Only `zeros` is supported for the :attr:`padding_mode` argument.
+
+    .. note::
+        Only `torch.quint8` is supported for the input data type.
+
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv1d` for other attributes.
+
+    Examples::
+
+        >>> m = nn.quantized.Conv1d(16, 33, 3, stride=2)
+        >>> input = torch.randn(20, 16, 100)
+        >>> # quantize input to quint8
+        >>> # xdoctest: +SKIP
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0,
+        ...                                     dtype=torch.quint8)
+        >>> output = m(q_input)
+
+    """
+
+    _FLOAT_MODULE = nn.Conv1d
+    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn1d
+    _NNI_CONV_RELU_MODULE = nni.ConvReLU1d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 device=None,
+                 dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _single(kernel_size)
+        stride = _single(stride)
+        padding = padding if isinstance(padding, str) else _single(padding)
+        dilation = _single(dilation)
+
+        # Subclasses of _ConvNd needs to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(Conv1d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _single(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConv1d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        if self.padding_mode == 'zeros':
+            self._packed_params = torch.ops.quantized.conv1d_prepack(
+                w, b, self.stride, self.padding, self.dilation, self.groups)
+        else:
+            self._packed_params = torch.ops.quantized.conv1d_prepack(
+                w, b, self.stride, _pair(0), self.dilation,
+                self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv1d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        if self.padding_mode != 'zeros':
+            # Padding in Conv1d is stored as (p, p), need to get (p,)
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv1d(input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        return _ConvNd.from_float(cls, mod)
+
+
+class Conv2d(_ConvNd):
+    r"""Applies a 2D convolution over a quantized input signal composed of
+    several quantized input planes.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv2d`.
+
+    .. note::
+        Only `zeros` is supported for the :attr:`padding_mode` argument.
+
+    .. note::
+        Only `torch.quint8` is supported for the input data type.
+
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv2d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.Conv2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
+        >>> input = torch.randn(20, 16, 50, 100)
+        >>> # quantize input to quint8
+        >>> # xdoctest: +SKIP
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+
+    """
+    _FLOAT_MODULE = nn.Conv2d
+    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn2d
+    _NNI_CONV_RELU_MODULE = nni.ConvReLU2d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _pair(kernel_size)
+        stride = _pair(stride)
+        padding = _pair(padding)
+        dilation = _pair(dilation)
+        # Subclasses of _ConvNd need to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(Conv2d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _pair(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConv2d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        if self.padding_mode == 'zeros':
+            self._packed_params = torch.ops.quantized.conv2d_prepack(
+                w, b, self.stride, self.padding, self.dilation, self.groups)
+        else:
+            self._packed_params = torch.ops.quantized.conv2d_prepack(
+                w, b, self.stride, _pair(0), self.dilation, self.groups)
+
+    def _weight_bias(self):
+        return self._packed_params.unpack()
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv2d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        return _ConvNd.from_float(cls, mod)
+
+
+class Conv3d(_ConvNd):
+    r"""Applies a 3D convolution over a quantized input signal composed of
+    several quantized input planes.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv3d`.
+
+    .. note::
+        Only `zeros` is supported for the :attr:`padding_mode` argument.
+
+    .. note::
+        Only `torch.quint8` is supported for the input data type.
+
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv3d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.Conv3d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
+        >>> input = torch.randn(20, 16, 56, 56, 56)
+        >>> # quantize input to quint8
+        >>> # xdoctest: +SKIP
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+
+    """
+    _FLOAT_MODULE = nn.Conv3d
+    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn3d
+    _NNI_CONV_RELU_MODULE = nni.ConvReLU3d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        padding = _triple(padding)
+        dilation = _triple(dilation)
+        # Subclasses of _ConvNd need to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(Conv3d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConv3d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        if self.padding_mode == 'zeros':
+            self._packed_params = torch.ops.quantized.conv3d_prepack(
+                w, b, self.stride, self.padding, self.dilation, self.groups)
+        else:
+            self._packed_params = torch.ops.quantized.conv3d_prepack(
+                w, b, self.stride, _triple(0), self.dilation, self.groups)
+
+    def _weight_bias(self):
+        return self._packed_params.unpack()
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv3d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        return _ConvNd.from_float(cls, mod)
+
+# === Transposed Convolutions ===
+MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
+
+
+class _ConvTransposeNd(_ConvNd):
+
+    _FLOAT_MODULE = MOD
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride,
+                 padding, dilation, transposed, output_padding,
+                 groups, bias, padding_mode, device=None, dtype=None):
+        if padding_mode != 'zeros':
+            raise ValueError('Only "zeros" padding mode is supported for {}'.format(self.__class__.__name__))
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        # Subclasses of _ConvNd need to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(_ConvTransposeNd, self)._init(
+            in_channels, out_channels, kernel_size, stride,
+            padding, dilation, transposed, output_padding,
+            groups, bias, padding_mode, **factory_kwargs)
+
+    def _input_padding(self, kernel_size: List[int], dilation: List[int], padding: List[int]) -> List[int]:
+        res = torch.jit.annotate(List[int], [])
+        for kdx in range(len(kernel_size)):
+            pad = (dilation[kdx] * (kernel_size[kdx] - 1) - padding[kdx])
+            res.append(pad)
+        return res
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        # derived classes override cls._FLOAT_MODULE attribute
+        msg = ' nnq.' + cls.__name__ + '.from_float only works for ' + \
+              cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
+        assert type(mod) == cls._FLOAT_MODULE, msg
+        assert hasattr(mod, 'qconfig'), \
+            'Input float module must have qconfig defined.'
+        weight_post_process = mod.qconfig.weight()
+        weight_post_process(mod.weight)
+        assert weight_post_process.dtype == torch.qint8, \
+            'Weight observer must have a dtype of qint8'
+        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
+        # the __init__ call used is the one from derived classes and not the one from _ConvTransposeNd
+        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,  # type: ignore[call-arg]
+                    mod.stride, mod.padding, mod.output_padding, mod.groups,
+                    mod.bias is not None, mod.dilation, mod.padding_mode)
+        qconv.set_weight_bias(qweight, mod.bias)
+        if not hasattr(mod, "activation_post_process") or mod.activation_post_process.dtype == torch.float:
+            return qconv  # dynamic quantization doesn't need scale/zero_point
+        else:
+            act_scale, act_zp = mod.activation_post_process.calculate_qparams()
+            qconv.scale = float(act_scale)
+            qconv.zero_point = int(act_zp)
+            return qconv
+
+    @staticmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
+        Args:
+            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+            output_scale (float): scale for output Tensor
+            output_zero_point (int): zero point for output Tensor
+        """
+        qconv = cls(
+            ref_qconvt.in_channels,
+            ref_qconvt.out_channels,
+            ref_qconvt.kernel_size,  # type: ignore[arg-type]
+            ref_qconvt.stride,  # type: ignore[arg-type]
+            ref_qconvt.padding,  # type: ignore[arg-type]
+            ref_qconvt.output_padding,  # type: ignore[arg-type]
+            ref_qconvt.groups,
+            ref_qconvt.bias is not None,  # type: ignore[arg-type]
+            ref_qconvt.dilation,  # type: ignore[arg-type]
+            ref_qconvt.padding_mode,
+            device=ref_qconvt.weight.device,
+            dtype=ref_qconvt.weight.dtype)
+        qweight = ref_qconvt.get_quantized_weight()
+        qconv.set_weight_bias(qweight, ref_qconvt.bias)
+        qconv.scale = float(output_scale)
+        qconv.zero_point = int(output_zero_point)
+        return qconv
+
+
+class ConvTranspose1d(_ConvTransposeNd):
+    r"""Applies a 1D transposed convolution operator over an input image
+    composed of several input planes.
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose1d`.
+
+    .. note:: Currently only the QNNPACK engine is implemented.
+        Please, set the `torch.backends.quantized.engine = 'qnnpack'`
+
+    For special notes, please, see :class:`~torch.nn.quantized.Conv1d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
+
+    Examples::
+
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_QENGINE)
+        >>> torch.backends.quantized.engine = 'qnnpack'
+        >>> from torch.nn import quantized as nnq
+        >>> # With square kernels and equal stride
+        >>> m = nnq.ConvTranspose1d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> input = torch.randn(20, 16, 50)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+        >>> # exact output size can be also specified as an argument
+        >>> input = torch.randn(1, 16, 12)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> downsample = nnq.Conv1d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(q_input)
+        >>> h.size()
+        torch.Size([1, 16, 6])
+        >>> # xdoctest: +SKIP("FIXME: output_size is not a parameter)
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose1d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _single(kernel_size)
+        stride = _single(stride)
+        padding = _single(padding)
+        dilation = _single(dilation)
+        output_padding = _single(output_padding)
+
+        super(ConvTranspose1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConvTranpose1d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params = torch.ops.quantized.conv_transpose1d_prepack(
+            w, b, self.stride, self.padding, self.output_padding, self.dilation,
+            self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv_transpose1d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        (w, _) = self._weight_bias()
+        return w
+
+    def bias(self):
+        (_, b) = self._weight_bias()
+        return b
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        return torch.ops.quantized.conv_transpose1d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
+
+
+class ConvTranspose2d(_ConvTransposeNd):
+    r"""Applies a 2D transposed convolution operator over an input image
+    composed of several input planes.
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose2d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.Conv2d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
+
+    Examples::
+
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_QENGINE)
+        >>> # QNNPACK or FBGEMM as backend
+        >>> torch.backends.quantized.engine = 'qnnpack'
+        >>> # With square kernels and equal stride
+        >>> import torch.nn.quantized as nnq
+        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> input = torch.randn(20, 16, 50, 100)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+        >>> # exact output size can be also specified as an argument
+        >>> input = torch.randn(1, 16, 12, 12)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(q_input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6])
+        >>> # xdoctest: +SKIP("FIXME: output_size is not a parameter)
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose2d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _pair(kernel_size)
+        stride = _pair(stride)
+        padding = _pair(padding)
+        dilation = _pair(dilation)
+        output_padding = _pair(output_padding)
+
+        super(ConvTranspose2d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConvTranpose2d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params = torch.ops.quantized.conv_transpose2d_prepack(
+            w, b, self.stride, self.padding, self.output_padding, self.dilation,
+            self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv2d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        (w, _) = self._weight_bias()
+        return w
+
+    def bias(self):
+        (_, b) = self._weight_bias()
+        return b
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        return ops.quantized.conv_transpose2d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
+
+
+class ConvTranspose3d(_ConvTransposeNd):
+    r"""Applies a 3D transposed convolution operator over an input image
+    composed of several input planes.
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose3d`.
+
+    .. note:: Currently only the FBGEMM engine is implemented.
+        Please, set the `torch.backends.quantized.engine = 'fbgemm'`
+
+    For special notes, please, see :class:`~torch.nn.quantized.Conv3d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
+
+    Examples::
+
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_QENGINE)
+        >>> torch.backends.quantized.engine = 'fbgemm'
+        >>> from torch.nn import quantized as nnq
+        >>> # With cubic kernels and equal stride
+        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
+        >>> # non-cubic kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
+        >>> input = torch.randn(20, 16, 50, 100, 100)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+        >>> # exact output size can be also specified as an argument
+        >>> input = torch.randn(1, 16, 12, 12, 12)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(q_input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6, 6])
+        >>> # xdoctest: +SKIP("FIXME: output_size is not a parameter)
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose3d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        padding = _triple(padding)
+        dilation = _triple(dilation)
+        output_padding = _triple(output_padding)
+
+        super(ConvTranspose3d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConvTranpose3d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params = torch.ops.quantized.conv_transpose3d_prepack(
+            w, b, self.stride, self.padding, self.output_padding, self.dilation,
+            self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv3d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        (w, _) = self._weight_bias()
+        return w
+
+    def bias(self):
+        (_, b) = self._weight_bias()
+        return b
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
+        return ops.quantized.conv_transpose3d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
diff --git a/torch/ao/nn/quantized/modules/dropout.py b/torch/ao/nn/quantized/modules/dropout.py
new file mode 100644
index 0000000000000..64110ab53bed9
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/dropout.py
@@ -0,0 +1,27 @@
+import torch
+
+__all__ = ['Dropout']
+
+class Dropout(torch.nn.Dropout):
+    r"""This is the quantized equivalent of :class:`~torch.nn.Dropout`.
+        And this is a placeholder to enable models where fp32 tensors
+        had dropout to work with quantized tensors in train and eval mode.
+
+    Args:
+        p: probability of an element to be zeroed
+        inplace: can optionally do the operation in-place. Default: ``False``
+    """
+
+    def forward(self, input):
+        return input
+
+    def _get_name(self):
+        return 'QuantizedDropout'
+
+    @classmethod
+    def from_float(cls, mod):
+        return cls(mod.p, mod.inplace)
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(mod.p, mod.inplace)
diff --git a/torch/ao/nn/quantized/modules/embedding_ops.py b/torch/ao/nn/quantized/modules/embedding_ops.py
new file mode 100644
index 0000000000000..5f23bd5286043
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/embedding_ops.py
@@ -0,0 +1,295 @@
+import torch
+import torch.nn as nn
+from torch import Tensor  # noqa: F401
+from torch._jit_internal import Optional, List  # noqa: F401
+
+from .utils import hide_packed_params_repr
+from .utils import _quantize_weight
+
+__all__ = ['EmbeddingPackedParams', 'Embedding', 'EmbeddingBag']
+
+class EmbeddingPackedParams(torch.nn.Module):
+    _version = 1
+
+    def __init__(self, num_embeddings, embedding_dim, dtype=torch.quint8):
+        super(EmbeddingPackedParams, self).__init__()
+        self.dtype = dtype
+        if self.dtype in [torch.quint8, torch.quint4x2]:
+            scales = torch.ones(num_embeddings, dtype=torch.float)
+            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
+            wq = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim], scales=scales,
+                                                           zero_points=zero_points,
+                                                           axis=0, dtype=self.dtype)
+            self.set_weight(wq)
+        else:
+            raise NotImplementedError(f'Unsupported dtype on quantized embedding! Supports quint8 and quint4x2. Got dtype: {dtype}')
+
+    @torch.jit.export
+    def set_weight(self, weight: torch.Tensor) -> None:
+        if self.dtype in [torch.quint8, torch.quint4x2]:
+            self._packed_weight = torch.ops.quantized.embedding_bag_prepack(weight)
+        else:
+            raise NotImplementedError('Unsupported dtype for quantized embedding prepack! Supports quint8 and quint4x2.')
+
+
+    @torch.jit.export
+    def _weight(self):
+        if self.dtype in [torch.quint8, torch.quint4x2]:
+            return torch.ops.quantized.embedding_bag_unpack(self._packed_weight)
+        else:
+            raise NotImplementedError('Unsupported dtype for quantized embedding unpack! Supports quint8 and quint4x2.')
+
+    def forward(self, x):
+        return x
+
+    # Version 1
+    #   self
+    #   |--- _packed_weight : Tensor representing weight of EmbeddingPackedParamsBase
+    #   |--- dtype : torch.dtype
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(EmbeddingPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'dtype'] = self.dtype
+        destination[prefix + '_packed_weight'] = self._weight()
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.dtype = state_dict[prefix + 'dtype']
+        state_dict.pop(prefix + 'dtype')
+
+        weight = state_dict[prefix + '_packed_weight']
+        state_dict.pop(prefix + '_packed_weight')
+        self.set_weight(weight)
+
+        super(EmbeddingPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                                 missing_keys, unexpected_keys, error_msgs)
+
+    def __repr__(self):
+        return self._weight().__repr__()
+
+class Embedding(torch.nn.Module):
+    r"""
+    A quantized Embedding module with quantized packed weights as inputs.
+    We adopt the same interface as `torch.nn.Embedding`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding for documentation.
+
+    Similar to :class:`~torch.nn.Embedding`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module of
+                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
+
+    Examples::
+        >>> m = nn.quantized.Embedding(num_embeddings=10, embedding_dim=12)
+        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8])
+        >>> output = m(indices)
+        >>> print(output.size())
+        torch.Size([9, 12])
+
+    """
+    _version = 1
+
+    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 sparse: bool = False, _weight: Optional[Tensor] = None, dtype=torch.quint8) -> None:
+        super(Embedding, self).__init__()
+        self.num_embeddings = num_embeddings
+        self.embedding_dim = embedding_dim
+        self.dtype = dtype
+
+        if _weight is None:
+            scales = torch.ones(num_embeddings, dtype=torch.float)
+            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
+            qweight = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim],
+                                                                scales=scales, zero_points=zero_points,
+                                                                axis=0, dtype=torch.quint8)
+        else:
+            assert list(_weight.shape) == [num_embeddings, embedding_dim], \
+                'Shape of weight does not match num_embeddings and embedding_dim'
+            qweight = _weight
+
+        self._packed_params = EmbeddingPackedParams(num_embeddings, embedding_dim, dtype)
+        self._packed_params.set_weight(qweight)
+
+    def forward(self, indices: Tensor) -> Tensor:
+        if self.dtype == torch.quint4x2:
+            return torch.ops.quantized.embedding_4bit(self._packed_params._packed_weight, indices)
+        else:
+            return torch.ops.quantized.embedding_byte(self._packed_params._packed_weight, indices)
+
+    def _get_name(self):
+        return 'QuantizedEmbedding'
+
+    def __repr__(self):
+        return hide_packed_params_repr(self, EmbeddingPackedParams)
+
+    def extra_repr(self):
+        extra_repr_str = 'num_embeddings={}, embedding_dim={}, dtype={}, qscheme={}'.format(
+            self.num_embeddings, self.embedding_dim, self._packed_params.dtype, self.weight().qscheme()
+        )
+
+        return extra_repr_str
+
+    def set_weight(self, w: torch.Tensor) -> None:
+        self._packed_params.set_weight(w)
+
+    def weight(self):
+        return self._packed_params._weight()
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a quantized embedding module from a float module
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by user
+        """
+        if hasattr(mod, 'weight_fake_quant'):
+            assert type(mod) == torch.ao.nn.qat.Embedding, 'nnq.' + cls.__name__ + '.from_float ' + \
+                'with fake quant only works for ' + torch.ao.nn.qat.Embedding.__name__
+            weight_observer = mod.weight_fake_quant
+            activation_post_process = mod.activation_post_process
+        else:
+            assert type(mod) == nn.Embedding, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
+                nn.Embedding.__name__
+            assert hasattr(mod, 'qconfig'), 'Embedding input float module must have qconfig defined'
+            from torch.ao.quantization import float_qparams_weight_only_qconfig
+            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
+                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
+            else:
+                weight_observer = float_qparams_weight_only_qconfig.weight()
+
+        dtype = weight_observer.dtype
+        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
+        assert is_float_qparams_qconfig, \
+            'Embedding quantization is only supported with float_qparams_weight_only_qconfig.'
+
+        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
+            f'The only supported dtype for nnq.Embedding is torch.quint8 and torch.quint4x2, got {dtype}'
+
+        # Run the observer to calculate qparams.
+        weight_observer(mod.weight)
+        qweight = _quantize_weight(mod.weight.float(), weight_observer)
+
+        # Create quantized Embedding module and pass in the quantized weight
+        qembedding = Embedding(mod.num_embeddings, mod.embedding_dim)
+        qembedding.set_weight(qweight)
+        return qembedding
+
+    @classmethod
+    def from_reference(cls, ref_embedding):
+        qembedding = cls(
+            ref_embedding.num_embeddings,
+            ref_embedding.embedding_dim,
+            ref_embedding.padding_idx,
+            ref_embedding.max_norm,
+            ref_embedding.norm_type,
+            ref_embedding.scale_grad_by_freq,
+            ref_embedding.sparse,
+            ref_embedding.get_quantized_weight(),
+            ref_embedding.weight_dtype,
+        )
+        return qembedding
+
+class EmbeddingBag(Embedding):
+    r"""
+    A quantized EmbeddingBag module with quantized packed weights as inputs.
+    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag for documentation.
+
+    Similar to :class:`~torch.nn.EmbeddingBag`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module of
+                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
+
+    Examples::
+        >>> m = nn.quantized.EmbeddingBag(num_embeddings=10, embedding_dim=12, include_last_offset=True, mode='sum')
+        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8, 2, 3])
+        >>> offsets = torch.tensor([0, 19, 20, 28, 28, 32])
+        >>> output = m(indices, offsets)
+        >>> print(output.size())
+        torch.Size([5, 12])
+
+    """
+    _version = 1
+
+    def __init__(self, num_embeddings: int, embedding_dim: int,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 mode: str = 'sum', sparse: bool = False, _weight: Optional[Tensor] = None,
+                 include_last_offset: bool = False, dtype=torch.quint8) -> None:
+        super(EmbeddingBag, self).__init__(num_embeddings, embedding_dim, _weight=_weight, dtype=dtype)
+
+        self.mode = mode
+        self.pruned_weights = False
+        self.include_last_offset = include_last_offset
+        self.dtype = dtype
+
+    def forward(self, indices: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None,
+                compressed_indices_mapping: Optional[Tensor] = None) -> Tensor:
+        if self.dtype == torch.quint4x2:
+            return torch.ops.quantized.embedding_bag_4bit(self._packed_params._packed_weight, indices, offsets, False, 0,
+                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
+                                                          self.include_last_offset)
+        else:
+            return torch.ops.quantized.embedding_bag_byte(self._packed_params._packed_weight, indices, offsets, False, 0,
+                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
+                                                          self.include_last_offset)
+
+    def _get_name(self):
+        return 'QuantizedEmbeddingBag'
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a quantized embedding_bag module from a float module
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by user
+        """
+        if hasattr(mod, 'weight_fake_quant'):
+            weight_observer = mod.weight_fake_quant
+        else:
+            assert type(mod) == nn.EmbeddingBag, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
+                nn.EmbeddingBag.__name__
+            assert hasattr(mod, 'qconfig'), 'EmbeddingBag input float module must have qconfig defined'
+            from torch.ao.quantization.qconfig import float_qparams_weight_only_qconfig
+            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
+                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
+            else:
+                weight_observer = float_qparams_weight_only_qconfig.weight()
+
+        dtype = weight_observer.dtype
+        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
+        assert is_float_qparams_qconfig, \
+            'EmbeddingBag quantization is only supported with float_qparams_weight_only_qconfig.'
+
+        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
+            f'The only supported dtype for nnq.EmbeddingBag is torch.quint8 and torch.quint4x2, got {dtype}'
+
+        # Run the observer to calculate qparams.
+        weight_observer(mod.weight)
+        qweight = _quantize_weight(mod.weight.float(), weight_observer)
+
+        # Create quantized EmbeddingBag module and pass in the quantized weight
+        qembedding_bag = EmbeddingBag(mod.num_embeddings, mod.embedding_dim, dtype=dtype)
+        qembedding_bag.set_weight(qweight)
+        return qembedding_bag
+
+    @classmethod
+    def from_reference(cls, ref_embedding_bag):
+        qembedding_bag = cls(
+            ref_embedding_bag.num_embeddings,
+            ref_embedding_bag.embedding_dim,
+            ref_embedding_bag.max_norm,
+            ref_embedding_bag.norm_type,
+            ref_embedding_bag.scale_grad_by_freq,
+            ref_embedding_bag.mode,
+            ref_embedding_bag.sparse,
+            ref_embedding_bag.get_quantized_weight(),
+            ref_embedding_bag.include_last_offset,
+            ref_embedding_bag.weight_dtype,
+        )
+        return qembedding_bag
diff --git a/torch/ao/nn/quantized/modules/functional_modules.py b/torch/ao/nn/quantized/modules/functional_modules.py
new file mode 100644
index 0000000000000..5bf7a7322652f
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/functional_modules.py
@@ -0,0 +1,233 @@
+from typing import List
+
+import torch
+from torch import Tensor
+from torch._ops import ops
+
+__all__ = ['FloatFunctional', 'FXFloatFunctional', 'QFunctional']
+
+class FloatFunctional(torch.nn.Module):
+    r"""State collector class for float operations.
+
+    The instance of this class can be used instead of the ``torch.`` prefix for
+    some operations. See example usage below.
+
+    .. note::
+
+        This class does not provide a ``forward`` hook. Instead, you must use
+        one of the underlying functions (e.g. ``add``).
+
+    Examples::
+
+        >>> f_add = FloatFunctional()
+        >>> a = torch.tensor(3.0)
+        >>> b = torch.tensor(4.0)
+        >>> f_add.add(a, b)  # Equivalent to ``torch.add(a, b)``
+
+    Valid operation names:
+        - add
+        - cat
+        - mul
+        - add_relu
+        - add_scalar
+        - mul_scalar
+    """
+    def __init__(self):
+        super(FloatFunctional, self).__init__()
+        self.activation_post_process = torch.nn.Identity()
+
+    def forward(self, x):
+        raise RuntimeError("FloatFunctional is not intended to use the " +
+                           "'forward'. Please use the underlying operation")
+
+    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
+    def add(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
+    def add_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.add(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
+    def mul(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.mul(x, y)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
+    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.mul(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.cat``"""
+    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
+        r = torch.cat(x, dim=dim)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
+    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        r = torch.nn.functional.relu(r)
+        r = self.activation_post_process(r)
+        return r
+
+class FXFloatFunctional(torch.nn.Module):
+    r""" module to replace FloatFunctional module before FX graph mode quantization,
+    since activation_post_process will be inserted in top level module directly
+
+    Valid operation names:
+        - add
+        - cat
+        - mul
+        - add_relu
+        - add_scalar
+        - mul_scalar
+    """
+    def forward(self, x):
+        raise RuntimeError("FloatFunctional is not intended to use the " +
+                           "'forward'. Please use the underlying operation")
+
+    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
+    def add(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
+    def add_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.add(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
+    def mul(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.mul(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
+    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.mul(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.cat``"""
+    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
+        r = torch.cat(x, dim=dim)
+        return r
+
+    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
+    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        r = torch.nn.functional.relu(r)
+        return r
+
+class QFunctional(torch.nn.Module):
+    r"""Wrapper class for quantized operations.
+
+    The instance of this class can be used instead of the
+    ``torch.ops.quantized`` prefix. See example usage below.
+
+    .. note::
+
+        This class does not provide a ``forward`` hook. Instead, you must use
+        one of the underlying functions (e.g. ``add``).
+
+    Examples::
+
+        >>> q_add = QFunctional()
+        >>> # xdoctest: +SKIP
+        >>> a = torch.quantize_per_tensor(torch.tensor(3.0), 1.0, 0, torch.qint32)
+        >>> b = torch.quantize_per_tensor(torch.tensor(4.0), 1.0, 0, torch.qint32)
+        >>> q_add.add(a, b)  # Equivalent to ``torch.ops.quantized.add(a, b, 1.0, 0)``
+
+    Valid operation names:
+        - add
+        - cat
+        - mul
+        - add_relu
+        - add_scalar
+        - mul_scalar
+    """
+    def __init__(self):
+        super(QFunctional, self).__init__()
+        self.scale = 1.0
+        self.zero_point = 0
+        self.activation_post_process = torch.nn.Identity()
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(QFunctional, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'scale'] = torch.tensor(self.scale)
+        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+
+        self.scale = float(state_dict.pop(prefix + 'scale'))
+        self.zero_point = int(state_dict.pop(prefix + 'zero_point'))
+        super(QFunctional, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                       missing_keys, unexpected_keys, error_msgs)
+
+    def _get_name(self):
+        return 'QFunctional'
+
+    def extra_repr(self):
+        return 'scale={}, zero_point={}'.format(
+            self.scale, self.zero_point
+        )
+
+    def forward(self, x):
+        raise RuntimeError("Functional is not intended to use the " +
+                           "'forward'. Please use the underlying operation")
+
+    r"""Operation equivalent to ``torch.ops.quantized.add``"""
+    def add(self, x: Tensor, y: Tensor) -> Tensor:
+        r = ops.quantized.add(x, y, scale=self.scale, zero_point=self.zero_point)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.add(Tensor, float)``"""
+    def add_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = ops.quantized.add_scalar(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, Tensor)``"""
+    def mul(self, x: Tensor, y: Tensor) -> Tensor:
+        r = ops.quantized.mul(x, y, scale=self.scale, zero_point=self.zero_point)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, float)``"""
+    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = ops.quantized.mul_scalar(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.cat``"""
+    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
+        r = ops.quantized.cat(x, scale=self.scale, zero_point=self.zero_point, dim=dim)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.add_relu``"""
+    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
+        r = ops.quantized.add_relu(x, y, scale=self.scale, zero_point=self.zero_point)
+        r = self.activation_post_process(r)
+        return r
+
+    @classmethod
+    def from_float(cls, mod):
+        assert type(mod) == FloatFunctional,\
+            "QFunctional.from_float expects an instance of FloatFunctional"
+        scale, zero_point = mod.activation_post_process.calculate_qparams()  # type: ignore[operator]
+        new_mod = QFunctional()
+        new_mod.scale = float(scale)
+        new_mod.zero_point = int(zero_point)
+        return new_mod
diff --git a/torch/ao/nn/quantized/modules/linear.py b/torch/ao/nn/quantized/modules/linear.py
new file mode 100644
index 0000000000000..85b0f8e2c48b5
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/linear.py
@@ -0,0 +1,302 @@
+from collections.abc import Iterable
+import torch
+
+import torch.nn as nn
+import torch.nn.intrinsic as nni
+import torch.nn.intrinsic.qat as nniqat
+from torch.nn.utils.fusion import fuse_linear_bn_weights
+from torch.nn.utils.parametrize import type_before_parametrizations
+
+from typing import Optional
+
+from .utils import _quantize_weight, hide_packed_params_repr, WeightedQuantizedModule
+
+__all__ = ['LinearPackedParams', 'Linear']
+
+
+class LinearPackedParams(torch.nn.Module):
+    _version = 3
+
+    def __init__(self, dtype=torch.qint8):
+        super().__init__()
+        self.dtype = dtype
+        if self.dtype == torch.qint8:
+            wq = torch._empty_affine_quantized([1, 1], scale=1.0, zero_point=0, dtype=torch.qint8)
+        elif self.dtype == torch.float16:
+            wq = torch.zeros([1, 1], dtype=torch.float)
+        self.set_weight_bias(wq, None)
+
+    @torch.jit.export
+    def set_weight_bias(self, weight: torch.Tensor, bias: Optional[torch.Tensor]) -> None:
+        if self.dtype == torch.qint8:
+            self._packed_params = torch.ops.quantized.linear_prepack(weight, bias)
+        elif self.dtype == torch.float16:
+            self._packed_params = torch.ops.quantized.linear_prepack_fp16(weight, bias)
+        else:
+            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
+
+
+    @torch.jit.export
+    def _weight_bias(self):
+        if self.dtype == torch.qint8:
+            return torch.ops.quantized.linear_unpack(self._packed_params)
+        elif self.dtype == torch.float16:
+            return torch.ops.quantized.linear_unpack_fp16(self._packed_params)
+        else:
+            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
+
+    def forward(self, x):
+        return x
+
+    # Version 1
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #
+    # Version 2
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #   |--- dtype : torch.dtype
+    #
+    # Version 3
+    #   self
+    #   |--- _packed_params : (Tensor, Tensor) representing (weight, bias)
+    #                         of LinearPackedParams
+    #   |--- dtype : torch.dtype
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(LinearPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'dtype'] = self.dtype
+        destination[prefix + '_packed_params'] = self._weight_bias()
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        version = local_metadata.get('version', None)
+        if version is None or version < 2:
+            self.dtype = torch.qint8
+        else:
+            self.dtype = state_dict[prefix + 'dtype']
+            state_dict.pop(prefix + 'dtype')
+
+        if version is None or version < 3:
+            self.set_weight_bias(state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
+            state_dict.pop(prefix + 'weight')
+            state_dict.pop(prefix + 'bias')
+
+        if version == 3:
+            weight, bias = state_dict[prefix + '_packed_params']
+            state_dict.pop(prefix + '_packed_params')
+            self.set_weight_bias(weight, bias)
+
+        super(LinearPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                              missing_keys, unexpected_keys, error_msgs)
+
+
+    def __repr__(self):
+        return self._weight_bias().__repr__()
+
+
+class Linear(WeightedQuantizedModule):
+    r"""
+    A quantized linear module with quantized tensor as inputs and outputs.
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
+
+    Similar to :class:`~torch.nn.Linear`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module of
+                         shape :math:`(\text{out\_features}, \text{in\_features})`.
+        bias (Tensor): the non-learnable bias of the module of shape :math:`(\text{out\_features})`.
+                If :attr:`bias` is ``True``, the values are initialized to zero.
+        scale: `scale` parameter of output Quantized Tensor, type: double
+        zero_point: `zero_point` parameter for output Quantized Tensor, type: long
+
+    Examples::
+
+        >>> m = nn.quantized.Linear(20, 30)
+        >>> input = torch.randn(128, 20)
+        >>> # xdoctest: +SKIP
+        >>> input = torch.quantize_per_tensor(input, 1.0, 0, torch.quint8)
+        >>> output = m(input)
+        >>> print(output.size())
+        torch.Size([128, 30])
+    """
+    _version = 3
+    _FLOAT_MODULE = (nn.Linear, nn.modules.linear.NonDynamicallyQuantizableLinear)
+
+    def __init__(self, in_features, out_features, bias_=True,
+                 dtype=torch.qint8):
+        super().__init__()
+        # We don't muck around with buffers or attributes or anything here
+        # to keep the module simple. *everything* is simply a Python attribute.
+        # Serialization logic is explicitly handled in the below serialization and
+        # deserialization modules
+        self.in_features = in_features
+        self.out_features = out_features
+        bias = None
+        if bias_:
+            bias = torch.zeros(out_features, dtype=torch.float)
+
+        if dtype == torch.qint8:
+            qweight = torch._empty_affine_quantized(
+                [out_features, in_features], scale=1, zero_point=0, dtype=torch.qint8)
+        elif dtype == torch.float16:
+            qweight = torch.zeros([out_features, in_features], dtype=torch.float)
+        else:
+            raise RuntimeError('Unsupported dtype specified for quantized Linear!')
+
+        self._packed_params = LinearPackedParams(dtype)
+        self._packed_params.set_weight_bias(qweight, bias)
+        self.scale = 1.0
+        self.zero_point = 0
+
+    def _get_name(self):
+        return 'QuantizedLinear'
+
+    def extra_repr(self):
+        return 'in_features={}, out_features={}, scale={}, zero_point={}, qscheme={}'.format(
+            self.in_features, self.out_features, self.scale, self.zero_point, self.weight().qscheme()
+        )
+
+    def __repr__(self):
+        return hide_packed_params_repr(self, LinearPackedParams)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return torch.ops.quantized.linear(
+            x, self._packed_params._packed_params, self.scale, self.zero_point)
+
+    # ===== Serialization methods =====
+    # The special consideration here is that we have to unpack the weights into their
+    # regular QTensor form for serialization. Packed weights should not live
+    # outside the process in which they were created, rather they should be derived
+    # from the QTensor weight.
+    #
+    # Version 1
+    #   self
+    #   |--- scale : float
+    #   |--- zero_point : int
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #
+    # Version 2
+    #   self
+    #   |--- scale : float
+    #   |--- zero_point : int
+    #   |--- _packed_params : Module
+    #        |--- weight : Tensor
+    #        |--- bias : Tensor
+    #
+    # Version 3
+    #   self
+    #   |--- scale : float
+    #   |--- zero_point : int
+    #   |--- _packed_params : Module
+    #        |--- _packed_params : (Tensor, Tensor) representing weight, bias
+    #                              of LinearPackedParams C++ struct
+    #
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super()._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'scale'] = torch.tensor(self.scale)
+        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
+
+    # ===== Deserialization methods =====
+    # Counterpart to the serialization methods, we must pack the serialized QTensor
+    # weight into its packed format for use by the FBGEMM ops.
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.scale = float(state_dict[prefix + 'scale'])
+        state_dict.pop(prefix + 'scale')
+
+        self.zero_point = int(state_dict[prefix + 'zero_point'])
+        state_dict.pop(prefix + 'zero_point')
+
+        version = local_metadata.get('version', None)
+
+        if version is None or version == 1:
+            # We moved the parameters into a LinearPackedParameters submodule
+            weight = state_dict.pop(prefix + 'weight')
+            bias = state_dict.pop(prefix + 'bias')
+            state_dict.update({prefix + '_packed_params.weight': weight,
+                               prefix + '_packed_params.bias': bias})
+
+        super()._load_from_state_dict(
+            state_dict, prefix, local_metadata, False,
+            missing_keys, unexpected_keys, error_msgs)
+
+    # Function rather than property to make sure that JIT serialization doesn't
+    # register this as an attribute
+    def _weight_bias(self):
+        return self._packed_params._weight_bias()
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params.set_weight_bias(w, b)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a quantized module from an observed float module
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+        """
+        if hasattr(mod, 'weight_fake_quant'):
+            if type_before_parametrizations(mod) == nniqat.LinearBn1d:
+                mod.weight, mod.bias = fuse_linear_bn_weights(
+                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
+                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
+            weight_post_process = mod.weight_fake_quant
+            activation_post_process = mod.activation_post_process
+        else:
+            # This function does not participate in JIT, so it is OK to ignore
+            # the type mismatch in assignment. Also, mypy has an issue with
+            # iterables not being implemented, so we are ignoring those too.
+            if not isinstance(cls._FLOAT_MODULE, Iterable):
+                cls._FLOAT_MODULE = [cls._FLOAT_MODULE]  # type: ignore[assignment]
+            supported_modules = ', '.join([float_mod.__name__ for float_mod in cls._FLOAT_MODULE])  # type: ignore[attr-defined]
+            error_msg = 'nnq.{}.from_float only works for {}, but got: {}'.format(cls.__name__, supported_modules, type(mod))
+            assert type_before_parametrizations(mod) in cls._FLOAT_MODULE, error_msg.format()  # type: ignore[attr-defined]
+            assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+            activation_post_process = mod.activation_post_process
+            if type_before_parametrizations(mod) == nni.LinearReLU:
+                mod = mod[0]
+            weight_post_process = mod.qconfig.weight()
+        weight_post_process(mod.weight)
+        dtype = weight_post_process.dtype
+        act_scale, act_zp = activation_post_process.calculate_qparams()
+        assert dtype == torch.qint8, 'Weight observer must have dtype torch.qint8'
+        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
+        qlinear = cls(mod.in_features,
+                      mod.out_features,
+                      dtype=dtype)
+        qlinear.set_weight_bias(qweight, mod.bias)
+        qlinear.scale = float(act_scale)
+        qlinear.zero_point = int(act_zp)
+        return qlinear
+
+    @classmethod
+    def from_reference(cls, ref_qlinear, output_scale, output_zero_point):
+        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
+
+        Args:
+            ref_qlinear (Module): a reference quantized linear module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+            output_scale (float): scale for output Tensor
+            zero_point (int): zero point for output Tensor
+        """
+        qlinear = cls(
+            ref_qlinear.in_features,
+            ref_qlinear.out_features)
+        qweight = ref_qlinear.get_quantized_weight()
+        qlinear.set_weight_bias(qweight, ref_qlinear.bias)
+
+        qlinear.scale = float(output_scale)
+        qlinear.zero_point = int(output_zero_point)
+        return qlinear
diff --git a/torch/ao/nn/quantized/modules/normalization.py b/torch/ao/nn/quantized/modules/normalization.py
new file mode 100644
index 0000000000000..3c77e1277598d
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/normalization.py
@@ -0,0 +1,204 @@
+import torch
+
+__all__ = ['LayerNorm', 'GroupNorm', 'InstanceNorm1d', 'InstanceNorm2d', 'InstanceNorm3d']
+
+class LayerNorm(torch.nn.LayerNorm):
+    r"""This is the quantized version of :class:`~torch.nn.LayerNorm`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+
+    def __init__(self, normalized_shape, weight, bias, scale, zero_point, eps=1e-5,
+                 elementwise_affine=True, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(LayerNorm, self).__init__(
+            normalized_shape, eps=eps, elementwise_affine=elementwise_affine,
+            **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.layer_norm(
+            input, self.normalized_shape, weight=self.weight, bias=self.bias,
+            eps=self.eps, output_scale=self.scale, output_zero_point=self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedLayerNorm'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.normalized_shape, mod.weight, mod.bias, float(scale),
+            int(zero_point), mod.eps, mod.elementwise_affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.normalized_shape, mod.weight, mod.bias, float(scale),
+            int(zero_point), mod.eps, mod.elementwise_affine)
+
+class GroupNorm(torch.nn.GroupNorm):
+    r"""This is the quantized version of :class:`~torch.nn.GroupNorm`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    __constants__ = ['num_groups', 'num_channels', 'eps', 'affine']
+
+    def __init__(self, num_groups, num_channels, weight, bias, scale, zero_point, eps=1e-5,
+                 affine=True, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(GroupNorm, self).__init__(num_groups, num_channels, eps, affine,
+                                        **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.group_norm(
+            input, self.num_groups, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedGroupNorm'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_groups, mod.num_channels, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+class InstanceNorm1d(torch.nn.InstanceNorm1d):
+    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm1d`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    def __init__(self, num_features, weight, bias, scale, zero_point,
+                 eps=1e-5, momentum=0.1, affine=False,
+                 track_running_stats=False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(InstanceNorm1d, self).__init__(
+            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.instance_norm(
+            input, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedInstanceNorm1d'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+
+class InstanceNorm2d(torch.nn.InstanceNorm2d):
+    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm2d`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    def __init__(self, num_features, weight, bias, scale, zero_point,
+                 eps=1e-5, momentum=0.1, affine=False,
+                 track_running_stats=False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(InstanceNorm2d, self).__init__(
+            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.instance_norm(
+            input, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedInstanceNorm2d'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+
+class InstanceNorm3d(torch.nn.InstanceNorm3d):
+    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm3d`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    def __init__(self, num_features, weight, bias, scale, zero_point,
+                 eps=1e-5, momentum=0.1, affine=False,
+                 track_running_stats=False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(InstanceNorm3d, self).__init__(
+            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.instance_norm(
+            input, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedInstanceNorm3d'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
diff --git a/torch/ao/nn/quantized/modules/rnn.py b/torch/ao/nn/quantized/modules/rnn.py
new file mode 100644
index 0000000000000..6e6970171150e
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/rnn.py
@@ -0,0 +1,47 @@
+import torch
+
+class LSTM(torch.ao.nn.quantizable.LSTM):
+    r"""A quantized long short-term memory (LSTM).
+
+    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
+
+    Attributes:
+        layers : instances of the `_LSTMLayer`
+
+    .. note::
+        To access the weights and biases, you need to access them per layer.
+        See examples in :class:`~torch.nn.quantizable.LSTM`
+
+    Examples::
+        >>> # xdoctest: +SKIP
+        >>> custom_module_config = {
+        ...     'float_to_observed_custom_module_class': {
+        ...         nn.LSTM: nn.quantizable.LSTM,
+        ...     },
+        ...     'observed_to_quantized_custom_module_class': {
+        ...         nn.quantizable.LSTM: nn.quantized.LSTM,
+        ...     }
+        ... }
+        >>> tq.prepare(model, prepare_custom_module_class=custom_module_config)
+        >>> tq.convert(model, convert_custom_module_class=custom_module_config)
+    """
+    _FLOAT_MODULE = torch.nn.quantizable.LSTM  # type: ignore[assignment]
+
+    def _get_name(self):
+        return 'QuantizedLSTM'
+
+    @classmethod
+    def from_float(cls, *args, **kwargs):
+        # The whole flow is float -> observed -> quantized
+        # This class does observed -> quantized only
+        raise NotImplementedError("It looks like you are trying to convert a "
+                                  "non-observed LSTM module. Please, see "
+                                  "the examples on quantizable LSTMs.")
+
+    @classmethod
+    def from_observed(cls, other):
+        assert type(other) == cls._FLOAT_MODULE
+        converted = torch.ao.quantization.convert(other, inplace=False,
+                                                  remove_qconfig=True)
+        converted.__class__ = cls
+        return converted
diff --git a/torch/ao/nn/quantized/modules/utils.py b/torch/ao/nn/quantized/modules/utils.py
new file mode 100644
index 0000000000000..d22cdc9a1d4e2
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/utils.py
@@ -0,0 +1,113 @@
+import abc
+import torch
+import itertools
+import collections
+from torch.nn.modules.module import _addindent
+
+class WeightedQuantizedModule(torch.nn.Module, metaclass=abc.ABCMeta):
+    """Wrapper for quantized modules than can be lowered from reference modules."""
+    @classmethod
+    @abc.abstractmethod
+    def from_reference(cls, ref_module, output_scale, output_zero_point):
+        raise NotImplementedError
+
+def _get_weight_observer(observer):
+    # FakeQuantize observer
+    if hasattr(observer, "activation_post_process"):
+        observer = observer.activation_post_process
+    # UniformQuantizationObserverBase observer
+    return observer
+
+def _needs_weight_clamping(observer, dtype):
+    observer = _get_weight_observer(observer)
+    if dtype in [torch.qint8, torch.quint8, torch.qint32]:
+        info = torch.iinfo(dtype)
+        return observer.quant_min > info.min or observer.quant_max < info.max
+    return False
+
+def _clamp_weights(qweight, observer, scale, zp):
+    if not _needs_weight_clamping(observer, qweight.dtype):
+        return qweight
+
+    observer = _get_weight_observer(observer)
+    min_, max_ = observer.quant_min, observer.quant_max
+
+    # Doing this because can't use torch.ops.quantized.clamp() with per_channel qscheme yet.
+    qw_int_max = torch.clone(qweight.int_repr()).fill_(max_)
+    qw_int_min = torch.clone(qweight.int_repr()).fill_(min_)
+    qw_int = torch.minimum(torch.maximum(qweight.int_repr(), qw_int_min), qw_int_max)
+
+    if observer.qscheme in [torch.per_tensor_symmetric,
+                            torch.per_tensor_affine]:
+        qweight = torch._make_per_tensor_quantized_tensor(qw_int, scale.item(), zp.item())
+    elif observer.qscheme in [torch.per_channel_symmetric,
+                              torch.per_channel_affine,
+                              torch.per_channel_affine_float_qparams]:
+        qweight = torch._make_per_channel_quantized_tensor(qw_int, scale, zp, axis=observer.ch_axis)
+    else:
+        raise ValueError("Unexpected qscheme " + observer.qscheme)
+    return qweight
+
+def _quantize_weight(float_wt, observer):
+    wt_scale, wt_zp = observer.calculate_qparams()
+    if observer.qscheme in [torch.per_tensor_symmetric, torch.per_tensor_affine]:
+        qweight = torch.quantize_per_tensor(
+            float_wt,
+            float(wt_scale), int(wt_zp), torch.qint8)
+        qweight = _clamp_weights(qweight, observer, wt_scale, wt_zp)
+    elif observer.qscheme in [torch.per_channel_symmetric, torch.per_channel_affine]:
+        wt_axis = observer.ch_axis
+        qweight = torch.quantize_per_channel(
+            float_wt,
+            wt_scale.to(torch.double), wt_zp.to(torch.int64), wt_axis, torch.qint8)
+        qweight = _clamp_weights(qweight, observer, wt_scale, wt_zp)
+    elif observer.qscheme in [torch.per_channel_affine_float_qparams]:
+        qweight = torch.quantize_per_channel(
+            float_wt,
+            wt_scale.to(torch.float), wt_zp.to(torch.float), observer.ch_axis, observer.dtype)
+        qweight = _clamp_weights(qweight, observer, wt_scale, wt_zp)
+    else:
+        raise ValueError("Unexpected qscheme " + observer.qscheme)
+    return qweight
+
+def _ntuple_from_first(n):
+    """Converts the argument to a tuple of size n
+    with the first element repeated."""
+    def parse(x):
+        while isinstance(x, collections.abc.Sequence):
+            if len(x) == n:
+                break
+            x = x[0]
+        return tuple(itertools.repeat(x, n))
+    return parse
+
+def hide_packed_params_repr(self, params):
+    # We don't want to show `PackedParams` children, hence custom
+    # `__repr__`. This is the same as nn.Module.__repr__, except the check
+    # for the `params module`.
+    extra_lines = []
+    extra_repr = self.extra_repr()
+    # empty string will be split into list ['']
+    if extra_repr:
+        extra_lines = extra_repr.split('\n')
+    child_lines = []
+    for key, module in self._modules.items():
+        if isinstance(module, params):
+            continue
+        mod_str = repr(module)
+        mod_str = _addindent(mod_str, 2)
+        child_lines.append('(' + key + '): ' + mod_str)
+    lines = extra_lines + child_lines
+
+    main_str = self._get_name() + '('
+    if lines:
+        # simple one-liner info, which most builtin Modules will use
+        if len(extra_lines) == 1 and not child_lines:
+            main_str += extra_lines[0]
+        else:
+            main_str += '\n  ' + '\n  '.join(lines) + '\n'
+
+    main_str += ')'
+    return main_str
+
+_pair_from_first = _ntuple_from_first(2)
diff --git a/torch/ao/nn/sparse/quantized/dynamic/linear.py b/torch/ao/nn/sparse/quantized/dynamic/linear.py
index e1742d7ed1097..1bfb19b590ee4 100644
--- a/torch/ao/nn/sparse/quantized/dynamic/linear.py
+++ b/torch/ao/nn/sparse/quantized/dynamic/linear.py
@@ -1,11 +1,11 @@
 from typing import Optional
 
-from torch.ao.nn.sparse.quantized import linear
-from torch.ao.nn.sparse.quantized.utils import LinearBlockSparsePattern
-
 import torch
 import torch.nn.intrinsic as nni
-from torch.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
+
+from torch.ao.nn.sparse.quantized import linear
+from torch.ao.nn.sparse.quantized.utils import LinearBlockSparsePattern
+from torch.ao.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
 
 __all__ = ['Linear']
 
diff --git a/torch/ao/nn/sparse/quantized/linear.py b/torch/ao/nn/sparse/quantized/linear.py
index 81c224666fb18..83be323e22c34 100644
--- a/torch/ao/nn/sparse/quantized/linear.py
+++ b/torch/ao/nn/sparse/quantized/linear.py
@@ -1,7 +1,7 @@
 from typing import Optional
 
 import torch
-from torch.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
+from torch.ao.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
 
 __all__ = ['LinearPackedParams', 'Linear']
 
diff --git a/torch/ao/ns/_numeric_suite.py b/torch/ao/ns/_numeric_suite.py
index 143265a8f1749..3ddca96b1de5e 100644
--- a/torch/ao/ns/_numeric_suite.py
+++ b/torch/ao/ns/_numeric_suite.py
@@ -1,7 +1,7 @@
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 from torch.ao.quantization import prepare
 from typing import Dict, List, Optional, Any, Union, Callable, Set
 
diff --git a/torch/ao/ns/_numeric_suite_dbr.py b/torch/ao/ns/_numeric_suite_dbr.py
deleted file mode 100644
index a7ac562df187b..0000000000000
--- a/torch/ao/ns/_numeric_suite_dbr.py
+++ /dev/null
@@ -1,112 +0,0 @@
-"""
-Numeric Suite Core APIs for define-by-run quantization.
-
-Experimental, API may change at any time.
-"""
-
-import functools
-from typing import Tuple, Any, Optional, List, Dict
-
-import torch
-
-from torch.ao.quantization._dbr.quantization_state import (
-    AutoQuantizationState,
-)
-
-def _turn_on_loggers(name: str, model: torch.nn.Module) -> None:
-    for _, module in model.named_modules():
-        if isinstance(module, AutoQuantizationState):
-            module.logging_model_name = name
-            module.log_op_outputs = True
-
-def add_loggers(
-    name_a: str,
-    model_a: torch.nn.Module,
-    name_b: str,
-    model_b: torch.nn.Module,
-) -> Tuple[torch.nn.Module, torch.nn.Module]:
-    """
-    Enables intermediate activation logging on model_a and model_b.
-    """
-    _turn_on_loggers(name_a, model_a)
-    _turn_on_loggers(name_b, model_b)
-    return model_a, model_b
-
-def _extract_logger_info_one_model(model: torch.nn.Module) -> Tuple[str, Any]:
-    results: Optional[List[List[Any]]] = None
-    model_name = None
-    for _, module in model.named_modules():
-        if isinstance(module, AutoQuantizationState):
-            if results is None:
-                # initialize results to the right length
-                results = [[] for i in range(len(module.op_outputs))]
-            assert results is not None
-
-            if model_name is None:
-                # model_name is the same everywhere in this model, take
-                # the first one
-                model_name = module.logging_model_name
-
-            for forward_idx, outputs in enumerate(module.op_outputs):
-                results[forward_idx].extend(outputs)
-
-    # sort each forward's results by global idx
-    assert results is not None
-    assert model_name is not None
-    for result_idx, result in enumerate(results):
-        result.sort(key=functools.cmp_to_key(  # type: ignore[misc]
-            lambda a, b: 1 if a[0] > b[0] else -1))  # type: ignore[index]
-
-    return model_name, results
-
-def extract_logger_info(
-    model_a: torch.nn.Module,
-    model_b: torch.nn.Module,
-    model_name_to_use_for_layer_names: str,
-) -> Any:
-    """
-    Extracts intermediate activations from model_a and model_b.
-    """
-
-    model_name_a, results_a = _extract_logger_info_one_model(model_a)
-    model_name_b, results_b = _extract_logger_info_one_model(model_b)
-    assert len(results_a) == len(results_b), 'results length mismatch'
-    results: Dict[str, Any] = {}
-    if len(results_a) == 0:
-        return results
-
-    for op_idx in range(len(results_a[0])):
-        # currently using global_idx for layer_name
-        layer_name = (
-            results_a[0][op_idx][0]
-            if model_name_to_use_for_layer_names == model_name_a
-            else results_a[0][op_idx][0])
-
-        values_a = [results_a[forward_idx][op_idx][3]
-                    for forward_idx in range(len(results_a))]
-        values_b = [results_b[forward_idx][op_idx][3]
-                    for forward_idx in range(len(results_b))]
-        node_output = {
-            model_name_a: [{
-                'type': 'node_output',
-                'values': values_a,
-                'ref_node_target_type': str(results_a[0][op_idx][2]),
-                'fqn': str(results_a[0][op_idx][1]),
-                'index_of_arg': 0,
-                'index_within_arg': 0,
-            }],
-            model_name_b: [{
-                'type': 'node_output',
-                'values': values_b,
-                'ref_node_target_type': str(results_b[0][op_idx][2]),
-                'fqn': str(results_b[0][op_idx][1]),
-                'index_of_arg': 0,
-                'index_within_arg': 0,
-            }],
-        }
-
-        results[layer_name] = {
-            'node_output': node_output,
-        }
-
-    return results
diff --git a/torch/ao/ns/fx/mappings.py b/torch/ao/ns/fx/mappings.py
index 7bcd0bab10cb3..ebb12272e4ed5 100644
--- a/torch/ao/ns/fx/mappings.py
+++ b/torch/ao/ns/fx/mappings.py
@@ -5,14 +5,14 @@
 import torch.nn.functional as F
 toq = torch.ops.quantized
 
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.quantized.dynamic as nniqd
 import torch.nn.intrinsic.qat as nniqat
 import torch.nn.intrinsic as nni
-import torch.nn.qat as nnqat
-import torch.nn.qat.dynamic as nnqatd
+import torch.ao.nn.qat as nnqat
+import torch.ao.nn.qat.dynamic as nnqatd
 from torch.ao.quantization.backend_config import get_native_backend_config_dict
 import torch.ao.quantization.fx._lower_to_native_backend as \
     _lower_to_native_backend
diff --git a/torch/ao/ns/fx/utils.py b/torch/ao/ns/fx/utils.py
index bd10a0e4a8c00..2993764b8a124 100644
--- a/torch/ao/ns/fx/utils.py
+++ b/torch/ao/ns/fx/utils.py
@@ -4,7 +4,7 @@
 import torch
 import torch.nn as nn
 import torch.nn.intrinsic.quantized as nniq
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 toq = torch.ops.quantized
 from typing import Tuple, Callable, Dict, Set, List, Optional, Union
diff --git a/torch/ao/ns/fx/weight_utils.py b/torch/ao/ns/fx/weight_utils.py
index 2020593ddbfb1..553e385af029c 100644
--- a/torch/ao/ns/fx/weight_utils.py
+++ b/torch/ao/ns/fx/weight_utils.py
@@ -1,10 +1,10 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
 import torch.nn.intrinsic.qat as nniqat
-import torch.nn.qat as nnqat
+import torch.ao.nn.qat as nnqat
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
 toq = torch.ops.quantized
diff --git a/torch/ao/quantization/_correct_bias.py b/torch/ao/quantization/_correct_bias.py
index 4456e33963946..0d9017533166a 100644
--- a/torch/ao/quantization/_correct_bias.py
+++ b/torch/ao/quantization/_correct_bias.py
@@ -1,6 +1,6 @@
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 import torch.ao.quantization
 import torch.ao.ns._numeric_suite as ns
diff --git a/torch/ao/quantization/_dbr/README.md b/torch/ao/quantization/_dbr/README.md
deleted file mode 100644
index b474fb4788b34..0000000000000
--- a/torch/ao/quantization/_dbr/README.md
+++ /dev/null
@@ -1,259 +0,0 @@
-# Define-by-run quantization
-
-Note: this code is an early prototype and the API may change at any time.
-
-Define-by-run quantization is a prototype of automated quantization syntax
-transforms for PyTorch Eager mode. High level algorithm:
-
-1. take a user model and an example input
-2. trace the model with example input and record the subgraphs of seen quantizeable ops
-3. define quantization syntax transforms over the seen subgraphs
-4. during execution of user code, dynamically call into the subgraphs from (3) when necessary
-
-## User API overview
-
-```
-from torch.ao.quantization._quantize_dbr import prepare, convert
-
-m = M(...)
-mp = prepare(m, example_input)
-# calibration (not shown)
-mq = convert(mp)
-```
-
-## Framework concepts overview
-
-The framework is defined with two major concepts:
-
-1. Each non-leaf module has a child of type `AutoQuantizationState`, stored
-under the `_auto_quant_state` attribute name. This child contains the quantization
-related state such as captured subgraphs and observers.
-2. During program execution, the framework overrides each module and quantizeable
-op to call hooks on the objects defined in (1), to implement the quantization
-syntax transforms.
-
-For example, imagine a user module such as this one:
-
-```
-class Child(torch.nn.Module):
-    def forward(self, x):
-        return x + x
-
-class Parent(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.conv = torch.nn.Conv2d(1, 1, 1)
-        self.child = Child()
-
-    def forward(self, x):
-        x = self.conv(x)
-        x = self.child(x)
-        return x
-
-m = Parent().eval()
-```
-
-After model creation, the model hierarchy looks like
-
-```
-m (Parent)
-|- conv (torch.nn.Conv2d)
-|- child (Child)
-```
-
-After adding auto-observe, the model hierarchy looks like
-
-```
-m (Parent)
-|- conv (torch.nn.Conv2d)
-|- _auto_quant_state (AutoQuantizationState)
-|- child (Child)
-   |- _auto_quant_state (AutoQuantizationState)
-```
-
-Each `AutoQuantizationState` instance stores (a) subgraphs of seen quantizeable ops
-and (b) observers. Here is an example (in pseudocode, actual variable names many
-change):
-
-```
-AutoQuantizationState(
-  (seen_op_infos): {
-    0: (type): <built-in method add of type object at 0x112a7fe50>
-       (input_tensor_infos): [
-         QTensorInfo(id=0, inf_dtype=torch.float32),
-         QTensorInfo(id=1, inf_dtype=torch.float32),
-       ],
-       (output_tensor_infos): [
-         QTensorInfo(id=2, inf_dtype=torch.quint8),
-       ],
-  }
-  (tensor_id_to_observer): ModuleDict(
-    (0): MinMaxObserver(...),
-    ...
-  )
-)
-```
-
-During program execution, the following overrides will be called, in order:
-1. m.__call__ override start
-2. m.conv.__call__ override start, end
-3. m.child.__call__ override start
-4. m.child::add.__torch_function__ override start, end
-5. m.child.__call__ override end
-6. m.__call__ override end
-
-There are three flavors of hooks:
-
-1. op hooks.  These are called from the parent module on individual ops, and used
-to implement quantization of op subgraphs.
-2. module I/O hooks.  These are called from the parent module on child modules only,
-and used to enforce dtype of module outputs, if needed.  For example, if the model
-needs to output a fp32 tensor and the last op is int8, the conversion to fp32
-will happen in this hook (and not in the op hook).
-3. arg dequant hooks.  These are called when the current op requires `torch.float`
-tensors.  Any non-quantizeable op will go through these, and have any quantized
-tensor arguments dequantized.
-
-## Code overview
-
-### auto_trace.py
-
-This file contains the logic for partial program capture. We override `__torch_function__`
-and `torch.nn.Module.__call__` to define the interception points. During tracing, calibration
-and inference, we dynamically execute quantization transforms from these interception
-points. The following pseudocode illustrates how both `add_auto_observation` and
-`add_auto_convert` call into quantization hooks:
-
-```
-# override of `__torch_function__`
-def __torch_function__(cls, func, types, args, kwargs):
-    ...
-
-    # the framework provides `cur_module` of the current function
-    # `quant_state` is the quantization state of the current module
-    quant_state = cur_module._auto_quant_state
-
-    # only some ops are quantizeable, the following line allows us to terminate
-    # early for unquantizeable ones
-    hook_type = get_torch_function_hook_type(parent_module, func)
-
-    if hook_type is HookType.OP_HOOKS:
-
-        # this line will throw an exception if control flow over quantizeable ops
-        # is detected
-        qstate.validate_cur_op(func)
-
-        # "before" hook
-        args, kwargs = qstate.op_prepare_before_hook(func, args, kwargs, ...)
-
-        # run original function
-        output = super().__torch_function__(func, types, args, kwargs)
-
-        # "after" hook
-        output = qstate.op_prepare_after_hook(func, output, args, kwargs, ...)
-
-    else:
-        output = super().__torch_function(func, types, args, kwargs)
-
-    ...
-
-    return output
-
-# override of `torch.nn.Module.__call__`
-def record_module(self, *args, **kwargs):
-    # the framework keeps track of parent module of the current module
-    parent_module = get_parent_module(...)
-    parent_qstate = parent_module._auto_quant_state
-    cur_module = self
-    cur_qstate = self._auto_quant_state
-
-
-    # the framework calculates when the current module needs op_hooks, io hooks
-    # or arg_dequants
-    hook_type = get_module_hook_type(parent_module, func)
-
-    if hook_type is HookType.OP_HOOKS:
-        # call before, during and after hooks on parent_qstate with the current op
-        # execute original forward
-        ...
-
-    elif hook_type is HookType.MODULE_IO_HOOKS:
-        # execute original forward
-        # call hook on outputs of module for dtype transitions
-        ...
-
-    elif hook_type is HookType.ARG_DEQUANTS:
-        # dequantize incoming arguments, if they are quantized
-        # execute original forward
-        ...
-
-    else:
-        # execute original forward
-
-    ...
-
-    return output
-
-
-```
-
-In detail:
-
-#### calibration
-
-This happens in the `add_auto_observation` function.
-
-1. For each non-leaf module in the model, we create a new `AutoQuantizationState`
-module and attach it as a child. This contains the quantization state
-(subgraphs and observers).
-2. For each `__torch_function__` and `torch.nn.Module.__call__` override, call
-quantization hooks if necessary. If `first_call` is true, this captures the
-subgraphs. Otherwise, this performs observation.
-
-#### inference
-
-This happens in the `add_auto_convert` function.
-
-1. For each `__torch_function__` and `torch.nn.Module.__call__` override, call
-quantization hooks if necessary. This performs the quantization inference
-syntax transforms.
-
-### quantization_state.py
-
-This file defines `AutoQuantizationState`. This is an object which
-stores quantization state for its parent module. It contains the following state:
-
-1. all captured quantization op subgraphs
-2. all dynamically created observers and fake_quants
-
-It also contains the following hooks:
-
-1. module before and after hooks (used for dtype transitions)
-2. function before and after hooks (used for dtype transitions and observation)
-3. function replacement hooks (used for substiting quantized kernels)
-
-### auto_trace_rewriter.py
-
-This file defines a custom FX tracer which can encode the transforms captured
-by `AutoQuantizationState` into an FX graph. This is useful because it provides
-a path to `torch.jit.script`.
-
-TODO(future PR): write up more.
-
-### fusion.py
-
-This file has a function which finds all potential module fusions of a model.
-This is implemented by tracing the model with the machinery in `auto_trace.py`,
-and traversing the found subgraphs to look for fusions. A list of module fqns
-to fuse is returned, which can then be plugged in to the Eager mode fusion APIs.
-
-### mappings.py
-
-This file defines the quantization mappings (which fp32 ops are related to which int8
-ops, which ops are quantizeable, etc.
-
-TODO(future PR): delete this file and use existing mappings instead.
-
-### utils.py
-
-This file contains various stateless utilities.
diff --git a/torch/ao/quantization/_dbr/auto_trace.py b/torch/ao/quantization/_dbr/auto_trace.py
deleted file mode 100644
index 98981d417e2f0..0000000000000
--- a/torch/ao/quantization/_dbr/auto_trace.py
+++ /dev/null
@@ -1,723 +0,0 @@
-import logging
-from typing import Tuple, Any, List, Dict
-
-import torch
-from torch.fx.node import map_aggregate
-
-from .quantization_state import (
-    AutoQuantizationState,
-)
-from .utils import (
-    trace_with_inputs,
-    is_leaf,
-    HookType,
-    get_torch_function_hook_type,
-    get_module_hook_type,
-    OpQuantizeabilityType,
-    AutoQuantizationStateModuleDict,
-    get_fqn_valid_for_module_dict_key,
-)
-from .model_utils import (
-    pack_weights_for_functionals,
-    attach_scale_zp_values_to_model,
-    attach_op_convert_info_to_model,
-    attach_output_convert_info_to_model,
-)
-from . import auto_trace_rewriter
-from torch.ao.quantization import is_activation_post_process
-from torch.ao.quantization.qconfig_mapping import QConfigMapping
-
-logger = logging.getLogger('auto_trace')
-logging.basicConfig(level=logging.DEBUG)
-# logging.basicConfig(level=logging.INFO)
-
-# enabling this tanks performance, make sure to disable for benchmarking
-# TODO(future PR): clean this up
-enable_logging = False
-# enable_logging = True
-
-
-def add_auto_observation(
-    model : torch.nn.Module,
-    qconfig_mapping: QConfigMapping,
-    example_inputs: Tuple[Any],
-    input_dtypes: Any = (torch.float,),  # must be same structure as model inputs
-    prepare_custom_config_dict: Dict[str, Any] = None,
-) -> torch.nn.Module:
-    if prepare_custom_config_dict is None:
-        prepare_custom_config_dict = {}
-    output_dtypes = prepare_custom_config_dict.get('output_dtypes', (torch.float,))
-
-    def convert_to_interception_proxy(x):
-        if isinstance(x, torch.Tensor):
-            return x.as_subclass(QuantizationPrepareTensorProxy)  # type: ignore[arg-type]
-        else:
-            return x
-
-    cur_module = None
-    first_call = True
-    module_stack : List[torch.nn.Module] = []
-    # Counter for tensor IDs, will be modified inplace by quant state.
-    # This is used to track tensors from output ops to input ops. For example,
-    # if op_n had a tensor output with id=1, and op_n+2 had a tensor input with
-    # id=1, we know that the output of op_n is the input to op_n+2. Note,
-    # this is a list because it needs to incremented inplace.
-    qtensor_id = [0]
-    module_id_to_fqn: Dict[int, str] = {}
-
-    # Counter for global quantizeable ops, useful for intermediate activation
-    # logging.
-    global_op_idx = [0]
-
-    global_disable_torch_function_override = False
-
-    class QuantizationPrepareTensorProxy(torch.Tensor):
-        """
-        An override of `torch.Tensor` to enable dynamic tracing for
-        quantization.
-
-        For each function with a `__torch_function__` override, this proxy does
-        the following for functions which need quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls `_auto_quant_state.op_prepare_before_hook`
-        3. executes the original function
-        4. calls `_auto_quant_state.op_prepare_after_hook`
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        Otherwise, calls the original function.
-        """
-
-        @classmethod
-        def __torch_function__(cls, func, types, args=(), kwargs=None):
-            nonlocal global_disable_torch_function_override
-            if (
-                # global override means disable the override here
-                global_disable_torch_function_override or
-                # to prevent printing things from going into an infinite loop
-                func == torch.Tensor.__repr__ or
-                # we don't need to override getters in this framework
-                func.__name__ == '__get__'
-            ):
-                return super().__torch_function__(func, types, args, kwargs)
-
-            # if we are in a function, the current module is always a parent
-            nonlocal cur_module
-            parent_module = cur_module
-            if enable_logging:
-                if not is_activation_post_process(parent_module):
-                    # logging for insides of obs/fq is not useful for this framework
-
-                    # fqn map does not contain observers, which is why we
-                    # cannot always assume that FQN exists
-                    fqn_for_logging = module_id_to_fqn.get(
-                        id(parent_module), 'unknown') if parent_module else None
-                    logger.debug(
-                        f' fqn:{fqn_for_logging} _tf_ {str(func)} len_args {len(args)}')
-
-            nonlocal qtensor_id
-            kwargs = kwargs if kwargs else {}
-            hook_type = get_torch_function_hook_type(parent_module, func)
-
-            if hook_type is HookType.OP_HOOKS:
-                fqn = module_id_to_fqn[id(parent_module)] if parent_module else None
-                qstate = parent_module._auto_quant_state  # type: ignore[attr-defined]
-                if not first_call:
-                    qstate.validate_cur_op(func)
-                # run "before" hook
-                if first_call:
-                    args, kwargs = qstate.first_call_op_prepare_before_hook(
-                        func, args, kwargs, qtensor_id, fqn, parent_module,
-                        OpQuantizeabilityType.QUANTIZEABLE)
-                else:
-                    args, kwargs = qstate.op_prepare_before_hook(
-                        func, args, kwargs)
-                # forward
-                output = super().__torch_function__(func, types, args, kwargs)
-                # run "after" hook
-                if first_call:
-                    output = qstate.first_call_op_prepare_after_hook(
-                        func, output, args, qtensor_id,
-                        OpQuantizeabilityType.QUANTIZEABLE)
-                else:
-                    output = qstate.op_prepare_after_hook(
-                        func, output, args, global_op_idx)
-                qstate.mark_cur_op_complete(func)
-            else:
-                # Hook type is not HookType.OP_HOOKS, if first_call is True we
-                # record the DAG of non-quantizeable ops.
-
-                if first_call:
-                    qstate = getattr(parent_module, '_auto_quant_state', None)
-                    if qstate:
-                        fqn = module_id_to_fqn.get(id(parent_module), None) \
-                            if parent_module else None
-                        args, kwargs = qstate.first_call_op_prepare_before_hook(
-                            func, args, kwargs, qtensor_id, fqn, parent_module,
-                            OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-                output = super().__torch_function__(func, types, args, kwargs)
-
-                if first_call:
-                    qstate = getattr(parent_module, '_auto_quant_state', None)
-                    if qstate:
-                        output = qstate.first_call_op_prepare_after_hook(
-                            func, output, args, qtensor_id,
-                            OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-            # TODO: is this right? Don't really understand this
-            if output is NotImplemented:
-                with torch._C.DisableTorchFunction():
-                    output = func(*args, **kwargs).as_subclass(
-                        QuantizationPrepareTensorProxy)
-                assert output is not NotImplemented
-
-            return output
-
-        def __repr__(self):
-            return f'QuantizationPrepareTensorProxy({super().__repr__()})'
-
-        # TODO(future PR): add other math overrides
-
-    class QuantizationInterceptionModule(type(model)):  # type: ignore[misc]
-        """
-        An override of user defined subclass of `nn.Module` to enable
-        dynamic tracing for quantization.
-
-        `cur_module` keeps track of the current module in the stack.
-
-        During the fist call, an `AutoQuantizationState` object is created and
-        attached to each non-leaf modules which we need to check for
-        quantizeable operations.
-
-        We override the `__call__` function to do the following for each
-        module:
-
-        If the module is an op which needs quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls parent module's `._auto_quant_state.op_prepare_before_hook`
-        3. executes the original module forward
-        4. calls parent module's `_auto_quant_state.op_prepare_after_hook`
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        If the module can contain children ops that need quantization:
-
-        1. calls `_auto_quant_state.inputs_prepare_hook` (not implemented yet)
-        2. executes the original module forward
-        3. calls `_auto_quant_state.outputs_prepare_hook`
-
-        Otherwise, calls the original module forward.
-        """
-
-        def __call__(self, *args, **kwargs):
-            new_args = map_aggregate(args, convert_to_interception_proxy)
-            new_kwargs = map_aggregate(kwargs, convert_to_interception_proxy)
-            orig_module_call = torch.nn.Module.__call__
-            orig_nn_sequential_forward = torch.nn.Sequential.forward
-
-            def _patched_module_call(self, *args, **kwargs):
-
-                if enable_logging:
-                    fqn = module_id_to_fqn.get(id(self), None)
-                    logger.debug(f" fqn:{fqn} _cl_: {type(self)} start")
-
-                nonlocal cur_module
-                old_module = cur_module
-                cur_module = self
-                try:
-                    parent_module = module_stack[-1] if len(module_stack) else None
-                    module_stack.append(self)
-                    fqn = module_id_to_fqn.get(id(self), None)
-
-                    hook_type = get_module_hook_type(parent_module, cur_module)
-
-                    if hook_type is HookType.OP_HOOKS:
-                        parent_qstate: AutoQuantizationState = \
-                            parent_module._auto_quant_state  # type: ignore[union-attr, assignment]
-                        # before hooks
-                        if not first_call:
-                            parent_qstate.validate_cur_op(cur_module)
-
-                        # If we are in this hook, `cur_module` is a leaf module.
-                        # Therefore, we do not need to override any of its
-                        # children. Disabling the overrides for performance.
-                        nonlocal global_disable_torch_function_override
-                        old_global_disable_torch_function_override = \
-                            global_disable_torch_function_override
-                        global_disable_torch_function_override = True
-
-                        if first_call:
-                            # mypy ignore is used instead of assert because this
-                            # runs on every forward and assert has a performance cost
-                            args, kwargs = parent_qstate.first_call_op_prepare_before_hook(
-                                cur_module, args, kwargs, qtensor_id,
-                                fqn, cur_module,  # type: ignore[arg-type]
-                                OpQuantizeabilityType.QUANTIZEABLE)
-                        else:
-                            # mypy ignore is used instead of assert because this
-                            # runs on every forward and assert has a performance cost
-                            args, kwargs = parent_qstate.op_prepare_before_hook(
-                                cur_module, args, kwargs)  # type: ignore[arg-type]
-
-                        # original forward
-                        output = orig_module_call(self, *args, **kwargs)
-
-                        # Re-enable the overrides.
-                        global_disable_torch_function_override = \
-                            old_global_disable_torch_function_override
-
-                        # after hooks
-                        if first_call:
-                            output = parent_qstate.first_call_op_prepare_after_hook(
-                                cur_module, output, args, qtensor_id,
-                                OpQuantizeabilityType.QUANTIZEABLE)
-                        else:
-                            output = parent_qstate.op_prepare_after_hook(
-                                cur_module, output, args, global_op_idx)
-                        parent_qstate.mark_cur_op_complete(cur_module)
-
-                    elif hook_type is HookType.MODULE_IO_HOOKS:
-                        # TODO(future PR): add inputs io hook
-
-                        cur_qstate = cur_module._auto_quant_state
-                        cur_qstate.reset_to_new_call()
-
-                        # original forward
-                        output = orig_module_call(self, *args, **kwargs)
-
-                        # after hooks
-                        if first_call:
-                            output = cur_qstate.first_call_outputs_prepare_hook(
-                                output, qtensor_id)
-                        else:
-                            output = cur_qstate.outputs_prepare_hook(output)
-
-                        cur_qstate.validate_is_at_last_seen_idx()
-
-                    elif hook_type is HookType.ARG_DEQUANTS:
-                        if first_call and parent_module is not None:
-                            parent_qstate_fc = getattr(
-                                parent_module, '_auto_quant_state', None)
-                            if parent_qstate_fc:
-                                args, kwargs = \
-                                    parent_qstate_fc.first_call_op_prepare_before_hook(
-                                        cur_module, args, kwargs, qtensor_id, fqn,
-                                        cur_module,
-                                        OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-                        output = orig_module_call(self, *args, **kwargs)
-                        # if this fp32 was inplace, make sure to set the output dtype
-                        # back to torch.float
-                        if hasattr(output, '_qtensor_info'):
-                            del output._qtensor_info
-
-                        if first_call and parent_module is not None:
-                            parent_qstate_fc = getattr(
-                                parent_module, '_auto_quant_state', None)
-                            if parent_qstate_fc:
-                                output = \
-                                    parent_qstate_fc.first_call_op_prepare_after_hook(
-                                        cur_module, output, args, qtensor_id,
-                                        OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-                    else:
-                        output = orig_module_call(self, *args, **kwargs)
-
-                    if enable_logging:
-                        fqn = module_id_to_fqn.get(id(self), None)
-                        logger.debug(f" fqn:{fqn} _cl_: {type(self)} end")
-
-                    return output
-                finally:
-                    module_stack.pop()
-                    cur_module = old_module
-
-            torch.nn.Module.__call__ = _patched_module_call
-            torch.nn.Sequential.forward = _nn_sequential_patched_forward  # type: ignore[assignment]
-            nonlocal first_call
-            try:
-                if first_call:
-                    # Create a list before iterating because we are adding new
-                    # named modules inside the loop.
-                    named_modules = list(self.named_modules())
-
-                    # Record module instances which are leaves or children of leaves
-                    leaves = set()
-                    for fqn, child in named_modules:
-                        if is_leaf(child, prepare_custom_config_dict):
-                            for _, child_child in child.named_modules():
-                                leaves.add(child_child)
-
-                    self._fqn_to_auto_quant_state_map = AutoQuantizationStateModuleDict()
-
-                    for fqn, v in named_modules:
-
-                        # fqn is the global FQN, i.e. 'foo.bar.baz'
-                        # v is the module instance
-                        #
-                        # we need to associate the global FQN with SeenOp
-                        # for modules, this is the module FQN
-                        # for functions, this is the parent module FQN
-                        module_id_to_fqn[id(v)] = fqn
-
-                        if v in leaves:
-                            continue
-
-                        if v is self:
-                            # for the top level module only, specify input
-                            # and output dtypes
-                            auto_quant_state = AutoQuantizationState(
-                                qconfig_mapping, fqn,
-                                input_dtypes, output_dtypes)
-                        else:
-                            auto_quant_state = AutoQuantizationState(
-                                qconfig_mapping, fqn)
-
-                        # The code below registers the auto_quant_state object
-                        # of the child in the module hierarchy of the parent,
-                        # and adds the auto_quant_state object to the child
-                        # with a raw __setattr__, without registering it in
-                        # the module hierarchy of the child.
-                        # This is solving the problem of both storing extra state
-                        # (observers) as well as not modifying the meaning of user
-                        # code in child modules which iterates over all module
-                        # children.
-                        #
-                        # This narrows down the issue of dynamically adding
-                        # children to only affect the top level module and not
-                        # the children.
-
-                        # On the parent, register this module in the FQN map
-                        fqn_to_use_for_key = \
-                            get_fqn_valid_for_module_dict_key(fqn)
-                        self._fqn_to_auto_quant_state_map[fqn_to_use_for_key] = \
-                            auto_quant_state
-                        # On the child, manually set the attribute without
-                        # going through the `torch.nn.Module.__setattr__`
-                        # function, to prevent this object from appearing in
-                        # the child's module hierarchy.
-                        object.__setattr__(
-                            v, '_auto_quant_state', auto_quant_state)
-
-                global_op_idx[0] = 0
-
-                output = super().__call__(*new_args, **new_kwargs)
-
-                if first_call:
-                    for _, v in self.named_modules():
-                        if hasattr(v, '_auto_quant_state'):
-                            v._auto_quant_state.match_fusion_patterns()
-                            v._auto_quant_state.insert_observers(v)
-
-                return output
-            finally:
-                torch.nn.Module.__call__ = orig_module_call
-                torch.nn.Sequential.forward = orig_nn_sequential_forward  # type: ignore[assignment]
-                first_call = False
-
-
-    model.__class__ = QuantizationInterceptionModule
-    # create the graph
-    trace_with_inputs(model, example_inputs)
-    return model
-
-
-def add_auto_convert(module : torch.nn.Module) -> torch.nn.Module:
-    def convert_to_dispatch_proxy(x):
-        if isinstance(x, torch.Tensor):
-            return x.as_subclass(QuantizationConvertTensorProxy)  # type: ignore[arg-type]
-        else:
-            return x
-
-    module_id_to_fqn: Dict[int, str] = {}
-    # Counter for global quantizeable ops, useful for intermediate activation
-    # logging.
-    global_op_idx = [0]
-
-    global_disable_torch_function_override = False
-
-    class QuantizationConvertTensorProxy(torch.Tensor):
-        """
-        An override of `torch.Tensor` to enable dynamic dispatch for
-        quantization inference.
-
-        For each function with a `__torch_fuction__` override, this proxy does
-        the following for functions which need quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls `_auto_quant_state.op_convert_before_hook`.
-        3. executes the function, with target, args and kwargs possibly modified
-           by (2)
-        4. calls `_auto_quant_state.inference_function_after_hook`.
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        Otherwise, calls the original function.
-        """
-
-        @classmethod
-        def __torch_function__(cls, func, types, args=(), kwargs=None):
-            nonlocal global_disable_torch_function_override
-            if (
-                # global override means disable the override here
-                global_disable_torch_function_override or
-                # to prevent printing things from going into an infinite loop
-                func == torch.Tensor.__repr__ or
-                # we don't need to override getters in this framework
-                func.__name__ == '__get__'
-            ):
-                return super().__torch_function__(func, types, args, kwargs)
-
-            kwargs = kwargs if kwargs else {}
-            # if we are in a function, the current module is always a parent
-            parent_module = cur_module
-            hook_type = get_torch_function_hook_type(parent_module, func)
-
-            if enable_logging:
-                fqn_for_logging = module_id_to_fqn.get(
-                    id(parent_module), 'unknown') if parent_module else None
-                logger.debug(
-                    f" fqn:{fqn_for_logging} _tf_ {func} " +
-                    f"hook_type {hook_type} " +
-                    # f"arg_types {[type(arg) for arg in args]}) " +
-                    f"arg_dtypes {[arg.dtype if isinstance(arg, torch.Tensor) else None for arg in args]}")
-
-            if hook_type is HookType.OP_HOOKS:
-                qstate: AutoQuantizationState = parent_module._auto_quant_state  # type: ignore[union-attr]
-                # before hooks
-                qstate.validate_cur_op(func)
-                func, args, kwargs = qstate.op_convert_before_hook(
-                    func, args, kwargs, parent_module)  # type: ignore[arg-type]
-
-                # forward
-                output = super().__torch_function__(func, types, args, kwargs)
-                # after hooks
-                output = qstate.op_convert_after_hook(
-                    func, output, global_op_idx)
-                qstate.mark_cur_op_complete(func)
-
-            elif hook_type is HookType.ARG_DEQUANTS:
-                # TODO(future PR): handle more dtypes
-                new_args = []
-                for arg in args:
-                    if isinstance(arg, torch.Tensor) and arg.is_quantized:
-                        new_args.append(arg.dequantize())
-                    else:
-                        new_args.append(arg)
-                args = tuple(new_args)
-                output = super().__torch_function__(func, types, args, kwargs)
-
-            else:  # HookType.NONE
-                output = super().__torch_function__(func, types, args, kwargs)
-
-            # TODO: is this right? Don't really understand this
-            if output is NotImplemented:
-                with torch._C.DisableTorchFunction():
-                    output = func(*args, **kwargs).as_subclass(
-                        QuantizationConvertTensorProxy)
-                assert output is not NotImplemented
-
-            if enable_logging:
-                fqn_for_logging = module_id_to_fqn.get(
-                    id(parent_module), 'unknown') if parent_module else None
-                out_dtype = None
-                if isinstance(output, torch.Tensor):
-                    out_dtype = output.dtype
-                logger.debug(f" fqn:{fqn_for_logging} _tf_ {func} out {out_dtype} end")
-
-            return output
-
-        def __repr__(self):
-            return f'QuantizationConvertTensorProxy({super().__repr__()})'
-
-    cur_module = None
-    module_stack : List[torch.nn.Module] = []
-
-    assert len(module.__class__.__bases__) == 1
-
-    class QuantizationDispatchModule(module.__class__.__bases__[0]):  # type: ignore[name-defined]
-        """
-        An override of user defined subclass of `nn.Module` to enable
-        dynamic tracing for quantization, after model conversion
-        to quantized domain.
-
-        `cur_module` keeps track of the current module in the stack.
-
-        Tensor arguments are converted to `QuantizationConvertTensorProxy`.
-
-        We override the `__call__` function to do the following for each
-        module:
-
-        If the module is an op which needs quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls parent module's `._auto_quant_state.op_convert_before_hook`
-        3. executes the original module forward
-        4. calls parent module's `_auto_quant_state.op_convert_after_hook`
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        If the module can contain children ops that need quantization:
-
-        1. calls `_auto_quant_state.inputs_convert_hook` (not implemented yet)
-        2. executes the original module forward
-        3. calls `_auto_quant_state.outputs_convert_hook`
-
-        Otherwise, calls the original module forward.
-        """
-
-        def __call__(self, *args, **kwargs):
-            new_args = map_aggregate(args, convert_to_dispatch_proxy)
-            new_kwargs = map_aggregate(kwargs, convert_to_dispatch_proxy)
-            orig_module_call = torch.nn.Module.__call__
-            orig_nn_sequential_forward = torch.nn.Sequential.forward
-
-            def _patched_module_call(self, *args, **kwargs):
-                nonlocal cur_module
-                old_module = cur_module
-                cur_module = self
-                nonlocal global_disable_torch_function_override
-                try:
-                    parent_module = module_stack[-1] if len(module_stack) else None
-                    module_stack.append(self)
-                    hook_type = get_module_hook_type(parent_module, cur_module)
-                    if enable_logging:
-                        fqn_for_logging = module_id_to_fqn.get(id(self), None)
-                        logger.debug(
-                            f" fqn: {fqn_for_logging} " +
-                            f"_cl_ {type(self)} " +
-                            f"arg_dtypes {[arg.dtype if isinstance(arg, torch.Tensor) else None for arg in args]} " +
-                            f"hook_type {hook_type}")
-
-                    if hook_type is HookType.OP_HOOKS:
-                        # before hooks
-                        qstate: AutoQuantizationState = \
-                            parent_module._auto_quant_state  # type: ignore[union-attr, assignment]
-                        qstate.validate_cur_op(cur_module)
-
-                        # If we are in this hook, `cur_module` is a leaf module.
-                        # Therefore, we do not need to override any of its
-                        # children. Disabling the overrides for performance.
-                        old_global_disable_torch_function_override = \
-                            global_disable_torch_function_override
-                        global_disable_torch_function_override = True
-
-                        _, args, kwargs = qstate.op_convert_before_hook(
-                            cur_module, args, kwargs, cur_module)
-                        # forward
-                        output = orig_module_call(self, *args, **kwargs)
-                        # after hooks
-                        output = qstate.op_convert_after_hook(
-                            cur_module, output, global_op_idx)
-
-                        # Re-enable the override.
-                        global_disable_torch_function_override = \
-                            old_global_disable_torch_function_override
-
-                        qstate.mark_cur_op_complete(cur_module)
-
-                    elif hook_type is HookType.MODULE_IO_HOOKS:
-                        cur_qstate: AutoQuantizationState = cur_module._auto_quant_state
-
-                        cur_qstate.reset_to_new_call()
-
-                        # before hooks (TODO)
-                        # forward
-                        output = orig_module_call(self, *args, **kwargs)
-                        # after hooks
-
-                        # For the sake of performance, we assume no overrides
-                        # are needed for quantizing/dequantizing things
-                        old_global_disable_torch_function_override = \
-                            global_disable_torch_function_override
-                        global_disable_torch_function_override = True
-
-                        output = cur_qstate.outputs_convert_hook(output)
-
-                        global_disable_torch_function_override = \
-                            old_global_disable_torch_function_override
-
-                        cur_qstate.validate_is_at_last_seen_idx()
-
-                    elif hook_type is HookType.ARG_DEQUANTS:
-                        # TODO(future PR): handle more dtypes
-                        new_args = []
-                        for arg in args:
-                            if isinstance(arg, torch.Tensor) and arg.is_quantized:
-                                dequant = arg.dequantize().as_subclass(
-                                    QuantizationConvertTensorProxy)  # type: ignore[arg-type]
-                                new_args.append(dequant)
-                            else:
-                                new_args.append(arg)
-                        args = tuple(new_args)
-                        output = orig_module_call(self, *args, **kwargs)
-
-                    else:
-                        output = orig_module_call(self, *args, **kwargs)
-
-                    if enable_logging:
-                        fqn_for_logging = module_id_to_fqn.get(id(self), None)
-                        logger.debug(
-                            f" fqn: {fqn_for_logging} " +
-                            f"_cl_ {type(self)} " +
-                            f"dtype {output.dtype if isinstance(output, torch.Tensor) else None} " +
-                            "end")
-                    return output
-                finally:
-                    module_stack.pop()
-                    cur_module = old_module
-
-            torch.nn.Module.__call__ = _patched_module_call
-            torch.nn.Sequential.forward = _nn_sequential_patched_forward  # type: ignore[assignment]
-
-            try:
-                global_op_idx[0] = 0
-                output = super().__call__(*new_args, **new_kwargs)
-
-                def unwrap_proxy(a):
-                    if isinstance(a, QuantizationConvertTensorProxy):
-                        a.__class__ = torch.Tensor  # type: ignore[assignment]
-                    return a
-
-                output = map_aggregate(output, unwrap_proxy)
-                return output
-            finally:
-                torch.nn.Module.__call__ = orig_module_call
-                torch.nn.Sequential.forward = orig_nn_sequential_forward  # type: ignore[assignment]
-
-        def rewrite_for_scripting(self):
-            return auto_trace_rewriter.rewrite_for_scripting(self)
-
-    pack_weights_for_functionals(module)
-    attach_scale_zp_values_to_model(module)
-    attach_op_convert_info_to_model(module)
-    attach_output_convert_info_to_model(module)
-
-    # Since eager mode convert could have changed the IDs of some modules,
-    # populate the FQN map again
-    for k, v in module.named_modules():
-        module_id_to_fqn[id(v)] = k
-
-    module.__class__ = QuantizationDispatchModule
-
-    return module
-
-
-# AutoQuantizationState lives in parent module's _modules.
-# Currently, `torch.nn.Sequential`'s forward iterates over all
-# items in _modules. To avoid changing the meaning of the program, for
-# now we patch the forward to ignore our quantization state.
-# Note: this is a hackedy hack, before launching we should consider
-# checking the fix into `torch.nn.Sequential` to avoid the patch.
-def _nn_sequential_patched_forward(cls, input):
-    for module in cls:
-        if not isinstance(module, AutoQuantizationStateModuleDict):
-            input = module(input)
-    return input
diff --git a/torch/ao/quantization/_dbr/auto_trace_rewriter.py b/torch/ao/quantization/_dbr/auto_trace_rewriter.py
deleted file mode 100644
index 1189dbc879c45..0000000000000
--- a/torch/ao/quantization/_dbr/auto_trace_rewriter.py
+++ /dev/null
@@ -1,247 +0,0 @@
-import copy
-import math
-import operator
-from types import ModuleType
-from typing import Callable, Any, Tuple, Dict
-
-import torch
-import torch.fx
-from .mappings import conv_ops
-from .quantization_state import AutoQuantizationState
-from .utils import (
-    get_packable_arg_idxs,
-    AutoQuantizationStateModuleDict,
-)
-
-class AllModuleTracer(torch.fx.Tracer):
-    """
-    This is a tracer that knows how to convert quantizeable ops with
-    dynamic dispatch into their corresponding quantized subgraphs.
-    """
-
-    node_name_to_dtype: Dict[str, Any]
-
-    def __init__(self, autowrap_modules: Tuple[ModuleType] = (math, ),
-                 autowrap_functions: Tuple[Callable, ...] = (),
-                 param_shapes_constant: bool = False) -> None:
-        super().__init__(
-            autowrap_modules, autowrap_functions,
-            param_shapes_constant)
-        self.node_name_to_dtype = {}
-
-    def is_leaf_module(self, m, module_qualified_name) -> bool:
-        return True
-
-    def _maybe_update_args_with_quants(self, args, arg_quant_infos, target):
-        # insert quants for inputs, if needed
-        if len(arg_quant_infos):
-            new_args = []
-            if target == torch.ops.quantized.cat:
-                new_first_arg = []
-                for idx, input_arg_quant_info in enumerate(arg_quant_infos):
-                    if input_arg_quant_info is None:
-                        new_first_arg.append(args[0][idx])
-                    else:
-                        # create a quant node
-                        scale, zp, dtype = input_arg_quant_info
-                        quant = super().create_node(
-                            'call_function', torch.quantize_per_tensor,
-                            (args[0][idx], scale.item(), zp.item(), dtype), {}, None, None)
-                        new_first_arg.append(quant)
-                new_args = [new_first_arg, *args[1:]]
-            elif target == torch.cat:
-                return args
-            else:
-                # TODO: this is not handling non-tensor tuple args (for example,
-                # dilation in conv2d) correctly, it just happens to work but
-                # needs a fix.
-                for idx, arg in enumerate(args):
-                    input_arg_quant_info = arg_quant_infos[idx]
-                    if input_arg_quant_info is None:
-                        new_args.append(args[idx])
-                    else:
-                        # create a quant node
-                        scale, zp, dtype = input_arg_quant_info
-                        quant = super().create_node(
-                            'call_function', torch.quantize_per_tensor,
-                            (args[idx], scale.item(), zp.item(), dtype), {}, None, None)
-                        new_args.append(quant)
-            args = tuple(new_args)
-        return args
-
-    def _maybe_update_args_with_dequants(self, args):
-        new_args = []
-        for arg in args:
-            if (
-                isinstance(arg, torch.fx.Node) and
-                arg.name in self.node_name_to_dtype and
-                self.node_name_to_dtype[arg.name] != torch.float
-            ):
-                dequant = torch.fx.Proxy(arg).dequantize().node
-                new_args.append(dequant)
-            else:
-                new_args.append(arg)
-        return tuple(new_args)
-
-    def _maybe_update_outputs(self, outputs, output_qtensor_infos, output_dtypes):
-        # TODO(future PR): handle other output types
-        assert len(outputs) == 1 and len(output_qtensor_infos) == 1
-        if output_dtypes is not None:
-            assert len(output_dtypes) == 1
-            output_dtype = output_dtypes[0]
-            qtensor_info = output_qtensor_infos[0]
-            if qtensor_info.inf_dtype != output_dtype:
-                assert output_dtype is torch.float, \
-                    'non-float dtypes not handled yet'
-                dequant = torch.fx.Proxy(outputs[0]).dequantize().node
-                outputs = (dequant,)
-        return outputs
-
-    def create_node(self, kind, target, args, kwargs, name=None, type_expr=None):
-        if target == operator.add:
-            target = torch.add
-        if target == operator.mul:
-            target = torch.mul
-
-        # TODO(future PR): move this into mappings
-        if target == 'add':
-            target = torch.add
-            kind = 'call_function'
-        if target == 'mul':
-            target = torch.mul
-            kind = 'call_function'
-
-        dtype_to_use = torch.float
-
-        if kind == 'call_function' or kind == 'call_method':
-            qstate = self.root._auto_quant_state
-            assert isinstance(qstate, AutoQuantizationState)
-            if qstate.cur_op_needs_hooks(target):
-                # need to test this path with call_method
-                assert kind == 'call_function'
-                qstate.validate_cur_op(target)
-
-                old_target = target
-                # TODO use arg_dequant_infos
-                new_target, arg_quant_infos, arg_dequant_infos, packed_param_name, additional_kwargs, _, _ = \
-                    qstate.get_op_convert_info(target)
-                for k in ('scale', 'zero_point'):
-                    if k in additional_kwargs:
-                        additional_kwargs[k] = additional_kwargs[k].item()
-                if new_target is not None:
-                    target = new_target
-                args = self._maybe_update_args_with_quants(args, arg_quant_infos, target)
-                # if there is a packed param, replace the relevant args
-                if packed_param_name is not None:
-                    new_args_with_packed = []
-                    packable_arg_idxs = get_packable_arg_idxs(old_target)
-                    added_packed = False
-                    for idx, arg in enumerate(args):
-                        if packable_arg_idxs is not None and idx in packable_arg_idxs:
-                            if not added_packed:
-                                # packed_param = getattr(self.root, packed_param_name)
-                                packed_param_node = super().create_node(
-                                    'get_attr', packed_param_name, (), {}, None, None)
-                                new_args_with_packed.append(packed_param_node)
-                                added_packed = True
-                        else:
-                            new_args_with_packed.append(arg)
-                    args = tuple(new_args_with_packed)
-
-                # TODO move op-specific logic out of here
-                if target is torch.ops.quantized.linear:
-                    def linear_rewrite_args(input, weight, bias=None):
-                        return (input, weight,
-                                additional_kwargs['scale'],
-                                additional_kwargs['zero_point'])
-                    args = linear_rewrite_args(*args, **kwargs)
-                    kwargs = {}
-                elif old_target not in conv_ops or target in conv_ops:
-                    kwargs.update(**additional_kwargs)
-                else:
-                    new_args = [*args]
-                    new_args.append(additional_kwargs['scale'])
-                    new_args.append(additional_kwargs['zero_point'])
-                    args = tuple(new_args)
-
-                dtype_to_use = qstate.get_cur_output_inf_dtype()
-                qstate.mark_cur_op_complete(old_target)
-
-            else:
-                args = self._maybe_update_args_with_dequants(args)
-
-        elif kind == 'call_module':
-            # TODO: handle fqn
-            module_instance = getattr(self.root, target)
-            qstate = self.root._auto_quant_state
-            assert isinstance(qstate, AutoQuantizationState)
-            if qstate.cur_op_needs_hooks(module_instance):
-                qstate.validate_cur_op(module_instance)
-
-                # TODO use arg_dequant_infos
-                _, arg_quant_infos, arg_dequant_infos, _packed_param_name, additional_kwargs, _, _ = \
-                    qstate.get_op_convert_info(module_instance)
-                for k in ('scale', 'zero_point'):
-                    if k in additional_kwargs:
-                        additional_kwargs[k] = additional_kwargs[k].item()
-
-                args = self._maybe_update_args_with_quants(args, arg_quant_infos, target)
-                kwargs.update(**additional_kwargs)
-
-                dtype_to_use = qstate.get_cur_output_inf_dtype()
-                qstate.mark_cur_op_complete(module_instance)
-
-            else:
-                args = self._maybe_update_args_with_dequants(args)
-
-        elif kind == 'output':
-            qstate = self.root._auto_quant_state
-            assert isinstance(qstate, AutoQuantizationState)
-            output_qtensor_infos = qstate.get_output_qtensor_infos()
-            output_dtypes = qstate.get_output_dtypes()
-            args = self._maybe_update_outputs(
-                args, output_qtensor_infos, output_dtypes)
-
-        out = super().create_node(kind, target, args, kwargs, name, type_expr)
-        self.node_name_to_dtype[out.name] = dtype_to_use
-        return out
-
-    # This is a hack to enable nn.Sequential to properly work with this
-    # class.
-    # TODO(future): remove the hack
-    def call_module(self, m: torch.nn.Module, forward: Callable[..., Any], args : Tuple[Any, ...], kwargs : Dict[str, Any]) -> Any:
-        if isinstance(m, AutoQuantizationStateModuleDict):
-            return args[0]
-        return super().call_module(m, forward, args, kwargs)
-
-# TODO(future PR): handle cases where the module is not symbolically
-# traceable
-def rewrite_for_scripting(mod: torch.nn.Module) -> torch.nn.Module:
-    """
-    Makes the dynamically dispatched ops in `mod` be explicit, so they
-    can be visibile to `torch.jit.script`. In detail:
-
-    1. symbolically traces the forward with FX, without any leaves
-    2. for each quantizeable op with dynamic dispatch, rewrites the graph to
-       contain the quantized subgraph (quant if necessary, quantized op,
-       dequant if necessary).
-    3. recursively repeat (1 - 2) for each child
-    """
-
-    def rewrite_helper(mod : torch.nn.Module):
-        copied = copy.copy(mod)
-        for name, child in mod.named_children():
-            setattr(copied, name, rewrite_helper(child))
-
-        if hasattr(mod, '_auto_quant_state') and (
-            mod._auto_quant_state.has_at_least_one_seen_q_op_info() or  # type: ignore[union-attr, operator]
-            (mod._auto_quant_state.get_output_dtypes() is not None)  # type: ignore[union-attr, operator]
-        ):
-            copied._auto_quant_state.reset_to_new_call()  # type: ignore[union-attr, operator]
-
-            graph = AllModuleTracer().trace(copied)
-            return torch.fx.GraphModule(copied, graph, copied.__class__.__name__)
-        else:
-            return copied
-
-    return rewrite_helper(mod)
diff --git a/torch/ao/quantization/_dbr/function_fusion.py b/torch/ao/quantization/_dbr/function_fusion.py
deleted file mode 100644
index fdafa5102ad42..0000000000000
--- a/torch/ao/quantization/_dbr/function_fusion.py
+++ /dev/null
@@ -1,101 +0,0 @@
-from typing import Dict, Tuple, Callable, Optional
-
-from .mappings import known_function_fusion_patterns_and_replacements
-from .utils import (
-    FusionInfo,
-    SeenQOpInfo,
-    get_users_of_seen_q_op_info,
-    get_producer_of_seen_q_op_info,
-)
-
-def _identity(x):
-    return x
-
-def pattern_is_match(
-    fusion_pattern: Tuple[Callable, ...],
-    cur_seen_q_op_info: Optional[SeenQOpInfo],
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> bool:
-    is_match = True
-    for el_type in fusion_pattern:
-        if cur_seen_q_op_info is not None and el_type == cur_seen_q_op_info.type:
-            next_seen_q_op_infos = get_users_of_seen_q_op_info(
-                idx_to_seen_q_op_infos, cur_seen_q_op_info)
-            if len(next_seen_q_op_infos) == 1:
-                cur_seen_q_op_info = next_seen_q_op_infos[0]
-            else:
-                cur_seen_q_op_info = None
-            continue
-        else:
-            is_match = False
-            break
-    return is_match
-
-def get_seen_q_op_info_of_start_of_fusion(
-    seen_q_op_info_end_of_fusion: SeenQOpInfo,
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> SeenQOpInfo:
-    assert seen_q_op_info_end_of_fusion.fusion_info is not None
-    cur_seen_q_op_info = seen_q_op_info_end_of_fusion
-    for idx in range(len(seen_q_op_info_end_of_fusion.fusion_info.pattern) - 1):
-        cur_seen_q_op_info = get_producer_of_seen_q_op_info(
-            idx_to_seen_q_op_infos, cur_seen_q_op_info)  # type: ignore[assignment]
-    return cur_seen_q_op_info
-
-def get_seen_q_op_info_of_end_of_fusion(
-    seen_q_op_info_start_of_fusion: SeenQOpInfo,
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> SeenQOpInfo:
-    assert seen_q_op_info_start_of_fusion.fusion_info is not None
-    cur_seen_q_op_info = seen_q_op_info_start_of_fusion
-    for idx in range(len(seen_q_op_info_start_of_fusion.fusion_info.pattern) - 1):
-        users = get_users_of_seen_q_op_info(
-            idx_to_seen_q_op_infos, cur_seen_q_op_info)
-        cur_seen_q_op_info = users[0]
-    return cur_seen_q_op_info
-
-def match_fusion_patterns(
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-):
-    """
-    Matches fusion patterns to elements of `idx_to_seen_q_op_infos`.
-    Modifies them inplace if matches are found.
-
-    Note:
-    1. The matching is local to the ops seen by a single parent module,
-       it does not cross module boundaries. This is for simplicity, and
-       there are no plans to relax this at the moment.
-    2. The matching only supports linear patterns of ops where all of
-       of the arguments needed to execute the fusion are passed to the first
-       op in the sequence. This is for simplicity, and can be relaxed
-       in a future PR if there is a need.
-    3. Currently the matching does not look at non quantizeable ops,
-       this will be fixed in the next PR.
-    """
-
-    # Walk the subgraphs and find the function fusions. For now, this is
-    # brute forced for simplicity, can be optimized later if necessary.
-    for idx, seen_q_op_info in idx_to_seen_q_op_infos.items():
-        for fusion_pattern, replacement in \
-                known_function_fusion_patterns_and_replacements.items():
-            is_match = pattern_is_match(
-                fusion_pattern, seen_q_op_info, idx_to_seen_q_op_infos)
-            if not is_match:
-                continue
-
-            cur_seen_q_op_info = seen_q_op_info
-            for idx in range(len(fusion_pattern)):
-                if idx > 0:
-                    users = get_users_of_seen_q_op_info(
-                        idx_to_seen_q_op_infos, cur_seen_q_op_info)
-                    cur_seen_q_op_info = users[0]
-
-                is_first_element = idx == 0
-                is_last_element = idx == len(fusion_pattern) - 1
-                replacement_type = replacement if is_first_element \
-                    else _identity
-                fusion_info = FusionInfo(
-                    fusion_pattern, replacement_type, is_first_element,
-                    is_last_element)
-                cur_seen_q_op_info.fusion_info = fusion_info
-            break
diff --git a/torch/ao/quantization/_dbr/fusion.py b/torch/ao/quantization/_dbr/fusion.py
deleted file mode 100644
index 7cf5ce4a618ed..0000000000000
--- a/torch/ao/quantization/_dbr/fusion.py
+++ /dev/null
@@ -1,56 +0,0 @@
-from typing import List
-
-import torch
-
-from .function_fusion import pattern_is_match
-
-from .utils import (
-    get_users_of_seen_q_op_info,
-)
-
-from .mappings import (
-    known_module_fusion_patterns,
-)
-
-def get_module_fusion_fqns(
-    module: torch.nn.Module,
-) -> List[List[str]]:
-    """
-    Input: a module with auto quantization state
-
-    Walks the subgraphs and determines which modules should be
-    fused.
-
-    Output: a list of FQNs of modules which should be fused.
-    """
-    results = []
-    for _, child in module.named_modules():
-        if not hasattr(child, '_auto_quant_state'):
-            continue
-        qstate = child._auto_quant_state
-
-        # Walk the subgraphs and record the FQNs of all known module fusions.
-        # For now, this is brute forced for simplicity, can be optimized later if
-        # necessary.
-        # TODO(future PR): if a pattern is matched, add it to "seen" items
-        # and do not use it in future matching.
-        for idx, seen_q_op_info in qstate.idx_to_seen_q_op_infos.items():
-            for fusion_pattern in known_module_fusion_patterns:
-                is_match = pattern_is_match(
-                    fusion_pattern, seen_q_op_info, qstate.idx_to_seen_q_op_infos)
-                if is_match:
-                    cur_fqns = [seen_q_op_info.fqn]
-                    cur_seen_q_op_info = seen_q_op_info
-                    for _element in fusion_pattern[:-1]:
-                        users = get_users_of_seen_q_op_info(
-                            qstate.idx_to_seen_q_op_infos, cur_seen_q_op_info)
-                        cur_seen_q_op_info = users[0]
-                        cur_fqns.append(cur_seen_q_op_info.fqn)
-
-                    # we check for existence to ensure the final fusion list
-                    # is deduplicated, in case the same op is called multiple
-                    # times in a single forward
-                    if cur_fqns not in results:
-                        results.append(cur_fqns)
-
-    return results
diff --git a/torch/ao/quantization/_dbr/mappings.py b/torch/ao/quantization/_dbr/mappings.py
deleted file mode 100644
index 89c963f8795a5..0000000000000
--- a/torch/ao/quantization/_dbr/mappings.py
+++ /dev/null
@@ -1,178 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.nn.quantized as nnq
-toq = torch.ops.quantized
-from torch.ao.quantization.quantization_mappings import (
-    DEFAULT_STATIC_QUANT_MODULE_MAPPINGS,
-    DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS,
-    DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS,
-)
-
-import operator
-from typing import Callable
-
-# TODO(future PR): reuse all of these with existing quantization mappings
-
-fp32_to_int8_fun_mapping = {
-    torch.Tensor.add: torch.ops.quantized.add,
-    torch.Tensor.add_: torch.ops.quantized.add,
-    torch.add: torch.ops.quantized.add,
-    operator.add: torch.ops.quantized.add,
-    operator.iadd: torch.ops.quantized.add,
-    torch.Tensor.mul: torch.ops.quantized.mul,
-    torch.mul: torch.ops.quantized.mul,
-    operator.mul: torch.ops.quantized.mul,
-    torch.cat: torch.ops.quantized.cat,
-    F.conv1d: torch.ops.quantized.conv1d,
-    F.conv2d: torch.ops.quantized.conv2d,
-    F.conv3d: torch.ops.quantized.conv3d,
-    F.linear: toq.linear,
-}
-
-# TODO: enforce that functions in fp32_to_int8_fun_mapping must both be
-# in functions_supported_by_quantization
-functions_supported_by_quantization = set([
-    torch.Tensor.add,
-    torch.Tensor.add_,
-    torch.Tensor.mul,
-    torch.add,
-    torch.mul,
-    torch.cat,
-    # adding for MobileNetV2, will need a better place for these
-    torch.nn.functional.adaptive_avg_pool2d,
-    F.hardsigmoid,
-    torch.flatten,
-    toq.add,
-    toq.mul,
-    toq.cat,
-    F.conv1d,
-    F.conv2d,
-    F.conv3d,
-    toq.conv1d,
-    toq.conv2d,
-    toq.conv3d,
-    F.dropout,
-    torch.relu,
-    F.relu,
-    F.linear,
-    toq.linear,
-])
-
-module_types_supported_by_quantization = set()
-module_types_supported_by_quantization |= \
-    set(DEFAULT_STATIC_QUANT_MODULE_MAPPINGS.keys())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_STATIC_QUANT_MODULE_MAPPINGS.values())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS.keys())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS.values())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS.keys())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS.values())
-module_types_supported_by_quantization |= set([
-    # these are quantizeable modules which do not need swaps
-    nn.ReLU,
-    nn.Dropout,
-    nn.Identity,
-])
-module_types_supported_by_quantization -= set([
-    # TODO(future PR): enable DBR quantization for embeddings
-    nn.Embedding,
-    nnq.Embedding,
-    nn.EmbeddingBag,
-    nnq.EmbeddingBag,
-])
-
-
-# These can work in either fp32 or quint8, without the need for observation
-# TODO: better name
-module_types_supported_by_quantization_preserves_dtype = set([
-    nn.Identity,
-    nn.Dropout,
-])
-
-functions_supported_by_quantization_preserves_dtype = set([
-    F.dropout,
-])
-
-add_and_mul_ops = set([
-    torch.add,
-    torch.Tensor.add,
-    torch.Tensor.add_,
-    torch.mul,
-    torch.Tensor.mul,
-])
-
-# TODO(future): reuse global mapping
-known_module_fusion_patterns = [
-    (torch.nn.Conv2d, torch.nn.ReLU),
-    (torch.nn.Conv2d, torch.nn.BatchNorm2d),
-]
-
-# TODO(future): reuse global mapping
-known_function_fusion_patterns_and_replacements = {
-    (torch.Tensor.add, torch.relu): toq.add_relu,
-}
-
-binary_related_ops = (
-    (torch.add, torch.Tensor.add),
-    (torch.add, torch.Tensor.add_),
-    (torch.Tensor.add, torch.Tensor.add_),
-    (torch.mul, torch.Tensor.mul),
-    (torch.mul, torch.Tensor.mul_),
-    (torch.Tensor.mul, torch.Tensor.mul_),
-)
-
-conv_ops = set([
-    F.conv1d,
-    F.conv2d,
-    F.conv3d,
-])
-
-conv_prepack_fns = {
-    F.conv1d: toq.conv1d_prepack,
-    F.conv2d: toq.conv2d_prepack,
-    F.conv3d: toq.conv3d_prepack,
-}
-
-# TODO(future PR): reuse global mapping
-a_related_to_b = set()
-for a, b in binary_related_ops:
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in fp32_to_int8_fun_mapping.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-
-def ops_are_related(
-    cur_op: Callable,
-    expected_op_type: Callable,
-    type_is_module: bool,
-) -> bool:
-    # if isinstance(cur_op, torch.nn.Module):
-    if type_is_module:
-        cur_op = type(cur_op)
-    return cur_op == expected_op_type or \
-        (cur_op, expected_op_type) in a_related_to_b
-
-# validity checks
-# TODO: move these out
-for m in module_types_supported_by_quantization_preserves_dtype:
-    assert m in module_types_supported_by_quantization, \
-        f"{m} needs to be added to module_types_supported_by_quantization"
-
-for f in functions_supported_by_quantization_preserves_dtype:
-    assert f in functions_supported_by_quantization, \
-        f"{f} needs to be added to functions_supported_by_quantization"
diff --git a/torch/ao/quantization/_dbr/model_utils.py b/torch/ao/quantization/_dbr/model_utils.py
deleted file mode 100644
index cd60de8a1ba4e..0000000000000
--- a/torch/ao/quantization/_dbr/model_utils.py
+++ /dev/null
@@ -1,163 +0,0 @@
-"""
-Contains model level utilities which can be aware of the AutoQuantizationState
-type.
-"""
-
-import torch
-import torch.nn.functional as F
-toq = torch.ops.quantized
-from .mappings import conv_ops, conv_prepack_fns
-from .quantization_state import AutoQuantizationState
-from torch.quantization import (
-    ObserverBase,
-    FakeQuantizeBase,
-)
-from typing import Optional
-
-def pack_weights_for_functionals(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Packs weights for functionals seen while tracing.
-    Note: weight packing for modules is handled by eager mode quantization
-    flow.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        # find any ops which need packing
-        for idx, seen_q_op_info in qstate.idx_to_seen_q_op_infos.items():
-            packable_args_len = len(seen_q_op_info.packable_tensor_idx_to_name) + \
-                len(seen_q_op_info.packable_nontensor_idx_to_arg)
-            if packable_args_len == 0:
-                continue
-
-            if seen_q_op_info.type in conv_ops:
-                # fetch all the info needed for packed params
-                assert seen_q_op_info.packable_tensor_idx_to_name[1] is not None
-                weight = getattr(module, seen_q_op_info.packable_tensor_idx_to_name[1])
-                assert seen_q_op_info.packable_tensor_idx_to_name[2] is not None
-                bias = getattr(module, seen_q_op_info.packable_tensor_idx_to_name[2])
-                stride = seen_q_op_info.packable_nontensor_idx_to_arg[3]
-                padding = seen_q_op_info.packable_nontensor_idx_to_arg[4]
-                dilation = seen_q_op_info.packable_nontensor_idx_to_arg[5]
-                groups = seen_q_op_info.packable_nontensor_idx_to_arg[6]
-
-                # quantize the weight
-                # TODO: create weight observers from qconfig.weight
-                assert seen_q_op_info.input_tensor_infos[1] is not None
-                weight_tensor_id = seen_q_op_info.input_tensor_infos[1].id
-                weight_obs = qstate.tensor_id_to_observer[str(weight_tensor_id)]
-                assert isinstance(weight_obs, (ObserverBase, FakeQuantizeBase))
-                scale, zp = weight_obs.calculate_qparams()
-                qweight = torch.quantize_per_tensor(weight, scale, zp, torch.qint8)
-
-                # create the packed params
-                packed_params = conv_prepack_fns[seen_q_op_info.type](
-                    qweight, bias, stride, padding, dilation, groups)
-
-                # attach to module
-                name_idx = 0
-                prefix = "_packed_params_"
-                name_candidate = f"{prefix}{name_idx}"
-                while hasattr(module, name_candidate):
-                    name_idx += 1
-                    name_candidate = f"{prefix}{name_idx}"
-                setattr(module, name_candidate, packed_params)
-                qstate.idx_to_packed_weight_name[idx] = name_candidate
-                # TODO: delete the original weights
-
-            elif seen_q_op_info.type == F.linear:
-                # fetch all the info needed for packed params
-                def get_tensor_param_name(idx: int, name: str) -> Optional[str]:
-                    param_name = seen_q_op_info.packable_tensor_idx_to_name.get(idx, None)
-                    if param_name is not None:
-                        return param_name
-                    return seen_q_op_info.packable_tensor_kwarg_name_to_name.get(name, None)
-
-                weight_name = get_tensor_param_name(1, 'weight')
-                assert weight_name is not None
-                weight = getattr(module, weight_name)
-
-                bias_name = get_tensor_param_name(2, 'bias')
-                bias = getattr(module, bias_name) if bias_name is not None else None
-
-                # quantize the weight
-                # TODO: create weight observers from qconfig.weight
-                assert seen_q_op_info.input_tensor_infos[1] is not None
-                weight_tensor_id = seen_q_op_info.input_tensor_infos[1].id
-                weight_obs = qstate.tensor_id_to_observer[str(weight_tensor_id)]
-                assert isinstance(weight_obs, (ObserverBase, FakeQuantizeBase))
-                scale, zp = weight_obs.calculate_qparams()
-                qweight = torch.quantize_per_tensor(weight, scale, zp, torch.qint8)
-
-                # create the packed params
-                packed_params = toq.linear_prepack(qweight, bias)
-
-                # attach to module
-                name_idx = 0
-                prefix = "_packed_params_"
-                name_candidate = f"{prefix}{name_idx}"
-                while hasattr(module, name_candidate):
-                    name_idx += 1
-                    name_candidate = f"{prefix}{name_idx}"
-                setattr(module, name_candidate, packed_params)
-                qstate.idx_to_packed_weight_name[idx] = name_candidate
-                # TODO: delete the original weights
-
-    for _, child in module.named_children():
-        pack_weights_for_functionals(child)
-
-def attach_scale_zp_values_to_model(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Calculates the scale and zero_point from each observer and attaches
-    these values to the parent module. This is done to avoid recalculating
-    these values at inference.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        for tensor_id, observer in qstate.tensor_id_to_observer.items():
-            activation_int8_or_int32_quantized = \
-                observer.dtype in [torch.quint8, torch.qint8, torch.qint32]
-            if activation_int8_or_int32_quantized:
-                scale, zp = observer.calculate_qparams()
-                # tensor_id_to_observer is a ModuleDict which has to have string keys
-                # tensor_id_to_scale_zp is a normal dict which can have int keys
-                qstate.tensor_id_to_scale_zp[int(tensor_id)] = (scale, zp)
-        qstate.tensor_id_to_observer.clear()
-
-    for _, child in module.named_children():
-        attach_scale_zp_values_to_model(child)
-
-def attach_op_convert_info_to_model(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Calculates the info needed to convert each op and attaches
-    it to the parent module. This is done to avoid recalculating these values
-    at inference.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        for _, seen_q_op_info in qstate.idx_to_seen_q_op_infos.items():
-            qstate.idx_to_op_convert_info[seen_q_op_info.idx] = \
-                qstate.calculate_op_convert_info(seen_q_op_info)
-
-    for _, child in module.named_children():
-        attach_op_convert_info_to_model(child)
-
-def attach_output_convert_info_to_model(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Calculates the info needed to perform the module outputs hook
-    and attaches it to the parent module. This is done to avoid recalculating
-    these values at inference.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        qstate.set_needs_dtype_transform_on_outputs()
-
-    for _, child in module.named_children():
-        attach_output_convert_info_to_model(child)
diff --git a/torch/ao/quantization/_dbr/module_swap_utils.py b/torch/ao/quantization/_dbr/module_swap_utils.py
deleted file mode 100644
index a95f8210286e7..0000000000000
--- a/torch/ao/quantization/_dbr/module_swap_utils.py
+++ /dev/null
@@ -1,79 +0,0 @@
-from typing import Dict, Callable, Any, Optional
-
-import torch
-
-from torch.nn.intrinsic import _FusedModule
-from ..utils import (
-    activation_is_int8_quantized,
-    activation_is_int32_quantized,
-    op_is_int8_dynamically_quantized,
-)
-from torch.ao.quantization import swap_module
-from torch.ao.quantization.quantization_mappings import (
-    DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS,
-)
-
-def _swap_child_modules(
-    module: torch.nn.Module,
-    static_mappings: Dict[Callable, Any],
-    dynamic_mappings: Dict[Callable, Any],
-    parent_fqn: Optional[str] = None,
-) -> None:
-    """
-    For each direct child of `module`, swaps it using `static_mappings`
-    if the qconfig for that child is using int8 static quantization,
-    and the module type is in the mapping.
-
-    Recursively calls itself on each child.
-    """
-
-    qstate = getattr(module, '_auto_quant_state', None)
-
-    reassign = {}
-    for local_fqn, mod in module.named_children():
-        if parent_fqn is None:
-            global_fqn = local_fqn
-        else:
-            global_fqn = f"{parent_fqn}.{local_fqn}"
-        # both fused modules and observed custom modules are
-        # swapped as one unit
-        if not isinstance(mod, _FusedModule):
-            _swap_child_modules(
-                mod, static_mappings, dynamic_mappings, global_fqn)
-
-        qconfig = getattr(mod, 'qconfig', None)
-        if not qconfig:
-            continue
-        activation_int8_quantized = activation_is_int8_quantized(qconfig)
-        op_int8_dynamically_quantized = op_is_int8_dynamically_quantized(qconfig)
-        activation_int32_quantized = activation_is_int32_quantized(qconfig)
-
-        # Get the output observer from qstate and attach it to the module,
-        # to match the API for Eager mode module swaps
-        if qstate is not None:
-            output_obs = qstate.get_output_observer_from_fqn(global_fqn)
-            if output_obs is not None:
-                mod.activation_post_process = output_obs
-
-        if activation_int8_quantized:
-            if not type(mod) in static_mappings:
-                continue
-            reassign[local_fqn] = swap_module(mod, static_mappings, {})
-        elif op_int8_dynamically_quantized:
-            if not type(mod) in dynamic_mappings:
-                continue
-            reassign[local_fqn] = swap_module(mod, dynamic_mappings, {})
-        elif activation_int32_quantized:
-            # For now, only apply reference logic to modules quantized to
-            # int32. Do it automatically.
-            # TODO(future PR): extend this logic to more dtypes, and add
-            # the is_reference API flag instead of doing this automatically.
-            # Note: swap modules only does the swap if the mapping for this
-            # module exists.
-            reassign[local_fqn] = swap_module(
-                mod, DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS, {})
-
-        # TODO(future PR): add support for other dtypes
-
-    for key, value in reassign.items():
-        module._modules[key] = value
diff --git a/torch/ao/quantization/_dbr/qconfig_mapping_utils.py b/torch/ao/quantization/_dbr/qconfig_mapping_utils.py
deleted file mode 100644
index 8503a131c6761..0000000000000
--- a/torch/ao/quantization/_dbr/qconfig_mapping_utils.py
+++ /dev/null
@@ -1,25 +0,0 @@
-import torch
-from typing import Callable, Dict
-from ..qconfig_mapping import QConfigMapping
-
-TYPE_TO_REPLACEMENT_TYPE: Dict[Callable, Callable] = {
-    torch.add: torch.Tensor.add,
-    torch.Tensor.add_: torch.Tensor.add,
-    torch.mul: torch.Tensor.mul,
-    torch.Tensor.mul_: torch.Tensor.mul,
-}
-
-def normalize_object_types(qconfig_mapping: QConfigMapping) -> None:
-    """
-    This function looks for entries in `qconfig_mapping.object_type_qconfigs`
-    corresponding to PyTorch overrides of Python math functions
-    such as `torch.add` and `torch.mul`. If any of these functions are found,
-    it changes the type to the tensor variant of these functions.
-    This is needed because the tensor variant is what is expected
-    within the framework.
-    """
-    for object_type, qconfig in list(qconfig_mapping.object_type_qconfigs.items()):
-        replacement_type = TYPE_TO_REPLACEMENT_TYPE.get(object_type, None)  # type: ignore[arg-type]
-        if replacement_type is not None:
-            del qconfig_mapping.object_type_qconfigs[object_type]
-            qconfig_mapping.object_type_qconfigs[replacement_type] = qconfig
diff --git a/torch/ao/quantization/_dbr/quantization_state.py b/torch/ao/quantization/_dbr/quantization_state.py
deleted file mode 100644
index 371703823a28f..0000000000000
--- a/torch/ao/quantization/_dbr/quantization_state.py
+++ /dev/null
@@ -1,986 +0,0 @@
-from typing import Callable, List, Tuple, Any, Optional, Dict
-
-import torch
-import torch.nn.functional as F
-
-from .mappings import (
-    conv_ops,
-    ops_are_related,
-)
-
-from .utils import (
-    _raise_obs_not_found_error,
-    _raise_obs_op_mismatch,
-    op_needs_quantization,
-    SeenQOpInfo,
-    SeenNonQOpInfo,
-    QTensorInfo,
-    FuncOutputObsType,
-    get_func_output_obs_type,
-    converted_func_needs_scale_zp,
-    FuncOutputDTypeType,
-    get_func_output_dtype_type,
-    get_quantized_op,
-    get_input_observed_arg_idxs,
-    get_packable_tensor_arg_idxs,
-    get_param_name,
-    get_packable_nontensor_arg_idxs,
-    get_packable_arg_idxs,
-    get_weight_arg_idx,
-    iterate_and_apply,
-    get_op_packing_only_uses_module_attributes,
-    get_packable_tensor_kwarg_names,
-    clone_detach_tensor_without_dispatch,
-    get_input_args_quant_dequant_info,
-    get_cur_qconfig,
-    OpQuantizeabilityType,
-)
-
-from .function_fusion import (
-    match_fusion_patterns,
-    get_seen_q_op_info_of_start_of_fusion,
-    get_seen_q_op_info_of_end_of_fusion,
-)
-
-from ..qconfig_mapping import (
-    QConfigMapping,
-)
-
-from torch.ao.quantization.utils import (
-    activation_is_int32_quantized,
-)
-
-OpConvertInfo = Tuple[
-    # quantized equivalent of original op (None means keep original)
-    Optional[Callable],
-    # arg_quant_infos, each element is (scale, zp, dtype) for quantized and None otherwise
-    List[Optional[Tuple[float, int, torch.dtype]]],
-    # arg_dequant_infos, each element is True if this arg needs a dequant
-    List[bool],
-    # packed param name, if the op has a packed param
-    Optional[str],
-    # additional kwargs, such as output scale and zero_point
-    Dict[str, Any],
-    # any_arg_quant_or_dequant_needed, if False then we can skip looking at
-    # arg_quant_infos and arg_dequant_infos, for performance
-    bool,
-    # any_arg_kwarg_modification_needed, if False then we can return original
-    # args and kwargs, for performance
-    bool,
-]
-
-# TODO(future PR): maybe better name
-# TODO(future PR): add serialization support
-class AutoQuantizationState(torch.nn.Module):
-    """
-    Contains state necessary to perform auto quantization on the parent
-    `nn.Module` instance.
-    """
-
-    idx : int
-
-    def __init__(
-        self,
-        qconfig_mapping: QConfigMapping,
-        fqn: str,
-        input_dtypes: Any = None,
-        output_dtypes: Any = None,
-    ):
-        super().__init__()
-        self.idx = 0
-        self.qconfig_mapping = qconfig_mapping
-        self.fqn = fqn
-        # this is a ModuleDict in order to properly register observers
-        # to be within the module hierarchy.
-        self.tensor_id_to_observer = torch.nn.ModuleDict()
-
-        # TODO(future PR): include kwargs
-        # Note: seen quantizeable ops are recorded with an index,
-        # because we enforce order of execution. However, seen
-        # unquantizeable ops are recorded without an index, because
-        # we do not enforce order of execution.
-        self.idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo] = {}
-        self.seen_nonq_op_infos: List[SeenNonQOpInfo] = []
-
-        # qtensor_info objects of tensor outputs of the module, specified
-        # in order of iteration through the output type. Non-tensor outputs
-        # are represented with `None`.
-        self.output_qtensor_infos: List[Optional[QTensorInfo]] = []
-        self.input_dtypes = input_dtypes
-        self.output_dtypes = output_dtypes
-        # key: idx of seen op
-        # value: name of packed weight
-        # note: this is filled out right before convert
-        self.idx_to_packed_weight_name: Dict[int, str] = {}
-        self.tensor_id_to_scale_zp: Dict[int, Tuple[torch.Tensor, torch.Tensor]] = {}
-
-        # Numeric Suite add_loggers functionality
-        # if this flag is True, op outputs will be saved for debugging
-        self.log_op_outputs = False
-        # data structure to save op outputs for debugging
-        # * outer list represents the different model forward call instances
-        # * inner list represents the different op forward call instances in a
-        #   model forward
-        # TODO(future PR): handle types which are not torch.Tensor
-        # TODO(future PR): use the Logger class and allow user overrides of it
-        self.op_outputs: List[List[Tuple[
-            int,  # global op idx
-            Optional[str],  # fqn
-            Callable,  # fp32 op type (TODO future PR: add quantized op type)
-            torch.Tensor,  # value
-        ]]] = []
-        # model name to use in logging results
-        self.logging_model_name: Optional[str]
-
-        self.idx_to_op_convert_info: Dict[int, OpConvertInfo] = {}
-
-        # If this is True, module outputs will be checked and converted
-        # to the dtype specified by the user. If this is False, module outputs
-        # will be returned as is. This value can be precalculated and it is set
-        # to its final value after tracing.
-        self.needs_dtype_transform_on_outputs = True
-
-    def get_extra_state(self):
-        return {"tensor_id_to_scale_zp": self.tensor_id_to_scale_zp}
-
-    def set_extra_state(self, state):
-        self.tensor_id_to_scale_zp = state["tensor_id_to_scale_zp"]
-        for _, seen_q_op_info in self.idx_to_seen_q_op_infos.items():
-            self.idx_to_op_convert_info[seen_q_op_info.idx] = \
-                self.calculate_op_convert_info(seen_q_op_info)
-
-    def has_at_least_one_seen_q_op_info(self) -> bool:
-        return len(self.idx_to_seen_q_op_infos) > 0
-
-    def validate_is_at_last_seen_idx(self) -> None:
-        is_at_last_seen_idx = (
-            len(self.idx_to_seen_q_op_infos) == 0 or
-            self.idx == len(self.idx_to_seen_q_op_infos)
-        )
-        if not is_at_last_seen_idx:
-            raise AssertionError(
-                f"Cur idx: {self.idx}, expected idx: {len(self.idx_to_seen_q_op_infos)}")
-
-    def extra_repr(self) -> str:
-        s = ""
-        # idx_to_seen_q_op_infos
-        if len(self.idx_to_seen_q_op_infos):
-            s += "(seen_q_op_infos): {\n"
-            for k, v in self.idx_to_seen_q_op_infos.items():
-                s += f"  {k}: {v}\n"
-            s += "}\n"
-        else:
-            s += "(seen_q_op_infos): {}\n"
-        if len(self.seen_nonq_op_infos):
-            s += "(seen_nonq_op_infos): {\n"
-            for n in self.seen_nonq_op_infos:
-                s += f"  {n}\n"
-            s += "}\n"
-        else:
-            s += "(seen_nonq_op_infos): {}\n"
-        # output_qtensor_infos
-        s += "(output_qtensor_infos): ["
-        for i in self.output_qtensor_infos:
-            s += f"{i} "
-        s += "]\n"
-        # idx_to_packed_weight_name
-        if len(self.idx_to_packed_weight_name):
-            s += "(idx_to_packed_weight_name): {\n"
-            for k, v in self.idx_to_packed_weight_name.items():  # type: ignore[assignment]
-                s += f"  {k}: {v}\n"
-            s += "}\n"
-        else:
-            s += "(idx_to_packed_weight_name): {}\n"
-        if len(self.tensor_id_to_scale_zp):
-            s += "(tensor_id_to_scale_zp): {\n"
-            for k, v in self.tensor_id_to_scale_zp.items():  # type: ignore[assignment]
-                s += f"  {k}: {v}\n"
-            s += "}"
-        return s
-
-    def _get_cur_seen_q_op_info(self):
-        return self.idx_to_seen_q_op_infos[self.idx]
-
-    def get_cur_output_inf_dtype(self):
-        return self._get_cur_seen_q_op_info().output_tensor_infos[0].inf_dtype
-
-    def reset_to_new_call(self):
-        """
-        Resets the internal op counter to start a new top level module call
-        """
-        # torch.nn.Module __setattr__ has overhead,
-        # this code is the explicit fast path for `self.idx = 0`
-        object.__setattr__(self, 'idx', 0)
-
-        if self.log_op_outputs:
-            self.op_outputs.append([])
-
-    def cur_op_needs_hooks(self, cur_op: Callable) -> bool:
-        return op_needs_quantization(cur_op)
-
-    def validate_cur_op(self, cur_op: Callable) -> None:
-        """
-        This function is expected to be called before any new function or
-        module call which needs hooks. It validates that the new function or
-        module is of the expected type based on the order of execution.
-        """
-        try:
-            seen_q_op_info = self._get_cur_seen_q_op_info()
-            expected_op = seen_q_op_info.type
-        except IndexError:
-            _raise_obs_not_found_error(cur_op)
-        if not ops_are_related(cur_op, expected_op, seen_q_op_info.type_is_module):
-            _raise_obs_op_mismatch(cur_op, expected_op)
-
-    def mark_cur_op_complete(self, cur_op: Callable) -> None:
-        """
-        This function is expected to be called after a function or module
-        processing is complete.
-        """
-        # torch.nn.Module __setattr__ has overhead,
-        # this code is the explicit fast path for `self.idx += 1`
-        object.__setattr__(self, 'idx', self.idx + 1)
-
-    def first_call_outputs_prepare_hook(
-        self,
-        outputs: Any,
-        qtensor_id: List[int],
-    ) -> Any:
-        """
-        This function is expected to be called on the outputs of a prepared
-        module right before they are returned to the parent, during tracing.
-        """
-        outputs = self._first_call_assign_qtensor_infos_to_mod_outputs(
-            outputs, qtensor_id)
-        return outputs
-
-    def outputs_prepare_hook(
-        self,
-        outputs: Any,
-    ) -> Any:
-        """
-        This function is expected to be called on the outputs of a prepared
-        module right before they are returned to the parent.
-        """
-        return outputs
-
-    def outputs_convert_hook(
-        self,
-        outputs: Any,
-    ) -> Any:
-        """
-        This function is expected to be called on the outputs of a converted
-        module right before they are returned to the parent.
-        """
-        outputs = self._maybe_mod_outputs_dtype_transform(outputs)
-        return outputs
-
-    def get_output_qtensor_infos(self) -> List[Optional[QTensorInfo]]:
-        """
-        Used by the conversion to torch.jit.script.
-        """
-        return self.output_qtensor_infos
-
-    def get_output_dtypes(self) -> Any:
-        """
-        Used by the conversion to torch.jit.script.
-        """
-        return self.output_dtypes
-
-    def first_call_op_prepare_before_hook(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-        qtensor_id: List[int],
-        fqn: str,
-        root_module: torch.nn.Module,
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
-        """
-        This function is expected to be called on args and kwargs of
-        `op` directly before `op` is executed, during tracing.
-
-        We record the type of `op`
-        and the IDs of its tensor inputs. Note: we add a placeholder for IDs
-        of tensor outputs, the placeholder will be filled out during the
-        `op_prepare_after_hook`.
-
-        The function returns modified `args` and `kwargs`.
-        """
-        return self._first_call_op_prepare_before_hook_create_subgraphs(
-            op, args, kwargs, qtensor_id, fqn, root_module,
-            op_quantizeability_type)
-
-    def op_prepare_before_hook(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-    ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
-        """
-        This function is expected to be called on args and kwargs of
-        `op` directly before `op` is executed.
-
-        We do the following:
-        * pass the inputs through observers, if needed
-
-        The function returns modified `args` and `kwargs`.
-        """
-        seen_q_op_info = self._get_cur_seen_q_op_info()
-
-        def _maybe_observe(arg, tensor_info):
-            tensor_id = tensor_info.id
-            # TODO: do not run this twice on input and output
-            if str(tensor_id) in self.tensor_id_to_observer:
-                observer = self.tensor_id_to_observer[str(tensor_id)]
-                return observer(arg)
-            else:
-                return arg
-
-        args = iterate_and_apply(
-            args, seen_q_op_info.input_tensor_infos, _maybe_observe)
-
-        return args, kwargs
-
-    def first_call_op_prepare_after_hook(
-        self,
-        op: Callable,
-        output: Any,
-        args: Tuple[Any, ...],
-        qtensor_id: List[int],
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> Any:
-        """
-        This function is called after an op call on a prepared model.
-
-        * create an observer for the output, if needed, and record it in
-          `tensor_id_to_observer`
-        * amend the current seen op with the tensor ID of the output
-        """
-        self._first_call_op_prepare_after_hook_adjust_subgraphs(
-            op, output, args, qtensor_id, op_quantizeability_type)
-        return output
-
-    def op_prepare_after_hook(
-        self,
-        op: Callable,
-        output: Any,
-        args: Tuple[Any, ...],
-        global_op_idx: List[int],
-    ) -> Any:
-        """
-        This function is called after an op call on a prepared model.
-
-        * observe the output, if needed
-        """
-        seen_q_op_info = self._get_cur_seen_q_op_info()
-
-        # if we are in a fusion, we only observe at the end of it
-        is_fusion = seen_q_op_info.fusion_info is not None
-        is_end_of_fusion = seen_q_op_info.fusion_info is not None and \
-            seen_q_op_info.fusion_info.is_last_element
-
-        if is_fusion:
-            if is_end_of_fusion:
-                # do observe in the end of fusions, according to info
-                # of the base op
-                seen_q_op_info_start = get_seen_q_op_info_of_start_of_fusion(
-                    seen_q_op_info, self.idx_to_seen_q_op_infos)
-                # use the obs type from beginning of pattern
-                func_output_obs_type = get_func_output_obs_type(seen_q_op_info_start)
-                if func_output_obs_type != FuncOutputObsType.NONE:
-                    # use the output tensor ID from the end of pattern
-                    tensor_id = seen_q_op_info.output_tensor_infos[0].id
-                    obs = self.tensor_id_to_observer[str(tensor_id)]
-                    output = obs(output)
-
-            else:
-                # do not observe in the middle of fusions
-                pass
-        else:
-            # observe without fusions as normal
-            func_output_obs_type = get_func_output_obs_type(seen_q_op_info)
-            # TODO(future PR): other output types
-            if func_output_obs_type != FuncOutputObsType.NONE:
-                tensor_id = seen_q_op_info.output_tensor_infos[0].id
-                obs = self.tensor_id_to_observer[str(tensor_id)]
-                output = obs(output)
-
-        if self.log_op_outputs:
-            output_clone = clone_detach_tensor_without_dispatch(output)
-            self.op_outputs[-1].append(
-                (global_op_idx[0], seen_q_op_info.fqn, seen_q_op_info.type, output_clone))
-            global_op_idx[0] += 1
-
-        return output
-
-    def op_convert_before_hook(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-        root_module: torch.nn.Module,
-    ) -> Tuple[Callable, Tuple[Any, ...], Dict[str, Any]]:
-        """
-        This function is called before an op call in a converted model.
-
-        For each arg in `args`, quantizes it if necessary.
-
-        Returns potentially modified `op`, potentially modified `args`,
-        potentially modified `kwargs`.
-        """
-        # TODO generalize this for more things
-        # currently:
-        # * can quantize args (via arg_quant_infos)
-        # * can add scale and zp (via additional kwargs)
-
-        # needed for F.conv2d
-        # F.conv2d(input, weight, bias, stride, padding, dilation, groups)
-        #   to
-        # q.conv2d(input, packed_params, scale, zero_point)
-        orig_op = op
-        maybe_new_op, arg_quant_infos, arg_dequant_infos, packed_param_name, \
-            additional_kwargs, any_arg_quant_or_dequant_needed, \
-            any_arg_kwarg_modification_needed = self.get_op_convert_info(op)
-        if maybe_new_op is not None:
-            op = maybe_new_op
-        if not any_arg_kwarg_modification_needed:
-            return op, args, kwargs
-        # print(op, arg_quant_infos, packed_param_name, additional_kwargs)
-
-        # potentially quantize args, based on arg_quant_infos
-        new_args = []
-        if any_arg_quant_or_dequant_needed:
-            tensor_arg_idx = 0
-            # TODO: refactor this to use iterate_and_apply
-            if orig_op is torch.cat:  # torch.cat variants
-                # input tensors
-                new_first_arg = []
-                for arg in args[0]:
-                    # TODO: handle non-tensor inputs
-                    quant_info = arg_quant_infos[tensor_arg_idx]
-                    dequant_info = arg_dequant_infos[tensor_arg_idx]
-                    if quant_info is not None:
-                        scale, zp, dtype = quant_info
-                        arg = torch.quantize_per_tensor(arg, scale, zp, dtype)
-                    if dequant_info is True:
-                        # Note: both quant and dequant paths are taken for
-                        # reference ops.
-                        arg = arg.dequantize()
-                    new_first_arg.append(arg)
-                    tensor_arg_idx += 1
-                new_args = [new_first_arg, *args[1:]]
-            else:
-                for arg in args:
-                    # TODO: handle non-tensor inputs
-                    # TODO: this is not handling non-tensor tuple args (for example,
-                    # dilation in conv2d) correctly, it just happens to work but
-                    # needs a fix.
-                    quant_info = arg_quant_infos[tensor_arg_idx]
-                    dequant_info = arg_dequant_infos[tensor_arg_idx]
-                    if quant_info is not None:
-                        scale, zp, dtype = quant_info
-                        arg = torch.quantize_per_tensor(arg, scale, zp, dtype)
-                    if dequant_info is True:
-                        # Note: both quant and dequant paths are taken for
-                        # reference ops.
-                        arg = arg.dequantize()
-                    new_args.append(arg)
-                    tensor_arg_idx += 1
-        else:
-            new_args = [*args]
-
-        # if there is a packed param, replace the relevant args
-        if packed_param_name is not None:
-            new_args_with_packed = []
-            packable_arg_idxs = get_packable_arg_idxs(orig_op)
-            added_packed = False
-            for idx, arg in enumerate(new_args):
-                if packable_arg_idxs is not None and idx in packable_arg_idxs:
-                    if not added_packed:
-                        packed_param = getattr(root_module, packed_param_name)
-                        new_args_with_packed.append(packed_param)
-                        added_packed = True
-                else:
-                    new_args_with_packed.append(arg)
-            new_args = new_args_with_packed
-
-        # potentially extend kwargs with scale and zero_point
-        # TODO move op-specific logic out of here
-        if len(additional_kwargs):
-            if orig_op not in conv_ops and orig_op != F.linear:
-                kwargs.update(**additional_kwargs)
-            else:
-                seen_q_op_info = self._get_cur_seen_q_op_info()
-                if seen_q_op_info.output_tensor_infos[0].inf_dtype == torch.quint8:
-                    new_args.append(additional_kwargs['scale'])
-                    new_args.append(additional_kwargs['zero_point'])
-
-        # TODO move op-specific logic out of here
-        if op is torch.ops.quantized.linear:
-            kwargs.pop('bias', None)
-
-        return op, tuple(new_args), kwargs
-
-    def op_convert_after_hook(
-        self,
-        op: Callable,
-        output,
-        global_op_idx: List[int],
-    ) -> Any:
-        """
-        This function is called after an op call in a converted model.
-        """
-        # TODO(future PR): improve performance by moving this out of the
-        # path of non-reference ops
-        seen_q_op_info = self._get_cur_seen_q_op_info()
-
-        if seen_q_op_info.is_reference_op_at_inference:
-            # given the current reference module design,
-            # we need to quantize to the target dtype
-            output_tensor_info = seen_q_op_info.output_tensor_infos[0]
-            tensor_id, inf_dtype = \
-                output_tensor_info.id, output_tensor_info.inf_dtype
-            scale, zp = self.tensor_id_to_scale_zp[tensor_id]
-            output = torch.quantize_per_tensor(
-                output, scale, zp, inf_dtype)
-
-        if self.log_op_outputs:
-            output_clone = clone_detach_tensor_without_dispatch(output)
-            seen_q_op_info = self._get_cur_seen_q_op_info()
-            self.op_outputs[-1].append(
-                (global_op_idx[0], seen_q_op_info.fqn, seen_q_op_info.type, output_clone))
-            global_op_idx[0] += 1
-
-        return output
-
-    def get_op_convert_info(
-        self,
-        op: Callable,
-    ) -> OpConvertInfo:
-        """
-        Returns the information needed for convert time modifications to `op`.
-        """
-        return self.idx_to_op_convert_info[self.idx]
-
-    def calculate_op_convert_info(
-        self,
-        seen_q_op_info: SeenQOpInfo,
-    ) -> OpConvertInfo:
-        """
-        This precalculates the information which will be returned by
-        `get_op_convert_info`.
-        """
-        # calculate new op
-        maybe_new_op = get_quantized_op(
-            seen_q_op_info, self.idx_to_seen_q_op_infos)
-
-        # calculate quant infos
-        arg_quant_infos, arg_dequant_infos, any_arg_quant_or_dequant_needed = \
-            get_input_args_quant_dequant_info(
-                seen_q_op_info, self.tensor_id_to_scale_zp)
-
-        # get packed param name, if applicable
-        packed_param_name = self._get_packed_param_name(seen_q_op_info)
-
-        # calculate scale and zp for output
-        # TODO: instead of always doing this if there is an observer,
-        # calculate whether this is needed based on the op and dtypes
-        additional_kwargs = {}
-        needs_scale_zp = converted_func_needs_scale_zp(seen_q_op_info)
-        if needs_scale_zp:
-            cur_seen_q_op_info = seen_q_op_info
-
-            # if this is a start of a fusion pattern, get the observer
-            # from the end of the fusion
-            is_start_of_fusion = seen_q_op_info.fusion_info and \
-                seen_q_op_info.fusion_info.is_first_element
-            if is_start_of_fusion:
-                cur_seen_q_op_info = get_seen_q_op_info_of_end_of_fusion(
-                    seen_q_op_info, self.idx_to_seen_q_op_infos)
-
-            output_tensor_infos = cur_seen_q_op_info.output_tensor_infos
-            tensor_id = output_tensor_infos[0].id
-            scale, zp = self.tensor_id_to_scale_zp[tensor_id]
-            additional_kwargs.update({'scale': scale, 'zero_point': zp})
-
-        any_arg_kwarg_modification_needed = bool(
-            any_arg_quant_or_dequant_needed or
-            packed_param_name is not None or
-            len(additional_kwargs)
-        )  # the cast to bool is to make mypy recognize this as a bool
-
-        return maybe_new_op, arg_quant_infos, arg_dequant_infos, \
-            packed_param_name, additional_kwargs, any_arg_quant_or_dequant_needed, \
-            any_arg_kwarg_modification_needed
-
-    def _get_packed_param_name(self, seen_q_op_info: SeenQOpInfo) -> Optional[str]:
-        """
-        If the op in seen_q_op_info has a quantized packed param, returns it.
-        Otherwise, returns None.
-        """
-        return self.idx_to_packed_weight_name.get(seen_q_op_info.idx, None)
-
-    def _first_call_assign_qtensor_infos_to_mod_outputs_tensor(
-        self,
-        output: torch.Tensor,
-        qtensor_id: List[int],
-    ) -> torch.Tensor:
-        """
-        This is a helper function for _first_call_assign_qtensor_infos_to_mod_outputs
-        to handle iterables of tensors without code duplication.
-        """
-        if not hasattr(output, '_qtensor_info'):
-            # TODO: use actual dtype instead of defaulting to float
-            output._qtensor_info = QTensorInfo(  # type: ignore[attr-defined]
-                qtensor_id[0], output.dtype, torch.float)
-            qtensor_id[0] += 1
-        self.output_qtensor_infos.append(output._qtensor_info)  # type: ignore[attr-defined]
-        # TODO(future PR): add an observer if needed
-        return output
-
-    def _first_call_assign_qtensor_infos_to_mod_outputs(
-        self,
-        outputs: Any,
-        qtensor_id: List[int],
-    ) -> Any:
-        """
-        Takes `outputs`, which are a set of values about to be returned from
-        the current module. If `_qtensor_info` attributes do not already exist
-        on any tensors in `outputs`, this function adds them, initializing the
-        dtype to `torch.float`. This allows us to reason about module output
-        dtypes even if the last op in the module is not quantizeable.
-        """
-        # TODO: handle objects with deeper nested tensors
-        if isinstance(outputs, torch.Tensor):
-            self._first_call_assign_qtensor_infos_to_mod_outputs_tensor(outputs, qtensor_id)
-        elif isinstance(outputs, tuple):
-            # TODO: handle other tuple subclasses more generically
-            new_outputs = []
-            for output in outputs:
-                if isinstance(output, torch.Tensor):
-                    new_outputs.append(self._first_call_assign_qtensor_infos_to_mod_outputs_tensor(
-                        output, qtensor_id))
-                else:
-                    new_outputs.append(output)
-            # hacky check for collections.namedtuple, TODO improve this
-            # https://stackoverflow.com/questions/2166818/how-to-check-if-an-object-is-an-instance-of-a-namedtuple
-            if hasattr(outputs, '_fields'):
-                outputs = outputs.__class__(*new_outputs)
-            else:
-                outputs = tuple(new_outputs)
-        else:
-            pass
-        return outputs
-
-    def set_needs_dtype_transform_on_outputs(self):
-        """
-        Calculates whether a dtype transform on module outputs is needed
-        and stores it. This is used to skip the outputs hook if it is not
-        needed.
-        """
-        self.needs_dtype_transform_on_outputs = False
-
-        if not len(self.output_qtensor_infos):
-            # if there are no tensor outputs, there is nothing to transform
-            return
-
-        qtensor_info = self.output_qtensor_infos[0]
-        if self.output_dtypes is not None:
-            assert qtensor_info is not None
-            # check the output dtype, and do the conversion if needed
-            output_dtype = self.output_dtypes[0]
-            if qtensor_info.inf_dtype != output_dtype:
-                assert output_dtype is torch.float, \
-                    'non-float output dtypes not handled yet'
-                self.needs_dtype_transform_on_outputs = True
-
-    def _maybe_mod_outputs_dtype_transform(
-        self,
-        outputs: Any,
-    ) -> Any:
-        """
-        Takes `outputs` which are about to be returned from this module
-        to the caller. If this module has restrictions on the dtypes of
-        tensors it has to return, does the dtype conversion. Otherwise,
-        does nothing.
-        """
-        if not self.needs_dtype_transform_on_outputs:
-            return outputs
-
-        if isinstance(outputs, torch.Tensor):
-            qtensor_info = self.output_qtensor_infos[0]
-            if self.output_dtypes is not None:
-                assert qtensor_info is not None
-                # check the output dtype, and do the conversion if needed
-                output_dtype = self.output_dtypes[0]
-                if qtensor_info.inf_dtype != output_dtype:
-                    assert output_dtype is torch.float, \
-                        'non-float output dtypes not handled yet'
-                    outputs = outputs.dequantize()
-            else:
-                # if no output dtype was specified, do nothing
-                pass
-
-        return outputs
-
-    def _first_call_op_prepare_before_hook_create_subgraphs_tensor(
-        self,
-        op: Callable,
-        arg: Any,
-        arg_tensor_infos: List[Optional[QTensorInfo]],
-        qtensor_id: List[int],
-    ) -> None:
-        """
-        Runs the prepare hook during first_call for individual
-        tensors. If the input argument is a tensor, this function is
-        called directly. If the input argument is an iterable such
-        as a list or a tuple, this function is called on each element of
-        the iteratble.
-        """
-        # TODO(next): fix this for torch.cat
-        if not isinstance(arg, torch.Tensor):
-            arg_tensor_infos.append(None)
-            return
-
-        # If a tensor does not have an ID, add it. This allows
-        # us to track inputs shared by multiple quantizeable modules.
-        if not hasattr(arg, '_qtensor_info'):
-            arg._qtensor_info = QTensorInfo(  # type: ignore[attr-defined]
-                qtensor_id[0], arg.dtype, arg.dtype)
-            qtensor_id[0] += 1
-        arg_tensor_infos.append(arg._qtensor_info)  # type: ignore[attr-defined]
-
-    def _first_call_op_prepare_before_hook_create_subgraphs(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-        qtensor_id: List[int],
-        fqn: str,
-        root_module: torch.nn.Module,
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
-        """
-        Given an op, args, kwargs about to be executed, records the subgraph
-        of this op in `self`.
-        """
-        arg_tensor_infos: List[Optional[QTensorInfo]] = []
-        for arg in args:
-            if isinstance(arg, (list, tuple)):
-                for inner_arg in arg:
-                    self._first_call_op_prepare_before_hook_create_subgraphs_tensor(
-                        op, inner_arg, arg_tensor_infos, qtensor_id)
-            else:
-                self._first_call_op_prepare_before_hook_create_subgraphs_tensor(
-                    op, arg, arg_tensor_infos, qtensor_id)
-
-        if op_quantizeability_type is OpQuantizeabilityType.NOT_QUANTIZEABLE:
-            op_type_is_module = isinstance(op, torch.nn.Module)
-            op_type : Callable = type(op) if op_type_is_module else op  # type: ignore[assignment]
-            self.seen_nonq_op_infos.append(SeenNonQOpInfo(
-                op_type, arg_tensor_infos, []))
-            return args, kwargs
-
-        op_packing_only_uses_module_attributes = \
-            get_op_packing_only_uses_module_attributes(op, args, kwargs, root_module)
-
-        packable_tensor_idx_to_name = {}
-        packable_nontensor_idx_to_arg = {}
-        packable_tensor_kwarg_name_to_name = {}
-        if op_packing_only_uses_module_attributes:
-            packable_tensor_arg_idxs = get_packable_tensor_arg_idxs(op)
-            if packable_tensor_arg_idxs is not None:
-                for arg_idx in packable_tensor_arg_idxs:
-                    if arg_idx >= len(args):
-                        continue
-                    arg = args[arg_idx]
-                    param_name = get_param_name(root_module, arg)
-                    packable_tensor_idx_to_name[arg_idx] = param_name
-
-            packable_nontensor_arg_idxs = get_packable_nontensor_arg_idxs(op)
-            if packable_nontensor_arg_idxs is not None:
-                for arg_idx in packable_nontensor_arg_idxs:
-                    packable_nontensor_idx_to_arg[arg_idx] = args[arg_idx]
-
-            packable_tensor_kwarg_names = \
-                get_packable_tensor_kwarg_names(op)
-            if packable_tensor_kwarg_names is not None:
-                for kwarg_name in packable_tensor_kwarg_names:
-                    if kwarg_name not in kwargs:
-                        continue
-                    kwarg = kwargs[kwarg_name]
-                    kwarg_name_on_module = get_param_name(root_module, kwarg)
-                    packable_tensor_kwarg_name_to_name[kwarg_name] = \
-                        kwarg_name_on_module
-
-        if self.idx not in self.idx_to_seen_q_op_infos:
-            op_type_is_module = isinstance(op, torch.nn.Module)
-            op_type = type(op) if op_type_is_module else op  # type: ignore[assignment]
-            qconfig = get_cur_qconfig(self.qconfig_mapping, fqn, op_type)
-            # TODO(future PR): use API flag instead of qconfig for is_reference
-            is_reference_op_at_inference = \
-                qconfig is not None and activation_is_int32_quantized(qconfig)
-            self.idx_to_seen_q_op_infos[self.idx] = SeenQOpInfo(
-                self.idx, op_type, op_type_is_module, fqn, arg_tensor_infos, [],
-                packable_tensor_idx_to_name, packable_nontensor_idx_to_arg,
-                packable_tensor_kwarg_name_to_name,
-                op_packing_only_uses_module_attributes, qconfig, None,
-                is_reference_op_at_inference)
-
-        return args, kwargs
-
-    def _first_call_op_prepare_after_hook_adjust_subgraphs(
-        self,
-        op: Callable,
-        output: Any,
-        args: Tuple[Any, ...],
-        qtensor_id: List[int],
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> None:
-        """
-        After `op` was just executed, modifies the subgraph recorded
-        for this op with the information about the output. Note, this
-        has to be done in the "after" hook because the output of the op
-        does not exist in the "before" hook.
-        """
-        # TODO(future PR): check if _qtensor_id needs to become an actual
-        # attribute of Tensor
-        # TODO(future PR): handle non-tensor outputs
-        if op_quantizeability_type is OpQuantizeabilityType.QUANTIZEABLE:
-
-            seen_q_op_info = self._get_cur_seen_q_op_info()
-            func_output_dtype_type = get_func_output_dtype_type(seen_q_op_info)
-            if func_output_dtype_type == FuncOutputDTypeType.DTYPE_DEPENDS_ON_QCONFIG:
-                qconfig = get_cur_qconfig(
-                    self.qconfig_mapping, seen_q_op_info.fqn,
-                    seen_q_op_info.type)
-                if qconfig is None:
-                    dtype_to_use = torch.float
-                else:
-                    dtype_to_use = qconfig.activation().dtype
-
-            elif func_output_dtype_type == FuncOutputDTypeType.DTYPE_DEFAULT_BC_UNSUPPORTED_SYNTAX:
-                dtype_to_use = torch.float
-            else:
-                # TODO(future PR): respect qconfig for torch.cat
-                if isinstance(args[0], (tuple, list)):  # for torch.cat
-                    unique_arg_dtypes = [
-                        arg._qtensor_info.inf_dtype for arg in args[0]]
-                    assert len(set(unique_arg_dtypes)) == 1, \
-                        'an iterable with arguments with different inference ' + \
-                        'dtypes is not supported yet'
-                    dtype_to_use = args[0][0]._qtensor_info.inf_dtype
-                else:
-                    dtype_to_use = args[0]._qtensor_info.inf_dtype
-
-        else:
-            dtype_to_use = None  # type: ignore[assignment]
-
-        def _add_output_qtensor_info(output, dtype_to_use):
-            if dtype_to_use is None:
-                dtype_to_use = output.dtype
-            output._qtensor_info = QTensorInfo(
-                qtensor_id[0], output.dtype, dtype_to_use)  # type: ignore[arg-type]
-            if op_quantizeability_type is OpQuantizeabilityType.QUANTIZEABLE:
-                target = self.idx_to_seen_q_op_infos[self.idx].output_tensor_infos
-            else:
-                target = self.seen_nonq_op_infos[-1].output_tensor_infos
-            target.append(output._qtensor_info)
-            qtensor_id[0] += 1
-
-        if isinstance(output, torch.Tensor):
-            _add_output_qtensor_info(output, dtype_to_use)
-        elif isinstance(output, tuple):
-            for element in output:
-                if isinstance(element, torch.Tensor):
-                    _add_output_qtensor_info(element, dtype_to_use)
-
-    def match_fusion_patterns(self):
-        match_fusion_patterns(self.idx_to_seen_q_op_infos)
-
-    def _maybe_insert_input_observers(self, seen_q_op_info: SeenQOpInfo):
-        func_output_dtype_type = get_func_output_dtype_type(seen_q_op_info)
-        input_observed_arg_idxs = get_input_observed_arg_idxs(
-            seen_q_op_info.type, seen_q_op_info.type_is_module)
-
-        if func_output_dtype_type == FuncOutputDTypeType.DTYPE_DEPENDS_ON_QCONFIG:
-            for idx, tensor_info in enumerate(seen_q_op_info.input_tensor_infos):
-                if tensor_info is None:
-                    continue
-                if input_observed_arg_idxs is not None and \
-                        idx not in input_observed_arg_idxs:
-                    continue
-
-                qconfig = get_cur_qconfig(
-                    self.qconfig_mapping, seen_q_op_info.fqn, seen_q_op_info.type)
-                if qconfig is None:
-                    # If qconfig is None, we do not need any input observers
-                    continue
-
-                elif tensor_info.inf_dtype != torch.quint8:
-                    # TODO(future PR): this assumes current dtype is quint8,
-                    # this is not always true
-                    # TODO(future PR): currently this only handles float32 and
-                    # quint8, we need to extend it to other dtypes
-                    tensor_id = tensor_info.id  # type: ignore[attr-defined]
-                    weight_arg_idx = get_weight_arg_idx(seen_q_op_info.type)
-                    obs = qconfig.weight() if idx == weight_arg_idx else \
-                        qconfig.activation()
-                    self.tensor_id_to_observer[str(tensor_id)] = obs
-
-    def _maybe_insert_output_observers(
-        self,
-        seen_q_op_info: SeenQOpInfo,
-        root_module: torch.nn.Module,
-    ):
-        if seen_q_op_info.fusion_info is not None:
-            if not seen_q_op_info.fusion_info.is_first_element:
-                # if we are in a fusion but not at the start, do not insert observer
-                return
-            else:
-                # if we are in a fusion and at the start, insert observer for its end
-                # get the output of the end of the fusion
-                cur_seen_q_op_info = get_seen_q_op_info_of_end_of_fusion(
-                    seen_q_op_info, self.idx_to_seen_q_op_infos)
-                output_tensor_id = cur_seen_q_op_info.output_tensor_infos[0].id
-        else:
-            output_tensor_id = seen_q_op_info.output_tensor_infos[0].id
-
-        func_output_obs_type = get_func_output_obs_type(seen_q_op_info)
-        if func_output_obs_type == FuncOutputObsType.NEW_OBS:
-            # TODO(future PR): check qconfig is None
-            qconfig = get_cur_qconfig(
-                self.qconfig_mapping, seen_q_op_info.fqn, seen_q_op_info.type)
-            assert qconfig is not None
-            self.tensor_id_to_observer[str(output_tensor_id)] = \
-                qconfig.activation()
-        elif func_output_obs_type == FuncOutputObsType.REUSES_FIRST_INPUT_OBS:
-            assert seen_q_op_info.input_tensor_infos[0] is not None
-            first_input_tensor_id = seen_q_op_info.input_tensor_infos[0].id
-
-            first_input_obs = \
-                self.tensor_id_to_observer[str(first_input_tensor_id)]
-            self.tensor_id_to_observer[str(output_tensor_id)] = first_input_obs
-
-    def insert_observers(self, root_module: torch.nn.Module):
-        for idx, seen_q_op_info in self.idx_to_seen_q_op_infos.items():
-            self._maybe_insert_input_observers(seen_q_op_info)
-            self._maybe_insert_output_observers(seen_q_op_info, root_module)
-
-    def get_output_observer_from_fqn(self, fqn: str) -> Optional[torch.nn.Module]:
-        for idx, seen_q_op_info in self.idx_to_seen_q_op_infos.items():
-            if seen_q_op_info.fqn != fqn:
-                continue
-            output_tensor_id = seen_q_op_info.output_tensor_infos[0].id
-            if str(output_tensor_id) in self.tensor_id_to_observer:
-                return self.tensor_id_to_observer[str(output_tensor_id)]
-        return None
-
-    # This is a hack to enable nn.Sequential to properly work with
-    # this class.
-    # TODO(future): remove the hack
-    def forward(self, x):
-        raise NotImplementedError('Calling AutoQuantizationState.forward is not supported')
-        # return x
diff --git a/torch/ao/quantization/_dbr/torchscript_utils.py b/torch/ao/quantization/_dbr/torchscript_utils.py
deleted file mode 100644
index 2efbbe5fd9382..0000000000000
--- a/torch/ao/quantization/_dbr/torchscript_utils.py
+++ /dev/null
@@ -1,15 +0,0 @@
-import torch
-from torch.jit._recursive import wrap_cpp_module
-
-def remove_redundant_aliases(scripted_module: torch.nn.Module):
-    """
-    Running torch.jit.trace on a model with DBR quantization introduces
-    extra alias ops, because we use `torch.Tensor.as_subclass` and tracing
-    through this results in an `aten::alias` function call in TorchScript.
-    This pass removes these alias calls when it is safe to do so.
-    """
-    module_c = scripted_module._c
-    module_c = \
-        torch._C._jit_pass_dbr_quant_remove_redundant_aliases(module_c)  # type: ignore[attr-defined]
-    scripted_module = wrap_cpp_module(module_c)
-    return scripted_module
diff --git a/torch/ao/quantization/_dbr/utils.py b/torch/ao/quantization/_dbr/utils.py
deleted file mode 100644
index 181d136beab26..0000000000000
--- a/torch/ao/quantization/_dbr/utils.py
+++ /dev/null
@@ -1,751 +0,0 @@
-import dataclasses
-import enum
-from typing import Callable, Tuple, Any, List, Optional, Dict
-
-import torch
-import torch.nn.functional as F
-toq = torch.ops.quantized
-
-from .mappings import (
-    functions_supported_by_quantization,
-    module_types_supported_by_quantization,
-    module_types_supported_by_quantization_preserves_dtype,
-    functions_supported_by_quantization_preserves_dtype,
-    fp32_to_int8_fun_mapping,
-    add_and_mul_ops,
-    conv_ops,
-)
-
-from ..qconfig import QConfigAny
-from ..qconfig_mapping import QConfigMapping
-
-from torch.quantization import (
-    ObserverBase,
-    FakeQuantizeBase,
-    is_activation_post_process,
-)
-
-from ..qconfig_mapping_utils import (
-    maybe_adjust_qconfig_for_module_type_or_name,
-)
-
-def _raise_obs_not_found_error(func):
-    raise RuntimeError(
-        f'Encountered arithmetic operation {torch.typename(func)} but we have '
-        f'encountered fewer arithmetic operations in previous calibration runs. '
-        f'This likely indicates that the program contains dynamic control flow. '
-        f' Quantization is not defined over dynamic control flow!')
-
-def _raise_obs_op_mismatch(func, prev_op):
-    raise RuntimeError(
-        f'Encountered arithmetic operation {torch.typename(func)} but previously '
-        f'recorded operation was {torch.typename(prev_op)}!. This likely indicates '
-        f'that the program contains dynamic control flow. Quantization is not '
-        f'defined over dynamic control flow!')
-
-
-@dataclasses.dataclass
-class QTensorInfo:
-    id: int  # tensor ID
-    orig_dtype: torch.dtype  # dtype seen while tracing with example input
-    inf_dtype: torch.dtype  # dtype at inference
-
-
-@dataclasses.dataclass
-class FusionInfo:
-    # linear matched pattern, example: [torch.add, torch.relu]
-    pattern: Tuple[Callable, ...]
-    # what the current element should be replaced with during execution
-    # example: toq.add_relu (for torch.add -> torch.relu)
-    replacement_type_this_element: Callable
-    # true if the current element is the first element of the pattern,
-    # for example true for torch.add in (torch.add -> torch.relu)
-    is_first_element: bool
-    # true if the current element is the last element of the pattern,
-    # for example true for torch.relu in (torch.add -> torch.relu)
-    is_last_element: bool
-
-
-@dataclasses.dataclass
-class SeenQOpInfo:
-    idx: int
-    # Python type of the seen op. For modules, this is type(mod). For
-    # functions, this is the target function.
-    type: Callable
-    # True if the type is a module, False otherwise (for functions/methods).
-    type_is_module: bool
-    # Note: FQN refers to the current module for modules and to the parent
-    # module for functions
-    fqn: str
-    # Information about the input tensors
-    # Non-tensor inputs are represented with None.
-    input_tensor_infos: List[Optional[QTensorInfo]]
-    # Information about the output tensors
-    # Non-tensor outputs are represented with None.
-    output_tensor_infos: List[QTensorInfo]
-    # Information about tensors which will need to be packed,
-    # idx is the argument index in args
-    # name is the name of this parameter in the parent module
-    packable_tensor_idx_to_name: Dict[int, Optional[str]]
-    # Information about non-tensors which will need to be packed,
-    # idx is the argument index in args
-    # arg is the argument value
-    packable_nontensor_idx_to_arg: Dict[int, Any]
-    # Information about tensors which will need to be packed from kwargs.
-    # kwarg_name is the kwarg name
-    # name is the name of this parameter in the parent module
-    packable_tensor_kwarg_name_to_name: Dict[str, Optional[str]]
-    # This is True if all packable args are simple attributes, or there
-    # are no packable args.
-    # This is False if some packable args are results of other functions.
-    op_packing_only_uses_module_attributes: bool
-    # QConfig for the op, can be None
-    qconfig: QConfigAny
-    # fusion_info for the op, is None if no fusion is found
-    fusion_info: Optional[FusionInfo]
-    # True if this op is a reference op during inference
-    is_reference_op_at_inference: bool
-
-    def __repr__(self) -> str:
-        s = f"(type): {self.type}\n"
-        s += f"     (fqn): {self.fqn}\n"
-        s += f"     (input_tensor_infos): {self.input_tensor_infos}\n"
-        s += f"     (output_tensor_infos): {self.output_tensor_infos}"
-        if len(self.packable_tensor_idx_to_name):
-            s += f"\n     (packable_tensor_idx_to_name): {self.packable_tensor_idx_to_name}"
-        if len(self.packable_nontensor_idx_to_arg):
-            s += f"\n     (packable_nontensor_idx_to_arg): {self.packable_nontensor_idx_to_arg}"
-        if len(self.packable_tensor_kwarg_name_to_name):
-            s += f"\n     (packable_tensor_kwarg_name_to_name): {self.packable_tensor_kwarg_name_to_name}"
-        if self.fusion_info:
-            s += f"\n     (fusion_info): {self.fusion_info}"
-        return s
-
-
-@dataclasses.dataclass
-class SeenNonQOpInfo:
-    # Python type of the seen op. For modules, this is type(mod). For
-    # functions, this is the target function.
-    type: Callable
-    # Information about the input tensors
-    # Non-tensor inputs are represented with None.
-    input_tensor_infos: List[Optional[QTensorInfo]]
-    # Information about the output tensors
-    # Non-tensor outputs are represented with None.
-    output_tensor_infos: List[QTensorInfo]
-
-
-class OpQuantizeabilityType(enum.Enum):
-    QUANTIZEABLE = 0
-    NOT_QUANTIZEABLE = 1
-
-def op_needs_quantization(op: Callable) -> bool:
-    if op in functions_supported_by_quantization:
-        return True
-    elif type(op) in module_types_supported_by_quantization:
-        return True
-    else:
-        return False
-
-# TODO: fix lint
-class ObserverWrapper(torch.nn.Identity):
-    def __init__(self, child):
-        super().__init__()
-        self.child = child
-        self.dtype = child.dtype
-
-def wrap_observers_in_placeholders(module: torch.nn.Module) -> None:
-    """
-    Wraps each child observer of `module` in a placeholder which prevents
-    the execution of the observer during the forward. This is useful to prevent
-    tracing the model with example inputs from contributing to calibration
-    statistics.
-    """
-    for name, child in module.named_children():
-        if isinstance(child, (ObserverBase, FakeQuantizeBase)):
-            wrapper = ObserverWrapper(child)
-            setattr(module, name, wrapper)
-        else:
-            wrap_observers_in_placeholders(child)
-
-def unwrap_observers_from_placeholders(module: torch.nn.Module) -> None:
-    """
-    Restores observers back to their original state.
-    """
-    # Note: we cannot use module.named_children() because we can
-    # have two different names refer to the same module, for example
-    # when we are reusing observers for torch.add scalar version.
-    for name, child in module._modules.items():
-        if child is None:
-            continue
-        if isinstance(child, ObserverWrapper):
-            unwrapped = child.child
-            setattr(module, name, unwrapped)
-        else:
-            unwrap_observers_from_placeholders(child)
-
-def trace_with_inputs(
-    model: torch.nn.Module,
-    example_args: Tuple[Any],
-) -> None:
-    with torch.no_grad():
-        old_training = model.training
-        model.eval()
-        wrap_observers_in_placeholders(model)
-        model(*example_args)
-        unwrap_observers_from_placeholders(model)
-        if old_training:
-            model.train()
-
-# TODO(future PR): verify correctness of this for all
-# quantizeable modules
-def is_leaf(
-    m: torch.nn.Module,
-    prepare_custom_config_dict: Optional[Dict[str, Any]],
-) -> bool:
-    if prepare_custom_config_dict is None:
-        prepare_custom_config_dict = {}
-
-    if 'non_traceable_module_class' in prepare_custom_config_dict:
-        for target_cls in prepare_custom_config_dict['non_traceable_module_class']:
-            if isinstance(m, target_cls):
-                return True
-
-    # TODO(future PR): extend to the rest of the container classes
-    container_classes = (
-        torch.nn.Sequential,
-        torch.nn.ModuleList,
-    )
-    return (
-        # allowlist everything in torch.nn except containers
-        (m.__module__.startswith('torch.nn') and (
-            not isinstance(m, container_classes)
-        )) or
-        # allowlist nni modules, as they inherit from nn.Sequential
-        m.__module__.startswith('torch.nn.intrinsic') or
-        # observers and fake quants are leaves
-        is_activation_post_process(m)
-    )
-
-class FuncOutputObsType(enum.Enum):
-    NONE = 0
-    NEW_OBS = 1
-    REUSES_FIRST_INPUT_OBS = 2
-
-def get_func_output_obs_type(
-    seen_q_op_info: SeenQOpInfo,
-) -> FuncOutputObsType:
-    op_type = seen_q_op_info.type
-
-    if seen_q_op_info.qconfig is None:
-        return FuncOutputObsType.NONE
-
-    # check for ops which need packed weights but the weights are
-    # coming from another function
-    if not seen_q_op_info.op_packing_only_uses_module_attributes:
-        return FuncOutputObsType.NONE
-
-    if op_type in add_and_mul_ops:
-        if (
-            len(seen_q_op_info.input_tensor_infos) > 0 and
-            seen_q_op_info.input_tensor_infos[0] is not None and
-            seen_q_op_info.input_tensor_infos[0].inf_dtype in (torch.int32, torch.int64)
-        ):
-            # this is handling ops on dtypes such as torch.int
-            return FuncOutputObsType.NONE
-        elif (
-            len(seen_q_op_info.input_tensor_infos) > 1 and
-            seen_q_op_info.input_tensor_infos[1] is None
-        ):
-            return FuncOutputObsType.REUSES_FIRST_INPUT_OBS
-    elif op_type in (torch.relu, F.relu):
-        return FuncOutputObsType.NONE
-    elif op_type == torch.cat:
-        if (
-            len(seen_q_op_info.input_tensor_infos) > 0 and
-            seen_q_op_info.input_tensor_infos[0] is not None and
-            seen_q_op_info.input_tensor_infos[0].inf_dtype in (torch.int32, torch.int64)
-        ):
-            return FuncOutputObsType.NONE
-    elif op_type in (torch.nn.LSTM,):
-        return FuncOutputObsType.NONE
-    return FuncOutputObsType.NEW_OBS
-
-def converted_func_needs_scale_zp(seen_q_op_info: SeenQOpInfo) -> bool:
-    op_type = seen_q_op_info.type
-    is_module = isinstance(op_type, type(torch.nn.Module))
-    if is_module:
-        return False
-    if seen_q_op_info.qconfig is None:
-        return False
-    if op_type in add_and_mul_ops:
-        # check if both arguments are tensors
-        inputs = seen_q_op_info.input_tensor_infos
-        both_args_tensors = len(inputs) == 2 and inputs[0] is not None and \
-            inputs[1] is not None
-        # disable quantization for torch.mul with int tensor arguments
-        first_dtype_is_not_int = len(inputs) > 0 and \
-            inputs[0] is not None and \
-            inputs[0].inf_dtype not in (torch.int32, torch.int64)
-        return both_args_tensors and first_dtype_is_not_int
-    elif op_type == torch.cat:
-        inputs = seen_q_op_info.input_tensor_infos
-        first_dtype_is_not_int = len(inputs) > 0 and \
-            inputs[0] is not None and \
-            inputs[0].inf_dtype not in (torch.int32, torch.int64)
-        return first_dtype_is_not_int
-    elif op_type in conv_ops or op_type == F.linear:
-        outputs = seen_q_op_info.output_tensor_infos
-        is_int8 = outputs[0].inf_dtype == torch.quint8
-        return is_int8
-    return False
-
-class FuncOutputDTypeType(enum.Enum):
-    # for ops which are quantizeable and are configured by the qconfig,
-    # for example F.conv2d
-    DTYPE_DEPENDS_ON_QCONFIG = 0
-    # for ops which are quantizeable and take the dtype of the previous
-    # op, for example nn.Dropout
-    DTYPE_EQUALS_INPUT_DTYPE = 1
-    # for ops which may be quantizeable in some cases but are not
-    # quantizeable due to observed syntax (for example, F.conv2d with
-    # weights coming from another function).
-    DTYPE_DEFAULT_BC_UNSUPPORTED_SYNTAX = 2
-
-def get_func_output_dtype_type(
-    seen_q_op_info: SeenQOpInfo,
-) -> FuncOutputDTypeType:
-    if seen_q_op_info.type_is_module:
-        if seen_q_op_info.type in module_types_supported_by_quantization_preserves_dtype:
-            return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-
-    # check for ops which need packed weights but the weights are
-    # coming from another function
-    if not seen_q_op_info.op_packing_only_uses_module_attributes:
-        return FuncOutputDTypeType.DTYPE_DEFAULT_BC_UNSUPPORTED_SYNTAX
-
-    args = seen_q_op_info.input_tensor_infos
-    if seen_q_op_info.type in functions_supported_by_quantization_preserves_dtype:
-        return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-    elif seen_q_op_info.type in add_and_mul_ops and len(args) > 0 and \
-            args[0] is not None and \
-            args[0].orig_dtype in (torch.int32, torch.int64):
-        # binary ops with torch.int arguments do not support quantization
-        return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-    elif seen_q_op_info.type == torch.cat and len(args) > 0 and \
-            args[0] is not None and \
-            args[0].orig_dtype in (torch.int32, torch.int64):
-        # TODO(before land): do we still need this branch?
-        return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-
-    return FuncOutputDTypeType.DTYPE_DEPENDS_ON_QCONFIG
-
-def get_weight_argument_info(op: Callable) -> Optional[Tuple[int, str]]:
-    if op == F.linear or op in conv_ops:
-        return (1, 'weight')
-    return None
-
-def get_op_packing_only_uses_module_attributes(
-    op: Callable,
-    args: Tuple[Any, ...],
-    kwargs: Dict[str, Any],
-    module: torch.nn.Module,
-) -> bool:
-    """
-    Returns True if all arguments of this op which are weights are module
-    attributes on the root module, and False otherwise.
-
-    For example, for `F.linear(input, weight, bias)`, this would return
-    True if `weight` is stored directly on the parent module (the common case),
-    and False if `weight` was an output of a different op.
-    """
-    # check for ops which need packed weights but the weights are
-    # coming from another function
-    info = get_weight_argument_info(op)
-    if info is not None:
-        idx, name = info
-        param_name = args[idx] if idx < len(args) else kwargs[name]
-        arg_name_in_root = get_param_name(module, param_name)
-        if arg_name_in_root is None:
-            return False
-    return True
-
-def get_quantized_op(
-    seen_q_op_info: SeenQOpInfo,
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> Optional[Callable]:
-    """
-    Given a `seen_q_op_info`, returns the quantized version of the seen function.
-    If the `seen_q_op_info` corresponds to a module, returns `None`.
-    If the function does need quantizing, returns `None`.
-    """
-    # if we are in a fusion, use the fusion replacement rules
-    if seen_q_op_info.fusion_info is not None:
-        return seen_q_op_info.fusion_info.replacement_type_this_element
-
-    op_type = seen_q_op_info.type
-    is_module = isinstance(op_type, type(torch.nn.Module))
-    if is_module:
-        return None
-    if seen_q_op_info.output_tensor_infos[0].inf_dtype != torch.quint8:
-        return None
-
-    if (
-        (op_type in add_and_mul_ops or op_type == torch.cat) and
-        seen_q_op_info.input_tensor_infos[0] is not None and
-        seen_q_op_info.input_tensor_infos[0].inf_dtype in (torch.int32, torch.int64)
-    ):
-        # handle torch.mul with int tensor arguments
-        return None
-    elif op_type in fp32_to_int8_fun_mapping:
-        return fp32_to_int8_fun_mapping[op_type]
-    return None
-
-def get_input_observed_arg_idxs(
-    op_type: Callable,
-    op_type_is_module: bool,
-) -> Optional[List[int]]:
-    if op_type_is_module:
-        # TODO(future PR): handle RNNs
-        return [0]
-    elif op_type in conv_ops:
-        return [0, 1]
-    elif op_type == F.linear:
-        return [0, 1]
-    # None means "observe all Tensor args"
-    return None
-
-def get_packable_tensor_arg_idxs(op: Callable) -> Optional[List[int]]:
-    """
-    Returns tensor arg idxs which correspond to parameters which will need
-    to be packed.
-    """
-    if op in conv_ops:
-        return [1, 2]
-    elif op == F.linear:
-        return [1, 2]
-    return None
-
-def get_packable_tensor_kwarg_names(op: Callable) -> Optional[List[str]]:
-    """
-    Returns tensor kwarg names which correspond to parameters which will
-    need to be packed.
-    """
-    if op == F.linear or op in conv_ops:
-        return ['weight', 'bias']
-    return None
-
-def get_param_name(module: torch.nn.Module, arg: Any) -> Optional[str]:
-    """
-    Returns the name of arg with respect to the current module.
-    """
-    for name, param in module.named_parameters():
-        if arg is param:
-            return name
-    return None
-    # raise AssertionError(f"arg {arg} not found in module {module}")
-
-def get_packable_nontensor_arg_idxs(op: Callable) -> Optional[List[int]]:
-    """
-    Returns nontensor arg idxs which correspond to arguments which will need
-    to be packed.
-    """
-    if op in conv_ops:
-        # stride, padding, dilation, groups
-        return [3, 4, 5, 6]
-    return None
-
-def get_packable_arg_idxs(op: Callable) -> Optional[List[int]]:
-    if op in conv_ops:
-        # weight, bias, stride, padding, dilation, groups
-        return [1, 2, 3, 4, 5, 6]
-    elif op == F.linear:
-        # weight, bias
-        return [1, 2]
-    return None
-
-def get_weight_arg_idx(op: Callable) -> Optional[int]:
-    if op in conv_ops:
-        return 1
-    elif op == F.linear:
-        return 1
-    return None
-
-def iterate_and_apply(
-    args: Any,
-    flattened_tensor_infos: List[Optional[QTensorInfo]],
-    func: Callable,
-    flattened_tensor_infos_idx=None
-) -> Any:
-    """
-    Inputs:
-      `args`: arguments to a function, may contain nested types, for example:
-
-        ([torch.Tensor, torch.Tensor], int, (int, int))
-
-      `flattened_tensor_infos`: tensor information containers for each tensor
-        in `args`, flattened, for example corresponding with above:
-
-        ({...}, {...}, None, None, None)
-
-      `func`: function to apply to each tensor in `args` to create `new_args`
-
-    Returns `new_args`, where each tensor has been transformed by `func`.
-    """
-    arg_idx = 0
-    if flattened_tensor_infos_idx is None:
-        flattened_tensor_infos_idx = [0]
-
-    if isinstance(args, tuple):
-        new_args = []
-        for arg in args:
-            new_arg = iterate_and_apply(
-                arg, flattened_tensor_infos, func, flattened_tensor_infos_idx)
-            new_args.append(new_arg)
-        return tuple(new_args)
-    elif isinstance(args, list):
-        for idx in range(len(args)):
-            new_arg = iterate_and_apply(
-                args[idx], flattened_tensor_infos, func, flattened_tensor_infos_idx)
-            args[idx] = new_arg
-        return args
-    else:
-        # individual element
-        cur_flattened_tensor_info = \
-            flattened_tensor_infos[flattened_tensor_infos_idx[0]]
-        flattened_tensor_infos_idx[0] += 1
-
-        if cur_flattened_tensor_info is not None:
-            return func(args, cur_flattened_tensor_info)
-        else:
-            return args
-
-def get_producer_of_seen_q_op_info(
-    idx_to_seen_q_op_info: Dict[int, SeenQOpInfo],
-    cur_seen_q_op_info: SeenQOpInfo,
-) -> Optional[SeenQOpInfo]:
-    """
-    Input: cur_seen_q_op_info, all seen ops
-    Output: the SeenQOpInfo which created the input to the current SeenQOpInfo
-    """
-    if cur_seen_q_op_info.input_tensor_infos[0] is None:
-        return None
-    input_tensor_id = cur_seen_q_op_info.input_tensor_infos[0].id
-    for idx, seen_q_op_info in idx_to_seen_q_op_info.items():
-        for output_tensor_info in seen_q_op_info.output_tensor_infos:
-            if output_tensor_info is not None:
-                if input_tensor_id == output_tensor_info.id:
-                    return seen_q_op_info
-    return None
-
-def get_users_of_seen_q_op_info(
-    idx_to_seen_q_op_info: Dict[int, SeenQOpInfo],
-    cur_seen_q_op_info: SeenQOpInfo,
-) -> List[SeenQOpInfo]:
-    """
-    Input: cur_seen_q_op_info
-    Output: list of all seen_q_op_infos which use the output of the cur_seen_q_op_info,
-    """
-    if len(cur_seen_q_op_info.output_tensor_infos) != 1:
-        return []
-    output_tensor_id = cur_seen_q_op_info.output_tensor_infos[0].id
-    results = []
-    for idx, seen_q_op_info in idx_to_seen_q_op_info.items():
-        for input_tensor_info in seen_q_op_info.input_tensor_infos:
-            if input_tensor_info is not None:
-                if output_tensor_id == input_tensor_info.id:
-                    results.append(seen_q_op_info)
-    return results
-
-class HookType(enum.Enum):
-    """
-    Describes the various types of function and module hooks that are used
-    to implement quantization syntax transforms.
-    """
-    # Hooks which are run before, during and after a quantizeable op.
-    # Usually used for op input and output observation, subsituating
-    # quantized kernels, and dynamically looking up arguments to quantized
-    # kernels.
-    OP_HOOKS = 0
-    # Hooks which are run before or after a `torch.nn.Module` which
-    # is a non-leaf. Usually used for dtype transforms if the user requests
-    # that the inputs or outputs of a certain module are of some dtype.
-    MODULE_IO_HOOKS = 1
-    # Hooks which are run before a non-quantizeable op which requires
-    # `torch.float` inputs. Any inputs which are not floats are converted
-    # back to floats.
-    ARG_DEQUANTS = 2
-    # Everything else
-    NONE = 3
-
-def get_torch_function_hook_type(
-    parent_module: Optional[torch.nn.Module],
-    func: Callable,
-) -> HookType:
-    # the direct __dict__ accesses are for performance, because
-    # the default `torch.nn.Module.__getattr__` has overhead.
-    parent_module_has_qstate = parent_module is not None and \
-        '_auto_quant_state' in parent_module.__dict__
-    needs_op_hooks = parent_module_has_qstate and \
-        parent_module.__dict__['_auto_quant_state'].cur_op_needs_hooks(func)  # type: ignore[union-attr, operator]
-
-    if needs_op_hooks:
-        return HookType.OP_HOOKS
-    elif (
-        parent_module_has_qstate and
-        # do not attempt to dequantize the args to dequantize, as that will
-        # lead to infinite recursion
-        func != torch.Tensor.dequantize
-    ):
-        return HookType.ARG_DEQUANTS
-    else:
-        return HookType.NONE
-
-def get_module_hook_type(
-    parent_module: Optional[torch.nn.Module],
-    cur_module: torch.nn.Module,
-) -> HookType:
-    cached_hook_type = getattr(cur_module, '_auto_quant_module_hook_type', None)
-    if cached_hook_type is not None:
-        return cached_hook_type
-    parent_module_has_qstate = parent_module is not None and \
-        '_auto_quant_state' in parent_module.__dict__
-    needs_op_hooks = parent_module_has_qstate and \
-        parent_module.__dict__['_auto_quant_state'].cur_op_needs_hooks(cur_module)  # type: ignore[union-attr, operator]
-    # We need IO hooks if
-    # * we are calling forward on a module (always True here)
-    # * that module has quant state
-    # * that module does not need op hooks for the parent
-    needs_io_hooks = (
-        '_auto_quant_state' in cur_module.__dict__ and
-        (not needs_op_hooks)
-    )
-    needs_arg_dequants = parent_module_has_qstate and not needs_op_hooks
-
-    if needs_op_hooks:
-        result = HookType.OP_HOOKS
-    elif needs_io_hooks:
-        result = HookType.MODULE_IO_HOOKS
-    elif needs_arg_dequants:
-        result = HookType.ARG_DEQUANTS
-    else:
-        result = HookType.NONE
-    cur_module._auto_quant_module_hook_type = result  # type: ignore[assignment]
-    return result
-
-def clone_detach_tensor_without_dispatch(x: torch.Tensor) -> torch.Tensor:
-    """
-    Creates a detached clone of `x`, unwrapping x from any dispatched
-    type before performing the copy.
-    This is necessary to not leak dispatched types to debugging logic
-    such as numeric suite.
-    TODO(future PR): figure out why is_quantized returns False for
-    the dispatched types, even though the underlying tensor is quantized.
-    """
-    old_class = x.__class__
-    x.__class__ = torch.Tensor
-    x_copy = x.clone().detach()
-    x.__class__ = old_class
-    return x_copy
-
-def get_input_args_quant_dequant_info(
-    seen_q_op_info: SeenQOpInfo,
-    tensor_id_to_scale_zp: Dict[int, Tuple[torch.Tensor, torch.Tensor]],
-) -> Tuple[List[Optional[Tuple[float, int, torch.dtype]]], List[bool], bool]:
-    """
-    Returns a list of information about the tensor inputs to the current op.
-
-    Quant list:
-    For each tensor input:
-    * if the tensor input needs a quant, the list will contain
-      (scale, zero_point)
-    * if the tensor input does not need a quant, the list will contain None
-
-    Dequant list:
-    For each tensor input:
-    * if the tensor input needs a dequant, True, otherwise, False
-
-    any_arg_quant_or_dequant_needed:
-    If True, at least one of quants or dequants is needed. If False,
-    there are no quants or dequants needed.
-
-    For example, if there are two tensor inputs to the current op, and the
-    first input needs a quant, this function will return
-
-      # quants
-      [(scale0, zero_point0), None],
-      # dequants
-      [False, False]
-    """
-    quant_infos: List[Optional[Tuple[float, int, torch.dtype]]] = []
-    dequant_infos: List[bool] = []
-
-    # determine the expected output dtype
-    output_dtype = seen_q_op_info.output_tensor_infos[0].inf_dtype
-    packable_arg_idxs = get_packable_arg_idxs(seen_q_op_info.type)
-    any_arg_quant_or_dequant_needed = False
-
-    for input_arg_idx, input_arg in enumerate(seen_q_op_info.input_tensor_infos):
-        arg_will_be_packed = packable_arg_idxs is not None and \
-            input_arg_idx in packable_arg_idxs and \
-            seen_q_op_info.op_packing_only_uses_module_attributes
-        if input_arg is not None and not arg_will_be_packed:
-            tensor_id = input_arg.id
-            if input_arg.inf_dtype != output_dtype:
-                any_arg_quant_or_dequant_needed = True
-                if output_dtype in (torch.quint8, torch.qint32):
-                    assert tensor_id in tensor_id_to_scale_zp
-                    scale, zp = tensor_id_to_scale_zp[tensor_id]
-                    # TODO: return this to the caller
-                    quant_infos.append((scale, zp, output_dtype))  # type: ignore[arg-type]
-                    if output_dtype == torch.qint32:
-                        # For now, we treat all qint32 ops as reference, so
-                        # we add a dequant before the op.
-                        # TODO(future PR): extend this to more dtypes
-                        # TODO(future PR): use is_reference flag instead of
-                        # assuming
-                        dequant_infos.append(True)
-                    else:
-                        dequant_infos.append(False)
-                else:
-                    quant_infos.append(None)
-                    dequant_infos.append(True)
-            else:
-                quant_infos.append(None)
-                dequant_infos.append(False)
-        else:
-            quant_infos.append(None)
-            dequant_infos.append(False)
-    return quant_infos, dequant_infos, any_arg_quant_or_dequant_needed
-
-def get_cur_qconfig(
-    qconfig_mapping: QConfigMapping,
-    cur_fqn: str,
-    cur_op_type: Callable,
-) -> Optional[QConfigAny]:
-    # precedence: global -> object_type -> module_name_regex -> module_name
-    #   -> module_name_object_type_order
-    # (module_name_regex, module_name_object_type_order not implemented yet)
-
-    # global
-    global_qconfig = qconfig_mapping.global_qconfig
-
-    qconfig = maybe_adjust_qconfig_for_module_type_or_name(
-        qconfig_mapping, cur_op_type, cur_fqn, global_qconfig)
-
-    return qconfig
-
-
-# We store quantization state for all children on the top level module in a
-# ModuleDict. In order to properly special case this module from other
-# ModuleDict instances, we create a marker class for it.
-class AutoQuantizationStateModuleDict(torch.nn.ModuleDict):
-    pass
-
-def get_fqn_valid_for_module_dict_key(fqn: str) -> str:
-    """
-    Modifies `fqn` to make it a valid key to a ModuleDict.
-    """
-    if fqn == '':
-        fqn = ' '
-    return fqn.replace('.', ':')
diff --git a/torch/ao/quantization/_quantize_dbr.py b/torch/ao/quantization/_quantize_dbr.py
deleted file mode 100644
index 9e8de68757af3..0000000000000
--- a/torch/ao/quantization/_quantize_dbr.py
+++ /dev/null
@@ -1,144 +0,0 @@
-import torch
-
-from ._dbr.auto_trace import add_auto_observation, add_auto_convert
-from ._dbr.fusion import get_module_fusion_fqns
-from ._dbr.qconfig_mapping_utils import normalize_object_types
-
-from .qconfig_mapping_utils import (
-    get_flattened_qconfig_dict,
-)
-from torch.ao.quantization.qconfig_mapping import QConfigMapping
-from torch.ao.quantization.quantization_mappings import (
-    get_default_static_quant_module_mappings,
-    get_default_dynamic_quant_module_mappings,
-)
-from ._dbr.module_swap_utils import _swap_child_modules
-
-
-def prepare(model, qconfig_dict, example_inputs, inplace=False, allow_list=None,
-            observer_non_leaf_module_list=None,
-            prepare_custom_config_dict=None,
-            fuse_modules=True):
-    r"""A wrapper around `torch.quantization.prepare` which prepares the
-    model for quantization using dynamic tracing.
-
-    Requires `qconfig_dict` (same format as prepare_fx) to specify the
-    quantization settings. Not all functionality is supported yet.
-
-    Requires `example_inputs` to build
-    the graph before calibration or quantization aware training can proceed.
-
-    Supported `prepare_custom_config_dict` keys:
-      * `non_traceable_module_class` - same meaning as in prepare_fx
-      * `output_dtypes` - expected dtypes of model outputs, must match actual
-        output structure.
-
-    TODO(future PR): better docblock
-    """
-    assert example_inputs is not None, 'example_inputs must be specified'
-
-    if prepare_custom_config_dict is None:
-        prepare_custom_config_dict = {}
-
-    # TODO: change signature to use QConfigMapping instead of qconfig_dict
-    qconfig_mapping = QConfigMapping.from_dict(qconfig_dict)
-
-    assert len(qconfig_mapping.module_name_regex_qconfigs) == 0, \
-        'qconfig_mapping.set_module_name_regex is not supported yet in define-by-run quantization'
-    assert len(qconfig_mapping.module_name_object_type_order_qconfigs) == 0, \
-        'qconfig_mapping.set_module_name_object_type_order is not supported yet in define-by-run quantization'
-
-    normalize_object_types(qconfig_mapping)
-    flattened_qconfig_dict = get_flattened_qconfig_dict(qconfig_mapping)
-    torch.quantization.propagate_qconfig_(model, flattened_qconfig_dict)
-
-    # if parts of the model are non traceable, delete qconfig from
-    # them so they do not get swapped
-    non_traceable_module_class = \
-        prepare_custom_config_dict.get('non_traceable_module_class', [])
-    for name, child in model.named_modules():
-        for target_cls in non_traceable_module_class:
-            if isinstance(child, target_cls):
-                for _, child_child in child.named_modules():
-                    child_child.qconfig = None
-
-    # TODO(future PR): QAT support
-
-    if fuse_modules:
-        # automatically fuse modules
-        old_class = model.__class__
-        model = add_auto_observation(
-            model, qconfig_mapping, example_inputs,
-            prepare_custom_config_dict=prepare_custom_config_dict)
-        module_fusion_fqns = get_module_fusion_fqns(model)
-        if len(module_fusion_fqns):
-            model = torch.quantization.fuse_modules(model, module_fusion_fqns)
-
-            # Since we are reusing the auto_trace machinery to find fusion
-            # FQNs, we need to do some surgery to get qconfigs on modules
-            # after module fusion to be correct.
-            for _, child in model.named_modules():
-                if isinstance(child, torch.nn.intrinsic._FusedModule):
-                    if hasattr(child[0], 'qconfig'):
-                        child.qconfig = child[0].qconfig
-
-        # delete all the DBR state from the model, so add_auto_observation
-        # can start from a clean slate
-        parents_to_delete_auto_quant_state = []
-        for k, v in model.named_modules():
-            if hasattr(v, '_auto_quant_state'):
-                parents_to_delete_auto_quant_state.append(v)
-        for v in parents_to_delete_auto_quant_state:
-            del v._auto_quant_state
-
-        del model._fqn_to_auto_quant_state_map
-
-        for p in model.parameters():
-            if hasattr(p, '_qtensor_info'):
-                del p._qtensor_info
-        for b in model.buffers():
-            if hasattr(b, '_qtensor_info'):
-                del b._qtensor_info
-
-        # the model hierarchy might have changed during fusion, so we
-        # have to delete the cached module hook types
-        for k, v in model.named_modules():
-            if hasattr(v, '_auto_quant_module_hook_type'):
-                del v._auto_quant_module_hook_type
-
-        model.__class__ = old_class
-
-    # Automatically assign qconfigs for modules where the defaults do not
-    # work.
-    # TODO(future PR): clean this up and align with other APIs
-    for name, child in model.named_modules():
-        if isinstance(child, (torch.nn.Embedding, torch.nn.EmbeddingBag)):
-            # pass
-            # child.qconfig = torch.quantization.float_qparams_weight_only_qconfig
-            # uncomment below to unbreak attention_is_all_you_need
-            # TODO write up issue, maybe fix
-            child.qconfig = None  # type: ignore[assignment]
-        elif isinstance(child, torch.nn.LSTM):
-            # TODO: fix LSTM handling in eager mode static quant and remove this
-            qconfig_mapping.object_type_qconfigs[torch.nn.LSTM] = None
-
-    # TODO(future PR): do the QAT module swap
-
-    assert not inplace
-    model = add_auto_observation(
-        model, qconfig_mapping, example_inputs,
-        prepare_custom_config_dict=prepare_custom_config_dict)
-    return model
-
-def convert(model: torch.nn.Module) -> torch.nn.Module:
-    r"""Converts a prepared DBR quantization model to a quantized form.
-
-    TODO(future PR): better docblock
-    """
-    static_mappings = get_default_static_quant_module_mappings()
-    dynamic_mappings = get_default_dynamic_quant_module_mappings()
-    # swap the modules
-    _swap_child_modules(model, static_mappings, dynamic_mappings)
-    # add dynamic handling for quants/dequants, functions and methods
-    model = add_auto_convert(model)
-    return model
diff --git a/torch/ao/quantization/backend_config/__init__.py b/torch/ao/quantization/backend_config/__init__.py
index 1f62594a722a4..3eab482ce08ca 100644
--- a/torch/ao/quantization/backend_config/__init__.py
+++ b/torch/ao/quantization/backend_config/__init__.py
@@ -1,14 +1,14 @@
-from .backend_config import BackendConfig, BackendPatternConfig, DTypeConfig
-from .native import get_native_backend_config_dict
-from .observation_type import ObservationType
-from .tensorrt import get_tensorrt_backend_config_dict
-
-# TODO: add more validations
-def validate_backend_config_dict(backend_config_dict):
-    return "configs" in backend_config_dict
+from .backend_config import BackendConfig, BackendPatternConfig, DTypeConfig, ObservationType
+from .native import get_native_backend_config, get_native_backend_config_dict
+from .tensorrt import get_tensorrt_backend_config, get_tensorrt_backend_config_dict
 
 __all__ = [
+    "get_native_backend_config",
     "get_native_backend_config_dict",
+    "get_tensorrt_backend_config",
     "get_tensorrt_backend_config_dict",
-    "validate_backend_config_dict",
+    "BackendConfig",
+    "BackendPatternConfig",
+    "DTypeConfig",
+    "ObservationType",
 ]
diff --git a/torch/ao/quantization/backend_config/_common_operator_config_utils.py b/torch/ao/quantization/backend_config/_common_operator_config_utils.py
index 1044769e710da..c55a038dcf518 100644
--- a/torch/ao/quantization/backend_config/_common_operator_config_utils.py
+++ b/torch/ao/quantization/backend_config/_common_operator_config_utils.py
@@ -7,8 +7,13 @@
 import torch.nn.qat as nnqat
 import torch.nn.quantized._reference as nnqr
 from collections import namedtuple
-from typing import List, Dict, Any
-from .observation_type import ObservationType
+from typing import List
+from .backend_config import (
+    BackendPatternConfig,
+    DTypeConfig,
+    ObservationType,
+)
+from ..fake_quantize import FixedQParamsFakeQuantize
 from ..fuser_method_mappings import (
     reverse_sequential_wrapper2,
     reverse2,
@@ -18,6 +23,7 @@
     fuse_linear_bn,
     fuse_convtranspose_bn,
 )
+from ..qconfig_mapping import _FIXED_QPARAMS_OP_TO_OBSERVER
 
 # TODO: rename to be more explict, e.g. qat_conv_relu
 _ConvMetadata = namedtuple(
@@ -42,8 +48,8 @@
     nnqat.Conv3d, nniqat.ConvReLU3d, nniqat.ConvBn3d, nniqat.ConvBnReLU3d,
     F.conv3d)
 
-def _get_binary_op_configs(dtype_configs):
-    binary_op_configs: List[Dict[str, Any]] = []
+def _get_binary_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    binary_op_configs: List[BackendPatternConfig] = []
     num_tensor_args_to_observation_type_mapping = {
         # TODO: this is not used right now since we have extra check in prepare
         # will need to change this to NO_OBSERVER later after we implemented
@@ -52,145 +58,125 @@ def _get_binary_op_configs(dtype_configs):
         1: ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
         2: ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
     }
-    for op_with_quantized_bop_scalar_variant in [
-            operator.add, torch.add, operator.mul, torch.mul]:
-        binary_op_configs.append({
-            "pattern": (torch.nn.ReLU, op_with_quantized_bop_scalar_variant),
-            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
-            "dtype_configs": dtype_configs,
-        })
-        binary_op_configs.append({
-            "pattern": (torch.nn.functional.relu, op_with_quantized_bop_scalar_variant),
-            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
-            "dtype_configs": dtype_configs,
-        })
-        binary_op_configs.append({
-            "pattern": (torch.relu, op_with_quantized_bop_scalar_variant),
-            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
-            "dtype_configs": dtype_configs,
-        })
-        binary_op_configs.append({
-            "pattern": op_with_quantized_bop_scalar_variant,
-            "num_tensor_args_to_observation_type": num_tensor_args_to_observation_type_mapping,
-            "dtype_configs": dtype_configs,
-        })
+    for op_with_quantized_bop_scalar_variant in [operator.add, torch.add, operator.mul, torch.mul]:
+        bop_patterns = [
+            (torch.nn.ReLU, op_with_quantized_bop_scalar_variant),
+            (torch.nn.functional.relu, op_with_quantized_bop_scalar_variant),
+            (torch.relu, op_with_quantized_bop_scalar_variant),
+            op_with_quantized_bop_scalar_variant
+        ]
+        for bop_pattern in bop_patterns:
+            binary_op_configs.append(
+                BackendPatternConfig(bop_pattern)
+                    .set_dtype_configs(dtype_configs)  # noqa: E131
+                    ._set_num_tensor_args_to_observation_type(num_tensor_args_to_observation_type_mapping))
+    # matmul
+    binary_op_configs.append(
+        BackendPatternConfig(torch.matmul)
+        .set_dtype_configs(dtype_configs)  # noqa: E131
+    )
     return binary_op_configs
 
-def _get_linear_configs(dtype_configs):
+def _get_linear_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
     """
     Return all configs related to linear modules and ops.
     """
     observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
-    linear_configs = []
+    linear_configs: List[BackendPatternConfig] = []
 
     # (1) Single linear modules/functions
     # -------------------------------------
     # linear module
-    linear_configs.append({
-        # Please see README under this folder for pattern format
-        "pattern": torch.nn.Linear,
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-        # the root module for the pattern, used to query the reference quantized module
-        # e.g. for a (torch.nn.ReLU, torch.nn.Linear) pattern, the root will be torch.nn.Linear
-        "root_module": torch.nn.Linear,
-        # the corresponding reference quantized module for the root module
-        "reference_quantized_module_for_root": nnqr.Linear,
-        "qat_module": nnqat.Linear,
-    })
+    linear_configs.append(
+        BackendPatternConfig(torch.nn.Linear)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            .set_root_module(torch.nn.Linear)
+            .set_reference_quantized_module(nnqr.Linear)
+            .set_qat_module(nnqat.Linear))
     # linear qat module
-    linear_configs.append({
-        "pattern": nnqat.Linear,
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-        "root_module": torch.nn.Linear,
-        "reference_quantized_module_for_root": nnqr.Linear,
-    })
+    linear_configs.append(
+        BackendPatternConfig(nnqat.Linear)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            .set_root_module(torch.nn.Linear)
+            .set_reference_quantized_module(nnqr.Linear))
     # functional linear
-    linear_configs.append({
-        "pattern": torch.nn.functional.linear,
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-    })
+    linear_configs.append(
+        BackendPatternConfig(torch.nn.functional.linear)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            ._set_input_type_to_index({"weight": 1, "bias": 2}))
 
     # (2) Linear + relu
     # -------------------
     # 2.1 linear module + relu fusion config
     # linear relu, linear module + relu module
-    linear_configs.append({
-        "pattern": (torch.nn.ReLU, torch.nn.Linear),
-        "dtype_configs": dtype_configs,
-        "fuser_method": reverse_sequential_wrapper2(nni.LinearReLU),
-        "fused_module": nni.LinearReLU,
-    })
+    linear_configs.append(
+        BackendPatternConfig((torch.nn.ReLU, torch.nn.Linear))
+            .set_dtype_configs(dtype_configs)  # noqa: E131
+            .set_fuser_method(reverse_sequential_wrapper2(nni.LinearReLU))
+            .set_fused_module(nni.LinearReLU))
     # linear relu, linear module + functional relu
-    linear_configs.append({
-        "pattern": (torch.nn.functional.relu, torch.nn.Linear),
-        "dtype_configs": dtype_configs,
-        "fuser_method": reverse_sequential_wrapper2(nni.LinearReLU),
-        "fused_module": nni.LinearReLU,
-    })
+    linear_configs.append(
+        BackendPatternConfig((torch.nn.functional.relu, torch.nn.Linear))
+            .set_dtype_configs(dtype_configs)  # noqa: E131
+            .set_fuser_method(reverse_sequential_wrapper2(nni.LinearReLU))
+            .set_fused_module(nni.LinearReLU))
 
     # 2.2 linear module + relu, fused module configs
     # linear relu, fused module
-    linear_configs.append({
-        "pattern": nni.LinearReLU,
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-        "root_module": torch.nn.Linear,
-        "reference_quantized_module_for_root": nnqr.Linear,
-        "qat_module": nniqat.LinearReLU,
-    })
+    linear_configs.append(
+        BackendPatternConfig(nni.LinearReLU)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            .set_root_module(torch.nn.Linear)
+            .set_reference_quantized_module(nnqr.Linear)
+            .set_qat_module(nniqat.LinearReLU))
     # linear relu, qat fused module
-    linear_configs.append({
-        "pattern": nniqat.LinearReLU,
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-        "root_module": torch.nn.Linear,
-        "reference_quantized_module_for_root": nnqr.Linear,
-    })
+    linear_configs.append(
+        BackendPatternConfig(nniqat.LinearReLU)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            .set_root_module(torch.nn.Linear)
+            .set_reference_quantized_module(nnqr.Linear))
     # 2.3 functional linear + relu configs
     # linear relu, functional linear + relu module
-    linear_configs.append({
-        "pattern": (torch.nn.ReLU, F.linear),
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-    })
+    linear_configs.append(
+        BackendPatternConfig((torch.nn.ReLU, F.linear))
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs))
     # linear relu, functional linear + functional relu
-    linear_configs.append({
-        "pattern": (F.relu, F.linear),
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-    })
+    linear_configs.append(
+        BackendPatternConfig((F.relu, F.linear))
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs))
 
     # (3) Linear + batchnorm
     # ------------------------
     # 3.1 linear bn fusion
-    linear_configs.append({
-        "pattern": (nn.BatchNorm1d, nn.Linear),
-        "dtype_configs": dtype_configs,
-        "fuser_method": reverse2(fuse_linear_bn),
-        "fused_module": nni.LinearBn1d,
-    })
+    linear_configs.append(
+        BackendPatternConfig((nn.BatchNorm1d, nn.Linear))
+            .set_dtype_configs(dtype_configs)  # noqa: E131
+            .set_fuser_method(reverse2(fuse_linear_bn))
+            .set_fused_module(nni.LinearBn1d))
 
     # 3.2 linear bn fused
     # linear bn, fused module
-    linear_configs.append({
-        "pattern": nni.LinearBn1d,
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-        "root_module": torch.nn.Linear,
-        "reference_quantized_module_for_root": nnqr.Linear,
-        "qat_module": nniqat.LinearBn1d,
-    })
+    linear_configs.append(
+        BackendPatternConfig(nni.LinearBn1d)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            .set_root_module(torch.nn.Linear)
+            .set_reference_quantized_module(nnqr.Linear)
+            .set_qat_module(nniqat.LinearBn1d))
     # linear bn, qat fused module
-    linear_configs.append({
-        "pattern": nniqat.LinearBn1d,
-        "observation_type": observation_type,
-        "dtype_configs": dtype_configs,
-        "root_module": torch.nn.Linear,
-        "reference_quantized_module_for_root": nnqr.Linear,
-    })
+    linear_configs.append(
+        BackendPatternConfig(nniqat.LinearBn1d)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            .set_root_module(torch.nn.Linear)
+            .set_reference_quantized_module(nnqr.Linear))
     return linear_configs
 
 def _get_conv_configs(dtype_configs):
@@ -204,171 +190,200 @@ def _get_conv_configs(dtype_configs):
         # (1) Single conv modules/functions
         # -----------------------------------
         # conv module
-        conv_configs.append({
-            "pattern": convs.root,
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "reference_quantized_module_for_root": convs.reference,
-            "qat_module": convs.qat,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.root)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference)
+                .set_qat_module(convs.qat))
         # conv qat module
-        conv_configs.append({
-            "pattern": convs.qat,
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "reference_quantized_module_for_root": convs.reference,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.qat)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference))
         # functional conv
-        conv_configs.append({
-            "pattern": convs.func,
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.func)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                ._set_input_type_to_index({"weight": 1, "bias": 2}))
 
         # (2) Conv + relu
         # -----------------
         # 2.1 conv module + relu fusion configs
         # conv relu fusion, conv module + relu module
-        conv_configs.append({
-            "pattern": (torch.nn.ReLU, convs.root),
-            "dtype_configs": dtype_configs,
-            "fuser_method": reverse_sequential_wrapper2(convs.fused_conv_relu),
-            "fused_module": convs.fused_conv_relu,
-        })
+        conv_configs.append(
+            BackendPatternConfig((torch.nn.ReLU, convs.root))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(reverse_sequential_wrapper2(convs.fused_conv_relu))
+                .set_fused_module(convs.fused_conv_relu))
         # conv relu fusion, conv module + functional relu
-        conv_configs.append({
-            "pattern": (F.relu, convs.root),
-            "dtype_configs": dtype_configs,
-            "fuser_method": reverse_sequential_wrapper2(convs.fused_conv_relu),
-            "fused_module": convs.fused_conv_relu,
-        })
+        conv_configs.append(
+            BackendPatternConfig((F.relu, convs.root))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(reverse_sequential_wrapper2(convs.fused_conv_relu))
+                .set_fused_module(convs.fused_conv_relu))
         # 2.2 conv module + relu fused module configs
         # conv relu, fused module
-        conv_configs.append({
-            "pattern": convs.fused_conv_relu,
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "reference_quantized_module_for_root": convs.reference,
-            "qat_module": convs.relu_qat,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.fused_conv_relu)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference)
+                .set_qat_module(convs.relu_qat))
         # conv relu, qat fused module
-        conv_configs.append({
-            "pattern": convs.relu_qat,
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "reference_quantized_module_for_root": convs.reference,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.relu_qat)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference))
         # 2.3 functional conv + relu configs
         # conv relu, functional conv + relu module
-        conv_configs.append({
-            "pattern": (torch.nn.ReLU, convs.func),
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-        })
+        conv_configs.append(
+            BackendPatternConfig((torch.nn.ReLU, convs.func))
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
         # conv relu, functional conv + functional relu
-        conv_configs.append({
-            "pattern": (F.relu, convs.func),
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-        })
+        conv_configs.append(
+            BackendPatternConfig((F.relu, convs.func))
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
 
         # fused conv relu
-        conv_configs.append({
-            "pattern": convs.fused_conv_relu,
-            "dtype_configs": dtype_configs,
-            "qat_module": convs.relu_qat,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.fused_conv_relu)
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_qat_module(convs.relu_qat))
 
-        conv_configs.append({
-            "pattern": convs.relu_qat,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "reference_quantized_module_for_root": convs.reference,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.relu_qat)
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference))
 
         # (3) Conv + batchnorm (+ relu)
         # -------------------------------
         # 3.1 conv bn fusion configs
         # conv + bn fusion
-        conv_configs.append({
-            "pattern": (convs.bn, convs.root),
-            "dtype_configs": dtype_configs,
-            "fuser_method": reverse2(fuse_conv_bn),
-            "fused_module": convs.fused_conv_bn,
-        })
+        conv_configs.append(
+            BackendPatternConfig((convs.bn, convs.root))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(reverse2(fuse_conv_bn))
+                .set_fused_module(convs.fused_conv_bn))
         # conv + bn + relu module fusion
-        conv_configs.append({
-            "pattern": (nn.ReLU, (convs.bn, convs.root)),
-            "dtype_configs": dtype_configs,
-            "fuser_method": reverse3(fuse_conv_bn_relu),
-            "fused_module": convs.fused_conv_bn_relu,
-        })
+        conv_configs.append(
+            BackendPatternConfig((nn.ReLU, (convs.bn, convs.root)))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(reverse3(fuse_conv_bn_relu))
+                .set_fused_module(convs.fused_conv_bn_relu))
         # conv + bn + relu functional fusion
-        conv_configs.append({
-            "pattern": (F.relu, (convs.bn, convs.root)),
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "fuser_method": reverse3(fuse_conv_bn_relu),
-            "fused_module": convs.fused_conv_bn_relu,
-        })
+        conv_configs.append(
+            BackendPatternConfig((F.relu, (convs.bn, convs.root)))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_root_module(convs.root)
+                .set_fuser_method(reverse3(fuse_conv_bn_relu))
+                .set_fused_module(convs.fused_conv_bn_relu))
         # TODO: we can add fusion for torch.relu as well
 
         # 3.2 conv + bn (+ relu) fused module configs
         # fused conv bn
-        conv_configs.append({
-            "pattern": convs.fused_conv_bn,
-            "dtype_configs": dtype_configs,
-            "qat_module": convs.bn_qat,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.fused_conv_bn)
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_qat_module(convs.bn_qat))
 
         # fused conv bn relu
-        conv_configs.append({
-            "pattern": convs.fused_conv_bn_relu,
-            "dtype_configs": dtype_configs,
-            "qat_module": convs.bn_relu_qat,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.fused_conv_bn_relu)
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_qat_module(convs.bn_relu_qat))
 
         # conv bn, qat fused module
-        conv_configs.append({
-            "pattern": convs.bn_qat,
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "reference_quantized_module_for_root": convs.reference,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.bn_qat)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference))
         # conv bn relu, qat fused module
-        conv_configs.append({
-            "pattern": convs.bn_relu_qat,
-            "observation_type": observation_type,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.root,
-            "reference_quantized_module_for_root": convs.reference,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.bn_relu_qat)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference))
 
         # (4) conv transpose and its fusion
         # 4.1 conv transpose config
-        conv_configs.append({
-            "pattern": convs.transpose,
-            "dtype_configs": dtype_configs,
-            "root_module": convs.transpose,
-            "reference_quantized_module_for_root": convs.transpose_reference,
-        })
+        conv_configs.append(
+            BackendPatternConfig(convs.transpose)
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_root_module(convs.transpose)
+                .set_reference_quantized_module(convs.transpose_reference))
 
         # 4.2 conv transpose + bn fusion
-        conv_configs.append({
-            "pattern": (convs.bn, convs.transpose),
-            "dtype_configs": dtype_configs,
-            "fuser_method": reverse2(fuse_convtranspose_bn),
-            "root_module": convs.transpose,
-            "reference_quantized_module_for_root": convs.transpose_reference,
-        })
+        conv_configs.append(
+            BackendPatternConfig((convs.bn, convs.transpose))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(reverse2(fuse_convtranspose_bn))
+                .set_root_module(convs.transpose)
+                .set_reference_quantized_module(convs.transpose_reference))
 
     return conv_configs
 
+def _get_cat_config(dtype_configs: List[DTypeConfig]) -> BackendPatternConfig:
+    return BackendPatternConfig(torch.cat) \
+        .set_observation_type(ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT) \
+        .set_dtype_configs(dtype_configs)
+
+def _get_default_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    configs = []
+    default_ops = [
+        torch.nn.ELU,
+        torch.nn.LeakyReLU,
+        torch.nn.Hardswish,
+        torch.nn.InstanceNorm1d,
+        torch.nn.InstanceNorm2d,
+        torch.nn.InstanceNorm3d,
+        torch.nn.LayerNorm,
+        torch.nn.Dropout,
+        torch.nn.PReLU,
+        torch.nn.functional.elu,
+        torch.nn.functional.hardswish,
+        torch.nn.functional.instance_norm,
+        torch.nn.functional.leaky_relu,
+        torch.nn.functional.dropout,
+        torch.nn.functional.layer_norm
+    ]
+    for op in default_ops:
+        configs.append(
+            BackendPatternConfig(op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+    return configs
+
+def _get_fixed_qparams_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    fixed_qparams_op_configs = []
+    for fixed_qparam_op, output_observer in _FIXED_QPARAMS_OP_TO_OBSERVER.items():
+        fixed_qparams_op_configs.append(
+            # TODO: The _overwrite_output keys are temporary, since we don't want to put observer
+            # in the configs we expect that it's provided by user
+            # What we want to put here is the requirement on observers, in this case dtype,
+            # quant_min, quant_max etc., but we need to first move all configs to
+            # backend_config_dict to do that, we'll remove these keys after we fully migrated
+            # everything to use backend_config_dict
+            BackendPatternConfig(fixed_qparam_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                ._set_overwrite_output_fake_quantize(FixedQParamsFakeQuantize.with_args(observer=output_observer))
+                ._set_overwrite_output_observer(output_observer))
+    return fixed_qparams_op_configs
+
 def _get_share_qparams_op_configs(dtype_configs):
     """ Get the operator config for the operators that works for both float and quantized input
     if input is quantized, the output Tensor shares the same quantization parameter
@@ -379,11 +394,9 @@ def _get_share_qparams_op_configs(dtype_configs):
     """
 
     def _get_share_qprams_op_backend_config(op):
-        return {
-            "pattern": op,
-            "observation_type": ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
-            "dtype_configs": dtype_configs,
-        }
+        return BackendPatternConfig(op) \
+            .set_observation_type(ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT) \
+            .set_dtype_configs(dtype_configs)
 
     share_qparams_ops = [
         torch.nn.AdaptiveAvgPool1d,
@@ -445,6 +458,80 @@ def _get_share_qprams_op_backend_config(op):
     ]
     return [_get_share_qprams_op_backend_config(op) for op in share_qparams_ops]
 
+def _get_bn_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    """ Get configs related to batchnorm. """
+    bn_configs = []
+    bn_to_fused_bn = {
+        torch.nn.BatchNorm2d: nni.BNReLU2d,
+        torch.nn.BatchNorm3d: nni.BNReLU3d,
+    }
+    for bn in bn_to_fused_bn.keys():
+        fused_bn = bn_to_fused_bn[bn]
+        # bn module + relu module fusion config
+        bn_configs.append(
+            BackendPatternConfig((torch.nn.ReLU, bn))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(reverse_sequential_wrapper2(fused_bn))
+                .set_fused_module(fused_bn))
+        # bn module + F.relu fusion config
+        bn_configs.append(
+            BackendPatternConfig((torch.nn.functional.relu, bn))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(reverse_sequential_wrapper2(bn_to_fused_bn[bn]))
+                .set_fused_module(fused_bn))
+        bn_configs.append(
+            BackendPatternConfig(bn)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+
+    # fused bn configs
+    for fused_bn in bn_to_fused_bn.values():
+        bn_configs.append(
+            BackendPatternConfig(fused_bn)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+    return bn_configs
+
+def _get_rnn_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    rnn_op_configs = []
+    for rnn_op, ref_rnn_op in [
+            (nn.GRUCell, nnqr.GRUCell),
+            (nn.LSTMCell, nnqr.LSTMCell),
+            (nn.RNNCell, nnqr.RNNCell),
+            (nn.LSTM, nnqr.LSTM)
+    ]:
+        rnn_op_configs.append(
+            BackendPatternConfig(rnn_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(rnn_op)
+                .set_reference_quantized_module(ref_rnn_op))
+    return rnn_op_configs
+
+def _get_embedding_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    embedding_op_configs = []
+    for embedding_op, qat_embedding_op, ref_embedding_op in [
+            (nn.Embedding, nnqat.Embedding, nnqr.Embedding),
+            (nn.EmbeddingBag, nnqat.EmbeddingBag, nnqr.EmbeddingBag),
+    ]:
+        embedding_op_configs.append(
+            BackendPatternConfig(embedding_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_qat_module(qat_embedding_op)
+                .set_root_module(embedding_op)
+                .set_reference_quantized_module(ref_embedding_op)
+                ._set_input_output_observed(False))  # This is temporary, and will be removed soon
+        # config for qat op
+        embedding_op_configs.append(
+            BackendPatternConfig(qat_embedding_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(embedding_op)
+                .set_reference_quantized_module(ref_embedding_op)
+                ._set_input_output_observed(False))  # This is temporary, and will be removed soon
+    return embedding_op_configs
+
 __all__ = [
     "_get_binary_op_configs",
     "_get_linear_configs",
diff --git a/torch/ao/quantization/backend_config/backend_config.py b/torch/ao/quantization/backend_config/backend_config.py
index aea6945aca466..223fc5ae39bc8 100644
--- a/torch/ao/quantization/backend_config/backend_config.py
+++ b/torch/ao/quantization/backend_config/backend_config.py
@@ -1,17 +1,18 @@
 from __future__ import annotations
 from dataclasses import dataclass
-from typing import Any, Callable, Dict, List, Optional
+from typing import Any, Callable, Dict, List, Optional, Type
 
 import torch
-from torch.ao.quantization.backend_config.observation_type import ObservationType
 from torch.ao.quantization.observer import _PartialWrapper
 from torch.ao.quantization.utils import Pattern
+from enum import Enum
 
 
 __all__ = [
     "BackendConfig",
     "BackendPatternConfig",
     "DTypeConfig",
+    "ObservationType",
 ]
 
 
@@ -43,6 +44,17 @@
 OVERWRITE_OUTPUT_FAKE_QUANTIZE_DICT_KEY = "overwrite_output_fake_quantize"
 OVERWRITE_OUTPUT_OBSERVER_DICT_KEY = "overwrite_output_observer"
 
+# TODO: maybe rename this to something that's not related to observer
+# e.g. QParamsType
+class ObservationType(Enum):
+    # this means input and output are observed with different observers, based
+    # on qconfig.activation
+    # example: conv, linear, softmax
+    OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT = 0
+    # this means the output will use the same observer instance as input, based
+    # on qconfig.activation
+    # example: torch.cat, maxpool
+    OUTPUT_SHARE_OBSERVER_WITH_INPUT = 1
 
 @dataclass
 class DTypeConfig:
@@ -151,12 +163,21 @@ def set_name(self, name: str) -> BackendConfig:
 
     def set_backend_pattern_config(self, config: BackendPatternConfig) -> BackendConfig:
         """
-        Set the config for an op that can be run on the target backend.
-        This overrides any existing config for the given op.
+        Set the config for an pattern that can be run on the target backend.
+        This overrides any existing config for the given pattern.
         """
         self.configs[config.pattern] = config
         return self
 
+    def set_backend_pattern_configs(self, configs: List[BackendPatternConfig]) -> BackendConfig:
+        """
+        Set the configs for patterns that can be run on the target backend.
+        This overrides any existing config for a given pattern if it was previously registered already.
+        """
+        for conf in configs:
+            self.set_backend_pattern_config(conf)
+        return self
+
     @classmethod
     def from_dict(cls, backend_config_dict: Dict[str, Any]) -> BackendConfig:
         """
@@ -165,11 +186,8 @@ def from_dict(cls, backend_config_dict: Dict[str, Any]) -> BackendConfig:
             "name": the name of the target backend
             "configs": a list of dictionaries that each represents a `BackendPatternConfig`
         """
-        for dict_key in [NAME_DICT_KEY, CONFIGS_DICT_KEY]:
-            if dict_key not in backend_config_dict:
-                raise ValueError("backend_config_dict must contain '%s'" % dict_key)
-        conf = cls(backend_config_dict[NAME_DICT_KEY])
-        for d in backend_config_dict[CONFIGS_DICT_KEY]:
+        conf = cls(backend_config_dict.get(NAME_DICT_KEY, ""))
+        for d in backend_config_dict.get(CONFIGS_DICT_KEY, []):
             if isinstance(d, BackendPatternConfig):
                 conf.set_backend_pattern_config(d)
             elif isinstance(d, Dict):
@@ -210,10 +228,10 @@ def __init__(self, pattern: Pattern):
         self.pattern = pattern
         self.observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
         self.dtype_configs: List[DTypeConfig] = []
-        self.root_module: Optional[torch.nn.Module] = None
-        self.qat_module: Optional[torch.nn.Module] = None
-        self.reference_quantized_module: Optional[torch.nn.Module] = None
-        self.fused_module: Optional[torch.nn.Module] = None
+        self.root_module: Optional[Type[torch.nn.Module]] = None
+        self.qat_module: Optional[Type[torch.nn.Module]] = None
+        self.reference_quantized_module: Optional[Type[torch.nn.Module]] = None
+        self.fused_module: Optional[Type[torch.nn.Module]] = None
         self.fuser_method: Optional[Callable] = None
 
         # Temporary/internal configs
@@ -239,7 +257,15 @@ def add_dtype_config(self, dtype_config: DTypeConfig) -> BackendPatternConfig:
         self.dtype_configs.append(dtype_config)
         return self
 
-    def set_root_module(self, root_module: torch.nn.Module) -> BackendPatternConfig:
+    def set_dtype_configs(self, dtype_configs: List[DTypeConfig]) -> BackendPatternConfig:
+        """
+        Set the supported input/output activation, weight, and bias data types for this pattern,
+        overriding all previously registered data types.
+        """
+        self.dtype_configs = dtype_configs
+        return self
+
+    def set_root_module(self, root_module: Type[torch.nn.Module]) -> BackendPatternConfig:
         """
         Set the module that represents the root for this pattern.
         For example, the root module for :class:`torch.nn.intrinsic.LinearReLU` should be :class:`torch.nn.Linear`.
@@ -247,21 +273,21 @@ def set_root_module(self, root_module: torch.nn.Module) -> BackendPatternConfig:
         self.root_module = root_module
         return self
 
-    def set_qat_module(self, qat_module: torch.nn.Module) -> BackendPatternConfig:
+    def set_qat_module(self, qat_module: Type[torch.nn.Module]) -> BackendPatternConfig:
         """
         Set the module that represents the QAT implementation for this pattern.
         """
         self.qat_module = qat_module
         return self
 
-    def set_reference_quantized_module(self, reference_quantized_module: torch.nn.Module) -> BackendPatternConfig:
+    def set_reference_quantized_module(self, reference_quantized_module: Type[torch.nn.Module]) -> BackendPatternConfig:
         """
         Set the module that represents the reference quantized implementation for this pattern's root module.
         """
         self.reference_quantized_module = reference_quantized_module
         return self
 
-    def set_fused_module(self, fused_module: torch.nn.Module) -> BackendPatternConfig:
+    def set_fused_module(self, fused_module: Type[torch.nn.Module]) -> BackendPatternConfig:
         """
         Set the module that represents the fused implementation for this pattern.
         """
diff --git a/torch/ao/quantization/backend_config/fbgemm.py b/torch/ao/quantization/backend_config/fbgemm.py
new file mode 100644
index 0000000000000..6e9b525439b52
--- /dev/null
+++ b/torch/ao/quantization/backend_config/fbgemm.py
@@ -0,0 +1,114 @@
+import torch
+from ._common_operator_config_utils import (
+    _get_binary_op_configs,
+    _get_bn_configs,
+    _get_cat_config,
+    _get_conv_configs,
+    _get_default_op_configs,
+    _get_embedding_op_configs,
+    _get_fixed_qparams_op_configs,
+    _get_linear_configs,
+    _get_rnn_op_configs,
+    _get_share_qparams_op_configs,
+)
+from .backend_config import BackendConfig, DTypeConfig
+
+
+# ===================
+# |  DTYPE CONFIGS  |
+# ===================
+
+# TODO: For now, these DTypeConfigs are identical to the ones defined in native.py
+# In the future, once we support specifying quant_min/quant_max and scale_min/scale_max,
+# these will diverge. In particular, for FBGEMM, we will restrict the activation quantized
+# values to within [0, 127].
+
+fbgemm_weighted_op_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+)
+
+fbgemm_default_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+)
+
+fbgemm_default_op_fp16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float16,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float16,
+)
+
+fbgemm_default_dynamic_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.float,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+fbgemm_default_dynamic_float16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+fbgemm_weight_only_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint8,
+)
+
+fbgemm_weight_only_quint4x2_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint4x2,
+)
+
+
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
+
+def get_fbgemm_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for PyTorch's native FBGEMM backend.
+    """
+    conv_dtype_configs = [fbgemm_weighted_op_int8_dtype_config]
+    linear_dtype_configs = [
+        fbgemm_weighted_op_int8_dtype_config,
+        fbgemm_default_dynamic_int8_dtype_config,
+        fbgemm_default_dynamic_float16_dtype_config,
+    ]
+    binary_op_dtype_configs = [fbgemm_weighted_op_int8_dtype_config]
+    default_op_dtype_configs = [fbgemm_default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [fbgemm_weighted_op_int8_dtype_config]
+    share_qparams_op_dtype_configs = [fbgemm_default_op_quint8_dtype_config]
+    rnn_op_dtype_configs = [
+        fbgemm_default_dynamic_int8_dtype_config,
+        fbgemm_default_dynamic_float16_dtype_config,
+    ]
+    embedding_op_dtype_configs = [
+        fbgemm_weight_only_quint8_dtype_config,
+        fbgemm_weight_only_quint4x2_dtype_config,
+    ]
+    return BackendConfig("fbgemm") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
+
+__all__ = [
+    "get_fbgemm_backend_config",
+]
diff --git a/torch/ao/quantization/backend_config/native.py b/torch/ao/quantization/backend_config/native.py
index bf259b875aaeb..7cc89e530f272 100644
--- a/torch/ao/quantization/backend_config/native.py
+++ b/torch/ao/quantization/backend_config/native.py
@@ -1,20 +1,18 @@
 import torch
-import torch.nn as nn
-import torch.nn.intrinsic as nni
-import torch.nn.qat as nnqat
-import torch.nn.quantized._reference as nnqr
 from ._common_operator_config_utils import (
     _get_binary_op_configs,
-    _get_linear_configs,
+    _get_bn_configs,
+    _get_cat_config,
     _get_conv_configs,
+    _get_default_op_configs,
+    _get_embedding_op_configs,
+    _get_fixed_qparams_op_configs,
+    _get_linear_configs,
+    _get_rnn_op_configs,
     _get_share_qparams_op_configs,
 )
-from .observation_type import ObservationType
-from ..fake_quantize import FixedQParamsFakeQuantize
-from ..fuser_method_mappings import (
-    reverse_sequential_wrapper2,
-)
-from ..qconfig_mapping import _FIXED_QPARAMS_OP_TO_OBSERVER
+from .backend_config import BackendConfig, DTypeConfig
+
 
 # ===================
 # |  DTYPE CONFIGS  |
@@ -22,219 +20,67 @@
 
 # weighted op int8 dtype config
 # this is config for ops that has quantized weights, like linear, conv
-weighted_op_int8_dtype_config = {
-    # optional, input activation dtype
-    "input_dtype": torch.quint8,
-    # optional, weight dtype
-    "weight_dtype": torch.qint8,
-    # optional, bias dtype
-    "bias_dtype": torch.float,
-    # optional, output activation dtype
-    "output_dtype": torch.quint8
-}
+weighted_op_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+)
 
-default_op_quint8_dtype_config = {
-    # optional, input activation dtype
-    "input_dtype": torch.quint8,
-    # optional, output activation dtype
-    "output_dtype": torch.quint8,
-}
+default_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+)
 
-default_op_fp16_dtype_config = {
-    # optional, input activation dtype
-    "input_dtype": torch.float16,
-    # optional, weight dtype
-    "weight_dtype": torch.float16,
-    # optional, bias dtype
-    "bias_dtype": torch.float16,
-    # optional, output activation dtype
-    "output_dtype": torch.float16,
-}
+default_op_fp16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float16,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float16,
+)
 
-default_dynamic_int8_dtype_config = {
-    "input_dtype": torch.quint8,
-    "weight_dtype": torch.qint8,
-    "bias_dtype": torch.float,
-    "output_dtype": torch.float,
+default_dynamic_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.float,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
     # currently the dtype check is not yet enabled, so we provided the dtype_configs but
     # it is not really used yet,
     # we will enable it a bit later after we moved everything to backend_config_dict
-    "is_dynamic": True,
-}
+    is_dynamic=True,
+)
 
-default_dynamic_float16_dtype_config = {
-    "input_dtype": torch.float16,
-    "weight_dtype": torch.float16,
-    "bias_dtype": torch.float,
-    "output_dtype": torch.float,
+default_dynamic_float16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float,
     # currently the dtype check is not yet enabled, so we provided the dtype_configs but
     # it is not really used yet,
     # we will enable it a bit later after we moved everything to backend_config_dict
-    "is_dynamic": True,
-}
-
-weight_only_quint8_dtype_config = {
-    "input_dtype": torch.float,
-    "weight_dtype": torch.quint8,
-    "output_dtype": torch.float,
-}
-
-weight_only_quint4x2_dtype_config = {
-    "input_dtype": torch.float,
-    "weight_dtype": torch.quint4x2,
-    "output_dtype": torch.float,
-}
-
-# ======================
-# |  OPERATOR CONFIGS  |
-# ======================
-
-def _get_default_op_backend_config(op, dtype_configs):
-    return {
-        "pattern": op,
-        "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-        "dtype_configs": dtype_configs,
-    }
-
-_DEFAULT_OP_INT8_CONFIGS = [
-    _get_default_op_backend_config(op, [default_op_quint8_dtype_config]) for op in [
-        torch.nn.ELU,
-        torch.nn.LeakyReLU,
-        torch.nn.Hardswish,
-        torch.nn.InstanceNorm1d,
-        torch.nn.InstanceNorm2d,
-        torch.nn.InstanceNorm3d,
-        torch.nn.LayerNorm,
-        torch.nn.Dropout,
-        torch.nn.PReLU,
-        torch.nn.functional.elu,
-        torch.nn.functional.hardswish,
-        torch.nn.functional.instance_norm,
-        torch.nn.functional.leaky_relu,
-        torch.nn.functional.dropout,
-        torch.nn.functional.layer_norm,
-    ]]
-
-def _get_fixed_qparams_op_configs(dtype_configs):
-    fixed_qparams_op_configs = []
-    for fixed_qparam_op, output_observer in _FIXED_QPARAMS_OP_TO_OBSERVER.items():
-        fixed_qparams_op_configs.append({
-            "pattern": fixed_qparam_op,
-            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-            # TODO: The following two keys are temporary, since we don't want to put observer in the configs
-            # we expect that it's provided by user
-            # What we want to put here is the requirement on observers, in this case dtype,
-            # quant_min, quant_max etc., but we need to first move all configs to
-            # backend_config_dict to do that, we'll remove these keys after we fully migrated
-            # everything to use backend_config_dict
-            "_overwrite_output_fake_quantizer": FixedQParamsFakeQuantize.with_args(observer=output_observer),
-            "_overwrite_output_observer": output_observer,
-            "dtype_configs": dtype_configs,
-        })
-    return fixed_qparams_op_configs
-
-_CAT_CONFIG = {
-    "pattern": torch.cat,
-    "observation_type": ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
-    "dtype_configs": [
-        default_op_quint8_dtype_config,
-    ]
-}
+    is_dynamic=True,
+)
 
-def _get_bn_configs():
-    """ Get configs related to batchnorm
-    """
-    bn_configs = []
-    bn_to_fused_bn = {
-        torch.nn.BatchNorm2d: nni.BNReLU2d,
-        torch.nn.BatchNorm3d: nni.BNReLU3d,
-    }
-    for bn in bn_to_fused_bn.keys():
-        fused_bn = bn_to_fused_bn[bn]
-        # bn module + relu module fusion config
-        bn_configs.append({
-            "pattern": (torch.nn.ReLU, bn),
-            "dtype_configs": [default_op_quint8_dtype_config],
-            "fuser_method": reverse_sequential_wrapper2(fused_bn),
-            "fused_module": fused_bn,
-        })
-        # bn module + F.relu fusion config
-        bn_configs.append({
-            "pattern": (torch.nn.functional.relu, bn),
-            "dtype_configs": [default_op_quint8_dtype_config],
-            "fuser_method": reverse_sequential_wrapper2(bn_to_fused_bn[bn]),
-            "fused_module": fused_bn,
-        })
-        bn_configs.append({
-            "pattern": bn,
-            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-            "dtype_configs": [default_op_quint8_dtype_config],
-        })
+weight_only_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint8,
+)
 
-    # fused bn configs
-    for fused_bn in bn_to_fused_bn.values():
-        bn_configs.append({
-            "pattern": fused_bn,
-            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-            "dtype_configs": [default_op_quint8_dtype_config],
-        })
-    return bn_configs
+weight_only_quint4x2_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint4x2,
+)
 
-def _get_rnn_op_configs():
-    rnn_op_configs = []
-    for rnn_op, ref_rnn_op in [
-            (nn.GRUCell, nnqr.GRUCell),
-            (nn.LSTMCell, nnqr.LSTMCell),
-            (nn.RNNCell, nnqr.RNNCell),
-            (nn.LSTM, nnqr.LSTM)
-    ]:
-        rnn_op_configs.append({
-            "pattern": rnn_op,
-            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-            "dtype_configs": [default_dynamic_int8_dtype_config, default_dynamic_float16_dtype_config],
-            "root_module": rnn_op,
-            "reference_quantized_module_for_root": ref_rnn_op,
-        })
-    return rnn_op_configs
 
-def _get_embedding_op_configs():
-    embedding_op_configs = []
-    for embedding_op, qat_embedding_op, ref_embedding_op in [
-            (nn.Embedding, nnqat.Embedding, nnqr.Embedding),
-            (nn.EmbeddingBag, nnqat.EmbeddingBag, nnqr.EmbeddingBag),
-    ]:
-        embedding_op_configs.append({
-            "pattern": embedding_op,
-            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-            "dtype_configs": [
-                weight_only_quint8_dtype_config,
-                weight_only_quint4x2_dtype_config
-            ],
-            "qat_module": qat_embedding_op,
-            "root_module": embedding_op,
-            "reference_quantized_module_for_root": ref_embedding_op,
-            # This is temporary, and will be removed soon
-            "_input_output_observed": False
-        })
-        # config for qat op
-        embedding_op_configs.append({
-            "pattern": qat_embedding_op,
-            "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-            "dtype_configs": [
-                weight_only_quint8_dtype_config,
-                weight_only_quint4x2_dtype_config
-            ],
-            "root_module": embedding_op,
-            "reference_quantized_module_for_root": ref_embedding_op,
-            # This is temporary, and will be removed soon
-            "_input_output_observed": False
-        })
-    return embedding_op_configs
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
 
-def get_test_only_legacy_native_backend_config_dict():
+def get_test_only_legacy_native_backend_config() -> BackendConfig:
     """
-    This is a backend configuration for the union of fbgemm/qnnpack
-    and various additional fp16 ops.
+    Return the `BackendConfig` for PyTorch Native backend (fbgemm/qnnpack) with various additional fp16 ops.
     """
     conv_dtype_configs = [weighted_op_int8_dtype_config]
     linear_dtype_configs = [
@@ -247,66 +93,86 @@ def get_test_only_legacy_native_backend_config_dict():
         weighted_op_int8_dtype_config,
         default_op_fp16_dtype_config,
     ]
+    default_op_dtype_configs = [default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [
+        weighted_op_int8_dtype_config,
+        default_op_fp16_dtype_config,
+    ]
     share_qparams_op_dtype_configs = [
         default_op_quint8_dtype_config,
         default_op_fp16_dtype_config
     ]
-    fixed_qparams_op_dtype_configs = [
-        weighted_op_int8_dtype_config,
-        default_op_fp16_dtype_config,
+    rnn_op_dtype_configs = [
+        default_dynamic_int8_dtype_config,
+        default_dynamic_float16_dtype_config,
     ]
-    return {
-        # optional
-        "name": "_native_and_fp16",
-        "configs": [
-            *_DEFAULT_OP_INT8_CONFIGS,
-            *_get_linear_configs(linear_dtype_configs),
-            *_get_conv_configs(conv_dtype_configs),
-            *_get_binary_op_configs(binary_op_dtype_configs),
-            *_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs),
-            _CAT_CONFIG,
-            *_get_bn_configs(),
-            *_get_share_qparams_op_configs(share_qparams_op_dtype_configs),
-            *_get_rnn_op_configs(),
-            *_get_embedding_op_configs(),
-        ],
-    }
-
-def get_native_backend_config_dict():
-    """ Get backend_config_dict for PyTorch Native backend (fbgemm/qnnpack). """
+    embedding_op_dtype_configs = [
+        weight_only_quint8_dtype_config,
+        weight_only_quint4x2_dtype_config,
+    ]
+    return BackendConfig("_native_and_fp16") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
+
+def get_native_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for PyTorch Native backend (fbgemm/qnnpack).
+    """
+    # TODO: express this BackendConfig as a union of the FBGEMM and QNNPACK BackendConfigs
     conv_dtype_configs = [weighted_op_int8_dtype_config]
     linear_dtype_configs = [
         weighted_op_int8_dtype_config,
         default_dynamic_int8_dtype_config,
         default_dynamic_float16_dtype_config,
     ]
-    binary_op_dtype_configs = [
-        weighted_op_int8_dtype_config,
-    ]
-    share_qparams_op_dtype_configs = [
-        default_op_quint8_dtype_config,
+    binary_op_dtype_configs = [weighted_op_int8_dtype_config]
+    default_op_dtype_configs = [default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [weighted_op_int8_dtype_config]
+    share_qparams_op_dtype_configs = [default_op_quint8_dtype_config]
+    rnn_op_dtype_configs = [
+        default_dynamic_int8_dtype_config,
+        default_dynamic_float16_dtype_config,
     ]
-    fixed_qparams_op_dtype_configs = [
-        weighted_op_int8_dtype_config,
+    embedding_op_dtype_configs = [
+        weight_only_quint8_dtype_config,
+        weight_only_quint4x2_dtype_config,
     ]
-    return {
-        # optional
-        "name": "native",
-        "configs": [
-            *_DEFAULT_OP_INT8_CONFIGS,
-            *_get_linear_configs(linear_dtype_configs),
-            *_get_conv_configs(conv_dtype_configs),
-            *_get_binary_op_configs(binary_op_dtype_configs),
-            *_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs),
-            _CAT_CONFIG,
-            *_get_bn_configs(),
-            *_get_share_qparams_op_configs(share_qparams_op_dtype_configs),
-            *_get_rnn_op_configs(),
-            *_get_embedding_op_configs(),
-        ],
-    }
+    return BackendConfig("native") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
+
+def get_native_backend_config_dict():
+    """
+    Return the `BackendConfig` for PyTorch Native backend (fbgemm/qnnpack) in dictionary form.
+    """
+    return get_native_backend_config().to_dict()
+
+def get_test_only_legacy_native_backend_config_dict():
+    """
+    Return the `BackendConfig` for PyTorch Native backend (fbgemm/qnnpack) with various additional
+    fp16 ops in dictionary form.
+    """
+    return get_test_only_legacy_native_backend_config().to_dict()
 
 __all__ = [
+    "get_test_only_legacy_native_backend_config",
     "get_test_only_legacy_native_backend_config_dict",
+    "get_native_backend_config",
     "get_native_backend_config_dict",
 ]
diff --git a/torch/ao/quantization/backend_config/observation_type.py b/torch/ao/quantization/backend_config/observation_type.py
index 9a25f1dbc70f1..e69de29bb2d1d 100644
--- a/torch/ao/quantization/backend_config/observation_type.py
+++ b/torch/ao/quantization/backend_config/observation_type.py
@@ -1,13 +0,0 @@
-from enum import Enum
-
-__all__ = ['ObservationType']
-
-class ObservationType(Enum):
-    # this means input and output are observed with different observers, based
-    # on qconfig.activation
-    # example: conv, linear, softmax
-    OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT = 0
-    # this means the output will use the same observer instance as input, based
-    # on qconfig.activation
-    # example: torch.cat, maxpool
-    OUTPUT_SHARE_OBSERVER_WITH_INPUT = 1
diff --git a/torch/ao/quantization/backend_config/qnnpack.py b/torch/ao/quantization/backend_config/qnnpack.py
new file mode 100644
index 0000000000000..bbb216ed4a909
--- /dev/null
+++ b/torch/ao/quantization/backend_config/qnnpack.py
@@ -0,0 +1,114 @@
+import torch
+from ._common_operator_config_utils import (
+    _get_binary_op_configs,
+    _get_bn_configs,
+    _get_cat_config,
+    _get_conv_configs,
+    _get_default_op_configs,
+    _get_embedding_op_configs,
+    _get_fixed_qparams_op_configs,
+    _get_linear_configs,
+    _get_rnn_op_configs,
+    _get_share_qparams_op_configs,
+)
+from .backend_config import BackendConfig, DTypeConfig
+
+
+# ===================
+# |  DTYPE CONFIGS  |
+# ===================
+
+# TODO: For now, these DTypeConfigs are identical to the ones defined in native.py
+# In the future, once we support specifying quant_min/quant_max and scale_min/scale_max,
+# these will diverge. In particular, for QNNPACK, we will restrict the weight quantized
+# values to within [-127, 127] and set the min scale value to 2 ** -12.
+
+qnnpack_weighted_op_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+)
+
+qnnpack_default_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+)
+
+qnnpack_default_op_fp16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float16,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float16,
+)
+
+qnnpack_default_dynamic_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.float,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+qnnpack_default_dynamic_float16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+qnnpack_weight_only_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint8,
+)
+
+qnnpack_weight_only_quint4x2_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint4x2,
+)
+
+
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
+
+def get_qnnpack_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for PyTorch's native QNNPACK backend.
+    """
+    conv_dtype_configs = [qnnpack_weighted_op_int8_dtype_config]
+    linear_dtype_configs = [
+        qnnpack_weighted_op_int8_dtype_config,
+        qnnpack_default_dynamic_int8_dtype_config,
+        qnnpack_default_dynamic_float16_dtype_config,
+    ]
+    binary_op_dtype_configs = [qnnpack_weighted_op_int8_dtype_config]
+    default_op_dtype_configs = [qnnpack_default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [qnnpack_weighted_op_int8_dtype_config]
+    share_qparams_op_dtype_configs = [qnnpack_default_op_quint8_dtype_config]
+    rnn_op_dtype_configs = [
+        qnnpack_default_dynamic_int8_dtype_config,
+        qnnpack_default_dynamic_float16_dtype_config,
+    ]
+    embedding_op_dtype_configs = [
+        qnnpack_weight_only_quint8_dtype_config,
+        qnnpack_weight_only_quint4x2_dtype_config,
+    ]
+    return BackendConfig("qnnpack") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
+
+__all__ = [
+    "get_qnnpack_backend_config",
+]
diff --git a/torch/ao/quantization/backend_config/tensorrt.py b/torch/ao/quantization/backend_config/tensorrt.py
index ec95b850ce4e4..9b6fb39e06160 100644
--- a/torch/ao/quantization/backend_config/tensorrt.py
+++ b/torch/ao/quantization/backend_config/tensorrt.py
@@ -1,5 +1,10 @@
 import torch
-from .observation_type import ObservationType
+from .backend_config import (
+    BackendConfig,
+    BackendPatternConfig,
+    DTypeConfig,
+    ObservationType
+)
 from ._common_operator_config_utils import (
     _get_binary_op_configs,
     _get_linear_configs,
@@ -7,50 +12,36 @@
     _get_share_qparams_op_configs,
 )
 
-def get_tensorrt_backend_config_dict():
-    """ Get the backend config dictionary for tensorrt backend
+def get_tensorrt_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for the TensorRT backend.
     NOTE: Current api will change in the future, it's just to unblock experimentation for
     new backends, please don't use it right now.
     TODO: add a README when it's more stable
     """
     # dtype configs
-    weighted_op_qint8_dtype_config = {
-        # optional, input activation dtype
-        "input_dtype": torch.qint8,
-        # optional, weight dtype
-        "weight_dtype": torch.qint8,
-        # optional, bias dtype
-        "bias_dtype": torch.float,
-        # optional, output activation dtype
-        "output_dtype": torch.qint8
-    }
-    non_weighted_op_qint8_dtype_config = {
-        # optional, input activation dtype
-        "input_dtype": torch.qint8,
-        # optional, output activation dtype
-        "output_dtype": torch.qint8,
-    }
+    weighted_op_qint8_dtype_config = DTypeConfig(
+        input_dtype=torch.qint8,
+        output_dtype=torch.qint8,
+        weight_dtype=torch.qint8,
+        bias_dtype=torch.float,
+    )
+    non_weighted_op_qint8_dtype_config = DTypeConfig(
+        input_dtype=torch.qint8,
+        output_dtype=torch.qint8,
+    )
 
-    addmm_config = {
-        "pattern": torch.addmm,
-        "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
-        "dtype_configs": [
-            weighted_op_qint8_dtype_config,
-        ],
-        # a map from input type to input index
-        "input_type_to_index": {
+    addmm_config = BackendPatternConfig(torch.addmm) \
+        .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
+        .add_dtype_config(weighted_op_qint8_dtype_config) \
+        ._set_input_type_to_index({
             "bias": 0,
             "input": 1,
             "weight": 2,
-        }
-    }
-    cat_config = {
-        "pattern": torch.cat,
-        "observation_type": ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
-        "dtype_configs": [
-            non_weighted_op_qint8_dtype_config,
-        ]
-    }
+        })
+    cat_config = BackendPatternConfig(torch.cat) \
+        .set_observation_type(ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT) \
+        .add_dtype_config(non_weighted_op_qint8_dtype_config)
     conv_dtype_configs = [
         weighted_op_qint8_dtype_config,
     ]
@@ -63,21 +54,23 @@ def get_tensorrt_backend_config_dict():
     share_qparams_op_dtype_configs = [
         non_weighted_op_qint8_dtype_config,
     ]
-    return {
-        # optional
-        "name": "tensorrt",
-        "configs": [
-            # there might be things not supported in fx2trt, but it will error out
-            # during fx2trt conversion and can support them after that
-            *_get_conv_configs(conv_dtype_configs),
-            addmm_config,
-            cat_config,
-            *_get_linear_configs(linear_dtype_configs),
-            *_get_binary_op_configs(binary_op_dtype_configs),
-            *_get_share_qparams_op_configs(share_qparams_op_dtype_configs),
-        ]
-    }
+    # there might be things not supported in fx2trt, but it will error out
+    # during fx2trt conversion and can support them after that
+    return BackendConfig("tensorrt") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_config(addmm_config) \
+        .set_backend_pattern_config(cat_config) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs))
+
+def get_tensorrt_backend_config_dict():
+    """
+    Return the `BackendConfig` for the TensorRT backend in dictionary form.
+    """
+    return get_tensorrt_backend_config().to_dict()
 
 __all__ = [
+    "get_tensorrt_backend_config",
     "get_tensorrt_backend_config_dict",
 ]
diff --git a/torch/ao/quantization/backend_config/utils.py b/torch/ao/quantization/backend_config/utils.py
index 95df3bf310c33..6645cea801040 100644
--- a/torch/ao/quantization/backend_config/utils.py
+++ b/torch/ao/quantization/backend_config/utils.py
@@ -1,80 +1,60 @@
-from typing import Dict, Any, List, Callable, Union, Tuple
+from typing import Dict, Any, List, Callable, Union, Tuple, Type
 
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+from .backend_config import BackendConfig, DTypeConfig
 from ..quantization_types import Pattern
 
-def get_pattern_to_dtype_configs(
-        backend_config_dict: Dict[str, Any]) -> Dict[Pattern, List[Dict[str, Any]]]:
-    pattern_to_dtype_configs: Dict[Pattern, List[Dict[str, torch.dtype]]] = dict()
-    for config in backend_config_dict.get("configs", []):
-        pattern = config["pattern"]
-        dtype_configs = config["dtype_configs"]
-        pattern_to_dtype_configs[pattern] = dtype_configs
+def get_pattern_to_dtype_configs(backend_config: BackendConfig) -> Dict[Pattern, List[DTypeConfig]]:
+    pattern_to_dtype_configs: Dict[Pattern, List[DTypeConfig]] = {}
+    for pattern, config in backend_config.configs.items():
+        pattern_to_dtype_configs[pattern] = config.dtype_configs
     return pattern_to_dtype_configs
 
-def get_qat_module_classes(
-        backend_config_dict: Dict[str, Any]) -> Tuple[type, ...]:
+def get_qat_module_classes(backend_config: BackendConfig) -> Tuple[type, ...]:
     qat_module_classes = []
-    for config in backend_config_dict.get("configs", []):
-        pattern = config["pattern"]
-        qat_module = config.get("qat_module", None)
-        if qat_module is not None:
-            qat_module_classes.append(qat_module)
+    for config in backend_config.configs.values():
+        if config.qat_module is not None:
+            qat_module_classes.append(config.qat_module)
     return tuple(set(qat_module_classes))
 
-def get_fused_module_classes(
-        backend_config_dict: Dict[str, Any]) -> Tuple[type, ...]:
+def get_fused_module_classes(backend_config: BackendConfig) -> Tuple[type, ...]:
     fused_module_classes = []
-    for config in backend_config_dict.get("configs", []):
-        pattern = config["pattern"]
-        fused_module = config.get("fused_module", None)
-        if fused_module is not None:
-            fused_module_classes.append(fused_module)
+    for config in backend_config.configs.values():
+        if config.fused_module is not None:
+            fused_module_classes.append(config.fused_module)
     return tuple(set(fused_module_classes))
 
-def get_pattern_to_input_type_to_index(
-        backend_config_dict: Dict[str, Any]) -> Dict[Pattern, Dict[str, int]]:
-    pattern_to_input_type_to_index: Dict[Pattern, Dict[str, int]] = dict()
-    for config in backend_config_dict.get("configs", []):
-        pattern = config["pattern"]
-        input_type_to_index = config.get("input_type_to_index", {})
-        pattern_to_input_type_to_index[pattern] = input_type_to_index
+def get_pattern_to_input_type_to_index(backend_config: BackendConfig) -> Dict[Pattern, Dict[str, int]]:
+    pattern_to_input_type_to_index: Dict[Pattern, Dict[str, int]] = {}
+    for pattern, config in backend_config.configs.items():
+        pattern_to_input_type_to_index[pattern] = config._input_type_to_index
     return pattern_to_input_type_to_index
 
 def get_root_module_to_quantized_reference_module(
-        backend_config_dict: Dict[str, Any]) -> Dict[Callable, Callable]:
-    mapping: Dict[Callable, Callable] = dict()
-    for config in backend_config_dict.get("configs", []):
-        if "root_module" in config and "reference_quantized_module_for_root" in config:
-            mapping[config["root_module"]] = config["reference_quantized_module_for_root"]
+        backend_config: BackendConfig) -> Dict[Type[torch.nn.Module], Type[torch.nn.Module]]:
+    mapping: Dict[Type[torch.nn.Module], Type[torch.nn.Module]] = {}
+    for config in backend_config.configs.values():
+        if config.root_module is not None and config.reference_quantized_module is not None:
+            mapping[config.root_module] = config.reference_quantized_module
     return mapping
 
-def get_fuser_method_mapping(
-        backend_config_dict: Dict[str, Any]) -> Dict[Pattern, Union[nn.Sequential, Callable]]:
-    fuser_method_mapping : Dict[Pattern, Union[nn.Sequential, Callable]] = dict()
-    for config in backend_config_dict.get("configs", []):
-        if "fuser_method" in config:
-            pattern = config["pattern"]
-            fuser_method = config["fuser_method"]
-            fuser_method_mapping[pattern] = fuser_method
-
+def get_fuser_method_mapping(backend_config: BackendConfig) -> Dict[Pattern, Union[nn.Sequential, Callable]]:
+    fuser_method_mapping : Dict[Pattern, Union[nn.Sequential, Callable]] = {}
+    for pattern, config in backend_config.configs.items():
+        if config.fuser_method is not None:
+            fuser_method_mapping[pattern] = config.fuser_method
     return fuser_method_mapping
 
-def get_module_to_qat_module(
-        backend_config_dict: Dict[str, Any]) -> Dict[Callable, Callable]:
-    module_to_qat_module: Dict[Callable, Callable] = dict()
-    for config in backend_config_dict.get("configs", []):
-        if "pattern" in config and "qat_module" in config:
-            pattern = config["pattern"]
-            qat_module = config["qat_module"]
-            module_to_qat_module[pattern] = qat_module
-
+def get_module_to_qat_module(backend_config: BackendConfig) -> Dict[Pattern, Type[torch.nn.Module]]:
+    module_to_qat_module: Dict[Pattern, Type[torch.nn.Module]] = {}
+    for pattern, config in backend_config.configs.items():
+        if config.qat_module is not None:
+            module_to_qat_module[pattern] = config.qat_module
     return module_to_qat_module
 
-def get_fusion_pattern_to_root_node_getter(
-        backend_config_dict: Dict[str, Any]) -> Dict[Pattern, Callable]:
+def get_fusion_pattern_to_root_node_getter(backend_config: BackendConfig) -> Dict[Pattern, Callable]:
     """ Get a map from fusion pattern to a function that returns the root node
     from the fusion pattern, e.g. the most common one is:
     def get_root_node(node_pattern):
@@ -84,17 +64,13 @@ def get_root_node(node_pattern):
     This can work for all patterns whose root node is the "last node" in the pattern,
     e.g. (torch.add, MatchAllNode, (torch.ReLU, torch.Conv2d))
     """
-    root_node_getter_mapping: Dict[Pattern, Callable] = dict()
-    for config in backend_config_dict.get("configs", []):
-        if "root_node_getter" in config:
-            pattern = config["pattern"]
-            root_node_getter = config["root_node_getter"]
-            root_node_getter_mapping[pattern] = root_node_getter
-
+    root_node_getter_mapping: Dict[Pattern, Callable] = {}
+    for pattern, config in backend_config.configs.items():
+        if config._root_node_getter is not None:
+            root_node_getter_mapping[pattern] = config._root_node_getter
     return root_node_getter_mapping
 
-def get_fusion_pattern_to_extra_inputs_getter(
-        backend_config_dict: Dict[str, Any]) -> Dict[Pattern, Callable]:
+def get_fusion_pattern_to_extra_inputs_getter(backend_config: BackendConfig) -> Dict[Pattern, Callable]:
     """ Get a map from fusion pattern to a function that returns extra input nodes
     from the fusion pattern, in the order required by the root node. This is optional,
     if not specified, we will not copy over any extra inputs for the root node.
@@ -108,13 +84,10 @@ def extra_inputs_getter(pattern) -> List[Any]:
         add, extra_input, conv_pattern = pattern
         return [extra_input]
     """
-    extra_inputs_getter_mapping: Dict[Pattern, Callable] = dict()
-    for config in backend_config_dict.get("configs", []):
-        if "extra_inputs_getter" in config:
-            pattern = config["pattern"]
-            extra_inputs_getter = config["extra_inputs_getter"]
-            extra_inputs_getter_mapping[pattern] = extra_inputs_getter
-
+    extra_inputs_getter_mapping: Dict[Pattern, Callable] = {}
+    for pattern, config in backend_config.configs.items():
+        if config._extra_inputs_getter is not None:
+            extra_inputs_getter_mapping[pattern] = config._extra_inputs_getter
     return extra_inputs_getter_mapping
 
 def remove_boolean_dispatch_from_name(p) -> Any:
diff --git a/torch/ao/quantization/experimental/linear.py b/torch/ao/quantization/experimental/linear.py
index cc98ffd08dc34..92cf96aa5c800 100644
--- a/torch/ao/quantization/experimental/linear.py
+++ b/torch/ao/quantization/experimental/linear.py
@@ -141,12 +141,6 @@ def forward(self, activation: torch.Tensor) -> torch.FloatTensor:
             for col in range(weight_cols):
                 decomposed_weight[row][col] = self.decompose_APoT(bin(self.weight_transposed[row][col]))
 
-        rows1 = self.weight_transposed.size(dim=0)
-        cols1 = self.weight_transposed.size(dim=1)
-
-        rows2 = activation.size(dim=0)
-        cols2 = activation.size(dim=1)
-
         result = self.matmul(decomposed_weight, activation).type(torch.FloatTensor)
 
         return result
diff --git a/torch/ao/quantization/experimental/observer.py b/torch/ao/quantization/experimental/observer.py
index 244975bdd4caf..85b29c84007a9 100644
--- a/torch/ao/quantization/experimental/observer.py
+++ b/torch/ao/quantization/experimental/observer.py
@@ -43,6 +43,7 @@ def calculate_qparams(self, signed):
         min_val: optional arg that can override min_val internal attribute
         max_val: optional arg that can override max_val internal attribute
     Returns:
+        alpha: alpha quantization parameter, max of abs value of observed values
         gamma: gamma quantization parameter, defined to ensure that alpha is the maximum of the range
         quantization_levels: non-uniform quantization levels (fp representation)
         level_indices: int representation of quantization_levels indices
@@ -119,7 +120,9 @@ def _calculate_qparams(self, signed: bool, min_val=None, max_val=None):
 
         return (alpha, gamma, quantization_levels, level_indices)
 
-    r"""Records the running minimum and maximum of ``x``."""
+    r"""Records the running minimum and maximum of ``x``.
+        Args:
+            x_orig: Tensor to be observed for min and max val"""
     def forward(self, x_orig):
         if x_orig.numel() == 0:
             return x_orig
@@ -133,10 +136,17 @@ def forward(self, x_orig):
         self.max_val = max_val
         return x_orig
 
-    def quant_levels_visualization(self, obs_result, filename):
+    r"""Displays visualization of APoT quantization levels
+        Args:
+            observer: APoTObserver to calculate qparams
+            signed: bool to indicate if qparams should be signed/unsigned
+    """
+    def quant_levels_visualization(self, signed=False):
+        alpha, gamma, quantization_levels, level_indices = self.calculate_qparams(signed)
+
         xs = [float(x) / 1000.0 for x in range(1000)]
-        ys = [apot_to_float(float_to_apot(x, obs_result[1], obs_result[2]),
-                            obs_result[1], obs_result[2]).item() for x in xs]
+        ys = [apot_to_float(float_to_apot(x, quantization_levels, level_indices, alpha),
+                            quantization_levels, level_indices).item() for x in xs]
 
         f = plt.figure(figsize=(15, 10))
 
diff --git a/torch/ao/quantization/fuse_modules.py b/torch/ao/quantization/fuse_modules.py
index 1f7027f5c8d57..4677cc173ca7d 100644
--- a/torch/ao/quantization/fuse_modules.py
+++ b/torch/ao/quantization/fuse_modules.py
@@ -135,6 +135,7 @@ def fuse_modules(model, modules_to_fuse, inplace=False, fuser_func=fuse_known_mo
 
     Examples::
 
+            >>> # xdoctest: +SKIP
             >>> m = M().eval()
             >>> # m is a module containing the sub-modules below
             >>> modules_to_fuse = [ ['conv1', 'bn1', 'relu1'], ['submodule.conv', 'submodule.relu']]
diff --git a/torch/ao/quantization/fuser_method_mappings.py b/torch/ao/quantization/fuser_method_mappings.py
index a2882f1360479..2aba5e7515dd9 100644
--- a/torch/ao/quantization/fuser_method_mappings.py
+++ b/torch/ao/quantization/fuser_method_mappings.py
@@ -21,6 +21,7 @@ def fuse_conv_bn(is_qat, conv, bn):
 
         >>> m1 = nn.Conv2d(10, 20, 3)
         >>> b1 = nn.BatchNorm2d(20)
+        >>> # xdoctest: +SKIP
         >>> m2 = fuse_conv_bn(m1, b1)
     """
     assert(conv.training == bn.training),\
@@ -58,6 +59,7 @@ def fuse_conv_bn_relu(is_qat, conv, bn, relu):
         >>> m1 = nn.Conv2d(10, 20, 3)
         >>> b1 = nn.BatchNorm2d(20)
         >>> r1 = nn.ReLU(inplace=False)
+        >>> # xdoctest: +SKIP
         >>> m2 = fuse_conv_bn_relu(m1, b1, r1)
     """
     assert(conv.training == bn.training == relu.training),\
@@ -103,6 +105,7 @@ def fuse_linear_bn(is_qat, linear, bn):
 
         >>> m1 = nn.Linear(20, 10)
         >>> b1 = nn.BatchNorm1d(10)
+        >>> # xdoctest: +SKIP
         >>> m2 = fuse_linear_bn(m1, b1)
     """
     assert(linear.training == bn.training),\
@@ -130,6 +133,7 @@ def fuse_convtranspose_bn(is_qat, convt, bn):
 
         >>> m1 = nn.ConvTranspose2d(10, 20, 3)
         >>> b1 = nn.BatchNorm2d(20)
+        >>> # xdoctest: +SKIP
         >>> m2 = fuse_convtranspose_bn(m1, b1)
     """
     assert(convt.training == bn.training),\
@@ -173,7 +177,7 @@ def get_fuser_method(op_list, additional_fuser_method_mapping=None):
     return None if fuser method does not exist
     '''
     if additional_fuser_method_mapping is None:
-        additional_fuser_method_mapping = dict()
+        additional_fuser_method_mapping = {}
     all_mappings = get_combined_dict(DEFAULT_OP_LIST_TO_FUSER_METHOD,
                                      additional_fuser_method_mapping)
     fuser_method = all_mappings.get(op_list, None)
diff --git a/torch/ao/quantization/fx/_equalize.py b/torch/ao/quantization/fx/_equalize.py
index bbebc628b5809..44ca7f35909d4 100644
--- a/torch/ao/quantization/fx/_equalize.py
+++ b/torch/ao/quantization/fx/_equalize.py
@@ -1,3 +1,8 @@
+import warnings
+
+from collections import namedtuple
+from typing import Any, Dict, List, Optional, Tuple
+
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
@@ -5,25 +10,16 @@
 from torch.fx import GraphModule
 from torch.fx.graph import Node
 
+from ..observer import _with_args, ObserverBase, PerChannelMinMaxObserver
+from ..utils import _parent_name, check_min_max_valid
+
 from .utils import (
-    WEIGHT_INDEX_DICT,
     get_new_attr_name_with_prefix,
     maybe_get_next_module,
+    WEIGHT_INDEX_DICT,
 )
-from ..observer import (
-    PerChannelMinMaxObserver,
-    _with_args,
-    ObserverBase,
-)
-from ..utils import (
-    check_min_max_valid,
-    _parent_name,
-)
-
-from collections import namedtuple
-from typing import Dict, Any, List, Tuple, Optional
-import warnings
 
+CUSTOM_MODULE_SUPP_LIST: List[Any] = []
 
 def reshape_scale(scale: torch.Tensor, axis: int, input: torch.Tensor) -> torch.Tensor:
     """Reshapes the scale so that we can multiply it to the input by the given axis.
@@ -32,6 +28,11 @@ def reshape_scale(scale: torch.Tensor, axis: int, input: torch.Tensor) -> torch.
     new_shape[axis] = input.size(axis)
     return scale.view(new_shape)
 
+qsheme_mapping_per_tensor_to_per_channel = {
+    torch.per_tensor_affine: torch.per_channel_affine,
+    torch.per_tensor_symmetric: torch.per_channel_symmetric,
+}
+
 
 class _InputEqualizationObserver(nn.Module):
     r"""Observer for tracking the running min/max values of input columns, and
@@ -62,8 +63,9 @@ def __init__(self, dtype=torch.quint8, qscheme=torch.per_tensor_affine,
         self.dtype = dtype
         self.qscheme = qscheme
 
+        per_channel_qscheme = qsheme_mapping_per_tensor_to_per_channel[qscheme]
         self.input_obs = PerChannelMinMaxObserver(ch_axis=1, dtype=dtype,
-                                                  qscheme=qscheme,
+                                                  qscheme=per_channel_qscheme,
                                                   quant_min=quant_min,
                                                   quant_max=quant_max,
                                                   factory_kwargs=factory_kwargs)
@@ -143,8 +145,11 @@ def __init__(self, dtype=torch.qint8, qscheme=torch.per_tensor_affine, quant_min
         self.qscheme = qscheme
         self.ch_axis = 1
 
+        per_channel_qscheme = qscheme
+        if qscheme in {torch.per_tensor_affine, torch.per_tensor_symmetric}:
+            per_channel_qscheme = qsheme_mapping_per_tensor_to_per_channel[qscheme]
         self.weight_col_obs = PerChannelMinMaxObserver(ch_axis=1, dtype=dtype,
-                                                       qscheme=qscheme,
+                                                       qscheme=per_channel_qscheme,
                                                        quant_min=quant_min,
                                                        quant_max=quant_max,
                                                        factory_kwargs=factory_kwargs)
@@ -241,13 +246,19 @@ def nn_module_supports_equalization(module) -> bool:
     """ Checks if the torch.nn node supports equalization. """
     return type(module) in [nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d]
 
+def custom_module_supports_equalization(module) -> bool:
+    """ Checks if the custom node supports equalization. """
+    return type(module) in CUSTOM_MODULE_SUPP_LIST
+
+
 def node_supports_equalization(node: Node, modules) -> bool:
     """ Checks if the current node supports equalization
     Currently we only support nn.Linear/F.Linear and nn.Conv/F.conv layers
     """
     if node.op == 'call_module':
         return nn_module_supports_equalization(modules[str(node.target)]) or \
-            fused_module_supports_equalization(modules[str(node.target)])
+            fused_module_supports_equalization(modules[str(node.target)]) or \
+            custom_module_supports_equalization(modules[str(node.target)])
     elif node.op == 'call_function':
         return node.target in [F.linear, F.conv1d, F.conv2d, F.conv3d]
     return False
@@ -413,7 +424,7 @@ def scale_weight_node(
         op_module = modules[str(node.target)][0]    # type: ignore[index]
     else:
         op_module = modules[str(node.target)]
-    assert(nn_module_supports_equalization(op_module))
+    assert(nn_module_supports_equalization(op_module) or custom_module_supports_equalization(op_module))
 
     # Scale the weights for input-weight equalization
     # If the following layer needs to be equalized then we will multiply its scale
diff --git a/torch/ao/quantization/fx/_lower_to_native_backend.py b/torch/ao/quantization/fx/_lower_to_native_backend.py
index 02fdc76ff6ac5..3485480eb8827 100644
--- a/torch/ao/quantization/fx/_lower_to_native_backend.py
+++ b/torch/ao/quantization/fx/_lower_to_native_backend.py
@@ -6,10 +6,10 @@
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.quantized.dynamic as nniqd
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.quantized._reference as nnqr
-from torch.nn.quantized.modules.utils import WeightedQuantizedModule
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized._reference as nnqr
+from torch.ao.nn.quantized.modules.utils import WeightedQuantizedModule
 from .graph_module import QuantizedGraphModule
 from .utils import (
     collect_producer_nodes,
@@ -331,9 +331,9 @@ def fold_weight(
     graph module with the traced nodes and run the graph module to pack the
     weight. then replace the original chain of ops with the packed weight.
     """
-    packed_weights = dict()
+    packed_weights = {}
     # map from folded node name to the prepacked weight name
-    folded_nodes = dict()
+    folded_nodes = {}
     # get packed weights
     for node in quantized.graph.nodes:
         if node.op == 'call_function' and node.target in WEIGHT_PREPACK_OPS:
@@ -501,11 +501,10 @@ def _lower_static_weighted_ref_module(
         parent_name, module_name = _parent_name(ref_node.target)
         setattr(modules[parent_name], module_name, q_module)
 
-        # Step 2: Remove dq_node, q_node and its args
+        # Step 2: Reroute around dq_node, and remove q_node and its args
         dq_node = ref_node.args[0]
         assert(isinstance(dq_node, Node))
-        dq_node.replace_all_uses_with(dq_node.args[0])
-        model.graph.erase_node(dq_node)
+        ref_node.replace_input_with(dq_node, dq_node.args[0])
         q_node.replace_all_uses_with(ref_node)
         model.graph.erase_node(q_node)
         model.graph.erase_node(scale_node)
@@ -527,16 +526,8 @@ def _lower_dynamic_weighted_ref_module(model: QuantizedGraphModule):
         dq_node = ref_node.args[0]
         if dq_node.op != "call_method" or dq_node.target != "dequantize":
             continue
-        # don't support lowering the pattern when the result of dequantize is used by
-        # multiple nodes
-        if len(dq_node.users) > 1:
-            continue
 
         input_dynamic_q_node = dq_node.args[0]
-        # don't support lowering the pattern when the result of quantize is used by
-        # multiple nodes
-        if len(input_dynamic_q_node.users) > 1:
-            continue
 
         if input_dynamic_q_node.op != "call_function" or \
            input_dynamic_q_node.target != torch.quantize_per_tensor_dynamic:
@@ -559,15 +550,10 @@ def _lower_dynamic_weighted_ref_module(model: QuantizedGraphModule):
         # TODO: maybe define a WeightedDynamicallyQuantizedModule
         q_module = q_class.from_reference(ref_module)  # type: ignore[attr-defined]
 
-        # replace reference moduel with dynamically quantized module
+        # replace reference module with dynamically quantized module
         parent_name, module_name = _parent_name(ref_node.target)
         setattr(named_modules[parent_name], module_name, q_module)
-
-        # remove q - dq node
-        dq_node.replace_all_uses_with(input_dynamic_q_node)
-        model.graph.erase_node(dq_node)
-        input_dynamic_q_node.replace_all_uses_with(input_dynamic_q_node.args[0])
-        model.graph.erase_node(input_dynamic_q_node)
+        ref_node.replace_input_with(dq_node, input_dynamic_q_node.args[0])
 
 def _lower_weight_only_weighted_ref_module(model: QuantizedGraphModule):
     """
@@ -650,11 +636,7 @@ def _lower_static_weighted_ref_functional(
         # Move func_node after output_zp_node in the graph
         output_zp_node.append(func_node)
 
-        # Clean up: Remove dequantize and quantize nodes, and the relu node if it exists
-        for dqn in [input_dq_node, weight_dq_node]:
-            dqn_input = dqn.args[0]
-            dqn.replace_all_uses_with(dqn_input)
-            model.graph.erase_node(dqn)
+        # Clean up: Remove quantize node, and the relu node if it exists
         model.graph.erase_node(q_node)
         if relu_node is not None:
             model.graph.erase_node(relu_node)
@@ -700,10 +682,6 @@ def _lower_dynamic_weighted_ref_functional(
             continue
 
         input_dynamic_q_node = input_dq_node.args[0]
-        # don't support lowering the pattern when the result of quantize is used by
-        # multiple nodes
-        if len(input_dynamic_q_node.users) > 1:
-            continue
 
         if input_dynamic_q_node.op != "call_function" or \
            input_dynamic_q_node.target != torch.quantize_per_tensor_dynamic:
@@ -762,12 +740,7 @@ def _lower_dynamic_weighted_ref_functional(
         if relu_node is not None:
             relu_node.replace_all_uses_with(func_node)
 
-        # Step 4: Remove dequantize and quantize nodes, and the relu node if it exists
-        for dqn in [input_dq_node, weight_dq_node]:
-            dqn_input = dqn.args[0]
-            dqn.replace_all_uses_with(dqn_input)
-            model.graph.erase_node(dqn)
-        model.graph.erase_node(input_dynamic_q_node)
+        # Step 4: Remove the relu node if it exists
         if relu_node is not None:
             model.graph.erase_node(relu_node)
 
@@ -793,8 +766,7 @@ def _lower_quantized_binary_op(
             dq_node = arg
             assert(isinstance(dq_node, Node))
             dn_input = dq_node.args[0]
-            dq_node.replace_all_uses_with(dn_input)
-            model.graph.erase_node(dq_node)
+            bop_node.replace_input_with(dq_node, dn_input)
             num_dq_nodes += 1
         assert(num_dq_nodes > 0)
 
@@ -882,7 +854,7 @@ def special_pattern_replacement(model: QuantizedGraphModule):
                 parent_name, module_name = _parent_name(ref_node.target)
                 setattr(modules[parent_name], module_name, qmodule)
 
-        # remove dq node:
+        # reroute around dq node:
         dq_nodes: List[Node] = []
         if isinstance(dq_node_or_nodes, Node):
             dq_nodes = [dq_node_or_nodes]
@@ -891,8 +863,7 @@ def special_pattern_replacement(model: QuantizedGraphModule):
 
         for dq_node in dq_nodes:
             dn_input = dq_node.args[0]
-            dq_node.replace_all_uses_with(dn_input)
-            model.graph.erase_node(dq_node)
+            ref_node.replace_input_with(dq_node, dn_input)
 
         # store q node args
         qnode_qparams = list(q_node.args)[1:]
@@ -962,6 +933,7 @@ def _lower_to_native_backend(
     _lower_quantized_binary_op(model, qconfig_map)
     _lower_getattr_tensor_metadta_op(model)
     special_pattern_replacement(model)
+    model.graph.eliminate_dead_code()
     model = fold_weight(model, node_name_to_scope)
     model.graph.eliminate_dead_code()
     model.recompile()
diff --git a/torch/ao/quantization/fx/_model_report/README.md b/torch/ao/quantization/fx/_model_report/README.md
index 03c0118655f03..0c4943ad6a755 100644
--- a/torch/ao/quantization/fx/_model_report/README.md
+++ b/torch/ao/quantization/fx/_model_report/README.md
@@ -21,25 +21,28 @@ This snippet should be ready to copy, paste, and use with the exception of a few
 
 ```python
 # prep model
-q_config_mapping = torch.ao.quantization.get_default_qconfig_mapping() # alternatively use your own qconfig mapping if you alredy have one
+qconfig_mapping = torch.ao.quantization.get_default_qconfig_mapping()
 model = Model() # TODO define model
-example_input = model.get_example_data()[0] # get example data for your model
-prepared_model = quantize_fx.prepare_fx(model, q_config_mapping, example_input)
+example_input = torch.randn((*args)) # TODO get example data for callibration
+prepared_model = quantize_fx.prepare_fx(model, qconfig_mapping, example_input)
 
 # create ModelReport instance and insert observers
-detector_set = set([PerChannelDetector(), InputWeightDetector(0.5), DynamicStaticDetector(), OutlierDetector()]) # TODO add all desired detectors
-model_report = ModelReport(prepared_model, detector_set)
+detector_set = set([DynamicStaticDetector()]) # TODO add all desired detectors
+model_report = ModelReport(model, detector_set)
 ready_for_callibrate = model_report.prepare_detailed_callibration()
 
-# TODO run callibration of model with relavent data
-
-# generate reports for your model and remove observers if desired
+# callibrate model and generate report
+ready_for_callibrate(example_input) # TODO run callibration of model with relavent data
 reports = model_report.generate_model_report(remove_inserted_observers=True)
 for report_name in report.keys():
     text_report, report_dict = reports[report_name]
     print(text_report, report_dict)
 
-# TODO update q_config_mapping based on feedback from reports
+# Optional: we get a ModelReportVisualizer instance to do any visualizations desired
+mod_rep_visualizer = tracer_reporter.generate_visualizer()
+mod_rep_visualizer.generate_table_visualization() # shows collected data as a table
+
+# TODO updated qconfig based on suggestions
 ```
 
 There is a tutorial in the works that will walk through a full usage of the ModelReport API.
@@ -61,6 +64,51 @@ It then returns the GraphModule with the detectors inserted into both the regula
   - A string-based report that is easily digestable and actionable explaining the data collected by relavent observers for that detector
   - A dictionary containing statistics collected by the relavent observers and values calculated by the detector for futher analysis or plotting
 
+## ModelReportVisualizer Overview
+
+After you have generated reports using the `ModelReport` instance,
+you can visualize some of the collected statistics using the `ModelReportVisualizer`.
+To get a `ModelReportVisualizer` instance from the `ModelReport` instance,
+call `model_report.generate_visualizer()`.
+
+When you first create the `ModelReportVisualizer` instance,
+it reorganizes the reports so instead of being in a:
+
+```
+report_name
+|
+-- module_fqn
+   |
+   -- feature_name
+      |
+      -- feature value
+```
+
+format, it will instead be in a:
+```
+-- module_fqn [ordered]
+   |
+   -- feature_name
+      |
+      -- feature value
+```
+
+Essentially, all the informations for each of the modules are consolidated across the different reports.
+Moreover, the modules are kept in the same chronological order they would appear in the model's `forward()` method.
+
+Then, when it comes to the visualizer, there are two main things you can do:
+1. Call `mod_rep_visualizer.generate_filtered_tables()` to get a table of values you can manipulate
+2. Call one of the generate visualization methods, which don't return anything but generate an output
+  - `mod_rep_visualizer.generate_table_visualization()` prints out a neatly formatted table
+  - `mod_rep_visualizer.generate_plot_visualization()` and `mod_rep_visualizer.generate_histogram_visualization()`
+  output plots.
+
+For both of the two things listed above, you can filter the data by either `module_fqn` or by `feature_name`.
+To get a list of all the modules or features, you can call `mod_rep_visualizer.get_all_unique_module_fqns()`
+and `mod_rep_visualizer.get_all_unique_feature_names()` respectively.
+For the features, because some features are not plottable, you can set the flag to only get plottable features
+in the aformentioned `get_all_unique_feature_names` method.
+
 ## Detector Overview
 
 The main way to add functionality to the ModelReport API is to add more Detectors.
@@ -130,6 +178,14 @@ Since you are also implementing your own detector in this case, it is up to you
     - `OutlierDetector`
 - `model_report_observer.py`: File containing the `ModelReportObserver` class
   - Primary observer inserted by Detectors to collect necessary information to generate reports
+- `model_report_visualizer.py`: File containing the `ModelReportVisualizer` class
+  - Reorganizes reports generated by the `ModelReport` class to be:
+    1. Ordered by module as they appear in a model's forward method
+    2. Organized by module_fqn --> feature_name --> feature values
+  - Helps generate visualizations of three different types:
+    - A formatted table
+    - A line plot (for both per-tensor and per-channel statistics)
+    - A histogram (for both per-tensor and per-channel statistics)
 - `model_report.py`: File containing the `ModelReport` class
   - Main class users are interacting with to go through the ModelReport worflow
   - API described in detail in [Overview section](#modelreport-overview)
@@ -141,7 +197,21 @@ Tests for the ModelReport API are found in the `test_model_report_fx.py` file fo
 These tests include:
 - Test class for the `ModelReportObserver`
 - Test class for the `ModelReport` class
+- Test class for the `ModelReportVisualizer` class
 - Test class for **each** of the implemented Detectors
 
 If you wish to add a Detector, make sure to create a test class modeled after one of the exisiting classes and test your detector.
 Because users will be interacting with the Detectors through the `ModelReport` class and not directly, ensure that the tests follow this as well.
+
+# Future Tasks and Improvements
+
+Below is a list of tasks that can help further improve the API or bug fixes that give the API more stability:
+
+- [ ] For DynamicStaticDetector, change method of calculating stationarity from variance to variance of variance to help account for outliers
+- [ ] Add more types of visualizations for data
+- [ ] Add ability to visualize histograms of histogram observers
+- [ ] Automatically generate QConfigs from given suggestions
+- [ ] Tune default arguments for detectors with further research and analysis on what appropriate thresholds are
+- [ ] Merge the generation of the reports and the qconfig generation together
+- [ ] Make a lot of the dicts returned object classes
+- [ ] Change type of equalization config from `QConfigMapping` to `EqualizationMapping`
diff --git a/torch/ao/quantization/fx/_model_report/detector.py b/torch/ao/quantization/fx/_model_report/detector.py
index 56ed9ff367ff2..30fad8b45f55b 100644
--- a/torch/ao/quantization/fx/_model_report/detector.py
+++ b/torch/ao/quantization/fx/_model_report/detector.py
@@ -6,9 +6,23 @@
 from abc import ABC, abstractmethod
 from torch.ao.quantization.fake_quantize import FakeQuantize
 from torch.ao.quantization.fx.graph_module import GraphModule
-from torch.ao.quantization.observer import ObserverBase
 from torch.ao.quantization.fx._model_report.model_report_observer import ModelReportObserver
-from torch.ao.quantization.qconfig import QConfig
+from torch.ao.quantization.qconfig import (
+    QConfig,
+    default_qconfig,
+    assert_valid_qconfig,
+)
+from torch.ao.quantization.observer import (
+    ObserverBase,
+    default_dynamic_quant_observer,
+    default_per_channel_weight_observer,
+    default_observer,
+    default_weight_observer,
+)
+from torch.ao.quantization.fx._equalize import (
+    default_equalization_qconfig,
+    EqualizationQConfig,
+)
 from torch.ao.quantization.quantize import is_activation_post_process
 
 # Names for observer insert keys
@@ -17,6 +31,84 @@
 DETECTOR_IS_POST_OBS_KEY = "is_post_observer"
 DETECTOR_OBS_ARGS_KEY = "observer_args"
 
+# Mapping related code
+class DetectorQConfigInfo():
+    r"""
+    This class contains the QConfig information for a single module.
+    The list of variables / values this contains can grow depending on the
+    extensibility of the qconfig mapping feature set but this currently includes:
+    - if activation observer is dynamic
+    - if weight observer is per channel
+
+
+    Args:
+        module_fqn (str): The fully qualified name (fqn) of the module that this
+            information contains info relavent to qconfig for
+    """
+
+    def __init__(self, module_fqn: str):
+        super().__init__()
+        self.module_fqn = module_fqn
+
+        # populate this section with all the variables we might find important
+        # change from none if your detector is actually using this
+        self.is_activation_dynamic = False
+        self.is_weight_per_channel = False
+
+        # equalization related options
+        self.is_equalization_recommended = False
+
+    def generate_quantization_qconfig(self, module: torch.nn.Module) -> QConfig:
+        r"""
+        Args:
+            module (torch.nn.Module) The module we are generating
+            the qconfig for
+
+        Returns the generated quantization QConfig according to what a valid configuration is
+        """
+        # Apply suggestions to new qconfig
+        module_qconfig = default_qconfig
+
+        # keep track of dynamic and per_channel recommendations
+        recommendations_list = []
+        # append as if a list of combinations
+        recommendations_list.append((self.is_activation_dynamic, self.is_weight_per_channel))
+        recommendations_list.append((self.is_activation_dynamic, False))  # only trying dynamic rec
+        recommendations_list.append((False, self.is_weight_per_channel))  # only trying dynamic
+
+        # now we try each of the combinations
+        for rec in recommendations_list:
+            # rec[0] -> dynamic recommended
+            # rec[1] -> per channel recommended
+            activation = default_dynamic_quant_observer if rec[0] else default_observer
+            weight = default_per_channel_weight_observer if rec[1] else default_weight_observer
+            test_config = QConfig(activation, weight)
+            try:
+                assert_valid_qconfig(test_config, module)
+                module_qconfig = test_config
+                break
+            except AssertionError:
+                # if not a valid configuration, we move on to the next one in priority
+                continue
+
+        # return the QConfig chosen
+        return module_qconfig
+
+    def generate_equalization_qconfig(self) -> EqualizationQConfig:
+        r"""
+        This returns the equalization configuration for a module.
+
+        For now, it just returns the default, but as more equalization options become
+        possible, this method can get more fleshed out with more nuanced granularity.
+
+
+        Returns the generated equalization QConfig according to what a valid configuration is
+        """
+        # in this case, we just return default equalization config
+        # we know this is valid because only valid modules would even
+        # have this option
+        return default_equalization_qconfig
+
 # Adding base class for detectors
 class DetectorBase(ABC):
     r""" Base Detector Module
@@ -31,6 +123,7 @@ class DetectorBase(ABC):
 
     def __init__(self):
         super().__init__()
+        self.detector_config_info = None
 
     @abstractmethod
     def determine_observer_insert_points(self, model) -> Dict:
@@ -48,6 +141,18 @@ def get_detector_name(self) -> str:
         r""" Returns the name of the current detector """
         pass
 
+
+    @abstractmethod
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        pass
+
     def _get_targeting_node(self, prepared_fx_model: GraphModule, target_fqn: str) -> torch.fx.node.Node:
         r"""
         Takes in a GraphModule and the target_fqn and finds the node whose target is this fqn.
@@ -134,6 +239,31 @@ def get_detector_name(self) -> str:
         r""" returns the string name of this detector"""
         return "per_channel_detector"
 
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # run the helper function to populate the dictionary
+        per_channel_info = self._detect_per_channel_helper(model)
+
+        # we actually have a qconfig info object we are populating
+        module_fqn_to_detector_qconfig_info = {}
+
+        for module_fqn in per_channel_info:
+            # create a detector info instance
+            detector_qconfig_info = DetectorQConfigInfo(module_fqn)
+
+            # see if per channel quantization is supported
+            per_chan_supported: bool = per_channel_info[module_fqn][self.PER_CHAN_SUPPORTED_KEY]
+            detector_qconfig_info.is_weight_per_channel = per_chan_supported
+            module_fqn_to_detector_qconfig_info[module_fqn] = detector_qconfig_info
+
+        return module_fqn_to_detector_qconfig_info
+
     def determine_observer_insert_points(self, model: nn.Module) -> Dict:
         r"""
         There is no observers inserted for the PerChannelDetector.
@@ -350,6 +480,32 @@ def get_detector_name(self) -> str:
         r""" returns the string name of this detector"""
         return "dynamic_vs_static_detector"
 
+
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # run the helper function to populate the dictionary
+        dynamic_static_info = self._generate_dict_info(model)
+
+        # we actually have a qconfig info object we are populating
+        module_fqn_to_detector_qconfig_info = {}
+
+        for module_fqn in dynamic_static_info:
+            # create a detector info instance
+            detector_qconfig_info = DetectorQConfigInfo(module_fqn)
+
+            # see if per channel quantization is supported
+            dynamic_static_recommended: bool = dynamic_static_info[module_fqn][self.DEFAULT_DYNAMIC_REC_KEY]
+            detector_qconfig_info.is_activation_dynamic = dynamic_static_recommended
+            module_fqn_to_detector_qconfig_info[module_fqn] = detector_qconfig_info
+
+        return module_fqn_to_detector_qconfig_info
+
     def _is_supported(self, module: nn.Module, insert: bool = False) -> bool:
         r"""Returns whether the given module is supported for observers
 
@@ -497,7 +653,7 @@ def generate_detector_report(self, model: GraphModule) -> Tuple[str, Dict[str, A
             benefit_str = ""
 
             # strings for if dynamic quantized per tensor is needed
-            recommend_per_tensor = " We recommend to add a {} before this module if it is static."
+            recommend_per_tensor = ". We recommend to add a {} before this module if it is static."
             rec_lay_to_add = "dynamic quantize per tensor layer"
             dynamic_per_tensor_string = recommend_per_tensor.format(rec_lay_to_add)
             dynamic_per_tensor_reasoning_string = (
@@ -608,7 +764,7 @@ class InputWeightEqualizationDetector(DetectorBase):
     INPUT_STR = "input"
 
     # default for what ratio we recommend input weight
-    DEFAULT_RECOMMEND_INPUT_WEIGHT_CHANNEL_RATIO = 0.5
+    DEFAULT_RECOMMEND_INPUT_WEIGHT_CHANNEL_RATIO = 0.4
 
     def __init__(self, ratio_threshold: float, ch_axis: int = 1):
         # ensure passed in inputs are valid
@@ -639,6 +795,41 @@ def _is_supported(self, module: nn.Module, insert: bool = False) -> bool:
             has_obs = hasattr(module, self.DEFAULT_PRE_OBSERVER_NAME)
             return is_supported_type and has_obs
 
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # run the helper function to populate the dictionary
+        # find the range of inputs
+        input_values: Dict[str, Dict] = self._extract_input_info(model)
+
+        # find the range of weights
+        weight_values: Dict[str, Dict] = self._extract_weight_info(model)
+
+        # calculate per_channel comparision statistic s_c
+        comp_stats: Dict[str, torch.Tensor] = self._generate_comparision_values(input_values, weight_values)
+
+        # generate the return dictionary
+        input_weight_equalization_info: Dict[str, Dict] = self._generate_dict_info(input_values, weight_values, comp_stats)
+
+        # we actually have a qconfig info object we are populating
+        module_fqn_to_detector_qconfig_info = {}
+
+        for module_fqn in input_weight_equalization_info:
+            # create a detector info instance
+            detector_qconfig_info = DetectorQConfigInfo(module_fqn)
+
+            # see if per channel quantization is supported
+            input_weight_recommended: bool = input_weight_equalization_info[module_fqn][self.RECOMMENDED_KEY]
+            detector_qconfig_info.is_equalization_recommended = input_weight_recommended
+            module_fqn_to_detector_qconfig_info[module_fqn] = detector_qconfig_info
+
+        return module_fqn_to_detector_qconfig_info
+
     def determine_observer_insert_points(self, prepared_fx_model: GraphModule) -> Dict[str, Dict[str, Any]]:
         r"""Determines where observers need to be inserted for the Input Weight Equalization Detector.
         For this detector, we want to place observers in front of supported layers.
@@ -867,7 +1058,7 @@ def _generate_dict_info(self, input_info: Dict, weight_info: Dict, comp_stats: D
         # store modules input weight equalization info
         input_weight_equalization_info: Dict[str, Dict] = {}
 
-        # for each module we add seperate set of suggestions
+        # for each module we add separate set of suggestions
         for module_fqn in input_info:
 
             # get relavent info for this module
@@ -1083,6 +1274,17 @@ def _supports_insertion(self, module: nn.Module) -> bool:
         num_children = len(list(module.children()))
         return num_children == 0 and not is_activation_post_process(module)
 
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # currently doesn't do anything for outlier detector
+        return {}
+
     def _supports_report_gen(self, module: nn.Module) -> bool:
         r"""Returns whether the given module is supported for report generation
 
diff --git a/torch/ao/quantization/fx/_model_report/model_report.py b/torch/ao/quantization/fx/_model_report/model_report.py
index 25be9e4edc664..dfe777a540585 100644
--- a/torch/ao/quantization/fx/_model_report/model_report.py
+++ b/torch/ao/quantization/fx/_model_report/model_report.py
@@ -1,4 +1,4 @@
-from typing import Any, Dict, Set, Tuple
+from typing import Any, Dict, Set, Tuple, Callable
 from collections import OrderedDict
 import torch
 from torch.ao.quantization.fx._model_report.detector import (
@@ -6,12 +6,14 @@
     DETECTOR_OBS_ARGS_KEY,
     DETECTOR_OBS_TO_INSERT_KEY,
     DETECTOR_IS_POST_OBS_KEY,
-    DETECTOR_TARGET_NODE_KEY
+    DETECTOR_TARGET_NODE_KEY,
+    DetectorQConfigInfo
 )
 from torch.ao.quantization.fx._model_report.model_report_visualizer import ModelReportVisualizer
 from torch.ao.quantization.fx.graph_module import GraphModule
 from torch.ao.quantization.observer import ObserverBase
-
+from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
+from torch.ao.quantization.fx._equalize import EqualizationQConfig
 
 class ModelReport:
     r"""
@@ -71,8 +73,10 @@ class compiles the report generated by each Detector class into a single report
             - Table
             - Histogram
             - Line plot
+    8.) Call model_report.generate_qconfigs to generate the qconfigs based on the report suggestions
 
     Example (with QuantizationTracer):
+        >>> # xdoctest: +SKIP
         >>> # get the necessary qconfig
         >>> config = PrepareCustomConfig()
         >>> skipped_module_names, skipped_module_classes = get_skipped_module_name_and_classes(config, False)
@@ -95,6 +99,12 @@ class compiles the report generated by each Detector class into a single report
         >>> # finally we generate the reports and optionally remove the observers we inserted
         >>> reports = tracer_reporter.generate_model_report(remove_inserted_observers=True)
 
+        >>> # Optional: we can generate the qconfig mapping based on the suggestions
+        >>> qconfigs = model_report.generate_qconfig_mapping()
+
+        >>> # Optional: we can generate the equalization mapping based on the suggestions
+        >>> qconfigs = model_report.generate_equalization_mapping()
+
         >>> # Optional: we get a ModelReportVisualizer instance to do any visualizations desired
         >>> model_report_visualizer = tracer_reporter.generate_visualizer()
 
@@ -330,7 +340,7 @@ def _is_same_info_for_same_key(self, info_dict_a: Dict, info_dict_b: Dict) -> bo
             dict_a_val = info_dict_a[key]
             dict_b_val = info_dict_b[key]
 
-            # if it's a tensor we have to handle seperately
+            # if it's a tensor we have to handle separately
             if type(dict_a_val) == torch.Tensor:
                 # if dict_b_val not tensor, automatically false
                 if type(dict_b_val) != torch.Tensor or sum(dict_a_val != dict_b_val) != 0:
@@ -415,3 +425,182 @@ def generate_visualizer(self) -> ModelReportVisualizer:
         visualizer: ModelReportVisualizer = ModelReportVisualizer(module_fqns_to_features)
 
         return visualizer
+
+    def _generate_qconfig_mapping_helper(
+        self,
+        detector_qconfig_info_combined: Dict[str, DetectorQConfigInfo],
+        generation_function: Callable
+    ) -> QConfigMapping:
+        r"""
+        This helper takes in the compiled detector qconfig info that
+        has been compiled together and merges it into a QConfigMapping
+        """
+        # keep track of the qconfigmapping
+        qconfig_mapping = QConfigMapping()
+
+        # loop through each module / fqn and attempt to create QConfigMapping
+        for fqn, module in self._model.named_modules():
+            # if we have a qconfig info for this module
+            if fqn in detector_qconfig_info_combined:
+                qconfig_info_compiled = detector_qconfig_info_combined[fqn]
+
+                # now generate the qconfig and add it to the mapping
+                generated_qconfig = generation_function(qconfig_info_compiled, module)
+
+                # add to our config
+                qconfig_mapping.set_module_name(fqn, generated_qconfig)
+
+        # return compiled mapping
+        return qconfig_mapping
+
+    def _update_detector_quantizaiton_qconfig_info(self, combined_info: DetectorQConfigInfo, new_info: DetectorQConfigInfo):
+        r"""
+        Takes in the old and new information and updates the combined information.
+
+        Args:
+            combined_info (DetectorQConfigInfo): The DetectorQConfigInfo we are compiling all of the information in
+            new_info (DetectorQConfigInfo): The DetectorQConfigInfo with the information we are trying to merge the new info
+                into it
+        """
+        combined_info.is_activation_dynamic = combined_info.is_activation_dynamic or new_info.is_activation_dynamic
+        combined_info.is_weight_per_channel = combined_info.is_weight_per_channel or new_info.is_weight_per_channel
+
+    def _update_detector_equalization_qconfig_info(self, combined_info: DetectorQConfigInfo, new_info: DetectorQConfigInfo):
+        r"""
+        Takes in the old and new information and updates the combined information.
+
+        Args:
+            combined_info (DetectorQConfigInfo): The DetectorQConfigInfo we are compiling all of the information in
+            new_info (DetectorQConfigInfo): The DetectorQConfigInfo with the information we are trying to merge the new info
+                into it
+        """
+        is_equalization_recommended = combined_info.is_equalization_recommended or new_info.is_equalization_recommended
+        combined_info.is_equalization_recommended = is_equalization_recommended
+
+    def _generate_module_fqn_to_detector_info_mapping(
+        self,
+        update_qconfig_info_function: Callable
+    ) -> Dict[str, DetectorQConfigInfo]:
+        r"""
+        Generates a QConfigMapping based on the suggestions of the
+        ModelReport API. The generated mapping encompasses all the
+        different types of feedback from the different detectors
+        all into one place.
+
+        These configs are based on the suggestions provided by the ModelReport API
+        and can only be generated once the reports have been generated.
+
+        Args:
+            update_qconfig_info_function (Callable) takes in a function that takes in two DetectorQConfigInfo
+            and updates the one that is being compiled
+
+        Returns a Dict mapping module_fqns to DetectorQConfigInfo objects
+
+        Note:
+            Throws exception if we try to generate mapping on model we already removed observers from
+            Throws exception if we try to generate mapping without preparing for callibration
+        """
+        # if we haven't prepped model for callibration, then we shouldn't generate mapping yet
+        if not self._prepared_flag:
+            raise Exception("Cannot generate report without preparing model for callibration")
+
+        # if we already removed the observers, we cannot mapping
+        if self._removed_observers:
+            raise Exception("Cannot generate report on model you already removed observers from")
+
+        # keep track of qconfig info for each module across detectors
+        detector_qconfig_info_combined: Dict[str, DetectorQConfigInfo] = {}
+
+        for detector in self._desired_report_detectors:
+            # get the info from the detector
+            detector_info: Dict[str, DetectorQConfigInfo] = detector.get_qconfig_info(self._model)
+
+            # we go through the modules
+            for module_fqn in detector_info:
+                # see if we already have info on it
+                if module_fqn in detector_qconfig_info_combined:
+                    # we combine the current options with what is there
+                    current_options = detector_qconfig_info_combined[module_fqn]
+                    detector_options = detector_info[module_fqn]
+
+                    update_qconfig_info_function(current_options, detector_options)
+                else:
+                    # we just use this for now
+                    detector_qconfig_info_combined[module_fqn] = detector_info[module_fqn]
+
+        return detector_qconfig_info_combined
+
+    def generate_qconfig_mapping(self) -> QConfigMapping:
+        r"""
+        Generates a QConfigMapping based on the suggestions of the
+        ModelReport API. The generated mapping encompasses all the
+        different types of feedback from the different detectors
+        all into one place.
+
+        These configs are based on the suggestions provided by the ModelReport API
+        and can only be generated once the reports have been generated.
+
+        Returns a QConfigMapping for the quantization configuration
+
+        Note:
+            Throws exception if we try to generate mapping on model we already removed observers from
+            Throws exception if we try to generate mapping without preparing for callibration
+        """
+        # get the mapping info
+        detector_qconfig_info_combined = self._generate_module_fqn_to_detector_info_mapping(
+            self._update_detector_quantizaiton_qconfig_info
+        )
+
+        # we will do a bit of processing and remove fqns that don't have input weight recommended
+
+        # now we generate the QConfig for each of the options
+        mapping: QConfigMapping = self._generate_qconfig_mapping_helper(
+            detector_qconfig_info_combined,
+            self._quantization_config_generator
+        )
+
+        # return the generated mapping
+        return mapping
+
+    def _quantization_config_generator(self, detector_qconfig_info: DetectorQConfigInfo, module: torch.nn.Module) -> QConfig:
+        r"""
+        Returns the quantization configuration generated by the DetectorQConfigInfo object
+        """
+        return detector_qconfig_info.generate_quantization_qconfig(module)
+
+    def _equalization_config_generator(
+        self,
+        detector_qconfig_info: DetectorQConfigInfo,
+        module: torch.nn.Module
+    ) -> EqualizationQConfig:
+        r"""
+        We ignore the module argument here, and only focus on thedetector_qconfig_info
+
+        Returns the equalization configuration generated by the DetectorQConfigInfo object
+        """
+        return detector_qconfig_info.generate_equalization_qconfig()
+
+    def generate_equalization_mapping(self) -> QConfigMapping:
+        r"""
+        Generates a QConfigMapping based on the suggestions of the
+        ModelReport API for equalization. The generated mapping encompasses all the
+        different types of feedback from the input-weight equalization detector.
+
+        These configs are based on the suggestions provided by the ModelReport API
+        and can only be generated once the reports have been generated.
+
+        Returns a QConfigMapping for the equalization configuration
+        """
+        # get the mapping info
+        detector_qconfig_info_combined = self._generate_module_fqn_to_detector_info_mapping(
+            self._update_detector_equalization_qconfig_info
+        )
+
+        # now we generate the QConfig for each of the options
+        mapping: QConfigMapping = self._generate_qconfig_mapping_helper(
+            detector_qconfig_info_combined,
+            self._equalization_config_generator
+        )
+
+        # return the generated mapping
+        return mapping
diff --git a/torch/ao/quantization/fx/_model_report/model_report_observer.py b/torch/ao/quantization/fx/_model_report/model_report_observer.py
index 500b1e654edcd..3d263ca32fe1c 100644
--- a/torch/ao/quantization/fx/_model_report/model_report_observer.py
+++ b/torch/ao/quantization/fx/_model_report/model_report_observer.py
@@ -198,7 +198,7 @@ def _calculate_percentile_stats(self, x_copy):
             self.percentile_batches_tracked = torch.zeros_like(any_non_zero_quantile_value)
             self.average_percentile_ratio = torch.zeros_like(ratio_if_not_zero)
 
-        # also initialize the constant channel var if that is not initialized seperately
+        # also initialize the constant channel var if that is not initialized separately
         if self.constant_channels.shape[0] == 0:
             self.constant_channels = torch.zeros_like(any_constant_channels)
 
diff --git a/torch/ao/quantization/fx/_model_report/model_report_visualizer.py b/torch/ao/quantization/fx/_model_report/model_report_visualizer.py
index 855c71261488e..ae450436d4f8c 100644
--- a/torch/ao/quantization/fx/_model_report/model_report_visualizer.py
+++ b/torch/ao/quantization/fx/_model_report/model_report_visualizer.py
@@ -321,10 +321,11 @@ def generate_filtered_tables(self, feature_filter: str = "", module_fqn_filter:
                     The rest of the rows will contain data
 
         Example Use:
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> mod_report_visualizer.generate_filtered_tables(
-                    feature_filter = "per_channel_min",
-                    module_fqn_filter = "block1"
-                ) # generates table with per_channel_min info for all modules in block 1 of the model
+            ...     feature_filter = "per_channel_min",
+            ...     module_fqn_filter = "block1"
+            ... ) # generates table with per_channel_min info for all modules in block 1 of the model
         """
         # first get the filtered data
         filtered_data: OrderedDict[str, Any] = self._get_filtered_data(feature_filter, module_fqn_filter)
@@ -403,12 +404,13 @@ def generate_table_visualization(self, feature_filter: str = "", module_fqn_filt
                 Default = "", results in all the modules in the reports to be visible in the table
 
         Example Use:
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> mod_report_visualizer.generate_table_visualization(
-                    feature_filter = "per_channel_min",
-                    module_fqn_filter = "block1"
-                )
-            # prints out neatly formatted table with per_channel_min info for
-                all modules in block 1 of the model
+            ...     feature_filter = "per_channel_min",
+            ...     module_fqn_filter = "block1"
+            ... )
+            >>> # prints out neatly formatted table with per_channel_min info
+            >>> # for all modules in block 1 of the model
         """
         # see if we got tabulate
         if not got_tabulate:
@@ -472,44 +474,50 @@ def _get_plottable_data(self, feature_filter: str, module_fqn_filter: str) -> Tu
 
         # offset should either be one of tensor or channel table or neither
         feature_column_offset = ModelReportVisualizer.NUM_NON_FEATURE_TENSOR_HEADERS
+        table = tensor_table
+
+        # if a per_channel plot, we have different offset and table
         if is_valid_per_channel_plot:
             feature_column_offset = ModelReportVisualizer.NUM_NON_FEATURE_CHANNEL_HEADERS
-
-        # keep track of per_channel or not
-        data_is_per_channel: bool = False
+            table = channel_table
 
         x_data: List = []
         y_data: List[List] = []
         # the feature will either be a tensor feature or channel feature
-        if is_valid_per_tensor_plot or is_valid_per_channel_plot:
-            # extra setup for y_data if per channel
-            if is_valid_per_channel_plot:
-                # gather the x_data and multiple y_data
-                # calculate the number of channels
-                num_channels: int = max(row[self.CHANNEL_NUM_INDEX] for row in channel_table) + 1
-                for channel in range(num_channels):
-                    y_data.append([])  # seperate data list per channel
-
-            for table_row_num, row in enumerate(tensor_table):
+        if is_valid_per_tensor_plot:
+            for table_row_num, row in enumerate(table):
                 # get x_value to append
                 x_val_to_append = table_row_num
-                current_channel: int = -1  # intially chose current channel
-                # if new module we are looking at, add it's index to x_data
-                if is_valid_per_channel_plot and row[self.CHANNEL_NUM_INDEX] == 0:
-                    new_module_index: int = table_row_num // num_channels
-                    x_val_to_append = new_module_index
-                    current_channel = row[self.CHANNEL_NUM_INDEX]
-
                 # the index of the feature will the 0 + num non feature columns
                 tensor_feature_index = feature_column_offset
                 row_value = row[tensor_feature_index]
                 if not type(row_value) == str:
                     x_data.append(x_val_to_append)
-                    # how we append y value depends on if per tensor or not
-                    if is_valid_per_channel_plot:
-                        y_data[current_channel].append(row_value)
-                    else:
-                        y_data.append(row_value)
+                    y_data.append(row_value)
+        elif is_valid_per_channel_plot:
+            # gather the x_data and multiple y_data
+            # calculate the number of channels
+            num_channels: int = max(row[self.CHANNEL_NUM_INDEX] for row in table) + 1
+            for channel in range(num_channels):
+                y_data.append([])  # separate data list per channel
+
+            for table_row_num, row in enumerate(table):
+                # get x_value to append
+                x_val_to_append = table_row_num
+                current_channel = row[self.CHANNEL_NUM_INDEX]  # intially chose current channel
+                new_module_index: int = table_row_num // num_channels
+                x_val_to_append = new_module_index
+
+                # the index of the feature will the 0 + num non feature columns
+                tensor_feature_index = feature_column_offset
+                row_value = row[tensor_feature_index]
+                if not type(row_value) == str:
+                    # only append if new index we are appending
+                    if len(x_data) == 0 or x_data[-1] != x_val_to_append:
+                        x_data.append(x_val_to_append)
+
+                    # append value for that channel
+                    y_data[current_channel].append(row_value)
         else:
             # more than one feature was chosen
             error_str = "Make sure to pick only a single feature with your filter to plot a graph."
@@ -518,12 +526,18 @@ def _get_plottable_data(self, feature_filter: str, module_fqn_filter: str) -> Tu
             raise ValueError(error_str)
 
         # return x, y values, and if data is per-channel
-        return (x_data, y_data, data_is_per_channel)
+        return (x_data, y_data, is_valid_per_channel_plot)
 
     def generate_plot_visualization(self, feature_filter: str, module_fqn_filter: str = ""):
         r"""
         Takes in a feature and optional module_filter and plots of the desired data.
 
+        For per channel features, it averages the value across the channels and plots a point
+        per module. The reason for this is that for models with hundreds of channels, it can
+        be hard to diffrentiate one channel line from another, and so the point of generating
+        a single average point per module is to give a sense of general trends that encourage
+        further deep dives.
+
         Note:
             Only features in the report that have tensor value data are plottable by this class
             When the tensor information is plotted, it will plot:
@@ -540,13 +554,14 @@ def generate_plot_visualization(self, feature_filter: str, module_fqn_filter: st
                 Default = "", results in all the modules in the reports to be visible in the table
 
         Example Use:
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> mod_report_visualizer.generate_plot_visualization(
-                    feature_filter = "per_channel_min",
-                    module_fqn_filter = "block1"
-                )
-            # outputs line plot of per_channel_min information for all modules in block1 of model
-                each channel gets it's own line, and it's plotted across the in-order modules
-                on the x-axis
+            ...     feature_filter = "per_channel_min",
+            ...     module_fqn_filter = "block1"
+            ... )
+            >>> # outputs line plot of per_channel_min information for all
+            >>> # modules in block1 of model each channel gets it's own line,
+            >>> # and it's plotted across the in-order modules on the x-axis
         """
         # checks if we have matplotlib and let's user know to install it if don't
         if not got_matplotlib:
@@ -557,7 +572,6 @@ def generate_plot_visualization(self, feature_filter: str, module_fqn_filter: st
         x_data, y_data, data_per_channel = self._get_plottable_data(feature_filter, module_fqn_filter)
 
         # plot based on whether data is per channel or not
-        fig = plt.figure()
         ax = plt.subplot()
         ax.set_ylabel(feature_filter)
         ax.set_title(feature_filter + " Plot")
@@ -566,10 +580,14 @@ def generate_plot_visualization(self, feature_filter: str, module_fqn_filter: st
         if data_per_channel:
             ax.set_xlabel("First idx of module")
             # set the legend as well
-            # plot a seperate line for each channel
-            for index, channel_info in enumerate(y_data):
-                ax.plot(x_data, channel_info, label="Channel {}".format(index))
+            # plot a single line that is average of the channel values
+            num_modules = len(y_data[0])  # all y_data have same length, so get num modules
+            num_channels = len(y_data)  # we want num channels to be able to calculate average later
+
+            avg_vals = [sum(y_data[:][index]) / num_channels for index in range(num_modules)]
 
+            # plot the three things we measured
+            ax.plot(x_data, avg_vals, label="Average Value Across {} Channels".format(num_channels))
             ax.legend(loc='upper right')
         else:
             ax.set_xlabel("idx")
@@ -598,10 +616,11 @@ def generate_histogram_visualization(self, feature_filter: str, module_fqn_filte
                 Default = 10, the values will be split into 10 equal sized bins
 
         Example Use:
+            >>> # xdoctest: +SKIP
             >>> mod_report_visualizer.generategenerate_histogram_visualization_plot_visualization(
-                    feature_filter = "per_channel_min",
-                    module_fqn_filter = "block1"
-                )
+            ...     feature_filter = "per_channel_min",
+            ...     module_fqn_filter = "block1"
+            ... )
             # outputs histogram of per_channel_min information for all modules in block1 of model
                 information is gathered across all channels for all modules in block 1 for the
                 per_channel_min and is displayed in a histogram of equally sized bins
@@ -616,7 +635,6 @@ def generate_histogram_visualization(self, feature_filter: str, module_fqn_filte
 
         # for histogram, we just care about plotting the y data
         # plot based on whether data is per channel or not
-        fig = plt.figure()
         ax = plt.subplot()
         ax.set_xlabel(feature_filter)
         ax.set_ylabel("Frequency")
@@ -628,13 +646,13 @@ def generate_histogram_visualization(self, feature_filter: str, module_fqn_filte
             all_data = []
             for index, channel_info in enumerate(y_data):
                 all_data.extend(channel_info)
+
             val, bins, _ = plt.hist(
                 all_data,
                 bins=num_bins,
                 stacked=True,
                 rwidth=0.8,
             )
-            ax.legend(loc='upper right')
             plt.xticks(bins)
         else:
             val, bins, _ = plt.hist(
diff --git a/torch/ao/quantization/fx/backend_config_utils.py b/torch/ao/quantization/fx/backend_config_utils.py
index 68a4823823e59..00cadf5a20790 100644
--- a/torch/ao/quantization/fx/backend_config_utils.py
+++ b/torch/ao/quantization/fx/backend_config_utils.py
@@ -1,7 +1,9 @@
 import torch
 from torch.ao.quantization.fx.pattern_utils import get_default_quant_patterns, sorted_patterns_dict
-from torch.ao.quantization.backend_config import get_native_backend_config_dict
-from torch.ao.quantization.backend_config.observation_type import ObservationType
+from torch.ao.quantization.backend_config import (
+    get_native_backend_config,
+    ObservationType,
+)
 from torch.ao.quantization.quantization_types import (
     Pattern,
     NodePattern,
@@ -12,6 +14,7 @@
     get_combined_dict,
 )
 
+from ..backend_config import BackendConfig
 from .quantization_patterns import QuantizeHandler
 from .fusion_patterns import DefaultFuseHandler
 
@@ -76,23 +79,23 @@ def input_output_observed(self):
 
     return ConfigurableQuantizeHandler
 
-def get_pattern_to_quantize_handlers(
-        backend_config_dict: Dict[str, Any]) -> Dict[Pattern, QuantizerCls]:
+def get_pattern_to_quantize_handlers(backend_config: BackendConfig) -> Dict[Pattern, QuantizerCls]:
     """
     Note: Quantize handler is just a holder for some check methods like
     (should_insert_observer_for_output), maybe this can be a enum as well,
     we can refactor this after we convert the path for fbgemm/qnnpack fully to the
     new path, this is not exposed to backend developers
     """
-    pattern_to_quantize_handlers = dict()
-    for config in backend_config_dict.get("configs", []):
-        pattern = config["pattern"]
-        observation_type = config.get("observation_type", None)
-        dtype_configs = config["dtype_configs"]
-        num_tensor_args_to_observation_type = config.get("num_tensor_args_to_observation_type", {})
-        overwrite_fake_quantizer = config.get("_overwrite_output_fake_quantizer", None)
-        overwrite_observer = config.get("_overwrite_output_observer", None)
-        input_output_observed = config.get("_input_output_observed", True)
+    pattern_to_quantize_handlers = {}
+    for pattern, config in backend_config.configs.items():
+        observation_type = config.observation_type
+        dtype_configs = config.dtype_configs
+        num_tensor_args_to_observation_type = config._num_tensor_args_to_observation_type
+        overwrite_fake_quantizer = config._overwrite_output_fake_quantize
+        overwrite_observer = config._overwrite_output_observer
+        input_output_observed = config._input_output_observed
+        if input_output_observed is None:
+            input_output_observed = True
         pattern_to_quantize_handlers[pattern] = \
             get_quantize_handler_cls(
                 observation_type,
@@ -104,29 +107,30 @@ def get_pattern_to_quantize_handlers(
 
     return pattern_to_quantize_handlers
 
+# TODO: move this to torch/ao/quantization/backend_config/utils.py
 def get_fusion_pattern_to_fuse_handler_cls(
-        backend_config_dict: Dict[str, Any]) -> Dict[Pattern, Callable]:
-    fusion_pattern_to_fuse_handlers: Dict[Pattern, Callable] = dict()
-    for config in backend_config_dict.get("configs", []):
-        if "fuser_method" in config:
-            pattern = config["pattern"]
+        backend_config: BackendConfig) -> Dict[Pattern, Callable]:
+    fusion_pattern_to_fuse_handlers: Dict[Pattern, Callable] = {}
+    for pattern, config in backend_config.configs.items():
+        if config.fuser_method is not None:
+            # TODO: is this logic right?
             fusion_pattern_to_fuse_handlers[pattern] = DefaultFuseHandler
 
     return fusion_pattern_to_fuse_handlers
 
-# TODO: remove when all uses are changed to backend_config_dict
+# TODO: remove when all uses are changed to backend_config
 def get_native_quant_patterns(additional_quant_patterns: Dict[Pattern, QuantizerCls] = None) -> Dict[Pattern, QuantizerCls]:
     """
-    Return a map from pattern to quantize handlers based on the default patterns and the native backend_config_dict.
+    Return a map from pattern to quantize handlers based on the default patterns and the native backend_config.
     The returned map is sorted such that longer patterns will be encountered first when iterating through it.
     """
     patterns = get_default_quant_patterns()
     if additional_quant_patterns is not None:
         patterns = get_combined_dict(patterns, additional_quant_patterns)
     # TODO: currently we just extend the quantize handlers generated from
-    # `get_native_backend_config_dict`
-    # in the future we can just assign backend_config_dict when everything is defined
-    for pattern, quantize_handler in get_pattern_to_quantize_handlers(get_native_backend_config_dict()).items():
+    # `get_native_backend_config`
+    # in the future we can just assign backend_config when everything is defined
+    for pattern, quantize_handler in get_pattern_to_quantize_handlers(get_native_backend_config()).items():
         patterns[pattern] = quantize_handler
     return sorted_patterns_dict(patterns)
 
diff --git a/torch/ao/quantization/fx/common_quantization_patterns.py b/torch/ao/quantization/fx/common_quantization_patterns.py
deleted file mode 100644
index a863c18a383e1..0000000000000
--- a/torch/ao/quantization/fx/common_quantization_patterns.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from .quantization_patterns import (
-    QuantizeHandler,
-)
-# TODO: remove
-class CommonQuantizeHandler(QuantizeHandler):
-    """ Common quantized op, first input and first output will be quantized
-    """
-    pass
diff --git a/torch/ao/quantization/fx/convert.py b/torch/ao/quantization/fx/convert.py
index 2fde17ac413fc..fffb24d1d3a70 100644
--- a/torch/ao/quantization/fx/convert.py
+++ b/torch/ao/quantization/fx/convert.py
@@ -26,7 +26,7 @@
 from ..qconfig_mapping_utils import (
     update_qconfig_for_qat,
 )
-from .qconfig_utils import (
+from .qconfig_mapping_utils import (
     generate_qconfig_map,
     compare_prepare_convert_qconfig_mappings,
     update_qconfig_for_fusion,
@@ -38,7 +38,10 @@
     get_fused_module_classes,
     get_qat_module_classes,
 )
-from torch.ao.quantization.backend_config import get_native_backend_config_dict
+from torch.ao.quantization.backend_config import (
+    BackendConfig,
+    get_native_backend_config,
+)
 from .graph_module import (
     QuantizedGraphModule,
     is_observed_module,
@@ -304,7 +307,7 @@ def convert_standalone_module(
         modules: Dict[str, torch.nn.Module],
         model: torch.fx.GraphModule,
         is_reference: bool,
-        backend_config_dict: Optional[Dict[str, Any]]):
+        backend_config: Optional[BackendConfig]):
     """ Converts a observed standalone module to a quantized standalone module by calling
     the fx convert api, currently using the same `is_reference` flag as parent, but we may
     changing this behavior in the future (e.g. separating quantization and lowering for
@@ -316,7 +319,7 @@ def convert_standalone_module(
       - model: original model
       - is_reference: a flag from parent provided by user to decide if we want to
         produce a reference model or a fbgemm/qnnpack model
-      - backend_config_dict: backend configuration of the target backend of quantization
+      - backend_config: backend configuration of the target backend of quantization
     """
     # TODO: remove is_reference flag
     if is_reference:
@@ -353,11 +356,11 @@ def convert_standalone_module(
         # we'll just add a dequantize node after this node
         insert_dequantize_node(node, model.graph)
 
-    # TODO: allow convert_custom_config to override backend_config_dict
+    # TODO: allow convert_custom_config to override backend_config
     # for standalone module
     quantized_standalone_module = convert_fn(
         observed_standalone_module,
-        backend_config_dict=backend_config_dict)
+        backend_config=backend_config)
     parent_name, name = _parent_name(node.target)
     # update the modules dict
     setattr(modules[parent_name], name, quantized_standalone_module)
@@ -368,7 +371,7 @@ def convert_weighted_module(
         modules: Dict[str, torch.nn.Module],
         observed_node_names: Set[str],
         qconfig_map: Dict[str, QConfigAny],
-        backend_config_dict: Dict[str, Any]):
+        backend_config: BackendConfig):
     """ Convert a weighted module to reference quantized module in the model
     If the QConfig of a QAT module is not set, the module will still be converted to
     a float module.
@@ -382,7 +385,7 @@ def convert_weighted_module(
     original_module = modules[str(node.target)]
     qconfig: QConfigAny = original_module.qconfig  # type: ignore[assignment]
     weight_post_process = None
-    qat_module_classes = get_qat_module_classes(backend_config_dict)
+    qat_module_classes = get_qat_module_classes(backend_config)
 
     if isinstance(
             original_module,
@@ -402,7 +405,7 @@ def convert_weighted_module(
         return
 
     # skip converting to reference quantized module if the qconfig is not supported
-    pattern_to_dtype_configs = get_pattern_to_dtype_configs(backend_config_dict)
+    pattern_to_dtype_configs = get_pattern_to_dtype_configs(backend_config)
     dtype_configs = pattern_to_dtype_configs.get(type(original_module), [])
     if not is_qconfig_supported_by_dtype_configs(qconfig, dtype_configs):
         return
@@ -461,7 +464,7 @@ def convert_weighted_module(
     # We use the same reference module for all modes of quantization: static, dynamic, weight_only
     # root_module_to_quantized_reference_module: module mapping from root (floating point) module class
     # to quantized reference module class, e.g. nn.Conv2d to nn.quantized._reference.Conv2d
-    root_module_to_quantized_reference_module = get_root_module_to_quantized_reference_module(backend_config_dict)
+    root_module_to_quantized_reference_module = get_root_module_to_quantized_reference_module(backend_config)
     ref_qmodule_cls = root_module_to_quantized_reference_module.get(type_before_parametrizations(float_module), None)
     assert (
         ref_qmodule_cls is not None
@@ -544,7 +547,7 @@ def convert(
         is_standalone_module: bool = False,
         _remove_qconfig_flag: bool = True,
         qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
-        backend_config_dict: Optional[Dict[str, Any]] = None) -> torch.nn.Module:
+        backend_config: Union[BackendConfig, Dict[str, Any], None] = None) -> torch.nn.Module:
     """
     We will convert an observed model (a module with observer calls) to a reference
     quantized model, the rule is simple:
@@ -581,6 +584,12 @@ def convert(
     qconfig_mapping = copy.deepcopy(qconfig_mapping)
     assert(qconfig_mapping is None or isinstance(qconfig_mapping, QConfigMapping))
 
+    if isinstance(backend_config, Dict):
+        warnings.warn(
+            "Passing a backend_config_dict to prepare is deprecated and will not be supported "
+            "in a future version. Please pass in a BackendConfig instead.")
+        backend_config = BackendConfig.from_dict(backend_config)
+
     node_name_to_scope, prepare_custom_config, observed_node_names = restore_state(model)
     qconfig_map: Dict[str, QConfigAny] = model._qconfig_map  # type: ignore[assignment]
 
@@ -708,13 +717,13 @@ def replace_observer_with_dequantize_node(node: Node, graph: Graph):
     input_quantized_idxs: List[int] = prepare_custom_config.input_quantized_indexes
     output_quantized_idxs: List[int] = prepare_custom_config.output_quantized_indexes
 
-    if backend_config_dict is None:
-        backend_config_dict = get_native_backend_config_dict()
-    root_module_to_quantized_reference_module = get_root_module_to_quantized_reference_module(backend_config_dict)
+    if backend_config is None:
+        backend_config = get_native_backend_config()
+    root_module_to_quantized_reference_module = get_root_module_to_quantized_reference_module(backend_config)
     # convert tuples so that it can work with isinstance(module, tuple_of_classes)
     root_module_classes = tuple(root_module_to_quantized_reference_module.keys())
-    qat_module_classes = get_qat_module_classes(backend_config_dict)
-    fused_module_classes = get_fused_module_classes(backend_config_dict)
+    qat_module_classes = get_qat_module_classes(backend_config)
+    fused_module_classes = get_fused_module_classes(backend_config)
     statically_quantized_custom_module_nodes: Set[Node] = set()
 
     for node in list(model.graph.nodes):
@@ -759,7 +768,7 @@ def replace_observer_with_dequantize_node(node: Node, graph: Graph):
                         qconfig_map)
             elif is_observed_standalone_module(modules[node.target]):
                 convert_standalone_module(
-                    node, modules, model, is_reference, backend_config_dict)
+                    node, modules, model, is_reference, backend_config)
             # below this point `type_before_parametrizations` is used
             # instead of `type` to handle situations with fx quant + sparsity
             elif type_before_parametrizations(modules[node.target]) in set(
@@ -770,7 +779,7 @@ def replace_observer_with_dequantize_node(node: Node, graph: Graph):
                    type_before_parametrizations(modules[node.target][0]) not in root_module_classes:
                     continue
                 convert_weighted_module(
-                    node, modules, observed_node_names, qconfig_map, backend_config_dict)
+                    node, modules, observed_node_names, qconfig_map, backend_config)
             elif type_before_parametrizations(modules[node.target]) in custom_module_classes:
                 convert_custom_module(
                     node, model.graph, modules, custom_module_class_mapping,
@@ -785,11 +794,8 @@ def replace_observer_with_dequantize_node(node: Node, graph: Graph):
 
     # TODO: maybe move this to quantize_fx.py
     if not is_reference:
-        model = duplicate_dequantize_node(model)
-        model = duplicate_quantize_dynamic_node(model)
         model = lower_to_fbgemm(model, qconfig_map, node_name_to_scope)
         model = remove_quant_dequant_pairs(model)
-        model = remove_extra_dequantize(model)
     # TODO: this looks hacky, we want to check why we need this and see if we can
     # remove this
     # removes qconfig and activation_post_process modules
diff --git a/torch/ao/quantization/fx/custom_config.py b/torch/ao/quantization/fx/custom_config.py
index 4395bb09ece2d..1fdde5e51a337 100644
--- a/torch/ao/quantization/fx/custom_config.py
+++ b/torch/ao/quantization/fx/custom_config.py
@@ -3,6 +3,7 @@
 from typing import Any, Dict, List, Optional, Tuple, Type
 
 from torch.ao.quantization import QConfigMapping
+from torch.ao.quantization.backend_config import BackendConfig
 from torch.ao.quantization.quant_type import QuantType, _quant_type_from_str, quant_type_to_str
 
 
@@ -33,8 +34,7 @@ class StandaloneModuleConfigEntry:
     qconfig_mapping: Optional[QConfigMapping]
     example_inputs: Tuple[Any, ...]
     prepare_custom_config: Optional[PrepareCustomConfig]
-    # TODO: replace this with BackendConfig
-    backend_config_dict: Optional[Dict[str, Any]]
+    backend_config: Optional[BackendConfig]
 
 
 class PrepareCustomConfig:
@@ -58,9 +58,9 @@ class PrepareCustomConfig:
 
         prepare_custom_config = PrepareCustomConfig() \
             .set_standalone_module_name("module1", qconfig_mapping, example_inputs, \
-                child_prepare_custom_config, backend_config_dict) \
+                child_prepare_custom_config, backend_config) \
             .set_standalone_module_class(MyStandaloneModule, qconfig_mapping, example_inputs, \
-                child_prepare_custom_config, backend_config_dict) \
+                child_prepare_custom_config, backend_config) \
             .set_float_to_observed_mapping(FloatCustomModule, ObservedCustomModule) \
             .set_non_traceable_module_names(["module2", "module3"]) \
             .set_non_traceable_module_classes([NonTraceableModule1, NonTraceableModule2]) \
@@ -84,16 +84,16 @@ def set_standalone_module_name(
             qconfig_mapping: Optional[QConfigMapping],
             example_inputs: Tuple[Any, ...],
             prepare_custom_config: Optional[PrepareCustomConfig],
-            backend_config_dict: Optional[Dict[str, Any]]) -> PrepareCustomConfig:
+            backend_config: Optional[BackendConfig]) -> PrepareCustomConfig:
         """
         Set the configuration for running a standalone module identified by `module_name`.
 
         If `qconfig_mapping` is None, the parent `qconfig_mapping` will be used instead.
         If `prepare_custom_config` is None, an empty `PrepareCustomConfig` will be used.
-        If `backend_config_dict` is None, the parent `backend_config_dict` will be used instead.
+        If `backend_config` is None, the parent `backend_config` will be used instead.
         """
         self.standalone_module_names[module_name] = \
-            StandaloneModuleConfigEntry(qconfig_mapping, example_inputs, prepare_custom_config, backend_config_dict)
+            StandaloneModuleConfigEntry(qconfig_mapping, example_inputs, prepare_custom_config, backend_config)
         return self
 
     def set_standalone_module_class(
@@ -102,16 +102,16 @@ def set_standalone_module_class(
             qconfig_mapping: Optional[QConfigMapping],
             example_inputs: Tuple[Any, ...],
             prepare_custom_config: Optional[PrepareCustomConfig],
-            backend_config_dict: Optional[Dict[str, Any]]) -> PrepareCustomConfig:
+            backend_config: Optional[BackendConfig]) -> PrepareCustomConfig:
         """
         Set the configuration for running a standalone module identified by `module_class`.
 
         If `qconfig_mapping` is None, the parent `qconfig_mapping` will be used instead.
         If `prepare_custom_config` is None, an empty `PrepareCustomConfig` will be used.
-        If `backend_config_dict` is None, the parent `backend_config_dict` will be used instead.
+        If `backend_config` is None, the parent `backend_config` will be used instead.
         """
         self.standalone_module_classes[module_class] = \
-            StandaloneModuleConfigEntry(qconfig_mapping, example_inputs, prepare_custom_config, backend_config_dict)
+            StandaloneModuleConfigEntry(qconfig_mapping, example_inputs, prepare_custom_config, backend_config)
         return self
 
     def set_float_to_observed_mapping(
@@ -177,10 +177,10 @@ def from_dict(cls, prepare_custom_config_dict: Dict[str, Any]) -> PrepareCustomC
         Create a `PrepareCustomConfig` from a dictionary with the following items:
 
             "standalone_module_name": a list of (module_name, qconfig_mapping, example_inputs,
-                child_prepare_custom_config, backend_config_dict) tuples
+                child_prepare_custom_config, backend_config) tuples
 
             "standalone_module_class" a list of (module_class, qconfig_mapping, example_inputs,
-                child_prepare_custom_config, backend_config_dict) tuples
+                child_prepare_custom_config, backend_config) tuples
 
             "float_to_observed_custom_module_class": a nested dictionary mapping from quantization
                 mode to an inner mapping from float module classes to observed module classes, e.g.
@@ -216,19 +216,32 @@ def _get_prepare_custom_config(obj: Any, dict_key: str) -> Optional[PrepareCusto
             raise ValueError("Expected PrepareCustomConfig in prepare_custom_config_dict[\"%s\"], got '%s'" %
                              (dict_key, type(obj)))
 
+        def _get_backend_config(obj: Any, dict_key: str) -> Optional[BackendConfig]:
+            """
+            Convert the given object into a BackendConfig if possible, else throw an exception.
+            """
+            if isinstance(obj, BackendConfig) or obj is None:
+                return obj
+            if isinstance(obj, Dict):
+                return BackendConfig.from_dict(obj)
+            raise ValueError("Expected BackendConfig in prepare_custom_config_dict[\"%s\"], got '%s'" %
+                             (dict_key, type(obj)))
+
         conf = cls()
         for (module_name, qconfig_dict, example_inputs, _prepare_custom_config_dict, backend_config_dict) in\
                 prepare_custom_config_dict.get(STANDALONE_MODULE_NAME_DICT_KEY, []):
             qconfig_mapping = _get_qconfig_mapping(qconfig_dict, STANDALONE_MODULE_NAME_DICT_KEY)
             prepare_custom_config = _get_prepare_custom_config(_prepare_custom_config_dict, STANDALONE_MODULE_NAME_DICT_KEY)
+            backend_config = _get_backend_config(backend_config_dict, STANDALONE_MODULE_NAME_DICT_KEY)
             conf.set_standalone_module_name(
-                module_name, qconfig_mapping, example_inputs, prepare_custom_config, backend_config_dict)
+                module_name, qconfig_mapping, example_inputs, prepare_custom_config, backend_config)
         for (module_class, qconfig_dict, example_inputs, _prepare_custom_config_dict, backend_config_dict) in\
                 prepare_custom_config_dict.get(STANDALONE_MODULE_CLASS_DICT_KEY, []):
             qconfig_mapping = _get_qconfig_mapping(qconfig_dict, STANDALONE_MODULE_CLASS_DICT_KEY)
             prepare_custom_config = _get_prepare_custom_config(_prepare_custom_config_dict, STANDALONE_MODULE_CLASS_DICT_KEY)
+            backend_config = _get_backend_config(backend_config_dict, STANDALONE_MODULE_CLASS_DICT_KEY)
             conf.set_standalone_module_class(
-                module_class, qconfig_mapping, example_inputs, prepare_custom_config, backend_config_dict)
+                module_class, qconfig_mapping, example_inputs, prepare_custom_config, backend_config)
         for quant_type_name, custom_module_mapping in prepare_custom_config_dict.get(FLOAT_TO_OBSERVED_DICT_KEY, {}).items():
             quant_type = _quant_type_from_str(quant_type_name)
             for float_class, observed_class in custom_module_mapping.items():
@@ -248,7 +261,7 @@ def to_dict(self) -> Dict[str, Any]:
         def _make_tuple(key: Any, e: StandaloneModuleConfigEntry):
             qconfig_dict = e.qconfig_mapping.to_dict() if e.qconfig_mapping else None
             prepare_custom_config_dict = e.prepare_custom_config.to_dict() if e.prepare_custom_config else None
-            return (key, qconfig_dict, e.example_inputs, prepare_custom_config_dict, e.backend_config_dict)
+            return (key, qconfig_dict, e.example_inputs, prepare_custom_config_dict, e.backend_config)
 
         d: Dict[str, Any] = {}
         for module_name, sm_config_entry in self.standalone_module_names.items():
diff --git a/torch/ao/quantization/fx/fuse.py b/torch/ao/quantization/fx/fuse.py
index 42df185b1eb39..094639ee4f198 100644
--- a/torch/ao/quantization/fx/fuse.py
+++ b/torch/ao/quantization/fx/fuse.py
@@ -15,17 +15,22 @@
     sorted_patterns_dict,
 )
 
-from ..backend_config.utils import get_fuser_method_mapping
-from ..backend_config.utils import get_fusion_pattern_to_root_node_getter
-from ..backend_config.utils import get_fusion_pattern_to_extra_inputs_getter
-from ..backend_config import get_native_backend_config_dict
+from ..backend_config import (
+    BackendConfig,
+    get_native_backend_config,
+)
+from ..backend_config.utils import (
+    get_fuser_method_mapping,
+    get_fusion_pattern_to_root_node_getter,
+    get_fusion_pattern_to_extra_inputs_getter,
+)
 from .backend_config_utils import get_fusion_pattern_to_fuse_handler_cls
 
 from .custom_config import FuseCustomConfig
 
 from .fusion_patterns import *  # noqa: F401,F403
 
-from typing import Any, Callable, Dict, Optional, List, Tuple, Union
+from typing import Any, Callable, Dict, List, Tuple, Union
 import warnings
 
 from torch.ao.quantization.quantization_types import Pattern, NodePattern
@@ -40,7 +45,7 @@ def fuse(
     model: GraphModule,
     is_qat: bool,
     fuse_custom_config: Union[FuseCustomConfig, Dict[str, Any], None] = None,
-    backend_config_dict: Optional[Dict[str, Any]] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> GraphModule:
     if fuse_custom_config is None:
         fuse_custom_config = FuseCustomConfig()
@@ -51,17 +56,23 @@ def fuse(
             "in a future version. Please pass in a FuseCustomConfig instead.")
         fuse_custom_config = FuseCustomConfig.from_dict(fuse_custom_config)
 
+    if isinstance(backend_config, Dict):
+        warnings.warn(
+            "Passing a backend_config_dict to prepare is deprecated and will not be supported "
+            "in a future version. Please pass in a BackendConfig instead.")
+        backend_config = BackendConfig.from_dict(backend_config)
+
     input_root = model
     input_graph = model.graph
     named_modules = dict(input_root.named_modules())
 
-    if backend_config_dict is None:
-        backend_config_dict = get_native_backend_config_dict()
+    if backend_config is None:
+        backend_config = get_native_backend_config()
 
-    fusion_pattern_to_fuse_handler_cls = sorted_patterns_dict(get_fusion_pattern_to_fuse_handler_cls(backend_config_dict))
-    fuser_method_mapping = get_fuser_method_mapping(backend_config_dict)
-    fusion_pattern_to_root_node_getter = get_fusion_pattern_to_root_node_getter(backend_config_dict)
-    fusion_pattern_to_extra_inputs_getter = get_fusion_pattern_to_extra_inputs_getter(backend_config_dict)
+    fusion_pattern_to_fuse_handler_cls = sorted_patterns_dict(get_fusion_pattern_to_fuse_handler_cls(backend_config))
+    fuser_method_mapping = get_fuser_method_mapping(backend_config)
+    fusion_pattern_to_root_node_getter = get_fusion_pattern_to_root_node_getter(backend_config)
+    fusion_pattern_to_extra_inputs_getter = get_fusion_pattern_to_extra_inputs_getter(backend_config)
 
     # find fusion
     fusion_pairs = _find_matches(
diff --git a/torch/ao/quantization/fx/fusion_patterns.py b/torch/ao/quantization/fx/fusion_patterns.py
index 29d4217699b09..1371bd0e21e62 100644
--- a/torch/ao/quantization/fx/fusion_patterns.py
+++ b/torch/ao/quantization/fx/fusion_patterns.py
@@ -39,7 +39,7 @@ def fuse(self,
              is_qat: bool) -> Node:
         pass
 
-# TODO: move this to backend_config.fuse_handler
+# TODO: move this to backend_config_utils
 class DefaultFuseHandler(FuseHandler):
     def __init__(
             self,
diff --git a/torch/ao/quantization/fx/pattern_utils.py b/torch/ao/quantization/fx/pattern_utils.py
index b483ede08011e..20b4b5a5c0c9d 100644
--- a/torch/ao/quantization/fx/pattern_utils.py
+++ b/torch/ao/quantization/fx/pattern_utils.py
@@ -25,8 +25,8 @@ def get_default_fusion_patterns() -> Dict[Pattern, QuantizeHandler]:
 # Mapping from pattern to activation_post_process(observer/fake_quant) constructor for output activation
 # e.g. pattern: torch.sigmoid,
 #      output_activation_post_process: default_fixed_qparams_range_0to1_fake_quant
-DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP = dict()
-DEFAULT_OUTPUT_OBSERVER_MAP = dict()
+DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP = {}
+DEFAULT_OUTPUT_OBSERVER_MAP = {}
 
 # Register pattern for both static quantization and qat
 def register_quant_pattern(pattern, fixed_qparams_observer=None):
diff --git a/torch/ao/quantization/fx/prepare.py b/torch/ao/quantization/fx/prepare.py
index 274bfe95200da..c8cef0f4ac317 100644
--- a/torch/ao/quantization/fx/prepare.py
+++ b/torch/ao/quantization/fx/prepare.py
@@ -32,7 +32,7 @@
     get_flattened_qconfig_dict,
     update_qconfig_for_qat,
 )
-from .qconfig_utils import (
+from .qconfig_mapping_utils import (
     generate_qconfig_map,
     update_qconfig_for_fusion,
 )
@@ -75,8 +75,6 @@
     get_non_observable_arg_indexes_and_types,
     get_new_attr_name_with_prefix,
     NON_QUANTIZABLE_WEIGHT_OPS,
-    WEIGHT_INDEX_DICT,
-    BIAS_INDEX_DICT,
 )
 
 from torch.ao.quantization.quantize import (
@@ -93,12 +91,13 @@
 
 from ..backend_config.utils import (
     get_pattern_to_dtype_configs,
-    get_pattern_to_input_type_to_index,
     get_module_to_qat_module,
     get_fusion_pattern_to_root_node_getter,
 )
 from ..backend_config import (
-    get_native_backend_config_dict,
+    BackendConfig,
+    DTypeConfig,
+    get_native_backend_config,
 )
 from .backend_config_utils import (
     get_pattern_to_quantize_handlers,
@@ -109,7 +108,7 @@
     StandaloneModuleConfigEntry,
 )
 
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union, Set
+from typing import Any, Dict, List, Optional, Set, Tuple, Type, Union
 from collections import defaultdict
 
 
@@ -155,75 +154,69 @@ def is_activation_post_process_node(node: Node, modules: Dict[str, torch.nn.Modu
     return isinstance(node, torch.fx.Node) and node.op == "call_module" and \
         is_activation_post_process(modules[str(node.target)])
 
-def node_arg_is_weight(node: Node, arg: Any) -> bool:
-    if isinstance(node, Node) and node.op == 'call_function' and \
-            node.target in WEIGHT_INDEX_DICT:
-        for i, node_arg in enumerate(node.args):
-            if arg is node_arg and i in \
-                    WEIGHT_INDEX_DICT[node.target]:  # type: ignore[index]
-                return True
-        for kwarg_name, kwarg_value in node.kwargs.items():
-            if kwarg_name == 'weight' and arg is kwarg_value:
-                return True
+def node_arg_is_weight(node: Node, arg: Any, backend_config: BackendConfig) -> bool:
+    if isinstance(node, Node) and node.op == "call_function" and node.target in backend_config.configs:
+        weight_index = backend_config.configs[node.target]._input_type_to_index.get("weight")
+        if weight_index is not None and weight_index < len(node.args) and node.args[weight_index] is arg:
+            return True
+        return node.kwargs.get("weight") is arg
     return False
 
-def node_arg_is_bias(node: Node, arg: Any) -> bool:
-    if not isinstance(node, Node) or node.op != 'call_function' or \
-       node.target not in BIAS_INDEX_DICT:
-        return False
-
-    for i, node_arg in enumerate(node.args):
-        if arg is node_arg and i in \
-           BIAS_INDEX_DICT[node.target]:  # type: ignore[index]
+def node_arg_is_bias(node: Node, arg: Any, backend_config: BackendConfig) -> bool:
+    if isinstance(node, Node) and node.op == "call_function" and node.target in backend_config.configs:
+        bias_index = backend_config.configs[node.target]._input_type_to_index.get("bias")
+        if bias_index is not None and bias_index < len(node.args) and node.args[bias_index] is arg:
             return True
-
-    return node.kwargs.get('bias', None) is arg
+        return node.kwargs.get("bias") is arg
+    return False
 
 def is_input_arg_dtype_supported_by_backend(
     arg: Argument,
     node: Node,
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
-    dtype_config: Dict[str, torch.dtype],
+    dtype_config: DTypeConfig,
+    backend_config: BackendConfig,
 ) -> bool:
     """ Check if the configured qconfig for the argument
     is supported by the backend or not
     """
     if isinstance(arg, (list, tuple)):
-        return all(map(lambda a: is_input_arg_dtype_supported_by_backend(a, node, node_name_to_target_dtype, dtype_config), arg))
+        return all(is_input_arg_dtype_supported_by_backend(a, node, node_name_to_target_dtype,
+                                                           dtype_config, backend_config) for a in arg)
     if not isinstance(arg, Node):
         return True
     # TODO: support check for standalone module
-    is_weight = node_arg_is_weight(node, arg)
-    is_bias = node_arg_is_bias(node, arg)
+    is_weight = node_arg_is_weight(node, arg, backend_config)
+    is_bias = node_arg_is_bias(node, arg, backend_config)
     is_activation = not is_weight and not is_bias
     if is_activation:
-        is_dynamic = dtype_config.get("is_dynamic", False)
+        is_dynamic = dtype_config.is_dynamic
         if is_dynamic:
-            input_activation_dtype = dtype_config.get("input_dtype", None)
+            input_activation_dtype = dtype_config.input_dtype
             # TODO: change this after the is_dynamic refactor is landed
             compute_dtype = node_name_to_target_dtype[node.name].get("input_activation_compute_dtype", None)
             return input_activation_dtype is None or \
                 compute_dtype == input_activation_dtype
         else:
-            input_activation_dtype = dtype_config.get("input_dtype", None)
+            input_activation_dtype = dtype_config.input_dtype
             return input_activation_dtype is None or \
                 node_name_to_target_dtype[node.name]["input_activation_dtype"] == input_activation_dtype
     elif is_weight:
-        weight_dtype = dtype_config.get("weight_dtype", None)
+        weight_dtype = dtype_config.weight_dtype
         return weight_dtype is None or node_name_to_target_dtype[node.name]["weight_dtype"] == weight_dtype
     else:  # bias
-        bias_dtype = dtype_config.get("bias_dtype", None)
+        bias_dtype = dtype_config.bias_dtype
         return bias_dtype is None or node_name_to_target_dtype[node.name]["bias_dtype"] == bias_dtype
 
 def is_output_dtype_supported_by_backend(
     node: Node,
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
-    dtype_config: Dict[str, torch.dtype],
+    dtype_config: DTypeConfig,
 ) -> bool:
     """ Check if the configured qconfig for the output
     is supported by the backend or not
     """
-    output_dtype = dtype_config.get("output_dtype", None)
+    output_dtype = dtype_config.output_dtype
     return output_dtype is None or \
         output_dtype == node_name_to_target_dtype[node.name]["output_activation_dtype"]
 
@@ -243,16 +236,16 @@ def is_pattern_dtype_config_supported_by_backend(
     pattern: Optional[Pattern],
     matched_node_pattern: Optional[NodePattern],
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
-    backend_config_dict: Optional[Dict[str, Any]]
+    backend_config: BackendConfig,
 ) -> bool:
     """ Check is the dtype configuration of a pattern is supported by
     the backend or not
     """
-    if backend_config_dict is None or pattern is None:
+    if backend_config is None or pattern is None:
         return True
     assert matched_node_pattern is not None and len(matched_node_pattern) >= 1
-    pattern_to_dtype_configs = get_pattern_to_dtype_configs(backend_config_dict)
-    dtype_configs: List[Dict[str, Any]] = pattern_to_dtype_configs.get(pattern, [])
+    pattern_to_dtype_configs = get_pattern_to_dtype_configs(backend_config)
+    dtype_configs: List[DTypeConfig] = pattern_to_dtype_configs.get(pattern, [])
 
     # TODO: this only works for one input and one output patterns, need to generalize to multiple
     # inputs/output
@@ -265,11 +258,11 @@ def is_pattern_dtype_config_supported_by_backend(
         for arg in input_node.args:
             supported = supported and \
                 is_input_arg_dtype_supported_by_backend(
-                    arg, input_node, node_name_to_target_dtype, dtype_config)
+                    arg, input_node, node_name_to_target_dtype, dtype_config, backend_config)
         for k, arg in input_node.kwargs.items():
             supported = supported and \
                 is_input_arg_dtype_supported_by_backend(
-                    arg, input_node, node_name_to_target_dtype, dtype_config)
+                    arg, input_node, node_name_to_target_dtype, dtype_config, backend_config)
         # check if output dtype is supported
         supported = supported and is_output_dtype_supported_by_backend(
             output_node, node_name_to_target_dtype, dtype_config)
@@ -282,8 +275,8 @@ def get_standalone_module_configs(
     modules: Dict[str, torch.nn.Module],
     prepare_custom_config: PrepareCustomConfig,
     parent_qconfig: QConfigAny,
-    parent_backend_config_dict: Optional[Dict[str, Any]],
-) -> Tuple[QConfigMapping, Tuple[Any, ...], PrepareCustomConfig, Optional[Dict[str, Any]]]:
+    parent_backend_config: Optional[BackendConfig],
+) -> Tuple[QConfigMapping, Tuple[Any, ...], PrepareCustomConfig, Optional[BackendConfig]]:
     """
     Returns the standalone module QConfigMapping and PrepareCustomConfig
     for `node`, assuming that the module pointed to by `node` is
@@ -299,12 +292,12 @@ def get_standalone_module_configs(
     qconfig_mapping = config_entry.qconfig_mapping or QConfigMapping().set_global(parent_qconfig)
     example_inputs = config_entry.example_inputs
     prepare_custom_config = config_entry.prepare_custom_config or PrepareCustomConfig()
-    backend_config_dict = config_entry.backend_config_dict or parent_backend_config_dict
-    return (qconfig_mapping, example_inputs, prepare_custom_config, backend_config_dict)
+    backend_config = config_entry.backend_config or parent_backend_config
+    return (qconfig_mapping, example_inputs, prepare_custom_config, backend_config)
 
 def qat_swap_modules(
         root: torch.nn.Module,
-        module_to_qat_module: Dict[Callable, Callable]) -> None:
+        module_to_qat_module: Dict[Pattern, Type[torch.nn.Module]]) -> None:
     convert(root, mapping=module_to_qat_module, inplace=True, remove_qconfig=False)
 
 def add_matched_node_name_to_set(matched_node_pattern: NodePattern, s: Set[str]):
@@ -468,13 +461,14 @@ def get_arg_target_dtype_as_input_to_node(
     node: Node,
     modules: Dict[str, torch.nn.Module],
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    backend_config: BackendConfig,
 ) -> Optional[Union[torch.dtype, type]]:
     """ Get the target argument dtype for the argument `arg`, as input
     to node `node`
     """
     assert isinstance(arg, Node)
-    is_weight = node_arg_is_weight(node, arg)
-    is_bias = node_arg_is_bias(node, arg)
+    is_weight = node_arg_is_weight(node, arg, backend_config)
+    is_bias = node_arg_is_bias(node, arg, backend_config)
     is_activation = not is_weight and not is_bias
     if is_activation:
         return node_name_to_target_dtype[node.name]["input_activation_dtype"]
@@ -491,13 +485,14 @@ def get_arg_target_compute_dtype_as_input_to_node(
     node: Node,
     modules: Dict[str, torch.nn.Module],
     node_name_to_target_dtype: Dict[str, Dict[str, Union[torch.dtype, type, None]]],
+    backend_config: BackendConfig,
 ) -> Union[torch.dtype, type, None]:
     """ Get the target argument dtype for the argument `arg`, as input
     to node `node`
     """
     assert isinstance(arg, Node)
-    is_weight = node_arg_is_weight(node, arg)
-    is_bias = node_arg_is_bias(node, arg)
+    is_weight = node_arg_is_weight(node, arg, backend_config)
+    is_bias = node_arg_is_bias(node, arg, backend_config)
     is_activation = not is_weight and not is_bias
     if is_activation and \
        "input_activation_compute_dtype" in node_name_to_target_dtype[node.name]:
@@ -515,7 +510,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     qhandler: Optional[QuantizeHandler],
     prepare_custom_config: PrepareCustomConfig,
-    backend_config_dict: Optional[Dict[str, Any]],
+    backend_config: BackendConfig,
 ) -> Argument:
     """
     Given a `node` and an `arg`, inserts an input observer between
@@ -531,7 +526,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
                 graph, node_name_to_target_dtype,
                 qhandler,
                 prepare_custom_config,
-                backend_config_dict)
+                backend_config)
             new_arg_to_return.append(new_inner_arg)
         return type(arg)(new_arg_to_return)
 
@@ -545,7 +540,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
     assert qconfig is not None
     if not is_standalone_module:
         # regular flow for most nodes, except standalone modules
-        is_weight = node_arg_is_weight(node, arg)
+        is_weight = node_arg_is_weight(node, arg, backend_config)
 
         is_reuse_input_qconfig_ = is_reuse_input_qconfig(qconfig)
 
@@ -553,10 +548,14 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
             qconfig.activation
 
         arg_as_output_target_dtype = get_arg_target_dtype_as_output(arg, modules, node_name_to_target_dtype)
-        arg_as_input_target_dtype = get_arg_target_dtype_as_input_to_node(arg, node, modules, node_name_to_target_dtype)
+        arg_as_input_target_dtype = get_arg_target_dtype_as_input_to_node(arg,
+                                                                          node,
+                                                                          modules,
+                                                                          node_name_to_target_dtype,
+                                                                          backend_config)
         arg_as_input_target_compute_dtype = \
             get_arg_target_compute_dtype_as_input_to_node(
-                arg, node, modules, node_name_to_target_dtype)
+                arg, node, modules, node_name_to_target_dtype, backend_config)
         needs_obs = (
             # if the dtypes are different, we need an observer
             (arg_as_output_target_dtype != arg_as_input_target_dtype) and
@@ -571,7 +570,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
             not is_reuse_input_qconfig_ or
             # need to add input observer for dynamic quantization
             # only add observer for first input for now, we may need to extend
-            # qconfig_dict and backend_config_dict to support more general configurations
+            # qconfig_dict and backend_config to support more general configurations
             # of dynamic quantization, e.g. dynamically quantizing second input, third
             # input etc.
             (arg_as_input_target_compute_dtype in [torch.quint8, torch.int8, torch.float16]) and arg is node.args[0]
@@ -581,7 +580,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
         # custom flow for standalone modules
         _, _, sm_prepare_custom_config, _ = \
             get_standalone_module_configs(
-                node, modules, prepare_custom_config, qconfig, backend_config_dict)
+                node, modules, prepare_custom_config, qconfig, backend_config)
         sm_input_quantized_idxs = sm_prepare_custom_config.input_quantized_indexes
 
         # for args, this is set to the index of the current arg
@@ -648,7 +647,7 @@ def maybe_insert_input_observers_for_node(
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     qhandler: Optional[QuantizeHandler],
     prepare_custom_config: PrepareCustomConfig,
-    backend_config_dict: Optional[Dict[str, Any]],
+    backend_config: BackendConfig,
 ) -> None:
     """
     If needed, inserts observers to the input args and kwargs of `node`.
@@ -677,7 +676,7 @@ def maybe_insert_input_observers_for_node(
             node_name_to_target_dtype,
             qhandler,
             prepare_custom_config,
-            backend_config_dict)
+            backend_config)
         new_args.append(new_arg)
 
     new_kwargs = {}
@@ -687,7 +686,7 @@ def maybe_insert_input_observers_for_node(
             node_name_to_target_dtype,
             qhandler,
             prepare_custom_config,
-            backend_config_dict)
+            backend_config)
         new_kwargs[k] = new_kwarg
 
     # assign the new args and kwargs to the node, inplace
@@ -702,6 +701,7 @@ def maybe_insert_input_equalization_observers_for_node(
     graph: Graph,
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
     is_branch: bool,
+    backend_config: BackendConfig,
 ) -> None:
     """
     If `node` needs to be equalized, find the input/weight observers it needs in
@@ -720,11 +720,11 @@ def maybe_insert_input_equalization_observers_for_node(
 
     new_args = []
     for arg in node.args:
-        if not isinstance(arg, Node) or node_arg_is_bias(node, arg):
+        if not isinstance(arg, Node) or node_arg_is_bias(node, arg, backend_config):
             new_args.append(arg)
             continue
 
-        is_weight = node_arg_is_weight(node, arg)
+        is_weight = node_arg_is_weight(node, arg, backend_config)
 
         act_eq_process_ctr = equalization_qconfig.weight if is_weight else \
             equalization_qconfig.input_activation
@@ -1080,7 +1080,7 @@ def insert_observers_for_model(
     equalization_config_map: Dict[str, Any],
     input_quantized_idxs: List[int],
     output_quantized_idxs: List[int],
-    backend_config_dict: Optional[Dict[str, Any]],
+    backend_config: BackendConfig,
     observed_node_names: Set[str],
     is_qat: bool,
 ) -> Optional[Node]:
@@ -1137,7 +1137,7 @@ def insert_observers_for_model(
     #
     # TODO: rename this to node_name_to_target_dtype_info
     node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]] = defaultdict(dict)
-    cache_for_no_tensor_check: Dict[Node, bool] = dict()
+    cache_for_no_tensor_check: Dict[Node, bool] = {}
 
     inputs_seen_counter = 0
     outputs_seen_counter = 0
@@ -1208,7 +1208,7 @@ def insert_observers_for_model(
             )
 
             is_supported_by_backend = is_pattern_dtype_config_supported_by_backend(
-                pattern, matched_node_pattern, node_name_to_target_dtype, backend_config_dict)
+                pattern, matched_node_pattern, node_name_to_target_dtype, backend_config)
 
             if not skip_inserting_observers and is_supported_by_backend:
                 modules = dict(model.named_modules(remove_duplicate=False))
@@ -1254,12 +1254,12 @@ def insert_observers_for_model(
                             node_name_to_target_dtype,
                             qhandler,
                             prepare_custom_config,
-                            backend_config_dict)
+                            backend_config)
 
                         # Insert equalization input observers if needed
                         maybe_insert_input_equalization_observers_for_node(
                             node, equalization_qconfig, model, modules, graph,
-                            node_name_to_target_dtype, is_quantized_branch)
+                            node_name_to_target_dtype, is_quantized_branch, backend_config)
 
                     is_last_node_of_pattern = node is last_node
                     is_general_tensor_value_op = \
@@ -1369,7 +1369,7 @@ def run_prepare_fx_on_standalone_modules(
     modules: Dict[str, torch.nn.Module],
     matches: Any,
     prepare_custom_config: PrepareCustomConfig,
-    backend_config_dict: Optional[Dict[str, Any]],
+    backend_config: BackendConfig,
 ) -> None:
     """
     Runs prepare_fx on each standalone module. Note: this does
@@ -1386,8 +1386,8 @@ def run_prepare_fx_on_standalone_modules(
             continue
 
         sm_qconfig_mapping, sm_example_inputs, sm_prepare_custom_config, \
-            sm_backend_config_dict = get_standalone_module_configs(
-                root_node, modules, prepare_custom_config, qconfig, backend_config_dict)
+            sm_backend_config = get_standalone_module_configs(
+                root_node, modules, prepare_custom_config, qconfig, backend_config)
 
         standalone_module = modules[root_node.target]
         prepare = \
@@ -1399,7 +1399,7 @@ def run_prepare_fx_on_standalone_modules(
                 is_qat,
                 example_inputs=sm_example_inputs,
                 prepare_custom_config=sm_prepare_custom_config,
-                backend_config_dict=sm_backend_config_dict)
+                backend_config=sm_backend_config)
         preserved_attributes = set(sm_prepare_custom_config.preserved_attributes)
         observed_standalone_module = ObservedStandaloneGraphModule(
             observed_standalone_module, observed_standalone_module.graph,
@@ -1435,7 +1435,7 @@ def prepare(
         example_inputs: Tuple[Any, ...],
         prepare_custom_config: Union[PrepareCustomConfig, Dict[str, Any], None] = None,
         _equalization_config: Union[QConfigMapping, Dict[str, Any], None] = None,
-        backend_config_dict: Optional[Dict[str, Any]] = None,
+        backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
         is_standalone_module: bool = False) -> ObservedGraphModule:
     """ standalone_module means it a submodule that is not inlined in
     parent module, and will be quantized separately as one unit.
@@ -1480,6 +1480,12 @@ def prepare(
             "in a future version. Please pass in a PrepareCustomConfig instead.")
         prepare_custom_config = PrepareCustomConfig.from_dict(prepare_custom_config)
 
+    if isinstance(backend_config, Dict):
+        warnings.warn(
+            "Passing a backend_config_dict to prepare is deprecated and will not be supported "
+            "in a future version. Please pass in a BackendConfig instead.")
+        backend_config = BackendConfig.from_dict(backend_config)
+
     assert(isinstance(qconfig_mapping, QConfigMapping))
     assert(isinstance(_equalization_config, QConfigMapping))
     qconfig_mapping = copy.deepcopy(qconfig_mapping)
@@ -1495,33 +1501,15 @@ def prepare(
     #   ((<function relu at 0x7f766a7360d0>, <built-in function add>):
     #     <class 'torch.ao.quantization.fx.quantize.Add'>),
     # }
-    # TODO: rename to pattern_to_quantize_handler
-    patterns: Dict[Pattern, QuantizeHandler] = {}
-    if backend_config_dict is None:
-        backend_config_dict = get_native_backend_config_dict()
-    patterns = get_pattern_to_quantize_handlers(backend_config_dict)
-    patterns = sorted_patterns_dict(patterns)
-
-    # TODO: make WEIGHT_INDEX_DICT and BIAS_INDEX_DICT an argument to the functions that needs them
-    # TODO: refactor this part to return WEIGHT_INDEX_DICT and BIAS_INDEX_DICT
-    pattern_to_input_type_to_index = get_pattern_to_input_type_to_index(backend_config_dict)
-    for pattern, input_type_to_index in pattern_to_input_type_to_index.items():
-        for input_type, index in input_type_to_index.items():
-            index_dicts = {
-                "weight": WEIGHT_INDEX_DICT,
-                "bias": BIAS_INDEX_DICT,
-                "input": {}  # not used right now
-            }
-            assert input_type in index_dicts.keys(), \
-                f"input type must be one of {index_dicts.keys()} but got: {input_type}"
-            index_dict = index_dicts[input_type]
-            if pattern in index_dict:  # type: ignore[operator]
-                index_dict[pattern].append(index)  # type: ignore[index]
-            else:
-                index_dict[pattern] = [index]  # type: ignore[index]
+
+    pattern_to_quantize_handler: Dict[Pattern, QuantizeHandler] = {}
+    if backend_config is None:
+        backend_config = get_native_backend_config()
+    pattern_to_quantize_handler = get_pattern_to_quantize_handlers(backend_config)
+    pattern_to_quantize_handler = sorted_patterns_dict(pattern_to_quantize_handler)
 
     root_node_getter_mapping = \
-        get_fusion_pattern_to_root_node_getter(backend_config_dict)
+        get_fusion_pattern_to_root_node_getter(backend_config)
 
     update_qconfig_for_fusion(model, qconfig_mapping)
     update_qconfig_for_fusion(model, _equalization_config)
@@ -1530,7 +1518,7 @@ def prepare(
     propagate_qconfig_(model, flattened_qconfig_dict, prepare_custom_config.to_dict())
 
     if is_qat:
-        module_to_qat_module = get_module_to_qat_module(backend_config_dict)
+        module_to_qat_module = get_module_to_qat_module(backend_config)
         qat_swap_modules(model, module_to_qat_module)
         update_qconfig_for_qat(qconfig_mapping, {})
 
@@ -1555,7 +1543,7 @@ def prepare(
 
     custom_module_classes = get_custom_module_class_keys(prepare_custom_config.float_to_observed_mapping)
     matches_without_qconfig = find_matches(
-        model.graph, modules, patterns, root_node_getter_mapping,
+        model.graph, modules, pattern_to_quantize_handler, root_node_getter_mapping,
         standalone_module_names, standalone_module_classes, custom_module_classes)
 
     # map qconfig instances to matches
@@ -1568,7 +1556,7 @@ def prepare(
     output_quantized_idxs: List[int] = prepare_custom_config.output_quantized_indexes
 
     run_prepare_fx_on_standalone_modules(
-        model, is_qat, modules, matches, prepare_custom_config, backend_config_dict)
+        model, is_qat, modules, matches, prepare_custom_config, backend_config)
 
     # record names for the set of observed node, so that in convert step
     # we know whether we need to convert a floating point module to reference
@@ -1581,7 +1569,7 @@ def prepare(
         equalization_qconfig_map,
         input_quantized_idxs,
         output_quantized_idxs,
-        backend_config_dict,
+        backend_config,
         observed_node_names,
         is_qat)
 
diff --git a/torch/ao/quantization/fx/qconfig_utils.py b/torch/ao/quantization/fx/qconfig_mapping_utils.py
similarity index 95%
rename from torch/ao/quantization/fx/qconfig_utils.py
rename to torch/ao/quantization/fx/qconfig_mapping_utils.py
index 1f3f9bf603cd9..087c8983330c2 100644
--- a/torch/ao/quantization/fx/qconfig_utils.py
+++ b/torch/ao/quantization/fx/qconfig_mapping_utils.py
@@ -6,6 +6,10 @@
 from torch.ao.quantization.quantize import (
     is_activation_post_process,
 )
+from torch.ao.quantization.backend_config import (
+    DTypeConfig,
+)
+
 from torch.fx import (
     GraphModule,
 )
@@ -103,7 +107,7 @@ def generate_qconfig_map(
         qconfig_mapping: QConfigMapping,
         node_name_to_scope: Dict[str, Tuple[str, type]]) -> Dict[str, QConfigAny]:
     global_qconfig = qconfig_mapping.global_qconfig
-    qconfig_map = dict()
+    qconfig_map = {}
 
     # example:
     #
@@ -229,13 +233,15 @@ def compare_prepare_convert_qconfig_mappings(
                 "Expected convert QConfigMapping to have the same qconfig as prepare for key {} {}; \
                 prepare: {}; convert: {}".format(dict_names[i], name, prepare_dicts[i][name], convert_dicts[i][name])
 
-def is_qconfig_supported_by_dtype_configs(qconfig: QConfig, dtype_configs: List[Dict[str, Any]]):
+def is_qconfig_supported_by_dtype_configs(qconfig: QConfig, dtype_configs: List[DTypeConfig]):
     for dtype_config in dtype_configs:
-        is_dynamic = dtype_config.get("is_dynamic", False)
-        input_dtype = dtype_config.get("input_dtype", torch.float)
-        weight_dtype = dtype_config.get("weight_dtype", torch.float)
-        bias_dtype = dtype_config.get("bias_dtype", torch.float)
-        output_dtype = dtype_config.get("output_dtype", torch.float)
+        is_dynamic = dtype_config.is_dynamic
+        if is_dynamic is None:
+            is_dynamic = False
+        input_dtype = dtype_config.input_dtype or torch.float
+        weight_dtype = dtype_config.weight_dtype or torch.float
+        bias_dtype = dtype_config.bias_dtype or torch.float
+        output_dtype = dtype_config.output_dtype or torch.float
         qconfig_activation_dtype, qconfig_weight_dtype, qconfig_compute_dtype = \
             get_qconfig_dtypes(qconfig)
         qconfig_bias_dtype = torch.float16 \
diff --git a/torch/ao/quantization/fx/quantization_patterns.py b/torch/ao/quantization/fx/quantization_patterns.py
index bacec65d03374..17b156baf0747 100644
--- a/torch/ao/quantization/fx/quantization_patterns.py
+++ b/torch/ao/quantization/fx/quantization_patterns.py
@@ -21,12 +21,7 @@ def _default_root_node_getter(node_pattern):
         node_pattern = node_pattern[-1]
     return node_pattern
 
-# -------------------------
-# Pattern Registrations
-# -------------------------
-
-# 1. Post Training Static Quantization and Quantization Aware Training Patterns
-
+# TODO: move to backend_config_utils.py
 # Base Pattern Handler
 class QuantizeHandler(ABC):
     """ Base handler class for the quantizer patterns
@@ -52,7 +47,7 @@ def __init__(
         # determine how many of the first two args are Tensors (versus scalars)
         # this distinguishes things like "x + y" from "x + 2" or "2 + x"
         if isinstance(self.root_node, Node):
-            cache_for_no_tensor_check: Dict[Node, bool] = dict()
+            cache_for_no_tensor_check: Dict[Node, bool] = {}
             for arg_idx in range(len(self.root_node.args)):
                 arg = self.root_node.args[arg_idx]
                 if isinstance(arg, Node) and (
diff --git a/torch/ao/quantization/fx/tracer.py b/torch/ao/quantization/fx/tracer.py
index 732c8b9575550..3a959447cfd6b 100644
--- a/torch/ao/quantization/fx/tracer.py
+++ b/torch/ao/quantization/fx/tracer.py
@@ -81,7 +81,7 @@ def __init__(
     def is_leaf_module(self, m: torch.nn.Module, module_qualified_name: str) -> bool:
         return (
             (
-                m.__module__.startswith("torch.nn")
+                (m.__module__.startswith("torch.nn") or m.__module__.startswith("torch.ao.nn"))
                 and not isinstance(m, torch.nn.Sequential)
             )
             or module_qualified_name in self.skipped_module_names
diff --git a/torch/ao/quantization/observer.py b/torch/ao/quantization/observer.py
index f407a505c429a..818a8a42162fa 100644
--- a/torch/ao/quantization/observer.py
+++ b/torch/ao/quantization/observer.py
@@ -12,8 +12,8 @@
 
 import torch
 import torch.nn as nn
-from torch.ao.quantization.utils import check_min_max_valid, calculate_qmin_qmax
-
+from torch.ao.quantization.utils import (
+    check_min_max_valid, calculate_qmin_qmax, is_per_tensor, is_per_channel)
 
 __all__ = [
     "default_affine_fixed_qparams_observer",
@@ -83,6 +83,7 @@ def _with_args(cls_or_self, **kwargs):
 
     Example::
 
+        >>> # xdoctest: +SKIP("Undefined vars")
         >>> Foo.with_args = classmethod(_with_args)
         >>> foo_builder = Foo.with_args(a=3, b=4).with_args(answer=42)
         >>> foo_instance1 = foo_builder()
@@ -103,11 +104,12 @@ def _with_callable_args(cls_or_self, **kwargs):
 
     Example::
 
+        >>> # xdoctest: +SKIP("Undefined vars")
         >>> Foo.with_callable_args = classmethod(_with_callable_args)
         >>> Foo.with_args = classmethod(_with_args)
         >>> foo_builder = Foo.with_callable_args(cur_time=get_time_func).with_args(name="dan")
         >>> foo_instance1 = foo_builder()
-        >>> wait 50
+        >>> # wait 50
         >>> foo_instance2 = foo_builder()
         >>> id(foo_instance1.creation_time) == id(foo_instance2.creation_time)
         False
@@ -446,7 +448,11 @@ def __init__(
         factory_kwargs=None,
         eps=torch.finfo(torch.float32).eps,
     ) -> None:
-
+        if not is_per_tensor(qscheme):
+            raise NotImplementedError(
+                "MinMaxObserver's qscheme only support torch.per_tensor_symmetric \
+                    and torch.per_tensor_affine."
+            )
         # For x86 quantized kernels, we need to ensure that the vpmaddubsw
         # instruction does not overflow. We allow for a reduce_range argument to
         # observers that reduces the quantized range to (0,127) or (-64, 63).
@@ -559,6 +565,11 @@ def __init__(
         eps=torch.finfo(torch.float32).eps,
         **kwargs
     ) -> None:
+        if not is_per_tensor(qscheme):
+            raise NotImplementedError(
+                "MovingAverageMinMaxObserver's qscheme only support \
+                    torch.per_tensor_symmetric and torch.per_tensor_affine."
+            )
         self.averaging_constant = averaging_constant
         super(MovingAverageMinMaxObserver, self).__init__(
             dtype=dtype,
@@ -628,6 +639,11 @@ def __init__(
         factory_kwargs=None,
         eps=torch.finfo(torch.float32).eps,
     ) -> None:
+        if not is_per_channel(qscheme):
+            raise NotImplementedError(
+                "PerChannelMinMaxObserver's qscheme only support \
+                    torch.per_channel_symmetric, torch.per_channel_affine and torch.per_channel_affine_float_qparams."
+            )
         super(PerChannelMinMaxObserver, self).__init__(
             dtype=dtype,
             qscheme=qscheme,
@@ -812,6 +828,11 @@ def __init__(
         eps=torch.finfo(torch.float32).eps,
         **kwargs
     ) -> None:
+        if not is_per_channel(qscheme):
+            raise NotImplementedError(
+                "MovingAveragePerChannelMinMaxObserver's qscheme only support \
+                    torch.per_channel_symmetric, torch.per_channel_affine and torch.per_channel_affine_float_qparams."
+            )
         super(MovingAveragePerChannelMinMaxObserver, self).__init__(
             ch_axis=ch_axis,
             dtype=dtype,
@@ -892,6 +913,11 @@ def __init__(
         factory_kwargs=None,
         eps=torch.finfo(torch.float32).eps,
     ) -> None:
+        if not is_per_tensor(qscheme):
+            raise NotImplementedError(
+                "HistogramObserver's qscheme only support torch.per_tensor_symmetric \
+                    and torch.per_tensor_affine."
+            )
         # bins: The number of bins used for histogram calculation.
         super(HistogramObserver, self).__init__(
             dtype=dtype,
diff --git a/torch/ao/quantization/quantization_mappings.py b/torch/ao/quantization/quantization_mappings.py
index 8b192b6ffd670..6e658ae0dfec0 100644
--- a/torch/ao/quantization/quantization_mappings.py
+++ b/torch/ao/quantization/quantization_mappings.py
@@ -8,14 +8,17 @@
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.quantized.dynamic as nniqd
 import torch.nn.intrinsic.qat as nniqat
-import torch.nn.quantized as nnq
-import torch.nn.quantized._reference as nnqr
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.qat as nnqat
-import torch.nn.qat.dynamic as nnqatd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized._reference as nnqr
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.qat as nnqat
+import torch.ao.nn.qat.dynamic as nnqatd
 
 from typing import Optional, Union, Dict, Set, Callable, Any
 
+# Because `torch.ao.nn` uses lazy imports, we need to make
+# sure we import the contents explicitly here.
+import torch.ao.nn.sparse
 import torch.ao.nn as ao_nn
 from torch.ao.quantization.stubs import QuantStub, DeQuantStub
 from torch.ao.quantization.fake_quantize import (
diff --git a/torch/ao/quantization/quantize.py b/torch/ao/quantization/quantize.py
index 8a2d0679aa1c6..bc526d3767f88 100644
--- a/torch/ao/quantization/quantize.py
+++ b/torch/ao/quantization/quantize.py
@@ -4,7 +4,7 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.nn.intrinsic import _FusedModule
 
 from torch.ao.quantization.quantization_mappings import (
diff --git a/torch/ao/quantization/quantize_fx.py b/torch/ao/quantization/quantize_fx.py
index fd09a49cca8bc..8d572e0c4f236 100644
--- a/torch/ao/quantization/quantize_fx.py
+++ b/torch/ao/quantization/quantize_fx.py
@@ -7,7 +7,10 @@
 from .fx import fuse  # noqa: F401
 from .fx import prepare  # noqa: F401
 from .fx.convert import convert
-from .backend_config import get_tensorrt_backend_config_dict  # noqa: F401
+from .backend_config import (  # noqa: F401
+    BackendConfig,
+    get_tensorrt_backend_config,
+)
 from .fx.graph_module import ObservedGraphModule
 from .fx.custom_config import (
     ConvertCustomConfig,
@@ -35,21 +38,21 @@ def _swap_ff_with_fxff(model: torch.nn.Module) -> None:
     """
     modules_to_swap = []
     for name, module in model.named_children():
-        if isinstance(module, torch.nn.quantized.FloatFunctional):
+        if isinstance(module, torch.ao.nn.quantized.FloatFunctional):
             modules_to_swap.append(name)
         else:
             _swap_ff_with_fxff(module)
 
     for name in modules_to_swap:
         del model._modules[name]
-        model._modules[name] = torch.nn.quantized.FXFloatFunctional()
+        model._modules[name] = torch.ao.nn.quantized.FXFloatFunctional()
 
 
 def _fuse_fx(
     graph_module: GraphModule,
     is_qat: bool,
     fuse_custom_config: Union[FuseCustomConfig, Dict[str, Any], None] = None,
-    backend_config_dict: Optional[Dict[str, Any]] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> GraphModule:
     r""" Internal helper function to fuse modules in preparation for quantization
 
@@ -58,7 +61,7 @@ def _fuse_fx(
     """
     _check_is_graph_module(graph_module)
     return fuse(
-        graph_module, is_qat, fuse_custom_config, backend_config_dict)  # type: ignore[operator]
+        graph_module, is_qat, fuse_custom_config, backend_config)  # type: ignore[operator]
 
 
 class Scope(object):
@@ -123,7 +126,7 @@ def _prepare_fx(
     example_inputs: Tuple[Any, ...],
     prepare_custom_config: Union[PrepareCustomConfig, Dict[str, Any], None] = None,
     _equalization_config: Optional[Union[QConfigMapping, Dict[str, Any]]] = None,
-    backend_config_dict: Optional[Dict[str, Any]] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
     is_standalone_module: bool = False,
 ) -> ObservedGraphModule:
     r""" Internal helper function for prepare_fx
@@ -164,7 +167,7 @@ def _prepare_fx(
         graph_module,
         is_qat,
         fuse_custom_config,
-        backend_config_dict)
+        backend_config)
     prepared = prepare(
         graph_module,
         qconfig_mapping,
@@ -173,7 +176,7 @@ def _prepare_fx(
         example_inputs=example_inputs,
         prepare_custom_config=prepare_custom_config,
         _equalization_config=_equalization_config,
-        backend_config_dict=backend_config_dict,
+        backend_config=backend_config,
         is_standalone_module=is_standalone_module,
     )  # type: ignore[operator]
 
@@ -188,7 +191,7 @@ def _prepare_standalone_module_fx(
     is_qat: bool,
     example_inputs: Tuple[Any, ...],
     prepare_custom_config: Union[PrepareCustomConfig, Dict[str, Any], None] = None,
-    backend_config_dict: Optional[Dict[str, Any]] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> GraphModule:
     r""" [Internal use only] Prepare a standalone module, so that it can be used when quantizing the
     parent module.
@@ -218,7 +221,7 @@ def _prepare_standalone_module_fx(
         is_qat,
         example_inputs,
         prepare_custom_config,
-        backend_config_dict=backend_config_dict,
+        backend_config=backend_config,
         is_standalone_module=True,
     )
 
@@ -226,7 +229,7 @@ def _prepare_standalone_module_fx(
 def fuse_fx(
     model: torch.nn.Module,
     fuse_custom_config: Union[FuseCustomConfig, Dict[str, Any], None] = None,
-    backend_config_dict: Optional[Dict[str, Any]] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> GraphModule:
     r""" Fuse modules like conv+bn, conv+bn+relu etc, model must be in eval mode.
     Fusion rules are defined in torch.quantization.fx.fusion_pattern.py
@@ -263,7 +266,7 @@ def fuse_fx(
         preserved_attributes = set(fuse_custom_config.preserved_attributes)
     for attr_name in preserved_attributes:
         setattr(graph_module, attr_name, getattr(model, attr_name))
-    return _fuse_fx(graph_module, False, fuse_custom_config, backend_config_dict)
+    return _fuse_fx(graph_module, False, fuse_custom_config, backend_config)
 
 
 def prepare_fx(
@@ -272,7 +275,7 @@ def prepare_fx(
     example_inputs: Tuple[Any, ...],
     prepare_custom_config: Union[PrepareCustomConfig, Dict[str, Any], None] = None,
     _equalization_config: Optional[Union[QConfigMapping, Dict[str, Any]]] = None,
-    backend_config_dict: Optional[Dict[str, Any]] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> ObservedGraphModule:
     r""" Prepare a model for post training static quantization
 
@@ -294,6 +297,9 @@ def prepare_fx(
               .set_module_name("module2", qconfig2) \
               .set_module_name_object_type_order("module3", torch.nn.functional.linear, 0, qconfig3)
 
+
+          The precedence of different settings:
+          set_global < set_object_type < set_module_name_regex < set_module_name < set_module_name_object_type_order
       * `example_inputs`: (required) Example inputs for forward function of the model
 
       * `prepare_custom_config`: customization configuration for quantization tool.
@@ -303,9 +309,9 @@ def prepare_fx(
 
               prepare_custom_config = PrepareCustomConfig() \
                   .set_standalone_module_name("module1", qconfig_mapping, example_inputs, \
-                      child_prepare_custom_config, backend_config_dict) \
+                      child_prepare_custom_config, backend_config) \
                   .set_standalone_module_class(MyStandaloneModule, qconfig_mapping, example_inputs, \
-                      child_prepare_custom_config, backend_config_dict) \
+                      child_prepare_custom_config, backend_config) \
                   .set_float_to_observed_mapping(FloatCustomModule, ObservedCustomModule) \
                   .set_non_traceable_module_names(["module2", "module3"]) \
                   .set_non_traceable_module_classes([NonTraceableModule1, NonTraceableModule2]) \
@@ -315,7 +321,7 @@ def prepare_fx(
 
       * `_equalization_config`: config for specifying how to perform equalization on the model
 
-      * `backend_config_dict`: a dictionary that specifies how operators are quantized
+      * `backend_config`: config that specifies how operators are quantized
          in a backend, this includes how the operaetors are observed,
          supported fusion patterns, how quantize/dequantize ops are
          inserted, supported dtypes etc. The structure of the dictionary is still WIP
@@ -327,19 +333,105 @@ def prepare_fx(
     Example::
 
         import torch
-        from torch.ao.quantization import get_default_qconfig
+        from torch.ao.quantization import get_default_qconfig_mapping
         from torch.ao.quantization import prepare_fx
 
-        float_model.eval()
-        qconfig = get_default_qconfig('fbgemm')
+        class Submodule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 5)
+            def forward(self, x):
+                x = self.linear(x)
+                return x
+
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 5)
+                self.sub = Submodule()
+
+            def forward(self, x):
+                x = self.linear(x)
+                x = self.sub(x) + x
+                return x
+
+        # initialize a floating point model
+        float_model = M().eval()
+
+        # define calibration function
         def calibrate(model, data_loader):
             model.eval()
             with torch.no_grad():
                 for image, target in data_loader:
                     model(image)
 
-        qconfig_mapping = QConfigMapping().set_global(qconfig)
+        # qconfig is the configuration for how we insert observers for a particular
+        # operator
+        # qconfig = get_default_qconfig("fbgemm")
+        # Example of customizing qconfig:
+        # qconfig = torch.ao.quantization.QConfig(
+        #    activation=MinMaxObserver.with_args(dtype=torch.qint8),
+        #    weight=MinMaxObserver.with_args(dtype=torch.qint8))
+        # `activation` and `weight` are constructors of observer module
+
+        # qconfig_mapping is a collection of quantization configurations, user can
+        # set the qconfig for each operator (torch op calls, functional calls, module calls)
+        # in the model through qconfig_mapping
+        # the following call will get the qconfig_mapping that works best for models
+        # that target "fbgemm" backend
+        qconfig_mapping = get_default_qconfig_mapping("fbgemm")
+
+        # We can customize qconfig_mapping in different ways.
+        # e.g. set the global qconfig, which means we will use the same qconfig for
+        # all operators in the model, this can be overwritten by other settings
+        # qconfig_mapping = QConfigMapping().set_global(qconfig)
+        # e.g. quantize the linear submodule with a specific qconfig
+        # qconfig_mapping = QConfigMapping().set_module_name("linear", qconfig)
+        # e.g. quantize all nn.Linear modules with a specific qconfig
+        # qconfig_mapping = QConfigMapping().set_object_type(torch.nn.Linear, qconfig)
+        # for a more complete list, please see the docstring for :class:`torch.ao.quantization.QConfigMapping`
+        # argument
+
+        # example_inputs is a tuple of inputs, that is used to infer the type of the
+        # outputs in the model
+        # currently it's not used, but please make sure model(*example_inputs) runs
         example_inputs = (torch.randn(1, 3, 224, 224),)
+
+        # TODO: add backend_config after we split the backend_config for fbgemm and qnnpack
+        # e.g. backend_config = get_default_backend_config("fbgemm")
+        # `prepare_fx` inserts observers in the model based on qconfig_mapping and
+        # backend_config. If the configuration for an operator in qconfig_mapping
+        # is supported in the backend_config (meaning it's supported by the target
+        # hardware), we'll insert observer modules according to the qconfig_mapping
+        # otherwise the configuration in qconfig_mapping will be ignored
+        #
+        # Example:
+        # in qconfig_mapping, user sets linear module to be quantized with quint8 for
+        # activation and qint8 for weight:
+        # qconfig = torch.ao.quantization.QConfig(
+        #     observer=MinMaxObserver.with_args(dtype=torch.quint8),
+        #     weight=MinMaxObserver.with-args(dtype=torch.qint8))
+        # Note: current qconfig api does not support setting output observer, but
+        # we may extend this to support these more fine grained control in the
+        # future
+        #
+        # qconfig_mapping = QConfigMapping().set_object_type(torch.nn.Linear, qconfig)
+        # in backend config, linear module also supports in this configuration:
+        # weighted_int8_dtype_config = DTypeConfig(
+        #   input_dtype=torch.quint8,
+        #   output_dtype=torch.quint8,
+        #   weight_dtype=torch.qint8,
+        #   bias_type=torch.float)
+
+        # linear_pattern_config = BackendPatternConfig(torch.nn.Linear) \
+        #    .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
+        #    .add_dtype_config(weighted_int8_dtype_config) \
+        #    ...
+
+        # backend_config = BackendConfig().set_backend_pattern_config(linear_pattern_config)
+        # `prepare_fx` will check that the setting requested by suer in qconfig_mapping
+        # is supported by the backend_config and insert observers and fake quant modules
+        # in the model
         prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)
         # Run calibration
         calibrate(prepared_model, sample_inference_data)
@@ -352,7 +444,7 @@ def calibrate(model, data_loader):
         example_inputs,
         prepare_custom_config,
         _equalization_config,
-        backend_config_dict,
+        backend_config,
     )
 
 
@@ -361,7 +453,7 @@ def prepare_qat_fx(
     qconfig_mapping: Union[QConfigMapping, Dict[str, Any]],
     example_inputs: Tuple[Any, ...],
     prepare_custom_config: Union[PrepareCustomConfig, Dict[str, Any], None] = None,
-    backend_config_dict: Optional[Dict[str, Any]] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> ObservedGraphModule:
     r""" Prepare a model for quantization aware training
 
@@ -370,7 +462,7 @@ def prepare_qat_fx(
       * `qconfig_mapping`: see :func:`~torch.ao.quantization.prepare_fx`
       * `example_inputs`: see :func:`~torch.ao.quantization.prepare_fx`
       * `prepare_custom_config`: see :func:`~torch.ao.quantization.prepare_fx`
-      * `backend_config_dict`: see :func:`~torch.ao.quantization.prepare_fx`
+      * `backend_config`: see :func:`~torch.ao.quantization.prepare_fx`
 
     Return:
       A GraphModule with fake quant modules (configured by qconfig_mapping), ready for
@@ -379,19 +471,75 @@ def prepare_qat_fx(
     Example::
 
         import torch
-        from torch.ao.quantization import get_default_qat_qconfig
+        from torch.ao.quantization import get_default_qat_qconfig_mapping
         from torch.ao.quantization import prepare_fx
 
-        qconfig = get_default_qat_qconfig('fbgemm')
+        class Submodule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 5)
+            def forward(self, x):
+                x = self.linear(x)
+                return x
+
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 5)
+                self.sub = Submodule()
+
+            def forward(self, x):
+                x = self.linear(x)
+                x = self.sub(x) + x
+                return x
+
+        # initialize a floating point model
+        float_model = M().train()
+        # (optional, but preferred) load the weights from pretrained model
+        # float_model.load_weights(...)
+
+        # define the training loop for quantization aware training
         def train_loop(model, train_data):
             model.train()
             for image, target in data_loader:
                 ...
 
-        float_model.train()
-        qconfig_mapping = QConfigMapping().set_global(qconfig)
-        prepared_model = prepare_fx(float_model, qconfig_mapping)
-        # Run calibration
+        # qconfig is the configuration for how we insert observers for a particular
+        # operator
+        # qconfig = get_default_qconfig("fbgemm")
+        # Example of customizing qconfig:
+        # qconfig = torch.ao.quantization.QConfig(
+        #    activation=FakeQuantize.with_args(observer=MinMaxObserver.with_args(dtype=torch.qint8)),
+        #    weight=FakeQuantize.with_args(observer=MinMaxObserver.with_args(dtype=torch.qint8)))
+        # `activation` and `weight` are constructors of observer module
+
+        # qconfig_mapping is a collection of quantization configurations, user can
+        # set the qconfig for each operator (torch op calls, functional calls, module calls)
+        # in the model through qconfig_mapping
+        # the following call will get the qconfig_mapping that works best for models
+        # that target "fbgemm" backend
+        qconfig_mapping = get_default_qat_qconfig("fbgemm")
+
+        # We can customize qconfig_mapping in different ways, please take a look at
+        # the doctring for :func:`~torch.ao.quantization.prepare_fx` for different ways
+        # to configure this
+
+        # example_inputs is a tuple of inputs, that is used to infer the type of the
+        # outputs in the model
+        # currently it's not used, but please make sure model(*example_inputs) runs
+        example_inputs = (torch.randn(1, 3, 224, 224),)
+
+        # TODO: add backend_config after we split the backend_config for fbgemm and qnnpack
+        # e.g. backend_config = get_default_backend_config("fbgemm")
+        # `prepare_qat_fx` inserts observers in the model based on qconfig_mapping and
+        # backend_config, if the configuration for an operator in qconfig_mapping
+        # is supported in the backend_config (meaning it's supported by the target
+        # hardware), we'll insert fake_quantize modules according to the qconfig_mapping
+        # otherwise the configuration in qconfig_mapping will be ignored
+        # see :func:`~torch.ao.quantization.prepare_fx` for a detailed explanation of
+        # how qconfig_mapping interacts with backend_config
+        prepared_model = prepare_qat_fx(float_model, qconfig_mapping, example_inputs)
+        # Run training
         train_loop(prepared_model, train_loop)
 
     """
@@ -402,7 +550,7 @@ def train_loop(model, train_data):
         True,  # is_qat
         example_inputs,
         prepare_custom_config,
-        backend_config_dict=backend_config_dict,
+        backend_config=backend_config,
     )
 
 
@@ -413,7 +561,7 @@ def _convert_fx(
     is_standalone_module: bool = False,
     _remove_qconfig: bool = True,
     qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
-    backend_config_dict: Dict[str, Any] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> torch.nn.Module:
     """ `is_standalone_module`: see docs in :func:`~torch.ao.quantization.prepare_standalone_module_fx`
     """
@@ -435,7 +583,7 @@ def _convert_fx(
         is_standalone_module,
         _remove_qconfig_flag=_remove_qconfig,
         qconfig_mapping=qconfig_mapping,
-        backend_config_dict=backend_config_dict,
+        backend_config=backend_config,
     )
 
     preserved_attributes = convert_custom_config.preserved_attributes
@@ -448,8 +596,8 @@ def convert_fx(
     graph_module: GraphModule,
     convert_custom_config: Union[ConvertCustomConfig, Dict[str, Any], None] = None,
     _remove_qconfig: bool = True,
-    qconfig_mapping: Union[QConfigMapping, Dict[str, Any]] = None,
-    backend_config_dict: Dict[str, Any] = None,
+    qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> torch.nn.Module:
     r""" Convert a calibrated or trained model to a quantized model
 
@@ -483,7 +631,7 @@ def convert_fx(
                 .set_object_type(torch.nn.functional.linear, qconfig_from_prepare)
                 .set_module_name("foo.bar", None)  # skip quantizing module "foo.bar"
 
-         * `backend_config_dict`: A configuration for the backend which describes how
+         * `backend_config`: A configuration for the backend which describes how
             operators should be quantized in the backend, this includes quantization
             mode support (static/dynamic/weight_only), dtype support (quint8/qint8 etc.),
             observer placement for each operators and fused operators. Detailed
@@ -495,6 +643,17 @@ def convert_fx(
     Example::
 
         # prepared_model: the model after prepare_fx/prepare_qat_fx and calibration/training
+        # convert_fx converts a calibrated/trained model to a quantized model for the
+        # target hardware, this includes converting the model first to a reference
+        # quantized model, and then lower the reference quantized model to a backend
+        # Currently, the supported backends are fbgemm (onednn), qnnpack (xnnpack) and
+        # they share the same set of quantized operators, so we are using the same
+        # lowering procedure
+        #
+        # backend_config defines the corresponding reference quantized module for
+        # the weighted modules in the model, e.g. nn.Linear
+        # TODO: add backend_config after we split the backend_config for fbgemm and qnnpack
+        # e.g. backend_config = get_default_backend_config("fbgemm")
         quantized_model = convert_fx(prepared_model)
 
     """
@@ -505,7 +664,7 @@ def convert_fx(
         convert_custom_config=convert_custom_config,
         _remove_qconfig=_remove_qconfig,
         qconfig_mapping=qconfig_mapping,
-        backend_config_dict=backend_config_dict,
+        backend_config=backend_config,
     )
 
 
@@ -513,12 +672,14 @@ def convert_to_reference_fx(
     graph_module: GraphModule,
     convert_custom_config: Union[ConvertCustomConfig, Dict[str, Any], None] = None,
     _remove_qconfig: bool = True,
-    qconfig_mapping: Union[QConfigMapping, Dict[str, Any]] = None,
-    backend_config_dict: Dict[str, Any] = None,
+    qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
 ) -> torch.nn.Module:
-    r""" Convert a calibrated or trained model to a reference quantized model, a common interface
-    between PyTorch quantization with other backends like accelerators. Callers should additionally
-    lower the returned reference model to the target backend before using the model for inference.
+    r""" Convert a calibrated or trained model to a reference quantized model,
+    see https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md for more details,
+    reference quantzied model is a standard representation of a quantized model provided
+    by FX Graph Mode Quantization, it can be further lowered to run on the target
+    hardware, like accelerators
 
     Args:
         * `graph_module`: A prepared and calibrated/trained model (GraphModule)
@@ -531,7 +692,7 @@ def convert_to_reference_fx(
         * `qconfig_mapping`: config for specifying how to convert a model for quantization.
             See :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more detail.
 
-         * `backend_config_dict`: A configuration for the backend which describes how
+         * `backend_config`: A configuration for the backend which describes how
             operators should be quantized in the backend. See
             :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more detail.
 
@@ -541,7 +702,9 @@ def convert_to_reference_fx(
     Example::
 
         # prepared_model: the model after prepare_fx/prepare_qat_fx and calibration/training
-        reference_model = convert_to_reference_fx(prepared_model)
+        # TODO: add backend_config after we split the backend_config for fbgemm and qnnpack
+        # e.g. backend_config = get_default_backend_config("fbgemm")
+        reference_quantized_model = convert_to_reference_fx(prepared_model)
 
     """
     torch._C._log_api_usage_once("quantization_api.quantize_fx.convert_to_reference_fx")
@@ -551,7 +714,7 @@ def convert_to_reference_fx(
         convert_custom_config=convert_custom_config,
         _remove_qconfig=_remove_qconfig,
         qconfig_mapping=qconfig_mapping,
-        backend_config_dict=backend_config_dict,
+        backend_config=backend_config,
     )
 
 
diff --git a/torch/ao/sparsity/_experimental/activation_sparsifier/activation_sparsifier.py b/torch/ao/sparsity/_experimental/activation_sparsifier/activation_sparsifier.py
index ae9522ce8cc77..e2eecdeb73f80 100644
--- a/torch/ao/sparsity/_experimental/activation_sparsifier/activation_sparsifier.py
+++ b/torch/ao/sparsity/_experimental/activation_sparsifier/activation_sparsifier.py
@@ -30,32 +30,33 @@ class ActivationSparsifier:
             specifies how inputs should be aggregated over time.
             The aggregate_fn should usually take 2 torch tensors and return the aggregated tensor.
             Example
-                >>> def add_agg_fn(tensor1, tensor2):  return tensor1 + tensor2
-        reduce_fn (Optional, Callable):
-            default reduce_fn that is used if not specified while registering the layer.
-            reduce_fn will be called on the aggregated tensor i.e. the tensor obtained after
-            calling agg_fn() on all inputs.
-            Example
-                >>> def mean_reduce_fn(agg_tensor):    return agg_tensor.mean(dim=0)
-        mask_fn (Optional, Callable):
-            default mask_fn that is used to create the sparsification mask using the tensor obtained after
-            calling the reduce_fn(). This is used by default if a custom one is passed in the
-            register_layer().
-            Note that the mask_fn() definition should contain the sparse arguments that is passed in sparse_config
-            arguments.
-        features (Optional, list):
-            default selected features to sparsify.
-            If this is non-empty, then the mask_fn will be applied for each feature of the input.
-            For example,
-                >>> mask = [mask_fn(reduce_fn(aggregated_fn(input[feature])) for feature in features]
-        feature_dim (Optional, int):
-            default dimension of input features. Again, features along this dim will be chosen
-            for sparsification.
-        sparse_config (Dict):
-            Default configuration for the mask_fn. This config will be passed
-            with the mask_fn()
+                def add_agg_fn(tensor1, tensor2):  return tensor1 + tensor2
+                reduce_fn (Optional, Callable):
+                    default reduce_fn that is used if not specified while registering the layer.
+                    reduce_fn will be called on the aggregated tensor i.e. the tensor obtained after
+                    calling agg_fn() on all inputs.
+                    Example
+                def mean_reduce_fn(agg_tensor):    return agg_tensor.mean(dim=0)
+                mask_fn (Optional, Callable):
+                    default mask_fn that is used to create the sparsification mask using the tensor obtained after
+                    calling the reduce_fn(). This is used by default if a custom one is passed in the
+                    register_layer().
+                    Note that the mask_fn() definition should contain the sparse arguments that is passed in sparse_config
+                    arguments.
+                features (Optional, list):
+                    default selected features to sparsify.
+                    If this is non-empty, then the mask_fn will be applied for each feature of the input.
+                    For example,
+                mask = [mask_fn(reduce_fn(aggregated_fn(input[feature])) for feature in features]
+                feature_dim (Optional, int):
+                    default dimension of input features. Again, features along this dim will be chosen
+                    for sparsification.
+                sparse_config (Dict):
+                    Default configuration for the mask_fn. This config will be passed
+                    with the mask_fn()
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> model = SomeModel()
         >>> act_sparsifier = ActivationSparsifier(...)  # init activation sparsifier
         >>> # Initialize aggregate_fn
@@ -74,6 +75,7 @@ class ActivationSparsifier:
         >>> act_sparsifier.register_layer(model.some_layer, aggregate_fn=agg_fn, reduce_fn=reduce_fn, mask_fn=mask_fn)
         >>>
         >>> # start training process
+        >>> for _ in [...]:
         >>>     # epoch starts
         >>>         # model.forward(), compute_loss() and model.backwards()
         >>>     # epoch ends
diff --git a/torch/ao/sparsity/_experimental/data_scheduler/base_data_scheduler.py b/torch/ao/sparsity/_experimental/data_scheduler/base_data_scheduler.py
index 7d4859743ef84..92d3c80d20854 100644
--- a/torch/ao/sparsity/_experimental/data_scheduler/base_data_scheduler.py
+++ b/torch/ao/sparsity/_experimental/data_scheduler/base_data_scheduler.py
@@ -89,11 +89,11 @@ def get_schedule_param(self):
         is called.
 
         Example:
-        >>> def get_schedule_param(self):
-                new_param = {}
-                for name in self.sparsifier.data_groups.keys():
-                    new_param[name] = self.sparsifier.data_groups[name][self.schedule_param] * 0.5
-                return new_param
+            >>> def get_schedule_param(self):
+            ...    new_param = {}
+            ...    for name in self.sparsifier.data_groups.keys():
+            ...         new_param[name] = self.sparsifier.data_groups[name][self.schedule_param] * 0.5
+            ...    return new_param
 
         When the step() function is called, the value in self.sparsifier.data_groups[name][self.schedule_param]
         would be halved
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/base_data_sparsifier.py b/torch/ao/sparsity/_experimental/data_sparsifier/base_data_sparsifier.py
index d59b672e59e1d..e6d0b98efff22 100644
--- a/torch/ao/sparsity/_experimental/data_sparsifier/base_data_sparsifier.py
+++ b/torch/ao/sparsity/_experimental/data_sparsifier/base_data_sparsifier.py
@@ -51,7 +51,7 @@ class BaseDataSparsifier(base_sparsifier.BaseSparsifier):
             configuration. Only the keys that don't exist in the `config` will
             be updated.
     Example::
-
+        >>> # xdoctest: +SKIP
         >>> data_list = [('tensor_1', torch.randn(3,3)), ('tensor_2', torch.randn(4,4))]
         >>> defaults = {'sparsity_level': 0.7}
         >>> sparsifier = DerivedDataSparsifier(data_list = data_list, **defaults) # Some sparsifier that inherits BaseDataSparsifier
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py b/torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py
new file mode 100644
index 0000000000000..54fa01eef74ba
--- /dev/null
+++ b/torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py
@@ -0,0 +1,130 @@
+import torch
+import torch.nn as nn
+from torch.ao.sparsity.sparsifier.utils import module_to_fqn, fqn_to_module
+from typing import Dict, List
+
+SUPPORTED_MODULES = {
+    nn.Embedding,
+    nn.EmbeddingBag
+}
+
+
+def _fetch_all_embeddings(model):
+    """Fetches Embedding and EmbeddingBag modules from the model
+    """
+    embedding_modules = []
+    stack = [model]
+    while stack:
+        module = stack.pop()
+        for _, child in module.named_children():
+            fqn_name = module_to_fqn(model, child)
+            if type(child) in SUPPORTED_MODULES:
+                embedding_modules.append((fqn_name, child))
+            else:
+                stack.append(child)
+    return embedding_modules
+
+
+def post_training_sparse_quantize(model,
+                                  data_sparsifier_class,
+                                  sparsify_first=True,
+                                  select_embeddings: List[nn.Module] = None,
+                                  **sparse_config):
+    """Takes in a model and applies sparsification and quantization to only embeddings & embeddingbags.
+    The quantization step can happen before or after sparsification depending on the `sparsify_first` argument.
+
+    Args:
+        - model (nn.Module)
+            model whose embeddings needs to be sparsified
+        - data_sparsifier_class (type of data sparsifier)
+            Type of sparsification that needs to be applied to model
+        - sparsify_first (bool)
+            if true, sparsifies first and then quantizes
+            otherwise, quantizes first and then sparsifies.
+        - select_embeddings (List of Embedding modules)
+            List of embedding modules to in the model to be sparsified & quantized.
+            If None, all embedding modules with be sparsified
+        - sparse_config (Dict)
+            config that will be passed to the constructor of data sparsifier object.
+
+    Note:
+        1. When `sparsify_first=False`, quantization occurs first followed by sparsification.
+            - before sparsifying, the embedding layers are dequantized.
+            - scales and zero-points are saved
+            - embedding layers are sparsified and `squash_mask` is applied
+            - embedding weights are requantized using the saved scales and zero-points
+        2. When `sparsify_first=True`, sparsification occurs first followed by quantization.
+            - embeddings are sparsified first
+            - quantization is applied on the sparsified embeddings
+    """
+    data_sparsifier = data_sparsifier_class(**sparse_config)
+
+    # if select_embeddings is None, perform it on all embeddings
+    if select_embeddings is None:
+        embedding_modules = _fetch_all_embeddings(model)
+
+    else:
+        embedding_modules = []
+        assert isinstance(select_embeddings, List), "the embedding_modules must be a list of embedding modules"
+        for emb in select_embeddings:
+            assert type(emb) in SUPPORTED_MODULES, "the embedding_modules list must be an embedding or embedding bags"
+            fqn_name = module_to_fqn(model, emb)
+            assert fqn_name is not None, "the embedding modules must be part of input model"
+            embedding_modules.append((fqn_name, emb))
+
+    if sparsify_first:
+        # sparsify
+        for name, emb_module in embedding_modules:
+            valid_name = name.replace('.', '_')
+            data_sparsifier.add_data(name=valid_name, data=emb_module)
+
+        data_sparsifier.step()
+        data_sparsifier.squash_mask()
+
+        # quantize
+        for _, emb_module in embedding_modules:
+            emb_module.qconfig = torch.ao.quantization.float_qparams_weight_only_qconfig
+
+        torch.quantization.prepare(model, inplace=True)
+        torch.quantization.convert(model, inplace=True)
+
+    else:
+        # quantize
+        for _, emb_module in embedding_modules:
+            emb_module.qconfig = torch.ao.quantization.float_qparams_weight_only_qconfig
+
+        torch.quantization.prepare(model, inplace=True)
+        torch.quantization.convert(model, inplace=True)
+
+        # retrieve scale & zero_points
+        quantize_params: Dict[str, Dict] = {'scales': {}, 'zero_points': {},
+                                            'dequant_weights': {}, 'axis': {},
+                                            'dtype': {}}
+
+        for name, _ in embedding_modules:
+            quantized_emb = fqn_to_module(model, name)
+            assert quantized_emb is not None  # satisfy mypy
+
+            quantized_weight = quantized_emb.weight()  # type: ignore[operator]
+            quantize_params['scales'][name] = quantized_weight.q_per_channel_scales()
+            quantize_params['zero_points'][name] = quantized_weight.q_per_channel_zero_points()
+            quantize_params['dequant_weights'][name] = torch.dequantize(quantized_weight)
+            quantize_params['axis'][name] = quantized_weight.q_per_channel_axis()
+            quantize_params['dtype'][name] = quantized_weight.dtype
+
+            # attach data to sparsifier
+            data_sparsifier.add_data(name=name.replace('.', '_'), data=quantize_params['dequant_weights'][name])
+
+        data_sparsifier.step()
+        data_sparsifier.squash_mask()
+
+        for name, _ in embedding_modules:
+            quantized_emb = fqn_to_module(model, name)
+            assert quantized_emb is not None  # satisfy mypy
+            requantized_vector = torch.quantize_per_channel(quantize_params['dequant_weights'][name],
+                                                            scales=quantize_params['scales'][name],
+                                                            zero_points=quantize_params['zero_points'][name],
+                                                            dtype=quantize_params['dtype'][name],
+                                                            axis=quantize_params['axis'][name])
+
+            quantized_emb.set_weight(requantized_vector)  # type: ignore[operator]
diff --git a/torch/ao/sparsity/_mappings.py b/torch/ao/sparsity/_mappings.py
index 64ca5b242e27b..c831b3ddce2b0 100644
--- a/torch/ao/sparsity/_mappings.py
+++ b/torch/ao/sparsity/_mappings.py
@@ -1,13 +1,12 @@
-import torch
-import torch.ao.nn
-
 def get_static_sparse_quantized_mapping():
+    import torch.ao.nn.sparse
     _static_sparse_quantized_mapping = dict({
         torch.nn.Linear: torch.ao.nn.sparse.quantized.Linear,
     })
     return _static_sparse_quantized_mapping
 
 def get_dynamic_sparse_quantized_mapping():
+    import torch.ao.nn.sparse
     _dynamic_sparse_quantized_mapping = dict({
         torch.nn.Linear: torch.ao.nn.sparse.quantized.dynamic.Linear,
     })
diff --git a/torch/ao/sparsity/scheduler/lambda_scheduler.py b/torch/ao/sparsity/scheduler/lambda_scheduler.py
index c5ea2fe025959..97f9072ef304e 100644
--- a/torch/ao/sparsity/scheduler/lambda_scheduler.py
+++ b/torch/ao/sparsity/scheduler/lambda_scheduler.py
@@ -19,6 +19,7 @@ class LambdaSL(BaseScheduler):
         >>> # Assuming sparsifier has two groups.
         >>> lambda1 = lambda epoch: epoch // 30
         >>> lambda2 = lambda epoch: 0.95 ** epoch
+        >>> # xdoctest: +SKIP
         >>> scheduler = LambdaSL(sparsifier, sl_lambda=[lambda1, lambda2])
         >>> for epoch in range(100):
         >>>     train(...)
diff --git a/torch/ao/sparsity/sparsifier/base_sparsifier.py b/torch/ao/sparsity/sparsifier/base_sparsifier.py
index 0f8147621d8e9..bfd6b49a1011c 100644
--- a/torch/ao/sparsity/sparsifier/base_sparsifier.py
+++ b/torch/ao/sparsity/sparsifier/base_sparsifier.py
@@ -43,14 +43,15 @@ class BaseSparsifier(abc.ABC):
 
     Example::
 
-        >>> config = [{'tensor_fqn': 'layer1.weight', {'tensor_fqn': 'linear2.weight2', 'sparsity_level': 0.5}]
+        >>> # xdoctest: +SKIP("Can't instantiate abstract class BaseSparsifier with abstract method update_mask")
+        >>> config = [{'tensor_fqn': 'layer1.weight', 'tensor_fqn': 'linear2.weight2', 'sparsity_level': 0.5}]
         >>> defaults = {'sparsity_level': 0.7}
         >>> # model.layer1.weight will have `sparsity_level` = 0.7 (getting default)
         >>> sparsifier = BaseSparsifier(config, defaults)
     """
     def __init__(self, defaults: Optional[Dict[str, Any]] = None):
         super().__init__()
-        self.defaults: Dict[str, Any] = defaults or dict()
+        self.defaults: Dict[str, Any] = defaults or {}
 
         self.state: Dict[str, Dict] = defaultdict(dict)
         self.groups: List[Dict[str, Any]] = []
@@ -233,6 +234,7 @@ def squash_mask(self,
                             to save in the `sparse_params`
 
         Examples:
+            >>> # xdoctest: +SKIP("locals are undefined")
             >>> # Don't save any sparse params
             >>> sparsifier.squash_mask()
             >>> hasattr(model.submodule1, 'sparse_params')
@@ -273,7 +275,7 @@ def squash_mask(self,
             tensor_name = config['tensor_name']
             parametrize.remove_parametrizations(module, tensor_name,
                                                 leave_parametrized=True)
-            sparse_params = dict()
+            sparse_params = {}
             if params_to_keep is not None:
                 global_params = {k: config[k] for k in params_to_keep}
                 sparse_params.update(global_params)
diff --git a/torch/ao/sparsity/sparsifier/utils.py b/torch/ao/sparsity/sparsifier/utils.py
index ee0791a91dce3..799c60a6d6623 100644
--- a/torch/ao/sparsity/sparsifier/utils.py
+++ b/torch/ao/sparsity/sparsifier/utils.py
@@ -77,4 +77,4 @@ def state_dict(self, *args, **kwargs):
         # We don't want to let the parametrizations to save the mask.
         # That way we make sure that the linear module doesn't store the masks
         # alongside their parametrizations.
-        return dict()
+        return {}
diff --git a/torch/autograd/__init__.py b/torch/autograd/__init__.py
index 0dbb039db8078..f3ce87e99545d 100644
--- a/torch/autograd/__init__.py
+++ b/torch/autograd/__init__.py
@@ -326,15 +326,31 @@ def variable(*args, **kwargs):
     raise RuntimeError("autograd initialization failed")
 
 # Import all native method/classes
-from torch._C._autograd import (DeviceType, ProfilerActivity, ProfilerState, ProfilerConfig, ProfilerEvent,
-                                _enable_profiler_legacy, _disable_profiler_legacy, _profiler_enabled,
-                                _enable_record_function, _set_empty_test_observer, kineto_available,
-                                _record_function_with_args_enter, _record_function_with_args_exit,
-                                _supported_activities, _add_metadata_json, SavedTensor,
-                                _push_saved_tensors_default_hooks, _pop_saved_tensors_default_hooks)
-
-from torch._C._autograd import (_ProfilerResult, _KinetoEvent, _kineto_step,
-                                _prepare_profiler, _enable_profiler, _disable_profiler)
+from torch._C._autograd import (
+    _add_metadata_json,
+    _disable_profiler,
+    _disable_profiler_legacy,
+    _enable_profiler,
+    _enable_profiler_legacy,
+    _enable_record_function,
+    _kineto_step,
+    _KinetoEvent,
+    _pop_saved_tensors_default_hooks,
+    _prepare_profiler,
+    _profiler_enabled,
+    _ProfilerResult,
+    _push_saved_tensors_default_hooks,
+    _record_function_with_args_enter,
+    _record_function_with_args_exit,
+    _set_empty_test_observer,
+    _supported_activities,
+    DeviceType,
+    kineto_available,
+    ProfilerEvent,
+    SavedTensor,
+)
+
+from torch._C._profiler import ProfilerActivity, ProfilerConfig, ProfilerState
 
 from . import profiler
 
diff --git a/torch/autograd/anomaly_mode.py b/torch/autograd/anomaly_mode.py
index cca0ece338d03..9faa1dc893581 100644
--- a/torch/autograd/anomaly_mode.py
+++ b/torch/autograd/anomaly_mode.py
@@ -13,7 +13,8 @@ class detect_anomaly(object):
     - Running the forward pass with detection enabled will allow the backward
       pass to print the traceback of the forward operation that created the failing
       backward function.
-    - Any backward computation that generate "nan" value will raise an error.
+    - If ``check_nan`` is ``True``, any backward computation that generate "nan"
+      value will raise an error. Default ``True``.
 
     .. warning::
         This mode should be enabled only for debugging as the different tests
@@ -70,17 +71,19 @@ class detect_anomaly(object):
 
     """
 
-    def __init__(self) -> None:
+    def __init__(self, check_nan=True) -> None:
         self.prev = torch.is_anomaly_enabled()
+        self.check_nan = check_nan
+        self.prev_check_nan = torch.is_anomaly_check_nan_enabled()
         warnings.warn('Anomaly Detection has been enabled. '
                       'This mode will increase the runtime '
                       'and should only be enabled for debugging.', stacklevel=2)
 
     def __enter__(self) -> None:
-        torch.set_anomaly_enabled(True)
+        torch.set_anomaly_enabled(True, self.check_nan)
 
     def __exit__(self, *args: Any) -> None:
-        torch.set_anomaly_enabled(self.prev)
+        torch.set_anomaly_enabled(self.prev, self.prev_check_nan)
 
 
 class set_detect_anomaly(object):
@@ -95,15 +98,18 @@ class set_detect_anomaly(object):
     Args:
         mode (bool): Flag whether to enable anomaly detection (``True``),
                      or disable (``False``).
+        check_nan (bool): Flag whether to raise an error when the backward
+                          generate "nan"
 
     """
 
-    def __init__(self, mode: bool) -> None:
+    def __init__(self, mode: bool, check_nan: bool = True) -> None:
         self.prev = torch.is_anomaly_enabled()
-        torch.set_anomaly_enabled(mode)
+        self.prev_check_nan = torch.is_anomaly_check_nan_enabled()
+        torch.set_anomaly_enabled(mode, check_nan)
 
     def __enter__(self) -> None:
         pass
 
     def __exit__(self, *args: Any) -> None:
-        torch.set_anomaly_enabled(self.prev)
+        torch.set_anomaly_enabled(self.prev, self.prev_check_nan)
diff --git a/torch/autograd/forward_ad.py b/torch/autograd/forward_ad.py
index 2eace220eaa19..8c43df928817b 100644
--- a/torch/autograd/forward_ad.py
+++ b/torch/autograd/forward_ad.py
@@ -56,6 +56,7 @@ def make_dual(tensor, tangent, *, level=None):
 
     Example::
 
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> with dual_level():
         ...   inp = make_dual(x, v)
         ...   out = f(inp)
@@ -95,6 +96,7 @@ def unpack_dual(tensor, *, level=None):
 
     Example::
 
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> with dual_level():
         ...   inp = make_dual(x, x_t)
         ...   out = f(inp)
@@ -130,6 +132,7 @@ class dual_level(_DecoratorContextManager):
 
     Example::
 
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> x = torch.tensor([1])
         >>> x_t = torch.tensor([1])
         >>> with dual_level():
diff --git a/torch/autograd/function.py b/torch/autograd/function.py
index 38b102743f350..686b5db83f68c 100644
--- a/torch/autograd/function.py
+++ b/torch/autograd/function.py
@@ -83,6 +83,7 @@ def save_for_forward(self, *tensors: torch.Tensor):
         See :ref:`extending-autograd` for more details on how to use this method.
 
         Example::
+            >>> # xdoctest: +SKIP
             >>> class Func(torch.autograd.Function):
             >>>     @staticmethod
             >>>     def forward(ctx, x: torch.Tensor, y: torch.Tensor, z: int):
@@ -149,6 +150,7 @@ def mark_dirty(self, *args: torch.Tensor):
             >>> b = a * a
             >>> Inplace.apply(a)  # This would lead to wrong gradients!
             >>>                   # but the engine would not know unless we mark_dirty
+            >>> # xdoctest: +SKIP
             >>> b.backward() # RuntimeError: one of the variables needed for gradient
             >>>              # computation has been modified by an inplace operation
 
@@ -314,6 +316,7 @@ class Function(with_metaclass(FunctionMeta, _C._FunctionBase, FunctionCtx, _Hook
         >>>         return grad_output * result
         >>>
         >>> # Use it by calling the apply method:
+        >>> # xdoctest: +SKIP
         >>> output = Exp.apply(input)
     """
     def __init__(self, *args, **kwargs):
diff --git a/torch/autograd/functional.py b/torch/autograd/functional.py
index 1b941967875ff..f47023f8c82a2 100644
--- a/torch/autograd/functional.py
+++ b/torch/autograd/functional.py
@@ -240,6 +240,7 @@ def vjp(func, inputs, v=None, create_graph=False, strict=False):
         ...   return x.exp().sum(dim=1)
         >>> inputs = torch.rand(4, 4)
         >>> v = torch.ones(4)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> vjp(exp_reducer, inputs, v)
         (tensor([5.7817, 7.2458, 5.7830, 6.7782]),
          tensor([[1.4458, 1.3962, 1.3042, 1.6354],
@@ -336,6 +337,7 @@ def jvp(func, inputs, v=None, create_graph=False, strict=False):
         ...   return x.exp().sum(dim=1)
         >>> inputs = torch.rand(4, 4)
         >>> v = torch.ones(4, 4)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> jvp(exp_reducer, inputs, v)
         (tensor([6.3090, 4.6742, 7.9114, 8.2106]),
          tensor([6.3090, 4.6742, 7.9114, 8.2106]))
@@ -535,6 +537,7 @@ def jacobian(func, inputs, create_graph=False, strict=False, vectorize=False, st
         >>> def exp_reducer(x):
         ...   return x.exp().sum(dim=1)
         >>> inputs = torch.rand(2, 2)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> jacobian(exp_reducer, inputs)
         tensor([[[1.4917, 2.4352],
                  [0.0000, 0.0000]],
@@ -744,6 +747,7 @@ def hessian(func, inputs, create_graph=False, strict=False, vectorize=False, out
         >>> def pow_reducer(x):
         ...   return x.pow(3).sum()
         >>> inputs = torch.rand(2, 2)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> hessian(pow_reducer, inputs)
         tensor([[[[5.2265, 0.0000],
                   [0.0000, 0.0000]],
@@ -847,6 +851,7 @@ def vhp(func, inputs, v=None, create_graph=False, strict=False):
         ...   return x.pow(3).sum()
         >>> inputs = torch.rand(2, 2)
         >>> v = torch.ones(2, 2)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> vhp(pow_reducer, inputs, v)
         (tensor(0.5591),
          tensor([[1.0689, 1.2431],
@@ -936,6 +941,7 @@ def hvp(func, inputs, v=None, create_graph=False, strict=False):
         ...   return x.pow(3).sum()
         >>> inputs = torch.rand(2, 2)
         >>> v = torch.ones(2, 2)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> hvp(pow_reducer, inputs, v)
         (tensor(0.1448),
          tensor([[2.0239, 1.6456],
diff --git a/torch/autograd/grad_mode.py b/torch/autograd/grad_mode.py
index 552afa4e52435..b847129d1759d 100644
--- a/torch/autograd/grad_mode.py
+++ b/torch/autograd/grad_mode.py
@@ -110,7 +110,7 @@ class no_grad(_DecoratorContextManager):
         your dual tensors.
 
     Example::
-
+        >>> # xdoctest: +SKIP
         >>> x = torch.tensor([1.], requires_grad=True)
         >>> with torch.no_grad():
         ...   y = x * 2
@@ -156,7 +156,7 @@ class enable_grad(_DecoratorContextManager):
         This API does not apply to :ref:`forward-mode AD <forward-mode-ad>`.
 
     Example::
-
+        >>> # xdoctest: +SKIP
         >>> x = torch.tensor([1.], requires_grad=True)
         >>> with torch.no_grad():
         ...   with torch.enable_grad():
@@ -165,6 +165,7 @@ class enable_grad(_DecoratorContextManager):
         True
         >>> y.backward()
         >>> x.grad
+        tensor([2.])
         >>> @torch.enable_grad()
         ... def doubler(x):
         ...     return x * 2
@@ -205,18 +206,18 @@ class set_grad_enabled(_DecoratorContextManager):
         This API does not apply to :ref:`forward-mode AD <forward-mode-ad>`.
 
     Example::
-
+        >>> # xdoctest: +SKIP
         >>> x = torch.tensor([1.], requires_grad=True)
         >>> is_train = False
         >>> with torch.set_grad_enabled(is_train):
         ...   y = x * 2
         >>> y.requires_grad
         False
-        >>> torch.set_grad_enabled(True)
+        >>> _ = torch.set_grad_enabled(True)
         >>> y = x * 2
         >>> y.requires_grad
         True
-        >>> torch.set_grad_enabled(False)
+        >>> _ = torch.set_grad_enabled(False)
         >>> y = x * 2
         >>> y.requires_grad
         False
@@ -268,6 +269,7 @@ class inference_mode(_DecoratorContextManager):
         ...   y = x * x
         >>> y.requires_grad
         False
+        >>> # xdoctest: +SKIP("want string isnt quite right")
         >>> y._version
         Traceback (most recent call last):
         File "<stdin>", line 1, in <module>
diff --git a/torch/autograd/graph.py b/torch/autograd/graph.py
index f81a42285e05d..05c0d51a61cc1 100644
--- a/torch/autograd/graph.py
+++ b/torch/autograd/graph.py
@@ -1,6 +1,7 @@
 import torch
 from typing import Callable, Any
 
+
 class saved_tensors_hooks():
     """Context-manager that sets a pair of pack / unpack hooks for saved tensors.
 
@@ -47,11 +48,11 @@ class saved_tensors_hooks():
         >>> b = torch.ones(5, requires_grad=True) * 2
         >>> with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
         ...     y = a * b
-        Packing tensor([1., 1., 1., 1., 1.])
-        Packing tensor([2., 2., 2., 2., 2.])
+        Packing tensor([1., 1., 1., 1., 1.], requires_grad=True)
+        Packing tensor([2., 2., 2., 2., 2.], grad_fn=<MulBackward0>)
         >>> y.sum().backward()
-        Unpacking tensor([1., 1., 1., 1., 1.])
-        Unpacking tensor([2., 2., 2., 2., 2.])
+        Unpacking tensor([1., 1., 1., 1., 1.], requires_grad=True)
+        Unpacking tensor([2., 2., 2., 2., 2.], grad_fn=<MulBackward0>)
 
     .. warning ::
         Performing an inplace operation on the input to either hooks may lead
@@ -93,6 +94,7 @@ class save_on_cpu(saved_tensors_hooks):
 
     Example::
 
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
         >>> a = torch.randn(5, requires_grad=True, device="cuda")
         >>> b = torch.randn(5, requires_grad=True, device="cuda")
         >>> c = torch.randn(5, requires_grad=True, device="cuda")
diff --git a/torch/autograd/profiler.py b/torch/autograd/profiler.py
index 0299f79f3e474..7173d0db2ab67 100644
--- a/torch/autograd/profiler.py
+++ b/torch/autograd/profiler.py
@@ -3,7 +3,7 @@
 
 import torch
 import torch.cuda
-from torch._C._autograd import _ExperimentalConfig
+from torch._C._profiler import _ExperimentalConfig
 
 from torch.autograd import (
     _disable_profiler,
@@ -118,11 +118,12 @@ class profile(object):
         please use ``use_cuda = False`` or ``num_workers = 0``.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> x = torch.randn((1, 1), requires_grad=True)
         >>> with torch.autograd.profiler.profile() as prof:
         >>>     for _ in range(100):  # any normal python code, really!
         >>>         y = x ** 2
-        >>          y.backward()
+        >>>         y.backward()
         >>> # NOTE: some columns were removed for brevity
         >>> print(prof.key_averages().table(sort_by="self_cpu_time_total"))
         -----------------------------------  ---------------  ---------------  ---------------
@@ -443,6 +444,7 @@ class record_function(ContextDecorator):
         ...         z = y ** 3
         ...     y.backward()
         ...
+        >>> # xdoctest: +IGNORE_WANT
         >>> # NOTE: some columns were removed for brevity
         >>> print(prof.key_averages().table(sort_by="self_cpu_time_total"))
         -----------------------------------  ---------------  ---------------  ---------------
@@ -535,6 +537,7 @@ class emit_itt(object):
             Default: ``False``
 
     Example:
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> with torch.autograd.profiler.emit_itt():
         ...     model(x)
 
@@ -602,6 +605,7 @@ class emit_nvtx(object):
             Default: ``False``
 
     Example:
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> with torch.cuda.profiler.profile():
         ...     model(x) # Warmup CUDA memory allocator and profiler
         ...     with torch.autograd.profiler.emit_nvtx():
diff --git a/torch/autograd/profiler_legacy.py b/torch/autograd/profiler_legacy.py
index 0211ec8a2809a..4c98122355340 100644
--- a/torch/autograd/profiler_legacy.py
+++ b/torch/autograd/profiler_legacy.py
@@ -57,7 +57,7 @@ def config(self):
             self.with_flops,
             self.with_modules,
             # avoid exposing _ExperimentalConfig this in legacy public API
-            torch._C._autograd._ExperimentalConfig(),
+            torch._C._profiler._ExperimentalConfig(),
         )
 
     def __enter__(self):
diff --git a/torch/backends/xeon/run_cpu.py b/torch/backends/xeon/run_cpu.py
index c056af9644789..69632cb208628 100644
--- a/torch/backends/xeon/run_cpu.py
+++ b/torch/backends/xeon/run_cpu.py
@@ -60,13 +60,13 @@
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --throughput_mode script.py args
+   python -m torch.backends.xeon.run_cpu --throughput_mode script.py args
 
 2. Run single-instance inference on a single CPU node.
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --node_id 1 script.py args
+   python -m torch.backends.xeon.run_cpu --node_id 1 script.py args
 
 Multi-instance inference
 ------------------------
@@ -77,13 +77,13 @@
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu -- python_script args
+   python -m torch.backends.xeon.run_cpu -- python_script args
 
    eg: on an Intel(R) Xeon(R) Scalable Processor with 14 instance, 4 cores per instance
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --ninstances 14 --ncores_per_instance 4 python_script args
+   python -m torch.backends.xeon.run_cpu --ninstances 14 --ncores_per_instance 4 python_script args
 
 2. Run single-instance inference among multiple instances.
    By default, runs all ninstances. If you want to independently run a single instance among ninstances, specify rank.
@@ -92,27 +92,27 @@
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 0 python_script args
+   python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 0 python_script args
 
    eg: run 1st instance on an Intel(R) Xeon(R) Scalable Processor with 2 instance (i.e., numactl -C 28-55)
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 1 python_script args
+   python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 1 python_script args
 
    eg: run 0th instance on an Intel(R) Xeon(R) Scalable Processor with 2 instance, 2 cores per instance,
    first four cores (i.e., numactl -C 0-1)
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --core_list "0, 1, 2, 3" --ninstances 2 --ncores_per_instance 2
+   python -m torch.backends.xeon.run_cpu --core_list "0, 1, 2, 3" --ninstances 2 --ncores_per_instance 2
    --rank 0 python_script args
 
 3. To look up what optional arguments this module offers:
 
 ::
 
-    >>> python -m torch.backends.xeon.run_cpu --help
+    python -m torch.backends.xeon.run_cpu --help
 
 Memory allocator
 ----------------
diff --git a/torch/csrc/DynamicTypes.cpp b/torch/csrc/DynamicTypes.cpp
index c481c27c92f73..b3021ffe0d8d8 100644
--- a/torch/csrc/DynamicTypes.cpp
+++ b/torch/csrc/DynamicTypes.cpp
@@ -38,6 +38,8 @@ at::DeprecatedTypeProperties* get_type_properties(
     backend = at::Backend::CPU;
   } else if (device_type == at::kCUDA) {
     backend = at::Backend::CUDA;
+  } else if (device_type == at::kXPU) {
+    backend = at::Backend::XPU;
   } else if (device_type == at::kMPS) {
     backend = at::Backend::MPS;
   } else if (device_type == at::DeviceType::Meta) {
diff --git a/torch/csrc/Exceptions.cpp b/torch/csrc/Exceptions.cpp
index 4a2140278ca70..5210d6f713dbe 100644
--- a/torch/csrc/Exceptions.cpp
+++ b/torch/csrc/Exceptions.cpp
@@ -12,7 +12,8 @@
 
 #include <c10/util/StringUtil.h>
 
-PyObject *THPException_FatalError, *THPException_LinAlgError;
+PyObject *THPException_FatalError, *THPException_LinAlgError,
+    *THPException_OutOfMemoryError;
 
 #define ASSERT_TRUE(cond) \
   if (!(cond))            \
@@ -34,6 +35,7 @@ For example, you can the torch.linalg.inv function will raise torch.linalg.LinAl
 a matrix is not invertible.\n \
 \n\
 Example:\n \
+>>> # xdoctest: +REQUIRES(env:TORCH_DOCKTEST_LAPACK)\n \
 >>> matrix = torch.eye(3, 3)\n \
 >>> matrix[-1, -1] = 0\n \
 >>> matrix\n \
@@ -51,6 +53,16 @@ could not be completed because the input matrix is singular.",
       PyModule_AddObject(module, "_LinAlgError", THPException_LinAlgError) ==
       0);
 
+  ASSERT_TRUE(
+      THPException_OutOfMemoryError = PyErr_NewExceptionWithDoc(
+          "torch.cuda.OutOfMemoryError",
+          "Exception raised when CUDA is out of memory",
+          PyExc_RuntimeError,
+          nullptr));
+  ASSERT_TRUE(
+      PyModule_AddObject(
+          module, "_OutOfMemoryError", THPException_OutOfMemoryError) == 0);
+
   return true;
 }
 
diff --git a/torch/csrc/Exceptions.h b/torch/csrc/Exceptions.h
index 06541ad9e2607..89256c64bba8c 100644
--- a/torch/csrc/Exceptions.h
+++ b/torch/csrc/Exceptions.h
@@ -52,58 +52,34 @@ static inline void PyErr_SetString(PyObject* type, const std::string& message) {
   try {                                               \
     torch::PyWarningHandler __enforce_warning_buffer; \
     try {
+#define _CATCH_GENERIC_ERROR(ErrorType, PythonErrorType, retstmnt) \
+  catch (const c10::ErrorType& e) {                                \
+    auto msg = torch::get_cpp_stacktraces_enabled()                \
+        ? e.what()                                                 \
+        : e.what_without_backtrace();                              \
+    PyErr_SetString(PythonErrorType, torch::processErrorMsg(msg)); \
+    retstmnt;                                                      \
+  }
+
 // Only catch torch-specific exceptions
-#define CATCH_CORE_ERRORS(retstmnt)                                          \
-  catch (python_error & e) {                                                 \
-    e.restore();                                                             \
-    retstmnt;                                                                \
-  }                                                                          \
-  catch (const c10::IndexError& e) {                                         \
-    auto msg = torch::get_cpp_stacktraces_enabled()                          \
-        ? e.what()                                                           \
-        : e.what_without_backtrace();                                        \
-    PyErr_SetString(PyExc_IndexError, torch::processErrorMsg(msg));          \
-    retstmnt;                                                                \
-  }                                                                          \
-  catch (const c10::ValueError& e) {                                         \
-    auto msg = torch::get_cpp_stacktraces_enabled()                          \
-        ? e.what()                                                           \
-        : e.what_without_backtrace();                                        \
-    PyErr_SetString(PyExc_ValueError, torch::processErrorMsg(msg));          \
-    retstmnt;                                                                \
-  }                                                                          \
-  catch (const c10::TypeError& e) {                                          \
-    auto msg = torch::get_cpp_stacktraces_enabled()                          \
-        ? e.what()                                                           \
-        : e.what_without_backtrace();                                        \
-    PyErr_SetString(PyExc_TypeError, torch::processErrorMsg(msg));           \
-    retstmnt;                                                                \
-  }                                                                          \
-  catch (const c10::NotImplementedError& e) {                                \
-    auto msg = torch::get_cpp_stacktraces_enabled()                          \
-        ? e.what()                                                           \
-        : e.what_without_backtrace();                                        \
-    PyErr_SetString(PyExc_NotImplementedError, torch::processErrorMsg(msg)); \
-    retstmnt;                                                                \
-  }                                                                          \
-  catch (const c10::LinAlgError& e) {                                        \
-    auto msg = torch::get_cpp_stacktraces_enabled()                          \
-        ? e.what()                                                           \
-        : e.what_without_backtrace();                                        \
-    PyErr_SetString(THPException_LinAlgError, torch::processErrorMsg(msg));  \
-    retstmnt;                                                                \
-  }                                                                          \
-  catch (const c10::Error& e) {                                              \
-    auto msg = torch::get_cpp_stacktraces_enabled()                          \
-        ? e.what()                                                           \
-        : e.what_without_backtrace();                                        \
-    PyErr_SetString(PyExc_RuntimeError, torch::processErrorMsg(msg));        \
-    retstmnt;                                                                \
-  }                                                                          \
-  catch (torch::PyTorchError & e) {                                          \
-    auto msg = torch::processErrorMsg(e.what());                             \
-    PyErr_SetString(e.python_type(), msg);                                   \
-    retstmnt;                                                                \
+#define CATCH_CORE_ERRORS(retstmnt)                                     \
+  catch (python_error & e) {                                            \
+    e.restore();                                                        \
+    retstmnt;                                                           \
+  }                                                                     \
+  _CATCH_GENERIC_ERROR(IndexError, PyExc_IndexError, retstmnt)          \
+  _CATCH_GENERIC_ERROR(ValueError, PyExc_ValueError, retstmnt)          \
+  _CATCH_GENERIC_ERROR(TypeError, PyExc_TypeError, retstmnt)            \
+  _CATCH_GENERIC_ERROR(                                                 \
+      NotImplementedError, PyExc_NotImplementedError, retstmnt)         \
+  _CATCH_GENERIC_ERROR(LinAlgError, THPException_LinAlgError, retstmnt) \
+  _CATCH_GENERIC_ERROR(                                                 \
+      OutOfMemoryError, THPException_OutOfMemoryError, retstmnt)        \
+  _CATCH_GENERIC_ERROR(Error, PyExc_RuntimeError, retstmnt)             \
+  catch (torch::PyTorchError & e) {                                     \
+    auto msg = torch::processErrorMsg(e.what());                        \
+    PyErr_SetString(e.python_type(), msg);                              \
+    retstmnt;                                                           \
   }
 
 #if defined(USE_DISTRIBUTED) && defined(USE_C10D)
@@ -169,7 +145,8 @@ static inline void PyErr_SetString(PyObject* type, const std::string& message) {
 
 #define END_HANDLE_TH_ERRORS END_HANDLE_TH_ERRORS_RET(nullptr)
 
-extern PyObject *THPException_FatalError, *THPException_LinAlgError;
+extern PyObject *THPException_FatalError, *THPException_LinAlgError,
+    *THPException_OutOfMemoryError;
 
 // Throwing this exception means that the python error flags have been already
 // set and control should be immediately returned to the interpreter.
diff --git a/torch/csrc/Module.cpp b/torch/csrc/Module.cpp
index 18fb5b105ef39..d7fdbd2363ab8 100644
--- a/torch/csrc/Module.cpp
+++ b/torch/csrc/Module.cpp
@@ -54,6 +54,7 @@
 #include <torch/csrc/monitor/python_init.h>
 #include <torch/csrc/multiprocessing/init.h>
 #include <torch/csrc/onnx/init.h>
+#include <torch/csrc/profiler/python/init.h>
 #include <torch/csrc/tensor/python_tensor.h>
 #include <torch/csrc/utils/disable_torch_function.h>
 #include <torch/csrc/utils/init.h>
@@ -1020,6 +1021,7 @@ PyObject* initModule() {
   torch::autograd::initSparseFunctions(module);
   torch::autograd::initSpecialFunctions(module);
   torch::autograd::init_legacy_variable(module);
+  torch::profiler::initPythonBindings(module);
   torch::python::init_bindings(module);
   torch::lazy::initLazyBindings(module);
 #ifdef USE_ITT
diff --git a/torch/csrc/Storage.cpp b/torch/csrc/Storage.cpp
index 64d5cd785e52c..e1ae53daf62fc 100644
--- a/torch/csrc/Storage.cpp
+++ b/torch/csrc/Storage.cpp
@@ -99,6 +99,8 @@ static PyObject* THPStorage_pynew(
     } else if (device.type() == at::kMPS) {
       allocator = at::mps::GetMPSAllocator();
 #endif
+    } else if (device.type() == at::DeviceType::XPU) {
+      allocator = c10::GetAllocator(device.type());
     } else if (device.type() == at::DeviceType::Meta) {
       allocator = c10::GetAllocator(device.type());
     } else {
diff --git a/torch/csrc/api/include/torch/nn/pimpl.h b/torch/csrc/api/include/torch/nn/pimpl.h
index b3a2f2fd28c6d..d66d83c257ebd 100644
--- a/torch/csrc/api/include/torch/nn/pimpl.h
+++ b/torch/csrc/api/include/torch/nn/pimpl.h
@@ -191,6 +191,14 @@ serialize::InputArchive& operator>>(
 } // namespace nn
 } // namespace torch
 
+// Workaround for CUDA 10.2 and below not allowing attribute unused on
+// using declarations.
+#ifdef __CUDACC__
+#define TORCH_UNUSED_EXCEPT_CUDA
+#else
+#define TORCH_UNUSED_EXCEPT_CUDA C10_UNUSED
+#endif
+
 /// Defines a class `Name` which inherits from `nn::ModuleHolder` to provide a
 /// wrapper over a `std::shared_ptr<ImplType>`.
 /// `Impl` is a type alias for `ImplType` which provides a way to call static
@@ -199,7 +207,7 @@ serialize::InputArchive& operator>>(
   class Name : public torch::nn::ModuleHolder<ImplType> { /* NOLINT */ \
    public:                                                             \
     using torch::nn::ModuleHolder<ImplType>::ModuleHolder;             \
-    using Impl = ImplType;                                             \
+    using Impl TORCH_UNUSED_EXCEPT_CUDA = ImplType;                    \
   }
 
 /// Like `TORCH_MODULE_IMPL`, but defaults the `ImplType` name to `<Name>Impl`.
diff --git a/torch/csrc/autograd/FunctionsManual.cpp b/torch/csrc/autograd/FunctionsManual.cpp
index 749c09f5e5dbe..39e068c23bbdd 100644
--- a/torch/csrc/autograd/FunctionsManual.cpp
+++ b/torch/csrc/autograd/FunctionsManual.cpp
@@ -497,6 +497,14 @@ Tensor sgn_backward(const Tensor& x, const Tensor& gx, const Tensor& sgn) {
   }
 }
 
+Tensor masked_fill_backward(const Tensor& grad, const Tensor& mask) {
+  // masked_select does not work well with functorch, as its shape is
+  // data-dependent
+  return areAnyTensorSubclassLike({grad, mask})
+      ? at::where(mask, grad, 0).sum()
+      : grad.masked_select(mask).sum();
+}
+
 Tensor mul_tensor_backward(Tensor grad, Tensor other, ScalarType self_st) {
   auto out = grad * other.conj();
   return handle_r_to_c(self_st, out);
@@ -613,7 +621,7 @@ Tensor sum_backward(
 Tensor nansum_backward(
     const Tensor& grad,
     const Tensor& self,
-    IntArrayRef dims,
+    at::OptionalIntArrayRef dims,
     bool keepdim) {
   return sum_backward(grad, self.sizes(), dims, keepdim) *
       self.isnan().logical_not();
@@ -3463,15 +3471,15 @@ Tensor eig_backward(
   // complex components of eigenvalues, while torch.linalg.eig will most likely
   // always return complex eigenvalues.
   if (!self.is_complex()) {
-    auto is_imag_eigvals_zero = false;
+    Tensor is_imag_eigvals_zero;
     // path for torch.eig with always a "real" 2D tensor of eigenvalues
     if (!D.is_complex()) {
       // narrow extracts the column corresponding to the imaginary part
-      is_imag_eigvals_zero = (D.narrow(-1, 1, 1) == 0.0).min().item<bool>();
+      is_imag_eigvals_zero = (D.narrow(-1, 1, 1) == 0.0).min();
     }
     // path for torch.linalg.eig with always a complex tensor of eigenvalues
     else {
-      is_imag_eigvals_zero = (at::imag(D) == 0.0).min().item<bool>();
+      is_imag_eigvals_zero = (at::imag(D) == 0.0).min();
       // insert an additional dimension to be compatible with torch.eig.
       // Recall that it produces 2D tensors.
       // We extract only the real parts as there is no support for
@@ -3481,7 +3489,7 @@ Tensor eig_backward(
     }
     // No support for complex eigenvalues for real inputs yet.
     TORCH_CHECK(
-        is_imag_eigvals_zero,
+        at::is_scalar_tensor_true(is_imag_eigvals_zero),
         "eig_backward: Backward calculation does not support complex eigenvalues for real inputs at the moment.");
   } else {
     // torch.eig returns 2d tensors for eigenvalues,
@@ -4019,6 +4027,60 @@ Tensor linalg_matrix_exp_differential(
       self, grad, at::linalg_matrix_exp, /* adjoint */ adjoint);
 }
 
+template <typename F1, typename F2, typename... Ts>
+Tensor masked_fmap(
+    const Tensor& mask,
+    const F1& f1,
+    const F2& f2,
+    const Tensor& t,
+    const Ts&... ts) {
+  // This function takes two functions f1 and f2 and a (variadic) list of
+  // tensors, and creates a new tensor of the same shape as the first element of
+  // the list of tensors by applying the function f1 to the tensors for which
+  // the mask is true and f2 to the tensors for which the mask is false This
+  // function is used when we have a formula that works for, say, all
+  // non-singular inputs and another one for when the inputs are singular. See
+  // for example det_backward
+
+  // Precondition for the n == 0 case to make sense
+  TORCH_INTERNAL_ASSERT(t.numel() != 0);
+  auto t_masked = t.index({mask});
+  auto n = t_masked.numel();
+  if (n == t.numel()) {
+    return f1(t, ts...);
+  } else if (n == 0) {
+    return f2(t, ts...);
+  } else {
+    // Equivalent to
+    // ret = torch.empty_like(t)
+    // ret[mask] = f1(t1[mask], ..., tn[mask])
+    // ret[~mask] = f2(t1[~mask], ..., tn[~mask])
+    auto not_mask = mask.logical_not();
+    return at::empty_like(t)
+        .index_put_({mask}, f1(t_masked, ts.index({mask})...))
+        .index_put_(
+            {not_mask}, f2(t.index({not_mask}), ts.index({not_mask})...));
+  }
+}
+
+Tensor linalg_det_jvp(
+    const Tensor& dA,
+    const Tensor& det,
+    const Tensor& LU,
+    const Tensor& pivots,
+    const bool use_A_T) {
+  // (d det)_A(E) = tr(A^{-1}E)*det
+  // We use that the determinant is C^1 to approximate the gradient of singular
+  // inputs Since we never differentiate over forward AD, we don't need to deal
+  // with further gradients, as we do in grad_backward
+  auto eps = at::native::_get_epsilon(c10::toRealValueType(LU.scalar_type()));
+  auto LU_ =
+      LU + at::diag_embed(at::where(LU.diagonal(0, -2, -1) == 0., eps, 0.));
+  auto AinvE =
+      at::linalg_lu_solve(LU_, pivots, dA, /*left=*/true, /*adjoint=*/use_A_T);
+  return AinvE.diagonal(0, -2, -1).sum(-1) * det;
+}
+
 Tensor linalg_det_backward(
     const Tensor& grad,
     const Tensor& det,
@@ -4026,7 +4088,8 @@ Tensor linalg_det_backward(
     const Tensor& LU,
     const Tensor& pivots) {
   at::NoTF32Guard disable_tf32;
-  if (!grad.defined()) {
+  // A.numel() == 0 necessary for the singular case
+  if (!grad.defined() || A.numel() == 0) {
     return {};
   }
 
@@ -4035,45 +4098,55 @@ Tensor linalg_det_backward(
   auto d_diag = grad * det.conj();
   // Optimisation, Make it F-transposed as it's what lu_solve expects
   auto d = at::diag_embed(d_diag.unsqueeze(-1).expand_as(pivots)).mT();
+  auto eps = at::native::_get_epsilon(c10::toRealValueType(LU.scalar_type()));
 
+  // Optimisation if we are not going to compute higher-order gradients
   if (!at::GradMode::is_enabled()) {
     // The formula is given by the solution of AX = det.conj() * det * I when A
-    // is invertible det is C^1, so if it's not invertible, we can wiggle the LU
-    // decomposition a bit and use the resulting matrix as a decent
-    // approximation
-    auto eps = at::native::_get_epsilon(c10::toRealValueType(LU.scalar_type()));
+    // is invertible det is C^1, so if it's not invertible, we can apply a
+    // perturbation to the LU decomposition and use the resulting matrix as a
+    // non-singular approximation
     auto LU_ =
         LU + at::diag_embed(at::where(LU.diagonal(0, -2, -1) == 0., eps, 0.));
     auto use_A_T = A.is_contiguous() && !A.is_complex();
     return at::linalg_lu_solve(
         LU_, pivots, d, /*left=*/true, /*adjoint=*/!use_A_T);
   } else {
-    // If we want to compute further gradients, we need to recompute the LU
-    // decomposition so that autograd computes the correct gradients wrt to A
-    // (cf. solve_backward)
+    // If we want to compute higher-order gradients, we need to recompute the
+    // LU decomposition so that autograd computes the correct gradients wrt
+    // to A (cf. solve_backward)
+    auto non_singular =
+        [](const Tensor& A, const Tensor& d, const Tensor& /*grad*/) {
+          return at::linalg_solve(A.mH(), d);
+        };
 
-    // TODO When the user wants higher derivatives, the trick above just does
-    // not cut it The proper way of doing this is doing `auto mask = det == 0.;`
-    // and then if any determinant is zero, use an SVD decomposition to compute
-    // the derivative in those inputs (not all inputs). The derivative may be
-    // then computed explicitly by noting that the gradient of the derivative of
-    // the determinant is given in terms of the adjugate of a matrix. Then, the
-    // adjugate of a singular matrix may be computed as per
-    // https://nhigham.com/2020/06/16/what-is-the-adjugate-of-a-matrix/
-    // The code may be implemented as follows:
-    //
-    // Tensor U, S, Vh;
-    // std::tie(U, S, Vh) = at::linalg_svd(A);
-    // auto alpha = (at::linalg_det(U) * at::linalg_det(Vh)).conj() * grad;
-    // auto D = prod_safe_zeros_backward(alpha.unsqueeze(-1), S, S.dim() - 1);
-    // return (U * D.unsqueeze(-2)).matmul(Vh);
-    //
-    // The issue with this code is that the derivative given by autograd of
-    // prod_safe_zeros_backward is not the second derivative of the product.
-    // It is not clear to me how to implement the second derivative of the
-    // product efficently. Note that this is also currently a problem when we
-    // compute higher derivatives of `x.prod()` and `x` has more than one zero.
-    return at::linalg_solve(A.mH(), d);
+    // The derivative may be then computed explicitly by noting that the
+    // gradient of the derivative of the determinant is given in terms of the
+    // adjugate of a matrix. The adjugate of a singular matrix may be computed
+    // as per https://nhigham.com/2020/06/16/what-is-the-adjugate-of-a-matrix/
+    auto singular = [](const Tensor& A,
+                       const Tensor& /*d*/,
+                       const Tensor& grad) {
+      Tensor U, S, Vh;
+      std::tie(U, S, Vh) = at::linalg_svd(A);
+      auto alpha = (at::linalg_det(U) * at::linalg_det(Vh)).conj() * grad;
+      auto D = prod_safe_zeros_backward(alpha.unsqueeze(-1), S, S.dim() - 1);
+      return (U * D.unsqueeze(-2)).matmul(Vh);
+    };
+
+    // We could use the singular formula for all inputs but we try to filter out
+    // some inputs via the masking, as computing an SVD is about 100 times
+    // slower than computing an lu_solve on GPU
+    // For tensor subclasses, we can't call masked_fmap as it calls
+    // index({mask}) which needs to call item to compute the number of elements
+    // in the result.
+
+    if (areAnyTensorSubclassLike({A, d, grad})) {
+      return singular(A, d, grad);
+    } else {
+      return masked_fmap(
+          det.abs() < 100. * eps, singular, non_singular, A, d, grad);
+    }
   }
 }
 
diff --git a/torch/csrc/autograd/FunctionsManual.h b/torch/csrc/autograd/FunctionsManual.h
index 376490fb6f45e..d8fb0923eed53 100644
--- a/torch/csrc/autograd/FunctionsManual.h
+++ b/torch/csrc/autograd/FunctionsManual.h
@@ -165,7 +165,7 @@ at::Tensor sum_backward(
 at::Tensor nansum_backward(
     const at::Tensor& grad,
     const at::Tensor& self,
-    at::IntArrayRef dims,
+    at::OptionalIntArrayRef dims,
     bool keepdim);
 std::vector<int64_t> reverse_list(const at::IntArrayRef list);
 at::Tensor reverse_dim(const at::Tensor& t, int64_t dim);
@@ -307,6 +307,7 @@ at::Tensor evenly_distribute_backward(
     const at::Tensor& input,
     const at::Tensor& value);
 Tensor sgn_backward(const Tensor& x, const Tensor& gx, const Tensor& sgn);
+Tensor masked_fill_backward(const Tensor& grad, const Tensor& mask);
 at::Tensor var_backward(
     at::Tensor grad,
     const at::Tensor& self,
@@ -834,6 +835,12 @@ Tensor linalg_det_backward(
     const Tensor& A,
     const Tensor& LU,
     const Tensor& pivots);
+Tensor linalg_det_jvp(
+    const Tensor& dA,
+    const Tensor& det,
+    const Tensor& LU,
+    const Tensor& pivots,
+    const bool use_A_T);
 std::tuple<Tensor, Tensor> linalg_lstsq_backward(
     const Tensor& grad,
     const Tensor& A,
diff --git a/torch/csrc/autograd/TraceTypeManual.cpp b/torch/csrc/autograd/TraceTypeManual.cpp
index e39b3a120f0ed..b647e3dc7df50 100644
--- a/torch/csrc/autograd/TraceTypeManual.cpp
+++ b/torch/csrc/autograd/TraceTypeManual.cpp
@@ -54,7 +54,9 @@ const Tensor& resize_(
     IntArrayRef size,
     c10::optional<MemoryFormat> optional_memory_format) {
   if (torch::jit::tracer::isTracing()) {
-    jit::tracer::ArgumentStash::popIntArrayRef("size");
+    if (jit::tracer::ArgumentStash::hasIntArrayRef("size")) {
+      jit::tracer::ArgumentStash::popIntArrayRef("size");
+    }
     jit::tracer::warn("resize_", jit::tracer::WARN_RESIZE);
     jit::tracer::delValueTrace(self);
   }
diff --git a/torch/csrc/autograd/anomaly_mode.cpp b/torch/csrc/autograd/anomaly_mode.cpp
index e8afa6f8fc523..0715f850de817 100644
--- a/torch/csrc/autograd/anomaly_mode.cpp
+++ b/torch/csrc/autograd/anomaly_mode.cpp
@@ -8,6 +8,7 @@ namespace torch {
 namespace autograd {
 
 bool AnomalyMode::_enabled = false;
+bool AnomalyMode::_check_nan = true;
 
 namespace {
 std::mutex& get_anomaly_guard_lock() {
@@ -21,20 +22,21 @@ uint32_t& get_anomaly_counter() {
 }
 } // namespace
 
-DetectAnomalyGuard::DetectAnomalyGuard() {
+DetectAnomalyGuard::DetectAnomalyGuard(bool check_nan) {
   TORCH_WARN_ONCE(
       "This mode should be enabled only for debugging as the different tests will slow down your program execution.");
   std::lock_guard<std::mutex> lock(get_anomaly_guard_lock());
   uint32_t& counter = get_anomaly_counter();
   counter++;
-  AnomalyMode::set_enabled(true);
+  this->prev_check_nan_ = AnomalyMode::should_check_nan();
+  AnomalyMode::set_enabled(true, check_nan);
 }
 
 DetectAnomalyGuard::~DetectAnomalyGuard() {
   std::lock_guard<std::mutex> lock(get_anomaly_guard_lock());
   uint32_t& counter = get_anomaly_counter();
   counter--;
-  AnomalyMode::set_enabled(counter > 0);
+  AnomalyMode::set_enabled(counter > 0, this->prev_check_nan_);
 }
 
 AnomalyMetadata::~AnomalyMetadata() = default;
diff --git a/torch/csrc/autograd/anomaly_mode.h b/torch/csrc/autograd/anomaly_mode.h
index a741acdd38af0..b0eba7eeac46a 100644
--- a/torch/csrc/autograd/anomaly_mode.h
+++ b/torch/csrc/autograd/anomaly_mode.h
@@ -14,12 +14,17 @@ struct TORCH_API AnomalyMode {
   static bool is_enabled() {
     return _enabled;
   }
-  static void set_enabled(bool enabled) {
+  static bool should_check_nan() {
+    return _check_nan;
+  }
+  static void set_enabled(bool enabled, bool check_nan = true) {
     _enabled = enabled;
+    _check_nan = check_nan;
   }
 
  private:
   static bool _enabled;
+  static bool _check_nan;
 };
 
 /// A RAII guard that enables Anomaly Detection Mode.
@@ -46,8 +51,11 @@ struct TORCH_API AnomalyMode {
 /// @endcode
 class TORCH_API DetectAnomalyGuard {
  public:
-  DetectAnomalyGuard();
+  DetectAnomalyGuard(bool check_nan = true);
   ~DetectAnomalyGuard();
+
+ private:
+  bool prev_check_nan_;
 };
 
 struct TORCH_API AnomalyMetadata {
diff --git a/torch/csrc/autograd/autograd.cpp b/torch/csrc/autograd/autograd.cpp
index 4a2a51235b9e8..83810321dde91 100644
--- a/torch/csrc/autograd/autograd.cpp
+++ b/torch/csrc/autograd/autograd.cpp
@@ -1,6 +1,12 @@
 #include <torch/csrc/autograd/autograd.h>
 #include <torch/csrc/autograd/variable.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/ones_like.h>
+#endif
+
 #include <torch/csrc/autograd/edge.h>
 #include <torch/csrc/autograd/engine.h>
 #include <torch/csrc/autograd/function.h>
diff --git a/torch/csrc/autograd/autograd_not_implemented_fallback.cpp b/torch/csrc/autograd/autograd_not_implemented_fallback.cpp
index 92d234e7fb287..b29a05349975a 100644
--- a/torch/csrc/autograd/autograd_not_implemented_fallback.cpp
+++ b/torch/csrc/autograd/autograd_not_implemented_fallback.cpp
@@ -2,10 +2,11 @@
 
 #include <c10/util/irange.h>
 
+#include <ATen/core/TorchDispatchUtils.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/core/ivalue.h>
 
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 #include <torch/csrc/autograd/VariableTypeUtils.h>
 #include <torch/csrc/autograd/autograd.h>
 #include <torch/csrc/autograd/function.h>
diff --git a/torch/csrc/autograd/custom_function.cpp b/torch/csrc/autograd/custom_function.cpp
index 36df9bfac273e..ea14e621e930d 100644
--- a/torch/csrc/autograd/custom_function.cpp
+++ b/torch/csrc/autograd/custom_function.cpp
@@ -509,6 +509,19 @@ variable_list AutogradContext::get_saved_variables() const {
   return saved;
 }
 
+bool AutogradContext::needs_input_grad(size_t output_edge_index) const {
+  auto ptr = grad_fn_.lock();
+  TORCH_INTERNAL_ASSERT(ptr);
+  return ptr->task_should_compute_output(output_edge_index);
+}
+
+bool AutogradContext::needs_input_grad(
+    std::initializer_list<IndexRange> idxs) const {
+  auto ptr = grad_fn_.lock();
+  TORCH_INTERNAL_ASSERT(ptr);
+  return ptr->task_should_compute_output(idxs);
+}
+
 void AutogradContext::mark_dirty(const variable_list& inputs) {
   dirty_inputs_.clear();
   dirty_inputs_.reserve(inputs.size());
diff --git a/torch/csrc/autograd/custom_function.h b/torch/csrc/autograd/custom_function.h
index 2875c0376ed3c..7e35eef6783d6 100644
--- a/torch/csrc/autograd/custom_function.h
+++ b/torch/csrc/autograd/custom_function.h
@@ -131,6 +131,11 @@ struct TORCH_API AutogradContext {
   const std::unordered_set<at::TensorImpl*>& get_and_bump_dirty() const;
   const std::unordered_set<at::TensorImpl*>& get_non_differentiable() const;
 
+  /// Expose the Node's `task_should_compute_output` method to the cpp
+  /// custom autograd Function as `needs_input_grad`.
+  bool needs_input_grad(size_t output_edge_index) const;
+  bool needs_input_grad(std::initializer_list<IndexRange> idxs) const;
+
  private:
   std::unordered_set<at::TensorImpl*> non_differentiable_;
   std::unordered_set<at::TensorImpl*> dirty_inputs_;
diff --git a/torch/csrc/autograd/engine.cpp b/torch/csrc/autograd/engine.cpp
index 408daf3b0250a..743486ea83fb7 100644
--- a/torch/csrc/autograd/engine.cpp
+++ b/torch/csrc/autograd/engine.cpp
@@ -13,6 +13,12 @@
 #include <ATen/Parallel.h>
 #include <ATen/detail/CUDAHooksInterface.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/isnan.h>
+#endif
+
 #include <c10/core/DeviceGuard.h>
 #include <c10/core/Event.h>
 #include <c10/core/Stream.h>
@@ -368,6 +374,21 @@ void GraphTaskGuard::restore_current_graph_task() {
   current_graph_task = std::move(last_graph_task_);
 }
 
+// The current graph task's exec_info is being used to trim unnecessary edegs
+// during node evaluation, see `Node.task_should_compute_output()` function.
+const std::unordered_map<Node*, GraphTask::ExecInfo>*
+get_current_graph_task_exec_info() {
+  return current_graph_task ? &current_graph_task->exec_info_ : nullptr;
+}
+
+bool get_current_graph_task_keep_graph() {
+  return current_graph_task ? current_graph_task->keep_graph_ : true;
+}
+
+void add_node_to_current_graph_task_exec_info(Node* fn) {
+  current_graph_task->exec_info_[fn].needed_ = true;
+}
+
 // NOTE: graph_tasks do not necessarily form a stack. Imagine this
 // case:
 //
@@ -887,7 +908,7 @@ void Engine::evaluate_function(
     return;
   }
 
-  if (AnomalyMode::is_enabled()) {
+  if (AnomalyMode::is_enabled() && AnomalyMode::should_check_nan()) {
     AutoGradMode grad_mode(false);
     for (const auto i : c10::irange(num_outputs)) {
       auto& output = outputs[i];
diff --git a/torch/csrc/autograd/engine.h b/torch/csrc/autograd/engine.h
index e86fd8339a63e..192d9434e2d8d 100644
--- a/torch/csrc/autograd/engine.h
+++ b/torch/csrc/autograd/engine.h
@@ -10,6 +10,7 @@
 #include <torch/csrc/autograd/anomaly_mode.h>
 #include <torch/csrc/autograd/function.h>
 #include <torch/csrc/autograd/functions/basic_ops.h>
+#include <torch/csrc/autograd/graph_task.h>
 #include <torch/csrc/autograd/input_buffer.h>
 #include <torch/csrc/autograd/saved_variable_hooks.h>
 #include <torch/csrc/autograd/utils/warnings.h>
@@ -35,9 +36,6 @@ struct ReadyQueue;
 namespace torch {
 namespace autograd {
 
-static constexpr int NO_DEVICE = -2;
-static constexpr int CPU_DEVICE = -1;
-
 // Maximum reentrant backward depth before switching to a new thread
 // This limit is based on the TSAN's deadlock detector, where it will
 // fail if a program hold more than 65 locks in one thread at once.
@@ -52,172 +50,6 @@ void validate_outputs(
     variable_list& grads,
     const std::function<std::string(const std::string&)>& format_error);
 
-// GraphTask holds metadata needed for a single execution of backward()
-struct GraphTask : std::enable_shared_from_this<GraphTask> {
-  std::atomic<uint64_t> outstanding_tasks_{0};
-  // Indicates if an error occurred while executing any task.  When this is
-  // true, it signals all threads to stop executing.
-  std::atomic_bool has_error_{false};
-  std::atomic_bool future_completed_{false};
-  // It is safe to read keep_graph_ without synchronization
-  bool keep_graph_;
-
-  // To protect reads/writes to not_ready_, dependencies_, captured_vars_,
-  // has_error_, future_result_, cpu_ready_queue_, and leaf_streams.
-  std::mutex mutex_;
-  std::unordered_map<Node*, InputBuffer> not_ready_;
-  std::unordered_map<Node*, int> dependencies_;
-
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-  struct ExecInfo {
-    struct Capture {
-      Capture(const Capture&) = delete;
-      Capture(Capture&&) = default;
-
-      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-      Capture(int input_idx, int output_idx)
-          : input_idx_(input_idx), output_idx_(output_idx) {}
-      int input_idx_; // within Node inputs
-      int output_idx_; // within the output vector of a GraphTask
-
-      // This hook will be executed after a grad is captured. The captured
-      // grad will be replaced by the return value of the hook.
-      struct GradCaptureHook {
-        virtual ~GradCaptureHook() = default;
-        virtual at::Tensor operator()(const at::Tensor& grad) = 0;
-      };
-      // The hooks will be called one by one in the order as they were added.
-      // The input grad of a hook will be the output of its preceding hook. The
-      // first hook will take the captured grad as the input. The output of the
-      // last hook will replace the captured grad.
-      std::vector<std::unique_ptr<GradCaptureHook>> hooks_;
-    };
-
-    bool should_execute() const {
-      return needed_ || captures_;
-    }
-
-    bool needed_ = false;
-    std::unique_ptr<std::vector<Capture>> captures_;
-  };
-  // Exec info has a bit complicated semantics. If it's empty, it means the task
-  // is run in a "default" mode, which means that all next_edges we encounter
-  // should get executed. If it's not empty, only functions that have an entry
-  // and this entry has needed == True should be executed. exec_info is only
-  // empty when the graph is executed via .backward() and the inputs parameter
-  // is not passed. Otherwise, when executed through .grad(), or when inputs arg
-  // is specified for .backward(), exec_info will be non-empty.
-  //
-  // exec_info_ is safe to read without synchronization
-  std::unordered_map<Node*, ExecInfo> exec_info_;
-  // Captures variables are grads captured that we return to the user. After
-  // execution of the GraphTask is completed, the captured_vars_ are moved
-  // out of the GraphTask and are no longer valid.
-  std::vector<Variable> captured_vars_;
-
-  // Note: this field is not ready to be used until the proper
-  // `thread_locals_.set_grad_mode()` call in the constructor.
-  at::ThreadLocalState thread_locals_ = at::ThreadLocalState();
-
-  std::unordered_set<c10::Stream> leaf_streams;
-
-  // Per-device current streams of the execute() that called this GraphTask.
-  // These will be synced with leaf_streams in exec_post_processing.
-  std::vector<c10::optional<c10::Stream>> caller_current_streams_;
-
-  // Collects caller_current_streams_
-  void stash_current_streams();
-
-  void init_to_execute(
-      Node& graph_root,
-      const edge_list& outputs,
-      bool accumulate_grad,
-      uint64_t min_topo_nr);
-
-  // The value of worker_device in the thread that created this task.
-  // See Note [Reentrant backwards]
-  // Safe to read owner_ and reentrant_depth_ without synchronizaton
-  int owner_;
-  // The number of parent graph tasks for this graph task
-  const int reentrant_depth_;
-
-  bool can_checkpoint() const {
-    return exec_info_.empty();
-  }
-
-  // check if the GraphTask is completed or not
-  bool completed();
-  // mark the graph task as completed and trigger post processing
-  void mark_as_completed_and_run_post_processing();
-
-  // Set an appropriate exception on this graph_task which was encountered while
-  // running the provided function.
-  void set_exception(std::exception_ptr eptr, const std::shared_ptr<Node>& fn);
-
-  // Set an appropriate exception on this graph_task which was encountered while
-  // running the provided function. But doesn't signal completion on
-  // 'future_result_' right away. The user needs to explicitly mark
-  // 'future_result_' completed with an appropriate exception.
-  void set_exception_without_signal(const std::shared_ptr<Node>& fn);
-
-  // Whether or not to stop execution for this GraphTask when an error is
-  // encountered. When set to true, this would cause Engine::execute() to throw
-  // an exception as soon as the autograd engine receives an exception.
-  bool exit_on_error_;
-
-  // CPU threads are dedicated to processing CPU work for the backward they
-  // invoked. So any given graph task maintains its own cpu_ready_queue_ where
-  // you should send work for it to be done. We memoize the cpu_ready_queue_ per
-  // GraphTask so that we know which ready queue we should push to if we are on
-  // device thread (i.e. GPU) and but next NodeTask should be run on CPU.
-  std::shared_ptr<ReadyQueue> cpu_ready_queue_;
-
-  // Future representing the completion of the graph task. Notified when all
-  // tasks are done.
-  c10::intrusive_ptr<at::ivalue::Future> future_result_;
-
-  // Final callbacks installed during execution of this GraphTask
-  std::vector<std::function<void()>> final_callbacks_;
-  // To protect reads and writes to final_callbacks_. Intentionally no reusing
-  // mutex_ as the two are protecting different data structures.
-  std::mutex final_callbacks_lock_;
-
-  utils::DelayWarningHandler warning_handler_;
-
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-  GraphTask(
-      bool keep_graph,
-      bool grad_mode,
-      int reentrant_depth,
-      std::shared_ptr<ReadyQueue> cpu_ready_queue,
-      bool exit_on_error = false)
-      : keep_graph_(keep_graph),
-        owner_(NO_DEVICE),
-        reentrant_depth_(reentrant_depth),
-        exit_on_error_(exit_on_error),
-        cpu_ready_queue_(std::move(cpu_ready_queue)),
-        future_result_(c10::make_intrusive<at::ivalue::Future>(
-            c10::ListType::create(c10::TensorType::get()))) {
-    thread_locals_.set_grad_mode(grad_mode);
-  }
-
- private:
-  // run GraphTask post processing
-  void exec_post_processing();
-};
-
-// The guard that sets and restores current_graph_task.
-class GraphTaskGuard {
- public:
-  explicit GraphTaskGuard(std::shared_ptr<GraphTask> graph_task);
-  ~GraphTaskGuard();
-
-  void restore_current_graph_task();
-
- private:
-  std::shared_ptr<GraphTask> last_graph_task_;
-};
-
 struct NodeTask {
   std::weak_ptr<GraphTask> base_;
   std::shared_ptr<Node> fn_;
diff --git a/torch/csrc/autograd/function.h b/torch/csrc/autograd/function.h
index 9f9da69381cc6..aa82e3ad2c77c 100644
--- a/torch/csrc/autograd/function.h
+++ b/torch/csrc/autograd/function.h
@@ -3,6 +3,7 @@
 #include <torch/csrc/autograd/anomaly_mode.h>
 #include <torch/csrc/autograd/edge.h>
 #include <torch/csrc/autograd/grad_mode.h>
+#include <torch/csrc/autograd/graph_task.h>
 #include <torch/csrc/autograd/input_metadata.h>
 #include <torch/csrc/autograd/saved_variable.h>
 #include <torch/csrc/autograd/variable.h>
@@ -369,6 +370,18 @@ struct TORCH_API Node : std::enable_shared_from_this<Node> {
   /// Returns the name of the dynamic type of the function, for debugging.
   virtual std::string name() const;
 
+  /// The difference between functions `should_compute_output` and
+  /// `task_should_compute_output`:
+  /// - `should_compute_output` should only be used during graph construction
+  /// and takes into account only requires_grad information
+  /// - `task_should_compute_output` should only be called during the backward
+  /// pass (unless called directly through grad_fn) and takes into account the
+  /// current graph task.  Specifically, the autograd engine trims unnecessary
+  /// edges when `inputs` are specified, and during backward untrimmed nodes
+  /// left on the graph can/should check `task_should_compute_output` to see if
+  /// any outgoing edges have been trimmed by the engine. If that is the case,
+  /// gradient computation wrt those edges can be omitted.
+  ///
   /// Returns true if the particular output edge is active, and that particular
   /// output of this function should be computed.
   bool should_compute_output(size_t output_edge_index) const {
@@ -387,6 +400,37 @@ struct TORCH_API Node : std::enable_shared_from_this<Node> {
     });
   }
 
+  /// Same as the above `should_compute_output` function but will also
+  /// check whether this edge is needed within the current graph task.
+  bool task_should_compute_output(size_t output_edge_index) const {
+    TORCH_CHECK(output_edge_index < num_outputs(), "Index out of range");
+    const auto& next = next_edges_[output_edge_index];
+    if (next.is_valid()) {
+      const auto exec_info = get_current_graph_task_exec_info();
+      if (exec_info && !exec_info->empty()) {
+        auto it = exec_info->find(next.function.get());
+        if (it == exec_info->end() || !it->second.should_execute()) {
+          return false; // this edge is not needed for the current graph_task
+        }
+      }
+      return true;
+    }
+    return false;
+  }
+
+  /// Returns true if any of the output edges in any of the ranges are active
+  /// and should be computed in the current graph task.
+  bool task_should_compute_output(
+      std::initializer_list<IndexRange> idxs) const {
+    return std::any_of(idxs.begin(), idxs.end(), [this](IndexRange range) {
+      for (const auto i : c10::irange(range.first, range.second)) {
+        if (task_should_compute_output(i))
+          return true;
+      }
+      return false;
+    });
+  }
+
   /// Returns the `PyObject` stored for this `Node` (for Python
   /// interaction).
   PyObject* pyobj() const noexcept {
diff --git a/torch/csrc/autograd/functions/tensor.cpp b/torch/csrc/autograd/functions/tensor.cpp
index 2fe5a3befbcc4..5a1108ddf7c88 100644
--- a/torch/csrc/autograd/functions/tensor.cpp
+++ b/torch/csrc/autograd/functions/tensor.cpp
@@ -3,6 +3,7 @@
 #include <torch/csrc/autograd/function.h>
 #include <torch/csrc/autograd/functions/basic_ops.h>
 #include <torch/csrc/autograd/functions/utils.h>
+#include <torch/csrc/autograd/graph_task.h>
 #include <torch/csrc/autograd/variable.h>
 
 #include <ATen/ATen.h>
@@ -21,10 +22,10 @@ auto CopyBackwards::apply(variable_list&& grads) -> variable_list {
   auto grad = c10::MaybeOwned<at::Tensor>::borrowed(grads[0]);
   variable_list grad_inputs(2);
   if (grad->defined()) {
-    if (should_compute_output(0)) {
+    if (task_should_compute_output(0)) {
       grad_inputs[0] = at::zeros_like(*grad, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
     }
-    if (should_compute_output(1)) {
+    if (task_should_compute_output(1)) {
       // Handle R->C copies without raising a warning
       const auto src_type = src_options.dtype().toScalarType();
       if (!c10::isComplexType(src_type) && grad->is_complex()) {
@@ -85,6 +86,18 @@ auto CopySlices::apply(variable_list&& inputs) -> variable_list {
     grad_slice = result.as_strided(view.sizes(), view.strides(), offset);
   }
 
+  // Adding the missing nodes to the current graph's `exec_info`.
+  // This is a workaround because the current `GraphTask::init_to_execute`
+  // does not traverse into CopySlices node.
+  const auto exec_info = get_current_graph_task_exec_info();
+  if (exec_info && !exec_info->empty()) {
+    for (const auto& next : fn->next_edges()) {
+      if (next.is_valid()) {
+        add_node_to_current_graph_task_exec_info(next.function.get());
+      }
+    }
+  }
+
   // TODO: We clone grad_slice because we modify it below and "fn" might save
   // it for the backward of res. We might be able to avoid the clone() if
   // double-backprop is disabled.
@@ -92,7 +105,7 @@ auto CopySlices::apply(variable_list&& inputs) -> variable_list {
 
   variable_list grad_inputs(num_outputs());
   for (const auto i : c10::irange(res.size())) {
-    if (should_compute_output(i)) {
+    if (task_should_compute_output(i)) {
       AT_ASSERT(res[i].defined());
       if (i == 0) {
         grad_slice.copy_(res[i]);
diff --git a/torch/csrc/autograd/graph_task.h b/torch/csrc/autograd/graph_task.h
new file mode 100644
index 0000000000000..66c567c57ba3a
--- /dev/null
+++ b/torch/csrc/autograd/graph_task.h
@@ -0,0 +1,193 @@
+#pragma once
+#include <ATen/ThreadLocalState.h>
+#include <ATen/core/Tensor.h>
+#include <c10/util/ThreadLocal.h>
+#include <torch/csrc/autograd/input_buffer.h>
+#include <torch/csrc/autograd/utils/warnings.h>
+#include <vector>
+
+namespace torch {
+namespace autograd {
+
+using edge_list = std::vector<Edge>;
+struct ReadyQueue;
+
+static constexpr int NO_DEVICE = -2;
+static constexpr int CPU_DEVICE = -1;
+
+// GraphTask holds metadata needed for a single execution of backward()
+struct GraphTask : std::enable_shared_from_this<GraphTask> {
+  std::atomic<uint64_t> outstanding_tasks_{0};
+  // Indicates if an error occurred while executing any task.  When this is
+  // true, it signals all threads to stop executing.
+  std::atomic_bool has_error_{false};
+  std::atomic_bool future_completed_{false};
+  // It is safe to read keep_graph_ without synchronization
+  bool keep_graph_;
+
+  // To protect reads/writes to not_ready_, dependencies_, captured_vars_,
+  // has_error_, future_result_, cpu_ready_queue_, and leaf_streams.
+  std::mutex mutex_;
+  std::unordered_map<Node*, InputBuffer> not_ready_;
+  std::unordered_map<Node*, int> dependencies_;
+
+  // Note [Exec info]
+  // Exec info is created for each GraphTask, which allows filtering paths on
+  // the graph that are not needed. It has a bit complicated semantics. If it's
+  // empty, it means the task is run in a "default" mode, which means that all
+  // next_edges we encounter should get executed. If it's not empty, only
+  // functions that have an entry and this entry has needed == True should be
+  // executed. exec_info is only empty when the graph is executed via
+  // .backward() and the inputs parameter is not passed. Otherwise, when
+  // executed through .grad(), or when inputs arg is specified for .backward(),
+  // exec_info will be non-empty.
+  //
+  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+  struct ExecInfo {
+    struct Capture {
+      Capture(const Capture&) = delete;
+      Capture(Capture&&) = default;
+
+      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+      Capture(int input_idx, int output_idx)
+          : input_idx_(input_idx), output_idx_(output_idx) {}
+      int input_idx_; // within Node inputs
+      int output_idx_; // within the output vector of a GraphTask
+
+      // This hook will be executed after a grad is captured. The captured
+      // grad will be replaced by the return value of the hook.
+      struct GradCaptureHook {
+        virtual ~GradCaptureHook() = default;
+        virtual at::Tensor operator()(const at::Tensor& grad) = 0;
+      };
+      // The hooks will be called one by one in the order as they were added.
+      // The input grad of a hook will be the output of its preceding hook. The
+      // first hook will take the captured grad as the input. The output of the
+      // last hook will replace the captured grad.
+      std::vector<std::unique_ptr<GradCaptureHook>> hooks_;
+    };
+
+    bool should_execute() const {
+      return needed_ || captures_;
+    }
+
+    bool needed_ = false;
+    std::unique_ptr<std::vector<Capture>> captures_;
+  };
+  // exec_info_ is safe to read without synchronization
+  std::unordered_map<Node*, ExecInfo> exec_info_;
+  // Captures variables are grads captured that we return to the user. After
+  // execution of the GraphTask is completed, the captured_vars_ are moved
+  // out of the GraphTask and are no longer valid.
+  std::vector<Variable> captured_vars_;
+
+  // Note: this field is not ready to be used until the proper
+  // `thread_locals_.set_grad_mode()` call in the constructor.
+  at::ThreadLocalState thread_locals_ = at::ThreadLocalState();
+
+  std::unordered_set<c10::Stream> leaf_streams;
+
+  // Per-device current streams of the execute() that called this GraphTask.
+  // These will be synced with leaf_streams in exec_post_processing.
+  std::vector<c10::optional<c10::Stream>> caller_current_streams_;
+
+  // Collects caller_current_streams_
+  void stash_current_streams();
+
+  void init_to_execute(
+      Node& graph_root,
+      const edge_list& outputs,
+      bool accumulate_grad,
+      uint64_t min_topo_nr);
+
+  // The value of worker_device in the thread that created this task.
+  // See Note [Reentrant backwards]
+  // Safe to read owner_ and reentrant_depth_ without synchronizaton
+  int owner_;
+  // The number of parent graph tasks for this graph task
+  const int reentrant_depth_;
+
+  bool can_checkpoint() const {
+    return exec_info_.empty();
+  }
+
+  // check if the GraphTask is completed or not
+  bool completed();
+  // mark the graph task as completed and trigger post processing
+  void mark_as_completed_and_run_post_processing();
+
+  // Set an appropriate exception on this graph_task which was encountered while
+  // running the provided function.
+  void set_exception(std::exception_ptr eptr, const std::shared_ptr<Node>& fn);
+
+  // Set an appropriate exception on this graph_task which was encountered while
+  // running the provided function. But doesn't signal completion on
+  // 'future_result_' right away. The user needs to explicitly mark
+  // 'future_result_' completed with an appropriate exception.
+  void set_exception_without_signal(const std::shared_ptr<Node>& fn);
+
+  // Whether or not to stop execution for this GraphTask when an error is
+  // encountered. When set to true, this would cause Engine::execute() to throw
+  // an exception as soon as the autograd engine receives an exception.
+  bool exit_on_error_;
+
+  // CPU threads are dedicated to processing CPU work for the backward they
+  // invoked. So any given graph task maintains its own cpu_ready_queue_ where
+  // you should send work for it to be done. We memoize the cpu_ready_queue_ per
+  // GraphTask so that we know which ready queue we should push to if we are on
+  // device thread (i.e. GPU) and but next NodeTask should be run on CPU.
+  std::shared_ptr<ReadyQueue> cpu_ready_queue_;
+
+  // Future representing the completion of the graph task. Notified when all
+  // tasks are done.
+  c10::intrusive_ptr<at::ivalue::Future> future_result_;
+
+  // Final callbacks installed during execution of this GraphTask
+  std::vector<std::function<void()>> final_callbacks_;
+  // To protect reads and writes to final_callbacks_. Intentionally no reusing
+  // mutex_ as the two are protecting different data structures.
+  std::mutex final_callbacks_lock_;
+
+  utils::DelayWarningHandler warning_handler_;
+
+  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+  GraphTask(
+      bool keep_graph,
+      bool grad_mode,
+      int reentrant_depth,
+      std::shared_ptr<ReadyQueue> cpu_ready_queue,
+      bool exit_on_error = false)
+      : keep_graph_(keep_graph),
+        owner_(NO_DEVICE),
+        reentrant_depth_(reentrant_depth),
+        exit_on_error_(exit_on_error),
+        cpu_ready_queue_(std::move(cpu_ready_queue)),
+        future_result_(c10::make_intrusive<at::ivalue::Future>(
+            c10::ListType::create(c10::TensorType::get()))) {
+    thread_locals_.set_grad_mode(grad_mode);
+  }
+
+ private:
+  // run GraphTask post processing
+  void exec_post_processing();
+};
+
+// The guard that sets and restores current_graph_task.
+class GraphTaskGuard {
+ public:
+  explicit GraphTaskGuard(std::shared_ptr<GraphTask> graph_task);
+  ~GraphTaskGuard();
+
+  void restore_current_graph_task();
+
+ private:
+  std::shared_ptr<GraphTask> last_graph_task_;
+};
+
+TORCH_API const std::unordered_map<Node*, GraphTask::ExecInfo>*
+get_current_graph_task_exec_info();
+TORCH_API bool get_current_graph_task_keep_graph();
+void add_node_to_current_graph_task_exec_info(Node* fn);
+
+} // namespace autograd
+} // namespace torch
diff --git a/torch/csrc/autograd/init.cpp b/torch/csrc/autograd/init.cpp
index 50ddde49575b1..df97a86e5983c 100644
--- a/torch/csrc/autograd/init.cpp
+++ b/torch/csrc/autograd/init.cpp
@@ -41,6 +41,17 @@ struct DisableFuncTorch {
   c10::impl::ExcludeDispatchKeyGuard back_guard_;
 };
 
+struct EnableTorchFunction {
+  EnableTorchFunction()
+      : old_(at::impl::PythonTorchFunctionTLS::is_disabled()) {
+    at::impl::PythonTorchFunctionTLS::set_disabled(false);
+  }
+  ~EnableTorchFunction() {
+    at::impl::PythonTorchFunctionTLS::set_disabled(old_);
+  }
+  bool old_;
+};
+
 } // namespace
 
 PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
@@ -80,77 +91,6 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
   if (!ParameterClass)
     return nullptr;
 
-  py::enum_<ProfilerState>(m, "ProfilerState")
-      .value("Disabled", ProfilerState::Disabled)
-      .value("CPU", ProfilerState::CPU)
-      .value("CUDA", ProfilerState::CUDA)
-      .value("NVTX", ProfilerState::NVTX)
-      .value("ITT", ProfilerState::ITT)
-      .value("KINETO", ProfilerState::KINETO)
-      .value("KINETO_GPU_FALLBACK", ProfilerState::KINETO_GPU_FALLBACK);
-
-  using torch::profiler::impl::ActiveProfilerType;
-  py::enum_<ActiveProfilerType>(m, "ActiveProfilerType")
-      .value("NONE", ActiveProfilerType::NONE)
-      .value("LEGACY", ActiveProfilerType::LEGACY)
-      .value("KINETO", ActiveProfilerType::KINETO)
-      .value("NVTX", ActiveProfilerType::NVTX);
-
-  py::enum_<ActivityType>(m, "ProfilerActivity")
-      .value("CPU", ActivityType::CPU)
-      .value("CUDA", ActivityType::CUDA);
-
-  py::class_<ExperimentalConfig>(m, "_ExperimentalConfig")
-      .def(
-          py::init<
-              std::vector<std::string> /* profiler_metrics */,
-              bool /* profiler_measure_per_kernel */
-              >(),
-          "An experimental config for Kineto features. Please note that"
-          "backward compatibility is not guaranteed.\n"
-          "    profiler_metrics : a list of CUPTI profiler metrics used\n"
-          "       to measure GPU performance events.\n"
-          "       If this list contains values Kineto runs in CUPTI profiler mode\n"
-          "    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel\n"
-          "       or for the entire measurement duration.",
-          py::arg("profiler_metrics") = std::vector<std::string>(),
-          py::arg("profiler_measure_per_kernel") = false)
-      .def(py::pickle(
-          [](const ExperimentalConfig& p) { // __getstate__
-            py::list py_metrics;
-            for (const auto& metric : p.profiler_metrics) {
-              py::bytes mbytes(metric);
-              py_metrics.append(mbytes);
-            }
-            /* Return a tuple that fully encodes the state of the config */
-            return py::make_tuple(py_metrics, p.profiler_measure_per_kernel);
-          },
-          [](py::tuple t) { // __setstate__
-            if (t.size() != 2) {
-              throw std::runtime_error("Expected 2 values in state");
-            }
-
-            py::list py_metrics = t[0].cast<py::list>();
-            std::vector<std::string> metrics{py_metrics.size()};
-
-            for (const auto& py_metric : py_metrics) {
-              metrics.push_back(py::str(py_metric));
-            }
-
-            return ExperimentalConfig(std::move(metrics), t[1].cast<bool>());
-          }));
-
-  py::class_<ProfilerConfig>(m, "ProfilerConfig")
-      .def(py::init<
-           ProfilerState,
-           bool, /* record_input_shapes */
-           bool, /* profile_memory */
-           bool, /* with_stack */
-           bool, /* with_flops */
-           bool, /* with_modules */
-           ExperimentalConfig /* experimental_config */
-           >());
-
   py::class_<LegacyEvent>(m, "ProfilerEvent")
       .def("kind", &LegacyEvent::kindStr)
       .def("name", [](const LegacyEvent& e) { return e.name(); })
@@ -185,12 +125,15 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
       .value("FPGA", c10::DeviceType::FPGA)
       .value("ORT", c10::DeviceType::ORT)
       .value("XLA", c10::DeviceType::XLA)
-      .value("Lazy", c10::DeviceType::Lazy)
+      .value("Vulkan", c10::DeviceType::Vulkan)
+      .value("Metal", c10::DeviceType::Metal)
+      .value("XPU", c10::DeviceType::XPU)
       .value("MPS", c10::DeviceType::MPS)
-      .value("HPU", c10::DeviceType::HPU)
       .value("Meta", c10::DeviceType::Meta)
-      .value("Vulkan", c10::DeviceType::Vulkan)
-      .value("Metal", c10::DeviceType::Metal);
+      .value("HPU", c10::DeviceType::HPU)
+      .value("VE", c10::DeviceType::VE)
+      .value("Lazy", c10::DeviceType::Lazy)
+      .value("IPU", c10::DeviceType::IPU);
 
   py::class_<KinetoEvent>(m, "_KinetoEvent")
       // name of the event
@@ -219,34 +162,10 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
           "correlation_id",
           [](const KinetoEvent& e) { return e.correlationId(); })
       // shapes of input tensors
-      .def(
-          "shapes",
-          [](const KinetoEvent& e) {
-            if (e.hasShapes()) {
-              return e.shapes();
-            } else {
-              return std::vector<std::vector<int64_t>>();
-            }
-          })
-      .def(
-          "dtypes",
-          [](const KinetoEvent& e) {
-            if (e.hasTypes()) {
-              return e.dtypes();
-            } else {
-              return std::vector<std::string>();
-            }
-          })
+      .def("shapes", [](const KinetoEvent& e) { return e.shapes().vec(); })
+      .def("dtypes", [](const KinetoEvent& e) { return e.dtypes().vec(); })
       // stack traces of the PyTorch CPU events
-      .def(
-          "stack",
-          [](const KinetoEvent& e) {
-            if (e.hasStack()) {
-              return e.stack();
-            } else {
-              return std::vector<std::string>();
-            }
-          })
+      .def("stack", [](const KinetoEvent& e) { return e.stack().vec(); })
       // type of the RecordFunction that generated a PyTorch CPU event
       // (op, torchscript function, user label, etc)
       .def("scope", [](const KinetoEvent& e) { return e.scope(); })
@@ -269,70 +188,7 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
       .def("cuda_elapsed_us", &KinetoEvent::cudaElapsedUs)
       .def("nbytes", [](const KinetoEvent& e) { return e.nBytes(); });
 
-  {
-    using torch::profiler::impl::PyFrameState;
-    using torch::profiler::impl::Result;
-    py::enum_<EventType>(m, "_EventType")
-        .value("TorchOp", EventType::TorchOp)
-        .value("Backend", EventType::Backend)
-        .value("Allocation", EventType::Allocation)
-        .value("PyCall", EventType::PyCall)
-        .value("PyCCall", EventType::PyCCall)
-        .value("Kineto", EventType::Kineto);
-    py::class_<ExtraFields<EventType::TorchOp>>(m, "_ExtraFields_TorchOp")
-        .def_readonly("inputs", &ExtraFields<EventType::TorchOp>::inputs_)
-        .def_readonly(
-            "allow_tf32_cublas",
-            &ExtraFields<EventType::TorchOp>::allow_tf32_cublas_);
-    py::class_<Inputs>(m, "_Inputs")
-        .def_readonly("shapes", &Inputs::shapes_)
-        .def_readonly("tensor_metadata", &Inputs::tensor_metadata_)
-        .def_readonly("dtypes", &Inputs::dtypes_);
-
-    py::class_<TensorMetadata>(m, "_TensorMetadata")
-        .def_property_readonly("layout", [](const TensorMetadata& metadata) {
-          PyObject* layout_obj = torch::autograd::utils::wrap(metadata.layout_);
-          return py::reinterpret_borrow<py::object>(layout_obj);
-        });
-
-    py::class_<ExtraFields<EventType::Backend>>(m, "_ExtraFields_Backend");
-    py::class_<ExtraFields<EventType::Allocation>>(
-        m, "_ExtraFields_Allocation");
-    py::class_<ExtraFields<EventType::PyCall>>(m, "_ExtraFields_PyCall")
-        .def_readonly("callsite", &ExtraFields<EventType::PyCall>::callsite_)
-        .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_);
-    py::class_<ExtraFields<EventType::PyCCall>>(m, "_ExtraFields_PyCCall")
-        .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_);
-    py::class_<PyFrameState>(m, "_PyFrameState")
-        .def_readonly("line_number", &PyFrameState::line_no_)
-        .def_property_readonly(
-            "file_name",
-            [](const PyFrameState& s) { return s.filename_.str(); })
-        .def_property_readonly("function_name", [](const PyFrameState& s) {
-          return s.funcname_.str();
-        });
-    py::class_<ExtraFields<EventType::Kineto>>(m, "_ExtraFields_Kineto");
-
-    py::class_<Result, std::shared_ptr<Result>>(m, "_ProfilerEvent")
-        .def("name", &Result::name)
-        .def_property_readonly("tag", &Result::tag)
-        .def_readonly("extra_fields", &Result::extra_fields_)
-        .def_property_readonly(
-            "id",
-            [](const Result& r) {
-              return reinterpret_cast<intptr_t>(r.shared_from_this().get());
-            })
-        .def_property_readonly(
-            "parent", [](const Result& r) { return r.parent_.lock(); })
-        .def_readonly("children", &Result::children_)
-        .def_readonly("start_time_ns", &Result::start_time_ns_)
-        .def_readonly("start_tid", &Result::start_tid_)
-        .def_property_readonly("correlation_id", &Result::correlationID)
-        .def_property_readonly("end_time_ns", &Result::endTimeNS)
-        .def_property_readonly("duration_time_ns", [](const Result& r) {
-          return r.endTimeNS() - r.start_time_ns_;
-        });
-  }
+  m.def("_soft_assert_raises", &setSoftAssertRaises);
 
   py::class_<ProfilerResult>(m, "_ProfilerResult")
       .def("trace_start_us", &ProfilerResult::trace_start_us)
@@ -459,6 +315,8 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
         registerPythonTensorClass(device, cls);
       });
 
+  _C_m.def("_activate_cuda_trace", []() { activateCUDATrace(); });
+
   py::class_<c10::InferenceMode>(_C_m, "_InferenceMode").def(py::init<bool>());
 
   py::class_<at::impl::RestorePythonTLSSnapshot>(
@@ -468,6 +326,8 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
   // TODO: line up this binding with DisableTorchFunction
   py::class_<torch::DisableTorchDispatch>(_C_m, "_DisableTorchDispatch")
       .def(py::init<>());
+  py::class_<EnableTorchFunction>(_C_m, "_EnableTorchFunction")
+      .def(py::init<>());
   py::class_<DisableFuncTorch>(_C_m, "_DisableFuncTorch").def(py::init<>());
 
   py::class_<torch::autograd::SavedVariable>(m, "SavedTensor")
@@ -646,12 +506,17 @@ static PyObject* is_inference_mode_enabled(PyObject* _unused, PyObject* arg) {
   END_HANDLE_TH_ERRORS
 }
 
-static PyObject* set_anomaly_mode_enabled(PyObject* _unused, PyObject* arg) {
+static PyObject* set_anomaly_mode_enabled(
+    PyObject* _unused,
+    PyObject* args,
+    PyObject* kwargs) {
   HANDLE_TH_ERRORS
-  if (!PyBool_Check(arg)) {
-    throw TypeError("enabled must be a bool (got %s)", Py_TYPE(arg)->tp_name);
-  }
-  AnomalyMode::set_enabled(arg == Py_True);
+  static PythonArgParser parser({
+      "set_anomaly_enabled(bool enabled, bool check_nan=True)",
+  });
+  ParsedArgs<2> parsed_args;
+  auto r = parser.parse(args, kwargs, parsed_args);
+  AnomalyMode::set_enabled(r.toBool(0), r.toBool(1));
   Py_RETURN_NONE;
   END_HANDLE_TH_ERRORS
 }
@@ -666,6 +531,18 @@ static PyObject* is_anomaly_mode_enabled(PyObject* _unused, PyObject* arg) {
   END_HANDLE_TH_ERRORS
 }
 
+static PyObject* is_anomaly_check_nan_enabled(
+    PyObject* _unused,
+    PyObject* arg) {
+  HANDLE_TH_ERRORS
+  if (AnomalyMode::should_check_nan()) {
+    Py_RETURN_TRUE;
+  } else {
+    Py_RETURN_FALSE;
+  }
+  END_HANDLE_TH_ERRORS
+}
+
 static PyObject* python_enter_dual_level(PyObject* _unused, PyObject* arg) {
   HANDLE_TH_ERRORS
   // It is unlikely that the depth of forward nesting will overflow int64_t so
@@ -695,10 +572,10 @@ static PyObject* python_exit_dual_level(
 static PyObject* set_torch_dispatch_mode(PyObject* _unused, PyObject* arg) {
   HANDLE_TH_ERRORS
   if (arg == Py_None) {
-    at::impl::TorchDispatchModeTLS::set_state(nullptr);
+    c10::impl::TorchDispatchModeTLS::set_state(nullptr);
   } else {
     Py_INCREF(arg);
-    at::impl::TorchDispatchModeTLS::set_state(
+    c10::impl::TorchDispatchModeTLS::set_state(
         std::make_shared<c10::SafePyObject>(arg, getPyInterpreter()));
   }
   Py_RETURN_NONE;
@@ -709,7 +586,7 @@ static PyObject* get_torch_dispatch_mode(
     PyObject* _unused,
     PyObject* _unused2) {
   HANDLE_TH_ERRORS
-  const auto& mode = at::impl::TorchDispatchModeTLS::get_state();
+  const auto& mode = c10::impl::TorchDispatchModeTLS::get_state();
   if (!mode) {
     Py_RETURN_NONE;
   } else {
@@ -778,8 +655,15 @@ static PyMethodDef methods[] = { // NOLINT
      METH_NOARGS,
      nullptr},
     {"set_autocast_cache_enabled", set_autocast_cache_enabled, METH_O, nullptr},
-    {"set_anomaly_enabled", set_anomaly_mode_enabled, METH_O, nullptr},
+    {"set_anomaly_enabled",
+     castPyCFunctionWithKeywords(set_anomaly_mode_enabled),
+     METH_VARARGS | METH_KEYWORDS,
+     nullptr},
     {"is_anomaly_enabled", is_anomaly_mode_enabled, METH_NOARGS, nullptr},
+    {"is_anomaly_check_nan_enabled",
+     is_anomaly_check_nan_enabled,
+     METH_NOARGS,
+     nullptr},
     {"_enter_dual_level", python_enter_dual_level, METH_NOARGS, nullptr},
     {"_exit_dual_level",
      castPyCFunctionWithKeywords(python_exit_dual_level),
diff --git a/torch/csrc/autograd/input_buffer.cpp b/torch/csrc/autograd/input_buffer.cpp
index 49d7e1cb5ad85..6cc6acefc9d45 100644
--- a/torch/csrc/autograd/input_buffer.cpp
+++ b/torch/csrc/autograd/input_buffer.cpp
@@ -3,6 +3,7 @@
 #include <ATen/BatchedTensorImpl.h>
 #include <ATen/SparseCsrTensorUtils.h>
 #include <ATen/SparseTensorUtils.h>
+#include <ATen/TensorOperators.h>
 
 #include <c10/core/DeviceGuard.h>
 #include <c10/core/Event.h>
diff --git a/torch/csrc/autograd/input_buffer.h b/torch/csrc/autograd/input_buffer.h
index dfa65cfaea451..1de8b890c6bad 100644
--- a/torch/csrc/autograd/input_buffer.h
+++ b/torch/csrc/autograd/input_buffer.h
@@ -5,7 +5,6 @@
 // values in-place (adding an input twice will accumulate the result).
 // This behaviour is needed and used only in backward graphs.
 
-#include <ATen/ATen.h>
 #include <memory>
 #include <utility>
 #include <vector>
diff --git a/torch/csrc/autograd/profiler_kineto.cpp b/torch/csrc/autograd/profiler_kineto.cpp
index 5afac0002eba8..3190760135fc7 100644
--- a/torch/csrc/autograd/profiler_kineto.cpp
+++ b/torch/csrc/autograd/profiler_kineto.cpp
@@ -15,6 +15,7 @@
 #include <torch/csrc/profiler/itt_observer.h>
 #include <torch/csrc/profiler/kineto_shim.h>
 #include <torch/csrc/profiler/nvtx_observer.h>
+#include <torch/csrc/profiler/util.h>
 
 #include <ATen/Context.h>
 
@@ -60,84 +61,111 @@ using torch::profiler::impl::dtypesToStr;
 using torch::profiler::impl::EventType;
 using torch::profiler::impl::ExtraFields;
 using torch::profiler::impl::ProfilerThreadLocalStateBase;
+using torch::profiler::impl::PyExtraFieldsBase;
 using torch::profiler::impl::Result;
 using torch::profiler::impl::shapesToStr;
 using torch::profiler::impl::stacksToStr;
 
-struct EventFieldsVisitor {
-  EventFieldsVisitor(
-      std::shared_ptr<Result>& result,
-      KinetoEvent& kineto_event,
-      const post_process_t& post_process)
-      : kineto_activity_{result->kineto_activity_},
-        kineto_event_{kineto_event},
-        post_process_{post_process} {
-    c10::guts::if_constexpr<torch::profiler::kKinetoAvailable>([&](auto _) {
-      kineto_event.deviceIndex(_(result->kineto_info_).device);
-      kineto_event.deviceResourceId(_(result->kineto_info_).resource);
+struct MetadataBase {
+  MetadataBase(const std::shared_ptr<Result>& result)
+      : kineto_activity_{result->kineto_activity_} {
+    if (c10::holds_alternative<ExtraFields<EventType::Kineto>>(
+            result->extra_fields_)) {
+      // In order to add metadata we have to downcast from
+      // `libkineto::ITraceActivity` to `libkineto::GenericTraceActivity`. We
+      // know that all activities provided by PyTorch are of the correct type,
+      // however Kineto profilers can (and do) add events that inherit directly
+      // from ITraceActivity. As a result, any Result which was constructed from
+      // an event that Kineto provided is unsafe to cast.
+      if (!(SOFT_ASSERT(!hasKinetoActivity()))) {
+        result->kineto_activity_ = nullptr;
+      }
+      kineto_activity_ = result->kineto_activity_;
+    }
+  }
+
+  void addMetadata(const std::string& key, const std::string& value) {
+    if (kineto_activity_ && !value.empty()) {
+      torch::profiler::impl::kineto::addMetadata(kineto_activity_, key, value);
+    }
+  }
+
+  bool hasKinetoActivity() const {
+    return kineto_activity_ != nullptr;
+  }
+
+ private:
+  const torch::profiler::impl::kineto::activity_t* kineto_activity_{nullptr};
+};
+
+struct AddTensorboardFields : public MetadataBase {
+  AddTensorboardFields(
+      const std::shared_ptr<Result>& result,
+      KinetoEvent& kineto_event)
+      : MetadataBase(result) {
+    result->visit(*this);
+    const auto module_hierarchy = kineto_event.moduleHierarchy();
+    addMetadata("Module Hierarchy", stacksToStr(module_hierarchy.vec(), "."));
+    addMetadata("Call stack", stacksToStr(kineto_event.stack().vec(), ";"));
+
+    result->visit_if_base<PyExtraFieldsBase>([&, this](const auto& i) -> void {
+      this->addMetadata("Python id", std::to_string(i.id_));
+
+      c10::optional<std::string> parent_id;
+      std::shared_ptr<Result> parent = result->parent_.lock();
+      while (parent && !parent_id.has_value()) {
+        parent->visit_if_base<PyExtraFieldsBase>(
+            [&](const auto& j) { parent_id = std::to_string(j.id_); });
+        parent = parent->parent_.lock();
+      }
+      this->addMetadata("Python parent id", parent_id.value_or("null"));
     });
+  }
+
+  void operator()(const ExtraFields<EventType::PyCall>& py_call) {
+    if (py_call.module_.has_value()) {
+      addMetadata("Python module id", std::to_string(py_call.module_->id_));
+    }
+  }
 
-    pushPythonMetadata(result->parent_.lock());
+  template <typename T>
+  void operator()(const T&) {}
+};
+
+struct AddGenericMetadata : public MetadataBase {
+  AddGenericMetadata(std::shared_ptr<Result>& result) : MetadataBase(result) {
     result->visit(*this);
-    handleStack(result->parent_);
+    result->visit_if_base<PyExtraFieldsBase>([&, this](const auto& i) -> void {
+      this->addMetadata("Python thread", std::to_string(i.python_tid_));
+    });
   }
 
   void operator()(ExtraFields<EventType::TorchOp>& op_event) {
-    handleJIT(op_event);
-    kineto_event_.get()
-        .scope((int8_t)op_event.scope_)
-        .debugHandle(op_event.debug_handle_)
-        .setAsync(op_event.is_async_);
-
     auto& shapes = op_event.inputs_.shapes_;
     if (!shapes.empty()) {
-      kineto_event_.get().shapes(shapes);
       addMetadata("Input Dims", shapesToStr(shapes));
     }
 
     auto& dtypes = op_event.inputs_.dtypes_;
     if (!dtypes.empty()) {
-      kineto_event_.get().dtypes(dtypes);
       addMetadata("Input type", dtypesToStr(dtypes));
     }
 
-    if (!op_event.extra_args_.empty()) {
-      kineto_event_.get().flops(
-          computeFlops(op_event.name_, op_event.extra_args_));
-    }
-    kineto_event_.get().cuda_event_start_ =
-        op_event.gpu_fallback_.cuda_event_start_;
-    kineto_event_.get().cuda_event_end_ =
-        op_event.gpu_fallback_.cuda_event_end_;
-
     // add information about an associated forward op, if a sequence number
     // is available (e.g. during training)
     if (op_event.sequence_number_ >= 0) {
-      kineto_event_.get()
-          .sequenceNr(op_event.sequence_number_)
-          .fwdThreadId(op_event.forward_tid_);
       addMetadata("Fwd thread id", std::to_string(op_event.forward_tid_));
       addMetadata("Sequence number", std::to_string(op_event.sequence_number_));
     }
   }
 
   void operator()(ExtraFields<EventType::Backend>& backend_event) {
-    handleJIT(backend_event);
-    kineto_event_.get()
-        .scope((int8_t)backend_event.scope_)
-        .debugHandle(backend_event.debug_handle_)
-        .backend(backend_event.backend_);
-
     if (!backend_event.backend_.empty()) {
       addMetadata("Backend", "\"" + backend_event.backend_ + "\"");
     }
   }
 
   void operator()(const ExtraFields<EventType::Allocation>& alloc) {
-    kineto_event_.get()
-        .deviceIndex(alloc.device_index_)
-        .nBytes(alloc.alloc_size_);
-
     addMetadata("Device Type", std::to_string((int8_t)alloc.device_type_));
     addMetadata("Device Id", std::to_string(alloc.device_index_));
     addMetadata("Addr", std::to_string(reinterpret_cast<intptr_t>(alloc.ptr_)));
@@ -151,10 +179,6 @@ struct EventFieldsVisitor {
   }
 
   void operator()(const ExtraFields<EventType::OutOfMemory>& alloc) {
-    kineto_event_.get()
-        .deviceIndex(alloc.device_index_)
-        .nBytes(alloc.alloc_size_);
-
     addMetadata("Device Type", std::to_string((int8_t)alloc.device_type_));
     addMetadata("Device Id", std::to_string(alloc.device_index_));
     addMetadata("Bytes", std::to_string(alloc.alloc_size_));
@@ -167,106 +191,7 @@ struct EventFieldsVisitor {
   }
 
   template <typename T>
-  void handleJIT(T& fields) {
-    auto& jit_stack = fields.jit_stack_;
-    auto& jit_modules = fields.jit_modules_;
-    if (post_process_.get()) {
-      post_process_.get()(fields.debug_handle_, jit_stack, jit_modules);
-    }
-    if (!jit_stack.empty()) {
-      // NB: This is only for the JIT stack. The python stack (if applicable)
-      //     is constructed later.
-      kineto_event_.get().stack(jit_stack);
-      addMetadata(
-          "Call stack", torch::profiler::impl::stacksToStr(jit_stack, ";"));
-    }
-
-    if (!jit_modules.empty()) {
-      kineto_event_.get().moduleHierarchy(jit_modules);
-      addMetadata(
-          "Module Hierarchy",
-          torch::profiler::impl::stacksToStr(jit_modules, "."));
-    }
-  }
-
-  void operator()(const ExtraFields<EventType::PyCall>& py_call) {
-    addPythonAnnotations(py_call);
-    if (py_call.module_.has_value()) {
-      addMetadata("Python module id", std::to_string(py_call.module_->id_));
-    }
-  }
-
-  void operator()(const ExtraFields<EventType::PyCCall>& py_call) {
-    addPythonAnnotations(py_call);
-  }
-
-  void operator()(const ExtraFields<EventType::Kineto>& e) {
-    TORCH_INTERNAL_ASSERT(kineto_activity_ == nullptr);
-    const auto linked = e.linked_activity_.lock();
-    if (linked) {
-      kineto_event_.get().linkedCorrelationId(linked->correlationID());
-    }
-  }
-
-  void pushPythonMetadata(std::shared_ptr<Result> parent) {
-    auto push = [&](const auto& i) {
-      c10::guts::if_constexpr<std::is_base_of<
-          torch::profiler::impl::PyExtraFieldsBase,
-          typename std::remove_reference<decltype(i)>::type>::
-                                  value>([&](auto _) {
-        py_metadata_.push_back({_(i).id_, _(i).python_tid_, parent->name()});
-      });
-    };
-
-    while (parent != nullptr) {
-      parent->visit(push);
-      parent = parent->parent_.lock();
-    }
-  }
-
-  template <typename T>
-  void addPythonAnnotations(T& t) {
-    addMetadata("Python id", std::to_string(t.id_));
-    addMetadata(
-        "Python parent id",
-        !py_metadata_.empty() ? std::to_string(py_metadata_.at(0).id_)
-                              : "null");
-    addMetadata("Python thread", std::to_string(t.python_tid_));
-  }
-
-  void handleStack(std::weak_ptr<Result> parent) {
-    // JIT stack takes precidence.
-    if (!kineto_event_.get().hasStack() && !py_metadata_.empty()) {
-      std::vector<std::string> stack;
-      for (auto i = py_metadata_.rbegin(); i < py_metadata_.rend(); ++i) {
-        stack.push_back(i->name_);
-      }
-      kineto_event_.get().stack(std::move(stack));
-    }
-
-    if (kineto_event_.get().hasStack()) {
-      addMetadata(
-          "Call stack",
-          torch::profiler::impl::stacksToStr(kineto_event_.get().stack(), ";"));
-    }
-  }
-
-  void addMetadata(const std::string& key, const std::string& value) {
-    if (kineto_activity_) {
-      torch::profiler::impl::kineto::addMetadata(kineto_activity_, key, value);
-    }
-  }
-
-  struct PythonMetadata {
-    size_t id_;
-    size_t python_tid_;
-    std::string name_;
-  };
-
-  const torch::profiler::impl::kineto::activity_t* kineto_activity_;
-  std::reference_wrapper<KinetoEvent> kineto_event_;
-  std::reference_wrapper<const post_process_t> post_process_;
-  std::vector<PythonMetadata> py_metadata_;
+  void operator()(const T&) {}
 };
 
 // Assumption: Total threads number will not exceed 2^16-1, and total ops will
@@ -301,7 +226,7 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       int64_t total_allocated,
       int64_t total_reserved,
       c10::Device device) override {
-    if (config_.profile_memory && config_.state != ProfilerState::Disabled) {
+    if (config_.profile_memory && !config_.disabled()) {
       record_queue_.getSubqueue()->emplace_allocation_event(
           torch::profiler::impl::getApproximateTime(),
           ptr,
@@ -318,7 +243,7 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       int64_t total_allocated,
       int64_t total_reserved,
       c10::Device device) override {
-    if (config_.profile_memory && config_.state != ProfilerState::Disabled) {
+    if (config_.profile_memory && !config_.disabled()) {
       record_queue_.getSubqueue()->emplace_ooms_event(
           torch::profiler::impl::getApproximateTime(),
           alloc_size,
@@ -357,12 +282,19 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
         std::remove_if(
             kineto_events_.begin(),
             kineto_events_.end(),
-            [](const auto& i) { return i.is_python_function_; }),
+            [](const auto& i) { return i.isPythonFunction(); }),
         kineto_events_.end());
 
     return std::move(records_and_trace.second);
   }
 
+  template <typename T>
+  void invokeCallback(T& t) {
+    if (event_post_process_cb_) {
+      event_post_process_cb_(t.debug_handle_, t.jit_stack_, t.jit_modules_);
+    }
+  }
+
   void materializeOpEvents(std::vector<std::shared_ptr<Result>>& events) {
     for (auto& e : events) {
       if (e->parent_.expired()) {
@@ -370,20 +302,14 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       }
 
       if (e->finished_) {
-        kineto_events_.emplace_back(
-            e->kinetoType() == libkineto::ActivityType::PYTHON_FUNCTION);
-        kineto_events_.back()
-            .name(e->name())
-            .startUs(e->start_time_ns_ / 1000)
-            .durationUs((e->endTimeNS() - e->start_time_ns_) / 1000)
-            .correlationId(e->correlationID())
-            .deviceType(e->deviceType())
-            .startThreadId(e->start_tid_)
-            .endThreadId(e->endTID())
-            .activityType((uint8_t)e->kinetoType());
-
-        EventFieldsVisitor set_fields_and_metadata(
-            e, kineto_events_.back(), getEventPostProcessingCallback());
+        e->visit(c10::overloaded(
+            [this](ExtraFields<EventType::TorchOp>& i) { invokeCallback(i); },
+            [this](ExtraFields<EventType::Backend>& i) { invokeCallback(i); },
+            [](auto&) {}));
+
+        kineto_events_.emplace_back(e);
+        AddTensorboardFields add_tb(e, kineto_events_.back());
+        AddGenericMetadata add_generic(e);
 
         // It is not safe to use the activity after post processing.
         e->kineto_activity_ = nullptr;
@@ -476,46 +402,13 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
   post_process_t event_post_process_cb_;
 };
 
-class GlobalStateManager {
- public:
-  static GlobalStateManager& singleton() {
-    static GlobalStateManager singleton_;
-    return singleton_;
-  }
-
-  template <typename... Args>
-  static void init(Args... args) {
-    if (singleton().state_) {
-      LOG(WARNING) << "GlobalStatePtr already exists!";
-    } else {
-      singleton().state_ =
-          std::make_shared<KinetoThreadLocalState>(std::forward<Args>(args)...);
-    }
-  }
-
-  static auto* get() {
-    return singleton().state_.get();
-  }
-
-  static std::shared_ptr<c10::DebugInfoBase> pop() {
-    TORCH_INTERNAL_ASSERT(
-        singleton().state_ != nullptr,
-        "Global state ptr cannot be null before resetting");
-    auto out = singleton().state_;
-    singleton().state_.reset();
-    return out;
-  }
-
- private:
-  GlobalStateManager() = default;
-
-  std::shared_ptr<KinetoThreadLocalState> state_;
-};
+using KinetoTLSGlobalStateManager =
+    torch::profiler::impl::GlobalStateManager<KinetoThreadLocalState>;
 
 template <bool use_global>
 static KinetoThreadLocalState* getStatePtr() {
   return c10::guts::if_constexpr<use_global>(
-      [] { return GlobalStateManager::get(); },
+      [] { return KinetoTLSGlobalStateManager::get(); },
       [] { return KinetoThreadLocalState::getTLS(); });
 }
 
@@ -591,7 +484,7 @@ void reportBackendEventToActiveKinetoProfiler(
     const std::string& event_name,
     const std::string& backend_name) {
   TORCH_INTERNAL_ASSERT(
-      GlobalStateManager::get() == nullptr,
+      KinetoTLSGlobalStateManager::get() == nullptr,
       "On-demand profiling does not support post processing callback");
 
   auto state_ptr = KinetoThreadLocalState::getTLS();
@@ -642,7 +535,7 @@ void enableProfilerWithEventPostProcess(
       config.state != ProfilerState::ITT,
       "ITT does not support post processing callback.");
   TORCH_INTERNAL_ASSERT(
-      GlobalStateManager::get() == nullptr,
+      KinetoTLSGlobalStateManager::get() == nullptr,
       "On-demand profiling does not support post processing callback");
 
   enableProfiler(config, activities, scopes);
@@ -665,13 +558,18 @@ void enableProfiler(
 
   TORCH_CHECK(
       config.state == ProfilerState::KINETO ||
-      config.state == ProfilerState::KINETO_GPU_FALLBACK ||
-      config.state == ProfilerState::KINETO_ONDEMAND);
+      config.state == ProfilerState::KINETO_GPU_FALLBACK || config.global());
   TORCH_CHECK(
       !activities.empty(), "No activities specified for Kineto profiler");
 
-  if (config.state == ProfilerState::KINETO ||
-      config.state == ProfilerState::KINETO_GPU_FALLBACK) {
+  if (config.global()) {
+    KinetoTLSGlobalStateManager::init(config, activities);
+
+    TORCH_INTERNAL_ASSERT(
+        activities.count(ActivityType::CPU),
+        "Ondemand profiling must enable CPU tracing");
+    pushProfilingCallbacks<true>(scopes);
+  } else {
     auto state = std::make_shared<KinetoThreadLocalState>(config, activities);
     c10::ThreadLocalDebugInfo::_push(c10::DebugInfoKind::PROFILER_STATE, state);
 
@@ -680,23 +578,14 @@ void enableProfiler(
     }
     torch::profiler::impl::kineto::startTrace();
   }
-
-  if (config.state == ProfilerState::KINETO_ONDEMAND) {
-    GlobalStateManager::init(config, activities);
-
-    TORCH_INTERNAL_ASSERT(
-        activities.count(ActivityType::CPU),
-        "Ondemand profiling must enable CPU tracing");
-    pushProfilingCallbacks<true>(scopes);
-  }
 }
 
 std::unique_ptr<ProfilerResult> disableProfiler() {
   auto state_ptr = std::static_pointer_cast<
       torch::profiler::impl::ProfilerThreadLocalStateBase>(
-      GlobalStateManager::get() == nullptr
+      KinetoTLSGlobalStateManager::get() == nullptr
           ? c10::ThreadLocalDebugInfo::_pop(c10::DebugInfoKind::PROFILER_STATE)
-          : GlobalStateManager::pop());
+          : KinetoTLSGlobalStateManager::pop());
 
   const auto& config = state_ptr->config();
   TORCH_CHECK(
@@ -740,13 +629,84 @@ std::unique_ptr<ProfilerResult> disableProfiler() {
   return result;
 }
 
+KinetoEvent::KinetoEvent(
+    std::shared_ptr<const torch::profiler::impl::Result> result)
+    : result_{result} {
+  TORCH_INTERNAL_ASSERT(result != nullptr);
+
+  // Populate Python stack
+  auto parent = result_->parent_.lock();
+  while (parent != nullptr) {
+    parent->visit_if_base<PyExtraFieldsBase>(
+        [&](const auto& i) { python_stack_.push_back(parent->name()); });
+    parent = parent->parent_.lock();
+  }
+}
+
+bool KinetoEvent::isPythonFunction() const {
+  bool out{false};
+  result_->visit_if_base<PyExtraFieldsBase>([&](const auto&) { out = true; });
+  return out;
+}
+
+const c10::ArrayRef<std::string> KinetoEvent::stack() const {
+  auto get = [&](const auto& i) -> auto& {
+    return !i.jit_stack_.empty() ? i.jit_stack_ : python_stack_;
+  };
+
+  using out_t = const c10::ArrayRef<std::string>;
+  return result_->visit(c10::overloaded(
+      [&](const ExtraFields<EventType::TorchOp>& i) -> out_t { return get(i); },
+      [&](const ExtraFields<EventType::Backend>& i) -> out_t { return get(i); },
+      [&](const auto&) -> out_t { return python_stack_; }));
+}
+
+const c10::ArrayRef<std::string> KinetoEvent::moduleHierarchy() const {
+  return result_->visit(c10::overloaded(
+      [](const ExtraFields<EventType::TorchOp>& e)
+          -> const c10::ArrayRef<std::string> { return e.jit_modules_; },
+      [](const ExtraFields<EventType::Backend>& e)
+          -> const c10::ArrayRef<std::string> { return e.jit_modules_; },
+      [](const auto&) -> const c10::ArrayRef<std::string> { return {}; }));
+}
+
+uint64_t KinetoEvent::durationUs() const {
+  return (result_->endTimeNS() - result_->start_time_ns_) / 1000;
+}
+
+int64_t KinetoEvent::debugHandle() const {
+  return result_->visit(c10::overloaded(
+      [](const ExtraFields<EventType::TorchOp>& i) { return i.debug_handle_; },
+      [](const ExtraFields<EventType::Backend>& i) { return i.debug_handle_; },
+      [](const auto&) -> int64_t { return -1; }));
+}
+
+uint8_t KinetoEvent::deviceIndex() const {
+  return result_->visit(c10::overloaded(
+      [](const ExtraFields<EventType::Allocation>& i) {
+        return static_cast<uint8_t>(i.device_index_);
+      },
+      [](const ExtraFields<EventType::OutOfMemory>& i) {
+        return static_cast<uint8_t>(i.device_index_);
+      },
+      [&](const auto&) {
+        return static_cast<uint8_t>(result_->kineto_info_.device);
+      }));
+}
+
+bool KinetoEvent::hasStack() const {
+  return !stack().empty();
+}
+
 int64_t KinetoEvent::cudaElapsedUs() const {
-  if (!cuda_event_start_ || !cuda_event_end_) {
+  auto cuda_event_start = fallbackStart();
+  auto cuda_event_end = fallbackEnd();
+  if (!cuda_event_start || !cuda_event_end) {
     return -1;
   }
   try {
     return (int64_t)torch::profiler::impl::cudaStubs()->elapsed(
-        &cuda_event_start_, &cuda_event_end_);
+        &cuda_event_start, &cuda_event_end);
   } catch (std::exception& e) {
     LOG(WARNING) << "Failed to measure time between two CUDA events. "
                  << e.what();
@@ -754,6 +714,65 @@ int64_t KinetoEvent::cudaElapsedUs() const {
   return -1;
 }
 
+#define FORWARD_FROM_RESULT(method_name, result_expr)                        \
+  decltype(std::declval<KinetoEvent>().method_name())                        \
+  KinetoEvent::method_name() const {                                         \
+    return static_cast<decltype(std::declval<KinetoEvent>().method_name())>( \
+        result_->result_expr);                                               \
+  }
+
+FORWARD_FROM_RESULT(startThreadId, start_tid_)
+FORWARD_FROM_RESULT(endThreadId, endTID())
+FORWARD_FROM_RESULT(activityType, kinetoType())
+FORWARD_FROM_RESULT(name, name())
+FORWARD_FROM_RESULT(deviceType, deviceType())
+FORWARD_FROM_RESULT(startUs, start_time_ns_ / 1000)
+FORWARD_FROM_RESULT(correlationId, correlationID())
+FORWARD_FROM_RESULT(deviceResourceId, kineto_info_.resource)
+#undef FORWARD_FROM_RESULT
+
+// Most of the fields in `KinetoEvent` only make sense for a single event type.
+// (Generally TorchOp.) For all other types they simply return the default
+// value. This macro provides a succinct way of expressing this behavior.
+#define TYPED_ATTR_WITH_DEFAULT(                                       \
+    event_type, method_name, expression, default_value)                \
+  decltype(std::declval<KinetoEvent>().method_name())                  \
+  KinetoEvent::method_name() const {                                   \
+    using out_t = decltype(std::declval<KinetoEvent>().method_name()); \
+    return result_->visit(c10::overloaded(                             \
+        [](const ExtraFields<EventType::event_type>& e) -> out_t {     \
+          return expression;                                           \
+        },                                                             \
+        [](const auto&) -> out_t { return default_value; }));          \
+  }
+
+#define TYPED_ATTR(event_type, method_name, expression) \
+  TYPED_ATTR_WITH_DEFAULT(event_type, method_name, expression, {})
+
+TYPED_ATTR_WITH_DEFAULT(TorchOp, sequenceNr, e.sequence_number_, -1)
+TYPED_ATTR(TorchOp, fwdThreadId, e.sequence_number_ >= 0 ? e.forward_tid_ : 0)
+TYPED_ATTR(TorchOp, hasShapes, !e.inputs_.shapes_.empty())
+TYPED_ATTR(TorchOp, shapes, e.inputs_.shapes_)
+TYPED_ATTR(TorchOp, hasTypes, !e.inputs_.dtypes_.empty())
+TYPED_ATTR(TorchOp, dtypes, e.inputs_.dtypes_)
+TYPED_ATTR(TorchOp, scope, static_cast<uint8_t>(e.scope_))
+TYPED_ATTR(TorchOp, hasModuleHierarchy, !e.jit_modules_.empty())
+TYPED_ATTR(TorchOp, isAsync, e.is_async_)
+TYPED_ATTR(TorchOp, fallbackStart, e.gpu_fallback_.cuda_event_start_)
+TYPED_ATTR(TorchOp, fallbackEnd, e.gpu_fallback_.cuda_event_end_)
+TYPED_ATTR(
+    TorchOp,
+    flops,
+    !e.extra_args_.empty() ? computeFlops(e.name_, e.extra_args_) : 0)
+TYPED_ATTR(Backend, backend, e.backend_)
+TYPED_ATTR(Allocation, nBytes, e.alloc_size_)
+TYPED_ATTR(Kineto, linkedCorrelationId, [&]() {
+  const auto linked = e.linked_activity_.lock();
+  return linked ? linked->correlationID() : 0;
+}())
+#undef TYPED_ATTR
+#undef TYPED_ATTR_WITH_DEFAULT
+
 ProfilerResult::ProfilerResult(
     uint64_t start_time,
     std::vector<KinetoEvent> events,
diff --git a/torch/csrc/autograd/profiler_kineto.h b/torch/csrc/autograd/profiler_kineto.h
index 68c803d1e76a6..c924e8074cf9a 100644
--- a/torch/csrc/autograd/profiler_kineto.h
+++ b/torch/csrc/autograd/profiler_kineto.h
@@ -20,268 +20,44 @@ namespace profiler {
 using experimental_event_t = std::shared_ptr<torch::profiler::impl::Result>;
 
 struct TORCH_API KinetoEvent {
-  explicit KinetoEvent(bool is_python_function = false)
-      : is_python_function_{is_python_function} {}
-
-  uint64_t startThreadId() const {
-    return start_thread_id_;
-  }
-
-  KinetoEvent& startThreadId(uint64_t start_thread_id) {
-    start_thread_id_ = start_thread_id;
-    return *this;
-  }
-
-  uint64_t endThreadId() const {
-    return end_thread_id_;
-  }
-
-  KinetoEvent& endThreadId(uint64_t end_thread_id) {
-    end_thread_id_ = end_thread_id;
-    return *this;
-  }
-
-  uint8_t activityType() const {
-    return activity_type_;
-  }
-
-  KinetoEvent& activityType(uint8_t activity_type) {
-    activity_type_ = activity_type;
-    return *this;
-  }
-
-  uint64_t fwdThreadId() const {
-    return fwd_thread_id_;
-  }
-
-  KinetoEvent& fwdThreadId(uint64_t fwd_thread_id) {
-    fwd_thread_id_ = fwd_thread_id;
-    return *this;
-  }
-
-  bool hasShapes() const {
-    return shapes_ != c10::nullopt;
-  }
-
-  const std::vector<std::vector<int64_t>>& shapes() const {
-    return *shapes_;
-  }
-
-  KinetoEvent& shapes(const std::vector<std::vector<int64_t>>& shapes) {
-    shapes_ = shapes;
-    return *this;
-  }
-
-  bool hasTypes() const {
-    return dtypes_ != c10::nullopt;
-  }
-
-  const std::vector<std::string>& dtypes() const {
-    return *dtypes_;
-  }
-
-  KinetoEvent& dtypes(const std::vector<std::string>& dtypes) {
-    dtypes_ = dtypes;
-    return *this;
-  }
-
-  uint64_t flops() const {
-    return flops_;
-  }
-
-  KinetoEvent& flops(uint64_t flops) {
-    flops_ = flops;
-    return *this;
-  }
-
-  int64_t sequenceNr() const {
-    return sequence_nr_;
-  }
-
-  KinetoEvent& sequenceNr(int64_t sequence_nr) {
-    sequence_nr_ = sequence_nr;
-    return *this;
-  }
-
-  bool hasStack() const {
-    return stack_ != c10::nullopt;
-  }
-
-  const std::vector<std::string>& stack() const {
-    return *stack_;
-  }
-
-  KinetoEvent& stack(const std::vector<std::string>& st) {
-    stack_ = st;
-    return *this;
-  }
-
-  uint8_t scope() const {
-    return scope_;
-  }
-
-  KinetoEvent& scope(uint8_t scope) {
-    scope_ = scope;
-    return *this;
-  }
-
-  bool hasModuleHierarchy() const {
-    return module_hierarchy_ != c10::nullopt;
-  }
-
-  const std::vector<std::string>& moduleHierarchy() const {
-    return *module_hierarchy_;
-  }
-
-  KinetoEvent& moduleHierarchy(
-      const std::vector<std::string>& module_hierarchy) {
-    module_hierarchy_ = module_hierarchy;
-    return *this;
-  }
-
-  KinetoEvent& debugHandle(int64_t debug_handle) {
-    debug_handle_ = debug_handle;
-    return *this;
-  }
-
-  int64_t debugHandle() const {
-    return debug_handle_;
-  }
-
-  std::string name() const {
-    return name_;
-  }
-
-  KinetoEvent& name(const std::string& evt_name) {
-    name_ = evt_name;
-    return *this;
-  }
-
-  KinetoEvent& setAsync(bool is_async) {
-    is_async_ = is_async;
-    return *this;
-  }
-
-  c10::DeviceType deviceType() const {
-    return (c10::DeviceType)device_type_;
-  }
-
-  KinetoEvent& deviceType(c10::DeviceType device_type) {
-    device_type_ = (int8_t)device_type;
-    return *this;
-  }
-
-  uint8_t deviceIndex() const {
-    return device_index_;
-  }
-
-  KinetoEvent& deviceIndex(uint8_t device_index) {
-    device_index_ = device_index;
-    return *this;
-  }
-
-  int64_t nBytes() const {
-    return nbytes_;
-  }
-
-  KinetoEvent& nBytes(int64_t nbytes) {
-    nbytes_ = nbytes;
-    return *this;
-  }
-
-  uint64_t startUs() const {
-    return start_us_;
-  }
-
-  KinetoEvent& startUs(uint64_t start_us) {
-    start_us_ = start_us;
-    return *this;
-  }
-
-  uint64_t durationUs() const {
-    return duration_us_;
-  }
-
-  KinetoEvent& durationUs(uint64_t duration_us) {
-    duration_us_ = duration_us;
-    return *this;
-  }
-
-  bool isAsync() const {
-    return is_async_;
-  }
-
-  uint64_t correlationId() const {
-    return correlation_id_;
-  }
-
-  KinetoEvent& correlationId(uint64_t correlation_id) {
-    correlation_id_ = correlation_id;
-    return *this;
-  }
-
-  uint64_t linkedCorrelationId() const {
-    return linked_correlation_id_;
-  }
-
-  KinetoEvent& linkedCorrelationId(uint64_t linked_correlation_id) {
-    linked_correlation_id_ = linked_correlation_id;
-    return *this;
-  }
-
-  int64_t deviceResourceId() const {
-    return device_resource_id_;
-  }
-
-  KinetoEvent& deviceResourceId(int64_t device_resource_id) {
-    device_resource_id_ = device_resource_id;
-    return *this;
-  }
-
-  std::string backend() const {
-    return backend_;
-  }
-
-  KinetoEvent& backend(const std::string& backend) {
-    backend_ = backend;
-    return *this;
-  }
-
-  bool isPythonFunction() const {
-    return is_python_function_;
-  }
-
+  explicit KinetoEvent(std::shared_ptr<const torch::profiler::impl::Result>);
+
+  uint64_t startThreadId() const;
+  uint64_t endThreadId() const;
+  uint8_t activityType() const;
+  uint64_t fwdThreadId() const;
+  bool hasShapes() const;
+  const c10::ArrayRef<std::vector<int64_t>> shapes() const;
+  bool hasTypes() const;
+  const c10::ArrayRef<std::string> dtypes() const;
+  uint64_t flops() const;
+  int64_t sequenceNr() const;
+  bool hasStack() const;
+  const c10::ArrayRef<std::string> stack() const;
+  uint8_t scope() const;
+  bool hasModuleHierarchy() const;
+  const c10::ArrayRef<std::string> moduleHierarchy() const;
+  int64_t debugHandle() const;
+  std::string name() const;
+  c10::DeviceType deviceType() const;
+  uint8_t deviceIndex() const;
+  int64_t nBytes() const;
+  uint64_t startUs() const;
+  uint64_t durationUs() const;
+  bool isAsync() const;
+  uint64_t correlationId() const;
+  uint64_t linkedCorrelationId() const;
+  int64_t deviceResourceId() const;
+  std::string backend() const;
+  bool isPythonFunction() const;
   int64_t cudaElapsedUs() const;
 
-  uint64_t start_thread_id_ = 0;
-  uint64_t end_thread_id_ = 0;
-  uint64_t fwd_thread_id_ = 0;
-  int64_t sequence_nr_ = -1;
-  uint8_t scope_ = 0;
-
-  uint8_t activity_type_ = 0;
-  c10::optional<std::vector<std::vector<int64_t>>> shapes_;
-  c10::optional<std::vector<std::string>> stack_;
-  c10::optional<std::vector<std::string>> module_hierarchy_;
-  c10::optional<std::vector<std::string>> dtypes_;
-  uint64_t flops_ = 0;
-
-  std::string name_;
-  uint8_t device_index_ = 0;
-  int8_t device_type_ = 0;
-  uint64_t start_us_ = 0;
-  uint64_t duration_us_ = 0;
-  uint64_t correlation_id_ = 0;
-  uint64_t linked_correlation_id_ = 0;
-  int64_t device_resource_id_ = 0;
-  int64_t nbytes_ = 0;
-  bool is_async_{false};
-  int64_t debug_handle_{-1};
-  std::string backend_;
+ private:
+  torch::profiler::impl::ProfilerEventStub fallbackStart() const;
+  torch::profiler::impl::ProfilerEventStub fallbackEnd() const;
 
-  torch::profiler::impl::ProfilerEventStub cuda_event_start_ = nullptr;
-  torch::profiler::impl::ProfilerEventStub cuda_event_end_ = nullptr;
-  bool is_python_function_;
+  std::shared_ptr<const torch::profiler::impl::Result> result_;
+  std::vector<std::string> python_stack_;
 };
 
 // Consolidating events returned directly from Kineto
diff --git a/torch/csrc/autograd/profiler_legacy.cpp b/torch/csrc/autograd/profiler_legacy.cpp
index a7422335c4425..cdae0b94c4115 100644
--- a/torch/csrc/autograd/profiler_legacy.cpp
+++ b/torch/csrc/autograd/profiler_legacy.cpp
@@ -193,7 +193,7 @@ thread_event_lists ProfilerLegacyThreadLocalState::consolidate() {
 }
 
 void ProfilerLegacyThreadLocalState::mark(std::string name, bool include_cuda) {
-  if (config_.state == torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.disabled()) {
     return;
   }
   if (config_.state == torch::profiler::impl::ProfilerState::NVTX) {
@@ -225,7 +225,7 @@ void ProfilerLegacyThreadLocalState::pushRange(
     const at::RecordFunction& fn,
     const bool record_cuda,
     std::vector<std::vector<int64_t>>&& shapes) {
-  if (config_.state == torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.disabled()) {
     return;
   }
   if (config_.state == torch::profiler::impl::ProfilerState::NVTX) {
@@ -273,7 +273,7 @@ void ProfilerLegacyThreadLocalState::pushRange(
 void ProfilerLegacyThreadLocalState::popRange(
     const at::RecordFunction& fn,
     const bool record_cuda) {
-  if (config_.state == torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.disabled()) {
     return;
   }
   if (config_.state == torch::profiler::impl::ProfilerState::NVTX) {
@@ -300,8 +300,7 @@ void ProfilerLegacyThreadLocalState::reportMemoryUsage(
     int64_t /* total_allocated, unused for legacy */,
     int64_t /* total_reserved, unused for legacy */,
     c10::Device device) {
-  if (config_.profile_memory &&
-      config_.state != torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.profile_memory && !config_.disabled()) {
     uint64_t thread_id = at::RecordFunction::currentThreadId();
     LegacyEvent evt(
         EventKind::MemoryAlloc,
@@ -372,9 +371,7 @@ void pushProfilingCallbacksLegacy() {
           [](const at::RecordFunction& fn)
               -> std::unique_ptr<at::ObserverContext> {
             auto state_ptr = ProfilerLegacyThreadLocalState::getTLS();
-            if (!state_ptr ||
-                state_ptr->config().state ==
-                    torch::profiler::impl::ProfilerState::Disabled) {
+            if (!state_ptr || state_ptr->config().disabled()) {
               return nullptr;
             }
             bool record_cuda = state_ptr->config().state ==
@@ -396,9 +393,7 @@ void pushProfilingCallbacksLegacy() {
           },
           [](const at::RecordFunction& fn, at::ObserverContext*) {
             auto state_ptr = ProfilerLegacyThreadLocalState::getTLS();
-            if (!state_ptr ||
-                state_ptr->config().state ==
-                    torch::profiler::impl::ProfilerState::Disabled) {
+            if (!state_ptr || state_ptr->config().disabled()) {
               return;
             }
             bool record_cuda = state_ptr->config().state ==
@@ -454,9 +449,7 @@ thread_event_lists disableProfilerLegacy(
 
   auto state_ptr = static_cast<ProfilerLegacyThreadLocalState*>(state.get());
   TORCH_CHECK(
-      state_ptr &&
-          state_ptr->config().state !=
-              torch::profiler::impl::ProfilerState::Disabled,
+      state_ptr && !state_ptr->config().disabled(),
       "Can't disable profiler when it's not running");
 
   if (cleanupTLSState) {
diff --git a/torch/csrc/autograd/profiler_python.cpp b/torch/csrc/autograd/profiler_python.cpp
index 34aab7e8e6709..5c4fdcf0ef641 100644
--- a/torch/csrc/autograd/profiler_python.cpp
+++ b/torch/csrc/autograd/profiler_python.cpp
@@ -1,5 +1,9 @@
 #include <torch/csrc/autograd/profiler_python.h>
 
+#include <ATen/core/TensorBase.h>
+#include <c10/util/Exception.h>
+#include <c10/util/Logging.h>
+#include <c10/util/StringUtil.h>
 #include <cstdint>
 #include <deque>
 #include <iostream>
@@ -17,6 +21,7 @@
 #include <c10/util/C++17.h>
 #include <c10/util/flat_hash_map.h>
 #include <c10/util/irange.h>
+#include <torch/csrc/autograd/python_variable.h>
 #include <torch/csrc/profiler/collection.h>
 #include <torch/csrc/profiler/containers.h>
 #include <torch/csrc/profiler/util.h>
@@ -182,9 +187,11 @@ template <>
 struct Config<CallType::PyModuleCall> {
   using key_t = PyModuleSelf;
   using ephemeral_t = PyFrameObject*;
+  using info_t =
+      std::pair<PyModuleCls, std::vector<std::pair<std::string, void*>>>;
   struct cache_t {
     c10::optional<CodeLocation> module_forward_;
-    ska::flat_hash_map<PyModuleSelf, PyModuleCls> modules_;
+    ska::flat_hash_map<PyModuleSelf, info_t> modules_and_params_;
     ska::flat_hash_map<PyModuleCls, at::StringView> module_cls_names_;
   };
   static constexpr EventType event_type = EventType::PyCall;
@@ -280,16 +287,30 @@ void ValueCache::store<CallType::PyModuleCall>(
     const PyModuleCallKey& key,
     Config<CallType::PyModuleCall>::ephemeral_t frame) {
   auto& cache = std::get<CallType::PyModuleCall>(state_);
-  if (C10_UNLIKELY(cache.modules_.find(key) == cache.modules_.end())) {
+  if (C10_UNLIKELY(
+          cache.modules_and_params_.find(key) ==
+          cache.modules_and_params_.end())) {
     if (C10_UNLIKELY(!cache.module_forward_.has_value())) {
       auto code = THPCodeObjectPtr(PyFrame_GetCode(frame));
       TORCH_INTERNAL_ASSERT(code.get() == nnModuleCode());
       cache.module_forward_ = PyCallKey(frame);
       store<CallType::PyCall>(*cache.module_forward_, no_ephemeral_t());
     }
+
+    py::dict params = py::handle((PyObject*)key).attr("_parameters");
+    std::vector<std::pair<std::string, void*>> params_;
+    for (auto& it : params) {
+      auto t = it.second.ptr();
+      if (py::isinstance<py::str>(it.first) && THPVariable_CheckExact(t)) {
+        auto storage = THPVariable_Unpack(t).storage().unsafeGetStorageImpl();
+        if (storage) {
+          params_.emplace_back(it.first.cast<std::string>(), storage->data());
+        }
+      }
+    }
     auto cls_handle = py::handle((PyObject*)key).attr("__class__");
     auto cls = PyModuleCls(cls_handle.ptr());
-    cache.modules_[key] = cls;
+    cache.modules_and_params_[key] = make_pair(cls, params_);
 
     if (cache.module_cls_names_.find(cls) == cache.module_cls_names_.end()) {
       cache.module_cls_names_[cls] =
@@ -303,9 +324,15 @@ ExtraFields<EventType::PyCall>::args_t ValueCache::load<CallType::PyModuleCall>(
     const PyModuleCallKey& key) const {
   auto& cache = std::get<CallType::PyModuleCall>(state_);
   TORCH_INTERNAL_ASSERT(cache.module_forward_.has_value());
-  auto cls = cache.modules_.at(key);
+  auto cls = cache.modules_and_params_.at(key).first;
   auto fwd = std::get<CallType::PyCall>(state_).at(*cache.module_forward_);
-  return {fwd, NNModuleInfo{key, cls, cache.module_cls_names_.at(cls)}};
+  return {
+      fwd,
+      NNModuleInfo{
+          key,
+          cls,
+          cache.module_cls_names_.at(cls),
+          cache.modules_and_params_.at(key).second}};
 }
 
 template <>
@@ -495,7 +522,8 @@ class PythonTracer final : public python_tracer::PythonTracerBase {
   void stop() override;
   std::vector<std::shared_ptr<Result>> getEvents(
       std::function<time_t(approx_time_t)> time_converter,
-      std::vector<python_tracer::CompressedEvent>& enters) override;
+      std::vector<python_tracer::CompressedEvent>& enters,
+      time_t end_time_ns) override;
   void clear() override;
 
  private:
@@ -666,8 +694,9 @@ class PostProcess {
   PostProcess(
       std::function<time_t(approx_time_t)> time_converter,
       std::deque<ThreadLocalResults>& tls,
-      const ValueCache& value_cache)
-      : time_converter_{time_converter} {
+      const ValueCache& value_cache,
+      time_t end_time_ns)
+      : end_time_{end_time_ns}, time_converter_{time_converter} {
     for (size_t python_tid : c10::irange(tls.size())) {
       CallTypeHelper<TraceKeyCacheState>::map(
           tls[python_tid].trace_keys_, *this, value_cache, python_tid);
@@ -714,6 +743,12 @@ class PostProcess {
       std::vector<python_tracer::CompressedEvent>& enters,
       std::vector<std::shared_ptr<Result>>& out) {
     using stack_t = std::vector<std::shared_ptr<Result>>;
+    auto pop = [](stack_t& stack, time_t t) {
+      TORCH_INTERNAL_ASSERT(stack.size(), "Python replay stack is empty.");
+      c10::get<ExtraFields<E>>(stack.back()->extra_fields_).end_time_ns_ = t;
+      stack.pop_back();
+    };
+
     ska::flat_hash_map<size_t, stack_t> stacks;
     auto& state = get_state<E>();
     for (const auto& enter : enters) {
@@ -721,12 +756,9 @@ class PostProcess {
       if (fields_it != state.fields_.end()) {
         while (!state.exits_.empty() &&
                state.exits_.top().t_ < enter.enter_t_) {
-          auto& stack = stacks[state.exits_.top().python_tid_];
-          TORCH_INTERNAL_ASSERT(stack.size(), "Python replay stack is empty.");
-          c10::get<ExtraFields<E>>(stack.back()->extra_fields_).end_time_ns_ =
-              state.exits_.top().t_;
+          auto& exit = state.exits_.top();
+          pop(stacks[exit.python_tid_], exit.t_);
           state.exits_.pop();
-          stack.pop_back();
         }
         out.push_back(Result::create(
             enter.enter_t_,
@@ -737,6 +769,13 @@ class PostProcess {
         stacks[fields_it->second.python_tid_].push_back(out.back());
       }
     }
+
+    // Handle events which were still running when profiling ended.
+    for (auto& i : stacks) {
+      while (!i.second.empty()) {
+        pop(i.second, end_time_);
+      }
+    }
   }
 
   template <EventType E>
@@ -750,6 +789,7 @@ class PostProcess {
     return std::get < E == EventType::PyCall ? 0 : 1 > (state_);
   }
 
+  time_t end_time_;
   std::function<time_t(approx_time_t)> time_converter_;
   std::tuple<State<EventType::PyCall>, State<EventType::PyCCall>> state_;
 };
@@ -778,9 +818,11 @@ struct PythonIDVisitor {
 
 std::vector<std::shared_ptr<Result>> PythonTracer::getEvents(
     std::function<time_t(approx_time_t)> time_converter,
-    std::vector<python_tracer::CompressedEvent>& enters) {
+    std::vector<python_tracer::CompressedEvent>& enters,
+    time_t end_time_ns) {
   value_cache_.trimPrefixes();
-  PostProcess post_process(time_converter, thread_local_results_, value_cache_);
+  PostProcess post_process(
+      time_converter, thread_local_results_, value_cache_, end_time_ns);
   auto out = post_process.run(enters);
 
   std::stable_sort(out.begin(), out.end(), [](const auto& a, const auto& b) {
diff --git a/torch/csrc/autograd/python_anomaly_mode.cpp b/torch/csrc/autograd/python_anomaly_mode.cpp
index ec5dfe1b0995c..3c91316c06f9f 100644
--- a/torch/csrc/autograd/python_anomaly_mode.cpp
+++ b/torch/csrc/autograd/python_anomaly_mode.cpp
@@ -16,7 +16,7 @@ namespace autograd {
 
 void PyAnomalyMetadata::store_stack() {
   pybind11::gil_scoped_acquire gil;
-  THPObjectPtr mod(PyImport_ImportModule("traceback"));
+  THPObjectPtr mod(PyImport_ImportModule("torch.fx.traceback"));
   if (!mod) {
     throw python_error();
   }
diff --git a/torch/csrc/autograd/python_cpp_function.cpp b/torch/csrc/autograd/python_cpp_function.cpp
index 474bf14119b0c..1c4f15e63f49d 100644
--- a/torch/csrc/autograd/python_cpp_function.cpp
+++ b/torch/csrc/autograd/python_cpp_function.cpp
@@ -78,7 +78,7 @@ PyObject* THPCppFunction_call(
 int THPCppFunction_traverse(PyObject* self, visitproc visit, void* arg) {
   auto& fn = *((THPCppFunction*)self)->cdata;
   for (const auto& hook : fn.pre_hooks()) {
-    if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
+    if (auto pyhook = dynamic_cast<PyFunctionTensorPreHook*>(hook.get())) {
       Py_VISIT(pyhook->dict);
     }
   }
@@ -150,7 +150,7 @@ PyObject* THPCppFunction_register_hook_dict(PyObject* self, PyObject* _var) {
   }
   auto var = (THPVariable*)_var;
   auto& fn = *((THPCppFunction*)self)->cdata;
-  std::unique_ptr<FunctionPreHook> hook(new PyFunctionPreHook(
+  std::unique_ptr<FunctionPreHook> hook(new PyFunctionTensorPreHook(
       var->backward_hooks, THPVariable_Unpack(var).output_nr()));
   fn.add_pre_hook(std::move(hook));
   Py_RETURN_NONE;
@@ -161,6 +161,11 @@ PyObject* THPCppFunction_register_hook(PyObject* self, PyObject* hook) {
   return registerFunctionHook(fn, hook);
 }
 
+PyObject* THPCppFunction_register_prehook(PyObject* self, PyObject* hook) {
+  auto& fn = *((THPCppFunction*)self)->cdata;
+  return registerFunctionPreHook(fn, hook);
+}
+
 PyObject* THPCppFunction_name(PyObject* self, PyObject* noargs) {
   auto& fn = *((THPCppFunction*)self)->cdata;
   return THPUtils_packString(fn.name());
@@ -282,5 +287,35 @@ PyObject* registerFunctionHook(Node& fn, PyObject* hook) {
   return handle;
 }
 
+// This is almost a copy of the function above except post -> pre
+PyObject* registerFunctionPreHook(Node& fn, PyObject* hook) {
+  PyObject* dict = Py_None;
+  for (const auto& hook : fn.pre_hooks()) {
+    if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
+      dict = pyhook->dict;
+      break;
+    }
+  }
+
+  THPObjectPtr register_fn(
+      PyObject_GetAttrString(THPFunctionClass, "_register_hook"));
+  if (!register_fn)
+    return nullptr;
+  THPObjectPtr res(
+      PyObject_CallFunctionObjArgs(register_fn.get(), dict, hook, nullptr));
+  if (!res)
+    return nullptr;
+
+  if (dict == Py_None) {
+    dict = PyTuple_GET_ITEM(res.get(), 0);
+    std::unique_ptr<FunctionPreHook> hook(new PyFunctionPreHook(dict));
+    fn.add_pre_hook(std::move(hook));
+  }
+
+  PyObject* handle = PyTuple_GET_ITEM(res.get(), 1);
+  Py_INCREF(handle);
+  return handle;
+}
+
 } // namespace autograd
 } // namespace torch
diff --git a/torch/csrc/autograd/python_cpp_function.h b/torch/csrc/autograd/python_cpp_function.h
index 9c8cf19483fbb..5a58b0e20ce39 100644
--- a/torch/csrc/autograd/python_cpp_function.h
+++ b/torch/csrc/autograd/python_cpp_function.h
@@ -39,6 +39,10 @@ PyObject* CppFunction_pynew(
    METH_O,                                                                     \
    nullptr},                                                                   \
       {(char*)"register_hook", THPCppFunction_register_hook, METH_O, nullptr}, \
+      {(char*)"register_prehook",                                              \
+       THPCppFunction_register_prehook,                                        \
+       METH_O,                                                                 \
+       nullptr},                                                               \
   {                                                                            \
     (char*)"name", THPCppFunction_name, METH_NOARGS, nullptr                   \
   }
@@ -64,6 +68,8 @@ PyObject* THPCppFunction_metadata(THPCppFunction* self, void* _unused);
 PyObject* THPCppFunction_requires_grad(THPCppFunction* self, void* _unused);
 PyObject* THPCppFunction_register_hook_dict(PyObject* self, PyObject* _var);
 PyObject* THPCppFunction_register_hook(PyObject* self, PyObject* hook);
+PyObject* THPCppFunction_register_prehook(PyObject* self, PyObject* hook);
+
 PyObject* THPCppFunction_name(PyObject* self, PyObject* noargs);
 
 PyTypeObject* _initFunctionPyTypeObject(
@@ -74,6 +80,8 @@ PyTypeObject* _initFunctionPyTypeObject(
 
 PyObject* registerFunctionHook(Node& fn, PyObject* hook);
 
+PyObject* registerFunctionPreHook(Node& fn, PyObject* hook);
+
 template <typename Ctor>
 PyTypeObject* createForwardFunctionPyTypeObject(
     PyTypeObject& type,
diff --git a/torch/csrc/autograd/python_function.cpp b/torch/csrc/autograd/python_function.cpp
index d2bcd4974712e..39374f7f82978 100644
--- a/torch/csrc/autograd/python_function.cpp
+++ b/torch/csrc/autograd/python_function.cpp
@@ -16,6 +16,7 @@
 #include <torch/csrc/autograd/functions/basic_ops.h>
 #include <torch/csrc/autograd/functions/utils.h>
 #include <torch/csrc/autograd/grad_mode.h>
+#include <torch/csrc/autograd/graph_task.h>
 #include <torch/csrc/autograd/python_anomaly_mode.h>
 #include <torch/csrc/autograd/python_cpp_function.h>
 #include <torch/csrc/autograd/python_hook.h>
@@ -213,7 +214,7 @@ static int THPFunction_traverse(THPFunction* self, visitproc visit, void* arg) {
   // that is stored in PyNode, since we don't really own that C++ object.
   if (auto cdata = self->cdata.lock()) {
     for (const auto& hook : cdata->pre_hooks()) {
-      if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
+      if (auto pyhook = dynamic_cast<PyFunctionTensorPreHook*>(hook.get())) {
         Py_VISIT(pyhook->dict);
       }
     }
@@ -587,6 +588,54 @@ std::pair<UnpackedInput, InputFlags> unpack_input(PyObject* args) {
   return std::make_pair(std::move(unpacked), std::move(flags));
 }
 
+// Given a prim::PythonOp node, _append_subgraph creates a subgraph such that:
+// (1) It has the same inputs as the prim::PythonOp node
+// (2) The intermediate nodes used in the PythonOp are cloned and stored in the
+// subgraph (3) trace_outputs stores the Value* objects, before a new trace
+// value is assigned by the prim::PythonOp node and helps to eventually route
+// the outputs of the subgraph correctly This newly created subgraph is then
+// added to the prim::PythonOp node as a subgraph attribute
+static void _append_subgraph(
+    torch::jit::Node* node,
+    torch::jit::Graph* graph,
+    std::vector<torch::jit::Value*> trace_outputs,
+    bool unpack_output) {
+  node->g_(
+      torch::jit::attr::Subgraph,
+      std::make_shared<torch::jit::Graph>(graph->current_scope()));
+  auto subgraph = node->g(torch::jit::attr::Subgraph);
+
+  std::unordered_map<Value*, Value*> value_map;
+  auto value_map_func = [&](Value* v) { return value_map.at(v); };
+  for (size_t i = 0; i < node->inputs().size(); ++i) {
+    auto subgraph_input = subgraph->addInput();
+    subgraph_input->copyMetadata(node->inputs().at(i));
+    value_map[node->inputs().at(i)] = subgraph_input;
+  }
+  // Find node position in owning block, all subsequent nodes after are added to
+  // subgraph
+  auto owning_block = node->owningBlock();
+  auto it = std::find(
+      owning_block->nodes().begin(), owning_block->nodes().end(), node);
+  // Skip TupleUnpack node if created
+  if (!unpack_output) {
+    it++;
+  }
+  for (it++; it != owning_block->nodes().end(); ++it) {
+    torch::jit::Node* node = *it;
+    auto* clone_node =
+        subgraph->insertNode(subgraph->createClone(node, value_map_func));
+    for (size_t i = 0; i < node->outputs().size(); ++i) {
+      value_map[node->outputs()[i]] = clone_node->outputs()[i];
+      auto trace_it = std::find(
+          trace_outputs.begin(), trace_outputs.end(), node->outputs()[i]);
+      if (trace_it != trace_outputs.end()) {
+        subgraph->registerOutput(clone_node->outputs()[i]);
+      }
+    }
+  }
+}
+
 static torch::jit::Node* _trace_pre_record(
     PyObject* op_obj,
     PyObject* input_objects,
@@ -651,6 +700,8 @@ static void _trace_post_record(
     auto unpacked = graph->createTupleUnpack(node->output())->insertAfter(node);
     node = unpacked;
   }
+
+  std::vector<torch::jit::Value*> trace_outputs;
   for (const auto i : c10::irange(num_outputs)) {
     PyObject* obj = PyTuple_GET_ITEM(output_objects, i);
     if (THPVariable_Check(obj)) {
@@ -658,10 +709,17 @@ static void _trace_post_record(
       const auto& tensor = THPVariable_Unpack(obj);
       if (tensor.defined()) {
         value->inferTypeFrom(tensor);
+        trace_outputs.push_back(jit::tracer::getValueTrace(tensor));
         jit::tracer::setValueTrace(tensor, value);
       }
     }
   }
+  py::bool_ is_in_onnx_export =
+      py::module::import("torch.onnx.__init__").attr("is_in_onnx_export");
+  if (py::cast<bool>(is_in_onnx_export)) {
+    _append_subgraph(old_node, graph, trace_outputs, unpack_output);
+  }
+
   // If TupleUnpack operator is created, we copy its output type back
   // to the original tuple type.
   if (!unpack_output) {
@@ -751,6 +809,18 @@ PyObject* THPFunction_name(PyObject* self, PyObject* noargs) {
   END_HANDLE_TH_ERRORS
 }
 
+PyObject* THPFunction_maybe_clear_saved_tensors(
+    PyObject* self,
+    PyObject* noargs) {
+  HANDLE_TH_ERRORS;
+  auto cdata = ((THPFunction*)self)->cdata.lock();
+  if (!get_current_graph_task_keep_graph()) {
+    cdata->release_variables();
+  }
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THPFunction_apply(PyObject* cls, PyObject* inputs) {
   HANDLE_TH_ERRORS
 
@@ -844,7 +914,7 @@ PyObject* THPFunction__register_hook_dict(PyObject* _self, PyObject* _var) {
   THPVariable* var = reinterpret_cast<THPVariable*>(_var);
   const auto& tensor = THPVariable_Unpack(var);
   std::unique_ptr<FunctionPreHook> hook(
-      new PyFunctionPreHook(var->backward_hooks, tensor.output_nr()));
+      new PyFunctionTensorPreHook(var->backward_hooks, tensor.output_nr()));
   auto self = (THPFunction*)_self;
   auto cdata = self->cdata.lock();
   TORCH_CHECK(
@@ -874,6 +944,21 @@ PyObject* THPFunction_register_hook(PyObject* _self, PyObject* hook) {
   END_HANDLE_TH_ERRORS
 }
 
+PyObject* THPFunction_register_prehook(PyObject* _self, PyObject* hook) {
+  HANDLE_TH_ERRORS
+  auto self = (THPFunction*)_self;
+  auto cdata = self->cdata.lock();
+  TORCH_CHECK(
+      cdata,
+      "Attribute 'register_prehook' is invalid for this instance of _C._FunctionBase. "
+      "Accessing this attribute directly on an instance of autograd.Function is a legacy "
+      "access pattern that is no longer supported. For examples on how to use new-style "
+      "autograd functions, see "
+      "https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function ");
+  return torch::autograd::registerFunctionPreHook(*cdata, hook);
+  END_HANDLE_TH_ERRORS
+}
+
 int THPFunction_set_materialize_grads(
     THPFunction* self,
     PyObject* value,
@@ -1129,12 +1214,17 @@ static struct PyGetSetDef THPFunction_properties[] = {
 // NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays,cppcoreguidelines-avoid-non-const-global-variables)
 static struct PyMethodDef THPFunction_methods[] = {
     {(char*)"name", THPFunction_name, METH_NOARGS, nullptr},
+    {(char*)"maybe_clear_saved_tensors",
+     THPFunction_maybe_clear_saved_tensors,
+     METH_NOARGS,
+     nullptr},
     {(char*)"apply", THPFunction_apply, METH_CLASS | METH_VARARGS, nullptr},
     {(char*)"_register_hook_dict",
      THPFunction__register_hook_dict,
      METH_O,
      nullptr},
     {(char*)"register_hook", THPFunction_register_hook, METH_O, nullptr},
+    {(char*)"register_prehook", THPFunction_register_prehook, METH_O, nullptr},
     {nullptr}};
 
 PyTypeObject THPFunctionType = {
diff --git a/torch/csrc/autograd/python_hook.cpp b/torch/csrc/autograd/python_hook.cpp
index 4ce4904d5b060..fe4fc205b7b90 100644
--- a/torch/csrc/autograd/python_hook.cpp
+++ b/torch/csrc/autograd/python_hook.cpp
@@ -26,27 +26,28 @@ static void check_single_result(
 namespace torch {
 namespace autograd {
 
-PyFunctionPreHook::PyFunctionPreHook(PyObject* dict, int value_idx)
-    : dict(dict), value_idx(value_idx) {
-  Py_INCREF(dict);
-}
-
-PyFunctionPreHook::~PyFunctionPreHook() {
-  // If python is already dead, leak the wrapped python objects
-  if (Py_IsInitialized()) {
-    pybind11::gil_scoped_acquire gil;
-    Py_DECREF(dict);
-  }
-}
-
-auto PyFunctionPreHook::operator()(const variable_list& values)
-    -> variable_list {
-  pybind11::gil_scoped_acquire gil;
-
-  THPObjectPtr value(THPVariable_Wrap(values.at(value_idx)));
-  if (!value)
-    throw python_error();
+namespace {
 
+// This function is called in 3 different cases:
+//   1) TensorPreHook
+//   2) PreHook
+//   3) PostHook
+//
+// Depending on the case, args and res can hold different types of objects:
+//
+// args:
+// TensorPreHook    (Tensor,)
+// PreHook          ((Tensor, ...),)                (grad_outputs,)
+// PostHook         ((Tensor, ...), (Tensor, ...))  (grad_inputs, grad_outputs)
+//
+// res:
+// TensorPreHook    Tensor
+// PreHook          ((Tensor, ...),)                (grad_outputs,)
+// PostHook         ((Tensor, ...),)                (grad_inputs,)
+//
+// This function returns True if any hook returned non-None value, and False
+// otherwise.
+bool _call_hooks(PyObject* dict, PyObject* args) {
   // Note: [Extend Hook Lifetime]
   // Hold a reference to hooks till we iterate over them.
   // This is to handle the case when hook calls `handle.remove` inside it
@@ -57,24 +58,86 @@ auto PyFunctionPreHook::operator()(const variable_list& values)
   // i.e. we hold the reference to the hooks till we have iterated over them.
   // Reference: https://github.com/pytorch/pytorch/issues/58354
   auto hooks = THPObjectPtr{PyDict_Values(dict)};
+  bool is_modified = false;
   const auto len = PyList_Size(hooks);
   for (Py_ssize_t idx = 0; idx < len; ++idx) {
     const auto hook = PyList_GetItem(hooks, idx);
-    THPObjectPtr res(PyObject_CallFunctionObjArgs(hook, value.get(), nullptr));
+
+    THPObjectPtr res(PyObject_CallObject(hook, args));
     if (!res)
       throw python_error();
     if (res == Py_None)
       continue;
-    check_single_result(value.get(), res.get(), hook);
-    value = std::move(res);
+
+    PyObject* args0 = PyTuple_GetItem(args, 0);
+    if (res == args0)
+      continue;
+
+    if (PyTuple_CheckExact(args0)) {
+      check_result(args0, res, hook);
+    } else {
+      check_single_result(args0, res, hook);
+    }
+    PyTuple_SetItem(args, 0, res.release());
+
+    is_modified = true;
   }
+  return is_modified;
+}
+
+} // namespace
 
+PyFunctionTensorPreHook::PyFunctionTensorPreHook(PyObject* dict, int value_idx)
+    : dict(dict), value_idx(value_idx) {
+  Py_INCREF(dict);
+}
+
+PyFunctionTensorPreHook::~PyFunctionTensorPreHook() {
+  // If python is already dead, leak the wrapped python objects
+  if (Py_IsInitialized()) {
+    pybind11::gil_scoped_acquire gil;
+    Py_DECREF(dict);
+  }
+}
+
+auto PyFunctionTensorPreHook::operator()(const variable_list& values)
+    -> variable_list {
+  pybind11::gil_scoped_acquire gil;
+  THPObjectPtr value(THPVariable_Wrap(values.at(value_idx)));
+  if (!value)
+    throw python_error();
+  THPObjectPtr tup(PyTuple_New(1));
+  PyTuple_SET_ITEM(tup.get(), 0, value.release());
+  bool is_tup_modified = _call_hooks(dict, tup.get());
   variable_list results(values);
-  if (value != Py_None)
-    results[value_idx] = THPVariable_Unpack(value.get());
+  if (is_tup_modified) {
+    results[value_idx] = THPVariable_Unpack(PyTuple_GetItem(tup.get(), 0));
+  }
   return results;
 }
 
+PyFunctionPreHook::PyFunctionPreHook(PyObject* dict) : dict(dict) {
+  Py_INCREF(dict);
+}
+
+PyFunctionPreHook::~PyFunctionPreHook() {
+  // If python is already dead, leak the wrapped python objects
+  if (Py_IsInitialized()) {
+    pybind11::gil_scoped_acquire gil;
+    Py_DECREF(dict);
+  }
+}
+
+auto PyFunctionPreHook::operator()(const variable_list& grad_outputs_)
+    -> variable_list {
+  pybind11::gil_scoped_acquire gil;
+  THPObjectPtr grad_outputs(wrap_variables(grad_outputs_));
+  THPObjectPtr tup(PyTuple_New(1));
+  PyTuple_SET_ITEM(tup.get(), 0, grad_outputs.release());
+  _call_hooks(dict, tup.get());
+  return unwrap_variables(PyTuple_GetItem(tup.get(), 0));
+}
+
 PyFunctionPostHook::PyFunctionPostHook(PyObject* dict) : dict(dict) {
   Py_INCREF(dict);
 }
@@ -91,26 +154,13 @@ auto PyFunctionPostHook::operator()(
     const variable_list& _outputs, /* grad_inputs */
     const variable_list& _inputs /* grad_outputs */) -> variable_list {
   pybind11::gil_scoped_acquire gil;
-
-  THPObjectPtr outputs(wrap_variables(_outputs));
-  THPObjectPtr inputs(wrap_variables(_inputs));
-
-  // See Note: [Extend Hook Lifetime]
-  auto hooks = THPObjectPtr{PyDict_Values(dict)};
-  const auto len = PyList_Size(hooks);
-  for (Py_ssize_t idx = 0; idx < len; ++idx) {
-    const auto hook = PyList_GetItem(hooks, idx);
-    THPObjectPtr res(PyObject_CallFunctionObjArgs(
-        hook, outputs.get(), inputs.get(), nullptr));
-    if (!res)
-      throw python_error();
-    if (res == Py_None)
-      continue;
-    check_result(outputs, res, hook);
-    outputs = std::move(res);
-  }
-
-  return unwrap_variables(outputs.get());
+  THPObjectPtr grad_inputs(wrap_variables(_outputs));
+  THPObjectPtr grad_outputs(wrap_variables(_inputs));
+  THPObjectPtr tup(PyTuple_New(2));
+  PyTuple_SET_ITEM(tup.get(), 0, grad_inputs.release());
+  PyTuple_SET_ITEM(tup.get(), 1, grad_outputs.release());
+  _call_hooks(dict, tup.get());
+  return unwrap_variables(PyTuple_GetItem(tup.get(), 0));
 }
 
 } // namespace autograd
diff --git a/torch/csrc/autograd/python_hook.h b/torch/csrc/autograd/python_hook.h
index 0d5a6d306a652..4c447a1c7471b 100644
--- a/torch/csrc/autograd/python_hook.h
+++ b/torch/csrc/autograd/python_hook.h
@@ -7,12 +7,19 @@
 namespace torch {
 namespace autograd {
 
+struct PyFunctionTensorPreHook : public FunctionPreHook {
+  PyFunctionTensorPreHook(PyObject* dict, int value_idx);
+  ~PyFunctionTensorPreHook() override;
+  variable_list operator()(const variable_list& values) override;
+  PyObject* dict;
+  int value_idx;
+};
+
 struct PyFunctionPreHook : public FunctionPreHook {
-  PyFunctionPreHook(PyObject* dict, int value_idx);
+  PyFunctionPreHook(PyObject* dict);
   ~PyFunctionPreHook() override;
   variable_list operator()(const variable_list& values) override;
   PyObject* dict;
-  int value_idx;
 };
 
 struct PyFunctionPostHook : public FunctionPostHook {
diff --git a/torch/csrc/autograd/python_torch_functions_manual.cpp b/torch/csrc/autograd/python_torch_functions_manual.cpp
index 57c9161e27717..949bf1219f5ab 100644
--- a/torch/csrc/autograd/python_torch_functions_manual.cpp
+++ b/torch/csrc/autograd/python_torch_functions_manual.cpp
@@ -49,114 +49,6 @@ namespace autograd {
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 PyObject* THPVariableFunctionsModule = nullptr;
 
-inline Tensor dispatch_arange(const Scalar& end, Tensor result) {
-  pybind11::gil_scoped_release no_gil;
-  return at::arange_out(result, end);
-}
-
-inline Tensor dispatch_arange(const Scalar& end, const TensorOptions& options) {
-  torch::utils::maybe_initialize_cuda(options);
-  pybind11::gil_scoped_release no_gil;
-  return torch::arange(end, options);
-}
-
-inline Tensor dispatch_arange(
-    const Scalar& start,
-    const Scalar& end,
-    const Scalar& step,
-    Tensor result) {
-  pybind11::gil_scoped_release no_gil;
-  return at::arange_out(result, start, end, step);
-}
-
-inline Tensor dispatch_arange(
-    const Scalar& start,
-    const Scalar& end,
-    const Scalar& step,
-    const TensorOptions& options) {
-  torch::utils::maybe_initialize_cuda(options);
-  pybind11::gil_scoped_release no_gil;
-  return torch::arange(start, end, step, options);
-}
-
-static PyObject* THPVariable_arange(
-    PyObject* self,
-    PyObject* args,
-    PyObject* kwargs) {
-  HANDLE_TH_ERRORS
-  static PythonArgParser parser(
-      {
-          "arange(Scalar end, *, Tensor out=None, ScalarType dtype=None, Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False)",
-          "arange(Scalar start, Scalar end, Scalar step=1, *, Tensor out=None, ScalarType dtype=None, Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False)",
-      },
-      /*traceable=*/true);
-
-  ParsedArgs<9> parsed_args;
-  auto r = parser.parse(args, kwargs, parsed_args);
-
-  if (r.has_torch_function()) {
-    return handle_torch_function(
-        r, args, kwargs, THPVariableFunctionsModule, "torch");
-  }
-
-  if (r.idx == 0) {
-    if (r.isNone(1)) {
-      auto end = r.scalar(0);
-      // NOTE: r.scalartype(X) gives the default dtype if r.isNone(X)
-      c10::optional<ScalarType> scalarType = r.scalartypeOptional(2);
-      const auto options = TensorOptions()
-                               .dtype(scalarType)
-                               .device(r.device(4))
-                               .layout(r.layout(3))
-                               .requires_grad(r.toBool(6))
-                               .pinned_memory(r.toBool(5));
-      return wrap(dispatch_arange(end, options));
-    } else {
-      TORCH_CHECK(
-          !r.toBool(5), " `pin_memory` and `out` parameters are incompatible");
-      check_out_type_matches(
-          r.tensor(1),
-          r.scalartype(2),
-          r.isNone(2),
-          r.layout(3),
-          r.device(4),
-          r.isNone(4));
-      return wrap(dispatch_arange(r.scalar(0), r.tensor(1))
-                      .set_requires_grad(r.toBool(6)));
-    }
-  } else if (r.idx == 1) {
-    if (r.isNone(3)) {
-      auto start = r.scalar(0);
-      auto end = r.scalar(1);
-      auto step = r.scalar(2);
-      // NOTE: r.scalartype(X) gives the default dtype if r.isNone(X)
-      c10::optional<ScalarType> scalarType = r.scalartypeOptional(4);
-      const auto options = TensorOptions()
-                               .dtype(scalarType)
-                               .device(r.device(6))
-                               .layout(r.layout(5))
-                               .requires_grad(r.toBool(8))
-                               .pinned_memory(r.toBool(7));
-      return wrap(dispatch_arange(start, end, step, options));
-    } else {
-      TORCH_CHECK(
-          !r.toBool(7), " `pin_memory` and `out` parameters are incompatible");
-      check_out_type_matches(
-          r.tensor(3),
-          r.scalartype(4),
-          r.isNone(4),
-          r.layout(5),
-          r.device(6),
-          r.isNone(6));
-      return wrap(
-          dispatch_arange(r.scalar(0), r.scalar(1), r.scalar(2), r.tensor(3))
-              .set_requires_grad(r.toBool(8)));
-    }
-  }
-  Py_RETURN_NONE;
-  END_HANDLE_TH_ERRORS
-}
-
 inline Tensor dispatch_range(
     const Scalar& start,
     const Scalar& end,
@@ -224,103 +116,6 @@ static PyObject* THPVariable_range(
   END_HANDLE_TH_ERRORS
 }
 
-inline Tensor dispatch_full(
-    IntArrayRef size,
-    const Scalar& fill_val,
-    const TensorOptions& options) {
-  torch::utils::maybe_initialize_cuda(options);
-  pybind11::gil_scoped_release no_gil;
-  return at::full(size, fill_val, options);
-}
-
-inline Tensor dispatch_full(
-    IntArrayRef size,
-    const Scalar& fill_val,
-    c10::optional<DimnameList> names,
-    const TensorOptions& options) {
-  torch::utils::maybe_initialize_cuda(options);
-  pybind11::gil_scoped_release no_gil;
-  return at::full(size, fill_val, names, options);
-}
-
-inline Tensor dispatch_full(
-    IntArrayRef size,
-    const Scalar& fill_val,
-    Tensor result) {
-  pybind11::gil_scoped_release no_gil;
-  return at::full_out(result, size, fill_val);
-}
-
-static PyObject* THPVariable_full(
-    PyObject* self,
-    PyObject* args,
-    PyObject* kwargs) {
-  HANDLE_TH_ERRORS
-
-  static PythonArgParser parser(
-      {
-          "full(IntArrayRef size, Scalar fill_value, *, Tensor out=None, ScalarType dtype=None, Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False)",
-          "full(IntArrayRef size, Scalar fill_value, *, DimnameList names=None, ScalarType dtype=None, Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False)",
-      },
-      /*traceable=*/true);
-
-  // Acquires (common) arguments
-  ParsedArgs<8> parsed_args;
-  auto r = parser.parse(args, kwargs, parsed_args);
-
-  if (r.has_torch_function()) {
-    return handle_torch_function(
-        r, args, kwargs, THPVariableFunctionsModule, "torch");
-  }
-
-  auto size = r.intlist(0);
-  auto fill_val = r.scalar(1);
-  const auto options = TensorOptions{}
-                           .dtype(r.scalartypeOptional(3))
-                           .layout(r.layout(4))
-                           .device(r.device(5))
-                           .pinned_memory(r.toBool(6));
-
-  if (r.idx == 0) {
-    // full
-    if (r.isNone(2)) {
-      return wrap(dispatch_full(size, fill_val, options)
-                      .set_requires_grad(r.toBool(7)));
-    }
-
-    // full.out
-    // Validates out tensor and other kwargs
-    auto result = r.tensor(2);
-    TORCH_CHECK(
-        !r.toBool(6), " `pin_memory` and `out` parameters are incompatible");
-    check_out_type_matches(
-        result,
-        r.scalartype(3),
-        r.isNone(3),
-        r.layout(4),
-        r.device(5),
-        r.isNone(5));
-
-    return wrap(
-        dispatch_full(size, fill_val, result).set_requires_grad(r.toBool(7)));
-  } else if (r.idx == 1) {
-    // full.names
-    if (r.isNone(2)) {
-      return wrap(dispatch_full(size, fill_val, c10::nullopt, options)
-                      .set_requires_grad(r.toBool(7)));
-    }
-
-    // Converts from c10::optional<std:vector...> to c10::optional<ArrayRef...>
-    auto raw_names = r.toDimnameListOptional(2);
-    c10::optional<DimnameList> names(*raw_names);
-    return wrap(dispatch_full(size, fill_val, names, options)
-                    .set_requires_grad(r.toBool(7)));
-  }
-
-  Py_RETURN_NONE;
-  END_HANDLE_TH_ERRORS
-}
-
 // implemented on python object to allow torch.as_tensor to be constructed with
 // arbitrarily nested python objects - list, tuple, np array, scalar, etc.
 static PyObject* THPVariable_as_tensor(
@@ -604,148 +399,6 @@ static PyObject* THPVariable_numel(
     PyObject* args,
     PyObject* kwargs);
 
-// linspace
-static PyObject* THPVariable_linspace(
-    PyObject* self_,
-    PyObject* args,
-    PyObject* kwargs) {
-  HANDLE_TH_ERRORS
-  static PythonArgParser parser(
-      {
-          "linspace(Scalar start, Scalar end, int64_t steps, *, Tensor out=None, ScalarType dtype=None, Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False)",
-      },
-      /*traceable=*/true);
-
-  ParsedArgs<9> parsed_args;
-  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
-  if (_r.has_torch_function()) {
-    return handle_torch_function(
-        _r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
-  }
-  if (_r.isNone(3)) {
-    // aten::linspace(Scalar start, Scalar end, int steps, *, ScalarType?
-    // dtype=None, Layout? layout=None, Device? device=None, bool?
-    // pin_memory=None) -> Tensor
-
-    // NOTE: r.scalartype(X) gives the default dtype if r.isNone(X)
-    // This leads to problem in the operator argument checks,
-    // when either `start` or `end` is complex and dtype is None
-    const auto options = TensorOptions()
-                             .dtype(_r.scalartypeOptional(4))
-                             .device(_r.device(6))
-                             .layout(_r.layoutOptional(5))
-                             .requires_grad(_r.toBool(8))
-                             .pinned_memory(_r.toBool(7));
-    torch::utils::maybe_initialize_cuda(options);
-
-    auto dispatch_linspace = [](Scalar start,
-                                Scalar end,
-                                int64_t steps,
-                                TensorOptions options) -> Tensor {
-      pybind11::gil_scoped_release no_gil;
-      return torch::linspace(start, end, steps, options);
-    };
-    return wrap(
-        dispatch_linspace(_r.scalar(0), _r.scalar(1), _r.toInt64(2), options));
-  } else {
-    // aten::linspace.out(Scalar start, Scalar end, int? steps=None, *,
-    // Tensor(a!) out) -> Tensor(a!)
-    check_out_type_matches(
-        _r.tensor(3),
-        _r.scalartype(4),
-        _r.isNone(4),
-        _r.layoutOptional(5),
-        _r.device(6),
-        _r.isNone(6));
-
-    auto dispatch_linspace_out =
-        [](Tensor out, Scalar start, Scalar end, int64_t steps) -> Tensor {
-      pybind11::gil_scoped_release no_gil;
-      return at::linspace_out(out, start, end, steps);
-    };
-    return wrap(dispatch_linspace_out(
-                    _r.tensor(3), _r.scalar(0), _r.scalar(1), _r.toInt64(2))
-                    .set_requires_grad(_r.toBool(8)));
-  }
-  Py_RETURN_NONE;
-  END_HANDLE_TH_ERRORS
-}
-
-// logspace
-static PyObject* THPVariable_logspace(
-    PyObject* self_,
-    PyObject* args,
-    PyObject* kwargs) {
-  HANDLE_TH_ERRORS
-  static PythonArgParser parser(
-      {
-          "logspace(Scalar start, Scalar end, int64_t steps, double base=10.0, *, Tensor out=None, ScalarType dtype=None, Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False)",
-      },
-      /*traceable=*/true);
-
-  ParsedArgs<10> parsed_args;
-  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
-  if (_r.has_torch_function()) {
-    return handle_torch_function(
-        _r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
-  }
-  if (_r.isNone(4)) {
-    // aten::logspace(Scalar start, Scalar end, int steps, float base=10.0, *,
-    // ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool?
-    // pin_memory=None) -> Tensor
-
-    // NOTE: r.scalartype(X) gives the default dtype if r.isNone(X)
-    // This leads to problem in the operator argument checks,
-    // when either `start` or `end` is complex and dtype is None
-    const auto options = TensorOptions()
-                             .dtype(_r.scalartypeOptional(5))
-                             .device(_r.device(7))
-                             .layout(_r.layoutOptional(6))
-                             .requires_grad(_r.toBool(9))
-                             .pinned_memory(_r.toBool(8));
-    torch::utils::maybe_initialize_cuda(options);
-
-    auto dispatch_logspace = [](Scalar start,
-                                Scalar end,
-                                int64_t steps,
-                                double base,
-                                TensorOptions options) -> Tensor {
-      pybind11::gil_scoped_release no_gil;
-      return torch::logspace(start, end, steps, base, options);
-    };
-    return wrap(dispatch_logspace(
-        _r.scalar(0), _r.scalar(1), _r.toInt64(2), _r.toDouble(3), options));
-  } else {
-    // aten::logspace.out(Scalar start, Scalar end, int steps, float base=10.0,
-    // *, Tensor(a!) out) -> Tensor(a!)
-    check_out_type_matches(
-        _r.tensor(4),
-        _r.scalartype(5),
-        _r.isNone(5),
-        _r.layoutOptional(6),
-        _r.device(7),
-        _r.isNone(7));
-
-    auto dispatch_logspace_out = [](Tensor out,
-                                    Scalar start,
-                                    Scalar end,
-                                    int64_t steps,
-                                    double base) -> Tensor {
-      pybind11::gil_scoped_release no_gil;
-      return at::logspace_out(out, start, end, steps, base);
-    };
-    return wrap(dispatch_logspace_out(
-                    _r.tensor(4),
-                    _r.scalar(0),
-                    _r.scalar(1),
-                    _r.toInt64(2),
-                    _r.toDouble(3))
-                    .set_requires_grad(_r.toBool(9)));
-  }
-  Py_RETURN_NONE;
-  END_HANDLE_TH_ERRORS
-}
-
 static PyObject* THPVariable__to_functional_tensor(
     PyObject* self,
     PyObject* args,
@@ -857,10 +510,6 @@ static PyObject* THPVariable__disable_functionalization(
 // being registered through native_functions.yaml, and be tagged cpp / JIT
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
 static PyMethodDef torch_functions_manual[] = {
-    {"arange",
-     castPyCFunctionWithKeywords(THPVariable_arange),
-     METH_VARARGS | METH_KEYWORDS | METH_STATIC,
-     nullptr},
     {"asarray",
      castPyCFunctionWithKeywords(THPVariable_asarray),
      METH_VARARGS | METH_KEYWORDS | METH_STATIC,
@@ -874,18 +523,6 @@ static PyMethodDef torch_functions_manual[] = {
      castPyCFunctionWithKeywords(THPVariable_frombuffer),
      METH_VARARGS | METH_KEYWORDS | METH_STATIC,
      nullptr},
-    {"full",
-     castPyCFunctionWithKeywords(THPVariable_full),
-     METH_VARARGS | METH_KEYWORDS | METH_STATIC,
-     nullptr},
-    {"linspace",
-     castPyCFunctionWithKeywords(THPVariable_linspace),
-     METH_VARARGS | METH_KEYWORDS | METH_STATIC,
-     nullptr},
-    {"logspace",
-     castPyCFunctionWithKeywords(THPVariable_logspace),
-     METH_VARARGS | METH_KEYWORDS | METH_STATIC,
-     nullptr},
     {"_is_functional_tensor",
      castPyCFunctionWithKeywords(THPVariable__is_functional_tensor),
      METH_VARARGS | METH_KEYWORDS | METH_STATIC,
diff --git a/torch/csrc/autograd/python_variable.cpp b/torch/csrc/autograd/python_variable.cpp
index 99ae57e263987..dbebc21e714b1 100644
--- a/torch/csrc/autograd/python_variable.cpp
+++ b/torch/csrc/autograd/python_variable.cpp
@@ -2,6 +2,7 @@
 #include <ATen/core/PythonFallbackKernel.h>
 #include <c10/core/DeviceType.h>
 #include <c10/core/SafePyObject.h>
+#include <c10/core/impl/GPUTrace.h>
 #include <c10/util/DeadlockDetection.h>
 #include <c10/util/irange.h>
 #include <pybind11/pytypes.h>
@@ -262,6 +263,28 @@ c10::SymIntArrayRef concrete_sym_sizes_fn(
 c10::Layout concrete_layout_fn(
     const c10::impl::PyInterpreter*,
     const c10::TensorImpl* self);
+c10::SymIntArrayRef concrete_sym_strides_fn(
+    const c10::impl::PyInterpreter*,
+    const c10::TensorImpl* self);
+c10::SymInt concrete_sym_numel_fn(
+    const c10::impl::PyInterpreter*,
+    const c10::TensorImpl* self);
+template <const char*, typename... Ts>
+void concrete_trace_cuda(const c10::impl::PyInterpreter*, Ts...);
+static constexpr char trace_cuda_event_creation_fn_name[] =
+    "CUDAEventCreationCallbacks";
+static constexpr char trace_cuda_event_deletion_fn_name[] =
+    "CUDAEventDeletionCallbacks";
+static constexpr char trace_cuda_event_record_fn_name[] =
+    "CUDAEventRecordCallbacks";
+static constexpr char trace_cuda_event_wait_fn_name[] =
+    "CUDAEventWaitCallbacks";
+static constexpr char trace_cuda_memory_allocation_fn_name[] =
+    "CUDAMemoryAllocationCallbacks";
+static constexpr char trace_cuda_memory_deallocation_fn_name[] =
+    "CUDAMemoryDeallocationCallbacks";
+static constexpr char trace_cuda_stream_creation_fn_name[] =
+    "CUDAStreamCreationCallbacks";
 
 class PyInterpreterHolder {
  public:
@@ -277,7 +300,17 @@ class PyInterpreterHolder {
             &concrete_strides_fn,
             &concrete_sizes_fn,
             &concrete_sym_sizes_fn,
-            &concrete_layout_fn)) {}
+            &concrete_layout_fn,
+            &concrete_sym_numel_fn,
+            &concrete_sym_strides_fn,
+            c10::impl::GPUTraceFunctionWrapper(
+                &concrete_trace_cuda<trace_cuda_event_creation_fn_name>,
+                &concrete_trace_cuda<trace_cuda_event_deletion_fn_name>,
+                &concrete_trace_cuda<trace_cuda_event_record_fn_name>,
+                &concrete_trace_cuda<trace_cuda_event_wait_fn_name>,
+                &concrete_trace_cuda<trace_cuda_memory_allocation_fn_name>,
+                &concrete_trace_cuda<trace_cuda_memory_deallocation_fn_name>,
+                &concrete_trace_cuda<trace_cuda_stream_creation_fn_name>))) {}
   // NB: intentionally leaks the memory
   ~PyInterpreterHolder() {
     impl_->disarm();
@@ -361,6 +394,10 @@ static PyObject* getPythonTensorClass(c10::Device d) {
   return device_to_py_class_[static_cast<size_t>(d.type())];
 }
 
+void activateCUDATrace() {
+  c10::impl::GPUTrace::set_trace(self_interpreter.get());
+}
+
 // TODO: Make this take Variable by const reference
 PyObject* THPVariable_Wrap(at::TensorBase var) {
   if (!var.defined()) {
@@ -572,7 +609,12 @@ static int THPVariable_clear(THPVariable* self) {
     }
   }
   TORCH_INTERNAL_ASSERT(!isResurrectable((THPVariable*)self));
-  self->cdata = MaybeOwned<Variable>();
+  {
+    // MapAllocator can take significant time to release large tensors;
+    // release the GIL here to avoid impacting main thread perf.
+    pybind11::gil_scoped_release no_gil;
+    self->cdata = MaybeOwned<Variable>();
+  }
   return 0;
 }
 
@@ -1193,7 +1235,7 @@ int THPVariable_set_backwards_hooks(
   torch::autograd::impl::clear_hooks(tensor);
   if (obj) {
     torch::autograd::impl::add_hook(
-        tensor, std::make_shared<PyFunctionPreHook>(obj, 0));
+        tensor, std::make_shared<PyFunctionTensorPreHook>(obj, 0));
   }
   return 0;
   END_HANDLE_TH_ERRORS_RET(-1)
@@ -2018,7 +2060,7 @@ static int THPVariable_subclass_traverse(
       }
 
       for (const auto& hook : torch::autograd::impl::hooks(tensor)) {
-        if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
+        if (auto pyhook = dynamic_cast<PyFunctionTensorPreHook*>(hook.get())) {
           Py_VISIT(pyhook->dict);
         }
       }
@@ -2107,7 +2149,8 @@ py::object torchDispatchFromTensorImpl(
       c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::
           unsafe_reclaim_from_nonowning(const_cast<c10::TensorImpl*>(self)));
   auto self_p = py::reinterpret_steal<py::object>(THPVariable_Wrap(self_t));
-  TORCH_INTERNAL_ASSERT(isPythonTensor(self_t));
+  // NB: this may not be a python tensor if you got here from a mode!
+  // TORCH_INTERNAL_ASSERT(isPythonTensor(self_t));
   append_overloaded_tensor(&overloaded_args, self_p.ptr());
   auto args = py::reinterpret_steal<py::object>(PyTuple_New(1));
   PyTuple_SET_ITEM(args.ptr(), 0, self_p.release().ptr());
@@ -2420,9 +2463,7 @@ c10::SymIntArrayRef concrete_sym_sizes_fn(
   py::list symints;
   for (auto it = out.begin(); it != out.end(); it++) {
     auto elm = *it;
-    auto si = torch::is_symint_node(elm)
-        ? elm.cast<c10::SymIntNodeImpl*>()->toSymInt()
-        : c10::SymInt{py::cast<int64_t>(elm)};
+    auto si = py::cast<c10::SymInt>(elm);
     // TODO: the buffer will need to be made owning later
     symints.append(si.as_int_unchecked());
   }
@@ -2440,7 +2481,6 @@ c10::Layout concrete_layout_fn(
     const c10::TensorImpl* self) {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
-
   auto out = torchDispatchFromTensorImpl(
       self,
       "layout",
@@ -2461,4 +2501,89 @@ c10::Layout concrete_layout_fn(
   return toLayout(out.ptr());
 }
 
+c10::SymInt concrete_sym_numel_fn(
+    const c10::impl::PyInterpreter*,
+    const c10::TensorImpl* self) {
+  pybind11::gil_scoped_acquire gil;
+  at::impl::MaybeSetTLSOnEntryGuard guard;
+  auto out = torchDispatchFromTensorImpl(
+      self,
+      "sym_numel",
+      py::module::import("torch")
+          .attr("ops")
+          .attr("aten")
+          .attr("sym_numel")
+          .attr("default")
+          .ptr(),
+      "torch.ops.aten");
+
+  if (out == Py_None) {
+    TORCH_CHECK(
+        !self->has_symbolic_sizes_strides(),
+        "Cannot call numel on a tensor with symbolic shapes/strides");
+    return self->sym_numel_default();
+  }
+  return torch::is_symint_node(out)
+      ? out.cast<c10::SymIntNodeImpl*>()->toSymInt()
+      : c10::SymInt{py::cast<int64_t>(out)};
+}
+
+template <const char* func_name, typename... Ts>
+void concrete_trace_cuda(const c10::impl::PyInterpreter*, Ts... args) {
+  pybind11::gil_scoped_acquire gil;
+  at::impl::MaybeSetTLSOnEntryGuard guard;
+
+  if (Py_IsInitialized()) {
+    try {
+      py::module mod = py::module::import("torch.utils._cuda_trace");
+      py::object hook = mod.attr(func_name).attr("fire_callbacks");
+      hook(args...);
+    } catch (const std::exception& e) {
+      LOG(ERROR) << "CUDA trace hook execution failed: " << e.what();
+    }
+  }
+}
+
+c10::SymIntArrayRef concrete_sym_strides_fn(
+    const c10::impl::PyInterpreter*,
+    const c10::TensorImpl* self) {
+  pybind11::gil_scoped_acquire gil;
+  at::impl::MaybeSetTLSOnEntryGuard guard;
+  HANDLE_TH_ERRORS
+  auto out = torchDispatchFromTensorImpl(
+      self,
+      "sym_stride",
+      py::module::import("torch")
+          .attr("ops")
+          .attr("aten")
+          .attr("sym_stride")
+          .attr("default")
+          .ptr(),
+      "torch.ops.aten");
+
+  if (out == Py_None) {
+    return self->sym_strides_default();
+  }
+  // We need to squeeze SymIntNodes and ints into `SymInts`
+  // since it's a format `sym_strides()` are stored in
+  TORCH_CHECK(
+      py::isinstance<py::tuple>(out) || py::isinstance<py::list>(out),
+      "Symshape must be a list or a tuple");
+  py::list symints;
+  for (auto it = out.begin(); it != out.end(); it++) {
+    auto elm = *it;
+    auto si = torch::is_symint_node(elm)
+        ? elm.cast<c10::SymIntNodeImpl*>()->toSymInt()
+        : c10::SymInt{py::cast<int64_t>(elm)};
+    symints.append(si.as_int_unchecked());
+  }
+
+  auto result = values_from_buffer(self, symints);
+  c10::SymInt* start = (c10::SymInt*)result[0];
+  int64_t len = result[1];
+
+  return c10::SymIntArrayRef(start, len);
+  END_HANDLE_TH_ERRORS_PYBIND
+}
+
 } // anonymous namespace
diff --git a/torch/csrc/autograd/python_variable.h b/torch/csrc/autograd/python_variable.h
index 2e9a83f617a26..be0bd458197e0 100644
--- a/torch/csrc/autograd/python_variable.h
+++ b/torch/csrc/autograd/python_variable.h
@@ -28,6 +28,8 @@ TORCH_API void registerPythonTensorClass(
     const std::string& device,
     PyObject* python_tensor_class);
 
+TORCH_API void activateCUDATrace();
+
 TORCH_PYTHON_API extern PyObject* THPVariableClass;
 TORCH_PYTHON_API extern PyObject* ParameterClass;
 
diff --git a/torch/csrc/autograd/python_variable_indexing.cpp b/torch/csrc/autograd/python_variable_indexing.cpp
index 912f165de8357..8c9ed1d7b8f98 100644
--- a/torch/csrc/autograd/python_variable_indexing.cpp
+++ b/torch/csrc/autograd/python_variable_indexing.cpp
@@ -22,6 +22,7 @@
 #include <c10/core/TensorOptions.h>
 #include <c10/util/irange.h>
 
+#include <c10/core/Layout.h>
 #include <tuple>
 #include <vector>
 
@@ -442,7 +443,9 @@ int THPVariable_setitem(PyObject* self, PyObject* index, PyObject* py_value) {
   }
 
   const auto& self_ = THPVariable_Unpack(self);
-  if (self_.is_sparse() || self_.is_sparse_csr()) {
+  if (self_.layout() == kSparse || self_.layout() == kSparseCsr ||
+      self_.layout() == kSparseCsc || self_.layout() == kSparseBsr ||
+      self_.layout() == kSparseBsc) {
     throw TypeError("Cannot assign to a sparse tensor");
   }
   OptionalDeviceGuard device_guard(device_of(self_));
diff --git a/torch/csrc/autograd/saved_variable.cpp b/torch/csrc/autograd/saved_variable.cpp
index c6ca8eda13d19..de375cd194764 100644
--- a/torch/csrc/autograd/saved_variable.cpp
+++ b/torch/csrc/autograd/saved_variable.cpp
@@ -159,7 +159,12 @@ Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
       message
           << "one of the variables needed for gradient computation has been "
              "modified by an inplace operation: ["
-          << data_.toString() << " " << data_.sizes() << "]";
+          << data_.toString() << " ";
+      if (data_.is_nested()) {
+        message << data_._nested_tensor_size() << "]";
+      } else {
+        message << data_.sizes() << "]";
+      }
       if (grad_fn) {
         message << ", which is output " << output_nr_ << " of "
                 << grad_fn->name() << ",";
diff --git a/torch/csrc/autograd/variable.cpp b/torch/csrc/autograd/variable.cpp
index aa3f17f1adfa9..28fd4568ca9fe 100644
--- a/torch/csrc/autograd/variable.cpp
+++ b/torch/csrc/autograd/variable.cpp
@@ -45,6 +45,8 @@ DifferentiableViewMeta::DifferentiableViewMeta(
     self_impl->set_version_counter(
         impl::version_counter(backward_info_.value().base_));
     attr_version_ = self_impl->version_counter().current_version();
+    TORCH_INTERNAL_ASSERT(
+        backward_info_.value().base_.unsafeGetTensorImpl() != self_impl);
   }
   if (shared_view_info_) {
     TORCH_INTERNAL_ASSERT(
@@ -679,7 +681,12 @@ const std::shared_ptr<torch::autograd::Node>& VariableHooks::grad_fn(
       //       that would provide a way to recreate the grad_fn chain.
       if (view_info.has_view_fn()) {
         auto view_fn = view_info.view_fn();
-        auto diff_view = view_fn(view_info.base_);
+        Tensor diff_view;
+        {
+          // We can reach this path with grad_mode disabled, e.g. engine
+          AutoGradMode grad_mode(true);
+          diff_view = view_fn(view_info.base_);
+        }
         diff_view_meta->grad_fn_ = diff_view.grad_fn();
       } else {
         auto fn =
diff --git a/torch/csrc/cuda/Module.cpp b/torch/csrc/cuda/Module.cpp
index 6b80e5a30f18d..0012d1ae596c0 100644
--- a/torch/csrc/cuda/Module.cpp
+++ b/torch/csrc/cuda/Module.cpp
@@ -519,32 +519,112 @@ PyObject* THCPModule_resetPeakMemoryStats(PyObject* _unused, PyObject* arg) {
   Py_RETURN_NONE;
 }
 
+struct Frame {
+  PyCodeObject* code;
+  int lasti;
+};
+
+struct StackContext : public c10::cuda::CUDACachingAllocator::Context {
+  std::vector<Frame> frames;
+  ~StackContext() {
+    for (auto& f : frames) {
+      Py_XDECREF((PyObject*)f.code);
+    }
+  }
+  static std::unique_ptr<c10::cuda::CUDACachingAllocator::Context> gather() {
+    py::gil_scoped_acquire acquire;
+    auto r = std::make_unique<StackContext>();
+    PyFrameObject* f = PyEval_GetFrame();
+    Py_XINCREF(f);
+    while (f) {
+      r->frames.emplace_back(Frame{PyFrame_GetCode(f), PyFrame_GetLasti(f)});
+      auto f_back = PyFrame_GetBack(f);
+      Py_XDECREF(f);
+      f = f_back;
+    }
+    return r;
+  }
+};
+
 PyObject* THCPModule_memorySnapshot(PyObject* _unused, PyObject* noargs) {
   HANDLE_TH_ERRORS
 
   using c10::cuda::CUDACachingAllocator::BlockInfo;
+  using c10::cuda::CUDACachingAllocator::History;
   using c10::cuda::CUDACachingAllocator::SegmentInfo;
 
-  const auto segmentInfoToDict = [](const SegmentInfo& segmentInfo) {
+  py::str device_s = "device";
+  py::str address_s = "address";
+  py::str total_size_s = "total_size";
+  py::str allocated_size_s = "allocated_size";
+  py::str active_size_s = "active_size";
+  py::str stream_s = "stream";
+  py::str segment_type_s = "segment_type";
+  py::str large_s = "large";
+  py::str small_s = "small";
+  py::str size_s = "size";
+  py::str state_s = "state";
+  py::str active_allocated_s = "active_allocated";
+  py::str active_pending_free_s = "active_pending_free";
+  py::str inactive_s = "inactive";
+  py::str addr_s = "addr";
+  py::str real_size_s = "real_size";
+  py::str filename_s = "filename";
+  py::str name_s = "name";
+  py::str line_s = "line";
+  py::str frames_s = "frames";
+  py::str history_s = "history";
+  py::str blocks_s = "blocks";
+
+  const auto segmentInfoToDict = [&](const SegmentInfo& segmentInfo) {
     py::dict segmentDict;
-    segmentDict["device"] = segmentInfo.device;
-    segmentDict["address"] = segmentInfo.address;
-    segmentDict["total_size"] = segmentInfo.total_size;
-    segmentDict["allocated_size"] = segmentInfo.allocated_size;
-    segmentDict["active_size"] = segmentInfo.active_size;
-    segmentDict["segment_type"] = (segmentInfo.is_large ? "large" : "small");
+    segmentDict[device_s] = segmentInfo.device;
+    segmentDict[address_s] = segmentInfo.address;
+    segmentDict[total_size_s] = segmentInfo.total_size;
+    segmentDict[allocated_size_s] = segmentInfo.allocated_size;
+    segmentDict[active_size_s] = segmentInfo.active_size;
+    // we want the python objects to pickle easily so use an int to
+    // represent the stream rather than a torch.cuda.stream object
+    segmentDict[stream_s] = int64_t(segmentInfo.stream);
+    segmentDict[segment_type_s] = (segmentInfo.is_large ? large_s : small_s);
 
     py::list blocks;
     for (const auto& blockInfo : segmentInfo.blocks) {
       py::dict blockDict;
-      blockDict["size"] = blockInfo.size;
-      blockDict["state"] =
+      blockDict[size_s] = blockInfo.size;
+      blockDict[state_s] =
           (blockInfo.allocated
-               ? "active_allocated"
-               : (blockInfo.active ? "active_pending_free" : "inactive"));
+               ? active_allocated_s
+               : (blockInfo.active ? active_pending_free_s : inactive_s));
+      if (blockInfo.history) {
+        py::list history;
+        History* h = blockInfo.history;
+        while (h) {
+          py::dict history_entry;
+          history_entry[addr_s] = (int64_t)h->addr;
+          history_entry[real_size_s] = h->real_size;
+          if (h->context) {
+            py::list frames;
+            auto sc = (StackContext*)h->context.get();
+            for (auto& f : sc->frames) {
+              py::dict frame;
+              frame[filename_s] =
+                  py::reinterpret_borrow<py::object>(f.code->co_filename);
+              frame[name_s] =
+                  py::reinterpret_borrow<py::object>(f.code->co_name);
+              frame[line_s] = PyCode_Addr2Line(f.code, f.lasti);
+              frames.append(std::move(frame));
+            }
+            history_entry[frames_s] = std::move(frames);
+          }
+          h = h->next.get();
+          history.append(std::move(history_entry));
+        }
+        blockDict[history_s] = std::move(history);
+      }
       blocks.append(blockDict);
     }
-    segmentDict["blocks"] = blocks;
+    segmentDict[blocks_s] = blocks;
 
     return segmentDict;
   };
@@ -561,6 +641,19 @@ PyObject* THCPModule_memorySnapshot(PyObject* _unused, PyObject* noargs) {
   END_HANDLE_TH_ERRORS
 }
 
+PyObject* THCPModule_recordMemoryHistory(PyObject* _unused, PyObject* enabled) {
+  HANDLE_TH_ERRORS
+  THPUtils_assert(
+      PyBool_Check(enabled),
+      "recordMemoryHistory expects a bool, "
+      "but got %s",
+      THPUtils_typename(enabled));
+  c10::cuda::CUDACachingAllocator::setContextRecorder(
+      enabled == Py_True ? StackContext::gather : nullptr);
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THCPModule_cudaSetSyncDebugMode(PyObject* _unused, PyObject* arg) {
   HANDLE_TH_ERRORS
   TORCH_WARN_ONCE(
@@ -817,6 +910,11 @@ static struct PyMethodDef _THCPModule_methods[] = {
      METH_O,
      nullptr},
     {"_cuda_memorySnapshot", THCPModule_memorySnapshot, METH_NOARGS, nullptr},
+    {"_cuda_recordMemoryHistory",
+     THCPModule_recordMemoryHistory,
+     METH_O,
+     nullptr},
+
     {"_cuda_cudaHostAllocator",
      THCPModule_cudaHostAllocator,
      METH_NOARGS,
diff --git a/torch/csrc/cuda/shared/cudart.cpp b/torch/csrc/cuda/shared/cudart.cpp
index 8fafd1ccdc5cf..9e098d44808ba 100644
--- a/torch/csrc/cuda/shared/cudart.cpp
+++ b/torch/csrc/cuda/shared/cudart.cpp
@@ -14,6 +14,14 @@ namespace torch {
 namespace cuda {
 namespace shared {
 
+#ifdef USE_ROCM
+namespace {
+hipError_t hipReturnSuccess() {
+  return hipSuccess;
+}
+} // namespace
+#endif
+
 void initCudartBindings(PyObject* module) {
   auto m = py::handle(module).cast<py::module>();
 
@@ -44,11 +52,21 @@ void initCudartBindings(PyObject* module) {
   cudart.def(
       "cuda"
       "ProfilerStart",
-      cudaProfilerStart);
+#ifdef USE_ROCM
+      hipReturnSuccess
+#else
+      cudaProfilerStart
+#endif
+  );
   cudart.def(
       "cuda"
       "ProfilerStop",
-      cudaProfilerStop);
+#ifdef USE_ROCM
+      hipReturnSuccess
+#else
+      cudaProfilerStop
+#endif
+  );
   cudart.def(
       "cuda"
       "HostRegister",
diff --git a/torch/csrc/deploy/deploy.cpp b/torch/csrc/deploy/deploy.cpp
index 680b8541873fe..cea7ea774d898 100644
--- a/torch/csrc/deploy/deploy.cpp
+++ b/torch/csrc/deploy/deploy.cpp
@@ -31,6 +31,7 @@ namespace deploy {
 
 const std::initializer_list<ExeSection> pythonInterpreterSection = {
     {".torch_deploy_payload.interpreter_all", true},
+    {".torch_deploy_payload.interpreter_hip", false},
     {".torch_deploy_payload.interpreter_cuda", false},
     {".torch_deploy_payload.interpreter_cpu", false},
 };
@@ -39,6 +40,9 @@ const std::initializer_list<InterpreterSymbol> kInterpreterSearchPath = {
     {"_binary_libtorch_deployinterpreter_all_so_start",
      "_binary_libtorch_deployinterpreter_all_so_end",
      true},
+    {"_binary_libtorch_deployinterpreter_hip_so_start",
+     "_binary_libtorch_deployinterpreter_hip_so_end",
+     false},
     {"_binary_libtorch_deployinterpreter_cuda_so_start",
      "_binary_libtorch_deployinterpreter_cuda_so_end",
      false},
diff --git a/torch/csrc/deploy/interpreter/defs.bzl b/torch/csrc/deploy/interpreter/defs.bzl
index 719155cf7da0f..4335518df019b 100644
--- a/torch/csrc/deploy/interpreter/defs.bzl
+++ b/torch/csrc/deploy/interpreter/defs.bzl
@@ -9,6 +9,7 @@ def embedded_interpreter(name, suffix, legacy = False, exported_deps = [], expor
     final_name = name
     is_all = suffix == "all"
     is_cuda = suffix == "cuda" or is_all
+    is_hip = suffix == "hip"
     platform_static_lib = []
     for platform in ["platform009", "platform010"]:
         name = platform + "_" + final_name
@@ -37,10 +38,13 @@ def embedded_interpreter(name, suffix, legacy = False, exported_deps = [], expor
                 ":builtin_registry_cuda",
                 "//caffe2:torch_python_cuda_without_torch",
                 "//deeplearning/trt/python:frozen_tensorrt",
-            ] if is_cuda else [
+            ] if is_cuda else ([
+                ":builtin_registry_hip",
+                "//caffe2:torch_python_hip_without_torch",
+            ] if is_hip else [
                 ":builtin_registry",
                 "//caffe2:torch_python_without_torch",
-            ]),
+            ])),
             external_deps =
                 [
                     # needed for interpreter.cpp itself, it uses pybind currently
diff --git a/torch/csrc/distributed/c10d/Ops.cpp b/torch/csrc/distributed/c10d/Ops.cpp
index 7ee87ad9ef771..def9793bf80a0 100644
--- a/torch/csrc/distributed/c10d/Ops.cpp
+++ b/torch/csrc/distributed/c10d/Ops.cpp
@@ -142,7 +142,11 @@ TORCH_LIBRARY(c10d, m) {
   // The following ProcessGroup and Work definations are more like declarations.
   // They don't expose the details of the two classes into TorchScript.
   m.class_<ProcessGroup>("ProcessGroup").def(torch::init<int64_t, int64_t>());
-  m.class_<ProcessGroup::Work>("Work").def(torch::init<>());
+  m.class_<ProcessGroup::Work>("Work")
+      .def(torch::init<>())
+      .def("wait", [](const c10::intrusive_ptr<ProcessGroup::Work>& self) {
+        self->wait();
+      });
   // It's important to register the op to the CompositeExplicitAutograd key to
   // enable
   // __torch_dispatch__.
diff --git a/torch/csrc/distributed/c10d/ProcessGroup.hpp b/torch/csrc/distributed/c10d/ProcessGroup.hpp
index f81358a87becc..a29275f150983 100644
--- a/torch/csrc/distributed/c10d/ProcessGroup.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroup.hpp
@@ -209,6 +209,15 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
     TORCH_INTERNAL_ASSERT(false, "getBackendName is not implemented.");
   };
 
+  virtual void startCoalescing() {
+    // no-op for backends that have not implemented startCoalescing
+  }
+
+  virtual void endCoalescing(
+      std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& /* reqs */) {
+    // no-op for backends that have not implemented endCoalescing
+  }
+
   // Consider using ops in Ops.hpp instead of the below, which route things
   // to the dispatcher.
   // TODO: Find a way to force the above rule programmatically.
diff --git a/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp b/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
index 1c1ac652f6ed2..6f28e25a0bd91 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
@@ -2842,10 +2842,6 @@ void ProcessGroupGloo::monitoredBarrier(
   }
 
   waitLoop(sendWorkMap);
-
-  using namespace std::chrono;
-  C10_UNUSED auto elapsedTime =
-      duration_cast<milliseconds>(steady_clock::now() - startTime);
 }
 
 void ProcessGroupGloo::setSequenceNumberForGroup() {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
index ae04f9bc702d3..1a3c981bf937e 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
@@ -9,6 +9,7 @@
 #include <stdexcept>
 #include <tuple>
 #include <unordered_set>
+#include <utility>
 
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/core/DeviceType.h>
@@ -534,6 +535,38 @@ bool ProcessGroupNCCL::WorkNCCL::timedOut() {
           currentTimepoint - workStartTime_) >= opTimeout_);
 }
 
+ProcessGroupNCCL::CoalescedWorkNCCL::CoalescedWorkNCCL(
+    std::vector<ProcessGroupNCCL::WorkNCCL> works,
+    int rank,
+    OpType opType)
+    : Work(rank, opType, nullptr), works_(std::move(works)) {}
+
+ProcessGroupNCCL::CoalescedWorkNCCL::~CoalescedWorkNCCL() = default;
+
+c10::intrusive_ptr<ProcessGroupNCCL::CoalescedWorkNCCL> ProcessGroupNCCL::
+    initCoalescedWork(
+        const std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& works,
+        int rank,
+        OpType opType) {
+  std::vector<ProcessGroupNCCL::WorkNCCL> ncclWorks;
+  ncclWorks.reserve(works.size());
+  for (auto& work : works) {
+    ncclWorks.push_back(*static_cast<ProcessGroupNCCL::WorkNCCL*>(work.get()));
+  }
+  return c10::make_intrusive<ProcessGroupNCCL::CoalescedWorkNCCL>(
+      ncclWorks, rank, opType);
+}
+
+// Same as calling synchronize().
+bool ProcessGroupNCCL::CoalescedWorkNCCL::wait(
+    std::chrono::milliseconds timeout) {
+  for (auto& w : works_) {
+    w.wait(timeout);
+  }
+  // Always return true, because abort API is not implemented.
+  return true;
+}
+
 ProcessGroupNCCL::ProcessGroupNCCL(
     const c10::intrusive_ptr<Store>& store,
     int rank,
@@ -551,7 +584,8 @@ ProcessGroupNCCL::ProcessGroupNCCL(
       "ProcessGroupNCCL is only supported with GPUs, no GPUs found!");
   blockingWait_ = parseEnvVarFlag(NCCL_BLOCKING_WAIT);
   asyncErrorHandling_ = parseEnvVarFlag(NCCL_ASYNC_ERROR_HANDLING);
-  desyncDebug_ = parseEnvVarFlag(NCCL_DESYNC_DEBUG);
+  desyncDebug_ = parseEnvVarFlag(NCCL_DESYNC_DEBUG) ||
+      (dist_debug_level_ >= DebugLevel::Detail);
 
   if (blockingWait_) {
     if (asyncErrorHandling_ || desyncDebug_) {
@@ -1322,6 +1356,15 @@ int64_t check_gpu_tensors_same_device(const std::vector<at::Tensor>& tensors) {
   return total_numel;
 }
 
+bool check_same_size(const std::vector<at::Tensor>& input_tensors) {
+  for (const auto& input_tensor : input_tensors) {
+    if (!input_tensors[0].is_same_size(input_tensor)) {
+      return false;
+    }
+  }
+  return true;
+}
+
 // Flatten each list in `tensor_lists' for a gather or scatter operation, and
 // ensure compatibility with the corresponding tensor in `other'.
 std::vector<at::Tensor> flatten_for_scatter_gather(
@@ -1405,6 +1448,34 @@ ProcessGroupNCCL::Options::Options(bool is_high_priority_stream)
     : ProcessGroup::Options(NCCL_BACKEND_NAME),
       is_high_priority_stream(is_high_priority_stream) {}
 
+void ProcessGroupNCCL::startCoalescing() {
+  coalescedDevices_.clear();
+  coalescing_active_ = true;
+  groupStart();
+}
+
+void ProcessGroupNCCL::endCoalescing(
+    std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& reqs) {
+  groupEnd();
+  if (reqs.size() != coalescedDevices_.size()) {
+    TORCH_CHECK(false, "Number of requests do not match number of collectives");
+  }
+
+  int batch_idx = 0;
+  for (const auto& req : reqs) {
+    auto ncclWork = static_cast<ProcessGroupNCCL::WorkNCCL*>(req.get());
+    // @lint-ignore CLANGTIDY
+    std::vector<at::Device> devices = coalescedDevices_[batch_idx];
+    const auto key = getKeyFromDevices(devices);
+    auto& ncclStreams = ncclStreams_[key];
+    for (const auto i : c10::irange(devices.size())) {
+      (*ncclWork->ncclEndEvents_)[i].record(ncclStreams[i]);
+    }
+    batch_idx += 1;
+  }
+  coalescing_active_ = false;
+}
+
 template <typename Fn, typename PreProcess, typename PostProcess>
 c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
     std::vector<at::Tensor>& inputs,
@@ -1436,6 +1507,10 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
   const auto key = getKeyFromDevices(devices);
   auto& ncclComms = getNCCLComm(key, devices, opType);
 
+  if (coalescing_active_) {
+    coalescedDevices_.push_back(devices);
+  }
+
   // Used many times below, so we stash the unordered_map lookup
   auto& ncclStreams = ncclStreams_[key];
 
@@ -1497,7 +1572,9 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
   // End event should only be recorded after the ncclGroupEnd()
   for (const auto i : c10::irange(devices.size())) {
     at::cuda::CUDAStream& ncclStream = ncclStreams[i];
-    (*work->ncclEndEvents_)[i].record(ncclStream);
+    if (!coalescing_active_) {
+      (*work->ncclEndEvents_)[i].record(ncclStream);
+    }
     work->ncclComms_[i] = ncclComms[i];
   }
 
@@ -1560,6 +1637,10 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::pointToPoint(
   }
   auto& ncclComms = getNCCLComm(key, devices, opType, p2pRank, isSendRecvSelf);
 
+  if (coalescing_active_) {
+    coalescedDevices_.push_back(devices);
+  }
+
   // First let NCCL streams wait for input tensors allocation streams
   syncStreams(devices, ncclEvents_[key], ncclStreams_[key]);
 
@@ -1623,7 +1704,9 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::pointToPoint(
   // End event should only be recorded after the ncclGroupEnd()
   for (const auto i : c10::irange(tensors.size())) {
     at::cuda::CUDAStream& ncclStream = ncclStreams_[key][i];
-    (*work->ncclEndEvents_)[i].record(ncclStream);
+    if (!coalescing_active_) {
+      (*work->ncclEndEvents_)[i].record(ncclStream);
+    }
     work->ncclComms_[i] = ncclComms[i];
     work->blockingWait_ = blockingWait_;
     work->opTimeout_ = options_->timeout;
@@ -1783,6 +1866,59 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::broadcast(
       "nccl:broadcast");
 }
 
+// _broadcast_oop adds an out-of-place broadcast in PGNCCL
+// Custom collectives may be implemented by coalescing broadcast operations
+// One use-case is implementing a vector all_gather (all_gather_v)
+// where unevenly sized inputs are gathered among participating ranks
+// Since all_gather provides an out-of-place API, an all_gather_v
+// semantic implemented inside pg_nccl.all_gather also needs to support
+// out-of-place, for which an out-of-place broadcast is required to be added
+c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_broadcast_oop(
+    std::vector<at::Tensor>& outputTensors,
+    std::vector<at::Tensor>& inputTensors,
+    const BroadcastOptions& opts) {
+  check_gpu_tensors_different_devices(outputTensors);
+  check_gpu_tensors_different_devices(inputTensors);
+
+  // @lint-ignore CLANGTIDY
+  auto tensor = outputTensors.back();
+  // @lint-ignore CLANGTIDY
+  auto in_tensor = inputTensors.back();
+  if (tensor.numel() != in_tensor.numel()) {
+    TORCH_CHECK(
+        false,
+        "Tensor input and output of _broadcast_oop must have the same number of elements ");
+  }
+  RECORD_PARAM_COMMS(
+      rank_, // rank
+      "_broadcast_oop", // colName
+      tensor.numel(), // inSize
+      tensor.numel(), // outSize
+      tensor.scalar_type(), // dType
+      std::vector<int64_t>(), // inSplitSizes
+      std::vector<int64_t>()); // outSplitSizes
+
+  return collective(
+      inputTensors,
+      outputTensors,
+      [&](at::Tensor& input,
+          at::Tensor& output,
+          ncclComm_t comm,
+          at::cuda::CUDAStream& stream) {
+        const auto root = opts.rootRank * inputTensors.size() + opts.rootTensor;
+        return ncclBroadcast(
+            input.data_ptr(),
+            output.data_ptr(),
+            input.numel(),
+            getNcclDataType(input.scalar_type()),
+            root,
+            comm,
+            stream.stream());
+      },
+      OpType::BROADCAST,
+      "nccl:_broadcast_oop");
+}
+
 c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce(
     std::vector<at::Tensor>& tensors,
     const ReduceOptions& opts) {
@@ -1820,61 +1956,144 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce(
       "nccl:reduce");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather(
-    std::vector<std::vector<at::Tensor>>& outputTensors,
+// _reduce_oop exposes an out-of-place reduce from PGNCCL
+// Custom collectives may be implemented by coalescing reduce operations
+// One use-case is implementing a vector reduce_scatter (reduce_scatter_v)
+// where inputs are reduced and scattered unevenly among participating ranks
+// Since reduce_scatter provides an out-of-place API, a reduce_scatter_v
+// semantic implemented inside pg_nccl.reduce_scatter also needs to support
+// out-of-place, for which an out-of-place reduce is required to be added
+c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_reduce_oop(
+    std::vector<at::Tensor>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
-    const AllgatherOptions& opts) {
+    const ReduceOptions& opts) {
+  check_gpu_tensors_different_devices(outputTensors);
   check_gpu_tensors_different_devices(inputTensors);
-
-  auto outputFlattened =
-      flatten_for_scatter_gather(outputTensors, inputTensors, size_);
-  check_gpu_tensors_different_devices(outputFlattened);
-
   // @lint-ignore CLANGTIDY
-  auto tensor = inputTensors.back();
+  auto tensor = outputTensors.back();
+  // @lint-ignore CLANGTIDY
+  auto in_tensor = inputTensors.back();
+  if (tensor.numel() != in_tensor.numel()) {
+    TORCH_CHECK(
+        false,
+        "Tensor input and output of _reduce_oop must have the same number of elements ");
+  }
   RECORD_PARAM_COMMS(
       rank_, // rank
-      "all_gather", // colName
+      "_reduce_oop", // colName
       tensor.numel(), // inSize
-      tensor.numel() * // outSize
-          this->getSize(), // dType
-      tensor.scalar_type(),
+      tensor.numel(), // outSize
+      tensor.scalar_type(), // dType
       std::vector<int64_t>(), // inSplitSizes
-      std::vector<int64_t>()); // outSplitSize
+      std::vector<int64_t>()); // outSplitSizes
 
   return collective(
       inputTensors,
-      outputFlattened,
+      outputTensors,
       [&](at::Tensor& input,
           at::Tensor& output,
           ncclComm_t comm,
           at::cuda::CUDAStream& stream) {
-        c10::cuda::CUDACachingAllocator::recordStream(
-            output.storage().data_ptr(), stream);
-        return ncclAllGather(
+        const auto root = opts.rootRank * inputTensors.size() + opts.rootTensor;
+        return ncclReduce(
             input.data_ptr(),
             output.data_ptr(),
             input.numel(),
             getNcclDataType(input.scalar_type()),
+            getNcclReduceOp(opts.reduceOp, input),
+            (int)root,
             comm,
             stream.stream());
       },
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {},
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
-        // Copy the flattened output tensors to the outputs.
-        for (const auto i : c10::irange(outputTensors.size())) {
-          at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
-          for (const auto j : c10::irange(outputTensors[0].size())) {
-            // See [Sync Streams].
-            c10::cuda::CUDACachingAllocator::recordStream(
-                outputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+      OpType::REDUCE,
+      "nccl:_reduce_oop");
+}
 
-            outputTensors[i][j].copy_(outputFlattened[i][j], true);
+c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather(
+    std::vector<std::vector<at::Tensor>>& outputTensors,
+    std::vector<at::Tensor>& inputTensors,
+    const AllgatherOptions& opts) {
+  check_gpu_tensors_different_devices(inputTensors);
+  // @lint-ignore CLANGTIDY
+  bool same_size = check_same_size(outputTensors.back());
+
+  if (same_size) {
+    auto outputFlattened =
+        flatten_for_scatter_gather(outputTensors, inputTensors, size_);
+    check_gpu_tensors_different_devices(outputFlattened);
+
+    // @lint-ignore CLANGTIDY
+    auto tensor = inputTensors.back();
+    RECORD_PARAM_COMMS(
+        rank_, // rank
+        "all_gather", // colName
+        tensor.numel(), // inSize
+        tensor.numel() * // outSize
+            this->getSize(), // dType
+        tensor.scalar_type(),
+        std::vector<int64_t>(), // inSplitSizes
+        std::vector<int64_t>()); // outSplitSize
+
+    return collective(
+        inputTensors,
+        outputFlattened,
+        [&](at::Tensor& input,
+            at::Tensor& output,
+            ncclComm_t comm,
+            at::cuda::CUDAStream& stream) {
+          c10::cuda::CUDACachingAllocator::recordStream(
+              output.storage().data_ptr(), stream);
+          return ncclAllGather(
+              input.data_ptr(),
+              output.data_ptr(),
+              input.numel(),
+              getNcclDataType(input.scalar_type()),
+              comm,
+              stream.stream());
+        },
+        [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {},
+        [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
+          // Copy the flattened output tensors to the outputs.
+          for (const auto i : c10::irange(outputTensors.size())) {
+            at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
+            for (const auto j : c10::irange(outputTensors[0].size())) {
+              // See [Sync Streams].
+              c10::cuda::CUDACachingAllocator::recordStream(
+                  outputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+
+              outputTensors[i][j].copy_(outputFlattened[i][j], true);
+            }
           }
-        }
-      },
-      OpType::ALLGATHER,
-      "nccl:all_gather");
+        },
+        OpType::ALLGATHER,
+        "nccl:all_gather");
+  } else {
+    const auto num_devices = outputTensors.size();
+    const auto num_reduces = outputTensors[0].size();
+    std::vector<c10::intrusive_ptr<ProcessGroup::Work>> works;
+    startCoalescing();
+    for (const auto i : c10::irange(num_reduces)) {
+      std::vector<at::Tensor> inputs_multi_dev(num_devices);
+      std::vector<at::Tensor> outputs_multi_dev(num_devices);
+      for (const auto j : c10::irange(num_devices)) {
+        // @lint-ignore CLANGTIDY
+        outputs_multi_dev[j] = outputTensors[j][i];
+        inputs_multi_dev[j] =
+            // @lint-ignore CLANGTIDY
+            i == (rank_ * num_devices + j) ? inputTensors[j]
+                                           : outputs_multi_dev[j];
+      }
+      auto broadcastOpts = BroadcastOptions{
+          static_cast<int64_t>(i / num_devices),
+          static_cast<int64_t>(i % num_devices),
+          opts.timeout};
+      auto work =
+          _broadcast_oop(outputs_multi_dev, inputs_multi_dev, broadcastOpts);
+      works.push_back(work);
+    }
+    endCoalescing(works);
+    return initCoalescedWork(works, rank_, OpType::BROADCAST);
+  }
 }
 
 c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather_coalesced(
@@ -1889,57 +2108,87 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce_scatter(
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ReduceScatterOptions& opts) {
   check_gpu_tensors_different_devices(outputTensors);
-
   // @lint-ignore CLANGTIDY
-  auto tensor = outputTensors.back();
-  RECORD_PARAM_COMMS(
-      rank_, // rank
-      "reduce_scatter", // colName
-      tensor.numel() * // inSize
-          this->getSize(), // outSize
-      tensor.numel(), // dType
-      tensor.scalar_type(),
-      std::vector<int64_t>(), // inSplitSizes
-      std::vector<int64_t>()); // outSplitSizes
+  bool same_size = check_same_size(inputTensors.back());
 
-  auto inputFlattened =
-      flatten_for_scatter_gather(inputTensors, outputTensors, size_);
-  check_gpu_tensors_different_devices(inputFlattened);
+  if (same_size) {
+    // @lint-ignore CLANGTIDY
+    auto tensor = outputTensors.back();
+    RECORD_PARAM_COMMS(
+        rank_, // rank
+        "reduce_scatter", // colName
+        tensor.numel() * // inSize
+            this->getSize(), // outSize
+        tensor.numel(), // dType
+        tensor.scalar_type(),
+        std::vector<int64_t>(), // inSplitSizes
+        std::vector<int64_t>()); // outSplitSizes
 
-  return collective(
-      inputFlattened,
-      outputTensors,
-      [&](at::Tensor& input,
-          at::Tensor& output,
-          ncclComm_t comm,
-          at::cuda::CUDAStream& stream) {
-        c10::cuda::CUDACachingAllocator::recordStream(
-            output.storage().data_ptr(), stream);
-        return ncclReduceScatter(
-            input.data_ptr(),
-            output.data_ptr(),
-            output.numel(),
-            getNcclDataType(input.scalar_type()),
-            getNcclReduceOp(opts.reduceOp, input),
-            comm,
-            stream.stream());
-      },
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
-        // Copy the input tensors to the flattened inputs.
-        for (const auto i : c10::irange(inputTensors.size())) {
-          at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
-          for (const auto j : c10::irange(inputTensors[0].size())) {
-            // See [Sync Streams].
-            c10::cuda::CUDACachingAllocator::recordStream(
-                inputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+    auto inputFlattened =
+        flatten_for_scatter_gather(inputTensors, outputTensors, size_);
+    check_gpu_tensors_different_devices(inputFlattened);
 
-            inputFlattened[i][j].copy_(inputTensors[i][j], true);
+    return collective(
+        inputFlattened,
+        outputTensors,
+        [&](at::Tensor& input,
+            at::Tensor& output,
+            ncclComm_t comm,
+            at::cuda::CUDAStream& stream) {
+          c10::cuda::CUDACachingAllocator::recordStream(
+              output.storage().data_ptr(), stream);
+          return ncclReduceScatter(
+              input.data_ptr(),
+              output.data_ptr(),
+              output.numel(),
+              getNcclDataType(input.scalar_type()),
+              getNcclReduceOp(opts.reduceOp, input),
+              comm,
+              stream.stream());
+        },
+        [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
+          // Copy the input tensors to the flattened inputs.
+          for (const auto i : c10::irange(inputTensors.size())) {
+            at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
+            for (const auto j : c10::irange(inputTensors[0].size())) {
+              // See [Sync Streams].
+              c10::cuda::CUDACachingAllocator::recordStream(
+                  inputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+
+              inputFlattened[i][j].copy_(inputTensors[i][j], true);
+            }
           }
-        }
-      },
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {},
-      OpType::REDUCE_SCATTER,
-      "nccl:reduce_scatter");
+        },
+        [&](std::vector<at::cuda::CUDAStream>&) {},
+        OpType::REDUCE_SCATTER,
+        "nccl:reduce_scatter");
+  } else {
+    const auto num_devices = inputTensors.size();
+    const auto num_reduces = inputTensors[0].size();
+    std::vector<c10::intrusive_ptr<ProcessGroup::Work>> works;
+    startCoalescing();
+    for (const auto i : c10::irange(num_reduces)) {
+      std::vector<at::Tensor> inputs_multi_dev(num_devices);
+      std::vector<at::Tensor> outputs_multi_dev(num_devices);
+      for (const auto j : c10::irange(num_devices)) {
+        // @lint-ignore CLANGTIDY
+        inputs_multi_dev[j] = inputTensors[j][i];
+        outputs_multi_dev[j] =
+            // @lint-ignore CLANGTIDY
+            i == (rank_ * num_devices + j) ? outputTensors[j]
+                                           : inputs_multi_dev[j];
+      }
+      auto reduceOpts = ReduceOptions{
+          opts.reduceOp,
+          static_cast<int64_t>(i / num_devices),
+          static_cast<int64_t>(i % num_devices),
+          opts.timeout};
+      auto work = _reduce_oop(outputs_multi_dev, inputs_multi_dev, reduceOpts);
+      works.push_back(work);
+    }
+    endCoalescing(works);
+    return initCoalescedWork(works, rank_, OpType::REDUCE);
+  }
 }
 
 c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_reduce_scatter_base(
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
index f86cf5e9d5766..b596ce627230b 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
@@ -218,6 +218,28 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
     friend class ProcessGroupNCCL;
   };
 
+  class CoalescedWorkNCCL
+      : public ProcessGroup::Work,
+        public std::enable_shared_from_this<CoalescedWorkNCCL> {
+   public:
+    // Constructor takes a list of WorkNCCL works
+    CoalescedWorkNCCL(
+        std::vector<ProcessGroupNCCL::WorkNCCL> works,
+        int rank,
+        OpType opType);
+
+    ~CoalescedWorkNCCL() override;
+
+    // Same as calling synchronize() for NCCL work.
+    bool wait(std::chrono::milliseconds timeout = kNoTimeout) override;
+
+   protected:
+    // The cached list of CUDA devices to operate on
+    std::vector<ProcessGroupNCCL::WorkNCCL> works_;
+
+    friend class ProcessGroupNCCL;
+  };
+
   struct Options : ProcessGroup::Options {
     // NOTE: timeout in ProcessGroupNCCL::Options denote the timeout for
     // operations. This is only used when blockingWait_ is enabled.
@@ -275,10 +297,20 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
       return std::string(NCCL_BACKEND_NAME);
   }
 
+  void startCoalescing() override;
+
+  void endCoalescing(
+      std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& reqs) override;
+
   c10::intrusive_ptr<ProcessGroup::Work> broadcast(
       std::vector<at::Tensor>& tensors,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
+  c10::intrusive_ptr<ProcessGroup::Work> _broadcast_oop(
+      std::vector<at::Tensor>& outputTensors,
+      std::vector<at::Tensor>& inputTensors,
+      const BroadcastOptions& opts = BroadcastOptions());
+
   c10::intrusive_ptr<ProcessGroup::Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override;
@@ -292,6 +324,11 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
+  c10::intrusive_ptr<ProcessGroup::Work> _reduce_oop(
+      std::vector<at::Tensor>& outputTensors,
+      std::vector<at::Tensor>& inputTensors,
+      const ReduceOptions& opts = ReduceOptions());
+
   c10::intrusive_ptr<ProcessGroup::Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
@@ -401,6 +438,12 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
       const char* profilingTitle=nullptr,
       const c10::optional<std::vector<at::Tensor>>& inputs = c10::nullopt);
 
+  virtual c10::intrusive_ptr<ProcessGroupNCCL::CoalescedWorkNCCL>
+  initCoalescedWork(
+      const std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& works,
+      int rank,
+      OpType opType);
+
  private:
   // Helper that encapsulates work shared across all collective communication
   // primitives.  The callbacks have the following signatures:
@@ -584,6 +627,12 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
   // Device Indexes used for all collectives in this group
   std::set<int> usedDeviceIdxs_;
 
+  // Flag to denote if a coalescing groupStart/groupEnd block is active
+  bool coalescing_active_ = false;
+
+  // Stores device indexes for all collectives run inside a coalescing block
+  std::vector<std::vector<at::Device>> coalescedDevices_;
+
   // map from the key: "group name + pg counter (ID)" to the
   // unique NCCL ID count. This needs to be group and pg specific
   //
diff --git a/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp b/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp
index 191ba4b2ddd75..fcae4e08e637f 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp
@@ -5,6 +5,8 @@
 #include <c10d/UCCUtils.hpp>
 #include <list>
 #include <memory>
+#include <unordered_map>
+#include <unordered_set>
 
 namespace c10d {
 
@@ -106,19 +108,16 @@ struct torch_ucc_config_t {
   bool enable_health_check;
 } torch_ucc_config;
 
-// TODO: support UCC_BLOCKING_WAIT that applies to all collectives.
-std::map<std::string, std::string> torch_ucc_envs_map = {
-    {"TORCH_UCC_ALLGATHER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_ALLGATHER_BASE_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_ALLREDUCE_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_ALLTOALL_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_BCAST_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_GATHER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_REDUCE_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_REDUCE_SCATTER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_SCATTER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_SEND_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_RECV_BLOCKING_WAIT", "0"},
+std::unordered_map<std::string, std::string> torch_ucc_envs_map = {
+    // TORCH_UCC_BLOCKING_WAIT allowed syntax:
+    // - TORCH_UCC_BLOCKING_WAIT=none --> blocking wait completely disabled
+    // - TORCH_UCC_BLOCKING_WAIT=all --> blocking wait completely enabled
+    // - TORCH_UCC_BLOCKING_WAIT=allreduce,send,recv --> blocking wait enabled
+    //                                                   on selected operations
+    // Supported operations:
+    // [allgather,allgather_base,allreduce,alltoall,broadcast,
+    //  gather,reduce,reduce_scatter,scatter,send,recv]
+    {"TORCH_UCC_BLOCKING_WAIT", "none"},
 
     {"TORCH_UCC_USE_FUTURE", "1"},
     {"TORCH_UCC_PROFILING_ENABLE", "0"},
@@ -128,11 +127,42 @@ std::map<std::string, std::string> torch_ucc_envs_map = {
     {"TORCH_UCC_ENABLE_COMMS_LOGGER", "0"},
 };
 
+std::vector<OpType> parse_blocking_wait(std::string op_list_string) {
+  const static std::unordered_map<std::string, OpType> str2op = {
+      {"allgather", OpType::ALLGATHER},
+      {"allgather_base", OpType::_ALLGATHER_BASE},
+      {"allreduce", OpType::ALLREDUCE},
+      {"alltoall_base", OpType::ALLTOALL_BASE},
+      {"broadcast", OpType::BROADCAST},
+      {"gather", OpType::GATHER},
+      {"reduce", OpType::REDUCE},
+      {"reduce_scatter", OpType::REDUCE_SCATTER},
+      {"scatter", OpType::SCATTER},
+      {"send", OpType::SEND},
+      {"recv", OpType::RECV},
+  };
+  auto op_list = parse_list(op_list_string);
+  if (op_list == std::vector<std::string>{"none"}) {
+    return {};
+  }
+  std::vector<OpType> result;
+  if (op_list == std::vector<std::string>{"all"}) {
+    for (auto entry : str2op) {
+      result.push_back(entry.second);
+    }
+  } else {
+    for (auto op_string : op_list) {
+      result.push_back(str2op.at(op_string));
+    }
+  }
+  return result;
+}
+
 } // namespace
 
-void read_confg() {
+void read_config() {
   // default configuration
-  torch_ucc_config.blocking_wait.fill(true);
+  torch_ucc_config.blocking_wait.fill(false);
   torch_ucc_config.enable_profiling = false;
   torch_ucc_config.use_future = true;
   torch_ucc_config.shared_comm = false;
@@ -149,24 +179,10 @@ void read_confg() {
     }
   }
 
-#define BUILD_BLOCKING_CFG(op, str)                   \
-  (torch_ucc_config.blocking_wait[(std::uint8_t)op] = \
-       std::stoi(torch_ucc_envs_map.at(str)))
-
-  BUILD_BLOCKING_CFG(OpType::ALLGATHER, "TORCH_UCC_ALLGATHER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(
-      OpType::_ALLGATHER_BASE, "TORCH_UCC_ALLGATHER_BASE_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::ALLREDUCE, "TORCH_UCC_ALLREDUCE_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::ALLTOALL_BASE, "TORCH_UCC_ALLTOALL_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::BROADCAST, "TORCH_UCC_BCAST_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::GATHER, "TORCH_UCC_GATHER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::REDUCE, "TORCH_UCC_REDUCE_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(
-      OpType::REDUCE_SCATTER, "TORCH_UCC_REDUCE_SCATTER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::SCATTER, "TORCH_UCC_SCATTER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::SEND, "TORCH_UCC_SEND_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::RECV, "TORCH_UCC_RECV_BLOCKING_WAIT");
-#undef BUILD_BLOCKING_CFG
+  auto blocking_wait_str = torch_ucc_envs_map.at("TORCH_UCC_BLOCKING_WAIT");
+  for (auto op : parse_blocking_wait(blocking_wait_str)) {
+    torch_ucc_config.blocking_wait[(std::uint8_t)op] = true;
+  }
 
   torch_ucc_config.use_future =
       std::stoi(torch_ucc_envs_map.at("TORCH_UCC_USE_FUTURE"));
@@ -700,7 +716,7 @@ ProcessGroupUCC::ProcessGroupUCC(
     int size,
     std::chrono::duration<float> timeout)
     : ProcessGroup(rank, size), timeout_(timeout) {
-  c10::call_once(torch_ucc_config.flag, read_confg);
+  c10::call_once(torch_ucc_config.flag, read_config);
   oob = std::make_shared<torch_ucc_oob_coll_info_t>();
   oob->rank = rank;
   oob->size = size;
diff --git a/torch/csrc/distributed/c10d/UCCUtils.cpp b/torch/csrc/distributed/c10d/UCCUtils.cpp
index 37cd829122f97..558568baf7567 100644
--- a/torch/csrc/distributed/c10d/UCCUtils.cpp
+++ b/torch/csrc/distributed/c10d/UCCUtils.cpp
@@ -3,6 +3,11 @@
 #include <c10d/UCCTracing.hpp>
 #include <c10d/UCCUtils.hpp>
 
+#include <cctype>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+
 namespace c10d {
 
 namespace {
diff --git a/torch/csrc/distributed/c10d/UCCUtils.hpp b/torch/csrc/distributed/c10d/UCCUtils.hpp
index 70cef1d7f99a4..bc1bb1dde3a96 100644
--- a/torch/csrc/distributed/c10d/UCCUtils.hpp
+++ b/torch/csrc/distributed/c10d/UCCUtils.hpp
@@ -188,6 +188,39 @@ ucc_status_t oob_allgather_test(void* req);
 
 ucc_status_t oob_allgather_free(void* req);
 
+// trim: remove spaces before and after the string view
+// implementation borrowed from https://stackoverflow.com/a/17976541
+inline c10::string_view trim(c10::string_view s) {
+  auto wsfront = std::find_if_not(
+      s.begin(), s.end(), [](int c) { return std::isspace(c); });
+  auto wsback = std::find_if_not(s.rbegin(), s.rend(), [](int c) {
+                  return std::isspace(c);
+                }).base();
+  return (
+      wsback <= wsfront ? "" : s.substr(wsfront - s.begin(), wsback - wsfront));
+}
+
+inline std::string tolower(c10::string_view s) {
+  std::string result;
+  result.reserve(s.size());
+  for (auto c : s) {
+    result.push_back(std::tolower(c));
+  }
+  return result;
+}
+
+inline std::vector<std::string> parse_list(std::string list) {
+  std::vector<std::string> result;
+  list = tolower(trim(list));
+  while (!list.empty()) {
+    const auto end_pos = list.find_first_of(',');
+    const auto token = trim(list.substr(0, end_pos));
+    result.push_back(std::string(token));
+    list = (end_pos != c10::string_view::npos) ? list.substr(end_pos + 1) : "";
+  }
+  return result;
+}
+
 } // namespace c10d
 
 #endif // USE_C10D_UCC
diff --git a/torch/csrc/distributed/c10d/debug.h b/torch/csrc/distributed/c10d/debug.h
index 97237811cc1bf..8524191515190 100644
--- a/torch/csrc/distributed/c10d/debug.h
+++ b/torch/csrc/distributed/c10d/debug.h
@@ -10,7 +10,7 @@
 
 namespace c10d {
 
-enum class DebugLevel { Off, Info, Detail };
+enum class DebugLevel { Off = 0, Info = 1, Detail = 2 };
 
 TORCH_API void setDebugLevel(DebugLevel level);
 
diff --git a/torch/csrc/distributed/c10d/init.cpp b/torch/csrc/distributed/c10d/init.cpp
index af4cf05107d5a..fa37ae2adef95 100644
--- a/torch/csrc/distributed/c10d/init.cpp
+++ b/torch/csrc/distributed/c10d/init.cpp
@@ -1353,6 +1353,15 @@ that adds a prefix to each key inserted to the store.
           .def(
               "_get_backend_name",
               &::c10d::ProcessGroup::getBackendName,
+              py::call_guard<py::gil_scoped_release>())
+          .def(
+              "_start_coalescing",
+              &::c10d::ProcessGroup::startCoalescing,
+              py::call_guard<py::gil_scoped_release>())
+          .def(
+              "_end_coalescing",
+              &::c10d::ProcessGroup::endCoalescing,
+              py::arg("reqs"),
               py::call_guard<py::gil_scoped_release>());
 
   // base ProcessGroup::Options binding
diff --git a/torch/csrc/init_flatbuffer_module.cpp b/torch/csrc/init_flatbuffer_module.cpp
index d11532890a29e..f739f834dc293 100644
--- a/torch/csrc/init_flatbuffer_module.cpp
+++ b/torch/csrc/init_flatbuffer_module.cpp
@@ -21,21 +21,24 @@
 
 namespace py = pybind11;
 
+using torch::jit::kFlatbufferDataAlignmentBytes;
+
 static std::shared_ptr<char> copyStr(const std::string& bytes) {
-  size_t size = (bytes.size() / FLATBUFFERS_MAX_ALIGNMENT + 1) *
-      FLATBUFFERS_MAX_ALIGNMENT;
+  size_t size = (bytes.size() / kFlatbufferDataAlignmentBytes + 1) *
+      kFlatbufferDataAlignmentBytes;
 #ifdef _WIN32
   std::shared_ptr<char> bytes_copy(
-      static_cast<char*>(_aligned_malloc(size, FLATBUFFERS_MAX_ALIGNMENT)),
+      static_cast<char*>(_aligned_malloc(size, kFlatbufferDataAlignmentBytes)),
       _aligned_free);
 #elif defined(__APPLE__)
   void* p;
-  ::posix_memalign(&p, FLATBUFFERS_MAX_ALIGNMENT, size);
+  ::posix_memalign(&p, kFlatbufferDataAlignmentBytes, size);
   TORCH_INTERNAL_ASSERT(p, "Could not allocate memory for flatbuffer");
   std::shared_ptr<char> bytes_copy(static_cast<char*>(p), free);
 #else
   std::shared_ptr<char> bytes_copy(
-      static_cast<char*>(aligned_alloc(FLATBUFFERS_MAX_ALIGNMENT, size)), free);
+      static_cast<char*>(aligned_alloc(kFlatbufferDataAlignmentBytes, size)),
+      free);
 #endif
   memcpy(bytes_copy.get(), bytes.data(), bytes.size());
   return bytes_copy;
@@ -96,8 +99,8 @@ extern "C"
         auto detached_buffer =
             torch::jit::save_mobile_module_to_bytes(module, _extra_files);
         return py::bytes(
-            reinterpret_cast<char*>(detached_buffer.data()),
-            detached_buffer.size());
+            reinterpret_cast<char*>(detached_buffer->data()),
+            detached_buffer->size());
       });
   pym.def(
       "_save_jit_module_to_bytes",
@@ -106,8 +109,8 @@ extern "C"
         auto detached_buffer =
             torch::jit::save_jit_module_to_bytes(module, _extra_files);
         return py::bytes(
-            reinterpret_cast<char*>(detached_buffer.data()),
-            detached_buffer.size());
+            reinterpret_cast<char*>(detached_buffer->data()),
+            detached_buffer->size());
       });
   pym.def(
       "_get_module_info_from_flatbuffer", [](std::string flatbuffer_content) {
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
index ac7fe9febe41e..77b4503174c84 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
@@ -4,7 +4,6 @@
 #import <torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.h>
 #import <torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h>
 #import <torch/csrc/jit/backends/coreml/objc/PTMCoreMLTensorSpec.h>
-#import <torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h>
 #import <torch/script.h>
 
 #import <CoreML/CoreML.h>
@@ -24,9 +23,6 @@
 using c10::impl::GenericList;
 using c10::IValue;
 
-static const int32_t kSampleThreshold = static_cast<int32_t>(1.0 / 1000.0 * static_cast<double>(RAND_MAX));
-static const int32_t kSampleEvery = 500;
-
 struct CoreMLConfig {
   std::string backend = "CPU";
   bool allow_low_precision = true;
@@ -86,17 +82,7 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
     const c10::Dict<IValue, IValue> model_dict = processed.toGenericDict();
     const std::string& extra = model_dict.at("extra").toStringRef();
     const std::string& model = model_dict.at("model").toStringRef();
-    const std::string& sha256 = model_dict.at("hash").toStringRef();
-
-    const int32_t load_id = std::rand();
-    const int32_t instance_key = std::rand();
-    size_t mem_limit = 0;
-
-    PTMCoreMLObserver *observer = coreMLObserverConfig().getCoreMLObserver();
-    if (observer) {
-      mem_limit = observer->getRemainingMemory();
-      observer->onEnterCompileModel(instance_key, load_id);
-    }
+    const std::string& modelID = model_dict.at("hash").toStringRef();
 
     CoreMLConfig config;
     std::vector<TensorSpec> input_specs;
@@ -108,27 +94,21 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
       input_specs = extra_json["inputs"].get<std::vector<TensorSpec>>();
       output_specs = extra_json["outputs"].get<std::vector<TensorSpec>>();
     } catch (std::exception& exn) {
-      if (observer) {
-        observer->onExitCompileModel(instance_key, false, true);
-      }
       TORCH_CHECK(false, "Parsing model dict failed!");
     }
 
     if (!type_validity(input_specs) || !type_validity(output_specs)) {
-      if (observer) {
-        observer->onExitCompileModel(instance_key, false, true);
-      }
       TORCH_CHECK(false, "Compiling model failed, only float type tensors supported");
     }
 
-    NSURL *modelURL = [PTMCoreMLCompiler compileModel:model modelID:sha256];
-    MLModel *cpuModel = modelURL ? [PTMCoreMLCompiler loadCPUModelAtURL:modelURL] : nil;
+    if (![PTMCoreMLCompiler compileModel:model modelID:modelID]) {
+      TORCH_CHECK(false, "Compiling MLModel failed");
+    }
+
+    MLModel *cpuModel = [PTMCoreMLCompiler loadModel:modelID backend:"cpu" allowLowPrecision:NO];
 
     if (!cpuModel) {
-      if (observer) {
-        observer->onExitCompileModel(instance_key, false, true);
-      }
-      TORCH_CHECK(false, "Compiling MLModel for CPU failed!");
+      TORCH_CHECK(false, "Loading MLModel failed");
     }
 
     NSMutableArray *orderedFeatures = [NSMutableArray array];
@@ -142,19 +122,12 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
     [executor autorelease];
 
     dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
-      MLModel *configuredModel = [PTMCoreMLCompiler loadModelAtURL:modelURL backend:config.backend allowLowPrecision:config.allow_low_precision];
+      MLModel *configuredModel = [PTMCoreMLCompiler loadModel:modelID backend:config.backend allowLowPrecision:config.allow_low_precision];
       executor.model = configuredModel ?: cpuModel;
     });
 
-    if (observer) {
-      bool should_log = load_id < kSampleThreshold;
-      observer->onExitCompileModel(instance_key, true, should_log);
-    }
-
     MLModelWrapper model_wrapper = MLModelWrapper(executor);
     model_wrapper.outputs = output_specs;
-    model_wrapper.load_id = load_id;
-    model_wrapper.mem_limit = mem_limit;
 
     auto model_wrapper_ptr = c10::make_intrusive<MLModelWrapper>(model_wrapper);
     auto handle = IValue::make_capsule(model_wrapper_ptr);
@@ -166,31 +139,11 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
 
   GenericList execute(IValue handle, GenericList inputs) override {
     const auto model_wrapper = c10::static_intrusive_pointer_cast<MLModelWrapper>(handle.toCapsule());
-    const int32_t instance_key = std::rand();
-    const int32_t load_id = model_wrapper->load_id;
-    const size_t mem_limit = model_wrapper->mem_limit;
-    int32_t inferences = model_wrapper->inferences;
-
-    PTMCoreMLObserver *observer = coreMLObserverConfig().getCoreMLObserver();
-    if (observer) {
-      observer->onEnterExecuteModel(instance_key, load_id, mem_limit, inferences);
-    }
 
     PTMCoreMLExecutor *executor = model_wrapper->executor;
     [executor setInputs:inputs];
 
     id<MLFeatureProvider> outputsProvider = [executor forward];
-
-    model_wrapper->inferences = ++inferences;
-
-    if (observer) {
-      // Check if this inference session is logged. If so, log every N inferences
-      bool succeeded = outputsProvider != nil;
-      bool should_log = load_id < kSampleThreshold && inferences > 1;
-      should_log = !succeeded || (should_log && (inferences % kSampleEvery == 0));
-      observer->onExitExecuteModel(instance_key, inferences, succeeded, should_log);
-    }
-
     return pack_outputs(model_wrapper->outputs, outputsProvider);
   }
 
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h
index 39a511c316e69..59f3922c5a3dc 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h
@@ -10,14 +10,12 @@ NS_ASSUME_NONNULL_BEGIN
 
 + (NSString*)modelCacheDirectory;
 
-+ (NSURL*)compileModel:(const std::string&)modelSpecs
-               modelID:(const std::string&)modelID;
++ (BOOL)compileModel:(const std::string&)modelSpecs
+             modelID:(const std::string&)modelID;
 
-+ (nullable MLModel*)loadCPUModelAtURL:(NSURL*)modelURL;
-
-+ (nullable MLModel*)loadModelAtURL:(NSURL*)modelURL
-                            backend:(const std::string&)backend
-                  allowLowPrecision:(BOOL)allowLowPrecision;
++ (nullable MLModel*)loadModel:(const std::string&)modelID
+                       backend:(const std::string&)backend
+             allowLowPrecision:(BOOL)allowLowPrecision;
 
 @end
 
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
index 98096ad79075a..60a6dd4d005ad 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
@@ -22,68 +22,54 @@ + (nonnull NSString *)modelCacheDirectory {
   return gModelCacheDirectory;
 }
 
-+ (NSURL*)compileModel:(const std::string&)modelSpecs modelID:(const std::string&)modelID {
++ (BOOL)compileModel:(const std::string&)modelSpecs modelID:(const std::string&)modelID {
   NSString* modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
   NSURL* modelPath = [PTMCoreMLCompiler _cacheFilePath:modelName];
-  NSURL* compiledModelPath = [PTMCoreMLCompiler _compiledModelFilePath:modelPath.path];
+  NSURL* compiledModelPath = [modelPath URLByAppendingPathExtension:@"mlmodelc"];
 
-  BOOL modelCached = [[NSFileManager defaultManager] fileExistsAtPath:modelPath.path];
-  BOOL compiledModelCached = [[NSFileManager defaultManager] fileExistsAtPath:compiledModelPath.path];
-  BOOL shouldRecompile = [self _shouldRecompileModel:compiledModelPath];
+  BOOL modelIsCached = [[NSFileManager defaultManager] fileExistsAtPath:modelPath.path];
+  BOOL compiledModelIsCached = [[NSFileManager defaultManager] fileExistsAtPath:compiledModelPath.path];
 
-  if (modelCached != compiledModelCached) {
-    modelCached = NO;
-    compiledModelCached = NO;
-    [PTMCoreMLCompiler _cleanupModel:modelPath compiledModel:compiledModelPath];
+#if TARGET_OS_IPHONE
+  NSError *error = nil;
+  NSURL *compilationOSPath = [modelPath URLByAppendingPathExtension:@"version"];
+  NSString *compilationOS = [NSString stringWithContentsOfFile:compilationOSPath.path encoding:NSUTF8StringEncoding error:&error];
+  NSString *currentOS = [UIDevice currentDevice].systemVersion;
+  BOOL wasCachedOnThisOS = [currentOS isEqualToString:compilationOS];
+#else
+  BOOL wasCachedOnThisOS = NO;
+#endif
+
+  if (modelIsCached != compiledModelIsCached || !wasCachedOnThisOS) {
+    modelIsCached = NO;
+    compiledModelIsCached = NO;
+    [PTMCoreMLCompiler _cleanupCachedModel:modelID];
   }
 
-  if (!modelCached) {
+  if (!modelIsCached) {
     // Note that the serialized protobuf binary contains bytes not text.
     // https://developers.google.com/protocol-buffers/docs/pythontutorial#parsing-and-serialization
     NSData* data = [NSData dataWithBytes:modelSpecs.c_str() length:modelSpecs.length()];
     if (![data writeToFile:modelPath.path atomically:YES]) {
-        // If the model cannot be persisted on disk then compilation cannot proceed.
-        NSLog(@"Failed to save specs for MLModel!");
-        [PTMCoreMLCompiler _cleanupModel:modelPath compiledModel:compiledModelPath];
-        return nil;
+      // If the model cannot be persisted on disk then compilation cannot proceed.
+      NSLog(@"Failed to save specs for MLModel!");
+      [PTMCoreMLCompiler _cleanupCachedModel:modelID];
+      return NO;
     }
   }
 
-  if (shouldRecompile || !compiledModelCached) {
-    NSError *error;
-    NSURL *temporaryURL = [MLModel compileModelAtURL:modelPath error:&error];
-    if (!error) {
-      [PTMCoreMLCompiler _moveFileToCache:temporaryURL cacheURL:compiledModelPath error:&error];
-    }
-    if (error) {
-      NSLog(@"Failed to compile MLModel!");
-      [PTMCoreMLCompiler _cleanupModel:modelPath compiledModel:compiledModelPath];
-      return nil;
-    }
+  if (compiledModelIsCached) {
+    return YES;
   }
 
-  return compiledModelPath;
+  return [PTMCoreMLCompiler _compileModel:modelID atPath:modelPath andCache:compiledModelPath];
 }
 
-+ (nullable MLModel*)loadCPUModelAtURL:(NSURL*)modelURL {
-  NSError *error;
-  MLModel *model;
-  if (@available(iOS 12.0, macOS 10.14, *)) {
-    MLModelConfiguration* config = [[MLModelConfiguration alloc] init];
-    config.computeUnits = MLComputeUnitsCPUOnly;
-    model = [MLModel modelWithContentsOfURL:modelURL configuration:config error:&error];
-  } else {
-    model = [MLModel modelWithContentsOfURL:modelURL error:&error];
-  }
-  if (error) {
-    NSLog(@"Failed to initialize MLModel!");
-    [PTMCoreMLCompiler _cleanupModel:nil compiledModel:modelURL];
-    return nil;
-  }
-  return model;
-}
++ (nullable MLModel*)loadModel:(const std::string&)modelID backend:(const std::string&)backend allowLowPrecision:(BOOL)allowLowPrecision {
+  NSString* modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
+  NSURL* modelPath = [PTMCoreMLCompiler _cacheFilePath:modelName];
+  NSURL* compiledModelPath = [modelPath URLByAppendingPathExtension:@"mlmodelc"];
 
-+ (nullable MLModel*)loadModelAtURL:(NSURL*)modelURL backend:(const std::string&)backend allowLowPrecision:(BOOL)allowLowPrecision {
   NSError *error;
   MLModel *model;
   if (@available(iOS 12.0, macOS 10.14, *)) {
@@ -96,55 +82,55 @@ + (nullable MLModel*)loadModelAtURL:(NSURL*)modelURL backend:(const std::string&
     }
     config.computeUnits = computeUnits;
     config.allowLowPrecisionAccumulationOnGPU = allowLowPrecision;
-    model = [MLModel modelWithContentsOfURL:modelURL configuration:config error:&error];
+    model = [MLModel modelWithContentsOfURL:compiledModelPath configuration:config error:&error];
   } else {
-    model = [MLModel modelWithContentsOfURL:modelURL error:&error];
+    model = [MLModel modelWithContentsOfURL:compiledModelPath error:&error];
   }
+
   if (error) {
     NSLog(@"Failed to initialize MLModel!");
-    [PTMCoreMLCompiler _cleanupModel:nil compiledModel:modelURL];
+    [PTMCoreMLCompiler _cleanupCachedModel:modelID];
     return nil;
   }
+
   return model;
 }
 
-+ (void)_cleanupModel:(NSURL*)modelPath compiledModel:(NSURL*)compiledModelPath {
-  NSFileManager* fileManager = [NSFileManager defaultManager];
-  NSError* error = nil;
-  if (modelPath && [fileManager fileExistsAtPath:modelPath.path]) {
-    [fileManager removeItemAtPath:modelPath.path error:&error];
++ (BOOL)_compileModel:(const std::string&)modelID atPath:(NSURL *)modelPath andCache:(NSURL *)cachePath {
+  NSError *error;
+  NSURL *temporaryURL = [MLModel compileModelAtURL:modelPath error:&error];
+  if (!error) {
+#if TARGET_OS_IPHONE
+    NSURL *compilationOSPath = [modelPath URLByAppendingPathExtension:@"version"];
+    NSString *currentOSVer = [UIDevice currentDevice].systemVersion;
+    [currentOSVer writeToFile:compilationOSPath.path atomically:YES];
+#endif
+    [PTMCoreMLCompiler _moveFileToCache:temporaryURL cacheURL:cachePath error:&error];
   }
-  if (compiledModelPath && [fileManager fileExistsAtPath:compiledModelPath.path]) {
-    [fileManager removeItemAtPath:compiledModelPath.path error:&error];
+  if (error) {
+    NSLog(@"Failed to compile MLModel!");
+    [PTMCoreMLCompiler _cleanupCachedModel:modelID];
   }
+  return !error;
+}
+
++ (void)_cleanupCachedModel:(const std::string&)modelID {
+  NSString* modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
+  NSURL* modelPath = [PTMCoreMLCompiler _cacheFilePath:modelName];
+  NSURL* compiledModelPath = [modelPath URLByAppendingPathExtension:@"mlmodelc"];
+  NSURL* compilationOSPath = [modelPath URLByAppendingPathExtension:@"version"];
+  NSError* error = nil;
+  [[NSFileManager defaultManager] removeItemAtPath:modelPath.path error:&error];
+  [[NSFileManager defaultManager] removeItemAtPath:compiledModelPath.path error:&error];
+  [[NSFileManager defaultManager] removeItemAtPath:compilationOSPath.path error:&error];
 }
 
 + (void)_moveFileToCache:(NSURL *)fileURL cacheURL:(NSURL *)cacheURL error:(NSError **)error {
   if ([fileURL isEqual:cacheURL]) {
     return;
   }
-  NSFileManager *fileManager = [NSFileManager defaultManager];
-  if ([fileManager fileExistsAtPath:cacheURL.path]) {
-    [fileManager removeItemAtURL:cacheURL error:error];
-  }
-  [fileManager moveItemAtURL:fileURL toURL:cacheURL error:error];
-}
-
-+ (BOOL)_shouldRecompileModel:(NSURL *)compiledModelPath {
-#if TARGET_OS_IPHONE
-  NSString *versionPath = [PTMCoreMLCompiler _cacheFilePath:@"version"].path;
-  NSString *cachedOSVer = nil;
-  if ([[NSFileManager defaultManager] fileExistsAtPath:versionPath]) {
-    NSError *error = nil;
-    cachedOSVer = [NSString stringWithContentsOfFile:versionPath encoding:NSUTF8StringEncoding error:&error];
-  }
-  // Compile the model when OS version changes
-  NSString *currentOSVer = [UIDevice currentDevice].systemVersion;
-  [currentOSVer writeToFile:versionPath atomically:YES];
-  return ![currentOSVer isEqualToString:cachedOSVer];
-#else
-  return YES;
-#endif
+  [[NSFileManager defaultManager] removeItemAtURL:cacheURL error:nil];
+  [[NSFileManager defaultManager] moveItemAtURL:fileURL toURL:cacheURL error:error];
 }
 
 + (NSURL *)_cacheFilePath:(NSString *)fileName {
@@ -152,9 +138,4 @@ + (NSURL *)_cacheFilePath:(NSString *)fileName {
   return [NSURL fileURLWithPath:filePath];
 }
 
-+ (NSURL *)_compiledModelFilePath:(NSString *)modelPath {
-  NSString *filePath = [modelPath stringByAppendingString:@".mlmodelc"];
-  return [NSURL fileURLWithPath:filePath];
-}
-
 @end
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h
index 49bad921cfa69..78160c187a1dd 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h
@@ -11,9 +11,6 @@ class MLModelWrapper : public CustomClassHolder {
  public:
   PTMCoreMLExecutor* executor;
   std::vector<TensorSpec> outputs;
-  int32_t load_id = 0;
-  int32_t inferences = 0;
-  size_t mem_limit = 0;
 
   MLModelWrapper() = delete;
 
@@ -24,18 +21,12 @@ class MLModelWrapper : public CustomClassHolder {
   MLModelWrapper(const MLModelWrapper& oldObject) {
     executor = oldObject.executor;
     outputs = oldObject.outputs;
-    load_id = oldObject.load_id;
-    inferences = oldObject.inferences;
-    mem_limit = oldObject.mem_limit;
     [executor retain];
   }
 
   MLModelWrapper(MLModelWrapper&& oldObject) {
     executor = oldObject.executor;
     outputs = oldObject.outputs;
-    load_id = oldObject.load_id;
-    inferences = oldObject.inferences;
-    mem_limit = oldObject.mem_limit;
     [executor retain];
   }
 
diff --git a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h b/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h
deleted file mode 100644
index 57d11527ac9cb..0000000000000
--- a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h
+++ /dev/null
@@ -1,47 +0,0 @@
-#include <memory>
-
-class PTMCoreMLObserver {
- public:
-  virtual ~PTMCoreMLObserver() = default;
-
-  virtual size_t getRemainingMemory() {
-    return 0;
-  }
-
-  virtual void onEnterCompileModel(const int32_t, const int32_t) {}
-  virtual void onExitCompileModel(const int32_t, bool, bool) {}
-
-  virtual void onEnterExecuteModel(
-      const int32_t,
-      const int32_t,
-      const size_t,
-      const int32_t) {}
-  virtual void onExitExecuteModel(const int32_t, const int32_t, bool, bool) {}
-};
-
-class PTMCoreMLObserverConfig {
- public:
-  PTMCoreMLObserverConfig();
-
-  // Do not allow copying/moving.
-  // There should be only one global instance of this class.
-  PTMCoreMLObserverConfig(const PTMCoreMLObserverConfig&) = delete;
-  PTMCoreMLObserverConfig& operator=(const PTMCoreMLObserverConfig&) = delete;
-
-  PTMCoreMLObserverConfig(PTMCoreMLObserverConfig&&) = delete;
-  PTMCoreMLObserverConfig& operator=(PTMCoreMLObserverConfig&&) = delete;
-
- private:
-  std::unique_ptr<PTMCoreMLObserver> observer_;
-
- public:
-  void setCoreMLObserver(std::unique_ptr<PTMCoreMLObserver> observer) {
-    observer_ = std::move(observer);
-  }
-
-  PTMCoreMLObserver* getCoreMLObserver() {
-    return observer_.get();
-  }
-};
-
-PTMCoreMLObserverConfig& coreMLObserverConfig();
diff --git a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm b/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
deleted file mode 100644
index 372fc53622f7b..0000000000000
--- a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
+++ /dev/null
@@ -1,8 +0,0 @@
-#import <torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h>
-
-PTMCoreMLObserverConfig::PTMCoreMLObserverConfig() : observer_{nullptr} {}
-
-PTMCoreMLObserverConfig& coreMLObserverConfig() {
-  static PTMCoreMLObserverConfig global_instance;
-  return global_instance;
-}
diff --git a/torch/csrc/jit/codegen/cuda/arith.cpp b/torch/csrc/jit/codegen/cuda/arith.cpp
index f9f13d916b53b..0e0d4c4da0e35 100644
--- a/torch/csrc/jit/codegen/cuda/arith.cpp
+++ b/torch/csrc/jit/codegen/cuda/arith.cpp
@@ -63,7 +63,11 @@ Val* promoteSize(Val* v1, Val* v2) {
   } else if (v1->isConstInt() && v2->isConstInt()) {
     TORCH_INTERNAL_ASSERT(
         v1->evaluateInt() == v2->evaluateInt(),
-        "Expected sizes to match but found ",
+        "Expected sizes of, ",
+        v1->toString(),
+        " and ",
+        v2->toString(),
+        " to match but found ",
         v1->evaluateInt(),
         " and ",
         v2->evaluateInt(),
@@ -404,17 +408,6 @@ Val* unaryOp(UnaryOpType type, Val* v1) {
   TORCH_INTERNAL_ASSERT(
       type != UnaryOpType::Address,
       "The reference operator & is not accessible in the Fusion IR");
-
-  // TODO: We should add the following, but we need to go through schedulers
-  // and make sure all calls to "fusion->inputs" includes the output of RandLike
-  //
-  //  If rand like, there isn't a real dependency on the input value, so map it
-  //  to a dummy scalar. if
-  //
-  // (type == UnaryOpType::RandLike) {
-  //   v1 = new NamedScalar("__rnd", v1->getDataType().value());
-  // }
-
   Val* out = newValLike(v1, v1->getDataType().value());
   IrBuilder::create<UnaryOp>(type, out, v1);
   return out;
@@ -447,6 +440,154 @@ TensorView* unaryOp(
   return unaryOp(type, cast_v1)->as<TensorView>();
 }
 
+// TENSOR FACTORIES
+TensorView* rand(const std::vector<Val*>& shape, DataType dtype) {
+  auto n = shape.size();
+  auto out = TensorViewBuilder()
+                 .ndims(n)
+                 .dtype(dtype)
+                 .contiguity(std::vector<bool>(n, true))
+                 .shape(shape)
+                 .build();
+  IrBuilder::create<RNGOp>(RNGOpType::Uniform, out, dtype);
+  return out;
+}
+
+// TENSOR FACTORIES
+TensorView* uniform(
+    const std::vector<Val*>& shape,
+    Val* low,
+    Val* high,
+    DataType dtype) {
+  auto n = shape.size();
+  auto out = TensorViewBuilder()
+                 .ndims(n)
+                 .dtype(dtype)
+                 .contiguity(std::vector<bool>(n, true))
+                 .shape(shape)
+                 .build();
+  IrBuilder::create<RNGOp>(
+      RNGOpType::UniformRange, out, dtype, std::vector<Val*>{low, high});
+  return out;
+}
+
+TensorView* rand_like(TensorView* v) {
+  TORCH_CHECK(
+      isFloatingPointType(v->dtype()),
+      "input must have floating point type, but got ",
+      v->dtype());
+  std::vector<Val*> shape;
+  shape.reserve(v->getMaybeRFactorDomain().size());
+  for (auto id : v->getMaybeRFactorDomain()) {
+    shape.emplace_back(id->getMaybeExpandedExtent());
+  }
+  return rand(shape, v->dtype());
+}
+
+Val* rand_like(Val* v) {
+  return rand_like(v->as<TensorView>());
+}
+
+TensorView* full(
+    const std::vector<Val*>& shape,
+    Val* fill_value,
+    DataType dtype) {
+  auto n = shape.size();
+  auto out = TensorViewBuilder()
+                 .ndims(n)
+                 .dtype(dtype)
+                 .contiguity(std::vector<bool>(n, true))
+                 .shape(shape)
+                 .build();
+  IrBuilder::create<FullOp>(out, fill_value, dtype);
+  return out;
+}
+
+TensorView* full_like(TensorView* tv, Val* fill_value) {
+  std::vector<Val*> shape;
+  shape.reserve(tv->getMaybeRFactorDomain().size());
+  for (auto id : tv->getMaybeRFactorDomain()) {
+    shape.emplace_back(id->getMaybeExpandedExtent());
+  }
+  return full(shape, fill_value, tv->dtype());
+}
+
+Val* full_like(Val* v, Val* fill_value) {
+  return full_like(v->as<TensorView>(), fill_value);
+}
+
+TensorView* zeros(const std::vector<Val*>& shape, DataType dtype) {
+  return full(shape, FusionGuard::getCurFusion()->zeroVal(), dtype);
+}
+
+TensorView* zeros_like(TensorView* tv) {
+  return full_like(tv, FusionGuard::getCurFusion()->zeroVal());
+}
+
+Val* zeros_like(Val* v) {
+  return zeros_like(v->as<TensorView>());
+}
+
+TensorView* ones(const std::vector<Val*>& shape, DataType dtype) {
+  return full(shape, FusionGuard::getCurFusion()->oneVal(), dtype);
+}
+
+TensorView* ones_like(TensorView* tv) {
+  return full_like(tv, FusionGuard::getCurFusion()->oneVal());
+}
+
+Val* ones_like(Val* v) {
+  return ones_like(v->as<TensorView>());
+}
+
+TensorView* arange(Val* end, DataType dtype) {
+  return arange(FusionGuard::getCurFusion()->zeroVal(), end, dtype);
+}
+
+TensorView* arange(Val* start, Val* end, DataType dtype) {
+  return arange(start, end, FusionGuard::getCurFusion()->oneVal(), dtype);
+}
+
+TensorView* arange(Val* start, Val* end, Val* step, DataType dtype) {
+  if (isIntegralType(dtype)) {
+    start = castOp(DataType::Int, start);
+    end = castOp(DataType::Int, end);
+    step = castOp(DataType::Int, step);
+  } else if (isFloatingPointType(dtype)) {
+    start = castOp(DataType::Double, start);
+    end = castOp(DataType::Double, end);
+    step = castOp(DataType::Double, step);
+  }
+  // Make sure no negative value is passed to ceilDiv as the device
+  // implementation of ceilDiv assumes positive inputs
+  auto size = castOp(DataType::Int, ceilDiv(abs(sub(end, start)), abs(step)));
+  auto out = TensorViewBuilder()
+                 .ndims(1)
+                 .dtype(dtype)
+                 .contiguity({true})
+                 .shape({size})
+                 .build();
+  IrBuilder::create<ARangeOp>(out, start, end, step, dtype);
+  return out;
+}
+
+TensorView* eye(Val* rows, Val* cols, DataType dtype) {
+  TORCH_CHECK(rows->getDataType() == DataType::Int, "rows must have type Int");
+  TORCH_CHECK(cols->getDataType() == DataType::Int, "cols must have type Int");
+  auto out = TensorViewBuilder()
+                 .ndims(2)
+                 .dtype(dtype)
+                 .contiguity({true, true})
+                 .shape(std::vector<Val*>{rows, cols})
+                 .build();
+  IrBuilder::create<EyeOp>(out, dtype);
+  return out;
+}
+
+TensorView* eye(Val* size, DataType dtype) {
+  return eye(size, size, dtype);
+}
+
 // UNARY OPERATIONS
 
 #define NVFUSER_DEFINE_UNARY_OP(op_name, op_type) \
@@ -469,30 +610,6 @@ NVFUSER_DEFINE_UNARY_OP(trunc, Trunc)
 NVFUSER_DEFINE_UNARY_OP(print, Print)
 #undef NVFUSER_DEFINE_UNARY_OP
 
-Val* randlike(Val* v) {
-  TORCH_CHECK(
-      isFloatingPointType(v->dtype()),
-      "input must have floating point type, but got ",
-      v->dtype());
-  auto rand_vals = unaryOp(UnaryOpType::RandLike, v);
-  return where(
-      eq(rand_vals, IrBuilder::create<Double>(1.0)),
-      IrBuilder::create<Double>(0.0),
-      rand_vals);
-}
-
-TensorView* randlike(TensorView* v) {
-  TORCH_CHECK(
-      isFloatingPointType(v->dtype()),
-      "input must have floating point type, but got ",
-      v->dtype());
-  auto rand_vals = unaryOp(UnaryOpType::RandLike, v);
-  return where(
-      eq(rand_vals, IrBuilder::create<Double>(1.0)),
-      IrBuilder::create<Double>(0.0),
-      rand_vals);
-}
-
 Val* bitwise_not(Val* v) {
   TORCH_CHECK(
       isIntegralType(v->dtype()) || v->dtype() == DataType::Bool,
@@ -1417,12 +1534,12 @@ WelfordResult Welford(
       out_avg,
       out_var,
       out_N, /*out var/avg/count */
+      tv, /*in var/avg/count */
+      FusionGuard::getCurFusion()->zeroVal(),
+      FusionGuard::getCurFusion()->oneVal(),
       init_avg_val,
       init_var_val,
-      init_N, /*init var/avg/count */
-      tv,
-      FusionGuard::getCurFusion()->zeroVal(),
-      FusionGuard::getCurFusion()->oneVal()); /*in var/avg/count */
+      init_N); /*init var/avg/count */
 
   return WelfordResult(out_avg, out_var, out_N);
 }
diff --git a/torch/csrc/jit/codegen/cuda/arith.h b/torch/csrc/jit/codegen/cuda/arith.h
index 7a1efee80f5dc..66344c74880c0 100644
--- a/torch/csrc/jit/codegen/cuda/arith.h
+++ b/torch/csrc/jit/codegen/cuda/arith.h
@@ -121,6 +121,52 @@ TORCH_CUDA_CU_API WelfordResult Welford(
     // import IrBuilder just for this one interface.
     Int* init_N = nullptr);
 
+// RNG OPERATIONS
+TORCH_CUDA_CU_API TensorView* rand(
+    const std::vector<Val*>& shape,
+    DataType dtype);
+TORCH_CUDA_CU_API Val* rand_like(Val*);
+TORCH_CUDA_CU_API TensorView* rand_like(TensorView*);
+
+TORCH_CUDA_CU_API TensorView* uniform(
+    const std::vector<Val*>& shape,
+    Val* low,
+    Val* high,
+    DataType dtype);
+
+// TENSOR FACTORIES
+TORCH_CUDA_CU_API TensorView* full(
+    const std::vector<Val*>& shape,
+    Val* fill_value,
+    DataType dtype);
+TORCH_CUDA_CU_API TensorView* full_like(TensorView* tv, Val* fill_value);
+TORCH_CUDA_CU_API Val* full_like(Val* tv, Val* fill_value);
+TORCH_CUDA_CU_API TensorView* zeros(
+    const std::vector<Val*>& shape,
+    DataType dtype);
+TORCH_CUDA_CU_API TensorView* zeros_like(TensorView*);
+TORCH_CUDA_CU_API Val* zeros_like(Val*);
+TORCH_CUDA_CU_API TensorView* ones(
+    const std::vector<Val*>& shape,
+    DataType dtype);
+TORCH_CUDA_CU_API TensorView* ones_like(TensorView*);
+TORCH_CUDA_CU_API Val* ones_like(Val*);
+//! WARNING: giving invalid combinations of the start, end and step
+//! arguments can result in undefined behavior. Specifically, the
+//! signs of `end - start` and step must be the same.
+TORCH_CUDA_CU_API TensorView* arange(Val* end, DataType dtype = DataType::Int);
+TORCH_CUDA_CU_API TensorView* arange(
+    Val* start,
+    Val* end,
+    DataType dtype = DataType::Int);
+TORCH_CUDA_CU_API TensorView* arange(
+    Val* start,
+    Val* end,
+    Val* step,
+    DataType dtype = DataType::Int);
+TORCH_CUDA_CU_API TensorView* eye(Val* size, DataType dtype);
+TORCH_CUDA_CU_API TensorView* eye(Val* rows, Val* cols, DataType dtype);
+
 // UNARY OPERATIONS
 // abs
 TORCH_CUDA_CU_API Val* abs(Val*);
@@ -185,9 +231,6 @@ TORCH_CUDA_CU_API TensorView* log2(TensorView*);
 // neg
 TORCH_CUDA_CU_API Val* neg(Val*);
 TORCH_CUDA_CU_API TensorView* neg(TensorView*);
-// randlike
-TORCH_CUDA_CU_API Val* randlike(Val*);
-TORCH_CUDA_CU_API TensorView* randlike(TensorView*);
 // real
 TORCH_CUDA_CU_API Val* real(Val*);
 TORCH_CUDA_CU_API TensorView* real(TensorView*);
diff --git a/torch/csrc/jit/codegen/cuda/codegen.cpp b/torch/csrc/jit/codegen/cuda/codegen.cpp
index 35f0d59541802..990fd279a2f87 100644
--- a/torch/csrc/jit/codegen/cuda/codegen.cpp
+++ b/torch/csrc/jit/codegen/cuda/codegen.cpp
@@ -260,7 +260,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
 
     // Random number generator (optional)
     if (kernel_summary.max_rng_offsets >= 0) {
-      indent() << "auto offset = philox_args.captured_ ?\n";
+      indent() << "auto philox_offset = philox_args.captured_ ?\n";
       indent()
           << "  static_cast<uint64_t>(*(philox_args.offset_.ptr) + philox_args.offset_intragraph_) :\n";
       indent() << "  philox_args.offset_.val;\n";
@@ -290,18 +290,18 @@ class CudaKernelGenerator : private OptOutConstDispatch {
                << ") extern __shared__ char array[];\n";
 
       if (has_dynamic_smem) {
-        indent() << "unsigned offset = 0;\n";
+        indent() << "unsigned smem_offset = 0;\n";
       }
 
       if (has_reductions || has_parallel_welford) {
         indent() << "void* shared_mem = array;\n";
         if (has_dynamic_smem) {
           if (has_parallel_welford) {
-            indent() << "offset += "
+            indent() << "smem_offset += "
                      << "((blockDim.x * blockDim.y * blockDim.z) * 3 * sizeof("
                      << kernel_summary.largest_smem_data_type << "));\n";
           } else {
-            indent() << "offset += "
+            indent() << "smem_offset += "
                      << "((blockDim.x * blockDim.y * blockDim.z) * sizeof("
                      << kernel_summary.largest_smem_data_type << "));\n";
           }
@@ -578,6 +578,26 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     genBankConflictCheck(ldst->in(), 16);
   }
 
+  void handle(const FullOp* fop) final {
+    indent() << gen(fop->output(0)) << " = (" << fop->dtype() << ")"
+             << gen(fop->getFillValue()) << ";\n";
+  }
+
+  void handle(const ARangeOp* aop) final {
+    auto index =
+        genTensorIndex(aop->getLinearLogicalIndex()->as<kir::TensorIndex>());
+    indent() << gen(aop->output(0)) << " = arange<" << aop->dtype() << ">";
+    code_ << "(" << index << ", " << gen(aop->start()) << ", "
+          << gen(aop->step()) << ");\n";
+  }
+
+  void handle(const EyeOp* aop) final {
+    auto index1 = gen(aop->getIndex1());
+    auto index2 = gen(aop->getIndex2());
+    indent() << gen(aop->output(0)) << " = (" << aop->dtype() << ")";
+    code_ << "(" << index1 << " == " << index2 << ");\n";
+  }
+
   void handle(const UnaryOp* uop) final {
     bool is_vector_op = false;
     size_t vector_word_size = 1;
@@ -615,7 +635,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
             vector_size_optional.has_value(),
             "Could not evaluate constant value bound to vectorized dim.");
 
-        vector_word_size = vector_size_optional.value();
+        vector_word_size = vector_size_optional->as<int64_t>();
 
         vectorize_op = id->getParallelType() == ParallelType::Vectorize;
         misaligned_op =
@@ -724,34 +744,12 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
 
     if (!print_inline_) {
-      if (op_type == UnaryOpType::RandLike) {
-        auto out_tv = uop->out()->as<kir::TensorIndex>()->view();
-        auto index = genTensorIndex(uop->out()->as<kir::TensorIndex>());
-        int multiple = out_tv->getDataType() == DataType::Double ? 2 : 4;
-        indent() << "nvfuser_index_t subseq" << uop->name() << " = (" << index
-                 << ") / " << multiple << ";\n";
-        indent() << "nvfuser_index_t component" << uop->name() << " = ("
-                 << index << ") % " << multiple << ";\n";
-        indent() << "nvfuser_index_t offset" << uop->name() << " = "
-                 << uop->getRNGOffset() << ";\n";
-        indent() << "if (rng_subseq != subseq" << uop->name()
-                 << " || rng_offset != offset" << uop->name() << ") {\n";
-        indent() << "  rng_result = philox(philox_args.seed_, subseq"
-                 << uop->name() << ", offset / 4 + offset" << uop->name()
-                 << ");\n";
-        indent() << "  rng_subseq = subseq" << uop->name() << ";\n";
-        indent() << "  rng_offset = offset" << uop->name() << ";\n";
-        indent() << "}\n";
-      }
-
       indent() << gen(uop->out());
       if (!uop->out()->isScalar() && !uop->in()->isScalar()) {
         code_ << "\n";
         indent() << kTab;
       }
       code_ << " = ";
-    } else {
-      TORCH_INTERNAL_ASSERT(op_type != UnaryOpType::RandLike);
     }
 
     if (auto op = inline_op_str(op_type)) {
@@ -780,13 +778,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
         }
       }
 
-      code_ << "(";
-      if (op_type == UnaryOpType::RandLike) {
-        code_ << "rng_result, component" << uop->name();
-      } else {
-        code_ << gen(uop->in());
-      }
-      code_ << ")";
+      code_ << "(" << gen(uop->in()) << ")";
     }
 
     if (!print_inline_) {
@@ -807,6 +799,45 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
   }
 
+  void handle(const RNGOp* rop) final {
+    // TODO: TORCH_INTERNAL_ASSERT that the scheduler correctly creates an
+    // innermost ID of size 4 (float) or size 2 (double)?
+    auto index = genTensorIndex(rop->getPhiloxIndex()->as<kir::TensorIndex>());
+    int multiple = rop->dtype() == DataType::Double ? 2 : 4;
+    indent() << "nvfuser_index_t linear_index" << rop->name() << " = " << index
+             << ";\n";
+    indent() << "nvfuser_index_t rng_subseq" << rop->name() << " = linear_index"
+             << rop->name() << " / " << multiple << ";\n";
+    indent() << "nvfuser_index_t rng_component" << rop->name()
+             << " = linear_index" << rop->name() << " % " << multiple << ";\n";
+    indent() << "nvfuser_index_t rng_offset" << rop->name() << " = "
+             << rop->getRNGOffset() << ";\n";
+    indent() << "if (rng_subseq != rng_subseq" << rop->name()
+             << " || rng_offset != rng_offset" << rop->name() << ") {\n";
+    indent() << "  rng_result = philox(philox_args.seed_, rng_subseq"
+             << rop->name() << ", philox_offset / 4 + rng_offset" << rop->name()
+             << ");\n";
+    indent() << "  rng_subseq = rng_subseq" << rop->name() << ";\n";
+    indent() << "  rng_offset = rng_offset" << rop->name() << ";\n";
+    indent() << "}\n";
+    auto op_type = rop->getRNGOpType();
+    indent() << gen(rop->output(0)) << " = " << op_type;
+    if (needFloatSuffix(op_type) && rop->dtype() == DataType::Float) {
+      code_ << "f";
+    }
+    code_ << "(rng_result, rng_component" << rop->name();
+    switch (op_type) {
+      case RNGOpType::UniformRange: {
+        auto parameters = rop->getParameters();
+        TORCH_INTERNAL_ASSERT(parameters.size() == 2);
+        code_ << ", " << gen(parameters[0]) << ", " << gen(parameters[1]);
+        break;
+      }
+      default:;
+    }
+    code_ << ");\n";
+  }
+
   std::string genBinaryOp(
       BinaryOpType op_type,
       DataType data_type,
@@ -1295,7 +1326,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       TORCH_INTERNAL_ASSERT(
           id->getParallelType() != ParallelType::MisalignedVectorize,
           "LoadStoreOp: no support yet for mis-aligned vectorization");
-      vector_word_size = vector_size_optional.value();
+      vector_word_size = vector_size_optional->as<int64_t>();
       vectorize_op = true;
       break;
     }
@@ -1702,6 +1733,16 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     indent() << kTab << func_args << ");\n";
   }
 
+  void handle(const kir::GroupedGridWelford* grouped_gwop) final {
+    if (grouped_gwop->isAllreduce()) {
+      generateGroupedGridAllreduceWelford(grouped_gwop);
+      return;
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          false, "Non-allreduce grouped grid welford is not yet supported");
+    }
+  }
+
   // Enumerates all combinations of index values of grouped
   // loops. Each combination is a vector of loop index values. The
   // length of the vector is the number of grouped loops.
@@ -1903,6 +1944,154 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     indent() << kTab << func_args << ");\n";
   }
 
+  // Mostly the same as the grouped grid redution version
+  void generateGroupedGridAllreduceWelford(
+      const kir::GroupedGridWelford* grouped_gwop) {
+    TORCH_INTERNAL_ASSERT(grouped_gwop->isAllreduce());
+
+    const auto index_replacement_maps = getLoopIndexReplacementMaps();
+    const auto num_grouped_iterations = index_replacement_maps.size();
+
+    // This is also checked at the lowering validaiton time, so it
+    // isn't strictly necessary.
+    TORCH_INTERNAL_ASSERT(
+        num_grouped_iterations * grouped_gwop->numExprs() <=
+            kMaxNumGroupedReductions,
+        "Too many grouped reductions: ",
+        grouped_gwop->toString(),
+        ". Up to ",
+        kMaxNumGroupedReductions,
+        " reductions are allowed.");
+
+    ArgumentBuilder data_types;
+    ArgumentBuilder index_types;
+
+    // Note that the data type of var and avg and that of N are the
+    // same with all the welford ops since we only support
+    // grouping of iterations.
+    const auto data_type = grouped_gwop->outputVals().at(0).avg()->dtype();
+    const auto index_type = grouped_gwop->outputVals().at(0).N()->dtype();
+
+    std::array<ArgumentBuilder, 3> out_args;
+    std::array<ArgumentBuilder, 3> in_args;
+    std::array<ArgumentBuilder, 3> init_args;
+    std::array<ArgumentBuilder, 3> work_bufs;
+
+    ArgumentBuilder bool_types;
+    ArgumentBuilder read_preds;
+    ArgumentBuilder write_preds;
+
+    for (const auto expr_index : c10::irange(grouped_gwop->numExprs())) {
+      const auto& output = grouped_gwop->outputVals().at(expr_index);
+      const auto& input = grouped_gwop->inputVals().at(expr_index);
+      const auto& init = grouped_gwop->initVals().at(expr_index);
+
+      for (const auto& group_index :
+           c10::irange(index_replacement_maps.size())) {
+        // Set the index replacement map with the concrete values of
+        // indices of grouped loops.
+        index_replacement_map_ = index_replacement_maps.at(group_index);
+
+        data_types.arg(data_type);
+        index_types.arg(index_type);
+
+        auto work_buffer_offset = group_index == 0
+            ? "0"
+            : (genInline(grouped_gwop->buffer_stride()) + " * " +
+               std::to_string(group_index));
+
+        // Setup arguments for avg, var, and N
+        for (const auto i : c10::irange(3)) {
+          out_args[i].arg(gen(output.get(i)));
+          in_args[i].arg(gen(input.get(i)));
+          init_args[i].arg(gen(init.get(i)));
+          const auto work_buffer = grouped_gwop->reduction_buffers()[i]
+                                       .at(expr_index)
+                                       ->buffer()
+                                       ->as<TensorView>();
+          work_bufs[i]
+              .arg("&")
+              .append(varName(work_buffer))
+              .append("[")
+              .append(work_buffer_offset)
+              .append("]");
+        }
+
+        // read and write predicates
+        bool_types.arg("bool");
+        // Same argument for all inputs. Different predicates would be
+        // used when grouping is done across iterations
+        TORCH_INTERNAL_ASSERT(grouped_gwop->predicate() != nullptr);
+        TORCH_INTERNAL_ASSERT(
+            grouped_gwop->predicate() != nullptr &&
+            grouped_gwop->predicate()->hasValue());
+        const auto read_pred = genInline(grouped_gwop->predicate());
+        read_preds.arg(read_pred);
+        if (grouped_gwop->writePredicate() != nullptr) {
+          TORCH_INTERNAL_ASSERT(grouped_gwop->writePredicate()->hasValue());
+          write_preds.arg(genInline(grouped_gwop->writePredicate()));
+        } else {
+          write_preds.arg(read_pred);
+        }
+
+        index_replacement_map_.clear();
+      }
+    }
+
+    ArgumentBuilder func_args(block_nest_level_ + 1, kTab);
+    // output
+    func_args.arg(genCall("RefTuple", data_types, out_args[0]));
+    func_args.arg(genCall("RefTuple", data_types, out_args[1]));
+    func_args.arg(genCall("RefTuple", index_types, out_args[2]));
+    // input
+    func_args.arg(genCall("ConstRefTuple", data_types, in_args[0]));
+    func_args.arg(genCall("ConstRefTuple", data_types, in_args[1]));
+    func_args.arg(genCall("ConstRefTuple", index_types, in_args[2]));
+    // init
+    func_args.arg(genCall("LocalTuple", data_types, init_args[0]));
+    func_args.arg(genCall("LocalTuple", data_types, init_args[1]));
+    func_args.arg(genCall("LocalTuple", index_types, init_args[2]));
+    // work buffer
+    func_args.arg(genCall("VolatilePtrTuple", data_types, work_bufs[0]));
+    func_args.arg(genCall("VolatilePtrTuple", data_types, work_bufs[1]));
+    func_args.arg(genCall("VolatilePtrTuple", index_types, work_bufs[2]));
+    // global_sync_buffer
+    const auto sync_buffer =
+        grouped_gwop->sync_buffer()->buffer()->as<TensorView>();
+    func_args.arg("&").append(varName(sync_buffer)).append("[0]");
+
+    // shared_buf
+    ArgumentBuilder smem_buffer_args;
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(data_type), "shared_mem_avg"));
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(data_type), "shared_mem_var"));
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(index_type), "shared_mem_n"));
+    func_args.arg(genCall(
+        "PtrTuple",
+        ArgumentBuilder().arg(data_type).arg(data_type).arg(index_type),
+        smem_buffer_args));
+
+    func_args.arg(genCall("LocalTuple", bool_types, read_preds));
+    func_args.arg(genCall("LocalTuple", bool_types, write_preds));
+
+    addProfileArguments(func_args, grouped_gwop);
+
+    ArgumentBuilder func_template_args;
+    func_template_args.arg(
+        grouped_gwop->numExprs() * index_replacement_maps.size());
+    func_template_args.arg(data_type);
+    func_template_args.arg(index_type);
+
+    indent() << genCall(
+                    genFusedReductionName(ir_utils::getTvOutput(grouped_gwop)) +
+                        ".welfordGroup",
+                    func_template_args,
+                    func_args)
+             << ";\n";
+  }
+
   void handle(const kir::GridBroadcast* grop) final {
     const auto bop = grop->broadcast_op();
     TORCH_INTERNAL_ASSERT(bop->out()->isA<kir::TensorIndex>());
@@ -2239,6 +2428,13 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
   }
 
+  void handle(const GroupedWelfordOp* grouped_wop) final {
+    TORCH_INTERNAL_ASSERT(
+        false,
+        "Should not reach here as grouped welford is only enabled for grid welford,",
+        " which is handled by its own handler");
+  }
+
   //! True if loop is grouped. The IterDomain of the loop must have
   //! ParallelType::Group, but it isn't sufficient as the loop may be
   //! for an initialization expression, for which the loop shold not
@@ -2247,7 +2443,8 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     if (loop->iter_domain()->getParallelType() != ParallelType::Group) {
       return false;
     }
-    return ExprFinder::exists(loop, {ExprType::GroupedGridReduction});
+    return ExprFinder::exists(
+        loop, {ExprType::GroupedGridReduction, ExprType::GroupedGridWelford});
   }
 
   void handle(const kir::ForLoop* loop) final {
@@ -2360,15 +2557,15 @@ class CudaKernelGenerator : private OptOutConstDispatch {
           break;
         case MemoryType::Shared:
           // Align Offset Position
-          indent() << "offset = alignBufferSize(offset, "
+          indent() << "smem_offset = alignBufferSize(smem_offset, "
                    // Always align to 128b / 16B
                    << 16 << ");\n";
           // Shared Memory Pointer
           indent() << buffer_dtype << "* " << varName(tv)
                    << " = reinterpret_cast<" << buffer_dtype << "*>"
-                   << "(array + offset);\n";
+                   << "(array + smem_offset);\n";
           // Increment Offset Position
-          indent() << "offset += (" << genInline(size) << " * sizeof("
+          indent() << "smem_offset += (" << genInline(size) << " * sizeof("
                    << buffer_dtype << "));\n";
           break;
         case MemoryType::Local: {
diff --git a/torch/csrc/jit/codegen/cuda/compute_at.cpp b/torch/csrc/jit/codegen/cuda/compute_at.cpp
index ae6231614b7ff..d8f950848f8fc 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at.cpp
+++ b/torch/csrc/jit/codegen/cuda/compute_at.cpp
@@ -213,20 +213,21 @@ void ComputeAt::runAt(
   auto selected = getPropagationSubgraph(producer, consumer);
   ComputeAtSelector selector(selected);
 
-  InlinePropagator inline_propagator(
-      consumer, consumer_position, mode, selector.selected());
-
   MaxRootDomainInfoSpanningTree path(consumer, consumer_position, &selector);
 
   if (mode == ComputeAtMode::MostInlined) {
     MostInlinedTransformPropagator propagator;
     path.traverse(&propagator);
+    inlineMost(selected);
   } else {
     TransformPropagator propagator(consumer, consumer_position);
     path.traverse(&propagator);
+    inlineSelectedAt(
+        selected,
+        consumer,
+        consumer_position,
+        mode == ComputeAtMode::BestEffort);
   }
-
-  path.traverse(&inline_propagator);
 }
 
 void ComputeAt::runWith(
@@ -253,19 +254,21 @@ void ComputeAt::runWith(
   auto selected = getPropagationSubgraph(producer, consumer);
   ComputeAtSelector selector(selected);
 
-  InlinePropagator inline_propagator(
-      producer, producer_position, mode, selector.selected());
-
   MaxRootDomainInfoSpanningTree path(producer, producer_position, &selector);
 
   if (mode == ComputeAtMode::MostInlined) {
     MostInlinedTransformPropagator propagator;
     path.traverse(&propagator);
+    inlineMost(selected);
   } else {
     TransformPropagator propagator(producer, producer_position);
     path.traverse(&propagator);
+    inlineSelectedAt(
+        selected,
+        producer,
+        producer_position,
+        mode == ComputeAtMode::BestEffort);
   }
-  path.traverse(&inline_propagator);
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/compute_at.h b/torch/csrc/jit/codegen/cuda/compute_at.h
index 98100334d72b6..d3d3fdb299dd6 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at.h
+++ b/torch/csrc/jit/codegen/cuda/compute_at.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 
diff --git a/torch/csrc/jit/codegen/cuda/compute_at_map.cpp b/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
index 34011bb92ae14..7f3de6687eb3a 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
@@ -6,6 +6,8 @@
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 #include <torch/csrc/jit/codegen/cuda/transform_iter.h>
 
+#include <tuple>
+
 namespace torch {
 namespace jit {
 namespace fuser {
@@ -29,8 +31,22 @@ bool idIsALeafDomain(IterDomain* id, TensorView* tv) {
 
 } // namespace
 
-IterDomainGraph::IterDomainGraph(Fusion* fusion) {
+IterDomainGraph::IterDomainGraph(Fusion* fusion, bool allow_self_mapping) {
   build(fusion);
+
+  if (!allow_self_mapping) {
+    TORCH_INTERNAL_ASSERT(
+        !hasSelfMapping(),
+        "Unsupported domain mapping detected in ",
+        std::get<0>(*self_mapping_info_)->toString(),
+        ". ",
+        std::get<3>(*self_mapping_info_),
+        " domains, ",
+        std::get<1>(*self_mapping_info_)->toString(),
+        " and ",
+        std::get<2>(*self_mapping_info_)->toString(),
+        ", are mapped with each other.");
+  }
 }
 
 //! Map corresponding inputs and outputs of swizzle op together
@@ -55,7 +71,202 @@ void mapMaybeSwizzleOp(
   }
 }
 
+bool IterDomainGraph::exprsMap(
+    Expr* first,
+    Expr* second,
+    bool forward,
+    const DisjointSets<IterDomain*>& id_map) {
+  if (first == nullptr || second == nullptr) {
+    return false;
+  }
+
+  if (first->etype() != second->etype()) {
+    return false;
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      first->etype() == ExprType::Merge || first->etype() == ExprType::Split,
+      "Merge and split are the only expressions supported through rfactor operations in compute at map, but found:\n",
+      first->toString());
+
+  auto first_ids = ir_utils::filterByType<IterDomain>(
+                       forward ? first->inputs() : first->outputs())
+                       .vector();
+
+  auto second_ids = ir_utils::filterByType<IterDomain>(
+                        forward ? second->inputs() : second->outputs())
+                        .vector();
+
+  TORCH_INTERNAL_ASSERT(
+      first_ids.size() == second_ids.size(),
+      "Expected number of ",
+      (forward ? "inputs" : "outputs"),
+      " to match for\n",
+      first->toString(),
+      second->toString());
+
+  {
+    std::vector<std::pair<IterDomain*, IterDomain*>> zipped_ids;
+
+    std::transform(
+        first_ids.begin(),
+        first_ids.end(),
+        second_ids.begin(),
+        std::back_inserter(zipped_ids),
+        [](IterDomain* first, IterDomain* second) {
+          return std::make_pair(first, second);
+        });
+
+    if (std::any_of(
+            zipped_ids.begin(),
+            zipped_ids.end(),
+            [&](std::pair<IterDomain*, IterDomain*> id_pair) {
+              return !id_map.strictAreMapped(id_pair.first, id_pair.second);
+            })) {
+      return false;
+    }
+  }
+
+  if (first->isA<Merge>() && !forward) {
+    // Can't back prop through merge without making sure one dimension actually
+    // is identical extents.
+    auto merge0 = first->as<Merge>();
+    auto merge1 = second->as<Merge>();
+
+    auto extent_0o = merge0->outer()->extent();
+    auto extent_0i = merge0->inner()->extent();
+    auto extent_1o = merge1->outer()->extent();
+    auto extent_1i = merge1->inner()->extent();
+
+    auto extent_0_match = extent_0o->sameAs(extent_1o) ||
+        (extent_0o->isConstInt() && extent_1o->isConstInt() &&
+         extent_0o->evaluateInt() == extent_1o->evaluateInt());
+
+    auto extent_1_match = extent_0i->sameAs(extent_1i) ||
+        (extent_0i->isConstInt() && extent_1i->isConstInt() &&
+         extent_0i->evaluateInt() == extent_1i->evaluateInt());
+
+    if (!(extent_0_match || extent_1_match)) {
+      return false;
+    }
+  }
+
+  if (first->isA<Split>()) {
+    auto first_split = first->as<Split>();
+    auto second_split = second->as<Split>();
+    if (!first_split->factor()->sameAs(second_split->factor()) ||
+        first_split->innerSplit() != second_split->innerSplit() ||
+        !first_split->startOffset()->sameAs(second_split->startOffset()) ||
+        !first_split->stopOffset()->sameAs(second_split->stopOffset())) {
+      return false;
+    }
+  }
+
+  return true;
+}
+
+void IterDomainGraph::mapThroughExpr(Expr* first, Expr* second, bool forward) {
+  if (first == nullptr || second == nullptr) {
+    return;
+  }
+
+  if (!exprsMap(first, second, forward, exact_nodes_)) {
+    return;
+  }
+
+  auto first_ids = ir_utils::filterByType<IterDomain>(
+                       forward ? first->outputs() : first->inputs())
+                       .vector();
+  auto second_ids = ir_utils::filterByType<IterDomain>(
+                        forward ? second->outputs() : second->inputs())
+                        .vector();
+  TORCH_INTERNAL_ASSERT(
+      first_ids.size() == second_ids.size(),
+      "This should be unreachable, if transformation expressions match, their number of inputs and outputs should as well.\n However found:\n",
+      first->toString(),
+      "\nand\n",
+      second->toString());
+  for (auto out_i : c10::irange(first_ids.size())) {
+    exact_nodes_.mapEntries(first_ids[out_i], second_ids[out_i]);
+    permissive_nodes_.mapEntries(first_ids[out_i], second_ids[out_i]);
+  }
+}
+
+namespace {
+
+// Returns a pair of mapped IDs
+c10::optional<std::pair<IterDomain*, IterDomain*>> detectMappablePair(
+    const std::vector<IterDomain*>& ids,
+    const IterDomainGraph& id_graph) {
+  for (auto id1 : ids) {
+    for (auto id2 : ids) {
+      if (id1 == id2) {
+        continue;
+      }
+      if (id_graph.permissiveNodes().disjointSetMap().at(id1)->has(id2)) {
+        return std::make_pair(id1, id2);
+      }
+    }
+  }
+
+  return {};
+}
+
+// It is assumed that for any tensor represented by a list of domains,
+// those domains should never be mapped with each other. It may be
+// possible to lift this assumption, but it's unclear if it could
+// matter in practice.
+c10::optional<std::tuple<TensorView*, IterDomain*, IterDomain*, std::string>>
+findFirstSelfMapping(Fusion* fusion, const IterDomainGraph& id_graph) {
+  for (auto tv : ir_utils::allTvs(fusion)) {
+    // For each tensor, make sure root, rfactor and leaf domains
+    // should not include domains that are mapped with another domain
+    // in the same set of domains. This may be overly conservative,
+    // and it maybe enough to check the root domains.
+
+    // Root domains
+    auto self_mappped_root_pair =
+        detectMappablePair(tv->getRootDomain(), id_graph);
+    if (self_mappped_root_pair.has_value()) {
+      return std::make_tuple(
+          tv,
+          self_mappped_root_pair->first,
+          self_mappped_root_pair->second,
+          "Root");
+    }
+
+    // Rfactor domains
+    if (tv->hasRFactor()) {
+      auto self_mappped_rf_pair =
+          detectMappablePair(tv->getRFactorDomain(), id_graph);
+      if (self_mappped_rf_pair.has_value()) {
+        return std::make_tuple(
+            tv,
+            self_mappped_rf_pair->first,
+            self_mappped_rf_pair->second,
+            "RFactor");
+      }
+    }
+
+    // Leaf domains
+    auto self_mappped_leaf_pair =
+        detectMappablePair(tv->domain()->domain(), id_graph);
+    if (self_mappped_leaf_pair.has_value()) {
+      return std::make_tuple(
+          tv,
+          self_mappped_leaf_pair->first,
+          self_mappped_leaf_pair->second,
+          "Leaf");
+    }
+  }
+  return c10::nullopt;
+}
+
+} // namespace
+
 void IterDomainGraph::build(Fusion* fusion) {
+  FusionGuard fg(fusion);
+
   // Initialize a node for every iteration domain
   for (auto tv : ir_utils::allTvs(fusion)) {
     const auto& root_domain = tv->getRootDomain();
@@ -251,6 +462,151 @@ void IterDomainGraph::build(Fusion* fusion) {
       }
     }
   }
+
+  // Explicitly map through rfactor transformations, if we have an op like:
+  //
+  // T1[x, y*z] = view(T0[x*y, z])
+  // T3[x, y*z] = view(T2[x*y, z])
+  // T4 = T0 + T2
+  //
+  // We want to map T1 and T3's rfactor transformations together by playing the
+  // transformations forward since their root domains map. If instead we have:
+  //
+  // T1[x, y*z] = view(T0[x*y, z])
+  // T3[x, y*z] = view(T2[x*y, z])
+  // T4 = T1 + T3
+  //
+  // Then we wouldn't have a mapping of T1 and T3's root domain, we'd have a
+  // mapping of their rfactor domain, so we would want to map T1 and T3's
+  // rfactor transformations starting at their rfactor domains.
+  //
+  // Therefore we'll explicitly map rfactor transformation iteration domains
+  // forward and backwards. Something similar could happen with rfactor of root
+  // domains, though it seems mapping rfactor reduction domains aren't that
+  // important. Mapping view transformations is more important since view is
+  // part of the compute definition so having the map through the
+  // transformations makes it easy to check if different view operations are
+  // consistent with eachother.
+
+  auto all_tvs = ir_utils::allTvs(fusion);
+  std::vector<TensorView*> all_consumer_tvs;
+  std::copy_if(
+      all_tvs.begin(),
+      all_tvs.end(),
+      std::back_inserter(all_consumer_tvs),
+      [](TensorView* tv) { return !tv->isFusionInput() && tv->hasRFactor(); });
+
+  // IterDomains could have multiple uses defined in the fusion if multiple
+  // transformations were redefined (more than one transform propagation pass
+  // was run and retransformed sections of the graph). We're going to make a new
+  // uses map so we can easily process the actual uses of IterDomains. We
+  // actually only need rfactor uses for this section of mapping, so we'll limit
+  // this map to only rfactor transformations.
+  std::unordered_map<IterDomain*, Expr*> rfactor_id_uses;
+
+  // Order of traversal is important for processing all the rfactor ids as the
+  // first pass will go forward through expressions and the second pass will
+  // traverse backwards through them. ID's will be unique in this vector,
+  // enforced when building it since it's built with rfactor_id_uses.
+  std::vector<IterDomain*> rfactor_id_order;
+
+  // Grab all the rfactor ids.
+  for (auto consumer_tv : all_consumer_tvs) {
+    auto exprs = StmtSort::getExprs(
+        fusion,
+        {consumer_tv->getMaybeRFactorDomain().begin(),
+         consumer_tv->getMaybeRFactorDomain().end()});
+    for (auto expr : exprs) {
+      auto rfactor_inp_ids = ir_utils::filterByType<IterDomain>(expr->inputs());
+      TORCH_INTERNAL_ASSERT(
+          expr->isA<Split>() || expr->isA<Merge>(),
+          "Wasn't expecting the expression type of:\n",
+          expr->toString(),
+          "\nto be an expression defined in an rfactor transformation.");
+      for (auto rfactor_inp_id : rfactor_inp_ids) {
+        TORCH_INTERNAL_ASSERT(
+            rfactor_id_uses.find(rfactor_inp_id) == rfactor_id_uses.end(),
+            "Was expecting iter domains to only have one active transformation but found id ",
+            rfactor_inp_id->toString(),
+            " used in\n",
+            rfactor_id_uses.at(rfactor_inp_id),
+            "\nand\n",
+            expr->toString());
+        rfactor_id_uses.emplace(std::make_pair(rfactor_inp_id, expr));
+        rfactor_id_order.push_back(rfactor_inp_id);
+      }
+    }
+    for (auto rfactor_id : consumer_tv->getMaybeRFactorDomain()) {
+      if (rfactor_id->isRFactorProduct()) {
+        rfactor_id_uses.emplace(std::make_pair(rfactor_id, nullptr));
+        rfactor_id_order.push_back(rfactor_id);
+      }
+    }
+  }
+
+  // if prop_forward we're going forward through transformations and
+  // expressions, meaning if inputs of expressions map then we map their
+  // outputs, otherwise we're traversing backwards, meaning if outputs of
+  // expressions map then we map their inputs.
+  for (auto prop_forward : {true, false}) {
+    std::unordered_set<Expr*> visited_exprs;
+
+    for (auto rfactor_id_i : c10::irange(rfactor_id_order.size())) {
+      auto first_rfactor_id = prop_forward
+          ? rfactor_id_order[rfactor_id_i]
+          : rfactor_id_order[rfactor_id_order.size() - 1 - rfactor_id_i];
+
+      // At should be safe since we made rfactor_id_order and rfactor_id_uses at
+      // the same time so they should have the same exact entries.
+      auto first_expr = prop_forward ? rfactor_id_uses.at(first_rfactor_id)
+                                     : first_rfactor_id->definition();
+
+      if (first_expr == nullptr) {
+        continue;
+      }
+
+      if (visited_exprs.find(first_expr) != visited_exprs.end()) {
+        continue;
+      }
+      visited_exprs.emplace(first_expr);
+
+      // Only need to be concerned here with mapping across rfactor iter
+      // domains, so isolate out those.
+      auto all_exact_map_ids = exact_nodes_.getDisjointSetOf(first_rfactor_id);
+      std::vector<IterDomain*> exact_map_rf_ids;
+      std::copy_if(
+          all_exact_map_ids.vector().begin(),
+          all_exact_map_ids.vector().end(),
+          std::back_inserter(exact_map_rf_ids),
+          [](IterDomain* id) { return id->isRFactorProduct(); });
+
+      for (auto exact_map_rf_id : exact_map_rf_ids) {
+        if (exact_map_rf_id == first_rfactor_id) {
+          continue;
+        }
+        // If there's an input with an rfactor domain we could have an exact
+        // mapped rfactor id that's on the input meaning it wouldn't have an
+        // entry in rfactor_id_uses
+        auto other_use =
+            rfactor_id_uses.find(exact_map_rf_id) == rfactor_id_uses.end()
+            ? nullptr
+            : rfactor_id_uses.at(exact_map_rf_id);
+        auto other_expr =
+            prop_forward ? other_use : exact_map_rf_id->definition();
+
+        if (other_expr == nullptr) {
+          continue;
+        }
+
+        if (visited_exprs.find(other_expr) != visited_exprs.end()) {
+          continue;
+        }
+
+        mapThroughExpr(first_expr, other_expr, prop_forward);
+      }
+    }
+  }
+  self_mapping_info_ = findFirstSelfMapping(fusion, *this);
 }
 
 void IterDomainGraph::initializeId(
@@ -323,7 +679,7 @@ void ComputeAtMap::allocateIndexVariables() {
                   // Halo extended parallel loops currently are handled
                   // differently and an index variable would still
                   // be allocated in this case.
-                  (GpuLower::current()->haloInfo().getExtent(id) == nullptr)) {
+                  (GpuLower::current()->haloInfo()->getExtent(id) == nullptr)) {
                 ptype = id->getParallelType();
                 return true;
               }
diff --git a/torch/csrc/jit/codegen/cuda/compute_at_map.h b/torch/csrc/jit/codegen/cuda/compute_at_map.h
index 0dd4e65f4b944..5ea92dff16447 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at_map.h
+++ b/torch/csrc/jit/codegen/cuda/compute_at_map.h
@@ -54,7 +54,7 @@ namespace cuda {
 //   Do not forward through any broadcast IDs
 class TORCH_CUDA_CU_API IterDomainGraph {
  public:
-  IterDomainGraph(Fusion* fusion);
+  IterDomainGraph(Fusion* fusion, bool allow_self_mapping = false);
 
   const DisjointSets<IterDomain*>& permissiveNodes() const {
     return permissive_nodes_;
@@ -88,11 +88,29 @@ class TORCH_CUDA_CU_API IterDomainGraph {
     return view_rfactor_ids_;
   }
 
+  // Returns if first and second are expressions through which the provided
+  // id_map have matching inputs (if forward), or outputs (if not forward).
+  // Returning true means the expressions are "the same", in terms they modify
+  // matching original extents, by the same amount.
+  static bool exprsMap(
+      Expr* first,
+      Expr* second,
+      bool forward,
+      const DisjointSets<IterDomain*>& id_map);
+
+  bool hasSelfMapping() const {
+    return self_mapping_info_.has_value();
+  }
+
  private:
   void build(Fusion* fusion);
 
   void initializeId(IterDomain* id, bool is_view_rfactor_id, bool is_leaf_id);
 
+  // Checks if exprsMap then if forward will map outputs else inputs in exact
+  // and permissive map.
+  void mapThroughExpr(Expr* first, Expr* second, bool forward);
+
   DisjointSets<IterDomain*> permissive_nodes_;
   DisjointSets<IterDomain*> exact_nodes_;
   DisjointSets<IterDomain*> loop_nodes_;
@@ -108,6 +126,9 @@ class TORCH_CUDA_CU_API IterDomainGraph {
   VectorOfUniqueEntries<IterDomain*> all_ids_;
 
   std::unordered_set<IterDomain*> view_rfactor_ids_;
+
+  c10::optional<std::tuple<TensorView*, IterDomain*, IterDomain*, std::string>>
+      self_mapping_info_ = c10::nullopt;
 };
 
 class TrivialReductionInfo;
diff --git a/torch/csrc/jit/codegen/cuda/contiguity.cpp b/torch/csrc/jit/codegen/cuda/contiguity.cpp
index 4817693bebdc3..86bee7fef529c 100644
--- a/torch/csrc/jit/codegen/cuda/contiguity.cpp
+++ b/torch/csrc/jit/codegen/cuda/contiguity.cpp
@@ -1,4 +1,5 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
 #include <torch/csrc/jit/codegen/cuda/lower2device.h>
 
 #include <torch/csrc/jit/codegen/cuda/contiguity.h>
@@ -8,20 +9,442 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+OrderedIdInformation::OrderedIdInformation(
+    const std::vector<IterDomain*>& ids,
+    const std::vector<IterDomain*>& root_domain,
+    std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info)
+    : active_ids_(root_domain), concrete_info_(concrete_info) {
+  if (ids.empty() || root_domain.empty()) {
+    return;
+  }
+
+  // Grab root ids and initialize them.
+  for (const auto root_i : c10::irange(root_domain.size())) {
+    auto root_id = root_domain[root_i]->as<IterDomain>();
+
+    // Initialize id_to_root_ids to map roots to themselves
+    id_to_root_ids_[root_id] = {root_id};
+
+    // Initialize roots as being made up of correctly ordered transforms.
+    consistently_ordered_ids_.emplace(root_id);
+
+    exclusively_consumes_roots_.emplace(root_id);
+  }
+
+  // Iterate from the root domain to the provided ids and fill
+  // consistently_ordered_ids_, id_to_root_ids_, and exclusively_consumes_roots_
+  // for all the IDs
+  auto exprs = StmtSort::getExprsBetween(
+      ids[0]->fusion(),
+      {root_domain.begin(), root_domain.end()},
+      {ids.begin(), ids.end()});
+
+  for (auto expr : exprs) {
+    OptInDispatch::handle(expr);
+  }
+}
+
+bool OrderedIdInformation::checkExclusivelyConsumesRoots(IterDomain* id) {
+  TORCH_INTERNAL_ASSERT(
+      std::find(active_ids_.begin(), active_ids_.end(), id) !=
+          active_ids_.end(),
+      "Error replaying transforms in contiguous ID checker, expected ",
+      id->toString(),
+      " to be in the active ID set.");
+
+  auto root_id_it = id_to_root_ids_.find(id);
+  TORCH_INTERNAL_ASSERT(
+      root_id_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker, couldn't find mapped roots of ",
+      id->toString());
+
+  const auto& root_ids = root_id_it->second;
+
+  // Check all the roots of all other ids, to see if any root_ids in id are also
+  // in them.
+  for (auto other_active_id : active_ids_) {
+    if (other_active_id == id || other_active_id == nullptr) {
+      continue;
+    }
+
+    auto root_id_it = id_to_root_ids_.find(other_active_id);
+    TORCH_INTERNAL_ASSERT(
+        root_id_it != id_to_root_ids_.end(),
+        "Error replaying transforms in contiguous ID checker, couldn't find mapped roots of ",
+        other_active_id->toString());
+
+    const auto& other_root_ids = root_id_it->second;
+
+    for (auto other_root_id : other_root_ids) {
+      if (root_ids.has(other_root_id)) {
+        return false;
+      }
+    }
+  }
+  return true;
+}
+
+void OrderedIdInformation::handle(Merge* merge) {
+  // Find inputs in the active_ids_ vector
+  const auto inner_it =
+      std::find(active_ids_.begin(), active_ids_.end(), merge->inner());
+  const auto outer_it =
+      std::find(active_ids_.begin(), active_ids_.end(), merge->outer());
+
+  // If either aren't in active_ids_ it means the inputs were detected to not be
+  // ordered correctly before hitting this expression.
+  if (inner_it == active_ids_.end() || outer_it == active_ids_.end()) {
+    return;
+  }
+
+  auto inner_pos = std::distance(active_ids_.begin(), inner_it);
+  auto outer_pos = std::distance(active_ids_.begin(), outer_it);
+
+  // Find inputs in the ordered transforms map
+  const auto inner_ordered_it = consistently_ordered_ids_.find(merge->inner());
+  const auto outer_ordered_it = consistently_ordered_ids_.find(merge->outer());
+
+  bool inner_ordered = inner_ordered_it != consistently_ordered_ids_.end();
+  bool outer_ordered = outer_ordered_it != consistently_ordered_ids_.end();
+
+  // Get root ids of the two inputs
+  const auto inner_root_ids_it = id_to_root_ids_.find(merge->inner());
+  const auto outer_root_ids_it = id_to_root_ids_.find(merge->outer());
+
+  TORCH_INTERNAL_ASSERT(
+      inner_root_ids_it != id_to_root_ids_.end() &&
+          outer_root_ids_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker.");
+
+  const auto& inner_root_ids = inner_root_ids_it->second;
+  const auto& outer_root_ids = outer_root_ids_it->second;
+
+  // TODO: Concretization may prevent contiguous indexing or vectorization.
+  //  It prevents contiguous indexing if the concretization is within the IDs
+  //  that are used for indexing.
+  //  For vectorization it just means we need to make sure the extents of the
+  //  axes to the right of the broadcast root domain in the contigous merge is
+  //  bigger than the vectorization dimension. And that the tensor buffer
+  //  supports the vector word size (always done).
+  bool outer_is_concretized_bcast = merge->outer()->isBroadcast() &&
+      concrete_info_->isConcretized(merge->outer());
+
+  bool inner_is_concretized_bcast = merge->inner()->isBroadcast() &&
+      concrete_info_->isConcretized(merge->inner());
+
+  // Update maps
+  // Find the position inner would have to have to be considered ordered
+  auto pos_after_outer = outer_pos + 1;
+  for (; pos_after_outer < active_ids_.size(); pos_after_outer++) {
+    if (active_ids_[pos_after_outer] == nullptr) {
+      // Can't be considered ordered after a nullptr
+      break;
+    }
+    if (active_ids_[pos_after_outer]->isReduction() ||
+        ((active_ids_[pos_after_outer]->isBroadcast() &&
+          !concrete_info_->isConcretized(active_ids_[pos_after_outer])))) {
+      // Skip reduction or broadcast axes that aren't concretized in the fusion
+      continue;
+    }
+    break;
+  }
+
+  // The output is ordered as long as the inputs were ordered and outer position
+  // is directly left of the inner position.
+  bool out_ordered = inner_ordered && outer_ordered;
+  out_ordered = out_ordered &&
+      // If inner_pos is before outer_pos it's not ordered correctly. If for
+      // some reason it's the same, that would be an error.
+      inner_pos > outer_pos &&
+      // Inner could be a broadcast, so doesn't have to be right on
+      // pos_after_outer as that ID (if it exists) should not be a broadcast.
+      // However, merging over a broadcast should be fine.
+      inner_pos <= pos_after_outer && !inner_is_concretized_bcast &&
+      !outer_is_concretized_bcast;
+
+  if (out_ordered) {
+    consistently_ordered_ids_.emplace(merge->out());
+  }
+
+  // Don't just remove active_ids_, as if we have something like:
+  //   [i0, i1, i2, i3]
+  //   ->merge(0, 2)
+  //   ->merge(1)
+  // The latter merge looks like it's ordered correctly, if we update the active
+  // map as:
+  //   [i0, i1, i2, i3] -> [i0*i2, i1, i3]
+  // Hoever if we instead mark it as:
+  //   [i0, i1, i2, i3] -> [i0*i2, i1, nullptr, i3]
+  // Or:
+  //   [i0, i1, i2, i3] -> [nullptr, i1, i0*i2, i3]
+  // It's clear the second merge is not ordered correctly. Doesn't matter which
+  // direction we put the iter domain in, prefer putting it in outer as we often
+  // are looking for inner dimensions that are contiguous. We don't want to
+  // always do this, as it could make ordered merges look non-ordered.
+  // For exmaple: [i0, i1, i2, i3]
+  // ->merge(0)
+  // ->merge(1)
+  // ->merge(0)
+  // If it's updated as:
+  // [i0, i1, i2, i3]
+  // -> [i0*i1, nullptr, i2, i3]
+  // -> [i0*i1, nullptr, i2*i3, nullptr]
+  // Now the final merge looks non-ordered but it is. So only insert a nullptr
+  // entry if the out is not ordered.
+  active_ids_[outer_pos] = merge->out();
+
+  if (!out_ordered) {
+    active_ids_[inner_pos] = nullptr;
+  } else {
+    active_ids_.erase(active_ids_.begin() + inner_pos);
+    for (auto i = outer_pos + 1; i < inner_pos; i++) {
+      // If there's broadcast axes between outer and inner and the merge was
+      // contiguous, there may be broadcasts between outer and inner that cannot
+      // be ordered merged anywhere else so remove them.
+      active_ids_.erase(active_ids_.begin() + outer_pos + 1);
+    }
+  }
+
+  // Update the root_id entry for the output.
+  VectorOfUniqueEntries<IterDomain*> root_ids = inner_root_ids;
+  root_ids.pushBack(outer_root_ids);
+
+  id_to_root_ids_[merge->out()] = root_ids;
+
+  // Need to check this after updating active_ids_ and id_to_root_ids_
+  if (checkExclusivelyConsumesRoots(merge->out())) {
+    exclusively_consumes_roots_.emplace(merge->out());
+  }
+}
+
+void OrderedIdInformation::handle(Split* split) {
+  // Find the input in the active_ids_ vector
+  const auto in_it =
+      std::find(active_ids_.begin(), active_ids_.end(), split->in());
+
+  if (in_it == active_ids_.end()) {
+    return;
+  }
+
+  auto in_pos = std::distance(active_ids_.begin(), in_it);
+
+  // Find the input in the ordered transforms map
+  const auto in_ordered_it = consistently_ordered_ids_.find(split->in());
+
+  bool in_ordered = in_ordered_it != consistently_ordered_ids_.end();
+
+  // Get root ids of the input
+  const auto in_root_ids_it = id_to_root_ids_.find(split->in());
+
+  TORCH_INTERNAL_ASSERT(
+      in_root_ids_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker.");
+
+  VectorOfUniqueEntries<IterDomain*> in_root_ids = in_root_ids_it->second;
+
+  // Update map for outputs
+  // Remove inputs from the active_ids_ and insert the output ID
+  active_ids_[in_pos] = split->outer();
+  active_ids_.insert(active_ids_.begin() + in_pos + 1, split->inner());
+
+  // The outputs are ordered as long as the input is ordered.
+  if (in_ordered) {
+    consistently_ordered_ids_.emplace(split->outer());
+    consistently_ordered_ids_.emplace(split->inner());
+  }
+
+  // Update the root_id entry for the outputs.
+  id_to_root_ids_[split->outer()] = in_root_ids;
+  id_to_root_ids_[split->inner()] = in_root_ids;
+}
+
+// Swizzle generally can't be contiguous because of the non-affine nature of it,
+// but we can still analyze the operation in the same way as merge/split.
+void OrderedIdInformation::handle(Swizzle2D* swizzle) {
+  // Find inputs in the active_ids_ vector
+  const auto in_x_it =
+      std::find(active_ids_.begin(), active_ids_.end(), swizzle->inX());
+  const auto in_y_it =
+      std::find(active_ids_.begin(), active_ids_.end(), swizzle->inY());
+
+  if (in_x_it == active_ids_.end() || in_y_it == active_ids_.end()) {
+    return;
+  }
+
+  auto in_x_pos = std::distance(active_ids_.begin(), in_x_it);
+  auto in_y_pos = std::distance(active_ids_.begin(), in_y_it);
+
+  // Find inputs in the ordered transforms map
+  const auto in_x_ordered_it = consistently_ordered_ids_.find(swizzle->inX());
+  const auto in_y_ordered_it = consistently_ordered_ids_.find(swizzle->inY());
+
+  bool in_x_ordered = in_x_ordered_it != consistently_ordered_ids_.end();
+  bool in_y_ordered = in_y_ordered_it != consistently_ordered_ids_.end();
+
+  // Get root ids of the two inputs
+  const auto in_x_root_ids_it = id_to_root_ids_.find(swizzle->inX());
+  const auto in_y_root_ids_it = id_to_root_ids_.find(swizzle->inY());
+
+  TORCH_INTERNAL_ASSERT(
+      in_x_root_ids_it != id_to_root_ids_.end() &&
+          in_y_root_ids_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker.");
+
+  const auto& in_x_root_ids = in_x_root_ids_it->second;
+  const auto& in_y_root_ids = in_y_root_ids_it->second;
+
+  // Update map for outputs
+  // Remove inputs from the active_ids_ and insert the output ID
+  active_ids_[in_x_pos] = swizzle->outX();
+  active_ids_[in_y_pos] = swizzle->outY();
+
+  // In the case of no real swizzle we can forward properties on each domain
+  // independently.
+  if (swizzle->swizzleType() == Swizzle2DType::NoSwizzle) {
+    if (in_x_ordered) {
+      consistently_ordered_ids_.emplace(swizzle->outX());
+    }
+
+    if (exclusivelyConsumesRoots(swizzle->inX())) {
+      exclusively_consumes_roots_.emplace(swizzle->outX());
+    }
+
+    if (in_y_ordered) {
+      consistently_ordered_ids_.emplace(swizzle->outY());
+    }
+
+    if (exclusivelyConsumesRoots(swizzle->inY())) {
+      exclusively_consumes_roots_.emplace(swizzle->outY());
+    }
+
+    id_to_root_ids_[swizzle->outX()] = in_x_root_ids;
+    id_to_root_ids_[swizzle->outY()] = in_y_root_ids;
+  } else {
+    VectorOfUniqueEntries<IterDomain*> root_ids = in_x_root_ids;
+    root_ids.pushBack(in_y_root_ids);
+    id_to_root_ids_[swizzle->outX()] = root_ids;
+    id_to_root_ids_[swizzle->outY()] = root_ids;
+  }
+}
+
+NonDivisibleSplitDependencies::NonDivisibleSplitDependencies(
+    // TODO: Revisit reduction rfactor axes and propagation. Should probably use
+    // ca_map to propogate non divisibility dependencies across exact map. Still
+    // need to think through divisible split and non divisible dependencies to
+    // see if there's conflicts where a split might look non divisible but
+    // actually is divisible and one's overruling the other.
+    const std::vector<IterDomain*>& ids,
+    const std::vector<IterDomain*>& root_domain,
+    const std::unordered_set<Split*>& divisible_splits) {
+  if (ids.empty() || root_domain.empty()) {
+    return;
+  }
+  auto transforms = StmtSort::getExprsBetween(
+      ids[0]->fusion(),
+      {root_domain.begin(), root_domain.end()},
+      {ids.begin(), ids.end()});
+  for (auto transform : transforms) {
+    auto inp_ids = ir_utils::filterByType<IterDomain>(transform->inputs());
+    for (auto inp_id : inp_ids) {
+      if (std::find(root_domain.begin(), root_domain.end(), inp_id) !=
+          root_domain.end()) {
+        // This generally shouldn't happen as there shouldn't be
+        // transformations before the root ids, but in case for some reason
+        // we eventually do have cases like that, we should reset the
+        // root_ids if for some reason they've been placed in the non
+        // divisible split set.
+        depends_on_non_divisible_split.erase(inp_id);
+      }
+    }
+
+    bool inputs_non_divisible =
+        std::any_of(inp_ids.begin(), inp_ids.end(), [this](IterDomain* inp_id) {
+          return depends_on_non_divisible_split.find(inp_id) !=
+              depends_on_non_divisible_split.end();
+        });
+
+    auto out_ids = ir_utils::filterByType<IterDomain>(transform->outputs());
+
+    if (inputs_non_divisible) {
+      // If any inputs are known to be dependent on a divisible split
+      // Mark outputs as dependent on a non_divisible split
+      depends_on_non_divisible_split.insert(out_ids.begin(), out_ids.end());
+      continue;
+    }
+
+    if (!transform->isA<Split>()) {
+      continue;
+    }
+
+    auto split = transform->as<Split>();
+    // If this transform is a non-divisible split
+    if (divisible_splits.find(split) == divisible_splits.end()) {
+      // Mark outputs as dependent on a non_divisible split
+      auto out_ids = ir_utils::filterByType<IterDomain>(transform->outputs());
+      depends_on_non_divisible_split.insert(out_ids.begin(), out_ids.end());
+    }
+  }
+}
+
 ContigIDs::ContigIDs(
     const std::vector<IterDomain*>& ids,
     const std::vector<IterDomain*>& root_domain,
     const std::vector<bool>& root_contiguity,
     std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref,
+    const std::unordered_set<Split*>& divisible_splits,
     std::unordered_map<IterDomain*, IterDomain*> p2c_id_map,
-    bool ignore_halo_constraint,
     bool ignore_indexability)
     : root_domain_(root_domain),
       root_contiguity_(root_contiguity),
       concrete_to_ref_(std::move(concrete_to_ref)),
+      divisible_splits_(divisible_splits),
       p2c_id_map_(std::move(p2c_id_map)),
-      ignore_indexability_(ignore_indexability) {
-  if (ids.empty()) {
+      ignore_indexability_(ignore_indexability),
+      non_divisible_id_info_(ids, root_domain_, divisible_splits_) {
+  if (ids.size() > 0) {
+    // This constructor doesn't provide the following information so it needs to
+    // be built.
+    ca_map_ = std::make_shared<ComputeAtMap>(ids[0]->fusion());
+    halo_info_ = std::make_shared<HaloInfo>(ids[0]->fusion(), ca_map_);
+    concrete_info_ =
+        std::make_shared<ConcretizedBroadcastDomains>(ids[0]->fusion());
+
+    consistent_transform_info_ = std::make_unique<const OrderedIdInformation>(
+        ids, root_domain, concrete_info_);
+  }
+  build(ids);
+}
+
+ContigIDs::ContigIDs(
+    const std::vector<IterDomain*>& ids,
+    const std::vector<IterDomain*>& root_domain,
+    const std::vector<bool>& root_contiguity,
+    std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref,
+    const std::unordered_set<Split*>& divisible_splits,
+    std::shared_ptr<const ComputeAtMap> ca_map,
+    std::shared_ptr<const HaloInfo> halo_info,
+    std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info,
+    std::unordered_map<IterDomain*, IterDomain*> p2c_id_map,
+    bool ignore_indexability)
+    : root_domain_(root_domain),
+      root_contiguity_(root_contiguity),
+      concrete_to_ref_(std::move(concrete_to_ref)),
+      divisible_splits_(divisible_splits),
+      ca_map_(ca_map),
+      halo_info_(halo_info),
+      concrete_info_(concrete_info),
+      p2c_id_map_(std::move(p2c_id_map)),
+      ignore_indexability_(ignore_indexability),
+      consistent_transform_info_(std::make_unique<const OrderedIdInformation>(
+          ids,
+          root_domain,
+          concrete_info_)),
+      non_divisible_id_info_(ids, root_domain, divisible_splits_) {
+  build(ids);
+}
+
+void ContigIDs::build(const std::vector<IterDomain*>& ids) {
+  if (ids.empty() || root_domain_.empty()) {
     return;
   }
 
@@ -32,35 +455,29 @@ ContigIDs::ContigIDs(
       " != ",
       root_contiguity_.size());
 
-  // GpuLower is required to honor halo constraints
-  if (!ignore_halo_constraint) {
-    TORCH_INTERNAL_ASSERT(GpuLower::hasCurrent(), "GpuLower not found");
-  }
-
-  for (const auto i : c10::irange(root_domain_.size())) {
-    auto root_domain_i = root_domain_[i]->as<IterDomain>();
-    root_to_indexed_id_[root_domain_i] = root_domain_i;
+  for (const auto root_domain_i : c10::irange(root_domain_.size())) {
+    auto root_domain_id = root_domain_[root_domain_i]->as<IterDomain>();
+    root_to_indexed_id_[root_domain_id] = root_domain_id;
     // Initialize to false
-    is_contig_root_[root_domain_i] = false;
+    is_contig_root_[root_domain_id] = false;
     // If a root domain has halo, can't use merged domain even if
     // both inputs are contiguous. HaloInfo is also initialized for
     // rfactor root domains, which should just return "zero"
     // RootAxisInfo. This should be safe as no rfactor tensor should
     // need halo.
-    if (root_contiguity_[i] &&
-        (ignore_halo_constraint ||
-         !GpuLower::current()
-              ->haloInfo()
-              .getRootAxisInfo(root_domain_i)
-              .hasHalo())) {
-      contig_ids_.emplace(root_domain_i);
-      is_contig_root_[root_domain_i] = true;
-      within_contig_ids_[root_domain_i] = std::unordered_set<IterDomain*>();
+    if (root_contiguity_[root_domain_i] &&
+        !halo_info_->getRootAxisInfo(root_domain_id).hasHalo()) {
+      contig_ids_.emplace(root_domain_id);
+      is_contig_root_[root_domain_id] = true;
+      within_contig_ids_[root_domain_id] = std::unordered_set<IterDomain*>();
     }
   }
 
   if (!contig_ids_.empty()) {
-    auto exprs = StmtSort::getExprs(ids[0]->fusion(), {ids.begin(), ids.end()});
+    auto exprs = StmtSort::getExprsBetween(
+        ids[0]->fusion(),
+        {root_domain_.begin(), root_domain_.end()},
+        {ids.begin(), ids.end()});
     for (auto expr : exprs) {
       handle(expr);
     }
@@ -68,114 +485,82 @@ ContigIDs::ContigIDs(
 }
 
 void ContigIDs::handle(Merge* merge) {
-  // If either input is non-contiguous so is output.
-  const auto inner = merge->inner();
-  const auto outer = merge->outer();
-  const auto out = merge->out();
-
-  if (!isContig(inner) || !isContig(outer)) {
+  // If output is not consistently ordered or doesn't solely consume all root
+  // domains in its dependencies, then it can't be a contiguously indexable
+  // iterdomain.
+  if (!(consistent_transform_info_->isConsistentlyOrdered(merge->out()) &&
+        consistent_transform_info_->exclusivelyConsumesRoots(merge->out()))) {
     return;
   }
 
-  // Stop contig merging if the merge output is not indexable.
-  if (!ignore_indexability_ && !isIndexable(out)) {
+  // If output is not "directly indexable" then it's definitely not contiguously
+  // indexable.
+  if (!ignore_indexability_ && !isIndexable(merge->out())) {
     return;
   }
 
-  // Grab inputs, make sure they're in root domain, check if they're
-  // contiguous.
-
-  auto lhs_inputs =
-      ir_utils::iterDomainInputsOfOrderedAs({outer}, root_domain_);
-  auto rhs_inputs =
-      ir_utils::iterDomainInputsOfOrderedAs({inner}, root_domain_);
+  // Check root domains for contiguity
+  auto root_ids_it =
+      consistent_transform_info_->idToRootIds().find(merge->out());
 
   TORCH_INTERNAL_ASSERT(
-      inRoot(lhs_inputs) && inRoot(rhs_inputs),
-      "Found an invalid merge operation, inputs of its arguments are not in the root domain.");
-
-  std::deque<IterDomain*> ordered_inputs(lhs_inputs.begin(), lhs_inputs.end());
-  ordered_inputs.insert(
-      ordered_inputs.end(), rhs_inputs.begin(), rhs_inputs.end());
-
-  // If any root input is not contig, output is not contig
-  if (!(std::all_of(
-          ordered_inputs.begin(), ordered_inputs.end(), [this](IterDomain* id) {
-            // Allow reduction tensors in contiguity check since we're using
-            // this to check contiguous vectors of reference tensors in
-            // schedulers (to set vectorization sizes), those reference tensors
-            // may have reduction dims, don't bail on contiguity just because
-            // it's a reduction dimension.
-            return is_contig_root_.at(id);
-          }))) {
-    return;
-  }
+      root_ids_it != consistent_transform_info_->idToRootIds().end(),
+      "\nError in contiguous analysis, merge info doesn't exist for:\n",
+      merge->toString(),
+      "\nId: ",
+      merge->out()->toString());
 
-  std::deque<IterDomain*> root_copy(root_domain_.begin(), root_domain_.end());
+  VectorOfUniqueEntries<IterDomain*> root_ids = root_ids_it->second;
 
-  // Forward to first matching argument
-  while (!root_copy.empty() && !ordered_inputs.empty()) {
-    if (root_copy.front() != ordered_inputs.front()) {
-      root_copy.pop_front();
-    } else {
-      break;
+  IterDomain* last_root = nullptr;
+  for (auto root_id_i : c10::irange(root_domain_.size())) {
+    auto root_id = root_domain_[root_id_i];
+    if (root_ids.has(root_id)) {
+      // ID found, remove it
+      root_ids.erase(root_id);
+      // If the last id isn't contiguous that's fine, we can use the stride of
+      // the last iter domain to multiply the contig index.
+      if (!root_contiguity_[root_id_i] && !root_ids.empty()) {
+        // Otherwise this merge->out() isn't a contiguously indexable ID
+        return;
+      }
+      last_root = root_id;
     }
   }
 
-  // Forward through all matching arguments
-  while (!root_copy.empty() && !ordered_inputs.empty()) {
-    if (root_copy.front() == ordered_inputs.front()) {
-      root_copy.pop_front();
-      ordered_inputs.pop_front();
-    } else if (
-        root_copy.front()->isReduction() || root_copy.front()->isBroadcast()) {
-      // This was a cause of an error with
-      // ReductionSchedulerMultiDimNonFastest. The test no longer
-      // fails.
-      root_copy.pop_front();
-    } else {
-      break;
-    }
+  // If there's a non_divisible split in the history of merge->out then it can't
+  // be contiguously indexable.
+  if (non_divisible_id_info_.dependsOnNonDivisibleSplit(merge->out())) {
+    return;
   }
 
-  // If we matched all inputs, the output is contiguous. Only want to keep the
-  // top contig ID, lower ids should be placed in the "within_contig_ids" map
-  // of top id.
-  if (ordered_inputs.empty()) {
-    if (contig_ids_.find(inner) != contig_ids_.end()) {
-      contig_ids_.erase(inner);
-    }
-
-    if (contig_ids_.find(outer) != contig_ids_.end()) {
-      contig_ids_.erase(outer);
-    }
+  // Now we know merge->out is a contiguously indexable ID
 
-    contig_ids_.emplace(out);
+  TORCH_INTERNAL_ASSERT(
+      last_root != nullptr,
+      "Issue processing root ids for ",
+      merge->out()->toString());
 
-    std::unordered_set<IterDomain*> within_out;
-    within_out.emplace(inner);
-    if (within_contig_ids_.find(inner) != within_contig_ids_.end()) {
-      auto in_inner = within_contig_ids_.at(inner);
-      within_out.insert(in_inner.begin(), in_inner.end());
-      within_contig_ids_.erase(inner);
-    }
+  // Reset root_ids
+  root_ids = root_ids_it->second;
+  for (auto root_id : root_ids) {
+    root_to_indexed_id_[root_id] = merge->out();
+  }
 
-    within_out.emplace(outer);
-    if (within_contig_ids_.find(outer) != within_contig_ids_.end()) {
-      auto in_outer = within_contig_ids_.at(outer);
-      within_out.insert(in_outer.begin(), in_outer.end());
-      within_contig_ids_.erase(outer);
-    }
+  auto all_within_vals = DependencyCheck::getAllValsBetween(
+      {root_domain_.begin(), root_domain_.end()}, {merge->out()});
+  auto all_within_ids = ir_utils::filterByType<IterDomain>(all_within_vals);
 
-    within_contig_ids_[out] = within_out;
+  std::unordered_set<IterDomain*> within_id_set(
+      all_within_ids.begin(), all_within_ids.end());
 
-    for (auto root : lhs_inputs) {
-      root_to_indexed_id_[root] = out;
-    }
-    for (auto root : rhs_inputs) {
-      root_to_indexed_id_[root] = out;
-    }
+  within_id_set.erase(merge->out());
+  within_contig_ids_[merge->out()] = within_id_set;
+  for (auto id : all_within_ids) {
+    contig_ids_.erase(id);
   }
+
+  contig_ids_.emplace(merge->out());
 }
 
 IterDomain* ContigIDs::getMappedId(IterDomain* id) const {
@@ -187,23 +572,15 @@ IterDomain* ContigIDs::getMappedId(IterDomain* id) const {
   }
 }
 
-IterDomain* ContigIDs::getCAIndexConcreteId(IterDomain* id) const {
-  TORCH_INTERNAL_ASSERT(
-      GpuLower::current() != nullptr, "GpuLower is not found");
-
-  auto c_id = GpuLower::current()->caMap()->getConcreteMappedID(
-      getMappedId(id), IdMappingMode::EXACT);
-  return c_id;
-}
-
 bool ContigIDs::isIndexable(IterDomain* id) const {
   // If ID is mapped to consumer through persmissive map but not exact map it
   // will not be mapped through to the exact map through the p2c map. Therefore
   // reject because it involves broadcast resolution.
-  if (!GpuLower::current()->caMap()->idExistsInMap(getMappedId(id))) {
+  if (!ca_map_->idExistsInMap(getMappedId(id))) {
     return false;
   }
-  auto c_id = getCAIndexConcreteId(id);
+  auto c_id =
+      ca_map_->getConcreteMappedID(getMappedId(id), IdMappingMode::EXACT);
   return concrete_to_ref_.find(c_id) != concrete_to_ref_.end();
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/contiguity.h b/torch/csrc/jit/codegen/cuda/contiguity.h
index 7293901310eb6..de89aca71c4cc 100644
--- a/torch/csrc/jit/codegen/cuda/contiguity.h
+++ b/torch/csrc/jit/codegen/cuda/contiguity.h
@@ -2,13 +2,128 @@
 
 #include <c10/macros/Export.h>
 
+#include <torch/csrc/jit/codegen/cuda/compute_at_map.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/lower_shift.h>
+#include <torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h>
 
 namespace torch {
 namespace jit {
 namespace fuser {
 namespace cuda {
 
+// Goes through the transformations associated with a series of ids and root
+// ids. Checks the ordering of the iteration domains through these operations to
+// pick out which operations are consistently ordered. For example:
+// [i0, i1, i2]
+// ->split(0, 4)->merge(1)->merge(1)->merge(0)
+// are consistently ordered from largest to smallest extents, but
+// ->split(0, 4)->merge(1)->merge(0, 2)->merge(0) is not consistently ordered
+// with the roots.
+//
+// This property is important to understand the contiguity of dimensions through
+// complex transformations.
+class OrderedIdInformation : public OptInDispatch {
+ public:
+  OrderedIdInformation() = delete;
+
+  OrderedIdInformation(
+      const std::vector<IterDomain*>& ids,
+      const std::vector<IterDomain*>& root_domain,
+      std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info);
+
+  const std::unordered_map<IterDomain*, VectorOfUniqueEntries<IterDomain*>>&
+  idToRootIds() const {
+    return id_to_root_ids_;
+  }
+
+  bool isConsistentlyOrdered(IterDomain* id) const {
+    return consistently_ordered_ids_.find(id) !=
+        consistently_ordered_ids_.end();
+  }
+
+  bool exclusivelyConsumesRoots(IterDomain* id) const {
+    return exclusively_consumes_roots_.find(id) !=
+        exclusively_consumes_roots_.end();
+  }
+
+ private:
+  // Returns if the id in active_ids should be in exclusively_consumes_roots_
+  bool checkExclusivelyConsumesRoots(IterDomain* id);
+
+  void handle(Split*) override;
+
+  void handle(Merge* merge) override;
+
+  void handle(Swizzle2D* swizzle) override;
+
+  // Track which root ids were used to generate each iter domain
+  std::unordered_map<IterDomain*, VectorOfUniqueEntries<IterDomain*>>
+      id_to_root_ids_;
+
+  // Track all IterDomains that have correct ordered transforms for contiguity.
+  // i.e. if we have:
+  //
+  // root = [i0, i1, i2]
+  // i3 = merge(i0, i2)
+  // would not be consistently ordered transformed
+  //
+  // root = [i0, i1, i2]
+  // i4, i5 = spit(merge(merge(i0, i1), i2), 4)
+  // would be consistently ordered transforms
+  //
+  // root = [i0, i1, i2, i3]
+  // i4 = merge(i1, i2) would also be consistently ordered transformed
+  std::unordered_set<IterDomain*> consistently_ordered_ids_;
+
+  // Active series of IterDomains that are updated while we're processing the
+  // domain. Helps us identify which ids are consistently_ordered_ids_. Used
+  // for intermediate storage, not to return.
+  std::vector<IterDomain*> active_ids_;
+
+  // IterDomains in this set exclusively consume all the uses of their roots.
+  // For example:
+  // [i0, i1] split(0, f)->merge(1)
+  // [ceilDiv(i0, f), f*i1]
+  // neither iter domains exclusively consume the roots. With another:
+  // merge(0) -> [ceilDiv(i0, f)*f*i1]
+  // The resulting iter domain does exclusively consume the roots.
+  //
+  // Also:
+  // [i0, i1, i2, i3] merge(1)->merge(1)
+  // ->[i0, i1*i2*i3]
+  // both resulting iter domains do exclusively consume their roots
+  std::unordered_set<IterDomain*> exclusively_consumes_roots_;
+
+  // Broadcast domains that are concretized cannot be considered contiguously
+  // indexable.
+  // TODO: This constraint is more conservative than necessary as it's only if
+  // the domain is concretized within the local indexing, not in the entire
+  // fusion.
+  std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info_;
+};
+
+// Based on provided divisible split set, goes through expressions and marks all
+// IterDomains that are dependent on a non-divisible split.
+class NonDivisibleSplitDependencies : public OptInDispatch {
+ public:
+  NonDivisibleSplitDependencies() = delete;
+
+  NonDivisibleSplitDependencies(
+      const std::vector<IterDomain*>& ids,
+      const std::vector<IterDomain*>& root_domain,
+      const std::unordered_set<Split*>& divisible_splits);
+
+  bool dependsOnNonDivisibleSplit(IterDomain* id) const {
+    return depends_on_non_divisible_split.find(id) !=
+        depends_on_non_divisible_split.end();
+  }
+
+ private:
+  std::unordered_set<IterDomain*> depends_on_non_divisible_split;
+};
+
 // A merge is contiguous if:
 //   Inputs of outer are to the left in the root domain of the inputs of RHS.
 //   All inputs are contiguous in the root domain:
@@ -22,8 +137,6 @@ namespace cuda {
 
 class ContigIDs : public OptInDispatch {
  public:
-  ContigIDs() = delete;
-
   //! Check through the history of ids whose inputs map to root_domain with
   //! contiguity root_contiguity. Return unordered_set of all merges that are
   //! contiguous. Ignore root order is primarily used for predicate generation.
@@ -42,11 +155,6 @@ class ContigIDs : public OptInDispatch {
   //! If ignore_indexability and ignore_halo_constraint are true,
   //! ignore the constraint on indexing and halo, respectively. It is
   //! the caller that is responsible for its correctness.
-  //!
-  //! The function interface with many parameters looks ugly, but it
-  //! is also important to make ignore_indexability and
-  //! ignore_halo_constraint explicit to avoid any surprise.
-  //!
   //! Not really sure why but clang-tidy only complains about
   //! std::unordered_map if passed as a const reference.
   ContigIDs(
@@ -54,9 +162,21 @@ class ContigIDs : public OptInDispatch {
       const std::vector<IterDomain*>& root_domain,
       const std::vector<bool>& root_contiguity,
       std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref,
+      const std::unordered_set<Split*>& divisible_splits,
       std::unordered_map<IterDomain*, IterDomain*> p2c_id_map = {},
-      bool ignore_indexability = false,
-      bool ignore_halo_constraint = false);
+      bool ignore_indexability = false);
+
+  ContigIDs(
+      const std::vector<IterDomain*>& ids,
+      const std::vector<IterDomain*>& root_domain,
+      const std::vector<bool>& root_contiguity,
+      std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref,
+      const std::unordered_set<Split*>& divisible_splits,
+      std::shared_ptr<const ComputeAtMap> ca_map,
+      std::shared_ptr<const HaloInfo> halo_info,
+      std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info,
+      std::unordered_map<IterDomain*, IterDomain*> p2c_id_map = {},
+      bool ignore_indexability = false);
 
   const std::unordered_set<IterDomain*>& contigIDs() const {
     return contig_ids_;
@@ -107,6 +227,8 @@ class ContigIDs : public OptInDispatch {
   IterDomain* getMappedId(IterDomain* id) const;
 
  private:
+  void build(const std::vector<IterDomain*>& ids);
+
   //! Root domains to analyze contiguity
   const std::vector<IterDomain*>& root_domain_;
   //! Contiguity of root_domain_
@@ -114,6 +236,14 @@ class ContigIDs : public OptInDispatch {
   //! Mapping of concrete to reference domains. If a concrete domain
   //! is not mapped, it is not indexable as there's no mapped index.
   const std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref_;
+  // Divisible split information as we can still consider iter domains
+  // contiguous through divisible splits.
+  const std::unordered_set<Split*>& divisible_splits_;
+
+  std::shared_ptr<const ComputeAtMap> ca_map_;
+  std::shared_ptr<const HaloInfo> halo_info_;
+  std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info_;
+
   //! Producer-to-consumer index map in the case of analyzing replayed
   //! producer tensors
   const std::unordered_map<IterDomain*, IterDomain*> p2c_id_map_;
@@ -129,6 +259,10 @@ class ContigIDs : public OptInDispatch {
   //! Mapping of root domain to the actual indexed domain, which can
   //! be itself or a contig merged domain if found.
   std::unordered_map<IterDomain*, IterDomain*> root_to_indexed_id_;
+
+  std::unique_ptr<const OrderedIdInformation> consistent_transform_info_;
+
+  NonDivisibleSplitDependencies non_divisible_id_info_;
 };
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/disjoint_set.h b/torch/csrc/jit/codegen/cuda/disjoint_set.h
index b325bedcf7b9e..09cf6e8de9504 100644
--- a/torch/csrc/jit/codegen/cuda/disjoint_set.h
+++ b/torch/csrc/jit/codegen/cuda/disjoint_set.h
@@ -302,17 +302,14 @@ class DisjointSets {
   std::string toString() const {
     std::stringstream ss;
     ss << "disjoint sets{\n";
+    const std::string sep("  ");
     for (auto s_ptr : disjoint_sets_) {
       auto& set = *s_ptr;
-      ss << "  { ";
+      ss << sep << "{\n";
       for (auto entry : set.vector()) {
-        ss << abstractToString(entry);
-        // DomainKey defines == but not !=
-        if (!(entry == set.back())) {
-          ss << "; ";
-        }
+        ss << sep << sep << abstractToString(entry) << "\n";
       }
-      ss << " }\n";
+      ss << sep << "}\n";
     }
     ss << "}";
     return ss.str();
diff --git a/torch/csrc/jit/codegen/cuda/dispatch.cpp b/torch/csrc/jit/codegen/cuda/dispatch.cpp
index 8a514d4c2e16f..70e9ae16375e5 100644
--- a/torch/csrc/jit/codegen/cuda/dispatch.cpp
+++ b/torch/csrc/jit/codegen/cuda/dispatch.cpp
@@ -95,6 +95,15 @@ void Val::dispatch(T handler, Val* val) {
 template <typename T>
 void Expr::dispatch(T handler, Expr* expr) {
   switch (*(expr->getExprType())) {
+    case ExprType::FullOp:
+      ptr(handler)->handle(expr->as<FullOp>());
+      return;
+    case ExprType::ARangeOp:
+      ptr(handler)->handle(expr->as<ARangeOp>());
+      return;
+    case ExprType::EyeOp:
+      ptr(handler)->handle(expr->as<EyeOp>());
+      return;
     case ExprType::UnaryOp:
       ptr(handler)->handle(expr->as<UnaryOp>());
       return;
@@ -104,6 +113,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::TernaryOp:
       ptr(handler)->handle(expr->as<TernaryOp>());
       return;
+    case ExprType::RNGOp:
+      ptr(handler)->handle(expr->as<RNGOp>());
+      return;
     case ExprType::ReductionOp:
       ptr(handler)->handle(expr->as<ReductionOp>());
       return;
@@ -113,6 +125,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::WelfordOp:
       ptr(handler)->handle(expr->as<WelfordOp>());
       return;
+    case ExprType::GroupedWelfordOp:
+      ptr(handler)->handle(expr->as<GroupedWelfordOp>());
+      return;
     case ExprType::LoadStoreOp:
       ptr(handler)->handle(expr->as<LoadStoreOp>());
       return;
@@ -190,6 +205,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::GridWelford:
       ptr(handler)->handle(expr->as<kir::GridWelford>());
       return;
+    case ExprType::GroupedGridWelford:
+      ptr(handler)->handle(expr->as<kir::GroupedGridWelford>());
+      return;
     case ExprType::AllocateFusedReduction:
       ptr(handler)->handle(expr->as<kir::AllocateFusedReduction>());
       return;
@@ -269,6 +287,15 @@ void Val::constDispatch(T handler, const Val* val) {
 template <typename T>
 void Expr::constDispatch(T handler, const Expr* expr) {
   switch (*(expr->getExprType())) {
+    case ExprType::FullOp:
+      ptr(handler)->handle(expr->as<FullOp>());
+      return;
+    case ExprType::ARangeOp:
+      ptr(handler)->handle(expr->as<ARangeOp>());
+      return;
+    case ExprType::EyeOp:
+      ptr(handler)->handle(expr->as<EyeOp>());
+      return;
     case ExprType::UnaryOp:
       ptr(handler)->handle(expr->as<UnaryOp>());
       return;
@@ -278,6 +305,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::TernaryOp:
       ptr(handler)->handle(expr->as<TernaryOp>());
       return;
+    case ExprType::RNGOp:
+      ptr(handler)->handle(expr->as<RNGOp>());
+      return;
     case ExprType::ReductionOp:
       ptr(handler)->handle(expr->as<ReductionOp>());
       return;
@@ -287,6 +317,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::WelfordOp:
       ptr(handler)->handle(expr->as<WelfordOp>());
       return;
+    case ExprType::GroupedWelfordOp:
+      ptr(handler)->handle(expr->as<GroupedWelfordOp>());
+      return;
     case ExprType::LoadStoreOp:
       ptr(handler)->handle(expr->as<LoadStoreOp>());
       return;
@@ -364,6 +397,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::GridWelford:
       ptr(handler)->handle(expr->as<kir::GridWelford>());
       return;
+    case ExprType::GroupedGridWelford:
+      ptr(handler)->handle(expr->as<kir::GroupedGridWelford>());
+      return;
     case ExprType::AllocateFusedReduction:
       ptr(handler)->handle(expr->as<kir::AllocateFusedReduction>());
       return;
@@ -451,6 +487,15 @@ void Val::mutatorDispatch(T mutator, Val* val) {
 template <typename T>
 void Expr::mutatorDispatch(T mutator, Expr* expr) {
   switch (*(expr->getExprType())) {
+    case ExprType::FullOp:
+      ptr(mutator)->mutate(expr->as<FullOp>());
+      return;
+    case ExprType::ARangeOp:
+      ptr(mutator)->mutate(expr->as<ARangeOp>());
+      return;
+    case ExprType::EyeOp:
+      ptr(mutator)->mutate(expr->as<EyeOp>());
+      return;
     case ExprType::UnaryOp:
       ptr(mutator)->mutate(expr->as<UnaryOp>());
       return;
@@ -460,6 +505,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::TernaryOp:
       ptr(mutator)->mutate(expr->as<TernaryOp>());
       return;
+    case ExprType::RNGOp:
+      ptr(mutator)->mutate(expr->as<RNGOp>());
+      return;
     case ExprType::ReductionOp:
       ptr(mutator)->mutate(expr->as<ReductionOp>());
       return;
@@ -469,6 +517,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::WelfordOp:
       ptr(mutator)->mutate(expr->as<WelfordOp>());
       return;
+    case ExprType::GroupedWelfordOp:
+      ptr(mutator)->mutate(expr->as<GroupedWelfordOp>());
+      return;
     case ExprType::LoadStoreOp:
       ptr(mutator)->mutate(expr->as<LoadStoreOp>());
       return;
@@ -546,6 +597,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::GridWelford:
       ptr(mutator)->mutate(expr->as<kir::GridWelford>());
       return;
+    case ExprType::GroupedGridWelford:
+      ptr(mutator)->mutate(expr->as<kir::GroupedGridWelford>());
+      return;
     case ExprType::AllocateFusedReduction:
       ptr(mutator)->mutate(expr->as<kir::AllocateFusedReduction>());
       return;
@@ -698,6 +752,15 @@ void OptOutConstDispatch::handle(const kir::IntPair* stmt) {
 }
 
 // Exprs
+void OptOutConstDispatch::handle(const FullOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutConstDispatch::handle(const ARangeOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutConstDispatch::handle(const EyeOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const UnaryOp* stmt) {
   unhandled(stmt);
 }
@@ -707,6 +770,9 @@ void OptOutConstDispatch::handle(const BinaryOp* stmt) {
 void OptOutConstDispatch::handle(const TernaryOp* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const RNGOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const ReductionOp* stmt) {
   unhandled(stmt);
 }
@@ -716,6 +782,9 @@ void OptOutConstDispatch::handle(const GroupedReductionOp* stmt) {
 void OptOutConstDispatch::handle(const WelfordOp* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const GroupedWelfordOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const LoadStoreOp* stmt) {
   unhandled(stmt);
 }
@@ -793,6 +862,9 @@ void OptOutConstDispatch::handle(const kir::GridBroadcast* stmt) {
 void OptOutConstDispatch::handle(const kir::GridWelford* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const kir::GroupedGridWelford* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const kir::AllocateFusedReduction* stmt) {
   unhandled(stmt);
 }
@@ -842,6 +914,15 @@ void OptOutDispatch::handle(kir::IntPair* stmt) {
 }
 
 // Exprs
+void OptOutDispatch::handle(FullOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutDispatch::handle(ARangeOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutDispatch::handle(EyeOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(UnaryOp* stmt) {
   unhandled(stmt);
 }
@@ -851,6 +932,9 @@ void OptOutDispatch::handle(BinaryOp* stmt) {
 void OptOutDispatch::handle(TernaryOp* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(RNGOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(ReductionOp* stmt) {
   unhandled(stmt);
 }
@@ -860,6 +944,9 @@ void OptOutDispatch::handle(GroupedReductionOp* stmt) {
 void OptOutDispatch::handle(WelfordOp* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(GroupedWelfordOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(LoadStoreOp* stmt) {
   unhandled(stmt);
 }
@@ -937,6 +1024,9 @@ void OptOutDispatch::handle(kir::GridBroadcast* stmt) {
 void OptOutDispatch::handle(kir::GridWelford* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(kir::GroupedGridWelford* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(kir::AllocateFusedReduction* stmt) {
   unhandled(stmt);
 }
diff --git a/torch/csrc/jit/codegen/cuda/dispatch.h b/torch/csrc/jit/codegen/cuda/dispatch.h
index f680871b6460c..4fea698191ec4 100644
--- a/torch/csrc/jit/codegen/cuda/dispatch.h
+++ b/torch/csrc/jit/codegen/cuda/dispatch.h
@@ -68,12 +68,17 @@ class ComplexDouble;
 class NamedScalar;
 
 // Exprs
+class FullOp;
+class ARangeOp;
+class EyeOp;
 class UnaryOp;
 class BinaryOp;
 class TernaryOp;
+class RNGOp;
 class ReductionOp;
 class GroupedReductionOp;
 class WelfordOp;
+class GroupedWelfordOp;
 class LoadStoreOp;
 class MmaOp;
 class BroadcastOp;
@@ -105,6 +110,7 @@ class GridReduction;
 class GroupedGridReduction;
 class GridBroadcast;
 class GridWelford;
+class GroupedGridWelford;
 class AllocateFusedReduction;
 class InitMagicZero;
 class UpdateMagicZero;
@@ -140,12 +146,17 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const kir::IntPair*);
 
   // Exprs
+  virtual void handle(const FullOp* stmt);
+  virtual void handle(const ARangeOp* stmt);
+  virtual void handle(const EyeOp* stmt);
   virtual void handle(const UnaryOp* stmt);
   virtual void handle(const BinaryOp* stmt);
   virtual void handle(const TernaryOp* stmt);
+  virtual void handle(const RNGOp* stmt);
   virtual void handle(const ReductionOp* stmt);
   virtual void handle(const GroupedReductionOp* stmt);
   virtual void handle(const WelfordOp* stmt);
+  virtual void handle(const GroupedWelfordOp* stmt);
   virtual void handle(const LoadStoreOp* stmt);
   virtual void handle(const MmaOp* stmt);
   virtual void handle(const BroadcastOp* stmt);
@@ -173,6 +184,7 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const kir::GroupedGridReduction*);
   virtual void handle(const kir::GridBroadcast*);
   virtual void handle(const kir::GridWelford*);
+  virtual void handle(const kir::GroupedGridWelford*);
   virtual void handle(const kir::AllocateFusedReduction*);
   virtual void handle(const kir::Swizzle2DInt*);
   virtual void handle(const kir::PairSelect*);
@@ -203,12 +215,17 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(kir::IntPair*);
 
   // Exprs
+  virtual void handle(FullOp* stmt);
+  virtual void handle(ARangeOp* stmt);
+  virtual void handle(EyeOp* stmt);
   virtual void handle(UnaryOp* stmt);
   virtual void handle(BinaryOp* stmt);
   virtual void handle(TernaryOp* stmt);
+  virtual void handle(RNGOp* stmt);
   virtual void handle(ReductionOp* stmt);
   virtual void handle(GroupedReductionOp* stmt);
   virtual void handle(WelfordOp* stmt);
+  virtual void handle(GroupedWelfordOp* stmt);
   virtual void handle(LoadStoreOp* stmt);
   virtual void handle(MmaOp* stmt);
   virtual void handle(BroadcastOp* stmt);
@@ -236,6 +253,7 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(kir::GroupedGridReduction* stmt);
   virtual void handle(kir::GridBroadcast* stmt);
   virtual void handle(kir::GridWelford* stmt);
+  virtual void handle(kir::GroupedGridWelford* stmt);
   virtual void handle(kir::AllocateFusedReduction* stmt);
   virtual void handle(kir::Swizzle2DInt* stmt);
   virtual void handle(kir::PairSelect* stmt);
@@ -307,12 +325,17 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(kir::IntPair*);
 
   // Exprs
+  virtual void mutate(FullOp*);
+  virtual void mutate(ARangeOp*);
+  virtual void mutate(EyeOp*);
   virtual void mutate(UnaryOp*);
   virtual void mutate(BinaryOp*);
   virtual void mutate(TernaryOp*);
+  virtual void mutate(RNGOp*);
   virtual void mutate(ReductionOp*);
   virtual void mutate(GroupedReductionOp*);
   virtual void mutate(WelfordOp*);
+  virtual void mutate(GroupedWelfordOp*);
   virtual void mutate(LoadStoreOp*);
   virtual void mutate(MmaOp*);
   virtual void mutate(BroadcastOp*);
@@ -340,6 +363,7 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(kir::GroupedGridReduction*);
   virtual void mutate(kir::GridBroadcast*);
   virtual void mutate(kir::GridWelford*);
+  virtual void mutate(kir::GroupedGridWelford*);
   virtual void mutate(kir::AllocateFusedReduction*);
   virtual void mutate(kir::Swizzle2DInt*);
   virtual void mutate(kir::PairSelect*);
diff --git a/torch/csrc/jit/codegen/cuda/dynamic_type.h b/torch/csrc/jit/codegen/cuda/dynamic_type.h
index 92c9123ee48a9..5cf9f0930929d 100644
--- a/torch/csrc/jit/codegen/cuda/dynamic_type.h
+++ b/torch/csrc/jit/codegen/cuda/dynamic_type.h
@@ -21,13 +21,28 @@ class TORCH_CUDA_CU_API IntOrDouble {
   IntOrDouble(size_t i) : value_((int64_t)i) {}
   IntOrDouble() : IntOrDouble(0) {}
 
+  // Avoid using copy constructor of c10::variant as it's
+  // deprecated.
+  IntOrDouble(const IntOrDouble& other) {
+    value_ = other.value_;
+  }
+
+  // Explicitly define copy assignment operator as its implicit definition is
+  // deprecated
+  IntOrDouble& operator=(const IntOrDouble& other) {
+    value_ = other.value_;
+    return *this;
+  }
+
   bool is_int() const {
     return c10::holds_alternative<int64_t>(value_);
   }
 
   template <typename T>
   T as() const {
-    TORCH_CHECK(c10::holds_alternative<T>(value_), "wrong type");
+    TORCH_CHECK(
+        c10::holds_alternative<T>(value_),
+        "The expected dtype and the actual dtype does not match in IntOrDouble");
     return c10::get<T>(value_);
   }
 
@@ -132,8 +147,19 @@ class TORCH_CUDA_CU_API IntOrDouble {
     }                                                   \
     TORCH_INTERNAL_ASSERT(false);                       \
   }                                                     \
-  template <typename T>                                 \
-  bool operator op(T other) {                           \
+  bool operator op(double other) {                      \
+    if (is_int()) {                                     \
+      return as<int64_t>() op other;                    \
+    }                                                   \
+    return as<double>() op other;                       \
+  }                                                     \
+  bool operator op(int64_t other) {                     \
+    if (is_int()) {                                     \
+      return as<int64_t>() op other;                    \
+    }                                                   \
+    return as<double>() op other;                       \
+  }                                                     \
+  bool operator op(int other) {                         \
     if (is_int()) {                                     \
       return as<int64_t>() op other;                    \
     }                                                   \
@@ -156,21 +182,10 @@ class TORCH_CUDA_CU_API IntOrDouble {
     return IntOrDouble(-as<double>());
   }
 
-  template <typename T>
-  bool operator==(T val) const {
-    return operator==(IntOrDouble(val));
-  }
-
-  template <typename T>
-  bool operator!=(T val) const {
-    return operator!=(IntOrDouble(val));
-  }
-
-  operator double() const;
-
-  operator int64_t() const;
-  operator size_t() const;
-  operator int() const;
+  explicit operator double() const;
+  explicit operator int64_t() const;
+  explicit operator size_t() const;
+  explicit operator int() const;
 };
 
 #define DEFINE_ARITHMETIC_OP(op)                           \
@@ -256,7 +271,13 @@ namespace IntOrDouble_functions {
 
 inline IntOrDouble ceildiv(const IntOrDouble& a, const IntOrDouble& b) {
   if (a.is_int() && b.is_int()) {
-    return (a.as<int64_t>() + b.as<int64_t>() - 1) / b.as<int64_t>();
+    auto aa = a.as<int64_t>();
+    auto bb = b.as<int64_t>();
+    if (bb > 0) {
+      return (aa + bb - 1) / bb;
+    } else {
+      return (aa + bb + 1) / bb;
+    }
   }
   return std::ceil((a / b).as<double>());
 }
@@ -275,6 +296,14 @@ inline IntOrDouble min(const IntOrDouble& a, const IntOrDouble& b) {
   return (a < b ? a : b).cast<double>();
 }
 
+inline IntOrDouble abs(const IntOrDouble& a) {
+  if (a.is_int()) {
+    return IntOrDouble(std::abs(a.as<int64_t>()));
+  } else {
+    return IntOrDouble(std::abs(a.as<double>()));
+  }
+}
+
 } // namespace IntOrDouble_functions
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/evaluator_common.cpp b/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
index 83107569dc54b..ae280b4ac44c8 100644
--- a/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
+++ b/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
@@ -67,7 +67,7 @@ std::vector<VALTYPE*> makeSortedEvaluationList(std::vector<VALTYPE*> input) {
   return sorted;
 }
 
-//! Kernel IR utility, collects all the symbolic integers
+//! Kernel IR utility, collects all the symbolic values
 //!  used in allocation nodes.
 void collectBufferSizes(
     std::vector<Val*>& into,
@@ -85,21 +85,21 @@ void collectBufferSizes(
 }
 
 //! Kernel IR utility, collects all the kernel symbolic
-//!  integers we will need at runtime, i.e. after the
+//!  values we will need at runtime, i.e. after the
 //!  generated cuda kernel has already been compiled.
 //!  The values are to be used for runtime logic, like
 //!  `computeLaunchparams`.
-std::vector<Val*> collectRuntimeUsedIntegers(kir::Kernel* kernel) {
+std::vector<Val*> collectRuntimeUsedValues(kir::Kernel* kernel) {
   std::vector<Val*> ret;
   auto all_tvs = ir_utils::allTvs(kernel);
-  // Collect extent and integer inputs
+  // Collect extent and inputs
   for (auto tv : all_tvs) {
     for (auto id : tv->domain()->domain()) {
       ret.push_back(id->extent());
     }
   }
   for (auto inp : kernel->inputs()) {
-    if (inp->isA<Int>()) {
+    if (inp->isA<Int>() || inp->isA<Double>()) {
       ret.push_back(inp);
     }
   }
@@ -108,17 +108,17 @@ std::vector<Val*> collectRuntimeUsedIntegers(kir::Kernel* kernel) {
   return makeSortedEvaluationList(ret);
 }
 
-std::vector<Val*> collectRuntimeUsedIntegers(Fusion* fusion) {
+std::vector<Val*> collectRuntimeUsedValues(Fusion* fusion) {
   std::vector<Val*> ret;
   auto all_tvs = ir_utils::allTvs(fusion);
-  // Collect extent and integer inputs
+  // Collect extent and inputs
   for (auto tv : all_tvs) {
     for (auto id : tv->domain()->domain()) {
       ret.push_back(id->extent());
     }
   }
   for (auto inp : fusion->inputs()) {
-    if (inp->isA<Int>()) {
+    if (inp->isA<Int>() || inp->isA<Double>()) {
       ret.push_back(inp);
     }
   }
@@ -128,14 +128,14 @@ std::vector<Val*> collectRuntimeUsedIntegers(Fusion* fusion) {
 } // namespace
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::initializeValueList(
+void PrecomputedValuesBase<IRContext>::initializeValueList(
     typename IRContext::EVALUATOR_TYPE& const_evaluator,
     const std::vector<Val*>& sorted_value_list) {
   // Initialize workspace
   num_of_values_ = sorted_value_list.size();
   defined_ = std::vector<bool>(num_of_values_, false);
   is_constant_ = std::vector<bool>(num_of_values_, false);
-  values_ = std::vector<int64_t>(num_of_values_, -1);
+  values_ = std::vector<IntOrDouble>(num_of_values_, -1);
 
   // Fill in constants and assign evaluator indices
   for (const auto i : c10::irange(num_of_values_)) {
@@ -150,7 +150,7 @@ void PrecomputedIntegersBase<IRContext>::initializeValueList(
 }
 
 template <typename IRContext>
-c10::optional<int64_t> PrecomputedIntegersBase<IRContext>::getMaybeValueFor(
+c10::optional<IntOrDouble> PrecomputedValuesBase<IRContext>::getMaybeValueFor(
     const Val* val) {
   auto index = val->evaluatorIndex();
   if (index < 0) {
@@ -163,8 +163,8 @@ c10::optional<int64_t> PrecomputedIntegersBase<IRContext>::getMaybeValueFor(
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::print() const {
-  std::cout << "Precomputed Integers:\n";
+void PrecomputedValuesBase<IRContext>::print() const {
+  std::cout << "Precomputed Values:\n";
   for (auto i : c10::irange(symbols_.size())) {
     if (defined_[i]) {
       std::cout << symbols_[i]->toInlineString() << " = " << values_[i]
@@ -174,14 +174,14 @@ void PrecomputedIntegersBase<IRContext>::print() const {
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::evaluate() {
-  FUSER_PERF_SCOPE("PrecomputedIntegers::Evaluate");
-  integer_machine_->run();
+void PrecomputedValuesBase<IRContext>::evaluate() {
+  FUSER_PERF_SCOPE("PrecomputedValues::Evaluate");
+  value_machine_->run();
   validate();
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::invalidate() {
+void PrecomputedValuesBase<IRContext>::invalidate() {
   // clear binding values
   binding_log_.clear();
 
@@ -193,20 +193,26 @@ void PrecomputedIntegersBase<IRContext>::invalidate() {
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::validate() {
-  FUSER_PERF_SCOPE("PrecomputedIntegers::Validate");
+void PrecomputedValuesBase<IRContext>::validate() {
+  FUSER_PERF_SCOPE("PrecomputedValuess::Validate");
   for (auto it : binding_log_) {
-    TORCH_INTERNAL_ASSERT(values_[it.first] == it.second);
+    TORCH_INTERNAL_ASSERT(
+        values_[it.first] == it.second,
+        "Precomputed values failed to validate.",
+        "\nSomething unexpected changed between the compilation and execution.\n",
+        values_[it.first],
+        " != ",
+        it.second);
   }
   has_valid_values_ = true;
 }
 
 template <typename IRContext>
-NaiveIntegerMachine<IRContext>::NaiveIntegerMachine(
-    PrecomputedIntegersBase<IRContext>& precomputed_integers)
-    : precomputed_integers_(precomputed_integers) {
+NaiveValueMachine<IRContext>::NaiveValueMachine(
+    PrecomputedValuesBase<IRContext>& precomputed_values)
+    : precomputed_values_(precomputed_values) {
   num_of_instructions_ = 0;
-  for (auto val : precomputed_integers_.symbols_) {
+  for (auto val : precomputed_values_.symbols_) {
     auto def = val->definition();
     if (def) {
       if (auto uop = dynamic_cast<UnaryOp*>(def)) {
@@ -221,12 +227,12 @@ NaiveIntegerMachine<IRContext>::NaiveIntegerMachine(
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::run() {
+void NaiveValueMachine<IRContext>::run() {
   for (const auto i : c10::irange(num_of_instructions_)) {
     // Skip this instruction if the dest location
     //  has already been computed or is constant.
-    if (precomputed_integers_.defined_[dest_[i]] ||
-        precomputed_integers_.is_constant_[dest_[i]]) {
+    if (precomputed_values_.defined_[dest_[i]] ||
+        precomputed_values_.is_constant_[dest_[i]]) {
       continue;
     }
     runInstruction(i);
@@ -234,7 +240,7 @@ void NaiveIntegerMachine<IRContext>::run() {
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::makeUnaryOp(UnaryOp* uop) {
+void NaiveValueMachine<IRContext>::makeUnaryOp(UnaryOp* uop) {
   int in = uop->inputs()[0]->evaluatorIndex();
   int out = uop->outputs()[0]->evaluatorIndex();
   TORCH_INTERNAL_ASSERT(in >= 0, "Integer Machine: unknown input: ", uop);
@@ -243,12 +249,15 @@ void NaiveIntegerMachine<IRContext>::makeUnaryOp(UnaryOp* uop) {
   int index = makeInstructionEntry();
   inst_type_[index] = InstructionType::UNARY_OP;
   uop_type_[index] = IRContext::getOpType(uop);
+  if (uop_type_[index] == UnaryOpType::Cast) {
+    data_type_[index] = uop->out()->getDataType().value();
+  }
   src0_[index] = in;
   dest_[index] = out;
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::makeBinaryOp(BinaryOp* bop) {
+void NaiveValueMachine<IRContext>::makeBinaryOp(BinaryOp* bop) {
   int in0 = bop->inputs()[0]->evaluatorIndex();
   int in1 = bop->inputs()[1]->evaluatorIndex();
   int out = bop->outputs()[0]->evaluatorIndex();
@@ -266,11 +275,12 @@ void NaiveIntegerMachine<IRContext>::makeBinaryOp(BinaryOp* bop) {
 }
 
 template <typename IRContext>
-int NaiveIntegerMachine<IRContext>::makeInstructionEntry() {
+int NaiveValueMachine<IRContext>::makeInstructionEntry() {
   int index = num_of_instructions_++;
   inst_type_.push_back(InstructionType::UNARY_OP);
   uop_type_.push_back(UnaryOpType::Abs);
   bop_type_.push_back(BinaryOpType::Add);
+  data_type_.push_back(DataType::Null);
   src0_.push_back(-1);
   src1_.push_back(-1);
   dest_.push_back(-1);
@@ -278,7 +288,7 @@ int NaiveIntegerMachine<IRContext>::makeInstructionEntry() {
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::runInstruction(int index) {
+void NaiveValueMachine<IRContext>::runInstruction(int index) {
   switch (inst_type_[index]) {
     case InstructionType::UNARY_OP:
       runUnaryOp(index);
@@ -290,52 +300,66 @@ void NaiveIntegerMachine<IRContext>::runInstruction(int index) {
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::runUnaryOp(int index) {
+void NaiveValueMachine<IRContext>::runUnaryOp(int index) {
+  using namespace IntOrDouble_functions;
   int src_index = src0_[index];
-  bool src_defined = precomputed_integers_.defined_[src_index];
-  bool src_is_const = precomputed_integers_.is_constant_[src_index];
+  bool src_defined = precomputed_values_.defined_[src_index];
+  bool src_is_const = precomputed_values_.is_constant_[src_index];
   if (!src_defined && !src_is_const) {
     return;
   }
 
   int dest_index = dest_[index];
 
-  auto& src = precomputed_integers_.values_[src_index];
-  auto& dest = precomputed_integers_.values_[dest_index];
+  auto& src = precomputed_values_.values_[src_index];
+  auto& dest = precomputed_values_.values_[dest_index];
 
   switch (uop_type_[index]) {
     case UnaryOpType::Neg:
       dest = -src;
       break;
-    case UnaryOpType::Cast:
+    case UnaryOpType::Set:
       dest = src;
       break;
+    case UnaryOpType::Cast:
+      if (data_type_[index] == DataType::Double) {
+        dest = src.template cast<double>();
+      } else if (data_type_[index] == DataType::Int) {
+        dest = src.template cast<int64_t>();
+      } else {
+        TORCH_INTERNAL_ASSERT(false, "dtype not supported in evaluator");
+      }
+      break;
+    case UnaryOpType::Abs:
+      dest = abs(src);
+      break;
     default:
-      TORCH_CHECK(!"Unexpected operator type");
+      TORCH_CHECK(!"Unexpected operator type ", uop_type_[index]);
   }
 
-  precomputed_integers_.defined_[dest_index] = true;
+  precomputed_values_.defined_[dest_index] = true;
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::runBinaryOp(int index) {
+void NaiveValueMachine<IRContext>::runBinaryOp(int index) {
+  using namespace IntOrDouble_functions;
   int src0_index = src0_[index];
   int src1_index = src1_[index];
-  bool src0_is_const = precomputed_integers_.is_constant_[src0_index];
-  bool src1_is_const = precomputed_integers_.is_constant_[src1_index];
+  bool src0_is_const = precomputed_values_.is_constant_[src0_index];
+  bool src1_is_const = precomputed_values_.is_constant_[src1_index];
 
   bool src_defined =
-      (precomputed_integers_.defined_[src0_index] || src0_is_const) &&
-      (precomputed_integers_.defined_[src1_index] || src1_is_const);
+      (precomputed_values_.defined_[src0_index] || src0_is_const) &&
+      (precomputed_values_.defined_[src1_index] || src1_is_const);
 
   if (!src_defined) {
     return;
   }
   int dest_index = dest_[index];
 
-  auto& lhs = precomputed_integers_.values_[src0_index];
-  auto& rhs = precomputed_integers_.values_[src1_index];
-  auto& dest = precomputed_integers_.values_[dest_index];
+  auto& lhs = precomputed_values_.values_[src0_index];
+  auto& rhs = precomputed_values_.values_[src1_index];
+  auto& dest = precomputed_values_.values_[dest_index];
 
   switch (bop_type_[index]) {
     case BinaryOpType::Add:
@@ -357,7 +381,7 @@ void NaiveIntegerMachine<IRContext>::runBinaryOp(int index) {
       break;
     case BinaryOpType::CeilDiv:
       TORCH_CHECK(rhs != 0);
-      dest = (lhs + rhs - 1) / rhs;
+      dest = ceildiv(lhs, rhs);
       break;
     case BinaryOpType::And:
       dest = Int::ScalarType(lhs && rhs);
@@ -372,30 +396,30 @@ void NaiveIntegerMachine<IRContext>::runBinaryOp(int index) {
       TORCH_CHECK(!"Unexpected operator type");
   }
 
-  precomputed_integers_.defined_[dest_index] = true;
+  precomputed_values_.defined_[dest_index] = true;
 }
 
-KernelPrecomputedIntegers::KernelPrecomputedIntegers(kir::Kernel* kernel) {
-  loadSymbols(collectRuntimeUsedIntegers(kernel));
+KernelPrecomputedValues::KernelPrecomputedValues(kir::Kernel* kernel) {
+  loadSymbols(collectRuntimeUsedValues(kernel));
   kir::ExpressionEvaluator evaluator;
   initializeValueList(evaluator, symbols());
   initializeNamedScalars();
   initializeIntegerMachine();
 }
 
-void KernelPrecomputedIntegers::bindTensorMetaData(
+// TODO: put this to base class
+void KernelPrecomputedValues::bindTensorMetaData(
     TensorView* tv,
-    const at::Tensor& at_tensor) {
-  std::vector<std::pair<Val*, int64_t>> ret;
+    const TensorArgAbstract* tensor_arg_abstract) {
   const auto root_domain =
       TensorDomain::noReductions(tv->domain()->getMaybeRFactorDomain());
   TORCH_INTERNAL_ASSERT(
-      at_tensor.ndimension() == static_cast<int>(root_domain.size()),
+      tensor_arg_abstract->getRank() == static_cast<int>(root_domain.size()),
       "Something went wrong configuring launch. Inputs do not match.");
 
   for (const auto dim : c10::irange(root_domain.size())) {
     auto extent = root_domain[dim]->extent();
-    auto value = at_tensor.sizes()[dim];
+    auto value = tensor_arg_abstract->getSize(dim);
     bindValue(extent->evaluatorIndex(), value);
   }
 }
@@ -418,7 +442,7 @@ c10::optional<ParallelType> getMaybeThreadSizeParallelType(
 
 } // namespace
 
-void KernelPrecomputedIntegers::initializeNamedScalars() {
+void KernelPrecomputedValues::initializeNamedScalars() {
   for (auto val : symbols()) {
     if (auto named_scalar = dynamic_cast<NamedScalar*>(val)) {
       auto maybe_parallel_type = getMaybeThreadSizeParallelType(named_scalar);
@@ -434,30 +458,53 @@ void KernelPrecomputedIntegers::initializeNamedScalars() {
   }
 }
 
-void KernelPrecomputedIntegers::bindKernelInputs(
+// TODO: merge this one with above.
+void KernelPrecomputedValues::bindKernelInputs(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& aten_inputs) {
+    const KernelArgumentHolder& args) {
   if (hasValidValues()) {
     invalidate();
   }
 
   const auto& inputs = kernel->inputs();
+  TORCH_INTERNAL_ASSERT(
+      args.size() == inputs.size(), "kernel inputs size does not match args");
 
   for (const auto i : c10::irange(inputs.size())) {
+    auto arg = args[i];
     const auto input = inputs[i];
     if (auto tensor_input = dynamic_cast<TensorView*>(input)) {
-      const auto aten_tensor = aten_inputs[i].toTensor();
-      bindTensorMetaData(tensor_input, aten_tensor);
-    } else if (input->isScalar() && input->dtype() == DataType::Int) {
-      bindValue(input->evaluatorIndex(), aten_inputs[i].toInt());
+      if (const auto& tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(arg)) {
+        bindTensorMetaData(tensor_input, tensor_arg_abstract);
+      } else {
+        // TODO: cpu scalar of int type should be bound as scalar int as well
+        TORCH_CHECK(
+            arg->isType(ArgType::CpuScalarTensor),
+            "binding input to TensorView expects input arg to be of tensor type");
+      }
+    } else if (input->isScalar()) {
+      if (input->dtype() == DataType::Int) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Long),
+            "binding input to integer type expects input arg to be a scalar of Long type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const int64_t*>(arg->arg()));
+      } else if (input->dtype() == DataType::Double) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Double),
+            "binding input to double type expects input arg to be a scalar of Double type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const double*>(arg->arg()));
+      }
     }
   }
 }
 
-void KernelPrecomputedIntegers::bindParallelExtents(
+void KernelPrecomputedValues::bindParallelExtents(
     const ParallelExtentMap& parallel_extents,
     const LaunchParams& launch_constraint) {
-  // Bind integer values of extents of parallelized
+  // Bind values of extents of parallelized
   //  iterdomains from launch_constraint when applicable.
   // Consistency will be checked at validate().
   for (const auto& it : parallel_extents) {
@@ -470,7 +517,7 @@ void KernelPrecomputedIntegers::bindParallelExtents(
   }
 }
 
-void KernelPrecomputedIntegers::bindConcreteParallelTypeValue(
+void KernelPrecomputedValues::bindConcreteParallelTypeValue(
     ParallelType pt,
     int64_t value) {
   auto index_list_it = thread_dim_value_indices_.find(pt);
@@ -481,52 +528,73 @@ void KernelPrecomputedIntegers::bindConcreteParallelTypeValue(
   }
 }
 
-FusionPrecomputedIntegers::FusionPrecomputedIntegers(Fusion* fusion)
+FusionPrecomputedValues::FusionPrecomputedValues(Fusion* fusion)
     : fusion_(fusion) {
-  loadSymbols(collectRuntimeUsedIntegers(fusion));
+  loadSymbols(collectRuntimeUsedValues(fusion));
   ExpressionEvaluator evaluator(fusion);
   initializeValueList(evaluator, symbols());
   initializeIntegerMachine();
 }
 
-void FusionPrecomputedIntegers::bindTensorMetaData(
+// TODO: put this to base class
+void FusionPrecomputedValues::bindTensorMetaData(
     TensorView* tv,
-    const at::Tensor& at_tensor) {
+    const TensorArgAbstract* tensor_arg_abstract) {
   const auto root_domain =
       TensorDomain::noReductions(tv->getMaybeRFactorDomain());
   TORCH_INTERNAL_ASSERT(
-      at_tensor.ndimension() == static_cast<int>(root_domain.size()),
+      tensor_arg_abstract->getRank() == static_cast<int>(root_domain.size()),
       "Something went wrong configuring launch. Inputs do not match.");
 
   for (const auto dim : c10::irange(root_domain.size())) {
     auto extent = root_domain[dim]->extent();
-    auto value = at_tensor.sizes()[dim];
-    precomputedIntegersBaseType::bindValue(extent->evaluatorIndex(), value);
+    auto value = tensor_arg_abstract->getSize(dim);
+    precomputedValuesBaseType::bindValue(extent->evaluatorIndex(), value);
   }
 }
 
-void FusionPrecomputedIntegers::bindFusionInputs(
-    const at::ArrayRef<IValue>& aten_inputs) {
+void FusionPrecomputedValues::bindFusionInputs(
+    const KernelArgumentHolder& args) {
   if (hasValidValues()) {
-    precomputedIntegersBaseType::invalidate();
+    precomputedValuesBaseType::invalidate();
   }
 
   const auto& inputs = fusion_->inputs();
+  TORCH_INTERNAL_ASSERT(
+      args.size() == inputs.size(), "kernel inputs size does not match args");
 
   for (const auto i : c10::irange(inputs.size())) {
     const auto input = inputs[i];
+    const ArgAbstract* arg = args[i];
     if (auto tensor_input = dynamic_cast<TensorView*>(input)) {
-      const auto aten_tensor = aten_inputs[i].toTensor();
-      bindTensorMetaData(tensor_input, aten_tensor);
-    } else if (input->isScalar() && input->getDataType() == DataType::Int) {
-      precomputedIntegersBaseType::bindValue(
-          input->evaluatorIndex(), aten_inputs[i].toInt());
+      if (const auto& tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(arg)) {
+        bindTensorMetaData(tensor_input, tensor_arg_abstract);
+      } else {
+        TORCH_CHECK(
+            arg->isType(ArgType::CpuScalarTensor),
+            "binding input to TensorView expects input arg to be of tensor type");
+      }
+    } else if (input->isScalar()) {
+      if (input->getDataType() == DataType::Int) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Long),
+            "binding input to integer type expects input arg to be a scalar of Long type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const int64_t*>(arg->arg()));
+      } else if (input->getDataType() == DataType::Double) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Double),
+            "binding input to double type expects input arg to be a scalar of Double type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const double*>(arg->arg()));
+      }
     }
   }
 }
 
-template class PrecomputedIntegersBase<FusionIRContext>;
-template class PrecomputedIntegersBase<KernelIRContext>;
+template class PrecomputedValuesBase<FusionIRContext>;
+template class PrecomputedValuesBase<KernelIRContext>;
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/evaluator_common.h b/torch/csrc/jit/codegen/cuda/evaluator_common.h
index 7cbe37c602b9e..528b1f1b2e0ad 100644
--- a/torch/csrc/jit/codegen/cuda/evaluator_common.h
+++ b/torch/csrc/jit/codegen/cuda/evaluator_common.h
@@ -1,4 +1,6 @@
 #pragma once
+#include <torch/csrc/jit/codegen/cuda/dynamic_type.h>
+#include <torch/csrc/jit/codegen/cuda/executor_kernel_arg.h>
 #include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
@@ -62,28 +64,28 @@ class KernelIRContext {
 };
 
 template <typename IRContext>
-class PrecomputedIntegersBase;
+class PrecomputedValuesBase;
 
-//! NaiveIntegerMachine:
+//! NaiveValueMachine:
 //!  This is an un-optimized runtime for evaluating a
-//!   set of integers in one run. The runtime contains
+//!   set of values in one run. The runtime contains
 //!   a vector of instructions inferred from IR at compile-time
 //!   and it currently must be associated with an instance of
-//!   PrecomputedIntegersBase that will provide the workspace
-//!   containing the concrete values for the integers.
+//!   PrecomputedValuesBase that will provide the workspace
+//!   containing the concrete values for the values.
 template <typename IRContext>
-class NaiveIntegerMachine {
+class NaiveValueMachine {
   //! The generic types of instructions supported for this
   //!  machine, currently only binary and unary.
   enum class InstructionType { UNARY_OP, BINARY_OP };
 
  public:
-  //! Constructor lowers all the expr IR nodes stored in precomputed_integer
+  //! Constructor lowers all the expr IR nodes stored in precomputed_values
   //!  and stores them in the private state.
-  NaiveIntegerMachine(PrecomputedIntegersBase<IRContext>& precomputed_integers);
+  NaiveValueMachine(PrecomputedValuesBase<IRContext>& precomputed_values);
 
   //! Runs all the instructions and write results to the associated
-  //!  precomputed_integers.
+  //!  precomputed_values.
   void run();
 
  private:
@@ -109,12 +111,12 @@ class NaiveIntegerMachine {
   void runBinaryOp(int index);
 
  private:
-  friend PrecomputedIntegersBase<IRContext>;
+  friend PrecomputedValuesBase<IRContext>;
 
-  //! Reference to the PrecomputedInteger workspace associated with
+  //! Reference to the PrecomputedValues workspace associated with
   //!   this runtime. All the instructions will read and write the
   //!   values in this workspace.
-  PrecomputedIntegersBase<IRContext>& precomputed_integers_;
+  PrecomputedValuesBase<IRContext>& precomputed_values_;
 
   //! Instruction buffer. All states are in separate vectors and
   //!  the entry of each vector at the same index correspond to
@@ -131,6 +133,10 @@ class NaiveIntegerMachine {
   //!  value at each index corresponding to a binary op.
   std::vector<UnaryOpType> uop_type_;
 
+  //! Data type for unary op of type UnaryOpType::Cast, contains a default
+  //!  value at each index corresponding other ops.
+  std::vector<DataType> data_type_;
+
   //! Unary operator type if applicable, contains a default
   //!  value at each index corresponding to a unary op.
   std::vector<BinaryOpType> bop_type_;
@@ -150,33 +156,33 @@ class NaiveIntegerMachine {
   std::vector<int> dest_;
 };
 
-//! PrecomputedIntegersBase:
-//!  A class to support optimized evaluation of integers
+//! PrecomputedValuesBase:
+//!  A class to support optimized evaluation of values
 //!  at runtime.
-//!    At compile time all necessary integers are collected
+//!    At compile time all necessary values are collected
 //!  from given IR nodes and a runtime and a workspace containing
 //!  the concrete values is created and pre-allocated.
-//!    At runtime the integer vm is used to evaluate all the
-//!  integers and store them in the workspace ahead of time.
+//!    At runtime the value vm is used to evaluate all the
+//!  values and store them in the workspace ahead of time.
 template <typename IRContext>
-class PrecomputedIntegersBase {
-  using INTEGER_MACHINE = NaiveIntegerMachine<IRContext>;
+class PrecomputedValuesBase {
+  using VALUE_MACHINE = NaiveValueMachine<IRContext>;
 
  public:
-  explicit PrecomputedIntegersBase() = default;
+  explicit PrecomputedValuesBase() = default;
 
   //! Returns if the workspace contains evaluated results.
   bool ready() {
     return has_valid_values_;
   }
 
-  //! Runs the internal integer machine that will compute
+  //! Runs the internal value machine that will compute
   //!  the values allocated in the workspace.
   void evaluate();
 
   //! Returns value for the given IR node if it's stored
   //!  in the workspace and has been evaluated.
-  c10::optional<int64_t> getMaybeValueFor(const Val* val);
+  c10::optional<IntOrDouble> getMaybeValueFor(const Val* val);
 
   //! Debugging helper, prints all the currently known values
   void print() const;
@@ -191,7 +197,7 @@ class PrecomputedIntegersBase {
 
   //! Bind concrete value to the given index
   //!  if the index is valid.
-  void bindValue(int index, int64_t value) {
+  void bindValue(int index, IntOrDouble value) {
     if (index < 0 || is_constant_[index]) {
       return;
     }
@@ -213,10 +219,10 @@ class PrecomputedIntegersBase {
     return symbols_;
   }
 
-  //! Initialize the integer runtime that will
+  //! Initialize the value runtime that will
   //!  infer instructions from the workspace.
   void initializeIntegerMachine() {
-    integer_machine_ = std::make_unique<INTEGER_MACHINE>(*this);
+    value_machine_ = std::make_unique<VALUE_MACHINE>(*this);
   }
 
   bool hasValidValues() {
@@ -236,7 +242,7 @@ class PrecomputedIntegersBase {
   }
 
  private:
-  friend INTEGER_MACHINE;
+  friend VALUE_MACHINE;
 
   //! Marks if an evaluation has finished
   bool has_valid_values_ = false;
@@ -253,7 +259,7 @@ class PrecomputedIntegersBase {
   std::vector<bool> is_constant_;
 
   //! Stores the concrete values at each index.
-  std::vector<int64_t> values_;
+  std::vector<IntOrDouble> values_;
 
   //! Stores the IR nodes corresponding to each index.
   std::vector<Val*> symbols_;
@@ -261,50 +267,48 @@ class PrecomputedIntegersBase {
   //! An internal log to keep track of all the bindings
   //!  used in each evaluation cycle. To be used for
   //!  consistency check.
-  std::vector<std::pair<int, int64_t>> binding_log_;
+  std::vector<std::pair<int, IntOrDouble>> binding_log_;
 
-  //! Integer runtime for realizing the integer computations.
-  std::unique_ptr<INTEGER_MACHINE> integer_machine_;
+  //! Integer runtime for realizing the values computations.
+  std::unique_ptr<VALUE_MACHINE> value_machine_;
 };
 
-//! PreComputedInteger workspace in Fusion IR context,
-//!  defines the set of integers to be collected in each
+//! PrecomputedValues workspace in Fusion IR context,
+//!  defines the set of values to be collected in each
 //!  fusion graph and the input value binding given each
 //!  fusion runtime input.
-class FusionPrecomputedIntegers
-    : public PrecomputedIntegersBase<FusionIRContext> {
-  using precomputedIntegersBaseType = PrecomputedIntegersBase<FusionIRContext>;
+class FusionPrecomputedValues : public PrecomputedValuesBase<FusionIRContext> {
+  using precomputedValuesBaseType = PrecomputedValuesBase<FusionIRContext>;
 
  public:
-  FusionPrecomputedIntegers(Fusion* fusion);
+  FusionPrecomputedValues(Fusion* fusion);
 
   //! Bind concrete values from fusion runtime inputs
-  void bindFusionInputs(const at::ArrayRef<IValue>& aten_inputs);
+  void bindFusionInputs(const KernelArgumentHolder& args);
 
  private:
-  void bindTensorMetaData(TensorView* tv, const at::Tensor& at_tensor);
+  void bindTensorMetaData(
+      TensorView* tv,
+      const TensorArgAbstract* tensor_arg_abstract);
 
  private:
   Fusion* fusion_ = nullptr;
 };
-//! PreComputedInteger workspace in Fusion IR context,
-//!  defines the set of integers to be collected in each
+//! PrecomputedValues workspace in Fusion IR context,
+//!  defines the set of values to be collected in each
 //!  kernel IR sequence and the input value binding given each
 //!  fusion runtime input and launch constraints.
-class KernelPrecomputedIntegers
-    : public PrecomputedIntegersBase<KernelIRContext> {
-  using precomputedIntegersBaseType = PrecomputedIntegersBase<KernelIRContext>;
+class KernelPrecomputedValues : public PrecomputedValuesBase<KernelIRContext> {
+  using precomputedValuesBaseType = PrecomputedValuesBase<KernelIRContext>;
 
  public:
   using ParallelExtentMap =
       std::unordered_map<ParallelType, std::vector<const Val*>, TypeHash>;
 
-  KernelPrecomputedIntegers(kir::Kernel* kernel);
+  KernelPrecomputedValues(kir::Kernel* kernel);
 
   //! Bind concrete values from fusion runtime inputs
-  void bindKernelInputs(
-      kir::Kernel* kernel,
-      const at::ArrayRef<IValue>& aten_inputs);
+  void bindKernelInputs(kir::Kernel* kernel, const KernelArgumentHolder& args);
 
   //! Bind concrete values from launch constraints
   void bindParallelExtents(
@@ -317,7 +321,9 @@ class KernelPrecomputedIntegers
   void bindConcreteParallelTypeValue(ParallelType pt, int64_t value);
 
  private:
-  void bindTensorMetaData(TensorView* tv, const at::Tensor& at_tensor);
+  void bindTensorMetaData(
+      TensorView* tv,
+      const TensorArgAbstract* tensor_arg_abstract);
 
   //! Iterate through all the named scalars corresponding
   //!  to thread sizes and pre-group them by their parallel
diff --git a/torch/csrc/jit/codegen/cuda/executor.cpp b/torch/csrc/jit/codegen/cuda/executor.cpp
index d2299a0ce5497..59c63e6cfe775 100644
--- a/torch/csrc/jit/codegen/cuda/executor.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor.cpp
@@ -20,6 +20,7 @@
 #include <c10/cuda/CUDAStream.h>
 #include <c10/util/irange.h>
 
+#include <cmath>
 #include <fstream>
 
 namespace torch {
@@ -29,6 +30,16 @@ namespace cuda {
 
 int FusionExecutor::fusion_id_counter_ = 0; // NOLINT
 
+bool fill_allocation_with_nan_ = false;
+
+bool shouldFillAllocationWithNan() {
+  return fill_allocation_with_nan_;
+}
+
+void setFillAllocationWithNan(bool value) {
+  fill_allocation_with_nan_ = value;
+}
+
 namespace {
 
 static const char* defineIndexMode(KernelIndexMode index_mode) {
@@ -161,9 +172,8 @@ void FusionExecutor::debugCompileFusionFromStr(
 
 void FusionExecutor::compileFusion(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs,
-    const LaunchParams& launch_constraints,
-    CompileOptions options) {
+    const KernelArgumentHolder& args,
+    const LaunchParams& launch_constraints) {
   FUSER_PERF_SCOPE("compileFusion");
 
   TORCH_INTERNAL_ASSERT(
@@ -181,7 +191,9 @@ void FusionExecutor::compileFusion(
     fusion->printMath();
   }
 
-  options_ = options;
+  // TODO: refactor the options_ passed through
+  options_.device = c10::Device(c10::DeviceType::CUDA, args.getDeviceIndex());
+  options_.index_mode = args.getIndexMode();
   c10::DeviceGuard dg(options_.device);
 
   TORCH_INTERNAL_ASSERT(
@@ -240,8 +252,8 @@ void FusionExecutor::compileFusion(
 
   // TODO: pass block_size here;
   c10::optional<int> block_size = c10::nullopt;
-  if (!inputs.empty()) {
-    auto expr_eval = executor_utils::bindKernelInputs(inputs, kernel);
+  if (!args.empty()) {
+    auto expr_eval = executor_utils::bindKernelInputs(args, kernel);
     auto launch_params =
         computeLaunchParams(launch_constraints, expr_eval, warp_size_);
     block_size = launch_params.nThreads();
@@ -249,8 +261,15 @@ void FusionExecutor::compileFusion(
         block_size > 0, "launch param inferred block size < 0");
   }
 
-  block_size_high_water_mark =
-      block_size.has_value() ? block_size.value() : block_size_high_water_mark;
+  // TODO: high water mark should be computed via occupancy API after
+  // compilation.
+
+  // Basically setting high water martk as 1 when we don't provide args for
+  // compilation, it will just generate a kernel that gets ditched at the first
+  // run - not great. We should have better heuristics.
+  block_size_high_water_mark = std::max<int64_t>(
+      (block_size.has_value() ? block_size.value() : 1),
+      block_size_high_water_mark);
   std::tie(compiled_kernel_, last_compiler_log_) = executor_utils::nvrtcCompile(
       structured_code,
       (kernelNamespace() + "::" + kernelName()).c_str(),
@@ -272,6 +291,42 @@ void FusionExecutor::compileFusion(
 
 namespace {
 
+void fillTensorWithNan(at::Tensor& t) {
+  switch (t.scalar_type()) {
+    case at::ScalarType::Byte:
+      t.fill_(0xFF);
+      break;
+    case at::ScalarType::Char:
+      t.fill_(0x7F);
+      break;
+    case at::ScalarType::Short:
+      t.fill_(0x7FFF);
+      break;
+    case at::ScalarType::Int:
+      t.fill_(0x7FFFFFFF);
+      break;
+    case at::ScalarType::Long:
+      t.fill_(0x7FFFFFFFFFFFFFFFL);
+      break;
+    case at::ScalarType::Bool:
+      t.fill_(true);
+      break;
+    case at::ScalarType::Half:
+    case at::ScalarType::Float:
+    case at::ScalarType::Double:
+    case at::ScalarType::BFloat16:
+      t.fill_(std::nan(""));
+      break;
+    case at::ScalarType::ComplexHalf:
+    case at::ScalarType::ComplexFloat:
+    case at::ScalarType::ComplexDouble:
+      t.fill_(c10::complex<double>(std::nan(""), std::nan("")));
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "Unknown dtype");
+  }
+}
+
 at::Tensor inferAndAlloc(
     const TensorView* tv,
     const std::vector<Val*>& sizes,
@@ -300,7 +355,7 @@ at::Tensor inferAndAlloc(
         size->name(),
         ") for the buffer ",
         tv->toString());
-    inferred_sizes.push_back(inferred_val.value());
+    inferred_sizes.push_back(inferred_val->as<int64_t>());
     if (expanded_map.count(expanded_sizes.size())) {
       auto expanded_size = expanded_map.at(expanded_sizes.size());
       const auto inferred_expanded_size = expr_eval.evaluate(expanded_size);
@@ -320,29 +375,30 @@ at::Tensor inferAndAlloc(
       } else {
         expanded_dim = true;
       }
-      expanded_sizes.push_back(inferred_expanded_size.value());
+      expanded_sizes.push_back(inferred_expanded_size->as<int64_t>());
     } else {
-      expanded_sizes.push_back(inferred_val.value());
+      expanded_sizes.push_back(inferred_val->as<int64_t>());
     }
   }
 
   const auto at_type = data_type_to_aten(tv->dtype());
+  const auto tensor_options =
+      at::TensorOptions().dtype(at_type).device(options.device);
+  c10::IntArrayRef isizes(inferred_sizes);
 
   if (zero_init) {
-    const auto tensor_options =
-        at::TensorOptions().dtype(at_type).device(options.device);
-    c10::IntArrayRef isizes(inferred_sizes);
     auto zeros = at::zeros(isizes, tensor_options);
     if (expanded_dim) {
       return zeros.expand(expanded_sizes);
     }
     return zeros;
   } else {
-    c10::IntArrayRef isizes(inferred_sizes);
     // Non Variable type guard for empty_cuda call
     at::AutoDispatchBelowADInplaceOrView non_variable_type_mode;
-    auto empty = at::native::empty_cuda(
-        isizes, at_type, c10::nullopt, options.device, c10::nullopt);
+    auto empty = at::empty(isizes, tensor_options);
+    if (shouldFillAllocationWithNan()) {
+      fillTensorWithNan(empty);
+    }
     if (expanded_dim) {
       return empty.expand(expanded_sizes);
     }
@@ -395,7 +451,7 @@ uint64_t FusionExecutor::computeSharedMemory(
           const int align_size = 16; // always align to 16B/128b.
           total = ceilDiv(total, align_size) * align_size;
         }
-        total += inferred_val.value() * data_size;
+        total += inferred_val->as<int64_t>() * data_size;
       } else {
         TORCH_INTERNAL_ASSERT(
             false,
@@ -464,10 +520,10 @@ LaunchParams FusionExecutor::computeLaunchParams(
 
   // TODO: Need to redesign this part a bit to
   //   find the right place to trigger evaluate
-  if (expr_eval.precomputedIntegers()) {
-    expr_eval.precomputedIntegers()->bindParallelExtents(
+  if (expr_eval.precomputedValues()) {
+    expr_eval.precomputedValues()->bindParallelExtents(
         parallel_iter_extents, launch_constraints);
-    expr_eval.precomputedIntegers()->evaluate();
+    expr_eval.precomputedValues()->evaluate();
   }
 
   // If any dimension was set in launch constraints we need to run through
@@ -489,7 +545,7 @@ LaunchParams FusionExecutor::computeLaunchParams(
                 "Cannot validate parallelization scheme, "
                 "this may be due to mixed broadcast axes that are parallelized.");
           }
-        } else if (!expr_eval.precomputedIntegers()) {
+        } else if (!expr_eval.precomputedValues()) {
           expr_eval.bind(extent, launch_constraints.getDim(p_type));
         }
         if (!launch_params.hasDim(p_type)) {
@@ -541,7 +597,7 @@ LaunchParams FusionExecutor::computeLaunchParams(
         TORCH_INTERNAL_ASSERT(
             *val <= 1024, "padded dimension larger than max block size");
       }
-      maximum_value = std::max(maximum_value, *val);
+      maximum_value = std::max(maximum_value, val->as<int64_t>());
     }
     // Protect for size-0 tensors, they still have a value so would prefer to
     // bind nothing than 0
@@ -553,8 +609,8 @@ LaunchParams FusionExecutor::computeLaunchParams(
 
   // Re-run the integer machine with all
   //  the thread sizes now determined.
-  if (expr_eval.precomputedIntegers()) {
-    expr_eval.precomputedIntegers()->evaluate();
+  if (expr_eval.precomputedValues()) {
+    expr_eval.precomputedValues()->evaluate();
   }
 
   const auto kernel = lowered_->kernel();
@@ -656,25 +712,82 @@ FusionExecutor::GlobalBuffers FusionExecutor::allocGlobalVals(
 }
 
 std::vector<at::Tensor> FusionExecutor::allocOutputs(
-    const at::ArrayRef<IValue>& inputs,
     kir::ExpressionEvaluator& expr_eval,
     const std::unordered_set<int>& alias_indices) {
   FUSER_PERF_SCOPE("FusionExecutor::AllocOutputs");
   const auto kernel = lowered_->kernel();
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
   std::vector<at::Tensor> outputs;
+  for (const auto out_i : c10::irange(kernel->outputs().size())) {
+    // TODO: FIX this short-cut where we trivially forward inputs to outputs
+    if (kernel->outputs()[out_i]->isFusionInput()) {
+      TORCH_INTERNAL_ASSERT(false, "trivial input forwarding NOT IMPLEMENTED");
+      // for (auto inp_i : c10::irange(kernel->inputs().size())) {
+      //   if (kernel->inputs()[inp_i] == kernel->outputs()[out_i]) {
+      //     TORCH_INTERNAL_ASSERT(
+      //         inp_i < inputs.size(),
+      //         "Issue with an input showing up as output, couldn't find
+      //         input.");
+      //     TORCH_INTERNAL_ASSERT(
+      //         inputs[inp_i].isTensor(),
+      //         "Cannot register a scalar as an output in a fusion.");
+      //     outputs.push_back(inputs[inp_i].toTensor());
+      //     break;
+      //   }
+      // }
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          kernel->outputs()[out_i]->isA<TensorView>(),
+          "Cannot allocate outputs that are not tensors.");
+      auto output = kernel->outputs()[out_i]->as<TensorView>();
+      if (alias_indices.count(out_i) != 0) {
+        // aliasing to inputs, no need to allocate real output, just push empty
+        // tensor here.
+        outputs.emplace_back();
+      } else {
+        outputs.push_back(
+            inferAndAllocOutput(output, expr_eval, options_, false));
+      }
+    }
+  }
+  return outputs;
+}
+
+void FusionExecutor::setUsedTVs() {
+  auto used_vals = fusion_->usedMathVals();
+  auto used_tvs = ir_utils::filterByType<TensorView>(used_vals);
+  used_tvs_.clear();
+  used_tvs_.insert(used_tvs_.begin(), used_tvs.begin(), used_tvs.end());
+}
+
+KernelArgumentHolder FusionExecutor::evaluateOutputSizes(
+    const KernelArgumentHolder& args,
+    kir::ExpressionEvaluator& expr_eval,
+    const std::unordered_set<int>& alias_indices) {
+  FUSER_PERF_SCOPE("FusionExecutor::AllocOutputs");
+  const auto kernel = lowered_->kernel();
+
+  KernelArgumentHolder ret(args.getIndexMode());
+  ret.setDeviceIndex(args.getDeviceIndex());
+
+  CompileOptions meta_options = options_;
+  meta_options.device = c10::Device(DeviceType::Meta, 0);
+
   for (const auto out_i : c10::irange(kernel->outputs().size())) {
     // If the output is just trivially the input, just "copy" it over.
     if (kernel->outputs()[out_i]->isFusionInput()) {
       for (auto inp_i : c10::irange(kernel->inputs().size())) {
         if (kernel->inputs()[inp_i] == kernel->outputs()[out_i]) {
           TORCH_INTERNAL_ASSERT(
-              inp_i < inputs.size(),
+              inp_i < args.size(),
               "Issue with an input showing up as output, couldn't find input.");
+
+          auto tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(args[inp_i]);
           TORCH_INTERNAL_ASSERT(
-              inputs[inp_i].isTensor(),
+              tensor_arg_abstract,
               "Cannot register a scalar as an output in a fusion.");
-          outputs.push_back(inputs[inp_i].toTensor());
+          ret.push(tensor_arg_abstract);
           break;
         }
       }
@@ -685,51 +798,111 @@ std::vector<at::Tensor> FusionExecutor::allocOutputs(
       auto output = kernel->outputs()[out_i]->as<TensorView>();
       if (alias_indices.count(out_i) != 0) {
         // aliasing to inputs, no need to allocate real output
-        outputs.push_back(
-            inferAndAlloc(output, {}, expr_eval, {}, options_, false));
+        // but we still need to push an entry here.
+        ret.push(int64_t(0));
       } else {
-        // Allocate a real output
-        outputs.push_back(
-            inferAndAllocOutput(output, expr_eval, options_, false));
+        // TODO: we are using meta here, which is bad since it doesn't account
+        // for devices. Switch to fake tensor instead
+        ret.push(inferAndAllocOutput(output, expr_eval, meta_options, false));
       }
     }
   }
-  return outputs;
+  return ret;
 }
 
-void FusionExecutor::setUsedTVs() {
-  auto used_vals = fusion_->usedMathVals();
-  auto used_tvs = ir_utils::filterByType<TensorView>(used_vals);
-  used_tvs_.clear();
+KernelArgumentHolder FusionExecutor::inferOutputSizes(
+    const KernelArgumentHolder& args,
+    const LaunchParams& launch_constraints) {
+  FUSER_PERF_SCOPE("FusionExecutor::RunFusion");
+
+  ExecutorEntry* executor_entry = nullptr;
+  c10::optional<size_t> opt_code = args.getCacheId();
+  if (opt_code.has_value()) {
+    executor_entry = &executor_entry_lookup_[*opt_code];
+  }
+
+  executor_utils::initializeCudaContext();
+  TORCH_INTERNAL_ASSERT(lowered_);
+
+  TORCH_INTERNAL_ASSERT(
+      !executor_entry || !executor_entry->init,
+      "compile kernel shouldn't hit a pre-existing cache");
+  FUSER_PERF_SCOPE("ExecutorRunFusion::ValidateAndInitialize");
+  // TODO: validate kernel inputs currently won't be happy, since our fusion
+  // args are mapped with `meta` tensor instead of `cuda` tensor, check if this
+  // would be resolved with FakeTensor
+  // executor_utils::validateKernelInputs(fusion_, args, options_.device);
+
+  if (!evaluator_precomputed_values_) {
+    evaluator_precomputed_values_ =
+        std::make_unique<KernelPrecomputedValues>(lowered_->kernel());
+  }
+
+  kir::ExpressionEvaluator expr_eval;
+  evaluator_precomputed_values_->bindKernelInputs(lowered_->kernel(), args);
+  expr_eval.precomputedValues() = evaluator_precomputed_values_.get();
+
+  // I think this binds something to expr_eval, so even though we are not using
+  // launch_params_, we still need this in order to infer output shapes.
+  launch_params_ =
+      computeLaunchParams(launch_constraints, expr_eval, warp_size_);
+
+  executor_utils::validateVectorizedTensors(
+      lowered_.get()->kernel(), args, {}, compileTimeDataCache(), expr_eval);
+
+  auto alias_indices_entry = executor_utils::caching::ExecutorCompileTimeEntry<
+      executor_utils::caching::InputAliasIndices>(
+      compileTimeDataCache(), [&]() {
+        return std::make_unique<std::vector<std::pair<int, int>>>(
+            fusion_->getInputAliasIndices());
+      });
+
+  auto& alias_indices = alias_indices_entry.get();
 
-  for (auto tv : used_tvs)
-    used_tvs_.push_back(tv);
+  // NOLINTNEXTLINE(bugprone-branch-clone)
+  auto output_alias_indices_entry =
+      executor_utils::caching::ExecutorCompileTimeEntry<
+          executor_utils::caching::OutputAliasIndices>(
+          compileTimeDataCache(), [&]() {
+            return std::make_unique<std::unordered_set<int>>(
+                fusion_->getOutputAliasIndices());
+          });
+
+  auto& output_alias_indices = output_alias_indices_entry.get();
+
+  auto ret = evaluateOutputSizes(args, expr_eval, output_alias_indices);
+
+  for (const auto& entry : alias_indices) {
+    auto aliased_output_index = entry.first;
+    auto aliased_input_index = entry.second;
+    TORCH_INTERNAL_ASSERT(
+        args[aliased_input_index]->isType(ArgType::Tensor),
+        "alias io only supports tensor");
+    ret.swap(aliased_output_index, args[aliased_input_index]);
+  }
+
+  return ret;
 }
 
 std::vector<at::Tensor> FusionExecutor::runFusion(
-    const at::ArrayRef<IValue>& inputs,
-    const std::vector<at::Tensor>& outputs,
+    KernelArgumentHolder& args,
     const LaunchParams& launch_constraints,
-    const c10::optional<size_t>& opt_code) {
+    const std::vector<at::Tensor>& outputs) {
   FUSER_PERF_SCOPE("FusionExecutor::RunFusion");
   TORCH_INTERNAL_ASSERT(compiled());
   TORCH_INTERNAL_ASSERT(
       fusion_id_ > 0, "Cannot run fusion, it was not compiled.");
   TORCH_INTERNAL_ASSERT(
-      !opt_code.has_value() || outputs.empty(),
+      !args.getCacheId().has_value() || outputs.empty(),
       "short cut input cache is not compatible with pre-allocated output");
 
+  size_t num_inputs = args.size();
+
   if (isDebugDumpEnabled(DebugDumpOption::FusionArgs)) {
     std::cout << "Arguments for fusion" << fusion_id_ << ":" << std::endl
               << "Inputs:" << std::endl;
-    for (const auto& input : inputs) {
-      if (input.isTensor()) {
-        const auto& input_tensor = input.toTensor();
-        std::cout << "  " << input_tensor.scalar_type() << " "
-                  << input.toTensor().sizes()
-                  << " (strides = " << input.toTensor().strides() << ")"
-                  << std::endl;
-      }
+    for (auto i : c10::irange(args.size())) {
+      args[i]->print();
     }
     std::cout << "Outputs:" << std::endl;
     for (const auto& output : outputs) {
@@ -740,8 +913,8 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   }
 
   ExecutorEntry* executor_entry = nullptr;
-  if (opt_code.has_value()) {
-    executor_entry = &executor_entry_lookup_[*opt_code];
+  if (args.getCacheId().has_value()) {
+    executor_entry = &executor_entry_lookup_[*args.getCacheId()];
   }
 
   c10::DeviceGuard dg(options_.device);
@@ -750,7 +923,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   TORCH_INTERNAL_ASSERT(lowered_);
   launch_params_ = LaunchParams();
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  std::vector<at::Tensor> allocated_outputs = outputs;
+  std::vector<at::Tensor> allocated_outputs;
   GlobalBuffers global_buffers;
   uint64_t rand_offset = 0;
 
@@ -771,18 +944,32 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
               c10::nullopt,
               options_.device,
               c10::nullopt));
+          if (shouldFillAllocationWithNan()) {
+            fillTensorWithNan(allocated_outputs.back());
+          }
         }
+        // Note: aliased output is not returned as output. But we still need it
+        // for kernel execution, so would need to push them to args
         for (const auto& entry : executor_entry->io_alias_indices) {
+          auto aliased_output_index = entry.first;
+          auto aliased_input_index = entry.second;
+          auto tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(args[aliased_input_index]);
           TORCH_INTERNAL_ASSERT(
-              inputs[entry.second].isTensor(), "alias io only supports tensor");
-          allocated_outputs[entry.first] = inputs[entry.second].toTensor();
+              tensor_arg_abstract, "alias io only supports tensor");
+          allocated_outputs[aliased_output_index] =
+              tensor_arg_abstract->getTensor();
         }
+        args.push(allocated_outputs);
       } else {
         TORCH_INTERNAL_ASSERT(
             outputs.size() == fusion_->outputs().size(),
             __func__,
             " provided number of outputs does match fusion output");
+        allocated_outputs = outputs;
+        args.push(outputs);
       }
+
       {
         FUSER_PERF_SCOPE("ExecutorRunFusion::IntermediateBufferAlloc");
         for (const auto i : c10::irange(executor_entry->buffer_sizes.size())) {
@@ -800,6 +987,9 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
                 c10::nullopt,
                 options_.device,
                 c10::nullopt));
+            if (shouldFillAllocationWithNan()) {
+              fillTensorWithNan(global_buffers.buffers.back());
+            }
             global_buffers.zero_init.push_back(false);
           }
         }
@@ -811,17 +1001,16 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
     // code path to take when either:
     //   1. no opt_code is provided or
     //   2. `executor_entry` is not initialized
-    executor_utils::validateKernelInputs(fusion_, inputs, options_.device);
+    executor_utils::validateKernelInputs(fusion_, args, options_.device);
 
-    if (!evaluator_precomputed_integers_) {
-      evaluator_precomputed_integers_ =
-          std::make_unique<KernelPrecomputedIntegers>(lowered_->kernel());
+    if (!evaluator_precomputed_values_) {
+      evaluator_precomputed_values_ =
+          std::make_unique<KernelPrecomputedValues>(lowered_->kernel());
     }
 
     kir::ExpressionEvaluator expr_eval;
-    evaluator_precomputed_integers_->bindKernelInputs(
-        lowered_->kernel(), inputs);
-    expr_eval.precomputedIntegers() = evaluator_precomputed_integers_.get();
+    evaluator_precomputed_values_->bindKernelInputs(lowered_->kernel(), args);
+    expr_eval.precomputedValues() = evaluator_precomputed_values_.get();
 
     launch_params_ =
         computeLaunchParams(launch_constraints, expr_eval, warp_size_);
@@ -860,7 +1049,13 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
           "what can be resident on the GPU at once. Need: ",
           launch_params_.gdimx() * launch_params_.gdimy() *
               launch_params_.gdimz(),
-          " but limited to ",
+          " (",
+          launch_params_.gdimx(),
+          " * ",
+          launch_params_.gdimy(),
+          " * ",
+          launch_params_.gdimz(),
+          ") but limited to ",
           num_blocks_per_SM,
           " * ",
           at::cuda::getDeviceProperties(options_.device.index())
@@ -873,7 +1068,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
     executor_utils::validateVectorizedTensors(
         lowered_.get()->kernel(),
-        inputs,
+        args,
         outputs,
         compileTimeDataCache(),
         expr_eval);
@@ -888,7 +1083,6 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
     auto& alias_indices = alias_indices_entry.get();
 
-    // ditch pre-allocated outputs if the number doesn't match.
     // NOLINTNEXTLINE(bugprone-branch-clone)
     if (outputs.empty()) {
       auto output_alias_indices_entry =
@@ -901,15 +1095,22 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
       auto& output_alias_indices = output_alias_indices_entry.get();
 
-      allocated_outputs = allocOutputs(inputs, expr_eval, output_alias_indices);
+      allocated_outputs = allocOutputs(expr_eval, output_alias_indices);
 
       for (const auto& entry : alias_indices) {
+        auto aliased_output_index = entry.first;
+        auto aliased_input_index = entry.second;
+        auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(args[aliased_input_index]);
         TORCH_INTERNAL_ASSERT(
-            inputs[entry.second].isTensor(), "alias io only supports tensor");
-        allocated_outputs[entry.first] = inputs[entry.second].toTensor();
+            tensor_arg_abstract, "alias io only supports tensor");
+        allocated_outputs[aliased_output_index] =
+            tensor_arg_abstract->getTensor();
       }
+      args.push(allocated_outputs);
     } else {
-      // TODO: Update for aliasing, validate the outputs are the right sizes.
+      allocated_outputs = outputs;
+      args.push(outputs);
       executor_utils::validateKernelOutputs(
           fusion_, allocated_outputs, options_.device);
     }
@@ -951,15 +1152,12 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
     }
   }
 
-  KernelArgumentHolder kernel_arguments(options_.index_mode);
-  {
-    FUSER_PERF_SCOPE("ExecutorRunFusion::FillKernelArgStructure");
-    kernel_arguments.push(inputs);
-    kernel_arguments.push(allocated_outputs);
-    kernel_arguments.push(global_buffers.buffers);
-    if (lowered_->kernel()->summary().max_rng_offsets >= 0) {
-      kernel_arguments.appendPhiloxRNGSeed(rand_offset);
-    }
+  // push back global buffers
+  args.push(global_buffers.buffers);
+
+  // push back RNG state if needed
+  if (lowered_->kernel()->summary().max_rng_offsets >= 0) {
+    args.appendPhiloxRNGSeed(rand_offset);
   }
 
   if (isDebugDumpEnabled(DebugDumpOption::LaunchParam)) {
@@ -969,19 +1167,15 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   if (isDebugDumpEnabled(DebugDumpOption::KernelArgs)) {
     std::cout << "Arguments for kernel" << fusion_id_ << ":" << std::endl
               << "Inputs:" << std::endl;
-    for (const auto& input : inputs) {
-      if (input.isTensor()) {
-        const auto& input_tensor = input.toTensor();
-        std::cout << "  " << input_tensor.scalar_type() << " "
-                  << input.toTensor().sizes()
-                  << " (strides = " << input.toTensor().strides() << ")"
-                  << std::endl;
-      }
+    for (auto i : c10::irange(args.size())) {
+      args[i]->print();
     }
     std::cout << "Outputs:" << std::endl;
+    // note: add aliased outputs here.
     for (const auto& output : allocated_outputs) {
       std::cout << "  " << output.scalar_type() << " " << output.sizes()
-                << " (strides = " << output.strides() << ")" << std::endl;
+                << " (strides = " << output.strides()
+                << ", address = " << output.data_ptr() << ")" << std::endl;
     }
     std::cout << "Reduction and semaphore buffers:" << std::endl;
     TORCH_INTERNAL_ASSERT(
@@ -1032,7 +1226,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
           launch_params_.bdimz(),
           launch_params_.smem(),
           stream,
-          kernel_arguments.getBuffer(),
+          args.getBuffer(),
           nullptr));
     } else {
 #ifndef __HIP_PLATFORM_HCC__
@@ -1048,7 +1242,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
               launch_params_.bdimz(),
               launch_params_.smem(),
               stream,
-              kernel_arguments.getBuffer()));
+              args.getBuffer()));
 #else
       TORCH_INTERNAL_ASSERT(
           false, "Cross grid communication not supported with HIP.");
@@ -1068,10 +1262,11 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
     bytes_processed_ = 0;
     // Figure how many bytes are inputs, outputs, and temporary buffers
-    for (auto input : inputs) {
-      if (input.isTensor()) {
-        bytes_processed_ += input.toTensor().numel() *
-            dataTypeSize(aten_to_data_type(input.toTensor().scalar_type()));
+    for (auto i : c10::irange(num_inputs)) {
+      if (auto tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(args[i])) {
+        bytes_processed_ += tensor_arg_abstract->numel() *
+            dataTypeSize(tensor_arg_abstract->getDataType());
       }
     }
     for (const auto& output : allocated_outputs) {
@@ -1098,7 +1293,8 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 void FusionExecutor::compileRtc(
     const std::string& code,
     const std::string& name,
-    bool structured) {
+    bool structured,
+    CompileOptions options) {
   FUSER_PERF_SCOPE("ExecutorRunFusion::compileRtc");
   std::string scode;
   if (!structured) {
@@ -1107,7 +1303,7 @@ void FusionExecutor::compileRtc(
     scode = code;
   }
   fusion_id_ = 1;
-  options_ = CompileOptions();
+  options_ = options;
 
   std::tie(compiled_kernel_, last_compiler_log_) =
       executor_utils::nvrtcCompile(scode, name, fusion_id_);
diff --git a/torch/csrc/jit/codegen/cuda/executor.h b/torch/csrc/jit/codegen/cuda/executor.h
index 8a56fe957fb8b..32ce5dc1cce01 100644
--- a/torch/csrc/jit/codegen/cuda/executor.h
+++ b/torch/csrc/jit/codegen/cuda/executor.h
@@ -16,6 +16,9 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+TORCH_CUDA_CU_API bool shouldFillAllocationWithNan();
+TORCH_CUDA_CU_API void setFillAllocationWithNan(bool value);
+
 // TODO: Should this actually be in launch params?
 struct TORCH_CUDA_CU_API CompileOptions {
   c10::Device device = c10::Device(c10::DeviceType::CUDA, 0);
@@ -33,17 +36,46 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
       int id,
       CompileOptions options = CompileOptions());
 
+  //! infers output sizes via returning non-allocated KernelArgumentHolder.
+  //! this function is useful for async compilation for segmented fusion
+  KernelArgumentHolder inferOutputSizes(
+      const KernelArgumentHolder& args,
+      const LaunchParams& launch_constraints);
+
+  void compileFusion(
+      Fusion* fusion,
+      const KernelArgumentHolder& args,
+      const LaunchParams& launch_constraints = LaunchParams());
+
+  // TODO: merge it with the overload above.
+  //! This API is merely here so we don't have to go back and update all cpp
+  //! tests.
   void compileFusion(
       Fusion* fusion,
       const at::ArrayRef<IValue>& inputs = {},
+      const LaunchParams& launch_constraints = LaunchParams()) {
+    KernelArgumentHolder args =
+        KernelArgumentHolder::createKernelArgumentHolder(inputs);
+    compileFusion(fusion, args, launch_constraints);
+  }
+
+  std::vector<at::Tensor> runFusion(
+      KernelArgumentHolder& args,
       const LaunchParams& launch_constraints = LaunchParams(),
-      CompileOptions options = CompileOptions());
+      const std::vector<at::Tensor>& outputs = {});
 
   std::vector<at::Tensor> runFusion(
       const at::ArrayRef<IValue>& inputs,
       const std::vector<at::Tensor>& outputs,
       const LaunchParams& launch_constraints = LaunchParams(),
-      const c10::optional<size_t>& opt_code = c10::nullopt);
+      const c10::optional<size_t>& opt_code = c10::nullopt) {
+    KernelArgumentHolder args =
+        KernelArgumentHolder::createKernelArgumentHolder(inputs);
+    if (opt_code.has_value()) {
+      args.setCacheId(*opt_code);
+    }
+    return runFusion(args, launch_constraints, outputs);
+  }
 
   std::vector<at::Tensor> runFusion(
       const at::ArrayRef<IValue>& inputs,
@@ -141,7 +173,8 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   void compileRtc(
       const std::string& code,
       const std::string& name,
-      bool structured = false);
+      bool structured = false,
+      CompileOptions options = CompileOptions());
 
   //! Internal tests only. Runs the compiled CUDA kernel from compileRtc.
   void runRtc(
@@ -187,7 +220,6 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   // skip allocating real storage for those, but still maintain its spot to
   // maintain the indexing from output aliases to inputs
   std::vector<at::Tensor> allocOutputs(
-      const at::ArrayRef<IValue>& inputs,
       kir::ExpressionEvaluator& expr_eval,
       const std::unordered_set<int>& alias_indices = {});
 
@@ -201,6 +233,15 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
     return &compile_time_info_cache_;
   }
 
+  //! returns KernelArgumentHolder representing the output sizes from kernel
+  //! execution. Note: 1. this API would ignoring aliased outputs and instead
+  //! pushing scalar int 0 as a place holder; 2. this API doesn't actually
+  //! allocate output in memory, but rather is used just to infer output sizes.
+  KernelArgumentHolder evaluateOutputSizes(
+      const KernelArgumentHolder& args,
+      kir::ExpressionEvaluator& expr_eval,
+      const std::unordered_set<int>& alias_indices = {});
+
  private:
   CompileOptions options_;
 
@@ -249,7 +290,7 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   ExecutorCompileTimeInfoCache compile_time_info_cache_;
 
   // Cached expr eval
-  std::unique_ptr<KernelPrecomputedIntegers> evaluator_precomputed_integers_ =
+  std::unique_ptr<KernelPrecomputedValues> evaluator_precomputed_values_ =
       nullptr;
 
   // Profiling support: knob to control wheter we actually execute the
diff --git a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
index dc2e4d1fa49f9..bc1ce2a4b7bc2 100644
--- a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
@@ -120,6 +120,24 @@ std::unique_ptr<TensorArgAbstract> getTensorArg(
 
 } // namespace
 
+KernelArgumentHolder KernelArgumentHolder::createKernelArgumentHolder(
+    const c10::ArrayRef<c10::IValue>& inputs) {
+  if (inputs.empty()) {
+    // default to int32 on device 0
+    KernelArgumentHolder args(KernelIndexMode::INT32);
+    args.setDeviceIndex(0);
+    return args;
+  }
+  auto device_index = getCommonDeviceCUDA(inputs);
+  auto index_mode = collectIndexMode(inputs);
+
+  KernelArgumentHolder args(index_mode);
+  args.setDeviceIndex(device_index);
+  args.push(inputs);
+
+  return args;
+}
+
 // Push a tensor to the arguments
 void KernelArgumentHolder::push(const at::Tensor& tensor) {
   changed_ = true;
@@ -188,7 +206,9 @@ void KernelArgumentHolder::push(const at::Tensor& tensor) {
     c10::ScalarType dtype = tensor.scalar_type();
     std::unique_ptr<TensorArgAbstract> tensor_arg =
         getTensorArg(dtype, nDims, index_mode_);
+    tensor_arg->setTensor(tensor);
     tensor_arg->setPointer(tensor.data_ptr());
+    tensor_arg->setDataType(aten_to_data_type(dtype));
     for (const auto i : c10::irange(nDims)) {
       tensor_arg->setSize(i, tensor.sizes()[i]);
       tensor_arg->setStride(i, tensor.strides()[i]);
@@ -230,6 +250,10 @@ void KernelArgumentHolder::push(const IValue& val) {
       " Tried to create argument to send to a fused kernel, but got a non-scalar type.");
 }
 
+void KernelArgumentHolder::push(int64_t val) {
+  arguments_.push_back(std::make_unique<LongArg>(val));
+}
+
 void KernelArgumentHolder::push(const at::PhiloxCudaState& val) {
   arguments_.push_back(std::make_unique<PhiloxCudaStateArg>(val));
 }
@@ -266,6 +290,17 @@ void KernelArgumentHolder::push(const std::vector<at::Tensor>& tensors) {
   }
 }
 
+void KernelArgumentHolder::push(const ArgAbstract* arg) {
+  changed_ = true;
+  arguments_.emplace_back(arg->copy_unique_ptr());
+}
+
+void KernelArgumentHolder::swap(int i, const ArgAbstract* arg) {
+  changed_ = true;
+  auto holder = arg->copy_unique_ptr();
+  arguments_[i].swap(holder);
+}
+
 void KernelArgumentHolder::appendPhiloxRNGSeed(uint64_t rand_offset) {
   at::PhiloxCudaState philox_engine_inputs;
   auto gen = at::cuda::detail::getDefaultCUDAGenerator();
diff --git a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
index c135328a3acc1..ee46576a6fac2 100644
--- a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
+++ b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
@@ -3,6 +3,7 @@
 #include <ATen/core/ivalue.h>
 #include <ATen/cuda/CUDAGeneratorImpl.h>
 #include <c10/util/Exception.h>
+#include <torch/csrc/jit/codegen/cuda/type.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <array>
 
@@ -21,7 +22,7 @@ struct TensorArgCodegen {
   T* data;
   std::array<nvfuser_index_t, N> size;
   std::array<nvfuser_index_t, N> stride;
-  constexpr int nDims() {
+  constexpr int nDims() const {
     return N;
   }
   void setSize(int i, nvfuser_index_t s) {
@@ -30,6 +31,12 @@ struct TensorArgCodegen {
   void setStride(int i, nvfuser_index_t s) {
     stride[i] = s;
   }
+  nvfuser_index_t getSize(int i) const {
+    return size[i];
+  }
+  nvfuser_index_t getStride(int i) const {
+    return stride[i];
+  }
 };
 
 // 0-Dim GPU based tensor
@@ -40,7 +47,7 @@ struct TensorArgCodegen<T, 0, nvfuser_index_t> {
   };
 
   T* data;
-  constexpr int nDims() {
+  constexpr int nDims() const {
     return 0;
   }
   void setSize(int, nvfuser_index_t) {
@@ -49,6 +56,12 @@ struct TensorArgCodegen<T, 0, nvfuser_index_t> {
   void setStride(int, nvfuser_index_t) {
     TORCH_INTERNAL_ASSERT(false, "Tried to set stride of a 0-dim tensor");
   }
+  nvfuser_index_t getSize(int i) const {
+    TORCH_INTERNAL_ASSERT(false, "Tried to get size of a 0-dim tensor");
+  }
+  nvfuser_index_t getStride(int i) const {
+    TORCH_INTERNAL_ASSERT(false, "Tried to get stride of a 0-dim tensor");
+  }
 };
 
 // Specialization for 0-dim case that's easy to pass in a CPU based tensor
@@ -62,62 +75,151 @@ struct CpuScalarTensorCodegen {
   T data;
 };
 
+// TODO: macro this and the printer below
+enum class ArgType {
+  PhiloxCudaState,
+  Long,
+  Double,
+  ComplexDouble,
+  Bool,
+  Tensor,
+  CpuScalarTensor
+};
+
+inline std::string argTypeToString(ArgType type) {
+  std::string ret;
+  switch (type) {
+    case ArgType::PhiloxCudaState:
+      ret = "PhiloxCudaState";
+      break;
+    case ArgType::Long:
+      ret = "Long";
+      break;
+    case ArgType::Double:
+      ret = "Double";
+      break;
+    case ArgType::ComplexDouble:
+      ret = "ComplexDouble";
+      break;
+    case ArgType::Bool:
+      ret = "Bool";
+      break;
+    case ArgType::Tensor:
+      ret = "Tensor";
+      break;
+    case ArgType::CpuScalarTensor:
+      ret = "CpuScalarTensor";
+      break;
+  }
+  return ret;
+}
+
 struct ArgAbstract {
   virtual ~ArgAbstract() = default;
+  virtual const void* arg() const = 0;
   virtual void* arg() = 0;
+  virtual bool isType(ArgType type) const = 0;
+  virtual ArgType type() const = 0;
+  virtual std::unique_ptr<ArgAbstract> copy_unique_ptr() const = 0;
+  virtual void print() const {
+    printf("input type: %s\n", argTypeToString(type()).c_str());
+  };
 };
 
+#define DEF_HELPEE_FUNC(TARGET_TYPE, ARG_NAME)                    \
+  bool isType(ArgType type) const override {                      \
+    return ArgType::TARGET_TYPE == type;                          \
+  }                                                               \
+  ArgType type() const override {                                 \
+    return ArgType::TARGET_TYPE;                                  \
+  }                                                               \
+  const void* arg() const override {                              \
+    return &ARG_NAME;                                             \
+  }                                                               \
+  void* arg() override {                                          \
+    return &ARG_NAME;                                             \
+  }                                                               \
+  std::unique_ptr<ArgAbstract> copy_unique_ptr() const override { \
+    return std::make_unique<TARGET_TYPE##Arg>(*this);             \
+  }
+
+#define DEF_PRINT_FUNC              \
+  void print() const override {     \
+    std::cout << val_ << std::endl; \
+  }
+
 struct PhiloxCudaStateArg : public ArgAbstract {
   at::PhiloxCudaState val_;
   PhiloxCudaStateArg(at::PhiloxCudaState _val) : val_(_val){};
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(PhiloxCudaState, val_)
 };
 
 struct LongArg : public ArgAbstract {
   int64_t val_;
   explicit LongArg(int64_t _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(Long, val_)
+  DEF_PRINT_FUNC
 };
 
 struct DoubleArg : public ArgAbstract {
   double val_;
   explicit DoubleArg(double _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(Double, val_)
+  DEF_PRINT_FUNC
 };
 
 struct ComplexDoubleArg : public ArgAbstract {
   c10::complex<double> val_;
   explicit ComplexDoubleArg(c10::complex<double> _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(ComplexDouble, val_)
+  DEF_PRINT_FUNC
 };
 
 struct BoolArg : public ArgAbstract {
   bool val_;
   explicit BoolArg(bool _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(Bool, val_)
+  DEF_PRINT_FUNC
 };
 
 struct TensorArgAbstract : ArgAbstract {
   virtual void setSize(int i, int64_t size) = 0;
   virtual void setStride(int i, int64_t stride) = 0;
   virtual void setPointer(void* ptr) = 0;
+  virtual void setDataType(DataType data_type) = 0;
+  virtual void setTensor(at::Tensor tensor) = 0;
+
+  virtual int64_t getRank() const = 0;
+  virtual int64_t getSize(int i) const = 0;
+  virtual int64_t getStride(int i) const = 0;
+  virtual void* getPointer() const = 0;
+  virtual DataType getDataType() const = 0;
+  virtual int64_t numel() const = 0;
+  virtual at::Tensor getTensor() const = 0;
+
+  // TODO: clean it up and also print out dtype
+  void print() const override {
+    auto rank = getRank();
+    std::cout << "tensor dtype: " << getDataType() << " sizes: (";
+    for (auto i = 0; i < rank; i++) {
+      std::cout << getSize(i) << ", ";
+    }
+    std::cout << ") stride: (";
+    for (auto i = 0; i < rank; i++) {
+      std::cout << getStride(i) << ", ";
+    }
+    std::cout << ") pointer: " << getPointer() << std::endl;
+  }
 };
 
-// This should match the tensor used in the code generation (almost exactly)
 template <typename TENSOR_TYPE, typename nvfuser_index_t>
 // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
 struct TensorArg : public TensorArgAbstract {
   TENSOR_TYPE instance_;
+  // TODO: this is ugly, we should be extracting data type from `instance_`
+  // instead
+  DataType data_type_ = DataType::Null;
+  at::Tensor tensor_;
 
   void setSize(int i, int64_t size) override {
     instance_.setSize(i, (nvfuser_index_t)size);
@@ -128,10 +230,40 @@ struct TensorArg : public TensorArgAbstract {
   void setPointer(void* ptr) override {
     instance_.data = static_cast<decltype(TENSOR_TYPE::data)>(ptr);
   }
+  void setDataType(DataType data_type) override {
+    data_type_ = data_type;
+  }
+  void setTensor(at::Tensor tensor) override {
+    tensor_ = tensor;
+  }
 
-  void* arg() override {
-    return &instance_;
+  int64_t getSize(int i) const override {
+    return instance_.getSize(i);
   }
+  int64_t getStride(int i) const override {
+    return instance_.getStride(i);
+  }
+  int64_t getRank() const override {
+    return instance_.nDims();
+  }
+  void* getPointer() const override {
+    return instance_.data;
+  }
+  DataType getDataType() const override {
+    return data_type_;
+  }
+  at::Tensor getTensor() const override {
+    return tensor_;
+  }
+  int64_t numel() const override {
+    int64_t ret = 1;
+    for (auto i : c10::irange(instance_.nDims())) {
+      ret *= instance_.getSize(i);
+    }
+    return ret;
+  }
+
+  DEF_HELPEE_FUNC(Tensor, instance_)
 };
 
 template <typename CPU_TENSOR_TYPE>
@@ -144,16 +276,46 @@ struct CpuScalarTensorArg : public ArgAbstract {
     instance_.data = _data;
   }
 
-  void* arg() override {
-    return &instance_;
-  }
+  DEF_HELPEE_FUNC(CpuScalarTensor, instance_)
 };
 
-class KernelArgumentHolder {
+// TODO: This class needs some further clean up and refactor
+//! KernelArgumentHolder copies meta information from kernel inputs, including
+//! tensor sizes/shapes/dtype/memory_ptr and copies scalar inputs. It is used
+//! for both compilation as well as kernel execution. The important thing is to
+//! strip ownership of tensor from KernelArgumentHolder, so that during async
+//! compilation, we are not unnecessarily holding memory that is not needed.
+class TORCH_CUDA_CU_API KernelArgumentHolder {
  public:
+  //! create KernelArgumentHolder from c10 inputs. Note that we we not taking
+  //! the ownership of the memory from the original inputs, but just recording
+  //! its meta data for kernel execution/compilation.
+  static KernelArgumentHolder createKernelArgumentHolder(
+      const c10::ArrayRef<c10::IValue>& inputs);
+
+  KernelIndexMode getIndexMode() const {
+    return index_mode_;
+  }
+
   explicit KernelArgumentHolder(KernelIndexMode index_mode)
       : index_mode_(index_mode) {}
 
+  KernelArgumentHolder(const KernelArgumentHolder& self)
+      : device_index_(self.getDeviceIndex()), index_mode_(self.getIndexMode()) {
+    for (const auto& arg : self.arguments_) {
+      push(arg.get());
+    }
+  }
+
+  KernelArgumentHolder& operator=(const KernelArgumentHolder& self) {
+    device_index_ = self.getDeviceIndex();
+    index_mode_ = self.getIndexMode();
+    for (const auto& arg : self.arguments_) {
+      push(arg.get());
+    }
+    return *this;
+  }
+
   // Push a tensor to the arguments
   void push(const at::Tensor& tensor);
 
@@ -170,12 +332,60 @@ class KernelArgumentHolder {
 
   void push(const std::vector<at::Tensor>& tensors);
 
+  void push(const ArgAbstract* arg);
+
+  void swap(int i, const ArgAbstract* arg);
+
+  // push int64
+  void push(int64_t val);
+
+  const ArgAbstract* back() const {
+    return arguments_.back().get();
+  }
+
   void appendPhiloxRNGSeed(uint64_t rand_offset);
 
+  const ArgAbstract* operator[](int ind) const {
+    return arguments_.at(ind).get();
+  };
+
+  size_t size() const {
+    return arguments_.size();
+  }
+
+  bool empty() const {
+    return arguments_.empty();
+  }
+
+  void setDeviceIndex(int index) {
+    device_index_ = index;
+  }
+
+  int getDeviceIndex() const {
+    return device_index_;
+  }
+
+  void setCacheId(size_t id) {
+    cache_id_ = id;
+  }
+
+  c10::optional<size_t> getCacheId() const {
+    return cache_id_;
+  }
+
+  void print() const {
+    for (const auto& arg : arguments_) {
+      arg->print();
+    }
+  }
+
  private:
   std::vector<std::unique_ptr<ArgAbstract>> arguments_;
   std::vector<void*> void_ptrs_;
   bool changed_ = true;
+
+  int device_index_ = 0;
+  c10::optional<size_t> cache_id_ = c10::nullopt;
   KernelIndexMode index_mode_ = KernelIndexMode::INT64;
 };
 
diff --git a/torch/csrc/jit/codegen/cuda/executor_utils.cpp b/torch/csrc/jit/codegen/cuda/executor_utils.cpp
index 04db5073f2aaa..b639fa0d007e2 100644
--- a/torch/csrc/jit/codegen/cuda/executor_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor_utils.cpp
@@ -23,6 +23,8 @@
 #include <nvfuser_resources/broadcast.h>
 #include <nvfuser_resources/fp16_support.h>
 #include <nvfuser_resources/fused_reduction.h>
+#include <nvfuser_resources/fused_welford_helper.h>
+#include <nvfuser_resources/fused_welford_impl.h>
 #include <nvfuser_resources/grid_broadcast.h>
 #include <nvfuser_resources/grid_reduction.h>
 #include <nvfuser_resources/grid_sync.h>
@@ -101,7 +103,9 @@ std::string kernelPreamble() {
   ss << nvfuser_resources::warp_cu;
   ss << nvfuser_resources::tensorcore_cu;
   ss << nvfuser_resources::memory_cu;
+  ss << nvfuser_resources::fused_welford_helper_cu;
   ss << nvfuser_resources::fused_reduction_cu;
+  ss << nvfuser_resources::fused_welford_impl_cu;
   ss << nvfuser_resources::swizzle_cu;
 
   // Random utilities
@@ -160,6 +164,7 @@ bool validateKernelArgTensor(
   at::ScalarType arg_data_type = arg.scalar_type();
   DataType param_data_type = *param->getDataType();
   bool match = false;
+  // TODO: remove this switch with `aten_to_data_type`
   switch (arg_data_type) {
     case at::ScalarType::Double:
       match = param_data_type == DataType::Double;
@@ -201,36 +206,36 @@ bool validateKernelArgTensor(
 
 // Return false if  arg_type doesn't match the type in param
 bool validateKernelArgScalar(
-    const c10::IValue& arg,
+    const ArgAbstract* arg,
     const Val* param,
     std::stringstream& msg) {
-  if (!arg.isScalar()) {
-    msg << "Argument is a scalar, but the parameter is not."
-        << "\n";
-    return false;
-  }
+  TORCH_INTERNAL_ASSERT(
+      param->getDataType().has_value(), "kernel param should have data type");
   DataType param_type = *param->getDataType();
   bool match = false;
-  switch (arg.toScalar().type()) {
-    case c10::ScalarType::Long:
+  switch (arg->type()) {
+    case ArgType::Long:
       match = param_type == DataType::Int || param_type == DataType::Int32;
       break;
-    case c10::ScalarType::ComplexDouble:
-      match = param_type == DataType::ComplexDouble ||
-          param_type == DataType::ComplexFloat;
-      break;
-    case c10::ScalarType::Double:
+    case ArgType::Double:
       match = param_type == DataType::Double || param_type == DataType::Float ||
           param_type == DataType::Half || param_type == DataType::BFloat16;
       break;
-    case c10::ScalarType::Bool:
+    case ArgType::Bool:
       match = param_type == DataType::Bool;
       break;
+    case ArgType::ComplexDouble:
+      match = param_type == DataType::ComplexDouble ||
+          param_type == DataType::ComplexFloat;
+      break;
     default:
-      match = false;
+      // TODO: We need to verify that param is actually a scalar
+      msg << "Argument is not a scalar, but the parameter is."
+          << "\n";
+      return false;
   }
   if (!match) {
-    msg << "Argument type is " << arg.toScalar().type()
+    msg << "Argument type is " << argTypeToString(arg->type())
         << ", but the parameter is " << param_type << "\n";
   }
   return match;
@@ -239,12 +244,23 @@ bool validateKernelArgScalar(
 // Return false if arg and param don't match up and if arg's device (if a
 // tensor) doesn't match provided device
 bool validateKernelArg(
-    const c10::IValue& arg,
+    const ArgAbstract* arg,
     const Val* param,
     const c10::Device& device,
     std::stringstream& msg) {
-  if (arg.isTensor()) {
-    return validateKernelArgTensor(arg.toTensor(), param, device, msg);
+  if (auto tensor_arg_abstract = dynamic_cast<const TensorArgAbstract*>(arg)) {
+    // TODO: don't use get tensor here. We would want to remove tensor reference
+    // for async compilation
+    return validateKernelArgTensor(
+        tensor_arg_abstract->getTensor(), param, device, msg);
+  } else if (arg->isType(ArgType::CpuScalarTensor)) {
+    // TODO: merge this one with above
+    // TODO: we need to check cpu scalar dtyp matches param
+    bool match = param->as<TensorView>()->isCpuScalar();
+    if (!match) {
+      msg << "Argument is scalar type, but kernel parameter is not\n";
+    }
+    return match;
   } else {
     return validateKernelArgScalar(arg, param, msg);
   }
@@ -332,7 +348,7 @@ bool checkValidMisalignedTensors(
 
 void validateKernelInputs(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const c10::Device& device) {
   FUSER_PERF_SCOPE("executor_utils::ValidateKernelInputs");
 
@@ -340,13 +356,12 @@ void validateKernelInputs(
   FusionGuard fg(fusion);
   // Check inputs
   TORCH_INTERNAL_ASSERT(
-      inputs.size() == fusion->inputs().size(),
-      "Wrong number of kernel inputs.");
+      args.size() == fusion->inputs().size(), "Wrong number of kernel inputs.");
 
   std::stringstream msg;
   bool mismatch = false;
-  for (const auto i : c10::irange(inputs.size())) {
-    const IValue& arg = inputs[i];
+  for (const auto i : c10::irange(args.size())) {
+    const ArgAbstract* arg = args[i];
     const Val* param = fusion->inputs()[i];
     mismatch = !validateKernelArg(arg, param, device, msg) || mismatch;
   }
@@ -373,7 +388,7 @@ void validateKernelOutputs(
   for (const auto i : c10::irange(outputs.size())) {
     const at::Tensor& arg = outputs[i];
     const Val* param = fusion->outputs()[i];
-    mismatch = !validateKernelArg(arg, param, device, msg) || mismatch;
+    mismatch = !validateKernelArgTensor(arg, param, device, msg) || mismatch;
   }
   TORCH_INTERNAL_ASSERT(
       !mismatch, "Found one or more invalid arguments: ", msg.str());
@@ -543,7 +558,7 @@ void validateAlignedVectorizeExtents(
         " as the extent of a vectorized root domain, ",
         id->toString(),
         ", is unknown.");
-    vectorized_merged_domain_extent *= extent_val.value();
+    vectorized_merged_domain_extent *= extent_val->as<int64_t>();
   }
 
   TORCH_INTERNAL_ASSERT(
@@ -557,13 +572,9 @@ void validateAlignedVectorizeExtents(
 }
 
 void validateAlignedVectorizedFusionInputOutput(
-    const IValue& aten_val,
+    const at::Tensor& aten_tensor,
     int word_size,
     TensorView* tv) {
-  TORCH_INTERNAL_ASSERT(aten_val.isTensor());
-
-  const auto& aten_tensor = aten_val.toTensor();
-
   TORCH_INTERNAL_ASSERT(
       reinterpret_cast<size_t>(aten_tensor.data_ptr()) %
               (word_size * aten_tensor.dtype().itemsize()) ==
@@ -614,7 +625,7 @@ void validateAlignedVectorizedFusionInputOutput(
 
 void validateAlignedVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval) {
@@ -639,9 +650,12 @@ void validateAlignedVectorizedTensors(
                       .aligned_vectorized_inp_tensor_pos) {
     auto tv = kernel->inputs().at(pos)->as<TensorView>();
     auto word_size = kernel->summary().vectorized_accesses.at(tv);
-    validateAlignedVectorizedFusionInputOutput(inputs[pos], word_size, tv);
+    auto tensor_arg_abstract =
+        dynamic_cast<const TensorArgAbstract*>(args[pos]);
+    TORCH_INTERNAL_ASSERT(tensor_arg_abstract, "alias io only supports tensor");
+    validateAlignedVectorizedFusionInputOutput(
+        tensor_arg_abstract->getTensor(), word_size, tv);
   }
-
   if (!outputs.empty()) {
     for (auto pos : tensor_vectorization_validation_entry.get()
                         .aligned_vectorized_out_tensor_pos) {
@@ -657,7 +671,7 @@ void validateAlignedVectorizedTensors(
 // could be improved to include shared memory.
 void validateMisalignedVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval) {
@@ -678,7 +692,13 @@ void validateMisalignedVectorizedTensors(
       inp_misaligned_tensors_pos.begin(),
       inp_misaligned_tensors_pos.end(),
       std::back_inserter(inp_misaligned_tensors),
-      [&inputs](int idx) { return inputs[idx]; });
+      [&args](int idx) {
+        auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(args[idx]);
+        TORCH_INTERNAL_ASSERT(
+            tensor_arg_abstract, "alias io only supports tensor");
+        return tensor_arg_abstract->getTensor();
+      });
 
   const auto& out_misaligned_tensors_pos =
       tensor_vectorization_validation_entry.get().out_misaligned_tensors_pos;
@@ -732,61 +752,62 @@ void validateVectorizedSplits(
 
 void validateVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval) {
   FUSER_PERF_SCOPE("FusionExecutor::validateVectorizedTensors");
 
   validateAlignedVectorizedTensors(
-      kernel, inputs, outputs, data_cache, expr_eval);
+      kernel, args, outputs, data_cache, expr_eval);
 
   validateMisalignedVectorizedTensors(
-      kernel, inputs, outputs, data_cache, expr_eval);
+      kernel, args, outputs, data_cache, expr_eval);
 
   validateVectorizedSplits(kernel, expr_eval);
 }
 
-kir::ExpressionEvaluator bindKernelInputs(
-    const at::ArrayRef<IValue>& aten_inputs,
-    kir::Kernel* kernel,
-    bool check_consistency) {
-  FUSER_PERF_SCOPE("executor_utils::BindKernelInputs");
-
-  TORCH_INTERNAL_ASSERT(
-      kernel->inputs().size() == aten_inputs.size(),
-      "Something went wrong configuring launch. Inputs no longer match.");
-
-  kir::ExpressionEvaluator expr_eval;
-  const auto& inputs = kernel->inputs();
-
-  for (const auto i : c10::irange(inputs.size())) {
-    const auto input = inputs[i];
+namespace {
 
-    if (auto tensor_input = dynamic_cast<TensorView*>(input)) {
+template <typename EXPR_EVALUATOR>
+void bindInputForExprEvaluation(
+    Val* val,
+    const ArgAbstract* arg,
+    bool check_consistency,
+    EXPR_EVALUATOR& expr_eval) {
+  if (val->getValType() == ValType::TensorView) {
+    TensorView* cg_tensor = val->as<TensorView>();
+    auto root_domain =
+        TensorDomain::noReductions(cg_tensor->getMaybeRFactorDomain());
+
+    if (root_domain.size() == 0) {
+      TORCH_INTERNAL_ASSERT(
+          arg->isType(ArgType::CpuScalarTensor) ||
+              (arg->isType(ArgType::Tensor) &&
+               dynamic_cast<const TensorArgAbstract*>(arg)->getRank() == 0),
+          "Something went wrong configuring launch. Inputs is not rank 0 tensor");
+    } else {
       TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].isTensor(),
-          "Something went wrong configuring launch. Inputs no longer match at index:",
-          i);
+          arg->isType(ArgType::Tensor),
+          "Something went wrong configuring launch. Inputs do not match.");
 
-      const auto aten_tensor = aten_inputs[i].toTensor();
-      const auto root_domain = TensorDomain::noReductions(
-          tensor_input->domain()->getMaybeRFactorDomain());
+      auto tensor_arg_abstract = dynamic_cast<const TensorArgAbstract*>(arg);
       TORCH_INTERNAL_ASSERT(
-          aten_tensor.ndimension() == static_cast<int>(root_domain.size()),
-          "Something went wrong configuring launch. Inputs no longer match.");
+          tensor_arg_abstract &&
+              tensor_arg_abstract->getRank() == (int64_t)root_domain.size(),
+          "Something went wrong configuring launch. Inputs rank does not match.");
 
       for (const auto dim : c10::irange(root_domain.size())) {
+        const auto tensor_arg_size = tensor_arg_abstract->getSize(dim);
+        const auto tensor_arg_stride = tensor_arg_abstract->getStride(dim);
         const auto extent = root_domain[dim]->extent();
         if (root_domain[dim]->hasExpandedExtent()) {
           TORCH_INTERNAL_ASSERT(
-              aten_tensor.strides()[dim] == 0,
-              "Execting an expanded dimension on ",
-              inputs[i]->toString(),
-              " dimension ",
+              tensor_arg_stride == 0,
+              "Execting an expanded dimension on dimension ",
               dim,
               " but found stride ",
-              aten_tensor.strides()[dim]);
+              tensor_arg_stride);
           // Could support dynamic size on expanded dimension, so may not have
           // an inferable expanded extent here. This check might be better to do
           // once all values are bound.
@@ -794,22 +815,16 @@ kir::ExpressionEvaluator bindKernelInputs(
               expr_eval.evaluate(root_domain[dim]->expandedExtent());
           if (maybe_expanded_size.has_value()) {
             TORCH_CHECK(
-                *maybe_expanded_size == aten_tensor.sizes()[dim],
+                *maybe_expanded_size == tensor_arg_size,
                 "Expecting expanded extent of ",
                 *maybe_expanded_size,
                 " but recieved value of ",
-                aten_tensor.sizes()[dim]);
+                tensor_arg_size);
           }
         }
 
-        const auto value = root_domain[dim]->hasExpandedExtent()
-            ? 1
-            : aten_tensor.sizes()[dim];
-        if (value == 0 && tensor_input->uses().empty()) {
-          // If there's no uses, ignore there's a size-0 dimension.
-          continue;
-        }
-        TORCH_INTERNAL_ASSERT(value != 0, "Cannot handle size-0 dimensions");
+        const auto value =
+            root_domain[dim]->hasExpandedExtent() ? 1 : tensor_arg_size;
         bool should_bind = true;
         if (check_consistency) {
           const auto prev_value = expr_eval.evaluate(extent);
@@ -829,104 +844,66 @@ kir::ExpressionEvaluator bindKernelInputs(
           expr_eval.bind(extent, value);
         }
       }
-      // NOLINTNEXTLINE: https://bugs.llvm.org/show_bug.cgi?id=48525
-    } else if (input->isScalar() && input->dtype() == DataType::Int) {
+    }
+  } else if (val->getValType().value() == ValType::Scalar) {
+    if (val->getDataType().value() == DataType::Int) {
+      TORCH_INTERNAL_ASSERT(
+          arg->isType(ArgType::Long),
+          "fusion expected Scalar Int inputs, but found ",
+          argTypeToString(arg->type()));
+      expr_eval.bind(val, *static_cast<const int64_t*>(arg->arg()));
+    } else if (val->getDataType().value() == DataType::Double) {
       TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].type()->kind() == c10::TypeKind::IntType,
-          "kernel expected Scalar Int inputs, but found",
-          aten_inputs[i].type()->str());
-      expr_eval.bind(input, aten_inputs[i].toInt());
+          arg->isType(ArgType::Double),
+          "fusion expected Scalar Double inputs, but found ",
+          argTypeToString(arg->type()));
+      expr_eval.bind(val, *static_cast<const double*>(arg->arg()));
     }
   }
+}
+
+} // namespace
+
+kir::ExpressionEvaluator bindKernelInputs(
+    const KernelArgumentHolder& args,
+    kir::Kernel* kernel,
+    bool check_consistency) {
+  FUSER_PERF_SCOPE("executor_utils::BindKernelInputs");
+
+  TORCH_INTERNAL_ASSERT(
+      kernel->inputs().size() == args.size(),
+      "Something went wrong configuring launch. Inputs no longer match.");
 
+  kir::ExpressionEvaluator expr_eval;
+  const auto& inputs = kernel->inputs();
+
+  for (const auto i : c10::irange(inputs.size())) {
+    bindInputForExprEvaluation(
+        inputs[i], args[i], check_consistency, expr_eval);
+  }
   return expr_eval;
 }
 
 ExpressionEvaluator bindFusionInputs(
-    const at::ArrayRef<IValue>& aten_inputs,
+    const KernelArgumentHolder& args,
     Fusion* fusion) {
   FUSER_PERF_SCOPE("executor_utils::BindFusionInputs");
 
+  auto inputs = fusion->inputs();
   TORCH_INTERNAL_ASSERT(
-      fusion->inputs().size() == aten_inputs.size(),
-      "Something went wrong configuring launch. Inputs do not match.");
+      inputs.size() == args.size(),
+      "Something went wrong configuring launch. Inputs do not match.\n",
+      "inputs: ",
+      ir_utils::toString(inputs),
+      " args size: ",
+      args.size());
 
   ExpressionEvaluator expr_eval(fusion);
-  auto inputs = fusion->inputs();
 
   // This should probably move to EvaluationContext as we may want to bind
   // input values frequently. Bind fusion input values to runtime values.
-  for (const auto i : c10::irange(fusion->inputs().size())) {
-    if (inputs[i]->getValType() == ValType::TensorView) {
-      TensorView* cg_tensor = inputs[i]->as<TensorView>();
-
-      TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].isTensor(),
-          "Something went wrong configuring launch. Inputs do not match.");
-
-      auto aten_tensor = aten_inputs[i].toTensor();
-      auto root_domain =
-          TensorDomain::noReductions(cg_tensor->getMaybeRFactorDomain());
-      TORCH_INTERNAL_ASSERT(
-          aten_tensor.ndimension() == (int64_t)root_domain.size(),
-          "Something went wrong configuring launch. Inputs do not match.");
-      for (const auto dim : c10::irange(root_domain.size())) {
-        const auto extent = root_domain[dim]->extent();
-        if (root_domain[dim]->hasExpandedExtent()) {
-          TORCH_INTERNAL_ASSERT(
-              aten_tensor.strides()[dim] == 0,
-              "Execting an expanded dimension on ",
-              inputs[i]->toString(),
-              " dimension ",
-              dim,
-              " but found stride ",
-              aten_tensor.strides()[dim]);
-          // Could support dynamic size on expanded dimension, so may not have
-          // an inferable expanded extent here. This check might be better to do
-          // once all values are bound.
-          auto maybe_expanded_size =
-              expr_eval.evaluate(root_domain[dim]->expandedExtent());
-          if (maybe_expanded_size.has_value()) {
-            TORCH_CHECK(
-                *maybe_expanded_size == aten_tensor.sizes()[dim],
-                "Expecting expanded extent of ",
-                *maybe_expanded_size,
-                " but recieved value of ",
-                aten_tensor.sizes()[dim]);
-          }
-        }
-
-        const auto value = root_domain[dim]->hasExpandedExtent()
-            ? 1
-            : aten_tensor.sizes()[dim];
-        if (value == 0 && cg_tensor->uses().empty()) {
-          // If there's no uses, ignore there's a size-0 dimension.
-          continue;
-        }
-        TORCH_INTERNAL_ASSERT(value != 0, "Cannot handle size-0 dimensions");
-        const auto prev_value = expr_eval.evaluate(extent);
-        if (prev_value.has_value()) {
-          TORCH_CHECK(
-              *prev_value == value,
-              "Attempting to bind ",
-              extent,
-              " to ",
-              value,
-              " but it's already set to ",
-              *prev_value);
-        } else {
-          expr_eval.bind(extent, value);
-        }
-      }
-    } else if (
-        inputs[i]->getValType().value() == ValType::Scalar &&
-        inputs[i]->getDataType().value() == DataType::Int) {
-      TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].type()->kind() == c10::TypeKind::IntType,
-          "fusion expected Scalar Int inputs, but found",
-          aten_inputs[i].type()->str());
-      expr_eval.bind(inputs[i], aten_inputs[i].toInt());
-    }
+  for (const auto i : c10::irange(inputs.size())) {
+    bindInputForExprEvaluation(inputs[i], args[i], true, expr_eval);
   }
   return expr_eval;
 }
@@ -943,6 +920,36 @@ void initializeCudaContext() {
   }
 }
 
+namespace {
+
+// Dump PTX or CUBIN to a file
+#if CUDA_VERSION >= 11010
+void dumpCompiledCodeToFile(
+    const nvrtcProgram& program,
+    int fusion_id,
+    bool dump_cubin) {
+  const auto getSize = dump_cubin
+      ? at::globalContext().getNVRTC().nvrtcGetCUBINSize
+      : at::globalContext().getNVRTC().nvrtcGetPTXSize;
+  const auto getCode = dump_cubin ? at::globalContext().getNVRTC().nvrtcGetCUBIN
+                                  : at::globalContext().getNVRTC().nvrtcGetPTX;
+  size_t size = 0;
+  AT_CUDA_NVRTC_CHECK(getSize(program, &size));
+  std::vector<char> code(size);
+  AT_CUDA_NVRTC_CHECK(getCode(program, code.data()));
+  std::stringstream file_name;
+  file_name << "__tmp_kernel" << fusion_id << "."
+            << (dump_cubin ? "cubin" : "ptx");
+  std::cout << "PRINTING: " << file_name.str() << std::endl;
+  std::ofstream out(file_name.str());
+  TORCH_INTERNAL_ASSERT(out.is_open());
+  out.write(code.data(), size);
+  out.close();
+}
+#endif
+
+} // namespace
+
 std::pair<NvrtcFunction, std::string> nvrtcCompile(
     const std::string& code,
     const std::string& func_name,
@@ -1028,8 +1035,6 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
   // Add line info to generated kernels
   if (isDebugDumpEnabled(DebugDumpOption::DebugInfo)) {
     args.push_back("-lineinfo");
-    args.push_back("-G");
-    args.push_back("--dopt=on");
   }
 #ifdef NDEBUG
   // Avoid excessive register usage from assertion
@@ -1191,92 +1196,26 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
     AT_CUDA_NVRTC_CHECK(getFunc(program, ptx.data()));
   }
 
-  NvrtcFunction compiled_kernel_;
-
-  // TODO: We do go through different code path, should investigate whether this
-  // has an impact on generated binary.
-#ifndef __HIP_PLATFORM_HCC__
-  const char* prefix_env = getenv("PYTORCH_NVFUSER_CUBIN");
-  if (prefix_env) {
-    FUSER_PERF_SCOPE("executor_utils::Nvrtc::LoadCUBIN");
-
-    // Output ptx file
-    std::stringstream output_file_name;
-    output_file_name << prefix_env << "_" << id
-                     << (compile_to_sass ? ".cubin" : ".ptx");
-    std::ofstream outputFile(output_file_name.str().c_str(), std::ios::out);
-    if (outputFile.is_open()) {
-      outputFile.write(ptx.data(), ptx.size());
-      outputFile.close();
-    }
-
-    if (compile_to_sass) {
-      FUSER_PERF_SCOPE("executor_utils::Nvrtc::LoadPTX");
-
-      // load sass directly
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadDataEx(
-          &(compiled_kernel_.module),
-          ptx.data(),
-          options.size(),
-          options.data(),
-          option_vals.data()));
-    } else {
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      CUlinkState linkState;
-
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuLinkCreate(
-          // 0, nullptr, nullptr, &linkState));
-          options.size(),
-          options.data(),
-          option_vals.data(),
-          &linkState));
-
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuLinkAddData(
-          linkState,
-          CU_JIT_INPUT_PTX,
-          ptx.data(),
-          ptx_size,
-          "compiling PTX",
-          0,
-          nullptr,
-          nullptr));
-
-      if (isDebugDumpEnabled(DebugDumpOption::PrintPtxasLog)) {
-        std::cout << info_log.data() << std::endl;
-      }
-
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      size_t cubinSize;
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      void* cubin;
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuLinkComplete(
-          linkState, &cubin, &cubinSize));
+#if CUDA_VERSION >= 11010
+  if (isDebugDumpEnabled(DebugDumpOption::Ptx)) {
+    dumpCompiledCodeToFile(program, id, false);
+  }
 
-      // Output binary file
-      std::stringstream cubin_file_name;
-      cubin_file_name << prefix_env << "_" << id << ".cubin";
+  if (isDebugDumpEnabled(DebugDumpOption::Cubin)) {
+    TORCH_INTERNAL_ASSERT(
+        compile_to_sass,
+        "CUBIN not available as the kernel was compiled only to PTX");
+    dumpCompiledCodeToFile(program, id, true);
+  }
+#endif
 
-      std::ofstream myCubinFile(
-          cubin_file_name.str().c_str(), std::ios::out | std::ios::binary);
+  NvrtcFunction compiled_kernel_;
 
-      if (myCubinFile.is_open()) {
-        myCubinFile.write(static_cast<const char*>(cubin), cubinSize);
-        myCubinFile.close();
-      }
-      // load compiled cubin
-      // AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadData(
-      //     &(compiled_kernel_.module), cubin));
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadDataEx(
-          &(compiled_kernel_.module),
-          cubin,
-          options.size(),
-          options.data(),
-          option_vals.data()));
-    }
-  } else {
+#ifndef __HIP_PLATFORM_HCC__
+  {
     FUSER_PERF_SCOPE("executor_utils::Nvrtc::LoadPTX");
 
-    // load ptx directly
+    // load ptx or cubin directly
     AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadDataEx(
         &(compiled_kernel_.module),
         ptx.data(),
diff --git a/torch/csrc/jit/codegen/cuda/executor_utils.h b/torch/csrc/jit/codegen/cuda/executor_utils.h
index 37817838f3869..af3b4d9372d41 100644
--- a/torch/csrc/jit/codegen/cuda/executor_utils.h
+++ b/torch/csrc/jit/codegen/cuda/executor_utils.h
@@ -9,6 +9,7 @@
 
 #include <torch/csrc/jit/ir/ir.h>
 
+#include <torch/csrc/jit/codegen/cuda/executor_kernel_arg.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
@@ -30,7 +31,7 @@ std::string kernelPreamble();
 
 void validateKernelInputs(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const c10::Device& device);
 
 void validateKernelOutputs(
@@ -40,13 +41,13 @@ void validateKernelOutputs(
 
 //! Bind kernel input values to runtime values
 kir::ExpressionEvaluator bindKernelInputs(
-    const at::ArrayRef<IValue>& aten_inputs,
+    const KernelArgumentHolder& args,
     kir::Kernel* kernel,
     bool check_consistency = true);
 
 //! Bind fusion input values to runtime values
 TORCH_CUDA_CU_API ExpressionEvaluator
-bindFusionInputs(const at::ArrayRef<IValue>& aten_inputs, Fusion* fusion);
+bindFusionInputs(const KernelArgumentHolder& args, Fusion* fusion);
 
 struct NvrtcFunction {
   CUmodule module = CUmodule();
@@ -303,7 +304,7 @@ std::unique_ptr<caching::WarpPaddedExtentsInfo> getWarpPaddedExtentsInfo(
 
 void validateVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval);
diff --git a/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp b/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp
index 0cd4df496538d..7dda464a4fac8 100644
--- a/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp
+++ b/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp
@@ -58,9 +58,9 @@ void ExpressionEvaluator::bind(Val* value, const IntOrDouble& concrete_value) {
 }
 
 c10::optional<IntOrDouble> ExpressionEvaluator::evaluate(Val* value) {
-  if (evaluator_precomputed_integers_ != nullptr) {
+  if (evaluator_precomputed_values_ != nullptr) {
     return toOptionalIntOrDouble(
-        evaluator_precomputed_integers_->getMaybeValueFor(value));
+        evaluator_precomputed_values_->getMaybeValueFor(value));
   } else {
     auto maybe_concrete_value = getValue(value);
     if (!maybe_concrete_value.has_value()) {
@@ -105,23 +105,34 @@ c10::optional<IntOrDouble> ExpressionEvaluator::getValue(Val* value) {
 }
 
 void ExpressionEvaluator::handle(UnaryOp* uop) {
+  using namespace IntOrDouble_functions;
   const auto in = evaluate(uop->in());
   if (in.has_value()) {
     switch (uop->getUnaryOpType()) {
       case UnaryOpType::Neg:
         known_values_[uop->out()] = -*in;
         break;
+      case UnaryOpType::Set:
+        known_values_[uop->out()] = *in;
+        break;
       case UnaryOpType::Cast:
         if (uop->out()->getDataType() == DataType::Int) {
           known_values_[uop->out()] = in->cast<int64_t>();
         } else if (uop->out()->getDataType() == DataType::Double) {
           known_values_[uop->out()] = in->cast<double>();
         } else {
-          TORCH_INTERNAL_ASSERT(false);
+          TORCH_INTERNAL_ASSERT(false, "dtype not supported in evaluator");
         }
         break;
+      case UnaryOpType::Abs:
+        known_values_[uop->out()] = abs(*in);
+        break;
       default:
-        TORCH_CHECK(!"Unexpected operator type");
+        TORCH_CHECK(
+            !"Unexpected operator type ",
+            uop->getUnaryOpType(),
+            " in ",
+            uop->toString());
     }
   }
 }
diff --git a/torch/csrc/jit/codegen/cuda/expr_evaluator.h b/torch/csrc/jit/codegen/cuda/expr_evaluator.h
index 212881d9b12ac..8d906ff58e43d 100644
--- a/torch/csrc/jit/codegen/cuda/expr_evaluator.h
+++ b/torch/csrc/jit/codegen/cuda/expr_evaluator.h
@@ -14,7 +14,7 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-class FusionPrecomputedIntegers;
+class FusionPrecomputedValues;
 
 //! Calculate Fusion IR expressions
 class TORCH_CUDA_CU_API ExpressionEvaluator : private OptOutDispatch {
@@ -36,13 +36,12 @@ class TORCH_CUDA_CU_API ExpressionEvaluator : private OptOutDispatch {
   //! Debugging helper, prints all the currently known values
   void print() const;
 
-  void bindPrecomputedIntegers(
-      FusionPrecomputedIntegers* precomputed_integers) {
-    evaluator_precomputed_integers_ = precomputed_integers;
+  void bindPrecomputedValues(FusionPrecomputedValues* precomputed_values) {
+    evaluator_precomputed_values_ = precomputed_values;
   }
 
-  auto precomputedIntegers() {
-    return evaluator_precomputed_integers_;
+  auto precomputedValues() {
+    return evaluator_precomputed_values_;
   }
 
  private:
@@ -54,7 +53,7 @@ class TORCH_CUDA_CU_API ExpressionEvaluator : private OptOutDispatch {
  private:
   std::unordered_map<const Val*, IntOrDouble> known_values_;
   Fusion* fusion_ = nullptr;
-  FusionPrecomputedIntegers* evaluator_precomputed_integers_ = nullptr;
+  FusionPrecomputedValues* evaluator_precomputed_values_ = nullptr;
 };
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/fusion.cpp b/torch/csrc/jit/codegen/cuda/fusion.cpp
index d3f4ae6715b87..04c367c667275 100644
--- a/torch/csrc/jit/codegen/cuda/fusion.cpp
+++ b/torch/csrc/jit/codegen/cuda/fusion.cpp
@@ -51,9 +51,9 @@ void swap(Fusion& a, Fusion& b) noexcept {
 }
 
 std::unique_ptr<SegmentedFusion> Fusion::segment(
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& args) {
   FUSER_PERF_SCOPE("Segment Fusion");
-  return SegmentCandidateFinder::segment(this, inputs);
+  return SegmentCandidateFinder::segment(this, args);
 }
 
 IrCloner Fusion::copy(const Fusion* from, Fusion* to) {
@@ -373,6 +373,19 @@ void Fusion::printMath(bool from_outputs_only) {
   std::cout << "}\n\n";
 }
 
+std::vector<Val*> Fusion::inputsAndCreated() {
+  auto result = inputs_;
+  for (auto expr : exprs()) {
+    auto tv_inputs = ir_utils::filterByType<TensorView>(expr->inputs());
+    if (tv_inputs.empty()) {
+      for (auto v : expr->outputs()) {
+        result.emplace_back(v);
+      }
+    }
+  }
+  return result;
+}
+
 void Fusion::printTransforms() {
   FUSER_PERF_SCOPE("Fusion::printTransforms");
 
@@ -531,14 +544,15 @@ Expr* Fusion::definition(const Val* val) const {
 
 // Indicate to kernel to set itself up to generate random numbers
 bool Fusion::isStochastic() {
-  for (auto expr : exprs())
-    if (expr->getExprType() == ExprType::UnaryOp)
-      if (expr->as<UnaryOp>()->getUnaryOpType() == UnaryOpType::RandLike)
-        return true;
+  for (auto expr : exprs()) {
+    if (expr->getExprType() == ExprType::RNGOp) {
+      return true;
+    }
+  }
   return false;
 }
 
-std::vector<Val*> Fusion::getTerminatingOutputs() {
+std::vector<Val*> Fusion::getTerminatingOutputs() const {
   FUSER_PERF_SCOPE("getTerminatingOutputs");
 
   auto is_reachable_to_output = [](Val* val) {
diff --git a/torch/csrc/jit/codegen/cuda/fusion.h b/torch/csrc/jit/codegen/cuda/fusion.h
index 1f25a9661bf87..e726d793be756 100644
--- a/torch/csrc/jit/codegen/cuda/fusion.h
+++ b/torch/csrc/jit/codegen/cuda/fusion.h
@@ -53,6 +53,7 @@ class WelfordResult;
 
 class SegmentCandidateFinder;
 class SegmentedFusion;
+class KernelArgumentHolder;
 
 //! Fusion Guard is our "context manager". It holds the actrive fusion and
 //! allows it to be accessed anywhere through FusionGuard::getCurFusion()
@@ -168,18 +169,19 @@ class TORCH_CUDA_CU_API Fusion : public IrContainer {
   bool isStochastic();
 
   //! Run fusion segmentation algorithm to create a segmented fusion
-  std::unique_ptr<SegmentedFusion> segment(
-      const at::ArrayRef<at::IValue>& inputs);
+  std::unique_ptr<SegmentedFusion> segment(const KernelArgumentHolder& args);
 
   const auto& inputs() const {
     return inputs_;
   }
 
+  std::vector<Val*> inputsAndCreated();
+
   const auto& outputs() const {
     return outputs_;
   }
 
-  std::vector<Val*> getTerminatingOutputs();
+  std::vector<Val*> getTerminatingOutputs() const;
 
   // Aliasing output to input value, this is a WAR to allow inplace update on
   // input tensor.
diff --git a/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp b/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
index 4e76bffe665b4..c0bf81dc688bf 100644
--- a/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
+++ b/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
@@ -1722,7 +1722,7 @@ class TranslateApplicableWelford {
   //!  returns true if any welford has been translated
   static bool run(
       SegmentedFusion* segmented_fusion,
-      const at::ArrayRef<IValue>& runtime_inputs) {
+      const KernelArgumentHolder& runtime_inputs) {
     TranslateApplicableWelford translate_welford(
         segmented_fusion, runtime_inputs);
     return translate_welford.translated_any_welford_;
@@ -1730,7 +1730,7 @@ class TranslateApplicableWelford {
 
   //! Try translation on complete fusion,
   //!  returns true if any welford has been translated
-  static bool run(Fusion* fusion, const at::ArrayRef<IValue>& runtime_inputs) {
+  static bool run(Fusion* fusion, const KernelArgumentHolder& runtime_inputs) {
     TranslateApplicableWelford translate_welford(fusion, runtime_inputs);
     return translate_welford.translated_any_welford_;
   }
@@ -1738,11 +1738,11 @@ class TranslateApplicableWelford {
  private:
   explicit TranslateApplicableWelford(
       SegmentedFusion* segmented_fusion,
-      const at::ArrayRef<IValue>& runtime_inputs);
+      const KernelArgumentHolder& runtime_inputs);
 
   explicit TranslateApplicableWelford(
       Fusion* fusion,
-      const at::ArrayRef<IValue>& runtime_inputs);
+      const KernelArgumentHolder& runtime_inputs);
 
   //! Given vector of welford ops from the same fusion,
   //!  checks if translating all of them result in a
@@ -1774,7 +1774,7 @@ class TranslateApplicableWelford {
   bool translated_any_welford_ = false;
 
   //! a reference to global fusion runtime inputs
-  const at::ArrayRef<IValue>& runtime_inputs_;
+  const KernelArgumentHolder& runtime_inputs_;
 
   //! For translation within group only,
   //!  group boundary at test copy
@@ -1785,7 +1785,7 @@ class TranslateApplicableWelford {
 
 TranslateApplicableWelford::TranslateApplicableWelford(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& runtime_inputs)
+    const KernelArgumentHolder& runtime_inputs)
     : runtime_inputs_(runtime_inputs) {
   auto exprs = fusion->exprs();
   std::vector<WelfordOp*> orignal_welfords(
@@ -1802,7 +1802,7 @@ TranslateApplicableWelford::TranslateApplicableWelford(
 
 TranslateApplicableWelford::TranslateApplicableWelford(
     SegmentedFusion* segmented_fusion,
-    const at::ArrayRef<IValue>& runtime_inputs)
+    const KernelArgumentHolder& runtime_inputs)
     : runtime_inputs_(runtime_inputs) {
   std::vector<SegmentedGroup*> translated_groups;
   std::vector<WelfordOp*> welford_to_translate;
@@ -2046,7 +2046,7 @@ void TranslateApplicableWelford::translateSingleWelford(WelfordOp* welford) {
 
 bool SegmentCandidateFinder::TranslateWelfordInFusion(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& runtime_inputs) {
+    const KernelArgumentHolder& runtime_inputs) {
   return TranslateApplicableWelford::run(fusion, runtime_inputs);
 }
 
@@ -2598,9 +2598,39 @@ bool CombineReductions::shouldRun(
   return false;
 }
 
-bool SegmentCandidateFinder::codeGenSupportedMerge(SegmentedEdge* edge) {
+namespace {
+
+//! Returns true if group1 and group2 are an immediate producer-consumer pair.
+bool areDirectlyConnected(SegmentedGroup* group1, SegmentedGroup* group2) {
+  // Check if group1 is a immediate consumer of group2
+  if (std::any_of(
+          group1->producer_edges.begin(),
+          group1->producer_edges.end(),
+          [group2](SegmentedEdge* edge) { return edge->from == group2; })) {
+    return true;
+  }
+
+  // Check if group1 is a immediate producer of group2
+  if (std::any_of(
+          group1->consumer_edges.begin(),
+          group1->consumer_edges.end(),
+          [group2](SegmentedEdge* edge) { return edge->to == group2; })) {
+    return true;
+  }
+
+  return false;
+}
+
+} // namespace
+
+bool SegmentCandidateFinder::codeGenSupportedMerge(
+    SegmentedGroup* group1,
+    SegmentedGroup* group2) {
+  TORCH_INTERNAL_ASSERT(
+      areDirectlyConnected(group1, group2),
+      "only support testing immediate producer-consumer groups");
   Fusion* fusion = segmented_fusion_->completeFusion();
-  auto h = tryMerge(fusion, runtime_info_, edge->from, edge->to);
+  auto h = tryMerge(fusion, runtime_info_, group1, group2);
   return h.has_value();
 }
 
@@ -2616,7 +2646,7 @@ ScheduleHeuristic SegmentCandidateFinder::deriveHeuristic(
 
 SegmentCandidateFinder::SegmentCandidateFinder(
     std::unique_ptr<Fusion> fusion,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& inputs,
     SegmentCandidateFinderOptions options)
     : options_(options),
       runtime_info_(fusion.get(), inputs, true),
@@ -2827,7 +2857,7 @@ void SegmentCandidateFinder::findSegments() {
 
         auto candidate_it = candidates.begin();
         while (candidate_it != candidates.end() &&
-               !codeGenSupportedMerge(candidate_it->edge)) {
+               !codeGenSupportedMerge(group, candidate_it->group)) {
           candidate_it++;
         }
         if (candidate_it == candidates.end()) {
@@ -2896,7 +2926,7 @@ void SegmentCandidateFinder::finalMerge() {
       for (auto consumer : all_consumers_of_producer_group) {
         if (!producer_check->isConsumerOfAny(
                 consumer, all_consumers_of_producer_group) &&
-            codeGenSupportedMerge(consumer_edge_map.at(consumer))) {
+            codeGenSupportedMerge(producer_group, consumer)) {
           to_merge_.emplace_back(producer_group);
           to_merge_.emplace_back(consumer);
           producer_group->merged_ = true;
@@ -3100,7 +3130,7 @@ FusionKernelRuntime::SchedulerEntryPtr SegmentedFusion::
 }
 
 std::unique_ptr<FusionHeuristics> SegmentedFusion::makeInitialHeuristics(
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& inputs) {
   auto ret = std::make_unique<FusionHeuristics>();
   SchedulerRuntimeInfo runtime_info(completeFusion(), inputs, true);
   for (auto g : groups()) {
@@ -3160,7 +3190,7 @@ class ForceHalfAnnotation : public IterVisitor {
                val->getDataType().value() == DataType::BFloat16);
         });
 
-    annotation.traverseFrom(fusion, fp16_outputs);
+    annotation.traverseTo(fusion, fp16_outputs);
     return annotation.force_fp16_tv_set_;
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/fusion_segmenter.h b/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
index 1af3374a58748..5014e708cb952 100644
--- a/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
+++ b/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
@@ -307,7 +307,7 @@ class TORCH_CUDA_CU_API SegmentedFusion {
 
   //! Make heuristics for all groups in this segmented fusion
   std::unique_ptr<FusionHeuristics> makeInitialHeuristics(
-      const at::ArrayRef<IValue>& inputs);
+      const KernelArgumentHolder& inputs);
 
   //! Inline Debug print for segmented fusion
   std::string toString(int verbosity) const;
@@ -445,7 +445,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
   // Perform segmentation on a copy of the given fusion
   static std::unique_ptr<SegmentedFusion> segment(
       const Fusion* fusion,
-      const at::ArrayRef<IValue>& inputs,
+      const KernelArgumentHolder& inputs,
       SegmentCandidateFinderOptions options = SegmentCandidateFinderOptions()) {
     auto fusion_copy = std::make_unique<Fusion>(*fusion);
     if (isDebugDumpEnabled(DebugDumpOption::FusionSegments)) {
@@ -460,7 +460,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
   // Perform segmentation on and take ownership of the given fusion
   static std::unique_ptr<SegmentedFusion> segment(
       std::unique_ptr<Fusion> fusion,
-      const at::ArrayRef<IValue>& inputs,
+      const KernelArgumentHolder& inputs,
       SegmentCandidateFinderOptions options = SegmentCandidateFinderOptions()) {
     SegmentCandidateFinder scf(std::move(fusion), inputs, options);
     if (isDebugDumpEnabled(DebugDumpOption::FusionSegments)) {
@@ -473,13 +473,13 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
 
   static bool TranslateWelfordInFusion(
       Fusion* fusion,
-      const at::ArrayRef<IValue>& runtime_inputs);
+      const KernelArgumentHolder& runtime_inputs);
 
  private:
   // Perform segmentation on and take ownership of the given fusion
   SegmentCandidateFinder(
       std::unique_ptr<Fusion> fusion,
-      const at::ArrayRef<IValue>& inputs,
+      const KernelArgumentHolder& inputs,
       SegmentCandidateFinderOptions options);
 
   void resetTraversal();
@@ -488,7 +488,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
 
   SegmentedGroup* mergeNodes();
 
-  bool codeGenSupportedMerge(SegmentedEdge* edge);
+  bool codeGenSupportedMerge(SegmentedGroup* group1, SegmentedGroup* group2);
 
   void findSegments();
 
@@ -612,7 +612,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
   //! TODO:
   //!  implement the expression evaluator transfer and
   //!  remove runtime_inputs_ in a follow up.
-  const at::ArrayRef<IValue>& runtime_inputs_;
+  const KernelArgumentHolder& runtime_inputs_;
 };
 
 // TODO: Make as member functions on classes instead of global scope
diff --git a/torch/csrc/jit/codegen/cuda/graph_fuser.cpp b/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
index 86aec3b0dc45f..fa857d2efe31a 100644
--- a/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
+++ b/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
@@ -4,6 +4,7 @@
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/interface.h>
+#include <torch/csrc/jit/codegen/cuda/parser.h>
 #include <torch/csrc/jit/codegen/cuda/partition.h>
 #include <torch/csrc/jit/codegen/cuda/transform_view.h>
 #include <torch/csrc/jit/codegen/cuda/utils.h>
@@ -571,6 +572,7 @@ struct CudaGraphFuser {
         [&](Value* producer_for_chunk) {
           return fuser::cuda::isFusibleCudaFusionGroup(
                      consumer, producer_for_chunk->node()) &&
+              isElementWiseNode(consumer) &&
               allUsersAreThisConsumerOrCalcSizes(chunk, producer_for_chunk);
         });
     if (it == chunk->inputs().end()) {
@@ -2174,7 +2176,10 @@ void decomposeLinearOps(Block* block) {
 void replaceAliasOpsWithCopy(std::shared_ptr<Graph>& graph, Block* block) {
   static std::unordered_map<Symbol, Symbol> alias_to_copy_mapping(
       {{aten::expand, prim::expand_copy},
-       {aten::expand_as, prim::expand_as_copy}});
+       {aten::expand_as, prim::expand_as_copy},
+       {aten::permute, prim::permute_copy},
+       {aten::transpose, prim::transpose_copy},
+       {aten::t, prim::t_copy}});
   // TODO: revert disabled aten::view
   //    ({{aten::view, prim::view_copy},
   //     {aten::reshape, prim::reshape_copy},
@@ -2226,7 +2231,10 @@ void replaceAliasOpsWithCopy(std::shared_ptr<Graph>& graph, Block* block) {
 void revertAliasCopyOps(std::shared_ptr<Graph>& graph, Block* block) {
   static std::unordered_map<Symbol, Symbol> copy_to_alias_mapping(
       {{prim::expand_copy, aten::expand},
-       {prim::expand_as_copy, aten::expand_as}});
+       {prim::expand_as_copy, aten::expand_as},
+       {prim::permute_copy, aten::permute},
+       {prim::transpose_copy, aten::transpose},
+       {prim::t_copy, aten::t}});
   // TODO: revert disabled aten::view
   //    ({{prim::view_copy, aten::view},
   //     {prim::flatten_copy, aten::flatten},
diff --git a/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp b/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp
index 5931eb3427aa9..d907a0665e9f6 100644
--- a/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp
@@ -38,7 +38,7 @@ bool hasMatchingTransformations(TensorView* ref, TensorView* other) {
 }
 
 // Validate grouping of reductions and return a new max producer position
-unsigned int validateReductionGrouping(
+void validateReductionGrouping(
     const std::vector<Val*>& inputs,
     const std::vector<Val*>& outputs) {
   TORCH_INTERNAL_ASSERT(inputs.size() == outputs.size());
@@ -57,7 +57,6 @@ unsigned int validateReductionGrouping(
   const auto num_root_dims = ref_domain.size();
   const auto num_dims = ref_tv->nDims();
   const auto ref_ca_pos = ref_tv->getComputeAtPosition();
-  auto max_producer_pos = ref_tv->getMaxProducerPosition();
   for (const auto i : c10::irange(inputs.size())) {
     auto output_tv = outputs.at(i)->as<TensorView>();
     const auto& output_domain = output_tv->getRootDomain();
@@ -136,9 +135,6 @@ unsigned int validateReductionGrouping(
         ref_tv->toString(),
         ". Mismatched tensor: ",
         output_tv->toString());
-
-    max_producer_pos =
-        std::max(max_producer_pos, output_tv->getMaxProducerPosition());
   }
 
   // Must not have any data dependency from outputs to inputs
@@ -152,8 +148,6 @@ unsigned int validateReductionGrouping(
     }
     TORCH_INTERNAL_ASSERT(all_dep_vals.empty(), ss.str());
   }
-
-  return max_producer_pos;
 }
 
 } // namespace
@@ -194,14 +188,14 @@ void groupReductions(const std::vector<TensorView*>& reduction_outputs) {
     inputs.at(i) = rop->in();
   }
 
-  auto max_producer_pos = validateReductionGrouping(inputs, outputs);
-
-  for (auto output : ir_utils::filterByType<TensorView>(outputs)) {
-    output->setMaxProducer(max_producer_pos);
-  }
+  validateReductionGrouping(inputs, outputs);
 
   IrBuilder::create<GroupedReductionOp>(
       container, op_types, init_vals, outputs, inputs);
+
+  for (auto output : ir_utils::filterByType<TensorView>(outputs)) {
+    output->updateMaxProducerPosition();
+  }
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/grouped_reduction.h b/torch/csrc/jit/codegen/cuda/grouped_reduction.h
index 39e6e0850e67a..330a6018446bd 100644
--- a/torch/csrc/jit/codegen/cuda/grouped_reduction.h
+++ b/torch/csrc/jit/codegen/cuda/grouped_reduction.h
@@ -27,6 +27,10 @@ namespace cuda {
 //!   dimensions, the same transformations and the same axes to
 //!   reduce.
 //!
+//! Note that Welford is not allowed yet, though it should be
+//! technically straightforward to support horizontal fusions of
+//! welford ops. Unclear how common it would be in practice, though.
+//!
 //! \param reduction_outputs Tensors produced by ReductionOp
 TORCH_CUDA_CU_API void groupReductions(
     const std::vector<TensorView*>& reduction_outputs);
diff --git a/torch/csrc/jit/codegen/cuda/index_compute.cpp b/torch/csrc/jit/codegen/cuda/index_compute.cpp
index 895c3e4bd96ef..0e65180ae6de5 100644
--- a/torch/csrc/jit/codegen/cuda/index_compute.cpp
+++ b/torch/csrc/jit/codegen/cuda/index_compute.cpp
@@ -51,8 +51,8 @@ int getProducerHaloOffset(
   IterDomain* consumer_id = it->second;
 
   const auto& halo_map = GpuLower::current()->haloInfo();
-  const auto p_pad = halo_map.getRootAxisInfo(producer_id).width(0);
-  const auto c_pad = halo_map.getRootAxisInfo(consumer_id).width(0);
+  const auto p_pad = halo_map->getRootAxisInfo(producer_id).width(0);
+  const auto c_pad = halo_map->getRootAxisInfo(consumer_id).width(0);
 
   auto offset = p_pad - c_pad;
 
@@ -178,7 +178,8 @@ Val* getConcreteProducerOffsetWithGather(
   Val* window_idx = nullptr;
 
   if (use_concrete_map) {
-    window_idx = index_map.at(ir_utils::caMapExactConcreteId(window_id));
+    window_idx = index_map.at(GpuLower::current()->caMap()->getConcreteMappedID(
+        window_id, IdMappingMode::EXACT));
   } else {
     window_idx = index_map.at(window_id);
   }
@@ -307,7 +308,8 @@ Val* getProducerIndexWithPartialSplit(
   }
 
   return SimplifyingIrBuilder::addExpr(
-      producer_index, SimplifyingIrBuilder::create<Int>(diff_eval.value()));
+      producer_index,
+      SimplifyingIrBuilder::create<Int>(diff_eval->as<int64_t>()));
 }
 
 } // namespace
@@ -439,9 +441,8 @@ void IndexCompute::handle(Merge* merge) {
 
   // When the reference has halo extent for inner_id, that extent needs to
   // be used to un-merge
-  if (reference_halo_extent_map_.find(inner_id) !=
-      reference_halo_extent_map_.end()) {
-    inner_extent = reference_halo_extent_map_[inner_id];
+  if (halo_extent_map_.find(inner_id) != halo_extent_map_.end()) {
+    inner_extent = halo_extent_map_[inner_id];
   }
 
   const auto outer_extent = getExtent(outer_id);
@@ -586,20 +587,16 @@ IndexCompute::IndexCompute(
     std::unordered_set<IterDomain*> zero_domains,
     std::unordered_set<IterDomain*> zero_merged_in,
     std::unordered_set<IterDomain*> preferred_paths,
-    std::unordered_map<IterDomain*, Val*> reference_halo_extent_map)
+    std::unordered_map<IterDomain*, Val*> halo_extent_map)
     : IndexCompute(
           _td,
           std::move(initial_index_map),
           std::move(extent_map),
           std::move(zero_domains),
           std::move(zero_merged_in),
-          ContigIDs(
-              _td->domain(),
-              _td->getMaybeRFactorDomain(),
-              std::vector<bool>(_td->getMaybeRFactorDomain().size(), false),
-              {}),
+          ContigIDs({}, {}, {}, {}, {}),
           std::move(preferred_paths),
-          std::move(reference_halo_extent_map)) {}
+          std::move(halo_extent_map)) {}
 
 IndexCompute::IndexCompute(
     const TensorDomain* _td,
@@ -609,14 +606,14 @@ IndexCompute::IndexCompute(
     std::unordered_set<IterDomain*> zero_merged_in,
     const ContigIDs& contig_finder,
     std::unordered_set<IterDomain*> preferred_paths,
-    std::unordered_map<IterDomain*, Val*> reference_halo_extent_map)
+    std::unordered_map<IterDomain*, Val*> halo_extent_map)
     : td_(_td),
       index_map_(std::move(initial_index_map)),
       extent_map_(std::move(extent_map)),
       zero_domains_(std::move(zero_domains)),
       zero_merged_in_(std::move(zero_merged_in)),
       preferred_paths_(std::move(preferred_paths)),
-      reference_halo_extent_map_(std::move(reference_halo_extent_map)) {
+      halo_extent_map_(std::move(halo_extent_map)) {
   FUSER_PERF_SCOPE("GpuLower::Lower::IndexCompute::IndexCompute");
 
   // Make sure we recompute any indices we can that map to a contiguous access
@@ -639,17 +636,19 @@ IndexCompute::IndexCompute(
     std::unordered_map<IterDomain*, Val*> initial_index_map,
     std::unordered_set<IterDomain*> zero_domains,
     std::unordered_set<IterDomain*> preferred_paths,
-    std::unordered_map<IterDomain*, Val*> reference_halo_extent_map)
+    std::unordered_map<IterDomain*, Val*> halo_extent_map)
     : index_map_(std::move(initial_index_map)),
       zero_domains_(std::move(zero_domains)),
       preferred_paths_(std::move(preferred_paths)),
-      reference_halo_extent_map_(std::move(reference_halo_extent_map)) {
+      halo_extent_map_(std::move(halo_extent_map)) {
   FUSER_PERF_SCOPE("GpuLower::Lower::IndexCompute::IndexCompute");
   concrete_id_pass_ = true;
   swizzle_mode_ = SwizzleMode::Loop;
 }
 
 void IndexCompute::run(const LoopIndexing& loop_indexing) {
+  TORCH_INTERNAL_ASSERT(
+      concrete_id_pass_, "concrete pass only for this option");
   // Apply loop swizzles if there are any that outputs to
   //  the loop domains.
   // Currently only support loop swizzles that directly output
@@ -669,18 +668,90 @@ void IndexCompute::run(const LoopIndexing& loop_indexing) {
     }
   }
 
+  // Resolve the index vals that could be resolved with only
+  //  the loops that consumer_tv doesn't share with any of its
+  //  consumers, i.e. the not-inlined loops that define consumer_tv
+  //  values.
+  collectIndexIntoPermissiveMap(loop_indexing);
+
   // Run through the loop indexing expressions and generate
   //  the indexing integer math for the concrete ids.
   for (auto expr : loop_indexing.getBackwardExprList()) {
+    // Resolve missing values from permissive map.
+    updateIndexMapFromPermissiveMap(expr);
+
     handle(expr);
   }
 }
 
+void IndexCompute::collectIndexIntoPermissiveMap(
+    const LoopIndexing& loop_indexing) {
+  // Visit the expressions that only produces un-inlined iterdomains,
+  //  in reverse topological order.
+  for (auto expr : loop_indexing.getBackwardOutOfLineExprList()) {
+    // Compute indexing vals for the expression inputs.
+    //
+    // This stage should run before any indexing computation so it could be
+    //  made sure that all index values computed at this stage are
+    //  the ones that can be resolved only with the not-inlined
+    //  iterdomains.
+    //
+    auto id_outputs = ir_utils::filterByType<IterDomain>(expr->outputs());
+    if (std::all_of(
+            id_outputs.begin(), id_outputs.end(), [this](IterDomain* id) {
+              return index_map_.count(
+                  GpuLower::current()->caMap()->getConcreteMappedID(
+                      id, IdMappingMode::EXACT));
+            })) {
+      // Visit this expression:
+      // LoopIndexingAnalysis::traverseFromDomainVals made sure that each
+      //  concrete index is bound exactly once so computing these expressions
+      //  early should still be consistent.
+      handle(expr);
+
+      auto id_inputs = ir_utils::filterByType<IterDomain>(expr->inputs());
+      for (auto id : id_inputs) {
+        // Collect backward pass results from this expression if they are
+        //  made available in by this expression.
+        auto idx_it =
+            index_map_.find(GpuLower::current()->caMap()->getConcreteMappedID(
+                id, IdMappingMode::EXACT));
+
+        if (idx_it != index_map_.end()) {
+          permissive_index_map_
+              [GpuLower::current()->caMap()->getConcreteMappedID(
+                  id, IdMappingMode::PERMISSIVE)] = idx_it->second;
+        }
+      }
+    }
+  }
+}
+
+void IndexCompute::updateIndexMapFromPermissiveMap(const Expr* id_expr) {
+  auto id_outputs = ir_utils::filterByType<IterDomain>(id_expr->outputs());
+  for (auto id : id_outputs) {
+    auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        id, IdMappingMode::EXACT);
+    // Only try to copy index val from permissive map when
+    //  the index is missing.
+    if (!index_map_.count(concrete_id)) {
+      auto permissive_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          id, IdMappingMode::PERMISSIVE);
+      // Write the permissive index val into index_map_ if the
+      //  missing value is found here.
+      auto permissive_it = permissive_index_map_.find(permissive_id);
+      if (permissive_it != permissive_index_map_.end()) {
+        index_map_[concrete_id] = permissive_it->second;
+      }
+    }
+  }
+}
+
 void IndexCompute::run() {
   const std::vector<Val*> domain_vals(
       td_->domain().begin(), td_->domain().end());
 
-  traverseFrom(td_->fusion(), domain_vals, false);
+  traverseTo(td_->fusion(), domain_vals, false);
 }
 
 IterDomain* IndexCompute::maybeGetExactMapConcreteID(IterDomain* id) {
@@ -714,15 +785,14 @@ bool IndexCompute::isZero(IterDomain* id) const {
 IndexCompute IndexCompute::updateIndexCompute(
     const TensorDomain* new_td,
     const std::unordered_map<IterDomain*, IterDomain*>& id_map,
-    const ContigIDs& contig_finder,
-    const std::unordered_map<IterDomain*, Val*>& reference_halo_extent_map)
-    const {
+    const ContigIDs& contig_finder) const {
   FUSER_PERF_SCOPE("GpuLower::Lower::updateIndexCompute");
 
   std::unordered_map<IterDomain*, Val*> updated_index_map;
   std::unordered_map<IterDomain*, Val*> updated_extent_map;
   std::unordered_set<IterDomain*> updated_zero_domains;
   std::unordered_set<IterDomain*> updated_zero_merged_in;
+  std::unordered_map<IterDomain*, Val*> updated_halo_extent_map;
 
   for (auto id_entry : id_map) {
     IterDomain* prev_id = id_entry.first;
@@ -741,6 +811,11 @@ IndexCompute IndexCompute::updateIndexCompute(
     if (zero_merged_in_.find(prev_id) != zero_merged_in_.end()) {
       updated_zero_merged_in.emplace(new_id);
     }
+
+    auto halo_extent_it = halo_extent_map_.find(prev_id);
+    if (halo_extent_it != halo_extent_map_.end()) {
+      updated_halo_extent_map[new_id] = halo_extent_it->second;
+    }
   }
 
   IndexCompute updated_index_compute(
@@ -751,25 +826,7 @@ IndexCompute IndexCompute::updateIndexCompute(
       updated_zero_merged_in,
       contig_finder,
       {},
-      reference_halo_extent_map);
-
-  if (concrete_id_pass_) {
-    // This should be the same behavior as with a reference tensor
-    //   created, since originally halo was pulled through exact
-    //   ca mapping and in the concrete_id_pass case, the id_map
-    //   also represents exact ca mapping.
-    // TODO: might need to re-visit pathological cases when we may
-    //  need to traverse and propagate halo info again in here.
-    for (auto id_entry : id_map) {
-      IterDomain* prev_id = id_entry.first;
-      IterDomain* new_id = id_entry.second;
-      auto halo_extent_it = reference_halo_extent_map_.find(prev_id);
-      if (halo_extent_it != reference_halo_extent_map_.end()) {
-        updated_index_compute.reference_halo_extent_map_[new_id] =
-            halo_extent_it->second;
-      }
-    }
-  }
+      updated_halo_extent_map);
 
   updated_index_compute.run();
 
@@ -790,7 +847,7 @@ class UpdateLeafIndices : public IterVisitor {
     const std::vector<Val*> domain_vals(
         td_->domain().begin(), td_->domain().end());
 
-    traverseFrom(td_->fusion(), domain_vals, false);
+    traverseTo(td_->fusion(), domain_vals, false);
   }
 
   const std::unordered_map<IterDomain*, Val*>& indexMap() const {
@@ -915,7 +972,7 @@ Val* getHaloExtentOfRootAxis(IterDomain* id, Val* normal_extent = nullptr) {
     normal_extent = id->extent();
   }
 
-  const auto& halo = GpuLower::current()->haloInfo().getRootAxisInfo(id);
+  const auto& halo = GpuLower::current()->haloInfo()->getRootAxisInfo(id);
   if (halo.hasHalo()) {
     auto halo_extent = SimplifyingIrBuilder::addExpr(
         normal_extent, SimplifyingIrBuilder::create<Int>(halo.width()));
@@ -1436,7 +1493,8 @@ std::vector<Val*> Index::getGlobalProducerStridedIndices(
   // effort which means some domains may be producer's original domains.
   std::vector<std::pair<IterDomain*, ParallelType>> p_id_backup;
   for (auto entry : c2p_map) {
-    auto ref_id = ir_utils::caMapExactConcreteId(entry.first);
+    auto ref_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        entry.first, IdMappingMode::EXACT);
     auto p_id = entry.second;
     if (ref_id->getParallelType() == ParallelType::Vectorize) {
       p_id_backup.emplace_back(std::make_pair(p_id, p_id->getParallelType()));
@@ -1675,7 +1733,8 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
   // effort which means some domains may be the originals.
   std::vector<std::pair<IterDomain*, ParallelType>> p_id_backup;
   for (auto entry : c2p_index_map) {
-    auto ref_id = ir_utils::caMapExactConcreteId(entry.first);
+    auto ref_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        entry.first, IdMappingMode::EXACT);
     auto p_id = entry.second;
     if (ref_id->getParallelType() == ParallelType::Vectorize) {
       p_id_backup.emplace_back(std::make_pair(p_id, p_id->getParallelType()));
@@ -1867,52 +1926,27 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
   return strided_inds;
 }
 
-std::vector<Val*> Index::getRandomTensorStridedIndices(
+std::vector<Val*> Index::getLinearLogicalIndex(
     TensorView* consumer_tv,
     const std::vector<kir::ForLoop*>& loops) {
-  // Use domain guard to ignore the contiguity of
-  //  consumer tv.
-  TensorDomain* consumer_tv_no_contiguity_domain = nullptr;
-  auto contiguity_vector =
-      std::vector<bool>(consumer_tv->getMaybeRFactorDomain().size(), true);
-  if (consumer_tv->hasRFactor()) {
-    consumer_tv_no_contiguity_domain = IrBuilder::create<TensorDomain>(
-        consumer_tv->getRootDomain(),
-        consumer_tv->getRFactorDomain(),
-        consumer_tv->domain()->domain(),
-        contiguity_vector);
-  } else {
-    consumer_tv_no_contiguity_domain = IrBuilder::create<TensorDomain>(
-        consumer_tv->getRootDomain(),
-        consumer_tv->domain()->domain(),
-        contiguity_vector);
-  }
-
-  ir_utils::TVDomainGuard domain_guard(
-      consumer_tv, consumer_tv_no_contiguity_domain);
-
-  // TODO:
-  //  More optimization on the underlying tensor layout
-  //   will be done in a follow up.
+  auto guard = ir_utils::overrideContiguityGuard(consumer_tv, true);
   return getGlobalConsumerStridedIndices(consumer_tv, loops);
 }
 
-std::vector<Val*> Index::getGlobalConsumerStridedIndices(
-    const TensorView* consumer_tv,
+std::vector<Val*> Index::getPerDimLogicalIndex(
+    TensorView* consumer_tv,
     const std::vector<kir::ForLoop*>& loops) {
-  FUSER_PERF_SCOPE("GpuLower::Lower::getGlobalConsumerIndex");
-
-  auto gpu_lower = GpuLower::current();
-
-  auto index_from_id_graph = getTensorIndexFromIdGraph(loops, consumer_tv);
-
-  auto consumer_indexing = index_from_id_graph.index;
+  auto guard = ir_utils::overrideContiguityGuard(consumer_tv, false);
+  IndexFromIdGraph index_from_id_graph =
+      getTensorIndexFromIdGraph(loops, consumer_tv);
+  return getRootIndices(consumer_tv, loops, index_from_id_graph);
+}
 
+std::vector<Val*> Index::getStrides(const TensorView* tv) {
   // Indices should now be mapped onto IterDomains in consumer, so just grab
   // and use them.
-  auto root_dom = consumer_tv->getMaybeRFactorDomain();
+  auto root_dom = tv->getMaybeRFactorDomain();
 
-  // TODO: Abstract stride logic to reuse with producer indexing
   std::vector<Val*> strides(
       root_dom.size(), GpuLower::current()->kernel()->oneVal());
   {
@@ -1923,14 +1957,13 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
         continue;
       }
       std::stringstream ss;
-      ss << "T" << consumer_tv->name() << ".stride[" << stride_i++ << "]";
+      ss << "T" << tv->name() << ".stride[" << stride_i++ << "]";
       strides[i] =
           SimplifyingIrBuilder::create<NamedScalar>(ss.str(), DataType::Int);
     }
   }
 
-  TORCH_INTERNAL_ASSERT(
-      root_dom.size() == consumer_tv->domain()->contiguity().size());
+  TORCH_INTERNAL_ASSERT(root_dom.size() == tv->domain()->contiguity().size());
   Val* cur_contig_stride = GpuLower::current()->kernel()->oneVal();
   for (const auto i : c10::irange(root_dom.size())) {
     auto dim = root_dom.size() - i - 1;
@@ -1938,24 +1971,7 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
       continue;
     }
 
-    Val* root_ind = nullptr;
-    if (consumer_indexing.indexMap().find(root_dom[dim]) !=
-        consumer_indexing.indexMap().end()) {
-      root_ind = consumer_indexing.indexMap().at(root_dom[dim]);
-    } else if (root_dom[dim]->isBroadcast()) {
-      root_ind = GpuLower::current()->kernel()->zeroVal();
-    }
-
-    TORCH_INTERNAL_ASSERT(
-        root_ind != nullptr,
-        "Couldn't find root mapping for ",
-        consumer_tv->toString(),
-        " dim: ",
-        dim,
-        " id: ",
-        root_dom[dim]->toString());
-
-    if (consumer_tv->domain()->contiguity()[dim]) {
+    if (tv->domain()->contiguity()[dim]) {
       // If contig, used the stored stride which may be the previous
       // dimensions stride * previous dimensions size
       strides[dim] = cur_contig_stride;
@@ -1971,12 +1987,18 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
           strides[dim], getHaloExtentOfRootAxis(root_dom[dim]));
     }
   }
+  return strides;
+}
 
-  auto vectorize_shift =
-      loops.empty() ? nullptr : loops.back()->vectorize_shift();
+std::vector<Val*> Index::getRootIndices(
+    const TensorView* tv,
+    const std::vector<kir::ForLoop*>& loops,
+    const IndexFromIdGraph& index_from_id_graph) {
+  auto gpu_lower = GpuLower::current();
+  auto root_dom = tv->getMaybeRFactorDomain();
+  auto indexing = index_from_id_graph.index;
 
-  // Global striding
-  std::vector<Val*> strided_inds(
+  std::vector<Val*> root_inds(
       root_dom.size(), GpuLower::current()->kernel()->zeroVal());
   for (const auto i : c10::irange(root_dom.size())) {
     // See a comment in indexing to root domains in getGlobalProducerIndex.
@@ -1987,22 +2009,21 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
     }
 
     TORCH_INTERNAL_ASSERT(
-        consumer_indexing.indexMap().find(root_dom[i]) !=
-            consumer_indexing.indexMap().end(),
+        indexing.indexMap().find(root_dom[i]) != indexing.indexMap().end(),
         "Couldn't find root mapping for ",
-        consumer_tv->toString(),
+        tv->toString(),
         " dim: ",
         i,
         " id: ",
         root_dom[i]->toString());
 
-    auto root_ind = consumer_indexing.indexMap().at(root_dom[i]);
+    auto root_ind = indexing.indexMap().at(root_dom[i]);
 
     // index hoist must be done before the adjustments for halo
     root_ind = hoistConsumerIndex(
         root_dom[i],
-        consumer_tv,
-        consumer_indexing,
+        tv,
+        indexing,
         index_from_id_graph.resolved_loop_domains,
         index_from_id_graph.initial_concrete_index_map,
         loops,
@@ -2010,12 +2031,33 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
 
     root_ind = SimplifyingIrBuilder::addExpr(
         root_ind, getGlobalConsumerOffsetWithPartialSplit(root_dom[i]));
+    root_inds[i] = root_ind;
+  }
+  return root_inds;
+}
 
-    if (root_ind->isZeroInt()) {
+std::vector<Val*> Index::getGlobalConsumerStridedIndices(
+    const TensorView* consumer_tv,
+    const std::vector<kir::ForLoop*>& loops) {
+  FUSER_PERF_SCOPE("GpuLower::Lower::getGlobalConsumerIndex");
+
+  auto index_from_id_graph = getTensorIndexFromIdGraph(loops, consumer_tv);
+  auto consumer_indexing = index_from_id_graph.index;
+  auto strides = getStrides(consumer_tv);
+  auto root_inds = getRootIndices(consumer_tv, loops, index_from_id_graph);
+
+  // Global striding
+  auto vectorize_shift =
+      loops.empty() ? nullptr : loops.back()->vectorize_shift();
+  std::vector<Val*> strided_inds(
+      root_inds.size(), GpuLower::current()->kernel()->zeroVal());
+  for (const auto i : c10::irange(root_inds.size())) {
+    if (root_inds[i]->isZeroInt()) {
       continue;
     } else {
-      auto strided_ind = SimplifyingIrBuilder::mulExpr(root_ind, strides[i]);
-      if (i == root_dom.size() - 1 && vectorize_shift != nullptr) {
+      auto strided_ind =
+          SimplifyingIrBuilder::mulExpr(root_inds[i], strides[i]);
+      if (i == strides.size() - 1 && vectorize_shift != nullptr) {
         strided_inds[i] =
             SimplifyingIrBuilder::addExpr(strided_ind, vectorize_shift);
       } else {
@@ -2298,7 +2340,7 @@ std::vector<PredicateDomainInfo> getPredicateContigIds(
   std::unordered_set<IterDomain*> excluded_ids;
 
   for (auto consumer_root_id : consumer_root_domain) {
-    if (gpu_lower->haloInfo().getRootAxisInfo(consumer_root_id).hasHalo()) {
+    if (gpu_lower->haloInfo()->getRootAxisInfo(consumer_root_id).hasHalo()) {
       excluded_ids.insert(consumer_root_id);
       continue;
     }
@@ -2434,7 +2476,7 @@ int getUnswitchStopOffset(
   const auto gpu_lower = GpuLower::current();
 
   AxisHaloInfo halo_info =
-      gpu_lower->haloInfo().getRootAxisInfo(consumer_root_id);
+      gpu_lower->haloInfo()->getRootAxisInfo(consumer_root_id);
 
   // If the consumer root domain to predicate does not have halo, no
   // adjustment is required.
@@ -2458,7 +2500,7 @@ int getUnswitchStopOffset(
           unswitch_it,
           consumer_tv->domain()->domain().end(),
           [&gpu_lower, &consumer_root_id](auto leaf_id) {
-            return gpu_lower->haloInfo().isHaloInherited(
+            return gpu_lower->haloInfo()->isHaloInherited(
                 consumer_root_id, leaf_id);
           })) {
     return halo_info.width();
@@ -2616,7 +2658,8 @@ std::pair<Val*, Val*> getStartAndStopLimitOffsets(
   Val* stop_limit = SimplifyingIrBuilder::negExpr(consumer_id->stopOffset());
 
   if (!non_divisible_pred) {
-    AxisHaloInfo halo_info = gpu_lower->haloInfo().getRootAxisInfo(consumer_id);
+    AxisHaloInfo halo_info =
+        gpu_lower->haloInfo()->getRootAxisInfo(consumer_id);
 
     // Below, "left" and "right" halo mean halo at offset zero and
     // axis extent, respectively.
@@ -2640,8 +2683,8 @@ std::pair<Val*, Val*> getStartAndStopLimitOffsets(
     // that it is less than the extent of the predicated ID +
     // halo. Note that getRootAxisInfo doesn't work since consumer_id
     // isn't a root domain.
-    if (gpu_lower->haloInfo().hasHaloWidth(consumer_id)) {
-      auto halo = gpu_lower->haloInfo().getHaloWidth(consumer_id);
+    if (gpu_lower->haloInfo()->hasHaloWidth(consumer_id)) {
+      auto halo = gpu_lower->haloInfo()->getHaloWidth(consumer_id);
       stop_limit = SimplifyingIrBuilder::addExpr(stop_limit, halo);
     }
   }
@@ -2788,8 +2831,8 @@ bool canOmitStopPredicate(
   // to be predicated, not its merged contig id even if it exists. So,
   // if contig_id does not have root axis info, contig_id is
   // guaranteed to have no halo.
-  auto halo_ext = gpu_lower->haloInfo().hasRootAxisInfo(contig_id)
-      ? gpu_lower->haloInfo().getRootAxisInfo(contig_id).width()
+  auto halo_ext = gpu_lower->haloInfo()->hasRootAxisInfo(contig_id)
+      ? gpu_lower->haloInfo()->getRootAxisInfo(contig_id).width()
       : 0;
 
   if (halo_ext + stop_offset_val.value() > 0) {
@@ -2907,14 +2950,6 @@ std::vector<RootPredicateInfo> Index::getReferenceRootPredicates(
 
   auto db_axis = gpu_lower->doubleBufferInfo().getDoubleBufferAxis(consumer_tv);
 
-  // Indexing is done without considering contig merging. Actual
-  // predicated domains are determined by considering contiguity.
-  const ContigIDs contig_finder(
-      consumer_tv->domain()->domain(),
-      consumer_tv->getMaybeRFactorDomain(),
-      std::vector<bool>(consumer_tv->getMaybeRFactorDomain().size(), false),
-      {});
-
   // Generate start and stop indexing from idgraph.
   //
   // Both start and stop positions may need to be predicated. Indexing
diff --git a/torch/csrc/jit/codegen/cuda/index_compute.h b/torch/csrc/jit/codegen/cuda/index_compute.h
index f064ebba293cb..9a94ee94ac09c 100644
--- a/torch/csrc/jit/codegen/cuda/index_compute.h
+++ b/torch/csrc/jit/codegen/cuda/index_compute.h
@@ -1,7 +1,6 @@
 #pragma once
 
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
-#include <torch/csrc/jit/codegen/cuda/reference_tensor.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 
 #include <unordered_map>
@@ -17,40 +16,40 @@
  * indices (based on input indices) that match the root dimension.
  *
  * For example with GLOBAL tensor:
- * TV[I, J]
- * TV[Io, Ii{4}, J] = TV.split(I, factor=4)
+ * TV[I, K]
+ * TV[Io, Ii{4}, K] = TV.split(I, factor=4)
  * ALLOC: NONE
  * INDEX: indexCompute {i, j, k} -> {i * 4 + j, k}
- * FLATTENED_INDEX: {i * 4 + j, k} -> {i * 4 + j * J + k}
+ * FLATTENED_INDEX: {i * 4 + j, k} -> {(i * 4 + j) * K + k}
  * PREDICATE: {i * 4 + j, k} -> i * 4 + j < I
  *
  *
  * For example with SHARED tensor:
  *
- * global_TV[I, J]
- * global_TV[Io, Ii{4}, J] = global_TV.split(I, factor=4)
+ * global_TV[I, K]
+ * global_TV[Io, Ii{4}, K] = global_TV.split(I, factor=4)
  * smem_TV.compute_at(global_TV, 1)
  * global_TV.parallelize(1, threadIDx.x)
  *
- * ALLOC: alloc(smem_TV, 4 x J)
+ * ALLOC: alloc(smem_TV, 4 x K)
  * INDEX: indexCompute(smem_TV, {threadIdx.x, k}) -> {threadIdx.x, k}
- * FLATTENED_INDEX: {threadIdx.x * 4 + j, k} -> {threadIdx.x * 4 + j * J + k}
+ * FLATTENED_INDEX: {threadIdx.x * 4 + j, k} -> {(threadIdx.x * 4 + j) * K + k}
  * PREDICATE: {threadIdx.x * 4 + j, k} -> threadIdx.x * 4 + j < I // Same as if
  * global
  *
  *
  * For example with LOCAL tensor:
- * global_TV[I, J, K]
- * global_TV[Io, Ii{4}, J] = global_TV.split(I, factor=4)
- * reg_TV.compute_at(global_TV, 1)
+ * global_TV[I, K, L]
+ * global_TV[Io, Ii{4}, K, L] = global_TV.split(I, factor=4)
+ * reg_TV.compute_at(global_TV, 2)
  * global_TV.parallelize(1, threadIDx.x)
  * global_TV{i, j, k, l} -> { i * 4 + j, k, l }
- * global_TV{ i * 4 + j, k, l } -> { i * 4 + j * J * K  +  k * K  +  l}
+ * global_TV{ i * 4 + j, k, l } -> { (i * 4 + j) * K * L  +  k * L  +  l}
  *
- * ALLOC: alloc(reg_TV, J x K)
+ * ALLOC: alloc(reg_TV, K x L)
  * INDEX: {k, l} -> {k, l}
- * FLATTENED_INDEX: {k, l} -> {k * J + l}
- * PREDICATE: i * 4 + j < I && k < J && l < K ->  // Same as if global
+ * FLATTENED_INDEX: {k, l} -> {k * L + l}
+ * PREDICATE: i * 4 + j < I && k < K && l < L ->  // Same as if global
  *
  * These indices can then be flattened later based on strides.
  */
@@ -62,6 +61,7 @@ namespace cuda {
 
 class ContigIDs;
 class LoopIndexing;
+struct IndexFromIdGraph;
 
 class IndexCompute : public BackwardVisitor {
  protected:
@@ -86,6 +86,18 @@ class IndexCompute : public BackwardVisitor {
   //! based traversal.
   IterDomain* maybeGetExactMapConcreteID(IterDomain* id);
 
+  //! (Concrete indexing pass only)
+  //!  Collect permissive index binding from the given expression.
+  //! See also permissive_map_ and LoopIndexing::getBackwardOutOfLineExprList.
+  void collectIndexIntoPermissiveMap(const LoopIndexing& loop_indexing);
+
+  //! (Concrete indexing pass only)
+  //!  Iterate through id_expr's input and pull index vals from permissive
+  //! map, when both of the following are true:
+  //!    1. the output id is missing in index_map_.
+  //!    2. the output id is found in permissive map.
+  void updateIndexMapFromPermissiveMap(const Expr* id_expr);
+
   // Tensor domain we're mapping back to root
   const TensorDomain* td_; // NOLINT
 
@@ -122,9 +134,8 @@ class IndexCompute : public BackwardVisitor {
   // if there's an option
   std::unordered_set<IterDomain*> preferred_paths_;
 
-  // Map from IterDomains to halo-extended extents in corresponding
-  // reference tensor
-  std::unordered_map<IterDomain*, Val*> reference_halo_extent_map_;
+  // Map from IterDomains to halo-extended extents
+  std::unordered_map<IterDomain*, Val*> halo_extent_map_;
 
   // Temporary flag which tells IndexCompute to use concrete id's from the exact
   // map rather than the actual IDs used in the ID expressions.
@@ -137,6 +148,16 @@ class IndexCompute : public BackwardVisitor {
   //  pass. See also [Note on swizzle mode]
   SwizzleMode swizzle_mode_ = SwizzleMode::NoSwizzle;
 
+  // (Concrete id pass only)
+  // Contains the indexing math that could be resolved with only the
+  //  iterdomains on the right of the consumer_tv's ca axis, i.e. the
+  //  ones that corresponding to the loops that consumer_tv would not
+  //  share with any of its consumers.
+  // These indexing vals should be kept separate from index_map_ and
+  //  should only be used when the indexing traversal follows the
+  //  order defined in LoopIndexingAnalysis::traverseFromDomainVals.
+  std::unordered_map<IterDomain*, Val*> permissive_index_map_;
+
  public:
   const std::unordered_map<IterDomain*, Val*>& indexMap() const {
     return index_map_;
@@ -166,7 +187,7 @@ class IndexCompute : public BackwardVisitor {
       std::unordered_set<IterDomain*> zero_domains,
       std::unordered_set<IterDomain*> _zero_merged_in,
       std::unordered_set<IterDomain*> preferred_paths = {},
-      std::unordered_map<IterDomain*, Val*> reference_halo_extent_map = {});
+      std::unordered_map<IterDomain*, Val*> halo_extent_map = {});
 
   IndexCompute(
       const TensorDomain* _td,
@@ -176,7 +197,7 @@ class IndexCompute : public BackwardVisitor {
       std::unordered_set<IterDomain*> _zero_merged_in,
       const ContigIDs& contig_finder,
       std::unordered_set<IterDomain*> preferred_paths = {},
-      std::unordered_map<IterDomain*, Val*> reference_halo_extent_map = {});
+      std::unordered_map<IterDomain*, Val*> halo_extent_map = {});
 
   // Entry point used for using concrete id based traversal. This traversal is
   // assumed to start at leaf IDs provided by initial_index_map.
@@ -191,9 +212,7 @@ class IndexCompute : public BackwardVisitor {
   IndexCompute updateIndexCompute(
       const TensorDomain* new_td,
       const std::unordered_map<IterDomain*, IterDomain*>& id_map,
-      const ContigIDs& contig_finder,
-      const std::unordered_map<IterDomain*, Val*>& reference_halo_extent_map =
-          {}) const;
+      const ContigIDs& contig_finder) const;
 
   // Interface to run index traversal through loop indexing analysis result to
   // be used with the entry point for concrete id based traversal.
@@ -309,6 +328,15 @@ class Index {
       const TensorView* consumer,
       const std::vector<kir::ForLoop*>& loops);
 
+  // get the strides of a tensor used for the index lowering
+  static std::vector<Val*> getStrides(const TensorView* tv);
+
+  // get the root indices of a tensor used for the index lowering
+  static std::vector<Val*> getRootIndices(
+      const TensorView* tv,
+      const std::vector<kir::ForLoop*>& loops,
+      const IndexFromIdGraph& index_from_id_graph);
+
  public:
   // Indexing functions
   // Consumer = Producer
@@ -341,21 +369,28 @@ class Index {
       const TensorView* consumer,
       const std::vector<kir::ForLoop*>& loops);
 
-  //! Returns a vector of strided indices mapped onto the (rfactor)
+  //! Returns the logical index linearized from a multi-dimension address into a
+  //! linear memory address a consumer tensor. The returned index is intended to
+  //! be used for the computation of some tensor factories, such as: arange and
+  //! rand (for Philox pseudo random sequences)
+  static std::vector<Val*> getLinearLogicalIndex(
+      TensorView* consumer_tv,
+      const std::vector<kir::ForLoop*>& loops);
+
+  //! Returns a vector of logical indices mapped onto the (rfactor)
   //! root domain of a consumer tensor. The returned index is intended
-  //! to be used to index into Philox pseudo random sequences so that
-  //! inlined multivisit to the same element in a random tensor returns
-  //! consistent values.
-  static std::vector<Val*> getRandomTensorStridedIndices(
+  //! to be used for the computation of some tensor factories, such as:
+  //! eye
+  static std::vector<Val*> getPerDimLogicalIndex(
       TensorView* consumer_tv,
       const std::vector<kir::ForLoop*>& loops);
 
   //! Take a consumer tensorview and loop nest and generates predicates
   //! associated with the concrete roots of the loop nest. Returns a list of
-  //! predicates, and a list of concrete roots they're associated with. It is
-  //! assumed that no predicate is required if index[i] is an index directly
-  //! from a for loop. This will not catch all cases if we actually have static
-  //! size information for example:
+  //! predicates, and a list of concrete roots they're associated with. It
+  //! is assumed that no predicate is required if index[i] is an index
+  //! directly from a for loop. This will not catch all cases if we actually
+  //! have static size information for example:
   //!
   //! TV[I].split(4)
   //! would produce the code:
@@ -364,14 +399,14 @@ class Index {
   //!     if( i * 4 + j < TV.size(0))
   //!       TV[i * 4 + j]...
   //!
-  //! However if we had TV.size[0] = 16 at "compile time" then we wouldn't need
-  //! the predicate. This will be caught by canOmitPredicate in the predicate
-  //! lowering
+  //! However if we had TV.size[0] = 16 at "compile time" then we wouldn't
+  //! need the predicate. This will be caught by canOmitPredicate in the
+  //! predicate lowering
   //!
-  //! unswitch_or_vec_loop is the for loop to start the unswitch like predicate,
-  //! this is not a bool value as if we have an unswitch loop with a vectorized
-  //! loop inside, we only want to base the "unswitch" like predicate on the
-  //! vectorized loop.
+  //! unswitch_or_vec_loop is the for loop to start the unswitch like
+  //! predicate, this is not a bool value as if we have an unswitch loop
+  //! with a vectorized loop inside, we only want to base the "unswitch"
+  //! like predicate on the vectorized loop.
   static std::vector<RootPredicateInfo> getReferenceRootPredicates(
       TensorView* consumer_tv,
       const std::vector<kir::ForLoop*>& loops,
diff --git a/torch/csrc/jit/codegen/cuda/inline_propagator.cpp b/torch/csrc/jit/codegen/cuda/inline_propagator.cpp
deleted file mode 100644
index a5edae083a32a..0000000000000
--- a/torch/csrc/jit/codegen/cuda/inline_propagator.cpp
+++ /dev/null
@@ -1,385 +0,0 @@
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
-#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
-#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
-#include <torch/csrc/jit/codegen/cuda/transform_iter.h>
-
-#include <utility>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-MaxPosCalculator::MaxPosCalculator(
-    ComputeAtMode mode,
-    std::unordered_set<IterDomain*> uninlinable_ids)
-    : mode_(mode), uninlinable_ids_(std::move(uninlinable_ids)) {
-  buildUnmappableDims();
-}
-
-void MaxPosCalculator::buildUnmappableDims() {
-  ComputeAtRootDomainMap root_map;
-  root_map.build();
-
-  auto all_tvs = ir_utils::allTvs(FusionGuard::getCurFusion());
-  for (auto tv : all_tvs) {
-    auto consumers = ir_utils::consumerTvsOf(tv);
-    for (auto consumer : consumers) {
-      // Grab dimensions in producer and consumer that are mappable to eachother
-      // based on the computeAtRootDomainMap. This will tell us which dimensions
-      // can be inlined based on avoiding trying to inline non-trivial
-      // reduction structures.
-      auto mappable_roots =
-          root_map.getMappableDims(tv->domain(), consumer->domain());
-      for (auto tv_root_id : tv->getMaybeRFactorDomain()) {
-        if (mappable_roots.find(tv_root_id) == mappable_roots.end() &&
-            !tv_root_id->isTrivialReduction()) {
-          unmappable_dims_.emplace(tv_root_id);
-        }
-      }
-    }
-  }
-}
-
-bool MaxPosCalculator::isAllowedID(
-    IterDomain* id,
-    TensorView* tv,
-    bool allow_reduction,
-    bool allow_vectorize,
-    bool allow_unmappable) const {
-  bool allowed = true;
-
-  if (!allow_reduction) {
-    allowed = allowed && !id->isReduction();
-  }
-
-  if (uninlinable_ids_.count(id)) {
-    return false;
-  }
-
-  if (!allow_vectorize) {
-    // Avoid inlining if marked as Vectorize or Group. In the case of
-    // BestEffort and MostInlined modes, avoid Unroll as well.
-    bool is_vectorize = isParallelTypeVectorize(id->getParallelType()) ||
-        id->getParallelType() == ParallelType::Group ||
-        ((mode_ == ComputeAtMode::BestEffort ||
-          mode_ == ComputeAtMode::MostInlined) &&
-         id->getParallelType() == ParallelType::Unroll);
-    allowed = allowed && !is_vectorize;
-  }
-
-  if (!allow_unmappable) {
-    auto root_dom = tv->getMaybeRFactorDomain();
-    std::unordered_set<Val*> root_dom_set(root_dom.begin(), root_dom.end());
-    auto all_vals = DependencyCheck::getAllValsBetween(root_dom_set, {id});
-    bool is_unmappable = false;
-    for (auto val : all_vals) {
-      auto id = val->as<IterDomain>();
-      if (root_dom_set.count(val) > 0 && unmappable_dims_.count(id) > 0) {
-        is_unmappable = true;
-        break;
-      }
-    }
-    allowed = allowed && !is_unmappable;
-  }
-
-  return allowed;
-}
-
-size_t MaxPosCalculator::getMaxPosSelf(
-    TensorView* tv,
-    bool allow_reduction,
-    bool allow_vectorize,
-    bool allow_unmappable) const {
-  auto dom = tv->domain()->domain();
-  auto iter = std::find_if(dom.begin(), dom.end(), [=](IterDomain* id) {
-    return !isAllowedID(
-        id, tv, allow_reduction, allow_vectorize, allow_unmappable);
-  });
-  return std::distance(dom.begin(), iter);
-}
-
-// Return the max position in producer that can be inlined to consumer
-// Cannot inline:
-//   Vectorized dimensions in consumer
-//   Unrolled dimensions in consumer
-size_t MaxPosCalculator::getMaxProducerPosFromConsumer(
-    TensorView* producer,
-    TensorView* consumer) const {
-  auto pairwise_root_map = PairwiseRootDomainMap(producer, consumer);
-  auto replay_CasP =
-      BestEffortReplay::replayCasP(consumer, producer, -1, pairwise_root_map);
-  auto p2c_replay_map = replay_CasP.getReplay();
-
-  for (size_t producer_pos = 0; producer_pos < producer->nDims();
-       producer_pos++) {
-    // If the producer position is mismatching with the consumer, then we can
-    // not inline into this position, otherwise the max producer position of
-    // the consumer will become invalid and expression sort will fail.
-    if (TransformReplay::getMatchedLeafPosWithoutReplayCasP(
-            consumer, producer, producer_pos + 1) < 0) {
-      return producer_pos;
-    }
-    auto map_it = p2c_replay_map.find(producer->axis(producer_pos));
-    if (map_it != p2c_replay_map.end()) {
-      auto c_id = map_it->second;
-      if (!isAllowedID(c_id, consumer, true, false, true)) {
-        return producer_pos;
-      }
-    }
-  }
-  return producer->nDims();
-}
-
-size_t InlinePropagator::getMaxPosAll(TensorView* tv, bool check_siblings) {
-  auto max_pos = max_pos_calc.getMaxPosSelf(tv, false, false, false);
-  for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
-    max_pos = std::min<size_t>(
-        max_pos, max_pos_calc.getMaxProducerPosFromConsumer(tv, consumer_tv));
-  }
-  if (check_siblings) {
-    for (auto sibling_tv : ir_utils::siblingTvsOf(tv)) {
-      max_pos = std::min<size_t>(max_pos, getMaxPosAll(sibling_tv, false));
-    }
-  }
-  return max_pos;
-}
-
-void InlinePropagator::setCAPos(TensorView* tv) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  size_t pos = mapped_reference_pos_.at(tv);
-  if (debug) {
-    std::cout << "  Setting CA pos of " << tv << ":" << std::endl;
-    std::cout << "    mapped position: " << pos << std::endl;
-  }
-  if ((selected_.empty() || selected_.count(tv)) && !tv->isFusionInput()) {
-    auto max_pos = getMaxPosAll(tv);
-    if (debug) {
-      std::cout << "    max inlinable position: " << max_pos << std::endl;
-    }
-    if (mode_ == ComputeAtMode::Standard) {
-      TORCH_INTERNAL_ASSERT(
-          pos <= max_pos,
-          "Invalid compute at position detected in InlinePropagator when trying to set the CA position of: ",
-          tv,
-          " to ",
-          pos,
-          ",  max position that's allowed is ",
-          max_pos);
-    } else if (mode_ == ComputeAtMode::BestEffort) {
-      pos = std::min<size_t>(pos, max_pos);
-    } else {
-      pos = max_pos;
-    }
-    // hoist inner most broadcast
-    while (pos > 0 && tv->axis(pos - 1)->isBroadcast()) {
-      pos--;
-    }
-    auto current_ca_pos = tv->getComputeAtPosition();
-    if (debug) {
-      std::cout << "    current CA position: " << current_ca_pos << std::endl;
-    }
-    if (pos > current_ca_pos) {
-      if (debug) {
-        std::cout << "    new CA position: " << pos << std::endl;
-      }
-      tv->setComputeAt(pos);
-      for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
-        needs_update_max_producer_.insert(consumer_tv);
-      }
-    } else if (debug) {
-      std::cout << "    CA position not changed" << std::endl;
-    }
-  } else if (debug) {
-    std::cout << "    tensor not selected, skip" << std::endl;
-  }
-}
-
-InlinePropagator::InlinePropagator(
-    TensorView* reference,
-    int64_t reference_pos,
-    ComputeAtMode mode,
-    std::unordered_set<TensorView*> selected,
-    std::unordered_set<IterDomain*> uninlinable_ids)
-    : max_pos_calc(mode, std::move(uninlinable_ids)),
-      selected_(std::move(selected)),
-      reference_(reference),
-      mode_(mode) {
-  if (reference_pos < 0) {
-    reference_pos += int64_t(reference->nDims()) + 1;
-  }
-  TORCH_INTERNAL_ASSERT(
-      reference_pos >= 0 && reference_pos <= reference->nDims(),
-      "Invalid computeAt axis, received ",
-      reference_pos,
-      " but should be > -",
-      reference->nDims(),
-      " and <= ",
-      reference->nDims(),
-      ".");
-  reference_pos_ = reference_pos;
-}
-
-void InlinePropagator::setUp() {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  mapped_reference_pos_[reference_] = reference_pos_;
-  if (debug) {
-    std::cout << "InlinePropagator::setUp" << std::endl;
-    std::cout << "  reference: " << reference_ << " @ " << reference_pos_
-              << std::endl;
-  }
-  setCAPos(reference_);
-}
-
-namespace {
-
-// Try to find the aligned position on consumer's domain corresponding to the
-//  compute at position of producer domain. Used in InlinePropagator pass only.
-//  No checking on actual producer-consumer relationship.
-unsigned int getConsumerPosAlignedToProducerCA(
-    TensorView* consumer,
-    TensorView* producer) {
-  // Locate consumer's position that aligns with
-  //  the producer's new compute at axis. We need broadcast axes forwarded so we
-  //  need to replay PasC as CasP will not forward braodcast dims. For example
-  //  if we have:
-  // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
-  // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
-  // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
-  // NVFuserTest.FusionComplexBCast1_CUDA
-
-  auto disjoint_sets =
-      BestEffortReplay::replayPasC(
-          producer, consumer, -1, PairwiseRootDomainMap(producer, consumer))
-          .getDisjointSets();
-
-  // Find the innermost position of consumer that has
-  //  been mapped within the producer ca axis.
-  unsigned int consumer_pos = consumer->nDims();
-  while (consumer_pos > 0) {
-    auto consumer_id = consumer->axis((int)consumer_pos - 1);
-    auto p_dom = producer->domain()->domain();
-    if (std::any_of(
-            p_dom.begin(),
-            p_dom.begin() + producer->getComputeAtPosition(),
-            [&consumer_id, &disjoint_sets](IterDomain* p_id) {
-              return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
-            })) {
-      break;
-    }
-    consumer_pos--;
-  }
-
-  return consumer_pos;
-}
-
-} // namespace
-
-void InlinePropagator::tearDown() {
-  for (auto consumer : needs_update_max_producer_) {
-    unsigned int consumer_pos = 0;
-    for (auto producer : ir_utils::producerTvsOf(consumer)) {
-      consumer_pos = std::max(
-          consumer_pos, getConsumerPosAlignedToProducerCA(consumer, producer));
-    }
-    consumer->setMaxProducer(consumer_pos);
-  }
-}
-
-void InlinePropagator::propagateC2P(TensorView* from, TensorView* to) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  if (debug) {
-    std::cout << "InlinePropagator::propagateC2P" << std::endl;
-    std::cout << "  from: " << from << std::endl;
-    std::cout << "  to: " << to << std::endl;
-  }
-  // Step 1: find mapped_reference_pos_[to]
-  int from_pos = mapped_reference_pos_.at(from);
-  auto to_pos =
-      TransformReplay::getMatchedLeafPosWithoutReplayPasC(to, from, from_pos);
-  if (mode_ == ComputeAtMode::Standard) {
-    TORCH_CHECK(
-        to_pos >= 0,
-        "Unable to propagate CA position from consumer ",
-        from,
-        " at ",
-        from_pos,
-        " to producer ",
-        to,
-        " because this would require replay.");
-  } else {
-    // For MostInlined and BestEffort inline propagation, we allow the DAG to
-    // be not replayed fully consistently. For such case, we just don't inline
-    // into the mismatched dimension.
-    while (to_pos < 0) {
-      from_pos--;
-      to_pos = TransformReplay::getMatchedLeafPosWithoutReplayPasC(
-          to, from, from_pos);
-    }
-  }
-  mapped_reference_pos_[to] = to_pos;
-  // Step 2: set CA position of `to`
-  setCAPos(to);
-}
-
-void InlinePropagator::propagateP2C(TensorView* from, TensorView* to) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  if (debug) {
-    std::cout << "InlinePropagator::propagateP2C" << std::endl;
-    std::cout << "  from: " << from << std::endl;
-    std::cout << "  to: " << to << std::endl;
-  }
-  // Step 1: find mapped_reference_pos_[to]
-  int from_pos = mapped_reference_pos_.at(from);
-  auto to_pos =
-      TransformReplay::getMatchedLeafPosWithoutReplayCasP(to, from, from_pos);
-  if (mode_ == ComputeAtMode::Standard) {
-    TORCH_CHECK(
-        to_pos >= 0,
-        "Unable to propagate CA position from producer ",
-        from,
-        " at ",
-        from_pos,
-        " to consumer ",
-        to,
-        " because this would require replay.");
-  } else {
-    // For MostInlined and BestEffort inline propagation, we allow the DAG to
-    // be not replayed fully consistently. For such case, we just don't inline
-    // into the mismatched dimension.
-    while (to_pos < 0) {
-      from_pos--;
-      to_pos = TransformReplay::getMatchedLeafPosWithoutReplayCasP(
-          to, from, from_pos);
-    }
-  }
-  mapped_reference_pos_[to] = to_pos;
-  // Step 2: set CA position of `to`
-  setCAPos(to);
-}
-
-void InlinePropagator::propagateSibling(TensorView* from, TensorView* to) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  if (debug) {
-    std::cout << "InlinePropagator::propagateSibling" << std::endl;
-    std::cout << "  from: " << from << std::endl;
-    std::cout << "  to: " << to << std::endl;
-  }
-  // Step 1: find mapped_reference_pos_[to]
-  auto from_pos = mapped_reference_pos_.at(from);
-  TORCH_CHECK(
-      TransformReplay::fullSelfMatching(to, from),
-      "Unable to propagate CA position from ",
-      from,
-      " to sibling ",
-      to,
-      " because this would require replay.");
-  mapped_reference_pos_[to] = from_pos;
-  // Step 2: set CA position of `to`
-  setCAPos(to);
-}
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/inline_propagator.h b/torch/csrc/jit/codegen/cuda/inline_propagator.h
deleted file mode 100644
index d1bdeebd06d63..0000000000000
--- a/torch/csrc/jit/codegen/cuda/inline_propagator.h
+++ /dev/null
@@ -1,118 +0,0 @@
-#pragma once
-
-#include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
-#include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
-#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
-
-#include <unordered_set>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-class TORCH_CUDA_CU_API MaxPosCalculator {
-  ComputeAtMode mode_ = ComputeAtMode::Standard;
-
-  // Root domains in producer that's unmappable to any of its consumers
-  std::unordered_set<IterDomain*> unmappable_dims_;
-
-  // User set IterDomains to not inline, used in schedulers to avoid inlining
-  // trivial reductions
-  std::unordered_set<IterDomain*> uninlinable_ids_;
-
-  // Iterate through all TVs and collect the dimensions of each TV that don't
-  // map to all its consumer TVs.
-  void buildUnmappableDims();
-
-  // Utility function to return if an id of tv is a valid iter domain to inline
-  // within. This is used in getMaxPos{PasC,CasP}. Different variations of the
-  // bool values are used if checking max position of PasC, CasP, or checking
-  // for a max "self" position.
-  bool isAllowedID(
-      IterDomain* id,
-      TensorView* tv,
-      bool allow_reduction,
-      bool allow_vectorize,
-      bool allow_unmappable) const;
-
- public:
-  // Returns the position at which tv can be inlined within.
-  size_t getMaxPosSelf(
-      TensorView* tv,
-      bool allow_reduction,
-      bool allow_vectorize,
-      bool allow_unmappable) const;
-
-  // Returns the maximum position producer can be inlined based on consumer
-  // given the set ComputeAtMode
-  size_t getMaxProducerPosFromConsumer(
-      TensorView* producer,
-      TensorView* consumer) const;
-
-  MaxPosCalculator(
-      ComputeAtMode mode,
-      std::unordered_set<IterDomain*> uninlinable_ids = {});
-};
-
-// Propagate inline position to the `selected` tensors in the DAG. If `selected`
-// is not specified or empty, then propagate to the entire DAG.
-class TORCH_CUDA_CU_API InlinePropagator
-    : public MaxInfoSpanningTree::Propagator {
-  // Checks producers and consumers to see what the maximum position in tv is
-  // that can be shared across both directions.
-  size_t getMaxPosAll(TensorView* tv, bool check_siblings = true);
-
-  // We use mapped_reference_pos_ to keep track of the outer axes information of
-  // the reference tensor. That is, mapped_reference_pos_[tv] answers the
-  // question "What outer axes in tv are shared with the specified reference
-  // tensor's outer axes?". However, when we actually set the CA position of tv,
-  // we might not want to set it as mapped_reference_pos_[tv] because because we
-  // don't want to inline certain things (such as vectorized dimensions, inner
-  // most broadcasting, etc.).
-  std::unordered_map<TensorView*, size_t> mapped_reference_pos_;
-
-  // Actually set the computeAt position. This does not necessarily equal to
-  // mapped_reference_pos_[tv] because we don't want to inline certain things.
-  void setCAPos(TensorView* tv);
-
-  const MaxPosCalculator max_pos_calc;
-  std::unordered_set<TensorView*> selected_;
-  std::unordered_set<TensorView*> needs_update_max_producer_;
-  TensorView* reference_;
-  size_t reference_pos_;
-  ComputeAtMode mode_ = ComputeAtMode::Standard;
-
- public:
-  InlinePropagator(
-      TensorView* reference,
-      int64_t reference_pos,
-      ComputeAtMode mode = ComputeAtMode::Standard,
-      std::unordered_set<TensorView*> selected = {},
-      std::unordered_set<IterDomain*> uninlinable_ids = {});
-
-  InlinePropagator(
-      TensorView* reference,
-      int64_t reference_pos,
-      std::unordered_set<TensorView*> selected)
-      : InlinePropagator(
-            reference,
-            reference_pos,
-            ComputeAtMode::Standard,
-            selected) {}
-
-  ~InlinePropagator() = default;
-
-  // Actually propagate the transformations for the inlining pass. Uses the
-  // functions above to figure out what position to do the propagation at.
-  virtual void setUp() override;
-  virtual void propagateC2P(TensorView* from, TensorView* to) override;
-  virtual void propagateP2C(TensorView* from, TensorView* to) override;
-  virtual void propagateSibling(TensorView* from, TensorView* to) override;
-  virtual void tearDown() override;
-};
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/inlining.cpp b/torch/csrc/jit/codegen/cuda/inlining.cpp
new file mode 100644
index 0000000000000..da6d229c68f8b
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/inlining.cpp
@@ -0,0 +1,306 @@
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/transform_iter.h>
+
+#include <utility>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+MaxPosCalculator::MaxPosCalculator(
+    const std::unordered_set<IterDomain*>& uninlinable_ids)
+    : uninlinable_ids_(uninlinable_ids) {
+  buildUnmappableDims();
+}
+
+void MaxPosCalculator::buildUnmappableDims() {
+  ComputeAtRootDomainMap root_map;
+  root_map.build();
+  auto all_tvs = ir_utils::allTvs(FusionGuard::getCurFusion());
+  for (auto tv : all_tvs) {
+    auto consumers = ir_utils::consumerTvsOf(tv);
+    for (auto consumer : consumers) {
+      // Grab dimensions in producer and consumer that are mappable to eachother
+      // based on the computeAtRootDomainMap. This will tell us which dimensions
+      // can be inlined based on avoiding trying to inline non-trivial
+      // reduction structures.
+      auto mappable_roots =
+          root_map.getMappableDims(tv->domain(), consumer->domain());
+      for (auto tv_root_id : tv->getMaybeRFactorDomain()) {
+        if (mappable_roots.find(tv_root_id) == mappable_roots.end() &&
+            !tv_root_id->isTrivialReduction()) {
+          unmappable_dims_.emplace(tv_root_id);
+        }
+      }
+    }
+  }
+}
+
+bool MaxPosCalculator::isAllowedID(
+    IterDomain* id,
+    TensorView* tv,
+    bool best_effort,
+    bool allow_reduction,
+    bool allow_vectorize,
+    bool allow_unmappable) const {
+  bool allowed = true;
+
+  if (!allow_reduction) {
+    allowed = allowed && !id->isReduction();
+  }
+
+  if (uninlinable_ids_.count(id)) {
+    return false;
+  }
+
+  if (!allow_vectorize) {
+    // Avoid inlining if marked as Vectorize or Group. In the case of
+    // BestEffort and MostInlined modes, avoid Unroll as well.
+    bool is_vectorize = isParallelTypeVectorize(id->getParallelType()) ||
+        id->getParallelType() == ParallelType::Group ||
+        (best_effort && id->getParallelType() == ParallelType::Unroll);
+    allowed = allowed && !is_vectorize;
+  }
+
+  if (!allow_unmappable) {
+    auto root_dom = tv->getMaybeRFactorDomain();
+    std::unordered_set<Val*> root_dom_set(root_dom.begin(), root_dom.end());
+    auto all_vals = DependencyCheck::getAllValsBetween(root_dom_set, {id});
+    bool is_unmappable = false;
+    for (auto val : all_vals) {
+      auto id = val->as<IterDomain>();
+      if (root_dom_set.count(val) > 0 && unmappable_dims_.count(id) > 0) {
+        is_unmappable = true;
+        break;
+      }
+    }
+    allowed = allowed && !is_unmappable;
+  }
+
+  return allowed;
+}
+
+size_t MaxPosCalculator::getMaxPosSelf(
+    TensorView* tv,
+    bool best_effort,
+    bool allow_reduction,
+    bool allow_vectorize,
+    bool allow_unmappable) const {
+  auto dom = tv->domain()->domain();
+  auto iter = std::find_if(dom.begin(), dom.end(), [=](IterDomain* id) {
+    return !isAllowedID(
+        id,
+        tv,
+        best_effort,
+        allow_reduction,
+        allow_vectorize,
+        allow_unmappable);
+  });
+  return std::distance(dom.begin(), iter);
+}
+
+// Return the max position in producer that can be inlined to consumer
+// Cannot inline:
+//   Vectorized dimensions in consumer
+//   Unrolled dimensions in consumer
+size_t MaxPosCalculator::getMaxProducerPosFromConsumer(
+    TensorView* producer,
+    TensorView* consumer,
+    bool best_effort) const {
+  auto pairwise_root_map = PairwiseRootDomainMap(producer, consumer);
+  auto replay_CasP =
+      BestEffortReplay::replayCasP(consumer, producer, -1, pairwise_root_map);
+  auto p2c_replay_map = replay_CasP.getReplay();
+
+  for (size_t producer_pos = 0; producer_pos < producer->nDims();
+       producer_pos++) {
+    // If the producer position is mismatching with the consumer, then we can
+    // not inline into this position, otherwise the max producer position of
+    // the consumer will become invalid and expression sort will fail.
+    if (TransformReplay::getMatchedLeafPosWithoutReplayCasP(
+            consumer, producer, producer_pos + 1) < 0) {
+      return producer_pos;
+    }
+    auto map_it = p2c_replay_map.find(producer->axis(producer_pos));
+    if (map_it != p2c_replay_map.end()) {
+      auto c_id = map_it->second;
+      if (!isAllowedID(c_id, consumer, best_effort, true, false, true)) {
+        return producer_pos;
+      }
+    }
+  }
+  return producer->nDims();
+}
+
+size_t MaxPosCalculator::getMaxPosAll(
+    TensorView* tv,
+    bool best_effort,
+    bool check_siblings) {
+  auto max_pos = getMaxPosSelf(tv, best_effort, false, false, false);
+  for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
+    max_pos = std::min<size_t>(
+        max_pos, getMaxProducerPosFromConsumer(tv, consumer_tv, best_effort));
+  }
+  if (check_siblings) {
+    for (auto sibling_tv : ir_utils::siblingTvsOf(tv)) {
+      max_pos = std::min<size_t>(
+          max_pos, getMaxPosAll(sibling_tv, best_effort, false));
+    }
+  }
+  return max_pos;
+}
+
+void inlineMost(const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  inlineMost(ir_utils::allTvs(FusionGuard::getCurFusion()), uninlinable_ids);
+}
+
+void inlineMost(
+    const std::vector<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  if (tvs.empty()) {
+    return;
+  }
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto tv : tvs) {
+    tv->inlineAt(-1, true, &calc);
+  }
+}
+
+void inlineMost(
+    const std::unordered_set<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  if (tvs.empty()) {
+    return;
+  }
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto tv : tvs) {
+    tv->inlineAt(-1, true, &calc);
+  }
+}
+
+namespace {
+
+// Find the positions of `selected` tensors that is mapped to the given position
+// in the reference tensor.
+class FindMappedPositions : public MaxInfoSpanningTree::Propagator {
+  std::unordered_map<TensorView*, size_t>& output_;
+
+ public:
+  FindMappedPositions(
+      std::unordered_map<TensorView*, size_t>& output,
+      TensorView* reference,
+      int64_t reference_pos);
+
+  ~FindMappedPositions() = default;
+
+  virtual void propagateC2P(TensorView* from, TensorView* to) override;
+  virtual void propagateP2C(TensorView* from, TensorView* to) override;
+  virtual void propagateSibling(TensorView* from, TensorView* to) override;
+};
+
+FindMappedPositions::FindMappedPositions(
+    std::unordered_map<TensorView*, size_t>& output,
+    TensorView* reference,
+    int64_t reference_pos)
+    : output_(output) {
+  if (reference_pos < 0) {
+    reference_pos += int64_t(reference->nDims()) + 1;
+  }
+  TORCH_CHECK(
+      reference_pos >= 0 && reference_pos <= reference->nDims(),
+      "Invalid axis received ",
+      reference_pos,
+      " but should be > -",
+      reference->nDims(),
+      " and <= ",
+      reference->nDims(),
+      ".");
+  output_[reference] = reference_pos;
+}
+
+void FindMappedPositions::propagateC2P(TensorView* from, TensorView* to) {
+  int from_pos = output_.at(from);
+  auto to_pos =
+      TransformReplay::getMatchedLeafPosWithoutReplayPasC(to, from, from_pos);
+  // If there is no matching position found, we compute the highest matched
+  // position as the closest approximation
+  while (to_pos < 0) {
+    from_pos--;
+    to_pos =
+        TransformReplay::getMatchedLeafPosWithoutReplayPasC(to, from, from_pos);
+  }
+  output_[to] = to_pos;
+}
+
+void FindMappedPositions::propagateP2C(TensorView* from, TensorView* to) {
+  int from_pos = output_.at(from);
+  auto to_pos =
+      TransformReplay::getMatchedLeafPosWithoutReplayCasP(to, from, from_pos);
+  // If there is no matching position found, we compute the highest matched
+  // position as the closest approximation
+  while (to_pos < 0) {
+    from_pos--;
+    to_pos =
+        TransformReplay::getMatchedLeafPosWithoutReplayCasP(to, from, from_pos);
+  }
+  output_[to] = to_pos;
+}
+
+void FindMappedPositions::propagateSibling(TensorView* from, TensorView* to) {
+  auto from_pos = output_.at(from);
+  TORCH_CHECK(
+      TransformReplay::fullSelfMatching(to, from),
+      "Transformations in siblings ",
+      from,
+      " and ",
+      to,
+      " does not match with each other.");
+  output_[to] = from_pos;
+}
+
+std::unordered_map<TensorView*, size_t> getPositionsMappedTo(
+    TensorView* reference_tv,
+    int64_t reference_pos) {
+  std::unordered_map<TensorView*, size_t> mapped_positions;
+  MaxRootDomainInfoSpanningTree tree(reference_tv, reference_pos);
+  FindMappedPositions propagator(mapped_positions, reference_tv, reference_pos);
+  tree.traverse(&propagator);
+  return mapped_positions;
+}
+
+} // namespace
+
+void inlineAllAt(
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  auto mapped_positions = getPositionsMappedTo(reference_tv, reference_pos);
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto pair : mapped_positions) {
+    pair.first->inlineAt(pair.second, best_effort, &calc);
+  }
+}
+
+void inlineSelectedAt(
+    const std::unordered_set<TensorView*>& selected,
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  auto mapped_positions = getPositionsMappedTo(reference_tv, reference_pos);
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto pair : mapped_positions) {
+    if (selected.count(pair.first) > 0) {
+      pair.first->inlineAt(pair.second, best_effort, &calc);
+    }
+  }
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/inlining.h b/torch/csrc/jit/codegen/cuda/inlining.h
new file mode 100644
index 0000000000000..3b15eb23f9877
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/inlining.h
@@ -0,0 +1,100 @@
+#pragma once
+
+#include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+
+#include <memory>
+#include <unordered_set>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+class MaxPosCalculator {
+  // Root domains in producer that's unmappable to any of its consumers
+  std::unordered_set<IterDomain*> unmappable_dims_;
+
+  // User set IterDomains to not inline, used in schedulers to avoid inlining
+  // trivial reductions
+  std::unordered_set<IterDomain*> uninlinable_ids_;
+
+  // Iterate through all TVs and collect the dimensions of each TV that don't
+  // map to all its consumer TVs.
+  void buildUnmappableDims();
+
+  // Utility function to return if an id of tv is a valid iter domain to inline
+  // within. This is used in getMaxPos{PasC,CasP}. Different variations of the
+  // bool values are used if checking max position of PasC, CasP, or checking
+  // for a max "self" position.
+  bool isAllowedID(
+      IterDomain* id,
+      TensorView* tv,
+      bool best_effort,
+      bool allow_reduction,
+      bool allow_vectorize,
+      bool allow_unmappable) const;
+
+ public:
+  // Returns the position at which tv can be inlined within.
+  size_t getMaxPosSelf(
+      TensorView* tv,
+      bool best_effort,
+      bool allow_reduction,
+      bool allow_vectorize,
+      bool allow_unmappable) const;
+
+  // Returns the maximum position producer can be inlined based on consumer
+  // given the set ComputeAtMode
+  size_t getMaxProducerPosFromConsumer(
+      TensorView* producer,
+      TensorView* consumer,
+      bool best_effort) const;
+
+  // Checks producers, consumers, and siblings to see what the maximum position
+  // in tv is that can be shared across both directions.
+  size_t getMaxPosAll(
+      TensorView* tv,
+      bool best_effort = false,
+      bool check_siblings = true);
+
+  MaxPosCalculator(const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+};
+
+// Inline to the right most allowed position for all tensors in the current
+// fusion.
+TORCH_CUDA_CU_API void inlineMost(
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+// Inline to the right most allowed position for the selected tensors in the
+// current fusion.
+TORCH_CUDA_CU_API void inlineMost(
+    const std::vector<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+// Inline to the right most allowed position for the selected tensors in the
+// current fusion.
+TORCH_CUDA_CU_API void inlineMost(
+    const std::unordered_set<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+
+// Inline to the position corresponding to the reference position in the
+// reference tensor for all tensors in the current fusion.
+TORCH_CUDA_CU_API void inlineAllAt(
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort = false,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+
+// Inline to the position corresponding to the reference position in the
+// reference tensor for selected tensors in the current fusion.
+TORCH_CUDA_CU_API void inlineSelectedAt(
+    const std::unordered_set<TensorView*>& selected,
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort = false,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/interface.cpp b/torch/csrc/jit/codegen/cuda/interface.cpp
index ae353220f1144..7b93552bf28ac 100644
--- a/torch/csrc/jit/codegen/cuda/interface.cpp
+++ b/torch/csrc/jit/codegen/cuda/interface.cpp
@@ -657,6 +657,62 @@ RegisterOperators reg_add_optional({
         aliasAnalysisFromSchema()),
 });
 
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+RegisterOperators reg_permute_copy({
+    Operator(
+        "prim::permute_copy(Tensor(a) self, int[] dims) -> Tensor",
+        [](const Node* node) -> Operation {
+          return [node](Stack& stack) {
+            TORCH_CHECK(
+                node->s(attr::name) == "CudaFusionGroup",
+                "permute_copy is only used by nvfuser to identify non-mutating ",
+                "alias ops, should be restored after fusion pass!");
+            IValue self, dims;
+            pop(stack, self, dims);
+            push(stack, at::native::view(self.toTensor(), dims.toIntVector()));
+          };
+        },
+        aliasAnalysisFromSchema()),
+});
+
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+RegisterOperators reg_transpose_copy({
+    Operator(
+        "prim::transpose_copy.int(Tensor(a) self, int dim0, int dim1) -> Tensor",
+        [](const Node* node) -> Operation {
+          return [node](Stack& stack) {
+            TORCH_CHECK(
+                node->s(attr::name) == "CudaFusionGroup",
+                "transpose_copy is only used by nvfuser to identify non-mutating ",
+                "alias ops, should be restored after fusion pass!");
+            IValue self, dim0, dim1;
+            pop(stack, self, dim0, dim1);
+            push(
+                stack,
+                at::transpose(self.toTensor(), dim0.toInt(), dim1.toInt()));
+          };
+        },
+        aliasAnalysisFromSchema()),
+});
+
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+RegisterOperators reg_t_copy({
+    Operator(
+        "prim::t_copy(Tensor(a) self) -> Tensor",
+        [](const Node* node) -> Operation {
+          return [node](Stack& stack) {
+            TORCH_CHECK(
+                node->s(attr::name) == "CudaFusionGroup",
+                "t_copy is only used by nvfuser to identify non-mutating ",
+                "alias ops, should be restored after fusion pass!");
+            IValue self;
+            pop(stack, self);
+            push(stack, at::t(self.toTensor()));
+          };
+        },
+        aliasAnalysisFromSchema()),
+});
+
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 RegisterOperators reg_view_copy({
     Operator(
diff --git a/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp b/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
index 5e4b0a6f2c0f6..b29a8bc417cd0 100644
--- a/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
@@ -210,7 +210,7 @@ int64_t Val::evaluateInt() {
   TORCH_INTERNAL_ASSERT(
       evaluated_val.has_value(),
       "Detected a const integer but failed to infer its value.");
-  return evaluated_val.value();
+  return evaluated_val->as<int64_t>();
 }
 
 double Val::evaluateDouble() {
@@ -227,7 +227,7 @@ double Val::evaluateDouble() {
   TORCH_INTERNAL_ASSERT(
       evaluated_val.has_value(),
       "Detected a const integer but failed to infer its value.");
-  return evaluated_val.value();
+  return evaluated_val->as<double>();
 }
 
 c10::optional<int64_t> Val::getInt() const {
diff --git a/torch/csrc/jit/codegen/cuda/ir_base_nodes.h b/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
index f59d7d7deaa0e..7d5ebad25282b 100644
--- a/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
@@ -48,6 +48,7 @@ class Expr;
 class Val;
 class UnaryOp;
 class BinaryOp;
+class RNGOp;
 class IterDomain;
 class IrCloner;
 class IrContainer;
diff --git a/torch/csrc/jit/codegen/cuda/ir_builder.cpp b/torch/csrc/jit/codegen/cuda/ir_builder.cpp
index adb523ecfa27d..f0fd438c15672 100644
--- a/torch/csrc/jit/codegen/cuda/ir_builder.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_builder.cpp
@@ -60,9 +60,13 @@ IR_BUILDER_INSTANTIATE(ShiftOp)
 IR_BUILDER_INSTANTIATE(GatherOp)
 IR_BUILDER_INSTANTIATE(ViewAsScalar)
 IR_BUILDER_INSTANTIATE(ViewOp)
+IR_BUILDER_INSTANTIATE(FullOp)
+IR_BUILDER_INSTANTIATE(ARangeOp)
+IR_BUILDER_INSTANTIATE(EyeOp)
 IR_BUILDER_INSTANTIATE(UnaryOp)
 IR_BUILDER_INSTANTIATE(BinaryOp)
 IR_BUILDER_INSTANTIATE(TernaryOp)
+IR_BUILDER_INSTANTIATE(RNGOp)
 IR_BUILDER_INSTANTIATE(ReductionOp)
 IR_BUILDER_INSTANTIATE(GroupedReductionOp)
 IR_BUILDER_INSTANTIATE(WelfordOp)
diff --git a/torch/csrc/jit/codegen/cuda/ir_cloner.cpp b/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
index 638e9d8c5a5f1..489be49ddfc7c 100644
--- a/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
@@ -88,6 +88,18 @@ void IrCloner::handle(const TensorView* tv) {
   clone_ = IrBuilder::clone(tv, this);
 }
 
+void IrCloner::handle(const FullOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
+void IrCloner::handle(const ARangeOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
+void IrCloner::handle(const EyeOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
 void IrCloner::handle(const UnaryOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
@@ -100,6 +112,10 @@ void IrCloner::handle(const TernaryOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
 
+void IrCloner::handle(const RNGOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
 void IrCloner::handle(const BroadcastOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
diff --git a/torch/csrc/jit/codegen/cuda/ir_cloner.h b/torch/csrc/jit/codegen/cuda/ir_cloner.h
index a0f5d76f007d8..06e1ec3359d95 100644
--- a/torch/csrc/jit/codegen/cuda/ir_cloner.h
+++ b/torch/csrc/jit/codegen/cuda/ir_cloner.h
@@ -68,9 +68,13 @@ class TORCH_CUDA_CU_API IrCloner : private OptInConstDispatch {
   void handle(const ComplexDouble*) override;
   void handle(const NamedScalar*) override;
 
+  void handle(const FullOp*) override;
+  void handle(const ARangeOp*) override;
+  void handle(const EyeOp*) override;
   void handle(const UnaryOp*) override;
   void handle(const BinaryOp*) override;
   void handle(const TernaryOp*) override;
+  void handle(const RNGOp*) override;
   void handle(const BroadcastOp*) override;
   void handle(const ReductionOp*) override;
   void handle(const GroupedReductionOp*) override;
diff --git a/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp b/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
index 941bf22dea763..6c04e4214b07d 100644
--- a/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
@@ -407,6 +407,34 @@ void IrGraphGenerator::handle(const TensorView* tv) {
   tensor_views_.push_back(tv);
 }
 
+void IrGraphGenerator::handle(const FullOp* fop) {
+  // node
+  printExpr(fop, "full");
+
+  // inputs & outputs
+  addArc(fop->getFillValue(), fop);
+  addArc(fop, fop->output(0));
+}
+
+void IrGraphGenerator::handle(const ARangeOp* aop) {
+  // node
+  printExpr(aop, "arange");
+
+  // inputs & outputs
+  addArc(aop->start(), aop);
+  addArc(aop->end(), aop);
+  addArc(aop->step(), aop);
+  addArc(aop, aop->output(0));
+}
+
+void IrGraphGenerator::handle(const EyeOp* eop) {
+  // node
+  printExpr(eop, "eye");
+
+  // inputs & outputs
+  addArc(eop, eop->output(0));
+}
+
 void IrGraphGenerator::handle(const UnaryOp* uop) {
   // node
   std::stringstream label;
@@ -443,6 +471,16 @@ void IrGraphGenerator::handle(const TernaryOp* op) {
   addArc(op, op->out());
 }
 
+void IrGraphGenerator::handle(const RNGOp* op) {
+  // node
+  std::stringstream label;
+  label << op->getRNGOpType();
+  printExpr(op, label.str());
+
+  // inputs & outputs
+  addArc(op, op->output(0));
+}
+
 void IrGraphGenerator::handle(const BroadcastOp* op) {
   printExpr(op, "Broadcast");
   addArc(op->in(), op);
diff --git a/torch/csrc/jit/codegen/cuda/ir_graphviz.h b/torch/csrc/jit/codegen/cuda/ir_graphviz.h
index e5bbcac9157dc..1f555ed31ec06 100644
--- a/torch/csrc/jit/codegen/cuda/ir_graphviz.h
+++ b/torch/csrc/jit/codegen/cuda/ir_graphviz.h
@@ -82,9 +82,13 @@ class TORCH_CUDA_CU_API IrGraphGenerator : private OptInConstDispatch {
   void handle(const ComplexDouble*) override;
   void handle(const NamedScalar*) override;
 
+  void handle(const FullOp*) override;
+  void handle(const ARangeOp*) override;
+  void handle(const EyeOp*) override;
   void handle(const UnaryOp*) override;
   void handle(const BinaryOp*) override;
   void handle(const TernaryOp*) override;
+  void handle(const RNGOp*) override;
   void handle(const BroadcastOp*) override;
   void handle(const ReductionOp*) override;
 
diff --git a/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h b/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
index 126abba2ae103..dbefc4858d110 100644
--- a/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
@@ -154,8 +154,6 @@ class TORCH_CUDA_CU_API ComplexDouble : public Val {
 //! the compute at position to maximum possible through traversal.
 enum class ComputeAtMode { Standard, BestEffort, MostInlined };
 
-class InlinePropagator;
-class MaxProducerPosUpdater;
 class TransformPropagator;
 struct MostInlinedTransformPropagator;
 class TransformIter;
@@ -163,6 +161,8 @@ class TransformReplay;
 class OptOutMutator;
 class TensorDomain;
 
+class MaxPosCalculator;
+
 namespace ir_utils {
 class TVDomainGuard;
 }
@@ -492,21 +492,30 @@ class TORCH_CUDA_CU_API TensorView : public Val {
   friend TORCH_CUDA_CU_API MostInlinedTransformPropagator;
   friend TORCH_CUDA_CU_API TransformReplay;
   friend TORCH_CUDA_CU_API OptOutMutator;
-  friend TORCH_CUDA_CU_API InlinePropagator;
-  friend TORCH_CUDA_CU_API MaxProducerPosUpdater;
+  friend class InlineBatchingGuard;
   friend class ir_utils::TVDomainGuard;
-  friend TORCH_CUDA_CU_API void groupReductions(
-      const std::vector<TensorView*>&);
+
+  // Inline the computation of this tensor into its consumer at the given
+  // position. If this tensor is already inlined in a higher position, then this
+  // call is a no-op. If the right most dimensions before `pos` are
+  // broadcasting, then will not inline into these broadcastings. If
+  // best_effort, then will inline into the highest allowed position that is <=
+  // `pos`.
+  void inlineAt(
+      int64_t pos,
+      bool best_effort = false,
+      MaxPosCalculator* calc = nullptr);
+
+  // Update the max producer position of the current tensor. This is required
+  // when we modify producer-consumer relationship of a scheduled tensor, for
+  // example, grouping multiple reductions.
+  void updateMaxProducerPosition();
 
  protected:
   void setDomain(TensorDomain* td) {
     domain_ = td;
   }
 
-  void setComputeAt(unsigned int this_pos, bool decrease = false);
-
-  void setMaxProducer(unsigned int this_pos, bool decrease = false);
-
  private:
   int normalizeAxisPos(int pos) const {
     if (pos < 0) {
diff --git a/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h b/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
index 51fc2354431bb..aa8793366a326 100644
--- a/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
@@ -30,6 +30,131 @@ struct AnalyzeViewResult;
 //! vals are `Int` will dispatch to v1->as<Int>()->sameAs(v2.as<Int>())
 bool areEqualScalars(Val* v1, Val* v2);
 
+class TORCH_CUDA_CU_API FullOp : public Expr {
+ public:
+  FullOp(IrBuilderPasskey, Val* out, Val* fill_value, DataType dtype);
+
+  FullOp(const FullOp* src, IrCloner* ir_cloner);
+
+  bool sameAs(const Statement* other) const override;
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  Val* getFillValue() const {
+    return fill_value_;
+  }
+
+ private:
+  const DataType dtype_;
+  Val* fill_value_;
+};
+
+class TORCH_CUDA_CU_API ARangeOp : public Expr {
+ public:
+  ARangeOp(
+      IrBuilderPasskey,
+      Val* out,
+      Val* start,
+      Val* end,
+      Val* step,
+      DataType dtype,
+      Val* linear_index = nullptr);
+
+  ARangeOp(const ARangeOp* src, IrCloner* ir_cloner);
+
+  bool sameAs(const Statement* other) const override;
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  Val* start() const {
+    return start_;
+  }
+
+  Val* end() const {
+    return end_;
+  }
+
+  Val* step() const {
+    return step_;
+  }
+
+  Val* getLinearLogicalIndex() const {
+    return linear_index_;
+  }
+
+  void setLinearIndex(Val* index) {
+    linear_index_ = index;
+  }
+
+ private:
+  const DataType dtype_;
+  Val* start_;
+  Val* end_;
+  Val* step_;
+  Val* linear_index_ = nullptr;
+};
+
+// Tensor factory for generating identity matrices like
+//
+// [[1, 0, 0],
+//  [0, 1, 0],
+//  [0, 0, 1]]
+//
+// or
+//
+// [[1, 0, 0],
+//  [0, 1, 0],
+//  [0, 0, 1],
+//  [0, 0, 0]]
+//
+// or
+//
+// [[1, 0, 0, 0],
+//  [0, 1, 0, 0],
+//  [0, 0, 1, 0]]
+class TORCH_CUDA_CU_API EyeOp : public Expr {
+ public:
+  EyeOp(
+      IrBuilderPasskey,
+      Val* out,
+      DataType dtype,
+      Val* index1 = nullptr,
+      Val* index2 = nullptr);
+
+  EyeOp(const EyeOp* src, IrCloner* ir_cloner);
+
+  bool sameAs(const Statement* other) const override;
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  Val* getIndex1() const {
+    return index1_;
+  }
+
+  void setIndex1(Val* index) {
+    index1_ = index;
+  }
+
+  Val* getIndex2() const {
+    return index2_;
+  }
+
+  void setIndex2(Val* index) {
+    index2_ = index;
+  }
+
+ private:
+  const DataType dtype_;
+  Val* index1_ = nullptr;
+  Val* index2_ = nullptr;
+};
+
 //! A specialization for Unary operations. Unary operations take in a single
 //! input and produce a single output. Examples include:
 //!   1) Casting operation i.e. float(a_val)
@@ -58,23 +183,12 @@ class TORCH_CUDA_CU_API UnaryOp : public Expr {
     return unary_op_type_;
   }
 
-  int getRNGOffset() const {
-    return rng_offset_;
-  }
-
-  void setRNGOffset(int val) {
-    rng_offset_ = val;
-  }
-
   bool sameAs(const Statement* other) const override;
 
  private:
   const UnaryOpType unary_op_type_;
   Val* const out_ = nullptr;
   Val* const in_ = nullptr;
-  // TODO: pull RNG op out of Unary ops
-  // https://github.com/csarofeen/pytorch/pull/1892
-  int rng_offset_ = -1;
 };
 
 //! A specialization for Binary operations. Binary operations take in two inputs
@@ -110,6 +224,65 @@ class TORCH_CUDA_CU_API BinaryOp : public Expr {
   Val* const rhs_ = nullptr;
 };
 
+//! A specialization for random number generator (RNG) operations. RNG
+//! operations take in no tensor input and produce a single output.
+class TORCH_CUDA_CU_API RNGOp : public Expr {
+ public:
+  RNGOp(
+      IrBuilderPasskey,
+      RNGOpType type,
+      Val* out,
+      DataType dtype,
+      std::vector<Val*> parameters = {},
+      int rng_offset = 0,
+      Val* philox_index = nullptr);
+
+  RNGOp(const RNGOp* src, IrCloner* ir_cloner);
+
+  RNGOpType getRNGOpType() const {
+    return rng_op_type_;
+  }
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  int getRNGOffset() const {
+    return rng_offset_;
+  }
+
+  void setRNGOffset(int val) {
+    rng_offset_ = val;
+  }
+
+  const std::vector<Val*>& getParameters() const {
+    return parameters_;
+  }
+
+  const std::vector<Val*>& getShape() const {
+    return shape_;
+  }
+
+  Val* getPhiloxIndex() const {
+    return philox_index_;
+  }
+
+  void setPhiloxIndex(Val* index) {
+    philox_index_ = index;
+  }
+
+  bool sameAs(const Statement* other) const override;
+
+ private:
+  const RNGOpType rng_op_type_;
+  const DataType dtype_;
+  std::vector<Val*> parameters_;
+  std::vector<Val*> shape_;
+  int rng_offset_ = -1;
+  // The index used to feed philox's subsequence and component
+  Val* philox_index_ = nullptr;
+};
+
 //! Broadcast in to match out. is_broadcast_dims are relative to out. Where
 //! is_broadcast_dims.size() == out->nDims().
 class TORCH_CUDA_CU_API BroadcastOp : public Expr {
@@ -247,6 +420,10 @@ class TORCH_CUDA_CU_API GroupedReductionOp : public Expr {
     return is_allreduce_;
   }
 
+  //! Return the index of the corresponding reduction expression for
+  //! a given output val.
+  int getExprIndexOfOutput(Val* output_val) const;
+
   bool sameAs(const Statement* other) const override;
 
  private:
@@ -258,78 +435,215 @@ class TORCH_CUDA_CU_API GroupedReductionOp : public Expr {
   bool is_allreduce_ = false;
 };
 
+//! Average, variance and N (count) vals for Welford
+class TORCH_CUDA_CU_API WelfordTriplet {
+ public:
+  //! Names of the Welford triplet vals
+  enum class ValName { Avg, Var, N };
+
+  WelfordTriplet() = default;
+
+  WelfordTriplet(Val* avg, Val* var, Val* N) : vals_({avg, var, N}) {}
+
+  Val* const& avg() const {
+    return get(ValName::Avg);
+  }
+
+  Val*& avg() {
+    return get(ValName::Avg);
+  }
+
+  TensorView* avgTv() const {
+    TORCH_INTERNAL_ASSERT(avg()->isA<TensorView>());
+    return avg()->as<TensorView>();
+  }
+
+  Val* const& var() const {
+    return get(ValName::Var);
+  }
+
+  Val*& var() {
+    return get(ValName::Var);
+  }
+
+  TensorView* varTv() const {
+    TORCH_INTERNAL_ASSERT(var()->isA<TensorView>());
+    return var()->as<TensorView>();
+  }
+
+  Val* const& N() const {
+    return get(ValName::N);
+  }
+
+  Val*& N() {
+    return get(ValName::N);
+  }
+
+  TensorView* NTv() const {
+    TORCH_INTERNAL_ASSERT(N()->isA<TensorView>());
+    return N()->as<TensorView>();
+  }
+
+  //! Get the i-th val. Ordering is defined by ValName.
+  Val* const& get(int i) const {
+    return vals_.at(i);
+  }
+
+  //! Get the i-th val. Ordering is defined by ValName.
+  Val*& get(int i) {
+    return vals_.at(i);
+  }
+
+  Val* const& get(ValName name) const {
+    return get(valNameToIndex(name));
+  }
+
+  Val*& get(ValName name) {
+    return get(valNameToIndex(name));
+  }
+
+  //! Get the name of a given val in this triplet. None is returned if
+  //! not found.
+  c10::optional<ValName> getNameOf(Val* val) const;
+
+  //! Return a new triplet with outputs produced by a function applied
+  //! to each of this triplet
+  template <typename Func>
+  WelfordTriplet transform(Func func) const {
+    return WelfordTriplet(func(avg()), func(var()), func(N()));
+  }
+
+  bool sameAs(const WelfordTriplet& other) const;
+
+  WelfordTriplet clone(IrCloner* ir_cloner) const;
+
+  //! Clone a vector of triplets
+  static std::vector<WelfordTriplet> clone(
+      const std::vector<WelfordTriplet>& src,
+      IrCloner* ir_cloner);
+
+  auto begin() {
+    return vals_.begin();
+  }
+
+  auto begin() const {
+    return vals_.begin();
+  }
+
+  auto end() {
+    return vals_.end();
+  }
+
+  auto end() const {
+    return vals_.end();
+  }
+
+ private:
+  //! Convert a given val name to an index
+  static int valNameToIndex(ValName name) {
+    return static_cast<int>(name);
+  }
+
+  //! Convert a given index to a name
+  static ValName indexToValName(int index) {
+    TORCH_INTERNAL_ASSERT(index >= 0 && index < 3, "Invalid index: ", index);
+    return static_cast<ValName>(index);
+  }
+
+ private:
+  //! Holds avg, var and N in this order
+  std::array<Val*, 3> vals_ = {{nullptr, nullptr, nullptr}};
+};
+
 //! Welford Scan operation.
 class TORCH_CUDA_CU_API WelfordOp : public Expr {
  public:
+  WelfordOp(
+      IrBuilderPasskey,
+      const WelfordTriplet& output,
+      const WelfordTriplet& input,
+      const WelfordTriplet& init,
+      bool is_fused = false);
+
   WelfordOp(
       IrBuilderPasskey,
       Val* out_avg,
       Val* out_var,
       Val* out_N,
-      Val* init_avg,
-      Val* init_var,
-      Val* init_N,
       Val* in_avg,
       Val* in_var,
       Val* in_N,
+      Val* init_avg,
+      Val* init_var,
+      Val* init_N,
       bool is_fused = false);
 
   WelfordOp(const WelfordOp* src, IrCloner* ir_cloner);
 
   Val* out() const {
-    return out_avg_;
+    return output().avg();
   }
 
   Val* in() const {
-    return in_avg_;
+    return input().avg();
   }
 
   bool sameAs(const Statement* const other) const override;
 
-  // Welford Accessors
-  // TODO clean up
+  const WelfordTriplet& output() const {
+    return output_;
+  }
+
   Val* outAvg() const {
-    return out_avg_;
+    return output().avg();
   }
 
   Val* outVar() const {
-    return out_var_;
+    return output().var();
   }
 
   Val* outN() const {
-    return out_N_;
+    return output().N();
+  }
+
+  const WelfordTriplet& input() const {
+    return input_;
   }
 
   Val* inAvg() const {
-    return in_avg_;
+    return input().avg();
   }
 
   Val* inVar() const {
-    return in_var_;
+    return input().var();
   }
 
   Val* inN() const {
-    return in_N_;
+    return input().N();
+  }
+
+  const WelfordTriplet& init() const {
+    return init_;
   }
 
   Val* initAvg() const {
-    return init_avg_;
+    return init().avg();
   }
 
   Val* initVar() const {
-    return init_var_;
+    return init().var();
   }
 
   Val* initN() const {
-    return init_N_;
+    return init().N();
   }
 
   bool singleValue() const {
-    return in_N_->isOneInt();
+    return inN()->isOneInt();
   }
 
   bool hasInit() const {
-    return !init_N_->isZeroInt();
+    return !initN()->isZeroInt();
   }
 
   bool isAllreduce() const {
@@ -338,20 +652,121 @@ class TORCH_CUDA_CU_API WelfordOp : public Expr {
 
   std::vector<Val*> getInitVals() const;
 
+  //! Return the init val for an output val
+  Val* getInitValOfOutput(Val* output_val) const;
+
  private:
-  Val* const out_avg_;
-  Val* const out_var_;
-  Val* const out_N_;
-  Val* const init_avg_;
-  Val* const init_var_;
-  Val* const init_N_;
-  Val* const in_avg_;
-  Val* const in_var_;
-  Val* const in_N_;
+  const WelfordTriplet output_;
+  const WelfordTriplet input_;
+  const WelfordTriplet init_;
   //! True if using the fused reduction kernel (not implemented yet)
   bool is_allreduce_ = false;
 };
 
+class TORCH_CUDA_CU_API GroupedWelfordOp : public Expr {
+ public:
+  GroupedWelfordOp(
+      IrBuilderPasskey,
+      std::vector<WelfordTriplet> output_vals,
+      std::vector<WelfordTriplet> input_vals,
+      std::vector<WelfordTriplet> init_vals,
+      bool is_allreduce = false,
+      ExprType expr_type = ExprType::GroupedWelfordOp);
+
+  GroupedWelfordOp(const GroupedWelfordOp* src, IrCloner* ir_cloner);
+
+  //! Number of expressions grouped horizontally. It does not reflect
+  //! iteration grouping. As horizontal grouping is not supported,
+  //! this always returns 1.
+  size_t numExprs() const {
+    return 1;
+  }
+
+  Val* out(size_t index) const {
+    return outAvg(index);
+  }
+
+  Val* in(size_t index) const {
+    return inAvg(index);
+  }
+
+  bool sameAs(const Statement* const other) const override;
+
+  const std::vector<WelfordTriplet>& outputVals() const {
+    return output_vals_;
+  }
+
+  const std::vector<WelfordTriplet>& inputVals() const {
+    return input_vals_;
+  }
+
+  const std::vector<WelfordTriplet>& initVals() const {
+    return init_vals_;
+  }
+
+  Val* outAvg(size_t index) const {
+    return outputVals().at(index).avg();
+  }
+
+  Val* outVar(size_t index) const {
+    return outputVals().at(index).var();
+  }
+
+  Val* outN(size_t index) const {
+    return outputVals().at(index).N();
+  }
+
+  Val* inAvg(size_t index) const {
+    return inputVals().at(index).avg();
+  }
+
+  Val* inVar(size_t index) const {
+    return inputVals().at(index).var();
+  }
+
+  Val* inN(size_t index) const {
+    return inputVals().at(index).N();
+  }
+
+  Val* initAvg(size_t index) const {
+    return initVals().at(index).avg();
+  }
+
+  Val* initVar(size_t index) const {
+    return initVals().at(index).var();
+  }
+
+  Val* initN(size_t index) const {
+    return initVals().at(index).N();
+  }
+
+  //! Return the index of the corresponding welford expression for
+  //! a given output val
+  int getExprIndexOfOutput(Val* output_val) const;
+
+  //! Return the init val for an output val
+  Val* getInitValOfOutput(Val* output_val) const;
+
+  bool singleValue(size_t index) const {
+    return inN(index)->isOneInt();
+  }
+
+  bool hasInit(size_t index) const {
+    return !initN(index)->isZeroInt();
+  }
+
+  bool isAllreduce() const {
+    return is_allreduce_;
+  }
+
+ private:
+  const std::vector<WelfordTriplet> output_vals_;
+  const std::vector<WelfordTriplet> input_vals_;
+  const std::vector<WelfordTriplet> init_vals_;
+  //! True if using the fused reduction kernel
+  bool is_allreduce_ = false;
+};
+
 //! Fused Matmul operation
 class TORCH_CUDA_CU_API MmaOp : public Expr {
  public:
@@ -904,6 +1319,13 @@ class TORCH_CUDA_CU_API IterDomain : public Val {
     return expanded_extent_;
   }
 
+  Val* getMaybeExpandedExtent() const {
+    if (hasExpandedExtent()) {
+      return expandedExtent();
+    }
+    return extent();
+  }
+
   //! Dimension padding interface:
   //!  2 modes are currently supported:
   //!
diff --git a/torch/csrc/jit/codegen/cuda/ir_iostream.cpp b/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
index c435c2c557162..ec55ad013d016 100644
--- a/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
@@ -248,6 +248,81 @@ void IrPrinter::handle(const NamedScalar* ns) {
   os_ << ns->name();
 }
 
+void IrPrinter::handle(const FullOp* fop) {
+  if (!print_inline_) {
+    indent();
+    os_ << fop->output(0) << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(fop);
+  }
+
+  os_ << "full({";
+  for (auto i : c10::irange(fop->inputs().size())) {
+    if (i == fop->inputs().size() - 1) {
+      os_ << "}";
+    }
+    if (i >= 0) {
+      os_ << ", ";
+    }
+    handle(fop->input(i));
+  }
+  os_ << ", " << fop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_)
+    os_ << ";\n";
+}
+
+void IrPrinter::handle(const ARangeOp* aop) {
+  if (!print_inline_) {
+    indent() << aop->output(0);
+    os_ << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(aop);
+  }
+
+  os_ << "arange(";
+  handle(aop->start());
+  os_ << ", ";
+  handle(aop->end());
+  os_ << ", ";
+  handle(aop->step());
+  os_ << ", " << aop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_)
+    os_ << ";\n";
+}
+
+void IrPrinter::handle(const EyeOp* eop) {
+  if (!print_inline_) {
+    indent();
+    os_ << eop->output(0) << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(eop);
+  }
+
+  os_ << "eye(";
+  handle(eop->input(0));
+  os_ << ", " << eop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_)
+    os_ << ";\n";
+}
+
 void IrPrinter::handle(const UnaryOp* uop) {
   bool istvop = ir_utils::isTvOp(uop);
   if (!print_inline_) {
@@ -393,16 +468,51 @@ void IrPrinter::handle(const TernaryOp* top) {
     os_ << ";\n";
 }
 
+void IrPrinter::handle(const RNGOp* rop) {
+  if (!print_inline_) {
+    indent();
+    os_ << rop->output(0) << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(rop);
+  }
+
+  os_ << rop->getRNGOpType() << "({";
+  bool first = true;
+  for (auto i : rop->getShape()) {
+    if (!first) {
+      os_ << ", ";
+    }
+    handle(i);
+    first = false;
+  }
+  os_ << "}";
+  for (auto i : rop->getParameters()) {
+    os_ << ", ";
+    handle(i);
+  }
+  os_ << ", " << rop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_) {
+    os_ << ";\n";
+  }
+}
+
 void IrPrinter::handle(const ReductionOp* rop) {
   indent() << rop->out() << "\n";
   indent() << "   = reduction( " << rop->in()
            << ", op = " << rop->getReductionOpType()
            << ", initial value = " << rop->init()
-           << ", allreduce = " << rop->isAllreduce() << " )\n";
+           << ", allreduce = " << (rop->isAllreduce() ? "true" : "false")
+           << " )\n";
 }
 
 void IrPrinter::handle(const GroupedReductionOp* grouped_rop) {
-  indent() << "Grouped reduction(\n";
+  indent() << "GroupedReductionOp(\n";
   ++indent_size_;
   for (const auto i : c10::irange(grouped_rop->numExprs())) {
     indent() << grouped_rop->output(i) << " = reduction( "
@@ -430,10 +540,34 @@ void IrPrinter::handle(const WelfordOp* wop) {
     os_ << "\n  initial value = " << wop->initAvg() << "(Avg)\n  "
         << wop->initVar() << "(Var)\n  " << wop->initN() << "(N)";
   }
-  os_ << "\n  allreduce = " << wop->isAllreduce();
+  os_ << "\n  allreduce = " << (wop->isAllreduce() ? "true" : "false");
   os_ << " )\n";
 }
 
+void IrPrinter::handle(const GroupedWelfordOp* grouped_wop) {
+  indent() << "GroupedWelford(\n";
+  ++indent_size_;
+  for (const auto i : c10::irange(grouped_wop->numExprs())) {
+    indent() << grouped_wop->outAvg(i) << " (Avg),\n";
+    indent() << grouped_wop->outVar(i) << " (Var),\n";
+    indent() << grouped_wop->outN(i) << " (Count)\n";
+    indent() << " = Welford ( ";
+    ++indent_size_;
+    indent() << grouped_wop->inAvg(i) << " (Avg),\n";
+    indent() << grouped_wop->inVar(i) << " (Var),\n";
+    indent() << grouped_wop->inN(i) << " (Count)\n";
+    indent() << "initial value =\n";
+    ++indent_size_;
+    indent() << grouped_wop->initAvg(i) << " (Avg),\n";
+    indent() << grouped_wop->initVar(i) << " (Var),\n";
+    indent() << grouped_wop->initN(i) << " (Count) )\n";
+    indent_size_ -= 2;
+  }
+  indent() << "allreduce = " << (grouped_wop->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
+}
+
 void IrPrinter::handle(const LoadStoreOp* ldst) {
   indent() << ldst->out() << " = " << ldst->opType() << "( " << ldst->in()
            << " )\n";
@@ -649,138 +783,171 @@ void IrPrinter::handle(const kir::GridBroadcast* node) {
 }
 
 void IrPrinter::handle(const kir::GridReduction* node) {
-  indent();
-  handle(node->out());
-  os_ << " = "
-      << "GRID_REDUCTION(op='" << node->getReductionOpType() << "'"
-      << ", in=";
-  handle(node->in());
-  os_ << ", init=";
-  handle(node->init());
-  os_ << ", read_pred=";
+  indent() << node->out() << " = reduction( " << node->in()
+           << ", op = " << node->getReductionOpType()
+           << ", initial value = " << node->init() << ",\n";
+  ++indent_size_;
+  indent() << "reduction buffer = " << node->reduction_buffer()->buffer()
+           << ",\n";
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
   if (node->predicate() != nullptr) {
-    handle(node->predicate());
+    os_ << node->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  os_ << ", write_pred=";
+  os_ << ",\n";
+  indent() << "write predicate = ";
   if (node->writePredicate() != nullptr) {
-    handle(node->writePredicate());
+    os_ << node->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  indent() << kTab << ".reduction_buffer=";
-  handle(node->reduction_buffer()->buffer());
-  os_ << "\n";
-  indent() << kTab << ".sync_buffer=";
-  handle(node->sync_buffer()->buffer());
-  os_ << "\n";
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (node->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
 }
 
 void IrPrinter::handle(const kir::GroupedGridReduction* node) {
-  indent() << "Grouped grid reduction(\n";
+  indent() << "GroupedGridReduction(\n";
   ++indent_size_;
   for (const auto i : c10::irange(node->numExprs())) {
-    indent();
-    handle(node->output(i));
-    os_ << " = "
-        << "reduction(op='" << node->getReductionOpType(i) << "'"
-        << ", in=";
-    handle(node->input(i));
-    os_ << ", init=";
-    handle(node->initVal(i));
-    os_ << "\n";
-  }
-  indent() << kTab << ".read_pred=";
+    indent() << node->output(i) << " = reduction( " << node->input(i)
+             << ", op = " << node->getReductionOpType(i)
+             << ", initial value = " << node->initVal(i)
+             << ", reduction buffer = "
+             << node->reduction_buffers().at(i)->buffer() << " )\n";
+  }
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
   if (node->predicate() != nullptr) {
-    handle(node->predicate());
+    os_ << node->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
-  indent() << kTab << ".write_pred=";
+  os_ << ",\n";
+  indent() << "write predicate = ";
   if (node->writePredicate() != nullptr) {
-    handle(node->writePredicate());
+    os_ << node->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
-  for (const auto i : c10::irange(node->numExprs())) {
-    indent() << kTab << ".reduction_buffer=";
-    handle(node->reduction_buffers().at(i)->buffer());
-    os_ << "\n";
-  }
-  indent() << kTab << ".sync_buffer=";
-  handle(node->sync_buffer()->buffer());
-  os_ << "\n";
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (node->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
 }
 
 void IrPrinter::handle(const kir::GridWelford* node) {
+  std::cerr << "current indent size: " << indent_size_ << std::endl;
   const auto* welford_op = node->welford_op();
-  indent();
-  handle(welford_op->outVar());
-  os_ << ",";
-  handle(welford_op->outAvg());
-  os_ << ",";
-  handle(welford_op->outN());
-  os_ << " = "
-      << "GRID_WELFORD("
-      << "inAvg=";
-  handle(welford_op->inAvg());
-  if (!welford_op->inN()->isOneInt()) {
-    indent() << ", inVar=";
-    handle(welford_op->inVar());
-  }
-  indent() << ", inN=";
-  handle(welford_op->inN());
-  if (!welford_op->initN()->isZeroInt()) {
-    indent() << ", initVar=";
-    handle(welford_op->initVar());
-    os_ << " initAvg=";
-    handle(welford_op->initAvg());
-    os_ << " initN=";
-    handle(welford_op->initN());
-  }
-  indent() << ", read_pred=";
+  indent() << welford_op->outAvg() << " (Avg),\n";
+  indent() << welford_op->outVar() << " (Var),\n";
+  indent() << welford_op->outN() << " (Count)\n";
+  indent() << " = Welford (\n";
+  ++indent_size_;
+  indent() << welford_op->inAvg() << " (Avg),\n";
+  indent() << welford_op->inVar() << " (Var),\n";
+  indent() << welford_op->inN() << " (Count)\n";
+  indent() << "initial value =\n";
+  ++indent_size_;
+  indent() << welford_op->initAvg() << " (Avg),\n";
+  indent() << welford_op->initVar() << " (Var),\n";
+  indent() << welford_op->initN() << " (Count),\n";
+  --indent_size_;
+  indent() << "reduction buffer =\n";
+  ++indent_size_;
+  indent() << node->avg_buffer()->buffer() << " (Avg),\n";
+  indent() << node->var_buffer()->buffer() << " (Var),\n";
+  indent() << node->N_buffer()->buffer() << " (Count),\n";
+  --indent_size_;
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
   if (welford_op->predicate() != nullptr) {
-    handle(welford_op->predicate());
+    os_ << welford_op->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  indent() << ", write_pred=";
+  os_ << ",\n";
+  indent() << "write predicate = ";
   if (welford_op->writePredicate() != nullptr) {
-    handle(welford_op->writePredicate());
+    os_ << welford_op->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  indent() << kTab << ".var_buffer=";
-  handle(node->var_buffer()->buffer());
-  os_ << ".avg_buffer=";
-  handle(node->avg_buffer()->buffer());
-  os_ << ".n_buffer=";
-  handle(node->N_buffer()->buffer());
-  os_ << "\n";
-  indent() << kTab << ".sync_buffer=";
-  handle(node->sync_buffer()->buffer());
-  os_ << "\n";
-  indent() << kTab << ".grid_read_pred=";
+  os_ << ",\n";
+  indent() << "grid read predicate = ";
   if (node->predicate() != nullptr) {
-    handle(node->predicate());
+    os_ << node->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
-  indent() << kTab << ".grid_write_pred=";
+  os_ << ",\n";
+  indent() << "grid write predicate = ";
   if (node->writePredicate() != nullptr) {
-    handle(node->writePredicate());
+    os_ << node->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (welford_op->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
+  std::cerr << "Ending indent size: " << indent_size_ << std::endl;
+}
+
+void IrPrinter::handle(const kir::GroupedGridWelford* node) {
+  indent() << "GroupedGridWelford(\n";
+  ++indent_size_;
+  for (const auto i : c10::irange(node->numExprs())) {
+    indent() << node->outAvg(i) << " (Avg),\n";
+    indent() << node->outVar(i) << " (Var),\n";
+    indent() << node->outN(i) << " (Count)\n";
+    indent() << " = Welford (\n";
+    ++indent_size_;
+    indent() << node->inAvg(i) << " (Avg),\n";
+    indent() << node->inVar(i) << " (Var),\n";
+    indent() << node->inN(i) << " (Count)\n";
+    indent() << "initial value =\n";
+    ++indent_size_;
+    indent() << node->initAvg(i) << " (Avg),\n";
+    indent() << node->initVar(i) << " (Var),\n";
+    indent() << node->initN(i) << " (Count),\n";
+    --indent_size_;
+    indent() << "reduction buffer =\n";
+    ++indent_size_;
+    indent() << node->reduction_buffers()[0].at(i)->buffer() << " (Avg),\n";
+    indent() << node->reduction_buffers()[1].at(i)->buffer() << " (Var),\n";
+    indent() << node->reduction_buffers()[2].at(i)->buffer() << " (Count) )\n";
+    indent_size_ -= 2;
+  }
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
+  if (node->predicate() != nullptr) {
+    os_ << node->predicate();
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << ",\n";
+  indent() << "write predicate = ";
+  if (node->writePredicate() != nullptr) {
+    os_ << node->writePredicate();
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (node->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
 }
 
 void IrPrinter::handle(const kir::InitMagicZero* node) {
diff --git a/torch/csrc/jit/codegen/cuda/ir_iostream.h b/torch/csrc/jit/codegen/cuda/ir_iostream.h
index 5266c259c1f5b..599e50286d294 100644
--- a/torch/csrc/jit/codegen/cuda/ir_iostream.h
+++ b/torch/csrc/jit/codegen/cuda/ir_iostream.h
@@ -82,12 +82,17 @@ class TORCH_CUDA_CU_API IrPrinter : public OptInConstDispatch {
   void handle(const ComplexDouble*) final;
   void handle(const NamedScalar*) final;
 
+  void handle(const FullOp*) final;
+  void handle(const ARangeOp*) final;
+  void handle(const EyeOp*) final;
   void handle(const UnaryOp*) final;
   void handle(const BinaryOp*) final;
   void handle(const TernaryOp*) final;
+  void handle(const RNGOp*) final;
   void handle(const ReductionOp*) final;
   void handle(const GroupedReductionOp*) final;
   void handle(const WelfordOp*) final;
+  void handle(const GroupedWelfordOp*) final;
   void handle(const LoadStoreOp*) final;
   void handle(const MmaOp*) final;
   void handle(const BroadcastOp*) final;
@@ -106,6 +111,7 @@ class TORCH_CUDA_CU_API IrPrinter : public OptInConstDispatch {
   void handle(const kir::GridReduction*) final;
   void handle(const kir::GroupedGridReduction*) final;
   void handle(const kir::GridWelford*) final;
+  void handle(const kir::GroupedGridWelford*) final;
   void handle(const kir::ForLoop*) final;
   void handle(const kir::IfThenElse*) final;
   void handle(const kir::Allocate*) final;
diff --git a/torch/csrc/jit/codegen/cuda/ir_nodes.cpp b/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
index da926a63cac1f..3319bf28a18a9 100644
--- a/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
@@ -182,6 +182,148 @@ bool ComplexDouble::sameAs(const Statement* other) const {
   return false;
 }
 
+FullOp::FullOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    Val* fill_value,
+    DataType dtype)
+    : Expr(passkey, ExprType::FullOp), dtype_(dtype), fill_value_(fill_value) {
+  if (out->isA<TensorView>()) {
+    addInput(out->as<TensorView>()->getRootDomain()[0]->extent());
+  }
+  addInput(fill_value);
+  addOutput(out);
+}
+
+FullOp::FullOp(const FullOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      dtype_(src->dtype()),
+      fill_value_(ir_cloner->clone(src->fill_value_)) {}
+
+bool FullOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<FullOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<FullOp>();
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
+ARangeOp::ARangeOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    Val* start,
+    Val* end,
+    Val* step,
+    DataType dtype,
+    Val* linear_index)
+    : Expr(passkey, ExprType::ARangeOp),
+      dtype_(dtype),
+      start_(start),
+      end_(end),
+      step_(step),
+      linear_index_(linear_index) {
+  addInput(start);
+  addInput(end);
+  addInput(step);
+  addOutput(out);
+}
+
+ARangeOp::ARangeOp(const ARangeOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      dtype_(src->dtype()),
+      start_(ir_cloner->clone(src->start_)),
+      end_(ir_cloner->clone(src->end_)),
+      step_(ir_cloner->clone(src->step_)),
+      linear_index_(ir_cloner->clone(src->linear_index_)) {}
+
+EyeOp::EyeOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    DataType dtype,
+    Val* index1,
+    Val* index2)
+    : Expr(passkey, ExprType::EyeOp),
+      dtype_(dtype),
+      index1_(index1),
+      index2_(index2) {
+  if (out->isA<TensorView>()) {
+    addInput(out->as<TensorView>()->getRootDomain()[0]->extent());
+    if (out->as<TensorView>()->getRootDomain()[1] !=
+        out->as<TensorView>()->getRootDomain()[0]) {
+      addInput(out->as<TensorView>()->getRootDomain()[1]->extent());
+    }
+  }
+  addOutput(out);
+}
+
+EyeOp::EyeOp(const EyeOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      dtype_(src->dtype_),
+      index1_(ir_cloner->clone(src->index1_)),
+      index2_(ir_cloner->clone(src->index2_)) {}
+
+bool EyeOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<EyeOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<EyeOp>();
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  if ((index1_ == nullptr) != (other_op->index1_ == nullptr)) {
+    return false;
+  }
+  if ((index2_ == nullptr) != (other_op->index2_ == nullptr)) {
+    return false;
+  }
+  if ((index1_ != nullptr) && !index1_->sameAs(other_op->index1_)) {
+    return false;
+  }
+  if ((index2_ != nullptr) && !index2_->sameAs(other_op->index2_)) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
+bool ARangeOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<ARangeOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<ARangeOp>();
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  if (!start_->sameAs(other_op->start_)) {
+    return false;
+  }
+  if (!end_->sameAs(other_op->end_)) {
+    return false;
+  }
+  if (!step_->sameAs(other_op->step_)) {
+    return false;
+  }
+  if ((linear_index_ == nullptr) != (other_op->linear_index_ == nullptr)) {
+    return false;
+  }
+  if ((linear_index_ != nullptr) &&
+      !linear_index_->sameAs(other_op->linear_index_)) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
 UnaryOp::UnaryOp(
     IrBuilderPasskey passkey,
     UnaryOpType type,
@@ -191,8 +333,7 @@ UnaryOp::UnaryOp(
     : Expr(passkey, ExprType::UnaryOp),
       unary_op_type_{type},
       out_{out},
-      in_{in},
-      rng_offset_(rng_offset) {
+      in_{in} {
   addOutput(out);
   addInput(in);
 }
@@ -201,8 +342,7 @@ UnaryOp::UnaryOp(const UnaryOp* src, IrCloner* ir_cloner)
     : Expr(src, ir_cloner),
       unary_op_type_(src->unary_op_type_),
       out_(ir_cloner->clone(src->out_)),
-      in_(ir_cloner->clone(src->in_)),
-      rng_offset_(src->rng_offset_) {}
+      in_(ir_cloner->clone(src->in_)) {}
 
 bool UnaryOp::sameAs(const Statement* other) const {
   if (this == other) {
@@ -212,8 +352,9 @@ bool UnaryOp::sameAs(const Statement* other) const {
     return false;
   }
   const auto other_op = other->as<UnaryOp>();
-  if (getUnaryOpType() != other_op->getUnaryOpType())
+  if (getUnaryOpType() != other_op->getUnaryOpType()) {
     return false;
+  }
   return Expr::sameAs(other);
 }
 
@@ -248,8 +389,9 @@ bool BinaryOp::sameAs(const Statement* other) const {
     return false;
   }
   const auto other_op = other->as<BinaryOp>();
-  if (getBinaryOpType() != other_op->getBinaryOpType())
+  if (getBinaryOpType() != other_op->getBinaryOpType()) {
     return false;
+  }
   return Expr::sameAs(other);
 }
 
@@ -288,8 +430,80 @@ bool TernaryOp::sameAs(const Statement* other) const {
     return false;
   }
   const auto other_op = other->as<TernaryOp>();
-  if (getTernaryOpType() != other_op->getTernaryOpType())
+  if (getTernaryOpType() != other_op->getTernaryOpType()) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
+RNGOp::RNGOp(
+    IrBuilderPasskey passkey,
+    RNGOpType type,
+    Val* out,
+    DataType dtype,
+    std::vector<Val*> parameters,
+    int rng_offset,
+    Val* philox_index)
+    : Expr(passkey, ExprType::RNGOp),
+      rng_op_type_(type),
+      dtype_(dtype),
+      parameters_(std::move(parameters)),
+      rng_offset_(rng_offset),
+      philox_index_(philox_index) {
+  if (out->isA<TensorView>()) {
+    for (auto id : out->as<TensorView>()->getRootDomain()) {
+      shape_.emplace_back(id->extent());
+    }
+  }
+  for (auto v : shape_) {
+    addInput(v);
+  }
+  for (auto v : parameters_) {
+    addInput(v);
+  }
+  addOutput(out);
+}
+
+RNGOp::RNGOp(const RNGOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      rng_op_type_(src->rng_op_type_),
+      dtype_(src->dtype()),
+      parameters_(ir_cloner->clone(src->parameters_)),
+      rng_offset_(src->rng_offset_),
+      philox_index_(ir_cloner->clone(src->philox_index_)) {}
+
+bool RNGOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<RNGOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<RNGOp>();
+  if (getRNGOpType() != other_op->getRNGOpType()) {
+    return false;
+  }
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  if (parameters_.size() != other_op->parameters_.size()) {
     return false;
+  }
+  for (auto i : c10::irange(parameters_.size())) {
+    if (!parameters_[i]->sameAs(other_op->parameters_[i])) {
+      return false;
+    }
+  }
+  if (getRNGOffset() != other_op->getRNGOffset()) {
+    return false;
+  }
+  if ((philox_index_ == nullptr) != (other_op->philox_index_ == nullptr)) {
+    return false;
+  }
+  if ((philox_index_ != nullptr) &&
+      !philox_index_->sameAs(other_op->philox_index_)) {
+    return false;
+  }
   return Expr::sameAs(other);
 }
 
@@ -452,6 +666,16 @@ GroupedReductionOp::GroupedReductionOp(
       init_vals_(ir_cloner->clone(src->init_vals_)),
       is_allreduce_(src->is_allreduce_) {}
 
+int GroupedReductionOp::getExprIndexOfOutput(Val* output_val) const {
+  auto it = std::find(outputs().begin(), outputs().end(), output_val);
+  if (it != outputs().end()) {
+    return std::distance(outputs().begin(), it);
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      false, "Not an output, ", output_val->toString(), ", of ", toString());
+}
+
 bool GroupedReductionOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -478,128 +702,337 @@ bool GroupedReductionOp::sameAs(const Statement* other) const {
 
 WelfordOp::WelfordOp(
     IrBuilderPasskey passkey,
-    Val* out_avg,
-    Val* out_var,
-    Val* out_N,
-    Val* init_avg,
-    Val* init_var,
-    Val* init_N,
-    Val* in_avg,
-    Val* in_var,
-    Val* in_N,
+    const WelfordTriplet& output,
+    const WelfordTriplet& input,
+    const WelfordTriplet& init,
     bool is_fused)
     : Expr(passkey, ExprType::WelfordOp),
-      out_avg_(out_avg),
-      out_var_(out_var),
-      out_N_(out_N),
-      init_avg_(init_avg),
-      init_var_(init_var),
-      init_N_(init_N),
-      in_avg_(in_avg),
-      in_var_(in_var == nullptr ? in_avg->container()->zeroVal() : in_var),
-      in_N_(in_N),
+      output_(output),
+      input_(input),
+      init_(init),
       is_allreduce_(is_fused) {
+  // Previously, nullptr was accepted and implicitly replaced by
+  // default values. Looks like we always pass some non-null values,
+  // so removed the implicit default behavior for code simplicity.
+  TORCH_INTERNAL_ASSERT(output.avg() != nullptr);
+  TORCH_INTERNAL_ASSERT(output.var() != nullptr);
+  TORCH_INTERNAL_ASSERT(output.N() != nullptr);
+  TORCH_INTERNAL_ASSERT(init.avg() != nullptr);
+  TORCH_INTERNAL_ASSERT(init.var() != nullptr);
+  TORCH_INTERNAL_ASSERT(init.N() != nullptr);
+  TORCH_INTERNAL_ASSERT(input.avg() != nullptr);
+  TORCH_INTERNAL_ASSERT(input.var() != nullptr);
+  TORCH_INTERNAL_ASSERT(input.N() != nullptr);
+
   // Check output type
   TORCH_INTERNAL_ASSERT(
-      out_avg->getValType().value() == ValType::TensorView ||
-      out_avg->getValType().value() == ValType::TensorIndex);
+      output.avg()->getValType().value() == ValType::TensorView ||
+      output.avg()->getValType().value() == ValType::TensorIndex);
   TORCH_INTERNAL_ASSERT(
-      out_var->getValType().value() == ValType::TensorView ||
-      out_var->getValType().value() == ValType::TensorIndex);
+      output.var()->getValType().value() == ValType::TensorView ||
+      output.var()->getValType().value() == ValType::TensorIndex);
   TORCH_INTERNAL_ASSERT(
-      out_N->getValType().value() == ValType::TensorView ||
-      out_N->getValType().value() == ValType::TensorIndex);
+      output.N()->getValType().value() == ValType::TensorView ||
+      output.N()->getValType().value() == ValType::TensorIndex);
+  TORCH_INTERNAL_ASSERT(isIntegralType(output.N()->dtype()));
 
   // check initial value
-  TORCH_INTERNAL_ASSERT(init_N->getValType().value() == ValType::Scalar);
-  if (!init_N->isZeroInt()) {
+  TORCH_INTERNAL_ASSERT(init.N()->getValType().value() == ValType::Scalar);
+  TORCH_INTERNAL_ASSERT(isIntegralType(init.N()->dtype()));
+  if (!init.N()->isZeroInt()) {
     // when initial count is zero, no initial variance or average is needed
     // initial value with a count of 1 is un-common enough that I'll push
     // the responsibility of creating all-zero var tensors to the user
     TORCH_INTERNAL_ASSERT(
-        init_avg &&
-        (init_avg->getValType().value() == ValType::TensorView ||
-         init_avg->getValType().value() == ValType::TensorIndex));
+        init_.avg()->getValType().value() == ValType::TensorView ||
+        init_.avg()->getValType().value() == ValType::TensorIndex);
     TORCH_INTERNAL_ASSERT(
-        init_var &&
-        (init_var->getValType().value() == ValType::TensorView ||
-         init_var->getValType().value() == ValType::TensorIndex));
+        init_.var()->getValType().value() == ValType::TensorView ||
+            init_.var()->getValType().value() == ValType::TensorIndex,
+        "Invalid initial var: ",
+        init_.var()->toString());
   }
 
-  TORCH_INTERNAL_ASSERT(
-      in_avg &&
-          (in_avg->getValType().value() == ValType::TensorView ||
-           in_avg->getValType().value() == ValType::TensorIndex),
-      in_avg->getValType().value());
   // check input
   TORCH_INTERNAL_ASSERT(
-      in_N->getValType().value() == ValType::Scalar ||
-      in_N->getValType().value() == ValType::TensorView ||
-      in_N->getValType().value() == ValType::TensorIndex);
-  if (!in_N->isOneInt()) {
+      input_.avg()->getValType().value() == ValType::TensorView ||
+          input_.avg()->getValType().value() == ValType::TensorIndex,
+      input_.avg()->getValType().value());
+  TORCH_INTERNAL_ASSERT(
+      input_.N()->getValType().value() == ValType::Scalar ||
+      input_.N()->getValType().value() == ValType::TensorView ||
+      input_.N()->getValType().value() == ValType::TensorIndex);
+  TORCH_INTERNAL_ASSERT(isIntegralType(input_.N()->dtype()));
+  if (!input_.N()->isOneInt()) {
     // when input is only one value, only the value is required through avg
     // input the var part is implicitly 0 and codegen will handle that.
     TORCH_INTERNAL_ASSERT(
-        in_var &&
-        (in_var->getValType().value() == ValType::TensorView ||
-         in_var->getValType().value() == ValType::TensorIndex));
+        input_.var()->getValType().value() == ValType::TensorView ||
+        input_.var()->getValType().value() == ValType::TensorIndex);
   } else {
     TORCH_INTERNAL_ASSERT(
-        in_var == nullptr || in_var->isZeroInt(),
+        input_.var() == nullptr || input_.var()->isZeroInt(),
         "Invalid var input, which must be either nullptr or scalar zero when the N input is one.");
   }
 
-  addOutput(out_avg_);
-  addOutput(out_var_);
-  addOutput(out_N_);
+  addOutput(output_.avg());
+  addOutput(output_.var());
+  addOutput(output_.N());
 
-  addInput(in_avg_);
-  // Previously in_var_ was allowed to be null
-  TORCH_INTERNAL_ASSERT(
-      in_var_ != nullptr, "Welford var input nullptr not allowed");
-  addInput(in_var_);
-  addInput(in_N_);
+  addInput(input_.avg());
+  addInput(input_.var());
+  addInput(input_.N());
+}
+
+c10::optional<WelfordTriplet::ValName> WelfordTriplet::getNameOf(
+    Val* val) const {
+  auto it = std::find(begin(), end(), val);
+  if (it != end()) {
+    return indexToValName(std::distance(begin(), it));
+  }
+
+  return c10::optional<WelfordTriplet::ValName>();
 }
 
+bool WelfordTriplet::sameAs(const WelfordTriplet& other) const {
+  return this == &other ||
+      (avg()->sameAs(other.avg()) && var()->sameAs(other.var()) &&
+       N()->sameAs(other.N()));
+}
+
+WelfordTriplet WelfordTriplet::clone(IrCloner* ir_cloner) const {
+  return transform([&](const Val* val) { return ir_cloner->clone<Val>(val); });
+}
+
+std::vector<WelfordTriplet> WelfordTriplet::clone(
+    const std::vector<WelfordTriplet>& src,
+    IrCloner* ir_cloner) {
+  std::vector<WelfordTriplet> cloned;
+  for (const auto& triplet : src) {
+    cloned.emplace_back(triplet.clone(ir_cloner));
+  }
+  return cloned;
+}
+
+WelfordOp::WelfordOp(
+    IrBuilderPasskey passkey,
+    Val* out_avg,
+    Val* out_var,
+    Val* out_N,
+    Val* in_avg,
+    Val* in_var,
+    Val* in_N,
+    Val* init_avg,
+    Val* init_var,
+    Val* init_N,
+    bool is_fused)
+    : WelfordOp(
+          passkey,
+          WelfordTriplet(out_avg, out_var, out_N),
+          WelfordTriplet(in_avg, in_var, in_N),
+          WelfordTriplet(init_avg, init_var, init_N),
+          is_fused) {}
+
 WelfordOp::WelfordOp(const WelfordOp* src, IrCloner* ir_cloner)
     : Expr(src, ir_cloner),
-      out_avg_(ir_cloner->clone(src->out_avg_)),
-      out_var_(ir_cloner->clone(src->out_var_)),
-      out_N_(ir_cloner->clone(src->out_N_)),
-      init_avg_(src->init_avg_ ? ir_cloner->clone(src->init_avg_) : nullptr),
-      init_var_(src->init_var_ ? ir_cloner->clone(src->init_var_) : nullptr),
-      init_N_(ir_cloner->clone(src->init_N_)),
-      in_avg_(ir_cloner->clone(src->in_avg_)),
-      in_var_(src->in_var_ ? ir_cloner->clone(src->in_var_) : nullptr),
-      in_N_(ir_cloner->clone(src->in_N_)),
+      output_(src->output_.clone(ir_cloner)),
+      input_(src->input_.clone(ir_cloner)),
+      init_(src->init_.clone(ir_cloner)),
       is_allreduce_(src->is_allreduce_) {}
 
-namespace {
-inline bool sameOptionalVal(Val* a, Val* b) {
-  return ((a == nullptr && b == nullptr)) || ((a && b) && (a->sameAs(b)));
+Val* WelfordOp::getInitValOfOutput(Val* output_val) const {
+  auto val_name = output().getNameOf(output_val);
+
+  TORCH_INTERNAL_ASSERT(
+      val_name.has_value(),
+      "Not an output val ",
+      output_val->toString(),
+      " of ",
+      toString());
+
+  return init().get(*val_name);
 }
-} // namespace
 
 bool WelfordOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
   }
   if (auto other_wop = dynamic_cast<const WelfordOp*>(other)) {
-    return in_avg_->sameAs(other_wop->in_avg_) &&
-        sameOptionalVal(in_var_, other_wop->in_var_) &&
-        in_N_->sameAs(other_wop->in_N_) &&
-        sameOptionalVal(init_avg_, other_wop->init_avg_) &&
-        sameOptionalVal(init_var_, other_wop->init_var_) &&
-        init_N_->sameAs(other_wop->init_N_);
+    return input_.sameAs(other_wop->input_) && init_.sameAs(other_wop->init_);
   }
   return false;
 }
 
 std::vector<Val*> WelfordOp::getInitVals() const {
-  std::vector<Val*> init_vals({init_avg_, init_var_, init_N_});
+  std::vector<Val*> init_vals({init_.avg(), init_.var(), init_.N()});
   return init_vals;
 }
 
+GroupedWelfordOp::GroupedWelfordOp(
+    IrBuilderPasskey passkey,
+    std::vector<WelfordTriplet> output_vals,
+    std::vector<WelfordTriplet> input_vals,
+    std::vector<WelfordTriplet> init_vals,
+    bool is_allreduce,
+    ExprType expr_type)
+    : Expr(passkey, expr_type),
+      output_vals_(std::move(output_vals)),
+      input_vals_(std::move(input_vals)),
+      init_vals_(std::move(init_vals)),
+      is_allreduce_(is_allreduce) {
+  const auto num_grouped_ops = output_vals_.size();
+
+  TORCH_INTERNAL_ASSERT(
+      input_vals_.size() == num_grouped_ops,
+      "Invalid number of input arguments. Expected: ",
+      num_grouped_ops,
+      ", Given: ",
+      input_vals_.size());
+  TORCH_INTERNAL_ASSERT(
+      init_vals_.size() == num_grouped_ops,
+      "Invalid number of N arguments. Expected: ",
+      num_grouped_ops,
+      ", Given: ",
+      init_vals_.size());
+
+  for (const auto i : c10::irange(num_grouped_ops)) {
+    // Check output type
+    TORCH_INTERNAL_ASSERT(
+        output_vals_[i].avg()->getValType().value() == ValType::TensorView ||
+        output_vals_[i].avg()->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(
+        output_vals_[i].var()->getValType().value() == ValType::TensorView ||
+        output_vals_[i].var()->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(
+        output_vals_[i].N()->getValType().value() == ValType::TensorView ||
+        output_vals_[i].N()->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(isIntegralType(output_vals_[i].N()->dtype()));
+
+    // check initial value
+    auto init_avg = init_vals_[i].avg();
+    auto init_var = init_vals_[i].var();
+    auto init_N = init_vals_[i].N();
+    TORCH_INTERNAL_ASSERT(
+        init_avg != nullptr && init_var != nullptr && init_N != nullptr,
+        "nullptr init vals are not allowed");
+    TORCH_INTERNAL_ASSERT(init_N->getValType().value() == ValType::Scalar);
+    TORCH_INTERNAL_ASSERT(isIntegralType(init_N->dtype()));
+    TORCH_INTERNAL_ASSERT(
+        init_avg->getValType().value() == ValType::TensorView ||
+            init_avg->getValType().value() == ValType::TensorIndex ||
+            (init_N->isZeroInt() &&
+             init_avg->getValType().value() == ValType::Scalar),
+        "Initial avg must be a tensor or, can be a scalar if initial N is zero.",
+        " Initial avg: ",
+        init_avg->toString(),
+        ". Initial N: ",
+        init_N->toString());
+    TORCH_INTERNAL_ASSERT(
+        init_var->getValType().value() == ValType::TensorView ||
+            init_var->getValType().value() == ValType::TensorIndex ||
+            (init_N->isZeroInt() &&
+             init_var->getValType().value() == ValType::Scalar),
+        "Initial var must be a tensor or, can be a scalar if initial N is zero: ",
+        init_var->toString());
+
+    // check input
+    auto in_avg = input_vals_[i].avg();
+    auto in_var = input_vals_[i].var();
+    auto in_N = input_vals_[i].N();
+    TORCH_INTERNAL_ASSERT(
+        in_avg != nullptr && in_var != nullptr && in_N != nullptr,
+        "nullptr input vals are not allowed");
+    TORCH_INTERNAL_ASSERT(
+        in_N->getValType().value() == ValType::Scalar ||
+        in_N->getValType().value() == ValType::TensorView ||
+        in_N->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(isIntegralType(in_N->dtype()));
+    TORCH_INTERNAL_ASSERT(
+        in_avg->getValType().value() == ValType::TensorView ||
+            in_avg->getValType().value() == ValType::TensorIndex,
+        "Invalid input avg argument type: ",
+        in_avg->getValType().value());
+
+    if (in_N->isOneInt()) {
+      // when input is only one value, only the value is required through avg
+      // input the var part must be implicitly 0
+      TORCH_INTERNAL_ASSERT(
+          in_var->isZeroInt(),
+          "Invalid var input, which must be scalar zero when the N input is one: ",
+          in_var->toString());
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          in_var->getValType().value() == ValType::TensorView ||
+              in_var->getValType().value() == ValType::TensorIndex,
+          in_var->getValType().value(),
+          ", ",
+          in_N->toString());
+    }
+  }
+
+  for (const auto i : c10::irange(num_grouped_ops)) {
+    addOutput(output_vals_[i].avg());
+    addOutput(output_vals_[i].var());
+    addOutput(output_vals_[i].N());
+    addInput(input_vals_[i].avg());
+    addInput(input_vals_[i].var());
+    addInput(input_vals_[i].N());
+  }
+}
+
+GroupedWelfordOp::GroupedWelfordOp(
+    const GroupedWelfordOp* src,
+    IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      output_vals_(WelfordTriplet::clone(src->output_vals_, ir_cloner)),
+      input_vals_(WelfordTriplet::clone(src->input_vals_, ir_cloner)),
+      init_vals_(WelfordTriplet::clone(src->init_vals_, ir_cloner)),
+      is_allreduce_(src->is_allreduce_) {}
+
+bool GroupedWelfordOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+
+  auto grouped_op = dynamic_cast<const GroupedWelfordOp*>(other);
+  if (grouped_op == nullptr) {
+    return false;
+  }
+
+  if (!Expr::sameAs(other)) {
+    return false;
+  }
+
+  for (const auto i : c10::irange(numExprs())) {
+    if (!initAvg(i)->sameAs(grouped_op->initAvg(i)) ||
+        !initVar(i)->sameAs(grouped_op->initVar(i)) ||
+        !initN(i)->sameAs(grouped_op->initN(i))) {
+      return false;
+    }
+  }
+
+  return true;
+}
+
+int GroupedWelfordOp::getExprIndexOfOutput(Val* output_val) const {
+  for (const auto expr_idx : c10::irange(numExprs())) {
+    if (outputVals().at(expr_idx).getNameOf(output_val).has_value()) {
+      return expr_idx;
+    }
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      false, "Not an output, ", output_val->toString(), ", of ", toString());
+}
+
+Val* GroupedWelfordOp::getInitValOfOutput(Val* output_val) const {
+  auto expr_index = getExprIndexOfOutput(output_val);
+
+  auto val_name = outputVals().at(expr_index).getNameOf(output_val).value();
+
+  return initVals().at(expr_index).get(val_name);
+}
+
 MmaOp::MmaOp(
     IrBuilderPasskey passkey,
     Val* out,
@@ -1137,6 +1570,10 @@ bool IterDomain::sameAs(const Statement* other) const {
   is_same = is_same && ScalarCheck::sameAs(start(), other_id->start());
   is_same =
       is_same && ScalarCheck::sameAs(stopOffset(), other_id->stopOffset());
+  is_same = is_same && (hasExpandedExtent() == other_id->hasExpandedExtent());
+  if (is_same && hasExpandedExtent()) {
+    is_same = ScalarCheck::sameAs(expandedExtent(), other_id->expandedExtent());
+  }
 
   return is_same;
 }
@@ -1171,24 +1608,47 @@ IterDomain* IterDomain::merge(IterDomain* outer, IterDomain* inner) {
       "Merging IterDomains with ending values that are 0 is not supported at this time.");
   TORCH_CHECK(
       outer->isReduction() == inner->isReduction() ||
-          (!outer->isReduction() && inner->extent()->isOneInt()) ||
-          (outer->extent()->isOneInt() && !inner->isReduction()),
+          (!outer->isReduction() && inner->isTrivialReduction()) ||
+          (outer->isTrivialReduction() && !inner->isReduction()),
       "Merging IterDomains requires that their iteration types match.");
   TORCH_CHECK(
       (outer->isGather() && inner->isGather()) ||
           (!outer->isGather() && !inner->isGather()),
       "Merging gather and non-gather domains is not supported.");
 
+  TORCH_CHECK(
+      !outer->isStride() && !inner->isStride(),
+      "No support for merging stride domains");
+
   Val* merged_id_size = mul(outer->extent(), inner->extent());
 
   IterType itype = outer->getIterType();
 
   if (outer->isBroadcast() && inner->isBroadcast()) {
     itype = IterType::Broadcast;
-  } else if (outer->isBroadcast() || inner->isBroadcast()) {
+  }
+
+  if ((outer->isBroadcast() || inner->isBroadcast()) &&
+      (outer->getIterType() == IterType::Iteration ||
+       inner->getIterType() == IterType::Iteration)) {
+    itype = IterType::Iteration;
+  }
+
+  // Merging trivial reduction with iter domain, that's fine, just make it an
+  // iter domain.
+  if ((outer->isTrivialReduction() || inner->isTrivialReduction()) &&
+      (outer->getIterType() == IterType::Iteration ||
+       inner->getIterType() == IterType::Iteration)) {
     itype = IterType::Iteration;
   }
 
+  // Merging trivial reduction with broadcasting, that's fine, just make it a
+  // broadcasting.
+  if ((outer->isTrivialReduction() || inner->isTrivialReduction()) &&
+      (outer->isBroadcast() || inner->isBroadcast())) {
+    itype = IterType::Broadcast;
+  }
+
   Val* expanded_extent = nullptr;
   if (outer->hasExpandedExtent() || inner->hasExpandedExtent()) {
     if (outer->hasExpandedExtent() && inner->hasExpandedExtent()) {
@@ -1208,13 +1668,6 @@ IterDomain* IterDomain::merge(IterDomain* outer, IterDomain* inner) {
     }
   }
 
-  // Merging trivial reduction with iter domain, that's fine, just make it an
-  // iter domain.
-  if ((outer->isReduction() || inner->isReduction()) &&
-      (!outer->isReduction() || !inner->isReduction())) {
-    itype = IterType::Iteration;
-  }
-
   IterDomain* merged_id =
       IterDomainBuilder(
           outer->container()->zeroVal(), merged_id_size->as<Int>())
@@ -1761,7 +2214,7 @@ void TensorDomain::split(
   resetDomains();
 }
 
-// Merge "axis" and "axis+1" into 1 dimension
+// Merge "axis_o" and "axis_i" into 1 dimension
 void TensorDomain::merge(int axis_o, int axis_i) {
   TORCH_INTERNAL_ASSERT(nDims() > 0, "Tried to do merge on a 0-dim domain");
   if (axis_o < 0)
@@ -1929,18 +2382,39 @@ TensorDomain* TensorDomain::view(const AnalyzeViewResult& view_analysis) {
 }
 
 TensorDomain* TensorDomain::flatten(int64_t start_dim, int64_t end_dim) {
+  auto inp_domain = noReductions(getMaybeRFactorDomain());
+
   if (start_dim < 0) {
-    start_dim += nDims();
+    start_dim += inp_domain.size();
   }
   if (end_dim < 0) {
-    end_dim += nDims();
+    end_dim += inp_domain.size();
   }
+  TORCH_CHECK(
+      start_dim >= 0 && start_dim < inp_domain.size(),
+      "Invalid start_dim ",
+      start_dim);
+  TORCH_CHECK(
+      end_dim >= 0 && end_dim < inp_domain.size(), "Invalid end_dim ", end_dim);
+  TORCH_CHECK(start_dim <= end_dim, "start_dim must be <= end_dim");
 
   std::vector<IterDomain*> new_root_domain;
-  auto inp_domain = noReductions(getMaybeRFactorDomain());
   new_root_domain.reserve(inp_domain.size());
-  for (auto id : inp_domain) {
-    new_root_domain.push_back(id->cloneWithoutRFactor());
+  for (auto i : c10::irange(inp_domain.size())) {
+    bool is_rfactor_dim = i >= start_dim && i <= end_dim;
+    auto inp_id = inp_domain[i];
+    auto out_id = IterDomainBuilder(inp_id)
+                      .is_rfactor_domain(is_rfactor_dim)
+                      .extent(
+                          (is_rfactor_dim && inp_id->hasExpandedExtent())
+                              ? inp_id->expandedExtent()
+                              : inp_id->extent())
+                      .iter_type(
+                          (is_rfactor_dim && inp_id->isBroadcast())
+                              ? IterType::Iteration
+                              : inp_id->getIterType())
+                      .build();
+    new_root_domain.push_back(out_id);
   }
 
   std::vector<IterDomain*> rfactor_domain;
@@ -1962,7 +2436,7 @@ TensorDomain* TensorDomain::flatten(int64_t start_dim, int64_t end_dim) {
   }
   rfactor_domain.push_back(merged_id);
 
-  for (auto i : c10::irange(end_dim + 1, nDims())) {
+  for (auto i : c10::irange(end_dim + 1, inp_domain.size())) {
     rfactor_domain.push_back(new_root_domain[i]);
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/ir_utils.cpp b/torch/csrc/jit/codegen/cuda/ir_utils.cpp
index 68616db917e94..dba5ee10adabb 100644
--- a/torch/csrc/jit/codegen/cuda/ir_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_utils.cpp
@@ -180,17 +180,55 @@ struct SubstituteInExpr : public OptInDispatch {
     OptInDispatch::handle(expr);
   }
 
+  void handle(FullOp* full_expr) final {
+    auto out = reference_->sameAs(full_expr->output(0)) ? substitute_
+                                                        : full_expr->output(0);
+    expr_ = IrBuilder::create<FullOp>(
+        full_expr->container(),
+        out,
+        full_expr->getFillValue(),
+        full_expr->dtype());
+  }
+
+  void handle(ARangeOp* arange_expr) final {
+    auto start = reference_->sameAs(arange_expr->start())
+        ? substitute_
+        : arange_expr->start();
+    auto end = reference_->sameAs(arange_expr->end()) ? substitute_
+                                                      : arange_expr->end();
+    auto step = reference_->sameAs(arange_expr->step()) ? substitute_
+                                                        : arange_expr->step();
+    auto out = reference_->sameAs(arange_expr->output(0))
+        ? substitute_
+        : arange_expr->output(0);
+    expr_ = IrBuilder::create<ARangeOp>(
+        arange_expr->container(),
+        out,
+        start,
+        end,
+        step,
+        arange_expr->dtype(),
+        arange_expr->getLinearLogicalIndex());
+  }
+
+  void handle(EyeOp* eye_expr) final {
+    auto out = reference_->sameAs(eye_expr->output(0)) ? substitute_
+                                                       : eye_expr->output(0);
+    expr_ = IrBuilder::create<EyeOp>(
+        eye_expr->container(),
+        out,
+        eye_expr->dtype(),
+        eye_expr->getIndex1(),
+        eye_expr->getIndex2());
+  }
+
   void handle(UnaryOp* unary_expr) final {
     auto in =
         reference_->sameAs(unary_expr->in()) ? substitute_ : unary_expr->in();
     auto out =
         reference_->sameAs(unary_expr->out()) ? substitute_ : unary_expr->out();
     expr_ = IrBuilder::create<UnaryOp>(
-        unary_expr->container(),
-        unary_expr->getUnaryOpType(),
-        out,
-        in,
-        unary_expr->getRNGOffset());
+        unary_expr->container(), unary_expr->getUnaryOpType(), out, in);
   }
 
   void handle(BinaryOp* binary_expr) final {
@@ -227,6 +265,23 @@ struct SubstituteInExpr : public OptInDispatch {
         in3);
   }
 
+  void handle(RNGOp* rng_expr) final {
+    std::vector<Val*> subsituted_params;
+    for (auto v : rng_expr->getParameters()) {
+      subsituted_params.emplace_back(reference_->sameAs(v) ? substitute_ : v);
+    }
+    auto out = reference_->sameAs(rng_expr->output(0)) ? substitute_
+                                                       : rng_expr->output(0);
+    expr_ = IrBuilder::create<RNGOp>(
+        rng_expr->container(),
+        rng_expr->getRNGOpType(),
+        out,
+        rng_expr->dtype(),
+        subsituted_params,
+        rng_expr->getRNGOffset(),
+        rng_expr->getPhiloxIndex());
+  }
+
   void handle(ReductionOp* reduction_expr) final {
     auto init = reference_->sameAs(reduction_expr->init())
         ? substitute_
@@ -415,12 +470,12 @@ struct SubstituteInExpr : public OptInDispatch {
         out_avg,
         out_var,
         out_N,
-        init_avg,
-        init_var,
-        init_N,
         in_avg,
         in_var,
         in_N,
+        init_avg,
+        init_var,
+        init_N,
         welford_expr->isAllreduce());
   }
 
@@ -720,13 +775,29 @@ class ValReplacementMutator : private OptOutMutator {
     // would be a tensorview that doesn't get updated extents. Therefore, first
     // grab all leaves towards outputs and grab stmts from there.
     auto stmts = StmtSort::getStmts(fusion, allLeafOuts(fusion), true);
-    for (auto stmt : stmts) {
+
+    // Some fusions, such as standalone rand_like, can have disconnected DAG, so
+    // we need some mechanism to make sure our replacement set is as complete as
+    // possible
+    // TODO: I think we need a more general mechanism to support disconnected
+    // DAG
+    std::vector<Val*> more;
+    for (auto v : fusion->inputs()) {
+      if (std::find(stmts.begin(), stmts.end(), v) == stmts.end()) {
+        more.emplace_back(v);
+      }
+    }
+    auto more_stmts = StmtSort::getStmts(fusion, more, true);
+    more_stmts.insert(more_stmts.end(), stmts.begin(), stmts.end());
+
+    for (auto stmt : more_stmts) {
       mutate(stmt);
     }
   }
 
  private:
   using OptOutMutator::mutate;
+
   void mutate(Val* val) final {
     if (replacement_map_.find(val) == replacement_map_.end()) {
       return OptOutMutator::mutate(val);
@@ -782,29 +853,12 @@ Val* getReductionInitValOf(TensorView* tv) {
   if (auto rop = dynamic_cast<ReductionOp*>(def)) {
     init = rop->init();
   } else if (auto grop = dynamic_cast<GroupedReductionOp*>(def)) {
-    int output_idx = -1;
-    for (const auto i : c10::irange(grop->numExprs())) {
-      if (tv == grop->output(i)) {
-        output_idx = static_cast<int>(i);
-        break;
-      }
-    }
-    TORCH_INTERNAL_ASSERT(
-        output_idx >= 0,
-        "Matching output not found for GroupedReductionOp: ",
-        tv->toString(),
-        ". Defined by: ",
-        def->toString());
+    int output_idx = grop->getExprIndexOfOutput(tv);
     init = grop->initVal(output_idx);
   } else if (auto wop = dynamic_cast<WelfordOp*>(def)) {
-    if (tv == wop->outAvg()) {
-      init = wop->initAvg();
-    } else if (tv == wop->outVar()) {
-      init = wop->initVar();
-    } else {
-      TORCH_INTERNAL_ASSERT(tv == wop->outN());
-      init = wop->initN();
-    }
+    return wop->getInitValOfOutput(tv);
+  } else if (auto gwop = dynamic_cast<GroupedWelfordOp*>(def)) {
+    init = gwop->getInitValOfOutput(tv);
   } else if (auto mma = dynamic_cast<MmaOp*>(def)) {
     init = mma->init();
   }
@@ -817,13 +871,38 @@ Val* getReductionInitValOf(TensorView* tv) {
 bool isReductionOp(const Expr* expr) {
   // Note that GridReduction inherits ReductionOp
   return expr->isA<ReductionOp>() || expr->isA<GroupedReductionOp>() ||
-      expr->isA<WelfordOp>() || expr->isA<kir::GridWelford>();
+      expr->isA<WelfordOp>() || expr->isA<GroupedWelfordOp>() ||
+      expr->isA<kir::GridWelford>() || expr->isA<kir::GroupedGridWelford>();
 }
 
 bool isReductionTvOp(const Expr* expr) {
   return ir_utils::isTvOp(expr) && isReductionOp(expr);
 }
 
+TORCH_CUDA_CU_API std::vector<ViewOp*> getViewOps(Fusion* fusion) {
+  auto all_exprs = fusion->exprs();
+
+  auto all_view_ops = ir_utils::filterByType<ViewOp>(all_exprs);
+
+  std::vector<ViewOp*> view_ops;
+
+  std::copy_if(
+      all_view_ops.begin(),
+      all_view_ops.end(),
+      std::back_inserter(view_ops),
+      [](ViewOp* view) {
+        return std::any_of(
+            view->outputs().begin(), view->outputs().end(), [](Val* v) {
+              if (!v->isA<TensorView>()) {
+                return false;
+              }
+              return v->as<TensorView>()->hasRFactor();
+            });
+      });
+
+  return view_ops;
+}
+
 namespace {
 
 struct ReplaceValInIndexVal : public OptInDispatch {
@@ -891,8 +970,7 @@ struct ReplaceValInIndexVal : public OptInDispatch {
     auto inp = last_visited_val_;
     TORCH_INTERNAL_ASSERT(uop->out()->isA<Int>());
     auto out = IrBuilder::create<Int>(c10::nullopt);
-    IrBuilder::create<UnaryOp>(
-        uop->getUnaryOpType(), out, inp, uop->getRNGOffset());
+    IrBuilder::create<UnaryOp>(uop->getUnaryOpType(), out, inp);
     last_visited_val_ = out;
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/ir_utils.h b/torch/csrc/jit/codegen/cuda/ir_utils.h
index 4e338d6ca0b6e..ad9f67180f92f 100644
--- a/torch/csrc/jit/codegen/cuda/ir_utils.h
+++ b/torch/csrc/jit/codegen/cuda/ir_utils.h
@@ -106,6 +106,10 @@ class FilteredView {
     return begin() == end();
   }
 
+  std::vector<value_type> vector() const {
+    return std::vector<value_type>(begin(), end());
+  }
+
  private:
   const InputIt input_it_;
   const InputIt last_;
@@ -313,6 +317,11 @@ TORCH_CUDA_CU_API bool isReductionOp(const Expr*);
 // Returns if Expr is a reduction op with TensorView or TensorIndex
 TORCH_CUDA_CU_API bool isReductionTvOp(const Expr*);
 
+// Returns all non-trivial view operations. We shouldn't have trivial view
+// operations but this function is to simply make sure if we ever do we don't
+// pull them in.
+TORCH_CUDA_CU_API std::vector<ViewOp*> getViewOps(Fusion*);
+
 template <typename T>
 std::string toString(const T& nodes) {
   std::stringstream ss;
diff --git a/torch/csrc/jit/codegen/cuda/iter_visitor.cpp b/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
index 6ae4e7374df57..3ecdede8d1984 100644
--- a/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
+++ b/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
@@ -32,81 +32,44 @@ void remove_visited(
   }
 }
 
-// Return all dependencies of a node including members of the node.
-class RecursiveDependencies : public OptInDispatch {
+class MemberStatements : public OptOutDispatch {
  public:
+  // Return all members of the stmt if it's a Val. For expressions it returns
+  // nothing.
   static std::vector<Statement*> next(Statement* stmt) {
-    RecursiveDependencies find_next(stmt);
+    MemberStatements find_next(stmt);
     return find_next.next_stmts_;
   }
 
  private:
-  RecursiveDependencies() = default;
+  MemberStatements() = default;
 
-  RecursiveDependencies(Statement* stmt) {
+  MemberStatements(Statement* stmt) {
     handle(stmt);
   }
 
-  using OptInDispatch::handle;
-
-  void handle(Expr* expr) final {
-    FusionGuard::getCurFusion()->assertInContainer(
-        expr,
-        "IterVisitor.cpp::RecursiveDependencies::handle(Expr*) Cannot traverse expr, ");
-    next_stmts_.insert(
-        next_stmts_.end(), expr->inputs().begin(), expr->inputs().end());
-  }
+  using OptOutDispatch::handle;
 
   void handle(Val* val) final {
     FusionGuard::getCurFusion()->assertInContainer(
         val,
-        "IterVisitor.cpp::RecursiveDependencies::handle(Val*) Cannot traverse val, ");
-    OptInDispatch::handle(val);
-  }
-
-  void simpleVal(Val* val) {
-    if (val->definition() == nullptr) {
-      return;
-    }
-    next_stmts_.push_back(val->definition());
-  }
-
-  void handle(Bool* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(Double* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(Int* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(ComplexDouble* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(NamedScalar* stmt) final {
-    simpleVal(stmt);
+        "IterVisitor.cpp::MemberStatements::handle(Val*) Cannot traverse val, ");
+    OptOutDispatch::handle(val);
   }
 
   void handle(IterDomain* stmt) final {
     next_stmts_.push_back(stmt->start());
     next_stmts_.push_back(stmt->extent());
     next_stmts_.push_back(stmt->stopOffset());
-    simpleVal(stmt);
   }
 
   void handle(TensorDomain* stmt) final {
     next_stmts_.insert(
         next_stmts_.end(), stmt->domain().begin(), stmt->domain().end());
-    simpleVal(stmt);
   }
 
   void handle(TensorView* tv) final {
     next_stmts_.push_back(tv->domain());
-    simpleVal(tv);
   }
 
   std::vector<Statement*> next_stmts_;
@@ -169,17 +132,18 @@ void IterVisitor::handle(Val* v) {
 // To prevent traversing all paths through a DAG (unless we want to) we have a
 // function to remove visited nodes from being re-added to the stack
 // (remove_visited).
-void IterVisitor::traverseFrom(
+void IterVisitor::traverseBetween(
     Fusion* fusion,
-    const std::vector<Val*>& from,
-    bool traverseAllPaths,
-    bool traverseIntoMembers) {
+    const std::unordered_set<Val*>& from,
+    const std::vector<Val*>& to,
+    bool traverse_all_paths,
+    bool traverse_into_members) {
   FusionGuard fg(fusion);
 
   std::unordered_set<Statement*> visited;
 
   stmt_stack.clear();
-  stmt_stack.emplace_back(from.rbegin(), from.rend());
+  stmt_stack.emplace_back(to.rbegin(), to.rend());
 
   bool all_inputs_visited = false;
 
@@ -201,7 +165,7 @@ void IterVisitor::traverseFrom(
     // If we just poped a stmt_stack level, we can finally visit it!
     if (all_inputs_visited) {
       // stmt may have be already visited.
-      if (traverseAllPaths || visited.find(stmt) == visited.end()) {
+      if (traverse_all_paths || visited.find(stmt) == visited.end()) {
         // Mark visited
         visited.insert(stmt);
 
@@ -217,10 +181,20 @@ void IterVisitor::traverseFrom(
     } else {
       // We're not ready to process this node, so add all its inputs to be
       // checked Visit input nodes.
-      auto next_stmts =
-          traverseIntoMembers ? RecursiveDependencies::next(stmt) : next(stmt);
+      std::vector<Statement*> next_stmts;
+
+      if ((stmt->isVal() && from.find(stmt->asVal()) == from.end()) ||
+          stmt->isExpr()) {
+        next_stmts = next(stmt);
+      }
+
+      if (traverse_into_members) {
+        auto members = MemberStatements::next(stmt);
+        next_stmts.insert(next_stmts.end(), members.begin(), members.end());
+      }
+
       // We may want to retraverse nodes, in that case revisit everything!
-      if (!traverseAllPaths) {
+      if (!traverse_all_paths) {
         // If we don't want to retraverse, remove nodes we already visisted.
         remove_visited(next_stmts, visited);
       }
@@ -238,12 +212,20 @@ void IterVisitor::traverseFrom(
   }
 }
 
+void IterVisitor::traverseTo(
+    Fusion* fusion,
+    const std::vector<Val*>& To,
+    bool traverse_all_paths,
+    bool traverse_into_members) {
+  traverseBetween(fusion, {}, To, traverse_all_paths, traverse_into_members);
+}
+
 void IterVisitor::traverseHelper(Fusion* fusion, bool traverse_all_paths) {
   FusionGuard fg(fusion);
 
   auto term_val_outs = fusion->getTerminatingOutputs();
   if (!term_val_outs.empty()) {
-    traverseFrom(fusion, term_val_outs, traverse_all_paths);
+    traverseTo(fusion, term_val_outs, traverse_all_paths);
   }
 }
 
@@ -257,8 +239,7 @@ void IterVisitor::traverseAllPaths(Fusion* fusion) {
 
 namespace {
 
-// Expr sort will take a fusion and return a topologically sorted list of
-// expressions.
+// TODO: Also have InputsOf should pick one and remove the other.
 class Inputs : public IterVisitor {
  private:
   //! Optional list of input vals. While traversing to inputs if a value in the
@@ -279,8 +260,9 @@ class Inputs : public IterVisitor {
   }
 
   void handle(Val* val) override {
-    // If there's no definition to val, or val is within the provided inputs
-    if (val->definition() == nullptr ||
+    // If there's no definition to val, or val is created inside the fusion, or
+    // val is within the provided inputs
+    if (val->definition() == nullptr || val->definition()->inputs().empty() ||
         std::find(all_inputs_.begin(), all_inputs_.end(), val) !=
             all_inputs_.end()) {
       // if not already placed in the inputs
@@ -298,7 +280,7 @@ class Inputs : public IterVisitor {
       return {};
     }
     Inputs inps(all_inputs);
-    inps.traverseFrom(of[0]->fusion(), of);
+    inps.traverseTo(of[0]->fusion(), of);
     return inps.inputs_;
   }
 };
@@ -327,7 +309,7 @@ class AllVals : public IterVisitor {
       Fusion* fusion,
       const std::vector<Val*>& from) {
     AllVals av;
-    av.traverseFrom(fusion, from, false);
+    av.traverseTo(fusion, from, false);
     return av.vals;
   }
 };
@@ -385,7 +367,7 @@ void BackwardVisitor::handle(Val* val) {
   OptOutDispatch::handle(val);
 }
 
-void BackwardVisitor::traverseFrom(
+void BackwardVisitor::traverseTo(
     Fusion* fusion,
     const std::vector<Val*>& from,
     bool traverseAllPaths) {
@@ -400,7 +382,6 @@ void BackwardVisitor::traverseFrom(
   }
 
   auto vals = AllVals::get(fusion, from);
-
   auto exprs = StmtSort::getExprs(fusion, from);
 
   {
@@ -538,7 +519,7 @@ struct Dependencies : public IterVisitor {
       std::unordered_set<Val*> _dependencies,
       const std::vector<Val*>& of)
       : dependencies_(std::move(_dependencies)) {
-    traverseFrom(of[0]->fusion(), of, false);
+    traverseTo(of[0]->fusion(), of, false);
   };
 
  public:
@@ -585,7 +566,7 @@ struct FindOutputs : public IterVisitor {
   // tracing all paths like this.
   FindOutputs(const std::unordered_set<Val*>& _of) : of_(_of) {
     auto fusion = (*of_.begin())->fusion();
-    traverseFrom(fusion, fusion->outputs(), true);
+    traverseTo(fusion, fusion->outputs(), true);
   };
 
   static std::unordered_set<Val*> getAllOutputsOf(
@@ -653,7 +634,7 @@ class DependentVals : public IterVisitor {
   DependentVals(const std::unordered_set<Val*>& _of) : of_(_of) {
     createBoundary();
     auto fusion = (*of_.begin())->fusion();
-    traverseFrom(fusion, fusion->outputs(), false);
+    traverseTo(fusion, fusion->outputs(), false);
   };
 
  public:
@@ -689,7 +670,7 @@ class DependencyChains : public IterVisitor {
 
   DependencyChains(Val* _dependency, Val* _of, bool all_chains_ = false)
       : dependencies_({_dependency}) {
-    traverseFrom(_of->fusion(), {_of}, all_chains_);
+    traverseTo(_of->fusion(), {_of}, all_chains_);
   }
 
   DependencyChains(Val* _dependency, bool all_chains_ = false)
@@ -815,12 +796,21 @@ std::vector<Expr*> StmtSort::getExprs(Fusion* fusion, bool traverse_members) {
 }
 
 std::vector<Expr*> StmtSort::getExprs(
+    Fusion* fusion,
+    const std::vector<Val*>& to,
+    bool traverse_members) {
+  auto stmts = StmtSort::getStmts(fusion, to, traverse_members);
+  auto filter = ir_utils::filterByType<Expr>(stmts.begin(), stmts.end());
+  std::vector<Expr*> exprs(filter.begin(), filter.end());
+  return exprs;
+}
+
+std::vector<Expr*> StmtSort::getExprsBetween(
     Fusion* fusion,
     const std::vector<Val*>& from,
+    const std::vector<Val*>& to,
     bool traverse_members) {
-  StmtSort es;
-  es.traverseFrom(fusion, from, false, traverse_members);
-  auto stmts = StmtSort::getStmts(fusion, from, traverse_members);
+  auto stmts = StmtSort::getStmtsBetween(fusion, from, to, traverse_members);
   auto filter = ir_utils::filterByType<Expr>(stmts.begin(), stmts.end());
   std::vector<Expr*> exprs(filter.begin(), filter.end());
   return exprs;
@@ -834,16 +824,27 @@ std::vector<Statement*> StmtSort::getStmts(
 }
 
 std::vector<Statement*> StmtSort::getStmts(
+    Fusion* fusion,
+    const std::vector<Val*>& to,
+    bool traverse_members) {
+  StmtSort es;
+  es.traverseTo(fusion, to, false, traverse_members);
+  return es.stmts;
+}
+
+std::vector<Statement*> StmtSort::getStmtsBetween(
     Fusion* fusion,
     const std::vector<Val*>& from,
+    const std::vector<Val*>& to,
     bool traverse_members) {
   StmtSort es;
-  es.traverseFrom(fusion, from, false, traverse_members);
+  es.traverseBetween(
+      fusion, {from.begin(), from.end()}, to, false, traverse_members);
   return es.stmts;
 }
 
 void InputsOf::handle(Val* v) {
-  if (v->definition() == nullptr) {
+  if (v->definition() == nullptr || v->definition()->inputs().empty()) {
     if (grabbed_inputs.emplace(v).second) {
       ordered_inputs.push_back(v);
     }
@@ -858,7 +859,7 @@ std::vector<Val*> InputsOf::outputs(
     Fusion* fusion,
     const std::vector<Val*>& outputs_) {
   InputsOf io;
-  io.traverseFrom(fusion, outputs_, false);
+  io.traverseTo(fusion, outputs_, false);
   return io.ordered_inputs;
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/iter_visitor.h b/torch/csrc/jit/codegen/cuda/iter_visitor.h
index 2447933d7373a..3ad485f1a17b6 100644
--- a/torch/csrc/jit/codegen/cuda/iter_visitor.h
+++ b/torch/csrc/jit/codegen/cuda/iter_visitor.h
@@ -75,29 +75,43 @@ class TORCH_CUDA_CU_API IterVisitor : public OptOutDispatch {
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
   std::vector<std::vector<Statement*>> stmt_stack;
 
-  // Statements to stop traversal on if they're hit (pretends they're leaf
-  // nodes in next)
-  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  std::unordered_set<Statement*> termination_stmts;
-
   void traverseHelper(Fusion* fusion, bool traverse_all_paths = false);
 
  public:
-  //! Starts at nodes provided in from, traverses from these nodes to inputs.
-  //! Calls handle on all Statement*s in topological sorted order.
+  //! Traverses nodes in Fusion from inputs in topological order to "to". i.e.
+  //! from inputs towards outputs.
   //! \param traverseAllPaths = false only call handle on each Statement* once
-  //!    traverseAllPaths = true traverses all paths from nodes in from to
-  //!    inputs. Calls handle on a Statement* for every path from "from" nodes,
-  //!    to inputs.
+  //!    traverseAllPaths = true traverses all paths between expressions/values.
+  //!    Calls handle on a Statement* for every path from inputs to "to".
   //! \param traverseIntoMembers = When hitting nodes like TensorView,
   //! TensorDomain, or IterDomain where there are members of the nodes that are
   //! Val's a value of "true" will also traverse into those member Val's, a
   //! value of "false" will not traverse into the members.
-  void traverseFrom(
+  void traverseTo(
       Fusion* fusion,
-      const std::vector<Val*>& from,
-      bool traverseAllPaths = false,
-      bool traverseIntoMembers = false);
+      const std::vector<Val*>& to,
+      bool traverse_all_paths = false,
+      bool traverse_into_members = false);
+
+  //! Traverses nodes in Fusion from inputs in topological order to "to". i.e.
+  //! from inputs towards outputs.
+  //! \param traverseAllPaths = false only call handle on each Statement* once
+  //!    traverseAllPaths = true traverses all paths between expressions/values.
+  //!    Calls handle on a Statement* for every path from inputs to "to".
+  //! \param traverseIntoMembers = When hitting nodes like TensorView,
+  //! TensorDomain, or IterDomain where there are members of the nodes that are
+  //! Val's a value of "true" will also traverse into those member Val's, a
+  //! value of "false" will not traverse into the members.
+  //! \param from: Specified values to start traversing. If a "from" Val is not
+  //! on path from inputs to "to" node it will not be visited. If there's a path
+  //! from inputs to "to" that doesn't go through "from" that input and the path
+  //! from it will also be traversed.
+  void traverseBetween(
+      Fusion* fusion,
+      const std::unordered_set<Val*>& from,
+      const std::vector<Val*>& to,
+      bool traverse_all_paths = false,
+      bool traverse_into_members = false);
 
   // Iterates from terminating outputs registered with the fusion. Terminating
   // means value is not used to generate any other value used in producing
@@ -109,7 +123,10 @@ class TORCH_CUDA_CU_API IterVisitor : public OptOutDispatch {
   void traverseAllPaths(Fusion* fusion);
 
   //! Get inputs to vals. Possible input vals can be optionally
-  //! given. If not, vals with no defining expression are returned.
+  //! given. If not, vals with no producers are returned.
+  //
+  // TODO: This doesn't seem to fit with IterVisitor. Should probably be moved
+  // out of the class.
   static std::vector<Val*> getInputsTo(
       const std::vector<Val*>& vals,
       const std::vector<Val*>& inputs = {});
@@ -197,7 +214,7 @@ class TORCH_CUDA_CU_API BackwardVisitor : public OptOutDispatch {
   // traverseAllPaths = false only call handle on each Statement* once
   // traverseAllPaths = true traverses all paths from nodes in from to inputs.
   //   Handle on a Statement* for every path from "from" nodes, to inputs.
-  void traverseFrom(
+  void traverseTo(
       Fusion* fusion,
       const std::vector<Val*>& from,
       bool traverseAllPaths = false);
@@ -251,37 +268,65 @@ class TORCH_CUDA_CU_API DependencyCheck {
 // expressions.
 class StmtSort : public IterVisitor {
  protected:
+  StmtSort() = default;
+
   std::vector<Statement*> stmts;
 
   void handle(Statement* stmt) override;
 
  public:
   // If traverse_members it will also extract all member nodes in the sorted
-  // expr list in the fusion. i.e. all expressions on IterDomains, extents, etc
-  static std::vector<Expr*> getExprs(
+  // statement list in the fusion. i.e. all IterDomains, extents, and associated
+  // expressions of them
+  static std::vector<Statement*> getStmts(
       Fusion* fusion,
       bool traverse_members = false);
 
+  // Returns ordered Statements required to produce from, including from.
+  static std::vector<Statement*> getStmts(
+      Fusion* fusion,
+      const std::vector<Val*>& to,
+      bool traverse_members = false);
+
+  // Returns ordered Statements required to produce from, including from.
+  // Stops traversal once hiting any Statements in to. Includes Statements in
+  // to.
+  //
+  // Warning: this doesn't necessarily prevent statements before `to` from being
+  // returned. e.g.
+  // i1 = i0
+  // i2 = i1
+  // i3 = i2
+  // i4 = i3 + i1
+  // getExprs(fusion, {i4}, {i3})
+  // will return the definition and values {i0, i1, i4}
+  // i3 is dependent on i1, but since i4 also is then the traversal will go down
+  // the i4->i1->i0 path, even though the i4->i3-//>i2->i1 path is blocked.
+  //
   // If traverse_members it will also extract all member nodes in the sorted
   // expr list in the fusion. i.e. all expressions on IterDomains, extents, etc
-  static std::vector<Expr*> getExprs(
+  static std::vector<Statement*> getStmtsBetween(
       Fusion* fusion,
       const std::vector<Val*>& from,
+      const std::vector<Val*>& to,
       bool traverse_members = false);
 
-  // If traverse_members it will also extract all member nodes in the sorted
-  // statement list in the fusion. i.e. all IterDomains, extents, and associated
-  // expressions of them
-  static std::vector<Statement*> getStmts(
+  // Same as getStmts version but filters to only return the Expr*s
+  static std::vector<Expr*> getExprs(
       Fusion* fusion,
       bool traverse_members = false);
 
-  // If traverse_members it will also extract all member nodes in the sorted
-  // expr list in the fusion. i.e. all IterDomains, extents, and associated
-  // expressions of them
-  static std::vector<Statement*> getStmts(
+  // Same as getStmts version but filters to only return the Expr*s
+  static std::vector<Expr*> getExprs(
+      Fusion* fusion,
+      const std::vector<Val*>& to,
+      bool traverse_members = false);
+
+  // Same as getStmts version but filters to only return the Expr*s
+  static std::vector<Expr*> getExprsBetween(
       Fusion* fusion,
       const std::vector<Val*>& from,
+      const std::vector<Val*>& to,
       bool traverse_members = false);
 };
 
diff --git a/torch/csrc/jit/codegen/cuda/kernel.cpp b/torch/csrc/jit/codegen/cuda/kernel.cpp
index 3f5efe02d8ed6..9e52116049728 100644
--- a/torch/csrc/jit/codegen/cuda/kernel.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel.cpp
@@ -80,11 +80,9 @@ class KernelIrScanner : private IrVisitor {
     }
   }
 
-  void handle(UnaryOp* unary_op) final {
-    if (unary_op->getUnaryOpType() == UnaryOpType::RandLike) {
-      summary_.max_rng_offsets =
-          std::max<int>(summary_.max_rng_offsets, unary_op->getRNGOffset());
-    }
+  void handle(RNGOp* rng_op) final {
+    summary_.max_rng_offsets =
+        std::max<int>(summary_.max_rng_offsets, rng_op->getRNGOffset());
   }
 
   void handle(TensorIndex* tensor_index) final {
@@ -137,6 +135,15 @@ class KernelIrScanner : private IrVisitor {
     }
   }
 
+  void handle(GroupedGridWelford* grid_welford) final {
+    summary_.has_welford = true;
+    summary_.has_grid_welford = true;
+    summary_.has_grid_reductions = true;
+    if (grid_welford->isAllreduce()) {
+      summary_.has_cooperative_grid_reduction = true;
+    }
+  }
+
   void handle(GridBroadcast* grid_broadcast) final {
     summary_.has_cooperative_grid_reduction = true;
     handle(grid_broadcast->broadcast_op());
diff --git a/torch/csrc/jit/codegen/cuda/kernel_cache.cpp b/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
index e1ed1d56c496d..754fe3c09c913 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
@@ -8,6 +8,8 @@
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/runtime/graph_executor.h>
 
+#include <c10/core/thread_pool.h>
+#include <c10/cuda/CUDAGuard.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/jit_log.h>
 
@@ -18,28 +20,12 @@ namespace cuda {
 
 namespace {
 
-// Check device of TensorType in all inputs ensure all tensors are on cuda
-// devices.
-// return common device index (or -1 if device differs).
-int getCommonDeviceCUDA(const at::ArrayRef<IValue>& inputs) {
-  int index = -1;
-  for (const auto& input : inputs) {
-    if (!input.isTensor()) {
-      continue;
-    }
-    const auto& device = input.toTensor().device();
-    // skip cpu scalar tensor as they'll be promoted to scalar later
-    if (device.is_cpu() && is_cpu_scalar(input.toTensor())) {
-      continue;
-    }
-    TORCH_CHECK(device.is_cuda(), "nvfuser only supports cuda device");
-    auto cur_index = device.index();
-    if (index != -1 && index != cur_index) {
-      return -1;
-    }
-    index = (int)cur_index; // NOLINT
-  }
-  return index;
+#define THREAD_POOL_SIZE 10
+
+// TODO: clean this up with some knobs
+c10::ThreadPool* getThreadPool() {
+  static c10::ThreadPool pool(THREAD_POOL_SIZE);
+  return &pool;
 }
 
 void encodeBuffer(size_t value, std::string& buffer) {
@@ -53,8 +39,7 @@ void encodeBuffer(size_t value, std::string& buffer) {
 } // namespace
 
 InputsIdLookup::IdLookupReturn InputsIdLookup::lookupId(
-    const at::ArrayRef<IValue>& inputs,
-    const SchedulerRuntimeInfo* additional_info) {
+    const at::ArrayRef<IValue>& inputs) {
   IdLookupReturn ret;
 
   // lock mutex_ because we are touching encoding_
@@ -123,6 +108,42 @@ FusionExecutorCache::FusionExecutorCache(std::unique_ptr<Fusion> fusion)
   }
 }
 
+KernelArgumentHolder FusionExecutorCache::prepareInputs(
+    const at::ArrayRef<IValue>& inputs) {
+  FUSER_PERF_SCOPE("FusionExecutorCache::prepareInputs");
+
+  KernelArgumentHolder args =
+      KernelArgumentHolder::createKernelArgumentHolder(inputs);
+
+  // TODO: move InputsIdLookup inside KernelArgumentHolder;
+  auto id_lookup_ret = inputs_id_lookup_.lookupId(inputs);
+  if (id_lookup_ret.eviction) {
+    evictCache(id_lookup_ret.evict_id);
+  }
+
+  args.setCacheId(id_lookup_ret.id);
+  return args;
+}
+
+bool FusionExecutorCache::isCompiled(const at::ArrayRef<IValue>& inputs) {
+  FUSER_PERF_SCOPE("FusionExecutorCache::isCompiled");
+
+  // Access kernels associated with the common device id
+  KernelArgumentHolder args = prepareInputs(inputs);
+
+  return getKernelRuntimeFor(args)->isCompiled();
+}
+
+void FusionExecutorCache::compileFusionAsync(
+    const at::ArrayRef<IValue>& inputs) {
+  FUSER_PERF_SCOPE("FusionExecutorCache::compileFusionAsync");
+
+  KernelArgumentHolder args = prepareInputs(inputs);
+  auto kernel_runtime = getKernelRuntimeFor(args);
+
+  kernel_runtime->startAsyncCompile(args);
+}
+
 // Note [ Permutation support in nvfuser ]
 //
 // Background:
@@ -171,25 +192,23 @@ std::vector<at::Tensor> FusionExecutorCache::runFusionWithInputs(
     perm_inputs = inputs_vec;
   }
 
-  SchedulerRuntimeInfo runtime_info(fusion(), perm_inputs);
+  KernelArgumentHolder args = prepareInputs(perm_inputs);
 
-  auto id_lookup_ret = inputs_id_lookup_.lookupId(perm_inputs, &runtime_info);
-  if (id_lookup_ret.eviction) {
-    evictCache(id_lookup_ret.evict_id);
-  }
-
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  const size_t unique_id = id_lookup_ret.id;
-  auto kernel_runtime = getKernelRuntimeFor(perm_inputs, unique_id);
+  auto kernel_runtime = getKernelRuntimeFor(args);
   most_recent_runtime_ = kernel_runtime;
-  auto outputs = kernel_runtime->runWithInput(perm_inputs, unique_id);
+  auto outputs = kernel_runtime->runWithInput(args);
 
   // permute output tensor returned by kernel execution. See Part_3 in Note [
   // Permutation support in nvfuser ]
   for (const auto& pair : fusion_->getPermutationOutputMap()) {
-    outputs[pair.first] = outputs[pair.first].permute(pair.second);
+    if (pair.first < outputs.size()) {
+      outputs[pair.first] = outputs[pair.first].permute(pair.second);
+    }
   }
 
+  // removing aliased outputs, since those are only used by input tensor update
+  // by fusion. It is not semantically correct to actually return them as
+  // outputs from fusion.
   int offset = 0;
   for (const auto& v : aliased_output_indices_) {
     outputs.erase(outputs.begin() + v - offset);
@@ -207,18 +226,16 @@ void FusionExecutorCache::evictCache(size_t cache_id) {
 }
 
 FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
-    const at::ArrayRef<IValue>& inputs,
-    size_t unique_id) {
+    const KernelArgumentHolder& args) {
   // Check for id hit case
+  auto unique_id = *args.getCacheId();
   auto id_it = id_to_kernel_runtime_.find(unique_id);
   if (id_it != id_to_kernel_runtime_.end()) {
     return id_it->second;
   }
 
   // Access kernels associated with the common device id
-  auto device_index = getCommonDeviceCUDA(inputs);
-  TORCH_CHECK(device_index >= 0, "device is not coherent for fusion inputs");
-  auto& kernel_runtimes = kernel_runtimes_[device_index];
+  auto& kernel_runtimes = kernel_runtimes_[args.getDeviceIndex()];
 
   // Check for re-use hit case
   //  a kernel runtime is re-usable if all the compiled
@@ -228,8 +245,8 @@ FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
   auto reuse_it = std::find_if(
       kernel_runtimes.begin(),
       kernel_runtimes.end(),
-      [&inputs, &new_heuristics](auto& kernel_runtime) {
-        auto maybe_heuristics = kernel_runtime->getMaybeHeuristicsFor(inputs);
+      [&args, &new_heuristics](auto& kernel_runtime) {
+        auto maybe_heuristics = kernel_runtime->getMaybeHeuristicsFor(args);
         if (!maybe_heuristics.has_value()) {
           return false;
         }
@@ -244,7 +261,7 @@ FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
   } else {
     // graph miss, need to re-build an optimized graph for this case
     kernel_runtimes.emplace_back(
-        std::make_unique<FusionKernelRuntime>(fusion_.get(), inputs));
+        std::make_unique<FusionKernelRuntime>(fusion_.get(), args));
     kernel_runtime = kernel_runtimes.back().get();
     if (profiling_) {
       kernel_runtime->profile(true);
@@ -257,7 +274,7 @@ FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
 
 FusionKernelRuntime::FusionKernelRuntime(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& args) {
   FUSER_PERF_SCOPE("FusionKernelRuntime::FusionKernelRuntime");
 
   // Make a copy of fusion and do segmentation and translation
@@ -265,11 +282,11 @@ FusionKernelRuntime::FusionKernelRuntime(
   auto fusion_copy = std::make_unique<Fusion>(*fusion);
 
   // Run segmentation on the copied fusion
-  SchedulerRuntimeInfo runtime_info(fusion_copy.get(), inputs, true);
+  SchedulerRuntimeInfo runtime_info(fusion_copy.get(), args, true);
 
   // Initialize the evaluator simplifer
-  precomputed_integers_ =
-      std::make_unique<FusionPrecomputedIntegers>(fusion_copy.get());
+  precomputed_values_ =
+      std::make_unique<FusionPrecomputedValues>(fusion_copy.get());
 
   //! Try to schedule the complete fusion
   scheduler_debug_utils::canScheduleMessage(
@@ -284,13 +301,13 @@ FusionKernelRuntime::FusionKernelRuntime(
   if (segmented) {
     // Take ownership and segment transformed fusion
     segmented_fusion_ =
-        SegmentCandidateFinder::segment(std::move(fusion_copy), inputs);
+        SegmentCandidateFinder::segment(std::move(fusion_copy), args);
   } else {
     segmented_fusion_ = SegmentedFusion::fromCompleteFusion(
         std::move(fusion_copy), maybe_complete_fusion_heuristic.value());
   }
 
-  heuristics_ = segmented_fusion_->makeInitialHeuristics(inputs);
+  heuristics_ = segmented_fusion_->makeInitialHeuristics(args);
   executors_ = std::vector<FusionExecutor>(segmented_fusion_->groups().size());
   if (isDebugDumpEnabled(DebugDumpOption::FusionSegments)) {
     segmented_fusion_->print();
@@ -307,10 +324,10 @@ FusionKernelRuntime::FusionKernelRuntime(
 }
 
 std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
-    const at::ArrayRef<IValue>& inputs,
-    size_t input_id,
+    KernelArgumentHolder& args,
     SegmentedGroup* sg) {
   FUSER_PERF_SCOPE("FusionKernelRuntime::runKernelWithInput");
+  std::lock_guard<std::mutex> guard(mutex_);
   // This function will be called once on un-segmented fusion,
   //  for segmented fusion, this function will be called on each segment
   //  In the case of segmented fusion, segmented group needs to be given so
@@ -319,8 +336,6 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
   //   is complied and run
   TORCH_INTERNAL_ASSERT(sg, "runKernelWithInput: need valid group to run");
   auto group_id = sg->groupId();
-  const int device_index = getCommonDeviceCUDA(inputs);
-  TORCH_CHECK(device_index >= 0, "device is not coherent for fusion inputs");
 
   LaunchParams launch_params;
 
@@ -336,14 +351,11 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
     // Running a segment group as a single kernel,
     //  make a fusion to run from segmented fusion
     fusion_to_run = segmented_fusion_->makeFusion(sg);
-    CompileOptions options;
-    options.device = c10::Device(DeviceType::CUDA, device_index);
-    options.index_mode = scheduler_entry->indexMode();
     FusionGuard fg(fusion_to_run.get());
     scheduler_entry->schedule(fusion_to_run.get());
     launch_params = scheduler_entry->params()->lparams;
     executors_[group_id].compileFusion(
-        fusion_to_run.get(), inputs, launch_params, options);
+        fusion_to_run.get(), args, launch_params);
   } else {
     launch_params = scheduler_entry->params()->lparams;
   }
@@ -358,7 +370,7 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
     executor.setMeasureKernelTimeFlag(true);
   }
 
-  auto outputs = executor.runFusion(inputs, launch_params, input_id);
+  auto outputs = executor.runFusion(args, launch_params);
 
   // Print relevant information all at once for easy debuging of perf
   if (isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
@@ -369,14 +381,8 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
       segmented_fusion_->completeFusion()->printMath();
     }
     std::cout << "With inputs:\n";
-    for (auto inp : inputs) {
-      if (inp.isTensor()) {
-        auto inp_tensor = inp.toTensor();
-        std::cout << "  " << inp_tensor.dtype() << "  " << inp_tensor.sizes()
-                  << "  " << inp_tensor.strides() << "\n";
-      } else {
-        std::cout << "  " << inp << "\n";
-      }
+    for (auto i : c10::irange(args.size())) {
+      args[i]->print();
     }
     std::cout << "Compiler log: " << executor.compilerLog() << "\n";
     std::cout << scheduler_entry->params()->toString() << "\n";
@@ -451,78 +457,218 @@ void FusionKernelRuntime::prepareRuntimeOrder() {
   }
 }
 
-std::vector<at::Tensor> FusionKernelRuntime::runWithInput(
-    const at::ArrayRef<IValue>& inputs,
-    size_t input_id) {
-  FUSER_PERF_SCOPE("FusionKernelRuntime::runMultiKernelWithInput");
+// passing args by value, since we will be modify this
+void FusionKernelRuntime::startAsyncCompile(KernelArgumentHolder& args_old) {
+  // only single compilation is supported at this moment.
+  std::unique_lock<std::mutex> unique_lock(mutex_, std::try_to_lock);
+  TORCH_CHECK(
+      unique_lock.owns_lock(),
+      "Calling startAsyncCompile on a FusionKernelRuntime that's already starting a compilation thread is not supported");
+  std::unique_lock<std::mutex> unique_lock2(compiling_, std::try_to_lock);
+  TORCH_CHECK(
+      unique_lock2.owns_lock(),
+      "Calling startAsyncCompile on a FusionKernelRuntime that's already starting a compilation thread is not supported 2");
+
+  // for some reason I can't seem to move unique_lock and it keeps using copy.
+  // auto compile_fusion = [args = std::move(args_old), lock =
+  // std::move(unique_lock), this] () mutable {
+  auto compile_fusion = [args = std::move(args_old), this]() mutable {
+    std::lock_guard<std::mutex> guard(compiling_);
+
+    // locking mutex_ since we are touching executors_ during compilation.
+    // c10::DeviceGuard dg(c10::Device(DeviceType::CUDA,
+    // args.getDeviceIndex())); CUDAGuard uses runtime API directly, which is
+    // thread safe.
+    c10::cuda::CUDAGuard dg(args.getDeviceIndex());
+
+    FUSER_PERF_SCOPE("FusionKernelRuntime::startAsyncCompile");
 
-  TORCH_INTERNAL_ASSERT(
-      inputs.size() == segmented_fusion_->inputs().size(),
-      "Inputs were not set up correctly, recieved ",
-      inputs.size(),
-      " inputs but expecting ",
-      segmented_fusion_->inputs().size());
+    TORCH_INTERNAL_ASSERT(
+        args.size() == segmented_fusion_->inputs().size(),
+        "Inputs were not set up correctly, recieved ",
+        args.size(),
+        " inputs but expecting ",
+        segmented_fusion_->inputs().size());
+
+    c10::Device device(c10::DeviceType::CUDA, args.getDeviceIndex());
+    std::unordered_map<Val*, const ArgAbstract*> tensor_map;
+    mapFusionInputsToArgs(tensor_map, args);
+
+    // TODO: compilation can happen in parallel! We can have output sizes
+    // inferred on un-compiled kernel and setup all tensor_map prior to
+    // compilation.
+    for (auto group_to_run : runtime_workspace_.group_run_order) {
+      // TODO: index mode should be updated per segmented kernel
+      // Prepare input vector
+      KernelArgumentHolder group_runtime_inputs(args.getIndexMode());
+      group_runtime_inputs.setDeviceIndex(args.getDeviceIndex());
+      for (auto input : group_to_run->inputs()) {
+        group_runtime_inputs.push(tensor_map.at(input));
+      }
+
+      // Run graph segment
+      KernelArgumentHolder group_runtime_outputs =
+          compileKernel(group_runtime_inputs, group_to_run);
+
+      // map output args to tensor map
+      const auto& group_outputs = group_to_run->outputs();
+      for (const size_t group_out_i : c10::irange(group_outputs.size())) {
+        args.push(group_runtime_outputs[group_out_i]);
+        tensor_map.emplace(group_outputs[group_out_i], args.back());
+      }
+    }
+  };
 
-  c10::Device device(c10::DeviceType::CUDA, 0);
-  int extent_index_ = 0;
-  // Bind input in the tensor_map
-  for (const auto i : c10::irange(inputs.size())) {
-    runtime_workspace_.tensor_map.emplace(
-        segmented_fusion_->inputs()[i], inputs[i]);
+  getThreadPool()->run(compile_fusion);
+}
 
+// TODO: replace the boilerplate in runKernelWithInput
+KernelArgumentHolder FusionKernelRuntime::compileKernel(
+    const KernelArgumentHolder& args,
+    SegmentedGroup* sg) {
+  FUSER_PERF_SCOPE("FusionKernelRuntime::compileKernel");
+  // This function will be called once on un-segmented fusion,
+  //  for segmented fusion, this function will be called on each segment
+  //  In the case of segmented fusion, segmented group needs to be given so
+  //   a kernel is compiled and run for a segmented group
+  //  In the case of complete fusion, sg = nullptr, and the original fusion
+  //   is complied and run
+  TORCH_INTERNAL_ASSERT(sg, "compileKernel: need valid group to run");
+  auto group_id = sg->groupId();
+
+  LaunchParams launch_params;
+
+  auto scheduler_entry = schedulers()[group_id].get();
+
+  // Check that the heuristics are matched, in the case of segmented fusion
+  TORCH_INTERNAL_ASSERT(!sg || scheduler_entry->heuristic() == sg->heuristic());
+
+  if (!executors_[group_id].compiled()) {
+    FUSER_PERF_SCOPE("FusionKernelRuntime::compileKernel::Compile");
+    std::unique_ptr<Fusion> fusion_to_run;
+
+    // Running a segment group as a single kernel,
+    //  make a fusion to run from segmented fusion
+    fusion_to_run = segmented_fusion_->makeFusion(sg);
+    FusionGuard fg(fusion_to_run.get());
+    scheduler_entry->schedule(fusion_to_run.get());
+    launch_params = scheduler_entry->params()->lparams;
+
+    executors_[group_id].compileFusion(
+        fusion_to_run.get(), args, launch_params);
+  } else {
+    // TODO: this is a false negative assert, since we could be compiling
+    // something for elevated high water mark on block size.
+    TORCH_CHECK(false, "compiling an already compiled kernel");
+  }
+
+  auto& executor = executors_[group_id];
+
+  auto outputs = executor.inferOutputSizes(args, launch_params);
+  return outputs;
+}
+
+void FusionKernelRuntime::mapFusionInputsToArgs(
+    std::unordered_map<Val*, const ArgAbstract*>& tensor_map,
+    KernelArgumentHolder& args) {
+  int extent_index = 0;
+  auto original_args_size = args.size();
+  // Bind args in the tensor_map
+  for (const auto i : c10::irange(original_args_size)) {
+    tensor_map.emplace(segmented_fusion_->inputs()[i], args[i]);
     // Bind tensorview inputs values in case some segmented group
     //  needs it down the road.
     // TODO: we probably have done this already up to this point
     //      should consider caching the expression evaluators, both
     //      more convenient and safer than replication
-    if (inputs[i].isTensor()) {
-      auto aten_tensor = inputs[i].toTensor();
-      device = aten_tensor.device();
-      for (auto dim_size : aten_tensor.sizes()) {
-        runtime_workspace_.tensor_map.emplace(
-            runtime_workspace_.group_extent_binding_order[extent_index_++],
-            dim_size);
+    if (auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(args[i])) {
+      // Note this is very ugly way. We are pushing every single extent to args,
+      // because we don't have a better place to hold them.
+      auto rank = tensor_arg_abstract->getRank();
+      for (const auto dim : c10::irange(rank)) {
+        args.push(tensor_arg_abstract->getSize(dim));
+        tensor_map.emplace(
+            runtime_workspace_.group_extent_binding_order[extent_index++],
+            args.back());
       }
     }
   }
+}
+
+std::vector<at::Tensor> FusionKernelRuntime::runWithInput(
+    KernelArgumentHolder& args) {
+  FUSER_PERF_SCOPE("FusionKernelRuntime::runWithInput");
+
+  TORCH_INTERNAL_ASSERT(
+      args.size() == segmented_fusion_->inputs().size(),
+      "Inputs were not set up correctly, recieved ",
+      args.size(),
+      " inputs but expecting ",
+      segmented_fusion_->inputs().size());
+
+  c10::Device device(c10::DeviceType::CUDA, args.getDeviceIndex());
+
+  std::unordered_map<Val*, const ArgAbstract*> tensor_map;
+  mapFusionInputsToArgs(tensor_map, args);
+
+  // TODO: we don't need this any more, since TensorArgAbstract already holds a
+  // reference to tensor
+  std::unordered_map<Val*, at::Tensor> output_holder;
 
   if (isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
     std::cout << "=================RUNNING FUSION SEGMENTS================="
               << std::endl;
   }
+
   for (auto group_to_run : runtime_workspace_.group_run_order) {
+    // TODO: index mode should be updated per segmented kernel
     // Prepare input vector
+    KernelArgumentHolder group_runtime_inputs(args.getIndexMode());
+    group_runtime_inputs.setDeviceIndex(args.getDeviceIndex());
     for (auto input : group_to_run->inputs()) {
-      runtime_workspace_.group_runtime_inputs.push_back(
-          runtime_workspace_.tensor_map.at(input));
+      group_runtime_inputs.push(tensor_map.at(input));
     }
+
+    // TODO: currently we are still outputing PyTorch tensors, instead of
+    // something abstract. This is quite unsatisfying. Prepare input vector
+
     // Run graph segment
-    runtime_workspace_.group_runtime_outputs = runKernelWithInput(
-        runtime_workspace_.group_runtime_inputs, input_id, group_to_run);
+    std::vector<at::Tensor> group_runtime_outputs =
+        runKernelWithInput(group_runtime_inputs, group_to_run);
 
     const auto& group_outputs = group_to_run->outputs();
 
     // Insert graph segment output to tensor map
-    for (unsigned int group_out_i = 0; group_out_i < group_outputs.size();
-         group_out_i++) {
-      runtime_workspace_.tensor_map.emplace(
-          group_outputs[group_out_i],
-          runtime_workspace_.group_runtime_outputs[group_out_i]);
+    TORCH_INTERNAL_ASSERT(
+        group_outputs.size() == group_runtime_outputs.size(),
+        "output size does not match");
+    for (const size_t group_out_i : c10::irange(group_outputs.size())) {
+      output_holder[group_outputs[group_out_i]] =
+          group_runtime_outputs[group_out_i];
+
+      args.push(group_runtime_outputs[group_out_i]);
+      tensor_map.emplace(group_outputs[group_out_i], args.back());
     }
-    runtime_workspace_.group_runtime_inputs.clear();
-    runtime_workspace_.group_runtime_outputs.clear();
   }
 
   if (isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
     std::cout << "=============FINISHED RUNNING FUSION SEGMENTS============"
               << std::endl;
   }
+
   // Produce final global output
-  std::vector<IValue> fusion_outputs;
+  std::vector<at::Tensor> fusion_outputs;
   for (auto output : segmented_fusion_->outputs()) {
-    const auto iter = runtime_workspace_.tensor_map.find(output);
-    if (iter != runtime_workspace_.tensor_map.end()) {
+    const auto iter = output_holder.find(output);
+    if (iter != output_holder.end()) {
       fusion_outputs.push_back(iter->second);
+    } else if (output->isFusionInput()) {
+      const auto iter = tensor_map.find(output);
+      TORCH_INTERNAL_ASSERT(
+          iter != tensor_map.end(), "Can not find output as aliased intput");
+      auto arg = dynamic_cast<const TensorArgAbstract*>(iter->second);
+      fusion_outputs.push_back(arg->getTensor());
     } else {
       bool empty_type_check = output->getDataType().has_value() &&
           output->getDataType().value() == DataType::Float;
@@ -555,20 +701,7 @@ std::vector<at::Tensor> FusionKernelRuntime::runWithInput(
       fusion_outputs.emplace_back(at::empty({0}, tensor_options));
     }
   }
-
-  std::vector<at::Tensor> fusion_output_tensors;
-  std::transform(
-      fusion_outputs.begin(),
-      fusion_outputs.end(),
-      std::back_inserter(fusion_output_tensors),
-      [](IValue ival) {
-        TORCH_INTERNAL_ASSERT(
-            ival.isTensor(), "Cannot output non-tensor objects from a fusion.");
-        return ival.toTensor();
-      });
-
-  runtime_workspace_.tensor_map.clear();
-  return fusion_output_tensors;
+  return fusion_outputs;
 }
 
 const std::vector<FusionKernelRuntime::SchedulerEntryPtr>& FusionKernelRuntime::
@@ -590,14 +723,14 @@ void FusionKernelRuntime::updateHeuristicsLaunchParams(
 }
 
 c10::optional<FusionKernelRuntime::HeuristicsPtr> FusionKernelRuntime::
-    getMaybeHeuristicsFor(const at::ArrayRef<IValue>& inputs) {
+    getMaybeHeuristicsFor(const KernelArgumentHolder& args) {
   FUSER_PERF_SCOPE("FusionKernelRuntime::getMaybeHeuristicsFor");
   auto complete_fusion = segmented_fusion_->completeFusion();
-  SchedulerRuntimeInfo runtime_info(complete_fusion, inputs);
-  precomputed_integers_->bindFusionInputs(inputs);
-  precomputed_integers_->evaluate();
-  runtime_info.expressionEvaluator().bindPrecomputedIntegers(
-      precomputed_integers_.get());
+  SchedulerRuntimeInfo runtime_info(complete_fusion, args);
+  precomputed_values_->bindFusionInputs(args);
+  precomputed_values_->evaluate();
+  runtime_info.expressionEvaluator().bindPrecomputedValues(
+      precomputed_values_.get());
 
   c10::optional<FusionKernelRuntime::HeuristicsPtr> ret;
   ret = std::make_unique<FusionHeuristics>();
diff --git a/torch/csrc/jit/codegen/cuda/kernel_cache.h b/torch/csrc/jit/codegen/cuda/kernel_cache.h
index f67742d10f3f4..a8a0f1cf4f62b 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_cache.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_cache.h
@@ -42,7 +42,7 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
  public:
   explicit FusionKernelRuntime(
       Fusion* fusion,
-      const at::ArrayRef<IValue>& inputs);
+      const KernelArgumentHolder& inputs);
 
   //! Type notations within FusionKernelRuntime Context
   using HashType = size_t;
@@ -56,10 +56,34 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
     }
   }
 
+  //! query if we already have a compiled kernel for execution
+  bool isCompiled() {
+    std::unique_lock<std::mutex> lock0(mutex_, std::try_to_lock);
+    std::unique_lock<std::mutex> lock1(compiling_, std::try_to_lock);
+    if (!lock0.owns_lock() || !lock1.owns_lock()) {
+      // compilation in progress
+      return false;
+    }
+
+    return std::all_of(
+        executors_.begin(), executors_.end(), [](const auto& executor) {
+          return executor.compiled();
+        });
+  }
+
+  //! starts compilation async
+  void startAsyncCompile(KernelArgumentHolder& inputs);
+
+  //! maps entries in `args` to fusion inputs.
+  //! Note that this function also pushes extra bits like dimension extent into
+  //! `args` for expression evaluator binding. So consider your `args` polluted
+  //! after this function and use it with caution.
+  void mapFusionInputsToArgs(
+      std::unordered_map<Val*, const ArgAbstract*>& tensor_map,
+      KernelArgumentHolder& args);
+
   //! Unified interface to run the managed kernels with given input
-  std::vector<at::Tensor> runWithInput(
-      const at::ArrayRef<IValue>& inputs,
-      size_t input_id);
+  std::vector<at::Tensor> runWithInput(KernelArgumentHolder& args);
 
   //! Turn On/Off profiling
   void profile(bool to_profile = true) {
@@ -110,7 +134,7 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
   //  any segment cannot be scheduled or the parameters don't match
   using HeuristicsPtr = std::unique_ptr<FusionHeuristics>;
   c10::optional<HeuristicsPtr> getMaybeHeuristicsFor(
-      const at::ArrayRef<IValue>& inputs);
+      const KernelArgumentHolder& args);
 
   //! Copy the launch params given in the parameter heuristics to prepare
   //!  for kernel launch for a new input dimension but same heuristics
@@ -121,8 +145,14 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
   //! fusions, or a kernel for a segmentedGrouup in a segmented fusion. Returns
   //! the kernel outputs.
   std::vector<at::Tensor> runKernelWithInput(
-      const at::ArrayRef<IValue>& inputs,
-      size_t input_id,
+      KernelArgumentHolder& args,
+      SegmentedGroup* sg);
+
+  //! Interface to compile a single kernel, either one kernel for single-kernel
+  //! fusions, or a kernel for a segmentedGrouup in a segmented fusion. Returns
+  //! the kernel outputs with tensor that doesn't own memory.
+  KernelArgumentHolder compileKernel(
+      const KernelArgumentHolder& args,
       SegmentedGroup* sg);
 
   //! Interface to run a the whole graph in a segmented fusion and return the
@@ -154,26 +184,25 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
 
   //! Pre-allocated runtime workspace to speed up kernel launch preparation.
   struct RuntimeWorkSpace {
-    //! Temporary space to save intermediate tensors for segmented fusion
-    std::unordered_map<Val*, IValue> tensor_map;
-
     //! Pre-determined order to run the segmented groups
     std::vector<SegmentedGroup*> group_run_order;
 
     //! Pre-determined order to bind tensor input meta data
     std::vector<Val*> group_extent_binding_order;
-
-    //! Pre-allocated workspace to hold group inputs and outputs
-    std::vector<IValue> group_runtime_inputs;
-    std::vector<at::Tensor> group_runtime_outputs;
   } runtime_workspace_;
 
-  //! Utility to speed up integer evaluation at runtime
-  std::unique_ptr<FusionPrecomputedIntegers> precomputed_integers_;
+  //! Utility to speed up value evaluation at runtime
+  std::unique_ptr<FusionPrecomputedValues> precomputed_values_;
 
   // States for profiling support
   bool profiling_ = false;
 
+  std::mutex mutex_;
+  // TODO: remove `compiling_` mutex and rely on `mutex_` only.
+  // we don't need the second mutex, if only I could figure out how to pass
+  // unique_lock into lambda
+  std::mutex compiling_;
+
   // The heuristics and executor for most recent kernel launch
   ExecutorLog most_recent_executor_log_;
 };
@@ -208,9 +237,7 @@ class TORCH_CUDA_CU_API InputsIdLookup : public NonCopyable {
   //! within the lookup cache. This is needed because lookup shortcut is also
   //! cached in nested `GraphCache`, `FusionExecutorCache` and `FusionExecutor`.
   //! see [ Note -- 2 level cache implementation ]
-  IdLookupReturn lookupId(
-      const at::ArrayRef<IValue>& inputs,
-      const SchedulerRuntimeInfo* additional_info = nullptr);
+  IdLookupReturn lookupId(const at::ArrayRef<IValue>& inputs);
 
   //! debugging API that returns the size of lookup table
   size_t size() const {
@@ -304,11 +331,13 @@ class TORCH_CUDA_CU_API InputsIdLookup : public NonCopyable {
 class TORCH_CUDA_CU_API FusionExecutorCache {
  public:
   //! create new fusion executor cache at a given device to handle kernel
-  //! generation of dynamic sizes;
-  //! fusion executor is taking the ownership of `fusion`;
+  //! generation of dynamic sizes
+  //! fusion executor is taking the ownership of `fusion`
   explicit FusionExecutorCache(std::unique_ptr<Fusion> fusion);
 
-  //! Execute fusion graph with given inputs, create `FusionExecutor` as needed;
+  //! Execute fusion graph with given inputs, create `FusionExecutor` as needed
+  //! Note this function also handles permutation & input update outside of
+  //! codegen.
   std::vector<at::Tensor> runFusionWithInputs(
       const at::ArrayRef<IValue>& inputs);
 
@@ -359,14 +388,25 @@ class TORCH_CUDA_CU_API FusionExecutorCache {
     }
   }
 
+  //! converts inputs from IValue to KernelArgumentHolder, also handles cache
+  //! lookup
+  KernelArgumentHolder prepareInputs(const at::ArrayRef<IValue>& inputs);
+
+  //! query if there's a kernel ready to go for given inputs
+  bool isCompiled(const at::ArrayRef<IValue>& inputs);
+
+  //! compile a kernel executor for given inputs. Note: the compilation is
+  //! async, there's some restriction on the user side. e.g. don't overlap
+  //! compilation and execution for the same FusionExecutor entry. This is
+  //! experimental at this moment, please use with extra caution.
+  void compileFusionAsync(const at::ArrayRef<IValue>& inputs);
+
  private:
   //! evict cached short cut entry in `code_to_fe_lookup_` as well as cached
   //! entry in `FusionExecutor`
   void evictCache(size_t cache_id);
 
-  FusionKernelRuntime* getKernelRuntimeFor(
-      const at::ArrayRef<IValue>& inputs,
-      size_t id);
+  FusionKernelRuntime* getKernelRuntimeFor(const KernelArgumentHolder& inputs);
 
  private:
   //! original un-scheduled `Fusion`;
diff --git a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp
index 40c97d714a14a..15a18a6bca83e 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp
@@ -10,11 +10,22 @@ namespace fuser {
 namespace cuda {
 namespace kir {
 
-void ExpressionEvaluator::bind(
-    const Val* value,
-    Int::ScalarType concrete_value) {
+namespace {
+
+template <typename T>
+c10::optional<IntOrDouble> toOptionalIntOrDouble(c10::optional<T> i) {
+  if (!i) {
+    return c10::nullopt;
+  }
+  return IntOrDouble(i.value());
+}
+
+} // namespace
+
+void ExpressionEvaluator::bind(const Val* value, IntOrDouble concrete_value) {
   TORCH_CHECK(value->isScalar());
-  TORCH_CHECK(value->dtype() == DataType::Int);
+  TORCH_CHECK(
+      value->dtype() == DataType::Int || value->dtype() == DataType::Double);
   TORCH_CHECK(!value->isConstScalar(), "Tried to bind to a constant value");
   TORCH_CHECK(
       value->definition() == nullptr,
@@ -29,29 +40,35 @@ void ExpressionEvaluator::bind(
     ParallelType pt,
     Int::ScalarType concrete_value) {
   TORCH_INTERNAL_ASSERT(isParallelTypeThread(pt));
-  if (precomputed_integers_) {
+  if (precomputed_values_) {
     // Need to bind the thread value to integer machine
     //  in pre-computed mode.
-    precomputed_integers_->bindConcreteParallelTypeValue(pt, concrete_value);
+    precomputed_values_->bindConcreteParallelTypeValue(pt, concrete_value);
   } else {
     known_parallel_dimensions_[pt] = concrete_value;
   }
 }
 
-c10::optional<Int::ScalarType> ExpressionEvaluator::evaluate(const Val* value) {
-  if (precomputed_integers_ && precomputed_integers_->ready()) {
-    if (precomputed_integers_->getMaybeValueFor(value).has_value()) {
-      return precomputed_integers_->getMaybeValueFor(value);
+c10::optional<IntOrDouble> ExpressionEvaluator::evaluate(const Val* value) {
+  if (precomputed_values_ && precomputed_values_->ready()) {
+    if (precomputed_values_->getMaybeValueFor(value).has_value()) {
+      return toOptionalIntOrDouble(
+          precomputed_values_->getMaybeValueFor(value));
     }
   }
 
   if (value->isScalar() && value->isConst()) {
-    return value->as<Int>()->value();
+    if (value->isADouble()) {
+      return toOptionalIntOrDouble(value->as<Double>()->value());
+    }
+    return toOptionalIntOrDouble(value->as<Int>()->value());
   } else {
     FUSER_PERF_SCOPE("kir::ExpressionEvaluator::evaluate");
 
     TORCH_CHECK(value->isScalar(), value->toString());
-    TORCH_CHECK(value->dtype() == DataType::Int, value->toString());
+    TORCH_CHECK(
+        value->dtype() == DataType::Int || value->dtype() == DataType::Double,
+        value->toString());
 
     // Is the value known (either explicit binding or memoized)?
     const auto pre_eval_it = known_values_.find(value);
@@ -63,7 +80,7 @@ c10::optional<Int::ScalarType> ExpressionEvaluator::evaluate(const Val* value) {
 
     const auto post_eval_it = known_values_.find(value);
     return post_eval_it != known_values_.end()
-        ? c10::optional<Int::ScalarType>(post_eval_it->second)
+        ? c10::optional<IntOrDouble>(post_eval_it->second)
         : c10::nullopt;
   }
   return c10::nullopt;
@@ -80,8 +97,8 @@ void ExpressionEvaluator::print() const {
     std::cout << kv.first->toString() << " = " << kv.second << "\n";
   }
   std::cout << "\nPre-computed Values\n";
-  if (precomputed_integers_ != nullptr) {
-    precomputed_integers_->print();
+  if (precomputed_values_ != nullptr) {
+    precomputed_values_->print();
   }
   std::cout << "--------------------\n\n";
 }
@@ -93,6 +110,13 @@ void ExpressionEvaluator::handle(const Int* value) {
   }
 }
 
+void ExpressionEvaluator::handle(const Double* value) {
+  TORCH_INTERNAL_ASSERT(!value->isConst());
+  if (auto def = value->definition()) {
+    OptOutConstDispatch::handle(def);
+  }
+}
+
 void ExpressionEvaluator::handle(const NamedScalar* named_scalar) {
   const auto& name = named_scalar->name();
   for (auto pt : kParallelTypeThreads) {
@@ -108,22 +132,41 @@ void ExpressionEvaluator::handle(const NamedScalar* named_scalar) {
 }
 
 void ExpressionEvaluator::handle(const UnaryOp* unary_op) {
+  using namespace IntOrDouble_functions;
   const auto in = evaluate(unary_op->in());
   if (in.has_value()) {
     switch (unary_op->getUnaryOpType()) {
       case UnaryOpType::Neg:
         known_values_[unary_op->out()] = -*in;
         break;
-      case UnaryOpType::Cast:
+      case UnaryOpType::Set:
         known_values_[unary_op->out()] = *in;
         break;
+      case UnaryOpType::Cast:
+        if (unary_op->out()->getDataType() == DataType::Int) {
+          known_values_[unary_op->out()] = in->cast<int64_t>();
+        } else if (unary_op->out()->getDataType() == DataType::Double) {
+          known_values_[unary_op->out()] = in->cast<double>();
+        } else {
+          TORCH_INTERNAL_ASSERT(false, "dtype not supported in evaluator");
+        }
+        break;
+      case UnaryOpType::Abs:
+        known_values_[unary_op->out()] = abs(*in);
+        break;
       default:
-        TORCH_CHECK(!"Unexpected operator type");
+        TORCH_CHECK(
+            false,
+            "Unexpected operator type ",
+            unary_op->getUnaryOpType(),
+            " in ",
+            unary_op->toString());
     }
   }
 }
 
 void ExpressionEvaluator::handle(const BinaryOp* binary_op) {
+  using namespace IntOrDouble_functions;
   const auto lhs = evaluate(binary_op->lhs());
   const auto rhs = evaluate(binary_op->rhs());
   if (lhs.has_value() && rhs.has_value()) {
@@ -147,7 +190,7 @@ void ExpressionEvaluator::handle(const BinaryOp* binary_op) {
         break;
       case BinaryOpType::CeilDiv:
         TORCH_CHECK(*rhs != 0);
-        known_values_[binary_op->out()] = (*lhs + *rhs - 1) / *rhs;
+        known_values_[binary_op->out()] = ceildiv(*lhs, *rhs);
         break;
       case BinaryOpType::And:
         known_values_[binary_op->out()] = Int::ScalarType(*lhs && *rhs);
diff --git a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h
index 63586857ad852..8df365dfdc58f 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h
@@ -4,6 +4,7 @@
 #include <c10/macros/Export.h>
 
 #include <torch/csrc/jit/codegen/cuda/dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/dynamic_type.h>
 #include <torch/csrc/jit/codegen/cuda/evaluator_common.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
 
@@ -39,13 +40,13 @@ namespace kir {
 class TORCH_CUDA_CU_API ExpressionEvaluator : private OptInConstDispatch {
  public:
   //! Set a concrete value for a symbolic value
-  void bind(const Val* value, Int::ScalarType concrete_value);
+  void bind(const Val* value, IntOrDouble concrete_value);
 
   //! Set a concrete value for a parallel dimension
   void bind(ParallelType pt, Int::ScalarType concrete_value);
 
   //! Try to evaluate a Kernel IR value
-  c10::optional<Int::ScalarType> evaluate(const Val* value);
+  c10::optional<IntOrDouble> evaluate(const Val* value);
 
   //! Returns true if `value` is known before binding kernel inputs
   static bool isConst(const Val* value);
@@ -53,19 +54,20 @@ class TORCH_CUDA_CU_API ExpressionEvaluator : private OptInConstDispatch {
   //! Debugging helper, prints all the currently known values
   void print() const;
 
-  auto& precomputedIntegers() {
-    return precomputed_integers_;
+  auto& precomputedValues() {
+    return precomputed_values_;
   }
 
  private:
   void handle(const Int* value) final;
+  void handle(const Double* value) final;
   void handle(const NamedScalar* named_scalar) final;
   void handle(const UnaryOp* unary_op) final;
   void handle(const BinaryOp* binary_op) final;
 
  private:
-  std::unordered_map<const Val*, Int::ScalarType> known_values_;
-  KernelPrecomputedIntegers* precomputed_integers_ = nullptr;
+  std::unordered_map<const Val*, IntOrDouble> known_values_;
+  KernelPrecomputedValues* precomputed_values_ = nullptr;
   std::unordered_map<ParallelType, Int::ScalarType, TypeHash>
       known_parallel_dimensions_;
 };
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir.cpp b/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
index 11bf1b1854378..132b99b31c34b 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
@@ -589,6 +589,34 @@ GridWelford::GridWelford(
       "IR type only valid for Kernel container.");
 }
 
+GroupedGridWelford::GroupedGridWelford(
+    IrBuilderPasskey passkey,
+    std::vector<WelfordTriplet> output_vals,
+    std::vector<WelfordTriplet> input_vals,
+    std::vector<WelfordTriplet> init_vals,
+    std::array<std::vector<Allocate*>, 3> reduction_buffers,
+    Allocate* sync_buffer,
+    Val* entrance_index,
+    Val* entrances,
+    Val* buffer_stride,
+    bool is_allreduce)
+    : GroupedWelfordOp(
+          passkey,
+          std::move(output_vals),
+          std::move(input_vals),
+          std::move(init_vals),
+          is_allreduce,
+          ExprType::GroupedGridWelford),
+      reduction_buffers_(std::move(reduction_buffers)),
+      sync_buffer_(sync_buffer),
+      entrance_index_(entrance_index),
+      entrances_(entrances),
+      buffer_stride_(buffer_stride) {
+  TORCH_INTERNAL_ASSERT(
+      passkey.ir_container_->isA<kir::Kernel>(),
+      "IR type only valid for Kernel container.");
+}
+
 AllocateFusedReduction::AllocateFusedReduction(
     IrBuilderPasskey passkey,
     GridReduction* grid_reduction)
@@ -619,6 +647,16 @@ AllocateFusedReduction::AllocateFusedReduction(
       "IR type only valid for Kernel container.");
 }
 
+AllocateFusedReduction::AllocateFusedReduction(
+    IrBuilderPasskey passkey,
+    GroupedGridWelford* grouped_grid_welford)
+    : Expr(passkey, ExprType::AllocateFusedReduction),
+      grid_expr_(grouped_grid_welford) {
+  TORCH_INTERNAL_ASSERT(
+      passkey.ir_container_->isA<kir::Kernel>(),
+      "IR type only valid for Kernel container.");
+}
+
 TensorIndex* AllocateFusedReduction::out() const {
   TORCH_INTERNAL_ASSERT(grid_expr_ != nullptr);
   if (grid_expr_->isA<GridReduction>() ||
@@ -626,6 +664,10 @@ TensorIndex* AllocateFusedReduction::out() const {
     return grid_expr_->outputs().at(0)->as<kir::TensorIndex>();
   } else if (auto grid_welford = dynamic_cast<GridWelford*>(grid_expr_)) {
     return grid_welford->welford_op()->out()->as<kir::TensorIndex>();
+  } else if (
+      auto grouped_grid_welford =
+          dynamic_cast<GroupedGridWelford*>(grid_expr_)) {
+    return grouped_grid_welford->out(0)->as<kir::TensorIndex>();
   } else {
     TORCH_INTERNAL_ASSERT(
         false, "Invalid grid expression: ", grid_expr_->toString());
@@ -642,6 +684,10 @@ const ParallelTypeBitmap& AllocateFusedReduction::threadPredicate() const {
       auto grouped_grid_reduction =
           dynamic_cast<GroupedGridReduction*>(grid_expr_)) {
     return grouped_grid_reduction->threadPredicate();
+  } else if (
+      auto grouped_grid_welford =
+          dynamic_cast<GroupedGridWelford*>(grid_expr_)) {
+    return grouped_grid_welford->threadPredicate();
   } else {
     TORCH_INTERNAL_ASSERT(
         false, "Invalid grid expression: ", grid_expr_->toString());
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir.h b/torch/csrc/jit/codegen/cuda/kernel_ir.h
index b629f687e2c01..62b245772dd03 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir.h
@@ -39,6 +39,7 @@ class TensorView;
 class UnaryOp;
 class BinaryOp;
 class TernaryOp;
+class RNGOp;
 class ReductionOp;
 class WelfordOp;
 class BroadcastOp;
@@ -64,6 +65,7 @@ class GridReduction;
 class GroupedGridReduction;
 class GridBroadcast;
 class GridWelford;
+class GroupedGridWelford;
 class AllocateFusedReduction;
 
 // Expr container
@@ -694,6 +696,8 @@ class TORCH_CUDA_CU_API GridBroadcast final : public Expr {
 //!
 //! This node provides FusionExecutor the information it needs to allocate the
 //! reduction and sync buffers.
+//!
+//! TODO: Make this a subclass of WelfordOp
 class TORCH_CUDA_CU_API GridWelford final : public Expr {
  public:
   GridWelford(
@@ -758,6 +762,64 @@ class TORCH_CUDA_CU_API GridWelford final : public Expr {
   ParallelTypeBitmap thread_predicate_;
 };
 
+class TORCH_CUDA_CU_API GroupedGridWelford final : public GroupedWelfordOp {
+ public:
+  // input, output and init vals are vectors of triplets
+  GroupedGridWelford(
+      IrBuilderPasskey passkey,
+      std::vector<WelfordTriplet> output_vals,
+      std::vector<WelfordTriplet> input_vals,
+      std::vector<WelfordTriplet> init_vals,
+      std::array<std::vector<Allocate*>, 3> reduction_buffers,
+      Allocate* sync_buffer,
+      Val* entrance_index,
+      Val* entrances,
+      Val* buffer_stride,
+      bool is_allreduce = false);
+
+  const std::array<std::vector<Allocate*>, 3>& reduction_buffers() const {
+    return reduction_buffers_;
+  }
+
+  Allocate* sync_buffer() const {
+    return sync_buffer_;
+  }
+
+  // Which instance of entering this grid reduction is this iteration?
+  Val* entrance_index() const {
+    return entrance_index_;
+  }
+
+  // How many times will this grid reduction be entered
+  Val* entrances() const {
+    return entrances_;
+  }
+
+  Val* buffer_stride() const {
+    return buffer_stride_;
+  }
+
+  const ParallelTypeBitmap& threadPredicate() const {
+    return thread_predicate_;
+  }
+
+  void setThreadPredicate(const ParallelTypeBitmap& thread_predicate) {
+    thread_predicate_ = thread_predicate;
+  }
+
+ private:
+  std::array<std::vector<Allocate*>, 3> reduction_buffers_;
+  Allocate* sync_buffer_ = nullptr;
+  // gridReduce has template flags for thread predicates. In order to
+  // use them, the thread predicate is held here separately from
+  // Expr::predicate_.
+  ParallelTypeBitmap thread_predicate_;
+  Val* entrance_index_ = nullptr;
+  Val* entrances_ = nullptr;
+  // Stride of reduction buffers
+  Val* buffer_stride_ = nullptr;
+};
+
 // Allocate an instance of the fused reduction class.
 class TORCH_CUDA_CU_API AllocateFusedReduction final : public Expr {
  public:
@@ -773,6 +835,10 @@ class TORCH_CUDA_CU_API AllocateFusedReduction final : public Expr {
       IrBuilderPasskey passkey,
       GroupedGridReduction* grouped_grid_reduction);
 
+  explicit AllocateFusedReduction(
+      IrBuilderPasskey passkey,
+      GroupedGridWelford* grouped_grid_welford);
+
   Expr* gridExpr() const {
     return grid_expr_;
   }
@@ -782,7 +848,7 @@ class TORCH_CUDA_CU_API AllocateFusedReduction final : public Expr {
   const ParallelTypeBitmap& threadPredicate() const;
 
  private:
-  //! GridReduction, GridWelford or GroupedGridReduction
+  //! GridReduction, GridWelford, GroupedGridReduction or GroupedGridWelford
   Expr* grid_expr_ = nullptr;
 };
 
diff --git a/torch/csrc/jit/codegen/cuda/lower2device.cpp b/torch/csrc/jit/codegen/cuda/lower2device.cpp
index 50eec0bc6e550..35d4af4a9321e 100644
--- a/torch/csrc/jit/codegen/cuda/lower2device.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower2device.cpp
@@ -8,6 +8,7 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/lower_alias_memory.h>
 #include <torch/csrc/jit/codegen/cuda/lower_allocation.h>
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
 #include <torch/csrc/jit/codegen/cuda/lower_double_buffer.h>
 #include <torch/csrc/jit/codegen/cuda/lower_expr_sort.h>
 #include <torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.h>
@@ -191,11 +192,9 @@ void GpuLower::collectPaddedParallelDims() {
 void assignRNGOffset(Fusion* fusion) {
   int counter = 0;
   for (auto expr : fusion->exprs()) {
-    if (expr->isA<UnaryOp>()) {
-      auto uop = expr->as<UnaryOp>();
-      if (uop->getUnaryOpType() == UnaryOpType::RandLike) {
-        uop->setRNGOffset(counter++);
-      }
+    if (expr->isA<RNGOp>()) {
+      auto rop = expr->as<RNGOp>();
+      rop->setRNGOffset(counter++);
     }
   }
 }
@@ -250,7 +249,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
   // mappings of all iteration domains across the fusion. There are three types
   // of mappings Permissive, Exact, and Loop, see compute_at_map.h/cpp for more
   // information.
-  compute_at_map_ = std::make_unique<ComputeAtMap>(fusion_);
+  compute_at_map_ = std::make_shared<ComputeAtMap>(fusion_);
 
   if (isDebugDumpEnabled(DebugDumpOption::ComputeAtMap)) {
     std::cout << compute_at_map_->toString() << std::endl;
@@ -258,8 +257,12 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
 
   compute_at_map_->validateAndPropagatePType();
 
+  // Uses compute_at_map, find all splits that are enforced to be divisible
+  divisible_splits_ = getAllDivisibleSplits(fusion_, compute_at_map_.get());
+
   // Used in parallel dimension map
-  concretized_broadcast_domains_.build(fusion_);
+  concretized_broadcast_domains_ =
+      std::make_shared<const ConcretizedBroadcastDomains>(fusion_);
 
   parallelDimensionMap().build(fusion_);
   if (isDebugDumpEnabled(DebugDumpOption::ParallelDimensions)) {
@@ -283,7 +286,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
 
   // Scan the whole fusion and build mappings about halo extensions of
   // all IterDomains
-  haloInfo().build(fusion_);
+  halo_info_ = std::make_shared<HaloInfo>(fusion_, compute_at_map_);
 
   // Want to run this after parallel map and halo info map are
   // created. vectorized_accesses_ and vectorized_set_info_ are filled.
diff --git a/torch/csrc/jit/codegen/cuda/lower2device.h b/torch/csrc/jit/codegen/cuda/lower2device.h
index d5600e0a25139..250b06a6495fb 100644
--- a/torch/csrc/jit/codegen/cuda/lower2device.h
+++ b/torch/csrc/jit/codegen/cuda/lower2device.h
@@ -62,7 +62,8 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
   //! Query if lowering is in progress
   static bool hasCurrent();
 
-  ConcretizedBroadcastDomains& concretizedBroadcastDomains() {
+  std::shared_ptr<const ConcretizedBroadcastDomains>
+  concretizedBroadcastDomains() {
     return concretized_broadcast_domains_;
   }
 
@@ -76,20 +77,16 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
     return thread_pred_map_;
   }
 
-  const std::unique_ptr<ComputeAtMap>& caMap() const {
-    return compute_at_map_;
+  std::shared_ptr<const ComputeAtMap> caMap() const {
+    return std::const_pointer_cast<const ComputeAtMap>(compute_at_map_);
   }
 
   const TrivialReductionInfo& trivialReductionInfo() const {
     return trivial_reduction_info_;
   }
 
-  const HaloInfo& haloInfo() const {
-    return halo_info_;
-  }
-
-  HaloInfo& haloInfo() {
-    return halo_info_;
+  std::shared_ptr<const HaloInfo> haloInfo() const {
+    return std::const_pointer_cast<const HaloInfo>(halo_info_);
   }
 
   const ParallelDimensionMap& parallelDimensionMap() const {
@@ -132,6 +129,10 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
     return non_divisible_split_info_;
   }
 
+  const auto& divisbleSplitSet() const {
+    return divisible_splits_;
+  }
+
   DoubleBufferInfo& doubleBufferInfo() {
     return double_buffer_info_;
   }
@@ -198,12 +199,13 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
   // would be safer to wrap all of these in unique pointers and remove the build
   // interface and default constructor. That way they couldn't be accessed
   // without being initialized.
-  ConcretizedBroadcastDomains concretized_broadcast_domains_;
+  std::shared_ptr<const ConcretizedBroadcastDomains>
+      concretized_broadcast_domains_;
   ThreadPredicateMap thread_pred_map_;
   PredicateElimination pred_elimination_;
-  std::unique_ptr<ComputeAtMap> compute_at_map_;
+  std::shared_ptr<ComputeAtMap> compute_at_map_;
   TrivialReductionInfo trivial_reduction_info_;
-  HaloInfo halo_info_;
+  std::shared_ptr<HaloInfo> halo_info_;
   LocalAllocationInfoMap local_allocation_info_map_;
   WarpPaddedParallelInfo warp_pad_info_;
   ParallelDimensionMap parallel_dimension_map_;
@@ -214,6 +216,7 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
   FusedReductionInfo fused_reduction_info_;
   SyncMap sync_map_;
   kir::KernelPerformanceProfile profile_;
+  std::unordered_set<Split*> divisible_splits_;
 
   // Track which tensor views are inputs or outputs of a vectorized operation
   // and their maximum vectorized access size
diff --git a/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp b/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
index 08517c441ebfd..ef12cce8fd46a 100644
--- a/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
@@ -18,16 +18,19 @@ namespace fuser {
 namespace cuda {
 
 namespace {
+// Alias used for std::transform
+IterDomain* exactConcreteId(IterDomain* id) {
+  return GpuLower::current()->caMap()->getConcreteMappedID(
+      id, IdMappingMode::EXACT);
+}
 
-//! Checks that the current loop nest is not realizing a serial
-//!  broadcast so that each index of producer buffer will only
-//!  be visited once, which is the only case where aggressive
-//!  inner sharing is valid.
-//!
+//! Checks that the current loop nest is realizing a serial
+//!  broadcast so that each index of producer buffer can be visited
+//!  multiple times, in which case the aggressive is not valid.
 bool isSerialBroadcastResolution(TensorView* producer, TensorView* consumer) {
   //! Note: see issue #1785:
   //!  serial broadcast resolution doesn't only happen to
-  //! immediate producers of broadcast ops. We can also have
+  //! immediate outputs of broadcast ops. We can also have
   //! example:
   //!  T1[I,B] = broadcast(T0[I]])
   //!  T3[I,I] = T1[I,B] + T2[I,I]
@@ -83,7 +86,7 @@ bool isSerialBroadcastResolution(TensorView* producer, TensorView* consumer) {
       std::inserter(
           producer_exact_concrete_root_ids,
           producer_exact_concrete_root_ids.begin()),
-      ir_utils::caMapExactConcreteId);
+      exactConcreteId);
 
   // Check if serial loop roots indexes any exact root id's that
   //  is not within the set of producer's root exact id's. These
@@ -92,7 +95,8 @@ bool isSerialBroadcastResolution(TensorView* producer, TensorView* consumer) {
   for (auto serial_loop_root :
        ir_utils::filterByType<IterDomain>(serial_loop_roots)) {
     if (!producer_exact_concrete_root_ids.count(
-            ir_utils::caMapExactConcreteId(serial_loop_root))) {
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                serial_loop_root, IdMappingMode::EXACT))) {
       return true;
     }
   }
@@ -565,7 +569,7 @@ class BufferUseDefInfo {
             "Lower_alias_memory : dynamic sized register allocation");
         return;
       }
-      if (register_size.value() <= kRegisterSizeThreshold) {
+      if (register_size->as<int64_t>() <= kRegisterSizeThreshold) {
         should_try_alias = false;
       }
     }
diff --git a/torch/csrc/jit/codegen/cuda/lower_allocation.cpp b/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
index f0c7b26c1802f..264905cfa213f 100644
--- a/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
@@ -59,7 +59,7 @@ class AllocationInserter : public kir::ExprMutator {
   // info.init_place_before, info.alloc_for_loop, info.alloc_place_before
   void fillAllocationInformation(AllocationInformation& info, Expr* expr) {
     auto loop_alloc_info =
-        loop_utils::getAllocInformation(info.buffer, for_loops_);
+        lower_utils::getAllocInformation(info.buffer, for_loops_);
 
     info.init_for_loop = loop_alloc_info.init_for_loop;
     info.alloc_for_loop = loop_alloc_info.alloc_for_loop;
@@ -131,7 +131,7 @@ class AllocationInserter : public kir::ExprMutator {
          ++init_loop_it) {
       auto id = *init_loop_it;
       kir::ForLoop* new_loop = nullptr;
-      auto extent_with_halo = gpu_lower->haloInfo().getExtent(id);
+      auto extent_with_halo = gpu_lower->haloInfo()->getExtent(id);
       if (extent_with_halo) {
         new_loop = IrBuilder::create<kir::ForLoop>(
             id,
@@ -166,7 +166,7 @@ class AllocationInserter : public kir::ExprMutator {
       }
       auto extent = id->extent();
       // Use halo-extended extent if found
-      auto halo_extent = gpu_lower->haloInfo().getRootAxisInfo(id);
+      auto halo_extent = gpu_lower->haloInfo()->getRootAxisInfo(id);
       if (halo_extent.hasHalo()) {
         extent = IrBuilder::addExpr(
             extent, IrBuilder::create<Int>(halo_extent.width()));
@@ -213,7 +213,7 @@ class AllocationInserter : public kir::ExprMutator {
 
     // Get the halo extent if found
     auto getExtent = [this](IterDomain* id) {
-      auto extent = gpu_lower->haloInfo().getExtent(id);
+      auto extent = gpu_lower->haloInfo()->getExtent(id);
       if (extent == nullptr) {
         extent = id->extent();
       }
@@ -368,7 +368,7 @@ class AllocationInserter : public kir::ExprMutator {
 
       auto extent = concrete_id->extent();
 
-      if (gpu_lower->haloInfo().getExtent(info.buffer->axis(axis_i)) !=
+      if (gpu_lower->haloInfo()->getExtent(info.buffer->axis(axis_i)) !=
           nullptr) {
         has_halo = true;
       }
@@ -478,6 +478,11 @@ class AllocationInserter : public kir::ExprMutator {
               out->name() == welford->outN()->name(), "Unreachable");
           init = welford->initN();
         }
+      } else if (expr->isA<GroupedWelfordOp>()) {
+        TORCH_INTERNAL_ASSERT(
+            default_val == nullptr,
+            "Welford should not have a default initialization value for predicate elimination.");
+        init = expr->as<GroupedWelfordOp>()->getInitValOfOutput(out);
       } else if (default_val != nullptr) {
         init = default_val;
       }
diff --git a/torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp b/torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp
new file mode 100644
index 0000000000000..c1de1201e5d18
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp
@@ -0,0 +1,121 @@
+
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
+
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+
+#include <unordered_set>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+std::unordered_set<Split*> getAllDivisibleSplits(Fusion* fusion) {
+  ComputeAtMap ca_map(fusion);
+  return getAllDivisibleSplits(fusion, &ca_map);
+}
+
+std::unordered_set<Split*> getAllDivisibleSplits(
+    Fusion* fusion,
+    const ComputeAtMap* ca_map) {
+  std::unordered_set<Split*> all_divisible_splits;
+
+  auto all_tvs = ir_utils::allTvs(fusion);
+  // Find all tensor views with a view like rfactor. Splits used in view
+  // transformations must be divisible by definition.
+  for (auto tv : all_tvs) {
+    auto rfactor_dom = tv->getMaybeRFactorDomain();
+    // Not view if there's no rfactor axis
+    if (!tv->domain()->hasViewLikeRFactor()) {
+      continue;
+    }
+
+    // Take the view transformations and add all the splits. Those splits are
+    // the only divisible splits.
+    auto view_exprs =
+        StmtSort::getExprs(fusion, {rfactor_dom.begin(), rfactor_dom.end()});
+    auto split_exprs = ir_utils::filterByType<Split>(view_exprs);
+    all_divisible_splits.insert(split_exprs.begin(), split_exprs.end());
+  }
+
+  // Vectorized dimensions are enforced to be a result of divisible splits.
+  // Gather vectorized splits.
+  for (auto tv : all_tvs) {
+    auto vec_id_it = std::find_if(
+        tv->domain()->domain().begin(),
+        tv->domain()->domain().end(),
+        [](IterDomain* id) {
+          return isParallelTypeVectorize(id->getParallelType());
+        });
+
+    if (vec_id_it == tv->domain()->domain().end()) {
+      continue;
+    }
+
+    // We could have a case technically like:
+    // [8, 2] where we do:
+    // split(0, 2)
+    // merge(1)
+    // so it ends up as [4, 4]
+    // split(0, 2) must be divisible, but for now we're not going to capture
+    // cases like this. Just look for direct split's producing a vectorize
+    // dimension.
+    auto vec_id = *vec_id_it;
+    if (vec_id->definition() != nullptr && vec_id->definition()->isA<Split>()) {
+      all_divisible_splits.emplace(vec_id->definition()->as<Split>());
+    }
+  }
+
+  // If there's no view like splits, there's nothing to find
+  if (all_divisible_splits.empty()) {
+    return all_divisible_splits;
+  }
+
+  // Track the concrete id in the exact map of the outer output of the split
+  // expressions. This is how we'll check if there are matching splits. This
+  // also gets rid of any splits that already match (for processing).
+  std::unordered_map<IterDomain*, Expr*> outer_concrete_id_to_expr;
+
+  for (auto split : all_divisible_splits) {
+    outer_concrete_id_to_expr[ca_map->getConcreteMappedID(
+        split->outer(), IdMappingMode::EXACT)] = split;
+  }
+
+  std::unordered_set<Expr*> visited(
+      all_divisible_splits.begin(), all_divisible_splits.end());
+
+  // Find splits that match what we already have:
+  for (auto entry : outer_concrete_id_to_expr) {
+    auto concrete_id = entry.first;
+    auto original_view_split = entry.second;
+
+    const auto& exact_mapped_ids =
+        ca_map->idGraph().exactNodes().getDisjointSetOf(concrete_id).vector();
+    for (auto other_id : exact_mapped_ids) {
+      if (other_id->definition() == nullptr) {
+        continue;
+      }
+
+      if (!visited.emplace(other_id->definition()).second) {
+        // Already visited
+        continue;
+      }
+
+      if (IterDomainGraph::exprsMap(
+              original_view_split,
+              other_id->definition(),
+              false,
+              ca_map->idGraph().exactNodes())) {
+        all_divisible_splits.emplace(other_id->definition()->as<Split>());
+      }
+    }
+  }
+
+  return all_divisible_splits;
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_divisible_split.h b/torch/csrc/jit/codegen/cuda/lower_divisible_split.h
new file mode 100644
index 0000000000000..f2c4a78e4895e
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_divisible_split.h
@@ -0,0 +1,29 @@
+#pragma once
+
+#include <c10/macros/Export.h>
+
+#include <torch/csrc/jit/codegen/cuda/compute_at_map.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// Looks through all transformations assocaited with view, or enforced divisible
+// vectorization splits and gathers all splits that provably don't have a
+// remainder, therefore the extents of the associated IterDomains do not require
+// a ceilDiv expressions.
+TORCH_CUDA_CU_API std::unordered_set<Split*> getAllDivisibleSplits(
+    Fusion* fusion);
+
+// Same as above but will use provided ComputeAtMap instead of building its own.
+TORCH_CUDA_CU_API std::unordered_set<Split*> getAllDivisibleSplits(
+    Fusion* fusion,
+    const ComputeAtMap* ca_map);
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp b/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp
index c50d4959e3746..e7057f6e53089 100644
--- a/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp
@@ -279,7 +279,8 @@ class DoubleBufferLoopCloner : public kir::IrVisitor {
          is_double_buffer_load_expr) ||
         (loop_type_ == DoubleBufferLoopStage::Epilog &&
          !is_double_buffer_load_expr)) {
-      if (supportInlinePredicate(expr) && expr->isA<LoadStoreOp>()) {
+      if (lower_utils::supportInlinePredicate(expr) &&
+          expr->isA<LoadStoreOp>()) {
         auto ldst = expr->as<LoadStoreOp>();
         cloned_scopes_.back()->push_back(IrBuilder::create<LoadStoreOp>(
             ldst->opType(), ldst->out(), ldst->in()));
diff --git a/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp b/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
index c4a5beeeabee2..dc26fdc5789ce 100644
--- a/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
@@ -709,7 +709,7 @@ std::vector<IterDomain*> getLocalDomainOrdering(
   std::sort(
       merged_domain.begin(),
       merged_domain.end(),
-      IterDomainDependencySorter(
+      ir_utils::IterDomainDependencySorter(
           concrete_id_dependencies, GpuLower::current()->caMap()));
   return merged_domain;
 }
diff --git a/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
index 213abda029a6c..744feab598b3a 100644
--- a/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
@@ -266,15 +266,9 @@ class FusionTransformer {
         fusion_->removeExpr(welford);
 
         fused_expr = IrBuilder::create<WelfordOp>(
-            out_avg,
-            out_var,
-            out_n,
-            init_avg,
-            init_var,
-            init_n,
-            in_avg,
-            in_var,
-            in_n,
+            WelfordTriplet{out_avg, out_var, out_n},
+            WelfordTriplet{in_avg, in_var, in_n},
+            WelfordTriplet{init_avg, init_var, init_n},
             true);
       } else if (auto grouped_rop = dynamic_cast<GroupedReductionOp*>(expr)) {
         TORCH_INTERNAL_ASSERT(!grouped_rop->isAllreduce());
diff --git a/torch/csrc/jit/codegen/cuda/lower_index.cpp b/torch/csrc/jit/codegen/cuda/lower_index.cpp
index be95d1b567874..e1bc2900ee16e 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_index.cpp
@@ -92,37 +92,85 @@ void IndexLowering::handle(const kir::ForLoop* for_loop) {
   active_scope_ = prev_scope;
 }
 
-// TODO: use a separate IR node to represent rand like
-void IndexLowering::lowerRandLike(const UnaryOp* uop) {
-  // TODO: not using this input any more, remove
-  //  when making RandLike a no-input op.
-  const auto in = lowerSrcIndex(uop->in(), uop->out());
+void IndexLowering::handle(const RNGOp* rop) {
+  // Write random tensor indices into the consumer
+  //  tensor index if the output is a tensor.
+  auto out_tv = dynamic_cast<TensorView*>(rop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr, "rand scalar not yet supported");
 
-  // Default path for scalar output.
-  Val* out = uop->out();
+  // TensorIndex for philox subsequence and component.
+  auto philox_index = SimplifyingIrBuilder::create<kir::TensorIndex>(
+      out_tv, Index::getLinearLogicalIndex(out_tv, for_loops_));
 
-  // Write random tensor indices into the consumer
+  // TensorIndex for writing rand_like output.
+  const auto out = lowerDstIndex(out_tv);
+
+  auto lowered = IrBuilder::create<RNGOp>(
+      rop->getRNGOpType(),
+      out,
+      rop->dtype(),
+      rop->getParameters(),
+      rop->getRNGOffset(),
+      philox_index);
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(rop, back());
+}
+
+void IndexLowering::handle(const FullOp* fop) {
+  auto out_tv = dynamic_cast<TensorView*>(fop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr);
+
+  // TensorIndex for writing output.
+  const auto out = lowerDstIndex(out_tv);
+  auto lowered =
+      IrBuilder::create<FullOp>(out, fop->getFillValue(), fop->dtype());
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(fop, back());
+}
+
+void IndexLowering::handle(const ARangeOp* aop) {
+  // Write linear tensor indices into the consumer
   //  tensor index if the output is a tensor.
-  auto out_tv = dynamic_cast<TensorView*>(uop->out());
-  if (out_tv != nullptr) {
-    out = SimplifyingIrBuilder::create<kir::TensorIndex>(
-        out_tv, Index::getRandomTensorStridedIndices(out_tv, for_loops_));
-  }
+  auto out_tv = dynamic_cast<TensorView*>(aop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr);
 
-  pushBack(IrBuilder::create<UnaryOp>(
-      UnaryOpType::RandLike, out, in, uop->getRNGOffset()));
-  GpuLower::current()->propagateExprInfo(uop, back());
+  // linear index for computing arange output
+  auto linear_index = SimplifyingIrBuilder::create<kir::TensorIndex>(
+      out_tv, Index::getLinearLogicalIndex(out_tv, for_loops_));
+
+  // TensorIndex for writing arange output.
+  const auto out = lowerDstIndex(out_tv);
+  auto lowered = IrBuilder::create<ARangeOp>(
+      out, aop->start(), aop->end(), aop->step(), aop->dtype(), linear_index);
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(aop, back());
+}
+
+void IndexLowering::handle(const EyeOp* eop) {
+  auto out_tv = dynamic_cast<TensorView*>(eop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr);
+
+  // linear index for computing eye output
+  auto indices = Index::getPerDimLogicalIndex(out_tv, for_loops_);
+  TORCH_INTERNAL_ASSERT(indices.size() == 2);
+  auto index1 = indices[0];
+  auto index2 = indices[1];
+
+  // TensorIndex for writing eye output.
+  const auto out = lowerDstIndex(out_tv);
+  auto lowered = IrBuilder::create<EyeOp>(out, eop->dtype(), index1, index2);
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(eop, back());
 }
 
 void IndexLowering::handle(const UnaryOp* uop) {
-  if (uop->getUnaryOpType() == UnaryOpType::RandLike) {
-    lowerRandLike(uop);
-    return;
-  }
   const auto in = lowerSrcIndex(uop->in(), uop->out());
   const auto out = lowerDstIndex(uop->out());
-  pushBack(IrBuilder::create<UnaryOp>(
-      uop->getUnaryOpType(), out, in, uop->getRNGOffset()));
+  pushBack(IrBuilder::create<UnaryOp>(uop->getUnaryOpType(), out, in));
   GpuLower::current()->propagateExprInfo(uop, back());
 }
 
@@ -616,8 +664,6 @@ void IndexLowering::handle(const WelfordOp* wop) {
   const bool has_block_reduce = out_domain->hasBlockReduction();
   const bool has_grid_reduce = out_domain->hasGridReduction();
 
-  // If we do a grid reduction we can't have a reduction axis that is not bound
-  // to a grid or block dim ()
   if (has_grid_reduce) {
     TORCH_INTERNAL_ASSERT(
         std::none_of(
@@ -651,12 +697,12 @@ void IndexLowering::handle(const WelfordOp* wop) {
       out_avg,
       out_var,
       out_N,
-      wop->initAvg(),
-      wop->initVar(),
-      wop->initN(),
       in_avg,
       in_var,
       in_N,
+      wop->initAvg(),
+      wop->initVar(),
+      wop->initN(),
       wop->isAllreduce());
 
   if (wop->predicate()) {
@@ -693,18 +739,18 @@ void IndexLowering::handleGridWelford(WelfordOp* indexed_wop) {
       getGridCommWorkBufferSize(out_domain, for_loops_, is_persistent);
 
   const auto work_buffer_size = buffer_size_info.size_of_privatized_buffer;
-  auto out_var_buffer = allocateUniqueBuffer(
-      work_buffer_size,
-      indexed_wop->outVar()->dtype(),
-      false,
-      indexed_wop->outVar()->as<kir::TensorIndex>()->view(),
-      work_buffer_map_);
   auto out_avg_buffer = allocateUniqueBuffer(
       work_buffer_size,
       indexed_wop->outAvg()->dtype(),
       false,
       indexed_wop->outAvg()->as<kir::TensorIndex>()->view(),
       work_buffer_map_);
+  auto out_var_buffer = allocateUniqueBuffer(
+      work_buffer_size,
+      indexed_wop->outVar()->dtype(),
+      false,
+      indexed_wop->outVar()->as<kir::TensorIndex>()->view(),
+      work_buffer_map_);
   auto out_N_buffer = allocateUniqueBuffer(
       work_buffer_size,
       indexed_wop->outN()->dtype(),
@@ -772,6 +818,150 @@ void IndexLowering::handleGridWelford(WelfordOp* indexed_wop) {
   }
 }
 
+void IndexLowering::handle(const GroupedWelfordOp* grouped_wop) {
+  TORCH_INTERNAL_ASSERT(ir_utils::isTvOp(grouped_wop));
+
+  const auto out_tv = ir_utils::getTvOutput(grouped_wop);
+  const auto out_domain = out_tv->domain();
+
+  const bool has_grid_reduce = out_domain->hasGridReduction();
+
+  std::vector<WelfordTriplet> indexed_outputs(grouped_wop->numExprs());
+  std::vector<WelfordTriplet> indexed_inputs(grouped_wop->numExprs());
+
+  for (const auto i : c10::irange(grouped_wop->numExprs())) {
+    const auto& output = grouped_wop->outputVals().at(i);
+    const auto& input = grouped_wop->inputVals().at(i);
+    WelfordTriplet indexed_output;
+    WelfordTriplet indexed_input;
+    for (const auto j : c10::irange(3)) {
+      indexed_output.get(j) = lowerDstIndex(output.get(j));
+      indexed_input.get(j) = lowerSrcIndex(input.get(j), output.get(j));
+    }
+    indexed_outputs[i] = indexed_output;
+    indexed_inputs[i] = indexed_input;
+  }
+
+  if (has_grid_reduce) {
+    handleGroupedGridWelford(
+        grouped_wop, indexed_outputs, indexed_inputs, grouped_wop->initVals());
+  } else {
+    TORCH_INTERNAL_ASSERT(
+        false,
+        "Only grid welford is supported. Validation should have caught non-grid welford grouping.");
+  }
+}
+
+std::vector<kir::Allocate*> IndexLowering::allocateWelfordWorkBuffer(
+    const std::vector<WelfordTriplet>& triplets,
+    WelfordTriplet::ValName name,
+    Val* buffer_size) {
+  std::vector<kir::Allocate*> work_buffers;
+
+  std::transform(
+      triplets.begin(),
+      triplets.end(),
+      std::back_inserter(work_buffers),
+      [&](const WelfordTriplet& output) {
+        return allocateUniqueBuffer(
+            buffer_size,
+            output.get(name)->dtype(),
+            false,
+            output.get(name)->as<TensorView>(),
+            work_buffer_map_);
+      });
+
+  return work_buffers;
+}
+
+void IndexLowering::handleGroupedGridWelford(
+    const GroupedWelfordOp* op,
+    const std::vector<WelfordTriplet>& output_vals,
+    const std::vector<WelfordTriplet>& input_vals,
+    const std::vector<WelfordTriplet>& init_vals) {
+  const auto out_tv = ir_utils::getTvOutput(op);
+  const auto out_domain = out_tv->domain();
+
+  TORCH_INTERNAL_ASSERT(out_domain->hasGridReduction());
+
+  // If we do a grid reduction we can't have a reduction axis that is not bound
+  // to a grid or block dim.
+  TORCH_INTERNAL_ASSERT(
+      std::none_of(
+          out_domain->domain().begin(),
+          out_domain->domain().end(),
+          [](IterDomain* id) {
+            return !id->isThread() && id->isReduction() &&
+                !id->extent()->isOneInt();
+          }),
+      "Found a reduction stage that has both a non-parallelized ",
+      "reduction and a grid reduction. This is not supported, ",
+      "please use rfactor to do the serialized reduction first, ",
+      "then the grid reduction.");
+
+  const bool is_persistent = op->isAllreduce();
+  auto work_buf_size_info =
+      getGridCommWorkBufferSize(out_domain, for_loops_, is_persistent);
+
+  const auto work_buffers_avg = allocateWelfordWorkBuffer(
+      op->outputVals(),
+      WelfordTriplet::ValName::Avg,
+      work_buf_size_info.size_of_privatized_buffer);
+  const auto work_buffers_var = allocateWelfordWorkBuffer(
+      op->outputVals(),
+      WelfordTriplet::ValName::Var,
+      work_buf_size_info.size_of_privatized_buffer);
+  const auto work_buffers_N = allocateWelfordWorkBuffer(
+      op->outputVals(),
+      WelfordTriplet::ValName::N,
+      work_buf_size_info.size_of_privatized_buffer);
+
+  auto sync_buffer_size =
+      getGridSyncBufferSize(out_domain, for_loops_, is_persistent);
+  auto sync_buffer = allocateUniqueBuffer(
+      sync_buffer_size, DataType::Int, true, out_tv, sync_buffer_map_);
+
+  const auto entrance_ind = !is_persistent
+      ? getEntranceLinIndGridReduce(for_loops_)
+      : GpuLower::current()->kernel()->zeroVal();
+  const auto n_entrances = !is_persistent
+      ? getEntranceCountGridReduce(for_loops_)
+      : GpuLower::current()->kernel()->oneVal();
+
+  // The thread predicate needs to be set separately from the main
+  // predicate. Do not combine them like other expressions.
+  const auto& thread_pred =
+      GpuLower::current()->threadPredMap().getPredicatedParallelTypes(out_tv);
+
+  auto indexed_op = IrBuilder::create<kir::GroupedGridWelford>(
+      output_vals,
+      input_vals,
+      init_vals,
+      std::array<std::vector<kir::Allocate*>, 3>{
+          work_buffers_avg, work_buffers_var, work_buffers_N},
+      sync_buffer,
+      entrance_ind,
+      n_entrances,
+      work_buf_size_info.buffer_stride,
+      op->isAllreduce());
+
+  indexed_op->setThreadPredicate(thread_pred);
+
+  if (op->predicate()) {
+    indexed_op->setPredicate(op->predicate());
+  }
+  if (op->writePredicate()) {
+    indexed_op->setWritePredicate(op->writePredicate());
+  }
+
+  pushBack(indexed_op);
+  GpuLower::current()->propagateExprInfo(op, back());
+
+  if (op->isAllreduce()) {
+    allocateUniqueFusedReduction(indexed_op, out_tv);
+  }
+}
+
 void IndexLowering::handle(const LoadStoreOp* ldst) {
   const auto in = lowerSrcIndex(ldst->in(), ldst->out());
   const auto out = lowerDstIndex(ldst->out());
@@ -889,7 +1079,7 @@ kir::Allocate* IndexLowering::allocateUniqueBuffer(
 
   // No existing allocation found. Create a new one
   auto new_buffer =
-      ir_utils::allocGlobalBufferForGridComm(buffer_size, dtype, zero_init);
+      lower_utils::allocGlobalBufferForGridComm(buffer_size, dtype, zero_init);
 
   // Keep track of the allocation
   alloc_map.emplace(out_tv, new_buffer);
@@ -927,6 +1117,11 @@ void IndexLowering::allocateUniqueFusedReduction(
           IrBuilder::create<kir::AllocateFusedReduction>(
               expr->as<kir::GroupedGridReduction>());
       break;
+    case ExprType::GroupedGridWelford:
+      fused_reduction_alloc_reduction =
+          IrBuilder::create<kir::AllocateFusedReduction>(
+              expr->as<kir::GroupedGridWelford>());
+      break;
     default:
       TORCH_INTERNAL_ASSERT(false, "Invalid expr: ", expr->toString());
   }
diff --git a/torch/csrc/jit/codegen/cuda/lower_index.h b/torch/csrc/jit/codegen/cuda/lower_index.h
index b42cd288d0459..6c08eeb195ea5 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index.h
+++ b/torch/csrc/jit/codegen/cuda/lower_index.h
@@ -38,16 +38,19 @@ class TORCH_CUDA_CU_API IndexLowering : private OptOutConstDispatch {
   // Insert an expression before the current top-level expression.
   void insertAtTopLevel(Expr* expr);
 
+  void handle(const FullOp*) final;
+  void handle(const ARangeOp*) final;
+  void handle(const EyeOp*) final;
   void handle(const ViewAsScalar*) final;
   void handle(const UnaryOp*) final;
-  // TODO: use a separate IR node to represent rand like
-  void lowerRandLike(const UnaryOp*);
 
   void handle(const BinaryOp*) final;
   void handle(const TernaryOp*) final;
+  void handle(const RNGOp*) final;
   void handle(const ReductionOp*) final;
   void handle(const GroupedReductionOp*) final;
   void handle(const WelfordOp*) final;
+  void handle(const GroupedWelfordOp*) final;
   void handle(const LoadStoreOp*) final;
   void handle(const MmaOp*) final;
   void handle(const BroadcastOp*) final;
@@ -80,6 +83,17 @@ class TORCH_CUDA_CU_API IndexLowering : private OptOutConstDispatch {
 
   void handleGridWelford(WelfordOp* new_wop);
 
+  void handleGroupedBlockWelford(
+      const GroupedWelfordOp* wop,
+      const std::vector<WelfordTriplet>& output_vals,
+      const std::vector<WelfordTriplet>& input_vals,
+      const std::vector<WelfordTriplet>& init_vals);
+  void handleGroupedGridWelford(
+      const GroupedWelfordOp* wop,
+      const std::vector<WelfordTriplet>& output_vals,
+      const std::vector<WelfordTriplet>& input_vals,
+      const std::vector<WelfordTriplet>& init_vals);
+
   // Allocate a unique buffer for grid reductions and broadcast. A
   // buffer is uniquely allocated for each output tensor of an
   // expression.
@@ -90,6 +104,11 @@ class TORCH_CUDA_CU_API IndexLowering : private OptOutConstDispatch {
       TensorView* out_tv,
       std::unordered_map<TensorView*, kir::Allocate*>& alloc_map);
 
+  std::vector<kir::Allocate*> allocateWelfordWorkBuffer(
+      const std::vector<WelfordTriplet>& triplets,
+      WelfordTriplet::ValName name,
+      Val* buffer_size);
+
   // Allocate a fused reduction object uniquely for a given
   // TensorView. Parameter expr is the expression corresponding to the
   // fused reduction.
diff --git a/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp b/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp
index 70b019a4cc48c..2f45ef1399223 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp
@@ -101,7 +101,7 @@ struct IndexingParameters {
 };
 
 // Initial loop index map for global producer or consumer case.
-IndexingParameters getGlobalIndexParameters(
+IndexingParameters getLinearIndexParameters(
     const LoopIndexing& loop_indexing,
     bool index_producer = false) {
   IndexingParameters index_parameters;
@@ -112,7 +112,8 @@ IndexingParameters getGlobalIndexParameters(
 
   for (auto loop_idx : c10::irange(loops.size())) {
     auto loop = loops[loop_idx];
-    auto index_domain = ir_utils::caMapExactConcreteId(loop_domain[loop_idx]);
+    auto index_domain = GpuLower::current()->caMap()->getConcreteMappedID(
+        loop_domain[loop_idx], IdMappingMode::EXACT);
     if (loop->isTrivial()) {
       // This is useful information in the case of
       //  MisalignedVectorize and double buffer epilog, etc.
@@ -125,7 +126,8 @@ IndexingParameters getGlobalIndexParameters(
 
   // Derive the halo extents from the loop indexing result.
   index_parameters.concrete_id_to_halo_extent =
-      GpuLower::current()->haloInfo().buildConcreteHaloExtentMap(loop_indexing);
+      GpuLower::current()->haloInfo()->buildConcreteHaloExtentMap(
+          loop_indexing);
 
   protectNonPredicateIndexWithMagicZero(
       loops,
@@ -148,7 +150,9 @@ IndexingParameters getGlobalIndexParameters(
 
         auto loop_id = loop_indexing.loopDomains()[loop_idx];
 
-        auto concrete_loop_id = ir_utils::caMapExactConcreteId(loop_id);
+        auto concrete_loop_id =
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                loop_id, IdMappingMode::EXACT);
 
         auto stage_depth =
             GpuLower::current()->doubleBufferInfo().getStageDepthFor(
@@ -185,7 +189,7 @@ IndexingParameters getNonGlobalInitialIndexParameters(
   }
 
   auto alloc_tv = index_producer ? producer_tv : consumer_tv;
-  auto alloc_info = loop_utils::getAllocInformation(
+  auto alloc_info = lower_utils::getAllocInformation(
       alloc_tv, loops, alloc_id_map, index_producer);
 
   std::unordered_map<kir::ForLoop*, Val*> loop_to_ind_map;
@@ -216,7 +220,9 @@ IndexingParameters getNonGlobalInitialIndexParameters(
     auto loop = loops[loop_idx];
     auto loop_domain = loop_domains[loop_idx];
 
-    auto concrete_loop_domain = ir_utils::caMapExactConcreteId(loop_domain);
+    auto concrete_loop_domain =
+        GpuLower::current()->caMap()->getConcreteMappedID(
+            loop_domain, IdMappingMode::EXACT);
 
     index_parameters.initial_concrete_id_index[concrete_loop_domain] =
         loop_to_ind_map.at(loop);
@@ -233,7 +239,8 @@ IndexingParameters getNonGlobalInitialIndexParameters(
 
   // Derive the halo extents from the loop indexing result.
   index_parameters.concrete_id_to_halo_extent =
-      GpuLower::current()->haloInfo().buildConcreteHaloExtentMap(loop_indexing);
+      GpuLower::current()->haloInfo()->buildConcreteHaloExtentMap(
+          loop_indexing);
 
   return index_parameters;
 }
@@ -397,7 +404,8 @@ IndexingParameters getPredicateInitialIndexParameters(
   for (int loop_idx : c10::irange(loops.size())) {
     auto loop = loops.at(loop_idx);
     auto concrete_loop_domain =
-        ir_utils::caMapExactConcreteId(loop_domains.at(loop_idx));
+        GpuLower::current()->caMap()->getConcreteMappedID(
+            loop_domains.at(loop_idx), IdMappingMode::EXACT);
     index_parameters.initial_concrete_id_index[concrete_loop_domain] =
         loop_to_ind_map.at(loop);
   }
@@ -408,7 +416,8 @@ IndexingParameters getPredicateInitialIndexParameters(
 
   // Derive the halo extents from the loop indexing result.
   index_parameters.concrete_id_to_halo_extent =
-      GpuLower::current()->haloInfo().buildConcreteHaloExtentMap(loop_indexing);
+      GpuLower::current()->haloInfo()->buildConcreteHaloExtentMap(
+          loop_indexing);
 
   return index_parameters;
 }
@@ -438,6 +447,7 @@ class LoopIndexingAnalysis {
     indexing.loop_root_ = loop_root_domains_;
     indexing.loop_domains_ = loop_domains_.vector();
     indexing.index_exprs_ = replayed_exprs_;
+    indexing.out_of_line_exprs_ = out_of_line_exprs_;
     return indexing;
   }
 
@@ -481,6 +491,12 @@ class LoopIndexingAnalysis {
   //! loop_domains_ with all of these iter domains.
   void constructLoopDomains();
 
+  //! Fills out_of_line_exprs_ by traversing the selected list of
+  //!  expressions in reverse topological order and collect iterdomains
+  //!  on the indexing paths that only involves leaf id's on the right
+  //!  of consumer's ca axis.
+  void collectOutOfLineExprs();
+
  private:
   //! Original loop nest input to derive info from.
   const std::vector<kir::ForLoop*>& loops_;
@@ -521,6 +537,10 @@ class LoopIndexingAnalysis {
   //! Selected list of exprs that will produce and consume each
   //!  of the exact concrete ids from the loop nest exactly once.
   std::vector<Expr*> replayed_exprs_;
+
+  //! Set of expressions from the selected list that can be
+  //!  resolved from axes on the right of ca axes.
+  std::vector<Expr*> out_of_line_exprs_;
 };
 
 LoopIndexingAnalysis::LoopIndexingAnalysis(
@@ -552,13 +572,20 @@ LoopIndexingAnalysis::LoopIndexingAnalysis(
   // consume each concrete id once so this map is well defined.
   for (auto expr : replayed_exprs_) {
     for (auto input_id : ir_utils::filterByType<IterDomain>(expr->inputs())) {
-      concrete_id_to_consumer_[ir_utils::caMapExactConcreteId(input_id)] = expr;
+      auto concrete_input_id =
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              input_id, IdMappingMode::EXACT);
+      concrete_id_to_consumer_[concrete_input_id] = expr;
     }
   }
 
   // Reconstruct the iterdomain view of the original loopnest after resolving
   // the exact definition of each index.
   constructLoopDomains();
+
+  //! Collect the set of indexing expressions that can be
+  //!  resolved out of line.
+  collectOutOfLineExprs();
 }
 
 void LoopIndexingAnalysis::validateLoopStructure(
@@ -580,7 +607,8 @@ void LoopIndexingAnalysis::validateLoopStructure(
   for (auto it_i = loops.begin(); it_i != loops.end(); ++it_i) {
     // Largely duplicating original logic
     auto loop_id = (*it_i)->iter_domain();
-    auto concrete_loop_id = ir_utils::caMapExactConcreteId(loop_id);
+    auto concrete_loop_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        loop_id, IdMappingMode::EXACT);
 
     TORCH_INTERNAL_ASSERT(
         !concrete_to_loop.count(concrete_loop_id),
@@ -644,13 +672,22 @@ void LoopIndexingAnalysis::traverseFromDomainVals() {
 }
 
 IterDomain* LoopIndexingAnalysis::concretizeAndVisitId(IterDomain* id) {
-  auto concrete_id = ir_utils::caMapExactConcreteId(id);
+  auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+      id, IdMappingMode::EXACT);
   if (replayed_concrete_ids_.pushBack(concrete_id)) {
     concrete_to_original_id_[concrete_id] = id;
   }
   return concrete_id;
 }
 
+namespace {
+// Alias used for std::transform
+IterDomain* exactConcreteId(IterDomain* id) {
+  return GpuLower::current()->caMap()->getConcreteMappedID(
+      id, IdMappingMode::EXACT);
+}
+} // namespace
+
 void LoopIndexingAnalysis::visitExpr(Expr* expr) {
   if (auto swizzle2d = dynamic_cast<Swizzle2D*>(expr)) {
     // Swizzle outputs are already forwarded through
@@ -685,14 +722,14 @@ void LoopIndexingAnalysis::visitExpr(Expr* expr) {
       consumed_ids.begin(),
       consumed_ids.end(),
       std::inserter(consumed_concrete_, consumed_concrete_.end()),
-      ir_utils::caMapExactConcreteId);
+      exactConcreteId);
 
   auto produced_ids = ir_utils::filterByType<IterDomain>(expr->outputs());
   std::transform(
       produced_ids.begin(),
       produced_ids.end(),
       std::inserter(produced_concrete_, produced_concrete_.end()),
-      ir_utils::caMapExactConcreteId);
+      exactConcreteId);
 }
 
 bool LoopIndexingAnalysis::visitIdsAndCheckDuplication(
@@ -717,8 +754,36 @@ void LoopIndexingAnalysis::constructLoopDomains() {
               !concrete_id_to_consumer_.count(concrete_id) &&
               // Use permissive map so the selected ID indeed represents the
               // loop.
-              GpuLower::current()->caMap()->areMapped(
-                  concrete_id, loop_id, IdMappingMode::PERMISSIVE);
+              // Note: see PR https://github.com/csarofeen/pytorch/pull/1960
+              //  and issue https://github.com/csarofeen/pytorch/issues/1873
+              // This mapping look up is part of a staged indexing scheme.
+              //  When we find a replayed exact id that exactly map to the loop
+              //  id, this means that we can resolve indexing involved in this
+              //  loop "locally", i.e. only with and with only the iterdomains
+              //  on the
+              //
+              //  given consumer tv.
+              //  When we cannot find an exact mapping, the permissive mapping
+              //  would
+              //   help defering the indexing resolution for this loop nest
+              //   level to other iterdomain expressions from tv's that are
+              //   further concretized and usually they are further down the
+              //   consumer chain of the given consumer tv.
+              //
+              //  Intuitively exact mapping of two iterdomains should imply
+              //  permissive mapping
+              //   of them as well and if that was the case, only looking up
+              //   permissive mapping would be enough to address both of the
+              //   cases above.
+              //  FIXME: But currently exact mapping does not imply permissive
+              //  mapping (See issue:
+              //       https://github.com/csarofeen/pytorch/issues/1963)
+              // Which means we should check both exact and permissive mapping
+              // here.
+              (GpuLower::current()->caMap()->areMapped(
+                   concrete_id, loop_id, IdMappingMode::EXACT) ||
+               GpuLower::current()->caMap()->areMapped(
+                   concrete_id, loop_id, IdMappingMode::PERMISSIVE));
         });
 
     TORCH_INTERNAL_ASSERT(
@@ -754,7 +819,8 @@ void LoopIndexingAnalysis::constructLoopDomains() {
   // will complain for not having all outputs of the traversal.
   for (auto id : ir_utils::filterByType<IterDomain>(all_ids_from_root)) {
     if (id->uses().empty()) {
-      loop_domains_.pushBack(ir_utils::caMapExactConcreteId(id));
+      loop_domains_.pushBack(GpuLower::current()->caMap()->getConcreteMappedID(
+          id, IdMappingMode::EXACT));
     }
   }
 }
@@ -782,7 +848,7 @@ IndexFromIdGraph getTensorIndexFromIdGraph(
   }
 
   if (is_global) {
-    index_parameters = getGlobalIndexParameters(loop_indexing, index_producer);
+    index_parameters = getLinearIndexParameters(loop_indexing, index_producer);
   } else {
     index_parameters = getNonGlobalInitialIndexParameters(
         loop_indexing, consumer_tv, index_producer, producer_tv, p2c_map);
@@ -834,7 +900,8 @@ IndexFromIdGraph getTensorIndexFromIdGraph(
 
     // Exact id will have to be pulled from consumer side as the
     //  producer side are replayed ids.
-    auto exact_concrete_id = ir_utils::caMapExactConcreteId(consumer_id);
+    auto exact_concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        consumer_id, IdMappingMode::EXACT);
 
     index_update_map[exact_concrete_id] = target_id;
 
@@ -850,6 +917,10 @@ IndexFromIdGraph getTensorIndexFromIdGraph(
       target_tv->getMaybeRFactorDomain(),
       target_tv->domain()->contiguity(),
       initial_indexable_map,
+      GpuLower::current()->divisbleSplitSet(),
+      GpuLower::current()->caMap(),
+      GpuLower::current()->haloInfo(),
+      GpuLower::current()->concretizedBroadcastDomains(),
       p2c_map);
 
   auto target_indexing = indexing.updateIndexCompute(
@@ -915,18 +986,16 @@ IndexFromIdGraph getPredicateIndexingFromIdGraph(
        ir_utils::filterByType<IterDomain>(all_consumer_vals)) {
     // Track the non-concrete id we were trying to bind index
     //  to, whether from producer or consumer.
-    auto exact_concrete_id = ir_utils::caMapExactConcreteId(consumer_id);
+    auto exact_concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        consumer_id, IdMappingMode::EXACT);
     index_update_map[exact_concrete_id] = consumer_id;
   }
 
-  // No contiguity info is used in the predicate indexing pass,
-  //  the predicate generation logic that uses the index math
-  //  generated here will take contiguity into account.
-  ContigIDs contig_finder(
-      consumer_tv->domain()->domain(),
-      consumer_tv->getMaybeRFactorDomain(),
-      std::vector<bool>(consumer_tv->getMaybeRFactorDomain().size(), false),
-      {});
+  // No contiguity info is used in the predicate indexing pass, the predicate
+  // generation logic that uses the index math generated here will take
+  // contiguity into account. Send an empty ContigID class so nothing is marked
+  // as contiguous.
+  ContigIDs contig_finder({}, {}, {}, {}, {});
 
   // Run second backward traversal to map back to the consumer_tv
   auto target_indexing = indexing.updateIndexCompute(
@@ -994,7 +1063,8 @@ LoopIndexingTraversal::LoopIndexingTraversal(
     auto next_ids =
         ir_utils::filterByType<IterDomain>(nextValsInTraversalOrder(expr));
     for (auto id : next_ids) {
-      auto concrete_id = ir_utils::caMapExactConcreteId(id);
+      auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          id, IdMappingMode::EXACT);
       TORCH_INTERNAL_ASSERT(
           concrete_id_to_dependency_.insert(std::make_pair(concrete_id, expr))
               .second,
@@ -1062,7 +1132,8 @@ std::vector<Expr*> LoopIndexingTraversal::getExprList() {
     for (auto prev_id :
          ir_utils::filterByType<IterDomain>(prevValsInTraversalOrder(top))) {
       auto prev_expr_it = concrete_id_to_dependency_.find(
-          ir_utils::caMapExactConcreteId(prev_id));
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              prev_id, IdMappingMode::EXACT));
       if (prev_expr_it != concrete_id_to_dependency_.end()) {
         auto prev_expr = prev_expr_it->second;
         if (!visited.count(prev_expr)) {
@@ -1088,6 +1159,50 @@ std::vector<Expr*> LoopIndexingTraversal::getExprList() {
 
 } // namespace
 
+void LoopIndexingAnalysis::collectOutOfLineExprs() {
+  // Keep track of all the id's that can be resolved without
+  //  iterdomains on the left of ca axes.
+  std::unordered_set<IterDomain*> out_of_line_ids;
+
+  // Start the set with all the leaf ids.
+  std::transform(
+      consumer_tv_->domain()->domain().begin() +
+          consumer_tv_->getComputeAtPosition(),
+      consumer_tv_->domain()->domain().end(),
+      std::inserter(out_of_line_ids, out_of_line_ids.end()),
+      exactConcreteId);
+
+  // Get the original selected list of index expressions
+  //  in reverse topological order.
+  auto backward_expr_list =
+      LoopIndexingTraversal::backwardTopologicalOrder(replayed_exprs_);
+
+  for (auto expr : backward_expr_list) {
+    auto id_outputs = ir_utils::filterByType<IterDomain>(expr->outputs());
+    if (
+        // Check that all of the outputs are out of line
+        std::all_of(
+            id_outputs.begin(),
+            id_outputs.end(),
+            [&out_of_line_ids](IterDomain* id) {
+              return out_of_line_ids.count(
+                  GpuLower::current()->caMap()->getConcreteMappedID(
+                      id, IdMappingMode::EXACT));
+            })) {
+      // Record out of line expression
+      out_of_line_exprs_.push_back(expr);
+
+      // Add all of the expression inputs as out of line id's.
+      auto id_inputs = ir_utils::filterByType<IterDomain>(expr->inputs());
+      std::transform(
+          id_inputs.begin(),
+          id_inputs.end(),
+          std::inserter(out_of_line_ids, out_of_line_ids.end()),
+          exactConcreteId);
+    }
+  }
+}
+
 std::vector<Expr*> LoopIndexing::getForwardExprList() const {
   return LoopIndexingTraversal::forwardTopologicalOrder(index_exprs_);
 }
@@ -1104,14 +1219,14 @@ std::unordered_set<IterDomain*> LoopIndexing::getAllExactConcreteIdSet() const {
         out_ids.begin(),
         out_ids.end(),
         std::inserter(all_id_set, all_id_set.end()),
-        ir_utils::caMapExactConcreteId);
+        exactConcreteId);
 
     auto in_ids = ir_utils::filterByType<IterDomain>(expr->inputs());
     std::transform(
         in_ids.begin(),
         in_ids.end(),
         std::inserter(all_id_set, all_id_set.end()),
-        ir_utils::caMapExactConcreteId);
+        exactConcreteId);
   }
   return all_id_set;
 }
@@ -1156,7 +1271,9 @@ class LoopIndexingPreferredPathCompute : public IterVisitor {
         }
         mapped_id = c_id_it->second;
       }
-      auto concrete_original_id = ir_utils::caMapExactConcreteId(mapped_id);
+      auto concrete_original_id =
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              mapped_id, IdMappingMode::EXACT);
       if (all_concrete_ids.count(concrete_original_id)) {
         if (original_id->isBroadcast() || original_id->isReduction() ||
             original_id->isStride()) {
@@ -1182,8 +1299,10 @@ class LoopIndexingPreferredPathCompute : public IterVisitor {
             all_iter_inputs.begin(),
             all_iter_inputs.end(),
             [&](IterDomain* inp_id) {
-              return this->preferred_path_.find(ir_utils::caMapExactConcreteId(
-                         inp_id)) != this->preferred_path_.end();
+              return this->preferred_path_.find(
+                         GpuLower::current()->caMap()->getConcreteMappedID(
+                             inp_id, IdMappingMode::EXACT)) !=
+                  this->preferred_path_.end();
             })) {
       auto all_iter_outputs = ir_utils::filterByType<IterDomain>(e->outputs());
 
@@ -1191,7 +1310,7 @@ class LoopIndexingPreferredPathCompute : public IterVisitor {
           all_iter_outputs.begin(),
           all_iter_outputs.end(),
           std::inserter(preferred_path_, preferred_path_.end()),
-          ir_utils::caMapExactConcreteId);
+          exactConcreteId);
     }
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_index_compute.h b/torch/csrc/jit/codegen/cuda/lower_index_compute.h
index d8d4dd7103b3a..4b81fd0dec0c5 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index_compute.h
+++ b/torch/csrc/jit/codegen/cuda/lower_index_compute.h
@@ -127,6 +127,12 @@ class LoopIndexing {
   //!  topological order.
   std::vector<Expr*> getBackwardExprList() const;
 
+  //! Returns the set of out of line expressions in
+  //!  reverse topological order.
+  const std::vector<Expr*>& getBackwardOutOfLineExprList() const {
+    return out_of_line_exprs_;
+  }
+
   //! Returns all exact concrete id's that were produced
   //!  or consumed in the selected indexing expressions
   std::unordered_set<IterDomain*> getAllExactConcreteIdSet() const;
@@ -152,6 +158,12 @@ class LoopIndexing {
   //! The selected sequence of expressions that should represent
   //!  the correct indexing math from the given loop nest.
   std::vector<Expr*> index_exprs_;
+
+  //! The subset of sequence of expressions that can be resolved
+  //!  with only the iterdomains on the right of consumer tv's ca
+  //!  axis.
+  //! Expressions are ordered in reverse topological order.
+  std::vector<Expr*> out_of_line_exprs_;
 };
 
 // When indexing there are sometimes an option to propagate an index down
diff --git a/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp b/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
index 12b02d0b51ce3..86ca9d8427e78 100644
--- a/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
@@ -293,7 +293,7 @@ class WarSyncInserter : private kir::ExprMutator {
     auto maybe_aliased_tv = alloc_map_.getRealBuffer(tv);
     auto alloc_it = smem_allocations_.find(maybe_aliased_tv);
     auto ca_loop =
-        loop_utils::getAllocInformation(tv, for_loops_).init_for_loop;
+        lower_utils::getAllocInformation(tv, for_loops_).init_for_loop;
     if (alloc_it == smem_allocations_.end()) {
       WarMemoryInfo mem_info;
       mem_info.ca_loop = ca_loop;
@@ -486,7 +486,7 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
       Expr* sync_expr = nullptr;
       kir::Allocate* maybe_alloc = nullptr;
       if (sync_bitmap.hasBID()) {
-        maybe_alloc = ir_utils::allocGlobalBufferForGridComm(
+        maybe_alloc = lower_utils::allocGlobalBufferForGridComm(
             getGridSyncBufferSize(sync_bitmap), DataType::Int, true);
         sync_expr = IrBuilder::create<kir::GridSync>(
             sync_bitmap, maybe_alloc->buffer());
diff --git a/torch/csrc/jit/codegen/cuda/lower_loops.cpp b/torch/csrc/jit/codegen/cuda/lower_loops.cpp
index 7fdb149da9359..0653296366ccc 100644
--- a/torch/csrc/jit/codegen/cuda/lower_loops.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_loops.cpp
@@ -33,7 +33,7 @@ LoopNestGenerator::LoopNestGenerator(const std::vector<Expr*>& exprs) {
 namespace {
 
 kir::ForLoop* openForHelper(kir::ForLoop* scope, IterDomain* id) {
-  auto extent_with_halo = GpuLower::current()->haloInfo().getExtent(id);
+  auto extent_with_halo = GpuLower::current()->haloInfo()->getExtent(id);
   kir::ForLoop* new_scope = nullptr;
   if (extent_with_halo) {
     // When an axis is extended with halo, unrolling and vectorization
@@ -252,7 +252,7 @@ void LoopNestGenerator::generate(const std::vector<Expr*>& exprs) {
     std::sort(
         loop_structure.rbegin(),
         loop_structure.rend(),
-        IterDomainDependencySorter(
+        ir_utils::IterDomainDependencySorter(
             concrete_id_dependencies, GpuLower::current()->caMap()));
     loop_structures_[tv] = loop_structure;
   }
diff --git a/torch/csrc/jit/codegen/cuda/lower_predicate.cpp b/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
index e6b7bc332b8c9..c83f47338f12b 100644
--- a/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
@@ -71,7 +71,16 @@ class ConditionalFromPredicateModifier : public kir::IrVisitor {
                               ->as<Bool>();
           }
         } else {
-          TORCH_INTERNAL_ASSERT(supportInlinePredicate(expr));
+          TORCH_INTERNAL_ASSERT(lower_utils::supportInlinePredicate(expr));
+          auto thread_pred = GpuLower::current()->threadPredMap().getPredicate(
+              ir_utils::getTvOutput(expr));
+          TORCH_INTERNAL_ASSERT(
+              thread_pred->isConst() && thread_pred->value().value());
+          conditional = SimplifyingIrBuilder::andExpr(
+                            conditional,
+                            GpuLower::current()->threadPredMap().getPredicate(
+                                ir_utils::getTvOutput(expr)))
+                            ->as<Bool>();
         }
       }
       TORCH_INTERNAL_ASSERT(conditional != nullptr);
diff --git a/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp b/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp
index 453872d4f4e97..38df8229bb777 100644
--- a/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp
@@ -303,12 +303,12 @@ class PredicateChcker : public IterVisitor {
 
   // Shift is not supported yet.
   bool predicateShift(Expr* expr) const {
-    auto& halo_info = GpuLower::current()->haloInfo();
+    auto halo_info = GpuLower::current()->haloInfo();
     auto input_tvs = ir_utils::filterByType<TensorView>(expr->inputs());
-    return halo_info.needsShiftPredicate(expr) ||
+    return halo_info->needsShiftPredicate(expr) ||
         std::any_of(input_tvs.begin(), input_tvs.end(), [&](auto input_tv) {
              return input_tv->definition() != nullptr &&
-                 halo_info.needsShiftPredicate(input_tv->definition());
+                 halo_info->needsShiftPredicate(input_tv->definition());
            });
   }
 
@@ -640,10 +640,11 @@ class PredicateChcker : public IterVisitor {
       // If input is not predicated, out-of-bound value may be
       // overwritten by a garbage value. However, it doesn't matter if
       // the input is also produced by another welford.
-      if (!input_def->isA<WelfordOp>() &&
+      if (!input_def->isA<WelfordOp>() && !input_def->isA<GroupedWelfordOp>() &&
           non_predicated_exprs_.find(input_def) !=
               non_predicated_exprs_.end()) {
         needs_predicate_ = true;
+        return;
       }
     }
   }
@@ -691,12 +692,8 @@ class PredicateChcker : public IterVisitor {
       } else if (
           auto input_def_grouped_rop =
               dynamic_cast<GroupedReductionOp*>(input_def)) {
-        auto input_index_as_output = std::distance(
-            input_def_grouped_rop->outputs().begin(),
-            std::find(
-                input_def_grouped_rop->outputs().begin(),
-                input_def_grouped_rop->outputs().end(),
-                input));
+        auto input_index_as_output =
+            input_def_grouped_rop->getExprIndexOfOutput(input);
         if (grouped_rop->getReductionOpType(i) !=
                 input_def_grouped_rop->getReductionOpType(
                     input_index_as_output) &&
@@ -714,6 +711,62 @@ class PredicateChcker : public IterVisitor {
     }
   }
 
+  void handle(GroupedWelfordOp* grouped_wop) final {
+    for (const auto expr_idx : c10::irange(grouped_wop->numExprs())) {
+      for (const auto val_idx : c10::irange(3)) {
+        auto init = grouped_wop->initVals().at(expr_idx).get(val_idx);
+
+        // Welford input can be a scalar. Predicate is required unless
+        // the scalar value is equal to the init value.
+        auto input = grouped_wop->inputVals().at(expr_idx).get(val_idx);
+        if (input->isScalar()) {
+          if (!input->sameAs(init)) {
+            needs_predicate_ = true;
+            return;
+          }
+          continue;
+        }
+
+        auto input_tv = dynamic_cast<TensorView*>(input);
+        TORCH_INTERNAL_ASSERT(input_tv != nullptr);
+
+        auto input_def = input->definition();
+
+        // When input_def is null, input must be an input to the fusion,
+        // so that must be allocated on global memory. Since we don't omit
+        // predication for expressions involving global memory, this
+        // should never occur.
+        TORCH_INTERNAL_ASSERT(
+            input_def != nullptr,
+            "Inconsistent input found: ",
+            input->toString());
+
+        // The input needs to be initialized to the init value to omit
+        // the predicate, so if the input has its own init value, i.e.,
+        // produced by another reduction, they must use the same init
+        // value.
+        Val* input_init = ir_utils::getReductionInitValOf(input_tv);
+        if (input_init != nullptr && !init->sameAs(input_init)) {
+          needs_predicate_ = true;
+          return;
+        }
+
+        // If input is not predicated, out-of-bound value may be
+        // overwritten by a garbage value. However, it doesn't matter if
+        // the input is also produced by another reduction op as it
+        // must be initialized and its initialized value is already
+        // found to be equal to the initil value of this op.
+        if (!input_def->isA<WelfordOp>() &&
+            !input_def->isA<GroupedWelfordOp>() &&
+            non_predicated_exprs_.find(input_def) !=
+                non_predicated_exprs_.end()) {
+          needs_predicate_ = true;
+          return;
+        }
+      }
+    }
+  }
+
   // Similar to the above reduction constraint but for MMA
   void handle(MmaOp* mma) final {
     for (auto input : ir_utils::filterByType<TensorView>(mma->inputs())) {
@@ -938,7 +991,7 @@ Val* PredicateElimination::getInitValue(TensorView* tv) const {
 }
 
 void PredicateElimination::build(Fusion* fusion) {
-  traverseFrom(fusion, fusion->outputs());
+  traverseTo(fusion, fusion->outputs());
 }
 
 std::string PredicateElimination::toString() const {
diff --git a/torch/csrc/jit/codegen/cuda/lower_shift.cpp b/torch/csrc/jit/codegen/cuda/lower_shift.cpp
index fe1e0cc509c13..991ed5745c87f 100644
--- a/torch/csrc/jit/codegen/cuda/lower_shift.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_shift.cpp
@@ -28,7 +28,7 @@ void ShiftPredicateInserter::insert(
   TORCH_INTERNAL_ASSERT(out_tv != nullptr, "Missing TensorView output");
 
   const bool needs_shift_predicate =
-      gpu_lower->haloInfo().needsShiftPredicate(out_tv->definition());
+      gpu_lower->haloInfo()->needsShiftPredicate(out_tv->definition());
   if (!needs_shift_predicate) {
     return;
   }
@@ -56,7 +56,7 @@ void ShiftPredicateInserter::insert(
   // If the expr involves a thread-block barrier, set the predicate of
   // the expr with shift_pred. Since the expr is not shift, the
   // padding is safe to omit.
-  if (ir_utils::hasBlockSync(expr, gpu_lower->threadPredMap())) {
+  if (lower_utils::hasBlockSync(expr, gpu_lower->threadPredMap())) {
     expr->setPredicate(shift_pred);
     return;
   }
@@ -145,13 +145,6 @@ const AxisHaloInfo& HaloInfo::getRootAxisInfo(IterDomain* id) const {
   return it->second;
 }
 
-AxisHaloInfo& HaloInfo::getRootAxisInfo(IterDomain* id) {
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-  return const_cast<AxisHaloInfo&>(
-      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-      const_cast<const HaloInfo*>(this)->getRootAxisInfo(id));
-}
-
 void HaloInfo::setRootAxisInfo(
     IterDomain* id,
     const AxisHaloInfo& root_axis_info) {
@@ -161,7 +154,9 @@ void HaloInfo::setRootAxisInfo(
   return;
 }
 
-void HaloInfo::build(Fusion* fusion) {
+HaloInfo::HaloInfo(Fusion* fusion, std::shared_ptr<const ComputeAtMap> ca_map)
+    // Make a copy of the permissive map for extent comparators
+    : permissive_map_(ca_map->idGraph().permissiveNodes()) {
   const auto vals = fusion->usedMathVals();
   auto tvs = ir_utils::filterByType<TensorView>(vals);
 
@@ -202,7 +197,7 @@ void HaloInfo::build(Fusion* fusion) {
 
   // Note that validation requires consumer halo info
   for (auto tv : tvs) {
-    validate(tv);
+    validate(tv, ca_map);
   }
 }
 
@@ -445,8 +440,20 @@ void HaloInfo::build(TensorDomain* td) {
       } else {
         setHaloWidth(merge->out(), 0);
       }
-    } else if (expr->getExprType().value() == ExprType::Swizzle2D) {
+    } else if (auto swizzle = dynamic_cast<Swizzle2D*>(expr)) {
       // Assume no halo on swizzled domain for now.
+      TORCH_INTERNAL_ASSERT(
+          getExtent(swizzle->inX()) == nullptr,
+          "Halo is not supported with swizzle. Halo-extended ID: ",
+          swizzle->inX()->toString(),
+          " used in ",
+          swizzle->toString());
+      TORCH_INTERNAL_ASSERT(
+          getExtent(swizzle->inY()) == nullptr,
+          "Halo is not supported with swizzle. Halo-extended ID: ",
+          swizzle->inY()->toString(),
+          " used in ",
+          swizzle->toString());
       for (auto id : ir_utils::filterByType<IterDomain>(expr->outputs())) {
         setHaloWidth(id, 0);
       }
@@ -474,12 +481,13 @@ void HaloInfo::build(TensorDomain* td) {
 //! Other types of parallelization should be supported except for
 //! vectorization. Vectorization should be eventually supported but
 //! needs further work.
-void HaloInfo::validate(TensorView* tv) const {
+void HaloInfo::validate(
+    TensorView* tv,
+    std::shared_ptr<const ComputeAtMap> ca_map) const {
   const auto mem_type = tv->getMemoryType();
 
   for (auto axis : tv->domain()->domain()) {
-    auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
-        axis, IdMappingMode::LOOP);
+    auto concrete_id = ca_map->getConcreteMappedID(axis, IdMappingMode::LOOP);
 
     // The extent is assumed to be the same
     TORCH_INTERNAL_ASSERT(
@@ -526,7 +534,7 @@ void HaloInfo::validate(TensorView* tv) const {
           consumer->domain()->domain().begin(),
           consumer->domain()->domain().end(),
           [&](IterDomain* consumer_axis) {
-            return GpuLower::current()->caMap()->areMapped(
+            return ca_map->areMapped(
                 axis, consumer_axis, IdMappingMode::PERMISSIVE);
           });
       if (it == consumer->domain()->domain().end()) {
@@ -626,11 +634,10 @@ bool extentCompare(
     const HaloInfo& halo_map,
     IterDomain* id1,
     IterDomain* id2,
-    Cmp cmp) {
-  auto gpu_lower = GpuLower::current();
+    Cmp cmp,
+    const DisjointSets<IterDomain*>& permissive_map) {
   TORCH_INTERNAL_ASSERT(
-      gpu_lower->caMap()->areMapped(id1, id2, IdMappingMode::PERMISSIVE),
-      "Invalid axes to compare");
+      permissive_map.strictAreMapped(id1, id2), "Invalid axes to compare");
 
   // It's invalid to compare two axes and when only either of them has
   // halo.
@@ -652,10 +659,10 @@ bool extentCompare(
       auto merge2 = dynamic_cast<Merge*>(id2->definition());
       TORCH_INTERNAL_ASSERT(
           merge2 != nullptr, "Invalid comparison: ", id1, " and ", id2);
-      auto inner_le =
-          extentCompare(halo_map, merge1->inner(), merge2->inner(), cmp);
-      auto outer_le =
-          extentCompare(halo_map, merge1->outer(), merge2->outer(), cmp);
+      auto inner_le = extentCompare(
+          halo_map, merge1->inner(), merge2->inner(), cmp, permissive_map);
+      auto outer_le = extentCompare(
+          halo_map, merge1->outer(), merge2->outer(), cmp, permissive_map);
       return inner_le && outer_le;
     } else {
       // This is not considered. Should never reach here.
@@ -667,11 +674,11 @@ bool extentCompare(
 } // namespace
 
 bool HaloInfo::extentLessEqual(IterDomain* id1, IterDomain* id2) const {
-  return extentCompare(*this, id1, id2, std::less_equal<>());
+  return extentCompare(*this, id1, id2, std::less_equal<>(), permissive_map_);
 }
 
 bool HaloInfo::extentEqual(IterDomain* id1, IterDomain* id2) const {
-  return extentCompare(*this, id1, id2, std::equal_to<>());
+  return extentCompare(*this, id1, id2, std::equal_to<>(), permissive_map_);
 }
 
 std::string HaloInfo::toString() const {
@@ -722,19 +729,20 @@ bool HaloInfo::needsShiftPredicate(Expr* expr) const {
 }
 
 std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
-    const LoopIndexing& loop_indexing) {
+    const LoopIndexing& loop_indexing) const {
   // Use a local workspace to avoid re-defining halo info.
-  HaloInfo local_halo_info;
+  HaloInfo local_halo_info = *GpuLower::current()->haloInfo();
 
-  auto& global_halo_info = GpuLower::current()->haloInfo();
+  auto global_halo_info = GpuLower::current()->haloInfo();
 
   // Setup root:
   for (auto consumer_root_id : loop_indexing.consumerTv()->getRootDomain()) {
     auto consumer_index_concrete_id =
-        ir_utils::caMapExactConcreteId(consumer_root_id);
+        GpuLower::current()->caMap()->getConcreteMappedID(
+            consumer_root_id, IdMappingMode::EXACT);
     local_halo_info.setRootAxisInfo(
         consumer_index_concrete_id,
-        global_halo_info.getRootAxisInfo(consumer_root_id));
+        global_halo_info->getRootAxisInfo(consumer_root_id));
   }
 
   // Track IDs that are generated by merging halo-extended IDs
@@ -747,7 +755,8 @@ std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
           merged_shifted_ids.find(split->in()) == merged_shifted_ids.end(),
           "Splitting IterDomain that is a merged domain of halo-extended domains is not allowed");
 
-      auto in_id = ir_utils::caMapExactConcreteId(split->in());
+      auto in_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          split->in(), IdMappingMode::EXACT);
 
       // If no halo info is found, nothing needs to be done. This ID
       // must be an ancestor of a domain set by setRootAxisInfo.
@@ -759,32 +768,43 @@ std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
 
       if (halo_width == 0) {
         local_halo_info.setHaloWidth(
-            ir_utils::caMapExactConcreteId(split->outer()), 0);
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                split->outer(), IdMappingMode::EXACT),
+            0);
         local_halo_info.setHaloWidth(
-            ir_utils::caMapExactConcreteId(split->inner()), 0);
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                split->inner(), IdMappingMode::EXACT),
+            0);
         continue;
       }
 
       // propagate to inner domain
-      auto out_id = ir_utils::caMapExactConcreteId(split->inner());
+      auto out_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          split->inner(), IdMappingMode::EXACT);
 
       auto expanded_extent =
           SimplifyingIrBuilder::addExpr(out_id->extent(), halo_width);
       local_halo_info.extent_map_.insert({out_id, expanded_extent});
 
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(split->outer()), 0);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              split->outer(), IdMappingMode::EXACT),
+          0);
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(split->inner()), halo_width);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              split->inner(), IdMappingMode::EXACT),
+          halo_width);
 
       // TODO: add support for inheritance map
     } else if (auto merge = dynamic_cast<Merge*>(expr)) {
       // If either of the two inputs has halo extension, propagate it
       // to the merged output ID
       auto inner_extent = local_halo_info.getExtent(
-          ir_utils::caMapExactConcreteId(merge->inner()));
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              merge->inner(), IdMappingMode::EXACT));
       auto outer_extent = local_halo_info.getExtent(
-          ir_utils::caMapExactConcreteId(merge->outer()));
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              merge->outer(), IdMappingMode::EXACT));
       if (inner_extent != nullptr || outer_extent != nullptr) {
         if (inner_extent == nullptr) {
           inner_extent = merge->inner()->extent();
@@ -795,28 +815,41 @@ std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
         auto expanded_extent =
             SimplifyingIrBuilder::mulExpr(outer_extent, inner_extent);
         local_halo_info.extent_map_.insert(
-            {ir_utils::caMapExactConcreteId(merge->out()), expanded_extent});
+            {GpuLower::current()->caMap()->getConcreteMappedID(
+                 merge->out(), IdMappingMode::EXACT),
+             expanded_extent});
         // Splitting the output of this merge is not allowed, so
         // remember it
-        merged_shifted_ids.insert(ir_utils::caMapExactConcreteId(merge->out()));
+        merged_shifted_ids.insert(
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                merge->out(), IdMappingMode::EXACT));
         // Note that halo_width_map_ is not updated
       } else {
-        setHaloWidth(ir_utils::caMapExactConcreteId(merge->out()), 0);
+        local_halo_info.setHaloWidth(
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                merge->out(), IdMappingMode::EXACT),
+            0);
       }
     } else if (auto swizzle_2d = dynamic_cast<Swizzle2D*>(expr)) {
       // Swizzle with halo not yet supported, just set the width
       //  to zero at the moment.
       TORCH_INTERNAL_ASSERT(
           local_halo_info.getHaloWidth(
-              ir_utils::caMapExactConcreteId(swizzle_2d->inX())) == 0 &&
+              GpuLower::current()->caMap()->getConcreteMappedID(
+                  swizzle_2d->inX(), IdMappingMode::EXACT)) == 0 &&
               local_halo_info.getHaloWidth(
-                  ir_utils::caMapExactConcreteId(swizzle_2d->inY())) == 0,
+                  GpuLower::current()->caMap()->getConcreteMappedID(
+                      swizzle_2d->inY(), IdMappingMode::EXACT)) == 0,
           "Swizzle on ID with halo not yet supported.");
       TORCH_INTERNAL_ASSERT("Swizzle on ID with halo not yet supported.");
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(swizzle_2d->outX()), 0);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              swizzle_2d->outX(), IdMappingMode::EXACT),
+          0);
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(swizzle_2d->outY()), 0);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              swizzle_2d->outY(), IdMappingMode::EXACT),
+          0);
     } else {
       TORCH_INTERNAL_ASSERT(false, "Unsupported expr: ", expr);
     }
diff --git a/torch/csrc/jit/codegen/cuda/lower_shift.h b/torch/csrc/jit/codegen/cuda/lower_shift.h
index d1500c5f9f203..0cb3c3ea44572 100644
--- a/torch/csrc/jit/codegen/cuda/lower_shift.h
+++ b/torch/csrc/jit/codegen/cuda/lower_shift.h
@@ -61,23 +61,12 @@ class AxisHaloInfo {
 class TORCH_CUDA_CU_API HaloInfo {
  public:
   //! Scan a fusion and collect all information for lowering
-  void build(Fusion* fusion);
-
-  //! Build mappings of extent information of a TensorDomain
-  void build(TensorDomain* td);
+  HaloInfo(Fusion* fusion, std::shared_ptr<const ComputeAtMap> ca_map);
 
   //! Almost exact duplicate of build(TensorDomain* td), except that
   //!  the traversal was done on loop indexing expressions.
   std::unordered_map<IterDomain*, Val*> buildConcreteHaloExtentMap(
-      const LoopIndexing& loop_indexing);
-
-  //! Set initial AxisHaloInfo of a root axis
-  //!
-  //! The axis does not need to be a root domain in the case of
-  //! reference tensors. Reference tensors get halo information from
-  //! consumer root domains, which may correspond to rfactor domains
-  //! of tensors from which reference tensors are derived.
-  void setRootAxisInfo(IterDomain* id, const AxisHaloInfo& root_axis_info);
+      const LoopIndexing& loop_indexing) const;
 
   //! Returns true if id has the root halo information set by
   //! setRootAxisInfo.
@@ -88,7 +77,6 @@ class TORCH_CUDA_CU_API HaloInfo {
   //! This is only for root axes. It is an error to query with
   //! non-root axes.
   const AxisHaloInfo& getRootAxisInfo(IterDomain* id) const;
-  AxisHaloInfo& getRootAxisInfo(IterDomain* id);
 
   //! Query if an axis has a halo width.
   //!
@@ -139,10 +127,21 @@ class TORCH_CUDA_CU_API HaloInfo {
   std::string toString() const;
 
  private:
+  //! Build mappings of extent information of a TensorDomain
+  void build(TensorDomain* td);
+
   //! Propagate root axis information from outputs to inputs of an
   //! expression
   void propagateRootAxisInfo(Expr* expr);
 
+  //! Set initial AxisHaloInfo of a root axis
+  //!
+  //! The axis does not need to be a root domain in the case of
+  //! reference tensors. Reference tensors get halo information from
+  //! consumer root domains, which may correspond to rfactor domains
+  //! of tensors from which reference tensors are derived.
+  void setRootAxisInfo(IterDomain* id, const AxisHaloInfo& root_axis_info);
+
   //! Adds a domain to the halo inheritance map.
   //!
   //! A domain, child, is added to the same set as domain parent. Both
@@ -163,11 +162,15 @@ class TORCH_CUDA_CU_API HaloInfo {
   void initializeFromRootAxisInfo(IterDomain* id);
 
   //! Validate shift usage
-  void validate(TensorView* td) const;
+  void validate(TensorView* td, std::shared_ptr<const ComputeAtMap> ca_map)
+      const;
 
   void setHaloWidth(IterDomain* id, int halo_width);
 
  private:
+  // Copy the permissive map from the passed in compute at map
+  const DisjointSets<IterDomain*> permissive_map_;
+
   //! Halo information of root axes
   std::unordered_map<IterDomain*, AxisHaloInfo> root_axis_map_;
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp b/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
index 497256b5f850e..2cdd9e827cdf7 100644
--- a/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
@@ -26,7 +26,7 @@ void validateParallelizationOfTensor(TensorView* tv) {
     // It doesn't matter if this axis is a non-concretized broadcast
     // TODO: merging broadcast and non-broadcast
     if (axis->isBroadcast() &&
-        !GpuLower::current()->concretizedBroadcastDomains().isConcretized(
+        !GpuLower::current()->concretizedBroadcastDomains()->isConcretized(
             axis)) {
       continue;
     }
@@ -195,7 +195,7 @@ void SyncMap::build(Fusion* fusion) {
               (!parallel_bcast_doms.get(consumer_ptype) ||
                !GpuLower::current()
                     ->concretizedBroadcastDomains()
-                    .isConcretized(consumer_axis))) {
+                    ->isConcretized(consumer_axis))) {
             continue;
           }
 
@@ -240,12 +240,12 @@ void SyncMap::build(Fusion* fusion) {
                     p_id, c_id, IdMappingMode::PERMISSIVE)) {
               const auto halo_info = GpuLower::current()->haloInfo();
 
-              if (halo_info.hasHaloWidth(p_id) !=
-                      halo_info.hasHaloWidth(c_id) ||
-                  (halo_info.hasHaloWidth(p_id) &&
-                   halo_info.hasHaloWidth(c_id) &&
-                   halo_info.getHaloWidth(p_id) !=
-                       halo_info.getHaloWidth(c_id))) {
+              if (halo_info->hasHaloWidth(p_id) !=
+                      halo_info->hasHaloWidth(c_id) ||
+                  (halo_info->hasHaloWidth(p_id) &&
+                   halo_info->hasHaloWidth(c_id) &&
+                   halo_info->getHaloWidth(p_id) !=
+                       halo_info->getHaloWidth(c_id))) {
                 raw_dims.set(parallel_type);
                 continue;
               }
@@ -421,7 +421,7 @@ void SyncMap::build(Fusion* fusion) {
                                      .redundant_types;
 
           if (p_id->isBroadcast() &&
-              GpuLower::current()->concretizedBroadcastDomains().isConcretized(
+              GpuLower::current()->concretizedBroadcastDomains()->isConcretized(
                   p_id) &&
               producer->getMemoryType() == MemoryType::Shared &&
               redundant_preds.hasTID()) {
@@ -436,7 +436,7 @@ void SyncMap::build(Fusion* fusion) {
               (!parallel_bcast_doms.get(producer_ptype) ||
                !GpuLower::current()
                     ->concretizedBroadcastDomains()
-                    .isConcretized(p_id))) {
+                    ->isConcretized(p_id))) {
             continue;
           }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
index 18a4426cb7c05..dc10224a165c0 100644
--- a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
@@ -237,7 +237,7 @@ void ThreadPredicateMap::updateBitSet(const Expr* expr) {
           id_reductions.set(id->getParallelType());
         }
         if (id->isBroadcast() &&
-            GpuLower::current()->concretizedBroadcastDomains().isConcretized(
+            GpuLower::current()->concretizedBroadcastDomains()->isConcretized(
                 id)) {
           id_bcasts.set(id->getParallelType());
         }
@@ -316,7 +316,7 @@ class RedundantUseAnalysis : BackwardVisitor {
  public:
   RedundantUseAnalysis(Fusion* fusion, const ThreadPredicateMap& pred_map)
       : fusion_(fusion), pred_map_(pred_map) {
-    traverseFrom(fusion, fusion->terminatingMathVals());
+    traverseTo(fusion, fusion->terminatingMathVals());
   }
 
   //! Returns a bit map signifying the parallel dimensions
@@ -575,7 +575,8 @@ ParallelTypeBitmap ThreadPredicateMap::getParallelBroadcastDomains(
 
   for (auto id : iter_domains) {
     if (!id->isBroadcast() ||
-        !GpuLower::current()->concretizedBroadcastDomains().isConcretized(id)) {
+        !GpuLower::current()->concretizedBroadcastDomains()->isConcretized(
+            id)) {
       continue;
     }
     if (id->isBlockDim() || (!output_smem && id->isThreadDim())) {
diff --git a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp
index cd9a7f7878660..88a84aa3c5877 100644
--- a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp
@@ -1,6 +1,5 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
-#include <torch/csrc/jit/codegen/cuda/lower2device.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 
 #include <torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h>
@@ -10,12 +9,13 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-void ConcretizedBroadcastDomains::build(Fusion* fusion) {
+ConcretizedBroadcastDomains::ConcretizedBroadcastDomains(Fusion* fusion) {
   exact_map_ = std::make_unique<ExactRootDomainMap>(fusion);
 
   // Initialize the origin map with input broadcast domains
+  auto inputs = fusion->inputsAndCreated();
   for (const auto fusion_input_tv :
-       ir_utils::filterByType<TensorView>(fusion->inputs())) {
+       ir_utils::filterByType<TensorView>(inputs)) {
     for (auto root_id : fusion_input_tv->getRootDomain()) {
       if (root_id->isBroadcast()) {
         broadcast_origin_map_.emplace(
diff --git a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h
index 24658f3cfe7c3..c30fa9951404a 100644
--- a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h
+++ b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h
@@ -23,7 +23,8 @@ namespace cuda {
 //! domains are marked as concretized.
 class TORCH_CUDA_CU_API ConcretizedBroadcastDomains : private IterVisitor {
  public:
-  void build(Fusion* fusion);
+  ConcretizedBroadcastDomains() = delete;
+  ConcretizedBroadcastDomains(Fusion* fusion);
 
   //! Is a domain concretized?
   bool isConcretized(IterDomain* id) const;
diff --git a/torch/csrc/jit/codegen/cuda/lower_unroll.cpp b/torch/csrc/jit/codegen/cuda/lower_unroll.cpp
index 06c28912ba8cd..00b94977effbc 100644
--- a/torch/csrc/jit/codegen/cuda/lower_unroll.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_unroll.cpp
@@ -81,7 +81,7 @@ void UnrollPass::handle(Expr* expr) {
 
     // When a predicate needs to account for ShiftOp, it is currently
     // taken care by its own function.
-    if (GpuLower::current()->haloInfo().needsShiftPredicate(expr)) {
+    if (GpuLower::current()->haloInfo()->needsShiftPredicate(expr)) {
       ShiftPredicateInserter::insert(
           expr, for_loops_, thread_pred, unswitched_loop_);
       return;
@@ -98,7 +98,7 @@ void UnrollPass::handle(Expr* expr) {
 
     // For expr calling a device func with block sync, don't create
     // if-then-else but pass the predicate to the device func
-    if (ir_utils::hasBlockSync(expr, GpuLower::current()->threadPredMap())) {
+    if (lower_utils::hasBlockSync(expr, GpuLower::current()->threadPredMap())) {
       const auto pred = unswitched_loop_
           ? thread_pred_expr
           : IrBuilder::create<kir::Predicate>(
@@ -124,7 +124,7 @@ void UnrollPass::handle(Expr* expr) {
                                     PredicateType::Inline, expr, thread_pred);
     }
 
-    if (supportInlinePredicate(expr)) {
+    if (lower_utils::supportInlinePredicate(expr)) {
       expr->setPredicate(pred);
       return;
     }
@@ -227,7 +227,7 @@ bool UnrollPass::canOmitElseClause(kir::ForLoop* fl) {
     // If there's any expression that requires barrier
     // synchronization, the else part can't be omitted
     for (auto expr : loop->body().exprs()) {
-      if (ir_utils::hasBlockSync(expr, pred_map)) {
+      if (lower_utils::hasBlockSync(expr, pred_map)) {
         return false;
       }
     }
@@ -269,9 +269,7 @@ bool UnrollPass::canOmitElseClause(kir::ForLoop* fl) {
   return true;
 }
 
-// Generate the loop nest structure and place it in lowered_exprs
 UnrollPass::UnrollPass(const std::vector<Expr*>& exprs) {
-  FUSER_PERF_SCOPE("GpuLower::Lower::UnrollPass::computeMap");
   kir::ExprMutator::traverseAndInsert(exprs);
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_utils.cpp b/torch/csrc/jit/codegen/cuda/lower_utils.cpp
index b890ff3e48f27..96b26bbdf02cd 100644
--- a/torch/csrc/jit/codegen/cuda/lower_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_utils.cpp
@@ -36,13 +36,42 @@ kir::IfThenElse* cloneIfThenElse(kir::IfThenElse* ite) {
 
 namespace ir_utils {
 
-TVDomainGuard::TVDomainGuard(TensorView* _tv, TensorDomain* td)
-    : tv_(_tv), prev_domain(tv_->domain()) {
+TVDomainGuard::TVDomainGuard(TensorView* tv, TensorDomain* td)
+    : tv_(tv), prev_domain_(tv_->domain()) {
   tv_->setDomain(td);
 }
 
+TVDomainGuard::TVDomainGuard(TVDomainGuard&& guard)
+    : tv_(nullptr), prev_domain_(guard.prev_domain_) {
+  std::swap(tv_, guard.tv_);
+}
+
 TVDomainGuard::~TVDomainGuard() {
-  tv_->setDomain(prev_domain);
+  if (tv_ != nullptr) {
+    tv_->setDomain(prev_domain_);
+  }
+}
+
+ir_utils::TVDomainGuard overrideContiguityGuard(
+    TensorView* tv,
+    bool contiguity) {
+  // Use domain guard to ignore the contiguity of
+  //  consumer tv.
+  TensorDomain* domain_with_specified_contiguity = nullptr;
+  std::vector<bool> contiguity_vector(
+      tv->getMaybeRFactorDomain().size(), contiguity);
+  if (tv->hasRFactor()) {
+    domain_with_specified_contiguity = IrBuilder::create<TensorDomain>(
+        tv->getRootDomain(),
+        tv->getRFactorDomain(),
+        tv->domain()->domain(),
+        contiguity_vector);
+  } else {
+    domain_with_specified_contiguity = IrBuilder::create<TensorDomain>(
+        tv->getRootDomain(), tv->domain()->domain(), contiguity_vector);
+  }
+
+  return ir_utils::TVDomainGuard(tv, domain_with_specified_contiguity);
 }
 
 std::vector<IterDomain*> iterDomainInputsOf(
@@ -91,9 +120,14 @@ bool isTvOp(const Expr* expr) {
       (expr->getExprType().value() == ExprType::UnaryOp ||
        expr->getExprType().value() == ExprType::BinaryOp ||
        expr->getExprType().value() == ExprType::TernaryOp ||
+       expr->getExprType().value() == ExprType::RNGOp ||
+       expr->getExprType().value() == ExprType::FullOp ||
+       expr->getExprType().value() == ExprType::ARangeOp ||
+       expr->getExprType().value() == ExprType::EyeOp ||
        expr->getExprType().value() == ExprType::ReductionOp ||
        expr->getExprType().value() == ExprType::GroupedReductionOp ||
        expr->getExprType().value() == ExprType::WelfordOp ||
+       expr->getExprType().value() == ExprType::GroupedWelfordOp ||
        expr->getExprType().value() == ExprType::LoadStoreOp ||
        expr->getExprType().value() == ExprType::MmaOp ||
        expr->getExprType().value() == ExprType::BroadcastOp ||
@@ -104,8 +138,10 @@ bool isTvOp(const Expr* expr) {
        expr->getExprType().value() == ExprType::ViewAsScalar ||
        expr->getExprType().value() == ExprType::ViewOp ||
        expr->getExprType().value() == ExprType::GridReduction ||
+       expr->getExprType().value() == ExprType::GroupedGridReduction ||
        expr->getExprType().value() == ExprType::GridBroadcast ||
-       expr->getExprType().value() == ExprType::GridWelford)) {
+       expr->getExprType().value() == ExprType::GridWelford ||
+       expr->getExprType().value() == ExprType::GroupedGridWelford)) {
     return true;
   }
   return false;
@@ -199,31 +235,6 @@ bool isScalarOp(const Expr* expr) {
   return true;
 }
 
-bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map) {
-  if (!isTvOp(expr)) {
-    return false;
-  }
-
-  if (!(isReductionOp(expr) || expr->isA<BroadcastOp>() ||
-        expr->isA<kir::GridBroadcast>())) {
-    return false;
-  }
-
-  // GroupedReductionOp can have multiple output TVs, but they must be
-  // parallelized in the same way, so just checking one of them is enough.
-  auto tv = getTvOutput(expr);
-
-  if (tv->hasBlockReduction() || tv->hasGridReduction()) {
-    return true;
-  } else if (expr->isA<BroadcastOp>()) {
-    const ParallelTypeBitmap pt_map =
-        GpuLower::current()->threadPredMap().getParallelBroadcastDomains(tv);
-    return pt_map.any();
-  }
-
-  return false;
-}
-
 c10::optional<IterDomain*> getMaybeWarpReductionDim(
     const Val* output,
     const Val* input) {
@@ -360,20 +371,6 @@ bool isGlobalLoadInit(const Expr* expr) {
   return false;
 }
 
-kir::Allocate* allocGlobalBufferForGridComm(
-    Val* buffer_size,
-    DataType dtype,
-    bool zero_init) {
-  const std::vector<IterDomain*> new_buffer_ids = {
-      IrBuilder::create<IterDomain>(IterDomainBuilder(
-          GpuLower::current()->kernel()->zeroVal(), buffer_size))};
-  const auto buffer_domain = IrBuilder::create<TensorDomain>(new_buffer_ids);
-  const auto buffer_tv =
-      IrBuilder::create<TensorView>(buffer_domain, dtype, MemoryType::Global);
-  return IrBuilder::create<kir::Allocate>(
-      buffer_tv, buffer_tv->getMemoryType(), nullptr, zero_init);
-}
-
 namespace {
 
 class ExprFlattener : private kir::IrVisitor {
@@ -408,112 +405,6 @@ std::vector<Expr*> flattenScopedExprs(const std::vector<Expr*>& loop_nests) {
   return ExprFlattener::flatten(loop_nests);
 }
 
-IterDomain* caMapExactConcreteId(IterDomain* id) {
-  return GpuLower::current()->caMap()->getConcreteMappedID(
-      id, IdMappingMode::EXACT);
-}
-
-std::vector<Expr*> getAllSwizzlesBetween(
-    std::vector<IterDomain*> from,
-    std::vector<IterDomain*> to) {
-  auto all_expr = DependencyCheck::getAllExprsBetween(
-      {from.begin(), from.end()}, {to.begin(), to.end()});
-
-  std::vector<Expr*> all_swizzles;
-
-  std::copy_if(
-      all_expr.begin(),
-      all_expr.end(),
-      std::back_inserter(all_swizzles),
-      [](Expr* expr) {
-        return expr->getExprType().has_value() &&
-            (expr->etype() == ExprType::Swizzle2D);
-      });
-
-  return all_swizzles;
-}
-
-} // namespace ir_utils
-
-namespace loop_utils {
-
-BasicAllocInfo getAllocInformation(
-    const TensorView* tv,
-    const std::vector<kir::ForLoop*>& for_loops,
-    const std::unordered_map<IterDomain*, IterDomain*>& id_map,
-    bool use_id_map) {
-  BasicAllocInfo info;
-  auto gpu_lower = GpuLower::current();
-
-  bool outer_alloc_found = false;
-
-  for (auto fl : for_loops) {
-    if (info.alloc_pos == tv->getComputeAtPosition()) {
-      break;
-    }
-
-    if (tv->axis(info.alloc_pos)->isReduction()) {
-      const auto outputs = FusionGuard::getCurFusion()->getTerminatingOutputs();
-      TORCH_INTERNAL_ASSERT(
-          std::find(outputs.begin(), outputs.end(), tv) != outputs.end(),
-          "Invalid computeAt of T",
-          tv->name(),
-          ". A reducation axis is detected outside computeAt point even though it is not an output tensor.");
-      break;
-    }
-
-    auto fl_id = fl->iter_domain();
-
-    if (fl_id->getParallelType() == ParallelType::Unroll) {
-      break;
-    }
-
-    // Shared memory must be allocated outside of unswitched
-    // domains. See issue #1133.
-    if (fl_id->getParallelType() == ParallelType::Unswitch &&
-        tv->getMemoryType() == MemoryType::Shared) {
-      outer_alloc_found = true;
-    }
-
-    // Assume global memory is allocated at outer most scope.
-    if (tv->getMemoryType() == MemoryType::Global) {
-      outer_alloc_found = true;
-    }
-
-    // Allocation of a double buffered tensor is placed outside its
-    // double buffer axis.
-    if ((tv->isDoubleBuffered() || tv->isCircularBuffered()) &&
-        tv->axis(info.alloc_pos) ==
-            gpu_lower->doubleBufferInfo().getDoubleBufferAxis(tv)) {
-      outer_alloc_found = true;
-    }
-
-    auto local_id = tv->axis(info.alloc_pos);
-
-    if (use_id_map) {
-      auto id_it = id_map.find(local_id);
-      if (id_it != id_map.end()) {
-        local_id = id_it->second;
-      }
-    }
-
-    if (GpuLower::current()->caMap()->areMapped(
-            local_id, fl_id, IdMappingMode::PERMISSIVE)) {
-      info.alloc_pos++;
-    }
-
-    info.init_for_loop = fl;
-
-    if (!outer_alloc_found) {
-      info.alloc_for_loop = fl;
-    }
-  }
-
-  return info;
-}
-
-} // namespace loop_utils
-
 namespace {
 
 class ReplaceExprInput : private kir::ExprMutator {
@@ -564,10 +455,7 @@ class ReplaceExprInput : private kir::ExprMutator {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
       auto replacement = IrBuilder::create<UnaryOp>(
-          node->getUnaryOpType(),
-          node->out(),
-          replaced_inputs.value().at(node->in()),
-          node->getRNGOffset());
+          node->getUnaryOpType(), node->out(), replaced_inputs->at(node->in()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
@@ -578,8 +466,8 @@ class ReplaceExprInput : private kir::ExprMutator {
       auto replacement = IrBuilder::create<BinaryOp>(
           node->getBinaryOpType(),
           node->out(),
-          replaced_inputs.value().at(node->lhs()),
-          replaced_inputs.value().at(node->rhs()));
+          replaced_inputs->at(node->lhs()),
+          replaced_inputs->at(node->rhs()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
@@ -590,13 +478,18 @@ class ReplaceExprInput : private kir::ExprMutator {
       auto replacement = IrBuilder::create<TernaryOp>(
           node->getTernaryOpType(),
           node->out(),
-          replaced_inputs.value().at(node->in1()),
-          replaced_inputs.value().at(node->in2()),
-          replaced_inputs.value().at(node->in3()));
+          replaced_inputs->at(node->in1()),
+          replaced_inputs->at(node->in2()),
+          replaced_inputs->at(node->in3()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
 
+  void handle(RNGOp* node) final {
+    // RNGOp has no input
+    return;
+  }
+
   void handle(ReductionOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
@@ -604,7 +497,7 @@ class ReplaceExprInput : private kir::ExprMutator {
           node->getReductionOpType(),
           node->init(),
           node->out(),
-          replaced_inputs.value().at(node->in()),
+          replaced_inputs->at(node->in()),
           node->isAllreduce());
       registerReplaceWithPredicate(node, replacement);
     }
@@ -635,7 +528,7 @@ class ReplaceExprInput : private kir::ExprMutator {
     if (replaced_inputs.has_value()) {
       auto replacement = IrBuilder::create<BroadcastOp>(
           node->out(),
-          replaced_inputs.value().at(node->in()),
+          replaced_inputs->at(node->in()),
           node->getBroadcastDimFlags());
       registerReplaceWithPredicate(node, replacement);
     }
@@ -651,9 +544,9 @@ class ReplaceExprInput : private kir::ExprMutator {
           node->initAvg(),
           node->initVar(),
           node->initN(),
-          replaced_inputs.value().at(node->inAvg()),
-          replaced_inputs.value().at(node->inVar()),
-          replaced_inputs.value().at(node->inN()));
+          replaced_inputs->at(node->inAvg()),
+          replaced_inputs->at(node->inVar()),
+          replaced_inputs->at(node->inN()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
@@ -663,8 +556,8 @@ class ReplaceExprInput : private kir::ExprMutator {
     if (replaced_inputs.has_value()) {
       auto replacement = IrBuilder::create<MmaOp>(
           node->out(),
-          replaced_inputs.value().at(node->inA()),
-          replaced_inputs.value().at(node->inB()),
+          replaced_inputs->at(node->inA()),
+          replaced_inputs->at(node->inB()),
           node->init(),
           node->options());
       registerReplaceWithPredicate(node, replacement);
@@ -692,13 +585,146 @@ std::vector<Expr*> replaceInputsInExpr(
   return ReplaceExprInput::replace(exprs, replacement_map);
 }
 
-bool isTrivialIterDomain(IterDomain* id) {
-  auto pt = id->getParallelType();
-  return id->isReduction() || id->isBroadcast() || id->isStride() ||
-      (id->extent()->isOneInt() && id->start()->isZeroInt()) ||
-      pt == ParallelType::Vectorize ||
-      (isParallelTypeThread(pt) &&
-       !GpuLower::current()->haloInfo().hasHaloWidth(id));
+std::vector<Expr*> getAllSwizzlesBetween(
+    std::vector<IterDomain*> from,
+    std::vector<IterDomain*> to) {
+  auto all_expr = DependencyCheck::getAllExprsBetween(
+      {from.begin(), from.end()}, {to.begin(), to.end()});
+
+  std::vector<Expr*> all_swizzles;
+
+  std::copy_if(
+      all_expr.begin(),
+      all_expr.end(),
+      std::back_inserter(all_swizzles),
+      [](Expr* expr) {
+        return expr->getExprType().has_value() &&
+            (expr->etype() == ExprType::Swizzle2D);
+      });
+
+  return all_swizzles;
+}
+
+} // namespace ir_utils
+
+namespace lower_utils {
+
+bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map) {
+  if (expr->isA<kir::BlockSync>()) {
+    return true;
+  }
+
+  if (!ir_utils::isTvOp(expr)) {
+    return false;
+  }
+
+  if (!(ir_utils::isReductionOp(expr) || expr->isA<BroadcastOp>() ||
+        expr->isA<kir::GridBroadcast>())) {
+    return false;
+  }
+
+  // GroupedReductionOp can have multiple output TVs, but they must be
+  // parallelized in the same way, so just checking one of them is enough.
+  auto tv = ir_utils::getTvOutput(expr);
+
+  if (tv->hasBlockReduction() || tv->hasGridReduction()) {
+    return true;
+  } else if (expr->isA<BroadcastOp>()) {
+    const ParallelTypeBitmap pt_map =
+        GpuLower::current()->threadPredMap().getParallelBroadcastDomains(tv);
+    return pt_map.any();
+  }
+
+  return false;
+}
+
+kir::Allocate* allocGlobalBufferForGridComm(
+    Val* buffer_size,
+    DataType dtype,
+    bool zero_init) {
+  const std::vector<IterDomain*> new_buffer_ids = {
+      IrBuilder::create<IterDomain>(IterDomainBuilder(
+          GpuLower::current()->kernel()->zeroVal(), buffer_size))};
+  const auto buffer_domain = IrBuilder::create<TensorDomain>(new_buffer_ids);
+  const auto buffer_tv =
+      IrBuilder::create<TensorView>(buffer_domain, dtype, MemoryType::Global);
+  return IrBuilder::create<kir::Allocate>(
+      buffer_tv, buffer_tv->getMemoryType(), nullptr, zero_init);
+}
+
+BasicAllocInfo getAllocInformation(
+    const TensorView* tv,
+    const std::vector<kir::ForLoop*>& for_loops,
+    const std::unordered_map<IterDomain*, IterDomain*>& id_map,
+    bool use_id_map) {
+  BasicAllocInfo info;
+  auto gpu_lower = GpuLower::current();
+
+  bool outer_alloc_found = false;
+
+  for (auto fl : for_loops) {
+    if (info.alloc_pos == tv->getComputeAtPosition()) {
+      break;
+    }
+
+    if (tv->axis(info.alloc_pos)->isReduction()) {
+      const auto outputs = FusionGuard::getCurFusion()->getTerminatingOutputs();
+      TORCH_INTERNAL_ASSERT(
+          std::find(outputs.begin(), outputs.end(), tv) != outputs.end(),
+          "Invalid computeAt of T",
+          tv->name(),
+          ". A reducation axis is detected outside computeAt point even though it is not an output tensor.");
+      break;
+    }
+
+    auto fl_id = fl->iter_domain();
+
+    if (fl_id->getParallelType() == ParallelType::Unroll) {
+      break;
+    }
+
+    // Shared memory must be allocated outside of unswitched
+    // domains. See issue #1133.
+    if (fl_id->getParallelType() == ParallelType::Unswitch &&
+        tv->getMemoryType() == MemoryType::Shared) {
+      outer_alloc_found = true;
+    }
+
+    // Assume global memory is allocated at outer most scope.
+    if (tv->getMemoryType() == MemoryType::Global) {
+      outer_alloc_found = true;
+    }
+
+    // Allocation of a double buffered tensor is placed outside its
+    // double buffer axis.
+    if ((tv->isDoubleBuffered() || tv->isCircularBuffered()) &&
+        tv->axis(info.alloc_pos) ==
+            gpu_lower->doubleBufferInfo().getDoubleBufferAxis(tv)) {
+      outer_alloc_found = true;
+    }
+
+    auto local_id = tv->axis(info.alloc_pos);
+
+    if (use_id_map) {
+      auto id_it = id_map.find(local_id);
+      if (id_it != id_map.end()) {
+        local_id = id_it->second;
+      }
+    }
+
+    if (GpuLower::current()->caMap()->areMapped(
+            local_id, fl_id, IdMappingMode::PERMISSIVE)) {
+      info.alloc_pos++;
+    }
+
+    info.init_for_loop = fl;
+
+    if (!outer_alloc_found) {
+      info.alloc_for_loop = fl;
+    }
+  }
+
+  return info;
 }
 
 //! Implementing this in here to avoid including too many headers
@@ -712,6 +738,8 @@ bool supportInlinePredicate(Expr* expr) {
   return false;
 }
 
+} // namespace lower_utils
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_utils.h b/torch/csrc/jit/codegen/cuda/lower_utils.h
index 4408f48fec76c..4807c1e5520ea 100644
--- a/torch/csrc/jit/codegen/cuda/lower_utils.h
+++ b/torch/csrc/jit/codegen/cuda/lower_utils.h
@@ -39,24 +39,32 @@ namespace ir_utils {
 // producers with a consumer set of indices, so we need to view the producer
 // transformed like consumer while we index. This will set the tv with td for
 // the life of this context guard.
-class TVDomainGuard {
+class TORCH_CUDA_CU_API TVDomainGuard {
  private:
   TensorView* tv_;
-  TensorDomain* prev_domain;
+  TensorDomain* prev_domain_;
 
  public:
-  explicit TVDomainGuard(TensorView* _tv, TensorDomain* td);
+  explicit TVDomainGuard(TensorView* tv, TensorDomain* td);
+  TVDomainGuard(const TVDomainGuard&) = delete;
+  TVDomainGuard(TVDomainGuard&&);
 
   //! An utility to access the tensordomain before the temporary
   //!  view. This is used to retrieve information, like swizzle
   //!  information that can only be reliably kept at the original domain.
   const TensorDomain* prevDomain() const {
-    return prev_domain;
+    return prev_domain_;
   }
 
   ~TVDomainGuard();
 };
 
+// Create a TVDomainGuard that temporarily view a tensorview with specified
+// all-true or all-false contiguity.
+TORCH_CUDA_CU_API ir_utils::TVDomainGuard overrideContiguityGuard(
+    TensorView* tv,
+    bool contiguity);
+
 //! Return inputs of provided IterDomains that are IterDomains. A list
 //! of input IterDomain can be optionally given. Otherwise,
 //! IterDomains with no defining expression are returned.
@@ -82,8 +90,6 @@ TORCH_CUDA_CU_API TensorView* getTvOutput(const Expr*);
 // Returns the first input of Expr that is a TensorView
 TORCH_CUDA_CU_API TensorView* getTvInput(const Expr*);
 
-bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map);
-
 //! Returns the iterdomain that maps to the thread dimension grouped
 //!  to warps. Returns nullopt if the reduction is not to be lowered to
 //!  a warp reduction.
@@ -108,13 +114,6 @@ bool derivedFromRootCAAxes(const TensorView* tv, IterDomain* axis);
 std::unordered_map<ParallelType, IterDomain*, TypeHash> getParallelDomains(
     const Val* val);
 
-// Allocate global buffer for a grid communication calls, i.e. grid reduce, grid
-// welford reduce, grid broadcast.
-kir::Allocate* allocGlobalBufferForGridComm(
-    Val* buffer_size,
-    DataType dtype,
-    bool zero_init);
-
 //! Returns true if the expression will be lowered to
 //!  a ldmatrix intrinsic.
 bool isLdMatrixOp(const Expr* expr);
@@ -150,49 +149,12 @@ bool isTensorScalarFillOp(const Expr* expr);
 TORCH_CUDA_CU_API std::vector<Expr*> flattenScopedExprs(
     const std::vector<Expr*>& loop_nests);
 
-//! Returns the concretized iterdomain according to
-//!  the exact compute at map.
-IterDomain* caMapExactConcreteId(IterDomain* id);
-
 //! Returns all swizzle ops between the set of iterdomains
 //!  in `from` and `to`.
 std::vector<Expr*> getAllSwizzlesBetween(
     std::vector<IterDomain*> from,
     std::vector<IterDomain*> to);
 
-} // namespace ir_utils
-
-namespace loop_utils {
-
-struct BasicAllocInfo {
-  // The for loop that the initialization of this allocation must be
-  // placed in, nullptr if not within a loop
-  kir::ForLoop* init_for_loop = nullptr;
-
-  // Keep track of the actual allocation loop. This can be different
-  // from init_for_loop only with unswitched shared memory allocations,
-  // which are moved outer loops to avoid duplicated allocations. This means
-  // that the alloc position may be outside what's expected. Most applications
-  // outside lower_allocation is likely looking for init_for_loop which is
-  // more directly related to how large an allocation is and how it's used.
-  // (see issue #1133).
-  kir::ForLoop* alloc_for_loop = nullptr;
-
-  // The allocation position relative to buffer IDs, it could be outside the
-  // compute at position if it's shared memory with a compute at inside an
-  // unswitch
-  size_t alloc_pos = 0;
-};
-
-// Fill the above allocation struct based on provided information. id_map is
-// used if we're looking at a producer tensor but loops on a consumer tensor.
-BasicAllocInfo getAllocInformation(
-    const TensorView* tv,
-    const std::vector<kir::ForLoop*>& loops,
-    const std::unordered_map<IterDomain*, IterDomain*>& id_map = {},
-    bool use_id_map = false);
-} // namespace loop_utils
-
 // Replace value pass on Kernel IR.
 //  Replace each use of any Val* that apears in the given `replacement_map`
 //  Keeps the predicate carried by each expr
@@ -203,9 +165,6 @@ std::vector<Expr*> replaceInputsInExpr(
     const std::vector<Expr*>& exprs,
     const std::unordered_map<Val*, Val*>& replacement_map);
 
-// True if an IterDomain does not materialize a loop
-bool isTrivialIterDomain(IterDomain* id);
-
 // Go through all expressions and compute a local ordering of loops. operator<
 // is implemented based on the concrete_id_dependencies analysis done. If
 // there's no dependency between two IDs then order doesn't mater, otherwise we
@@ -235,7 +194,7 @@ struct TORCH_CUDA_CU_API IterDomainDependencySorter {
   IterDomainDependencySorter(
       const std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>&
           concrete_id_dependencies,
-      const std::unique_ptr<ComputeAtMap>& compute_at_map)
+      std::shared_ptr<const ComputeAtMap> compute_at_map)
       : concrete_id_dependencies_(concrete_id_dependencies),
         compute_at_map_(compute_at_map) {}
 
@@ -261,13 +220,56 @@ struct TORCH_CUDA_CU_API IterDomainDependencySorter {
 
   const std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>&
       concrete_id_dependencies_;
-  const std::unique_ptr<ComputeAtMap>& compute_at_map_;
+  const std::shared_ptr<const ComputeAtMap> compute_at_map_;
+};
+
+} // namespace ir_utils
+
+namespace lower_utils {
+
+bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map);
+
+// Allocate global buffer for a grid communication calls, i.e. grid reduce, grid
+// welford reduce, grid broadcast.
+kir::Allocate* allocGlobalBufferForGridComm(
+    Val* buffer_size,
+    DataType dtype,
+    bool zero_init);
+
+struct BasicAllocInfo {
+  // The for loop that the initialization of this allocation must be
+  // placed in, nullptr if not within a loop
+  kir::ForLoop* init_for_loop = nullptr;
+
+  // Keep track of the actual allocation loop. This can be different
+  // from init_for_loop only with unswitched shared memory allocations,
+  // which are moved outer loops to avoid duplicated allocations. This means
+  // that the alloc position may be outside what's expected. Most applications
+  // outside lower_allocation is likely looking for init_for_loop which is
+  // more directly related to how large an allocation is and how it's used.
+  // (see issue #1133).
+  kir::ForLoop* alloc_for_loop = nullptr;
+
+  // The allocation position relative to buffer IDs, it could be outside the
+  // compute at position if it's shared memory with a compute at inside an
+  // unswitch
+  size_t alloc_pos = 0;
 };
 
+// Fill the above allocation struct based on provided information. id_map is
+// used if we're looking at a producer tensor but loops on a consumer tensor.
+BasicAllocInfo getAllocInformation(
+    const TensorView* tv,
+    const std::vector<kir::ForLoop*>& loops,
+    const std::unordered_map<IterDomain*, IterDomain*>& id_map = {},
+    bool use_id_map = false);
+
 //! Returns true if the expression has a variant that takes a predicate
 //!  as an inline argument.
 bool supportInlinePredicate(Expr* expr);
 
+} // namespace lower_utils
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_validation.cpp b/torch/csrc/jit/codegen/cuda/lower_validation.cpp
index 5a2eb55ce39e8..ee656e0b74e50 100644
--- a/torch/csrc/jit/codegen/cuda/lower_validation.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_validation.cpp
@@ -721,7 +721,8 @@ std::unordered_map<IterDomain*, std::pair<int64_t, int64_t>> getLiveRangeOffsets
           // is visible to outside of the fusion.
           map.insert(
               {consumer_root,
-               {consumer_start_offset.value(), consumer_stop_offset.value()}});
+               {consumer_start_offset->as<int64_t>(),
+                consumer_stop_offset->as<int64_t>()}});
         } else {
           // When the range of this root domain is already set, it
           // must be set by its consumers. Make sure the required
@@ -729,9 +730,9 @@ std::unordered_map<IterDomain*, std::pair<int64_t, int64_t>> getLiveRangeOffsets
           // this root domain.
           auto& consumer_range = it->second;
           TORCH_INTERNAL_ASSERT(
-              consumer_start_offset.value() <= consumer_range.first);
+              consumer_start_offset->as<int64_t>() <= consumer_range.first);
           TORCH_INTERNAL_ASSERT(
-              consumer_stop_offset.value() <= consumer_range.second);
+              consumer_stop_offset->as<int64_t>() <= consumer_range.second);
         }
       }
 
@@ -1189,7 +1190,7 @@ void validateAndConvertIterDomainGrouping(Fusion* fusion) {
 
       // Halo is not allowed
       TORCH_CHECK(
-          GpuLower::current()->haloInfo().getExtent(id) == nullptr,
+          GpuLower::current()->haloInfo()->getExtent(id) == nullptr,
           "Invalid use of ParallelType::Group.",
           " Grouping of halo-extended IterDomain, ",
           id->toString(),
@@ -1211,8 +1212,10 @@ void validateAndConvertIterDomainGrouping(Fusion* fusion) {
 
     TORCH_CHECK(
         tv->definition()->isA<ReductionOp>() ||
-            tv->definition()->isA<GroupedReductionOp>(),
-        "Invalid use of ParallelType::Group. Only ReductionOp and GroupedReductionOp are allowed. ",
+            tv->definition()->isA<GroupedReductionOp>() ||
+            tv->definition()->isA<WelfordOp>() ||
+            tv->definition()->isA<GroupedWelfordOp>(),
+        "Invalid use of ParallelType::Group. Only ReductionOp, GroupedReductionOp, WelfordOp and GroupedWelfordOp are allowed. ",
         tv->definition()->toString());
 
     // Convert ReductionOp to GroupedReductionOp
@@ -1245,6 +1248,36 @@ void validateAndConvertIterDomainGrouping(Fusion* fusion) {
           outputs,
           inputs,
           is_allreduce);
+    } else if (tv->definition()->isA<WelfordOp>()) {
+      // Convert WelfordOp to GroupedWelfordOp
+      auto wop = def->as<WelfordOp>();
+      auto is_allreduce = wop->isAllreduce();
+
+      TORCH_CHECK(
+          is_allreduce,
+          "Invalid use of ParallelType::Group.",
+          " Only enabled for allreduce reductions: ",
+          wop->toString());
+
+      TORCH_CHECK(
+          tv->domain()->hasGridReduction(),
+          "Invalid use of ParallelType::Group.",
+          " Only enabled for grid reductions: ",
+          wop->toString());
+
+      std::vector<WelfordTriplet> output_vals(
+          {{wop->outAvg(), wop->outVar(), wop->outN()}});
+      std::vector<WelfordTriplet> input_vals(
+          {{wop->inAvg(), wop->inVar(), wop->inN()}});
+      std::vector<WelfordTriplet> init_vals(
+          {{wop->initAvg(), wop->initVar(), wop->initN()}});
+      fusion->removeExpr(wop);
+      IrBuilder::create<GroupedWelfordOp>(
+          static_cast<IrContainer*>(fusion),
+          output_vals,
+          input_vals,
+          init_vals,
+          is_allreduce);
     }
   }
 }
diff --git a/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp b/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
index 1d87790c014fb..ff603c1d18f64 100644
--- a/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
@@ -136,7 +136,7 @@ class EliminateDeadBroadcastAndAllocate {
 //!    be removed, and generates a replacement map from the broadcast
 //!    output to reduction output.
 //!
-//!   2. kir_utils::replaceInputsInExpr replaces applicable uses of
+//!   2. ir_utils::replaceInputsInExpr replaces applicable uses of
 //!    the broadcast output with the corresponding reduction output.
 //!
 //!   3. EliminateDeadBroadcastAndAllocate removes the broadcast ops
@@ -145,8 +145,8 @@ class FuseBroadcastWithWarpReduce : private kir::IrVisitor {
  public:
   static std::vector<Expr*> fuse(const std::vector<Expr*>& exprs) {
     FuseBroadcastWithWarpReduce fuse_broadcast_map(exprs);
-    const auto replaced_inputs =
-        replaceInputsInExpr(exprs, fuse_broadcast_map.val_replacement_map_);
+    const auto replaced_inputs = ir_utils::replaceInputsInExpr(
+        exprs, fuse_broadcast_map.val_replacement_map_);
     return EliminateDeadBroadcastAndAllocate::run(replaced_inputs);
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/manager.cpp b/torch/csrc/jit/codegen/cuda/manager.cpp
index 22f914de407ee..4eb61c78b749f 100644
--- a/torch/csrc/jit/codegen/cuda/manager.cpp
+++ b/torch/csrc/jit/codegen/cuda/manager.cpp
@@ -62,12 +62,16 @@ namespace {
 // in the fallback path.
 void enableAliasCopyNodes(const std::shared_ptr<Graph>& graph, Block* block) {
   static std::unordered_set<Symbol> alias_copy_op(
-      {prim::view_copy,
-       prim::reshape_copy,
-       prim::expand_copy,
+      {prim::expand_copy,
        prim::expand_as_copy,
+       prim::flatten_copy,
+       prim::permute_copy,
+       prim::reshape_copy,
        prim::squeeze_copy,
-       prim::unsqueeze_copy});
+       prim::t_copy,
+       prim::transpose_copy,
+       prim::unsqueeze_copy,
+       prim::view_copy});
 
   for (Node* n : block->nodes()) {
     for (Block* b : n->blocks()) {
diff --git a/torch/csrc/jit/codegen/cuda/mutator.cpp b/torch/csrc/jit/codegen/cuda/mutator.cpp
index 8f8298443341d..12a3de15f4a7f 100644
--- a/torch/csrc/jit/codegen/cuda/mutator.cpp
+++ b/torch/csrc/jit/codegen/cuda/mutator.cpp
@@ -58,8 +58,14 @@ void OptOutMutator::mutate(NamedScalar* ns) {}
 void OptOutMutator::mutate(IterDomain* id) {
   Val* start = maybeMutated(id->start());
   Val* extent = maybeMutated(id->extent());
+  Val* expanded_extent = nullptr;
+  if (id->hasExpandedExtent()) {
+    expanded_extent = maybeMutated(id->expandedExtent());
+  }
   Val* stop_offset = maybeMutated(id->stopOffset());
   if (start->sameAs(id->start()) && extent->sameAs(id->extent()) &&
+      (!id->hasExpandedExtent() ||
+       expanded_extent->sameAs(id->expandedExtent())) &&
       stop_offset->sameAs(id->stopOffset())) {
     return;
   }
@@ -69,6 +75,7 @@ void OptOutMutator::mutate(IterDomain* id) {
           .start(start)
           .extent(extent)
           .stop_offset(stop_offset)
+          .expanded_extent(expanded_extent)
           .build());
 }
 
@@ -118,7 +125,48 @@ void OptOutMutator::mutate(kir::TensorIndex*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
 
-// MUTATE FUNCTIONS FOR EXPRESSIONS.
+void OptOutMutator::mutate(FullOp* fop) {
+  Val* out = maybeMutated(fop->output(0));
+  Val* fill_value = maybeMutated(fop->getFillValue());
+
+  if (out->sameAs(fop->output(0))) {
+    return;
+  }
+  auto container = fop->container();
+  container->removeExpr(fop);
+  IrBuilder::create<FullOp>(container, out, fill_value, fop->dtype());
+}
+
+void OptOutMutator::mutate(ARangeOp* aop) {
+  Val* out = maybeMutated(aop->output(0));
+
+  if (out->sameAs(aop->output(0))) {
+    return;
+  }
+  auto container = aop->container();
+  container->removeExpr(aop);
+  IrBuilder::create<ARangeOp>(
+      container,
+      out,
+      aop->start(),
+      aop->end(),
+      aop->step(),
+      aop->dtype(),
+      aop->getLinearLogicalIndex());
+}
+
+void OptOutMutator::mutate(EyeOp* eop) {
+  Val* out = maybeMutated(eop->output(0));
+
+  if (out->sameAs(eop->output(0))) {
+    return;
+  }
+  auto container = eop->container();
+  container->removeExpr(eop);
+  IrBuilder::create<EyeOp>(
+      container, out, eop->dtype(), eop->getIndex1(), eop->getIndex2());
+}
+
 void OptOutMutator::mutate(UnaryOp* uop) {
   Val* out = maybeMutated(uop->out());
   Val* in = maybeMutated(uop->in());
@@ -129,7 +177,7 @@ void OptOutMutator::mutate(UnaryOp* uop) {
   auto container = uop->container();
   auto uop_type = uop->getUnaryOpType();
   container->removeExpr(uop);
-  IrBuilder::create<UnaryOp>(container, uop_type, out, in, uop->getRNGOffset());
+  IrBuilder::create<UnaryOp>(container, uop_type, out, in);
 }
 
 void OptOutMutator::mutate(BinaryOp* bop) {
@@ -164,6 +212,31 @@ void OptOutMutator::mutate(TernaryOp* top) {
   IrBuilder::create<TernaryOp>(container, top_type, out, in1, in2, in3);
 }
 
+void OptOutMutator::mutate(RNGOp* rop) {
+  Val* out = maybeMutated(rop->output(0));
+  auto& parameters = rop->getParameters();
+  std::vector<Val*> mutated_parameters;
+  for (auto v : parameters) {
+    mutated_parameters.emplace_back(maybeMutated(v));
+  }
+
+  if (out == rop->output(0) && mutated_parameters == parameters) {
+    return;
+  }
+
+  auto container = rop->container();
+  auto rop_type = rop->getRNGOpType();
+  container->removeExpr(rop);
+  IrBuilder::create<RNGOp>(
+      container,
+      rop_type,
+      out,
+      rop->dtype(),
+      mutated_parameters,
+      rop->getRNGOffset(),
+      rop->getPhiloxIndex());
+}
+
 void OptOutMutator::mutate(ReductionOp* rop) {
   Val* out = maybeMutated(rop->out());
   Val* in = maybeMutated(rop->in());
@@ -256,15 +329,52 @@ void OptOutMutator::mutate(WelfordOp* wop) {
       out_avg,
       out_var,
       out_N,
-      init_avg,
-      init_var,
-      init_N,
       in_avg,
       in_var,
       in_N,
+      init_avg,
+      init_var,
+      init_N,
       wop->isAllreduce());
 }
 
+void OptOutMutator::mutate(GroupedWelfordOp* wop) {
+  bool is_same = true;
+
+  std::vector<WelfordTriplet> output_vals;
+  for (const auto& out : wop->outputVals()) {
+    auto maybe_mutated =
+        out.transform([&](Val* val) { return maybeMutated(val); });
+    is_same = is_same && maybe_mutated.sameAs(out);
+    output_vals.push_back(maybe_mutated);
+  }
+
+  std::vector<WelfordTriplet> input_vals;
+  for (const auto& inp : wop->inputVals()) {
+    auto maybe_mutated =
+        inp.transform([&](Val* val) { return maybeMutated(val); });
+    is_same = is_same && maybe_mutated.sameAs(inp);
+    input_vals.push_back(maybe_mutated);
+  }
+
+  std::vector<WelfordTriplet> init_vals;
+  for (const auto& init : wop->initVals()) {
+    auto maybe_mutated =
+        init.transform([&](Val* val) { return maybeMutated(val); });
+    is_same = is_same && maybe_mutated.sameAs(init);
+    init_vals.push_back(maybe_mutated);
+  }
+
+  if (is_same) {
+    return;
+  }
+
+  auto container = wop->container();
+  container->removeExpr(wop);
+  IrBuilder::create<GroupedWelfordOp>(
+      container, output_vals, input_vals, init_vals, wop->isAllreduce());
+}
+
 void OptOutMutator::mutate(MmaOp* mma) {
   Val* out = maybeMutated(mma->out());
   Val* in_a = maybeMutated(mma->inA());
@@ -504,6 +614,9 @@ void OptOutMutator::mutate(kir::GridBroadcast*) {
 void OptOutMutator::mutate(kir::GridWelford*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
+void OptOutMutator::mutate(kir::GroupedGridWelford*) {
+  TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
+}
 void OptOutMutator::mutate(kir::AllocateFusedReduction*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
diff --git a/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp b/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp
index 3a2ab5f5eb5be..eaff9274892dd 100644
--- a/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp
+++ b/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp
@@ -23,7 +23,7 @@ void NonDivisibleSplitInfo::build(Fusion* fusion) {
         tv->domain()->domain().begin(), tv->domain()->domain().end());
     current_tv_ = tv;
     clearReachability();
-    traverseFrom(fusion, domain_vals);
+    traverseTo(fusion, domain_vals);
     current_tv_ = nullptr;
   }
 
@@ -53,7 +53,16 @@ void NonDivisibleSplitInfo::handle(Split* split) {
         splits_to_validate_.insert(split);
       } else {
         // Not proven to be a divisible split
-        splits_to_predicate_[current_tv_].push_back(split);
+        auto gpu_lower = GpuLower::current();
+        TORCH_INTERNAL_ASSERT(gpu_lower != nullptr);
+
+        // If we know this split must be divisible, it's either validated as
+        // above, exact matches to a case matching the above, or exact matches
+        // to a transformation from view which must be divisible.
+        if (gpu_lower->divisbleSplitSet().find(split) ==
+            gpu_lower->divisbleSplitSet().end()) {
+          splits_to_predicate_[current_tv_].push_back(split);
+        }
       }
 
       is_protected = true;
diff --git a/torch/csrc/jit/codegen/cuda/nvfuser.cmake b/torch/csrc/jit/codegen/cuda/nvfuser.cmake
index 05c36a90d499d..54167adc274a6 100644
--- a/torch/csrc/jit/codegen/cuda/nvfuser.cmake
+++ b/torch/csrc/jit/codegen/cuda/nvfuser.cmake
@@ -15,6 +15,8 @@ list(APPEND NVFUSER_RUNTIME_FILES
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/broadcast.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fp16_support.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/bf16_support.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
@@ -45,7 +47,7 @@ foreach(src ${NVFUSER_RUNTIME_FILES})
   add_custom_command(
     COMMENT "Stringify NVFUSER runtime source file"
     OUTPUT ${dst}
-    DEPENDS ${src}
+    DEPENDS ${src} "${NVFUSER_STRINGIFY_TOOL}"
     COMMAND ${PYTHON_EXECUTABLE} ${NVFUSER_STRINGIFY_TOOL} -i ${src} -o ${dst}
   )
   add_custom_target(nvfuser_rt_${filename} DEPENDS ${dst})
diff --git a/torch/csrc/jit/codegen/cuda/ops/alias.cpp b/torch/csrc/jit/codegen/cuda/ops/alias.cpp
index 74be78b4aae38..1731c18260fce 100644
--- a/torch/csrc/jit/codegen/cuda/ops/alias.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/alias.cpp
@@ -36,6 +36,8 @@ TensorView* applyViewTransforms(
     TensorView* orig_tv,
     TensorView* post_reduce_tv,
     const AnalyzeViewResult& view_analysis) {
+  TORCH_INTERNAL_ASSERT(orig_tv != nullptr, "Input is invalid.");
+  TORCH_INTERNAL_ASSERT(post_reduce_tv != nullptr, "Input is invalid.");
   TORCH_INTERNAL_ASSERT(
       !post_reduce_tv->hasComputeAt(),
       "Cannot modify rfactor domain after compute at has been set.");
@@ -43,10 +45,6 @@ TensorView* applyViewTransforms(
   TORCH_INTERNAL_ASSERT(
       post_reduce_tv->nDims() > 0, "Tried to view a 0-dim TensorView");
 
-  TORCH_CHECK(
-      !post_reduce_tv->domain()->hasRFactor(),
-      "Cannot call view on the same TensorView twice.");
-
   TORCH_INTERNAL_ASSERT(!view_analysis.transforms.empty());
 
   TensorView* consumer = IrBuilder::create<TensorView>(
@@ -62,6 +60,7 @@ TensorView* applyViewTransforms(
 } // namespace
 
 TensorView* view(TensorView* x, DataType dtype) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   if (x->getDataType() == dtype) {
     return x;
   }
@@ -81,6 +80,7 @@ TensorView* view(
     TensorView* x,
     const std::vector<int64_t>& original_sizes,
     const std::vector<int64_t>& new_sizes) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   TORCH_INTERNAL_ASSERT(
       TensorDomain::noReductions(x->getMaybeRFactorDomain()).size() ==
       original_sizes.size());
@@ -111,18 +111,20 @@ TensorView* view(
 }
 
 TensorView* flatten(TensorView* x, int64_t start_dim, int64_t end_dim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  auto inp_domain = TensorDomain::noReductions(x->getMaybeRFactorDomain());
   if (start_dim < 0) {
-    start_dim += x->nDims();
+    start_dim += inp_domain.size();
   }
   if (end_dim < 0) {
-    end_dim += x->nDims();
+    end_dim += inp_domain.size();
   }
   TORCH_CHECK(
-      start_dim >= 0 && start_dim < x->nDims(),
+      start_dim >= 0 && start_dim < inp_domain.size(),
       "Invalid start_dim ",
       start_dim);
   TORCH_CHECK(
-      end_dim >= 0 && end_dim < x->nDims(), "Invalid end_dim ", end_dim);
+      end_dim >= 0 && end_dim < inp_domain.size(), "Invalid end_dim ", end_dim);
   TORCH_CHECK(start_dim <= end_dim, "start_dim must be <= end_dim");
 
   if (start_dim == end_dim) {
@@ -139,6 +141,7 @@ TensorView* flatten(TensorView* x, int64_t start_dim, int64_t end_dim) {
 }
 
 TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   TORCH_INTERNAL_ASSERT(
@@ -162,6 +165,7 @@ TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes) {
 }
 
 TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes, int dim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   TORCH_INTERNAL_ASSERT(
@@ -190,6 +194,7 @@ TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes, int dim) {
 }
 
 TensorView* unsqueeze(TensorView* x, int dim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   if (dim < 0) {
@@ -209,14 +214,28 @@ TensorView* unsqueeze(TensorView* x, int dim) {
 }
 
 TensorView* permute(TensorView* x, const std::vector<int64_t>& new2old) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   auto inp_domain = TensorDomain::noReductions(x->getMaybeRFactorDomain());
   std::vector<IterDomain*> out_domain(inp_domain.size());
 
+  TORCH_CHECK(
+      inp_domain.size() == new2old.size(),
+      "The number of dimensions in the tensor input does not match the length",
+      " of the desired ordering of dimensions i.e. input.dim() = ",
+      inp_domain.size(),
+      " is not equal to len(dims) = ",
+      new2old.size());
+
+  // Return scalar tensors immediately
+  if (inp_domain.size() == 0) {
+    return set(x);
+  }
+
   auto normalized_new2old =
       ir_utils::normalizeNew2Old(new2old, inp_domain.size());
 
   for (const auto i : c10::irange(out_domain.size())) {
-    auto in_id = inp_domain[new2old[i]];
+    auto in_id = inp_domain[normalized_new2old[i]];
     out_domain[i] = in_id->cloneWithoutRFactor();
   }
 
@@ -229,6 +248,7 @@ TensorView* permute(TensorView* x, const std::vector<int64_t>& new2old) {
 }
 
 TensorView* transpose(TensorView* x, int64_t dim0, int64_t dim1) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   if (dim0 < 0) {
@@ -259,6 +279,7 @@ TensorView* transpose(TensorView* x, int64_t dim0, int64_t dim1) {
 }
 
 TensorView* transpose(TensorView* x) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   TORCH_CHECK(
diff --git a/torch/csrc/jit/codegen/cuda/ops/composite.cpp b/torch/csrc/jit/codegen/cuda/ops/composite.cpp
index 994cf039bed19..a7905c4894c15 100644
--- a/torch/csrc/jit/codegen/cuda/ops/composite.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/composite.cpp
@@ -27,7 +27,7 @@ ForwardDropoutResult dropout(TensorView* x, Val* prob, Val* scale) {
           scale->getDataType().value() == DataType::Double,
       "Scale is not a valid Double.");
 
-  auto rand_vals = randlike(x);
+  auto rand_vals = rand_like(x);
   auto mask = lt(rand_vals, prob);
   auto apply_mask = mul(x, mask);
   auto y = mul(apply_mask, scale);
@@ -73,6 +73,26 @@ LstmResult lstm(
   return {cell, hidden};
 }
 
+namespace {
+template <typename T>
+TORCH_CUDA_CU_API T* sign(T* x) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  auto zero = IrBuilder::create<Double>(x->container(), 0.);
+  auto one = IrBuilder::create<Double>(x->container(), 1.);
+  auto minus_one = IrBuilder::create<Double>(x->container(), -1.);
+  auto sign = where(gt(x, zero), one, where(lt(x, zero), minus_one, zero));
+  return castOp(x->getDataType().value(), sign);
+}
+} // namespace
+
+TORCH_CUDA_CU_API TensorView* sign(TensorView* x) {
+  return sign<TensorView>(x);
+}
+
+TORCH_CUDA_CU_API Val* sign(Val* x) {
+  return sign<Val>(x);
+}
+
 TensorView* softplus(TensorView* x, Val* beta, Val* threshold) {
   TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   TORCH_INTERNAL_ASSERT(beta != nullptr, "Beta is invalid.");
diff --git a/torch/csrc/jit/codegen/cuda/ops/composite.h b/torch/csrc/jit/codegen/cuda/ops/composite.h
index 96fc452fa0e5c..23aee5b20c479 100644
--- a/torch/csrc/jit/codegen/cuda/ops/composite.h
+++ b/torch/csrc/jit/codegen/cuda/ops/composite.h
@@ -43,6 +43,8 @@ TORCH_CUDA_CU_API LstmResult lstm(
     TensorView* cell_x,
     TensorView* out_x);
 
+TORCH_CUDA_CU_API TensorView* sign(TensorView* x);
+TORCH_CUDA_CU_API Val* sign(Val* x);
 TORCH_CUDA_CU_API TensorView* softplus(
     TensorView* x,
     Val* beta,
diff --git a/torch/csrc/jit/codegen/cuda/ops/normalization.cpp b/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
index 883b852607206..fb515fae836c6 100644
--- a/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
@@ -69,6 +69,64 @@ TensorView* variance(
   return y;
 }
 
+TORCH_CUDA_CU_API VarMeanResult variance_mean(
+    TensorView* x,
+    const std::vector<int>& dims,
+    int64_t correction,
+    bool keepdim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+
+  TORCH_CHECK(
+      correction >= 0, "correction must be non-negative, but got ", correction);
+
+  // There are compilation errors for half precision
+  auto dtype = x->getDataType().value();
+  TORCH_CHECK(
+      !(dtype == DataType::Half || dtype == DataType::BFloat16),
+      "variance_mean is not supported for ",
+      dtype,
+      " please upcast to float");
+
+  if (isComplexType(x->getDataType().value())) {
+    // There are compilation errors:
+    // __tmp_kernel1.cu(6727): error: namespace "CudaCodeGen::std" has no member
+    // "imagf"
+    // __tmp_kernel1.cu(6753): error: namespace "CudaCodeGen::std" has no member
+    // "realf"
+    TORCH_CHECK(false, "var_mean is not supported for complex types.");
+    auto out_real = variance_mean(real(x), dims, correction, keepdim);
+    auto out_imag = variance_mean(imag(x), dims, correction, keepdim);
+    // variance of a complex tensor is the sum of real and imaginary variances
+    // and is real mean of a complex tensor is complex complex(out_real.mean,
+    // out_imag.mean) It seems construction of a complex tensor from two real
+    // tensors is not supported yet
+    return {add(out_real.var, out_imag.var), nullptr};
+  }
+
+  const int kNumberOfDims =
+      TensorDomain::noReductions(x->getMaybeRFactorDomain()).size();
+  auto num_features = numFeatures(x, dims, kNumberOfDims);
+  if (correction > 0) {
+    num_features =
+        sub(num_features, IrBuilder::create<Int>(x->container(), correction));
+  }
+
+  auto welford_out = Welford(x, dims);
+  auto mean = welford_out.avg;
+  auto var = mul(welford_out.var_sum, reciprocal(num_features));
+
+  if (keepdim) {
+    std::vector<bool> is_broadcast(kNumberOfDims, false);
+    for (auto dim : dims) {
+      is_broadcast[dim] = true;
+    }
+    var = broadcast(var, is_broadcast);
+    mean = broadcast(mean, is_broadcast);
+  }
+
+  return {var, mean};
+}
+
 TensorView* standard_deviation(
     TensorView* x,
     const std::vector<int>& dims,
diff --git a/torch/csrc/jit/codegen/cuda/ops/normalization.h b/torch/csrc/jit/codegen/cuda/ops/normalization.h
index f953af5bd96db..d0283525d19a7 100644
--- a/torch/csrc/jit/codegen/cuda/ops/normalization.h
+++ b/torch/csrc/jit/codegen/cuda/ops/normalization.h
@@ -38,6 +38,11 @@ struct BackwardRMSNormResult {
   TensorView* grad_weight = nullptr;
 };
 
+struct VarMeanResult {
+  TensorView* var = nullptr;
+  TensorView* mean = nullptr;
+};
+
 TORCH_CUDA_CU_API TensorView* mean(
     TensorView* x,
     const std::vector<int>& dims,
@@ -55,6 +60,12 @@ TORCH_CUDA_CU_API TensorView* variance(
     int64_t correction,
     bool keepdim);
 
+TORCH_CUDA_CU_API VarMeanResult variance_mean(
+    TensorView* x,
+    const std::vector<int>& dims,
+    int64_t correction,
+    bool keepdim);
+
 TORCH_CUDA_CU_API TensorView* standard_deviation(
     TensorView* x,
     const std::vector<int>& dims,
diff --git a/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp b/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
index fd468a8b792e3..c562b206652d5 100644
--- a/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
@@ -56,7 +56,7 @@ void ParallelDimensionMap::registerConstantExtent(IterDomain* id) {
       id->toString(),
       " should have been constant, but could not be evaluated at compile time.");
 
-  auto const_extent = extent_int.value();
+  auto const_extent = extent_int->as<int64_t>();
 
   // Uses index map
   auto concrete_id = getCAMappedConcreteDomain(id);
diff --git a/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp b/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp
index 43961dbda4754..9e3ff2046c0f6 100644
--- a/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp
+++ b/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp
@@ -12,6 +12,7 @@ constexpr std::bitset<ParallelTypeBitmap::kNumParallelTypes>
 
 std::string ParallelTypeBitmap::toString() const {
   std::stringstream ss;
+  ss << "(";
   bool is_first = true;
   for (ParallelType pt : *this) {
     if (!is_first) {
@@ -20,6 +21,7 @@ std::string ParallelTypeBitmap::toString() const {
     ss << pt;
     is_first = false;
   }
+  ss << ")";
   return ss.str();
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/parser.cpp b/torch/csrc/jit/codegen/cuda/parser.cpp
index f802d02aceaa4..b7592788039f8 100644
--- a/torch/csrc/jit/codegen/cuda/parser.cpp
+++ b/torch/csrc/jit/codegen/cuda/parser.cpp
@@ -1321,7 +1321,7 @@ class IrParser {
               }
             }
 
-            auto out = randlike(operand);
+            auto out = rand_like(operand);
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
@@ -2312,7 +2312,7 @@ class IrParser {
 
               auto data_type = DataType::Null;
               if (const auto opt_ivalue = toIValue(node->input(2))) {
-                if (!opt_ivalue.value().isNone()) {
+                if (!opt_ivalue->isNone()) {
                   data_type = aten_to_data_type(opt_ivalue->toScalarType());
                 }
               }
@@ -2457,8 +2457,8 @@ class IrParser {
 
     {
       std::array<const char*, kNumVarOps> Variance = {
-          "aten::var.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor",
-          "aten::std.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> Tensor"};
+          "aten::var.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor",
+          "aten::std.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor"};
       for (auto signature : Variance) {
         auto ptr_op = getOperatorForLiteral(signature);
         REGISTER_PARSE_RULE(
@@ -2480,8 +2480,13 @@ class IrParser {
               TORCH_INTERNAL_ASSERT(
                   dims_list.has_value(), "Cannot fuse with dynamic axes");
               std::vector<int> dims;
-              for (const auto dim : dims_list->vec()) {
-                dims.emplace_back(static_cast<int>(dim));
+              if (!dims_list->empty()) {
+                for (const auto dim : dims_list->vec()) {
+                  dims.emplace_back(static_cast<int>(dim));
+                }
+              } else {
+                dims.resize(input->as<TensorView>()->nDims());
+                std::iota(dims.begin(), dims.end(), 0);
               }
 
               auto unbiased = constant_as<bool>(node->input(2));
@@ -3373,6 +3378,115 @@ class IrParser {
           },
           nullptr);
     }
+
+    {
+      auto ptr_op = getOperatorForLiteral(
+          "prim::permute_copy.int(Tensor(a) self, int[] dims) -> Tensor");
+      REGISTER_PARSE_RULE(
+          ptr_op,
+          {
+            MemoryFormat format;
+            std::list<Val*> list_val;
+            std::tie(format, list_val) = getConsistentValues(
+                c10::nullopt, value_map[node->inputs()[0]->unique()]);
+            auto self_t = list_val.front();
+            list_val.pop_front();
+            auto self = self_t->as<TensorView>();
+
+            auto dims = constant_as<c10::List<int64_t>>(node->input(1));
+            TORCH_INTERNAL_ASSERT(
+                dims.has_value(), "The dims parameter is required.");
+            TORCH_INTERNAL_ASSERT(
+                dims.value().size() == self->getMaybeRFactorDomain().size());
+
+            auto output = permute(self, dims->vec());
+            value_map.emplace(
+                node->output()->unique(), ValueHolder(output, format));
+          },
+          [](const Node* node) -> bool {
+            if (!isInputNonSizeZeroTensor(node)) {
+              return false;
+            }
+            auto dims = constant_as<c10::List<int64_t>>(node->input(1));
+            if (!dims.has_value()) {
+              return false;
+            }
+
+            return true;
+          },
+          nullptr);
+    }
+
+    {
+      auto ptr_op = getOperatorForLiteral(
+          "prim::transpose_copy.int(Tensor(a) self, int dim0, int dim1) -> Tensor");
+      REGISTER_PARSE_RULE(
+          ptr_op,
+          {
+            MemoryFormat format;
+            std::list<Val*> list_val;
+            std::tie(format, list_val) = getConsistentValues(
+                c10::nullopt, value_map[node->inputs()[0]->unique()]);
+            auto self_t = list_val.front();
+            list_val.pop_front();
+            auto self = self_t->as<TensorView>();
+
+            auto dim0 = constant_as<int>(node->input(1));
+            TORCH_INTERNAL_ASSERT(
+                dim0.has_value(), "dim0 in transpose is not valid.");
+
+            auto dim1 = constant_as<int>(node->input(2));
+            TORCH_INTERNAL_ASSERT(
+                dim1.has_value(), "dim1 in transpose is not valid.");
+
+            auto output = transpose(self, dim0.value(), dim1.value());
+            value_map.emplace(
+                node->output()->unique(), ValueHolder(output, format));
+          },
+          [](const Node* node) -> bool {
+            if (!isInputNonSizeZeroTensor(node)) {
+              return false;
+            }
+            if (node->input(1)->node()->kind() != prim::Constant) {
+              return false;
+            }
+            if (node->input(2)->node()->kind() != prim::Constant) {
+              return false;
+            }
+            return true;
+          },
+          nullptr);
+    }
+
+    {
+      auto ptr_op =
+          getOperatorForLiteral("prim::t_copy(Tensor(a) self) -> Tensor");
+      REGISTER_PARSE_RULE(
+          ptr_op,
+          {
+            MemoryFormat format;
+            std::list<Val*> list_val;
+            std::tie(format, list_val) = getConsistentValues(
+                c10::nullopt, value_map[node->inputs()[0]->unique()]);
+            auto self_t = list_val.front();
+            list_val.pop_front();
+            auto self = self_t->as<TensorView>();
+
+            TORCH_INTERNAL_ASSERT(self->getMaybeRFactorDomain().size() <= 2);
+
+            auto output = transpose(self);
+            value_map.emplace(
+                node->output()->unique(), ValueHolder(output, format));
+          },
+          [](const Node* node) -> bool {
+            if (!isInputNonSizeZeroTensor(node)) {
+              return false;
+            }
+
+            return true;
+          },
+          nullptr);
+    }
   }
 
   void processJitNode(const JitOp* node) {
@@ -4136,6 +4250,49 @@ bool insertProfileIValue(ProfilingRecord* pr, Node* node, size_t offset) {
     return true;
   }
 
+  static auto permute_schema =
+      getOperatorForLiteral(
+          "aten::permute(Tensor(a) self, int[] dims) -> Tensor(a)")
+          ->schema();
+  static auto permute_copy_schema =
+      getOperatorForLiteral(
+          "prim::permute_copy(Tensor(a) self, int[] dims) -> Tensor")
+          ->schema();
+  if (node->matches(permute_schema) || node->matches(permute_copy_schema)) {
+    switch (offset) {
+      // argument 1: dims;
+      case 1:
+        profileIntList(pr, node, offset);
+        break;
+      default:
+        return false;
+    }
+    return true;
+  }
+
+  static auto transpose_int_copy_schema =
+      getOperatorForLiteral(
+          "aten::transpose.int(Tensor(a) self, int dim0, int dim1) -> Tensor(a)")
+          ->schema();
+  static auto transpose_int_schema =
+      getOperatorForLiteral(
+          "prim::transpose_copy.int(Tensor(a) self, int dim0, int dim1) -> Tensor")
+          ->schema();
+  if (node->matches(transpose_int_copy_schema) ||
+      node->matches(transpose_int_schema)) {
+    switch (offset) {
+      // argument 1: dim0;
+      // argument 2: dim1;
+      case 1:
+      case 2:
+        profileInt(pr, node, offset);
+        break;
+      default:
+        return false;
+    }
+    return true;
+  }
+
   static auto batch_norm_impl_index_schema =
       getOperatorForLiteral(
           "aten::_batch_norm_impl_index(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> (Tensor, Tensor, Tensor, Tensor, int)")
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp
index fe24497276490..760326b9079f8 100644
--- a/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp
@@ -6,6 +6,7 @@
 #include <torch/csrc/jit/codegen/cuda/arith.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ops/composite.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.h>
@@ -179,11 +180,12 @@ void initNvFuserPythonBindings(PyObject* module) {
       .def(
           "define_constant",
           [](nvfuser::FusionDefinition& self,
-             c10::complex<double> val) -> nvfuser::Scalar* {
+             std::complex<double> val) -> nvfuser::Scalar* {
             nvfuser::Scalar* out = self.defineScalar();
             self.defineRecord(new nvfuser::ConstantRecord<
                               torch::jit::fuser::cuda::ComplexDouble,
-                              c10::complex<double>>({out->index}, val));
+                              c10::complex<double>>(
+                {out->index}, static_cast<c10::complex<double>>(val)));
             return out;
           },
           py::return_value_policy::reference)
@@ -287,11 +289,12 @@ void initNvFuserPythonBindings(PyObject* module) {
   NVFUSER_PYTHON_BINDING_UNARY_OP("neg", neg)
   NVFUSER_PYTHON_BINDING_UNARY_OP("bitwise_not", bitwise_not)
   NVFUSER_PYTHON_BINDING_UNARY_OP("relu", relu)
-  NVFUSER_PYTHON_BINDING_UNARY_OP("rand_like", randlike)
+  NVFUSER_PYTHON_BINDING_UNARY_OP("rand_like", rand_like)
   NVFUSER_PYTHON_BINDING_UNARY_OP("reciprocal", reciprocal)
   NVFUSER_PYTHON_BINDING_UNARY_OP("round", round)
   NVFUSER_PYTHON_BINDING_UNARY_OP("rsqrt", rsqrt)
   NVFUSER_PYTHON_BINDING_UNARY_OP("set", set)
+  NVFUSER_PYTHON_BINDING_UNARY_OP("sign", sign)
   NVFUSER_PYTHON_BINDING_UNARY_OP("sigmoid", sigmoid)
   NVFUSER_PYTHON_BINDING_UNARY_OP("silu", silu)
   NVFUSER_PYTHON_BINDING_UNARY_OP("sin", sin)
diff --git a/torch/csrc/jit/codegen/cuda/reference_tensor.h b/torch/csrc/jit/codegen/cuda/reference_tensor.h
deleted file mode 100644
index 07c83bb6ed74c..0000000000000
--- a/torch/csrc/jit/codegen/cuda/reference_tensor.h
+++ /dev/null
@@ -1,27 +0,0 @@
-#pragma once
-
-#include <c10/macros/Export.h>
-
-#include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
-
-#include <unordered_map>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-struct ReferenceTensor {
-  TensorDomain* domain = nullptr;
-
-  // Map from concrete iteration domains in ComputeAtMaps to iter domains
-  // including those used to construct domain.
-  std::unordered_map<IterDomain*, IterDomain*> concrete_to_id;
-  // Map from reference iteration domains to concrete iteration domains.
-  std::unordered_map<IterDomain*, IterDomain*> id_to_concrete;
-};
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/root_domain_map.cpp b/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
index 304614361a017..09a740d01097d 100644
--- a/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
@@ -186,27 +186,39 @@ auto ensureMapping(
   return it;
 }
 
+TensorView* lookUpTv(const TensorDomain* td) {
+  Fusion* fusion = FusionGuard::getCurFusion();
+  for (auto tv : ir_utils::filterByType<TensorView>(fusion->vals())) {
+    if (tv->domain() == td) {
+      return tv;
+    }
+  }
+  return nullptr;
+}
+
 } // namespace
 
 std::string DomainKey::toString() const {
   std::stringstream ss;
-  ss << "{";
-  if (td()) {
-    ss << td() << " (root: " << td()->getRootDomain()
-       << ", maybe rfactor: " << td()->getMaybeRFactorDomain() << ")";
-  } else {
-    ss << "null";
-  }
-  ss << ", ";
   if (id()) {
     ss << id();
   } else {
     ss << "null";
   }
   if (concreteId()) {
-    ss << " (" << concreteId() << ")";
+    ss << " (concrete: " << concreteId() << ")";
+  }
+  ss << " in ";
+  if (td()) {
+    auto tv = lookUpTv(td());
+    TORCH_INTERNAL_ASSERT(tv != nullptr, "No TV found for ", td()->toString());
+    ss << "T" << tv->name() << "[ " << td()->getRootDomain() << " ]";
+    if (td()->hasRFactor()) {
+      ss << " (Rfactor: [ " << td()->getMaybeRFactorDomain() << " ])";
+    }
+  } else {
+    ss << "null";
   }
-  ss << "}";
   return ss.str();
 }
 
@@ -222,12 +234,12 @@ class FindInputDomains : BackwardVisitor {
  private:
   FindInputDomains(TensorView* tv, const IterDomain* id)
       : BackwardVisitor(false), tv_(tv) {
-    input_keys.insert(DomainKey(tv_->domain(), id));
+    input_keys_.insert(DomainKey(tv_->domain(), id));
   }
 
   DomainKeySet find() {
-    traverseFrom(tv_->fusion(), {tv_});
-    return input_keys;
+    traverseTo(tv_->fusion(), {tv_});
+    return input_keys_;
   }
 
   void handle(Expr* expr) override {
@@ -249,7 +261,7 @@ class FindInputDomains : BackwardVisitor {
                    .mapConsumerToProducer(out_tv->domain(), in_tv->domain());
     for (auto root_dom : out_tv->getRootDomain()) {
       DomainKey out_key({out_tv->domain(), root_dom});
-      if (input_keys.find(out_key) == input_keys.end()) {
+      if (input_keys_.find(out_key) == input_keys_.end()) {
         continue;
       }
       auto input_id_it = c2p.find(root_dom);
@@ -257,13 +269,13 @@ class FindInputDomains : BackwardVisitor {
         continue;
       }
       DomainKey input_key(in_tv->domain(), input_id_it->second);
-      input_keys.insert(input_key);
+      input_keys_.insert(input_key);
     }
   }
 
  private:
   TensorView* tv_ = nullptr;
-  DomainKeySet input_keys;
+  DomainKeySet input_keys_;
 
  public:
   static DomainKeySet find(TensorView* tv, const IterDomain* id) {
@@ -285,6 +297,10 @@ void UnmappableReductionDomains::handleReductionOutput(TensorView* out_tv) {
   auto use_chains = DependencyCheck::getAllUseChains(out_tv);
   for (const auto& chain : use_chains) {
     for (const auto& tv : ir_utils::filterByType<TensorView>(chain)) {
+      // Do not include the tensor itself in its consumers
+      if (tv == out_tv) {
+        continue;
+      }
       const auto& root_domain = tv->getRootDomain();
       for (const auto& id : root_domain) {
         DomainKey consumer_key(tv->domain(), id);
@@ -327,30 +343,41 @@ void UnmappableReductionDomains::handle(WelfordOp* op) {
 }
 
 bool UnmappableReductionDomains::isReductionOutputMapped(
-    const std::vector<DomainKey>& consumer_domains,
+    const DomainKeySet& consumer_domains,
     const ComputeAtRootDomainMap& root_map) const {
+  // Check each reduction domain if any of the consumer domains
+  // conflicts with it
   for (const auto& kv : reduction_domains_) {
     const DomainKey& reduction_domain = kv.first;
+    // Domains that must not be mapped with the reduction domain
     const DomainKeySet& incompatible_domains = kv.second;
-    DomainKey consumer_domain_with_reduction;
-    bool reduction_found = false;
+    // Input domains to the reduction domain
     const auto& input_keys = reduction_domain_inputs_.at(reduction_domain);
-    for (const DomainKey& consumer_domain : consumer_domains) {
-      for (const auto& input_key : input_keys) {
-        if (input_key == consumer_domain) {
-          consumer_domain_with_reduction = consumer_domain;
-          reduction_found = true;
-          break;
-        }
-      }
-    }
-    if (!reduction_found) {
+    // Check if any of the consumer domains is an input to the
+    // reduction
+    auto it = std::find_if(
+        consumer_domains.begin(),
+        consumer_domains.end(),
+        [&](const auto& consumer_domain) {
+          return std::find(
+                     input_keys.begin(), input_keys.end(), consumer_domain) !=
+              input_keys.end();
+        });
+    // None of the consumer domains is used for the reduction
+    // domain. They should be safe with respect to this reduction
+    // domain
+    if (it == consumer_domains.end()) {
       continue;
     }
-    // Make sure no incompatible domains will be merged with the reduction
-    // domain.
+
+    // A consumer domain that is an input to the reduction domain
+    const DomainKey& input_to_reduction = *it;
+
+    // Check if mapping input_to_reduction with the other domains in
+    // consumer_domains. If there's a domain that is a consumer of the
+    // reduction, they must not be mapped together
     for (const auto& consumer_domain : consumer_domains) {
-      if (consumer_domain == consumer_domain_with_reduction) {
+      if (consumer_domain == input_to_reduction) {
         continue;
       }
       if (std::any_of(
@@ -370,6 +397,27 @@ bool UnmappableReductionDomains::isReductionOutputMapped(
   return false;
 }
 
+std::string UnmappableReductionDomains::toString() const {
+  std::stringstream ss;
+  ss << "Reduction-to-consumer map\n";
+  for (const auto& kv : reduction_domains_) {
+    ss << "\tReduction: " << kv.first.toString() << "\n";
+    for (const auto& mapped_val : kv.second) {
+      ss << "\t\tConsumer domain: " << mapped_val.toString() << "\n";
+    }
+  }
+
+  ss << "Reduction-to-producer map\n";
+  for (const auto& kv : reduction_domain_inputs_) {
+    ss << "\tReduction: " << kv.first.toString() << "\n";
+    for (const auto& mapped_val : kv.second) {
+      ss << "\t\tProducer domain: " << mapped_val.toString() << "\n";
+    }
+  }
+
+  return ss.str();
+}
+
 void ComputeAtRootDomainMap::build(bool map_through_reduction) {
   // Make sure we start from scratch. Throw away previous results.
   eq_set_.clear();
@@ -649,7 +697,7 @@ ComputeAtRootDomainMapBuilder::ComputeAtRootDomainMapBuilder(
       map_through_reduction_(map_through_reduction) {
   Fusion* fusion = FusionGuard::getCurFusion();
   TORCH_INTERNAL_ASSERT(fusion != nullptr);
-  traverseFrom(fusion, fusion->outputs(), false);
+  traverseTo(fusion, fusion->outputs(), false);
   if (!pending_map_.empty()) {
     std::stringstream ss;
     ss << "pending map:\n";
@@ -712,7 +760,7 @@ void ComputeAtRootDomainMapBuilder::setInvalid(
 }
 
 bool ComputeAtRootDomainMapBuilder::isInvalid(
-    const std::vector<DomainKey>& domains) const {
+    const DomainKeySet& domains) const {
   // First, collect all invalid mappings for each of the keys in domains
   DomainKeyMap<DomainKeySet> invalid_key_map;
   for (const auto& key : domains) {
@@ -729,8 +777,9 @@ bool ComputeAtRootDomainMapBuilder::isInvalid(
 
   // Next, check if any pair is invalid to map.
   const auto num_keys = domains.size();
+  const std::vector<DomainKey> domains_vec({domains.begin(), domains.end()});
   for (const auto i : c10::irange(num_keys)) {
-    const auto& key_i = domains[i];
+    const auto& key_i = domains_vec[i];
     // If no invalid keys found for key_i, it can be skipped.
     const auto invalid_key_map_it = invalid_key_map.find(key_i);
     if (invalid_key_map_it == invalid_key_map.end()) {
@@ -743,7 +792,7 @@ bool ComputeAtRootDomainMapBuilder::isInvalid(
     // If any other key in domains is identified mappable with any of
     // the keys in this set, the mapping with key_i is invalid.
     for (const auto j : c10::irange(i + 1, num_keys)) {
-      const auto& key_j = domains[j];
+      const auto& key_j = domains_vec[j];
       if (std::any_of(
               invalid_keys_for_i.begin(),
               invalid_keys_for_i.end(),
@@ -786,10 +835,6 @@ void ComputeAtRootDomainMapBuilder::setMaybeMapped(
       addToPendingList(producer_bcast_key, consumer_bcast_key);
     }
   } else {
-    TORCH_INTERNAL_ASSERT(
-        !consumer_id->isBroadcast(),
-        "No concrete domain found for a broadcast domain: ",
-        consumer_key.toString());
     auto producer_concrete_key = producer_key;
     if (producer_id->isBroadcast()) {
       const auto concrete_id = consumer_id;
@@ -825,7 +870,7 @@ void ComputeAtRootDomainMapBuilder::mapPointwiseOrReductionOp(Expr* e) {
   const auto& out_root = out_td->getRootDomain();
 
   // Record equalities from output to all the inputs
-  // ignores un-concretizable broadcasts
+  // ignores non-concretizable broadcasts
   for (auto* in_tv : ir_utils::filterByType<TensorView>(e->inputs())) {
     const TensorDomain* in_td = in_tv->domain();
     std::vector<IterDomain*> in_root =
@@ -841,15 +886,16 @@ void ComputeAtRootDomainMapBuilder::mapPointwiseOrReductionOp(Expr* e) {
     for (const auto it : c10::irange(in_root.size())) {
       if (e->outputs().size() > 1) {
         TORCH_INTERNAL_ASSERT(
-            e->isA<WelfordOp>() || e->isA<GroupedReductionOp>(),
-            "Multi-output mapping assumes WelforddOp or GroupedReductionOp but, ",
+            e->isA<WelfordOp>() || e->isA<GroupedReductionOp>() ||
+                e->isA<GroupedWelfordOp>(),
+            "Unknown multi-output Expr type ",
             e->getExprType().value(),
             " is found");
-        for (auto o : e->outputs()) {
-          auto o_tv = o->as<TensorView>();
-          auto o_td = o_tv->domain();
-          auto o_root = o_td->getRootDomain();
-          setMaybeMapped(in_td, in_root[it], o_td, o_root[it]);
+        for (auto out : e->outputs()) {
+          auto out_tv = out->as<TensorView>();
+          auto out_td = out_tv->domain();
+          auto out_root = out_td->getRootDomain();
+          setMaybeMapped(in_td, in_root[it], out_td, out_root[it]);
         }
       } else {
         setMaybeMapped(in_td, in_root[it], out_td, out_root[it]);
@@ -1005,6 +1051,10 @@ void ComputeAtRootDomainMapBuilder::mapAllPendingMappings(
   }
 }
 
+void ComputeAtRootDomainMapBuilder::handle(RNGOp* rop) {
+  handle(rop->output(0)->as<TensorView>());
+}
+
 void ComputeAtRootDomainMapBuilder::handle(TensorView* tv) {
   const TensorDomain* td = tv->domain();
   const auto rfactor = TensorDomain::noReductions(td->getMaybeRFactorDomain());
@@ -1015,7 +1065,7 @@ void ComputeAtRootDomainMapBuilder::handle(TensorView* tv) {
     mapAllPendingMappings(td, id);
   }
 
-  // When tv has a rfactor domain, propagate the domain mappings from
+  // When tv has an rfactor domain, propagate the domain mappings from
   // each of the rfactor axes to the dependent root axes.
   if (td->hasViewLikeRFactor()) {
     std::unordered_set<Val*> root_set(
@@ -1047,61 +1097,20 @@ void ComputeAtRootDomainMapBuilder::handle(TensorView* tv) {
   }
 }
 
-// Checks whether all consumers of a producer can be joined without
-// introducing unsupported mappings. Specifically, if a domain of a
-// consumer has a mapped iteration domain in another consumer that
-// does not correspond to the same producer iteration domain, mapping
-// the consumer domains would result in the producer iteration domain
-// mapped to two different consumer iteration domains, requiring
-// recomputations.
-bool ComputeAtRootDomainMapBuilder::hasMatchingDomains(
-    const std::vector<DomainKey>& unique_domains) {
-  for (const auto& key : unique_domains) {
-    for (const auto& other_key : unique_domains) {
-      if (key == other_key) {
-        continue;
-      }
-      const auto& other_root = other_key.td()->getRootDomain();
-      if (std::any_of(
-              other_root.begin(), other_root.end(), [&](const IterDomain* id) {
-                return root_map_.canMap(key, other_key.td(), id);
-              })) {
-        return true;
-      }
-    }
-  }
-  return false;
-}
-
 // Checks whether all consumers of a producer can be joined without
 // introducing unsupported mappings, i.e., requiring recomputations.
 bool ComputeAtRootDomainMapBuilder::safeToMap(const DomainKeySet& domains) {
   if (domains.size() <= 1) {
     return true;
   }
-  // Filter out equivalent domains
-  std::vector<DomainKey> unique_domains;
-  for (const auto& domain : domains) {
-    if (std::none_of(
-            unique_domains.begin(),
-            unique_domains.end(),
-            [&](const auto& unique_dom) {
-              return root_map_.canMap(domain, unique_dom);
-            })) {
-      unique_domains.push_back(domain);
-    }
-  }
-  if (hasMatchingDomains(unique_domains)) {
-    return false;
-  }
+
   // Can't map if reduction output domains would be mapped
-  if (incompatible_domains_.isReductionOutputMapped(
-          unique_domains, root_map_) &&
+  if (incompatible_domains_.isReductionOutputMapped(domains, root_map_) &&
       !map_through_reduction_) {
     return false;
   }
   // Make sure mapping these domains won't cause any invalid mapping
-  if (isInvalid(unique_domains)) {
+  if (isInvalid(domains)) {
     return false;
   }
   return true;
@@ -1114,7 +1123,7 @@ class ExactRootDomainMapBuilder : private IterVisitor {
       Fusion* fusion,
       DisjointSets<const IterDomain*>& eq_sets)
       : eq_sets_(eq_sets) {
-    traverseFrom(fusion, fusion->outputs());
+    traverseTo(fusion, fusion->outputs());
   }
 
  private:
diff --git a/torch/csrc/jit/codegen/cuda/root_domain_map.h b/torch/csrc/jit/codegen/cuda/root_domain_map.h
index a4a3b5a440e2c..fa3d323ba6d21 100644
--- a/torch/csrc/jit/codegen/cuda/root_domain_map.h
+++ b/torch/csrc/jit/codegen/cuda/root_domain_map.h
@@ -145,6 +145,9 @@ class DomainKey {
     return td() == other.td() && id() == other.id() &&
         concreteId() == other.concreteId();
   }
+  bool operator!=(const DomainKey& other) const {
+    return !(*this == other);
+  }
 
   std::string toString() const;
 
@@ -183,9 +186,11 @@ class TORCH_CUDA_CU_API UnmappableReductionDomains : private IterVisitor {
   //! reduction outputs within the corresponding reduction loop is not
   //! possible. This routine is used to build root domain mappings.
   bool isReductionOutputMapped(
-      const std::vector<DomainKey>& consumer_domains,
+      const DomainKeySet& consumer_domains,
       const ComputeAtRootDomainMap& root_map) const;
 
+  std::string toString() const;
+
  private:
   using IterVisitor::handle;
   void handle(ReductionOp* op) override;
@@ -284,6 +289,8 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMap : public RootDomainMap {
       const TensorDomain* producer,
       const TensorDomain* consumer) const;
 
+  std::string toString() const;
+
  private:
   //! Returns if key_a and key(td_b, id_b) are mapped to eachother (equivalent),
   //! or are the same key.
@@ -326,8 +333,6 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMap : public RootDomainMap {
       const std::unordered_set<IterDomain*>& root_dims_to_map,
       bool producer_to_consumer) const override;
 
-  std::string toString() const;
-
  private:
   //! Disjoint set of all mapped <TD, ID> keys to determine axes equivalency
   DisjointSets<DomainKey, DomainKeyHash> eq_set_;
@@ -365,7 +370,7 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMapBuilder
   void setInvalid(const DomainKey& key1, const DomainKey& key2);
 
   //! Check if no pair of domains is invalid to map
-  bool isInvalid(const std::vector<DomainKey>& domains) const;
+  bool isInvalid(const DomainKeySet& domains) const;
 
   //! Track a pair of producer-consumer domains as potentially mappable. Inserts
   //! entries into pending_map_, but does not add anything into the root_map_
@@ -399,6 +404,8 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMapBuilder
     mapPointwiseOrReductionOp(top);
   }
 
+  void handle(RNGOp* top) override;
+
   void handle(ReductionOp* op) override {
     mapPointwiseOrReductionOp(op);
   }
@@ -451,8 +458,6 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMapBuilder
   //! mapping is done separately for each concrete domain.
   void mapAllPendingMappings(const TensorDomain* td, IterDomain* id);
 
-  bool hasMatchingDomains(const std::vector<DomainKey>& unique_domains);
-
   bool safeToMap(const DomainKeySet& domains);
 
  private:
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu b/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
index a1cbcb1b398e8..74e364ae7b4a2 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
@@ -1,4 +1,5 @@
 namespace fused_reduction {
+
 namespace impl {
 
 //! Suppose f_i be the i-th function of the binary function
@@ -110,6 +111,85 @@ __inline__ __device__ static void reduceEach(
       val0, offset0, val1, offset1, reduction_ops...);
 }
 
+template <typename TupleType0, typename TupleType1, typename Func, int num_vals>
+struct TupleReduce {};
+
+template <typename TupleType0, typename TupleType1, typename Func>
+struct TupleReduce<TupleType0, TupleType1, Func, 1> {
+  __inline__ __device__ static void reduce(
+      TupleType0& val0,
+      nvfuser_index_t offset0,
+      const TupleType1& val1,
+      nvfuser_index_t offset1,
+      Func reduction_op) {
+    static_assert(
+        IsSameType<
+            typename TupleType0::ValTypes,
+            typename TupleType1::ValTypes>::value,
+        "Invalid value types");
+    reduction_op(val0.val<0>(offset0), val1.val<0>(offset1));
+  }
+};
+
+template <typename TupleType0, typename TupleType1, typename Func>
+struct TupleReduce<TupleType0, TupleType1, Func, 2> {
+  __inline__ __device__ static void reduce(
+      TupleType0& val0,
+      nvfuser_index_t offset0,
+      const TupleType1& val1,
+      nvfuser_index_t offset1,
+      Func reduction_op) {
+    static_assert(
+        IsSameType<
+            typename TupleType0::ValTypes,
+            typename TupleType1::ValTypes>::value,
+        "Invalid value types");
+    reduction_op(
+        val0.val<0>(offset0),
+        val0.val<1>(offset0),
+        val1.val<0>(offset1),
+        val1.val<1>(offset1));
+  }
+};
+
+template <typename TupleType0, typename TupleType1, typename Func>
+struct TupleReduce<TupleType0, TupleType1, Func, 3> {
+  __inline__ __device__ static void reduce(
+      TupleType0& val0,
+      nvfuser_index_t offset0,
+      const TupleType1& val1,
+      nvfuser_index_t offset1,
+      Func reduction_op) {
+    static_assert(
+        IsSameType<
+            typename TupleType0::ValTypes,
+            typename TupleType1::ValTypes>::value,
+        "Invalid value types");
+    reduction_op(
+        val0.val<0>(offset0),
+        val0.val<1>(offset0),
+        val0.val<2>(offset0),
+        val1.val<0>(offset1),
+        val1.val<1>(offset1),
+        val1.val<2>(offset1));
+  }
+};
+
+//! Reduce all values of a tuple together. The reduction function must
+//! have the same number of inputs as the number of values of each tuple.
+template <typename TupleType0, typename TupleType1, typename Func>
+__inline__ __device__ void reduceTuple(
+    TupleType0& val0,
+    nvfuser_index_t offset0,
+    const TupleType1& val1,
+    nvfuser_index_t offset1,
+    Func reduction_op) {
+  static_assert(
+      TupleType0::num_vals == TupleType1::num_vals, "Invalid number of values");
+  TupleReduce<TupleType0, TupleType1, Func, TupleType0::num_vals>::reduce(
+      val0, offset0, val1, offset1, reduction_op);
+}
+
 // Reduces all of the first (idx+1) values by a thread block
 template <
     int idx,
@@ -297,85 +377,6 @@ __inline__ __device__ void blockReduceEach(
           reduction_ops...);
 }
 
-template <typename TupleType0, typename TupleType1, typename Func, int num_vals>
-struct TupleReduce {};
-
-template <typename TupleType0, typename TupleType1, typename Func>
-struct TupleReduce<TupleType0, TupleType1, Func, 1> {
-  __inline__ __device__ static void reduce(
-      TupleType0& val0,
-      nvfuser_index_t offset0,
-      const TupleType1& val1,
-      nvfuser_index_t offset1,
-      Func reduction_op) {
-    static_assert(
-        IsSameType<
-            typename TupleType0::ValTypes,
-            typename TupleType1::ValTypes>::value,
-        "Invalid value types");
-    reduction_op(val0.val<0>(offset0), val1.val<0>(offset1));
-  }
-};
-
-template <typename TupleType0, typename TupleType1, typename Func>
-struct TupleReduce<TupleType0, TupleType1, Func, 2> {
-  __inline__ __device__ static void reduce(
-      TupleType0& val0,
-      nvfuser_index_t offset0,
-      const TupleType1& val1,
-      nvfuser_index_t offset1,
-      Func reduction_op) {
-    static_assert(
-        IsSameType<
-            typename TupleType0::ValTypes,
-            typename TupleType1::ValTypes>::value,
-        "Invalid value types");
-    reduction_op(
-        val0.val<0>(offset0),
-        val0.val<1>(offset0),
-        val1.val<0>(offset1),
-        val1.val<1>(offset1));
-  }
-};
-
-template <typename TupleType0, typename TupleType1, typename Func>
-struct TupleReduce<TupleType0, TupleType1, Func, 3> {
-  __inline__ __device__ static void reduce(
-      TupleType0& val0,
-      nvfuser_index_t offset0,
-      const TupleType1& val1,
-      nvfuser_index_t offset1,
-      Func reduction_op) {
-    static_assert(
-        IsSameType<
-            typename TupleType0::ValTypes,
-            typename TupleType1::ValTypes>::value,
-        "Invalid value types");
-    reduction_op(
-        val0.val<0>(offset0),
-        val0.val<1>(offset0),
-        val0.val<2>(offset0),
-        val1.val<0>(offset1),
-        val1.val<1>(offset1),
-        val1.val<2>(offset1));
-  }
-};
-
-//! Reduce all values of a tuple together. The reduction function must
-//! have the same number of inputs as the number of values of each tuple.
-template <typename TupleType0, typename TupleType1, typename Func>
-__inline__ __device__ void reduceTuple(
-    TupleType0& val0,
-    nvfuser_index_t offset0,
-    const TupleType1& val1,
-    nvfuser_index_t offset1,
-    Func reduction_op) {
-  static_assert(
-      TupleType0::num_vals == TupleType1::num_vals, "Invalid number of values");
-  TupleReduce<TupleType0, TupleType1, Func, TupleType0::num_vals>::reduce(
-      val0, offset0, val1, offset1, reduction_op);
-}
-
 } // namespace impl
 
 // We have 6 dimensions, 3 in the grid, 3 in the block
@@ -463,831 +464,1065 @@ class ParallelReduce {
       bool read_pred, // Prevent reading from out of bounds memory
       bool write_pred, // Prevent from writing out of bounds
       const LocalTuple<Types...>& init_val,
-      Func reduction_op) {
-    // If no reduction needed, just return input
-    if (!BLOCK_REDUCE && !GRID_REDUCE) {
-      if (read_pred && write_pred) {
-        out = inp;
-      }
-      return;
-    }
-
-    // Don't read/write in temporary buffers if in a predicated dimension
-    bool block_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
-            threadIdx);
+      Func reduction_op);
 
-    // Initialize block result
-    LocalTuple<Types...> block_result = init_val;
-
-    // Grab input data if participating in the reduction, set to block_result in
-    // the case there is no block reduction
-    if (block_reduce_participate && read_pred) {
-      block_result = inp;
-    }
-
-    // Only threads that with id == 0 in the dimensions being reduced will
-    // have a valid result
-    bool has_block_result = index_utils::maskedIsZero<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx);
-
-    if (BLOCK_REDUCE) {
-      // -- START BLOCK REDUCTION -- //
-
-      // Size of the block reduction segment, can be an int since it's limited
-      // to number of threads
-      int block_reduction_size = index_utils::maskedSize<
-          isReduce(X_THREAD),
-          isReduce(Y_THREAD),
-          isReduce(Z_THREAD)>(blockDim);
-
-      // Index in the reduction segment, can be an int since it's limited to
-      // number of threads
-      int tid_in_block_reduction = index_utils::maskedOffset<
-          isReduce(X_THREAD),
-          isReduce(Y_THREAD),
-          isReduce(Z_THREAD)>(threadIdx, blockDim);
-
-      // ID of the block reduction this thread is participating in
-      //
-      // If any of the parallel dimensions are predicated out, that means
-      // they've already been reduced, so we only care about the first thread in
-      // that dimension. Therefore don't expand the reduction_idx by that
-      // dimension
-      int block_reduction_idx = index_utils::
-          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-              threadIdx, blockDim);
-
-      // Shared memory buffer is 2D
-      // [iter dimension, reduction dimension]
-
-      // Offset into smem for the current thread
-      int block_reduce_smem_offset =
-          block_reduction_idx * block_reduction_size + tid_in_block_reduction;
-
-      // Initialize shared memory
-      if (block_reduce_participate) {
-        copyTuple(shared_buf, block_reduce_smem_offset, block_result);
-      }
+  //! Profiled version
+  template <typename Func, typename... Types>
+  __device__ __inline__ void reduce(
+      RefTuple<Types...> out,
+      const ConstRefTuple<Types...>& inp,
+      VolatilePtrTuple<Types...> global_work_buffer,
+      int64_t* global_sync_buffer, // Allocated as product of all
+                                   // non-participating Grid dimension
+      PtrTuple<Types...> shared_buf,
+      bool read_pred, // Prevent reading from out of bounds memory
+      bool write_pred, // Prevent from writing out of bounds
+      const LocalTuple<Types...>& init_val,
+      Func reduction_op,
+      int64_t& cycles,
+      int64_t& count);
 
-      // Sync to make sure smem is completely initialized
-      block_sync::sync();
+  //! Each value of a tuple is independently reduced by the
+  //! corresponding reduction op. Thus, Welford-like reductions are
+  //! not supported by this interface.
+  //!
+  //! Note that out is purely used as the output parameter, and its
+  //! initial value is not used but just overwritten. Since grid
+  //! reductions do not allow serial reduction IterDomains, there is
+  //! no need to accumulate into the out parameter.
+  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+  __device__ __inline__ void reduceGroup(
+      RefTuple<DataTypes...> out,
+      const ConstRefTuple<DataTypes...>& inp,
+      VolatilePtrTuple<DataTypes...> global_work_buffer,
+      const LocalTuple<DataTypes...>& init_val,
+      int64_t* global_sync_buffer,
+      void* shared_mem,
+      const LocalTuple<BoolTypes...>& read_preds,
+      const LocalTuple<BoolTypes...>& write_preds,
+      Funcs... funcs);
 
-      // Round reduction size down to nearest power of 2
-      int np2 = 1 << (31 - __clz(block_reduction_size));
+  //! Profiled version
+  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+  __device__ __inline__ void reduceGroup(
+      RefTuple<DataTypes...> out,
+      const ConstRefTuple<DataTypes...>& inp,
+      VolatilePtrTuple<DataTypes...> global_work_buffer,
+      const LocalTuple<DataTypes...>& init_val,
+      int64_t* global_sync_buffer,
+      void* shared_mem,
+      const LocalTuple<BoolTypes...>& read_preds,
+      const LocalTuple<BoolTypes...>& write_preds,
+      int64_t& cycles,
+      int64_t& count,
+      Funcs... funcs);
+
+  template <int NumArgs, typename DataType, typename IndexType>
+  __device__ __inline__ void welfordGroup(
+      typename MakeRefTuple<NumArgs, DataType>::type out_avg,
+      typename MakeRefTuple<NumArgs, DataType>::type out_var,
+      typename MakeRefTuple<NumArgs, IndexType>::type out_N,
+      const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_avg,
+      const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_var,
+      const typename MakeConstRefTuple<NumArgs, IndexType>::type& inp_N,
+      const typename MakeLocalTuple<NumArgs, DataType>::type& init_avg,
+      const typename MakeLocalTuple<NumArgs, DataType>::type& init_var,
+      const typename MakeLocalTuple<NumArgs, IndexType>::type& init_N,
+      typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+          global_work_buffer_avg,
+      typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+          global_work_buffer_var,
+      typename MakeVolatilePtrTuple<NumArgs, IndexType>::type
+          global_work_buffer_N,
+      int64_t* global_sync_buffer,
+      PtrTuple<DataType, DataType, IndexType> shared_buf,
+      const typename MakeLocalTuple<NumArgs, bool>::type& read_preds,
+      const typename MakeLocalTuple<NumArgs, bool>::type& write_preds);
 
-      // Perform an initial reduction leaving np2 elements
-      if (block_reduce_participate && tid_in_block_reduction < np2 &&
-          tid_in_block_reduction + np2 < block_reduction_size) {
-        impl::reduceTuple(
-            shared_buf,
-            block_reduce_smem_offset,
-            shared_buf,
-            block_reduce_smem_offset + np2,
-            reduction_op);
-      }
+ private:
+  __device__ static bool isLastBlockInGrid() {
+    return index_utils::maskedIsLast<
+               isReduceOrIter(X_BLOCK),
+               isReduceOrIter(Y_BLOCK),
+               isReduceOrIter(Z_BLOCK)>(blockIdx, gridDim) &&
+        index_utils::maskedIsZero<
+               !isReduceOrIter(X_BLOCK),
+               !isReduceOrIter(Y_BLOCK),
+               !isReduceOrIter(Z_BLOCK)>(blockIdx);
+  }
 
-      // Always need to sync while operating on shared memory
-      block_sync::sync();
+  //! Initial per-CTA reduction of each value of a tuple. Each value
+  //! is reduced individually, so the shared memory buffer just needs
+  //! to be large enough for each value. NOTE that the smem buffer is
+  //! not forward protected.
+  template <
+      bool BLOCK_BROADCAST,
+      typename... DataTypes,
+      typename... Funcs,
+      typename... BoolTypes>
+  __device__ __inline__ static LocalTuple<DataTypes...> reduceGroupBlock(
+      const ConstRefTuple<DataTypes...>& inp,
+      const LocalTuple<DataTypes...>& init_val,
+      void* shared_mem,
+      const LocalTuple<BoolTypes...>& read_preds,
+      bool block_reduce_participate,
+      Funcs... funcs);
 
-      // Reduce down until 2 values, leaving 2 values allows us to manually
-      // perform the last reduction and avoid a syncthreads
-      for (int factor = np2 / 2; factor > 1; factor >>= 1) {
-        if (tid_in_block_reduction < factor && block_reduce_participate) {
-          impl::reduceTuple(
-              shared_buf,
-              block_reduce_smem_offset,
-              shared_buf,
-              block_reduce_smem_offset + factor,
-              reduction_op);
-        }
-        block_sync::sync();
-      }
+  //! Final reduction of partial results. Done by all blocks
+  //! redundantly when BROADCAST is true, or just one block otherwise.
+  //! The smem buffer is assumed synchronized when it is passed in,
+  //! but it isn't synchronized when returning from this function.
+  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+  __device__ __inline__ static void reduceGroupLastBlock(
+      RefTuple<DataTypes...>& out,
+      const VolatilePtrTuple<DataTypes...>& global_work_buffer,
+      const LocalTuple<DataTypes...>& init_val,
+      void* shared_mem,
+      nvfuser_index_t block_red_idx_offset,
+      nvfuser_index_t num_thread_iters,
+      nvfuser_index_t num_block_iters,
+      nvfuser_index_t thread_red_idx_offset,
+      nvfuser_index_t grid_red_size,
+      const LocalTuple<BoolTypes...>& write_preds,
+      bool block_reduce_participate,
+      bool grid_reduce_participate,
+      Funcs... reduction_ops);
 
-      // Accumulate that last valid result
-      if (has_block_result) {
-        copyTuple(block_result, shared_buf, block_reduce_smem_offset);
-        if (block_reduction_size > 1) {
-          impl::reduceTuple(
-              block_result,
-              0,
-              shared_buf,
-              block_reduce_smem_offset + 1,
-              reduction_op);
-        }
-      }
+  //! Welford version of reduceGroupBlock
+  template <
+      bool BLOCK_BROADCAST,
+      int NumVals,
+      typename DataType,
+      typename IndexType>
+  __device__ __inline__ static void welfordGroupBlock(
+      LocalWelfordTripletTuple<NumVals, DataType, IndexType>& block_result,
+      const ConstRefWelfordTripletTuple<NumVals, DataType, IndexType>& inp,
+      PtrTuple<DataType, DataType, IndexType> shared_buf,
+      const typename MakeLocalTuple<NumVals, bool>::type& read_preds,
+      bool block_reduce_participate);
+
+  //! Welford version of reduceGrouplLastBlock
+  template <int NumVals, typename DataType, typename IndexType>
+  __device__ __inline__ static void welfordGroupLastBlock(
+      RefWelfordTripletTuple<NumVals, DataType, IndexType>& out,
+      const VolatilePtrWelfordTripletTuple<NumVals, DataType, IndexType>&
+          global_work_buffer,
+      const LocalWelfordTripletTuple<NumVals, DataType, IndexType>& init_val,
+      PtrTuple<DataType, DataType, IndexType> shared_buf,
+      nvfuser_index_t block_red_idx_offset,
+      nvfuser_index_t num_thread_iters,
+      nvfuser_index_t num_block_iters,
+      nvfuser_index_t thread_red_idx_offset,
+      nvfuser_index_t grid_red_size,
+      const typename MakeLocalTuple<NumVals, bool>::type& write_preds,
+      bool block_reduce_participate,
+      bool grid_reduce_participate);
 
-      // ===== BLOCK REDUCTION CLEANUP =======
-      if (!GRID_REDUCE) {
-        // If no grid reduction, we don't have to continue. Either broadcast
-        // back across the block or return the correct reduction
-        if (has_block_result && write_pred) {
-          impl::reduceTuple(block_result, 0, out, 0, reduction_op);
-          out = block_result;
-        }
-        if (BROADCAST) {
-          // No grid reduce, but need to broadcast, perform block broadcast
-          if (has_block_result && write_pred) {
-            // Put result back in shared memory, put in the first entry of the
-            // reduction segment's buffer
-            copyTuple(
-                shared_buf,
-                block_reduction_idx * block_reduction_size,
-                block_result);
-          }
-
-          // Sync threads to make sure result is in smem
-          block_sync::sync();
-          // If the thread is participating, and is not attempting to write out
-          // of bounds, return the broadcasted value.
-          if (block_reduce_participate && write_pred) {
-            copyTuple(
-                out, shared_buf, block_reduction_idx * block_reduction_size);
-          }
-        }
+  // End Parallel reduce class
+};
 
-        // Forward protect shared memory, don't want threads to continue to
-        // another reduction/broadcast and pollute shared memory before the
-        // reduction is completely finished.
-        //
-        // This could be avoided in some cases if we added thread syncs from
-        // block reductions in the syncthread insertion pass.
-        block_sync::sync();
-        return;
-      }
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename Func, typename... Types>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduce(
+        RefTuple<Types...> out,
+        const ConstRefTuple<Types...>& inp,
+        VolatilePtrTuple<Types...> global_work_buffer,
+        int64_t* global_sync_buffer, // Allocated as product of all
+        // non-participating Grid dimension
+        PtrTuple<Types...> shared_buf,
+        bool read_pred, // Prevent reading from out of bounds memory
+        bool write_pred, // Prevent from writing out of bounds
+        const LocalTuple<Types...>& init_val,
+        Func reduction_op) {
+  // If no reduction needed, just return input
+  if (!BLOCK_REDUCE && !GRID_REDUCE) {
+    if (read_pred && write_pred) {
+      out = inp;
     }
+    return;
+  }
 
-    // -- START GRID REDUCTION -- //
-    // Grid reductions are more challenging for two reasons, (1) the reduction
-    // itself is 3D instead of 2D because we now have an iter domain space in
-    // the grid dimension. (2) a tree reduction isn't performed, instead all
-    // blocks will populate GMEM and one  block will finish the grid reduction.
-
-    // What is the grid reduction size, block reduction already performed so
-    // that doesn't have to be taken into consideration
-    const auto grid_red_size = index_utils::
-        maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            gridDim);
-
-    // Which ID in the reduction is this block. Threads can participate in
-    // multiple grid reductions, but the block will have the same relative index
-    // in those reductions
-    const auto idx_in_grid_red = index_utils::
-        maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if (PERSISTENT_REDUCTION && flip) {
-      auto global_buffer_size =
-          index_utils::
-              maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-                  gridDim) *
-          index_utils::
-              maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-                  blockDim) *
-          grid_red_size;
-      global_work_buffer += global_buffer_size;
-    }
-    flip = !flip;
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool block_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
+          threadIdx);
 
-    // How many grid reductions have to be performed, in the grid dimension
-    const auto num_block_iters = index_utils::
-        maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+  // Initialize block result
+  LocalTuple<Types...> block_result = init_val;
 
-    // Which grid reduction does this block participate in, in the grid
-    // dimension
-    const auto block_red_idx_offset = index_utils::
-        maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-            blockIdx, gridDim);
+  // Grab input data if participating in the reduction, set to block_result in
+  // the case there is no block reduction
+  if (block_reduce_participate && read_pred) {
+    block_result = inp;
+  }
+
+  // Only threads that with id == 0 in the dimensions being reduced will
+  // have a valid result
+  bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
 
-    // How many grid reductions have to be performed, in the block dimension
-    const auto num_thread_iters = index_utils::
-        maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+  if (BLOCK_REDUCE) {
+    // -- START BLOCK REDUCTION -- //
+
+    // Size of the block reduction segment, can be an int since it's limited
+    // to number of threads
+    int block_reduction_size = index_utils::
+        maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
             blockDim);
 
-    // Which grid reduction does this thread participate in, in the block
+    // Index in the reduction segment, can be an int since it's limited to
+    // number of threads
+    int tid_in_block_reduction = index_utils::maskedOffset<
+        isReduce(X_THREAD),
+        isReduce(Y_THREAD),
+        isReduce(Z_THREAD)>(threadIdx, blockDim);
+
+    // ID of the block reduction this thread is participating in
+    //
+    // If any of the parallel dimensions are predicated out, that means
+    // they've already been reduced, so we only care about the first thread in
+    // that dimension. Therefore don't expand the reduction_idx by that
     // dimension
-    const auto thread_red_idx_offset = index_utils::
+    int block_reduction_idx = index_utils::
         maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
             threadIdx, blockDim);
 
-    // 3D buffer of reductions:
-    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-    // Offset into the work buffer
-    const auto work_buf_offset =
-        (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
-            num_thread_iters +
-        thread_red_idx_offset;
+    // Shared memory buffer is 2D
+    // [iter dimension, reduction dimension]
 
-    // Don't read/write in temporary buffers if in a predicated dimension
-    bool grid_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(
-            blockIdx);
-
-    if (grid_reduce_participate && block_reduce_participate) {
-      if (has_block_result) {
-        copyTuple(global_work_buffer, work_buf_offset, block_result);
-      }
-    }
+    // Offset into smem for the current thread
+    int block_reduce_smem_offset =
+        block_reduction_idx * block_reduction_size + tid_in_block_reduction;
 
-    // -- GLOBAL BUFFER FILLED -- //
-
-    bool last_block = index_utils::
-        maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if (grid_reduce_participate) {
-      // Don't need to sync up blocks that are not participating in this
-      // reduction
-      grid_sync::sync<
-          isReduce(X_BLOCK),
-          isReduce(Y_BLOCK),
-          isReduce(Z_BLOCK),
-          PERSISTENT_REDUCTION>(
-          global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+    // Initialize shared memory
+    if (block_reduce_participate) {
+      copyTuple(shared_buf, block_reduce_smem_offset, block_result);
     }
 
-    // -- START BLOCK CLEANUP -- //
-    // All blocks perform the last cleanup, so every block, and every thread
-    // will have the final result
-
-    // Initialize block result
-    LocalTuple<Types...> last_block_result(init_val);
-
-    if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
-      // Can use the last block to reduce all the values the blocks filled in.
-      // Can use any thread that has been predicated, or has been reduced to do
-      // this reduction, cannot use any block that's associated with an
-      // iteration domain
-
-      // Start with non-block reduction
-
-      // Index in the reduction segment
-      int tid_in_block_reduction_2 = index_utils::maskedOffset<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx, blockDim);
-
-      int block_reduction_size_2 = index_utils::maskedSize<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(blockDim);
-
-      // 3D buffer of reductions:
-      //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-      // Change the offset, we want to keep the last two dimensions, but the
-      // first dimension is what we will reduce over
-      const auto work_buf_offset_2 =
-          block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
-      for (auto reduction_i = tid_in_block_reduction_2;
-           reduction_i < grid_red_size;
-           reduction_i += block_reduction_size_2) {
-        impl::reduceTuple(
-            last_block_result,
-            0,
-            global_work_buffer,
-            work_buf_offset_2 +
-                reduction_i * num_block_iters *
-                    num_thread_iters, // Iterating over the outer most
-                                      // dimension, so need to stride by the
-                                      // total number of grid reductions. Could
-                                      // come back and change it so this is the
-                                      // contiguous dimension
-            reduction_op);
-      }
-
-      // -- START LAST BLOCK - BLOCK REDUCTION -- //
-
-      // Reduced so we have one value per thread, we need to further reduce any
-      // dimension that is not an iter dimension
+    // Sync to make sure smem is completely initialized
+    block_sync::sync();
 
-      // Which block reduction this thread is participating in
-      int block_reduction_idx = index_utils::
-          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-              threadIdx, blockDim);
+    // Round reduction size down to nearest power of 2
+    int np2 = 1 << (31 - __clz(block_reduction_size));
 
-      // Offset in smem for this thread's result
-      auto smem_offset = block_reduction_idx * block_reduction_size_2 +
-          tid_in_block_reduction_2;
+    // Perform an initial reduction leaving np2 elements
+    if (block_reduce_participate && tid_in_block_reduction < np2 &&
+        tid_in_block_reduction + np2 < block_reduction_size) {
+      impl::reduceTuple(
+          shared_buf,
+          block_reduce_smem_offset,
+          shared_buf,
+          block_reduce_smem_offset + np2,
+          reduction_op);
+    }
 
-      // Similar as before, reduce down to nearest power of 2 so we can do a
-      // tree reduction
-      int np2 = 1 << (31 - __clz(min(block_reduction_size_2, grid_red_size)));
+    // Always need to sync while operating on shared memory
+    block_sync::sync();
 
-      // Threads values are initialized, so all can participate here
-      if (tid_in_block_reduction_2 >= np2) {
-        copyTuple(shared_buf, smem_offset, last_block_result);
+    // Reduce down until 2 values, leaving 2 values allows us to manually
+    // perform the last reduction and avoid a syncthreads
+    for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+      if (tid_in_block_reduction < factor && block_reduce_participate) {
+        impl::reduceTuple(
+            shared_buf,
+            block_reduce_smem_offset,
+            shared_buf,
+            block_reduce_smem_offset + factor,
+            reduction_op);
       }
-
       block_sync::sync();
+    }
 
-      if (tid_in_block_reduction_2 < np2 &&
-          tid_in_block_reduction_2 + np2 <
-              min(block_reduction_size_2, grid_red_size)) {
+    // Accumulate that last valid result
+    if (has_block_result) {
+      copyTuple(block_result, shared_buf, block_reduce_smem_offset);
+      if (block_reduction_size > 1) {
         impl::reduceTuple(
-            last_block_result, 0, shared_buf, smem_offset + np2, reduction_op);
+            block_result,
+            0,
+            shared_buf,
+            block_reduce_smem_offset + 1,
+            reduction_op);
       }
+    }
 
-      if (tid_in_block_reduction_2 < np2) {
-        copyTuple(shared_buf, smem_offset, last_block_result);
+    // ===== BLOCK REDUCTION CLEANUP =======
+    if (!GRID_REDUCE) {
+      // If no grid reduction, we don't have to continue. Either broadcast
+      // back across the block or return the correct reduction
+      if (has_block_result && write_pred) {
+        impl::reduceTuple(block_result, 0, out, 0, reduction_op);
+        out = block_result;
       }
-
-      // Always sync when communicating across smem
-      block_sync::sync();
-
-      // Reduce down to 2 values, last thread will do the final reduction and
-      // can save a syncthreads this way
-      for (int factor = np2 / 2; factor > 1; factor >>= 1) {
-        if (tid_in_block_reduction_2 < factor) {
-          impl::reduceTuple(
-              shared_buf,
-              smem_offset,
+      if (BROADCAST) {
+        // No grid reduce, but need to broadcast, perform block broadcast
+        if (has_block_result && write_pred) {
+          // Put result back in shared memory, put in the first entry of the
+          // reduction segment's buffer
+          copyTuple(
               shared_buf,
-              smem_offset + factor,
-              reduction_op);
+              block_reduction_idx * block_reduction_size,
+              block_result);
         }
-        block_sync::sync();
-      }
 
-      // If this thread in each block has the final result before broadcasting
-      // to all other threads in block
-      bool has_block_result_2 = index_utils::maskedIsZero<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx);
-      // Do the last reduction, protected by the write predicate
-      copyTuple(last_block_result, shared_buf, smem_offset);
-      if (has_block_result && grid_reduce_participate) {
-        impl::reduceTuple(last_block_result, 0, out, 0, reduction_op);
-        if (min(block_reduction_size_2, grid_red_size) > 1) {
-          impl::reduceTuple(
-              last_block_result, 0, shared_buf, smem_offset + 1, reduction_op);
-        }
-      }
-      if (grid_reduce_participate && PERSISTENT_REDUCTION) {
-        // If persistent reduction, always broadcast reduced values
-        copyTuple(shared_buf, smem_offset, last_block_result);
+        // Sync threads to make sure result is in smem
         block_sync::sync();
-        if (write_pred && block_reduce_participate) {
+        // If the thread is participating, and is not attempting to write out
+        // of bounds, return the broadcasted value.
+        if (block_reduce_participate && write_pred) {
           copyTuple(
-              out, shared_buf, block_reduction_idx * block_reduction_size_2);
-        }
-        // For persistent kernels we double the global buffer allocation so we
-        // don't need to protect those buffers every iteration preventing the
-        // need of an additional grid_sync. Since we flip back and forth between
-        // sections of the buffer, the one grid sync protects the other part of
-        // the buffer.
-      } else {
-        if (grid_reduce_participate) {
-          if (last_block && has_block_result && block_reduce_participate &&
-              write_pred) {
-            copyTuple(
-                out, shared_buf, block_reduction_idx * block_reduction_size_2);
-          }
+              out, shared_buf, block_reduction_idx * block_reduction_size);
         }
       }
-      // Forward protect the smem used in this reduction
+
+      // Forward protect shared memory, don't want threads to continue to
+      // another reduction/broadcast and pollute shared memory before the
+      // reduction is completely finished.
+      //
+      // This could be avoided in some cases if we added thread syncs from
+      // block reductions in the syncthread insertion pass.
       block_sync::sync();
+      return;
     }
   }
 
-  //! Profiled version
-  template <typename Func, typename... Types>
-  __device__ __inline__ void reduce(
-      RefTuple<Types...> out,
-      const ConstRefTuple<Types...>& inp,
-      VolatilePtrTuple<Types...> global_work_buffer,
-      int64_t* global_sync_buffer, // Allocated as product of all
-                                   // non-participating Grid dimension
-      PtrTuple<Types...> shared_buf,
-      bool read_pred, // Prevent reading from out of bounds memory
-      bool write_pred, // Prevent from writing out of bounds
-      const LocalTuple<Types...>& init_val,
-      Func reduction_op,
-      int64_t& cycles,
-      int64_t& count) {
-    int64_t start_counter = 0;
-
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      start_counter = readCycleCounter();
+  // -- START GRID REDUCTION -- //
+  // Grid reductions are more challenging for two reasons, (1) the reduction
+  // itself is 3D instead of 2D because we now have an iter domain space in
+  // the grid dimension. (2) a tree reduction isn't performed, instead all
+  // blocks will populate GMEM and one  block will finish the grid reduction.
+
+  // What is the grid reduction size, block reduction already performed so
+  // that doesn't have to be taken into consideration
+  const auto grid_red_size = index_utils::
+      maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          gridDim);
+
+  // Which ID in the reduction is this block. Threads can participate in
+  // multiple grid reductions, but the block will have the same relative index
+  // in those reductions
+  const auto idx_in_grid_red = index_utils::
+      maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (PERSISTENT_REDUCTION && flip) {
+    auto global_buffer_size =
+        index_utils::
+            maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+                gridDim) *
+        index_utils::
+            maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+                blockDim) *
+        grid_red_size;
+    global_work_buffer += global_buffer_size;
+  }
+  flip = !flip;
+
+  // How many grid reductions have to be performed, in the grid dimension
+  const auto num_block_iters = index_utils::
+      maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+
+  // Which grid reduction does this block participate in, in the grid
+  // dimension
+  const auto block_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the block dimension
+  const auto num_thread_iters = index_utils::
+      maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          blockDim);
+
+  // Which grid reduction does this thread participate in, in the block
+  // dimension
+  const auto thread_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // 3D buffer of reductions:
+  //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+  // Offset into the work buffer
+  const auto work_buf_offset =
+      (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
+          num_thread_iters +
+      thread_red_idx_offset;
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool grid_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(blockIdx);
+
+  if (grid_reduce_participate && block_reduce_participate) {
+    if (has_block_result) {
+      copyTuple(global_work_buffer, work_buf_offset, block_result);
     }
+  }
 
-    reduce(
-        out,
-        inp,
-        global_work_buffer,
-        global_sync_buffer,
-        shared_buf,
-        read_pred,
-        write_pred,
-        init_val,
-        reduction_op);
-
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      cycles += readCycleCounter() - start_counter;
-      ++count;
-    }
+  // -- GLOBAL BUFFER FILLED -- //
+
+  bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (grid_reduce_participate) {
+    // Don't need to sync up blocks that are not participating in this
+    // reduction
+    grid_sync::sync<
+        isReduce(X_BLOCK),
+        isReduce(Y_BLOCK),
+        isReduce(Z_BLOCK),
+        PERSISTENT_REDUCTION>(
+        global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
   }
 
-  //! Each value of a tuple is independently reduced by the
-  //! corresponding reduction op. Thus, Welford-like reductions are
-  //! not supported by this interface.
-  //!
-  //! Note that out is purely used as the output parameter, and its
-  //! initial value is not used but just overwritten. Since grid
-  //! reductions do not allow serial reduction IterDomains, there is
-  //! no need to accumulate into the out parameter.
-  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
-  __device__ __inline__ void reduceGroup(
-      RefTuple<DataTypes...> out,
-      const ConstRefTuple<DataTypes...>& inp,
-      VolatilePtrTuple<DataTypes...> global_work_buffer,
-      const LocalTuple<DataTypes...>& init_val,
-      int64_t* global_sync_buffer,
-      void* shared_mem,
-      const LocalTuple<BoolTypes...>& read_preds,
-      const LocalTuple<BoolTypes...>& write_preds,
-      Funcs... funcs) {
-    static_assert(
-        sizeof...(DataTypes) == sizeof...(Funcs),
-        "Mismatched number of Tuple values and functions");
-    static_assert(
-        sizeof...(DataTypes) == sizeof...(BoolTypes),
-        "Mismatched number of Tuple values and predicate values");
+  // -- START BLOCK CLEANUP -- //
+  // All blocks perform the last cleanup, so every block, and every thread
+  // will have the final result
 
-    // If no reduction needed, just return input
-    if (!BLOCK_REDUCE && !GRID_REDUCE) {
-      copyTupleIf(out, inp, read_preds && write_preds);
-      return;
-    }
+  // Initialize block result
+  LocalTuple<Types...> last_block_result(init_val);
 
-    // Don't read/write in temporary buffers if in a predicated dimension
-    const bool block_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
-            threadIdx);
+  if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
+    // Can use the last block to reduce all the values the blocks filled in.
+    // Can use any thread that has been predicated, or has been reduced to do
+    // this reduction, cannot use any block that's associated with an
+    // iteration domain
 
-    // Only threads that with id == 0 in the dimensions being reduced will
-    // have a valid result
-    const bool has_block_result = index_utils::maskedIsZero<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx);
-
-    // Initial per-block reduction. Result is broadcast if specified
-    // and this call is block reduction only.
-    const auto block_result = reduceGroupBlock < !GRID_REDUCE &&
-        BROADCAST > (inp,
-                     init_val,
-                     shared_mem,
-                     read_preds,
-                     block_reduce_participate,
-                     funcs...);
-    // If block reduction only, save to out and exit
-    if (!GRID_REDUCE) {
-      copyTupleIf(
-          out,
-          block_result,
-          write_preds &&
-              (block_reduce_participate && (BROADCAST || has_block_result)));
+    // Start with non-block reduction
 
-      // Need a block sync here as reduceGroupBlock does not
-      // forward-protect the smem buffer. This block sync is not
-      // necessary when a grid reduction follows since a block sync is
-      // done just before the grid sync.
-      block_sync::sync();
-      return;
+    // Index in the reduction segment
+    int tid_in_block_reduction_2 = index_utils::maskedOffset<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx, blockDim);
+
+    int block_reduction_size_2 = index_utils::maskedSize<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(blockDim);
+
+    // 3D buffer of reductions:
+    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+    // Change the offset, we want to keep the last two dimensions, but the
+    // first dimension is what we will reduce over
+    const auto work_buf_offset_2 =
+        block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
+    for (auto reduction_i = tid_in_block_reduction_2;
+         reduction_i < grid_red_size;
+         reduction_i += block_reduction_size_2) {
+      impl::reduceTuple(
+          last_block_result,
+          0,
+          global_work_buffer,
+          work_buf_offset_2 +
+              reduction_i * num_block_iters *
+                  num_thread_iters, // Iterating over the outer most
+          // dimension, so need to stride by the
+          // total number of grid reductions. Could
+          // come back and change it so this is the
+          // contiguous dimension
+          reduction_op);
     }
 
-    // -- START GRID REDUCTION -- //
-    // Grid reductions are more challenging for two reasons, (1) the reduction
-    // itself is 3D instead of 2D because we now have an iter domain space in
-    // the grid dimension. (2) a tree reduction isn't performed, instead all
-    // blocks will populate GMEM and one  block will finish the grid reduction.
-
-    // What is the grid reduction size, block reduction already performed so
-    // that doesn't have to be taken into consideration
-    const auto grid_red_size = index_utils::
-        maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            gridDim);
-
-    // Which ID in the reduction is this block. Threads can participate in
-    // multiple grid reductions, but the block will have the same relative index
-    // in those reductions
-    const auto idx_in_grid_red = index_utils::
-        maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    // How many grid reductions have to be performed, in the grid dimension
-    const auto num_block_iters = index_utils::
-        maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
-
-    // Which grid reduction does this block participate in, in the grid
-    // dimension
-    const auto block_red_idx_offset = index_utils::
-        maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-            blockIdx, gridDim);
+    // -- START LAST BLOCK - BLOCK REDUCTION -- //
 
-    // How many grid reductions have to be performed, in the block dimension
-    const auto num_thread_iters = index_utils::
-        maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-            blockDim);
+    // Reduced so we have one value per thread, we need to further reduce any
+    // dimension that is not an iter dimension
 
-    // Which grid reduction does this thread participate in, in the block
-    // dimension
-    const auto thread_red_idx_offset = index_utils::
+    // Which block reduction this thread is participating in
+    int block_reduction_idx = index_utils::
         maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
             threadIdx, blockDim);
 
-    // 3D buffer of reductions:
-    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-    // Offset into the work buffer
-    const auto work_buf_offset =
-        (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
-            num_thread_iters +
-        thread_red_idx_offset;
-
-    // Don't read/write in temporary buffers if in a predicated dimension
-    bool grid_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(
-            blockIdx);
-
-    if (PERSISTENT_REDUCTION && flip) {
-      auto global_buffer_size =
-          index_utils::
-              maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-                  gridDim) *
-          index_utils::
-              maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-                  blockDim) *
-          grid_red_size;
-      global_work_buffer += global_buffer_size;
+    // Offset in smem for this thread's result
+    auto smem_offset =
+        block_reduction_idx * block_reduction_size_2 + tid_in_block_reduction_2;
+
+    // Similar as before, reduce down to nearest power of 2 so we can do a
+    // tree reduction
+    int np2 = 1 << (31 - __clz(min(block_reduction_size_2, grid_red_size)));
+
+    // Threads values are initialized, so all can participate here
+    if (tid_in_block_reduction_2 >= np2) {
+      copyTuple(shared_buf, smem_offset, last_block_result);
     }
-    flip = !flip;
 
-    // Per-block partial reduction to global work buffer
-    if (grid_reduce_participate && block_reduce_participate &&
-        has_block_result) {
-      copyTuple(global_work_buffer, work_buf_offset, block_result);
+    block_sync::sync();
+
+    if (tid_in_block_reduction_2 < np2 &&
+        tid_in_block_reduction_2 + np2 <
+            min(block_reduction_size_2, grid_red_size)) {
+      impl::reduceTuple(
+          last_block_result, 0, shared_buf, smem_offset + np2, reduction_op);
     }
 
-    // -- GLOBAL BUFFER FILLED -- //
-
-    bool last_block = index_utils::
-        maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if (grid_reduce_participate) {
-      // Don't need to sync up blocks that are not participating in this
-      // reduction
-      grid_sync::sync<
-          isReduce(X_BLOCK),
-          isReduce(Y_BLOCK),
-          isReduce(Z_BLOCK),
-          PERSISTENT_REDUCTION>(
-          global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+    if (tid_in_block_reduction_2 < np2) {
+      copyTuple(shared_buf, smem_offset, last_block_result);
     }
 
-    // -- START BLOCK CLEANUP -- //
-    reduceGroupLastBlock(
-        out,
-        global_work_buffer,
-        init_val,
-        shared_mem,
-        block_red_idx_offset,
-        num_thread_iters,
-        num_block_iters,
-        thread_red_idx_offset,
-        grid_red_size,
-        write_preds,
-        block_reduce_participate,
-        grid_reduce_participate,
-        funcs...);
+    // Always sync when communicating across smem
+    block_sync::sync();
+
+    // Reduce down to 2 values, last thread will do the final reduction and
+    // can save a syncthreads this way
+    for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+      if (tid_in_block_reduction_2 < factor) {
+        impl::reduceTuple(
+            shared_buf,
+            smem_offset,
+            shared_buf,
+            smem_offset + factor,
+            reduction_op);
+      }
+      block_sync::sync();
+    }
 
-    // Forward protect the smem buffer
+    // If this thread in each block has the final result before broadcasting
+    // to all other threads in block
+    bool has_block_result_2 = index_utils::maskedIsZero<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx);
+    // Do the last reduction, protected by the write predicate
+    copyTuple(last_block_result, shared_buf, smem_offset);
+    if (has_block_result && grid_reduce_participate) {
+      impl::reduceTuple(last_block_result, 0, out, 0, reduction_op);
+      if (min(block_reduction_size_2, grid_red_size) > 1) {
+        impl::reduceTuple(
+            last_block_result, 0, shared_buf, smem_offset + 1, reduction_op);
+      }
+    }
+    if (grid_reduce_participate && PERSISTENT_REDUCTION) {
+      // If persistent reduction, always broadcast reduced values
+      copyTuple(shared_buf, smem_offset, last_block_result);
+      block_sync::sync();
+      if (write_pred && block_reduce_participate) {
+        copyTuple(
+            out, shared_buf, block_reduction_idx * block_reduction_size_2);
+      }
+      // For persistent kernels we double the global buffer allocation so we
+      // don't need to protect those buffers every iteration preventing the
+      // need of an additional grid_sync. Since we flip back and forth between
+      // sections of the buffer, the one grid sync protects the other part of
+      // the buffer.
+    } else {
+      if (grid_reduce_participate) {
+        if (last_block && has_block_result && block_reduce_participate &&
+            write_pred) {
+          copyTuple(
+              out, shared_buf, block_reduction_idx * block_reduction_size_2);
+        }
+      }
+    }
+    // Forward protect the smem used in this reduction
     block_sync::sync();
   }
+}
 
-  //! Profiled version
-  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
-  __device__ __inline__ void reduceGroup(
-      RefTuple<DataTypes...> out,
-      const ConstRefTuple<DataTypes...>& inp,
-      VolatilePtrTuple<DataTypes...> global_work_buffer,
-      const LocalTuple<DataTypes...>& init_val,
-      int64_t* global_sync_buffer,
-      void* shared_mem,
-      const LocalTuple<BoolTypes...>& read_preds,
-      const LocalTuple<BoolTypes...>& write_preds,
-      int64_t& cycles,
-      int64_t& count,
-      Funcs... funcs) {
-    int64_t start_counter = 0;
+//! Profiled version
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename Func, typename... Types>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduce(
+        RefTuple<Types...> out,
+        const ConstRefTuple<Types...>& inp,
+        VolatilePtrTuple<Types...> global_work_buffer,
+        int64_t* global_sync_buffer, // Allocated as product of all
+        // non-participating Grid dimension
+        PtrTuple<Types...> shared_buf,
+        bool read_pred, // Prevent reading from out of bounds memory
+        bool write_pred, // Prevent from writing out of bounds
+        const LocalTuple<Types...>& init_val,
+        Func reduction_op,
+        int64_t& cycles,
+        int64_t& count) {
+  int64_t start_counter = 0;
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    start_counter = readCycleCounter();
+  }
 
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      start_counter = readCycleCounter();
-    }
+  reduce(
+      out,
+      inp,
+      global_work_buffer,
+      global_sync_buffer,
+      shared_buf,
+      read_pred,
+      write_pred,
+      init_val,
+      reduction_op);
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    cycles += readCycleCounter() - start_counter;
+    ++count;
+  }
+}
 
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
     reduceGroup(
+        RefTuple<DataTypes...> out,
+        const ConstRefTuple<DataTypes...>& inp,
+        VolatilePtrTuple<DataTypes...> global_work_buffer,
+        const LocalTuple<DataTypes...>& init_val,
+        int64_t* global_sync_buffer,
+        void* shared_mem,
+        const LocalTuple<BoolTypes...>& read_preds,
+        const LocalTuple<BoolTypes...>& write_preds,
+        Funcs... funcs) {
+  static_assert(
+      sizeof...(DataTypes) == sizeof...(Funcs),
+      "Mismatched number of Tuple values and functions");
+  static_assert(
+      sizeof...(DataTypes) == sizeof...(BoolTypes),
+      "Mismatched number of Tuple values and predicate values");
+
+  // If no reduction needed, just return input
+  if (!BLOCK_REDUCE && !GRID_REDUCE) {
+    copyTupleIf(out, inp, read_preds && write_preds);
+    return;
+  }
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  const bool block_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
+          threadIdx);
+
+  // Only threads that with id == 0 in the dimensions being reduced will
+  // have a valid result
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  // Initial per-block reduction. Result is broadcast if specified
+  // and this call is block reduction only.
+  const auto block_result = reduceGroupBlock < !GRID_REDUCE &&
+      BROADCAST > (inp,
+                   init_val,
+                   shared_mem,
+                   read_preds,
+                   block_reduce_participate,
+                   funcs...);
+  // If block reduction only, save to out and exit
+  if (!GRID_REDUCE) {
+    copyTupleIf(
         out,
-        inp,
-        global_work_buffer,
-        init_val,
-        global_sync_buffer,
-        shared_mem,
-        read_preds,
-        write_preds,
-        funcs...);
+        block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
 
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      cycles += readCycleCounter() - start_counter;
-      ++count;
-    }
+    // Need a block sync here as reduceGroupBlock does not
+    // forward-protect the smem buffer. This block sync is not
+    // necessary when a grid reduction follows since a block sync is
+    // done just before the grid sync.
+    block_sync::sync();
+    return;
   }
 
- private:
-  __device__ bool isLastBlockInGrid() {
-    return index_utils::maskedIsLast<
-               isReduceOrIter(X_BLOCK),
-               isReduceOrIter(Y_BLOCK),
-               isReduceOrIter(Z_BLOCK)>(blockIdx, gridDim) &&
-        index_utils::maskedIsZero<
-               !isReduceOrIter(X_BLOCK),
-               !isReduceOrIter(Y_BLOCK),
-               !isReduceOrIter(Z_BLOCK)>(blockIdx);
+  // -- START GRID REDUCTION -- //
+  // Grid reductions are more challenging for two reasons, (1) the reduction
+  // itself is 3D instead of 2D because we now have an iter domain space in
+  // the grid dimension. (2) a tree reduction isn't performed, instead all
+  // blocks will populate GMEM and one  block will finish the grid reduction.
+
+  // What is the grid reduction size, block reduction already performed so
+  // that doesn't have to be taken into consideration
+  const auto grid_red_size = index_utils::
+      maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          gridDim);
+
+  // Which ID in the reduction is this block. Threads can participate in
+  // multiple grid reductions, but the block will have the same relative index
+  // in those reductions
+  const auto idx_in_grid_red = index_utils::
+      maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the grid dimension
+  const auto num_block_iters = index_utils::
+      maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+
+  // Which grid reduction does this block participate in, in the grid
+  // dimension
+  const auto block_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the block dimension
+  const auto num_thread_iters = index_utils::
+      maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          blockDim);
+
+  // Which grid reduction does this thread participate in, in the block
+  // dimension
+  const auto thread_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // 3D buffer of reductions:
+  //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+  // Offset into the work buffer
+  const auto work_buf_offset =
+      (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
+          num_thread_iters +
+      thread_red_idx_offset;
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool grid_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(blockIdx);
+
+  if (PERSISTENT_REDUCTION && flip) {
+    auto global_buffer_size =
+        index_utils::
+            maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+                gridDim) *
+        index_utils::
+            maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+                blockDim) *
+        grid_red_size;
+    global_work_buffer += global_buffer_size;
   }
+  flip = !flip;
 
-  //! Initial per-CTA reduction of each value of a tuple. Each value
-  //! is reduced individually, so the shared memory buffer just needs
-  //! to be large enough for each value. NOTE that the smem buffer is
-  //! not forward protected.
-  template <
-      bool BLOCK_BROADCAST,
-      typename... DataTypes,
-      typename... Funcs,
-      typename... BoolTypes>
-  __device__ __inline__ static LocalTuple<DataTypes...> reduceGroupBlock(
-      const ConstRefTuple<DataTypes...>& inp,
-      const LocalTuple<DataTypes...>& init_val,
-      void* shared_mem,
-      const LocalTuple<BoolTypes...>& read_preds,
-      bool block_reduce_participate,
-      Funcs... funcs) {
-    const bool has_block_result = index_utils::maskedIsZero<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx);
+  // Per-block partial reduction to global work buffer
+  if (grid_reduce_participate && block_reduce_participate && has_block_result) {
+    copyTuple(global_work_buffer, work_buf_offset, block_result);
+  }
 
-    // Initialize block result
-    LocalTuple<DataTypes...> block_result = init_val;
+  // -- GLOBAL BUFFER FILLED -- //
+
+  bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (grid_reduce_participate) {
+    // Don't need to sync up blocks that are not participating in this
+    // reduction
+    grid_sync::sync<
+        isReduce(X_BLOCK),
+        isReduce(Y_BLOCK),
+        isReduce(Z_BLOCK),
+        PERSISTENT_REDUCTION>(
+        global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+  }
 
-    copyTupleIf(block_result, inp, block_reduce_participate && read_preds);
+  // -- START BLOCK CLEANUP -- //
+  reduceGroupLastBlock(
+      out,
+      global_work_buffer,
+      init_val,
+      shared_mem,
+      block_red_idx_offset,
+      num_thread_iters,
+      num_block_iters,
+      thread_red_idx_offset,
+      grid_red_size,
+      write_preds,
+      block_reduce_participate,
+      grid_reduce_participate,
+      funcs...);
+
+  // Forward protect the smem buffer
+  block_sync::sync();
+}
 
-    // Size of the block reduction segment, can be an int since it's limited
-    // to number of threads
-    const int block_reduction_size = index_utils::
-        maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
-            blockDim);
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduceGroup(
+        RefTuple<DataTypes...> out,
+        const ConstRefTuple<DataTypes...>& inp,
+        VolatilePtrTuple<DataTypes...> global_work_buffer,
+        const LocalTuple<DataTypes...>& init_val,
+        int64_t* global_sync_buffer,
+        void* shared_mem,
+        const LocalTuple<BoolTypes...>& read_preds,
+        const LocalTuple<BoolTypes...>& write_preds,
+        int64_t& cycles,
+        int64_t& count,
+        Funcs... funcs) {
+  int64_t start_counter = 0;
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    start_counter = readCycleCounter();
+  }
 
-    // Index in the reduction segment, can be an int since it's limited to
-    // number of threads
-    const int tid_in_block_reduction = index_utils::maskedOffset<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx, blockDim);
+  reduceGroup(
+      out,
+      inp,
+      global_work_buffer,
+      init_val,
+      global_sync_buffer,
+      shared_mem,
+      read_preds,
+      write_preds,
+      funcs...);
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    cycles += readCycleCounter() - start_counter;
+    ++count;
+  }
+}
 
-    // ID of the block reduction this thread is participating in
-    //
-    // If any of the parallel dimensions are predicated out, that means
-    // they've already been reduced, so we only care about the first thread in
-    // that dimension. Therefore don't expand the reduction_idx by that
-    // dimension
-    const int block_reduction_idx = index_utils::
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <
+    bool BLOCK_BROADCAST,
+    typename... DataTypes,
+    typename... Funcs,
+    typename... BoolTypes>
+__device__ __inline__ LocalTuple<DataTypes...> ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduceGroupBlock(
+        const ConstRefTuple<DataTypes...>& inp,
+        const LocalTuple<DataTypes...>& init_val,
+        void* shared_mem,
+        const LocalTuple<BoolTypes...>& read_preds,
+        bool block_reduce_participate,
+        Funcs... funcs) {
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  // Initialize block result
+  LocalTuple<DataTypes...> block_result = init_val;
+
+  copyTupleIf(block_result, inp, block_reduce_participate && read_preds);
+
+  // Size of the block reduction segment, can be an int since it's limited
+  // to number of threads
+  const int block_reduction_size = index_utils::
+      maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          blockDim);
+
+  // Index in the reduction segment, can be an int since it's limited to
+  // number of threads
+  const int tid_in_block_reduction = index_utils::
+      maskedOffset<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // ID of the block reduction this thread is participating in
+  //
+  // If any of the parallel dimensions are predicated out, that means
+  // they've already been reduced, so we only care about the first thread in
+  // that dimension. Therefore don't expand the reduction_idx by that
+  // dimension
+  const int block_reduction_idx = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // Do not protect the smem buffer as it's not always necessary.
+  impl::blockReduceEach<
+      BLOCK_BROADCAST,
+      false,
+      LocalTuple<DataTypes...>,
+      Funcs...>(
+      block_result,
+      block_result,
+      shared_mem,
+      has_block_result,
+      tid_in_block_reduction,
+      block_reduction_size,
+      block_reduction_size,
+      block_reduction_idx,
+      funcs...);
+
+  return block_result;
+}
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduceGroupLastBlock(
+        RefTuple<DataTypes...>& out,
+        const VolatilePtrTuple<DataTypes...>& global_work_buffer,
+        const LocalTuple<DataTypes...>& init_val,
+        void* shared_mem,
+        nvfuser_index_t block_red_idx_offset,
+        nvfuser_index_t num_thread_iters,
+        nvfuser_index_t num_block_iters,
+        nvfuser_index_t thread_red_idx_offset,
+        nvfuser_index_t grid_red_size,
+        const LocalTuple<BoolTypes...>& write_preds,
+        bool block_reduce_participate,
+        bool grid_reduce_participate,
+        Funcs... reduction_ops) {
+  // Initialize block result
+  LocalTuple<DataTypes...> last_block_result(init_val);
+
+  const bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
+    // Can use the last block to reduce all the values the blocks filled in.
+    // Can use any thread that has been predicated, or has been reduced to do
+    // this reduction, cannot use any block that's associated with an
+    // iteration domain
+
+    // Start with non-block reduction
+
+    // Index in the reduction segment
+    int tid_in_block_reduction = index_utils::maskedOffset<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx, blockDim);
+
+    int block_reduction_size = index_utils::maskedSize<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(blockDim);
+
+    bool has_block_result = index_utils::maskedIsZero<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx);
+
+    // 3D buffer of reductions:
+    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+    // Change the offset, we want to keep the last two dimensions, but the
+    // first dimension is what we will reduce over
+    const auto work_buf_offset =
+        block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
+    for (auto reduction_i = tid_in_block_reduction; reduction_i < grid_red_size;
+         reduction_i += block_reduction_size) {
+      impl::reduceEach(
+          last_block_result,
+          0,
+          global_work_buffer,
+          work_buf_offset +
+              reduction_i * num_block_iters *
+                  num_thread_iters, // Iterating over the outer most
+                                    // dimension, so need to stride by the
+                                    // total number of grid reductions. Could
+                                    // come back and change it so this is the
+                                    // contiguous dimension
+          reduction_ops...);
+    }
+
+    // Which block reduction this thread is participating in
+    int block_reduction_idx = index_utils::
         maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
             threadIdx, blockDim);
 
-    // Do not protect the smem buffer as it's not always necessary.
-    impl::blockReduceEach<
-        BLOCK_BROADCAST,
-        false,
-        LocalTuple<DataTypes...>,
-        Funcs...>(
-        block_result,
-        block_result,
+    impl::blockReduceEach<BROADCAST, false, LocalTuple<DataTypes...>, Funcs...>(
+        last_block_result,
+        last_block_result,
         shared_mem,
         has_block_result,
         tid_in_block_reduction,
         block_reduction_size,
-        block_reduction_size,
+        min(grid_red_size, block_reduction_size),
         block_reduction_idx,
-        funcs...);
+        reduction_ops...);
 
-    return block_result;
-  }
-
-  //! Final reduction of partial results. Done by all blocks
-  //! redundantly when BROADCAST is true, or just one block otherwise.
-  //! The smem buffer is assumed synchronized when it is passed in,
-  //! but it isn't synchronized when returning from this function.
-  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
-  __device__ __inline__ void reduceGroupLastBlock(
-      RefTuple<DataTypes...>& out,
-      const VolatilePtrTuple<DataTypes...>& global_work_buffer,
-      const LocalTuple<DataTypes...>& init_val,
-      void* shared_mem,
-      nvfuser_index_t block_red_idx_offset,
-      nvfuser_index_t num_thread_iters,
-      nvfuser_index_t num_block_iters,
-      nvfuser_index_t thread_red_idx_offset,
-      nvfuser_index_t grid_red_size,
-      const LocalTuple<BoolTypes...>& write_preds,
-      bool block_reduce_participate,
-      bool grid_reduce_participate,
-      Funcs... reduction_ops) {
-    // Initialize block result
-    LocalTuple<DataTypes...> last_block_result(init_val);
-
-    const bool last_block = index_utils::
-        maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
-      // Can use the last block to reduce all the values the blocks filled in.
-      // Can use any thread that has been predicated, or has been reduced to do
-      // this reduction, cannot use any block that's associated with an
-      // iteration domain
-
-      // Start with non-block reduction
-
-      // Index in the reduction segment
-      int tid_in_block_reduction = index_utils::maskedOffset<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx, blockDim);
-
-      int block_reduction_size = index_utils::maskedSize<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(blockDim);
-
-      bool has_block_result = index_utils::maskedIsZero<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx);
-
-      // 3D buffer of reductions:
-      //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-      // Change the offset, we want to keep the last two dimensions, but the
-      // first dimension is what we will reduce over
-      const auto work_buf_offset =
-          block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
-      for (auto reduction_i = tid_in_block_reduction;
-           reduction_i < grid_red_size;
-           reduction_i += block_reduction_size) {
-        impl::reduceEach(
-            last_block_result,
-            0,
-            global_work_buffer,
-            work_buf_offset +
-                reduction_i * num_block_iters *
-                    num_thread_iters, // Iterating over the outer most
-                                      // dimension, so need to stride by the
-                                      // total number of grid reductions. Could
-                                      // come back and change it so this is the
-                                      // contiguous dimension
-            reduction_ops...);
-      }
-
-      // Which block reduction this thread is participating in
-      int block_reduction_idx = index_utils::
-          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-              threadIdx, blockDim);
-
-      impl::
-          blockReduceEach<BROADCAST, false, LocalTuple<DataTypes...>, Funcs...>(
-              last_block_result,
-              last_block_result,
-              shared_mem,
-              has_block_result,
-              tid_in_block_reduction,
-              block_reduction_size,
-              min(grid_red_size, block_reduction_size),
-              block_reduction_idx,
-              reduction_ops...);
-
-      copyTupleIf(
-          out,
-          last_block_result,
-          write_preds &&
-              (block_reduce_participate && (BROADCAST || has_block_result)));
-    }
+    copyTupleIf(
+        out,
+        last_block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
   }
-
-  // End Parallel reduce class
-};
+}
 
 } // namespace fused_reduction
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
new file mode 100644
index 0000000000000..9f29071bab917
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
@@ -0,0 +1,93 @@
+namespace fused_reduction {
+
+// Tuple of Welford avg, var and N parameters.
+//
+// Template parameters:
+// - DataTypeT: Type of avg and var
+// - IndexTypeT: Type of N
+// - MakeTuple: Template template parameter to define Tuple types
+// (e.g., MakeLocalTuple>
+template <
+    int NumVals,
+    typename DataTypeT,
+    typename IndexTypeT,
+    template <int, typename>
+    typename MakeTuple>
+struct WelfordTripletTuple {
+  static constexpr int num_vals = NumVals;
+  using DataType = DataTypeT;
+  using IndexType = IndexTypeT;
+  using DataTuple = typename MakeTuple<NumVals, DataType>::type;
+  using IndexTuple = typename MakeTuple<NumVals, IndexType>::type;
+
+  DataTuple avg;
+  DataTuple var;
+  IndexTuple N;
+
+  WelfordTripletTuple(
+      const DataTuple& avg,
+      const DataTuple& var,
+      const IndexTuple& N)
+      : avg(avg), var(var), N(N) {}
+};
+
+template <int NumVals, typename DataType, typename IndexType>
+using LocalWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataType, IndexType, MakeLocalTuple>;
+
+template <int NumVals, typename DataType, typename IndexType>
+using RefWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataType, IndexType, MakeRefTuple>;
+
+template <int NumVals, typename DataType, typename IndexType>
+using ConstRefWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataType, IndexType, MakeConstRefTuple>;
+
+template <int NumVals, typename DataTypeT, typename IndexTypeT>
+using VolatilePtrWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataTypeT, IndexTypeT, MakeVolatilePtrTuple>;
+
+// Advance pointer offsets of WelfordTripleTuple. Only valid when the
+// values are pointer values.
+template <typename WelfordTripletTupleType>
+__inline__ __device__ static void operator+=(
+    WelfordTripletTupleType& triplet,
+    nvfuser_index_t offset) {
+  triplet.avg += offset;
+  triplet.var += offset;
+  triplet.N += offset;
+}
+
+// Copy each of the triplet tuples
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyWelfordTripletTuple(
+    DstType& dst,
+    nvfuser_index_t dst_offset,
+    const SrcType& src,
+    nvfuser_index_t src_offset = 0) {
+  copyTuple(dst.avg, dst_offset, src.avg, src_offset);
+  copyTuple(dst.var, dst_offset, src.var, src_offset);
+  copyTuple(dst.N, dst_offset, src.N, src_offset);
+}
+
+// Copy each of the triplet tuples
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyWelfordTripletTuple(
+    DstType& dst,
+    const SrcType& src,
+    nvfuser_index_t src_offset = 0) {
+  copyWelfordTripletTuple(dst, 0, src, src_offset);
+}
+
+// Copy each of the triplet tuples
+template <typename DstType, typename SrcType, typename PredType>
+__inline__ __device__ static void copyWelfordTripletTupleIf(
+    DstType& dst,
+    const SrcType& src,
+    const PredType& pred) {
+  copyTupleIf(dst.avg, src.avg, pred);
+  copyTupleIf(dst.var, src.var, pred);
+  copyTupleIf(dst.N, src.N, pred);
+}
+
+} // namespace fused_reduction
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
new file mode 100644
index 0000000000000..8603087e84535
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
@@ -0,0 +1,623 @@
+namespace fused_reduction {
+
+namespace impl {
+
+//! Implementation helper for welfordEach.
+template <int ValIdx, typename Triplet0, typename Triplet1>
+struct WelfordForEach {
+  static __inline__ __device__ void call(
+      Triplet0& triplet0,
+      nvfuser_index_t offset0,
+      const Triplet1& triplet1,
+      nvfuser_index_t offset1) {
+    static_assert(
+        Triplet0::num_vals == Triplet1::num_vals, "Invalid Triplet types");
+    static_assert(
+        IsSameType<typename Triplet0::DataType, typename Triplet1::DataType>::
+            value,
+        "Invalid Triplet types");
+    static_assert(
+        IsSameType<typename Triplet0::IndexType, typename Triplet1::IndexType>::
+            value,
+        "Invalid Triplet types");
+
+    using DataType = typename Triplet0::DataType;
+    using IndexType = typename Triplet0::IndexType;
+
+    WelfordForEach<ValIdx - 1, Triplet0, Triplet1>::call(
+        triplet0, offset0, triplet1, offset1);
+    welfordCombine<DataType, IndexType>(
+        triplet0.avg.val<ValIdx>(offset0),
+        triplet0.var.val<ValIdx>(offset0),
+        triplet0.N.val<ValIdx>(offset0),
+        triplet1.avg.val<ValIdx>(offset1),
+        triplet1.var.val<ValIdx>(offset1),
+        triplet1.N.val<ValIdx>(offset1));
+  }
+};
+
+template <typename Triplet0, typename Triplet1>
+struct WelfordForEach<-1, Triplet0, Triplet1> {
+  __inline__ __device__ static void call(
+      Triplet0& triplet0,
+      nvfuser_index_t offset0,
+      const Triplet1& triplet1,
+      nvfuser_index_t offset1) {}
+};
+
+//! Call welfordCombine with each of the triplet tuples. This is a
+//! welford version of reduceEach.
+template <typename Triplet0, typename Triplet1>
+__inline__ __device__ static void welfordEach(
+    Triplet0& triplet0,
+    nvfuser_index_t offset0,
+    const Triplet1& triplet1,
+    nvfuser_index_t offset1) {
+  WelfordForEach<Triplet0::num_vals - 1, Triplet0, Triplet1>::call(
+      triplet0, offset0, triplet1, offset1);
+}
+
+// Welford version of BlockReduceEach
+template <
+    int idx,
+    bool BROADCAST,
+    bool FORWARD_PROTECT_SMEM,
+    typename LocalWelfordTripletTupleT>
+struct BlockWelfordEach {
+  __inline__ __device__ static void reduce(
+      LocalWelfordTripletTupleT& block_result,
+      const LocalWelfordTripletTupleT& partial_result,
+      PtrTuple<
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::IndexType> shared_buf,
+      bool has_block_result,
+      int tid_in_reduction,
+      int num_threads_per_reduction,
+      int num_elements_per_reduction,
+      int reduction_idx) {
+    // Finish the reduction of each tuple value with a smaller offset
+    BlockWelfordEach<idx - 1, BROADCAST, true, LocalWelfordTripletTupleT>::
+        reduce(
+            block_result,
+            partial_result,
+            shared_buf,
+            has_block_result,
+            tid_in_reduction,
+            num_threads_per_reduction,
+            num_elements_per_reduction,
+            reduction_idx);
+
+    if (num_elements_per_reduction == 1) {
+      if (has_block_result) {
+        copyWelfordTripletTuple(block_result, partial_result);
+      }
+      return;
+    }
+
+    using DataType = typename LocalWelfordTripletTupleT::DataType;
+    using IndexType = typename LocalWelfordTripletTupleT::IndexType;
+
+    LocalTuple<DataType, DataType, IndexType> block_result_i(
+        partial_result.avg.val<idx>(0),
+        partial_result.var.val<idx>(0),
+        partial_result.N.val<idx>(0));
+
+    const auto smem_offset =
+        reduction_idx * num_threads_per_reduction + tid_in_reduction;
+
+    const int np2 = 1 << (31 - __clz(num_elements_per_reduction));
+
+    // Threads values are initialized, so all can participate here
+    if (tid_in_reduction >= np2) {
+      copyTuple(shared_buf, smem_offset, block_result_i);
+    }
+
+    block_sync::sync();
+    if (tid_in_reduction < np2 &&
+        tid_in_reduction + np2 < num_elements_per_reduction) {
+      impl::reduceTuple(
+          block_result_i,
+          0,
+          shared_buf,
+          smem_offset + np2,
+          welfordCombine<DataType, IndexType>);
+    }
+
+    if (tid_in_reduction < np2) {
+      copyTuple(shared_buf, smem_offset, block_result_i);
+    }
+
+    // Always sync when communicating across smem
+    block_sync::sync();
+
+    // Reduce down to 2 values, last thread will do the final reduction and
+    // can save a syncthreads this way
+    for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+      if (tid_in_reduction < factor) {
+        impl::reduceTuple(
+            shared_buf,
+            smem_offset,
+            shared_buf,
+            smem_offset + factor,
+            welfordCombine<DataType, IndexType>);
+      }
+      block_sync::sync();
+    }
+
+    copyTuple(block_result_i, shared_buf, smem_offset);
+
+    // Do the last reduction
+    if (has_block_result) {
+      impl::reduceTuple(
+          block_result_i,
+          0,
+          shared_buf,
+          smem_offset + 1,
+          welfordCombine<DataType, IndexType>);
+    }
+
+    if (BROADCAST) {
+      if (has_block_result) {
+        // Put result back in shared memory, put in the first entry of the
+        // reduction segment's buffer
+        copyTuple(
+            shared_buf,
+            reduction_idx * num_threads_per_reduction,
+            block_result_i);
+      }
+
+      // Sync threads to make sure result is in smem
+      block_sync::sync();
+
+      copyTuple(
+          block_result_i,
+          shared_buf,
+          reduction_idx * num_threads_per_reduction);
+    }
+
+    block_result.avg.val<idx>(0) = block_result_i.val<0>(0);
+    block_result.var.val<idx>(0) = block_result_i.val<1>(0);
+    block_result.N.val<idx>(0) = block_result_i.val<2>(0);
+
+    if (FORWARD_PROTECT_SMEM) {
+      block_sync::sync();
+    }
+  }
+};
+
+// Specialization for idx == -1, i.e., no value to reduce.
+template <
+    bool BROADCAST,
+    bool FORWARD_PROTECT_SMEM,
+    typename LocalWelfordTripletTupleT>
+struct BlockWelfordEach<
+    -1,
+    BROADCAST,
+    FORWARD_PROTECT_SMEM,
+    LocalWelfordTripletTupleT> {
+  __inline__ __device__ static void reduce(
+      LocalWelfordTripletTupleT& block_result,
+      const LocalWelfordTripletTupleT& partial_result,
+      PtrTuple<
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::IndexType> shared_buf,
+      bool has_block_result,
+      int tid_in_reduction,
+      int num_threads_per_reduction,
+      int num_elements_per_reduction,
+      int reduction_idx) {}
+};
+
+//! Welford version of blockReduceEach. Perform block-parallel Welford
+//! reduction of each Welford triplet.
+template <
+    bool BROADCAST,
+    bool FORWARD_PROTECT_SMEM,
+    typename LocalWelfordTripletTupleT>
+__inline__ __device__ void blockWelfordEach(
+    LocalWelfordTripletTupleT& block_result,
+    const LocalWelfordTripletTupleT& partial_result,
+    PtrTuple<
+        typename LocalWelfordTripletTupleT::DataType,
+        typename LocalWelfordTripletTupleT::DataType,
+        typename LocalWelfordTripletTupleT::IndexType> shared_buf,
+    bool has_block_result,
+    int tid_in_reduction,
+    int num_threads_per_reduction,
+    int num_elements_per_reduction,
+    int reduction_idx) {
+  BlockWelfordEach<
+      LocalWelfordTripletTupleT::num_vals - 1,
+      BROADCAST,
+      FORWARD_PROTECT_SMEM,
+      LocalWelfordTripletTupleT>::
+      reduce(
+          block_result,
+          partial_result,
+          shared_buf,
+          has_block_result,
+          tid_in_reduction,
+          num_threads_per_reduction,
+          num_elements_per_reduction,
+          reduction_idx);
+}
+
+} // namespace impl
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <int NumArgs, typename DataType, typename IndexType>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    welfordGroup(
+        typename MakeRefTuple<NumArgs, DataType>::type out_avg,
+        typename MakeRefTuple<NumArgs, DataType>::type out_var,
+        typename MakeRefTuple<NumArgs, IndexType>::type out_N,
+        const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_avg,
+        const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_var,
+        const typename MakeConstRefTuple<NumArgs, IndexType>::type& inp_N,
+        const typename MakeLocalTuple<NumArgs, DataType>::type& init_avg,
+        const typename MakeLocalTuple<NumArgs, DataType>::type& init_var,
+        const typename MakeLocalTuple<NumArgs, IndexType>::type& init_N,
+        typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+            global_work_buffer_avg,
+        typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+            global_work_buffer_var,
+        typename MakeVolatilePtrTuple<NumArgs, IndexType>::type
+            global_work_buffer_N,
+        int64_t* global_sync_buffer,
+        PtrTuple<DataType, DataType, IndexType> shared_buf,
+        const typename MakeLocalTuple<NumArgs, bool>::type& read_preds,
+        const typename MakeLocalTuple<NumArgs, bool>::type& write_preds) {
+  const ConstRefWelfordTripletTuple<NumArgs, DataType, IndexType> inp(
+      inp_avg, inp_var, inp_N);
+  RefWelfordTripletTuple<NumArgs, DataType, IndexType> out(
+      out_avg, out_var, out_N);
+
+  // If no reduction needed, just return input
+  if (!BLOCK_REDUCE && !GRID_REDUCE) {
+    copyWelfordTripletTupleIf(out, inp, read_preds && write_preds);
+    return;
+  }
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  const bool block_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
+          threadIdx);
+
+  // Only threads that with id == 0 in the dimensions being reduced will
+  // have a valid result
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  LocalWelfordTripletTuple<NumArgs, DataType, IndexType> block_result(
+      init_avg, init_var, init_N);
+
+  // Initial per-block reduction. Result is broadcast if specified
+  // and this call is block reduction only.
+  welfordGroupBlock<!GRID_REDUCE && BROADCAST>(
+      block_result, inp, shared_buf, read_preds, block_reduce_participate);
+
+  // If block reduction only, save to out and exit
+  if (!GRID_REDUCE) {
+    copyWelfordTripletTupleIf(
+        out,
+        block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
+
+    // Need a block sync here as reduceGroupBlock does not
+    // forward-protect the smem buffer. This block sync is not
+    // necessary when a grid reduction follows since a block sync is
+    // done just before the grid sync.
+    block_sync::sync();
+    return;
+  }
+
+  // -- START GRID REDUCTION -- //
+  // Grid reductions are more challenging for two reasons, (1) the reduction
+  // itself is 3D instead of 2D because we now have an iter domain space in
+  // the grid dimension. (2) a tree reduction isn't performed, instead all
+  // blocks will populate GMEM and one  block will finish the grid reduction.
+
+  // What is the grid reduction size, block reduction already performed so
+  // that doesn't have to be taken into consideration
+  const auto grid_red_size = index_utils::
+      maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          gridDim);
+
+  // Which ID in the reduction is this block. Threads can participate in
+  // multiple grid reductions, but the block will have the same relative index
+  // in those reductions
+  const auto idx_in_grid_red = index_utils::
+      maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the grid dimension
+  const auto num_block_iters = index_utils::
+      maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+
+  // Which grid reduction does this block participate in, in the grid
+  // dimension
+  const auto block_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the block dimension
+  const auto num_thread_iters = index_utils::
+      maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          blockDim);
+
+  // Which grid reduction does this thread participate in, in the block
+  // dimension
+  const auto thread_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // 3D buffer of reductions:
+  //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+  // Offset into the work buffer
+  auto work_buf_offset =
+      (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
+          num_thread_iters +
+      thread_red_idx_offset;
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool grid_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(blockIdx);
+
+  VolatilePtrWelfordTripletTuple<NumArgs, DataType, IndexType>
+      global_work_buffer(
+          global_work_buffer_avg, global_work_buffer_var, global_work_buffer_N);
+
+  if (PERSISTENT_REDUCTION && flip) {
+    auto global_buffer_size =
+        index_utils::
+            maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+                gridDim) *
+        index_utils::
+            maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+                blockDim) *
+        grid_red_size;
+    global_work_buffer += global_buffer_size;
+  }
+  flip = !flip;
+
+  // Per-block partial reduction to global work buffer
+  if (grid_reduce_participate && block_reduce_participate && has_block_result) {
+    copyWelfordTripletTuple(global_work_buffer, work_buf_offset, block_result);
+  }
+
+  // -- GLOBAL BUFFER FILLED -- //
+
+  bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (grid_reduce_participate) {
+    // Don't need to sync up blocks that are not participating in this
+    // reduction
+    grid_sync::sync<
+        isReduce(X_BLOCK),
+        isReduce(Y_BLOCK),
+        isReduce(Z_BLOCK),
+        PERSISTENT_REDUCTION>(
+        global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+  }
+
+  // -- START BLOCK CLEANUP -- //
+  welfordGroupLastBlock(
+      out,
+      global_work_buffer,
+      LocalWelfordTripletTuple<NumArgs, DataType, IndexType>(
+          init_avg, init_var, init_N),
+      shared_buf,
+      block_red_idx_offset,
+      num_thread_iters,
+      num_block_iters,
+      thread_red_idx_offset,
+      grid_red_size,
+      write_preds,
+      block_reduce_participate,
+      grid_reduce_participate);
+
+  // Forward protect the smem buffer
+  block_sync::sync();
+}
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <
+    bool BLOCK_BROADCAST,
+    int NumVals,
+    typename DataType,
+    typename IndexType>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    welfordGroupBlock(
+        LocalWelfordTripletTuple<NumVals, DataType, IndexType>& block_result,
+        const ConstRefWelfordTripletTuple<NumVals, DataType, IndexType>& inp,
+        PtrTuple<DataType, DataType, IndexType> shared_buf,
+        const typename MakeLocalTuple<NumVals, bool>::type& read_preds,
+        bool block_reduce_participate) {
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  copyWelfordTripletTupleIf(
+      block_result, inp, block_reduce_participate && read_preds);
+
+  // Size of the block reduction segment, can be an int since it's limited
+  // to number of threads
+  const int block_reduction_size = index_utils::
+      maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          blockDim);
+
+  // Index in the reduction segment, can be an int since it's limited to
+  // number of threads
+  const int tid_in_block_reduction = index_utils::
+      maskedOffset<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // ID of the block reduction this thread is participating in
+  //
+  // If any of the parallel dimensions are predicated out, that means
+  // they've already been reduced, so we only care about the first thread in
+  // that dimension. Therefore don't expand the reduction_idx by that
+  // dimension
+  const int block_reduction_idx = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // Do not protect the smem buffer as it's not always necessary.
+  impl::blockWelfordEach<
+      BLOCK_BROADCAST,
+      false,
+      LocalWelfordTripletTuple<NumVals, DataType, IndexType>>(
+      block_result,
+      block_result,
+      shared_buf,
+      has_block_result,
+      tid_in_block_reduction,
+      block_reduction_size,
+      block_reduction_size,
+      block_reduction_idx);
+}
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <int NumVals, typename DataType, typename IndexType>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    welfordGroupLastBlock(
+        RefWelfordTripletTuple<NumVals, DataType, IndexType>& out,
+        const VolatilePtrWelfordTripletTuple<NumVals, DataType, IndexType>&
+            global_work_buffer,
+        const LocalWelfordTripletTuple<NumVals, DataType, IndexType>& init_val,
+        PtrTuple<DataType, DataType, IndexType> shared_buf,
+        nvfuser_index_t block_red_idx_offset,
+        nvfuser_index_t num_thread_iters,
+        nvfuser_index_t num_block_iters,
+        nvfuser_index_t thread_red_idx_offset,
+        nvfuser_index_t grid_red_size,
+        const typename MakeLocalTuple<NumVals, bool>::type& write_preds,
+        bool block_reduce_participate,
+        bool grid_reduce_participate) {
+  // Initialize block result
+  auto last_block_result = init_val;
+
+  const bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
+    // Can use the last block to reduce all the values the blocks filled in.
+    // Can use any thread that has been predicated, or has been reduced to do
+    // this reduction, cannot use any block that's associated with an
+    // iteration domain
+
+    // Start with non-block reduction
+
+    // Index in the reduction segment
+    int tid_in_block_reduction = index_utils::maskedOffset<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx, blockDim);
+
+    int block_reduction_size = index_utils::maskedSize<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(blockDim);
+
+    bool has_block_result = index_utils::maskedIsZero<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx);
+
+    // 3D buffer of reductions:
+    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+    // Change the offset, we want to keep the last two dimensions, but the
+    // first dimension is what we will reduce over
+    const auto work_buf_offset =
+        block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
+    for (auto reduction_i = tid_in_block_reduction; reduction_i < grid_red_size;
+         reduction_i += block_reduction_size) {
+      impl::welfordEach(
+          last_block_result,
+          0,
+          global_work_buffer,
+          work_buf_offset + reduction_i * num_block_iters * num_thread_iters);
+    }
+
+    // Which block reduction this thread is participating in
+    int block_reduction_idx = index_utils::
+        maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+            threadIdx, blockDim);
+
+    impl::blockWelfordEach<
+        BROADCAST,
+        false,
+        LocalWelfordTripletTuple<NumVals, DataType, IndexType>>(
+        last_block_result,
+        last_block_result,
+        shared_buf,
+        has_block_result,
+        tid_in_block_reduction,
+        block_reduction_size,
+        min(grid_red_size, block_reduction_size),
+        block_reduction_idx);
+
+    copyWelfordTripletTupleIf(
+        out,
+        last_block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
+  }
+}
+
+} // namespace fused_reduction
diff --git a/torch/csrc/jit/codegen/cuda/runtime/helpers.cu b/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
index 907a0706aa6d2..4f87493faa989 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
@@ -27,6 +27,18 @@ __device__ constexpr int64_t ceilDiv(int a, int64_t b) {
   return ceilDiv((int64_t)a, b);
 }
 
+__device__ constexpr double ceilDiv(double a, double b) {
+  return std::ceil(a / b);
+}
+
+__device__ constexpr double ceilDiv(double a, int64_t b) {
+  return std::ceil(a / b);
+}
+
+__device__ constexpr double ceilDiv(int64_t a, double b) {
+  return std::ceil(a / b);
+}
+
 // Monotonic and precise lerp is described here:
 // https://math.stackexchange.com/a/1798323
 __device__ double lerp(double start, double end, double weight) {
@@ -605,6 +617,7 @@ __device__ __half print_impl(const char* name, __half value) {
   return value;
 }
 
+#if __CUDACC_VER_MAJOR__ >= 11
 __device__ __bfloat print_impl(const char* name, __bfloat value) {
   printf(
       "%s = %f @ threadIdx=(%d,%d,%d), blockIdx=(%d,%d,%d)\n",
@@ -618,9 +631,15 @@ __device__ __bfloat print_impl(const char* name, __bfloat value) {
       (int)blockIdx.z);
   return value;
 }
+#endif
 
 #define print(...) print_impl(#__VA_ARGS__, (__VA_ARGS__))
 
+template <typename OutT, typename IndexT, typename InputT>
+__device__ OutT arange(IndexT index, InputT start, InputT step) {
+  return start + step * index;
+}
+
 //! Utility to print out bank conflicts given the address from each thread.
 __device__ void checkBankConflict(
     size_t address,
diff --git a/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu b/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
index e26b99c068440..75d39e7c0c4b6 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
@@ -46,20 +46,44 @@ __device__ uint4 philox(
 
 __device__ float uniformf(unsigned int x) {
   constexpr float kRanInvM32 = 2.3283064e-10f; // Inverse of 2^32.
-  return x * kRanInvM32;
+  float result = x * kRanInvM32;
+  return result == 1 ? 0.0f : result;
 }
 
 __device__ double uniform(unsigned int x, unsigned int y) {
   constexpr double kRan2Pow53Inv = 1.1102230246251565e-16;
   const unsigned long long z =
       (unsigned long long)x ^ ((unsigned long long)y << (53 - 32));
-  return z * kRan2Pow53Inv + (kRan2Pow53Inv / 2.0);
+  double result = z * kRan2Pow53Inv + (kRan2Pow53Inv / 2.0);
+  return result == 1 ? 0.0 : result;
 }
 
-__device__ double randLike(const uint4 &rng_result, int rng_component) {
-  return uniform((&rng_result.x)[rng_component * 2], (&rng_result.x)[rng_component * 2 + 1]);
+__device__ double rng_uniform(const uint4& rng_result, int rng_component) {
+  return uniform(
+      (&rng_result.x)[rng_component * 2],
+      (&rng_result.x)[rng_component * 2 + 1]);
 }
 
-__device__ float randLikef(const uint4 &rng_result, int rng_component) {
+__device__ float rng_uniformf(const uint4& rng_result, int rng_component) {
   return uniformf((&rng_result.x)[rng_component]);
 }
+
+__device__ double rng_uniform_range(
+    const uint4& rng_result,
+    int rng_component,
+    double from,
+    double to) {
+  auto range = to - from;
+  auto uniform01 = rng_uniform(rng_result, rng_component);
+  return from + range * uniform01;
+}
+
+__device__ float rng_uniform_rangef(
+    const uint4& rng_result,
+    int rng_component,
+    float from,
+    float to) {
+  auto range = to - from;
+  auto uniform01 = rng_uniformf(rng_result, rng_component);
+  return from + range * uniform01;
+}
diff --git a/torch/csrc/jit/codegen/cuda/runtime/tuple.cu b/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
index 7f2e7ab94b7d4..6daac6b99b758 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
@@ -606,6 +606,179 @@ using PtrTuple = PtrTupleBase<false, Types...>;
 template <typename... Types>
 using VolatilePtrTuple = PtrTupleBase<true, Types...>;
 
+// Define a LocalTuple of NumVals values of type Type
+template <int NumVals, typename Type>
+struct MakeLocalTuple;
+
+template <typename Type>
+struct MakeLocalTuple<1, Type> {
+  using type = LocalTuple<Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<2, Type> {
+  using type = LocalTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<3, Type> {
+  using type = LocalTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<4, Type> {
+  using type = LocalTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<5, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<6, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<7, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<8, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <int NumVals, typename Type>
+struct MakeRefTuple;
+
+template <typename Type>
+struct MakeRefTuple<1, Type> {
+  using type = RefTuple<Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<2, Type> {
+  using type = RefTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<3, Type> {
+  using type = RefTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<4, Type> {
+  using type = RefTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<5, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<6, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<7, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<8, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <int NumVals, typename Type>
+struct MakeConstRefTuple;
+
+template <typename Type>
+struct MakeConstRefTuple<1, Type> {
+  using type = ConstRefTuple<Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<2, Type> {
+  using type = ConstRefTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<3, Type> {
+  using type = ConstRefTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<4, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<5, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<6, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<7, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<8, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <int NumVals, typename Type>
+struct MakeVolatilePtrTuple;
+
+template <typename Type>
+struct MakeVolatilePtrTuple<1, Type> {
+  using type = VolatilePtrTuple<Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<2, Type> {
+  using type = VolatilePtrTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<3, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<4, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<5, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<6, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<7, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<8, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
 // Utility definitions. Currently only used with LocalTuple
 
 template <int idx, typename BinaryFunc, typename... DataTypes>
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h b/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
index 56460ec926959..d01d226efe42b 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
@@ -2,6 +2,7 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/normalization.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/pointwise.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/reduction.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
 
 namespace torch {
 namespace jit {
@@ -10,9 +11,11 @@ namespace cuda {
 
 enum class TORCH_CUDA_CU_API ScheduleHeuristic {
   None,
+  NoOp,
   PointWise,
   Reduction,
-  Persistent
+  Persistent,
+  Transpose
 };
 }
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h b/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h
index f2c9f161619ab..6453962bfec8a 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h
@@ -26,13 +26,18 @@ namespace HeuristicCompileTime {
 //! Enum for all possible types of cached entries of compile-time info.
 enum class CompileTimeEntryType {
   DOMAIN_MAP,
+  TRANSPOSE_DOMAIN_MAP,
   REFERENCE_TENSORS,
+  REFERENCE_TENSORS_FOR_GROUPS,
   VECTORIZABLE_INPUTS_AND_OUTPUTS,
+  INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS,
   UNROLLABLE_INPUTS_AND_OUTPUTS,
   REDUCTION_TVS,
   PERSISTENT_BUFFER_INFO,
   SCOPE_PERSISTENT_FACTOR_INFO,
-  BROADCAST_BYTE_MULTIPLES
+  BROADCAST_BYTE_MULTIPLES,
+  INNER_MOST_DIMS_INFO,
+  CAN_SCHEDULE_TRANSPOSE,
 };
 
 //! Entry type definition class for `DOMAIN_MAP`,
@@ -44,6 +49,15 @@ class DomainMap {
       CompileTimeEntryType::DOMAIN_MAP;
 };
 
+//! Entry type definition class for `DOMAIN_MAP`,
+//!  stores the domain map of a fusion, used by transpose scheduler.
+class TransposeDomainMap {
+ public:
+  using DataType = pointwise_utils::DomainMap;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::TRANSPOSE_DOMAIN_MAP;
+};
+
 //! Entry type definition class for `REFERENCE_TENSORS`,
 //!  stores the the reference TensorViews used to schedule a fusion.
 class ReferenceTensors {
@@ -53,6 +67,16 @@ class ReferenceTensors {
       CompileTimeEntryType::REFERENCE_TENSORS;
 };
 
+//! Entry type definition class for `REFERENCE_TENSORS`,
+//!  stores the the reference TensorViews used to schedule a fusion, used by
+//!  transpose scheduler.
+class ReferenceTensorsForGroups {
+ public:
+  using DataType = std::vector<TensorView*>;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::REFERENCE_TENSORS_FOR_GROUPS;
+};
+
 //! Entry type definition class for `VECTORIZABLE_INPUTS_AND_OUTPUTS`,
 //!  stores the vectorizable TensorViews on a fusion's inputs and outputs.
 class VectorizableInputsAndOutputs {
@@ -62,6 +86,15 @@ class VectorizableInputsAndOutputs {
       CompileTimeEntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS;
 };
 
+//! Entry type definition class for `INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS`,
+//!  stores the fusion's inputs and outputs grouped by inner most dimension.
+class InputsOutputsInnerDimGroups {
+ public:
+  using DataType = std::vector<std::vector<TensorView*>>;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS;
+};
+
 //! Entry type definition class for `UNROLLABLE_INPUTS_AND_OUTPUTS`,
 //!  stores the unrollable TensorViews on a fusion's inputs and outputs.
 class UnrollableInputsAndOutputs {
@@ -89,6 +122,16 @@ class PersistentBufferInfo {
       CompileTimeEntryType::PERSISTENT_BUFFER_INFO;
 };
 
+//! Entry type definition class for `INNER_MOST_DIMS_INFO`,
+//!  Used in the transpose scheduler to store inner most IterDomains and their
+//!  position in reference1 of group 1 and group 2
+class InnerMostDimInfo {
+ public:
+  using DataType = std::vector<int64_t>;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::INNER_MOST_DIMS_INFO;
+};
+
 //! Auxiliary data types for `SCOPE_PERSISTENT_FACTOR_INFO` entry type.
 using ScopedPersistenceBufferMap = std::unordered_map<Val*, std::vector<bool>>;
 
@@ -111,11 +154,20 @@ class ScopePersistentFactorInfo {
 //!  information.
 class BroadcastMultiples {
  public:
-  using DataType = std::vector<scheduler_utils::BroadcastMultiple>;
+  using DataType = scheduler_utils::BroadcastMultipleInformation;
   static const CompileTimeEntryType EntryType =
       CompileTimeEntryType::BROADCAST_BYTE_MULTIPLES;
 };
 
+//! Entry type definition class for `CAN_SCHEDULE_TRANSPOSE`,
+//!  stores if the transpose scheduler can scheduler this fusion
+class CanScheduleTranspose {
+ public:
+  using DataType = bool;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::CAN_SCHEDULE_TRANSPOSE;
+};
+
 //! Base abstract class for unified storage in `HeuristicSummary`,
 //!  each entry in `HeuristicSummary` will be a subclass.
 class CompileTimeInfoBase : public PolymorphicBase {
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h b/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h
index 058c72e592ad1..a828d66fdf039 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
 
 #include <string>
 
@@ -9,7 +10,7 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-class HeuristicParams {
+class HeuristicParams : public PolymorphicBase {
  public:
   std::string tag = "";
 
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
index 6abd4dd56c473..1991cada00dda 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
@@ -49,7 +49,7 @@ bool canValidateIsInnerDim(
       if (!maybe_factor.has_value()) {
         return false;
       }
-      int factor = maybe_factor.value();
+      int factor = maybe_factor->as<int64_t>();
       if (factor < inner_dim_size) {
         // This might be too restrictive. Would need more
         //   bookkeeping to relax.
@@ -68,7 +68,7 @@ bool canValidateIsInnerDim(
       if (!maybe_leaf_size.has_value()) {
         return false;
       }
-      if (maybe_leaf_size.value() != inner_dim_size) {
+      if (maybe_leaf_size->as<int64_t>() != inner_dim_size) {
         return false;
       }
       leaf = merge->inner();
@@ -274,10 +274,10 @@ std::unordered_set<IterDomain*> getMmaDomainSet(
 //  optimizations.
 //
 // A concrete example:
-//  T0 [I0, I1, I2, I3, I4, I5] = mma(T1[I01, B11, B21, I31, I41, B51], T2[B02,
+//  T0 [I0, I1, I2, R3, I4, I5] = mma(T1[I01, B11, B21, I31, I41, B51], T2[B02,
 //  I12, B22, I32, I42, I52], {3};
 // In this case some example querries:
-//  K dimension of T0 = {I3}
+//  K dimension of T0 = {R3}
 //  M dimension of T1 = {I01}
 //  N dimension of T2 = {I52}
 //  etc.
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp b/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
index 0d87174f0b566..459974b8d2884 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
@@ -878,7 +878,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getPersistentHeuristics(
           data_cache, [&first_red_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    first_red_tv, true));
+                    first_red_tv, true, true));
           });
 
   auto& vectorizable_inputs_outputs = vectorizable_inputs_outputs_entry.get();
@@ -888,7 +888,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getPersistentHeuristics(
           data_cache, [&first_red_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    first_red_tv, false));
+                    first_red_tv, false, false));
           });
 
   auto& unrollable_inputs_outputs = unrollable_inputs_outputs_entry.get();
@@ -909,7 +909,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getPersistentHeuristics(
   }
 
   // Try expanding vectorization to contig merged domains
-  vectorize_factor = scheduler_utils::expandVectorizationToContigMergedDomains(
+  vectorize_factor = vectorize_helper::expandVectorizationToContigMergedDomains(
       fusion,
       runtime_info,
       vectorizable_inputs_outputs,
@@ -992,6 +992,8 @@ TORCH_CUDA_CU_API void schedulePersistentKernel(
       scheduler_utils::getReductionTvs(fusion /*, ignore_trivial = true */);
 
   TORCH_INTERNAL_ASSERT(reduction_tvs.size());
+  // Registry assumes the reference tv is the first reduction_tv, if this
+  // changes registry needs to change.
   auto reduction_tv = reduction_tvs[0];
 
   auto dim_analysis = scheduler_utils::canonicalDimReduction(
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
index ad04f437389dd..b40e6fbf7cf7a 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
@@ -1,11 +1,12 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/pointwise.h>
 
 #include <torch/csrc/jit/codegen/cuda/executor_utils.h>
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/lower_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h>
@@ -51,35 +52,6 @@ class DomainMap : public pointwise_utils::DomainMap {
     return result;
   }
 
-  static bool hasReferenceTensorView(Fusion* fusion) {
-    FusionGuard fg(fusion);
-    DomainMap domain_map(fusion);
-    return domain_map.findReferenceTensorView() != nullptr;
-  }
-
-  // Determine if output TensorView is a valid reference tensor for this fusion.
-  // The reference tensor must map to all the iterDomains in each input.
-  bool isValidReference(TensorView* output_tv) const {
-    if (output_tv->isFusionInput()) {
-      return false;
-    }
-    for (auto input_tv :
-         ir_utils::filterByType<TensorView>(fusion_->inputs())) {
-      if (input_tv->uses().empty()) {
-        continue;
-      }
-
-      if (fusion_->getOutputAlias(output_tv) == input_tv) {
-        continue;
-      }
-
-      if (!areAllInputIdsMappedToOutput(input_tv, output_tv)) {
-        return false;
-      }
-    }
-    return true;
-  }
-
  private:
   bool hasMinimumSize(TensorView* tv, int num_axes) const {
     TORCH_INTERNAL_ASSERT(tv != nullptr);
@@ -148,7 +120,7 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
         inferred_val.has_value(),
         "Error inferring size for pointwise scheduler: ",
         ref_root[ref_i]->extent()->toInlineString());
-    elem_counts[ref_i] = inferred_val.value();
+    elem_counts[ref_i] = inferred_val->as<int64_t>();
     n_elems *= elem_counts[ref_i];
   }
 
@@ -163,13 +135,11 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
     });
     vectorizable_inputs_outputs_entry.get();
 
-    auto broadcast_byte_multiples_entry =
-        HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>(
-            data_cache, []() {
-              return std::make_unique<
-                  std::vector<scheduler_utils::BroadcastMultiple>>();
-            });
-    broadcast_byte_multiples_entry.get();
+    auto broadcast_info = HeuristicSummaryEntry<
+        HeuristicCompileTime::BroadcastMultiples>(data_cache, []() {
+      return std::make_unique<scheduler_utils::BroadcastMultipleInformation>();
+    });
+    broadcast_info.get();
     return std::make_shared<PointwiseParams>("Pointwise heuristics");
   }
 
@@ -179,7 +149,7 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
           data_cache, [&largest_out]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    largest_out, true));
+                    largest_out, true, true));
           });
 
   constexpr int64_t kSixteen = 16; // clang tidy
@@ -203,33 +173,9 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
         ceilDiv(n_elems, device_multiprocessor_count * kThreadX));
   }
 
-  // If we use RNG don't unroll so we can do correctness testing
-  if (fusion->isStochastic() &&
-      isOptionDisabled(DisableOption::UnrollWithRng)) {
-    max_unroll_factor = 1;
-  }
-
   auto params = std::make_shared<PointwiseParams>("Pointwise heuristics");
 
-  /*
-   * 2D pointwise scheduling logic. What is expected is there's some
-   * broadcasting pattern which would make scheduling as a 2D problem more
-   * efficient than scheduling simply as a 1D problem.
-   *
-   * Mapping count holds how many bytes are in each dimension for both inputs
-   * and outputs relative to the reference tensor. What we're looking for is a
-   * break point in reference_tvs dimensions which separates the outer dimension
-   * and inner dimension of the problem mapped to 2D.
-   *
-   * break_point is computed assuming no reuse, ignoring parallelization
-   * limitations, and simply figures out which point best separates broadcasted
-   * dimensions. In other words, where's the point where we isolate the most
-   * broadcasted elements to one side.
-   *
-   * Once a break point is found, simply schedule the pointwise op as 2D
-   * balancing parallelization as best as possible.
-   */
-
+  // See pointwise.h to understand what we're doing for this 2D analysis.
   // Ideal break point location
   int break_point = 0;
 
@@ -258,16 +204,15 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
   // break point.
   int64_t gdim_right = 1;
 
-  auto broadcast_byte_multiples_entry =
-      HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>(
-          data_cache, [&largest_out, &index_type]() {
-            return std::make_unique<
-                std::vector<scheduler_utils::BroadcastMultiple>>(
-                scheduler_utils::getBroadcastMultiples(
-                    largest_out, index_type));
-          });
+  auto broadcast_info = HeuristicSummaryEntry<
+      HeuristicCompileTime::BroadcastMultiples>(
+      data_cache, [&largest_out, &index_type]() {
+        return std::make_unique<scheduler_utils::BroadcastMultipleInformation>(
+            scheduler_utils::getBroadcastMultiples(largest_out, index_type));
+      });
 
-  auto& broadcast_byte_multiples = broadcast_byte_multiples_entry.get();
+  auto& view_disjoint_sets = broadcast_info.get().view_disjoint_set_ids;
+  auto& broadcast_byte_multiples = broadcast_info.get().broadcast_multiples;
 
   TORCH_INTERNAL_ASSERT(broadcast_byte_multiples.size() == ref_root.size());
 
@@ -294,6 +239,12 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
       int64_t min_total_transfer = std::numeric_limits<int64_t>::max();
 
       for (const auto break_point_i : c10::irange(ref_root.size())) {
+        // If break point is incoherent with view, don't consider breaking here.
+        if (!scheduler_utils::breakIsDisjoint(
+                view_disjoint_sets, break_point_i)) {
+          continue;
+        }
+
         // Number of elements in the right side of reference tv with
         // break_point_i
         int64_t cur_right_elem_count = 1;
@@ -390,8 +341,10 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
   }
 
   // Try expanding vectorization to contig merged domains
+  // TODO: This is an expensive function that shouldn't be in heuristics without
+  // caching.
   auto expanded_vector_word_size =
-      scheduler_utils::expandVectorizationToContigMergedDomains(
+      vectorize_helper::expandVectorizationToContigMergedDomains(
           fusion,
           runtime_info,
           vectorizable_inputs_outputs,
@@ -463,8 +416,15 @@ LaunchParams schedulePointwise(
   return params->lparams;
 }
 
+TensorView* getReferenceTensorView(Fusion* fusion) {
+  FusionGuard fg(fusion);
+  DomainMap domain_map(fusion);
+  auto reference_tv = domain_map.findReferenceTensorView();
+  return reference_tv;
+}
+
 bool hasReferenceTensorView(Fusion* fusion) {
-  return DomainMap::hasReferenceTensorView(fusion);
+  return getReferenceTensorView(fusion) != nullptr;
 }
 
 // TODO: Inline intermediate operations (avoid inlining unrolled/vectorized
@@ -515,41 +475,142 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
     return;
   }
 
-  DomainMap domain_map(fusion);
-  TensorView* reference_tv =
-      domain_map.findReferenceTensorView(params.break_point);
+  TensorView* reference_tv = getReferenceTensorView(fusion);
 
   TORCH_INTERNAL_ASSERT(
       reference_tv != nullptr,
       "Could not find a fully broadcasted output to reference schedule on.");
 
-  auto all_tvs = ir_utils::allTvs(fusion);
-
-  // Merge right side of break point
+  // Positions of rhs and lhs after merging all dimensions.
   int rhs_i = -1;
-  for (int i = (int)reference_tv->nDims(); i > (int)params.break_point; i--) {
-    auto axis_i = i - 1;
-    if (rhs_i == -1) {
-      rhs_i = axis_i;
-    } else {
-      reference_tv->merge(axis_i, rhs_i);
-      rhs_i = axis_i;
+  int lhs_i = -1;
+
+  auto view_ops = ir_utils::getViewOps(fusion);
+
+  /*
+   * If there's no path from reference through producer paths only to a view,
+   * e.g.: input
+   *      /  \
+   *   view reference
+   *    /
+   * output
+   *
+   * we need to propagate the view transformations to the reference tv before
+   * scheduling the reference tv. Since view ops have to be identical, if any
+   * path from reference tv through producers goes through a view, all paths
+   * from reference tv's to views should be through producers.
+   */
+  bool needs_view_prop =
+      view_ops.size() > 0 &&
+      !std::any_of(
+          view_ops.begin(), view_ops.end(), [&reference_tv](ViewOp* view) {
+            return DependencyCheck::isDependencyOf(view->out(), reference_tv) ||
+                view->out()->sameAs(reference_tv);
+          });
+
+  if (needs_view_prop) {
+    auto first_view_op = *view_ops.begin();
+
+    // Propagate the view transformations
+    TransformPropagator propagator(first_view_op->out());
+    MaxRootDomainInfoSpanningTree spanning_tree(first_view_op->out());
+    spanning_tree.traverse(&propagator);
+
+    // Reorder reference_tv after propagating the view operation. This will
+    // reorder for better merging.
+    reference_tv->reorder(
+        scheduler_utils::domainReorderAsRfactorMap(reference_tv));
+
+    // Break point is relative to rfactor domain, find the leaf domain ID's in
+    // the left/right side, we really need the values in domain, but easiest way
+    // to do this is with Dependency check which will grab all intermediate
+    // values too.
+    auto lhs_all_vals = DependencyCheck::getAllValsBetween(
+        {reference_tv->getMaybeRFactorDomain().begin(),
+         reference_tv->getMaybeRFactorDomain().begin() + params.break_point},
+        {reference_tv->domain()->domain().begin(),
+         reference_tv->domain()->domain().end()});
+
+    std::unordered_set<Val*> lhs_all_vals_set(
+        lhs_all_vals.begin(), lhs_all_vals.end());
+
+    auto rhs_all_vals = DependencyCheck::getAllValsBetween(
+        {reference_tv->getMaybeRFactorDomain().begin() + params.break_point,
+         reference_tv->getMaybeRFactorDomain().end()},
+        {reference_tv->domain()->domain().begin(),
+         reference_tv->domain()->domain().end()});
+
+    std::unordered_set<Val*> rhs_all_vals_set(
+        rhs_all_vals.begin(), rhs_all_vals.end());
+
+    // Make sure lhs and rhs groups are disjoint.
+    for (auto lhs_val : lhs_all_vals) {
+      TORCH_INTERNAL_ASSERT(
+          rhs_all_vals_set.count(lhs_val) == 0,
+          "Error in pointwise scheduler. LHS and RHS of the 2D scheduler are not disjoint.");
     }
-  }
-  if (rhs_i >= 0) {
-    // If there's an rhs
-    reference_tv->reorder({{rhs_i, -1}});
-  }
 
-  // Merge left side of break point
-  int lhs_i = -1;
-  for (int i = (int)params.break_point; i > 0; i--) {
-    auto axis_i = i - 1;
-    if (lhs_i == -1) {
-      lhs_i = axis_i;
-    } else {
-      reference_tv->merge(axis_i, lhs_i);
-      lhs_i = axis_i;
+    // Merge rhs, then lhs.
+    IterDomain* rhs_id = nullptr;
+    IterDomain* lhs_id = nullptr;
+    auto ndims = reference_tv->nDims();
+    for (auto i : c10::irange(ndims)) {
+      // Merge from right to left
+      auto pos = ndims - 1 - i;
+      auto id = reference_tv->axis(pos);
+      if (lhs_all_vals_set.count(id) > 0) {
+        if (lhs_id == nullptr) {
+          lhs_id = id;
+          lhs_i = pos;
+        } else {
+          reference_tv->merge(pos, lhs_i);
+          lhs_i = pos;
+          if (rhs_i > lhs_i) {
+            rhs_i--;
+          }
+        }
+      } else if (rhs_all_vals_set.count(id) > 0) {
+        if (rhs_id == nullptr) {
+          rhs_id = id;
+          rhs_i = pos;
+        } else {
+          reference_tv->merge(pos, rhs_i);
+          rhs_i = pos;
+          if (lhs_i > rhs_i) {
+            lhs_i--;
+          }
+        }
+      }
+    }
+    // Find the iter domains that should be in the lhs, and rhs.
+  } else {
+    // Don't need to worry about view transformations, just merge reference tv
+    // as we normally would.
+
+    // Merge right side of break point
+    for (int i = (int)reference_tv->nDims(); i > (int)params.break_point; i--) {
+      auto axis_i = i - 1;
+      if (rhs_i == -1) {
+        rhs_i = axis_i;
+      } else {
+        reference_tv->merge(axis_i, rhs_i);
+        rhs_i = axis_i;
+      }
+    }
+    if (rhs_i >= 0) {
+      // If there's an rhs
+      reference_tv->reorder({{rhs_i, -1}});
+    }
+
+    // Merge left side of break point
+    for (int i = (int)params.break_point; i > 0; i--) {
+      auto axis_i = i - 1;
+      if (lhs_i == -1) {
+        lhs_i = axis_i;
+      } else {
+        reference_tv->merge(axis_i, lhs_i);
+        lhs_i = axis_i;
+      }
     }
   }
 
@@ -713,7 +774,7 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
   if (params.vectorize) {
     // Grab all tensor views that should be vectorized
     auto inputs_outputs =
-        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true);
+        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true, true);
     std::vector<TensorView*> vectorized_tvs;
     bool should_vectorize_reference_tv = false;
     for (auto tv : inputs_outputs) {
@@ -744,9 +805,9 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
   // get a higher position in later inline propagation. We need this separate
   // step because we were not using ParallelType::Unroll, so we have to do
   // unrolling manually.
-  InlinePropagator inline_unswitch(
-      reference_tv, unswitch_pos, ComputeAtMode::BestEffort);
-  spanning_tree.traverse(&inline_unswitch);
+  inlineAllAt(reference_tv, unswitch_pos, true);
+
+  auto all_tvs = ir_utils::allTvs(fusion);
 
   // Inline at the inner most position. The CA position of all tensors except
   // inputs, cached inputs and outputs will be updated.
@@ -759,9 +820,7 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
     auto output = entry.second;
     inner_most_tensors.erase(output);
   }
-  InlinePropagator inline_inner_most(
-      reference_tv, -1, ComputeAtMode::BestEffort, inner_most_tensors);
-  spanning_tree.traverse(&inline_inner_most);
+  inlineMost(inner_most_tensors);
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h
index 6cba29cd6b4b9..f3a1da7bcff5f 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h
@@ -10,6 +10,141 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+/*
+ * The 2D pointwise scheduling logic is a bit interesting. We'll start by giving
+ * motivation for what the scheduling is attempting to do. What we're going to
+ * do with the scheduling is attempt to make it two dimensional in a way that
+ * minimizes the refetching of broadcasted dimensions. If we think of the
+ * trivial case:
+ * T0[i0, b1]
+ * T1[b0, i1]
+ * T2[i0, i1] = T0 + T1
+ * If we scheduled T2 as 1-dimensional we would do something along the lines of
+ * merging i0 and i1 then splitting out a block and thread dimension. If i1 is
+ * greater than the thread dimension, then all threads would pull the same value
+ * from T0. However, they would all be pulling different values from T1. In this
+ * case we have perfect reuse of the broadcast dimension T0 but potentially no
+ * reuse of the broadcast dimension of T1. "Potentially" because if i1 isn't too
+ * big it should be efficiently cached in L2. If i1 is big, then by the time we
+ * increment the i0 dimension the i1 dimension will be pushed out of cache.
+ *
+ * Instead what we do is we map this to a two dimensional problem. Instead of
+ * having the schedule that merges the two dimensions, we'll actually leave the
+ * dimensions separate and we'll take i0, split it to BIDy, TIDy, and take i1
+ * and split it to BIDx and TIDx. Therefore we'll have a parallelization on T2
+ * like [BIDy, TIDy | BIDx, TIDx], where | denotes the separation of the
+ * original i0 and i1. This helps because all threads in the TIDx dimension will
+ * reuse the same value in the i0 dimension (holding BIDy and TIDy constant),
+ * all the threads in the TIDy dimension (holding BIDx, and TIDx constant) will
+ * reuse the same value in the i1 dimension. This reuse of values reduces the
+ * number of redundant values pulled from T0 and T1. The same thing can be said
+ * for when incrementing BIDy, but since BIDy is strided on BIDx there's no
+ * effective increment of BIDy without incrementing BIDx. Since all threads are
+ * executed within a block we can effectively consider the block incrementing
+ * TIDx BDIMx times while holding TIDy constant and incrementing TIDy BDIMy
+ * times while holding TIDx constant. Since multiple BIDx's are running at the
+ * same time on the device we can consider a wave on the GPU of incrementing
+ * BIDx (wave number of times), while holding TIDy constant BDIMy * wave number
+ * of times.
+ *
+ * If instead we have a situation like:
+ * T0[i0, i1, b2]
+ * T1[i0, b1, i2]
+ * T2[i0, i1, i2] = T0 + T1
+ * It makes sense that the break point would be in position 2, between i1 and
+ * i2. This is because when we map [i0, i1 | i2] to [BIDy, TIDy| BIDx, TIDx]
+ * BIDx, and TIDx will access the same elements of T0 on b2, and TIDy will
+ * likely access the same elements of T1 (as long as i1 > BDIMy). Even if i1 on
+ * the order of BDIMy we'll only access ~two unique elements per increment of
+ * BIDx or TIDx. This means we'll still reuse many of the same values and limit
+ * the amount we need to read duplicate values in T0 and T1.
+ *
+ * If instead we have:
+ * T0[i0, b1, i2]
+ * T1[b0, i1, i2]
+ * T2[i0, i1, i2] = T0 + T1
+ * The analysis gets a bit more complicated. First if i2 is very large and i0
+ * and i1 are relatively small it would make sense to have [i0, i1 | i2]. If b0
+ * is very small it's unlikely beneficial to have [i0 | i1, i2] as there would
+ * be small reuse on b0, and potentially no reuse on b1. If i2 is very small it
+ * may be worthwhile to have [i0 | i1, i2]. If i1 and i2 are not small, and
+ * their product is relatively large (i.e. you can't fit T2[i, :, :] in L2) then
+ * it's unlikely we'll get any significant reuse across i0.
+ *
+ * What we should (but don't due to complexity) assume then, is that we will get
+ * strong reuse across TIDx and TIDy for dimensions that are on the inner
+ * portion of the 2D tile.
+ *
+ * For example if we have:
+ * T0[i0, b1, i2]
+ * T1[b0, b1, i2]
+ * T2[b0, i1, i2]
+ * T3[i0, i1, i2] = T0 + T1 + T2
+ * We may want to break point at position 1 or position 2 (i.e. [i0 | i1, i2] or
+ * [i0, i1 | i2]). We can't immediately tell from the structure.
+ *
+ * If we choose [i0, i1 | i2] then we'll get:
+ * Strong reuse of T0 on TIDy (b1 dim)
+ * Perfect reuse across T1 on TIDy (b0 and b1)
+ * If BIDx is bound to the LHS of the tile we'll get:
+ * Maybe strong reuse of T0 on BIDx (b1 dim if it's large)
+ * Perfect reuse across T1 on BIDx
+ * Potentially no reuse on T2 if i1 is very large
+ *
+ * If we pick [i0 | i1, i2], then we'll get:
+ * We'll perfect reuse across TIDy on T1 and T2 on b0
+ * Some reuse on T0 and T1 on b1 across BIDx if i2 is relatively small and BIDx
+ * is bound to the RHS of the 2D schedule Perfect reuse on T1 and T2 on b0
+ * across BIDx if BIDx is bound to the LHS of the 2D schedule
+ *
+ * Materializing these benefits is dependent on the decisions the scheduler
+ * makes when parallelizing the problem. The heuristics logic at the moment is
+ * fairly simplistic where it assumes that there's only reuse across the break
+ * points for tensors that have no iteration domain on the entire side of the
+ * breakpoint. This is not optimal but for the time being it seems sufficient.
+ * We would ideally take into consideration the parallelization scheme and
+ * partial broadcasting on the lhs or rhs.
+ *
+ * An example of how this analysis is done is given the DAG:
+ * T0[i0, i1, b2] float
+ * T1[i0, b1, i2] half
+ * T2[i0, b1, i2] = cast(T1, float)
+ * T4[i0, i1, i2] float = T0 + T2
+ * With values of 10, 100, 1000 as [i0, i1, i2]
+ * Our break point analysis for positions 0, 1, 2, 3 will be:
+ *
+ * 0: 10*10 * 100*10 * 1000*10 = 1e9
+ * 1: 10*10 * 100*10 * 1000*10 = 1e9
+ * 2: 10*10 * 100*10 * 1000*6  = 6e8
+ * 3: 10*10 * 100*10 * 1000*10 = 1e9
+ *
+ * Where for each computation the LHS of the * pairs is the number of elements
+ * in that dimension on the reference and the RHS of the * pairs is the
+ * broadcast multiple where any tensor that has all broadcasts on the rhs or lhs
+ * of the break point doesn't contribute to the broadcast multiple of the rhs or
+ * lhs.
+ *
+ * So we'll pick position 2 since we're confident we can get broadcast reuse on
+ * the rhs of tensor 0. As already mentioned this is a pretty big
+ * simplification/assumption and in reality it may be harder/easier to take
+ * advantage of broadcast on the inner or outer dimension. This is a reasonable
+ * way to make relative decisions on break points, however, this computation is
+ * ont doing an effective estimate of actual DRAM transfers which it should be
+ * modified to do so.
+ *
+ * For view schedules there can be some incoherent break points for example:
+ * T1[i0, i1*i2] = view(T0[i0, i1, i2])
+ * would make the position 2 "incoherent". In otherwords we cannot replay
+ * through the view a schedule that tries to merge i0 and i1, without i2. So for
+ * positions that are incoherent we won't consider break point positions there.
+ *
+ * See FusionBroadcastViewMultiples_CUDA for what we expect with view handling.
+ * Shortly any dimensions that are inputs or outputs of view transformations are
+ * considered together, since it's hard to account for partial dimensions that
+ * are being broadcasted. So for view it's primarily an all or nothing situation
+ * when it comes to the 2D pointwise scheduler.
+ */
+
 class SchedulerRuntimeInfo;
 class HeuristicSummary;
 
@@ -36,6 +171,9 @@ TORCH_CUDA_CU_API LaunchParams schedulePointwise(
 //!  the pointwise scheduler.
 bool hasReferenceTensorView(Fusion* fusion);
 
+// Return reference tensor view.
+TensorView* getReferenceTensorView(Fusion* fusion);
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp
index ff6bfd07dd12d..cf823322078f7 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp
@@ -6,19 +6,9 @@ namespace fuser {
 namespace cuda {
 namespace pointwise_utils {
 
-DomainMap::DomainMap(Fusion* fusion)
-    : fusion_(fusion), ca_map_(ComputeAtMap(fusion)) {
-  view_tvs_ = scheduler_utils::getViewTVs(fusion);
-}
-
-bool DomainMap::areExactMapped(IterDomain* id1, IterDomain* id2) {
-  return ca_map_.areMapped(id1, id2, IdMappingMode::EXACT);
-}
-
-// Determine if all IterDomains in input are mapped to output
-bool DomainMap::areAllInputIdsMappedToOutput(
-    TensorView* input_tv,
-    TensorView* output_tv) const {
+// Determine if all IterDomains in input are mapped to the given tensor
+bool DomainMap::areAllInputIdsMappedTo(TensorView* input_tv, TensorView* tv)
+    const {
   // Get concrete IDs for input root or rfactor domain
   std::unordered_set<IterDomain*> in_concrete_ids;
   for (auto in_id : input_tv->getMaybeRFactorDomain()) {
@@ -30,11 +20,9 @@ bool DomainMap::areAllInputIdsMappedToOutput(
 
   // Erase all input concrete IDs mapped to the output domain
   // Ignore unresolved broadcast dimensions
-  for (auto out_id : output_tv->getMaybeRFactorDomain()) {
-    if (!out_id->isBroadcast()) {
-      if (!eraseIfMapped(in_concrete_ids, out_id)) {
-        eraseIfInputMappedThroughViewToOutput(in_concrete_ids, out_id);
-      }
+  for (auto id : tv->getMaybeRFactorDomain()) {
+    if (!eraseIfMapped(in_concrete_ids, id)) {
+      eraseIfInputMappedThroughViewTo(in_concrete_ids, id);
     }
   }
   return in_concrete_ids.empty();
@@ -45,7 +33,7 @@ bool DomainMap::eraseIfMapped(
     std::unordered_set<IterDomain*>& in_concrete_ids,
     IterDomain* out_id) const {
   auto out_concrete_id =
-      ca_map_.getConcreteMappedID(out_id, IdMappingMode::EXACT);
+      ca_map_.getConcreteMappedID(out_id, IdMappingMode::PERMISSIVE);
   auto in_concrete_id_iter = in_concrete_ids.find(out_concrete_id);
   bool found_match = in_concrete_id_iter != in_concrete_ids.end();
   if (found_match) {
@@ -58,12 +46,12 @@ bool DomainMap::eraseIfMapped(
 // Currently this function only allow having one view on the path from input to
 // output. If there are multiple views, then likely the pointwise scheduler will
 // reject the fusion because we can not correctly find a reference tensor.
-void DomainMap::eraseIfInputMappedThroughViewToOutput(
+void DomainMap::eraseIfInputMappedThroughViewTo(
     std::unordered_set<IterDomain*>& in_concrete_ids,
-    IterDomain* out_id) const {
+    IterDomain* id) const {
   for (auto view : view_tvs_) {
     // Find any ID in view rfactor domain that is mapped to output ID
-    auto view_rfactor_id = anyMapped(view->getRFactorDomain(), out_id);
+    auto view_rfactor_id = anyMapped(view->getRFactorDomain(), id);
     if (view_rfactor_id == nullptr) {
       continue;
     }
@@ -94,6 +82,20 @@ IterDomain* DomainMap::anyMapped(
   return nullptr;
 }
 
+// Determine if output TensorView is a valid reference tensor for this fusion.
+// The reference tensor must map to all the iterDomains in each input.
+bool DomainMap::isValidReference(TensorView* tv) const {
+  for (auto input_tv : ir_utils::filterByType<TensorView>(fusion_->inputs())) {
+    if (input_tv->uses().empty()) {
+      continue;
+    }
+    if (!areAllInputIdsMappedTo(input_tv, tv)) {
+      return false;
+    }
+  }
+  return true;
+}
+
 } // namespace pointwise_utils
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h
index 99d29a4525115..6cc4b1b8b93bd 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h
@@ -15,18 +15,22 @@ namespace pointwise_utils {
 // that maps to all IterDomains in the fusion.
 class DomainMap {
  public:
-  DomainMap(Fusion* fusion);
+  DomainMap(Fusion* fusion) : fusion_(fusion), ca_map_(fusion) {
+    view_tvs_ = scheduler_utils::getViewTVs(fusion);
+  }
   virtual ~DomainMap() = default;
 
-  bool areExactMapped(IterDomain* id1, IterDomain* id2);
-
   const ComputeAtMap& getComputeAtMap() const {
     return ca_map_;
   }
 
+  // Determine if a TensorView is a valid reference tensor for this fusion.
+  // The reference tensor must map to all the iterDomains in each input.
+  bool isValidReference(TensorView* tv) const;
+
  protected:
-  // Determine if all iterDomains are mapped between input and output tvs
-  bool areAllInputIdsMappedToOutput(TensorView* input_tv, TensorView* output_tv)
+  // Determine if all IterDomains are mapped between input and the given tvs
+  bool areAllInputIdsMappedTo(TensorView* input_tv, TensorView* output_tv)
       const;
 
   // Erase input concrete ID if it is mapped to output ID
@@ -34,10 +38,10 @@ class DomainMap {
       std::unordered_set<IterDomain*>& in_concrete_ids,
       IterDomain* out_id) const;
 
-  // Check if in_id is mapped to out_id through any view rfactor domain
-  void eraseIfInputMappedThroughViewToOutput(
+  // Check if in_id is mapped to id through any view rfactor domain
+  void eraseIfInputMappedThroughViewTo(
       std::unordered_set<IterDomain*>& in_concrete_ids,
-      IterDomain* out_id) const;
+      IterDomain* id) const;
 
   // Find any id in domain that maps with target id
   IterDomain* anyMapped(
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp b/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
index febe99de2ad84..3037f8469dad4 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
@@ -924,7 +924,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getReductionHeuristics(
           data_cache, [&reduction_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    reduction_tv, true));
+                    reduction_tv, true, true));
           });
 
   auto& vectorizable_inputs_outputs = vectorizable_inputs_outputs_entry.get();
@@ -934,7 +934,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getReductionHeuristics(
           data_cache, [&reduction_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    reduction_tv, false));
+                    reduction_tv, false, false));
           });
 
   auto& unrollable_inputs_outputs = unrollable_inputs_outputs_entry.get();
@@ -954,7 +954,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getReductionHeuristics(
   }
 
   // Try expanding vectorization to contig merged domains
-  vectorize_factor = scheduler_utils::expandVectorizationToContigMergedDomains(
+  vectorize_factor = vectorize_helper::expandVectorizationToContigMergedDomains(
       fusion,
       runtime_info,
       vectorizable_inputs_outputs,
@@ -1010,6 +1010,8 @@ void scheduleReduction(Fusion* fusion, const ReductionParams& rparams) {
 
   TORCH_INTERNAL_ASSERT(reduction_tvs.size());
 
+  // Registry assumes the reference tv is the first reduction_tv, if this
+  // changes registry needs to change.
   auto reduction_tv = reduction_tvs[0];
 
   auto dim_analysis = scheduler_utils::canonicalDimReduction(
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
index a9038dd8c2969..ae9ecd88bbdc3 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
@@ -1,7 +1,7 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
 
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/ir_cloner.h>
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
@@ -266,7 +266,7 @@ void multiReductionInliner(
 
     // Grab all tensor views that should be vectorized
     auto vectorizable_inputs_outputs =
-        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true);
+        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true, true);
 
     auto vectorizable_expr = [](Expr* e) {
       return e->isA<UnaryOp>() &&
@@ -336,14 +336,7 @@ void multiReductionInliner(
       scheduler_utils::getTrivialReductionMap(fusion);
 
   // Inline the schedule
-  InlinePropagator inline_propagator(
-      reference_tv,
-      -1,
-      ComputeAtMode::MostInlined,
-      {},
-      mapped_to_trivial_reduction);
-
-  MaxRootDomainInfoSpanningTree(reference_tv).traverse(&inline_propagator);
+  inlineMost(mapped_to_trivial_reduction);
 }
 
 namespace {
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp b/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
index 76aeafeb002f2..3161aeb10b499 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
@@ -9,6 +9,7 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/debug_utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/pointwise.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
 
 #include <limits>
@@ -357,6 +358,45 @@ class SchedulerTopologyChecker {
 
     return true;
   }
+
+  /* Returns if any non-trivial views are not before the reference. For example:
+   *     t0
+   *    /  \
+   *  view ref
+   *   |
+   *   t1
+   * This could be important as transform propagation from a reference backwards
+   * through a view should always work, but transform propagation form a
+   * reference forward through a view could interfere with the view transforms.
+   */
+  static bool hasViewNotBeforeRef(
+      Fusion* fusion,
+      std::vector<TensorView*> reference_tvs) {
+    std::vector<TensorView*> view_tvs;
+    auto view_ops = ir_utils::getViewOps(fusion);
+    for (auto view_op : view_ops) {
+      auto tv_outs = ir_utils::filterByType<TensorView>(view_op->outputs());
+      for (auto entry : tv_outs) {
+        view_tvs.push_back(entry);
+      }
+    }
+
+    if (view_tvs.empty()) {
+      return false;
+    }
+
+    // Terrible complexity, may be worth improving, but is a compile time
+    // check.
+    for (auto ref_tv : reference_tvs) {
+      for (auto view_tv : view_tvs) {
+        if (!DependencyCheck::isDependencyOf(view_tv, ref_tv)) {
+          return true;
+        }
+      }
+    }
+
+    return false;
+  }
 };
 
 bool isConnectedFusionGraph(Fusion* fusion) {
@@ -368,19 +408,24 @@ bool isConnectedFusionGraph(Fusion* fusion) {
   // A set of connected components on the fusion graph
   DisjointSets<Val*> component_sets;
 
+  TORCH_INTERNAL_ASSERT(
+      !fusion->outputs().empty(), "Fusion without output is not supported");
+  auto output0 = fusion->outputs()[0];
+  component_sets.initializeSet(output0);
+
   // Iterate through all used exprs
   for (auto expr : fusion->exprs()) {
     TORCH_INTERNAL_ASSERT(
-        !expr->inputs().empty(), "unknown expr with zero input");
+        !expr->outputs().empty(), "unknown expr with zero output");
 
     // Each expr maps all its inputs and
     //  outputs to the same component
-    auto input0 = expr->inputs()[0];
-    for (auto input : expr->inputs()) {
-      component_sets.mapEntries(input0, input);
+    auto output0 = expr->output(0);
+    for (auto input : ir_utils::filterByType<TensorView>(expr->inputs())) {
+      component_sets.mapEntries(output0, input);
     }
     for (auto output : expr->outputs()) {
-      component_sets.mapEntries(input0, output);
+      component_sets.mapEntries(output0, output);
     }
   }
 
@@ -393,7 +438,6 @@ bool isConnectedFusionGraph(Fusion* fusion) {
   //  If there is no independent compute flow
   // on this fusion graph, all outputs will be
   // equivalent/connected to the first output.
-  auto output0 = fusion->outputs()[0];
   for (auto output : fusion->outputs()) {
     if (!component_sets.strictAreMapped(output0, output)) {
       return false;
@@ -404,20 +448,20 @@ bool isConnectedFusionGraph(Fusion* fusion) {
 
 } // namespace
 
-SchedulerRuntimeInfo::SchedulerRuntimeInfo(
-    Fusion* complete_fusion,
-    const at::ArrayRef<IValue>& inputs,
-    bool create_expr_evaluator)
-    : complete_fusion_(complete_fusion) {
+void SchedulerRuntimeInfo::initialize(
+    const KernelArgumentHolder& args,
+    bool create_expr_evaluator) {
   TORCH_INTERNAL_ASSERT(
-      complete_fusion_->inputs().size() == inputs.size(),
+      complete_fusion_->inputs().size() == args.size(),
       "Invalid number of arguments passed in for provided fusion group.");
 
-  for (auto inp_i : c10::irange(inputs.size())) {
-    auto aten_inp = inputs[inp_i];
-    if (aten_inp.isTensor()) {
+  for (auto inp_i : c10::irange(args.size())) {
+    auto kernel_arg = args[inp_i];
+    // Note: we are skipping CpuScalar tensor here
+    if (auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(kernel_arg)) {
       auto fusion_inp = complete_fusion_->inputs()[inp_i];
-      auto data_ptr = aten_inp.toTensor().data_ptr();
+      auto data_ptr = tensor_arg_abstract->getPointer();
       input_ptrs_[fusion_inp] = (size_t)data_ptr;
     }
   }
@@ -425,9 +469,28 @@ SchedulerRuntimeInfo::SchedulerRuntimeInfo(
   expression_evaluator_ =
       std::make_unique<ExpressionEvaluator>(complete_fusion_);
   if (create_expr_evaluator) {
-    initializeExpressionEvaluator(inputs);
+    initializeExpressionEvaluator(args);
   }
-  collectIndexModeInfo(inputs);
+  index_mode_ = args.getIndexMode();
+}
+
+SchedulerRuntimeInfo::SchedulerRuntimeInfo(
+    Fusion* complete_fusion,
+    const KernelArgumentHolder& args,
+    bool create_expr_evaluator)
+    : complete_fusion_(complete_fusion) {
+  initialize(args, create_expr_evaluator);
+}
+
+// TODO: remove this one
+SchedulerRuntimeInfo::SchedulerRuntimeInfo(
+    Fusion* complete_fusion,
+    const at::ArrayRef<at::IValue>& aten_inputs,
+    bool create_expr_evaluator)
+    : complete_fusion_(complete_fusion) {
+  KernelArgumentHolder args =
+      KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+  initialize(args, create_expr_evaluator);
 }
 
 // TODO: Output tensors could have an alignment that is not 16 Bytes passed in
@@ -440,11 +503,11 @@ size_t SchedulerRuntimeInfo::ptrOf(TensorView* tv) {
 }
 
 void SchedulerRuntimeInfo::initializeExpressionEvaluator(
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& args) {
   // TODO: refactor bindFusionInputs to better support this
   //  use case, i.e. support construct and bind input.
   *expression_evaluator_ =
-      executor_utils::bindFusionInputs(inputs, complete_fusion_);
+      executor_utils::bindFusionInputs(args, complete_fusion_);
 }
 
 size_t SchedulerRuntimeInfo::computeAlignmentSize(size_t ptr_address) {
@@ -550,7 +613,7 @@ size_t SchedulerRuntimeInfo::getMaxVectorizableWidth(TensorView* tv) {
     }
 
     // Still contiguous
-    numel *= dim_size.value().as<int64_t>();
+    numel *= dim_size->as<int64_t>();
   }
 
   // Assuming intermediate tensors have friendly alignment, and
@@ -650,7 +713,7 @@ size_t SchedulerRuntimeInfo::getInnerDimVectorizableWidth(TensorView* tv) {
   auto maybe_inner_dimension_size =
       expression_evaluator_->evaluate(inner_most_dim->extent());
   TORCH_INTERNAL_ASSERT(maybe_inner_dimension_size.has_value());
-  size_t inner_dimension_size = maybe_inner_dimension_size.value();
+  size_t inner_dimension_size = maybe_inner_dimension_size->as<int64_t>();
 
   while (next_vector_size <= max_vector_size &&
          next_vector_size <= inner_dimension_size &&
@@ -665,54 +728,6 @@ size_t SchedulerRuntimeInfo::getInnerDimVectorizableWidth(TensorView* tv) {
   return vector_size;
 }
 
-void SchedulerRuntimeInfo::collectIndexModeInfo(
-    const at::ArrayRef<at::IValue>& inputs) {
-  // TODO: Need to check the output sizes as well.
-
-  // Save 1 more bit besides the sign bit to be conservative
-  constexpr int64_t most_positive_int32_index =
-      std::numeric_limits<int>::max() / 2;
-  constexpr int64_t most_negative_int32_index =
-      std::numeric_limits<int>::min() / 2;
-
-  // Start by setting index mode to int32
-  index_mode_ = KernelIndexMode::INT32;
-
-  // Check all runtime inputs, and if any one of
-  //  the input's index exceeds max_int32 will
-  //  fall back to int64 indexing
-  for (auto ivalue_input : inputs) {
-    if (ivalue_input.isTensor()) {
-      auto tensor_input = ivalue_input.toTensor();
-      int64_t tensor_most_positive_index = 0;
-      int64_t tensor_most_negative_index = 0;
-      for (auto dim_i = 0; dim_i < tensor_input.ndimension(); dim_i++) {
-        // Ignore broadcast dimensions
-        if (tensor_input.size(dim_i) > 1) {
-          // accumulate based on the sign of stride
-          if (tensor_input.stride(dim_i) > 0) {
-            // Acuumulate positive stride
-            tensor_most_positive_index +=
-                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
-          } else {
-            // Acuumulate negative stride
-            tensor_most_negative_index +=
-                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
-          }
-        }
-      }
-
-      // Fall back to int64 if it can be either too positive
-      //  or too negative.
-      if (tensor_most_positive_index > most_positive_int32_index ||
-          tensor_most_negative_index < most_negative_int32_index) {
-        index_mode_ = KernelIndexMode::INT64;
-        return;
-      }
-    }
-  }
-}
-
 bool SchedulerEntry::sameAs(const SchedulerEntry* other) {
   if (heuristc_ != other->heuristc_) {
     return false;
@@ -774,8 +789,7 @@ static bool checkPatternEquivalence(
 // being broadcasted to one size multiple times or different sizes. This is a
 // hard to optimize problem and likely indicates we shouldn't be fusing.
 bool hasNonUniqueBcast(Fusion* fusion) {
-  ConcretizedBroadcastDomains concretize_info;
-  concretize_info.build(fusion);
+  ConcretizedBroadcastDomains concretize_info(fusion);
 
   for (auto tv : ir_utils::allTvs(fusion)) {
     for (auto id : tv->getRootDomain()) {
@@ -816,6 +830,119 @@ bool hasNonUniqueBcast(Fusion* fusion) {
 //!        This function will be called when compiling a kernel. It should apply
 //!        scheduling to the given fusion
 
+//! NoOp scheduler represents the case where scheduler will
+//!  not do any scheduling operations and forward the un-scheduled
+//!  fusion directly to code generation and kernel compilation.
+//!
+//! Typical use case of this scheduler is to handle edge cases
+//!  such as where all tensors are size-1 or size-0.
+class NoOpScheduler : public SchedulerEntry {
+  //! Provides a dummy heuristic type to ensure
+  //!  unified interface on NoOp scheduler.
+  class NoOpHeuristic : public HeuristicParams {
+   public:
+    size_t hash() const override {
+      return 0;
+    }
+    std::shared_ptr<HeuristicParams> clone() const override {
+      return std::make_shared<NoOpHeuristic>();
+    }
+    bool sameAs(const std::shared_ptr<HeuristicParams>& other) const override {
+      auto other_casted = std::dynamic_pointer_cast<ReductionParams>(other);
+      return other_casted != nullptr;
+    };
+  };
+
+ public:
+  explicit NoOpScheduler(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr)
+      : SchedulerEntry(ScheduleHeuristic::NoOp) {
+    params_ = std::make_shared<NoOpHeuristic>();
+  }
+
+  //! Check if the no-op heuristics apply in given fusion
+  static bool canScheduleCompileTime(Fusion* fusion) {
+    // Check there're no non-trivial reduction ops.
+    for (auto reduction :
+         ir_utils::getReductionOps(fusion, true /* ignore_trivial */)) {
+      for (auto input :
+           ir_utils::filterByType<TensorView>(reduction->inputs())) {
+        auto root_dom = input->getRootDomain();
+        auto all_nonzero =
+            std::none_of(root_dom.begin(), root_dom.end(), [](IterDomain* id) {
+              return id->extent()->isZeroInt();
+            });
+        if (all_nonzero) {
+          scheduler_debug_utils::canScheduleRejectReason(
+              ScheduleHeuristic::NoOp,
+              "reduction of non-zero elements is not supported");
+          return false;
+        }
+      }
+    }
+
+    // Check that all outputs are either broadcast or ignored reduction.
+    for (auto out_tv : ir_utils::filterByType<TensorView>(fusion->outputs())) {
+      auto non_zero_candidate_dimension = TensorDomain::noReductions(
+          TensorDomain::noBroadcasts(out_tv->domain()->domain()));
+
+      // non_zero_candidate_dimension is empty would mean this out tv has only
+      //  broadcast and trivial reduction axes, and this out tv would not
+      //  require scheduling ops.
+      // If any of the dimensions in non_zero_candidate_dimension is compile
+      // time
+      //  constant zero, this out tv also does not require any scheduling
+      //  operation as it is essentially a scalar.
+      // TODO:
+      // There seems to be a runtime component to it
+      //  too, i.e. if the runtime sizes are zero, then we should
+      //  handle it through null scheduler.
+      if (!non_zero_candidate_dimension.empty() &&
+          std::none_of(
+              non_zero_candidate_dimension.begin(),
+              non_zero_candidate_dimension.end(),
+              [](IterDomain* id) { return id->extent()->isZeroInt(); })) {
+        // We have found a out_tv with a dimension that NoOp scheduler couldn't
+        //  handle and therefore reject this fusion.
+        scheduler_debug_utils::canScheduleRejectReason(
+            ScheduleHeuristic::NoOp, "output has a concrete dimension");
+        return false;
+      }
+    }
+
+    // We have verified that all iterdomains on all output tv's are trivial
+    // reductions,
+    //  broadcasts or zero-sized. Therefore accepting this fusion for NoOp
+    //  scheduling.
+    return true;
+  }
+
+  static bool canScheduleRunTime(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    // TODO:
+    //  Pipe through dynamic zero checks.
+    return true;
+  }
+
+  void schedule(Fusion* fusion) override {
+    // Schedule is no-op.
+    return;
+  }
+
+ private:
+  void computeHeuristics(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    // Heuristics is no-op.
+    return;
+  }
+};
+
 class ReductionScheduler : public SchedulerEntry {
  public:
   explicit ReductionScheduler(
@@ -828,7 +955,7 @@ class ReductionScheduler : public SchedulerEntry {
 
   //! Check if the reduction heuristics apply in given fusion
   static bool canScheduleCompileTime(Fusion* fusion) {
-    // Temporarily allow view in reduction scheduler
+    // Temporarily disallow view in reduction scheduler
     // TODO Add more testing before enabling
     auto view_tvs = scheduler_utils::getViewTVs(fusion);
     if (view_tvs.size() > 0) {
@@ -866,6 +993,17 @@ class ReductionScheduler : public SchedulerEntry {
       return false;
     }
 
+    // Persistent scheduler simply uses reduction_tvs[0] as the reference, if
+    // that changes, this needs to be changed. Second check here may be overly
+    // conservative.
+    if (SchedulerTopologyChecker::hasViewNotBeforeRef(
+            fusion, {reduction_tvs[0]}) ||
+        !scheduler_utils::allMatchingViews(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Reduction, "Unsupported view fusion.");
+      return false;
+    }
+
     // Make sure reduction axes are consistent through the fusion
     auto reduction_ops =
         ir_utils::getReductionOps(fusion, false /* ignore_trivial */);
@@ -965,6 +1103,84 @@ class ReductionScheduler : public SchedulerEntry {
   }
 };
 
+class TransposeScheduler : public SchedulerEntry {
+ public:
+  explicit TransposeScheduler(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr)
+      : SchedulerEntry(ScheduleHeuristic::Transpose) {
+    computeHeuristics(fusion, runtime_info, data_cache);
+  }
+
+  static bool canScheduleCompileTime(Fusion* fusion) {
+    // Temporarily disallow view in transpose scheduler
+    // TODO Add more testing before enabling
+    auto view_tvs = scheduler_utils::getViewTVs(fusion);
+    if (view_tvs.size() > 0) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose, "No support for view op");
+      return false;
+    }
+
+    if (!hasAtLeastTwoValidGroups(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose,
+          "cannot find two mismatching inner most dimensions");
+      return false;
+    }
+
+    // TODO: add support for trivial reduction
+    auto reduction_ops =
+        ir_utils::getReductionOps(fusion, false /* ignore_trivial */);
+
+    if (!reduction_ops.empty()) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose, "no support for reduction ops");
+      return false;
+    }
+
+    if (hasNonUniqueBcast(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose,
+          "Broadcasting dimension might be broadcasting to multiple sizes.");
+      return false;
+    }
+
+    return true;
+  }
+
+  static bool canScheduleRunTime(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    FUSER_PERF_SCOPE("TransposeScheduler::canScheduleRunTime");
+
+    auto reason =
+        getTransposeRuntimeRejectReason(fusion, data_cache, runtime_info);
+    if (!reason.empty()) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose, reason);
+      return false;
+    }
+    return true;
+  }
+
+  void schedule(Fusion* fusion) override {
+    FUSER_PERF_SCOPE("Schedule Transpose Fusion");
+    scheduleTranspose(fusion, transposeParams());
+  }
+
+ private:
+  void computeHeuristics(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    params_ = getTransposeHeuristics(fusion, runtime_info, data_cache);
+    TORCH_INTERNAL_ASSERT(params_ != nullptr);
+  }
+};
+
 class PointWiseScheduler : public SchedulerEntry {
  public:
   explicit PointWiseScheduler(
@@ -985,6 +1201,14 @@ class PointWiseScheduler : public SchedulerEntry {
       return false;
     }
 
+    if (!scheduler_utils::allMatchingViews(fusion) &&
+        SchedulerTopologyChecker::hasViewNotBeforeRef(
+            fusion, {getReferenceTensorView(fusion)})) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::PointWise, "Unsupported view fusion.");
+      return false;
+    }
+
     auto reduction_ops =
         ir_utils::getReductionOps(fusion, true /* ignore_trivial */);
 
@@ -1008,6 +1232,18 @@ class PointWiseScheduler : public SchedulerEntry {
       Fusion* fusion,
       SchedulerRuntimeInfo& runtime_info,
       HeuristicSummary* data_cache = nullptr) {
+    auto can_schedule_transpose_entry =
+        HeuristicSummaryEntry<HeuristicCompileTime::CanScheduleTranspose>(
+            data_cache, [fusion]() {
+              return std::make_unique<bool>(
+                  TransposeScheduler::canScheduleCompileTime(fusion));
+            });
+    if (can_schedule_transpose_entry.get()) {
+      auto reason =
+          getTransposeRuntimeRejectReason(fusion, data_cache, runtime_info);
+      return !reason.empty();
+    }
+
     return true;
   }
 
@@ -1075,6 +1311,16 @@ class PersistentKernelScheduler : public SchedulerEntry {
       return false;
     }
 
+    // Persistent scheduler simply uses reduction_tvs[0] as the reference, if
+    // that changes, this needs to be changed. Second check here may be overly
+    // conservative.
+    if (SchedulerTopologyChecker::hasViewNotBeforeRef(
+            fusion, {reduction_tvs[0]}) ||
+        !scheduler_utils::allMatchingViews(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Persistent, "Unsupported view fusion.");
+    }
+
     if (findTransposeOps(fusion).size() > 0) {
       // Use pointwise logic
       scheduler_debug_utils::canScheduleRejectReason(
@@ -1247,7 +1493,9 @@ class PersistentKernelScheduler : public SchedulerEntry {
 // Schedule Table
 const std::vector<ScheduleHeuristic>& all_heuristics() {
   static const std::vector<ScheduleHeuristic> hlist = {
+      ScheduleHeuristic::NoOp,
       ScheduleHeuristic::Reduction,
+      ScheduleHeuristic::Transpose,
       ScheduleHeuristic::PointWise,
       ScheduleHeuristic::Persistent};
   return hlist;
@@ -1268,6 +1516,9 @@ bool checkCanSchedule(
     if (!isConnectedFusionGraph(fusion)) {
       return false;
     }
+    if (IterDomainGraph(fusion, /*allow_self_mapping=*/true).hasSelfMapping()) {
+      return false;
+    }
     if (!SchedulerType::canScheduleCompileTime(fusion)) {
       return false;
     }
@@ -1285,6 +1536,8 @@ bool SchedulerEntry::canSchedule(
     SchedulerRuntimeInfo& runtime_info,
     HeuristicSummary* data_cache) {
   switch (sh) {
+    case ScheduleHeuristic::NoOp:
+      return checkCanSchedule<NoOpScheduler>(fusion, runtime_info, data_cache);
     case ScheduleHeuristic::PointWise:
       return checkCanSchedule<PointWiseScheduler>(
           fusion, runtime_info, data_cache);
@@ -1294,6 +1547,9 @@ bool SchedulerEntry::canSchedule(
     case ScheduleHeuristic::Persistent:
       return checkCanSchedule<PersistentKernelScheduler>(
           fusion, runtime_info, data_cache);
+    case ScheduleHeuristic::Transpose:
+      return checkCanSchedule<TransposeScheduler>(
+          fusion, runtime_info, data_cache);
     default:
       TORCH_INTERNAL_ASSERT(false, "unreachable");
       return false;
@@ -1308,6 +1564,10 @@ std::unique_ptr<SchedulerEntry> SchedulerEntry::makeEntry(
     HeuristicSummary* data_cache) {
   std::unique_ptr<SchedulerEntry> scheduler_entry = nullptr;
   switch (sh) {
+    case ScheduleHeuristic::NoOp:
+      scheduler_entry =
+          std::make_unique<NoOpScheduler>(fusion, runtime_info, data_cache);
+      break;
     case ScheduleHeuristic::PointWise:
       scheduler_entry = std::make_unique<PointWiseScheduler>(
           fusion, runtime_info, data_cache);
@@ -1320,6 +1580,10 @@ std::unique_ptr<SchedulerEntry> SchedulerEntry::makeEntry(
       scheduler_entry = std::make_unique<PersistentKernelScheduler>(
           fusion, runtime_info, data_cache);
       break;
+    case ScheduleHeuristic::Transpose:
+      scheduler_entry = std::make_unique<TransposeScheduler>(
+          fusion, runtime_info, data_cache);
+      break;
     default:
       TORCH_INTERNAL_ASSERT(false, "unreachable");
   }
@@ -1347,12 +1611,16 @@ size_t SchedulerEntryHash::operator()(const SchedulerEntry& se) const {
 
 std::string toString(ScheduleHeuristic sh) {
   switch (sh) {
+    case ScheduleHeuristic::NoOp:
+      return "no-op";
     case ScheduleHeuristic::PointWise:
       return "pointwise";
     case ScheduleHeuristic::Reduction:
       return "reduction";
     case ScheduleHeuristic::Persistent:
       return "persistent";
+    case ScheduleHeuristic::Transpose:
+      return "transpose";
     default:
       TORCH_INTERNAL_ASSERT(false, "undefined schedule");
   }
@@ -1393,6 +1661,9 @@ HeuristicSummary::HeuristicSummary(
     : heuristic_(heuristic) {
   recording_ = true;
   switch (heuristic) {
+    case ScheduleHeuristic::NoOp:
+      NoOpScheduler::canScheduleRunTime(fusion, runtime_info, this);
+      break;
     case ScheduleHeuristic::PointWise:
       getPointwiseHeuristics(fusion, runtime_info, this);
       PointWiseScheduler::canScheduleRunTime(fusion, runtime_info, this);
@@ -1405,6 +1676,10 @@ HeuristicSummary::HeuristicSummary(
       getPersistentHeuristics(fusion, runtime_info, this);
       PersistentKernelScheduler::canScheduleRunTime(fusion, runtime_info, this);
       break;
+    case ScheduleHeuristic::Transpose:
+      getTransposeHeuristics(fusion, runtime_info, this);
+      TransposeScheduler::canScheduleRunTime(fusion, runtime_info, this);
+      break;
     default:
       TORCH_INTERNAL_ASSERT(false, "unknown heuristic");
   }
@@ -1414,6 +1689,10 @@ HeuristicSummary::HeuristicSummary(
 
 void HeuristicSummary::validate() const {
   switch (heuristic_) {
+    case ScheduleHeuristic::NoOp: {
+      // TODO: need to cache the dynamically zero inputs?
+      break;
+    }
     case ScheduleHeuristic::PointWise: {
       TORCH_INTERNAL_ASSERT(entry_type_map_.count(EntryType::DOMAIN_MAP));
       TORCH_INTERNAL_ASSERT(
@@ -1422,6 +1701,26 @@ void HeuristicSummary::validate() const {
           entry_type_map_.count(EntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS));
       TORCH_INTERNAL_ASSERT(
           entry_type_map_.count(EntryType::BROADCAST_BYTE_MULTIPLES));
+      TORCH_INTERNAL_ASSERT(
+          entry_type_map_.count(EntryType::CAN_SCHEDULE_TRANSPOSE));
+      auto can_schedule_transpose =
+          entry_type_map_.at(EntryType::CAN_SCHEDULE_TRANSPOSE)
+              ->as<
+                  CompileTimeInfo<HeuristicCompileTime::CanScheduleTranspose>>()
+              ->get();
+      if (!*can_schedule_transpose) {
+        break;
+      }
+    }
+    case ScheduleHeuristic::Transpose: {
+      TORCH_INTERNAL_ASSERT(
+          entry_type_map_.count(EntryType::TRANSPOSE_DOMAIN_MAP));
+      TORCH_INTERNAL_ASSERT(entry_type_map_.count(
+          EntryType::INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS));
+      TORCH_INTERNAL_ASSERT(
+          entry_type_map_.count(EntryType::REFERENCE_TENSORS_FOR_GROUPS));
+      TORCH_INTERNAL_ASSERT(
+          entry_type_map_.count(EntryType::INNER_MOST_DIMS_INFO));
       break;
     }
     case ScheduleHeuristic::Reduction: {
@@ -1487,9 +1786,14 @@ HeuristicSummaryEntry<EntryClass>::HeuristicSummaryEntry(
 
 // Template instantiation for pre-defined cache entries
 template class HeuristicSummaryEntry<HeuristicCompileTime::DomainMap>;
+template class HeuristicSummaryEntry<HeuristicCompileTime::TransposeDomainMap>;
 template class HeuristicSummaryEntry<HeuristicCompileTime::ReferenceTensors>;
+template class HeuristicSummaryEntry<
+    HeuristicCompileTime::ReferenceTensorsForGroups>;
 template class HeuristicSummaryEntry<
     HeuristicCompileTime::VectorizableInputsAndOutputs>;
+template class HeuristicSummaryEntry<
+    HeuristicCompileTime::InputsOutputsInnerDimGroups>;
 template class HeuristicSummaryEntry<
     HeuristicCompileTime::UnrollableInputsAndOutputs>;
 template class HeuristicSummaryEntry<HeuristicCompileTime::ReductionTVs>;
@@ -1498,6 +1802,9 @@ template class HeuristicSummaryEntry<
 template class HeuristicSummaryEntry<
     HeuristicCompileTime::ScopePersistentFactorInfo>;
 template class HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>;
+template class HeuristicSummaryEntry<HeuristicCompileTime::InnerMostDimInfo>;
+template class HeuristicSummaryEntry<
+    HeuristicCompileTime::CanScheduleTranspose>;
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/registry.h b/torch/csrc/jit/codegen/cuda/scheduler/registry.h
index 7d2af85bfad0e..7ed8474935c01 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/registry.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/registry.h
@@ -1,4 +1,5 @@
 #pragma once
+#include <torch/csrc/jit/codegen/cuda/executor_kernel_arg.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h>
@@ -33,13 +34,19 @@ class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
   static constexpr size_t max_alignment_size_in_byte = 16;
 
   //! Create runtime info for given fusion and input. Creating and binding
-  //! evaluator is optional. The evaluator is used to manage intermediate
+  //!  evaluator is optional. The evaluator is used to manage intermediate
   //!  integers in the fusion. We need them for segmenter and schedulers,
   //!  but we don't need them when we are just using this class to provide
   //!  additional encoding for kernel cache lookup.
   SchedulerRuntimeInfo(
       Fusion* complete_fusion,
-      const at::ArrayRef<at::IValue>& inputs,
+      const KernelArgumentHolder& inputs,
+      bool create_expr_evaluator = false);
+
+  // TODO: Remove this guy below. Everything needs to go into the other ctor
+  SchedulerRuntimeInfo(
+      Fusion* complete_fusion,
+      const at::ArrayRef<at::IValue>& aten_inputs,
       bool create_expr_evaluator = false);
 
   //! Lookup for the alignment sizes of the given tv. Currently only returns
@@ -78,12 +85,11 @@ class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
 
  private:
   // Bind full fusion inputs to the internal expression evaluator
-  void initializeExpressionEvaluator(const at::ArrayRef<at::IValue>& inputs);
+  void initializeExpressionEvaluator(const KernelArgumentHolder& inputs);
 
-  // check if input is compatible with 32b index mode
-  void collectIndexModeInfo(const at::ArrayRef<at::IValue>& inputs);
+  // Initialize SchedulerRuntimeInfo
+  void initialize(const KernelArgumentHolder& args, bool create_expr_evaluator);
 
- private:
   bool isInputTv(TensorView* tv) {
     return std::find(
                complete_fusion_->inputs().begin(),
@@ -91,6 +97,7 @@ class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
                tv) != complete_fusion_->inputs().end();
   }
 
+ private:
   // Returns the offset of tv in the inputs ignoring non tensor views. Used to
   // access input_sizes, input_strides, input_ptr
   int offsetTensorPos(TensorView* tv);
@@ -187,6 +194,13 @@ class TORCH_CUDA_CU_API SchedulerEntry {
     return *pparams;
   }
 
+  const TransposeParams& transposeParams() const {
+    auto tparams = std::dynamic_pointer_cast<TransposeParams>(params_);
+    TORCH_INTERNAL_ASSERT(
+        tparams != nullptr, "Heuristic parameter is not a transpose parameter");
+    return *tparams;
+  }
+
   void updateLaunchConstraint(const LaunchParams& launch_params) {
     params_->lparams = launch_params;
   }
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp b/torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp
new file mode 100644
index 0000000000000..b7e85cbc1c5e7
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp
@@ -0,0 +1,1140 @@
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
+
+#include <torch/csrc/jit/codegen/cuda/executor_utils.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/instrumentation.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/lower_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
+
+#include <ATen/cuda/CUDAContext.h>
+
+#include <algorithm>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace {
+
+// DomainMap uses the ComputeAtMap to find a reference TensorView
+// that maps to all iterDomains in the fusion.
+class DomainMap : public pointwise_utils::DomainMap {
+ public:
+  using pointwise_utils::DomainMap::DomainMap;
+
+  TensorView* findReferenceFor(const std::vector<TensorView*>& group) const {
+    TensorView* result = nullptr;
+    int max_dims = -1;
+    for (auto tv : group) {
+      if (isValidReference(tv)) {
+        int dims = pointwise_utils::nRootDims(tv);
+        if (dims > max_dims) {
+          result = tv;
+          max_dims = dims;
+        }
+      }
+    }
+    return result;
+  }
+
+  IterDomain* getMappedRootDimIn(TensorView* tv, IterDomain* root_dim) const {
+    // Find the root id mapped to `root_dim`
+    const auto& root_dom = tv->getRootDomain();
+    IterDomain* mapped_id = nullptr;
+    for (auto i : c10::irange(root_dom.size())) {
+      if (ca_map_.idGraph().permissiveNodes().permissiveAreMapped(
+              root_dom[i], root_dim)) {
+        mapped_id = root_dom[i];
+        break;
+      }
+    }
+    return mapped_id;
+  }
+
+  static bool hasAtLeastTwoValidGroups(Fusion* fusion) {
+    FusionGuard fg(fusion);
+    DomainMap domain_map(fusion);
+    auto grouped_inputs_outputs = domain_map.groupInputsOutputsByInnerDim();
+    if (grouped_inputs_outputs.size() < 2) {
+      return false;
+    }
+    auto ref1 = domain_map.findReferenceFor(grouped_inputs_outputs[0]);
+    auto ref2 = domain_map.findReferenceFor(grouped_inputs_outputs[1]);
+    if (ref1 == nullptr || ref2 == nullptr) {
+      return false;
+    }
+    // reference 1 is the global reference, so it must have dim mapped the
+    // innermost dim of both groups
+    auto innermost2 = scheduler_utils::innerMostRootDim(ref2);
+    return domain_map.getMappedRootDimIn(ref1, innermost2) != nullptr;
+  }
+
+  int getInnerLeafDim(TensorView* tv, IterDomain* root_dim) const {
+    auto mapped_id = getMappedRootDimIn(tv, root_dim);
+    TORCH_INTERNAL_ASSERT(
+        mapped_id != nullptr,
+        "Can not find ID mapped to ",
+        root_dim,
+        " in tensor ",
+        tv);
+    // Project the root id to leaf id
+    while (!mapped_id->uses().empty()) {
+      TORCH_INTERNAL_ASSERT(mapped_id->uses().size() == 1);
+      auto expr = mapped_id->uses()[0];
+      if (expr->isA<Split>()) {
+        mapped_id = expr->as<Split>()->inner();
+      } else {
+        auto merge = expr->as<Merge>();
+        TORCH_INTERNAL_ASSERT(
+            mapped_id == merge->inner(),
+            "Can not find ID mapped to ",
+            root_dim,
+            " in tensor ",
+            tv);
+        mapped_id = merge->out();
+      }
+    }
+    // Find the position of the leaf id
+    const auto& dom = tv->domain()->domain();
+    for (auto i : c10::irange(dom.size())) {
+      if (dom[i] == mapped_id) {
+        return i;
+      }
+    }
+    TORCH_INTERNAL_ASSERT(
+        false, "Can not find ID mapped to ", root_dim, " in tensor ", tv);
+  }
+
+  // Group inputs and outputs of a fusion by its inner most domain. For example
+  //   inputs: t0, t1
+  //   t2 = transpose(t1)
+  //   t3 = t0 + t2
+  //   t4 = sin(t0)
+  //   t5 = cos(t1)
+  //   outputs: t3, t4, t5
+  //
+  // Then we should have group {t0, t3, t4} and {t1, t5}
+  //
+  // The returned groups are sorted in descending size. If the sizes of two
+  // group are equal, then we sort them by their members in the following order:
+  //   output[0], output[1], ..., input[0], input[1], ...
+  // That is, {ouput[0], output[2]} will be in front of {ouput[1], output[3]}
+  // The order here must be deterministic, because in transpose heuristics, we
+  // have `vectorize_factor1` and `vectorize_factor2` and we need to be sure
+  // that `1` and `2` are assigned to the same group across runs.
+  //
+  // In the case where view is present in the graph, there are two cases: if the
+  // view doesn't touch any inner dimension of any group, then the support of it
+  // is trivial. In the case where view actually touches an inner-most dim, we
+  // keep track of the inner-most dimension of view's split and merges.
+  //
+  // For example, if you have:
+  //   T0 [2, 3, 5] <-- input
+  //   T1 [2, 5, 3] <-- input
+  //   T2 [2, 5, 3] = transpose(T0) + T1
+  //   T3 [2, 15] = view(T2)
+  //   output <-- T3
+  //
+  // Then T3 should be in the same group with T1, and T0 should have
+  // different group with T1 and T3.
+  std::vector<std::vector<TensorView*>> groupInputsOutputsByInnerDim() const {
+    std::vector<std::vector<TensorView*>> groups;
+    auto output_tvs = ir_utils::filterByType<TensorView>(fusion_->outputs());
+    auto input_tvs = ir_utils::filterByType<TensorView>(fusion_->inputs());
+    std::unordered_set<TensorView*> grouped;
+    decltype(input_tvs)* tv_filtered_groups[2] = {&output_tvs, &input_tvs};
+    for (auto tv_filtered_group : tv_filtered_groups) {
+      for (auto tv : *tv_filtered_group) {
+        if (tv->isFusionInput() && tv->uses().empty()) {
+          continue;
+        }
+        if (grouped.count(tv) > 0) {
+          continue;
+        }
+        groups.emplace_back(std::vector<TensorView*>{tv});
+        grouped.emplace(tv);
+        // We only want to grab the inner-most dimension, because we don't want
+        // tensors with different inner-most dimension to be put in the same
+        // group. For example, if we have:
+        //   T2[i1, i3*i2] = relu(view(transpose(T1[i1, i2, i3])))
+        // then we don't want T1 and T2 to be in the same group.
+        //
+        // But we don't want to check contiguity. For example, if we have:
+        //   T1[i1, i2, i3] (contiguous) + T2[i1, i2, i3] (discontiguous)
+        // Then we still want to T1 and T2 to be grouped together.
+        auto group =
+            scheduler_utils::getInputsOutputsWithInnerDim(tv, true, false);
+        if (group.empty()) {
+          // In case that the inner most dim of tv is not found (for example, tv
+          // is a fusion input with only reductions), we just return a null
+          // result which will tell the scheduler to reject the fusion
+          return {};
+        }
+        for (auto member_tv : group) {
+          if (grouped.count(member_tv) == 0) {
+            grouped.emplace(member_tv);
+            groups.back().emplace_back(member_tv);
+          } else if (member_tv != tv) {
+            // Ambiguous grouping. This should only happen at `canSchedule`, so
+            // we just return a null result which will tell the scheduler to
+            // reject the fusion
+            return {};
+          }
+        }
+      }
+    }
+    std::stable_sort(
+        groups.begin(),
+        groups.end(),
+        [](const std::vector<TensorView*>& v1,
+           const std::vector<TensorView*>& v2) {
+          return v1.size() > v2.size();
+        });
+    return groups;
+  }
+};
+
+// Note: [Supporting small transpose dimensions]
+// We prefer to make tiles of size 32x32 if there are enough elements to achieve
+// good occupancy, otherwise, we will use tile size 8x8. In both cases, it is
+// possible that the inner dimension of group 1 and/or group 2 are smaller than
+// the desired tile size. If this happens, part of the threads of a block will
+// be wasted, leading to bad performance. To prevent this from happening, if the
+// size of the inner-most dim is smaller than the tile size, we merge other
+// dimensions with the inner-most dimension to create larger "virtual inner-most
+// dimension". The algorithm that we create these virtual inner-most dimensions
+// is as follows:
+//
+// For example, if we have
+//   T0[I0{2}, I1{1024*1024}, I2{2}, I3{2}, I4{2}, I5{2}, I6{2}] input
+//   T1 = transpose(T0, 4, 6)
+// We first try to merge each inner-most dim with the dimensions on its left:
+//   T0[I0{2}, I1*I2*I3*I4{1024*1024*8}, I5*I6{4}]
+// If there is/are still unsatisfied innermost dim(s) after this step (I5*I6 in
+// this example), we find other dims that is not merged yet to satisfy it/them:
+//   T0[I0*I5*I6{8}, I1*I2*I3*I4{1024*1024*8}]
+// If after merging all the dims, there is still one of them not satisfied, this
+// usually means there is one large dim that is consumed by the satisfied one.
+// We will split that dim and large dim and and use the splitted ones to satisfy
+// both of them:
+//   T0[I0*I1o*I5*I6{1024*1024/4*8}, I1i*I2*I3*I4{32}]
+void maybeBuildVirtualInnerDims(
+    TransposeParams& params,
+    int64_t device_multiprocessor_count,
+    int64_t n_elems,
+    const std::vector<int64_t>& shape_in_ref1,
+    int64_t inner_most1,
+    int64_t inner_most2) {
+  int64_t merged_size1 = shape_in_ref1[inner_most1];
+  int64_t merged_size2 = shape_in_ref1[inner_most2];
+
+  int64_t actual_tile_size1 =
+      std::min<int64_t>(merged_size1, params.tile_size1);
+  int64_t actual_tile_size2 =
+      std::min<int64_t>(merged_size2, params.tile_size2);
+  int64_t wave_elements =
+      device_multiprocessor_count * actual_tile_size1 * actual_tile_size2;
+
+  if (wave_elements >= n_elems) {
+    // if one full wave can handle all elements, don't create virtual inner dims
+    return;
+  }
+
+  // merge inner_most1 and inner_most2 left until we are done or we can no
+  // longer do so
+  int64_t dim = inner_most1 - 1;
+  while (dim >= 0 && dim != inner_most2 && merged_size1 < params.tile_size1) {
+    params.dims_merged_with_1.push_back(dim);
+    merged_size1 *= shape_in_ref1[dim];
+    dim--;
+  }
+  dim = inner_most2 - 1;
+  while (dim >= 0 && dim != inner_most1 && merged_size2 < params.tile_size2) {
+    params.dims_merged_with_2.push_back(dim);
+    merged_size2 *= shape_in_ref1[dim];
+    dim--;
+  }
+  // If any of them are unsatisfied, then find other dims to merge
+  std::unordered_set<int64_t> unavailable_dims;
+  unavailable_dims.reserve(
+      2 + params.dims_merged_with_1.size() + params.dims_merged_with_2.size());
+  unavailable_dims.insert(inner_most1);
+  unavailable_dims.insert(inner_most2);
+  for (auto i : params.dims_merged_with_1) {
+    unavailable_dims.insert(i);
+  }
+  for (auto i : params.dims_merged_with_2) {
+    unavailable_dims.insert(i);
+  }
+  dim = shape_in_ref1.size() - 1;
+  while (dim >= 0 && merged_size1 < params.tile_size1) {
+    if (unavailable_dims.count(dim) == 0) {
+      params.dims_merged_with_1.push_back(dim);
+      merged_size1 *= shape_in_ref1[dim];
+      unavailable_dims.insert(dim);
+    }
+    dim--;
+  }
+  dim = shape_in_ref1.size() - 1;
+  while (dim >= 0 && merged_size2 < params.tile_size2) {
+    if (unavailable_dims.count(dim) == 0) {
+      params.dims_merged_with_2.push_back(dim);
+      merged_size2 *= shape_in_ref1[dim];
+      unavailable_dims.insert(dim);
+    }
+    dim--;
+  }
+  // If both are satisfied, then we are done. If neither are satisfied, then it
+  // is impossible to satisfy both of them, also done.
+  if ((merged_size1 < params.tile_size1) ==
+      (merged_size2 < params.tile_size2)) {
+    return; // no need to split
+  }
+  // If one of them are not satisfied, there might be two cases:
+  // 1. The satisfied one just merged in a large dim. If this is the case, We
+  //    split this large dim, so that now we have two available dims to satisfy
+  //    both virtual innermost dim.
+  // 2. The satisfied one did not merge in anything. For example,
+  //    T0[I0{1024*1024}, I1{2}]
+  //    If this is the case, this means that we need to split the large
+  //    inner-most dimension to satisfy the small innermost dimension
+  int64_t large_dim;
+  int64_t split_factor;
+  bool split_inner_most;
+  if (merged_size1 < params.tile_size1) {
+    if (params.dims_merged_with_2.empty()) {
+#if SUPPORT_SPLITTING_INNERMOST_DIM
+      // https://github.com/csarofeen/pytorch/issues/1964
+      // case 2
+      split_inner_most = true;
+      large_dim = inner_most2;
+      split_factor = params.tile_size2;
+#else
+      // disabled due to indexing error
+      return;
+#endif
+    } else {
+      // case 1
+      split_inner_most = false;
+      large_dim = params.dims_merged_with_2.back();
+      auto prev_merged_size2 = merged_size2 / shape_in_ref1[large_dim];
+      split_factor = ceilDiv(params.tile_size2, prev_merged_size2);
+    }
+  } else {
+    if (params.dims_merged_with_1.empty()) {
+#if SUPPORT_SPLITTING_INNERMOST_DIM
+      // https://github.com/csarofeen/pytorch/issues/1964
+      // case 2
+      split_inner_most = true;
+      large_dim = inner_most1;
+      split_factor = params.tile_size1;
+#else
+      // disabled due to indexing error
+      return;
+#endif
+    } else {
+      // case 1
+      split_inner_most = false;
+      large_dim = params.dims_merged_with_1.back();
+      auto prev_merged_size1 = merged_size1 / shape_in_ref1[large_dim];
+      split_factor = ceilDiv(params.tile_size1, prev_merged_size1);
+    }
+  }
+  params.split_before_tiling.push_back({large_dim, split_factor});
+  // adjust all dims to after-split
+  for (auto& i : params.dims_merged_with_1) {
+    if (i > large_dim) {
+      i++;
+    }
+  }
+  for (auto& i : params.dims_merged_with_2) {
+    if (i > large_dim) {
+      i++;
+    }
+  }
+  // Give the split-out dim to the unsatisfied one, so that both are satisfied.
+  if (merged_size1 < params.tile_size1) {
+    if (!split_inner_most) {
+      params.dims_merged_with_2.pop_back();
+      params.dims_merged_with_2.push_back(large_dim + 1);
+    }
+    params.dims_merged_with_1.push_back(large_dim);
+  } else {
+    if (!split_inner_most) {
+      params.dims_merged_with_1.pop_back();
+      params.dims_merged_with_1.push_back(large_dim + 1);
+    }
+    params.dims_merged_with_2.push_back(large_dim);
+  }
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::TransposeDomainMap> getDomainMap(
+    HeuristicSummary* data_cache,
+    Fusion* fusion) {
+  auto domain_map_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::TransposeDomainMap>(
+          data_cache,
+          [fusion]() { return std::make_unique<DomainMap>(fusion); });
+  return domain_map_entry;
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::InputsOutputsInnerDimGroups>
+getInputsOutputsGroups(HeuristicSummary* data_cache, DomainMap& domain_map) {
+  auto grouped_inputs_outputs_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::InputsOutputsInnerDimGroups>(
+          data_cache, [&domain_map]() {
+            return std::make_unique<std::vector<std::vector<TensorView*>>>(
+                domain_map.groupInputsOutputsByInnerDim());
+          });
+  auto& grouped_inputs_outputs = grouped_inputs_outputs_entry.get();
+
+  TORCH_INTERNAL_ASSERT(
+      grouped_inputs_outputs.size() >= 2,
+      "Can not find mismatched inner most dim, should use pointwise scheduler.");
+
+  return grouped_inputs_outputs_entry;
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::ReferenceTensorsForGroups>
+getReferenceTensors(
+    HeuristicSummary* data_cache,
+    DomainMap& domain_map,
+    std::vector<std::vector<TensorView*>>& grouped_inputs_outputs) {
+  auto reference_tensors_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::ReferenceTensorsForGroups>(
+          data_cache, [&domain_map, &grouped_inputs_outputs]() {
+            std::vector<TensorView*> data{
+                domain_map.findReferenceFor(grouped_inputs_outputs[0]),
+                domain_map.findReferenceFor(grouped_inputs_outputs[1])};
+            return std::make_unique<std::vector<TensorView*>>(std::move(data));
+          });
+  auto& reference_tensors = reference_tensors_entry.get();
+  TORCH_INTERNAL_ASSERT(reference_tensors.size() == 2);
+  TensorView* reference1 = reference_tensors[0];
+  TensorView* reference2 = reference_tensors[1];
+  TORCH_INTERNAL_ASSERT(
+      reference1 != nullptr, "Unable to find reference tensor for group 1");
+  TORCH_INTERNAL_ASSERT(
+      reference2 != nullptr, "Unable to find reference tensor for group 2");
+  return reference_tensors_entry;
+}
+
+std::pair<std::vector<int64_t>, int64_t> getShapeInReference(
+    HeuristicSummary* data_cache,
+    SchedulerRuntimeInfo& runtime_info,
+    TensorView* reference,
+    DomainMap& domain_map) {
+  auto ref_root = reference->getMaybeRFactorDomain();
+  std::vector<int64_t> shape_in_ref;
+  shape_in_ref.reserve(reference->nDims());
+  int64_t n_elems = 1;
+  for (size_t ref_i = 0; ref_i < ref_root.size(); ref_i++) {
+    auto id = ref_root[ref_i];
+    auto concrete_id = domain_map.getComputeAtMap().getConcreteMappedID(
+        id, IdMappingMode::EXACT);
+    auto inferred_val =
+        runtime_info.expressionEvaluator().evaluate(concrete_id->extent());
+    TORCH_INTERNAL_ASSERT(
+        inferred_val.has_value(),
+        "Error inferring size for pointwise scheduler: ",
+        ref_root[ref_i]->extent()->toInlineString());
+    int64_t size = inferred_val->as<int64_t>();
+    n_elems *= size;
+    shape_in_ref.push_back(size);
+  }
+  return {shape_in_ref, n_elems};
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::InnerMostDimInfo>
+getInnerMostDimInfoInReference(
+    HeuristicSummary* data_cache,
+    const std::vector<TensorView*>& group_references,
+    TensorView* global_reference,
+    DomainMap& domain_map) {
+  auto innermost_info_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::InnerMostDimInfo>(
+          data_cache, [&]() {
+            std::vector<int64_t> data;
+            data.reserve(group_references.size());
+            for (auto ref_tv : group_references) {
+              auto inner_most_id = scheduler_utils::innerMostRootDim(ref_tv);
+              auto inner_most_pos_in_global_ref =
+                  domain_map.getInnerLeafDim(global_reference, inner_most_id);
+              data.emplace_back(inner_most_pos_in_global_ref);
+            }
+            return std::make_unique<std::vector<int64_t>>(std::move(data));
+          });
+  return innermost_info_entry;
+}
+
+} // namespace
+
+std::string getTransposeRuntimeRejectReason(
+    Fusion* fusion,
+    HeuristicSummary* data_cache,
+    SchedulerRuntimeInfo& runtime_info) {
+  auto domain_map_entry = getDomainMap(data_cache, fusion);
+  auto& domain_map = dynamic_cast<DomainMap&>(domain_map_entry.get());
+  auto grouped_inputs_outputs_entry =
+      getInputsOutputsGroups(data_cache, domain_map);
+  auto grouped_inputs_outputs = grouped_inputs_outputs_entry.get();
+  auto reference_tensors_entry =
+      getReferenceTensors(data_cache, domain_map, grouped_inputs_outputs);
+  auto reference_tensors = reference_tensors_entry.get();
+  TensorView* reference1 = reference_tensors[0];
+
+  auto pair =
+      getShapeInReference(data_cache, runtime_info, reference1, domain_map);
+  auto& shape_in_ref1 = pair.first;
+  auto& n_elems = pair.second;
+
+  auto innermost_info_entry = getInnerMostDimInfoInReference(
+      data_cache, reference_tensors, reference1, domain_map);
+  auto innermost_info = innermost_info_entry.get();
+
+  constexpr size_t default_tile_elements =
+      TransposeParams::getDefaultTileSize() *
+      TransposeParams::getDefaultTileSize();
+
+  // don't schedule with transpose scheduler if less than a full wave
+  const int64_t device_multiprocessor_count =
+      (int64_t)at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
+  auto elements_per_wave = device_multiprocessor_count * default_tile_elements;
+  if (elements_per_wave > n_elems) {
+    return "Transpose scheduler does not perform well on small problem sizes.";
+  }
+
+  auto inner_most_pos1_in_ref1 = innermost_info[0];
+  auto inner_most_pos2_in_ref1 = innermost_info[1];
+
+  auto inner_size1 = shape_in_ref1[inner_most_pos1_in_ref1];
+  auto inner_size2 = shape_in_ref1[inner_most_pos2_in_ref1];
+
+  // For cases like
+  //   transpose(T0[1000000000, 2, 2], 1, 2)
+  // the pointwise scheduler should provide better performance, because it
+  // provides coalesced memory access
+  if (inner_size1 * inner_size2 < default_tile_elements) {
+    auto inner_elements = inner_size1 * inner_size2;
+    for (int64_t i = inner_most_pos2_in_ref1 + 1; i < inner_most_pos1_in_ref1;
+         i++) {
+      inner_elements *= shape_in_ref1[i];
+    }
+    // note that the algorithm here is only an approximation because it only
+    // checks reference1. In principle, we need to check all inputs and outputs
+    // to get an accurate result, but that is too much work. I think checking
+    // only reference 1 is fine for now. Below is an example where the
+    // approximation here will not work:
+    //   T0[10000000, 2, 3] (reference 1)
+    //   T1[2, 10000000, 3] input/output
+    //   T2[2, 10000000, 3] input/output
+    //   T3[2, 10000000, 3] input/output
+    //   T4[3, 10000000, 2] input/output
+    //   T5[3, 10000000, 2] input/output
+    if (inner_elements < default_tile_elements) {
+      return "Inner transpose of small dimensions should be scheduled by the "
+             "pointwise scheduler because it provides better memory coalescing";
+    }
+  }
+
+#if !SUPPORT_SPLITTING_INNERMOST_DIM
+  if (n_elems / inner_size1 < TransposeParams::getDefaultTileSize() ||
+      n_elems / inner_size2 < TransposeParams::getDefaultTileSize()) {
+    return "Splitting of inner most dim for the creation of virtual inner most dim "
+           "is disabled due to indexing bug, skipping this case at runtime for now"
+           "See: https://github.com/csarofeen/pytorch/issues/1964";
+  }
+#endif
+
+  return "";
+}
+
+bool hasAtLeastTwoValidGroups(Fusion* fusion) {
+  return DomainMap::hasAtLeastTwoValidGroups(fusion);
+}
+
+std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs,
+    HeuristicSummary* data_cache) {
+  SchedulerRuntimeInfo runtime_info(fusion, runtime_inputs, true);
+  return getTransposeHeuristics(fusion, runtime_info, data_cache);
+}
+
+std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    SchedulerRuntimeInfo& runtime_info,
+    HeuristicSummary* data_cache) {
+  FUSER_PERF_SCOPE("getTransposeHeuristics");
+
+  FusionGuard fg(fusion);
+
+  // Incase any buffer is of type DataType::Index
+  DataType index_type = indexModeToDtype(runtime_info.getIndexMode());
+
+  auto domain_map_entry = getDomainMap(data_cache, fusion);
+  auto& domain_map = dynamic_cast<DomainMap&>(domain_map_entry.get());
+  auto grouped_inputs_outputs_entry =
+      getInputsOutputsGroups(data_cache, domain_map);
+  auto grouped_inputs_outputs = grouped_inputs_outputs_entry.get();
+  auto reference_tensors_entry =
+      getReferenceTensors(data_cache, domain_map, grouped_inputs_outputs);
+  auto reference_tensors = reference_tensors_entry.get();
+  TensorView* reference1 = reference_tensors[0];
+  TensorView* reference2 = reference_tensors[1];
+  auto pair =
+      getShapeInReference(data_cache, runtime_info, reference1, domain_map);
+  auto& shape_in_ref1 = pair.first;
+  auto& n_elems = pair.second;
+
+  const int64_t device_multiprocessor_count =
+      (int64_t)at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
+
+  auto innermost_info_entry = getInnerMostDimInfoInReference(
+      data_cache, reference_tensors, reference1, domain_map);
+  auto innermost_info = innermost_info_entry.get();
+
+  auto inner_most_pos1_in_ref1 = innermost_info[0];
+  auto inner_most_pos2_in_ref1 = innermost_info[1];
+
+  auto params = std::make_shared<TransposeParams>("Transpose heuristics");
+
+  // Expand inner-most dims to virtual inner-most dims so that the inner-most
+  // dims has at least tile_size elements
+  // See note [Supporting small transpose dimensions]
+  maybeBuildVirtualInnerDims(
+      *params,
+      device_multiprocessor_count,
+      n_elems,
+      shape_in_ref1,
+      inner_most_pos1_in_ref1,
+      inner_most_pos2_in_ref1);
+
+  // Note [vectorization and unroll of input and output]
+  //
+  // The choice of vectorization size, block size and tile sizes needs to be
+  // consistent with each other. Consider the following:
+  //
+  // The number of threads in one block is
+  //   num_threads = blockDim.x * blockDim.y
+  // and the number of elements per each tile is
+  //   num_elems_per_tile = params->tile_size1 * params->tile_size2
+  // So each thread needs to process
+  //   num_elems_per_thread = num_elems_per_tile / num_threads
+  // elements. That is, once the tile sizes and block size are determined, the
+  // `num_elems_per_thread` is determined, regardless of vectorizability of
+  // input/output tensors.
+  //
+  // To make the selection of tile sizes othogonal to vectorizability, we
+  // support having both vectorization and unrolling in the same tensor. For
+  // example, if we have num_elems_per_tile == 1024 and num_threads = 256, then
+  // we have num_elems_per_thread being 4. And if we have vector size 2, then we
+  // will do unroll 2 * vectorize 2 at the same tensor.
+  //
+  // Also, since the inner most dim of different groups are not the same, it is
+  // natural to consider their vectorizability separately and allow them to have
+  // different vectorize/unroll sizes.
+
+  constexpr int64_t kSixteen = 16; // clang tidy
+
+  int64_t max_input_dtype_size = 1;
+
+  size_t n_input_tensors = 0;
+  for (auto inp : ir_utils::filterByType<TensorView>(fusion->inputs())) {
+    max_input_dtype_size = std::max(
+        max_input_dtype_size,
+        (int64_t)dataTypeSize(inp->getDataType().value(), index_type));
+    n_input_tensors++;
+  }
+
+  auto max_unroll_factor = ceilDiv(
+      // Available unrolling based on size of data type
+      (int64_t)kSixteen / max_input_dtype_size,
+      // Reduce max unrolling factor if we have many inputs/outputs to unroll
+      // as it could start consuming a lot of registers.
+      std::max(
+          (scheduler_utils::lastPow2(
+               (int64_t)grouped_inputs_outputs[0].size() +
+               (int64_t)grouped_inputs_outputs[1].size()) >>
+           2),
+          (int64_t)1));
+
+  // Don't unroll at the cost of getting a full wave on the GPU
+  auto max_unroll_factor_occupancy = ceilDiv(
+      n_elems,
+      device_multiprocessor_count * params->tile_size1 * params->tile_size2);
+  max_unroll_factor = std::min(max_unroll_factor, max_unroll_factor_occupancy);
+
+  // Don't unroll at the cost of getting a full warp, useful for the case where
+  // tile sizes are small
+  auto max_unroll_factor_block =
+      ceilDiv(params->tile_size1 * params->tile_size2, 32);
+  max_unroll_factor = std::min(max_unroll_factor, max_unroll_factor_block);
+
+  // Compute maximum vectorize factor that can be used
+  size_t vectorize_factor1 = max_unroll_factor;
+  size_t vectorize_factor2 = max_unroll_factor;
+
+  for (auto tv : grouped_inputs_outputs[0]) {
+    const auto tv_vectorize_factor =
+        runtime_info.getInnerDimVectorizableWidth(tv);
+    vectorize_factor1 = std::min(vectorize_factor1, tv_vectorize_factor);
+  }
+  // TODO: Since group2 only has global->shared and shared->global set op, we
+  // can have fine-grained control of unroll/vectorization at per tensor level.
+  // We should not be using a single global vectorize factor for the entire
+  // group 2
+  for (auto tv : grouped_inputs_outputs[1]) {
+    const auto tv_vectorize_factor =
+        runtime_info.getInnerDimVectorizableWidth(tv);
+    vectorize_factor2 = std::min(vectorize_factor2, tv_vectorize_factor);
+  }
+
+  params->vectorize_factor1 = scheduler_utils::lastPow2(
+      std::min(static_cast<size_t>(max_unroll_factor), vectorize_factor1));
+  params->vectorize_factor2 = scheduler_utils::lastPow2(
+      std::min(static_cast<size_t>(max_unroll_factor), vectorize_factor2));
+
+  params->lparams.bind(params->getThreadsPerBlock(), ParallelType::TIDx);
+
+  if (isDebugDumpEnabled(DebugDumpOption::SchedulerDebug)) {
+    std::cerr << "\n===== Transpose Stats ========\n"
+              << "inputs: " << ir_utils::toString(fusion->inputs()) << "\n"
+              << "outputs: " << ir_utils::toString(fusion->outputs()) << "\n"
+              << "shape: " << shape_in_ref1 << "\n"
+              << "num_elems: " << n_elems << "\n"
+              << "n_input_tensors: " << n_input_tensors << "\n"
+              << "max_input_dtype_size: " << max_input_dtype_size << "\n"
+              << "group 1: " << ir_utils::toString(grouped_inputs_outputs[0])
+              << "\n"
+              << "reference1: " << reference1 << "\n"
+              << "inner_most_id1 position: " << inner_most_pos1_in_ref1
+              << " (in reference 1)\n"
+              << "group 2: " << ir_utils::toString(grouped_inputs_outputs[1])
+              << "\n"
+              << "reference2: " << reference2 << "\n"
+              << "inner_most_id2 position: " << inner_most_pos2_in_ref1
+              << " (in reference 1)" << std::endl;
+    if (!params->split_before_tiling.empty() ||
+        !params->dims_merged_with_1.empty() ||
+        !params->dims_merged_with_2.empty()) {
+      std::cerr << "small transposed dim, needs virtual inner-most dim"
+                << std::endl;
+    }
+    std::cerr << std::endl;
+    std::cerr << params->toString() << std::endl;
+  }
+
+  return params;
+}
+
+// TODO: remove or return launch parameters
+LaunchParams scheduleTranspose(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs) {
+  FUSER_PERF_SCOPE("scheduleFusion");
+  auto params = getTransposeHeuristics(fusion, runtime_inputs);
+  TORCH_INTERNAL_ASSERT(
+      params != nullptr, "Could not schedule transpose operation.");
+  scheduleTranspose(fusion, *params);
+  return params->lparams;
+}
+
+void scheduleTranspose(Fusion* fusion, TransposeParams params) {
+  FusionGuard fg(fusion);
+
+  // Make sure we don't have global memory set on intermediate tensors from
+  // fusion segmentation
+  scheduler_utils::clearMemorySpace(fusion);
+
+  // maybe has_reduction for scheduling should be done on a per output tensor
+  // basis.
+  // TODO: add support for trivial reduction
+  TORCH_INTERNAL_ASSERT(
+      ir_utils::getReductionOps(fusion, /*ignore_trivial=*/false).empty(),
+      "This scheduler only handles pointwise ops.");
+
+  // Cache inputs
+  auto cached_inputs = scheduler_utils::cacheInputs(fusion, true);
+
+  // Cache and fork outputs
+  auto cached_outputs = scheduler_utils::cacheAndForkOutputs(fusion, true);
+
+  std::vector<TensorView*> input_tvs;
+  {
+    auto filtered_tvs = ir_utils::filterByType<TensorView>(fusion->inputs());
+    // Remove hanging tensor views
+    for (auto tv : filtered_tvs) {
+      if (tv->uses().empty()) {
+        continue;
+      }
+      input_tvs.push_back(tv);
+    }
+  }
+  auto output_tvs = ir_utils::filterByType<TensorView>(fusion->outputs());
+
+  size_t max_dims = 0;
+  for (auto inp : input_tvs) {
+    max_dims = std::max(pointwise_utils::nRootDims(inp), max_dims);
+  }
+
+  for (auto out : output_tvs) {
+    max_dims = std::max(pointwise_utils::nRootDims(out), max_dims);
+  }
+
+  // If everything is zero dim tensors, just return.
+  if (max_dims == 0) {
+    return;
+  }
+
+  DomainMap domain_map(fusion);
+  auto grouped_inputs_outputs = domain_map.groupInputsOutputsByInnerDim();
+  TORCH_INTERNAL_ASSERT(grouped_inputs_outputs.size() >= 2);
+
+  /*
+   * We need something similar to `cacheFork` for input tensors in group 2. We
+   * need this because we will want to propagate to the entire DAG except group
+   * 2 and its cached inputs, so we need to make sure the DAG is still connected
+   * if we remove group and its cached inputs. For example
+   *    t0
+   *    |
+   *   cache
+   *   /  \
+   *  t1  t2
+   * if groups = {{t1, t2}, {t0}}, then removing {t0, cache} from the DAG will
+   * make it disconnected.
+   */
+  std::unordered_set<TensorView*> group2_and_cached_inputs(
+      grouped_inputs_outputs[1].begin(), grouped_inputs_outputs[1].end());
+  for (auto tv : grouped_inputs_outputs[1]) {
+    if (tv->isFusionInput()) {
+      auto existing_cache = ir_utils::consumerTvsOf(tv)[0];
+      if (ir_utils::consumerTvsOf(existing_cache).size() > 1) {
+        auto new_cache = tv->cacheAfter();
+        new_cache->setMemoryType(MemoryType::Shared);
+        group2_and_cached_inputs.emplace(new_cache);
+      } else {
+        existing_cache->setMemoryType(MemoryType::Shared);
+        group2_and_cached_inputs.emplace(existing_cache);
+      }
+    }
+  }
+  // set cached outputs of group 2 to shared memory
+  for (auto pair : cached_outputs) {
+    auto cached_output = pair.first;
+    auto output = pair.second;
+    if (group2_and_cached_inputs.count(output) > 0) {
+      cached_output->setMemoryType(MemoryType::Shared);
+    }
+  }
+
+  TensorView* reference1 =
+      domain_map.findReferenceFor(grouped_inputs_outputs[0]);
+  TensorView* reference2 =
+      domain_map.findReferenceFor(grouped_inputs_outputs[1]);
+
+  TORCH_INTERNAL_ASSERT(
+      reference1 != nullptr,
+      "Could not find a fully broadcasted tensor to reference schedule on the first group.");
+
+  TORCH_INTERNAL_ASSERT(
+      reference2 != nullptr,
+      "Could not find a fully broadcasted tensor to reference schedule on the second group.");
+
+  auto inner_most_id1 = scheduler_utils::innerMostRootDim(reference1);
+  auto inner_most_id2 = scheduler_utils::innerMostRootDim(reference2);
+
+  //////////////////////////////////////////
+  // Step 1: Make virtual inner most dims //
+  //////////////////////////////////////////
+
+  // See note [Supporting small transpose dimensions]
+
+  // split big dims so that we have enough dimensions available to merge with
+  // inner-most dims to create the virtual inner-most dim
+  scheduler_utils::splitDims(reference1, params.split_before_tiling);
+  // Merging reference 1's dims_merged_with_1 but updating dims_merged_with_2
+  // based on the changes in the dimensions that were merged. So we can then run
+  // merge with dims_merged_with_2.
+  auto merged1 = scheduler_utils::mergeDims(
+      reference1, params.dims_merged_with_1, params.dims_merged_with_2);
+  // Merging reference 1's dims_merged_with_2 and updating `merged1`.
+  std::vector<size_t> merged1_vec;
+  if (merged1.has_value()) {
+    merged1_vec.push_back(*merged1);
+  }
+  auto merged2 = scheduler_utils::mergeDims(
+      reference1, params.dims_merged_with_2, merged1_vec);
+  if (merged1.has_value()) {
+    merged1 = merged1_vec[0];
+  }
+
+  // merge with inner most dims to get virtual inner most dims
+  size_t inner_most_pos1_in_ref1 =
+      domain_map.getInnerLeafDim(reference1, inner_most_id1);
+  size_t inner_most_pos2_in_ref1 =
+      domain_map.getInnerLeafDim(reference1, inner_most_id2);
+  if (merged1.has_value()) {
+    if (inner_most_pos1_in_ref1 < *merged1) {
+      reference1->reorder(
+          {{*merged1, inner_most_pos1_in_ref1},
+           {inner_most_pos1_in_ref1, *merged1}});
+      std::swap(*merged1, inner_most_pos1_in_ref1);
+    }
+    if (inner_most_pos2_in_ref1 > inner_most_pos1_in_ref1) {
+      inner_most_pos2_in_ref1--;
+    }
+    if (merged2.has_value() && *merged2 > inner_most_pos1_in_ref1) {
+      (*merged2)--;
+    }
+    reference1->merge(*merged1, inner_most_pos1_in_ref1);
+    inner_most_pos1_in_ref1 = *merged1;
+  }
+  if (merged2.has_value()) {
+    if (inner_most_pos2_in_ref1 < *merged2) {
+      reference1->reorder(
+          {{*merged2, inner_most_pos2_in_ref1},
+           {inner_most_pos2_in_ref1, *merged2}});
+      std::swap(*merged2, inner_most_pos2_in_ref1);
+    }
+    if (inner_most_pos1_in_ref1 > inner_most_pos2_in_ref1) {
+      inner_most_pos1_in_ref1--;
+    }
+    reference1->merge(*merged2, inner_most_pos2_in_ref1);
+    inner_most_pos2_in_ref1 = *merged2;
+  }
+
+  /////////////////////////////
+  // Step 2: global schedule //
+  /////////////////////////////
+
+  // make tile
+  // [..., I1, .., I2, ...]
+  reference1->split(inner_most_pos1_in_ref1, params.tile_size1);
+  reference1->reorder({{inner_most_pos1_in_ref1 + 1, -1}});
+  reference1->split(inner_most_pos2_in_ref1, params.tile_size2);
+  reference1->reorder({{inner_most_pos2_in_ref1 + 1, -1}});
+  // [..., I1/tile1, .., I2/tile2, ..., tile1, tile2]
+
+  // Merge remaining dimensions
+  int lhs_i = -1;
+  for (int i = (int)reference1->nDims() - 2; i > 0; i--) {
+    auto axis_i = i - 1;
+    if (lhs_i == -1) {
+      lhs_i = axis_i;
+    } else {
+      reference1->merge(axis_i, lhs_i);
+      lhs_i = axis_i;
+    }
+  }
+  reference1->split(0, 1);
+  // [merged_dim, 1, tile1, tile2]
+
+  // parallelize non-tile dimensions
+  reference1->axis(1)->parallelize(ParallelType::Unswitch);
+  reference1->axis(0)->parallelize(ParallelType::BIDx);
+  // [BIDx, Unswitch, tile1, tile2]
+
+  // Propagate transformations so far to the entire DAG
+  TransformPropagator propagator(reference1);
+  MaxRootDomainInfoSpanningTree entire_dag(reference1);
+  entire_dag.traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(reference1);
+
+  // For a transpose scheduling, all we need is to bind threadIdx.x differently
+  // for inputs and outputs. This swap of binding could happen at any tensor on
+  // the path from input to output, especially, it does not have to be in the
+  // transpose tensor. Here, we naively do the binding swap at cached
+  // input/output for simplicity. We might need to find a better set of swap
+  // tensors in the future to reduce shared memory usage.
+
+  //////////////////////////////
+  // Step 3: Schedule group 2 //
+  //////////////////////////////
+
+  // transform tile for vectorization/unroll
+  // See note [vectorization and unroll of input and output]
+
+  int pos = reference2->nDims() - 2;
+  // [..., tile1, tile2]
+  reference2->merge(pos);
+  reference2->split(pos, params.vectorize_factor2);
+  reference2->split(pos, params.getThreadsPerBlock());
+  // [..., Unroll, TIDx, Vectorize]
+
+  // Propagate transformations of reference2 to the entire DAG except
+  // group 1. We actually only want to propagate to the fusion outputs, but
+  // inputs and outputs themselves are disconnected, so we have to borrow the
+  // entire DAG and use its spanning tree.
+  {
+    auto all_tvs_except1 = ir_utils::allTvsExcept(
+        fusion,
+        {grouped_inputs_outputs[0].begin(), grouped_inputs_outputs[0].end()});
+    SetSelector selector({all_tvs_except1.begin(), all_tvs_except1.end()});
+    MaxRootDomainInfoSpanningTree entire_dag_except1(reference2, &selector);
+    TransformPropagator propagator(reference2);
+    entire_dag_except1.traverse(&propagator);
+  }
+
+  // parallelize group2 and its cached inputs
+  {
+    if (params.vectorize_factor2 > 1) {
+      reference2->axis(-1)->parallelize(ParallelType::Vectorize);
+    }
+    reference2->axis(-2)->parallelize(ParallelType::TIDx);
+    reference2->axis(-3)->parallelize(ParallelType::Unroll);
+
+    ComputeAtMap ca_map(fusion);
+
+    scheduler_utils::parallelizeAllLike(
+        reference2,
+        {group2_and_cached_inputs.begin(), group2_and_cached_inputs.end()},
+        {ParallelType::TIDx});
+
+    // Only vectorize the axes that exactly maps to the vectorized axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> vectorized_group2_cached_inputs;
+    for (auto gin : group2_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference2](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference2->axis(-1), IdMappingMode::EXACT);
+              })) {
+        vectorized_group2_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference2, vectorized_group2_cached_inputs, {ParallelType::Vectorize});
+
+    // Only unroll the axes that exactly maps to the unrolled axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> unrolled_group2_cached_inputs;
+    for (auto gin : group2_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference2](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference2->axis(-3), IdMappingMode::EXACT);
+              })) {
+        unrolled_group2_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference2, unrolled_group2_cached_inputs, {ParallelType::Unroll});
+  }
+
+  //////////////////////////////
+  // Step 4: Schedule group 1 //
+  //////////////////////////////
+
+  // schedule group 1
+  reference1->reorder({{-2, -1}});
+  // [..., tile2, tile1]
+  pos = reference1->nDims() - 2;
+  reference1->merge(pos);
+  reference1->split(pos, params.vectorize_factor1);
+  reference1->split(pos, params.getThreadsPerBlock());
+  if (params.vectorize_factor1 > 1) {
+    reference1->axis(-1)->parallelize(ParallelType::Vectorize);
+  }
+  reference1->axis(-2)->parallelize(ParallelType::TIDx);
+  reference1->axis(-3)->parallelize(ParallelType::Unroll);
+  // [..., Unroll, TIDx, Vectorize]
+
+  // Propagate transformations, parallelization of the reference1 to the entire
+  // DAG except group 2 and its corresponding cached outputs.
+  {
+    auto all_tvs_except2 =
+        ir_utils::allTvsExcept(fusion, group2_and_cached_inputs);
+    SetSelector selector({all_tvs_except2.begin(), all_tvs_except2.end()});
+    MaxRootDomainInfoSpanningTree entire_dag_except_outputs(
+        reference1, &selector);
+    TransformPropagator propagator(reference1);
+    entire_dag_except_outputs.traverse(&propagator);
+    scheduler_utils::parallelizeAllLike(
+        reference1, all_tvs_except2, {ParallelType::TIDx});
+  }
+
+  // vectorize and unroll group 1's output and cached input
+  {
+    ComputeAtMap ca_map(fusion);
+    std::vector<TensorView*> group1_and_cached_inputs(
+        grouped_inputs_outputs[0].begin(), grouped_inputs_outputs[0].end());
+    for (auto tv : grouped_inputs_outputs[0]) {
+      if (tv->isFusionInput()) {
+        group1_and_cached_inputs.emplace_back(ir_utils::consumerTvsOf(tv)[0]);
+      }
+    }
+
+    // Only vectorize the axes that exactly maps to the vectorized axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> vectorized_group1_cached_inputs;
+    for (auto gin : group1_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference1](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference1->axis(-1), IdMappingMode::EXACT);
+              })) {
+        vectorized_group1_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference1, vectorized_group1_cached_inputs, {ParallelType::Vectorize});
+
+    // Only unroll the axes that exactly maps to the unrolled axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> unrolled_group1_cached_inputs;
+    for (auto gin : group1_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference1](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference1->axis(-3), IdMappingMode::EXACT);
+              })) {
+        unrolled_group1_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference1, unrolled_group1_cached_inputs, {ParallelType::Unroll});
+  }
+
+  ////////////////////////////////
+  // Step 5: Cleanup and inline //
+  ////////////////////////////////
+
+  // cleanup parallelization from reference1 and reference2 if they are fusion
+  // inputs
+  for (auto tv : {reference1, reference2}) {
+    if (tv->isFusionInput()) {
+      for (auto id : tv->domain()->domain()) {
+        id->parallelize(ParallelType::Serial);
+      }
+    }
+  }
+
+  // Inline
+  inlineMost();
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/transpose.h b/torch/csrc/jit/codegen/cuda/scheduler/transpose.h
new file mode 100644
index 0000000000000..c1a4ab6efb6ae
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/transpose.h
@@ -0,0 +1,115 @@
+#pragma once
+
+#include <ATen/core/ivalue.h>
+
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h>
+
+#define SUPPORT_SPLITTING_INNERMOST_DIM 0
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// Note [Transpose scheduling]
+//
+// The target of transpose scheduling is to get coalesced global memory access
+// to as much input and output tensors as possible. For a DAG with only pure
+// pointwise operators, the scheduling is very simple because the inner most
+// dimension of all input and output tensors are all mapped together in the
+// ComputeAtMap, i.e., there is essentially only one inner most dimension. In
+// such case, we just vectorize that inner most dimension and bind it to
+// threadIdx.x identically for all input and output tensors. In the case where
+// transposes are present in the DAG, the inner most dimensions of different
+// inputs and outputs might not match. And there is no fixed pattern on which
+// input/output tensors should share the same inner most dimension with which.
+// Consider the following example DAGs ([T] represents transpose, all tensors
+// are 2D):
+//
+//   t0    t1      t0    t1      t0    t1        t0    t1         t0
+//    \    |        \    /        \    |          \    |          |
+//     \  [T]       [T] [T]        \  [T]          t2 [T]        [T]
+//      \ /           \ /           \ / \         / \ / \         |
+//      t2             t2           t2   t3      t3  t4 t5       [T]
+//                                                                |
+//                                                                t1
+//
+// In order to support all these cases in a general way, the following
+// perspective is very important: What we are looking for is to bind threadIdx.x
+// differently for different inputs and outputs, so there has to be some tensor
+// somewhere in the DAG that we write and read with different threadIdx.x
+// bindings. The tensor of binding swap can be any tensor on the path that
+// connects inputs/outputs with different inner most dimension, especially, it
+// does not necessarily have to be the tensor of the transpose operator. In
+// other words, thanks to our indexing system who is already taking care of the
+// correctness of transpose, the scheduler can freely choose where to realize
+// these transposes as different threadIdx.x bindings. This observation greatly
+// simplifies our scheduling.
+//
+// Our scheduling strategy is as follows: We first split the inputs and outputs
+// of the fusion into two groups according to their inner most dimension. The
+// inner most dimensions of tensors in the same group are mapped to each other,
+// and they are not mapped to the inner most dimesion of tensors in a different
+// group. Depending on the transpose pattern, there can be more than two groups,
+// if this is the case, we only consider the two largest groups, and the tensors
+// in the remaining groups will just be accessed unvectorized and uncoalesced.
+// We call the largest group as `group1` and the second largest group as
+// `group2`. When we have the groups, we will make a 2D tiling [I1, I2] ->
+// [I1/tile1, tile1, I2/tile2, tile2] on the inner most dimensions of group1 and
+// group2. If I1 and I2 are too small to make a 32x32 tile, such as in the
+// fusion of tanspose(T1[1024, 2, 1024, 2], {1, 3}), we merge in other
+// dimensions to make a virtual I1 and I2. The details of how we create virtual
+// I1 and I2 are described in note [Supporting small transpose dimensions].
+//
+// Each tile [tile1, tile2] will be handled by a block, and the tensors that
+// have mismatched threadIdx.x bindings will use shared memory. The outer IDs of
+// the tiling split will be merged with non-tiled IDs and then binded to
+// blockIdx.x for the entire DAG, regardless of which group a tensor belongs to.
+// For the inner tile IDs [tile1, tile2], we need to transform and parallelize
+// group 1 and group 2 differently. The intermediate tensors can be transformed
+// and parallelized consistently either with group 1 or group 2. Here, since
+// group 1 is larger than group 2, we decide to only transform and parallelize
+// the cached inputs of group 2 together with group 2, and keep the rest of the
+// DAG consistent with group 1.
+//
+// If you would like to see an example of how to manually schedule a complicated
+// DAG using this idea, refer to:
+//   FusionManualScheduleTransposeComplexDAG1_CUDA
+
+class SchedulerRuntimeInfo;
+class HeuristicSummary;
+
+TORCH_CUDA_CU_API std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs,
+    HeuristicSummary* data_cache = nullptr);
+
+TORCH_CUDA_CU_API std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    SchedulerRuntimeInfo& runtime_info,
+    HeuristicSummary* data_cache = nullptr);
+
+TORCH_CUDA_CU_API void scheduleTranspose(
+    Fusion* fusion,
+    TransposeParams params);
+
+TORCH_CUDA_CU_API LaunchParams scheduleTranspose(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs);
+
+//! Utility for canSchedule interface to check if this fusion has at least two
+//! groups, each with a fully broadcasted reference tensor.
+TORCH_CUDA_CU_API bool hasAtLeastTwoValidGroups(Fusion* fusion);
+
+// If can schedule at runtime, returns empty string, otherwise returns the
+// reason why we should not schedule at runtime.
+TORCH_CUDA_CU_API std::string getTransposeRuntimeRejectReason(
+    Fusion* fusion,
+    HeuristicSummary* data_cache,
+    SchedulerRuntimeInfo& runtime_info);
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h b/torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h
new file mode 100644
index 0000000000000..5e56278a7f16b
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h
@@ -0,0 +1,163 @@
+#pragma once
+
+#include <c10/util/hash.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/heuristic.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
+
+#include <sstream>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// Parameters of the transpose heuristic to describe the optimial schedule.
+// Warning: equal operator is intended for use in caching the kernel associated
+// with these reduction parameters. It does not check if the launch parameters
+// are equivelent!
+class TransposeParams : public HeuristicParams {
+ public:
+  static constexpr size_t getMaxThreadsPerBlock() {
+    return 128;
+  }
+
+  static constexpr size_t getDefaultTileSize() {
+    return 32;
+  }
+
+  // See note [Supporting small transpose dimensions], all dims are positions in
+  // reference1
+  std::vector<std::pair<size_t, size_t>> split_before_tiling = {};
+  std::vector<size_t> dims_merged_with_1 = {};
+  std::vector<size_t> dims_merged_with_2 = {};
+
+  // Vectorization factor for tensors in the first group
+  size_t vectorize_factor1 = 1;
+
+  // Vectorization factor for tensors in the second group
+  size_t vectorize_factor2 = 1;
+
+  // TODO: support symbolic tile size
+  // https://github.com/csarofeen/pytorch/pull/1854#discussion_r928143729
+
+  // Tile size for the inner most dim of tensors in the first group
+  size_t tile_size1 = getDefaultTileSize();
+
+  // Tile size for the inner most dim of tensors in the second group
+  size_t tile_size2 = getDefaultTileSize();
+
+  using HeuristicParams::HeuristicParams;
+
+  // Warning: Does not check launch parameters!
+  bool sameAs(
+      const std::shared_ptr<HeuristicParams>& other_base) const override {
+    auto other_casted = std::dynamic_pointer_cast<TransposeParams>(other_base);
+    if (other_casted == nullptr) {
+      return false;
+    }
+    const TransposeParams& other = *other_casted;
+    bool attr_equal = other.split_before_tiling == split_before_tiling &&
+        other.dims_merged_with_1 == dims_merged_with_1 &&
+        other.dims_merged_with_2 == dims_merged_with_2 &&
+        other.vectorize_factor1 == vectorize_factor1 &&
+        other.vectorize_factor2 == vectorize_factor2 &&
+        other.tile_size1 == tile_size1 && other.tile_size2 == tile_size2;
+    return attr_equal;
+  }
+
+  std::string toString() const override {
+    std::stringstream ss;
+    ss << "\n===== Transpose Parameters ========\n"
+       << (tag == "" ? "" : "Tag: ") << tag << " Transpose Characteristics:\n"
+       << " BlckX: " << lparams.bdimx() << "\n";
+    ss << " input tile size: " << tile_size1 << "\n";
+    ss << " output tile size: " << tile_size2 << "\n";
+    int elements_per_tile = tile_size1 * tile_size2;
+    ss << " elements per tile: " << elements_per_tile << "\n";
+    int elements_per_thread = elements_per_tile / lparams.bdimx();
+    ss << " elements per thread: " << elements_per_thread << "\n";
+    if (vectorize_factor1 > 1) {
+      ss << "Vectorize group 1, Factor: " << vectorize_factor1 << "\n";
+    }
+    int unroll_factor1 = elements_per_thread / vectorize_factor1;
+    if (unroll_factor1 > 1) {
+      ss << "Unroll group 1, Factor: " << unroll_factor1 << "\n";
+    }
+    if (vectorize_factor2 > 1) {
+      ss << "Vectorize group 2, Factor: " << vectorize_factor2 << "\n";
+    }
+    int unroll_factor2 = elements_per_thread / vectorize_factor2;
+    if (unroll_factor2 > 1) {
+      ss << "Unroll group 2, Factor: " << unroll_factor2 << "\n";
+    }
+    if (!split_before_tiling.empty() || !dims_merged_with_1.empty() ||
+        !dims_merged_with_2.empty()) {
+      ss << "Virtual inner-most dim:\n";
+      if (!split_before_tiling.empty()) {
+        ss << "  ";
+        bool first = true;
+        for (auto pair : split_before_tiling) {
+          if (!first) {
+            ss << ", ";
+          }
+          first = false;
+          ss << "split(" << pair.first << ", " << pair.second << ")";
+        }
+        ss << "\n";
+      }
+      if (!dims_merged_with_1.empty()) {
+        ss << "  merge ";
+        bool first = true;
+        for (auto dim : dims_merged_with_1) {
+          if (!first) {
+            ss << ", ";
+          }
+          first = false;
+          ss << dim;
+        }
+        ss << " with innermost1\n";
+      }
+      if (!dims_merged_with_2.empty()) {
+        ss << "  merge ";
+        bool first = true;
+        for (auto dim : dims_merged_with_2) {
+          if (!first) {
+            ss << ", ";
+          }
+          first = false;
+          ss << dim;
+        }
+        ss << " with innermost2\n";
+      }
+    }
+    ss << "====================================\n";
+    return ss.str();
+  }
+
+  size_t hash() const override {
+    return c10::get_hash(
+        split_before_tiling,
+        dims_merged_with_1,
+        dims_merged_with_2,
+        vectorize_factor1,
+        vectorize_factor2,
+        tile_size1,
+        tile_size2);
+  }
+
+  std::shared_ptr<HeuristicParams> clone() const override {
+    return std::make_shared<TransposeParams>(*this);
+  }
+
+  int getThreadsPerBlock() const {
+    size_t tile_vectors1 = ceilDiv(tile_size1 * tile_size2, vectorize_factor1);
+    size_t tile_vectors2 = ceilDiv(tile_size1 * tile_size2, vectorize_factor2);
+    size_t tile_vectors = std::min(tile_vectors1, tile_vectors2);
+    return std::min(getMaxThreadsPerBlock(), tile_vectors);
+  }
+};
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
index 2d329f99767c1..d985da926354b 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
@@ -11,6 +11,8 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h>
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 
+#include <algorithm>
+
 namespace torch {
 namespace jit {
 namespace fuser {
@@ -135,6 +137,64 @@ size_t merge_3d(
   }
 }
 
+void splitDims(
+    TensorView* tv,
+    std::vector<std::pair<size_t, size_t>> to_split, // (dim, size)
+    std::vector<size_t>& to_update) {
+  std::stable_sort(
+      to_split.begin(),
+      to_split.end(),
+      [](const std::pair<size_t, size_t>& p1,
+         const std::pair<size_t, size_t>& p2) { return p1.first < p2.first; });
+  size_t dim_offset = 0;
+  size_t pending_dim_offset = 0;
+  int64_t prev_dim = -1;
+  for (auto entry : to_split) {
+    size_t dim = entry.first;
+    size_t size = entry.second;
+    if (dim != prev_dim) {
+      dim_offset += pending_dim_offset;
+      pending_dim_offset = 0;
+    }
+    size_t actual_dim = dim_offset + dim;
+    tv->split(actual_dim, size);
+    pending_dim_offset++;
+    for (auto& i : to_update) {
+      if (i > actual_dim) {
+        i++;
+      }
+    }
+    prev_dim = dim;
+  }
+}
+
+c10::optional<size_t> mergeDims(
+    TensorView* tv,
+    std::vector<size_t> to_merge,
+    std::vector<size_t>& to_update) {
+  if (to_merge.empty()) {
+    return c10::nullopt;
+  }
+  if (to_merge.size() == 1) {
+    return to_merge[0];
+  }
+  std::sort(to_merge.begin(), to_merge.end());
+  size_t left = to_merge[0];
+  int64_t offset = 0;
+  for (auto right = to_merge.begin() + 1; right != to_merge.end(); right++) {
+    auto actual_right = offset-- + *right;
+    tv->merge(left, actual_right);
+    for (auto& i : to_update) {
+      if (i == actual_right) {
+        i = left;
+      } else if (i > actual_right) {
+        i--;
+      }
+    }
+  }
+  return left;
+}
+
 size_t mergeReduction(
     TensorView* tv,
     const std::unordered_set<IterDomain*>& dont_merge) {
@@ -240,21 +300,6 @@ void parallelizeAllLike(
   }
 }
 
-void computeAtInputs(TensorView* consumer, int pos, ComputeAtMode mode) {
-  for (auto inp_tv : ir_utils::inputTvsOf(consumer)) {
-    inp_tv->computeAt(consumer, pos, mode);
-  }
-}
-
-void computeWithOutputs(TensorView* producer, int pos, ComputeAtMode mode) {
-  for (auto out_tv : ir_utils::outputTvsOf(producer)) {
-    if (out_tv == producer) {
-      continue;
-    }
-    producer->computeWith(out_tv, pos, mode);
-  }
-}
-
 namespace {
 
 // Find the resolution points of the persistent buffers in the provided
@@ -566,7 +611,7 @@ TvProperties getProperties(
       TORCH_INTERNAL_ASSERT(
           inferred_val.has_value(), "Error inferring reduction size.");
       inner_most_dimension_numel =
-          inner_most_dimension_numel * inferred_val.value().as<int64_t>();
+          inner_most_dimension_numel * inferred_val->as<int64_t>();
       inner_most_dimension_ndims++;
     }
   }
@@ -583,9 +628,9 @@ TvProperties getProperties(
         inferred_val.has_value(),
         "Error inferring dimensions of reduction fusion.");
     if (id->isReduction()) {
-      total_reduction_numel *= inferred_val.value().as<int64_t>();
+      total_reduction_numel *= inferred_val->as<int64_t>();
     } else {
-      total_iteration_numel *= inferred_val.value().as<int64_t>();
+      total_iteration_numel *= inferred_val->as<int64_t>();
     }
   }
 
@@ -600,51 +645,6 @@ TvProperties getProperties(
   return properties;
 }
 
-void computeAtBetween(
-    const std::vector<TensorView*>& producers,
-    const std::vector<TensorView*>& overall_consumers,
-    int pos,
-    ComputeAtMode mode,
-    std::unordered_set<IterDomain*> mapped_to_trivial_reduction) {
-  for (auto producer : producers) {
-    // Figure out what's between producer and overall_consumers, will not give
-    // back any consumers that are not downstream from producer
-    auto all_vals_between = DependencyCheck::getAllValsBetween(
-        {producer}, {overall_consumers.begin(), overall_consumers.end()});
-
-    std::unordered_set<Val*> all_vals_between_set(
-        all_vals_between.begin(), all_vals_between.end());
-
-    for (auto consumer : overall_consumers) {
-      if (all_vals_between_set.count(consumer)) {
-        // The way we generate producers and consumers is that we inch away from
-        // inputs/outputs. There's a chance we could meet in the middle.
-        if (producer == consumer) {
-          continue;
-        }
-
-        auto pos_it = std::find_if(
-            consumer->domain()->domain().begin(),
-            consumer->domain()->domain().end(),
-            [&mapped_to_trivial_reduction](IterDomain* id) {
-              return mapped_to_trivial_reduction.count(id);
-            });
-
-        auto consumer_pos = pos_it == consumer->domain()->domain().end()
-            ? pos
-            : std::min(
-                  (int)std::distance(
-                      consumer->domain()->domain().begin(), pos_it) +
-                      1,
-                  (pos < 0 ? pos + (int)consumer->nDims() : pos));
-        // Assume we don't want to reset computeAt on tensors that have already
-        // performed it.
-        producer->computeAt(consumer, consumer_pos, mode);
-      }
-    }
-  }
-}
-
 namespace {
 
 // Figure out which persistent buffers are active at the generation of values in
@@ -822,9 +822,9 @@ PersistentBufferSizeReturn persistentBufferSize(
       TORCH_INTERNAL_ASSERT(
           id_size.has_value(), "Could not infer persistent buffer size.");
       if (persistent_buffer_sizes[buffer_i] == -1) {
-        persistent_buffer_sizes[buffer_i] = id_size.value();
+        persistent_buffer_sizes[buffer_i] = id_size->as<int64_t>();
       } else {
-        persistent_buffer_sizes[buffer_i] *= id_size.value().as<int64_t>();
+        persistent_buffer_sizes[buffer_i] *= id_size->as<int64_t>();
       }
     }
 
@@ -1075,13 +1075,14 @@ std::vector<std::pair<TensorView*, TensorView*>> cacheAndForkOutputs(
 }
 
 namespace {
+
 // Take the inner most rfactor id from innerMostRootDim and project it to the
 // root domain if the provided domain is on the rfactor domain. If vectorize,
 // will not project if not following the inner most path.
 IterDomain* projectIdToRoot(
     TensorView* tv,
     IterDomain* reference_id,
-    bool vectorize) {
+    bool inner_only) {
   if (reference_id == nullptr) {
     return nullptr;
   }
@@ -1102,14 +1103,19 @@ IterDomain* projectIdToRoot(
     if (expr->isA<Merge>()) {
       auto merge = expr->as<Merge>();
       if (merge->out() == projected_id) {
-        projected_id = merge->inner();
+        if (!merge->inner()->isBroadcast() &&
+            !merge->inner()->isTrivialReduction()) {
+          projected_id = merge->inner();
+        } else {
+          projected_id = merge->outer();
+        }
       }
     } else if (expr->isA<Split>()) {
       auto split = expr->as<Split>();
       if (split->inner() == projected_id) {
         projected_id = split->in();
       } else if (split->outer() == projected_id) {
-        if (vectorize) {
+        if (inner_only) {
           projected_id = nullptr;
         } else {
           projected_id = split->in();
@@ -1125,6 +1131,62 @@ IterDomain* projectIdToRoot(
   }
   return projected_id;
 }
+
+// Take the inner most root id from innerMostRootDim and project it to the
+// rfactor domain if the provided domain is on the rfactor domain. If vectorize,
+// will not project if not following the inner most path.
+IterDomain* projectIdToRFactor(
+    TensorView* tv,
+    IterDomain* reference_id,
+    bool inner_only) {
+  if (reference_id == nullptr) {
+    return nullptr;
+  }
+
+  if (!tv->hasRFactor()) {
+    return reference_id;
+  }
+
+  auto replay_exprs = StmtSort::getExprs(
+      tv->fusion(),
+      {tv->getRFactorDomain().begin(), tv->getRFactorDomain().end()},
+      false);
+  if (replay_exprs.empty()) {
+    return reference_id;
+  }
+
+  IterDomain* projected_id = reference_id;
+  for (auto expr_it = replay_exprs.begin(); expr_it != replay_exprs.end();
+       ++expr_it) {
+    auto expr = *expr_it;
+    if (expr->isA<Merge>()) {
+      auto merge = expr->as<Merge>();
+      if (merge->inner() == projected_id) {
+        projected_id = merge->out();
+      } else if (merge->outer() == projected_id) {
+        if (merge->inner()->isBroadcast() ||
+            merge->inner()->isTrivialReduction() || !inner_only) {
+          projected_id = merge->out();
+        } else {
+          projected_id = nullptr;
+        }
+      }
+    } else if (expr->isA<Split>()) {
+      auto split = expr->as<Split>();
+      if (split->in() == projected_id) {
+        projected_id = split->inner();
+      }
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          false, "Didn't recognize the iterdomain expression: ", expr);
+    }
+    if (projected_id == nullptr) {
+      break;
+    }
+  }
+  return projected_id;
+}
+
 } // namespace
 
 IterDomain* innerMostRootDim(TensorView* tv) {
@@ -1176,95 +1238,76 @@ IterDomain* innerMostRootDim(TensorView* tv) {
 FindAllMappedDims::FindAllMappedDims(
     TensorView* from,
     IterDomain* id,
-    bool vectorize_pass)
-    : starting_tv(from), starting_id(id) {
-  std::deque<TensorView*> to_visit{starting_tv};
-  std::unordered_set<TensorView*> visited;
-  mapped_ids.emplace(std::make_pair(starting_tv, starting_id));
-
-  // Propagate mapping of id
-  while (!to_visit.empty()) {
-    auto tv = to_visit.front();
-    to_visit.pop_front();
-
-    if (!visited.emplace(tv).second) {
-      continue;
-    }
-
-    auto tv_id = mapped_ids.at(tv);
-
-    for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
-      if (visited.find(consumer_tv) != visited.end()) {
-        continue;
-      }
+    bool inner_only)
+    : starting_tv_(from), starting_id_(id), inner_only_(inner_only) {}
+
+void FindAllMappedDims::setUp() {
+  mapped_root_ids_[starting_tv_] =
+      projectIdToRoot(starting_tv_, starting_id_, inner_only_);
+  mapped_rfactor_ids_[starting_tv_] =
+      projectIdToRFactor(starting_tv_, starting_id_, inner_only_);
+}
 
-      if (mapped_ids.find(consumer_tv) != mapped_ids.end()) {
-        continue;
-      }
+void FindAllMappedDims::propagateC2P(TensorView* from, TensorView* to) {
+  auto from_id = mapped_root_ids_.at(from);
+  PairwiseRootDomainMap root_map(to, from);
+  auto c2p_map = root_map.mapConsumerToProducer(from->domain(), to->domain());
+  auto p_it = c2p_map.find(from_id);
+  if (p_it != c2p_map.end()) {
+    mapped_root_ids_[to] = projectIdToRoot(to, p_it->second, inner_only_);
+    mapped_rfactor_ids_[to] = p_it->second;
+  } else {
+    mapped_root_ids_[to] = nullptr;
+    mapped_rfactor_ids_[to] = nullptr;
+  }
+}
 
-      PairwiseRootDomainMap root_map(tv, consumer_tv);
-      auto p2c_map =
-          root_map.mapProducerToConsumer(tv->domain(), consumer_tv->domain());
+void FindAllMappedDims::propagateP2C(TensorView* from, TensorView* to) {
+  auto from_id = mapped_rfactor_ids_.at(from);
+  PairwiseRootDomainMap root_map(from, to);
+  auto p2c_map = root_map.mapProducerToConsumer(from->domain(), to->domain());
+  auto c_it = p2c_map.find(from_id);
+  if (c_it != p2c_map.end()) {
+    mapped_root_ids_[to] = c_it->second;
+    mapped_rfactor_ids_[to] = projectIdToRFactor(to, c_it->second, inner_only_);
+  } else {
+    mapped_root_ids_[to] = nullptr;
+    mapped_rfactor_ids_[to] = nullptr;
+  }
+}
 
-      auto c_it = p2c_map.find(tv_id);
-      if (c_it != p2c_map.end()) {
-        mapped_ids.emplace(std::make_pair(consumer_tv, c_it->second));
-        to_visit.emplace_back(consumer_tv);
+void FindAllMappedDims::propagateSibling(TensorView* from, TensorView* to) {
+  auto from_id = mapped_root_ids_.at(from);
+  if (from_id == nullptr) {
+    mapped_root_ids_[to] = nullptr;
+  } else {
+    for (auto i : c10::irange(from->getRootDomain().size())) {
+      if (from_id == from->getRootDomain()[i]) {
+        mapped_root_ids_[to] = to->getRootDomain()[i];
+        break;
       }
     }
-
-    // For producers, project to root
-    tv_id = projectIdToRoot(tv, tv_id, vectorize_pass);
-    // If projection fails, don't map to producers
-    if (tv_id == nullptr) {
-      continue;
-    }
-
-    for (auto producer_tv : ir_utils::producerTvsOf(tv)) {
-      if (visited.find(producer_tv) != visited.end()) {
-        continue;
-      }
-
-      if (mapped_ids.find(producer_tv) != mapped_ids.end()) {
-        continue;
-      }
-
-      PairwiseRootDomainMap root_map(producer_tv, tv);
-      auto c2p_map =
-          root_map.mapConsumerToProducer(tv->domain(), producer_tv->domain());
-      auto p_it = c2p_map.find(tv_id);
-      if (p_it != c2p_map.end()) {
-        mapped_ids.emplace(std::make_pair(producer_tv, p_it->second));
-        to_visit.emplace_back(producer_tv);
+  }
+  from_id = mapped_rfactor_ids_.at(from);
+  if (from_id == nullptr) {
+    mapped_root_ids_[to] = nullptr;
+  } else {
+    for (auto i : c10::irange(from->getMaybeRFactorDomain().size())) {
+      if (from_id == from->getMaybeRFactorDomain()[i]) {
+        mapped_rfactor_ids_[to] = to->getMaybeRFactorDomain()[i];
+        return;
       }
     }
   }
+  TORCH_INTERNAL_ASSERT(false, "Unable to find mapped root/rfactor domain");
 }
 
-std::unordered_set<IterDomain*> FindAllMappedDims::from(
-    TensorView* tv,
-    IterDomain* id,
-    bool vectorize_pass) {
-  auto root_domain = tv->hasReduction() && tv->hasRFactor()
-      ? tv->getRootDomain()
-      : tv->getMaybeRFactorDomain();
-
-  TORCH_INTERNAL_ASSERT(
-      std::find_if(
-          root_domain.begin(),
-          root_domain.end(),
-          [&id](IterDomain* root_id) { return root_id == id; }) !=
-          root_domain.end(),
-      "Tried to map out ",
-      id,
-      " from TV ",
-      tv,
-      " to the rest of the fusion, but id does not belong to this tv.");
-
-  FindAllMappedDims mapped_dims(tv, id, vectorize_pass);
-
+std::unordered_set<IterDomain*> FindAllMappedDims::get() const {
   std::unordered_set<IterDomain*> mapped_id_set;
-  for (auto entry : mapped_dims.mapped_ids) {
+  for (auto entry : mapped_root_ids_) {
+    mapped_id_set.emplace(entry.second);
+  }
+  for (auto entry : mapped_rfactor_ids_) {
     mapped_id_set.emplace(entry.second);
   }
   return mapped_id_set;
@@ -1272,15 +1315,15 @@ std::unordered_set<IterDomain*> FindAllMappedDims::from(
 
 bool hasInnerDim(
     TensorView* tv,
-    std::unordered_set<IterDomain*> vector_dims,
+    std::unordered_set<IterDomain*> inner_dims,
     bool should_vectorize) {
   const auto& inner_most_dim = innerMostRootDim(tv);
   if (inner_most_dim == nullptr || inner_most_dim->isReduction()) {
     return false;
   }
 
-  // Make sure inner most dimension is in the vector_dim set
-  if (vector_dims.count(inner_most_dim) == 0) {
+  // Make sure inner most dimension is in the inner_dims set
+  if (inner_dims.count(inner_most_dim) == 0) {
     return false;
   }
 
@@ -1312,18 +1355,37 @@ bool hasInnerDim(
 
 std::vector<TensorView*> getInputsOutputsWithInnerDim(
     TensorView* reference_tv,
+    bool inner_only,
     bool vectorize_pass) {
+  if (vectorize_pass) {
+    TORCH_INTERNAL_ASSERT(
+        inner_only, "Can only vectorize inner-most dimensions");
+  }
+
   auto inner_most_id = innerMostRootDim(reference_tv);
 
   if (inner_most_id == nullptr) {
     return {};
   }
 
-  auto vectorizable_dims =
-      FindAllMappedDims::from(reference_tv, inner_most_id, vectorize_pass);
+  FindAllMappedDims all_mapped_root_dims(
+      reference_tv, inner_most_id, inner_only);
+  MaxRootDomainInfoSpanningTree tree(reference_tv);
+  tree.traverse(&all_mapped_root_dims);
+
+  auto vectorizable_dims = all_mapped_root_dims.get();
 
   std::vector<TensorView*> vectorizable_tensors;
 
+  // We put outputs in front of inputs because this would make the transpose
+  // scheduler prefer to use output instead of input as reference tensor.
+  for (auto output_tv :
+       ir_utils::filterByType<TensorView>(reference_tv->fusion()->outputs())) {
+    if (hasInnerDim(output_tv, vectorizable_dims, vectorize_pass)) {
+      vectorizable_tensors.push_back(output_tv);
+    }
+  }
+
   for (auto input_tv :
        ir_utils::filterByType<TensorView>(reference_tv->fusion()->inputs())) {
     if (hasInnerDim(input_tv, vectorizable_dims, vectorize_pass)) {
@@ -1331,17 +1393,112 @@ std::vector<TensorView*> getInputsOutputsWithInnerDim(
     }
   }
 
-  for (auto output_tv :
-       ir_utils::filterByType<TensorView>(reference_tv->fusion()->outputs())) {
-    if (hasInnerDim(output_tv, vectorizable_dims, vectorize_pass)) {
-      vectorizable_tensors.push_back(output_tv);
+  return vectorizable_tensors;
+}
+
+namespace {
+// Holder return struct for the below function.
+struct DisjointViewSetInfo {
+  // const* to the disjoint set in disjoint_view_set passed in to
+  // getDisjointViewSetsOf each iterdomain in the rfactor of ref is mapped to.
+  //
+  // WARNING: these pointers are relative to the disjoint_view_set reference
+  // passed into getDisjointViewSetsOf it's the user's responsibillity to
+  // maintain the lifetime of that reference to match this vector.
+  std::vector<const VectorOfUniqueEntries<IterDomain*>*> disjoint_sets_of_ref;
+
+  // Unique ID associated to the disjoint view group the rfactor id belongs to
+  // in disjoint_sets_of_ref. It's straight forward to map from
+  // disjoint_sets_of_ref to the vector, but not the other way around.
+  std::vector<int> disjoint_set_ids;
+
+  // TensorView reference the above vectors are relative to.
+  TensorView* ref;
+};
+
+// Returns disjoint view sets mapped onto the given reference. Returns a pair
+// of vectors of size rfactorDomain of reference. Vector of
+// VectorOfUniqueEntries returns a const* to the disjoint set in
+// disjoint_view_set the iterdomain is mapped to. Integer vector represents
+// which disjoint view group the rfactor id belongs to. It's straight forward
+// to map from the former to the latter, but not the latter to former.
+//
+// Since we return a const* to entries in disjoint_view_set, it must be passed
+// in as a reference. Algorithm is N^2 based on number of dims in reference,
+// but generating the disjoint view set is likely the limiter on perf of this
+// function.
+DisjointViewSetInfo getDisjointViewSetsOf(
+    Fusion* fusion,
+    TensorView* of,
+    DisjointSets<IterDomain*>& disjoint_view_set) {
+  auto rfactor_dom = of->getMaybeRFactorDomain();
+  if (rfactor_dom.size() == 0) {
+    return {};
+  }
+
+  // Start naming id's based on 0 so the inner most dimension will always be
+  // 0, then as groups are discovered marching to the left their id will
+  // increase. i.e. we could have something like [0, 3, 1, 2, 1, 0] as a
+  // result.
+  std::vector<int> disjoint_group_ids(rfactor_dom.size(), -1);
+  std::vector<const VectorOfUniqueEntries<IterDomain*>*> disjoint_set_of_id(
+      rfactor_dom.size(), nullptr);
+  int current_group_id = 0;
+  int ref_dim_i = rfactor_dom.size() - 1;
+
+  while (ref_dim_i >= 0) {
+    if (disjoint_group_ids[ref_dim_i] != -1) {
+      // Already put in a group, continue
+      ref_dim_i--;
+      continue;
     }
+
+    const auto& ref_group =
+        disjoint_view_set.getDisjointSetOf(rfactor_dom[ref_dim_i]);
+
+    int other_dim_i = ref_dim_i;
+    while (other_dim_i >= 0) {
+      const auto& other_group =
+          disjoint_view_set.getDisjointSetOf(rfactor_dom[other_dim_i]);
+      if (&ref_group == &other_group) {
+        disjoint_group_ids[other_dim_i] = current_group_id;
+        disjoint_set_of_id[other_dim_i] = &ref_group;
+      }
+      other_dim_i--;
+    }
+
+    ref_dim_i--;
+    current_group_id++;
   }
 
-  return vectorizable_tensors;
+  TORCH_INTERNAL_ASSERT(
+      std::none_of(
+          disjoint_group_ids.begin(),
+          disjoint_group_ids.end(),
+          [](int i) { return i == -1; }),
+      "Failed to generate the view disjoint groups of the reference ",
+      of->toString());
+
+  TORCH_INTERNAL_ASSERT(
+      std::none_of(
+          disjoint_set_of_id.begin(),
+          disjoint_set_of_id.end(),
+          [](const VectorOfUniqueEntries<IterDomain*>* ptr) {
+            return ptr == nullptr;
+          }),
+      "Failed to generate the view disjoint groups of the reference ",
+      of->toString());
+
+  DisjointViewSetInfo info;
+  info.disjoint_sets_of_ref = disjoint_set_of_id;
+  info.disjoint_set_ids = disjoint_group_ids;
+  info.ref = of;
+
+  return info;
 }
+} // namespace
 
-std::vector<BroadcastMultiple> getBroadcastMultiples(
+BroadcastMultipleInformation getBroadcastMultiples(
     TensorView* reference_tv,
     DataType index_type) {
   auto fusion = reference_tv->fusion();
@@ -1350,6 +1507,13 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
   std::vector<BroadcastMultiple> multiples(
       reference_tv->getMaybeRFactorDomain().size());
 
+  auto disjoint_view_sets = disjointViewSets(fusion);
+  auto disjoint_set_information = scheduler_utils::getDisjointViewSetsOf(
+      fusion, reference_tv, disjoint_view_sets);
+
+  auto ref_disjoint_sets = disjoint_set_information.disjoint_sets_of_ref;
+  auto ref_disjoint_set_ids = disjoint_set_information.disjoint_set_ids;
+
   // All input or output tensor views
   std::vector<TensorView*> in_out_tvs;
   {
@@ -1359,8 +1523,8 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
     in_out_tvs.insert(in_out_tvs.end(), out_tvs.begin(), out_tvs.end());
   }
 
-  // Shouldn't matter if we use EXACT or PERMISSIVE mapping mode for compute at
-  // map as we're just looking at the root mappings.
+  // Shouldn't matter if we use EXACT or PERMISSIVE mapping mode for compute
+  // at map as we're just looking at the root mappings.
   auto ca_map = ComputeAtMap(fusion);
 
   auto ref_root_domain = reference_tv->getMaybeRFactorDomain();
@@ -1380,35 +1544,60 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
       if (ref_id->isBroadcast() || ref_id->isReduction()) {
         continue;
       }
-      auto map_it = std::find_if(
-          in_out_tv_domain_list.begin(),
-          in_out_tv_domain_list.end(),
-          [&ref_id, &ca_map](IterDomain* in_out_tv_id) {
-            return ca_map.areMapped(in_out_tv_id, ref_id, IdMappingMode::EXACT);
-          });
 
-      if (map_it == in_out_tv_domain_list.end()) {
+      bool ref_id_has_view_transforms = std::count(
+                                            ref_disjoint_set_ids.begin(),
+                                            ref_disjoint_set_ids.end(),
+                                            ref_disjoint_set_ids[ref_i]) > 1;
+
+      // Could have multiple mappings if there's view transforms
+      std::vector<IterDomain*> mapped_ids;
+      if (!ref_id_has_view_transforms) {
+        auto mapped_it = std::find_if(
+            in_out_tv_domain_list.begin(),
+            in_out_tv_domain_list.end(),
+            [&ref_id, &ca_map](IterDomain* in_out_tv_id) {
+              return ca_map.areMapped(
+                  in_out_tv_id, ref_id, IdMappingMode::EXACT);
+            });
+        if (mapped_it != in_out_tv_domain_list.end()) {
+          mapped_ids.push_back(*mapped_it);
+        }
+      } else {
+        for (auto in_out_id : in_out_tv_domain) {
+          if (ref_disjoint_sets[ref_i]->has(in_out_id)) {
+            mapped_ids.push_back(in_out_id);
+          }
+        }
+      }
+
+      // Nothing maps to reference, no contribution to multiples for this dim
+      if (mapped_ids.empty()) {
         continue;
       }
 
-      // If input/output id is broadcast or reduction
-      if ((*map_it)->isBroadcast() || (*map_it)->isReduction()) {
+      if (std::all_of(mapped_ids.begin(), mapped_ids.end(), [](IterDomain* id) {
+            return id->isReduction() || id->isBroadcast();
+          })) {
         continue;
       }
 
+      // If any iteration domain in the input or output that's mapped through
+      // the view disjoint set is not a reduction or broadcast, assume it's a
+      // full dimension for the sake of the pointwise scheduler.
       mapped_axes[ref_i] = true;
-      in_out_tv_domain_list.erase(map_it);
     }
 
     // For each break point position if there an lhs or rhs multiple based on
-    // this tensor add it to the global multiplier
+    // this tensor add it to the global multiplier. The only time we consider
+    // we can benefit from broadcast is if the entire left or right side the
+    // break point is all broadcasts.
     {
       bool rhs = false;
       bool lhs = false;
       auto dtype_size =
           dataTypeSize(in_out_tv->getDataType().value(), index_type);
-      for (size_t mapped_axes_i = 0; mapped_axes_i < mapped_axes.size();
-           mapped_axes_i++) {
+      for (auto mapped_axes_i : c10::irange(mapped_axes.size())) {
         auto lhs_i = mapped_axes_i;
         auto rhs_i = mapped_axes.size() - 1 - mapped_axes_i;
 
@@ -1425,91 +1614,10 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
       }
     }
   }
-
-  return multiples;
-}
-
-size_t collectMaxVectorizeSizeWithContigMerge(
-    TensorView* tv,
-    IterDomain* leaf_merged_domain,
-    size_t max_vector_size_in_byte,
-    ExpressionEvaluator& expression_evaluator,
-    DataType index_type) {
-  // Maybe too conservative, but only handles fully contiguous tensors
-  // TODO: Relax the contiguity constraint to be similar to that in index
-  // computing. Just looking for all merged root domains in the right order, all
-  // merged root dimensions are contiguous, all merged root dimensions are next
-  // to eachother (exlcuding broadcast).
-  if (std::any_of(
-          tv->domain()->contiguity().begin(),
-          tv->domain()->contiguity().end(),
-          [](const auto contig) { return !contig; })) {
-    return 1;
-  }
-
-  auto dtype_size = dataTypeSize(tv->dtype(), index_type);
-  const size_t max_vector_size = max_vector_size_in_byte / dtype_size;
-
-  // Assume no halo-related expression appears in the fusion. No
-  // broadcast is merged, so indexability can be assumed to be true.
-  ContigIDs contigIds(
-      {leaf_merged_domain},
-      tv->getMaybeRFactorDomain(),
-      tv->domain()->contiguity(),
-      {},
-      {},
-      true,
-      true);
-
-  auto innermost_root_id = tv->getMaybeRFactorDomain().back();
-  auto indexed_id = contigIds.rootToIndexedID().at(innermost_root_id);
-
-  size_t merged_size = 1;
-  // If the indexed ID is a contig merged domain, i.e., it is
-  // different from innermost_root_id, we accumulate the extents of
-  // all the root domains covered by the contig indexed ID. Otherwise,
-  // just look at the extent of the innermost root ID.
-  if (indexed_id != innermost_root_id) {
-    const auto& within_root = contigIds.withinContigIDs().at(indexed_id);
-    for (auto root_id : tv->getMaybeRFactorDomain()) {
-      if (within_root.find(root_id) == within_root.end()) {
-        continue;
-      }
-      auto maybe_dimension_size =
-          expression_evaluator.evaluate(root_id->extent());
-      TORCH_INTERNAL_ASSERT(
-          maybe_dimension_size.has_value(),
-          "Unknown extent of tv: ",
-          tv->toString(),
-          ", id: ",
-          root_id->toString());
-      merged_size *= maybe_dimension_size.value().as<int64_t>();
-    }
-  } else {
-    auto maybe_dimension_size =
-        expression_evaluator.evaluate(innermost_root_id->extent());
-    TORCH_INTERNAL_ASSERT(
-        maybe_dimension_size.has_value(),
-        "Unknown extent of tv: ",
-        tv->toString(),
-        ", id: ",
-        innermost_root_id->toString());
-    merged_size = maybe_dimension_size.value();
-  }
-
-  size_t vector_size = 1;
-  size_t next_vector_size = vector_size * 2;
-
-  // Try until vector size exceeds the max allowed size
-  while (next_vector_size <= max_vector_size) {
-    if (merged_size % next_vector_size != 0) {
-      break;
-    }
-    vector_size = next_vector_size;
-    next_vector_size *= 2;
-  }
-
-  return vector_size;
+  BroadcastMultipleInformation bcast_info;
+  bcast_info.view_disjoint_set_ids = ref_disjoint_set_ids;
+  bcast_info.broadcast_multiples = multiples;
+  return bcast_info;
 }
 
 namespace matmul_utils {
@@ -1743,7 +1851,7 @@ c10::optional<IterDomain*> getMaybeRootIfInnermostTiled(
 
 } // namespace
 
-TORCH_CUDA_CU_API void orderTiledConcreteIdAsRoot(TensorView* tv) {
+void orderTiledConcreteIdAsRoot(TensorView* tv) {
   auto ndims = tv->nDims();
 
   // Keep track of the left most position where we will
@@ -1778,7 +1886,7 @@ TORCH_CUDA_CU_API void orderTiledConcreteIdAsRoot(TensorView* tv) {
   //  neither an inner tile nor reduction/broadcast is found, and would
   //  not re-order any iterdomain beyond that point to keep the
   //  outer loop structure unchanged.
-  for (auto i = ndims - 1; i >= 0; i--) {
+  for (int64_t i = static_cast<int64_t>(ndims) - 1; i >= 0; i--) {
     auto leaf_id = tv->axis(i);
     if (leaf_id->isBroadcast() || leaf_id->isReduction()) {
       // Register this reduction or broadcast axis
@@ -1843,9 +1951,7 @@ TORCH_CUDA_CU_API void orderTiledConcreteIdAsRoot(TensorView* tv) {
 } // namespace matmul_utils
 
 //! Propagate current transformations on from_tv to all graphs
-TORCH_CUDA_CU_API void transformPropagateToAllFrom(
-    TensorView* from_tv,
-    int pos) {
+void transformPropagateToAllFrom(TensorView* from_tv, int pos) {
   TransformPropagator propagator(from_tv, pos);
   MaxRootDomainInfoSpanningTree(from_tv, nullptr).traverse(&propagator);
 }
@@ -2071,187 +2177,218 @@ void BoundedDirectionalTransformPropagator::bothWays(
   propagate(from, pos, included_tvs, *options);
 }
 
-// Grab all values and expressions used to make the merged_domain and remove
-// them from the fusion
-void cleanUpInnermostMergedDomains(
-    const std::vector<IterDomain*>& root_domain,
-    IterDomain* merged_domain) {
-  TORCH_INTERNAL_ASSERT(merged_domain != nullptr);
-  TORCH_INTERNAL_ASSERT(!root_domain.empty());
-
-  std::unordered_set<Val*> root_set({root_domain.begin(), root_domain.end()});
+DisjointSets<IterDomain*> disjointViewSets(Fusion* fusion) {
+  // Start from the exact iter domain graph of the fusion
+  IterDomainGraph id_graph(fusion);
+  auto disjoint_view_ids = id_graph.exactNodes();
 
-  auto vals = DependencyCheck::getAllValsBetween(root_set, {merged_domain});
-
-  for (auto it = vals.rbegin(); it != vals.rend(); ++it) {
-    TORCH_INTERNAL_ASSERT((*it)->isA<IterDomain>());
-    auto id = (*it)->as<IterDomain>();
-    if (root_set.find(id) != root_set.end()) {
-      continue;
+  // If iter domains are involved in any transformation from root domains to
+  // rfactor domains they should be considered "contaminated".
+  for (auto tv : ir_utils::allTvs(fusion)) {
+    for (auto expr : StmtSort::getExprs(
+             fusion,
+             {tv->getMaybeRFactorDomain().begin(),
+              tv->getMaybeRFactorDomain().end()})) {
+      if (expr->isA<Merge>()) {
+        auto merge = expr->as<Merge>();
+        disjoint_view_ids.mapEntries(merge->inner(), merge->out());
+        disjoint_view_ids.mapEntries(merge->outer(), merge->out());
+      } else if (expr->isA<Split>()) {
+        auto split = expr->as<Split>();
+        disjoint_view_ids.mapEntries(split->in(), split->inner());
+        disjoint_view_ids.mapEntries(split->in(), split->outer());
+      } else {
+        TORCH_INTERNAL_ASSERT(
+            false, "Expression type: ", expr->toString(), " not supported.");
+      }
     }
-    Fusion* fusion = id->container()->as<Fusion>();
-    auto id_def = id->definition();
-    TORCH_INTERNAL_ASSERT(
-        id_def->isA<Merge>(),
-        "Invalid ID: ",
-        id->toString(),
-        ". Expected definition of a Merge expression: ",
-        (id_def != nullptr ? id_def->toString() : "nullptr"));
-    fusion->removeExpr(id_def);
-    fusion->removeVal(id);
   }
+  return disjoint_view_ids;
 }
 
-// Merge innermost domains for finding the widest vectorizable
-// size. Return the merged domain or nullptr if no merge is done.
-IterDomain* mergeInnermostDomains(
-    const std::vector<IterDomain*>& domain,
-    int num_merged_domains) {
-  const auto ndims = domain.size();
-  IterDomain* merged_id = nullptr;
-  bool is_merge_done = false;
-  for (const auto i : c10::irange(num_merged_domains)) {
-    auto id = domain.at(ndims - 1 - i);
-    // broadcast and trivial reductions are ignored
-    if (id->isBroadcast() || id->isTrivialReduction()) {
-      continue;
-    }
-    if (merged_id == nullptr) {
-      merged_id = id;
-    } else {
-      auto id_inner = merged_id;
-      auto id_outer = id;
-      merged_id = IterDomain::merge(id_outer, id_inner);
-      is_merge_done = true;
-    }
-  }
-  return is_merge_done ? merged_id : nullptr;
-}
+bool allMatchingViews(Fusion* fusion) {
+  // Start from the exact iter domain graph of the fusion
+  IterDomainGraph id_graph(fusion);
+  auto exact_disjoint_set = id_graph.exactNodes();
 
-//! Attempt to expand vectorized domains to contig merged domains. Break point
-//! identifies the point in which you can't propagate contiguous merges. For
-//! example in pointwise this is the point where we want to split the
-//! parallelization to take advantage of broadcast, and for reduction schedulers
-//! it's the point where we switch from a reduction domain to an iter domain (or
-//! vice versa).
-size_t expandVectorizationToContigMergedDomains(
-    Fusion* fusion,
-    SchedulerRuntimeInfo& runtime_info,
-    const std::vector<TensorView*> vectorizable_inputs_outputs,
-    TensorView* reference_tv,
-    int break_point,
-    size_t default_word_size) {
-  // Don't vectorize when RNG is used
-  if (fusion->isStochastic() &&
-      isOptionDisabled(DisableOption::UnrollWithRng)) {
-    return default_word_size;
+  auto view_exprs = ir_utils::getViewOps(fusion);
+  if (view_exprs.empty()) {
+    return true;
   }
 
-  size_t max_expand_size = SchedulerRuntimeInfo::max_alignment_size_in_byte;
-  size_t common_alignment_size =
-      SchedulerRuntimeInfo::max_alignment_size_in_byte;
-
-  for (auto inp_out : vectorizable_inputs_outputs) {
-    auto dtype_size = dataTypeSize(
-        inp_out->dtype(), indexModeToDtype(runtime_info.getIndexMode()));
+  std::vector<TensorView*> all_view_outs;
 
-    max_expand_size = std::min(
-        max_expand_size,
-        SchedulerRuntimeInfo::max_alignment_size_in_byte / dtype_size);
-    max_expand_size = std::min(
-        max_expand_size, runtime_info.getMaxVectorizableWidth(inp_out));
-    common_alignment_size =
-        std::min(common_alignment_size, runtime_info.getAlignmentSize(inp_out));
+  for (auto view_expr : view_exprs) {
+    auto outs = ir_utils::filterByType<TensorView>(view_expr->outputs());
+    all_view_outs.insert(all_view_outs.end(), outs.begin(), outs.end());
   }
 
-  // If there's no possibility to increase vector size of provided tensors, then
-  // don't bother doing a more complex analysis to try and do so, just return
-  // early.
-  if (max_expand_size == default_word_size) {
-    return default_word_size;
-  }
+  TORCH_INTERNAL_ASSERT(
+      all_view_outs.size() > 0,
+      "Found view operations but can't find any output tensor views.");
 
-  auto ca_map = ComputeAtMap(fusion);
+  auto first_out_tv = *all_view_outs.begin();
+  auto first_root_dom =
+      TensorDomain::noReductions(first_out_tv->getRootDomain());
+  auto first_rfactor_dom =
+      TensorDomain::noReductions(first_out_tv->getRFactorDomain());
 
-  // Merge the domains right of the break point
-  const auto& ref_root = reference_tv->getMaybeRFactorDomain();
-  const int num_merged_domains =
-      static_cast<int>(ref_root.size()) - static_cast<int>(break_point);
+  for (auto other_out_tv : all_view_outs) {
+    if (other_out_tv == first_out_tv) {
+      continue;
+    }
 
-  // No expansion with no merged domain
-  if (num_merged_domains == 0) {
-    return default_word_size;
-  }
+    auto other_root_dom =
+        TensorDomain::noReductions(other_out_tv->getRootDomain());
+    auto other_rfactor_dom =
+        TensorDomain::noReductions(other_out_tv->getRFactorDomain());
 
-  // Merge the domains but don't modify TensorDomain
-  auto merged_domain = mergeInnermostDomains(ref_root, num_merged_domains);
+    if (first_root_dom.size() != other_root_dom.size() ||
+        first_rfactor_dom.size() != other_rfactor_dom.size()) {
+      return false;
+    }
+    {
+      std::vector<std::pair<IterDomain*, IterDomain*>> zipped_ids;
+
+      std::transform(
+          first_root_dom.begin(),
+          first_root_dom.end(),
+          other_root_dom.begin(),
+          std::back_inserter(zipped_ids),
+          [](IterDomain* first, IterDomain* second) {
+            return std::make_pair(first, second);
+          });
 
-  // No expansion is done if no merge is done.
-  if (merged_domain == nullptr) {
-    return default_word_size;
-  }
+      if (std::any_of(
+              zipped_ids.begin(),
+              zipped_ids.end(),
+              [&exact_disjoint_set](
+                  std::pair<IterDomain*, IterDomain*> id_pair) {
+                return !exact_disjoint_set.strictAreMapped(
+                    id_pair.first, id_pair.second);
+              })) {
+        return false;
+      }
+    }
+    {
+      std::vector<std::pair<IterDomain*, IterDomain*>> zipped_ids;
+
+      std::transform(
+          first_rfactor_dom.begin(),
+          first_rfactor_dom.end(),
+          other_rfactor_dom.begin(),
+          std::back_inserter(zipped_ids),
+          [](IterDomain* first, IterDomain* second) {
+            return std::make_pair(first, second);
+          });
 
-  // Find the vectorizable word size with the merged domains
-  size_t word_size = scheduler_utils::collectMaxVectorizeSizeWithContigMerge(
-      reference_tv,
-      merged_domain,
-      common_alignment_size,
-      runtime_info.expressionEvaluator(),
-      indexModeToDtype(runtime_info.getIndexMode()));
+      if (std::any_of(
+              zipped_ids.begin(),
+              zipped_ids.end(),
+              [&exact_disjoint_set](
+                  std::pair<IterDomain*, IterDomain*> id_pair) {
+                return !exact_disjoint_set.strictAreMapped(
+                    id_pair.first, id_pair.second);
+              })) {
+        return false;
+      }
+    }
+  }
+  return true;
+}
 
-  cleanUpInnermostMergedDomains(ref_root, merged_domain);
+bool breakIsDisjoint(std::vector<int> group_ids, int pos) {
+  if (pos < 0) {
+    pos += group_ids.size();
+  }
+  TORCH_INTERNAL_ASSERT(
+      pos >= 0 && pos <= group_ids.size(),
+      "Invalid position, size of vec is ",
+      group_ids.size(),
+      " but position is ",
+      pos);
 
-  // Stop if the reference doesn't get a larger word size.
-  if (word_size <= default_word_size) {
-    return default_word_size;
+  if (pos == 0 || pos == group_ids.size()) {
+    return true;
   }
 
-  // Check the other TVs and take the minimum of the valid word sizes
-  for (const auto tv : vectorizable_inputs_outputs) {
-    if (tv == reference_tv) {
-      continue;
-    }
+  std::unordered_set<int> left_ints(group_ids.begin(), group_ids.begin() + pos);
 
-    const auto& tv_root = tv->getMaybeRFactorDomain();
+  for (auto i = pos; i < group_ids.size(); i++) {
+    if (left_ints.count(group_ids[i]) > 0) {
+      return false;
+    }
+  }
+  return true;
+}
 
-    int tv_num_merged_domains = 0;
-    for (const auto i : c10::irange(num_merged_domains)) {
-      if (i == tv_root.size()) {
-        break;
+std::unordered_map<int, int> domainReorderAsRfactorMap(TensorView* tv) {
+  FusionGuard fg(tv->fusion());
+  auto transform_exprs = StmtSort::getExprs(
+      tv->fusion(),
+      {tv->domain()->domain().begin(), tv->domain()->domain().end()});
+  // simply update this vector of id's as progressing through the transformation
+  // expressions. We'll always insert the result of split in the location of the
+  // input, and insert the merge result in the position of the inner dimension.
+
+  auto reordered_ids = tv->getMaybeRFactorDomain();
+  for (const auto* expr : transform_exprs) {
+    if (const Split* split = dynamic_cast<const Split*>(expr)) {
+      auto find_it =
+          std::find(reordered_ids.begin(), reordered_ids.end(), split->in());
+      if (find_it == reordered_ids.end()) {
+        // Transformations before rfactor, ignore those.
+        continue;
       }
-      auto ref_id = ref_root.at(ref_root.size() - 1 - i);
-      IterDomain* tv_id = tv_root.at(tv_root.size() - 1 - i);
-      // If not mapped, stop expanding.
-      if (!ca_map.areMapped(ref_id, tv_id, IdMappingMode::EXACT)) {
-        break;
-      } else {
-        ++tv_num_merged_domains;
+      auto pos = std::distance(reordered_ids.begin(), find_it);
+      reordered_ids[pos] = split->inner();
+      reordered_ids.insert(reordered_ids.begin() + pos, split->outer());
+    } else if (const Merge* merge = dynamic_cast<const Merge*>(expr)) {
+      auto find_it_0 =
+          std::find(reordered_ids.begin(), reordered_ids.end(), merge->outer());
+      auto find_it_1 =
+          std::find(reordered_ids.begin(), reordered_ids.end(), merge->inner());
+      if (find_it_0 == reordered_ids.end() &&
+          find_it_1 == reordered_ids.end()) {
+        // Transformations before rfactor, ignore those.
+        continue;
       }
-    }
-
-    size_t tv_word_size = 1;
-    if (tv_num_merged_domains > 1) {
-      auto tv_merged_domain =
-          mergeInnermostDomains(tv_root, tv_num_merged_domains);
-      if (tv_merged_domain == nullptr) {
-        tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
-      } else {
-        tv_word_size = scheduler_utils::collectMaxVectorizeSizeWithContigMerge(
-            tv,
-            tv_merged_domain,
-            common_alignment_size,
-            runtime_info.expressionEvaluator(),
-            indexModeToDtype(runtime_info.getIndexMode()));
-        cleanUpInnermostMergedDomains(tv_root, tv_merged_domain);
+      TORCH_INTERNAL_ASSERT(
+          find_it_0 != reordered_ids.end() && find_it_1 != reordered_ids.end(),
+          "Error in transformations of ",
+          tv->toString(),
+          "\nTransformations before rfactor should not mix with transformations after rfactor.");
+      auto pos0 = std::distance(reordered_ids.begin(), find_it_0);
+      auto pos1 = std::distance(reordered_ids.begin(), find_it_1);
+      if (pos0 > pos1) {
+        std::swap(pos0, pos1);
       }
-    } else {
-      tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
-    }
+      // Should be impossible.
+      TORCH_INTERNAL_ASSERT(
+          pos0 != pos1,
+          "Didn't expect merge inputs to be the same iteratrion domain:\n",
+          merge->toString());
 
-    word_size = std::min(word_size, tv_word_size);
+      reordered_ids.erase(reordered_ids.begin() + pos0);
+      pos1--;
+      reordered_ids[pos1] = merge->out();
+    }
   }
 
-  return word_size;
+  std::unordered_map<int, int> old2new;
+  for (auto id_i : c10::irange(tv->domain()->domain().size())) {
+    auto leaf_id = tv->axis(id_i);
+    auto find_it =
+        std::find(reordered_ids.begin(), reordered_ids.end(), leaf_id);
+    TORCH_INTERNAL_ASSERT(
+        find_it != reordered_ids.end(),
+        "Reordering map creation failed, uninitialized iterdomain,",
+        " likely something is wrong with the transformations between the rfactor domain and the leaves.");
+    int new_pos = (int)std::distance(reordered_ids.begin(), find_it);
+    int old_pos = (int)id_i;
+    old2new[old_pos] = new_pos;
+  }
+  return old2new;
 }
 
 } // namespace scheduler_utils
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/utils.h b/torch/csrc/jit/codegen/cuda/scheduler/utils.h
index 6e31f91198a56..373a879f740d5 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/utils.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/utils.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
@@ -44,6 +45,38 @@ inline int64_t safeDiv(const int64_t x, const int64_t y) {
   return std::max(x / y, (int64_t)1);
 }
 
+// Split the given dimensions in `to_split`. Also update the dimensions in
+// `to_update` to the positions in the splitted tensor. Splitting one dimension
+// multiple times is supported, and if this is the case, then the order of
+// `to_split` matters. All given dimensions are numbers before any split.
+TORCH_CUDA_CU_API void splitDims(
+    TensorView* tv,
+    std::vector<std::pair<size_t, size_t>> to_split, // (dim, size)
+    std::vector<size_t>& to_update);
+
+TORCH_CUDA_CU_API inline void splitDims(
+    TensorView* tv,
+    std::vector<std::pair<size_t, size_t>> to_split) { // (dim, size)
+  std::vector<size_t> unused;
+  splitDims(tv, std::move(to_split), unused);
+}
+
+// Merge all the given dimensions in `to_merge` into a single dimension. Also
+// update the dimensions in `to_update` to the positions in the merged tensor.
+// Returns the merged dimension. All given dimensions are numbers before any
+// merge.
+TORCH_CUDA_CU_API c10::optional<size_t> mergeDims(
+    TensorView* tv,
+    std::vector<size_t> to_merge,
+    std::vector<size_t>& to_update);
+
+TORCH_CUDA_CU_API inline c10::optional<size_t> mergeDims(
+    TensorView* tv,
+    std::vector<size_t> to_merge) {
+  std::vector<size_t> unused;
+  return mergeDims(tv, std::move(to_merge), unused);
+}
+
 // Merge all reduction to the right side and returns total number of
 // reduction axes. Don't merge is typically used for trivial reductions.
 size_t mergeReduction(
@@ -83,16 +116,6 @@ TORCH_CUDA_CU_API inline void parallelizeAllLike(
       propagate_padding);
 }
 
-TORCH_CUDA_CU_API void computeAtInputs(
-    TensorView* consumer,
-    int pos,
-    ComputeAtMode mode = ComputeAtMode::Standard);
-
-TORCH_CUDA_CU_API void computeWithOutputs(
-    TensorView* producer,
-    int pos,
-    ComputeAtMode mode = ComputeAtMode::Standard);
-
 struct PersistentBufferInfo {
   std::vector<TensorView*> persistent_buffers;
   std::unordered_set<IterDomain*> unmappable_dims;
@@ -155,15 +178,6 @@ TvProperties getProperties(
     SchedulerRuntimeInfo& runtime_info,
     TensorView* tv);
 
-// Will call computeAt once on each producer, with the first consumer found that
-// is a consumer of the individual producer
-void computeAtBetween(
-    const std::vector<TensorView*>& producers,
-    const std::vector<TensorView*>& consumers,
-    int pos,
-    ComputeAtMode mode,
-    std::unordered_set<IterDomain*> mapped_to_trivial_reduction = {});
-
 // Struct to store persistent buffer sizes. also holds the persistent buffer
 // size of the buffers are projected to the inputs.
 struct PersistentBufferSizeReturn {
@@ -228,34 +242,32 @@ cacheAndForkOutputs(Fusion* fusion, bool unroll);
 // domain.
 IterDomain* innerMostRootDim(TensorView* tv);
 
-// Uses a lot of logic from TransformPropagator in the implementation
-class FindAllMappedDims {
- private:
-  FindAllMappedDims(
-      TensorView* from,
-      IterDomain* starting_id,
-      bool vectorize_pass);
-
- private:
-  std::unordered_map<TensorView*, IterDomain*> mapped_ids;
-  TensorView* starting_tv = nullptr;
-  IterDomain* starting_id = nullptr;
+// Looks through fusion and finds all dims that match to the one provided in
+// the tensorview provided. Iter domain must be a root domain. If inner_only,
+// will only map dimensions if they're the inner most position. This is
+// important when projecting a dimension between an rfactor position and its
+// root position when mapping from consumer to producer. If inner_only=true,
+// takes the rfactor/root dimensions that maps, projects it to the root/rfactor
+// domain, but only following the inner most pass when encounting split/merge.
+// When propagating backward, for split it will only propagate backwards if the
+// mapped dimension is the inner portion of the split. For merge, inner_only
+// doesn't make a dimension and will propagate through the inner portion of the
+// merge. When propagating forward, the logic is symmetric with the backward
+// case.
+class FindAllMappedDims : public MaxInfoSpanningTree::Propagator {
+  std::unordered_map<TensorView*, IterDomain*> mapped_root_ids_;
+  std::unordered_map<TensorView*, IterDomain*> mapped_rfactor_ids_;
+  TensorView* starting_tv_ = nullptr;
+  IterDomain* starting_id_ = nullptr;
+  bool inner_only_;
 
  public:
-  // Looks through fusion and finds all dims that match to the one provided in
-  // the tensorview provided. Iter domain must be a root domain. If vectorize
-  // pass, will only map dimensions if they're the inner most position. This is
-  // important when projecting a dimension from an rfactor position to its root
-  // position when mapping from consumer to producer. If vectorize_pass=true,
-  // takes the rfactor dimensions that maps, projects it to the root domain, but
-  // only following the inner most pass when encounting split/merge. For split
-  // it will only propagate backwards if the mapped dimension is the inner
-  // portion of the split. For merge, vectorize_pass doesn't make a dimension
-  // and will propagate through the inner portion of the merge.
-  static std::unordered_set<IterDomain*> from(
-      TensorView* tv,
-      IterDomain* id,
-      bool vectorize_pass);
+  FindAllMappedDims(TensorView* from, IterDomain* starting_id, bool inner_only);
+  virtual void setUp() override;
+  virtual void propagateC2P(TensorView* from, TensorView* to) override;
+  virtual void propagateP2C(TensorView* from, TensorView* to) override;
+  virtual void propagateSibling(TensorView* from, TensorView* to) override;
+  std::unordered_set<IterDomain*> get() const;
 };
 
 // Checks if tensor view has an iteration domain in vector dims in its inner
@@ -268,10 +280,13 @@ bool hasInnerDim(
 
 // Returns all inputs and outputs that share the inner most dimension of the
 // provided reference. If reference is an input it ignores reduction axes, will
-// ignore all broadcast axes. If can_vectorize, will check contiguity for
-// vectorization, otherwise it just checks it has that inner dim.
+// ignore all broadcast axes. If inner_only, will require inner->inner mapping
+// in view, otherwise, it allows all inner->any mapping. If vectorize_pass, will
+// check contiguity for vectorization, otherwise it just checks it has that
+// inner dim.
 std::vector<TensorView*> getInputsOutputsWithInnerDim(
     TensorView* reference_tv,
+    bool inner_only,
     bool vectorize_pass);
 
 // Structure to hold byte multiples for break points. I.e. if we have the
@@ -288,14 +303,26 @@ struct BroadcastMultiple {
   int64_t lhs_multiple = 0;
 };
 
-// Returns a vector of counts, size = reference_tv->getRootDomain().size(), each
-// entry [i] is the number of inputs/outputs that have a non-broadcast dimension
-// mapped to the corresponding dimension in reference_tv. Count includes
-// reference_tv if reference_tv is an input or output. Count is multiplied by
-// data type size.
-std::vector<BroadcastMultiple> getBroadcastMultiples(
-    TensorView* reference_tv,
-    DataType index_type);
+struct BroadcastMultipleInformation {
+  std::vector<int> view_disjoint_set_ids;
+  std::vector<BroadcastMultiple> broadcast_multiples;
+};
+
+// Returns a vector of size reference_tv->getMaybeRFactorDomain().size() which
+// is a view disjoint set id of each of those iter domains. If entries share the
+// same value, they undergo view transformations in the fusion together.
+// Broadcast multiples are also of size
+// reference_tv->getMaybeRFactorDomain().size(), each entry [i] is the number of
+// inputs/outputs that have a non-broadcast dimension mapped to the
+// corresponding dimension in reference_tv. Broadcast multiples includes
+// reference_tv if reference_tv is an input or output. Broadcast multiples is
+// multiplied by data type size. In the case of view operations the broadcast
+// multiple is the full multiple size if any domain in the group maps to a
+// non-broadcast dimension in the given input/output. Otherwise if all
+// dimensions are broadcast that input/output will not contribute to the
+// multiple.
+TORCH_CUDA_CU_API BroadcastMultipleInformation
+getBroadcastMultiples(TensorView* reference_tv, DataType index_type);
 
 //! Collect maximum vectorization word size of a tensor whose
 //! innermost domain is leaf_merged_domain. Contig merging is taken
@@ -468,6 +495,47 @@ struct TORCH_CUDA_CU_API BoundedDirectionalTransformPropagator {
       Options options);
 };
 
+// Schedulers typically start by merging some axes together then splitting,
+// and propagating those transformations through the dag. What we want to
+// understand is if these merges can be supported through view operations.
+// For example it could be problematic to support a reduction fusion:
+//
+// tv0[2, 3, 4]
+// tv1 = sum(tv0, {1, 2})
+// tv2 = view(tv0, {6, 4})
+//
+// Since the first step of the reduction scheduler would be tv1->merge(1, 2).
+// If we tried to propagate this transformation through the view it would make
+// the view invalid. If we tried to propagate the view through the reduction,
+// it would attempt to merge a reduction and non-reduction dimension. So for
+// these types of fusions we would like to understand that the view considers
+// axis 1 and 2 of tv1 as "non-separable" axes.
+//
+// If IterDomains are disjoint in the returned set, then they are considered
+// "separable".
+// Warning: This pass generates the IdGraphs, not intended for use at runtime.
+TORCH_CUDA_CU_API DisjointSets<IterDomain*> disjointViewSets(Fusion* fusion);
+
+// Return if all trasnformations in all views match.
+// TODO: Should this be moved to registry.cpp/.h?
+// Warning: This pass generates the IdGraphs, not intended for use at runtime.
+TORCH_CUDA_CU_API bool allMatchingViews(Fusion* fusion);
+
+// Makes sure that there are no group id's left of pos that match right of pos.
+// e.g.
+// [1, 0, 0] pos 2 would return false
+// [1, 0, 0] pos 1 would return true
+TORCH_CUDA_CU_API bool breakIsDisjoint(std::vector<int> group_ids, int pos);
+
+// Generates an old to new map to reorder tv's domain as the rfactor order.
+// Priority is given to inner most dimensions for example:
+// rfactor [i0, i1, i2]
+// domain [i0*i2, i1]
+// will produce the map {{0, 1}, {1, 0}}
+// This is somewhat similar to orderTiledConcreteIdAsRoot
+TORCH_CUDA_CU_API std::unordered_map<int, int> domainReorderAsRfactorMap(
+    TensorView* tv);
+
 } // namespace scheduler_utils
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp
new file mode 100644
index 0000000000000..d67c046ff7ee9
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp
@@ -0,0 +1,287 @@
+#include <torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h>
+
+#include <torch/csrc/jit/codegen/cuda/compute_at_map.h>
+#include <torch/csrc/jit/codegen/cuda/contiguity.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
+
+#include <c10/util/irange.h>
+
+#include <unordered_set>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+namespace vectorize_helper {
+
+// Grab all values and expressions used to make the merged_domain and remove
+// them from the fusion
+void cleanUpInnermostMergedDomains(
+    const std::vector<IterDomain*>& root_domain,
+    IterDomain* merged_domain) {
+  TORCH_INTERNAL_ASSERT(merged_domain != nullptr);
+  TORCH_INTERNAL_ASSERT(!root_domain.empty());
+
+  std::unordered_set<Val*> root_set({root_domain.begin(), root_domain.end()});
+
+  auto vals = DependencyCheck::getAllValsBetween(root_set, {merged_domain});
+
+  for (auto it = vals.rbegin(); it != vals.rend(); ++it) {
+    TORCH_INTERNAL_ASSERT((*it)->isA<IterDomain>());
+    auto id = (*it)->as<IterDomain>();
+    if (root_set.find(id) != root_set.end()) {
+      continue;
+    }
+    Fusion* fusion = id->container()->as<Fusion>();
+    auto id_def = id->definition();
+    TORCH_INTERNAL_ASSERT(
+        id_def->isA<Merge>(),
+        "Invalid ID: ",
+        id->toString(),
+        ". Expected definition of a Merge expression: ",
+        (id_def != nullptr ? id_def->toString() : "nullptr"));
+    fusion->removeExpr(id_def);
+    fusion->removeVal(id);
+  }
+}
+
+// Merge innermost domains for finding the widest vectorizable
+// size. Return the merged domain or nullptr if no merge is done.
+IterDomain* mergeInnermostDomains(
+    const std::vector<IterDomain*>& domain,
+    int num_merged_domains) {
+  const auto ndims = domain.size();
+  IterDomain* merged_id = nullptr;
+  bool is_merge_done = false;
+  for (const auto i : c10::irange(num_merged_domains)) {
+    auto id = domain.at(ndims - 1 - i);
+    // broadcast and trivial reductions are ignored
+    if (id->isBroadcast() || id->isTrivialReduction()) {
+      continue;
+    }
+    if (merged_id == nullptr) {
+      merged_id = id;
+    } else {
+      auto id_inner = merged_id;
+      auto id_outer = id;
+      merged_id = IterDomain::merge(id_outer, id_inner);
+      is_merge_done = true;
+    }
+  }
+  return is_merge_done ? merged_id : nullptr;
+}
+
+size_t collectMaxVectorizeSizeWithContigMerge(
+    TensorView* tv,
+    IterDomain* leaf_merged_domain,
+    size_t max_vector_size_in_byte,
+    ExpressionEvaluator& expression_evaluator,
+    DataType index_type) {
+  // Maybe too conservative, but only handles fully contiguous tensors
+  // TODO: Relax the contiguity constraint to be similar to that in index
+  // computing. Just looking for all merged root domains in the right order,
+  // all merged root dimensions are contiguous, all merged root dimensions are
+  // next to eachother (exlcuding broadcast).
+  if (std::any_of(
+          tv->domain()->contiguity().begin(),
+          tv->domain()->contiguity().end(),
+          [](const auto contig) { return !contig; })) {
+    return 1;
+  }
+
+  auto dtype_size = dataTypeSize(tv->dtype(), index_type);
+  const size_t max_vector_size = max_vector_size_in_byte / dtype_size;
+
+  // Assume no halo-related expression appears in the fusion. No
+  // broadcast is merged, so indexability can be assumed to be true.
+  // This is expensive, as ContigIDs builds other things like CAMap,
+  // HaloInfo, and ConcreteBroadcast info. We should explicitly build and reuse
+  // these as they're compile time information.
+  ContigIDs contigIds(
+      {leaf_merged_domain},
+      tv->getMaybeRFactorDomain(),
+      tv->domain()->contiguity(),
+      {},
+      getAllDivisibleSplits(tv->fusion()),
+      {},
+      true);
+
+  auto innermost_root_id = tv->getMaybeRFactorDomain().back();
+  auto indexed_id = contigIds.rootToIndexedID().at(innermost_root_id);
+
+  size_t merged_size = 1;
+  // If the indexed ID is a contig merged domain, i.e., it is
+  // different from innermost_root_id, we accumulate the extents of
+  // all the root domains covered by the contig indexed ID. Otherwise,
+  // just look at the extent of the innermost root ID.
+  if (indexed_id != innermost_root_id) {
+    const auto& within_root = contigIds.withinContigIDs().at(indexed_id);
+    for (auto root_id : tv->getMaybeRFactorDomain()) {
+      if (within_root.find(root_id) == within_root.end()) {
+        continue;
+      }
+      auto maybe_dimension_size =
+          expression_evaluator.evaluate(root_id->extent());
+      TORCH_INTERNAL_ASSERT(
+          maybe_dimension_size.has_value(),
+          "Unknown extent of tv: ",
+          tv->toString(),
+          ", id: ",
+          root_id->toString());
+      merged_size *= maybe_dimension_size->as<int64_t>();
+    }
+  } else {
+    auto maybe_dimension_size =
+        expression_evaluator.evaluate(innermost_root_id->extent());
+    TORCH_INTERNAL_ASSERT(
+        maybe_dimension_size.has_value(),
+        "Unknown extent of tv: ",
+        tv->toString(),
+        ", id: ",
+        innermost_root_id->toString());
+    merged_size = maybe_dimension_size->as<int64_t>();
+  }
+
+  size_t vector_size = 1;
+  size_t next_vector_size = vector_size * 2;
+
+  // Try until vector size exceeds the max allowed size
+  while (next_vector_size <= max_vector_size) {
+    if (merged_size % next_vector_size != 0) {
+      break;
+    }
+    vector_size = next_vector_size;
+    next_vector_size *= 2;
+  }
+
+  return vector_size;
+}
+
+//! Attempt to expand vectorized domains to contig merged domains. Break point
+//! identifies the point in which you can't propagate contiguous merges. For
+//! example in pointwise this is the point where we want to split the
+//! parallelization to take advantage of broadcast, and for reduction
+//! schedulers it's the point where we switch from a reduction domain to an
+//! iter domain (or vice versa).
+size_t expandVectorizationToContigMergedDomains(
+    Fusion* fusion,
+    SchedulerRuntimeInfo& runtime_info,
+    const std::vector<TensorView*> vectorizable_inputs_outputs,
+    TensorView* reference_tv,
+    int break_point,
+    size_t default_word_size) {
+  size_t max_expand_size = SchedulerRuntimeInfo::max_alignment_size_in_byte;
+  size_t common_alignment_size =
+      SchedulerRuntimeInfo::max_alignment_size_in_byte;
+
+  for (auto inp_out : vectorizable_inputs_outputs) {
+    auto dtype_size = dataTypeSize(
+        inp_out->dtype(), indexModeToDtype(runtime_info.getIndexMode()));
+
+    max_expand_size = std::min(
+        max_expand_size,
+        SchedulerRuntimeInfo::max_alignment_size_in_byte / dtype_size);
+    max_expand_size = std::min(
+        max_expand_size, runtime_info.getMaxVectorizableWidth(inp_out));
+    common_alignment_size =
+        std::min(common_alignment_size, runtime_info.getAlignmentSize(inp_out));
+  }
+
+  // If there's no possibility to increase vector size of provided tensors,
+  // then don't bother doing a more complex analysis to try and do so, just
+  // return early.
+  if (max_expand_size == default_word_size) {
+    return default_word_size;
+  }
+
+  auto ca_map = ComputeAtMap(fusion);
+
+  // Merge the domains right of the break point
+  const auto& ref_root = reference_tv->getMaybeRFactorDomain();
+  const int num_merged_domains =
+      static_cast<int>(ref_root.size()) - static_cast<int>(break_point);
+
+  // No expansion with no merged domain
+  if (num_merged_domains == 0) {
+    return default_word_size;
+  }
+
+  // Merge the domains but don't modify TensorDomain
+  auto merged_domain = mergeInnermostDomains(ref_root, num_merged_domains);
+
+  // No expansion is done if no merge is done.
+  if (merged_domain == nullptr) {
+    return default_word_size;
+  }
+
+  // Find the vectorizable word size with the merged domains
+  size_t word_size = collectMaxVectorizeSizeWithContigMerge(
+      reference_tv,
+      merged_domain,
+      common_alignment_size,
+      runtime_info.expressionEvaluator(),
+      indexModeToDtype(runtime_info.getIndexMode()));
+
+  cleanUpInnermostMergedDomains(ref_root, merged_domain);
+
+  // Stop if the reference doesn't get a larger word size.
+  if (word_size <= default_word_size) {
+    return default_word_size;
+  }
+
+  // Check the other TVs and take the minimum of the valid word sizes
+  for (const auto tv : vectorizable_inputs_outputs) {
+    if (tv == reference_tv) {
+      continue;
+    }
+
+    const auto& tv_root = tv->getMaybeRFactorDomain();
+
+    int tv_num_merged_domains = 0;
+    for (const auto i : c10::irange(num_merged_domains)) {
+      if (i == tv_root.size()) {
+        break;
+      }
+      auto ref_id = ref_root.at(ref_root.size() - 1 - i);
+      IterDomain* tv_id = tv_root.at(tv_root.size() - 1 - i);
+      // If not mapped, stop expanding.
+      if (!ca_map.areMapped(ref_id, tv_id, IdMappingMode::EXACT)) {
+        break;
+      } else {
+        ++tv_num_merged_domains;
+      }
+    }
+
+    size_t tv_word_size = 1;
+    if (tv_num_merged_domains > 1) {
+      auto tv_merged_domain =
+          mergeInnermostDomains(tv_root, tv_num_merged_domains);
+      if (tv_merged_domain == nullptr) {
+        tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
+      } else {
+        tv_word_size = collectMaxVectorizeSizeWithContigMerge(
+            tv,
+            tv_merged_domain,
+            common_alignment_size,
+            runtime_info.expressionEvaluator(),
+            indexModeToDtype(runtime_info.getIndexMode()));
+        cleanUpInnermostMergedDomains(tv_root, tv_merged_domain);
+      }
+    } else {
+      tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
+    }
+
+    word_size = std::min(word_size, tv_word_size);
+  }
+
+  return word_size;
+}
+
+} // namespace vectorize_helper
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h
index 0a67d00618e23..a9b959b495d60 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h
@@ -2,21 +2,15 @@
 
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
-#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
-#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+
+#include <vector>
 
 namespace torch {
 namespace jit {
 namespace fuser {
 namespace cuda {
-
-// TODO: Put implementations in a vectorize_helper.cpp
-namespace scheduler_utils {
-
-// Moved the definition of these to
-// torch/csrc/jit/codegen/cuda/scheduler/utils.cpp as making new CPP files is
-// painful for multiple reasons.
+namespace vectorize_helper {
 
 // Grab all values and expressions used to make the merged_domain and remove
 // them from the fusion
@@ -44,7 +38,7 @@ size_t expandVectorizationToContigMergedDomains(
     int break_point,
     size_t default_word_size);
 
-} // namespace scheduler_utils
+} // namespace vectorize_helper
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/tensor_view.cpp b/torch/csrc/jit/codegen/cuda/tensor_view.cpp
index 8132d1f5c8759..ecc42b1e9388b 100644
--- a/torch/csrc/jit/codegen/cuda/tensor_view.cpp
+++ b/torch/csrc/jit/codegen/cuda/tensor_view.cpp
@@ -3,6 +3,7 @@
 #include <torch/csrc/jit/codegen/cuda/compute_at.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
 #include <torch/csrc/jit/codegen/cuda/ir_cloner.h>
@@ -290,40 +291,115 @@ IterDomain* TensorView::axis(int pos) const {
   return domain()->axis(pos);
 }
 
-void TensorView::setComputeAt(unsigned int pos, bool decrease) {
+void TensorView::inlineAt(
+    int64_t pos,
+    bool best_effort,
+    MaxPosCalculator* calc) {
   TORCH_INTERNAL_ASSERT(
       !container()->isA<kir::Kernel>(),
       "Function invalid for kernel container.");
-  if (pos <= compute_at_pos_ && !decrease) {
-    return;
+
+  std::unique_ptr<MaxPosCalculator> calc_owner;
+  if (calc == nullptr) {
+    calc_owner = std::make_unique<MaxPosCalculator>();
+    calc = calc_owner.get();
+  }
+
+  if (pos < 0) {
+    pos += int64_t(nDims()) + 1;
   }
 
   TORCH_INTERNAL_ASSERT(
-      (unsigned)pos <= nDims(),
-      "Invalid this computeAt position for T",
+      pos >= 0 && pos <= nDims(),
+      "Invalid inline position for T",
       name(),
       ": ",
       pos);
 
-  compute_at_pos_ = pos;
-}
+  auto max_inline_pos = calc->getMaxPosAll(this, best_effort);
 
-void TensorView::setMaxProducer(unsigned int pos, bool decrease) {
-  TORCH_INTERNAL_ASSERT(
-      !container()->isA<kir::Kernel>(),
-      "Function invalid for kernel container.");
-  if (pos <= max_producer_pos_ && !decrease) {
-    return;
+  if (best_effort) {
+    pos = std::min<int64_t>(max_inline_pos, pos);
+  }
+
+  // hoist inner most broadcast
+  while (pos > 0 && axis(pos - 1)->isBroadcast()) {
+    pos--;
   }
 
   TORCH_INTERNAL_ASSERT(
-      (unsigned)pos <= nDims(),
-      "Invalid max producer position for T",
+      pos <= max_inline_pos,
+      "Invalid inline position for T",
       name(),
       ": ",
-      pos);
+      pos,
+      ". Maximum allowed value:",
+      max_inline_pos);
+
+  if (isFusionInput()) {
+    return;
+  }
+
+  if (pos > compute_at_pos_) {
+    compute_at_pos_ = pos;
+    for (auto consumer : ir_utils::consumerTvsOf(this)) {
+      consumer->updateMaxProducerPosition();
+    }
+  }
+}
+
+namespace {
 
-  max_producer_pos_ = pos;
+// Try to find the aligned position on consumer's domain corresponding to the
+//  compute at position of producer domain. No checking on actual
+//  producer-consumer relationship.
+unsigned int getConsumerPosAlignedToProducerCA(
+    TensorView* consumer,
+    TensorView* producer) {
+  // Locate consumer's position that aligns with
+  //  the producer's new compute at axis. We need broadcast axes forwarded so we
+  //  need to replay PasC as CasP will not forward braodcast dims. For example
+  //  if we have:
+  // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
+  // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
+  // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
+  // NVFuserTest.FusionComplexBCast1_CUDA
+
+  auto disjoint_sets =
+      BestEffortReplay::replayPasC(
+          producer, consumer, -1, PairwiseRootDomainMap(producer, consumer))
+          .getDisjointSets();
+
+  // Find the innermost position of consumer that has
+  //  been mapped within the producer ca axis.
+  unsigned int consumer_pos = consumer->nDims();
+  while (consumer_pos > 0) {
+    auto consumer_id = consumer->axis((int)consumer_pos - 1);
+    auto p_dom = producer->domain()->domain();
+    if (std::any_of(
+            p_dom.begin(),
+            p_dom.begin() + producer->getComputeAtPosition(),
+            [&consumer_id, &disjoint_sets](IterDomain* p_id) {
+              return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
+            })) {
+      break;
+    }
+    consumer_pos--;
+  }
+
+  return consumer_pos;
+}
+
+} // namespace
+
+void TensorView::updateMaxProducerPosition() {
+  TORCH_INTERNAL_ASSERT(
+      !container()->isA<kir::Kernel>(),
+      "Function invalid for kernel container.");
+  for (auto producer : ir_utils::producerTvsOf(this)) {
+    max_producer_pos_ = std::max(
+        max_producer_pos_, getConsumerPosAlignedToProducerCA(this, producer));
+  }
 }
 
 TensorView* TensorView::computeAt(
@@ -430,7 +506,7 @@ TensorView* TensorView::split(
   return this;
 }
 
-// Merge "axis" and "axis+1" into 1 dimension
+// Merge "axis_o" and "axis_i" into 1 dimension
 TensorView* TensorView::merge(int axis_o, int axis_i) {
   TORCH_INTERNAL_ASSERT(nDims() > 0, "Tried to do merge on a 0-dim TensorView");
 
@@ -826,6 +902,15 @@ std::vector<TensorView*> TensorView::rFactor(
         "Rfactor of a multi-output reduction not used correctly");
   }
 
+  // Currently grouping of welford is only supported through
+  // ParallelType::Group, so GroupedWelfordOp is only created during
+  // the lowering time. As rFactor is done before lowering, there
+  // should be no GroupedWelfordOp at this point.
+  TORCH_INTERNAL_ASSERT(
+      !definition()->isA<GroupedWelfordOp>(),
+      "GroupedWelfordOp found: ",
+      definition()->toString());
+
   std::vector<TensorView*> rf_tvs(tvs.size());
 
   // Make sure this gets rfactored last so everybody gets
@@ -852,25 +937,25 @@ std::vector<TensorView*> TensorView::rFactor(
     IrBuilder::create<WelfordOp>(
         producer_avg,
         producer_var,
-        producer_n, /*out var/avg/count */
-        wop->initAvg(),
-        wop->initVar(),
-        wop->initN(), /*init var/avg/count */
+        producer_n,
         wop->inAvg(),
         wop->inVar(),
-        wop->inN());
+        wop->inN(),
+        wop->initAvg(),
+        wop->initVar(),
+        wop->initN());
 
     // Expr* consumer_definition =
     IrBuilder::create<WelfordOp>(
         wop->outAvg(),
         wop->outVar(),
         wop->outN(),
-        wop->initAvg(),
-        wop->initVar(),
-        wop->initN(),
         producer_avg,
         producer_var,
-        producer_n);
+        producer_n,
+        wop->initAvg(),
+        wop->initVar(),
+        wop->initN());
   } else if (
       auto grouped_rop = dynamic_cast<GroupedReductionOp*>(definition())) {
     IrBuilder::create<GroupedReductionOp>(
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
index fe3abcd1d5a58..fdfb58eba2a83 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
@@ -11,7 +11,7 @@
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
 #include <torch/csrc/jit/codegen/cuda/grouped_reduction.h>
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/interface.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
@@ -3565,44 +3565,6 @@ TEST_F(NVFuserTest, FusionRootMappingReductionDependency6_CUDA_CUDA) {
       {true, true});
 }
 
-TEST_F(NVFuserTest, FusionRootMappingMultipleBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  auto tv1 = broadcast(tv0, {false, true});
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = add(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  // tv0 cannot be mapped with the consumers as it would mean its only
-  // domain would be mapped to both the first and second domains of
-  // the two consumers, thus computing tv0 at both corresponding loops.
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {false},
-      tv1,
-      tv1->getRootDomain(),
-      {false, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {false},
-      tv2,
-      tv2->getRootDomain(),
-      {false, false});
-  checkIdMapped(tv1, tv1->getRootDomain(), tv3, tv3->getRootDomain());
-  checkIdMapped(tv2, tv2->getRootDomain(), tv3, tv3->getRootDomain());
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {false},
-      tv3,
-      tv3->getRootDomain(),
-      {false, false});
-}
-
 TEST_F(
     NVFuserTest,
     FusionRootMappingMultipleBroadcastWithNoCommonConsumer_CUDA) {
@@ -3810,21 +3772,62 @@ TEST_F(NVFuserTest, FusionRootMappingTrivialReduction_CUDA) {
   testValidate(&fusion, outputs, aten_inputs, {t3, t4}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionComputeAtFailDueToRootMapping_CUDA) {
+// Repro of issue #1950
+TEST_F(NVFuserTest, FusionRootMappingRepro1950_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = makeSymbolicTensor(3);
+  auto tv2 = makeSymbolicTensor(3);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+
+  auto tv3 = set(tv0);
+  auto tv4 = mul(tv1, tv3);
+  auto tv5 = mul(tv1, tv2);
+  auto tv6 = mul(tv5, tv3);
+  auto tv7 = sum(tv6, {2});
+  auto tv8 = broadcast(tv7, {false, false, true});
+  auto tv9 = mul(tv3, tv8);
+
+  // Issue #1950 was caused by a particular traversal ordering based
+  // on the output tensor ordering as below
+  fusion.addOutput(tv9);
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv4);
+
+  ComputeAtRootDomainMap root_map;
+  root_map.build();
+
+  checkIdMapped(root_map, tv4, tv4->axis(-1), tv9, tv9->axis(-1), false);
+}
+
+TEST_F(NVFuserTest, FusionDetectSelfMappedDomains_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
   auto tv0 = makeSymbolicTensor(1);
   fusion.addInput(tv0);
+  // [I1]
   auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  // [B2, I2]
   auto tv2 = broadcast(tv1, {true, false});
+  // [I3, B3]
   auto tv3 = broadcast(tv1, {false, true});
+  // [I4, I5]
   auto tv4 = add(tv2, tv3);
   fusion.addOutput(tv4);
 
-  // computeAt should fail as there is no valid root mapping.
+  // IterDomainGraph maps B2, I3 and I4 together, and similarly I2,
+  // B3 and I5. The problem is I1 is mapped with both of the ID
+  // groups, so eventually all of the IDs are mapped
+  // together. IterDomainGraph should throw an exception as this
+  // pattern of domain mappings is not supported.
+
   // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(tv1->computeAt(tv4, 1));
+  ASSERT_ANY_THROW({ IterDomainGraph id_graph(&fusion); });
 }
 
 TEST_F(NVFuserTest, FusionScalarInputs_CUDA) {
@@ -4269,12 +4272,6 @@ TEST_F(NVFuserTest, FusionUnaryOps_CUDA) {
           /*Inputs Tuple*/
           std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
     });
-
-    // TODO: why the rand_like test is failing for complex? Is it because each
-    // complex needs to draw 2 random numbers from the RNG? We need to enable
-    // this
-    // TODO:
-    //  Randlike testing is moved to test_gpu_rng.cu
   }
 
   dtypes = {DataType::Int, DataType::Int32, DataType::Bool};
@@ -9447,7 +9444,7 @@ TEST_F(NVFuserTest, FusionMagicSchedulerInstanceNormalizationBackward_CUDA) {
       "");
 }
 
-TEST_F(NVFuserTest, FusionPersistentSoftmaxLocalSmem_CUDA) {
+TEST_F(NVFuserTest, FusionPersistentSoftmaxLocalShared_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
@@ -9553,10 +9550,11 @@ TEST_F(NVFuserTest, FusionPersistentSoftmaxLocalSmem_CUDA) {
   const int64_t dimy = 16384;
 
   auto properties = at::cuda::getDeviceProperties(0);
-  // Require 70KB of smem to run test
-  const size_t required_smem_size = 70 << 10;
+  const size_t required_smem_size =
+      (dimy - static_size) * sizeof(float) + TIDX * sizeof(float);
   if (properties->sharedMemPerBlockOptin < required_smem_size) {
-    GTEST_SKIP() << "not enough shared memory space on device to run test";
+    GTEST_SKIP() << "not enough shared memory space on device to run test: "
+                 << properties->sharedMemPerBlock;
   }
 
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
@@ -9742,6 +9740,14 @@ TEST_F(NVFuserTest, FusionPersistentNormLocalShared_CUDA) {
   const float kEps = 1e-5;
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
 
+  auto properties = at::cuda::getDeviceProperties(0);
+  const size_t required_smem_size =
+      (dimy - static_size) * sizeof(float) + TIDX * sizeof(float);
+  if (properties->sharedMemPerBlockOptin < required_smem_size) {
+    GTEST_SKIP() << "not enough shared memory space on device to run test: "
+                 << properties->sharedMemPerBlock;
+  }
+
   at::Tensor aten_input = at::randn({dimx, dimy}, options);
   at::Tensor aten_static_in = aten_input.narrow(1, 0, static_size);
   at::Tensor aten_dynamic_in =
@@ -9757,13 +9763,6 @@ TEST_F(NVFuserTest, FusionPersistentNormLocalShared_CUDA) {
   torch::jit::fuser::cuda::FusionExecutor fe;
   fe.compileFusion(&fusion, aten_inputs);
 
-  auto properties = at::cuda::getDeviceProperties(0);
-  // Require 70KB of smem to run test
-  const size_t required_smem_size = 70 << 10;
-  if (properties->sharedMemPerBlockOptin < required_smem_size) {
-    GTEST_SKIP() << "not enough shared memory space on device to run test";
-  }
-
   fe.runFusion(aten_inputs, {cg_static_out, cg_dynamic_out});
 
   auto at_mu = at::mean(aten_input.to(at::kDouble), -1).unsqueeze(1);
@@ -10904,26 +10903,6 @@ TEST_F(NVFuserTest, FusionLSTMCell_CUDA) {
       &fusion, cg_outputs, aten_inputs, {at_cy, at_hy}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionComputeAtMultiBCast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = broadcast(tv1, {true, false});
-  TensorView* tv3 = broadcast(tv1, {false, true});
-  TensorView* tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  // Not possible to do computeAt at position -1 as recomputation
-  // would be required. An exception should be thrown.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(tv1->computeAt(tv3, -1));
-}
-
 TEST_F(NVFuserTest, FusionReductionHalf_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
@@ -11576,7 +11555,7 @@ TEST_F(NVFuserTest, FusionNonUniqueBroadcastSize_CUDA) {
   fusion.addInput(tv1);
   fusion.addInput(tv2);
 
-  auto tv3 = broadcast(tv0, {false, true});
+  auto tv3 = broadcast(tv0, {true, false});
   auto tv4 = add(tv3, tv1);
   auto tv5 = add(tv3, tv2);
 
@@ -13181,67 +13160,48 @@ TEST_F(NVFuserTest, FusionWelfordShmoo_CUDA) {
   }
 }
 
-TEST_F(NVFuserTest, FusionTranspose1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int M = 10;
-  constexpr int N = 20;
+namespace {
+void testVarMean(at::ScalarType dtype, int correction, bool keepdim) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
 
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = transpose(tv0);
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
+  int M = 64, N = 128;
 
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  auto tv0 = makeSymbolicTensor(2, aten_to_data_type(dtype));
+  fusion->addInput(tv0);
+  auto tvs = variance_mean(tv0, {1}, correction, keepdim);
+  auto tv_mean = tvs.mean;
+  auto tv_var = tvs.var;
+  fusion->addOutput(tv_var);
+  fusion->addOutput(tv_mean);
 
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
   at::manual_seed(0);
   at::Tensor t0 = at::randn({M, N}, options);
-  std::vector<IValue> aten_inputs = {t0};
 
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto outputs = executor_cache.runFusionWithInputs({t0});
 
-  at::Tensor aten_output = t0.t();
+  auto at_var_mean = at::var_mean(t0, {1}, correction, keepdim);
+  std::vector<at::Tensor> aten_outputs = {
+      std::get<0>(at_var_mean), std::get<1>(at_var_mean)};
 
   testValidate(
-      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+      executor_cache.fusion(), outputs, {t0}, aten_outputs, __LINE__, __FILE__);
 }
+} // namespace
 
-TEST_F(NVFuserTest, FusionTranspose2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int M = 10;
-  constexpr int N = 20;
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = transpose(tv0);
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-
-  tv1->merge(0);
-  tv1->split(0, 32);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  at::Tensor aten_output = t0.t();
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+TEST_F(NVFuserTest, FusionVarMean_CUDA) {
+  std::vector<at::ScalarType> dtypes = {at::kFloat, at::kDouble};
+  std::vector<int> corrections = {0, 1};
+  std::vector<bool> keepdims = {false, true};
+  for (auto correction : corrections) {
+    for (auto keepdim : keepdims) {
+      for (auto dtype : dtypes) {
+        testVarMean(dtype, correction, keepdim);
+      }
+    }
+  }
 }
 
 TEST_F(NVFuserTest, FusionSimpleGemmTransposed_CUDA) {
@@ -13841,8 +13801,8 @@ TEST_F(NVFuserTest, FusionMultipleVectorize_CUDA) {
 
   auto outputs = executor_cache.runFusionWithInputs({t0, t1});
   auto runtime1 = executor_cache.getMostRecentKernelRuntime();
-  auto log1 = std::dynamic_pointer_cast<PointwiseParams>(
-      executor_cache.getMostRecentExecutorInfo().params);
+  auto log1 =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
   TORCH_CHECK(log1 != nullptr);
   TORCH_CHECK(log1->vectorize);
 
@@ -13855,8 +13815,8 @@ TEST_F(NVFuserTest, FusionMultipleVectorize_CUDA) {
 
   outputs = executor_cache.runFusionWithInputs({t0, t1});
   auto runtime2 = executor_cache.getMostRecentKernelRuntime();
-  auto log2 = std::dynamic_pointer_cast<PointwiseParams>(
-      executor_cache.getMostRecentExecutorInfo().params);
+  auto log2 =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
   TORCH_CHECK(log2 != nullptr);
   TORCH_CHECK(log2->vectorize);
 
@@ -13869,8 +13829,8 @@ TEST_F(NVFuserTest, FusionMultipleVectorize_CUDA) {
 
   outputs = executor_cache.runFusionWithInputs({t0, t1});
   auto runtime3 = executor_cache.getMostRecentKernelRuntime();
-  auto log3 = std::dynamic_pointer_cast<PointwiseParams>(
-      executor_cache.getMostRecentExecutorInfo().params);
+  auto log3 =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
   TORCH_CHECK(log3 != nullptr);
   TORCH_CHECK(log3->vectorize);
 
@@ -14125,134 +14085,6 @@ TEST_F(NVFuserTest, FusionSwizzle2_CUDA) {
       &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionTransposeWithSwizzle_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = transpose(tv0);
-  fusion.addOutput(tv1);
-
-  // tv0: [I0, I1]
-  // tv1: [I1, I0]
-
-  const int BS = 32;
-
-  // CTA tiling by BS*BS
-  tv1->split(1, BS);
-  tv1->split(0, BS);
-  tv1->reorder({{1, 2}});
-  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
-
-  // Create a smem buffer to cache each tile
-  auto tv0_cache = tv0->cacheAfter();
-  tv0_cache->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv1, 2);
-  // tv0: [I0, I1]
-  // tv0_cache: [I1/BS, I0/BS, BS(I1), BS(I0)]
-  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
-
-  // Assign each thread block to a tile
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-  tv1->axis(1)->parallelize(ParallelType::BIDx);
-
-  // Thread mapping for each tile. For both of the input and output
-  // tiles, map TIDx to the fastest-changing dimension to facilitate
-  // coalesced gmem accesses.
-  tv1->axis(2)->parallelize(ParallelType::TIDy);
-  tv1->axis(3)->parallelize(ParallelType::TIDx);
-  // Note that the fastest-changing axis is next to the inner-most
-  // axis since computeAt reorders the axes as the output tensor.
-  tv0_cache->axis(2)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(3)->parallelize(ParallelType::TIDy);
-
-  // Swizzles the smem cache to avoid bank conflicts
-  tv0_cache->swizzle(SwizzleType::Transpose, {3, 2});
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 100;
-  const int by = 200;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0.t();
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTransposeWithSwizzle1DThreadBlock_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = transpose(tv0);
-  fusion.addOutput(tv1);
-
-  // tv0: [I0, I1]
-  // tv1: [I1, I0]
-
-  const int BS = 32;
-  const int BDIM = 256;
-
-  // CTA tiling by BS*BS
-  tv1->split(1, BS);
-  tv1->split(0, BS);
-  tv1->reorder({{1, 2}});
-  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
-
-  // Create a smem buffer to cache each tile
-  auto tv0_cache = tv0->cacheAfter();
-  tv0_cache->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv1, 2);
-  // tv0: [I0, I1]
-  // tv0_cache: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-
-  // Tranform the tile axes for 1D thread mapping
-  tv1->merge(-2, -1);
-  tv1->split(-1, BDIM);
-  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-
-  // Transform the cache similarly but apply swizzle to the 2D tile axes.
-  tv0_cache->reorder({{-2, -1}});
-  tv0_cache->swizzle(SwizzleType::Transpose, {2, 3});
-  tv0_cache->merge(-2, -1);
-  tv0_cache->split(-1, BDIM);
-  // tv0: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-
-  // Assign each thread block to a tile
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-  tv1->axis(1)->parallelize(ParallelType::BIDx);
-
-  // Thread mapping for each tile.
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 100;
-  const int by = 200;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0.t();
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
 TEST_F(NVFuserTest, FusionGridPersistence_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
@@ -15381,7 +15213,13 @@ TEST_F(NVFuserTest, FusionDAGMerging_CUDA) {
   at::Tensor t0 = at::randn({2, 2, 2, 2, 2}, options);
   at::Tensor t1 = at::randn({2}, options);
 
-  auto fusion_segments = fusion.segment({t0, t1});
+  std::vector<at::Tensor> aten_inputs = {t0, t1};
+
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(aten_inputs);
+
+  auto fusion_segments = fusion.segment(args);
   TORCH_CHECK(fusion_segments->groups().size() <= 4);
 }
 
@@ -15751,8 +15589,12 @@ TEST_F(NVFuserTest, FusionSegmentVerticalMerge_CUDA) {
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
   at::Tensor t0 = at::randn({2, 2, 2}, options);
 
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(t0);
+
   auto segmented_fusion =
-      SegmentCandidateFinder::segment(fusion.get(), {t0}, segment_options);
+      SegmentCandidateFinder::segment(fusion.get(), args, segment_options);
 
   TORCH_CHECK(segmented_fusion->groups().size() == 2);
 }
@@ -15791,8 +15633,14 @@ TEST_F(NVFuserTest, FusionSegmentHorizontalMerge_CUDA) {
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
   at::Tensor t0 = at::randn({2, 2, 2}, options);
 
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(t0);
+  c10::IValue scalar = 1.0;
+  args.push(scalar);
+
   auto segmented_fusion =
-      SegmentCandidateFinder::segment(fusion.get(), {t0, 1.0}, segment_options);
+      SegmentCandidateFinder::segment(fusion.get(), args, segment_options);
 
   TORCH_CHECK(segmented_fusion->groups().size() == 2);
 }
@@ -15830,8 +15678,12 @@ TEST_F(NVFuserTest, FusionSegmentMixReduction_CUDA) {
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
   at::Tensor t0 = at::randn({2, 2, 2}, options);
 
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(t0);
+
   auto segmented_fusion =
-      SegmentCandidateFinder::segment(fusion.get(), {t0}, segment_options);
+      SegmentCandidateFinder::segment(fusion.get(), args, segment_options);
 
   TORCH_CHECK(segmented_fusion->groups().size() <= 2);
 }
@@ -18399,20 +18251,19 @@ TEST_F(NVFuserTest, FusionSegmenterCombineReductionsCycleRepro_CUDA) {
   at::Tensor at_t17 = at::randn({128, 64, 1024}, options_half);
   double at_d56 = 1.1111;
 
-  std::vector<IValue> aten_inputs = {
-      at_t0,
-      at_t1,
-      at_t3,
-      at_t5,
-      at_t7,
-      at_t11,
-      at_t13,
-      at_t15,
-      at_t17,
-      at_d56};
+  std::vector<at::Tensor> aten_inputs = {
+      at_t0, at_t1, at_t3, at_t5, at_t7, at_t11, at_t13, at_t15, at_t17};
+
+  c10::IValue val = at_d56;
+
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(aten_inputs);
+  args.push(val);
+
   for (auto _ : c10::irange(5)) {
     auto segmented_fusion =
-        SegmentCandidateFinder::segment(fusion_ptr.get(), aten_inputs);
+        SegmentCandidateFinder::segment(fusion_ptr.get(), args);
   }
 }
 
@@ -21071,9 +20922,9 @@ TEST_F(NVFuserTest, FusionBroadcastConcretization1_CUDA) {
   }
 
   GpuLower gpulw(&fusion);
-  TORCH_CHECK(!gpulw.concretizedBroadcastDomains().isConcretized(
+  TORCH_CHECK(!gpulw.concretizedBroadcastDomains()->isConcretized(
       loweredTv(tv4, gpulw)->axis(1)));
-  TORCH_CHECK(gpulw.concretizedBroadcastDomains().isConcretized(
+  TORCH_CHECK(gpulw.concretizedBroadcastDomains()->isConcretized(
       loweredTv(tv7, gpulw)->axis(1)));
 
   auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
@@ -21261,8 +21112,7 @@ TEST_F(NVFuserTest, FusionBroadcastConcretization5_CUDA) {
   auto tvs3 = Welford(tv17, {1});
   fusion.addOutput(tvs3.avg);
 
-  ConcretizedBroadcastDomains bcast_concretization_info;
-  bcast_concretization_info.build(&fusion);
+  ConcretizedBroadcastDomains bcast_concretization_info(&fusion);
 
   TORCH_CHECK(
       bcast_concretization_info.maybeNonUniquelyConcretized(tv5->axis(1)),
@@ -23377,11 +23227,8 @@ TEST_F(NVFuserTest, FusionTestReEntrantGridWelford_CUDA) {
     tv->axis(-1)->parallelize(ParallelType::Serial);
   }
 
-  CompileOptions co;
-  co.index_mode = KernelIndexMode::INT32;
-
   FusionExecutor fe;
-  fe.compileFusion(&fusion, {}, LaunchParams(), co);
+  fe.compileFusion(&fusion, {}, LaunchParams());
 
   auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
   at::Tensor t0 = at::randn({X, Y, Y, Z}, options);
@@ -24676,7 +24523,7 @@ TEST_F(NVFuserTest, FusionIssue1770Repro_CUDA) {
       __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionTransformPropagatorSelector) {
+TEST_F(NVFuserTest, FusionTransformPropagatorSelector_CUDA) {
   auto fusion = std::make_unique<Fusion>();
   FusionGuard fg(fusion.get());
 
@@ -24958,10 +24805,19 @@ TEST_F(NVFuserTest, FusionBoundedDirectionSelection1_CUDA) {
   scheduler_utils::BoundedDirectionalTransformPropagator::backward(
       tv3, -1, {tv0, tv2});
 
-  // Check that the splits are replayed on tv1, even though tv2
-  //  is part of the boundary.
+  // Check that the splits are replayed on tv2
+  TORCH_INTERNAL_ASSERT(
+      tv2->nDims() == tv3->nDims(),
+      "Propagator didn't propagate to tv2: ",
+      tv2->toString());
+
+  // Check that the splits are replayed on tv1 as well. Even though
+  //  one of its consumers, tv2, is part of the boundary, another
+  //  consumer is not a boundary, so tv1 should be transformed as well.
   TORCH_INTERNAL_ASSERT(
-      tv2->nDims() == 4, "Propagator didn't propagate to tv2");
+      tv1->nDims() == tv3->nDims(),
+      "Propagator didn't propagate to tv1: ",
+      tv1->toString());
 }
 
 TEST_F(NVFuserTest, FusionIssueRepro1844_CUDA) {
@@ -25040,7 +24896,7 @@ TEST_F(NVFuserTest, FusionInsertMagicZero1_CUDA) {
   tv2->reorder({{1, 2}, {2, 1}});
   tv2->merge(0);
 
-  TransformPropagator propagator(tv2);
+  TransformPropagatorWithCheck propagator(tv2);
   MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
 
   tv0->computeAt(tv2, 1);
@@ -25091,7 +24947,7 @@ TEST_F(NVFuserTest, FusionRepro1860_CUDA) {
   fusion.addInput(tv22);
 
   auto tv3 = add(tv0, tv1);
-  auto tv4 = softmax(tv3, {0});
+  auto tv4 = softmax(tv3, 0);
   auto tv5 = add(tv4, tv22);
   fusion.addOutput(tv5);
 
@@ -25160,7 +25016,7 @@ TEST_F(NVFuserTest, FusionExpandReduce2_CUDA) {
   // [iBIDx, iTIDx, rTIDy, rBIDy, rO]
   auto tv3 = tv2->rFactor({-1});
 
-  TransformPropagator propagator(tv3);
+  TransformPropagatorWithCheck propagator(tv3);
   MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
   scheduler_utils::parallelizeAllLike(tv3);
   tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
@@ -25261,7 +25117,7 @@ TEST_F(
       &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims1_CUDA) {
+TEST_F(NVFuserTest, FusionInliningMismatchedDims1_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
@@ -25274,8 +25130,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims1_CUDA) {
   auto tv5 = tan(tv4);
   fusion.addOutput(tv5);
 
-  InlinePropagator inline_propagator(tv5, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv5).traverse(&inline_propagator);
+  inlineMost();
 
   TORCH_CHECK(tv5->getComputeAtPosition() == 3);
   TORCH_CHECK(tv4->getComputeAtPosition() == 3);
@@ -25295,7 +25150,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims1_CUDA) {
   testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims2_CUDA) {
+TEST_F(NVFuserTest, FusionInliningMismatchedDims2_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
@@ -25308,8 +25163,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims2_CUDA) {
   auto tv5 = tan(tv4);
   fusion.addOutput(tv5);
 
-  InlinePropagator inline_propagator(tv5, -1, ComputeAtMode::BestEffort);
-  MaxRootDomainInfoSpanningTree(tv5).traverse(&inline_propagator);
+  inlineAllAt(tv5, -1, true);
 
   TORCH_CHECK(tv5->getComputeAtPosition() == 3);
   TORCH_CHECK(tv4->getComputeAtPosition() == 3);
@@ -25329,7 +25183,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims2_CUDA) {
   testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims3_CUDA) {
+TEST_F(NVFuserTest, FusionInliningMismatchedDims3_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
@@ -25353,8 +25207,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims3_CUDA) {
     tv->merge(2);
   }
 
-  InlinePropagator inline_propagator(tv8, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv8).traverse(&inline_propagator);
+  inlineMost();
 
   TORCH_CHECK(tv8->getComputeAtPosition() == 3);
   TORCH_CHECK(tv7->getComputeAtPosition() == 3);
@@ -25377,7 +25230,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims3_CUDA) {
   testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims4_CUDA) {
+TEST_F(NVFuserTest, FusionInliningMismatchedDims4_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
@@ -25391,8 +25244,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims4_CUDA) {
   fusion.addOutput(tv5);
 
   tv3->merge(1);
-  InlinePropagator inline_propagator(tv0, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv0).traverse(&inline_propagator);
+  inlineMost();
 
   TORCH_CHECK(tv5->getComputeAtPosition() == 3);
   TORCH_CHECK(tv4->getComputeAtPosition() == 3);
@@ -25412,7 +25264,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims4_CUDA) {
   testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionInlinePropagatorBroadcast_CUDA) {
+TEST_F(NVFuserTest, FusionInliningBroadcast_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
@@ -25431,8 +25283,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorBroadcast_CUDA) {
     tv->merge(2);
   }
 
-  InlinePropagator inline_propagator(tv0, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv0).traverse(&inline_propagator);
+  inlineMost();
 
   TORCH_CHECK(tv4->getComputeAtPosition() == 3);
   TORCH_CHECK(tv3->getComputeAtPosition() == 3);
@@ -25451,7 +25302,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorBroadcast_CUDA) {
   testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
 }
 
-TEST_F(NVFuserTest, FusionInlinePropagatorBroadcastTrivialReduction_CUDA) {
+TEST_F(NVFuserTest, FusionInliningBroadcastTrivialReduction_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
 
@@ -25473,8 +25324,7 @@ TEST_F(NVFuserTest, FusionInlinePropagatorBroadcastTrivialReduction_CUDA) {
     tv->merge(2);
   }
 
-  InlinePropagator inline_propagator(tv6, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv6).traverse(&inline_propagator);
+  inlineMost();
 
   TORCH_CHECK(tv6->getComputeAtPosition() == 3);
   TORCH_CHECK(tv5->getComputeAtPosition() == 3);
@@ -25564,8 +25414,7 @@ TEST_F(NVFuserTest, FusionIdGraphTrivialReduction_CUDA) {
     tv->merge(2);
   }
 
-  InlinePropagator inline_propagator(tv3, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&inline_propagator);
+  inlineMost();
 
   ComputeAtMap ca_map(&fusion);
 
@@ -25696,6 +25545,494 @@ TEST_F(NVFuserTest, FusionSizeDependentData_CUDA) {
       executor_cache.fusion(), cg_outputs, {a}, {a + 123}, __LINE__, __FILE__);
 }
 
+TEST_F(NVFuserTest, FusionDependencyCheck_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  TensorView* tv1 = makeSymbolicTensor(1);
+  TensorView* tv2 = makeSymbolicTensor(1);
+  TensorView* tv3 = makeSymbolicTensor(1);
+
+  auto tv4 = add(tv0, tv1);
+  auto tv5 = add(tv0, tv2);
+  auto tv6 = add(tv0, tv3);
+
+  auto tv7 = add(tv1, tv2);
+  auto tv8 = add(tv1, tv3);
+
+  auto tv9 = add(tv2, tv3);
+
+  {
+    auto all_vals = DependencyCheck::getAllValsBetween(
+        {tv0, tv1}, {tv4, tv5, tv6, tv7, tv8, tv9});
+    std::unordered_set<Val*> all_vals_set(all_vals.begin(), all_vals.end());
+    std::vector<Val*> results({tv0, tv1, tv4, tv5, tv6, tv7, tv8});
+    for (auto result : results) {
+      TORCH_CHECK(all_vals_set.count(result) > 0);
+      all_vals_set.erase(result);
+    }
+    TORCH_CHECK(all_vals_set.empty());
+  }
+
+  auto tv10 = add(tv6, tv7);
+  {
+    auto all_vals = DependencyCheck::getAllValsBetween({tv0, tv1}, {tv10});
+    std::unordered_set<Val*> all_vals_set(all_vals.begin(), all_vals.end());
+    std::vector<Val*> results({tv0, tv1, tv6, tv7, tv10});
+    for (auto result : results) {
+      TORCH_CHECK(all_vals_set.count(result) > 0);
+      all_vals_set.erase(result);
+    }
+    TORCH_CHECK(all_vals_set.empty());
+  }
+}
+
+// Repro for issue #1925
+TEST_F(NVFuserTest, FusionScheduleTransposeRepro1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(4);
+  auto tv1 = makeConcreteTensor({-1, -1, -1, 1});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({1, 1, 333, 1}, options);
+  at::Tensor input1 = at::randn({1, 1, 333, 1}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref = input0 + input1;
+
+  testValidate(
+      &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// Repro for issue #1873
+TEST_F(NVFuserTest, FusionInlineBroadcastIndexing0_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = set(tv0);
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->split(0, 32);
+
+  tv0->computeAt(tv4, 1);
+
+  tv2->split(-1, 8);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({123}, options);
+  at::Tensor t1 = at::randn({3, 123}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+
+  auto outputs = fe.runFusion({t0, t1});
+
+  auto tv_ref = t0 + t1;
+
+  testValidate(&fusion, outputs, {t0, t1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPredicateUnshare_CUDA) {
+  // https://github.com/csarofeen/pytorch/issues/1926
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  fusion->addOutput(tv2);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  for (auto tv : {tv1, tv2}) {
+    tv->split(0, 4);
+    tv->reorder({{1, -1}});
+    tv->split(1, 8);
+    tv->merge(0);
+    tv->split(0, 1);
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::Unswitch);
+  }
+  tv1->merge(2);
+  tv2->reorder({{2, 3}});
+  tv2->merge(2);
+  for (auto tv : {tv1, tv2}) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({5, 5}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto out = cg_outputs[0];
+
+  testValidate(fusion, {out}, {t0}, {t0}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, AsyncCompilation_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(1);
+  TensorView* tv2 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addInput(tv2);
+
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1)); // Group 0
+  TensorView* tv4 =
+      max(tv3, {0}); // Group 0 (use max instead to avoid numerical issues)
+  TensorView* tv5 = add(tv4, tv1); //  Group 0 (Non Broadcast after reduce,
+                                   //  keeps normalization scheduler away)
+  TensorView* tv6 = add(tv5, tv2); //  Group 1 (Broadcast after reduce)
+
+  fusion->addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({8, 5}, options);
+  at::Tensor t1 = at::randn({5}, options);
+  at::Tensor t2 = at::randn({8, 5}, options);
+
+  auto t3 = t0.add(1.0);
+  auto t4 = std::get<0>(at::max(t3, 0));
+  auto t5 = t4.add(t1);
+  auto t6 = t5.add(t2);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2};
+
+  executor_cache.compileFusionAsync(aten_inputs);
+
+  while (!executor_cache.isCompiled(aten_inputs)) {
+    std::this_thread::sleep_for(std::chrono::milliseconds(20));
+    printf(".");
+  }
+
+  auto outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
+      "segmentation didn't happen");
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()
+              ->fusionSegments()
+              ->groups()
+              .size() == 2,
+      "segmentation didn't happen as expected");
+
+  testValidate(
+      executor_cache.fusion(), outputs, aten_inputs, {t6}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMergeBroadcastingTrivialReduction1_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({1, 1});
+  TensorView* tv1 = makeConcreteTensor({-1});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = sum(tv0, {1});
+  auto tv3 = add(tv2, tv1);
+  fusion->addOutput(tv3);
+
+  tv0->merge(0);
+
+  MaxRootDomainInfoSpanningTree tree(tv0);
+  TransformPropagatorWithCheck tp(tv0);
+  tree.traverse(&tp);
+
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1, 1}, options);
+  at::Tensor t1 = at::randn({10}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  testValidate(
+      fusion, {out}, {t0, t1}, {t1 + t0.flatten()}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMergeBroadcastingTrivialReduction2_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({-1, 1, 1});
+  TensorView* tv1 = makeConcreteTensor({-1, -1});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = sum(tv0, {1});
+  auto tv3 = add(tv2, tv1);
+  fusion->addOutput(tv3);
+
+  tv2->merge(1);
+  tv2->merge(0);
+
+  MaxRootDomainInfoSpanningTree tree(tv0);
+  TransformPropagatorWithCheck tp(tv0);
+  tree.traverse(&tp);
+
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 1, 1}, options);
+  at::Tensor t1 = at::randn({10, 10}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  testValidate(
+      fusion, {out}, {t0, t1}, {t1 + t0.squeeze(-1)}, __LINE__, __FILE__);
+}
+
+// Simple test case exercising the null scheduler path.
+TEST_F(NVFuserTest, FusionNullScheduler_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({1, 1, 1});
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {0, 1, 2});
+
+  fusion->addOutput(tv1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1, 1, 1}, options);
+
+  std::vector<IValue> aten_inputs({t0});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto t1 = t0.sum({0, 1, 2});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {t1}, __LINE__, __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+// Simple test case exercising the null scheduler path.
+TEST_F(NVFuserTest, FusionNullScheduler2_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({0, 1, 9223372036854775807L});
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {0, 1, 2});
+
+  fusion->addOutput(tv1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({0, 1, 9223372036854775807L}, options);
+
+  std::vector<IValue> aten_inputs({t0});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto t1 = t0.sum({0, 1, 2});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {t1}, __LINE__, __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+// Simple test case exercising the null scheduler path.
+TEST_F(NVFuserTest, FusionNullScheduler3_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = TensorViewBuilder().ndims(0).build();
+  auto tv1 = TensorViewBuilder().ndims(0).build();
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = add(tv0, tv1);
+  fusion->addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({}, options);
+  at::Tensor t1 = at::randn({}, options);
+
+  std::vector<IValue> aten_inputs({t0, t1});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {t0, t1},
+      {t0 + t1},
+      __LINE__,
+      __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+TEST_F(NVFuserTest, FusionEmpty_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({10, 10, 10});
+  auto tv1 = makeConcreteTensor({10, 10, 10});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addOutput(tv0);
+  fusion->addOutput(tv1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 10, 10}, options);
+  at::Tensor t1 = at::randn({10, 10, 10}, options);
+
+  std::vector<IValue> aten_inputs({t0, t1});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {t0, t1},
+      {t0, t1},
+      __LINE__,
+      __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+TEST_F(NVFuserTest, FusionMappingRelation_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({1, 1});
+  TensorView* tv1 = makeConcreteTensor({-1, 1, 1});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = set(tv0);
+  auto tv3 = broadcast(tv2, {true, false, false});
+  auto tv4 = add(tv3, tv1);
+
+  fusion->addOutput(tv4);
+
+  tv4->merge(-2);
+  tv4->merge(-1);
+
+  tv0->computeAt(tv4, -1);
+  tv1->computeAt(tv4, -1);
+
+  ComputeAtMap ca_map(fusion);
+
+  // FIXME: This is the concerning part that would motivate some
+  //  more formalization on concrete/permissive mapping:
+  //   exact mapping should ideally imply permissive mapping.
+  auto tv4_inner_node = tv4->axis(0)->definition()->input(1)->as<IterDomain>();
+  TORCH_CHECK(
+      ca_map.areMapped(tv2->axis(0), tv4_inner_node, IdMappingMode::EXACT));
+  TORCH_CHECK(!ca_map.areMapped(
+      tv2->axis(0), tv4_inner_node, IdMappingMode::PERMISSIVE));
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1, 1}, options);
+  at::Tensor t1 = at::randn({2, 1, 1}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  testValidate(
+      fusion, {out}, {t0, t1}, {t1 + t0.squeeze(0)}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInlineAt_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = cos(tv1);
+  fusion->addOutput(tv2);
+
+  tv1->inlineAt(-1);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({100, 2}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto out = cg_outputs[0];
+
+  testValidate(fusion, {out}, {t0}, {t0.sin().cos()}, __LINE__, __FILE__);
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
index 332fec9b65236..e827de56e56bd 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
@@ -10,6 +10,7 @@
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
 #include <torch/csrc/jit/codegen/cuda/grouped_reduction.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/interface.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
@@ -24,6 +25,7 @@
 #include <torch/csrc/jit/codegen/cuda/mutator.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
 #include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
 #include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
@@ -2148,6 +2150,316 @@ TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduce4_CUDA) {
   testValidate(fe.kernel(), outputs, {t0}, {ref}, __LINE__, __FILE__);
 }
 
+TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduceWelford1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = Welford(tv1, {0}).avg;
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv0, tv3);
+  fusion.addOutput(tv4);
+
+  const int vec = 2;
+  const int tidx = 32;
+  const int tidy = 8;
+
+  tv1->split(1, vec);
+  tv1->split(1, tidx);
+  tv1->split(0, tidy);
+  TransformPropagator propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+  tv1->axis(2)->parallelize(ParallelType::BIDx);
+  tv1->axis(3)->parallelize(ParallelType::TIDx);
+
+  scheduler_utils::parallelizeAllLike(tv1);
+
+  tv2->axis(4)->parallelize(ParallelType::Group);
+
+  // Make sure the reduction expr is converted to GroupedGridReduciton
+  // and the non-reduction domains of the output TV are either
+  // grouped or parallelized
+  GpuLower gpulw(&fusion);
+  bool validated = false;
+  for (auto expr : KernelExprVisitor::getAllExprs(gpulw.kernel())) {
+    auto grouped_grid_reduction = dynamic_cast<kir::GroupedGridWelford*>(expr);
+    if (grouped_grid_reduction == nullptr) {
+      continue;
+    }
+    validated = true;
+  }
+  TORCH_CHECK(
+      validated, "Invalid lowered kernel. No GroupedGridWelford found.");
+
+  std::vector<int64_t> shape({99, 101});
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  auto t0_double = t0.to(at::kDouble);
+  auto ref = t0_double + t0_double.mean({0}).unsqueeze(0);
+
+  testValidate(fe.kernel(), outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Test grouping of two domains
+TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduceWelford2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = Welford(tv1, {0}).avg;
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv0, tv3);
+  fusion.addOutput(tv4);
+
+  const int vec1 = 2;
+  const int vec2 = 3;
+  const int tidx = 16;
+  const int tidy = 8;
+
+  tv1->split(1, vec1);
+  tv1->split(1, vec2);
+  tv1->split(1, tidx);
+  tv1->split(0, tidy);
+  TransformPropagator propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+  tv1->axis(2)->parallelize(ParallelType::BIDx);
+  tv1->axis(3)->parallelize(ParallelType::TIDx);
+
+  scheduler_utils::parallelizeAllLike(tv1);
+
+  tv2->axis(4)->parallelize(ParallelType::Group);
+  tv2->axis(5)->parallelize(ParallelType::Group);
+
+  std::vector<int64_t> shape({99, 129});
+
+  // Make sure the reduction expr is converted to GroupedGridReduciton
+  // and the non-reduction domains of the output TV are either
+  // grouped or parallelized
+  GpuLower gpulw(&fusion);
+  bool validated = false;
+  for (auto expr : KernelExprVisitor::getAllExprs(gpulw.kernel())) {
+    auto grouped_grid_reduction = dynamic_cast<kir::GroupedGridWelford*>(expr);
+    if (grouped_grid_reduction == nullptr) {
+      continue;
+    }
+    validated = true;
+  }
+  TORCH_CHECK(
+      validated, "Invalid lowered kernel. No GroupedGridWelford found.");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  auto t0_double = t0.to(at::kDouble);
+  auto ref = t0_double + t0_double.mean({0}).unsqueeze(0);
+
+  testValidate(fe.kernel(), outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Follows the pattern of persistent outer grid welford in batchnorm
+TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduceWelfordShmoo_CUDA) {
+  struct Params {
+    int N;
+    int H;
+    int W;
+    int C;
+    int tidx;
+    int tidy;
+    int vect;
+    int persistent_buffer;
+    int bidx;
+  };
+
+  auto test = [](const Params& params) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    std::vector<bool> bcast_pattern{true, true, true, false};
+    std::vector<int> reduction_dims{2, 1, 0};
+
+    auto tv0 = makeSymbolicTensor(4);
+    fusion.addInput(tv0);
+
+    auto tv1 = set(tv0);
+    auto tvs = Welford(tv1, reduction_dims);
+    auto tv2 = tvs.avg;
+    auto tv3 = tvs.var_sum;
+    auto tv4 = tvs.n;
+    auto tv5 = broadcast(tv2, bcast_pattern);
+    auto tv6 = broadcast(tv3, bcast_pattern);
+    auto tv7 = broadcast(tv4, bcast_pattern);
+    auto tv8 = sub(tv1, tv5);
+    auto tv9 = add(tv8, tv6);
+    // auto tv10 = div(tv9, tv7);
+    // fusion.addOutput(tv10);
+    fusion.addOutput(tv9);
+
+    // Schedule the fusion as it will be done by the persistent
+    // scheduler
+
+    auto input_cache = tv1;
+    auto output_cache = tv9->cacheBefore();
+
+    auto transform_ref = tv2;
+
+    transform_ref->merge(0)->merge(0);
+
+    int reduction_pos = 1;
+
+    transform_ref->split(0, params.tidy);
+    ++reduction_pos;
+    transform_ref->axis(1)->parallelize(ParallelType::TIDy);
+
+    // Persistent buffer
+    transform_ref->split(0, params.persistent_buffer);
+    ++reduction_pos;
+
+    // Unswitch
+    transform_ref->split(0, 1);
+    ++reduction_pos;
+    transform_ref->axis(1)->parallelize(ParallelType::Unswitch);
+
+    transform_ref->axis(0)->parallelize(ParallelType::BIDy);
+
+    transform_ref->split(reduction_pos, params.vect);
+    transform_ref->axis(reduction_pos + 1)
+        ->parallelize(ParallelType::Vectorize);
+
+    transform_ref->split(reduction_pos, params.tidx);
+    transform_ref->axis(reduction_pos + 1)->parallelize(ParallelType::TIDx);
+    transform_ref->split(reduction_pos, params.bidx);
+    transform_ref->axis(reduction_pos + 1)->parallelize(ParallelType::BIDx);
+
+    auto transform_ref_rf =
+        reduction_scheduler_utils::sortAndRFactor(transform_ref);
+
+    TransformPropagator propagator(transform_ref_rf);
+    MaxRootDomainInfoSpanningTree(transform_ref_rf).traverse(&propagator);
+
+    int vec_id = std::distance(
+        transform_ref_rf->domain()->domain().begin(),
+        std::find_if(
+            transform_ref_rf->domain()->domain().begin(),
+            transform_ref_rf->domain()->domain().end(),
+            [](auto id) {
+              return id->getParallelType() == ParallelType::Vectorize;
+            }));
+    transform_ref_rf->axis(vec_id)->parallelize(ParallelType::Serial);
+
+    int unswitch_id = std::distance(
+        transform_ref_rf->domain()->domain().begin(),
+        std::find_if(
+            transform_ref_rf->domain()->domain().begin(),
+            transform_ref_rf->domain()->domain().end(),
+            [](auto id) {
+              return id->getParallelType() == ParallelType::Unswitch;
+            }));
+    transform_ref_rf->axis(unswitch_id)->parallelize(ParallelType::Serial);
+
+    scheduler_utils::parallelizeAllLike(
+        transform_ref_rf, ir_utils::allTvs(&fusion));
+
+    ParallelType vec_pt = ParallelType::Vectorize;
+    tv1->axis(vec_id)->parallelize(vec_pt);
+    tv9->axis(vec_id)->parallelize(vec_pt);
+
+    transform_ref->axis(vec_id)->parallelize(ParallelType::Group);
+
+    transform_ref_rf->axis(unswitch_id)->parallelize(ParallelType::Unswitch);
+
+    inlineMost();
+
+    // Make sure the reduction expr is converted to GroupedGridReduciton
+    // and the non-reduction domains of the output TV are either
+    // grouped or parallelized
+    GpuLower gpulw(&fusion);
+    bool validated = false;
+    for (auto expr : KernelExprVisitor::getAllExprs(gpulw.kernel())) {
+      auto grouped_grid_reduction =
+          dynamic_cast<kir::GroupedGridWelford*>(expr);
+      validated = true;
+    }
+    TORCH_CHECK(
+        validated, "Invalid lowered kernel. No GroupedGridWelford found.");
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::manual_seed(0);
+
+    const std::vector<int64_t> input_shape{
+        params.N, params.H, params.W, params.C};
+    auto t0 = at::randn(input_shape, options);
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {t0});
+
+    // Skip the rest of this test size if the required number of SMs
+    // exceeds the available SM count
+    const auto num_required_sms = params.bidx *
+        ceilDiv(ceilDiv(params.N * params.H * params.W, params.tidy),
+                params.persistent_buffer);
+    if (num_required_sms > deviceSMCount()) {
+      return;
+    }
+
+    auto cg_outputs = fe.runFusion({t0});
+
+    auto t1 = t0.to(at::kDouble);
+    auto t2 = t1.mean({0, 1, 2}).unsqueeze(0).unsqueeze(0).unsqueeze(0);
+    auto t3 =
+        at::var(t1, {0, 1, 2}, false).unsqueeze(0).unsqueeze(0).unsqueeze(0);
+    auto t4 = params.N * params.H * params.W;
+    auto ref = (t1 - t2 + (t3 * t4));
+
+    testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__, "");
+  };
+
+  std::vector<Params> base_params;
+  base_params.push_back({256, 7, 7, 1, 8, 32, 2, 32, 4});
+  base_params.push_back({256, 7, 7, 1, 16, 16, 4, 50, 4});
+  base_params.push_back({128, 7, 7, 1, 16, 16, 4, 32, 4});
+  base_params.push_back({128, 14, 14, 1, 16, 16, 4, 32, 1});
+  base_params.push_back({128, 14, 14, 1, 16, 16, 2, 64, 2});
+  base_params.push_back({128, 14, 14, 1, 8, 32, 4, 50, 4});
+  base_params.push_back({128, 14, 14, 1, 8, 32, 2, 50, 4});
+
+  std::vector<Params> param_vec;
+  for (const auto base_p : base_params) {
+    for (const auto c_dim : {512, 1024, 2048}) {
+      auto tmp = base_p;
+      tmp.C = c_dim;
+      param_vec.push_back(tmp);
+    }
+  }
+
+  for (const auto& params : param_vec) {
+    test(params);
+  }
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu b/torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu
index 65ac774e1ad02..5fc61fe6a368d 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu
@@ -7,6 +7,7 @@
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
 #include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
 #include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
 #include <ATen/cuda/CUDAGraphsUtils.cuh>
@@ -105,32 +106,96 @@ at::Tensor generate_uniform(int64_t size, at::ScalarType dtype) {
 } // namespace
 
 TEST_F(NVFuserTest, FusionRNGValidateWithCURand_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  Int* size_val = IrBuilder::create<Int>();
+  fusion->addInput(size_val);
+  TensorView* tv0 = rand({size_val}, DataType::Float);
+  TensorView* tv1 = rand({size_val}, DataType::Double);
+  fusion->addOutput(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
   for (int64_t size : {16, 1024, 10001, 10002, 10003, 100000, 10000001}) {
-    for (auto dtype : {kFloat, kDouble}) {
-      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-      auto fusion = fusion_ptr.get();
-      FusionGuard fg(fusion);
+    at::manual_seed(0);
+    auto cg_outputs = fec.runFusionWithInputs({size});
 
-      TensorView* tv0 = makeSymbolicTensor(1, aten_to_data_type(dtype));
-      fusion->addInput(tv0);
-      auto tv1 = randlike(tv0);
-      fusion->addOutput(tv1);
+    at::manual_seed(0);
+    auto ref0 = generate_uniform(size, kFloat);
+    auto ref1 = generate_uniform(size, kDouble);
 
-      FusionExecutorCache fec(std::move(fusion_ptr));
+    testValidate(
+        fec.fusion(), cg_outputs, {size}, {ref0, ref1}, __LINE__, __FILE__);
+  }
+}
 
-      auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
-      at::Tensor t0 = at::zeros({size}, options);
+TEST_F(NVFuserTest, FusionRNGManualScheduleValidateWithCURand_CUDA) {
+  int64_t size = 128;
+  auto dtype = kFloat;
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
 
-      at::manual_seed(0);
-      auto cg_outputs = fec.runFusionWithInputs({t0});
-      auto out = cg_outputs[0];
+  TensorView* tv0 = makeSymbolicTensor(1, aten_to_data_type(dtype));
+  fusion->addInput(tv0);
+  auto tv1 = rand_like(tv0);
+  auto tv2 = set(tv1);
+  fusion->addOutput(tv2);
 
-      at::manual_seed(0);
-      auto ref = generate_uniform(size, dtype);
+  tv2->split(0, 8);
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
 
-      testValidate(fec.fusion(), {out}, {t0}, {ref}, __LINE__, __FILE__);
-    }
-  }
+  tv1->computeAt(tv2, 1);
+
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+  at::Tensor t0 = at::zeros({size}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0});
+
+  at::manual_seed(0);
+  auto cg_outputs = fe.runFusion({t0});
+  auto out = cg_outputs[0];
+
+  at::manual_seed(0);
+  auto ref = generate_uniform(size, dtype);
+
+  testValidate(fusion, {out}, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionRNGManualScheduleValidateWithCURand2_CUDA) {
+  auto dtype = kFloat;
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  Int* size1 = IrBuilder::create<Int>();
+  Int* size2 = IrBuilder::create<Int>();
+  Int* size3 = IrBuilder::create<Int>();
+  Int* size4 = IrBuilder::create<Int>();
+  fusion->addInput(size1);
+  fusion->addInput(size2);
+  fusion->addInput(size3);
+  fusion->addInput(size4);
+  TensorView* tv0 = rand({size1, size2, size3, size4}, DataType::Float);
+  fusion->addOutput(tv0);
+
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {10, 10, 10, 10});
+
+  at::manual_seed(0);
+  auto cg_outputs = fe.runFusion({10, 10, 10, 10});
+  auto out = cg_outputs[0];
+
+  at::manual_seed(0);
+  auto ref = generate_uniform(10000, dtype).view({10, 10, 10, 10});
+
+  testValidate(fusion, {out}, {10, 10, 10, 10}, {ref}, __LINE__, __FILE__);
 }
 
 TEST_F(NVFuserTest, FusionBroadcastingRNG_CUDA) {
@@ -143,7 +208,7 @@ TEST_F(NVFuserTest, FusionBroadcastingRNG_CUDA) {
     TensorView* tv1 = makeConcreteTensor({5, 5}, aten_to_data_type(dtype));
     fusion->addInput(tv0);
     fusion->addInput(tv1);
-    auto tv2 = randlike(tv0);
+    auto tv2 = rand_like(tv0);
     auto tv3 = add(tv1, tv2);
     auto tv4 = add(tv0, tv3);
     fusion->addOutput(tv4);
@@ -174,7 +239,7 @@ TEST_F(NVFuserTest, FusionBroadcastingRNG2_CUDA) {
       TensorView* tv1 = makeSymbolicTensor(1, aten_to_data_type(dtype));
       fusion->addInput(tv0);
       fusion->addInput(tv1);
-      auto tv2 = randlike(tv0);
+      auto tv2 = rand_like(tv0);
       auto tv3 = add(tv1, tv2);
       fusion->addOutput(tv3);
 
@@ -196,5 +261,109 @@ TEST_F(NVFuserTest, FusionBroadcastingRNG2_CUDA) {
   }
 }
 
+TEST_F(NVFuserTest, FusionBroadcastingRNGSmem_CUDA) {
+  for (auto dtype : {kFloat, kDouble}) {
+    std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+    auto fusion = fusion_ptr.get();
+    FusionGuard fg(fusion);
+
+    TensorView* tv0 = makeConcreteTensor({5, 1}, aten_to_data_type(dtype));
+    TensorView* tv1 = makeConcreteTensor({5, 5}, aten_to_data_type(dtype));
+    fusion->addInput(tv0);
+    fusion->addInput(tv1);
+    auto tv2 = rand_like(tv0);
+    auto tv3 = add(tv1, tv2);
+    auto tv4 = add(tv0, tv3);
+    fusion->addOutput(tv4);
+
+    auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+    at::Tensor t0 = at::zeros({5, 1}, options);
+    at::Tensor t1 = at::zeros({5, 5}, options);
+
+    auto lparams = scheduleTranspose(fusion, {t0, t1});
+
+    FusionExecutor fe;
+    fe.compileFusion(fusion, {t0, t1}, lparams);
+    auto cg_outputs = fe.runFusion({t0, t1}, lparams);
+    auto out = cg_outputs[0];
+
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 1)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 2)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 3)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 4)).all().item<bool>())
+  }
+}
+
+TEST_F(NVFuserTest, FusionBroadcastingRNGSmemNonSquareTile_CUDA) {
+  // https://github.com/csarofeen/pytorch/issues/1926
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({5, 1});
+  TensorView* tv1 = makeConcreteTensor({5, 5});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = rand_like(tv0);
+  auto tv3 = add(tv1, tv2);
+  auto tv4 = add(tv0, tv3);
+  fusion->addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::zeros({5, 1}, options);
+  at::Tensor t1 = at::zeros({5, 5}, options);
+
+  TransposeParams heuristics;
+  heuristics.tile_size1 = 8;
+  heuristics.tile_size2 = 4;
+  scheduleTranspose(fusion, heuristics);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 1)).all().item<bool>());
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 2)).all().item<bool>());
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 3)).all().item<bool>());
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 4)).all().item<bool>());
+}
+
+TEST_F(NVFuserTest, FusionUniform_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  Int* size_val = IrBuilder::create<Int>();
+  Double* low = IrBuilder::create<Double>();
+  Double* high = IrBuilder::create<Double>();
+  fusion->addInput(size_val);
+  fusion->addInput(low);
+  fusion->addInput(high);
+  TensorView* tv0 = uniform({size_val}, low, high, DataType::Float);
+  TensorView* tv1 = uniform({size_val}, low, high, DataType::Double);
+  fusion->addOutput(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  for (int64_t size : {16, 1024, 10001, 10002, 10003, 100000, 10000001}) {
+    at::manual_seed(0);
+    auto cg_outputs = fec.runFusionWithInputs({size, -1.0, 1.0});
+
+    at::manual_seed(0);
+    auto ref0 = generate_uniform(size, kFloat) * 2 - 1;
+    auto ref1 = generate_uniform(size, kDouble) * 2 - 1;
+
+    testValidate(
+        fec.fusion(),
+        cg_outputs,
+        {size, -1.0, 1.0},
+        {ref0, ref1},
+        __LINE__,
+        __FILE__);
+  }
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp
new file mode 100644
index 0000000000000..06e93fcd579e3
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp
@@ -0,0 +1,339 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+
+TEST_F(NVFuserTest, FusionStandaloneFull_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  Val* fill_val1 = IrBuilder::create<Int>();
+  Val* fill_val2 = IrBuilder::create<Int>();
+  Val* fill_val3 = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  fusion->addInput(fill_val1);
+  fusion->addInput(fill_val2);
+  fusion->addInput(fill_val3);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv = full({size}, fill_val1, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = full({size, size}, fill_val2, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = full_like(out_tv, fill_val3);
+    fusion->addOutput(out_tv);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::full({size}, 11, options));
+      expect.emplace_back(at::full({size, size}, 12, options));
+      expect.emplace_back(at::full({size, size}, 13, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size, 11, 12, 13});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size, 11, 12, 13},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneZeros_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv = zeros({size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = zeros({size, size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = zeros_like(out_tv);
+    fusion->addOutput(out_tv);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::zeros({size}, options));
+      expect.emplace_back(at::zeros({size, size}, options));
+      expect.emplace_back(at::zeros({size, size}, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneOnes_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv = ones({size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = ones({size, size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = ones_like(out_tv);
+    fusion->addOutput(out_tv);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::ones({size}, options));
+      expect.emplace_back(at::ones({size, size}, options));
+      expect.emplace_back(at::ones({size, size}, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneARange_CUDA) {
+  auto starts_ends = {-1., 0., 10.3, 1024. * 256};
+  auto steps = {-1.5, 1., 2.};
+  auto dtypes = {kFloat, kLong, kDouble};
+
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+
+    auto fusion = std::make_unique<Fusion>();
+    FusionGuard fg(fusion.get());
+
+    Val* start_int = IrBuilder::create<Int>();
+    Val* end_int = IrBuilder::create<Int>();
+    Val* step_int = IrBuilder::create<Int>();
+    Val* start_double = IrBuilder::create<Double>();
+    Val* end_double = IrBuilder::create<Double>();
+    Val* step_double = IrBuilder::create<Double>();
+    fusion->addInput(start_int);
+    fusion->addInput(end_int);
+    fusion->addInput(step_int);
+    fusion->addInput(start_double);
+    fusion->addInput(end_double);
+    fusion->addInput(step_double);
+    auto tv0 = arange(start_int, end_int, step_int, aten_to_data_type(dtype));
+    auto tv1 =
+        arange(start_double, end_double, step_double, aten_to_data_type(dtype));
+    auto tv2 =
+        arange(start_int, end_double, step_double, aten_to_data_type(dtype));
+    auto tv3 =
+        arange(start_double, end_double, step_int, aten_to_data_type(dtype));
+    fusion->addOutput(tv0);
+    fusion->addOutput(tv1);
+    fusion->addOutput(tv2);
+    fusion->addOutput(tv3);
+
+    FusionExecutorCache executor_cache(std::move(fusion));
+
+    const auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+
+    for (auto start : starts_ends) {
+      for (auto end : starts_ends) {
+        for (auto step : steps) {
+          if (std::signbit(end - start) != std::signbit(step)) {
+            continue;
+          }
+
+          at::Tensor a =
+              at::arange((int64_t)start, (int64_t)end, (int64_t)step, options);
+          at::Tensor b =
+              at::arange((double)start, (double)end, (double)step, options);
+          at::Tensor c =
+              at::arange((int64_t)start, (double)end, (double)step, options);
+          at::Tensor d =
+              at::arange((double)start, (double)end, (int64_t)step, options);
+
+          auto cg_outputs = executor_cache.runFusionWithInputs(
+              {(int64_t)start,
+               (int64_t)end,
+               (int64_t)step,
+               (double)start,
+               (double)end,
+               (double)step});
+
+          testValidate(
+              executor_cache.fusion(),
+              cg_outputs,
+              {(int64_t)start,
+               (int64_t)end,
+               (int64_t)step,
+               (double)start,
+               (double)end,
+               (double)step},
+              {a, b, c, d},
+              __LINE__,
+              __FILE__);
+        }
+      }
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneEye_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  Val* maybe_m = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  fusion->addInput(maybe_m);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv1 = eye(size, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv1);
+    auto out_tv2 = eye(size, maybe_m, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv2);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::eye(size, options));
+      expect.emplace_back(at::eye(size, 15, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size, 15});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size, 15},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp
new file mode 100644
index 0000000000000..ad43b0ed4e07d
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp
@@ -0,0 +1,1028 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+
+TEST_F(NVFuserTest, FusionTranspose1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int M = 10;
+  constexpr int N = 20;
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = transpose(tv0);
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  at::Tensor aten_output = t0.t();
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTranspose2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int M = 10;
+  constexpr int N = 20;
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = transpose(tv0);
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->merge(0);
+  tv1->split(0, 32);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  at::Tensor aten_output = t0.t();
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTransposeWithSwizzle_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0);
+  fusion.addOutput(tv1);
+
+  // tv0: [I0, I1]
+  // tv1: [I1, I0]
+
+  const int BS = 32;
+
+  // CTA tiling by BS*BS
+  tv1->split(1, BS);
+  tv1->split(0, BS);
+  tv1->reorder({{1, 2}});
+  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
+
+  // Create a smem buffer to cache each tile
+  auto tv0_cache = tv0->cacheAfter();
+  tv0_cache->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv1, 2);
+  // tv0: [I0, I1]
+  // tv0_cache: [I1/BS, I0/BS, BS(I1), BS(I0)]
+  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
+
+  // Assign each thread block to a tile
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::BIDx);
+
+  // Thread mapping for each tile. For both of the input and output
+  // tiles, map TIDx to the fastest-changing dimension to facilitate
+  // coalesced gmem accesses.
+  tv1->axis(2)->parallelize(ParallelType::TIDy);
+  tv1->axis(3)->parallelize(ParallelType::TIDx);
+  // Note that the fastest-changing axis is next to the inner-most
+  // axis since computeAt reorders the axes as the output tensor.
+  tv0_cache->axis(2)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(3)->parallelize(ParallelType::TIDy);
+
+  // Swizzles the smem cache to avoid bank conflicts
+  tv0_cache->swizzle(SwizzleType::Transpose, {3, 2});
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 100;
+  const int by = 200;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0.t();
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTransposeWithSwizzle1DThreadBlock_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0);
+  fusion.addOutput(tv1);
+
+  // tv0: [I0, I1]
+  // tv1: [I1, I0]
+
+  const int BS = 32;
+  const int BDIM = 256;
+
+  // CTA tiling by BS*BS
+  tv1->split(1, BS);
+  tv1->split(0, BS);
+  tv1->reorder({{1, 2}});
+  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
+
+  // Create a smem buffer to cache each tile
+  auto tv0_cache = tv0->cacheAfter();
+  tv0_cache->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv1, 2);
+  // tv0: [I0, I1]
+  // tv0_cache: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+
+  // Tranform the tile axes for 1D thread mapping
+  tv1->merge(-2, -1);
+  tv1->split(-1, BDIM);
+  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+
+  // Transform the cache similarly but apply swizzle to the 2D tile axes.
+  tv0_cache->reorder({{-2, -1}});
+  tv0_cache->swizzle(SwizzleType::Transpose, {2, 3});
+  tv0_cache->merge(-2, -1);
+  tv0_cache->split(-1, BDIM);
+  // tv0: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+
+  // Assign each thread block to a tile
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::BIDx);
+
+  // Thread mapping for each tile.
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 100;
+  const int by = 200;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0.t();
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSimple_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->tanspose->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSinTransposeCos_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 2);
+  auto tv2 = sin(tv1);
+  auto tv3 = transpose(tv2, 1, 2);
+  auto tv4 = cos(tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.transpose(0, 2).sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+/*
+ * t0->transpose--.
+ *                 \
+ * t1->transpose---add-->sin->t5
+ */
+TEST_F(NVFuserTest, FusionScheduleTransposeMultipleInput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = transpose(tv0, 0, 2);
+  auto tv3 = transpose(tv1, 0, 2);
+  auto tv4 = add(tv2, tv3);
+  auto tv5 = sin(tv4);
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({256, 1024, 1024}, options);
+  at::Tensor input1 = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref = (input0.transpose(0, 2) + input1.transpose(0, 2)).sin();
+
+  testValidate(
+      &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// t0->sin->transpose->t5
+//  `->cos->transpose->t6
+TEST_F(NVFuserTest, FusionScheduleTransposeMultipleOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv2 = sin(tv0);
+  auto tv3 = cos(tv0);
+  auto tv5 = transpose(tv2, 0, 2);
+  auto tv6 = transpose(tv3, 0, 2);
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref1 = input.sin().transpose(0, 2);
+  auto tv_ref2 = input.cos().transpose(0, 2);
+
+  testValidate(
+      &fusion, outputs, {input}, {tv_ref1, tv_ref2}, __LINE__, __FILE__);
+}
+
+/*
+ * t0->transpose->sin->t3
+ *   \_.-->cos->t5
+ *   /
+ * t1
+ */
+TEST_F(NVFuserTest, FusionScheduleTransposeMultipleInputOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = transpose(tv0, 0, 2);
+  auto tv3 = sin(tv2);
+  fusion.addOutput(tv3);
+  auto tv4 = add(tv0, tv1);
+  auto tv5 = cos(tv4);
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({256, 1024, 1024}, options);
+  at::Tensor input1 = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref1 = input0.transpose(0, 2).sin();
+  auto tv_ref2 = (input0 + input1).cos();
+
+  testValidate(
+      &fusion,
+      outputs,
+      {input0, input1},
+      {tv_ref1, tv_ref2},
+      __LINE__,
+      __FILE__);
+}
+
+/*
+ *             .------>sin------>z
+ * x->transpose->transpose->add->y
+ *  \_______________________/
+ */
+TEST_F(NVFuserTest, FusionScheduleTransposeMatchingSkipConnection_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 2);
+  auto tv2 = transpose(tv1, 0, 2);
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+  auto tv4 = sin(tv1);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref1 = input.transpose(0, 2).transpose(0, 2) + input;
+  auto tv_ref2 = input.transpose(0, 2).sin();
+
+  testValidate(
+      &fusion, outputs, {input}, {tv_ref1, tv_ref2}, __LINE__, __FILE__);
+}
+
+// x->transpose--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleTransposeBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = transpose(tv0, 1, 2);
+  auto tv3 = broadcast(tv1, {false, false, true});
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({1024, 256, 1024}, options);
+  at::Tensor input1 = at::randn({1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+  // auto lparams = schedulePointwise(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref = input0.transpose(1, 2) + input1.unsqueeze(2);
+
+  testValidate(
+      &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->broadcast--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleTransposeNoReference_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = broadcast(tv0, {false, true, false});
+  auto tv3 = broadcast(tv1, {false, false, true});
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({1024, 256}, options);
+  at::Tensor input1 = at::randn({1024, 1024}, options);
+
+  EXPECT_THAT(
+      [&]() {
+        scheduleTranspose(&fusion, {input0, input1});
+      },
+      testing::ThrowsMessage<c10::Error>(
+          testing::HasSubstr("reference tensor")));
+}
+
+// x->broadcast--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleBroadcastOnly_CUDA) {
+  for (bool contig0 : {true, false}) {
+    for (bool contig1 : {true, false}) {
+      Fusion fusion;
+      FusionGuard fg(&fusion);
+      auto tv0 = contig0 ? makeContigConcreteTensor({-1, 1, -1})
+                         : makeConcreteTensor({-1, 1, -1});
+      auto tv1 = contig1 ? makeContigConcreteTensor({-1, -1, 1})
+                         : makeConcreteTensor({-1, -1, 1});
+      fusion.addInput(tv0);
+      fusion.addInput(tv1);
+      auto tv2 = add(tv0, tv1);
+      fusion.addOutput(tv2);
+
+      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+      at::Tensor input0 = at::randn({1024, 1, 256}, options);
+      at::Tensor input1 = at::randn({1024, 1024, 1}, options);
+
+      auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+      FusionExecutor fe;
+      fe.compileFusion(&fusion, {input0, input1}, lparams);
+      auto outputs = fe.runFusion({input0, input1}, lparams);
+
+      auto tv_ref = input0 + input1;
+
+      testValidate(
+          &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+    }
+  }
+}
+
+// mermaid graph:
+// ```mermaid
+// %%{
+//   init: {
+//     'theme': 'base',
+//     'themeVariables': { 'fontSize': '30px', 'fontFamily': 'times'}}
+// }%%
+// graph TD
+//   T0("T0(M, N, K)")
+//   T1("T1(N, M, K)")
+//   T2("T2(M, K, N)")
+//   T0 --> A("transpose(1, 2)") --> T3("T3(M, K, N)")
+//   T1 ---> sigmoid --> T5("T5(N, M, K)")
+//   T5 --> B("transpose(0, 2)") --> T7("T7(K, M, N)")
+//   T2 ----> C("add")
+//   T3 --> C --> T6("T6(M, K, N)")
+//   T6 --> D("transpose(0, 1)") --> T11("T11(K, M, N)")
+//   T11 --> E("add") -->T12("T12(K, M, N)")
+//   T7 --> E
+//   T1 ---> F("transpose(0, 1)") --> T4("T4(M, N, K)")
+//   T0 --> G("add") --> T8("T8(M, N, K)") --> relu ---> T9("T9(M, N, K)")
+//   T4 --> G
+//   T6 ---> sin ---> T10("T10(M, K, N)")
+//   style T0 fill:lightgreen
+//   style T1 fill:lightgreen
+//   style T2 fill:lightgreen
+//   style T12 fill:lightblue
+//   style T9 fill:lightblue
+//   style T10 fill:lightblue
+// ```
+TEST_F(NVFuserTest, FusionScheduleTransposeComplexDAG1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  auto tv2 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  auto tv3 = transpose(tv0, 1, 2);
+  auto tv4 = transpose(tv1, 0, 1);
+  auto tv5 = sigmoid(tv1);
+  auto tv6 = add(tv2, tv3);
+  auto tv7 = transpose(tv5, 0, 2);
+  auto tv8 = add(tv4, tv0);
+  auto tv9 = relu(tv8);
+  fusion.addOutput(tv9);
+  auto tv10 = sin(tv6);
+  fusion.addOutput(tv10);
+  auto tv11 = transpose(tv6, 0, 1);
+  auto tv12 = add(tv7, tv11);
+  fusion.addOutput(tv12);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({512, 1024, 256}, options);
+  at::Tensor input1 = at::randn({1024, 512, 256}, options);
+  at::Tensor input2 = at::randn({512, 256, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1, input2});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1, input2}, lparams);
+  auto outputs = fe.runFusion({input0, input1, input2}, lparams);
+
+  auto t3 = input0.transpose(1, 2);
+  auto t4 = input1.transpose(0, 1);
+  auto t5 = input1.sigmoid();
+  auto t6 = input2 + t3;
+  auto t7 = t5.transpose(0, 2);
+  auto t8 = t4 + input0;
+  auto t9 = t8.relu();
+  auto t10 = t6.sin();
+  auto t11 = t6.transpose(0, 1);
+  auto t12 = t7 + t11;
+
+  testValidate(
+      &fusion,
+      outputs,
+      {input0, input1, input2},
+      {t9, t10, t12},
+      __LINE__,
+      __FILE__);
+}
+
+// mermaid graph:
+// ```mermaid
+// %%{
+//   init: {
+//     'theme': 'base',
+//     'themeVariables': { 'fontSize': '30px', 'fontFamily': 'times'}}
+// }%%
+// graph TD
+//   T0("T0(M, N, K)")
+//   T1("T1(N, M, K)")
+//   T2("T2(M, K, N)")
+//   T0 --> A("transpose(1, 2)") --> T3("T3(M, K, N)")
+//   T1 ---> sigmoid --> T5("T5(N, M, K)")
+//   T5 --> B("transpose(0, 2)") --> T7("T7(K, M, N)")
+//   T2 ----> C("add")
+//   T3 --> C --> T6("T6(M, K, N)")
+//   T6 --> D("transpose(0, 1)") --> T11("T11(K, M, N)")
+//   T11 --> E("add") -->T12("T12(K, M, N)")
+//   T7 --> E
+//   T1 ---> F("transpose(0, 1)") --> T4("T4(M, N, K)")
+//   T0 --> G("add") --> T8("T8(M, N, K)") --> relu ---> T9("T9(M, N, K)")
+//   T4 --> G
+//   T6 ---> sin ---> T10("T10(M, K, N)")
+//   style T0 fill:lightgreen
+//   style T1 fill:lightgreen
+//   style T2 fill:lightgreen
+//   style T12 fill:lightblue
+//   style T9 fill:lightblue
+//   style T10 fill:lightblue
+// ```
+TEST_F(NVFuserTest, FusionManualScheduleTransposeComplexDAG1_CUDA) {
+  // achieved: 833.526 GB/s on RTX 3090 (theoretical bandwidth: 936 GB/s)
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  auto tv2 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  auto tv3 = transpose(tv0, 1, 2);
+  auto tv4 = transpose(tv1, 0, 1);
+  auto tv5 = sigmoid(tv1);
+  auto tv6 = add(tv2, tv3);
+  auto tv7 = transpose(tv5, 0, 2);
+  auto tv8 = add(tv4, tv0);
+  auto tv9 = relu(tv8);
+  fusion.addOutput(tv9);
+  auto tv10 = sin(tv6);
+  fusion.addOutput(tv10);
+  auto tv11 = transpose(tv6, 0, 1);
+  auto tv12 = add(tv7, tv11);
+  fusion.addOutput(tv12);
+
+  // group 1: tv0, tv1, *tv9, innermost dim K
+  // group 2: tv2, *tv10, tv12, innermost dim N
+
+  // cache inputs and outputs
+  auto tv0_cache = tv0->cacheAfter();
+  auto tv1_cache = tv1->cacheAfter();
+  auto tv2_cache = tv2->cacheAfter();
+  auto tv9_cache = tv9->cacheBefore();
+  auto tv10_cache = tv10->cacheBefore();
+  auto tv12_cache = tv12->cacheBefore();
+
+  // Step 1: Make 32x32 tiles, schedule outer dimensions
+  {
+    // Pick an arbitrary tensor as a reference tensor for this step. There is no
+    // requirement on which group this reference tensor should belong to. Here
+    // we pick tv9, which belongs to group 1.
+
+    // Make 32x32 tile:
+    // [M, N, K]
+    tv9->split(1, 32);
+    tv9->reorder({{2, -1}});
+    tv9->split(2, 32);
+    tv9->reorder({{3, -1}});
+    // [M, N/32, K/32, 32(N), 32(K)]
+
+    // merge outer dims, parallelize on BIDx, and unswitch
+    tv9->merge(0);
+    tv9->merge(0);
+    tv9->split(0, 1);
+    // [M * N/32 * K/32, 1, 32(N), 32(K)]
+    tv9->axis(0)->parallelize(ParallelType::BIDx);
+    tv9->axis(1)->parallelize(ParallelType::Unswitch);
+    // [BIDx, Unswitch, 32(N), 32(K)]
+
+    // propagate to the entire DAG
+    MaxRootDomainInfoSpanningTree entire_dag(tv9);
+    TransformPropagator tp(tv9);
+    entire_dag.traverse(&tp);
+    scheduler_utils::parallelizeAllLike(tv9);
+  }
+
+  constexpr int threads_per_block = 128;
+
+  // Step 2, schedule group 2
+  {
+    // group 2: tv2, *tv10, tv12, innermost dim N
+
+    tv2_cache->setMemoryType(MemoryType::Shared);
+    tv10_cache->setMemoryType(MemoryType::Shared);
+    tv12_cache->setMemoryType(MemoryType::Shared);
+
+    // pick tv10 as reference tensor for group 2
+    // [BIDx, Unswitch, 32(N), 32(K)]
+    tv10->reorder({{-1, -2}});
+    // [BIDx, Unswitch, 32(K), 32(N)]
+    tv10->merge(2);
+    tv10->split(2, 4);
+    tv10->split(2, threads_per_block);
+    tv10->axis(-1)->parallelize(ParallelType::Vectorize);
+    tv10->axis(-2)->parallelize(ParallelType::TIDx);
+    tv10->axis(-3)->parallelize(ParallelType::Unroll);
+    // [BIDx, Unswitch, Unroll, TIDx, Vectorize]
+
+    // Propagate to group 2 and its cache. Note that group 2 and its cache are
+    // not connected, so we need to borrow other tensors of the DAG to be able
+    // to propagate. The transformations on borrowed tensors will be overwritten
+    // in the next step. We can not borrow the reference tensor of group 1.
+    auto all_tvs_except_ref1 = ir_utils::allTvsExcept(&fusion, {tv9});
+    auto all_tvs_except_ref1_set = std::unordered_set<TensorView*>(
+        all_tvs_except_ref1.begin(), all_tvs_except_ref1.end());
+    SetSelector selector(all_tvs_except_ref1_set);
+    MaxRootDomainInfoSpanningTree tree(tv10, &selector);
+    TransformPropagator tp(tv10);
+    tree.traverse(&tp);
+    scheduler_utils::parallelizeAllLike(
+        tv10, {tv2_cache, tv10, tv12}, {ParallelType::TIDx});
+    scheduler_utils::parallelizeAllLike(
+        tv10,
+        {tv2_cache, tv10, tv12},
+        {ParallelType::Vectorize, ParallelType::Unroll});
+  }
+
+  // Step 3, schedule group 1
+  {
+    // group 1: tv0, tv1, *tv9, innermost dim K
+    // [BIDx, Unswitch, 32(N), 32(K)]
+    tv9->merge(2);
+    tv9->split(2, 4);
+    tv9->split(2, threads_per_block);
+    tv9->axis(-1)->parallelize(ParallelType::Vectorize);
+    tv9->axis(-2)->parallelize(ParallelType::TIDx);
+    tv9->axis(-3)->parallelize(ParallelType::Unroll);
+    // [BIDx, Unswitch, Unroll, TIDx, Vectorize]
+
+    // Propagate to the entire DAG except for group 2 and its cached inputs
+    auto all_tvs_except2 =
+        ir_utils::allTvsExcept(&fusion, {tv2, tv2_cache, tv10, tv12});
+    auto all_tvs_except2_set = std::unordered_set<TensorView*>(
+        all_tvs_except2.begin(), all_tvs_except2.end());
+    SetSelector selector(all_tvs_except2_set);
+    MaxRootDomainInfoSpanningTree tree(tv9, &selector);
+    TransformPropagator tp(tv9);
+    tree.traverse(&tp);
+    scheduler_utils::parallelizeAllLike(
+        tv9, all_tvs_except2, {ParallelType::TIDx});
+    scheduler_utils::parallelizeAllLike(
+        tv9,
+        {tv0_cache, tv1_cache, tv9},
+        {ParallelType::Vectorize, ParallelType::Unroll});
+  }
+
+  // inline
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({512, 1024, 256}, options);
+  at::Tensor input1 = at::randn({1024, 512, 256}, options);
+  at::Tensor input2 = at::randn({512, 256, 1024}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1, input2});
+  auto outputs = fe.runFusion({input0, input1, input2});
+
+  auto t3 = input0.transpose(1, 2);
+  auto t4 = input1.transpose(0, 1);
+  auto t5 = input1.sigmoid();
+  auto t6 = input2 + t3;
+  auto t7 = t5.transpose(0, 2);
+  auto t8 = t4 + input0;
+  auto t9 = t8.relu();
+  auto t10 = t6.sin();
+  auto t11 = t6.transpose(0, 1);
+  auto t12 = t7 + t11;
+
+  testValidate(
+      &fusion,
+      outputs,
+      {input0, input1, input2},
+      {t9, t10, t12},
+      __LINE__,
+      __FILE__);
+}
+
+// x->view->y
+TEST_F(NVFuserTest, FusionViewNoTranspose_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = flatten(tv0, 1, 2);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(!hasAtLeastTwoValidGroups(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionTransposeSelfMapping_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 1);
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  EXPECT_THAT(
+      [&]() { IterDomainGraph(fusion_ptr.get()); },
+      testing::ThrowsMessage<c10::Error>(
+          testing::HasSubstr("Unsupported domain mapping detected")));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({5, 5}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  auto ref = t0.transpose(0, 1) + t0;
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+#if 0
+// silent wrong result
+TEST_F(NVFuserTest, FusionTransposeViewSelfMapping_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 1);
+  auto tv2 = view(tv0, {2, 3}, {3, 2});
+  auto tv3 = add(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({2, 3}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  auto ref = t0.transpose(0, 1) + t0.view({3, 2});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+#endif
+
+// t0------------.
+// t2->broadcast->sub->mul->relu->t6
+// t1------------------'
+TEST_F(NVFuserTest, FusionScheduleTransposeMissingDim_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigConcreteTensor({1, -1, 1});
+  auto tv2 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  auto tv3 = broadcast(tv2, {true, false, true});
+  auto tv4 = sub(tv0, tv3);
+  auto tv5 = mul(tv4, tv1);
+  auto tv6 = relu(tv5);
+  fusion.addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({512, 1024, 512}, options);
+  at::Tensor input1 = at::randn({1, 1024, 1}, options);
+  at::Tensor input2 = at::randn({1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1, input2});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1, input2}, lparams);
+  auto outputs = fe.runFusion({input0, input1, input2}, lparams);
+
+  auto t3 = input2.unsqueeze(0).unsqueeze(-1);
+  auto t4 = input0 - t3;
+  auto t5 = t4 * input1;
+  auto t6 = at::relu(t5);
+
+  testValidate(
+      &fusion, outputs, {input0, input1, input2}, {t6}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmall_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({1024, 2, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmallInnerSize1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({64 * 1024 * 1024, 2, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmallInnerSize2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 0, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 64 * 1024 * 1024, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(0, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmallInnerSize3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(8);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 4, 7);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({1024 * 1024, 2, 2, 2, 2, 2, 2, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(4, 7).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTranspose2DSmallInnerSize_CUDA) {
+  std::array<std::vector<int64_t>, 2> shapes{
+      std::vector<int64_t>{1024 * 1024 * 128, 2},
+      std::vector<int64_t>{2, 1024 * 1024 * 128}};
+  for (const auto& shape : shapes) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    auto tv0 = makeContigTensor(2);
+    fusion.addInput(tv0);
+    auto tv1 = sin(tv0);
+    auto tv2 = transpose(tv1, 0, 1);
+    auto tv3 = cos(tv2);
+    fusion.addOutput(tv3);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor input = at::randn(shape, options);
+
+    auto lparams = scheduleTranspose(&fusion, {input});
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {input}, lparams);
+    auto outputs = fe.runFusion({input}, lparams);
+
+    auto tv_ref = input.sin().transpose(0, 1).cos();
+
+    testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+  }
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp
new file mode 100644
index 0000000000000..8921367cd8f8b
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp
@@ -0,0 +1,273 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/lower_utils.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+
+TEST_F(NVFuserTest, FusionSplitDims_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int64_t* p = prime_numbers;
+  auto tv = makeConcreteTensor(
+      {p[0] * p[1] * p[2], p[3], p[4], p[5] * p[6], p[7], p[8], p[9] * p[10]});
+  std::vector<size_t> dims{0, 1, 2, 3, 4, 5, 6};
+  scheduler_utils::splitDims(
+      tv, {{0, p[2]}, {0, p[1]}, {3, p[6]}, {6, p[10]}}, dims);
+  TORCH_CHECK(tv->nDims() == 11);
+  for (auto i : c10::irange(11)) {
+    TORCH_CHECK(tv->axis(i)->extent()->evaluateInt() == p[i]);
+  }
+  std::vector<size_t> expect{0, 3, 4, 5, 7, 8, 9};
+  TORCH_CHECK(dims == expect);
+}
+
+TEST_F(NVFuserTest, FusionMergeDims_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int64_t* p = prime_numbers;
+  auto tv = makeConcreteTensor(
+      {p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10]});
+  std::vector<size_t> dims{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
+  auto merged = scheduler_utils::mergeDims(tv, {2, 3, 7, 8, 9}, dims);
+  TORCH_CHECK(merged == 2);
+  std::vector<size_t> expect_shape{
+      p[0], p[1], p[2] * p[3] * p[7] * p[8] * p[9], p[4], p[5], p[6], p[10]};
+  TORCH_CHECK(tv->nDims() == expect_shape.size());
+  for (auto i : c10::irange(expect_shape.size())) {
+    TORCH_CHECK(tv->axis(i)->extent()->evaluateInt() == expect_shape[i]);
+  }
+  std::vector<size_t> expect_dims{0, 1, 2, 2, 3, 4, 5, 2, 2, 2, 6};
+  TORCH_CHECK(dims == expect_dims);
+}
+
+TEST_F(NVFuserTest, FusionReorderAsRFactor_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int a = 1, b = 2, c = 3, d = 4;
+
+  TensorView* tv0 = makeConcreteTensor({a, b, c, d});
+  fusion.addInput(tv0);
+  fusion.addOutput(tv0);
+
+  // [a, b, c, d]
+  tv0->merge(0, 2);
+  // [a*c, b, d]
+  tv0->split(1, 2);
+  // [a*c, bo, bi, d]
+  tv0->split(3, 3);
+  // [a*c, bo, bi, do, di]
+  tv0->reorder({{1, 4}, {2, 1}, {3, 3}, {4, 2}});
+  // [a*c, bi, di, do, bo]
+  tv0->merge(3);
+  tv0->merge(1);
+  // [a*c, bi*di, do*bo]
+  tv0->reorder({{0, 2}});
+  // [bi*di, do*bo, a*c]
+  // Order we want is:
+  // [a*c, do*bo, bi*di]
+  auto old2new = scheduler_utils::domainReorderAsRfactorMap(tv0);
+  TORCH_CHECK(old2new[0] == 2);
+  TORCH_CHECK(old2new[1] == 1);
+  TORCH_CHECK(old2new[2] == 0);
+}
+
+TEST_F(NVFuserTest, FusionDisjointViewSet_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion->addInput(tv0);
+
+  auto tv1 = view(tv0, {2, 3, 4}, {2, 12});
+
+  auto tv2 = makeConcreteTensor({2, 12});
+  fusion->addInput(tv2);
+
+  auto tv3 = add(tv2, tv1);
+  fusion->addOutput(tv3);
+
+  auto disjoint_exact = scheduler_utils::disjointViewSets(fusion.get());
+
+  TORCH_INTERNAL_ASSERT(
+      disjoint_exact.strictAreMapped(tv0->axis(1), tv0->axis(2)));
+}
+
+TEST_F(NVFuserTest, FusionMatchingViews_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 2, y = 3, z = 4;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = view(tv0, {x, y, z}, {x * y, z});
+
+  auto tv2 = sin(tv1);
+
+  auto tv3 = view(tv2, {x * y, z}, {x, y * z});
+  fusion.addOutput(tv3);
+
+  auto tv4 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv4);
+
+  auto tv5 = view(tv4, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv5);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv6 = add(tv0, tv4);
+  fusion.addOutput(tv6);
+
+  TORCH_INTERNAL_ASSERT(!scheduler_utils::allMatchingViews(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionBroadcastViewMultiples_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int a = 2, b = 3, c = 5, d = 7, e = 11, f = 13;
+
+  auto tv0 = makeConcreteTensor({a, b, c, d, e, f});
+  fusion.addInput(tv0);
+
+  // tie e and f together (swapping values next to eachother enforces they'll be
+  // merged then split by view)
+  auto tv1 = view(tv0, {a, b, c, d, e, f}, {a, b, c, d, f, e});
+  fusion.addOutput(tv1);
+
+  // swap d and e
+  auto tv2 = transpose(tv1, 3, 4);
+  // tie c and e together
+  auto tv3 = view(tv2, {a, b, c, e, d, f}, {a, b, e, c, d, f});
+
+  fusion.addOutput(tv3);
+
+  auto tv4 = set(tv0);
+  // Use tv4 as the reference
+  fusion.addOutput(tv4);
+
+  // a, b, d aren't tied to anything so they are valid broadcasts from the
+  // perspective of broadcast multiples analysis.
+  auto tv5 = makeConcreteTensor({1, 1, c, 1, e, f});
+  fusion.addInput(tv5);
+
+  // c, e, and f are tied together so this shouldn't be counted as a broadcast
+  // dim in the reference since it's a partial bcast
+  auto tv6 = makeConcreteTensor({a, b, c, 1, 1, 1});
+  fusion.addInput(tv6);
+
+  // c, e, and f are tied together this should be counted as a broadcast dim in
+  // the reference since it's a partial bcast
+  auto tv7 = makeConcreteTensor({a, b, 1, 1, 1, 1});
+  fusion.addInput(tv7);
+
+  // plug the broadcasts into the fusion
+  auto tv8 = add(tv5, tv4);
+  auto tv9 = add(tv6, tv8);
+  auto tv10 = add(tv7, tv9);
+  fusion.addOutput(tv10);
+
+  auto bcast_info =
+      scheduler_utils::getBroadcastMultiples(tv4, DataType::Int32);
+
+  // linked c, e, and f together so they should have the same id.
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[5] == 0);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[4] == 0);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[3] == 1);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[2] == 0);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[1] == 2);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[0] == 3);
+
+  TORCH_CHECK(
+      scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 0));
+  TORCH_CHECK(
+      scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 1));
+  TORCH_CHECK(
+      scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 2));
+  TORCH_CHECK(
+      !scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 3));
+  TORCH_CHECK(
+      !scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 4));
+  TORCH_CHECK(
+      !scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 5));
+
+  // tv0  [a, b, c, d, e, f]
+  // tv1  [a, b, c, d, e, f]
+  // tv3  [a, b, c, d, e, f]
+  // tv4  [a, b, c, d, e, f]
+  // tv5  [1, 1, c, 1, e, f] -> Left bcasts should show up in some multiples
+  // tv6  [a, b, c, 1, 1, 1] -> view interferes with bcasts, non of these should
+  //                            show up
+  // tv7  [a, b, 1, 1, 1, 1] -> These broadcasts could be recognized
+  // tv10 [a, b, c, d, e, f]
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[0].lhs_multiple == 0 &&
+      bcast_info.broadcast_multiples[0].rhs_multiple == 8 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[1].lhs_multiple == 7 * 4 &&
+      bcast_info.broadcast_multiples[1].rhs_multiple == 8 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[2].lhs_multiple == 7 * 4 &&
+      bcast_info.broadcast_multiples[2].rhs_multiple == 7 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[3].lhs_multiple == 8 * 4 &&
+      bcast_info.broadcast_multiples[3].rhs_multiple == 7 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[4].lhs_multiple == 8 * 4 &&
+      bcast_info.broadcast_multiples[4].rhs_multiple == 7 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[5].lhs_multiple == 8 * 4 &&
+      bcast_info.broadcast_multiples[5].rhs_multiple == 7 * 4);
+}
+
+TEST_F(NVFuserTest, FusionTVDomainGuard_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<bool> all_true = {true, true};
+  std::vector<bool> all_false = {false, false};
+  std::vector<bool> false_true = {false, true};
+  auto tv = TensorViewBuilder().ndims(2).contiguity(false_true).build();
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+  {
+    auto guard = ir_utils::overrideContiguityGuard(tv, true);
+    TORCH_CHECK(tv->domain()->contiguity() == all_true);
+  }
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+  {
+    auto guard = ir_utils::overrideContiguityGuard(tv, false);
+    TORCH_CHECK(tv->domain()->contiguity() == all_false);
+  }
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+  {
+    auto guard1 = ir_utils::overrideContiguityGuard(tv, true);
+    auto guard2 = std::move(guard1);
+    TORCH_CHECK(tv->domain()->contiguity() == all_true);
+  }
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h b/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
index 0247c33c8a726..fc6831f24c423 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
@@ -1,3 +1,5 @@
+#pragma once
+
 #include <torch/csrc/jit/codegen/cuda/executor_utils.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
@@ -5,38 +7,16 @@
 #include <torch/csrc/jit/codegen/cuda/lower_utils.h>
 
 #include <ATen/cuda/CUDAContext.h>
-#include <torch/torch.h>
 
 #include <unordered_map>
 
+// Tests go in torch::jit
 namespace torch {
 namespace jit {
-namespace fuser {
-namespace cuda {
-
-inline bool deviceMajorMinorCheck(int major, int minor = 0) {
-  auto dev_prop = at::cuda::getCurrentDeviceProperties();
-  if (dev_prop->major < major ||
-      (dev_prop->major == major && dev_prop->minor < minor)) {
-    return false;
-  }
-  return true;
-}
 
-inline int deviceSMCount() {
-  int sm_count = at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
-  return sm_count;
-}
+using namespace torch::jit::fuser::cuda;
 
-class NVFuserTest : public ::testing::Test {
- protected:
-  void SetUp() override {
-    // requires PASCAL or newer
-    if (!deviceMajorMinorCheck(6)) {
-      GTEST_SKIP() << "skipping tests on pre-PASCAL GPUs";
-    }
-  }
-};
+namespace {
 
 struct ValidationConstants {
   // Tolerances generated from randn + add + sum fusion
@@ -67,8 +47,6 @@ struct ValidationConstants {
   double base_float_rel_tol = -1;
 };
 
-namespace {
-
 // Returns abs and relative values to use for validation
 std::pair<double, double> getTolerance(
     DataType dtype,
@@ -243,7 +221,8 @@ class ReductionSizeMapper : private IterVisitor {
             id,
             " in ",
             tv);
-        reduction_elements = reduction_elements * inferred_extent.value();
+        reduction_elements =
+            reduction_elements * inferred_extent->as<int64_t>();
       }
     }
     return reduction_elements;
@@ -283,7 +262,11 @@ ExpressionEvaluator bindInputsAndLaunchParams(
     Fusion* fusion,
     const at::ArrayRef<IValue>& aten_inputs,
     const LaunchParams& launch_constraints) {
-  auto expr_eval = executor_utils::bindFusionInputs(aten_inputs, fusion);
+  // index_mode is not important here
+  KernelArgumentHolder argument_holder(KernelIndexMode::INT64);
+  argument_holder.push(aten_inputs);
+
+  auto expr_eval = executor_utils::bindFusionInputs(argument_holder, fusion);
   for (auto val : fusion->vals()) {
     if (!val->isA<TensorView>()) {
       continue;
@@ -326,15 +309,13 @@ ExpressionEvaluator bindInputsAndLaunchParams(
   return expr_eval;
 }
 
-} // namespace
-
 // Validation will look through the fusion and figure out how many elements were
 // reduced to create each output. It will then compute a tolernace to use for
 // allclose based on experimental results. The experimental results were based
 // on adding two tensors then summing them. This of course has an assumption
 // that we're always summing values between -2 and 2. If we start summing values
 // larger than that this approach might not hold.
-inline void testValidate(
+void testValidate(
     Fusion* fusion,
     const std::vector<at::Tensor>& fusion_outputs,
     const at::ArrayRef<IValue>& aten_inputs,
@@ -454,18 +435,6 @@ inline void testValidate(
   }
 }
 
-inline void clearL2Cache() {
-  torch::NoGradGuard no_grad;
-  auto l2_cache_size = at::cuda::getCurrentDeviceProperties()->l2CacheSize;
-  auto options =
-      torch::TensorOptions().dtype(torch::kFloat32).device(at::kCUDA, 0);
-
-  auto l2_elems = l2_cache_size / 4;
-  torch::Tensor t0 = torch::empty(l2_elems, options);
-  torch::Tensor t1 = torch::clone(t0);
-};
-
-} // namespace cuda
-} // namespace fuser
+} // namespace
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
index 3d1bb5841fcf1..1ed73d3256bcf 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
@@ -1,4 +1,5 @@
 #if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
 #include <gtest/gtest.h>
 
 #include <torch/csrc/jit/codegen/cuda/arith.h>
@@ -9,6 +10,7 @@
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/interface.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
@@ -21,6 +23,7 @@
 #include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
 #include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
 #include <torch/csrc/jit/codegen/cuda/mutator.h>
 #include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
@@ -344,12 +347,14 @@ std::vector<view_example> all_view_examples = {
     {{37, 9, 7, 3 * 2, 5 * 2}, {37 * 9, 2, 2, 3, 7 * 5}},
 
     {{1, 1, 3333, 1}, {1, 1, -1, 1}},
-    {{1, 1111 * 3}, {1, 1, 1, -1, 1, 3}},
+    // Disabled for now due to non-deterministic nan issue (#1920)
+    // {{1, 1111 * 3}, {1, 1, 1, -1, 1, 3}},
     {{1, 3333, 1}, {-1}},
     {{1, 1, 3333, 1}, {1, 1, 3333, 1}},
     {{1, 303 * 11, 1}, {1, 303, -1, 1}},
     {{1, 3333, 1}, {1, 303, 11, 1}},
-    {{1, 3333}, {1, 1, 1, 1111, 1, 3}},
+    // Disabled for now due to non-deterministic nan issue (#1920)
+    // {{1, 3333}, {1, 1, 1, 1111, 1, 3}},
     {{1, 3333, 1}, {3333}},
 
     {{1, 3922 * 7, 1, 2}, {1, 3922 * 2, 1, -1}},
@@ -1159,6 +1164,696 @@ TEST_F(NVFuserTest, FusionViewTransformCache_CUDA) {
       {{19, 3 * 4, 7, 99}, {19, -1, 3}}, {{19, 3 * 5, 7, 99}, {19, -1, 3}});
 }
 
+TEST_F(NVFuserTest, FusionViewIdGraph_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int w = 2, x = 3, y = 4, z = 5;
+
+  auto tv0 = makeConcreteTensor({w, x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {w, x, y, z}, {w, y, x * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({w, x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {w, x, y, z}, {w, y, x * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  auto tv6 = makeConcreteTensor({w, x, x, y, z});
+
+  auto tv7 = sum(tv6, {2});
+  auto tv8 = broadcast(tv7, {false, true, false, true, false, false});
+
+  auto tv9 = makeConcreteTensor({w, 6, x, 7, y, z});
+  fusion.addInput(tv9);
+  auto tv10 = add(tv8, tv9);
+  fusion.addOutput(tv10);
+
+  auto tv12 = view(tv8, {w, 1, x, 1, y, z}, {w, y, x * z});
+  fusion.addOutput(tv12);
+
+  // Link the views after the views happen
+  auto t13 = add(tv12, tv4);
+  fusion.addOutput(t13);
+
+  // Grab the trivial reduced tensor from t12's view.
+  auto tv11 = ir_utils::producerTvsOf(tv12)[0];
+
+  // Start from the exact iter domain graph of the fusion
+  IterDomainGraph id_graph(&fusion);
+  auto disjoint_view_ids = id_graph.exactNodes();
+
+  TORCH_CHECK(
+      id_graph.exactNodes().strictAreMapped(tv2->axis(1), tv4->axis(1)));
+  TORCH_CHECK(
+      id_graph.exactNodes().strictAreMapped(tv2->axis(2), tv4->axis(2)));
+
+  TORCH_CHECK(id_graph.exactNodes().strictAreMapped(
+      tv2->getRootDomain()[1], tv12->getRootDomain()[1]));
+  TORCH_CHECK(id_graph.exactNodes().strictAreMapped(
+      tv2->getRootDomain()[2], tv12->getRootDomain()[2]));
+  TORCH_CHECK(id_graph.exactNodes().strictAreMapped(
+      tv2->getRootDomain()[3], tv12->getRootDomain()[3]));
+}
+
+TEST_F(NVFuserTest, FusionViewVectorize_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = flatten(tv0, 1, 2);
+  auto tv2 = flatten(tv0, 1, 2);
+  auto tv3 = sin(tv1);
+  auto tv4 = sin(tv2);
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = schedulePointwise(&fusion, {input});
+
+  auto hasVectorization = [](TensorView* tv) -> bool {
+    for (auto i : tv->domain()->domain()) {
+      if (i->getParallelType() == ParallelType::Vectorize) {
+        return true;
+      }
+    }
+    return false;
+  };
+
+  for (auto o : fusion.outputs()) {
+    TORCH_CHECK(hasVectorization(o->as<TensorView>()));
+  }
+  for (auto i : fusion.inputs()) {
+    for (auto c : ir_utils::consumerTvsOf(i->as<TensorView>())) {
+      TORCH_CHECK(hasVectorization(c));
+    }
+  }
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.flatten(1, 2).sin();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref, tv_ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionExpandFlatten_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({-1, -1, 1});
+  fusion->addInput(tv0);
+  auto tv1 = expand(
+      tv0,
+      {tv0->axis(0)->extent(),
+       tv0->axis(1)->extent(),
+       IrBuilder::create<Int>(8)});
+  auto tv2 = flatten(tv1, 1, 2);
+  auto tv3 = sum(tv2, {1});
+  fusion->addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({input});
+
+  auto aten_out = input.expand({256, 1024, 8}).flatten(1, 2).sum(1);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {input},
+      {aten_out},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIllegalReductionFlatten_CUDA) {
+  EXPECT_THAT(
+      []() {
+        auto fusion = std::make_unique<Fusion>();
+        FusionGuard fg(fusion.get());
+
+        auto tv0 = makeConcreteTensor({2, 3});
+        fusion->addInput(tv0);
+
+        auto tv1 = sum(tv0, {1});
+        auto tv2 = flatten(tv1, 0, 1);
+        fusion->addOutput(tv2);
+      },
+      testing::ThrowsMessage<c10::Error>(
+          testing::HasSubstr("Invalid end_dim")));
+}
+
+TEST_F(NVFuserTest, FusionReductionFlatten1_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({2, 3, 5});
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = flatten(tv1, 0, 1);
+  fusion->addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({2, 3, 5}, options);
+  auto ref = t0.sum({1}).flatten(0, 1);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPwiseViewSchedule_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  TORCH_INTERNAL_ASSERT(scheduler_utils::allMatchingViews(&fusion));
+  {
+    TransformPropagator propagator(tv4);
+    MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+  }
+
+  for (auto i : c10::irange(tv5->nDims() - 1)) {
+    tv5->merge(0);
+  }
+  tv5->split(0, 32);
+  tv5->split(0, 4);
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::Unroll);
+  tv5->axis(2)->parallelize(ParallelType::TIDx);
+
+  {
+    TransformPropagator propagator(tv5);
+    MaxRootDomainInfoSpanningTree spanning_tree(tv5);
+    spanning_tree.traverse(&propagator);
+    scheduler_utils::parallelizeAllLike(tv5);
+
+    // Inline the schedule
+    inlineMost();
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t0 + t3;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t3});
+  auto cg_outputs = fe.runFusion({t0, t3});
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t2, t4, t5}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSumViewSchedule_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  auto tv5 = sum(tv4, {1});
+  fusion.addOutput(tv5);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv6 = add(tv0, tv3);
+  fusion.addOutput(tv6);
+
+  TORCH_INTERNAL_ASSERT(scheduler_utils::allMatchingViews(&fusion));
+  {
+    TransformPropagator propagator(tv4);
+    MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+  }
+
+  tv5->split(1, 128);
+  tv5->split(1, 4);
+
+  auto tv5_rf = tv5->rFactor({1, 2});
+  tv5_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv5_rf->axis(2)->parallelize(ParallelType::Unroll);
+  tv5_rf->axis(3)->parallelize(ParallelType::TIDx);
+
+  {
+    TransformPropagator propagator(tv5_rf);
+    MaxRootDomainInfoSpanningTree spanning_tree(tv5_rf);
+    spanning_tree.traverse(&propagator);
+    scheduler_utils::parallelizeAllLike(tv5_rf);
+
+    // Inline the schedule
+    inlineMost();
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t4.sum({1});
+  auto t6 = t0 + t3;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t3});
+  auto cg_outputs = fe.runFusion({t0, t3});
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t2, t5, t6}, __LINE__, __FILE__);
+}
+
+// Make sure matching views are segmented into the same kernel
+TEST_F(NVFuserTest, FusionViewMagicSchedule1_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t0 + t3;
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3});
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t2, t4, t5}, __LINE__, __FILE__);
+}
+
+// Make sure views of views are correct
+TEST_F(NVFuserTest, FusionViewMagicSchedule2_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  auto tv3 = view(tv2, {x, y * z}, {x * y, z});
+  auto tv4 = view(tv3, {x * y, z}, {y, x * z});
+  auto tv5 = view(tv4, {y, x * z}, {x, y, z});
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  auto aten_out = sin(t0);
+
+  // For now pointwise scheduler only accepts a single view at a time, so this
+  // will be broken up into multiple kernels. This is due to the reference check
+  // looking for all mappings to all input IDs.
+  // TODO: Fix the reference check for this case
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {aten_out}, __LINE__, __FILE__);
+}
+
+// Make sure broadcasts not on the view path that don't interfere with view are
+// segmented in one kernel and correctly trigger 2D pointwise scheduling
+TEST_F(NVFuserTest, FusionViewMagicSchedule3_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  // Broadcast on another branch to drive the pointwise reference to not be on
+  // the view paths.
+
+  auto tv6 = makeConcreteTensor({w, x, y, z});
+  fusion.addInput(tv6);
+  auto tv7 = broadcast(tv0, {true, false, false, false});
+  auto tv8 = add(tv6, tv7);
+  // tv8 should be the reference for the pointwise fusion. This broadcast
+  // pattern doesn't interfere with the views, so this should also be scheduled
+  // as 2D.
+  fusion.addOutput(tv8);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t0 + t3;
+  at::Tensor t6 = at::randn({w, x, y, z}, options);
+  auto t8 = t6.add(t0);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  // Collect the heuristic params
+  executor_cache.profile(true);
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3, t6});
+
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+  TORCH_CHECK(executor_cache.getMostRecentExecutorInfo()
+                  .params->isA<PointwiseParams>());
+  auto pparams =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
+  TORCH_CHECK(pparams->break_point == 1);
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t3, t6}, {t2, t4, t5, t8}, __LINE__, __FILE__);
+}
+
+// Make sure broadcasts through views when not conflicting with view are
+// segmented into one kernel and trigger 2D pointwise scheduler.
+TEST_F(NVFuserTest, FusionViewMagicSchedule4_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = makeConcreteTensor({x, 1, 1});
+  fusion.addInput(tv4);
+
+  auto tv5 = add(tv4, tv3);
+
+  auto tv6 = view(tv5, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv6);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv7 = add(tv0, tv3);
+  fusion.addOutput(tv7);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  at::Tensor t4 = at::randn({x, 1, 1}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t5 = t4 + t3;
+  auto t6 = at::native::view(t5, {x, y * z});
+  auto t7 = t0 + t3;
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  // Collect the heuristic params
+  executor_cache.profile(true);
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3, t4});
+
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+  TORCH_CHECK(executor_cache.getMostRecentExecutorInfo()
+                  .params->isA<PointwiseParams>());
+  auto pparams =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
+  TORCH_CHECK(pparams->break_point == 1);
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t3, t4}, {t2, t6, t7}, __LINE__, __FILE__);
+}
+
+// Make sure different views that are consumed by the reference are segmented
+// into a single kernel.
+TEST_F(NVFuserTest, FusionViewMagicSchedule5_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({w, x, y * z});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = view(tv1, {w, x, y * z}, {z, y, x, w});
+
+  auto tv3 = makeConcreteTensor({w, x * y, z});
+  fusion.addInput(tv3);
+  auto tv4 = cos(tv3);
+  auto tv5 = view(tv4, {w, x * y, z}, {z, y, x, w});
+
+  auto tv6 = add(tv2, tv5);
+  fusion.addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({w, x, y * z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {z, y, x, w});
+  at::Tensor t3 = at::randn({w, x * y, z}, options);
+  auto t4 = cos(t3);
+  auto t5 = at::native::view(t4, {z, y, x, w});
+  auto t6 = add(t2, t5);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  // Collect the heuristic params
+  executor_cache.profile(true);
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3});
+
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+  TORCH_CHECK(executor_cache.getMostRecentExecutorInfo()
+                  .params->isA<PointwiseParams>());
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t6}, __LINE__, __FILE__);
+}
+
+// Make sure different views that are consumed by the reference are segmented
+// into a single kernel.
+TEST_F(NVFuserTest, FusionViewMapping_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({w, x, y * z});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = view(tv1, {w, x, y * z}, {z, y, x, w});
+
+  auto tv3 = makeConcreteTensor({w, x * y, z});
+  fusion.addInput(tv3);
+  auto tv4 = cos(tv3);
+  auto tv5 = view(tv4, {w, x * y, z}, {z, y, x, w});
+
+  auto tv6 = add(tv2, tv5);
+  fusion.addOutput(tv6);
+
+  tv6->merge(0);
+  tv6->merge(0);
+  tv6->merge(0);
+  tv6->split(0, 128);
+  tv6->split(0, 4);
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+  tv6->axis(1)->parallelize(ParallelType::Unroll);
+  tv6->axis(2)->parallelize(ParallelType::TIDx);
+
+  TransformPropagator propagator(tv6);
+  MaxRootDomainInfoSpanningTree spanning_tree(tv6);
+  spanning_tree.traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(tv6);
+
+  // Inline the schedule
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({w, x, y * z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {z, y, x, w});
+  at::Tensor t3 = at::randn({w, x * y, z}, options);
+  auto t4 = cos(t3);
+  auto t5 = at::native::view(t4, {z, y, x, w});
+  auto t6 = add(t2, t5);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t3});
+  auto cg_outputs = fe.runFusion({t0, t3});
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t6}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionLowerDivisibleSplits_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeContigTensor(4);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = view(tv1, {w, x, y, z}, {z, y, x, w});
+
+  fusion.addOutput(tv2);
+
+  tv2->merge(0)->merge(0)->merge(0)->split(0, 4)->split(0, 8, false);
+
+  TransformPropagator propagator(tv2);
+  MaxRootDomainInfoSpanningTree spanning_tree(tv2);
+  spanning_tree.traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(tv2);
+
+  // Inline the schedule
+  inlineMost();
+
+  auto divisible_splits = getAllDivisibleSplits(&fusion);
+
+  // Operations on all tensors are basically:
+  // [10] merge(0)          [9]->outer->definition
+  // [9] merge(0)           [8]->outer->definition
+  // [8] merge(0)           [7]->in->definition
+  // [7] split(0, z, false) [6]->in->definition
+  // [6] split(1, y, false) [5]->in->definition
+  // [5] split(2, x, false) [3]->inner->definition
+  // RFactor of tv2
+  // [4] merge(0)           [3]->outer->definition
+  // [3] merge(0)           [2]->outer->definition
+  // [2] merge(0)           [1]->in->definition
+  // [1] split(0, 4)        [0]->in->definition
+  // [0] split(0, 8, false) tv->axis(0)->definition
+
+  for (auto tv : std::vector<TensorView*>({tv2, tv1, tv0})) {
+    auto transform_0 = tv->axis(0)->definition()->as<Split>();
+    auto transform_1 = transform_0->in()->definition()->as<Split>();
+    auto transform_2 = transform_1->in()->definition()->as<Merge>();
+    auto transform_3 = transform_2->outer()->definition()->as<Merge>();
+
+    auto transform_5 = transform_3->inner()->definition()->as<Split>();
+    auto transform_6 = transform_5->in()->definition()->as<Split>();
+    auto transform_7 = transform_6->in()->definition()->as<Split>();
+
+    TORCH_CHECK(
+        divisible_splits.find(transform_5) != divisible_splits.end(),
+        "Expecting: ",
+        transform_5->toString(),
+        "\nFrom TV: ",
+        tv,
+        "\nTo be a divisible split.");
+    TORCH_CHECK(
+        divisible_splits.find(transform_6) != divisible_splits.end(),
+        "Expecting: ",
+        transform_6->toString(),
+        "\nFrom TV: ",
+        tv,
+        "\nTo be a divisible split.");
+    TORCH_CHECK(
+        divisible_splits.find(transform_7) != divisible_splits.end(),
+        "Expecting: ",
+        transform_7->toString(),
+        "\nFrom TV: ",
+        tv,
+        "\nTo be a divisible split.");
+  }
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_utils.h b/torch/csrc/jit/codegen/cuda/test/test_utils.h
index 4886785e355c6..fb83459952a2e 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_utils.h
+++ b/torch/csrc/jit/codegen/cuda/test/test_utils.h
@@ -1,9 +1,17 @@
 #pragma once
 
-#include <cstddef>
-
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDACachingAllocator.h>
+#include <torch/torch.h>
+
+#include <gtest/gtest.h>
+
+#include <cstddef>
+
 // Tests go in torch::jit
 namespace torch {
 namespace jit {
@@ -64,6 +72,65 @@ void checkIntValue(
   TORCH_CHECK(actual_value.value() == expected_value);
 }
 
+// prime numbers
+int64_t prime_numbers[] = {
+    2,    3,    5,    7,    11,   13,   17,   19,   23,   29,   31,   37,
+    41,   43,   47,   53,   59,   61,   67,   71,   73,   79,   83,   89,
+    97,   101,  103,  107,  109,  113,  127,  131,  137,  139,  149,  151,
+    157,  163,  167,  173,  179,  181,  191,  193,  197,  199,  211,  223,
+    227,  229,  233,  239,  241,  251,  257,  263,  269,  271,  277,  281,
+    283,  293,  307,  311,  313,  317,  331,  337,  347,  349,  353,  359,
+    367,  373,  379,  383,  389,  397,  401,  409,  419,  421,  431,  433,
+    439,  443,  449,  457,  461,  463,  467,  479,  487,  491,  499,  503,
+    509,  521,  523,  541,  547,  557,  563,  569,  571,  577,  587,  593,
+    599,  601,  607,  613,  617,  619,  631,  641,  643,  647,  653,  659,
+    661,  673,  677,  683,  691,  701,  709,  719,  727,  733,  739,  743,
+    751,  757,  761,  769,  773,  787,  797,  809,  811,  821,  823,  827,
+    829,  839,  853,  857,  859,  863,  877,  881,  883,  887,  907,  911,
+    919,  929,  937,  941,  947,  953,  967,  971,  977,  983,  991,  997,
+    1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051, 1061, 1063, 1069,
+    1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163,
+    1171, 1181, 1187, 1193, 1201, 1213, 1217, 1223};
+
+bool deviceMajorMinorCheck(int major, int minor = 0) {
+  auto dev_prop = at::cuda::getCurrentDeviceProperties();
+  if (dev_prop->major < major ||
+      (dev_prop->major == major && dev_prop->minor < minor)) {
+    return false;
+  }
+  return true;
+}
+
+int deviceSMCount() {
+  int sm_count = at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
+  return sm_count;
+}
+
+void clearL2Cache() {
+  torch::NoGradGuard no_grad;
+  auto l2_cache_size = at::cuda::getCurrentDeviceProperties()->l2CacheSize;
+  auto options =
+      torch::TensorOptions().dtype(torch::kFloat32).device(at::kCUDA, 0);
+
+  auto l2_elems = l2_cache_size / 4;
+  torch::Tensor t0 = torch::empty(l2_elems, options);
+  torch::Tensor t1 = torch::clone(t0);
+};
+
 } // namespace
+
+// Fixture class must be uniquely identified, i.e., can't be in an
+// anonymous namespace
+class NVFuserTest : public ::testing::Test {
+ protected:
+  void SetUp() override {
+    // requires PASCAL or newer
+    if (!deviceMajorMinorCheck(6)) {
+      GTEST_SKIP() << "skipping tests on pre-PASCAL GPUs";
+    }
+    setFillAllocationWithNan(true);
+  }
+};
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/tools/stringify_file.py b/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
index 4c3f173660b08..a7ba997d91bbc 100644
--- a/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
+++ b/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
@@ -19,6 +19,8 @@
 # msvc string literal maximum length 16380
 # https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2026?view=msvc-170
 MAX_STRING_LITERAL = 16000
+# https://docs.microsoft.com/en-us/cpp/c-language/maximum-string-length?view=msvc-170
+MAX_STRING_CONCATENATED = 65535
 
 with open(args.input, 'r') as fin:
     with open(args.output, 'w') as fout:
@@ -28,14 +30,18 @@
         fout.write('namespace nvfuser_resources {\n\n')
         fout.write(f'constexpr const char* {literal_name} = R"(\n')
         accumulated_chars = 0
+        accumulated_chars_per_literal = 0
         for line in fin:
             accumulated_chars = accumulated_chars + len(line) + 1
-            if accumulated_chars >= MAX_STRING_LITERAL:
+            accumulated_chars_per_literal = accumulated_chars_per_literal + len(line) + 1
+            if accumulated_chars_per_literal >= MAX_STRING_LITERAL:
                 fout.write(')"\n')
                 fout.write('R"(\n')
                 fout.write(line)
-                accumulated_chars = len(line) + 1
+                accumulated_chars_per_literal = len(line) + 1
             else:
                 fout.write(line)
         fout.write(')";\n')
         fout.write('\n} // namespace nvfuser_resources\n')
+        if accumulated_chars >= MAX_STRING_CONCATENATED:
+            raise Exception("runtime header file exceeds size limit of 65535 for MSVC")
diff --git a/torch/csrc/jit/codegen/cuda/transform_iter.cpp b/torch/csrc/jit/codegen/cuda/transform_iter.cpp
index 59d9355c4ef71..1e4973d766492 100644
--- a/torch/csrc/jit/codegen/cuda/transform_iter.cpp
+++ b/torch/csrc/jit/codegen/cuda/transform_iter.cpp
@@ -137,7 +137,7 @@ void ReplayTransformations::handle(Swizzle2D* swizzle_2d) {
   auto id_in_y = swizzle_2d->inY();
 
   // Make sure we have a corresponding entry in our map pointing to the ID we're
-  // going to replay the split on
+  // going to replay the swizzle on
   auto it_x = id_map_.find(id_in_x);
   auto it_y = id_map_.find(id_in_y);
 
@@ -162,7 +162,7 @@ void ReplayTransformations::handle(Swizzle2D* swizzle_2d) {
   auto outs = std::make_pair(mapped_x, mapped_y);
 
   if (replay_swizzle_) {
-    // Replay the split onto mapped
+    // Replay the swizzle onto mapped
     outs = IterDomain::swizzle(swizzle_2d->swizzleType(), mapped_x, mapped_y);
 
     // Remove mapped from the leaf IDs
@@ -224,7 +224,7 @@ void ReplayTransformations::runReplay() {
   // Switch outDomain to a vector to start the traversal
   std::vector<Val*> traversal_vals(
       target_domain_.begin(), target_domain_.end());
-  traverseFrom(traversal_vals[0]->fusion(), traversal_vals);
+  traverseTo(traversal_vals[0]->fusion(), traversal_vals);
 
   if (error_on_failure_)
     TORCH_INTERNAL_ASSERT(
diff --git a/torch/csrc/jit/codegen/cuda/type.cpp b/torch/csrc/jit/codegen/cuda/type.cpp
index 471552ed6b6d8..f8b740626f74e 100644
--- a/torch/csrc/jit/codegen/cuda/type.cpp
+++ b/torch/csrc/jit/codegen/cuda/type.cpp
@@ -1,5 +1,7 @@
 #include <torch/csrc/jit/codegen/cuda/type.h>
 
+#include <ATen/cuda/CUDAContext.h>
+
 #include <stdexcept>
 #include <unordered_map>
 
@@ -160,6 +162,17 @@ DataType getTypeFromComplexType(DataType dtype) {
   }
 }
 
+bool isSupportedTypeByDevice(DataType dtype) {
+  auto prop = at::cuda::getCurrentDeviceProperties();
+  auto major_ver = prop->major;
+  switch (dtype) {
+    case DataType::BFloat16:
+      return major_ver >= 8;
+    default:
+      return true;
+  }
+}
+
 bool isIntegerOp(const BinaryOpType bopt) {
   return bopt >= BinaryOpType::Mod && bopt <= BinaryOpType::Rshift;
 }
@@ -290,12 +303,20 @@ static const char* predicate_type2string(PredicateType t) {
 
 static const char* expr_type2string(ExprType t) {
   switch (t) {
+    case ExprType::FullOp:
+      return "FullOp";
+    case ExprType::ARangeOp:
+      return "ARangeOp";
+    case ExprType::EyeOp:
+      return "EyeOp";
     case ExprType::UnaryOp:
       return "UnaryOp";
     case ExprType::BinaryOp:
       return "BinaryOp";
     case ExprType::TernaryOp:
       return "TernaryOp";
+    case ExprType::RNGOp:
+      return "RNGOp";
     case ExprType::ReductionOp:
       return "ReductionOp";
     case ExprType::GroupedReductionOp:
@@ -304,6 +325,8 @@ static const char* expr_type2string(ExprType t) {
       return "BroadcastOp";
     case ExprType::WelfordOp:
       return "WelfordOp";
+    case ExprType::GroupedWelfordOp:
+      return "GroupedWelfordOp";
     case ExprType::LoadStoreOp:
       return "LoadStoreOp";
     case ExprType::MmaOp:
@@ -350,6 +373,8 @@ static const char* expr_type2string(ExprType t) {
       return "GridBroadcast";
     case ExprType::GridWelford:
       return "GridWelford";
+    case ExprType::GroupedGridWelford:
+      return "GroupedGridWelford";
     case ExprType::Swizzle2D:
       return "Swizzle2D";
     case ExprType::Swizzle2DInt:
@@ -387,6 +412,10 @@ bool needFloatSuffix(UnaryOpType t) {
   }
 }
 
+bool needFloatSuffix(RNGOpType t) {
+  return true;
+}
+
 static const char* unary_op_type2string(UnaryOpType t) {
   switch (t) {
     case UnaryOpType::Abs:
@@ -439,8 +468,6 @@ static const char* unary_op_type2string(UnaryOpType t) {
       return "not";
     case UnaryOpType::Print:
       return "print";
-    case UnaryOpType::RandLike:
-      return "randLike";
     case UnaryOpType::Reciprocal:
       return "reciprocal";
     case UnaryOpType::Relu:
@@ -642,6 +669,18 @@ static const char* binary_op_type_inline_op2string(BinaryOpType t) {
   return nullptr;
 }
 
+static const char* rng_op_type_inline_op2string(RNGOpType t) {
+  switch (t) {
+    case RNGOpType::Uniform:
+      return "rng_uniform";
+    case RNGOpType::UniformRange:
+      return "rng_uniform_range";
+    default:
+      break;
+  }
+  return nullptr;
+}
+
 std::string stringifyBooleanOp(const BinaryOpType bopt) {
   switch (bopt) {
     case BinaryOpType::And:
@@ -670,6 +709,17 @@ static const char* ternary_op_type2string(TernaryOpType t) {
   }
 }
 
+static const char* rng_op_type2string(RNGOpType t) {
+  switch (t) {
+    case RNGOpType::Uniform:
+      return "rng_uniform";
+    case RNGOpType::UniformRange:
+      return "rng_uniform_range";
+    default:
+      TORCH_INTERNAL_ASSERT(false, "Unexpected RNGOpType");
+  }
+}
+
 static const char* parallel_type2string(ParallelType t) {
   switch (t) {
     case ParallelType::BIDz:
@@ -982,6 +1032,10 @@ std::ostream& operator<<(std::ostream& out, const TernaryOpType totype) {
   return out << ternary_op_type2string(totype);
 }
 
+std::ostream& operator<<(std::ostream& out, const RNGOpType rngtype) {
+  return out << rng_op_type2string(rngtype);
+}
+
 std::ostream& operator<<(std::ostream& out, const ParallelType ptype) {
   return out << stringifyThread(ptype);
 }
@@ -1068,6 +1122,12 @@ c10::optional<std::string> inline_op_str(const BinaryOpType botype) {
                         : c10::nullopt;
 }
 
+c10::optional<std::string> inline_op_str(const RNGOpType rngtype) {
+  const char* str = rng_op_type_inline_op2string(rngtype);
+  return str != nullptr ? c10::optional<std::string>(std::string(str))
+                        : c10::nullopt;
+}
+
 c10::optional<std::string> integer_op_str(const BinaryOpType botype) {
   const char* str = binary_op_integer_op2string(botype);
   return str != nullptr ? c10::optional<std::string>(std::string(str))
diff --git a/torch/csrc/jit/codegen/cuda/type.h b/torch/csrc/jit/codegen/cuda/type.h
index c6e637abe0efd..4f72bb82b8418 100644
--- a/torch/csrc/jit/codegen/cuda/type.h
+++ b/torch/csrc/jit/codegen/cuda/type.h
@@ -101,16 +101,23 @@ int getVectorSizeFromType(DataType dtype);
 DataType getTypeFromVectorType(DataType dtype);
 // Return the corresponding scalar of a complex type
 DataType getTypeFromComplexType(DataType dtype);
+// Return if the datatype is supported on the current device
+TORCH_CUDA_CU_API bool isSupportedTypeByDevice(DataType dtype);
 
 enum class ExprType {
   Invalid,
+  FullOp,
+  ARangeOp,
+  EyeOp,
   UnaryOp,
   BinaryOp,
   TernaryOp,
+  RNGOp,
   ReductionOp,
   GroupedReductionOp,
   BroadcastOp,
   WelfordOp,
+  GroupedWelfordOp,
   MmaOp,
   TransposeOp,
   ExpandOp,
@@ -137,6 +144,7 @@ enum class ExprType {
   GroupedGridReduction,
   GridBroadcast,
   GridWelford,
+  GroupedGridWelford,
   AllocateFusedReduction
 };
 
@@ -167,7 +175,6 @@ enum class UnaryOpType {
   Log2,
   BitCast,
   Neg,
-  RandLike,
   Real,
   Reciprocal,
   Relu,
@@ -240,6 +247,11 @@ enum class BinaryOpType {
   Xor
 };
 
+enum class RNGOpType {
+  Uniform, // Uniform in [0, 1)
+  UniformRange, // Uniform in [low, high]
+};
+
 // Return if output of operator should be a boolean
 bool isIntegerOp(const BinaryOpType bopt);
 
@@ -348,6 +360,7 @@ enum class SwizzleMode { NoSwizzle = 0, Data, Loop };
 // float value i.e. sin->sinf
 bool needFloatSuffix(UnaryOpType t);
 bool needFloatSuffix(BinaryOpType t);
+bool needFloatSuffix(RNGOpType t);
 
 ValType promote_type(const ValType& t1, const ValType& t2);
 DataType promote_type(const DataType& t1, const DataType& t2);
@@ -364,6 +377,7 @@ TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const ExprType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const UnaryOpType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const BinaryOpType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const TernaryOpType);
+TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const RNGOpType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const ParallelType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const MemoryType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const IterType);
@@ -396,6 +410,7 @@ TORCH_CUDA_CU_API bool isParallelTypeVectorize(ParallelType);
 
 TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const UnaryOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const BinaryOpType);
+TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const RNGOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> integer_op_str(const BinaryOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> bool_op_str(const BinaryOpType);
 
diff --git a/torch/csrc/jit/codegen/cuda/type_inference.cpp b/torch/csrc/jit/codegen/cuda/type_inference.cpp
index 534fa91488cee..7422cf20d7c2b 100644
--- a/torch/csrc/jit/codegen/cuda/type_inference.cpp
+++ b/torch/csrc/jit/codegen/cuda/type_inference.cpp
@@ -445,13 +445,16 @@ class NaiveTypePropagator {
         copyScalarTypeAndDeviceToOutput(out_type->withDim(c10::nullopt), node);
         break;
       }
-      case prim::unsqueeze_copy:
       case prim::expand_copy:
       case prim::expand_as_copy:
-      case prim::squeeze_copy:
+      case prim::flatten_copy:
+      case prim::permute_copy:
       case prim::reshape_copy:
-      case prim::view_copy:
-      case prim::flatten_copy: {
+      case prim::squeeze_copy:
+      case prim::t_copy:
+      case prim::transpose_copy:
+      case prim::unsqueeze_copy:
+      case prim::view_copy: {
         auto out_type = node->input(0)->type()->cast<TensorType>();
         copyScalarTypeAndDeviceToOutput(out_type, node);
         break;
diff --git a/torch/csrc/jit/codegen/cuda/utils.cpp b/torch/csrc/jit/codegen/cuda/utils.cpp
index 13d774f2ed041..d71f5543bae90 100644
--- a/torch/csrc/jit/codegen/cuda/utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/utils.cpp
@@ -38,7 +38,8 @@ auto parseDebugDumpOptions() {
       {DebugDumpOption::Halo, false},
       {DebugDumpOption::PerfDebugVerbose, false},
       {DebugDumpOption::TransformPropagator, false},
-      {DebugDumpOption::InlinePropagator, false}};
+      {DebugDumpOption::Cubin, false},
+      {DebugDumpOption::Ptx, false}};
 
   if (const char* dump_options = std::getenv("PYTORCH_NVFUSER_DUMP")) {
     c10::string_view options_view(dump_options);
@@ -89,8 +90,10 @@ auto parseDebugDumpOptions() {
         options_map[DebugDumpOption::PerfDebugVerbose] = true;
       } else if (token == "transform_propagator") {
         options_map[DebugDumpOption::TransformPropagator] = true;
-      } else if (token == "inline_propagator") {
-        options_map[DebugDumpOption::InlinePropagator] = true;
+      } else if (token == "cubin") {
+        options_map[DebugDumpOption::Cubin] = true;
+      } else if (token == "ptx") {
+        options_map[DebugDumpOption::Ptx] = true;
       } else {
         TORCH_CHECK(
             false,
@@ -102,7 +105,7 @@ auto parseDebugDumpOptions() {
             "\tkernel_args, dump_eff_bandwidth, draw_segmented_fusion,\n",
             "\tscheduler_params, parallel_dimensions, buffer_reuse_verbose,\n",
             "\tptxas_verbose, halo, segmenter_logging, perf_debug_verbose\n",
-            "\ttransform_propagator, inline_propagator\n");
+            "\ttransform_propagator, cubin, ptx\n");
       }
       options_view = (end_pos != c10::string_view::npos)
           ? options_view.substr(end_pos + 1)
@@ -121,8 +124,7 @@ auto parseDisableOptions() {
       {DisableOption::Fma, false},
       {DisableOption::IndexHoist, false},
       {DisableOption::Nvtx, false},
-      {DisableOption::PredicateElimination, false},
-      {DisableOption::UnrollWithRng, false}};
+      {DisableOption::PredicateElimination, false}};
 
   if (const char* dump_options = std::getenv("PYTORCH_NVFUSER_DISABLE")) {
     c10::string_view options_view(dump_options);
@@ -145,16 +147,13 @@ auto parseDisableOptions() {
         options_map[DisableOption::Nvtx] = true;
       } else if (token == "predicate_elimination") {
         options_map[DisableOption::PredicateElimination] = true;
-      } else if (token == "unroll_with_rng") {
-        options_map[DisableOption::UnrollWithRng] = true;
       } else {
         TORCH_CHECK(
             false,
             "Invalid disable option: '",
             token,
             "'\nAvailable options:\n",
-            "\tarch_check, fallback, fma, index_hoist, nvtx, predicate_elimination\n",
-            "unroll_with_rng");
+            "\tarch_check, fallback, fma, index_hoist, nvtx, predicate_elimination\n");
       }
       options_view = (end_pos != c10::string_view::npos)
           ? options_view.substr(end_pos + 1)
@@ -191,10 +190,11 @@ auto parseEnableOptions() {
       } else {
         TORCH_CHECK(
             false,
-            "Invalid disable option: '",
+            "Invalid enable option: '",
             token,
             "'\nAvailable options:\n",
-            "\tcomplex, kernel_profile, conv_decomposition, bank_conflict_check");
+            "\tcomplex, kernel_profile, linear_decomposition,",
+            "conv_decomposition, bank_conflict_check");
       }
       options_view = (end_pos != c10::string_view::npos)
           ? options_view.substr(end_pos + 1)
@@ -283,11 +283,78 @@ bool is_cpu_scalar(const c10::TensorType& tensor_type) {
   auto opt_device = tensor_type.device();
   auto opt_dim = tensor_type.dim();
   auto opt_numel = tensor_type.numel();
-  return opt_device.has_value() && opt_device.value().is_cpu() &&
+  return opt_device.has_value() && opt_device->is_cpu() &&
       opt_dim.has_value() && opt_numel.has_value() && opt_dim.value() == 0 &&
       opt_numel.value() == 1;
 }
 
+// Check device of TensorType in all inputs ensure all tensors are on cuda
+// devices.
+// return common device index (or -1 if device differs).
+int getCommonDeviceCUDA(const at::ArrayRef<IValue>& inputs) {
+  int index = -1;
+  for (const auto& input : inputs) {
+    if (!input.isTensor()) {
+      continue;
+    }
+    const auto& device = input.toTensor().device();
+    // skip cpu scalar tensor as they'll be promoted to scalar later
+    if (device.is_cpu() && is_cpu_scalar(input.toTensor())) {
+      continue;
+    }
+    TORCH_CHECK(device.is_cuda(), "nvfuser only supports cuda device");
+    auto cur_index = device.index();
+    if (index != -1 && index != cur_index) {
+      return -1;
+    }
+    index = (int)cur_index; // NOLINT
+  }
+  return index;
+}
+
+KernelIndexMode collectIndexMode(const at::ArrayRef<at::IValue>& inputs) {
+  // Save 1 more bit besides the sign bit to be conservative
+  constexpr int64_t most_positive_int32_index =
+      std::numeric_limits<int>::max() / 2;
+  constexpr int64_t most_negative_int32_index =
+      std::numeric_limits<int>::min() / 2;
+
+  // Check all runtime inputs, and if any one of
+  //  the input's index exceeds max_int32 will
+  //  fall back to int64 indexing
+  for (auto ivalue_input : inputs) {
+    if (ivalue_input.isTensor()) {
+      auto tensor_input = ivalue_input.toTensor();
+      int64_t tensor_most_positive_index = 0;
+      int64_t tensor_most_negative_index = 0;
+      for (auto dim_i = 0; dim_i < tensor_input.ndimension(); dim_i++) {
+        // Ignore broadcast dimensions
+        if (tensor_input.size(dim_i) > 1) {
+          // accumulate based on the sign of stride
+          if (tensor_input.stride(dim_i) > 0) {
+            // Acuumulate positive stride
+            tensor_most_positive_index +=
+                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
+          } else {
+            // Acuumulate negative stride
+            tensor_most_negative_index +=
+                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
+          }
+        }
+      }
+
+      // Fall back to int64 if it can be either too positive
+      //  or too negative.
+      if (tensor_most_positive_index > most_positive_int32_index ||
+          tensor_most_negative_index < most_negative_int32_index) {
+        return KernelIndexMode::INT64;
+      }
+    }
+  }
+  // return index mode as int32
+  return KernelIndexMode::INT32;
+}
+
 bool isDebugDumpEnabled(DebugDumpOption option) {
   const static auto dump_options = parseDebugDumpOptions();
   return dump_options.at(option);
diff --git a/torch/csrc/jit/codegen/cuda/utils.h b/torch/csrc/jit/codegen/cuda/utils.h
index 6fdc973fbeb54..4945065cf69ee 100644
--- a/torch/csrc/jit/codegen/cuda/utils.h
+++ b/torch/csrc/jit/codegen/cuda/utils.h
@@ -2,6 +2,7 @@
 
 #include <ATen/ATen.h>
 #include <c10/util/Exception.h>
+#include <torch/csrc/jit/codegen/cuda/type.h>
 #include <torch/csrc/jit/ir/ir.h>
 
 namespace torch {
@@ -17,6 +18,11 @@ bool is_zero_sized_tensor(const std::shared_ptr<c10::TensorType>& tensor_type);
 bool is_cpu_scalar(const at::Tensor& tensor);
 bool is_cpu_scalar(const c10::TensorType& tensor_type);
 
+// TODO: merge these two
+// check if input is compatible with 32b index mode
+int getCommonDeviceCUDA(const at::ArrayRef<IValue>& inputs);
+KernelIndexMode collectIndexMode(const at::ArrayRef<at::IValue>& inputs);
+
 //! Types of debug print-outs
 //!
 //! These can be set through the `PYTORCH_NVFUSER_DUMP` environment variable
@@ -48,8 +54,8 @@ enum class DebugDumpOption {
                     //! associated with what's running
   TransformPropagator, //! When running TransformPropagator, print propagation
                        //! path and replay result
-  InlinePropagator, //! When running InlinePropagator, print propagation
-                    //! path and inlining result
+  Cubin, //! Dump compiled CUBIN
+  Ptx //! Dump compiled PTX
 };
 
 TORCH_CUDA_CU_API bool isDebugDumpEnabled(DebugDumpOption option);
@@ -66,8 +72,7 @@ enum class DisableOption {
   Fma, //! Disable FMA instructions
   IndexHoist, //! Disable index hoisting
   Nvtx, //! Disable NVTX instrumentation
-  PredicateElimination, //! Disable predicate elimination
-  UnrollWithRng //! Disable unrolling for kernels with RNG in them
+  PredicateElimination //! Disable predicate elimination
 };
 
 TORCH_CUDA_CU_API bool isOptionDisabled(DisableOption option);
diff --git a/torch/csrc/jit/codegen/fuser/codegen.cpp b/torch/csrc/jit/codegen/fuser/codegen.cpp
index 0665d21a7a4fc..c28ad2ba1ae09 100644
--- a/torch/csrc/jit/codegen/fuser/codegen.cpp
+++ b/torch/csrc/jit/codegen/fuser/codegen.cpp
@@ -490,7 +490,7 @@ std::string generateKernel(
         env.s("access", format("t${formal}.data[t${formal}_offset]", env));
         env.s("access_vec4", format("t${formal}_buf[i]", env));
       }
-      env.s("lhs_type", calcScalarTypeName(input.second.value().scalar_type));
+      env.s("lhs_type", calcScalarTypeName(input.second->scalar_type));
 
       // load input in vectorized code path
       auto ele_size = at::elementSize((*input.second).scalar_type);
diff --git a/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp b/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp
index 15c980064fd59..85bd74bfdbae4 100644
--- a/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp
+++ b/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp
@@ -35,6 +35,10 @@ void codegenOutputQuery(
     int& major,
     int& minor,
     bool& compile_to_sass) {
+#ifdef USE_ROCM
+  AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&major, &minor));
+  compile_to_sass = false;
+#else
   using CudaVersion = std::pair<int, int>;
   CudaVersion nvrtc_version;
   AT_CUDA_NVRTC_CHECK(
@@ -75,6 +79,7 @@ void codegenOutputQuery(
     minor = dev_version.second;
     compile_to_sass = true;
   }
+#endif
 }
 
 // Compiles the specified kernel and stores the metadata required to run it
diff --git a/torch/csrc/jit/frontend/builtin_functions.cpp b/torch/csrc/jit/frontend/builtin_functions.cpp
index b7f1197648c41..34283e7976995 100644
--- a/torch/csrc/jit/frontend/builtin_functions.cpp
+++ b/torch/csrc/jit/frontend/builtin_functions.cpp
@@ -94,91 +94,6 @@ def __contains__(self: str, key: str):
     return self.find(key, 0, len(self)) != -1
 )SCRIPT";
 
-#if !ENABLE_UPGRADERS
-// Implementations of historic symbol behaviors are defined here
-// See note [Versioned Symbols]
-
-// This builtin is for testing
-auto _test_serialization_subcmul = R"SCRIPT(
-def _test_serialization_subcmul_0_2(self: Tensor, other:Tensor, alpha: number=2) -> Tensor:
-  return other - (self * alpha)
-)SCRIPT";
-
-// Division versioned symbols, for Torchscript programs serialized when
-// division on integer tensors was floor division, not true division.
-
-// Tensor x Tensor
-// NOTE: testing for the tensors being float tensors is sufficient here,
-// because the Torchscript versions this fix applies to (0 through 3)
-// did not support complex tensors.
-auto div_tensor = R"SCRIPT(
-def div_0_3(self: Tensor, other: Tensor) -> Tensor:
-  if (self.is_floating_point() or other.is_floating_point()):
-    return self.true_divide(other)
-  return self.divide(other, rounding_mode='trunc')
-)SCRIPT";
-
-// Tensor x Scalar
-auto div_tensor_scalar = R"SCRIPT(
-def div_0_3(self: Tensor, other: number) -> Tensor:
-  if (self.is_floating_point() or isinstance(other, float)):
-    return self.true_divide(other)
-  return self.divide(other, rounding_mode='trunc')
-)SCRIPT";
-
-// Scalar x Scalar
-auto div_scalar_scalar = R"SCRIPT(
-def div_0_3(self: number, other: number) -> number:
-  return self / other
-)SCRIPT";
-
-// Tensor x Tensor with out kwarg
-// NOTE: the JIT doesn't support Tensor x Scalar with the out kwarg
-auto div_tensor_out = R"SCRIPT(
-def div_0_3(self: Tensor, other: Tensor, *, out: Tensor) -> Tensor:
-  if (self.is_floating_point() or other.is_floating_point() or out.is_floating_point()):
-    return self.true_divide(other, out=out)
-  return self.divide(other, rounding_mode='trunc', out=out)
-)SCRIPT";
-
-// Tensor x Tensor inplace
-auto div__tensor = R"SCRIPT(
-def div__0_3(self: Tensor, other: Tensor) -> Tensor:
-  if (self.is_floating_point() or other.is_floating_point()):
-    return self.true_divide_(other)
-  return self.divide_(other, rounding_mode='trunc')
-)SCRIPT";
-
-// Tensor x Scalar inplace
-auto div__scalar = R"SCRIPT(
-def div__0_3(self: Tensor, other: number) -> Tensor:
-  if (self.is_floating_point() or isinstance(other, float)):
-    return self.true_divide_(other)
-  return self.divide_(other, rounding_mode='trunc')
-)SCRIPT";
-
-// NOTE: torch.full would historically infer a float dtype for bool and
-//   integral fill values.
-// NOTE: Torchscript does not currently support complex values
-// NOTE: Torchscript does not currently support named tensors, although
-//   torch.full does have a named tensor variant
-auto full = R"SCRIPT(
-def full_0_4(size:List[int], fill_value:number, *, dtype:Optional[int]=None,
-             layout:Optional[int]=None, device:Optional[Device]=None,
-             pin_memory:Optional[bool]=None) -> Tensor:
-  if dtype is None:
-    fill_value = float(fill_value)
-  return torch.full(size, fill_value, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory)
-)SCRIPT";
-
-// NOTE: the out variant of full works the same, but must be overridden
-//   since the other variant of full is overridden
-auto full_out = R"SCRIPT(
-def full_0_4(size:List[int], fill_value:number, *, out:Tensor) -> Tensor:
-  return torch.full(size, fill_value, out=out)
-)SCRIPT";
-#endif
-
 struct BuiltinFunctionRegistry {
   const std::vector<Function*>& getAllBuiltinFunctionsFor(Symbol name) {
     const static std::vector<Function*> empty;
@@ -253,21 +168,6 @@ struct BuiltinFunctionRegistry {
     loadSource(aten_ops, "aten");
     loadSource(aten_ops_additional, "aten");
 
-#if !ENABLE_UPGRADERS
-    // Loads functions implementing historic behavior, see note [Versioned
-    // Symbols]
-    // Note: these functions go into the "upgraders" namespace
-    loadSource(_test_serialization_subcmul, "upgraders");
-    loadSource(div_tensor, "upgraders");
-    loadSource(div_tensor_scalar, "upgraders");
-    loadSource(div_scalar_scalar, "upgraders");
-    loadSource(div__tensor, "upgraders");
-    loadSource(div_tensor_out, "upgraders");
-    loadSource(div__scalar, "upgraders");
-    loadSource(full, "upgraders");
-    loadSource(full_out, "upgraders");
-#endif
-
     // These are under `prim` instead of `aten` since they exist to bind certain
     // tensor property getters to correpsonding methods
     loadSource(tensor_properties, "prim");
diff --git a/torch/csrc/jit/frontend/function_schema_parser.cpp b/torch/csrc/jit/frontend/function_schema_parser.cpp
index 7e7d5ed9815b9..9a4c9b93a88d9 100644
--- a/torch/csrc/jit/frontend/function_schema_parser.cpp
+++ b/torch/csrc/jit/frontend/function_schema_parser.cpp
@@ -165,7 +165,11 @@ struct SchemaParser {
       N = c10::stoll(L.expect(TK_NUMBER).text());
       L.expect(']');
       auto container = type_parser.parseAliasAnnotation();
-      if (container && alias_info) {
+      if (alias_info) {
+        if (!container) {
+          container = c10::optional<at::AliasInfo>(at::AliasInfo());
+          container->setIsWrite(alias_info->isWrite());
+        }
         container->addContainedType(std::move(*alias_info));
       }
       alias_info = std::move(container);
diff --git a/torch/csrc/jit/frontend/ir_emitter.cpp b/torch/csrc/jit/frontend/ir_emitter.cpp
index 4b178fff7592f..d60dd77bc8dad 100644
--- a/torch/csrc/jit/frontend/ir_emitter.cpp
+++ b/torch/csrc/jit/frontend/ir_emitter.cpp
@@ -662,12 +662,10 @@ struct to_ir {
     }
     method.setSchema(emitDef(def, self, graph->block()));
 
-#if ENABLE_UPGRADERS
     // At this point, we might have received a graph that is compiled with
     // old operator schemas that might not exist in the system anymore.
     // Therefore, we replace such ops with its' valid upgrader.
     ReplaceOldOperatorsWithUpgraders(graph);
-#endif
 
     // NB ORDERING: SSA conversion has to occur before
     // lifting of closures and forks, this way closures are converted
@@ -4709,7 +4707,7 @@ struct to_ir {
     if (sliceable->type()->cast<TupleType>()) {
       std::vector<at::optional<NamedValue>> tuple_args;
       // since we are only dealing with tuple slicing, we try to keep
-      // tuple args seperate for now
+      // tuple args separate for now
       tuple_args.reserve(3);
 
       start ? tuple_args.emplace_back(start)
diff --git a/torch/csrc/jit/frontend/schema_matching.cpp b/torch/csrc/jit/frontend/schema_matching.cpp
index 4b1d3d0ec5952..b5e4c395672f3 100644
--- a/torch/csrc/jit/frontend/schema_matching.cpp
+++ b/torch/csrc/jit/frontend/schema_matching.cpp
@@ -679,12 +679,8 @@ Value* emitBuiltinCall(
   const auto& variants = getAllOperatorsFor(name);
   const auto& builtin_functions = getAllBuiltinFunctionsFor(name);
 
-#if ENABLE_UPGRADERS
   // first let's set the graph's version
   auto graph_version = graph.get_op_version();
-#else
-  c10::optional<size_t> graph_version = c10::nullopt;
-#endif
 
   std::stringstream failure_messages;
   std::vector<const FunctionSchema*> schemas;
diff --git a/torch/csrc/jit/frontend/schema_type_parser.cpp b/torch/csrc/jit/frontend/schema_type_parser.cpp
index bef657cc8f79b..a5fdf8bc539ef 100644
--- a/torch/csrc/jit/frontend/schema_type_parser.cpp
+++ b/torch/csrc/jit/frontend/schema_type_parser.cpp
@@ -425,7 +425,11 @@ SchemaTypeParser::parseFakeAndRealType() {
       fake_value = c10::TypeFactory::create<ListType>(fake_value);
       real_value = c10::TypeFactory::create<ListType>(real_value);
       auto container = parseAliasAnnotation();
-      if (container && alias_info) {
+      if (alias_info) {
+        if (!container) {
+          container = c10::optional<AliasInfo>(AliasInfo());
+          container->setIsWrite(alias_info->isWrite());
+        }
         container->addContainedType(std::move(*alias_info));
       }
       alias_info = std::move(container);
diff --git a/torch/csrc/jit/frontend/source_range.cpp b/torch/csrc/jit/frontend/source_range.cpp
index 4932811153963..05d9a35c7f07b 100644
--- a/torch/csrc/jit/frontend/source_range.cpp
+++ b/torch/csrc/jit/frontend/source_range.cpp
@@ -217,6 +217,9 @@ void SourceRange::print_with_context(
   // determine CONTEXT line range
   size_t begin_line = start(); // beginning of lines to highlight
   size_t end_line = range_end;
+  if (begin_line > str.size()) {
+    return;
+  }
   while (begin_line > 0 && str[begin_line - 1] != '\n')
     --begin_line;
   while (end_line < str.size() && str[end_line] != '\n')
diff --git a/torch/csrc/jit/frontend/sugared_value.h b/torch/csrc/jit/frontend/sugared_value.h
index c447e9764a6c3..89a66d593c03d 100644
--- a/torch/csrc/jit/frontend/sugared_value.h
+++ b/torch/csrc/jit/frontend/sugared_value.h
@@ -321,14 +321,6 @@ struct TORCH_API BuiltinModule : public SugaredValue {
     }
 
     auto sym = Symbol::fromQualString(name + "::" + field);
-#if !ENABLE_UPGRADERS
-    if (version.has_value()) {
-      // Possibly replaces symbol with another that implements its
-      // historic behavior.
-      // See note [Versioned Symbols]
-      sym = get_symbol_for_version(sym, *version);
-    }
-#endif
     return std::make_shared<BuiltinFunction>(sym, c10::nullopt);
   }
 
diff --git a/torch/csrc/jit/frontend/tracer.cpp b/torch/csrc/jit/frontend/tracer.cpp
index a7b61c26a9d83..5b1d40655b675 100644
--- a/torch/csrc/jit/frontend/tracer.cpp
+++ b/torch/csrc/jit/frontend/tracer.cpp
@@ -1,6 +1,7 @@
 #include <torch/csrc/jit/frontend/tracer.h>
 
 #include <ATen/Backtrace.h>
+#include <ATen/ScalarOps.h>
 #include <ATen/TracerMode.h>
 #include <ATen/core/Dict.h>
 #include <ATen/core/functional.h>
@@ -981,7 +982,8 @@ void ArgumentStash::stashIntArrayRefElem(
   // TODO: check type?
   if (!isTracing())
     return;
-  auto& list_trace = stash.intlists.emplace(arg_name, size).first->second;
+  IntArrayRefTrace& list_trace =
+      stash.intlists.emplace(arg_name, size).first->second;
   AT_ASSERT(size == list_trace.size());
   AT_ASSERT(idx < list_trace.size());
   AT_ASSERT(list_trace[idx] == nullptr);
@@ -1059,7 +1061,8 @@ const char* WARN_CONSTRUCTOR =
 const char* WARN_RESIZE =
     " can't be represented in the JIT at the moment, so we won't connect any uses of "
     "this value with its current trace. If you happen to use it again, it will show "
-    "up as a constant in the graph.";
+    "up as a constant in the graph. Consider using `view` or `reshape` to make "
+    "it traceable.";
 const char* STRICT_TRACER_MSG =
     " might cause the trace to be incorrect, this is only valid if the container "
     "structure does not change based on the module's inputs. Consider using a constant "
diff --git a/torch/csrc/jit/ir/alias_analysis.cpp b/torch/csrc/jit/ir/alias_analysis.cpp
index d39c489be213e..d523730f5642f 100644
--- a/torch/csrc/jit/ir/alias_analysis.cpp
+++ b/torch/csrc/jit/ir/alias_analysis.cpp
@@ -828,11 +828,17 @@ void AliasDb::analyzeImpl(Node* node) {
 
     // Do sanity checks on the alias annotation
     TORCH_INTERNAL_ASSERT(
-        formal->containedTypes().size() == 0,
+        formal->containedTypes().size() <= 1,
         "Composite types for alias analysis not yet supported");
     TORCH_INTERNAL_ASSERT(
         !formal->isWildcardBefore(),
         "Doesn't make sense for a input value to begin as a wildcard");
+    // This is a special case where we have alias info before [] but not after,
+    // such as `Tensor(a!)[]`
+    if (formal->containedTypes().size() == 1 && formal->beforeSets().empty()) {
+      // Use the first containedType in alias info.
+      formal = &(formal->containedTypes()[0]);
+    }
 
     const auto& formalAlias = formal->beforeSet();
 
@@ -879,10 +885,13 @@ void AliasDb::analyzeImpl(Node* node) {
     }
 
     TORCH_INTERNAL_ASSERT(
-        formal->containedTypes().size() == 0,
+        formal->containedTypes().size() <= 1,
         "Composite types for alias analysis not yet supported");
     TORCH_INTERNAL_ASSERT(formal->beforeSets() == formal->afterSets());
-
+    if (formal->containedTypes().size() == 1 && formal->beforeSets().empty()) {
+      // Use the first containedType in alias info.
+      formal = &(formal->containedTypes()[0]);
+    }
     if (formal->isWildcardBefore()) {
       TORCH_INTERNAL_ASSERT(
           formal->beforeSets().size() == 1,
diff --git a/torch/csrc/jit/ir/ir.cpp b/torch/csrc/jit/ir/ir.cpp
index 2b61b6b0d43bb..2b3d397f0378a 100644
--- a/torch/csrc/jit/ir/ir.cpp
+++ b/torch/csrc/jit/ir/ir.cpp
@@ -318,6 +318,7 @@ std::ostream& Node::print(
   if (kind() == prim::PythonOp) {
     auto* pyOp = static_cast<const ::torch::jit::PythonOp*>(this);
     out << "^" << pyOp->name();
+    printAttributes(out, /*ignore_subgraph=*/false);
     pyOp->writeScalars(out);
   } else if (hasAttribute(attr::Subgraph) && groups) {
     out << kind().toQualString() << "_" << groups->size();
diff --git a/torch/csrc/jit/mobile/flatbuffer_loader.cpp b/torch/csrc/jit/mobile/flatbuffer_loader.cpp
index c983737d94341..fb23e7ee97753 100644
--- a/torch/csrc/jit/mobile/flatbuffer_loader.cpp
+++ b/torch/csrc/jit/mobile/flatbuffer_loader.cpp
@@ -1,6 +1,19 @@
-#include <flatbuffers/base.h>
 #include <torch/csrc/jit/mobile/flatbuffer_loader.h>
 
+#ifdef FLATBUFFERS_VERSION_MAJOR
+#error "flatbuffer_loader.h must not include any flatbuffers headers"
+#endif // FLATBUFFERS_VERSION_MAJOR
+
+#include <array>
+#include <istream>
+#include <memory>
+#include <string>
+#include <tuple>
+#include <unordered_map>
+#include <unordered_set>
+#include <utility>
+#include <vector>
+
 #include <ATen/ATen.h>
 #include <ATen/core/dynamic_type.h>
 #include <ATen/core/ivalue.h>
@@ -13,56 +26,133 @@
 #include <caffe2/serialize/inline_container.h>
 #include <torch/csrc/jit/frontend/script_type_parser.h>
 #include <torch/csrc/jit/mobile/file_format.h>
+#include <torch/csrc/jit/mobile/function.h>
 #include <torch/csrc/jit/mobile/import.h>
 #include <torch/csrc/jit/mobile/interpreter.h>
+#include <torch/csrc/jit/mobile/module.h>
 #include <torch/csrc/jit/mobile/observer.h>
 #include <torch/csrc/jit/mobile/type_parser.h>
 #include <torch/csrc/jit/runtime/instruction.h>
 #include <torch/csrc/jit/serialization/export_bytecode.h>
 #include <torch/csrc/jit/serialization/import_export_constants.h>
 #include <torch/csrc/jit/serialization/import_read.h>
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h>
 #include <torch/custom_class.h>
 
-#include <flatbuffers/flatbuffers.h>
-
 #ifndef DISABLE_UPGRADER
 #include <torch/csrc/jit/mobile/parse_bytecode.h>
 #include <torch/csrc/jit/mobile/upgrader_mobile.h>
 #endif
 
-#if defined(HAVE_MMAP)
-#include <fcntl.h>
-#include <sys/mman.h>
-#include <sys/stat.h>
-#include <unistd.h>
-#endif
-
 #ifdef _WIN32
 #include <malloc.h>
 #else
 #include <cstdlib>
 #endif
 
-#include <string>
-#include <vector>
+#if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD)
+namespace flatbuffers = flatbuffers_fbsource;
+#define FLATBUFFERS_MAX_ALIGNMENT FLATBUFFERS_FBSOURCE_MAX_ALIGNMENT
+#endif
 
 namespace torch {
 namespace jit {
 
-using caffe2::serialize::IStreamAdapter;
-using caffe2::serialize::PyTorchStreamReader;
-using caffe2::serialize::ReadAdapterInterface;
+// Our own alignment requirement does not need to be exactly the same as what
+// flatbuffers supports, but what flatbuffers supports needs to satisfy our
+// requirement.
+static_assert(
+    kFlatbufferDataAlignmentBytes <= FLATBUFFERS_MAX_ALIGNMENT,
+    "Sizes must be compatible");
+static_assert(
+    (kFlatbufferDataAlignmentBytes & ~(kFlatbufferDataAlignmentBytes - 1)) ==
+        kFlatbufferDataAlignmentBytes,
+    "Must be a power of 2");
+
+namespace {
 
 static constexpr c10::string_view kCustomClassPrefix =
     "__torch__.torch.classes";
 static constexpr c10::string_view kTorchPrefix = "__torch__";
 static constexpr c10::string_view kJitPrefix = "torch.jit";
 
-template <typename T, typename U>
-std::vector<T> parseListNative(const U* list) {
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(list != nullptr);
-  return {list->items()->begin(), list->items()->end()};
-}
+class FlatbufferLoader final {
+ public:
+  FlatbufferLoader();
+
+  typedef IValue (
+      *IValueParser)(FlatbufferLoader&, const mobile::serialization::IValue&);
+  void registerIValueParser(
+      mobile::serialization::IValueUnion ivalue_type,
+      IValueParser parser);
+  mobile::Module parseModule(mobile::serialization::Module* module);
+
+  void extractJitSourceAndConstants(
+      ExtraFilesMap* jit_sources,
+      std::vector<IValue>* constants);
+
+  typedef TypePtr (*TypeResolver)(
+      const std::string& type_str,
+      std::shared_ptr<CompilationUnit> cu);
+
+  void internal_registerTypeResolver(TypeResolver type_resolver);
+
+  IValue& getIValue(uint32_t pos) {
+    TORCH_CHECK(pos < all_ivalues_.size());
+    return all_ivalues_[pos];
+  }
+
+  mobile::Function* getFunction(uint32_t pos) {
+    return all_functions_[pos];
+  }
+
+  ClassTypePtr getType(uint32_t pos) {
+    TORCH_CHECK(pos < all_types_.size());
+    return all_types_[pos];
+  }
+
+  c10::Storage getStorage(uint32_t index);
+  TypePtr getOrCreateTypeAnnotations(const flatbuffers::String* offset);
+  ClassTypePtr getOrCreateClassTypeForObject(
+      const mobile::serialization::Object* object);
+
+  const mobile::serialization::Module* getCurrentFlatbufferInput() {
+    return module_;
+  }
+
+  void setShouldCopyTensorMemory(bool should_copy_tensor_memory) {
+    should_copy_tensor_memory_ = should_copy_tensor_memory;
+  }
+
+  std::shared_ptr<mobile::CompilationUnit> mcu_;
+  std::shared_ptr<CompilationUnit> cu_;
+
+ private:
+  IValue parseIValue(const mobile::serialization::IValue* ivalue);
+  std::unique_ptr<mobile::Function> parseFunction(
+      const mobile::serialization::Function* method);
+  void parseAndPopulate(
+      uint32_t i,
+      const mobile::serialization::IValue* ivalue);
+
+  std::unordered_map<uint32_t, mobile::Function*> all_functions_;
+  std::vector<ClassTypePtr> all_types_;
+  std::unordered_set<uint32_t> initialized_types_;
+  std::unordered_map<const flatbuffers::String*, TypePtr> type_annotations_;
+  std::vector<bool> storage_loaded_;
+  std::vector<c10::Storage> storages_;
+  std::vector<IValue> all_ivalues_;
+  std::array<
+      IValueParser,
+      static_cast<uint8_t>(mobile::serialization::IValueUnion::MAX) + 1>
+      ivalue_parsers_;
+  TypeResolver type_resolver_ = nullptr;
+  mobile::serialization::Module* module_ = nullptr;
+  bool module_parsed_ = false;
+  bool should_copy_tensor_memory_ = false;
+  // 0 -> mobile_ivalue_size_ elements are from the mobile module.
+  uint32_t mobile_ivalue_size_ = 0;
+};
 
 IValue parseList(
     FlatbufferLoader&,
@@ -231,7 +321,6 @@ mobile::Module FlatbufferLoader::parseModule(
   return m;
 }
 
-namespace {
 void appendUpgraderFunctions(mobile::Function* function) {
 #ifndef DISABLE_UPGRADER
   for (auto& byteCodeFunctionWithOperator : getUpgraderBytecodeList()) {
@@ -239,7 +328,6 @@ void appendUpgraderFunctions(mobile::Function* function) {
   }
 #endif
 }
-} // namespace
 
 std::unique_ptr<mobile::Function> FlatbufferLoader::parseFunction(
     const mobile::serialization::Function* method) {
@@ -272,9 +360,7 @@ std::unique_ptr<mobile::Function> FlatbufferLoader::parseFunction(
         op->name()->str(), op->overload_name()->str(), num_args);
   }
 
-  if (should_load_operators_) {
-    function->initialize_operators(true);
-  }
+  function->initialize_operators(true);
 
   for (const auto i : *method->type_annotations()) {
     function->append_type(getOrCreateTypeAnnotations(i));
@@ -440,6 +526,12 @@ IValue parseList(
   return res;
 }
 
+template <typename T, typename U>
+std::vector<T> parseListNative(const U* list) {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(list != nullptr);
+  return {list->items()->begin(), list->items()->end()};
+}
+
 IValue parseIntList(
     FlatbufferLoader&,
     const mobile::serialization::IValue& ivalue) {
@@ -647,30 +739,67 @@ void FlatbufferLoader::extractJitSourceAndConstants(
   parseExtraFilesFromVector(module_->jit_sources(), jit_sources);
 }
 
+} // namespace
+
 mobile::Module parse_and_initialize_mobile_module(
-    std::shared_ptr<char> data,
+    void* data,
     size_t,
     c10::optional<at::Device>,
-    ExtraFilesMap* extra_files) {
+    ExtraFilesMap* extra_files,
+    bool should_copy_tensor_memory) {
   TORCH_CHECK(
-      mobile::serialization::ModuleBufferHasIdentifier(data.get()),
-      "Format error");
-  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data.get());
-  mobile::Module m = FlatbufferLoader().parseModule(flatbuffer_module);
-  m.set_delete_memory(std::move(data));
+      mobile::serialization::ModuleBufferHasIdentifier(data), "Format error");
+  // TODO(T128189662): If not copying, enforce that data is aligned to
+  // kFlatbufferDataAlignmentBytes, and add unit tests.
+
+  FlatbufferLoader loader;
+  loader.setShouldCopyTensorMemory(should_copy_tensor_memory);
+
+  // Flatbuffer doesn't seem to have a way to provide the buffer size when
+  // interacting with the buffer.
+  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data);
+  mobile::Module m = loader.parseModule(flatbuffer_module);
   if (extra_files != nullptr) {
     parseExtraFiles(flatbuffer_module, *extra_files);
   }
   return m;
 }
 
-mobile::Module initialize_mobile_module(
-    mobile::serialization::Module* flatbuffer_module,
+mobile::Module parse_and_initialize_mobile_module(
+    std::shared_ptr<char> data,
+    size_t size,
+    c10::optional<at::Device> device,
+    ExtraFilesMap* extra_files) {
+  mobile::Module m = parse_and_initialize_mobile_module(
+      data.get(),
+      size,
+      device,
+      extra_files,
+      /*should_copy_tensor_memory=*/false);
+  m.set_delete_memory(std::move(data));
+  return m;
+}
+
+mobile::Module parse_and_initialize_mobile_module_for_jit(
+    void* data,
+    size_t,
+    ExtraFilesMap& jit_sources,
+    std::vector<IValue>& jit_constants,
     c10::optional<at::Device>,
-    bool should_copy_tensor_memory) {
-  auto flatbufferLoader = FlatbufferLoader();
-  flatbufferLoader.setShouldCopyTensorMemory(should_copy_tensor_memory);
-  mobile::Module m = flatbufferLoader.parseModule(flatbuffer_module);
+    ExtraFilesMap* extra_files) {
+  TORCH_CHECK(
+      mobile::serialization::ModuleBufferHasIdentifier(data), "Format error");
+  // TODO(T128189662): Enforce that data is aligned to
+  // kFlatbufferDataAlignmentBytes, and add unit tests.
+
+  FlatbufferLoader loader;
+  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data);
+  mobile::Module m = loader.parseModule(flatbuffer_module);
+  if (extra_files != nullptr) {
+    parseExtraFiles(flatbuffer_module, *extra_files);
+  }
+
+  loader.extractJitSourceAndConstants(&jit_sources, &jit_constants);
   return m;
 }
 
@@ -751,7 +880,8 @@ mobile::Module load_mobile_module_from_stream_with_copy(
       std::move(data), size, device, extra_files);
 }
 
-static mobile::Module parse_flatbuffer_no_object(
+namespace {
+mobile::Module parse_flatbuffer_no_object(
     std::shared_ptr<char> data,
     size_t size,
     c10::optional<at::Device> device) {
@@ -780,6 +910,7 @@ static mobile::Module parse_flatbuffer_no_object(
   m.set_delete_memory(std::move(data));
   return m;
 }
+} // namespace
 
 bool register_flatbuffer_loader() {
   load_flatbuffer_bytes = parse_and_initialize_mobile_module;
diff --git a/torch/csrc/jit/mobile/flatbuffer_loader.h b/torch/csrc/jit/mobile/flatbuffer_loader.h
index 43f5be52d7f9e..eee44d4b647ed 100644
--- a/torch/csrc/jit/mobile/flatbuffer_loader.h
+++ b/torch/csrc/jit/mobile/flatbuffer_loader.h
@@ -1,20 +1,33 @@
 #pragma once
 
+#include <istream>
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
 #include <ATen/core/ivalue.h>
-#include <caffe2/serialize/inline_container.h>
-#include <torch/csrc/jit/mobile/function.h>
-#include <torch/csrc/jit/mobile/interpreter.h>
+#include <c10/core/Device.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/Optional.h>
 #include <torch/csrc/jit/mobile/module.h>
-#include <torch/csrc/jit/runtime/instruction.h>
-#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h> // NOLINT
-#include <torch/custom_class.h>
 
-#include <string>
-#include <vector>
+/**
+ * Defines the public API for loading flatbuffer-serialized mobile modules.
+ * Note that this header must not include or depend on flatbuffer-defined
+ * types, to avoid leaking those details to PyTorch clients.
+ */
 
 namespace torch {
 namespace jit {
 
+/// All non-copied data pointers provided to `parse_and_initialize_*` functions
+/// must be aligned to this boundary. Since the Module will point directly into
+/// the data, this alignment is necessary to ensure that certain types/structs
+/// are properly aligned.
+constexpr size_t kFlatbufferDataAlignmentBytes = 16;
+
+/// Maps file names to file contents.
 using ExtraFilesMap = std::unordered_map<std::string, std::string>;
 
 // On high level, to produce a Module from a file on disk, we need to go
@@ -24,46 +37,72 @@ using ExtraFilesMap = std::unordered_map<std::string, std::string>;
 //    structure
 // 3. Module initialization: Produce mobile::Module out of the structure
 //    produced in 2.
-// Under this context, the structure described in 2. is
-// mobile::serialization::Module
-
-// Parse a mobile::Module from flatbuffer's in-memory Module representation.
-// The caller is assumed to manage the lifetimes of Module.
-// This function does step 3 described above.
-// If should_copy_tensor_memory is true, then the returned module will NOT
-// have refences to flatbuffer_module, so it can be discarded.
-// If should_copy_tensor_memory is false, then returned module will have
-// tensors that points inside of flatbuffer_module; the caller need to make
-// sure that flatbuffer_module outlives returned Module.
-TORCH_API mobile::Module initialize_mobile_module(
-    mobile::serialization::Module* flatbuffer_module,
+// Under this context, the structure described in 2. is the flatbuffer-defined
+// type mobile::serialization::Module. However, this step/type is not visible in
+// the public API.
+
+// Parse a mobile::Module from raw bytes.
+//
+// This function does steps 2+3 described above.
+//
+// Does not take ownership of `data`; if you want it to take ownership, see the
+// shared_ptr overload of this function.
+//
+// If should_copy_tensor_memory is true, then the returned module will NOT have
+// refences to `data`, so `data` can be freed immediately.
+//
+// If should_copy_tensor_memory is false, then returned module will have tensors
+// that points inside of `data`; the caller will need to make sure that `data`
+// outlives the returned Module. Also, `data` must be aligned to
+// kFlatbufferDataAlignmentBytes.
+TORCH_API mobile::Module parse_and_initialize_mobile_module(
+    void* data,
+    size_t size, // of `data`, in bytes.
     c10::optional<at::Device> device = c10::nullopt,
+    ExtraFilesMap* extra_files = nullptr,
     bool should_copy_tensor_memory = false);
 
 // Parse a mobile::Module from raw bytes.
-// ownership of data is shared to the returned Module.
-// (Feel free to pass in a unique_ptr too!)
-// This function does steps 2+3 described above
+//
+// This function does steps 2+3 described above.
+//
+// The returned Module holds a reference to `data`, which must be aligned to
+// kFlatbufferDataAlignmentBytes.
+//
+// If you do not want the Module to hold a reference to `data`, see the raw
+// pointer overload of this function.
 TORCH_API mobile::Module parse_and_initialize_mobile_module(
     std::shared_ptr<char> data,
-    size_t size,
+    size_t size, // of `data`, in bytes.
+    c10::optional<at::Device> device = c10::nullopt,
+    ExtraFilesMap* extra_files = nullptr);
+
+// Parse a mobile::Module from raw bytes, also returning JIT-related metadata.
+//
+// This is the same as parse_and_initialize_mobile_module() except that it also
+// extracts JIT source files and constants. Can be used to construct a
+// jit::Module.
+TORCH_API mobile::Module parse_and_initialize_mobile_module_for_jit(
+    void* data,
+    size_t size, // of `data`, in bytes.
+    ExtraFilesMap& jit_sources,
+    std::vector<IValue>& jit_constants,
     c10::optional<at::Device> device = c10::nullopt,
     ExtraFilesMap* extra_files = nullptr);
 
 // Load a mobile::Module from a filepath.
+//
 // This function does steps 1+2+3 described above.
-// We need to have this as a convienience because Python
-// API will need to wrap this. C++ clients should use one
-// versions above.
+//
+// We need to have this as a convienience because Python API will need to wrap
+// this. C++ clients should use one of the versions of
+// parse_and_initialize_mobile_module() so they can manage the raw data more
+// directly.
 TORCH_API mobile::Module load_mobile_module_from_file(
     const std::string& filename,
     c10::optional<at::Device> device = c10::nullopt,
     ExtraFilesMap* extra_files = nullptr);
 
-TORCH_API void parseExtraFiles(
-    mobile::serialization::Module* module,
-    ExtraFilesMap& extra_files);
-
 TORCH_API uint64_t get_bytecode_version(std::istream& in);
 TORCH_API uint64_t get_bytecode_version(const std::string& filename);
 TORCH_API uint64_t get_bytecode_version_from_bytes(char* flatbuffer_content);
@@ -84,96 +123,5 @@ TORCH_API mobile::Module load_mobile_module_from_stream_with_copy(
 // in this file directly.
 TORCH_API bool register_flatbuffer_loader();
 
-class TORCH_API FlatbufferLoader {
- public:
-  FlatbufferLoader();
-
-  typedef IValue (
-      *IValueParser)(FlatbufferLoader&, const mobile::serialization::IValue&);
-  void registerIValueParser(
-      mobile::serialization::IValueUnion ivalue_type,
-      IValueParser parser);
-  mobile::Module parseModule(mobile::serialization::Module* module);
-
-  void extractJitSourceAndConstants(
-      ExtraFilesMap* jit_sources,
-      std::vector<IValue>* constants);
-
-  typedef TypePtr (*TypeResolver)(
-      const std::string& type_str,
-      std::shared_ptr<CompilationUnit> cu);
-
-  void internal_registerTypeResolver(TypeResolver type_resolver);
-
-  IValue& getIValue(uint32_t pos) {
-    TORCH_CHECK(pos < all_ivalues_.size());
-    return all_ivalues_[pos];
-  }
-
-  mobile::Function* getFunction(uint32_t pos) {
-    return all_functions_[pos];
-  }
-
-  ClassTypePtr getType(uint32_t pos) {
-    TORCH_CHECK(pos < all_ivalues_.size());
-    return all_types_[pos];
-  }
-
-  c10::Storage getStorage(uint32_t index);
-  TypePtr getOrCreateTypeAnnotations(const flatbuffers::String* offset);
-  ClassTypePtr getOrCreateClassTypeForObject(
-      const mobile::serialization::Object* object);
-
-  const mobile::serialization::Module* getCurrentFlatbufferInput() {
-    return module_;
-  }
-
-  bool getShouldCopyTensorMemory() {
-    return should_copy_tensor_memory_;
-  }
-
-  void setShouldCopyTensorMemory(bool should_copy_tensor_memory) {
-    should_copy_tensor_memory_ = should_copy_tensor_memory;
-  }
-
-  // Whether or not should load operators in functions.
-  // Not loading operators is useful because if an operator is not found
-  // then we throw exceptions, and sometimes we want to print out
-  // what operators are included before that to debug.
-  void setShouldLoadOperators(bool should_load_operators) {
-    should_load_operators_ = should_load_operators;
-  }
-
-  std::shared_ptr<mobile::CompilationUnit> mcu_;
-  std::shared_ptr<CompilationUnit> cu_;
-
- private:
-  IValue parseIValue(const mobile::serialization::IValue* ivalue);
-  std::unique_ptr<mobile::Function> parseFunction(
-      const mobile::serialization::Function* method);
-  void parseAndPopulate(
-      uint32_t i,
-      const mobile::serialization::IValue* ivalue);
-
-  std::unordered_map<uint32_t, mobile::Function*> all_functions_;
-  std::vector<ClassTypePtr> all_types_;
-  std::unordered_set<uint32_t> initialized_types_;
-  std::unordered_map<const flatbuffers::String*, TypePtr> type_annotations_;
-  std::vector<bool> storage_loaded_;
-  std::vector<c10::Storage> storages_;
-  std::vector<IValue> all_ivalues_;
-  std::array<
-      IValueParser,
-      static_cast<uint8_t>(mobile::serialization::IValueUnion::MAX) + 1>
-      ivalue_parsers_;
-  TypeResolver type_resolver_ = nullptr;
-  mobile::serialization::Module* module_ = nullptr;
-  bool module_parsed_ = false;
-  bool should_copy_tensor_memory_ = false;
-  bool should_load_operators_ = true;
-  // 0 -> mobile_ivalue_size_ elements are from the mobile module.
-  uint32_t mobile_ivalue_size_ = 0;
-};
-
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/import.cpp b/torch/csrc/jit/mobile/import.cpp
index 9a3ce0d8f8391..575bc683a1568 100644
--- a/torch/csrc/jit/mobile/import.cpp
+++ b/torch/csrc/jit/mobile/import.cpp
@@ -586,11 +586,19 @@ mobile::Module _load_for_mobile(
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files,
     uint64_t module_load_options) {
-  std::shared_ptr<char> data;
-  size_t size = 0;
-  std::tie(data, size) = get_file_content(filename.c_str());
-  return _load_mobile_from_bytes(
-      data, size, device, extra_files, module_load_options);
+  auto format = getFileFormat(filename);
+
+  if (format == FileFormat::FlatbufferFileFormat) {
+    std::shared_ptr<char> data;
+    size_t size = 0;
+    std::tie(data, size) = get_file_content(filename.c_str());
+    return _load_mobile_from_bytes(
+        data, size, device, extra_files, module_load_options);
+  }
+
+  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
+  return _load_for_mobile(
+      std::move(rai), device, extra_files, module_load_options);
 }
 
 TORCH_API mobile::Module _load_for_mobile(
@@ -696,16 +704,43 @@ void _load_extra_only_for_mobile(
     const std::string& filename,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
   auto observer = torch::observerConfig().getModuleObserver();
   // NOLINTNEXTLINE(clang-analyzer-security.insecureAPI.rand)
   auto instance_key = std::rand();
   if (observer) {
     observer->onEnterLoadModel(instance_key);
   }
-  auto reader = torch::make_unique<PyTorchStreamReader>(std::move(rai));
-  BytecodeDeserializer deserializer(std::move(reader));
-  deserializer.deserialize_only_extra(device, extra_files);
+
+  auto format = getFileFormat(filename);
+  switch (format) {
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<FileAdapter> rai =
+          std::make_unique<FileAdapter>(filename);
+      auto reader = torch::make_unique<PyTorchStreamReader>(std::move(rai));
+      BytecodeDeserializer deserializer(std::move(reader));
+      deserializer.deserialize_only_extra(device, extra_files);
+      break;
+    }
+    case FileFormat::FlatbufferFileFormat: {
+      // TODO: the current flatbuffers implementation will always load the
+      // whole module including the extra files. Ideally it should be
+      // possible to just get the extra files given data
+      std::shared_ptr<char> data;
+      size_t size = 0;
+      std::tie(data, size) = get_file_content(filename.c_str());
+      if (load_flatbuffer_bytes != nullptr) {
+        load_flatbuffer_bytes(data, size, device, &extra_files);
+      } else {
+        TORCH_CHECK(
+            false,
+            "Flatbuffer input file but the build hasn't enabled flatbuffer");
+      }
+      break;
+    }
+    default: {
+      TORCH_CHECK(false, "Format error");
+    }
+  }
 }
 
 namespace mobile {
diff --git a/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp b/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
index 32852d691b562..47b3a7d9c8408 100644
--- a/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
+++ b/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
@@ -285,8 +285,18 @@ TracerResult trace_run(const std::string& input_module_path) {
 
   // run with QNNPACK
   run_model(input_module_path, root_ops, enabled_backends, called_kernel_tags);
-  at::globalContext().setQEngine(at::QEngine::FBGEMM);
-  run_model(input_module_path, root_ops, enabled_backends, called_kernel_tags);
+  // Not every model can be successfully run with fbgemm,
+  // but for those that can this can help broaden the tracers scope around hyper
+  // optimized QNNPack paths
+  try {
+    at::globalContext().setQEngine(at::QEngine::FBGEMM);
+    run_model(
+        input_module_path, root_ops, enabled_backends, called_kernel_tags);
+  } catch (std::exception& ex) {
+    std::cerr
+        << "ModelTracer encountered an error while attempting to run the model in FBGEMM mode"
+        << ex.what() << "\n Skipping FBGEMM execution" << std::endl;
+  }
 
   op_tracer.getCalledOperators().withLock(
       [&](std::set<std::string>& called_operators) {
diff --git a/torch/csrc/jit/mobile/promoted_prim_ops.cpp b/torch/csrc/jit/mobile/promoted_prim_ops.cpp
index 13d942a7b7c97..1572e7f147f21 100644
--- a/torch/csrc/jit/mobile/promoted_prim_ops.cpp
+++ b/torch/csrc/jit/mobile/promoted_prim_ops.cpp
@@ -62,6 +62,21 @@ void sym_size(Stack& stack) {
   auto t = std::move(pop(stack)).toTensor();
   pack(stack, t.sym_sizes().vec());
 }
+void sym_size_int(Stack& stack) {
+  auto dim = pop(stack).toInt();
+  auto t = pop(stack).toTensor();
+  push(stack, t.sym_sizes()[dim]);
+}
+
+void sym_numel(Stack& stack) {
+  auto t = std::move(pop(stack)).toTensor();
+  push(stack, t.sym_numel());
+}
+
+void sym_stride(Stack& stack) {
+  auto t = std::move(pop(stack)).toTensor();
+  pack(stack, t.sym_strides().vec());
+}
 
 void device(Stack& stack) {
   push(stack, pop(stack).toTensor().device());
diff --git a/torch/csrc/jit/mobile/promoted_prim_ops.h b/torch/csrc/jit/mobile/promoted_prim_ops.h
index a1ae3f52ba1a7..901a396d2cdda 100644
--- a/torch/csrc/jit/mobile/promoted_prim_ops.h
+++ b/torch/csrc/jit/mobile/promoted_prim_ops.h
@@ -21,6 +21,12 @@ void size(Stack& stack);
 
 void sym_size(Stack& stack);
 
+void sym_size_int(Stack& stack);
+
+void sym_numel(Stack& stack);
+
+void sym_stride(Stack& stack);
+
 void device(Stack& stack);
 
 void dtype(Stack& stack);
diff --git a/torch/csrc/jit/operator_upgraders/upgraders_guard.cpp b/torch/csrc/jit/operator_upgraders/upgraders_guard.cpp
deleted file mode 100644
index 02364c5eee89f..0000000000000
--- a/torch/csrc/jit/operator_upgraders/upgraders_guard.cpp
+++ /dev/null
@@ -1,12 +0,0 @@
-#include <caffe2/serialize/versions.h>
-#include <torch/csrc/jit/operator_upgraders/upgraders_guard.h>
-
-namespace torch {
-namespace jit {
-
-bool is_upgraders_enabled() {
-  return ENABLE_UPGRADERS;
-}
-
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/operator_upgraders/upgraders_guard.h b/torch/csrc/jit/operator_upgraders/upgraders_guard.h
deleted file mode 100644
index 9907f8238c0a7..0000000000000
--- a/torch/csrc/jit/operator_upgraders/upgraders_guard.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#pragma once
-#include <c10/macros/Export.h>
-
-namespace torch {
-namespace jit {
-
-TORCH_API bool is_upgraders_enabled();
-
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/passes/autocast.cpp b/torch/csrc/jit/passes/autocast.cpp
index 07c1651d72d98..4ac023b9d6a35 100644
--- a/torch/csrc/jit/passes/autocast.cpp
+++ b/torch/csrc/jit/passes/autocast.cpp
@@ -216,6 +216,39 @@ void castInputsToWidestType(Node* node, const AutocastContext& context) {
   }
 }
 
+// Users can call torch.is_autocast_enabled() or is_autocast_cpu_enabled() to
+// determine whether autocasting is enabled. With JIT-scripted functions, we
+// actually need to return true if eager autocast OR jit autocast are enabled.
+//
+// In the case where JIT autocast is enabled, we replace
+//    %x : bool = aten::is_autocast_enabled()
+// with a constant "True".
+//
+// More context on eager vs JIT autocasting:
+//
+// Autocasting actually has two settings: eager autocasting, and JIT
+// autocasting. Eager autocasting is the thread-local setting that turns on
+// the relevant bit in the dispatcher settings. JIT autocasting is the pass
+// implemented in this file, which makes changes to the graph to insert casting
+// ops in order to achieve the same behavior as eager autocasting.
+//
+// If eager autocasting is enabled at the time when a JIT-scripted function is
+// invoked, then autocasting will occur regardless of what the JIT-autocasting
+// settings are.
+void updateAutocastEnabledCheck(Node* node, bool is_jit_enabled) {
+  if (!is_jit_enabled) {
+    return;
+  }
+
+  auto graph = node->owningGraph();
+
+  WithInsertPoint insert_point(node);
+
+  Value* true_constant = graph->insertConstant(IValue(true));
+  node->output()->replaceAllUsesWith(true_constant);
+  node->destroy();
+}
+
 // [Note: implicit type promotion in Autocast]
 //
 // Casting policy below mostly follows pytorch/aten/src/ATen/autocast.cpp, with
@@ -319,6 +352,14 @@ void handleBlock(Block* block, AutocastContext initial_state) {
         }
         break;
 
+      case aten::is_autocast_enabled:
+        updateAutocastEnabledCheck(node, current_state().gpu_enabled);
+        break;
+
+      case aten::is_autocast_cpu_enabled:
+        updateAutocastEnabledCheck(node, current_state().cpu_enabled);
+        break;
+
       // CastPolicy::fp16 (cast all inputs to float16)
       case aten::_convolution:
       case aten::conv1d:
diff --git a/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp b/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp
index 6e2fb72ab5ebe..89d942a3d65c6 100644
--- a/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp
+++ b/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp
@@ -16,7 +16,7 @@ namespace jit {
 
 namespace {
 void fuseFrozenConvAddReluImpl(std::shared_ptr<Graph>& graph) {
-#if AT_CUDNN_ENABLED()
+#if AT_CUDNN_ENABLED() || AT_ROCM_ENABLED()
   GRAPH_DEBUG("Before fuseFrozenConvAddReluImpl: ", *graph);
   SubgraphRewriter rewriter;
 
@@ -31,10 +31,17 @@ void fuseFrozenConvAddReluImpl(std::shared_ptr<Graph>& graph) {
       %res = aten::${relu}(%x)
       return (%res))");
 
+#ifdef USE_ROCM
+  std::string conv_relu_fused = R"(
+    graph(%input, %weight, %bias, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
+        %res = aten::miopen_convolution_relu(%input, %weight, %bias, %stride, %padding, %dilation, %groups)
+        return (%res))";
+#else
   std::string conv_relu_fused = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
         %res = aten::cudnn_convolution_relu(%input, %weight, %bias, %stride, %padding, %dilation, %groups)
         return (%res))";
+#endif
 
   auto conv_add_relu_rstring = at::jit::CodeTemplate(R"(
     graph(%input, %weight, %bias, %z, %alpha, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
@@ -43,10 +50,17 @@ void fuseFrozenConvAddReluImpl(std::shared_ptr<Graph>& graph) {
       %res = aten::${relu}(%y)
       return (%res))");
 
+#ifdef USE_ROCM
+  std::string conv_add_relu_fused = R"(
+    graph(%input, %weight, %bias, %z, %alpha, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
+        %res = aten::miopen_convolution_add_relu(%input, %weight, %z, %alpha, %bias, %stride, %padding, %dilation, %groups)
+        return (%res))";
+#else
   std::string conv_add_relu_fused = R"(
     graph(%input, %weight, %bias, %z, %alpha, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
         %res = aten::cudnn_convolution_add_relu(%input, %weight, %z, %alpha, %bias, %stride, %padding, %dilation, %groups)
         return (%res))";
+#endif
 
   for (const auto& conv : conv_operators) {
     for (const auto& relu : relu_operators) {
diff --git a/torch/csrc/jit/passes/hoist_conv_packed_params.cpp b/torch/csrc/jit/passes/hoist_conv_packed_params.cpp
index 6cee6f3673e23..b26c964f47d0b 100644
--- a/torch/csrc/jit/passes/hoist_conv_packed_params.cpp
+++ b/torch/csrc/jit/passes/hoist_conv_packed_params.cpp
@@ -103,17 +103,24 @@ void HoistConvPackedParams(script::Module& m) {
         c10::optional<std::string> moduleName = getModuleName(n->inputs()[0]);
         bool moduleNameIsQuantizedConv = moduleName.has_value() &&
             (moduleName.value() ==
-                 "__torch__.torch.nn.quantized.modules.conv.Conv1d" ||
+                 "__torch__.torch.ao.nn.quantized.modules.conv.Conv1d" ||
              moduleName.value() ==
-                 "__torch__.torch.nn.quantized.modules.conv.Conv2d" ||
+                 "__torch__.torch.ao.nn.quantized.modules.conv.Conv2d" ||
              moduleName.value() ==
-                 "__torch__.torch.nn.quantized.modules.conv.Conv3d" ||
+                 "__torch__.torch.ao.nn.quantized.modules.conv.Conv3d" ||
              moduleName.value() ==
                  "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU1d" ||
              moduleName.value() ==
                  "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU2d" ||
              moduleName.value() ==
-                 "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU3d");
+                 "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU3d" ||
+             // BC Stuff
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv1d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv2d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv3d");
 
         if (moduleNameIsQuantizedConv) {
           GRAPH_UPDATE("Hoisting ", *n, " to root module.");
diff --git a/torch/csrc/jit/passes/mkldnn_rewrite.cpp b/torch/csrc/jit/passes/mkldnn_rewrite.cpp
new file mode 100644
index 0000000000000..59da4cda60d2b
--- /dev/null
+++ b/torch/csrc/jit/passes/mkldnn_rewrite.cpp
@@ -0,0 +1,234 @@
+#include <ATen/Config.h>
+#include <ATen/code_template.h>
+#include <torch/csrc/jit/ir/ir.h>
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/constant_propagation.h>
+#include <torch/csrc/jit/passes/dead_code_elimination.h>
+#include <torch/csrc/jit/passes/graph_rewrite_helper.h>
+#include <torch/csrc/jit/passes/mkldnn_rewrite.h>
+#include <torch/csrc/jit/tensorexpr/kernel.h>
+
+namespace torch {
+namespace jit {
+
+#if AT_MKLDNN_ENABLED()
+
+c10::VaryingShape<int64_t> getSizesOf(Node* n, size_t idx) {
+  auto tt = n->input(idx)->type()->cast<TensorType>();
+  return tt->sizes();
+}
+
+void insertPrePackedConvOpForNode(Node* n) {
+  constexpr int POS_INPUT = 0;
+  constexpr int POS_WEIGHT = 1;
+  if (!tensorexpr::isContiguous(
+          n->input(POS_INPUT), at::MemoryFormat::ChannelsLast)) {
+    GRAPH_DEBUG(
+        "insertPrePackedConvOpForNode: input is not ChannelsLast contiguous");
+    return;
+  }
+
+  if (!tensorexpr::isContiguous(
+          n->input(POS_WEIGHT), at::MemoryFormat::ChannelsLast)) {
+    GRAPH_DEBUG(
+        "insertPrePackedConvOpForNode: weight is not ChannelsLast contiguous");
+    return;
+  }
+
+  // Leave depthwise conv2d to NNC
+  if (tensorexpr::conv2dIsSupportedJit(n)) {
+    GRAPH_DEBUG("insertPrePackedConvOpForNode: leave depthwise conv2d to NNC");
+    return;
+  }
+
+  WithInsertPoint guard(n);
+  auto graph = n->owningGraph();
+
+  auto input_sizes = getSizesOf(n, POS_INPUT);
+  IValue input_size_value(*input_sizes.concrete_sizes());
+  auto input_size = graph->insertConstant(input_size_value);
+
+  auto prepack_node = graph->create(
+      Symbol::fromQualString("mkldnn_prepacked::conv2d_prepack"), 1);
+
+  // skip input value
+  for (auto i = 1; i < n->inputs().size(); i++) {
+    Value* v = n->input(i);
+    prepack_node->addInput(v);
+  }
+  prepack_node->addInput(input_size);
+  auto attr = graph->insertConstant(IValue("none"));
+  prepack_node->addInput(attr);
+  prepack_node->output()->setType(
+      getCustomClass("__torch__.torch.classes.mkldnn.ConvOpContext"));
+  graph->insertNode(prepack_node);
+
+  auto prepack_conv = graph->insertNode(
+      graph->create(Symbol::fromQualString("mkldnn_prepacked::conv2d_run"), 1));
+  prepack_conv->addInput(n->input(0));
+  prepack_conv->addInput(prepack_node->output());
+  prepack_conv->output()->setType(n->output()->type()->cast<TensorType>());
+
+  n->output()->replaceAllUsesWith(prepack_conv->output());
+}
+
+bool isTensorTypeCPU(Node* node) {
+  for (const auto& input : node->inputs()) {
+    auto type = input->type()->cast<TensorType>();
+    if (!type) {
+      continue;
+    }
+    auto device = type->device();
+    if (!device) {
+      return false;
+    }
+    if (!device->is_cpu()) {
+      return false;
+    }
+  }
+  return true;
+}
+
+void insertPrePackedConvOp(Block* b) {
+  for (Node* n : b->nodes()) {
+    for (Block* b : n->blocks()) {
+      insertPrePackedConvOp(b);
+    }
+
+    if (n->kind() == aten::conv2d) {
+      if (isTensorTypeCPU(n)) {
+        insertPrePackedConvOpForNode(n);
+      }
+    }
+  }
+  EliminateDeadCode(b);
+}
+
+void insertMkldnnPrePackedConv2dOp(std::shared_ptr<Graph>& graph) {
+  // Replace _convolution with conv2d
+  graph_rewrite_helper::replaceConvolutionWithAtenConv(graph);
+
+  insertPrePackedConvOp(graph->block());
+}
+
+void insertMkldnnPrePackedOps(std::shared_ptr<Graph>& graph) {
+  insertMkldnnPrePackedConv2dOp(graph);
+}
+
+void insertMkldnnPrePackedOps(script::Module& module) {
+  for (auto& method : module.get_methods()) {
+    auto graph = method.graph();
+    insertMkldnnPrePackedOps(graph);
+  }
+  for (script::Module m : module.children()) {
+    insertMkldnnPrePackedOps(m);
+  }
+}
+
+void FuseReluWithPackedOps(std::shared_ptr<Graph>& graph) {
+  auto conv_op_rstring = at::jit::CodeTemplate(R"(
+    graph(%input, %weight, %bias, %stride:int[], %padding:int[],
+          %dilation:int[], %groups:int, %input_size:int[], %dummy_attr:str):
+        %packed_weight_bias = mkldnn_prepacked::conv2d_prepack(
+            %weight, %bias, %stride, %padding, %dilation, %groups,
+            %input_size, %dummy_attr)
+        %conv2d_res = mkldnn_prepacked::conv2d_run(%input, %packed_weight_bias)
+        %res = aten::${op}(%conv2d_res)
+        return (%res))");
+
+  auto conv_op_fused_rstring = at::jit::CodeTemplate(R"(
+    graph(%input, %weight, %bias, %stride:int[], %padding:int[],
+          %dilation:int[], %groups:int, %input_size:int[], %dummy_attr:str):
+        %attr: str = prim::Constant[value="${op_attr}"]()
+        %packed_weight_bias : __torch__.torch.classes.mkldnn.ConvOpContext = mkldnn_prepacked::conv2d_prepack(
+            %weight, %bias, %stride, %padding, %dilation, %groups,
+            %input_size, %attr)
+        %res = mkldnn_prepacked::conv2d_run(%input, %packed_weight_bias)
+        return (%res))");
+
+  for (auto const& it : mkldnn::fusion_rewrite_map) {
+    std::string op = it.first;
+    if (op == std::string("none")) {
+      continue;
+    }
+
+    at::jit::TemplateEnv env;
+    env.s("op", op);
+
+    at::jit::TemplateEnv env_fused;
+    env_fused.s("op_attr", op);
+
+    SubgraphRewriter rewriter;
+    rewriter.RegisterRewritePattern(
+        conv_op_rstring.format(env), conv_op_fused_rstring.format(env_fused));
+
+    auto filters = it.second;
+    rewriter.runOnGraph(graph, filters);
+  }
+}
+
+void PrePackingOpsFolder(Block* b) {
+  auto is_foldable_op = [](const Node* n) -> bool {
+    return (
+        n->kind() ==
+        Symbol::fromQualString("mkldnn_prepacked::conv2d_prepack"));
+  };
+
+  std::unordered_set<Node*> nodes_to_delete;
+  for (Node* n : b->nodes()) {
+    for (Block* block : n->blocks()) {
+      PrePackingOpsFolder(block);
+    }
+    if (is_foldable_op(n)) {
+      auto optional_outputs = torch::jit::runNodeIfInputsAreConstant(n);
+      if (optional_outputs) {
+        auto outputs = optional_outputs.value();
+        TORCH_CHECK(outputs.size() == 1, "Prepack ops have single output");
+        Value* prepack_op_value = n->output(0);
+        auto graph = n->owningGraph();
+        WithInsertPoint ins(prepack_op_value->node());
+        auto weak_class_obj =
+            outputs[0].toObject()->copy_to_weak_compilation_ref();
+        Value* packed_weight = graph->insertConstant(weak_class_obj)
+                                   ->setType(n->output(0)->type());
+        prepack_op_value->replaceAllUsesWith(packed_weight);
+        nodes_to_delete.insert(n);
+      }
+    }
+  }
+  for (auto n : nodes_to_delete) {
+    n->removeAllInputs();
+  }
+  for (auto n : nodes_to_delete) {
+    n->destroy();
+  }
+}
+
+void FoldPrePackingOps(std::shared_ptr<Graph>& graph) {
+  PrePackingOpsFolder(graph->block());
+}
+
+void FuseConvWithEltwise(std::shared_ptr<Graph>& graph) {
+  GRAPH_DEBUG(
+      "Before insertMkldnnPrePackedOps. Beginning of FuseConvWithEltwise\n",
+      *graph);
+  insertMkldnnPrePackedOps(graph);
+  GRAPH_DEBUG(
+      "After insertMkldnnPrePackedOps, before FuseReluWithPackedOps\n", *graph);
+  FuseReluWithPackedOps(graph);
+  GRAPH_DEBUG(
+      "After FuseReluWithPackedOps, before FoldPrePackingOps\n", *graph);
+  FoldPrePackingOps(graph);
+  GRAPH_DEBUG("After FoldPrePackingOps. End of FuseConvWithEltwise\n", *graph);
+}
+
+#else
+
+void FuseConvWithEltwise(std::shared_ptr<Graph>& graph) {
+  GRAPH_DEBUG("MKLDNN Not enabled");
+}
+
+#endif // AT_MKLDNN_ENABLED()
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/mkldnn_rewrite.h b/torch/csrc/jit/passes/mkldnn_rewrite.h
new file mode 100644
index 0000000000000..30d02332429a8
--- /dev/null
+++ b/torch/csrc/jit/passes/mkldnn_rewrite.h
@@ -0,0 +1,34 @@
+#pragma once
+
+#include <ATen/Config.h>
+#include <torch/csrc/jit/api/module.h>
+#include <torch/csrc/jit/ir/ir.h>
+#include <torch/csrc/jit/passes/subgraph_rewrite.h>
+
+#if AT_MKLDNN_ENABLED()
+
+#include <ideep/tensor.hpp>
+
+#endif // AT_MKLDNN_ENABLED()
+
+namespace torch {
+namespace jit {
+
+#if AT_MKLDNN_ENABLED()
+
+namespace mkldnn {
+
+const static std::map<std::string, std::vector<torch::jit::MatchFilter>>
+    fusion_rewrite_map = {
+        {"none", {}},
+        {"relu", {}},
+};
+
+} // namespace mkldnn
+
+#endif // AT_MKLDNN_ENABLED()
+
+void FuseConvWithEltwise(std::shared_ptr<Graph>& graph);
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/normalize_ops.cpp b/torch/csrc/jit/passes/normalize_ops.cpp
index 3b1f10aaa2e35..0a7d7f6cf6613 100644
--- a/torch/csrc/jit/passes/normalize_ops.cpp
+++ b/torch/csrc/jit/passes/normalize_ops.cpp
@@ -125,6 +125,7 @@ const std::unordered_map<Symbol, Symbol>& getOperatorAliasMap() {
       {aten::multiply, aten::mul},
       {aten::multiply_, aten::mul_},
       {aten::linalg_matmul, aten::matmul},
+      {aten::inverse, aten::linalg_inv},
       {aten::true_divide, aten::div},
       {aten::true_divide_, aten::div_},
       {aten::concat, aten::cat},
diff --git a/torch/csrc/jit/passes/onnx.cpp b/torch/csrc/jit/passes/onnx.cpp
index 150eee9c96719..19c0d0b36a1ae 100644
--- a/torch/csrc/jit/passes/onnx.cpp
+++ b/torch/csrc/jit/passes/onnx.cpp
@@ -365,6 +365,21 @@ void NodeToONNX(
     }
   };
 
+  // Inline the prim::PythonOp sub-block nodes and append them to the onnx graph
+  auto inlineAutograd = [&](Node* PythonOpNode) {
+    for (auto subblock : PythonOpNode->blocks()) {
+      for (const auto i : c10::irange(PythonOpNode->inputs().size())) {
+        env[subblock->inputs()[i]] = env[PythonOpNode->inputs()[i]];
+      }
+      for (auto* node : subblock->nodes()) {
+        NodeToONNX(node, new_block, operator_export_type, env);
+      }
+      for (const auto i : c10::irange(PythonOpNode->outputs().size())) {
+        env[PythonOpNode->outputs()[i]] = env[subblock->outputs()[i]];
+      }
+    }
+  };
+
   // Cast output of symbolic() python implementation
   auto processSymbolicOutput = [&](const std::string& op_name,
                                    Node* n,
@@ -441,11 +456,30 @@ void NodeToONNX(
         "PythonOp", "prim", opset_version);
     if (!py::hasattr(pyobj, "symbolic") &&
         (!PyObject_IsTrue(is_registered_op.ptr()))) {
-      // Simply clone the node, unless either
+      // Inline the subgraph within the prim::PythonOp unless
+      // either of these conditions are satisfied
       // 1. The torch.autograd.Function class of this node object has `symbolic`
       // method defined.
       // 2. Custom export symbolic is registered for prim::PythonOp.
-      cloneNode(op);
+      if (operator_export_type == ::torch::onnx::OperatorExportTypes::ONNX) {
+        try {
+          inlineAutograd(op);
+        } catch (const std::exception& ex) {
+          TORCH_WARN(
+              "Unable to inline PythonOp: ",
+              op->name(),
+              " due to the following exception\n",
+              ex.what(),
+              "prim::PythonOp will be exported as is and without being inlined\n",
+              "Try exporting with the following alternatives: \n",
+              "1) Set operator_export_type to ONNX_FALLTHROUGH mode\n",
+              "2) Register a symbolic method for the prim::PythonOp ",
+              op->name());
+          cloneNode(op);
+        }
+      } else {
+        cloneNode(op);
+      }
       return;
     }
 
diff --git a/torch/csrc/jit/passes/onnx/function_extraction.cpp b/torch/csrc/jit/passes/onnx/function_extraction.cpp
index cf21264e4c22c..999a41ec5a49e 100644
--- a/torch/csrc/jit/passes/onnx/function_extraction.cpp
+++ b/torch/csrc/jit/passes/onnx/function_extraction.cpp
@@ -1,5 +1,6 @@
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/onnx/function_extraction.h>
+#include <torch/csrc/jit/passes/onnx/naming.h>
 
 namespace torch {
 namespace jit {
@@ -653,7 +654,7 @@ void FunctionExtractor::ConvertScopeToFunction(
   auto& func_ctx = *func_ctxs_[scope_key];
 
   const std::string module_class_name(
-      func_ctx.scope_key_->name().toUnqualString());
+      ONNXScopeName::className(func_ctx.scope_key_));
   auto pos = module_class_name.rfind('.');
   TORCH_INTERNAL_ASSERT(pos != std::string::npos);
 
@@ -748,7 +749,8 @@ bool FunctionExtractor::ScopeContext::IsIdenticalFuncion(
   if (&other_ctx == this) {
     return true;
   }
-  if (this->scope_->name() != other_ctx.scope_->name()) {
+  if (ONNXScopeName::className(this->scope_) !=
+      ONNXScopeName::className(other_ctx.scope_)) {
     return false;
   }
   if (this->inputs_.size() != other_ctx.inputs_.size() ||
@@ -1046,7 +1048,7 @@ NodeAttrNameMap FunctionExtractor::run() {
   // Deepest scope comes first, guaranteeing no other scope can be its child.
   auto sorted_scope_keys = SortScopesByMaxDepth(identical_scope_map);
   for (const auto& scope_key : sorted_scope_keys) {
-    if (module_names_.find(scope_key->name().toUnqualString()) !=
+    if (module_names_.find(ONNXScopeName::className(scope_key)) !=
         module_names_.end()) {
       ConvertScopeToFunction(
           scope_key, identical_scope_map[scope_key], scope_ctxs, graph_);
diff --git a/torch/csrc/jit/passes/onnx/function_substitution.cpp b/torch/csrc/jit/passes/onnx/function_substitution.cpp
index 9a756b0182941..2b7b68241556f 100644
--- a/torch/csrc/jit/passes/onnx/function_substitution.cpp
+++ b/torch/csrc/jit/passes/onnx/function_substitution.cpp
@@ -2,11 +2,84 @@
 
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/onnx/helper.h>
+#include <torch/csrc/jit/passes/onnx/naming.h>
 
 namespace torch {
 namespace jit {
 
+namespace {
+
+const std::string kTopModuleVariableName = "";
+
+std::string TidyClassNameFromTorchScript(
+    const c10::optional<c10::QualifiedName>& class_name) {
+  if (!class_name) {
+    return "UNKNOWN_CLASS";
+  }
+  std::string out = "";
+  for (const auto& atom : class_name->atoms()) {
+    bool is_internal_torch_atom = (atom == "__torch__");
+    bool is_mangle_atom = (atom.find("__torch_mangle") != std::string::npos);
+    if (!is_internal_torch_atom && !is_mangle_atom) {
+      if (!out.empty()) {
+        out += ".";
+      }
+      out += atom;
+    }
+  }
+  return out;
+}
+
+std::string GetCallNodeVariableName(const Node* call_node) {
+  TORCH_INTERNAL_ASSERT(
+      call_node->kind() == prim::CallFunction ||
+      call_node->kind() == prim::CallMethod);
+  auto module_node = call_node->input(0)->node();
+
+  if (!module_node->hasAttribute(attr::name)) {
+    return "";
+  }
+  std::string module_name = module_node->s(attr::name);
+  if (module_node->inputs().size() == 0) {
+    return module_name;
+  }
+  // If module is from container, attr::name in module node only carries
+  // index info. Need to check parent node (container) for variable name.
+  auto parent_module_value = module_node->input(0);
+  while (parent_module_value) {
+    auto parent_module_type = parent_module_value->type()->cast<ClassType>();
+    if (parent_module_type &&
+        parent_module_type->name() ==
+            "__torch__.torch.nn.modules.container.ModuleList") {
+      auto parent_module_node = parent_module_value->node();
+      module_name = parent_module_node->s(attr::name) + "." + module_name;
+      parent_module_value = parent_module_node->inputs().size() > 0
+          ? parent_module_node->input(0)
+          : nullptr;
+    } else {
+      break;
+    }
+  }
+
+  return module_name;
+}
+
+ScopePtr ForwardCallScope(Graph& graph, Node* call_node) {
+  TORCH_INTERNAL_ASSERT(call_node->kind() == prim::CallMethod);
+  const std::string& method_name = call_node->s(attr::name);
+  if (method_name == "forward") {
+    const auto type = call_node->input(0)->type()->expect<c10::NamedType>();
+    const std::string class_name = TidyClassNameFromTorchScript(type->name());
+    const std::string variable_name = GetCallNodeVariableName(call_node);
+    const std::string scope_name =
+        onnx::ONNXScopeName::createFullScopeName(class_name, variable_name);
+    return graph.current_scope()->push(Symbol::scope(scope_name));
+  }
+  return graph.current_scope();
+}
+
 void functionCallSubstitution(Block* block) {
+  auto graph = block->owningGraph();
   for (auto it = block->nodes().begin(), end = block->nodes().end();
        it != end;) {
     Node* cur = *it++;
@@ -60,21 +133,54 @@ void functionCallSubstitution(Block* block) {
         const std::string& name = cur->s(attr::name);
         if (auto class_type = cur->input(0)->type()->cast<ClassType>()) {
           Function& function = class_type->getMethod(name);
+          ScopePtr call_scope = ForwardCallScope(*graph, cur);
+          WithCurrentScope scope_guard(*graph, call_scope);
+          GRAPH_DEBUG(
+              "Setting scope guard for forward call: ",
+              graph->current_scope()->namesFromRoot());
           if (auto graphFunction = tryToGraphFunction(function)) {
+            GRAPH_DEBUG(
+                "Inner graph for method call ",
+                name,
+                ": ",
+                *graphFunction->graph());
+            WithCurrentScope inner_graph_scope_guard(
+                *graphFunction->graph(), call_scope);
             functionCallSubstitution(graphFunction->graph()->block());
             inlineCallTo(cur, graphFunction, false);
           }
         }
       } break;
       default: {
+        if (!graph->current_scope()->isBlank()) {
+          cur->setScope(graph->current_scope());
+        }
         for (auto b : cur->blocks()) {
           functionCallSubstitution(b);
         }
       } break;
     }
+    GRAPH_DEBUG(
+        "Graph current scope after node process: ",
+        graph->current_scope()->namesFromRoot());
+  }
+}
+
+ScopePtr ONNXGraphTopLevelScope(Graph& graph) {
+  if (graph.inputs().size() == 0) {
+    return graph.current_scope();
+  }
+  if (auto top_module_type = graph.inputs().at(0)->type()->cast<ClassType>()) {
+    auto scope_name = ::torch::jit::onnx::ONNXScopeName::createFullScopeName(
+        TidyClassNameFromTorchScript(top_module_type->name()),
+        kTopModuleVariableName);
+    return graph.current_scope()->push(Symbol::scope(scope_name));
   }
+  return graph.current_scope();
 }
 
+} // namespace
+
 // This pass is to be used for ONNX conversion only. The ONNX converter depends
 // on a number of deprecated aten operators. These operators are removed from IR
 // and replaced by the compiled python function code. However, in-order to
@@ -82,6 +188,7 @@ void functionCallSubstitution(Block* block) {
 // with the aten symbolic which can still be used by the ONNX converter.
 void ONNXFunctionCallSubstitution(Graph& graph) {
   GRAPH_DUMP("Before function call substitution calls: ", &graph);
+  WithCurrentScope top_level_scope_guard(graph, ONNXGraphTopLevelScope(graph));
   functionCallSubstitution(graph.block());
   GRAPH_DUMP("After function call substitution calls: ", &graph);
 }
diff --git a/torch/csrc/jit/passes/onnx/naming.cpp b/torch/csrc/jit/passes/onnx/naming.cpp
new file mode 100644
index 0000000000000..f6e398d48536b
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/naming.cpp
@@ -0,0 +1,205 @@
+#include <torch/csrc/jit/passes/onnx/naming.h>
+#include <torch/csrc/onnx/onnx.h>
+
+namespace torch {
+namespace jit {
+namespace onnx {
+
+namespace ONNXScopeName {
+
+using NameFunc = std::string (*)(torch::jit::ScopePtr scope);
+
+const std::string name_separator = "::";
+
+namespace {
+
+std::string nameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator,
+    NameFunc name_func) {
+  std::string out = (*name_func)(scope);
+  if (scope->isRoot()) {
+    return out;
+  }
+  auto parent = scope->parent();
+  while (!parent->isRoot()) {
+    out = std::string((*name_func)(parent)).append(layer_separator).append(out);
+    parent = parent->parent();
+  }
+  return out;
+}
+
+std::pair<std::string, std::string> parseNameFromScope(
+    torch::jit::ScopePtr scope) {
+  std::string full_name = scope->name().toUnqualString();
+  auto pos = full_name.find(name_separator);
+  TORCH_CHECK(
+      pos != std::string::npos,
+      "Scope name (" + full_name + ") does not contain '" + name_separator +
+          "'");
+  return std::make_pair(full_name.substr(0, pos), full_name.substr(pos + 2));
+}
+
+} // namespace
+
+std::string createFullScopeName(
+    const std::string& class_name,
+    const std::string& variable_name) {
+  return std::string(class_name).append(name_separator).append(variable_name);
+}
+
+std::string variableName(torch::jit::ScopePtr scope) {
+  return parseNameFromScope(scope).second;
+}
+
+std::string variableNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator) {
+  return nameFromRoot(scope, layer_separator, &variableName);
+}
+
+std::string className(torch::jit::ScopePtr scope) {
+  return parseNameFromScope(scope).first;
+}
+
+std::string classNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator) {
+  return nameFromRoot(scope, layer_separator, &className);
+}
+
+bool isCompatibleScope(torch::jit::ScopePtr scope) {
+  return !scope->isRoot() && !scope->isBlank() &&
+      (std::string(scope->name().toUnqualString()).find(name_separator) !=
+       std::string::npos);
+}
+} // namespace ONNXScopeName
+
+namespace {
+
+class NodeNameGenerator {
+ public:
+  NodeNameGenerator(std::shared_ptr<Graph> g) : graph_(g){};
+  virtual ~NodeNameGenerator() = 0;
+  void PopulateNodeNames();
+
+ protected:
+  virtual void CreateNodeName(Node* n) = 0;
+  void PopulateNodeNames(Block*);
+  void UpdateOutputsNames(Node* n);
+  bool IsGraphOutput(const Value* v, const std::shared_ptr<Graph> graph) const;
+
+ protected:
+  std::string CreateUniqueName(
+      std::unordered_map<std::string, size_t>& base_name_count,
+      std::string base_name);
+
+  std::unordered_map<const Node*, std::string> node_names_;
+  std::unordered_map<std::string, size_t> base_node_name_counts_;
+  std::shared_ptr<Graph> graph_;
+  const std::string layer_separator_ = "/";
+};
+NodeNameGenerator::~NodeNameGenerator(){};
+
+class ScopedNodeNameGenerator : public NodeNameGenerator {
+ public:
+  ScopedNodeNameGenerator(std::shared_ptr<Graph> g) : NodeNameGenerator(g){};
+
+ protected:
+  void CreateNodeName(Node* n) override;
+
+ private:
+  std::string GetFullScopeName(ScopePtr scope);
+  std::unordered_map<ScopePtr, std::string> full_scope_names_;
+  std::unordered_map<std::string, size_t> base_scope_name_counts_;
+};
+
+std::string NodeNameGenerator::CreateUniqueName(
+    std::unordered_map<std::string, size_t>& base_name_count,
+    std::string base_name) {
+  if (base_name_count.find(base_name) == base_name_count.end()) {
+    base_name_count[base_name] = 0;
+  } else {
+    auto count = ++base_name_count[base_name];
+    base_name += "_";
+    base_name += std::to_string(count);
+  }
+  return base_name;
+}
+
+bool NodeNameGenerator::IsGraphOutput(
+    const Value* v,
+    const std::shared_ptr<Graph> graph) const {
+  for (const auto* graph_output : graph->outputs()) {
+    if (v == graph_output) {
+      return true;
+    }
+  }
+  return false;
+}
+
+void NodeNameGenerator::UpdateOutputsNames(Node* n) {
+  if (node_names_.find(n) != node_names_.end()) {
+    auto node_name = node_names_[n];
+    for (auto i : c10::irange(n->outputs().size())) {
+      auto output = n->output(i);
+      if (!IsGraphOutput(output, graph_)) {
+        auto output_name = node_name;
+        output_name.append("_output_").append(std::to_string(i));
+        output->setDebugName(output_name);
+      }
+    }
+  }
+}
+
+void NodeNameGenerator::PopulateNodeNames() {
+  PopulateNodeNames(graph_->block());
+}
+
+void NodeNameGenerator::PopulateNodeNames(Block* b) {
+  for (auto* n : b->nodes()) {
+    for (auto* sub_block : n->blocks()) {
+      PopulateNodeNames(sub_block);
+    }
+    CreateNodeName(n);
+    UpdateOutputsNames(n);
+  }
+}
+
+void ScopedNodeNameGenerator::CreateNodeName(Node* n) {
+  if (node_names_.find(n) == node_names_.end()) {
+    if (!ONNXScopeName::isCompatibleScope(n->scope())) {
+      return;
+    }
+    if (n->mustBeNone()) {
+      // JIT IR does not allow attribute for None node.
+      return;
+    }
+    auto name = GetFullScopeName(n->scope());
+    name += layer_separator_;
+    name += n->kind().toUnqualString();
+    node_names_[n] = CreateUniqueName(base_node_name_counts_, name);
+  }
+  n->s_(Symbol::attr(::torch::onnx::kOnnxNodeNameAttribute), node_names_[n]);
+}
+
+std::string ScopedNodeNameGenerator::GetFullScopeName(ScopePtr scope) {
+  if (full_scope_names_.find(scope) == full_scope_names_.end()) {
+    auto full_scope_name =
+        ONNXScopeName::variableNameFromRoot(scope, layer_separator_);
+    full_scope_names_[scope] =
+        CreateUniqueName(base_scope_name_counts_, full_scope_name);
+  }
+  return full_scope_names_[scope];
+}
+
+} // namespace
+
+void AssignScopedNamesForNodeAndValue(std::shared_ptr<Graph>& graph) {
+  auto node_name_generator = std::make_unique<ScopedNodeNameGenerator>(graph);
+  node_name_generator->PopulateNodeNames();
+}
+
+} // namespace onnx
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/naming.h b/torch/csrc/jit/passes/onnx/naming.h
new file mode 100644
index 0000000000000..0472d041a9233
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/naming.h
@@ -0,0 +1,30 @@
+#pragma once
+
+#include <torch/csrc/jit/ir/ir.h>
+
+namespace torch {
+namespace jit {
+namespace onnx {
+
+namespace ONNXScopeName {
+
+std::string createFullScopeName(
+    const std::string& class_name,
+    const std::string& variable_name);
+std::string variableName(torch::jit::ScopePtr scope);
+std::string variableNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator);
+std::string className(torch::jit::ScopePtr scope);
+std::string classNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator);
+bool isCompatibleScope(torch::jit::ScopePtr scope);
+
+} // namespace ONNXScopeName
+
+TORCH_API void AssignScopedNamesForNodeAndValue(std::shared_ptr<Graph>& graph);
+
+} // namespace onnx
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp b/torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp
new file mode 100644
index 0000000000000..8786af2ee7eb6
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp
@@ -0,0 +1,58 @@
+#include <torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h>
+
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/onnx/helper.h>
+#include <torch/csrc/jit/passes/utils/subgraph_utils.h>
+
+namespace torch {
+namespace jit {
+
+void convertSubgraphToSubBlock(Block* block) {
+  for (auto it = block->nodes().begin(), end = block->nodes().end();
+       it != end;) {
+    Node* node = *it++;
+    if (node->kind() == prim::PythonOp) {
+      // Construct subblock
+      auto subblock = node->addBlock();
+      auto graph = subblock->owningGraph();
+
+      std::unordered_map<Value*, Value*> env;
+      // Populate subblock with subgraph nodes
+      auto subgraph = node->g(attr::Subgraph);
+      for (const auto i : c10::irange(subgraph->inputs().size())) {
+        subblock->addInput()->copyMetadata(subgraph->inputs()[i]);
+        env[subgraph->inputs()[i]] = subblock->inputs()[i];
+      }
+      for (auto* n : subgraph->nodes()) {
+        auto cloned_n =
+            subblock->appendNode(graph->createClone(n, [&](Value* v) {
+              return env.find(v) != env.end() ? env[v] : v;
+            }));
+        for (size_t i = 0; i < n->outputs().size(); ++i) {
+          env[n->outputs().at(i)] = cloned_n->outputs().at(i);
+          auto it = std::find(
+              subgraph->outputs().begin(),
+              subgraph->outputs().end(),
+              n->outputs()[i]);
+          if (it != subgraph->outputs().end()) {
+            subblock->registerOutput(cloned_n->outputs()[i]);
+          }
+        }
+      }
+      // Remove subgraph attribute from the pythonOp node and recurse through
+      // sub-blocks
+      node->removeAttribute(attr::Subgraph);
+    }
+    for (auto block : node->blocks()) {
+      convertSubgraphToSubBlock(block);
+    }
+  }
+}
+
+// This pass is to be used for ONNX conversion only.
+void ONNXAutogradFunctionProcess(std::shared_ptr<Graph>& graph) {
+  convertSubgraphToSubBlock(graph->block());
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h b/torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h
new file mode 100644
index 0000000000000..4c3c07bb6711d
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include <torch/csrc/jit/ir/ir.h>
+
+namespace torch {
+namespace jit {
+
+TORCH_API void ONNXAutogradFunctionProcess(std::shared_ptr<Graph>& graph);
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/shape_type_inference.cpp b/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
index 04427743bb6d5..2027eca3ca81d 100644
--- a/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
+++ b/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
@@ -2261,20 +2261,28 @@ size_t ONNXAssignOutputShape(
   } else if (THPUtils_checkString(output_obj)) {
     // Ignore string, since they are not supported as output in ONNX.
   } else if (PyNone_Check(output_obj)) {
-    // TODO: Currently there's no one thing to do here that works for
-    // both tracing and scripting.
-    // If we don't increment outputs_index here, then scripting fails
-    // for
-    // `python test/onnx/test_pytorch_onnx_no_runtime.py`.
-    // If we do increment it, then tracing fails for
-    // `python test/onnx/test_pytorch_onnx_onnxruntime.py
-    //  TestONNXRuntime.test_tuple_with_none_outputs`.
+    // Tracing:
+    //    Ignore None, since it is not captured in IR graph as output.
+    // Scripting:
+    //    Ignore None, if observing a fixed `None` node in IR graph. Because it
+    //    is meaningless to include it as graph output as it carries no
+    //    data/information. Plus that static `None` is not supported in ONNX IR.
+    //    Otherwise, the output should have type `Optional`, and should be
+    //    converted to ONNX `Optional`.
+
+    // More context:
     // Cause: in tracing we flatten the outputs in ONNXTracedModule.forward
-    // in torch/jit/_trace.py while tracing. This means the output has None
-    // objects omitted. But then the outputs passed in here are un-flattened,
-    // which means they contain None objects.
-    // Ideally we'd remove this difference.
-    outputs_index += static_cast<size_t>(is_script);
+    // in torch/jit/_trace.py while tracing. This means the traced IR graph
+    // has None outputs omitted.
+    // But then the outputs passed in here are un-flattened, which means they
+    // contain None objects. Ideally we'd remove this difference.
+    if (is_script && outputs_index < graph->outputs().size()) {
+      if (graph->outputs().at(outputs_index)->node()->mustBeNone()) {
+        graph->eraseOutput(outputs_index);
+      } else {
+        outputs_index++;
+      }
+    }
   } else {
     std::string msg =
         ("Model output has unsupported type. See "
diff --git a/torch/csrc/jit/passes/quantization/helper.h b/torch/csrc/jit/passes/quantization/helper.h
index b8c1efa9dcaa5..d1b51603118a6 100644
--- a/torch/csrc/jit/passes/quantization/helper.h
+++ b/torch/csrc/jit/passes/quantization/helper.h
@@ -47,9 +47,9 @@ TORCH_API bool isScalar(Value* v);
 TORCH_API bool hitGraphInput(Value* value);
 
 // Converts a mangled name, such as
-//   __torch__.torch.nn.quantized.modules.conv.___torch_mangle_7.Conv2d
+//   __torch__.torch.ao.nn.quantized.modules.conv.___torch_mangle_7.Conv2d
 // into an unmangled name, such as
-//   __torch__.torch.nn.quantized.modules.conv.Conv2d
+//   __torch__.torch.ao.nn.quantized.modules.conv.Conv2d
 TORCH_API std::string removeTorchMangle(const std::string& orig_name);
 
 // Return the module name that corresponds to the value.
diff --git a/torch/csrc/jit/passes/symbolic_shape_analysis.cpp b/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
index 7b3a1e0b01480..5c13c2abe2ced 100644
--- a/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
+++ b/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
@@ -37,8 +37,6 @@ but not limited to:
 - Add decent coverage of common ops
 - Add shape analysis pass on Graph that handles Loops
 - Allow concurrent reads to the operator map
-- Successive applications of same inputs to same shape function (e.g. series of
-pointwise ops)
 - Supporting returning partially evaluated shape compute graph
 */
 
@@ -193,6 +191,18 @@ bool isListOfInts(const TypePtr& type) {
       type->cast<ListType>()->getElementType()->cast<IntType>();
 }
 
+bool isListOfListOfInts(const TypePtr& type) {
+  // Allows List[Optional[List[Int]]]
+  if (!type->cast<ListType>()) {
+    return false;
+  }
+  TypePtr element_type = type->cast<ListType>()->getElementType();
+  if (element_type->cast<OptionalType>()) {
+    element_type = element_type->cast<OptionalType>()->getElementType();
+  }
+  return isListOfInts(element_type);
+}
+
 bool isListOfTensors(const TypePtr& type) {
   return type->cast<ListType>() &&
       type->cast<ListType>()->getElementType()->cast<TensorType>();
@@ -306,7 +316,20 @@ struct SymbolicShapeOpAnalyzer {
 
   // We handle non-constant values in the shape propagation step
   void substituteConstantInputs() {
-    if (schema_->name() == "aten::cat") {
+    if (shape_compute_graph_->inputs().size() == 0) {
+      return;
+    }
+
+    bool seen_tensor_list = false;
+
+    size_t op_in_index = 0;
+    while (op_in_index < shape_compute_graph_->inputs().size()) {
+      Value* graph_in_var = shape_compute_graph_->inputs().at(op_in_index);
+      if (!isListOfListOfInts(graph_in_var->type())) {
+        op_in_index++;
+        continue;
+      }
+
       // Modifying the graph where _node is part of to not use the tensor
       // construct
 
@@ -322,25 +345,32 @@ struct SymbolicShapeOpAnalyzer {
       // def cat(x, y, dim: int)
       //     tensors = [x, y]
       //     body
+      TORCH_INTERNAL_ASSERT(
+          !seen_tensor_list,
+          "SSA doesn't handle case with multiple tensor lists")
+      seen_tensor_list = true;
+
       uint64_t li_length = inputs_.size() - (schema_->arguments().size() - 1);
       std::vector<Value*> li_inputs;
-      Value* graph_input = shape_compute_graph_->inputs().at(0);
-      for (size_t j = 0; j < li_length; ++j) {
-        auto new_inp = shape_compute_graph_->insertInput(j);
-        new_inp->setType(ListType::ofInts());
+
+      TypePtr element_type =
+          graph_in_var->type()->cast<ListType>()->getElementType();
+      for (size_t j = op_in_index; j < op_in_index + li_length; ++j) {
+        auto new_inp = shape_compute_graph_->insertInput(op_in_index + j);
+        new_inp->setType(element_type);
         li_inputs.push_back(new_inp);
       }
       WithInsertPoint guard(*shape_compute_graph_->block()->nodes().begin());
       auto new_li = shape_compute_graph_->insertNode(
-          shape_compute_graph_->createList(ListType::ofInts(), li_inputs));
-      graph_input->replaceAllUsesWith(new_li->output());
-
-      shape_compute_graph_->eraseInput(li_length);
+          shape_compute_graph_->createList(element_type, li_inputs));
+      graph_in_var->replaceAllUsesWith(new_li->output());
+      shape_compute_graph_->eraseInput(op_in_index + li_length);
     }
 
     TORCH_INTERNAL_ASSERT(
         shape_compute_graph_->inputs().size() <= inputs_.size(),
         "Shape Compute Graph expected to have less inputs than actual inputs"); //?
+
     for (size_t op_in_index = 0;
          op_in_index < shape_compute_graph_->inputs().size();
          op_in_index++) {
@@ -632,7 +662,6 @@ std::vector<SSArgument> getNodeInputShapes(Node* n, const AliasDb& db) {
     }
     if (isListOfTensors(type)) {
       // waiting for more use cases to decide on best generalization
-      TORCH_INTERNAL_ASSERT(n->kind() == aten::cat, "TODO: generalize logic");
       if (n->input(node_index)->node()->kind() == prim::Constant) {
         auto ival = toIValue(n->input(node_index));
         for (const auto& ten : ival->toTensorVector()) {
diff --git a/torch/csrc/jit/passes/tensorexpr_fuser.cpp b/torch/csrc/jit/passes/tensorexpr_fuser.cpp
index fe63de1a04d40..9736ce1b11f81 100644
--- a/torch/csrc/jit/passes/tensorexpr_fuser.cpp
+++ b/torch/csrc/jit/passes/tensorexpr_fuser.cpp
@@ -465,6 +465,23 @@ class TensorExprFuser {
     }
 
     for (Node* n : subgraph->nodes()) {
+      auto tensor_inputs = filter(n->inputs(), [](Value* v) {
+        return v->type()->isSubtypeOf(*TensorType::get());
+      });
+      GRAPH_DEBUG("Building sizes for ", getHeader(n));
+      bool all_inputs_have_sizes = true;
+      auto shapes = fmap(tensor_inputs, [&](Value* v) {
+        GRAPH_DEBUG("Getting aten::size for %", v->debugName());
+        all_inputs_have_sizes &= shape_of.count(v);
+        return shape_of.count(v) != 0 ? shape_of.at(v) : nullptr;
+      });
+      if (!all_inputs_have_sizes) {
+        GRAPH_DEBUG(
+            "Not all tensor arguments have sizes available to compute the broadcasted size",
+            getHeader(n));
+        continue;
+      }
+
       if (n->kind() == prim::ConstantChunk) {
         Node* sizes_node = graph->insertNode(
             graph->create(prim::ChunkSizes, shape_of.at(n->input()), 2));
@@ -493,23 +510,6 @@ class TensorExprFuser {
         continue;
       }
 
-      auto tensor_inputs = filter(n->inputs(), [](Value* v) {
-        return v->type()->isSubtypeOf(*TensorType::get());
-      });
-      GRAPH_DEBUG("Building sizes for ", getHeader(n));
-      bool all_inputs_have_sizes = true;
-      auto shapes = fmap(tensor_inputs, [&](Value* v) {
-        GRAPH_DEBUG("Getting aten::size for %", v->debugName());
-        all_inputs_have_sizes &= shape_of.count(v);
-        return shape_of.count(v) != 0 ? shape_of.at(v) : nullptr;
-      });
-
-      if (!all_inputs_have_sizes) {
-        GRAPH_DEBUG(
-            "Not all tensor arguments have sizes available to compute the broadcasted size",
-            getHeader(n));
-        continue;
-      }
       shape_of.emplace(
           n->output(),
           shapes.size() == 1 ? shapes[0]
@@ -1046,7 +1046,8 @@ class TensorExprFuser {
     }
 
     if (node->kind() == aten::conv2d) {
-      if (!tensorexpr::conv2dIsSupportedJit(node)) {
+      if (!tensorexpr::conv2dIsSupportedJit(node) &&
+          !tensorexpr::mkldnnPrepackedConvIsSupportedJit(node)) {
         GRAPH_DEBUG("Params of conv2d are not supported");
         return false;
       }
diff --git a/torch/csrc/jit/passes/utils/memory_dag.cpp b/torch/csrc/jit/passes/utils/memory_dag.cpp
index 2a51402de188b..d8eef5af852c3 100644
--- a/torch/csrc/jit/passes/utils/memory_dag.cpp
+++ b/torch/csrc/jit/passes/utils/memory_dag.cpp
@@ -186,7 +186,7 @@ void MemoryDAG::setWildcards(
     auto wildcardElement = getWildcardElement(v);
     TORCH_INTERNAL_ASSERT(wildcardElement);
 
-    const MemoryLocations pointeeSet = getMemoryLocations(elementMap.at(v));
+    const MemoryLocations& pointeeSet = getMemoryLocations(elementMap.at(v));
     for (const auto& pointee : pointeeSet) {
       auto from = this->fromIndex(pointee);
       // avoid cycles where the wildcard points to itself
diff --git a/torch/csrc/jit/passes/vulkan_rewrite.cpp b/torch/csrc/jit/passes/vulkan_rewrite.cpp
index 88baf90054af3..9a0f45ff84020 100644
--- a/torch/csrc/jit/passes/vulkan_rewrite.cpp
+++ b/torch/csrc/jit/passes/vulkan_rewrite.cpp
@@ -50,10 +50,10 @@ void insertPrePackedConv2dOp(std::shared_ptr<Graph>& graph) {
   std::string prepacked_ops_conv2d_pattern = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
         %output_min_max : None = prim::Constant()
-        %packed_weight_bias = vulkan_prepack::create_conv2d_clamp_context(
+        %packed_weight_bias = vulkan_prepack::create_conv2d_context(
             %weight, %bias, %stride, %padding, %dilation, %groups,
             %output_min_max, %output_min_max)
-        %r = vulkan_prepack::run_conv2d_clamp_context(%input, %packed_weight_bias)
+        %r = vulkan_prepack::run_conv2d_context(%input, %packed_weight_bias)
         return (%r) )";
 
   SubgraphRewriter rewriter;
@@ -70,10 +70,10 @@ void insertPrePackedConv2dOp(std::shared_ptr<Graph>& graph) {
   std::string prepacked_ops_conv2d_transpose_pattern = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[], %dilation:int[], %output_padding:int[], %groups:int):
         %output_min_max : None = prim::Constant()
-        %packed_weight_bias = vulkan_prepack::create_conv2d_transpose_clamp_context(
+        %packed_weight_bias = vulkan_prepack::create_tconv2d_context(
             %weight, %bias, %stride, %padding, %output_padding, %dilation, %groups,
             %output_min_max, %output_min_max)
-        %res = vulkan_prepack::run_conv2d_transpose_clamp_context(%input, %packed_weight_bias)
+        %res = vulkan_prepack::run_tconv2d_context(%input, %packed_weight_bias)
         return (%res) )";
 
   SubgraphRewriter transpose_rewriter;
@@ -94,9 +94,15 @@ void insertPrePackedGruOp(std::shared_ptr<Graph>& graph) {
         %y.1 : Tensor, %hn.1 : Tensor = vulkan_prepack::run_gru_context(%input.1, %hx.1, %packed_weights_biases)
         return (%y.1, %hn.1) )";
 
+  auto filter = [&](const Match& match,
+                    const std::unordered_map<std::string, Value*>& vmap) {
+    auto node = match.values_map.at(vmap.at("params_cpu"))->node();
+    return node->output()->type()->str() == "Tensor[]";
+  };
+
   SubgraphRewriter gru_rewriter;
   gru_rewriter.RegisterRewritePattern(gru_pattern, prepacked_ops_pattern);
-  gru_rewriter.runOnGraph(graph);
+  gru_rewriter.runOnGraph(graph, filter);
 }
 
 void insertPrePackedLstmOp(std::shared_ptr<Graph>& graph) {
@@ -112,9 +118,15 @@ void insertPrePackedLstmOp(std::shared_ptr<Graph>& graph) {
         %y.1 : Tensor, %hn.1 : Tensor, %cn.1 : Tensor = vulkan_prepack::run_lstm_context(%input.1, %hx.1, %cx.1, %packed_weights_biases)
         return (%y.1, %hn.1, %cn.1) )";
 
+  auto filter = [&](const Match& match,
+                    const std::unordered_map<std::string, Value*>& vmap) {
+    auto node = match.values_map.at(vmap.at("hx"))->node();
+    return node->output()->type()->str() == "Tensor[]";
+  };
+
   SubgraphRewriter lstm_rewriter;
   lstm_rewriter.RegisterRewritePattern(lstm_pattern, prepacked_ops_pattern);
-  lstm_rewriter.runOnGraph(graph);
+  lstm_rewriter.runOnGraph(graph, filter);
 }
 
 void fuseHardtanhWithPackedOps(std::shared_ptr<Graph>& graph) {
@@ -123,19 +135,19 @@ void fuseHardtanhWithPackedOps(std::shared_ptr<Graph>& graph) {
   std::string conv2d_prepack_run_hardtanh_fused = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[],
           %dilation:int[], %groups:int, %output_min, %output_max, %dummy_min_max):
-        %packed_weight_bias : __torch__.torch.classes.vulkan.VulkanOpContext = vulkan_prepack::create_conv2d_clamp_context(
+        %packed_weight_bias : __torch__.torch.classes.vulkan.Conv2dPackedContext = vulkan_prepack::create_conv2d_context(
             %weight, %bias, %stride, %padding, %dilation, %groups,
             %output_min, %output_max)
-        %r = vulkan_prepack::run_conv2d_clamp_context(%input, %packed_weight_bias)
+        %r = vulkan_prepack::run_conv2d_context(%input, %packed_weight_bias)
         return (%r) )";
 
   std::string conv2d_prepack_run_hardtanh = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[],
           %dilation:int[], %groups:int, %output_min, %output_max, %dummy_min_max):
-        %packed_weight_bias = vulkan_prepack::create_conv2d_clamp_context(
+        %packed_weight_bias = vulkan_prepack::create_conv2d_context(
             %weight, %bias, %stride, %padding, %dilation, %groups,
             %dummy_min_max, %dummy_min_max)
-        %conv2d_res = vulkan_prepack::run_conv2d_clamp_context(%input, %packed_weight_bias)
+        %conv2d_res = vulkan_prepack::run_conv2d_context(%input, %packed_weight_bias)
         %r = aten::hardtanh(%conv2d_res, %output_min, %output_max)
         return (%r) )";
 
@@ -145,10 +157,10 @@ void fuseHardtanhWithPackedOps(std::shared_ptr<Graph>& graph) {
   std::string conv2d_prepack_run_hardtanh_inplace = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[],
           %dilation:int[], %groups:int, %output_min, %output_max, %dummy_min_max):
-        %packed_weight_bias = vulkan_prepack::create_conv2d_clamp_context(
+        %packed_weight_bias = vulkan_prepack::create_conv2d_context(
             %weight, %bias, %stride, %padding, %dilation, %groups,
             %dummy_min_max, %dummy_min_max)
-        %conv2d_res = vulkan_prepack::run_conv2d_clamp_context(%input, %packed_weight_bias)
+        %conv2d_res = vulkan_prepack::run_conv2d_context(%input, %packed_weight_bias)
         %r = aten::hardtanh_(%conv2d_res, %output_min, %output_max)
         return (%r) )";
 
@@ -166,19 +178,19 @@ void fuseReluWithPackedOps(std::shared_ptr<Graph>& graph) {
           %dilation:int[], %groups:int, %dummy_min_max):
         %output_min: float = prim::Constant[value=0.0]()
         %output_max: None = prim::Constant()
-        %packed_weight_bias : __torch__.torch.classes.vulkan.VulkanOpContext = vulkan_prepack::create_conv2d_clamp_context(
+        %packed_weight_bias : __torch__.torch.classes.vulkan.Conv2dPackedContext = vulkan_prepack::create_conv2d_context(
             %weight, %bias, %stride, %padding, %dilation, %groups,
             %output_min, %output_max)
-        %r = vulkan_prepack::run_conv2d_clamp_context(%input, %packed_weight_bias)
+        %r = vulkan_prepack::run_conv2d_context(%input, %packed_weight_bias)
         return (%r) )";
 
   std::string conv2d_prepack_run_relu = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[],
           %dilation:int[], %groups:int, %dummy_min_max):
-        %packed_weight_bias = vulkan_prepack::create_conv2d_clamp_context(
+        %packed_weight_bias = vulkan_prepack::create_conv2d_context(
             %weight, %bias, %stride, %padding, %dilation, %groups,
             %dummy_min_max, %dummy_min_max)
-        %conv2d_res = vulkan_prepack::run_conv2d_clamp_context(%input, %packed_weight_bias)
+        %conv2d_res = vulkan_prepack::run_conv2d_context(%input, %packed_weight_bias)
         %r = aten::relu(%conv2d_res)
         return (%r) )";
 
@@ -188,10 +200,10 @@ void fuseReluWithPackedOps(std::shared_ptr<Graph>& graph) {
   std::string conv2d_prepack_run_relu_inplace = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[],
           %dilation:int[], %groups:int, %dummy_min_max):
-        %packed_weight_bias = vulkan_prepack::create_conv2d_clamp_context(
+        %packed_weight_bias = vulkan_prepack::create_conv2d_context(
             %weight, %bias, %stride, %padding, %dilation, %groups,
             %dummy_min_max, %dummy_min_max)
-        %conv2d_res = vulkan_prepack::run_conv2d_clamp_context(%input, %packed_weight_bias)
+        %conv2d_res = vulkan_prepack::run_conv2d_context(%input, %packed_weight_bias)
         %r = aten::relu_(%conv2d_res)
         return (%r) )";
 
@@ -229,13 +241,15 @@ void vulkanFoldPrePackingOps(script::Module& m) {
   PrePackingOpsFilterFn filter_fn = [](const Node* n) -> bool {
     return (
         (n->kind() ==
-         Symbol::fromQualString(
-             "vulkan_prepack::create_conv2d_clamp_context")) ||
+         Symbol::fromQualString("vulkan_prepack::create_conv2d_context")) ||
+        (n->kind() ==
+         Symbol::fromQualString("vulkan_prepack::create_tconv2d_context")) ||
         (n->kind() ==
          Symbol::fromQualString("vulkan_prepack::create_linear_context")) ||
         (n->kind() ==
-         Symbol::fromQualString(
-             "vulkan_prepack::create_conv2d_transpose_clamp_context")));
+         Symbol::fromQualString("vulkan_prepack::create_gru_context")) ||
+        (n->kind() ==
+         Symbol::fromQualString("vulkan_prepack::create_lstm_context")));
   };
   PrePackingOpsFolder(m, filter_fn, "prepack_folding");
 }
diff --git a/torch/csrc/jit/python/init.cpp b/torch/csrc/jit/python/init.cpp
index 436684e7212cf..9920b9a536f96 100644
--- a/torch/csrc/jit/python/init.cpp
+++ b/torch/csrc/jit/python/init.cpp
@@ -185,7 +185,11 @@ class PythonSymIntNodeImpl : public c10::SymIntNodeImpl {
     return dispatch_common_(__FUNCTION__, other);
   }
 
-  virtual SymIntNode div(const SymIntNode& other) override {
+  virtual SymIntNode truediv(const SymIntNode& other) override {
+    return dispatch_common_(__FUNCTION__, other);
+  }
+
+  virtual SymIntNode floordiv(const SymIntNode& other) override {
     return dispatch_common_(__FUNCTION__, other);
   }
 
@@ -233,10 +237,14 @@ bool loadPythonClasses() {
   return true;
 }
 
-bool isEmptyContainer(const py::handle self) {
-  bool is_empty_list =
-      PySequence_Check(self.ptr()) && !PySequence_Size(self.ptr());
-  return is_empty_list;
+c10::optional<IValue> toTypeInferredIValueOptional(py::handle input) {
+  // Errors need to be caught here because toTypeInferredIValue errors out
+  // on various object types, but we want it to work with all types.
+  try {
+    return toTypeInferredIValue(input);
+  } catch (const c10::Error& e) {
+    return c10::nullopt;
+  }
 }
 } // anonymous namespace
 
@@ -1223,10 +1231,28 @@ void initJITBindings(PyObject* module) {
             return a->mul(snb);
           })
       .def(
-          "__div__",
+          "__truediv__",
+          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
+            auto snb = toSymIntNode(a, b);
+            return a->truediv(snb);
+          })
+      .def(
+          "__rtruediv__",
+          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
+            auto snb = toSymIntNode(a, b);
+            return snb->truediv(a);
+          })
+      .def(
+          "__floordiv__",
+          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
+            auto snb = toSymIntNode(a, b);
+            return a->floordiv(snb);
+          })
+      .def(
+          "__rfloordiv__",
           [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
             auto snb = toSymIntNode(a, b);
-            return a->div(snb);
+            return snb->floordiv(a);
           })
       .def(
           "__mod__",
@@ -1555,6 +1581,7 @@ void initJITBindings(PyObject* module) {
           auto symbol = Symbol::fromQualString(op_name);
           auto operations = getAllOperatorsFor(symbol);
           bool allow_numbers_as_tensors = symbol.is_prims() ||
+              symbol.is_nvprims() ||
               (symbol.is_aten() &&
                torch::should_allow_numbers_as_tensors(symbol.toUnqualString()));
           for (const auto& op : operations) {
@@ -1564,9 +1591,21 @@ void initJITBindings(PyObject* module) {
                                        py::args args, py::kwargs kwargs) {
                     ToIValueAllowNumbersAsTensors g(allow_numbers_as_tensors);
                     return _get_operation_for_overload_or_packet(
-                        {op}, symbol, args, kwargs, true);
+                        {op}, symbol, args, kwargs, /*is_overload*/ true);
                   });
-              return py::make_tuple(func, py::cast(op->getTags().vec()));
+              auto func_dk =
+                  py::cpp_function([op, symbol, allow_numbers_as_tensors](
+                                       const std::string& str_dk,
+                                       py::args args,
+                                       py::kwargs kwargs) {
+                    c10::optional<c10::DispatchKey> dk =
+                        c10::make_optional(c10::parseDispatchKey(str_dk));
+                    ToIValueAllowNumbersAsTensors g(allow_numbers_as_tensors);
+                    return _get_operation_for_overload_or_packet(
+                        {op}, symbol, args, kwargs, /*is_overload*/ true, dk);
+                  });
+              return py::make_tuple(
+                  func, func_dk, py::cast(op->getTags().vec()));
             }
           }
           throw std::runtime_error("Found no matching operator overload");
@@ -1599,6 +1638,7 @@ void initJITBindings(PyObject* module) {
           }
 
           bool allow_numbers_as_tensors = symbol.is_prims() ||
+              symbol.is_nvprims() ||
               (symbol.is_aten() &&
                torch::should_allow_numbers_as_tensors(symbol.toUnqualString()));
 
@@ -1676,38 +1716,39 @@ void initJITBindings(PyObject* module) {
           [](SchemaInfo& self,
              const std::string& name,
              const py::object& value) {
-            if (isEmptyContainer(value)) {
-              return;
-            }
-            // For normalization purposes there is an inconsistency within
-            // torch.fx that turns all arguments named "self" into "input". Thus
-            // this check ensures that those arguments are checked correctly.
-            if (name == "input" && !self.hasInputArgumentNamed("input")) {
-              self.addArgumentValue("self", toTypeInferredIValue(value));
-            } else {
-              self.addArgumentValue(name, toTypeInferredIValue(value));
+            c10::optional<IValue> i_value = toTypeInferredIValueOptional(value);
+            if (i_value) {
+              // For normalization purposes there is an inconsistency within
+              // torch.fx that turns all arguments named "self" into "input".
+              // Thus this check ensures that those arguments are checked
+              // correctly.
+              if (name == "input" && !self.hasInputArgumentNamed("input")) {
+                self.addArgumentValue("self", *i_value);
+              } else {
+                self.addArgumentValue(name, *i_value);
+              }
             }
           })
       .def("add_argument_values", [](SchemaInfo& self, const py::dict& values) {
         std::unordered_map<std::string, IValue> value_map;
         for (const auto& key_pair : values) {
           IValue key = toTypeInferredIValue(key_pair.first);
-          if (isEmptyContainer(key_pair.second)) {
-            continue;
-          }
-          IValue value = toTypeInferredIValue(key_pair.second);
           TORCH_INTERNAL_ASSERT(
               key.isString(),
               "Add argument value keys types should be strings.");
-          // For normalization purposes there is an inconsistency within
-          // torch.fx that
-          // turns all arguments named "self" into "input". Thus this check
-          // ensures that those arguments are checked correctly.
-          if (key.toStringRef() == "input" &&
-              !self.hasInputArgumentNamed("input")) {
-            self.addArgumentValue("self", value);
-          } else {
-            value_map[key.toStringRef()] = value;
+          c10::optional<IValue> value =
+              toTypeInferredIValueOptional(key_pair.second);
+          if (value) {
+            // For normalization purposes there is an inconsistency within
+            // torch.fx that
+            // turns all arguments named "self" into "input". Thus this check
+            // ensures that those arguments are checked correctly.
+            if (key.toStringRef() == "input" &&
+                !self.hasInputArgumentNamed("input")) {
+              self.addArgumentValue("self", *value);
+            } else {
+              value_map[key.toStringRef()] = *value;
+            }
           }
         }
         self.addArgumentValues(value_map);
@@ -1879,16 +1920,24 @@ void initJITBindings(PyObject* module) {
               }),
           py::call_guard<py::gil_scoped_release>());
   m.def("_is_alias_of", [](const py::object& self, const py::object& other) {
-    if (isEmptyContainer(self) || isEmptyContainer(other)) {
+    c10::optional<IValue> self_value = toTypeInferredIValueOptional(self);
+    c10::optional<IValue> other_value = toTypeInferredIValueOptional(other);
+
+    // Only return true if we are certain that self and other are aliasing.
+    if (!self_value || !other_value) {
       return false;
     }
-    return toTypeInferredIValue(self).isAliasOf(toTypeInferredIValue(other));
+    return self_value->isAliasOf(*other_value);
   });
   m.def("_overlaps", [](const py::object& self, const py::object& other) {
-    if (isEmptyContainer(self) || isEmptyContainer(other)) {
-      return true;
+    c10::optional<IValue> self_value = toTypeInferredIValueOptional(self);
+    c10::optional<IValue> other_value = toTypeInferredIValueOptional(other);
+
+    // Only return true if we are certain that self and other are overlapping.
+    if (!self_value || !other_value) {
+      return false;
     }
-    return toTypeInferredIValue(self).overlaps(toTypeInferredIValue(other));
+    return self_value->overlaps(*other_value);
   });
   m.def("fork", [](const py::args& args, const py::kwargs& kwargs) {
     AT_ASSERT(args.size() >= 1);
diff --git a/torch/csrc/jit/python/pybind_utils.cpp b/torch/csrc/jit/python/pybind_utils.cpp
index 9dd755acf4e9f..c1502dabdff55 100644
--- a/torch/csrc/jit/python/pybind_utils.cpp
+++ b/torch/csrc/jit/python/pybind_utils.cpp
@@ -81,9 +81,7 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
       return static_cast<c10::complex<double>>(c_obj);
     }
     case TypeKind::SymIntType:
-      return torch::is_symint_node(obj)
-          ? obj.cast<c10::SymIntNodeImpl*>()->toSymInt()
-          : c10::SymInt{py::cast<int64_t>(obj)};
+      return py::cast<c10::SymInt>(obj);
     case TypeKind::IntType:
     // NB: Typically, these switches are completely dead, because
     // Argument::type() will always report IntType for these types.
@@ -194,9 +192,7 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
           c10::List<c10::SymInt> symints;
           for (auto it = obj.begin(); it != obj.end(); it++) {
             auto elm = *it;
-            auto si = torch::is_symint_node(elm)
-                ? elm.cast<c10::SymIntNodeImpl*>()->toSymInt()
-                : c10::SymInt{py::cast<int64_t>(elm)};
+            auto si = py::cast<c10::SymInt>(elm);
             symints.push_back(si);
           }
           return symints;
diff --git a/torch/csrc/jit/python/pybind_utils.h b/torch/csrc/jit/python/pybind_utils.h
index 5386d30bc2bb7..cb339b9bb308b 100644
--- a/torch/csrc/jit/python/pybind_utils.h
+++ b/torch/csrc/jit/python/pybind_utils.h
@@ -852,8 +852,7 @@ inline py::object toPyObject(IValue ivalue) {
 #endif
   } else if (ivalue.isSymInt()) {
     auto si = ivalue.toSymInt();
-    return si.is_symbolic() ? py::cast(si.toSymIntNodeImpl())
-                            : py::cast(si.expect_int());
+    return py::cast(si);
   } else {
     AT_ERROR(
         "Missing cases in 'toPyObject'! Can't convert ",
@@ -1190,13 +1189,18 @@ inline std::pair<std::shared_ptr<Operator>, Stack> getOpWithStack(
 inline py::object invokeOperatorFromPython(
     const std::vector<std::shared_ptr<Operator>>& operations,
     py::args args,
-    const py::kwargs& kwargs) {
+    const py::kwargs& kwargs,
+    c10::optional<c10::DispatchKey> dk = c10::nullopt) {
   auto opWithStack = getOpWithStack(operations, args, kwargs);
   std::shared_ptr<Operator> found_op = std::get<0>(opWithStack);
   Stack stack = std::get<1>(opWithStack);
   {
     pybind11::gil_scoped_release no_gil_guard;
-    found_op->getOperation()(stack);
+    if (dk) {
+      found_op->getOperationForDispatchKey (*dk)(stack);
+    } else {
+      found_op->getOperation()(stack);
+    }
   }
 
   return createPyObjectForStack(std::move(stack));
@@ -1207,7 +1211,8 @@ inline py::object _get_operation_for_overload_or_packet(
     Symbol symbol,
     py::args args,
     const py::kwargs& kwargs,
-    bool is_overload) {
+    bool is_overload,
+    c10::optional<c10::DispatchKey> dk = c10::nullopt) {
   std::vector<py::handle> overloaded_args;
   size_t total_arg_num = args.size() + kwargs.size();
   for (const auto i : c10::irange(args.size())) {
@@ -1260,7 +1265,7 @@ inline py::object _get_operation_for_overload_or_packet(
             self_func.ptr(),
             module_name.c_str()));
   }
-  return invokeOperatorFromPython(operations, args, kwargs);
+  return invokeOperatorFromPython(operations, args, kwargs, dk);
 }
 
 } // namespace jit
diff --git a/torch/csrc/jit/python/python_ir.cpp b/torch/csrc/jit/python/python_ir.cpp
index 2fbe31989bfa8..68690a206dc5e 100644
--- a/torch/csrc/jit/python/python_ir.cpp
+++ b/torch/csrc/jit/python/python_ir.cpp
@@ -203,7 +203,17 @@ void initPythonIRBindings(PyObject* module_) {
       .def(
           "has_writers",
           [&](AliasDb& db, Value* v1) { return db.hasWriters(v1); })
-      .def("__str__", &AliasDb::toString);
+      .def("__str__", &AliasDb::toString)
+      .def(
+          "move_after_topologically_valid",
+          [](AliasDb& db, Node* n, Node* movePoint) {
+            return db.moveAfterTopologicallyValid(n, movePoint);
+          })
+      .def(
+          "move_before_topologically_valid",
+          [](AliasDb& db, Node* n, Node* movePoint) {
+            return db.moveBeforeTopologicallyValid(n, movePoint);
+          });
 #define GS(name) def(#name, &Graph ::name)
   py::class_<Graph, std::shared_ptr<Graph>>(m, "Graph")
       .def(py::init<>())
diff --git a/torch/csrc/jit/python/script_init.cpp b/torch/csrc/jit/python/script_init.cpp
index 5b50b787f7c52..3f825d9bd52b0 100644
--- a/torch/csrc/jit/python/script_init.cpp
+++ b/torch/csrc/jit/python/script_init.cpp
@@ -18,7 +18,6 @@
 #include <torch/csrc/jit/mobile/module.h>
 #include <torch/csrc/jit/operator_upgraders/upgraders.h>
 #include <torch/csrc/jit/operator_upgraders/upgraders_entry.h>
-#include <torch/csrc/jit/operator_upgraders/upgraders_guard.h>
 #include <torch/csrc/jit/operator_upgraders/utils.h>
 #include <torch/csrc/jit/operator_upgraders/version_map.h>
 #include <torch/csrc/jit/python/module_python.h>
@@ -1776,8 +1775,6 @@ void initJitScriptBindings(PyObject* module) {
     return Decl(p.parseTypeComment());
   });
 
-  m.def("_is_upgraders_enabled", &is_upgraders_enabled);
-
   m.def("_get_upgraders_map_size", &get_upgraders_map_size);
   m.def("_dump_upgraders_map", &dump_upgraders_map);
 
diff --git a/torch/csrc/jit/runtime/graph_executor.cpp b/torch/csrc/jit/runtime/graph_executor.cpp
index 08029ce317af7..0e121da32fcb7 100644
--- a/torch/csrc/jit/runtime/graph_executor.cpp
+++ b/torch/csrc/jit/runtime/graph_executor.cpp
@@ -375,7 +375,7 @@ struct DifferentiableGraphBackward : public autograd::Node {
 
  private:
   void produceOutput(size_t i, at::Tensor output, variable_list& outputs) {
-    if (should_compute_output(i)) {
+    if (task_should_compute_output(i)) {
       const auto& edge = next_edge(i);
       if (output.defined()) {
         outputs.emplace_back(std::move(output));
diff --git a/torch/csrc/jit/runtime/operator.h b/torch/csrc/jit/runtime/operator.h
index c70261db52b28..510c27a40ce5c 100644
--- a/torch/csrc/jit/runtime/operator.h
+++ b/torch/csrc/jit/runtime/operator.h
@@ -137,6 +137,22 @@ struct TORCH_API Operator {
         });
   }
 
+  Operation getOperationForDispatchKey(c10::DispatchKey dk) const {
+    // TODO: some sort of caching mechanism?
+    return op_.fold<Operation>(
+        [dk](const C10Operator& op) {
+          return [op, dk](Stack& stack) {
+            op.handle_.callBoxedForDispatchKey(dk, stack);
+          };
+        },
+        [](const JitOnlyOperator& op) {
+          TORCH_CHECK(
+              false,
+              "calling a JIT operator for dispatch key is not supported");
+          return nullptr;
+        });
+  }
+
   const FunctionSchema& schema() const {
     return op_.fold<const FunctionSchema&>(
         [](const C10Operator& op) -> const FunctionSchema& {
diff --git a/torch/csrc/jit/runtime/register_prim_ops.cpp b/torch/csrc/jit/runtime/register_prim_ops.cpp
index e35b79611730f..1ccdae634a956 100644
--- a/torch/csrc/jit/runtime/register_prim_ops.cpp
+++ b/torch/csrc/jit/runtime/register_prim_ops.cpp
@@ -1,3 +1,4 @@
+#include <ATen/autocast_mode.h>
 #include <c10/util/Optional.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/mobile/promoted_prim_ops.h>
@@ -414,6 +415,11 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs{
         TORCH_SELECTIVE_SCHEMA("aten::sym_size(Tensor self) -> SymInt[]"),
         sym_size,
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA(
+            "aten::sym_size.int(Tensor self, int dim) -> SymInt"),
+        sym_size_int,
+        aliasAnalysisFromSchema()),
     OperatorGeneratorArgs(
         TORCH_SELECTIVE_SCHEMA("aten::stride(Tensor self) -> int[]"),
         [](Stack& stack) {
@@ -421,6 +427,14 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs{
           push(stack, arg.strides());
         },
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA("aten::sym_numel(Tensor self) -> SymInt"),
+        sym_numel,
+        aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA("aten::sym_stride(Tensor self) -> SymInt[]"),
+        sym_stride,
+        aliasAnalysisFromSchema()),
     OperatorGeneratorArgs(
         TORCH_SELECTIVE_SCHEMA("prim::EnumName(AnyEnumType enum) -> str"),
         [](Stack& stack) {
@@ -656,6 +670,28 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs{
           push(stack, a != b);
         },
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA("aten::is_autocast_enabled() -> bool"),
+        [](Stack& stack) {
+#if defined BUILD_LITE_INTERPRETER || defined C10_MOBILE
+          bool enabled = false;
+#else
+          bool enabled = at::autocast::is_enabled();
+#endif
+          push(stack, enabled);
+        },
+        aliasAnalysisConservative()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA("aten::is_autocast_cpu_enabled() -> bool"),
+        [](Stack& stack) {
+#if defined BUILD_LITE_INTERPRETER || defined C10_MOBILE
+          bool enabled = false;
+#else
+          bool enabled = at::autocast::is_cpu_enabled();
+#endif
+          push(stack, enabled);
+        },
+        aliasAnalysisConservative()),
     OperatorGeneratorArgs(
         TORCH_SELECTIVE_SCHEMA("prim::Uninitialized() -> Any"),
         unInitialized,
diff --git a/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp b/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp
index 50e34d2402df6..fdeba0a3add36 100644
--- a/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp
+++ b/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp
@@ -1728,6 +1728,85 @@ def transpose(self: List[int],
     _3 = torch.append(out0, elem0)
   return (out, out0, [grad_output[1]])
 
+def conv_forwards(input: List[int],
+    weight: List[int],
+    bias: Optional[List[int]],
+    stride: List[int],
+    padding: List[int],
+    dilation: List[int],
+    transposed: bool,
+    output_padding: List[int],
+    groups: int) -> List[int]:
+  has_dilation = torch.gt(torch.len(dilation), 0)
+  dim = torch.len(input)
+  output_size = annotate(List[int], [])
+  if transposed:
+    weight_output_channels_dim = 1
+  else:
+    weight_output_channels_dim = 0
+  _0 = torch.append(output_size, input[0])
+  _1 = torch.append(output_size, weight[weight_output_channels_dim])
+  for _2 in range(torch.__range_length(2, dim, 1)):
+    d = torch.__derive_index(_2, 2, 1)
+    if has_dilation:
+      dilation_ = dilation[torch.sub(d, 2)]
+    else:
+      dilation_ = 1
+    if transposed:
+      kernel = torch.mul(dilation_, torch.sub(weight[d], 1))
+      _3 = torch.mul(torch.sub(input[d], 1), stride[torch.sub(d, 2)])
+      _4 = torch.mul(padding[torch.sub(d, 2)], 2)
+      _5 = torch.add(torch.sub(_3, _4), kernel)
+      _6 = torch.append(output_size, torch.add(_5, 1))
+    else:
+      _7 = torch.mul(dilation_, torch.sub(weight[d], 1))
+      kernel0 = torch.add(_7, 1)
+      _8 = input[d]
+      _9 = torch.mul(padding[torch.sub(d, 2)], 2)
+      _10 = torch.sub(torch.add(_8, _9), kernel0)
+      _11 = torch.floordiv(_10, stride[torch.sub(d, 2)])
+      _12 = torch.append(output_size, torch.add(_11, 1))
+  return output_size
+
+)=====")
++ std::string(R"=====(def conv_transpose2d_input(input: List[int],
+    weight: List[int],
+    bias: Optional[List[int]]=None,
+    stride: Optional[List[int]]=None,
+    padding: Optional[List[int]]=None,
+    output_padding: Optional[List[int]]=None,
+    groups: int=1,
+    dilation: Optional[List[int]]=None) -> List[int]:
+  if torch.__is__(stride, None):
+    stride0 = [1, 1]
+  else:
+    stride0 = unchecked_cast(List[int], stride)
+  if torch.__is__(padding, None):
+    padding0 = [0, 0]
+  else:
+    padding0 = unchecked_cast(List[int], padding)
+  if torch.__is__(dilation, None):
+    dilation0 = [1, 1]
+  else:
+    dilation0 = unchecked_cast(List[int], dilation)
+  has_dilation = torch.gt(torch.len(dilation0), 0)
+  dim = torch.len(input)
+  output_size = annotate(List[int], [])
+  _0 = torch.append(output_size, input[0])
+  _1 = torch.append(output_size, weight[1])
+  for _2 in range(torch.__range_length(2, dim, 1)):
+    d = torch.__derive_index(_2, 2, 1)
+    if has_dilation:
+      dilation_ = dilation0[torch.sub(d, 2)]
+    else:
+      dilation_ = 1
+    kernel = torch.mul(dilation_, torch.sub(weight[d], 1))
+    _3 = torch.mul(torch.sub(input[d], 1), stride0[torch.sub(d, 2)])
+    _4 = torch.mul(padding0[torch.sub(d, 2)], 2)
+    _5 = torch.add(torch.sub(_3, _4), kernel)
+    _6 = torch.append(output_size, torch.add(_5, 1))
+  return output_size
+
 )=====")
 + std::string(R"=====(def flatten(input: List[int],
     start_dim: int,
@@ -2120,26 +2199,36 @@ def transpose(self: List[int],
     keep_dim: bool,
     dt: Any) -> List[int]:
   out = annotate(List[int], [])
-  if opt_dims is None:
-    dims:List[int] = []
+  if torch.__is__(opt_dims, None):
+    _0, opt_dims0 = True, opt_dims
+  else:
+    opt_dims1 = unchecked_cast(List[int], opt_dims)
+    _0, opt_dims0 = torch.eq(torch.len(opt_dims1), 0), opt_dims1
+  if _0:
+    _1 = torch.len(self)
+    dims0 = annotate(List[int], [])
+    for _2 in range(_1):
+      _3 = torch.append(dims0, _2)
+    dims = dims0
   else:
-    dims = opt_dims
+    opt_dims2 = unchecked_cast(List[int], opt_dims0)
+    dims = opt_dims2
   for idx in range(torch.len(self)):
     is_mean_dim = False
-    for _0 in range(torch.len(dims)):
-      reduce_dim = dims[_0]
-      _1 = torch.len(self)
-      if torch.le(_1, 0):
+    for _4 in range(torch.len(dims)):
+      reduce_dim = dims[_4]
+      _5 = torch.len(self)
+      if torch.le(_5, 0):
         dim_post_expr = 1
       else:
-        dim_post_expr = _1
+        dim_post_expr = _5
       min = torch.neg(dim_post_expr)
       max = torch.sub(dim_post_expr, 1)
       if torch.lt(reduce_dim, min):
-        _2 = True
+        _6 = True
       else:
-        _2 = torch.gt(reduce_dim, max)
-      if torch.__not__(_2):
+        _6 = torch.gt(reduce_dim, max)
+      if torch.__not__(_6):
         pass
       else:
         ops.prim.RaiseException("AssertionError: ")
@@ -2155,11 +2244,11 @@ def transpose(self: List[int],
       is_mean_dim = is_mean_dim0
     if is_mean_dim:
       if keep_dim:
-        _3 = torch.append(out, 1)
+        _7 = torch.append(out, 1)
       else:
         pass
     else:
-      _4 = torch.append(out, self[idx])
+      _8 = torch.append(out, self[idx])
   return out
 
 )=====")
@@ -2170,7 +2259,7 @@ def transpose(self: List[int],
   out = annotate(List[int], [])
   for idx in range(torch.len(self)):
     is_mean_dim = False
-    for _1 in range(1):
+    for _1 in range(torch.len(_0)):
       reduce_dim = _0[_1]
       _2 = torch.len(self)
       if torch.le(_2, 0):
@@ -2682,17 +2771,9 @@ def native_batch_norm(input: List[int],
   return out
 
 def nonzero_lower_bound(input: List[int]) -> List[int]:
-  if torch.ge(torch.len(input), 1):
-    pass
-  else:
-    ops.prim.RaiseException("AssertionError: ")
   return [0, torch.len(input)]
 
 def nonzero_upper_bound(input: List[int]) -> List[int]:
-  if torch.ge(torch.len(input), 1):
-    pass
-  else:
-    ops.prim.RaiseException("AssertionError: ")
   numel = 1
   for _0 in range(torch.len(input)):
     elem = input[_0]
@@ -2746,6 +2827,8 @@ const OperatorMap<std::string>& GetShapeFunctionMappings() {
     {"aten::batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> Tensor", "batch_norm"},
     {"aten::conv3d(Tensor input, Tensor weight, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] dilation=1, int groups=1) -> Tensor", "conv3d"},
     {"aten::convolution_backward(Tensor grad_output, Tensor input, Tensor weight, int[]? bias_sizes, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)", "conv_backwards"},
+    {"aten::convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor", "conv_forwards"},
+    {"aten::conv_transpose2d.input(Tensor input, Tensor weight, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] output_padding=0, int groups=1, int[2] dilation=1) -> Tensor", "conv_transpose2d_input"},
     {"aten::flatten.using_ints(Tensor(a) self, int start_dim=0, int end_dim=-1) -> Tensor(a)", "flatten"},
     {"aten::cat(Tensor[] tensors, int dim=0) -> Tensor", "cat"},
     {"aten::permute(Tensor(a) self, int[] dims) -> Tensor(a)", "permute"},
diff --git a/torch/csrc/jit/runtime/static/native_ops.cpp b/torch/csrc/jit/runtime/static/native_ops.cpp
index 16e357d8f459d..0695c3f70a391 100644
--- a/torch/csrc/jit/runtime/static/native_ops.cpp
+++ b/torch/csrc/jit/runtime/static/native_ops.cpp
@@ -204,15 +204,27 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::list,
     aten_list,
     [](Node* n) -> SROperator {
-      return [](ProcessedNode* p_node) {
-        const auto str = p_node->Input(0).toStringRef();
-        c10::List<std::string> chars;
-        chars.reserve(str.size());
-        for (auto c : str) {
-          chars.emplace_back(1, c);
-        }
-        p_node->Output(0) = std::move(chars);
-      };
+      if (n->matches(torch::schema("aten::list(str t) -> str[]"))) {
+        return [](ProcessedNode* p_node) {
+          const auto str = p_node->Input(0).toStringRef();
+          c10::List<std::string> chars;
+          chars.reserve(str.size());
+          for (auto c : str) {
+            chars.emplace_back(1, c);
+          }
+          p_node->Output(0) = std::move(chars);
+        };
+      }
+
+      if (n->matches(torch::schema("aten::list.t(t[] l) -> t[]"))) {
+        return [](ProcessedNode* p_node) {
+          const auto input = p_node->Input(0).toList();
+          p_node->Output(0) = input.copy();
+        };
+      }
+
+      LogAndDumpSchema(n);
+      return nullptr;
     });
 
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
diff --git a/torch/csrc/jit/runtime/static/ops.cpp b/torch/csrc/jit/runtime/static/ops.cpp
index 93a9043ed276d..0439393affa29 100644
--- a/torch/csrc/jit/runtime/static/ops.cpp
+++ b/torch/csrc/jit/runtime/static/ops.cpp
@@ -1633,6 +1633,36 @@ REGISTER_OPERATOR_FUNCTOR(
       };
     });
 
+namespace {
+
+std::vector<std::int64_t> permute_output_sizes(
+    c10::IntArrayRef self_sizes,
+    c10::IntArrayRef dims) {
+  const auto nDim = dims.size();
+  TORCH_CHECK(
+      self_sizes.size() == nDim,
+      "permute input and output tensors must have the same rank, got input rank=",
+      self_sizes.size(),
+      "; output rank=",
+      nDim);
+  std::vector<bool> dims_seen(nDim, false);
+  std::vector<std::int64_t> output_sizes;
+  output_sizes.reserve(nDim);
+  for (size_t i = 0; i < nDim; ++i) {
+    auto dim = c10::maybe_wrap_dim(dims[i], nDim);
+    TORCH_CHECK(
+        !dims_seen[dim],
+        "permute dims must be unique, found duplicate dim=",
+        dim);
+
+    output_sizes.push_back(self_sizes[dim]);
+    dims_seen[dim] = true;
+  }
+  return output_sizes;
+}
+
+} // namespace
+
 // Out variants for view ops are registered to a separate registry because
 // their outputs (views) can't participate in memory reuse.
 REGISTER_OPERATOR_FUNCTOR(
@@ -1652,6 +1682,29 @@ REGISTER_OPERATOR_FUNCTOR(
       };
     });
 
+REGISTER_OPERATOR_FUNCTOR(
+    static_runtime::permute_copy,
+    sr_permute_copy,
+    [](Node* n) -> SROperator {
+      if (!n->matches(torch::schema(
+              "static_runtime::permute_copy(Tensor self, int[] dims) -> Tensor"))) {
+        LogAndDumpSchema(n);
+        return nullptr;
+      }
+      return [](ProcessedNode* p_node) {
+        const auto& self = p_node->Input(0).toTensor();
+        const auto dims = p_node->Input(1).toDimVector();
+
+        if (p_node->Output(0).isNone()) {
+          p_node->Output(0) = create_empty_from(self);
+        }
+        auto& output = p_node->Output(0).toTensor();
+        at::native::resize_(
+            output, permute_output_sizes(self.sizes(), dims), c10::nullopt);
+        at::native::permute_copy_out(self, dims, output);
+      };
+    });
+
 REGISTER_OPERATOR_FUNCTOR(
     static_runtime::flatten_copy,
     aten_flatten,
diff --git a/torch/csrc/jit/runtime/static/passes.cpp b/torch/csrc/jit/runtime/static/passes.cpp
index 7234cc3652eea..5e5290275f17f 100644
--- a/torch/csrc/jit/runtime/static/passes.cpp
+++ b/torch/csrc/jit/runtime/static/passes.cpp
@@ -1027,6 +1027,10 @@ void UseVariadicGroupedAccessor(const std::shared_ptr<Graph>& graph) {
       graph,
       fromQualString("grouped_accessor::grouped_accessor_op_v2"),
       fromQualString("static_runtime::variadic_grouped_accessor_op_v2"));
+  UseVariadicOp(
+      graph,
+      fromQualString("fb::grouped_accessor_op_async"),
+      fromQualString("static_runtime::variadic_grouped_accessor_op_async"));
 }
 
 namespace {
@@ -1039,6 +1043,10 @@ void CreateOwnedRefsForSpecialValuesHelper(Graph& graph, Block* block) {
   }
 
   auto outputs = block->outputs();
+  // Create owned refs for inputs. Otherwise, the input cleanup process
+  // will destroy our outputs before we return.
+  FastSet<Value*> inputs = {block->inputs().begin(), block->inputs().end()};
+
   for (const auto i : c10::irange(outputs.size())) {
     auto* output = outputs[i];
 
@@ -1048,7 +1056,7 @@ void CreateOwnedRefsForSpecialValuesHelper(Graph& graph, Block* block) {
       continue;
     }
 
-    if (toIValue(output).has_value() ||
+    if ((inputs.find(output) != inputs.end()) || toIValue(output).has_value() ||
         // If the output's owning block is not this one, it's from an outer
         // scope
         output->node()->owningBlock() != block) {
diff --git a/torch/csrc/jit/runtime/symbolic_script.cpp b/torch/csrc/jit/runtime/symbolic_script.cpp
index 8cf56d6696569..73753157795ce 100644
--- a/torch/csrc/jit/runtime/symbolic_script.cpp
+++ b/torch/csrc/jit/runtime/symbolic_script.cpp
@@ -149,13 +149,13 @@ const std::vector<std::string> functions = {
             return std_out, backward
 
         def std_1(self,
-                  dim: List[int],
+                  dim: Optional[List[int]],
                   unbiased: bool,
                   keepdim: bool):
             std_out = torch.std(self, dim, unbiased, keepdim)
             def backward(grad_output):
                 correction = AD_bool_to_int(unbiased)
-                grad_self = AD_var_backward_1(grad_output / (std_out * 2), self, dim, correction, keepdim)
+                grad_self = AD_var_backward_2(grad_output / (std_out * 2), self, dim, correction, keepdim)
                 return grad_self, None, None, None
 
             return std_out, backward
@@ -182,12 +182,12 @@ const std::vector<std::string> functions = {
             return torch.var(self, unbiased), backward
 
         def var_1(self,
-                  dim: List[int],
+                  dim: Optional[List[int]],
                   unbiased: bool,
                   keepdim: bool):
             def backward(grad_output):
                 correction = AD_bool_to_int(unbiased)
-                grad_self = AD_var_backward_1(grad_output, self, dim, correction, keepdim)
+                grad_self = AD_var_backward_2(grad_output, self, dim, correction, keepdim)
                 return grad_self, None, None, None
 
             return torch.var(self, dim, unbiased, keepdim), backward
@@ -927,6 +927,13 @@ const std::vector<std::string> functions = {
                 return torch.gelu_backward(grad_output, self, approximate=approximate), None
             return result, backward
 
+        def silu(self):
+            result = torch.silu(self)
+            def backward(grad_output):
+                input_sigmoid = torch.sigmoid(self)
+                return grad_output * (input_sigmoid * (1 + self * (1 - input_sigmoid)))
+            return result, backward
+
         def hardswish(self):
             result = torch.hardswish(self)
             def backward(grad_output):
diff --git a/torch/csrc/jit/runtime/symbolic_shape_registry.cpp b/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
index 1c1315ac27adc..52dcb2ff391a4 100644
--- a/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
+++ b/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
@@ -231,12 +231,15 @@ void transformShapeFunction(
   if (schema_string->returns().size() > 1) {
     TORCH_INTERNAL_ASSERT(
         graph->outputs().size() == 1 &&
-        graph->outputs().at(0)->node()->kind() == prim::TupleConstruct);
+        graph->outputs().at(0)->type()->cast<TupleType>());
     auto tuple_node = graph->outputs().at(0)->node();
+    WithInsertPoint guard(graph->return_node());
+    auto tuple_unpack_values = createTupleUnpack(tuple_node->output());
     graph->eraseOutput(0);
-    for (Value* v : tuple_node->inputs()) {
+    for (Value* v : tuple_unpack_values) {
       graph->registerOutput(v);
     }
+    GRAPH_DUMP("After Output Tuple Unpacking", graph);
   }
 }
 
@@ -246,7 +249,13 @@ std::shared_ptr<Graph> genShapeComputeFn(
     std::unordered_map<std::string, std::shared_ptr<Graph>>& reused_functions,
     const CompilationUnit& module) {
   std::shared_ptr<Graph> graph;
+  GRAPH_DEBUG(
+      "Registering schema: ",
+      *schema_string,
+      " with shape compute func: ",
+      shape_compute_function_name);
   if (reused_functions.count(shape_compute_function_name)) {
+    GRAPH_DEBUG("Registering reused schema");
     graph = reused_functions[shape_compute_function_name];
   } else {
     Function& shape_compute_function =
@@ -344,20 +353,28 @@ void loadModule(const CompilationUnit& module) {
 }
 
 void loadFunctions() {
-  auto shape_compute_functions =
-      GetSerializedShapeFunctions() + _xnnpack_shape_compute_functions;
-
-  auto src = std::make_shared<Source>(shape_compute_functions);
-  std::stringstream ss;
-  std::vector<at::IValue> constantTable;
-  auto resolver = std::make_shared<SourceImporterImpl>(
-      compilation_unit,
-      &constantTable,
-      [&](const std::string& name) -> std::shared_ptr<Source> { return src; },
-      1);
-  compilation_unit->define(
-      c10::nullopt, shape_compute_functions, resolver, nullptr);
-  loadModule(*compilation_unit);
+  try {
+    auto shape_compute_functions =
+        GetSerializedShapeFunctions() + _xnnpack_shape_compute_functions;
+
+    auto src = std::make_shared<Source>(shape_compute_functions);
+    std::stringstream ss;
+    std::vector<at::IValue> constantTable;
+    auto resolver = std::make_shared<SourceImporterImpl>(
+        compilation_unit,
+        &constantTable,
+        [&](const std::string& name) -> std::shared_ptr<Source> { return src; },
+        1);
+    compilation_unit->define(
+        c10::nullopt, shape_compute_functions, resolver, nullptr);
+    loadModule(*compilation_unit);
+  } catch (...) {
+    // Reset the cache and compilation unit so that we don't get weird errors
+    // in later tests when one of the shape functions is invalid.
+    compilation_unit = std::make_shared<CompilationUnit>();
+    cached_schema_to_graph.clear();
+    throw;
+  }
 }
 } // anonymous namespace
 
diff --git a/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp b/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
index c5c714079a950..77824baf02ec5 100644
--- a/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
+++ b/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
@@ -77,6 +77,7 @@ const OperatorMap<std::string>& get_tensorexpr_elementwise_set() {
       {"aten::softplus(Tensor self, Scalar beta=1, Scalar threshold=20) -> Tensor", "unary"},
       {"aten::relu6(Tensor self) -> Tensor", "unary"},
       {"aten::gelu(Tensor self, *, str approximate='none') -> Tensor", "unary"},
+      {"aten::silu(Tensor self) -> Tensor", "unary"},
       {"aten::neg(Tensor self) -> Tensor", "unary"},
       {"aten::reciprocal(Tensor self) -> Tensor", "unary"},
       {"aten::expm1(Tensor self) -> Tensor", "unary"},
diff --git a/torch/csrc/jit/serialization/export.cpp b/torch/csrc/jit/serialization/export.cpp
index 471bcc23595c4..51efb307da8b0 100644
--- a/torch/csrc/jit/serialization/export.cpp
+++ b/torch/csrc/jit/serialization/export.cpp
@@ -57,26 +57,30 @@ namespace onnx_torch = ::torch::onnx;
 namespace onnx = ::ONNX_NAMESPACE;
 
 const static int kInvalidOpsetVersion = -1;
+const static int kMainOpsetVersion = 17;
 // Based on OP_SET_ID_VERSION_MAP in
 // https://github.com/onnx/onnx/blob/master/onnx/helper.py.
-constexpr static std::array<int64_t, 17> kOpsetVersionToIRVersion = {
-    kInvalidOpsetVersion,
-    3,
-    kInvalidOpsetVersion,
-    kInvalidOpsetVersion,
-    kInvalidOpsetVersion,
-    3,
-    3,
-    3,
-    3,
-    4,
-    5,
-    6,
-    7,
-    7,
-    7,
-    8,
-    8};
+constexpr static std::array<int64_t, kMainOpsetVersion + 1>
+    kOpsetVersionToIRVersion = {
+        kInvalidOpsetVersion,
+        3, // opset 1
+        kInvalidOpsetVersion,
+        kInvalidOpsetVersion,
+        kInvalidOpsetVersion,
+        3, // opset 5
+        3, // opset 6
+        3, // opset 7
+        3, // opset 8
+        4, // opset 9
+        5, // opset 10
+        6, // opset 11
+        7, // opset 12
+        7, // opset 13
+        7, // opset 14
+        8, // opset 15
+        8, // opset 16
+        8, // opset 17
+};
 
 std::string getNodeStackTraceString(const Node* n) {
   return n->sourceRange().str();
@@ -885,15 +889,24 @@ void GraphEncoder::EncodeNode(
         !node->kind().is_attr());
   }
   node_proto->set_op_type(node->kind().toUnqualString());
+  const auto node_name_attribute_symbol =
+      Symbol::attr(::torch::onnx::kOnnxNodeNameAttribute);
   if (add_node_names) {
-    auto node_name =
+    std::string node_name =
         node_proto->op_type() + "_" + std::to_string(num_op_nodes_);
+    if (node->hasAttribute(node_name_attribute_symbol)) {
+      node_name = node->s(node_name_attribute_symbol);
+    }
     node_proto->set_name(node_name);
     onnx_node_name_map_[node] = node_name;
     num_op_nodes_++;
   }
   auto attrs_it = node_attr_to_name_.find(node);
   for (auto attr_name : node->attributeNames()) {
+    if (attr_name == node_name_attribute_symbol) {
+      // Skip the node name attribute.
+      continue;
+    }
     if (attrs_it != node_attr_to_name_.end()) {
       auto attr_it = attrs_it->second.find(attr_name.toUnqualString());
       if (attr_it != attrs_it->second.end()) {
diff --git a/torch/csrc/jit/serialization/flatbuffer_serializer.cpp b/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
index f4273867fa61f..c3d683e496903 100644
--- a/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
+++ b/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
@@ -1,17 +1,32 @@
 #include <torch/csrc/jit/serialization/flatbuffer_serializer.h>
 
+#ifdef FLATBUFFERS_VERSION_MAJOR
+#error "flatbuffer_serializer.h must not include any flatbuffers headers"
+#endif // FLATBUFFERS_VERSION_MAJOR
+
+#include <fstream>
+#include <functional>
+#include <stdexcept>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
 #include <ATen/ATen.h>
 #include <c10/core/CPUAllocator.h>
 #include <c10/util/Exception.h>
 #include <caffe2/serialize/versions.h>
-#include <flatbuffers/flatbuffers.h>
 #include <torch/csrc/jit/mobile/code.h>
-#include <torch/csrc/jit/mobile/flatbuffer_loader.h>
 #include <torch/csrc/jit/mobile/train/export_data.h>
 #include <torch/csrc/jit/passes/inliner.h>
 #include <torch/csrc/jit/runtime/instruction.h>
 #include <torch/csrc/jit/serialization/export.h>
-#include <string>
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h> // NOLINT
+
+#if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD)
+namespace flatbuffers = flatbuffers_fbsource;
+#define FLATBUFFERS_MAX_ALIGNMENT FLATBUFFERS_FBSOURCE_MAX_ALIGNMENT
+#endif
 
 namespace torch {
 namespace jit {
@@ -760,29 +775,54 @@ void save_mobile_module(
   auto buffer = save_mobile_module_to_bytes(
       module, extra_files, jit_sources, jit_constants);
   std::fstream ofile(filename, std::ios::binary | std::ios::out);
-  ofile.write(reinterpret_cast<char*>(buffer.data()), buffer.size());
+  ofile.write(
+      reinterpret_cast<char*>(buffer->data()),
+      static_cast<std::streamsize>(buffer->size()));
   ofile.close();
 }
 
-flatbuffers::DetachedBuffer save_mobile_module_to_bytes(
+/// Deletes a DetachedBuffer, along with the internal
+/// flatbuffers::DetachedBuffer if present. Used as a custom deleter for
+/// std::unique_ptr; see UniqueDetachedBuffer and make_unique_detached_buffer.
+void DetachedBuffer::destroy(DetachedBuffer* buf) {
+  // May be null.
+  delete static_cast<flatbuffers::DetachedBuffer*>(buf->data_owner_);
+  delete buf;
+}
+
+/// Provides access to DetachedBuffer::destroy().
+struct DetachedBufferFriend {
+  /// Returns a UniqueDetachedBuffer that wraps the provided DetachedBuffer.
+  static DetachedBuffer::UniqueDetachedBuffer make_unique_detached_buffer(
+      DetachedBuffer* buf) {
+    return DetachedBuffer::UniqueDetachedBuffer(buf, DetachedBuffer::destroy);
+  }
+};
+
+DetachedBuffer::UniqueDetachedBuffer save_mobile_module_to_bytes(
     const mobile::Module& module,
     const ExtraFilesMap& extra_files,
     const ExtraFilesMap& jit_sources,
     const std::vector<IValue>& jit_constants) {
   FlatbufferSerializer fb_serializer;
-  return fb_serializer.serializeModule(
+  flatbuffers::DetachedBuffer buf = fb_serializer.serializeModule(
       module,
-      /*include_tensor_data_in_flatbuffer*/ true,
+      /*include_tensor_data_in_flatbuffer=*/true,
       extra_files,
       jit_sources,
       jit_constants);
+  flatbuffers::DetachedBuffer* buf_ptr =
+      new flatbuffers::DetachedBuffer(std::move(buf));
+  DetachedBuffer* ret =
+      new DetachedBuffer(buf_ptr->data(), buf_ptr->size(), buf_ptr);
+  return DetachedBufferFriend::make_unique_detached_buffer(ret);
 }
 
 static void save_mobile_module_to_func(
     const mobile::Module& module,
     const std::function<size_t(const void*, size_t)>& writer_func) {
   auto buffer = save_mobile_module_to_bytes(module);
-  writer_func(buffer.data(), buffer.size());
+  writer_func(buffer->data(), buffer->size());
 }
 
 bool register_flatbuffer_serializer() {
diff --git a/torch/csrc/jit/serialization/flatbuffer_serializer.h b/torch/csrc/jit/serialization/flatbuffer_serializer.h
index f2a4aed7b64cb..24da6b5527922 100644
--- a/torch/csrc/jit/serialization/flatbuffer_serializer.h
+++ b/torch/csrc/jit/serialization/flatbuffer_serializer.h
@@ -1,21 +1,75 @@
 #pragma once
 
-#include <ATen/core/qualified_name.h>
-#include <flatbuffers/flatbuffers.h>
+#include <functional>
+#include <memory>
 #include <string>
+#include <unordered_map>
 #include <vector>
 
 #include <ATen/core/ivalue.h>
-#include <ATen/core/jit_type.h>
-#include <torch/csrc/jit/backends/backend_debug_handler.h>
+#include <c10/macros/Macros.h>
 #include <torch/csrc/jit/mobile/module.h>
-#include <torch/csrc/jit/serialization/type_name_uniquer.h>
 
-#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h> // NOLINT
+/**
+ * Defines the public API for serializing mobile modules to flatbuffer.
+ * Note that this header must not include or depend on flatbuffer-defined
+ * types, to avoid leaking those details to PyTorch clients.
+ */
 
 namespace torch {
 namespace jit {
 
+/// Maps file names to file contents.
+using ExtraFilesMap = std::unordered_map<std::string, std::string>;
+
+/**
+ * Represents a span of data. Typically owned by a UniqueDetachedBuffer.
+ */
+class TORCH_API DetachedBuffer final {
+ public:
+  /// Creates a new DetachedBuffer with an optional data owner. This interface
+  /// is provided to let users create objects of this type for testing.
+  DetachedBuffer(void* data, size_t size, void* internal_data_owner = nullptr)
+      : data_(data), size_(size), data_owner_(internal_data_owner) {}
+
+  /// Returns a pointer to the data.
+  C10_NODISCARD void* data() {
+    return data_;
+  }
+  /// Returns a pointer to the data.
+  C10_NODISCARD const void* data() const {
+    return data_;
+  }
+  /// Returns the size of the data, in bytes.
+  C10_NODISCARD size_t size() const {
+    return size_;
+  }
+
+  /// Wrapper type that typically owns data_owner_.
+  using UniqueDetachedBuffer =
+      std::unique_ptr<DetachedBuffer, std::function<void(DetachedBuffer*)>>;
+
+ private:
+  /// Deletes the owner, if present, and the buf itself.
+  /// Note: we could have provided a movable type with a destructor that did
+  /// this work, but the unique wrapper was easier in practice.
+  static void destroy(DetachedBuffer* buf);
+
+  /// Provides access to destroy() for implementation and testing.
+  friend struct DetachedBufferFriend;
+  friend struct DetachedBufferTestingFriend;
+
+  /// Pointer to the data. Not owned by this class.
+  void* data_;
+  /// The size of `data_`, in bytes.
+  size_t size_;
+  /// Opaque pointer to the underlying owner of `data_`. This class
+  /// (DetachedBuffer) does not own the owner or the data. It will typically be
+  /// owned by a UniqueDetachedBuffer that knows how to delete the owner along
+  /// with this class.
+  void* data_owner_;
+};
+
 TORCH_API void save_mobile_module(
     const mobile::Module& module,
     const std::string& filename,
@@ -23,7 +77,7 @@ TORCH_API void save_mobile_module(
     const ExtraFilesMap& jit_sources = ExtraFilesMap(),
     const std::vector<IValue>& jit_constants = {});
 
-TORCH_API flatbuffers::DetachedBuffer save_mobile_module_to_bytes(
+TORCH_API DetachedBuffer::UniqueDetachedBuffer save_mobile_module_to_bytes(
     const mobile::Module& module,
     const ExtraFilesMap& extra_files = ExtraFilesMap(),
     const ExtraFilesMap& jit_sources = ExtraFilesMap(),
diff --git a/torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp b/torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp
index cd91892eaf24d..321068311da25 100644
--- a/torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp
+++ b/torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp
@@ -1,5 +1,9 @@
 #include <torch/csrc/jit/serialization/flatbuffer_serializer_jit.h>
 
+#ifdef FLATBUFFERS_VERSION_MAJOR
+#error "flatbuffer_serializer_jit.h must not include any flatbuffers headers"
+#endif // FLATBUFFERS_VERSION_MAJOR
+
 #include <torch/csrc/jit/mobile/file_format.h>
 #include <torch/csrc/jit/mobile/flatbuffer_loader.h>
 #include <torch/csrc/jit/operator_upgraders/upgraders_entry.h>
@@ -16,21 +20,17 @@ Module parse_and_initialize_jit_module(
     size_t size,
     ExtraFilesMap& extra_files,
     c10::optional<at::Device> device) {
-#if ENABLE_UPGRADERS
   populate_upgraders_graph_map();
-#endif
-  auto* flatbuffer_module = mobile::serialization::GetMutableModule(data.get());
-  FlatbufferLoader loader;
-  mobile::Module mobilem = loader.parseModule(flatbuffer_module);
-  parseExtraFiles(flatbuffer_module, extra_files);
-  ExtraFilesMap files;
-  std::vector<IValue> constants;
-  loader.extractJitSourceAndConstants(&files, &constants);
+  ExtraFilesMap jit_files;
+  std::vector<IValue> jit_constants;
+  mobile::Module mobilem = parse_and_initialize_mobile_module_for_jit(
+      data.get(), size, jit_files, jit_constants, device, &extra_files);
+
   Module m = jitModuleFromSourceAndConstants(
       mobilem._ivalue(),
-      files,
-      constants,
-      flatbuffer_module->bytecode_version());
+      jit_files,
+      jit_constants,
+      static_cast<int32_t>(mobilem.bytecode_version()));
   m.set_delete_memory(data);
   return m;
 }
@@ -59,11 +59,12 @@ void save_jit_module(
     const ExtraFilesMap& extra_files) {
   auto buffer = save_jit_module_to_bytes(module, extra_files);
   std::fstream ofile(filename, std::ios::binary | std::ios::out);
-  ofile.write(reinterpret_cast<char*>(buffer.data()), buffer.size()); // NOLINT
+  ofile.write(
+      reinterpret_cast<char*>(buffer->data()), buffer->size()); // NOLINT
   ofile.close();
 }
 
-flatbuffers::DetachedBuffer save_jit_module_to_bytes(
+DetachedBuffer::UniqueDetachedBuffer save_jit_module_to_bytes(
     const Module& module,
     const ExtraFilesMap& extra_files) {
   ExtraFilesMap jitfiles;
@@ -81,7 +82,7 @@ static void save_jit_module_to_write_func(
     const std::function<size_t(const void*, size_t)>& writer_func) {
   (void)save_mobile_debug_info;
   auto buffer = save_jit_module_to_bytes(module, extra_files);
-  writer_func(reinterpret_cast<void*>(buffer.data()), buffer.size());
+  writer_func(reinterpret_cast<void*>(buffer->data()), buffer->size());
 }
 
 bool register_flatbuffer_all() {
diff --git a/torch/csrc/jit/serialization/flatbuffer_serializer_jit.h b/torch/csrc/jit/serialization/flatbuffer_serializer_jit.h
index ce2b8bc87d43f..1f605f18ba1e5 100644
--- a/torch/csrc/jit/serialization/flatbuffer_serializer_jit.h
+++ b/torch/csrc/jit/serialization/flatbuffer_serializer_jit.h
@@ -1,5 +1,4 @@
-#include <flatbuffers/flatbuffers.h>
-#include <torch/csrc/jit/serialization/import.h>
+#pragma once
 
 #include <torch/csrc/jit/serialization/flatbuffer_serializer.h>
 
@@ -11,7 +10,7 @@ TORCH_API void save_jit_module(
     const std::string& filename,
     const ExtraFilesMap& extra_files = ExtraFilesMap());
 
-TORCH_API flatbuffers::DetachedBuffer save_jit_module_to_bytes(
+TORCH_API DetachedBuffer::UniqueDetachedBuffer save_jit_module_to_bytes(
     const Module& module,
     const ExtraFilesMap& extra_files = ExtraFilesMap());
 
diff --git a/torch/csrc/jit/serialization/import.cpp b/torch/csrc/jit/serialization/import.cpp
index 484f0161e3a49..a72abeaede8e1 100644
--- a/torch/csrc/jit/serialization/import.cpp
+++ b/torch/csrc/jit/serialization/import.cpp
@@ -253,9 +253,8 @@ Module ScriptModuleDeserializer::deserialize(
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
   // we populate the upgraders map before any load starts
-#if ENABLE_UPGRADERS
   populate_upgraders_graph_map();
-#endif
+
   C10_LOG_API_USAGE_ONCE("torch.script.load");
   device_ = device;
   // Load extra files.
diff --git a/torch/csrc/jit/serialization/import_source.cpp b/torch/csrc/jit/serialization/import_source.cpp
index 2a1f7fb046743..c515b65989048 100644
--- a/torch/csrc/jit/serialization/import_source.cpp
+++ b/torch/csrc/jit/serialization/import_source.cpp
@@ -316,19 +316,19 @@ c10::optional<Assign> SourceImporterImpl::
 
   // module demangled qualname -> ReplacementDescr
   static std::unordered_map<std::string, AttrTypeReplacementDescr> replacements{
-      {"__torch__.torch.nn.quantized.modules.linear.LinearPackedParams",
+      {"__torch__.torch.ao.nn.quantized.modules.linear.LinearPackedParams",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.modules.linear.Linear",
+      {"__torch__.torch.ao.nn.quantized.modules.linear.Linear",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.dynamic.modules.linear.Linear",
+      {"__torch__.torch.ao.nn.quantized.dynamic.modules.linear.Linear",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.modules.conv.Conv2d",
+      {"__torch__.torch.ao.nn.quantized.modules.conv.Conv2d",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.Conv2dPackedParamsBase"}},
@@ -336,14 +336,35 @@ c10::optional<Assign> SourceImporterImpl::
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.Conv2dPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.modules.conv.Conv3d",
+      {"__torch__.torch.ao.nn.quantized.modules.conv.Conv3d",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}},
       {"__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU3d",
        {"_packed_params",
         "Tensor",
-        "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}}};
+        "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}},
+      // BC Stuff
+      {"__torch__.torch.nn.quantized.modules.linear.LinearPackedParams",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.modules.linear.Linear",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.modules.conv.Conv2d",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.Conv2dPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.modules.conv.Conv3d",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.dynamic.modules.linear.Linear",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.LinearPackedParamsBase"}}};
   // @lint-ignore-every CLANGTIDY facebook-hte-StdRegexIsAwful
   static std::regex mangle_re("\\.___torch_mangle_\\d+");
   auto demangled_classname =
diff --git a/torch/csrc/jit/serialization/pickler.cpp b/torch/csrc/jit/serialization/pickler.cpp
index c0fb6a5fc4325..d1aca68773a10 100644
--- a/torch/csrc/jit/serialization/pickler.cpp
+++ b/torch/csrc/jit/serialization/pickler.cpp
@@ -569,6 +569,14 @@ void Pickler::pushTensorReference(const IValue& ivalue) {
 void Pickler::startTypeTag() {
   pushGlobal("torch.jit._pickle", "restore_type_tag");
 }
+namespace {
+c10::optional<std::string> type_printer(const c10::Type& type) {
+  if (auto dyn = type.castRaw<c10::DynamicType>()) {
+    return dyn->fallback()->annotation_str(type_printer);
+  }
+  return c10::nullopt;
+}
+} // namespace
 
 // See startTypeTag
 void Pickler::endTypeTag(const IValue& ivalue) {
@@ -576,11 +584,10 @@ void Pickler::endTypeTag(const IValue& ivalue) {
 
   // Push the dict type
   TORCH_INTERNAL_ASSERT(ivalue.type());
+
   auto type = ivalue.type();
-  if (auto dyn = type->castRaw<c10::DynamicType>()) {
-    type = dyn->fallback();
-  }
-  pushString(type->annotation_str());
+  auto annot_str = type->annotation_str(type_printer);
+  pushString(annot_str);
 
   // Pop the dict and type into a tuple
   push<PickleOpCode>(PickleOpCode::TUPLE2);
diff --git a/torch/csrc/jit/serialization/python_print.cpp b/torch/csrc/jit/serialization/python_print.cpp
index 074f84714afec..059c8d3a6f682 100644
--- a/torch/csrc/jit/serialization/python_print.cpp
+++ b/torch/csrc/jit/serialization/python_print.cpp
@@ -746,13 +746,12 @@ struct PythonPrintImpl {
   }
 
   void checkVersion(Node* node) {
-#if ENABLE_UPGRADERS
     if (auto schema = node->maybeSchema()) {
       auto schema_name = getFullSchemaName(*schema);
       auto version_entry = get_operator_version_map().find(schema_name);
       if (version_entry != get_operator_version_map().end()) {
         const auto& entry = version_entry->second;
-        // TODO (tugsuu) move this calculation into a seperate step.
+        // TODO (tugsuu) move this calculation into a separate step.
         uint64_t current_version = entry[entry.size() - 1].bumped_at_version;
         uint64_t legacy_version_map_version =
             get_min_version_for_kind(node->kind());
@@ -769,10 +768,6 @@ struct PythonPrintImpl {
         }
       }
     }
-#else
-    min_version_ =
-        std::max(min_version_, get_min_version_for_kind(node->kind()));
-#endif
   }
 
   void printNode(Node* node, bool print_const) {
@@ -1627,11 +1622,7 @@ struct PythonPrintImpl {
   bool enforce_importable_;
 
   // The least version that supports all printed ops
-#if ENABLE_UPGRADERS
   uint64_t min_version_ = caffe2::serialize::kMinSupportedFileFormatVersion;
-#else
-  uint64_t min_version_ = 0;
-#endif
 };
 
 PythonPrint::PythonPrint(
diff --git a/torch/csrc/jit/serialization/unpickler.cpp b/torch/csrc/jit/serialization/unpickler.cpp
index a77deb7add639..83a4fa827b365 100644
--- a/torch/csrc/jit/serialization/unpickler.cpp
+++ b/torch/csrc/jit/serialization/unpickler.cpp
@@ -334,6 +334,7 @@ PickleOpCode Unpickler::readInstruction() {
       stack_.emplace_back(readFloat());
       break;
     case PickleOpCode::TUPLE: {
+      TORCH_CHECK(!marks_.empty(), "Parsing error: marks_ is empty");
       size_t start = marks_.back();
       marks_.pop_back();
       std::vector<IValue> elements;
@@ -390,7 +391,11 @@ PickleOpCode Unpickler::readInstruction() {
           c10::impl::GenericDict(AnyType::get(), AnyType::get()));
       break;
     case PickleOpCode::APPENDS: {
+      TORCH_CHECK(!marks_.empty(), "Parsing error: marks_ is empty");
       size_t start = marks_.back();
+      TORCH_CHECK(
+          start > 0 && start <= stack_.size(),
+          "Parsing error: wrong start index for stack_");
       auto list_ivalue = stack_.at(start - 1);
       readList(list_ivalue);
     } break;
@@ -400,6 +405,7 @@ PickleOpCode Unpickler::readInstruction() {
       stack_.push_back(std::move(list_ivalue));
     } break;
     case PickleOpCode::DICT: {
+      TORCH_CHECK(!marks_.empty(), "Parsing error: marks_ is empty");
       size_t start = marks_.back();
       marks_.pop_back();
       auto dict = c10::impl::GenericDict(AnyType::get(), AnyType::get());
@@ -410,8 +416,12 @@ PickleOpCode Unpickler::readInstruction() {
       stack_.emplace_back(std::move(dict));
     } break;
     case PickleOpCode::SETITEMS: {
+      TORCH_CHECK(!marks_.empty(), "Parsing error: marks_ is empty");
       size_t start = marks_.back();
       marks_.pop_back();
+      TORCH_CHECK(
+          start > 0 && start <= stack_.size(),
+          "Parsing error: wrong start index for stack_");
       auto dict = stack_.at(start - 1).toGenericDict();
       for (size_t i = start; i < stack_.size(); i += 2) {
         dict.insert_or_assign(stack_[i], stack_[i + 1]);
@@ -446,6 +456,9 @@ PickleOpCode Unpickler::readInstruction() {
       size_t idx = stack_.back().toInt();
       stack_.pop_back();
       // stack is: <functor_arg>
+      TORCH_CHECK(
+          idx < globals_.size(),
+          "Parsing error: out of bounds access to globals_");
       globals_.at(idx)();
     } break;
     case PickleOpCode::BINPERSID: {
@@ -918,6 +931,7 @@ std::string Unpickler::readBytes(size_t length) {
 // Pop all the list items off of the stack and append them to the list at
 // the corresponding MARK
 void Unpickler::readList(IValue list_ivalue) {
+  TORCH_CHECK(!marks_.empty(), "Parsing error: marks_ is empty");
   size_t start = marks_.back();
   marks_.pop_back();
   auto num_elements = stack_.size() - start;
diff --git a/torch/csrc/jit/tensorexpr/external_functions.cpp b/torch/csrc/jit/tensorexpr/external_functions.cpp
index 4055e39071efc..3b87c9458a559 100644
--- a/torch/csrc/jit/tensorexpr/external_functions.cpp
+++ b/torch/csrc/jit/tensorexpr/external_functions.cpp
@@ -5,6 +5,7 @@
 #include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
 #include <ATen/core/Tensor.h>
+#include <ATen/native/mkldnn/OpContext.h>
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/quantized/cpu/BinaryOps.h>
 #include <ATen/native/quantized/cpu/QuantUtils.h>
@@ -1411,6 +1412,30 @@ void nnc_aten_triangular_solve(
   }
 }
 
+#if AT_MKLDNN_ENABLED()
+
+void nnc_mkldnn_prepacked_conv_run(
+    int64_t bufs_num,
+    void** buf_data,
+    int64_t* buf_ranks,
+    int64_t* buf_dims,
+    int64_t* buf_strides,
+    int8_t* buf_dtypes,
+    int64_t args_num,
+    int64_t* extra_args) {
+  using namespace at::native::mkldnn;
+
+  auto tensors = constructTensors(
+      bufs_num - 1, buf_data, buf_ranks, buf_dims, buf_strides, buf_dtypes);
+
+  const at::Tensor& x = tensors[1];
+  auto context = reinterpret_cast<ConvOpContext*>(buf_data[2]);
+
+  context->run(x, buf_data[0]);
+}
+
+#endif // AT_MKLDNN_ENABLED()
+
 #ifdef USE_XNNPACK
 
 void nnc_prepacked_linear_clamp_run(
@@ -1589,6 +1614,12 @@ const static RegisterNNCExternalFunction nnc_embedding(
     "nnc_aten_embedding",
     nnc_aten_embedding);
 
+#if AT_MKLDNN_ENABLED()
+const static RegisterNNCExternalFunction reg_nnc_mkldnn_prepacked_conv_run(
+    "nnc_mkldnn_prepacked_conv_run",
+    nnc_mkldnn_prepacked_conv_run);
+#endif // AT_MKLDNN_ENABLED()
+
 #ifdef USE_XNNPACK
 const static RegisterNNCExternalFunction reg_nnc_prepacked_linear_clamp_run(
     "nnc_prepacked_linear_clamp_run",
diff --git a/torch/csrc/jit/tensorexpr/kernel.cpp b/torch/csrc/jit/tensorexpr/kernel.cpp
index 9c22eb2db503b..5c7c783e1b78a 100644
--- a/torch/csrc/jit/tensorexpr/kernel.cpp
+++ b/torch/csrc/jit/tensorexpr/kernel.cpp
@@ -8,6 +8,7 @@
 #include <c10/util/irange.h>
 #include <c10/util/string_utils.h>
 #include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/mkldnn_rewrite.h>
 #include <torch/csrc/jit/passes/symbolic_shape_runtime_fusion.h>
 #include <torch/csrc/jit/tensorexpr/analysis.h>
 #include <torch/csrc/jit/tensorexpr/expr.h>
@@ -208,9 +209,7 @@ std::vector<int64_t> _pair_int(IValue v) {
   }
 }
 
-static bool isContiguous(
-    const torch::jit::Value* v,
-    at::MemoryFormat memory_format = at::MemoryFormat::Contiguous) {
+bool isContiguous(const torch::jit::Value* v, at::MemoryFormat memory_format) {
   auto const& tt = v->type()->cast<TensorType>();
   if (!tt) {
     return false;
@@ -272,6 +271,49 @@ bool conv2dIsSupportedJit(const torch::jit::Node* node) {
       groups->toInt());
 }
 
+bool mkldnnPrepackedConvIsSupportedJit(const torch::jit::Node* node) {
+#if AT_MKLDNN_ENABLED()
+  auto const& input = getTensorInfoJit(node->input(0));
+  auto const& weight = getTensorInfoJit(node->input(1));
+  auto const& stride = toIValue(node->input(3));
+  auto const& pad = toIValue(node->input(4));
+  auto const& dilation = toIValue(node->input(5));
+  auto const& groups = toIValue(node->input(6));
+
+  // Everything should be statically known (bias could be NoneType =
+  // prim::Constant()).
+  if (!input || !weight || !stride || !pad || !dilation || !groups) {
+    GRAPH_DEBUG("some params aren't static");
+    return false;
+  }
+
+  // Weights and bias should be Constant when using mkldnn backend
+  if (node->input(1)->node()->kind() != prim::Constant ||
+      node->input(2)->node()->kind() != prim::Constant) {
+    GRAPH_DEBUG(
+        "mkldnnPrepackedConvIsSupported: weight or bias is not Constant");
+    return false;
+  }
+
+  // Input and weight should be NHWC contiguous.
+  if (!(isContiguous(node->input(0), at::MemoryFormat::ChannelsLast) &&
+        isContiguous(node->input(1), at::MemoryFormat::ChannelsLast))) {
+    GRAPH_DEBUG(
+        "mkldnnPrepackedConvIsSupported: input or weight is not ChannelsLast contiguous");
+    return false;
+  }
+
+  return mkldnnPrepackedConvIsSupported(
+      *input,
+      *weight,
+      _pair_int(*stride),
+      _pair_int(*pad),
+      _pair_int(*dilation),
+      groups->toInt());
+#endif
+  return false;
+}
+
 // The fuser currently only supports matmul of 2D x 2D matrices
 bool matmulIsSupported(const torch::jit::Node* node) {
   auto const& input0 = getTensorInfoJit(node->input(0));
@@ -1563,6 +1605,9 @@ void TensorExprKernel::optimizeOwningGraph() {
   // Determine the propagated memory layout
   deduceMemoryLayoutPolicy();
 
+  // Fuse Conv with Eltwise Op
+  FuseConvWithEltwise(graph_);
+
   // Optimize the concatenation
   OptimizeCat(graph_);
 
diff --git a/torch/csrc/jit/tensorexpr/kernel.h b/torch/csrc/jit/tensorexpr/kernel.h
index c2bbc20b305ea..dc803ced2e29d 100644
--- a/torch/csrc/jit/tensorexpr/kernel.h
+++ b/torch/csrc/jit/tensorexpr/kernel.h
@@ -24,6 +24,8 @@ struct SmallSizeTPairHash {
 
 // Returns true if the TE fuser supports this conv2d.
 bool conv2dIsSupportedJit(const Node* node);
+// Returns true if the TE fuser supports this conv2d with mkldnn prepacked conv.
+bool mkldnnPrepackedConvIsSupportedJit(const Node* node);
 // Returns true if the TE fuser supports this matmul.
 bool matmulIsSupported(const Node* node);
 template <typename T>
@@ -369,6 +371,10 @@ TORCH_API bool& getOptConditionals();
 TORCH_API c10::optional<at::Device> pickDeviceType(
     const at::ArrayRef<torch::jit::Value*>& inputs);
 
+bool isContiguous(
+    const torch::jit::Value* v,
+    at::MemoryFormat memory_format = at::MemoryFormat::Contiguous);
+
 } // namespace tensorexpr
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/tensorexpr/lowerings.cpp b/torch/csrc/jit/tensorexpr/lowerings.cpp
index b4f59c8cfeb67..e704cc689d279 100644
--- a/torch/csrc/jit/tensorexpr/lowerings.cpp
+++ b/torch/csrc/jit/tensorexpr/lowerings.cpp
@@ -4,6 +4,7 @@
 #include <torch/csrc/jit/tensorexpr/operators/operators.h>
 
 #include <ATen/native/Activation.h>
+#include <ATen/native/mkldnn/Common.h>
 
 namespace torch {
 namespace jit {
@@ -44,6 +45,12 @@ int nnc_lowerings_lazy_registration() {
       computePrepackedLinearClampRun);
 #endif
 
+#if AT_MKLDNN_ENABLED()
+  RegisterNNCLoweringsFunction mkldnn_prepacked_conv2d_run(
+      {"mkldnn_prepacked::conv2d_run(Tensor X, __torch__.torch.classes.mkldnn.ConvOpContext W_prepack) -> (Tensor Y)"},
+      computeMkldnnPrepackedConvRun);
+#endif // AT_MKLDNN_ENABLED()
+
   RegisterNNCLoweringsFunction aten_sub(
       {"aten::sub.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> (Tensor)",
        "aten::sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> (Tensor)"},
@@ -594,6 +601,22 @@ int nnc_lowerings_lazy_registration() {
             });
       });
 
+  RegisterNNCLoweringsFunction aten_silu(
+      {"aten::silu(Tensor self) -> (Tensor)"},
+      [](const std::vector<ArgValue>& inputs,
+         const std::vector<ExprHandle>& outputShape,
+         const std::vector<ExprHandle>& outputStrides,
+         const c10::optional<ScalarType>& outputType,
+         at::Device device) {
+        return computeOneOperand(
+            "aten_silu",
+            inputs,
+            outputShape,
+            outputStrides,
+            outputType,
+            [](const ExprHandle& a) { return a * sigmoid(a); });
+      });
+
   RegisterNNCLoweringsFunction aten_reciprocal(
       {"aten::reciprocal(Tensor self) -> (Tensor)"},
       [](const std::vector<ArgValue>& inputs,
diff --git a/torch/csrc/jit/tensorexpr/operators/conv2d.cpp b/torch/csrc/jit/tensorexpr/operators/conv2d.cpp
index f06ab90a80d34..2a781079d9f47 100644
--- a/torch/csrc/jit/tensorexpr/operators/conv2d.cpp
+++ b/torch/csrc/jit/tensorexpr/operators/conv2d.cpp
@@ -1,3 +1,4 @@
+#include <ATen/Config.h>
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/tensorexpr/loopnest.h>
 #include <torch/csrc/jit/tensorexpr/operators/conv2d.h>
@@ -307,6 +308,47 @@ bool conv2dIsSupported(
   return true;
 }
 
+bool mkldnnPrepackedConvIsSupported(
+    const TensorInfo& input,
+    const TensorInfo& weight,
+    const std::vector<int64_t>& stride,
+    const std::vector<int64_t>& pad,
+    const std::vector<int64_t>& dilation,
+    int64_t groups) {
+#if AT_MKLDNN_ENABLED()
+  if (input.dtype != c10::ScalarType::Float ||
+      weight.dtype != c10::ScalarType::Float) {
+    GRAPH_DEBUG("mkldnnPrepackedConvIsSupported: only float32 allowed");
+    return false;
+  }
+  if (input.dims.size() != 4 || weight.dims.size() != 4) {
+    GRAPH_DEBUG("mkldnnPrepackedConvIsSupported: inputs are the wrong size");
+    return false;
+  }
+  if (stride.size() != 2) {
+    GRAPH_DEBUG("mkldnnPrepackedConvIsSupported: unsupported stride");
+    return false;
+  }
+  if (pad.size() != 2) {
+    GRAPH_DEBUG("mkldnnPrepackedConvIsSupported: unsupported pad");
+    return false;
+  }
+  if (dilation.size() != 2) {
+    GRAPH_DEBUG("mkldnnPrepackedConvIsSupported: unsupported dilation");
+    return false;
+  }
+
+  // Do not rewrite for cases where native is faster than mkldnn
+  // Conditions are from: aten/src/ATen/native/Convolution.cpp:use_mkldnn
+  bool use_mkldnn = groups > 1 || (weight.dims[2] > 3 && weight.dims[3] > 3) ||
+      input.dims[0] > 1 ||
+      input.dims[0] * input.dims[1] * input.dims[2] * input.dims[3] > 20480;
+  GRAPH_DEBUG("mkldnnPrepackedConvIsSupported: ", use_mkldnn);
+  return use_mkldnn;
+#endif
+  return false;
+}
+
 Tensor computeConv2d(
     const std::vector<ArgValue>& inputs,
     const std::vector<ExprHandle>& outputShape,
@@ -427,6 +469,26 @@ Tensor computePrepackedLinearClampRun(
   return Tensor(ResultBuf.node(), s);
 }
 
+Tensor computeMkldnnPrepackedConvRun(
+    const std::vector<ArgValue>& inputs,
+    const std::vector<ExprHandle>& outputShape,
+    const std::vector<ExprHandle>& outputStrides,
+    const c10::optional<ScalarType>& outputType,
+    at::Device device) {
+  Dtype dtype = kFloat;
+  if (outputType) {
+    dtype = Dtype(*outputType);
+  }
+
+  BufHandle ResultBuf(
+      "mkldnn_prepacked_conv_run", outputShape, outputStrides, dtype);
+  const BufHandle& inp = c10::get<BufHandle>(inputs[0]);
+  const BufHandle& prepacked = c10::get<BufHandle>(inputs[1]);
+  StmtPtr s = ExternalCall::make(
+      ResultBuf, "nnc_mkldnn_prepacked_conv_run", {inp, prepacked}, {});
+  return Tensor(ResultBuf.node(), s);
+}
+
 } // namespace tensorexpr
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/tensorexpr/operators/conv2d.h b/torch/csrc/jit/tensorexpr/operators/conv2d.h
index 8278d8773bdb8..65902960192ab 100644
--- a/torch/csrc/jit/tensorexpr/operators/conv2d.h
+++ b/torch/csrc/jit/tensorexpr/operators/conv2d.h
@@ -63,6 +63,13 @@ bool conv2dIsSupported(
     const std::vector<int64_t>& pad,
     const std::vector<int64_t>& dilation,
     int64_t groups);
+bool mkldnnPrepackedConvIsSupported(
+    const TensorInfo& input,
+    const TensorInfo& weight,
+    const std::vector<int64_t>& stride,
+    const std::vector<int64_t>& pad,
+    const std::vector<int64_t>& dilation,
+    int64_t groups);
 Tensor computeConv2d(
     const std::vector<ArgValue>& inputs,
     const std::vector<ExprHandle>& outputShape,
@@ -87,7 +94,12 @@ Tensor computePrepackedLinearClampRun(
     const std::vector<ExprHandle>& outputStrides,
     const c10::optional<ScalarType>& outputType,
     at::Device device);
-
+Tensor computeMkldnnPrepackedConvRun(
+    const std::vector<ArgValue>& inputs,
+    const std::vector<ExprHandle>& outputShape,
+    const std::vector<ExprHandle>& outputStrides,
+    const c10::optional<ScalarType>& outputType,
+    at::Device device);
 } // namespace tensorexpr
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/lazy/core/dynamic_ir.h b/torch/csrc/lazy/core/dynamic_ir.h
index 15d5aa4a8662c..8af7f4fae44ec 100644
--- a/torch/csrc/lazy/core/dynamic_ir.h
+++ b/torch/csrc/lazy/core/dynamic_ir.h
@@ -43,9 +43,12 @@ namespace lazy {
 
 class TORCH_API DimensionNode {
  public:
-  virtual bool isDynamic() const {
+  virtual bool isSymbolic() const {
     return false;
   };
+  virtual int64_t getDynamicValue() const {
+    TORCH_CHECK(false, "NYI");
+  };
   virtual int64_t getStaticValue() const {
     TORCH_CHECK(false, "NYI");
   };
diff --git a/torch/csrc/lazy/core/lazy_graph_executor.cpp b/torch/csrc/lazy/core/lazy_graph_executor.cpp
index 9f83d7730a9cf..eff49c7d4fe67 100644
--- a/torch/csrc/lazy/core/lazy_graph_executor.cpp
+++ b/torch/csrc/lazy/core/lazy_graph_executor.cpp
@@ -530,13 +530,13 @@ Value LazyGraphExecutor::GetDeviceDataIrValue(
 }
 
 Value LazyGraphExecutor::GetIrValueForScalarFromCodegen(
-    const at::Scalar& value) {
+    const at::Scalar& value,
+    const BackendDevice& device) {
   if (IsSpecialScalar(value)) {
     return MakeScalar(value, value.type());
   }
-  auto cpu_device = getBackend()->GetBackendDevice(c10::Device(c10::kCPU, 0));
   BackendDataPtr data =
-      getBackend()->MakeComputationDataFromScalar(value, cpu_device);
+      getBackend()->MakeComputationDataFromScalar(value, device);
   data->SetInfo(
       std::make_shared<DeviceDataInfo>(/*tensor_id=*/-1, /*read_only=*/true));
   return MakeDeviceData(std::move(data));
@@ -549,8 +549,7 @@ Value LazyGraphExecutor::GetIrValueForScalar(
   if (IsSpecialScalar(value)) {
     return MakeScalar(value, type);
   }
-  return GetDeviceDataIrValue(
-      value, type, getBackend()->GetBackendDevice(c10::Device(c10::kCPU, 0)));
+  return GetDeviceDataIrValue(value, type, device);
 }
 
 Value LazyGraphExecutor::GetIrValueForScalar(
diff --git a/torch/csrc/lazy/core/lazy_graph_executor.h b/torch/csrc/lazy/core/lazy_graph_executor.h
index 76b7d694bfabd..8116ad23ff068 100644
--- a/torch/csrc/lazy/core/lazy_graph_executor.h
+++ b/torch/csrc/lazy/core/lazy_graph_executor.h
@@ -102,7 +102,9 @@ class TORCH_API LazyGraphExecutor {
   // use it in other cases where `GetIrValueForXXXScalar` is used, as well
   // In order to do that, we need to untangle the cases where we don't need
   // `expand` and where we don't expect a scalar tensor
-  Value GetIrValueForScalarFromCodegen(const at::Scalar& value);
+  Value GetIrValueForScalarFromCodegen(
+      const at::Scalar& value,
+      const BackendDevice& device);
   Value GetIrValueForExpandedScalar(
       const at::Scalar& value,
       const Shape& shape,
diff --git a/torch/csrc/lazy/core/shape_inference.cpp b/torch/csrc/lazy/core/shape_inference.cpp
index 64bef53c59ef9..114c6a8e16b13 100644
--- a/torch/csrc/lazy/core/shape_inference.cpp
+++ b/torch/csrc/lazy/core/shape_inference.cpp
@@ -392,7 +392,7 @@ std::vector<Shape> compute_shape_std(const at::Tensor& self, bool unbiased) {
 }
 std::vector<Shape> compute_shape_std(
     const at::Tensor& self,
-    at::IntArrayRef dim,
+    at::OptionalIntArrayRef dim,
     bool unbiased,
     bool keepdim) {
   return compute_shape_std(self, dim, c10::nullopt, keepdim);
@@ -520,6 +520,12 @@ std::vector<Shape> compute_shape_cat(at::TensorList tensors, int64_t dim) {
   return {Shape(tensors[0].scalar_type(), out_shape)};
 }
 
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_cholesky(
+    const at::Tensor& self,
+    bool upper) {
+  return {Shape(self.scalar_type(), self.sizes().vec())};
+}
+
 std::vector<torch::lazy::Shape> compute_shape_native_batch_norm(
     const at::Tensor& input,
     const c10::optional<at::Tensor>& weight,
@@ -732,6 +738,12 @@ std::vector<Shape> compute_shape_zero(const at::Tensor& self) {
   return {Shape(self.scalar_type(), self.sizes().vec())};
 }
 
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_take(
+    const at::Tensor& self,
+    const at::Tensor& index) {
+  return {Shape(self.scalar_type(), index.sizes().vec())};
+}
+
 std::vector<Shape> compute_shape_trace(const at::Tensor& self) {
   return {Shape(self.scalar_type(), {})};
 }
diff --git a/torch/csrc/lazy/core/shape_inference.h b/torch/csrc/lazy/core/shape_inference.h
index 77a26ec07fc0b..53eec1dd72742 100644
--- a/torch/csrc/lazy/core/shape_inference.h
+++ b/torch/csrc/lazy/core/shape_inference.h
@@ -29,6 +29,7 @@ TORCH_API std::vector<torch::lazy::Shape> compute_shape_bernoulli(const at::Tens
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_binary_cross_entropy(const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_binary_cross_entropy_backward(const at::Tensor & grad_output, const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_cat(at::TensorList tensors, int64_t dim);
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_cholesky(const at::Tensor & self, bool upper);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_clamp_min(const at::Tensor & self, const at::Scalar & min);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_clone(const at::Tensor & self, c10::optional<at::MemoryFormat> memory_format);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_constant_pad_nd(const at::Tensor & self, at::IntArrayRef pad, const at::Scalar & value);
@@ -79,10 +80,11 @@ TORCH_API std::vector<torch::lazy::Shape> compute_shape_smooth_l1_loss_backward(
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_sort(const at::Tensor & self, int64_t dim, bool descending);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_stack(at::TensorList tensors, int64_t dim);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_std(const at::Tensor & self, bool unbiased);
-TORCH_API std::vector<torch::lazy::Shape> compute_shape_std(const at::Tensor & self, at::IntArrayRef dim, bool unbiased, bool keepdim);
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_std(const at::Tensor & self, at::OptionalIntArrayRef dim, bool unbiased, bool keepdim);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_std(const at::Tensor & self, at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_sum(const at::Tensor & self, c10::optional<at::ScalarType> dtype);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape__to_copy(const at::Tensor & self, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory, bool non_blocking, c10::optional<at::MemoryFormat> memory_format);
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_take(const at::Tensor & self, const at::Tensor & index);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_trace(const at::Tensor & self);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_zero(const at::Tensor & self);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_narrow_copy_symint(const at::Tensor & self, int64_t dim, int64_t start, c10::SymInt length);
diff --git a/torch/csrc/lazy/core/tensor_impl.cpp b/torch/csrc/lazy/core/tensor_impl.cpp
index 67cd53c442a63..72d24f7de53ad 100644
--- a/torch/csrc/lazy/core/tensor_impl.cpp
+++ b/torch/csrc/lazy/core/tensor_impl.cpp
@@ -89,15 +89,6 @@ LTCTensorImpl::LTCTensorImpl(LazyTensor&& tensor)
   // according to https://github.com/pytorch/xla/pull/2682.
   is_non_overlapping_and_dense_ = false;
   set_sizes_strides_policy(SizesStridesPolicy::CustomSizes);
-
-  auto rank = tensor_->shape().Get().sizes().size();
-  sym_sizes_.reserve(rank);
-  for (auto i : c10::irange(rank)) {
-    auto dim_node = getBackend()->GetIrBuilder()->MakeSizeNode(
-        this->tensor_->GetIrValue(), i);
-    auto sn = c10::make_intrusive<torch::lazy::SymIntNodeImpl>(dim_node);
-    sym_sizes_.push_back(sn->toSymInt());
-  }
 }
 
 void LTCTensorImpl::set_tensor(const LazyTensorPtr& lazy_tensor) {
@@ -142,14 +133,31 @@ void LTCTensorImpl::shallow_copy_from(
   generation_ = 0;
 }
 
+c10::SymIntArrayRef LTCTensorImpl::sym_strides_custom() const {
+  return sym_strides_default();
+}
+
+void LTCTensorImpl::setup_sym_sizes() const {
+  auto rank = tensor_->shape().Get().sizes().size();
+  std::vector<c10::SymInt> sym_sizes;
+  sym_sizes.reserve(rank);
+  for (auto i : c10::irange(rank)) {
+    auto dim_node = getBackend()->GetIrBuilder()->MakeSizeNode(
+        this->tensor_->GetIrValue(), i);
+    auto sn = c10::make_intrusive<torch::lazy::SymIntNodeImpl>(dim_node);
+    sym_sizes.push_back(sn->toSymInt());
+  }
+
+  sym_sizes_ = sym_sizes;
+}
+
 c10::SymIntArrayRef LTCTensorImpl::sym_sizes_custom() const {
   if (FLAGS_ltc_enable_symbolic_shapes) {
-    return c10::SymIntArrayRef(sym_sizes_.data(), sym_sizes_.size());
+    setup_sym_sizes();
+    return c10::SymIntArrayRef(sym_sizes_->data(), sym_sizes_->size());
   }
 
-  // return upper bound
-  const_cast<LTCTensorImpl*>(this)->setup_size_properties();
-  return TensorImpl::sym_sizes_default();
+  return c10::SymIntArrayRef::fromIntArrayRef(sizes_custom());
 }
 
 c10::SymIntArrayRef LTCTensorImpl::sym_sizes() const {
diff --git a/torch/csrc/lazy/core/tensor_impl.h b/torch/csrc/lazy/core/tensor_impl.h
index 36e3a09b59aa2..1240665bf6012 100644
--- a/torch/csrc/lazy/core/tensor_impl.h
+++ b/torch/csrc/lazy/core/tensor_impl.h
@@ -45,6 +45,7 @@ class TORCH_API LTCTensorImpl final : public c10::TensorImpl {
 
   virtual c10::SymIntArrayRef sym_sizes_custom() const override;
   virtual c10::SymIntArrayRef sym_sizes() const override;
+  virtual c10::SymIntArrayRef sym_strides_custom() const override;
 
 #ifndef C10_DISABLE_TENSORIMPL_EXTENSIBILITY
   const at::Storage& storage() const override {
@@ -57,9 +58,10 @@ class TORCH_API LTCTensorImpl final : public c10::TensorImpl {
 
  private:
   void setup_size_properties();
+  void setup_sym_sizes() const;
 
   LazyTensorPtr tensor_;
-  std::vector<c10::SymInt> sym_sizes_;
+  mutable c10::optional<std::vector<c10::SymInt>> sym_sizes_;
   size_t generation_{0};
 };
 
diff --git a/torch/csrc/lazy/python/python_util.cpp b/torch/csrc/lazy/python/python_util.cpp
index 00f93e4115ed4..703d43ca65059 100644
--- a/torch/csrc/lazy/python/python_util.cpp
+++ b/torch/csrc/lazy/python/python_util.cpp
@@ -33,7 +33,9 @@ std::vector<SourceLocation> GetPythonFrames() {
   if (Py_IsInitialized()) {
     pybind11::gil_scoped_acquire gil;
     PyFrameObject* frame = PyEval_GetFrame();
-    Py_INCREF(frame);
+    if (frame != nullptr) {
+      Py_INCREF(frame);
+    }
     while (frame != nullptr) {
       SourceLocation loc;
       auto code = THPCodeObjectPtr(PyFrame_GetCode(frame));
diff --git a/torch/csrc/lazy/ts_backend/dynamic_ir.cpp b/torch/csrc/lazy/ts_backend/dynamic_ir.cpp
index 1180885ab57f7..2bb67af47fc7f 100644
--- a/torch/csrc/lazy/ts_backend/dynamic_ir.cpp
+++ b/torch/csrc/lazy/ts_backend/dynamic_ir.cpp
@@ -34,7 +34,7 @@ SizeNode::SizeNode(Value input, size_t dim)
 int64_t SizeNode::getStaticValue() const {
   return dynamic_cast<const TsNode*>(operand(0).node)->shape(0).size(dim_);
 }
-bool SizeNode::isDynamic() const {
+bool SizeNode::isSymbolic() const {
   auto symbolic_vec =
       dynamic_cast<const TsNode*>(operand(0).node)->shape(0).is_symbolic();
   if (!symbolic_vec.has_value()) {
@@ -59,8 +59,8 @@ int64_t SizeAdd::getStaticValue() const {
       DimCast(operand(1))->getStaticValue();
 }
 
-bool SizeAdd::isDynamic() const {
-  return DimCast(operand(0))->isDynamic() || DimCast(operand(1))->isDynamic();
+bool SizeAdd::isSymbolic() const {
+  return DimCast(operand(0))->isSymbolic() || DimCast(operand(1))->isSymbolic();
 }
 
 std::string SizeAdd::ToString() const {
@@ -79,8 +79,8 @@ int64_t SizeMul::getStaticValue() const {
       DimCast(operand(1))->getStaticValue();
 }
 
-bool SizeMul::isDynamic() const {
-  return DimCast(operand(0))->isDynamic() || DimCast(operand(1))->isDynamic();
+bool SizeMul::isSymbolic() const {
+  return DimCast(operand(0))->isSymbolic() || DimCast(operand(1))->isSymbolic();
 }
 
 std::string SizeMul::ToString() const {
@@ -102,8 +102,8 @@ int64_t SizeDiv::getStaticValue() const {
       DimCast(operand(1))->getStaticValue();
 }
 
-bool SizeDiv::isDynamic() const {
-  return DimCast(operand(0))->isDynamic() || DimCast(operand(1))->isDynamic();
+bool SizeDiv::isSymbolic() const {
+  return DimCast(operand(0))->isSymbolic() || DimCast(operand(1))->isSymbolic();
 }
 
 std::string SizeDiv::ToString() const {
diff --git a/torch/csrc/lazy/ts_backend/dynamic_ir.h b/torch/csrc/lazy/ts_backend/dynamic_ir.h
index 042e48f136784..40132aa574040 100644
--- a/torch/csrc/lazy/ts_backend/dynamic_ir.h
+++ b/torch/csrc/lazy/ts_backend/dynamic_ir.h
@@ -49,7 +49,7 @@ class TORCH_API SizeNode : public TsNode, public DimensionNode {
  public:
   SizeNode(Value input, size_t dim);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
   size_t dim_ = 0;
   virtual torch::lazy::TSOpVector Lower(
@@ -61,7 +61,7 @@ class TORCH_API SizeAdd : public TsNode, public DimensionNode {
  public:
   SizeAdd(Value a, Value b);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
 };
 
@@ -69,7 +69,7 @@ class TORCH_API SizeMul : public TsNode, public DimensionNode {
  public:
   SizeMul(Value a, Value b);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
 };
 
@@ -77,7 +77,7 @@ class TORCH_API SizeDiv : public TsNode, public DimensionNode {
  public:
   SizeDiv(Value a, Value b);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
 };
 
diff --git a/torch/csrc/lazy/ts_backend/ts_native_functions.cpp b/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
index 9e837249f52df..ec32a45b50dec 100644
--- a/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
+++ b/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
@@ -497,12 +497,6 @@ at::Tensor LazyNativeFunctions::_trilinear(
   return at::functionalization::functionalize_aten_op<ATEN_OP(_trilinear)>::
       call(i1, i2, i3, expand1, expand2, expand3, sumdim, unroll_dim);
 }
-::std::tuple<at::Tensor, at::Tensor> LazyNativeFunctions::linalg_inv_ex(
-    const at::Tensor& self,
-    bool check_errors) {
-  return at::functionalization::functionalize_aten_op<ATEN_OP(
-      linalg_inv_ex)>::call(self, check_errors);
-}
 at::Tensor LazyNativeFunctions::linalg_pinv(
     const at::Tensor& self,
     const c10::optional<at::Tensor>& atol,
diff --git a/torch/csrc/onnx/init.cpp b/torch/csrc/onnx/init.cpp
index 62597e9a3d46e..9222273d45e28 100644
--- a/torch/csrc/onnx/init.cpp
+++ b/torch/csrc/onnx/init.cpp
@@ -14,7 +14,9 @@
 #include <torch/csrc/jit/passes/onnx/function_extraction.h>
 #include <torch/csrc/jit/passes/onnx/function_substitution.h>
 #include <torch/csrc/jit/passes/onnx/list_model_parameters.h>
+#include <torch/csrc/jit/passes/onnx/naming.h>
 #include <torch/csrc/jit/passes/onnx/onnx_log.h>
+#include <torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h>
 #include <torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.h>
 #include <torch/csrc/jit/passes/onnx/pattern_conversion/pattern_encapsulation.h>
 #include <torch/csrc/jit/passes/onnx/peephole.h>
@@ -52,6 +54,9 @@ void initONNXBindings(PyObject* module) {
       .def(
           "_jit_pass_onnx_function_substitution",
           wrap_pybind_function(ONNXFunctionCallSubstitution))
+      .def(
+          "_jit_pass_onnx_autograd_function_process",
+          wrap_pybind_function(ONNXAutogradFunctionProcess))
       .def(
           "_jit_pass_onnx_peephole",
           ::torch::wrap_pybind_function([](std::shared_ptr<Graph>& graph,
@@ -221,7 +226,17 @@ void initONNXBindings(PyObject* module) {
               out << std::endl;
             }
           },
-          "Write `args` to the previously specified ONNX log stream.");
+          "Write `args` to the previously specified ONNX log stream.")
+      .def(
+          "_jit_pass_onnx_assign_scoped_names_for_node_and_value",
+          ::torch::wrap_pybind_function(
+              ::torch::jit::onnx::AssignScopedNamesForNodeAndValue),
+          "Assign informative scoped names for nodes and values.")
+      .def(
+          "_jit_onnx_create_full_scope_name",
+          ::torch::wrap_pybind_function(
+              ::torch::jit::onnx::ONNXScopeName::createFullScopeName),
+          "Create a full scope name from class name and variable name.");
 
   m.def(
       "_check_onnx_proto",
diff --git a/torch/csrc/onnx/onnx.h b/torch/csrc/onnx/onnx.h
index 163bcdaf70075..b5ac8d7bd14ec 100644
--- a/torch/csrc/onnx/onnx.h
+++ b/torch/csrc/onnx/onnx.h
@@ -16,5 +16,7 @@ enum class TrainingMode {
   TRAINING, // Training mode
 };
 
+const std::string kOnnxNodeNameAttribute = "onnx_name";
+
 } // namespace onnx
 } // namespace torch
diff --git a/torch/csrc/profiler/api.cpp b/torch/csrc/profiler/api.cpp
index e1e45bbfd6d49..64787f5f77a5f 100644
--- a/torch/csrc/profiler/api.cpp
+++ b/torch/csrc/profiler/api.cpp
@@ -4,6 +4,24 @@ namespace torch {
 namespace profiler {
 namespace impl {
 
+ExperimentalConfig::ExperimentalConfig(
+    std::vector<std::string> profiler_metrics,
+    bool profiler_measure_per_kernel)
+    : profiler_metrics{profiler_metrics},
+      profiler_measure_per_kernel{profiler_measure_per_kernel} {}
+
+/*explicit*/ ExperimentalConfig::operator bool() const {
+  return !profiler_metrics.empty();
+}
+
+bool ProfilerConfig::disabled() const {
+  return state == torch::profiler::impl::ProfilerState::Disabled;
+}
+
+bool ProfilerConfig::global() const {
+  return state == torch::profiler::impl::ProfilerState::KINETO_ONDEMAND;
+}
+
 namespace {
 enum ProfilerIValueIdx {
   STATE = 0,
@@ -42,9 +60,7 @@ ProfilerConfig ProfilerConfig::fromIValue(
 
 bool profilerEnabled() {
   auto state_ptr = ProfilerThreadLocalStateBase::getTLS();
-  return state_ptr &&
-      state_ptr->config().state !=
-      torch::profiler::impl::ProfilerState::Disabled;
+  return state_ptr && !state_ptr->config().disabled();
 }
 
 TORCH_API ActiveProfilerType profilerType() {
diff --git a/torch/csrc/profiler/api.h b/torch/csrc/profiler/api.h
index b75875d2f4ea8..f0703905a8c96 100644
--- a/torch/csrc/profiler/api.h
+++ b/torch/csrc/profiler/api.h
@@ -39,18 +39,14 @@ enum class C10_API_ENUM ActiveProfilerType {
 };
 
 struct TORCH_API ExperimentalConfig {
-  explicit ExperimentalConfig(
+  ExperimentalConfig(
       std::vector<std::string> profiler_metrics = {},
-      bool profiler_measure_per_kernel = false)
-      : profiler_metrics(std::move(profiler_metrics)),
-        profiler_measure_per_kernel(profiler_measure_per_kernel) {}
+      bool profiler_measure_per_kernel = false);
   ~ExperimentalConfig() = default;
-  std::vector<std::string> profiler_metrics;
-  bool profiler_measure_per_kernel = false;
+  explicit operator bool() const;
 
-  bool hasOptions() const {
-    return profiler_metrics.size() > 0;
-  }
+  std::vector<std::string> profiler_metrics;
+  bool profiler_measure_per_kernel;
 };
 
 struct TORCH_API ProfilerConfig {
@@ -70,6 +66,10 @@ struct TORCH_API ProfilerConfig {
         with_flops(with_flops),
         with_modules(with_modules) {}
   ~ProfilerConfig() = default;
+
+  bool disabled() const;
+  bool global() const;
+
   ProfilerState state;
   ExperimentalConfig experimental_config;
   bool report_input_shapes;
diff --git a/torch/csrc/profiler/collection.cpp b/torch/csrc/profiler/collection.cpp
index 5f546943c001d..be8b2135cae46 100644
--- a/torch/csrc/profiler/collection.cpp
+++ b/torch/csrc/profiler/collection.cpp
@@ -29,12 +29,24 @@ namespace impl {
 using trace_ptr_t =
     std::unique_ptr<torch::profiler::impl::kineto::ActivityTraceWrapper>;
 
+// ============================================================================
+// == PyTorch Ops =============================================================
+// ============================================================================
+
+// ----------------------------
+// |  Input / Output encoder  |
+// ----------------------------
 void InputOutputEncoder::push(c10::ArrayRef<const c10::IValue> values) {
   for (const auto& value : values) {
     if (value.isTensor()) {
       push(value.toTensor());
     } else if (value.isScalar()) {
       tags_.emplace_back(Tag::Scalar);
+      // Scalars are small enough that they are stored in ivalues without an
+      // extra memory alloc
+      // TODO: further optimize this by maybe giving Profiler access to the
+      // guts of IValue.
+      ivalues_.emplace_back(value);
     } else if (value.isTensorList()) {
       tags_.emplace_back(Tag::TensorListBegin);
       // TODO: Skip TensorList for now.
@@ -47,10 +59,11 @@ void InputOutputEncoder::push(c10::ArrayRef<const c10::IValue> values) {
 }
 
 void InputOutputEncoder::push(const at::Tensor& t) {
-  if (t.defined()) {
+  if (t.defined() && !t.is_nested()) { // TODO fix nested sizes
     tags_.emplace_back(Tag::Tensor);
     const auto& sizes = t.sizes();
     const auto dim = sizes.size();
+    const auto layout = t.layout();
     TORCH_CHECK(
         dim <= std::numeric_limits<uint32_t>::max(),
         "Cannot profile Tensors of size > uint32 max. Got dim: ",
@@ -58,12 +71,16 @@ void InputOutputEncoder::push(const at::Tensor& t) {
 
     tensor_metadata_.emplace_back(
         /*ptr_=*/(void*)t.unsafeGetTensorImpl(),
-        /*dtype_=*/t.scalar_type(),
+        /*device_type_*/ t.device().type(),
+        /*device_index_*/ t.device().index(),
+        /*dtype=*/t.scalar_type(),
         /*dim_=*/(uint32_t)dim,
-        /*layout_=*/t.layout());
+        /*layout_=*/layout);
 
-    for (const auto i : sizes) {
-      tensor_sizes_.emplace_back(i);
+    tensor_sizes_strides_.copy(sizes);
+    if (layout == at::kStrided) {
+      // Only Strided layout tensors have strides
+      tensor_sizes_strides_.copy(t.strides());
     }
   } else {
     tags_.emplace_back(Tag::UndefinedTensor);
@@ -75,19 +92,28 @@ auto InputOutputEncoder::getNextShapesAndDtypes() {
   return [this,
           tag_it = tags_.begin(),
           tensor_metadata_it = tensor_metadata_.begin(),
-          tensor_size_it = tensor_sizes_.begin()]() mutable {
+          tensor_size_strides_it = tensor_sizes_strides_.begin(),
+          ivals_it = ivalues_.begin()]() mutable {
     struct Inputs out;
     bool terminate = false;
     while (!terminate && tag_it != tags_.end()) {
       out.shapes_.emplace_back();
+      out.strides_.emplace_back();
       switch (*tag_it) {
         case Tag::Tensor: {
           const auto& md = *tensor_metadata_it++;
           for (const auto _ : c10::irange(md.dim_)) {
             (void)_; // Suppress unused variable warning
-            out.shapes_.back().push_back(*tensor_size_it++);
+            out.shapes_.back().push_back(*tensor_size_strides_it++);
+          }
+          if (md.layout_ == at::kStrided) {
+            for (const auto _ : c10::irange(md.dim_)) {
+              (void)_; // Suppress unused variable warning
+              out.strides_.back().push_back(*tensor_size_strides_it++);
+            }
           }
           out.tensor_metadata_.emplace_back(md);
+          out.ivalues_.emplace_back();
           out.dtypes_.emplace_back(scalarTypeToTypeMeta(md.dtype_).name());
         } break;
 
@@ -96,23 +122,27 @@ auto InputOutputEncoder::getNextShapesAndDtypes() {
             // TODO: Skip TensorLists for now.
           }
           out.dtypes_.emplace_back("TensorList");
+          out.ivalues_.emplace_back();
           out.tensor_metadata_.emplace_back();
           break;
 
         case Tag::Scalar:
           out.dtypes_.emplace_back("Scalar");
+          out.ivalues_.emplace_back(*ivals_it++);
           out.tensor_metadata_.emplace_back();
           break;
 
         case Tag::UndefinedTensor:
         case Tag::Other:
           out.dtypes_.emplace_back();
+          out.ivalues_.emplace_back();
           out.tensor_metadata_.emplace_back();
           break;
 
         case Tag::TERMINATOR:
           // This marks the end of this op.
           out.shapes_.pop_back();
+          out.strides_.pop_back();
           terminate = true;
           break;
 
@@ -128,7 +158,171 @@ auto InputOutputEncoder::getNextShapesAndDtypes() {
 void InputOutputEncoder::clear() {
   tags_.clear();
   tensor_metadata_.clear();
-  tensor_sizes_.clear();
+  tensor_sizes_strides_.clear();
+  ivalues_.clear();
+}
+
+// ---------------------------------------------------
+// |  Correlation ID tracking (OpList & EventBlock)  |
+// ---------------------------------------------------
+template <typename T, size_t ChunkSize>
+ThreadLocalSubqueue::TorchOpStorage::EventBlock<T, ChunkSize>::EventBlock() {
+  static std::atomic<uint64_t> counter_{0};
+  id_start_ = 1 + ChunkSize * counter_++;
+}
+
+template <class... Args>
+std::pair<KinetoObserverContext::Event*, uint64_t> ThreadLocalSubqueue::
+    TorchOpStorage::OpList::emplace_back(Args&&... args) {
+  maybe_grow();
+  *next_ = {std::forward<Args>(args)...};
+  auto corr_id = buffer_last_->correlation_id(next_);
+  return {next_++, corr_id};
+}
+
+uint64_t ThreadLocalSubqueue::TorchOpStorage::OpList::correlationID(
+    const OpList::Iterator& e) {
+  return e.address().first->correlation_id(&*e);
+}
+
+template <typename T, size_t ChunkSize>
+uint64_t ThreadLocalSubqueue::TorchOpStorage::EventBlock<T, ChunkSize>::
+    correlation_id(const T* ptr) const {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+      ptr >= this->data() && ptr < this->data() + ChunkSize);
+  return id_start_ + (ptr - this->data());
+}
+
+// ---------------------------------
+// |  Collection (Observer logic)  |
+// ---------------------------------
+std::unique_ptr<KinetoObserverContext> ThreadLocalSubqueue::begin_op(
+    const at::RecordFunction& fn) {
+  KinetoObserverContext::Event* event;
+  uint64_t corr_id;
+  std::tie(event, corr_id) = torch_ops_.op_events_.emplace_back(
+      fn.seqNr(),
+      fn.forwardThreadId(),
+      fn.scope(),
+      fn.isAsync(),
+      fn.debugHandle(),
+      fn.name());
+  if (config_.report_input_shapes) {
+    torch_ops_.inputs_outputs_.push(fn.inputs());
+  }
+  if (fn.scope() == at::RecordScope::USER_SCOPE) {
+    torch::profiler::impl::kineto::pushUserCorrelationId(corr_id);
+  } else {
+    torch::profiler::impl::kineto::pushCorrelationId(corr_id);
+  }
+
+#if !defined BUILD_LITE_INTERPRETER && !defined C10_MOBILE
+  // backward nodes source range corresponds to the forward node
+  // TODO: consider using C++ stack trace
+  if (config_.with_stack && fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
+    auto cs = torch::profiler::impl::prepareCallstack(jit::currentCallstack());
+    torch_ops_.jit_stack_.emplace_back(callstackStr(cs));
+  }
+  if (config_.with_modules &&
+      fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
+    torch_ops_.jit_modules_.emplace_back(jit::currentModuleHierarchy());
+  }
+#endif
+  if (config_.with_flops) {
+    torch_ops_.extra_args_.emplace_back(
+        torch::profiler::impl::saveExtraArgs(fn));
+  }
+
+  auto out = std::make_unique<KinetoObserverContext>(event);
+
+  if (config_.state == ProfilerState::KINETO_GPU_FALLBACK) {
+    try {
+      out->fallback_ = torch_ops_.gpu_fallback_.emplace_back();
+      torch::profiler::impl::cudaStubs()->record(
+          nullptr, &out->fallback_->cuda_event_start_, nullptr);
+    } catch (const std::exception& e) {
+      LOG(WARNING) << "Failed to record CUDA event. " << e.what();
+    }
+  }
+
+  event->start_time_ = torch::profiler::impl::getApproximateTime();
+  event->allow_tf32_cublas_ = at::globalContext().allowTF32CuBLAS();
+  return out;
+}
+
+// ---------------
+// |  Collation  |
+// ---------------
+namespace {
+template <typename T>
+struct StealOrDefault {
+  StealOrDefault(T& container)
+      : container_{container}, it_{container.begin()} {}
+
+  ~StealOrDefault() {
+    container_.get().clear();
+  }
+
+  typename T::Iterator::value_type operator()() {
+    if (it_.exhausted()) {
+      return typename T::Iterator::value_type();
+    } else {
+      auto result = std::move(*it_);
+      ++it_;
+      return result;
+    }
+  }
+
+  std::reference_wrapper<T> container_;
+  typename T::Iterator it_;
+};
+} // namespace
+
+void ThreadLocalSubqueue::TorchOpStorage::materialize(
+    std::vector<std::shared_ptr<Result>>& out,
+    const std::function<time_t(approx_time_t)> time_converter,
+    const uint64_t tid,
+    const kineto::DeviceAndResource& kineto_info) {
+  // Plumb Autograd info to the top level annotation.
+  auto it = op_events_.begin();
+  for (C10_UNUSED const auto _ :
+       c10::irange(static_cast<int64_t>(op_events_.size()) - 1)) {
+    auto& first = it->basic_fields_;
+    auto& second = (++it)->basic_fields_;
+    if (first.scope_ == at::RecordScope::FUNCTION &&
+        second.scope_ == at::RecordScope::BACKWARD_FUNCTION &&
+        first.name_.rfind("autograd::engine::evaluate_function: ", 0) == 0) {
+      first.sequence_number_ = second.sequence_number_;
+      first.forward_tid_ = second.forward_tid_;
+    }
+  }
+
+  auto input_getter = inputs_outputs_.getNextShapesAndDtypes();
+
+  // TODO: CTAD will take care of template args when we move to C++17
+  auto jit_stack = StealOrDefault<decltype(jit_stack_)>(jit_stack_);
+  auto jit_module = StealOrDefault<decltype(jit_modules_)>(jit_modules_);
+  auto extra_args = StealOrDefault<decltype(extra_args_)>(extra_args_);
+  auto gpu_fallback = StealOrDefault<decltype(gpu_fallback_)>(gpu_fallback_);
+
+  for (auto event = op_events_.begin(); event != op_events_.end(); ++event) {
+    ExtraFields<EventType::TorchOp> e{
+        std::move(event->basic_fields_),
+        ThreadLocalSubqueue::TorchOpStorage::OpList::correlationID(event),
+        time_converter(event->end_time_),
+        input_getter(),
+        jit_stack(),
+        jit_module(),
+        extra_args(),
+        gpu_fallback(),
+        event->allow_tf32_cublas_};
+
+    out.emplace_back(Result::create(
+        time_converter(event->start_time_), tid, kineto_info, std::move(e)));
+  }
+
+  op_events_.clear();
+  inputs_outputs_.clear();
 }
 
 namespace {
@@ -163,7 +357,8 @@ struct NoOpPythonTracer : public PythonTracerBase {
   void clear() override {}
   std::vector<std::shared_ptr<Result>> getEvents(
       std::function<time_t(approx_time_t)>,
-      std::vector<CompressedEvent>&) override {
+      std::vector<CompressedEvent>&,
+      time_t) override {
     return {};
   }
   ~NoOpPythonTracer() = default;
@@ -259,13 +454,21 @@ uint64_t Result::correlationID() const {
 }
 
 int64_t Result::endTimeNS() const {
-  return visit(c10::overloaded(
+  auto end_time_ns = visit(c10::overloaded(
       ATTRIBUTE(TorchOp, torchOpEndNS(e, finished_, parent_)),
       ATTRIBUTE(Backend, e.end_time_us_ * 1000),
       ATTRIBUTE(Allocation, start_time_ns_),
       ATTRIBUTE(OutOfMemory, start_time_ns_),
       ATTRIBUTE(Kineto, start_time_ns_ + e.duration_us_ * 1000),
       [&](const auto& e) -> int64_t { return e.end_time_ns_; }));
+
+  // In rare cases we're willing to tolerate ops which are missing an end time
+  // so long as they can borrow their parent's end time. A consequence of this,
+  // however, is that `endTimeNS` may not make sense until tree construction is
+  // complete.
+  auto end_time_is_valid =
+      !finished_ || SOFT_ASSERT(end_time_ns >= start_time_ns_, name());
+  return end_time_is_valid ? end_time_ns : start_time_ns_;
 }
 
 uint64_t Result::endTID() const {
@@ -284,23 +487,6 @@ c10::DeviceType Result::deviceType() const {
 }
 #undef ATTRIBUTE
 
-template <typename T, size_t ChunkSize>
-ThreadLocalSubqueue::EventBlock<T, ChunkSize>::EventBlock() {
-  static std::atomic<uint64_t> counter_{0};
-  id_start_ = 1 + ChunkSize * counter_++;
-}
-template <class... Args>
-std::pair<KinetoObserverContext::Event*, uint64_t> ThreadLocalSubqueue::OpList::
-    emplace_back(Args&&... args) {
-  maybe_grow();
-  *next_ = {std::forward<Args>(args)...};
-  auto corr_id = buffer_last_->correlation_id(next_);
-  return {next_++, corr_id};
-}
-uint64_t ThreadLocalSubqueue::OpList::correlationID(const OpList::Iterator& e) {
-  return e.address().first->correlation_id(&*e);
-}
-
 ThreadLocalSubqueue::ThreadLocalSubqueue(
     const uint64_t tid,
     const ProfilerConfig& config)
@@ -308,59 +494,6 @@ ThreadLocalSubqueue::ThreadLocalSubqueue(
   torch::profiler::impl::kineto::recordThreadInfo();
 }
 
-std::unique_ptr<KinetoObserverContext> ThreadLocalSubqueue::begin_op(
-    const at::RecordFunction& fn) {
-  KinetoObserverContext::Event* event;
-  uint64_t corr_id;
-  std::tie(event, corr_id) = op_events_.emplace_back(
-      fn.seqNr(),
-      fn.forwardThreadId(),
-      fn.scope(),
-      fn.isAsync(),
-      fn.debugHandle(),
-      fn.name());
-  if (config_.report_input_shapes) {
-    inputs_outputs_.push(fn.inputs());
-  }
-  if (fn.scope() == at::RecordScope::USER_SCOPE) {
-    torch::profiler::impl::kineto::pushUserCorrelationId(corr_id);
-  } else {
-    torch::profiler::impl::kineto::pushCorrelationId(corr_id);
-  }
-
-#if !defined BUILD_LITE_INTERPRETER && !defined C10_MOBILE
-  // backward nodes source range corresponds to the forward node
-  // TODO: consider using C++ stack trace
-  if (config_.with_stack && fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
-    auto cs = torch::profiler::impl::prepareCallstack(jit::currentCallstack());
-    jit_stack_.emplace_back(callstackStr(cs));
-  }
-  if (config_.with_modules &&
-      fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
-    jit_modules_.emplace_back(jit::currentModuleHierarchy());
-  }
-#endif
-  if (config_.with_flops) {
-    extra_args_.emplace_back(torch::profiler::impl::saveExtraArgs(fn));
-  }
-
-  auto out = std::make_unique<KinetoObserverContext>(event);
-
-  if (config_.state == ProfilerState::KINETO_GPU_FALLBACK) {
-    try {
-      out->fallback_ = gpu_fallback_.emplace_back();
-      torch::profiler::impl::cudaStubs()->record(
-          nullptr, &out->fallback_->cuda_event_start_, nullptr);
-    } catch (const std::exception& e) {
-      LOG(WARNING) << "Failed to record CUDA event. " << e.what();
-    }
-  }
-
-  event->start_time_ = torch::profiler::impl::getApproximateTime();
-  event->allow_tf32_cublas_ = at::globalContext().allowTF32CuBLAS();
-  return out;
-}
-
 RecordQueue::RecordQueue(
     const ProfilerConfig& config,
     std::set<ActivityType> activities)
@@ -407,17 +540,6 @@ void RecordQueue::stop() {
 }
 
 namespace {
-template <typename T>
-auto steal_or_default(T& it) {
-  if (it.exhausted()) {
-    return typename T::value_type();
-  } else {
-    auto result = std::move(*it);
-    ++it;
-    return result;
-  }
-}
-
 void mark_finished(std::shared_ptr<Result>& r) {
   TORCH_INTERNAL_ASSERT(!r->finished_, r->name());
   r->finished_ = true;
@@ -714,7 +836,7 @@ trace_ptr_t addKinetoEvents(
   passEventsToKineto(results, start_time_us, end_time_us);
 
   // In on demand mode kineto is directly controlled by other machinery.
-  if (config.state == ProfilerState::KINETO_ONDEMAND) {
+  if (config.global()) {
     return nullptr;
   }
 
@@ -724,34 +846,6 @@ trace_ptr_t addKinetoEvents(
   return trace;
 }
 
-struct EvaluateFunctionVisitor {
-  void operator()(
-      ExtraFields<EventType::TorchOp>& first,
-      ExtraFields<EventType::TorchOp>& second) {
-    if (first.scope_ == at::RecordScope::FUNCTION &&
-        second.scope_ == at::RecordScope::BACKWARD_FUNCTION &&
-        first.name_.rfind("autograd::engine::evaluate_function: ", 0) == 0) {
-      first.sequence_number_ = second.sequence_number_;
-      first.forward_tid_ = second.forward_tid_;
-    }
-  }
-
-  template <typename T0, typename T1>
-  void operator()(T0&, T1&) {}
-};
-
-void set_autograd_evaluate(std::vector<std::shared_ptr<Result>>& results) {
-  auto end = results.size() > 2 ? results.end() - 1 : results.begin();
-  for (auto it = results.begin(); it < end; ++it) {
-    if ((*it)->start_tid_ == (*(it + 1))->start_tid_) {
-      c10::visit(
-          EvaluateFunctionVisitor(),
-          (*it)->extra_fields_,
-          (*(it + 1))->extra_fields_);
-    }
-  }
-}
-
 using result_ptr_t = std::shared_ptr<Result>;
 struct ResultGreater {
   bool operator()(const result_ptr_t& a, const result_ptr_t& b) const {
@@ -760,7 +854,6 @@ struct ResultGreater {
 };
 
 void build_tree(std::vector<std::shared_ptr<Result>>& events) {
-  set_autograd_evaluate(events);
   std::stable_sort(
       events.begin(), events.end(), [](const auto& a, const auto& b) {
         return a->start_time_ns_ < b->start_time_ns_;
@@ -873,66 +966,26 @@ RecordQueue::getRecords(
   std::vector<python_tracer::CompressedEvent> python_enters;
   for (auto& subqueue_it : sub_queues_) {
     auto& queue = *subqueue_it.second;
-    for (auto& i : queue.backend_events_) {
-      auto start_time = i.start_time_us_;
-      out.emplace_back(Result::create(
-          /*start_time_ns_=*/start_time * 1000,
-          /*start_tid_=*/queue.tid(),
-          /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/std::move(i)));
-    }
-    queue.backend_events_.clear();
-
-    auto input_getter = queue.inputs_outputs_.getNextShapesAndDtypes();
-    auto jit_stack_it = queue.jit_stack_.begin();
-    auto jit_module_it = queue.jit_modules_.begin();
-    auto extra_args_it = queue.extra_args_.begin();
-    auto gpu_fallback_it = queue.gpu_fallback_.begin();
-    for (auto event = queue.op_events_.begin(); event != queue.op_events_.end();
-         ++event) {
-      auto& i = *event;
-      auto start_time = converter(i.start_time_);
-      out.emplace_back(Result::create(
-          start_time,
-          /*start_tid_=*/queue.tid(),
-          /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/
-          ExtraFields<EventType::TorchOp>(
-              std::move(i.basic_fields_),
-              ThreadLocalSubqueue::OpList::correlationID(event),
-              converter(i.end_time_),
-              input_getter(),
-              steal_or_default(jit_stack_it),
-              steal_or_default(jit_module_it),
-              steal_or_default(extra_args_it),
-              steal_or_default(gpu_fallback_it),
-              i.allow_tf32_cublas_)));
-    }
-    queue.op_events_.clear();
-    queue.inputs_outputs_.clear();
-    queue.jit_stack_.clear();
-    queue.jit_modules_.clear();
-    queue.extra_args_.clear();
-    queue.gpu_fallback_.clear();
-
-    for (auto& i : queue.allocations_) {
-      auto start_time = converter(i.start_time_);
-      out.emplace_back(Result::create(
-          start_time,
-          /*start_tid_=*/queue.tid(),
-          /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/std::move(i)));
-    }
-    queue.allocations_.clear();
-    for (auto& i : queue.ooms_) {
-      auto start_time = converter(i.start_time_);
-      out.emplace_back(Result::create(
-          start_time,
-          /*start_tid_=*/queue.tid(),
-          /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/std::move(i)));
-    }
-    queue.ooms_.clear();
+    auto materialize = [&](auto& events) {
+      for (auto& i : events) {
+        out.emplace_back(Result::create(
+            /*start_time_ns_=*/c10::guts::if_constexpr<std::is_same<
+                typename std::remove_reference<decltype(i)>::type,
+                ExtraFields<EventType::Backend>>::value>(
+                [&](auto _) { return _(i).start_time_us_ * 1000; },
+                [&](auto _) { return converter(_(i).start_time_); }),
+            /*start_tid_=*/queue.tid(),
+            /*kineto_info_=*/queue.kineto_info(),
+            /*extra_fields_=*/std::move(i)));
+      }
+      events.clear();
+    };
+
+    queue.torch_ops_.materialize(
+        out, converter, queue.tid(), queue.kineto_info());
+    materialize(queue.backend_events_);
+    materialize(queue.allocations_);
+    materialize(queue.ooms_);
 
     for (auto& i : queue.py_calls_) {
       python_enters.push_back(
@@ -942,7 +995,8 @@ RecordQueue::getRecords(
 
   if (tracePython()) {
     auto& tracer = python_tracer::PythonTracerBase::get();
-    for (auto i : tracer.getEvents(converter, python_enters)) {
+    for (auto i :
+         tracer.getEvents(converter, python_enters, end_time_us * 1000)) {
       out.push_back(i);
     }
     tracer.clear();
diff --git a/torch/csrc/profiler/collection.h b/torch/csrc/profiler/collection.h
index 5b05f923521f2..1a6966c38016e 100644
--- a/torch/csrc/profiler/collection.h
+++ b/torch/csrc/profiler/collection.h
@@ -50,6 +50,10 @@ struct TorchOpBasicFields {
 
 struct TensorMetadata {
   void* ptr_;
+  // Device is separated into DeviceType and DeviceIndex as Device
+  // doesn't have a default initializer (which the std::array initializer needs)
+  c10::DeviceType device_type_;
+  c10::DeviceIndex device_index_;
   c10::ScalarType dtype_;
   uint32_t dim_;
   c10::Layout layout_;
@@ -57,6 +61,8 @@ struct TensorMetadata {
 
 struct Inputs {
   std::vector<std::vector<int64_t>> shapes_;
+  std::vector<std::vector<int64_t>> strides_;
+  std::vector<c10::IValue> ivalues_;
   std::vector<std::string> dtypes_;
   std::vector<c10::optional<TensorMetadata>> tensor_metadata_;
 };
@@ -163,6 +169,7 @@ struct NNModuleInfo {
   PyModuleCls cls_;
   at::StringView cls_name_;
 
+  std::vector<std::pair<std::string, void*>> params_;
   // Indicates that `self_` is the kth instance of `cls_` observed.
   size_t id_{std::numeric_limits<size_t>::max()};
 };
@@ -247,6 +254,17 @@ struct TORCH_API Result : public std::enable_shared_from_this<Result> {
     return c10::visit(std::forward<T>(visitor), extra_fields_);
   }
 
+  template <typename T, typename Fn>
+  void visit_if_base(Fn&& fn) const {
+    visit([&](const auto& extra_fields) {
+      using extra_fields_t = typename std::remove_cv<
+          typename std::remove_reference<decltype(extra_fields)>::type>::type;
+
+      c10::guts::if_constexpr<std::is_base_of<T, extra_fields_t>::value>(
+          [&](auto _) { fn(_(extra_fields)); });
+    });
+  }
+
   EventType tag() const {
     return visit([](const auto& i) { return deduceTag(i); });
   }
@@ -342,7 +360,8 @@ class InputOutputEncoder final {
   AppendOnlyList<Tag, IO_ENCODER_DEFAULT_BLOCK_SIZE> tags_;
   AppendOnlyList<TensorMetadata, IO_ENCODER_DEFAULT_BLOCK_SIZE>
       tensor_metadata_;
-  AppendOnlyList<int64_t, IO_ENCODER_DEFAULT_BLOCK_SIZE> tensor_sizes_;
+  AppendOnlyList<int64_t, IO_ENCODER_DEFAULT_BLOCK_SIZE> tensor_sizes_strides_;
+  AppendOnlyList<c10::IValue, IO_ENCODER_DEFAULT_BLOCK_SIZE> ivalues_;
 };
 
 class RecordQueue;
@@ -381,7 +400,8 @@ struct TORCH_API PythonTracerBase {
   virtual void stop() = 0;
   virtual std::vector<std::shared_ptr<Result>> getEvents(
       std::function<time_t(approx_time_t)> time_converter,
-      std::vector<CompressedEvent>& enters) = 0;
+      std::vector<CompressedEvent>& enters,
+      time_t end_time_ns) = 0;
   virtual void clear() = 0;
 };
 
@@ -432,49 +452,47 @@ class TORCH_API ThreadLocalSubqueue {
   // See `containers.h` for block size benchmarks.
   static constexpr size_t BlockSize = 512;
 
-  template <typename T, size_t ChunkSize>
-  class EventBlock : public std::array<T, ChunkSize> {
-   public:
-    EventBlock();
-    uint64_t correlation_id(const T* ptr) const {
-      TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
-          ptr >= this->data() && ptr < this->data() + ChunkSize);
-      return id_start_ + (ptr - this->data());
-    }
-
-   private:
-    uint64_t id_start_;
-  };
+  struct TorchOpStorage {
+    // NB: This is a destructive operation.
+    void materialize(
+        std::vector<std::shared_ptr<Result>>& out,
+        const std::function<time_t(approx_time_t)> time_converter,
+        const uint64_t tid,
+        const kineto::DeviceAndResource& kineto_info);
 
-  class OpList : public AppendOnlyList<
-                     KinetoObserverContext::Event,
-                     BlockSize,
-                     EventBlock> {
-   public:
-    template <class... Args>
-    std::pair<KinetoObserverContext::Event*, uint64_t> emplace_back(
-        Args&&... args);
-    static uint64_t correlationID(const OpList::Iterator& e);
-  };
+    template <typename T, size_t ChunkSize>
+    class EventBlock : public std::array<T, ChunkSize> {
+     public:
+      EventBlock();
+      uint64_t correlation_id(const T* ptr) const;
 
-  OpList op_events_;
+     private:
+      uint64_t id_start_;
+    };
 
-  // report_input_shapes
-  InputOutputEncoder inputs_outputs_;
+    using event_t = KinetoObserverContext::Event;
+    class OpList : public AppendOnlyList<event_t, BlockSize, EventBlock> {
+     public:
+      template <class... Args>
+      std::pair<event_t*, uint64_t> emplace_back(Args&&... args);
+      static uint64_t correlationID(const OpList::Iterator& e);
+    } op_events_;
 
-  // with_stack
-  AppendOnlyList<jit_stack_t, BlockSize> jit_stack_;
-  AppendOnlyList<std::pair<python_tracer::TraceKey, approx_time_t>, BlockSize>
-      py_calls_;
+    // report_input_shapes
+    InputOutputEncoder inputs_outputs_;
+
+    // with_stack (JIT)
+    AppendOnlyList<jit_stack_t, BlockSize> jit_stack_;
 
-  // with_modules
-  AppendOnlyList<jit_modules_t, BlockSize> jit_modules_;
+    // with_modules
+    AppendOnlyList<jit_modules_t, BlockSize> jit_modules_;
 
-  // with_flops
-  AppendOnlyList<extra_args_t, BlockSize> extra_args_;
+    // with_flops
+    AppendOnlyList<extra_args_t, BlockSize> extra_args_;
 
-  // ProfilerState::KINETO_GPU_FALLBACK
-  AppendOnlyList<FallbackPair, BlockSize> gpu_fallback_;
+    // ProfilerState::KINETO_GPU_FALLBACK
+    AppendOnlyList<FallbackPair, BlockSize> gpu_fallback_;
+  } torch_ops_;
 
   // reportBackendEventToActiveKinetoProfiler
   AppendOnlyList<ExtraFields<EventType::Backend>, BlockSize> backend_events_;
@@ -484,6 +502,10 @@ class TORCH_API ThreadLocalSubqueue {
 
   // reportOOMs
   AppendOnlyList<ExtraFields<EventType::OutOfMemory>, BlockSize> ooms_;
+
+  // with_stack (Python)
+  AppendOnlyList<std::pair<python_tracer::TraceKey, approx_time_t>, BlockSize>
+      py_calls_;
 };
 
 class TORCH_API RecordQueue {
diff --git a/torch/csrc/profiler/containers.h b/torch/csrc/profiler/containers.h
index 78fab227f6be9..5ea36720f4a65 100644
--- a/torch/csrc/profiler/containers.h
+++ b/torch/csrc/profiler/containers.h
@@ -9,6 +9,7 @@
 #include <vector>
 
 #include <c10/macros/Macros.h>
+#include <c10/util/ArrayRef.h>
 #include <c10/util/Exception.h>
 
 namespace torch {
@@ -66,6 +67,24 @@ class AppendOnlyList {
     return next_++;
   }
 
+  template <typename T0>
+  typename std::enable_if<
+      std::is_same<T0, T>::value && std::is_trivially_copyable<T>::value>::type
+  copy(c10::ArrayRef<T0> src) {
+    size_t n = src.size();
+    if (C10_LIKELY(next_ && (next_ + n <= end_))) {
+      std::memcpy((void*)next_, (void*)src.begin(), n * sizeof(T0));
+      next_ += n;
+    } else {
+      // We could chunk this into several `memcpy`s, but because we expect this
+      // fallback to be infrequent (n << ChunkSize) the performance impact is
+      // negligible.
+      for (auto i : src) {
+        emplace_back(i);
+      }
+    }
+  }
+
   void clear() {
     buffer_.clear();
     buffer_last_ = buffer_.before_begin();
diff --git a/torch/csrc/profiler/execution_graph_observer.cpp b/torch/csrc/profiler/execution_graph_observer.cpp
index a3f4137951460..f114d86b923bd 100644
--- a/torch/csrc/profiler/execution_graph_observer.cpp
+++ b/torch/csrc/profiler/execution_graph_observer.cpp
@@ -10,10 +10,14 @@
 #endif // _WIN32
 
 #include <fmt/format.h>
+#include <algorithm>
 #include <chrono>
+#include <cmath>
 #include <fstream>
+#include <iomanip>
 #include <map>
 #include <mutex>
+#include <sstream>
 #include <stack>
 #include <stdexcept>
 #include <vector>
@@ -24,11 +28,13 @@
 #include <ATen/record_function.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/profiler/execution_graph_observer.h>
+#include <torch/csrc/profiler/util.h>
 
 using namespace at;
 
 namespace torch {
-namespace {
+namespace profiler {
+namespace impl {
 
 //******************************************************************************
 // JSON output utility functions. To be merged with PyTorch profiler.
@@ -105,7 +111,12 @@ inline std::string getValueShape(
 
 inline std::string getScalarValue(const c10::IValue& val) {
   if (val.isDouble()) {
-    return std::to_string(val.toDouble());
+    double d_val = val.toDouble();
+    if (std::isinf(d_val) || std::isnan(d_val)) {
+      return fmt::format("\"{}\"", std::to_string(d_val));
+    } else {
+      return std::to_string(d_val);
+    }
   } else if (val.isInt()) {
     return std::to_string(val.toInt());
   } else if (val.isBool()) {
@@ -131,8 +142,6 @@ inline int32_t processId() {
 // Main ExecutionGraphObserver implementation.
 //******************************************************************************
 
-static CallbackHandle handle_{INVALID_CALLBACK_HANDLE};
-
 // ExecutionGraphObserver contains all the states of the observer. Some of them
 // are shared between the enter and exit RecordFunction call backs, some data
 // like the `op_stack` may be accessed across different threads. So we should be
@@ -140,7 +149,7 @@ static CallbackHandle handle_{INVALID_CALLBACK_HANDLE};
 // at the cost of performance in large number of threads situations. We may
 // optimize this further to thread local, fine-grained locking, or use thread
 // safe containers.
-struct ExecutionGraphObserver {
+struct TORCH_API ExecutionGraphObserver {
   using ID = size_t;
 
   // Mapping of each thread to its own operator stack
@@ -151,12 +160,6 @@ struct ExecutionGraphObserver {
   // Observer run state.
   enum class RunState { uninitialized, disabled, enabled };
 
- private:
-  // Must use accessors to change this so that we can keep the
-  // RecordFunction callback in sync with the state.
-  RunState state{RunState::uninitialized};
-
- public:
   // Mutex for multithreaded access to the shared containers.
   std::mutex g_mutex{};
   // Stream to write output JSON.
@@ -164,7 +167,11 @@ struct ExecutionGraphObserver {
 
   // Full path to the output file.
   std::string file_name{};
-  CallbackHandle op_observer_handle{};
+
+  // RecordFunction callback handle for this observer.
+  CallbackHandle cb_handle{INVALID_CALLBACK_HANDLE};
+
+  // Process ID.
   int32_t pid{-1};
   std::string record_time{};
 
@@ -176,19 +183,19 @@ struct ExecutionGraphObserver {
   }
 
   RunState getState() const {
-    return state;
+    return state_;
   }
 
   void setState(RunState newState) {
-    if (state == RunState::uninitialized ||
-        callbackShouldBeEnabled(state) != callbackShouldBeEnabled(newState)) {
+    if (state_ == RunState::uninitialized ||
+        callbackShouldBeEnabled(state_) != callbackShouldBeEnabled(newState)) {
       if (callbackShouldBeEnabled(newState)) {
-        reenableCallback(handle_);
+        reenableCallback(cb_handle);
       } else {
-        disableCallback(handle_);
+        disableCallback(cb_handle);
       }
     }
-    state = newState;
+    state_ = newState;
   }
 
  private:
@@ -196,6 +203,10 @@ struct ExecutionGraphObserver {
     return run_state == ExecutionGraphObserver::RunState::enabled;
   }
 
+  // Must use accessors to change this so that we can keep the
+  // RecordFunction callback in sync with the state.
+  RunState state_{RunState::uninitialized};
+
   // All tensors and operators have an unique id assigned. Increment id for each
   // new tensor or operator node.
   // 0 -> unintialized
@@ -204,12 +215,8 @@ struct ExecutionGraphObserver {
   std::atomic<ID> id_{2};
 };
 
-// Using a singleton pattern here to avoid global static variable initialization
-// race.
-ExecutionGraphObserver& observer() {
-  static ExecutionGraphObserver _observer{};
-  return _observer;
-}
+// Using a singleton manager here to allow init and delete the observer object.
+using ObserverManager = GlobalStateManager<ExecutionGraphObserver>;
 
 // Uninitialized node has id = 0
 const ExecutionGraphObserver::ID uninitialized_id{0};
@@ -369,14 +376,23 @@ inline std::string convertIValue(
     size_t offset = 0;
     size_t numel = 0;
     size_t itemsize = 0;
+    std::string device_str = "";
     if (t->has_storage()) {
-      storage_id = getObjectID(ob, t->storage().data());
+      auto& t_storage = t->storage();
+      storage_id = getObjectID(ob, t_storage.data());
       offset = t->storage_offset();
       numel = t->numel();
       itemsize = t->itemsize();
+      device_str = t->device().str();
     }
-    return vectorToString(
-        std::vector<size_t>{tensor_id, storage_id, offset, numel, itemsize});
+    return fmt::format(
+        "[{},{},{},{},{},\"{}\"]",
+        tensor_id,
+        storage_id,
+        offset,
+        numel,
+        itemsize,
+        device_str);
   } else if (val.isTuple()) {
     std::vector<std::string> str_array;
     const auto& val_tuple = val.toTupleRef().elements();
@@ -481,25 +497,50 @@ void recordOperatorStart(
 
 std::unique_ptr<ObserverContext> onFunctionEnter(const RecordFunction& fn) {
   using RunState = ExecutionGraphObserver::RunState;
-  auto& ob = observer();
-
-  if (ob.getState() == RunState::enabled) {
+  auto ob = ObserverManager::get();
+  if (ob != nullptr && ob->getState() == RunState::enabled) {
     // record op
     auto fc_ptr = std::make_unique<FunctionCallContext>();
-    recordOperatorStart(ob, *fc_ptr.get(), fn);
+    recordOperatorStart(*ob, *fc_ptr.get(), fn);
     return fc_ptr;
-  } else {
-    return nullptr;
   }
+  return nullptr;
+}
+
+inline std::string json_str_escape(const std::string& str) {
+  std::ostringstream ostream;
+  for (auto ch = str.cbegin(); ch != str.cend(); ch++) {
+    if (*ch == '"') {
+      ostream << "\\\"";
+    } else if (*ch == '\\') {
+      ostream << "\\\\";
+    } else if (*ch == '\b') {
+      ostream << "\\b";
+    } else if (*ch == '\f') {
+      ostream << "\\f";
+    } else if (*ch == '\n') {
+      ostream << "\\n";
+    } else if (*ch == '\r') {
+      ostream << "\\r";
+    } else if (*ch == '\t') {
+      ostream << "\\t";
+    } else if ('\x00' <= *ch && *ch <= '\x1f') {
+      ostream << "\\u" << std::hex << std::setw(4) << std::setfill('0')
+              << static_cast<int>(*ch);
+    } else {
+      ostream << *ch;
+    }
+  }
+  return ostream.str();
 }
 
 void onFunctionExit(const RecordFunction& fn, ObserverContext* ctx_ptr) {
   using RunState = ExecutionGraphObserver::RunState;
-  auto& ob = observer();
-  if (ctx_ptr == nullptr) {
+  auto ob = ObserverManager::get();
+  if (ob == nullptr || ctx_ptr == nullptr) {
     return;
   }
-  if (ob.getState() == RunState::enabled) {
+  if (ob->getState() == RunState::enabled) {
     auto fc_ptr = dynamic_cast<FunctionCallContext*>(ctx_ptr);
     // TORCH_INTERNAL_ASSERT(fc_ptr != nullptr);
     if (fc_ptr == nullptr) {
@@ -528,23 +569,23 @@ void onFunctionExit(const RecordFunction& fn, ObserverContext* ctx_ptr) {
     std::vector<std::string> output_shapes;
     std::vector<std::string> output_values;
     try {
-      const std::lock_guard<std::mutex> lock(ob.g_mutex);
+      const std::lock_guard<std::mutex> lock(ob->g_mutex);
       // remove current op id from stack
 
-      ob.op_stack[fn.threadId()].pop();
+      ob->op_stack[fn.threadId()].pop();
       for (const auto i : c10::irange(output_start, outputs.size())) {
         appendValueInfo(
-            ob, outputs[i], output_values, output_types, output_shapes);
+            *ob, outputs[i], output_values, output_types, output_shapes);
       }
 
       std::string op_schema_str{};
       const auto op_schema = fn.operator_schema();
       if (op_schema.has_value()) {
-        op_schema_str = c10::toString(op_schema.value());
+        op_schema_str = json_str_escape(c10::toString(op_schema.value()));
       }
 
       writeJsonNode(
-          ob.out,
+          ob->out,
           fc.name,
           fc.op_id,
           fn.handle(),
@@ -561,7 +602,7 @@ void onFunctionExit(const RecordFunction& fn, ObserverContext* ctx_ptr) {
           vectorToString(output_shapes),
           vectorToString(output_types),
           op_schema_str);
-      ob.out << ",";
+      ob->out << ",";
     } catch (const std::exception& e) {
       LOG(WARNING) << "Exception in execution graph observer: [" << fc.name
                    << " (" << fc.op_id << ")] " << e.what();
@@ -569,17 +610,13 @@ void onFunctionExit(const RecordFunction& fn, ObserverContext* ctx_ptr) {
   }
 }
 
-} // namespace
-
-namespace profiler {
-namespace impl {
 // Add execution graph observer callback functions to the RecordFunction global
 // observers.
 bool addExecutionGraphObserver(const std::string& output_file_path) {
-  // Making this static local to ensure it's instantiated just once when it's
-  // called the first time.
-  if (handle_ == INVALID_CALLBACK_HANDLE) {
-    auto& ob = observer();
+  // Check if the observer is already initialized.
+  if (ObserverManager::get() == nullptr) {
+    ObserverManager::init();
+    auto& ob = *ObserverManager::get();
     ob.pid = processId();
     // Set output
     ob.file_name = output_file_path;
@@ -587,7 +624,7 @@ bool addExecutionGraphObserver(const std::string& output_file_path) {
       return false;
     }
 
-    handle_ = addGlobalCallback(
+    ob.cb_handle = addGlobalCallback(
         RecordFunctionCallback(&onFunctionEnter, &onFunctionExit)
             .needsInputs(true)
             .needsOutputs(true)
@@ -597,32 +634,37 @@ bool addExecutionGraphObserver(const std::string& output_file_path) {
 
     VLOG(1) << "Added PyTorch execution graph observer, output="
             << output_file_path;
-  } else {
+  } else if (ObserverManager::get()->cb_handle != INVALID_CALLBACK_HANDLE) {
     LOG(WARNING) << "Execution graph observer is already registered.";
   }
-  return handle_ != INVALID_CALLBACK_HANDLE;
+  return true;
 }
 
 void removeExecutionGraphObserver() {
-  auto& ob = observer();
-  if (ob.getState() != ExecutionGraphObserver::RunState::disabled) {
-    disableExecutionGraphObserver();
-  }
+  auto ob = ObserverManager::get();
+  if (ob != nullptr) {
+    if (ob->getState() != ExecutionGraphObserver::RunState::disabled) {
+      disableExecutionGraphObserver();
+    }
 
-  if (handle_ != INVALID_CALLBACK_HANDLE) {
-    finalizeExecutionGraphOutput(ob);
-    removeCallback(handle_);
-    handle_ = INVALID_CALLBACK_HANDLE;
-    ob.setState(ExecutionGraphObserver::RunState::uninitialized);
-    VLOG(1) << "Removed PyTorch execution graph observer";
+    if (ob->cb_handle != INVALID_CALLBACK_HANDLE) {
+      finalizeExecutionGraphOutput(*ob);
+      removeCallback(ob->cb_handle);
+      ob->cb_handle = INVALID_CALLBACK_HANDLE;
+      // Release the current EG observer object and reset.
+      ObserverManager::pop();
+      VLOG(1) << "Removed PyTorch execution graph observer";
+    } else {
+      LOG(WARNING) << "Execution graph observer was not registered.";
+    }
   } else {
-    LOG(WARNING) << "Execution graph observer was not registered.";
+    LOG(WARNING) << "Execution graph observer was not initialized.";
   }
 }
 
 void enableExecutionGraphObserver() {
   VLOG(1) << "enableExecutionGraphObserver() ";
-  auto& ob = observer();
+  auto& ob = *ObserverManager::get();
   // Make sure we are not already enabled.
   if (ob.getState() == ExecutionGraphObserver::RunState::enabled) {
     LOG(WARNING)
@@ -634,7 +676,7 @@ void enableExecutionGraphObserver() {
 
 void disableExecutionGraphObserver() {
   VLOG(1) << "disableExecutionGraphObserver()";
-  auto& ob = observer();
+  auto& ob = *ObserverManager::get();
   if (ob.getState() != ExecutionGraphObserver::RunState::disabled) {
     ob.setState(ExecutionGraphObserver::RunState::disabled);
   } else {
diff --git a/torch/csrc/profiler/execution_graph_observer.h b/torch/csrc/profiler/execution_graph_observer.h
index 5e7b276133a12..9cd52a93dbe5b 100644
--- a/torch/csrc/profiler/execution_graph_observer.h
+++ b/torch/csrc/profiler/execution_graph_observer.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <c10/macros/Export.h>
 #include <string>
 
 namespace torch {
@@ -8,16 +9,16 @@ namespace impl {
 
 // Adds the execution graph observer as a global callback function, the data
 // will be written to output file path.
-bool addExecutionGraphObserver(const std::string& output_file_path);
+TORCH_API bool addExecutionGraphObserver(const std::string& output_file_path);
 
 // Remove the execution graph observer from the global callback functions.
-void removeExecutionGraphObserver();
+TORCH_API void removeExecutionGraphObserver();
 
 // Enables execution graph observer.
-void enableExecutionGraphObserver();
+TORCH_API void enableExecutionGraphObserver();
 
 // Disables execution graph observer.
-void disableExecutionGraphObserver();
+TORCH_API void disableExecutionGraphObserver();
 
 } // namespace impl
 } // namespace profiler
diff --git a/torch/csrc/profiler/kineto_client_interface.cpp b/torch/csrc/profiler/kineto_client_interface.cpp
index e6842718e0474..aca7fbd910555 100644
--- a/torch/csrc/profiler/kineto_client_interface.cpp
+++ b/torch/csrc/profiler/kineto_client_interface.cpp
@@ -18,16 +18,29 @@ class LibKinetoClient : public libkineto::ClientInterface {
     reportInputShapes_ = setupOpInputsCollection;
   }
 
+#ifdef USE_KINETO_MIN_CHANGE
+  void start(bool withStack) override {
+    ProfilerConfig cfg {
+      ProfilerState::KINETO_ONDEMAND,
+          /*report_input_shapes=*/reportInputShapes_,
+          /*profile_memory=*/false,
+          /*with_stack=*/withStack,
+#else
   void start() override {
     ProfilerConfig cfg{
         ProfilerState::KINETO_ONDEMAND,
-        reportInputShapes_,
-        false,
-        false,
-        false,
-        false};
+        /*report_input_shapes=*/reportInputShapes_,
+        /*profile_memory=*/false,
+        /*with_stack=*/false,
+#endif
+          /*with_flops=*/false,
+          /*with_modules=*/false
+    };
     std::set<ActivityType> activities{ActivityType::CPU};
-    auto scopes = {at::RecordScope::FUNCTION, at::RecordScope::USER_SCOPE};
+    std::unordered_set<at::RecordScope> scopes;
+    scopes.insert(at::RecordScope::FUNCTION);
+    scopes.insert(at::RecordScope::USER_SCOPE);
+    scopes.insert(at::RecordScope::BACKWARD_FUNCTION);
     enableProfiler(cfg, activities, scopes);
   }
 
diff --git a/torch/csrc/profiler/kineto_shim.cpp b/torch/csrc/profiler/kineto_shim.cpp
index 5fd02d2e9374d..e08821f80b42b 100644
--- a/torch/csrc/profiler/kineto_shim.cpp
+++ b/torch/csrc/profiler/kineto_shim.cpp
@@ -221,7 +221,7 @@ void prepareTrace(
   ExperimentalConfigWrapper configWrap(config);
 
   // Experimental Configuration options are present
-  if (config.hasOptions() && configWrap.assertValid(activities)) {
+  if (config && configWrap.assertValid(activities)) {
     configWrap.prepareTraceWithExperimentalOptions();
     return;
   }
diff --git a/torch/csrc/profiler/python/init.cpp b/torch/csrc/profiler/python/init.cpp
new file mode 100644
index 0000000000000..43153f2c163c3
--- /dev/null
+++ b/torch/csrc/profiler/python/init.cpp
@@ -0,0 +1,218 @@
+#include <torch/csrc/profiler/python/init.h>
+
+#include <ATen/record_function.h>
+#include <torch/csrc/autograd/utils/wrap_outputs.h>
+#include <torch/csrc/jit/python/pybind_utils.h>
+#include <torch/csrc/profiler/collection.h>
+#include <torch/csrc/utils/pybind.h>
+
+namespace torch {
+namespace profiler {
+
+void initPythonBindings(PyObject* module) {
+  auto rootModule = py::handle(module).cast<py::module>();
+  auto m = rootModule.def_submodule("_profiler");
+
+  using namespace torch::profiler::impl;
+
+  py::enum_<at::RecordScope>(m, "RecordScope")
+      .value("FUNCTION", at::RecordScope::FUNCTION)
+      .value("BACKWARD_FUNCTION", at::RecordScope::BACKWARD_FUNCTION)
+      .value("TORCHSCRIPT_FUNCTION", at::RecordScope::TORCHSCRIPT_FUNCTION)
+      .value("KERNEL_FUNCTION_DTYPE", at::RecordScope::KERNEL_FUNCTION_DTYPE)
+      .value("CUSTOM_CLASS", at::RecordScope::CUSTOM_CLASS)
+      .value("BUILD_FEATURE", at::RecordScope::BUILD_FEATURE)
+      .value("LITE_INTERPRETER", at::RecordScope::LITE_INTERPRETER)
+      .value("USER_SCOPE", at::RecordScope::USER_SCOPE)
+      .value("STATIC_RUNTIME_OP", at::RecordScope::STATIC_RUNTIME_OP)
+      .value("STATIC_RUNTIME_MODEL", at::RecordScope::STATIC_RUNTIME_MODEL);
+
+  py::enum_<ProfilerState>(m, "ProfilerState")
+      .value("Disabled", ProfilerState::Disabled)
+      .value("CPU", ProfilerState::CPU)
+      .value("CUDA", ProfilerState::CUDA)
+      .value("NVTX", ProfilerState::NVTX)
+      .value("ITT", ProfilerState::ITT)
+      .value("KINETO", ProfilerState::KINETO)
+      .value("KINETO_GPU_FALLBACK", ProfilerState::KINETO_GPU_FALLBACK);
+
+  py::enum_<ActiveProfilerType>(m, "ActiveProfilerType")
+      .value("NONE", ActiveProfilerType::NONE)
+      .value("LEGACY", ActiveProfilerType::LEGACY)
+      .value("KINETO", ActiveProfilerType::KINETO)
+      .value("NVTX", ActiveProfilerType::NVTX);
+
+  py::enum_<ActivityType>(m, "ProfilerActivity")
+      .value("CPU", ActivityType::CPU)
+      .value("CUDA", ActivityType::CUDA);
+
+  py::class_<ExperimentalConfig>(m, "_ExperimentalConfig")
+      .def(
+          py::init<
+              std::vector<std::string> /* profiler_metrics */,
+              bool /* profiler_measure_per_kernel */
+              >(),
+          "An experimental config for Kineto features. Please note that"
+          "backward compatibility is not guaranteed.\n"
+          "    profiler_metrics : a list of CUPTI profiler metrics used\n"
+          "       to measure GPU performance events.\n"
+          "       If this list contains values Kineto runs in CUPTI profiler mode\n"
+          "    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel\n"
+          "       or for the entire measurement duration.",
+          py::arg("profiler_metrics") = std::vector<std::string>(),
+          py::arg("profiler_measure_per_kernel") = false)
+      .def(py::pickle(
+          [](const ExperimentalConfig& p) { // __getstate__
+            py::list py_metrics;
+            for (const auto& metric : p.profiler_metrics) {
+              py::bytes mbytes(metric);
+              py_metrics.append(mbytes);
+            }
+            /* Return a tuple that fully encodes the state of the config */
+            return py::make_tuple(py_metrics, p.profiler_measure_per_kernel);
+          },
+          [](py::tuple t) { // __setstate__
+            if (t.size() != 2) {
+              throw std::runtime_error("Expected 2 values in state");
+            }
+
+            py::list py_metrics = t[0].cast<py::list>();
+            std::vector<std::string> metrics{py_metrics.size()};
+
+            for (const auto& py_metric : py_metrics) {
+              metrics.push_back(py::str(py_metric));
+            }
+
+            return ExperimentalConfig(std::move(metrics), t[1].cast<bool>());
+          }));
+
+  py::class_<ProfilerConfig>(m, "ProfilerConfig")
+      .def(py::init<
+           ProfilerState,
+           bool, /* record_input_shapes */
+           bool, /* profile_memory */
+           bool, /* with_stack */
+           bool, /* with_flops */
+           bool, /* with_modules */
+           ExperimentalConfig /* experimental_config */
+           >());
+
+  py::enum_<EventType>(m, "_EventType")
+      .value("TorchOp", EventType::TorchOp)
+      .value("Backend", EventType::Backend)
+      .value("Allocation", EventType::Allocation)
+      .value("PyCall", EventType::PyCall)
+      .value("PyCCall", EventType::PyCCall)
+      .value("Kineto", EventType::Kineto);
+
+  py::class_<Inputs>(m, "_Inputs")
+      .def_readonly("shapes", &Inputs::shapes_)
+      .def_readonly("dtypes", &Inputs::dtypes_)
+      .def_readonly("strides", &Inputs::strides_)
+      .def_property_readonly(
+          "ivalues",
+          [](const Inputs& inputs) {
+            py::list list;
+            for (auto& v : inputs.ivalues_) {
+              list.append(torch::jit::toPyObject(v));
+            }
+            return list;
+          })
+      .def_readonly("tensor_metadata", &Inputs::tensor_metadata_);
+
+  py::class_<TensorMetadata>(m, "_TensorMetadata")
+      .def_property_readonly(
+          "layout",
+          [](const TensorMetadata& metadata) {
+            PyObject* layout_obj =
+                torch::autograd::utils::wrap(metadata.layout_);
+            return py::reinterpret_borrow<py::object>(layout_obj);
+          })
+      .def_property_readonly("device", [](const TensorMetadata& metadata) {
+        // Have to pull a copy of the existing Python Device object.
+        PyObject* thp_device = THPDevice_New(
+            c10::Device(metadata.device_type_, metadata.device_index_));
+        return py::reinterpret_borrow<py::object>(thp_device);
+      });
+
+  using torch_op_t = ExtraFields<EventType::TorchOp>;
+  py::class_<torch_op_t>(m, "_ExtraFields_TorchOp")
+      .def_readonly("inputs", &torch_op_t::inputs_)
+      .def_readonly("scope", &torch_op_t::scope_)
+      .def_readonly("sequence_number", &torch_op_t::sequence_number_)
+      .def_readonly("allow_tf32_cublas", &torch_op_t::allow_tf32_cublas_);
+
+  py::class_<ExtraFields<EventType::Backend>>(m, "_ExtraFields_Backend");
+
+  using allocation_t = ExtraFields<EventType::Allocation>;
+  py::class_<allocation_t>(m, "_ExtraFields_Allocation")
+      .def_property_readonly(
+          "ptr",
+          [](const allocation_t& a) {
+            return reinterpret_cast<intptr_t>(a.ptr_);
+          })
+      .def_readonly("alloc_size", &allocation_t::alloc_size_)
+      .def_readonly("device_type", &allocation_t::device_type_)
+      .def_readonly("device_index", &allocation_t::device_index_)
+      .def_readonly("total_allocated", &allocation_t::total_allocated_)
+      .def_readonly("total_reserved", &allocation_t::total_reserved_);
+
+  py::class_<NNModuleInfo>(m, "_NNModuleInfo")
+      .def_property_readonly(
+          "params",
+          [](const NNModuleInfo& s) {
+            py::list list;
+            for (auto& p : s.params_) {
+              list.append(std::make_pair(
+                  p.first, reinterpret_cast<intptr_t>(p.second)));
+            }
+            return list;
+          })
+      .def_property_readonly(
+          "cls_name", [](const NNModuleInfo& s) { return s.cls_name_.str(); });
+
+  py::class_<ExtraFields<EventType::PyCall>>(m, "_ExtraFields_PyCall")
+      .def_readonly("module", &ExtraFields<EventType::PyCall>::module_)
+      .def_readonly("callsite", &ExtraFields<EventType::PyCall>::callsite_)
+      .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_)
+      .def_readonly("module", &ExtraFields<EventType::PyCall>::module_);
+
+  py::class_<ExtraFields<EventType::PyCCall>>(m, "_ExtraFields_PyCCall")
+      .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_);
+
+  py::class_<PyFrameState>(m, "_PyFrameState")
+      .def_readonly("line_number", &PyFrameState::line_no_)
+      .def_property_readonly(
+          "file_name", [](const PyFrameState& s) { return s.filename_.str(); })
+      .def_property_readonly("function_name", [](const PyFrameState& s) {
+        return s.funcname_.str();
+      });
+
+  py::class_<ExtraFields<EventType::OutOfMemory>>(
+      m, "_ExtraFields_OutOfMemory");
+
+  py::class_<ExtraFields<EventType::Kineto>>(m, "_ExtraFields_Kineto");
+
+  py::class_<Result, std::shared_ptr<Result>>(m, "_ProfilerEvent")
+      .def("name", &Result::name)
+      .def_property_readonly("tag", &Result::tag)
+      .def_readonly("extra_fields", &Result::extra_fields_)
+      .def_property_readonly(
+          "id",
+          [](const Result& r) {
+            return reinterpret_cast<intptr_t>(r.shared_from_this().get());
+          })
+      .def_property_readonly(
+          "parent", [](const Result& r) { return r.parent_.lock(); })
+      .def_readonly("children", &Result::children_)
+      .def_readonly("start_time_ns", &Result::start_time_ns_)
+      .def_readonly("start_tid", &Result::start_tid_)
+      .def_property_readonly("correlation_id", &Result::correlationID)
+      .def_property_readonly("end_time_ns", &Result::endTimeNS)
+      .def_property_readonly("duration_time_ns", [](const Result& r) {
+        return r.endTimeNS() - r.start_time_ns_;
+      });
+}
+
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/python/init.h b/torch/csrc/profiler/python/init.h
new file mode 100644
index 0000000000000..18b5d7a3e74cb
--- /dev/null
+++ b/torch/csrc/profiler/python/init.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include <torch/csrc/utils/python_stub.h>
+
+namespace torch {
+namespace profiler {
+
+void initPythonBindings(PyObject* module);
+
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/util.cpp b/torch/csrc/profiler/util.cpp
index 6ae2e745806f3..f8821e0c6499c 100644
--- a/torch/csrc/profiler/util.cpp
+++ b/torch/csrc/profiler/util.cpp
@@ -85,6 +85,24 @@ std::function<time_t(approx_time_t)> ApproximateClockToUnixTimeConverter::
   };
 }
 
+namespace {
+c10::optional<bool> soft_assert_raises_;
+} // namespace
+
+void setSoftAssertRaises(c10::optional<bool> value) {
+  soft_assert_raises_ = value;
+}
+
+bool softAssertRaises() {
+  return soft_assert_raises_.value_or(
+#ifdef NDEBUG
+      false
+#else
+      true
+#endif
+  );
+}
+
 // ----------------------------------------------------------------------------
 // -- NVTX --------------------------------------------------------------------
 // ----------------------------------------------------------------------------
@@ -504,7 +522,8 @@ uint64_t computeFlops(
           "Failed to compute flops for op aten::conv2d because stride must be size 2 and cannot be 0.");
       return 0;
     }
-    // format of the input is defined in torch.nn.quantized.functional.conv2d()
+    // format of the input is defined in
+    // torch.ao.nn.quantized.functional.conv2d()
     uint64_t minibatch = 0, in_channels = 0, input_h = 0, input_w = 0;
     uint64_t out_channels = 0, kernel_h = 0, kernel_w = 0;
     const uint64_t conv2d_multiply_factor = 2;
diff --git a/torch/csrc/profiler/util.h b/torch/csrc/profiler/util.h
index 9f26e088705db..8bee4275c22f9 100644
--- a/torch/csrc/profiler/util.h
+++ b/torch/csrc/profiler/util.h
@@ -9,6 +9,7 @@
 
 #include <ATen/record_function.h>
 #include <c10/macros/Macros.h>
+#include <c10/util/Optional.h>
 #include <torch/csrc/Export.h>
 #include <torch/csrc/jit/frontend/source_range.h>
 
@@ -35,9 +36,25 @@
 #endif
 #endif
 
+// TODO: replace with pytorch/rfcs#43 when it is ready.
+#define SOFT_ASSERT(cond, ...)                         \
+  [&]() -> bool {                                      \
+    if (C10_UNLIKELY(!(cond))) {                       \
+      if (torch::profiler::impl::softAssertRaises()) { \
+        TORCH_INTERNAL_ASSERT(cond, __VA_ARGS__);      \
+      } else {                                         \
+        TORCH_WARN(__VA_ARGS__);                       \
+      }                                                \
+      return false;                                    \
+    }                                                  \
+    return true;                                       \
+  }()
+
 namespace torch {
 namespace profiler {
 namespace impl {
+TORCH_API bool softAssertRaises();
+TORCH_API void setSoftAssertRaises(c10::optional<bool> value);
 
 using time_t = int64_t;
 using steady_clock_t = std::conditional<
@@ -156,6 +173,42 @@ uint64_t TORCH_API computeFlops(
     const std::string& op_name,
     const std::unordered_map<std::string, c10::IValue>& extra_args);
 
+template <typename T>
+class TORCH_API GlobalStateManager {
+ public:
+  static GlobalStateManager& singleton() {
+    static GlobalStateManager singleton_;
+    return singleton_;
+  }
+
+  template <typename... Args>
+  static void init(Args... args) {
+    if (singleton().state_) {
+      LOG(WARNING) << "GlobalStatePtr already exists!";
+    } else {
+      singleton().state_ = std::make_shared<T>(std::forward<Args>(args)...);
+    }
+  }
+
+  static auto* get() {
+    return singleton().state_.get();
+  }
+
+  static std::shared_ptr<T> pop() {
+    TORCH_INTERNAL_ASSERT(
+        singleton().state_ != nullptr,
+        "Global state ptr cannot be null before resetting");
+    auto out = singleton().state_;
+    singleton().state_.reset();
+    return out;
+  }
+
+ private:
+  GlobalStateManager() = default;
+
+  std::shared_ptr<T> state_;
+};
+
 } // namespace impl
 } // namespace profiler
 } // namespace torch
diff --git a/torch/csrc/tensor/python_tensor.cpp b/torch/csrc/tensor/python_tensor.cpp
index 23187decb10d7..bda95b00abdca 100644
--- a/torch/csrc/tensor/python_tensor.cpp
+++ b/torch/csrc/tensor/python_tensor.cpp
@@ -449,6 +449,10 @@ c10::DispatchKey get_default_dispatch_key() {
   return backendToDispatchKey(default_backend);
 }
 
+at::Device get_default_device() {
+  return at::Device(c10::backendToDeviceType(default_backend));
+}
+
 ScalarType get_default_scalar_type() {
   return get_default_dtype_as_scalartype();
 }
diff --git a/torch/csrc/tensor/python_tensor.h b/torch/csrc/tensor/python_tensor.h
index 87a66012b4955..ab74e05341932 100644
--- a/torch/csrc/tensor/python_tensor.h
+++ b/torch/csrc/tensor/python_tensor.h
@@ -1,13 +1,10 @@
 #pragma once
 
+#include <c10/core/Device.h>
 #include <c10/core/DispatchKey.h>
 #include <c10/core/ScalarType.h>
 #include <torch/csrc/python_headers.h>
 
-namespace c10 {
-struct Device;
-}
-
 namespace at {
 class Tensor;
 } // namespace at
@@ -31,6 +28,7 @@ void py_set_default_dtype(PyObject* dtype_obj);
 // change.  Probably only store ScalarType, as that's the only flex point
 // we support.
 c10::DispatchKey get_default_dispatch_key();
+at::Device get_default_device();
 
 // Gets the ScalarType for the default tensor type.
 at::ScalarType get_default_scalar_type();
diff --git a/torch/csrc/utils/out_types.cpp b/torch/csrc/utils/out_types.cpp
index 4eead8119d660..17236b24fd197 100644
--- a/torch/csrc/utils/out_types.cpp
+++ b/torch/csrc/utils/out_types.cpp
@@ -7,33 +7,22 @@ namespace utils {
 // consistent with the out tensor's options
 void check_out_type_matches(
     const at::Tensor& result,
-    at::ScalarType scalarType,
+    c10::optional<at::ScalarType> scalarType,
     bool scalarType_is_none,
     c10::optional<at::Layout> layout,
-    const at::Device& device,
+    c10::optional<at::Device> device,
     bool device_is_none) {
   if (scalarType_is_none && !layout && device_is_none) { // common case
     return;
   }
-  if (!scalarType_is_none && result.scalar_type() != scalarType) {
+  if (!scalarType_is_none && result.scalar_type() != scalarType.value()) {
     AT_ERROR(
         "dtype ",
-        scalarType,
+        *scalarType,
         " does not match dtype of out parameter (",
         result.scalar_type(),
         ")");
   }
-  auto scalarType_arg = scalarType_is_none ? result.scalar_type() : scalarType;
-  auto device_type_arg =
-      device_is_none ? result.device().type() : device.type();
-  if (result.scalar_type() != scalarType_arg) {
-    AT_ERROR(
-        "scalar type ",
-        scalarType_arg,
-        " does not match scalar type of out parameter (",
-        result.scalar_type(),
-        ")");
-  }
   if (layout && result.layout() != *layout) {
     AT_ERROR(
         "layout ",
@@ -42,10 +31,10 @@ void check_out_type_matches(
         result.layout(),
         ")");
   }
-  if (result.device().type() != device_type_arg) {
+  if (!device_is_none && result.device().type() != device.value().type()) {
     AT_ERROR(
         "device type ",
-        device_type_arg,
+        device->type(),
         " does not match device type of out parameter (",
         result.device().type(),
         ")");
diff --git a/torch/csrc/utils/out_types.h b/torch/csrc/utils/out_types.h
index b3cf97f95bc18..1cab00bc270f2 100644
--- a/torch/csrc/utils/out_types.h
+++ b/torch/csrc/utils/out_types.h
@@ -7,10 +7,10 @@ namespace utils {
 
 TORCH_API void check_out_type_matches(
     const at::Tensor& result,
-    at::ScalarType scalarType,
+    c10::optional<at::ScalarType> scalarType,
     bool scalarType_is_none,
     c10::optional<at::Layout> layout,
-    const at::Device& device,
+    c10::optional<at::Device> device,
     bool device_is_none);
 
 }
diff --git a/torch/csrc/utils/python_arg_parser.cpp b/torch/csrc/utils/python_arg_parser.cpp
index 9095be4e61d9b..12373c1753609 100644
--- a/torch/csrc/utils/python_arg_parser.cpp
+++ b/torch/csrc/utils/python_arg_parser.cpp
@@ -292,7 +292,7 @@ auto handle_torch_function_no_python_arg_parser(
       torch_function_name == TorchFunctionName::TorchFunction;
   auto get_mode = [&]() {
     return is_torch_function ? at::impl::PythonTorchFunctionTLS::get_mode()
-                             : at::impl::TorchDispatchModeTLS::get_state();
+                             : c10::impl::TorchDispatchModeTLS::get_state();
   };
 
   const auto& maybe_mode = get_mode();
diff --git a/torch/csrc/utils/python_arg_parser.h b/torch/csrc/utils/python_arg_parser.h
index 8523f055a0f0d..0dd6439402ca9 100644
--- a/torch/csrc/utils/python_arg_parser.h
+++ b/torch/csrc/utils/python_arg_parser.h
@@ -76,6 +76,49 @@
 #include <string>
 #include <vector>
 
+namespace torch {
+inline bool is_symint_node(py::handle obj) {
+  auto static tp_symn = py::type::of<c10::SymIntNodeImpl>();
+  if (py::isinstance(obj, tp_symn)) {
+    TORCH_CHECK(
+        !jit::tracer::isTracing(), "JIT tracing of SymInts isn't supported!");
+    return true;
+  }
+  return false;
+}
+} // namespace torch
+
+namespace pybind11 {
+namespace detail {
+template <>
+struct type_caster<c10::SymInt> {
+ public:
+  PYBIND11_TYPE_CASTER(c10::SymInt, _("SymInt"));
+  bool load(py::handle src, bool) {
+    if (torch::is_symint_node(src)) {
+      value = src.cast<c10::SymIntNodeImpl*>()->toSymInt();
+      return true;
+    }
+
+    auto raw_obj = src.ptr();
+    if (THPUtils_checkIndex(raw_obj)) {
+      value = c10::SymInt{THPUtils_unpackIndex(raw_obj)};
+      return true;
+    }
+    return false;
+  }
+
+  static py::handle cast(
+      c10::SymInt si,
+      return_value_policy /* policy */,
+      handle /* parent */) {
+    return si.is_symbolic() ? py::cast(si.toSymIntNodeImpl()).release()
+                            : py::cast(si.expect_int()).release();
+  }
+};
+} // namespace detail
+} // namespace pybind11
+
 namespace torch {
 
 bool should_allow_numbers_as_tensors(const std::string& name);
@@ -474,17 +517,6 @@ inline std::vector<int64_t> PythonArgs::intlist(int i) {
   return intlistWithDefault(i, signature.params[i].default_intlist);
 }
 
-inline bool is_symint_node(py::handle obj) {
-  auto static tp_symn = py::type::of<c10::SymIntNodeImpl>();
-  // TODO: switch this to `isinstance`
-  if (obj.get_type().equal(tp_symn)) {
-    TORCH_CHECK(
-        !jit::tracer::isTracing(), "JIT tracing of SymInts isn't supported!");
-    return true;
-  }
-  return false;
-}
-
 inline PyObject* toPyObject(c10::SymInt symint) {
   if (symint.is_symbolic()) {
     auto r = py::cast(symint.toSymIntNodeImpl()).release().ptr();
@@ -715,8 +747,7 @@ inline at::Device toDevice(PyObject* obj) {
 
 inline at::Device PythonArgs::device(int i) {
   if (!args[i]) {
-    return at::Device(backendToDeviceType(
-        dispatchKeyToBackend(torch::tensors::get_default_dispatch_key())));
+    return torch::tensors::get_default_device();
   }
   return toDevice(args[i]);
 }
@@ -855,10 +886,8 @@ inline c10::SymInt PythonArgs::toSymInt(int i) {
     jit::tracer::ArgumentStash::stashValue(
         signature.params[i].name, idx, var, c10::IntType::get());
   }
-  if (torch::is_symint_node(py::handle(args[i]))) {
-    return py::handle(args[i]).cast<c10::SymIntNodeImpl*>()->toSymInt();
-  }
-  return c10::SymInt(THPUtils_unpackLong(args[i]));
+
+  return py::cast<c10::SymInt>(py::handle(args[i]));
 }
 
 inline int64_t PythonArgs::toInt64WithDefault(int i, int64_t default_int) {
diff --git a/torch/csrc/utils/python_dispatch.cpp b/torch/csrc/utils/python_dispatch.cpp
index 25e93bf02da79..73bbdc09855ff 100644
--- a/torch/csrc/utils/python_dispatch.cpp
+++ b/torch/csrc/utils/python_dispatch.cpp
@@ -292,6 +292,8 @@ void initDispatchBindings(PyObject* module) {
   });
 
   m.def(
+      // Returns whether or not a direct kernel registration exists
+      // for this <op_name, dispatch_key> pair.
       "_dispatch_has_kernel_for_dispatch_key",
       [](const char* name, const char* dispatch) -> bool {
         auto op =
@@ -300,6 +302,22 @@ void initDispatchBindings(PyObject* module) {
         return op->hasKernelForDispatchKey(c10::parseDispatchKey(dispatch));
       });
 
+  m.def(
+      // Returns whether or not there is an entry in the runtime computed
+      // dispatch table, for this <op_name, dispatch_key> pair. For example, if
+      // "op" has a `CompositeImplicitAutograd` kernel, Then
+      // _dispatch_has_computed_kernel_for_dispatch_key(op, backend) will return
+      // true for all backends that are part of the alias set for
+      // CompositeImplicitAutograd.
+      "_dispatch_has_computed_kernel_for_dispatch_key",
+      [](const char* name, const char* dispatch) -> bool {
+        auto op =
+            c10::Dispatcher::singleton().findOp(torch::jit::parseName(name));
+        TORCH_CHECK(op, "operator ", name, " does not exist");
+        return op->hasComputedKernelForDispatchKey(
+            c10::parseDispatchKey(dispatch));
+      });
+
   m.def("_dispatch_find_dangling_impls", []() -> std::vector<std::string> {
     auto danglingImpls = c10::Dispatcher::singleton().findDanglingImpls();
 
@@ -327,6 +345,55 @@ void initDispatchBindings(PyObject* module) {
     return at::isTensorSubclassLike(tensor);
   });
 
+  m.def("_dispatch_key_name", [](uint64_t dispatch_key) {
+    auto dt = (c10::DispatchKey)dispatch_key;
+    return c10::toString(dt);
+  });
+  m.def("_dispatch_num_backends", []() { return c10::num_backends; });
+
+  py::enum_<c10::DispatchKey>(m, "DispatchKey")
+      .value("Undefined", c10::DispatchKey::Undefined)
+      .value("Dense", c10::DispatchKey::Dense)
+      .value("BackendSelect", c10::DispatchKey::BackendSelect)
+      .value("CPU", c10::DispatchKey::CPU)
+      .value("CUDA", c10::DispatchKey::CUDA)
+      .value("AutocastCPU", c10::DispatchKey::AutocastCPU)
+      .value("AutocastCUDA", c10::DispatchKey::AutocastCUDA)
+      .value("AutogradCPU", c10::DispatchKey::AutogradCPU)
+      .value("ADInplaceOrView", c10::DispatchKey::ADInplaceOrView)
+      .value("AutogradCUDA", c10::DispatchKey::AutogradCUDA)
+      .value("PythonTLSSnapshot", c10::DispatchKey::PythonTLSSnapshot)
+      .value("Python", c10::DispatchKey::Python);
+
+  py::class_<c10::DispatchKeySet>(m, "DispatchKeySet")
+      .def(py::init<c10::DispatchKey>())
+      .def("__or__", &c10::DispatchKeySet::operator|)
+      .def("__sub__", &c10::DispatchKeySet::operator-)
+      .def("__and__", &c10::DispatchKeySet::operator&)
+      .def("highestPriorityTypeId", &c10::DispatchKeySet::highestPriorityTypeId)
+      .def("has", &c10::DispatchKeySet::has);
+
+  m.def("_dispatch_keyset_full_after", [](c10::DispatchKey t) {
+    return c10::DispatchKeySet(c10::DispatchKeySet::FULL_AFTER, t);
+  });
+
+  m.def("_dispatch_keyset_to_string", [](c10::DispatchKeySet keyset) {
+    return c10::toString(keyset);
+  });
+
+  m.def("_dispatch_keys", [](const at::Tensor& tensor) {
+    auto* impl = tensor.unsafeGetTensorImpl();
+    return impl->key_set();
+  });
+  m.def("_dispatch_tls_local_include_set", []() {
+    return c10::impl::tls_local_dispatch_key_set().included_;
+  });
+  m.def("_dispatch_tls_local_exclude_set", []() {
+    return c10::impl::tls_local_dispatch_key_set().excluded_;
+  });
+  py::class_<c10::impl::ExcludeDispatchKeyGuard>(m, "ExcludeDispatchKeyGuard")
+      .def(py::init<c10::DispatchKeySet>());
+
   py::class_<at::AutoDispatchBelowAutograd>(m, "_AutoDispatchBelowAutograd")
       .def(py::init<>());
 
diff --git a/torch/csrc/utils/tensor_numpy.cpp b/torch/csrc/utils/tensor_numpy.cpp
index 229352215a283..fc866c3dafb00 100644
--- a/torch/csrc/utils/tensor_numpy.cpp
+++ b/torch/csrc/utils/tensor_numpy.cpp
@@ -236,7 +236,6 @@ at::Tensor tensor_from_numpy(
     stride /= element_size_in_bytes;
   }
 
-  size_t storage_size = 1;
   for (const auto i : c10::irange(ndim)) {
     if (strides[i] < 0) {
       throw ValueError(
@@ -245,8 +244,6 @@ at::Tensor tensor_from_numpy(
           "(You can probably work around this by making a copy of your array "
           " with array.copy().) ");
     }
-    // XXX: this won't work for negative strides
-    storage_size += (sizes[i] - 1) * strides[i];
   }
 
   void* data_ptr = PyArray_DATA(array);
@@ -256,7 +253,7 @@ at::Tensor tensor_from_numpy(
         "Conversion between byte orders is currently not supported.");
   }
   Py_INCREF(obj);
-  return at::from_blob(
+  return at::lift_fresh(at::from_blob(
       data_ptr,
       sizes,
       strides,
@@ -264,7 +261,7 @@ at::Tensor tensor_from_numpy(
         pybind11::gil_scoped_acquire gil;
         Py_DECREF(obj);
       },
-      at::device(kCPU).dtype(numpy_dtype_to_aten(PyArray_TYPE(array))));
+      at::device(kCPU).dtype(numpy_dtype_to_aten(PyArray_TYPE(array)))));
 }
 
 int aten_to_numpy_dtype(const ScalarType scalar_type) {
diff --git a/torch/csrc/utils/torch_dispatch_mode.h b/torch/csrc/utils/torch_dispatch_mode.h
index 3dd73b7422d77..abc4c4e56fe89 100644
--- a/torch/csrc/utils/torch_dispatch_mode.h
+++ b/torch/csrc/utils/torch_dispatch_mode.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 
 namespace torch {
 namespace torch_dispatch_mode {
@@ -8,12 +8,12 @@ namespace torch_dispatch_mode {
 struct StashTorchDispatchModeGuard {
  public:
   StashTorchDispatchModeGuard() {
-    saved_ = at::impl::TorchDispatchModeTLS::get_state();
-    at::impl::TorchDispatchModeTLS::set_state(nullptr);
+    saved_ = c10::impl::TorchDispatchModeTLS::get_state();
+    c10::impl::TorchDispatchModeTLS::set_state(nullptr);
   }
 
   ~StashTorchDispatchModeGuard() {
-    at::impl::TorchDispatchModeTLS::set_state(saved_);
+    c10::impl::TorchDispatchModeTLS::set_state(saved_);
   }
 
  private:
diff --git a/torch/cuda/__init__.py b/torch/cuda/__init__.py
index 26e44fa9afe73..c0299141e8d98 100644
--- a/torch/cuda/__init__.py
+++ b/torch/cuda/__init__.py
@@ -178,6 +178,7 @@ def _lazy_call(callable, **kwargs):
 class DeferredCudaCallError(Exception):
     pass
 
+OutOfMemoryError = torch._C._OutOfMemoryError
 
 def init():
     r"""Initialize PyTorch's CUDA state.  You may need to call
diff --git a/torch/cuda/_memory_viz.py b/torch/cuda/_memory_viz.py
new file mode 100644
index 0000000000000..829497e8ff927
--- /dev/null
+++ b/torch/cuda/_memory_viz.py
@@ -0,0 +1,188 @@
+import pickle
+import sys
+import os
+import io
+import subprocess
+from typing import Dict, Any
+__all__ = ["format_flamegraph", "segments", "memory", "compare", "stats", "Bytes"]
+
+def _frame_fmt(f):
+    i = f['line']
+    fname = f['filename'].split('/')[-1]
+    func = f['name']
+    return f'{fname}:{i}:{func}'
+
+def format_flamegraph(flamegraph_lines, flamegraph_script=None):
+    if flamegraph_script is None:
+        flamegraph_script = f'/tmp/{os.getuid()}_flamegraph.pl'
+    if not os.path.exists(flamegraph_script):
+        import urllib.request
+        print(f"Downloading flamegraph.pl to: {flamegraph_script}")
+        urllib.request.urlretrieve(
+            'https://raw.githubusercontent.com/brendangregg/FlameGraph/master/flamegraph.pl', flamegraph_script)
+        subprocess.run(['chmod', '+x', flamegraph_script])
+    args = [flamegraph_script, '--countname', 'bytes']
+    p = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE, encoding='utf-8')
+    assert p.stdin is not None
+    assert p.stdout is not None
+    p.stdin.write(flamegraph_lines)
+    p.stdin.close()
+    result = p.stdout.read()
+    p.stdout.close()
+    p.wait()
+    assert p.wait() == 0
+    return result
+
+def _write_blocks(f, prefix, blocks):
+    for b in blocks:
+        if 'history' not in b:
+            f.write(f'{prefix};{b["state"]} {b["size"]}\n')
+            continue
+        accounted_for_size = 0
+        for h in b['history']:
+            sz = h['real_size']
+            accounted_for_size += sz
+            frames = h['frames']
+            if frames:
+                frame_s = ';'.join([_frame_fmt(f) for f in reversed(frames)])
+            else:
+                frame_s = "<non-python>"
+            f.write(f'{prefix};{b["state"]};{frame_s} {sz}\n')
+        gaps = b['size'] - accounted_for_size
+        if gaps:
+            f.write(f'{prefix};{b["state"]};<gaps> {gaps}\n')
+
+def segments(snapshot, format_flamegraph=format_flamegraph):
+    f = io.StringIO()
+    for seg in snapshot:
+        prefix = f'stream_{seg["stream"]};seg_{seg["address"]}'
+        _write_blocks(f, prefix, seg['blocks'])
+    return format_flamegraph(f.getvalue())
+
+def memory(snapshot, format_flamegraph=format_flamegraph):
+    f = io.StringIO()
+    for seg in snapshot:
+        prefix = f'stream_{seg["stream"]}'
+        _write_blocks(f, prefix, seg['blocks'])
+    return format_flamegraph(f.getvalue())
+
+def compare(before, after, format_flamegraph=format_flamegraph):
+    def _seg_key(seg):
+        return (seg['address'], seg['total_size'])
+
+    def _seg_info(seg):
+        return f'stream_{seg["stream"]};seg_{seg["address"]}'
+
+    f = io.StringIO()
+
+    before_segs = set(_seg_key(seg) for seg in before)
+    after_segs = set(_seg_key(seg) for seg in after)
+
+    print(f'only_before = {list(a for a,_ in (before_segs - after_segs))}')
+    print(f'only_after = {list(a for a,_ in (after_segs - before_segs))}')
+
+    for seg in before:
+        if _seg_key(seg) not in after_segs:
+            _write_blocks(f, f'only_before;{_seg_info(seg)}', seg['blocks'])
+
+    for seg in after:
+        if _seg_key(seg) not in before_segs:
+            _write_blocks(f, f'only_after;{_seg_info(seg)}', seg['blocks'])
+
+    return format_flamegraph(f.getvalue())
+
+class Bytes:
+    def __init__(self, value):
+        self.value = value
+
+    def __add__(self, rhs):
+        return Bytes(self.value + rhs)
+
+    def __repr__(self):
+        num = self.value
+        # https://stackoverflow.com/questions/1094841/get-human-readable-version-of-file-size
+        for unit in ["", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"]:
+            if abs(num) < 1024.0:
+                return f"{num:3.1f}{unit}B"
+            num /= 1024.0
+        return f"{num:.1f}YiB"
+
+def stats(snapshot):
+    result : Dict[str, Any] = {}
+    result['segments'] = len(snapshot)
+    result['total_size'] = Bytes(0)
+    for seg in snapshot:
+        total_size = 0
+        for b in seg['blocks']:
+            if b['state'] not in result:
+                result[b['state']] = Bytes(0)
+            total_size += b['size']
+            result[b['state']] += b['size']
+        assert seg['total_size'] == total_size
+        result['total_size'] += total_size
+    return result
+
+
+if __name__ == "__main__":
+    import os.path
+    thedir = os.path.realpath(os.path.dirname(__file__))
+    if thedir in sys.path:
+        # otherwise we find cuda/random.py as random...
+        sys.path.remove(thedir)
+    from pprint import pprint
+    import argparse
+
+    fn_name = 'torch.cuda.memory_dbg.snapshot()'
+    pickled = f'pickled memory statistics from {fn_name}'
+    parser = argparse.ArgumentParser(description=f'Visualize memory dumps produced by {fn_name}')
+    subparsers = parser.add_subparsers(dest='action')
+
+    def _output(p):
+        p.add_argument('-o', '--output', default='output.svg', help='flamegraph svg (default: output.svg)')
+
+    stats_a = subparsers.add_parser('stats', description='Prints overall allocation statistics')
+    stats_a.add_argument('input', help=pickled)
+
+    description = 'Generate a flamegraph that visualizes what memory is stored in each allocator segment (aka block)'
+    segments_a = subparsers.add_parser('segments', description=description)
+    segments_a.add_argument('input', help=pickled)
+    _output(segments_a)
+
+    description = "Generate a flamegraph the program locations contributing to CUDA memory usage."
+    memory_a = subparsers.add_parser('memory', description=description)
+    memory_a.add_argument('input', help=pickled)
+    _output(memory_a)
+
+    description = 'Generate a flamegraph that shows segments (aka blocks) that have been added ' \
+        'or removed between two different memorys snapshots.'
+    compare_a = subparsers.add_parser('compare', description=description)
+    compare_a.add_argument('before', help=pickled)
+    compare_a.add_argument('after', help=pickled)
+    _output(compare_a)
+
+
+    args = parser.parse_args()
+
+    def _read(name):
+        if name == '-':
+            return sys.stdin.buffer
+        else:
+            return open(name, 'rb')
+
+    def _write(name, data):
+        with open(name, 'w') as f:
+            f.write(data)
+
+    if args.action == 'segments':
+        data = pickle.load(_read(args.input))
+        _write(args.output, segments(data))
+    elif args.action == 'memory':
+        data = pickle.load(_read(args.input))
+        _write(args.output, memory(data))
+    elif args.action == 'stats':
+        data = pickle.load(_read(args.input))
+        pprint(stats(data))
+    elif args.action == 'compare':
+        before = pickle.load(_read(args.before))
+        after = pickle.load(_read(args.after))
+        _write(args.output, compare(before, after))
diff --git a/torch/cuda/jiterator.py b/torch/cuda/jiterator.py
index 207da5685acd6..1616814e77819 100644
--- a/torch/cuda/jiterator.py
+++ b/torch/cuda/jiterator.py
@@ -54,7 +54,7 @@ def __init__(self, code_string: str, return_by_ref: bool, num_outputs: int, **kw
     def __call__(self, *tensors: Tensor, **kwargs):
         # Jiterator follow torch.cuda's lazy initialization behavior
         # Defer checking cuda's availability at the function invocation time
-        assert self.is_cuda_available, "Jiterator is only supported on CUDA GPUs, no CUDA GPUs are available."
+        assert self.is_cuda_available, "Jiterator is only supported on CUDA and ROCm GPUs, none are available."
 
         assert len(tensors) <= 8, "jiterator only supports up to 8 tensor inputs."
 
diff --git a/torch/cuda/memory.py b/torch/cuda/memory.py
index 904ad305828b5..30f4e0f2ff2fc 100644
--- a/torch/cuda/memory.py
+++ b/torch/cuda/memory.py
@@ -5,7 +5,11 @@
 
 import torch
 from . import is_initialized, _get_device_index, _lazy_init
+
+from ._memory_viz import segments as _segments, memory as _memory
+
 from torch.types import Device
+from torch import _C
 
 __all__ = ["caching_allocator_alloc", "caching_allocator_delete", "set_per_process_memory_fraction",
            "empty_cache", "memory_stats", "memory_stats_as_nested_dict", "reset_accumulated_memory_stats",
@@ -590,3 +594,23 @@ def mem_get_info(device: Union[Device, int] = None) -> Tuple[int, int]:
         device = torch.cuda.current_device()
     device = _get_device_index(device)
     return torch.cuda.cudart().cudaMemGetInfo(device)
+
+def _record_memory_history(enabled: bool, device: Union[Device, int] = None):
+    with torch.cuda.device(device):
+        _C._cuda_recordMemoryHistory(enabled)
+
+def _snapshot(device: Union[Device, int] = None):
+    with torch.cuda.device(device):
+        return _C._cuda_memorySnapshot()
+
+def _save_segment_usage(filename='output.svg', snapshot=None):
+    if snapshot is None:
+        snapshot = memory_snapshot()
+    with open(filename, 'w') as f:
+        f.write(_segments(snapshot))
+
+def _save_memory_usage(filename='output.svg', snapshot=None):
+    if snapshot is None:
+        snapshot = memory_snapshot()
+    with open(filename, 'w') as f:
+        f.write(_memory(snapshot))
diff --git a/torch/distributed/_shard/checkpoint/__init__.py b/torch/distributed/_shard/checkpoint/__init__.py
index 0eddc81287568..b6c36b9f3e044 100644
--- a/torch/distributed/_shard/checkpoint/__init__.py
+++ b/torch/distributed/_shard/checkpoint/__init__.py
@@ -1,9 +1,9 @@
 from .metadata import (
     BytesReadRequest,
     BytesWriteRequest,
-    ShardedTensorMetadata,
-    ShardStorageMetadata,
     TensorStorageMetadata,
+    BytesStorageMetadata,
+    ChunkStorageMetadata,
     Metadata,
     TensorReadRequest,
     TensorWriteRequest,
@@ -13,3 +13,13 @@
 from .storage import StorageReader, StorageWriter
 from .filesystem import FileSystemReader, FileSystemWriter
 from .api import CheckpointException
+
+
+from .planner import (
+    SavePlanner,
+    LoadPlanner,
+    SavePlan,
+    LoadPlan,
+    ReadItem,
+    WriteItem,
+)
diff --git a/torch/distributed/_shard/checkpoint/api.py b/torch/distributed/_shard/checkpoint/api.py
index 0ca848c017d98..e74b34d9f233f 100644
--- a/torch/distributed/_shard/checkpoint/api.py
+++ b/torch/distributed/_shard/checkpoint/api.py
@@ -1,18 +1,41 @@
-from typing import Dict
+from typing import Dict, Tuple, Any
+import traceback as tb
+
+WRAPPED_EXCEPTION = Tuple[BaseException, tb.StackSummary]
+
+def _wrap_exception(exc: BaseException) -> WRAPPED_EXCEPTION:
+    return (exc, tb.extract_tb(exc.__traceback__))
+
+def _is_wrapped_exception(obj: Any) -> bool:
+    if not isinstance(obj, tuple):
+        return False
+    if len(obj) != 2:
+        return False
+    return isinstance(obj[0], BaseException) and isinstance(obj[1], tb.StackSummary)
 
 class CheckpointException(BaseException):
     """
     Exception raised if failure was detected as part of a checkpoint load or save.
     """
-    def __init__(self, msg: str, failures: Dict[int, BaseException]):
+    def __init__(self, msg: str, failures: Dict[int, WRAPPED_EXCEPTION]):
         super().__init__(msg, failures)
         self._failures = failures
 
     @property
-    def failures(self) -> Dict[int, BaseException]:
+    def failures(self) -> Dict[int, WRAPPED_EXCEPTION]:
         """
         Returns:
             Dict of failed nodes and their associated exception.
               Keys are node ranks and values are exceptions
         """
         return self._failures
+
+    def __str__(self):
+        str = f"CheckpointException ranks:{self._failures.keys()}\n"
+        for rank, exc_pair in self._failures.items():
+            exc, trace = exc_pair
+            str += f"Traceback (most recent call last): (RANK {rank})\n"
+            if trace is not None:
+                str += "".join(tb.format_list(trace))
+            str += "".join(tb.format_exception_only(type(exc), value=exc))
+        return str
diff --git a/torch/distributed/_shard/checkpoint/default_planner.py b/torch/distributed/_shard/checkpoint/default_planner.py
new file mode 100644
index 0000000000000..8f6a0c2be7ed4
--- /dev/null
+++ b/torch/distributed/_shard/checkpoint/default_planner.py
@@ -0,0 +1,204 @@
+import dataclasses
+import io
+from typing import List, Tuple, Dict, Any, Union, cast
+
+import torch
+
+from torch.distributed._shard._utils import narrow_tensor_by_index
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+
+
+from .planner import (
+    SavePlanner,
+    LoadPlanner,
+    SavePlan,
+    LoadPlan,
+    ReadItem,
+    WriteItem,
+    WriteItemType,
+)
+
+from .metadata import (
+    BytesStorageMetadata,
+    TensorStorageMetadata,
+    MetadataIndex,
+    Metadata,
+    STATE_DICT_TYPE,
+    STORAGE_TYPES
+)
+
+from .planner_helpers import (
+    _create_read_items,
+    _create_write_items,
+    _create_default_metadata_only_plan
+)
+
+from .utils import (
+    find_state_dict_object
+)
+
+class DefaultSavePlanner(SavePlanner):
+    def init(self, state_dict: Dict[str, Any], is_coordinator: bool) -> None:
+        self.state_dict = state_dict
+        self.is_coordinator = is_coordinator
+
+    def create_local_plan(self) -> SavePlan:
+        self.plan = create_default_local_save_plan(self.state_dict, self.is_coordinator)
+        return self.plan
+
+    def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]:
+        self.global_plan, self.metadata = create_default_global_save_plan(all_plans)
+        return self.global_plan, self.metadata
+
+    def finish_plan(self, new_plan: SavePlan) -> SavePlan:
+        self.plan = new_plan
+        return new_plan
+
+    def resolve_data(self, write_item: WriteItem) -> Union[torch.Tensor, io.BytesIO]:
+        object = self.lookup_object(write_item.index)
+        return self.transform_object(write_item, object)
+
+    def lookup_object(self, index: MetadataIndex) -> Any:
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        return find_state_dict_object(self.state_dict, index)
+
+    def transform_object(self, write_item: WriteItem, object: Any):
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        if write_item.type == WriteItemType.BYTE_IO:
+            bytes = io.BytesIO()
+            torch.save(object, bytes)
+            object = bytes
+        return object
+
+
+class DefaultLoadPlanner(LoadPlanner):
+    def init(self, state_dict: STATE_DICT_TYPE, metadata: Metadata, is_coordinator: bool) -> None:
+        self.state_dict = state_dict
+        self.metadata = metadata
+        self.is_coordinator = is_coordinator
+
+    def create_local_plan(self) -> LoadPlan:
+        return create_default_local_load_plan(self.state_dict, self.metadata)
+
+    def create_global_plan(self, global_plan: List[LoadPlan]) -> List[LoadPlan]:
+        return create_default_global_load_plan(global_plan)
+
+    def finish_plan(self, new_plan: LoadPlan) -> LoadPlan:
+        return new_plan
+
+    def load_bytes(self, read_item: ReadItem, value: io.BytesIO) -> None:
+        self.state_dict[read_item.dest_index.fqn] = torch.load(value)
+
+    def resolve_tensor(self, read_item: ReadItem):
+        tensor = self.lookup_tensor(read_item.dest_index)
+        return self.transform_tensor(read_item, tensor)
+
+    def commit_tensor(self, read_item: ReadItem, tensor: torch.Tensor) -> None:
+        pass
+
+    def lookup_tensor(self, index: MetadataIndex) -> torch.Tensor:
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        return find_state_dict_object(self.state_dict, index)
+
+    def transform_tensor(self, read_item: ReadItem, tensor: torch.Tensor):
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        return narrow_tensor_by_index(tensor, read_item.dest_offsets, read_item.lengths)
+
+
+def create_default_local_load_plan(
+    state_dict: Dict[str, Any],
+    metadata: Metadata,
+) -> LoadPlan:
+    requests = []
+    """
+    Create the ``LoadPlan`` used by DefaultLoadPlanner.
+
+    It produces one read item per value in ``state_dict`` using the metadata in ``metadata``.
+
+    The default behavior is to match key exactly between state_dict and metadata.
+    It handles resharding by issuing multiple read requests against storage in order to match
+    load requirements.
+    """
+    for fqn, obj in state_dict.items():
+        md = metadata.state_dict_metadata[fqn]
+        requests += _create_read_items(fqn, md, obj)
+
+    return LoadPlan(requests)
+
+def create_default_global_load_plan(all_plans: List[LoadPlan]) -> List[LoadPlan]:
+    """
+    Create global load plan used by DefaultLoadPlanner.
+
+    The default load behavior involved no global coordination and this function
+    currently doesn't change the local plans.
+    """
+    return all_plans
+
+def create_default_local_save_plan(state_dict: Dict[str, Any], is_coordinator: bool) -> SavePlan:
+    """
+    Create the ``SavePlan`` used by DefaultSavePlanner.
+
+    On non-coordinator ranks, this function ignores tensors and non-tensor objects,
+    only producing writes for ShardedTensor objects.
+
+    On the coordinator rank, produce writes for all values.
+    """
+    requests = []
+    for fqn, obj in state_dict.items():
+        if isinstance(obj, ShardedTensor) or is_coordinator:
+            requests += _create_write_items(fqn, obj)
+    return SavePlan(requests)
+
+def create_default_global_save_plan(all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]:
+    """
+    Create the global plan and metadata used by DefaultSavePlanner.
+
+    Metadata is produced by concatenating the metadata of all ``WriteItem`` from the supplied plans.
+
+    The only global planning change is to update index hints in all ``MetadataIndex`` objects.
+    """
+    md: Dict[str, STORAGE_TYPES] = {}
+    new_plans = []
+    for plan in all_plans:
+        new_items = []
+        for item in plan.items:
+            if not item.type == WriteItemType.SHARD:
+                assert item.index.fqn not in md
+
+            if item.type == WriteItemType.BYTE_IO:
+                md[item.index.fqn] = BytesStorageMetadata()
+                new_items.append(item)
+            else:
+                assert item.tensor_data is not None
+                tensor_md = cast(
+                    TensorStorageMetadata,
+                    md.setdefault(item.index.fqn, TensorStorageMetadata(
+                        properties=item.tensor_data.properties,
+                        size=item.tensor_data.size,
+                        chunks=[],
+                    ))
+                )
+                new_index = dataclasses.replace(item.index, index=len(tensor_md.chunks))
+                new_item = dataclasses.replace(item, index=new_index)
+                new_items.append(new_item)
+
+                assert item.tensor_data.chunk is not None, f"Cannot create MD for tensor without bounds. FQN: {item.index.fqn}"
+                tensor_md.chunks.append(item.tensor_data.chunk)
+        new_plans.append(dataclasses.replace(plan, items=new_items))
+    return (new_plans, Metadata(md))
+
+def _create_default_local_metadata(state_dict: STATE_DICT_TYPE) -> Metadata:
+    """
+    Return the ``Metadata`` if DefaultSavePlanner was used to checkpoint ``state_dict``.
+    """
+    plan = _create_default_metadata_only_plan(state_dict)
+    _, md = create_default_global_save_plan([plan])
+    return md
diff --git a/torch/distributed/_shard/checkpoint/metadata.py b/torch/distributed/_shard/checkpoint/metadata.py
index 40586113b460e..fcb33887374f8 100644
--- a/torch/distributed/_shard/checkpoint/metadata.py
+++ b/torch/distributed/_shard/checkpoint/metadata.py
@@ -5,52 +5,37 @@
 import torch
 from torch.distributed._shard.sharded_tensor import (
     ShardedTensor,
-    ShardedTensorMetadata,
-    ShardMetadata,
 )
-
-TENSOR_TYPE = Union[torch.Tensor, ShardedTensor]
-STATE_DICT_TYPE = Dict[str, Any]
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
 
 @dataclass
-class ShardStorageMetadata:
-    shard_metadata: ShardMetadata
-    # storage key used for this particular Shard
-    storage_key: str
-
-
-# Metadata for each param.
-@dataclass
-class ShardedTensorStorageMetadata:
-    # Metadata for the sharded tensor itself
-    tensor_metadata: ShardedTensorMetadata
-
-    # Storage info for each Shard. There's no ordering requirement for this list.
-    storage_metadata: List[ShardStorageMetadata]
-
+class ChunkStorageMetadata:
+    """
+    Each chunk is expected to have the same properties of the TensorStorageMetadata that includes it.
+    """
+    offsets: torch.Size
+    sizes: torch.Size
 
 @dataclass
 class TensorStorageMetadata:
-    # Storage key used for this tensor
-    storage_key: str
-
-    # Tensor sizes
+    properties: TensorProperties
     size: torch.Size
+    chunks: List[ChunkStorageMetadata]
 
 @dataclass
 class BytesStorageMetadata:
-    # Storage key used for this tensor
-    storage_key: str
-
-    # serialized payload size
-    length: int
+    pass
 
-STORAGE_TYPES = Union[ShardedTensorStorageMetadata, TensorStorageMetadata, BytesStorageMetadata]
+TENSOR_TYPE = Union[torch.Tensor, ShardedTensor]
+STORAGE_TYPES = Union[TensorStorageMetadata, BytesStorageMetadata]
+STATE_DICT_TYPE = Dict[str, Any]
 
 @dataclass
 class Metadata:
     # Keys are the same from the `state_dict` used.
     state_dict_metadata: Dict[str, STORAGE_TYPES]
+    planner_data: Any = None
+    storage_data: Any = None
 
 @dataclass
 class BytesWriteRequest:
diff --git a/torch/distributed/_shard/checkpoint/planner.py b/torch/distributed/_shard/checkpoint/planner.py
new file mode 100644
index 0000000000000..f3692cc113956
--- /dev/null
+++ b/torch/distributed/_shard/checkpoint/planner.py
@@ -0,0 +1,344 @@
+import abc
+from dataclasses import dataclass
+import io
+from typing import List, Tuple, Any, Union, Optional
+
+from enum import Enum, auto
+import torch
+
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
+
+from .metadata import (
+    ChunkStorageMetadata,
+    MetadataIndex,
+    Metadata,
+    STATE_DICT_TYPE
+)
+
+class WriteItemType(Enum):
+    TENSOR = auto()
+    SHARD = auto()
+    BYTE_IO = auto()
+
+class LoadItemType(Enum):
+    TENSOR = auto()
+    BYTE_IO = auto()
+
+@dataclass(frozen=True)
+class TensorWriteData:
+    chunk: ChunkStorageMetadata
+    properties: TensorProperties
+    size: torch.Size
+
+@dataclass(frozen=True)
+class WriteItem:
+    index: MetadataIndex
+    type: WriteItemType
+
+    # Value present if it's a tensor write
+    tensor_data: Optional[TensorWriteData] = None
+
+@dataclass(frozen=True)
+class ReadItem:
+    # Read Item
+    type: LoadItemType
+
+    # Index into the state_dict
+    dest_index: MetadataIndex
+    # Offsets into destination tensor
+    dest_offsets: torch.Size
+
+    # Index into the checkpoint
+    storage_index: MetadataIndex
+    # Offset into the checkpoint data
+    storage_offsets: torch.Size
+
+    # Size of the hypercube to copy
+    lengths: torch.Size
+
+@dataclass(frozen=True)
+class SavePlan:
+    items: List[WriteItem]
+    storage_data: Any = None
+    planner_data: Any = None
+
+@dataclass
+class LoadPlan:
+    items: List[ReadItem]
+    storage_data: Any = None
+    planner_data: Any = None
+
+class SavePlanner(abc.ABC):
+    """
+    Abstract class defining the protocol used by save_state_dict to plan the save process.
+
+    SavePlanners are stateful objects that can be used to customize the whole save process.
+
+    SavePlanner acts as an access proxy to the state_dict, so any transfomation done to it
+    will be visible to the whole process.
+
+    A planner subclass can expect the following sequence of calls during save_state_dict:
+
+    1) init - called on all ranks.
+        Signals the start of a checkpoint save.
+
+    2) create_local_plan - called on all ranks.
+        Process the state_dict and produces a `SavePlan` that will be sent for global planning.
+
+    3) create_global_plan - called on the coordinator rank only.
+        Takes the SavePlan from all ranks and make any global decision.
+
+    4) finish_plan - called on all ranks.
+        This gives each rank a chance to adjust to global planning decisions.
+
+    5) resolve_data - called multiple times on each rank
+        Lookups a value on the `state_dict` for the storage layer to write.
+
+    Users are recomended to extend DefaultSavePlanner instead of this interface directly as
+    most changes can be expressed by changes in a single method.
+
+    There are 3 usual patterns of extension:
+
+    Rewriting state_dict. This is the simplest way to extend the save process as it
+    doesn't requite understanding the intrincacies of how SavePlan works:
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class RenamePlanner(DefaultSavePlanner):
+    >>>     def init(self, state_dict, is_coordinator):
+    >>>         # prefix all keys with `foo_``
+    >>>         super().init(self, {"foo_" + k: v for k, v in state_dict.items()}, is_coordinator)
+
+    Modifying local plan and lookup in tandem. This is useful when fine control of how data is persisted
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class FP16Planner(DefaultSavePlanner):
+    >>>     def create_local_plan(self):
+    >>>         plan = super().create_local_plan()
+    >>>         for p in plan:
+    >>>             if p.tensor_data is not None:
+    >>>                 p.tensor_data.properties.dtype = torch.float16
+    >>>
+    >>>     def resolve_data(self, write_item):
+    >>>         item = super().resolve_data(write_item)
+    >>>         return item if write_item.type == WriteItemType.BYTE_IO else item.to(torch.float16)
+
+    Using the global planning step to make central decisions that can't be made individually by each rank
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> from itertools import islice
+    >>> from dataclasses import replace
+    >>> class DDPLoadBalancingPlanner(DefaultSavePlanner):
+    >>>     # This uses the default local plan behavior of having all non-sharded writes in rank 0
+    >>>     # This sample doesn't handle ShardedTensors
+    >>>     def create_global_plan(self, all_plans):
+    >>>         def chunk(it, size):
+    >>>             it = iter(it)
+    >>>         return list(iter(lambda: tuple(islice(it, size)), ()))
+    >>>         all_plans = [
+    >>>             replace(plan, items=items) for plan, items in
+    >>>                 zip(all_plans, chunk(all_plans[0].items, len(all_plans)))
+    >>>         ]
+    >>>         return super().create_global_plan(all_plans)
+
+    Finally, some planners need to save additional metadata in the checkpoint, this is
+    accomplished by having each rank contribute their data items in the local plan and
+    the global planner aggregate them:
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class SaveExtraDataPlanner(DefaultSavePlanner):
+    >>>     def create_local_plan(self) -> SavePlan:
+    >>>         plan = super().create_local_plan()
+    >>>         return replace(plan, planner_data="per-rank-data")
+    >>>
+    >>>     def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]:
+    >>>         global_plan, metadata = super().create_global_plan(all_plans)
+    >>>         merged_data = [p.planner_data for p in global_plan]
+    >>>         metadata = replace(metadata, planner_data=merged_data)
+    >>>         return global_plan, metadata
+    """
+    @abc.abstractmethod
+    def init(self, state_dict: STATE_DICT_TYPE, is_coordinator: bool) -> None:
+        """
+        Intialize this planner to save ``state_dict``.
+
+        Implementations should save those values as they won't be provided lated in the save process.
+
+        This is called on all ranks.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_local_plan(self) -> SavePlan:
+        """
+        Compute the save plan for the current rank.
+        This will be aggregated and passed to create_global_plan.
+        Planner specific data can be passed through SavePlan::planner_data.
+
+        This is called on all ranks.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]:
+        """
+        Compute the global checkpoint plan and return the local plan of each rank.
+
+        This is called on the coordinator rank only.
+        """
+        pass
+
+    @abc.abstractmethod
+    def finish_plan(self, new_plan: SavePlan) -> SavePlan:
+        """
+        Merge the plan created by `create_local_plan` and the result of `create_global_plan`.
+
+        This is called on all ranks.
+        """
+        pass
+
+    @abc.abstractmethod
+    def resolve_data(self, write_item: WriteItem) -> Union[torch.Tensor, io.BytesIO]:
+        """
+        Lookup the object associated with ``write_item``in `state_dict` and apply any
+        transformation (such as serialization) prior to the storage layer consuming it.
+
+        Called on each rank multiple times, at least once per WriteItem in the final SavePlan.
+
+        This method should be idepotent and thread-save. StorageWriter implementations
+        are free to call it as frequently as they need.
+
+        Any transformation that allocates memory should be lazily done when his method
+        is called in order to reduce peak memory required by checkpointing.
+
+        When returning tensors, they can be on any device or format, they can be views too.
+        It's the storage layer responsiblity to figure out how to save them.
+        """
+        pass
+
+class LoadPlanner:
+    """
+    Abstract class defining the protocol used by load_state_dict to plan the load process.
+
+    LoadPlanner are stateful objects that can be used to customize the whole load process.
+
+    LoadPlanner acts as an access proxy to the state_dict, so any transfomation done to it
+    will be visible to the whole process.
+
+    A planner subclass can expect the following sequence of calls during load_state_dict:
+
+    1) init - called on all ranks.
+        Signals the start of loading a checkpoint.
+
+    2) create_local_plan - called on all ranks.
+        Process the state_dict and produces a `LoadPlan` that will be sent for global planning.
+
+    3) create_global_plan - called on the coordinator rank only.
+        Takes the LoadPlan from all ranks and make any global decision.
+
+    4) load_bytes - called multiple times on each rank
+        This is called once per non-tensor value in state_dict.
+
+    5) resolve_tensor and commit_tensor - called multiple times on each rank
+        They are called in pair for each Tensor value in state_dict.
+
+    Users are recomended to extend DefaultLoadPlanner instead of this interface directly as
+    most changes can be expressed by changes in a single method.
+
+    There are two usual patterns of extension:
+
+    Rewriting state_dict. This is the simplest way to extend the load process as it
+    doesn't requite understanding the intrincacies of how LoadPlan works. We need
+    to keep a reference to the original state_dict as load happens in place so
+    we need to be able to perform it in place
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class RenamePlanner(DefaultLoadPlanner):
+    >>>     def init(self, state_dict, metadata, is_coordinator):
+    >>>         self.original_state_dict = state_dict
+    >>>         super().init(self, {"foo_" + k: v for k, v in state_dict.items()}, is_coordinator)
+    >>>
+    >>>     def load_bytes(self, read_item, value):
+    >>>         # Remove the "foo_" prefix
+    >>>         self.original_state_dict[read_item.dest_index.fqn[4:]] = torch.load(value)
+
+
+    Modifying resolve_tensor and commit_tensor to handle load time transformation.
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class MetaModelMaterialize(DefaultSavePlanner):
+    >>>     def resolve_tensor(self, read_item):
+    >>>         tensor = super().resolve_tensor(read_item)
+    >>>         return torch.empty_like(tensor, device="cpu")
+    >>>
+    >>>     def commit_tensor(self, read_item, tensor):
+    >>>         self.state_dict[read_item.dest_index.fqn] = tensor
+    """
+    @abc.abstractmethod
+    def init(self, state_dict: STATE_DICT_TYPE, metadata: Metadata, is_coordinator: bool) -> None:
+        """
+        Initialize this instance to load data into ``state_dict``
+
+        . N.B. This is called on every rank.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_local_plan(self) -> LoadPlan:
+        """
+        Create a LoadPlan based on state_dict and metadata provided by init.
+
+        . N.B. This is called on every rank.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_global_plan(self, global_plan: List[LoadPlan]) -> List[LoadPlan]:
+        """
+        Compute the global load plan and return plans for each rank.
+
+        . N.B. This is called on the coordinator rank only
+        """
+        pass
+
+    @abc.abstractmethod
+    def finish_plan(self, central_plan: LoadPlan) -> LoadPlan:
+        """
+        Accept the plan from coordinator and return final LoadPlan.
+        """
+        pass
+
+    @abc.abstractmethod
+    def load_bytes(self, read_item: ReadItem, value: io.BytesIO) -> None:
+        """
+        Load the item described by ``read_item``and ``value``.
+
+        This method is expected to modify in-place the underlying state_dict.
+
+        The contents of ``value`` are defined by the SavePlanner used to produce
+        the checkpoint being loaded.
+        """
+        pass
+
+    @abc.abstractmethod
+    def resolve_tensor(self, read_item: ReadItem) -> torch.Tensor:
+        """
+        Return the tensor described by ``read_item`` to be used by the StorageReader to load `read_item`.
+
+        The tensor should alias with one on the underlying state_dict as StorageReader will replace its contents.
+        If, for any reason, that's not possible, the planner can use the ``commit_tensor`` method to copy the data
+        back to the one in state_dict.
+        """
+        pass
+
+    @abc.abstractmethod
+    def commit_tensor(self, read_item: ReadItem, tensor: torch.Tensor) -> None:
+        """
+        This method is called once the StorageReader finished loading data into ``tensor``.
+
+        The provided tensor is the same one returned by the call to ``resolve_tensor``.
+        This method is only needed if this LoadPlanner needs to post process ``tensor`` prior to
+        copying it back to the one in the state_dict.
+
+        The contents of tensor will follow its device synchronization model.
+        """
+        pass
diff --git a/torch/distributed/_shard/checkpoint/planner_helpers.py b/torch/distributed/_shard/checkpoint/planner_helpers.py
new file mode 100644
index 0000000000000..fce7699b953f0
--- /dev/null
+++ b/torch/distributed/_shard/checkpoint/planner_helpers.py
@@ -0,0 +1,199 @@
+from typing import List, Any
+
+import torch
+
+from torch.distributed._shard.metadata import ShardMetadata
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
+from torch.distributed._shard.sharded_tensor.shard import Shard
+
+from torch.distributed._shard.sharding_spec._internals import (
+    _check_shard_metadata_pair_overlap,
+)
+
+from .planner import (
+    LoadItemType,
+    SavePlan,
+    ReadItem,
+    WriteItem,
+    WriteItemType,
+    TensorWriteData,
+)
+
+from .metadata import (
+    BytesStorageMetadata,
+    ChunkStorageMetadata,
+    TensorStorageMetadata,
+    MetadataIndex,
+    STATE_DICT_TYPE,
+    STORAGE_TYPES
+)
+
+from .resharding import (
+    _shards_get_overlap_region_wrt_saved_tensor
+)
+
+def _create_shard_metadata(size: torch.Size) -> ShardMetadata:
+    return ShardMetadata(
+        shard_offsets=[0] * len(size),
+        shard_sizes=list(size),
+    )
+
+def _create_shard_from_tensor(tensor: torch.Tensor) -> Shard:
+    return Shard(
+        tensor=tensor,
+        metadata=_create_shard_metadata(tensor.size())
+    )
+
+def _chunk_for_shard(shard_md: ShardMetadata) -> ChunkStorageMetadata:
+    return ChunkStorageMetadata(
+        offsets=torch.Size(shard_md.shard_offsets),
+        sizes=torch.Size(shard_md.shard_sizes)
+    )
+
+def _sharded_tensor_metadata(sharded_tensor: ShardedTensor, shard_md: ShardMetadata) -> TensorWriteData:
+    return TensorWriteData(
+        chunk=_chunk_for_shard(shard_md),
+        properties=sharded_tensor.metadata().tensor_properties,
+        size=sharded_tensor.metadata().size,
+    )
+
+def _create_write_item_for_shard(fqn: str, sharded_tensor: ShardedTensor, shard_md: ShardMetadata) -> WriteItem:
+    offsets = torch.Size(shard_md.shard_offsets)
+    return WriteItem(
+        index=MetadataIndex(fqn, offsets),
+        type=WriteItemType.SHARD,
+        tensor_data=_sharded_tensor_metadata(sharded_tensor, shard_md),
+    )
+
+def _create_write_item_for_tensor(fqn: str, tensor: torch.Tensor) -> WriteItem:
+    offsets = torch.Size([0] * len(tensor.size()))
+    return WriteItem(
+        index=MetadataIndex(fqn, offsets),
+        type=WriteItemType.TENSOR,
+        tensor_data=TensorWriteData(
+            chunk=ChunkStorageMetadata(
+                offsets=offsets,
+                sizes=tensor.size()
+            ),
+            properties=TensorProperties.create_from_tensor(tensor),
+            size=tensor.size(),
+        )
+    )
+
+def _create_write_item_for_bytesio(fqn: str, bytes: Any):
+    return WriteItem(
+        index=MetadataIndex(fqn),
+        type=WriteItemType.BYTE_IO,
+    )
+
+def _create_read_item_for_byteio(dest_index, dest_offset, storage_index, storage_offset, length):
+    return ReadItem(
+        type=LoadItemType.BYTE_IO,
+        dest_index=dest_index,
+        dest_offsets=torch.Size((dest_offset,)),
+        storage_index=storage_index,
+        storage_offsets=torch.Size((storage_offset,)),
+        lengths=torch.Size((length,)),
+    )
+
+def _create_read_item_for_tensor(dest_index, dest_offsets, storage_index, storage_offsets, lengths):
+    return ReadItem(
+        type=LoadItemType.TENSOR,
+        dest_index=dest_index,
+        dest_offsets=torch.Size(dest_offsets),
+        storage_index=storage_index,
+        storage_offsets=torch.Size(storage_offsets),
+        lengths=torch.Size(lengths),
+    )
+
+def _create_sharded_read_items(
+    fqn: str,
+    checkpoint_md: TensorStorageMetadata,
+    local_shards: List[Shard],
+) -> List[ReadItem]:
+
+    read_items = []
+    # this is a naive quadratic algo that can be optimized later
+    for idx, shard in enumerate(local_shards):
+        for storage_idx, storage_md in enumerate(checkpoint_md.chunks):
+            shard_md_from_storage = ShardMetadata(
+                shard_sizes=list(storage_md.sizes),
+                shard_offsets=list(storage_md.offsets),
+            )
+
+            if not _check_shard_metadata_pair_overlap(
+                shard.metadata, shard_md_from_storage
+            ):
+                continue
+
+            storage_offsets = []
+            dest_offsets = []
+            lengths = []
+            for (
+                dim,
+                offset_for_saved_tensor,
+                offset_for_current_tensor,
+                length,
+            ) in _shards_get_overlap_region_wrt_saved_tensor(
+                saved_shard=shard_md_from_storage, current_shard=shard.metadata
+            ):
+                storage_offsets.append(offset_for_saved_tensor)
+                dest_offsets.append(offset_for_current_tensor)
+                lengths.append(length)
+
+            read_items.append(
+                _create_read_item_for_tensor(
+                    dest_index=MetadataIndex(fqn, shard.metadata.shard_offsets, idx),
+                    dest_offsets=dest_offsets,
+                    storage_index=MetadataIndex(fqn, storage_md.offsets, storage_idx),
+                    storage_offsets=storage_offsets,
+                    lengths=lengths,
+                )
+            )
+    return read_items
+
+def _create_default_metadata_only_plan(state_dict: STATE_DICT_TYPE) -> SavePlan:
+    requests = []
+    for fqn, obj in state_dict.items():
+        if isinstance(obj, ShardedTensor):
+            for shard_md in obj.metadata().shards_metadata:
+                requests.append(_create_write_item_for_shard(fqn, obj, shard_md))
+        elif isinstance(obj, torch.Tensor):
+            requests.append(_create_write_item_for_tensor(fqn, obj))
+        else:
+            requests.append(_create_write_item_for_bytesio(fqn, obj))
+    return SavePlan(requests)
+
+def _create_write_items(fqn: str, object: Any) -> List[WriteItem]:
+    if isinstance(object, ShardedTensor):
+        return [_create_write_item_for_shard(fqn, object, shard.metadata) for shard in object.local_shards()]
+    elif isinstance(object, torch.Tensor):
+        return [_create_write_item_for_tensor(fqn, object)]
+    else:
+        return [_create_write_item_for_bytesio(fqn, object)]
+
+def _create_read_items(fqn: str, md: STORAGE_TYPES, obj: Any) -> List[ReadItem]:
+    if isinstance(md, BytesStorageMetadata):
+        return [_create_read_item_for_byteio(
+            dest_index=MetadataIndex(fqn),
+            dest_offset=0,
+            storage_index=MetadataIndex(fqn),
+            storage_offset=0,
+            length=0
+        )]
+    elif isinstance(obj, ShardedTensor):
+        local_shards = obj.local_shards()
+    elif isinstance(obj, torch.Tensor):
+        local_shards = [_create_shard_from_tensor(obj)]
+    else:
+        raise ValueError(
+            f"Invalid checkpoint metadata for {fqn}, " +
+            f"expected BytesStorageMetadata but found {type(md)}"
+        )
+
+    return _create_sharded_read_items(
+        fqn,
+        md,
+        local_shards
+    )
diff --git a/torch/distributed/_shard/checkpoint/resharding.py b/torch/distributed/_shard/checkpoint/resharding.py
index 15be6dd7cdc55..5335949a81a05 100644
--- a/torch/distributed/_shard/checkpoint/resharding.py
+++ b/torch/distributed/_shard/checkpoint/resharding.py
@@ -15,16 +15,17 @@
     _check_shard_metadata_pair_overlap,
 )
 from torch.distributed._shard.sharded_tensor.shard import Shard
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
 
 
 from .metadata import (
-    BytesStorageMetadata,
     BytesWriteRequest,
     TensorReadRequest,
-    ShardStorageMetadata,
-    ShardedTensorStorageMetadata,
-    TensorStorageMetadata,
     TensorWriteRequest,
+    ChunkStorageMetadata,
+    TensorStorageMetadata,
+    BytesStorageMetadata,
+    MetadataIndex,
 )
 
 def _trim(tensor: torch.Tensor) -> torch.Tensor:
@@ -101,34 +102,27 @@ def _shards_get_overlap_region_wrt_saved_tensor(
 
     return narrows
 
+def _chunk_to_shard_md(chunk_md: ChunkStorageMetadata) -> ShardMetadata:
+    return ShardMetadata(
+        shard_offsets=list(chunk_md.offsets),
+        shard_sizes=list(chunk_md.sizes)
+    )
 
-def _get_sharded_tensor_element_size(tensor: ShardedTensor) -> int:
-    if len(tensor.local_shards()) > 0:
-        test_tensor = tensor.local_shards()[0].tensor
-    else:
-        dtype = tensor.metadata().tensor_properties.dtype
-        test_tensor = torch.empty((1,), dtype=dtype)
-
-    return test_tensor.element_size()
-
+def _shard_md_to_chunk(chunk_md: ShardMetadata) -> ChunkStorageMetadata:
+    return ChunkStorageMetadata(
+        offsets=torch.Size(chunk_md.shard_offsets),
+        sizes=torch.Size(chunk_md.shard_sizes),
+    )
 
 def _compute_sharded_tensor_md(
     tensor: ShardedTensor,
-    shard_to_storage_key: Dict[str, str]
-) -> ShardedTensorStorageMetadata:
-    smd = []
-    for shard_md in tensor.metadata().shards_metadata:
-        shard_storage_key = shard_to_storage_key[_get_shard_key(shard_md)]
-
-        one_smd = ShardStorageMetadata(
-            shard_metadata=shard_md,
-            storage_key=shard_storage_key,
-        )
-        smd.append(one_smd)
+) -> TensorStorageMetadata:
+    smd = [_shard_md_to_chunk(sm) for sm in tensor.metadata().shards_metadata]
 
-    return ShardedTensorStorageMetadata(
-        tensor_metadata=tensor.metadata(),
-        storage_metadata=smd,
+    return TensorStorageMetadata(
+        properties=tensor.metadata().tensor_properties,
+        size=torch.Size(tensor.metadata().size),
+        chunks=smd,
     )
 
 
@@ -151,31 +145,35 @@ def _get_shard_storage_key(
 
 
 def _prepare_sharded_tensor_write(
+    fqn: str,
     sharded_tensor: ShardedTensor,
     storage_key: str,
     storage_key_to_fqn: Dict[str, str]
-) -> Tuple[List[TensorWriteRequest], ShardedTensorStorageMetadata]:
+) -> Tuple[List[TensorWriteRequest], TensorStorageMetadata, Dict[MetadataIndex, str]]:
     """
     Prepare sharded tensor write.
 
     Args:
+        fqn: The FQN of ``sharded_tensor`` in the state_dict.
         sharded_tensor: The sharded tensor to persist.
         storage_key: The identifier for `sharded_tensor`.
         storage_key_to_fqn: dict used to produce storage keys
-
     Returns:
-        Write requests for persisting the sharded tensor, and metadata
-        describing the persisted sharded tensor.
+        A 3-element tuple with the following values:
+            List of ``TensorWriteRequest`` for the tensor
+            Metadada describing the tensor.
+            Dictionary describing storage information for this tensor
 
     NB `storage_key` is used to compose the key names of the local shards.
-
     """
     write_requests = []
-    shard_to_storage_key: Dict[str, str] = dict()
+    shard_to_storage_key: Dict[str, str] = {}
+    storage_md = {}
 
     for shard_md in sharded_tensor.metadata().shards_metadata:
         shard_storage_key = _get_shard_storage_key(storage_key, shard_md, storage_key_to_fqn)
         shard_to_storage_key[_get_shard_key(shard_md)] = shard_storage_key
+        storage_md[MetadataIndex(fqn, shard_md.shard_offsets)] = shard_storage_key
 
     for shard in sharded_tensor.local_shards():
         tensor = shard.tensor.detach()
@@ -186,21 +184,24 @@ def _prepare_sharded_tensor_write(
             storage_key=shard_storage_key,
         )
         write_requests.append(wr)
-    return write_requests, _compute_sharded_tensor_md(
-        sharded_tensor, shard_to_storage_key
-    )
+    return write_requests, _compute_sharded_tensor_md(sharded_tensor), storage_md
 
 
 def _prepare_sharded_tensor_read(
-    metadata: ShardedTensorStorageMetadata, sharded_tensor_out: ShardedTensor
+    fqn: str,
+    storage_metadata: Dict[MetadataIndex, str],
+    metadata: TensorStorageMetadata,
+    sharded_tensor_out: ShardedTensor
 ) -> List[TensorReadRequest]:
     """
     Prepare sharded tensor read.
 
     Args:
+        fqn: The FQN of ``sharded_tensor`` in the state_dict.
+        storage_metadata: Dictionary describing checkpoint storage.
         metadata: Metadata describing the persisted sharded tensor. Normally,
                   this is generated by func::`_prepare_sharded_tensor_write`.
-        sharded_tensor_out: The dest sharded tensor.
+        sharded_tensor_out: The ShardedTensor being read.
 
     Returns:
         A list of class::`TensorReadRequest`. When fullfilled,
@@ -208,18 +209,23 @@ def _prepare_sharded_tensor_read(
         tensor.
     """
     return _prepare_generic_tensor_read(
-        metadata.storage_metadata,
-        sharded_tensor_out.local_shards())
+        fqn,
+        metadata.chunks,
+        sharded_tensor_out.local_shards(),
+        storage_metadata)
 
 def _prepare_generic_tensor_read(
-    checkpoint_shards: List[ShardStorageMetadata], local_shards: List[Shard]
+    fqn: str,
+    checkpoint_shards: List[ChunkStorageMetadata],
+    local_shards: List[Shard],
+    storage_metadata: Dict[MetadataIndex, str]
 ) -> List[TensorReadRequest]:
     read_reqs = []
     # this is a naive quadratic algo that can be optimized later
     for shard in local_shards:
         # scan all mds looking for chunks
         for storage_md in checkpoint_shards:
-            shard_md_from_storage = storage_md.shard_metadata
+            shard_md_from_storage = _chunk_to_shard_md(storage_md)
 
             # do they overlap?
             if not _check_shard_metadata_pair_overlap(
@@ -227,7 +233,7 @@ def _prepare_generic_tensor_read(
             ):
                 continue
 
-            storage_key = storage_md.storage_key
+            storage_key = storage_metadata[MetadataIndex(fqn, storage_md.offsets)]
             target_tensor = shard.tensor.detach()
             offsets = []
             lengths = []
@@ -257,36 +263,39 @@ def _prepare_generic_tensor_read(
             )
     return read_reqs
 
-def _compute_tensor_md(storage_key: str, tensor: Tensor) -> TensorStorageMetadata:
+def _compute_tensor_md(tensor: Tensor) -> TensorStorageMetadata:
     return TensorStorageMetadata(
-        storage_key=storage_key,
-        size=tensor.size()
+        properties=TensorProperties.create_from_tensor(tensor),
+        size=tensor.size(),
+        chunks=[ChunkStorageMetadata(
+            offsets=torch.Size([0] * len(tensor.shape)),
+            sizes=tensor.size(),
+        )]
     )
 
 def _prepare_tensor_write(
     tensor: Tensor, fqn: str, storage_key_to_fqn: Dict[str, str]
-) -> Tuple[List[TensorWriteRequest], TensorStorageMetadata]:
+) -> Tuple[List[TensorWriteRequest], TensorStorageMetadata, Dict[MetadataIndex, str]]:
     storage_key = _create_storage_key(storage_key_to_fqn, fqn)
-
+    storage_md = {MetadataIndex(fqn, [0] * len(tensor.shape)): storage_key}
     write_reqs = [
         TensorWriteRequest(
             tensor=_trim(tensor),
             storage_key=storage_key,
         )
     ]
-    return (write_reqs, _compute_tensor_md(storage_key, tensor))
+    return (write_reqs, _compute_tensor_md(tensor), storage_md)
 
 
-def _compute_bytes_md(storage_key: str, bytes: io.BytesIO) -> BytesStorageMetadata:
+def _compute_bytes_md(bytes: io.BytesIO) -> BytesStorageMetadata:
     return BytesStorageMetadata(
-        storage_key=storage_key,
-        length=len(bytes.getbuffer())
     )
 
 def _prepare_bytes_write(
     bytes: io.BytesIO, fqn: str, storage_key_to_fqn: Dict[str, str]
-) -> Tuple[List[BytesWriteRequest], BytesStorageMetadata]:
+) -> Tuple[List[BytesWriteRequest], BytesStorageMetadata, Dict[MetadataIndex, str]]:
     storage_key = _create_storage_key(storage_key_to_fqn, fqn)
+    storage_md = {MetadataIndex(fqn): storage_key}
 
     write_reqs = [
         BytesWriteRequest(
@@ -294,4 +303,4 @@ def _prepare_bytes_write(
             storage_key=storage_key,
         )
     ]
-    return (write_reqs, _compute_bytes_md(storage_key, bytes))
+    return (write_reqs, _compute_bytes_md(bytes), storage_md)
diff --git a/torch/distributed/_shard/checkpoint/state_dict_loader.py b/torch/distributed/_shard/checkpoint/state_dict_loader.py
index 3c1ccf9e59839..f5f7422904da0 100644
--- a/torch/distributed/_shard/checkpoint/state_dict_loader.py
+++ b/torch/distributed/_shard/checkpoint/state_dict_loader.py
@@ -1,5 +1,5 @@
 import io
-from typing import Any, Dict, List, Tuple, Optional
+from typing import Any, Dict, List, Tuple, Optional, cast
 from torch.distributed._shard.metadata import ShardMetadata
 from torch.distributed._shard.sharded_tensor.shard import Shard
 
@@ -8,25 +8,18 @@
 from torch import Tensor
 from torch.distributed._shard.sharded_tensor import (
     ShardedTensor,
-    ShardedTensorMetadata
-)
-from torch.distributed._shard.sharding_spec._internals import (
-    validate_non_overlapping_shards_metadata,
-    _check_shard_metadata_pair_overlap,
 )
 
 from .metadata import (
     BytesReadRequest,
     BytesStorageMetadata,
-    ShardStorageMetadata,
     TensorReadRequest,
-    Metadata,
-    ShardedTensorStorageMetadata,
     TensorStorageMetadata,
+    Metadata,
+    MetadataIndex,
 )
 from .resharding import (
     _prepare_generic_tensor_read,
-    _shards_get_overlap_region_wrt_saved_tensor
 )
 from .storage import (
     StorageReader,
@@ -46,13 +39,6 @@ def _create_shard_for(tensor: Tensor) -> Shard:
         metadata=_create_shard_metadata(tensor.size()),
     )
 
-def _create_checkpoint_shard_for(storage: TensorStorageMetadata) -> ShardStorageMetadata:
-    return ShardStorageMetadata(
-        # The metadata device is not used during loading.
-        shard_metadata=_create_shard_metadata(storage.size),
-        storage_key=storage.storage_key,
-    )
-
 def _reshard_and_prepare_read_request(
     state_dict: Dict[str, Any], metadata_from_storage: Metadata
 ) -> Tuple[List[BytesReadRequest], List[TensorReadRequest]]:
@@ -61,6 +47,7 @@ def _reshard_and_prepare_read_request(
     """
     tensor_read_requests = []
     bytes_read_requests = []
+    storage_md = cast(Dict[MetadataIndex, str], metadata_from_storage.storage_data)
     for fqn, obj in state_dict.items():
         md = metadata_from_storage.state_dict_metadata[fqn]
         if isinstance(obj, ShardedTensor):
@@ -72,7 +59,7 @@ def _reshard_and_prepare_read_request(
                 bytes_io = io.BytesIO()
                 brr = BytesReadRequest(
                     bytes=bytes_io,
-                    storage_key=md.storage_key,
+                    storage_key=storage_md[MetadataIndex(fqn)],
                     fqn=fqn
                 )
                 bytes_read_requests.append(brr)
@@ -83,17 +70,15 @@ def _reshard_and_prepare_read_request(
                 )
             continue
 
-        if isinstance(md, ShardedTensorStorageMetadata):
-            checkpoint_shards = md.storage_metadata
-        elif isinstance(md, TensorStorageMetadata):
-            checkpoint_shards = [_create_checkpoint_shard_for(md)]
+        if isinstance(md, TensorStorageMetadata):
+            checkpoint_shards = md.chunks
         else:
             raise ValueError(
                 f"Invalid checkpoint metadata for {fqn}, " +
                 f"expected TensorStorageMetadata but found {type(md)}"
             )
 
-        tensor_read_requests += _prepare_generic_tensor_read(checkpoint_shards, local_shards)
+        tensor_read_requests += _prepare_generic_tensor_read(fqn, checkpoint_shards, local_shards, storage_md)
 
 
 
@@ -141,6 +126,7 @@ def load_state_dict(
         None.
 
     Examples
+        >>> # xdoctest: +SKIP
         >>> my_model = MyModule()
         >>> optimizer = Adagrad(my_model.parameters())
         >>> model_state_dict = my_model.state_dict()
@@ -186,96 +172,3 @@ def load_model():
         tensor_futures.wait()
 
     distW.all_gather("checkpoint read", load_model)
-
-
-def _validate_sharded_tensor(
-    tensor_md: ShardedTensorMetadata, checkpoint_md: ShardedTensorStorageMetadata
-) -> None:
-    # We assume the incoming tensor has being validated during construction
-
-    # To ensure a checkpoint can satisfy loading a ST, we compute the loading
-    # plans for all shards and see if they are doable.
-    validate_non_overlapping_shards_metadata(
-        checkpoint_md.tensor_metadata.shards_metadata
-    )
-
-    for shard_md in tensor_md.shards_metadata:
-        read_volume = 0
-        for storage_md in checkpoint_md.storage_metadata:
-            shard_md_from_storage = storage_md.shard_metadata
-
-            if not _check_shard_metadata_pair_overlap(shard_md, shard_md_from_storage):
-                continue
-
-            shard_volume = 1
-            for (_, _, _, length,) in _shards_get_overlap_region_wrt_saved_tensor(
-                saved_shard=shard_md_from_storage, current_shard=shard_md
-            ):
-                shard_volume *= length
-            read_volume += shard_volume
-
-        shard_volume = 1
-        for size in shard_md.shard_sizes:
-            shard_volume *= size
-        if read_volume != shard_volume:
-            raise ValueError(
-                f"Shard {shard_md} only has {read_volume} available" +
-                "elements but needs {shard_volume}"
-            )
-
-def validate_metadata(
-    state_dict: Dict[str, Any], metadata: Metadata
-) -> None:
-    """
-    Verify if it's possible to correctly load `state_dict` from `metadata`.
-
-    This method validate if a checkpoint is usable with a given model
-    state_dict without loading it. It will raise ValueError if it finds
-    anything problematic.
-
-    Args:
-        state_dict: A state_dict to verify if it's loadable.
-        metadata: Checkpoint metadata to verify against.
-
-    Returns:
-        None
-
-    Example:
-        >>> my_model: torch.nn.Model = ....
-        >>> my_reader: torch.distributed._shard.checkpoint.StorageReader = ...
-
-        >>> torch.distributed._shard.checkpoint.validate_metadata(my_model.state_dict(), my_reader.read_metadata())
-        None
-    ```
-
-    """
-    for fqn, obj in state_dict.items():
-        if isinstance(obj, ShardedTensor):
-            if fqn not in metadata.state_dict_metadata:
-                raise ValueError(f"{fqn}: Could not find ShardedTensor metadata")
-
-            md = metadata.state_dict_metadata[fqn]
-            if not isinstance(md, ShardedTensorStorageMetadata):
-                raise ValueError(f"{fqn}: Expected ShardedTensorStorageMetadata but found: {type(md)}")
-
-            # Check if the overall ShardedTensor size is the same. Individual shards don't matter as we can reshard.
-            md_size = list(md.tensor_metadata.size)
-            tensor_size = list(obj.metadata().size)
-            if md_size != tensor_size:
-                raise ValueError(
-                    f"{fqn}: Incompatible ShardedTensor size: expectected {tensor_size} but found {md_size}"
-                )
-
-            _validate_sharded_tensor(obj.metadata(), md)
-        elif isinstance(obj, torch.Tensor):
-            if fqn not in metadata.state_dict_metadata:
-                raise ValueError(f"{fqn}: Could not find Tensor metadata")
-
-            md = metadata.state_dict_metadata[fqn]
-            if not isinstance(md, TensorStorageMetadata):
-                raise ValueError(f"{fqn}: Expected TensorStorageMetadata but found: {type(md)}")
-
-            if md.size != obj.size():
-                raise ValueError(
-                    f"{fqn}: Incompatible tensor size: expected {obj.size()} but found {md.size}"
-                )
diff --git a/torch/distributed/_shard/checkpoint/state_dict_saver.py b/torch/distributed/_shard/checkpoint/state_dict_saver.py
index d60d143d48dc8..01527c9844fe3 100644
--- a/torch/distributed/_shard/checkpoint/state_dict_saver.py
+++ b/torch/distributed/_shard/checkpoint/state_dict_saver.py
@@ -61,29 +61,35 @@ def _prepare(
     metadata = Metadata(state_dict_metadata={})
     tensor_write_requests: List[TensorWriteRequest] = []
     bytes_write_requests: List[BytesWriteRequest] = []
-    storage_key_to_fqn: Dict[str, str] = dict()
+    storage_key_to_fqn: Dict[str, str] = {}
+
+    storage_md = {}
 
     for fqn, obj in state_dict.items():
         if isinstance(obj, ShardedTensor):
-            st_write_reqs, st_md = _prepare_sharded_tensor_write(obj, fqn, storage_key_to_fqn)
+            st_write_reqs, st_md, storage_data = _prepare_sharded_tensor_write(fqn, obj, fqn, storage_key_to_fqn)
             tensor_write_requests += st_write_reqs
             metadata.state_dict_metadata[fqn] = st_md
+            storage_md.update(storage_data)
         elif isinstance(obj, Tensor):
-            write_reqs, tensor_md = _prepare_tensor_write(obj, fqn, storage_key_to_fqn)
+            write_reqs, tensor_md, storage_data = _prepare_tensor_write(obj, fqn, storage_key_to_fqn)
             if write_replicated_data:
                 tensor_write_requests += write_reqs
             metadata.state_dict_metadata[fqn] = tensor_md
+            storage_md.update(storage_data)
         else:
             bytes_io = io.BytesIO()
             # This produces incomplete MD for rank > 0 since we won't populate bytes_io.
             # This is ok since only rank == 0 uses this data
             if write_replicated_data:
                 torch.save(obj, bytes_io)
-            byte_write_reqs, bytes_md = _prepare_bytes_write(bytes_io, fqn, storage_key_to_fqn)
+            byte_write_reqs, bytes_md, storage_data = _prepare_bytes_write(bytes_io, fqn, storage_key_to_fqn)
             if write_replicated_data:
                 bytes_write_requests += byte_write_reqs
             metadata.state_dict_metadata[fqn] = bytes_md
+            storage_md.update(storage_data)
 
+    metadata.storage_data = storage_md
     return (metadata, bytes_write_requests, tensor_write_requests)
 
 def save_state_dict(
@@ -122,6 +128,7 @@ def save_state_dict(
         no_dist (bool): Don't attempt to save in SPMD style. Default to False
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> my_model = MyModule()
         >>> # We must call this function prior to state_dict()
         >>> my_model._register_state_dict_hook(state_dict_hook)
diff --git a/torch/distributed/_shard/checkpoint/utils.py b/torch/distributed/_shard/checkpoint/utils.py
index 86e4543c6fdcc..699c6f2bd2353 100644
--- a/torch/distributed/_shard/checkpoint/utils.py
+++ b/torch/distributed/_shard/checkpoint/utils.py
@@ -1,6 +1,11 @@
-from typing import List, Callable, Optional, Union, TypeVar, cast, Any
+from typing import List, Callable, Optional, Union, TypeVar, Dict, Any, cast
 import torch.distributed as dist
-from .api import CheckpointException
+from .api import (
+    CheckpointException,
+    _wrap_exception,
+    _is_wrapped_exception,
+    WRAPPED_EXCEPTION
+)
 
 import torch
 
@@ -15,9 +20,13 @@
     MetadataIndex,
 )
 
+
 T = TypeVar('T')
 R = TypeVar('R')
 
+def _get_failure_dict(results: List[Union[T, WRAPPED_EXCEPTION]]) -> Dict[int, WRAPPED_EXCEPTION]:
+    return cast(Dict[int, WRAPPED_EXCEPTION], {i: err for i, err in enumerate(results) if _is_wrapped_exception(err)})
+
 class _DistWrapper:
     """
     This is a wrapper around PG that provides a series of features around object collectives.
@@ -126,24 +135,24 @@ def reduce_scatter(
             Call ``reduce_fun`` on all those values
             Scatter to each rank part of the result.
         """
-        local_data: Union[BaseException, T]
+        local_data: Union[WRAPPED_EXCEPTION, T]
         try:
             local_data = map_fun()
         except BaseException as e:
-            local_data = e
+            local_data = _wrap_exception(e)
 
         all_data = self.gather_object(local_data)
         all_results: Optional[List[Union[R, CheckpointException]]] = None
         if self.is_coordinator:
             assert all_data is not None
-            node_failures = {i: err for i, err in enumerate(all_data) if isinstance(err, BaseException)}
+            node_failures = _get_failure_dict(all_data)
 
             if len(node_failures) == 0:
                 try:
-                    # N.B. why can't mypy cast List[R] to List[Union[R, CheckpointException]]?
+                    # N.B. why can't mypy cast List[R] to List[Union[R, WRAPPED_EXCEPTION]]?
                     all_results = cast(List[Union[R, CheckpointException]], reduce_fun(cast(List[T], all_data)))
                 except BaseException as e:
-                    node_failures[self.rank] = e
+                    node_failures[self.rank] = _wrap_exception(e)
 
             if len(node_failures) > 0:
                 all_results = [CheckpointException(step, node_failures)] * self.get_world_size()
@@ -168,22 +177,22 @@ def all_reduce(
             Call ``reduce_fun`` on all those values
             Broadcast the reduced value to all ranks.
         """
-        local_data: Union[T, BaseException]
+        local_data: Union[T, WRAPPED_EXCEPTION]
         try:
             local_data = map_fun()
         except BaseException as e:
-            local_data = e
+            local_data = _wrap_exception(e)
 
         all_data = self.gather_object(local_data)
         result: Optional[Union[R, CheckpointException]] = None
         if self.is_coordinator:
             assert all_data is not None
-            node_failures = {i: err for i, err in enumerate(all_data) if isinstance(err, BaseException)}
+            node_failures = _get_failure_dict(all_data)
             if len(node_failures) == 0:
                 try:
                     result = reduce_fun(cast(List[T], all_data))
                 except BaseException as e:
-                    node_failures[self.rank] = e
+                    node_failures[self.rank] = _wrap_exception(e)
 
             if len(node_failures) > 0:
                 result = CheckpointException(step, node_failures)
@@ -205,15 +214,15 @@ def all_gather(
             Run ``map_cp`` on all ranks
             all_gather the values to all ranks
         """
-        result: Union[T, BaseException]
+        result: Union[T, WRAPPED_EXCEPTION]
         try:
             result = map_fun()
         except BaseException as e:
-            result = e
+            result = _wrap_exception(e)
 
         all_results = self.all_gather_object(result)
 
-        node_failures = {i: err for i, err in enumerate(all_results) if isinstance(err, BaseException)}
+        node_failures = _get_failure_dict(all_results)
         if len(node_failures) > 0:
             raise CheckpointException(step, node_failures)
         return cast(List[T], all_results)
@@ -235,7 +244,7 @@ def broadcast(
             try:
                 result = map_fun()
             except BaseException as e:
-                result = CheckpointException(step, {self.rank: e})
+                result = CheckpointException(step, {self.rank: _wrap_exception(e)})
         final_result = self.broadcast_object(result)
         if isinstance(final_result, CheckpointException):
             raise final_result
@@ -263,5 +272,8 @@ def find_state_dict_object(state_dict: STATE_DICT_TYPE, index: MetadataIndex) ->
     if isinstance(obj, ShardedTensor):
         return _find_shard(obj, index).tensor
     if index.offset is not None:
-        raise ValueError(f"FQN: '{index.fqn}' is not a ShardedTensor, can't find by offset")
+        # special case looking up a tensor by origin
+        if isinstance(obj, torch.Tensor) and index.offset == torch.Size([0] * len(obj.size())):
+            return obj
+        raise ValueError(f"FQN: '{index.fqn}' is not a ShardedTensor, can't find by offset: '{index.offset}'")
     return obj
diff --git a/torch/distributed/_shard/partial_tensor.py b/torch/distributed/_shard/partial_tensor.py
index bb442026e2c51..dc8d09bdd7f30 100644
--- a/torch/distributed/_shard/partial_tensor.py
+++ b/torch/distributed/_shard/partial_tensor.py
@@ -60,6 +60,7 @@ class _PartialTensor(torch.Tensor):
     Examples:
         >>> # All tensors below are of torch.int64 type.
         >>> # We have 2 process groups, 2 ranks.
+        >>> # xdoctest: +SKIP
         >>> tensor = torch.arange(2, dtype=torch.int64) + 1 + 2 * rank
         >>> tensor = torch.cat([tensor, tensor + 2])
         >>> tensor
diff --git a/torch/distributed/_shard/sharded_optim/__init__.py b/torch/distributed/_shard/sharded_optim/__init__.py
index fbb021f7df442..172213fb0c171 100644
--- a/torch/distributed/_shard/sharded_optim/__init__.py
+++ b/torch/distributed/_shard/sharded_optim/__init__.py
@@ -30,6 +30,7 @@ def named_params_with_sharded_tensor(
 
     Example::
 
+        >>> # xdoctest: +SKIP
         >>> model = torch.nn.Linear(*linear_size)
         >>> shard_parameter(model, "weight", spec)
         >>> for name, param in named_params_with_sharded_tensor(model):
diff --git a/torch/distributed/_shard/sharded_tensor/__init__.py b/torch/distributed/_shard/sharded_tensor/__init__.py
index 827612fee7fd6..1751253f05623 100644
--- a/torch/distributed/_shard/sharded_tensor/__init__.py
+++ b/torch/distributed/_shard/sharded_tensor/__init__.py
@@ -363,22 +363,24 @@ def init_from_local_shards(
 
 
     Examples:
-      Suppose we want construct a sharded tensor on two ranks, global size = (10, 5),
-      each shard have a (5, 5) local tensor, we can do it like below:
+        Suppose we want construct a sharded tensor on two ranks, global size = (10, 5),
+        each shard have a (5, 5) local tensor, we can do it like below:
 
-      on rank 0:
+        on rank 0:
+        >>> # xdoctest: +SKIP("not distributed")
         >>> local_shard_metadata = ShardMetadata(
-        >>>     shard_offsets=[0, 0]
-        >>>     shard_lengths=[5, 5]
+        >>>     shard_offsets=[0, 0],
+        >>>     shard_lengths=[5, 5],
         >>>     placement="rank:0/cuda:0"
         >>> )
         >>> local_shards = [Shard(torch.randn(5, 5), local_shard_metadata)]
         >>> sharded_tensor = init_from_local_shards(local_shards, [10, 5])
 
-      on rank 1:
+        on rank 1:
+        >>> # xdoctest: +SKIP("not distributed")
         >>> local_shard_metadata = ShardMetadata(
-        >>>     shard_offsets=[5, 0]
-        >>>     shard_lengths=[5, 5]
+        >>>     shard_offsets=[5, 0],
+        >>>     shard_lengths=[5, 5],
         >>>     placement="rank:1/cuda:1"
         >>> )
         >>> local_shards = [Shard(torch.randn(5, 5), local_shard_metadata)]
@@ -427,8 +429,8 @@ def custom_sharded_op_impl(func):
     Example::
         >>> @custom_sharded_op_impl(torch.nn.functional.linear)
         >>> def my_custom_sharded_linear(types, args, kwargs, process_group):
-        >>>   ....
-        >>>
+        >>>     ...
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> input = torch.rand(10, 32)
         >>> weight = sharded_tensor.rand(32, 16)
         >>> bias = torch.rand(16)
diff --git a/torch/distributed/_shard/sharded_tensor/_ops/_common.py b/torch/distributed/_shard/sharded_tensor/_ops/_common.py
index 3366435f83c17..e672c54927dbd 100644
--- a/torch/distributed/_shard/sharded_tensor/_ops/_common.py
+++ b/torch/distributed/_shard/sharded_tensor/_ops/_common.py
@@ -12,11 +12,12 @@ def _sharded_op_common(op, early_stop_func, extra_check):
     different behaviors are done on either local shards or a local tensor.
 
     Example::
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> op = torch.transpose
         >>> @_sharded_op_impl(op)
         >>> @_sharded_op_common(op, early_stop_func, extra_check)
         >>> def sharded_tensor_op(types, args, kwargs, process_group):
-        >>>   ....
+        >>>   ...
         >>>
         >>> st = sharded_tensor.rand(32, 16)
         >>> st.transpose(1, 2)
diff --git a/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py b/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py
index 2c9d0df4d84b4..4d0600e92f85f 100644
--- a/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py
+++ b/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py
@@ -126,12 +126,11 @@ def sharded_inplace_copy(types, args, kwargs, pg):
     self_st = args[0]
     new_st = args[1]
     nonblocking = kwargs.get("non_blocking", False)
-    self_meta = self_st.metadata()
-    new_meta = new_st.metadata()
-    if self_meta != new_meta:
-        raise RuntimeError(
-            "inplace copy can only happen between two ShardedTensor with same metadata!"
-        )
+    for local_shard, new_shard in zip(self_st.local_shards(), new_st.local_shards()):
+        if local_shard.metadata != new_shard.metadata:
+            raise RuntimeError(
+                "inplace copy can only happen between two ShardedTensor with same metadata!"
+            )
     for local_shard, new_shard in zip(self_st.local_shards(), new_st.local_shards()):
         local_shard.tensor.copy_(new_shard.tensor, nonblocking)
 
diff --git a/torch/distributed/_shard/sharded_tensor/api.py b/torch/distributed/_shard/sharded_tensor/api.py
index d30f965a387bf..156423c65c112 100644
--- a/torch/distributed/_shard/sharded_tensor/api.py
+++ b/torch/distributed/_shard/sharded_tensor/api.py
@@ -455,7 +455,7 @@ def shard_size(shard_md):
         world_size = dist.get_world_size(self._process_group)
         rank_sizes = [0 for _ in range(world_size)]
         max_rank_size = 0
-        shard_placement: Dict[ShardMetadata, Tuple[int, int]] = dict()
+        shard_placement: Dict[ShardMetadata, Tuple[int, int]] = {}
         # collect sizes
         for shard_md in self.metadata().shards_metadata:
             shard_rank = cast(_remote_device, shard_md.placement).rank()
@@ -801,6 +801,7 @@ def _init_from_local_tensor(
         Examples:
             >>> # All tensors below are of torch.int64 type.
             >>> # We have 2 process groups, 2 ranks.
+            >>> # xdoctest: +SKIP
             >>> tensor = torch.arange(2, dtype=torch.int64) + 1 + 2 * rank
             >>> local_tensor = torch.unsqueeze(torch.cat([tensor, tensor + 2]))
             >>> local_tensor
@@ -949,6 +950,7 @@ def reshard(self, resharding_spec: shard_spec.ShardingSpec) -> ShardedTensor:
 
         Examples:
             >>> # We have 2 process groups, 2 ranks.
+            >>> # xdoctest: +SKIP
             >>> tensor = torch.arange(4, dtype=torch.int64) + 1 + 2 * rank
             >>> tensor = torch.stack([tensor, tensor])
             >>> tensor
diff --git a/torch/distributed/_shard/sharding_plan/api.py b/torch/distributed/_shard/sharding_plan/api.py
index 113212f0f4dd1..4fa06ff780b50 100644
--- a/torch/distributed/_shard/sharding_plan/api.py
+++ b/torch/distributed/_shard/sharding_plan/api.py
@@ -48,6 +48,7 @@ class ShardingPlan(object):
         >>>         return self.relu(self.fc2(self.gelu(self.fc1(input))))
 
 
+        >>> # xdoctest: +SKIP("Undefined spec1, spec2)
         >>> sharding_plan = ShardingPlan(
         >>>    plan={
         >>>        "fc1.weight": spec1,
diff --git a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py
index 2377289c591df..3b05ce999247e 100644
--- a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py
+++ b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py
@@ -1,23 +1,21 @@
 # coding=utf-8
 
-from typing import List
-
 import torch
 import torch.distributed as dist
-from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.distributed._shard.sharded_tensor import ShardedTensor
 from torch.distributed._shard.sharded_tensor._ops._common import _sharded_op_common
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor,
-)
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
 from torch.distributed._shard.sharding_spec._internals import (
-    get_split_size,
+    get_chunk_sharding_params,
     get_chunked_dim_size,
+    get_split_size,
 )
+from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
 from torch.distributed.nn.functional import (
-    all_gather,
+    _all_gather_base,
+    all_reduce,
     all_to_all_single,
 )
-from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
 
 
 def _chunk_sharding_spec_check(spec, op):
@@ -29,6 +27,7 @@ def _chunk_sharding_spec_check(spec, op):
             f"Only ChunkShardingSpec supported for '{op.__name__}'."
         )
 
+
 def _register_sharded_op_on_local_tensor(
     op, early_stop_func=None, extra_check=None, customized_func=None
 ):
@@ -55,6 +54,7 @@ def _register_sharded_op_on_local_tensor(
         func (Callable): registered implementation for sharded op for
         ``__torch_function__`` dispatch.
     """
+
     @custom_sharding_spec_op(ChunkShardingSpec, op)
     @_sharded_op_common(op, early_stop_func, extra_check)
     def sharded_tensor_op_on_local_tensor(types, args=(), kwargs=None, pg=None):
@@ -88,7 +88,7 @@ def _handle_col_wise_sharding_base(
     weight,
     local_shard,
     pg,
-    gathered_inputs=None,
+    gathered_inputs,
     mode=None,
     gathered_per_sample_weights=None,
     gathered_offsets=None,
@@ -124,10 +124,6 @@ def _handle_col_wise_sharding_base(
 
     Return: final result of input being applied with the op.
     """
-    if gathered_inputs is None:
-        # allgather the inputs first.
-        gathered_inputs = all_gather(input, group=pg)
-
     # run the operator's function for all the inputs.
     results = []
     for i, inp in enumerate(gathered_inputs):
@@ -161,9 +157,7 @@ def _handle_col_wise_sharding_base(
     return torch.transpose(output, 0, col_dim)
 
 
-def _result_distribute_with_col_rearrange(
-    results, input, world_size, weight, pg
-):
+def _result_distribute_with_col_rearrange(results, input, world_size, weight, pg):
     """
     For col-wise sharding of weight, we need to distribute
     results to each rank. We do them in this function.
@@ -187,7 +181,9 @@ def _result_distribute_with_col_rearrange(
     dims = list(results[0].size())
     dims[0] = sharding_dim_size
     combined_results = torch.cat(results)
-    output = torch.empty(*dims, device=combined_results.device, dtype=combined_results.dtype)
+    output = torch.empty(
+        *dims, device=combined_results.device, dtype=combined_results.dtype
+    )
 
     # Compute output splits
     split_size = get_split_size(sharding_dim_size, world_size)
@@ -226,170 +222,13 @@ def _result_distribute_with_col_rearrange(
     return output.index_select(0, torch.tensor(indices, device=output.device))
 
 
-def _handle_row_wise_lookup_distribute(
-    input_sorted, input, world_size, weight, rank, padding_idx
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embedding/embeddingBag to each rank.
-    If the index in the placement is not equal to the rank number, we need to
-    do the rearrangement based on the order given by the Sharding Spec (placement).
-
-    In addition, we do two things for padding_idx. The first thing is to only
-    set it if it's within the range of the current rank and the other thing
-    is to do the modularization of it by sharded_dim_size_max.
-
-    Args:
-        input_sorted: sorted lookup IDs of embedding/embeddingBag.
-        input: tensor to be applied op on.
-        world_size: number of ranks.
-        weight: shareded weight tensor.
-        rank: # of cuda process.
-        padding_idx: If specified, the entries at padding_idx do
-            not contribute to the gradient and reduction.
-
-    Return:
-        input_sorted: sorted lookup IDs of embedding/embeddingBag
-            Rearrangement performed if it is needed.
-        input_split_sizes: size of IDs to be assigned to each rank.
-        sharded_dim_size_max: the max size of the row each rank gets.
-        input_split_rearrange_indices: indices of row rearrangement.
-        rearrange_indices_1d_second_order: reverse indices of row
-            rearrangement, which will be used to restore the original
-            order.
-        padding_idx: Same as input if padding_idx is within the range
-            of the given rank; otherwise, None is returned. It is
-            also modularized by sharded_dim_size_max.
-    """
-    # Decide which rank the input goes to by check the sharding range.
-    split_size = get_split_size(weight.size(0), world_size)
-    rearrange_rows = False
-    indices_flatten = None
-    input_split_sizes: List[int] = [0] * world_size
-    input_split_start_indices: List[int] = [0] * world_size
-    start_row_idx_rank = None
-    end_row_idx_rank = None
-    # When we do the chunk split, we always ensure the first N - 1 chunks get max out
-    # and then the Nth chunk gets the rest. So input_split_sizes like [3, 3, 3, 4]
-    # are not possible. The expected split size will be [4, 4, 4, 1].
-    sharded_dim_size_max = get_chunked_dim_size(weight.size(0), split_size, 0)
-    for idx, placement in enumerate(weight._sharding_spec.placements):
-        sharded_dim_size = get_chunked_dim_size(weight.size(0), split_size, idx)
-        start_row_idx = idx * sharded_dim_size_max
-        end_row_idx = start_row_idx + sharded_dim_size
-        start_idx = torch.searchsorted(input_sorted, start_row_idx).item()
-        end_idx = torch.searchsorted(input_sorted, end_row_idx).item()
-        input_split_sizes[placement.rank()] = int(end_idx - start_idx)
-        input_split_start_indices[placement.rank()] = int(start_idx)
-        if placement.rank() != idx:
-            rearrange_rows = True
-        # Store the range of the current rank.
-        if placement.rank() == rank:
-            start_row_idx_rank = start_row_idx
-            end_row_idx_rank = end_row_idx
-
-    # Perform the modular if padding_idx is within the range.
-    if padding_idx is not None:
-        if padding_idx < start_row_idx_rank or padding_idx >= end_row_idx_rank:
-            padding_idx = None
-        else:
-            padding_idx = padding_idx % sharded_dim_size_max
-
-    rearrange_indices_1d_second_order = None
-    if rearrange_rows:
-        # Need to re-arrange the 1D tensor to be sent via all2all.
-        indices: List[List[int]] = [[0]] * world_size
-        for placement in weight._sharding_spec.placements:
-            split_length = input_split_sizes[placement.rank()]
-            offset_idx = input_split_start_indices[placement.rank()]
-            indices[placement.rank()] = list(
-                range(offset_idx, offset_idx + split_length)
-            )
-        indices_flatten = list(idx for indice in indices for idx in indice)
-
-        input_sorted = input_sorted.index_select(
-            0, torch.tensor(indices_flatten, device=input.device)
-        )
-        rearrange_indices_1d_second_order = torch.argsort(torch.Tensor(indices_flatten))
-
-    return (
-        input_sorted,
-        input_split_sizes,
-        sharded_dim_size_max,
-        torch.tensor(indices_flatten, device=input.device) if rearrange_rows else None,
-        rearrange_indices_1d_second_order,
-        padding_idx,
-    )
-
-
-def _communicate_size_to_each_rank(
-    input_size_list, output_size, input, pg, tensor_type=torch.int
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to first
-    communicate the input length to each rank because each rank gets a
-    different one.
-
-    Args:
-        input_size_list: list of sizes to be sent to each rank.
-        output_size: length of the output tensor.
-        input: tensor to be applied op on.
-        pg: process group.
-        tensor_type: dtype of tensor.
-
-    Return: A list of communication results (int).
-    """
-    input_size_list_tensor = torch.tensor(
-        input_size_list, dtype=tensor_type, device=input.device
-    )
-    output_size_list_tensor = torch.empty(
-        output_size, dtype=tensor_type, device=input.device
-    )
-    dist.all_to_all_single(
-        output_size_list_tensor,
-        input_size_list_tensor,
-        group=pg,
-    )
-    return output_size_list_tensor.tolist()
-
-
-def _communicate_list_to_each_rank(
-    input_tensor_list, output_lists, input, pg, tensor_type=torch.int64
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to
-    communicate a list of input tensors to each rank. Because the
-    input could be a list of list, we need to first convert the list
-    to a tensor.
-
-    Args:
-        input_tensor_list: list of tensors to be sent to each rank.
-        output_lists: list of sizes to be obtained from each rank.
-        input: tensor to be applied op on.
-        pg: process group.
-        tensor_type: dtype of tensor.
-
-    Return: A list of communication results (tensors).
-    """
-    output_tensor_list = []
-    for output_list in output_lists:
-        output_tensor_list.append(
-            torch.empty(output_list, dtype=tensor_type, device=input.device)
-        )
-    dist.all_to_all(
-        output_tensor_list,
-        input_tensor_list,
-        group=pg,
-    )
-    return output_tensor_list
-
-
 def _handle_max_norm_col_wise(
     max_norm,
     norm_type,
     local_shard,
     input,
     world_size,
+    gathered_inputs,
     pg,
 ):
     """
@@ -406,24 +245,22 @@ def _handle_max_norm_col_wise(
         local_shard: col-wise shared local weight used for lookup.
         input: tensor to be applied op to.
         world_size: number of ranks.
+        gathered_inputs: list of inputs from all ranks.
         pg: process group.
 
     Return:
         local_shard_norm_renormed: local_shard re-normed to max_norm if the norm is larger
             than it.
-        gathered_inputs: list of inputs from all ranks.
+
     """
     norm_type = norm_type if norm_type is not None else 2.0
-    # allgather the inputs first.
-    gathered_inputs = [torch.zeros_like(input) for _ in range(world_size)]
-    dist.all_gather(gathered_inputs, input, group=pg)
     unique_inp = torch.unique(torch.cat(gathered_inputs))
     local_shard_sum = torch.sum(
         torch.pow(torch.abs(local_shard), norm_type), dim=1, dtype=local_shard.dtype
     )
     # For col-wise sharding, we need to first aggregate the powered sum
     # from each rank first and then calculate the norm.
-    dist.all_reduce(local_shard_sum, group=pg)
+    local_shard_sum = all_reduce(local_shard_sum, group=pg)
     local_shard_norm = torch.pow(local_shard_sum, 1.0 / norm_type)
     max_norm_tensor = torch.full(
         (local_shard.size(0),),
@@ -443,4 +280,73 @@ def _handle_max_norm_col_wise(
         .t()
         .contiguous()
     )
-    return local_shard_norm_renormed, gathered_inputs
+    return local_shard_norm_renormed
+
+
+def _all_gather_base_input(input, pg):
+    """
+    Use _all_gather_base to get a concatenated input from each rank.
+
+    Args:
+        input: tensor to be applied op on.
+        pg: process group.
+
+    Returns:
+        gathered_inputs: input gathered from each rank and concat by dim 0.
+    """
+    # allgather the inputs first.
+    gather_inp_size = list(input.size())
+    gather_inp_size[0] = input.size(0) * dist.get_world_size(pg)
+    gather_inp = torch.empty(gather_inp_size, device=input.device, dtype=input.dtype)
+    return _all_gather_base(gather_inp, input, group=pg)
+
+
+def _handle_row_wise_mask(gather_inp, padding_idx, weight, world_size, rank):
+    """
+    Mask the input for embedding look-up for IDs which are not stored
+    on the current rank. This function also adjust the ``padding_idx``
+    so that it is only used on the rank where the corresponding row is
+    stored.
+
+    Note that, with ``max_norm`` flag on, only weights of rows being
+    looked up will be re-normed. So we need an extra row for masked ID
+    so that it does not affect the final result and ``max_norm``.
+
+    Args:
+        gather_inp: tensor to be applied op on gathered from all ranks.
+        padding_idx: If specified, the entries at padding_idx do
+            not contribute to the gradient; therefore, the embedding
+            vector at padding_idx is not updated during training,
+            i.e. it remains as a fixed “pad”.
+            Note that the embedding vector at padding_idx is
+            excluded from the reduction.
+        weight: weight tensor of Embedding look-up table.
+        world_size: number of ranks.
+        rank: # of cuda process.
+
+    Returns:
+        lookup_input: Tensor of masked input.
+        padding_idx: adjusted padding_idx.
+        padding_row: The extra row we used during lookup so that
+            looking up does not affect ``max_norm``.
+    """
+    (start_pos, chunk_size) = get_chunk_sharding_params(
+        weight.size(0), world_size, weight._sharding_spec, rank
+    )
+    mask = (gather_inp < start_pos) | (gather_inp >= start_pos + chunk_size)
+    lookup_input = gather_inp.clone() - start_pos
+    lookup_input[mask] = chunk_size
+    if (
+        padding_idx is not None
+        and padding_idx >= start_pos
+        and padding_idx < (start_pos + chunk_size)
+    ):
+        padding_idx = padding_idx - start_pos
+    else:
+        padding_idx = None
+
+    # When max_norm is set, it will only re-norm the row being looked up.
+    padding_row = torch.zeros(
+        1, weight.size(1), device=gather_inp.device, dtype=weight.dtype
+    )
+    return lookup_input, padding_idx, padding_row
diff --git a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py
index 10433129173d5..77bfa08da025f 100644
--- a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py
+++ b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py
@@ -2,17 +2,19 @@
 
 import torch
 import torch.distributed as dist
+from torch.distributed._shard.replicated_tensor import ReplicatedTensor
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
+from torch.distributed.nn.functional import all_gather, all_reduce, reduce_scatter
+
 from ._common import (
-    _communicate_size_to_each_rank,
+    _all_gather_base_input,
     _handle_col_wise_sharding_base,
-    _handle_row_wise_lookup_distribute,
     _handle_max_norm_col_wise,
+    _handle_row_wise_mask,
 )
-from torch.distributed._shard.sharding_spec import ChunkShardingSpec
-from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor
-)
+
 
 @custom_sharding_spec_op(ChunkShardingSpec, torch.nn.functional.embedding)
 def sharded_embedding(types, args, kwargs, pg):
@@ -37,34 +39,30 @@ def sharded_embedding(types, args, kwargs, pg):
     4 GPUs creating 3 shard of (3 x 17) and 1 shard of (1 x 17).
     The algorithm is as follows:
 
-    1. First the input is flattened to 1D and gets sorted so that we can distribute
-       them to the corresponding rank. For example if the given input is
+    1. First the input is all gathered to all ranks, since this is SPMD and
+       input is actually sharded across all ranks. The inputs then become a
+       4 (4 x 6) tensor on each rank. For example if the given input is
        tensor([[6, 5, 2, 9, 6, 3],
                [3, 1, 2, 4, 7, 6],
                [4, 0, 4, 9, 8, 9],
                [8, 6, 6, 4, 6, 1]])
-       Then we have the 1D array like:
-       tensor([6, 5, 2, 9, 6, 3, 3, 1, 2, 4, 7, 6, 4, 0, 4, 9, 8, 9, 8, 6, 6, 4, 6, 1])
-       And sort it:
-       tensor([0, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6, 6, 7, 8, 8, 9, 9, 9])
-       We also record the indices so that we can recover back.
-    2. Next we perform the split by search the index of the chunking
-       boundary. So the above array will be split into 4 parts:
-       tensor([[0, 1, 1, 2, 2], [3, 3, 4, 4, 4, 4, 5],
-              [6, 6, 6, 6, 6, 6, 7, 8, 8], [9, 9, 9])
-       Rearrangement may be needed if the rank order is different from
-       its index in the placement.
-    3. Next, we communicate the length of each part to each rank via all2all
-       so that each rank now knows what input it will get from all other ranks.
-    4. Before we send out the array to other ranks, we need to do the modular operation
-       so that each rank do use that for embedding lookup.
-       The above tensor will look like the below after performing the moduler of 3:
-       tensor([[0, 1, 1, 2, 2], [0, 0, 1, 1, 1, 1, 2],
-              [0, 0, 0, 0, 0, 0, 1, 2, 2], [0, 0, 0])
-    5. Now, each rank receives a matrix (size may vary) and do the lookup. We then use
-       all2all to send the result back to each rank.
-    6. We use the recorded indices to recover the sorted positions and reshape the
-       matrix to (4 x 6 x 17), which is what we need.
+       on rank 0.
+       Then on every rank, we will have this tensor.
+       If input itself is already replicated, no all-gather will be done.
+    2. Next, we mask the ID which are not stored on that rank.
+       For example on rank 0, we store ID [0, 1, 2]. We only keep the ID
+       inside the set of numbers. The rest of them will be masked to an extra row.
+       The masked matrix will be used for embedding look up and is like:
+       tensor([[4, 4, 2, 4, 4, 4],
+               [4, 1, 2, 4, 4, 4],
+               [4, 0, 4, 4, 4, 4],
+               [4, 4, 4, 4, 4, 1]])
+       The reason of having an extra row (aka, number 4 in the example) is
+       because when max_norm is specified only weight which has looked will
+       be re-normed so mask IDs whose embeddings are not stored in current
+       rank will to an extra row will ensure max_norm still works as expected.
+    3. If max_norm is specified, the extra row gurantee that the mask ID will
+       not affect the behavior of weigh re-norm.
 
     COLWISE SHARDING
     ================
@@ -146,10 +144,8 @@ def _validate_embedding_param(args, kwargs):
     input = args[0]
     weight = args[1]
     max_norm = kwargs.get("max_norm")
-    norm_type = kwargs.get("norm_type")
     scale_grad_by_freq = kwargs.get("scale_grad_by_freq")
     sparse = kwargs.get("sparse")
-    padding_idx = kwargs.get("padding_idx")
 
     # Validate types
     if not isinstance(input, torch.Tensor):
@@ -213,11 +209,16 @@ def _handle_col_wise_sharding(
 
     Returns: final result of lookup.
     """
-    gathered_inputs = None
+    if not isinstance(input, ReplicatedTensor):
+        # allgather the inputs first for non Replicated Tensor.
+        gathered_inputs = all_gather(input, group=pg)
+    else:
+        gathered_inputs = input
+
     if max_norm is not None:
         # max_norm changes the weight in-place
-        local_shard, gathered_inputs = _handle_max_norm_col_wise(
-            max_norm, norm_type, local_shard, input, world_size, pg
+        local_shard = _handle_max_norm_col_wise(
+            max_norm, norm_type, local_shard, input, world_size, gathered_inputs, pg
         )
 
     output = _handle_col_wise_sharding_base(
@@ -228,8 +229,8 @@ def _handle_col_wise_sharding(
         weight,
         local_shard,
         pg,
+        gathered_inputs,
         padding_idx=padding_idx,
-        gathered_inputs=gathered_inputs,
     )
     return (output, local_shard)
 
@@ -260,77 +261,44 @@ def _handle_row_wise_sharding(
 
     Returns: final result of lookup.
     """
-    # flatten the ids across all input and sort
-    input_size = input.size()
-    input_1d = torch.reshape(input, (-1,)).contiguous()
-    input_sorted, indices_1d = torch.sort(input_1d)
-    rearrange_indices_1d = torch.argsort(indices_1d)
-    input_sorted.contiguous()
-
-    (
-        input_sorted,
-        input_split_sizes,
-        sharded_dim_size_max,
-        _,
-        rearrange_indices_1d_second_order,
-        padding_idx,
-    ) = _handle_row_wise_lookup_distribute(
-        input_sorted, input, world_size, weight, rank, padding_idx
-    )
-
-    # Get the input split size to be sent from each rank to the current rank.
-    # We can then infer the output split size.
-    output_split_sizes = _communicate_size_to_each_rank(
-        input_split_sizes, world_size, input, pg
-    )
-
-    # Input sent from each rank to the current rank may have different sizes.
-    gathered_input = torch.empty(
-        sum(output_split_sizes), dtype=torch.int64, device=input.device
-    )
-
-    # Perform the modular operation of the 1D tensor to be sent to each rank.
-    input_sorted = torch.remainder(input_sorted, sharded_dim_size_max)
+    if not isinstance(input, ReplicatedTensor):
+        # allgather the inputs first for non Replicated Tensor.
+        gather_inp = _all_gather_base_input(input, pg)
+    else:
+        gather_inp = input
 
-    # Perform alltoall
-    dist.all_to_all_single(
-        gathered_input,
-        input_sorted,
-        input_split_sizes=input_split_sizes,
-        output_split_sizes=output_split_sizes,
-        group=pg,
+    # Mask the input according to sharding spec.
+    lookup_input, padding_idx, padding_row = _handle_row_wise_mask(
+        gather_inp, padding_idx, weight, world_size, rank
     )
 
-    # If input is None, passing in max_norm causes
-    # errors in CUDA.
-    if max_norm is not None and gathered_input.size(0) == 0:
+    # When input is a large tensor, the value of weight is changed.
+    # This is a walk-around for now. GH issue: #81717
+    if max_norm is not None:
+        torch.nn.functional.embedding(
+            torch.unique(lookup_input)[:-1],
+            local_shard,
+            padding_idx=padding_idx,
+            max_norm=max_norm,
+            norm_type=norm_type,
+        )
         max_norm = None
 
-    # Perform local embedding look up.
-    gathered_input_embeddings = torch.nn.functional.embedding(
-        gathered_input,
-        local_shard,
+    local_input_embeddings = torch.nn.functional.embedding(
+        lookup_input,
+        torch.cat([local_shard, padding_row]),
         padding_idx=padding_idx,
         max_norm=max_norm,
         norm_type=norm_type,
     )
 
-    # Gather all lookup result appropriately by performing alltoall again
-    gathered_output = torch.empty(
-        input_sorted.size(0), weight.size(1), device=input.device
-    )
-    dist.all_to_all_single(
-        gathered_output,
-        gathered_input_embeddings,
-        input_split_sizes=output_split_sizes,
-        output_split_sizes=input_split_sizes,
-        group=pg,
-    )
-
-    # Rearrange the results to its original shape.
-    if rearrange_indices_1d_second_order is not None:
-        gathered_output = gathered_output[rearrange_indices_1d_second_order]
-    gathered_output = gathered_output[rearrange_indices_1d]
-
-    # Return the appropriate local result.
-    return torch.reshape(gathered_output, (*input_size, weight.size(1)))
+    # TODO: Make the result a PartialTensor.
+    if isinstance(input, ReplicatedTensor):
+        return all_reduce(local_input_embeddings, group=pg)
+    else:
+        local_shards = local_input_embeddings.chunk(pg.size())
+        return reduce_scatter(
+            torch.empty_like(local_shards[0]),
+            list(local_shards),
+            group=pg,
+        )
diff --git a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py
index 1703f5d15dc48..4d75e26ba1f9d 100644
--- a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py
+++ b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py
@@ -1,23 +1,21 @@
 # coding=utf-8
 
-from typing import List, cast
+from typing import cast, List
 
 import torch
 import torch.distributed as dist
-from torch._C._distributed_c10d import (
-    ReduceOp,
-)
+from torch._C._distributed_c10d import ReduceOp
+from torch.distributed._shard.replicated_tensor import ReplicatedTensor
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
+from torch.distributed.nn.functional import all_gather, all_reduce, reduce_scatter
+
 from ._common import (
-    _communicate_list_to_each_rank,
-    _communicate_size_to_each_rank,
+    _all_gather_base_input,
     _handle_col_wise_sharding_base,
-    _handle_row_wise_lookup_distribute,
     _handle_max_norm_col_wise,
-)
-from torch.distributed._shard.sharding_spec import ChunkShardingSpec
-from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor
+    _handle_row_wise_mask,
 )
 
 
@@ -44,45 +42,31 @@ def sharded_embedding_bag(types, args, kwargs, pg):
     4 GPUs creating 4 shard of (4 x 17).
     The algorithm is as follows:
 
-    1. First if the input is a 2D tensor, we sort by row. (If it's a 1D tensor, we sort
-       the tensor per interval defined by offset.
-       For example if the given input is generated within [1, 9] like
-       tensor([[ 3,  7,  7,  9,  2,  1],
-               [ 0,  0, 14,  5,  3, 12],
-               [ 4,  5,  5,  9,  5, 13],
-               [10,  3,  0,  7, 13,  9]])
-       Then we have the sorted 2D tensor like:
-       tensor([[ 1,  2,  3,  7,  7,  9],
-               [ 0,  0,  3,  5, 12, 14],
-               [ 4,  5,  5,  5,  9, 13],
-               [ 0,  3,  7,  9, 10, 13]])
-       Note if placement not equal to rank we will rearrange accordingly.
-    2. Based on sorted result, we now have the offset like the following:
-       [tensor([0, 3, 5, 6]), tensor([0, 3, 4, 4]),
-        tensor([0, 0, 4, 5]), tensor([0, 2, 3, 5])]
-       Note that embedding bag does allow the offset idx equal to length of
-       input or repetitive. For these cases, it return a zero tensor.
-    3. Next, we rearrange the sorted tensor into different ranks by first
-       flattening it and grouping by ranks. Finally, we get a list of 1D tensors.
-       So the given tensor now becomes:
-       [tensor([1, 2, 3, 0, 0, 3, 0, 3]), tensor([7, 7, 5, 4, 5, 5, 5, 7]),
-        tensor([9, 9, 9, 10]), tensor([12, 14, 13, 13])]
-       We sync offsets with IDs. Offset now becomes:
-       [tensor([0, 3, 6, 6]), tensor([0, 2, 3, 7]),
-        tensor([0, 1, 1, 2]), tensor([0, 0, 2, 3])]
-    5. Before we send out the array to other ranks, we need to do the modular operation
-       so that each rank do use that for embedding look-up.
-       The above ID tensor list will look like the below after performing the moduler of 4:
-       [tensor([1, 2, 3, 0, 0, 3, 0, 3]), tensor([3, 3, 1, 0, 1, 1, 1, 3]),
-        tensor([1, 1, 1, 2]), tensor([0, 2, 1, 1])]
-    4. The example above only happens in one rank and each rank does a very similar thing
-       with different rearranged IDs and offsets list. We then send IDs and offsets to the
-       corresponding rank. Each rank do the look-up and aggregation on its local shard.
-       We then use reduce_scatter to send the result back to each rank and perform the
-       aggregation simultaneously.
-    5. For "Mean" mode we need to divide by either column size (2D) or the interval length
-       defined by the offset. We also need to mask the unexisting row to neg Inf so that
-       negative value does not gets wiped out in the "Max" mode.
+    1. First the input is all gathered to all ranks, since this is SPMD and
+       input is actually sharded across all ranks. The inputs then become a
+       4 (4 x 6) tensor on each rank. For example if the given input is
+       tensor([[6, 5, 2, 9, 6, 3],
+               [3, 1, 2, 4, 7, 6],
+               [4, 0, 4, 9, 8, 9],
+               [8, 6, 6, 4, 6, 1]])
+       on rank 0.
+       Then on every rank, we will have this tensor.
+       If input itself is already replicated, no all-gather will be done.
+    2. Next, we mask the ID which are not stored on that rank.
+       For example on rank 0, we store ID [0, 1, 2]. We only keep the ID
+       inside the set of numbers. The rest of them will be masked to an extra row.
+       The masked matrix will be used for embedding look up and is like:
+       tensor([[4, 4, 2, 4, 4, 4],
+               [4, 1, 2, 4, 4, 4],
+               [4, 0, 4, 4, 4, 4],
+               [4, 4, 4, 4, 4, 1]])
+    3. If ``max_norm`` is specified, the extra row gurantee that the mask ID will
+       not affect the behavior of weigh re-norm.
+    4. The example above only happens in one rank and each rank does a very similar thing.
+       For "Mean" mode we need to divide by either column size (2D) or the interval length
+       defined by the offset (excluding the row specified in ``padding_idx``).
+       We also need to mask the unexisting row to neg Inf so that negative value does not
+       gets wiped out in the "Max" mode.
 
     COLWISE SHARDING
     ================
@@ -185,11 +169,9 @@ def _validate_embedding_bag_param(args, kwargs):
     per_sample_weights = kwargs.get("per_sample_weights")
     mode = kwargs.get("mode")
     max_norm = kwargs.get("max_norm")
-    norm_type = kwargs.get("norm_type")
     scale_grad_by_freq = kwargs.get("scale_grad_by_freq")
     sparse = kwargs.get("sparse")
     include_last_offset = kwargs.get("include_last_offset")
-    padding_idx = kwargs.get("padding_idx")
 
     # Validate types
     if not isinstance(input, torch.Tensor):
@@ -299,22 +281,16 @@ def _handle_col_wise_sharding(
             If max_norm, this will be the renormed weight.
     """
     # allgather the special input of embedding bag first.
-    gathered_per_sample_weights = None
-    if per_sample_weights is not None:
-        gathered_per_sample_weights = [
-            torch.zeros_like(per_sample_weights) for _ in range(world_size)
-        ]
-        dist.all_gather(gathered_per_sample_weights, per_sample_weights, group=pg)
-    gathered_offsets = None
-    if offsets is not None:
-        gathered_offsets = [torch.zeros_like(offsets) for _ in range(world_size)]
-        dist.all_gather(gathered_offsets, offsets, group=pg)
+    (
+        gathered_inputs,
+        gathered_per_sample_weights,
+        gathered_offsets,
+    ) = _all_gather_embedding_bag_input(input, per_sample_weights, offsets, pg)
 
-    gathered_inputs = None
     if max_norm is not None:
         # max_norm changes the weight in-place
-        local_shard, gathered_inputs = _handle_max_norm_col_wise(
-            max_norm, norm_type, local_shard, input, world_size, pg
+        local_shard = _handle_max_norm_col_wise(
+            max_norm, norm_type, local_shard, input, world_size, gathered_inputs, pg
         )
 
     output = _handle_col_wise_sharding_base(
@@ -325,11 +301,11 @@ def _handle_col_wise_sharding(
         weight,
         local_shard,
         pg,
+        gathered_inputs,
         mode=mode,
         gathered_per_sample_weights=gathered_per_sample_weights,
         gathered_offsets=gathered_offsets,
         padding_idx=padding_idx,
-        gathered_inputs=gathered_inputs,
     )
     return (output, local_shard)
 
@@ -377,443 +353,132 @@ def _handle_row_wise_sharding(
     Returns:
         gathered_output: final result of lookup and aggregation.
     """
-    # We sort each interval defined by offset. If 2D, each interval is a row.
-    input_size = input.size()
-    (
-        input_split_sorted_list,
-        input_split_sorted_indices,
-        split_sizes_1d,
-        split_sizes_1d_with_padding,
-    ) = _input_split_sort(input, offsets, padding_idx)
-
-    # Within each interval of the sorted list, we first need to distribute
-    # each ID to different bucket(rank) and also ensure the rearrangement
-    # has been done in case the placement idx not equal to rank.
-    # We then perform some simple stats on each interval for the next step
-    # If user specifies per_sample_weights we need to rearrange them
-    # to be sync with IDs and then distribute them to each rank
-    (
-        input_combined,
-        input_combined_split_sizes,
-        offsets_rearrange_list,
-        offsets_rearrange_sizes,
-        per_sample_weights,
-        sharded_dim_size_max,
-        padding_idx,
-    ) = _sorted_input_distribute_prepare(
-        input_split_sorted_list,
-        input_split_sorted_indices,
-        world_size,
-        input,
-        weight,
-        per_sample_weights,
-        rank,
-        padding_idx,
-    )
+    if not isinstance(input, ReplicatedTensor):
+        if input.dim() > 1 and per_sample_weights is None:
+            # allgather the inputs first for non Replicated Tensor.
+            gather_inp = _all_gather_base_input(input, pg)
+        else:
+            (
+                gathered_inputs,
+                gathered_per_sample_weights,
+                gathered_offsets,
+            ) = _all_gather_embedding_bag_input(input, per_sample_weights, offsets, pg)
+            cat_dim = 0 if input.dim() != 1 else -1
+            gather_inp = torch.cat(gathered_inputs, dim=cat_dim)
+            if per_sample_weights is not None:
+                per_sample_weights = torch.cat(gathered_per_sample_weights, dim=cat_dim)
+            offset_add = 0 if input.dim() > 1 else input.size(0)
+            if offsets is not None:
+                offsets_list = torch.cat(
+                    [gathered_offsets[i] + (offset_add * i) for i in range(pg.size())],
+                    dim=cat_dim,
+                )
+    else:
+        gather_inp = input
 
-    # Send ID/offsets/per_sample_weights to different bucket(rank).
-    (
-        gathered_input,
-        output_offsets_tensor_list,
-        output_split_sizes,
-        gathered_per_sample_weights,
-    ) = _distribute_input(
-        input_combined,
-        input_combined_split_sizes,
-        offsets_rearrange_list,
-        offsets_rearrange_sizes,
-        sharded_dim_size_max,
-        world_size,
-        input,
-        per_sample_weights,
-        pg,
+    # Mask the input according to sharding spec.
+    lookup_input, padding_local, padding_row = _handle_row_wise_mask(
+        gather_inp, padding_idx, weight, world_size, rank
     )
+    if mode == "max":
+        padding_row[:] = -float("Inf")
 
-    # Perform the embedding bag look-up and aggregation
-    results = []
-    for i, inp in enumerate(gathered_input):
-        per_sample_weights = (
-            gathered_per_sample_weights[i]
-            if gathered_per_sample_weights is not None
-            else None
-        )
-        # If input is None, passing in max_norm causes
-        # errors in CUDA.
-        if max_norm is not None and inp.size(0) == 0:
-            max_norm = None
-
-        # Perform local embedding look up and aggregation.
-        result = torch.nn.functional.embedding_bag(
-            inp,
+    # When input is a large tensor, the value of weight is changed.
+    # This is a walk-around for now. GH issue: #81717.
+    if max_norm is not None:
+        torch.nn.functional.embedding_bag(
+            torch.unique(lookup_input)[:-1],
             local_shard,
-            offsets=output_offsets_tensor_list[i],
-            mode=mode if mode != "mean" else "sum",
-            per_sample_weights=per_sample_weights,
+            offsets=torch.tensor([0], device=local_shard.device, dtype=torch.long),
+            mode=mode,
+            per_sample_weights=None,
             max_norm=max_norm,
             norm_type=norm_type,
-            padding_idx=padding_idx,
+            padding_idx=padding_local,
         )
-        if mode != "max":
-            results.append(result)
-        # For max case, it there is no look-up from some ranks
-        # it will return all zero for that. For that case, we need
-        # to set the row to neg inf; otherwise, in the final
-        # aggregation negative values will be rounded up to zero.
-        elif inp.size(0) == 0:
-            result[:] = -float("Inf")
-            results.append(result)
-        else:
-            for idx, current_offset in enumerate(output_offsets_tensor_list[i]):
-                next_offset = current_offset
-                if idx == len(output_offsets_tensor_list[i]) - 1:
-                    next_offset = output_split_sizes[i]
-                else:
-                    next_offset = output_offsets_tensor_list[i][idx + 1]
-                # When there is no interval in the current rank or all IDs
-                # are equal to padding_idx, we then need to ensure they
-                # don't contribute to the final result.
-                if (current_offset == next_offset) or (
-                    padding_idx is not None
-                    and not torch.any(
-                        torch.ne(inp[current_offset:next_offset], padding_idx)
-                    )
-                ):
-                    result[idx] = -float("Inf")
-            results.append(result)
-
-    # Gather all the aggregated results appropriately by using reduce_scatter.
-    row_size = input.size(0) if len(input_size) > 1 else len(split_sizes_1d)
-    gathered_output = torch.empty(row_size, weight.size(1), device=input.device)
+        max_norm = None
+    result = torch.nn.functional.embedding_bag(
+        lookup_input,
+        torch.cat([local_shard, padding_row]),
+        offsets=offsets_list if offsets is not None else offsets,
+        mode=mode if mode != "mean" else "sum",
+        per_sample_weights=per_sample_weights,
+        max_norm=max_norm,
+        norm_type=norm_type,
+        padding_idx=padding_local,
+    )
+
     op = ReduceOp.SUM if mode != "max" else ReduceOp.MAX
-    dist.reduce_scatter(gathered_output, results, op=op, group=pg)
+    # TODO: Make the result a PartialTensor and move the the logic below there.
+    if isinstance(input, ReplicatedTensor):
+        result = all_reduce(result, op=op, group=pg)
+    else:
+        local_shards = result.chunk(pg.size())
+        result = reduce_scatter(
+            torch.empty_like(local_shards[0]),
+            list(local_shards),
+            op=op,
+            group=pg,
+        )
 
     # For Mean, we cannot do the division until very end because the sum of means
     # not equal to the mean of sum. (Divisor is different)
     if mode == "mean":
-        split_sizes_1d_tensor = torch.tensor(
-            split_sizes_1d_with_padding, dtype=torch.float, device=input.device
-        )
-        # Make sure divisor is not zero.
-        split_sizes_1d_tensor[split_sizes_1d_tensor == 0.0] = 1.0
-        return (
-            torch.div(gathered_output.t().contiguous(), split_sizes_1d_tensor)
-            .t()
-            .contiguous()
-        )
-
-    # Return the appropriate local result.
-    return gathered_output
-
-
-def _input_split_sort(input, offsets, padding_idx):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embeddingBag to each rank by range. The constraint
-    here is that we can not directly sort the whole input because we have to
-    differentiate between each interval because the result is aggregated.
-
-    If the index in the placement is not equal to the rank number, we need to
-    do the rearrangement based on the order given by the Sharding Spec (placement).
-
-    We also calculate the split_size with padding_idx excluded per interval
-    so that we can use it as the divisor to calculate the mean correctly.
-
-    Args:
-        input: tensor to be applied op on.
-        offsets: start index of each interval in the 1D case.
-        padding_idx: the embedding vector at padding_idx is
-            excluded from the reduction.
-
-    Return:
-        input_split_sorted_list: list of ID positions sorted per interval.
-        input_split_sorted_indices: sorted indices for per_sample_weights
-            rearrangments.
-        split_sizes_1d: size of each split for 1D input because it can be
-            different in such scenario.
-        split_sizes_1d_with_padding: size of each split for 1D input with
-            padding_idx excluded. This is for the divisor of `mean` mode.
-    """
-    input_size = input.size()
-    input_split_sorted_list = []
-    split_sizes_1d = []
-    split_sizes_1d_with_padding = []
-    padding_idx = padding_idx if padding_idx is not None else -1
-
-    # For 2D tensor, we just first sort and then append row by row into a list.
-    if len(input_size) > 1:
-        indice_offset = 0
-        sorted_input, input_split_sorted_indices = torch.sort(input)
-        for i in range(0, sorted_input.size(0)):
-            input_split_sorted_list.append(sorted_input[i])
-            input_split_sorted_indices[i] += indice_offset
-            indice_offset += input.size(1)
-            split_sizes_1d_with_padding.append(
-                torch.sum(torch.ne(sorted_input[i], padding_idx)).item()
+        if input.dim() > 1:
+            padding_idx = padding_idx if padding_idx is not None else -1
+            split_sizes = torch.sum(
+                torch.ne(input, padding_idx), dim=-1, dtype=local_shard.dtype
             )
-        input_split_sorted_indices = torch.reshape(input_split_sorted_indices, (-1,))
-    # Split 1D input tensor based on the given offsets.
-    else:
-        input_split_sorted_indices_list = []
-        offset_len = len(offsets)
-        split_size = offsets[1:offset_len] - offsets[0:-1]
-        split_sizes_1d = split_size.tolist()
-        if torch.sum(split_size) < input.size(0):
-            split_sizes_1d.append(input.size(0) - offsets[-1].item())
-        indice_offset = 0
-        for idx, split_result in enumerate(torch.split(input, split_sizes_1d)):
-            split_result_sorted, indices = torch.sort(split_result)
-            input_split_sorted_list.append(split_result_sorted)
-            split_sizes_1d_with_padding.append(
-                torch.sum(torch.ne(split_result_sorted, padding_idx)).item()
+        else:
+            split_sizes = torch.cat(
+                (
+                    offsets[1 : offsets.size(0)] - offsets[0:-1],
+                    (input.size(0) - offsets[-1]).unsqueeze(0),
+                ),
+                dim=-1,
             )
-            input_split_sorted_indices_list.append(indices + indice_offset)
-            indice_offset += split_sizes_1d[idx]
-        input_split_sorted_indices = torch.cat(input_split_sorted_indices_list)
-
-    return (
-        input_split_sorted_list,
-        input_split_sorted_indices,
-        split_sizes_1d,
-        split_sizes_1d_with_padding,
-    )
-
+        return torch.div(result, split_sizes.unsqueeze(1))
 
-def _sorted_input_distribute_prepare(
-    input_split_sorted_list,
-    input_split_sorted_indices,
-    world_size,
-    input,
-    weight,
-    per_sample_weights,
-    rank,
-    padding_idx,
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embeddingBag to each rank by range. After sorting
-    per interval, we need to distribute each position to the corresponding
-    rank and we need to sync this change to offsets and per_sample_weights.
-    Also, we perform rearrangements, if the order in Sharding Spec is not
-    same as the rank sequence.
-
-    In addition, in the row-wise sharding, we need to do two things for
-    padding_idx. The first thing is only to set it if it's within the range
-    of the current rank and the other thing is to do the modularization of
-    it by sharded_dim_size_max.
+    # Return the appropriate local result.
+    return result
 
-    Args:
-        input_split_sorted_list: list of ID positions sorted per interval.
-        input_split_sorted_indices: sorted indices for per_sample_weights
-            rearrangments.
-        input: tensor to be applied op on.
-        world_size: number of ranks.
-        weight: shareded weight tensor.
-        per_sample_weights: weights for weighted sum mode.
-        rank: # of cuda process.
-        padding_idx: If specified, the entries at padding_idx do
-            not contribute to the gradient and reduction.
 
-    Returns:
-        input_combined: list of ID to be sent to each rank.
-        input_combined_split_sizes: # of bags sent to each rank.
-        offsets_rearrange_list: list of starting position of each bag.
-        offsets_rearrange_sizes: # of bag offsets sent to each rank.
-        per_sample_weights: weights for weighted sum mode.
-        sharded_dim_size_max: the max size of the row each rank gets.
-        padding_idx: Modularized padding_idx if it is within the range,
-            otherwise, None is returned.
+def _all_gather_embedding_bag_input(input, per_sample_weights, offsets, pg):
     """
-    input_sorted_list = []
-    input_split_sizes_list = []
-    input_split_sizes_rolling_sum = []
-    rearrange_indices_list = []
-    input_split_rearrange_indices_combined = None
-    split_sizes_rolling_sum = 0
-    for idx, split_result_sorted in enumerate(input_split_sorted_list):
-        split_result_sorted.contiguous()
-        (
-            input_sorted,
-            input_split_sizes,
-            sharded_dim_size_max,
-            input_split_rearrange_indices,
-            _,
-            padding_idx_modular,
-        ) = _handle_row_wise_lookup_distribute(
-            split_result_sorted, input, world_size, weight, rank, padding_idx
-        )
-        rearrange_indices_list.append(
-            input_split_rearrange_indices + split_sizes_rolling_sum
-            if input_split_rearrange_indices is not None
-            else None
-        )
-        input_sorted_list.append(input_sorted)
-        input_split_sizes_list.append(input_split_sizes)
-        input_split_sizes_rolling_sum.append(split_sizes_rolling_sum)
-        split_sizes_rolling_sum += sum(input_split_sizes)
-
-    # padding_idx cannot be directly overridden in the for loop because the
-    # later iteration will wipe out the modularized padding_idx.
-    padding_idx = padding_idx_modular
-    if not (any(x is None for x in rearrange_indices_list)):
-        input_split_rearrange_indices_combined = torch.cat(rearrange_indices_list)
-
-    # Flatten each interval into a big 1D tensor.
-    input_combined = torch.cat(input_sorted_list)
-
-    # Rearrange the 1D tensor to move the IDs of look-up within each
-    # interval to the corresponding sharding rank. We also rearrange
-    # the offsets to be in sync with IDs.
-    input_combined_rearrange_indices = []
-    offsets_rearrange_list = []
-    offsets_rearrange_sizes = []
-    input_combined_split_sizes = []
-    # Calculate the indices for rearrangements
-    for rank in range(0, world_size):
-        offsets_rearrange = []
-        offset = 0
-        for idx, input_split_sizes in enumerate(input_split_sizes_list):
-            offsets_rearrange.append(offset)
-            split_length = input_split_sizes[rank]
-            offset_idx = input_split_sizes_rolling_sum[idx] + sum(
-                [
-                    split_size if i < rank else 0
-                    for i, split_size in enumerate(input_split_sizes)
-                ]
-            )
-            input_combined_rearrange_indices += list(
-                range(offset_idx, offset_idx + split_length)
-            )
-            offset += split_length
-        offsets_rearrange_list.append(offsets_rearrange)
-        offsets_rearrange_sizes.append(len(offsets_rearrange))
-        input_combined_split_sizes.append(offset)
-
-    # Perform the actual rearrangements of IDs
-    input_combined = input_combined.index_select(
-        0, torch.tensor(input_combined_rearrange_indices, device=input.device)
-    )
+    In case we need to gather input and all other parameters of embeddingBag
+    ops, we need to stack all input together to perform ``all_gather``
+    collective communication just once.
 
-    # If per_sample_weights exists, we need to sync the shift which
-    # we applied to the position IDs for look-up.
-    if per_sample_weights is not None:
-        # Rearrange per interval.
-        per_sample_weights = torch.reshape(per_sample_weights, (-1,))
-        per_sample_weights = per_sample_weights[input_split_sorted_indices]
-        if input_split_rearrange_indices_combined is not None:
-            per_sample_weights = per_sample_weights[
-                input_split_rearrange_indices_combined
-            ]
-        # Rearrange across different ranks.
-        per_sample_weights = per_sample_weights.index_select(
-            0,
-            torch.tensor(input_combined_rearrange_indices, device=input.device),
-        )
-
-    return (
-        input_combined,
-        input_combined_split_sizes,
-        offsets_rearrange_list,
-        offsets_rearrange_sizes,
-        per_sample_weights,
-        sharded_dim_size_max,
-        padding_idx,
-    )
-
-
-def _distribute_input(
-    input_combined,
-    input_combined_split_sizes,
-    offsets_rearrange_list,
-    offsets_rearrange_sizes,
-    sharded_dim_size_max,
-    world_size,
-    input,
-    per_sample_weights,
-    pg,
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embeddingBag, offsets and per_sample_weights to
-    each rank by range. To save the # of communication, we consolidate the
-    communication of tensors which shares the same dtype.
+    Note that since offsets does not share the same size as input and
+    is always smaller than input, we resize it during the communication.
 
     Args:
-        input_combined: list of ID to be sent to each rank.
-        input_combined_split_sizes: # of bags sent to each rank.
-        offsets_rearrange_list: list of starting position of each bag.
-        offsets_rearrange_sizes: # of bag offsets sent to each rank.
-        sharded_dim_size_max: the max size of the row each rank gets.
-        world_size: number of ranks.
         input: tensor to be applied op on.
-        per_sample_weights: weights for weighted sum mode.
+        per_sampe_weights: weights for weighted sum mode.
+        offsets: when input is 1D. offsets determines the starting
+            index position of each bag (sequence) in input.
         pg: process group.
 
     Returns:
-        gathered_input: list of tensors of IDs for lookup and aggregation.
-        output_offsets_tensor_list: list of tensors of offsets which specifies the
-            boundary of each bag.
-        output_split_sizes: list of size of IDs sent from each rank.
-        gathered_per_sample_weights: per_sample_weights from each rank.
+        gathered_inputs: list of input tensor gathered from each rank.
+        gathered_per_sample_weights: list of per_sample_weights from each rank.
+        gathered_offsets: list of offsets from each rank.
     """
-    # Communicate the length of offset and ID split size to each rank
-    # To save the # of communications, we interleave the sizes into one list.
-    input_size_list = offsets_rearrange_sizes + input_combined_split_sizes
-    input_size_list[::2] = offsets_rearrange_sizes
-    input_size_list[1::2] = input_combined_split_sizes
-    output_size_list = _communicate_size_to_each_rank(
-        input_size_list, world_size * 2, input, pg
-    )
-
-    # Perform the modular operation of the 1D tensor to be sent to each rank.
-    input_combined = torch.remainder(input_combined, sharded_dim_size_max)
-    input_combined_list = list(torch.split(input_combined, input_combined_split_sizes))
-
-    # Covert each offset list to a tensor and combine with the input
-    # so we only perform one communication to each rank.
-    input_tensor_list = []
-    output_tensor_size_list = []
-    for idx, input_list in enumerate(offsets_rearrange_list):
-        input_tensor_list.append(
-            torch.cat(
-                (
-                    torch.tensor(input_list, dtype=torch.int64, device=input.device),
-                    input_combined_list[idx],
-                )
-            )
-        )
-        output_tensor_size_list.append(
-            output_size_list[2 * idx] + output_size_list[2 * idx + 1]
-        )
-
-    output_tensor_list = _communicate_list_to_each_rank(
-        input_tensor_list, output_tensor_size_list, input, pg
-    )
-    output_tensor_list = list(
-        torch.split(torch.cat(output_tensor_list), output_size_list)
-    )
-    output_offsets_tensor_list = output_tensor_list[::2]
-    gathered_input = output_tensor_list[1::2]
-    output_split_sizes = output_size_list[1::2]
+    input_to_gather = [input]
+    if per_sample_weights is not None:
+        input_to_gather.append(per_sample_weights)
+    if offsets is not None:
+        input_to_gather.append(offsets.clone().resize_(input.size()))
+    gathered_inputs = all_gather(torch.stack(input_to_gather), group=pg)
 
-    # If user specifies per_sample_weights we need to communicate
-    # them to the corresponding rank.
     gathered_per_sample_weights = None
     if per_sample_weights is not None:
-        # Split the 1D tensor per_sample_weights to be sent to each rank.
-        per_sample_weights_list = list(
-            torch.split(per_sample_weights, input_combined_split_sizes)
-        )
-        gathered_per_sample_weights = _communicate_list_to_each_rank(
-            per_sample_weights_list,
-            output_split_sizes,
-            input,
-            pg,
-            tensor_type=per_sample_weights.dtype,
-        )
-
-    return (
-        gathered_input,
-        output_offsets_tensor_list,
-        output_split_sizes,
-        gathered_per_sample_weights,
-    )
+        gathered_per_sample_weights = [t[1] for t in gathered_inputs]
+    gathered_offsets = None
+    if offsets is not None:
+        idx = 2 if per_sample_weights is not None else 1
+        gathered_offsets = [
+            t[idx].resize_(offsets.size()).to(offsets.dtype) for t in gathered_inputs
+        ]
+    gathered_inputs = [t[0].to(input.dtype) for t in gathered_inputs]
+    return gathered_inputs, gathered_per_sample_weights, gathered_offsets
diff --git a/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py b/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py
index 0dad32ec64448..53364c74f224d 100644
--- a/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py
+++ b/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py
@@ -25,11 +25,28 @@ def __init__(
         mod: torch.nn.Module,
         checkpoint_impl: CheckpointImpl = CheckpointImpl.REENTRANT,
         offload_to_cpu: bool = False,
+        checkpoint_fn=None,
+        *checkpoint_fn_args,
+        **checkpoint_fn_kwargs,
     ):
         super().__init__()
         self._checkpoint_wrapped_module = mod
         self.checkpoint_impl = checkpoint_impl
         self.offload_to_cpu = offload_to_cpu
+        if checkpoint_fn is None:
+            # use torch.utils.checkpoint
+            self.checkpoint_fn = partial(
+                checkpoint,
+                use_reentrant=(
+                    self.checkpoint_impl == CheckpointImpl.REENTRANT
+                ),
+            )
+        else:
+            self.checkpoint_fn = partial(
+                checkpoint_fn,
+                *checkpoint_fn_args,
+                **checkpoint_fn_kwargs,
+            )
         # state_dict post hook to remove prefix to allow loading into a
         # non-checkpoint wrapped module.
         self._register_state_dict_hook(self._post_state_dict_hook)
@@ -53,11 +70,10 @@ def __getitem__(self, key: int) -> Any:
     def forward(self, *args, **kwargs):
         offload_mgr = save_on_cpu(pin_memory=True) if self.offload_to_cpu else suppress()
         with offload_mgr:  # type: ignore[attr-defined]
-            return checkpoint(
+            return self.checkpoint_fn(
                 self._checkpoint_wrapped_module,
-                use_reentrant=(self.checkpoint_impl == CheckpointImpl.REENTRANT),
                 *args,
-                **kwargs,
+                **kwargs
             )
 
     def named_parameters(
@@ -110,6 +126,9 @@ def checkpoint_wrapper(
     module: torch.nn.Module,
     checkpoint_impl: CheckpointImpl = CheckpointImpl.REENTRANT,
     offload_to_cpu: bool = False,
+    checkpoint_fn=None,
+    *checkpoint_fn_args,
+    **checkpoint_fn_kwargs,
 ) -> torch.nn.Module:
     """
     A convenience wrapper for activation checkpointing. If the module is wrapped
@@ -124,17 +143,28 @@ def checkpoint_wrapper(
             The module to be wrapped
         checkpoint_impl (Optional[CheckpointImpl]):
             The checkpointing implementation to use. Currently only
-            CheckpointImpl.REENTRANT is supported.
+            CheckpointImpl.REENTRANT is supported. Note that this will only
+            be passed into the ``torch.utils.checkpoint.checkpoint``
+            implementation, and is ignored if a custom ``checkpoint_fn`` is
+            specified.
         offload_to_cpu (Optional[bool]):
             Whether to offload outer activations to CPU. Note that this
             currently only works with CheckpointImpl.REENTRANT.
+        checkpoint_fn (Optional[Callable]):
+            Functional checkpoint implementation to use. If this is specified,
+            it will be used over the default ``torch.utils.checkpoint.checkpoint``
+            implementation and the `checkpoint_impl` argument will be ignored.
+        *checkpoint_fn_args: (Sequence[Any]): Arguments to pass into `checkpoint_fn`.
+        **checkpoint_fn_kwargs: (Dict[str, Any]): Keyword arguments to pass into `checkpoint_fn`.
 
     Returns:
         (nn.Module):
             Wrapped module
     """
 
-    return CheckpointWrapper(module, checkpoint_impl, offload_to_cpu)
+    return CheckpointWrapper(
+        module, checkpoint_impl, offload_to_cpu, checkpoint_fn, checkpoint_fn_args, checkpoint_fn_kwargs
+    )
 
 
 def apply_activation_checkpointing_wrapper(
diff --git a/torch/distributed/algorithms/_comm_hooks/default_hooks.py b/torch/distributed/algorithms/_comm_hooks/default_hooks.py
index b9ea63392eb3e..10dcce3197c72 100644
--- a/torch/distributed/algorithms/_comm_hooks/default_hooks.py
+++ b/torch/distributed/algorithms/_comm_hooks/default_hooks.py
@@ -1,16 +1,15 @@
 import functools
 import torch
 import torch.distributed as dist
-from torch.distributed import distributed_c10d
 
 
 class DefaultState(object):
     r"""
-    Stores state needed to perform the default ``all_reduce`` algorithm
+    Stores state needed to perform the default communication algorithm
     within a communication hook.
 
     Args:
-        process_group (ProcessGroup): The process group to be used for all-reduce.
+        process_group (ProcessGroup): The process group to be used.
     """
 
     __slots__ = [
@@ -22,17 +21,19 @@ class DefaultState(object):
 
     def __init__(
         self,
-        process_group
+        process_group: dist.ProcessGroup
     ):
-        self.process_group = process_group if process_group is not None else distributed_c10d._get_default_group()
+        if process_group is None:
+            raise ValueError(f"Expected to pass in an explicit ProcessGroup to {self}.")
+        self.process_group = process_group
         self.world_size = dist.get_world_size(process_group)
+        # Setting two factors `self.gradient_predivide_factor`
+        # and `self.gradient_postdivide_factor` to avoid underflow and overflow
         self.gradient_predivide_factor = self._get_gradient_predivide_factor(
             self.world_size
         )
         self.gradient_postdivide_factor = self.world_size / self.gradient_predivide_factor
 
-    # setting two factors `self.gradient_predivide_factor`
-    # and `self.gradient_postdivide_factor` to avoid underflow and overflow
     def _get_gradient_predivide_factor(self, world_size: int) -> float:
         factor: int = 1
         while world_size % factor == 0 and world_size / factor > factor:
@@ -67,7 +68,7 @@ def __init__(
 def _decompress(state: LowPrecisionState, grad: torch.Tensor):
     """
     Casts gradients back to full parameter precision so that
-    further computation happens in full precision
+    further computation happens in full precision.
     """
     orig_grad_data = grad.data
     grad.data = grad.data.to(state.parameter_type)
@@ -80,46 +81,80 @@ def allreduce_hook(state: DefaultState, grad: torch.Tensor):
     and a necessary pre- and post-division of gradients.
 
     Args:
-        state (DefaultState): State information, configures pre- and post-division factors
+        state (DefaultState): State information, configures pre- and post-division factors.
         grad (torch.Tensor): A gradient for the local batch that needs to be communicated across ranks.
     """
+    # Average grad by pre-division factor. Together pre- and post-division factors
+    # lead to an overall averaging by world_size, required for consistency with PyTorch DDP.
+    # This is a two-step process to avoid potential underflow and overflow.
     if state.gradient_predivide_factor > 1:
         grad.div_(state.gradient_predivide_factor)
     dist.all_reduce(grad, group=state.process_group)
+    # Average grad by post-division factor.
     if state.gradient_postdivide_factor > 1:
         grad.div_(state.gradient_postdivide_factor)
 
-def lower_precision_hook(prec: torch.dtype, state: LowPrecisionState, grad: torch.Tensor):
-    grad.data = grad.data.to(prec)
-    allreduce_hook(state, grad)
-    _decompress(state, grad)
+def reduce_scatter_hook(state: DefaultState, grad: torch.Tensor, output: torch.Tensor):
+    r"""
+    This FSDP communication hook implements ``reduce_scatter`` algorithm for
+    sharded FSDP strategies and a necessary pre- and post-division of gradients.
 
-def fp16_compress_hook(state: LowPrecisionState, grad: torch.Tensor):
+    Args:
+        state (DefaultState): State information, configures pre- and post-division factors.
+        grad (torch.Tensor): An unsharded gradient for the local batch that needs to be
+        communicated across ranks.
+        output (torch.Tensor): Stores a single shard of the gradient after ``reduce_scatter``.
+    """
+    # Average grad by pre-division factor.
+    if state.gradient_predivide_factor > 1:
+        grad.div_(state.gradient_predivide_factor)
+    dist._reduce_scatter_base(
+        output, grad, group=state.process_group
+    )
+    # Average grad's shard by post-division factor.
+    if state.gradient_postdivide_factor > 1:
+        output.div_(state.gradient_postdivide_factor)
+
+def _low_precision_hook(prec: torch.dtype, state: LowPrecisionState, grad: torch.Tensor, output: torch.Tensor):
+    grad.data = grad.data.to(prec)
+    if output is not None:
+        output.data = output.data.to(prec)
+        reduce_scatter_hook(state, grad, output)
+        _decompress(state, output)
+    else:
+        allreduce_hook(state, grad)
+        _decompress(state, grad)
+
+def fp16_compress_hook(state: LowPrecisionState, grad: torch.Tensor, output: torch.Tensor = None):
     r"""
     This FSDP communication hook implements a simple gradient compression
     approach that casts ``grad`` to half-precision floating-point format (``torch.float16``).
     It also averages gradients by ``world_size`` in two steps: first it pre-divides gradients by a
-    ``state.predivide_factor``, and after an allreduce step gradients are averaged by a ``state.postdivide_factor``.
-    Onse post-division is done, compressed gradients are casted back to parameters' precision.
+    ``state.gradient_predivide_factor``, and after a communication step (``all_reduce`` or ``reduce_scatter``)
+    gradients are averaged by a ``state.gradient_postdivide_factor``.
+    Once post-division is done, compressed gradients are casted back to parameters' precision.
 
     Args:
-        state (DefaultState): State information, configures pre- and post-division factors
+        state (LowPrecisionState): State information, configures pre- and post-division factors, parameters' precision.
         grad (torch.Tensor): A gradient for the local batch that needs to be communicated across ranks in a lower precision.
+        output (torch.Tensor): Stores a single shard of the gradient after ``reduce_scatter``.
     """
-    fp16_hook = functools.partial(lower_precision_hook, torch.float16)
-    return fp16_hook(state, grad)
+    fp16_hook = functools.partial(_low_precision_hook, torch.float16)
+    return fp16_hook(state, grad, output)
 
-def bf16_compress_hook(state: LowPrecisionState, grad: torch.Tensor):
+def bf16_compress_hook(state: LowPrecisionState, grad: torch.Tensor, output: torch.Tensor = None):
     r"""
     This FSDP communication hook implements a simple gradient compression
     approach that casts ``grad`` to half-precision floating-point format (``torch.float16``).
     It also averages gradients by ``world_size`` in two steps: first it pre-divides gradients by a
-    ``state.predivide_factor``, and after an allreduce step gradients are averaged by a ``state.postdivide_factor``.
-    Onse post-division is done, compressed gradients are casted back to parameters' precision.
+    ``state.gradient_predivide_factor``, and after a communication step (``all_reduce`` or ``reduce_scatter``)
+    gradients are averaged by a ``state.gradient_postdivide_factor``.
+    Once post-division is done, compressed gradients are casted back to parameters' precision.
 
     Args:
-        state (DefaultState): State information, configures pre- and post-division factors
+        state (LowPrecisionState): State information, configures pre- and post-division factors, parameters' precision.
         grad (torch.Tensor): A gradient for the local batch that needs to be communicated across ranks in a lower precision.
+        output (torch.Tensor): Stores a single shard of the gradient after ``reduce_scatter``.
     """
-    bf16_hook = functools.partial(lower_precision_hook, torch.bfloat16)
-    return bf16_hook(state, grad)
+    bf16_hook = functools.partial(_low_precision_hook, torch.bfloat16)
+    return bf16_hook(state, grad, output)
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/__init__.py b/torch/distributed/algorithms/ddp_comm_hooks/__init__.py
index 6e2ac880fb401..037b63b6ba929 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/__init__.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/__init__.py
@@ -97,6 +97,7 @@ def register_ddp_comm_hook(
     Uses Python comm hook implementations.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> register_ddp_comm_hook(DDPCommHookType.FP16_COMPRESS, model, state)
     """
     comm_hook_type.value(model=model, state=state)
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py b/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py
index ba226db1d8ab1..3d1bb120f122f 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py
@@ -19,6 +19,7 @@ def noop_hook(_: Any, bucket: GradBucket) -> torch.futures.Future[torch.Tensor]:
     some factors such as the overlap between allreduce and computation or the desynchronization across ranks.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> ddp_model.register_comm_hook(None, noop_hook)
     """
     fut: torch.futures.Future[torch.Tensor] = torch.futures.Future()
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py b/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py
index 7d4321220daf4..48114b9716b8e 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py
@@ -33,6 +33,7 @@ def allreduce_hook(
     unaffecting DDP behavior.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> ddp_model.register_comm_hook(process_group, allreduce_hook)
     """
     return _allreduce_fut(process_group, bucket.buffer())
@@ -49,6 +50,7 @@ def fp16_compress_hook(
     tensors are allreduced, the chained callback ``decompress`` casts it back to the input data type (such as ``float32``).
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> ddp_model.register_comm_hook(process_group, fp16_compress_hook)
     """
     group_to_use = process_group if process_group is not None else dist.group.WORLD
@@ -84,6 +86,7 @@ def bf16_compress_hook(
     tensors are allreduced, the chained callback ``decompress`` casts it back to the input data type (such as ``float32``).
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> ddp_model.register_comm_hook(process_group, bf16_compress_hook)
     """
     group_to_use = process_group if process_group is not None else dist.group.WORLD
@@ -116,6 +119,7 @@ def fp16_compress_wrapper(
     Therefore, ``fp16_compress_hook`` is equivalent to ``fp16_compress_wrapper(allreduce_hook)``.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1, start_powerSGD_iter=10)
         >>> ddp_model.register_comm_hook(state, fp16_compress_wrapper(powerSGD_hook))
     """
@@ -153,6 +157,7 @@ def bf16_compress_wrapper(
     Therefore, ``bf16_compress_hook`` is equivalent to ``bf16_compress_wrapper(allreduce_hook)``.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1, start_powerSGD_iter=10)
         >>> ddp_model.register_comm_hook(state, bf16_compress_wrapper(powerSGD_hook))
     """
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py b/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py
index 2d148b5d3b620..9cbeb80d59a15 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py
@@ -82,6 +82,7 @@ def post_localSGD_hook(
         Future handler of the communication, which updates the gradients in place.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> state = PostLocalSGDState(process_group=process_group, subgroup=subgroup,
                                   start_localSGD_iter=10)
         >>> ddp_model.register_comm_hook(state, post_localSGD_hook)
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py b/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
index bb6e453143654..fb3662db23a29 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
@@ -379,6 +379,7 @@ def powerSGD_hook(
         Future handler of the communication, which updates the gradients in place.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1,
                                   start_powerSGD_iter=10, min_compression_rate=0.5)
         >>> ddp_model.register_comm_hook(state, powerSGD_hook)
@@ -687,6 +688,7 @@ def batched_powerSGD_hook(
         Future handler of the communication, which updates the gradients in place.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1)
         >>> ddp_model.register_comm_hook(state, batched_powerSGD_hook)
     """  # noqa: B950
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py b/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py
index 64973dcb95307..aaa0b9455ee82 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py
@@ -59,6 +59,7 @@ def quantization_pertensor_hook(
         ``allreduce`` protocol. It works only with flattened grads.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> ddp_model.register_comm_hook(process_group, quantization_pertensor_hook)
     """
     group_to_use = process_group if process_group is not None else dist.group.WORLD
@@ -138,6 +139,7 @@ def quantization_perchannel_hook(
         ``allreduce`` protocol. It works only with flattened grads.
 
     Example::
+        >>> # xdoctest: +SKIP
         >>> ddp_model.register_comm_hook(process_group, quantization_perchannel_hook)
     """
     group_to_use = process_group if process_group is not None else dist.group.WORLD
diff --git a/torch/distributed/algorithms/join.py b/torch/distributed/algorithms/join.py
index 60b058eb5a749..0132e586e2041 100644
--- a/torch/distributed/algorithms/join.py
+++ b/torch/distributed/algorithms/join.py
@@ -150,6 +150,7 @@ class Join():
         >>> import torch
         >>> import torch.distributed as dist
         >>> import torch.multiprocessing as mp
+        >>> # xdoctest: +SKIP
         >>> import torch.nn.parallel.DistributedDataParallel as DDP
         >>> import torch.distributed.optim.ZeroRedundancyOptimizer as ZeRO
         >>> from torch.distributed.algorithms.join import Join
diff --git a/torch/distributed/algorithms/model_averaging/averagers.py b/torch/distributed/algorithms/model_averaging/averagers.py
index b903d76abd935..3237733c4ea55 100644
--- a/torch/distributed/algorithms/model_averaging/averagers.py
+++ b/torch/distributed/algorithms/model_averaging/averagers.py
@@ -49,35 +49,36 @@ class PeriodicModelAverager(ModelAverager):
 
     Example::
 
-        >>>  import torch
-        >>>  import torch.distributed as dist
-        >>>  import torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook as post_localSGD
-        >>>  import torch.distributed.algorithms.model_averaging.averagers as averagers
-        >>>  import torch.nn as nn
+        >>> # xdoctest: +SKIP("undefined variables")
+        >>> import torch
+        >>> import torch.distributed as dist
+        >>> import torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook as post_localSGD
+        >>> import torch.distributed.algorithms.model_averaging.averagers as averagers
+        >>> import torch.nn as nn
         >>>
-        >>>  dist.init_process_group("nccl", rank=rank, world_size=16)
-        >>>  torch.cuda.set_device(rank)
-        >>>  module = nn.Linear(1, 1, bias=False).cuda()
-        >>>  model = nn.parallel.DistributedDataParallel(
-        >>>     module, device_ids=[rank], output_device=rank
-        >>>  )
-        >>>  # Register a post-localSGD communication hook.
-        >>>  state = PostLocalSGDState(process_group=None, subgroup=None, start_localSGD_iter=100)
-        >>>  model.register_comm_hook(state, post_localSGD_hook)
+        >>> dist.init_process_group("nccl", rank=rank, world_size=16)
+        >>> torch.cuda.set_device(rank)
+        >>> module = nn.Linear(1, 1, bias=False).cuda()
+        >>> model = nn.parallel.DistributedDataParallel(
+        >>>    module, device_ids=[rank], output_device=rank
+        >>> )
+        >>> # Register a post-localSGD communication hook.
+        >>> state = PostLocalSGDState(process_group=None, subgroup=None, start_localSGD_iter=100)
+        >>> model.register_comm_hook(state, post_localSGD_hook)
         >>>
-        >>>  # In the first 100 steps, run global gradient averaging like normal DDP at every step.
-        >>>  # After 100 steps, run model averaging every 4 steps.
-        >>>  # Note that ``warmup_steps`` must be the same as ``start_localSGD_iter`` used in ``PostLocalSGDState``.
-        >>>  averager = averagers.PeriodicModelAverager(period=4, warmup_steps=100)
-        >>>  for step in range(0, 200):
-        >>>     optimizer.zero_grad()
-        >>>     loss = loss_fn(output, labels)
-        >>>     loss.backward()
-        >>>     optimizer.step()
-        >>>     # Will average model parameters globally every 4 steps. Thus,
-        >>>     # inter-node communication only occurs every 4 iterations after
-        >>>     # the initial ``warmup_steps`` period.
-        >>>     averager.average_parameters(model.parameters())
+        >>> # In the first 100 steps, run global gradient averaging like normal DDP at every step.
+        >>> # After 100 steps, run model averaging every 4 steps.
+        >>> # Note that ``warmup_steps`` must be the same as ``start_localSGD_iter`` used in ``PostLocalSGDState``.
+        >>> averager = averagers.PeriodicModelAverager(period=4, warmup_steps=100)
+        >>> for step in range(0, 200):
+        >>>    optimizer.zero_grad()
+        >>>    loss = loss_fn(output, labels)
+        >>>    loss.backward()
+        >>>    optimizer.step()
+        >>>    # Will average model parameters globally every 4 steps. Thus,
+        >>>    # inter-node communication only occurs every 4 iterations after
+        >>>    # the initial ``warmup_steps`` period.
+        >>>    averager.average_parameters(model.parameters())
     """
 
     def __init__(
diff --git a/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py b/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
index b91438d47e417..9c187d5d77e38 100644
--- a/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
+++ b/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
@@ -47,43 +47,44 @@ class HierarchicalModelAverager(averagers.ModelAverager):
                                                 (default: ``None``)
 
     Example::
-        >>>  from collections import OrderedDict
-        >>>  import torch
-        >>>  import torch.distributed as dist
-        >>>  from torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook import (
-        >>>      PostLocalSGDState,
-        >>>      post_localSGD_hook,
-        >>>  )
-        >>>  import torch.distributed.algorithms.model_averaging.hierarchical_model_averager as hierarchicalSGD
-        >>>  import torch.nn as nn
+        >>> # xdoctest: +SKIP('undefined rank')
+        >>> from collections import OrderedDict
+        >>> import torch
+        >>> import torch.distributed as dist
+        >>> from torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook import (
+        >>>     PostLocalSGDState,
+        >>>     post_localSGD_hook,
+        >>> )
+        >>> import torch.distributed.algorithms.model_averaging.hierarchical_model_averager as hierarchicalSGD
+        >>> import torch.nn as nn
         >>>
-        >>>  dist.init_process_group("nccl", rank=rank, world_size=16)
-        >>>  torch.cuda.set_device(rank)
-        >>>  module = nn.Linear(1, 1, bias=False).to(rank)
-        >>>  model = nn.parallel.DistributedDataParallel(
-        >>>     module, device_ids=[rank], output_device=rank
-        >>>  )
-        >>>  # Register a post-localSGD communication hook.
-        >>>  # Assume that each machine has 4 GPUs, then each intra-machine subgroup has a size of 4.
-        >>>  subgroup, _ = dist.new_subgroups()
-        >>>  state = PostLocalSGDState(subgroup=subgroup, start_localSGD_iter=100)
-        >>>  model.register_comm_hook(state, post_localSGD_hook)
+        >>> dist.init_process_group("nccl", rank=rank, world_size=16)
+        >>> torch.cuda.set_device(rank)
+        >>> module = nn.Linear(1, 1, bias=False).to(rank)
+        >>> model = nn.parallel.DistributedDataParallel(
+        >>>    module, device_ids=[rank], output_device=rank
+        >>> )
+        >>> # Register a post-localSGD communication hook.
+        >>> # Assume that each machine has 4 GPUs, then each intra-machine subgroup has a size of 4.
+        >>> subgroup, _ = dist.new_subgroups()
+        >>> state = PostLocalSGDState(subgroup=subgroup, start_localSGD_iter=100)
+        >>> model.register_comm_hook(state, post_localSGD_hook)
         >>>
-        >>>  # Average parameters among each group of 8 processes every 4 iterations, and among all
-        >>>  # the 16 processes every 16 iterations.
-        >>>  averager = hierarchicalSGD.HierarchicalModelAverager(
-        >>>      period_group_size_dict=OrderedDict([(4, 8), (16, 16)]), warmup_steps=100)
-        >>>  # Note that ``warmup_steps`` must be the same as ``start_localSGD_iter`` used in ``PostLocalSGDState``.
-        >>>  # In the first 100 steps, run global gradient averaging like normal DDP at every step.
-        >>>  # After 100 steps, run model averaging at two levels.
-        >>>  for step in range(0, 200):
-        >>>     optimizer.zero_grad()
-        >>>     loss = loss_fn(output, labels)
-        >>>     loss.backward()
-        >>>     optimizer.step()
-        >>>     # Average parameters after ``optimizer.step()``.
-        >>>     # Thus, the inter-node communication only occurs periodically after ``warmup_steps``.
-        >>>     averager.average_parameters(model.parameters())
+        >>> # Average parameters among each group of 8 processes every 4 iterations, and among all
+        >>> # the 16 processes every 16 iterations.
+        >>> averager = hierarchicalSGD.HierarchicalModelAverager(
+        >>>     period_group_size_dict=OrderedDict([(4, 8), (16, 16)]), warmup_steps=100)
+        >>> # Note that ``warmup_steps`` must be the same as ``start_localSGD_iter`` used in ``PostLocalSGDState``.
+        >>> # In the first 100 steps, run global gradient averaging like normal DDP at every step.
+        >>> # After 100 steps, run model averaging at two levels.
+        >>> for step in range(0, 200):
+        >>>    optimizer.zero_grad()
+        >>>    loss = loss_fn(output, labels)
+        >>>    loss.backward()
+        >>>    optimizer.step()
+        >>>    # Average parameters after ``optimizer.step()``.
+        >>>    # Thus, the inter-node communication only occurs periodically after ``warmup_steps``.
+        >>>    averager.average_parameters(model.parameters())
 
     .. warning ::
         The last group size in the dict must be the size of the provided ``process_group``,
diff --git a/torch/distributed/autograd/__init__.py b/torch/distributed/autograd/__init__.py
index c8d4366e44294..56062b52819f2 100644
--- a/torch/distributed/autograd/__init__.py
+++ b/torch/distributed/autograd/__init__.py
@@ -36,6 +36,7 @@ class context(object):
 
     Example::
         >>> import torch.distributed.autograd as dist_autograd
+        >>> # xdoctest: +SKIP
         >>> with dist_autograd.context() as context_id:
         >>>   t1 = torch.rand((3, 3), requires_grad=True)
         >>>   t2 = torch.rand((3, 3), requires_grad=True)
diff --git a/torch/distributed/distributed_c10d.py b/torch/distributed/distributed_c10d.py
index 0c74cd5801fb9..932de843c25f0 100644
--- a/torch/distributed/distributed_c10d.py
+++ b/torch/distributed/distributed_c10d.py
@@ -62,10 +62,10 @@
 
 try:
     from torch._C._distributed_c10d import ProcessGroupUCC
+    ProcessGroupUCC.__module__ = "torch.distributed.distributed_c10d"
 except ImportError:
     _UCC_AVAILABLE = False
 
-
 logger = logging.getLogger(__name__)
 
 PG_WRAPPER_STORE_PREFIX = "pg_wrapper"
@@ -386,7 +386,7 @@ def _check_op(op):
 def _check_p2p_op_list(p2p_op_list):
     """
     Helper to check that the ``p2p_op_list`` is a list of P2POp instances and
-    all ops use the same backend.
+    all ops use the same group.
     """
     if not isinstance(p2p_op_list, list) or not all(
         isinstance(p2p_op, P2POp) for p2p_op in p2p_op_list
@@ -396,9 +396,9 @@ def _check_p2p_op_list(p2p_op_list):
             "to be of type ``torch.distributed.P2POp``."
         )
 
-    backend = get_backend(p2p_op_list[0].group)
-    if not all(backend == get_backend(p2p_op.group) for p2p_op in p2p_op_list):
-        raise RuntimeError("All groups need to use the same backend.")
+    group = p2p_op_list[0].group
+    if not all(group == p2p_op.group for p2p_op in p2p_op_list):
+        raise RuntimeError("All ops need to use the same group.")
 
 
 def is_mpi_available():
@@ -1097,14 +1097,14 @@ def __new__(cls, op, tensor, peer, group=None, tag=0):
 
 
 @contextlib.contextmanager
-def _batch_p2p_manager(backend):
-    if backend == Backend.NCCL:
-        ProcessGroupNCCL._group_start()
+def _coalescing_manager(group, reqs):
+    if group is None:
+        group = _get_default_group()
+    group._start_coalescing()
     try:
         yield
     finally:
-        if backend == Backend.NCCL:
-            ProcessGroupNCCL._group_end()
+        group._end_coalescing(reqs)
 
 
 def batch_isend_irecv(p2p_op_list):
@@ -1125,6 +1125,7 @@ def batch_isend_irecv(p2p_op_list):
         op in the op_list.
 
     Examples:
+        >>> # xdoctest: +SKIP("no rank")
         >>> send_tensor = torch.arange(2) + 2 * rank
         >>> recv_tensor = torch.randn(2)
         >>> send_op = dist.P2POp(dist.isend, send_tensor, (rank + 1)%world_size)
@@ -1147,9 +1148,9 @@ def batch_isend_irecv(p2p_op_list):
         involving only a subset of ranks of the ``group`` are allowed.
     """
     _check_p2p_op_list(p2p_op_list)
-    backend = get_backend(p2p_op_list[0].group)
+    group = p2p_op_list[0].group
     reqs = []
-    with _batch_p2p_manager(backend):
+    with _coalescing_manager(group, reqs):
         for p2p_op in p2p_op_list:
             op = p2p_op.op
             tensor = p2p_op.tensor
@@ -1338,6 +1339,7 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False):
         None, if not async_op or if not part of the group
 
     Examples:
+        >>> # xdoctest: +SKIP("no rank")
         >>> # All tensors below are of torch.int64 type.
         >>> # We have 2 process groups, 2 ranks.
         >>> tensor = torch.arange(2, dtype=torch.int64) + 1 + 2 * rank
@@ -1680,6 +1682,7 @@ def all_gather_object(object_list, obj, group=None):
         function with data you trust.
 
     Example::
+        >>> # xdoctest: +SKIP("need process group init")
         >>> # Note: Process group initialization omitted on each rank.
         >>> import torch.distributed as dist
         >>> # Assumes world_size of 3.
@@ -1766,16 +1769,17 @@ def gather_object(obj, object_gather_list=None, dst=0, group=None):
         function with data you trust.
 
     Example::
+        >>> # xdoctest: +SKIP("need process group init")
         >>> # Note: Process group initialization omitted on each rank.
         >>> import torch.distributed as dist
         >>> # Assumes world_size of 3.
         >>> gather_objects = ["foo", 12, {1: 2}] # any picklable object
         >>> output = [None for _ in gather_objects]
         >>> dist.gather_object(
-                gather_objects[dist.get_rank()],
-                output if dist.get_rank() == 0 else None,
-                dst=0
-            )
+        ...     gather_objects[dist.get_rank()],
+        ...     output if dist.get_rank() == 0 else None,
+        ...     dst=0
+        ... )
         >>> # On rank 0
         >>> output
         ['foo', 12, {1: 2}]
@@ -1871,6 +1875,7 @@ def broadcast_object_list(object_list, src=0, group=None, device=None):
         function with data you trust.
 
     Example::
+        >>> # xdoctest: +SKIP("need process group init")
         >>> # Note: Process group initialization omitted on each rank.
         >>> import torch.distributed as dist
         >>> if dist.get_rank() == 0:
@@ -1968,6 +1973,7 @@ def scatter_object_list(
         function with data you trust.
 
     Example::
+        >>> # xdoctest: +SKIP("need process group init")
         >>> # Note: Process group initialization omitted on each rank.
         >>> import torch.distributed as dist
         >>> if dist.get_rank() == 0:
@@ -2053,6 +2059,7 @@ def all_gather(tensor_list, tensor, group=None, async_op=False):
         None, if not async_op or if not part of the group
 
     Examples:
+        >>> # xdoctest: +SKIP("need process group init")
         >>> # All tensors below are of torch.int64 dtype.
         >>> # We have 2 process groups, 2 ranks.
         >>> tensor_list = [torch.zeros(2, dtype=torch.int64) for _ in range(2)]
@@ -2122,6 +2129,7 @@ def _all_gather_base(output_tensor, input_tensor, group=None, async_op=False):
         None, if not async_op or if not part of the group
 
     Examples:
+        >>> # xdoctest: +SKIP("need process group init")
         >>> # All tensors below are of torch.int64 dtype.
         >>> # We have 2 process groups, 2 ranks.
         >>> output_tensor = torch.zeros(2, dtype=torch.int64)
@@ -2563,6 +2571,7 @@ def all_to_all_single(
         `all_to_all_single` is experimental and subject to change.
 
     Examples:
+        >>> # xdoctest: +SKIP("Undefined rank")
         >>> input = torch.arange(4) + rank * 4
         >>> input
         tensor([0, 1, 2, 3])     # Rank 0
@@ -2678,6 +2687,7 @@ def all_to_all(output_tensor_list, input_tensor_list, group=None, async_op=False
         `all_to_all` is experimental and subject to change.
 
     Examples:
+        >>> # xdoctest: +SKIP("Undefined rank")
         >>> input = torch.arange(4) + rank * 4
         >>> input = list(input.chunk(4))
         >>> input
@@ -2858,6 +2868,7 @@ def monitored_barrier(group=GroupMember.WORLD, timeout=None, wait_all_ranks=Fals
         ``None``.
 
     Example::
+        >>> # xdoctest: +SKIP("need process group init")
         >>> # Note: Process group initialization omitted on each rank.
         >>> import torch.distributed as dist
         >>> if dist.get_rank() != 1:
@@ -3115,6 +3126,7 @@ def new_subgroups(
 
     Examples:
         >>> # Create intra-machine subgroups.
+        >>> # xdoctest: +SKIP("need process group init")
         >>> cur_subgroup, subgroups = dist.new_subgroups()
         >>> # Allreduce within the machine.
         >>> rank = dist.get_rank()
@@ -3229,6 +3241,7 @@ def new_subgroups_by_enumeration(
 
     Examples:
         >>> # Create two subgroups, where each has 2 processes.
+        >>> # xdoctest: +SKIP("need process group init")
         >>> cur_subgroup, subgroups = dist.new_subgroups(ranks=[[0, 2], [1, 3]])
         >>> rank = dist.get_rank()
         >>> tensor = torch.ones(1, device=rank) * rank
diff --git a/torch/distributed/elastic/timer/__init__.py b/torch/distributed/elastic/timer/__init__.py
index 5d1efe0ead2b2..ea4b2a46c4231 100644
--- a/torch/distributed/elastic/timer/__init__.py
+++ b/torch/distributed/elastic/timer/__init__.py
@@ -41,3 +41,4 @@ def trainer_func(message_queue):
 
 from .api import TimerClient, TimerRequest, TimerServer, configure, expires  # noqa: F401
 from .local_timer import LocalTimerClient, LocalTimerServer  # noqa: F401
+from .file_based_local_timer import FileTimerClient, FileTimerServer, FileTimerRequest  # noqa: F401
diff --git a/torch/distributed/elastic/timer/file_based_local_timer.py b/torch/distributed/elastic/timer/file_based_local_timer.py
new file mode 100644
index 0000000000000..81fe669f4a811
--- /dev/null
+++ b/torch/distributed/elastic/timer/file_based_local_timer.py
@@ -0,0 +1,313 @@
+# Copyright (c) Meta Platforms, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import io
+import json
+import logging
+import os
+import select
+import signal
+import threading
+import time
+from typing import Dict, List, Optional, Set, Tuple
+
+from torch.distributed.elastic.timer.api import TimerClient, TimerRequest
+
+__all__ = ['FileTimerClient', 'FileTimerRequest', 'FileTimerServer']
+
+class FileTimerRequest(TimerRequest):
+    """
+    Data object representing a countdown timer acquisition and release
+    that is used between the ``FileTimerClient`` and ``FileTimerServer``.
+    A negative ``expiration_time`` should be interpreted as a "release"
+    request.
+    ``signal`` is the signal to reap the worker process from the server
+    process.
+    """
+
+    __slots__ = ["version", "worker_pid", "scope_id", "expiration_time", "signal"]
+
+    def __init__(self, worker_pid: int, scope_id: str, expiration_time: float, signal: int = 0) -> None:
+        self.version = 1
+        self.worker_pid = worker_pid
+        self.scope_id = scope_id
+        self.expiration_time = expiration_time
+        self.signal = signal
+
+    def __eq__(self, other) -> bool:
+        if isinstance(other, FileTimerRequest):
+            return (
+                self.version == other.version
+                and self.worker_pid == other.worker_pid
+                and self.scope_id == other.scope_id
+                and self.expiration_time == other.expiration_time
+                and self.signal == other.signal
+            )
+        return False
+
+    def to_json(self) -> str:
+        return json.dumps(
+            {
+                "version": self.version,
+                "pid": self.worker_pid,
+                "scope_id": self.scope_id,
+                "expiration_time": self.expiration_time,
+                "signal": self.signal
+            },
+        )
+
+
+class FileTimerClient(TimerClient):
+    """
+    Client side of ``FileTimerServer``. This client is meant to be used
+    on the same host that the ``FileTimerServer`` is running on and uses
+    pid to uniquely identify a worker.
+    This client uses a named_pipe to send timer requests to the
+    ``FileTimerServer``. This client is a producer while the
+    ``FileTimerServer`` is a consumer. Multiple clients can work with
+    the same ``FileTimerServer``.
+
+    Args:
+
+        file_path: str, the path of a FIFO special file. ``FileTimerServer``
+                        must have created it by calling os.mkfifo().
+
+        signal: singal, the signal to use to kill the process. Using a
+                        negative or zero signal will not kill the process.
+    """
+    def __init__(self, file_path: str, signal=signal.SIGKILL) -> None:
+        super().__init__()
+        self._file_path = file_path
+        self.signal = signal
+
+    def _open_non_blocking(self) -> Optional[io.TextIOWrapper]:
+        try:
+            fd = os.open(self._file_path, os.O_WRONLY | os.O_NONBLOCK)
+            return os.fdopen(fd, "wt")
+        except Exception:
+            return None
+
+    def _send_request(self, request: FileTimerRequest) -> None:
+        # The server may have crashed or may haven't started yet.
+        # In such case, calling open() in blocking model blocks the client.
+        # To avoid such issue, open it in non-blocking mode, and an OSError will
+        # be raised if the server is not there.
+        file = self._open_non_blocking()
+        if file is None:
+            raise BrokenPipeError("Could not send the FileTimerRequest because FileTimerServer is not available.")
+        with file:
+            json_request = request.to_json()
+            # Write request with no greater than select.PIPE_BUF is guarantee to be atomic.
+            if len(json_request) > select.PIPE_BUF:
+                raise RuntimeError(
+                    f"FileTimerRequest larger than {select.PIPE_BUF} bytes "
+                    f"is not supported: {json_request}"
+                )
+            file.write(json_request + "\n")
+
+    def acquire(self, scope_id: str, expiration_time: float) -> None:
+        self._send_request(
+            request=FileTimerRequest(
+                worker_pid=os.getpid(),
+                scope_id=scope_id,
+                expiration_time=expiration_time,
+                signal=self.signal
+            ),
+        )
+
+    def release(self, scope_id: str) -> None:
+        self._send_request(
+            request=FileTimerRequest(
+                worker_pid=os.getpid(),
+                scope_id=scope_id,
+                expiration_time=-1,
+                signal=0
+            ),
+        )
+
+
+class FileTimerServer:
+    """
+    Server that works with ``FileTimerClient``. Clients are expected to be
+    running on the same host as the process that is running this server.
+    Each host in the job is expected to start its own timer server locally
+    and each server instance manages timers for local workers (running on
+    processes on the same host).
+
+    Args:
+
+        file_path: str, the path of a FIFO special file to be created.
+
+        max_interval: float, max interval in seconds for each watchdog loop.
+
+        daemon: bool, running the watchdog thread in daemon mode or not.
+                      A daemon thread will not block a process to stop.
+    """
+
+    def __init__(self, file_path: str, max_interval: float = 10, daemon: bool = True) -> None:
+        self._file_path = file_path
+        self._max_interval = max_interval
+        self._daemon = daemon
+        self._timers: Dict[Tuple[int, str], FileTimerRequest] = {}
+        self._stop_signaled = False
+        self._watchdog_thread: Optional[threading.Thread] = None
+        if os.path.exists(self._file_path):
+            os.remove(self._file_path)
+        os.mkfifo(self._file_path)
+        # For test only. Count the number of requests received.
+        self._request_count = 0
+        # For test only. Process all requests and stop the server.
+        self._run_once = False
+
+
+    def start(self) -> None:
+        logging.info(
+            f"Starting {type(self).__name__}..."
+            f" max_interval={self._max_interval},"
+            f" daemon={self._daemon}"
+        )
+        self._watchdog_thread = threading.Thread(target=self._watchdog_loop, daemon=self._daemon)
+        logging.info("Starting watchdog thread...")
+        self._watchdog_thread.start()
+
+    def stop(self) -> None:
+        logging.info(f"Stopping {type(self).__name__}")
+        self._stop_signaled = True
+        if self._watchdog_thread:
+            logging.info("Stopping watchdog thread...")
+            self._watchdog_thread.join(self._max_interval)
+            self._watchdog_thread = None
+        else:
+            logging.info("No watchdog thread running, doing nothing")
+        if os.path.exists(self._file_path):
+            os.remove(self._file_path)
+
+    def run_once(self) -> None:
+        self._run_once = True
+        if self._watchdog_thread:
+            logging.info("Stopping watchdog thread...")
+            self._watchdog_thread.join()
+            self._watchdog_thread = None
+        else:
+            logging.info("No watchdog thread running, doing nothing")
+        if os.path.exists(self._file_path):
+            os.remove(self._file_path)
+
+    def _watchdog_loop(self) -> None:
+        # Open the pipe in blocking mode blocks the server thread.
+        # This is fine for the following reasons:
+        #  1. No client case usually does not happen.
+        #  2. We are running the watchdog loop in a separate daemon
+        #     thread, which will not block the process to stop.
+        with open(self._file_path, "rt") as fd:
+            while not self._stop_signaled:
+                try:
+                    run_once = self._run_once
+                    self._run_watchdog(fd)
+                    if run_once:
+                        break
+                except Exception as e:
+                    logging.error("Error running watchdog", exc_info=e)
+
+    def _run_watchdog(self, fd: io.TextIOWrapper) -> None:
+        timer_requests = self._get_requests(fd, self._max_interval)
+        self.register_timers(timer_requests)
+        now = time.time()
+        reaped_worker_pids = set()
+        for worker_pid, expired_timers in self.get_expired_timers(now).items():
+            logging.info(f"Reaping worker_pid=[{worker_pid}]." f" Expired timers: {self._get_scopes(expired_timers)}")
+            # In case we have multiple expired timers, we find the first timer
+            # with a valid signal (>0) in the expiration time order.
+            expired_timers.sort(key=lambda timer: timer.expiration_time)
+            signal = 0
+            for timer in expired_timers:
+                if timer.signal > 0:
+                    signal = timer.signal
+                    break
+            if signal <= 0:
+                logging.info(f"No signal specified with worker=[{worker_pid}]. Do not reap it.")
+                continue
+            if self._reap_worker(worker_pid, signal):
+                logging.info(f"Successfully reaped worker=[{worker_pid}] with signal={signal}")
+                reaped_worker_pids.add(worker_pid)
+            else:
+                logging.error(f"Error reaping worker=[{worker_pid}]. Will retry on next watchdog.")
+        self.clear_timers(reaped_worker_pids)
+
+    def _get_scopes(self, timer_requests: List[FileTimerRequest]) -> List[str]:
+        return [r.scope_id for r in timer_requests]
+
+    def _get_requests(self, fd: io.TextIOWrapper, max_interval: float) -> List[FileTimerRequest]:
+        start = time.time()
+        requests = []
+        while not self._stop_signaled or self._run_once:
+            # For named pipe, readline() is blocking when at least one writer opens.
+            # It returns only when flush() is called at the writer side.
+            # Note that flush() is automatically called inside close().
+            # After the last writer closes, readline() is not blocking.
+            # It will return an empty string when it's at end-of-file.
+            # Since the client side always opens the pipe, writes a message and closes
+            # the pipe immediately, the readline() call below is not blocking for long.
+            json_request = fd.readline()
+            if len(json_request) == 0:
+                if self._run_once:
+                    break
+                time.sleep(min(max_interval, 1))
+            else:
+                request = json.loads(json_request)
+                pid = request["pid"]
+                scope_id = request["scope_id"]
+                expiration_time = request["expiration_time"]
+                signal = request["signal"]
+                requests.append(
+                    FileTimerRequest(
+                        worker_pid=pid, scope_id=scope_id, expiration_time=expiration_time, signal=signal
+                    )
+                )
+            now = time.time()
+            if now - start > max_interval:
+                break
+        return requests
+
+    def register_timers(self, timer_requests: List[FileTimerRequest]) -> None:
+        for request in timer_requests:
+            pid = request.worker_pid
+            scope_id = request.scope_id
+            expiration_time = request.expiration_time
+            self._request_count += 1
+
+            key = (pid, scope_id)
+            # negative expiration is a proxy for a release call
+            if expiration_time < 0:
+                if key in self._timers:
+                    del self._timers[key]
+            else:
+                self._timers[key] = request
+
+    def clear_timers(self, worker_pids: Set[int]) -> None:
+        for (pid, scope_id) in list(self._timers.keys()):
+            if pid in worker_pids:
+                del self._timers[(pid, scope_id)]
+
+    def get_expired_timers(self, deadline: float) -> Dict[int, List[FileTimerRequest]]:
+        # pid -> [timer_requests...]
+        expired_timers: Dict[int, List[FileTimerRequest]] = {}
+        for request in self._timers.values():
+            if request.expiration_time <= deadline:
+                expired_scopes = expired_timers.setdefault(request.worker_pid, [])
+                expired_scopes.append(request)
+        return expired_timers
+
+    def _reap_worker(self, worker_pid: int, signal: int) -> bool:
+        try:
+            os.kill(worker_pid, signal)
+            return True
+        except ProcessLookupError:
+            logging.info(f"Process with pid={worker_pid} does not exist. Skipping")
+            return True
+        except Exception as e:
+            logging.error(f"Error terminating pid={worker_pid}", exc_info=e)
+        return False
diff --git a/torch/distributed/fsdp/_optim_utils.py b/torch/distributed/fsdp/_optim_utils.py
index 907c0f4b06867..bb470b00b6dc6 100644
--- a/torch/distributed/fsdp/_optim_utils.py
+++ b/torch/distributed/fsdp/_optim_utils.py
@@ -1,7 +1,9 @@
+import collections
 import copy
 import functools
 from typing import (
     Any,
+    cast,
     Dict,
     Iterable,
     Iterator,
@@ -15,12 +17,20 @@
 
 import torch
 import torch.distributed as dist
+
 # Import the entire FSDP file to avoid circular imports
 import torch.distributed.fsdp.fully_sharded_data_parallel as FSDP
 from torch.distributed.fsdp.flat_param import (
     FlatParameter,
     FlatParamHandle,
 )
+from torch.distributed._shard.sharded_tensor import (
+    ShardedTensor
+)
+from torch.distributed.fsdp._shard_utils import (
+    _create_chunk_sharded_tensor,
+    _gather_state_dict,
+)
 
 
 class _ConsolidatedOptimState:
@@ -77,6 +87,7 @@ def _unflatten_optim_state(
     flat_param_state: Dict[str, Any],
     fsdp_module,
     to_save: bool,
+    shard_state: bool,
 ) -> List[Dict[str, Any]]:
     """
     Unflattens the optimizer state, consisting of the "state" part and the
@@ -104,10 +115,16 @@ def _unflatten_optim_state(
     consolidated_state = _communicate_optim_state(
         flat_param, flat_param_state, fsdp_module, to_save,
     )
-    unflat_param_state = _unflatten_communicated_optim_state(
-        flat_param,
-        consolidated_state,
-    ) if to_save else []
+    unflat_param_state = (
+        _unflatten_communicated_optim_state(
+            fsdp_module,
+            flat_param,
+            consolidated_state,
+            shard_state,
+        )
+        if to_save or shard_state
+        else []
+    )
     return unflat_param_state
 
 
@@ -174,8 +191,10 @@ def _communicate_optim_state(
 
 
 def _unflatten_communicated_optim_state(
+    fsdp_module,
     flat_param: FlatParameter,
     state: _ConsolidatedOptimState,
+    shard_state: bool,
 ) -> List[Dict[str, Any]]:
     """
     Unflattens the communicated optimizer state (given by ``tensor_state``,
@@ -209,7 +228,17 @@ def _unflatten_communicated_optim_state(
                 flat_param_views[state_name] = views
             else:
                 views = flat_param_views[state_name]
-            unflat_state_param[state_name] = next(views)
+            optim_state: Union[torch.Tensor, ShardedTensor] = next(views)
+            if shard_state:
+                optim_state = _create_chunk_sharded_tensor(
+                    optim_state,
+                    fsdp_module.rank,
+                    fsdp_module.world_size,
+                    torch.cuda.device_count(),
+                    fsdp_module.process_group,
+                )
+            unflat_state_param[state_name] = optim_state
+
         # Add zero-dimension tensor state: take the target rank's value
         for state_name, zero_dim_tensor in zero_dim_tensor_state.items():
             unflat_state_param[state_name] = zero_dim_tensor
@@ -220,8 +249,8 @@ def _unflatten_communicated_optim_state(
     return unflat_param_state
 
 
-def _flatten_full_optim_state_dict(
-    full_optim_state_dict: Dict[str, Any],
+def _flatten_optim_state_dict(
+    optim_state_dict: Dict[str, Any],
     model: torch.nn.Module,
     shard_state: bool,
 ) -> Dict[str, Any]:
@@ -234,10 +263,10 @@ def _flatten_full_optim_state_dict(
     Returns:
         Dict[str, Any]: The flattened optimizer state dict.
     """
-    full_osd = full_optim_state_dict
-    if "state" not in full_osd or "param_groups" not in full_osd:
+    unflat_osd = optim_state_dict
+    if "state" not in unflat_osd or "param_groups" not in unflat_osd:
         raise ValueError(
-            "`full_optim_state_dict` must have the keys \"state\" and "
+            "`optim_state_dict` must have the keys \"state\" and "
             "\"param_groups\" to be a valid optimizer state dict"
         )
     flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
@@ -245,7 +274,7 @@ def _flatten_full_optim_state_dict(
 
     # Construct the "state" part
     flat_osd_state: Dict[_OptimStateKey, Any] = {}
-    full_osd_state = full_osd["state"]
+    unflat_osd_state = unflat_osd["state"]
     for param, unflat_param_names in param_to_unflat_param_names.items():
         if isinstance(param, FlatParameter):  # flatten FSDP parameters' states
             assert param in flat_param_to_fsdp_module, \
@@ -253,7 +282,7 @@ def _flatten_full_optim_state_dict(
                 f"param: {param}"
             fsdp_module = flat_param_to_fsdp_module[param]
             flat_state = _flatten_optim_state(
-                full_osd_state, unflat_param_names, fsdp_module, param,
+                unflat_osd_state, unflat_param_names, fsdp_module, param,
                 shard_state,
             )
             key = _OptimStateKey(tuple(unflat_param_names), True)
@@ -261,17 +290,17 @@ def _flatten_full_optim_state_dict(
         else:  # do not flatten non-FSDP parameters' states
             assert len(unflat_param_names) == 1
             unflat_param_name = unflat_param_names[0]
-            if unflat_param_name not in full_osd_state:
+            if unflat_param_name not in unflat_osd_state:
                 # The state dict may not have an entry for a parameter if it
                 # was not passed into the optimizer (e.g. if it is not an
                 # FSDP-managed parameter)
                 continue
             key = _OptimStateKey(tuple(unflat_param_names), False)
-            flat_osd_state[key] = copy.copy(full_osd_state[unflat_param_name])
+            flat_osd_state[key] = copy.copy(unflat_osd_state[unflat_param_name])
 
     # Construct the "param_groups" part -- copy as is since it will be
     # rekeyed later according to the target rank's `optim_input`
-    flat_osd_param_groups = copy.deepcopy(full_osd["param_groups"])
+    flat_osd_param_groups = copy.deepcopy(unflat_osd["param_groups"])
     return {"state": flat_osd_state, "param_groups": flat_osd_param_groups}
 
 
@@ -326,8 +355,11 @@ def _flatten_optim_state(
     # There may still be some unflattened parameters with state and some
     # without
     unflat_param_states = [
-        unflat_osd_state[unflat_param_name]
-        if unflat_param_name in unflat_osd_state else None
+        _gather_state_dict(
+            unflat_osd_state[unflat_param_name], pg=fsdp_module.process_group
+        )
+        if unflat_param_name in unflat_osd_state
+        else None
         for unflat_param_name in unflat_param_names
     ]
     # Check that the unflattened parameters have the same state names
@@ -971,3 +1003,165 @@ def _get_unflat_to_flat_param_ids(
 
 def _is_zero_dim_tensor(x: Any) -> bool:
     return torch.is_tensor(x) and x.dim() == 0
+
+
+def _optim_state_dict(
+    model: torch.nn.Module,
+    optim: torch.optim.Optimizer,
+    optim_input: Optional[Union[
+        List[Dict[str, Any]], Iterable[torch.nn.Parameter],
+    ]] = None,
+    rank0_only: bool = True,
+    shard_state: bool = False,
+    group: Optional[dist.ProcessGroup] = None,
+) -> Dict[str, Any]:
+    """
+    Consolidates the optimizer state and returns it as a :class:`dict`
+    following the convention of :meth:`torch.optim.Optimizer.state_dict`,
+    i.e. with keys ``"state"`` and ``"param_groups"``.
+    The flattened parameters in ``FSDP`` modules contained in ``model``
+    are mapped back to their unflattened parameters.
+
+    Args:
+        model (torch.nn.Module): Root module (which may or may not be a
+            :class:`FullyShardedDataParallel` instance) whose parameters
+            were passed into the optimizer ``optim``.
+        optim (torch.optim.Optimizer): Optimizer for ``model`` 's
+            parameters.
+        optim_input (Optional[Union[List[Dict[str, Any]], Iterable[torch.nn.Parameter]]]):
+            Input passed into the optimizer ``optim`` representing either a
+            :class:`list` of parameter groups or an iterable of parameters;
+            if ``None``, then this method assumes the input was
+            ``model.parameters()``. (Default: ``None``)
+        rank0_only (bool): If ``True``, saves the populated :class:`dict`
+            only on rank 0; if ``False``, saves it on all ranks. (Default:
+            ``True``)
+        shard_state (bool): If ``True``, shard and distribute all
+            non-zero-dimension states.
+
+    Returns:
+        Dict[str, Any]: A :class:`dict` containing the optimizer state for
+        ``model`` 's original unflattened parameters and including keys
+        "state" and "param_groups" following the convention of
+        :meth:`torch.optim.Optimizer.state_dict`. If ``rank0_only=False``,
+        then nonzero ranks return an empty :class:`dict`.
+    """
+    osd = optim.state_dict()
+    osd_state, osd_param_groups = osd["state"], osd["param_groups"]
+    rank = dist.get_rank(group)
+    to_save = not rank0_only or (rank == 0 or shard_state)
+    fsdp_osd: Dict = {"state": {}, "param_groups": []} if to_save else {}
+    fsdp_osd_state = fsdp_osd["state"] if to_save else None
+
+    # Construct the local mapping between unflattened parameter names
+    # (`_OptimStateKey`s) and parameter IDs and broadcast rank 0's mapping
+    param_to_unflat_param_names: Dict[torch.nn.Parameter, List[str]] = \
+        FSDP._get_param_to_unflat_param_names(model)
+    flat_param_id_to_param: List[torch.nn.Parameter] = \
+        _get_param_id_to_param(model, optim_input)
+    optim_state_key_to_flat_param_id: Dict[_OptimStateKey, int] = {}  # local
+    r0_flat_param_id_to_optim_state_key: Dict[int, _OptimStateKey] = (
+        collections.OrderedDict()  # rank 0
+    )
+    for flat_param_id, param in enumerate(flat_param_id_to_param):
+        # Do not include parameters without state to avoid empty mappings
+        # just like in normal `torch.optim.Optimizer.state_dict()`
+        if flat_param_id not in osd_state:
+            continue
+        optim_state_key = _OptimStateKey(
+            unflat_param_names=tuple(param_to_unflat_param_names[param]),
+            is_flat_param=isinstance(param, FlatParameter),
+        )
+        if rank == 0:
+            r0_flat_param_id_to_optim_state_key[flat_param_id] = optim_state_key
+        optim_state_key_to_flat_param_id[optim_state_key] = flat_param_id
+    key_obj_list: List[Optional[Dict[int, _OptimStateKey]]] = (
+        [r0_flat_param_id_to_optim_state_key] if rank == 0 else [None]
+    )
+    dist.broadcast_object_list(key_obj_list, src=0, group=group)
+    assert key_obj_list[0] is not None
+    r0_flat_param_id_to_optim_state_key = key_obj_list[0]
+
+    # Ensure that all ranks have at least the optimizer states needed by
+    # rank 0's optimizer
+    missing_keys: List[_OptimStateKey] = []
+    for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
+        if r0_optim_state_key not in optim_state_key_to_flat_param_id:
+            # A parameter from rank 0's optimizer does not exist for this
+            # rank's optimizer
+            missing_keys.append(r0_optim_state_key)
+            continue
+        flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
+        assert flat_param_id >= 0 and flat_param_id < len(flat_param_id_to_param), \
+            "Check the `flat_param_id_to_param` construction"
+    device = torch.device("cuda", torch.cuda.current_device())
+    num_missing = torch.tensor([len(missing_keys)], dtype=torch.int32, device=device)
+    dist.all_reduce(num_missing, group=group)
+    if num_missing.item() > 0:
+        obj_list = [None for _ in range(dist.get_world_size(group))]
+        dist.all_gather_object(obj_list, missing_keys, group=group)
+        error_msg = (
+            "FSDP currently requires each rank to have at least the "
+            "optimizer states needed by rank 0's optimizer but some ranks "
+            "are missing some of those states"
+        )
+        for rank, keys in enumerate(obj_list):
+            keys = cast(List[_OptimStateKey], keys)
+            if len(keys) > 0:
+                error_msg += (
+                    f"\nRank {rank} is missing states for the parameters: "
+                    f"{[key.unflat_param_names for key in keys]}"
+                )
+        raise RuntimeError(error_msg)
+
+    # Iterate in rank 0's flattened parameter ID order to ensure aligned
+    # all-gathers across ranks
+    flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
+    for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
+        flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
+        param = flat_param_id_to_param[flat_param_id]
+        if r0_optim_state_key.is_flat_param:
+            fsdp_module = flat_param_to_fsdp_module[param]
+            unflat_state = _unflatten_optim_state(
+                cast(FlatParameter, param),
+                osd_state[flat_param_id],
+                fsdp_module,
+                to_save,
+                shard_state
+            )
+            if to_save:
+                assert len(unflat_state) == len(r0_optim_state_key.unflat_param_names)
+                for unflat_param_name, unflat_param_state in zip(
+                    r0_optim_state_key.unflat_param_names, unflat_state,
+                ):
+                    fsdp_osd_state[unflat_param_name] = unflat_param_state
+        elif to_save:
+            assert len(r0_optim_state_key.unflat_param_names) == 1
+            unflat_param_name = r0_optim_state_key.unflat_param_names[0]
+            fsdp_osd_state[unflat_param_name] = copy.copy(osd_state[flat_param_id])
+            for state_name, value in fsdp_osd_state[unflat_param_name].items():
+                if torch.is_tensor(value):
+                    fsdp_osd_state[unflat_param_name][state_name] = value.cpu()
+
+    if not to_save:
+        return {}
+
+    # Handle the "param_groups" part of the optimizer state dict
+    fsdp_osd_param_groups = fsdp_osd["param_groups"]  # alias
+    for flat_param_group in osd_param_groups:
+        unflat_param_group = copy.deepcopy(flat_param_group)
+        param_group_params = [
+            flat_param_id_to_param[flat_param_id]
+            for flat_param_id in flat_param_group["params"]
+        ]
+        nested_unflat_param_names = [
+            param_to_unflat_param_names[param]
+            for param in param_group_params
+        ]
+        unflat_param_group["params"] = [
+            unflat_param_name
+            for unflat_param_names in nested_unflat_param_names
+            for unflat_param_name in unflat_param_names
+        ]  # flatten the list of lists
+        fsdp_osd_param_groups.append(unflat_param_group)
+    return fsdp_osd
diff --git a/torch/distributed/fsdp/shard_utils.py b/torch/distributed/fsdp/_shard_utils.py
similarity index 64%
rename from torch/distributed/fsdp/shard_utils.py
rename to torch/distributed/fsdp/_shard_utils.py
index 966427e275265..7bbf2732a3462 100644
--- a/torch/distributed/fsdp/shard_utils.py
+++ b/torch/distributed/fsdp/_shard_utils.py
@@ -1,17 +1,23 @@
 import bisect
 import itertools
 import math
-from typing import Any, Dict, List, Tuple, Optional
+from typing import Any, Dict, List, Optional, Tuple
 
 import torch
 import torch.distributed as dist
 import torch.nn.functional as F
 from torch.distributed import distributed_c10d
-from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharded_tensor import (
+    Shard,
+    ShardedTensor,
+    ShardedTensorMetadata,
+    TensorProperties,
+)
 from torch.distributed._shard.sharding_spec import (
     ChunkShardingSpec,
     EnumerableShardingSpec,
     ShardingSpec,
+    ShardMetadata,
 )
 
 
@@ -150,16 +156,35 @@ def _all_gather_sharded_tensor(
         pg = distributed_c10d._get_default_group()
     world_size = dist.get_world_size(pg)
     shards = sharded_tensor.local_shards()
-    local_tensor = shards[0].tensor.flatten()
     dim_0_size = sharded_tensor.size()[0]  # type: ignore[index]
     tensor_numel = sharded_tensor.size().numel()  # type: ignore[union-attr]
     chunk_size = math.ceil(dim_0_size / world_size) * tensor_numel // dim_0_size
-    num_padding = chunk_size - local_tensor.numel()
-    if num_padding > 0:
-        local_tensor = F.pad(local_tensor, [0, num_padding])
-    tensor = torch.empty(chunk_size * world_size, dtype=local_tensor.dtype).cuda()
+    cuda_device = torch.device("cuda", torch.cuda.current_device())
+    if shards:
+        local_tensor = shards[0].tensor.flatten()
+        if not local_tensor.is_cuda:
+            move_to_cpu = torch.ones(1, device=cuda_device)
+            local_tensor = local_tensor.cuda()
+        else:
+            move_to_cpu = torch.zeros(1, device=cuda_device)
+        num_padding = chunk_size - local_tensor.numel()
+        if num_padding > 0:
+            local_tensor = F.pad(local_tensor, [0, num_padding])
+    else:
+        local_tensor = torch.zeros(
+            chunk_size, dtype=sharded_tensor.dtype, device=cuda_device
+        )
+        move_to_cpu = torch.zeros(1, device=cuda_device)
+
+    tensor = torch.empty(
+        chunk_size * world_size,
+        dtype=local_tensor.dtype,
+        device=cuda_device,
+    )
     dist._all_gather_base(tensor, local_tensor, group=pg)
-    return tensor.narrow(0, 0, tensor_numel).reshape(sharded_tensor.size())
+
+    tensor = tensor.narrow(0, 0, tensor_numel).reshape(sharded_tensor.size())
+    return tensor
 
 
 def _gather_state_dict(
@@ -167,24 +192,68 @@ def _gather_state_dict(
     pg: Optional[dist.ProcessGroup] = None,
 ) -> Dict[str, Any]:
     """
-    Given a state_dict, this API gathers all the ShardedTensor in the state_dict
-    to the output_rank, and creates a new state_dict which the values are either
-    the gathered tensors (rank == output_rank) or None (rank != output_rank).
+    Given a state_dict, this API gathers all the ShardedTensors in the state_dict.
     """
     new_state_dict = {}
     for key, tensor in state_dict.items():
         if isinstance(tensor, ShardedTensor):
-            """
-            # TODO: It is unclear why the following implementation cause a
-            # timeout in some unittests on AWS servers but not other environment.
-            output_tensor = (
-                torch.empty(tensor.shape, dtype=tensor.dtype).cuda()
-                if curr_rank == output_rank
-                else None
-            )
-            tensor.gather(output_rank, output_tensor)
-            """
             output_tensor = _all_gather_sharded_tensor(tensor, pg)
-            tensor = output_tensor
+            if tensor.local_shards() and tensor.local_shards()[0].tensor.is_cuda:
+                tensor = output_tensor
+            else:
+                tensor = output_tensor.cpu()
         new_state_dict[key] = tensor
     return new_state_dict
+
+
+def _create_chunk_sharded_tensor(
+    tensor: torch.Tensor,
+    rank: int,
+    world_size: int,
+    device_per_node: int,
+    pg: dist.ProcessGroup,
+) -> ShardedTensor:
+    """
+    Shard a tensor to chunks along the first dimension. The local rank will gets its
+    corresponding chunk as the local shard to create a ShardedTensor.
+    """
+    chunks = tensor.chunk(world_size, dim=0)
+    if len(chunks) > rank:
+        local_shard = chunks[rank].clone()
+        offsets = [0 for _ in tensor.size()]
+        offsets[0] = math.ceil(tensor.size()[0] / world_size) * rank
+        local_shards = [Shard.from_tensor_and_offsets(local_shard, offsets, rank)]
+    else:
+        local_shards = []
+
+    # Create a ShardedTensor without invoking communication.
+    chunk_sizes = [list(chunk.size()) for chunk in chunks]
+    dim0_offsets = [0] + list(
+        itertools.accumulate([chunk_size[0] for chunk_size in chunk_sizes])
+    )[:-1]
+    offsets = [0] * (len(chunk_sizes[0]) - 1)
+    chunk_offsets = [[d0] + offsets for d0 in dim0_offsets]
+    placements = [
+        f"rank:{r}/cuda:{r % device_per_node}" for r in range(len(chunk_sizes))
+    ]
+    assert len(chunk_sizes) == len(chunk_offsets) == len(placements)
+    shard_metadata = [
+        ShardMetadata(offset, size, placement)
+        for offset, size, placement in zip(chunk_offsets, chunk_sizes, placements)
+    ]
+    sharded_tensor_metadata = ShardedTensorMetadata(
+        shards_metadata=shard_metadata,
+        size=tensor.size(),
+        tensor_properties=TensorProperties(
+            dtype=tensor.dtype,
+            layout=tensor.layout,
+            requires_grad=False,
+            memory_format=torch.contiguous_format,
+            pin_memory=tensor.is_pinned(),
+        )
+    )
+    return ShardedTensor._init_from_local_shards_and_global_metadata(
+        local_shards,
+        sharded_tensor_metadata=sharded_tensor_metadata,
+        process_group=pg
+    )
diff --git a/torch/distributed/fsdp/_utils.py b/torch/distributed/fsdp/_utils.py
index fd403fe0701bb..69cba499bc1b6 100644
--- a/torch/distributed/fsdp/_utils.py
+++ b/torch/distributed/fsdp/_utils.py
@@ -1,8 +1,10 @@
 from collections import OrderedDict
+import dataclasses
 from typing import Any, Callable, Dict, List, Set, Tuple, Union
 
 import torch
 from torch.nn.modules.batchnorm import _BatchNorm
+from torch.nn.parallel.scatter_gather import _is_namedtuple  # type: ignore[attr-defined]
 
 from torch.nn.utils.rnn import PackedSequence
 
@@ -26,6 +28,12 @@ def _apply_to_tensors(
     def apply(x: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict, PackedSequence]) -> Any:
         if torch.is_tensor(x):
             return fn(x)
+        elif hasattr(x, "__dataclass_fields__"):
+            dc = dataclasses.replace(x)
+            for f in dataclasses.fields(dc):
+                name = f.name
+                setattr(dc, name, apply(getattr(dc, name)))
+            return dc
         elif isinstance(x, OrderedDict):
             od = x.__class__()
             for key, value in x.items():
@@ -36,6 +44,9 @@ def apply(x: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict, PackedSequ
             return x
         elif isinstance(x, dict):
             return {key: apply(value) for key, value in x.items()}
+        elif _is_namedtuple(x):
+            res = (apply(el) for el in x)
+            return type(x)(*res)
         elif isinstance(x, (list, tuple, set)):
             return type(x)(apply(el) for el in x)
         else:
diff --git a/torch/distributed/fsdp/fully_sharded_data_parallel.py b/torch/distributed/fsdp/fully_sharded_data_parallel.py
index 470b189894941..c04bbbbde74e8 100644
--- a/torch/distributed/fsdp/fully_sharded_data_parallel.py
+++ b/torch/distributed/fsdp/fully_sharded_data_parallel.py
@@ -1,4 +1,3 @@
-import collections
 import contextlib
 import copy
 import functools
@@ -52,19 +51,17 @@
     _to_kwargs,
 )
 from torch.nn.parameter import Parameter
-
 from ._optim_utils import (
     _broadcast_pos_dim_tensor_states,
     _broadcast_processed_optim_state_dict,
-    _flatten_full_optim_state_dict,
-    _get_flat_param_to_fsdp_module,
+    _flatten_optim_state_dict,
     _get_param_id_to_param,
     _get_param_to_param_id,
-    _OptimStateKey,
+    _optim_state_dict,
     _process_pos_dim_tensor_state,
     _rekey_sharded_optim_state_dict,
-    _unflatten_optim_state,
 )
+from ._shard_utils import _create_chunk_sharded_tensor
 from ._utils import (
     _apply_to_modules,
     _apply_to_tensors,
@@ -293,8 +290,9 @@ class StateDictType(Enum):
                meaningful to FSDP (because parameters are flattened). Note that
                these APIs are meant for use via the :func:`state_dict_type`
                context manager as follows:
+                   >>> # xdoctest: +SKIP("undefined variables")
                    >>> with fsdp.state_dict_type(StateDictType.LOCAL_STATE_DICT):
-                   >>>     state = fsdp.state_dict()  # loads local state dict
+                   ...     state = fsdp.state_dict()  # loads local state dict
             3. ``_sharded_state_dict/_load_sharded_state_dict``: this pair of APIs
                return and load sharded, unflattened parameters. The ``state_dict``
                return by ``sharded_state_dict`` can be used by all other parallel
@@ -326,6 +324,7 @@ class FullStateDictConfig(StateDictConfig):
     together to optimize memory savings when taking checkpoints. Note that
     this config class is meant for user via the :func:`state_dict_type`
     context manager as follows:
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> fsdp = FSDP(model, auto_wrap_policy=...)
         >>> cfg = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
         >>> with FullyShardedDataParallel.state_dict_type(fsdp, StateDictType.FULL_STATE_DICT, cfg):
@@ -470,6 +469,7 @@ class FullyShardedDataParallel(nn.Module):
 
     Example::
 
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> import torch
         >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
         >>> torch.cuda.set_device(device_id)
@@ -623,9 +623,11 @@ class FullyShardedDataParallel(nn.Module):
 
             Example::
 
+                >>> # xdoctest: +SKIP("undefined variables")
                 >>> module = MyModule(device="meta")
                 >>> def my_init_fn(module):
                 >>>     # responsible for initializing a module, such as with reset_parameters
+                >>>     ...
                 >>> fsdp_model = FSDP(module, param_init_fn=my_init_fn, auto_wrap_policy=size_based_auto_wrap_policy)
                 >>> print(next(fsdp_model.parameters()).device) # current CUDA device
                 >>> # With torchdistX
@@ -828,15 +830,6 @@ def _run_param_init_fn():
 
         # Enum to indicate if we're in the forward/backward pass, idle, etc.
         self.training_state = TrainingState_.IDLE
-
-        # setting two factors to avoid underflow and overflow
-        self.gradient_predivide_factor: float = self._get_gradient_predivide_factor(
-            self.world_size
-        )
-        self.gradient_postdivide_factor: float = (
-            self.world_size / self.gradient_predivide_factor
-        )
-
         self.cpu_offload = cpu_offload or CPUOffload()
         self.backward_prefetch = backward_prefetch
         self.forward_prefetch = forward_prefetch
@@ -950,9 +943,11 @@ def _run_param_init_fn():
         # For validating execution order across ranks
         self._exec_order_data = _ExecOrderData()
 
-        # setting communication hook to a default
-        self.communication_hook = self._get_default_comm_hook()
-        self.communication_hook_state = self._get_default_comm_hook_state()
+        # Setting communication hook to a default:
+        # ``reduce_scatter`` for shareded strategies and
+        # ``all_reduce`` for ``NO_SHARD``
+        self._communication_hook = self._get_default_comm_hook()
+        self._communication_hook_state = self._get_default_comm_hook_state()
         self._hook_registered = False
 
     def _init_param_exec_order_wrap_policy(self, *args, **kwargs) -> None:
@@ -1004,7 +999,7 @@ def _init_param_exec_order_wrap_policy(self, *args, **kwargs) -> None:
             TracingConfig
         ):
             # Initialize a dict that maps each module to its parent FSDP wrap
-            module_to_fsdp: Dict[nn.Module, FullyShardedDataParallel] = dict()
+            module_to_fsdp: Dict[nn.Module, FullyShardedDataParallel] = {}
             for wrap in self.fsdp_modules(self):
                 module_to_fsdp[wrap.module] = wrap
             # Set self._fsdp_params_exec_order based on execution_info.module_forward_order.
@@ -1027,10 +1022,10 @@ def _init_param_exec_order_wrap_policy(self, *args, **kwargs) -> None:
 
     def _move_module_if_needed(self, module) -> None:
         """
-        Moves module appropriately depending on device_id and
-        whether module is on CPU. Returns a ``bool`` indicating
-        whether the module needs to be moved back to CPU before
-        returning to user.
+        Moves module if module is on CPU and device_id is specified.
+        If device_id is not specified and module is on CPU, we log a
+        warning to user mentioning to use ``device_id`` argument to speed
+        up initialization performance.
         """
         # Move module to device specified. Note that this is done prior to
         # setting compute_device to ensure that they align.
@@ -1051,7 +1046,11 @@ def _move_module_if_needed(self, module) -> None:
                 pass
 
             # For GPU modules, module device should match device_id.
-            if param is not None and param.device != self.device_id:
+            if (
+                param is not None
+                and not isinstance(param, FlatParameter)
+                and param.device != self.device_id
+            ):
                 raise RuntimeError(
                     f"Module on rank {self.rank} is given device_id argument "
                     f"{self.device_id}, but is on {param.device}. "
@@ -1302,14 +1301,6 @@ def apply(self, fn: Callable[[nn.Module], None]) -> "FullyShardedDataParallel":
 
         return ret
 
-    # setting two factors 'self.gradient_predivide_factor'
-    # and 'self.gradient_postdivide_factor' to avoid underflow and overflow
-    def _get_gradient_predivide_factor(self, world_size: int) -> float:
-        factor: int = 1
-        while world_size % factor == 0 and world_size / factor > factor:
-            factor *= 2
-        return float(factor)
-
     def _offload_to_cpu(self, p):
         """
         Offloads parameter to CPU from self.compute_device. If the parameter is
@@ -1356,8 +1347,8 @@ def _low_precision_hook_enabled(self) -> bool:
         Wether a low precision hook is registered or not.
         """
         return (
-            self.communication_hook is not None
-            and self.communication_hook in LOW_PRECISION_HOOKS
+            self._communication_hook is not None
+            and self._communication_hook in LOW_PRECISION_HOOKS
         )
 
     def _cast_fp_inputs_to_precision(
@@ -1802,9 +1793,10 @@ def state_dict_type(
 
         Example::
 
-        >>> model = DDP(FSDP(...))
-        >>> with FSDP.state_dict_type(model, StateDictType.LOCAL_STATE_DICT):
-        >>>     checkpoint = model.state_dict()
+            >>> # xdoctest: +SKIP("undefined variables")
+            >>> model = DDP(FSDP(...))
+            >>> with FSDP.state_dict_type(model, StateDictType.LOCAL_STATE_DICT):
+            >>>     checkpoint = model.state_dict()
 
         Args:
             module (torch.nn.Module): Root module.
@@ -1954,6 +1946,7 @@ def _local_post_state_dict_hook(
         # nn.Module.state_dict() will detach the parameter. Therefore, we need
         # to get flat_param from the FlattenParamsWrapper to get the metadata.
         flat_param = getattr(self._fsdp_wrapped_module, FLAT_PARAM, None)
+        assert flat_param is not None
         # Construct a ShardedTensor from the flat_param.
         full_numel = flat_param._unsharded_size.numel()
         shard_offset = flat_param.numel() * self.rank
@@ -1983,19 +1976,22 @@ def _sharded_post_state_dict_hook(
         if not self._fsdp_wrapped_module.has_params:
             return state_dict
 
-        for fqn, _, _ in self._param_fqns:
-            # Create a ShardedTensor for the unflattened, non-sharded parameter.
-            fqn = f"{prefix}{fqn}"
-            param = state_dict[fqn]
-            local_shard = param.chunk(self.world_size)[self.rank].clone()
-            offsets = [0 for _ in param.size()]
-            offsets[0] = math.ceil(param.size()[0] / self.world_size) * self.rank
-            local_shards = [
-                Shard.from_tensor_and_offsets(local_shard, offsets, self.rank)
-            ]
-            state_dict[fqn] = init_from_local_shards(
-                local_shards, param.size(), process_group=self.process_group
-            )  # type: ignore[assignment]
+        assert self.training_state != TrainingState_.SUMMON_FULL_PARAMS, (
+            "Inside _sharded_post_load_state_dict_hook, the training_state must "
+            "not be SUMMON_FULL_PARAMS."
+        )
+        with self._summon_full_params(recurse=False, writeback=False):
+            for fqn, _, _ in self._param_fqns:
+                # Create a ShardedTensor for the unflattened, non-sharded parameter.
+                param = functools.reduce(getattr, fqn.split("."), self.module)
+                state_dict[f"{prefix}{fqn}"] = _create_chunk_sharded_tensor(
+                    tensor=param,
+                    rank=self.rank,
+                    world_size=self.world_size,
+                    device_per_node=torch.cuda.device_count(),
+                    pg=self.process_group
+                )  # type: ignore[assignment]
+        state_dict.pop(f"{prefix}{FLAT_PARAM}")
         return state_dict
 
     @staticmethod
@@ -2041,22 +2037,23 @@ def state_dict(self, *args, **kwargs):
 
         Example::
 
-        >>> import torch
-        >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-        >>> from torch.distributed.fsdp import StateDictType
-        >>> torch.cuda.set_device(device_id)
-        >>> my_module = nn.Linear(...)
-        >>> sharded_module = FSDP(my_module)
-        >>> full_state_dict_config = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
-        >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT, full_state_dict_config):
-        >>>     full_dict = sharded_module.state_dict()
-        >>> full_dict.keys()
-        >>> odict_keys(['weight', 'bias'])
-        >>> # using local state dict
-        >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
-        >>>     local_dict = sharded_module.state_dict()
-        >>> local_dict.keys()
-        >>> odict_keys(['flat_param', 'inner.flat_param'])
+            >>> # xdoctest: +SKIP("undefined variables")
+            >>> import torch
+            >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+            >>> from torch.distributed.fsdp import StateDictType
+            >>> torch.cuda.set_device(device_id)
+            >>> my_module = nn.Linear(...)
+            >>> sharded_module = FSDP(my_module)
+            >>> full_state_dict_config = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
+            >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT, full_state_dict_config):
+            >>>     full_dict = sharded_module.state_dict()
+            >>> full_dict.keys()
+            >>> odict_keys(['weight', 'bias'])
+            >>> # using local state dict
+            >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
+            >>>     local_dict = sharded_module.state_dict()
+            >>> local_dict.keys()
+            >>> odict_keys(['flat_param', 'inner.flat_param'])
 
         .. warning:: This needs to be called on all ranks, since synchronization
             primitives may be used.
@@ -2103,24 +2100,19 @@ def state_dict(self, *args, **kwargs):
             else:
                 return {}
 
-        elif self._state_dict_type == StateDictType.LOCAL_STATE_DICT:
+        elif (
+            self._state_dict_type == StateDictType.LOCAL_STATE_DICT or
+            self._state_dict_type == StateDictType.SHARDED_STATE_DICT
+        ):
             if (
                 self._fsdp_wrapped_module.flat_param is not None and
                 not self._fsdp_wrapped_module.flat_param._is_sharded
             ):
                 raise RuntimeError(
-                    "local_state_dict can only be called "
+                    "sharded_state_dict/local_state_dict can only be called "
                     "when parameters are flatten and sharded."
                 )
             return super().state_dict(*args, **kwargs)
-        elif self._state_dict_type == StateDictType.SHARDED_STATE_DICT:
-            summon_ctx = (
-                self._summon_full_params(recurse=False, writeback=False)
-                if self.training_state != TrainingState_.SUMMON_FULL_PARAMS else
-                contextlib.suppress()
-            )
-            with summon_ctx:
-                return super().state_dict(*args, **kwargs)
         else:
             raise ValueError(f"Unknown StateDictType {self._state_dict_type}.")
 
@@ -2309,6 +2301,7 @@ def load_state_dict(
         self,
         state_dict: Mapping[str, Any],
         *args,
+        **kwargs,
     ) -> NamedTuple:
         """
         The entry point of all three FSDP ``load_state_dict`` APIs. By default,
@@ -2327,24 +2320,25 @@ def load_state_dict(
 
         Example::
 
-        >>> import torch
-        >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-        >>> from torch.distributed.fsdp import StateDictType
-        >>> torch.cuda.set_device(device_id)
-        >>> my_module = nn.Linear(...)
-        >>> sharded_module = FSDP(my_module)
-        >>> checkpoint = torch.load(PATH)
-        >>> full_state_dict = checkpoint['full_state_dict']
-        >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT):
-        >>>     sharded_module.load_state_dict(full_state_dict)
-        >>> full_dict.keys()
-        >>> odict_keys(['weight', 'bias'])
-        >>> # using local state dict
-        >>> local_state_dict = checkpoint['local_state_dict']
-        >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
-        >>>     sharded_module.load_state_dict(local_state_dict)
-        >>> local_dict.keys()
-        >>> odict_keys(['flat_param', 'inner.flat_param'])
+            >>> # xdoctest: +SKIP("undefined variables")
+            >>> import torch
+            >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+            >>> from torch.distributed.fsdp import StateDictType
+            >>> torch.cuda.set_device(device_id)
+            >>> my_module = nn.Linear(...)
+            >>> sharded_module = FSDP(my_module)
+            >>> checkpoint = torch.load(PATH)
+            >>> full_state_dict = checkpoint['full_state_dict']
+            >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT):
+            >>>     sharded_module.load_state_dict(full_state_dict)
+            >>> full_dict.keys()
+            >>> odict_keys(['weight', 'bias'])
+            >>> # using local state dict
+            >>> local_state_dict = checkpoint['local_state_dict']
+            >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
+            >>>     sharded_module.load_state_dict(local_state_dict)
+            >>> local_dict.keys()
+            >>> odict_keys(['flat_param', 'inner.flat_param'])
 
         .. warning:: This needs to be called on all ranks, since synchronization
             primitives may be used.
@@ -2883,10 +2877,7 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                     "FSDP only works with gradients that don't require gradients"
                 )
 
-            if (
-                self._require_backward_grad_sync
-                or self.sharding_strategy == ShardingStrategy.FULL_SHARD
-            ):
+            if self._should_free_full_params():
                 self._free_full_params(cast(List[FlatParameter], [param]))
 
             if self._mixed_precision_enabled_for_params():
@@ -2931,12 +2922,19 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                     # reduce_dtype matches the param dtype.
                     param.grad.data = param.grad.data.to(self.mixed_precision.reduce_dtype)
 
-                if self.gradient_predivide_factor > 1 and self.communication_hook is None:
-                    # Average grad by pre-division factor. Together pre- and post-division factors
-                    # lead to an overall averaging by world_size, required for consistency with PyTorch DDP.
-                    # This is a two-step process to avoid potential underflow and overflow.
-                    param.grad.div_(self.gradient_predivide_factor)
-
+                if self._exec_order_data.is_first_iter:
+                    # For all sharding strategies communication is performed through `_communication_hook`:
+                    # default comm hooks are: `reduce_scatter` for sharded strategies and
+                    # `all_reduce` for non-sharded strategies. This checks asserts that `_communication_hook`
+                    # and `_communication_hook_state`, required for communication not `None`.`
+                    p_assert(
+                        self._communication_hook is not None,
+                        "Communication hook should not be None"
+                    )
+                    p_assert(
+                        self._communication_hook_state is not None,
+                        "Communication hook state should not be None"
+                    )
                 grad = param.grad.data
                 if param._is_sharded:  # type: ignore[attr-defined]
                     # We clear `param.grad` to permit repeated gradient
@@ -2957,14 +2955,7 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                     num_pad = self.world_size * chunks[0].numel() - grad.numel()
                     input_flattened = F.pad(grad_flatten, [0, num_pad])
                     output = torch.zeros_like(chunks[0])
-                    dist._reduce_scatter_base(
-                        output, input_flattened, group=self.process_group
-                    )
-                    if self.gradient_postdivide_factor > 1:
-                        # Average grad by pre-division factor. Together pre- and post-division factors
-                        # lead to an overall averaging by world_size, required for consistency with PyTorch DDP.
-                        # This is a two-step process to avoid potential underflow and overflow.
-                        output.div_(self.gradient_postdivide_factor)
+                    self._communication_hook(self._communication_hook_state, input_flattened, output)
 
                     self._cast_grad_to_param_dtype(output, param)
 
@@ -2997,11 +2988,9 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
                     assert (
                         self.world_size == 1 or self.sharding_strategy == ShardingStrategy.NO_SHARD
                     ), "Currently the way for _is_sharded to be False is \
-                        world_size == 1 or sharding_stratagy is set to be NO_SHARD"
+                        world_size == 1 or sharding_strategy is set to be NO_SHARD"
                     if self.sharding_strategy == ShardingStrategy.NO_SHARD:
-                        # if a communication hook was not registered,
-                        # then a default hook (`all_reduce`) will be used
-                        self.communication_hook(self.communication_hook_state, param.grad)
+                        self._communication_hook(self._communication_hook_state, param.grad)
 
                     self._cast_grad_to_param_dtype(param.grad, param)
 
@@ -3071,14 +3060,10 @@ def _queue_wait_for_post_backward(self) -> None:
     def _wait_for_post_backward(self) -> None:
         """Wait for post-backward to finish. Only called on root instance."""
         assert self._is_root, "_wait_for_post_backward can only be called on root."
-        # Check if the root module has params and if any of them has
-        # the `requires_grad` field set. If `requires_grad=False` for
-        # all the params, the post_backward hook will not fire and the
-        # state will remain in `TrainingState_.BACKWARD_PRE`.
-        if any([p.requires_grad for p in self.params]):
-            self._assert_state(TrainingState_.BACKWARD_POST)
-        else:
-            self._assert_state(TrainingState_.BACKWARD_PRE)
+        # Root's training state might be backward_pre or backward_post depending on
+        # if root parameter's post backward hook was called. The post-backward hook
+        # may not have been called if gradient was not computed for this param/FSDP
+        # module.
 
         if self._require_backward_grad_sync:
             torch.cuda.current_stream().wait_stream(self._streams["post_backward"])
@@ -3093,8 +3078,43 @@ def _wait_for_post_backward(self) -> None:
         # A backward pass is done, clean up below.
         self._exec_order_data.reset()
 
+        def _catch_all_reshard(fsdp_module: FullyShardedDataParallel) -> None:
+            """
+            Reshards full parameters that may have not been resharded in
+            post_backward_hook. This can happen when an FSDP module's output
+            is used in forward so its pre-backward fires unsharding the param,
+            but post-backward does not fire since the output was not ultimately
+            used in loss computation so FSDP parameter did not get a gradient.
+            """
+            # Note that we wrap resharding logic in a try-catch as a defensive
+            # approach, as if an error is thrown, we are in the backwards pass,
+            # and autograd would not print out much useful info about the actual
+            # error hit.
+            try:
+                for p in fsdp_module.params:
+                    # Parameter is already resharded if the param in use points
+                    # to the local shard.
+                    if p.data.data_ptr() == p._local_shard.data_ptr():
+                        continue
+
+                    if fsdp_module._should_free_full_params():
+                        fsdp_module._free_full_params(cast(List[FlatParameter], [p]))
+
+                    if fsdp_module._mixed_precision_enabled_for_params():
+                        fsdp_module._free_mp_shard(cast(List[FlatParameter], [p]))
+
+                    fsdp_module._use_param_local_shard(cast(List[FlatParameter], [p]))
+            except Exception as e:
+                p_assert(
+                    False,
+                    f"Got exception while resharding module {fsdp_module}: {str(e)}",
+                    raise_assertion_error=False
+                )
+                raise e
+
         def _finalize_params(fsdp_module: FullyShardedDataParallel) -> None:
             """Helper used below on all fsdp modules."""
+
             for p in fsdp_module.params:
                 if p.requires_grad:
                     if hasattr(p, "_shard_bwd_hook"):
@@ -3127,7 +3147,11 @@ def _finalize_params(fsdp_module: FullyShardedDataParallel) -> None:
                             f"Device mismatch: p={p.device} "  # type: ignore[attr-defined]
                             f"p._saved_grad_shard={p._saved_grad_shard.device}"
                         )
-                        p.grad = p._saved_grad_shard  # type: ignore[attr-defined]
+                        # Check if post-backward was called for this param (FSDP unit).
+                        # TODO: This logic will have to be revisited when non-recursive wrapping
+                        # lands. If it was not called, there is no new gradient to accumulate
+                        if p._post_backward_called:
+                            p.grad = p._saved_grad_shard
                     else:
                         p_assert(
                             not p._is_sharded or not p._post_backward_called,
@@ -3145,47 +3169,15 @@ def _finalize_params(fsdp_module: FullyShardedDataParallel) -> None:
                     p._post_backward_called = False
 
         # Update root and nested FSDP's hooks and flags.
-        for m in self.modules():  # includes self
-            if isinstance(m, FullyShardedDataParallel):
-                if any(p.requires_grad for p in m.parameters()):
-                    # Check if the module has params and if any of them has
-                    # the `requires_grad` field set. If `requires_grad=False` for
-                    # all the params, the post_backward hook will not fire and the
-                    # state will remain in `TrainingState_.BACKWARD_PRE`.
-                    managed_param_requires_grad = any(p.requires_grad for p in m.params)
-                    if managed_param_requires_grad:
-                        p_assert(
-                            all(hasattr(p, '_post_backward_called') for p in m.params),
-                            "Expected all params to have flag _post_backward_called set!"
-                        )
-                        post_backward_hook_called = any(p._post_backward_called for p in m.params)
-                        if post_backward_hook_called:
-                            m._assert_state(TrainingState_.BACKWARD_POST)
-                        else:
-                            # post backward hook was not called, meaning param
-                            # did not have a gradient computed. It was either unused
-                            # in forward, or unused in loss computation so it did
-                            # not get gradient
-                            m._assert_state([TrainingState_.BACKWARD_PRE, TrainingState_.IDLE])
-                    else:
-                        m._assert_state(TrainingState_.BACKWARD_PRE)
-                else:
-                    # When `m` and its children have no non-ignored params or
-                    # have non-ignored params but none with `requires_grad==True`,
-                    # there are two cases:
-                    # 1. output tensors are `requires_grad==True`. In this case,
-                    # pre-backward hook is still registered, so it is in BACKWARD_PRE state.
-                    # 2. output tensors are `requires_grad==False`. In this case,
-                    # pre-backward hook is not registered, so it is in IDLE state.
-                    m._assert_state([TrainingState_.BACKWARD_PRE, TrainingState_.IDLE])
-
-                _finalize_params(m)
-                m._pre_backward_hook_has_run = False
-                m.training_state = TrainingState_.IDLE
-
-                if m._is_root:
-                    # reset this flag for cases like "one forward pass + multiple backward passes"
-                    self._post_backward_callback_queued = False
+        for m in self.fsdp_modules(self):  # includes self
+            _finalize_params(m)
+            _catch_all_reshard(m)
+            m._pre_backward_hook_has_run = False
+            m.training_state = TrainingState_.IDLE
+
+            if m._is_root:
+                # reset this flag for cases like "one forward pass + multiple backward passes"
+                self._post_backward_callback_queued = False
 
         if self._use_param_exec_order_policy() and self._param_exec_order_prep_stage:
             self._param_exec_order_policy_second_iter_init()
@@ -3490,6 +3482,14 @@ def _prep_grads_for_backward(self) -> None:
                         p._saved_grad_shard = p.grad.data  # type: ignore[attr-defined]
                 p.grad = None
 
+    def _should_free_full_params(self):
+        return (
+            self.sharding_strategy == ShardingStrategy.FULL_SHARD
+            # Optimization where we don't reshard if running Zero-2
+            # in no_sync().
+            or self._require_backward_grad_sync
+        )
+
     @torch.no_grad()
     def _free_full_params(self, params: Optional[List[FlatParameter]] = None) -> None:
         """
@@ -3658,7 +3658,7 @@ def full_optim_state_dict(
             List[Dict[str, Any]], Iterable[torch.nn.Parameter],
         ]] = None,
         rank0_only: bool = True,
-        group=None,
+        group: Optional[dist.ProcessGroup] = None,
     ) -> Dict[str, Any]:
         """
         Consolidates the full optimizer state on rank 0 and returns it
@@ -3709,115 +3709,54 @@ def full_optim_state_dict(
             :meth:`torch.optim.Optimizer.state_dict`. If ``rank0_only=True``,
             then nonzero ranks return an empty :class:`dict`.
         """
-        osd = optim.state_dict()
-        osd_state, osd_param_groups = osd["state"], osd["param_groups"]
-        rank = dist.get_rank(group)
-        to_save = not rank0_only or rank == 0
-        full_osd: Dict = {"state": {}, "param_groups": []} if to_save else {}
-        full_osd_state = full_osd["state"] if to_save else None
-
-        # Construct the local mapping between unflattened parameter names
-        # (`_OptimStateKey`s) and parameter IDs and broadcast rank 0's mapping
-        param_to_unflat_param_names: Dict[torch.nn.Parameter, List[str]] = \
-            _get_param_to_unflat_param_names(model)
-        flat_param_id_to_param: List[torch.nn.Parameter] = \
-            _get_param_id_to_param(model, optim_input)
-        optim_state_key_to_flat_param_id: Dict[_OptimStateKey, int] = {}  # local
-        r0_flat_param_id_to_optim_state_key: Dict[int, _OptimStateKey] = collections.OrderedDict()  # rank 0
-        for flat_param_id, param in enumerate(flat_param_id_to_param):
-            # Do not include parameters without state to avoid empty mappings
-            # just like in normal `torch.optim.Optimizer.state_dict()`
-            if flat_param_id not in osd_state:
-                continue
-            optim_state_key = _OptimStateKey(
-                unflat_param_names=tuple(param_to_unflat_param_names[param]),
-                is_flat_param=isinstance(param, FlatParameter),
-            )
-            if rank == 0:
-                r0_flat_param_id_to_optim_state_key[flat_param_id] = optim_state_key
-            optim_state_key_to_flat_param_id[optim_state_key] = flat_param_id
-        obj_list = [r0_flat_param_id_to_optim_state_key] if rank == 0 else [None]
-        dist.broadcast_object_list(obj_list, src=0, group=group)
-        r0_flat_param_id_to_optim_state_key = obj_list[0]
-
-        # Ensure that all ranks have at least the optimizer states needed by
-        # rank 0's optimizer
-        missing_keys: List[_OptimStateKey] = []
-        for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
-            if r0_optim_state_key not in optim_state_key_to_flat_param_id:
-                # A parameter from rank 0's optimizer does not exist for this
-                # rank's optimizer
-                missing_keys.append(r0_optim_state_key)
-                continue
-            flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
-            assert flat_param_id >= 0 and flat_param_id < len(flat_param_id_to_param), \
-                "Check the `flat_param_id_to_param` construction"
-        device = torch.device("cuda", torch.cuda.current_device())
-        num_missing = torch.tensor([len(missing_keys)], dtype=torch.int32, device=device)
-        dist.all_reduce(num_missing, group=group)
-        if num_missing.item() > 0:
-            obj_list = [None for _ in range(dist.get_world_size(group))]
-            dist.all_gather_object(obj_list, missing_keys, group=group)
-            error_msg = (
-                "FSDP currently requires each rank to have at least the "
-                "optimizer states needed by rank 0's optimizer but some ranks "
-                "are missing some of those states"
-            )
-            for rank, keys in enumerate(obj_list):
-                if len(keys) > 0:
-                    error_msg += (
-                        f"\nRank {rank} is missing states for the parameters: "
-                        f"{[key.unflat_param_names for key in keys]}"
-                    )
-            raise RuntimeError(error_msg)
-
-        # Iterate in rank 0's flattened parameter ID order to ensure aligned
-        # all-gathers across ranks
-        flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
-        for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
-            flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
-            param = flat_param_id_to_param[flat_param_id]
-            if r0_optim_state_key.is_flat_param:
-                fsdp_module = flat_param_to_fsdp_module[param]
-                unflat_state = _unflatten_optim_state(
-                    param, osd_state[flat_param_id], fsdp_module, to_save,
-                )
-                if to_save:
-                    assert len(unflat_state) == len(r0_optim_state_key.unflat_param_names)
-                    for unflat_param_name, unflat_param_state in zip(
-                        r0_optim_state_key.unflat_param_names, unflat_state,
-                    ):
-                        full_osd_state[unflat_param_name] = unflat_param_state
-            elif to_save:
-                assert len(r0_optim_state_key.unflat_param_names) == 1
-                unflat_param_name = r0_optim_state_key.unflat_param_names[0]
-                full_osd_state[unflat_param_name] = copy.copy(osd_state[flat_param_id])
-                for state_name, value in full_osd_state[unflat_param_name].items():
-                    if torch.is_tensor(value):
-                        full_osd_state[unflat_param_name][state_name] = value.cpu()
-
-        if not to_save:
-            return {}
-
-        # Handle the "param_groups" part of the optimizer state dict
-        full_osd_param_groups = full_osd["param_groups"]  # alias
-        for flat_param_group in osd_param_groups:
-            unflat_param_group = copy.deepcopy(flat_param_group)
-            param_group_params = [
-                flat_param_id_to_param[flat_param_id]
-                for flat_param_id in flat_param_group["params"]
-            ]
-            nested_unflat_param_names = [
-                param_to_unflat_param_names[param]
-                for param in param_group_params
+        return _optim_state_dict(
+            model=model,
+            optim=optim,
+            optim_input=optim_input,
+            rank0_only=rank0_only,
+            shard_state=False,
+            group=group,
+        )
+
+    @staticmethod
+    def sharded_optim_state_dict(
+        model: torch.nn.Module,
+        optim: torch.optim.Optimizer,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]], Iterable[torch.nn.Parameter],
             ]
-            unflat_param_group["params"] = [
-                unflat_param_name
-                for unflat_param_names in nested_unflat_param_names
-                for unflat_param_name in unflat_param_names
-            ]  # flatten the list of lists
-            full_osd_param_groups.append(unflat_param_group)
-        return full_osd
+        ] = None,
+        group: Optional[dist.ProcessGroup] = None,
+    ) -> Dict[str, Any]:
+        """
+        The API is similar to :meth:``full_optim_state_dict`` but this API
+        chunks all non-zero-dimension states to ShardedTensor to save memory.
+        This API should only be used when the model state_dict is derived with
+        the context manager ``with state_dict_type(SHARDED_STATE_DICT):``.
+
+        For the detail usages, refer to the :meth:``full_optim_state_dict`` doc.
+
+        .. warning:: The returned state dict contains ShardedTensor and cannot be
+            directly used by the regular ``optim.load_state_dict``.
+        """
+
+        # TODO: The ultimate goal of the optimizer state APIs should be the same
+        # as state_dict/load_state_dict -- using one API to get optimizer states
+        # and one API to load optimizer states. ``state_dict_type`` will be used
+        # to decide which optimizer states should be returned.
+        # There are currently two APIs to load a full optimizer state. So the
+        # first step of the unification is to merge the two full optimizer state
+        # loading APIs.
+        # Task: https://github.com/pytorch/pytorch/issues/82232
+        return _optim_state_dict(
+            model=model,
+            optim=optim,
+            optim_input=optim_input,
+            rank0_only=False,
+            shard_state=True,
+            group=group,
+        )
 
     @staticmethod
     def shard_full_optim_state_dict(
@@ -3836,6 +3775,7 @@ def shard_full_optim_state_dict(
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
             >>> model, optim = ...
             >>> full_osd = FSDP.full_optim_state_dict(model, optim)
@@ -3880,11 +3820,50 @@ def shard_full_optim_state_dict(
             flattened parameters instead of unflattened parameters and
             restricted to only include this rank's part of the optimizer state.
         """
-        sharded_osd = _flatten_full_optim_state_dict(
+        sharded_osd = _flatten_optim_state_dict(
             full_optim_state_dict, model, True,
         )
         return _rekey_sharded_optim_state_dict(sharded_osd, model, optim_input)
 
+    @staticmethod
+    def flatten_sharded_optim_state_dict(
+        sharded_optim_state_dict: Dict[str, Any],
+        model: torch.nn.Module,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]], Iterable[torch.nn.Parameter],
+            ]
+        ] = None,
+    ) -> Dict[str, Any]:
+        """
+        The API is similar to :meth:``shard_full_optim_state_dict``. The only
+        difference is that the input ``sharded_optim_state_dict`` should be
+        returned from :meth:`sharded_optim_state_dict`. Therefore, there will be
+        allgather calls on each rank to gather ShardedTensor.
+
+        Args:
+            sharded_optim_state_dict (Dict[str, Any]): Optimizer state dict
+                corresponding to the unflattened parameters and holding the
+                sharded optimizer state.
+            model (torch.nn.Module):
+                Refer to :meth:``shard_full_optim_state_dict``.
+            optim_input (Optional[Union[List[Dict[str, Any]], Iterable[torch.nn.Parameter]]]):
+                Refer to :meth:``shard_full_optim_state_dict``.
+
+        Returns:
+            Refer to :meth:``shard_full_optim_state_dict``.
+        """
+
+        # TODO: The implementation is the same as ``shard_full_optim_state_dict``.
+        # See the TODO in ``shard_full_optim_state_dict`` for the future
+        # unification plan.
+        flattened_osd = _flatten_optim_state_dict(
+            sharded_optim_state_dict,
+            model=model,
+            shard_state=True,
+        )
+        return _rekey_sharded_optim_state_dict(flattened_osd, model, optim_input)
+
     @staticmethod
     def scatter_full_optim_state_dict(
         full_optim_state_dict: Optional[Dict[str, Any]],
@@ -3903,6 +3882,7 @@ def scatter_full_optim_state_dict(
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
             >>> model, optim = ...
             >>> full_osd = FSDP.full_optim_state_dict(model, optim)  # only non-empty on rank 0
@@ -3962,7 +3942,11 @@ def scatter_full_optim_state_dict(
         if rank == 0:
             if full_optim_state_dict is None:
                 raise ValueError("Rank 0 must pass in the full optimizer state dict")
-            flat_osd = _flatten_full_optim_state_dict(full_optim_state_dict, model, False)
+            flat_osd = _flatten_optim_state_dict(
+                full_optim_state_dict,
+                model=model,
+                shard_state=False,
+            )
             processed_osd = _process_pos_dim_tensor_state(flat_osd, world_size)
         # Broadcast the optim state dict without positive-dimension tensor
         # state and the FSDP parameter IDs from rank 0 to all ranks
@@ -3999,6 +3983,7 @@ def rekey_optim_state_dict(
         :meth:`full_optim_state_dict`) to use parameter IDs and be loadable to
         a non-wrapped model::
 
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> wrapped_model, wrapped_optim = ...
             >>> full_osd = FSDP.full_optim_state_dict(wrapped_model, wrapped_optim)
             >>> nonwrapped_model, nonwrapped_optim = ...
@@ -4008,6 +3993,7 @@ def rekey_optim_state_dict(
         To re-key a normal optimizer state dict from a non-wrapped model to be
         loadable to a wrapped model::
 
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> nonwrapped_model, nonwrapped_optim = ...
             >>> osd = nonwrapped_optim.state_dict()
             >>> rekeyed_osd = FSDP.rekey_optim_state_dict(osd, OptimStateKeyType.PARAM_NAME, nonwrapped_model)
@@ -4086,7 +4072,7 @@ def _get_default_comm_hook(self) -> Any:
         Returns a default communication hook based on a sharding strategy.
         """
         if self.sharding_strategy != ShardingStrategy.NO_SHARD:
-            return None
+            return default_hooks.reduce_scatter_hook
         else:
             return default_hooks.allreduce_hook
 
@@ -4094,10 +4080,7 @@ def _get_default_comm_hook_state(self) -> Any:
         r"""
         Returns a default communication hook state based on a sharding strategy.
         """
-        if self.sharding_strategy != ShardingStrategy.NO_SHARD:
-            return None
-        else:
-            return default_hooks.DefaultState(process_group=self.process_group)
+        return default_hooks.DefaultState(process_group=self.process_group)
 
     def register_comm_hook(self, state: object, hook: callable):
         """
@@ -4109,10 +4092,6 @@ def register_comm_hook(self, state: object, hook: callable):
         which involve different communication strategies for
         parameter syncs while training with :class:`FullyShardedDataParallel`.
 
-        .. warning::
-            FSDP only support communication hooks for a ``NO_SHARD`` strategy at this time.
-            If other strategies are used, an error will be raised.
-
         .. warning ::
             FSDP communication hook should be registered before running an initial forward pass
             and only once.
@@ -4123,30 +4102,35 @@ def register_comm_hook(self, state: object, hook: callable):
                             peers to communicate with next in `GossipGrad <https://arxiv.org/abs/1803.05880>`_, etc.
                             It is locally stored by each worker
                             and shared by all the gradient tensors on the worker.
-            hook (Callable): Callable with the following signature:
-                            ``hook: Callable[torch.Tensor] -> None``:
+            hook (Callable): Callable, which has one of the following signatures:
+                            1) ``hook: Callable[torch.Tensor] -> None``:
                             This function takes in a Python tensor, which represents
                             the full, flattened, unsharded gradient with respect to all variables
                             corresponding to the model this FSDP unit is wrapping
                             (that are not wrapped by other FSDP sub-units).
-                            It then performs all necessary processing and returns ``None``.
+                            It then performs all necessary processing and returns ``None``;
+                            2) ``hook: Callable[torch.Tensor, torch.Tensor] -> None``:
+                            This function takes in two Python tensors, the first one represents
+                            the full, flattened, unsharded gradient with respect to all variables
+                            corresponding to the model this FSDP unit is wrapping
+                            (that are not wrapped by other FSDP sub-units). The latter
+                            represents a pre-sized tensor to store a chunk of a sharded gradient after
+                            reduction.
+                            In both cases, callable performs all necessary processing and returns ``None``.
+                            Callables with signature 1 are expected to handle gradient communication for a `NO_SHARD` case.
+                            Callables with signature 2 are expected to handle gradient communication for sharded cases.
 
         """
         if not self.check_is_root():
             raise AssertionError("register_comm_hook can only be called on a root instance.")
-        if self.sharding_strategy != ShardingStrategy.NO_SHARD:
-            raise NotImplementedError(
-                "Communication hooks are currently only available for a NO_SHARD strategy."
-            )
-        else:
-            # register same hook for root and all submodules
-            for submodule in self.fsdp_modules(self):
-                assert not submodule._hook_registered, "communication hook can be only registered once"
-                submodule._hook_registered = True
-                assert submodule.communication_hook == self._get_default_comm_hook(),\
-                    f"communication hook should be default, but it is {submodule.communication_hook.__name__} instead"
-                submodule.communication_hook_state = state
-                submodule.communication_hook = hook
+        for submodule in self.fsdp_modules(self):
+            assert not submodule._hook_registered, "communication hook can be only registered once"
+            submodule._hook_registered = True
+            assert submodule._communication_hook == self._get_default_comm_hook(),\
+                f"communication hook should be default, but it is {submodule._communication_hook.__name__} instead"
+            submodule._communication_hook_state = state
+            submodule._communication_hook = hook
+
 
 def _get_default_cuda_device(module: nn.Module) -> torch.device:
     """Try to infer CUDA device from module parameters."""
@@ -4183,13 +4167,14 @@ def _alloc_storage(data: torch.Tensor, size: torch.Size) -> None:
     ), "Then tensor storage should have been resized to be 0."
     data.storage().resize_(size.numel())  # type: ignore[attr-defined]
 
-def p_assert(cond: Any, s: Any) -> None:
+def p_assert(cond: Any, s: Any, raise_assertion_error: bool = True) -> None:
     """This is used as an alternate to ``assert`` when in the backward context
     to print the error message ``s`` since otherwise, it is swallowed."""
     if not cond:
         print(s)
         traceback.print_stack()
-        raise AssertionError
+        if raise_assertion_error:
+            raise AssertionError
 
 def _calc_grad_norm(parameters: List[torch.nn.Parameter], p: float) -> torch.Tensor:
     r"""Calculate gradient norm of an iterable of parameters.
diff --git a/torch/distributed/fsdp/wrap.py b/torch/distributed/fsdp/wrap.py
index e0cf445fd9b43..8013da8e37ea1 100644
--- a/torch/distributed/fsdp/wrap.py
+++ b/torch/distributed/fsdp/wrap.py
@@ -112,7 +112,7 @@ def transformer_auto_wrap_policy(
 
        transformer_layer_cls (int):
            Submodules with one of the `transformer_layer_cls` names
-           will be wrapped as seperated FSDP units
+           will be wrapped as separated FSDP units
     """
     if recurse:
         # always recurse
diff --git a/torch/distributed/launch.py b/torch/distributed/launch.py
index 6173abb2c9ecf..5cf12225fae23 100644
--- a/torch/distributed/launch.py
+++ b/torch/distributed/launch.py
@@ -30,7 +30,7 @@
 
 ::
 
-    >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
+    python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
                YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
                arguments of your training script)
 
@@ -41,7 +41,7 @@
 
 ::
 
-    >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
+    python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
                --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
                --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
                and all other arguments of your training script)
@@ -50,7 +50,7 @@
 
 ::
 
-    >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
+    python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
                --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
                --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
                and all other arguments of your training script)
@@ -59,7 +59,7 @@
 
 ::
 
-    >>> python -m torch.distributed.launch --help
+    python -m torch.distributed.launch --help
 
 
 **Important Notices:**
@@ -78,6 +78,7 @@
 
 ::
 
+    >>> # xdoctest: +SKIP
     >>> import argparse
     >>> parser = argparse.ArgumentParser()
     >>> parser.add_argument("--local_rank", type=int)
@@ -95,6 +96,7 @@
 
     >>> with torch.cuda.device(args.local_rank):
     >>>    # your code to run
+    >>>    ...
 
 3. In your training program, you are supposed to call the following function
 at the beginning to start the distributed backend. It is strongly recommended
@@ -103,8 +105,8 @@
 
 ::
 
-    torch.distributed.init_process_group(backend='YOUR BACKEND',
-                                         init_method='env://')
+    >>> torch.distributed.init_process_group(backend='YOUR BACKEND',
+    >>>                                      init_method='env://')
 
 4. In your training program, you can either use regular distributed functions
 or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your
@@ -114,9 +116,9 @@
 
 ::
 
-    model = torch.nn.parallel.DistributedDataParallel(model,
-                                                      device_ids=[args.local_rank],
-                                                      output_device=args.local_rank)
+    >>> model = torch.nn.parallel.DistributedDataParallel(model,
+    >>>                                                   device_ids=[args.local_rank],
+    >>>                                                   output_device=args.local_rank)
 
 Please ensure that ``device_ids`` argument is set to be the only GPU device id
 that your code will be operating on. This is generally the local rank of the
diff --git a/torch/distributed/nn/api/remote_module.py b/torch/distributed/nn/api/remote_module.py
index 9316046af7722..72a213b1d0f71 100644
--- a/torch/distributed/nn/api/remote_module.py
+++ b/torch/distributed/nn/api/remote_module.py
@@ -72,6 +72,7 @@
     "forward",
 )
 
+
 # RPC handler.
 def _instantiate_template(module_interface_cls, enable_moving_cpu_tensors_to_cuda):
     instantiator.instantiate_scriptable_remote_module_template(
@@ -193,6 +194,7 @@ def __init__(
         Example::
             Run the following code in two different processes:
 
+            >>> # xdoctest: +SKIP("distributed")
             >>> # On worker 0:
             >>> import torch
             >>> import torch.distributed.rpc as rpc
@@ -499,6 +501,7 @@ def init_from_module_rref(
         Example::
             Run the following code in two different processes:
 
+            >>> # xdoctest: +SKIP("distributed")
             >>> # On worker 0:
             >>> import torch
             >>> import torch.distributed.rpc as rpc
@@ -620,6 +623,7 @@ class RemoteModule(_RemoteModule):
     Example::
         Run the following code in two different processes:
 
+        >>> # xdoctest: +SKIP("distributed")
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
diff --git a/torch/distributed/nn/functional.py b/torch/distributed/nn/functional.py
index 9f6fd38039675..4fea83ac95b60 100644
--- a/torch/distributed/nn/functional.py
+++ b/torch/distributed/nn/functional.py
@@ -134,6 +134,7 @@ def _all_gather_base(output_tensor, input_tensor, group=group.WORLD):
     Examples:
         >>> # All tensors below are of torch.int64 dtype.
         >>> # We have 2 process groups, 2 ranks.
+        >>> # xdoctest: +SKIP("incorrect want text")
         >>> output_tensor = torch.zeros(2, dtype=torch.int64)
         >>> output_tensor
         [tensor([0, 0])] # Rank 0 and 1
diff --git a/torch/distributed/optim/functional_rprop.py b/torch/distributed/optim/functional_rprop.py
index 9a2f89799d6ce..77d350b20e323 100644
--- a/torch/distributed/optim/functional_rprop.py
+++ b/torch/distributed/optim/functional_rprop.py
@@ -24,6 +24,7 @@ def __init__(
         etas: Tuple[float, float] = (0.5, 1.2),
         step_sizes: Tuple[float, float] = (1e-6, 50),
         foreach: bool = False,
+        maximize: bool = False,
         _allow_empty_param_list: bool = False,
     ):
         self.defaults = {
@@ -32,6 +33,7 @@ def __init__(
         self.etas = etas
         self.step_sizes = step_sizes
         self.foreach = foreach
+        self.maximize = maximize
 
         if len(params) == 0 and not _allow_empty_param_list:
             raise ValueError("optimizer got an empty parameter list")
@@ -86,4 +88,5 @@ def step(self, gradients: List[Optional[Tensor]]):
                     step_size_max=step_size_max,
                     etaminus=etaminus,
                     etaplus=etaplus,
-                    foreach=self.foreach)
+                    foreach=self.foreach,
+                    maximize=self.maximize)
diff --git a/torch/distributed/optim/optimizer.py b/torch/distributed/optim/optimizer.py
index 0d34930e4b258..535104beb9f41 100644
--- a/torch/distributed/optim/optimizer.py
+++ b/torch/distributed/optim/optimizer.py
@@ -18,6 +18,7 @@
 
 logger = logging.getLogger(__name__)
 
+
 # XXX: we define a _ScriptModuleOptimizer here to explicitly
 # compile the FunctionalOptimizer class into TorchScript
 # This is because ScriptClass instance still lives in
@@ -33,6 +34,7 @@ class _ScriptLocalOptimizerInterface(object):
     def step(self, autograd_ctx_id: int) -> None:
         pass
 
+
 class _ScriptLocalOptimizer(nn.Module):
     # TorchScript does not support multithread concurrent compiling.
     # request_callback might invoke concurrent compiling, so we
@@ -106,6 +108,7 @@ def _new_script_local_optimizer(optim_cls, local_params_rref, *args, **kwargs):
         return rpc.RRef(
             script_optim, _ScriptLocalOptimizerInterface)
 
+
 @jit.script
 def _script_local_optimizer_step(
     local_optim_rref: RRef[_ScriptLocalOptimizerInterface],
@@ -114,6 +117,7 @@ def _script_local_optimizer_step(
     local_optim = local_optim_rref.local_value()
     local_optim.step(autograd_ctx_id)
 
+
 def _wait_for_all(rpc_futs):
     # TODO: improve error propagation
     exception = None
@@ -163,6 +167,7 @@ class DistributedOptimizer:
         kwargs: arguments to pass to the optimizer constructor on each worker.
 
     Example::
+        >>> # xdoctest: +SKIP("distributed")
         >>> import torch.distributed.autograd as dist_autograd
         >>> import torch.distributed.rpc as rpc
         >>> from torch import optim
diff --git a/torch/distributed/optim/post_localSGD_optimizer.py b/torch/distributed/optim/post_localSGD_optimizer.py
index 6ba73a7afc8bb..49b8560c1e957 100644
--- a/torch/distributed/optim/post_localSGD_optimizer.py
+++ b/torch/distributed/optim/post_localSGD_optimizer.py
@@ -15,41 +15,42 @@ class PostLocalSGDOptimizer(torch.optim.Optimizer):
 
     Example::
 
-        >>>  import torch
-        >>>  import torch.distributed as dist
-        >>>  import torch.distributed.algorithms.model_averaging.averagers as averagers
-        >>>  import torch.nn as nn
-        >>>  from torch.distributed.optim import PostLocalSGDOptimizer
-        >>>  from torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook import (
-        >>>    PostLocalSGDState,
-        >>>    post_localSGD_hook,
-        >>>  )
+        >>> # xdoctest: +SKIP("undefined variables")
+        >>> import torch
+        >>> import torch.distributed as dist
+        >>> import torch.distributed.algorithms.model_averaging.averagers as averagers
+        >>> import torch.nn as nn
+        >>> from torch.distributed.optim import PostLocalSGDOptimizer
+        >>> from torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook import (
+        >>>   PostLocalSGDState,
+        >>>   post_localSGD_hook,
+        >>> )
         >>>
-        >>>  model = nn.parallel.DistributedDataParallel(
-        >>>     module, device_ids=[rank], output_device=rank
-        >>>  )
+        >>> model = nn.parallel.DistributedDataParallel(
+        >>>    module, device_ids=[rank], output_device=rank
+        >>> )
         >>>
-        >>>  # Register a post-localSGD communication hook.
-        >>>  state = PostLocalSGDState(process_group=None, subgroup=None, start_localSGD_iter=100)
-        >>>  model.register_comm_hook(state, post_localSGD_hook)
+        >>> # Register a post-localSGD communication hook.
+        >>> state = PostLocalSGDState(process_group=None, subgroup=None, start_localSGD_iter=100)
+        >>> model.register_comm_hook(state, post_localSGD_hook)
         >>>
-        >>>  # Create a post-localSGD optimizer that wraps a local optimizer.
-        >>>  # Note that ``warmup_steps`` used in ``PostLocalSGDOptimizer`` must be the same as
-        >>>  # ``start_localSGD_iter`` used in ``PostLocalSGDState``.
-        >>>  local_optim = torch.optim.SGD(params=model.parameters(), lr=0.01)
-        >>>  opt = PostLocalSGDOptimizer(
-        >>>      optim=local_optim,
-        >>>      averager=averagers.PeriodicModelAverager(period=4, warmup_steps=100)
-        >>>  )
+        >>> # Create a post-localSGD optimizer that wraps a local optimizer.
+        >>> # Note that ``warmup_steps`` used in ``PostLocalSGDOptimizer`` must be the same as
+        >>> # ``start_localSGD_iter`` used in ``PostLocalSGDState``.
+        >>> local_optim = torch.optim.SGD(params=model.parameters(), lr=0.01)
+        >>> opt = PostLocalSGDOptimizer(
+        >>>     optim=local_optim,
+        >>>     averager=averagers.PeriodicModelAverager(period=4, warmup_steps=100)
+        >>> )
         >>>
-        >>>  # In the first 100 steps, DDP runs global gradient averaging at every step.
-        >>>  # After 100 steps, DDP runs gradient averaging within each subgroup (intra-node by default),
-        >>>  # and post-localSGD optimizer runs global model averaging every 4 steps after applying the local optimizer.
-        >>>  for step in range(0, 200):
-        >>>     opt.zero_grad()
-        >>>     loss = loss_fn(output, labels)
-        >>>     loss.backward()
-        >>>     opt.step()
+        >>> # In the first 100 steps, DDP runs global gradient averaging at every step.
+        >>> # After 100 steps, DDP runs gradient averaging within each subgroup (intra-node by default),
+        >>> # and post-localSGD optimizer runs global model averaging every 4 steps after applying the local optimizer.
+        >>> for step in range(0, 200):
+        >>>    opt.zero_grad()
+        >>>    loss = loss_fn(output, labels)
+        >>>    loss.backward()
+        >>>    opt.step()
     """
 
     def __init__(
diff --git a/torch/distributed/optim/utils.py b/torch/distributed/optim/utils.py
index 624d4f688dbc7..7561a33f609af 100644
--- a/torch/distributed/optim/utils.py
+++ b/torch/distributed/optim/utils.py
@@ -32,6 +32,7 @@ def register_functional_optim(key, optim):
     need not be of :class:`torch.optim.Optimizer` (e.g. for custom optimizers)
     Example::
         >>> # import the new functional optimizer
+        >>> # xdoctest: +SKIP
         >>> from xyz import fn_optimizer
         >>> from torch.distributed.optim.utils import register_functional_optim
         >>> fn_optim_key = "XYZ_optim"
diff --git a/torch/distributed/optim/zero_redundancy_optimizer.py b/torch/distributed/optim/zero_redundancy_optimizer.py
index 4a3b93c1ad27c..48eef147ebfc1 100644
--- a/torch/distributed/optim/zero_redundancy_optimizer.py
+++ b/torch/distributed/optim/zero_redundancy_optimizer.py
@@ -331,6 +331,7 @@ class ZeroRedundancyOptimizer(Optimizer, Joinable):
         >>> from torch.distributed.optim import ZeroRedundancyOptimizer
         >>> from torch.nn.parallel import DistributedDataParallel as DDP
 
+        >>> # xdoctest: +SKIP
         >>> model = nn.Sequential(*[nn.Linear(2000, 2000).to(rank) for _ in range(20)])
         >>> ddp = DDP(model, device_ids=[rank])
         >>> opt = ZeroRedundancyOptimizer(
diff --git a/torch/distributed/pipeline/sync/pipe.py b/torch/distributed/pipeline/sync/pipe.py
index fe91af940a24c..96bc51989f620 100644
--- a/torch/distributed/pipeline/sync/pipe.py
+++ b/torch/distributed/pipeline/sync/pipe.py
@@ -149,13 +149,16 @@ class WithDevice(nn.Module):
         device(:class:`torch.device`): The device to run the module on.
 
     Example::
+        >>> # xdoctest: +SKIP("distributed")
         >>> fc1 = nn.Linear(16, 8).cuda(0)
         >>> fc2 = nn.Linear(8, 4).cuda(1)
         >>> dropout = nn.Dropout()
         >>>
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
         >>> # Dropout does not have any parameters/buffers, but we want to
         >>> # run it on cuda:1 to avoid any GPU to CPU transfers.
         >>> model = nn.Sequential(fc1, fc2, WithDevice(dropout, 'cuda:1'))
+        >>> # xdoctest: +SKIP("Needs RPC framework init")
         >>> model = Pipe(model, chunks=8)
     """
     def __init__(self, module: nn.Module, device: torch.device):
@@ -184,6 +187,7 @@ def _assemble_partition(modules: List[nn.Module]):
             modules_list.append(module)
     return PipeSequential(*modules_list)
 
+
 def _split_module(modules: nn.Sequential) -> Tuple[List[nn.Sequential], List[torch.device]]:
     partitions = []
     devices = []
@@ -270,6 +274,7 @@ class Pipe(Module):
         Pipeline of two FC layers across GPUs 0 and 1.
 
         >>> # Need to initialize RPC framework first.
+        >>> # xdoctest: +SKIP
         >>> os.environ['MASTER_ADDR'] = 'localhost'
         >>> os.environ['MASTER_PORT'] = '29500'
         >>> torch.distributed.rpc.init_rpc('worker', rank=0, world_size=1)
diff --git a/torch/distributed/rpc/api.py b/torch/distributed/rpc/api.py
index 1f5eb76d626ae..41ee10a0749c9 100644
--- a/torch/distributed/rpc/api.py
+++ b/torch/distributed/rpc/api.py
@@ -148,6 +148,7 @@ def _broadcast_to_followers(sequence_id, objects_map):
 
 _thread_local_var = threading.local()
 
+
 @contextlib.contextmanager
 def _wait_all():
     r"""
@@ -157,6 +158,7 @@ def _wait_all():
 
 
     Example::
+        >>> # xdoctest: +SKIP("distributed")
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
@@ -175,6 +177,7 @@ def _wait_all():
         finally:
             del _thread_local_var.future_list
 
+
 @_require_initialized
 def _all_gather(obj, worker_names=None, timeout=UNSET_RPC_TIMEOUT):
     r"""
@@ -236,7 +239,7 @@ def _all_gather(obj, worker_names=None, timeout=UNSET_RPC_TIMEOUT):
     # Leader's signal is the first to be unblocked, after receiving all
     # followers' data objects.
     if is_leader:
-        worker_name_to_response_future_dict = dict()
+        worker_name_to_response_future_dict = {}
         for follower_name in worker_names - {leader_name}:
             fut = rpc_async(
                 follower_name,
@@ -284,6 +287,7 @@ def _barrier(worker_names):
             f"Failed to complete barrier, got error {ex}"
         )
 
+
 @_require_initialized
 def _wait_all_workers(timeout=DEFAULT_SHUTDOWN_TIMEOUT):
     r"""
@@ -331,11 +335,12 @@ def shutdown(graceful=True, timeout=DEFAULT_SHUTDOWN_TIMEOUT):
         on both workers. Refer to :meth:`~torch.distributed.init_process_group`
         API for more details. For example,
 
-        >>> export MASTER_ADDR=localhost
-        >>> export MASTER_PORT=5678
+        export MASTER_ADDR=localhost
+        export MASTER_PORT=5678
 
         Then run the following code in two different processes:
 
+        >>> # xdoctest: +SKIP
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
@@ -374,6 +379,7 @@ def shutdown(graceful=True, timeout=DEFAULT_SHUTDOWN_TIMEOUT):
     else:
         _finalize_shutdown()
 
+
 def _finalize_shutdown():
     try:
         # This raises a `TORCH_CHECK()` exception on RRef leak detected.
@@ -394,6 +400,7 @@ def _finalize_shutdown():
         _cleanup_python_rpc_handler()
         _reset_current_rpc_agent()
 
+
 @_require_initialized
 def get_worker_info(worker_name=None):
     r"""
@@ -451,7 +458,6 @@ def _rref_typeof_on_user(rref, timeout=UNSET_RPC_TIMEOUT, blocking=True):
         return fut
 
 
-
 T = TypeVar("T")
 GenericWithOneTypeVar = Generic[T]
 
@@ -574,15 +580,17 @@ def remote(to, func, args=None, kwargs=None, timeout=UNSET_RPC_TIMEOUT):
         raised as they have not yet been handled.
 
     Example::
+
         Make sure that ``MASTER_ADDR`` and ``MASTER_PORT`` are set properly
         on both workers. Refer to :meth:`~torch.distributed.init_process_group`
         API for more details. For example,
 
-        >>> export MASTER_ADDR=localhost
-        >>> export MASTER_PORT=5678
+        export MASTER_ADDR=localhost
+        export MASTER_PORT=5678
 
         Then run the following code in two different processes:
 
+        >>> # xdoctest: +SKIP
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
@@ -665,6 +673,7 @@ def remote(to, func, args=None, kwargs=None, timeout=UNSET_RPC_TIMEOUT):
 
     return rref
 
+
 def _invoke_rpc(to, func, rpc_type, args=None, kwargs=None, rpc_timeout=UNSET_RPC_TIMEOUT):
     if not callable(func):
         raise TypeError("function should be callable.")
@@ -759,11 +768,12 @@ def rpc_sync(to, func, args=None, kwargs=None, timeout=UNSET_RPC_TIMEOUT):
         on both workers. Refer to :meth:`~torch.distributed.init_process_group`
         API for more details. For example,
 
-        >>> export MASTER_ADDR=localhost
-        >>> export MASTER_PORT=5678
+        export MASTER_ADDR=localhost
+        export MASTER_PORT=5678
 
         Then run the following code in two different processes:
 
+        >>> # xdoctest: +SKIP
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
@@ -850,11 +860,12 @@ def rpc_async(to, func, args=None, kwargs=None, timeout=UNSET_RPC_TIMEOUT):
         on both workers. Refer to :meth:`~torch.distributed.init_process_group`
         API for more details. For example,
 
-        >>> export MASTER_ADDR=localhost
-        >>> export MASTER_PORT=5678
+        export MASTER_ADDR=localhost
+        export MASTER_PORT=5678
 
         Then run the following code in two different processes:
 
+        >>> # xdoctest: +SKIP
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
@@ -894,15 +905,17 @@ def rpc_async(to, func, args=None, kwargs=None, timeout=UNSET_RPC_TIMEOUT):
         _thread_local_var.future_list.append(fut)
     return fut
 
+
 def _get_should_profile():
     # Legacy profiler should be enabled. RPC profiling is not supported with
     # Kineto profiler.
-    ActiveProfilerType = torch._C._autograd.ActiveProfilerType
+    ActiveProfilerType = torch._C._profiler.ActiveProfilerType
     return (
         torch.autograd._profiler_enabled() and
         torch._C._autograd._profiler_type() == ActiveProfilerType.LEGACY  # type: ignore[attr-defined]
     )
 
+
 def _enable_rpc_profiler(should_profile, qualified_name, func, rpc_type, dst_worker_info):
     ctx_manager = contextlib.suppress()
 
diff --git a/torch/distributed/rpc/functions.py b/torch/distributed/rpc/functions.py
index f0d106c53844a..b1c85c47853d7 100644
--- a/torch/distributed/rpc/functions.py
+++ b/torch/distributed/rpc/functions.py
@@ -56,6 +56,7 @@ def async_execution(fn):
         >>>     )
         >>>
         >>> # On worker0
+        >>> # xdoctest: +SKIP
         >>> ret = rpc.rpc_sync(
         >>>     "worker1",
         >>>     async_add_chained,
diff --git a/torch/distributed/rpc/options.py b/torch/distributed/rpc/options.py
index 5c018075cc372..a995184bc823c 100644
--- a/torch/distributed/rpc/options.py
+++ b/torch/distributed/rpc/options.py
@@ -113,6 +113,7 @@ def set_device_map(self, to: str, device_map: Dict[DeviceType, DeviceType]):
                 invertible.
 
         Example::
+            >>> # xdoctest: +SKIP("distributed")
             >>> # both workers
             >>> def add(x, y):
             >>>     print(x)  # tensor([1., 1.], device='cuda:1')
diff --git a/torch/distributed/rpc/server_process_global_profiler.py b/torch/distributed/rpc/server_process_global_profiler.py
index b4bcb453dc83b..ad26492da8cc8 100644
--- a/torch/distributed/rpc/server_process_global_profiler.py
+++ b/torch/distributed/rpc/server_process_global_profiler.py
@@ -55,6 +55,7 @@ class _server_process_global_profile(profile):
         please use ``use_cuda = False`` or ``num_workers = 0``.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
@@ -67,7 +68,7 @@ class _server_process_global_profile(profile):
         >>> inner_profile_rref.rpc_sync().__enter__()
         >>> rpc.rpc_sync(dst_worker_name, torch.sub, (x, y))
         >>> inner_profile_rref.rpc_sync().__exit__(None, None, None)
-        >>> outer_profile_rref.rpc_sync().__exit__(None, None, None
+        >>> outer_profile_rref.rpc_sync().__exit__(None, None, None)
         >>> print(inner_profile_rref.rpc_sync().key_averages())
         ---------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
         Name       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
diff --git a/torch/distributed/run.py b/torch/distributed/run.py
index 21ca98f6c62c6..2495f09d4a5cb 100644
--- a/torch/distributed/run.py
+++ b/torch/distributed/run.py
@@ -82,7 +82,7 @@
 
 ::
 
-    >>> torchrun
+    torchrun
         --standalone
         --nnodes=1
         --nproc_per_node=$NUM_TRAINERS
@@ -101,7 +101,7 @@
 
 ::
 
-    >>> torchrun
+    torchrun
         --rdzv_backend=c10d
         --rdzv_endpoint=localhost:0
         --nnodes=1
@@ -114,7 +114,7 @@
 
 ::
 
-    >>> torchrun
+    torchrun
         --nnodes=$NUM_NODES
         --nproc_per_node=$NUM_TRAINERS
         --max_restarts=3
@@ -135,7 +135,7 @@
 
 ::
 
-    >>> torchrun
+    torchrun
         --nnodes=1:4
         --nproc_per_node=$NUM_TRAINERS
         --max_restarts=3
@@ -294,6 +294,7 @@
 
 ::
 
+ >>> # xdoctest: +SKIP("stub")
  >>> import torch.distributed as dist
  >>> dist.init_process_group(backend="gloo|nccl")
 
diff --git a/torch/distributed/utils.py b/torch/distributed/utils.py
index 20a618d7cb305..4dea0ce828848 100644
--- a/torch/distributed/utils.py
+++ b/torch/distributed/utils.py
@@ -1,10 +1,8 @@
-import collections
-
 import torch
 import torch.distributed as dist
 from torch.nn.parallel._functions import _get_stream
 from torch.nn.parallel.scatter_gather import (  # type: ignore[attr-defined]
-    is_namedtuple as _is_namedtuple
+    _is_namedtuple
 )
 from typing import Dict, Any, List
 
@@ -40,22 +38,10 @@ def to_map(obj):
             return [type(obj)(*args) for args in zip(*map(to_map, obj))]
         if isinstance(obj, tuple) and len(obj) > 0:
             return list(zip(*map(to_map, obj)))
-        if isinstance(obj, str):
-            # Needs to be checked, otherwise it's taken as a sequence infinitely.
-            # This is because the elements of a string are also strings, and so on.
-            return [obj]
-        if isinstance(obj, collections.abc.Sequence) and len(obj) > 0:
-            try:
-                return [type(obj)(i) for i in zip(*map(to_map, obj))]  # type: ignore[call-arg]
-            except TypeError:
-                # The sequence type may not support `__init__(iterable)` (e.g., `range`).
-                return [list(i) for i in zip(*map(to_map, obj))]
-        if isinstance(obj, collections.abc.Mapping) and len(obj) > 0:
-            try:
-                return [type(obj)(i) for i in zip(*map(to_map, obj.items()))]   # type: ignore[call-arg]
-            except TypeError:
-                # The mapping type may not support `__init__(iterable)`.
-                return [dict(i) for i in zip(*map(to_map, obj.items()))]
+        if isinstance(obj, list) and len(obj) > 0:
+            return [list(i) for i in zip(*map(to_map, obj))]
+        if isinstance(obj, dict) and len(obj) > 0:
+            return [type(obj)(i) for i in zip(*map(to_map, obj.items()))]
         return [obj]
 
     # Avoid reference cycle
diff --git a/torch/distributions/bernoulli.py b/torch/distributions/bernoulli.py
index b6e1afeca19a5..a566ad156d3cb 100644
--- a/torch/distributions/bernoulli.py
+++ b/torch/distributions/bernoulli.py
@@ -19,6 +19,7 @@ class Bernoulli(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Bernoulli(torch.tensor([0.3]))
         >>> m.sample()  # 30% chance 1; 70% chance 0
         tensor([ 0.])
diff --git a/torch/distributions/beta.py b/torch/distributions/beta.py
index af2213da450c8..51316e7f56ebd 100644
--- a/torch/distributions/beta.py
+++ b/torch/distributions/beta.py
@@ -14,6 +14,7 @@ class Beta(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Beta(torch.tensor([0.5]), torch.tensor([0.5]))
         >>> m.sample()  # Beta distributed with concentration concentration1 and concentration0
         tensor([ 0.1046])
diff --git a/torch/distributions/binomial.py b/torch/distributions/binomial.py
index e892e2487caa0..5b2d31213ad47 100644
--- a/torch/distributions/binomial.py
+++ b/torch/distributions/binomial.py
@@ -18,6 +18,7 @@ class Binomial(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Binomial(100, torch.tensor([0 , .2, .8, 1]))
         >>> x = m.sample()
         tensor([   0.,   22.,   71.,  100.])
diff --git a/torch/distributions/categorical.py b/torch/distributions/categorical.py
index b18a27fff33f9..ae39a1ad520f5 100644
--- a/torch/distributions/categorical.py
+++ b/torch/distributions/categorical.py
@@ -35,6 +35,7 @@ class Categorical(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Categorical(torch.tensor([ 0.25, 0.25, 0.25, 0.25 ]))
         >>> m.sample()  # equal probability of 0, 1, 2, 3
         tensor(3)
diff --git a/torch/distributions/cauchy.py b/torch/distributions/cauchy.py
index 3db153c6fc131..b0e0861173c30 100644
--- a/torch/distributions/cauchy.py
+++ b/torch/distributions/cauchy.py
@@ -17,6 +17,7 @@ class Cauchy(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Cauchy(torch.tensor([0.0]), torch.tensor([1.0]))
         >>> m.sample()  # sample from a Cauchy distribution with loc=0 and scale=1
         tensor([ 2.3214])
diff --git a/torch/distributions/chi2.py b/torch/distributions/chi2.py
index bde0c2ff1f782..5ecbd854e49ba 100644
--- a/torch/distributions/chi2.py
+++ b/torch/distributions/chi2.py
@@ -10,6 +10,7 @@ class Chi2(Gamma):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Chi2(torch.tensor([1.0]))
         >>> m.sample()  # Chi2 distributed with shape df=1
         tensor([ 0.1046])
diff --git a/torch/distributions/continuous_bernoulli.py b/torch/distributions/continuous_bernoulli.py
index 17743592c7e97..acd3e6430b0c9 100644
--- a/torch/distributions/continuous_bernoulli.py
+++ b/torch/distributions/continuous_bernoulli.py
@@ -22,6 +22,7 @@ class ContinuousBernoulli(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = ContinuousBernoulli(torch.tensor([0.3]))
         >>> m.sample()
         tensor([ 0.2538])
diff --git a/torch/distributions/dirichlet.py b/torch/distributions/dirichlet.py
index 6d9a1522fb88c..9c7d43d04289e 100644
--- a/torch/distributions/dirichlet.py
+++ b/torch/distributions/dirichlet.py
@@ -33,6 +33,7 @@ class Dirichlet(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Dirichlet(torch.tensor([0.5, 0.5]))
         >>> m.sample()  # Dirichlet distributed with concentration [0.5, 0.5]
         tensor([ 0.1046,  0.8954])
diff --git a/torch/distributions/exponential.py b/torch/distributions/exponential.py
index 11d176f6fa43e..5557482591c6f 100644
--- a/torch/distributions/exponential.py
+++ b/torch/distributions/exponential.py
@@ -13,6 +13,7 @@ class Exponential(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Exponential(torch.tensor([1.0]))
         >>> m.sample()  # Exponential distributed with rate=1
         tensor([ 0.1046])
diff --git a/torch/distributions/fishersnedecor.py b/torch/distributions/fishersnedecor.py
index 220ab60899a12..5fbdf6b690fd9 100644
--- a/torch/distributions/fishersnedecor.py
+++ b/torch/distributions/fishersnedecor.py
@@ -14,6 +14,7 @@ class FisherSnedecor(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = FisherSnedecor(torch.tensor([1.0]), torch.tensor([2.0]))
         >>> m.sample()  # Fisher-Snedecor-distributed with df1=1 and df2=2
         tensor([ 0.2453])
diff --git a/torch/distributions/gamma.py b/torch/distributions/gamma.py
index 907117846ec80..fbe497d95daf7 100644
--- a/torch/distributions/gamma.py
+++ b/torch/distributions/gamma.py
@@ -17,6 +17,7 @@ class Gamma(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Gamma(torch.tensor([1.0]), torch.tensor([1.0]))
         >>> m.sample()  # Gamma distributed with concentration=1 and rate=1
         tensor([ 0.1046])
diff --git a/torch/distributions/geometric.py b/torch/distributions/geometric.py
index 195add2f13afb..5f61427488e74 100644
--- a/torch/distributions/geometric.py
+++ b/torch/distributions/geometric.py
@@ -19,6 +19,7 @@ class Geometric(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Geometric(torch.tensor([0.3]))
         >>> m.sample()  # underlying Bernoulli has 30% chance 1; 70% chance 0
         tensor([ 2.])
diff --git a/torch/distributions/gumbel.py b/torch/distributions/gumbel.py
index 51abad6f3fbf2..07c3ea9f8dd89 100644
--- a/torch/distributions/gumbel.py
+++ b/torch/distributions/gumbel.py
@@ -15,6 +15,7 @@ class Gumbel(TransformedDistribution):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Gumbel(torch.tensor([1.0]), torch.tensor([2.0]))
         >>> m.sample()  # sample from Gumbel distribution with loc=1, scale=2
         tensor([ 1.0124])
diff --git a/torch/distributions/half_cauchy.py b/torch/distributions/half_cauchy.py
index eca165a15c37b..e751c5a0fbc37 100644
--- a/torch/distributions/half_cauchy.py
+++ b/torch/distributions/half_cauchy.py
@@ -18,6 +18,7 @@ class HalfCauchy(TransformedDistribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = HalfCauchy(torch.tensor([1.0]))
         >>> m.sample()  # half-cauchy distributed with scale=1
         tensor([ 2.3214])
diff --git a/torch/distributions/half_normal.py b/torch/distributions/half_normal.py
index 8dc240274f192..1c3f9e8e66a22 100644
--- a/torch/distributions/half_normal.py
+++ b/torch/distributions/half_normal.py
@@ -18,6 +18,7 @@ class HalfNormal(TransformedDistribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = HalfNormal(torch.tensor([1.0]))
         >>> m.sample()  # half-normal distributed with scale=1
         tensor([ 0.1046])
diff --git a/torch/distributions/independent.py b/torch/distributions/independent.py
index 1893a045246cc..67c7fdc4d2d27 100644
--- a/torch/distributions/independent.py
+++ b/torch/distributions/independent.py
@@ -15,17 +15,19 @@ class Independent(Distribution):
     the same shape as a Multivariate Normal distribution (so they are
     interchangeable), you can::
 
+        >>> from torch.distributions.multivariate_normal import MultivariateNormal
+        >>> from torch.distributions.normal import Normal
         >>> loc = torch.zeros(3)
         >>> scale = torch.ones(3)
         >>> mvn = MultivariateNormal(loc, scale_tril=torch.diag(scale))
         >>> [mvn.batch_shape, mvn.event_shape]
-        [torch.Size(()), torch.Size((3,))]
+        [torch.Size([]), torch.Size([3])]
         >>> normal = Normal(loc, scale)
         >>> [normal.batch_shape, normal.event_shape]
-        [torch.Size((3,)), torch.Size(())]
+        [torch.Size([3]), torch.Size([])]
         >>> diagn = Independent(normal, 1)
         >>> [diagn.batch_shape, diagn.event_shape]
-        [torch.Size(()), torch.Size((3,))]
+        [torch.Size([]), torch.Size([3])]
 
     Args:
         base_distribution (torch.distributions.distribution.Distribution): a
diff --git a/torch/distributions/kumaraswamy.py b/torch/distributions/kumaraswamy.py
index 16d6886b9f996..4802adf0a1330 100644
--- a/torch/distributions/kumaraswamy.py
+++ b/torch/distributions/kumaraswamy.py
@@ -23,6 +23,7 @@ class Kumaraswamy(TransformedDistribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Kumaraswamy(torch.tensor([1.0]), torch.tensor([1.0]))
         >>> m.sample()  # sample from a Kumaraswamy distribution with concentration alpha=1 and beta=1
         tensor([ 0.1729])
diff --git a/torch/distributions/laplace.py b/torch/distributions/laplace.py
index 21310147d6ae2..e1dca36aa76a4 100644
--- a/torch/distributions/laplace.py
+++ b/torch/distributions/laplace.py
@@ -12,6 +12,7 @@ class Laplace(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Laplace(torch.tensor([0.0]), torch.tensor([1.0]))
         >>> m.sample()  # Laplace distributed with loc=0, scale=1
         tensor([ 0.1046])
diff --git a/torch/distributions/lkj_cholesky.py b/torch/distributions/lkj_cholesky.py
index 273779b62c59c..1cc5bfa84081d 100644
--- a/torch/distributions/lkj_cholesky.py
+++ b/torch/distributions/lkj_cholesky.py
@@ -34,6 +34,7 @@ class LKJCholesky(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> l = LKJCholesky(3, 0.5)
         >>> l.sample()  # l @ l.T is a sample of a correlation 3x3 matrix
         tensor([[ 1.0000,  0.0000,  0.0000],
diff --git a/torch/distributions/log_normal.py b/torch/distributions/log_normal.py
index 40694e2c77423..5d02f64526afc 100644
--- a/torch/distributions/log_normal.py
+++ b/torch/distributions/log_normal.py
@@ -15,6 +15,7 @@ class LogNormal(TransformedDistribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = LogNormal(torch.tensor([0.0]), torch.tensor([1.0]))
         >>> m.sample()  # log-normal distributed with mean=0 and stddev=1
         tensor([ 0.1046])
diff --git a/torch/distributions/logistic_normal.py b/torch/distributions/logistic_normal.py
index 9a986ccf1d80f..7d8a70649c30f 100644
--- a/torch/distributions/logistic_normal.py
+++ b/torch/distributions/logistic_normal.py
@@ -22,7 +22,8 @@ class LogisticNormal(TransformedDistribution):
 
         >>> # logistic-normal distributed with mean=(0, 0, 0) and stddev=(1, 1, 1)
         >>> # of the base Normal distribution
-        >>> m = distributions.LogisticNormal(torch.tensor([0.0] * 3), torch.tensor([1.0] * 3))
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
+        >>> m = LogisticNormal(torch.tensor([0.0] * 3), torch.tensor([1.0] * 3))
         >>> m.sample()
         tensor([ 0.7653,  0.0341,  0.0579,  0.1427])
 
diff --git a/torch/distributions/lowrank_multivariate_normal.py b/torch/distributions/lowrank_multivariate_normal.py
index 1deb34a615cea..921477ac99a4e 100644
--- a/torch/distributions/lowrank_multivariate_normal.py
+++ b/torch/distributions/lowrank_multivariate_normal.py
@@ -8,6 +8,7 @@
 
 __all__ = ['LowRankMultivariateNormal']
 
+
 def _batch_capacitance_tril(W, D):
     r"""
     Computes Cholesky of :math:`I + W.T @ inv(D) @ W` for a batch of matrices :math:`W`
@@ -52,7 +53,8 @@ class LowRankMultivariateNormal(Distribution):
         covariance_matrix = cov_factor @ cov_factor.T + cov_diag
 
     Example:
-
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> m = LowRankMultivariateNormal(torch.zeros(2), torch.tensor([[1.], [0.]]), torch.ones(2))
         >>> m.sample()  # normally distributed with mean=`[0,0]`, cov_factor=`[[1],[0]]`, cov_diag=`[1,1]`
         tensor([-0.2102, -0.5429])
diff --git a/torch/distributions/mixture_same_family.py b/torch/distributions/mixture_same_family.py
index ee941df5ae29f..8e0fdce3ada25 100644
--- a/torch/distributions/mixture_same_family.py
+++ b/torch/distributions/mixture_same_family.py
@@ -17,24 +17,25 @@ class MixtureSameFamily(Distribution):
 
     Examples::
 
-        # Construct Gaussian Mixture Model in 1D consisting of 5 equally
-        # weighted normal distributions
+        >>> # xdoctest: +SKIP("undefined vars")
+        >>> # Construct Gaussian Mixture Model in 1D consisting of 5 equally
+        >>> # weighted normal distributions
         >>> mix = D.Categorical(torch.ones(5,))
         >>> comp = D.Normal(torch.randn(5,), torch.rand(5,))
         >>> gmm = MixtureSameFamily(mix, comp)
 
-        # Construct Gaussian Mixture Modle in 2D consisting of 5 equally
-        # weighted bivariate normal distributions
+        >>> # Construct Gaussian Mixture Modle in 2D consisting of 5 equally
+        >>> # weighted bivariate normal distributions
         >>> mix = D.Categorical(torch.ones(5,))
         >>> comp = D.Independent(D.Normal(
-                     torch.randn(5,2), torch.rand(5,2)), 1)
+        ...          torch.randn(5,2), torch.rand(5,2)), 1)
         >>> gmm = MixtureSameFamily(mix, comp)
 
-        # Construct a batch of 3 Gaussian Mixture Models in 2D each
-        # consisting of 5 random weighted bivariate normal distributions
+        >>> # Construct a batch of 3 Gaussian Mixture Models in 2D each
+        >>> # consisting of 5 random weighted bivariate normal distributions
         >>> mix = D.Categorical(torch.rand(3,5))
         >>> comp = D.Independent(D.Normal(
-                    torch.randn(3,5,2), torch.rand(3,5,2)), 1)
+        ...         torch.randn(3,5,2), torch.rand(3,5,2)), 1)
         >>> gmm = MixtureSameFamily(mix, comp)
 
     Args:
diff --git a/torch/distributions/multinomial.py b/torch/distributions/multinomial.py
index 854dbbd0ee5c7..1fc532b2157d2 100644
--- a/torch/distributions/multinomial.py
+++ b/torch/distributions/multinomial.py
@@ -32,6 +32,7 @@ class Multinomial(Distribution):
 
     Example::
 
+        >>> # xdoctest: +SKIP("FIXME: found invalid values")
         >>> m = Multinomial(100, torch.tensor([ 1., 1., 1., 1.]))
         >>> x = m.sample()  # equal probability of 0, 1, 2, 3
         tensor([ 21.,  24.,  30.,  25.])
diff --git a/torch/distributions/multivariate_normal.py b/torch/distributions/multivariate_normal.py
index 1f2451d4ef4a1..e8c15c32d985c 100644
--- a/torch/distributions/multivariate_normal.py
+++ b/torch/distributions/multivariate_normal.py
@@ -7,6 +7,7 @@
 
 __all__ = ['MultivariateNormal']
 
+
 def _batch_mv(bmat, bvec):
     r"""
     Performs a batched matrix-vector product, with compatible but different batch shapes.
@@ -91,6 +92,8 @@ class MultivariateNormal(Distribution):
 
     Example:
 
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> m = MultivariateNormal(torch.zeros(2), torch.eye(2))
         >>> m.sample()  # normally distributed with mean=`[0,0]` and covariance_matrix=`I`
         tensor([-0.2102, -0.5429])
diff --git a/torch/distributions/normal.py b/torch/distributions/normal.py
index 9032836ab7ad1..8864816b74fbf 100644
--- a/torch/distributions/normal.py
+++ b/torch/distributions/normal.py
@@ -16,6 +16,7 @@ class Normal(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Normal(torch.tensor([0.0]), torch.tensor([1.0]))
         >>> m.sample()  # normally distributed with loc=0 and scale=1
         tensor([ 0.1046])
diff --git a/torch/distributions/one_hot_categorical.py b/torch/distributions/one_hot_categorical.py
index f884338b3d7cb..ea574079039f8 100644
--- a/torch/distributions/one_hot_categorical.py
+++ b/torch/distributions/one_hot_categorical.py
@@ -25,6 +25,7 @@ class OneHotCategorical(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = OneHotCategorical(torch.tensor([ 0.25, 0.25, 0.25, 0.25 ]))
         >>> m.sample()  # equal probability of 0, 1, 2, 3
         tensor([ 0.,  0.,  0.,  1.])
diff --git a/torch/distributions/pareto.py b/torch/distributions/pareto.py
index c39ca17939eee..0d28048bb4397 100644
--- a/torch/distributions/pareto.py
+++ b/torch/distributions/pareto.py
@@ -12,6 +12,7 @@ class Pareto(TransformedDistribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Pareto(torch.tensor([1.0]), torch.tensor([1.0]))
         >>> m.sample()  # sample from a Pareto distribution with scale=1 and alpha=1
         tensor([ 1.5623])
diff --git a/torch/distributions/poisson.py b/torch/distributions/poisson.py
index 9f9c9f54143fd..63aaa08e5f152 100644
--- a/torch/distributions/poisson.py
+++ b/torch/distributions/poisson.py
@@ -18,6 +18,7 @@ class Poisson(ExponentialFamily):
 
     Example::
 
+        >>> # xdoctest: +SKIP("poisson_cpu not implemented for 'Long'")
         >>> m = Poisson(torch.tensor([4]))
         >>> m.sample()
         tensor([ 3.])
diff --git a/torch/distributions/relaxed_bernoulli.py b/torch/distributions/relaxed_bernoulli.py
index c82ff3d601e1a..500e82991bfb3 100644
--- a/torch/distributions/relaxed_bernoulli.py
+++ b/torch/distributions/relaxed_bernoulli.py
@@ -100,8 +100,9 @@ class RelaxedBernoulli(TransformedDistribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = RelaxedBernoulli(torch.tensor([2.2]),
-                                 torch.tensor([0.1, 0.2, 0.3, 0.99]))
+        ...                      torch.tensor([0.1, 0.2, 0.3, 0.99]))
         >>> m.sample()
         tensor([ 0.2951,  0.3442,  0.8918,  0.9021])
 
diff --git a/torch/distributions/relaxed_categorical.py b/torch/distributions/relaxed_categorical.py
index beefec3fcffd2..3ea069aad1c57 100644
--- a/torch/distributions/relaxed_categorical.py
+++ b/torch/distributions/relaxed_categorical.py
@@ -94,8 +94,9 @@ class RelaxedOneHotCategorical(TransformedDistribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = RelaxedOneHotCategorical(torch.tensor([2.2]),
-                                         torch.tensor([0.1, 0.2, 0.3, 0.4]))
+        ...                              torch.tensor([0.1, 0.2, 0.3, 0.4]))
         >>> m.sample()
         tensor([ 0.1294,  0.2324,  0.3859,  0.2523])
 
diff --git a/torch/distributions/studentT.py b/torch/distributions/studentT.py
index 36d227455105c..2699f89b48b84 100644
--- a/torch/distributions/studentT.py
+++ b/torch/distributions/studentT.py
@@ -15,6 +15,7 @@ class StudentT(Distribution):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = StudentT(torch.tensor([2.0]))
         >>> m.sample()  # Student's t-distributed with degrees of freedom=2
         tensor([ 0.1046])
diff --git a/torch/distributions/uniform.py b/torch/distributions/uniform.py
index c4baa48109f68..cd29f2aa8d91d 100644
--- a/torch/distributions/uniform.py
+++ b/torch/distributions/uniform.py
@@ -17,6 +17,7 @@ class Uniform(Distribution):
 
         >>> m = Uniform(torch.tensor([0.0]), torch.tensor([5.0]))
         >>> m.sample()  # uniformly distributed in the range [0.0, 5.0)
+        >>> # xdoctest: +SKIP
         tensor([ 2.3418])
 
     Args:
diff --git a/torch/distributions/von_mises.py b/torch/distributions/von_mises.py
index 5d7c2b868d5ce..9ac967f2b3baa 100644
--- a/torch/distributions/von_mises.py
+++ b/torch/distributions/von_mises.py
@@ -75,7 +75,8 @@ class VonMises(Distribution):
     interpreted as angles modulo 2 pi.
 
     Example::
-        >>> m = dist.VonMises(torch.tensor([1.0]), torch.tensor([1.0]))
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
+        >>> m = VonMises(torch.tensor([1.0]), torch.tensor([1.0]))
         >>> m.sample() # von Mises distributed with loc=1 and concentration=1
         tensor([1.9777])
 
diff --git a/torch/distributions/weibull.py b/torch/distributions/weibull.py
index 50f001d5297e4..7f0b180377365 100644
--- a/torch/distributions/weibull.py
+++ b/torch/distributions/weibull.py
@@ -14,6 +14,7 @@ class Weibull(TransformedDistribution):
 
     Example:
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
         >>> m = Weibull(torch.tensor([1.0]), torch.tensor([1.0]))
         >>> m.sample()  # sample from a Weibull distribution with scale=1, concentration=1
         tensor([ 0.4784])
diff --git a/torch/distributions/wishart.py b/torch/distributions/wishart.py
index 058a9e51c5fb7..e23c61c836e17 100644
--- a/torch/distributions/wishart.py
+++ b/torch/distributions/wishart.py
@@ -33,9 +33,10 @@ class Wishart(ExponentialFamily):
     or its Cholesky decomposition :math:`\mathbf{\Sigma} = \mathbf{L}\mathbf{L}^\top`
 
     Example:
+        >>> # xdoctest: +SKIP("FIXME: scale_tril must be at least two-dimensional")
         >>> m = Wishart(torch.eye(2), torch.Tensor([2]))
         >>> m.sample()  # Wishart distributed with mean=`df * I` and
-                        # variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j
+        >>>             # variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j
 
     Args:
         covariance_matrix (Tensor): positive-definite covariance matrix
diff --git a/torch/functional.py b/torch/functional.py
index f8798c32040c0..12c6f1b143e1a 100644
--- a/torch/functional.py
+++ b/torch/functional.py
@@ -136,7 +136,6 @@ def broadcast_shapes(*shapes):
             return tensors[0].shape
 
 
-
 def split(
     tensor: Tensor, split_size_or_sections: Union[int, List[int]], dim: int = 0
 ) -> List[Tensor]:
@@ -257,15 +256,18 @@ def einsum(*args: Any) -> Tensor:
 
     Examples::
 
-        # trace
+        >>> # trace
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> torch.einsum('ii', torch.randn(4, 4))
         tensor(-1.2104)
 
-        # diagonal
+        >>> # diagonal
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> torch.einsum('ii->i', torch.randn(4, 4))
         tensor([-0.1034,  0.7952, -0.2433,  0.4545])
 
-        # outer product
+        >>> # outer product
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> x = torch.randn(5)
         >>> y = torch.randn(4)
         >>> torch.einsum('i,j->ij', x, y)
@@ -275,7 +277,8 @@ def einsum(*args: Any) -> Tensor:
                 [ 0.1713, -0.4291, -0.5802,  0.7350],
                 [ 0.5704, -1.4290, -1.9323,  2.4480]])
 
-        # batch matrix multiplication
+        >>> # batch matrix multiplication
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> As = torch.randn(3,2,5)
         >>> Bs = torch.randn(3,5,4)
         >>> torch.einsum('bij,bjk->bik', As, Bs)
@@ -288,7 +291,8 @@ def einsum(*args: Any) -> Tensor:
                 [[ 2.8153,  1.8787, -4.3839, -1.2112],
                 [ 0.3728, -2.1131,  0.0921,  0.8305]]])
 
-        # with sublist format and ellipsis
+        >>> # with sublist format and ellipsis
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> torch.einsum(As, [..., 0, 1], Bs, [..., 1, 2], [..., 0, 2])
         tensor([[[-1.0564, -1.5904,  3.2023,  3.1271],
                 [-1.6706, -0.8097, -0.8025, -2.1183]],
@@ -299,12 +303,12 @@ def einsum(*args: Any) -> Tensor:
                 [[ 2.8153,  1.8787, -4.3839, -1.2112],
                 [ 0.3728, -2.1131,  0.0921,  0.8305]]])
 
-        # batch permute
+        >>> # batch permute
         >>> A = torch.randn(2, 3, 4, 5)
         >>> torch.einsum('...ij->...ji', A).shape
         torch.Size([2, 3, 5, 4])
 
-        # equivalent to torch.nn.functional.bilinear
+        >>> # equivalent to torch.nn.functional.bilinear
         >>> A = torch.randn(3,5,4)
         >>> l = torch.randn(2,5)
         >>> r = torch.randn(2,4)
@@ -446,6 +450,7 @@ def meshgrid(*tensors, indexing: Optional[str] = None) -> Tuple[Tensor, ...]:
 
             `torch.meshgrid` is commonly used to produce a grid for
             plotting.
+            >>> # xdoctest: +REQUIRES(module:matplotlib)
             >>> import matplotlib.pyplot as plt
             >>> xs = torch.linspace(-5, 5, steps=100)
             >>> ys = torch.linspace(-5, 5, steps=100)
@@ -453,7 +458,6 @@ def meshgrid(*tensors, indexing: Optional[str] = None) -> Tuple[Tensor, ...]:
             >>> z = torch.sin(torch.sqrt(x * x + y * y))
             >>> ax = plt.axes(projection='3d')
             >>> ax.plot_surface(x.numpy(), y.numpy(), z.numpy())
-            <mpl_toolkits.mplot3d.art3d.Poly3DCollection object at 0x7f8f30d40100>
             >>> plt.show()
 
         .. image:: ../_static/img/meshgrid.png
@@ -730,22 +734,22 @@ def _unique_impl(input: Tensor, sorted: bool = True,
 
         >>> output = torch.unique(torch.tensor([1, 3, 2, 3], dtype=torch.long))
         >>> output
-        tensor([ 2,  3,  1])
+        tensor([1, 2, 3])
 
         >>> output, inverse_indices = torch.unique(
         ...     torch.tensor([1, 3, 2, 3], dtype=torch.long), sorted=True, return_inverse=True)
         >>> output
-        tensor([ 1,  2,  3])
+        tensor([1, 2, 3])
         >>> inverse_indices
-        tensor([ 0,  2,  1,  2])
+        tensor([0, 2, 1, 2])
 
         >>> output, inverse_indices = torch.unique(
         ...     torch.tensor([[1, 3], [2, 3]], dtype=torch.long), sorted=True, return_inverse=True)
         >>> output
-        tensor([ 1,  2,  3])
+        tensor([1, 2, 3])
         >>> inverse_indices
-        tensor([[ 0,  2],
-                [ 1,  2]])
+        tensor([[0, 2],
+                [1, 2]])
 
     """
     if has_torch_function_unary(input):
@@ -976,6 +980,7 @@ def tensordot(a, b, dims: List[List[int]], out: Optional[torch.Tensor] = None):
     def tensordot(a, b, dims: torch.Tensor, out: Optional[torch.Tensor] = None):  # noqa: F811
         pass
 
+
 def tensordot(a, b, dims=2, out: Optional[torch.Tensor] = None):  # noqa: F811
     r"""Returns a contraction of a and b over multiple dimensions.
 
@@ -1012,6 +1017,7 @@ def tensordot(a, b, dims=2, out: Optional[torch.Tensor] = None):  # noqa: F811
                 [4796., 5162.],
                 [4928., 5306.]])
 
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
         >>> a = torch.randn(3, 4, 5, device='cuda')
         >>> b = torch.randn(4, 5, 6, device='cuda')
         >>> c = torch.tensordot(a, b, dims=2).cpu()
@@ -1065,7 +1071,8 @@ def tensordot(a, b, dims=2, out: Optional[torch.Tensor] = None):  # noqa: F811
     else:
         return _VF.tensordot(a, b, dims_a, dims_b, out=out)  # type: ignore[attr-defined]
 
-def cartesian_prod(*tensors):
+
+def cartesian_prod(*tensors: Tensor) -> Tensor:
     """Do cartesian product of the given sequence of tensors. The behavior is similar to
     python's `itertools.product`.
 
@@ -1079,6 +1086,7 @@ def cartesian_prod(*tensors):
 
     Example::
 
+        >>> import itertools
         >>> a = [1, 2, 3]
         >>> b = [4, 5]
         >>> list(itertools.product(a, b))
@@ -1098,6 +1106,7 @@ def cartesian_prod(*tensors):
         return handle_torch_function(cartesian_prod, tensors, *tensors)
     return _VF.cartesian_prod(tensors)  # type: ignore[attr-defined]
 
+
 def block_diag(*tensors):
     """Create a block diagonal matrix from provided tensors.
 
@@ -1188,6 +1197,7 @@ def cdist(x1, x2, p=2., compute_mode='use_mm_for_euclid_dist_if_necessary'):
     else:
         raise ValueError(f"{compute_mode} is not a valid value for compute_mode")
 
+
 def atleast_1d(*tensors):
     r"""
     Returns a 1-dimensional view of each input tensor with zero dimensions.
@@ -1201,11 +1211,11 @@ def atleast_1d(*tensors):
 
     Example::
 
-        >>> x = torch.randn(2)
+        >>> x = torch.arange(2)
         >>> x
-        tensor([1.4584, 0.7583])
+        tensor([0, 1])
         >>> torch.atleast_1d(x)
-        tensor([1.4584, 0.7583])
+        tensor([0, 1])
         >>> x = torch.tensor(1.)
         >>> x
         tensor(1.)
@@ -1223,6 +1233,7 @@ def atleast_1d(*tensors):
         tensors = tensors[0]
     return _VF.atleast_1d(tensors)  # type: ignore[attr-defined]
 
+
 def atleast_2d(*tensors):
     r"""
     Returns a 2-dimensional view of each input tensor with zero dimensions.
@@ -1241,13 +1252,13 @@ def atleast_2d(*tensors):
         tensor(1.)
         >>> torch.atleast_2d(x)
         tensor([[1.]])
-        >>> x = torch.randn(2,2)
+        >>> x = torch.arange(4).view(2,2)
         >>> x
-        tensor([[2.2086, 2.5165],
-                [0.1757, 0.5194]])
+        tensor([[0, 1],
+                [2, 3]])
         >>> torch.atleast_2d(x)
-        tensor([[2.2086, 2.5165],
-                [0.1757, 0.5194]])
+        tensor([[0, 1],
+                [2, 3]])
         >>> x = torch.tensor(0.5)
         >>> y = torch.tensor(1.)
         >>> torch.atleast_2d((x,y))
@@ -1260,6 +1271,7 @@ def atleast_2d(*tensors):
         tensors = tensors[0]
     return _VF.atleast_2d(tensors)  # type: ignore[attr-defined]
 
+
 def atleast_3d(*tensors):
     r"""
     Returns a 3-dimensional view of each input tensor with zero dimensions.
@@ -1278,21 +1290,21 @@ def atleast_3d(*tensors):
         tensor(0.5000)
         >>> torch.atleast_3d(x)
         tensor([[[0.5000]]])
-        >>> y = torch.randn(2,2)
+        >>> y = torch.arange(4).view(2,2)
         >>> y
-        tensor([[-0.8079,  0.7460],
-                [-1.1647,  1.4734]])
+        tensor([[0, 1],
+                [2, 3]])
         >>> torch.atleast_3d(y)
-        tensor([[[-0.8079],
-                [ 0.7460]],
+        tensor([[[0],
+                 [1]],
                 <BLANKLINE>
-                [[-1.1647],
-                [ 1.4734]]])
-        >>> x = torch.randn(1,1,1)
+                [[2],
+                 [3]]])
+        >>> x = torch.tensor(1).view(1, 1, 1)
         >>> x
-        tensor([[[-1.5689]]])
+        tensor([[[1]]])
         >>> torch.atleast_3d(x)
-        tensor([[[-1.5689]]])
+        tensor([[[1]]])
         >>> x = torch.tensor(0.5)
         >>> y = torch.tensor(1.)
         >>> torch.atleast_3d((x,y))
@@ -1501,6 +1513,7 @@ def norm(input, p="fro", dim=None, keepdim=False, out=None, dtype=None):  # noqa
             else:
                 return _VF.norm(input, p, _dim, keepdim=keepdim, dtype=dtype, out=out)  # type: ignore[attr-defined]
 
+
 def chain_matmul(*matrices, out=None):
     r"""Returns the matrix product of the :math:`N` 2-D tensors. This product is efficiently computed
     using the matrix chain order algorithm which selects the order in which incurs the lowest cost in terms
@@ -1524,10 +1537,12 @@ def chain_matmul(*matrices, out=None):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> a = torch.randn(3, 4)
         >>> b = torch.randn(4, 5)
         >>> c = torch.randn(5, 6)
         >>> d = torch.randn(6, 7)
+        >>> # will raise a deprecation warning
         >>> torch.chain_matmul(a, b, c, d)
         tensor([[ -2.3375,  -3.9790,  -4.1119,  -6.6577,   9.5609, -11.5095,  -3.2614],
                 [ 21.4038,   3.3378,  -8.4982,  -5.2457, -10.2561,  -2.4684,   2.7163],
@@ -1621,6 +1636,8 @@ def _lu_impl(A, pivot=True, get_infos=False, out=None):
 
     Example::
 
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> A = torch.randn(2, 3, 3)
         >>> A_LU, pivots = torch.lu(A)
         >>> A_LU
@@ -1647,6 +1664,7 @@ def _lu_impl(A, pivot=True, get_infos=False, out=None):
 else:
     _ListOrSeq = List[Tensor]
 
+
 def _check_list_size(out_len: int, get_infos: bool, out: _ListOrSeq) -> None:
     get_infos_int = 1 if get_infos else 0
     if out_len - get_infos_int != 2:
@@ -1654,6 +1672,7 @@ def _check_list_size(out_len: int, get_infos: bool, out: _ListOrSeq) -> None:
     if not isinstance(out, (tuple, list)):
         raise TypeError(f"argument 'out' must be tuple of Tensors, not {type(out).__name__}")
 
+
 def _lu_with_infos(A, pivot=True, get_infos=False, out=None):
     # type: (Tensor, bool, bool, Optional[Tuple[Tensor, Tensor, Tensor]]) -> Tuple[Tensor, Tensor, Tensor]
     if has_torch_function_unary(A):
@@ -1668,6 +1687,7 @@ def _lu_with_infos(A, pivot=True, get_infos=False, out=None):
     else:
         return result  # A_LU, pivots, infos
 
+
 def _lu_no_infos(A, pivot=True, get_infos=False, out=None):
     # type: (Tensor, bool, bool, Optional[Tuple[Tensor, Tensor]]) -> Tuple[Tensor, Tensor]
     # need to check for torch_function here so that we exit if
@@ -1695,5 +1715,6 @@ def _lu_no_infos(A, pivot=True, get_infos=False, out=None):
     func_name='lu')
 lu.__doc__ = _lu_impl.__doc__
 
+
 def align_tensors(*tensors):
     raise RuntimeError('`align_tensors` not yet implemented.')
diff --git a/torch/futures/__init__.py b/torch/futures/__init__.py
index 1795983b3f300..f2ba35f1e80b1 100644
--- a/torch/futures/__init__.py
+++ b/torch/futures/__init__.py
@@ -9,9 +9,11 @@
 T = TypeVar("T")
 S = TypeVar("S")
 
+
 class _PyFutureMeta(type(torch._C.Future), type(Generic)):  # type: ignore[misc, no-redef]
     pass
 
+
 class Future(torch._C.Future, Generic[T], metaclass=_PyFutureMeta):
     r"""
     Wrapper around a ``torch._C.Future`` which encapsulates an asynchronous
diff --git a/torch/fx/_symbolic_trace.py b/torch/fx/_symbolic_trace.py
index 5ef5c78278a77..a1939a389ab1c 100644
--- a/torch/fx/_symbolic_trace.py
+++ b/torch/fx/_symbolic_trace.py
@@ -1,8 +1,10 @@
 import builtins
+import copy
 import functools
 import inspect
 import math
 import os
+import warnings
 from itertools import chain
 from types import CodeType, FunctionType, ModuleType
 from typing import (
@@ -361,8 +363,9 @@ def is_leaf_module(self, m: torch.nn.Module, module_qualified_name: str) -> bool
                 submodule ``bar``, which contains submodule ``baz``, that module will
                 appear with the qualified name ``foo.bar.baz`` here.
         """
-        return m.__module__.startswith("torch.nn") and not isinstance(
-            m, torch.nn.Sequential
+        return (
+            (m.__module__.startswith("torch.nn") or m.__module__.startswith("torch.ao.nn"))
+            and not isinstance(m, torch.nn.Sequential)
         )
 
     @compatibility(is_backward_compatible=True)
@@ -431,6 +434,68 @@ def call_module(
             return forward(*args, **kwargs)
         return self.create_proxy("call_module", module_qualified_name, args, kwargs)
 
+    @compatibility(is_backward_compatible=False)
+    def getattr(self, attr: str, attr_val: Any, parameter_proxy_cache: Dict[str, Any]):
+        """
+        Method that specifies the behavior of this ``Tracer`` when we call getattr
+        on a call to an ``nn.Module`` instance.
+
+        By default, the behavior is to return a proxy value for the attribute. It
+        also stores the proxy value in the ``parameter_proxy_cache``, so that future
+        calls will reuse the proxy rather than creating a new one.
+
+        This method can be overridden to --for example-- not return proxies when
+        querying parameters.
+
+        Args:
+
+            attr (str): The name of the attribute being queried
+            attr_val (Any): The value of the attribute
+            parametr_proxy_cache (Dict[str, Any]): A cache of attr names to proxies
+
+        Return:
+
+            The return value from the getattr call.
+        """
+        def maybe_get_proxy_for_attr(
+            attr_val, collection_to_search, parameter_proxy_cache
+        ):
+            for n, p in collection_to_search:
+                if attr_val is p:
+                    if n not in parameter_proxy_cache:
+                        kwargs = {}
+                        if (
+                            "proxy_factory_fn"
+                            in inspect.signature(self.create_proxy).parameters
+                        ):
+                            kwargs["proxy_factory_fn"] = (
+                                None
+                                if not self.param_shapes_constant
+                                else lambda node: ParameterProxy(
+                                    self, node, n, attr_val
+                                )
+                            )
+                        val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs)  # type: ignore[arg-type]
+                        parameter_proxy_cache[n] = val_proxy
+                    return parameter_proxy_cache[n]
+            return None
+
+        if isinstance(attr_val, torch.nn.Parameter):
+            maybe_parameter_proxy = maybe_get_proxy_for_attr(
+                attr_val, self.root.named_parameters(), parameter_proxy_cache
+            )
+            if maybe_parameter_proxy is not None:
+                return maybe_parameter_proxy
+
+        if self.proxy_buffer_attributes and isinstance(attr_val, torch.Tensor):
+            maybe_buffer_proxy = maybe_get_proxy_for_attr(
+                attr_val, self.root.named_buffers(), parameter_proxy_cache
+            )
+            if maybe_buffer_proxy is not None:
+                return maybe_buffer_proxy
+
+        return attr_val
+
     # This method will be refactored
     @compatibility(is_backward_compatible=False)
     def create_args_for_root(self, root_fn, is_module, concrete_args=None):
@@ -496,7 +561,7 @@ def replace_ph(x):
                         )
                         self.create_proxy("call_function", _assert_is_none, args, {})
                     else:
-                        torch.warnings.warn(
+                        warnings.warn(
                             f"Was not able to add assertion to guarantee correct input {name} to "
                             f"specialized function. It is up to the user to make sure that your inputs match the "
                             f"inputs you specialized the function with."
@@ -557,46 +622,6 @@ def flatten_fn(*args):
             return flatten_fn, flat_args
         return root_fn, args
 
-    def _module_getattr(self, attr, attr_val, parameter_proxy_cache):
-        def maybe_get_proxy_for_attr(
-            attr_val, collection_to_search, parameter_proxy_cache
-        ):
-            for n, p in collection_to_search:
-                if attr_val is p:
-                    if n not in parameter_proxy_cache:
-                        kwargs = {}
-                        if (
-                            "proxy_factory_fn"
-                            in inspect.signature(self.create_proxy).parameters
-                        ):
-                            kwargs["proxy_factory_fn"] = (
-                                None
-                                if not self.param_shapes_constant
-                                else lambda node: ParameterProxy(
-                                    self, node, n, attr_val
-                                )
-                            )
-                        val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs)  # type: ignore[arg-type]
-                        parameter_proxy_cache[n] = val_proxy
-                    return parameter_proxy_cache[n]
-            return None
-
-        if isinstance(attr_val, torch.nn.Parameter):
-            maybe_parameter_proxy = maybe_get_proxy_for_attr(
-                attr_val, self.root.named_parameters(), parameter_proxy_cache
-            )
-            if maybe_parameter_proxy is not None:
-                return maybe_parameter_proxy
-
-        if self.proxy_buffer_attributes and isinstance(attr_val, torch.Tensor):
-            maybe_buffer_proxy = maybe_get_proxy_for_attr(
-                attr_val, self.root.named_buffers(), parameter_proxy_cache
-            )
-            if maybe_buffer_proxy is not None:
-                return maybe_buffer_proxy
-
-        return attr_val
-
     @compatibility(is_backward_compatible=True)
     def trace(
         self,
@@ -677,7 +702,7 @@ def collect_tensor_attrs(m: torch.nn.Module, prefix_atoms: List[str]):
             @functools.wraps(_orig_module_getattr)
             def module_getattr_wrapper(mod, attr):
                 attr_val = _orig_module_getattr(mod, attr)
-                return self._module_getattr(attr, attr_val, parameter_proxy_cache)
+                return self.getattr(attr, attr_val, parameter_proxy_cache)
 
             @functools.wraps(_orig_module_call)
             def module_call_wrapper(mod, *args, **kwargs):
@@ -721,6 +746,20 @@ def forward(*args, **kwargs):
             _is_fx_tracing_flag = old_is_fx_tracing_flag
         return self.graph
 
+    def __deepcopy__(self, memo):
+        # _autowrap_search contains modules, which cannot be deepcopied.
+        new_tracer = Tracer.__new__(Tracer)
+
+        for k, v in self.__dict__.items():
+            if k in {'_autowrap_search'}:
+                new_obj = copy.copy(v)
+            else:
+                new_obj = copy.deepcopy(v, memo)
+
+            new_tracer.__dict__[k] = new_obj
+
+        return new_tracer
+
 
 # List of pairs of (global dict, function name) functions
 # to patch for the purposes of the wrap() API.
diff --git a/torch/fx/experimental/const_fold.py b/torch/fx/experimental/const_fold.py
index ae2ffc60b0ddd..56ffbcacc84d1 100644
--- a/torch/fx/experimental/const_fold.py
+++ b/torch/fx/experimental/const_fold.py
@@ -180,6 +180,10 @@ def split_const_subgraphs(
         if skip_folding_node_fn and skip_folding_node_fn(node):
             continue
 
+        # Skip folding side-effectful functions
+        if node.is_impure():
+            continue
+
         # Must be a constant foldable node at this point.
         const_nodes.add(node)
         if node.op != "get_attr":
diff --git a/torch/fx/experimental/meta_tracer.py b/torch/fx/experimental/meta_tracer.py
index 7ec5fb88d4091..4718c504e0062 100644
--- a/torch/fx/experimental/meta_tracer.py
+++ b/torch/fx/experimental/meta_tracer.py
@@ -199,11 +199,11 @@ def create_proxy(self, kind, target, args, kwargs, name=None, type_expr=None, pr
 
         return rv
 
-    def _module_getattr(self, attr, attr_val, parameter_proxy_cache):
+    def getattr(self, attr, attr_val, parameter_proxy_cache):
         if getattr(self, '_disable_module_getattr', False):
             return attr_val
         else:
-            return super()._module_getattr(attr, attr_val, parameter_proxy_cache)
+            return super().getattr(attr, attr_val, parameter_proxy_cache)
 
     def call_module(self, m, forward, args, kwargs):
         self.orig_forward = forward
diff --git a/torch/fx/experimental/migrate_gradual_types/constraint_generator.py b/torch/fx/experimental/migrate_gradual_types/constraint_generator.py
index e5955fd039b8e..10004cab4515e 100644
--- a/torch/fx/experimental/migrate_gradual_types/constraint_generator.py
+++ b/torch/fx/experimental/migrate_gradual_types/constraint_generator.py
@@ -1,5 +1,6 @@
 import torch
 import operator
+import warnings
 from typing import Callable, Dict, Iterable
 
 from torch.fx._symbolic_trace import _assert_is_none
@@ -167,6 +168,7 @@ def expand_inference_rule(n: Node, symbols, constraints, counter):
 
     return constraints, counter
 
+
 @register_inference_rule(torch.nn.functional.gelu)
 @register_inference_rule(torch.nn.functional.dropout)
 @register_inference_rule(torch.nn.functional.softmax)
@@ -175,16 +177,34 @@ def expand_inference_rule(n: Node, symbols, constraints, counter):
 @register_inference_rule("int")
 @register_inference_rule("long")
 @register_inference_rule("contiguous")
+@register_inference_rule(torch.ones)
+@register_inference_rule(torch.zeros)
 def equality_inference_rule(n: Node, symbols, constraints, counter):
     """
     We generate the constraint: input = output
     """
-    assert isinstance(n.args[0], Node)
     output, counter = gen_tvar(counter)
     symbols[n] = output
-    input = symbols[n.args[0]]
-    assert isinstance(input, TVar)
-    return [BinConstraintT(input, output, op_eq)], counter
+
+    if isinstance(n.args[0], Node):
+        input = symbols[n.args[0]]
+        if isinstance(input, TVar):
+            return [BinConstraintT(input, output, op_eq)], counter
+
+        # then we have dimension variables
+        else:
+            for arg in n.args:
+                assert isinstance(symbols[arg], DVar)
+        my_size = [symbols[arg] for arg in n.args]
+        return [BinConstraintT(output, TensorType(my_size), op_eq)], counter
+
+    elif isinstance(n.args[0], tuple):
+        # then the tuple is the size
+        assert len(n.args[0]) <= 4
+        my_size = [symbols[arg] for arg in n.args[0]]
+        return [BinConstraintT(output, TensorType(my_size), op_eq)], counter
+    else:
+        raise NotImplementedError('Method not yet implemented')
 
 
 @register_inference_rule("transpose")
@@ -256,14 +276,30 @@ def masked_fill_inference_rule(n: Node, symbols, constraints, counter):
         raise NotImplementedError('Not yet implemented')
 
 
+@register_inference_rule(torch.nn.functional.embedding)
+def embedding_inference_rule_functional(n: Node, symbols, constraints, counter):
+    assert isinstance(n.args[0], Node)
+
+    embedding_dim_weights = symbols[n.args[1]]
+
+    # will treat this as a static shape. So we will not use matching.
+    weight_dims, counter = gen_tensor_dims(2, counter)
+    equality_constraint = BinConstraintT(embedding_dim_weights, TensorType(weight_dims), op_eq)
+    embedding_dim = weight_dims[1]
+    constraints, counter = gen_embedding_rules(n, symbols, embedding_dim, counter)
+    return [equality_constraint] + constraints, counter
+
+
 @register_inference_rule(torch.nn.modules.sparse.Embedding)
 def embedding_inference_rule(n: Node, module_instance, symbols, constraints, counter):
     """
     The output shape differs from the input shape in the last dimension
     """
     assert isinstance(n.args[0], Node)
+    return gen_embedding_rules(n, symbols, module_instance.embedding_dim, counter)
+
 
-    embedding_dim = module_instance.embedding_dim  # number
+def gen_embedding_rules(n: Node, symbols, embedding_dim, counter):
 
     embedding_output, counter = gen_tvar(counter)
     symbols[n] = embedding_output
@@ -288,6 +324,16 @@ def embedding_inference_rule(n: Node, module_instance, symbols, constraints, cou
     return [Disj([c1, Disj(c2)])], counter
 
 
+@register_inference_rule(torch.tensor)
+def tensor_inference_rule(n: Node, symbols, constraints, counter):
+    """
+    If the tensor is a scalar, we will skip it since we
+    do not support scalars yet. We will add support in the future
+    if it's needed. For our examples so far, scalars are not needed.
+    """
+    return [], counter
+
+
 @register_inference_rule("reshape")
 @register_inference_rule("view")
 def view_inference_rule(n: Node, symbols, constraints, counter):
@@ -459,15 +505,19 @@ def getitem_inference_rule(n: Node, symbols, constraints, counter):
         symbols[n] = get_item_output
 
         # retreive arg variables
-        get_item_arg = symbols[n.args[0]]
-        assert isinstance(get_item_arg, TVar)
-
-        input_dyn = BinConstraintT(get_item_arg, Dyn, op_eq)
-        output_dyn = BinConstraintT(get_item_output, Dyn, op_eq)  # type: ignore[assignment]
-        c1 = Conj([input_dyn, output_dyn])
+        if n.args[0] in symbols:
+            get_item_arg = symbols[n.args[0]]
+            assert isinstance(get_item_arg, TVar)
 
+            input_dyn = BinConstraintT(get_item_arg, Dyn, op_eq)
+            output_dyn = BinConstraintT(get_item_output, Dyn, op_eq)  # type: ignore[assignment]
+            c1 = Conj([input_dyn, output_dyn])
 
-        c2 = [GetItemTensor(i + 1, n.args[1], get_item_output, get_item_arg) for i in range(MAX_TENSOR_RANK)]  # type: ignore[misc]
+            c2 = [GetItemTensor(i + 1, n.args[1], get_item_output, get_item_arg)  # type: ignore[misc]
+                  for i in range(MAX_TENSOR_RANK)]
+        else:
+            # TODO: we should figure out why there is a key-error here.
+            return [], counter
 
         return [Disj([c1, *c2])], counter
 
@@ -511,6 +561,22 @@ def gt_inference_rule(n: Node, symbols, constraints, counter):
             my_gt, counter = gen_bvar(counter)
             equality_constraint = BinConstraintD(my_gt, gt_constraint, op_eq)
             return [equality_constraint], counter
+
+        elif isinstance(e1, TVar) and isinstance(e2, int):
+            # then we made the wrong assumption about the argument being a tensor
+            # so we should fix the assumption
+            warnings.warn(f'Made the wrong assumption for node {n}. Correctness not guaranteed.')
+
+            new_e1, counter = gen_dvar(counter)
+            symbols[n.args[0]] = new_e1
+            symbols[n.args[0]]
+
+            gt_constraint = BinConstraintD(new_e1, e2, op_gt)
+
+            my_gt, counter = gen_bvar(counter)
+            equality_constraint = BinConstraintD(my_gt, gt_constraint, op_eq)
+            return [equality_constraint], counter
+
         else:
             raise NotImplementedError('Method not yet implemented')
 
@@ -518,6 +584,43 @@ def gt_inference_rule(n: Node, symbols, constraints, counter):
         raise NotImplementedError('Method not yet implemented')
 
 
+@register_inference_rule(operator.eq)
+def eq_inference_rule(n: Node, symbols, constraints, counter):
+    assert isinstance(n.args[0], Node) or isinstance(n.args[0], int)
+    assert isinstance(n.args[1], Node) or isinstance(n.args[1], int)
+
+    e1 = symbols[n.args[0]] if isinstance(n.args[0], Node) else n.args[0]
+    e2 = symbols[n.args[1]] if isinstance(n.args[1], Node) else n.args[1]
+
+    if isinstance(n.args[0], Node) and isinstance(n.args[1], Node):
+        if isinstance(e1, TVar) and isinstance(e2, TVar):
+            eq_tensor, counter = gen_tvar(counter)
+            symbols[n] = eq_tensor
+            return gen_broadcasting_constraints(e1, e2, symbols, counter, eq_tensor)
+
+        elif isinstance(e1, DVar) and isinstance(e2, DVar):
+            # This is meant to be used for flow analysis only
+            eq_constraint = BinConstraintD(e1, e2, op_eq)
+
+            my_eq, counter = gen_bvar(counter)
+            equality_constraint = BinConstraintD(my_eq, eq_constraint, op_eq)
+            return [equality_constraint], counter
+
+        else:
+            raise RuntimeError('Sort Mismatch')
+
+    elif isinstance(n.args[0], Node) and not isinstance(n.args[1], Node):
+        if isinstance(e1, DVar):
+            # This is meant to be used for flow analysis only
+            eq_constraint = BinConstraintD(e1, e2, op_eq)
+
+            my_eq, counter = gen_bvar(counter)
+            equality_constraint = BinConstraintD(my_eq, eq_constraint, op_eq)
+            return [equality_constraint], counter
+        else:
+            raise NotImplementedError('Method not yet implemented')
+    else:
+        raise NotImplementedError('Method not yet implemented')
 
 @register_inference_rule(operator.ne)
 def neq_inference_rule(n: Node, symbols, constraints, counter):
@@ -668,7 +771,8 @@ def full_inference_rule(n: Node, symbols, constraints, counter):
 
     assert isinstance(n.args[0], Iterable)
     for arg in n.args[0]:
-        res.append(symbols[arg])
+        dim = arg if isinstance(arg, int) else symbols[arg]
+        res.append(dim)
     c = BinConstraintT(full, TensorType(list(res)), op_eq)  # type: ignore[arg-type]
     return [c], counter
 
@@ -815,6 +919,15 @@ def flatten_inference_rule(n: Node, symbols, constraints, counter):
     return [Disj([both_dyn, *const])], counter
 
 
+@register_inference_rule(torch.nn.functional.layer_norm)
+def layer_norm_functional(n: Node, symbols, constraints, counter):
+    """
+    We generate the constraint: input = output
+    """
+    assert isinstance(n.args[0], Node)
+    return gen_layer_norm_constraints(n, n.args[1], symbols, counter)
+
+
 @register_inference_rule(torch.nn.LayerNorm)
 def layer_norm_inference_rule(n: Node, module_instance, symbols, constraints, counter):
     """
@@ -822,6 +935,10 @@ def layer_norm_inference_rule(n: Node, module_instance, symbols, constraints, co
     Input should be consistent with the normalized_shape
     """
     assert isinstance(n.args[0], Node)
+    return gen_layer_norm_constraints(n, module_instance.normalized_shape, symbols, counter)
+
+
+def gen_layer_norm_constraints(n: Node, normalized_shape, symbols, counter):
     output, counter = gen_tvar(counter)
     symbols[n] = output
     input = symbols[n.args[0]]
@@ -838,16 +955,11 @@ def layer_norm_inference_rule(n: Node, module_instance, symbols, constraints, co
 
         c_tensor_i = Conj([BinConstraintT(input, TensorType(new_dims_rhs), op_eq),
                            BinConstraintT(output, TensorType(new_dims_rhs), op_eq)] +
-                          add_layer_norm_constraints(new_dims_rhs, list(module_instance.normalized_shape)) +
+                          add_layer_norm_constraints(new_dims_rhs, list(normalized_shape)) +
                           nat_constraints)
         c2.append(c_tensor_i)
-
-
     return [Disj([c1, Disj(c2)])], counter
 
-    # return [BinConstraintT(input, output, op_eq),
-    #         BinConstraintT(input, normalized_shape, op_consistency)], counter
-
 @register_inference_rule(torch.nn.Dropout)
 @register_inference_rule(torch.nn.ReLU)
 def relu_inference_rule(n: Node, module_instance, symbols, constraints, counter):
@@ -861,6 +973,7 @@ def relu_inference_rule(n: Node, module_instance, symbols, constraints, counter)
     assert isinstance(input, TVar)
     return [BinConstraintT(input, output, op_eq)], counter
 
+
 @register_inference_rule(torch.nn.Linear)
 def linear_inference_rule(n: Node, module_instance, symbols, constraints, counter):
     """
@@ -868,6 +981,41 @@ def linear_inference_rule(n: Node, module_instance, symbols, constraints, counte
     If the input is Dyn, then so should the output
     """
     assert isinstance(n.args[0], Node)
+    return linear_constraints(n, module_instance.in_features, module_instance.out_features, symbols, counter)
+
+
+@register_inference_rule("dim")  # type: ignore[attr-defined]
+def torch_dim_inference_rule(n: Node, symbols, constraints, counter):
+    assert isinstance(n.args[0], Node)
+    my_dim, counter = gen_dvar(counter)
+    symbols[n] = my_dim
+    input = symbols[n.args[0]]
+
+    input_dyn = BinConstraintT(input, Dyn, op_eq)
+    output_dyn = BinConstraintD(my_dim, Dyn, op_eq)
+
+    c1 = []
+
+    for i in range(1, MAX_TENSOR_RANK + 1):
+        new_dims_rhs_1, counter = gen_tensor_dims(i, counter)
+
+        c_tensor_i = Conj([BinConstraintT(input, TensorType(new_dims_rhs_1), op_eq),
+                           BinConstraintD(my_dim, i, op_eq)])
+        c1.append(c_tensor_i)
+
+    return [Disj([Conj([input_dyn, output_dyn]), Disj(c1)])], counter
+
+
+@register_inference_rule(torch._C._nn.linear)  # type: ignore[attr-defined]
+def torch_linear_inference_rule(n: Node, symbols, constraints, counter):
+    assert isinstance(n.args[0], Node)
+    weight_dims, counter = gen_tensor_dims(2, counter)
+    equality_constraint = BinConstraintT(symbols[n.args[1]], TensorType(weight_dims), op_eq)
+    constraints, counter = linear_constraints(n, weight_dims[1], weight_dims[0], symbols, counter)
+    return [equality_constraint] + constraints, counter
+
+
+def linear_constraints(n: Node, in_features, out_features, symbols, counter):
     linear_output, counter = gen_tvar(counter)
     symbols[n] = linear_output
     linear_input = symbols[n.args[0]]
@@ -886,11 +1034,9 @@ def linear_inference_rule(n: Node, module_instance, symbols, constraints, counte
 
         c_tensor_i = Conj([BinConstraintT(linear_input, TensorType(new_dims_rhs_1), op_eq),
                            BinConstraintT(linear_output, TensorType(new_dims_rhs_2), op_eq)] +
-                          add_linear_constraints(new_dims_rhs_1, new_dims_rhs_2, module_instance) +
+                          add_linear_constraints(new_dims_rhs_1, new_dims_rhs_2, in_features, out_features) +
                           nat_constraints)
         c2.append(c_tensor_i)
-
-
     return [Disj([c1, Disj(c2)])], counter
 
 def add_layer_norm_constraints(input_dim, normalized_dim):
@@ -914,13 +1060,13 @@ def add_layer_norm_constraints(input_dim, normalized_dim):
         return constraints
 
 
-def add_linear_constraints(dims1, dims2, module_instance):
+def add_linear_constraints(dims1, dims2, in_features, out_features):
     assert len(dims1) == len(dims2)
     constraints = []
     for i in range(len(dims1)):
         if i == len(dims1) - 1:
-            constraints.append(BinConstraintD(dims1[i], module_instance.in_features, op_consistency))
-            constraints.append(BinConstraintD(dims2[i], module_instance.out_features, op_eq))
+            constraints.append(BinConstraintD(dims1[i], in_features, op_consistency))
+            constraints.append(BinConstraintD(dims2[i], out_features, op_eq))
         else:
             constraints.append(BinConstraintD(dims1[i], dims2[i], op_eq))
 
@@ -1054,11 +1200,6 @@ def generate_constraints(self, counter=0):
 
         all_constraints = []
 
-        # Annotate with Dyn if no type exists
-        for n in graph.nodes:
-            if n.type is None:
-                n.type = Dyn
-
         for n in graph.nodes:
             (constraints, counter) = self.generate_constraints_node(n, counter)
             all_constraints += constraints
@@ -1077,7 +1218,18 @@ def generate_constraints_node(self, n: Node, counter):
         if n.op == 'placeholder':
             x, counter = gen_tvar(counter)
             self.symbol_dict[n] = x
-            c1 = BinConstraintT(n.type, x, op_precision)
+
+            my_type = n.type
+
+            if n.type != Dyn and (not isinstance(n.type, TensorType)):
+                if n.type == torch.nn.parameter.Parameter:
+                    # since we have a parameter, the shape must be static
+                    assert 'example_value' in n.meta
+                    my_type = TensorType(n.meta['example_value'].size())
+                else:
+                    my_type = Dyn
+
+            c1 = BinConstraintT(my_type, x, op_precision)
             c2 = BinConstraintT(x, MAX_TENSOR_RANK, op_leq)
             return [c1, c2], counter
 
diff --git a/torch/fx/experimental/proxy_tensor.py b/torch/fx/experimental/proxy_tensor.py
index 0bd8de42e0e9d..426c99db5d6c3 100644
--- a/torch/fx/experimental/proxy_tensor.py
+++ b/torch/fx/experimental/proxy_tensor.py
@@ -4,75 +4,38 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
 import contextlib
-import copy
 import functools
-from typing import Any, Dict, Optional, Tuple, Callable, Union
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 import torch
-from torch._C import _disabled_torch_function_impl
 import torch.utils._pytree as pytree
 from torch.fx import Tracer, GraphModule
 from torch._subclasses.fake_tensor import FakeTensorMode
 import torch.fx as fx
-from torch.utils._mode_utils import no_dispatch
 from torch.fx.passes.shape_prop import _extract_tensor_metadata
 from contextlib import contextmanager, nullcontext
+import inspect
+from dataclasses import dataclass
+import weakref
 
 from torch.utils._python_dispatch import TorchDispatchMode, enable_torch_dispatch_mode
 from torch._subclasses import FakeTensor
-from .symbolic_shapes import ShapeEnv, magic_methods, reflectable_magic_methods
+from .symbolic_shapes import ShapeEnv, SymDispatchMode, PySymInt
 import torch.fx.experimental.symbolic_shapes as symbolic_shapes
+from torch.fx import Proxy
 
-__all__ = ["ProxyTensor", "PythonKeyTracer", "dispatch_trace", "make_fx", "enable_strict", "DecompositionInterpreter"]
+__all__ = ["PythonKeyTracer", "dispatch_trace", "make_fx", "DecompositionInterpreter", "get_proxy", "has_proxy"]
 aten = torch.ops.aten
+prim = torch.ops.prim
 
 CURRENT_DECOMPOSITION_TABLE: Dict[torch._ops.OpOverload, Callable] = {}
 
+CONSTANT_NUMEL_LIMIT = 1
 
-class ProxySymInt(object):
-    def __init__(self, sym_int, proxy):
-        assert isinstance(sym_int, torch._C.SymIntNode) or isinstance(sym_int, int)
-        self.sym_int = sym_int
-        self.proxy = proxy
-
-    def wrap(self, num):
-        return ProxySymInt(num, num)
-
-    def __str__(self):
-        return f"ProxySymInt({self.sym_int})"
-
-    def __int__(self):
-        # Not sure how to make mypy support this lol
-        return int(self.sym_int)  # type: ignore[arg-type]
-
-    def __bool__(self):
-        return bool(self.sym_int)
-
-import operator
-
-def create_magic_impl(op):
-    def magic_impl(self, other):
-        def unwrap_proxy(x):
-            return x.proxy if isinstance(x, ProxySymInt) else x
-        out_proxy = op(unwrap_proxy(self), unwrap_proxy(other))
-
-        def unwrap_proxyint(x):
-            return x.sym_int if isinstance(x, ProxySymInt) else x
-        out_sym_int = op(unwrap_proxyint(self), unwrap_proxyint(other))
-        return ProxySymInt(out_sym_int, out_proxy)
-    return magic_impl
-
-for method in reflectable_magic_methods:
-    method_name = f'{method}'
-
-    op = getattr(operator, method_name)
-    setattr(ProxySymInt, f'r{method_name}', create_magic_impl(op))
-
-for method in magic_methods:
-    method_name = f'{method}'
-
-    op = getattr(operator, method_name)
-    setattr(ProxySymInt, method_name, create_magic_impl(op))
 
+def fake_signature(fn, nargs):
+    """FX gets confused by varargs, de-confuse it"""
+    argnames = ",".join(f"arg{i}" for i in range(nargs))
+    return eval(f"lambda {argnames}: fn({argnames})", {"fn": fn})
 
 @contextmanager
 def decompose(decomposition_table):
@@ -84,19 +47,80 @@ def decompose(decomposition_table):
     finally:
         CURRENT_DECOMPOSITION_TABLE = old_decomposition_table
 
-# Checks whether we try to convert the tensor into a scalar
-IS_STRICT = True
-def enable_strict(val):
-    global IS_STRICT
-    IS_STRICT = val
-
-def wrap_output(inner_res, proxy_res, *, constant, proxy_mode):
+# ensure we cannot collide with other properties
+proxy_slot = object()
+no_default = object()
+
+def set_proxy_slot(obj, tracer, proxy):
+    d = obj.__dict__.setdefault(proxy_slot, weakref.WeakKeyDictionary())
+    assert isinstance(d, weakref.WeakKeyDictionary)
+    d[tracer] = proxy
+
+def has_proxy_slot(obj, tracer):
+    return get_proxy_slot(obj, tracer, False, lambda _: True)
+
+# the default argument is what to return if the slot is not set.
+# the transform argument is handy if you need to extract a subfield from
+# the successfully looked up result (but NOT the default.)
+def get_proxy_slot(obj, tracer, default=no_default, transform=lambda x: x):
+    d = obj.__dict__.get(proxy_slot)
+    if not d:
+        if default is no_default:
+            raise KeyError(f"{obj} is not tracked with proxy for {tracer}")
+        return default
+    assert isinstance(d, weakref.WeakKeyDictionary)
+    if tracer not in d:
+        if default is no_default:
+            raise KeyError(f"{obj} is not tracked with proxy for {tracer}")
+        else:
+            return default
+    return transform(d[tracer])
+
+
+def get_proxy_slots(obj):
+    return obj.__dict__.get(proxy_slot)
+
+
+# Gets the proxy for a tensor, if it exists.
+def get_proxy(obj):
+    res = get_proxy_slots(obj)
+    if res is None:
+        return None
+    vals = tuple(res.values())
+    assert len(vals) == 1
+    return vals[0]
+
+def has_proxy(obj):
+    return get_proxy(obj) is not None
+
+def track_tensor(tensor, proxy, *, constant, tracer):
+    # The basic idea is that we need to associate each tensor/SymInt
+    # with a Proxy.  How do we setup this association?  We just store
+    # the proxy on the proxy slot of the object, keyed on the tracer
+    # (so that if we have multiple tracers at the same time, they
+    # don't clobber each other.)
+    for i, s in enumerate(tensor.shape):
+        if isinstance(s, SymInt):
+            inner_s = s.get_pyobj()
+            assert isinstance(inner_s, PySymInt)
+            # TODO: improve naming
+            # TODO: lazily insert this into the graph only on first
+            # use?  Maybe complicated and DCE is a better idea
+            set_proxy_slot(inner_s, tracer, torch.ops.aten.sym_size(proxy, i))
+
+        # TODO: also do stride/numel
+    set_proxy_slot(tensor, tracer, _ProxyTensor(proxy, constant))
+
+def track_tensor_tree(inner_res, proxy_res, *, constant, tracer):
     def wrap_with_proxy(e, proxy, constant):
         if isinstance(e, torch.Tensor):
-            with no_dispatch():
-                return ProxyTensor(e, proxy, constant=constant, proxy_mode=proxy_mode)
-        else:
-            return e
+            track_tensor(e, proxy, tracer=tracer, constant=constant)
+            if not e.is_sparse:
+                proxy.node.meta['tensor_meta'] = _extract_tensor_metadata(e)
+        elif isinstance(e, list):
+            # example use case: allreduce_ returns ([tensor], work)
+            for idx, ee in enumerate(e):
+                wrap_with_proxy(ee, proxy[idx], get_constant(idx))
 
     def get_constant(idx):
         if constant is None:
@@ -107,14 +131,13 @@ def get_constant(idx):
     # Unfortunately, tree_map cannot directly be used here. As the resulting
     # object may be a proxy that represents a tuple, we may need to
     # explicitly unwrap the proxy by simulating the flattening operations.
-    if isinstance(inner_res, tuple):
-        return tuple(wrap_with_proxy(e, proxy_res[idx], get_constant(idx)) for idx, e in enumerate(inner_res))
-    elif isinstance(inner_res, list):
-        return list([wrap_with_proxy(e, proxy_res[idx], get_constant(idx)) for idx, e in enumerate(inner_res)])
+    if isinstance(inner_res, tuple) or isinstance(inner_res, list):
+        for idx, e in enumerate(inner_res):
+            wrap_with_proxy(e, proxy_res[idx], get_constant(idx))
     elif isinstance(inner_res, torch.Tensor):
-        return wrap_with_proxy(inner_res, proxy_res, constant)
-    else:
-        return inner_res
+        wrap_with_proxy(inner_res, proxy_res, constant)
+
+    return inner_res
 
 
 def maybe_disable_fake_tensor_mode():
@@ -127,16 +150,25 @@ def maybe_disable_fake_tensor_mode():
         return nullcontext()
 
 
-def unwrap_elem(e):
-    if isinstance(e, ProxyTensor):
-        return e.elem
-    if isinstance(e, torch._C.SymIntNode):
-        if isinstance(e.get_pyobj(), ProxySymInt):
-            return e.get_pyobj().sym_int
+@dataclass
+class _ProxyTensor:
+    proxy: Proxy
+    constant: Optional[torch.Tensor]
+
+
+def fetch_symint_proxy(tracer):
+    def inner(e):
+        n = e.get_pyobj()
+        if n.constant is not None:
+            return n.constant
         else:
-            raise RuntimeError(f"Something has gone wrong, we are trying to put SymInt {e.get_pyobj()} into the graph,"
-                               f"even though it's not a ProxySymInt. This is a bug.")
-    return e
+            # NB: we REQUIRE all symints to be tracked
+            return get_proxy_slot(n, tracer)
+    return inner
+
+
+def fetch_tensor_proxy(tracer):
+    return lambda t: get_proxy_slot(t, tracer, t)
 
 
 def proxy_call(proxy_mode, func_overload, args, kwargs=None):
@@ -145,36 +177,69 @@ def proxy_call(proxy_mode, func_overload, args, kwargs=None):
 
     func = func_overload.overloadpacket
     if func_overload in CURRENT_DECOMPOSITION_TABLE:
-        return CURRENT_DECOMPOSITION_TABLE[func_overload](*args, **kwargs)
-    if func_overload == aten._local_scalar_dense.default:
-        t, = args
-        assert not kwargs
-        if t.constant is not None:
+        with proxy_mode.restore():
+            r = CURRENT_DECOMPOSITION_TABLE[func_overload](*args, **kwargs)
+            if r is not NotImplemented:
+                return r
+
+    # Some of these are not "real" aten ops and will fail if we
+    # call _dispatch_has_kernel_for_dispatch_key on them.
+    # This list is probably incomplete
+    if func_overload not in [torch.ops.aten.size.default]:
+        with proxy_mode.restore():
+            r = func_overload.decompose(*args, **kwargs)
+            if r is not NotImplemented:
+                return r
+
+    tracer = proxy_mode.tracer
+
+    f_args, f_kwargs = pytree.tree_map_only(torch.Tensor, fetch_tensor_proxy(tracer), (args, kwargs))
+
+    # If there are SymInts, we also should not consider this constant.
+    # However, fake tensor handling of SymInts is sufficiently broken that
+    # I couldn't write a test for this case
+    all_constant = (
+        pytree.tree_all_only(_ProxyTensor, lambda t: t.constant is not None, (f_args, f_kwargs))
+        # TODO: maybe constant SymInts should also be allowed?  Not sure if
+        # this can happen
+        and pytree.tree_all_only(SymInt, lambda _: False, (args, kwargs))
+    )
+
+    if torch.Tag.data_dependent_output in func_overload.tags:  # type: ignore[attr-defined]
+        # Check if all of the Tensor inputs are constants
+        if all_constant:
+            const_args, const_kwargs = pytree.tree_map_only(
+                _ProxyTensor, lambda t: t.constant, (f_args, f_kwargs)
+            )
             with maybe_disable_fake_tensor_mode():
-                return t.constant.item()
-        raise RuntimeError("It appears that you're trying to get value out of a tracing tensor - erroring out! "
-                           "It's likely that this is caused by data-dependent control flow or similar."
-                           "Try torch.fx.experimental.proxy_tensor.enable_strict(False) to disable this check")
-
-    def unwrap_proxy(e):
-        return e.proxy if isinstance(e, ProxyTensor) else e
+                return func_overload(*const_args, **const_kwargs)
+        raise RuntimeError(
+            "It appears that you're trying to get value out of a tracing tensor - erroring out! "
+            "It's likely that this is caused by data-dependent control flow or similar."
+        )
 
-    proxy_args = pytree.tree_map(unwrap_proxy, args)
-    proxy_kwargs = pytree.tree_map(unwrap_proxy, kwargs)
+    proxy_args, proxy_kwargs = pytree.tree_map_only(
+        SymInt,
+        fetch_symint_proxy(proxy_mode.tracer),
+        pytree.tree_map_only(_ProxyTensor, lambda e: e.proxy, (f_args, f_kwargs))
+    )
+    proxy_out = func_overload(*proxy_args, **proxy_kwargs)
 
-    proxy_res = func_overload(*proxy_args, **proxy_kwargs)
     # Kind of a hacky way to test if an op is in-place or not
     if func.__name__[-1] == "_" and func.__name__[0] != "_":
-        args[0].proxy = proxy_res
-        proxy_res.node.meta['tensor_meta'] = _extract_tensor_metadata(args[0])
-    inner_res = func_overload(*pytree.tree_map(unwrap_elem, args), **pytree.tree_map(unwrap_elem, kwargs))
+        if isinstance(args[0], List):
+            # e.g., c10d::allreduce_ returns a list of tensors as the first element
+            # in the output.
+            for i, a in enumerate(args[0]):
+                a.proxy = proxy_out[0][i]
+                proxy_out[0][i].node.meta['tensor_meta'] = _extract_tensor_metadata(a)
+        else:
+            # This makes DCE marginally less likely to DCE inplace operations.
+            # It is not strictly necessary
+            args[0].proxy = proxy_out
+            proxy_out.node.meta['tensor_meta'] = _extract_tensor_metadata(args[0])
 
-    # Needed to sync up metadata for in-place operators that modify metadata
-    # TODO: instead forward the metadata to the inner tensor so updating
-    # is not necessary
-    if torch.Tag.inplace_view in func_overload.tags:  # type: ignore[attr-defined]
-        with no_dispatch():
-            func_overload(*args, **kwargs)
+    out = func_overload(*args, **kwargs)
 
     # In some circumstances, we will be tracing in a situation where a tensor
     # is *statically* known to be a constant (currently, this only happens if
@@ -197,115 +262,24 @@ def unwrap_proxy(e):
     # element constant computation by testing the numel of the result before
     # propagating const-ness.  Similarly, we don't require the constant to
     # live on CPU, but we could.
-    all_constant = True
-    any_constant = False
-
-    def check_constant(e):
-        nonlocal all_constant, any_constant
-        if isinstance(e, ProxyTensor):
-            if e.constant is None:
-                all_constant = False
-            else:
-                any_constant = True
-
-    pytree.tree_map(check_constant, args)
-    pytree.tree_map(check_constant, kwargs)
-
-    def unwrap_constant(e):
-        if isinstance(e, ProxyTensor):
-            return e.constant
-        return e
+    any_constant = pytree.tree_any_only(_ProxyTensor, lambda t: t.constant is not None, (f_args, f_kwargs))
 
     constant = None
     # NB: do NOT include factories as constants
-    if all_constant and any_constant:
+    if (
+        torch.Tag.nondeterministic_seeded not in func_overload.tags  # type: ignore[attr-defined]
+        and all_constant
+        and any_constant
+        and pytree.tree_all_only(torch.Tensor, lambda t: t.numel() <= CONSTANT_NUMEL_LIMIT, out)
+    ):
         with maybe_disable_fake_tensor_mode():
-            constant = func_overload(
-                *pytree.tree_map(unwrap_constant, args),
-                **pytree.tree_map(unwrap_constant, kwargs)
+            const_args, const_kwargs = pytree.tree_map_only(
+                _ProxyTensor, lambda t: t.constant, (f_args, f_kwargs)
             )
+            constant = func_overload(*const_args, **const_kwargs)
 
-    # TODO(chilli): Enable this after it's been refactored to work with wrapper tensor subclasses in general
-    # pytree.tree_map(lambda x: check_metadata_consistency(x, ProxyTensor), (inner_res, args, kwargs))
-    return wrap_output(inner_res, proxy_res, constant=constant, proxy_mode=proxy_mode)
-
-
-class ProxyTensor(torch.Tensor):
-    proxy: fx.Proxy
-    elem: torch.Tensor
-    has_sym_ints: bool
-    proxy_mode: "ProxyTorchDispatchMode"
-
-
-    @staticmethod
-    def __new__(cls, elem, proxy, *, requires_grad=None, constant=None, proxy_mode):
-        def create_proxy_symint(sym_int, new_proxy):
-            return torch._C.SymIntNode.new_symint(ProxySymInt(sym_int, new_proxy))
-
-        has_sym_ints = symbolic_shapes.has_symbolic_sizes_strides(elem)
-        if has_sym_ints:
-            new_shape = []
-            for idx, s in enumerate(elem.shape):
-                if isinstance(s, torch._C.SymIntNode):
-                    new_shape.append(create_proxy_symint(s, proxy.size(idx)))
-                else:
-                    assert isinstance(s, int)
-                    # If it's not an existing SymIntNodeImpl, just pass the proxy as the int
-                    # _make_wrapper_subclass requires all inputs to be SymIntNodeImpls
-                    new_shape.append(create_proxy_symint(s, s))
-            # TODO: hack, since we currently don't support symbolic strides
-            new_strides = symbolic_shapes.create_contiguous(new_shape)
-        else:
-            new_shape = elem.shape
-            new_strides = elem.stride()
-
-        r = torch.Tensor._make_wrapper_subclass(  # type: ignore[attr-defined]
-            cls,
-            new_shape, dtype=elem.dtype, layout=elem.layout, device=elem.device,
-            requires_grad=requires_grad if requires_grad is not None else False, strides=new_strides,
-            storage_offset=elem.storage_offset()
-        )
-        r.has_sym_ints = has_sym_ints
-        return r
-
-    def __init__(self, elem, proxy, *, requires_grad=None, constant=None, proxy_mode):
-        # TODO: hack since _extract_tensor_metadata currently tries to access stride
-        if elem.is_sparse or self.has_sym_ints:
-            proxy.node.meta['tensor_meta'] = {}
-        else:
-            proxy.node.meta['tensor_meta'] = _extract_tensor_metadata(self)
-        # This detects situations where you accidentally put a ProxyTensor
-        # inside a ProxyTensor for the same trace; this is a layering violation
-        assert not (isinstance(elem, ProxyTensor) and elem.proxy.tracer is proxy.tracer)
-        self.elem = elem
-        self.proxy = proxy
-        self.constant = constant
-        self.proxy_mode = proxy_mode
-
-
-    def __deepcopy__(self, memo):
-        return self.clone()
-
-    def __repr__(self):
-        with no_dispatch():
-            return f"ProxyTensor({self.elem}, proxy={self.proxy})"
-
-    __torch_function__ = _disabled_torch_function_impl
-
-    @classmethod
-    def __torch_dispatch__(cls, func_overload, types, args=(), kwargs=None):
-        # Get the first proxy mode. If there are different proxy modes with
-        # different tracers torch.fx.Proxy would raise an error.
-        proxy_mode = None
-        for arg in pytree.tree_flatten((args, kwargs))[0]:
-            if isinstance(arg, ProxyTensor):
-                if proxy_mode is None:
-                    proxy_mode = arg.proxy_mode
-                    break
-        assert proxy_mode is not None, "At least one argument must be a ProxyTensor"
-
-        with proxy_mode.restore():  # type: ignore[union-attr]
-            return func_overload(*args, **kwargs)
+    track_tensor_tree(out, proxy_out, constant=constant, tracer=tracer)
+    return out
 
 
 class PythonKeyTracer(Tracer):
@@ -320,6 +294,10 @@ def call_module(
     ) -> Any:
         return forward(*args, **kwargs)
 
+    # We don't want to turn getattr calls into proxies. So we just return the actual value.
+    def getattr(self, attr, attr_val, parameter_proxy_cache):
+        return attr_val
+
     def create_arg(self, a: Any):
         if isinstance(a, torch.nn.Parameter):
             for n, p in self.root.named_parameters():
@@ -337,10 +315,9 @@ def create_arg(self, a: Any):
                 setattr(self.root, qualname, a)
 
             return self.create_node('get_attr', qualname, (), {})
-        elif isinstance(a, torch._C.SymIntNode):
-            py_symint = a.get_pyobj()
-            assert isinstance(py_symint, ProxySymInt)
-            return py_symint.proxy.node
+        elif isinstance(a, SymInt):
+            assert a.get_pyobj().constant is not None
+            return a.get_pyobj().constant
         return super().create_arg(a)
 
 
@@ -354,51 +331,64 @@ def dispatch_trace(
     return GraphModule(tracer.root, graph, name)
 
 
-def wrap_key(f, inps, proxy_mode):
-    flat_inps, _ = pytree.tree_flatten(inps)
+def wrap_key(f, tensors, tracer):
+    flat_tensors, tensors_spec = pytree.tree_flatten(tensors)
 
     @functools.wraps(f)
-    def wrapped(*args):
-        flat_args, args_spec = pytree.tree_flatten(args)
-        assert (len(flat_args) == len(flat_inps))
-        for idx, arg in enumerate(flat_args):
-            if isinstance(flat_inps[idx], torch.Tensor):
-                with no_dispatch():
-                    flat_args[idx] = ProxyTensor(
-                        flat_inps[idx],
-                        arg,
-                        requires_grad=(flat_inps[idx].is_leaf and flat_inps[idx].requires_grad),
-                        proxy_mode=proxy_mode,
-                    )
-            else:
-                flat_args[idx] = flat_inps[idx]
-
-        tree_args = pytree.tree_unflatten(flat_args, args_spec)
-        out = f(*tree_args)
-        flat_outs, out_spec = pytree.tree_flatten(out)
-        for idx in range(len(flat_outs)):
-            if isinstance(flat_outs[idx], torch.Tensor) and isinstance(flat_outs[idx], ProxyTensor):
-                flat_outs[idx] = flat_outs[idx].proxy
-        return pytree.tree_unflatten(flat_outs, out_spec)
+    def wrapped(*proxies):
+        flat_proxies, proxies_spec = pytree.tree_flatten(proxies)
+        assert len(flat_proxies) == len(flat_tensors)
+        track_tensor_tree(flat_tensors, flat_proxies, constant=None, tracer=tracer)
+
+        out = f(*tensors)
+        return pytree.tree_map_only(
+            torch.Tensor,
+            lambda t: get_proxy_slot(t, tracer, t, lambda x: x.proxy),
+            out
+        )
 
     return wrapped
 
 
 class ProxyTorchDispatchMode(TorchDispatchMode):
-    def __init__(self, tracer, trace_factory_functions=True):
+    def __init__(self, tracer):
         self.tracer = tracer
-        self.trace_factory_functions = trace_factory_functions
+        self.enable_tracing = True
+        self.sym_mode = ProxySymDispatchMode(tracer)
+        self.trace_state = {}
 
     def __torch_dispatch__(self, func_overload, types, args=(), kwargs=None):
+        with self.sym_mode.enable(False):
+            return self.inner_torch_dispatch(func_overload, types, args, kwargs)
+
+    @contextmanager
+    def restore(self):
+        with self.sym_mode.enable(True):
+            with super().restore():
+                yield
+
+    def inner_torch_dispatch(self, func_overload, types, args=(), kwargs=None):
+        if not self.enable_tracing:
+            return func_overload(*args, **kwargs)
+
         if symbolic_shapes.is_symbolic_op(func_overload):
-            return symbolic_shapes.handle_symbolic_op(func_overload, args, kwargs)
+            with self.restore():
+                return symbolic_shapes.handle_symbolic_op(func_overload, args, kwargs)
 
         func = func_overload.overloadpacket
         # We don't want to convert torch.tensor constants into tracing objects.
         if func_overload == aten.lift.default:
             return args[0]
-        if any(tuple(isinstance(arg, ProxyTensor) for arg in pytree.tree_flatten(args)[0])):
-            return proxy_call(self, func_overload, args, kwargs)
+
+        if func in [prim.device]:
+            return func_overload(*args, **kwargs)
+
+        if pytree.tree_any_only(
+            torch.Tensor,
+            lambda t: has_proxy_slot(t, self.tracer),
+            (args, kwargs)
+        ):
+            out = proxy_call(self, func_overload, args, kwargs)
         # When we trace through a torch.tensor invocation, you never actually
         # see a torch.ops.aten.tensor call. Instead, the way this function is
         # implemented internally is that we allocate a plain tensor (this is
@@ -432,29 +422,90 @@ def __torch_dispatch__(self, func_overload, types, args=(), kwargs=None):
         #       x.add_(2)
         #
         # This is what the overload modification does.
-        elif self.trace_factory_functions:
+        else:
+            flat_args = pytree.tree_flatten((args, kwargs))[0]
+            handled_types = [torch.Tensor, _ProxyTensor, torch.nn.Parameter]
+
+            # If there are any tensor subclasses, we need to handle those tensor subclasses first
+            # TODO: we could use types to test this
+            if any(isinstance(arg, torch.Tensor) and type(arg) not in handled_types for arg in flat_args):
+                return NotImplemented
+
             if func_overload is torch.ops.aten.lift_fresh.default:
                 func_overload = torch.ops.aten.lift_fresh_copy.default
 
-            proxy_res = self.tracer.create_proxy('call_function', func_overload, args, kwargs,
+            n_args, n_kwargs = pytree.tree_map_only(SymInt, fetch_symint_proxy(self.tracer), (args, kwargs))
+
+            proxy_out = self.tracer.create_proxy('call_function', func_overload, n_args, n_kwargs,
                                                  name=self.tracer.graph._target_to_str(func.__name__))
 
-            inner_res = func_overload(*args, **kwargs)
+            out = func_overload(*args, **kwargs)
 
             # If this is a lift, the input tensor is guaranteed to be a
             # constant, so we keep a copy of the original argument along so
             # we can query it if we're asked to item() it at some later point
             is_lift = func_overload is torch.ops.aten.lift_fresh_copy.default
-            if is_lift:
+            if is_lift and out.numel() <= CONSTANT_NUMEL_LIMIT:
                 with maybe_disable_fake_tensor_mode():
                     constant = args[0].clone()
             else:
                 constant = None
-            return wrap_output(inner_res, proxy_res, constant=constant, proxy_mode=self)
-        else:
-            return func_overload(*args, **kwargs)
+            track_tensor_tree(out, proxy_out, constant=constant, tracer=self.tracer)
+
+        def assert_proxy_tensor(e):
+            assert has_proxy_slot(e, self.tracer), \
+                f"Internal Error: make_fx is incorrectly baking a tensor constant into the graph: {str(e)}"
+
+        # When we trace factory functions, we expect that tensor outputs are *always* tracked.
+        # (Except for torch.tensor() constants handled through lift(), which is handled
+        # specially further up).
+        pytree.tree_map_only(torch.Tensor, assert_proxy_tensor, out)
+        return out
 
 
+SymInt = torch._C.SymIntNode
+
+
+class ProxySymDispatchMode(SymDispatchMode):
+    def __init__(self, tracer):
+        super().__init__()
+        self.tracer = tracer
+        # When false, we don't trace operations.  If you do this, you MUST
+        # call track_tensor/track_tensor_tree on all results of the operation
+        # to ensure we can adeduately track the results
+        self.enable_tracing = True
+
+    @contextmanager
+    def enable(self, b):
+        old = self.enable_tracing
+        self.enable_tracing = b
+        try:
+            yield
+        finally:
+            self.enable_tracing = old
+
+    def __sym_dispatch__(self, func, types, args, kwargs):
+        if not self.enable_tracing:
+            return func(*args, **kwargs)
+        p_args, p_kwargs = pytree.tree_map_only(
+            PySymInt,
+            lambda s: get_proxy_slot(s, self.tracer) if s.constant is None else s.constant,
+            (args, kwargs)
+        )
+        # func doesn't have a __torch_function__ that Proxy can interpose, so
+        # we gotta do it manually
+        n_args, n_kwargs = pytree.tree_map_only(fx.Proxy, lambda p: p.node, (p_args, p_kwargs))
+
+        n_out = self.tracer.create_node("call_function", func, n_args, n_kwargs)
+        p_out = fx.Proxy(n_out, self.tracer)
+        out = func(*args, **kwargs)
+        assert isinstance(out, PySymInt), f"{func}(*{args}, **{kwargs}) = {out}"
+        set_proxy_slot(out, self.tracer, p_out)
+        return out
+
+
+# TODO: I'm not sure what the point of this class is; you can just
+# make_fx through a regular Interpreter
 class DecompositionInterpreter(torch.fx.Interpreter):
     def __init__(self, module: torch.fx.GraphModule, new_graph: torch.fx.Graph, decomposition_table=None, **kwargs):
         super().__init__(module, **kwargs)
@@ -467,41 +518,55 @@ def __init__(self, module: torch.fx.GraphModule, new_graph: torch.fx.Graph, deco
 
     def placeholder(self, target, args, kwargs):
         out = super().placeholder(target, args, kwargs)
+        proxy = torch.fx.Proxy(self.new_graph.placeholder(target), self.tracer)
+        track_tensor_tree(out, proxy, constant=None, tracer=self.tracer)
         # TODO handle case where the first character of target is '*'
-        return ProxyTensor(out, torch.fx.Proxy(self.new_graph.placeholder(target), self.tracer), proxy_mode=self.mode)
+        return out
 
     def get_attr(self, target, args, kwargs):
         out = super().get_attr(target, args, kwargs)
-        return ProxyTensor(out, torch.fx.Proxy(self.new_graph.get_attr(target), self.tracer), proxy_mode=self.mode)
+        proxy = torch.fx.Proxy(self.new_graph.get_attr(target), self.tracer)
+        track_tensor_tree(out, proxy, constant=None, tracer=self.tracer)
+        return out
 
-    # call_function, call_method, call_module get traced automatically by the ProxyTensors.
+    # call_function, call_method, call_module get traced automatically by the outer mode.
 
     def output(self, target, args, kwargs):
         out = super().output(target, args, kwargs)
 
         def unwrap(e):
-            return e.proxy.node if isinstance(e, ProxyTensor) else e
+            return get_proxy_slot(e, self.tracer, e, lambda x: x.proxy.node)
         self.new_graph.output(pytree.tree_map(unwrap, out))
         return out
 
     def run(self, *args, **kwargs):
         # Should enter the mode at least once for being able to restore it later
         # See: https://github.com/pytorch/pytorch/pull/82549#discussion_r934782025
-        with self.mode:
-            pass
-        with decompose(self.decomposition_table):
+        with decompose(self.decomposition_table), self.mode:
             return super().run(*args, **kwargs)
 
-def make_fx(f, decomposition_table=None, trace_factory_functions=True, tracing_mode="real"):
-    if tracing_mode != "real" and not trace_factory_functions:
-        raise ValueError("""\
-use_fake and not trace_factory_functions is not currently supported; if
-proxy tensor is not executed as a mode, fake tensors must not be executed
-as a mode either (otherwise, we will incorrectly intern fake tensors into
-the traced graph module.)  However, non-mode execution of fake tensors
-is not currently supported (although, in principle, it could be; file
-a bug if you need this)""")
 
+def wrapper_and_args_for_make_fx(func, args, kwargs):
+    # make_fx doesn't support kwargs, so we need to do this flattening
+    # and then unflatten the args before calling func
+    flat_args, spec = pytree.tree_flatten((args, kwargs))
+
+    def wrapped(flat_args):
+        fn_args, fn_kwargs = pytree.tree_unflatten(flat_args, spec)
+        return func(*fn_args, **fn_kwargs)
+    return wrapped, flat_args
+
+@contextmanager
+def disable_autocast_cache():
+    old_value = torch.is_autocast_cache_enabled()
+    torch.set_autocast_cache_enabled(False)
+    try:
+        yield
+    finally:
+        torch.set_autocast_cache_enabled(old_value)
+
+
+def make_fx(f, decomposition_table=None, tracing_mode="real"):
     assert tracing_mode in ["real", "fake", "symbolic"]
 
     if decomposition_table is None:
@@ -521,7 +586,7 @@ def wrapped(*args):
         else:
             raise AssertionError(f"Unexpected tracing type: {tracing_mode}")
 
-        proxy_mode = ProxyTorchDispatchMode(fx_tracer, trace_factory_functions=trace_factory_functions)
+        proxy_mode = ProxyTorchDispatchMode(fx_tracer)
 
         def wrap_fake_concrete(x):
             if isinstance(x, torch.Tensor):
@@ -530,11 +595,12 @@ def wrap_fake_concrete(x):
             return x
 
         shape_env = ShapeEnv()
+        sym_mode = proxy_mode.sym_mode
 
         # todo: Figure out a more informative name for symints
         def wrap_fake_symbolic(x, sym_shape):
             if isinstance(x, torch.Tensor):
-                val = FakeTensor(fake_tensor_mode, torch.empty(sym_shape, device="meta"), x.device)
+                val = FakeTensor(fake_tensor_mode, torch.empty(sym_shape, device="meta", requires_grad=x.requires_grad), x.device)
                 return val
             return x
 
@@ -549,8 +615,18 @@ def wrap_fake_symbolic(x, sym_shape):
         else:
             args = pytree.tree_map(wrap_fn_map[tracing_mode], args)
 
-        with decompose(decomposition_table), fake_tensor_mode, proxy_mode:  # type: ignore[attr-defined]
-            t = dispatch_trace(wrap_key(f, args, proxy_mode), tracer=fx_tracer, concrete_args=tuple(phs))
+        if not hasattr(f, '__code__') or inspect.unwrap(f).__code__.co_flags & inspect.CO_VARARGS:
+            # FX doesn't support varargs, so we gotta fake up a wrapper
+            # TODO: Would be nice to fix this at the source...
+            func = fake_signature(f, len(phs))
+        else:
+            func = f
+
+        # We disable the autocast cache as the autocast cache causes type conversions on parameters to
+        # check a cache, which introduces untracked tensors into the graph
+        with decompose(decomposition_table), fake_tensor_mode, \
+             sym_mode, proxy_mode, disable_autocast_cache():  # type: ignore[attr-defined]
+            t = dispatch_trace(wrap_key(func, args, fx_tracer), tracer=fx_tracer, concrete_args=tuple(phs))
 
         # TODO: kind of a bad way to do it, should maybe figure out a better way
         t.shape_env = shape_env  # type: ignore[assignment]
@@ -568,6 +644,21 @@ def get_torch_dispatch_modes():
     return modes
 
 
+@contextlib.contextmanager
+def disable_proxy_modes_tracing():
+    # TODO: This probably doesn't correctly also disable ProxySymDispatchMode
+    modes = get_torch_dispatch_modes()
+    proxy_tensor_modes = [m for m in modes if isinstance(m, ProxyTorchDispatchMode)]
+    olds = [m.enable_tracing for m in proxy_tensor_modes]
+    for proxy_mode in proxy_tensor_modes:
+        proxy_mode.enable_tracing = False
+    try:
+        yield
+    finally:
+        for proxy_mode, old in zip(proxy_tensor_modes, olds):
+            proxy_mode.enable_tracing = old
+
+
 def get_isolated_graphmodule(func, args, kwargs):
     """A helper function used to get the GraphModule for the given func.
 
@@ -575,40 +666,8 @@ def get_isolated_graphmodule(func, args, kwargs):
     It detaches the args and kwargs from the current tracer so that the trace of
     the current graph module can be created without any side-effects.
     """
-    # make_fx doesn't support kwargs, so we need to do this flattening
-    # and then unflatten the args before calling func
-    all_args, spec = pytree.tree_flatten((args, kwargs))
-
-    def wrapped(args):
-        fn_args, fn_kwargs = pytree.tree_unflatten(args, spec)
-        return func(*fn_args, **fn_kwargs)
+    wrapped, all_args = wrapper_and_args_for_make_fx(func, args, kwargs)
 
-    unwrapped_all_args = [unwrap_elem(a) for a in all_args]
-
-    # Current implementation doesn't support the case when ProxyTensor is
-    # wrapped with another Tensor subclass
-    # See: https://github.com/pytorch/pytorch/pull/81764#issuecomment-1200472068
-    assert all(
-        getattr(a, "elem", None) is None
-        for a in unwrapped_all_args
-        if isinstance(a, torch.Tensor)
-    ), "ProxyTensor is wrapped with another Tensor subclass"
-
-
-    with contextlib.ExitStack() as stack:
-        modes = get_torch_dispatch_modes()
-        # Disable all torch dispatch modes
-        for mode in modes:
-            stack.enter_context(enable_torch_dispatch_mode(mode.inner, replace=mode))
-        assert torch._C._get_torch_dispatch_mode() is None
-
-        # Enable all torch dispatch modes except ProxyTorchDispatchMode
-        for mode in reversed([m for m in modes if not isinstance(m, ProxyTorchDispatchMode)]):
-            mode = copy.copy(mode)
-            mode.inner = torch._C._get_torch_dispatch_mode()
-            mode.ancestors = set() if mode.inner is None else mode.inner.ancestors.union({mode.inner})
-            stack.enter_context(mode.restore())
-        gm = make_fx(wrapped)(unwrapped_all_args)
-    assert all(m1 == m2 for m1, m2 in zip(modes, get_torch_dispatch_modes()))
-    assert all(m1.inner == m2.inner for m1, m2 in zip(modes, get_torch_dispatch_modes()))
+    with disable_proxy_modes_tracing():
+        gm = make_fx(wrapped)(all_args)
     return gm
diff --git a/torch/fx/experimental/symbolic_shapes.py b/torch/fx/experimental/symbolic_shapes.py
index 098fff6e00ad9..6174b3b4d01fb 100644
--- a/torch/fx/experimental/symbolic_shapes.py
+++ b/torch/fx/experimental/symbolic_shapes.py
@@ -1,6 +1,7 @@
 import torch
 import torch.utils._pytree as pytree
-from typing import Dict, Any
+from typing import Dict, Any, List, Type
+import operator
 
 try:
     import sympy  # type: ignore[import]
@@ -10,7 +11,52 @@
 
 aten = torch.ops.aten
 
-__all__ = ["has_symbolic_sizes_strides", "create_contiguous", "is_symbolic_op", "handle_symbolic_op", "PySymInt", "ShapeEnv"]
+__all__ = [
+    "has_symbolic_sizes_strides", "create_contiguous", "is_symbolic_op", "handle_symbolic_op", "PySymInt", "ShapeEnv",
+    "SymDispatchMode"
+]
+
+SYM_FUNCTION_MODE = None
+
+# We don't bother with the metaclass as all of the dispatching logic happens
+# entirely from Python
+#
+# Didn't bother with ancestors for now, unlikely to have multiple modes for
+# symints right now
+
+
+# SymDispatchMode gets invoked whenever an operation is processed on
+# a PySymInt.  When this occurs, you get called at __sym_dispatch__
+# with the operation in question.  This is symmetric to TorchDispatchMode
+# but with some caveats:
+#
+#   - In TorchDispatchMode, you get the same arguments as what a user
+#     invoked your API with; e.g., if you call torch.ops.aten.foo(a, b),
+#     you get (a, b) as args to your call.  In SymDispatchMode, if
+#     you call a + b (where a and b are SymInts), you will get
+#     (a.get_pyobj(), b.get_pyobj()) as your args (these are PySymInts)
+#
+#   - SymInt/PySymInt don't have FX proxy support (unlike, e.g., Tensor).
+#     So you have to manually call Tracer/create_node to write into
+#     the graph.  See ProxySymDispatchMode for an example
+#
+class SymDispatchMode:
+    def __sym_dispatch__(self, func, types, args, kwargs):
+        raise NotImplementedError()
+
+    def __enter__(self):
+        global SYM_FUNCTION_MODE
+        old = SYM_FUNCTION_MODE
+        if hasattr(self, "inner"):
+            raise RuntimeError(f"{self} has already been used as a mode. Please use a fresh version")
+        else:
+            self.inner = old
+        SYM_FUNCTION_MODE = self
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        global SYM_FUNCTION_MODE
+        SYM_FUNCTION_MODE = self.inner
 
 def has_symbolic_sizes_strides(elem):
     return any([isinstance(i, torch._C.SymIntNode) for i in elem.shape])
@@ -21,23 +67,39 @@ def create_contiguous(shape):
         strides.append(dim * strides[-1])
     return list(reversed(strides))
 
-
 def is_symbolic_op(func):
-    return func in [aten.sym_size.default, aten.dim.default, aten.is_contiguous.default, aten.stride.default]
-
+    return func in [aten.sym_size.default, aten.dim.default,
+                    aten.is_contiguous.default, aten.sym_stride.default, aten.sym_numel.default
+                    ]
 
 def handle_symbolic_op(func, args, kwargs):
     assert is_symbolic_op(func)
     if func == torch.ops.aten.sym_size.default:
         return None
+    if func == torch.ops.aten.sym_stride.default:
+        return None
     if func == torch.ops.aten.dim.default:
         return len(args[0].shape)
+    if func == torch.ops.aten.sym_numel.default:
+        res = 1
+        for s in args[0].shape:
+            res = res * s
+        return res
     # TODO: hack, need to make is_contiguous calls symbolic (probably through computing on symbolic strides)
     if func == torch.ops.aten.is_contiguous.default:
         return True
-    # TODO: hack, we don't currently support symbolic strides properly
-    if func == torch.ops.aten.stride.default:
-        return create_contiguous(args[0].shape)
+
+def _handle_sym_dispatch(func, args, kwargs):
+    global SYM_FUNCTION_MODE
+    mode = SYM_FUNCTION_MODE
+    assert mode
+    SYM_FUNCTION_MODE = mode.inner
+    try:
+        # TODO: properly compute types
+        types: List[Type] = []
+        return mode.__sym_dispatch__(func, types, args, kwargs)
+    finally:
+        SYM_FUNCTION_MODE = mode
 
 # TODO: An incomplete list
 # 1. Set variables to be equal when we do equality
@@ -48,15 +110,16 @@ class PySymInt(object):
     our program. They're what sit under FakeTensor, and contains our primary
     implementation of symbolic shapes.
     """
-    def __init__(self, expr, shape_env):
+    def __init__(self, expr, shape_env, constant=None):
         self.expr = expr
         self.shape_env = shape_env
+        self.constant = constant
 
     def wrap(self, num):
-        return PySymInt(sympy.Integer(num), self.shape_env)
+        return PySymInt(sympy.Integer(num), self.shape_env, constant=num)
 
     def __str__(self):
-        return f"PySymInt({self.expr})"
+        return f"{self.expr}"
 
     # Today we error on calling int on a symbolic shape, as this is a very accessible footgun.
     # In the future we'll probably need some explicit way of allowing this
@@ -72,6 +135,7 @@ def __bool__(self):
     'sub': lambda a, b: a - b,
     'mul': lambda a, b: a * b,
     'mod': lambda a, b: a % b,
+    'floordiv': lambda a, b: sympy.floor(a / b),
 }
 
 magic_methods = {
@@ -84,17 +148,22 @@ def __bool__(self):
 }
 
 for method, _func in magic_methods.items():
-    method_name = f'{method}'
-
     def _create_magic_impl(func):
+        method_name = method
+
         def magic_impl(self, other):
+            if SYM_FUNCTION_MODE:
+                return _handle_sym_dispatch(getattr(operator, method_name), (self, other), {})
             if isinstance(other, PySymInt):
                 other = other.expr
             return PySymInt(func(self.expr, other), self.shape_env)
         return magic_impl
 
     # this should be wrapped transparently into torch._C.SymIntNode
-    setattr(PySymInt, method_name, _create_magic_impl(_func))
+    setattr(PySymInt, method, _create_magic_impl(_func))
+    setattr(PySymInt, f"__{method}__", _create_magic_impl(_func))
+    if method in reflectable_magic_methods:
+        setattr(PySymInt, f"__r{method}__", _create_magic_impl(_func))
 
 class ShapeEnv(object):
     def __init__(self):
@@ -109,7 +178,7 @@ def create_symint(self, name, val, shape_env=None):
         # Currently we don't put 0/1 specialization in guards but perhaps we should
         if val == 0 or val == 1:
             return val
-        sympy_expr = sympy.Symbol(name, positive=True)
+        sympy_expr = sympy.Symbol(name, positive=True, integer=True)
         py_sym_int = PySymInt(sympy_expr, self)
         cpp_sym_int = torch._C.SymIntNode.new_symint(py_sym_int)  # type: ignore[attr-defined]
         shape_env[sympy_expr] = val
@@ -117,7 +186,10 @@ def create_symint(self, name, val, shape_env=None):
 
     def try_constantify(self, expr):
         # Simplifies assuming that shape vars > 1 (since we cache on 0/1 shape values)
-        new_shape_env = {k: sympy.Symbol(f'shape_{idx}', positive=True) + 1 for idx, k in enumerate(self.shape_env.keys())}
+        new_shape_env = {
+            k: sympy.Symbol(f"shape_{idx}", positive=True, integer=True) + 1
+            for idx, k in enumerate(self.shape_env.keys())
+        }
         new_expr = expr.subs(new_shape_env)
         new_expr = new_expr.simplify()
         if len(list(new_expr.free_symbols)) == 0:
@@ -151,5 +223,12 @@ def evaluate_expr(self, expr):
 
         expr = expr.simplify()
         concrete_val = expr.subs(self.shape_env)
+
+        # Uncomment this to see what code triggered this guard.
+        # TODO: Save this to the guard representation so you can look
+        # at it later
+        # import traceback
+        # traceback.print_stack()
+
         self.guards.append((expr, concrete_val))
         return concrete_val
diff --git a/torch/fx/experimental/unification/core.py b/torch/fx/experimental/unification/core.py
index c123171bac98f..39ed3d0d4d2f9 100644
--- a/torch/fx/experimental/unification/core.py
+++ b/torch/fx/experimental/unification/core.py
@@ -37,6 +37,7 @@ def _reify(o, s):
 
 def reify(e, s):
     """ Replace variables of expression with substitution
+    >>> # xdoctest: +SKIP
     >>> x, y = var(), var()
     >>> e = (1, x, (3, y))
     >>> s = {x: 2, y: 4}
diff --git a/torch/fx/experimental/unification/dispatch.py b/torch/fx/experimental/unification/dispatch.py
index b69c64c8bfad3..93039ce75070f 100644
--- a/torch/fx/experimental/unification/dispatch.py
+++ b/torch/fx/experimental/unification/dispatch.py
@@ -1,6 +1,6 @@
 from functools import partial
 from .multipledispatch import dispatch  # type: ignore[import]
 
-namespace = dict()  # type: ignore[var-annotated]
+namespace = {}  # type: ignore[var-annotated]
 
 dispatch = partial(dispatch, namespace=namespace)
diff --git a/torch/fx/experimental/unification/match.py b/torch/fx/experimental/unification/match.py
index 40a0bf9b5a3f2..3a8ce9385e308 100644
--- a/torch/fx/experimental/unification/match.py
+++ b/torch/fx/experimental/unification/match.py
@@ -7,7 +7,7 @@
 class Dispatcher(object):
     def __init__(self, name):
         self.name = name
-        self.funcs = dict()
+        self.funcs = {}
         self.ordering = []
 
     def add(self, signature, func):
@@ -39,6 +39,7 @@ def _(func):
 class VarDispatcher(Dispatcher):
     """ A dispatcher that calls functions with variable names
     >>> d = VarDispatcher('d')
+    >>> # xdoctest: +SKIP
     >>> x = var('x')
     >>> @d.register('inc', x)
     ... def f(x):
@@ -59,7 +60,7 @@ def __call__(self, *args, **kwargs):
 
 
 
-global_namespace = dict()  # type: ignore[var-annotated]
+global_namespace = {}  # type: ignore[var-annotated]
 
 
 def match(*signature, **kwargs):
diff --git a/torch/fx/experimental/unification/more.py b/torch/fx/experimental/unification/more.py
index 20aaa319d61fd..b375dc147b365 100644
--- a/torch/fx/experimental/unification/more.py
+++ b/torch/fx/experimental/unification/more.py
@@ -11,6 +11,7 @@ def unifiable(cls):
     ...     def __init__(self, a, b):
     ...         self.a = a
     ...         self.b = b
+    >>> # xdoctest: +SKIP
     >>> unifiable(A)
     <class 'unification.more.A'>
     >>> x = var('x')
@@ -38,6 +39,7 @@ def reify_object(o, s):
     ...         self.b = b
     ...     def __str__(self):
     ...         return "Foo(%s, %s)"%(str(self.a), str(self.b))
+    >>> # xdoctest: +SKIP
     >>> x = var('x')
     >>> f = Foo(1, x)
     >>> print(f)
@@ -92,6 +94,7 @@ def unify_object(u, v, s):
     ...         self.b = b
     ...     def __str__(self):
     ...         return "Foo(%s, %s)"%(str(self.a), str(self.b))
+    >>> # xdoctest: +SKIP
     >>> x = var('x')
     >>> f = Foo(1, x)
     >>> g = Foo(1, 2)
diff --git a/torch/fx/experimental/unification/multipledispatch/core.py b/torch/fx/experimental/unification/multipledispatch/core.py
index 4347f8295eec0..0662f14014362 100644
--- a/torch/fx/experimental/unification/multipledispatch/core.py
+++ b/torch/fx/experimental/unification/multipledispatch/core.py
@@ -3,7 +3,7 @@
 
 from .dispatcher import Dispatcher, MethodDispatcher
 
-global_namespace = dict()  # type: ignore[var-annotated]
+global_namespace = {}  # type: ignore[var-annotated]
 
 
 def dispatch(*types, **kwargs):
@@ -24,12 +24,12 @@ def dispatch(*types, **kwargs):
     4
     >>> f(3.0)
     2.0
-    Specify an isolated namespace with the namespace keyword argument
-    >>> my_namespace = dict()
+    >>> # Specify an isolated namespace with the namespace keyword argument
+    >>> my_namespace = {}
     >>> @dispatch(int, namespace=my_namespace)
     ... def foo(x):
     ...     return x + 1
-    Dispatch on instance methods within classes
+    >>> # Dispatch on instance methods within classes
     >>> class MyClass(object):
     ...     @dispatch(list)
     ...     def __init__(self, data):
@@ -37,6 +37,10 @@ def dispatch(*types, **kwargs):
     ...     @dispatch(int)
     ...     def __init__(self, datum):
     ...         self.data = [datum]
+    >>> MyClass([1, 2, 3]).data
+    [1, 2, 3]
+    >>> MyClass(3).data
+    [3]
     """
     namespace = kwargs.get('namespace', global_namespace)
 
diff --git a/torch/fx/experimental/unification/multipledispatch/dispatcher.py b/torch/fx/experimental/unification/multipledispatch/dispatcher.py
index e2df673396d9d..f2a69749ab543 100644
--- a/torch/fx/experimental/unification/multipledispatch/dispatcher.py
+++ b/torch/fx/experimental/unification/multipledispatch/dispatcher.py
@@ -95,6 +95,7 @@ class Dispatcher(object):
     Use ``dispatch`` to add implementations
     Examples
     --------
+    >>> # xdoctest: +SKIP("bad import name")
     >>> from multipledispatch import dispatch
     >>> @dispatch(int)
     ... def f(x):
@@ -178,9 +179,9 @@ def add(self, signature, func):
         Traceback (most recent call last):
         ...
         NotImplementedError: Could not find signature for add: <int, float>
-        When ``add`` detects a warning it calls the ``on_ambiguity`` callback
-        with a dispatcher/itself, and a set of ambiguous type signature pairs
-        as inputs.  See ``ambiguity_warn`` for an example.
+        >>> # When ``add`` detects a warning it calls the ``on_ambiguity`` callback
+        >>> # with a dispatcher/itself, and a set of ambiguous type signature pairs
+        >>> # as inputs.  See ``ambiguity_warn`` for an example.
         """
         # Handle annotations
         if not signature:
@@ -280,6 +281,7 @@ def dispatch(self, *types):
         """Deterimine appropriate implementation for this type signature
         This method is internal.  Users should call this object as a function.
         Implementation resolution occurs within the ``__call__`` method.
+        >>> # xdoctest: +SKIP
         >>> from multipledispatch import dispatch
         >>> @dispatch(int)
         ... def inc(x):
@@ -331,7 +333,7 @@ def __setstate__(self, d):
         self.name = d['name']
         self.funcs = d['funcs']
         self._ordering = ordering(self.funcs)
-        self._cache = dict()
+        self._cache = {}
 
     @property
     def __doc__(self):
diff --git a/torch/fx/experimental/unification/multipledispatch/utils.py b/torch/fx/experimental/unification/multipledispatch/utils.py
index 78bf295dd5f78..ae36406005d51 100644
--- a/torch/fx/experimental/unification/multipledispatch/utils.py
+++ b/torch/fx/experimental/unification/multipledispatch/utils.py
@@ -36,10 +36,10 @@ def _toposort(edges):
         L - an ordered list of nodes that satisfy the dependencies of edges
     >>> _toposort({1: (2, 3), 2: (3, )})
     [1, 2, 3]
-    Closely follows the wikipedia page [2]
-    [1] Kahn, Arthur B. (1962), "Topological sorting of large networks",
-    Communications of the ACM
-    [2] http://en.wikipedia.org/wiki/Toposort#Algorithms
+    >>> # Closely follows the wikipedia page [2]
+    >>> # [1] Kahn, Arthur B. (1962), "Topological sorting of large networks",
+    >>> # Communications of the ACM
+    >>> # [2] http://en.wikipedia.org/wiki/Toposort#Algorithms
     """
     incoming_edges = reverse_dict(edges)
     incoming_edges = OrderedDict((k, set(val))
diff --git a/torch/fx/experimental/unification/multipledispatch/variadic.py b/torch/fx/experimental/unification/multipledispatch/variadic.py
index da93a02a8f341..b0a59a699d63e 100644
--- a/torch/fx/experimental/unification/multipledispatch/variadic.py
+++ b/torch/fx/experimental/unification/multipledispatch/variadic.py
@@ -76,6 +76,7 @@ class Variadic(six.with_metaclass(VariadicSignatureMeta)):
     Examples
     --------
     >>> Variadic[int]  # any number of int arguments
+    >>> # xdoctest: +SKIP
     <class 'multipledispatch.variadic.Variadic[int]'>
     >>> Variadic[(int, str)]  # any number of one of int or str arguments
     <class 'multipledispatch.variadic.Variadic[(int, str)]'>
diff --git a/torch/fx/experimental/unification/utils.py b/torch/fx/experimental/unification/utils.py
index f725615748650..a0200b0b037c3 100644
--- a/torch/fx/experimental/unification/utils.py
+++ b/torch/fx/experimental/unification/utils.py
@@ -36,6 +36,7 @@ def _toposort(edges):
     outputs:
         L - an ordered list of nodes that satisfy the dependencies of edges
     >>> _toposort({1: (2, 3), 2: (3, )})
+    >>> # xdoctest: +SKIP
     [1, 2, 3]
     Closely follows the wikipedia page [2]
     [1] Kahn, Arthur B. (1962), "Topological sorting of large networks",
diff --git a/torch/fx/experimental/unification/variable.py b/torch/fx/experimental/unification/variable.py
index 68324319e62e8..81fa8ff6d6e35 100644
--- a/torch/fx/experimental/unification/variable.py
+++ b/torch/fx/experimental/unification/variable.py
@@ -60,12 +60,13 @@ def variables(*variables):
     True
     >>> print(isvar(1))
     False
-    Normal approach
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> # Normal approach
     >>> from unification import unify
     >>> x = var('x')
     >>> unify(x, 1)
     {~x: 1}
-    Context Manager approach
+    >>> # Context Manager approach
     >>> with variables('x'):
     ...     print(unify('x', 1))
     {'x': 1}
diff --git a/torch/fx/graph.py b/torch/fx/graph.py
index 647b3520669c3..7b942052a1674 100644
--- a/torch/fx/graph.py
+++ b/torch/fx/graph.py
@@ -294,7 +294,7 @@ def additional_globals(self) -> List[Tuple[str, Any]]:
         """
         return []
 
-    def _gen_python_code(self, nodes, root_module: str, namespace: _Namespace) -> PythonCode:
+    def _gen_python_code(self, nodes, root_module: str, namespace: _Namespace, *, verbose: bool = False) -> PythonCode:
         free_vars: List[str] = []
         body: List[str] = []
         globals_: Dict[str, Any] = {}
@@ -408,6 +408,51 @@ def delete_unused_values(user : Node):
             else:
                 body.append('\n')
 
+        prev_stacktrace = None
+
+        def append_stacktrace_summary(node : Node):
+            """
+            Append a summary of the stacktrace to the generated code. This is
+            useful for debugging.
+            """
+            nonlocal prev_stacktrace
+            pattern = re.compile(r"^File \"(.+)\", line (\d+), in (.+)$")
+
+            if node.op not in {'placeholder', 'output'}:
+                if node.stack_trace:
+                    if node.stack_trace != prev_stacktrace:
+                        prev_stacktrace = node.stack_trace
+
+                        lines = node.stack_trace.strip().split('\n')
+                        idx = 0
+                        context_lines = []
+                        while idx < len(lines):
+                            line = lines[idx].strip()
+                            if line.startswith('File '):
+                                break
+                            context_lines.append(line)
+                            idx += 1
+
+                        summary_lines = []
+                        if context_lines:
+                            summary_lines.append(', '.join(context_lines))
+
+                        if idx + 1 < len(lines):
+                            matches = pattern.match(lines[idx].strip())
+                            if matches:
+                                file = matches.group(1)
+                                lineno = matches.group(2)
+                                lineage = f'File: {file}:{lineno}'
+                                summary_lines.append(lineage)
+
+                            code = f"code: {lines[idx + 1].strip()}"
+                            summary_lines.append(code)
+
+                        summary_str = ', '.join(summary_lines)
+                        body.append(f'\n# {summary_str}\n')
+                elif prev_stacktrace != "":
+                    prev_stacktrace = ""
+                    body.append('\n# No stacktrace found for following nodes \n')
 
         def emit_node(node : Node):
             maybe_type_annotation = '' if node.type is None else f' : {type_repr(node.type)}'
@@ -475,6 +520,8 @@ def emit_node(node : Node):
         for node in nodes:
             # NOTE: emit_node does not emit a string with newline. It depends
             # on delete_unused_values to append one
+            if verbose:
+                append_stacktrace_summary(node)
             emit_node(node)
             delete_unused_values(node)
 
@@ -1089,7 +1136,7 @@ def _target_to_str(self, target : Target) -> str:
         return op
 
     @compatibility(is_backward_compatible=True)
-    def python_code(self, root_module: str) -> PythonCode:
+    def python_code(self, root_module: str, *, verbose: bool = False) -> PythonCode:
         """
         Turn this ``Graph`` into valid Python code.
 
@@ -1148,10 +1195,10 @@ def override_node_repr(graph: Graph):
                     node._repr_fn = orig_repr_fns[node]
 
         with override_node_repr(self):
-            return self._python_code(root_module, namespace)
+            return self._python_code(root_module, namespace, verbose=verbose)
 
-    def _python_code(self, root_module: str, namespace: _Namespace) -> PythonCode:
-        return self._codegen._gen_python_code(self.nodes, root_module, namespace)
+    def _python_code(self, root_module: str, namespace: _Namespace, *, verbose: bool = False) -> PythonCode:
+        return self._codegen._gen_python_code(self.nodes, root_module, namespace, verbose=verbose)
 
 
     def __str__(self) -> str:
diff --git a/torch/fx/graph_module.py b/torch/fx/graph_module.py
index 4f8bcc13d9e38..a5f3589938f7a 100644
--- a/torch/fx/graph_module.py
+++ b/torch/fx/graph_module.py
@@ -116,7 +116,7 @@ def reduce_package_graph_module(
 def reduce_deploy_graph_module(
     importer: PackageImporter, body: Dict[Any, Any], import_block: str
 ) -> torch.nn.Module:
-    ns = dict()
+    ns = {}
     ns["__builtins__"] = importer.patched_builtins
     fn_src = body.get('_code')
     assert fn_src is not None
@@ -707,11 +707,12 @@ def __copy__(self):
         return GraphModule(self, self.graph)
 
     @compatibility(is_backward_compatible=False)
-    def nested_str(self) -> str:
+    def print_readable(self):
         """
         Return the Python code generated for current GraphModule and its children GraphModules
         """
-        module_code = self.code
+        verbose_python_code = self._graph.python_code(root_module='self', verbose=True)
+        module_code = verbose_python_code.src
         module_code = module_code.lstrip('\n')
         module_code = f"class {self._get_name()}(torch.nn.Module):\n" + module_code
         module_code = _addindent(module_code, 4)
@@ -723,11 +724,12 @@ def nested_str(self) -> str:
         submodule_code = "\n".join(submodule_code_list)
         submodule_code = _addindent(submodule_code, 4)
 
-        return module_code + submodule_code
+        print(module_code + submodule_code)
 
     def __str__(self) -> str:
         orig_str = super().__str__()
-        return '\n'.join([orig_str, self._code])
+        print_readable_reminder = "# To see more debug info, please use `graph_module.print_readable()`"
+        return '\n'.join([orig_str, self._code, print_readable_reminder])
 
     def _replicate_for_data_parallel(self):
         new_gm = self.__copy__()
diff --git a/torch/fx/interpreter.py b/torch/fx/interpreter.py
index df6404d393f9a..ca2a4f1a7a16c 100644
--- a/torch/fx/interpreter.py
+++ b/torch/fx/interpreter.py
@@ -4,8 +4,10 @@
 from .proxy import Proxy
 from ._symbolic_trace import Tracer
 from ._compatibility import compatibility
+import torch.fx.traceback as fx_traceback
 from typing import Any, Dict, Iterator, List, Optional, Tuple, Union
 import inspect
+from contextlib import contextmanager
 
 __all__ = ['Interpreter', 'Transformer']
 
@@ -134,6 +136,11 @@ def run(self, *args, initial_env : Optional[Dict[Node, Any]] = None, enable_io_p
                 output_val = self.env[node]
                 return self.module.graph.process_outputs(output_val) if enable_io_processing else output_val
 
+    @contextmanager
+    def _set_current_node(self, node):
+        with fx_traceback.append_stack_trace(node.stack_trace):
+            yield
+
     @compatibility(is_backward_compatible=True)
     def run_node(self, n : Node) -> Any:
         """
@@ -148,10 +155,11 @@ def run_node(self, n : Node) -> Any:
         Returns:
             Any: The result of executing ``n``
         """
-        args, kwargs = self.fetch_args_kwargs_from_env(n)
-        assert isinstance(args, tuple)
-        assert isinstance(kwargs, dict)
-        return getattr(self, n.op)(n.target, args, kwargs)
+        with fx_traceback.append_stack_trace(n.stack_trace):
+            args, kwargs = self.fetch_args_kwargs_from_env(n)
+            assert isinstance(args, tuple)
+            assert isinstance(kwargs, dict)
+            return getattr(self, n.op)(n.target, args, kwargs)
 
     # Main Node running APIs
     @compatibility(is_backward_compatible=True)
@@ -397,6 +405,7 @@ def __init__(self, graph: Graph):
 
             def is_leaf_module(self, _, __) -> bool:
                 return True
+
         self.tracer = TransformerTracer(self.new_graph)
         self.tracer.root = module
 
@@ -453,7 +462,8 @@ def transform(self) -> GraphModule:
         Transform ``self.module`` and return the transformed
         ``GraphModule``.
         """
-        result = super().run(enable_io_processing=False)
+        with fx_traceback.override_stack_trace():
+            result = super().run(enable_io_processing=False)
         if result is not None:
             def strip_proxy(a : Union[Argument, Proxy]) -> Any:
                 return a.node if isinstance(a, Proxy) else a
diff --git a/torch/fx/operator_schemas.py b/torch/fx/operator_schemas.py
index 9848baa04c108..7ec3bc9a6731d 100644
--- a/torch/fx/operator_schemas.py
+++ b/torch/fx/operator_schemas.py
@@ -177,7 +177,7 @@ def ret_type(x):
             return ret_type(base_type)
     except Exception as e:
         # We tried to create a type hint for list but failed.
-        torch.warnings.warn(f"We were not able to successfully create type hint from the type {x}")
+        warnings.warn(f"We were not able to successfully create type hint from the type {x}")
         pass
     return x
 
diff --git a/torch/fx/passes/backends/nvfuser.py b/torch/fx/passes/backends/nvfuser.py
index 8c8fcbb6e1c36..cbded1de0cde4 100644
--- a/torch/fx/passes/backends/nvfuser.py
+++ b/torch/fx/passes/backends/nvfuser.py
@@ -26,7 +26,7 @@ def aten_to_dtype(self, dtype: torch.dtype, **kwargs):
     return torch._prims.convert_element_type(self, dtype)
 
 # decomposition_table currently contains both aten2aten and aten2prim decomposition
-# this is a hack to seperate them, as we only need aten2prim decomposition for nvfuser-supported aten graph lowering
+# this is a hack to separate them, as we only need aten2prim decomposition for nvfuser-supported aten graph lowering
 aten2aten_decomp = {}
 aten2prim_decomp = {}
 
diff --git a/torch/fx/passes/infra/pass_manager.py b/torch/fx/passes/infra/pass_manager.py
index 2295c3001d068..8ff32763b9b17 100644
--- a/torch/fx/passes/infra/pass_manager.py
+++ b/torch/fx/passes/infra/pass_manager.py
@@ -31,6 +31,8 @@ def wrapped_fn(gm):
         fn(gm)
         return PassResult(gm, True)
 
+    if wrapped_fn.__name__ == 'wrapped_fn':
+        wrapped_fn.__name__ = str(fn)
     return wrapped_fn
 
 @compatibility(is_backward_compatible=False)
@@ -188,6 +190,7 @@ def __init__(
         steps=None,
         run_checks_after_each_pass: bool = False,
         suppress_check_failures: bool = False,
+        debug: bool = False,
     ):
         if passes:
             self.passes = passes
@@ -198,6 +201,7 @@ def __init__(
 
         self.run_checks_after_each_pass = run_checks_after_each_pass
         self.suppress_check_failures = suppress_check_failures
+        self.debug = debug
 
     def add_pass(self, _pass: Callable):
         """
@@ -275,8 +279,16 @@ def __call__(self, module: nn.Module) -> PassResult:
             modified = False
 
             # Run the set of passes on the graph module
-            for fn in self.passes:
-                res = fn(module)
+            for i, fn in enumerate(self.passes):
+                if self.debug:
+                    print(f"Running pass \'{fn.__name__}\'")
+
+                try:
+                    res = fn(module)
+                except Exception as e:
+                    prev_pass_names = [p.__name__ for p in self.passes[:i]]
+                    msg = f"An error occurred when running the \'{fn.__name__}\' pass after the following passes: {prev_pass_names}"
+                    raise type(e)(msg) from e
 
                 module = res.graph_module
                 modified = modified or res.modified
diff --git a/torch/fx/passes/pass_manager.py b/torch/fx/passes/pass_manager.py
index 096857efb7c2b..52f1170290db5 100644
--- a/torch/fx/passes/pass_manager.py
+++ b/torch/fx/passes/pass_manager.py
@@ -1,6 +1,9 @@
 from functools import wraps
 from inspect import unwrap
 from typing import Callable, List
+import logging
+
+logger = logging.getLogger(__name__)
 
 
 # for callables which modify object inplace and return something other than
@@ -19,11 +22,51 @@ def inplace_wrapper(fn: Callable) -> Callable:
 
     @wraps(fn)
     def wrapped_fn(gm):
-        fn(gm)
+        val = fn(gm)
         return gm
 
     return wrapped_fn
 
+def log_hook(fn: Callable, level=logging.INFO) -> Callable:
+    """
+    Logs callable output.
+
+    This is useful for logging output of passes. Note inplace_wrapper replaces
+    the pass output with the modified object. If we want to log the original
+    output, apply this wrapper before inplace_wrapper.
+
+
+    ```
+    def my_pass(d: Dict) -> bool:
+        changed = False
+        if 'foo' in d:
+            d['foo'] = 'bar'
+            changed = True
+        return changed
+
+    pm = PassManager(
+        passes=[
+            inplace_wrapper(log_hook(my_pass))
+        ]
+    )
+    ```
+
+    Args:
+        fn (Callable[Type1, Type2])
+        level: logging level (e.g. logging.INFO)
+
+    Returns:
+        wrapped_fn (Callable[Type1, Type2])
+    """
+    @wraps(fn)
+    def wrapped_fn(gm):
+        val = fn(gm)
+        logger.log(level, f"Ran pass {fn}\t Return value: {val}",)
+        return val
+
+    return wrapped_fn
+
+
 
 def loop_pass(base_pass: Callable, n_iter: int = None, predicate: Callable = None):
     """
diff --git a/torch/fx/passes/reinplace.py b/torch/fx/passes/reinplace.py
index 39f0ad1569b92..ff24ef97f5459 100644
--- a/torch/fx/passes/reinplace.py
+++ b/torch/fx/passes/reinplace.py
@@ -2,7 +2,7 @@
 from torch.fx import Node
 from torch.fx._compatibility import compatibility
 from torch._subclasses.fake_tensor import FakeTensorMode, FakeTensor
-from torch.utils._pytree import tree_map
+from torch.utils._pytree import tree_map, tree_flatten, tree_map_only
 from torch.multiprocessing.reductions import StorageWeakRef
 
 import _operator
@@ -110,7 +110,8 @@ def run_node(self, node: Node):
     def propagate(self, *args):
         self.multi_output_view_nodes = {}
         self.node_counter = -1
-        with FakeTensorMode.push() as mode:
+
+        with FakeTensorMode(allow_meta=True) as mode:
             fake_args = [mode.from_tensor(a) for a in args]
             return super().run(*fake_args)
 
@@ -152,10 +153,12 @@ def _maybe_get_inplace_op(op):
         for f in inplace_overloads
         if _schemas_match(op._schema, f._schema)
     ]
-    # This is for sanity: if foo() and foo_() are both operators,
-    # we expect them to have compatible schemas.
-    # (This is asserted by codegen for ATen, but might not be true
-    # for other arbitrary operators).
+    # Just becuase foo() and foo_() are both existing operators,
+    # They aren't guaranteed to have compatible schemas.
+    # For example, pow.Scalar(Scalar self, Tensor exponent) has no valid inplace variant,
+    # Even though several overloads of pow_ exist.
+    if len(inplace_overloads_with_matching_schemas) == 0:
+        return None
     assert len(inplace_overloads_with_matching_schemas) == 1
     inplace_op = inplace_overloads_with_matching_schemas[0]
     return inplace_op
@@ -181,7 +184,7 @@ def _add_if_tensor(x, set_):
         usage_nodes = t.users
         for n in usage_nodes:
             # We only care about usages after the current node
-            if n.meta['node_idx'] <= op_index:
+            if 'node_idx' not in n.meta or n.meta['node_idx'] <= op_index:
                 continue
             # We also don't care about intermediate view ops.
             # They only matter if their output is then used elsewhere
@@ -260,33 +263,61 @@ def reinplace(gm, *sample_args):
     In general, we can't reinplace node `b = a.add(...)` if "a" aliases any of the
     inputs to the program.
 
-    Given a node "b = foo(a, ...)", the algorithm for re-inplacing is as follows:
+    Given a node "b = foo(a, args...) the algorithm for re-inplacing is as follows:
+
+    (1) Perform some initial checks on the metadata of "a" and "args..."
+        that can disqualify them from being reinplaced.
+
+      (1a) Check that the self argument we're attempting to reinplace
+           has acceptable dtype/size metadata to reinplace with.
+
+           For example, if we have:
+             a = torch.ones(1)
+             b = torch.ones(10)
+             out = torch.add(a, b)
+           We can't turn that into
+             a.add_(b)
+           Because that would require resizing "a".
 
-    (1) Check if foo has a mutating variant. If not, move to the next node.
+           Similarly, we can't convert torch.ge(a, b) into a.ge_(b),
+           beause that would require changing a's dtype (from e.g. float32 to bool).
+           Note that in this specific example, we could technically do better..
 
-        Note that we ignore view ops (we don't bother to turn `as_strided()`
-        into `as_strided_()`), as it complicates the algorithm and doesn't
-        provide meaningful speedups.
+           If we see the pattern:
+             a_1 = a.ge(b)
+             a_2 = aten._to_copy(a_1, a.dtype)
+           Then we this should be valid to completely re-inplace
+           (this is exactly what functionalization will emit when it sees a.ge_(b)).
 
-        Currently, we also only check for an inplace op, `foo_`.
-        Later, we should beef this up to check for out= or mutable ops.
+           This optimization is only really important for user programs
+           that directly use inplace comparison ops though.
 
-    (2) Check if "a" is an alias of any of the program inputs.
+           We also cannot re-inplace on tensors that have overlapping memory,
+           e.g. torch.ones(1).expand(4, 4).add_(1)
 
-        If it is, skip and move to the next node.
-        Inplace'ing an op that would cause it to mutate a program is not sound,
-        because that would be a side effect visible to the user.
+      (1b) Check if "a" is an alias of any of the program inputs.
 
-        NOTE: there's a future optimization that we should make:
-        if "a" is a (alias of a)  program input, but later in the program
-        there is a node that looks like "a.copy_(...)",
-        Then re-inplacing is ok to do - we are temporarily re-using a's buffer,
-        which will later be overwritten by the copy_() call.
+          If it is, skip and move to the next node.
+          Inplace'ing an op that would cause it to mutate a program is not sound,
+          because that would be a side effect visible to the user.
 
-        This will be an important optimization to have for programs that mutate
-        their inputs. It currently isn't implemented though.
+          NOTE: there's a future optimization that we should make:
+          if "a" is a (alias of a)  program input, but later in the program
+          there is a node that looks like "a.copy_(...)",
+          Then re-inplacing is ok to do - we are temporarily re-using a's buffer,
+          which will later be overwritten by the copy_() call.
 
-    (3) Check that "a" and all of its outstanding aliases are not used anywhere
+          This will be an important optimization to have for programs that mutate
+          their inputs. It currently isn't implemented though.
+
+      (1c) Check if "a" and "args..." alias
+
+          For example, re-inplacing to create code like the below
+          isn't guaranteed to be sound:
+
+            aten.mul_(a, a)
+
+    (2) Check that "a" and all of its outstanding aliases are not used anywhere
         later in the graph. If this is the case, then it's safe to re-inplace
         to "b = foo_(a)".
 
@@ -350,8 +381,53 @@ def f(x):
                       as_strided -> as_strided_scatter
                 (ii) "args..." are the same between the foo() and foo_scatter() calls.
 
-    (4) Finally, after converting "b = foo(a)" into "foo_(a)",
-        we need to find all later nodes that use "b" as an argument
+    (3) Perform the actual re-inplacing on foo!
+
+      (3b) is the common case, but special care is needed for {view}_scatter (3a)
+
+      (3a) {view}_scatter ops.
+
+        Consider this program:
+          a = torch.zeros(2, 2)
+          b = torch.ones(2)
+          a[0] = b
+
+        Post functionalization, that will look like:
+          a = torch.zeros(2)
+          b = torch.ones(1)
+          a_updated = torch.select_scatter(a, b, 0, 0)
+
+        In this case though, there is no "functional" op to re-inplace!
+        Instead, we'd like to directly remove toe select_scatter call.
+        We already know from (3) that this is valid,
+        because "a" has no later usages in the graph.
+
+        We perform the re-inplacing on the {view}_scatter op like so
+        Before:
+          a_updated = torch.select_scatter(a, b, args...)
+        After:
+          a_slice = a.select(a, args...)
+          a_slice.copy_(b)
+
+      (3b) Otherwise, replace the functional op with its inplace variant.
+        Before:
+          b = foo(a, args...)
+        After:
+          a.foo_(args...)
+
+    (4) Finally, after converting either:
+          Before:
+            b = foo(a)
+          After:
+            foo_(a)
+        or
+          Before:
+            b = {slice}_scatter(a, mutated_slice, args...)
+          After:
+            slice = {slice}(a, args...)
+            slice.copy_(mutated_slice)
+
+        We now need to find all later nodes that use "b" as an argument
         and update them to take in "a" instead.
 
         Note that for the majority of inplace ops, this isn't actually necessary
@@ -406,34 +482,64 @@ def _add_to_map(x):
             tree_map(_add_to_map, n.meta['fake_result'])
 
     # inplace-ify functional ops, subject to the constraints written below.
-    all_later_view_inverse_node_usages = set()
+    all_later_view_inverse_nodes_to_delete = set()
     for idx, node in enumerate(gm.graph.nodes):
         if node.op == 'call_function':
-            # Step 1: Check to see if this operator has an inplace variant.
-            maybe_inplace_op = _maybe_get_inplace_op(node.target)
-            if maybe_inplace_op is None:
+
+            # Today, the re-inplace pass on directly acts on:
+            # - functional ops with an inplace variant
+            # - {view}_scatter ops that can be potentially removed from the graph.
+            # Both of these ops take in tensor first args, so filtering on this condition
+            # makes the later code simpler.
+            # We should revisit this at some point though, particularly when we also want
+            # the reinplacer to be able to handle out= and mutable operators
+            # and tensorlist first args (like `_foreach_` ops).
+            if not isinstance(node.target, torch._ops.OpOverload):
+                continue
+            if len(node.target._schema.arguments) < 1:
+                continue
+            if type(node.target._schema.arguments[0].type) != torch.TensorType:
                 continue
-            # This is a proxy check for ensuring that the first argument is "tensor-like"
-            # (This should be the case for all ops with inplace variants in ATen,
-            # although we technically don't have guarantees for custom ops).
-            assert len(node.target._schema.arguments) > 0
-            assert 'Tensor' in str(node.target._schema.arguments[0].type)
 
-            # Step 2: ensure that the op we're trying to re-inplace isn't a program input.
+            # Step 1a: Check that the self argument we're attempting to reinplace
+            # has the same size/stride as the output.
+            # For example, we shouldn't try to reinplace torch.add(scalar_tensor, larger_tensor)
+            # As it would require resizing scalar_tensor.
+            # (We could potentially swizzle this into larger_tensor.add_(scalar_tensor),
+            # this is probably an optimization to revisit later).
             self_arg = node.args[0]
+            self_flattened, _ = tree_flatten(self_arg.meta['fake_result'])
+            node_flattened, _ = tree_flatten(node.meta['fake_result'])
+            self_has_wrong_metadata = False
+            if len(self_flattened) == len(node_flattened):
+                for self_meta, node_meta in zip(self_flattened, node_flattened):
+                    if self_meta.numel() != node_meta.numel():
+                        self_has_wrong_metadata = True
+                    if self_meta.dtype != node_meta.dtype:
+                        self_has_wrong_metadata = True
+                    # We also cannot re-inplace on tensors that have internal memory overlap.
+                    # e.g. torch.ones(1).expand(4, 4).add_(1)
+                    if torch._debug_has_internal_overlap(self_meta) == 1:
+                        self_has_wrong_metadata = True
+            # Here, we (optimistically) assume that a.resize(b) is valid to re-inplace,
+            # Since users should never really be calling the functional "torch.ops.aten.resize"
+            # op directly in their programs.
+            if self_has_wrong_metadata and node.target != torch.ops.aten.resize.default:
+                continue
+
+            # Step 1b: ensure that the op we're trying to re-inplace isn't a program input
             self_arg_name = self_arg.name
             self_arg_storage = StorageWeakRef(self_arg.meta['fake_result'].storage())
             if self_arg_storage in input_storages:
                 # TODO: later, add the optimization for handling `copy_()` calls in the graph.
                 continue
             if len([x for x in node.args if x is self_arg]) > 1:
-                # Step (3b) in the original description.
+                # Step 1c:
                 # Calling stuff like aten.mul_(a, a) isn't guaranteed to be sound,
                 # so we prevent re-inplacing in this case.
                 continue
 
             self_arg_storage = StorageWeakRef(self_arg.meta['fake_result'].storage())
-            curr_node_storage = StorageWeakRef(node.meta['fake_result'].storage())
             self_aliases = storage_to_nodes[self_arg_storage]
 
             # First, we find all later usages of any of the aliases of self_arg.
@@ -442,27 +548,60 @@ def _add_to_map(x):
             # that are safe to fully remove.
             later_view_inverse_node_usages = _get_view_inverse_node_usages(later_node_usages, self_aliases)
 
-            # Step 3: Check to see if the input to the op is re-used later in the graph.
+            # Step 2: Check to see if the input to the op is re-used later in the graph.
             # If not (same goes for its aliases), then this op is safe to re-in place.
             # This is a slightly roundabout way to check that there are no later usages of the current self argument.
             # (later_view_inverse_node_usages corresponds to "view_scatter" nodes that we are allowed to delete)
             can_reinplace = len(later_node_usages - later_view_inverse_node_usages) == 0
             if not can_reinplace:
                 continue
-            # Step 4: replace the current out-of-place op with its inplace variant.
-            node.target = maybe_inplace_op
+
+            # Step 3a: Special handling for when we see *_scatter operators.
+            # When we see an operator like `b = torch.slice_scatter(a, ...)`,
+            # instead of trying to "inplace" it into a.slice_scatter_(..._),
+            # we would prefer to remove it from the graph entirely,
+            # and instead copy_() the slice directly into the larger tensor.
+            # See the description of the algorithm for a full example.
+            if node.target in _VIEW_INVERSE_MAP and node not in all_later_view_inverse_nodes_to_delete:
+                view_op = _VIEW_INVERSE_MAP[node.target]
+                # Before:
+                #   base_updated = torch.ops.aten.slice_scatter.default(base, mutated_slice, args...)
+                # After:
+                #   slice = torch.ops.aten.slice.default(base, args...)
+                #   slice.copy_(mutated_slice)
+                with gm.graph.inserting_before(node):
+                    mutated_slice_node = node.args[1]
+                    remaining_slice_args = node.args[2:]
+                    slice_node = gm.graph.create_node(
+                        'call_function', view_op, (self_arg,) + tuple(remaining_slice_args), node.kwargs)
+                    copy_node = gm.graph.create_node(
+                        'call_function', torch.ops.aten.copy_.default, (slice_node, mutated_slice_node,), {})
+                # Add the slice_scatter node to our "nodes to delete" list.
+                all_later_view_inverse_nodes_to_delete.add(node)
+
+
+            else:
+                # Step 3b: Check to see if this operator has an inplace variant.
+                maybe_inplace_op = _maybe_get_inplace_op(node.target)
+                if maybe_inplace_op is None:
+                    continue
+                # And if so, replace it with its inplace variant.
+                node.target = maybe_inplace_op
+
             # At this point, 'storage_to_nodes' will be stale.
             # Now that we're inplacing `b = foo(a)`, we need to effectively
             # union together the dict values for b and a's storage.
             # Hmm... morally I think we also want to keep the `fake_result` metadata
             # up to date here, but I'm not sure how easy it is to do.
             # Maybe it's fine to wait until the end of the pass to update it.
+            curr_node_storage = StorageWeakRef(node.meta['fake_result'].storage())
             storage_to_nodes[self_arg_storage].update(storage_to_nodes[curr_node_storage])
             storage_to_nodes[curr_node_storage].update(storage_to_nodes[self_arg_storage])
 
             # Need to remember the view_scatter view nodes we found so we can remove them alter.
-            all_later_view_inverse_node_usages.update(later_view_inverse_node_usages)
+            all_later_view_inverse_nodes_to_delete.update(later_view_inverse_node_usages)
 
+            # Step 4:
             # Now that we've replaced b = a.foo() with a.foo_(),
             # We need to replace any later usages of "b" with "a"
             for old in itertools.chain([node], later_view_inverse_node_usages):
@@ -470,30 +609,41 @@ def _add_to_map(x):
                 nodes_to_update = [n for n in old.users if n.meta['node_idx'] > node.meta['node_idx']]
                 for node_to_update in nodes_to_update:
                     new_args = []
-                    for arg_idx, a in enumerate(node_to_update.args):
+                    args = node_to_update.args
+
+                    def replace_arg(a):
                         if a == old:
-                            new_args.append(new)
-                        else:
-                            new_args.append(a)
-                    new_kwargs = {}
-                    for kwarg_idx, (k, v) in enumerate(node_to_update.kwargs.items()):
-                        if isinstance(v, Node) and v.name == old.name:
-                            new_kwargs[k] = new
-                        else:
-                            new_kwargs[k] = v
-                    node_to_update.args = tuple(new_args)
-                    node_to_update.kwargs = new_kwargs
-
-                    old_ref = StorageWeakRef(old.meta['fake_result'].storage())
-                    node_ref = StorageWeakRef(node_to_update.meta['fake_result'].storage())
-                    if old_ref == node_ref:
-                        # This will happen if we're updating a view op, e.g.
-                        # e.g. replacing
-                        #     x = view(old)
-                        #     x = view(new)
-                        # When that happens, we need to make sure to keep our
-                        # storage mapping up to date.
-                        new_ref = StorageWeakRef(new.meta['fake_result'].storage())
+                            return new
+                        return a
+
+                    # First, replace usages of "b" with "a"
+                    node_to_update.args = tree_map_only(Node, replace_arg, node_to_update.args)
+                    node_to_update.kwargs = tree_map_only(Node, replace_arg, node_to_update.kwargs)
+
+                    # Second, update our storage_to_nodes data structure.
+                    old_flattened_res, _ = tree_flatten(old.meta['fake_result'])
+                    node_flattened_res, _ = tree_flatten(node_to_update.meta['fake_result'])
+
+                    old_res_storage = set(StorageWeakRef(x.storage()) for x in old_flattened_res if isinstance(x, FakeTensor))
+                    node_res_storage = set(StorageWeakRef(x.storage()) for x in node_flattened_res if isinstance(x, FakeTensor))
+
+                    # This will happen if we're updating a view op, e.g.
+                    # e.g. replacing
+                    #     x = view(old)
+                    #     x = view(new)
+                    # When that happens, we need to make sure to keep our
+                    # storage mapping up to date.
+                    #
+                    # We're checking for len(...) == 1 here because all view ops are guaranteed to return either a single tensor,
+                    # or multiple tensors that all share the same storage.
+                    # We can't just check equality because we might encounter FX nodes that return zero tensor outputs.
+                    if len(old_res_storage) == 1 and len(node_res_storage) == 1 and old_res_storage == node_res_storage:
+                        new_flattened_res, _ = tree_flatten(new.meta['fake_result'])
+                        new_res_storage = set(StorageWeakRef(x.storage()) for x in new_flattened_res if isinstance(x, FakeTensor))
+                        assert len(new_res_storage) == 1
+                        (old_ref,) = old_res_storage
+                        (new_ref,) = new_res_storage
+                        (node_ref,) = node_res_storage
                         # Technically, "old_ref" and all its aliases will remain
                         # in our mapping.
                         # That should be fine though, since we deleted "old"
@@ -501,10 +651,10 @@ def _add_to_map(x):
                         storage_to_nodes[node_ref].update(storage_to_nodes[new_ref])
                         storage_to_nodes[new_ref].update(storage_to_nodes[node_ref])
 
-    # Step 5: delete any _scatter nodes that we de-functionalized
+    # Step 4: delete any _scatter nodes that we de-functionalized
     # Need to take care not to delete any of these nodes until after *all* modifications
     # to the graph are finished.
-    for to_delete in all_later_view_inverse_node_usages:
+    for to_delete in all_later_view_inverse_nodes_to_delete:
         gm.graph.erase_node(to_delete)
 
 
diff --git a/torch/fx/passes/splitter_base.py b/torch/fx/passes/splitter_base.py
index 46e5470162842..88624b67f2c64 100644
--- a/torch/fx/passes/splitter_base.py
+++ b/torch/fx/passes/splitter_base.py
@@ -26,6 +26,7 @@
 )
 import warnings
 
+
 __all__ = ['FxNetAccNodesFinder', 'FxNetSplitterInternalError', 'Subgraph', 'SplitResult', 'generate_inputs_for_submodules']
 _LOGGER = logging.getLogger(__name__)
 
@@ -59,19 +60,11 @@ def __init__(self):
             "we might not care about non-tensor data flow and we can set this option "
             "to true to disable the functionality that prevent non-tensor data flow.",
         )
-        parser.add_argument(
-            "--op_lowering_disallow_list",
-            default="",
-            type=str,
-            help="A comma separated string which represents a disallow_list of "
-            "operator names."
-        )
         args, unknown = parser.parse_known_args()
 
         self.min_acc_module_size: int = args.min_acc_module_size
         self.skip_fusion: bool = args.skip_fusion
         self.allow_non_tensor: bool = args.allow_non_tensor
-        self.op_lowering_disallow_list: List[str] = args.op_lowering_disallow_list.split(",")
 
 
 @compatibility(is_backward_compatible=False)
@@ -323,11 +316,21 @@ def __init__(
         self.update_deps_for_fusions()
 
         self.non_acc_submodule_name = non_acc_submodule_name
+        self._node_submodule_map: Dict[str, str] = {}
 
     # ===============================================================
     # Helpers for ctor and initial state
     # ===============================================================
 
+    def get_node_submodule_map(self) -> Dict[str, str]:
+        """ Returns a map from node name to submodule name, e.g.
+            node: main_module_impl_impl_over_arch_unary_multiple_embedding
+              _pooling_embedding_pooling_sparse_entity_equivalence_key
+              _proxy_embedding_bag
+            maps to submodule name of: _run_on_acc_1
+        """
+        return self._node_submodule_map
+
     def find_deps(self) -> Dict[torch.fx.Node, NodeSet]:
         """
         Builds a graph of node dependencies. Leaf nodes don't have any
@@ -723,12 +726,9 @@ def put_nodes_into_subgraphs(self) -> List[Subgraph]:
         current_cpu_nodes, current_acc_nodes = self.starter_nodes()
         visited_nodes: NodeSet = set()
 
-        # Determine which subgraph to start from based on node dependency
-        acc_subgraph: bool = True
-        for n in current_cpu_nodes:
-            if self.deps[n] <= visited_nodes:
-                acc_subgraph = False
-                break
+        # Determine which subgraph to start from based on which subgraph has
+        # 0-dep node
+        acc_subgraph: bool = not any([len(self.deps[n]) == 0 for n in current_cpu_nodes])
 
         current_subgraph_nodes: NodeList = []
 
@@ -817,14 +817,14 @@ def remove_small_acc_subgraphs(self, subgraphs: List[Subgraph]) -> List[Subgraph
     def tag(self, subgraphs: List[Subgraph]):
         self.tags: List[str] = []
         for subgraph in subgraphs:
-            subgraph_name = self.non_acc_submodule_name
-
             tag = f"_run_on_acc_{len(self.tags)}" if subgraph.is_acc else f"{self.non_acc_submodule_name}{len(self.tags)}"
             self.tags.append(tag)
             for node in subgraph.nodes:
                 if hasattr(node, "tag"):
                     raise FxNetSplitterInternalError(f"Node {node} was already tagged")
+
                 node.tag = tag  # type: ignore[attr-defined]
+                self._node_submodule_map[node.name] = tag
 
     def split(self, remove_tag: bool = False) -> torch.fx.GraphModule:
         split_module = split_by_tags(self.module, self.tags)
diff --git a/torch/fx/passes/tools_common.py b/torch/fx/passes/tools_common.py
index 6300671a7b294..87ad477cdc2cf 100644
--- a/torch/fx/passes/tools_common.py
+++ b/torch/fx/passes/tools_common.py
@@ -1,4 +1,5 @@
 from typing import List, Tuple, Union, Dict, Any, Set, Mapping
+import collections
 from dataclasses import dataclass
 
 import torch
@@ -209,7 +210,7 @@ def __call__(self) -> Dict[torch.fx.Node, NodeSet]:
 
 
 @compatibility(is_backward_compatible=False)
-def legalize_graph(gm: torch.fx.GraphModule):
+def legalize_graph(gm: torch.fx.GraphModule) -> torch.fx.GraphModule:
     """
     Replace the graph of the given GraphModule with one that contains the same nodes as the
     original, but in topologically sorted order.
@@ -220,43 +221,33 @@ def legalize_graph(gm: torch.fx.GraphModule):
     Arguments:
         gm: The graph module to topologically sort. It is modified in-place.
 
+    Returns:
+        The graph module in-place sorted
     """
-    # Build an adjacency list representation of node dependencies in the graph. This also
-    # serves as a list of nodes that still need to be inserted into the new, topologically
-    # sorted graph.
-    dependencies = {node: node.all_input_nodes.copy() for node in gm.graph.nodes}
-
-    # Construct a new graph that will contain all nodes in topologically sorted order.
+    indeg = {node: 0 for node in gm.graph.nodes}
     new_graph = torch.fx.Graph()
-    value_remap: Dict[torch.fx.Node, torch.fx.Node] = {}
-
-    # Copy over all nodes with no dependencies.
-    for node, deps in dependencies.items():
-        if not deps:
-            value_remap[node] = new_graph.node_copy(node, lambda n: value_remap[n])
-
-    # Remove the copied over nodes from the adjacency list.
-    for copied_node in value_remap.keys():
-        del dependencies[copied_node]
-
-    # While there are still nodes to insert into the new graph:
-    while dependencies:
-        copied_this_round = []
-
-        # Copy over all nodes whose dependencies already exist in the new graph.
-        for node, deps in dependencies.items():
-            all_deps_copied = True
-            for dep in deps:
-                if dep not in value_remap:
-                    all_deps_copied = False
-
-            if all_deps_copied:
-                value_remap[node] = new_graph.node_copy(node, lambda n: value_remap[n])
-                copied_this_round.append(node)
-
-        # Delete all nodes copied over in this iteration from dependencies.
-        for copied_node in copied_this_round:
-            del dependencies[copied_node]
-
-    # Replace the old graph with the new, topologically sorted one.
+    # Track how many unfulfilled dependencies each node has
+    for node in gm.graph.nodes:
+        for user in node.users:
+            indeg[user] += 1
+    queue: collections.deque = collections.deque()
+    # Add all nodes with no dependencies to the queue
+    for node in gm.graph.nodes:
+        if indeg[node] == 0:
+            queue.append(node)
+    env: Dict[torch.fx.Node, torch.fx.Node] = {}
+    # Pop nodes from the queue, and add nodes that have had all their
+    # dependencies fulfilled
+    while len(queue) > 0:
+        cur = queue.popleft()
+        env[cur] = new_graph.node_copy(cur, lambda x: env[x])
+        for user in cur.users:
+            indeg[user] -= 1
+            if indeg[user] == 0:
+                queue.append(user)
+    # If the new graph's size is not as large as the old one, then there must be
+    # a cycle (i.e. some node's dependencies were not satisfied.)
+    if len(new_graph.nodes) < len(gm.graph.nodes):
+        raise RuntimeError(f"Input graph has cycles, unable to add {[node for node in indeg if indeg[node] != 0]}")
     gm.graph = new_graph
+    return gm
diff --git a/torch/fx/passes/utils/__init__.py b/torch/fx/passes/utils/__init__.py
index bc6b4553a356c..2a7970ba4c283 100644
--- a/torch/fx/passes/utils/__init__.py
+++ b/torch/fx/passes/utils/__init__.py
@@ -1 +1 @@
-from .common import lift_subgraph_as_module, HolderModule
+from .common import lift_subgraph_as_module, HolderModule, compare_graphs
diff --git a/torch/fx/passes/utils/common.py b/torch/fx/passes/utils/common.py
index f14e0a428193f..6b2af22a355a8 100644
--- a/torch/fx/passes/utils/common.py
+++ b/torch/fx/passes/utils/common.py
@@ -2,10 +2,11 @@
 
 from torch.fx.graph_module import GraphModule
 from torch.fx.graph import Graph
+from torch.fx.passes.utils.matcher_utils import SubgraphMatcher
 from torch.fx._compatibility import compatibility
 
 
-__all__ = ['HolderModule', 'lift_subgraph_as_module']
+__all__ = ['HolderModule', 'lift_subgraph_as_module', 'compare_graphs']
 
 @compatibility(is_backward_compatible=False)
 class HolderModule(Module):
@@ -65,3 +66,18 @@ def lift_subgraph_as_module(gm: GraphModule, subgraph: Graph, class_name: str =
         setattr(curr, leaf_node_name, leaf_node)
 
     return GraphModule(submodule, subgraph, class_name)
+
+
+@compatibility(is_backward_compatible=False)
+def compare_graphs(left: Graph, right: Graph) -> bool:
+    """
+    Return True if two graphs are identical, i.e they
+        - have the same number of outputs in the same order
+        - have the same number of inputs in the same order
+        - have the same set of nodes, and identical connectivity
+    """
+
+    matcher = SubgraphMatcher(left, match_output=True, match_placeholder=True)
+    matches = matcher.match(right)
+
+    return len(matches) > 0
diff --git a/torch/fx/passes/utils/matcher_utils.py b/torch/fx/passes/utils/matcher_utils.py
new file mode 100644
index 0000000000000..31ae96a47fee8
--- /dev/null
+++ b/torch/fx/passes/utils/matcher_utils.py
@@ -0,0 +1,233 @@
+from dataclasses import dataclass, field
+from collections import defaultdict
+import copy
+from torch.fx.graph import Graph
+from torch.fx.node import Node
+from torch.fx._compatibility import compatibility
+from typing import Dict, List, Set
+
+__all__ = ['SubgraphMatcher', 'InternalMatch']
+
+
+@compatibility(is_backward_compatible=False)
+@dataclass
+class InternalMatch():
+    # Nodes from which the match was found
+    anchors: List[Node]
+    # Maps nodes in the pattern subgraph to nodes in the larger graph
+    nodes_map: Dict[Node, Node] = field(default_factory=dict)
+
+    # nodes in target graph that are matched placeholder in pattern
+    placeholder_nodes: List[Node] = field(default_factory=list)
+
+    # nodes in matched subgraph returned by output
+    returning_nodes: List[Node] = field(default_factory=list)
+
+    def __copy__(self):
+        return InternalMatch(anchors=self.anchors, nodes_map=self.nodes_map.copy(),
+                             placeholder_nodes=self.placeholder_nodes.copy(),
+                             returning_nodes=self.returning_nodes.copy())
+
+@compatibility(is_backward_compatible=False)
+class SubgraphMatcher:
+    def __init__(self, pattern: Graph,
+                 match_output: bool = False,
+                 match_placeholder: bool = False,
+                 remove_overlapping_matches: bool = True) -> None:
+        """
+        Args:
+            pattern: the targeted matching pattern, represented in fx.Graph.
+            match_output: If True, output node in the pattern graph will be treated as a part of the targeted pattern.
+                If False, output node is ignored during match.
+            match_placeholder: If True, placeholder node in the pattern graph will be treated as a part of
+                the targeted pattern. If False, placeholder nodes will be used a wildcard.
+            remove_overlapping_matches: If True, in the case of overlapping matches, only the first match
+                will be returned.
+        """
+
+        self.pattern = pattern
+        self.match_output = match_output
+        self.match_placeholder = match_placeholder
+        self.remove_overlapping_matches = remove_overlapping_matches
+
+        if len(pattern.nodes) == 0:
+            raise ValueError("SubgraphMatcher cannot be initialized with an empty pattern")
+
+        for node in pattern.nodes:
+            if node.op != "output":
+                assert len(node.users) > 0, \
+                       "SubgraphMatcher cannot be initialized with an pattern with dead code"
+
+        # TODO: assert pattern is a connected graph
+
+        self.pattern_placeholder_nodes = [n for n in pattern.nodes if n.op == "placeholder"]
+        output_node = next(iter(reversed(pattern.nodes)))
+        # nodes returned by outputs
+        self.pattern_returning_nodes: List[Node] = output_node.all_input_nodes
+
+        self.pattern_anchors: List[Node] = []
+        if match_output:
+            self.pattern_anchors = [output_node]
+        else:
+            # If a node has output_node as the ONLY user, then this node is a graph sink,
+            # and should be matched against as an anchor
+            self.pattern_anchors = [n for n in output_node.all_input_nodes if len(n.users) == 1]
+
+    def _nodes_are_equal(self, pn: Node, gn: Node) -> bool:
+        # TODO: match args and kwargs
+
+        # if exact match for placeholder is not required, then use placeholder as a wildcard
+        if not self.match_placeholder and pn.op == "placeholder":
+            return True
+
+        if pn.op == gn.op:
+            if pn.op == "placeholder" or pn.op == "output":
+                return True
+            return pn.target == gn.target
+        return False
+
+    def _is_contained(self, nodes_map: Dict[Node, Node]) -> bool:
+        # `lookup` represents all the nodes in `original_graph`
+        # that are part of `pattern`
+        lookup: Dict[Node, Node] = {gn : pn for pn, gn in nodes_map.items()}
+        for gn, pn in lookup.items():
+            # Placeholders can be used by other nodes in the graphs
+            if pn.op == "placeholder":
+                continue
+
+            # nodes returned by output are allowed to be used in other areas of the graph
+            if pn in self.pattern_returning_nodes:
+                continue
+
+            for user in gn.users:
+                # If this node has users that were not in `lookup`, then it must leak out of the
+                # pattern subgraph
+                if user not in lookup:
+                    return False
+        return True
+
+    def _remove_overlapping_matches(self, matches: List[InternalMatch]) -> List[InternalMatch]:
+        non_overlapping_matches: List[InternalMatch] = list()
+        nodes_matched: Set[Node] = set()
+
+        for match in matches:
+            found_overlap = False
+            for pn, gn in match.nodes_map.items():
+                if pn.op not in {"placeholder", "output"} and gn in nodes_matched:
+                    found_overlap = True
+                    break
+
+            if not found_overlap:
+                non_overlapping_matches.append(match)
+                for pn, gn in match.nodes_map.items():
+                    if pn.op not in {"placeholder", "output"}:
+                        nodes_matched.add(gn)
+        return non_overlapping_matches
+
+    def _match_nodes(self, pn: Node, gn: Node, match: InternalMatch) -> bool:
+
+        # Check if we've already matched these nodes in the current
+        # traversal
+        if pn in match.nodes_map:
+            return match.nodes_map[pn] == gn
+
+        # TODO: use a more efficienty way to check if gn is matched before: two-way dict
+        if gn in match.nodes_map.values():
+            return False
+
+        if not self._nodes_are_equal(pn, gn):
+            return False
+
+        # Optimistically mark `pn` as a match for `gn`
+        match.nodes_map[pn] = gn
+
+        if pn.op == "placeholder":
+            return True
+
+        # Recursively traverse upwards to check if `pn` is a true
+        # match for `gn`
+        match_found = (len(pn.all_input_nodes) == len(gn.all_input_nodes) and
+                       all(self._match_nodes(pn_, gn_, match) for pn_, gn_
+                           in zip(pn.all_input_nodes, gn.all_input_nodes)))
+
+        if not match_found:
+            match.nodes_map.pop(pn)
+            return False
+
+        return True
+
+    def match(self, graph: Graph) -> List[InternalMatch]:
+        """
+        Returns:
+            The matched subgraphs.
+            Thre returned subgraph would be fully self-contained, meaning the nodes (except placeholder
+            and nodes returned by output) can only be consumed by nodes within the matched subgraph.
+
+        Subgraph pattern matcher is implemented with the backtracking style in the following steps:
+
+        1. We first identify all the anchor nodes in the pattern graph. The anchor nodes
+        are the "sinks" (nodes with no user other than the output node) of the pattern graph.
+        One pattern graph could have multiple anchors if it has multiple return values.
+
+        2. In the target graph, we identify the potential candidate nodes that can be matched
+        with each anchor. These anchor-candidate pairs are the starting points for
+        pairwise per-node matching.
+
+        3. For each anchor-candidate pair, we simultaneously traverse backwards (DFS) in both
+        pattern and target graphs. For every pattern nodes along traversal path, we compare it
+        against the target nodes. In case any comparison failed, the match for this anchor-candidate
+        pair fails. A match is found when DFS completes traversing the graph. See `self._match_nodes`
+        for more details.
+
+        4. In the case of multiple anchors, every anchor will need to find a match using step 3.
+        In addition, the matches found between anchors need to have a common intersection node
+        in order for the match to be valid. This is implemented with backtracking. See `backtracking`
+        for more details.
+
+        Notice: graph traversal must be done in the reverser order because a tensor can have multiple
+        consumers, but can only have a single producer. Only with reverser order, we can we jointly
+        traverse the pattern and target graph in a deterministic path.
+
+        Warning: In theory, this backtracking algorithm have an **exponential** time complexity. However,
+        in practice, it's unlikely to blow up.
+
+        """
+
+        # find candidate nodes to match with pattern anchors
+        match_candidates: Dict[Node, List[Node]] = defaultdict(list)
+        for pattern_anchor in self.pattern_anchors:
+            for node in graph.nodes:
+                if self._nodes_are_equal(pattern_anchor, node):
+                    match_candidates[pattern_anchor].append(node)
+        match_candidates_list = list(match_candidates.items())
+        matches: List[InternalMatch] = []
+
+        def backtracking(anchor_index, match):
+            if anchor_index == len(match_candidates_list):
+                match.placeholder_nodes = [match.nodes_map[pn] for pn in self.pattern_placeholder_nodes]
+                match.returning_nodes = [match.nodes_map[pn] for pn in self.pattern_returning_nodes]
+                matches.append(match)
+                return
+
+            pattern_anchor, candidate_nodes = match_candidates_list[anchor_index]
+            saved_match = copy.copy(match)
+
+            for node in candidate_nodes:
+                match_found = self._match_nodes(pattern_anchor, node, match)
+                if match_found:
+                    # match next anchor
+                    backtracking(anchor_index + 1, match)
+
+                    # revert to saved_match before matching with current anchor
+                    match = copy.copy(saved_match)
+
+        match = InternalMatch(anchors=self.pattern_anchors)
+        backtracking(0, match)
+
+        # filter out the matches where the subgraph is not fully_contained
+        matches = [match for match in matches if self._is_contained(match.nodes_map)]
+
+        if self.remove_overlapping_matches:
+            matches = self._remove_overlapping_matches(matches)
+
+        return matches
diff --git a/torch/fx/proxy.py b/torch/fx/proxy.py
index a932cc11fde2f..c7eb3c9caeb17 100644
--- a/torch/fx/proxy.py
+++ b/torch/fx/proxy.py
@@ -9,6 +9,7 @@
 from .node import Target, Node, Argument, base_types, map_aggregate
 from ._compatibility import compatibility
 from .operator_schemas import check_for_mutable_operation
+import torch.fx.traceback as fx_traceback
 
 __all__ = ['TracerBase', 'GraphAppendingTracer', 'TraceError', 'Proxy', 'Attribute', 'ParameterProxy']
 
@@ -75,7 +76,10 @@ def create_proxy(self, kind: str, target: Target, args: Tuple[Any, ...], kwargs:
             proxy = proxy_factory_fn(node)
 
         # Optionally set stack trace on the created Node for debugging purposes
-        if self.record_stack_traces:
+        if fx_traceback.is_stack_trace_overridden():
+            stacks = fx_traceback.format_stack()
+            proxy.node.stack_trace = '\n'.join(reversed(stacks))
+        elif self.record_stack_traces:
             user_frame = self._find_user_frame()
             if user_frame:
                 walk_stack_gen = traceback.walk_stack(user_frame)
@@ -91,14 +95,23 @@ def _find_user_frame(self):
         symbolic tracing.
         """
         # We have to do a little dance here. Basically, walk up the callstack and
-        # record the first frame not in the FX source. This is the frame executing
+        # record the first frame not in the pytorch source. This is the frame executing
         # the user code during tracing.
         frame = inspect.currentframe()
 
-        fx_files = ['torch/fx/proxy.py', 'torch/fx/_symbolic_trace.py']
+        pt_files = ['torch/fx/proxy.py',
+                    'torch/fx/_symbolic_trace.py',
+                    'torch/fx/experimental/proxy_tensor.py',
+                    'torch/_ops.py',
+                    'torch/_tensor.py',
+                    'torch/utils/_python_dispatch.py',
+                    'torch/_prims_common/wrappers.py',
+                    'torch/_refs/__init__.py',
+                    'torch/_refs/nn/functional/__init__.py'
+                    ]
         while frame:
             frame = frame.f_back
-            if frame and all(not frame.f_code.co_filename.endswith(file) for file in fx_files):
+            if frame and all(not frame.f_code.co_filename.endswith(file) for file in pt_files):
                 break
 
         if not frame:
diff --git a/torch/fx/subgraph_rewriter.py b/torch/fx/subgraph_rewriter.py
index c651324bb142b..897df119cc68e 100644
--- a/torch/fx/subgraph_rewriter.py
+++ b/torch/fx/subgraph_rewriter.py
@@ -17,76 +17,6 @@ class Match(NamedTuple):
     # Maps nodes in the pattern subgraph to nodes in the larger graph
     nodes_map: Dict[Node, Node]
 
-class _SubgraphMatcher:
-    def __init__(self, pattern: Graph) -> None:
-        self.pattern = pattern
-        if len(pattern.nodes) == 0:
-            raise ValueError("_SubgraphMatcher cannot be initialized with an "
-                             "empty pattern")
-        # `self.pattern_anchor` is the output Node in `pattern`
-        self.pattern_anchor = next(iter(reversed(pattern.nodes)))
-        # Ensure that there is only a single output value in the pattern
-        # since we don't support multiple outputs
-        assert len(self.pattern_anchor.all_input_nodes) == 1, \
-            "Pattern matching on multiple outputs is not supported"
-        # Maps nodes in the pattern subgraph to nodes in the larger graph
-        self.nodes_map: Dict[Node, Node] = {}
-
-    def matches_subgraph_from_anchor(self, anchor: Node) -> bool:
-        """
-        Checks if the whole pattern can be matched starting from
-        ``anchor`` in the larger graph.
-
-        Pattern matching is done by recursively comparing the pattern
-        node's use-def relationships against the graph node's.
-        """
-        self.nodes_map = {}
-        return self._match_nodes(self.pattern_anchor, anchor)
-
-    # Compare the pattern node `pn` against the graph node `gn`
-    def _match_nodes(self, pn: Node, gn: Node) -> bool:
-
-        # Check if we've already matched these nodes in the current
-        # traversal
-        if pn in self.nodes_map:
-            return self.nodes_map[pn] == gn
-
-        def attributes_are_equal(pn: Node, gn: Node) -> bool:
-            # Use placeholder and output nodes as wildcards. The
-            # only exception is that an output node can't match
-            # a placeholder
-            if (pn.op == "placeholder"
-                    or (pn.op == "output" and gn.op != "placeholder")):
-                return True
-            return pn.op == gn.op and pn.target == gn.target
-
-        # Terminate early if the node attributes are not equal
-        if not attributes_are_equal(pn, gn):
-            return False
-
-        # Optimistically mark `pn` as a match for `gn`
-        self.nodes_map[pn] = gn
-
-        # Traverse the use-def relationships to ensure that `pn` is a true
-        # match for `gn`
-        if pn.op == "placeholder":
-            return True
-        if (pn.op != "output"
-                and len(pn.all_input_nodes) != len(gn.all_input_nodes)):
-            return False
-        if pn.op == "output":
-            match_found = any(self._match_nodes(pn.all_input_nodes[0], gn_)
-                              for gn_ in gn.all_input_nodes)
-        else:
-            match_found = (len(pn.all_input_nodes) == len(gn.all_input_nodes)
-                           and all(self._match_nodes(pn_, gn_) for pn_, gn_
-                                   in zip(pn.all_input_nodes, gn.all_input_nodes)))
-        if not match_found:
-            self.nodes_map.pop(pn)
-            return False
-
-        return True
-
 
 def _replace_submodules(gm: GraphModule, replacement: torch.nn.Module) -> None:
     gm.delete_all_unused_submodules()
@@ -248,175 +178,67 @@ def forward(self, x, w1, w2):
             add_2 = add_1 + max_2
             return add_2
     """
+    from torch.fx.passes.utils.matcher_utils import SubgraphMatcher, InternalMatch
+
     # Get the graphs for `gm`, `pattern`, `replacement`
-    original_graph = gm.graph
-    pattern_graph = symbolic_trace(pattern).graph
-    replacement_graph = symbolic_trace(replacement).graph
-
-    # Find all possible pattern matches in original_graph. Note that
-    # pattern matches may overlap with each other.
-    matcher = _SubgraphMatcher(pattern_graph)
-    matches: List[Match] = []
-
-    # Consider each node as an "anchor" (deepest matching graph node)
-    for anchor in original_graph.nodes:
-
-        if matcher.matches_subgraph_from_anchor(anchor):
-
-            def pattern_is_contained(nodes_map: Dict[Node, Node]) -> bool:
-                # `lookup` represents all the nodes in `original_graph`
-                # that are part of `pattern`
-                lookup: Dict[Node, Node] = {v: k for k, v in nodes_map.items()}
-                for n in lookup.keys():
-
-                    # Nodes that can "leak"...
-
-                    # Placeholders (by definition)
-                    if n.op == "placeholder":
-                        continue
-                    # Pattern output (acts as a container)
-                    if lookup[n].op == "output":
-                        continue
-                    # Result contained by pattern output (what we'll
-                    # hook in to the new Graph, thus what we'll
-                    # potentially use in other areas of the Graph as
-                    # an input Node)
-                    if (len(lookup[n].users) == 1
-                            and list(lookup[n].users.keys())[0].op == "output"):
-                        continue
-
-                    for user in n.users:
-                        # If this node has users that were not in
-                        # `lookup`, then it must leak out of the
-                        # pattern subgraph
-                        if user not in lookup:
-                            return False
-                return True
-
-            # It's not a match if the pattern leaks out into the rest
-            # of the graph
-            if pattern_is_contained(matcher.nodes_map):
-                # Shallow copy nodes_map
-                matches.append(Match(anchor=anchor,
-                                     nodes_map=copy.copy({
-                                         key: value
-                                         for key, value in matcher.nodes_map.items()
-                                     })))
-
-    # The set of all nodes in `original_graph` that we've seen thus far
-    # as part of a pattern match
-    replaced_nodes: Set[Node] = set()
-    # As we progressively replace nodes, we'll need to keep track of how the match results should change
-    match_changed_node: Dict[Node, Node] = dict()
+    original_graph: Graph = gm.graph
+    pattern_graph: Graph = symbolic_trace(pattern).graph
+    replacement_graph: Graph = symbolic_trace(replacement).graph
 
-    # Return True if one of the nodes in the current match has already
-    # been used as part of another match
-    def overlaps_with_prev_match(match: Match) -> bool:
-        for pn, gn in match.nodes_map.items():
-            if pn.op in ["placeholder", "output"]:
-                continue
-            if gn in replaced_nodes and gn.op != "placeholder":
-                return True
-        return False
+    matcher = SubgraphMatcher(pattern_graph, match_output=False, match_placeholder=False,
+                              remove_overlapping_matches=True)
+    _matches: List[InternalMatch] = matcher.match(original_graph)
 
-    for match in matches:
-        # Skip overlapping matches
-        if overlaps_with_prev_match(match):
-            continue
+    replacement_placeholders = [n for n in replacement_graph.nodes if n.op == "placeholder"]
 
-        # Map replacement graph nodes to their copy in `original_graph`
-        val_map: Dict[Node, Node] = {}
+    # As we progressively replace nodes, we'll need to keep track of how the match results should change
+    match_changed_node: Dict[Node, Node] = {}
+
+    for match in _matches:
 
-        pattern_placeholders = [n for n in pattern_graph.nodes
-                                if n.op == "placeholder"]
-        assert len(pattern_placeholders) > 0
-        replacement_placeholders = [n for n in replacement_graph.nodes
-                                    if n.op == "placeholder"]
-        assert len(pattern_placeholders) == len(replacement_placeholders)
-        placeholder_map = {r: p for r, p
-                           in zip(replacement_placeholders, pattern_placeholders)}
-
-        # node from `original_graph` that matched with the output node
-        # in `pattern`
-        subgraph_output: Node = match.anchor
-
-        def mark_node_as_replaced(n: Node) -> None:
-            if n not in match.nodes_map.values():
-                return
-            for n_ in n.all_input_nodes:
-                mark_node_as_replaced(n_)
-            replaced_nodes.add(n)
-
-        for input_node in subgraph_output.all_input_nodes:
-            mark_node_as_replaced(input_node)
+        # Build connecting between replacement graph's input and original graph input producer node
 
         # Initialize `val_map` with mappings from placeholder nodes in
         # `replacement` to their corresponding node in `original_graph`
-        for replacement_node in replacement_placeholders:
-            # Get the `original_graph` placeholder node
-            # corresponding to the current `replacement_node`
-            pattern_node = placeholder_map[replacement_node]
-            original_graph_node = match_changed_node.get(match.nodes_map[pattern_node], match.nodes_map[pattern_node])
-
-            # Populate `val_map`
-            val_map[replacement_node] = original_graph_node
+        assert len(match.placeholder_nodes) == len(replacement_placeholders)
+        val_map: Dict[Node, Node] = {}
+        for rn, gn in zip(replacement_placeholders, match.placeholder_nodes):
+            val_map[rn] = match_changed_node.get(gn, gn)
 
         # Copy the replacement graph over
-        with original_graph.inserting_before(subgraph_output):
-            copied_output = original_graph.graph_copy(replacement_graph,
-                                                      val_map)
+        user_nodes: Set[Node] = set()
+        for n in match.returning_nodes:
+            for user in n.users:
+                user_nodes.add(user)
+        assert user_nodes, "The returning_nodes should have at least one user node"
+
+        if len(user_nodes) == 1:
+            first_user_node = list(user_nodes)[0]
+        else:
+            # If there are multiple user nodes, we need to find the first user node
+            # in the current execution order of the `original_graph`
+            for n in original_graph.nodes:
+                if n in user_nodes:
+                    first_user_node = n
+                    break
 
-        # Hook the output Node of the replacement subgraph in to the
-        # original Graph at the correct location
+        with original_graph.inserting_before(first_user_node):
+            copied_returning_nodes = original_graph.graph_copy(replacement_graph, val_map)
 
-        # CASE 1: We need to hook the replacement subgraph in somewhere
-        # in the middle of the graph. We replace the Node in the
-        # original graph that corresponds to the end of the pattern
-        # subgraph
-        if subgraph_output.op != "output":
-            pattern_outputs = [n for n in pattern_graph.nodes
-                               if n.op == "output"]
-            assert len(pattern_outputs) > 0
-            replacement_outputs = [n for n in replacement_graph.nodes
-                                   if n.op == "output"]
-            assert len(replacement_outputs) == len(pattern_outputs)
-            outputs_map = {p: r for r, p
-                           in zip(replacement_outputs, pattern_outputs)}
-
-            for pn, gn in match.nodes_map.items():
-                if gn.op == "placeholder":
-                    continue
-
-                # Search for the node corresponding to the output of the pattern
-                if pn.op != "output":
-                    continue
-                assert subgraph_output == gn
-
-                # Update all anchor inputs to the new nodes
-                rn = outputs_map[pn]
-                for pn_input, rn_input in zip(pn.all_input_nodes, rn.all_input_nodes):
-                    gn_input = match.nodes_map[pn_input]
-                    rn_input_in_original_graph = val_map[rn_input]
-                    gn_input.replace_all_uses_with(rn_input_in_original_graph)
-                    # We store the updated node point in case other nodes want to use it
-                    match_changed_node[gn_input] = rn_input_in_original_graph
-
-            assert subgraph_output.op != "output"
-        # CASE 2: The pattern subgraph match extends to the end of the
-        # original graph, so we need to change the current graph's
-        # output Node to reflect the insertion of the replacement graph.
-        # We'll keep the current output Node, but update its args and
-        # `_input_nodes` as necessary
-        else:
-            subgraph_output.args = ((copied_output,))
-            if isinstance(copied_output, Node):
-                subgraph_output._input_nodes = {copied_output: None}
+        if isinstance(copied_returning_nodes, Node):
+            copied_returning_nodes = (copied_returning_nodes, )
 
-        assert isinstance(copied_output, Node)
-        # Erase the `pattern` nodes
-        for node in reversed(original_graph.nodes):
-            if len(node.users) == 0 and node.op != "output":
-                original_graph.erase_node(node)
+        # Hook the output Node of the replacement subgraph in to the
+        # original Graph at the correct location
+        assert len(match.returning_nodes) == len(copied_returning_nodes)
+        for gn, copied_node in zip(match.returning_nodes, copied_returning_nodes):
+            gn.replace_all_uses_with(copied_node)
+            match_changed_node[gn] = copied_node
+        # Remove the original nodes
+        for node in reversed(pattern_graph.nodes):
+            if node.op != "placeholder" and node.op != "output":
+                gn = match.nodes_map[node]
+                gm.graph.erase_node(gn)
 
     # Update the passed-in GraphModule to reflect the new state of
     # `original_graph`
@@ -427,4 +249,6 @@ def mark_node_as_replaced(n: Node) -> None:
     if isinstance(replacement, torch.nn.Module):
         _replace_submodules(gm, replacement)
 
+    # Convert _matches: InternalMatch to Match to comply with backward compatibility of this function
+    matches: List[Match] = [Match(anchor=match.anchors[0], nodes_map=match.nodes_map) for match in _matches]
     return matches
diff --git a/torch/fx/traceback.py b/torch/fx/traceback.py
new file mode 100644
index 0000000000000..a07b36b997bdb
--- /dev/null
+++ b/torch/fx/traceback.py
@@ -0,0 +1,62 @@
+import traceback
+from contextlib import contextmanager
+from typing import Optional, List
+from ._compatibility import compatibility
+
+__all__ = ['override_stack_trace', 'set_stack_trace', 'append_stack_trace', 'format_stack', 'is_stack_trace_overridden']
+
+
+current_stack: List[str] = []
+is_overridden = False
+
+
+@compatibility(is_backward_compatible=False)
+@contextmanager
+def override_stack_trace():
+    global is_overridden
+
+    saved_is_overridden = is_overridden
+    try:
+        is_overridden = True
+        yield
+    finally:
+        is_overridden = saved_is_overridden
+
+
+@compatibility(is_backward_compatible=False)
+def set_stack_trace(stack : List[str]):
+    global current_stack
+
+    if is_overridden and stack:
+        current_stack = stack
+
+@compatibility(is_backward_compatible=False)
+@contextmanager
+def append_stack_trace(stack : Optional[str]):
+    """
+    The content of stack here is an entire stacktraces as a string
+    """
+    global current_stack
+
+    if is_overridden and stack:
+        try:
+            current_stack.append(stack)
+            yield
+        finally:
+            current_stack.pop()
+    else:
+        yield
+
+
+@compatibility(is_backward_compatible=False)
+def format_stack() -> List[str]:
+    if is_overridden:
+        return current_stack.copy()
+    else:
+        # fallback to traceback.format_stack()
+        return traceback.format_stack()
+
+
+@compatibility(is_backward_compatible=False)
+def is_stack_trace_overridden() -> bool:
+    return is_overridden
diff --git a/torch/hub.py b/torch/hub.py
index 5e1a8baba7173..cfe4181332ec9 100644
--- a/torch/hub.py
+++ b/torch/hub.py
@@ -10,10 +10,11 @@
 import warnings
 import zipfile
 from pathlib import Path
-from typing import Callable, Dict, Optional, Union, Any
+from typing import Dict, Optional, Any
 from urllib.error import HTTPError
 from urllib.request import urlopen, Request
 from urllib.parse import urlparse  # noqa: F401
+from torch.serialization import MAP_LOCATION
 
 try:
     from tqdm.auto import tqdm  # automatically select proper tqdm submodule if available
@@ -170,7 +171,6 @@ def _validate_not_a_forked_repo(repo_owner, repo_name, ref):
                      'If it\'s a commit from a forked repo, please call hub.load() with forked repo directly.')
 
 
-
 def _get_cache_or_reload(github, force_reload, trust_repo, calling_fn, verbose=True, skip_validation=False):
     # Setup hub_dir to save downloaded files
     hub_dir = get_dir()
@@ -240,6 +240,7 @@ def _get_cache_or_reload(github, force_reload, trust_repo, calling_fn, verbose=T
 
     return repo_dir
 
+
 def _check_repo_is_trusted(repo_owner, repo_name, owner_name_branch, trust_repo, calling_fn="load"):
     hub_dir = get_dir()
     filepath = os.path.join(hub_dir, "trusted_list")
@@ -522,10 +523,11 @@ def load(repo_or_dir, model, *args, source='github', trust_repo=None, force_relo
     Example:
         >>> # from a github repo
         >>> repo = 'pytorch/vision'
-        >>> model = torch.hub.load(repo, 'resnet50', pretrained=True)
+        >>> model = torch.hub.load(repo, 'resnet50', weights='ResNet50_Weights.IMAGENET1K_V1')
         >>> # from a local directory
         >>> path = '/some/local/path/pytorch/vision'
-        >>> model = torch.hub.load(path, 'resnet50', pretrained=True)
+        >>> # xdoctest: +SKIP
+        >>> model = torch.hub.load(path, 'resnet50', weights='ResNet50_Weights.DEFAULT')
     """
     source = source.lower()
 
@@ -557,8 +559,9 @@ def _load_local(hubconf_dir, model, *args, **kwargs):
         a single model with corresponding pretrained weights.
 
     Example:
+        >>> # xdoctest: +SKIP("stub local path")
         >>> path = '/some/local/path/pytorch/vision'
-        >>> model = _load_local(path, 'resnet50', pretrained=True)
+        >>> model = _load_local(path, 'resnet50', weights='ResNet50_Weights.IMAGENET1K_V1')
     """
     sys.path.insert(0, hubconf_dir)
 
@@ -585,6 +588,7 @@ def download_url_to_file(url, dst, hash_prefix=None, progress=True):
             Default: True
 
     Example:
+        >>> # xdoctest: +REQUIRES(POSIX)
         >>> torch.hub.download_url_to_file('https://s3.amazonaws.com/pytorch/models/resnet18-5c106cde.pth', '/tmp/temporary_file')
 
     """
@@ -663,7 +667,7 @@ def _legacy_zip_load(filename, model_dir, map_location):
 def load_state_dict_from_url(
     url: str,
     model_dir: Optional[str] = None,
-    map_location: Optional[Union[Callable[[str], str], Dict[str, str]]] = None,
+    map_location: MAP_LOCATION = None,
     progress: bool = True,
     check_hash: bool = False,
     file_name: Optional[str] = None
diff --git a/torch/jit/_shape_functions.py b/torch/jit/_shape_functions.py
index 1e93f0718635b..cc0083c17b0a0 100644
--- a/torch/jit/_shape_functions.py
+++ b/torch/jit/_shape_functions.py
@@ -5,9 +5,17 @@
 
 ###
 # There are generated files that depend on this file
-# To re-generate, please run:
-# cd ~/pytorch && python
-# torchgen/shape_functions/gen_jit_shape_functions.py
+# To re-generate, please run from the root of the repo:
+# python torchgen/shape_functions/gen_jit_shape_functions.py
+
+# How to test:
+# After regenerating files, compile PyTorch.
+# Then run: ./build/bin/test_jit --gtest_filter=TestShapeGraphLinting.Basic
+# If you have enabled opinfo testing for the op, also run:
+# python test/test_ops_jit.py TestJitCPU::test_variant_consistency_jit_[FAILING_OP]_cpu_float32
+# to reproduce errors from opinfo tests.
+
+# Example PR: https://github.com/pytorch/pytorch/pull/80860/files
 ####
 
 import torch
@@ -153,8 +161,13 @@ def view_one_unused(self: List[int], sizes: List[int], *, implicit: bool = False
     return view(self, sizes)
 
 
-def mean_dim(self: List[int], dims: List[int], keep_dim: bool, dt: Any):
+def sum_mean_dim(self: List[int], opt_dims: Optional[List[int]], keep_dim: bool, dt: Any):
     out: List[int] = []
+    if opt_dims is None or len(opt_dims) == 0:
+        dims: List[int] = list(range(len(self)))
+    else:
+        dims = opt_dims
+
     for idx in range(len(self)):
         is_mean_dim: bool = False
         for reduce_dim in dims:
@@ -168,7 +181,7 @@ def mean_dim(self: List[int], dims: List[int], keep_dim: bool, dt: Any):
     return out
 
 def max_dim(self: List[int], dim: int, keep_dim: bool):
-    out = mean_dim(self, [dim], keep_dim, None)
+    out = sum_mean_dim(self, [dim], keep_dim, None)
     return out, out
 
 # note: python already rounds down towards negative infinity on integer division, special arithmetic not needed
@@ -728,6 +741,48 @@ def conv_backwards(grad_output: List[int], input:List[int], weight:List[int], bi
     # Bias gradient is always generated regardess of if biases is supplied
     return _copy(input), _copy(weight), [grad_output[1]]
 
+def conv_transpose2d_input(input: List[int], weight: List[int], bias: Optional[List[int]] = None, stride: Optional[List[int]] = None, padding: Optional[List[int]] = None, output_padding: Optional[List[int]] = None, groups: int = 1, dilation: Optional[List[int]] = None) -> List[int]:
+    if stride is None:
+        stride = [1, 1]
+    if padding is None:
+        padding = [0, 0]
+    if output_padding is None:
+        output_padding = [0, 0]
+    if dilation is None:
+        dilation = [1, 1]
+    has_dilation = len(dilation) > 0
+    dim = len(input)
+    output_size: List[int] = []
+    input_batch_size_dim = 0
+    weight_output_channels_dim = 1
+    output_size.append(input[input_batch_size_dim])
+    output_size.append(weight[weight_output_channels_dim])
+
+    for d in range(2, dim):
+        dilation_ = dilation[d - 2] if has_dilation else 1
+        kernel = dilation_ * (weight[d] - 1)
+        output_size.append((input[d] - 1) * stride[d - 2] - 2 * padding[d - 2] + kernel + 1)
+    return output_size
+
+def conv_forwards(input: List[int], weight: List[int], bias: Optional[List[int]], stride: List[int], padding: List[int], dilation: List[int], transposed: bool, output_padding: List[int], groups: int) -> List[int]:
+    has_dilation = len(dilation) > 0
+    dim = len(input)
+    output_size: List[int] = []
+    input_batch_size_dim = 0
+    weight_output_channels_dim = 1 if transposed else 0
+    output_size.append(input[input_batch_size_dim])
+    output_size.append(weight[weight_output_channels_dim])
+
+    for d in range(2, dim):
+        dilation_ = dilation[d - 2] if has_dilation else 1
+        if transposed:
+            kernel = dilation_ * (weight[d] - 1)
+            output_size.append((input[d] - 1) * stride[d - 2] - 2 * padding[d - 2] + kernel + 1)
+        else:
+            kernel = dilation_ * (weight[d] - 1) + 1
+            output_size.append((input[d] + (2 * padding[d - 2]) - kernel) // stride[d - 2] + 1)
+    return output_size
+
 def batch_norm(
     input: List[int],
     weight: Optional[List[int]],
@@ -922,14 +977,24 @@ def native_batch_norm(input: List[int], weight: Optional[List[int]], bias: Optio
         _size = [0]
     return _copy(input), _size, _size
 
-# TODO: Add support for List[Optional[List[int]]] arguments (i.e. `Tensor?[]`).
-# def index_Tensor(self: List[int], indices: List[Optional[List[int]]]) -> List[int]:
-#     assert len(indices) <= len(self), "More indices than dimensions to index"
-#     broadcasted_shape: List[int] = []
-#     for index_tensor_shape in indices:
-#         if index_tensor_shape is not None:
-#             broadcasted_shape = broadcast(broadcasted_shape, index_tensor_shape)
-#     return broadcasted_shape
+"""
+Currently deferring the enabling of this, as part of the propoasal to suspend
+adding ops.
+There are currently cases in the test case where this is being called
+in the SSA opinfo tests with with unexpected values (eg list of two ints, see the first
+opinfo test). The behavoir of index is significantly dependent on the inputs.
+
+This could be an error with how we are matching up shape functions, or that this
+function needs to just implement everything.
+
+def index_Tensor(self: List[int], indices: List[Optional[List[int]]]) -> List[int]:
+    assert len(indices) <= len(self), "More indices than dimensions to index"
+    broadcasted_shape: List[int] = []
+    for index_tensor_shape in indices:
+        if index_tensor_shape is not None:
+            broadcasted_shape = broadcast(broadcasted_shape, index_tensor_shape)
+    return broadcasted_shape
+"""
 
 ScriptFn = torch._C.ScriptFunction
 shape_compute_graph_mapping : Dict[str, ScriptFn ] = {}
@@ -997,14 +1062,16 @@ def add_bounded_compute_mapping(operator_schema: str, lower_bound_func: Callable
 add_shape_compute_mapping("aten::batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> Tensor", batch_norm)
 add_shape_compute_mapping("aten::conv3d(Tensor input, Tensor weight, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] dilation=1, int groups=1) -> Tensor", conv3d)
 add_shape_compute_mapping("aten::convolution_backward(Tensor grad_output, Tensor input, Tensor weight, int[]? bias_sizes, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)", conv_backwards)
+add_shape_compute_mapping("aten::convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor", conv_forwards)
+add_shape_compute_mapping("aten::conv_transpose2d.input(Tensor input, Tensor weight, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] output_padding=0, int groups=1, int[2] dilation=1) -> Tensor", conv_transpose2d_input)
 add_shape_compute_mapping("aten::flatten.using_ints(Tensor(a) self, int start_dim=0, int end_dim=-1) -> Tensor(a)", flatten)
 add_shape_compute_mapping("aten::cat(Tensor[] tensors, int dim=0) -> Tensor", cat)
 add_shape_compute_mapping("aten::permute(Tensor(a) self, int[] dims) -> Tensor(a)", permute)
 add_shape_compute_mapping("aten::view(Tensor(a) self, int[] size) -> Tensor(a)", view)
 add_shape_compute_mapping("aten::expand_as(Tensor(a) self, Tensor other) -> Tensor(a)", expand)
 add_shape_compute_mapping("aten::expand(Tensor(a) self, int[] size, *, bool implicit=False) -> Tensor(a)", expand_one_unused)
-add_shape_compute_mapping("aten::mean.dim(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor", mean_dim)
-add_shape_compute_mapping("aten::sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor", mean_dim)
+add_shape_compute_mapping("aten::mean.dim(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor", sum_mean_dim)
+add_shape_compute_mapping("aten::sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor", sum_mean_dim)
 add_shape_compute_mapping("aten::max.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)", max_dim)
 add_shape_compute_mapping("aten::mean(Tensor self, *, ScalarType? dtype=None) -> Tensor", zero_dim_tensor)
 add_shape_compute_mapping("aten::sum(Tensor self, *, ScalarType? dtype=None) -> Tensor", zero_dim_tensor)
@@ -1021,8 +1088,7 @@ def add_bounded_compute_mapping(operator_schema: str, lower_bound_func: Callable
 add_shape_compute_mapping("aten::nll_loss_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)", nll_loss_forward)
 add_shape_compute_mapping("aten::native_layer_norm(Tensor input, int[] normalized_shape, Tensor? weight, Tensor? bias, float eps) -> (Tensor, Tensor, Tensor)", native_layer_norm)
 add_shape_compute_mapping("aten::native_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)", native_batch_norm)
-# TODO: Add support for List[Optional[List[int]]] arguments (i.e. `Tensor?[]`).
-#add_shape_compute_mapping("aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor", index_Tensor)
+# add_shape_compute_mapping("aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor", index_Tensor)
 
 # TODO: migrate over all of symbolic_shape_registry_util.cpp
 # These are duplicated here so that the functions will be serialiazed
diff --git a/torch/jit/quantized.py b/torch/jit/quantized.py
index 483296dce5d6b..df0cfe1cc1f4b 100644
--- a/torch/jit/quantized.py
+++ b/torch/jit/quantized.py
@@ -14,7 +14,7 @@ def __init__(self, other):
         super(QuantizedLinear, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedLinear is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.Linear instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.Linear instead.")
 
         self.in_features = other.in_features
         self.out_features = other.out_features
@@ -59,7 +59,7 @@ def __init__(self, other):
         super(QuantizedLinearFP16, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedLinearFP16 is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.Linear instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.Linear instead.")
         self.in_features = other.in_features
         self.out_features = other.out_features
         self.original_weight = other.weight
@@ -99,7 +99,7 @@ def __init__(self, other):
         super(QuantizedRNNCellBase, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedRNNCellBase is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.RNNCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.RNNCell instead.")
 
         self.input_size = other.input_size
         self.hidden_size = other.hidden_size
@@ -177,7 +177,7 @@ def __init__(self, other):
         super(QuantizedRNNCell, self).__init__(other)
         warnings.warn(
             "torch.jit.QuantizedRNNCell is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.RNNCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.RNNCell instead.")
         self.nonlinearity = other.nonlinearity
 
     @torch.jit.script_method
@@ -212,7 +212,7 @@ def __init__(self, other):
         super(QuantizedLSTMCell, self).__init__(other)
         warnings.warn(
             "torch.jit.QuantizedLSTMCell is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.LSTMCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.LSTMCell instead.")
 
     @torch.jit.script_method
     def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
@@ -235,7 +235,7 @@ def __init__(self, other):
         super(QuantizedGRUCell, self).__init__(other)
         warnings.warn(
             "torch.jit.QuantizedGRUCell is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.GRUCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.GRUCell instead.")
 
     @torch.jit.script_method
     def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
@@ -263,7 +263,7 @@ def __init__(self, other, dtype=torch.int8):
         super(QuantizedRNNBase, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedRNNBase is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic instead.")
         self.mode = other.mode
         self.input_size = other.input_size
         self.hidden_size = other.hidden_size
@@ -368,7 +368,7 @@ def __init__(self, other, dtype):
         super(QuantizedLSTM, self).__init__(other, dtype)
         warnings.warn(
             "torch.jit.QuantizedLSTM is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.LSTM instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.LSTM instead.")
 
     @torch.jit.script_method
     def forward_impl(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]], batch_sizes: Optional[Tensor],
@@ -448,7 +448,7 @@ def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         warnings.warn(
             "torch.jit.QuantizedGRU is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.GRU instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.GRU instead.")
 
 
     @torch.jit.script_method
diff --git a/torch/library.py b/torch/library.py
index 0589286094e42..eae9bac26453e 100644
--- a/torch/library.py
+++ b/torch/library.py
@@ -76,6 +76,7 @@ def impl(self, op_name, fn, dispatch_key=''):
                           the dispatch key that the library was created with.
 
         Example::
+            >>> # xdoctest: +SKIP
             >>> my_lib = Library("aten", "IMPL")
             >>> def div_cpu(self, other):
             >>>    return self * (1 / other)
diff --git a/torch/linalg/__init__.py b/torch/linalg/__init__.py
index 53917eecd7d6c..b0706288106f6 100644
--- a/torch/linalg/__init__.py
+++ b/torch/linalg/__init__.py
@@ -28,8 +28,7 @@
 
 Supports input of float, double, cfloat and cdouble dtypes. Also supports batches
 of vectors, for which it computes the product along the dimension :attr:`dim`.
-In this case, the output has the same batch dimensions as the inputs broadcast to
-a common shape.
+It broadcasts over the batch dimensions.
 
 Args:
     input (Tensor): the first input tensor.
@@ -39,9 +38,6 @@
 Keyword args:
     out (Tensor, optional): the output tensor. Ignored if `None`. Default: `None`.
 
-Raises:
-    RuntimeError: If after broadcasting :attr:`input`\ `.size(\ `:attr:`dim`\ `) != 3`
-                  or :attr:`other`\ `.size(\ `:attr:`dim`\ `) != 3`.
 Example:
     >>> a = torch.randn(4, 3)
     >>> a
@@ -1040,7 +1036,7 @@
 when :attr:`driver` is one of (`'gelsy'`, `'gelsd'`, `'gelss'`).
 In this case, if :math:`\sigma_i` are the singular values of `A` in decreasing order,
 :math:`\sigma_i` will be rounded down to zero if :math:`\sigma_i \leq \text{rcond} \cdot \sigma_1`.
-If :attr:`rcond`\ `= None` (default), :attr:`rcond` is set to the machine precision of the dtype of :attr:`A`.
+If :attr:`rcond`\ `= None` (default), :attr:`rcond` is set to the machine precision of the dtype of :attr:`A` times `max(m, n)`.
 
 This function returns the solution to the problem and some extra information in a named tuple of
 four tensors `(solution, residuals, rank, singular_values)`. For inputs :attr:`A`, :attr:`B`
diff --git a/torch/masked/__init__.py b/torch/masked/__init__.py
new file mode 100644
index 0000000000000..e5d7ae7d3d545
--- /dev/null
+++ b/torch/masked/__init__.py
@@ -0,0 +1,2 @@
+from .maskedtensor.core import is_masked_tensor, MaskedTensor
+from .maskedtensor.creation import as_masked_tensor, masked_tensor
diff --git a/torch/masked/maskedtensor/__init__.py b/torch/masked/maskedtensor/__init__.py
new file mode 100644
index 0000000000000..723a37e76a520
--- /dev/null
+++ b/torch/masked/maskedtensor/__init__.py
@@ -0,0 +1,8 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# flake8: noqa
+
+from .binary import _apply_native_binary, _is_native_binary
+from .core import is_masked_tensor, MaskedTensor
+from .creation import as_masked_tensor, masked_tensor
+from .passthrough import _apply_pass_through_fn, _is_pass_through_fn
+from .unary import _apply_native_unary, _is_native_unary
diff --git a/torch/masked/maskedtensor/binary.py b/torch/masked/maskedtensor/binary.py
new file mode 100644
index 0000000000000..ae9e19bd37756
--- /dev/null
+++ b/torch/masked/maskedtensor/binary.py
@@ -0,0 +1,189 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import torch
+
+from .core import _map_mt_args_kwargs, _masks_match, _tensors_match, _wrap_result, is_masked_tensor
+
+__all__ = []  # type: ignore[var-annotated]
+
+BINARY_NAMES = [
+    "add",
+    "atan2",
+    "arctan2",
+    "bitwise_and",
+    "bitwise_or",
+    "bitwise_xor",
+    "bitwise_left_shift",
+    "bitwise_right_shift",
+    "div",
+    "divide",
+    "floor_divide",
+    "fmod",
+    "logaddexp",
+    "logaddexp2",
+    "mul",
+    "multiply",
+    "nextafter",
+    "remainder",
+    "sub",
+    "subtract",
+    "true_divide",
+    "eq",
+    "ne",
+    "le",
+    "ge",
+    "greater",
+    "greater_equal",
+    "gt",
+    "less_equal",
+    "lt",
+    "less",
+    "maximum",
+    "minimum",
+    "fmax",
+    "fmin",
+    "not_equal",
+]
+
+INPLACE_BINARY_NAMES = [
+    n + "_"
+    for n in (
+        list(
+            set(BINARY_NAMES)
+            - {
+                "logaddexp",
+                "logaddexp2",
+                "equal",
+                "fmin",
+                "minimum",
+                "maximum",
+                "fmax",
+            }
+        )
+    )
+]
+
+
+def _get_at_least_one_mask(a, b):
+    if not is_masked_tensor(a) and not is_masked_tensor(b):
+        raise TypeError("At least one of `a` and `b` must be a MaskedTensor")
+    if not _masks_match(a, b):
+        raise ValueError("a and b must have matching masks")
+    if is_masked_tensor(a):
+        return a.get_mask()
+    return b.get_mask()
+
+
+def _binary_helper(fn, args, kwargs, inplace):
+    if len(kwargs) != 0:
+        raise ValueError("len(kwargs) must equal 0")
+    for a in args[2:]:
+        if torch.is_tensor(a):
+            raise TypeError("MaskedTensor binary ops do not support Tensor arguments aside from the lhs and rhs")
+
+    if not _masks_match(*args[:2]):
+        raise ValueError(
+            "Input masks must match. If you need support for this, please open an issue on Github."
+        )
+
+    data_args, data_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x.get_data()
+    )
+    mask_args, mask_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x.get_mask()
+    )
+
+    args0_layout = data_args[0].layout
+    same_layout = args0_layout == data_args[1].layout
+
+    if args0_layout == torch.sparse_coo:
+        if same_layout:
+            if not _tensors_match(data_args[0].indices(), data_args[1].indices()):
+                raise ValueError(
+                    "sparse_coo indices must match. If you need support for this, please open an issue on Github."
+                )
+            if data_args[0].size() != data_args[1].size():
+                raise ValueError("input1 and input2 must have the same size for binary functions.")
+
+            data_args[1] = data_args[1].values()
+
+        i = data_args[0].indices()
+        size = data_args[0].size()
+        data_args[0] = data_args[0].values()
+        v = fn(*data_args)
+        result_data = torch.sparse_coo_tensor(i, v, size)
+
+    elif args0_layout == torch.sparse_csr:
+        if same_layout:
+            if not (
+                _tensors_match(data_args[0].crow_indices(), data_args[1].crow_indices())
+                and _tensors_match(
+                    data_args[0].col_indices(), data_args[1].col_indices()
+                )
+            ):
+                raise ValueError(
+                    "sparse_csr indices must match. If you need support for this, please open an issue on Github."
+                )
+
+            data_args[1] = data_args[1].values()
+
+        crow = data_args[0].crow_indices()
+        col = data_args[0].col_indices()
+        data_args[0] = data_args[0].values()
+        v = fn(*data_args)
+        result_data = torch.sparse_csr_tensor(crow, col, v)
+
+    else:
+        result_data = fn(*data_args)
+
+    if inplace:
+        args[0]._set_data_mask(result_data, mask_args[0])
+        return args[0]
+    else:
+        result_mask = _get_at_least_one_mask(*args[:2])
+        # sparse tensors don't have strides so we can only expand if the layout is strided
+        if args0_layout == torch.strided:
+            result_mask = result_mask.expand_as(result_data)
+        return _wrap_result(result_data, result_mask)
+
+
+def _torch_binary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def binary_fn(*args, **kwargs):
+        return _binary_helper(fn, args, kwargs, inplace=False)
+
+    return binary_fn
+
+
+def _torch_inplace_binary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def binary_fn(*args, **kwargs):
+        return _binary_helper(fn, args, kwargs, inplace=True)
+
+    return binary_fn
+
+
+NATIVE_BINARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_binary(name) for name in BINARY_NAMES
+}
+NATIVE_INPLACE_BINARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_inplace_binary(name)
+    for name in INPLACE_BINARY_NAMES
+}
+
+NATIVE_BINARY_FNS = list(NATIVE_BINARY_MAP.keys())
+NATIVE_INPLACE_BINARY_FNS = list(NATIVE_INPLACE_BINARY_MAP.keys())
+
+
+def _is_native_binary(fn):
+    return fn in NATIVE_BINARY_FNS or fn in NATIVE_INPLACE_BINARY_FNS
+
+
+def _apply_native_binary(fn, *args, **kwargs):
+    if fn in NATIVE_BINARY_FNS:
+        return NATIVE_BINARY_MAP[fn](*args, **kwargs)
+    if fn in NATIVE_INPLACE_BINARY_FNS:
+        return NATIVE_INPLACE_BINARY_MAP[fn](*args, **kwargs)
+    return NotImplemented
diff --git a/torch/masked/maskedtensor/core.py b/torch/masked/maskedtensor/core.py
new file mode 100644
index 0000000000000..d52ac3baa1496
--- /dev/null
+++ b/torch/masked/maskedtensor/core.py
@@ -0,0 +1,590 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import warnings
+
+import torch
+from torch._masked import _sparse_coo_where, _sparse_csr_where
+from torch.overrides import get_default_nowrap_functions
+
+
+__all__ = [
+    "MaskedTensor",
+    "is_masked_tensor",
+]
+
+
+def is_masked_tensor(a):
+    r""" Returns True if the input is a MaskedTensor, else False
+
+    Args:
+        a: any input
+
+    Examples:
+
+        >>> # xdoctest: +SKIP
+        >>> from torch.masked.maskedtensor.creation import masked_tensor
+        >>> data = torch.arange(6).reshape(2,3)
+        >>> mask = torch.tensor([[True, False, False], [True, True, False]])
+        >>> mt = masked_tensor(data, mask)
+        >>> is_masked_tensor(mt)
+        True
+    """
+    return isinstance(a, MaskedTensor)
+
+
+def _tensors_match(a, b, exact=True):
+    if is_masked_tensor(a) or is_masked_tensor(b):
+        raise ValueError("Neither `a` nor `b` can be a MaskedTensor.")
+    if a.layout != b.layout:
+        raise ValueError(f"`a` and `b` must have the same layout. Got {a.layout} and {b.layout}")
+
+    if a.dtype != b.dtype:
+        b = b.type(a.dtype)
+    if a.layout == b.layout == torch.sparse_coo:
+        return _tensors_match(a.values(), b.values(), exact) and _tensors_match(
+            a.indices(), b.indices(), exact
+        )
+    elif a.layout == b.layout == torch.sparse_csr:
+        return (
+            _tensors_match(a.crow_indices(), b.crow_indices(), exact)
+            and _tensors_match(a.col_indices(), b.col_indices(), exact)
+            and _tensors_match(a.values(), b.values(), exact)
+        )
+    if exact:
+        return (a.dim() == b.dim()) and torch.eq(a, b).all().item()
+    return (a.dim() == b.dim()) and torch.allclose(a, b)
+
+
+def _masks_match(a, b):
+    if is_masked_tensor(a) and is_masked_tensor(b):
+        mask_a = a._masked_mask
+        mask_b = b._masked_mask
+        return _tensors_match(mask_a, mask_b, exact=True)
+    return True
+
+
+def _map_mt_args_kwargs(args, kwargs, map_fn):
+    def _helper(a, map_fn):
+        if is_masked_tensor(a):
+            return map_fn(a)
+        elif torch.is_tensor(a):
+            return a
+        elif isinstance(a, list):
+            a_impl, _ = _map_mt_args_kwargs(a, {}, map_fn)
+            return a_impl
+        elif isinstance(a, tuple):
+            a_impl, _ = _map_mt_args_kwargs(a, {}, map_fn)
+            return tuple(a_impl)
+        else:
+            return a
+
+    if kwargs is None:
+        kwargs = {}
+    impl_args = []
+    for a in args:
+        impl_args.append(_helper(a, map_fn))
+    impl_kwargs = {}
+    for k, v in kwargs.items():
+        impl_kwargs[k] = _helper(a, map_fn)
+    return impl_args, impl_kwargs
+
+
+def _wrap_result(result_data, result_mask):
+    if isinstance(result_data, list):
+        return list(_wrap_result(r, m) for (r, m) in zip(result_data, result_mask))
+    if isinstance(result_data, tuple):
+        return tuple(_wrap_result(r, m) for (r, m) in zip(result_data, result_mask))
+    if torch.is_tensor(result_data):
+        return MaskedTensor(result_data, result_mask)
+    # Expect result_data and result_mask to be Tensors only
+    return NotImplemented
+
+
+def _check_args_kwargs_length(args, kwargs, error_prefix, len_args=None, len_kwargs=None):
+    if len_args is not None and len_args != len(args):
+        raise ValueError(f"{error_prefix}: len(args) must be {len_args} but got {len(args)}")
+    if len_kwargs is not None and len_kwargs != len(kwargs):
+        raise ValueError(f"{error_prefix}: len(kwargs) must be {len_kwargs} but got {len(kwargs)}")
+
+
+def _masked_tensor_str(data, mask, formatter):
+    if data.layout in {torch.sparse_coo, torch.sparse_csr}:
+        data = data.to_dense()
+        mask = mask.to_dense()
+    if data.dim() == 1:
+        formatted_elements = [
+            formatter.format(d.item()) if isinstance(d.item(), float) else str(d.item())
+            for d in data
+        ]
+        max_len = max(
+            map(lambda x: 8 if x[1] else len(x[0]), zip(formatted_elements, ~mask))
+        )
+        return (
+            "["
+            + ", ".join(
+                [
+                    "--".rjust(max_len) if m else e
+                    for (e, m) in zip(formatted_elements, ~mask)
+                ]
+            )
+            + "]"
+        )
+    sub_strings = [_masked_tensor_str(d, m, formatter) for (d, m) in zip(data, mask)]
+    sub_strings = ["\n".join(["  " + si for si in s.split("\n")]) for s in sub_strings]
+    return "[\n" + ",\n".join(sub_strings) + "\n]"
+
+
+def _get_data(a):
+    if is_masked_tensor(a):
+        return a._masked_data
+    return a
+
+
+def _maybe_get_mask(a):
+    if is_masked_tensor(a):
+        return a._masked_mask
+    return None
+
+
+class _MaskedContiguous(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedContiguous forward: input must be a MaskedTensor.")
+
+        if input.is_contiguous():
+            return input
+
+        data = input.get_data()
+        mask = input.get_mask()
+
+        return MaskedTensor(data.contiguous(), mask.contiguous())
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output
+
+
+class _MaskedToDense(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedToDense forward: input must be a MaskedTensor.")
+
+        if input.layout() == torch.strided:
+            return input
+
+        ctx.layout = input.layout()
+        data = input.get_data()
+        mask = input.get_mask()
+
+        return MaskedTensor(data.to_dense(), mask.to_dense())
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        layout = ctx.layout
+
+        if layout == torch.sparse_coo:
+            return grad_output.to_sparse_coo()
+        elif layout == torch.sparse_csr:
+            return grad_output.to_sparse_csr()
+        elif layout == torch.strided:
+            return grad_output.to_dense()
+        raise ValueError("to_dense: Unsupported input layout: ", layout)
+
+
+class _MaskedToSparse(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedToSparse forward: input must be a MaskedTensor.")
+
+        # Following the convention from sparse tensors that to_sparse always means that we convert to sparse_coo
+        if input.layout() == torch.sparse_coo:
+            return input
+
+        data = input.get_data()
+        mask = input.get_mask()
+        sparse_mask = mask.to_sparse_coo().coalesce()
+        sparse_data = data.sparse_mask(sparse_mask)
+
+        return MaskedTensor(sparse_data, sparse_mask)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.to_dense()
+
+
+class _MaskedToSparseCsr(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedToSparseCsr forward: input must be a MaskedTensor.")
+
+        if input._masked_data.ndim != 2:
+            raise ValueError(f"Only 2D tensors can be converted to the SparseCsr layout but got shape: {input._masked_data.size()}")
+
+        if input.layout() == torch.sparse_csr:
+            return input
+
+        data = input.get_data()
+        mask = input.get_mask()
+        sparse_mask = mask.to_sparse_csr()
+        sparse_data = data.sparse_mask(sparse_mask)
+
+        return MaskedTensor(sparse_data, sparse_mask)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.to_dense()
+
+
+class _MaskedWhere(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, cond, self, other):
+        ctx.mark_non_differentiable(cond)
+        ctx.save_for_backward(cond)
+        return torch.ops.aten.where(cond, self, other)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        (cond,) = ctx.saved_tensors
+
+        def masked_out_like(mt):
+            return MaskedTensor(mt.get_data(), torch.zeros_like(mt.get_mask()).bool())
+
+        return (
+            None,
+            torch.ops.aten.where(cond, grad_output, masked_out_like(grad_output)),
+            torch.ops.aten.where(cond, masked_out_like(grad_output), grad_output),
+        )
+
+
+class MaskedTensor(torch.Tensor):
+    @staticmethod
+    def __new__(cls, data, mask, requires_grad=False):
+        if not torch.is_tensor(data):
+            raise TypeError("data must be a Tensor")
+        if not torch.is_tensor(mask):
+            raise TypeError("mask must be a Tensor")
+        # Use a Tensor that of the give size for the wrapper.
+        kwargs = {}
+        kwargs["device"] = data.device
+        kwargs["dtype"] = data.dtype
+        kwargs["layout"] = data.layout
+        kwargs["requires_grad"] = requires_grad
+        kwargs["dispatch_sizes_strides_policy"] = "strides"
+        kwargs["dispatch_layout"] = True
+        return torch.Tensor._make_wrapper_subclass(cls, data.size(), **kwargs)  # type: ignore[attr-defined]
+
+    def _preprocess_data(self, data, mask):
+        if data.layout != mask.layout:
+            raise TypeError("data and mask must have the same layout.")
+        if data.layout == torch.sparse_coo:
+            data = data.coalesce()
+            mask = mask.coalesce()
+            if data._nnz() != mask._nnz():
+                data = _sparse_coo_where(mask, data, torch.tensor(0))
+        elif data.layout == torch.sparse_csr:
+            if data._nnz() != mask._nnz():
+                data = _sparse_csr_where(mask, data, torch.tensor(0))
+
+        # Have to pick awkward names to not conflict with existing fields such as data
+        self._masked_data = data
+        self._masked_mask = mask
+
+    def _validate_members(self):
+        data = self._masked_data
+        mask = self._masked_mask
+        if type(data) != type(mask):
+            raise TypeError(f"data and mask must have the same type. Got {type(data)} and {type(mask)}")
+        if data.layout not in {torch.strided, torch.sparse_coo, torch.sparse_csr}:
+            raise TypeError(f"data layout of {data.layout} is not supported.")
+        if data.layout == torch.sparse_coo:
+            if not _tensors_match(data.indices(), mask.indices(), exact=True):
+                raise ValueError("data and mask are both sparse COO tensors but do not have the same indices.")
+        elif data.layout == torch.sparse_csr:
+            if not _tensors_match(
+                data.crow_indices(), mask.crow_indices(), exact=True
+            ) or not _tensors_match(data.col_indices(), mask.col_indices(), exact=True):
+                raise ValueError("data and mask are both spares CSR tensors but do not share either crow or col indices.")
+        if mask.dtype != torch.bool:
+            raise TypeError("mask must have dtype bool.")
+        if not (
+            data.dtype == torch.float16
+            or data.dtype == torch.float32
+            or data.dtype == torch.float64
+            or data.dtype == torch.bool
+            or data.dtype == torch.int8
+            or data.dtype == torch.int16
+            or data.dtype == torch.int32
+            or data.dtype == torch.int64
+        ):
+            raise TypeError("{data.dtype} is not supported in MaskedTensor.")
+        if data.dim() != mask.dim():
+            raise ValueError("data.dim() must equal mask.dim()")
+        if data.size() != mask.size():
+            raise ValueError("data.size() must equal mask.size()")
+        if mask.requires_grad:
+            raise ValueError("mask cannot have requires_grad=True")
+
+    def __init__(self, data, mask, requires_grad=False):
+        self._preprocess_data(data, mask)
+        self._validate_members()
+
+    def _set_data_mask(self, data, mask):
+        self._masked_data = data
+        self._masked_mask = mask
+        self._validate_members()
+
+    def __repr__(self):
+        formatter = "{0:8.4f}"
+        if self.dim() == 0:
+            scalar_data = self.get_data().item()
+            data_formatted = (
+                formatter.format(scalar_data)
+                if isinstance(scalar_data, float)
+                else str(scalar_data)
+            )
+            if not self.get_mask().item():
+                data_formatted = "--"
+            return (
+                "MaskedTensor("
+                + data_formatted
+                + ", "
+                + str(self.get_mask().item())
+                + ")"
+            )
+        s = _masked_tensor_str(self.get_data(), self.get_mask(), formatter)
+        s = "\n".join("  " + si for si in s.split("\n"))
+        return "MaskedTensor(\n" + s + "\n)"
+
+    # Seems like this needs to be defined before torch_dispatch to work
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
+        kwargs = kwargs or {}
+
+        if func in [torch.Tensor.where, torch.where]:
+            _check_args_kwargs_length(args, kwargs, "__torch_function__, torch.where", len_args=3, len_kwargs=0)
+            return _MaskedWhere.apply(*args)
+        if func is torch.Tensor.contiguous:
+            return _MaskedContiguous.apply(args[0])
+        if func is torch.Tensor.to_dense:
+            return _MaskedToDense.apply(args[0])
+        if func is torch.Tensor.to_sparse:
+            return _MaskedToSparse.apply(args[0])
+        if func is torch.Tensor.to_sparse_csr:
+            return _MaskedToSparseCsr.apply(args[0])
+        if not all(issubclass(cls, t) for t in types):
+            return NotImplemented
+        with torch._C.DisableTorchFunction():
+            ret = func(*args, **kwargs)
+            if func in get_default_nowrap_functions():
+                return ret
+            else:
+                return torch._tensor._convert(ret, cls)
+
+    @classmethod
+    def unary(cls, fn, data, mask):
+        return MaskedTensor(fn(data), mask)
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args, kwargs):
+        func = func.overloadpacket
+
+        from .passthrough import _apply_pass_through_fn, _is_pass_through_fn
+
+        if _is_pass_through_fn(func):
+            return _apply_pass_through_fn(func, *args, **kwargs)
+
+        from .unary import _apply_native_unary, _is_native_unary
+
+        if _is_native_unary(func):
+            return _apply_native_unary(func, *args, **kwargs)
+
+        from .binary import _apply_native_binary, _is_native_binary
+
+        if _is_native_binary(func):
+            return _apply_native_binary(func, *args, **kwargs)
+
+        if func in [torch.ops.aten.mm, torch.ops.aten.bmm]:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=2, len_kwargs=0)
+            return cls.matmul(args[0], args[1], func)  # type: ignore[call-arg]
+
+        # Doesn't work for addmm where the first argument is a Tensor
+        data = _get_data(args[0])
+        mask = _maybe_get_mask(args[0])
+        if func is torch.ops.aten.stride:
+            return None
+        if func is torch.ops.prim.layout:
+            return data.layout
+        if func is torch.ops.aten.is_contiguous:
+            if data.is_sparse:
+                raise ValueError(
+                    "MaskedTensors with sparse data do not have is_contiguous"
+                )
+            return data.is_contiguous()
+        if func is torch.ops.aten.contiguous:
+            if data.is_sparse:
+                raise ValueError(
+                    "MaskedTensors with sparse data do not have contiguous"
+                )
+            return _MaskedContiguous.apply(args[0])
+        if func is torch.ops.aten.new_empty_strided:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=3)
+            if tuple(args[1]) != tuple(data.size()):
+                raise ValueError(f"__torch_dispatch__, {func.name}: args[1] expected to be the same as data.size()")
+            if tuple(args[2]) != tuple(data.stride()):
+                raise ValueError(f"__torch_dispatch__, {func.name}: args[2] expected to be the same as data.stride()")
+            return MaskedTensor(func(data, args[1], args[2], **kwargs), mask)
+        if func is torch.ops.aten._local_scalar_dense:
+            if not mask:
+                raise ValueError("__torch_dispatch__, {func.name}: expected a mask tensor")
+            return func(data)
+        if func is torch.ops.aten._to_copy:
+            return MaskedTensor(func(data, *args[1:], **kwargs), mask)
+        if func is torch.ops.aten.new_empty_strided:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=3)
+            if tuple(args[1]) != tuple(data.size()):
+                raise ValueError(f"__torch_dispatch__, {func.name}: args[1] expected to be the same as data.size()")
+            if tuple(args[2]) != tuple(data.stride()):
+                raise ValueError(f"__torch_dispatch__, {func.name}: args[2] expected to be the same as data.stride()")
+            return MaskedTensor(func(data, args[1], args[2], **kwargs), mask)
+        if func in [torch.ops.aten.detach, torch.ops.aten.clone]:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+            return MaskedTensor(func(data), mask)
+        if func is torch.ops.aten._softmax:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=3, len_kwargs=0)
+            input_data = data.masked_fill(~mask, float("-inf"))
+            result_data = func(input_data, args[1], args[2])
+            return MaskedTensor(result_data, mask)
+        if func in [torch.ops.aten.ones_like]:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1)
+            res_data = func(data, **kwargs)
+            return MaskedTensor(res_data, mask)
+        if func is torch.ops.aten._softmax_backward_data:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=4)
+            grad = args[0]
+            output = args[1]
+            dim = args[2]
+            input_dtype = args[3]
+            if is_masked_tensor(grad) and is_masked_tensor(output):
+                if not _masks_match(grad, output):
+                    raise ValueError("__torch_dispatch__, {func}: expected the masks of grad and output to match")
+                grad_data = _get_data(grad).masked_fill(~_maybe_get_mask(grad), 1)
+                output_data = _get_data(output).masked_fill(~_maybe_get_mask(output), 0)
+                new_grad_data = torch.ops.aten._softmax_backward_data(
+                    grad_data, output_data, dim, input_dtype
+                )
+                res = MaskedTensor(new_grad_data, _maybe_get_mask(grad))
+                return res
+        if func is torch.ops.aten.copy_:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=2)
+            if not _masks_match(mask, _maybe_get_mask(args[1])):
+                raise ValueError("args[0] mask and args[1] mask must match but do not")
+            func(data, _get_data(args[1]))
+            return args[0]
+        if func in [torch.ops.aten.where]:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=3, len_kwargs=0)
+            if not torch.is_tensor(args[0]):
+                raise ValueError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+            mx = args[1]
+            my = args[2]
+            if not is_masked_tensor(mx):
+                mx = MaskedTensor(mx, torch.ones_like(mx, dtype=torch.bool))
+            if not is_masked_tensor(my):
+                my = MaskedTensor(my, torch.ones_like(my, dtype=torch.bool))
+            new_data = func(args[0], mx.get_data(), my.get_data())
+            new_mask = func(args[0], mx.get_mask(), my.get_mask())
+            return MaskedTensor(new_data, new_mask)
+        if func is torch.ops.aten.to_sparse:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+            if not torch.is_tensor(args[0]):
+                raise TypeError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+            mt = args[0]
+            if not is_masked_tensor(mt):
+                mt = MaskedTensor(mt, torch.ones_like(mt, dtype=torch.bool))
+            if mt.is_sparse_coo():
+                return mt
+            new_mask = func(mask).coalesce()
+            new_data = data.sparse_mask(new_mask)
+            return MaskedTensor(new_data, new_mask)
+        if func is torch.ops.aten.to_sparse_csr:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+            if not torch.is_tensor(args[0]):
+                raise ValueError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+            mt = args[0]
+            if not is_masked_tensor(mt):
+                mt = MaskedTensor(mt, torch.ones_like(mt).bool())
+            if mt.is_sparse_csr():
+                return mt
+            new_mask = func(mask)
+            new_data = data.sparse_mask(new_mask)
+            return MaskedTensor(new_data, new_mask)
+        if func in [torch.ops.aten._to_dense]:
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+            if not torch.is_tensor(args[0]):
+                raise ValueError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+            mt = args[0]
+            if not is_masked_tensor(mt):
+                mt = MaskedTensor(mt, torch.ones_like(mt).bool())
+            new_data = func(data)
+            new_mask = func(mask)
+            return MaskedTensor(new_data, new_mask)
+        if func is torch.ops.aten._indices:
+            # Assumes data is sparse
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+            return MaskedTensor(data.indices(), torch.ones_like(data.indices()).bool())
+        if func is torch.ops.aten._values:
+            # Assumes data is sparse
+            _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+            data = data.values()
+            return MaskedTensor(data, torch.ones_like(data).bool())
+        if func is torch.ops.aten._sparse_coo_tensor_with_dims_and_tensors:
+            new_args = list(args)
+            if is_masked_tensor(args[-1]):
+                new_args[-1] = args[-1]._masked_data
+            if is_masked_tensor(args[-2]):
+                new_args[-2] = args[-2]._masked_data
+
+            new_data = func(*new_args, **kwargs)
+            new_args[-1] = torch.ones_like(new_args[-1])
+            new_mask = func(*new_args, **kwargs).bool()
+
+            return MaskedTensor(new_data, new_mask)
+        msg = (
+            f"{func.__name__} is not implemented in __torch_dispatch__ for MaskedTensor.\n"
+            "If you would like this operator to be supported, please file an issue for a feature request at "
+            "https://github.com/pytorch/maskedtensor/issues with a minimal reproducible code snippet.\n"
+            "In the case that the semantics for the operator are not trivial, it would be appreciated "
+            "to also include a proposal for the semantics."
+        )
+        warnings.warn(msg)
+        return NotImplemented
+
+    def __lt__(self, other):
+        if is_masked_tensor(other):
+            return MaskedTensor(self.get_data() < _get_data(other), self.get_mask())
+        return MaskedTensor(self.get_data() < other, self.get_mask())
+
+    def to_tensor(self, value):
+        return self.get_data().masked_fill(~self.get_mask(), value)
+
+    def get_data(self):
+        return self._masked_data
+
+    def get_mask(self):
+        return self._masked_mask
+
+    def layout(self):
+        return self.get_data().layout
+
+    def is_sparse_coo(self):
+        return self.layout() == torch.sparse_coo
+
+    def is_sparse_csr(self):
+        return self.layout() == torch.sparse_csr
+
+    # Update later to support more sparse layouts
+    def is_sparse(self):
+        return self.is_sparse_coo() or self.is_sparse_csr()
diff --git a/torch/masked/maskedtensor/creation.py b/torch/masked/maskedtensor/creation.py
new file mode 100644
index 0000000000000..15acc20efcd96
--- /dev/null
+++ b/torch/masked/maskedtensor/creation.py
@@ -0,0 +1,58 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import torch
+
+from .core import MaskedTensor, is_masked_tensor
+
+__all__ = [
+    "masked_tensor",
+    "AsMaskedTensor",
+    "as_masked_tensor"
+]
+
+def masked_tensor(data, mask, requires_grad=False):
+    r""" A basic factory function to create a MaskedTensor
+
+    Args:
+        data: input data tensor
+        mask: input mask tensor with dtype bool where True indicates "specified" and False indicates "unspecified"
+
+    Examples::
+
+        >>> # xdoctest: +SKIP
+        >>> data = torch.arange(6).reshape(2,3)
+        >>> mask = torch.tensor([[True, False, False], [True, True, False]])
+        >>> mt = masked_tensor(data, mask)
+        >>> mt
+        masked_tensor(
+        [
+            [0,       --,       --],
+            [3, 4,       --]
+        ]
+        )
+    """
+    if is_masked_tensor(data):
+        raise TypeError("data is already a MaskedTensor but must be a regular Tensor")
+    if is_masked_tensor(mask):
+        raise TypeError("mask is already a MaskedTensor but must be a regular Tensor")
+
+    data = data.clone().detach()
+    mask = mask.clone().detach()
+    return MaskedTensor(data, mask, requires_grad)
+
+
+# New function as_masked_tensor with autograd support to
+# convert torch.Tensor into a MaskedTensor with some user-defined
+# mask.
+class AsMaskedTensor(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, data, mask):
+        return MaskedTensor(data, mask)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output, None
+
+
+def as_masked_tensor(data, mask):
+    return AsMaskedTensor.apply(data, mask)
diff --git a/torch/masked/maskedtensor/passthrough.py b/torch/masked/maskedtensor/passthrough.py
new file mode 100644
index 0000000000000..b019a968b6580
--- /dev/null
+++ b/torch/masked/maskedtensor/passthrough.py
@@ -0,0 +1,42 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+"""
+These are functions that should simply be applied to both mask and data.
+Take select or stack as an example. This operation can be applied to
+both the mask and data of a MaskedTensor and the result wrapped into
+a new MaskedTensor as a result.
+"""
+
+import torch
+
+from .core import _map_mt_args_kwargs, _wrap_result
+
+__all__ = []  # type: ignore[var-annotated]
+
+
+PASSTHROUGH_FNS = [
+    torch.ops.aten.select,
+    torch.ops.aten.transpose,
+    torch.ops.aten.split,
+    torch.ops.aten.t,
+    torch.ops.aten.slice,
+    torch.ops.aten.slice_backward,
+    torch.ops.aten.select_backward,
+    torch.ops.aten.index,
+    torch.ops.aten.expand,
+    torch.ops.aten.view,
+    torch.ops.aten._unsafe_view,
+    torch.ops.aten._reshape_alias,
+    torch.ops.aten.cat,
+]
+
+
+def _is_pass_through_fn(fn):
+    return fn in PASSTHROUGH_FNS
+
+
+def _apply_pass_through_fn(fn, *args, **kwargs):
+    data_args, data_kwargs = _map_mt_args_kwargs(args, kwargs, lambda x: x.get_data())
+    result_data = fn(*data_args, **data_kwargs)
+    mask_args, mask_kwargs = _map_mt_args_kwargs(args, kwargs, lambda x: x.get_mask())
+    result_mask = fn(*mask_args, **mask_kwargs)
+    return _wrap_result(result_data, result_mask)
diff --git a/torch/masked/maskedtensor/unary.py b/torch/masked/maskedtensor/unary.py
new file mode 100644
index 0000000000000..98f0fe378b739
--- /dev/null
+++ b/torch/masked/maskedtensor/unary.py
@@ -0,0 +1,188 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import torch
+
+from .core import _map_mt_args_kwargs, _wrap_result
+
+__all__ = []  # type: ignore[var-annotated]
+
+
+UNARY_NAMES = [
+    "abs",
+    "absolute",
+    "acos",
+    "arccos",
+    "acosh",
+    "arccosh",
+    "angle",
+    "asin",
+    "arcsin",
+    "asinh",
+    "arcsinh",
+    "atan",
+    "arctan",
+    "atanh",
+    "arctanh",
+    "bitwise_not",
+    "ceil",
+    "clamp",
+    "clip",
+    "conj_physical",
+    "cos",
+    "cosh",
+    "deg2rad",
+    "digamma",
+    "erf",
+    "erfc",
+    "erfinv",
+    "exp",
+    "exp2",
+    "expm1",
+    "fix",
+    "floor",
+    "frac",
+    "lgamma",
+    "log",
+    "log10",
+    "log1p",
+    "log2",
+    "logit",
+    "i0",
+    "isnan",
+    "nan_to_num",
+    "neg",
+    "negative",
+    "positive",
+    "pow",
+    "rad2deg",
+    "reciprocal",
+    "round",
+    "rsqrt",
+    "sigmoid",
+    "sign",
+    "sgn",
+    "signbit",
+    "sin",
+    "sinc",
+    "sinh",
+    "sqrt",
+    "square",
+    "tan",
+    "tanh",
+    "trunc",
+]
+
+INPLACE_UNARY_NAMES = [
+    n + "_"
+    for n in (list(set(UNARY_NAMES) - {"angle", "positive", "signbit", "isnan"}))
+]
+
+# Explicitly tracking functions we know are currently not supported
+# This might be due to missing code gen or because of complex semantics
+UNARY_NAMES_UNSUPPORTED = [
+    "atan2",
+    "arctan2",
+    "bitwise_left_shift",
+    "bitwise_right_shift",
+    "copysign",
+    "float_power",
+    "fmod",
+    "frexp",
+    "gradient",
+    "imag",
+    "ldexp",
+    "lerp",
+    "logical_not",
+    "hypot",
+    "igamma",
+    "igammac",
+    "mvlgamma",
+    "nextafter",
+    "polygamma",
+    "real",
+    "remainder",
+    "true_divide",
+    "xlogy",
+]
+
+
+def _unary_helper(fn, args, kwargs, inplace):
+    if len(kwargs) != 0:
+        raise ValueError("MaskedTensor unary ops require that len(kwargs) == 0. "
+                         "If you need support for this, please open an issue on Github.")
+    for a in args[1:]:
+        if torch.is_tensor(a):
+            raise TypeError("MaskedTensor unary ops do not support additional Tensor arguments")
+
+    mask_args, mask_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x._masked_mask
+    )
+    data_args, data_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x._masked_data
+    )
+
+    if args[0].layout() == torch.sparse_coo:
+        data_args[0] = data_args[0].coalesce()
+        s = data_args[0].size()
+        i = data_args[0].indices()
+        data_args[0] = data_args[0].coalesce().values()
+        v = fn(*data_args)
+        result_data = torch.sparse_coo_tensor(i, v, size=s)
+
+    elif args[0].layout() == torch.sparse_csr:
+        crow = data_args[0].crow_indices()
+        col = data_args[0].col_indices()
+        data_args[0] = data_args[0].values()
+        v = fn(*data_args)
+        result_data = torch.sparse_csr_tensor(crow, col, v)
+
+    else:
+        result_data = fn(*data_args)
+
+    if inplace:
+        args[0]._set_data_mask(result_data, mask_args[0])
+        return args[0]
+    else:
+        return _wrap_result(result_data, mask_args[0])
+
+
+def _torch_unary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def unary_fn(*args, **kwargs):
+        return _unary_helper(fn, args, kwargs, inplace=False)
+
+    return unary_fn
+
+
+def _torch_inplace_unary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def unary_fn(*args, **kwargs):
+        return _unary_helper(fn, args, kwargs, inplace=True)
+
+    return unary_fn
+
+
+NATIVE_UNARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_unary(name) for name in UNARY_NAMES
+}
+NATIVE_INPLACE_UNARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_inplace_unary(name)
+    for name in INPLACE_UNARY_NAMES
+}
+
+NATIVE_UNARY_FNS = list(NATIVE_UNARY_MAP.keys())
+NATIVE_INPLACE_UNARY_FNS = list(NATIVE_INPLACE_UNARY_MAP.keys())
+
+
+def _is_native_unary(fn):
+    return fn in NATIVE_UNARY_FNS or fn in NATIVE_INPLACE_UNARY_FNS
+
+
+def _apply_native_unary(fn, *args, **kwargs):
+    if fn in NATIVE_UNARY_FNS:
+        return NATIVE_UNARY_MAP[fn](*args, **kwargs)
+    if fn in NATIVE_INPLACE_UNARY_FNS:
+        return NATIVE_INPLACE_UNARY_MAP[fn](*args, **kwargs)
+    return NotImplemented
diff --git a/torch/monitor/__init__.py b/torch/monitor/__init__.py
index 723936c8382a8..b8589bb000870 100644
--- a/torch/monitor/__init__.py
+++ b/torch/monitor/__init__.py
@@ -16,6 +16,7 @@ class TensorboardEventHandler:
     This currently only supports ``torch.monitor.Stat`` events which are logged
     as scalars.
 
+    >>> # xdoctest: +REQUIRES(module:tensorboard)
     >>> from torch.utils.tensorboard import SummaryWriter
     >>> from torch.monitor import TensorboardEventHandler, register_event_handler
     >>> writer = SummaryWriter("log_dir")
diff --git a/torch/nn/functional.py b/torch/nn/functional.py
index ac81cfbce7125..496d2b1b5ee61 100644
--- a/torch/nn/functional.py
+++ b/torch/nn/functional.py
@@ -2144,9 +2144,10 @@ def embedding(
     Examples::
 
         >>> # a batch of 2 samples of 4 indices each
-        >>> input = torch.tensor([[1,2,4,5],[4,3,2,9]])
+        >>> input = torch.tensor([[1, 2, 4, 5], [4, 3, 2, 9]])
         >>> # an embedding matrix containing 10 tensors of size 3
         >>> embedding_matrix = torch.rand(10, 3)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> F.embedding(input, embedding_matrix)
         tensor([[[ 0.8490,  0.9625,  0.6753],
                  [ 0.9666,  0.7761,  0.6108],
@@ -2284,6 +2285,7 @@ def embedding_bag(
         >>> # a batch of 2 samples of 4 indices each
         >>> input = torch.tensor([1,2,4,5,4,3,2,9])
         >>> offsets = torch.tensor([0,4])
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> F.embedding_bag(input, embedding_matrix, offsets)
         tensor([[ 0.3397,  0.3552,  0.5545],
                 [ 0.5893,  0.4386,  0.5882]])
@@ -2675,7 +2677,7 @@ def nll_loss(
         >>> input = torch.randn(3, 5, requires_grad=True)
         >>> # each element in target has to have 0 <= value < C
         >>> target = torch.tensor([1, 0, 4])
-        >>> output = F.nll_loss(F.log_softmax(input), target)
+        >>> output = F.nll_loss(F.log_softmax(input, dim=1), target)
         >>> output.backward()
     """
     if has_torch_function_variadic(input, target, weight):
@@ -2937,12 +2939,12 @@ def cross_entropy(
     reduction: str = "mean",
     label_smoothing: float = 0.0,
 ) -> Tensor:
-    r"""This criterion computes the cross entropy loss between input and target.
+    r"""This criterion computes the cross entropy loss between input logits and target.
 
     See :class:`~torch.nn.CrossEntropyLoss` for details.
 
     Args:
-        input (Tensor) : Predicted unnormalized scores (often referred to as logits);
+        input (Tensor) : Predicted unnormalized logits;
             see Shape section below for supported shapes.
         target (Tensor) : Ground truth class indices or class probabilities;
             see Shape section below for supported shapes.
diff --git a/torch/nn/grad.py b/torch/nn/grad.py
index 0cf5fe23f5b54..94d50ad06c71c 100644
--- a/torch/nn/grad.py
+++ b/torch/nn/grad.py
@@ -55,6 +55,7 @@ def conv1d_weight(input, weight_size, grad_output, stride=1, padding=0, dilation
         >>> weight = torch.randn(1,1,1, requires_grad=True)
         >>> output = F.conv1d(input, weight)
         >>> grad_output = torch.randn(output.shape)
+        >>> # xdoctest: +SKIP
         >>> grad_weight = torch.autograd.grad(output, filter, grad_output)
         >>> F.grad.conv1d_weight(input, weight.shape, grad_output)
 
@@ -117,6 +118,7 @@ def conv2d_weight(input, weight_size, grad_output, stride=1, padding=0, dilation
         >>> weight = torch.randn(1,1,1,2, requires_grad=True)
         >>> output = F.conv2d(input, weight)
         >>> grad_output = torch.randn(output.shape)
+        >>> # xdoctest: +SKIP
         >>> grad_weight = torch.autograd.grad(output, filter, grad_output)
         >>> F.grad.conv2d_weight(input, weight.shape, grad_output)
 
diff --git a/torch/nn/init.py b/torch/nn/init.py
index e9d2354497e78..b70e7f5e390c1 100644
--- a/torch/nn/init.py
+++ b/torch/nn/init.py
@@ -463,6 +463,7 @@ def orthogonal_(tensor, gain=1):
         gain: optional scaling factor
 
     Examples:
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
         >>> w = torch.empty(3, 5)
         >>> nn.init.orthogonal_(w)
     """
diff --git a/torch/nn/intrinsic/qat/modules/conv_fused.py b/torch/nn/intrinsic/qat/modules/conv_fused.py
index dc8ef829c68b7..ef8c9c4c4c70c 100644
--- a/torch/nn/intrinsic/qat/modules/conv_fused.py
+++ b/torch/nn/intrinsic/qat/modules/conv_fused.py
@@ -2,7 +2,7 @@
 import torch
 import torch.nn as nn
 import torch.nn.intrinsic as nni
-import torch.nn.qat as nnqat
+import torch.ao.nn.qat as nnqat
 import torch.nn.functional as F
 from torch.nn import init
 from torch.nn.utils import fuse_conv_bn_weights
diff --git a/torch/nn/intrinsic/qat/modules/linear_relu.py b/torch/nn/intrinsic/qat/modules/linear_relu.py
index 1b76c5dd54646..6cb12d380486c 100644
--- a/torch/nn/intrinsic/qat/modules/linear_relu.py
+++ b/torch/nn/intrinsic/qat/modules/linear_relu.py
@@ -1,5 +1,5 @@
 import torch
-import torch.nn.qat as nnqat
+import torch.ao.nn.qat as nnqat
 import torch.nn.intrinsic as nni
 import torch.nn.functional as F
 
@@ -19,6 +19,7 @@ class LinearReLU(nnqat.Linear, nni._FusedModule):
 
     Examples::
 
+        >>> # xdoctest: +SKIP
         >>> m = nn.qat.LinearReLU(20, 30)
         >>> input = torch.randn(128, 20)
         >>> output = m(input)
diff --git a/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py b/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
index 2b0b9b37ef407..de57d762781c3 100644
--- a/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
+++ b/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
@@ -1,5 +1,5 @@
 import torch
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized.dynamic as nnqd
 import torch.nn.intrinsic as nni
 
 class LinearReLU(nnqd.Linear):
@@ -8,15 +8,16 @@ class LinearReLU(nnqd.Linear):
     for dynamic quantization.
     Supports both, FP16 and INT8 quantization.
 
-    We adopt the same interface as :class:`torch.nn.quantized.dynamic.Linear`.
+    We adopt the same interface as :class:`torch.ao.nn.quantized.dynamic.Linear`.
 
     Attributes:
-        Same as torch.nn.quantized.dynamic.Linear
+        Same as torch.ao.nn.quantized.dynamic.Linear
 
     Examples::
 
         >>> m = nn.intrinsic.quantized.dynamic.LinearReLU(20, 30)
         >>> input = torch.randn(128, 20)
+        >>> # xdoctest: +SKIP
         >>> output = m(input)
         >>> print(output.size())
         torch.Size([128, 30])
diff --git a/torch/nn/intrinsic/quantized/modules/bn_relu.py b/torch/nn/intrinsic/quantized/modules/bn_relu.py
index 0727e57553d0b..cd6086c976874 100644
--- a/torch/nn/intrinsic/quantized/modules/bn_relu.py
+++ b/torch/nn/intrinsic/quantized/modules/bn_relu.py
@@ -2,17 +2,17 @@
 import torch
 import torch.nn.intrinsic
 import torch.nn.intrinsic.qat
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 
 class BNReLU2d(nnq.BatchNorm2d):
     r"""
     A BNReLU2d module is a fused module of BatchNorm2d and ReLU
 
-    We adopt the same interface as :class:`torch.nn.quantized.BatchNorm2d`.
+    We adopt the same interface as :class:`torch.ao.nn.quantized.BatchNorm2d`.
 
     Attributes:
-        Same as torch.nn.quantized.BatchNorm2d
+        Same as torch.ao.nn.quantized.BatchNorm2d
 
     """
     _FLOAT_MODULE = torch.nn.intrinsic.BNReLU2d
@@ -45,10 +45,10 @@ class BNReLU3d(nnq.BatchNorm3d):
     r"""
     A BNReLU3d module is a fused module of BatchNorm3d and ReLU
 
-    We adopt the same interface as :class:`torch.nn.quantized.BatchNorm3d`.
+    We adopt the same interface as :class:`torch.ao.nn.quantized.BatchNorm3d`.
 
     Attributes:
-        Same as torch.nn.quantized.BatchNorm3d
+        Same as torch.ao.nn.quantized.BatchNorm3d
 
     """
     _FLOAT_MODULE = torch.nn.intrinsic.BNReLU3d
diff --git a/torch/nn/intrinsic/quantized/modules/conv_relu.py b/torch/nn/intrinsic/quantized/modules/conv_relu.py
index d4c24ee241f6d..e23409c27c6ce 100644
--- a/torch/nn/intrinsic/quantized/modules/conv_relu.py
+++ b/torch/nn/intrinsic/quantized/modules/conv_relu.py
@@ -3,7 +3,7 @@
 import torch.nn.intrinsic
 import torch.nn.intrinsic.qat
 import torch.nn.functional as F
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 from torch.nn.utils import fuse_conv_bn_weights
 
@@ -14,10 +14,10 @@ class ConvReLU1d(nnq.Conv1d):
     r"""
     A ConvReLU1d module is a fused module of Conv1d and ReLU
 
-    We adopt the same interface as :class:`torch.nn.quantized.Conv1d`.
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Conv1d`.
 
     Attributes:
-        Same as torch.nn.quantized.Conv1d
+        Same as torch.ao.nn.quantized.Conv1d
 
     """
     _FLOAT_MODULE = torch.nn.intrinsic.ConvReLU1d  # type: ignore[assignment]
@@ -64,10 +64,10 @@ class ConvReLU2d(nnq.Conv2d):
     r"""
     A ConvReLU2d module is a fused module of Conv2d and ReLU
 
-    We adopt the same interface as :class:`torch.nn.quantized.Conv2d`.
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Conv2d`.
 
     Attributes:
-        Same as torch.nn.quantized.Conv2d
+        Same as torch.ao.nn.quantized.Conv2d
 
     """
     _FLOAT_MODULE = torch.nn.intrinsic.ConvReLU2d  # type: ignore[assignment]
@@ -114,9 +114,9 @@ class ConvReLU3d(nnq.Conv3d):
     r"""
     A ConvReLU3d module is a fused module of Conv3d and ReLU
 
-    We adopt the same interface as :class:`torch.nn.quantized.Conv3d`.
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Conv3d`.
 
-    Attributes: Same as torch.nn.quantized.Conv3d
+    Attributes: Same as torch.ao.nn.quantized.Conv3d
 
     """
     _FLOAT_MODULE = torch.nn.intrinsic.ConvReLU3d  # type: ignore[assignment]
diff --git a/torch/nn/intrinsic/quantized/modules/linear_relu.py b/torch/nn/intrinsic/quantized/modules/linear_relu.py
index 558ee1201f3dd..39f54b6a3a016 100644
--- a/torch/nn/intrinsic/quantized/modules/linear_relu.py
+++ b/torch/nn/intrinsic/quantized/modules/linear_relu.py
@@ -1,18 +1,19 @@
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import torch.nn.intrinsic as nni
 
 class LinearReLU(nnq.Linear):
     r"""
     A LinearReLU module fused from Linear and ReLU modules
 
-    We adopt the same interface as :class:`torch.nn.quantized.Linear`.
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Linear`.
 
     Attributes:
-        Same as torch.nn.quantized.Linear
+        Same as torch.ao.nn.quantized.Linear
 
     Examples::
 
+        >>> # xdoctest: +SKIP
         >>> m = nn.intrinsic.LinearReLU(20, 30)
         >>> input = torch.randn(128, 20)
         >>> output = m(input)
diff --git a/torch/nn/modules/activation.py b/torch/nn/modules/activation.py
index 980036e5bd183..c816437abdbfd 100644
--- a/torch/nn/modules/activation.py
+++ b/torch/nn/modules/activation.py
@@ -934,6 +934,7 @@ class MultiheadAttention(Module):
 
     Examples::
 
+        >>> # xdoctest: +SKIP
         >>> multihead_attn = nn.MultiheadAttention(embed_dim, num_heads)
         >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
 
@@ -1091,6 +1092,8 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
             why_not_fast_path = "attn_mask was not None"
         elif query.is_nested and key_padding_mask is not None:
             why_not_fast_path = "key_padding_mask is not supported with NestedTensor input"
+        elif self.num_heads % 2 == 1:
+            why_not_fast_path = "num_heads is odd"
 
         if not why_not_fast_path:
             tensor_args = (
@@ -1112,19 +1115,37 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
                 why_not_fast_path = ("grad is enabled and at least one of query or the "
                                      "input/output projection weights or biases requires_grad")
             if not why_not_fast_path:
-                return torch._native_multi_head_attention(
-                    query,
-                    key,
-                    value,
-                    self.embed_dim,
-                    self.num_heads,
-                    self.in_proj_weight,
-                    self.in_proj_bias,
-                    self.out_proj.weight,
-                    self.out_proj.bias,
-                    key_padding_mask if key_padding_mask is not None else attn_mask,
-                    need_weights,
-                    average_attn_weights)
+                if not torch.jit.is_scripting():
+                    return torch._native_multi_head_attention(
+                        query,
+                        key,
+                        value,
+                        self.embed_dim,
+                        self.num_heads,
+                        self.in_proj_weight,
+                        self.in_proj_bias,
+                        self.out_proj.weight,
+                        self.out_proj.bias,
+                        key_padding_mask if key_padding_mask is not None else attn_mask,
+                        need_weights,
+                        average_attn_weights,
+                        1 if key_padding_mask is not None else 0 if attn_mask is not None else None)
+                elif attn_mask is None:
+                    # hack until 9/26/2022 for TS jit compatibility window
+                    return torch._native_multi_head_attention(
+                        query,
+                        key,
+                        value,
+                        self.embed_dim,
+                        self.num_heads,
+                        self.in_proj_weight,
+                        self.in_proj_bias,
+                        self.out_proj.weight,
+                        self.out_proj.bias,
+                        key_padding_mask if key_padding_mask is not None else attn_mask,
+                        need_weights,
+                        average_attn_weights)
+
         any_nested = query.is_nested or key.is_nested or value.is_nested
         assert not any_nested, ("MultiheadAttention does not support NestedTensor outside of its fast path. " +
                                 f"The fast path was not hit because {why_not_fast_path}")
@@ -1302,7 +1323,7 @@ class Softmin(Module):
 
     Examples::
 
-        >>> m = nn.Softmin()
+        >>> m = nn.Softmin(dim=1)
         >>> input = torch.randn(2, 3)
         >>> output = m(input)
     """
@@ -1429,7 +1450,7 @@ class LogSoftmax(Module):
 
     Examples::
 
-        >>> m = nn.LogSoftmax()
+        >>> m = nn.LogSoftmax(dim=1)
         >>> input = torch.randn(2, 3)
         >>> output = m(input)
     """
diff --git a/torch/nn/modules/batchnorm.py b/torch/nn/modules/batchnorm.py
index 6d6d1d2ec3d16..382accfef560b 100644
--- a/torch/nn/modules/batchnorm.py
+++ b/torch/nn/modules/batchnorm.py
@@ -13,6 +13,7 @@
 __all__ = ['BatchNorm1d', 'LazyBatchNorm1d', 'BatchNorm2d', 'LazyBatchNorm2d', 'BatchNorm3d',
            'LazyBatchNorm3d', 'SyncBatchNorm']
 
+
 class _NormBase(Module):
     """Common base of _InstanceNorm and _BatchNorm"""
 
@@ -633,6 +634,7 @@ class SyncBatchNorm(_BatchNorm):
         >>> # Note: every rank calls into new_group for every
         >>> # process group created, even if that rank is not
         >>> # part of the group.
+        >>> # xdoctest: +SKIP
         >>> process_groups = [torch.distributed.new_group(pids) for pids in [r1, r2]]
         >>> process_group = process_groups[0 if dist.get_rank() <= 3 else 1]
         >>> # Without Learnable Parameters
@@ -778,6 +780,7 @@ def convert_sync_batchnorm(cls, module, process_group=None):
         Example::
 
             >>> # Network with nn.BatchNorm layer
+            >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
             >>> module = torch.nn.Sequential(
             >>>            torch.nn.Linear(20, 100),
             >>>            torch.nn.BatchNorm1d(100),
@@ -789,6 +792,7 @@ def convert_sync_batchnorm(cls, module, process_group=None):
             >>> # Note: every rank calls into new_group for every
             >>> # process group created, even if that rank is not
             >>> # part of the group.
+            >>> # xdoctest: +SKIP("distributed")
             >>> process_groups = [torch.distributed.new_group(pids) for pids in [r1, r2]]
             >>> process_group = process_groups[0 if dist.get_rank() <= 3 else 1]
             >>> sync_bn_module = torch.nn.SyncBatchNorm.convert_sync_batchnorm(module, process_group)
diff --git a/torch/nn/modules/channelshuffle.py b/torch/nn/modules/channelshuffle.py
index efaa76d2c8ab7..3faee2c75fc29 100644
--- a/torch/nn/modules/channelshuffle.py
+++ b/torch/nn/modules/channelshuffle.py
@@ -15,6 +15,7 @@ class ChannelShuffle(Module):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("FIXME: incorrect want")
         >>> channel_shuffle = nn.ChannelShuffle(2)
         >>> input = torch.randn(1, 4, 2, 2)
         >>> print(input)
diff --git a/torch/nn/modules/container.py b/torch/nn/modules/container.py
index 9ede31035728f..2d4357c03a680 100644
--- a/torch/nn/modules/container.py
+++ b/torch/nn/modules/container.py
@@ -100,7 +100,7 @@ def _get_item_by_idx(self, iterator, idx) -> T:
         return next(islice(iterator, idx, None))
 
     @_copy_to_script_wrapper
-    def __getitem__(self, idx) -> Union['Sequential', T]:
+    def __getitem__(self, idx: Union[slice, int]) -> Union['Sequential', T]:
         if isinstance(idx, slice):
             return self.__class__(OrderedDict(list(self._modules.items())[idx]))
         else:
@@ -275,7 +275,7 @@ def _get_abs_string_index(self, idx):
         return str(idx)
 
     @_copy_to_script_wrapper
-    def __getitem__(self, idx: int) -> Union[Module, 'ModuleList']:
+    def __getitem__(self, idx: Union[int, slice]) -> Union[Module, 'ModuleList']:
         if isinstance(idx, slice):
             return self.__class__(list(self._modules.values())[idx])
         else:
diff --git a/torch/nn/modules/distance.py b/torch/nn/modules/distance.py
index 065ad6b243972..73ba31b8868ad 100644
--- a/torch/nn/modules/distance.py
+++ b/torch/nn/modules/distance.py
@@ -7,13 +7,21 @@
 
 class PairwiseDistance(Module):
     r"""
-    Computes the pairwise distance between vectors :math:`v_1`, :math:`v_2` using the p-norm:
+    Computes the pairwise distance between input vectors, or between columns of input matrices.
+
+    Distances are computed using ``p``-norm, with constant ``eps`` added to avoid division by zero
+    if ``p`` is negative, i.e.:
+
+    .. math ::
+        \mathrm{dist}\left(x, y\right) = \left\Vert x-y + \epsilon e \right\Vert_p,
+
+    where :math:`e` is the vector of ones and the ``p``-norm is given by.
 
     .. math ::
         \Vert x \Vert _p = \left( \sum_{i=1}^n  \vert x_i \vert ^ p \right) ^ {1/p}.
 
     Args:
-        p (real): the norm degree. Default: 2
+        p (real, optional): the norm degree. Can be negative. Default: 2
         eps (float, optional): Small value to avoid division by zero.
             Default: 1e-6
         keepdim (bool, optional): Determines whether or not to keep the vector dimension.
@@ -23,6 +31,7 @@ class PairwiseDistance(Module):
         - Input2: :math:`(N, D)` or :math:`(D)`, same shape as the Input1
         - Output: :math:`(N)` or :math:`()` based on input dimension.
           If :attr:`keepdim` is ``True``, then :math:`(N, 1)` or :math:`(1)` based on input dimension.
+
     Examples::
         >>> pdist = nn.PairwiseDistance(p=2)
         >>> input1 = torch.randn(100, 128)
diff --git a/torch/nn/modules/fold.py b/torch/nn/modules/fold.py
index 1819469e0ad37..5380cf155c907 100644
--- a/torch/nn/modules/fold.py
+++ b/torch/nn/modules/fold.py
@@ -91,6 +91,7 @@ class Fold(Module):
         where ``divisor`` is a tensor that depends only on the shape
         and dtype of the ``input``:
 
+        >>> # xdoctest: +SKIP
         >>> input_ones = torch.ones(input.shape, dtype=input.dtype)
         >>> divisor = fold(unfold(input_ones))
 
@@ -232,6 +233,7 @@ class Unfold(Module):
         where ``divisor`` is a tensor that depends only on the shape
         and dtype of the ``input``:
 
+        >>> # xdoctest: +SKIP
         >>> input_ones = torch.ones(input.shape, dtype=input.dtype)
         >>> divisor = fold(unfold(input_ones))
 
@@ -257,6 +259,7 @@ class Unfold(Module):
         >>> output.size()
         torch.Size([2, 30, 4])
 
+        >>> # xdoctest: +IGNORE_WANT
         >>> # Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape)
         >>> inp = torch.randn(1, 3, 10, 12)
         >>> w = torch.randn(2, 3, 4, 5)
diff --git a/torch/nn/modules/lazy.py b/torch/nn/modules/lazy.py
index 7899850ce500e..d214f6e5eb5d7 100644
--- a/torch/nn/modules/lazy.py
+++ b/torch/nn/modules/lazy.py
@@ -74,6 +74,7 @@ class LazyModuleMixin:
     These "dry runs" send inputs of the correct size, dtype, and device through
     the network and to each one of its lazy modules. After this the network can be used as usual.
 
+    >>> # xdoctest: +SKIP
     >>> class LazyMLP(torch.nn.Module):
     ...    def __init__(self):
     ...        super().__init__()
diff --git a/torch/nn/modules/loss.py b/torch/nn/modules/loss.py
index b5d07843d7649..85de8c549edba 100644
--- a/torch/nn/modules/loss.py
+++ b/torch/nn/modules/loss.py
@@ -449,15 +449,16 @@ class KLDivLoss(_Loss):
 
     Examples::
 
+        >>> import torch.nn.functional as F
         >>> kl_loss = nn.KLDivLoss(reduction="batchmean")
         >>> # input should be a distribution in the log space
-        >>> input = F.log_softmax(torch.randn(3, 5, requires_grad=True))
+        >>> input = F.log_softmax(torch.randn(3, 5, requires_grad=True), dim=1)
         >>> # Sample a batch of distributions. Usually this would come from the dataset
-        >>> target = F.softmax(torch.rand(3, 5))
+        >>> target = F.softmax(torch.rand(3, 5), dim=1)
         >>> output = kl_loss(input, target)
 
         >>> kl_loss = nn.KLDivLoss(reduction="batchmean", log_target=True)
-        >>> log_target = F.log_softmax(torch.rand(3, 5))
+        >>> log_target = F.log_softmax(torch.rand(3, 5), dim=1)
         >>> output = kl_loss(input, log_target)
     """
     __constants__ = ['reduction']
@@ -670,7 +671,7 @@ class BCEWithLogitsLoss(_Loss):
         >>> pos_weight = torch.ones([64])  # All weights are equal to 1
         >>> criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
         >>> criterion(output, target)  # -log(sigmoid(1.5))
-        tensor(0.2014)
+        tensor(0.20...)
 
     Args:
         weight (Tensor, optional): a manual rescaling weight given to the loss
@@ -832,9 +833,9 @@ class MultiLabelMarginLoss(_Loss):
         >>> x = torch.FloatTensor([[0.1, 0.2, 0.4, 0.8]])
         >>> # for target y, only consider labels 3 and 0, not after label -1
         >>> y = torch.LongTensor([[3, 0, -1, 1]])
-        >>> loss(x, y)
         >>> # 0.25 * ((1-(0.1-0.2)) + (1-(0.1-0.4)) + (1-(0.8-0.2)) + (1-(0.8-0.4)))
-        tensor(0.8500)
+        >>> loss(x, y)
+        tensor(0.85...)
 
     """
     __constants__ = ['reduction']
@@ -1029,14 +1030,16 @@ def forward(self, input: Tensor, target: Tensor) -> Tensor:
 
 
 class CrossEntropyLoss(_WeightedLoss):
-    r"""This criterion computes the cross entropy loss between input and target.
+    r"""This criterion computes the cross entropy loss between input logits
+    and target.
 
     It is useful when training a classification problem with `C` classes.
     If provided, the optional argument :attr:`weight` should be a 1D `Tensor`
     assigning weight to each of the classes.
     This is particularly useful when you have an unbalanced training set.
 
-    The `input` is expected to contain raw, unnormalized scores for each class.
+    The `input` is expected to contain the unnormalized logits for each class (which do `not` need
+    to be positive or sum to 1, in general).
     `input` has to be a Tensor of size :math:`(C)` for unbatched input,
     :math:`(minibatch, C)` or :math:`(minibatch, C, d_1, d_2, ..., d_K)` with :math:`K \geq 1` for the
     `K`-dimensional case. The last being useful for higher dimension inputs, such
@@ -1386,9 +1389,9 @@ class MultiMarginLoss(_WeightedLoss):
         >>> loss = nn.MultiMarginLoss()
         >>> x = torch.tensor([[0.1, 0.2, 0.4, 0.8]])
         >>> y = torch.tensor([3])
-        >>> loss(x, y)
         >>> # 0.25 * ((1-(0.8-0.1)) + (1-(0.8-0.2)) + (1-(0.8-0.4)))
-        tensor(0.3250)
+        >>> loss(x, y)
+        tensor(0.32...)
     """
     __constants__ = ['p', 'margin', 'reduction']
     margin: float
@@ -1573,15 +1576,16 @@ class TripletMarginWithDistanceLoss(_Loss):
     >>> def l_infinity(x1, x2):
     >>>     return torch.max(torch.abs(x1 - x2), dim=1).values
     >>>
-    >>> triplet_loss = \
-    >>>     nn.TripletMarginWithDistanceLoss(distance_function=l_infinity, margin=1.5)
+    >>> # xdoctest: +SKIP("FIXME: Would call backwards a second time")
+    >>> triplet_loss = (
+    >>>     nn.TripletMarginWithDistanceLoss(distance_function=l_infinity, margin=1.5))
     >>> output = triplet_loss(anchor, positive, negative)
     >>> output.backward()
     >>>
     >>> # Custom Distance Function (Lambda)
-    >>> triplet_loss = \
+    >>> triplet_loss = (
     >>>     nn.TripletMarginWithDistanceLoss(
-    >>>         distance_function=lambda x, y: 1.0 - F.cosine_similarity(x, y))
+    >>>         distance_function=lambda x, y: 1.0 - F.cosine_similarity(x, y)))
     >>> output = triplet_loss(anchor, positive, negative)
     >>> output.backward()
 
@@ -1706,6 +1710,7 @@ class CTCLoss(_Loss):
         >>> C = 20      # Number of classes (including blank)
         >>>
         >>> # Initialize random batch of input vectors, for *size = (T,C)
+        >>> # xdoctest: +SKIP("FIXME: error in doctest")
         >>> input = torch.randn(T, C).log_softmax(2).detach().requires_grad_()
         >>> input_lengths = torch.tensor(T, dtype=torch.long)
         >>>
diff --git a/torch/nn/modules/module.py b/torch/nn/modules/module.py
index d0bfcd9e175a3..50f82d50cf824 100644
--- a/torch/nn/modules/module.py
+++ b/torch/nn/modules/module.py
@@ -21,6 +21,7 @@
 # the type of the subclass, not the looser type of `Module`.
 T = TypeVar('T', bound='Module')
 
+
 class _IncompatibleKeys(namedtuple('IncompatibleKeys', ['missing_keys', 'unexpected_keys'])):
     def __repr__(self):
         if not self.missing_keys and not self.unexpected_keys:
@@ -41,6 +42,7 @@ def _addindent(s_, numSpaces):
     s = first + '\n' + s
     return s
 
+
 class _WrappedHook:
     def __init__(self, hook: Callable, module: Optional["Module"] = None):
         self.hook: Callable = hook
@@ -151,6 +153,7 @@ def register_module_forward_hook(hook: Callable[..., None]) -> RemovableHandle:
     _global_forward_hooks[handle.id] = hook
     return handle
 
+
 def register_module_backward_hook(
     hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]]
 ) -> RemovableHandle:
@@ -177,6 +180,7 @@ def register_module_backward_hook(
     _global_backward_hooks[handle.id] = hook
     return handle
 
+
 def register_module_full_backward_hook(
     hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]]
 ) -> RemovableHandle:
@@ -352,6 +356,7 @@ def register_buffer(self, name: str, tensor: Optional[Tensor], persistent: bool
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined vars")
             >>> self.register_buffer('running_mean', torch.zeros(num_features))
 
         """
@@ -705,20 +710,17 @@ def apply(self: T, fn: Callable[['Module'], None]) -> T:
             >>> net.apply(init_weights)
             Linear(in_features=2, out_features=2, bias=True)
             Parameter containing:
-            tensor([[ 1.,  1.],
-                    [ 1.,  1.]])
+            tensor([[1., 1.],
+                    [1., 1.]], requires_grad=True)
             Linear(in_features=2, out_features=2, bias=True)
             Parameter containing:
-            tensor([[ 1.,  1.],
-                    [ 1.,  1.]])
-            Sequential(
-              (0): Linear(in_features=2, out_features=2, bias=True)
-              (1): Linear(in_features=2, out_features=2, bias=True)
-            )
+            tensor([[1., 1.],
+                    [1., 1.]], requires_grad=True)
             Sequential(
               (0): Linear(in_features=2, out_features=2, bias=True)
               (1): Linear(in_features=2, out_features=2, bias=True)
             )
+
         """
         for module in self.children():
             module.apply(fn)
@@ -923,6 +925,7 @@ def to(self, *args, **kwargs):
 
         Examples::
 
+            >>> # xdoctest: +IGNORE_WANT("non-deterministic")
             >>> linear = nn.Linear(2, 2)
             >>> linear.weight
             Parameter containing:
@@ -934,6 +937,7 @@ def to(self, *args, **kwargs):
             Parameter containing:
             tensor([[ 0.1913, -0.3420],
                     [-0.5113, -0.2325]], dtype=torch.float64)
+            >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
             >>> gpu1 = torch.device("cuda:1")
             >>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
             Linear(in_features=2, out_features=2, bias=True)
@@ -1070,7 +1074,7 @@ def _get_backward_hooks(self):
 
     def _maybe_warn_non_full_backward_hook(self, inputs, result, grad_fn):
         if not isinstance(result, torch.Tensor):
-            if not (isinstance(result, tuple) and all([isinstance(r, torch.Tensor) for r in result])):
+            if not (isinstance(result, tuple) and all(isinstance(r, torch.Tensor) for r in result)):
                 warnings.warn("Using non-full backward hooks on a Module that does not return a "
                               "single Tensor or a tuple of Tensors is deprecated and will be removed "
                               "in future versions. This hook will be missing some of the grad_output. "
@@ -1080,7 +1084,7 @@ def _maybe_warn_non_full_backward_hook(self, inputs, result, grad_fn):
             result = (result,)
 
         if not isinstance(inputs, torch.Tensor):
-            if not (isinstance(inputs, tuple) and all([isinstance(i, torch.Tensor) for i in inputs])):
+            if not (isinstance(inputs, tuple) and all(isinstance(i, torch.Tensor) for i in inputs)):
                 warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
                               "single Tensor or a tuple of Tensors is deprecated and will be removed "
                               "in future versions. This hook will be missing some of the grad_input. "
@@ -1368,12 +1372,16 @@ def state_dict(self, *, prefix: str = ..., keep_vars: bool = ...) -> Dict[str, A
     # TODO: Change `*args` to `*` and remove the copprespinding warning in docs when BC allows.
     # Also remove the logic for arg parsing together.
     def state_dict(self, *args, destination=None, prefix='', keep_vars=False):
-        r"""Returns a dictionary containing a whole state of the module.
+        r"""Returns a dictionary containing references to the whole state of the module.
 
         Both parameters and persistent buffers (e.g. running averages) are
         included. Keys are corresponding parameter and buffer names.
         Parameters and buffers set to ``None`` are not included.
 
+        .. note::
+            The returned object is a shallow copy. It contains references
+            to the module's parameters and buffers.
+
         .. warning::
             Currently ``state_dict()`` also accepts positional arguments for
             ``destination``, ``prefix`` and ``keep_vars`` in order. However,
@@ -1402,6 +1410,7 @@ def state_dict(self, *args, destination=None, prefix='', keep_vars=False):
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined vars")
             >>> module.state_dict().keys()
             ['bias', 'weight']
 
@@ -1685,6 +1694,7 @@ def parameters(self, recurse: bool = True) -> Iterator[Parameter]:
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined vars")
             >>> for param in model.parameters():
             >>>     print(type(param), param.size())
             <class 'torch.Tensor'> (20L,)
@@ -1709,6 +1719,7 @@ def named_parameters(self, prefix: str = '', recurse: bool = True) -> Iterator[T
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined vars")
             >>> for name, param in self.named_parameters():
             >>>    if name in ['bias']:
             >>>        print(param.size())
@@ -1733,6 +1744,7 @@ def buffers(self, recurse: bool = True) -> Iterator[Tensor]:
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined vars")
             >>> for buf in model.buffers():
             >>>     print(type(buf), buf.size())
             <class 'torch.Tensor'> (20L,)
@@ -1757,6 +1769,7 @@ def named_buffers(self, prefix: str = '', recurse: bool = True) -> Iterator[Tupl
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined vars")
             >>> for name, buf in self.named_buffers():
             >>>    if name in ['running_var']:
             >>>        print(buf.size())
@@ -1786,6 +1799,7 @@ def named_children(self) -> Iterator[Tuple[str, 'Module']]:
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined vars")
             >>> for name, module in model.named_children():
             >>>     if name in ['conv4', 'conv5']:
             >>>         print(module)
@@ -1812,7 +1826,7 @@ def modules(self) -> Iterator['Module']:
             >>> l = nn.Linear(2, 2)
             >>> net = nn.Sequential(l, l)
             >>> for idx, m in enumerate(net.modules()):
-                    print(idx, '->', m)
+            ...     print(idx, '->', m)
 
             0 -> Sequential(
               (0): Linear(in_features=2, out_features=2, bias=True)
@@ -1846,7 +1860,7 @@ def named_modules(self, memo: Optional[Set['Module']] = None, prefix: str = '',
             >>> l = nn.Linear(2, 2)
             >>> net = nn.Sequential(l, l)
             >>> for idx, m in enumerate(net.named_modules()):
-                    print(idx, '->', m)
+            ...     print(idx, '->', m)
 
             0 -> ('', Sequential(
               (0): Linear(in_features=2, out_features=2, bias=True)
diff --git a/torch/nn/modules/padding.py b/torch/nn/modules/padding.py
index 6511b5aeb18f6..df8d78837961d 100644
--- a/torch/nn/modules/padding.py
+++ b/torch/nn/modules/padding.py
@@ -46,6 +46,7 @@ class ConstantPad1d(_ConstantPadNd):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> m = nn.ConstantPad1d(2, 3.5)
         >>> input = torch.randn(1, 2, 4)
         >>> input
@@ -98,6 +99,7 @@ class ConstantPad2d(_ConstantPadNd):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> m = nn.ConstantPad2d(2, 3.5)
         >>> input = torch.randn(1, 2, 2)
         >>> input
@@ -198,6 +200,7 @@ class ReflectionPad1d(_ReflectionPadNd):
     Examples::
 
         >>> m = nn.ReflectionPad1d(2)
+        >>> # xdoctest: +IGNORE_WANT("other tests seem to modify printing styles")
         >>> input = torch.arange(8, dtype=torch.float).reshape(1, 2, 4)
         >>> input
         tensor([[[0., 1., 2., 3.],
@@ -239,6 +242,7 @@ class ReflectionPad2d(_ReflectionPadNd):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("not sure why xdoctest is choking on this")
         >>> m = nn.ReflectionPad2d(2)
         >>> input = torch.arange(9, dtype=torch.float).reshape(1, 1, 3, 3)
         >>> input
@@ -295,6 +299,7 @@ class ReflectionPad3d(_ReflectionPadNd):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("not sure why xdoctest is choking on this")
         >>> m = nn.ReflectionPad3d(1)
         >>> input = torch.arange(8, dtype=torch.float).reshape(1, 1, 2, 2, 2)
         >>> m(input)
@@ -351,6 +356,7 @@ class ReplicationPad1d(_ReplicationPadNd):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("not sure why xdoctest is choking on this")
         >>> m = nn.ReplicationPad1d(2)
         >>> input = torch.arange(8, dtype=torch.float).reshape(1, 2, 4)
         >>> input
@@ -394,6 +400,7 @@ class ReplicationPad2d(_ReplicationPadNd):
     Examples::
 
         >>> m = nn.ReplicationPad2d(2)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> input = torch.arange(9, dtype=torch.float).reshape(1, 1, 3, 3)
         >>> input
         tensor([[[[0., 1., 2.],
@@ -449,6 +456,7 @@ class ReplicationPad3d(_ReplicationPadNd):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> m = nn.ReplicationPad3d(3)
         >>> input = torch.randn(16, 3, 8, 320, 480)
         >>> output = m(input)
@@ -484,6 +492,7 @@ class ZeroPad2d(ConstantPad2d):
 
     Examples::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> m = nn.ZeroPad2d(2)
         >>> input = torch.randn(1, 1, 3, 3)
         >>> input
diff --git a/torch/nn/modules/pooling.py b/torch/nn/modules/pooling.py
index 5da8f0c41fe6a..93a43e10c9624 100644
--- a/torch/nn/modules/pooling.py
+++ b/torch/nn/modules/pooling.py
@@ -291,6 +291,7 @@ class MaxUnpool1d(_MaxUnpoolNd):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("do other tests modify the global state?")
         >>> pool = nn.MaxPool1d(2, stride=2, return_indices=True)
         >>> unpool = nn.MaxUnpool1d(2, stride=2)
         >>> input = torch.tensor([[[1., 2, 3, 4, 5, 6, 7, 8]]])
@@ -524,7 +525,7 @@ class AvgPool1d(_AvgPoolNd):
         >>> # pool with window of size=3, stride=2
         >>> m = nn.AvgPool1d(3, stride=2)
         >>> m(torch.tensor([[[1.,2,3,4,5,6,7]]]))
-        tensor([[[ 2.,  4.,  6.]]])
+        tensor([[[2., 4., 6.]]])
     """
 
     kernel_size: _size_1_t
diff --git a/torch/nn/modules/rnn.py b/torch/nn/modules/rnn.py
index 7ad9fa58d684f..4d6fd9c959ebc 100644
--- a/torch/nn/modules/rnn.py
+++ b/torch/nn/modules/rnn.py
@@ -1066,8 +1066,8 @@ class RNNCell(RNNCellBase):
         >>> hx = torch.randn(3, 20)
         >>> output = []
         >>> for i in range(6):
-                hx = rnn(input[i], hx)
-                output.append(hx)
+        ...     hx = rnn(input[i], hx)
+        ...     output.append(hx)
     """
     __constants__ = ['input_size', 'hidden_size', 'bias', 'nonlinearity']
     nonlinearity: str
@@ -1168,8 +1168,8 @@ class LSTMCell(RNNCellBase):
         >>> cx = torch.randn(3, 20)
         >>> output = []
         >>> for i in range(input.size()[0]):
-                hx, cx = rnn(input[i], (hx, cx))
-                output.append(hx)
+        ...     hx, cx = rnn(input[i], (hx, cx))
+        ...     output.append(hx)
         >>> output = torch.stack(output, dim=0)
     """
 
@@ -1260,8 +1260,8 @@ class GRUCell(RNNCellBase):
         >>> hx = torch.randn(3, 20)
         >>> output = []
         >>> for i in range(6):
-                hx = rnn(input[i], hx)
-                output.append(hx)
+        ...     hx = rnn(input[i], hx)
+        ...     output.append(hx)
     """
 
     def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
diff --git a/torch/nn/modules/sparse.py b/torch/nn/modules/sparse.py
index 2ab093c37f506..a8d70616d852a 100644
--- a/torch/nn/modules/sparse.py
+++ b/torch/nn/modules/sparse.py
@@ -69,6 +69,7 @@ class Embedding(Module):
         >>> embedding = nn.Embedding(10, 3)
         >>> # a batch of 2 samples of 4 indices each
         >>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> embedding(input)
         tensor([[[-0.0251, -1.6902,  0.7172],
                  [-0.6431,  0.0748,  0.6969],
@@ -200,6 +201,7 @@ def from_pretrained(cls, embeddings, freeze=True, padding_idx=None,
             >>> embedding = nn.Embedding.from_pretrained(weight)
             >>> # Get embeddings for index 1
             >>> input = torch.LongTensor([1])
+            >>> # xdoctest: +IGNORE_WANT("non-deterministic")
             >>> embedding(input)
             tensor([[ 4.0000,  5.1000,  6.3000]])
         """
@@ -277,6 +279,7 @@ class EmbeddingBag(Module):
         >>> # a batch of 2 samples of 4 indices each
         >>> input = torch.tensor([1,2,4,5,4,3,2,9], dtype=torch.long)
         >>> offsets = torch.tensor([0,4], dtype=torch.long)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> embedding_sum(input, offsets)
         tensor([[-0.8861, -5.4350, -0.0523],
                 [ 1.1306, -2.5798, -1.0044]])
@@ -427,6 +430,7 @@ def from_pretrained(cls, embeddings: Tensor, freeze: bool = True, max_norm: Opti
             >>> embeddingbag = nn.EmbeddingBag.from_pretrained(weight)
             >>> # Get embeddings for index 1
             >>> input = torch.LongTensor([[1, 0]])
+            >>> # xdoctest: +IGNORE_WANT("non-deterministic")
             >>> embeddingbag(input)
             tensor([[ 2.5000,  3.7000,  4.6500]])
         """
diff --git a/torch/nn/modules/transformer.py b/torch/nn/modules/transformer.py
index 5a0d8c7dcc6c2..8f6ca761d0c78 100644
--- a/torch/nn/modules/transformer.py
+++ b/torch/nn/modules/transformer.py
@@ -130,6 +130,7 @@ def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, t
             batch size, E is the feature number
 
         Examples:
+            >>> # xdoctest: +SKIP
             >>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
         """
 
@@ -183,12 +184,13 @@ class TransformerEncoder(Module):
     """
     __constants__ = ['norm']
 
-    def __init__(self, encoder_layer, num_layers, norm=None, enable_nested_tensor=True):
+    def __init__(self, encoder_layer, num_layers, norm=None, enable_nested_tensor=True, mask_check=True):
         super(TransformerEncoder, self).__init__()
         self.layers = _get_clones(encoder_layer, num_layers)
         self.num_layers = num_layers
         self.norm = norm
         self.enable_nested_tensor = enable_nested_tensor
+        self.mask_check = mask_check
 
     def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
         r"""Pass the input through the encoder layers in turn.
@@ -227,12 +229,15 @@ def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_ma
             why_not_sparsity_fast_path = "enable_nested_tensor was not True"
         elif src_key_padding_mask is None:
             why_not_sparsity_fast_path = "src_key_padding_mask was None"
-        elif (not torch._nested_tensor_from_mask_left_aligned(src, src_key_padding_mask.logical_not())):
-            why_not_sparsity_fast_path = "src and src_key_padding_mask was not left aligned"
+        elif (((not hasattr(self, "mask_check")) or self.mask_check)
+                and not torch._nested_tensor_from_mask_left_aligned(src, src_key_padding_mask.logical_not())):
+            why_not_sparsity_fast_path = "mask_check enabled, and src and src_key_padding_mask was not left aligned"
         elif output.is_nested:
             why_not_sparsity_fast_path = "NestedTensor input is not supported"
         elif mask is not None:
             why_not_sparsity_fast_path = "src_key_padding_mask and mask were both supplied"
+        elif first_layer.self_attn.num_heads % 2 == 1:
+            why_not_sparsity_fast_path = "num_head is odd"
 
         if not why_not_sparsity_fast_path:
             tensor_args = (
@@ -255,13 +260,20 @@ def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_ma
                 why_not_sparsity_fast_path = "some Tensor argument has_torch_function"
             elif not (src.is_cuda or 'cpu' in str(src.device)):
                 why_not_sparsity_fast_path = "src is neither CUDA nor CPU"
-            elif torch.is_grad_enabled() and any([x.requires_grad for x in tensor_args]):
+            elif torch.is_grad_enabled() and any(x.requires_grad for x in tensor_args):
                 why_not_sparsity_fast_path = ("grad is enabled and at least one of query or the "
                                               "input/output projection weights or biases requires_grad")
 
             if (not why_not_sparsity_fast_path) and (src_key_padding_mask is not None):
                 convert_to_nested = True
-                output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not())
+                # simplify on or after on 8/16/2022 to unconditionally call with mask_check=False
+                # we have established that either (1) the mask is OK with the check above,
+                # or (2) that we don't need a mask check with mask_check=False in the init
+                if not torch.jit.is_scripting():
+                    output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
+                else:
+                    # When scripting, make a simpler call until the FC bar passes on 8/16/2022
+                    output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not())
                 src_key_padding_mask_for_layers = None
 
         for mod in self.layers:
@@ -347,7 +359,7 @@ class TransformerEncoderLayer(Module):
         batch_first: If ``True``, then the input and output tensors are provided
             as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
         norm_first: if ``True``, layer norm is done prior to attention and feedforward
-            operations, respectivaly. Otherwise it's done after. Default: ``False`` (after).
+            operations, respectively. Otherwise it's done after. Default: ``False`` (after).
 
     Examples::
         >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
@@ -453,6 +465,8 @@ def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,
             why_not_sparsity_fast_path = "src_mask is not supported for fastpath"
         elif src.is_nested and src_key_padding_mask is not None:
             why_not_sparsity_fast_path = "src_key_padding_mask is not supported with NestedTensor input for fastpath"
+        elif self.self_attn.num_heads % 2 == 1:
+            why_not_sparsity_fast_path = "num_head is odd"
 
         if not why_not_sparsity_fast_path:
             tensor_args = (
@@ -475,34 +489,62 @@ def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,
             # generator expressions.
             if torch.overrides.has_torch_function(tensor_args):
                 why_not_sparsity_fast_path = "some Tensor argument has_torch_function"
-            elif not all([(x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args]):
+            elif not all((x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args):
                 why_not_sparsity_fast_path = "some Tensor argument is neither CUDA nor CPU"
-            elif torch.is_grad_enabled() and any([x.requires_grad for x in tensor_args]):
+            elif torch.is_grad_enabled() and any(x.requires_grad for x in tensor_args):
                 why_not_sparsity_fast_path = ("grad is enabled and at least one of query or the "
                                               "input/output projection weights or biases requires_grad")
 
             if not why_not_sparsity_fast_path:
-                return torch._transformer_encoder_layer_fwd(
-                    src,
-                    self.self_attn.embed_dim,
-                    self.self_attn.num_heads,
-                    self.self_attn.in_proj_weight,
-                    self.self_attn.in_proj_bias,
-                    self.self_attn.out_proj.weight,
-                    self.self_attn.out_proj.bias,
-                    self.activation_relu_or_gelu == 2,
-                    self.norm_first,
-                    self.norm1.eps,
-                    self.norm1.weight,
-                    self.norm1.bias,
-                    self.norm2.weight,
-                    self.norm2.bias,
-                    self.linear1.weight,
-                    self.linear1.bias,
-                    self.linear2.weight,
-                    self.linear2.bias,
-                    src_mask if src_mask is not None else src_key_padding_mask,  # TODO: split into two args
-                )
+                if not torch.jit.is_scripting():
+                    return torch._transformer_encoder_layer_fwd(
+                        src,
+                        self.self_attn.embed_dim,
+                        self.self_attn.num_heads,
+                        self.self_attn.in_proj_weight,
+                        self.self_attn.in_proj_bias,
+                        self.self_attn.out_proj.weight,
+                        self.self_attn.out_proj.bias,
+                        self.activation_relu_or_gelu == 2,
+                        self.norm_first,
+                        self.norm1.eps,
+                        self.norm1.weight,
+                        self.norm1.bias,
+                        self.norm2.weight,
+                        self.norm2.bias,
+                        self.linear1.weight,
+                        self.linear1.bias,
+                        self.linear2.weight,
+                        self.linear2.bias,
+                        # TODO: if src_mask and src_key_padding_mask merge to single 4-dim mask
+                        src_mask if src_mask is not None else src_key_padding_mask,
+                        1 if src_key_padding_mask is not None else
+                        0 if src_mask is not None else
+                        None,
+                    )
+                elif src_mask is None:
+                    # hack until 9/26/2022 for TS jit compatibility window
+                    return torch._transformer_encoder_layer_fwd(
+                        src,
+                        self.self_attn.embed_dim,
+                        self.self_attn.num_heads,
+                        self.self_attn.in_proj_weight,
+                        self.self_attn.in_proj_bias,
+                        self.self_attn.out_proj.weight,
+                        self.self_attn.out_proj.bias,
+                        self.activation_relu_or_gelu == 2,
+                        self.norm_first,
+                        self.norm1.eps,
+                        self.norm1.weight,
+                        self.norm1.bias,
+                        self.norm2.weight,
+                        self.norm2.bias,
+                        self.linear1.weight,
+                        self.linear1.bias,
+                        self.linear2.weight,
+                        self.linear2.bias,
+                        src_mask if src_mask is not None else src_key_padding_mask,
+                    )
 
         x = src
         if self.norm_first:
@@ -548,7 +590,7 @@ class TransformerDecoderLayer(Module):
         batch_first: If ``True``, then the input and output tensors are provided
             as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
         norm_first: if ``True``, layer norm is done prior to self attention, multihead
-            attention and feedforward operations, respectivaly. Otherwise it's done after.
+            attention and feedforward operations, respectively. Otherwise it's done after.
             Default: ``False`` (after).
 
     Examples::
diff --git a/torch/nn/modules/upsampling.py b/torch/nn/modules/upsampling.py
index 6e1702d36b53e..4f13c84c2e90a 100644
--- a/torch/nn/modules/upsampling.py
+++ b/torch/nn/modules/upsampling.py
@@ -7,6 +7,7 @@
 
 __all__ = ['Upsample', 'UpsamplingNearest2d', 'UpsamplingBilinear2d']
 
+
 class Upsample(Module):
     r"""Upsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data.
 
@@ -73,60 +74,61 @@ class Upsample(Module):
 
         >>> input = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
         >>> input
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
 
         >>> m = nn.Upsample(scale_factor=2, mode='nearest')
         >>> m(input)
-        tensor([[[[ 1.,  1.,  2.,  2.],
-                  [ 1.,  1.,  2.,  2.],
-                  [ 3.,  3.,  4.,  4.],
-                  [ 3.,  3.,  4.,  4.]]]])
+        tensor([[[[1., 1., 2., 2.],
+                  [1., 1., 2., 2.],
+                  [3., 3., 4., 4.],
+                  [3., 3., 4., 4.]]]])
 
+        >>> # xdoctest: +IGNORE_WANT("other tests seem to modify printing styles")
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear')  # align_corners=False
         >>> m(input)
-        tensor([[[[ 1.0000,  1.2500,  1.7500,  2.0000],
-                  [ 1.5000,  1.7500,  2.2500,  2.5000],
-                  [ 2.5000,  2.7500,  3.2500,  3.5000],
-                  [ 3.0000,  3.2500,  3.7500,  4.0000]]]])
+        tensor([[[[1.0000, 1.2500, 1.7500, 2.0000],
+                  [1.5000, 1.7500, 2.2500, 2.5000],
+                  [2.5000, 2.7500, 3.2500, 3.5000],
+                  [3.0000, 3.2500, 3.7500, 4.0000]]]])
 
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
         >>> m(input)
-        tensor([[[[ 1.0000,  1.3333,  1.6667,  2.0000],
-                  [ 1.6667,  2.0000,  2.3333,  2.6667],
-                  [ 2.3333,  2.6667,  3.0000,  3.3333],
-                  [ 3.0000,  3.3333,  3.6667,  4.0000]]]])
+        tensor([[[[1.0000, 1.3333, 1.6667, 2.0000],
+                  [1.6667, 2.0000, 2.3333, 2.6667],
+                  [2.3333, 2.6667, 3.0000, 3.3333],
+                  [3.0000, 3.3333, 3.6667, 4.0000]]]])
 
         >>> # Try scaling the same data in a larger tensor
-        >>>
         >>> input_3x3 = torch.zeros(3, 3).view(1, 1, 3, 3)
         >>> input_3x3[:, :, :2, :2].copy_(input)
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
         >>> input_3x3
-        tensor([[[[ 1.,  2.,  0.],
-                  [ 3.,  4.,  0.],
-                  [ 0.,  0.,  0.]]]])
+        tensor([[[[1., 2., 0.],
+                  [3., 4., 0.],
+                  [0., 0., 0.]]]])
 
+        >>> # xdoctest: +IGNORE_WANT("seems to fail when other tests are run in the same session")
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear')  # align_corners=False
         >>> # Notice that values in top left corner are the same with the small input (except at boundary)
         >>> m(input_3x3)
-        tensor([[[[ 1.0000,  1.2500,  1.7500,  1.5000,  0.5000,  0.0000],
-                  [ 1.5000,  1.7500,  2.2500,  1.8750,  0.6250,  0.0000],
-                  [ 2.5000,  2.7500,  3.2500,  2.6250,  0.8750,  0.0000],
-                  [ 2.2500,  2.4375,  2.8125,  2.2500,  0.7500,  0.0000],
-                  [ 0.7500,  0.8125,  0.9375,  0.7500,  0.2500,  0.0000],
-                  [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]]]])
+        tensor([[[[1.0000, 1.2500, 1.7500, 1.5000, 0.5000, 0.0000],
+                  [1.5000, 1.7500, 2.2500, 1.8750, 0.6250, 0.0000],
+                  [2.5000, 2.7500, 3.2500, 2.6250, 0.8750, 0.0000],
+                  [2.2500, 2.4375, 2.8125, 2.2500, 0.7500, 0.0000],
+                  [0.7500, 0.8125, 0.9375, 0.7500, 0.2500, 0.0000],
+                  [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]]])
 
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
         >>> # Notice that values in top left corner are now changed
         >>> m(input_3x3)
-        tensor([[[[ 1.0000,  1.4000,  1.8000,  1.6000,  0.8000,  0.0000],
-                  [ 1.8000,  2.2000,  2.6000,  2.2400,  1.1200,  0.0000],
-                  [ 2.6000,  3.0000,  3.4000,  2.8800,  1.4400,  0.0000],
-                  [ 2.4000,  2.7200,  3.0400,  2.5600,  1.2800,  0.0000],
-                  [ 1.2000,  1.3600,  1.5200,  1.2800,  0.6400,  0.0000],
-                  [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]]]])
+        tensor([[[[1.0000, 1.4000, 1.8000, 1.6000, 0.8000, 0.0000],
+                  [1.8000, 2.2000, 2.6000, 2.2400, 1.1200, 0.0000],
+                  [2.6000, 3.0000, 3.4000, 2.8800, 1.4400, 0.0000],
+                  [2.4000, 2.7200, 3.0400, 2.5600, 1.2800, 0.0000],
+                  [1.2000, 1.3600, 1.5200, 1.2800, 0.6400, 0.0000],
+                  [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]]])
     """
     __constants__ = ['size', 'scale_factor', 'mode', 'align_corners', 'name', 'recompute_scale_factor']
     name: str
@@ -194,15 +196,15 @@ class UpsamplingNearest2d(Upsample):
 
         >>> input = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
         >>> input
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
 
         >>> m = nn.UpsamplingNearest2d(scale_factor=2)
         >>> m(input)
-        tensor([[[[ 1.,  1.,  2.,  2.],
-                  [ 1.,  1.,  2.,  2.],
-                  [ 3.,  3.,  4.,  4.],
-                  [ 3.,  3.,  4.,  4.]]]])
+        tensor([[[[1., 1., 2., 2.],
+                  [1., 1., 2., 2.],
+                  [3., 3., 4., 4.],
+                  [3., 3., 4., 4.]]]])
     """
     def __init__(self, size: Optional[_size_2_t] = None, scale_factor: Optional[_ratio_2_t] = None) -> None:
         super(UpsamplingNearest2d, self).__init__(size, scale_factor, mode='nearest')
@@ -240,15 +242,16 @@ class UpsamplingBilinear2d(Upsample):
 
         >>> input = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
         >>> input
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
 
+        >>> # xdoctest: +IGNORE_WANT("do other tests modify the global state?")
         >>> m = nn.UpsamplingBilinear2d(scale_factor=2)
         >>> m(input)
-        tensor([[[[ 1.0000,  1.3333,  1.6667,  2.0000],
-                  [ 1.6667,  2.0000,  2.3333,  2.6667],
-                  [ 2.3333,  2.6667,  3.0000,  3.3333],
-                  [ 3.0000,  3.3333,  3.6667,  4.0000]]]])
+        tensor([[[[1.0000, 1.3333, 1.6667, 2.0000],
+                  [1.6667, 2.0000, 2.3333, 2.6667],
+                  [2.3333, 2.6667, 3.0000, 3.3333],
+                  [3.0000, 3.3333, 3.6667, 4.0000]]]])
     """
     def __init__(self, size: Optional[_size_2_t] = None, scale_factor: Optional[_ratio_2_t] = None) -> None:
         super(UpsamplingBilinear2d, self).__init__(size, scale_factor, mode='bilinear', align_corners=True)
diff --git a/torch/nn/parallel/data_parallel.py b/torch/nn/parallel/data_parallel.py
index 77111aa67c036..44e571e72892f 100644
--- a/torch/nn/parallel/data_parallel.py
+++ b/torch/nn/parallel/data_parallel.py
@@ -114,6 +114,7 @@ class DataParallel(Module):
 
     Example::
 
+        >>> # xdoctest: +SKIP
         >>> net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
         >>> output = net(input_var)  # input_var can be on any device, including CPU
     """
diff --git a/torch/nn/parallel/distributed.py b/torch/nn/parallel/distributed.py
index 376ba0906cd18..dcaccf38a6724 100644
--- a/torch/nn/parallel/distributed.py
+++ b/torch/nn/parallel/distributed.py
@@ -264,11 +264,13 @@ class DistributedDataParallel(Module, Joinable):
     GPU from 0 to N-1. This can be done by either setting
     ``CUDA_VISIBLE_DEVICES`` for every process or by calling:
 
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> torch.cuda.set_device(i)
 
     where i is from 0 to N-1. In each process, you should refer the following
     to construct this module:
 
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> torch.distributed.init_process_group(
         >>>     backend='nccl', world_size=N, init_method='...'
         >>> )
@@ -339,6 +341,7 @@ class DistributedDataParallel(Module, Joinable):
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> import torch.distributed.autograd as dist_autograd
             >>> from torch.nn.parallel import DistributedDataParallel as DDP
             >>> import torch
@@ -505,9 +508,10 @@ class DistributedDataParallel(Module, Joinable):
                      can set ``static_graph = True`` as well.
 
                      Example::
+                         >>> # xdoctest: +SKIP("undefined variables")
                          >>> model_DDP = torch.nn.parallel.DistributedDataParallel(model)
                          >>> # Training loop
-                         >>> .....
+                         >>> ...
                          >>> ddp_logging_data = model_DDP._get_ddp_logging_data()
                          >>> static_graph = ddp_logging_data.get("can_set_static_graph")
 
@@ -517,8 +521,9 @@ class DistributedDataParallel(Module, Joinable):
 
     Example::
 
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
-        >>> net = torch.nn.parallel.DistributedDataParallel(model, pg)
+        >>> net = torch.nn.parallel.DistributedDataParallel(model)
     """
 
     def __init__(
@@ -944,6 +949,7 @@ def no_sync(self):
 
         Example::
 
+            >>> # xdoctest: +SKIP("undefined variables")
             >>> ddp = torch.nn.parallel.DistributedDataParallel(model, pg)
             >>> with ddp.no_sync():
             >>>   for input in inputs:
@@ -1401,18 +1407,19 @@ def register_comm_hook(self, state: object, hook: callable):
         Example::
             Below is an example of a noop hook that returns the same tensor.
 
-            >>> def noop(state: object, bucket: dist.GradBucket): -> torch.futures.Future[torch.Tensor]
+            >>> def noop(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]:
             >>>     fut = torch.futures.Future()
             >>>     fut.set_result(bucket.buffer())
             >>>     return fut
 
+            >>> # xdoctest: +SKIP('undefined name')
             >>> ddp.register_comm_hook(state=None, hook=noop)
 
         Example::
             Below is an example of a Parallel SGD algorithm where gradients are encoded before
             allreduce, and then decoded after allreduce.
 
-            >>> def encode_and_decode(state: object, bucket: dist.GradBucket): -> torch.futures.Future[torch.Tensor]
+            >>> def encode_and_decode(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]:
             >>>     encoded_tensor = encode(bucket.buffer()) # encode gradients
             >>>     fut = torch.distributed.all_reduce(encoded_tensor).get_future()
             >>>     # Define the then callback to decode.
@@ -1421,6 +1428,7 @@ def register_comm_hook(self, state: object, hook: callable):
             >>>         return decoded_tensor
             >>>     return fut.then(decode)
 
+            >>> # xdoctest: +SKIP('undefined name')
             >>> ddp.register_comm_hook(state=None, hook=encode_and_decode)
         """
         self._check_comm_hook(hook)
@@ -1446,6 +1454,7 @@ def _register_builtin_comm_hook(self, comm_hook_type):
             compressed into 16-bit floating-point numbers before allreduce, and
             then decompressed after allreduce.
 
+            >>> # xdoctest: +SKIP('undefined name')
             >>> ddp._register_builtin_comm_hook(dist.BuiltinCommHookType.FP16_COMPRESS)
 
         """
@@ -1497,6 +1506,7 @@ def _register_fused_optim(self, optim: Type, *args, optim_params=None, **kwargs)
 
     Example::
 
+        >>> # xdoctest: +SKIP("No rendezvous handler")
         >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
         >>> net = torch.nn.parallel.DistributedDataParallel(model, pg)
         >>> lr = 1e-2
@@ -1506,8 +1516,8 @@ def _register_fused_optim(self, optim: Type, *args, optim_params=None, **kwargs)
         >>> # Example with subset of parameters
         >>> params_to_opt = [list(net.parameters())[0]]
         >>> net._register_fused_optim(
-            torch.optim.Adam, lr, optim_params=params_to_opt,  betas=betas, eps=eps
-        )
+        ...   torch.optim.Adam, lr, optim_params=params_to_opt,  betas=betas, eps=eps
+        ... )
         """
         # Note: importing in function, otherwise this will cause a circular
         # import as optimizer_overlap module needs to import DistributedDataParallel.
@@ -1649,8 +1659,8 @@ def _check_comm_hook(self, hook):
             hook.__name__ in ["bf16_compress_hook", "bf16_compress_wrapper_hook"]
             and
             (
-                torch.version.cuda is None
-                or int(torch.version.cuda.split('.')[0]) < 11
+                (torch.version.cuda is None and torch.version.hip is None)
+                or (torch.version.cuda is not None and int(torch.version.cuda.split('.')[0]) < 11)
                 or not dist.is_available()
                 or not dist.is_nccl_available()
                 or torch.cuda.nccl.version() < (2, 10)
diff --git a/torch/nn/qat/__init__.py b/torch/nn/qat/__init__.py
index 3d79bdbfe8320..b8fd28316a351 100644
--- a/torch/nn/qat/__init__.py
+++ b/torch/nn/qat/__init__.py
@@ -1 +1,7 @@
+# flake8: noqa: F401
+r"""QAT Dynamic Modules
+
+This package is in the process of being deprecated.
+Please, use `torch.ao.nn.qat.dynamic` instead.
+"""
 from .modules import *  # noqa: F403
diff --git a/torch/nn/qat/dynamic/__init__.py b/torch/nn/qat/dynamic/__init__.py
index 3d79bdbfe8320..b8fd28316a351 100644
--- a/torch/nn/qat/dynamic/__init__.py
+++ b/torch/nn/qat/dynamic/__init__.py
@@ -1 +1,7 @@
+# flake8: noqa: F401
+r"""QAT Dynamic Modules
+
+This package is in the process of being deprecated.
+Please, use `torch.ao.nn.qat.dynamic` instead.
+"""
 from .modules import *  # noqa: F403
diff --git a/torch/nn/qat/dynamic/modules/linear.py b/torch/nn/qat/dynamic/modules/linear.py
index 8f4bbe47a41e8..47f61acaaa9ed 100644
--- a/torch/nn/qat/dynamic/modules/linear.py
+++ b/torch/nn/qat/dynamic/modules/linear.py
@@ -1,25 +1,10 @@
-import torch
-from torch.ao.quantization import activation_is_memoryless
-
-
-class Linear(torch.nn.qat.Linear):
-    r"""
-    A linear module attached with FakeQuantize modules for weight,
-    used for dynamic quantization aware training.
-
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
-    for documentation.
-
-    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
-    default.
-    """
-
-    def __init__(self, in_features, out_features, bias=True,
-                 qconfig=None, device=None, dtype=None) -> None:
-        super().__init__(in_features, out_features, bias, qconfig, device, dtype)
-        if not activation_is_memoryless(qconfig):
-            raise ValueError(
-                "Dynamic QAT requires a memoryless observer." +
-                "This means a MovingAverage observer with averaging constant equal to 1"
-            )
+# flake8: noqa: F401
+r"""QAT Modules
+
+This file is in the process of migration to `torch/ao/nn/qat/dynamic`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/dynamic/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.qat.dynamic.modules.linear import Linear
diff --git a/torch/nn/qat/modules/__init__.py b/torch/nn/qat/modules/__init__.py
index 988a1dd5ed4b6..8be7c4b532dd7 100644
--- a/torch/nn/qat/modules/__init__.py
+++ b/torch/nn/qat/modules/__init__.py
@@ -1,8 +1,18 @@
-from .linear import Linear
-from .conv import Conv1d
-from .conv import Conv2d
-from .conv import Conv3d
-from .embedding_ops import EmbeddingBag, Embedding
+# flake8: noqa: F401
+r"""QAT Modules
+
+This package is in the process of being deprecated.
+Please, use `torch.ao.nn.qat.modules` instead.
+"""
+from torch.ao.nn.qat.modules.linear import Linear
+from torch.ao.nn.qat.modules.conv import Conv1d
+from torch.ao.nn.qat.modules.conv import Conv2d
+from torch.ao.nn.qat.modules.conv import Conv3d
+from torch.ao.nn.qat.modules.embedding_ops import EmbeddingBag, Embedding
+
+from . import conv
+from . import embedding_ops
+from . import linear
 
 __all__ = [
     "Linear",
diff --git a/torch/nn/qat/modules/conv.py b/torch/nn/qat/modules/conv.py
index ef6a7c909f492..a64b6ac6da97d 100644
--- a/torch/nn/qat/modules/conv.py
+++ b/torch/nn/qat/modules/conv.py
@@ -1,264 +1,12 @@
-import torch
-import torch.nn as nn
-from torch.nn.modules.utils import _single, _pair, _triple
-from torch.nn.intrinsic import _FusedModule
-from typing import Tuple, TypeVar, Union
-from torch.nn.common_types import _size_1_t, _size_2_t, _size_3_t
-
-MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
-
-class _ConvNd(nn.modules.conv._ConvNd):
-
-    _FLOAT_MODULE = MOD
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: Tuple[int, ...],
-                 stride: Tuple[int, ...],
-                 padding: Tuple[int, ...],
-                 dilation: Tuple[int, ...],
-                 transposed: bool,
-                 output_padding: Tuple[int, ...],
-                 groups: int,
-                 bias: bool,
-                 padding_mode: str,
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        factory_kwargs = {"device": device, "dtype": dtype}
-        nn.modules.conv._ConvNd.__init__(self, in_channels, out_channels, kernel_size,
-                                         stride, padding, dilation, transposed,
-                                         output_padding, groups, bias, padding_mode, **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input):
-        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @staticmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module
-
-            Args:
-               `mod`: a float module, either produced by torch.ao.quantization utilities
-               or directly from user
-        """
-        assert type(mod) == cls._FLOAT_MODULE, (
-            "qat."
-            + cls.__name__
-            + ".from_float only works for "
-            + cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
-        )
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        if issubclass(type(mod), _FusedModule):
-            mod = mod[0]  # type: ignore[index]
-        qconfig = mod.qconfig
-        qat_conv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
-                       stride=mod.stride, padding=mod.padding, dilation=mod.dilation,
-                       groups=mod.groups, bias=mod.bias is not None,
-                       padding_mode=mod.padding_mode, qconfig=qconfig)
-        qat_conv.weight = mod.weight
-        qat_conv.bias = mod.bias
-        return qat_conv
-
-    def to_float(self):
-        """ This works for both single qat conv, and the qat conv - relu modules
-        to convert the qat module to a floating point module
-        """
-        cls = type(self)
-        conv = cls._FLOAT_CONV_MODULE(  # type: ignore[attr-defined, operator]
-            self.in_channels,
-            self.out_channels,
-            self.kernel_size,  # type: ignore[arg-type]
-            self.stride,  # type: ignore[arg-type]
-            self.padding,  # type: ignore[arg-type]
-            self.dilation,  # type: ignore[arg-type]
-            self.groups,
-            self.bias is not None,
-            self.padding_mode)
-        conv.weight = torch.nn.Parameter(self.weight.detach())
-        if self.bias is not None:
-            conv.bias = torch.nn.Parameter(self.bias.detach())
-        # conv relu
-        if issubclass(cls, _FusedModule):
-            modules = [conv]
-            assert hasattr(cls, "_FLOAT_RELU_MODULE")
-            relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
-            modules.append(relu)
-            fused = cls._FLOAT_MODULE(*modules)  # type: ignore[arg-type, attr-defined, operator]
-            fused.train(self.training)
-            return fused
-        else:
-            return conv
-
-class Conv1d(_ConvNd, nn.Conv1d):
-    r"""
-    A Conv1d module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as :class:`~torch.nn.Conv1d`
-
-    Similar to :class:`~torch.nn.Conv2d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Conv1d
-    _FLOAT_CONV_MODULE = nn.Conv1d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: Union[str, _size_1_t] = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        kernel_size_ = _single(kernel_size)
-        stride_ = _single(stride)
-        padding_ = padding if isinstance(padding, str) else _single(padding)
-        dilation_ = _single(dilation)
-        super().__init__(
-            in_channels,
-            out_channels,
-            kernel_size_,
-            stride=stride_,
-            padding=padding_,
-            dilation=dilation_,
-            transposed=False,
-            output_padding=_single(0),
-            groups=groups,
-            bias=bias,
-            padding_mode=padding_mode,
-            qconfig=qconfig,
-            device=device,
-            dtype=dtype)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super().from_float(cls, mod)
-
-class Conv2d(_ConvNd, nn.Conv2d):
-    r"""
-    A Conv2d module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Conv2d`, please see
-    https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d
-    for documentation.
-
-    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Conv2d
-    _FLOAT_CONV_MODULE = nn.Conv2d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_2_t,
-                 stride: _size_2_t = 1,
-                 padding: Union[str, _size_2_t] = 0,
-                 dilation: _size_2_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        kernel_size_ = _pair(kernel_size)
-        stride_ = _pair(stride)
-        padding_ = padding if isinstance(padding, str) else _pair(padding)
-        dilation_ = _pair(dilation)
-        super().__init__(
-            in_channels,
-            out_channels,
-            kernel_size_,
-            stride=stride_,
-            padding=padding_,
-            dilation=dilation_,
-            transposed=False,
-            output_padding=_pair(0),
-            groups=groups,
-            bias=bias,
-            padding_mode=padding_mode,
-            qconfig=qconfig,
-            device=device,
-            dtype=dtype)
-
-    def forward(self, input):
-        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super().from_float(cls, mod)
-
-class Conv3d(_ConvNd, nn.Conv3d):
-    r"""
-    A Conv3d module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Conv3d`, please see
-    https://pytorch.org/docs/stable/nn.html?highlight=conv3d#torch.nn.Conv3d
-    for documentation.
-
-    Similar to `torch.nn.Conv3d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Conv3d
-    _FLOAT_CONV_MODULE = nn.Conv3d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_3_t,
-                 stride: _size_3_t = 1,
-                 padding: Union[str, _size_3_t] = 0,
-                 dilation: _size_3_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        kernel_size_ = _triple(kernel_size)
-        stride_ = _triple(stride)
-        padding_ = padding if isinstance(padding, str) else _triple(padding)
-        dilation_ = _triple(dilation)
-        super().__init__(
-            in_channels,
-            out_channels,
-            kernel_size_,
-            stride=stride_,
-            padding=padding_,
-            dilation=dilation_,
-            transposed=False,
-            output_padding=_triple(0),
-            groups=groups,
-            bias=bias,
-            padding_mode=padding_mode,
-            qconfig=qconfig,
-            device=device,
-            dtype=dtype)
-
-    def forward(self, input):
-        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super().from_float(cls, mod)
+# flake8: noqa: F401
+r"""QAT Modules
+
+This file is in the process of migration to `torch/ao/nn/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.qat.modules.conv import Conv1d
+from torch.ao.nn.qat.modules.conv import Conv2d
+from torch.ao.nn.qat.modules.conv import Conv3d
diff --git a/torch/nn/qat/modules/embedding_ops.py b/torch/nn/qat/modules/embedding_ops.py
index da7f33363742d..88c7f2dfd45cd 100644
--- a/torch/nn/qat/modules/embedding_ops.py
+++ b/torch/nn/qat/modules/embedding_ops.py
@@ -1,143 +1,14 @@
-import torch
-from torch import Tensor
-import torch.nn as nn
-import torch.nn.functional as F
+# flake8: noqa: F401
+r"""QAT Modules
 
-__all__ = ['Embedding', 'EmbeddingBag']
-
-class Embedding(nn.Embedding):
-    r"""
-    An embedding bag module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Embedding`, please see
-    https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding
-    for documentation.
-
-    Similar to `torch.nn.Embedding`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Embedding
-
-    def __init__(self, num_embeddings, embedding_dim, padding_idx=None,
-                 max_norm=None, norm_type=2.0, scale_grad_by_freq=False,
-                 sparse=False, _weight=None, device=None, dtype=None, qconfig=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
-                         norm_type, scale_grad_by_freq, sparse, _weight,
-                         **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(qconfig.weight().qscheme)
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input) -> Tensor:
-        return F.embedding(input, self.weight_fake_quant(self.weight), self.padding_idx,
-                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
-                           self.sparse)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module
-
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
-        """
-        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
-            cls._FLOAT_MODULE.__name__
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
-        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(weight_qscheme)
-
-        qconfig = mod.qconfig
-        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.padding_idx,
-                                mod.max_norm, mod.norm_type, mod.scale_grad_by_freq,
-                                mod.sparse, mod.weight, qconfig=qconfig)
-
-        return qat_embedding_bag
+This file is in the process of migration to `torch/ao/nn/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/modules`,
+while adding an import statement here.
+"""
 
-    def to_float(self):
-        embedding_bag = torch.nn.Embedding(self.num_embeddings, self.embedding_dim, self.padding_idx,
-                                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
-                                           self.sparse, None)
-        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
-        embedding_bag.train(self.training)
-        return embedding_bag
-
-class EmbeddingBag(nn.EmbeddingBag):
-    r"""
-    An embedding bag module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
-    https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag
-    for documentation.
-
-    Similar to `torch.nn.EmbeddingBag`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.EmbeddingBag
-
-    def __init__(self, num_embeddings, embedding_dim, max_norm=None,
-                 norm_type=2.0, scale_grad_by_freq=False, mode='mean',
-                 sparse=False, _weight=None, include_last_offset=False,
-                 padding_idx=None, qconfig=None, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
-                         scale_grad_by_freq, mode, sparse, _weight,
-                         include_last_offset, padding_idx, **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(qconfig.weight().qscheme)
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input, offsets=None, per_sample_weights=None) -> Tensor:
-        return F.embedding_bag(input, self.weight_fake_quant(self.weight), offsets,
-                               self.max_norm, self.norm_type,
-                               self.scale_grad_by_freq, self.mode, self.sparse,
-                               per_sample_weights, self.include_last_offset,
-                               self.padding_idx)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module
-
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
-        """
-        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
-            cls._FLOAT_MODULE.__name__
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
-        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(weight_qscheme)
-
-        qconfig = mod.qconfig
-        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.max_norm, mod.norm_type,
-                                mod.scale_grad_by_freq, mod.mode, mod.sparse, mod.weight,
-                                mod.include_last_offset, mod.padding_idx, qconfig=qconfig)
-
-        return qat_embedding_bag
+__all__ = ['Embedding', 'EmbeddingBag']
 
-    def to_float(self):
-        embedding_bag = torch.nn.EmbeddingBag(self.num_embeddings, self.embedding_dim, self.max_norm,
-                                              self.norm_type, self.scale_grad_by_freq, self.mode, self.sparse,
-                                              None, self.include_last_offset, self.padding_idx)
-        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
-        embedding_bag.train(self.training)
-        return embedding_bag
+from torch.ao.nn.qat.modules.embedding_ops import Embedding
+from torch.ao.nn.qat.modules.embedding_ops import EmbeddingBag
diff --git a/torch/nn/qat/modules/linear.py b/torch/nn/qat/modules/linear.py
index b03c255f97f84..a35f3f8d7e0ee 100644
--- a/torch/nn/qat/modules/linear.py
+++ b/torch/nn/qat/modules/linear.py
@@ -1,77 +1,10 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch.nn.intrinsic import LinearReLU
-from torch.nn.utils.parametrize import (
-    is_parametrized,
-    type_before_parametrizations,
-    transfer_parametrizations_and_params,
-)
-
-class Linear(nn.Linear):
-    r"""
-    A linear module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
-    for documentation.
-
-    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Linear
-
-    def __init__(self, in_features, out_features, bias=True,
-                 qconfig=None, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(in_features, out_features, bias, **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input):
-        return F.linear(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module or qparams_dict
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
-        """
-        assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
-            " qat."
-            + cls.__name__
-            + ".from_float only works for "
-            + cls._FLOAT_MODULE.__name__
-        )
-        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
-        assert mod.qconfig, "Input float module must have a valid qconfig"
-        if type_before_parametrizations(mod) == LinearReLU:
-            mod = mod[0]
-
-        qconfig = mod.qconfig
-        qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)
-
-        if is_parametrized(mod, "weight"):
-            transfer_parametrizations_and_params(mod, qat_linear, "weight")
-        else:
-            qat_linear.weight = mod.weight
-
-        if is_parametrized(mod, "bias"):
-            transfer_parametrizations_and_params(mod, qat_linear, "bias")
-        else:
-            qat_linear.bias = mod.bias
-
-        return qat_linear
-
-    def to_float(self):
-        linear = torch.nn.Linear(self.in_features, self.out_features, self.bias is not None)
-        linear.weight = torch.nn.Parameter(self.weight.detach())
-        if self.bias is not None:
-            linear.bias = torch.nn.Parameter(self.bias.detach())
-        linear.train(self.training)
-        return linear
+# flake8: noqa: F401
+r"""QAT Modules
+
+This file is in the process of migration to `torch/ao/nn/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.qat.modules.linear import Linear
diff --git a/torch/nn/quantizable/modules/__init__.py b/torch/nn/quantizable/modules/__init__.py
index 7c9fb032a2bb3..a1257b404b734 100644
--- a/torch/nn/quantizable/modules/__init__.py
+++ b/torch/nn/quantizable/modules/__init__.py
@@ -1,6 +1,6 @@
-from .activation import MultiheadAttention
-from .rnn import LSTM
-from .rnn import LSTMCell
+from torch.ao.nn.quantizable.modules.activation import MultiheadAttention
+from torch.ao.nn.quantizable.modules.rnn import LSTM
+from torch.ao.nn.quantizable.modules.rnn import LSTMCell
 
 __all__ = [
     'LSTM',
diff --git a/torch/nn/quantizable/modules/activation.py b/torch/nn/quantizable/modules/activation.py
index 65ea5c2741359..e854414ec8ca6 100644
--- a/torch/nn/quantizable/modules/activation.py
+++ b/torch/nn/quantizable/modules/activation.py
@@ -1,454 +1,10 @@
-import torch
-import torch.jit  # this is needed to avoid a circular import
-from torch import nn
-import torch.nn.functional as nnF
-
-from torch import Tensor
-from typing import Optional, Tuple
-
-import warnings
-
-class MultiheadAttention(nn.MultiheadAttention):
-    _FLOAT_MODULE = nn.MultiheadAttention
-
-    r"""Quantizable implementation of the MultiheadAttention.
-
-    Note::
-        Please, refer to :class:`~torch.nn.MultiheadAttention` for more
-        information
-
-    Allows the model to jointly attend to information from different
-    representation subspaces.
-    See reference: Attention Is All You Need
-
-    The original MHA module is not quantizable.
-    This reimplements it by explicitly instantiating the linear layers.
-
-    .. math::
-        \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
-        \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
-
-    Args:
-        embed_dim: total dimension of the model.
-        num_heads: parallel attention heads.
-        dropout: a Dropout layer on attn_output_weights. Default: 0.0.
-        bias: add bias as module parameter. Default: True.
-        add_bias_kv: add bias to the key and value sequences at dim=0.
-        add_zero_attn: add a new batch of zeros to the key and
-                       value sequences at dim=1.
-        kdim: total number of features in key. Default: None.
-        vdim: total number of features in value. Default: None.
-        batch_first: If ``True``, then the input and output tensors are provided
-            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
-
-    Note that if :attr:`kdim` and :attr:`vdim` are None, they will be set
-    to :attr:`embed_dim` such that query, key, and value have the same
-    number of features.
-
-    Examples::
-
-        >>> import torch.nn.quantizable as nnqa
-        >>> multihead_attn = nnqa.MultiheadAttention(embed_dim, num_heads)
-        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
-
-    Note::
-        Please, follow the quantization flow to convert the quantizable MHA.
-    """
-    __constants__ = ['batch_first']
-
-    def __init__(self, embed_dim: int, num_heads: int,
-                 dropout: float = 0., bias: bool = True,
-                 add_bias_kv: bool = False, add_zero_attn: bool = False,
-                 kdim: int = None, vdim: int = None, batch_first: bool = False,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(MultiheadAttention, self).__init__(embed_dim, num_heads, dropout,
-                                                 bias, add_bias_kv,
-                                                 add_zero_attn, kdim, vdim, batch_first,
-                                                 **factory_kwargs)
-        self.linear_Q = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)
-        self.linear_K = nn.Linear(self.kdim, self.embed_dim, bias=bias, **factory_kwargs)
-        self.linear_V = nn.Linear(self.vdim, self.embed_dim, bias=bias, **factory_kwargs)
-        # for the type: ignore, see https://github.com/pytorch/pytorch/issues/58969
-        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)  # type: ignore[assignment]
-
-        # Functionals
-        self.q_scaling_product = torch.nn.quantized.FloatFunctional()
-        # note: importing torch.nn.quantized at top creates a circular import
-
-        # Quant/Dequant
-        self.quant_attn_output = torch.ao.quantization.QuantStub()
-        self.quant_attn_output_weights = torch.ao.quantization.QuantStub()
-        self.dequant_q = torch.ao.quantization.DeQuantStub()
-        self.dequant_k = torch.ao.quantization.DeQuantStub()
-        self.dequant_v = torch.ao.quantization.DeQuantStub()
-
-    def _get_name(self):
-        return 'QuantizableMultiheadAttention'
-
-    @classmethod
-    def from_float(cls, other):
-        assert type(other) == cls._FLOAT_MODULE
-        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
-        # Setting the dropout to 0.0!
-        observed = cls(other.embed_dim, other.num_heads, other.dropout,
-                       (other.in_proj_bias is not None),
-                       (other.bias_k is not None),
-                       other.add_zero_attn, other.kdim, other.vdim)
-        observed.bias_k = other.bias_k
-        observed.bias_v = other.bias_v
-        observed.qconfig = other.qconfig
-
-        # Set the linear weights
-        # for the type: ignores, see https://github.com/pytorch/pytorch/issues/58969
-        observed.out_proj.weight = other.out_proj.weight  # type: ignore[has-type]
-        observed.out_proj.bias = other.out_proj.bias  # type: ignore[has-type]
-        if other._qkv_same_embed_dim:
-            # Use separate params
-            bias = other.in_proj_bias
-            _start = 0
-            _end = _start + other.embed_dim
-            weight = other.in_proj_weight[_start:_end, :]
-            if bias is not None:
-                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
-            observed.linear_Q.weight = torch.nn.Parameter(weight,
-                                                          weight.requires_grad)
-            observed.linear_Q.bias = bias
-
-            bias = other.in_proj_bias
-            _start = _end
-            _end = _start + other.embed_dim
-            weight = other.in_proj_weight[_start:_end, :]
-            if bias is not None:
-                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
-            observed.linear_K.weight = torch.nn.Parameter(weight,
-                                                          weight.requires_grad)
-            observed.linear_K.bias = bias
-
-            bias = other.in_proj_bias
-            _start = _end
-            weight = other.in_proj_weight[_start:, :]
-            if bias is not None:
-                bias = torch.nn.Parameter(bias[_start:], bias.requires_grad)
-            observed.linear_V.weight = torch.nn.Parameter(weight,
-                                                          weight.requires_grad)
-            observed.linear_V.bias = bias
-        else:
-            observed.linear_Q.weight = nn.Parameter(other.q_proj_weight)
-            observed.linear_K.weight = nn.Parameter(other.k_proj_weight)
-            observed.linear_V.weight = nn.Parameter(other.v_proj_weight)
-            if other.in_proj_bias is None:
-                observed.linear_Q.bias = None  # type: ignore[assignment]
-                observed.linear_K.bias = None  # type: ignore[assignment]
-                observed.linear_V.bias = None  # type: ignore[assignment]
-            else:
-                observed.linear_Q.bias = nn.Parameter(other.in_proj_bias[0:other.embed_dim])
-                observed.linear_K.bias = nn.Parameter(other.in_proj_bias[other.embed_dim:(other.embed_dim * 2)])
-                observed.linear_V.bias = nn.Parameter(other.in_proj_bias[(other.embed_dim * 2):])
-        observed.eval()
-        # Explicit prepare
-        observed = torch.ao.quantization.prepare(observed, inplace=True)
-        return observed
-
-    @torch.jit.unused
-    def dequantize(self):
-        r"""Utility to convert the quantized MHA back to float.
-
-        The motivation for this is that it is not trivial to conver the weights
-        from the format that is used in the quantized version back to the
-        float.
-        """
-        fp = self._FLOAT_MODULE(self.embed_dim, self.num_heads, self.dropout,
-                                (self.in_proj_bias is not None),
-                                (self.bias_k is not None),
-                                self.add_zero_attn, self.kdim, self.vdim, self.batch_first)
-        assert fp._qkv_same_embed_dim == self._qkv_same_embed_dim
-        if self.bias_k is not None:
-            fp.bias_k = nn.Parameter(self.bias_k.dequantize())
-        if self.bias_v is not None:
-            fp.bias_v = nn.Parameter(self.bias_v.dequantize())
-
-        # Set the linear weights
-        # Note: Because the linear layers are quantized, mypy does not nkow how
-        # to deal with them -- might need to ignore the typing checks.
-        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
-        w, b = self.out_proj._weight_bias()  # type: ignore[operator, has-type]
-        fp.out_proj.weight = nn.Parameter(w.dequantize())
-        if b is not None:
-            fp.out_proj.bias = nn.Parameter(b)
-
-        wQ, bQ = self.linear_Q._weight_bias()  # type: ignore[operator]
-        wQ = wQ.dequantize()
-        wK, bK = self.linear_K._weight_bias()  # type: ignore[operator]
-        wK = wK.dequantize()
-        wV, bV = self.linear_V._weight_bias()  # type: ignore[operator]
-        wV = wV.dequantize()
-        if fp._qkv_same_embed_dim:
-            # Use separate params
-            _start = 0
-            _end = _start + fp.embed_dim
-            fp.in_proj_weight[_start:_end, :] = wQ
-            if fp.in_proj_bias is not None:
-                assert all(bQ == 0)
-                fp.in_proj_bias[_start:_end] = bQ
-
-            _start = _end
-            _end = _start + fp.embed_dim
-            fp.in_proj_weight[_start:_end, :] = wK
-            if fp.in_proj_bias is not None:
-                assert all(bK == 0)
-                fp.in_proj_bias[_start:_end] = bK
-
-            _start = _end
-            fp.in_proj_weight[_start:, :] = wV
-            if fp.in_proj_bias is not None:
-                assert all(bV == 0)
-                fp.in_proj_bias[_start:] = bV
-        else:
-            fp.q_proj_weight = nn.Parameter(wQ)
-            fp.k_proj_weight = nn.Parameter(wK)
-            fp.v_proj_weight = nn.Parameter(wV)
-            if fp.in_proj_bias is None:
-                self.linear_Q.bias = None
-                self.linear_K.bias = None
-                self.linear_V.bias = None
-            else:
-                fp.in_proj_bias[0:fp.embed_dim] = bQ
-                fp.in_proj_bias[fp.embed_dim:(fp.embed_dim * 2)] = bK
-                fp.in_proj_bias[(fp.embed_dim * 2):] = bV
-
-        return fp
-
-
-    @classmethod
-    def from_observed(cls, other):
-        # The whole flow is float -> observed -> quantized
-        # This class does float -> observed only
-        # See nn.quantized.MultiheadAttention
-        raise NotImplementedError("It looks like you are trying to prepare an "
-                                  "MHA module. Please, see "
-                                  "the examples on quantizable MHAs.")
-
-    def forward(self,
-                query: Tensor,
-                key: Tensor,
-                value: Tensor,
-                key_padding_mask: Optional[Tensor] = None,
-                need_weights: bool = True,
-                attn_mask: Optional[Tensor] = None,
-                average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
-        r"""
-    Note::
-        Please, refer to :func:`~torch.nn.MultiheadAttention.forward` for more
-        information
-
-    Args:
-        query, key, value: map a query and a set of key-value pairs to an output.
-            See "Attention Is All You Need" for more details.
-        key_padding_mask: if provided, specified padding elements in the key will
-            be ignored by the attention. When given a binary mask and a value is True,
-            the corresponding value on the attention layer will be ignored. When given
-            a byte mask and a value is non-zero, the corresponding value on the attention
-            layer will be ignored
-        need_weights: output attn_output_weights.
-        attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all
-            the batches while a 3D mask allows to specify a different mask for the entries of each batch.
-
-    Shape:
-        - Inputs:
-        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
-          the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
-        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
-          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
-        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
-          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
-        - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.
-          If a ByteTensor is provided, the non-zero positions will be ignored while the position
-          with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
-          value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
-        - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
-          3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
-          S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
-          positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
-          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
-          is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
-          is provided, it will be added to the attention weight.
-        - average_attn_weights: If true, indicates that the returned ``attn_weights`` should be averaged across
-          heads. Otherwise, ``attn_weights`` are provided separately per head. Note that this flag only has an
-          effect when ``need_weights=True.``. Default: True (i.e. average weights across heads)
-
-        - Outputs:
-        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
-          E is the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
-        - attn_output_weights: If ``average_attn_weights=True``, returns attention weights averaged
-          across heads of shape :math:`(N, L, S)`, where N is the batch size, L is the target sequence length,
-          S is the source sequence length. If ``average_attn_weights=False``, returns attention weights per
-          head of shape :math:`(N, num_heads, L, S)`.
-        """
-        return self._forward_impl(query, key, value, key_padding_mask,
-                                  need_weights, attn_mask, average_attn_weights)
-
-    def _forward_impl(self,
-                      query: Tensor,
-                      key: Tensor,
-                      value: Tensor,
-                      key_padding_mask: Optional[Tensor] = None,
-                      need_weights: bool = True,
-                      attn_mask: Optional[Tensor] = None,
-                      average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
-        # This version will not deal with the static key/value pairs.
-        # Keeping it here for future changes.
-        #
-        # TODO: This method has some duplicate lines with the
-        # `torch.nn.functional.multi_head_attention`. Will need to refactor.
-        static_k = None
-        static_v = None
-
-        if self.batch_first:
-            query, key, value = [x.transpose(0, 1) for x in (query, key, value)]
-
-        tgt_len, bsz, embed_dim_to_check = query.size()
-        assert self.embed_dim == embed_dim_to_check
-        # allow MHA to have different sizes for the feature dimension
-        assert key.size(0) == value.size(0) and key.size(1) == value.size(1)
-
-        head_dim = self.embed_dim // self.num_heads
-        assert head_dim * self.num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
-        scaling = float(head_dim) ** -0.5
-
-        q = self.linear_Q(query)
-        k = self.linear_K(key)
-        v = self.linear_V(value)
-
-        q = self.q_scaling_product.mul_scalar(q, scaling)
-
-        if attn_mask is not None:
-            assert attn_mask.dtype == torch.float32 or attn_mask.dtype == torch.float64 or \
-                attn_mask.dtype == torch.float16 or attn_mask.dtype == torch.uint8 or attn_mask.dtype == torch.bool, \
-                'Only float, byte, and bool types are supported for attn_mask, not {}'.format(attn_mask.dtype)
-            if attn_mask.dtype == torch.uint8:
-                warnings.warn("Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
-                attn_mask = attn_mask.to(torch.bool)
-
-            if attn_mask.dim() == 2:
-                attn_mask = attn_mask.unsqueeze(0)
-                if list(attn_mask.size()) != [1, query.size(0), key.size(0)]:
-                    raise RuntimeError('The size of the 2D attn_mask is not correct.')
-            elif attn_mask.dim() == 3:
-                if list(attn_mask.size()) != [bsz * self.num_heads, query.size(0), key.size(0)]:
-                    raise RuntimeError('The size of the 3D attn_mask is not correct.')
-            else:
-                raise RuntimeError("attn_mask's dimension {} is not supported".format(attn_mask.dim()))
-            # attn_mask's dim is 3 now.
-
-        # convert ByteTensor key_padding_mask to bool
-        if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
-            warnings.warn("Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
-            key_padding_mask = key_padding_mask.to(torch.bool)
-        if self.bias_k is not None and self.bias_v is not None:
-            if static_k is None and static_v is None:
-
-                # Explicitly assert that bias_k and bias_v are not None
-                # in a way that TorchScript can understand.
-                bias_k = self.bias_k
-                assert bias_k is not None
-                bias_v = self.bias_v
-                assert bias_v is not None
-
-                k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
-                v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
-                if attn_mask is not None:
-                    attn_mask = nnF.pad(attn_mask, (0, 1))
-                if key_padding_mask is not None:
-                    key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
-            else:
-                assert static_k is None, "bias cannot be added to static key."
-                assert static_v is None, "bias cannot be added to static value."
-        else:
-            assert self.bias_k is None
-            assert self.bias_v is None
-
-        q = q.contiguous().view(tgt_len, bsz * self.num_heads, head_dim).transpose(0, 1)
-        if k is not None:
-            k = k.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
-        if v is not None:
-            v = v.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
-
-        if static_k is not None:
-            assert static_k.size(0) == bsz * self.num_heads
-            assert static_k.size(2) == head_dim
-            k = static_k
-
-        if static_v is not None:
-            assert static_v.size(0) == bsz * self.num_heads
-            assert static_v.size(2) == head_dim
-            v = static_v
-
-        src_len = k.size(1)
-
-        if key_padding_mask is not None:
-            assert key_padding_mask.size(0) == bsz
-            assert key_padding_mask.size(1) == src_len
-
-        if self.add_zero_attn:
-            src_len += 1
-            k_zeros = torch.zeros((k.size(0), 1) + k.size()[2:])
-            if k.is_quantized:
-                k_zeros = torch.quantize_per_tensor(k_zeros, k.q_scale(), k.q_zero_point(), k.dtype)
-            k = torch.cat([k, k_zeros], dim=1)
-            v_zeros = torch.zeros((v.size(0), 1) + k.size()[2:])
-            if v.is_quantized:
-                v_zeros = torch.quantize_per_tensor(v_zeros, v.q_scale(), v.q_zero_point(), v.dtype)
-            v = torch.cat([v, v_zeros], dim=1)
-
-            if attn_mask is not None:
-                attn_mask = nnF.pad(attn_mask, (0, 1))
-            if key_padding_mask is not None:
-                key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
-
-        # Leaving the quantized zone here
-        q = self.dequant_q(q)
-        k = self.dequant_k(k)
-        v = self.dequant_v(v)
-        attn_output_weights = torch.bmm(q, k.transpose(1, 2))
-        assert list(attn_output_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
-
-        if attn_mask is not None:
-            if attn_mask.dtype == torch.bool:
-                attn_output_weights.masked_fill_(attn_mask, float('-inf'))
-            else:
-                attn_output_weights += attn_mask
-
-        if key_padding_mask is not None:
-            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
-            attn_output_weights = attn_output_weights.masked_fill(
-                key_padding_mask.unsqueeze(1).unsqueeze(2),
-                float('-inf'),
-            )
-            attn_output_weights = attn_output_weights.view(bsz * self.num_heads, tgt_len, src_len)
-
-        attn_output_weights = nnF.softmax(
-            attn_output_weights, dim=-1)
-        attn_output_weights = nnF.dropout(attn_output_weights, p=self.dropout, training=self.training)
-
-        attn_output = torch.bmm(attn_output_weights, v)
-        assert list(attn_output.size()) == [bsz * self.num_heads, tgt_len, head_dim]
-        if self.batch_first:
-            attn_output = attn_output.view(bsz, tgt_len, self.embed_dim)
-        else:
-            attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, self.embed_dim)
-
-        # Reentering the quantized zone
-        attn_output = self.quant_attn_output(attn_output)
-        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
-        attn_output = self.out_proj(attn_output)  # type: ignore[has-type]
-        attn_output_weights = self.quant_attn_output_weights(attn_output_weights)
-
-        if need_weights:
-            # average attention weights over heads
-            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
-            if average_attn_weights:
-                attn_output_weights = attn_output_weights.mean(dim=1)
-            return attn_output, attn_output_weights
-        else:
-            return attn_output, None
+# flake8: noqa: F401
+r"""Quantizable Modules
+
+This file is in the process of migration to `torch/ao/nn/quantizable`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantizable/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.quantizable.modules.activation import MultiheadAttention
diff --git a/torch/nn/quantizable/modules/rnn.py b/torch/nn/quantizable/modules/rnn.py
index 68ca6a8db1b59..b3449bf71611e 100644
--- a/torch/nn/quantizable/modules/rnn.py
+++ b/torch/nn/quantizable/modules/rnn.py
@@ -1,387 +1,11 @@
-import numbers
-from typing import Optional, Tuple
-import warnings
-
-import torch
-from torch import Tensor
-
+# flake8: noqa: F401
+r"""Quantizable Modules
+
+This file is in the process of migration to `torch/ao/nn/quantizable`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantizable/modules`,
+while adding an import statement here.
 """
-We will recreate all the RNN modules as we require the modules to be decomposed
-into its building blocks to be able to observe.
-"""
-
-class LSTMCell(torch.nn.Module):
-    r"""A quantizable long short-term memory (LSTM) cell.
-
-    For the description and the argument types, please, refer to :class:`~torch.nn.LSTMCell`
-
-    Examples::
-
-        >>> import torch.nn.quantizable as nnqa
-        >>> rnn = nnqa.LSTMCell(10, 20)
-        >>> input = torch.randn(3, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> cx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-                hx, cx = rnn(input[i], (hx, cx))
-                output.append(hx)
-    """
-    _FLOAT_MODULE = torch.nn.LSTMCell
-
-    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.input_size = input_dim
-        self.hidden_size = hidden_dim
-        self.bias = bias
-
-        self.igates = torch.nn.Linear(input_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
-        self.hgates = torch.nn.Linear(hidden_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
-        self.gates = torch.nn.quantized.FloatFunctional()
-
-        self.fgate_cx = torch.nn.quantized.FloatFunctional()
-        self.igate_cgate = torch.nn.quantized.FloatFunctional()
-        self.fgate_cx_igate_cgate = torch.nn.quantized.FloatFunctional()
-
-        self.ogate_cy = torch.nn.quantized.FloatFunctional()
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
-        if hidden is None or hidden[0] is None or hidden[1] is None:
-            hidden = self.initialize_hidden(x.shape[0], x.is_quantized)
-        hx, cx = hidden
-
-        igates = self.igates(x)
-        hgates = self.hgates(hx)
-        gates = self.gates.add(igates, hgates)
-
-        input_gate, forget_gate, cell_gate, out_gate = gates.chunk(4, 1)
-
-        input_gate = torch.sigmoid(input_gate)
-        forget_gate = torch.sigmoid(forget_gate)
-        cell_gate = torch.tanh(cell_gate)
-        out_gate = torch.sigmoid(out_gate)
-
-        fgate_cx = self.fgate_cx.mul(forget_gate, cx)
-        igate_cgate = self.igate_cgate.mul(input_gate, cell_gate)
-        fgate_cx_igate_cgate = self.fgate_cx_igate_cgate.add(fgate_cx, igate_cgate)
-        cy = fgate_cx_igate_cgate
-
-        tanh_cy = torch.tanh(cy)
-        hy = self.ogate_cy.mul(out_gate, tanh_cy)
-        return hy, cy
-
-    def initialize_hidden(self, batch_size: int, is_quantized: bool = False) -> Tuple[Tensor, Tensor]:
-        h, c = torch.zeros((batch_size, self.hidden_size)), torch.zeros((batch_size, self.hidden_size))
-        if is_quantized:
-            h = torch.quantize_per_tensor(h, scale=1.0, zero_point=0, dtype=torch.quint8)
-            c = torch.quantize_per_tensor(c, scale=1.0, zero_point=0, dtype=torch.quint8)
-        return h, c
-
-    def _get_name(self):
-        return 'QuantizableLSTMCell'
-
-    @classmethod
-    def from_params(cls, wi, wh, bi=None, bh=None):
-        """Uses the weights and biases to create a new LSTM cell.
-
-        Args:
-            wi, wh: Weights for the input and hidden layers
-            bi, bh: Biases for the input and hidden layers
-        """
-        assert (bi is None) == (bh is None)  # Either both None or both have values
-        input_size = wi.shape[1]
-        hidden_size = wh.shape[1]
-        cell = cls(input_dim=input_size, hidden_dim=hidden_size,
-                   bias=(bi is not None))
-        cell.igates.weight = torch.nn.Parameter(wi)
-        if bi is not None:
-            cell.igates.bias = torch.nn.Parameter(bi)
-        cell.hgates.weight = torch.nn.Parameter(wh)
-        if bh is not None:
-            cell.hgates.bias = torch.nn.Parameter(bh)
-        return cell
-
-    @classmethod
-    def from_float(cls, other):
-        assert type(other) == cls._FLOAT_MODULE
-        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
-        observed = cls.from_params(other.weight_ih, other.weight_hh,
-                                   other.bias_ih, other.bias_hh)
-        observed.qconfig = other.qconfig
-        observed.igates.qconfig = other.qconfig
-        observed.hgates.qconfig = other.qconfig
-        return observed
-
-
-class _LSTMSingleLayer(torch.nn.Module):
-    r"""A single one-directional LSTM layer.
-
-    The difference between a layer and a cell is that the layer can process a
-    sequence, while the cell only expects an instantaneous value.
-    """
-    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.cell = LSTMCell(input_dim, hidden_dim, bias=bias, **factory_kwargs)
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
-        result = []
-        for xx in x:
-            hidden = self.cell(xx, hidden)
-            result.append(hidden[0])  # type: ignore[index]
-        result_tensor = torch.stack(result, 0)
-        return result_tensor, hidden
-
-    @classmethod
-    def from_params(cls, *args, **kwargs):
-        cell = LSTMCell.from_params(*args, **kwargs)
-        layer = cls(cell.input_size, cell.hidden_size, cell.bias)
-        layer.cell = cell
-        return layer
-
-
-class _LSTMLayer(torch.nn.Module):
-    r"""A single bi-directional LSTM layer."""
-    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
-                 batch_first: bool = False, bidirectional: bool = False,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.batch_first = batch_first
-        self.bidirectional = bidirectional
-        self.layer_fw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
-        if self.bidirectional:
-            self.layer_bw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
-        if self.batch_first:
-            x = x.transpose(0, 1)
-        if hidden is None:
-            hx_fw, cx_fw = (None, None)
-        else:
-            hx_fw, cx_fw = hidden
-        if self.bidirectional:
-            if hx_fw is None:
-                hx_bw = None
-            else:
-                hx_bw = hx_fw[1]
-                hx_fw = hx_fw[0]
-            if cx_fw is None:
-                cx_bw = None
-            else:
-                cx_bw = cx_fw[1]
-                cx_fw = cx_fw[0]
-            hidden_bw = hx_bw, cx_bw
-        if hx_fw is None and cx_fw is None:
-            hidden_fw = None
-        else:
-            hidden_fw = torch.jit._unwrap_optional(hx_fw), torch.jit._unwrap_optional(cx_fw)
-        result_fw, hidden_fw = self.layer_fw(x, hidden_fw)
-
-        if hasattr(self, 'layer_bw') and self.bidirectional:
-            x_reversed = x.flip(0)
-            result_bw, hidden_bw = self.layer_bw(x_reversed, hidden_bw)
-            result_bw = result_bw.flip(0)
-
-            result = torch.cat([result_fw, result_bw], result_fw.dim() - 1)
-            if hidden_fw is None and hidden_bw is None:
-                h = None
-                c = None
-            elif hidden_fw is None:
-                h = hidden_bw[0]
-                c = hidden_bw[1]
-            elif hidden_bw is None:
-                h = hidden_fw[0]
-                c = hidden_fw[1]
-            else:
-                h = torch.stack([hidden_fw[0], hidden_bw[0]], 0)  # type: ignore[list-item]
-                c = torch.stack([hidden_fw[1], hidden_bw[1]], 0)  # type: ignore[list-item]
-        else:
-            result = result_fw
-            h, c = torch.jit._unwrap_optional(hidden_fw)  # type: ignore[assignment]
-
-        if self.batch_first:
-            result.transpose_(0, 1)
-
-        return result, (h, c)
-
-    @classmethod
-    def from_float(cls, other, layer_idx=0, qconfig=None, **kwargs):
-        r"""
-        There is no FP equivalent of this class. This function is here just to
-        mimic the behavior of the `prepare` within the `torch.ao.quantization`
-        flow.
-        """
-        assert hasattr(other, 'qconfig') or (qconfig is not None)
-
-        input_size = kwargs.get('input_size', other.input_size)
-        hidden_size = kwargs.get('hidden_size', other.hidden_size)
-        bias = kwargs.get('bias', other.bias)
-        batch_first = kwargs.get('batch_first', other.batch_first)
-        bidirectional = kwargs.get('bidirectional', other.bidirectional)
-
-        layer = cls(input_size, hidden_size, bias, batch_first, bidirectional)
-        layer.qconfig = getattr(other, 'qconfig', qconfig)
-        wi = getattr(other, f'weight_ih_l{layer_idx}')
-        wh = getattr(other, f'weight_hh_l{layer_idx}')
-        bi = getattr(other, f'bias_ih_l{layer_idx}', None)
-        bh = getattr(other, f'bias_hh_l{layer_idx}', None)
-
-        layer.layer_fw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
-
-        if other.bidirectional:
-            wi = getattr(other, f'weight_ih_l{layer_idx}_reverse')
-            wh = getattr(other, f'weight_hh_l{layer_idx}_reverse')
-            bi = getattr(other, f'bias_ih_l{layer_idx}_reverse', None)
-            bh = getattr(other, f'bias_hh_l{layer_idx}_reverse', None)
-            layer.layer_bw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
-        return layer
-
-
-class LSTM(torch.nn.Module):
-    r"""A quantizable long short-term memory (LSTM).
-
-    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
-
-    Attributes:
-        layers : instances of the `_LSTMLayer`
-
-    .. note::
-        To access the weights and biases, you need to access them per layer.
-        See examples below.
-
-    Examples::
-
-        >>> import torch.nn.quantizable as nnqa
-        >>> rnn = nnqa.LSTM(10, 20, 2)
-        >>> input = torch.randn(5, 3, 10)
-        >>> h0 = torch.randn(2, 3, 20)
-        >>> c0 = torch.randn(2, 3, 20)
-        >>> output, (hn, cn) = rnn(input, (h0, c0))
-        >>> # To get the weights:
-        >>> print(rnn.layers[0].weight_ih)
-        tensor([[...]])
-        >>> print(rnn.layers[0].weight_hh)
-        AssertionError: There is no reverse path in the non-bidirectional layer
-    """
-    _FLOAT_MODULE = torch.nn.LSTM
-
-    def __init__(self, input_size: int, hidden_size: int,
-                 num_layers: int = 1, bias: bool = True,
-                 batch_first: bool = False, dropout: float = 0.,
-                 bidirectional: bool = False,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.input_size = input_size
-        self.hidden_size = hidden_size
-        self.num_layers = num_layers
-        self.bias = bias
-        self.batch_first = batch_first
-        self.dropout = float(dropout)
-        self.bidirectional = bidirectional
-        self.training = False  # We don't want to train using this module
-        num_directions = 2 if bidirectional else 1
-
-        if not isinstance(dropout, numbers.Number) or not 0 <= dropout <= 1 or \
-                isinstance(dropout, bool):
-            raise ValueError("dropout should be a number in range [0, 1] "
-                             "representing the probability of an element being "
-                             "zeroed")
-        if dropout > 0:
-            warnings.warn("dropout option for quantizable LSTM is ignored. "
-                          "If you are training, please, use nn.LSTM version "
-                          "followed by `prepare` step.")
-            if num_layers == 1:
-                warnings.warn("dropout option adds dropout after all but last "
-                              "recurrent layer, so non-zero dropout expects "
-                              "num_layers greater than 1, but got dropout={} "
-                              "and num_layers={}".format(dropout, num_layers))
-
-        layers = [_LSTMLayer(self.input_size, self.hidden_size,
-                             self.bias, batch_first=False,
-                             bidirectional=self.bidirectional, **factory_kwargs)]
-        for layer in range(1, num_layers):
-            layers.append(_LSTMLayer(self.hidden_size, self.hidden_size,
-                                     self.bias, batch_first=False,
-                                     bidirectional=self.bidirectional,
-                                     **factory_kwargs))
-        self.layers = torch.nn.ModuleList(layers)
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
-        if self.batch_first:
-            x = x.transpose(0, 1)
-
-        max_batch_size = x.size(1)
-        num_directions = 2 if self.bidirectional else 1
-        if hidden is None:
-            zeros = torch.zeros(num_directions, max_batch_size,
-                                self.hidden_size, dtype=torch.float,
-                                device=x.device)
-            zeros.squeeze_(0)
-            if x.is_quantized:
-                zeros = torch.quantize_per_tensor(zeros, scale=1.0,
-                                                  zero_point=0, dtype=x.dtype)
-            hxcx = [(zeros, zeros) for _ in range(self.num_layers)]
-        else:
-            hidden_non_opt = torch.jit._unwrap_optional(hidden)
-            if isinstance(hidden_non_opt[0], Tensor):
-                hx = hidden_non_opt[0].reshape(self.num_layers, num_directions,
-                                               max_batch_size,
-                                               self.hidden_size).unbind(0)
-                cx = hidden_non_opt[1].reshape(self.num_layers, num_directions,
-                                               max_batch_size,
-                                               self.hidden_size).unbind(0)
-                hxcx = [(hx[idx].squeeze_(0), cx[idx].squeeze_(0)) for idx in range(self.num_layers)]
-            else:
-                hxcx = hidden_non_opt
-
-        for idx, layer in enumerate(self.layers):
-            x, hxcx[idx] = layer(x, hxcx[idx])
-
-        hx_list = []
-        cx_list = []
-        for idx in range(self.num_layers):
-            hx_list.append(hxcx[idx][0])
-            cx_list.append(hxcx[idx][1])
-        hx_tensor = torch.stack(hx_list)
-        cx_tensor = torch.stack(cx_list)
-
-        # We are creating another dimension for bidirectional case
-        # need to collapse it
-        hx_tensor = hx_tensor.reshape(-1, *hx_tensor.shape[-2:])
-        cx_tensor = cx_tensor.reshape(-1, *cx_tensor.shape[-2:])
-
-        if self.batch_first:
-            x = x.transpose(0, 1)
-
-        return x, (hx_tensor, cx_tensor)
-
-    def _get_name(self):
-        return 'QuantizableLSTM'
-
-    @classmethod
-    def from_float(cls, other, qconfig=None):
-        assert isinstance(other, cls._FLOAT_MODULE)
-        assert (hasattr(other, 'qconfig') or qconfig)
-        observed = cls(other.input_size, other.hidden_size, other.num_layers,
-                       other.bias, other.batch_first, other.dropout,
-                       other.bidirectional)
-        observed.qconfig = getattr(other, 'qconfig', qconfig)
-        for idx in range(other.num_layers):
-            observed.layers[idx] = _LSTMLayer.from_float(other, idx, qconfig,
-                                                         batch_first=False)
-        observed.eval()
-        observed = torch.ao.quantization.prepare(observed, inplace=True)
-        return observed
-
-    @classmethod
-    def from_observed(cls, other):
-        # The whole flow is float -> observed -> quantized
-        # This class does float -> observed only
-        raise NotImplementedError("It looks like you are trying to convert a "
-                                  "non-quantizable LSTM module. Please, see "
-                                  "the examples on quantizable LSTMs.")
+from torch.ao.nn.quantizable.modules.rnn import LSTM
+from torch.ao.nn.quantizable.modules.rnn import LSTMCell
diff --git a/torch/nn/quantized/__init__.py b/torch/nn/quantized/__init__.py
index 3d79bdbfe8320..37a6a0433ead2 100644
--- a/torch/nn/quantized/__init__.py
+++ b/torch/nn/quantized/__init__.py
@@ -1 +1,2 @@
+from . import functional  # noqa: F403
 from .modules import *  # noqa: F403
diff --git a/torch/nn/quantized/_reference/modules/__init__.py b/torch/nn/quantized/_reference/modules/__init__.py
index 3627f93ebd6c5..2541c22ccf59c 100644
--- a/torch/nn/quantized/_reference/modules/__init__.py
+++ b/torch/nn/quantized/_reference/modules/__init__.py
@@ -1,7 +1,18 @@
-from .linear import Linear
-from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
-from .rnn import RNNCell, LSTMCell, GRUCell, LSTM
-from .sparse import Embedding, EmbeddingBag
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/_reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/_reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized._reference.modules.linear import Linear
+from torch.ao.nn.quantized._reference.modules.conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from torch.ao.nn.quantized._reference.modules.rnn import RNNCell, LSTMCell, GRUCell, LSTM
+from torch.ao.nn.quantized._reference.modules.sparse import Embedding, EmbeddingBag
 
 __all__ = [
     'Linear',
diff --git a/torch/nn/quantized/_reference/modules/conv.py b/torch/nn/quantized/_reference/modules/conv.py
index 64b31c0c75a72..12ce9344d12c6 100644
--- a/torch/nn/quantized/_reference/modules/conv.py
+++ b/torch/nn/quantized/_reference/modules/conv.py
@@ -1,316 +1,19 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from typing import Optional, Dict, Any, List
-from torch.nn.common_types import _size_1_t
-from .utils import ReferenceQuantizedModule
-
-class _ConvNd(torch.nn.modules.conv._ConvNd, ReferenceQuantizedModule):
-    """ A reference version of nn.quantized.Conv2d
-        we will not pack the parameters in this module, since weight packing is an
-        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
-        this is useful when user want to use this module in other backends like Glow.
-    """
-    __annotations__ = {"bias": Optional[torch.Tensor]}
-    _IS_REFERENCE = True
-
-    @staticmethod
-    def from_float(cls, float_conv, weight_qparams):
-        qref_conv = cls(
-            float_conv.in_channels,
-            float_conv.out_channels,
-            float_conv.kernel_size,  # type: ignore[arg-type]
-            float_conv.stride,  # type: ignore[arg-type]
-            float_conv.padding,  # type: ignore[arg-type]
-            float_conv.dilation,  # type: ignore[arg-type]
-            float_conv.groups,
-            float_conv.bias is not None,  # type: ignore[arg-type]
-            float_conv.padding_mode,
-            device=float_conv.weight.device,
-            dtype=float_conv.weight.dtype,
-            weight_qparams=weight_qparams)
-        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
-        if float_conv.bias is not None:
-            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
-        return qref_conv
-
-class Conv1d(_ConvNd, nn.Conv1d):
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = "zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.Conv1d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.conv1d ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.conv1d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv1d
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.conv1d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, self.dilation, self.groups)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConv1d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvNd.from_float(cls, float_conv, weight_qparams)
-
-class Conv2d(_ConvNd, nn.Conv2d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros',
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.Conv2d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.conv2d ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.conv2d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv2d
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.conv2d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, self.dilation, self.groups)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConv2d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvNd.from_float(cls, float_conv, weight_qparams)
-
-class Conv3d(_ConvNd, nn.Conv3d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode="zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.Conv3d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.conv3d ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.conv3d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv3d
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.conv3d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, self.dilation, self.groups)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConv3d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvNd.from_float(cls, float_conv, weight_qparams)
-
-class _ConvTransposeNd(_ConvNd, torch.nn.modules.conv._ConvTransposeNd):
-    """ A reference version of nn.quantized.ConvTranspose2d
-        we will not pack the parameters in this module, since weight packing is an
-        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
-        this is useful when user want to use this module in other backends like Glow.
-    """
-    @staticmethod
-    def from_float(cls, float_conv, weight_qparams):
-        qref_conv = cls(
-            float_conv.in_channels,
-            float_conv.out_channels,
-            float_conv.kernel_size,  # type: ignore[arg-type]
-            float_conv.stride,  # type: ignore[arg-type]
-            float_conv.padding,  # type: ignore[arg-type]
-            float_conv.output_padding,  # type: ignore[arg-type]
-            float_conv.groups,
-            float_conv.bias is not None,  # type: ignore[arg-type]
-            float_conv.dilation,  # type: ignore[arg-type]
-            float_conv.padding_mode,
-            device=float_conv.weight.device,
-            dtype=float_conv.weight.dtype,
-            weight_qparams=weight_qparams)
-        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
-        if float_conv.bias is not None:
-            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
-        return qref_conv
-
-
-class ConvTranspose1d(_ConvTransposeNd, nn.ConvTranspose1d):
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 output_padding: _size_1_t = 0,
-                 groups: int = 1,
-                 bias: bool = True,
-                 dilation: _size_1_t = 1,
-                 padding_mode: str = "zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.ConvTranspose1d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.convTranspose1d ---
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.convTranspose1d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv1d
-        """
-
-        assert isinstance(self.padding, tuple)
-        # One cannot replace List by Tuple or Sequence in "_output_padding" because
-        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
-        output_padding = self._output_padding(
-            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
-
-        weight_quant_dequant = self.get_weight()
-        result = F.conv_transpose1d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, output_padding, self.groups, self.dilation)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConvTranspose1d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
-
-class ConvTranspose2d(_ConvTransposeNd, nn.ConvTranspose2d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0,
-                 groups=1, bias=True, dilation=1,
-                 padding_mode='zeros',
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-
-        nn.ConvTranspose2d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.convTranspose2d ---
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.convTranspose2d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv2d
-        """
-        assert isinstance(self.padding, tuple)
-        # One cannot replace List by Tuple or Sequence in "_output_padding" because
-        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
-
-        output_padding = self._output_padding(
-            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
-
-        weight_quant_dequant = self.get_weight()
-        result = F.conv_transpose2d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, output_padding, self.groups, self.dilation)
-
-        return result
-
-    def _get_name(self):
-        return "QuantizedConvTranspose2d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
-
-class ConvTranspose3d(_ConvTransposeNd, nn.ConvTranspose3d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0,
-                 groups=1, bias=True, dilation=1,
-                 padding_mode="zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.ConvTranspose3d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.convTranspose3d ---
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.convTranspose3d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv3d
-        """
-
-        assert isinstance(self.padding, tuple)
-        # One cannot replace List by Tuple or Sequence in "_output_padding" because
-        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
-        output_padding = self._output_padding(
-            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
-
-        weight_quant_dequant = self.get_weight()
-        result = F.conv_transpose3d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, output_padding, self.groups, self.dilation)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConvTranspose3d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/_reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/_reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized._reference.modules.conv import _ConvNd
+from torch.ao.nn.quantized._reference.modules.conv import Conv1d
+from torch.ao.nn.quantized._reference.modules.conv import Conv2d
+from torch.ao.nn.quantized._reference.modules.conv import Conv3d
+from torch.ao.nn.quantized._reference.modules.conv import _ConvTransposeNd
+from torch.ao.nn.quantized._reference.modules.conv import ConvTranspose1d
+from torch.ao.nn.quantized._reference.modules.conv import ConvTranspose2d
+from torch.ao.nn.quantized._reference.modules.conv import ConvTranspose3d
diff --git a/torch/nn/quantized/_reference/modules/linear.py b/torch/nn/quantized/_reference/modules/linear.py
index adbcba39685fc..5b9292287deb2 100644
--- a/torch/nn/quantized/_reference/modules/linear.py
+++ b/torch/nn/quantized/_reference/modules/linear.py
@@ -1,55 +1,12 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from typing import Optional, Dict, Any
-from .utils import ReferenceQuantizedModule
-
-class Linear(nn.Linear, ReferenceQuantizedModule):
-    """ A reference quantized linear module that fits into the FX
-    Graph Mode Quantization workflow
-    activation will be floating point Tensor, we will store floating
-    point weight as well in the module, but in forward we'll quantize
-    and dequantize the weight before running the floating point functional
-    linear operator.
-    """
-    _IS_REFERENCE = True
-
-    def __init__(
-            self,
-            in_features: int,
-            out_features: int,
-            bias_: bool = True,
-            device: Optional[torch.device] = None,
-            dtype: Optional[torch.dtype] = None,
-            weight_qparams: Optional[Dict[str, Any]] = None):
-        super().__init__(in_features, out_features, bias_, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def _get_name(self):
-        return "QuantizedLinear(Reference)"
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.linear ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.linear --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized linear
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.linear(x, weight_quant_dequant, self.bias)
-        return result
-
-    @classmethod
-    def from_float(cls, float_linear, weight_qparams):
-        qref_linear = Linear(
-            float_linear.in_features, float_linear.out_features,
-            float_linear.bias is not None, device=float_linear.weight.device,
-            dtype=float_linear.weight.dtype, weight_qparams=weight_qparams)
-        qref_linear.weight = torch.nn.Parameter(float_linear.weight.detach())
-        if float_linear.bias is not None:
-            qref_linear.bias = torch.nn.Parameter(float_linear.bias.detach())
-        return qref_linear
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/_reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/_reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized._reference.modules.linear import Linear
diff --git a/torch/nn/quantized/_reference/modules/rnn.py b/torch/nn/quantized/_reference/modules/rnn.py
index 5b84f0e48b9a0..3096d41bad5bb 100644
--- a/torch/nn/quantized/_reference/modules/rnn.py
+++ b/torch/nn/quantized/_reference/modules/rnn.py
@@ -1,477 +1,17 @@
-import torch
-import torch.nn as nn
-from torch import Tensor
-from .utils import _quantize_and_dequantize_weight
-from .utils import _quantize_weight
-from typing import Optional, Dict, Any, Tuple
-from torch import _VF
-from torch.nn.utils.rnn import PackedSequence
-
-def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
-    return tensor.index_select(dim, permutation)
-
-def get_weight_and_quantization_params(module, wn):
-    weight = getattr(module, wn)
-    params = [weight]
-    for param_name in [wn + n for n in ["_qscheme", "_dtype", "_scale", "_zero_point", "_axis"]]:
-        if hasattr(module, param_name):
-            param = getattr(module, param_name)
-        else:
-            param = None
-        params.append(param)
-    return params
-
-def get_quantized_weight(module, wn):
-    if not hasattr(module, wn):
-        return None
-    params = get_weight_and_quantization_params(module, wn)
-    weight = _quantize_weight(*params)
-    return weight
-
-def get_quantize_and_dequantized_weight(module, wn):
-    if not hasattr(module, wn):
-        return None
-    params = get_weight_and_quantization_params(module, wn)
-    weight = _quantize_and_dequantize_weight(*params)
-    return weight
-
-class RNNCellBase(nn.RNNCellBase):
-    def __init__(self, input_size: int, hidden_size: int, bias: bool, num_chunks: int,
-                 device=None, dtype=None, weight_qparams_dict=None) -> None:
-        super().__init__(input_size, hidden_size, bias, num_chunks, device=device, dtype=dtype)
-        if weight_qparams_dict is None:
-            weight_qparams = {
-                "qscheme": torch.per_tensor_affine,
-                "dtype": torch.quint8,
-                "scale": 1.0,
-                "zero_point": 0
-            }
-            weight_qparams_dict = {
-                "weight_ih": weight_qparams,
-                "weight_hh": weight_qparams
-            }
-        assert len(weight_qparams_dict) == 2, "Expected length for weight_qparams_dict to be 2 for QuantizedRNNCellBase(Reference)"
-        self._init_weight_qparams_dict(weight_qparams_dict, device)
-
-    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
-        assert weight_qparams_dict is not None
-        for key, weight_qparams in weight_qparams_dict.items():
-            # TODO: refactor the duplicated code to utils.py
-            weight_qscheme = weight_qparams["qscheme"]
-            weight_dtype = weight_qparams["dtype"]
-            setattr(self, key + "_qscheme", weight_qscheme)
-            setattr(self, key + "_dtype", weight_dtype)
-            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
-                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
-            if weight_qscheme is not None:
-                scale = weight_qparams["scale"]
-                scale_tensor = scale.clone().detach() \
-                    if isinstance(scale, torch.Tensor) else \
-                    torch.tensor(scale, dtype=torch.float, device=device)
-                self.register_buffer(key + "_scale", scale_tensor)
-                zp = weight_qparams["zero_point"]
-                zp_tensor = zp.clone().detach() \
-                    if isinstance(zp, torch.Tensor) else \
-                    torch.tensor(zp, dtype=torch.int, device=device)
-                self.register_buffer(key + "_zero_point", zp_tensor)
-                if weight_qscheme == torch.per_channel_affine:
-                    axis = weight_qparams["axis"]
-                    axis_tensor = axis.clone().detach() \
-                        if isinstance(axis, torch.Tensor) else \
-                        torch.tensor(axis, dtype=torch.int, device=device)
-                    self.register_buffer(key + "_axis", axis_tensor)
-                else:
-                    # added for TorchScriptability, not used
-                    self.register_buffer(
-                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
-
-    def _get_name(self):
-        return "QuantizedRNNCellBase(Reference)"
-
-    def get_quantized_weight_ih(self):
-        return get_quantized_weight(self, "weight_ih")
-
-    def get_quantized_weight_hh(self):
-        return get_quantized_weight(self, "weight_hh")
-
-    def get_weight_ih(self):
-        return get_quantize_and_dequantized_weight(self, "weight_ih")
-
-    def get_weight_hh(self):
-        return get_quantize_and_dequantized_weight(self, "weight_hh")
-
-class RNNCell(RNNCellBase):
-    """
-    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
-    we need to pass in a `weight_qparams_dict` that maps from weight name,
-    e.g. weight_ih, to the weight_qparams for that weight
-    """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True, nonlinearity: str = "tanh",
-                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
-        super().__init__(input_size, hidden_size, bias, num_chunks=1, **factory_kwargs)
-        self.nonlinearity = nonlinearity
-
-    def _get_name(self):
-        return "QuantizedRNNCell(Reference)"
-
-    # TODO: refactor nn.RNNCell to have a _forward that takes weight_ih and weight_hh as input
-    # and remove duplicated code, same for the other two Cell modules
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        assert input.dim() in (1, 2), \
-            f"RNNCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
-        is_batched = input.dim() == 2
-        if not is_batched:
-            input = input.unsqueeze(0)
-
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        else:
-            hx = hx.unsqueeze(0) if not is_batched else hx
-
-        if self.nonlinearity == "tanh":
-            ret = _VF.rnn_tanh_cell(
-                input, hx,
-                self.get_weight_ih(), self.get_weight_hh(),
-                self.bias_ih, self.bias_hh,
-            )
-        elif self.nonlinearity == "relu":
-            ret = _VF.rnn_relu_cell(
-                input, hx,
-                self.get_weight_ih(), self.get_weight_hh(),
-                self.bias_ih, self.bias_hh,
-            )
-        else:
-            ret = input  # TODO: remove when jit supports exception flow
-            raise RuntimeError(
-                "Unknown nonlinearity: {}".format(self.nonlinearity))
-
-        if not is_batched:
-            ret = ret.squeeze(0)
-
-        return ret
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.bias,
-            mod.nonlinearity,
-            mod.weight_ih.device,
-            mod.weight_ih.dtype,
-            weight_qparams_dict)
-        ref_mod.weight_ih = mod.weight_ih
-        ref_mod.weight_hh = mod.weight_hh
-        ref_mod.bias_ih = mod.bias_ih
-        ref_mod.bias_hh = mod.bias_hh
-        return ref_mod
-
-class LSTMCell(RNNCellBase):
-    """
-    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
-    we need to pass in a `weight_qparams_dict` that maps from weight name,
-    e.g. weight_ih, to the weight_qparams for that weight
-    """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
-                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
-        super().__init__(input_size, hidden_size, bias, num_chunks=4, **factory_kwargs)
-
-    def _get_name(self):
-        return "QuantizedLSTMCell(Reference)"
-
-    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
-        assert input.dim() in (1, 2), \
-            f"LSTMCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
-        is_batched = input.dim() == 2
-        if not is_batched:
-            input = input.unsqueeze(0)
-
-        if hx is None:
-            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-            hx = (zeros, zeros)
-        else:
-            hx = (hx[0].unsqueeze(0), hx[1].unsqueeze(0)) if not is_batched else hx
-
-        ret = _VF.lstm_cell(
-            input, hx,
-            self.get_weight_ih(), self.get_weight_hh(),
-            self.bias_ih, self.bias_hh,
-        )
-
-        if not is_batched:
-            ret = (ret[0].squeeze(0), ret[1].squeeze(0))
-        return ret
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.bias,
-            mod.weight_ih.device,
-            mod.weight_ih.dtype,
-            weight_qparams_dict)
-        ref_mod.weight_ih = mod.weight_ih
-        ref_mod.weight_hh = mod.weight_hh
-        ref_mod.bias_ih = mod.bias_ih
-        ref_mod.bias_hh = mod.bias_hh
-        return ref_mod
-
-class GRUCell(RNNCellBase):
-    """
-    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
-    we need to pass in a `weight_qparams_dict` that maps from weight name,
-    e.g. weight_ih, to the weight_qparams for that weight
-    """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
-                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
-        super().__init__(input_size, hidden_size, bias, num_chunks=3, **factory_kwargs)
-
-    def _get_name(self):
-        return "QuantizedGRUCell(Reference)"
-
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        assert input.dim() in (1, 2), \
-            f"GRUCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
-        is_batched = input.dim() == 2
-        if not is_batched:
-            input = input.unsqueeze(0)
-
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        else:
-            hx = hx.unsqueeze(0) if not is_batched else hx
-
-        ret = _VF.gru_cell(
-            input, hx,
-            self.get_weight_ih(), self.get_weight_hh(),
-            self.bias_ih, self.bias_hh,
-        )
-
-        if not is_batched:
-            ret = ret.squeeze(0)
-
-        return ret
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.bias,
-            mod.weight_ih.device,
-            mod.weight_ih.dtype,
-            weight_qparams_dict)
-        ref_mod.weight_ih = mod.weight_ih
-        ref_mod.weight_hh = mod.weight_hh
-        ref_mod.bias_ih = mod.bias_ih
-        ref_mod.bias_hh = mod.bias_hh
-        return ref_mod
-
-class RNNBase(nn.RNNBase):
-    def __init__(self, mode: str, input_size: int, hidden_size: int,
-                 num_layers: int = 1, bias: bool = True, batch_first: bool = False,
-                 dropout: float = 0., bidirectional: bool = False, proj_size: int = 0,
-                 device=None, dtype=None,
-                 weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        super().__init__(
-            mode, input_size, hidden_size, num_layers, bias, batch_first, dropout,
-            bidirectional, proj_size, device, dtype
-        )
-        if weight_qparams_dict is None:
-            weight_qparams = {
-                'qscheme': torch.per_tensor_affine,
-                'dtype': torch.quint8,
-                'scale': 1.0,
-                'zero_point': 0
-            }
-            weight_qparams_dict = dict()
-            for wn in self._flat_weights_names:
-                if wn.startswith("weight"):
-                    weight_qparams_dict[wn] = weight_qparams
-        self._init_weight_qparams_dict(weight_qparams_dict, device)
-
-    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
-        for key, weight_qparams in weight_qparams_dict.items():
-            weight_qscheme = weight_qparams["qscheme"]
-            weight_dtype = weight_qparams["dtype"]
-            setattr(self, key + "_qscheme", weight_qscheme)
-            setattr(self, key + "_dtype", weight_dtype)
-            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
-                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
-            if weight_qscheme is not None:
-                self.register_buffer(
-                    key + "_scale",
-                    torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
-                self.register_buffer(
-                    key + "_zero_point",
-                    torch.tensor(weight_qparams["zero_point"], dtype=torch.int, device=device))
-                if weight_qscheme == torch.per_channel_affine:
-                    self.register_buffer(
-                        key + "_axis",
-                        torch.tensor(weight_qparams["axis"], dtype=torch.int, device=device))
-                else:
-                    # added for TorchScriptability, not used
-                    self.register_buffer(
-                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
-
-class LSTM(RNNBase):
-    """ Reference Quantized LSTM Module
-    We'll store weight_qparams for all the weights in _flat_weights, we need to pass in
-    a `weight_qparams_dict` that maps from weight name, e.g. weight_ih_l0,
-    to the weight_qparams for that weight
-    """
-    def __init__(self, *args, **kwargs):
-        super().__init__('LSTM', *args, **kwargs)
-
-    # Same as above, see torch/nn/modules/module.py::_forward_unimplemented
-    def permute_hidden(self,  # type: ignore[override]
-                       hx: Tuple[Tensor, Tensor],
-                       permutation: Optional[Tensor]
-                       ) -> Tuple[Tensor, Tensor]:
-        if permutation is None:
-            return hx
-        return apply_permutation(hx[0], permutation), apply_permutation(hx[1], permutation)
-
-    def get_expected_cell_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
-        if batch_sizes is not None:
-            mini_batch = int(batch_sizes[0])
-        else:
-            mini_batch = input.size(0) if self.batch_first else input.size(1)
-        num_directions = 2 if self.bidirectional else 1
-        expected_hidden_size = (self.num_layers * num_directions,
-                                mini_batch, self.hidden_size)
-        return expected_hidden_size
-
-    # In the future, we should prevent mypy from applying contravariance rules here.
-    # See torch/nn/modules/module.py::_forward_unimplemented
-    def check_forward_args(self,  # type: ignore[override]
-                           input: Tensor,
-                           hidden: Tuple[Tensor, Tensor],
-                           batch_sizes: Optional[Tensor],
-                           ):
-        self.check_input(input, batch_sizes)
-        self.check_hidden_size(hidden[0], self.get_expected_hidden_size(input, batch_sizes),
-                               'Expected hidden[0] size {}, got {}')
-        self.check_hidden_size(hidden[1], self.get_expected_cell_size(input, batch_sizes),
-                               'Expected hidden[1] size {}, got {}')
-
-    def get_quantized_weight_bias_dict(self):
-        """ dictionary from flat_weight_name to quantized weight or (unquantized) bias
-        e.g.
-        {
-          "weight_ih_l0": quantized_weight,
-          "bias_ih_l0": unquantized_bias,
-          ...
-        }
-        """
-        quantized_weight_bias_dict = {}
-        for wn in self._flat_weights_names:
-            if hasattr(self, wn):
-                if wn.startswith("weight"):
-                    weight_or_bias = get_quantized_weight(self, wn)
-                else:
-                    weight_or_bias = getattr(self, wn)
-            else:
-                weight_or_bias = None
-            quantized_weight_bias_dict[wn] = weight_or_bias
-        return quantized_weight_bias_dict
-
-    def get_flat_weights(self):
-        flat_weights = []
-        for wn in self._flat_weights_names:
-            if hasattr(self, wn):
-                weight = getattr(self, wn)
-                if wn.startswith("weight"):
-                    params = get_weight_and_quantization_params(self, wn)
-                    weight = _quantize_and_dequantize_weight(*params)
-            else:
-                weight = None
-            flat_weights.append(weight)
-        return flat_weights
-
-    def forward(self, input, hx=None):  # noqa: F811
-        orig_input = input
-        # xxx: isinstance check needs to be in conditional for TorchScript to compile
-        batch_sizes = None
-        if isinstance(orig_input, PackedSequence):
-            input, batch_sizes, sorted_indices, unsorted_indices = input
-            max_batch_size = batch_sizes[0]
-            max_batch_size = int(max_batch_size)
-        else:
-            batch_sizes = None
-            is_batched = input.dim() == 3
-            batch_dim = 0 if self.batch_first else 1
-            if not is_batched:
-                input = input.unsqueeze(batch_dim)
-            max_batch_size = input.size(0) if self.batch_first else input.size(1)
-            sorted_indices = None
-            unsorted_indices = None
-
-        if hx is None:
-            num_directions = 2 if self.bidirectional else 1
-            real_hidden_size = self.proj_size if self.proj_size > 0 else self.hidden_size
-            h_zeros = torch.zeros(self.num_layers * num_directions,
-                                  max_batch_size, real_hidden_size,
-                                  dtype=input.dtype, device=input.device)
-            c_zeros = torch.zeros(self.num_layers * num_directions,
-                                  max_batch_size, self.hidden_size,
-                                  dtype=input.dtype, device=input.device)
-            hx = (h_zeros, c_zeros)
-        else:
-            if batch_sizes is None:  # If not PackedSequence input.
-                if is_batched:
-                    if (hx[0].dim() != 3 or hx[1].dim() != 3):
-                        msg = ("For batched 3-D input, hx and cx should "
-                               f"also be 3-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
-                        raise RuntimeError(msg)
-                else:
-                    if hx[0].dim() != 2 or hx[1].dim() != 2:
-                        msg = ("For unbatched 2-D input, hx and cx should "
-                               f"also be 2-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
-                        raise RuntimeError(msg)
-                    hx = (hx[0].unsqueeze(1), hx[1].unsqueeze(1))
-
-            # Each batch of the hidden state should match the input sequence that
-            # the user believes he/she is passing in.
-            hx = self.permute_hidden(hx, sorted_indices)
-
-        self.check_forward_args(input, hx, batch_sizes)
-        if batch_sizes is None:
-            result = _VF.lstm(input, hx, self.get_flat_weights(), self.bias, self.num_layers,
-                              self.dropout, self.training, self.bidirectional, self.batch_first)
-        else:
-            result = _VF.lstm(input, batch_sizes, hx, self.get_flat_weights(), self.bias,
-                              self.num_layers, self.dropout, self.training, self.bidirectional)
-        output = result[0]
-        hidden = result[1:]
-        # xxx: isinstance check needs to be in conditional for TorchScript to compile
-        if isinstance(orig_input, PackedSequence):
-            output_packed = PackedSequence(output, batch_sizes, sorted_indices, unsorted_indices)
-            return output_packed, self.permute_hidden(hidden, unsorted_indices)
-        else:
-            if not is_batched:
-                output = output.squeeze(batch_dim)
-                hidden = (hidden[0].squeeze(1), hidden[1].squeeze(1))
-            return output, self.permute_hidden(hidden, unsorted_indices)
-
-    def _get_name(self):
-        return "QuantizedLSTM(Reference)"
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.num_layers,
-            mod.bias,
-            mod.batch_first,
-            mod.dropout,
-            mod.bidirectional,
-            weight_qparams_dict=weight_qparams_dict)
-        for wn in mod._flat_weights_names:
-            setattr(ref_mod, wn, getattr(mod, wn))
-        return ref_mod
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/_reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/_reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized._reference.modules.rnn import RNNCellBase
+from torch.ao.nn.quantized._reference.modules.rnn import RNNCell
+from torch.ao.nn.quantized._reference.modules.rnn import LSTMCell
+from torch.ao.nn.quantized._reference.modules.rnn import GRUCell
+from torch.ao.nn.quantized._reference.modules.rnn import RNNBase
+from torch.ao.nn.quantized._reference.modules.rnn import LSTM
diff --git a/torch/nn/quantized/_reference/modules/sparse.py b/torch/nn/quantized/_reference/modules/sparse.py
index 5ace87f0fb732..b1ea97ca96589 100644
--- a/torch/nn/quantized/_reference/modules/sparse.py
+++ b/torch/nn/quantized/_reference/modules/sparse.py
@@ -1,92 +1,13 @@
-import torch.nn as nn
-import torch.nn.functional as F
-from torch import Tensor
-from .utils import ReferenceQuantizedModule
-from typing import Optional, Dict, Any
-
-class Embedding(nn.Embedding, ReferenceQuantizedModule):
-    """ A reference quantized Embedding module that fits into the
-    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
-    we will store floating point weight as well in the module, but in forward we'll
-    quantize and dequantize the weight before running the floating point functional
-    embedding operator.
-    """
-    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 sparse: bool = False, _weight: Optional[Tensor] = None,
-                 device=None, dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
-        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
-                         norm_type, scale_grad_by_freq, sparse, _weight, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def _get_name(self):
-        return "QuantizedEmbedding(Reference)"
-
-    def forward(self, input: Tensor) -> Tensor:
-        weight_quant_dequant = self.get_weight()
-        return F.embedding(
-            input, weight_quant_dequant, self.padding_idx, self.max_norm,
-            self.norm_type, self.scale_grad_by_freq, self.sparse)
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams):
-        return cls(
-            mod.num_embeddings,
-            mod.embedding_dim,
-            mod.padding_idx,
-            mod.max_norm,
-            mod.norm_type,
-            mod.scale_grad_by_freq,
-            mod.sparse,
-            mod.weight,
-            mod.weight.device,
-            mod.weight.dtype,
-            weight_qparams)
-
-class EmbeddingBag(nn.EmbeddingBag, ReferenceQuantizedModule):
-    """ A reference quantized EmbeddingBag module that fits into the
-    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
-    we will store floating point weight as well in the module, but in forward we'll
-    quantize and dequantize the weight before running the floating point functional
-    embedding operator.
-    """
-    def __init__(self, num_embeddings: int, embedding_dim: int,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 mode: str = 'mean', sparse: bool = False, _weight: Optional[Tensor] = None,
-                 include_last_offset: bool = False, padding_idx: Optional[int] = None,
-                 device=None, dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
-        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
-                         scale_grad_by_freq, mode, sparse, _weight, include_last_offset,
-                         padding_idx, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def _get_name(self):
-        return "QuantizedEmbedding(Reference)"
-
-    def forward(self, input: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None) -> Tensor:
-        weight_quant_dequant = self.get_weight()
-        return F.embedding_bag(input, weight_quant_dequant, offsets,
-                               self.max_norm, self.norm_type,
-                               self.scale_grad_by_freq, self.mode, self.sparse,
-                               per_sample_weights, self.include_last_offset,
-                               self.padding_idx)
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams):
-        return cls(
-            mod.num_embeddings,
-            mod.embedding_dim,
-            mod.max_norm,
-            mod.norm_type,
-            mod.scale_grad_by_freq,
-            mod.mode,
-            mod.sparse,
-            mod.weight,
-            mod.include_last_offset,
-            mod.padding_idx,
-            mod.weight.device,
-            mod.weight.dtype,
-            weight_qparams
-        )
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/_reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/_reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized._reference.modules.sparse import Embedding
+from torch.ao.nn.quantized._reference.modules.sparse import EmbeddingBag
diff --git a/torch/nn/quantized/_reference/modules/utils.py b/torch/nn/quantized/_reference/modules/utils.py
index f9cd0b7dcb21f..bfce9f857fda1 100644
--- a/torch/nn/quantized/_reference/modules/utils.py
+++ b/torch/nn/quantized/_reference/modules/utils.py
@@ -1,160 +1,15 @@
-import torch
-from typing import Dict, Any
-
-class ReferenceQuantizedModule(torch.nn.Module):
-    def _init_weight_qparams(self, weight_qparams, device):
-        if weight_qparams is None:
-            weight_qparams = {
-                "qscheme": torch.per_tensor_affine,
-                "dtype": torch.quint8,
-                "scale": 1.0,
-                "zero_point": 0
-            }
-        self.weight_qscheme: torch.qscheme = weight_qparams["qscheme"]
-        self.weight_dtype = weight_qparams["dtype"]
-        assert self.weight_qscheme in [
-            None, torch.per_tensor_affine, torch.per_channel_affine,
-            torch.per_channel_affine_float_qparams], \
-            Exception(f"qscheme: {self.weight_qscheme} is not support in reference quantized {self._get_name()}")
-        if self.weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
-            zero_point_dtype = weight_qparams["zero_point"].dtype if \
-                isinstance(weight_qparams["zero_point"], torch.Tensor) else \
-                torch.int
-            w_scale = weight_qparams["scale"]
-            w_scale_tensor = w_scale.clone().detach() \
-                if isinstance(w_scale, torch.Tensor) \
-                else torch.tensor(w_scale, dtype=torch.float, device=device)
-            self.register_buffer("weight_scale", w_scale_tensor)
-            w_zp = weight_qparams["zero_point"]
-            w_zp_tensor = w_zp.clone().detach() \
-                if isinstance(w_zp, torch.Tensor) \
-                else torch.tensor(w_zp, dtype=zero_point_dtype, device=device)
-            self.register_buffer("weight_zero_point", w_zp_tensor)
-            if self.weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
-                w_axis = weight_qparams["axis"]
-                w_axis_tensor = w_axis.clone().detach() \
-                    if isinstance(w_axis, torch.Tensor) \
-                    else torch.tensor(w_axis, dtype=torch.int, device=device)
-                self.register_buffer("weight_axis", w_axis_tensor)
-            else:
-                # added for TorchScriptability, not used
-                self.register_buffer(
-                    "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
-        else:
-            # added for TorchScriptability, and for torch.float
-            self.register_buffer("weight_scale", torch.tensor(1.0, dtype=torch.float, device=device))
-            self.register_buffer("weight_zero_point", torch.tensor(0, dtype=torch.int, device=device))
-            self.register_buffer(
-                "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
-
-    def get_weight(self):
-        """
-        Fake quantize (quantize and dequantize) the weight with
-        the quantization parameters for weight, this is used to
-        simulate the numerics for the quantized weight in a quantized
-        model
-        """
-        # suppress mypy warning
-        assert isinstance(self.weight_scale, torch.Tensor)
-        assert isinstance(self.weight_zero_point, torch.Tensor)
-        assert isinstance(self.weight_axis, torch.Tensor)
-        return _quantize_and_dequantize_weight(
-            self.weight,  # type: ignore[arg-type]
-            self.weight_qscheme,
-            self.weight_dtype,
-            self.weight_scale,
-            self.weight_zero_point, self.weight_axis)
-
-    def get_quantized_weight(self):
-        # suppress mypy warning
-        assert isinstance(self.weight_scale, torch.Tensor)
-        assert isinstance(self.weight_zero_point, torch.Tensor)
-        assert isinstance(self.weight_axis, torch.Tensor)
-        return _quantize_weight(
-            self.weight,  # type: ignore[arg-type]
-            self.weight_qscheme,
-            self.weight_dtype,
-            self.weight_scale,
-            self.weight_zero_point,
-            self.weight_axis)
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super()._save_to_state_dict(destination, prefix, keep_vars)
-        _save_weight_qparams(
-            destination, prefix, self.weight_qscheme, self.weight_dtype,
-            self.weight_scale, self.weight_zero_point, self.weight_axis)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        for key in _get_weight_qparam_keys(state_dict, prefix):
-            setattr(self, key, state_dict[prefix + key])
-            state_dict.pop(prefix + key)
-
-        super()._load_from_state_dict(
-            state_dict, prefix, local_metadata, False,
-            missing_keys, unexpected_keys, error_msgs)
-
-def _quantize_weight(
-        weight: torch.Tensor,
-        weight_qscheme: torch.qscheme,
-        weight_dtype: torch.dtype,
-        weight_scale: torch.Tensor,
-        weight_zero_point: torch.Tensor,
-        weight_axis: torch.Tensor):
-    if weight_dtype == torch.float16:
-        weight = weight.to(weight_dtype)
-        return weight
-
-    if weight_qscheme == torch.per_tensor_affine:
-        if weight_dtype in [torch.quint8, torch.qint8, torch.qint32]:
-            weight = torch.quantize_per_tensor(weight, weight_scale, weight_zero_point, weight_dtype)
-            return weight
-    elif weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
-        if weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
-            weight = torch.quantize_per_channel(
-                weight, weight_scale,
-                weight_zero_point, weight_axis.item(), weight_dtype)  # type: ignore[arg-type]
-            return weight
-    raise Exception(f"Unsupported dtype and qscheme: {weight_dtype}, {weight_qscheme}")
-
-def _quantize_and_dequantize_weight(
-        weight: torch.Tensor,
-        weight_qscheme: torch.qscheme,
-        weight_dtype: torch.dtype,
-        weight_scale: torch.Tensor,
-        weight_zero_point: torch.Tensor,
-        weight_axis: torch.Tensor):
-    """ Quantize and then dequantize the weight based on
-    the quantization parameters
-    """
-    if weight_qscheme in [
-            torch.per_tensor_affine,
-            torch.per_channel_affine,
-            torch.per_channel_affine_float_qparams]:
-        weight_quant = _quantize_weight(
-            weight, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis)
-        weight_dequant = weight_quant.dequantize()
-    else:
-        weight_dequant = weight
-    return weight_dequant
-
-def _save_weight_qparams(destination, prefix, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis):
-    destination[prefix + "weight_qscheme"] = weight_qscheme
-    destination[prefix + "weight_dtype"] = weight_dtype
-    if weight_qscheme is not None:
-        destination[prefix + "weight_scale"] = weight_scale
-        destination[prefix + "weight_zero_point"] = weight_zero_point
-        if weight_qscheme == torch.per_channel_affine:
-            destination[prefix + "weight_axis"] = weight_axis
-
-def _get_weight_qparam_keys(
-        state_dict: Dict[str, Any],
-        prefix: str):
-    keys = ["weight_qscheme", "weight_dtype"]
-    weight_qscheme = state_dict[prefix + "weight_qscheme"]
-    if weight_qscheme is not None:
-        keys.append("weight_scale")
-        keys.append("weight_zero_point")
-        if weight_qscheme == torch.quantize_per_channel:
-            keys.append("weight_axis")
-    return keys
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/_reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/_reference`,
+while adding an import statement here.
+"""
+from torch.ao.nn.quantized._reference.modules.utils import _quantize_weight
+from torch.ao.nn.quantized._reference.modules.utils import _quantize_and_dequantize_weight
+from torch.ao.nn.quantized._reference.modules.utils import _save_weight_qparams
+from torch.ao.nn.quantized._reference.modules.utils import _get_weight_qparam_keys
+from torch.ao.nn.quantized._reference.modules.utils import ReferenceQuantizedModule
diff --git a/torch/nn/quantized/dynamic/__init__.py b/torch/nn/quantized/dynamic/__init__.py
index 3d79bdbfe8320..1b08cd1bc7149 100644
--- a/torch/nn/quantized/dynamic/__init__.py
+++ b/torch/nn/quantized/dynamic/__init__.py
@@ -1 +1 @@
-from .modules import *  # noqa: F403
+from torch.ao.nn.quantized.dynamic import *  # noqa: F403
diff --git a/torch/nn/quantized/dynamic/modules/__init__.py b/torch/nn/quantized/dynamic/modules/__init__.py
index a7a97e0a8da83..f5a6843d53a8c 100644
--- a/torch/nn/quantized/dynamic/modules/__init__.py
+++ b/torch/nn/quantized/dynamic/modules/__init__.py
@@ -1,7 +1,20 @@
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
 
-from .linear import Linear
-from .rnn import LSTM, GRU, LSTMCell, RNNCell, GRUCell
-from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.dynamic.modules import conv
+from torch.ao.nn.quantized.dynamic.modules import linear
+from torch.ao.nn.quantized.dynamic.modules import rnn
+
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from torch.ao.nn.quantized.dynamic.modules.linear import Linear
+from torch.ao.nn.quantized.dynamic.modules.rnn import LSTM, GRU, LSTMCell, RNNCell, GRUCell
 
 __all__ = [
     'Linear',
diff --git a/torch/nn/quantized/dynamic/modules/conv.py b/torch/nn/quantized/dynamic/modules/conv.py
index ea31259fcebce..3e850e0396d1d 100644
--- a/torch/nn/quantized/dynamic/modules/conv.py
+++ b/torch/nn/quantized/dynamic/modules/conv.py
@@ -1,393 +1,18 @@
-# coding=utf-8
-r"""Dynamically quantized convolution modules."""
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
 
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from torch import Tensor
-from torch._ops import ops
-from torch.nn.common_types import _size_1_t
-from torch.nn.modules.utils import _single, _pair, _triple
-from torch.nn.quantized.modules.conv import _reverse_repeat_padding
-import torch.nn.quantized.modules as nnq
-import warnings
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
 
-class Conv1d(nnq.Conv1d):
-    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv1d` and :class:`~torch.nn.quantized.dynamic.Conv1d` and
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv1d` for other attributes.
-
-    Examples::
-
-        >>> m = nn.quantized.dynamic.Conv1d(16, 33, 3, stride=2)
-        >>> input = torch.randn(20, 16, 100)
-        >>> output = m(input)
-
-    """
-
-    _FLOAT_MODULE = nn.Conv1d
-    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
-    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 device=None,
-                 dtype=None,
-                 reduce_range=True):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _single(kernel_size)
-        stride = _single(stride)
-        padding = padding if isinstance(padding, str) else _single(padding)
-        dilation = _single(dilation)
-
-        super(Conv1d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConv1d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        if self.padding_mode != 'zeros':
-            # Padding in Conv1d is stored as (p, p), need to get (p,)
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv1d_dynamic(input, self._packed_params, reduce_range)
-
-
-class Conv2d(nnq.Conv2d):
-    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv2d` and :class:`~torch.nn.quantized.dynamic.Conv2d` and
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv2d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.dynamic.Conv2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
-        >>> input = torch.randn(20, 16, 50, 100)
-        >>> output = m(input)
-
-    """
-    _FLOAT_MODULE = nn.Conv2d
-    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
-    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _pair(kernel_size)
-        stride = _pair(stride)
-        padding = _pair(padding)
-        dilation = _pair(dilation)
-
-        super(Conv2d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConv2d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv2d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class Conv3d(nnq.Conv3d):
-    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv3d` and :class:`~torch.nn.quantized.dynamic.Conv3d` and
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv3d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.dynamic.Conv3d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
-        >>> input = torch.randn(20, 16, 56, 56, 56)
-        >>> output = m(input)
-
-    """
-    _FLOAT_MODULE = nn.Conv3d
-    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
-    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _triple(kernel_size)
-        stride = _triple(stride)
-        padding = _triple(padding)
-        dilation = _triple(dilation)
-        super(Conv3d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConv3d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv3d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class ConvTranspose1d(nnq.ConvTranspose1d):
-    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose1d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv1d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose1d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nndq.ConvTranspose1d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nndq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> output = m(input)
-        >>> # exact output size can be also specified as an argument
-        >>> downsample = nndq.Conv1d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nndq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(input)
-        >>> h.size()
-        torch.Size([1, 16, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose1d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(ConvTranspose1d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConvTranpose1d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        return torch.ops.quantized.conv_transpose1d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class ConvTranspose2d(nnq.ConvTranspose2d):
-    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose2d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv2d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> output = m(input)
-        >>> # exact output size can be also specified as an argument
-        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose2d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(ConvTranspose2d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConvTranpose2d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        return ops.quantized.conv_transpose2d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class ConvTranspose3d(nnq.ConvTranspose3d):
-    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose3d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv3d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
-
-    Examples::
-
-        >>> # With cubic kernels and equal stride
-        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
-        >>> # non-cubic kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
-        >>> output = m(input)
-        >>> # exact output size can be also specified as an argument
-        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose3d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(ConvTranspose3d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConvTranpose3d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
-        return ops.quantized.conv_transpose3d_dynamic(
-            input, self._packed_params, reduce_range)
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv1d
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv2d
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv3d
+from torch.ao.nn.quantized.dynamic.modules.conv import ConvTranspose1d
+from torch.ao.nn.quantized.dynamic.modules.conv import ConvTranspose2d
+from torch.ao.nn.quantized.dynamic.modules.conv import ConvTranspose3d
diff --git a/torch/nn/quantized/dynamic/modules/linear.py b/torch/nn/quantized/dynamic/modules/linear.py
index 8049a21009d69..3073ef5fe048d 100644
--- a/torch/nn/quantized/dynamic/modules/linear.py
+++ b/torch/nn/quantized/dynamic/modules/linear.py
@@ -1,126 +1,10 @@
-import torch
-import torch.nn.quantized as nnq
-import torch.nn.intrinsic as nni
-from torch.nn.quantized.modules.utils import _quantize_weight
-
-class Linear(nnq.Linear):
-    r"""
-    A dynamic quantized linear module with floating point tensor as inputs and outputs.
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
-
-    Similar to :class:`torch.nn.Linear`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module which are of
-                         shape :math:`(\text{out\_features}, \text{in\_features})`.
-        bias (Tensor): the non-learnable floating point bias of the module of shape
-                       :math:`(\text{out\_features})`. If :attr:`bias` is ``True``,
-                       the values are initialized to zero.
-
-    Examples::
-
-        >>> m = nn.quantized.dynamic.Linear(20, 30)
-        >>> input = torch.randn(128, 20)
-        >>> output = m(input)
-        >>> print(output.size())
-        torch.Size([128, 30])
-    """
-    # version used in this class is different from the parent class nnq.Linear
-    _version = 4
-
-    def __init__(self, in_features, out_features, bias_=True, dtype=torch.qint8):
-        super(Linear, self).__init__(in_features, out_features, bias_, dtype=dtype)
-        # We don't muck around with buffers or attributes or anything here
-        # to keep the module simple. *everything* is simply a Python attribute.
-        # Serialization logic is explicitly handled in the below serialization and
-        # deserialization modules
-        self.version = 4
-
-    def forward(self, x):
-        # Note that we can handle self.bias == None case.
-        if self._packed_params.dtype == torch.qint8:
-            if self.version is None or self.version < 4:
-                Y = torch.ops.quantized.linear_dynamic(
-                    x, self._packed_params._packed_params)
-            else:
-                Y = torch.ops.quantized.linear_dynamic(
-                    x, self._packed_params._packed_params, reduce_range=True)
-        elif self._packed_params.dtype == torch.float16:
-            Y = torch.ops.quantized.linear_dynamic_fp16(
-                x, self._packed_params._packed_params)
-        else:
-            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
-        return Y.to(x.dtype)
-
-    def _get_name(self):
-        return 'DynamicQuantizedLinear'
-
-    def extra_repr(self):
-        extra_repr_str = 'in_features={}, out_features={}, dtype={}'.format(
-            self.in_features, self.out_features, self._packed_params.dtype
-        )
-        if self._packed_params.dtype == torch.qint8:
-            extra_repr_str += ', qscheme={}'.format(self.weight().qscheme())
-        return extra_repr_str
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        version = local_metadata.get('version', None)
-        self.version = version
-        super(Linear, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                  missing_keys, unexpected_keys, error_msgs)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a dynamic quantized module from a float module or qparams_dict
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-        """
-        float_modules = [torch.nn.Linear, torch.nn.modules.linear.NonDynamicallyQuantizableLinear,
-                         torch.nn.intrinsic.modules.fused.LinearReLU, torch.nn.qat.dynamic.Linear]
-
-        assert type(mod) in float_modules, \
-            'nn.quantized.dynamic.Linear.from_float only works for one of' + \
-            str([float_mod.__name__ for float_mod in float_modules])
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        if type(mod) == nni.LinearReLU:
-            mod = mod[0]
-        if mod.qconfig is not None and mod.qconfig.weight is not None:
-            weight_observer = mod.qconfig.weight()
-        else:
-            # We have the circular import issues if we import the qconfig in the beginning of this file:
-            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
-            # import until we need it.
-            from torch.ao.quantization.qconfig import default_dynamic_qconfig
-            weight_observer = default_dynamic_qconfig.weight()
-        dtype = weight_observer.dtype
-        assert dtype in [torch.qint8, torch.float16], "The only supported dtypes for " \
-            "dynamic quantized linear are qint8 and float16 got: {}".format(dtype)
-        weight_observer(mod.weight)
-        if dtype == torch.qint8:
-            qweight = _quantize_weight(mod.weight.float(), weight_observer)
-        elif dtype == torch.float16:
-            qweight = mod.weight.float()
-        else:
-            raise RuntimeError('Unsupported dtype specified for dynamic quantized Linear!')
-        qlinear = cls(mod.in_features, mod.out_features, dtype=dtype)
-        qlinear.set_weight_bias(qweight, mod.bias)
-        return qlinear
-
-    @classmethod
-    def from_reference(cls, ref_qlinear):
-        """ Create a (fbgemm/qnnpack) dynamic quantized module from a reference quantized
-        module
-        Args:
-            ref_qlinear (Module): a reference quantized  module, either produced by
-            torch.ao.quantization functions or provided by the user
-        """
-        qlinear = cls(ref_qlinear.in_features, ref_qlinear.out_features, dtype=ref_qlinear.weight_dtype)
-        qweight = ref_qlinear.get_quantized_weight()
-        bias = ref_qlinear.bias
-        qlinear.set_weight_bias(qweight, bias)
-        return qlinear
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.quantized.dynamic.modules.linear import Linear
diff --git a/torch/nn/quantized/dynamic/modules/rnn.py b/torch/nn/quantized/dynamic/modules/rnn.py
index dff08ee0a24a5..3c23b60bff043 100644
--- a/torch/nn/quantized/dynamic/modules/rnn.py
+++ b/torch/nn/quantized/dynamic/modules/rnn.py
@@ -1,1054 +1,21 @@
-import numbers
-import warnings
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
 
-import torch
-import torch.nn as nn
-from torch import Tensor  # noqa: F401
-from torch._jit_internal import Tuple, Optional, List, Union, Dict  # noqa: F401
-from torch.nn.utils.rnn import PackedSequence
-from torch.nn.quantized.modules.utils import _quantize_weight
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['pack_weight_bias', 'PackedParameter', 'RNNBase', 'LSTM', 'GRU', 'RNNCellBase', 'RNNCell', 'LSTMCell',
            'GRUCell']
 
-def _apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
-    return tensor.index_select(dim, permutation)
-
-def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
-    warnings.warn("apply_permutation is deprecated, please use tensor.index_select(dim, permutation) instead")
-    return _apply_permutation(tensor, permutation, dim)
-
-def pack_weight_bias(qweight, bias, dtype):
-
-    if dtype == torch.qint8:
-        # for each layer, for each direction we need to quantize and pack
-        # weights and pack parameters in this order:
-        #
-        #   w_ih, w_hh
-        packed_weight = \
-            torch.ops.quantized.linear_prepack(qweight, bias)
-
-        return packed_weight
-    else:
-        # for each layer, for each direction we need to quantize and pack
-        # weights and pack parameters in this order:
-        #
-        #   packed_ih, packed_hh, b_ih, b_hh
-        packed_weight = torch.ops.quantized.linear_prepack_fp16(
-            qweight, bias)
-
-        return packed_weight
-
-class PackedParameter(torch.nn.Module):
-    def __init__(self, param):
-        super(PackedParameter, self).__init__()
-        self.param = param
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(PackedParameter, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'param'] = self.param
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.param = state_dict[prefix + 'param']
-        super(PackedParameter, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                           missing_keys, unexpected_keys, error_msgs)
-
-class RNNBase(torch.nn.Module):
-
-    _FLOAT_MODULE = nn.RNNBase
-
-    _version = 2
-
-    def __init__(self, mode, input_size, hidden_size,
-                 num_layers=1, bias=True, batch_first=False,
-                 dropout=0., bidirectional=False, dtype=torch.qint8):
-        super(RNNBase, self).__init__()
-
-        self.mode = mode
-        self.input_size = input_size
-        self.hidden_size = hidden_size
-        self.num_layers = num_layers
-        self.bias = bias
-        self.batch_first = batch_first
-        self.dropout = float(dropout)
-        self.bidirectional = bidirectional
-        self.dtype = dtype
-        self.version = 2
-        self.training = False
-        num_directions = 2 if bidirectional else 1
-
-        # "type: ignore" is required since ints and Numbers are not fully comparable
-        # https://github.com/python/mypy/issues/8566
-        if not isinstance(dropout, numbers.Number) \
-                or not 0 <= dropout <= 1 or isinstance(dropout, bool):  # type: ignore[operator]
-            raise ValueError("dropout should be a number in range [0, 1] "
-                             "representing the probability of an element being "
-                             "zeroed")
-        if dropout > 0 and num_layers == 1:  # type: ignore[operator]
-            warnings.warn("dropout option adds dropout after all but last "
-                          "recurrent layer, so non-zero dropout expects "
-                          "num_layers greater than 1, but got dropout={} and "
-                          "num_layers={}".format(dropout, num_layers))
-
-        if mode == 'LSTM':
-            gate_size = 4 * hidden_size
-        elif mode == 'GRU':
-            gate_size = 3 * hidden_size
-        else:
-            raise ValueError("Unrecognized RNN mode: " + mode)
-
-        _all_weight_values = []
-        for layer in range(num_layers):
-            for direction in range(num_directions):
-                layer_input_size = input_size if layer == 0 else hidden_size * num_directions
-
-                w_ih = torch.randn(gate_size, layer_input_size).to(torch.float)
-                w_hh = torch.randn(gate_size, hidden_size).to(torch.float)
-                b_ih = torch.randn(gate_size).to(torch.float)
-                b_hh = torch.randn(gate_size).to(torch.float)
-                if dtype == torch.qint8:
-                    w_ih = torch.quantize_per_tensor(w_ih, scale=0.1, zero_point=0, dtype=torch.qint8)
-                    w_hh = torch.quantize_per_tensor(w_hh, scale=0.1, zero_point=0, dtype=torch.qint8)
-                    packed_ih = \
-                        torch.ops.quantized.linear_prepack(w_ih, b_ih)
-                    packed_hh = \
-                        torch.ops.quantized.linear_prepack(w_hh, b_hh)
-                    if self.version is None or self.version < 2:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh)
-                    else:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh, True)
-                else:
-                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
-                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
-                        packed_ih, packed_hh)
-
-                _all_weight_values.append(PackedParameter(cell_params))
-        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
-
-    def _get_name(self):
-        return 'DynamicQuantizedRNN'
-
-    def extra_repr(self):
-        s = '{input_size}, {hidden_size}'
-        if self.num_layers != 1:
-            s += ', num_layers={num_layers}'
-        if self.bias is not True:
-            s += ', bias={bias}'
-        if self.batch_first is not False:
-            s += ', batch_first={batch_first}'
-        if self.dropout != 0:
-            s += ', dropout={dropout}'
-        if self.bidirectional is not False:
-            s += ', bidirectional={bidirectional}'
-        return s.format(**self.__dict__)
-
-    def __repr__(self):
-        # We don't want to show `ModuleList` children, hence custom
-        # `__repr__`. This is the same as nn.Module.__repr__, except the check
-        # for the `PackedParameter` and `nn.ModuleList`.
-        # You should still override `extra_repr` to add more info.
-        extra_lines = []
-        extra_repr = self.extra_repr()
-        # empty string will be split into list ['']
-        if extra_repr:
-            extra_lines = extra_repr.split('\n')
-        child_lines = []
-        for key, module in self._modules.items():
-            if isinstance(module, (PackedParameter, nn.ModuleList)):
-                continue
-            mod_str = repr(module)
-            mod_str = nn.modules.module._addindent(mod_str, 2)
-            child_lines.append('(' + key + '): ' + mod_str)
-        lines = extra_lines + child_lines
-
-        main_str = self._get_name() + '('
-        if lines:
-            # simple one-liner info, which most builtin Modules will use
-            if len(extra_lines) == 1 and not child_lines:
-                main_str += extra_lines[0]
-            else:
-                main_str += '\n  ' + '\n  '.join(lines) + '\n'
-
-        main_str += ')'
-        return main_str
-
-    def check_input(self, input: Tensor, batch_sizes: Optional[Tensor]) -> None:
-        expected_input_dim = 2 if batch_sizes is not None else 3
-        if input.dim() != expected_input_dim:
-            raise RuntimeError(
-                'input must have {} dimensions, got {}'.format(
-                    expected_input_dim, input.dim()))
-        if self.input_size != input.size(-1):
-            raise RuntimeError(
-                'input.size(-1) must be equal to input_size. Expected {}, got {}'.format(
-                    self.input_size, input.size(-1)))
-
-    def get_expected_hidden_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
-        if batch_sizes is not None:
-            mini_batch = int(batch_sizes[0])
-        else:
-            mini_batch = input.size(0) if self.batch_first else input.size(1)
-        num_directions = 2 if self.bidirectional else 1
-        expected_hidden_size = (self.num_layers * num_directions,
-                                mini_batch, self.hidden_size)
-        return expected_hidden_size
-
-    def check_hidden_size(
-        self, hx: Tensor, expected_hidden_size: Tuple[int, int, int],
-        msg: str = 'Expected hidden size {}, got {}'
-    ) -> None:
-        if hx.size() != expected_hidden_size:
-            raise RuntimeError(msg.format(
-                expected_hidden_size, list(hx.size())))
-
-    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
-        self.check_input(input, batch_sizes)
-        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
-        self.check_hidden_size(hidden, expected_hidden_size,
-                               msg='Expected hidden size {}, got {}')
-
-    def permute_hidden(self, hx: Tensor, permutation: Optional[Tensor]) -> Tensor:
-        if permutation is None:
-            return hx
-        return _apply_permutation(hx, permutation)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        version = local_metadata.get('version', None)
-        self.version = version
-        super(RNNBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                   missing_keys, unexpected_keys, error_msgs)
-
-    def set_weight_bias(self, weight_bias_dict):
-
-        def weight_bias_name(ihhh, layer, suffix):
-            weight_name = "weight_{}_l{}{}".format(ihhh, layer, suffix)
-            bias_name = "bias_{}_l{}{}".format(ihhh, layer, suffix)
-            return weight_name, bias_name
-
-        num_directions = 2 if self.bidirectional else 1
-        # TODO: dedup with __init__ of RNNBase
-        _all_weight_values = []
-        for layer in range(self.num_layers):
-            for direction in range(num_directions):
-                suffix = "_reverse" if direction == 1 else ""
-                w_ih_name, b_ih_name = weight_bias_name("ih", layer, suffix)
-                w_hh_name, b_hh_name = weight_bias_name("hh", layer, suffix)
-                w_ih = weight_bias_dict[w_ih_name]
-                b_ih = weight_bias_dict[b_ih_name]
-                w_hh = weight_bias_dict[w_hh_name]
-                b_hh = weight_bias_dict[b_hh_name]
-                if w_ih.dtype == torch.qint8:
-                    packed_ih = torch.ops.quantized.linear_prepack(w_ih, b_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack(w_hh, b_hh)
-                    if self.version is None or self.version < 2:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh)
-                    else:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh, True)
-                else:
-                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
-                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
-                        packed_ih, packed_hh)
-
-                _all_weight_values.append(PackedParameter(cell_params))
-        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
-
-    @classmethod
-    def from_float(cls, mod):
-        assert type(mod) in set(
-            [torch.nn.LSTM,
-             torch.nn.GRU]
-        ), 'nn.quantized.dynamic.RNNBase.from_float only works for nn.LSTM and nn.GRU'
-        assert hasattr(
-            mod,
-            'qconfig'
-        ), 'Input float module must have qconfig defined'
-
-        if mod.qconfig is not None and mod.qconfig.weight is not None:
-            weight_observer_method = mod.qconfig.weight
-        else:
-            # We have the circular import issues if we import the qconfig in the beginning of this file:
-            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
-            # import until we need it.
-            from torch.ao.quantization.qconfig import default_dynamic_qconfig
-            weight_observer_method = default_dynamic_qconfig.weight
-
-        dtype = weight_observer_method().dtype
-        supported_scalar_types = [torch.qint8, torch.float16]
-        if dtype not in supported_scalar_types:
-            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
-        # RNNBase can be either LSTM or GRU
-        qRNNBase: Union[LSTM, GRU]
-        if mod.mode == 'LSTM':
-            qRNNBase = LSTM(mod.input_size, mod.hidden_size, mod.num_layers,
-                            mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
-        elif mod.mode == 'GRU':
-            qRNNBase = GRU(mod.input_size, mod.hidden_size, mod.num_layers,
-                           mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
-        else:
-            raise NotImplementedError('Only LSTM/GRU is supported for QuantizedRNN for now')
-
-        num_directions = 2 if mod.bidirectional else 1
-
-        assert mod.bias
-
-        _all_weight_values = []
-        for layer in range(qRNNBase.num_layers):
-            for direction in range(num_directions):
-                suffix = '_reverse' if direction == 1 else ''
-
-                def retrieve_weight_bias(ihhh):
-                    weight_name = 'weight_{}_l{}{}'.format(ihhh, layer, suffix)
-                    bias_name = 'bias_{}_l{}{}'.format(ihhh, layer, suffix)
-                    weight = getattr(mod, weight_name)
-                    bias = getattr(mod, bias_name)
-                    return weight, bias
-
-                weight_ih, bias_ih = retrieve_weight_bias('ih')
-                weight_hh, bias_hh = retrieve_weight_bias('hh')
-
-                if dtype == torch.qint8:
-                    def quantize_and_pack(w, b):
-                        weight_observer = weight_observer_method()
-                        weight_observer(w)
-                        qweight = _quantize_weight(w.float(), weight_observer)
-                        packed_weight = \
-                            torch.ops.quantized.linear_prepack(qweight, b)
-                        return packed_weight
-                    packed_ih = quantize_and_pack(weight_ih, bias_ih)
-                    packed_hh = quantize_and_pack(weight_hh, bias_hh)
-                    if qRNNBase.version is None or qRNNBase.version < 2:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, bias_ih, bias_hh)
-                    else:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, bias_ih, bias_hh, True)
-
-                elif dtype == torch.float16:
-                    packed_ih = torch.ops.quantized.linear_prepack_fp16(
-                        weight_ih.float(), bias_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack_fp16(
-                        weight_hh.float(), bias_hh)
-
-                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
-                        packed_ih, packed_hh)
-                else:
-                    raise RuntimeError('Unsupported dtype specified for dynamic quantized LSTM!')
-
-                _all_weight_values.append(PackedParameter(cell_params))
-        qRNNBase._all_weight_values = torch.nn.ModuleList(_all_weight_values)
-
-        return qRNNBase
-
-
-    def _weight_bias(self):
-        # Returns a dict of weights and biases
-        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
-        count = 0
-        num_directions = 2 if self.bidirectional else 1
-        for layer in range(self.num_layers):
-            for direction in range(num_directions):
-                suffix = '_reverse' if direction == 1 else ''
-                key_name1 = 'weight_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                key_name2 = 'weight_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                # packed weights are part of torchbind class, CellParamsSerializationType
-                # Within the packed weight class, the weight and bias are accessible as Tensors
-                packed_weight_bias = self._all_weight_values[count].param.__getstate__()[0][4]
-                weight_bias_dict['weight'][key_name1] = packed_weight_bias[0].__getstate__()[0][0]
-                weight_bias_dict['weight'][key_name2] = packed_weight_bias[1].__getstate__()[0][0]
-                key_name1 = 'bias_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                key_name2 = 'bias_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                weight_bias_dict['bias'][key_name1] = packed_weight_bias[0].__getstate__()[0][1]
-                weight_bias_dict['bias'][key_name2] = packed_weight_bias[1].__getstate__()[0][1]
-                count = count + 1
-        return weight_bias_dict
-
-    def get_weight(self):
-        return self._weight_bias()['weight']
-
-    def get_bias(self):
-        return self._weight_bias()['bias']
-
-class LSTM(RNNBase):
-    r"""
-    A dynamic quantized LSTM module with floating point tensor as inputs and outputs.
-    We adopt the same interface as `torch.nn.LSTM`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM for documentation.
-
-    Examples::
-
-        >>> rnn = nn.LSTM(10, 20, 2)
-        >>> input = torch.randn(5, 3, 10)
-        >>> h0 = torch.randn(2, 3, 20)
-        >>> c0 = torch.randn(2, 3, 20)
-        >>> output, (hn, cn) = rnn(input, (h0, c0))
-    """
-    _FLOAT_MODULE = nn.LSTM
-
-    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
-
-    def __init__(self, *args, **kwargs):
-        super(LSTM, self).__init__('LSTM', *args, **kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedLSTM'
-
-    def forward_impl(
-        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]],
-        batch_sizes: Optional[Tensor], max_batch_size: int,
-        sorted_indices: Optional[Tensor]
-    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
-        if hx is None:
-            num_directions = 2 if self.bidirectional else 1
-            zeros = torch.zeros(self.num_layers * num_directions,
-                                max_batch_size, self.hidden_size,
-                                dtype=input.dtype, device=input.device)
-            hx = (zeros, zeros)
-        else:
-            # Each batch of the hidden state should match the input sequence that
-            # the user believes he/she is passing in.
-            hx = self.permute_hidden(hx, sorted_indices)
-
-        self.check_forward_args(input, hx, batch_sizes)
-
-        _all_params = ([m.param for m in self._all_weight_values])
-        if batch_sizes is None:
-            result = torch.quantized_lstm(input, hx, _all_params, self.bias, self.num_layers,
-                                          float(self.dropout), self.training, self.bidirectional,
-                                          self.batch_first, dtype=self.dtype, use_dynamic=True)
-        else:
-            result = torch.quantized_lstm(input, batch_sizes, hx, _all_params, self.bias,
-                                          self.num_layers, float(self.dropout), self.training,
-                                          self.bidirectional, dtype=self.dtype, use_dynamic=True)
-        output = result[0]
-        hidden = result[1:]
-
-        return output, hidden
-
-    @torch.jit.export
-    def forward_tensor(
-        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None
-    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
-        batch_sizes = None
-        max_batch_size = input.size(0) if self.batch_first else input.size(1)
-        sorted_indices = None
-        unsorted_indices = None
-
-        output, hidden = self.forward_impl(
-            input, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    @torch.jit.export
-    def forward_packed(
-        self, input: PackedSequence, hx: Optional[Tuple[Tensor, Tensor]] = None
-    ) -> Tuple[PackedSequence, Tuple[Tensor, Tensor]]:
-        input_, batch_sizes, sorted_indices, unsorted_indices = input
-        max_batch_size = batch_sizes[0]
-        max_batch_size = int(max_batch_size)
-
-        output_, hidden = self.forward_impl(
-            input_, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        output = PackedSequence(output_, batch_sizes,
-                                sorted_indices, unsorted_indices)
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    # "type: ignore" is required due to issue #43072
-    def permute_hidden(  # type: ignore[override]
-        self, hx: Tuple[Tensor, Tensor], permutation: Optional[Tensor]
-    ) -> Tuple[Tensor, Tensor]:
-        if permutation is None:
-            return hx
-        return _apply_permutation(hx[0], permutation), _apply_permutation(hx[1], permutation)
-
-    # "type: ignore" is required due to issue #43072
-    def check_forward_args(  # type: ignore[override]
-        self, input: Tensor, hidden: Tuple[Tensor, Tensor], batch_sizes: Optional[Tensor]
-    ) -> None:
-        self.check_input(input, batch_sizes)
-        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
-
-        self.check_hidden_size(hidden[0], expected_hidden_size,
-                               'Expected hidden[0] size {}, got {}')
-        self.check_hidden_size(hidden[1], expected_hidden_size,
-                               'Expected hidden[1] size {}, got {}')
-
-    @torch.jit.ignore
-    def forward(self, input, hx=None):
-        if isinstance(input, PackedSequence):
-            return self.forward_packed(input, hx)
-        else:
-            return self.forward_tensor(input, hx)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(LSTM, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, ref_mod):
-        assert hasattr(ref_mod, "weight_ih_l0_dtype"), "We are assuming weight_ih_l0 "
-        "exists in LSTM, may need to relax the assumption to support the use case"
-        qmod = cls(
-            ref_mod.input_size,
-            ref_mod.hidden_size,
-            ref_mod.num_layers,
-            ref_mod.bias,
-            ref_mod.batch_first,
-            ref_mod.dropout,
-            ref_mod.bidirectional,
-            # assuming there is layer 0, which should be OK
-            ref_mod.weight_ih_l0_dtype,
-        )
-        qmod.set_weight_bias(ref_mod.get_quantized_weight_bias_dict())
-        return qmod
-
-
-class GRU(RNNBase):
-    r"""Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
-
-
-    For each element in the input sequence, each layer computes the following
-    function:
-
-    .. math::
-        \begin{array}{ll}
-            r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
-            z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
-            n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
-            h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
-        \end{array}
-
-    where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is the input
-    at time `t`, :math:`h_{(t-1)}` is the hidden state of the layer
-    at time `t-1` or the initial hidden state at time `0`, and :math:`r_t`,
-    :math:`z_t`, :math:`n_t` are the reset, update, and new gates, respectively.
-    :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
-
-    In a multilayer GRU, the input :math:`x^{(l)}_t` of the :math:`l` -th layer
-    (:math:`l >= 2`) is the hidden state :math:`h^{(l-1)}_t` of the previous layer multiplied by
-    dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)}_t` is a Bernoulli random
-    variable which is :math:`0` with probability :attr:`dropout`.
-
-    Args:
-        input_size: The number of expected features in the input `x`
-        hidden_size: The number of features in the hidden state `h`
-        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
-            would mean stacking two GRUs together to form a `stacked GRU`,
-            with the second GRU taking in outputs of the first GRU and
-            computing the final results. Default: 1
-        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
-            Default: ``True``
-        batch_first: If ``True``, then the input and output tensors are provided
-            as (batch, seq, feature). Default: ``False``
-        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
-            GRU layer except the last layer, with dropout probability equal to
-            :attr:`dropout`. Default: 0
-        bidirectional: If ``True``, becomes a bidirectional GRU. Default: ``False``
-
-    Inputs: input, h_0
-        - **input** of shape `(seq_len, batch, input_size)`: tensor containing the features
-          of the input sequence. The input can also be a packed variable length
-          sequence. See :func:`torch.nn.utils.rnn.pack_padded_sequence`
-          for details.
-        - **h_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
-          containing the initial hidden state for each element in the batch.
-          Defaults to zero if not provided. If the RNN is bidirectional,
-          num_directions should be 2, else it should be 1.
-
-    Outputs: output, h_n
-        - **output** of shape `(seq_len, batch, num_directions * hidden_size)`: tensor
-          containing the output features h_t from the last layer of the GRU,
-          for each `t`. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
-          given as the input, the output will also be a packed sequence.
-          For the unpacked case, the directions can be separated
-          using ``output.view(seq_len, batch, num_directions, hidden_size)``,
-          with forward and backward being direction `0` and `1` respectively.
-
-          Similarly, the directions can be separated in the packed case.
-        - **h_n** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
-          containing the hidden state for `t = seq_len`
-
-          Like *output*, the layers can be separated using
-          ``h_n.view(num_layers, num_directions, batch, hidden_size)``.
-
-    Shape:
-        - Input1: :math:`(L, N, H_{in})` tensor containing input features where
-          :math:`H_{in}=\text{input\_size}` and `L` represents a sequence length.
-        - Input2: :math:`(S, N, H_{out})` tensor
-          containing the initial hidden state for each element in the batch.
-          :math:`H_{out}=\text{hidden\_size}`
-          Defaults to zero if not provided. where :math:`S=\text{num\_layers} * \text{num\_directions}`
-          If the RNN is bidirectional, num_directions should be 2, else it should be 1.
-        - Output1: :math:`(L, N, H_{all})` where :math:`H_{all}=\text{num\_directions} * \text{hidden\_size}`
-        - Output2: :math:`(S, N, H_{out})` tensor containing the next hidden state
-          for each element in the batch
-
-    Attributes:
-        weight_ih_l[k] : the learnable input-hidden weights of the :math:`\text{k}^{th}` layer
-            (W_ir|W_iz|W_in), of shape `(3*hidden_size, input_size)` for `k = 0`.
-            Otherwise, the shape is `(3*hidden_size, num_directions * hidden_size)`
-        weight_hh_l[k] : the learnable hidden-hidden weights of the :math:`\text{k}^{th}` layer
-            (W_hr|W_hz|W_hn), of shape `(3*hidden_size, hidden_size)`
-        bias_ih_l[k] : the learnable input-hidden bias of the :math:`\text{k}^{th}` layer
-            (b_ir|b_iz|b_in), of shape `(3*hidden_size)`
-        bias_hh_l[k] : the learnable hidden-hidden bias of the :math:`\text{k}^{th}` layer
-            (b_hr|b_hz|b_hn), of shape `(3*hidden_size)`
-
-    .. note::
-        All the weights and biases are initialized from :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})`
-        where :math:`k = \frac{1}{\text{hidden\_size}}`
-
-    .. include:: ../cudnn_persistent_rnn.rst
-
-    Examples::
-
-        >>> rnn = nn.GRU(10, 20, 2)
-        >>> input = torch.randn(5, 3, 10)
-        >>> h0 = torch.randn(2, 3, 20)
-        >>> output, hn = rnn(input, h0)
-    """
-    _FLOAT_MODULE = nn.GRU
-
-    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
-
-    def __init__(self, *args, **kwargs):
-        super(GRU, self).__init__('GRU', *args, **kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedGRU'
-
-    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
-        self.check_input(input, batch_sizes)
-        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
-
-        self.check_hidden_size(hidden, expected_hidden_size,
-                               'Expected hidden size {}, got {}')
-
-    def forward_impl(
-        self, input: Tensor, hx: Optional[Tensor],
-        batch_sizes: Optional[Tensor], max_batch_size: int,
-        sorted_indices: Optional[Tensor]
-    ) -> Tuple[Tensor, Tensor]:
-        if hx is None:
-            num_directions = 2 if self.bidirectional else 1
-            zeros = torch.zeros(self.num_layers * num_directions,
-                                max_batch_size, self.hidden_size,
-                                dtype=input.dtype, device=input.device)
-            hx = zeros
-        else:
-            # Each batch of the hidden state should match the input sequence that
-            # the user believes he/she is passing in.
-            hx = self.permute_hidden(hx, sorted_indices)
-
-        self.check_forward_args(input, hx, batch_sizes)
-
-        _all_params = ([m.param for m in self._all_weight_values])
-        if batch_sizes is None:
-            result = torch.quantized_gru(input,
-                                         hx,
-                                         _all_params,
-                                         self.bias,
-                                         self.num_layers,
-                                         self.dropout,
-                                         self.training,
-                                         self.bidirectional,
-                                         self.batch_first)
-        else:
-            result = torch.quantized_gru(input,
-                                         batch_sizes,
-                                         hx,
-                                         _all_params,
-                                         self.bias,
-                                         self.num_layers,
-                                         self.dropout,
-                                         self.training,
-                                         self.bidirectional)
-        output = result[0]
-        hidden = result[1]
-
-        return output, hidden
-
-
-    @torch.jit.export
-    def forward_tensor(
-        self, input: Tensor, hx: Optional[Tensor] = None
-    ) -> Tuple[Tensor, Tensor]:
-        batch_sizes = None
-        max_batch_size = input.size(0) if self.batch_first else input.size(1)
-        sorted_indices = None
-        unsorted_indices = None
-
-        output, hidden = self.forward_impl(
-            input, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    @torch.jit.export
-    def forward_packed(
-        self, input: PackedSequence, hx: Optional[Tensor] = None
-    ) -> Tuple[PackedSequence, Tensor]:
-        input_, batch_sizes, sorted_indices, unsorted_indices = input
-        max_batch_size = batch_sizes[0]
-        max_batch_size = int(max_batch_size)
-        output_, hidden = self.forward_impl(
-            input_, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        output = PackedSequence(output_, batch_sizes,
-                                sorted_indices, unsorted_indices)
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    def permute_hidden(
-        self, hx: Tensor, permutation: Optional[Tensor]
-    ) -> Tensor:
-        if permutation is None:
-            return hx
-        return _apply_permutation(hx, permutation)
-
-    @torch.jit.ignore
-    def forward(self, input, hx=None):
-        if isinstance(input, PackedSequence):
-            return self.forward_packed(input, hx)
-        else:
-            return self.forward_tensor(input, hx)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(GRU, cls).from_float(mod)
-
-
-class RNNCellBase(torch.nn.Module):
-    # _FLOAT_MODULE = nn.CellRNNBase
-    __constants__ = ['input_size', 'hidden_size', 'bias']
-
-    def __init__(self, input_size, hidden_size, bias=True, num_chunks=4, dtype=torch.qint8):
-        super(RNNCellBase, self).__init__()
-        self.input_size = input_size
-        self.hidden_size = hidden_size
-        self.bias = bias
-        self.weight_dtype = dtype
-        if bias:
-            self.bias_ih = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
-            self.bias_hh = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
-        else:
-            self.register_parameter('bias_ih', None)
-            self.register_parameter('bias_hh', None)
-
-        weight_ih = torch.randn(num_chunks * hidden_size, input_size).to(torch.float)
-        weight_hh = torch.randn(num_chunks * hidden_size, hidden_size).to(torch.float)
-        if dtype == torch.qint8:
-            weight_ih = torch.quantize_per_tensor(weight_ih, scale=1, zero_point=0, dtype=torch.qint8)
-            weight_hh = torch.quantize_per_tensor(weight_hh, scale=1, zero_point=0, dtype=torch.qint8)
-
-        if dtype == torch.qint8:
-            # for each layer, for each direction we need to quantize and pack
-            # weights and pack parameters in this order:
-            #
-            #   w_ih, w_hh
-            packed_weight_ih = \
-                torch.ops.quantized.linear_prepack(weight_ih, self.bias_ih)
-            packed_weight_hh = \
-                torch.ops.quantized.linear_prepack(weight_hh, self.bias_hh)
-        else:
-            # for each layer, for each direction we need to quantize and pack
-            # weights and pack parameters in this order:
-            #
-            #   packed_ih, packed_hh, b_ih, b_hh
-            packed_weight_ih = torch.ops.quantized.linear_prepack_fp16(
-                weight_ih, self.bias_ih)
-            packed_weight_hh = torch.ops.quantized.linear_prepack_fp16(
-                weight_hh, self.bias_hh)
-
-        self._packed_weight_ih = packed_weight_ih
-        self._packed_weight_hh = packed_weight_hh
-
-    def _get_name(self):
-        return 'DynamicQuantizedRNNBase'
-
-    def extra_repr(self):
-        s = '{input_size}, {hidden_size}'
-        if 'bias' in self.__dict__ and self.bias is not True:
-            s += ', bias={bias}'
-        if 'nonlinearity' in self.__dict__ and self.nonlinearity != "tanh":
-            s += ', nonlinearity={nonlinearity}'
-        return s.format(**self.__dict__)
-
-    def check_forward_input(self, input):
-        if input.size(1) != self.input_size:
-            raise RuntimeError(
-                "input has inconsistent input_size: got {}, expected {}".format(
-                    input.size(1), self.input_size))
-
-    def check_forward_hidden(self, input: Tensor, hx: Tensor, hidden_label: str = '') -> None:
-        if input.size(0) != hx.size(0):
-            raise RuntimeError(
-                "Input batch size {} doesn't match hidden{} batch size {}".format(
-                    input.size(0), hidden_label, hx.size(0)))
-
-        if hx.size(1) != self.hidden_size:
-            raise RuntimeError(
-                "hidden{} has inconsistent hidden_size: got {}, expected {}".format(
-                    hidden_label, hx.size(1), self.hidden_size))
-
-    @classmethod
-    def from_float(cls, mod):
-        assert type(mod) in set([torch.nn.LSTMCell,
-                                 torch.nn.GRUCell,
-                                 torch.nn.RNNCell]), 'nn.quantized.dynamic.RNNCellBase.from_float \
-                                 only works for nn.LSTMCell, nn.GRUCell and nn.RNNCell'
-        assert hasattr(
-            mod, 'qconfig'), 'Input float module must have qconfig defined'
-
-        if mod.qconfig is not None and mod.qconfig.weight is not None:
-            weight_observer_method = mod.qconfig.weight
-        else:
-            # We have the circular import issues if we import the qconfig in the beginning of this file:
-            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
-            # import until we need it.
-            from torch.ao.quantization.qconfig import default_dynamic_qconfig
-            weight_observer_method = default_dynamic_qconfig.weight
-
-        dtype = weight_observer_method().dtype
-        supported_scalar_types = [torch.qint8, torch.float16]
-        if dtype not in supported_scalar_types:
-            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
-
-        qRNNCellBase: Union[LSTMCell, GRUCell, RNNCell]
-
-        if type(mod) == torch.nn.LSTMCell:
-            qRNNCellBase = LSTMCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
-        elif type(mod) == torch.nn.GRUCell:
-            qRNNCellBase = GRUCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
-        elif type(mod) == torch.nn.RNNCell:
-            qRNNCellBase = RNNCell(mod.input_size, mod.hidden_size, bias=mod.bias, nonlinearity=mod.nonlinearity, dtype=dtype)
-        else:
-            raise NotImplementedError('Only LSTMCell, GRUCell and RNNCell \
-            are supported for QuantizedRNN for now')
-
-        assert mod.bias
-
-        def _observe_and_quantize_weight(weight):
-            if dtype == torch.qint8:
-                weight_observer = weight_observer_method()
-                weight_observer(weight)
-                qweight = _quantize_weight(weight.float(), weight_observer)
-                return qweight
-            else:
-                return weight.float()
-
-        qRNNCellBase._packed_weight_ih = pack_weight_bias(_observe_and_quantize_weight(mod.weight_ih), mod.bias_ih, dtype)
-        qRNNCellBase._packed_weight_hh = pack_weight_bias(_observe_and_quantize_weight(mod.weight_hh), mod.bias_hh, dtype)
-        return qRNNCellBase
-
-    @classmethod
-    def from_reference(cls, ref_mod):
-        assert hasattr(ref_mod, "weight_ih_dtype"), "We are assuming weight_ih "
-        "exists in reference module, may need to relax the assumption to support the use case"
-        if hasattr(ref_mod, "nonlinearity"):
-            qmod = cls(
-                ref_mod.input_size,
-                ref_mod.hidden_size,
-                ref_mod.bias,
-                ref_mod.nonlinearity,
-                dtype=ref_mod.weight_ih_dtype
-            )
-        else:
-            qmod = cls(
-                ref_mod.input_size,
-                ref_mod.hidden_size,
-                ref_mod.bias,
-                dtype=ref_mod.weight_ih_dtype
-            )
-        weight_bias_dict = {
-            "weight": {
-                "weight_ih": ref_mod.get_quantized_weight_ih(),
-                "weight_hh": ref_mod.get_quantized_weight_hh(),
-            },
-            "bias": {
-                "bias_ih": ref_mod.bias_ih,
-                "bias_hh": ref_mod.bias_hh,
-            }
-        }
-        qmod.set_weight_bias(weight_bias_dict)
-        return qmod
-
-    def _weight_bias(self):
-        # Returns a dict of weights and biases
-        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
-        w1, b1 = self._packed_weight_ih.__getstate__()[0]
-        w2, b2 = self._packed_weight_hh.__getstate__()[0]
-        # TODO: these can be simplified to one level? e.g. using weight_ih as key
-        # directly
-        weight_bias_dict['weight']['weight_ih'] = w1
-        weight_bias_dict['weight']['weight_hh'] = w2
-        weight_bias_dict['bias']['bias_ih'] = b1
-        weight_bias_dict['bias']['bias_hh'] = b2
-        return weight_bias_dict
-
-    def get_weight(self):
-        return self._weight_bias()['weight']
-
-    def get_bias(self):
-        return self._weight_bias()['bias']
-
-    def set_weight_bias(self, weight_bias_dict):
-        # TODO: these can be simplified to one level? e.g. using weight_ih as key
-        # directly
-        self._packed_weight_ih = pack_weight_bias(
-            weight_bias_dict["weight"]["weight_ih"],
-            weight_bias_dict["bias"]["bias_ih"],
-            self.weight_dtype)
-        self._packed_weight_hh = pack_weight_bias(
-            weight_bias_dict["weight"]["weight_hh"],
-            weight_bias_dict["bias"]["bias_hh"],
-            self.weight_dtype)
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(RNNCellBase, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + '_packed_weight_ih'] = self._packed_weight_ih
-        destination[prefix + '_packed_weight_hh'] = self._packed_weight_hh
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self._packed_weight_ih = state_dict.pop(prefix + '_packed_weight_ih')
-        self._packed_weight_hh = state_dict.pop(prefix + '_packed_weight_hh')
-        super(RNNCellBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                       missing_keys, unexpected_keys, error_msgs)
-
-class RNNCell(RNNCellBase):
-    r"""An Elman RNN cell with tanh or ReLU non-linearity.
-    A dynamic quantized RNNCell module with floating point tensor as inputs and outputs.
-    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.RNNCell`,
-    please see https://pytorch.org/docs/stable/nn.html#torch.nn.RNNCell for documentation.
-
-    Examples::
-
-        >>> rnn = nn.RNNCell(10, 20)
-        >>> input = torch.randn(6, 3, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-                hx = rnn(input[i], hx)
-                output.append(hx)
-    """
-    __constants__ = ['input_size', 'hidden_size', 'bias', 'nonlinearity']
-
-    def __init__(self, input_size, hidden_size, bias=True, nonlinearity="tanh", dtype=torch.qint8):
-        super(RNNCell, self).__init__(input_size, hidden_size, bias, num_chunks=1, dtype=dtype)
-        self.nonlinearity = nonlinearity
-
-    def _get_name(self):
-        return 'DynamicQuantizedRNNCell'
-
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        self.check_forward_input(input)
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        self.check_forward_hidden(input, hx, '')
-        if self.nonlinearity == "tanh":
-            ret = torch.ops.quantized.quantized_rnn_tanh_cell_dynamic(
-                input, hx,
-                self._packed_weight_ih, self._packed_weight_hh,
-                self.bias_ih, self.bias_hh)
-        elif self.nonlinearity == "relu":
-            ret = torch.ops.quantized.quantized_rnn_relu_cell_dynamic(
-                input, hx,
-                self._packed_weight_ih, self._packed_weight_hh,
-                self.bias_ih, self.bias_hh)
-        else:
-            ret = input  # TODO: remove when jit supports exception flow
-            raise RuntimeError(
-                "Unknown nonlinearity: {}".format(self.nonlinearity))
-        return ret
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(RNNCell, cls).from_float(mod)
-
-
-class LSTMCell(RNNCellBase):
-    r"""A long short-term memory (LSTM) cell.
-
-    A dynamic quantized LSTMCell module with floating point tensor as inputs and outputs.
-    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.LSTMCell`,
-    please see https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell for documentation.
-
-    Examples::
-
-        >>> rnn = nn.LSTMCell(10, 20)
-        >>> input = torch.randn(6, 3, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> cx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-                hx, cx = rnn(input[i], (hx, cx))
-                output.append(hx)
-    """
-
-    def __init__(self, *args, **kwargs):
-        super(LSTMCell, self).__init__(*args, num_chunks=4, **kwargs)  # type: ignore[misc]
-
-    def _get_name(self):
-        return 'DynamicQuantizedLSTMCell'
-
-    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
-        self.check_forward_input(input)
-        if hx is None:
-            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-            hx = (zeros, zeros)
-        self.check_forward_hidden(input, hx[0], '[0]')
-        self.check_forward_hidden(input, hx[1], '[1]')
-        return torch.ops.quantized.quantized_lstm_cell_dynamic(
-            input, hx,
-            self._packed_weight_ih, self._packed_weight_hh,
-            self.bias_ih, self.bias_hh)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(LSTMCell, cls).from_float(mod)
-
-class GRUCell(RNNCellBase):
-    r"""A gated recurrent unit (GRU) cell
-
-    A dynamic quantized GRUCell module with floating point tensor as inputs and outputs.
-    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.GRUCell`,
-    please see https://pytorch.org/docs/stable/nn.html#torch.nn.GRUCell for documentation.
-
-    Examples::
-
-        >>> rnn = nn.GRUCell(10, 20)
-        >>> input = torch.randn(6, 3, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-                hx = rnn(input[i], hx)
-                output.append(hx)
-    """
-
-    def __init__(self, input_size, hidden_size, bias=True, dtype=torch.qint8):
-        super(GRUCell, self).__init__(input_size, hidden_size, bias, num_chunks=3, dtype=dtype)
-
-    def _get_name(self):
-        return 'DynamicQuantizedGRUCell'
-
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        self.check_forward_input(input)
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        self.check_forward_hidden(input, hx, '')
-        return torch.ops.quantized.quantized_gru_cell_dynamic(
-            input, hx,
-            self._packed_weight_ih, self._packed_weight_hh,
-            self.bias_ih, self.bias_hh,
-        )
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(GRUCell, cls).from_float(mod)
+from torch.ao.nn.quantized.dynamic.modules.rnn import PackedParameter
+from torch.ao.nn.quantized.dynamic.modules.rnn import RNNBase
+from torch.ao.nn.quantized.dynamic.modules.rnn import LSTM
+from torch.ao.nn.quantized.dynamic.modules.rnn import GRU
+from torch.ao.nn.quantized.dynamic.modules.rnn import RNNCellBase
+from torch.ao.nn.quantized.dynamic.modules.rnn import RNNCell
+from torch.ao.nn.quantized.dynamic.modules.rnn import LSTMCell
+from torch.ao.nn.quantized.dynamic.modules.rnn import GRUCell
diff --git a/torch/nn/quantized/functional.py b/torch/nn/quantized/functional.py
index 0be06c73777af..84c4dcb06bc35 100644
--- a/torch/nn/quantized/functional.py
+++ b/torch/nn/quantized/functional.py
@@ -1,615 +1,10 @@
-r""" Functional interface (quantized)."""
-from typing import List, Optional
-import warnings
+r"""nn.quantized.functional
 
-import torch
-from torch import Tensor
-from torch.nn.modules.utils import _pair, _triple
-from torch.nn.quantized.modules.utils import _pair_from_first
-from torch.jit.annotations import BroadcastingList2
+Quantized equivalents of the `nn.functional`.
 
-# Although some of the functions and docstrings are mirrored from the torch.nn,
-# we want to have them here for future changes.
+Note::
+    This location is in the process of being deprecated.
+    Please, use the `torch.ao.nn.quantized.functional` instead.
+"""
 
-def avg_pool2d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
-               count_include_pad=True, divisor_override=None):
-    r"""
-    Applies 2D average-pooling operation in :math:`kH \times kW` regions by step size
-    :math:`sH \times sW` steps. The number of output features is equal to the number of
-    input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    See :class:`~torch.nn.quantized.AvgPool2d` for details and output shape.
-
-    Args:
-        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
-        kernel_size: size of the pooling region. Can be a single number or a
-          tuple `(kH, kW)`
-        stride: stride of the pooling operation. Can be a single number or a
-          tuple `(sH, sW)`. Default: :attr:`kernel_size`
-        padding: implicit zero paddings on both sides of the input. Can be a
-          single number or a tuple `(padH, padW)`. Default: 0
-        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
-            to compute the output shape. Default: ``False``
-        count_include_pad: when True, will include the zero-padding in the
-            averaging calculation. Default: ``True``
-        divisor_override: if specified, it will be used as divisor, otherwise
-             size of the pooling region will be used. Default: None
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.avg_pool2d' must be quantized!")
-    return torch.nn.functional.avg_pool2d(input, kernel_size, stride, padding,
-                                          ceil_mode, count_include_pad,
-                                          divisor_override)
-
-def avg_pool3d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
-               count_include_pad=True, divisor_override=None):
-    r"""
-    Applies 3D average-pooling operation in :math:`kD \ times kH \times kW` regions by step size
-    :math:`sD \times sH \times sW` steps. The number of output features is equal to the number of
-    input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    Args:
-        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
-        kernel_size: size of the pooling region. Can be a single number or a
-          tuple `(kD, kH, kW)`
-        stride: stride of the pooling operation. Can be a single number or a
-          tuple `(sD, sH, sW)`. Default: :attr:`kernel_size`
-        padding: implicit zero paddings on both sides of the input. Can be a
-          single number or a tuple `(padD, padH, padW)`. Default: 0
-        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
-            to compute the output shape. Default: ``False``
-        count_include_pad: when True, will include the zero-padding in the
-            averaging calculation. Default: ``True``
-        divisor_override: if specified, it will be used as divisor, otherwise
-             size of the pooling region will be used. Default: None
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.avg_pool3d' must be quantized!")
-    return torch.nn.functional.avg_pool3d(input, kernel_size, stride, padding,
-                                          ceil_mode, count_include_pad,
-                                          divisor_override)
-
-def adaptive_avg_pool2d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
-    r"""
-    Applies a 2D adaptive average pooling over a quantized input signal composed
-    of several quantized input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    See :class:`~torch.nn.quantized.AdaptiveAvgPool2d` for details and output shape.
-
-    Args:
-        output_size: the target output size (single integer or
-                     double-integer tuple)
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.functional.adaptive_avg_pool2d' must be quantized!")
-    return torch.nn.functional.adaptive_avg_pool2d(input, output_size)
-
-def adaptive_avg_pool3d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
-    r"""
-    Applies a 3D adaptive average pooling over a quantized input signal composed
-    of several quantized input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    See :class:`~torch.nn.quantized.AdaptiveAvgPool3d` for details and output shape.
-
-    Args:
-        output_size: the target output size (single integer or
-                     double-integer tuple)
-    """
-    if not input.is_quantized:
-        raise ValueError(
-            "Input to 'quantized.functional.adaptive_avg_pool3d' must be quantized!")
-    return torch.nn.functional.adaptive_avg_pool3d(input, output_size)
-
-def conv1d(input, weight, bias,
-           stride=1, padding=0, dilation=1, groups=1,
-           padding_mode='zeros',
-           scale=1.0, zero_point=0,
-           dtype=torch.quint8):
-    r"""
-    Applies a 1D convolution over a quantized 1D input composed of several input
-    planes.
-
-    See :class:`~torch.nn.quantized.Conv1d` for details and output shape.
-
-    Args:
-        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iW)`
-        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , iW)`
-        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
-        stride: the stride of the convolving kernel. Can be a single number or a
-          tuple `(sW,)`. Default: 1
-        padding: implicit paddings on both sides of the input. Can be a
-          single number or a tuple `(padW,)`. Default: 0
-        dilation: the spacing between kernel elements. Can be a single number or
-          a tuple `(dW,)`. Default: 1
-        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
-          number of groups. Default: 1
-        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
-        scale: quantization scale for the output. Default: 1.0
-        zero_point: quantization zero_point for the output. Default: 0
-        dtype: quantization data type to use. Default: ``torch.quint8``
-
-    Examples::
-
-        >>> from torch.nn.quantized import functional as qF
-        >>> filters = torch.randn(33, 16, 3, dtype=torch.float)
-        >>> inputs = torch.randn(20, 16, 50, dtype=torch.float)
-        >>> bias = torch.randn(33, dtype=torch.float)
-        >>>
-        >>> scale, zero_point = 1.0, 0
-        >>> dtype_inputs = torch.quint8
-        >>> dtype_filters = torch.qint8
-        >>>
-        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
-        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
-        >>> qF.conv1d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
-    """  # noqa: E501
-    if padding_mode != 'zeros':
-        raise NotImplementedError("Only zero-padding is supported!")
-    if input.dtype != torch.quint8:
-        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
-    if weight.dtype != torch.qint8:
-        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
-    if input.ndim != 3:
-        raise ValueError("Input shape must be `(N, C, L)`!")
-    stride = _pair_from_first(stride)
-    padding = _pair_from_first(padding)
-    dilation = _pair_from_first(dilation)
-
-    packed_params = torch.ops.quantized.conv1d_prepack(
-        weight, bias, stride, padding, dilation, groups)
-    return torch.ops.quantized.conv1d(input, packed_params, scale, zero_point)
-
-def conv2d(input, weight, bias,
-           stride=1, padding=0, dilation=1, groups=1,
-           padding_mode='zeros',
-           scale=1.0, zero_point=0,
-           dtype=torch.quint8):
-    r"""
-    Applies a 2D convolution over a quantized 2D input composed of several input
-    planes.
-
-    See :class:`~torch.nn.quantized.Conv2d` for details and output shape.
-
-    Args:
-        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
-        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kH , kW)`
-        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
-        stride: the stride of the convolving kernel. Can be a single number or a
-          tuple `(sH, sW)`. Default: 1
-        padding: implicit paddings on both sides of the input. Can be a
-          single number or a tuple `(padH, padW)`. Default: 0
-        dilation: the spacing between kernel elements. Can be a single number or
-          a tuple `(dH, dW)`. Default: 1
-        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
-          number of groups. Default: 1
-        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
-        scale: quantization scale for the output. Default: 1.0
-        zero_point: quantization zero_point for the output. Default: 0
-        dtype: quantization data type to use. Default: ``torch.quint8``
-
-    Examples::
-
-        >>> from torch.nn.quantized import functional as qF
-        >>> filters = torch.randn(8, 4, 3, 3, dtype=torch.float)
-        >>> inputs = torch.randn(1, 4, 5, 5, dtype=torch.float)
-        >>> bias = torch.randn(8, dtype=torch.float)
-        >>>
-        >>> scale, zero_point = 1.0, 0
-        >>> dtype_inputs = torch.quint8
-        >>> dtype_filters = torch.qint8
-        >>>
-        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
-        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
-        >>> qF.conv2d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
-    """  # noqa: E501
-    if padding_mode != 'zeros':
-        raise NotImplementedError("Only zero-padding is supported!")
-    if input.dtype != torch.quint8:
-        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
-    if weight.dtype != torch.qint8:
-        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
-    if input.ndim != 4:
-        raise ValueError("Input shape must be `(N, C, H, W)`!")
-    stride = _pair(stride)
-    padding = _pair(padding)
-    dilation = _pair(dilation)
-
-    packed_params = torch.ops.quantized.conv2d_prepack(
-        weight, bias, stride, padding, dilation, groups)
-    return torch.ops.quantized.conv2d(input, packed_params, scale, zero_point)
-
-def conv3d(input, weight, bias, stride=1, padding=0, dilation=1, groups=1,
-           padding_mode='zeros', scale=1.0, zero_point=0, dtype=torch.quint8):
-    r"""
-    Applies a 3D convolution over a quantized 3D input composed of several input
-    planes.
-
-    See :class:`~torch.nn.quantized.Conv3d` for details and output shape.
-
-    Args:
-        input: quantized input tensor of shape
-          :math:`(\text{minibatch} , \text{in\_channels} , iD , iH , iW)`
-        weight: quantized filters of shape
-          :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kD , kH , kW)`
-        bias: **non-quantized** bias tensor of shape
-          :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
-        stride: the stride of the convolving kernel. Can be a single number or a
-          tuple `(sD, sH, sW)`. Default: 1
-        padding: implicit paddings on both sides of the input. Can be a
-          single number or a tuple `(padD, padH, padW)`. Default: 0
-        dilation: the spacing between kernel elements. Can be a single number or
-          a tuple `(dD, dH, dW)`. Default: 1
-        groups: split input into groups, :math:`\text{in\_channels}` should be
-          divisible by the number of groups. Default: 1
-        padding_mode: the padding mode to use. Only "zeros" is supported for
-          quantized convolution at the moment. Default: "zeros"
-        scale: quantization scale for the output. Default: 1.0
-        zero_point: quantization zero_point for the output. Default: 0
-        dtype: quantization data type to use. Default: ``torch.quint8``
-
-    Examples::
-
-        >>> from torch.nn.quantized import functional as qF
-        >>> filters = torch.randn(8, 4, 3, 3, 3, dtype=torch.float)
-        >>> inputs = torch.randn(1, 4, 5, 5, 5, dtype=torch.float)
-        >>> bias = torch.randn(8, dtype=torch.float)
-        >>>
-        >>> scale, zero_point = 1.0, 0
-        >>> dtype_inputs = torch.quint8
-        >>> dtype_filters = torch.qint8
-        >>>
-        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
-        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
-        >>> qF.conv3d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
-    """  # noqa: E501
-    if padding_mode != 'zeros':
-        raise NotImplementedError("Only zero-padding is supported!")
-    if input.dtype != torch.quint8:
-        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
-    if weight.dtype != torch.qint8:
-        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
-    if input.ndim != 5:
-        raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-    stride = _triple(stride)
-    padding = _triple(padding)
-    dilation = _triple(dilation)
-
-    packed_params = torch.ops.quantized.conv3d_prepack(
-        weight, bias, stride, padding, dilation, groups)
-    return torch.ops.quantized.conv3d(input, packed_params, scale, zero_point)
-
-def interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
-    r"""Down/up samples the input to either the given :attr:`size` or the given
-    :attr:`scale_factor`
-
-    See :func:`torch.nn.functional.interpolate` for implementation details.
-
-    The input dimensions are interpreted in the form:
-    `mini-batch x channels x [optional depth] x [optional height] x width`.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D/3D input is supported for quantized inputs
-
-    .. note:: Only the following modes are supported for the quantized inputs:
-
-        - `bilinear`
-        - `nearest`
-
-    Args:
-        input (Tensor): the input tensor
-        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
-            output spatial size.
-        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to match input size if it is a tuple.
-        mode (str): algorithm used for upsampling:
-            ``'nearest'`` | ``'bilinear'``
-        align_corners (bool, optional): Geometrically, we consider the pixels of the
-            input and output as squares rather than points.
-            If set to ``True``, the input and output tensors are aligned by the
-            center points of their corner pixels, preserving the values at the corner pixels.
-            If set to ``False``, the input and output tensors are aligned by the corner
-            points of their corner pixels, and the interpolation uses edge value padding
-            for out-of-boundary values, making this operation *independent* of input size
-            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
-            is ``'bilinear'``.
-            Default: ``False``
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.interpolate' must be quantized!")
-    return torch.nn.functional.interpolate(input, size, scale_factor, mode,
-                                           align_corners)
-
-def linear(
-    input: Tensor, weight: Tensor, bias: Optional[Tensor] = None,
-    scale: Optional[float] = None, zero_point: Optional[int] = None
-) -> Tensor:
-    r"""
-    Applies a linear transformation to the incoming quantized data:
-    :math:`y = xA^T + b`.
-    See :class:`~torch.nn.quantized.Linear`
-
-    .. note::
-
-      Current implementation packs weights on every call, which has penalty on performance.
-      If you want to avoid the overhead, use :class:`~torch.nn.quantized.Linear`.
-
-    Args:
-      input (Tensor): Quantized input of type `torch.quint8`
-      weight (Tensor): Quantized weight of type `torch.qint8`
-      bias (Tensor): None or fp32 bias of type `torch.float`
-      scale (double): output scale. If None, derived from the input scale
-      zero_point (long): output zero point. If None, derived from the input zero_point
-
-    Shape:
-        - Input: :math:`(N, *, in\_features)` where `*` means any number of
-          additional dimensions
-        - Weight: :math:`(out\_features, in\_features)`
-        - Bias: :math:`(out\_features)`
-        - Output: :math:`(N, *, out\_features)`
-    """
-    if scale is None:
-        scale = input.q_scale()
-    if zero_point is None:
-        zero_point = input.q_zero_point()
-    _packed_params = torch.ops.quantized.linear_prepack(weight, bias)
-    return torch.ops.quantized.linear(input, _packed_params, scale, zero_point)
-
-def max_pool1d(input, kernel_size, stride=None, padding=0, dilation=1,
-               ceil_mode=False, return_indices=False):
-    r"""Applies a 1D max pooling over a quantized input signal composed of
-    several quantized input planes.
-
-    .. note:: The input quantization parameters are propagated to the output.
-
-    See :class:`~torch.nn.quantized.MaxPool1d` for details.
-    """
-    if return_indices:
-        raise NotImplementedError("return_indices is not yet implemented!")
-    if stride is None:
-        stride = torch.jit.annotate(List[int], [])
-    return torch.nn.functional.max_pool1d(input, kernel_size, stride, padding,
-                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
-
-def max_pool2d(input, kernel_size, stride=None, padding=0, dilation=1,
-               ceil_mode=False, return_indices=False):
-    r"""Applies a 2D max pooling over a quantized input signal composed of
-    several quantized input planes.
-
-    .. note:: The input quantization parameters are propagated to the output.
-
-    See :class:`~torch.nn.quantized.MaxPool2d` for details.
-    """
-    if return_indices:
-        raise NotImplementedError("return_indices is not yet implemented!")
-    if stride is None:
-        stride = torch.jit.annotate(List[int], [])
-    return torch.nn.functional.max_pool2d(input, kernel_size, stride, padding,
-                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
-
-def celu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
-    r"""celu(input, scale, zero_point, alpha=1.) -> Tensor
-
-    Applies the quantized CELU function element-wise.
-
-    .. math::
-        \text{CELU}(x) = \max(0,x) + \min(0, \alpha * (\exp(x / \alpha) - 1))
-
-    Args:
-        input: quantized input
-        alpha: the :math:`\alpha` value for the CELU formulation. Default: 1.0
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.celu' must be quantized!")
-    return torch.ops.quantized.celu(input, scale, zero_point, alpha)
-
-
-def leaky_relu(input: Tensor, negative_slope: float = 0.01, inplace: bool = False,
-               scale: Optional[float] = None, zero_point: Optional[int] = None):
-    r"""
-    Quantized version of the.
-    leaky_relu(input, negative_slope=0.01, inplace=False, scale, zero_point) -> Tensor
-
-    Applies element-wise,
-    :math:`\text{LeakyReLU}(x) = \max(0, x) + \text{negative\_slope} * \min(0, x)`
-
-    Args:
-        input: Quaintized input
-        negative_slope: The slope of the negative input
-        inplace: Inplace modification of the input tensor
-        scale, zero_point: Scale and zero point of the output tensor.
-
-    See :class:`~torch.nn.LeakyReLU` for more details.
-    """
-    if scale is not None and zero_point is not None:
-        assert not inplace, "Cannot rescale with `inplace`"
-        output = torch._empty_affine_quantized(
-            input.shape, scale=scale, zero_point=int(zero_point), dtype=input.dtype)
-        torch._C._nn.leaky_relu(input, negative_slope, out=output)
-        return output
-    if inplace:
-        result = torch._C._nn.leaky_relu_(input, negative_slope)
-    else:
-        result = torch._C._nn.leaky_relu(input, negative_slope)
-    return result
-
-def hardtanh(input: Tensor, min_val: float = -1., max_val: float = 1., inplace: bool = False) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.hardtanh`.
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.hardtanh' must be quantized!")
-    if inplace:
-        return torch._C._nn.hardtanh_(input, min_val, max_val)
-    return torch._C._nn.hardtanh(input, min_val, max_val)
-
-def hardswish(input: Tensor, scale: float, zero_point: int) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.hardswish`.
-
-    Args:
-        input: quantized input
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.hardswish' must be quantized!")
-    return torch._ops.ops.quantized.hardswish(input, scale, zero_point)
-
-def threshold(input: Tensor, threshold: float, value: float) -> Tensor:
-    r"""Applies the quantized version of the threshold function element-wise:
-
-    .. math::
-        x = \begin{cases}
-                x & \text{if~} x > \text{threshold} \\
-                \text{value} & \text{otherwise}
-            \end{cases}
-
-    See :class:`~torch.nn.Threshold` for more details.
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.threshold' must be quantized!")
-    if threshold is None:
-        raise ValueError("Input to 'threshold' must be specified!")
-    if value is None:
-        raise ValueError("Input to 'value' must be specified!")
-    return torch._ops.ops.quantized.threshold(input, threshold, value)
-
-def elu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.elu`.
-
-    Args:
-        input: quantized input
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        alpha: the alpha constant
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.elu' must be quantized!")
-    return torch.ops.quantized.elu(input, scale, zero_point, alpha)
-
-def hardsigmoid(input: Tensor, inplace: bool = False) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.hardsigmoid`.
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.hardsigmoid' must be quantized!")
-    if inplace:
-        return torch._C._nn.hardsigmoid_(input)  # type: ignore[attr-defined]
-    return torch._C._nn.hardsigmoid(input)
-
-def clamp(input: Tensor, min_: float, max_: float) -> Tensor:
-    r"""float(input, min\_, max\_) -> Tensor
-
-    Applies the clamp function element-wise.
-    See :class:`~torch.nn.quantized.clamp` for more details.
-
-    Args:
-        input: quantized input
-        min_: minimum value for clamping
-        max_: maximum value for clamping
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.clamp' must be quantized!")
-    return torch.clamp(input, min_, max_)
-
-def upsample(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
-    r"""Upsamples the input to either the given :attr:`size` or the given
-    :attr:`scale_factor`
-
-    .. warning::
-        This function is deprecated in favor of
-        :func:`torch.nn.quantized.functional.interpolate`.
-        This is equivalent with ``nn.quantized.functional.interpolate(...)``.
-
-    See :func:`torch.nn.functional.interpolate` for implementation details.
-
-    The input dimensions are interpreted in the form:
-    `mini-batch x channels x [optional depth] x [optional height] x width`.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D input is supported for quantized inputs
-
-    .. note:: Only the following modes are supported for the quantized inputs:
-
-        - `bilinear`
-        - `nearest`
-
-    Args:
-        input (Tensor): quantized input tensor
-        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
-            output spatial size.
-        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to be an integer.
-        mode (str): algorithm used for upsampling:
-            ``'nearest'`` | ``'bilinear'``
-        align_corners (bool, optional): Geometrically, we consider the pixels of the
-            input and output as squares rather than points.
-            If set to ``True``, the input and output tensors are aligned by the
-            center points of their corner pixels, preserving the values at the corner pixels.
-            If set to ``False``, the input and output tensors are aligned by the corner
-            points of their corner pixels, and the interpolation uses edge value padding
-            for out-of-boundary values, making this operation *independent* of input size
-            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
-            is ``'bilinear'``.
-            Default: ``False``
-
-    .. warning::
-        With ``align_corners = True``, the linearly interpolating modes
-        (`bilinear`) don't proportionally align the
-        output and input pixels, and thus the output values can depend on the
-        input size. This was the default behavior for these modes up to version
-        0.3.1. Since then, the default behavior is ``align_corners = False``.
-        See :class:`~torch.nn.Upsample` for concrete examples on how this
-        affects the outputs.
-    """
-    warnings.warn("nn.quantized.functional.upsample is deprecated. Use nn.quantized.functional.interpolate instead.")
-    return interpolate(input, size, scale_factor, mode, align_corners)
-
-def upsample_bilinear(input, size=None, scale_factor=None):
-    r"""Upsamples the input, using bilinear upsampling.
-
-    .. warning::
-        This function is deprecated in favor of
-        :func:`torch.nn.quantized.functional.interpolate`.
-        This is equivalent with
-        ``nn.quantized.functional.interpolate(..., mode='bilinear', align_corners=True)``.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D inputs are supported
-
-    Args:
-        input (Tensor): quantized input
-        size (int or Tuple[int, int]): output spatial size.
-        scale_factor (int or Tuple[int, int]): multiplier for spatial size
-    """
-    # DeprecationWarning is ignored by default
-    warnings.warn("nn.quantized.functional.upsample_bilinear is deprecated. Use nn.quantized.functional.interpolate instead.")
-    return interpolate(input, size, scale_factor, mode='bilinear', align_corners=True)
-
-def upsample_nearest(input, size=None, scale_factor=None):
-    r"""Upsamples the input, using nearest neighbours' pixel values.
-
-    .. warning::
-        This function is deprecated in favor of
-        :func:`torch.nn.quantized.functional.interpolate`.
-        This is equivalent with ``nn.quantized.functional.interpolate(..., mode='nearest')``.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D inputs are supported
-
-    Args:
-        input (Tensor): quantized input
-        size (int or Tuple[int, int] or Tuple[int, int, int]): output spatial
-            size.
-        scale_factor (int): multiplier for spatial size. Has to be an integer.
-    """
-    # DeprecationWarning is ignored by default
-    warnings.warn("nn.quantized.functional.upsample_nearest is deprecated. Use nn.quantized.functional.interpolate instead.")
-    return interpolate(input, size, scale_factor, mode='nearest')
+from torch.ao.nn.quantized.functional import *  # noqa: F401,F403
diff --git a/torch/nn/quantized/modules/__init__.py b/torch/nn/quantized/modules/__init__.py
index 2ccfe1de9870b..360f0112644a8 100644
--- a/torch/nn/quantized/modules/__init__.py
+++ b/torch/nn/quantized/modules/__init__.py
@@ -1,93 +1,38 @@
-import torch
-from torch.nn.modules.pooling import MaxPool2d
-
-from .activation import ReLU6, Hardswish, ELU, LeakyReLU, Sigmoid, Softmax, MultiheadAttention, PReLU
-from .dropout import Dropout
-from .batchnorm import BatchNorm2d, BatchNorm3d
-from .normalization import LayerNorm, GroupNorm, InstanceNorm1d, \
-    InstanceNorm2d, InstanceNorm3d
-from .conv import Conv1d, Conv2d, Conv3d
-from .conv import ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
-from .linear import Linear
-from .embedding_ops import Embedding, EmbeddingBag
-from .rnn import LSTM
-
-from .functional_modules import FloatFunctional, FXFloatFunctional, QFunctional
-
-
-class Quantize(torch.nn.Module):
-    r"""Quantizes an incoming tensor
-
-    Args:
-     `scale`: scale of the output Quantized Tensor
-     `zero_point`: zero_point of output Quantized Tensor
-     `dtype`: data type of output Quantized Tensor
-     `factory_kwargs`: Dictionary of kwargs used for configuring initialization
-         of internal buffers. Currently, `device` and `dtype` are supported.
-         Example: `factory_kwargs={'device': 'cuda', 'dtype': torch.float64}`
-         will initialize internal buffers as type `torch.float64` on the current CUDA device.
-         Note that `dtype` only applies to floating-point buffers.
-
-    Examples::
-        >>> t = torch.tensor([[1., -1.], [1., -1.]])
-        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
-        >>> qm = Quantize(scale, zero_point, dtype)
-        >>> qt = qm(t)
-        >>> print(qt)
-        tensor([[ 1., -1.],
-                [ 1., -1.]], size=(2, 2), dtype=torch.qint8, scale=1.0, zero_point=2)
-    """
-
-    scale: torch.Tensor
-    zero_point: torch.Tensor
-
-    def __init__(self, scale, zero_point, dtype, factory_kwargs=None):
-        factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
-        super(Quantize, self).__init__()
-        self.register_buffer('scale', torch.tensor([scale], **factory_kwargs))
-        self.register_buffer('zero_point',
-                             torch.tensor([zero_point], dtype=torch.long,
-                                          **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}))
-        self.dtype = dtype
-
-    def forward(self, X):
-        return torch.quantize_per_tensor(X, float(self.scale),
-                                         int(self.zero_point), self.dtype)
-
-    @staticmethod
-    def from_float(mod):
-        assert hasattr(mod, 'activation_post_process')
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return Quantize(scale.float().item(), zero_point.long().item(), mod.activation_post_process.dtype)
-
-    def extra_repr(self):
-        return 'scale={}, zero_point={}, dtype={}'.format(self.scale, self.zero_point, self.dtype)
-
-
-class DeQuantize(torch.nn.Module):
-    r"""Dequantizes an incoming tensor
-
-    Examples::
-        >>> input = torch.tensor([[1., -1.], [1., -1.]])
-        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
-        >>> qm = Quantize(scale, zero_point, dtype)
-        >>> quantized_input = qm(input)
-        >>> dqm = DeQuantize()
-        >>> dequantized = dqm(quantized_input)
-        >>> print(dequantized)
-        tensor([[ 1., -1.],
-                [ 1., -1.]], dtype=torch.float32)
-    """
-
-    def __init__(self):
-        super(DeQuantize, self).__init__()
-
-    def forward(self, Xq):
-        return Xq.dequantize()
-
-    @staticmethod
-    def from_float(mod):
-        return DeQuantize()
+r"""Quantized Modules
+
+Note::
+    The `torch.nn.quantized` namespace is in the process of being deprecated.
+    Please, use `torch.ao.nn.quantized` instead.
+"""
+
+from torch.ao.nn.quantized.modules.activation import ReLU6, Hardswish, ELU, LeakyReLU, Sigmoid, Softmax, MultiheadAttention, PReLU
+from torch.ao.nn.quantized.modules.batchnorm import BatchNorm2d, BatchNorm3d
+from torch.ao.nn.quantized.modules.conv import Conv1d, Conv2d, Conv3d
+from torch.ao.nn.quantized.modules.conv import ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from torch.ao.nn.quantized.modules.dropout import Dropout
+from torch.ao.nn.quantized.modules.embedding_ops import Embedding, EmbeddingBag
+from torch.ao.nn.quantized.modules.functional_modules import FloatFunctional, FXFloatFunctional, QFunctional
+from torch.ao.nn.quantized.modules.linear import Linear
+from torch.ao.nn.quantized.modules.normalization import LayerNorm, GroupNorm, InstanceNorm1d, InstanceNorm2d, InstanceNorm3d
+from torch.ao.nn.quantized.modules.rnn import LSTM
+
+from torch.ao.nn.quantized.modules import MaxPool2d
+from torch.ao.nn.quantized.modules import Quantize, DeQuantize
+
+# The following imports are needed in case the user decides
+# to import the files directly,
+# s.a. `from torch.nn.quantized.modules.conv import ...`.
+# No need to add them to the `__all__`.
+from torch.ao.nn.quantized.modules import activation
+from torch.ao.nn.quantized.modules import batchnorm
+from torch.ao.nn.quantized.modules import conv
+from torch.ao.nn.quantized.modules import dropout
+from torch.ao.nn.quantized.modules import embedding_ops
+from torch.ao.nn.quantized.modules import functional_modules
+from torch.ao.nn.quantized.modules import linear
+from torch.ao.nn.quantized.modules import normalization
+from torch.ao.nn.quantized.modules import rnn
+from torch.ao.nn.quantized.modules import utils
 
 __all__ = [
     'BatchNorm2d',
diff --git a/torch/nn/quantized/modules/activation.py b/torch/nn/quantized/modules/activation.py
index d1ce62b27823f..386c0d7f2d24d 100644
--- a/torch/nn/quantized/modules/activation.py
+++ b/torch/nn/quantized/modules/activation.py
@@ -1,277 +1,18 @@
-import torch
-import torch.nn.quantized.functional
-
-class ReLU6(torch.nn.ReLU):
-    r"""Applies the element-wise function:
-
-    :math:`\text{ReLU6}(x) = \min(\max(x_0, x), q(6))`, where :math:`x_0` is the
-    zero_point, and :math:`q(6)` is the quantized representation of number 6.
-
-    Args:
-        inplace: can optionally do the operation in-place. Default: ``False``
-
-    Shape:
-        - Input: :math:`(N, *)` where `*` means, any number of additional
-          dimensions
-        - Output: :math:`(N, *)`, same shape as the input
-
-    .. image:: ../scripts/activation_images/ReLU6.png
-
-    Examples::
-
-        >>> m = nn.quantized.ReLU6()
-        >>> input = torch.randn(2)
-        >>> input = torch.quantize_per_tensor(input, 1.0, 0, dtype=torch.qint32)
-        >>> output = m(input)
-    """
-    def __init__(self, inplace=False):
-        super(ReLU6, self).__init__(inplace)
-        self.inplace = inplace
-
-    def forward(self, input):
-        return torch.ops.quantized.relu6(input, self.inplace)
-
-    def _get_name(self):
-        return 'QuantizedReLU6'
-
-    @staticmethod
-    def from_float(mod):
-        return ReLU6(mod.inplace)
-
-class Hardswish(torch.nn.Hardswish):
-    r"""This is the quantized version of :class:`~torch.nn.Hardswish`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-    def __init__(self, scale, zero_point):
-        super(Hardswish, self).__init__()
-        self.scale = scale
-        self.zero_point = zero_point
-
-    def forward(self, input):
-        return torch.nn.quantized.functional.hardswish(
-            input, scale=self.scale, zero_point=self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedHardswish'
-
-    @staticmethod
-    def from_float(mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return Hardswish(float(scale), int(zero_point))
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(float(scale), int(zero_point))
-
-class ELU(torch.nn.ELU):
-    r"""This is the quantized equivalent of :class:`~torch.nn.ELU`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        alpha: the alpha constant
-    """
-    def __init__(self, scale, zero_point, alpha=1.):
-        super(ELU, self).__init__(alpha)
-        self.scale = scale
-        self.zero_point = zero_point
-
-    def forward(self, input):
-        return torch.nn.quantized.functional.elu(
-            input, self.scale, self.zero_point, self.alpha)
-
-    def _get_name(self):
-        return 'QuantizedELU'
-
-    @staticmethod
-    def from_float(mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return ELU(float(scale), int(zero_point), mod.alpha)
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(float(scale), int(zero_point), mod.alpha)
-
-class LeakyReLU(torch.nn.LeakyReLU):
-    r"""This is the quantized equivalent of :class:`~torch.nn.LeakyReLU`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        negative_slope: Controls the angle of the negative slope. Default: 1e-2
-    """
-    def __init__(self, scale: float, zero_point: int, negative_slope: float = 1e-2,
-                 inplace: bool = False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(negative_slope, inplace)
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.leaky_relu(
-            input, self.negative_slope, self.inplace, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedLeakyReLU'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
-
-class Sigmoid(torch.nn.Sigmoid):
-    r"""This is the quantized equivalent of :class:`~torch.nn.Sigmoid`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-
-    def __init__(self, output_scale: float, output_zero_point: int):
-        super().__init__()
-        self.output_scale = output_scale
-        self.output_zero_point = output_zero_point
-
-    def forward(self, input):
-        return torch.ops.quantized.sigmoid(input, self.output_scale, self.output_zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        output_scale, output_zero_point = mod.activation_post_process.calculate_qparams()
-        return cls(float(output_scale), int(output_zero_point))
-
-class Softmax(torch.nn.Softmax):
-    r"""This is the quantized version of :class:`~torch.nn.Softmax`.
-
-    Args:
-        dim: A dimension along which Softmax will be computed (so every slice along dim will sum to 1).
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-    def __init__(self, dim=None, scale=1.0, zero_point=0):
-        super().__init__()
-        self.dim = dim
-        self.scale = scale
-        self.zero_point = zero_point
-
-    def forward(self, input):
-        dim = self.dim
-        if dim is None:
-            stacklevel = 3
-            # Note: adding the mypy ignore on _get_softmax_dim seems less bad
-            # than making `_get_softmax_dim` an official API.
-            dim = torch.nn.functional._get_softmax_dim(  # type: ignore[attr-defined]
-                "softmax", input.dim(), stacklevel)
-        return torch.ops.quantized.softmax(
-            input, dim, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedSoftmax'
-
-    @staticmethod
-    def from_float(mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return Softmax(mod.dim, float(scale), int(zero_point))
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(mod.dim, float(scale), int(zero_point))
-
-class MultiheadAttention(torch.nn.quantizable.MultiheadAttention):
-    _FLOAT_MODULE = torch.nn.quantizable.MultiheadAttention
-
-    def _get_name(self):
-        return "QuantizedMultiheadAttention"
-
-    @classmethod
-    def from_float(cls, other):
-        # The whole flow is float -> observed -> quantized
-        # This class does observed -> quantized only
-        raise NotImplementedError("It looks like you are trying to convert a "
-                                  "non-observed MHA module. Please, see "
-                                  "the examples on quantizable MHAs.")
-
-    @classmethod
-    def from_observed(cls, other):
-        converted = torch.ao.quantization.convert(other, mapping=None,
-                                                  inplace=False,
-                                                  remove_qconfig=True,
-                                                  convert_custom_config_dict=None)
-        converted.__class__ = cls
-        # Remove the parameters for the bias_k and bias_v to quantize them
-        # TODO: This is a potential source of accuracy drop.
-        #       quantized cat takes the scale and zp of the first
-        #       element, which might lose the precision in the bias_k
-        #       and the bias_v (which are cat'ed with k/v being first).
-        if converted.bias_k is not None:
-            bias_k = converted._parameters.pop('bias_k')
-            sc, zp = torch._choose_qparams_per_tensor(bias_k,
-                                                      reduce_range=False)
-            bias_k = torch.quantize_per_tensor(bias_k, sc, zp, torch.quint8)
-            setattr(converted, 'bias_k', bias_k)  # noqa: B010
-
-        if converted.bias_v is not None:
-            bias_v = converted._parameters.pop('bias_v')
-            sc, zp = torch._choose_qparams_per_tensor(bias_k,
-                                                      reduce_range=False)
-            bias_v = torch.quantize_per_tensor(bias_v, sc, zp, torch.quint8)
-            setattr(converted, 'bias_v', bias_v)  # noqa: B010
-
-        return converted
-
-class PReLU(torch.nn.Module):
-    r"""This is the quantized equivalent of :class:`~torch.nn.PReLU`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        num_parameters: number of parameters: 1, or the number of channels at input. Default: 1
-    """
-    def __init__(self, output_scale: float, output_zero_point: int,
-                 num_parameters: int = 1) -> None:
-        super().__init__()
-        self.num_parameters = num_parameters
-        self.scale = output_scale
-        self.zero_point = output_zero_point
-        w = torch.randn(num_parameters, dtype=torch.float)
-        qw = torch.quantize_per_tensor(w, scale=1.0, zero_point=0, dtype=torch.quint8)
-        self.set_weight(qw)
-
-    def set_weight(self, w: torch.Tensor) -> None:
-        self.weight = w
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        return torch.ops.quantized.prelu(input, self.weight, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedPReLU'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
-        float_wt = mod.weight.float()
-        observer = mod.qconfig.weight()
-        wt_scale, wt_zp = observer.calculate_qparams()
-        qweight = torch.quantize_per_tensor(
-            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
-        qprelu.set_weight(qweight)
-        return qprelu
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
-        float_wt = mod.weight.float()
-        observer = mod.qconfig.weight()
-        wt_scale, wt_zp = observer.calculate_qparams()
-        qweight = torch.quantize_per_tensor(
-            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
-        qprelu.set_weight(qweight)
-        return qprelu
+# flake8: noqa: F401
+r"""Quantized Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.modules.activation import ELU
+from torch.ao.nn.quantized.modules.activation import Hardswish
+from torch.ao.nn.quantized.modules.activation import LeakyReLU
+from torch.ao.nn.quantized.modules.activation import MultiheadAttention
+from torch.ao.nn.quantized.modules.activation import PReLU
+from torch.ao.nn.quantized.modules.activation import ReLU6
+from torch.ao.nn.quantized.modules.activation import Sigmoid
+from torch.ao.nn.quantized.modules.activation import Softmax
diff --git a/torch/nn/quantized/modules/batchnorm.py b/torch/nn/quantized/modules/batchnorm.py
index 1046d0254b6d5..ef3d5a91da7cc 100644
--- a/torch/nn/quantized/modules/batchnorm.py
+++ b/torch/nn/quantized/modules/batchnorm.py
@@ -1,103 +1,12 @@
-import torch
-import torch.nn.quantized.functional
-import torch.nn.intrinsic as nni
-from torch import Tensor
-
-class _BatchNorm(torch.nn.modules.batchnorm._BatchNorm):
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_features, eps, momentum, True, True, **factory_kwargs)
-        self.register_buffer('scale', torch.tensor(1.0, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(0, **factory_kwargs))
-
-    @staticmethod
-    def from_float(cls, mod):
-        activation_post_process = mod.activation_post_process
-        if type(mod) == cls._NNI_BN_RELU_MODULE:
-            mod = mod[0]
-        scale, zero_point = activation_post_process.calculate_qparams()
-        new_mod = cls(mod.num_features, mod.eps)
-        new_mod.weight = mod.weight
-        new_mod.bias = mod.bias
-        new_mod.running_mean = mod.running_mean
-        new_mod.running_var = mod.running_var
-        new_mod.scale = scale
-        new_mod.zero_point = zero_point
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, bn, output_scale, output_zero_point):
-        qbn = cls(
-            bn.num_features,
-            bn.eps,
-            bn.momentum,
-            device=bn.weight.device,
-            dtype=bn.weight.dtype
-        )
-        qbn.weight = bn.weight
-        qbn.bias = bn.bias
-        qbn.running_mean = bn.running_mean
-        qbn.running_var = bn.running_var
-        qbn.scale = output_scale
-        qbn.zero_point = output_zero_point
-        return qbn
-
-class BatchNorm2d(_BatchNorm):
-    r"""This is the quantized version of :class:`~torch.nn.BatchNorm2d`.
-    """
-
-    _NNI_BN_RELU_MODULE = nni.BNReLU2d
-
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_features, eps, momentum, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedBatchNorm2d'
-
-    def _check_input_dim(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-
-    def forward(self, input: Tensor) -> Tensor:
-        # disabling this since this is not symbolically traceable
-        # self._check_input_dim(input)
-        return torch.ops.quantized.batch_norm2d(
-            input, self.weight, self.bias, self.running_mean,
-            self.running_var, self.eps, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        return _BatchNorm.from_float(cls, mod)
-
-class BatchNorm3d(_BatchNorm):
-    r"""This is the quantized version of :class:`~torch.nn.BatchNorm3d`.
-    """
-
-    _NNI_BN_RELU_MODULE = nni.BNReLU3d
-
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_features, eps, momentum, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedBatchNorm3d'
-
-    def _check_input_dim(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-
-    def forward(self, input: Tensor) -> Tensor:
-        # disabling this since this is not symbolically traceable
-        # self._check_input_dim(input)
-        return torch.ops.quantized.batch_norm3d(
-            input, self.weight, self.bias, self.running_mean,
-            self.running_var, self.eps, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        return _BatchNorm.from_float(cls, mod)
+# flake8: noqa: F401
+r"""Quantized Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.modules.batchnorm import BatchNorm2d
+from torch.ao.nn.quantized.modules.batchnorm import BatchNorm3d
diff --git a/torch/nn/quantized/modules/conv.py b/torch/nn/quantized/modules/conv.py
index 7453fe092c389..aea6cd104edf2 100644
--- a/torch/nn/quantized/modules/conv.py
+++ b/torch/nn/quantized/modules/conv.py
@@ -1,919 +1,21 @@
-# coding=utf-8
-r"""Quantized convolution modules."""
+# flake8: noqa: F401
+r"""Quantized Modules
 
-from typing import Optional, List, TypeVar
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.nn.intrinsic as nni
-import torch.nn.intrinsic.qat as nniqat
-
-from torch._ops import ops
-from torch.nn.common_types import _size_1_t
-from torch.nn.modules.utils import _single, _pair, _triple
-from torch.nn.quantized.modules.utils import _quantize_weight, WeightedQuantizedModule
-from torch.nn.utils import fuse_conv_bn_weights
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
 
-_SUPPORTED_PADDING = {
-    'zeros',
-    'reflect'
-}
-
-
-def _reverse_repeat_padding(padding: List[int]) -> List[int]:
-    _reversed_padding_repeated_twice: List[int] = []
-    N = len(padding)
-    for idx in range(N):
-        for _ in range(2):
-            _reversed_padding_repeated_twice.append(padding[N - idx - 1])
-    return _reversed_padding_repeated_twice
-
-class _ConvNd(WeightedQuantizedModule):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        # All subclasses have this signature - See PR #49702s
-        raise NotImplementedError
-
-    def _init(self, in_channels, out_channels, kernel_size, stride,
-              padding, dilation,
-              transposed, output_padding,
-              groups, bias,
-              padding_mode='zeros',
-              device=None,
-              dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(_ConvNd, self).__init__()
-
-        if in_channels % groups != 0:
-            raise ValueError('in_channels must be divisible by groups')
-        if out_channels % groups != 0:
-            raise ValueError('out_channels must be divisible by groups')
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.kernel_size = kernel_size
-        self.stride = stride
-        self.padding = padding
-        self.dilation = dilation
-        self.transposed = transposed
-        self.output_padding = output_padding
-        self.groups = groups
-        if padding_mode not in _SUPPORTED_PADDING:
-            raise ValueError("'padding_mode' {} is not supported by quantized convolution".format(padding_mode))
-        self.padding_mode = padding_mode
-        # Initialize as NCHW. set_weight will internally transpose to NHWC.
-        if self.transposed:
-            weight_shape = [in_channels, out_channels // self.groups]
-        else:
-            weight_shape = [out_channels, in_channels // self.groups]
-        qweight = torch._empty_affine_quantized(
-            weight_shape + list(kernel_size),
-            scale=1, zero_point=0, dtype=torch.qint8,
-            **{k: v for k, v in factory_kwargs.items() if k != 'dtype'})
-        bias_float = (
-            torch.zeros(out_channels, dtype=torch.float,
-                        **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}) if bias else None)
-
-        self.set_weight_bias(qweight, bias_float)
-        self.scale = 1.0
-        self.zero_point = 0
-
-    def set_weight_bias(self, qweight, bias_float):
-        raise NotImplementedError
-
-    def bias(self):
-        raise NotImplementedError
-
-    def _weight_bias(self):
-        raise NotImplementedError
-
-    def extra_repr(self):
-        s = ('{in_channels}, {out_channels}, kernel_size={kernel_size}'
-             ', stride={stride}, scale={scale}, zero_point={zero_point}')
-        if self.padding != (0,) * len(self.padding):
-            s += ', padding={padding}'
-        if self.dilation != (1,) * len(self.dilation):
-            s += ', dilation={dilation}'
-        if self.output_padding != (0,) * len(self.output_padding):
-            s += ', output_padding={output_padding}'
-        if self.groups != 1:
-            s += ', groups={groups}'
-        if self.bias() is None:
-            s += ', bias=False'
-        return s.format(**self.__dict__)
-
-    # ===== Serialization methods =====
-    # The special consideration here is that we have to unpack the weights into
-    # their regular QTensor form for serialization. Packed weights should not
-    # live outside the process in which they were created, rather they should be
-    # derived from the QTensor weight.
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #
-    # TODO: maybe change to this when https://github.com/pytorch/pytorch/pull/32958 is landed
-    #   self
-    #   |--- _packed_params : Conv2dPackedParamsBase or Conv3dPackedParamsBase
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(_ConvNd, self)._save_to_state_dict(destination, prefix, keep_vars)
-        (w, b) = self._weight_bias()
-        destination[prefix + 'weight'] = w
-        destination[prefix + 'bias'] = b
-        destination[prefix + 'scale'] = torch.tensor(self.scale)
-        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
-
-    @torch.jit.export
-    def __getstate__(self):
-        (w, b) = self._weight_bias()
-        return (
-            self.in_channels,
-            self.out_channels,
-            self.kernel_size,
-            self.stride,
-            self.padding,
-            self.dilation,
-            self.transposed,
-            self.output_padding,
-            self.groups,
-            self.padding_mode,
-            w,
-            b,
-            self.scale,
-            self.zero_point,
-            self.training
-        )
-
-    # ===== Deserialization methods =====
-    # Counterpart to the serialization methods, we must pack the serialized
-    # QTensor weight into its packed format for use by the FBGEMM ops.
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.set_weight_bias(
-            state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
-        state_dict.pop(prefix + 'weight')
-        state_dict.pop(prefix + 'bias')
-        self.scale = float(state_dict[prefix + 'scale'])
-        state_dict.pop(prefix + 'scale')
-        self.zero_point = int(state_dict[prefix + 'zero_point'])
-        state_dict.pop(prefix + 'zero_point')
-        super(_ConvNd, self)._load_from_state_dict(
-            state_dict, prefix, local_metadata, False, missing_keys,
-            unexpected_keys, error_msgs)
-
-    @torch.jit.export
-    def __setstate__(self, state):
-        self.in_channels = state[0]
-        self.out_channels = state[1]
-        self.kernel_size = state[2]
-        self.stride = state[3]
-        self.padding = state[4]
-        self.dilation = state[5]
-        self.transposed = state[6]
-        self.output_padding = state[7]
-        self.groups = state[8]
-        self.padding_mode = state[9]
-        self.set_weight_bias(state[10], state[11])
-        self.scale = state[12]
-        self.zero_point = state[13]
-        self.training = state[14]
-
-    def __deepcopy__(self, memo):
-        new_instance = type(self).__new__(type(self))
-        torch.nn.Module.__init__(new_instance)
-        state = self.__getstate__()
-        new_instance.__setstate__(state)
-        return new_instance
-
-    def __copy__(self):
-        return self.__deepcopy__({})
-
-    @classmethod
-    def get_qconv(cls, mod, activation_post_process, weight_post_process=None):
-        r"""Creates a qconv object and returns it.
-        """
-        if weight_post_process is None:
-            weight_post_process = mod.qconfig.weight()
-        weight_post_process(mod.weight)
-        assert weight_post_process.dtype == torch.qint8, \
-            'Weight observer must have a dtype of qint8'
-        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
-        # the __init__ call used is the one from derived classes and not the one from _ConvNd
-        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
-                    mod.stride, mod.padding, mod.dilation, mod.groups,
-                    mod.bias is not None, mod.padding_mode)
-        qconv.set_weight_bias(qweight, mod.bias)
-        if activation_post_process is None or activation_post_process.dtype == torch.float:
-            return qconv  # dynamic quantization doesn't need scale/zero_point
-        else:
-            act_scale, act_zp = activation_post_process.calculate_qparams()
-            qconv.scale = float(act_scale)
-            qconv.zero_point = int(act_zp)
-            return qconv
-
-    @staticmethod
-    def from_float(cls, mod):
-        if hasattr(mod, "weight_fake_quant"):
-            # assert type(mod) == cls.__QAT_MODULE, " nnq." + cls.__name__ + \
-            # ".from_float only works for " + cls.__QAT_MODULE.__name__
-            if type(mod) == cls._NNIQAT_CONV_BN_MODULE:
-                mod.weight, mod.bias = fuse_conv_bn_weights(
-                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
-                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
-            assert hasattr(mod, "activation_post_process"), \
-                "Input QAT module must have observer attached"
-            weight_post_process = mod.weight_fake_quant
-            activation_post_process = mod.activation_post_process
-        else:
-            assert type(mod) == cls._FLOAT_MODULE, \
-                " nnq." + cls.__name__ + ".from_float only works for " + \
-                cls._FLOAT_MODULE.__name__ + " but got:" + str(type(mod))
-            assert hasattr(mod, "qconfig"), \
-                "Input float module must have qconfig defined."
-            activation_post_process = None if not hasattr(
-                mod, "activation_post_process") else mod.activation_post_process
-            if type(mod) == cls._NNI_CONV_RELU_MODULE:
-                mod = mod[0]
-            weight_post_process = mod.qconfig.weight()
-        return cls.get_qconv(mod, activation_post_process, weight_post_process)
-
-    @classmethod
-    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
-        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
-        Args:
-            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-            output_scale (float): scale for output Tensor
-            output_zero_point (int): zero point for output Tensor
-        """
-        qconv = cls(
-            ref_qconv.in_channels,
-            ref_qconv.out_channels,
-            ref_qconv.kernel_size,  # type: ignore[arg-type]
-            ref_qconv.stride,  # type: ignore[arg-type]
-            ref_qconv.padding,  # type: ignore[arg-type]
-            ref_qconv.dilation,  # type: ignore[arg-type]
-            ref_qconv.groups,
-            ref_qconv.bias is not None,  # type: ignore[arg-type]
-            ref_qconv.padding_mode,
-            device=ref_qconv.weight.device,
-            dtype=ref_qconv.weight.dtype)
-        qweight = ref_qconv.get_quantized_weight()
-        qconv.set_weight_bias(qweight, ref_qconv.bias)
-        qconv.scale = float(output_scale)
-        qconv.zero_point = int(output_zero_point)
-        return qconv
-
-class Conv1d(_ConvNd):
-    r"""Applies a 1D convolution over a quantized input signal composed of
-    several quantized input planes.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv1d`.
-
-    .. note::
-        Only `zeros` is supported for the :attr:`padding_mode` argument.
-
-    .. note::
-        Only `torch.quint8` is supported for the input data type.
-
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv1d` for other attributes.
-
-    Examples::
-
-        >>> m = nn.quantized.Conv1d(16, 33, 3, stride=2)
-        >>> input = torch.randn(20, 16, 100)
-        >>> # quantize input to quint8
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0,
-                                                dtype=torch.quint8)
-        >>> output = m(q_input)
-
-    """
-
-    _FLOAT_MODULE = nn.Conv1d
-    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn1d
-    _NNI_CONV_RELU_MODULE = nni.ConvReLU1d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 device=None,
-                 dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _single(kernel_size)
-        stride = _single(stride)
-        padding = padding if isinstance(padding, str) else _single(padding)
-        dilation = _single(dilation)
-
-        # Subclasses of _ConvNd needs to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(Conv1d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _single(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConv1d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        if self.padding_mode == 'zeros':
-            self._packed_params = torch.ops.quantized.conv1d_prepack(
-                w, b, self.stride, self.padding, self.dilation, self.groups)
-        else:
-            self._packed_params = torch.ops.quantized.conv1d_prepack(
-                w, b, self.stride, _pair(0), self.dilation,
-                self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv1d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        if self.padding_mode != 'zeros':
-            # Padding in Conv1d is stored as (p, p), need to get (p,)
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv1d(input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        return _ConvNd.from_float(cls, mod)
-
-
-class Conv2d(_ConvNd):
-    r"""Applies a 2D convolution over a quantized input signal composed of
-    several quantized input planes.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv2d`.
-
-    .. note::
-        Only `zeros` is supported for the :attr:`padding_mode` argument.
-
-    .. note::
-        Only `torch.quint8` is supported for the input data type.
-
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv2d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.Conv2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
-        >>> input = torch.randn(20, 16, 50, 100)
-        >>> # quantize input to quint8
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-
-    """
-    _FLOAT_MODULE = nn.Conv2d
-    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn2d
-    _NNI_CONV_RELU_MODULE = nni.ConvReLU2d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _pair(kernel_size)
-        stride = _pair(stride)
-        padding = _pair(padding)
-        dilation = _pair(dilation)
-        # Subclasses of _ConvNd need to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(Conv2d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _pair(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConv2d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        if self.padding_mode == 'zeros':
-            self._packed_params = torch.ops.quantized.conv2d_prepack(
-                w, b, self.stride, self.padding, self.dilation, self.groups)
-        else:
-            self._packed_params = torch.ops.quantized.conv2d_prepack(
-                w, b, self.stride, _pair(0), self.dilation, self.groups)
-
-    def _weight_bias(self):
-        return self._packed_params.unpack()
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv2d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        return _ConvNd.from_float(cls, mod)
-
-
-class Conv3d(_ConvNd):
-    r"""Applies a 3D convolution over a quantized input signal composed of
-    several quantized input planes.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv3d`.
-
-    .. note::
-        Only `zeros` is supported for the :attr:`padding_mode` argument.
-
-    .. note::
-        Only `torch.quint8` is supported for the input data type.
-
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv3d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.Conv3d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
-        >>> input = torch.randn(20, 16, 56, 56, 56)
-        >>> # quantize input to quint8
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-
-    """
-    _FLOAT_MODULE = nn.Conv3d
-    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn3d
-    _NNI_CONV_RELU_MODULE = nni.ConvReLU3d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _triple(kernel_size)
-        stride = _triple(stride)
-        padding = _triple(padding)
-        dilation = _triple(dilation)
-        # Subclasses of _ConvNd need to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(Conv3d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConv3d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        if self.padding_mode == 'zeros':
-            self._packed_params = torch.ops.quantized.conv3d_prepack(
-                w, b, self.stride, self.padding, self.dilation, self.groups)
-        else:
-            self._packed_params = torch.ops.quantized.conv3d_prepack(
-                w, b, self.stride, _triple(0), self.dilation, self.groups)
-
-    def _weight_bias(self):
-        return self._packed_params.unpack()
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv3d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        return _ConvNd.from_float(cls, mod)
-
-# === Transposed Convolutions ===
-MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
-
-class _ConvTransposeNd(_ConvNd):
-
-    _FLOAT_MODULE = MOD
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride,
-                 padding, dilation, transposed, output_padding,
-                 groups, bias, padding_mode, device=None, dtype=None):
-        if padding_mode != 'zeros':
-            raise ValueError('Only "zeros" padding mode is supported for {}'.format(self.__class__.__name__))
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        # Subclasses of _ConvNd need to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(_ConvTransposeNd, self)._init(
-            in_channels, out_channels, kernel_size, stride,
-            padding, dilation, transposed, output_padding,
-            groups, bias, padding_mode, **factory_kwargs)
-
-    def _input_padding(self, kernel_size: List[int], dilation: List[int], padding: List[int]) -> List[int]:
-        res = torch.jit.annotate(List[int], [])
-        for kdx in range(len(kernel_size)):
-            pad = (dilation[kdx] * (kernel_size[kdx] - 1) - padding[kdx])
-            res.append(pad)
-        return res
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        # derived classes override cls._FLOAT_MODULE attribute
-        msg = ' nnq.' + cls.__name__ + '.from_float only works for ' + \
-              cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
-        assert type(mod) == cls._FLOAT_MODULE, msg
-        assert hasattr(mod, 'qconfig'), \
-            'Input float module must have qconfig defined.'
-        weight_post_process = mod.qconfig.weight()
-        weight_post_process(mod.weight)
-        assert weight_post_process.dtype == torch.qint8, \
-            'Weight observer must have a dtype of qint8'
-        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
-        # the __init__ call used is the one from derived classes and not the one from _ConvTransposeNd
-        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,  # type: ignore[call-arg]
-                    mod.stride, mod.padding, mod.output_padding, mod.groups,
-                    mod.bias is not None, mod.dilation, mod.padding_mode)
-        qconv.set_weight_bias(qweight, mod.bias)
-        if not hasattr(mod, "activation_post_process") or mod.activation_post_process.dtype == torch.float:
-            return qconv  # dynamic quantization doesn't need scale/zero_point
-        else:
-            act_scale, act_zp = mod.activation_post_process.calculate_qparams()
-            qconv.scale = float(act_scale)
-            qconv.zero_point = int(act_zp)
-            return qconv
-
-    @staticmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
-        Args:
-            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-            output_scale (float): scale for output Tensor
-            output_zero_point (int): zero point for output Tensor
-        """
-        qconv = cls(
-            ref_qconvt.in_channels,
-            ref_qconvt.out_channels,
-            ref_qconvt.kernel_size,  # type: ignore[arg-type]
-            ref_qconvt.stride,  # type: ignore[arg-type]
-            ref_qconvt.padding,  # type: ignore[arg-type]
-            ref_qconvt.output_padding,  # type: ignore[arg-type]
-            ref_qconvt.groups,
-            ref_qconvt.bias is not None,  # type: ignore[arg-type]
-            ref_qconvt.dilation,  # type: ignore[arg-type]
-            ref_qconvt.padding_mode,
-            device=ref_qconvt.weight.device,
-            dtype=ref_qconvt.weight.dtype)
-        qweight = ref_qconvt.get_quantized_weight()
-        qconv.set_weight_bias(qweight, ref_qconvt.bias)
-        qconv.scale = float(output_scale)
-        qconv.zero_point = int(output_zero_point)
-        return qconv
-
-class ConvTranspose1d(_ConvTransposeNd):
-    r"""Applies a 1D transposed convolution operator over an input image
-    composed of several input planes.
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose1d`.
-
-    .. note:: Currently only the QNNPACK engine is implemented.
-        Please, set the `torch.backends.quantized.engine = 'qnnpack'`
-
-    For special notes, please, see :class:`~torch.nn.quantized.Conv1d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
-
-    Examples::
-
-        >>> torch.backends.quantized.engine = 'qnnpack'
-        >>> # With square kernels and equal stride
-        >>> m = nnq.ConvTranspose1d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> input = torch.randn(20, 16, 50)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-        >>> # exact output size can be also specified as an argument
-        >>> input = torch.randn(1, 16, 12)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> downsample = nnq.Conv1d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(q_input)
-        >>> h.size()
-        torch.Size([1, 16, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose1d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _single(kernel_size)
-        stride = _single(stride)
-        padding = _single(padding)
-        dilation = _single(dilation)
-        output_padding = _single(output_padding)
-
-        super(ConvTranspose1d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConvTranpose1d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params = torch.ops.quantized.conv_transpose1d_prepack(
-            w, b, self.stride, self.padding, self.output_padding, self.dilation,
-            self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv_transpose1d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        (w, _) = self._weight_bias()
-        return w
-
-    def bias(self):
-        (_, b) = self._weight_bias()
-        return b
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        return torch.ops.quantized.conv_transpose1d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
-
-
-class ConvTranspose2d(_ConvTransposeNd):
-    r"""Applies a 2D transposed convolution operator over an input image
-    composed of several input planes.
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose2d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.Conv2d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
-
-    Examples::
-
-        >>> # QNNPACK or FBGEMM as backend
-        >>> torch.backends.quantized.engine = 'qnnpack'
-        >>> # With square kernels and equal stride
-        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> input = torch.randn(20, 16, 50, 100)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-        >>> # exact output size can be also specified as an argument
-        >>> input = torch.randn(1, 16, 12, 12)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(q_input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose2d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _pair(kernel_size)
-        stride = _pair(stride)
-        padding = _pair(padding)
-        dilation = _pair(dilation)
-        output_padding = _pair(output_padding)
-
-        super(ConvTranspose2d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConvTranpose2d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params = torch.ops.quantized.conv_transpose2d_prepack(
-            w, b, self.stride, self.padding, self.output_padding, self.dilation,
-            self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv2d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        (w, _) = self._weight_bias()
-        return w
-
-    def bias(self):
-        (_, b) = self._weight_bias()
-        return b
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        return ops.quantized.conv_transpose2d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
-
-class ConvTranspose3d(_ConvTransposeNd):
-    r"""Applies a 3D transposed convolution operator over an input image
-    composed of several input planes.
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose3d`.
-
-    .. note:: Currently only the FBGEMM engine is implemented.
-        Please, set the `torch.backends.quantized.engine = 'fbgemm'`
-
-    For special notes, please, see :class:`~torch.nn.quantized.Conv3d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
-
-    Examples::
-
-        >>> torch.backends.quantized.engine = 'fbgemm'
-        >>> # With cubic kernels and equal stride
-        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
-        >>> # non-cubic kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
-        >>> input = torch.randn(20, 16, 50, 100, 100)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-        >>> # exact output size can be also specified as an argument
-        >>> input = torch.randn(1, 16, 12, 12, 12)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(q_input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose3d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _triple(kernel_size)
-        stride = _triple(stride)
-        padding = _triple(padding)
-        dilation = _triple(dilation)
-        output_padding = _triple(output_padding)
-
-        super(ConvTranspose3d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConvTranpose3d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params = torch.ops.quantized.conv_transpose3d_prepack(
-            w, b, self.stride, self.padding, self.output_padding, self.dilation,
-            self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv3d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        (w, _) = self._weight_bias()
-        return w
-
-    def bias(self):
-        (_, b) = self._weight_bias()
-        return b
+from torch.ao.nn.quantized.modules.conv import _reverse_repeat_padding
 
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
-        return ops.quantized.conv_transpose3d(
-            input, self._packed_params, self.scale, self.zero_point)
+from torch.ao.nn.quantized.modules.conv import Conv1d
+from torch.ao.nn.quantized.modules.conv import Conv2d
+from torch.ao.nn.quantized.modules.conv import Conv3d
 
-    @classmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
+from torch.ao.nn.quantized.modules.conv import ConvTranspose1d
+from torch.ao.nn.quantized.modules.conv import ConvTranspose2d
+from torch.ao.nn.quantized.modules.conv import ConvTranspose3d
diff --git a/torch/nn/quantized/modules/dropout.py b/torch/nn/quantized/modules/dropout.py
index 2885f5dc68464..e09dfbfbca6c6 100644
--- a/torch/nn/quantized/modules/dropout.py
+++ b/torch/nn/quantized/modules/dropout.py
@@ -1,28 +1,13 @@
-import torch
-import torch.nn.quantized.functional
+# flake8: noqa: F401
+r"""Quantized Modules
 
-__all__ = ['Dropout']
-
-class Dropout(torch.nn.Dropout):
-    r"""This is the quantized equivalent of :class:`~torch.nn.Dropout`.
-        And this is a placeholder to enable models where fp32 tensors
-        had dropout to work with quantized tensors in train and eval mode.
-
-    Args:
-        p: probability of an element to be zeroed
-        inplace: can optionally do the operation in-place. Default: ``False``
-    """
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-    def forward(self, input):
-        return input
-
-    def _get_name(self):
-        return 'QuantizedDropout'
-
-    @classmethod
-    def from_float(cls, mod):
-        return cls(mod.p, mod.inplace)
+__all__ = ['Dropout']
 
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(mod.p, mod.inplace)
+from torch.ao.nn.quantized.modules.dropout import Dropout
diff --git a/torch/nn/quantized/modules/embedding_ops.py b/torch/nn/quantized/modules/embedding_ops.py
index 6253706249e74..051f53499695d 100644
--- a/torch/nn/quantized/modules/embedding_ops.py
+++ b/torch/nn/quantized/modules/embedding_ops.py
@@ -1,294 +1,15 @@
-import torch
-import torch.nn as nn
-from torch import Tensor  # noqa: F401
-from torch._jit_internal import Optional, List  # noqa: F401
-from torch.nn.quantized.modules.utils import hide_packed_params_repr
-from torch.nn.quantized.modules.utils import _quantize_weight
+# flake8: noqa: F401
+r"""Quantized Modules
 
-__all__ = ['EmbeddingPackedParams', 'Embedding', 'EmbeddingBag']
-
-class EmbeddingPackedParams(torch.nn.Module):
-    _version = 1
-
-    def __init__(self, num_embeddings, embedding_dim, dtype=torch.quint8):
-        super(EmbeddingPackedParams, self).__init__()
-        self.dtype = dtype
-        if self.dtype in [torch.quint8, torch.quint4x2]:
-            scales = torch.ones(num_embeddings, dtype=torch.float)
-            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
-            wq = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim], scales=scales,
-                                                           zero_points=zero_points,
-                                                           axis=0, dtype=self.dtype)
-            self.set_weight(wq)
-        else:
-            raise NotImplementedError(f'Unsupported dtype on quantized embedding! Supports quint8 and quint4x2. Got dtype: {dtype}')
-
-    @torch.jit.export
-    def set_weight(self, weight: torch.Tensor) -> None:
-        if self.dtype in [torch.quint8, torch.quint4x2]:
-            self._packed_weight = torch.ops.quantized.embedding_bag_prepack(weight)
-        else:
-            raise NotImplementedError('Unsupported dtype for quantized embedding prepack! Supports quint8 and quint4x2.')
-
-
-    @torch.jit.export
-    def _weight(self):
-        if self.dtype in [torch.quint8, torch.quint4x2]:
-            return torch.ops.quantized.embedding_bag_unpack(self._packed_weight)
-        else:
-            raise NotImplementedError('Unsupported dtype for quantized embedding unpack! Supports quint8 and quint4x2.')
-
-    def forward(self, x):
-        return x
-
-    # Version 1
-    #   self
-    #   |--- _packed_weight : Tensor representing weight of EmbeddingPackedParamsBase
-    #   |--- dtype : torch.dtype
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(EmbeddingPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'dtype'] = self.dtype
-        destination[prefix + '_packed_weight'] = self._weight()
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.dtype = state_dict[prefix + 'dtype']
-        state_dict.pop(prefix + 'dtype')
-
-        weight = state_dict[prefix + '_packed_weight']
-        state_dict.pop(prefix + '_packed_weight')
-        self.set_weight(weight)
-
-        super(EmbeddingPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                                 missing_keys, unexpected_keys, error_msgs)
-
-    def __repr__(self):
-        return self._weight().__repr__()
-
-class Embedding(torch.nn.Module):
-    r"""
-    A quantized Embedding module with quantized packed weights as inputs.
-    We adopt the same interface as `torch.nn.Embedding`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding for documentation.
-
-    Similar to :class:`~torch.nn.Embedding`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module of
-                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
-
-    Examples::
-        >>> m = nn.quantized.Embedding(num_embeddings=10, embedding_dim=12)
-        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8])
-        >>> output = m(indices)
-        >>> print(output.size())
-        torch.Size([9, 12]
-
-    """
-    _version = 1
-
-    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 sparse: bool = False, _weight: Optional[Tensor] = None, dtype=torch.quint8) -> None:
-        super(Embedding, self).__init__()
-        self.num_embeddings = num_embeddings
-        self.embedding_dim = embedding_dim
-        self.dtype = dtype
-
-        if _weight is None:
-            scales = torch.ones(num_embeddings, dtype=torch.float)
-            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
-            qweight = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim],
-                                                                scales=scales, zero_points=zero_points,
-                                                                axis=0, dtype=torch.quint8)
-        else:
-            assert list(_weight.shape) == [num_embeddings, embedding_dim], \
-                'Shape of weight does not match num_embeddings and embedding_dim'
-            qweight = _weight
-
-        self._packed_params = EmbeddingPackedParams(num_embeddings, embedding_dim, dtype)
-        self._packed_params.set_weight(qweight)
-
-    def forward(self, indices: Tensor) -> Tensor:
-        if self.dtype == torch.quint4x2:
-            return torch.ops.quantized.embedding_4bit(self._packed_params._packed_weight, indices)
-        else:
-            return torch.ops.quantized.embedding_byte(self._packed_params._packed_weight, indices)
-
-    def _get_name(self):
-        return 'QuantizedEmbedding'
-
-    def __repr__(self):
-        return hide_packed_params_repr(self, EmbeddingPackedParams)
-
-    def extra_repr(self):
-        extra_repr_str = 'num_embeddings={}, embedding_dim={}, dtype={}, qscheme={}'.format(
-            self.num_embeddings, self.embedding_dim, self._packed_params.dtype, self.weight().qscheme()
-        )
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-        return extra_repr_str
-
-    def set_weight(self, w: torch.Tensor) -> None:
-        self._packed_params.set_weight(w)
-
-    def weight(self):
-        return self._packed_params._weight()
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a quantized embedding module from a float module
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by user
-        """
-        if hasattr(mod, 'weight_fake_quant'):
-            assert type(mod) == nn.qat.Embedding, 'nnq.' + cls.__name__ + '.from_float ' + \
-                'with fake quant only works for ' + nn.qat.Embedding.__name__
-            weight_observer = mod.weight_fake_quant
-            activation_post_process = mod.activation_post_process
-        else:
-            assert type(mod) == nn.Embedding, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
-                nn.Embedding.__name__
-            assert hasattr(mod, 'qconfig'), 'Embedding input float module must have qconfig defined'
-            from torch.ao.quantization import float_qparams_weight_only_qconfig
-            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
-                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
-            else:
-                weight_observer = float_qparams_weight_only_qconfig.weight()
-
-        dtype = weight_observer.dtype
-        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
-        assert is_float_qparams_qconfig, \
-            'Embedding quantization is only supported with float_qparams_weight_only_qconfig.'
-
-        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
-            f'The only supported dtype for nnq.Embedding is torch.quint8 and torch.quint4x2, got {dtype}'
-
-        # Run the observer to calculate qparams.
-        weight_observer(mod.weight)
-        qweight = _quantize_weight(mod.weight.float(), weight_observer)
-
-        # Create quantized Embedding module and pass in the quantized weight
-        qembedding = Embedding(mod.num_embeddings, mod.embedding_dim)
-        qembedding.set_weight(qweight)
-        return qembedding
-
-    @classmethod
-    def from_reference(cls, ref_embedding):
-        qembedding = cls(
-            ref_embedding.num_embeddings,
-            ref_embedding.embedding_dim,
-            ref_embedding.padding_idx,
-            ref_embedding.max_norm,
-            ref_embedding.norm_type,
-            ref_embedding.scale_grad_by_freq,
-            ref_embedding.sparse,
-            ref_embedding.get_quantized_weight(),
-            ref_embedding.weight_dtype,
-        )
-        return qembedding
-
-class EmbeddingBag(Embedding):
-    r"""
-    A quantized EmbeddingBag module with quantized packed weights as inputs.
-    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag for documentation.
-
-    Similar to :class:`~torch.nn.EmbeddingBag`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module of
-                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
-
-    Examples::
-        >>> m = nn.quantized.EmbeddingBag(num_embeddings=10, embedding_dim=12, include_last_offset=True, mode='sum')
-        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8, 2, 3])
-        >>> offsets = torch.tensor([0, 19, 20, 28, 28, 32])
-        >>> output = m(indices, offsets)
-        >>> print(output.size())
-        torch.Size([5, 12]
-
-    """
-    _version = 1
-
-    def __init__(self, num_embeddings: int, embedding_dim: int,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 mode: str = 'sum', sparse: bool = False, _weight: Optional[Tensor] = None,
-                 include_last_offset: bool = False, dtype=torch.quint8) -> None:
-        super(EmbeddingBag, self).__init__(num_embeddings, embedding_dim, _weight=_weight, dtype=dtype)
-
-        self.mode = mode
-        self.pruned_weights = False
-        self.include_last_offset = include_last_offset
-        self.dtype = dtype
-
-    def forward(self, indices: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None,
-                compressed_indices_mapping: Optional[Tensor] = None) -> Tensor:
-        if self.dtype == torch.quint4x2:
-            return torch.ops.quantized.embedding_bag_4bit(self._packed_params._packed_weight, indices, offsets, False, 0,
-                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
-                                                          self.include_last_offset)
-        else:
-            return torch.ops.quantized.embedding_bag_byte(self._packed_params._packed_weight, indices, offsets, False, 0,
-                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
-                                                          self.include_last_offset)
-
-    def _get_name(self):
-        return 'QuantizedEmbeddingBag'
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a quantized embedding_bag module from a float module
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by user
-        """
-        if hasattr(mod, 'weight_fake_quant'):
-            weight_observer = mod.weight_fake_quant
-        else:
-            assert type(mod) == nn.EmbeddingBag, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
-                nn.EmbeddingBag.__name__
-            assert hasattr(mod, 'qconfig'), 'EmbeddingBag input float module must have qconfig defined'
-            from torch.ao.quantization.qconfig import float_qparams_weight_only_qconfig
-            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
-                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
-            else:
-                weight_observer = float_qparams_weight_only_qconfig.weight()
-
-        dtype = weight_observer.dtype
-        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
-        assert is_float_qparams_qconfig, \
-            'EmbeddingBag quantization is only supported with float_qparams_weight_only_qconfig.'
-
-        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
-            f'The only supported dtype for nnq.EmbeddingBag is torch.quint8 and torch.quint4x2, got {dtype}'
-
-        # Run the observer to calculate qparams.
-        weight_observer(mod.weight)
-        qweight = _quantize_weight(mod.weight.float(), weight_observer)
-
-        # Create quantized EmbeddingBag module and pass in the quantized weight
-        qembedding_bag = EmbeddingBag(mod.num_embeddings, mod.embedding_dim, dtype=dtype)
-        qembedding_bag.set_weight(qweight)
-        return qembedding_bag
+__all__ = ['EmbeddingPackedParams', 'Embedding', 'EmbeddingBag']
 
-    @classmethod
-    def from_reference(cls, ref_embedding_bag):
-        qembedding_bag = cls(
-            ref_embedding_bag.num_embeddings,
-            ref_embedding_bag.embedding_dim,
-            ref_embedding_bag.max_norm,
-            ref_embedding_bag.norm_type,
-            ref_embedding_bag.scale_grad_by_freq,
-            ref_embedding_bag.mode,
-            ref_embedding_bag.sparse,
-            ref_embedding_bag.get_quantized_weight(),
-            ref_embedding_bag.include_last_offset,
-            ref_embedding_bag.weight_dtype,
-        )
-        return qembedding_bag
+from torch.ao.nn.quantized.modules.embedding_ops import Embedding
+from torch.ao.nn.quantized.modules.embedding_ops import EmbeddingBag
+from torch.ao.nn.quantized.modules.embedding_ops import EmbeddingPackedParams
diff --git a/torch/nn/quantized/modules/functional_modules.py b/torch/nn/quantized/modules/functional_modules.py
index 06d4f29d744ae..9bdcc5bc23abb 100644
--- a/torch/nn/quantized/modules/functional_modules.py
+++ b/torch/nn/quantized/modules/functional_modules.py
@@ -1,232 +1,15 @@
-from typing import List
+# flake8: noqa: F401
+r"""Quantized Modules
 
-import torch
-from torch import Tensor
-from torch._ops import ops
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['FloatFunctional', 'FXFloatFunctional', 'QFunctional']
 
-class FloatFunctional(torch.nn.Module):
-    r"""State collector class for float operations.
-
-    The instance of this class can be used instead of the ``torch.`` prefix for
-    some operations. See example usage below.
-
-    .. note::
-
-        This class does not provide a ``forward`` hook. Instead, you must use
-        one of the underlying functions (e.g. ``add``).
-
-    Examples::
-
-        >>> f_add = FloatFunctional()
-        >>> a = torch.tensor(3.0)
-        >>> b = torch.tensor(4.0)
-        >>> f_add.add(a, b)  # Equivalent to ``torch.add(a, b)``
-
-    Valid operation names:
-        - add
-        - cat
-        - mul
-        - add_relu
-        - add_scalar
-        - mul_scalar
-    """
-    def __init__(self):
-        super(FloatFunctional, self).__init__()
-        self.activation_post_process = torch.nn.Identity()
-
-    def forward(self, x):
-        raise RuntimeError("FloatFunctional is not intended to use the " +
-                           "'forward'. Please use the underlying operation")
-
-    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
-    def add(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
-    def add_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.add(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
-    def mul(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.mul(x, y)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
-    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.mul(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.cat``"""
-    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
-        r = torch.cat(x, dim=dim)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
-    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        r = torch.nn.functional.relu(r)
-        r = self.activation_post_process(r)
-        return r
-
-class FXFloatFunctional(torch.nn.Module):
-    r""" module to replace FloatFunctional module before FX graph mode quantization,
-    since activation_post_process will be inserted in top level module directly
-
-    Valid operation names:
-        - add
-        - cat
-        - mul
-        - add_relu
-        - add_scalar
-        - mul_scalar
-    """
-    def forward(self, x):
-        raise RuntimeError("FloatFunctional is not intended to use the " +
-                           "'forward'. Please use the underlying operation")
-
-    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
-    def add(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
-    def add_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.add(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
-    def mul(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.mul(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
-    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.mul(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.cat``"""
-    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
-        r = torch.cat(x, dim=dim)
-        return r
-
-    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
-    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        r = torch.nn.functional.relu(r)
-        return r
-
-class QFunctional(torch.nn.Module):
-    r"""Wrapper class for quantized operations.
-
-    The instance of this class can be used instead of the
-    ``torch.ops.quantized`` prefix. See example usage below.
-
-    .. note::
-
-        This class does not provide a ``forward`` hook. Instead, you must use
-        one of the underlying functions (e.g. ``add``).
-
-    Examples::
-
-        >>> q_add = QFunctional()
-        >>> a = torch.quantize_per_tensor(torch.tensor(3.0), 1.0, 0, torch.qint32)
-        >>> b = torch.quantize_per_tensor(torch.tensor(4.0), 1.0, 0, torch.qint32)
-        >>> q_add.add(a, b)  # Equivalent to ``torch.ops.quantized.add(a, b, 1.0, 0)``
-
-    Valid operation names:
-        - add
-        - cat
-        - mul
-        - add_relu
-        - add_scalar
-        - mul_scalar
-    """
-    def __init__(self):
-        super(QFunctional, self).__init__()
-        self.scale = 1.0
-        self.zero_point = 0
-        self.activation_post_process = torch.nn.Identity()
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(QFunctional, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'scale'] = torch.tensor(self.scale)
-        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-
-        self.scale = float(state_dict.pop(prefix + 'scale'))
-        self.zero_point = int(state_dict.pop(prefix + 'zero_point'))
-        super(QFunctional, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                       missing_keys, unexpected_keys, error_msgs)
-
-    def _get_name(self):
-        return 'QFunctional'
-
-    def extra_repr(self):
-        return 'scale={}, zero_point={}'.format(
-            self.scale, self.zero_point
-        )
-
-    def forward(self, x):
-        raise RuntimeError("Functional is not intended to use the " +
-                           "'forward'. Please use the underlying operation")
-
-    r"""Operation equivalent to ``torch.ops.quantized.add``"""
-    def add(self, x: Tensor, y: Tensor) -> Tensor:
-        r = ops.quantized.add(x, y, scale=self.scale, zero_point=self.zero_point)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.add(Tensor, float)``"""
-    def add_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = ops.quantized.add_scalar(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, Tensor)``"""
-    def mul(self, x: Tensor, y: Tensor) -> Tensor:
-        r = ops.quantized.mul(x, y, scale=self.scale, zero_point=self.zero_point)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, float)``"""
-    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = ops.quantized.mul_scalar(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.cat``"""
-    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
-        r = ops.quantized.cat(x, scale=self.scale, zero_point=self.zero_point, dim=dim)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.add_relu``"""
-    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
-        r = ops.quantized.add_relu(x, y, scale=self.scale, zero_point=self.zero_point)
-        r = self.activation_post_process(r)
-        return r
-
-    @classmethod
-    def from_float(cls, mod):
-        assert type(mod) == FloatFunctional,\
-            "QFunctional.from_float expects an instance of FloatFunctional"
-        scale, zero_point = mod.activation_post_process.calculate_qparams()  # type: ignore[operator]
-        new_mod = QFunctional()
-        new_mod.scale = float(scale)
-        new_mod.zero_point = int(zero_point)
-        return new_mod
+from torch.ao.nn.quantized.modules.functional_modules import FloatFunctional
+from torch.ao.nn.quantized.modules.functional_modules import FXFloatFunctional
+from torch.ao.nn.quantized.modules.functional_modules import QFunctional
diff --git a/torch/nn/quantized/modules/linear.py b/torch/nn/quantized/modules/linear.py
index 267c1feee195a..0696014dd7d94 100644
--- a/torch/nn/quantized/modules/linear.py
+++ b/torch/nn/quantized/modules/linear.py
@@ -1,298 +1,14 @@
-from collections.abc import Iterable
-import torch
+# flake8: noqa: F401
+r"""Quantized Modules
 
-import torch.nn as nn
-import torch.nn.intrinsic as nni
-import torch.nn.intrinsic.qat as nniqat
-from torch.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr, WeightedQuantizedModule
-from torch.nn.utils.fusion import fuse_linear_bn_weights
-from torch.nn.utils.parametrize import type_before_parametrizations
-from typing import Optional
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['LinearPackedParams', 'Linear']
 
-class LinearPackedParams(torch.nn.Module):
-    _version = 3
-
-    def __init__(self, dtype=torch.qint8):
-        super().__init__()
-        self.dtype = dtype
-        if self.dtype == torch.qint8:
-            wq = torch._empty_affine_quantized([1, 1], scale=1.0, zero_point=0, dtype=torch.qint8)
-        elif self.dtype == torch.float16:
-            wq = torch.zeros([1, 1], dtype=torch.float)
-        self.set_weight_bias(wq, None)
-
-    @torch.jit.export
-    def set_weight_bias(self, weight: torch.Tensor, bias: Optional[torch.Tensor]) -> None:
-        if self.dtype == torch.qint8:
-            self._packed_params = torch.ops.quantized.linear_prepack(weight, bias)
-        elif self.dtype == torch.float16:
-            self._packed_params = torch.ops.quantized.linear_prepack_fp16(weight, bias)
-        else:
-            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
-
-
-    @torch.jit.export
-    def _weight_bias(self):
-        if self.dtype == torch.qint8:
-            return torch.ops.quantized.linear_unpack(self._packed_params)
-        elif self.dtype == torch.float16:
-            return torch.ops.quantized.linear_unpack_fp16(self._packed_params)
-        else:
-            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
-
-    def forward(self, x):
-        return x
-
-    # Version 1
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #
-    # Version 2
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #   |--- dtype : torch.dtype
-    #
-    # Version 3
-    #   self
-    #   |--- _packed_params : (Tensor, Tensor) representing (weight, bias)
-    #                         of LinearPackedParams
-    #   |--- dtype : torch.dtype
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(LinearPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'dtype'] = self.dtype
-        destination[prefix + '_packed_params'] = self._weight_bias()
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        version = local_metadata.get('version', None)
-        if version is None or version < 2:
-            self.dtype = torch.qint8
-        else:
-            self.dtype = state_dict[prefix + 'dtype']
-            state_dict.pop(prefix + 'dtype')
-
-        if version is None or version < 3:
-            self.set_weight_bias(state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
-            state_dict.pop(prefix + 'weight')
-            state_dict.pop(prefix + 'bias')
-
-        if version == 3:
-            weight, bias = state_dict[prefix + '_packed_params']
-            state_dict.pop(prefix + '_packed_params')
-            self.set_weight_bias(weight, bias)
-
-        super(LinearPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                              missing_keys, unexpected_keys, error_msgs)
-
-
-    def __repr__(self):
-        return self._weight_bias().__repr__()
-
-
-class Linear(WeightedQuantizedModule):
-    r"""
-    A quantized linear module with quantized tensor as inputs and outputs.
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
-
-    Similar to :class:`~torch.nn.Linear`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module of
-                         shape :math:`(\text{out\_features}, \text{in\_features})`.
-        bias (Tensor): the non-learnable bias of the module of shape :math:`(\text{out\_features})`.
-                If :attr:`bias` is ``True``, the values are initialized to zero.
-        scale: `scale` parameter of output Quantized Tensor, type: double
-        zero_point: `zero_point` parameter for output Quantized Tensor, type: long
-
-    Examples::
-
-        >>> m = nn.quantized.Linear(20, 30)
-        >>> input = torch.randn(128, 20)
-        >>> input = torch.quantize_per_tensor(input, 1.0, 0, torch.quint8)
-        >>> output = m(input)
-        >>> print(output.size())
-        torch.Size([128, 30])
-    """
-    _version = 3
-    _FLOAT_MODULE = (nn.Linear, nn.modules.linear.NonDynamicallyQuantizableLinear)
-
-    def __init__(self, in_features, out_features, bias_=True,
-                 dtype=torch.qint8):
-        super().__init__()
-        # We don't muck around with buffers or attributes or anything here
-        # to keep the module simple. *everything* is simply a Python attribute.
-        # Serialization logic is explicitly handled in the below serialization and
-        # deserialization modules
-        self.in_features = in_features
-        self.out_features = out_features
-        bias = None
-        if bias_:
-            bias = torch.zeros(out_features, dtype=torch.float)
-
-        if dtype == torch.qint8:
-            qweight = torch._empty_affine_quantized(
-                [out_features, in_features], scale=1, zero_point=0, dtype=torch.qint8)
-        elif dtype == torch.float16:
-            qweight = torch.zeros([out_features, in_features], dtype=torch.float)
-        else:
-            raise RuntimeError('Unsupported dtype specified for quantized Linear!')
-
-        self._packed_params = LinearPackedParams(dtype)
-        self._packed_params.set_weight_bias(qweight, bias)
-        self.scale = 1.0
-        self.zero_point = 0
-
-    def _get_name(self):
-        return 'QuantizedLinear'
-
-    def extra_repr(self):
-        return 'in_features={}, out_features={}, scale={}, zero_point={}, qscheme={}'.format(
-            self.in_features, self.out_features, self.scale, self.zero_point, self.weight().qscheme()
-        )
-
-    def __repr__(self):
-        return hide_packed_params_repr(self, LinearPackedParams)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return torch.ops.quantized.linear(
-            x, self._packed_params._packed_params, self.scale, self.zero_point)
-
-    # ===== Serialization methods =====
-    # The special consideration here is that we have to unpack the weights into their
-    # regular QTensor form for serialization. Packed weights should not live
-    # outside the process in which they were created, rather they should be derived
-    # from the QTensor weight.
-    #
-    # Version 1
-    #   self
-    #   |--- scale : float
-    #   |--- zero_point : int
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #
-    # Version 2
-    #   self
-    #   |--- scale : float
-    #   |--- zero_point : int
-    #   |--- _packed_params : Module
-    #        |--- weight : Tensor
-    #        |--- bias : Tensor
-    #
-    # Version 3
-    #   self
-    #   |--- scale : float
-    #   |--- zero_point : int
-    #   |--- _packed_params : Module
-    #        |--- _packed_params : (Tensor, Tensor) representing weight, bias
-    #                              of LinearPackedParams C++ struct
-    #
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super()._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'scale'] = torch.tensor(self.scale)
-        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
-
-    # ===== Deserialization methods =====
-    # Counterpart to the serialization methods, we must pack the serialized QTensor
-    # weight into its packed format for use by the FBGEMM ops.
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.scale = float(state_dict[prefix + 'scale'])
-        state_dict.pop(prefix + 'scale')
-
-        self.zero_point = int(state_dict[prefix + 'zero_point'])
-        state_dict.pop(prefix + 'zero_point')
-
-        version = local_metadata.get('version', None)
-
-        if version is None or version == 1:
-            # We moved the parameters into a LinearPackedParameters submodule
-            weight = state_dict.pop(prefix + 'weight')
-            bias = state_dict.pop(prefix + 'bias')
-            state_dict.update({prefix + '_packed_params.weight': weight,
-                               prefix + '_packed_params.bias': bias})
-
-        super()._load_from_state_dict(
-            state_dict, prefix, local_metadata, False,
-            missing_keys, unexpected_keys, error_msgs)
-
-    # Function rather than property to make sure that JIT serialization doesn't
-    # register this as an attribute
-    def _weight_bias(self):
-        return self._packed_params._weight_bias()
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params.set_weight_bias(w, b)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a quantized module from an observed float module
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-        """
-        if hasattr(mod, 'weight_fake_quant'):
-            if type_before_parametrizations(mod) == nniqat.LinearBn1d:
-                mod.weight, mod.bias = fuse_linear_bn_weights(
-                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
-                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
-            weight_post_process = mod.weight_fake_quant
-            activation_post_process = mod.activation_post_process
-        else:
-            # This function does not participate in JIT, so it is OK to ignore
-            # the type mismatch in assignment. Also, mypy has an issue with
-            # iterables not being implemented, so we are ignoring those too.
-            if not isinstance(cls._FLOAT_MODULE, Iterable):
-                cls._FLOAT_MODULE = [cls._FLOAT_MODULE]  # type: ignore[assignment]
-            supported_modules = ', '.join([float_mod.__name__ for float_mod in cls._FLOAT_MODULE])  # type: ignore[attr-defined]
-            error_msg = 'nnq.{}.from_float only works for {}, but got: {}'.format(cls.__name__, supported_modules, type(mod))
-            assert type_before_parametrizations(mod) in cls._FLOAT_MODULE, error_msg.format()  # type: ignore[attr-defined]
-            assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-            activation_post_process = mod.activation_post_process
-            if type_before_parametrizations(mod) == nni.LinearReLU:
-                mod = mod[0]
-            weight_post_process = mod.qconfig.weight()
-        weight_post_process(mod.weight)
-        dtype = weight_post_process.dtype
-        act_scale, act_zp = activation_post_process.calculate_qparams()
-        assert dtype == torch.qint8, 'Weight observer must have dtype torch.qint8'
-        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
-        qlinear = cls(mod.in_features,
-                      mod.out_features,
-                      dtype=dtype)
-        qlinear.set_weight_bias(qweight, mod.bias)
-        qlinear.scale = float(act_scale)
-        qlinear.zero_point = int(act_zp)
-        return qlinear
-
-    @classmethod
-    def from_reference(cls, ref_qlinear, output_scale, output_zero_point):
-        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
-
-        Args:
-            ref_qlinear (Module): a reference quantized linear module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-            output_scale (float): scale for output Tensor
-            zero_point (int): zero point for output Tensor
-        """
-        qlinear = cls(
-            ref_qlinear.in_features,
-            ref_qlinear.out_features)
-        qweight = ref_qlinear.get_quantized_weight()
-        qlinear.set_weight_bias(qweight, ref_qlinear.bias)
-
-        qlinear.scale = float(output_scale)
-        qlinear.zero_point = int(output_zero_point)
-        return qlinear
+from torch.ao.nn.quantized.modules.linear import Linear
+from torch.ao.nn.quantized.modules.linear import LinearPackedParams
diff --git a/torch/nn/quantized/modules/normalization.py b/torch/nn/quantized/modules/normalization.py
index a24160dee1bc6..91923b3121c9e 100644
--- a/torch/nn/quantized/modules/normalization.py
+++ b/torch/nn/quantized/modules/normalization.py
@@ -1,205 +1,17 @@
-import torch
-import torch.nn.quantized.functional
+# flake8: noqa: F401
+r"""Quantized Modules
 
-__all__ = ['LayerNorm', 'GroupNorm', 'InstanceNorm1d', 'InstanceNorm2d', 'InstanceNorm3d']
-
-class LayerNorm(torch.nn.LayerNorm):
-    r"""This is the quantized version of :class:`~torch.nn.LayerNorm`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-
-    def __init__(self, normalized_shape, weight, bias, scale, zero_point, eps=1e-5,
-                 elementwise_affine=True, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(LayerNorm, self).__init__(
-            normalized_shape, eps=eps, elementwise_affine=elementwise_affine,
-            **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.layer_norm(
-            input, self.normalized_shape, weight=self.weight, bias=self.bias,
-            eps=self.eps, output_scale=self.scale, output_zero_point=self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedLayerNorm'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.normalized_shape, mod.weight, mod.bias, float(scale),
-            int(zero_point), mod.eps, mod.elementwise_affine)
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.normalized_shape, mod.weight, mod.bias, float(scale),
-            int(zero_point), mod.eps, mod.elementwise_affine)
-
-class GroupNorm(torch.nn.GroupNorm):
-    r"""This is the quantized version of :class:`~torch.nn.GroupNorm`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-    __constants__ = ['num_groups', 'num_channels', 'eps', 'affine']
-
-    def __init__(self, num_groups, num_channels, weight, bias, scale, zero_point, eps=1e-5,
-                 affine=True, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(GroupNorm, self).__init__(num_groups, num_channels, eps, affine,
-                                        **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.group_norm(
-            input, self.num_groups, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedGroupNorm'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_groups, mod.num_channels, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
-
-class InstanceNorm1d(torch.nn.InstanceNorm1d):
-    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm1d`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-    """
-    def __init__(self, num_features, weight, bias, scale, zero_point,
-                 eps=1e-5, momentum=0.1, affine=False,
-                 track_running_stats=False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(InstanceNorm1d, self).__init__(
-            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.instance_norm(
-            input, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedInstanceNorm1d'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-
-class InstanceNorm2d(torch.nn.InstanceNorm2d):
-    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm2d`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-    def __init__(self, num_features, weight, bias, scale, zero_point,
-                 eps=1e-5, momentum=0.1, affine=False,
-                 track_running_stats=False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(InstanceNorm2d, self).__init__(
-            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.instance_norm(
-            input, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedInstanceNorm2d'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-
-class InstanceNorm3d(torch.nn.InstanceNorm3d):
-    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm3d`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-    def __init__(self, num_features, weight, bias, scale, zero_point,
-                 eps=1e-5, momentum=0.1, affine=False,
-                 track_running_stats=False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(InstanceNorm3d, self).__init__(
-            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.instance_norm(
-            input, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedInstanceNorm3d'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
+__all__ = ['LayerNorm', 'GroupNorm', 'InstanceNorm1d', 'InstanceNorm2d', 'InstanceNorm3d']
 
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
+from torch.ao.nn.quantized.modules.normalization import LayerNorm
+from torch.ao.nn.quantized.modules.normalization import GroupNorm
+from torch.ao.nn.quantized.modules.normalization import InstanceNorm1d
+from torch.ao.nn.quantized.modules.normalization import InstanceNorm2d
+from torch.ao.nn.quantized.modules.normalization import InstanceNorm3d
diff --git a/torch/nn/quantized/modules/rnn.py b/torch/nn/quantized/modules/rnn.py
index 7e523ba830d22..05d7a8ee92cd5 100644
--- a/torch/nn/quantized/modules/rnn.py
+++ b/torch/nn/quantized/modules/rnn.py
@@ -1,47 +1,11 @@
-import torch
+# flake8: noqa: F401
+r"""Quantized Modules
 
-class LSTM(torch.nn.quantizable.LSTM):
-    r"""A quantized long short-term memory (LSTM).
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
-
-    Attributes:
-        layers : instances of the `_LSTMLayer`
-
-    .. note::
-        To access the weights and biases, you need to access them per layer.
-        See examples in :class:`~torch.nn.quantizable.LSTM`
-
-    Examples::
-
-        >>> custom_module_config = {
-        ...     'float_to_observed_custom_module_class': {
-        ...         nn.LSTM: nn.quantizable.LSTM,
-        ...     },
-        ...     'observed_to_quantized_custom_module_class': {
-        ...         nn.quantizable.LSTM: nn.quantized.LSTM,
-        ...     }
-        ... }
-        >>> tq.prepare(model, prepare_custom_module_class=custom_module_config)
-        >>> tq.convert(model, convert_custom_module_class=custom_module_config)
-    """
-    _FLOAT_MODULE = torch.nn.quantizable.LSTM
-
-    def _get_name(self):
-        return 'QuantizedLSTM'
-
-    @classmethod
-    def from_float(cls, *args, **kwargs):
-        # The whole flow is float -> observed -> quantized
-        # This class does observed -> quantized only
-        raise NotImplementedError("It looks like you are trying to convert a "
-                                  "non-observed LSTM module. Please, see "
-                                  "the examples on quantizable LSTMs.")
-
-    @classmethod
-    def from_observed(cls, other):
-        assert type(other) == cls._FLOAT_MODULE
-        converted = torch.ao.quantization.convert(other, inplace=False,
-                                                  remove_qconfig=True)
-        converted.__class__ = cls
-        return converted
+from torch.ao.nn.quantized.modules.rnn import LSTM
diff --git a/torch/nn/quantized/modules/utils.py b/torch/nn/quantized/modules/utils.py
index e9e801f2f0554..97b5f5ea2adac 100644
--- a/torch/nn/quantized/modules/utils.py
+++ b/torch/nn/quantized/modules/utils.py
@@ -1,73 +1,15 @@
-import abc
-import torch
-from itertools import repeat
-import collections
-from torch.nn.modules.module import _addindent
-
-class WeightedQuantizedModule(torch.nn.Module, metaclass=abc.ABCMeta):
-    """Wrapper for quantized modules than can be lowered from reference modules."""
-    @classmethod
-    @abc.abstractmethod
-    def from_reference(cls, ref_module, output_scale, output_zero_point):
-        raise NotImplementedError
-
-def _quantize_weight(float_wt, observer):
-    wt_scale, wt_zp = observer.calculate_qparams()
-    if observer.qscheme in [torch.per_tensor_symmetric, torch.per_tensor_affine]:
-        qweight = torch.quantize_per_tensor(
-            float_wt,
-            float(wt_scale), int(wt_zp), torch.qint8)
-    elif observer.qscheme in [torch.per_channel_symmetric, torch.per_channel_affine]:
-        wt_axis = observer.ch_axis
-        qweight = torch.quantize_per_channel(
-            float_wt,
-            wt_scale.to(torch.double), wt_zp.to(torch.int64), wt_axis, torch.qint8)
-    elif observer.qscheme in [torch.per_channel_affine_float_qparams]:
-        qweight = torch.quantize_per_channel(
-            float_wt,
-            wt_scale.to(torch.float), wt_zp.to(torch.float), observer.ch_axis, observer.dtype)
-    else:
-        raise ValueError("Unexpected qscheme " + observer.qscheme)
-    return qweight
-
-def _ntuple_from_first(n):
-    """Converts the argument to a tuple of size n
-    with the first element repeated."""
-    def parse(x):
-        while isinstance(x, collections.abc.Sequence):
-            if len(x) == n:
-                break
-            x = x[0]
-        return tuple(repeat(x, n))
-    return parse
-
-def hide_packed_params_repr(self, params):
-    # We don't want to show `PackedParams` children, hence custom
-    # `__repr__`. This is the same as nn.Module.__repr__, except the check
-    # for the `params module`.
-    extra_lines = []
-    extra_repr = self.extra_repr()
-    # empty string will be split into list ['']
-    if extra_repr:
-        extra_lines = extra_repr.split('\n')
-    child_lines = []
-    for key, module in self._modules.items():
-        if isinstance(module, params):
-            continue
-        mod_str = repr(module)
-        mod_str = _addindent(mod_str, 2)
-        child_lines.append('(' + key + '): ' + mod_str)
-    lines = extra_lines + child_lines
-
-    main_str = self._get_name() + '('
-    if lines:
-        # simple one-liner info, which most builtin Modules will use
-        if len(extra_lines) == 1 and not child_lines:
-            main_str += extra_lines[0]
-        else:
-            main_str += '\n  ' + '\n  '.join(lines) + '\n'
-
-    main_str += ')'
-    return main_str
-
-_pair_from_first = _ntuple_from_first(2)
+# flake8: noqa: F401
+r"""Quantized Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.modules.utils import _ntuple_from_first
+from torch.ao.nn.quantized.modules.utils import _pair_from_first
+from torch.ao.nn.quantized.modules.utils import _quantize_weight
+from torch.ao.nn.quantized.modules.utils import hide_packed_params_repr
+from torch.ao.nn.quantized.modules.utils import WeightedQuantizedModule
diff --git a/torch/nn/utils/_deprecation_utils.py b/torch/nn/utils/_deprecation_utils.py
new file mode 100644
index 0000000000000..c6be7fb96adf7
--- /dev/null
+++ b/torch/nn/utils/_deprecation_utils.py
@@ -0,0 +1,45 @@
+from typing import List, Callable
+import importlib
+import warnings
+
+
+_MESSAGE_TEMPLATE = r"Usage of '{old_location}' is deprecated; please use '{new_location}' instead."
+
+def lazy_deprecated_import(all: List[str], old_module: str, new_module: str) -> Callable:
+    r"""Import utility to lazily import deprecated packages / modules / functional.
+
+    The old_module and new_module are also used in the deprecation warning defined
+    by the `_MESSAGE_TEMPLATE`.
+
+    Args:
+        all: The list of the functions that are imported. Generally, the module's
+            __all__ list of the module.
+        old_module: Old module location
+        new_module: New module location / Migrated location
+
+    Returns:
+        Callable to asign to the `__getattr__`
+
+    Usage:
+
+        # In the `torch/nn/quantized/functional.py`
+        from torch.nn.utils._deprecation_utils import lazy_deprecated_import
+        _MIGRATED_TO = "torch.ao.nn.quantized.functional"
+        __getattr__ = lazy_deprecated_import(
+            all=__all__,
+            old_module=__name__,
+            new_module=_MIGRATED_TO)
+    """
+    warning_message = _MESSAGE_TEMPLATE.format(
+        old_location=old_module,
+        new_location=new_module)
+
+    def getattr_dunder(name):
+        if name in all:
+            # We are using the "RuntimeWarning" to make sure it is not
+            # ignored by default.
+            warnings.warn(warning_message, RuntimeWarning)
+            package = importlib.import_module(new_module)
+            return getattr(package, name)
+        raise AttributeError(f"Module {new_module!r} has no attribute {name!r}.")
+    return getattr_dunder
diff --git a/torch/nn/utils/_expanded_weights/conv_expanded_weights.py b/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
index df30619160053..c10ccb90ae92f 100644
--- a/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
+++ b/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
@@ -1,7 +1,7 @@
 import torch
 import torch.nn.functional as F
 
-from .conv_utils import conv_backward, conv_args_and_kwargs, conv_picker
+from .conv_utils import conv_backward, conv_args_and_kwargs, conv_picker, conv_input_for_string_padding
 from .expanded_weights_impl import ExpandedWeight, implements_per_sample_grads
 from .expanded_weights_utils import forward_helper
 
@@ -11,10 +11,19 @@
 class ConvPerSampleGrad(torch.autograd.Function):
     @staticmethod
     def forward(ctx, kwarg_names, conv_fn, *expanded_args_and_kwargs):
-        if any([isinstance(i, str) for i in expanded_args_and_kwargs]):
-            raise RuntimeError("Expanded Weights does not support convolution padding as a string. "
-                               "Please file an issue to prioritize support")
         expanded_args, expanded_kwargs = conv_args_and_kwargs(kwarg_names, expanded_args_and_kwargs)
+        orig_input = expanded_args[0]
+        was_same_padding = expanded_kwargs['padding'] == "same"
+
+        if isinstance(expanded_kwargs['padding'], str):
+            # if padding is a string, we'll do the necessary padding (slowly) using F.pad
+            kernel_size = expanded_args[1].shape[2:]
+            padding, dilation = expanded_kwargs['padding'], expanded_kwargs['dilation']
+            input = conv_input_for_string_padding(conv_fn, padding, expanded_args[0], dilation, kernel_size)
+            expanded_args = (input, expanded_args[1])
+            # since we've already done the padding, don't need any more
+            expanded_kwargs['padding'] = 0
+
         output = forward_helper(conv_fn, expanded_args, expanded_kwargs)
         input, weight = expanded_args
         batched_dim_size = conv_picker(conv_fn, 3, 4, 5)
@@ -24,8 +33,10 @@ def forward(ctx, kwarg_names, conv_fn, *expanded_args_and_kwargs):
 
         ctx.conv_fn = conv_fn
 
-        ctx.batch_size = input.shape[0]
-        ctx.input_required_grad = input.requires_grad
+        ctx.batch_size = orig_input.shape[0]
+        ctx.input_required_grad = orig_input.requires_grad
+        ctx.orig_input_shape = orig_input.shape
+        ctx.was_same_padding = was_same_padding
         ctx.stride, ctx.padding = expanded_kwargs['stride'], expanded_kwargs['padding']
         ctx.dilation, ctx.groups = expanded_kwargs['dilation'], expanded_kwargs['groups']
 
diff --git a/torch/nn/utils/_expanded_weights/conv_utils.py b/torch/nn/utils/_expanded_weights/conv_utils.py
index f1bb281228e9b..6b1701c33df53 100644
--- a/torch/nn/utils/_expanded_weights/conv_utils.py
+++ b/torch/nn/utils/_expanded_weights/conv_utils.py
@@ -28,6 +28,38 @@ def conv_args_and_kwargs(kwarg_names, expanded_args_and_kwargs):
 def conv_normalizer(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
     return (input, weight), {'bias': bias, 'stride': stride, 'padding': padding, 'dilation': dilation, 'groups': groups}
 
+
+def conv_input_for_string_padding(func, padding_style, input, dilation, kernel_size):
+    if padding_style == "valid":
+        return input
+    else:
+        padding = int_padding_for_string_padding(func, padding_style, dilation, kernel_size)
+        return F.pad(input, padding)
+
+
+def int_padding_for_string_padding(func, padding_style, dilation, kernel_size):
+    def get_dilation(i):
+        return dilation[i] if isinstance(dilation, tuple) else dilation
+
+    if padding_style == "same":
+        padding: List[int] = []
+        # F.pad needs the padding in reverse order from what conv expects
+        for i in range(conv_picker(func, 0, 1, 2), -1, -1):
+            padding += conv_padding_for_same(get_dilation(i), kernel_size[i])
+        return padding
+    elif padding_style == "valid":
+        return conv_picker(func, 2, 4, 6) * (0,)
+    else:
+        raise RuntimeError(f"got padding type of {padding_style}, only accept 'same' or 'valid'")
+
+
+def conv_padding_for_same(dilation, kernel_size):
+    total_pad = dilation * (kernel_size - 1)
+    left_pad = total_pad // 2
+    right_pad = total_pad - left_pad
+    return left_pad, right_pad
+
+
 def conv_backward(func, ctx, grad_output):
 
     def weight_grad_sample(weight):
@@ -43,6 +75,15 @@ def expand(param):
         else:
             return param
 
+    def calc_total_padding(func, was_same, padding, dilation, kernel_size):
+        if was_same:
+            all_padding = int_padding_for_string_padding(func, "same", dilation, kernel_size)
+            # F.pad needs the padding in reverse order from what conv expects
+            total_padding = tuple(all_padding[i] + all_padding[i - 1] for i in range(len(all_padding) - 1, -1, -2))
+            return total_padding
+        else:
+            return tuple(2 * pad for pad in padding)
+
     weight_shape = ctx.weight.shape
     stride, padding, dilation, groups = expand(ctx.stride), expand(ctx.padding), expand(ctx.dilation), ctx.groups
 
@@ -55,15 +96,24 @@ def expand(param):
     results.append(None)  # for kwarg names
     results.append(None)  # for op reference
 
+    # "same" padding may give uneven padding on either side so we need to separate the "padding" attr and total padding
+    total_padding = calc_total_padding(func, ctx.was_same_padding, padding, dilation, kernel_size)
+
     if ctx.input_required_grad:
         output_padding = []
         input_dims = conv_picker(func, 1, 2, 3)
         for i in range(input_dims):
-            input_dim = ctx.input.shape[2 + i]
-            output_padding.append((2 * padding[i] + input_dim - (kernel_size[i] * dilation[i] - dilation[i] + 1)) % stride[i])
+            input_dim = ctx.orig_input_shape[2 + i]
+            output_padding.append((total_padding[i] + input_dim - (kernel_size[i] * dilation[i] - dilation[i] + 1)) % stride[i])
         weight_ = unpack_expanded_weight_or_tensor(ctx.weight)
         transpose_func = conv_picker(func, F.conv_transpose1d, F.conv_transpose2d, F.conv_transpose3d)
-        results.append(transpose_func(grad_output, weight_, None, stride, padding, tuple(output_padding), groups, dilation))
+        out = transpose_func(grad_output, weight_, None, stride, padding, tuple(output_padding), groups, dilation)
+
+        if ctx.was_same_padding:
+            for i in range(len(total_padding)):
+                out = torch.narrow(out, 2 + i, total_padding[i] // 2, ctx.orig_input_shape[2 + i])
+
+        results.append(out)
     else:
         results.append(None)
     # weight and bias don't compute batched gradients; no other arguments are differentiable
@@ -147,6 +197,7 @@ def unfold3d(
     Example:
         >>> B, C, D, H, W = 3, 4, 5, 6, 7
         >>> tensor = torch.arange(1, B*C*D*H*W + 1.).view(B, C, D, H, W)
+        >>> # xdoctest: +SKIP
         >>> unfold3d(tensor, kernel_size=2, padding=0, stride=1).shape
         torch.Size([3, 32, 120])
     """
diff --git a/torch/nn/utils/_per_sample_grad.py b/torch/nn/utils/_per_sample_grad.py
index a0dc7b00db529..3dee953d61710 100644
--- a/torch/nn/utils/_per_sample_grad.py
+++ b/torch/nn/utils/_per_sample_grad.py
@@ -30,6 +30,7 @@ def call_for_per_sample_grads(module, *, batch_size=None, loss_reduction="sum"):
     Examples::
         >>> model = nn.Linear(4, 3)
         >>> batched_input = torch.randn(5, 4)  # batch size of 5
+        >>> # xdoctest: +SKIP
         >>> res = call_for_per_sample_grads(model)(batched_input).sum()
         >>> res.backward()
         >>> assert model.weight.shape == (3, 4)
diff --git a/torch/nn/utils/init.py b/torch/nn/utils/init.py
index acb7d02156f9f..4289042c7ed50 100644
--- a/torch/nn/utils/init.py
+++ b/torch/nn/utils/init.py
@@ -29,6 +29,7 @@ def skip_init(module_cls, *args, **kwargs):
     Example::
 
         >>> import torch
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> m = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1)
         >>> m.weight
         Parameter containing:
diff --git a/torch/nn/utils/memory_format.py b/torch/nn/utils/memory_format.py
index a5ae39ad31f8a..baa342c4c1331 100644
--- a/torch/nn/utils/memory_format.py
+++ b/torch/nn/utils/memory_format.py
@@ -50,13 +50,14 @@ def convert_conv2d_weight_memory_format(module, memory_format):
         The original module with updated ``nn.Conv2d``
 
     Example:
-        >>>  input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float16, device="cuda")
-        >>>  model = nn.Sequential(
-        >>>      nn.Conv2d(8, 4, 3)).cuda().half()
-        >>>  # This is identical to:
-        >>>  # nn.utils.convert_conv2d_weight_memory_format(model, torch.channels_last)
-        >>>  model = nn.utils.convert_conv2d_weight_memory_format(model, torch.channels_last)
-        >>>  out = model(input)
+        >>> # xdoctest: +REQUIRES(env:CUBLAS_WORKSPACE_CONFIG)
+        >>> input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float16, device="cuda")
+        >>> model = nn.Sequential(
+        >>>     nn.Conv2d(8, 4, 3)).cuda().half()
+        >>> # This is identical to:
+        >>> # nn.utils.convert_conv2d_weight_memory_format(model, torch.channels_last)
+        >>> model = nn.utils.convert_conv2d_weight_memory_format(model, torch.channels_last)
+        >>> out = model(input)
     """
 
     # TODO: expand this to `_ConvNd` when channels_last support is extended
diff --git a/torch/nn/utils/parametrizations.py b/torch/nn/utils/parametrizations.py
index 6595b4115f7d2..7b097f6676713 100644
--- a/torch/nn/utils/parametrizations.py
+++ b/torch/nn/utils/parametrizations.py
@@ -10,6 +10,7 @@
 
 __all__ = ['orthogonal', 'spectral_norm']
 
+
 def _is_orthogonal(Q, eps=None):
     n, k = Q.size(-2), Q.size(-1)
     Id = torch.eye(k, dtype=Q.dtype, device=Q.device)
@@ -242,6 +243,7 @@ def orthogonal(module: Module,
 
     Example::
 
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
         >>> orth_linear = orthogonal(nn.Linear(20, 40))
         >>> orth_linear
         ParametrizedLinear(
@@ -252,6 +254,7 @@ def orthogonal(module: Module,
             )
         )
         )
+        >>> # xdoctest: +IGNORE_WANT
         >>> Q = orth_linear.weight
         >>> torch.dist(Q.T @ Q, torch.eye(20))
         tensor(4.9332e-07)
@@ -457,18 +460,20 @@ def spectral_norm(module: Module,
 
     Example::
 
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> snm = spectral_norm(nn.Linear(20, 40))
         >>> snm
         ParametrizedLinear(
-        in_features=20, out_features=40, bias=True
-        (parametrizations): ModuleDict(
+          in_features=20, out_features=40, bias=True
+          (parametrizations): ModuleDict(
             (weight): ParametrizationList(
-            (0): _SpectralNorm()
+              (0): _SpectralNorm()
             )
-        )
+          )
         )
         >>> torch.linalg.matrix_norm(snm.weight, 2)
-        tensor(1.0000, grad_fn=<CopyBackwards>)
+        tensor(1.0081, grad_fn=<AmaxBackward0>)
     """
     weight = getattr(module, name, None)
     if not isinstance(weight, Tensor):
diff --git a/torch/nn/utils/parametrize.py b/torch/nn/utils/parametrize.py
index f48d18e7a8600..b8f8d439c1b72 100644
--- a/torch/nn/utils/parametrize.py
+++ b/torch/nn/utils/parametrize.py
@@ -460,6 +460,7 @@ def right_inverse(self, X: Tensor) -> Union[Tensor, Sequence[Tensor]]
         ValueError: if the module does not have a parameter or a buffer named :attr:`tensor_name`
 
     Examples:
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
         >>> import torch
         >>> import torch.nn as nn
         >>> import torch.nn.utils.parametrize as P
diff --git a/torch/nn/utils/prune.py b/torch/nn/utils/prune.py
index fa8789324230b..455d7b82a6b4e 100644
--- a/torch/nn/utils/prune.py
+++ b/torch/nn/utils/prune.py
@@ -843,6 +843,7 @@ def identity(module, name):
         module (nn.Module): modified (i.e. pruned) version of the input module
 
     Examples:
+        >>> # xdoctest: +SKIP
         >>> m = prune.identity(nn.Linear(2, 3), 'bias')
         >>> print(m.bias_mask)
         tensor([1., 1., 1.])
@@ -876,6 +877,7 @@ def random_unstructured(module, name, amount):
         module (nn.Module): modified (i.e. pruned) version of the input module
 
     Examples:
+        >>> # xdoctest: +SKIP
         >>> m = prune.random_unstructured(nn.Linear(2, 3), 'weight', amount=1)
         >>> torch.sum(m.weight_mask == 0)
         tensor(1)
@@ -916,6 +918,7 @@ def l1_unstructured(module, name, amount, importance_scores=None):
         module (nn.Module): modified (i.e. pruned) version of the input module
 
     Examples:
+        >>> # xdoctest: +SKIP
         >>> m = prune.l1_unstructured(nn.Linear(2, 3), 'weight', amount=0.2)
         >>> m.state_dict().keys()
         odict_keys(['bias', 'weight_orig', 'weight_mask'])
@@ -953,9 +956,10 @@ def random_structured(module, name, amount, dim):
         module (nn.Module): modified (i.e. pruned) version of the input module
 
     Examples:
+        >>> # xdoctest: +SKIP
         >>> m = prune.random_structured(
-                nn.Linear(5, 3), 'weight', amount=3, dim=1
-            )
+        ...     nn.Linear(5, 3), 'weight', amount=3, dim=1
+        ... )
         >>> columns_pruned = int(sum(torch.sum(m.weight, dim=0) == 0))
         >>> print(columns_pruned)
         3
@@ -998,9 +1002,10 @@ def ln_structured(module, name, amount, n, dim, importance_scores=None):
         module (nn.Module): modified (i.e. pruned) version of the input module
 
     Examples:
+        >>> # xdoctest: +SKIP
         >>> m = prune.ln_structured(
-               nn.Conv2d(5, 3, 2), 'weight', amount=0.3, dim=1, n=float('-inf')
-            )
+        ...    nn.Conv2d(5, 3, 2), 'weight', amount=0.3, dim=1, n=float('-inf')
+        ... )
     """
     LnStructured.apply(
         module, name, amount, n, dim, importance_scores=importance_scores
@@ -1050,19 +1055,20 @@ def global_unstructured(parameters, pruning_method, importance_scores=None, **kw
         scope of global pruning to unstructured methods.
 
     Examples:
+        >>> # xdoctest: +SKIP
         >>> net = nn.Sequential(OrderedDict([
-                ('first', nn.Linear(10, 4)),
-                ('second', nn.Linear(4, 1)),
-            ]))
+        ...     ('first', nn.Linear(10, 4)),
+        ...     ('second', nn.Linear(4, 1)),
+        ... ]))
         >>> parameters_to_prune = (
-                (net.first, 'weight'),
-                (net.second, 'weight'),
-            )
+        ...     (net.first, 'weight'),
+        ...     (net.second, 'weight'),
+        ... )
         >>> prune.global_unstructured(
-                parameters_to_prune,
-                pruning_method=prune.L1Unstructured,
-                amount=10,
-            )
+        ...     parameters_to_prune,
+        ...     pruning_method=prune.L1Unstructured,
+        ...     amount=10,
+        ... )
         >>> print(sum(torch.nn.utils.parameters_to_vector(net.buffers()) == 0))
         tensor(10, dtype=torch.uint8)
 
@@ -1150,9 +1156,10 @@ def custom_from_mask(module, name, mask):
         module (nn.Module): modified (i.e. pruned) version of the input module
 
     Examples:
+        >>> # xdoctest: +SKIP
         >>> m = prune.custom_from_mask(
-                nn.Linear(5, 3), name='bias', mask=torch.tensor([0, 1, 0])
-            )
+        ...     nn.Linear(5, 3), name='bias', mask=torch.tensor([0, 1, 0])
+        ... )
         >>> print(m.bias_mask)
         tensor([0., 1., 0.])
 
@@ -1205,6 +1212,7 @@ def is_pruned(module):
 
     Examples:
         >>> m = nn.Linear(5, 7)
+        >>> # xdoctest: +SKIP
         >>> print(prune.is_pruned(m))
         False
         >>> prune.random_unstructured(m, name='weight', amount=0.2)
diff --git a/torch/nn/utils/rnn.py b/torch/nn/utils/rnn.py
index 200269780a3d9..829f9e0f9e6fc 100644
--- a/torch/nn/utils/rnn.py
+++ b/torch/nn/utils/rnn.py
@@ -468,7 +468,7 @@ def pack_sequence(sequences: List[Tensor], enforce_sorted: bool = True) -> Packe
         >>> b = torch.tensor([4,5])
         >>> c = torch.tensor([6])
         >>> pack_sequence([a, b, c])
-        PackedSequence(data=tensor([ 1,  4,  6,  2,  5,  3]), batch_sizes=tensor([ 3,  2,  1]))
+        PackedSequence(data=tensor([1, 4, 6, 2, 5, 3]), batch_sizes=tensor([3, 2, 1]), sorted_indices=None, unsorted_indices=None)
 
 
     Args:
@@ -496,11 +496,14 @@ def unpack_sequence(packed_sequences: PackedSequence) -> List[Tensor]:
         >>> b = torch.tensor([4,5])
         >>> c = torch.tensor([6])
         >>> sequences = [a, b, c]
-        [tensor([ 1,  2,  3]), tensor([ 4,  5]), tensor([ 6])]
+        >>> print(sequences)
+        [tensor([1, 2, 3]), tensor([4, 5]), tensor([6])]
         >>> packed_sequences = pack_sequence(sequences)
-        PackedSequence(data=tensor([ 1,  4,  6,  2,  5,  3]), batch_sizes=tensor([ 3,  2,  1]))
+        >>> print(packed_sequences)
+        PackedSequence(data=tensor([1, 4, 6, 2, 5, 3]), batch_sizes=tensor([3, 2, 1]), sorted_indices=None, unsorted_indices=None)
         >>> unpacked_sequences = unpack_sequence(packed_sequences)
-        [tensor([ 1,  2,  3]), tensor([ 4,  5]), tensor([ 6])]
+        >>> print(unpacked_sequences)
+        [tensor([1, 2, 3]), tensor([4, 5]), tensor([6])]
 
 
     Args:
diff --git a/torch/nn/utils/stateless.py b/torch/nn/utils/stateless.py
index add34d2d51498..f7dcf07c89f19 100644
--- a/torch/nn/utils/stateless.py
+++ b/torch/nn/utils/stateless.py
@@ -107,6 +107,7 @@ def functional_call(
         Example::
 
             >>> a = {'foo': torch.zeros(())}
+            >>> # xdoctest: +SKIP
             >>> mod = Foo()  # does self.foo = self.foo + 1
             >>> print(mod.foo)  # tensor(0.)
             >>> functional_call(mod, a, torch.ones(()))
diff --git a/torch/onnx/__init__.py b/torch/onnx/__init__.py
index 5d6e7db3356f9..27931eca29930 100644
--- a/torch/onnx/__init__.py
+++ b/torch/onnx/__init__.py
@@ -29,6 +29,7 @@
     utils,
 )
 from ._exporter_states import ExportTypes, SymbolicContext
+from ._type_utils import JitScalarType
 from .errors import CheckerError  # Backwards compatibility
 from .utils import (
     _optimize_graph,
@@ -65,6 +66,7 @@
     "OperatorExportTypes",
     "TrainingMode",
     "TensorProtoDataType",
+    "JitScalarType",
     # Classes
     "SymbolicContext",
     # Public functions
@@ -86,6 +88,7 @@
 # Set namespace for exposed private names
 ExportTypes.__module__ = "torch.onnx"
 SymbolicContext.__module__ = "torch.onnx"
+JitScalarType.__module__ = "torch.onnx"
 
 producer_name = "pytorch"
 producer_version = _C_onnx.PRODUCER_VERSION
diff --git a/torch/onnx/_constants.py b/torch/onnx/_constants.py
index 4cf3ef4d1e1d2..3741b2e3d7a12 100644
--- a/torch/onnx/_constants.py
+++ b/torch/onnx/_constants.py
@@ -1,8 +1,8 @@
 """Constant values used in ONNX."""
 
 ONNX_ARCHIVE_MODEL_PROTO_NAME = "__MODEL_PROTO"
-onnx_default_opset = 13
-onnx_main_opset = 16
+onnx_default_opset = 14
+onnx_main_opset = 17
 onnx_stable_opsets = tuple(range(7, onnx_main_opset))
 onnx_constant_folding_opsets = tuple(range(9, onnx_main_opset + 1))
 
diff --git a/torch/onnx/_exporter_states.py b/torch/onnx/_exporter_states.py
index 24d12d1b96fba..7f30d5ecb08fc 100644
--- a/torch/onnx/_exporter_states.py
+++ b/torch/onnx/_exporter_states.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import enum
 from typing import Dict
 
 from torch import _C as _C
@@ -31,3 +32,15 @@ def __init__(self, params_dict, env, cur_node, onnx_block):
         self.cur_node: _C.Node = cur_node
         # Current onnx block that converted nodes are being appended to.
         self.onnx_block: _C.Block = onnx_block
+
+
+@enum.unique
+class RuntimeTypeCheckState(enum.Enum):
+    """Runtime type check state."""
+
+    # Runtime type checking is disabled.
+    DISABLED = enum.auto()
+    # Runtime type checking is enabled but warnings are shown only.
+    WARNINGS = enum.auto()
+    # Runtime type checking is enabled.
+    ERRORS = enum.auto()
diff --git a/torch/onnx/_globals.py b/torch/onnx/_globals.py
index 038f8cfd59f3d..61a9ce20af6c4 100644
--- a/torch/onnx/_globals.py
+++ b/torch/onnx/_globals.py
@@ -5,14 +5,14 @@
 Be very judicious when adding any new global variables. Do not create new global
 variables unless they are absolutely necessary.
 """
-
+import os
 from typing import Optional
 
 import torch._C._onnx as _C_onnx
 
 # This module should only depend on _constants and nothing else in torch.onnx to keep
 # dependency direction clean.
-from torch.onnx import _constants
+from torch.onnx import _constants, _exporter_states
 
 
 class _InternalGlobals:
@@ -29,7 +29,21 @@ def __init__(self):
         # Whether the user's model is training during export
         self.export_training: bool = False
         self.operator_export_type: Optional[_C_onnx.OperatorExportTypes] = None
-        self.onnx_shape_inference: bool = False
+        self.onnx_shape_inference: bool = True
+
+        # Internal feature flags
+        if os.getenv("TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK") == "ERRORS":
+            self.runtime_type_check_state = (
+                _exporter_states.RuntimeTypeCheckState.ERRORS
+            )
+        elif os.getenv("TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK") == "DISABLED":
+            self.runtime_type_check_state = (
+                _exporter_states.RuntimeTypeCheckState.DISABLED
+            )
+        else:
+            self.runtime_type_check_state = (
+                _exporter_states.RuntimeTypeCheckState.WARNINGS
+            )
 
     @property
     def training_mode(self):
diff --git a/torch/ao/quantization/_dbr/__init__.py b/torch/onnx/_internal/__init__.py
similarity index 100%
rename from torch/ao/quantization/_dbr/__init__.py
rename to torch/onnx/_internal/__init__.py
diff --git a/torch/onnx/_internal/_beartype.py b/torch/onnx/_internal/_beartype.py
new file mode 100644
index 0000000000000..7ad494984fc91
--- /dev/null
+++ b/torch/onnx/_internal/_beartype.py
@@ -0,0 +1,105 @@
+"""An internal wrapper for the beartype library.
+
+The module returns a no-op decorator when the beartype library is not installed.
+"""
+import functools
+import traceback
+import typing
+import warnings
+from types import ModuleType
+
+from torch.onnx import _exporter_states, errors
+from torch.onnx._globals import GLOBALS
+
+try:
+    import beartype as _beartype_lib  # type: ignore[import]
+    from beartype import roar as _roar  # type: ignore[import]
+
+    # Beartype warns when we import from typing because the types are deprecated
+    # in Python 3.9. But there will be a long time until we can move to using
+    # the native container types for type annotations (when 3.9 is the lowest
+    # supported version). So we silence the warning.
+    warnings.filterwarnings(
+        "ignore",
+        category=_roar.BeartypeDecorHintPep585DeprecationWarning,
+    )
+except ImportError:
+    _beartype_lib = None  # type: ignore[assignment]
+except Exception as e:
+    # Warn errors that are not import errors (unexpected).
+    warnings.warn(f"{e}")
+    _beartype_lib = None  # type: ignore[assignment]
+
+
+def _no_op_decorator(func):
+    return func
+
+
+def _create_beartype_decorator(
+    runtime_check_state: _exporter_states.RuntimeTypeCheckState,
+):
+    # beartype needs to be imported outside of the function and aliased because
+    # this module overwrites the name "beartype".
+
+    if runtime_check_state == _exporter_states.RuntimeTypeCheckState.DISABLED:
+        return _no_op_decorator
+    if _beartype_lib is None:
+        # If the beartype library is not installed, return a no-op decorator
+        if runtime_check_state == _exporter_states.RuntimeTypeCheckState.ERRORS:
+            warnings.warn(
+                "TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK is set to 'ERRORS', "
+                "but the beartype library is not installed. "
+                "Install beartype with `pip install beartype` to enable runtime type checking."
+            )
+        return _no_op_decorator
+
+    assert isinstance(_beartype_lib, ModuleType)
+
+    if runtime_check_state == _exporter_states.RuntimeTypeCheckState.ERRORS:
+        # Enable runtime type checking which errors on any type hint violation.
+        return _beartype_lib.beartype
+
+    # Warnings only
+    def beartype(func):
+        """Warn on type hint violation."""
+
+        if "return" in func.__annotations__:
+            # Remove the return type from the func function's
+            # annotations so that the beartype decorator does not complain
+            # about the return type.
+            return_type = func.__annotations__["return"]
+            del func.__annotations__["return"]
+            beartyped = _beartype_lib.beartype(func)
+            # Restore the return type to the func function's annotations
+            func.__annotations__["return"] = return_type
+        else:
+            beartyped = _beartype_lib.beartype(func)
+
+        @functools.wraps(func)
+        def _coerce_beartype_exceptions_to_warnings(*args, **kwargs):
+            try:
+                return beartyped(*args, **kwargs)
+            except _roar.BeartypeCallHintParamViolation:
+                # Fall back to the original function if the beartype hint is violated.
+                warnings.warn(
+                    traceback.format_exc(),
+                    category=errors.CallHintViolationWarning,
+                    stacklevel=2,
+                )
+
+            return func(*args, **kwargs)  # noqa: B012
+
+        return _coerce_beartype_exceptions_to_warnings
+
+    return beartype
+
+
+if typing.TYPE_CHECKING:
+    # This is a hack to make mypy play nicely with the beartype decorator.
+    def beartype(func):
+        return func
+
+else:
+    beartype = _create_beartype_decorator(GLOBALS.runtime_type_check_state)
+    # Make sure that the beartype decorator is enabled whichever path we took.
+    assert beartype is not None
diff --git a/torch/onnx/_onnx_supported_ops.py b/torch/onnx/_onnx_supported_ops.py
index 39b51a6f32407..e37530cad6485 100644
--- a/torch/onnx/_onnx_supported_ops.py
+++ b/torch/onnx/_onnx_supported_ops.py
@@ -76,7 +76,7 @@ def _symbolic_argument_count(func):
 
 
 def _all_symbolics_schemas():
-    symbolics_schemas: Dict[str, _TorchSchema] = dict()
+    symbolics_schemas: Dict[str, _TorchSchema] = {}
 
     for domain, version in symbolic_registry._registry:
         for opname, sym_func in symbolic_registry._registry[(domain, version)].items():
diff --git a/torch/onnx/_patch_torch.py b/torch/onnx/_patch_torch.py
index 11835cf7636c1..d3fc23131b281 100644
--- a/torch/onnx/_patch_torch.py
+++ b/torch/onnx/_patch_torch.py
@@ -1,21 +1,27 @@
 """Importing this patches torch._C classes to add ONNX conveniences."""
 import numbers
 import re
-from typing import Iterable, Tuple, Union
+from typing import Any, Iterable, Tuple, Union
 
 import torch
-import torch._C._onnx as _C_onnx
+from torch import _C
+from torch._C import _onnx as _C_onnx
+
+# Import utils to get _params_dict because it is a global that is accessed by c++ code
+from torch.onnx import _deprecation, utils
 from torch.onnx._globals import GLOBALS
 
+_ATTR_PATTERN = re.compile("^(.+)_(([ifstgz])|(ty))$")
+
 
 # TODO(#78694): Refactor the patching process to make it more transparent to users.
 def _graph_op(
-    g: torch._C.Graph,
+    g: _C.Graph,
     opname: str,
-    *raw_args: torch._C.Value,
+    *raw_args: _C.Value,
     outputs: int = 1,
     **kwargs,
-) -> Union[torch._C.Value, Tuple[torch._C.Value, ...]]:
+) -> Union[_C.Value, Tuple[_C.Value, ...]]:
     r"""Creates an ONNX operator "opname", taking "args" as inputs and attributes "kwargs".
 
     The set of operators and the inputs/attributes they take
@@ -25,7 +31,8 @@ def _graph_op(
 
     Args:
         g: The Torch graph.
-        opname: The ONNX operator name, e.g., `Abs` or `Add`. TODO(justinchu): Update examples to correct ones.
+        opname: The ONNX operator name, e.g., `Abs` or `Add`, or an operator qualified
+            with a namespace, e.g., `aten::add`.
         raw_args: The inputs to the operator; usually provided
             as arguments to the `symbolic` definition.
         outputs: The number of outputs this operator returns.
@@ -49,22 +56,18 @@ def _graph_op(
     # now they can pass through None attributes, and have them not show up
     kwargs = {k: v for k, v in kwargs.items() if v is not None}
 
-    def const_if_tensor(arg):
-        if arg is None:
-            return arg
-        elif isinstance(arg, torch._C.Value):
-            return arg
-        else:
-            return g.op("Constant", value_z=arg)  # type: ignore[attr-defined]
+    args = [_const_if_tensor(g, arg) for arg in raw_args]
 
-    args = [const_if_tensor(arg) for arg in raw_args]
-    n = g.insertNode(_new_node(g, opname, outputs, *args, **kwargs))  # type: ignore[attr-defined]
+    if "::" in opname:
+        namespace, op = opname.split("::")
+    else:
+        namespace = "onnx"
+        op = opname
 
-    # Import utils to get _params_dict because it is a global that is accessed by c++ code
-    from torch.onnx import utils
+    n = g.insertNode(_new_node(g, namespace, op, outputs, *args, **kwargs))
 
     if GLOBALS.onnx_shape_inference:
-        torch._C._jit_pass_onnx_node_shape_type_inference(
+        _C._jit_pass_onnx_node_shape_type_inference(
             n, utils._params_dict, GLOBALS.export_onnx_opset_version
         )
 
@@ -73,15 +76,27 @@ def const_if_tensor(arg):
     return tuple(n.outputs())
 
 
+def _const_if_tensor(g: _C.Graph, arg):
+    if arg is None:
+        return arg
+    if isinstance(arg, _C.Value):
+        return arg
+    return _graph_op(g, "Constant", value_z=arg)
+
+
 # Generate an ONNX ATen op node.
-def _aten_op(g, operator, *args, overload_name="", **kwargs):
-    kwargs["aten"] = True
-    return g.op(
-        "ATen", *args, operator_s=operator, overload_name_s=overload_name, **kwargs
+def _aten_op(g: _C.Graph, operator: str, *args, overload_name: str = "", **kwargs):
+    return _graph_op(
+        g,
+        "aten::ATen",
+        *args,
+        operator_s=operator,
+        overload_name_s=overload_name,
+        **kwargs,
     )
 
 
-def _block_op(b, opname, *args, **kwargs):
+def _block_op(b: _C.Block, opname: str, *args, **kwargs):
     if "::" in opname:
         aten = False
         ns_opname = opname
@@ -89,35 +104,38 @@ def _block_op(b, opname, *args, **kwargs):
         aten = kwargs.pop("aten", False)
         ns = "aten" if aten else "onnx"
         ns_opname = ns + "::" + opname
-    n = b.addNode(ns_opname, list(args))
+    n = b.addNode(ns_opname, args)
     for k, v in sorted(kwargs.items()):
-        # TODO: enable inplace in aten exporting mode.
         if k == "inplace":
             continue
         _add_attribute(n, k, v, aten=aten)
-    if len(list(n.outputs())) == 1:
+    outputs = tuple(n.outputs())
+    if len(outputs) == 1:
         return n.output()
-    return tuple(o for o in n.outputs())
+    return outputs
 
 
-def _new_node(g: torch._C.Graph, opname: str, outputs, *args, **kwargs):
-    if "::" in opname:
-        aten = False
-        ns_opname = opname
-    else:
-        aten = kwargs.pop("aten", False)
-        ns = "aten" if aten else "onnx"
-        ns_opname = ns + "::" + opname
-    n = g.create(ns_opname, args, outputs)  # type: ignore[attr-defined]
+def _new_node(
+    g: _C.Graph, namespace: str, op: str, outputs: int, *args, **kwargs
+) -> _C.Node:
+    """Creates a new node in the graph.
+
+    Args:
+        g: The graph to create the operator on.
+        namespace: The namespace of the operator. E.g., "aten", "onnx".
+        op: The name of the operator to create.
+        outputs: The number of the outputs of the node.
+
+    Returns:
+        The new node.
+    """
+    aten = kwargs.pop("aten", False)
+    node = g.create(f"{namespace}::{op}", args, outputs)
     for k, v in sorted(kwargs.items()):
-        # TODO: enable inplace in aten exporting mode.
         if k == "inplace":
             continue
-        _add_attribute(n, k, v, aten=aten)
-    return n
-
-
-_attr_pattern = re.compile("^(.+)_(([ifstgz])|(ty))$")
+        _add_attribute(node, k, v, aten=aten)
+    return node
 
 
 def _is_onnx_list(value):
@@ -128,7 +146,7 @@ def _is_onnx_list(value):
     )
 
 
-def _scalar(x):
+def _scalar(x: torch.Tensor):
     """Convert a scalar tensor into a Python value."""
     assert x.numel() == 1
     return x[0]
@@ -141,15 +159,13 @@ def _is_caffe2_aten_fallback():
     )
 
 
-def _add_attribute(node, key, value, aten):
+def _add_attribute(node: _C.Node, key: str, value: Any, aten: bool):
     r"""Initializes the right attribute based on type of value."""
-    m = _attr_pattern.match(key)
+    m = _ATTR_PATTERN.match(key)
     if m is None:
-        raise IndexError(
-            (
-                "Invalid attribute specifier '{}' names "
-                + " must be suffixed with type, e.g. 'dim_i' or 'dims_i'"
-            ).format(key)
+        raise ValueError(
+            f"Invalid attribute specifier '{key}' names "
+            " must be suffixed with type, e.g. 'dim_i' or 'dims_i'"
         )
     name, kind = m.group(1), m.group(2)
     if _is_onnx_list(value):
@@ -165,11 +181,13 @@ def _add_attribute(node, key, value, aten):
                 kind = "f"
             else:
                 kind = "i"
-    return getattr(node, kind + "_")(name, value)
+    return getattr(node, f"{kind}_")(name, value)
 
 
-# TODO: We might not need this anymore, since most scalars now show up as tensors
-# TODO(#76254): Remove the helper function if not needed.
+# TODO(#76254): Remove the deprecated function.
+@_deprecation.deprecated(
+    "1.13", "1.14", "Use 'g.op()' to create a constant node instead."
+)
 def _graph_constant(
     g,
     value,
@@ -223,6 +241,12 @@ def _graph_constant(
     return g.op("Constant", *args, value_t=tensor, **kwargs)
 
 
+# TODO(#76254): Remove the deprecated function.
+@_deprecation.deprecated(
+    "1.13",
+    "1.14",
+    "Internally use '_node_get' in symbolic_helper instead.",
+)
 def _node_getitem(self, k):
     """Gets attributes of a node which is polymorphic over return type.
 
diff --git a/torch/onnx/_type_utils.py b/torch/onnx/_type_utils.py
new file mode 100644
index 0000000000000..70b1b3f868ece
--- /dev/null
+++ b/torch/onnx/_type_utils.py
@@ -0,0 +1,239 @@
+"""Utilities for converting and operating on ONNX, JIT and torch types."""
+from __future__ import annotations
+
+import enum
+from typing import Dict, Optional, Union
+
+from typing_extensions import Literal
+
+import torch
+from torch._C import _onnx as _C_onnx
+
+ScalarName = Literal[
+    "Byte",
+    "Char",
+    "Double",
+    "Float",
+    "Half",
+    "Int",
+    "Long",
+    "Short",
+    "Bool",
+    "ComplexHalf",
+    "ComplexFloat",
+    "ComplexDouble",
+    "QInt8",
+    "QUInt8",
+    "QInt32",
+    "BFloat16",
+    "Undefined",
+]
+
+TorchName = Literal[
+    "bool",
+    "uint8_t",
+    "int8_t",
+    "double",
+    "float",
+    "half",
+    "int",
+    "int64_t",
+    "int16_t",
+    "complex32",
+    "complex64",
+    "complex128",
+    "qint8",
+    "quint8",
+    "qint32",
+    "bfloat16",
+]
+
+
+class JitScalarType(enum.IntEnum):
+    """Scalar types defined in torch.
+
+    Use ``JitScalarType`` to convert from torch and JIT scalar types to ONNX scalar types.
+
+    Examples::
+        >>> # xdoctest: +IGNORE_WANT("win32 has different output")
+        >>> JitScalarType.from_name("Float").onnx_type()
+        TensorProtoDataType.FLOAT
+    """
+
+    # Order defined in https://github.com/pytorch/pytorch/blob/344defc9733a45fee8d0c4d3f5530f631e823196/c10/core/ScalarType.h
+    UINT8 = 0
+    INT8 = enum.auto()  # 1
+    INT16 = enum.auto()  # 2
+    INT = enum.auto()  # 3
+    INT64 = enum.auto()  # 4
+    HALF = enum.auto()  # 5
+    FLOAT = enum.auto()  # 6
+    DOUBLE = enum.auto()  # 7
+    COMPLEX32 = enum.auto()  # 8
+    COMPLEX64 = enum.auto()  # 9
+    COMPLEX128 = enum.auto()  # 10
+    BOOL = enum.auto()  # 11
+    QINT8 = enum.auto()  # 12
+    QUINT8 = enum.auto()  # 13
+    QINT32 = enum.auto()  # 14
+    BFLOAT16 = enum.auto()  # 15
+    UNDEFINED = enum.auto()  # 16
+
+    @classmethod
+    def from_name(
+        cls, name: Union[ScalarName, TorchName, Optional[str]]
+    ) -> JitScalarType:
+        """Convert a JIT scalar type or torch type name to ScalarType.
+
+        Args:
+            name: JIT scalar type name (Byte) or torch type name (uint8_t).
+
+        Returns:
+            ScalarType.
+
+        Raises:
+            ValueError: if name is not a valid scalar type name or if it is None.
+        """
+        if name is None:
+            raise ValueError("Scalar type name cannot be None")
+        if valid_scalar_name(name):
+            return _SCALAR_NAME_TO_TYPE[name]  # type: ignore[index]
+        if valid_torch_name(name):
+            return _TORCH_NAME_TO_SCALAR_TYPE[name]  # type: ignore[index]
+
+        raise ValueError(f"Unknown torch or scalar type: '{name}'")
+
+    @classmethod
+    def from_dtype(cls, dtype: torch.dtype) -> JitScalarType:
+        """Convert a torch dtype to ScalarType."""
+        if dtype not in _DTYPE_TO_SCALAR_TYPE:
+            raise ValueError(f"Unknown dtype: {dtype}")
+        return _DTYPE_TO_SCALAR_TYPE[dtype]
+
+    def scalar_name(self) -> ScalarName:
+        """Convert a ScalarType to a JIT scalar type name."""
+        return _SCALAR_TYPE_TO_NAME[self]
+
+    def torch_name(self) -> TorchName:
+        """Convert a ScalarType to a torch type name."""
+        return _SCALAR_TYPE_TO_TORCH_NAME[self]
+
+    def dtype(self) -> torch.dtype:
+        """Convert a ScalarType to a torch dtype."""
+        return _SCALAR_TYPE_TO_DTYPE[self]
+
+    def onnx_type(self) -> _C_onnx.TensorProtoDataType:
+        """Convert a ScalarType to an ONNX data type."""
+        if self not in _SCALAR_TYPE_TO_ONNX:
+            raise ValueError(f"Scalar type {self} cannot be converted to ONNX")
+        return _SCALAR_TYPE_TO_ONNX[self]
+
+    def onnx_compatible(self) -> bool:
+        """Return whether this ScalarType is compatible with ONNX."""
+        return (
+            self in _SCALAR_TYPE_TO_ONNX
+            and self != JitScalarType.UNDEFINED
+            and self != JitScalarType.COMPLEX32
+        )
+
+
+def valid_scalar_name(scalar_name: Union[ScalarName, str]) -> bool:
+    """Return whether the given scalar name is a valid JIT scalar type name."""
+    return scalar_name in _SCALAR_NAME_TO_TYPE
+
+
+def valid_torch_name(torch_name: Union[TorchName, str]) -> bool:
+    """Return whether the given torch name is a valid torch type name."""
+    return torch_name in _TORCH_NAME_TO_SCALAR_TYPE
+
+
+# https://github.com/pytorch/pytorch/blob/344defc9733a45fee8d0c4d3f5530f631e823196/c10/core/ScalarType.h
+_SCALAR_TYPE_TO_NAME: Dict[JitScalarType, ScalarName] = {
+    JitScalarType.BOOL: "Bool",
+    JitScalarType.UINT8: "Byte",
+    JitScalarType.INT8: "Char",
+    JitScalarType.INT16: "Short",
+    JitScalarType.INT: "Int",
+    JitScalarType.INT64: "Long",
+    JitScalarType.HALF: "Half",
+    JitScalarType.FLOAT: "Float",
+    JitScalarType.DOUBLE: "Double",
+    JitScalarType.COMPLEX32: "ComplexHalf",
+    JitScalarType.COMPLEX64: "ComplexFloat",
+    JitScalarType.COMPLEX128: "ComplexDouble",
+    JitScalarType.QINT8: "QInt8",
+    JitScalarType.QUINT8: "QUInt8",
+    JitScalarType.QINT32: "QInt32",
+    JitScalarType.BFLOAT16: "BFloat16",
+    JitScalarType.UNDEFINED: "Undefined",
+}
+
+_SCALAR_NAME_TO_TYPE: Dict[ScalarName, JitScalarType] = {
+    v: k for k, v in _SCALAR_TYPE_TO_NAME.items()
+}
+
+_SCALAR_TYPE_TO_TORCH_NAME: Dict[JitScalarType, TorchName] = {
+    JitScalarType.BOOL: "bool",
+    JitScalarType.UINT8: "uint8_t",
+    JitScalarType.INT8: "int8_t",
+    JitScalarType.INT16: "int16_t",
+    JitScalarType.INT: "int",
+    JitScalarType.INT64: "int64_t",
+    JitScalarType.HALF: "half",
+    JitScalarType.FLOAT: "float",
+    JitScalarType.DOUBLE: "double",
+    JitScalarType.COMPLEX32: "complex32",
+    JitScalarType.COMPLEX64: "complex64",
+    JitScalarType.COMPLEX128: "complex128",
+    JitScalarType.QINT8: "qint8",
+    JitScalarType.QUINT8: "quint8",
+    JitScalarType.QINT32: "qint32",
+    JitScalarType.BFLOAT16: "bfloat16",
+}
+
+_TORCH_NAME_TO_SCALAR_TYPE: Dict[TorchName, JitScalarType] = {
+    v: k for k, v in _SCALAR_TYPE_TO_TORCH_NAME.items()
+}
+
+_SCALAR_TYPE_TO_ONNX = {
+    JitScalarType.BOOL: _C_onnx.TensorProtoDataType.BOOL,
+    JitScalarType.UINT8: _C_onnx.TensorProtoDataType.UINT8,
+    JitScalarType.INT8: _C_onnx.TensorProtoDataType.INT8,
+    JitScalarType.INT16: _C_onnx.TensorProtoDataType.INT16,
+    JitScalarType.INT: _C_onnx.TensorProtoDataType.INT32,
+    JitScalarType.INT64: _C_onnx.TensorProtoDataType.INT64,
+    JitScalarType.HALF: _C_onnx.TensorProtoDataType.FLOAT16,
+    JitScalarType.FLOAT: _C_onnx.TensorProtoDataType.FLOAT,
+    JitScalarType.DOUBLE: _C_onnx.TensorProtoDataType.DOUBLE,
+    JitScalarType.COMPLEX64: _C_onnx.TensorProtoDataType.COMPLEX64,
+    JitScalarType.COMPLEX128: _C_onnx.TensorProtoDataType.COMPLEX128,
+    JitScalarType.BFLOAT16: _C_onnx.TensorProtoDataType.BFLOAT16,
+    JitScalarType.UNDEFINED: _C_onnx.TensorProtoDataType.UNDEFINED,
+    JitScalarType.COMPLEX32: _C_onnx.TensorProtoDataType.UNDEFINED,
+    JitScalarType.QINT8: _C_onnx.TensorProtoDataType.INT8,
+    JitScalarType.QUINT8: _C_onnx.TensorProtoDataType.UINT8,
+    JitScalarType.QINT32: _C_onnx.TensorProtoDataType.INT32,
+}
+
+# source of truth is
+# https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_dtypes.cpp
+_SCALAR_TYPE_TO_DTYPE = {
+    JitScalarType.BOOL: torch.bool,
+    JitScalarType.UINT8: torch.uint8,
+    JitScalarType.INT8: torch.int8,
+    JitScalarType.INT16: torch.short,
+    JitScalarType.INT: torch.int,
+    JitScalarType.INT64: torch.int64,
+    JitScalarType.HALF: torch.half,
+    JitScalarType.FLOAT: torch.float,
+    JitScalarType.DOUBLE: torch.double,
+    JitScalarType.COMPLEX32: torch.complex32,
+    JitScalarType.COMPLEX64: torch.complex64,
+    JitScalarType.COMPLEX128: torch.complex128,
+    JitScalarType.QINT8: torch.qint8,
+    JitScalarType.QUINT8: torch.quint8,
+    JitScalarType.QINT32: torch.qint32,
+    JitScalarType.BFLOAT16: torch.bfloat16,
+}
+
+_DTYPE_TO_SCALAR_TYPE = {v: k for k, v in _SCALAR_TYPE_TO_DTYPE.items()}
diff --git a/torch/onnx/errors.py b/torch/onnx/errors.py
index 7456065bcc7b8..84230fd32f270 100644
--- a/torch/onnx/errors.py
+++ b/torch/onnx/errors.py
@@ -1,10 +1,32 @@
 """ONNX exporter exceptions."""
+from __future__ import annotations
 
+import textwrap
 from typing import Optional
 
+from torch import _C
 from torch.onnx import _constants
 
-__all__ = ["OnnxExporterError", "CheckerError", "UnsupportedOperatorError"]
+__all__ = [
+    "OnnxExporterError",
+    "OnnxExporterWarning",
+    "CallHintViolationWarning",
+    "CheckerError",
+    "UnsupportedOperatorError",
+    "SymbolicValueError",
+]
+
+
+class OnnxExporterWarning(UserWarning):
+    """Base class for all warnings in the ONNX exporter."""
+
+    pass
+
+
+class CallHintViolationWarning(OnnxExporterWarning):
+    """Warning raised when a type hint is violated during a function call."""
+
+    pass
 
 
 class OnnxExporterError(RuntimeError):
@@ -14,7 +36,7 @@ class OnnxExporterError(RuntimeError):
 
 
 class CheckerError(OnnxExporterError):
-    r"""Raised when ONNX checker detects an invalid model."""
+    """Raised when ONNX checker detects an invalid model."""
 
     pass
 
@@ -42,3 +64,50 @@ def __init__(
                 "it with the right domain and version."
             )
         super().__init__(msg)
+
+
+class SymbolicValueError(OnnxExporterError):
+    """Errors around TorchScript values and nodes."""
+
+    def __init__(self, msg: str, value: _C.Value):
+        message = (
+            f"{msg}  [Caused by the value '{value}' (type '{value.type()}') in the "
+            f"TorchScript graph. The containing node has kind '{value.node().kind()}'.] "
+        )
+
+        code_location = value.node().sourceRange()
+        if code_location:
+            message += f"\n    (node defined in {code_location})"
+
+        try:
+            # Add its input and output to the message.
+            message += "\n\n"
+            message += textwrap.indent(
+                (
+                    "Inputs:\n"
+                    + (
+                        "\n".join(
+                            f"    #{i}: {input_}  (type '{input_.type()}')"
+                            for i, input_ in enumerate(value.node().inputs())
+                        )
+                        or "    Empty"
+                    )
+                    + "\n"
+                    + "Outputs:\n"
+                    + (
+                        "\n".join(
+                            f"    #{i}: {output}  (type '{output.type()}')"
+                            for i, output in enumerate(value.node().outputs())
+                        )
+                        or "    Empty"
+                    )
+                ),
+                "    ",
+            )
+        except AttributeError:
+            message += (
+                " Failed to obtain its input and output for debugging. "
+                "Please refer to the TorchScript graph for debugging information."
+            )
+
+        super().__init__(message)
diff --git a/torch/onnx/symbolic_caffe2.py b/torch/onnx/symbolic_caffe2.py
index c406fac6982b0..6d1be80a1f86b 100644
--- a/torch/onnx/symbolic_caffe2.py
+++ b/torch/onnx/symbolic_caffe2.py
@@ -38,8 +38,8 @@ def register_quantized_ops(domain: str, version: int):
 def _permute_helper(g, input, axes):
     quant_args = {
         "axes_i": axes,
-        "Y_scale_f": input.node()["Y_scale"],
-        "Y_zero_point_i": input.node()["Y_zero_point"],
+        "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
+        "Y_zero_point_i": symbolic_helper._node_get(input.node(), "Y_zero_point"),
     }
     output = g.op("_caffe2::Int8Transpose", input, **quant_args)
     symbolic_helper._quantized_ops.add(output)
@@ -141,8 +141,8 @@ def relu(g, input):
     if input not in symbolic_helper._quantized_ops:
         return opset9.relu(g, input)
     kwargs = {
-        "Y_scale_f": input.node()["Y_scale"],
-        "Y_zero_point_i": input.node()["Y_zero_point"],
+        "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
+        "Y_zero_point_i": symbolic_helper._node_get(input.node(), "Y_zero_point"),
     }
     output = g.op("_caffe2::Int8Relu", input, **kwargs)
     symbolic_helper._quantized_ops.add(output)
@@ -181,8 +181,8 @@ def upsample_nearest2d(
     output_size = symbolic_helper._parse_arg(output_size, "is")
     kwargs = {
         "output_size_i": output_size,
-        "Y_scale_f": input.node()["Y_scale"],
-        "Y_zero_point_i": input.node()["Y_zero_point"],
+        "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
+        "Y_zero_point_i": symbolic_helper._node_get(input.node(), "Y_zero_point"),
     }
     input = nchw2nhwc(g, input)
     output = g.op("_caffe2::Int8ResizeNearest", input, **kwargs)
@@ -202,8 +202,8 @@ def max_pool2d(g, input, kernel_size, stride, padding, dilation, ceil_mode):
         "pads_i": padding + padding,
         "kernel_i": kernel_size[0],
         "order_s": "NHWC",
-        "Y_scale_f": input.node()["Y_scale"],
-        "Y_zero_point_i": input.node()["Y_zero_point"],
+        "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
+        "Y_zero_point_i": symbolic_helper._node_get(input.node(), "Y_zero_point"),
     }
     input = nchw2nhwc(g, input)
     output = g.op("_caffe2::Int8MaxPool", input, **kwargs)
@@ -239,8 +239,8 @@ def avg_pool2d(
         "pads_i": padding + padding,
         "kernel_i": kernel_size[0],
         "order_s": "NHWC",
-        "Y_scale_f": input.node()["Y_scale"],
-        "Y_zero_point_i": input.node()["Y_zero_point"],
+        "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
+        "Y_zero_point_i": symbolic_helper._node_get(input.node(), "Y_zero_point"),
     }
     input = nchw2nhwc(g, input)
     output = g.op("_caffe2::Int8AveragePool", input, **kwargs)
@@ -254,8 +254,8 @@ def reshape(g, input, shape):
         return opset9.reshape(g, input, shape)
 
     kwargs = {
-        "Y_scale_f": input.node()["Y_scale"],
-        "Y_zero_point_i": input.node()["Y_zero_point"],
+        "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
+        "Y_zero_point_i": symbolic_helper._node_get(input.node(), "Y_zero_point"),
     }
     output = g.op("_caffe2::Int8Reshape", input, shape, **kwargs)
     symbolic_helper._quantized_ops.add(output)
@@ -277,8 +277,8 @@ def slice(g, input, dim, start, end, step):
         "start_idx_i": start,
         "end_idx_i": end,
         "dim_i": dim,
-        "Y_scale_f": input.node()["Y_scale"],
-        "Y_zero_point_i": input.node()["Y_zero_point"],
+        "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
+        "Y_zero_point_i": symbolic_helper._node_get(input.node(), "Y_zero_point"),
     }
     output = g.op("_caffe2::Int8Slice", input, **kwargs)
     symbolic_helper._quantized_ops.add(output)
diff --git a/torch/onnx/symbolic_helper.py b/torch/onnx/symbolic_helper.py
index f89265ef243d3..d5f29f4af4b10 100644
--- a/torch/onnx/symbolic_helper.py
+++ b/torch/onnx/symbolic_helper.py
@@ -1,20 +1,22 @@
 from __future__ import annotations
 
-import collections
-import enum
 import functools
 import inspect
 import sys
+import typing
 import warnings
-from typing import Any, Callable, List, Optional, Sequence, Set, Tuple, Union
+from typing import Any, Callable, List, NoReturn, Optional, Sequence, Set, Tuple, Union
+
+from typing_extensions import Literal
 
 import torch
 import torch._C._onnx as _C_onnx
 from torch import _C
 
 # Monkey-patch graph manipulation methods on Graph, used for the ONNX symbolics
-from torch.onnx import _patch_torch  # noqa: F401
+from torch.onnx import _constants, _patch_torch, _type_utils, errors  # noqa: F401
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype
 
 # Note [Edit Symbolic Files]
 # EDITING THIS FILE AND SYMBOLIC_OPSET<VERSION> FILES? READ THIS FIRST!
@@ -73,70 +75,111 @@
     "scalar_name_to_pytorch",
     "scalar_type_to_onnx",
     "scalar_type_to_pytorch_type",
-    "ScalarType",
 ]
 
 # ---------------------------------------------------------------------------------
 # Helper functions
 # ---------------------------------------------------------------------------------
 
+_ValueDescriptor = Literal[
+    "v",
+    "i",
+    "is",
+    "f",
+    "fs",
+    "b",
+    "s",
+    "t",
+    "none",
+]
+
 
-def _parse_arg(value, desc, arg_name=None, node_name=None):
+def _parse_arg(
+    value,
+    desc: _ValueDescriptor,
+    arg_name: Optional[str] = None,
+    node_name: Optional[str] = None,
+):
     if desc == "none":
         return value
     if desc == "v" or not _is_value(value):
         return value
-    if value.node().mustBeNone():
+
+    node = value.node()
+    if node.mustBeNone():
         return None
-    if value.node().kind() == "onnx::Constant":
-        tval = value.node()["value"]
+    if node.kind() == "onnx::Constant":
+        node_val = _node_get(node, "value")
         if desc == "i":
-            return int(tval)
+            return int(node_val)
         elif desc == "f":
-            return float(tval)
+            return float(node_val)
         elif desc == "b":
-            return bool(tval)
+            return bool(node_val)
         elif desc == "s":
-            return str(tval)
+            return str(node_val)
         elif desc == "t":
-            return tval
+            return node_val
         elif desc == "is":
-            return [int(v) for v in tval]
+            return [int(v) for v in node_val]
         elif desc == "fs":
-            return [float(v) for v in tval]
+            return [float(v) for v in node_val]
         else:
-            raise RuntimeError("ONNX symbolic doesn't know to interpret Constant node")
-    elif value.node().kind() == "prim::ListConstruct":
+            raise errors.SymbolicValueError(
+                f"ONNX symbolic does not understand the Constant node '{node}' "
+                f"specified with descriptor '{desc}'.",
+                value,
+            )
+    elif node.kind() == "prim::ListConstruct":
         if desc == "is":
-            for v in value.node().inputs():
-                if v.node().kind() != "onnx::Constant":
-                    raise RuntimeError(
-                        "Failed to export an ONNX attribute '"
-                        + v.node().kind()
-                        + "', since it's not constant, please try to make "
-                        "things (e.g., kernel size) static if possible"
+            for v in node.inputs():
+                element_node = v.node()
+                if element_node.kind() != "onnx::Constant":
+                    raise errors.SymbolicValueError(
+                        f"Failed to export a node '{element_node}' "
+                        f"(in list node {node}) "
+                        f"because it is not constant. "
+                        f"Please try to make things (e.g. kernel sizes) static if possible.",
+                        value,
                     )
-            return [int(v.node()["value"]) for v in value.node().inputs()]
+            return [int(_node_get(v.node(), "value")) for v in value.node().inputs()]
         else:
-            raise RuntimeError(
-                "ONNX symbolic doesn't know to interpret ListConstruct node"
+            raise errors.SymbolicValueError(
+                f"ONNX symbolic does not know how to unpack the ListConstruct node that "
+                f"is not a list of integers: '{node}'",
+                value,
             )
 
     if arg_name is None or node_name is None:
-        raise RuntimeError(
-            f"Expected node type 'onnx::Constant', got '{value.node().kind()}'."
-        )
-    else:
-        raise RuntimeError(
-            "Expected node type 'onnx::Constant' "
-            f"for argument '{arg_name}' of node '{node_name}', "
-            f"got '{value.node().kind()}'."
+        raise errors.SymbolicValueError(
+            f"Expected node type 'onnx::Constant', got '{node.kind()}'.",
+            value,
         )
 
+    raise errors.SymbolicValueError(
+        "Expected node type 'onnx::Constant' "
+        f"for argument '{arg_name}' of node '{node_name}', got '{node.kind()}'.",
+        value,
+    )
+
 
-def _maybe_get_const(value, desc):
-    if _is_value(value) and value.node().kind() == "onnx::Constant":
-        return _parse_arg(value, desc)
+def _node_get(node: _C.Node, key: str):
+    """Gets attributes of a node which is polymorphic over return type."""
+    assert isinstance(node, _C.Node)
+    sel = node.kindOf(key)
+    return getattr(node, sel)(key)
+
+
+def _is_onnx_constant(value: _C.Value):
+    """Whether a Value is an ONNX constant."""
+    return value.node().kind() == "onnx::Constant"
+
+
+def _maybe_get_const(value: _C.Value, descriptor: _ValueDescriptor):
+    # NOTE: prim::Constant at this stage usually means something not compatible in ONNX,
+    # otherwise it'd be converted to onnx::Constant
+    if _is_value(value) and _is_onnx_constant(value):
+        return _parse_arg(value, descriptor)
     return value
 
 
@@ -149,36 +192,64 @@ def _maybe_get_scalar(value):
 
 def _get_const(value, desc, arg_name):
     if not _is_constant(value):
-        raise RuntimeError(
-            f"ONNX symbolic expected a constant value of the {arg_name} argument, "
-            f"got `{value}`"
+        raise errors.SymbolicValueError(
+            f"ONNX symbolic expected a constant value of the '{arg_name}' argument, "
+            f"got '{value}'",
+            value,
         )
     return _parse_arg(value, desc)
 
 
 def _unpack_list(list_value: _C.Value) -> List[_C.Value]:
     list_node = list_value.node()
-    assert list_node.kind() == "prim::ListConstruct"
+    if list_node.kind() != "prim::ListConstruct":
+        raise errors.SymbolicValueError(
+            f"ONNX symbolic expected node type prim::ListConstruct, "
+            f"got '{list_node}'.",
+            list_value,
+        )
     return list(list_node.inputs())
 
 
-def _unpack_tuple(tuple_value):
+def _unpack_tuple(tuple_value: _C.Value) -> Tuple[_C.Value, ...]:
     tuple_node = tuple_value.node()
-    if tuple_node.kind() != "prim::TupleConstruct":
-        raise RuntimeError(
-            f"ONNX symbolic expected node type `prim::TupleConstruct`, "
-            f"got `{tuple_node}`"
+    if not _is_tuple_construct(tuple_value):
+        raise errors.SymbolicValueError(
+            f"ONNX symbolic expected node type 'prim::TupleConstruct', "
+            f"got '{tuple_node.kind()}'.",
+            tuple_value,
         )
-    return list(tuple_node.inputs())
+    return tuple(tuple_node.inputs())
+
+
+def _unpack_quantized_tensor(tuple_value: _C.Value) -> Tuple[_C.Value, ...]:
+    """Unpacks a quantized tensor into a tuple of tensor and scale/zero_point.
+    Args:
+        tuple_value: A tuple of tensor, scale, zero_point, and optionally axis.
+    Returns:
+        A tuple of tensor, scale, zero_point, and optionally axis.
+    """
+    tuple_node = tuple_value.node()
+    # A quantized tensor is represented as tuple of the form (tensor, scale, zero_point, <axis>)
+    if not _is_tuple_construct(tuple_value):
+        raise errors.SymbolicValueError(
+            f"ONNX symbolic expected the output of `{tuple_node}` to be a quantized "
+            f"tensor. Is this likely due to missing support for quantized "
+            f"`{tuple_node.kind()}`. Please create an issue on {_constants.PYTORCH_GITHUB_ISSUES_URL}",
+            tuple_value,
+        )
+    unpacked = tuple(tuple_node.inputs())
+    assert len(unpacked) == 3 or len(unpacked) == 4
+    return unpacked
 
 
 # Check if list_value is output from prim::ListConstruct
 # This is usually called before _unpack_list to ensure the list can be unpacked.
-def _is_packed_list(list_value):
+def _is_packed_list(list_value: _C.Value) -> bool:
     return _is_value(list_value) and list_value.node().kind() == "prim::ListConstruct"
 
 
-def parse_args(*arg_descriptors):
+def parse_args(*arg_descriptors: _ValueDescriptor):
     """A decorator which converts args from torch._C.Value to built-in types.
 
     For example:
@@ -227,6 +298,8 @@ def wrapper(g, *args, **kwargs):
                 arg_names = list(sig.parameters.keys())[1:]
                 fn_name = fn.__name__
             except Exception:
+                # FIXME(justinchuby): Avoid catching Exception.
+                # Catch a more specific exception instead.
                 arg_names = [None] * len(args)  # type: ignore[list-item]
                 fn_name = None
             args = [
@@ -298,44 +371,57 @@ def q_foo(g, x, y):
     """
 
     def decorator(fn):
-        fn._scale = scale
-        fn._zero_point = zero_point
-
         @functools.wraps(fn)
         def wrapper(g, *args, **kwargs):
-            _scale = fn._scale
-            if _scale is not None:
-                _scale = g.op("Constant", value_t=torch.tensor(_scale))
-            _zero_point = fn._zero_point
-            if _zero_point is not None:
-                _zero_point = g.op("Constant", value_t=torch.tensor(_zero_point))
+            nonlocal scale
+            nonlocal zero_point
+            if scale is not None:
+                _scale = g.op("Constant", value_t=torch.tensor(scale))
+            else:
+                _scale = None
+            if zero_point is not None:
+                _zero_point = g.op("Constant", value_t=torch.tensor(zero_point))
+            else:
+                _zero_point = None
 
             # Support variable length arguments by marking unspecified ones as non-quantized
             arg_q_descriptors_extended = arg_q_descriptors + (False,) * (
                 len(args) - len(arg_q_descriptors)
             )
             descriptor_args = tuple(zip(arg_q_descriptors_extended, args))
+
             # Run regular symbolic function if none of the argument is QTensor.
             if not any(
-                (descriptor and arg.node().kind() == "prim::TupleConstruct")
+                (descriptor and _is_value(arg) and _is_tuple_construct(arg))
                 for descriptor, arg in descriptor_args
             ):
                 return fn(g, *args, **kwargs)
 
-            dequantized_args = []
+            # Dequantize arguments that are quantized
+            non_quantized_args = []
             for descriptor, arg in descriptor_args:
-                if descriptor:
-                    dequantized_arg, scale, zero_point, _ = dequantize_helper(g, arg)
-                    dequantized_args.append(dequantized_arg)
+                if descriptor and _is_value(arg) and _is_tuple_construct(arg):
+                    # Quantized arg is a tuple of (value, scale, zero_point)
+                    dequantized_arg, arg_scale, arg_zero_point, _ = dequantize_helper(
+                        g, arg
+                    )
+                    non_quantized_args.append(dequantized_arg)
+                    # Set scale and zero_point to the first quantized input if not already set
                     if _scale is None:
-                        _scale = scale
+                        _scale = arg_scale
                     if _zero_point is None:
-                        _zero_point = zero_point
+                        _zero_point = arg_zero_point
                 else:
-                    dequantized_args.append(arg)
+                    # Non-quantized arg
+                    non_quantized_args.append(arg)
             # TODO(justinchuby): Only single output is supported for now. We may want to
             # support multiple outputs in the future.
-            output = fn(g, *dequantized_args, **kwargs)
+            output = fn(g, *non_quantized_args, **kwargs)
+
+            assert _scale is not None, "Bug: Scale must be set for quantized operator"
+            assert (
+                _zero_point is not None
+            ), "Bug: Zero point must be set for quantized operator"
 
             return quantize_helper(g, output, _scale, _zero_point)
 
@@ -344,7 +430,7 @@ def wrapper(g, *args, **kwargs):
     return decorator
 
 
-def _scalar(x):
+def _scalar(x: torch.Tensor):
     """Convert a scalar tensor into a Python value."""
     if isinstance(x, torch.Tensor) and x.shape == ():
         return x.item()
@@ -369,127 +455,176 @@ def _if_scalar_type_as(g: _C.Graph, self, tensor):
     return self
 
 
-def _is_none(x):
+def _is_none(x: _C.Value) -> bool:
     return x.node().mustBeNone()
 
 
-def _is_value(x):
+def _is_value(x: Any) -> bool:
     return isinstance(x, _C.Value)
 
 
-def _is_constant(value):
-    return not _is_value(value) or value.node().kind() in (
+def _is_constant(value: Any) -> bool:
+    return not _is_value(value) or value.node().kind() in {
         "onnx::Constant",
         "prim::Constant",
-    )
+    }
 
 
-def _is_tensor(x):
+def _is_tensor(x: _C.Value) -> bool:
     return x.type().isSubtypeOf(_C.TensorType.get())
 
 
-def _is_list(x):
-    return isinstance(x.type(), _C.ListType)
+def _as_list_type(jit_type: _C.JitType) -> Optional[_C.ListType]:
+    if isinstance(jit_type, _C.ListType):
+        return jit_type
+    return None
+
 
+def _is_list(x: _C.Value) -> bool:
+    return _as_list_type(x.type()) is not None
 
-def _is_tensor_list(x):
-    return _is_list(x) and isinstance(x.type().getElementType(), _C.TensorType)
 
+def _is_tensor_list(x: _C.Value) -> bool:
+    x_type = _as_list_type(x.type())
+    if x_type is None:
+        return False
+    return isinstance(x_type.getElementType(), _C.TensorType)
 
-def _is_scalar_list(x):
+
+def _is_scalar_list(x: _C.Value) -> bool:
     """Checks if x is a scalar list, for example: List[float], List[int].
 
     Besides checking the type is ListType, we also check if the data type is
     a valid ONNX data type.
     """
-    element_type = str(x.type().getElementType())
+    x_type = _as_list_type(x.type())
+    if x_type is None:
+        return False
+    element_type = str(x_type.getElementType())
     return (
-        _is_list(x)
-        and element_type in scalar_name_to_pytorch.keys()
-        and (scalar_name_to_pytorch[element_type] in cast_pytorch_to_onnx.keys())
+        _type_utils.valid_torch_name(element_type)
+        and _type_utils.JitScalarType.from_name(element_type).onnx_compatible()
     )
 
 
-def is_caffe2_aten_fallback():
+def _is_tuple_construct(x: _C.Value) -> bool:
+    return x.node().kind() == "prim::TupleConstruct"
+
+
+def is_caffe2_aten_fallback() -> bool:
     return (
         GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK
         and _C_onnx._CAFFE2_ATEN_FALLBACK
     )
 
 
-def _get_tensor_rank(x):
+@_beartype.beartype
+def _get_tensor_rank(x: _C.Value) -> Optional[int]:
     if not _is_tensor(x) or x.type() is None:
         return None
-    return x.type().dim()
+    x_type = x.type()
+    x_type = typing.cast(_C.TensorType, x_type)
+    return x_type.dim()
 
 
-def _get_tensor_sizes(x, allow_nonstatic=True):
+@_beartype.beartype
+def _get_tensor_sizes(x: _C.Value, allow_nonstatic: bool = True):
     if not _is_tensor(x) or x.type() is None:
         return None
+    x_type = x.type()
+    x_type = typing.cast(_C.TensorType, x_type)
     if allow_nonstatic:
         # Each individual symbol is returned as None.
         # e.g. [1, "a", "b"] -> [1, None, None]
-        return x.type().varyingSizes()
+        return x_type.varyingSizes()
     # returns None, if exists any symbol in sizes.
     # e.g. [1, "a", "b"] -> None
-    return x.type().sizes()
+    return x_type.sizes()
 
 
-def _get_tensor_dim_size(x, dim):
-    try:
-        sizes = _get_tensor_sizes(x)
-        return sizes[dim]
-    except Exception:
-        pass
-    return None
+@_beartype.beartype
+def _get_tensor_dim_size(x: _C.Value, dim: int) -> Optional[int]:
+    sizes = _get_tensor_sizes(x)
+    return sizes[dim] if sizes else None
 
 
-def _get_dim_for_cross(input, dim):
+def _get_dim_for_cross(x: _C.Value, dim: Optional[int]):
     if dim == -1:
-        return dim + _get_tensor_rank(input)
+        tensor_rank = _get_tensor_rank(x)
+        assert tensor_rank is not None
+        return dim + tensor_rank
     # If dim is not given, it defaults to the first dimension found with the size 3
     if dim is None:
-        sizes = _get_tensor_sizes(input)
+        sizes = _get_tensor_sizes(x)
+        assert sizes is not None
         for index, size in enumerate(sizes):
             if size is not None and size == 3:
                 return index
     return dim
 
 
-def _unimplemented(op, msg):
+def _unimplemented(op: str, msg: str, value: Optional[_C.Value] = None) -> None:
     # For BC reasons, the behavior for Caffe2 does not raise exception for unimplemented operators
     if _C_onnx._CAFFE2_ATEN_FALLBACK:
-        warnings.warn(
-            "ONNX export failed on " + op + " because " + msg + " not supported"
-        )
+        warnings.warn(f"ONNX export failed on {op} because {msg} not supported")
     elif GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX:
-        _onnx_unsupported(f"{op}, {msg}")
+        _onnx_unsupported(f"{op}, {msg}", value)
 
 
-def _onnx_unsupported(op_name):
-    raise RuntimeError(
+def _onnx_unsupported(op_name: str, value: Optional[_C.Value] = None) -> NoReturn:
+    message = (
         f"Unsupported: ONNX export of operator {op_name}. "
-        "Please feel free to request support or submit a pull request on PyTorch GitHub."
+        f"Please feel free to request support or submit a pull request "
+        f"on PyTorch GitHub: {_constants.PYTORCH_GITHUB_ISSUES_URL}"
     )
+    if isinstance(value, _C.Value):
+        raise errors.SymbolicValueError(
+            message,
+            value,
+        )
+    raise errors.OnnxExporterError(message)
 
 
-def _onnx_opset_unsupported(op_name, current_opset, supported_opset):
-    raise RuntimeError(
+def _onnx_opset_unsupported(
+    op_name: str,
+    current_opset: int,
+    supported_opset: int,
+    value: Optional[_C.Value] = None,
+) -> NoReturn:
+    message = (
         f"Unsupported: ONNX export of {op_name} in opset {current_opset}. "
         f"Please try opset version {supported_opset}."
     )
+    if isinstance(value, _C.Value):
+        raise errors.SymbolicValueError(
+            message,
+            value,
+        )
+    raise errors.OnnxExporterError(message)
 
 
-def _onnx_opset_unsupported_detailed(op_name, current_opset, supported_opset, reason):
-    raise RuntimeError(
+def _onnx_opset_unsupported_detailed(
+    op_name: str,
+    current_opset: int,
+    supported_opset: int,
+    reason: str,
+    value: Optional[_C.Value] = None,
+) -> NoReturn:
+    message = (
         f"Unsupported: ONNX export of {op_name} in "
         f"opset {current_opset}. {reason}. Please try opset version {supported_opset}."
     )
+    if isinstance(value, _C.Value):
+        raise errors.SymbolicValueError(
+            message,
+            value,
+        )
+    raise errors.OnnxExporterError(message)
 
 
-def _block_list_in_opset(name):
+def _block_list_in_opset(name: str):
     def symbolic_fn(*args, **kwargs):
-        raise RuntimeError(
+        raise errors.OnnxExporterError(
             f"ONNX export failed on {name}, which is not implemented for opset "
             f"{GLOBALS.export_onnx_opset_version}. "
             "Try exporting with other opset versions."
@@ -498,7 +633,7 @@ def symbolic_fn(*args, **kwargs):
     return symbolic_fn
 
 
-def _try_get_scalar_type(*args):
+def _try_get_scalar_type(*args) -> Optional[str]:
     for arg in args:
         try:
             return arg.type().scalarType()
@@ -521,8 +656,8 @@ def _select_helper(g, self, dim, index, apply_reshape=True):
             )
 
     index_scalar_type = index.type().scalarType()
-    if index_scalar_type is None or index_scalar_type not in ["Long", "Int"]:
-        index = g.op("Cast", index, to_i=cast_pytorch_to_onnx["Long"])
+    if index_scalar_type is None or index_scalar_type not in {"Long", "Int"}:
+        index = g.op("Cast", index, to_i=_C_onnx.TensorProtoDataType.INT64)
     return g.op("Gather", self, index, axis_i=dim)
 
 
@@ -537,35 +672,44 @@ def _slice_helper(g, input, axes, starts, ends, steps=None, dynamic_slice=False)
         return _slice10(g, input, axes, starts, ends, steps, dynamic_slice)
 
 
-_ScalarAndTensorElementTypeGroup = collections.namedtuple(
-    "_ScalarAndTensorElementTypeGroup", ("tensor_element_types", "scalar_types")
-)
-_FPTypeGroup = _ScalarAndTensorElementTypeGroup(
-    (torch.float16, torch.float32, torch.float64, torch.bfloat16),
-    ("Float", "Double", "Half", "BFloat16"),
-)
-_BoolTypeGroup = _ScalarAndTensorElementTypeGroup((torch.bool,), ("Bool",))
-
-
-def _is_in_type_group(value, type_set):
-    if not value:
+def _is_in_type_group(value, scalar_types: Set[_type_utils.JitScalarType]) -> bool:
+    """Helper function for determining if a value is in a scalar type group."""
+    if value is None:
         return False
     if isinstance(value, torch.Tensor):
-        return value.dtype in type_set.tensor_element_types
+        return _type_utils.JitScalarType.from_dtype(value.dtype) in scalar_types
+    elif isinstance(value.type(), torch.ListType):
+        return (
+            _type_utils.JitScalarType.from_dtype(value.type().getElementType().dtype())
+            in scalar_types
+        )
     scalar_type = value.type().scalarType()
     if scalar_type is None:
         warnings.warn(
             "Type cannot be inferred, which might cause exported graph to produce incorrect results."
         )
-    return scalar_type in type_set.scalar_types
+        return False
+    try:
+        return _type_utils.JitScalarType.from_name(scalar_type) in scalar_types
+    except ValueError:
+        # scalar_type is not a known ScalarType
+        return False
 
 
-def _is_fp(value):
-    return _is_in_type_group(value, _FPTypeGroup)
+def _is_fp(value) -> bool:
+    return _is_in_type_group(
+        value,
+        {
+            _type_utils.JitScalarType.FLOAT,
+            _type_utils.JitScalarType.DOUBLE,
+            _type_utils.JitScalarType.HALF,
+            _type_utils.JitScalarType.BFLOAT16,
+        },
+    )
 
 
-def _is_bool(value):
-    return _is_in_type_group(value, _BoolTypeGroup)
+def _is_bool(value) -> bool:
+    return _is_in_type_group(value, {_type_utils.JitScalarType.BOOL})
 
 
 def _generate_wrapped_number(g, scalar):
@@ -660,8 +804,8 @@ def _unsqueeze_helper(g, input, axes_i):
         return g.op("Unsqueeze", input, axes_i=axes_i)
     # Tensor type
     if GLOBALS.export_onnx_opset_version < 13:
-        raise ValueError(
-            f"Opset version must be >= 13 for Unsqueeze with dynamic axes. {input.node().sourceRange()}"
+        raise errors.SymbolicValueError(
+            "Opset version must be >= 13 for Unsqueeze with dynamic axes.", input
         )
     return g.op("Unsqueeze", input, axes_i[0])
 
@@ -674,14 +818,15 @@ def _squeeze_helper(g, input, axes_i):
         return g.op("Squeeze", input, axes_i=axes_i)
     # Tensor type
     if GLOBALS.export_onnx_opset_version < 13:
-        raise ValueError(
-            f"Opset version must be >= 13 for Squeeze with dynamic axes. {input.node().sourceRange()}"
+        raise errors.SymbolicValueError(
+            "Opset version must be >= 13 for Squeeze with dynamic axes.", input
         )
     axes_t = axes_i[0]
     axes_rank = _get_tensor_rank(axes_t)
+    assert axes_rank is not None
     if axes_rank > 1:
-        raise ValueError(
-            "For Squeeze axses as input, the axes rank must be one in ONNX spec."
+        raise errors.SymbolicValueError(
+            "For Squeeze axses as input, the axes rank must be one in ONNX spec.", input
         )
     elif axes_rank == 0:
         # The axes is a scalar. Unsqueeze it to a rank 1 tensor.
@@ -720,11 +865,11 @@ def _interpolate_size_to_scales(g, input, output_size, dim):
     if _is_value(output_size):
         offset = 2
         offsets = g.op("Constant", value_t=torch.ones(offset, dtype=torch.float32))
-        dividend = g.op("Cast", output_size, to_i=cast_pytorch_to_onnx["Float"])
+        dividend = g.op("Cast", output_size, to_i=_C_onnx.TensorProtoDataType.FLOAT)
         divisor = _slice_helper(
             g, g.op("Shape", input), axes=[0], ends=[sys.maxsize], starts=[offset]
         )
-        divisor = g.op("Cast", divisor, to_i=cast_pytorch_to_onnx["Float"])
+        divisor = g.op("Cast", divisor, to_i=_C_onnx.TensorProtoDataType.FLOAT)
         scale_dims = g.op("Div", dividend, divisor)
         scales = g.op("Concat", offsets, scale_dims, axis_i=0)
     else:
@@ -777,7 +922,9 @@ def _interpolate_get_scales(g, scale_factor, dim):
         return g.op("Concat", offsets, scale_factor, axis_i=0)
     else:
         scale_factor = _unsqueeze_helper(g, scale_factor, [0])
-        scale_factor = g.op("Cast", scale_factor, to_i=cast_pytorch_to_onnx["Float"])
+        scale_factor = g.op(
+            "Cast", scale_factor, to_i=_C_onnx.TensorProtoDataType.FLOAT
+        )
         scales = [scale_factor for i in range(dim - 2)]
     scale_factor = g.op("Concat", offsets, *scales, axis_i=0)
     return scale_factor
@@ -816,6 +963,40 @@ def _interpolate_get_scales_and_mode(g, input, size, scale_factor, mode, align_c
     return scale_factor, mode
 
 
+def _argmin_argmax_helper(
+    g, input: torch._C.Value, dim: torch._C.Value, keepdim: int, op_name: str
+):
+    def op_wrapper(input, axis_i, keepdims_i):
+        if GLOBALS.export_onnx_opset_version >= 12:
+            return g.op(
+                op_name,
+                input,
+                axis_i=axis_i,
+                keepdims_i=keepdims_i,
+                select_last_index_i=False,
+            )
+        return g.op(op_name, input, axis_i=axis_i, keepdims_i=keepdims_i)
+
+    if _is_none(dim):
+        flattened = _reshape_helper(
+            g, input, g.op("Constant", value_t=torch.tensor([-1]))
+        )
+        output = op_wrapper(flattened, axis_i=0, keepdims_i=False)
+        if keepdim:
+            input_shape = g.op("Shape", input)
+            input_shape_shape = g.op("Shape", input_shape)
+            new_shape = g.op(
+                "ConstantOfShape",
+                input_shape_shape,
+                value_t=torch.tensor([1], dtype=torch.int64),
+            )
+            output = g.op("Reshape", output, new_shape)
+        return output
+
+    dim = _parse_arg(dim, "i")
+    return op_wrapper(input, axis_i=dim, keepdims_i=keepdim)
+
+
 def _interpolate_helper(name, dim, interpolate_mode):
     @quantized_args(True, False, False)
     def symbolic_fn(g, input, output_size, *args):
@@ -834,7 +1015,9 @@ def symbolic_fn(g, input, output_size, *args):
             input_size_beg = _slice_helper(
                 g, input_size, axes=[0], ends=[2], starts=[0]
             )
-            output_size = g.op("Cast", output_size, to_i=cast_pytorch_to_onnx["Long"])
+            output_size = g.op(
+                "Cast", output_size, to_i=_C_onnx.TensorProtoDataType.INT64
+            )
             output_size = g.op("Concat", input_size_beg, output_size, axis_i=0)
 
             if GLOBALS.export_onnx_opset_version >= 13:
@@ -928,7 +1111,7 @@ def __interpolate_helper(
             size = _unsqueeze_helper(g, size, [0])
             size = [size for i in range(rank - 2)]
             size = g.op("Concat", *size, axis_i=0)
-        size = g.op("Cast", size, to_i=cast_pytorch_to_onnx["Long"])
+        size = g.op("Cast", size, to_i=_C_onnx.TensorProtoDataType.INT64)
         size = g.op("Concat", input_size, size, axis_i=0)
 
         if GLOBALS.export_onnx_opset_version >= 13:
@@ -1004,13 +1187,22 @@ def _repeat_interleave_split_helper(g, self, reps, dim):
     return split_out if reps > 1 else [split_out]
 
 
-def _arange_cast_helper(g, end, start=None, step=None, dtype=None):
+def _arange_cast_helper(
+    g, end, start=None, step=None, dtype=None
+) -> Tuple[
+    _type_utils.JitScalarType,
+    Optional[_C.Value],
+    Optional[_C.Value],
+    Optional[_C.Value],
+]:
     def _is_all_integral(scalars):
         for scalar in scalars:
             try:
                 if scalar.type().scalarType() != "Long":
                     return False
             except Exception:
+                # FIXME(justinchuby): Avoid catching Exception.
+                # Catch a more specific exception instead.
                 pass
         return True
 
@@ -1020,16 +1212,20 @@ def _is_all_integral(scalars):
     # Otherwise, the dtype is inferred to be torch.int64.
     if dtype is None or (_is_value(dtype) and _is_none(dtype)):
         if _is_all_integral([start, end, step]):
-            type = scalar_type_to_pytorch_type.index(torch.int64)
+            scalar_type = _type_utils.JitScalarType.INT64
         else:
-            type = scalar_type_to_pytorch_type.index(torch.get_default_dtype())
+            scalar_type = _type_utils.JitScalarType.from_dtype(
+                torch.get_default_dtype()
+            )
     else:
-        type = dtype
+        assert isinstance(dtype, int)
+        # TODO(justinchuby): Check if dtype is indeed a int.
+        scalar_type = _type_utils.JitScalarType(dtype)
 
-    start = g.op("Cast", start, to_i=scalar_type_to_onnx[type]) if start else None
-    end = g.op("Cast", end, to_i=scalar_type_to_onnx[type]) if end else None
-    step = g.op("Cast", step, to_i=scalar_type_to_onnx[type]) if step else None
-    return type, end, start, step
+    start = g.op("Cast", start, to_i=scalar_type.onnx_type()) if start else None
+    end = g.op("Cast", end, to_i=scalar_type.onnx_type()) if end else None
+    step = g.op("Cast", step, to_i=scalar_type.onnx_type()) if step else None
+    return scalar_type, end, start, step
 
 
 def _arange_helper(g, *args):
@@ -1086,8 +1282,8 @@ def _reshape_helper(g, input, shape, allowzero=0):
         shape = g.op("Constant", value_t=torch.LongTensor(shape))
     if GLOBALS.export_onnx_opset_version <= 13:
         if allowzero == 1:
-            raise _onnx_opset_unsupported(
-                "Reshape with allowzero=1", GLOBALS.export_onnx_opset_version, 14
+            _onnx_opset_unsupported(
+                "Reshape with allowzero=1", GLOBALS.export_onnx_opset_version, 14, input
             )
         return g.op("Reshape", input, shape)
     else:
@@ -1102,20 +1298,28 @@ def _batchnorm_helper(g, input, weight, bias, running_mean, running_var):
 
     if weight is None or _is_none(weight):
         if channel_size is None:
-            raise RuntimeError(
-                "Unsupported: ONNX export of batch_norm for unknown " "channel size."
+            raise errors.SymbolicValueError(
+                "Unsupported: ONNX export of batch_norm for unknown channel size.",
+                input,
             )
-        weight_value = torch.tensor([1.0] * channel_size).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        weight_value = torch.tensor(
+            [1.0] * channel_size,
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         )
         weight = g.op("Constant", value_t=weight_value)
     if bias is None or _is_none(bias):
         if channel_size is None:
-            raise RuntimeError(
-                "Unsupported: ONNX export of batch_norm for unknown " "channel size."
+            raise errors.SymbolicValueError(
+                "Unsupported: ONNX export of batch_norm for unknown channel size.",
+                input,
             )
-        bias_value = torch.tensor([0.0] * channel_size).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        bias_value = torch.tensor(
+            [0.0] * channel_size,
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         )
         bias = g.op("Constant", value_t=bias_value)
     # If track_running_stats is set to False batch statistics are instead used during evaluation time
@@ -1240,20 +1444,24 @@ def dequantize_helper(
 
     Args:
         g: Graph, the ONNX IR graph that is under construction.
-        qtensor: torch._C.Value, either a tuple of (quantized_tensor, scale, zero_point) for per tensor quantization,
-          or (quantized_tensor, scale, zero_point, axis) for per channel quantization.
-          Representing the quantized tensor.
-        qdtype: torch.onnx.TensorProtoDataType default None, if not None, represents the data type of quantized tensor.
-          It must be either torch.onnx.TensorProtoDataType.UINT8 or torch.onnx.TensorProtoDataType.INT8.
+        qtensor: torch._C.Value, either a tuple of (quantized_tensor, scale, zero_point)
+            for per tensor quantization, or
+            (quantized_tensor, scale, zero_point, axis) for per channel quantization,
+            representing the quantized tensor.
+        qdtype: torch.onnx.TensorProtoDataType default None, if not None, represents the
+            data type of quantized tensor. It must be either
+            torch.onnx.TensorProtoDataType.UINT8 or torch.onnx.TensorProtoDataType.INT8.
     """
-    unpacked_qtensors = _unpack_tuple(qtensor)
+    unpacked_qtensors = _unpack_quantized_tensor(qtensor)
     tensor, scale, zero_point = unpacked_qtensors[:3]
     axis = unpacked_qtensors[3] if len(unpacked_qtensors) >= 4 else None
     axis_i = _get_const(axis, "i", "axis")
-    input_qdtype = cast_pytorch_to_onnx[tensor.type().scalarType()]
+    input_scalar_type = tensor.type().scalarType()
+    assert input_scalar_type is not None
+    input_qdtype = _type_utils.JitScalarType.from_name(tensor.type().scalarType())
     if qdtype is None:
         if input_qdtype is not None:
-            qdtype = input_qdtype
+            qdtype = input_qdtype.onnx_type()
         else:
             qdtype = _C_onnx.TensorProtoDataType.UINT8
     value = g.op("Cast", tensor, to_i=qdtype)
@@ -1266,6 +1474,7 @@ def dequantize_helper(
             GLOBALS.export_onnx_opset_version,
             13,
             "Attribute axis is not supported.",
+            qtensor,
         )
 
     return (
@@ -1292,6 +1501,9 @@ def quantize_helper(
         zero_point: torch._C.Value, quantized zero point.
         axis: Optional[torch._C.Value] default None, if None, represents per tensor quantization.
             Otherwise, represents per channel quantization, along given axis.
+
+    Returns:
+        A TupleConstruct storing information of the quantized tensor.
     """
     if (
         axis is not None
@@ -1303,16 +1515,15 @@ def quantize_helper(
             GLOBALS.export_onnx_opset_version,
             13,
             "Attribute axis is not supported.",
+            tensor,
         )
 
     assert scale is not None
-    if scale.type().scalarType() != "Float":  # type: ignore[attr-defined]
-        # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+    if scale.type().scalarType() != "Float":
         scale = g.op("Cast", scale, to_i=_C_onnx.TensorProtoDataType.FLOAT)
 
     assert zero_point is not None
-    if zero_point.type().scalarType() not in ("Byte", "Char"):  # type: ignore[attr-defined]
-        # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+    if zero_point.type().scalarType() not in ("Byte", "Char"):
         zero_point = g.op("Cast", zero_point, to_i=_C_onnx.TensorProtoDataType.UINT8)
     output = g.op(
         "QuantizeLinear",
@@ -1364,15 +1575,12 @@ def _set_operator_export_type(operator_export_type):
 
 
 # This function is for debug use only.
-# onnx_shape_inference = False by default.
+# onnx_shape_inference = True by default.
 def _set_onnx_shape_inference(onnx_shape_inference: bool):
     GLOBALS.onnx_shape_inference = onnx_shape_inference
 
 
-# Metaprogram symbolics for each ATen native specialized cast operator.
-# For e.g. we specify a function named `_cast_uint8_t` that instantiates an
-# ONNX cast node with `to` attribute "UINT8"
-#
+# Deprecated. Internally use _type_utils.ScalarType
 # TODO: remove these once we support Type's in the JIT IR and we can once again
 # use the unified toType operator
 cast_pytorch_to_onnx = {
@@ -1391,6 +1599,7 @@ def _set_onnx_shape_inference(onnx_shape_inference: bool):
     "Undefined": _C_onnx.TensorProtoDataType.UNDEFINED,
 }
 
+# Deprecated. Internally use _type_utils.ScalarType
 scalar_name_to_pytorch = {
     "uint8_t": "Byte",
     "int8_t": "Char",
@@ -1410,27 +1619,7 @@ def _set_onnx_shape_inference(onnx_shape_inference: bool):
 }
 
 
-class ScalarType(enum.IntEnum):
-    """A human-readable name for a key into scalar_type_to_pytorch_type."""
-
-    UINT8 = 0
-    INT8 = enum.auto()
-    SHORT = enum.auto()
-    INT = enum.auto()
-    INT64 = enum.auto()
-    HALF = enum.auto()
-    FLOAT = enum.auto()
-    DOUBLE = enum.auto()
-    COMPLEX32 = enum.auto()
-    COMPLEX64 = enum.auto()
-    COMPLEX128 = enum.auto()
-    BOOL = enum.auto()
-    QINT8 = enum.auto()
-    QUINT8 = enum.auto()
-    QINT32 = enum.auto()
-    BFLOAT16 = enum.auto()
-
-
+# Deprecated. Internally use _type_utils.ScalarType
 # This indicates each scalar type's corresponding
 # torch type. Related source:
 # https://github.com/pytorch/pytorch/blob/344defc9733a45fee8d0c4d3f5530f631e823196/c10/core/ScalarType.h
@@ -1453,6 +1642,7 @@ class ScalarType(enum.IntEnum):
     torch.bfloat16,  # 15
 ]
 
+# Deprecated. Internally use _type_utils.ScalarType
 # source of truth is
 # https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_dtypes.cpp
 pytorch_name_to_type = {
@@ -1474,10 +1664,7 @@ class ScalarType(enum.IntEnum):
 }
 
 
-def _cast_func_template(to_i, g, input, non_blocking):
-    return g.op("Cast", input, to_i=to_i)
-
-
+# Deprecated. Internally use _type_utils.ScalarType
 scalar_type_to_onnx = [
     cast_pytorch_to_onnx["Byte"],  # 0
     cast_pytorch_to_onnx["Char"],  # 1
diff --git a/torch/onnx/symbolic_opset10.py b/torch/onnx/symbolic_opset10.py
index 855e76b30e16d..9ae83a11cb5bf 100644
--- a/torch/onnx/symbolic_opset10.py
+++ b/torch/onnx/symbolic_opset10.py
@@ -10,6 +10,8 @@
 # Monkey-patch graph manipulation methods on Graph, used for the ONNX symbolics
 from torch.onnx import (  # noqa: F401
     _patch_torch,
+    _type_utils,
+    errors,
     symbolic_helper,
     symbolic_opset9 as opset9,
 )
@@ -232,7 +234,7 @@ def symbolic_fn(g, input, output_size, *args):
         symbolic_helper._interpolate_warning(interpolate_mode)
         align_corners = symbolic_helper._maybe_get_scalar(align_corners)
         if align_corners:
-            return symbolic_helper._unimplemented(name, "align_corners == True")
+            return symbolic_helper._unimplemented(name, "align_corners == True", input)
         if scales is None:
             scales = symbolic_helper._interpolate_size_to_scales(
                 g, input, output_size, dim
@@ -295,7 +297,7 @@ def slice(g, self, *args):
         start, end, step = args
         dim = 0
     else:
-        raise NotImplementedError("Unknown aten::slice signature")
+        raise errors.SymbolicValueError("Unknown aten::slice signature", self)
     is_start_none = start.node().kind() == "prim::Constant" and isinstance(
         start.type(), _C.NoneType
     )
@@ -446,11 +448,13 @@ def fake_quantize_per_tensor_affine(
             10,
             13,
             "Quantize range (0, 127) not supported, requires opset 13 Clip",
+            inputs,
         )
     if (quant_min, quant_max) not in [(0, 255), (-128, 127)]:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             f"For (quant_min, quant_max), ONNX allows only (0, 255) and (-128, 127). "
-            f"Got ({quant_min}, {quant_max})"
+            f"Got ({quant_min}, {quant_max})",
+            inputs,
         )
     scale = symbolic_helper._maybe_get_scalar(scale)
     if scale is None:
@@ -459,6 +463,7 @@ def fake_quantize_per_tensor_affine(
             10,
             13,
             "Non-constant scale not supported",
+            inputs,
         )
     scale = scale.float().data  # Avoid exporter generating double type
     if quant_min == 0:
@@ -487,8 +492,9 @@ def isfinite(g, input):
 
 def quantize_per_tensor(g, input, scale, zero_point, dtype):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
+    # TODO(justinchuby): Extract all the cast ops into a helper function.
     zero_point = g.op(
-        "Cast", zero_point, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
+        "Cast", zero_point, to_i=_type_utils.JitScalarType(dtype).onnx_type()
     )
     scale = g.op("Cast", scale, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     return symbolic_helper.quantize_helper(g, input, scale, zero_point)
@@ -504,7 +510,7 @@ def nan_to_num(g, input, nan, posinf, neginf):
     # return the original tensor
     if not symbolic_helper._is_fp(input):
         return input
-    input_dtype = symbolic_helper.pytorch_name_to_type[input.type().scalarType()]
+    input_dtype = _type_utils.JitScalarType.from_name(input.type().scalarType()).dtype()
     if nan is None:
         nan = 0.0
     nan_cond = opset9.isnan(g, input)
@@ -608,6 +614,57 @@ def hardswish(g, x, op_scale, op_zero_point):
 
         return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
+    @staticmethod
+    def sigmoid(g, x, op_scale, op_zero_point):
+        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+
+        output = opset9.sigmoid(g, x)
+
+        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+    @staticmethod
+    def leaky_relu(g, x, negative_slope, inplace, op_scale, op_zero_point):
+        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+
+        output = opset9.leaky_relu(g, x, negative_slope, inplace)
+
+        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+    @staticmethod
+    def layer_norm(g, x, normalized_shape, weight, bias, eps, op_scale, op_zero_point):
+        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+
+        output = opset9.layer_norm(g, x, normalized_shape, weight, bias, eps, False)
+
+        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+    @staticmethod
+    def group_norm(g, x, num_groups, weight, bias, eps, op_scale, op_zero_point):
+        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+
+        output = opset9.group_norm(g, x, num_groups, weight, bias, eps, False)
+
+        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+    @staticmethod
+    @symbolic_helper.parse_args("v", "v", "v", "f", "v", "v")
+    def instance_norm(
+        g,
+        q_input,
+        weight,
+        bias,
+        eps,
+        op_scale,
+        op_zero_point,
+    ):
+        input, _, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+
+        output = opset9.instance_norm(
+            g, input, weight, bias, None, None, False, 0, eps, False
+        )
+
+        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
     @staticmethod
     def conv2d_relu(
         g,
diff --git a/torch/onnx/symbolic_opset11.py b/torch/onnx/symbolic_opset11.py
index 942fde5b141e7..c34dc4e985d5d 100644
--- a/torch/onnx/symbolic_opset11.py
+++ b/torch/onnx/symbolic_opset11.py
@@ -6,7 +6,10 @@
 
 import torch
 from torch import _C
+from torch._C import _onnx as _C_onnx
 from torch.onnx import (
+    _type_utils,
+    errors,
     symbolic_helper,
     symbolic_opset10 as opset10,
     symbolic_opset9 as opset9,
@@ -89,26 +92,21 @@
 ]
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "f", "f")
-def hardtanh(g, self, min_val, max_val):
+def hardtanh(g, self: _C.Value, min_val: float, max_val: float):
     dtype = self.type().scalarType()
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
     else:
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        scalar_type = _type_utils.JitScalarType.from_name(dtype)
     min_val = g.op(
         "Constant",
-        value_t=torch.tensor(
-            min_val, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor(min_val, dtype=scalar_type.dtype()),
     )
     max_val = g.op(
         "Constant",
-        value_t=torch.tensor(
-            max_val, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor(max_val, dtype=scalar_type.dtype()),
     )
     return opset9.op_with_optional_float_cast(
         g, "Clip", self, min_val, max_val, opset_before=12
@@ -121,7 +119,9 @@ def clamp(g, self, min, max):
     def _cast_if_not_none(tensor, dtype):
         if tensor is not None and not symbolic_helper._is_none(tensor):
             return g.op(
-                "Cast", tensor, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype]
+                "Cast",
+                tensor,
+                to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
             )
         else:
             return tensor
@@ -149,7 +149,7 @@ def _cast_if_not_none(tensor, dtype):
 @symbolic_helper.parse_args("v", "v")
 def clamp_min(g, self, min):
     dtype = self.type().scalarType()
-    min = g.op("Cast", min, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
+    min = g.op("Cast", min, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type())
     if symbolic_helper._get_tensor_rank(min) == 0:
         max = opset9.unused(g)
         return opset9.op_with_optional_float_cast(
@@ -162,7 +162,7 @@ def clamp_min(g, self, min):
 @symbolic_helper.parse_args("v", "v")
 def clamp_max(g, self, max):
     dtype = self.type().scalarType()
-    max = g.op("Cast", max, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
+    max = g.op("Cast", max, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type())
     if symbolic_helper._get_tensor_rank(max) == 0:
         min = opset9.unused(g)
         return opset9.op_with_optional_float_cast(
@@ -173,30 +173,25 @@ def clamp_max(g, self, max):
 
 
 def relu6(g, input):
-    relu = opset9.op_with_optional_float_cast(g, "Relu", input, opset_before=14)
+    relu_ = opset9.op_with_optional_float_cast(g, "Relu", input, opset_before=14)
     dtype = input.type().scalarType()
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
     else:
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        scalar_type = _type_utils.JitScalarType.from_name(dtype)
     min_val = g.op(
         "Constant",
-        value_t=torch.tensor(
-            0, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor(0, dtype=scalar_type.dtype()),
     )
     max_val = g.op(
         "Constant",
-        value_t=torch.tensor(
-            6, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor(6, dtype=scalar_type.dtype()),
     )
-    return clamp(g, relu, min_val, max_val)
+    return clamp(g, relu_, min_val, max_val)
 
 
 # Opset 11 gather accepts negative indices
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "i", "v")
 def select(g, self, dim, index):
     return g.op("Gather", self, index, axis_i=dim)
@@ -218,8 +213,7 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
 
     if len(indices_list) > 1:
         for idx_ in range(len(indices_list)):
-            if indices_list[idx_].type().scalarType() == "Bool":  # type: ignore[attr-defined]
-                # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+            if symbolic_helper._is_bool(indices_list[idx_]):
                 indices_list[idx_] = g.op("NonZero", indices_list[idx_])
         index = indices_list[0]
 
@@ -274,8 +268,7 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
         #   return (%33)
         index = indices_list[0]
         bool_inp = index
-        if bool_inp.type() is not None and bool_inp.type().scalarType() == "Bool":  # type: ignore[attr-defined]
-            # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+        if symbolic_helper._is_bool(bool_inp):
             rank = symbolic_helper._get_tensor_rank(values)
             if rank is not None and rank == 0:
                 return opset9.masked_fill(g, self, bool_inp, values)
@@ -294,17 +287,16 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
 
     dtype = self.type().scalarType()
     if dtype is not None and dtype != values.type().scalarType():
-        values = g.op("Cast", values, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
-    dtype = symbolic_helper.scalar_type_to_onnx.index(
-        symbolic_helper.cast_pytorch_to_onnx[dtype]
-    )
-    dtype = symbolic_helper.scalar_type_to_pytorch_type[dtype]
+        values = g.op(
+            "Cast", values, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
+        )
+    scalar_type = _type_utils.JitScalarType.from_name(dtype)
 
     if accumulate:
         zeros = g.op(
             "ConstantOfShape",
             g.op("Shape", self),
-            value_t=torch.tensor([0], dtype=dtype),
+            value_t=torch.tensor([0], dtype=scalar_type.dtype()),
         )
         result = g.op("ScatterND", zeros, index, values)
         result = add(g, self, result)
@@ -376,7 +368,9 @@ def scatter(g, self, dim, index, src):
             src = g.op(
                 "Cast",
                 src,
-                to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+                to_i=_type_utils.JitScalarType.from_name(
+                    self.type().scalarType()
+                ).onnx_type(),
             )
         return g.op(
             "ScatterElements", self, index, opset9.expand_as(g, src, index), axis_i=dim
@@ -389,7 +383,7 @@ def cumsum(g, self, dim, dtype=None):
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
         cast = g.op(
-            "Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[parsed_dtype]
+            "Cast", self, to_i=_type_utils.JitScalarType(parsed_dtype).onnx_type()
         )
     else:
         cast = self
@@ -669,7 +663,7 @@ def _prepare_onnx_paddings(g, input, pad):
     )
     # Concat pad with extension: paddings = [dim_n_begin, dim_n_end, dim_n-1_begin, dim_n-1_end, 0, 0, ... ]
     # Currently ONNX only supports int64 type for Pad
-    pad = g.op("Cast", pad, to_i=symbolic_helper.cast_pytorch_to_onnx["Long"])
+    pad = g.op("Cast", pad, to_i=_C_onnx.TensorProtoDataType.INT64)
     paddings = g.op(
         "Concat",
         pad,
@@ -689,9 +683,7 @@ def _prepare_onnx_paddings(g, input, pad):
     paddings = symbolic_helper._reshape_helper(
         g, paddings, g.op("Constant", value_t=torch.tensor([-1]))
     )
-    padding_c = g.op(
-        "Cast", paddings, to_i=symbolic_helper.cast_pytorch_to_onnx["Long"]
-    )
+    padding_c = g.op("Cast", paddings, to_i=_C_onnx.TensorProtoDataType.INT64)
     return padding_c
 
 
@@ -734,7 +726,7 @@ def pad(g, input, pad, mode, value):
     elif mode == "circular":
         return opset9._pad_circular(g, input, pad)
     else:
-        raise RuntimeError(f"Unrecognized padding mode {mode}")
+        raise errors.SymbolicValueError(f"Unrecognized padding mode {mode}", input)
 
 
 def linalg_det(g, self):
@@ -757,22 +749,18 @@ def _get_arange_dtype(dtype):
         else:
             # aten::arange(Scalar end, ScalarType dtype, Layout, Device, bool pin_memory)
             dtype = _get_arange_dtype(args[1])
-        type, end, start, step = symbolic_helper._arange_cast_helper(
+        type_, end, start, step = symbolic_helper._arange_cast_helper(
             g, end=args[0], dtype=dtype
         )
         start_default = g.op(
             "Constant",
-            value_t=torch.tensor(
-                0, dtype=symbolic_helper.scalar_type_to_pytorch_type[type]
-            ),
+            value_t=torch.tensor(0, dtype=type_.dtype()),
         )
         delta_default = g.op(
             "Constant",
-            value_t=torch.tensor(
-                1, dtype=symbolic_helper.scalar_type_to_pytorch_type[type]
-            ),
+            value_t=torch.tensor(1, dtype=type_.dtype()),
         )
-        arange_tensor = g.op("Range", start_default, end, delta_default)
+        return g.op("Range", start_default, end, delta_default)
     elif len(args) == 4 or len(args) == 7:
         if len(args) == 4:
             # aten::arange(Scalar start, Scalar end, Scalar step, Tensor out)
@@ -780,28 +768,25 @@ def _get_arange_dtype(dtype):
         else:
             # aten::arange(Scalar start, Scalar end, Scalar step, ScalarType dtype, Layout, Device, bool pin_memory)
             dtype = _get_arange_dtype(args[3])
-        type, end, start, step = symbolic_helper._arange_cast_helper(
+        _, end, start, step = symbolic_helper._arange_cast_helper(
             g, start=args[0], end=args[1], step=args[2], dtype=dtype
         )
-        arange_tensor = g.op("Range", start, end, step)
+        return g.op("Range", start, end, step)
     elif len(args) == 6:
         # aten::arange(Scalar start, Scalar end, ScalarType dtype, Layout, Device, bool pin_memory)
         dtype = _get_arange_dtype(args[2])
-        type, end, start, step = symbolic_helper._arange_cast_helper(
+        type_, end, start, step = symbolic_helper._arange_cast_helper(
             g, start=args[0], end=args[1], dtype=dtype
         )
         delta_default = g.op(
             "Constant",
-            value_t=torch.tensor(
-                1, dtype=symbolic_helper.scalar_type_to_pytorch_type[type]
-            ),
+            value_t=torch.tensor(1, dtype=type_.dtype()),
         )
-        arange_tensor = g.op("Range", start, end, delta_default)
+        return g.op("Range", start, end, delta_default)
     else:
-        raise NotImplementedError(
-            "Unknown aten::arange signature taking " + str(len(args)) + " arguments."
+        return symbolic_helper._unimplemented(
+            "aten::arange", f"with {len(args)} arguments"
         )
-    return arange_tensor
 
 
 @symbolic_helper.parse_args("v", "i")
@@ -896,7 +881,7 @@ def index(g, self, index):
     if len(indices) == 1:
         index = indices[0]
         if not symbolic_helper._is_none(index) and (
-            index.type().scalarType() == "Bool" or index.type().scalarType() == "Byte"
+            symbolic_helper._is_bool(index) or index.type().scalarType() == "Byte"
         ):
             index = opset9.nonzero(g, index)
             return g.op("GatherND", self, index)
@@ -941,7 +926,9 @@ def __rshift_(g, self, other):
         other = g.op(
             "Cast",
             other,
-            to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+            to_i=_type_utils.JitScalarType.from_name(
+                self.type().scalarType()
+            ).onnx_type(),
         )
 
     if self.type().scalarType() == "Byte":
@@ -950,12 +937,12 @@ def __rshift_(g, self, other):
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
     # exponent (same type as self) has to be float or double in onnx::Pow
     if not symbolic_helper._is_fp(self):
-        other = g.op("Cast", other, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     two_pow = g.op("Pow", two, other)
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
     )
     rshift = g.op("Div", self, two_pow)
     return rshift
@@ -968,7 +955,9 @@ def __lshift_(g, self, other):
         other = g.op(
             "Cast",
             other,
-            to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+            to_i=_type_utils.JitScalarType.from_name(
+                self.type().scalarType()
+            ).onnx_type(),
         )
 
     if self.type().scalarType() == "Byte":
@@ -977,12 +966,12 @@ def __lshift_(g, self, other):
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
     # exponent (same type as self) has to be float or double in onnx::Pow
     if not symbolic_helper._is_fp(self):
-        other = g.op("Cast", other, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     two_pow = g.op("Pow", two, other)
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
     )
     lshift = g.op("Mul", self, two_pow)
     return lshift
@@ -1155,7 +1144,9 @@ def linalg_vector_norm(g, self, ord, dim, keepdim, dtype):
         cond_op = g.op(
             "Cast",
             cond_op,
-            to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+            to_i=_type_utils.JitScalarType.from_name(
+                self.type().scalarType()
+            ).onnx_type(),
         )
         return symbolic_helper._reducesum_helper(
             g, cond_op, axes_i=dim, keepdims_i=keepdim
@@ -1261,9 +1252,10 @@ def embedding_renorm(g, weight, indices, max_norm, norm_type):
     elif norm_type == 2:
         norm_type = "ReduceL2"
     else:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             f"Unsupported: ONNX export of embedding_renorm with norm: {norm_type}. "
-            "Only 1. and 2. are supported."
+            "Only 1. and 2. are supported.",
+            weight,
         )
     partial_weight_norm = g.op(norm_type, partial_weight, axes_i=[1], keepdims_i=1)
     # https://github.com/pytorch/pytorch/blob/0a07488ed2c47765e337e290bd138c0e6e459cbd/aten/src/ATen/native/Embedding.cpp#L177
diff --git a/torch/onnx/symbolic_opset12.py b/torch/onnx/symbolic_opset12.py
index ab2197c367cba..19484e35998ae 100644
--- a/torch/onnx/symbolic_opset12.py
+++ b/torch/onnx/symbolic_opset12.py
@@ -2,7 +2,15 @@
 from typing import Optional, Tuple
 
 import torch
-from torch.onnx import symbolic_helper, symbolic_opset9 as opset9, utils
+from torch._C import _onnx as _C_onnx
+from torch.onnx import (
+    _type_utils,
+    errors,
+    symbolic_helper,
+    symbolic_opset9 as opset9,
+    utils,
+)
+
 
 # EDITING THIS FILE? READ THIS FIRST!
 # see Note [Edit Symbolic Files] in symbolic_helper.py
@@ -17,7 +25,6 @@
     "cross_entropy_loss",
     "dropout",
     "einsum",
-    "einsum_helper",
     "ge",
     "le",
     "native_dropout",
@@ -31,19 +38,19 @@
 ]
 
 
-def einsum_helper(g, equation, tensors):
+def _einsum_helper(g, equation, tensors):
     if not tensors:
         raise RuntimeError("Einsum inputs are empty.")
     # ONNX does not support bool for Einsum inputs.
-    if tensors[0].type().scalarType() == "Bool":
+    if symbolic_helper._is_bool(tensors[0]):
         tensors = [
-            g.op("Cast", tensor, to_i=symbolic_helper.cast_pytorch_to_onnx["Long"])
+            g.op("Cast", tensor, to_i=_C_onnx.TensorProtoDataType.INT64)
             for tensor in tensors
         ]
         return g.op(
             "Cast",
             g.op("Einsum", *tensors, equation_s=equation),
-            to_i=symbolic_helper.cast_pytorch_to_onnx["Bool"],
+            to_i=_C_onnx.TensorProtoDataType.BOOL,
         )
     else:
         return g.op("Einsum", *tensors, equation_s=equation)
@@ -52,7 +59,7 @@ def einsum_helper(g, equation, tensors):
 @symbolic_helper.parse_args("s", "v")
 def einsum(g, equation, tensor_list):
     tensors = symbolic_helper._unpack_list(tensor_list)
-    return einsum_helper(g, equation, tensors)
+    return _einsum_helper(g, equation, tensors)
 
 
 @symbolic_helper.parse_args("v", "v")
@@ -62,9 +69,11 @@ def outer(g, input, other):
         other = g.op(
             "Cast",
             other,
-            to_i=symbolic_helper.cast_pytorch_to_onnx[input.type().scalarType()],
+            to_i=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).onnx_type(),
         )
-    return einsum_helper(g, "i,j->ij", [input, other])
+    return _einsum_helper(g, "i,j->ij", [input, other])
 
 
 def _dropout_returns_masked_input_and_mask(
@@ -143,8 +152,10 @@ def cross_entropy_loss(
     reduction = reduction_vals[reduction]
 
     label_smoothing = symbolic_helper._maybe_get_const(label_smoothing, "f")
-    if label_smoothing > 0.0:
-        raise RuntimeError("Unsupported: ONNX does not support label_smoothing")
+    if label_smoothing is not None and label_smoothing > 0.0:
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX does not support label_smoothing", self
+        )
 
     # in onnx SoftmaxCrossEntropyLoss specification, ignore_index is optional without default value.
     # therefore we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).
@@ -207,7 +218,8 @@ def binary_cross_entropy_with_logits(g, input, target, weight, pos_weight, reduc
         return g.op("ReduceSum", output, keepdims_i=0)
     else:
         return symbolic_helper._onnx_unsupported(
-            "binary_cross_entropy_with_logits with reduction other than none, mean, or sum"
+            "binary_cross_entropy_with_logits with reduction other than none, mean, or sum",
+            input,
         )
 
 
@@ -215,43 +227,21 @@ def celu(g, self, alpha):
     alpha = symbolic_helper._maybe_get_const(alpha, "f")
     # if the input is of type double cast it to float
     if self.type().scalarType() == "Double":
-        self = g.op("Cast", self, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        self = g.op("Cast", self, to_i=_C_onnx.TensorProtoDataType.FLOAT)
         out = g.op("Celu", self, alpha_f=alpha)
-        return g.op("Cast", out, to_i=symbolic_helper.cast_pytorch_to_onnx["Double"])
+        return g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.DOUBLE)
 
     return g.op("Celu", self, alpha_f=alpha)
 
 
-def argmax(g, input, dim, keepdim):
-    if symbolic_helper._is_none(dim):
-        flattened = symbolic_helper._reshape_helper(
-            g, input, g.op("Constant", value_t=torch.tensor([-1]))
-        )
-        return g.op(
-            "ArgMax", flattened, axis_i=0, keepdims_i=False, select_last_index_i=False
-        )
-    else:
-        dim = symbolic_helper._parse_arg(dim, "i")
-        keepdim = symbolic_helper._parse_arg(keepdim, "i")
-        return g.op(
-            "ArgMax", input, axis_i=dim, keepdims_i=keepdim, select_last_index_i=False
-        )
+@symbolic_helper.parse_args("v", "v", "i")
+def argmax(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+    return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMax")
 
 
-def argmin(g, input, dim, keepdim):
-    if symbolic_helper._is_none(dim):
-        flattened = symbolic_helper._reshape_helper(
-            g, input, g.op("Constant", value_t=torch.tensor([-1]))
-        )
-        return g.op(
-            "ArgMin", flattened, axis_i=0, keepdims_i=False, select_last_index_i=False
-        )
-    else:
-        dim = symbolic_helper._parse_arg(dim, "i")
-        keepdim = symbolic_helper._parse_arg(keepdim, "i")
-        return g.op(
-            "ArgMin", input, axis_i=dim, keepdims_i=keepdim, select_last_index_i=False
-        )
+@symbolic_helper.parse_args("v", "v", "i")
+def argmin(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+    return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMin")
 
 
 def pow(g, self, exponent):
@@ -293,6 +283,7 @@ def unfold(g, input, dimension, size, step):
         )
 
         ndim = symbolic_helper._get_tensor_rank(input)
+        assert ndim is not None
         perm = list(range(0, ndim))
         perm.append(perm.pop(dimension))
 
@@ -343,14 +334,16 @@ def tensordot(g, input_a, input_b, dims_a, dims_b, out=None):
 
     dim_count_a = symbolic_helper._get_tensor_rank(input_a)
     if dim_count_a is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of tensordot for tensor(input_a) of unknown rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of tensordot for tensor(input_a) of unknown rank.",
+            input_a,
         )
 
     dim_count_b = symbolic_helper._get_tensor_rank(input_b)
     if dim_count_b is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of tensordot for tensor(input_b) of unknown rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of tensordot for tensor(input_b) of unknown rank.",
+            input_b,
         )
 
     dims_a = [
diff --git a/torch/onnx/symbolic_opset13.py b/torch/onnx/symbolic_opset13.py
index 26f3d5d0d2f7d..a0f85b969b27f 100644
--- a/torch/onnx/symbolic_opset13.py
+++ b/torch/onnx/symbolic_opset13.py
@@ -5,6 +5,8 @@
 import torch
 import torch._C._onnx as _C_onnx
 from torch.onnx import (
+    _type_utils,
+    errors,
     symbolic_helper,
     symbolic_opset11 as opset11,
     symbolic_opset9 as opset9,
@@ -18,7 +20,7 @@ def softmax(g, input, dim, dtype=None):
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
         softmax = g.op(
-            "Cast", softmax, to_i=symbolic_helper.scalar_type_to_onnx[parsed_dtype]
+            "Cast", softmax, to_i=_type_utils.JitScalarType(parsed_dtype).onnx_type()
         )
 
     return softmax
@@ -30,7 +32,7 @@ def log_softmax(g, input, dim, dtype=None):
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
         return_op = g.op(
-            "Cast", return_op, to_i=symbolic_helper.scalar_type_to_onnx[parsed_dtype]
+            "Cast", return_op, to_i=_type_utils.JitScalarType(parsed_dtype).onnx_type()
         )
     return return_op
 
@@ -80,7 +82,7 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
             for i in range(_outputs)
         ]
 
-    split_val = split_size_or_sizes.node()["value"]
+    split_val = symbolic_helper._node_get(split_size_or_sizes.node(), "value")
     if split_val.dim() > 0:
         return g.op("Split", self, split_size_or_sizes, axis_i=dim, outputs=_outputs)
     split_size = symbolic_helper._get_const(split_size_or_sizes, "i", "split_size")
@@ -90,7 +92,9 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
         if _outputs is not None:
             size = split_size * _outputs
         else:
-            raise RuntimeError("Unknown dimension size not supported")
+            raise errors.SymbolicValueError(
+                "Unknown dimension size not supported", self
+            )
     splits = [split_size] * (size // split_size)
     leftover = size % split_size
     if leftover:
@@ -118,11 +122,12 @@ def tensor_split(g, self, indices_or_sections, dim, _outputs=None):
     const_1 = g.op("Constant", value_t=torch.tensor(1, dtype=torch.long))
 
     if symbolic_helper._is_split_static(indices_or_sections, _outputs):
-        split_val = indices_or_sections.node()["value"]
+        split_val = symbolic_helper._node_get(indices_or_sections.node(), "value")
 
         if split_val.dim() > 0:
             start = g.op("Constant", value_t=torch.tensor([0], dtype=torch.long))
             res = []
+            assert _outputs is not None
             for i in range(_outputs - 1):
                 end = g.op(
                     "Gather",
@@ -146,7 +151,9 @@ def tensor_split(g, self, indices_or_sections, dim, _outputs=None):
             if _outputs is not None:
                 size = split_size * _outputs
             else:
-                raise RuntimeError("Unknown dimension size not supported")
+                raise errors.SymbolicValueError(
+                    "Unknown dimension size not supported", self
+                )
 
         min_split_size = size // split_size
         num_splits_one_extra = size % split_size
@@ -268,10 +275,8 @@ def nonzero_numpy(g, input, _outputs=None):
 @symbolic_helper.parse_args("v", "v", "v", "i")
 def where(g, condition, self=None, other=None, _outputs=None):
     # Assumes that torch.where's first argument takes only Bool and Byte tensors.
-    if condition.type().scalarType() != "Bool":
-        condition = g.op(
-            "Cast", condition, to_i=symbolic_helper.cast_pytorch_to_onnx["Bool"]
-        )
+    if not symbolic_helper._is_bool(condition):
+        condition = g.op("Cast", condition, to_i=_C_onnx.TensorProtoDataType.BOOL)
     if self is None:
         condition = opset9.nonzero(g, condition)
         return symbolic_helper._unbind_helper(
@@ -287,9 +292,10 @@ def fake_quantize_per_channel_affine(
     # NOTE: (0, 127) is allowed as special case. PyTorch restricts activations to be in the range (0, 127).
     #   https://github.com/pytorch/pytorch/blob/b34b192d6b97325c9f78e5995c48c8498ede34bd/torch/ao/quantization/observer.py#L1422
     if (quant_min, quant_max) not in [(0, 255), (-128, 127), (0, 127)]:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             "For (quant_min, quant_max), ONNX allows only (0, 127), (0, 255) and (-128, 127). "
-            "Got ({}, {})".format(quant_min, quant_max)
+            f"Got ({quant_min}, {quant_max})",
+            inputs,
         )
     # ONNX defines zero_point to be int8 or uint8
     if quant_min == 0:
@@ -314,9 +320,10 @@ def fake_quantize_per_tensor_affine(
     # NOTE: (0, 127) is allowed as special case. PyTorch restricts activations to be in the range (0, 127).
     #   https://github.com/pytorch/pytorch/blob/b34b192d6b97325c9f78e5995c48c8498ede34bd/torch/ao/quantization/observer.py#L1422
     if (quant_min, quant_max) not in [(0, 255), (-128, 127), (0, 127)]:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             "For (quant_min, quant_max), ONNX allows only (0, 127), (0, 255) and (-128, 127). "
-            "Got ({}, {})".format(quant_min, quant_max)
+            f"Got ({quant_min}, {quant_max})",
+            inputs,
         )
     if quant_min == 0:
         zero_point = g.op("Cast", zero_point, to_i=_C_onnx.TensorProtoDataType.UINT8)
@@ -358,10 +365,10 @@ def reduce_nodim(g, self, dtype):
             if dtype.node().kind() == "onnx::Constant":
                 dtype = symbolic_helper._get_const(dtype, "i", "dtype")
                 self = g.op(
-                    "Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
+                    "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self)
 
         @symbolic_helper.parse_args("v", "v", "i", "none")
@@ -369,10 +376,10 @@ def reduce_dim(g, self, dim, keepdim, dtype):
             if dtype.node().kind() == "onnx::Constant":
                 dtype = symbolic_helper._get_const(dtype, "i", "dtype")
                 self = g.op(
-                    "Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
+                    "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self, dim, keepdim)
 
         return reduce_nodim, reduce_dim
@@ -428,16 +435,19 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
     repeats_sizes = symbolic_helper._get_tensor_sizes(repeats)
     input_sizes = symbolic_helper._get_tensor_sizes(input)
     if repeats_dim is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown " "repeats rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats rank.",
+            self,
         )
     if repeats_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown " "repeats size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats size.",
+            self,
         )
     if input_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown " "input size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown input size.",
+            self,
         )
     # Handle cases where dim is negative
     if dim < 0:
diff --git a/torch/onnx/symbolic_opset14.py b/torch/onnx/symbolic_opset14.py
index aa7d7d135256a..e70a7b1cf8762 100644
--- a/torch/onnx/symbolic_opset14.py
+++ b/torch/onnx/symbolic_opset14.py
@@ -71,6 +71,7 @@ def batch_norm(
             15,
             "All input tensors must have the same `dtype`."
             " Turn off Autocast or export using opset version 15.",
+            input,
         )
 
     symbolic_helper.check_training_mode(training, "batch_norm")
diff --git a/torch/onnx/symbolic_opset16.py b/torch/onnx/symbolic_opset16.py
index 8434f068a2fdb..046132264f7eb 100644
--- a/torch/onnx/symbolic_opset16.py
+++ b/torch/onnx/symbolic_opset16.py
@@ -20,7 +20,6 @@
     Where
     GreaterOrEqual
     LessOrEqual
-    SequenceMap
 """
 
 # EDITING THIS FILE? READ THIS FIRST!
@@ -30,7 +29,7 @@
     GRID_SAMPLE_INTERPOLATION_MODES,
     GRID_SAMPLE_PADDING_MODES,
 )
-from torch.onnx import symbolic_helper
+from torch.onnx import _type_utils, symbolic_helper
 
 
 # note (mkozuki): Why `grid_sampler` instead of `grid_sample`?
@@ -74,7 +73,9 @@ def scatter_add(g, self, dim, index, src):
             src = g.op(
                 "Cast",
                 src,
-                to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+                to_i=_type_utils.JitScalarType.from_name(
+                    self.type().scalarType()
+                ).onnx_type(),
             )
 
         return g.op(
diff --git a/torch/onnx/symbolic_opset17.py b/torch/onnx/symbolic_opset17.py
new file mode 100644
index 0000000000000..81111ebca7251
--- /dev/null
+++ b/torch/onnx/symbolic_opset17.py
@@ -0,0 +1,19 @@
+"""This file exports ONNX ops for opset 17.
+
+Note [ONNX Operators that are added/updated in opset 17]
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+https://github.com/onnx/onnx/blob/main/docs/Changelog.md#version-17-of-the-default-onnx-operator-set
+New operators:
+    BlackmanWindow
+    DFT
+    HammingWindow
+    HannWindow
+    LayerNormalization
+    MelWeightMatrix
+    STFT
+    SequenceMap
+"""
+
+# EDITING THIS FILE? READ THIS FIRST!
+# see Note [Edit Symbolic Files] in symbolic_helper.py
diff --git a/torch/onnx/symbolic_opset8.py b/torch/onnx/symbolic_opset8.py
index 604f3de8b8981..bf480dc27d306 100644
--- a/torch/onnx/symbolic_opset8.py
+++ b/torch/onnx/symbolic_opset8.py
@@ -33,7 +33,7 @@
 import warnings
 
 import torch
-from torch.onnx import symbolic_helper, symbolic_opset9 as opset9
+from torch.onnx import _type_utils, errors, symbolic_helper, symbolic_opset9 as opset9
 
 block_listed_operators = [
     "nonzero",
@@ -67,7 +67,7 @@ def symbolic_fn(g, input, output_size, *args):
         symbolic_helper._interpolate_warning(interpolate_mode)
         align_corners = symbolic_helper._maybe_get_scalar(align_corners)
         if align_corners:
-            return symbolic_helper._unimplemented(name, "align_corners == True")
+            return symbolic_helper._unimplemented(name, "align_corners == True", input)
         output_size = symbolic_helper._maybe_get_const(output_size, "is")
         if symbolic_helper._is_value(output_size):
             return symbolic_helper._unimplemented(
@@ -198,15 +198,28 @@ def prelu(g, self, weight):
 def mm(g, self, other):
     # Create a dummy C tensor. Only needed for API purposes, the value is
     # since beta = 0
-    ty = symbolic_helper._try_get_scalar_type(self, other).lower()
-    C = g.constant(0, [1], ty)
+    scalar_type = symbolic_helper._try_get_scalar_type(self, other)
+    if scalar_type is None:
+        raise errors.SymbolicValueError(
+            "mm can only operate on tensors with known types", self
+        )
+    zero_constant = g.op(
+        "Constant",
+        value_t=torch.tensor(
+            [0], dtype=_type_utils.JitScalarType.from_name(scalar_type).dtype()
+        ),
+    )
+
     if symbolic_helper._try_get_scalar_type(self):
-        old_type, self, other, C = _try_cast_integer_to_float(g, self, other, C)
+        old_type, self, other, zero_constant = _try_cast_integer_to_float(
+            g, self, other, zero_constant
+        )
         return _cast_to_type(
-            g, g.op("Gemm", self, other, C, beta_f=0.0, alpha_f=1.0), old_type
+            g,
+            g.op("Gemm", self, other, zero_constant, beta_f=0.0, alpha_f=1.0),
+            old_type,
         )
-    else:
-        return g.op("Gemm", self, other, C, beta_f=0.0, alpha_f=1.0)
+    return g.op("Gemm", self, other, zero_constant, beta_f=0.0, alpha_f=1.0)
 
 
 @symbolic_helper.parse_args("v", "v", "v", "t", "t")
@@ -264,25 +277,25 @@ def flatten(g, input, start_dim, end_dim):
     return opset9.flatten(g, input, start_dim, end_dim)
 
 
-def _constant_fill(g, sizes, dtype, const_value):
+def _constant_fill(g, sizes, dtype: int, const_value):
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
-    if not symbolic_helper.scalar_type_to_pytorch_type[dtype].is_floating_point:
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
+    if not scalar_type.dtype().is_floating_point:
         result = g.op(
             "ConstantFill",
             sizes,
-            dtype_i=symbolic_helper.cast_pytorch_to_onnx["Float"],
+            dtype_i=_type_utils.JitScalarType.FLOAT.onnx_type(),
             input_as_shape_i=1,
             value_f=const_value,
         )
-        return symbolic_helper._cast_func_template(
-            symbolic_helper.scalar_type_to_onnx[dtype], g, result, None
-        )
+        return g.op("Cast", result, to_i=scalar_type.onnx_type())
     else:
         return g.op(
             "ConstantFill",
             sizes,
-            dtype_i=symbolic_helper.scalar_type_to_onnx[dtype],
+            dtype_i=scalar_type.onnx_type(),
             input_as_shape_i=1,
             value_f=const_value,
         )
diff --git a/torch/onnx/symbolic_opset9.py b/torch/onnx/symbolic_opset9.py
index 6c8d879559db6..25457ae9c1fa6 100644
--- a/torch/onnx/symbolic_opset9.py
+++ b/torch/onnx/symbolic_opset9.py
@@ -17,7 +17,13 @@
 from torch import _C
 
 # Monkey-patch graph manipulation methods on Graph, used for the ONNX symbolics
-from torch.onnx import _patch_torch, symbolic_helper  # noqa: F401
+from torch.onnx import (  # noqa: F401
+    _constants,
+    _patch_torch,
+    _type_utils,
+    errors,
+    symbolic_helper,
+)
 from torch.onnx._exporter_states import (
     SymbolicContext,  # Special case class import for readability
 )
@@ -324,6 +330,9 @@
     "zeros",
 ]
 
+
+_INT64_MAX = 9223372036854775807
+
 # used to represent "missing" optional inputs
 def unused(g):
     n = g.op("prim::Constant")
@@ -341,10 +350,12 @@ def _reshape_from_tensor(g, input, shape):
     return reshape(g, input, shape)
 
 
+@symbolic_helper.quantized_args(True)
 def reshape(g, self, shape):
     return symbolic_helper._reshape_helper(g, self, shape)
 
 
+@symbolic_helper.quantized_args(True)
 def reshape_as(g, self, other):
     shape = g.op("Shape", other)
     return reshape(g, self, shape)
@@ -353,7 +364,7 @@ def reshape_as(g, self, other):
 def add(g, self, other, alpha=None):
     if symbolic_helper._is_value(self) and symbolic_helper._is_tensor_list(self):
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "Add", 9, 11, "Add between list of tensors not supported"
+            "Add", 9, 11, "Add between list of tensors not supported", self
         )
     if alpha and symbolic_helper._scalar(symbolic_helper._maybe_get_scalar(alpha)) != 1:
         other = g.op("Mul", other, alpha)
@@ -400,8 +411,9 @@ def _div_rounding_mode(g, self, other, rounding_mode):
     elif rounding_mode == "trunc":
         return _trunc_divide(g, self, other)
     else:
-        raise RuntimeError(
-            f'Unsupported rounding mode: "{rounding_mode}". Expected None, "floor" or "trunc"'
+        raise errors.SymbolicValueError(
+            f'Unsupported rounding mode: "{rounding_mode}". Expected None, "floor" or "trunc"',
+            self,
         )
 
 
@@ -412,7 +424,7 @@ def _trunc_divide(g, self, other):
     # (eg. -0.1 should become -0 )
     # - if scalar_type information are not available, assume that
     # we need to call floor (treat as float)
-    out = g.op("Cast", out, to_i=symbolic_helper.cast_pytorch_to_onnx["Long"])
+    out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.INT64)
 
     # Matching PyTorch's behavior:
     # - if self is fp the output's type is self's type
@@ -427,13 +439,15 @@ def _trunc_divide(g, self, other):
             and other.type().scalarType() is not None
             and symbolic_helper._is_fp(other)
         ):
-            out = g.op("Cast", out, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+            out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.FLOAT)
         else:
             out = g.op(
-                "Cast", out, to_i=symbolic_helper.cast_pytorch_to_onnx[scalar_type]
+                "Cast",
+                out,
+                to_i=_type_utils.JitScalarType.from_name(scalar_type).onnx_type(),
             )
     else:
-        out = g.op("Cast", out, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     return out
 
 
@@ -487,10 +501,10 @@ def true_divide(g, self, other):
     # Case 2: neither is floating
     # Casts both inputs to the default scalar type
     scalar_type = torch.get_default_dtype()
-    onnx_scalar_type = symbolic_helper.cast_pytorch_to_onnx["Float"]
+    onnx_scalar_type = _C_onnx.TensorProtoDataType.FLOAT
     assert scalar_type is torch.float or scalar_type is torch.double
     if torch.get_default_dtype() is torch.double:
-        onnx_scalar_type = symbolic_helper.cast_pytorch_to_onnx["Double"]
+        onnx_scalar_type = _C_onnx.TensorProtoDataType.DOUBLE
 
     self = g.op("Cast", self, to_i=onnx_scalar_type)
     other = g.op("Cast", other, to_i=onnx_scalar_type)
@@ -500,7 +514,7 @@ def true_divide(g, self, other):
 def reciprocal(g, self):
     # torch.reciprocal implicitly casts to float, so we do the same.
     if not symbolic_helper._is_fp(self):
-        self = g.op("Cast", self, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        self = g.op("Cast", self, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     return g.op("Reciprocal", self)
 
 
@@ -558,10 +572,7 @@ def isNotNoneAnd(v, u):
         return v is not None and v != u
 
     if dtype is not None and (isNotNoneAnd(mat1_rank, 2) or isNotNoneAnd(mat2_rank, 2)):
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
-        dtype = symbolic_helper.scalar_type_to_pytorch_type[dtype]
+        scalar_type = _type_utils.JitScalarType.from_name(dtype)
 
         res1 = g.op("MatMul", mat1, mat2)
         res2 = self
@@ -570,12 +581,16 @@ def isNotNoneAnd(v, u):
         beta = symbolic_helper._scalar(beta)
 
         if alpha != 1:
-            alpha = g.op("Constant", value_t=torch.tensor(alpha, dtype=dtype))
+            alpha = g.op(
+                "Constant", value_t=torch.tensor(alpha, dtype=scalar_type.dtype())
+            )
             res1 = g.op("Mul", res1, alpha)
         if beta != 1:
             beta = g.op(
                 "Constant",
-                value_t=torch.tensor(symbolic_helper._scalar(beta), dtype=dtype),
+                value_t=torch.tensor(
+                    symbolic_helper._scalar(beta), dtype=scalar_type.dtype()
+                ),
             )
             res2 = g.op("Mul", res2, beta)
 
@@ -605,6 +620,8 @@ def rsqrt(g, self):
     )
 
 
+# Fixed scale and zero_point, discovered from aten/src/ATen/native/quantized/cpu/qtanh.cpp
+@symbolic_helper.quantized_args(True, scale=2.0 / 256.0, zero_point=128)
 def tanh(g, self):
     return g.op("Tanh", self)
 
@@ -643,9 +660,10 @@ def sign(g, self):
     return g.op("Sign", self)
 
 
+@symbolic_helper.quantized_args(True)
 def _slice(g, input, axes, starts, ends):
     assert len(starts) == len(ends)
-    if len(starts) == 1 and starts[0] == 0 and ends[0] == 9223372036854775807:
+    if len(starts) == 1 and starts[0] == 0 and ends[0] == _INT64_MAX:
         return input
     return g.op("Slice", input, axes_i=axes, starts_i=starts, ends_i=ends)
 
@@ -682,12 +700,13 @@ def overload_by_arg_count(fn):
     @functools.wraps(fn)
     def wrapper(g, *args):
         overloads = fn(g, *args)
-        last_exception = None
         for overload in overloads:
             arg_descriptors = overload._arg_descriptors
             if len(arg_descriptors) == len(args):
                 return overload(g, *args)
-        raise NotImplementedError(f"Unknown aten::{fn.__name__} signature")
+        return symbolic_helper._unimplemented(
+            f"aten::{fn.__name__}", f"with {len(args)} arguments"
+        )
 
     return wrapper
 
@@ -699,28 +718,30 @@ def _reduce_with_dtype(onnx_op, name, allow_multi_dim_support=True):
 
     @overload_by_arg_count
     def reduce(g, *args, **kwargs):
+        @symbolic_helper.quantized_args(True)
         @symbolic_helper.parse_args("v", "none")
         def reduce_nodim(g, self, dtype):
             if dtype.node().kind() == "onnx::Constant":
                 dtype = symbolic_helper._get_const(dtype, "i", "dtype")
                 self = g.op(
-                    "Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
+                    "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self)
 
         dim_desc = "is" if allow_multi_dim_support else "i"
 
-        @symbolic_helper.parse_args("v", dim_desc, "i", "none")
+        @symbolic_helper.quantized_args(True)
+        @symbolic_helper.parse_args("v", dim_desc, "i", "none")  # type: ignore[arg-type]
         def reduce_dim(g, self, dim, keepdim, dtype):
             if dtype.node().kind() == "onnx::Constant":
                 dtype = symbolic_helper._get_const(dtype, "i", "dtype")
                 self = g.op(
-                    "Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
+                    "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self, dim, keepdim)
 
         return reduce_nodim, reduce_dim
@@ -738,44 +759,46 @@ def reduce_dim(g, self, dim, keepdim, dtype):
 def cumsum(g, input, dim, dtype):
     if symbolic_helper.is_caffe2_aten_fallback():
         if dtype.node().kind() != "prim::Constant":
-            return symbolic_helper._unimplemented(name, "dtype")
+            return symbolic_helper._unimplemented("cumsum", "dtype", dtype)
         return g.at("cumsum", input, dim_i=dim)
-    else:
-        symbolic_helper._onnx_opset_unsupported("cumsum", 9, 11)
+
+    symbolic_helper._onnx_opset_unsupported("cumsum", 9, 11, input)
 
 
 def _sample_dirichlet(g, self, generator):
     if symbolic_helper.is_caffe2_aten_fallback():
         if not symbolic_helper._is_none(generator):
             return symbolic_helper._unimplemented(
-                "_sample_dirichlet", "We are not able to export generator"
+                "_sample_dirichlet", "We are not able to export generator", self
             )
         return g.at("_sample_dirichlet", self)
-    else:
-        return symbolic_helper._onnx_unsupported("_sample_dirichlet")
+    return symbolic_helper._onnx_unsupported("_sample_dirichlet", self)
 
 
 def _standard_gamma(g, self, generator):
     if symbolic_helper.is_caffe2_aten_fallback():
         if not symbolic_helper._is_none(generator):
             return symbolic_helper._unimplemented(
-                "_standard_gamma", "We are not able to export generator"
+                "_standard_gamma", "not able to export generator", self
             )
         return g.at("_standard_gamma", self)
-    else:
-        return symbolic_helper._onnx_unsupported("_standard_gamma")
+
+    return symbolic_helper._onnx_unsupported("_standard_gamma", self)
 
 
 def t(g, self):
     return g.op("Transpose", self, perm_i=(1, 0))
 
 
+@symbolic_helper.quantized_args(True)
 def numpy_T(g, input):
     ndim = symbolic_helper._get_tensor_rank(input)
+    assert ndim is not None
     perm = list(reversed(range(0, ndim)))
     return g.op("Transpose", input, perm_i=perm)
 
 
+@symbolic_helper.quantized_args(True)
 def expand(g, self, size, implicit):
     size = symbolic_helper._maybe_get_const(size, "is")
     if not symbolic_helper._is_value(size):
@@ -787,13 +810,14 @@ def expand(g, self, size, implicit):
         size = symbolic_helper._reshape_helper(
             g, stack(g, size, 0), g.op("Constant", value_t=torch.tensor([-1]))
         )
-    dtype = symbolic_helper.ScalarType.INT64
+    dtype = _type_utils.JitScalarType.INT64
     ones = ones_like(g, size, dtype)
     neg_ones = mul(g, ones, g.op("Constant", value_t=torch.tensor(-1)))
     size = where(g, g.op("Equal", size, neg_ones), ones, size)
     return g.op("Expand", self, size)
 
 
+@symbolic_helper.quantized_args(True, True)
 def expand_as(g, self, other):
     self_t = symbolic_helper._maybe_get_const(self, "t")
     if isinstance(self_t, torch.Tensor):
@@ -809,12 +833,14 @@ def expand_as(g, self, other):
     return g.op("Expand", self, shape)
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "i", "b", "v")
 def embedding(g, weight, indices, padding_idx, scale_grad_by_freq, sparse):
     if scale_grad_by_freq and GLOBALS.export_training:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             "Unsupported: ONNX export of embedding with scale_grad_by_freq=True "
-            "for training mode. ONNX does not support scaling the gradients."
+            "for training mode. ONNX does not support scaling the gradients.",
+            weight,
         )
     if padding_idx >= 0 and GLOBALS.export_training:
         warnings.warn(
@@ -826,6 +852,7 @@ def embedding(g, weight, indices, padding_idx, scale_grad_by_freq, sparse):
     return g.op("Gather", weight, indices)
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "v", "i", "i", "i", "v", "i", "i")
 def embedding_bag(
     g,
@@ -841,7 +868,7 @@ def embedding_bag(
 ):
     if not symbolic_helper._is_none(per_sample_weights):
         return symbolic_helper._onnx_unsupported(
-            "embedding_bag  with per_sample_weights"
+            "embedding_bag with per_sample_weights"
         )
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at(
@@ -856,8 +883,8 @@ def embedding_bag(
             include_last_offset_i=include_last_offset,
             padding_idx_i=padding_idx,
         )
-    else:
-        return symbolic_helper._onnx_unsupported("embedding_bag")
+
+    return symbolic_helper._onnx_unsupported("embedding_bag", embedding_matrix)
 
 
 def size(g, self, dim=None):
@@ -871,6 +898,7 @@ def size(g, self, dim=None):
     return symbolic_helper._size_helper(g, self, dim)
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "i", "i")
 def transpose(g, self, dim0, dim1):
     if dim0 == dim1:  # micro-optimization
@@ -882,17 +910,15 @@ def transpose(g, self, dim0, dim1):
         axes = list(range(rank))
         axes[dim0], axes[dim1] = axes[dim1], axes[dim0]
         return g.op("Transpose", self, perm_i=axes)
-    else:
+    elif symbolic_helper.is_caffe2_aten_fallback():
         # if we don't have dim information we cannot
         # output a permute so use ATen instead
-        if symbolic_helper.is_caffe2_aten_fallback():
-            return g.at(
-                "transpose", self, overload_name="int", dim0_i=dim0, dim1_i=dim1
-            )
-        else:
-            raise RuntimeError(
-                "Unsupported: ONNX export of transpose for tensor " "of unknown rank."
-            )
+        return g.at("transpose", self, overload_name="int", dim0_i=dim0, dim1_i=dim1)
+    else:
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of transpose for tensor of unknown rank.",
+            self,
+        )
 
 
 @symbolic_helper.parse_args("v", "is")
@@ -902,6 +928,7 @@ def permute(g, self, dims):
     return g.op("Transpose", self, perm_i=dims)
 
 
+@symbolic_helper.quantized_args(True)
 def view(g, self, size):
     return reshape(g, self, size)
 
@@ -915,11 +942,13 @@ def view_as(g, self, other):
 def unsafe_chunk(g, self, chunks, dim, _outputs=None):
     if _outputs is None:
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "unsafe_chunk", 9, 11, "Dynamic number of outputs not supported"
+            "unsafe_chunk", 9, 11, "Dynamic number of outputs not supported", self
         )
     size = symbolic_helper._get_tensor_dim_size(self, dim)
     if size is None:
-        return symbolic_helper._unimplemented("unsafe_chunk", "unknown dimension size")
+        return symbolic_helper._unimplemented(
+            "unsafe_chunk", "unknown dimension size", self
+        )
     split_size = (size + chunks - 1) // chunks
     splits = [split_size] * (size // split_size)
     leftover = size % split_size
@@ -928,17 +957,16 @@ def unsafe_chunk(g, self, chunks, dim, _outputs=None):
     return g.op("Split", self, split_i=splits, axis_i=dim, outputs=_outputs)
 
 
-@symbolic_helper.parse_args("v", "v", "v", "i")
+@symbolic_helper.parse_args("v", "v", "i", "i")
 def split(g, self, split_size_or_sizes, dim, _outputs=None):
     if not symbolic_helper._is_split_static(split_size_or_sizes, _outputs):
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "split", 9, 11, "Dynamic number of outputs not supported"
+            "split", 9, 11, "Dynamic number of outputs not supported", self
         )
-    split_val = split_size_or_sizes.node()["value"]
+    split_val = symbolic_helper._node_get(split_size_or_sizes.node(), "value")
     if split_val.dim() > 0:
         return split_with_sizes(g, self, split_size_or_sizes, dim, _outputs)
     split_size = symbolic_helper._get_const(split_size_or_sizes, "i", "split_size")
-    dim = symbolic_helper._get_const(dim, "i", "dim")
 
     size = symbolic_helper._get_tensor_dim_size(self, dim)
     if size is None:
@@ -946,7 +974,7 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
             size = split_size * _outputs
         else:
             return symbolic_helper._onnx_opset_unsupported_detailed(
-                "split", 9, 11, "Unknown dimension size not supported"
+                "split", 9, 11, "Unknown dimension size not supported", self
             )
     splits = [split_size] * (size // split_size)
     leftover = size % split_size
@@ -963,7 +991,7 @@ def unsafe_split(g, self, split_size_or_sizes, dim, _outputs=None):
 def split_with_sizes(g, self, split_sizes, dim, _outputs=None):
     if not symbolic_helper._is_split_static(split_sizes, _outputs):
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "split_with_sizes", 9, 11, "Dynamic number of outputs not supported"
+            "split_with_sizes", 9, 11, "Dynamic number of outputs not supported", self
         )
     return g.op("Split", self, split_i=split_sizes, axis_i=dim, outputs=_outputs)
 
@@ -976,7 +1004,7 @@ def unsafe_split_with_sizes(g, self, split_sizes, dim, _outputs=None):
 def unbind(g, self, dim=0, _outputs=None):
     if _outputs is None:
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "unbind", 9, 11, "Dynamic number of outputs not supported"
+            "unbind", 9, 11, "Dynamic number of outputs not supported", self
         )
 
     outputs = g.op("Split", self, split_i=[1] * _outputs, axis_i=dim, outputs=_outputs)
@@ -987,12 +1015,13 @@ def unbind(g, self, dim=0, _outputs=None):
     return squeezed_outputs
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "i", "v")
 def select(g, self, dim, index):
     index = symbolic_helper._maybe_get_scalar(index)
     if (not symbolic_helper._is_value(index)) and (index < 0):
         if index == -1:
-            end_index = 9223372036854775807
+            end_index = _INT64_MAX
         else:
             end_index = index + 1
         slice_node = symbolic_helper._slice_helper(
@@ -1029,7 +1058,7 @@ def squeeze(g, self, dim=None):
             squeeze_dim += rank
         else:
             return symbolic_helper._unimplemented(
-                "squeeze", "negative axis with unknown input rank"
+                "squeeze", "negative axis with unknown input rank", self
             )
 
     dim_size = symbolic_helper._get_tensor_dim_size(self, squeeze_dim)
@@ -1129,21 +1158,26 @@ def op_with_optional_float_cast(g, op_name, *args, **kwargs):
     if require_cast:
         for input in inputs:
             if input.isCompleteTensor() and input.type().scalarType() != dtype_0:
-                raise RuntimeError(
-                    f"Inputs of {op_name} must have same dtype. Got {dtype_0} and {input.type().scalarType()}"
+                raise errors.SymbolicValueError(
+                    f"Inputs of {op_name} must have same dtype. Got {dtype_0} and {input.type().scalarType()}",
+                    input,
                 )
         for i, input in enumerate(inputs):
             if input.isCompleteTensor() and not symbolic_helper._is_fp(input):
                 inputs[i] = g.op(
                     "Cast",
                     input,
-                    to_i=symbolic_helper.cast_pytorch_to_onnx[target_float_t],
+                    to_i=_type_utils.JitScalarType.from_name(
+                        target_float_t
+                    ).onnx_type(),
                 )
 
     self = g.op(op_name, *inputs, **kwargs)
 
     if require_cast:
-        self = g.op("Cast", self, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype_0])
+        self = g.op(
+            "Cast", self, to_i=_type_utils.JitScalarType.from_name(dtype_0).onnx_type()
+        )
 
     return self
 
@@ -1176,17 +1210,17 @@ def _len(g, self):
 def threshold(g, self, threshold, value):
     # See Note [Export inplace]
     if symbolic_helper._scalar(threshold) != 0:
-        return symbolic_helper._unimplemented("threshold", "non-zero threshold")
+        return symbolic_helper._unimplemented("threshold", "non-zero threshold", self)
     if symbolic_helper._scalar(value) != 0:
-        return symbolic_helper._unimplemented("threshold", "non-zero value")
+        return symbolic_helper._unimplemented("threshold", "non-zero value", self)
     return g.op("Relu", self)
 
 
-def leaky_relu(g, input, negative_slope, inplace=False):
-    negative_slope = symbolic_helper._get_const(negative_slope, "t", "negative_slope")
+@symbolic_helper.quantized_args(True)
+@symbolic_helper.parse_args("v", "f", "b")
+def leaky_relu(g, input: _C.Value, negative_slope: float, inplace: bool = False):
     # See Note [Export inplace]
-    # TODO: Talk to ONNX about unconditional cast of scalar to float
-    return g.op("LeakyRelu", input, alpha_f=symbolic_helper._scalar(negative_slope))
+    return g.op("LeakyRelu", input, alpha_f=negative_slope)
 
 
 @symbolic_helper.parse_args("v", "i")
@@ -1239,7 +1273,9 @@ def softmax(g, input, dim, dtype=None):
         if dtype and dtype.node().kind() != "prim::Constant":
             parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
             softmax = g.op(
-                "Cast", softmax, to_i=symbolic_helper.scalar_type_to_onnx[parsed_dtype]
+                "Cast",
+                softmax,
+                to_i=_type_utils.JitScalarType(parsed_dtype).onnx_type(),
             )
 
         if is_transpose_required:
@@ -1255,7 +1291,7 @@ def softmax(g, input, dim, dtype=None):
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
         softmax = g.op(
-            "Cast", softmax, to_i=symbolic_helper.scalar_type_to_onnx[parsed_dtype]
+            "Cast", softmax, to_i=_type_utils.JitScalarType(parsed_dtype).onnx_type()
         )
     return softmax
 
@@ -1271,7 +1307,9 @@ def get_pool_ceil_padding(input, kernel_size, stride, padding):
     sizes = symbolic_helper._get_tensor_sizes(input)
     dim = sizes[-len(padding) :] if sizes is not None else None
     if dim is None or any([i is None for i in dim]):
-        return symbolic_helper._unimplemented(name, "input size not accessible")
+        return symbolic_helper._unimplemented(
+            "get_pool_ceil_padding", "input size not accessible", input
+        )
     ceiled_output_dim = [
         int(math.ceil((dim[i] + 2 * padding[i] - kernel_size[i]) / float(stride[i])))
         + 1
@@ -1312,7 +1350,7 @@ def _max_pool(name, tuple_fn, ndims, return_indices):
     @symbolic_helper.parse_args("v", "is", "is", "is", "is", "i")
     def symbolic_fn(g, input, kernel_size, stride, padding, dilation, ceil_mode):
         if set(tuple_fn(dilation)) != {1}:
-            return symbolic_helper._unimplemented(name, "dilation")
+            return symbolic_helper._unimplemented(name, "dilation", input)
         if not stride:
             stride = kernel_size
         padding = tuple(tuple_fn(padding))
@@ -1458,11 +1496,14 @@ def symbolic_fn(g, input, output_size):
         # so we try using max_poolxd_with_indices, and if it is not possible
         # (input is not a complete tensor or output size not factor of input size)
         # then we call GlobalAveragePool and return None for the indices
+        output_size_value = output_size
         try:
             output_size = symbolic_helper._parse_arg(output_size, "is")
         except Exception:
+            # FIXME(justinchuby): Avoid catching Exception.
+            # Catch a more specific exception instead.
             return symbolic_helper._onnx_unsupported(
-                "adaptive pooling, since output_size is not constant."
+                "adaptive pooling, since output_size is not constant.", input
             )
         if output_size == [1] * len(output_size) and type == "AveragePool":
             return g.op("GlobalAveragePool", input)
@@ -1470,18 +1511,22 @@ def symbolic_fn(g, input, output_size):
         try:
             dim = sizes[2:]
         except Exception:
+            # FIXME(justinchuby): Avoid catching Exception.
+            # Catch a more specific exception instead.
             dim = None
         if dim is None or any([i is None for i in dim]):
             if output_size == [1] * len(output_size):
                 return g.op("GlobalMaxPool", input), None
-            return symbolic_helper._unimplemented(name, "input size not accessible")
+            return symbolic_helper._unimplemented(
+                name, "input size not accessible", input
+            )
         # verify if output size % input size = 0 for all dim
         mod = [dim[i] % output_size[i] for i in range(0, len(dim))]
         if mod != [0] * len(mod):
             if output_size == [1] * len(output_size):
                 return g.op("GlobalMaxPool", input), None
             return symbolic_helper._unimplemented(
-                name, "output size that are not factor of input size"
+                name, "output size that are not factor of input size", output_size_value
             )
         k = [int(dim[i] / output_size[i]) for i in range(0, len(dim))]
         # call max_poolxd_with_indices to get indices in the output
@@ -1540,8 +1585,8 @@ def _prepare_onnx_paddings(dim, pad):
     return paddings
 
 
-def _convert_padding_node(padding):
-    padding = symbolic_helper._maybe_get_const(padding, "is")
+def _convert_padding_node(input):
+    padding = symbolic_helper._maybe_get_const(input, "is")
     if symbolic_helper._is_value(padding) and symbolic_helper._is_packed_list(padding):
         input_list = symbolic_helper._unpack_list(padding)
         try:
@@ -1549,8 +1594,10 @@ def _convert_padding_node(padding):
                 symbolic_helper._get_const(v, "i", "padding") for v in input_list
             ]
         except Exception:
+            # FIXME(justinchuby): Avoid catching Exception.
+            # Catch a more specific exception instead.
             return symbolic_helper._onnx_opset_unsupported_detailed(
-                "Pad", 9, 11, "The sizes of the padding must be constant"
+                "Pad", 9, 11, "The sizes of the padding must be constant", input
             )
     return padding
 
@@ -1560,8 +1607,10 @@ def constant_pad_nd(g, input, padding, value):
     try:
         value = symbolic_helper._get_const(value, "f", "value")
     except Exception:
+        # FIXME(justinchuby): Avoid catching Exception.
+        # Catch a more specific exception instead.
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "Pad", 9, 11, "The value for the padding must be constant"
+            "Pad", 9, 11, "The value for the padding must be constant", value
         )
 
     padding = _convert_padding_node(padding)
@@ -1648,7 +1697,7 @@ def pad(g, input, pad, mode, value):
     elif mode == "circular":
         return _pad_circular(g, input, pad)
     else:
-        raise RuntimeError(f"Unrecognized padding mode {mode}")
+        raise errors.SymbolicValueError(f"Unrecognized padding mode {mode}", input)
 
 
 def _interpolate(name, dim, interpolate_mode):
@@ -1659,7 +1708,7 @@ def symbolic_fn(g, input, output_size, *args):
         symbolic_helper._interpolate_warning(interpolate_mode)
         align_corners = symbolic_helper._maybe_get_scalar(align_corners)
         if align_corners:
-            return symbolic_helper._unimplemented(name, "align_corners == True")
+            return symbolic_helper._unimplemented(name, "align_corners == True", input)
         if scales is None:
             scales = symbolic_helper._interpolate_size_to_scales(
                 g, input, output_size, dim
@@ -1686,13 +1735,14 @@ def __interpolate(
     return g.op("Upsample", input, scales, mode_s=mode)
 
 
-def bitwise_not(g, inp):
-    if inp.type().scalarType() != "Bool":
-        raise NotImplementedError(
+def bitwise_not(g, input):
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise Not "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            input,
         )
-    return g.op("Not", inp)
+    return g.op("Not", input)
 
 
 def wrap_logical_op_with_cast_to(to_type):
@@ -1714,14 +1764,16 @@ def wrap_with_not(g, input, other):
 
 
 def __not_(g, self):
-    if self.type().scalarType() != "Bool":
-        raise NotImplementedError(
+    if not symbolic_helper._is_bool(self):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise Not "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            self,
         )
     return g.op("Not", self)
 
 
+@symbolic_helper.quantized_args(True, True)
 def eq(g, self, other):
     if isinstance(self.type(), _C.DeviceObjType) and isinstance(
         other.type(), _C.DeviceObjType
@@ -1732,11 +1784,13 @@ def eq(g, self, other):
     return g.op("Equal", self, other)
 
 
+@symbolic_helper.quantized_args(True, True)
 @wrap_logical_op_with_negation
 def ne(g, self, other):
     return eq(g, self, other)
 
 
+@symbolic_helper.quantized_args(True, True)
 def gt(g, input, other):
     return gt_impl(g, input, other)
 
@@ -1744,15 +1798,16 @@ def gt(g, input, other):
 def gt_impl(g, input, other):
     if (
         input.type().scalarType() is not None
-        and input.type().scalarType() == "Bool"
+        and symbolic_helper._is_bool(input)
         and other.type().scalarType() is not None
-        and other.type().scalarType() == "Bool"
+        and symbolic_helper._is_bool(other)
     ):
-        input = g.op("Cast", input, to_i=symbolic_helper.cast_pytorch_to_onnx["Int"])
-        other = g.op("Cast", other, to_i=symbolic_helper.cast_pytorch_to_onnx["Int"])
+        input = g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT32)
+        other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.INT32)
     return g.op("Greater", input, other)
 
 
+@symbolic_helper.quantized_args(True, True)
 def lt(g, input, other):
     return lt_impl(g, input, other)
 
@@ -1760,53 +1815,73 @@ def lt(g, input, other):
 def lt_impl(g, input, other):
     if (
         input.type().scalarType() is not None
-        and input.type().scalarType() == "Bool"
+        and symbolic_helper._is_bool(input)
         and other.type().scalarType() is not None
-        and other.type().scalarType() == "Bool"
+        and symbolic_helper._is_bool(other)
     ):
-        input = g.op("Cast", input, to_i=symbolic_helper.cast_pytorch_to_onnx["Int"])
-        other = g.op("Cast", other, to_i=symbolic_helper.cast_pytorch_to_onnx["Int"])
+        input = g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT32)
+        other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.INT32)
     return g.op("Less", input, other)
 
 
+@symbolic_helper.quantized_args(True, True)
 @wrap_logical_op_with_negation
 def ge(g, input, other):
     return lt_impl(g, input, other)
 
 
+@symbolic_helper.quantized_args(True, True)
 @wrap_logical_op_with_negation
 def le(g, input, other):
     return gt_impl(g, input, other)
 
 
 def __and_(g, input, other):
-    if input.type().scalarType() == "Bool" and other.type().scalarType() == "Bool":
-        return g.op("And", input, other)
-    else:
-        raise NotImplementedError(
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise AND "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            input,
+        )
+    if not symbolic_helper._is_bool(other):
+        raise errors.SymbolicValueError(
+            "ONNX export does NOT support exporting bitwise AND "
+            "for non-boolean input values",
+            other,
         )
+    return g.op("And", input, other)
 
 
 def __or_(g, input, other):
-    if input.type().scalarType() == "Bool" and other.type().scalarType() == "Bool":
-        return g.op("Or", input, other)
-    else:
-        raise NotImplementedError(
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
+            "ONNX export does NOT support exporting bitwise OR "
+            "for non-boolean input values",
+            input,
+        )
+    if not symbolic_helper._is_bool(other):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise OR "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            other,
         )
+    return g.op("Or", input, other)
 
 
 def __xor_(g, input, other):
-    if input.type().scalarType() == "Bool" and other.type().scalarType() == "Bool":
-        return g.op("Xor", input, other)
-    else:
-        raise NotImplementedError(
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise XOR "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            input,
         )
+    if not symbolic_helper._is_bool(other):
+        raise errors.SymbolicValueError(
+            "ONNX export does NOT support exporting bitwise XOR "
+            "for non-boolean input values",
+            other,
+        )
+    return g.op("Xor", input, other)
 
 
 @wrap_logical_op_with_cast_to("Bool")
@@ -1831,18 +1906,20 @@ def __rshift_(g, self, other):
         other = g.op(
             "Cast",
             other,
-            to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+            to_i=_type_utils.JitScalarType.from_name(
+                self.type().scalarType()
+            ).onnx_type(),
         )
 
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
     # exponent (same type as self) has to be float or double in onnx::Pow
     if not symbolic_helper._is_fp(self):
-        other = g.op("Cast", other, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     two_pow = g.op("Pow", two, other)
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
     )
     rshift = g.op("Div", self, two_pow)
     return rshift
@@ -1855,18 +1932,20 @@ def __lshift_(g, self, other):
         other = g.op(
             "Cast",
             other,
-            to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+            to_i=_type_utils.JitScalarType.from_name(
+                self.type().scalarType()
+            ).onnx_type(),
         )
 
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
     # exponent (same type as self) has to be float or double in onnx::Pow
     if not symbolic_helper._is_fp(self):
-        other = g.op("Cast", other, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     two_pow = g.op("Pow", two, other)
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
     )
     lshift = g.op("Mul", self, two_pow)
     return lshift
@@ -1875,10 +1954,8 @@ def __lshift_(g, self, other):
 @symbolic_helper.parse_args("v", "v", "v", "i")
 def where(g, condition, self=None, other=None, _outputs=None):
     # Assumes that torch.where's first argument takes only Bool and Byte tensors.
-    if condition.type().scalarType() != "Bool":
-        condition = g.op(
-            "Cast", condition, to_i=symbolic_helper.cast_pytorch_to_onnx["Bool"]
-        )
+    if not symbolic_helper._is_bool(condition):
+        condition = g.op("Cast", condition, to_i=_C_onnx.TensorProtoDataType.BOOL)
     if self is None:
         condition = nonzero(g, condition)
         return symbolic_helper._unbind_helper(
@@ -1912,7 +1989,7 @@ def log_softmax(g, input, dim, dtype=None):
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
         return_op = g.op(
-            "Cast", return_op, to_i=symbolic_helper.scalar_type_to_onnx[parsed_dtype]
+            "Cast", return_op, to_i=_type_utils.JitScalarType(parsed_dtype).onnx_type()
         )
     if is_transpose_required:
         return_op = g.op("Transpose", return_op, perm_i=axes)
@@ -1922,7 +1999,7 @@ def log_softmax(g, input, dim, dtype=None):
 @symbolic_helper.parse_args("v", "i", "i")
 def _log_softmax(g, input, dim, half_to_float):
     if half_to_float and input.type().scalarType() == "Half":
-        input = g.op("Cast", input, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+        input = g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     return log_softmax(g, input, dim)
 
 
@@ -1949,11 +2026,14 @@ def _convolution(
     try:
         kernel_shape = weight_size[2:]
     except Exception:
+        # FIXME(justinchuby): Avoid catching Exception.
+        # Catch a more specific exception instead.
         kernel_shape = None
 
     if kernel_shape is None or any([i is None for i in kernel_shape]):
-        raise RuntimeError(
-            "Unsupported: ONNX export of convolution for kernel " "of unknown shape."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of convolution for kernel of unknown shape.",
+            input,
         )
 
     args = [input, weight]
@@ -2178,6 +2258,7 @@ def batch_norm(
             15,
             "All input tensors must have the same `dtype`."
             " Turn off Autocast or export using opset version 15.",
+            input,
         )
 
     weight, bias, running_mean, running_var = symbolic_helper._batchnorm_helper(
@@ -2250,6 +2331,7 @@ def _layer_norm_returns_normalized_input_mean_rstd(
     return normalized, None, None
 
 
+@symbolic_helper.quantized_args(True, False, False, False)
 @symbolic_helper.parse_args("v", "is", "v", "v", "f")
 def native_layer_norm(g, input, normalized_shape, weight, bias, eps):
     return _layer_norm_returns_normalized_input_mean_rstd(
@@ -2257,6 +2339,7 @@ def native_layer_norm(g, input, normalized_shape, weight, bias, eps):
     )
 
 
+@symbolic_helper.quantized_args(True, False, False, False)
 @symbolic_helper.parse_args("v", "is", "v", "v", "f", "i")
 def layer_norm(g, input, normalized_shape, weight, bias, eps, cudnn_enable):
     normalized, _, _ = _layer_norm_returns_normalized_input_mean_rstd(
@@ -2273,29 +2356,37 @@ def instance_norm(
     bias,
     running_mean,
     running_var,
-    use_input_stats,
-    momentum,
-    eps,
-    cudnn_enabled,
+    use_input_stats: int,
+    momentum: float,
+    eps: float,
+    cudnn_enabled: int,
 ):
     symbolic_helper.check_training_mode(use_input_stats, "instance_norm")
     channel_size = symbolic_helper._get_tensor_dim_size(input, 1)
     if weight is None or symbolic_helper._is_none(weight):
         if channel_size is None:
-            raise RuntimeError(
-                "Unsupported: ONNX export of instance_norm for unknown " "channel size."
+            raise errors.SymbolicValueError(
+                "Unsupported: ONNX export of instance_norm for unknown channel size.",
+                input,
             )
-        weight_value = torch.tensor([1.0] * channel_size).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        weight_value = torch.tensor(
+            [1.0] * channel_size,
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         )
         weight = g.op("Constant", value_t=weight_value)
     if bias is None or symbolic_helper._is_none(bias):
         if channel_size is None:
-            raise RuntimeError(
-                "Unsupported: ONNX export of instance_norm for unknown " "channel size."
+            raise errors.SymbolicValueError(
+                "Unsupported: ONNX export of instance_norm for unknown channel size.",
+                input,
             )
-        bias_value = torch.tensor([0.0] * channel_size).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        bias_value = torch.tensor(
+            [0.0] * channel_size,
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         )
         bias = g.op("Constant", value_t=bias_value)
     if (
@@ -2313,9 +2404,10 @@ def instance_norm(
         input_size_reshape = input_size.copy()
         n = input_size[0]
         if n is None:
-            raise RuntimeError(
+            raise errors.SymbolicValueError(
                 "Unsupported: ONNX export of instance_norm training for unknown "
-                "batch size."
+                "batch size.",
+                input,
             )
         c = input_size[1]
         input_size_reshape[0] = 1
@@ -2361,9 +2453,12 @@ def unfold(g, input, dimension, size, step):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("unfold", input, dimension_i=dimension, size_i=size, step_i=step)
     sizes = symbolic_helper._get_tensor_sizes(input)
+    # FIXME(justinchuby): Get rid of the try catch here to improve readability
     try:
         sizedim = sizes[dimension]
     except Exception:
+        # FIXME(justinchuby): Avoid catching Exception.
+        # Catch a more specific exception instead.
         sizedim = None
     if sizedim is not None:
         low_indices = range(0, sizedim, step)
@@ -2385,21 +2480,27 @@ def unfold(g, input, dimension, size, step):
         ]
         return g.op("Concat", *unsqueeze, axis_i=dimension)
     else:
-        return symbolic_helper._unimplemented("Unfold", "input size not accessible")
+        return symbolic_helper._unimplemented(
+            "Unfold", "input size not accessible", input
+        )
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "t", "t", "t")
 def elu(g, input, alpha, scale, input_scale):
     if scale and scale != 1.0:
-        return symbolic_helper._unimplemented("scale", "does not support scale in Elu")
+        return symbolic_helper._unimplemented(
+            "scale", "does not support scale in Elu", scale
+        )
     if input_scale and input_scale != 1.0:
         return symbolic_helper._unimplemented(
-            "input_scale", "does not support input_scale in Elu"
+            "input_scale", "does not support input_scale in Elu", input_scale
         )
     # See Note [Export inplace]
     return g.op("Elu", input, alpha_f=symbolic_helper._scalar(alpha))
 
 
+@symbolic_helper.quantized_args(True)
 def selu(g, input):
     return g.op("Selu", input)
 
@@ -2426,10 +2527,8 @@ def index_put(g, self, indices_list_value, values, accumulate):
     if len(indices_list) == 0:
         if accumulate:
             return add(g, self, values)
-        else:
-            return values
-    else:
-        symbolic_helper._onnx_opset_unsupported("index_put", 9, 11)
+        return values
+    symbolic_helper._onnx_opset_unsupported("index_put", 9, 11, self)
 
 
 def index_fill(g, self, dim, index, value):
@@ -2474,7 +2573,9 @@ def bucketize(g, self, boundaries, out_int32=False, right=False):
     new_shape = g.op("Concat", g.op("Shape", boundaries), g.op("Shape", self), axis_i=0)
     # Unsqueeze step is performed to respect ONNX's numpy style broadcasting for comparison ops
     # https://github.com/onnx/onnx/blob/main/docs/Broadcasting.md
-    unsqueeze_axes = list(range(1, symbolic_helper._get_tensor_rank(self) + 1))
+    tensor_rank = symbolic_helper._get_tensor_rank(self)
+    assert tensor_rank is not None
+    unsqueeze_axes = list(range(1, tensor_rank + 1))
     expanded_boundaries = expand(
         g,
         symbolic_helper._unsqueeze_helper(g, boundaries, unsqueeze_axes),
@@ -2503,18 +2604,21 @@ def type_as(g, self, other):
         return self
     if other_dtype is not None:
         return g.op(
-            "Cast", self, to_i=symbolic_helper.cast_pytorch_to_onnx[other_dtype]
+            "Cast",
+            self,
+            to_i=_type_utils.JitScalarType.from_name(other_dtype).onnx_type(),
         )
-    else:
-        if symbolic_helper.is_caffe2_aten_fallback():
-            # We don't know the type of other, bail by emitting ATen
-            return g.at("type_as", self, other)
-        else:
-            raise RuntimeError(
-                "Unsupported: ONNX export of type_as for tensor "
-                "of unknown dtype. Please check if the dtype of the "
-                "parameter passed to the type_as function is correct."
-            )
+
+    if symbolic_helper.is_caffe2_aten_fallback():
+        # We don't know the type of other, bail by emitting ATen
+        return g.at("type_as", self, other)
+
+    raise errors.SymbolicValueError(
+        "Unsupported: ONNX export of type_as for tensor "
+        "of unknown dtype. Please check if the dtype of the "
+        "parameter passed to the type_as function is correct.",
+        other,
+    )
 
 
 @symbolic_helper.parse_args("v", "v", "i", "f")
@@ -2581,10 +2685,14 @@ def pow(g, self, exponent):
     f_dtype = self_dtype = self.type().scalarType()
     if not symbolic_helper._is_fp(self):
         f_dtype = "Float"
-        self = g.op("Cast", self, to_i=symbolic_helper.cast_pytorch_to_onnx[f_dtype])
+        self = g.op(
+            "Cast", self, to_i=_type_utils.JitScalarType.from_name(f_dtype).onnx_type()
+        )
     if not symbolic_helper._is_fp(exponent):
         exponent = g.op(
-            "Cast", exponent, to_i=symbolic_helper.cast_pytorch_to_onnx[f_dtype]
+            "Cast",
+            exponent,
+            to_i=_type_utils.JitScalarType.from_name(f_dtype).onnx_type(),
         )
     pow = g.op("Pow", self, exponent)
     return pow
@@ -2619,7 +2727,9 @@ def clamp_min(g, self, min):
         )
     else:
         dtype = self.type().scalarType()
-        min = g.op("Cast", min, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
+        min = g.op(
+            "Cast", min, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
+        )
         return op_with_optional_float_cast(g, "Max", self, min, opset_before=12)
 
 
@@ -2631,12 +2741,15 @@ def clamp_max(g, self, max):
         )
     else:
         dtype = self.type().scalarType()
-        max = g.op("Cast", max, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
+        max = g.op(
+            "Cast", max, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
+        )
         return op_with_optional_float_cast(g, "Min", self, max, opset_before=12)
 
 
 # torch.max (same for torch.min) actually has two interfaces smashed together:
 # torch.max(x, dim, keepdim) and torch.max(x, y)
+# TODO(justinchuby): Support multiple quantized args in output
 def max(g, self, dim_or_y=None, keepdim=None):
     # torch.max(input)
     if dim_or_y is None and keepdim is None:
@@ -2653,10 +2766,12 @@ def max(g, self, dim_or_y=None, keepdim=None):
         return max, indices
 
 
+@symbolic_helper.quantized_args(True, True)
 def maximum(g, input, other):
     return max(g, input, dim_or_y=other)
 
 
+# TODO(justinchuby): Support multiple quantized args in output
 def min(g, self, dim_or_y=None, keepdim=None):
     # torch.min(input)
     if dim_or_y is None and keepdim is None:
@@ -2673,20 +2788,24 @@ def min(g, self, dim_or_y=None, keepdim=None):
         return min, indices
 
 
+@symbolic_helper.quantized_args(True, True)
 def minimum(g, input, other):
     return min(g, input, dim_or_y=other)
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "is", "i")
 def amax(g, self, dim, keepdim):
     return g.op("ReduceMax", self, axes_i=dim, keepdims_i=keepdim)
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "is", "i")
 def amin(g, self, dim, keepdim):
     return g.op("ReduceMin", self, axes_i=dim, keepdims_i=keepdim)
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "i")
 def aminmax(g, self, dim, keepdim):
     reduce_kwargs = {"keepdims_i": keepdim}
@@ -2718,7 +2837,7 @@ def _unsupported_dropout(name):
     def feature_dropout(g, input, p, train):
         # NB: In inference mode, FeatureDropout is exported as an identity op.
         if train:
-            return symbolic_helper._unimplemented(name, "training mode")
+            return symbolic_helper._unimplemented(name, "training mode", input)
         return input
 
     return feature_dropout
@@ -2742,7 +2861,9 @@ def norm(g, self, p, dim, keepdim):
     elif p == 2:
         f = _reduce_op_symbolic("ReduceL2")
     else:
-        raise RuntimeError("ONNX export only p-norms with p of 1 or 2")
+        raise errors.SymbolicValueError(
+            "ONNX export only p-norms with p of 1 or 2", self
+        )
     return f(g, self, dim=dim, keepdim=keepdim)
 
 
@@ -2773,7 +2894,7 @@ def _unique(g, input, sorted, return_inverse):
             outputs=2,
         )
     else:
-        return symbolic_helper._onnx_unsupported("_unique")
+        return symbolic_helper._onnx_unsupported("_unique", input)
 
 
 @symbolic_helper.parse_args("v", "i", "i", "i")
@@ -2787,16 +2908,50 @@ def _unique2(g, input, sorted, return_inverse, return_counts):
             return_counts_i=return_counts,
             outputs=3,
         )
-    else:
-        symbolic_helper._onnx_opset_unsupported("_unique2", 9, 11)
-
 
-# TODO(justinchuby): Clean up this function generation magic by defining the functions
-# explicitly.
-for k, v in symbolic_helper.cast_pytorch_to_onnx.items():  # type: ignore[has-type]
-    name = f"_cast_{k}"
-    globals()[name] = symbolic_helper.parse_args("v", "i")(
-        functools.partial(symbolic_helper._cast_func_template, v)
+    symbolic_helper._onnx_opset_unsupported("_unique2", 9, 11, input)
+
+
+def _cast_func_template(to_i, g, input, non_blocking):
+    """Template for creating a cast function."""
+    return g.op("Cast", input, to_i=to_i)
+
+
+# Metaprogram symbolics for each ATen native specialized cast operator.
+# For e.g. we specify a function named `_cast_Byte` that instantiates an
+# ONNX cast node with `to` attribute "UINT8"
+# def _cast_Byte
+# def _cast_Char
+# def _cast_Short
+# def _cast_Int
+# def _cast_Long
+# def _cast_Half
+# def _cast_Float
+# def _cast_Double
+# def _cast_ComplexFloat
+# def _cast_ComplexDouble
+# def _cast_Bool
+# def _cast_BFloat16
+for scalar_type in (
+    "Byte",
+    "Char",
+    "Short",
+    "Int",
+    "Long",
+    "Half",
+    "Float",
+    "Double",
+    "ComplexFloat",
+    "ComplexDouble",
+    "Bool",
+    "BFloat16",
+):
+    func_name = f"_cast_{scalar_type}"
+    globals()[func_name] = symbolic_helper.parse_args("v", "i")(
+        functools.partial(
+            _cast_func_template,
+            _type_utils.JitScalarType.from_name(scalar_type).onnx_type(),
+        )
     )
 
 
@@ -2816,17 +2971,15 @@ def new_empty(g, self, sizes, dtype, layout, device, pin_memory=False):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
         dtype = self_dtype
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        dtype = _type_utils.JitScalarType.from_name(dtype)
     return empty(g, sizes, dtype, layout, device, pin_memory)
 
 
 def scalar_tensor(g, scalar, dtype, *options):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
-    scalar = g.op("Cast", scalar, to_i=symbolic_helper.scalar_type_to_onnx[dtype])
+        dtype = _type_utils.JitScalarType.FLOAT
+    scalar = g.op("Cast", scalar, to_i=_type_utils.JitScalarType(dtype).onnx_type())
     return scalar
 
 
@@ -2834,30 +2987,25 @@ def tensor(g, data, dtype=None, device=None, requires_grad=False):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if symbolic_helper._is_packed_list(data):
         if dtype is None:
-            dtype = symbolic_helper._unpack_list(data)[0].type().scalarType()  # type: ignore[attr-defined]
-            # TODO(justinchuby): Remove type ignore after #81112 is checked in.
-            dtype = symbolic_helper.scalar_type_to_onnx.index(
-                symbolic_helper.cast_pytorch_to_onnx[dtype]
-            )
+            scalar_name = symbolic_helper._unpack_list(data)[0].type().scalarType()
+            dtype = _type_utils.JitScalarType.from_name(scalar_name)
         input_list = list()
         for t in symbolic_helper._unpack_list(data):
             shape_reference = g.op("Constant", value_t=torch.LongTensor([1]))
             t = symbolic_helper._reshape_helper(g, t, shape_reference)
-            t = g.op("Cast", t, to_i=symbolic_helper.scalar_type_to_onnx[dtype])
+            t = g.op("Cast", t, to_i=_type_utils.JitScalarType(dtype).onnx_type())
             input_list.append(t)
         return g.op("Concat", *input_list, axis_i=0)
     else:
         if dtype is None:
-            dtype = data.type().scalarType()
-            dtype = symbolic_helper.scalar_type_to_onnx.index(
-                symbolic_helper.cast_pytorch_to_onnx[dtype]
-            )
+            scalar_name = data.type().scalarType()
+            dtype = _type_utils.JitScalarType.from_name(scalar_name)
         if symbolic_helper._is_list(data) and (
             symbolic_helper._is_tensor_list(data)
             or symbolic_helper._is_scalar_list(data)
         ):
             data = g.op("ConcatFromSequence", data, axis_i=0, new_axis_i=1)
-    return g.op("Cast", data, to_i=symbolic_helper.scalar_type_to_onnx[dtype])
+    return g.op("Cast", data, to_i=_type_utils.JitScalarType(dtype).onnx_type())
 
 
 def as_tensor(g, data, dtype=None, device=None):
@@ -2868,16 +3016,16 @@ def as_tensor(g, data, dtype=None, device=None):
 def zeros(g, sizes, dtype, layout, device, pin_memory=False):
     # NOTE: no way to set device, layout and pin_memory in ONNX, so we ignore it
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
     sizes_ = symbolic_helper._maybe_get_const(sizes, "is")
     if isinstance(sizes_, list) and len(sizes_) == 0:
         sizes = g.op("Constant", value_t=torch.tensor([]).to(torch.int64))
     return g.op(
         "ConstantOfShape",
         sizes,
-        value_t=torch.tensor(
-            [0], dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor([0], dtype=scalar_type.dtype()),
     )
 
 
@@ -2887,39 +3035,36 @@ def zeros_like(
 ):
     shape = g.op("Shape", input)
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
     return g.op(
         "ConstantOfShape",
         shape,
-        value_t=torch.tensor(
-            [0], dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor([0], dtype=scalar_type.dtype()),
     )
 
 
 def new_zeros(g, self, sizes, dtype, layout, device, pin_memory=False):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
-        dtype = self_dtype
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        dtype = _type_utils.JitScalarType.from_name(self_dtype)
     return zeros(g, sizes, dtype, layout, device, pin_memory)
 
 
 @symbolic_helper.parse_args("v", "i", "v", "v", "v")
 def ones(g, sizes, dtype, layout, device, pin_memory=False):
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
     sizes_ = symbolic_helper._maybe_get_const(sizes, "is")
     if isinstance(sizes_, list) and len(sizes_) == 0:
         sizes = g.op("Constant", value_t=torch.tensor([]).to(torch.int64))
     return g.op(
         "ConstantOfShape",
         sizes,
-        value_t=torch.tensor(
-            [1], dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor([1], dtype=scalar_type.dtype()),
     )
 
 
@@ -2929,13 +3074,13 @@ def ones_like(
 ):
     shape = g.op("Shape", input)
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
     return g.op(
         "ConstantOfShape",
         shape,
-        value_t=torch.tensor(
-            [1], dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor([1], dtype=scalar_type.dtype()),
     )
 
 
@@ -2943,30 +3088,29 @@ def new_ones(g, self, sizes, dtype, layout, device, pin_memory=False):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
         dtype = self_dtype
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        dtype = _type_utils.JitScalarType.from_name(dtype)
     return ones(g, sizes, dtype, layout, device, pin_memory)
 
 
 def full(g, sizes, value, dtype, layout, device, pin_memory=False):
     const_value = symbolic_helper._maybe_get_const(value, "t")
     if symbolic_helper._is_value(const_value):
-        dtype = symbolic_helper.ScalarType.FLOAT if dtype is None else dtype
+        dtype = _type_utils.JitScalarType.FLOAT if dtype is None else dtype
         tmp = zeros(g, sizes, dtype, layout, device)
         return add(g, tmp, value, g.op("Constant", value_t=torch.tensor(1)))
     else:
         dtype = symbolic_helper._get_const(dtype, "i", "dtype")
-        dtype = symbolic_helper.ScalarType.FLOAT if dtype is None else dtype
+        if dtype is None:
+            scalar_type = _type_utils.JitScalarType.FLOAT
+        else:
+            scalar_type = _type_utils.JitScalarType(dtype)
         sizes_ = symbolic_helper._maybe_get_const(sizes, "is")
         if isinstance(sizes_, list) and len(sizes_) == 0:
             sizes = g.op("Constant", value_t=torch.tensor([]).to(torch.int64))
         return g.op(
             "ConstantOfShape",
             sizes,
-            value_t=const_value.view(1).to(
-                symbolic_helper.scalar_type_to_pytorch_type[dtype]
-            ),
+            value_t=const_value.view(1).to(scalar_type.dtype()),
         )
 
 
@@ -2982,21 +3126,20 @@ def full_like(
 ):
     fill_value = symbolic_helper._maybe_get_const(fill_value, "f")
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
-    dtype = symbolic_helper.ScalarType.FLOAT if dtype is None else dtype
+    if dtype is None:
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
     if symbolic_helper._is_value(fill_value):
         tmp = zeros_like(g, input, dtype, layout, device)
-        fill_value = g.op(
-            "Cast", fill_value, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
-        )
+        fill_value = g.op("Cast", fill_value, to_i=scalar_type.onnx_type())
         return add(g, tmp, fill_value, g.op("Constant", value_t=torch.tensor(1)))
     else:
         shape = g.op("Shape", input)
         return g.op(
             "ConstantOfShape",
             shape,
-            value_t=torch.tensor([fill_value]).to(
-                symbolic_helper.scalar_type_to_pytorch_type[dtype]
-            ),
+            value_t=torch.tensor([fill_value], dtype=scalar_type.dtype()),
         )
 
 
@@ -3004,9 +3147,7 @@ def new_full(g, self, size, fill_value, dtype, layout, device, pin_memory=False)
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
         dtype = self_dtype
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        dtype = _type_utils.JitScalarType.from_name(dtype)
     return full(g, size, fill_value, dtype, layout, device, pin_memory)
 
 
@@ -3018,7 +3159,7 @@ def eye(g, *args):
         shape = g.op("Concat", dim_size, dim_size, axis_i=0)
         tensor = zeros(g, shape, dtype, layout, device)
         return g.op("EyeLike", tensor)
-    elif len(args) == 6:
+    if len(args) == 6:
         # aten::eye(n, m, dtype, layout, device, pin_memory)
         n, m, dtype, layout, device, pin_memory = args
         shape = g.op(
@@ -3029,8 +3170,8 @@ def eye(g, *args):
         )
         tensor = zeros(g, shape, dtype, layout, device)
         return g.op("EyeLike", tensor)
-    else:
-        raise NotImplementedError("Unknown aten::eye signature")
+
+    return symbolic_helper._unimplemented("aten::eye", f"with {len(args)} arguments")
 
 
 def slice(g, self, *args):
@@ -3039,7 +3180,7 @@ def slice(g, self, *args):
         dim, start, end, step = args
         step = symbolic_helper._parse_arg(step, "i")
         if step != 1:
-            raise RuntimeError("step!=1 is currently not supported")
+            raise errors.SymbolicValueError("step!=1 is currently not supported", self)
         is_start_none = start.node().kind() == "prim::Constant" and isinstance(
             start.type(), _C.NoneType
         )
@@ -3054,10 +3195,11 @@ def slice(g, self, *args):
             or dim.node().kind() != "onnx::Constant"
         ):
             if GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX:
-                raise RuntimeError(
+                raise errors.SymbolicValueError(
                     "Unsupported: ONNX export of Slice with dynamic inputs. DynamicSlice "
                     "is a deprecated experimental op. Please use statically allocated "
-                    "variables or export to a higher opset version."
+                    "variables or export to a higher opset version.",
+                    self,
                 )
             else:
                 start_unsqueezed = symbolic_helper._unsqueeze_helper(g, start, [0])
@@ -3072,11 +3214,7 @@ def slice(g, self, *args):
                 )
         else:
             start = 0 if is_start_none else symbolic_helper._parse_arg(start, "i")
-            end = (
-                9223372036854775807
-                if is_end_none
-                else symbolic_helper._parse_arg(end, "i")
-            )
+            end = _INT64_MAX if is_end_none else symbolic_helper._parse_arg(end, "i")
             dim = symbolic_helper._parse_arg(dim, "i")
             return symbolic_helper._slice_helper(
                 g, self, axes=[dim], starts=[start], ends=[end]
@@ -3092,23 +3230,23 @@ def slice(g, self, *args):
             end.type(), _C.NoneType
         )
         start = 0 if is_start_none else symbolic_helper._parse_arg(start, "i")
-        end = (
-            9223372036854775807 if is_end_none else symbolic_helper._parse_arg(end, "i")
-        )
+        end = _INT64_MAX if is_end_none else symbolic_helper._parse_arg(end, "i")
         return symbolic_helper._slice_helper(
             g, self, axes=[dim], starts=[start], ends=[end]
         )
-    else:
-        raise NotImplementedError("Unknown aten::slice signature")
+
+    return symbolic_helper._unimplemented("aten::slice", f"with {len(args)} arguments")
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "f", "f")
-def hardtanh(g, self, min_val, max_val):
+def hardtanh(g, self: _C.Value, min_val: float, max_val: float):
     return op_with_optional_float_cast(
         g, "Clip", self, min_f=min_val, max_f=max_val, opset_before=12
     )
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v")
 def hardswish(g, self):
     hs = hardsigmoid(g, self)
@@ -3133,16 +3271,12 @@ def tanhshrink(g, self):
 def hardshrink(g, self, lambd):
     dtype = self.type().scalarType()
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
     else:
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        scalar_type = _type_utils.JitScalarType.from_name(dtype)
     lambd_op = g.op(
         "Constant",
-        value_t=torch.tensor(
-            lambd, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor(lambd, dtype=scalar_type.dtype()),
     )
     cond = logical_or(g, gt(g, self, lambd_op), lt(g, self, neg(g, lambd_op)))
     return g.op(
@@ -3151,9 +3285,7 @@ def hardshrink(g, self, lambd):
         self,
         g.op(
             "Constant",
-            value_t=torch.tensor(
-                0, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-            ),
+            value_t=torch.tensor(0, dtype=scalar_type.dtype()),
         ),
     )
 
@@ -3162,16 +3294,12 @@ def hardshrink(g, self, lambd):
 def softshrink(g, self, lambd):
     dtype = self.type().scalarType()
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
     else:
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        scalar_type = _type_utils.JitScalarType.from_name(dtype)
     lambd_op = g.op(
         "Constant",
-        value_t=torch.tensor(
-            lambd, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-        ),
+        value_t=torch.tensor(lambd, dtype=scalar_type.dtype()),
     )
     gt_cond = gt(g, self, lambd_op)
     gt_out = g.op(
@@ -3180,9 +3308,7 @@ def softshrink(g, self, lambd):
         sub(g, self, lambd_op),
         g.op(
             "Constant",
-            value_t=torch.tensor(
-                0, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-            ),
+            value_t=torch.tensor(0, dtype=scalar_type.dtype()),
         ),
     )
     lt_cond = lt(g, self, neg(g, lambd_op))
@@ -3192,9 +3318,7 @@ def softshrink(g, self, lambd):
         add(g, self, lambd_op),
         g.op(
             "Constant",
-            value_t=torch.tensor(
-                0, dtype=symbolic_helper.scalar_type_to_pytorch_type[dtype]
-            ),
+            value_t=torch.tensor(0, dtype=scalar_type.dtype()),
         ),
     )
     return add(g, gt_out, lt_out)
@@ -3223,26 +3347,29 @@ def unsqueeze(g, self, dim):
             dim = dim + rank + 1
         else:
             return symbolic_helper._unimplemented(
-                "unsqueeze", "negative axis with unknown input rank"
+                "unsqueeze", "negative axis with unknown input rank", self
             )
 
     return symbolic_helper._unsqueeze_helper(g, self, axes_i=[dim])
 
 
+# TODO(justinchuby): Support multiple quantized args in output
 @symbolic_helper.parse_args("v", "i", "i", "none")
 def sort(g, self, dim, decending, out=None):
     if out is not None:
         symbolic_helper._unimplemented(
-            "Sort", "Out parameter is not supported for sort"
+            "Sort", "Out parameter is not supported for sort", self
         )
     self_sizes = symbolic_helper._get_tensor_sizes(self)
     try:
         dim_size = self_sizes[dim]
     except Exception:
+        # FIXME(justinchuby): Avoid catching Exception.
+        # Catch a more specific exception instead.
         dim_size = None
 
     if dim_size is None:
-        return symbolic_helper._unimplemented("Sort", "input size not accessible")
+        return symbolic_helper._unimplemented("Sort", "input size not accessible", self)
 
     return g.op("TopK", self, k_i=dim_size, axis_i=dim, outputs=2)
 
@@ -3252,14 +3379,15 @@ def numel(g, self):
     return g.op("ReduceProd", shape, keepdims_i=0)
 
 
+# TODO(justinchuby): Support multiple quantized args in output
 @symbolic_helper.parse_args("v", "i", "i", "i", "i", "none")
 def topk(g, self, k, dim, largest, sorted, out=None):
     if out is not None:
         symbolic_helper._unimplemented(
-            "TopK", "Out parameter is not supported for topk"
+            "TopK", "Out parameter is not supported for topk", self
         )
     if not largest:
-        symbolic_helper._unimplemented("TopK", "Ascending TopK is not supported")
+        symbolic_helper._unimplemented("TopK", "Ascending TopK is not supported", self)
 
     return g.op("TopK", self, k_i=k, axis_i=dim, outputs=2)
 
@@ -3299,7 +3427,7 @@ def is_aten_to_device_only(args):
             symbolic_helper._is_value(args[0])
             and args[0].node().kind() == "onnx::Constant"
         ):
-            tval = args[0].node()["value"]
+            tval = symbolic_helper._node_get(args[0].node(), "value")
             if isinstance(tval, torch.Tensor):
                 if len(tval.shape) == 0:
                     tval = tval.item()
@@ -3310,32 +3438,36 @@ def is_aten_to_device_only(args):
         if symbolic_helper._is_value(dtype) or isinstance(dtype, torch.Tensor):
             # aten::to(Tensor, Tensor, bool, bool, memory_format)
             dtype = args[0].type().scalarType()
-            return g.op("Cast", self, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
+            return g.op(
+                "Cast",
+                self,
+                to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
+            )
         else:
             # aten::to(Tensor, ScalarType, bool, bool, memory_format)
             # memory_format is ignored
-            return g.op("Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype])
+            return g.op("Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type())
     elif len(args) == 5:
         # aten::to(Tensor, Device, ScalarType, bool, bool, memory_format)
         dtype = symbolic_helper._get_const(args[1], "i", "dtype")
         # memory_format is ignored
-        return g.op("Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype])
+        return g.op("Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type())
     elif len(args) == 6:
         # aten::to(Tensor, ScalarType, Layout, Device, bool, bool, memory_format) -> Tensor
         dtype = symbolic_helper._get_const(args[0], "i", "dtype")
         # Layout, device and memory_format are ignored
-        return g.op("Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype])
+        return g.op("Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type())
     elif len(args) == 7:
         # aten::to(Tensor, ScalarType, Layout, Device, bool, bool, bool, memory_format) -> Tensor
         dtype = symbolic_helper._get_const(args[0], "i", "dtype")
         # Layout, device and memory_format are ignored
-        return g.op("Cast", self, to_i=symbolic_helper.scalar_type_to_onnx[dtype])
-    else:
-        return symbolic_helper._onnx_unsupported("Unknown aten::to signature")
+        return g.op("Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type())
+
+    return symbolic_helper._onnx_unsupported("Unknown aten::to signature", self)
 
 
 def repeat(g, self, repeats):
-    dtype = symbolic_helper.ScalarType.INT64
+    dtype = _type_utils.JitScalarType.INT64
     shape_ = ones_like(g, repeats, dtype)
     self = g.op("Expand", self, shape_)
     return g.op("Tile", self, repeats)
@@ -3357,16 +3489,19 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
     repeats_sizes = symbolic_helper._get_tensor_sizes(repeats)
     input_sizes = symbolic_helper._get_tensor_sizes(input)
     if repeats_dim is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown repeats rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats rank.",
+            input,
         )
     if repeats_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown repeats size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats size.",
+            input,
         )
     if input_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown input size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown input size.",
+            input,
         )
 
     input_sizes_temp = input_sizes.copy()
@@ -3384,6 +3519,7 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
                 9,
                 13,
                 "Unsupported along dimension with unknown input size",
+                self,
             )
         else:
             reps = input_sizes[dim]
@@ -3399,17 +3535,22 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
                 9,
                 13,
                 "Unsupported along dimension with unknown input size",
+                self,
             )
         if repeats_sizes[0] is None:
             return symbolic_helper._onnx_opset_unsupported_detailed(
-                "repeat_interleave", 9, 13, "Unsupported for cases with dynamic repeats"
+                "repeat_interleave",
+                9,
+                13,
+                "Unsupported for cases with dynamic repeats",
+                self,
             )
         assert (
             repeats_sizes[0] == input_sizes[dim]
         ), "repeats must have the same size as input along dim"
         reps = repeats_sizes[0]
     else:
-        raise RuntimeError("repeats must be 0-dim or 1-dim tensor")
+        raise errors.SymbolicValueError("repeats must be 0-dim or 1-dim tensor", self)
 
     final_splits = list()
     r_splits = symbolic_helper._repeat_interleave_split_helper(g, repeats, reps, 0)
@@ -3438,7 +3579,9 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
 def pixel_shuffle(g, self, upscale_factor):
     dims = symbolic_helper._get_tensor_sizes(self)
     if len(dims) != 4:
-        return symbolic_helper._unimplemented("pixel_shuffle", "only support 4d input")
+        return symbolic_helper._unimplemented(
+            "pixel_shuffle", "only support 4d input", self
+        )
     if any(i is None for i in dims[1:]):
         after_view = symbolic_helper._reshape_helper(
             g,
@@ -3507,7 +3650,9 @@ def pixel_shuffle(g, self, upscale_factor):
 def pixel_unshuffle(g, self, downscale_factor):
     dims = symbolic_helper._get_tensor_sizes(self)
     if len(dims) != 4:
-        return symbolic_helper._unimplemented("pixel_shuffle", "only support 4d input")
+        return symbolic_helper._unimplemented(
+            "pixel_shuffle", "only support 4d input", self
+        )
     if any(i is None for i in dims[1:]):
         # For dynamic input shapes, two reshapes are performed
         reshape_h = symbolic_helper._reshape_helper(
@@ -3615,7 +3760,7 @@ def _generic_rnn(
     if variant == "LSTM" and len(all_weights) != num_layers * weights_per_layer * (
         1 + bidirectional
     ):
-        return symbolic_helper._unimplemented("LSTM", "LSTMs with projections")
+        return symbolic_helper._unimplemented("LSTM", "LSTMs with projections", input)
     assert len(all_weights) == num_layers * weights_per_layer * (1 + bidirectional)
     layer_weights = [
         all_weights[i : i + weights_per_layer]
@@ -3626,7 +3771,7 @@ def _generic_rnn(
         input = g.op("Transpose", input, perm_i=[1, 0, 2])
     if dropout and train:
         return symbolic_helper._unimplemented(
-            "RNN/GRU/LSTM", "dropout in training mode"
+            "RNN/GRU/LSTM", "dropout in training mode", input
         )
 
     if variant.startswith("RNN"):
@@ -3636,7 +3781,9 @@ def _generic_rnn(
     w_hh = all_weights[1]
     hidden_size = symbolic_helper._get_tensor_dim_size(w_hh, 1)
     if hidden_size is None:
-        return symbolic_helper._unimplemented("RNN/GRU/LSTM", "unknown hidden size")
+        return symbolic_helper._unimplemented(
+            "RNN/GRU/LSTM", "unknown hidden size", input
+        )
 
     unidirectional = not bidirectional
 
@@ -3980,7 +4127,9 @@ def detach(g, input):
 @symbolic_helper.parse_args("v", "i")
 def contiguous(g, input, memory_format):
     if memory_format > 2:  # allower values are any, preserve and contiguous_format
-        raise RuntimeError("onnx memory_format support is not implemented")
+        raise errors.SymbolicValueError(
+            "onnx memory_format support is not implemented", input
+        )
     return input
 
 
@@ -3992,7 +4141,9 @@ def _pack_padded_sequence(g, input, lengths, batch_first):
     if batch_first:
         input = g.op("Transpose", input, perm_i=[1, 0, 2])
     if not lengths.type().isSubtypeOf(torch._C.TensorType.get()):
-        raise RuntimeError("Lengths must be a Tensor for ONNX export")
+        raise errors.SymbolicValueError(
+            "'lengths' must be a Tensor for ONNX export", input
+        )
     # We know it's a TensorType so this check is now safe.
     # It's really only necessary because those operators expand to something that
     # only works with int32 types in Caffe2...
@@ -4017,50 +4168,50 @@ def _pad_packed_sequence(
 def randn(g, shapes, dtype, *options):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
     shape = symbolic_helper._maybe_get_const(shapes, "is")
     if symbolic_helper._is_value(shape):
         shape_const = g.op(
             "ConstantOfShape",
             shapes,
-            value_t=torch.tensor(
-                [0], dtype=symbolic_helper.scalar_type_to_pytorch_type[6]
-            ),
+            value_t=torch.tensor([0], dtype=torch.float),
         )
         return g.op(
             "RandomNormalLike",
             shape_const,
-            dtype_i=symbolic_helper.scalar_type_to_onnx[dtype],
+            dtype_i=scalar_type.onnx_type(),
         )
     return g.op(
         "RandomNormal",
         shape_i=shape,
-        dtype_i=symbolic_helper.scalar_type_to_onnx[dtype],
+        dtype_i=scalar_type.onnx_type(),
     )
 
 
 def rand(g, shapes, dtype, *options):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
     shape = symbolic_helper._maybe_get_const(shapes, "is")
     if symbolic_helper._is_value(shape):
         shape_const = g.op(
             "ConstantOfShape",
             shapes,
-            value_t=torch.tensor(
-                [0], dtype=symbolic_helper.scalar_type_to_pytorch_type[6]
-            ),
+            value_t=torch.tensor([0], dtype=torch.float),
         )
         return g.op(
             "RandomUniformLike",
             shape_const,
-            dtype_i=symbolic_helper.scalar_type_to_onnx[dtype],
+            dtype_i=scalar_type.onnx_type(),
         )
     return g.op(
         "RandomUniform",
         shape_i=shape,
-        dtype_i=symbolic_helper.scalar_type_to_onnx[dtype],
+        dtype_i=scalar_type.onnx_type(),
     )
 
 
@@ -4069,10 +4220,10 @@ def randn_like(
 ):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
-    return g.op(
-        "RandomNormalLike", self, dtype_i=symbolic_helper.scalar_type_to_onnx[dtype]
-    )
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
+    return g.op("RandomNormalLike", self, dtype_i=scalar_type.onnx_type())
 
 
 def rand_like(
@@ -4080,14 +4231,17 @@ def rand_like(
 ):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        dtype = _type_utils.JitScalarType.FLOAT
     return g.op(
-        "RandomUniformLike", self, dtype_i=symbolic_helper.scalar_type_to_onnx[dtype]
+        "RandomUniformLike", self, dtype_i=_type_utils.JitScalarType(dtype).onnx_type()
     )
 
 
 @symbolic_helper.parse_args("v", "f", "f", "i", "none")
 def rrelu(g, input, lower, upper, training, generator):
+    if not training:
+        slope = (upper + lower) / 2.0
+        return g.op("LeakyRelu", input, alpha_f=slope)
     p = g.op("RandomUniformLike", input, high_f=upper, low_f=lower)
     return g.op("PRelu", input, p)
 
@@ -4095,25 +4249,29 @@ def rrelu(g, input, lower, upper, training, generator):
 def bernoulli(g, input, generator=None, out=None):
     if out is not None:
         symbolic_helper._unimplemented(
-            "Bernoulli", "out parameter is not supported for bernoulli"
+            "Bernoulli", "out parameter is not supported for bernoulli", input
         )
     if generator is not None and not symbolic_helper._is_none(generator):
         symbolic_helper._unimplemented(
-            "Bernoulli", "generator is not supported for bernoulli"
+            "Bernoulli", "generator is not supported for bernoulli", input
         )
 
     dtype = symbolic_helper._try_get_scalar_type(input)
     if dtype is None:
-        return symbolic_helper._unimplemented("Bernoulli", "input dtype not accessible")
+        return symbolic_helper._unimplemented(
+            "Bernoulli", "input dtype not accessible", input
+        )
     p = g.op(
         "RandomUniformLike",
         input,
         high_f=1.0,
         low_f=0.0,
-        dtype_i=symbolic_helper.cast_pytorch_to_onnx[dtype],
+        dtype_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
     )
     output = g.op("Less", p, input)
-    return g.op("Cast", output, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
+    return g.op(
+        "Cast", output, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
+    )
 
 
 @symbolic_helper.parse_args("v")
@@ -4136,6 +4294,7 @@ def flatten(g, input, start_dim, end_dim):
             "dim",
             "ONNX and PyTorch use different strategies to split the input. "
             "Input rank must be known at export time.",
+            input,
         )
 
     # TODO: remove this as onnx opset 11 spec allows negative axes
@@ -4201,28 +4360,14 @@ def narrow(g, input, dim, start, length):
     )
 
 
-def argmax(g, input, dim, keepdim):
-    if symbolic_helper._is_none(dim):
-        flattened = symbolic_helper._reshape_helper(
-            g, input, g.op("Constant", value_t=torch.tensor([-1]))
-        )
-        return g.op("ArgMax", flattened, axis_i=0, keepdims_i=False)
-    else:
-        dim = symbolic_helper._parse_arg(dim, "i")
-        keepdim = symbolic_helper._parse_arg(keepdim, "i")
-        return g.op("ArgMax", input, axis_i=dim, keepdims_i=keepdim)
+@symbolic_helper.parse_args("v", "v", "i")
+def argmax(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+    return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMax")
 
 
-def argmin(g, input, dim, keepdim):
-    if symbolic_helper._is_none(dim):
-        flattened = symbolic_helper._reshape_helper(
-            g, input, g.op("Constant", value_t=torch.tensor([-1]))
-        )
-        return g.op("ArgMin", flattened, axis_i=0, keepdims_i=False)
-    else:
-        dim = symbolic_helper._parse_arg(dim, "i")
-        keepdim = symbolic_helper._parse_arg(keepdim, "i")
-        return g.op("ArgMin", input, axis_i=dim, keepdims_i=keepdim)
+@symbolic_helper.parse_args("v", "v", "i")
+def argmin(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+    return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMin")
 
 
 @symbolic_helper.parse_args("v", "i", "v", "v")
@@ -4238,28 +4383,26 @@ def scatter(g, self, dim, index, src):
             src = g.op(
                 "Cast",
                 src,
-                to_i=symbolic_helper.cast_pytorch_to_onnx[self.type().scalarType()],
+                to_i=_type_utils.JitScalarType.from_name(
+                    self.type().scalarType()
+                ).onnx_type(),
             )
         return g.op("Scatter", self, index, expand_as(g, src, index), axis_i=dim)
 
 
 @symbolic_helper.parse_args("v", "i", "v", "v")
 def scatter_add(g, self, dim, index, src):
-    dtype = symbolic_helper._try_get_scalar_type(self)
-    if dtype is None:
+    scalar_name = symbolic_helper._try_get_scalar_type(self)
+    if scalar_name is None:
         return symbolic_helper._unimplemented(
-            "scatter_add", "input dtype not accessible"
+            "scatter_add", "input dtype not accessible", self
         )
-    dtype = symbolic_helper.scalar_type_to_onnx.index(
-        symbolic_helper.cast_pytorch_to_onnx[dtype]
-    )
-    dtype = symbolic_helper.scalar_type_to_pytorch_type[dtype]
+    scalar_type = _type_utils.JitScalarType.from_name(scalar_name)
     sizes = symbolic_helper._get_tensor_sizes(self, allow_nonstatic=False)
     if sizes:
-        to_add = g.op("Constant", value_t=torch.zeros(sizes, dtype=dtype))
+        to_add = g.op("Constant", value_t=torch.zeros(sizes, dtype=scalar_type.dtype()))
     else:
-        dtype = symbolic_helper.scalar_type_to_pytorch_type.index(dtype)
-        to_add = zeros_like(g, self, dtype)
+        to_add = zeros_like(g, self, scalar_type)
     to_add = symbolic_helper._scatter_helper(g, to_add, dim, index, src)
     return add(g, self, to_add)
 
@@ -4291,17 +4434,15 @@ def __isnot_(g, self, other):
 def one_hot(g, self, num_classes):
     values = g.op("Constant", value_t=torch.LongTensor([0, 1]))
     # onnxruntime supports limited type combinations for OneHot.
-    if num_classes.type().scalarType() in ("Byte", "Char", "Int", "Short"):
-        num_classes = g.op(
-            "Cast", num_classes, to_i=symbolic_helper.cast_pytorch_to_onnx["Long"]
-        )
+    if num_classes.type().scalarType() in {"Byte", "Char", "Int", "Short"}:
+        num_classes = g.op("Cast", num_classes, to_i=_C_onnx.TensorProtoDataType.INT64)
     return g.op("OneHot", self, num_classes, values, axis_i=-1)
 
 
 @symbolic_helper.parse_args("v", "i", "v", "v")
 def gather(g, self, dim, index, sparse_grad=False):
     if symbolic_helper._maybe_get_const(sparse_grad, "i"):
-        return symbolic_helper._unimplemented("gather", "sparse_grad == True")
+        return symbolic_helper._unimplemented("gather", "sparse_grad == True", self)
     # NOTE: This workaround is needed since GatherElement is only supported
     #       since opset 11, and Gather in ONNX is not the same as torch.gather.
     dtype = self.type().scalarType()
@@ -4310,7 +4451,7 @@ def gather(g, self, dim, index, sparse_grad=False):
     index = g.op(
         "Cast",
         g.op("OneHot", index, depth, values, axis_i=dim),
-        to_i=symbolic_helper.cast_pytorch_to_onnx[dtype],
+        to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
     )
     mul = g.op("Mul", symbolic_helper._unsqueeze_helper(g, self, [dim + 1]), index)
     return symbolic_helper._reducesum_helper(g, mul, axes_i=[dim], keepdims_i=0)
@@ -4343,7 +4484,7 @@ def _var_mean(g, input, dim, correction, keepdim):
         correction = 1
     if correction != 0:
         num_elements = g.op(
-            "Cast", num_elements, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"]
+            "Cast", num_elements, to_i=_C_onnx.TensorProtoDataType.FLOAT
         )
         one = g.op("Constant", value_t=torch.tensor(correction, dtype=torch.float))
         mul = g.op("Mul", var, num_elements)
@@ -4396,7 +4537,7 @@ def _float_step_convert(range_tensor):
             range_tensor = g.op(
                 "Cast",
                 g.op("Ceil", range_tensor),
-                to_i=symbolic_helper.scalar_type_to_onnx[4],
+                to_i=_type_utils.JitScalarType.INT64.onnx_type(),
             )
         return range_tensor
 
@@ -4416,7 +4557,7 @@ def _float_step_convert(range_tensor):
             g, nonzero(g, ones(g, range_tensor, dtype, None, None)), [1]
         )
         return g.op(
-            "Cast", arange_tensor, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
+            "Cast", arange_tensor, to_i=_type_utils.JitScalarType(dtype).onnx_type()
         )
     elif len(args) == 4 or len(args) == 7:
         if len(args) == 4:
@@ -4437,7 +4578,7 @@ def _float_step_convert(range_tensor):
         )
         arange_tensor = g.op("Add", g.op("Mul", arange_tensor, step), start)
         return g.op(
-            "Cast", arange_tensor, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
+            "Cast", arange_tensor, to_i=_type_utils.JitScalarType(dtype).onnx_type()
         )
     elif len(args) == 6:
         # aten::arange(Scalar start, Scalar end, ScalarType dtype, Layout, Device, bool pin_memory)
@@ -4456,13 +4597,11 @@ def _float_step_convert(range_tensor):
             start,
         )
         return g.op(
-            "Cast", arange_tensor, to_i=symbolic_helper.scalar_type_to_onnx[dtype]
-        )
-    else:
-        raise NotImplementedError(
-            "Unknown aten::arange signature taking " + str(len(args)) + " arguments."
+            "Cast", arange_tensor, to_i=_type_utils.JitScalarType(dtype).onnx_type()
         )
 
+    return symbolic_helper._unimplemented("aten::arange", f"with {len(args)} arguments")
+
 
 def linspace(g, start, end, steps, dtype, layout, device, pin_memory):
     range_tensor = symbolic_helper._arange_helper(g, steps, None)
@@ -4496,11 +4635,12 @@ def index(g, self, index):
 
     def try_mask_to_index(index):
         if not symbolic_helper._is_none(index) and (
-            index.type().scalarType() == "Byte" or index.type().scalarType() == "Bool"
+            index.type().scalarType() == "Byte" or symbolic_helper._is_bool(index)
         ):
             if GLOBALS.export_onnx_opset_version < 9:
-                raise RuntimeError(
-                    "Exporting masked indices are only supported after ONNX opset 9."
+                raise errors.SymbolicValueError(
+                    "Exporting masked indices are only supported after ONNX opset 9.",
+                    self,
                 )
             warnings.warn(
                 "Exporting aten::index operator with indices of type Byte. "
@@ -4547,19 +4687,21 @@ def try_mask_to_index(index):
         else:
             rank = symbolic_helper._get_tensor_rank(self)
             if rank is None:
-                raise NotImplementedError(
-                    "Unsupported aten::index operator of advanced indexing on tensor of unknown rank, "
-                    + "try turning on shape and type propagate during export: "
-                    + "torch.onnx._export(..., propagate=True)."
+                return symbolic_helper._unimplemented(
+                    "aten::index",
+                    "operator of advanced indexing on tensor of unknown rank. "
+                    "Try turning on shape inference during export: "
+                    "torch.onnx._export(..., onnx_shape_inference=True).",
+                    self,
                 )
             # TODO: If indexing is supported natively in ONNX in future opsets,
             #       update the warning to recommend exporting with higher opset version.
             warnings.warn(
                 "Exporting aten::index operator of advanced indexing in opset "
-                + str(GLOBALS.export_onnx_opset_version)
-                + " is achieved by combination of multiple ONNX operators, "
-                + "including Reshape, Transpose, Concat, and Gather. "
-                + "If indices include negative values, the exported graph will produce incorrect results."
+                f"{GLOBALS.export_onnx_opset_version}"
+                " is achieved by combination of multiple ONNX operators, "
+                "including Reshape, Transpose, Concat, and Gather. "
+                "If indices include negative values, the exported graph will produce incorrect results."
             )
             adv_idx_count = len(adv_idx_indices)
             shape_tensor = _shape_as_tensor(g, self)
@@ -4664,7 +4806,7 @@ def linalg_norm(
         self_dim = symbolic_helper._get_tensor_rank(self)
         if self_dim is None:
             return symbolic_helper._unimplemented(
-                "dim", "Input rank must be known at export time."
+                "dim", "Input rank must be known at export time.", self
             )
         if self_dim == 1:
             ord_value = symbolic_helper._parse_arg(ord, "f")
@@ -4700,7 +4842,7 @@ def linalg_vector_norm(
         result = g.op("ReduceMin", g.op("Abs", self), axes_i=dim, keepdims_i=keepdim)
     elif ord == 0:
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "linalg_vector_norm", 9, 11, "ord=0 not supported"
+            "linalg_vector_norm", 9, 11, "ord=0 not supported", self
         )
     else:
         ord_op = g.op("Constant", value_t=torch.tensor(ord, dtype=torch.float32))
@@ -4733,7 +4875,7 @@ def linalg_matrix_norm(
     if ord_value == "fro":
         return frobenius_norm(g, self, dim, keepdim)
     elif ord_value == "nuc":
-        return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==nuc")
+        return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==nuc", self)
     else:
         ord_value = symbolic_helper._parse_arg(ord, "f")
         if ord_value is None:
@@ -4741,12 +4883,12 @@ def linalg_matrix_norm(
         if ord_value == 2 or ord_value == -2:
             # ord = 2/-2 unimplemented due to lack of operators
             # used to calculate singular values
-            return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==2")
+            return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==2", self)
         # Wrap the dim vector to handle neagtive dim values
         self_dim = symbolic_helper._get_tensor_rank(self)
         if self_dim is None:
             return symbolic_helper._unimplemented(
-                "linalg.matrix_norm", "Input rank must be known at export time."
+                "linalg.matrix_norm", "Input rank must be known at export time.", self
             )
         # Common implementation for cases with
         # ord = 1/-1 and ord = inf/-inf
@@ -4795,19 +4937,20 @@ def frobenius_norm(g, self, dim=None, keepdim=False):
 def multinomial(g, input, num_samples, replacement=False, generator=None):
     if generator is not None and not symbolic_helper._is_none(generator):
         symbolic_helper._unimplemented(
-            "Multinomial", "generator is not supported for multinomial"
+            "Multinomial", "generator is not supported for multinomial", input
         )
     if not replacement and num_samples > 1:
         symbolic_helper._unimplemented(
             "Multinomial",
             "replacement=False when num_samples > 1 is not supported for multinomial",
+            input,
         )
 
     log_input = log(g, input)
     return g.op(
         "Multinomial",
         log_input,
-        dtype_i=symbolic_helper.cast_pytorch_to_onnx["Long"],
+        dtype_i=_C_onnx.TensorProtoDataType.INT64,
         sample_size_i=num_samples,
     )
 
@@ -4818,10 +4961,14 @@ def baddbmm(g, self, batch1, batch2, beta, alpha):
     mul_a = mul(
         g,
         batch_mul,
-        g.op("Cast", alpha, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype]),
+        g.op(
+            "Cast", alpha, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
+        ),
     )
     mul_b = mul(
-        g, self, g.op("Cast", beta, to_i=symbolic_helper.cast_pytorch_to_onnx[dtype])
+        g,
+        self,
+        g.op("Cast", beta, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()),
     )
     return add(g, mul_a, mul_b)
 
@@ -4831,7 +4978,9 @@ def meshgrid(g, tensor_list, indexing: Optional[str] = None):
     if indexing is None:
         indexing = "ij"
     elif indexing not in {"ij", "xy"}:
-        raise ValueError(f"Unsupported indexing: {indexing}")
+        raise errors.SymbolicValueError(
+            f"Unsupported indexing: {indexing}", tensor_list
+        )
     if indexing == "xy":
         tensor_list[0], tensor_list[1] = tensor_list[1], tensor_list[0]
     tensors = [
@@ -4888,6 +5037,7 @@ def gelu(g, self: torch._C.Value, approximate: str = "none"):
         )
 
 
+@symbolic_helper.quantized_args(True, False, False, False)
 @symbolic_helper.parse_args("v", "i", "v", "v", "f", "i")
 def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
     if symbolic_helper.is_caffe2_aten_fallback():
@@ -4906,7 +5056,7 @@ def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
         assert channel_size % num_groups == 0
     input_rank = symbolic_helper._get_tensor_rank(input)
     if input_rank is None:
-        return symbolic_helper._unimplemented("group_norm", "unknown input rank")
+        return symbolic_helper._unimplemented("group_norm", "unknown input rank", input)
     # 0 in the shape list keeps dimension value unchanged.
     shape = [0, num_groups, -1]
     input_reshaped = symbolic_helper._reshape_helper(
@@ -4918,14 +5068,20 @@ def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
     # instance norm computation and reshape
     weight_ = g.op(
         "Constant",
-        value_t=torch.tensor([1.0] * num_groups).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        value_t=torch.tensor(
+            [1.0] * num_groups,
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         ),
     )
     bias_ = g.op(
         "Constant",
-        value_t=torch.tensor([0.0] * num_groups).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        value_t=torch.tensor(
+            [0.0] * num_groups,
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         ),
     )
 
@@ -4935,13 +5091,19 @@ def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
     norm = symbolic_helper._reshape_helper(g, norm_reshaped, g.op("Shape", input))
 
     if weight is None or weight.node().mustBeNone():
-        weight_value = torch.tensor([1.0]).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        weight_value = torch.tensor(
+            [1.0],
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         )
         weight = g.op("Constant", value_t=weight_value)
     if bias is None or bias.node().mustBeNone():
-        bias_value = torch.tensor([0.0]).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        bias_value = torch.tensor(
+            [0.0],
+            dtype=_type_utils.JitScalarType.from_name(
+                input.type().scalarType()
+            ).dtype(),
         )
         bias = g.op("Constant", value_t=bias_value)
 
@@ -4972,12 +5134,13 @@ def _weight_norm(g, weight_v, weight_g, dim):
         norm_v = norm(g, weight_v, 2, axes, 1)
         div = g.op("Div", weight_v, norm_v)
         return g.op("Mul", div, weight_g)
-    elif symbolic_helper.is_caffe2_aten_fallback():
+    if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("_weight_norm", weight_v, weight_g, dim_i=dim)
-    else:
-        raise RuntimeError(
-            "Unsupported: ONNX export of _weight_norm for tensor " "of unknown rank."
-        )
+
+    raise errors.SymbolicValueError(
+        "Unsupported: ONNX export of _weight_norm for tensor of unknown rank.",
+        weight_v,
+    )
 
 
 def dim(g, self):
@@ -5036,10 +5199,11 @@ def kl_div(g, input, target, reduction, log_target):
         return symbolic_helper._reducesum_helper(g, output, keepdims_i=0)
     else:
         return symbolic_helper._onnx_unsupported(
-            "kl_div with reduction other than none, mean, or sum."
+            "kl_div with reduction other than none, mean, or sum.", input
         )
 
 
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "is", "i")
 def as_strided(g, self, sizes, strides, offset=None):
     sizes = symbolic_helper._maybe_get_const(sizes, "is")
@@ -5100,7 +5264,7 @@ def __derive_index(g, index, start, step):
 def __range_length(g, lo, hi, step):
     sub = g.op("Sub", hi, lo)
     div = g.op("Ceil", true_divide(g, sub, step))
-    return g.op("Cast", div, to_i=symbolic_helper.cast_pytorch_to_onnx["Long"])
+    return g.op("Cast", div, to_i=_C_onnx.TensorProtoDataType.INT64)
 
 
 def linear(g, input, weight, bias):
@@ -5123,20 +5287,22 @@ def hann_window(
     g,
     window_length,
     periodic=True,
-    dtype=None,
+    dtype: Optional[int] = None,
     layout=None,
     device=None,
     pin_memory=None,
     requires_grad=False,
 ):
     if dtype is None:
-        dtype = torch.get_default_dtype()
-        if not dtype or not dtype.is_floating_point:
-            dtype = torch.float
-        dtype = symbolic_helper.scalar_type_to_pytorch_type.index(dtype)
+        dtype_ = torch.get_default_dtype()
+        if not dtype_ or not dtype_.is_floating_point:
+            dtype_ = torch.float
+        scalar_type = _type_utils.JitScalarType.from_dtype(dtype_)
+    else:
+        scalar_type = _type_utils.JitScalarType(dtype)
 
     n_array = arange(g, window_length, 4, None, None, None)
-    output = g.op("Cast", n_array, to_i=symbolic_helper.cast_pytorch_to_onnx["Float"])
+    output = g.op("Cast", n_array, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     output = mul(
         g, g.op("Constant", value_t=torch.tensor(math.pi, dtype=torch.float)), output
     )
@@ -5149,7 +5315,7 @@ def hann_window(
     output = g.op(
         "Cast",
         square(g, sin(g, output)),
-        to_i=symbolic_helper.scalar_type_to_onnx[dtype],
+        to_i=scalar_type.onnx_type(),
     )
 
     return output
@@ -5175,6 +5341,7 @@ def movedim(g, self, source, destination):
         return self
 
     self_rank = symbolic_helper._get_tensor_rank(self)
+    assert self_rank is not None
 
     perm = list(range(self_rank))
 
@@ -5199,11 +5366,9 @@ def movedim(g, self, source, destination):
 def fill(g, self, value):
     dtype = self.type().scalarType()
     if dtype is None:
-        dtype = symbolic_helper.ScalarType.FLOAT
+        dtype = _type_utils.JitScalarType.FLOAT
     else:
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
+        dtype = _type_utils.JitScalarType.from_name(dtype)
 
     return full_like(g, self, value, dtype)
 
@@ -5217,22 +5382,24 @@ def index_add(g, self, dim, index, other, alpha=None):
     # ONNX does not support "alpha" argument, unlike aten index_add
     # See: https://github.com/pytorch/pytorch/pull/65993#issuecomment-953151102 for more context
     if alpha and symbolic_helper._scalar(symbolic_helper._maybe_get_scalar(alpha)) != 1:
-        return symbolic_helper._unimplemented("index_add", "alpha != 1")
+        return symbolic_helper._unimplemented("index_add", "alpha != 1", self)
 
     dim = symbolic_helper._maybe_get_const(dim, "i")
     if dim is None:
-        raise NotImplementedError(
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting 'index_add_()' function with "
-            + "unknown 'dim' value."
+            "unknown 'dim' value.",
+            self,
         )
 
     self_dim_rank = symbolic_helper._get_tensor_rank(self)
     other_dim_rank = symbolic_helper._get_tensor_rank(other)
 
     if self_dim_rank is None or other_dim_rank is None:
-        raise NotImplementedError(
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting 'index_add_()' function while "
-            + "the rank of self tensor or tensor to be added is unknown."
+            "the rank of self tensor or tensor to be added is unknown.",
+            self,
         )
 
     if other_dim_rank != self_dim_rank:
@@ -5247,9 +5414,10 @@ def index_add(g, self, dim, index, other, alpha=None):
 
     if (other_dim_size is not None) and (self_dim_size is not None):
         if other_dim_size > self_dim_size:
-            raise NotImplementedError(
-                "ONNX export does NOT support exporting 'index_add_()' function with "
-                + "duplicated values in 'index' parameter yet."
+            raise errors.SymbolicValueError(
+                "ONNX export does not support exporting 'index_add_()' function with "
+                "duplicated values in 'index' parameter yet.",
+                self,
             )
 
     # Construct a new shape. It's almost as same as self except the size of the 'dim'
@@ -5320,6 +5488,7 @@ def cdist(g, x1, x2, p=2.0, compute_mode="use_mm_for_euclid_dist_if_necessary"):
     # Currently we ignore the 'compute_mode' variable as we use default to
     # using matrix multiplication to calculate the euclidean distance
     rank = symbolic_helper._get_tensor_rank(x1)
+    assert rank is not None
     broadcasted_x1 = symbolic_helper._unsqueeze_helper(g, x1, [rank - 1])
     broadcasted_x2 = symbolic_helper._unsqueeze_helper(g, x2, [rank - 2])
     return pairwise_distance(
@@ -5368,7 +5537,7 @@ def ConstantSplit(g, self, split_size, dim):
         size = symbolic_helper._get_tensor_dim_size(self, dim)
         if size is None:
             return symbolic_helper._unimplemented(
-                "prim::ConstantSplit", "unknown dimension size"
+                "prim::ConstantSplit", "unknown dimension size", self
             )
         splits = [split_size] * (size // split_size)
         leftover = size % split_size
@@ -5385,7 +5554,7 @@ def ConstantChunk(g, self, chunks, dim):
         dim_size = symbolic_helper._get_tensor_dim_size(self, dim)
         if dim_size is None:
             return symbolic_helper._unimplemented(
-                "prim::ConstantChunk", "unknown dimension size"
+                "prim::ConstantChunk", "unknown dimension size", self
             )
         split_size = (dim_size + chunks - 1) // chunks
         return Prim.ConstantSplit(g, self, split_size, dim)
@@ -5441,22 +5610,23 @@ def unchecked_cast(g, self):
 
     @staticmethod
     def dtype(g, self):
-        dtype = symbolic_helper._try_get_scalar_type(self)
-        if dtype is None:
-            dtype = "Float"
-        dtype = symbolic_helper.scalar_type_to_onnx.index(
-            symbolic_helper.cast_pytorch_to_onnx[dtype]
-        )
-        return g.op("Constant", value_t=torch.tensor(dtype))
+        scalar_name = symbolic_helper._try_get_scalar_type(self)
+        if scalar_name is None:
+            scalar_name = "Float"
+        scalar_type = _type_utils.JitScalarType.from_name(scalar_name)
+        # This node records a torch dtype as int
+        return g.op("Constant", value_t=torch.tensor(scalar_type))
 
-    # tolist is currently supported only for 1D input tensors.
-    # dim_val and elem_ty_val represent dimension and type annotations
-    # that need to match dimension and type of the input tensor.
     @staticmethod
     def tolist(g, input, dim_val, elem_ty_val):
+        """tolist is currently supported only for 1D input tensors.
+
+        dim_val and elem_ty_val represent dimension and type annotations
+        that need to match dimension and type of the input tensor.
+        """
         dim = symbolic_helper._maybe_get_const(dim_val, "i")
         if dim > 1:
-            return symbolic_helper._unimplemented("prim::tolist", "dim_val > 1")
+            return symbolic_helper._unimplemented("prim::tolist", "dim_val > 1", input)
         return input
 
     # -----------------------------------------------------------------------------
@@ -5471,6 +5641,7 @@ def device(ctx: SymbolicContext, g: _C.Graph, *inputs, **kwargs) -> None:
         return symbolic_helper._unimplemented(
             "prim::device",
             f"output type should be 'DeviceObjType', not '{output_type.kind()}'",
+            ctx.cur_node.output(),
         )
 
     @staticmethod
@@ -5556,7 +5727,7 @@ def If(ctx: SymbolicContext, g, *inputs, **attrs):
             # %14 : Bool(device=cpu) = onnx::Equal(%13, %8)
             # %15 : Bool(requires_grad=0, device=cpu) = onnx::Constant[value={0}]()
             # %16 : Long(1, strides=[1], device=cpu) = onnx::Shape(%input.1)
-            input_flag = inputs[0].node()["value"].tolist()
+            input_flag = symbolic_helper._node_get(inputs[0].node(), "value").tolist()
             const_value = (
                 all(input_flag) if isinstance(input_flag, list) else bool(input_flag)
             )
@@ -5575,8 +5746,9 @@ def If(ctx: SymbolicContext, g, *inputs, **attrs):
             final_b_list = []
             for idx in range(len(if_output_list)):
                 if current_b_list[idx] not in env:
-                    raise RuntimeError(
-                        f"The sub block ATen output {current_b_list[idx]} is not in env."
+                    raise errors.SymbolicValueError(
+                        f"The sub block ATen output {current_b_list[idx]} is not in env.",
+                        current_b_list[idx],
                     )  # type:ignore[operator]
                 onnx_b = env[current_b_list[idx]]
                 final_b_list.append(onnx_b)
@@ -5619,18 +5791,22 @@ def Constant(ctx: SymbolicContext, g, *inputs, **attrs):
         if isinstance(n.output().type(), _C.DeviceObjType):
             return None
         if n.kindOf("value") == "t":
-            return g.op("Constant", value_t=n["value"])
+            return g.op("Constant", value_t=symbolic_helper._node_get(n, "value"))
         if n.kindOf("value") == "s":
-            return g.op("Constant", value_s=n["value"])
-        elif n.output().type().isSubtypeOf(
+            return g.op("Constant", value_s=symbolic_helper._node_get(n, "value"))
+        if n.output().type().isSubtypeOf(
             _C.ListType.ofInts()
         ) or n.output().type().isSubtypeOf(_C.ListType.ofFloats()):
-            return g.op("Constant", value_t=torch.tensor(n["value"]))
-        else:
-            raise RuntimeError(
-                f"Unsupported prim::Constant kind: `{n.kindOf('value')}`. Send a bug report."
+            return g.op(
+                "Constant", value_t=torch.tensor(symbolic_helper._node_get(n, "value"))
             )
 
+        raise errors.SymbolicValueError(
+            f"Unsupported prim::Constant kind: `{n.kindOf('value')}`. "
+            f"Please send a bug report at {_constants.PYTORCH_GITHUB_ISSUES_URL}.",
+            n.output(),
+        )
+
 
 class Onnx:
     domain = "onnx"
diff --git a/torch/onnx/utils.py b/torch/onnx/utils.py
index 21fcc7ac8b87e..36ae24b56cf9e 100644
--- a/torch/onnx/utils.py
+++ b/torch/onnx/utils.py
@@ -26,6 +26,7 @@
     Mapping,
     Optional,
     Sequence,
+    Set,
     Tuple,
     Type,
     Union,
@@ -46,6 +47,7 @@
     symbolic_registry,
 )
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype
 
 __all__ = [
     "is_in_onnx_export",
@@ -182,6 +184,7 @@ def exporter_context(model, mode, verbose):
         yield (mode_ctx, apex_ctx, log_ctx)
 
 
+@_beartype.beartype
 def export(
     model: Union[torch.nn.Module, torch.jit.ScriptModule, torch.jit.ScriptFunction],
     args: Union[Tuple[Any, ...], torch.Tensor],
@@ -343,7 +346,7 @@ def export(
 
                 Models exported this way are probably runnable only by Caffe2.
 
-        opset_version (int, default 13): The version of the
+        opset_version (int, default 14): The version of the
             `default (ai.onnx) opset <https://github.com/onnx/onnx/blob/master/docs/Operators.md>`_
             to target. Must be >= 7 and <= 16.
         do_constant_folding (bool, default True): Apply the constant-folding optimization.
@@ -547,6 +550,7 @@ def _optimize_graph(
     # Remove fork/wait nodes
     _C._jit_pass_inline_fork_wait(graph)
     _C._jit_pass_lint(graph)
+    _C._jit_pass_onnx_autograd_function_process(graph)
     _C._jit_pass_lower_all_tuples(graph)
 
     # we now record some ops like ones/zeros
@@ -967,11 +971,19 @@ def _get_example_outputs(model, args):
 }
 
 
-def unpack_quantized_tensor(value):
+def unpack_quantized_tensor(value, cast_onnx_accepted=True):
     if isinstance(value, torch.Tensor) and value.dtype in _qtype_vtype_map:
         q_value_dequantize = value.dequantize()
-        q_scale = torch.tensor(value.q_scale(), dtype=torch.double)
-        q_zero_point = torch.tensor(value.q_zero_point(), dtype=torch.int64)
+        q_scale = (
+            torch.tensor(value.q_scale(), dtype=torch.double)
+            if cast_onnx_accepted
+            else torch.tensor(value.q_scale(), dtype=torch.float32)
+        )
+        q_zero_point = (
+            torch.tensor(value.q_zero_point(), dtype=torch.int64)
+            if cast_onnx_accepted
+            else torch.tensor(value.q_zero_point(), dtype=_qtype_vtype_map[value.dtype])
+        )
         q_value = q_value_dequantize / q_scale + q_zero_point
         q_value = q_value.to(dtype=_qtype_vtype_map[value.dtype])
         return q_value, q_scale, q_zero_point
@@ -992,28 +1004,6 @@ def _pre_trace_quant_model(model, args):
     return model
 
 
-def _assign_onnx_node_name(graph, node_names):
-    """Takes in ONNX graph, and mapping from _C.Node to node name in exported ONNX ModelProto.
-
-    Returns:
-        graph (_C.Graph): A TorchScript IR Graph with ONNX nodes, where each _C.Node gets its name
-        in exported ONNX ModelProto assigned as attribute ``onnx_name``.
-    """
-
-    def n_fn(n, b_fn, node_names):
-        for b in n.blocks():
-            b_fn(b, node_names)
-        if n in node_names:
-            n.s_("onnx_name", node_names[n])
-
-    def b_fn(b, node_names):
-        for n in b.nodes():
-            n_fn(n, b_fn, node_names)
-
-    b_fn(graph, node_names)
-    return graph
-
-
 def _model_to_graph(
     model,
     args,
@@ -1175,7 +1165,6 @@ def export_to_pretty_string(
     symbolic_helper._set_opset_version(opset_version)
     symbolic_helper._set_operator_export_type(operator_export_type)
 
-    symbolic_helper._set_onnx_shape_inference(True)
     with exporter_context(model, training, verbose):
         val_keep_init_as_ip = _decide_keep_init_as_input(
             keep_initializers_as_inputs, operator_export_type, opset_version
@@ -1251,12 +1240,10 @@ def unconvertible_ops(
     return graph, unsupported_ops
 
 
-def _setup_trace_module_map(model, export_modules_as_functions):
-    def __setup_trace_module_map():
-        trace_module_map = {_m: torch.typename(type(_m)) for _m in model.modules()}
-        torch.jit._trace._trace_module_map = trace_module_map
-        return trace_module_map
-
+def _setup_trace_module_map(
+    model: Union[torch.nn.Module, torch.jit.ScriptModule],
+    export_modules_as_functions: Union[bool, Collection[Type[torch.nn.Module]]],
+) -> Set[str]:
     def __register_attribute_hook():
         attr_name = "_onnx_attrs"
 
@@ -1280,13 +1267,35 @@ def _track_module_attributes_forward_hook(module, input, output):
             m.register_forward_hook(_track_module_attributes_forward_hook)
             m.register_forward_pre_hook(_track_module_attributes_forward_pre_hook)
 
+    def _unqualified_variable_name(qualified_name: str) -> str:
+        """
+        Parse qualified variable name and return the unqualified version.
+
+        Pure numeric atoms are considered inadequate, so this function will look past them,
+        and start from the first non-numeric atom.
+
+        Example:
+            >>> _unqualified_variable_name('__main__.Foo.bar')
+            'bar'
+            >>> _unqualified_variable_name('__main__.Foo.bar.0')
+            'bar.0'
+        """
+        name_atoms = qualified_name.split(".")
+        for i, atom in reversed(list(enumerate(name_atoms))):
+            if not atom.isnumeric():
+                return ".".join(name_atoms[i:])
+        return qualified_name
+
+    trace_module_map = {
+        _m: torch._C._jit_onnx_create_full_scope_name(
+            torch.typename(type(_m)), _unqualified_variable_name(_n)
+        )
+        for _n, _m in model.named_modules()
+    }
+    torch.jit._trace._trace_module_map = trace_module_map
     if isinstance(export_modules_as_functions, bool) and export_modules_as_functions:
-        trace_module_map = __setup_trace_module_map()
-        export_modules_as_functions = {v for k, v in trace_module_map.items()}
-    elif (
-        isinstance(export_modules_as_functions, set)
-        and len(export_modules_as_functions) > 0
-    ):
+        module_typenames = {torch.typename(type(module)) for module in trace_module_map}
+    elif isinstance(export_modules_as_functions, set) and export_modules_as_functions:
 
         def _find_typename(v):
             if isinstance(v, type):
@@ -1298,16 +1307,14 @@ def _find_typename(v):
                     "Got `%s`." % (type(v).__name__)
                 )
 
-        trace_module_map = __setup_trace_module_map()
         module_typenames = {_find_typename(v) for v in export_modules_as_functions}
-        export_modules_as_functions = module_typenames
     else:
-        export_modules_as_functions = None
+        module_typenames = set()
 
-    if export_modules_as_functions:
+    if module_typenames:
         __register_attribute_hook()
 
-    return export_modules_as_functions
+    return module_typenames
 
 
 def _reset_trace_module_map():
@@ -1369,9 +1376,12 @@ def _export(
                 "This is because `opset_version` < 15 implies IR version < 8, which means "
                 "no local function support. "
             )
-        export_modules_as_functions = _setup_trace_module_map(
-            model, export_modules_as_functions
-        )
+
+        module_typenames_to_export_as_functions: Set[str] = set()
+        if isinstance(model, (torch.nn.Module, torch.jit.ScriptModule)):
+            module_typenames_to_export_as_functions = _setup_trace_module_map(
+                model, export_modules_as_functions
+            )
 
         if not operator_export_type:
             if _C_onnx._CAFFE2_ATEN_FALLBACK:
@@ -1430,14 +1440,17 @@ def _export(
 
             _C._jit_pass_dce_allow_deleting_nodes_with_side_effects(graph)
             node_attr_to_name = {}  # type: ignore[var-annotated]
-            if export_modules_as_functions:
+            if module_typenames_to_export_as_functions:
                 # NOTE: cannot call DCE after this pass. DCE will remove function definition nodes.
                 node_attr_to_name = _C._jit_pass_onnx_function_extraction(
-                    graph, export_modules_as_functions, list(params_dict.keys())
+                    graph,
+                    module_typenames_to_export_as_functions,
+                    list(params_dict.keys()),
                 )
             params_dict = _C._jit_pass_onnx_deduplicate_initializers(  # type: ignore[assignment]
                 graph, params_dict, getattr(model, "training", False)  # type: ignore[arg-type]
             )
+            _C._jit_pass_onnx_assign_scoped_names_for_node_and_value(graph)
             if export_params:
                 (
                     proto,
@@ -1477,9 +1490,7 @@ def _export(
                     node_attr_to_name,
                 )
             if verbose:
-                torch.onnx.log(
-                    "Exported graph: ", _assign_onnx_node_name(graph, node_names)
-                )
+                torch.onnx.log("Exported graph: ", graph)
             if export_type == _exporter_states.ExportTypes.PROTOBUF_FILE:
                 assert len(export_map) == 0
                 with torch.serialization._open_file_like(f, "wb") as opened_file:
@@ -1702,30 +1713,30 @@ def _run_symbolic_function(
     """
 
     opset_version = GLOBALS.export_onnx_opset_version
-    symbolic_helper.is_caffe2_aten_fallback = symbolic_helper.is_caffe2_aten_fallback
 
     # See Note [Export inplace]
-    # TODO(ezyang): I think this is not necessary anymore
-    if n.kind().endswith("_"):
-        ns_op_name = n.kind()[:-1]
+    node_kind = n.kind()
+    if node_kind.endswith("_"):
+        # Treat relu_ -> relu; add_ -> add etc.
+        ns_op_name = node_kind[:-1]
     else:
-        ns_op_name = n.kind()
-    ns, op_name = ns_op_name.split("::")
+        ns_op_name = node_kind
+
+    namespace, op_name = ns_op_name.split("::")
 
     try:
         symbolic_registry.register_version("", opset_version)
 
         # Caffe2-specific: Quantized op symbolics are registered for opset 9 only.
         if symbolic_helper.is_caffe2_aten_fallback() and opset_version == 9:
-
             symbolic_caffe2.register_quantized_ops("caffe2", opset_version)
 
-        if ns == "aten":
+        if namespace == "aten":
             domain = ""
-        elif ns == "quantized" and symbolic_helper.is_caffe2_aten_fallback():
+        elif namespace == "quantized" and symbolic_helper.is_caffe2_aten_fallback():
             domain = "caffe2"
         else:
-            domain = ns
+            domain = namespace
 
         if symbolic_registry.is_registered_op(op_name, domain, opset_version):
             symbolic_fn = _find_symbolic_in_registry(
@@ -1733,7 +1744,7 @@ def _run_symbolic_function(
             )
             assert symbolic_fn is not None
 
-            attrs = {k: n[k] for k in n.attributeNames()}  # type: ignore[attr-defined]
+            attrs = {k: symbolic_helper._node_get(n, k) for k in n.attributeNames()}
             if _need_symbolic_context(symbolic_fn):
                 ctx = _exporter_states.SymbolicContext(_params_dict, env, n, block)
                 return symbolic_fn(ctx, g, *inputs, **attrs)
@@ -1742,13 +1753,21 @@ def _run_symbolic_function(
             if op_name == "PythonOp":
                 inputs = (n, *inputs)
             return symbolic_fn(g, *inputs, **attrs)
-        elif ns == "onnx":
+        elif namespace == "onnx":
             # Clone node to trigger ONNX shape inference
-            attrs = {k + "_" + n.kindOf(k)[0]: n[k] for k in n.attributeNames()}  # type: ignore[attr-defined]
+            attrs = {
+                k + "_" + n.kindOf(k)[0]: symbolic_helper._node_get(n, k)
+                for k in n.attributeNames()
+            }
             return g.op(op_name, *inputs, **attrs, outputs=n.outputsSize())  # type: ignore[attr-defined]
-        elif _should_aten_fallback(ns, op_name, opset_version, operator_export_type):
+        elif _should_aten_fallback(
+            namespace, op_name, opset_version, operator_export_type
+        ):
             # Direct ATen export requested
-            attrs = {k + "_" + n.kindOf(k)[0]: n[k] for k in n.attributeNames()}  # type: ignore[attr-defined]
+            attrs = {
+                k + "_" + n.kindOf(k)[0]: symbolic_helper._node_get(n, k)
+                for k in n.attributeNames()
+            }
             outputs = n.outputsSize()
             attrs["outputs"] = outputs
             # `overload_name` is set for non-Caffe2 builds only
@@ -1772,7 +1791,10 @@ def _run_symbolic_function(
             and not symbolic_helper.is_caffe2_aten_fallback()
         ):
             # Emit ATen op for non-Caffe2 builds when `operator_export_type==ONNX_ATEN_FALLBACK`
-            attrs = {k + "_" + n.kindOf(k)[0]: n[k] for k in n.attributeNames()}  # type: ignore[attr-defined]
+            attrs = {
+                k + "_" + n.kindOf(k)[0]: symbolic_helper._node_get(n, k)
+                for k in n.attributeNames()
+            }
             return g.at(  # type: ignore[attr-defined]
                 op_name, *inputs, overload_name=_get_aten_op_overload_name(n), **attrs
             )
diff --git a/torch/onnx/verification.py b/torch/onnx/verification.py
index 78acad1e2475a..32591ba19e182 100644
--- a/torch/onnx/verification.py
+++ b/torch/onnx/verification.py
@@ -62,10 +62,12 @@ def _inline_flatten_list(inputs, res_list):
     return res_list
 
 
-def _unpack_to_numpy(values):
+def _unpack_to_numpy(values, cast_onnx_accepted=True):
     value_unpacked = []
     for value in values:
-        value_unpacked.extend(utils.unpack_quantized_tensor(value))
+        value_unpacked.extend(
+            utils.unpack_quantized_tensor(value, cast_onnx_accepted=cast_onnx_accepted)
+        )
     return [_to_numpy(v) for v in value_unpacked]
 
 
@@ -113,14 +115,71 @@ def _ort_session(
     return ort_session
 
 
-def _compare_ort_pytorch_outputs(ort_outs, pt_outs, rtol, atol):
-    pt_outs, _ = torch.jit._flatten(pt_outs)
-    pt_outs = _unpack_to_numpy(pt_outs)
+def _compare_ort_pytorch_outputs(
+    ort_outs: Sequence[np.ndarray],
+    pt_outs: Sequence[torch.Tensor],
+    rtol: float,
+    atol: float,
+    check_shape: bool,
+    check_dtype: bool,
+    acceptable_error_percentage: Optional[float],
+):
+    """
+    Compare ONNX Runtime and PyTorch outputs.
+
+    Args:
+        ort_outs: outputs from ONNX Runtime.
+        pt_outs: outputs from PyTorch.
+        rtol (float, optional): relative tolerance in comparison between ONNX and PyTorch outputs.
+        atol (float, optional): absolute tolerance in comparison between ONNX and PyTorch outputs.
+        acceptable_error_percentage (float, optional): acceptable percentage of element mismatches in comparison.
+            It should be a float of value between 0.0 and 1.0.
 
-    assert len(pt_outs) == len(ort_outs), "number of outputs differ"
+    Raises:
+        AssertionError: if outputs from ONNX model and PyTorch model are not
+            equal up to specified precision.
+        ValueError: if arguments provided are invalid.
+    """
+    pt_outs, _ = torch.jit._flatten(pt_outs)
+    pt_outs = _unpack_to_numpy(pt_outs, cast_onnx_accepted=False)
+
+    assert len(ort_outs) == len(
+        pt_outs
+    ), f"Number of outputs differ ONNX runtime: ({len(ort_outs)}) PyTorch: ({len(pt_outs)})"
+    if acceptable_error_percentage and (
+        acceptable_error_percentage > 1.0 or acceptable_error_percentage < 0.0
+    ):
+        raise ValueError(
+            "If set, acceptable_error_percentage should be between 0.0 and 1.0"
+        )
 
     for ort_out, pt_out in zip(ort_outs, pt_outs):
-        np.testing.assert_allclose(ort_out, pt_out, rtol=rtol, atol=atol)
+        try:
+            # TODO: Remove `check_shape` option once every shape inconsistent issue is addressed.
+            if not check_shape:
+                # Allow different but broadcastable output shapes.
+                ort_out, pt_out = np.broadcast_arrays(ort_out, pt_out)
+            torch.testing.assert_close(
+                ort_out,
+                pt_out,
+                rtol=rtol,
+                atol=atol,
+                check_dtype=check_dtype,
+                equal_nan=True,
+            )
+        except AssertionError as e:
+            if acceptable_error_percentage:
+                error_percentage = 1 - np.sum(
+                    np.isclose(ort_out, pt_out, rtol=rtol, atol=atol)
+                ) / np.prod(ort_out.shape)
+                if error_percentage <= acceptable_error_percentage:
+                    warnings.warn(
+                        f"Suppressed AssertionError:\n{e}.\n"
+                        f"Error percentage {error_percentage} "
+                        f"within acceptable range {acceptable_error_percentage}."
+                    )
+                    continue
+            raise
 
 
 def _prepare_input_for_pytorch(args, kwargs):
@@ -219,6 +278,9 @@ def _compare_ort_pytorch_model(
     flatten,
     rtol,
     atol,
+    check_shape,
+    check_dtype,
+    accetable_error_persentage: Optional[float],
 ):
     """Compare outputs from ONNX model runs with outputs from PyTorch model runs.
 
@@ -240,7 +302,15 @@ def compare_ort_pytorch_model_with_input(input_args, input_kwargs):
         )
         ort_outs = _run_ort(ort_session, ort_inputs)
 
-        _compare_ort_pytorch_outputs(ort_outs, pt_outs, rtol, atol)
+        _compare_ort_pytorch_outputs(
+            ort_outs,
+            pt_outs,
+            rtol,
+            atol,
+            check_shape,
+            check_dtype,
+            accetable_error_persentage,
+        )
 
     compare_ort_pytorch_model_with_input(input_args, input_kwargs)
 
@@ -416,9 +486,7 @@ def _onnx_graph_from_model(
     if opset_version is None:
         opset_version = _constants.onnx_default_opset
 
-    export_modules_as_functions = utils._setup_trace_module_map(
-        model, export_modules_as_functions
-    )
+    utils._setup_trace_module_map(model, export_modules_as_functions)
 
     if not operator_export_type:
         if _C_onnx._CAFFE2_ATEN_FALLBACK:
@@ -517,9 +585,12 @@ def verify(
     additional_test_inputs: Optional[Sequence[Tuple[Any, ...]]] = None,
     remained_onnx_input_idx: Optional[Sequence[int]] = None,
     flatten: bool = True,
+    check_shape: bool = True,
+    check_dtype: bool = True,
     ort_providers: Sequence[str] = _ORT_PROVIDERS,
     rtol: float = 0.001,
     atol: float = 1e-7,
+    acceptable_error_percentage: Optional[float] = None,
     **_,
 ):
     """Verify model export to ONNX with ONNX Runtime.
@@ -550,13 +621,21 @@ def verify(
             inputs into a flattened list of Tensors for ONNX. Set this to False if nested
             structures are to be preserved for ONNX, which is usually the case with
             exporting ScriptModules.
+        check_shape (bool, optional): Default True. If True, check the shapes between
+            PyTorch and ONNX Runtime outputs are exactly the same. Set this to False to allow
+            output shape broadcasting.
+        check_dtype (bool, optional): Default True. If True, check the dtypes between
+            PyTorch and ONNX Runtime outputs are consistent.
         ort_providers (sequence, optional): ONNX Runtime providers to use.
         rtol (float, optional): relative tolerance in comparison between ONNX and PyTorch outputs.
         atol (float, optional): absolute tolerance in comparison between ONNX and PyTorch outputs.
+        acceptable_error_percentage (float, optional): acceptable percentage of element mismatches in comparison.
+            It should be a float of value between 0.0 and 1.0.
 
     Raises:
         AssertionError: if outputs from ONNX model and PyTorch model are not
             equal up to specified precision.
+        ValueError: if arguments provided are invalid.
     """
     if training == torch.onnx.TrainingMode.TRAINING:
         model.train()
@@ -599,4 +678,7 @@ def verify(
             flatten,
             rtol,
             atol,
+            check_shape,
+            check_dtype,
+            acceptable_error_percentage,
         )
diff --git a/torch/optim/adam.py b/torch/optim/adam.py
index 7176ea544d611..820cba92e7c2a 100644
--- a/torch/optim/adam.py
+++ b/torch/optim/adam.py
@@ -1,7 +1,7 @@
 import math
 import torch
 from torch import Tensor
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['Adam', 'adam']
@@ -74,7 +74,8 @@ class Adam(Optimizer):
 
     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
                  weight_decay=0, amsgrad=False, *, foreach: Optional[bool] = None,
-                 maximize: bool = False, capturable: bool = False):
+                 maximize: bool = False, capturable: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -87,7 +88,8 @@ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
         defaults = dict(lr=lr, betas=betas, eps=eps,
                         weight_decay=weight_decay, amsgrad=amsgrad,
-                        maximize=maximize, foreach=foreach, capturable=capturable)
+                        maximize=maximize, foreach=foreach, capturable=capturable,
+                        differentiable=differentiable)
         super(Adam, self).__init__(params, defaults)
 
     def __setstate__(self, state):
@@ -97,13 +99,14 @@ def __setstate__(self, state):
             group.setdefault('maximize', False)
             group.setdefault('foreach', None)
             group.setdefault('capturable', False)
+            group.setdefault('differentiable', False)
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
         if not step_is_tensor:
             for s in state_values:
                 s['step'] = torch.tensor(float(s['step']))
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -152,7 +155,8 @@ def step(self, closure=None):
 
                     if group['amsgrad']:
                         max_exp_avg_sqs.append(state['max_exp_avg_sq'])
-
+                    if group['differentiable'] and state['step'].requires_grad:
+                        raise RuntimeError('`requires_grad` is not supported for `step` in differentiable mode')
                     state_steps.append(state['step'])
 
             adam(params_with_grad,
@@ -169,7 +173,8 @@ def step(self, closure=None):
                  eps=group['eps'],
                  maximize=group['maximize'],
                  foreach=group['foreach'],
-                 capturable=group['capturable'])
+                 capturable=group['capturable'],
+                 differentiable=group['differentiable'])
 
         return loss
 
@@ -184,6 +189,7 @@ def adam(params: List[Tensor],
          # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
          foreach: bool = None,
          capturable: bool = False,
+         differentiable: bool = False,
          *,
          amsgrad: bool,
          beta1: float,
@@ -224,7 +230,8 @@ def adam(params: List[Tensor],
          weight_decay=weight_decay,
          eps=eps,
          maximize=maximize,
-         capturable=capturable)
+         capturable=capturable,
+         differentiable=differentiable)
 
 
 def _single_tensor_adam(params: List[Tensor],
@@ -241,7 +248,8 @@ def _single_tensor_adam(params: List[Tensor],
                         weight_decay: float,
                         eps: float,
                         maximize: bool,
-                        capturable: bool):
+                        capturable: bool,
+                        differentiable: bool):
 
     for i, param in enumerate(params):
 
@@ -269,7 +277,7 @@ def _single_tensor_adam(params: List[Tensor],
         exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
         exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
 
-        if capturable:
+        if capturable or differentiable:
             step = step_t
 
             # 1 - beta1 ** step can't be captured in a CUDA graph, even if step is a CUDA tensor
@@ -284,7 +292,11 @@ def _single_tensor_adam(params: List[Tensor],
 
             if amsgrad:
                 # Maintains the maximum of all 2nd moment running avg. till now
-                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
+                if differentiable:
+                    max_exp_avg_sqs_i = max_exp_avg_sqs[i].clone()
+                else:
+                    max_exp_avg_sqs_i = max_exp_avg_sqs[i]
+                max_exp_avg_sqs[i].copy_(torch.maximum(max_exp_avg_sqs_i, exp_avg_sq))
                 # Uses the max. for normalizing running avg. of gradient
                 # Folds in (admittedly ugly) 1-elem step_size math here to avoid extra param-set-sized read+write
                 # (can't fold it into addcdiv_ below because addcdiv_ requires value is a Number, not a Tensor)
@@ -328,7 +340,8 @@ def _multi_tensor_adam(params: List[Tensor],
                        weight_decay: float,
                        eps: float,
                        maximize: bool,
-                       capturable: bool):
+                       capturable: bool,
+                       differentiable: bool):
     if len(params) == 0:
         return
 
@@ -339,6 +352,7 @@ def _multi_tensor_adam(params: List[Tensor],
     if maximize:
         grads = torch._foreach_neg(tuple(grads))  # type: ignore[assignment]
 
+    assert not differentiable, "_foreach ops don't support autograd"
     # Handle complex parameters
     grads = [torch.view_as_real(x) if torch.is_complex(x) else x for x in grads]
     exp_avgs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_avgs]
@@ -377,7 +391,7 @@ def _multi_tensor_adam(params: List[Tensor],
 
         if amsgrad:
             # Maintains the maximum of all 2nd moment running avg. till now
-            max_exp_avg_sqs = torch._foreach_maximum(max_exp_avg_sqs, exp_avg_sqs)  # type: ignore[assignment]
+            torch._foreach_maximum_(max_exp_avg_sqs, exp_avg_sqs)  # type: ignore[assignment]
 
             # Use the max. for normalizing running avg. of gradient
             max_exp_avg_sq_sqrt = torch._foreach_sqrt(max_exp_avg_sqs)
@@ -405,7 +419,7 @@ def _multi_tensor_adam(params: List[Tensor],
 
         if amsgrad:
             # Maintains the maximum of all 2nd moment running avg. till now
-            max_exp_avg_sqs = torch._foreach_maximum(max_exp_avg_sqs, exp_avg_sqs)  # type: ignore[assignment]
+            torch._foreach_maximum_(max_exp_avg_sqs, exp_avg_sqs)
 
             # Use the max. for normalizing running avg. of gradient
             max_exp_avg_sq_sqrt = torch._foreach_sqrt(max_exp_avg_sqs)
diff --git a/torch/optim/adamax.py b/torch/optim/adamax.py
index 8fe74bf414143..bb45764bd0551 100644
--- a/torch/optim/adamax.py
+++ b/torch/optim/adamax.py
@@ -215,6 +215,12 @@ def _single_tensor_adamax(params: List[Tensor],
         if weight_decay != 0:
             grad = grad.add(param, alpha=weight_decay)
 
+        if torch.is_complex(param):
+            param = torch.view_as_real(param)
+            grad = torch.view_as_real(grad)
+            exp_avg = torch.view_as_real(exp_avg)
+            exp_inf = torch.view_as_real(exp_inf)
+
         # Update biased first moment estimate.
         exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
         # Update the exponentially weighted infinity norm.
@@ -249,6 +255,11 @@ def _multi_tensor_adamax(params: List[Tensor],
     if maximize:
         grads = torch._foreach_neg(grads)
 
+    params = [torch.view_as_real(x) if torch.is_complex(x) else x for x in params]
+    grads = [torch.view_as_real(x) if torch.is_complex(x) else x for x in grads]
+    exp_avgs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_avgs]
+    exp_infs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_infs]
+
     # Update steps
     torch._foreach_add_(state_steps, 1)
 
diff --git a/torch/optim/adamw.py b/torch/optim/adamw.py
index d678b695820aa..3b01a0c4d3410 100644
--- a/torch/optim/adamw.py
+++ b/torch/optim/adamw.py
@@ -257,6 +257,12 @@ def _single_tensor_adamw(params: List[Tensor],
         if capturable:
             assert param.is_cuda and step_t.is_cuda, "If capturable=True, params and state_steps must be CUDA tensors."
 
+        if torch.is_complex(param):
+            grad = torch.view_as_real(grad)
+            exp_avg = torch.view_as_real(exp_avg)
+            exp_avg_sq = torch.view_as_real(exp_avg_sq)
+            param = torch.view_as_real(param)
+
         # update step
         step_t += 1
 
@@ -337,6 +343,11 @@ def _multi_tensor_adamw(params: List[Tensor],
     if maximize:
         grads = torch._foreach_neg(tuple(grads))  # type: ignore[assignment]
 
+    grads = [torch.view_as_real(x) if torch.is_complex(x) else x for x in grads]
+    exp_avgs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_avgs]
+    exp_avg_sqs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_avg_sqs]
+    params = [torch.view_as_real(x) if torch.is_complex(x) else x for x in params]
+
     # update steps
     torch._foreach_add_(state_steps, 1)
 
@@ -369,7 +380,7 @@ def _multi_tensor_adamw(params: List[Tensor],
 
         if amsgrad:
             # Maintains the maximum of all 2nd moment running avg. till now
-            max_exp_avg_sqs = torch._foreach_maximum(max_exp_avg_sqs, exp_avg_sqs)  # type: ignore[assignment]
+            torch._foreach_maximum_(max_exp_avg_sqs, exp_avg_sqs)
 
             # Use the max. for normalizing running avg. of gradient
             max_exp_avg_sq_sqrt = torch._foreach_sqrt(max_exp_avg_sqs)
@@ -397,7 +408,7 @@ def _multi_tensor_adamw(params: List[Tensor],
 
         if amsgrad:
             # Maintains the maximum of all 2nd moment running avg. till now
-            max_exp_avg_sqs = torch._foreach_maximum(max_exp_avg_sqs, exp_avg_sqs)  # type: ignore[assignment]
+            torch._foreach_maximum_(max_exp_avg_sqs, exp_avg_sqs)
 
             # Use the max. for normalizing running avg. of gradient
             max_exp_avg_sq_sqrt = torch._foreach_sqrt(max_exp_avg_sqs)
diff --git a/torch/optim/lr_scheduler.py b/torch/optim/lr_scheduler.py
index de64c88e0f7c8..c3c40d1ef0f00 100644
--- a/torch/optim/lr_scheduler.py
+++ b/torch/optim/lr_scheduler.py
@@ -11,7 +11,7 @@
 
 __all__ = ['LambdaLR', 'MultiplicativeLR', 'StepLR', 'MultiStepLR', 'ConstantLR', 'LinearLR',
            'ExponentialLR', 'SequentialLR', 'CosineAnnealingLR', 'ChainedScheduler', 'ReduceLROnPlateau',
-           'CyclicLR', 'CosineAnnealingWarmRestarts', 'OneCycleLR']
+           'CyclicLR', 'CosineAnnealingWarmRestarts', 'OneCycleLR', 'PolynomialLR']
 
 EPOCH_DEPRECATION_WARNING = (
     "The epoch parameter in `scheduler.step()` was not necessary and is being "
@@ -192,6 +192,7 @@ class LambdaLR(_LRScheduler):
         >>> # Assuming optimizer has two groups.
         >>> lambda1 = lambda epoch: epoch // 30
         >>> lambda2 = lambda epoch: 0.95 ** epoch
+        >>> # xdoctest: +SKIP
         >>> scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
         >>> for epoch in range(100):
         >>>     train(...)
@@ -275,6 +276,7 @@ class MultiplicativeLR(_LRScheduler):
 
     Example:
         >>> lmbda = lambda epoch: 0.95
+        >>> # xdoctest: +SKIP
         >>> scheduler = MultiplicativeLR(optimizer, lr_lambda=lmbda)
         >>> for epoch in range(100):
         >>>     train(...)
@@ -361,6 +363,7 @@ class StepLR(_LRScheduler):
         >>> # lr = 0.005    if 30 <= epoch < 60
         >>> # lr = 0.0005   if 60 <= epoch < 90
         >>> # ...
+        >>> # xdoctest: +SKIP
         >>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
         >>> for epoch in range(100):
         >>>     train(...)
@@ -408,6 +411,7 @@ class MultiStepLR(_LRScheduler):
         >>> # lr = 0.05     if epoch < 30
         >>> # lr = 0.005    if 30 <= epoch < 80
         >>> # lr = 0.0005   if epoch >= 80
+        >>> # xdoctest: +SKIP
         >>> scheduler = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
         >>> for epoch in range(100):
         >>>     train(...)
@@ -458,6 +462,7 @@ class ConstantLR(_LRScheduler):
         >>> # lr = 0.025   if epoch == 2
         >>> # lr = 0.025   if epoch == 3
         >>> # lr = 0.05    if epoch >= 4
+        >>> # xdoctest: +SKIP
         >>> scheduler = ConstantLR(self.opt, factor=0.5, total_iters=4)
         >>> for epoch in range(100):
         >>>     train(...)
@@ -519,6 +524,7 @@ class LinearLR(_LRScheduler):
         >>> # lr = 0.0375   if epoch == 2
         >>> # lr = 0.04375  if epoch == 3
         >>> # lr = 0.05    if epoch >= 4
+        >>> # xdoctest: +SKIP
         >>> scheduler = LinearLR(self.opt, start_factor=0.5, total_iters=4)
         >>> for epoch in range(100):
         >>>     train(...)
@@ -610,6 +616,7 @@ class SequentialLR(_LRScheduler):
         >>> # lr = 0.9     if epoch == 2
         >>> # lr = 0.81    if epoch == 3
         >>> # lr = 0.729   if epoch == 4
+        >>> # xdoctest: +SKIP
         >>> scheduler1 = ConstantLR(self.opt, factor=0.1, total_iters=2)
         >>> scheduler2 = ExponentialLR(self.opt, gamma=0.9)
         >>> scheduler = SequentialLR(self.opt, schedulers=[scheduler1, scheduler2], milestones=[2])
@@ -700,6 +707,56 @@ def load_state_dict(self, state_dict):
             self._schedulers[idx].load_state_dict(s)
 
 
+class PolynomialLR(_LRScheduler):
+    """Decays the learning rate of each parameter group using a polynomial function
+    in the given total_iters. When last_epoch=-1, sets initial lr as lr.
+
+    Args:
+        optimizer (Optimizer): Wrapped optimizer.
+        total_iters (int): The number of steps that the scheduler decays the learning rate. Default: 5.
+        power (int): The power of the polynomial. Default: 1.0.
+        verbose (bool): If ``True``, prints a message to stdout for
+            each update. Default: ``False``.
+
+    Example:
+        >>> # Assuming optimizer uses lr = 0.001 for all groups
+        >>> # lr = 0.001     if epoch == 0
+        >>> # lr = 0.00075   if epoch == 1
+        >>> # lr = 0.00050   if epoch == 2
+        >>> # lr = 0.00025   if epoch == 3
+        >>> # lr = 0.0       if epoch >= 4
+        >>> # xdoctest: +SKIP("undefined vars")
+        >>> scheduler = PolynomialLR(self.opt, total_iters=4, power=1.0)
+        >>> for epoch in range(100):
+        >>>     train(...)
+        >>>     validate(...)
+        >>>     scheduler.step()
+    """
+    def __init__(self, optimizer, total_iters=5, power=1.0, last_epoch=-1, verbose=False):
+        self.total_iters = total_iters
+        self.power = power
+        super().__init__(optimizer, last_epoch, verbose)
+
+    def get_lr(self):
+        if not self._get_lr_called_within_step:
+            warnings.warn("To get the last learning rate computed by the scheduler, "
+                          "please use `get_last_lr()`.", UserWarning)
+
+        if self.last_epoch == 0 or self.last_epoch > self.total_iters:
+            return [group["lr"] for group in self.optimizer.param_groups]
+
+        decay_factor = ((1.0 - self.last_epoch / self.total_iters) / (1.0 - (self.last_epoch - 1) / self.total_iters)) ** self.power
+        return [group["lr"] * decay_factor for group in self.optimizer.param_groups]
+
+    def _get_closed_form_lr(self):
+        return [
+            (
+                base_lr * (1.0 - min(self.total_iters, self.last_epoch) / self.total_iters) ** self.power
+            )
+            for base_lr in self.base_lrs
+        ]
+
+
 class CosineAnnealingLR(_LRScheduler):
     r"""Set the learning rate of each parameter group using a cosine annealing
     schedule, where :math:`\eta_{max}` is set to the initial lr and
@@ -788,6 +845,7 @@ class ChainedScheduler(_LRScheduler):
         >>> # lr = 0.729    if epoch == 2
         >>> # lr = 0.6561   if epoch == 3
         >>> # lr = 0.59049  if epoch >= 4
+        >>> # xdoctest: +SKIP
         >>> scheduler1 = ConstantLR(self.opt, factor=0.1, total_iters=2)
         >>> scheduler2 = ExponentialLR(self.opt, gamma=0.9)
         >>> scheduler = ChainedScheduler([scheduler1, scheduler2])
@@ -885,6 +943,7 @@ class ReduceLROnPlateau(object):
             each update. Default: ``False``.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
         >>> scheduler = ReduceLROnPlateau(optimizer, 'min')
         >>> for epoch in range(10):
@@ -1099,6 +1158,7 @@ class CyclicLR(_LRScheduler):
             each update. Default: ``False``.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
         >>> scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.01, max_lr=0.1)
         >>> data_loader = torch.utils.data.DataLoader(...)
@@ -1297,6 +1357,7 @@ def step(self, epoch=None):
         """Step could be called after every batch update
 
         Example:
+            >>> # xdoctest: +SKIP("Undefined vars")
             >>> scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
             >>> iters = len(dataloader)
             >>> for epoch in range(20):
@@ -1312,6 +1373,7 @@ def step(self, epoch=None):
         This function can be called in an interleaved way.
 
         Example:
+            >>> # xdoctest: +SKIP("Undefined vars")
             >>> scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
             >>> for epoch in range(20):
             >>>     scheduler.step()
@@ -1454,6 +1516,7 @@ class OneCycleLR(_LRScheduler):
 
     Example:
         >>> data_loader = torch.utils.data.DataLoader(...)
+        >>> # xdoctest: +SKIP
         >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
         >>> scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=len(data_loader), epochs=10)
         >>> for epoch in range(10):
diff --git a/torch/optim/lr_scheduler.pyi b/torch/optim/lr_scheduler.pyi
index b6a57861517b4..154d8df55780c 100644
--- a/torch/optim/lr_scheduler.pyi
+++ b/torch/optim/lr_scheduler.pyi
@@ -110,3 +110,8 @@ class OneCycleLR(_LRScheduler):
     cycle_momentum: bool = ...
     use_beta1: bool = ...
     def __init__(self, optimizer: Optimizer, max_lr: Union[float, List[float]], total_steps: int = ..., epochs: int = ..., steps_per_epoch: int = ..., pct_start: float = ..., anneal_strategy: str = ..., cycle_momentum: bool = ..., base_momentum: Union[float, List[float]] = ..., max_momentum: Union[float, List[float]] = ..., div_factor: float = ..., final_div_factor: float = ..., three_phase: bool = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
+
+class PolynomialLR(_LRScheduler):
+    total_iters: int = ...
+    power: float = ...
+    def __init__(self, optimizer: Optimizer, total_iters: int = ..., power: float = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
diff --git a/torch/optim/rmsprop.py b/torch/optim/rmsprop.py
index d5aa61e9540c7..22a5bd4488a79 100644
--- a/torch/optim/rmsprop.py
+++ b/torch/optim/rmsprop.py
@@ -1,6 +1,6 @@
 import torch
 from torch import Tensor
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['RMSprop', 'rmsprop']
@@ -68,7 +68,8 @@ class RMSprop(Optimizer):
     """
 
     def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0,
-                 centered=False, foreach: Optional[bool] = None, maximize: bool = False):
+                 centered=False, foreach: Optional[bool] = None, maximize: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -81,7 +82,8 @@ def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, moment
             raise ValueError("Invalid alpha value: {}".format(alpha))
 
         defaults = dict(lr=lr, momentum=momentum, alpha=alpha, eps=eps, centered=centered,
-                        weight_decay=weight_decay, foreach=foreach, maximize=maximize)
+                        weight_decay=weight_decay, foreach=foreach, maximize=maximize,
+                        differentiable=differentiable)
         super(RMSprop, self).__init__(params, defaults)
 
     def __setstate__(self, state):
@@ -91,8 +93,9 @@ def __setstate__(self, state):
             group.setdefault('centered', False)
             group.setdefault('foreach', None)
             group.setdefault('maximize', False)
+            group.setdefault('differentiable', False)
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -131,7 +134,6 @@ def step(self, closure=None):
                         state['momentum_buffer'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                     if group['centered']:
                         state['grad_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
-
                 square_avgs.append(state['square_avg'])
 
                 if group['momentum'] > 0:
@@ -139,6 +141,9 @@ def step(self, closure=None):
                 if group['centered']:
                     grad_avgs.append(state['grad_avg'])
 
+                if group['differentiable'] and isinstance(state['step'], Tensor):
+                    raise RuntimeError('`step` can\'t be a tensor')
+
                 state['step'] += 1
 
 
@@ -154,7 +159,8 @@ def step(self, closure=None):
                     momentum=group['momentum'],
                     centered=group['centered'],
                     foreach=group['foreach'],
-                    maximize=group["maximize"])
+                    maximize=group["maximize"],
+                    differentiable=group["differentiable"])
 
         return loss
 
@@ -168,6 +174,7 @@ def rmsprop(params: List[Tensor],
             # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
             foreach: bool = None,
             maximize: bool = False,
+            differentiable: bool = False,
             *,
             lr: float,
             alpha: float,
@@ -202,7 +209,8 @@ def rmsprop(params: List[Tensor],
          weight_decay=weight_decay,
          momentum=momentum,
          centered=centered,
-         maximize=maximize)
+         maximize=maximize,
+         differentiable=differentiable)
 
 
 def _single_tensor_rmsprop(params: List[Tensor],
@@ -217,7 +225,8 @@ def _single_tensor_rmsprop(params: List[Tensor],
                            weight_decay: float,
                            momentum: float,
                            centered: bool,
-                           maximize: bool):
+                           maximize: bool,
+                           differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
@@ -227,17 +236,32 @@ def _single_tensor_rmsprop(params: List[Tensor],
         if weight_decay != 0:
             grad = grad.add(param, alpha=weight_decay)
 
+        is_complex_param = torch.is_complex(param)
+        if is_complex_param:
+            param = torch.view_as_real(param)
+            grad = torch.view_as_real(grad)
+            square_avg = torch.view_as_real(square_avg)
+
         square_avg.mul_(alpha).addcmul_(grad, grad, value=1 - alpha)
 
         if centered:
             grad_avg = grad_avgs[i]
+            if is_complex_param:
+                grad_avg = torch.view_as_real(grad_avg)
             grad_avg.mul_(alpha).add_(grad, alpha=1 - alpha)
-            avg = square_avg.addcmul(grad_avg, grad_avg, value=-1).sqrt_().add_(eps)
+            avg = square_avg.addcmul(grad_avg, grad_avg, value=-1).sqrt_()
+        else:
+            avg = square_avg.sqrt()
+
+        if differentiable:
+            avg = avg.add(eps)
         else:
-            avg = square_avg.sqrt().add_(eps)
+            avg = avg.add_(eps)
 
         if momentum > 0:
             buf = momentum_buffer_list[i]
+            if is_complex_param:
+                buf = torch.view_as_real(buf)
             buf.mul_(momentum).addcdiv_(grad, avg)
             param.add_(buf, alpha=-lr)
         else:
@@ -256,21 +280,32 @@ def _multi_tensor_rmsprop(params: List[Tensor],
                           weight_decay: float,
                           momentum: float,
                           centered: bool,
-                          maximize: bool):
+                          maximize: bool,
+                          differentiable: bool):
 
     if len(params) == 0:
         return
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
     if maximize:
         grads = torch._foreach_neg(grads)
 
     if weight_decay != 0:
         torch._foreach_add_(grads, params, alpha=weight_decay)
 
+    def _view_complex_as_real(tensor_list):
+        return [torch.view_as_real(t) if torch.is_complex(t) else t for t in tensor_list]
+
+    grads = _view_complex_as_real(grads)
+    params = _view_complex_as_real(params)
+    square_avgs = _view_complex_as_real(square_avgs)
+
     torch._foreach_mul_(square_avgs, alpha)
     torch._foreach_addcmul_(square_avgs, grads, grads, value=1 - alpha)
 
     if centered:
+        grad_avgs = _view_complex_as_real(grad_avgs)
         torch._foreach_mul_(grad_avgs, alpha)
         torch._foreach_add_(grad_avgs, grads, alpha=1 - alpha)
         avg = torch._foreach_addcmul(square_avgs, grad_avgs, grad_avgs, value=-1)
@@ -281,6 +316,7 @@ def _multi_tensor_rmsprop(params: List[Tensor],
         torch._foreach_add_(avg, eps)
 
     if momentum > 0:
+        momentum_buffer_list = _view_complex_as_real(momentum_buffer_list)
         torch._foreach_mul_(momentum_buffer_list, momentum)
         torch._foreach_addcdiv_(momentum_buffer_list, grads, avg)
         torch._foreach_add_(params, momentum_buffer_list, alpha=-lr)
diff --git a/torch/optim/rprop.py b/torch/optim/rprop.py
index 1b9952b26b7d3..527cc3ab6bf3a 100644
--- a/torch/optim/rprop.py
+++ b/torch/optim/rprop.py
@@ -52,22 +52,25 @@ class Rprop(Optimizer):
             maximal allowed step sizes (default: (1e-6, 50))
         foreach (bool, optional): whether foreach implementation of optimizer
             is used (default: None)
+        maximize (bool, optional): maximize the params based on the objective, instead of
+            minimizing (default: False)
     """
 
     def __init__(self, params, lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50),
-                 foreach: Optional[bool] = None):
+                 foreach: Optional[bool] = None, maximize: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 < etas[0] < 1.0 < etas[1]:
             raise ValueError("Invalid eta values: {}, {}".format(etas[0], etas[1]))
 
-        defaults = dict(lr=lr, etas=etas, step_sizes=step_sizes, foreach=foreach)
+        defaults = dict(lr=lr, etas=etas, step_sizes=step_sizes, foreach=foreach, maximize=maximize)
         super(Rprop, self).__init__(params, defaults)
 
     def __setstate__(self, state):
         super().__setstate__(state)
         for group in self.param_groups:
             group.setdefault('foreach', None)
+            group.setdefault('maximize', False)
 
     @torch.no_grad()
     def step(self, closure=None):
@@ -90,6 +93,7 @@ def step(self, closure=None):
             etaminus, etaplus = group['etas']
             step_size_min, step_size_max = group['step_sizes']
             foreach = group['foreach']
+            maximize = group['maximize']
 
             for p in group['params']:
                 if p.grad is None:
@@ -106,7 +110,12 @@ def step(self, closure=None):
                 if len(state) == 0:
                     state['step'] = 0
                     state['prev'] = torch.zeros_like(p, memory_format=torch.preserve_format)
-                    state['step_size'] = grad.new().resize_as_(grad).fill_(group['lr'])
+                    if p.dtype.is_complex:
+                        # Complex Number should be as if they are two independent real numbers.
+                        # Hence the step_size shouldn't be zero for imaginary part.
+                        state['step_size'] = grad.new().resize_as_(grad).fill_(complex(group['lr'], group['lr']))
+                    else:
+                        state['step_size'] = grad.new().resize_as_(grad).fill_(group['lr'])
 
                 prevs.append(state['prev'])
                 step_sizes.append(state['step_size'])
@@ -121,7 +130,8 @@ def step(self, closure=None):
                   step_size_max=step_size_max,
                   etaminus=etaminus,
                   etaplus=etaplus,
-                  foreach=foreach)
+                  foreach=foreach,
+                  maximize=maximize)
 
         return loss
 
@@ -133,6 +143,7 @@ def rprop(params: List[Tensor],
           # kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
           # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
           foreach: bool = None,
+          maximize: bool = False,
           *,
           step_size_min: float,
           step_size_max: float,
@@ -162,7 +173,8 @@ def rprop(params: List[Tensor],
          step_size_min=step_size_min,
          step_size_max=step_size_max,
          etaminus=etaminus,
-         etaplus=etaplus)
+         etaplus=etaplus,
+         maximize=maximize)
 
 
 def _single_tensor_rprop(params: List[Tensor],
@@ -173,13 +185,21 @@ def _single_tensor_rprop(params: List[Tensor],
                          step_size_min: float,
                          step_size_max: float,
                          etaminus: float,
-                         etaplus: float):
+                         etaplus: float,
+                         maximize: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
+        grad = grad if not maximize else -grad
         prev = prevs[i]
         step_size = step_sizes[i]
 
+        if torch.is_complex(param):
+            grad = torch.view_as_real(grad)
+            prev = torch.view_as_real(prev)
+            param = torch.view_as_real(param)
+            step_size = torch.view_as_real(step_size)
+
         sign = grad.mul(prev).sign()
         sign[sign.gt(0)] = etaplus
         sign[sign.lt(0)] = etaminus
@@ -207,11 +227,24 @@ def _multi_tensor_rprop(params: List[Tensor],
                         step_size_min: float,
                         step_size_max: float,
                         etaminus: float,
-                        etaplus: float):
+                        etaplus: float,
+                        maximize: bool):
 
     if len(params) == 0:
         return
 
+    # Handle complex params
+    def _view_complex_as_real(tensor_list):
+        return [torch.view_as_real(t) if torch.is_complex(t) else t for t in tensor_list]
+
+    grads = _view_complex_as_real(grads)
+    prevs = _view_complex_as_real(prevs)
+    params = _view_complex_as_real(params)
+    step_sizes = _view_complex_as_real(step_sizes)
+
+    if maximize:
+        grads = torch._foreach_neg(grads)
+
     signs = torch._foreach_mul(grads, prevs)
     signs = [s.sign() for s in signs]
     for sign in signs:
@@ -226,6 +259,7 @@ def _multi_tensor_rprop(params: List[Tensor],
 
     # for dir<0, dfdx=0
     # for dir>=0 dfdx=dfdx
+    grads = list(grads)
     for i in range(len(grads)):
         grads[i] = grads[i].clone(memory_format=torch.preserve_format)
         grads[i][signs[i].eq(etaminus)] = 0
diff --git a/torch/optim/sgd.py b/torch/optim/sgd.py
index 4a941d4593900..ed48973cf7c63 100644
--- a/torch/optim/sgd.py
+++ b/torch/optim/sgd.py
@@ -55,6 +55,7 @@ class SGD(Optimizer):
             is used (default: None)
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
         >>> optimizer.zero_grad()
         >>> loss_fn(model(input), target).backward()
@@ -113,6 +114,7 @@ def __setstate__(self, state):
             group.setdefault('nesterov', False)
             group.setdefault('maximize', False)
             group.setdefault('foreach', None)
+            group.setdefault('differentiable', False)
 
     @_use_grad_for_differentiable
     def step(self, closure=None):
diff --git a/torch/optim/swa_utils.py b/torch/optim/swa_utils.py
index d44fc3fffd095..4d2743a278c2e 100644
--- a/torch/optim/swa_utils.py
+++ b/torch/optim/swa_utils.py
@@ -34,6 +34,7 @@ class AveragedModel(Module):
             both the parameters and the buffers of the model. (default: ``False``)
 
     Example:
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> loader, optimizer, model, loss_fn = ...
         >>> swa_model = torch.optim.swa_utils.AveragedModel(model)
         >>> scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
@@ -59,9 +60,10 @@ class AveragedModel(Module):
     equally-weighted average of the weights.
 
     Example:
+        >>> # xdoctest: +SKIP("undefined variables")
         >>> # Compute exponential moving averages of the weights and buffers
-        >>> ema_avg = lambda averaged_model_parameter, model_parameter, num_averaged:\
-                            0.1 * averaged_model_parameter + 0.9 * model_parameter
+        >>> ema_avg = lambda averaged_model_parameter, model_parameter, num_averaged: (
+        ...                 0.1 * averaged_model_parameter + 0.9 * model_parameter)
         >>> swa_model = torch.optim.swa_utils.AveragedModel(model, avg_fn=ema_avg, use_buffers=True)
 
     .. note::
@@ -150,6 +152,7 @@ def update_bn(loader, model, device=None):
             :attr:`device` before being passed into :attr:`model`.
 
     Example:
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> loader, model = ...
         >>> torch.optim.swa_utils.update_bn(loader, model)
 
@@ -210,6 +213,7 @@ class SWALR(_LRScheduler):
     as in the example below.
 
     Example:
+        >>> # xdoctest: +SKIP("Undefined variables")
         >>> loader, optimizer, model = ...
         >>> lr_lambda = lambda epoch: 0.9
         >>> scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer,
diff --git a/torch/overrides.py b/torch/overrides.py
index f4deae2aa257b..a440ecf3517b7 100644
--- a/torch/overrides.py
+++ b/torch/overrides.py
@@ -91,6 +91,7 @@ def get_ignored_functions() -> Set[Callable]:
         torch.import_ir_module,
         torch.import_ir_module_from_buffer,
         torch.is_anomaly_enabled,
+        torch.is_anomaly_check_nan_enabled,
         torch.is_grad_enabled,
         torch.merge_type_from_type_comment,
         torch.parse_ir,
@@ -251,6 +252,7 @@ def get_ignored_functions() -> Set[Callable]:
         Tensor.__new__,
         Tensor.__class__,
         Tensor.__subclasshook__,
+        Tensor.__hash__,
         Tensor.as_subclass,
         Tensor.reinforce,
         Tensor.new,
@@ -279,7 +281,7 @@ def get_ignored_functions() -> Set[Callable]:
         Tensor._is_zerotensor,
         Tensor._addmm_activation,
         Tensor._nested_tensor_layer_norm,
-        Tensor.to_padded_tensor
+        Tensor.to_padded_tensor,
     }
 
 
@@ -691,6 +693,8 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.miopen_batch_norm: (lambda input, weight, bias, running_mean, running_var, training,
                                   exponential_average_factor, epsilon: -1),
         torch.miopen_convolution: lambda input, weight, bias, padding, stride, dilation, groups, benchmark, deterministic: -1,
+        torch.miopen_convolution_add_relu: lambda input, weight, z, alpha, bias, stride, padding, dilation, groups: -1,
+        torch.miopen_convolution_relu: lambda input, weight, bias, stride, padding, dilation, groups: -1,
         torch.miopen_convolution_transpose: (lambda input, weight, bias, padding, output_padding, stride, dilation,
                                              groups, benchmark, deterministic: -1),
         torch.miopen_depthwise_convolution: (lambda input, weight, bias, padding, stride, dilation, groups, benchmark,
@@ -1153,7 +1157,6 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         Tensor.__deepcopy__: lambda self, memo: -1,
         Tensor.__int__: lambda self: -1,
         Tensor.__long__: lambda self: -1,
-        Tensor.__hash__: lambda self: -1,
         Tensor.__index__: lambda self: -1,
         Tensor.__len__: lambda self: -1,
         Tensor.__format__: lambda self, format_spec: -1,
diff --git a/torch/profiler/__init__.py b/torch/profiler/__init__.py
index 44635884554f6..a0185a9799b37 100644
--- a/torch/profiler/__init__.py
+++ b/torch/profiler/__init__.py
@@ -1,4 +1,4 @@
-r'''
+r"""
 PyTorch Profiler is a tool that allows the collection of performance metrics during training and inference.
 Profiler's context manager API can be used to better understand what model operators are the most expensive,
 examine their input shapes and stack traces, study device kernel activity and visualize the execution trace.
@@ -6,15 +6,32 @@
 .. note::
     An earlier version of the API in :mod:`torch.autograd` module is considered legacy and will be deprecated.
 
-'''
-from .profiler import profile, _KinetoProfile, \
-    schedule, supported_activities, tensorboard_trace_handler, ProfilerAction, \
-    _ExperimentalConfig, ExecutionGraphObserver
-from torch._C._autograd import ProfilerActivity, kineto_available, _supported_activities, DeviceType
+"""
+from torch._C._autograd import _supported_activities, DeviceType, kineto_available
+from torch._C._profiler import _ExperimentalConfig, ProfilerActivity, RecordScope
 from torch.autograd.profiler import record_function
 
-__all__ = ['profile', 'schedule', 'supported_activities',
-           'tensorboard_trace_handler', 'ProfilerAction', 'ProfilerActivity',
-           'kineto_available', 'DeviceType', 'record_function', 'ExecutionGraphObserver']
+from .profiler import (
+    _KinetoProfile,
+    ExecutionGraphObserver,
+    profile,
+    ProfilerAction,
+    schedule,
+    supported_activities,
+    tensorboard_trace_handler,
+)
+
+__all__ = [
+    "profile",
+    "schedule",
+    "supported_activities",
+    "tensorboard_trace_handler",
+    "ProfilerAction",
+    "ProfilerActivity",
+    "kineto_available",
+    "DeviceType",
+    "record_function",
+    "ExecutionGraphObserver",
+]
 
 from . import itt
diff --git a/torch/profiler/_pattern_matcher.py b/torch/profiler/_pattern_matcher.py
index 42d1ebf7c974a..057eef8aee63d 100644
--- a/torch/profiler/_pattern_matcher.py
+++ b/torch/profiler/_pattern_matcher.py
@@ -1,4 +1,6 @@
 from collections import deque
+import json
+import math
 import os
 import re
 from typing import Dict, List, Set
@@ -7,7 +9,7 @@
 from torch.profiler import profile
 import torch.utils.benchmark as benchmark
 from torch.profiler._utils import index_of_first_match
-from torch._C._autograd import (_ProfilerEvent, _ExtraFields_TorchOp,
+from torch._C._profiler import (_ProfilerEvent, _ExtraFields_TorchOp,
                                 _ExtraFields_PyCCall, _ExtraFields_PyCall,
                                 _EventType)
 
@@ -25,6 +27,7 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
         self.should_benchmark = should_benchmark
         self.name = "Please specify a name for pattern"
         self.description = "Please specify a description for pattern"
+        self.url = ""
         assert prof.profiler is not None and prof.profiler.kineto_results is not None
         self.event_tree = prof.profiler.kineto_results.experimental_event_tree(
         )
@@ -50,13 +53,32 @@ def eventTreeTraversal(self):
     def summary(self, events: List[_ProfilerEvent]):
         default_summary = f"{self.name}: {len(events)} events matched."
         if self.should_benchmark:
-            summary = self.benchmark_summary(events)
             # If benchmark summary is not empty, use it.
-            return summary if summary else default_summary
+            return self.benchmark_summary(
+                events) if hasattr(  # type: ignore[attr-defined]
+                    self, 'benchmark') else default_summary
         return default_summary
 
     def benchmark_summary(self, events: List[_ProfilerEvent]):
-        return ""
+
+        def format_time(time_ns: int):
+            unit_lst = ["ns", "us", "ms"]
+            for unit in unit_lst:
+                if time_ns < 1000:
+                    return f"{time_ns:.2f} {unit}"
+                time_ns //= 1000
+            return f"{time_ns:.2f} s"
+
+        assert hasattr(self, 'benchmark'), 'Please implement benchmark()'
+        shapes_factor_map = self.benchmark(  # type: ignore[attr-defined]
+            events)
+        original_time = sum(event.duration_time_ns for event in events)
+        new_time = sum(shapes_factor_map[input_shapes(event)] *
+                       event.duration_time_ns for event in events)
+        return (
+            f"{self.name}: {len(events)} events matched. "
+            f"Total Estimated Speedup: {format_time(original_time - new_time)} ({round(original_time/new_time, 2)}X)"
+        )
 
     def match(self, event: _ProfilerEvent):
         '''
@@ -108,7 +130,10 @@ def go_up_until(self, event: _ProfilerEvent, predicate):
 
 class NamePattern(Pattern):
 
-    def __init__(self, prof: profile, name: str, should_benchmark: bool = False):
+    def __init__(self,
+                 prof: profile,
+                 name: str,
+                 should_benchmark: bool = False):
         super().__init__(prof, should_benchmark)
         self.description = f"Matched Name Event: {name}"
         self.name = name
@@ -138,6 +163,7 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
         super().__init__(prof, should_benchmark)
         self.name = "Extra CUDA Copy Pattern"
         self.description = "Filled a CPU tensor and immediately moved it to GPU. Please initalize it on GPU."
+        self.url = "https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#create-tensors-directly-on-the-target-device"
         self.init_ops = {
             "aten::fill_", "aten::zero_", "aten::normal_", "aten::uniform_"
         }
@@ -150,6 +176,24 @@ def match(self, event):
         # TODO: We should also check tensor identities
         if event.name() != "aten::to":
             return False
+        to_event = event
+        if not event.children:
+            return False
+        event = event.children[-1]
+        if event.name() != "aten::_to_copy":
+            return False
+        if not event.children:
+            return False
+        event = event.children[-1]
+        if event.name() != "aten::copy_":
+            return False
+        # aten::copy_ should have the first 2 args dtype the same
+        dtypes = input_dtypes(event)
+        if len(dtypes) < 2:
+            return False
+        if dtypes[0] != dtypes[1]:
+            return False
+        event = to_event
         # Up one level
         event = event.parent
         if event is None:
@@ -167,28 +211,18 @@ def match(self, event):
         # TODO: Check if tensor is reused
 
     def benchmark(self, events: List[_ProfilerEvent]):
-        shapes_factor_map = {input_shapes(event)[0]: 0.0 for event in events}
+        shapes_factor_map = {input_shapes(event): 0.0 for event in events}
         for shape in shapes_factor_map:
-            to_timer = benchmark.Timer(stmt='torch.ones(shape).to("cuda")',
-                                       globals={'shape': shape})
-            de_timer = benchmark.Timer(stmt='torch.ones(shape, device="cuda")',
-                                       globals={'shape': shape})
+            size = shape[0]
+            to_timer = benchmark.Timer(stmt='torch.ones(size).to("cuda")',
+                                       globals={'size': size})
+            de_timer = benchmark.Timer(stmt='torch.ones(size, device="cuda")',
+                                       globals={'size': size})
             to_time = to_timer.timeit(10).mean
             de_time = de_timer.timeit(10).mean
             shapes_factor_map[shape] = de_time / to_time
         return shapes_factor_map
 
-    def benchmark_summary(self, events: List[_ProfilerEvent]):
-        shapes_factor_map = self.benchmark(events)
-        original_time = sum(event.duration_time_ns for event in events) / 1e3
-        new_time = sum(
-            shapes_factor_map[input_shapes(event)[0]] * event.duration_time_ns
-            for event in events) / 1e3
-        return (
-            f"{self.name}: {len(events)} events matched. "
-            f"Total Estimated Speedup: {original_time - new_time}us ({original_time/new_time}X)"
-        )
-
 
 class ForLoopIndexingPattern(Pattern):
     '''
@@ -263,6 +297,7 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
             "You are currently using GPU that supports TF32. "
             "Please enable TF32 by setting 'torch.backends.cuda.matmul.allow_tf32 = True'"
         )
+        self.url = "https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices"
 
     @property
     def skip(self):
@@ -307,17 +342,6 @@ def benchmark(self, events: List[_ProfilerEvent]):
             shapes_factor_map[shape] = tf32_time / fp32_time
         return shapes_factor_map
 
-    def benchmark_summary(self, events: List[_ProfilerEvent]):
-        shapes_factor_map = self.benchmark(events)
-        original_time = sum(event.duration_time_ns for event in events) / 1e3
-        new_time = sum(
-            shapes_factor_map[input_shapes(event)] * event.duration_time_ns
-            for event in events) / 1e3
-        return (
-            f"{self.name}: {len(events)} events matched. "
-            f"Total Estimated Speedup: {original_time - new_time}us ({original_time/new_time}X)"
-        )
-
 
 class OptimizerSingleTensorPattern(Pattern):
     '''
@@ -342,6 +366,7 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
             "Deteced optimizer running with single tensor implementation. "
             "Please enable multi tensor implementation by passing 'foreach=True' into optimizer."
         )
+        self.url = ""
 
     def match(self, event: _ProfilerEvent):
         for optimizer in self.optimizers_with_foreach:
@@ -375,10 +400,17 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
             "Detected DataLoader running with synchronized implementation. "
             "Please enable asynchronous dataloading by setting num_workers > 0 when initializing DataLoader."
         )
+        self.url = (
+            "https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html"
+            "#enable-async-data-loading-and-augmentation")
 
     def match(self, event: _ProfilerEvent):
+
         def is_dataloader_function(name: str, function_name: str):
-            return name.startswith(os.path.join("torch", "utils", "data", "dataloader.py")) and name.endswith(function_name)
+            return name.startswith(
+                os.path.join("torch", "utils", "data",
+                             "dataloader.py")) and name.endswith(function_name)
+
         if not is_dataloader_function(event.name(), "__iter__"):
             return False
         if not event.children:
@@ -389,7 +421,8 @@ def is_dataloader_function(name: str, function_name: str):
         if not event.children:
             return False
         event = event.children[0]
-        return not is_dataloader_function(event.name(), "check_worker_number_rationality")
+        return not is_dataloader_function(event.name(),
+                                          "check_worker_number_rationality")
         # TODO: We should also check if the loader is bottleneck.
 
 
@@ -418,6 +451,9 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
         self.description = (
             "Detected gradient set to zero instead of None. "
             "Please add 'set_to_none=True' when calling zero_grad().")
+        self.url = (
+            "https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html"
+            "#disable-gradient-calculation-for-validation-or-inference")
 
     def match(self, event: _ProfilerEvent):
         if not event.name().endswith(": zero_grad"):
@@ -450,6 +486,9 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
         super().__init__(prof, should_benchmark)
         self.name = "Enabling Bias in Conv2d Followed By BatchNorm Pattern"
         self.description = "Detected bias enabled in Conv2d that is followed by BatchNorm2d. Please set 'bias=False' in Conv2d."
+        self.url = (
+            "https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html"
+            "#disable-bias-for-convolutions-directly-followed-by-a-batch-norm")
 
     @property
     def skip(self):
@@ -471,13 +510,77 @@ def match(self, event: _ProfilerEvent):
         return event.name().startswith("nn.Module: BatchNorm2d")
 
 
+class MatMulDimInFP16Pattern(Pattern):
+
+    def __init__(self, prof: profile, should_benchmark: bool = False):
+        super().__init__(prof, should_benchmark)
+        self.name = "Matrix Multiplication Dimension Not Aligned Pattern"
+        self.description = "Detected matmul with dimension not aligned. Please use matmul with aligned dimension."
+        self.url = "https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#use-mixed-precision-and-amp"
+
+    @property
+    def skip(self):
+        return not self.prof.with_stack or not self.prof.record_shapes
+
+    def match(self, event: _ProfilerEvent):
+
+        def mutiple_of(shapes, multiple):
+            return all(dim % multiple == 0 for shape in shapes
+                       for dim in shape[-2:])
+
+        if event.name() not in ("aten::mm", "aten::bmm", "aten::addmm"):
+            return False
+        if not input_dtypes(event):
+            return False
+        arg_dtype = input_dtypes(event)[0]
+        # TODO: Have a better way to check dtype_size
+        if (arg_dtype.endswith("c10::BFloat16")
+                or arg_dtype.endswith("c10::Half")) and not mutiple_of(
+                    input_shapes(event), 8):
+            return True
+        return False
+
+    def benchmark(self, events: List[_ProfilerEvent]):
+
+        def closest_multiple(shapes, multiple):
+            return [multiple * math.ceil(shape / multiple) for shape in shapes]
+
+        shapes_factor_map = {input_shapes(event): 0.0 for event in events}
+        for shape in shapes_factor_map:
+            matrixA = torch.randn(shape[0], device="cuda", dtype=torch.float16)
+            matrixB = torch.randn(shape[1], device="cuda", dtype=torch.float16)
+            not_aligned_dim_timer = benchmark.Timer(
+                stmt='torch.mm(matrixA, matrixB)',
+                globals={
+                    "matrixA": matrixA,
+                    "matrixB": matrixB
+                })
+            matrixA = torch.randn(closest_multiple(shape[0], 8),
+                                  device="cuda",
+                                  dtype=torch.float16)
+            matrixB = torch.randn(closest_multiple(shape[1], 8),
+                                  device="cuda",
+                                  dtype=torch.float16)
+            aligned_dim_timer = benchmark.Timer(
+                stmt='torch.mm(matrixA, matrixB)',
+                globals={
+                    "matrixA": matrixA,
+                    "matrixB": matrixB
+                })
+            not_aligned_dim_time = not_aligned_dim_timer.timeit(10).mean
+            aligned_dim_time = aligned_dim_timer.timeit(10).mean
+            shapes_factor_map[shape] = aligned_dim_time / not_aligned_dim_time
+        return shapes_factor_map
+
+
 def source_code_location(event: _ProfilerEvent):
     while event:
         if event.tag == _EventType.PyCall or event.tag == _EventType.PyCCall:
             assert isinstance(event.extra_fields,
                               _ExtraFields_PyCall) or isinstance(
                                   event.extra_fields, _ExtraFields_PyCCall)
-            if not event.extra_fields.caller.file_name.startswith("torch" + os.sep):
+            if not event.extra_fields.caller.file_name.startswith("torch" +
+                                                                  os.sep):
                 return f"{event.extra_fields.caller.file_name}:{event.extra_fields.caller.line_number}"
         event = event.parent
     return "No source code location found"
@@ -517,20 +620,26 @@ def eventTreeBFS(event_tree: List[_ProfilerEvent]):
             stack.append(child_event)
 
 
-def report_all_anti_patterns(prof, should_benchmark: bool = False):
+def report_all_anti_patterns(prof,
+                             should_benchmark: bool = False,
+                             print_enable: bool = True,
+                             json_report_dir: str = None):
+    report_dict: Dict = {}
     anti_patterns = [
         ExtraCUDACopyPattern(prof, should_benchmark),
-        ForLoopIndexingPattern(prof, should_benchmark),
+        # ForLoopIndexingPattern(prof, should_benchmark),
         FP32MatMulPattern(prof, should_benchmark),
         OptimizerSingleTensorPattern(prof, should_benchmark),
         SynchronizedDataLoaderPattern(prof, should_benchmark),
         GradNotSetToNonePattern(prof, should_benchmark),
-        Conv2dBiasFollowedByBatchNorm2dPattern(prof, should_benchmark)
+        Conv2dBiasFollowedByBatchNorm2dPattern(prof, should_benchmark),
+        MatMulDimInFP16Pattern(prof, should_benchmark)
     ]
     reported = set()
     summaries = []
     message_list = [f"{'-'*40}TorchTidy Report{'-'*40}"]
     message_list.append("Matched Events:")
+
     for anti_pattern in anti_patterns:
         matched_events = anti_pattern.matched_events()
         if not matched_events:
@@ -541,7 +650,27 @@ def report_all_anti_patterns(prof, should_benchmark: bool = False):
             if report_msg not in reported:
                 message_list.append(report_msg)
                 reported.add(report_msg)
+                src_location, line_no = source_code_location(event).split(":")
+                report_dict.setdefault(src_location, []).append({
+                    "line_number": int(line_no),
+                    "name": anti_pattern.name,
+                    "url": anti_pattern.url,
+                    "message": anti_pattern.description,
+                })
+
+    if json_report_dir is not None:
+        json_report_path = os.path.join(json_report_dir,
+                                        "torchtidy_report.json")
+        if os.path.exists(json_report_path):
+            with open(json_report_path, "r") as f:
+                exisiting_report = json.load(f)
+                exisiting_report.update(report_dict)
+                report_dict = exisiting_report
+        with open(json_report_path, "w") as f:
+            json.dump(report_dict, f, indent=4)
+
     message_list.append("Summary:")
     message_list += summaries
     message_list.append(f"{'-'*40}TorchTidy Report{'-'*40}")
-    print("\n".join(message_list))
+    if print_enable:
+        print("\n".join(message_list))
diff --git a/torch/profiler/profiler.py b/torch/profiler/profiler.py
index de1eb5b216ec7..78ff771a262c9 100644
--- a/torch/profiler/profiler.py
+++ b/torch/profiler/profiler.py
@@ -10,12 +10,12 @@
 import torch
 import torch.autograd.profiler as prof
 from torch._C._autograd import (
-    _ExperimentalConfig,
     _add_execution_graph_observer,
     _remove_execution_graph_observer,
     _enable_execution_graph_observer,
     _disable_execution_graph_observer,
 )
+from torch._C._profiler import _ExperimentalConfig
 from torch.autograd import ProfilerActivity, kineto_available
 
 __all__ = ['supported_activities', 'ProfilerAction', 'schedule', 'tensorboard_trace_handler', 'profile',
diff --git a/torch/quantization/fx/_equalize.py b/torch/quantization/fx/_equalize.py
index 80f685515c23e..4cdd9a9adf1b5 100644
--- a/torch/quantization/fx/_equalize.py
+++ b/torch/quantization/fx/_equalize.py
@@ -17,6 +17,7 @@
     default_equalization_qconfig,
     fused_module_supports_equalization,
     nn_module_supports_equalization,
+    custom_module_supports_equalization,
     node_supports_equalization,
     is_equalization_observer,
     get_op_node_and_weight_eq_obs,
@@ -32,5 +33,6 @@
     convert_eq_obs,
     _convert_equalization_ref,
     get_layer_sqnr_dict,
-    get_equalization_qconfig_dict
+    get_equalization_qconfig_dict,
+    CUSTOM_MODULE_SUPP_LIST,
 )
diff --git a/torch/quasirandom.py b/torch/quasirandom.py
index fe239c27e6cc9..b85a9bd2842dd 100644
--- a/torch/quasirandom.py
+++ b/torch/quasirandom.py
@@ -35,6 +35,7 @@ class SobolEngine(object):
 
     Examples::
 
+        >>> # xdoctest: +SKIP("unseeded random state")
         >>> soboleng = torch.quasirandom.SobolEngine(dimension=5)
         >>> soboleng.draw(3)
         tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
diff --git a/torch/serialization.py b/torch/serialization.py
index 6e9a8a4c84661..04573a059f029 100644
--- a/torch/serialization.py
+++ b/torch/serialization.py
@@ -14,7 +14,8 @@
 from torch._sources import get_source_lines_and_file
 from torch.types import Storage
 from torch.storage import _get_dtype_from_pickle_storage_type
-from typing import Any, BinaryIO, cast, Dict, Optional, Type, Tuple, Union, IO
+from typing import Any, BinaryIO, Callable, cast, Dict, Optional, Type, Tuple, Union, IO
+from typing_extensions import TypeAlias
 import copyreg
 import pickle
 import pathlib
@@ -29,6 +30,24 @@
 PROTOCOL_VERSION = 1001
 STORAGE_KEY_SEPARATOR = ','
 
+FILE_LIKE: TypeAlias = Union[str, os.PathLike, BinaryIO, IO[bytes]]
+MAP_LOCATION: TypeAlias = Optional[Union[Callable[[torch.Tensor, str], torch.Tensor], torch.device, str, Dict[str, str]]]
+
+__all__ = [
+    'SourceChangeWarning',
+    'mkdtemp',
+    'register_package',
+    'check_module_version_greater_or_equal',
+    'validate_cuda_device',
+    'location_tag',
+    'default_restore_location',
+    'normalize_storage_type',
+    'storage_to_tensor_type',
+    'save',
+    'load',
+    'StorageType',
+]
+
 class SourceChangeWarning(Warning):
     pass
 
@@ -129,6 +148,11 @@ def _mps_tag(obj):
         return 'mps'
 
 
+def _meta_tag(obj):
+    if obj.device.type == 'meta':
+        return 'meta'
+
+
 def _cpu_deserialize(obj, location):
     if location == 'cpu':
         return obj
@@ -165,10 +189,15 @@ def _mps_deserialize(obj, location):
     if location == 'mps':
         return obj.mps()
 
+def _meta_deserialize(obj, location):
+    if location == 'meta':
+        return torch.UntypedStorage(obj.nbytes(), device='meta')
+
 
 register_package(10, _cpu_tag, _cpu_deserialize)
 register_package(20, _cuda_tag, _cuda_deserialize)
 register_package(21, _mps_tag, _mps_deserialize)
+register_package(22, _meta_tag, _meta_deserialize)
 
 
 def location_tag(storage: Union[Storage, torch.storage.TypedStorage, torch.UntypedStorage]):
@@ -340,8 +369,13 @@ def _check_dill_version(pickle_module) -> None:
                 pickle_module.__version__
             ))
 
-def save(obj, f: Union[str, os.PathLike, BinaryIO, IO[bytes]],
-         pickle_module=pickle, pickle_protocol=DEFAULT_PROTOCOL, _use_new_zipfile_serialization=True) -> None:
+def save(
+    obj: object,
+    f: FILE_LIKE,
+    pickle_module: Any = pickle,
+    pickle_protocol: int = DEFAULT_PROTOCOL,
+    _use_new_zipfile_serialization: bool = True
+) -> None:
     # Reference: https://github.com/pytorch/pytorch/issues/54354
     # The first line of this docstring overrides the one Sphinx generates for the
     # documentation. We need it so that Sphinx doesn't leak `pickle`s path from
@@ -615,7 +649,12 @@ def persistent_id(obj):
         zip_file.write_record(name, storage.data_ptr(), num_bytes)
 
 
-def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
+def load(
+    f: FILE_LIKE,
+    map_location: MAP_LOCATION = None,
+    pickle_module: Any = pickle,
+    **pickle_load_args: Any
+) -> Any:
     # Reference: https://github.com/pytorch/pytorch/issues/54354
     # The first line of this docstring overrides the one Sphinx generates for the
     # documentation. We need it so that Sphinx doesn't leak `pickle`s path from
@@ -686,6 +725,7 @@ def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
         as byte arrays which can be decoded later with ``byte_array.decode(...)``.
 
     Example:
+        >>> # xdoctest: +SKIP("undefined filepaths")
         >>> torch.load('tensors.pt')
         # Load all tensors onto the CPU
         >>> torch.load('tensors.pt', map_location=torch.device('cpu'))
diff --git a/torch/sparse/__init__.py b/torch/sparse/__init__.py
index 64b396e2b452f..a130ef4784c9e 100644
--- a/torch/sparse/__init__.py
+++ b/torch/sparse/__init__.py
@@ -179,6 +179,7 @@ def sum(input: Tensor, dim: DimOrDims = None,
                            torch.randint(0, dims[1], size=(nnz,))], 0).reshape(2, nnz)
         >>> V = torch.randn(nnz, dims[2], dims[3])
         >>> size = torch.Size(dims)
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> S = torch.sparse_coo_tensor(I, V, size)
         >>> S
         tensor(indices=tensor([[2, 0, 3],
diff --git a/torch/testing/_creation.py b/torch/testing/_creation.py
index 36c86f5d00e8d..bb9730c5c0b6e 100644
--- a/torch/testing/_creation.py
+++ b/torch/testing/_creation.py
@@ -75,6 +75,7 @@ def make_tensor(
         >>> from torch.testing import make_tensor
         >>> # Creates a float tensor with values in [-1, 1)
         >>> make_tensor((3,), device='cpu', dtype=torch.float32, low=-1, high=1)
+        >>> # xdoctest: +SKIP
         tensor([ 0.1205, 0.2282, -0.6380])
         >>> # Creates a bool tensor on CUDA
         >>> make_tensor((2, 2), device='cuda', dtype=torch.bool)
diff --git a/torch/testing/_internal/autocast_test_lists.py b/torch/testing/_internal/autocast_test_lists.py
index 00c588977bcb7..9a88ab054340c 100644
--- a/torch/testing/_internal/autocast_test_lists.py
+++ b/torch/testing/_internal/autocast_test_lists.py
@@ -319,6 +319,14 @@ def __init__(self, dev):
             ("conv_transpose1d", conv_args_bf16[0]),
             ("conv_transpose2d", conv_args_bf16[1]),
             ("conv_transpose3d", conv_args_bf16[2]),
+            ("poisson_nll_loss", mat0_bf16 + mat1_bf16 + (True, False, 1.e-8, torch.nn._reduction.get_enum('mean'))),
+            ("cosine_embedding_loss", (torch.tensor([[1, 2, 3]], device=dev, dtype=torch.bfloat16),
+                                       torch.tensor([[1, 3, 4]], device=dev, dtype=torch.bfloat16),
+                                       torch.tensor([1], device=dev, dtype=torch.int))),
+            ("hinge_embedding_loss", mat0_bf16 + (torch.ones(n, device=dev, dtype=torch.int),)),
+            ("margin_ranking_loss", mat0_bf16 + mat1_bf16 + (torch.ones((n,), device=dev, dtype=torch.bfloat16),)),
+            ("triplet_margin_loss", mat0_bf16 + mat1_bf16 + mat2_bf16),
+            ("binary_cross_entropy_with_logits", mat0_bf16 + (torch.rand((n, n), device=dev, dtype=torch.bfloat16),)),
         ]
         self.nn_bf16 = [
             ("linear", mat0_fp32 + mat1_fp32, {}),
@@ -328,6 +336,17 @@ def __init__(self, dev):
             ("binary_cross_entropy", (torch.rand((n, n), device=dev, dtype=torch.bfloat16),) +
                                      (torch.rand((n, n), device=dev, dtype=torch.bfloat16),)),
             ("reflection_pad1d", dummy_bf16[2], {"padding": (3, 3)}),
+            ("nll_loss", (torch.rand((n, n), device=dev, dtype=torch.bfloat16),
+                          torch.zeros((n,), device=dev, dtype=torch.long))),
+            ("nll_loss2d", (torch.rand((n, n, n, n), device=dev, dtype=torch.bfloat16),
+                            torch.zeros((n, n, n), device=dev, dtype=torch.long))),
+            ("l1_loss", mat0_bf16 + mat1_bf16),
+            ("smooth_l1_loss", mat0_bf16 + mat1_bf16),
+            ("mse_loss", mat0_bf16 + mat1_bf16),
+            ("multilabel_margin_loss", mat0_bf16 + (torch.ones((n, n), device=dev, dtype=torch.long),)),
+            ("soft_margin_loss", mat0_bf16 + (torch.ones((n, n), device=dev, dtype=torch.long),)),
+            ("multi_margin_loss", mat0_bf16 + (torch.ones((n,), device=dev, dtype=torch.long),)),
+            ("huber_loss", mat0_bf16 + mat1_bf16),
         ]
         self.torch_need_autocast_promote = [
             ("cat", (pointwise0_bf16 + pointwise1_fp32,)),
diff --git a/torch/testing/_internal/check_kernel_launches.py b/torch/testing/_internal/check_kernel_launches.py
index 517382fa07476..44924eb6ada91 100644
--- a/torch/testing/_internal/check_kernel_launches.py
+++ b/torch/testing/_internal/check_kernel_launches.py
@@ -111,10 +111,9 @@ def check_file(filename):
         return 0
     if should_exclude_file(filename):
         return 0
-    fo = open(filename, "r")
-    contents = fo.read()
-    unsafeCount = check_code_for_cuda_kernel_launches(contents, filename)
-    fo.close()
+    with open(filename, "r") as fo:
+        contents = fo.read()
+        unsafeCount = check_code_for_cuda_kernel_launches(contents, filename)
     return unsafeCount
 
 
diff --git a/torch/testing/_internal/common_cuda.py b/torch/testing/_internal/common_cuda.py
index cccd8c02ec148..1ee8e40ebd062 100644
--- a/torch/testing/_internal/common_cuda.py
+++ b/torch/testing/_internal/common_cuda.py
@@ -3,7 +3,7 @@
 import functools
 import torch
 import torch.cuda
-from torch.testing._internal.common_utils import TEST_NUMBA, IS_WINDOWS
+from torch.testing._internal.common_utils import TEST_NUMBA, IS_WINDOWS, TEST_WITH_ROCM
 import inspect
 import contextlib
 from distutils.version import LooseVersion
@@ -180,4 +180,15 @@ def _check_cusparse_generic_available():
         min_supported_version = (11, 0)
     return version >= min_supported_version
 
+def _check_hipsparse_generic_available():
+    if not TEST_WITH_ROCM:
+        return False
+
+    rocm_version = str(torch.version.hip)
+    rocm_version = rocm_version.split("-")[0]    # ignore git sha
+    rocm_version_tuple = tuple(int(x) for x in rocm_version.split("."))
+    return not (rocm_version_tuple is None or rocm_version_tuple < (5, 1))
+
+
 TEST_CUSPARSE_GENERIC = _check_cusparse_generic_available()
+TEST_HIPSPARSE_GENERIC = _check_hipsparse_generic_available()
diff --git a/torch/testing/_internal/common_device_type.py b/torch/testing/_internal/common_device_type.py
index 8a8e538e4ea74..952747357172b 100644
--- a/torch/testing/_internal/common_device_type.py
+++ b/torch/testing/_internal/common_device_type.py
@@ -15,7 +15,8 @@
     IS_SANDCASTLE, IS_FBCODE, IS_REMOTE_GPU, IS_WINDOWS, DeterministicGuard, \
     _TestParametrizer, compose_parametrize_fns, dtype_name, \
     TEST_WITH_MIOPEN_SUGGEST_NHWC, NATIVE_DEVICES
-from torch.testing._internal.common_cuda import _get_torch_cuda_version, TEST_CUSPARSE_GENERIC
+from torch.testing._internal.common_cuda import _get_torch_cuda_version, \
+    TEST_CUSPARSE_GENERIC, TEST_HIPSPARSE_GENERIC
 from torch.testing._internal.common_dtype import get_all_dtypes
 
 # The implementation should be moved here as soon as the deprecation period is over.
@@ -1343,6 +1344,12 @@ def wrap_fn(self, *args, **kwargs):
 def skipCUDAIfNoCusparseGeneric(fn):
     return skipCUDAIf(not TEST_CUSPARSE_GENERIC, "cuSparse Generic API not available")(fn)
 
+def skipCUDAIfNoHipsparseGeneric(fn):
+    return skipCUDAIf(not TEST_HIPSPARSE_GENERIC, "hipSparse Generic API not available")(fn)
+
+def skipCUDAIfNoSparseGeneric(fn):
+    return skipCUDAIf(not (TEST_CUSPARSE_GENERIC or TEST_HIPSPARSE_GENERIC), "Sparse Generic API not available")(fn)
+
 def skipCUDAIfNoCudnn(fn):
     return skipCUDAIfCudnnVersionLessThan(0)(fn)
 
diff --git a/torch/testing/_internal/common_methods_invocations.py b/torch/testing/_internal/common_methods_invocations.py
index 420cf8738d475..67969cc5a6053 100644
--- a/torch/testing/_internal/common_methods_invocations.py
+++ b/torch/testing/_internal/common_methods_invocations.py
@@ -1,9 +1,8 @@
 from functools import wraps, partial
 from itertools import product, chain, islice
 import itertools
-import collections
+import functools
 import copy
-from enum import Enum
 import operator
 import random
 import unittest
@@ -12,34 +11,28 @@
 import torch
 import numpy as np
 from torch._six import inf
-import collections.abc
-
-from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Union, Iterable
-from dataclasses import dataclass, asdict
-from torchgen.utils import dataclass_repr
 
+from typing import Any, Dict, List, Tuple, Union
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import (
     _dispatch_dtypes, floating_types, floating_types_and, complex_types, floating_and_complex_types,
     floating_and_complex_types_and, all_types_and_complex_and, all_types_and, all_types_and_complex, integral_types_and,
-    all_types, double_types, empty_types, complex_types_and, integral_types
+    all_types, empty_types, complex_types_and, integral_types
 )
 from torch.testing._internal.common_device_type import \
     (onlyCUDA, onlyNativeDeviceTypes, disablecuDNN, skipCUDAIfNoMagma, skipCUDAIfNoMagmaAndNoCusolver,
      skipCUDAIfNoCusolver, skipCPUIfNoLapack, skipCPUIfNoFFT, skipCUDAIfRocm, skipCUDAIf, precisionOverride,
      skipCPUIfNoMklSparse,
-     toleranceOverride, tol, has_cusolver)
+     toleranceOverride, tol)
 from torch.testing._internal.common_cuda import (
     CUDA11OrLater, SM53OrLater, SM60OrLater, with_tf32_off, TEST_CUDNN,
-    _get_torch_cuda_version, _get_magma_version)
-from torch.testing._internal.common_utils import \
-    (is_iterable_of_tensors,
-     make_fullrank_matrices_with_distinct_singular_values,
-     TEST_WITH_ROCM, IS_WINDOWS, IS_MACOS, TEST_SCIPY,
-     torch_to_numpy_dtype_dict, TEST_WITH_ASAN,
-     GRADCHECK_NONDET_TOL, slowTest, noncontiguous_like,
-     freeze_rng_state)
-import torch.testing._internal.opinfo_helper as opinfo_helper
+    _get_torch_cuda_version)
+from torch.testing._internal.common_utils import (
+    make_fullrank_matrices_with_distinct_singular_values,
+    TEST_WITH_ROCM, IS_WINDOWS, IS_MACOS, TEST_SCIPY,
+    torch_to_numpy_dtype_dict, TEST_WITH_ASAN,
+    GRADCHECK_NONDET_TOL, freeze_rng_state,
+)
 
 import torch._refs as refs  # noqa: F401
 import torch._refs.nn.functional
@@ -52,231 +45,94 @@
 
 from distutils.version import LooseVersion
 
-has_scipy_fft = False
+from torch.testing._internal.opinfo.core import (  # noqa: F401
+    L,
+    M,
+    S,
+    XS,
+    _NOTHING,
+    _getattr_qual,
+    DecorateInfo,
+    SampleInput,
+    ErrorInput,
+    AliasInfo,
+    NumericsFilter,
+    OpInfo,
+    _generate_reduction_inputs,
+    _generate_reduction_kwargs,
+    sample_inputs_reduction,
+    ReductionOpInfo,
+    reference_inputs_elementwise_binary,
+    make_error_inputs_elementwise_binary,
+    generate_elementwise_binary_tensors,
+    generate_elementwise_binary_arbitrarily_strided_tensors,
+    generate_elementwise_binary_small_value_tensors,
+    generate_elementwise_binary_large_value_tensors,
+    generate_elementwise_binary_extremal_value_tensors,
+    generate_elementwise_binary_broadcasting_tensors,
+    generate_elementwise_binary_with_scalar_samples,
+    generate_elementwise_binary_with_scalar_and_type_promotion_samples,
+    generate_elementwise_binary_noncontiguous_tensors,
+    sample_inputs_elementwise_binary,
+    BinaryUfuncInfo,
+    sample_inputs_elementwise_unary,
+    generate_elementwise_unary_tensors,
+    generate_elementwise_unary_small_value_tensors,
+    generate_elementwise_unary_large_value_tensors,
+    generate_elementwise_unary_extremal_value_tensors,
+    reference_inputs_elementwise_unary,
+    UnaryUfuncInfo,
+    sample_inputs_spectral_ops,
+    SpectralFuncType,
+    SpectralFuncInfo,
+    ShapeFuncInfo,
+    sample_inputs_foreach,
+    ForeachFuncInfo,
+    gradcheck_wrapper_hermitian_input,
+    gradcheck_wrapper_triangular_input,
+    gradcheck_wrapper_triangular_input_real_positive_diagonal,
+    gradcheck_wrapper_masked_operation,
+    gradcheck_wrapper_masked_pointwise_operation,
+    clone_sample,
+)
+from torch.testing._internal.opinfo.refs import (  # NOQA: F401
+    _find_referenced_opinfo,
+    _inherit_constructor_args,
+    PythonRefInfo,
+    ReductionPythonRefInfo,
+    ElementwiseUnaryPythonRefInfo,
+    ElementwiseBinaryPythonRefInfo,
+)
+from torch.testing._internal.opinfo.utils import (
+    np_unary_ufunc_integer_promotion_wrapper,
+    reference_reduction_numpy,
+)
+from torch.testing._internal import opinfo
+from torch.testing._internal.opinfo.definitions.linalg import (
+    sample_inputs_linalg_cholesky,
+    sample_inputs_linalg_cholesky_inverse,
+    sample_inputs_cross,
+    sample_inputs_linalg_qr_geqrf,
+    sample_inputs_linalg_invertible,
+    sample_inputs_lu_solve,
+    sample_inputs_legacy_solve,
+    sample_inputs_svd,
+    sample_inputs_linalg_det_logdet_slogdet,
+    sample_inputs_linalg_lu,
+)
+from torch.testing._internal.opinfo.definitions.special import (
+    sample_inputs_i0_i1,
+    sample_inputs_polygamma,
+    reference_polygamma,
+)
+from torch.testing._internal.opinfo.definitions._masked import (
+    sample_inputs_softmax_variant,
+)
+
 if TEST_SCIPY:
     from scipy import stats
     import scipy.spatial
     import scipy.special
-    try:
-        import scipy.fft
-        has_scipy_fft = True
-    except ModuleNotFoundError:
-        pass
-
-
-# Reasonable testing sizes for dimensions
-L = 20
-M = 10
-S = 5
-XS = 3
-
-# Unique value to distinguish default from anything else
-_NOTHING = object()
-
-
-class DecorateInfo(object):
-    """Describes which test, or type of tests, should be wrapped in the given
-       decorators when testing an operator. Any test that matches all provided
-       arguments will be decorated. The decorators will only be applied if the
-       active_if argument is True."""
-
-    __slots__ = ['decorators', 'cls_name', 'test_name', 'device_type', 'dtypes', 'active_if']
-
-    def __init__(self, decorators, cls_name=None, test_name=None, *,
-                 device_type=None, dtypes=None, active_if=True):
-        self.decorators = list(decorators) if isinstance(decorators, collections.abc.Sequence) else [decorators]
-        self.cls_name = cls_name
-        self.test_name = test_name
-        self.device_type = device_type
-        self.dtypes = dtypes
-        self.active_if = active_if
-
-        # Validate dtypes
-        if self.dtypes is not None:
-            for dtype in self.dtypes:
-                assert isinstance(dtype, torch.dtype)
-
-    def is_active(self, cls_name, test_name, device_type, dtype):
-        return (
-            self.active_if and
-            (self.cls_name is None or self.cls_name == cls_name) and
-            (self.test_name is None or self.test_name == test_name) and
-            (self.device_type is None or self.device_type == device_type) and
-            (self.dtypes is None or dtype in self.dtypes)
-        )
-
-# FIXME
-# Note: historically the 'input' kwarg had to be a Tensor or TensorList, but we are trying
-#   to support scalar inputs, too. Some tests still depend on 'input' being a Tensor
-#   or TensorList, however.
-class SampleInput(object):
-    """Represents sample inputs to a function."""
-
-    __slots__ = ['input', 'args', 'kwargs', 'output_process_fn_grad', 'broadcasts_input', 'name']
-
-    def __init__(self, input, *, args=tuple(), kwargs=None, output_process_fn_grad=lambda x: x, broadcasts_input=False, name=""):
-        # input is the first input to the op and is typically either a Tensor or TensorList (Sequence[Tensor]).
-        # This follows the typical pattern where for Tensor inputs op(t, ...) = t.op(...).
-        self.input = input
-        self.args = args
-        self.kwargs = kwargs if kwargs is not None else {}
-        self.output_process_fn_grad = output_process_fn_grad
-        self.name = name
-
-        # Specifies if `self.input` is broadcasted or not,
-        # given that the operator supports broadcasting.
-        # This field is used to verify the behavior for inplace variant.
-        #
-        # If a SampleInput is marked with `broadcasts_input=True`,
-        # it is verified that we get a `RuntimerError` with this sample,
-        # and inplace variant. Also inplace grad{grad} tests are skipped,
-        # for such inputs (as they will error out otherwise).
-        self.broadcasts_input = broadcasts_input
-
-    def _repr_helper(self, formatter):
-        # Helper function to return the details of the SampleInput as `str`
-        # It consolidates all the fields of SampleInput and allows,
-        # formatting the fields like `input`, `args`, etc with `formatter`
-        # callable to customize the representation.
-        # Look at `summary` method for example.
-        arguments = [
-            f'input={formatter(self.input)}',
-            f'args={formatter(self.args)}',
-            f'kwargs={formatter(self.kwargs)}',
-            f'output_process_fn_grad={self.output_process_fn_grad}',
-            f'broadcasts_input={self.broadcasts_input}',
-            f'name={repr(self.name)}']
-
-        return f'SampleInput({", ".join(a for a in arguments if a is not None)})'
-
-    def __repr__(self):
-        return self._repr_helper(lambda x: x)
-
-    def summary(self):
-        # Returns the SampleInput details in a more
-        # friendly format.
-        # It formats `Tensor` and `TensorList`
-        # in a more condensed representation.
-        def formatter(arg):
-            # Format any instance of `Tensor` (standalone, in list, or in dict)
-            # by Tensor[TensorShape]
-            # Eg. Tensor with shape (3, 4) is formatted as Tensor[3, 4]
-            if isinstance(arg, torch.Tensor):
-                shape = str(tuple(arg.shape)).replace('(', '').replace(')', '')
-                return f"Tensor[{shape}]"
-            elif isinstance(arg, dict):
-                return {k: formatter(v) for k, v in arg.items()}
-            elif is_iterable_of_tensors(arg):
-                return "TensorList[" + ", ".join(map(formatter, arg)) + "]"
-            elif isinstance(arg, (list, tuple)):  # Handle list, tuple
-                return "(" + ",".join(map(formatter, arg)) + ")"
-
-            return repr(arg)
-
-        return self._repr_helper(formatter)
-
-    # Applies the transform f(t) -> t to each tensor and dtype in the SampleInput
-    def transform(self, f):
-        def tt(t):
-            def _tt(t):
-                with torch.no_grad():
-                    return f(t)
-
-            if isinstance(t, torch.Tensor):
-                return _tt(t)
-            elif isinstance(t, torch.dtype):
-                return _tt(t)
-            elif isinstance(t, list):
-                return list(map(tt, t))
-            elif isinstance(t, tuple):
-                return tuple(map(tt, t))
-            elif isinstance(t, dict):
-                return {k: tt(v) for k, v in t.items()}
-            else:
-                return t
-
-        sample_tt_input, tt_args, tt_kwargs = tt(self.input), tt(self.args), tt(self.kwargs)
-
-        # Note the transformed SampleInput assumes metadata like output_process_fn_grad is still valid!
-        return SampleInput(
-            sample_tt_input,
-            args=tt_args,
-            kwargs=tt_kwargs,
-            output_process_fn_grad=self.output_process_fn_grad,
-            broadcasts_input=self.broadcasts_input,
-            name=self.name + "_transformed")
-
-    # Returns the NumPy version of the sample input object in the form of a tuple: (input, args, kwargs)
-    # Converts tensors to ndarrays by calling .detach().cpu().numpy() on them
-    # Converts dtypes by remapping them using torch_to_numpy_dtype_dict
-    def numpy(self):
-        def to_numpy(t):
-            if isinstance(t, torch.Tensor):
-                if t.dtype is torch.bfloat16:
-                    return t.detach().cpu().to(torch.float32).numpy()
-                if t.dtype is torch.chalf:
-                    return t.detach().cpu().to(torch.cfloat).numpy()
-                return t.detach().cpu().numpy()
-            elif isinstance(t, torch.dtype):
-                return torch_to_numpy_dtype_dict[t]
-
-            return t
-
-        return self.transform(to_numpy)
-
-    def noncontiguous(self):
-        def to_noncontiguous(t):
-            if isinstance(t, torch.Tensor):
-                return noncontiguous_like(t)
-            elif isinstance(t, torch.dtype):
-                return t
-
-            return t
-
-        return self.transform(to_noncontiguous)
-
-
-class ErrorInput(object):
-    """
-    A SampleInput that will cause the operation to throw an error plus information
-    about the resulting error.
-    """
-
-    __slots__ = ['sample_input', 'error_type', 'error_regex']
-
-    def __init__(self, sample_input, *, error_type=RuntimeError, error_regex):
-        self.sample_input = sample_input
-        self.error_type = error_type
-        self.error_regex = error_regex
-
-
-class AliasInfo(object):
-    """Class holds alias information. For example, torch.abs ->
-    torch.absolute, torch.Tensor.absolute, torch.Tensor.absolute_
-    """
-
-    def __init__(self, alias_name):
-        self.name = alias_name
-        self.op = _getattr_qual(torch, alias_name)
-        self.method_variant = getattr(torch.Tensor, alias_name, None)
-        self.inplace_variant = getattr(torch.Tensor, alias_name + "_", None)
-
-    def __call__(self, *args, **kwargs):
-        return self.op(*args, **kwargs)
-
-
-# Extension of getattr to support qualified names
-# e.g. _getattr_qual(torch, 'linalg.norm') -> torch.linalg.norm
-def _getattr_qual(obj, name, default=_NOTHING):
-    try:
-        for path in name.split('.'):
-            obj = getattr(obj, path)
-        return obj
-    except AttributeError:
-        if default is not _NOTHING:
-            return default
-        else:
-            raise
 
 
 # test if a tensor is close to an integer
@@ -288,1150 +144,6 @@ def close_to_int(x, eps=0.1):
     return (y < eps) | (y > (1 - eps))
 
 
-NumericsFilter = collections.namedtuple('NumericsFilter', ['condition', 'safe_val'])
-
-
-# Note [OpInfos]
-# ~~~~~~~~~~~~~~
-#
-# The majority of this note was written shortly after the PyTorch 1.9 release.
-# If you notice it's out-of-date or think it could be improved then please
-# file an issue.
-#
-# See also: the OpInfo tracker (https://github.com/pytorch/pytorch/issues/54261)
-# See also: "Writing Test Templates" in common_device_type.py to learn how to
-#   parametrize a test template using OpInfos.
-# See also: PyTorch's GitHub wiki on running and writing tests
-#   https://github.com/pytorch/pytorch/wiki/Running-and-writing-tests
-# See also: ModuleInfos, OpInfo's sister class, defined in common_modules.py
-#
-# An OpInfo is a collection of metadata related to a PyTorch operator. This
-#   metadata is used to generate tests that validate properties of the operator,
-#   like if it implements the correct gradient formula.
-#
-# WHY OPINFOS?
-# ~~~~~~~~~~~~
-#
-# OpInfos are principally intended to do three things:
-#
-#   1) to allow systematic testing over all PyTorch's operators
-#   2) to simplify operating testing by autogenerating many tests
-#   3) to allow systems (like autograd, torchscript, fx, nnc...) to test
-#        against every PyTorch operator
-#
-# All these goals are still a work in progress. Not every operator has an
-#   OpInfo, and some operator tests that could be automatically generated
-#   still have to be written manually.
-#
-# It's helpful to understand that OpInfos are both about test simplification and
-#   modularity. PyTorch is a complicated framework with many interrelated systems,
-#   too many for any one person to keep track of. An OpInfo can be thought of as the
-#   interface between an operator implementer and those other systems. Instead of
-#   requiring the implementer of torch.foo understand how to test its forward
-#   mode AD or NNC support that's typically handled automatically just by
-#   defining an OpInfo.
-#
-# It's often surprising to OpInfo writers that just implementing an OpInfo
-#   typically can't verify an operator is actually implemented correctly:
-#
-# "If an OpInfo doesn't validate my op works as expected, what's the point
-#     of it?"
-#
-# But the point of is the above. OpInfos are intended to let you focus on testing
-#   the operator logic you're familiar with instead of having to write tests for
-#   how the operator interacts with each of PyTorch's many systems.
-#
-# And, OK, it turns out that SOMETIMES just writing an OpInfo DOES
-#   validate your op works as expected, but that's only in special
-#   cases. See below for details.
-#
-# WHAT'S AN OPINFO?
-# ~~~~~~~~~~~~~~~~~
-#
-# So what is an OpInfo? It's a Python class that describes an operator's properties,
-#   like which dtypes it supports on the CPU and whether it has any aliases.
-#   These properties can be divided into three categories:
-#
-#   1) Metadata describing the operator, like the operator's name and if it
-#     "supports" the out kwarg.
-#   2) Test directives, like "skips" that tell the test suite to skip some
-#     tests.
-#   3) A "sample inputs" function that generates valid inputs for the operator.
-#
-# OpInfo attributes are described in more detail below.
-#
-# THE SAMPLE INPUTS FUNCTION
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~
-#
-# The "sample inputs" function merits special elaboration. This function is
-#   crucial to testing with OpInfos. A typical OpInfo test has to treat the operator
-#   as a black box. There's no structure for the test to understand or exploit.
-#   Without "sample inputs" it wouldn't even know how to call the OpInfo's
-#   operator. The sample input function saves the day by providing different
-#   "SampleInputs" that can be used to call the operator. A sample input
-#   function should have the following signature:
-#
-#   def sample_inputs_foo(op_info, device, dtype, requires_grad, **kwargs):
-#
-#   And should return an iterable of SampleInputs (see the class description
-#   above). Each SampleInput defines an "input", "args", "kwargs", an
-#   "output_process_fn_grad" function, the "broadcasts_input" bool and a
-#   "name".
-#
-#   All the "sample_inputs" functions are invoked within a `torch.no_grad()`
-#   environment for efficiency and correctness. As such remember to set the
-#   "requires_grad" flag on the inputs **after** performing any transformations
-#   on them.
-#
-# The "input" is the first argument to the operator, or the tensor that
-#   the method or inplace variants of the operator should be called on, and
-#   should be on the requested device, of the requested dtype, and its
-#   requires_grad attribute should be set to the requires_grad argument.
-#
-# "args" should contain positional arguments, and "kwargs" keyword arguments.
-#
-# "output_process_fn_grad" has an interesting name. It's a function that maps
-#   the operator's output (when given the input, args, and kwargs) to the
-#   portion of the output to gradcheck. For example, consider an operator
-#   like torch.linalg.slogdet
-#   (https://pytorch.org/docs/master/generated/torch.linalg.slogdet.html).
-#   This operator returns a tuple of two tensors, but the first tensor
-#   cannot be backwarded through. Its "output_process_fn_grad" filters
-#   this output tuple to just the second argument, which we can call backward
-#   on. Functions that produce a single tensor can ignore this argument.
-#
-# "broadcasts_input" is a bool indicated if the SampleInput causes the operator
-#   to broadcast the "input" argument. This is important for tests to understand
-#   because inplace variants of operations throw a runtime error if they
-#   would broadcast their input arguments, so tests that work with inplace
-#   variants filter SampleInputs that broadcast their input.
-#
-# "name" is a string that's just used for debugging. It appears when printing
-#   the SampleInput.
-#
-# Sample inputs are designed to be used with many tests, some
-#   that are very time consuming, so they should be a small
-#   set with small tensors. An elaborated set of sample inputs
-#   can be specified using the "reference_inputs_func" attribute.
-#   The "reference inputs" for an operation are an extended
-#   set of sample inputs that can more exhausively test an
-#   operator. They are used by only a few tests that are careful
-#   not to take too long to run. Adding reference inputs
-#   is highly encouraged!
-#
-# THE (OPTIONAL) ERROR INPUTS FUNCTION
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-#
-# OpInfos may optionally specify "error inputs" through an error function. If
-#   specified test_errors in test_ops.py will call the op with these inputs
-#   and validate that the desired error is thrown.
-#
-# Error inputs automate a common testing pattern where multiple inputs are
-#   passed to an operation and the errors they thrown are reviewed. Tests
-#   written in this style should be ported to the new OpInfo pattern.
-#
-# Error inputs are specified using the ErrorInputs class, which contains
-#   a SampleInput (see above) and data about the expected error.
-#
-# OPINFO FILE ORGANIZATION
-# ~~~~~~~~~~~~~~~~~~~~~~~~
-#
-# All OpInfos are currently defined in this file. Most OpInfo tests are defined
-#   in test_ops.py, but some system-specific tests are defined in those
-#   systems' test files, and subclass-specific tests are defined in the test
-#   file that corresponds to that subclass (see the below).
-#   Expect a reorganization in the future.
-#
-# WHAT'S TESTED?
-# ~~~~~~~~~~~~~~
-#
-# Every OpInfo in the op_db sequence has the following properties validated in
-# test_ops.py:
-#
-#   - that its supported dtypes are specified correctly
-#   - that the operation produces the same results when called with noncontiguous inputs
-#   - that it supports the out= argument properly (if it allows out=),
-#       see https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
-#   - that it works with the conjugate view bit properly
-#   - that its function, method, and inplace variants perform the same operation
-#       (that is, that torch.add, torch.Tensor.add, and torch.Tensor.add_ all
-#       do the same thing).
-#   - that its inplace variant preserves the input's storage
-#   - that its gradient formula is implemented correctly, and that it supports
-#       gradgrad and complex grad and gradgrad and forward mode AD properly for
-#       the op's function and inplace variants (method variants are skipped
-#       to reduce test time).
-#   - that the operation performs the same operation when traced or scripted
-#       using the jit
-#   - that the operation is autodifferentiated by the jit as expected
-#   - that the operator's aliases, if any, perform the same operation and that
-#       the jit understands the alias
-#   - that the operator throws the correct errors (if error_inputs is defined)
-#   - that the operator produces the same results as a NumPy reference (if ref is defined)
-#   - that the operator produces the same results as a NumPy reference on an extended
-#       set of "reference inputs" (if both ref and reference_inputs_func are defined)
-#       (NOTE: elementwise unary and elementwise binary OpInfos do this even if only
-#         ref is defined, because they effectively autogenerate reference inputs)
-#   - that the operator works on different CUDA devices
-#
-# Additional OpInfo tests are in test_jit_fuser_te.py, test_fx_experimental.py,
-#   and test_fx.py. These tests validate that operators work with NNC and FX
-#   as expected.
-#
-# For performance, some of the above tests may only run on the first
-#   SampleInput returned by an OpInfo's sample input function.
-#
-# In addition to these tests, some subclasses (discussed in the next section)
-#   define additional tests.
-#
-# Critically, as mentioned above, what's not necessarily tested is that the operator
-#   works as expected. When implementing an OpInfo an engineer must still
-#   typically write one or more tests validating the operator's behavior.
-#   The exception to this is if reference testing is sufficient, or if
-#   the operation belongs to an OpInfo subclass that has more exhaustive
-#   operator testing. Elementwise unary and elementwise binary operators,
-#   in particular, usually don't require additional testing beyond
-#   writing an Opinfo.
-#
-#
-# OPINFO (SUB)CLASSES
-# ~~~~~~~~~~~~~~~~~~~
-#
-# In addition to the OpInfo base class there are several specialized OpInfo
-#   subclasses. For example, the UnaryUfuncInfo subclass is used for
-#   unary elementwise operations. These operations have a common structure
-#   that test_unary_ufuncs.py exploits with additional automated testing.
-#   The automated testing in test_unary_ufuncs.py is so thorough, comparing
-#   the operator to a NumPy reference function on a plethora of values, that
-#   just implementing an OpInfo for a unary elementwise operation is often
-#   sufficient testing.
-#
-# The ForeachFuncInfo is another OpInfo subclass that is hyper-specialized to a
-#   very unique class of operations. These OpInfos aren't included in the
-#   op_db sequence and have their own tests.
-#
-# Other OpInfo subclasses, like SpectralFuncInfo, are just for convenience
-# when writing OpInfos.
-#
-# TESTING A NEW OPERATOR
-# ~~~~~~~~~~~~~~~~~~~~~~
-#
-# If you're adding a new operator to any of the following namespaces:
-#   - torch
-#   - torch.fft
-#   - torch.linalg,
-#   - torch.special
-#   - torch.nn.functional
-# then you should typically add an OpInfo for it.
-#
-# As mentioned a couple times above, implementing an OpInfo is not
-#   usually sufficient testing (unless the operator is a unary or binary elementwise
-#   operator). The OpInfo will only test the properties described in the
-#   "WHAT'S TESTED" section. It DOES NOT necessarily verify that the operator is
-#   implemented correctly.
-#
-# TIPS FOR WRITING AN OPINFO AND OPINFO TESTS
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-#
-# Writing an OpInfo can be a little daunting. Since the point of an OpInfo is to
-#   be consumed by a variety of systems it can be hard to understand how to
-#   deal with test failures or how to set the OpInfo metadata properly.
-#
-# Before adding an OpInfo it helps to look at other OpInfos. A sample inputs
-#   function must be defined, and the operator's dtypes must be specified.
-#   Once that's done you should run the operator's tests in test_ops.py
-#   (these can be filtered using the "-k" argument in pytest). Tests that
-#   fail should provide an error message that describes what to change about
-#   your OpInfo. You don't need to worry about changing an OpInfo's default
-#   values unless a test yells at you.
-#
-# Similarly, if you're writing a test that consumes OpInfos then it's critical
-#   your test provides a clear error message describing what to do when it
-#   fails. You should not assume the OpInfo implementer is familiar with your
-#   system.
-#
-# If you see a confusing error message while developing an OpInfo then please
-#   file an issue describing what happened.
-#
-# This trial-and-error approach to writing an OpInfo can be frustrating,
-#   but it's probably necessary as long as OpInfos don't require
-#   learning about all the systems that consume them. One thing that can help
-#   is the get_supported_dtypes() function defined in opinfo_helper.py. This
-#   function can be used to programmatically specify the dtypes an operator
-#   supports, and is especially useful if writing an OpInfo on a machine
-#   without a CUDA device. See its documentation for more details.
-#
-# THE FUTURE OF OPINFOS AND OPINFO TESTING
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-#
-# In the future we expect OpInfo coverage to improve and cover
-#   the great majority of PyTorch's (public) operators.
-#
-
-# Classes and methods for the operator database
-@dataclass
-class OpInfo(object):
-    """Operator information and helper functions for acquiring it."""
-
-    # the string name of the function
-    name: str
-
-    # An optional reference function that accepts ndarrays (AKA "NumPy arrays").
-    # If given, the op will be compared with its reference on each of its sample inputs.
-    ref: Callable = None
-
-    # the following metadata describes the operator, its variants, and its aliases, if any
-
-    # iterable of aliases, e.g. ("absolute",) for torch.abs
-    aliases: Iterable = None
-
-    # additional string to include in the test name
-    # this is useful when an op needs multiple OpInfos,
-    # like divide does, often because it's really several
-    # different ops behind the scenes
-    variant_test_name: str = ''
-
-    # the function variant of the operation, populated as torch.<name> if None
-    op: Callable = None
-
-    # allows the method variant of this operation to be specified as follows:
-    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
-    # - if None, then the OpInfo explicitly specifies is has no associated method
-    # - if a Callable, then that callable should be the method associated with this operation
-    method_variant: Callable = _NOTHING
-
-    # allows the inplace variant of this operation to be specified as follows:
-    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
-    # - if None, then the OpInfo explicitly specifies is has no associated inplace variant
-    # - if a Callable, then that callable should be the inplace variant associated with this operation
-    inplace_variant: Callable = _NOTHING
-
-    # allows the operator variant of this operation to be specified as follows:
-    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
-    # - if None, then the OpInfo explicitly specifies is has no associated operator
-    # - if a Callable, then that callable should be the operator associated with this operation
-    operator_variant: Callable = _NOTHING
-
-    # allows the inplace operator variant of this operation to be specified as follows:
-    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
-    # - if None, then the OpInfo explicitly specifies is has no associated inplace operator
-    # - if a Callable, then that callable should be the inplace operator associated with this operation
-    inplace_operator_variant: Callable = _NOTHING
-
-    # the following metadata are test directives for skipping or modifying tests
-
-    # information about which tests to skip
-    skips: Tuple = tuple()
-
-    # decorators to apply to generated tests
-    decorators: Tuple = tuple()
-
-    # the following are pointers to functions to generate certain classes of inputs
-
-    # function to generate sample inputs with strided layouts
-    sample_inputs_func: Callable = None
-
-    # function to generate a more thorough set of samples inputs with strided layouts
-    reference_inputs_func: Callable = None
-
-    # function to generate inputs that will throw errors
-    error_inputs_func: Callable = None
-
-    # function to generate sample inputs with sparse coo layouts
-    sample_inputs_sparse_coo_func: Callable = None
-
-    # function to generate sample inputs with sparse csr layouts
-    sample_inputs_sparse_csr_func: Callable = None
-
-    # function to generate sample inputs with sparse csc layouts
-    sample_inputs_sparse_csc_func: Callable = None
-
-    # function to generate sample inputs with sparse bsr layouts
-    sample_inputs_sparse_bsr_func: Callable = None
-
-    # function to generate sample inputs with sparse bsc layouts
-    sample_inputs_sparse_bsc_func: Callable = None
-
-    # the following metadata relates to dtype support and is tested for correctness in test_ops.py
-
-    # dtypes this function works with on the CPU,
-    # inherited by other device types that don't specify their own dtypes
-    dtypes: _dispatch_dtypes = None
-
-    # the following dtypesIf... options override the dtypes value on their respective device types
-
-    # dtypes this function is expected to work with on CUDA
-    dtypesIfCUDA: _dispatch_dtypes = None
-
-    # dtypes this function is expected to work with on ROCM
-    dtypesIfROCM: _dispatch_dtypes = None
-
-    # backward dtypes this function is expected to work with
-    backward_dtypes: _dispatch_dtypes = None
-
-    # backward dtypes this function is expected to work with on CUDA
-    backward_dtypesIfCUDA: _dispatch_dtypes = None
-
-    # backward dtypes this function is expected to work with on ROCM
-    backward_dtypesIfROCM: _dispatch_dtypes = None
-
-    # the following metadata describes the operators out= support
-
-    # whether the op supports the out kwarg
-    # defaults to True, if the op does not allow the out kwarg or
-    # supports it incorrectly then test_out in test_ops.py should fail
-    supports_out: bool = True
-
-    # the following metadata relates to autograd support
-    # whether the operation supports backward mode AD
-    # if true, gradient correctness is tested in test_ops.py
-    # using the op's sample inputs
-    supports_autograd: bool = True
-
-    # whether the op supports second order gradients
-    # if true, gradgrad correctness is tested in test_ops.py
-    # defaults to support_autograd's value
-    # TODO: rename this to supports_bwgrad_bwgrad to be consistent with below
-    supports_gradgrad: bool = None
-
-    # whether the ops supports second order gradients via
-    # forward-over-reverse. If True, forward-over-reverse gradgrad correctness
-    # is tested. If False, test that forward grad is not implemented.
-    # Defaults to False.
-    supports_fwgrad_bwgrad: bool = False
-
-    # whether the operation supports inplace autograd
-    # if true, tested in test_ops.py
-    # defaults to supports_autograd's value
-    supports_inplace_autograd: bool = None
-
-    # Whether the operation support forward mode AD
-    # If the value is True, we check that the gradients are correct
-    # If the value is False, we test that forward grad is not implemented
-    supports_forward_ad: bool = False
-
-    # wrapper function for gradcheck
-    gradcheck_wrapper: Callable = lambda op, *args, **kwargs: op(*args, **kwargs)
-
-    # whether to check batched grad when doing gradcheck
-    # defaults to support_autograd's value
-    check_batched_grad: bool = None
-
-    # whether to check batched grad grad when doing gradgradcheck
-    # default's to support_gradgrad's value
-    check_batched_gradgrad: bool = None
-
-    # whether to check batched forward grad when doing gradcheck
-    # defaults to the value of `supports_forward_ad`
-    check_batched_forward_grad: bool = None
-
-    # whether to check batched forward grad when doing gradcheck
-    # defaults to the value of `check_batched_forward_grad`
-    check_inplace_batched_forward_grad: bool = None
-
-    # tolerance for nondeterminism while performing gradcheck
-    gradcheck_nondet_tol: float = 0.0
-
-    # Whether to use the fast implmentation for gradcheck/gradgradcheck.
-    # When set to None, defers to the default value provided by the wrapper
-    # function around gradcheck (testing._internal.common_utils.gradcheck)
-    gradcheck_fast_mode: bool = None
-
-    # the following metadata relates to JIT support and is tested for correctness in test_ops.py
-
-    # name of the corresponding aten:: operator
-    aten_name: str = None
-
-    # if this is a composite implicit autograd op, the decomposed op
-    decomp_aten_name: Optional[str] = None
-
-    # name of the corresponding aten:: operator for backwards
-    aten_backward_name: Optional[str] = None
-
-    # if a op's aten::node is expected to be symbolically autodiffed
-    assert_autodiffed: bool = False
-
-    # a list of strings with node names that are expected to be in a
-    # DifferentiableGraph when autodiffed. Ex: ['aten::add', 'aten::mm'],
-    # default is populated to be ['aten::(name of Python operator)']
-    autodiff_nonfusible_nodes: List[str] = None
-
-    # a list of strings with node names that are expected to be in FusionGroups
-    # inside of DifferentiableGraphs when this operation is autodiffed.
-    # Ex: ['aten::add', 'aten::mm'], defaults to an empty list
-    # Note: currently no ops use fusible nodes
-    autodiff_fusible_nodes: List[str] = None
-
-    # the following metadata relates to sparse support and is used in test_sparse.py
-
-    # whether the op supports sparse inputs
-    supports_sparse: bool = False
-
-    # only run tracing tests
-    supports_scripting: bool = True
-
-    # if the operator can be traced
-    supports_tracing: bool = True
-
-    # the following metadata relates to sparse csr support and is used in test_sparse_csr.py
-
-    # whether the op supports sparse csr inputs
-    supports_sparse_csr: bool = False
-    # whether the op supports sparse csc inputs
-    supports_sparse_csc: bool = False
-    # whether the op supports sparse bsr inputs
-    supports_sparse_bsr: bool = False
-    # whether the op supports sparse bsc inputs
-    supports_sparse_bsc: bool = False
-
-    # the following metadata relates to complex support and is checked in test_ops.py
-
-    test_conjugated_samples: bool = True
-
-    test_neg_view: bool = True
-
-    # assert that jit shape analysis fully propagates shape
-    assert_jit_shape_analysis: bool = False
-
-    # the following metadata relates to ExpandedWeights support and is checked in test_expanded_weights.py
-
-    supports_expanded_weight: bool = False
-
-    is_factory_function: bool = False
-
-    def __post_init__(self):
-        self._original_opinfo_args = asdict(self).copy()
-
-        assert self.dtypes is not None, "OpInfo for {0} has no dtypes!".format(self.name)
-
-        dtypes_args = (self.dtypes, self.dtypesIfCUDA, self.dtypesIfROCM)
-
-        # Validates the dtypes are generated from the dispatch-related functions
-        for dtype_list in dtypes_args:
-            assert isinstance(dtype_list, (_dispatch_dtypes, type(None)))
-
-        if self.aten_name is None:
-            self.aten_name = self.name
-
-        # Attribute to verify dynamic_dtypes are used.
-        self.dynamic_dtypes = any(map(lambda dtypes: isinstance(
-            dtypes, opinfo_helper._dynamic_dispatch_dtypes), dtypes_args))
-
-        if self.dynamic_dtypes:
-            # Make sure `dtyesIfCUDA` is dynamic, if dynamic dispatch is used for CPU
-            # This is because, below we set dtypesIfCUDA to dtypes if they are None.
-            assert isinstance(self.dtypesIfCUDA, opinfo_helper._dynamic_dispatch_dtypes), \
-                (f"To use dynamic dypes for operator {self.name}, "
-                 "acquire the dtypes dynamically for argument `dtypesIfCUDA`."
-                 "This is to ensure that CUDA dtypes are acquired correctly as they"
-                 "differ from CPU dtypes occasionally")
-
-        self.dtypes = set(self.dtypes)
-
-        # NOTE: backward dtypes must be acquired before forward dtypes
-        #   since they fallback to explicit (not implicit!) specifications of
-        #   forward dtypes
-        self.backward_dtypesIfROCM = set(self.backward_dtypesIfROCM) if self.backward_dtypesIfROCM is not None else (
-            self.backward_dtypesIfCUDA if self.backward_dtypesIfCUDA is not None
-            else self.backward_dtypes if self.backward_dtypes is not None
-            else self.dtypesIfROCM if self.dtypesIfROCM is not None
-            else self.dtypesIfCUDA if self.dtypesIfCUDA is not None
-            else self.dtypes)
-        self.backward_dtypesIfCUDA = set(self.backward_dtypesIfCUDA) if self.backward_dtypesIfCUDA is not None else (
-            self.backward_dtypes if self.backward_dtypes is not None
-            else self.dtypesIfCUDA if self.dtypesIfCUDA is not None
-            else self.dtypes)
-        self.backward_dtypes = set(self.backward_dtypes) if self.backward_dtypes is not None else self.dtypes
-
-        self.dtypesIfCUDA = set(self.dtypesIfCUDA) if self.dtypesIfCUDA is not None else self.dtypes
-        self.dtypesIfROCM = set(self.dtypesIfROCM) if self.dtypesIfROCM is not None else self.dtypesIfCUDA
-
-        # NOTE: if the op is unspecified it is assumed to be under the torch namespace
-        if not self.op:
-            self.op = _getattr_qual(torch, self.name)
-
-        if self.method_variant is _NOTHING:
-            self.method_variant = getattr(torch.Tensor, self.name, None)
-
-        # attributes like real, imag are not callable
-        if not callable(self.method_variant):
-            self.method_variant = None
-
-        if self.inplace_variant is _NOTHING:
-            inplace_name = self.name + "_"
-            self.inplace_variant = getattr(torch.Tensor, inplace_name, None)
-
-        if self.operator_variant is _NOTHING:
-            self.operator_variant = getattr(operator, self.name, None)
-
-        if self.inplace_operator_variant is _NOTHING:
-            # Note: operator.i<op> will use operator.<op> and assign the result to the lhs when no
-            # __i<op>__ method is found. This results in the appearance of an inplace operator variant which
-            # does not have the correct inplace behavior. To avoid this, we guard automatic detection of the inplace
-            # operator with a check that an inplace variant exists.
-            if self.inplace_variant is not None:
-                inplace_operator_name = "i" + self.name
-                self.inplace_operator_variant = getattr(operator, inplace_operator_name, None)
-            else:
-                self.inplace_operator_variant = None
-
-        self.decorators = (*self.decorators, *self.skips)
-
-        # We run the sampling functions without tracking the gradiends of the creation of inputs
-        self.sample_inputs_func = torch.no_grad()(self.sample_inputs_func)
-        self.sample_inputs_sparse_coo_func = torch.no_grad()(self.sample_inputs_sparse_coo_func)
-        self.sample_inputs_sparse_csr_func = torch.no_grad()(self.sample_inputs_sparse_csr_func)
-        self.sample_inputs_sparse_csc_func = torch.no_grad()(self.sample_inputs_sparse_csc_func)
-        self.sample_inputs_sparse_bsr_func = torch.no_grad()(self.sample_inputs_sparse_bsr_func)
-        self.sample_inputs_sparse_bsc_func = torch.no_grad()(self.sample_inputs_sparse_bsc_func)
-        if self.reference_inputs_func is not None:
-            self.reference_inputs_func = torch.no_grad()(self.reference_inputs_func)
-
-        if not self.autodiff_fusible_nodes:
-            self.autodiff_fusible_nodes = []
-
-        if self.autodiff_nonfusible_nodes is None:
-            self.autodiff_nonfusible_nodes = ['aten::' + self.name]
-
-        # Autograd support
-
-        # Autograd flags that depend on backward AD only
-        # - If setting has been explicitly set, raise error if inconsistent
-        if self.supports_gradgrad is None:
-            self.supports_gradgrad = self.supports_autograd
-        else:
-            assert not (self.supports_gradgrad and not self.supports_autograd), (
-                "supports_gradgrad refines the part of autograd is supported, so it should "
-                "not be set if supports_autograd is False")
-        if self.check_batched_grad is None:
-            self.check_batched_grad = self.supports_autograd or self.supports_forward_ad
-        else:
-            assert not (self.check_batched_grad and not (self.supports_autograd or self.supports_forward_ad)), (
-                "check_batched_grad refines the part of autograd that will be checked (by gradcheck), so "
-                "it should not be set if supports_autograd is False")
-        if self.check_batched_gradgrad is None:
-            self.check_batched_gradgrad = self.supports_gradgrad
-        else:
-            assert not (self.check_batched_gradgrad and not self.supports_gradgrad), (
-                "check_batched_gradgrad refines the part of autograd that will be checked (by "
-                "gradgradcheck), so it should not be set if either supports_gradgrad or supports_autograd "
-                "is False.")
-        if self.check_batched_forward_grad is None:
-            self.check_batched_forward_grad = self.supports_forward_ad
-        else:
-            assert not (self.check_batched_forward_grad and not self.supports_forward_ad), (
-                "check_batched_forward_grad should only be used when supports_forward_ad "
-                "is True. It is used to disable the test in the specific cases "
-                "where the op supports forward ad but fails to compute "
-                "batched forward grad.")
-
-        if self.check_inplace_batched_forward_grad is None:
-            self.check_inplace_batched_forward_grad = self.check_batched_forward_grad
-        else:
-            assert not (self.check_inplace_batched_forward_grad and not self.check_batched_forward_grad), (
-                "check_batched_forward_grad should only be used when check_batched_forward_grad "
-                "is True. It is used to disable the test in the specific cases "
-                "where the op supports batched forward grad but fails to compute batched forward "
-                "grad for the inplace variant of the op.")
-
-        assert not (self.supports_fwgrad_bwgrad and not self.supports_autograd), (
-            "supports_fwgrad_bwgrad enables forward-over-backward gradgrad checks and should only be "
-            "True if backward ad is also checked, i.e., supports_forward_ad should be True.", self.name)
-
-        # Autograd flags that depend on both forward AD and backward AD
-        if self.supports_inplace_autograd is None:
-            self.supports_inplace_autograd = self.supports_autograd or self.supports_forward_ad
-        else:
-            assert not (self.supports_inplace_autograd and not self.supports_autograd and not self.supports_forward_ad), (
-                "supports_inplace_autograd refines the part of autograd that is supported, so "
-                "it should not be set if both supports_autograd and supports_forward_ad are False")
-
-        if self.aliases is not None:
-            self.aliases = tuple(AliasInfo(a) for a in self.aliases)  # type: ignore[assignment]
-        else:
-            self.aliases = ()
-
-    def __call__(self, *args, **kwargs):
-        """Calls the function variant of the operator."""
-        return self.op(*args, **kwargs)
-
-    def __str__(self):
-        return dataclass_repr(self)
-
-    def get_op(self):
-        """Returns the function variant of the operator, torch.<op_name>."""
-        return self.op
-
-    def get_method(self):
-        """Returns the method variant of the operator, torch.Tensor.<op_name>.
-        Returns None if the operator has no method variant.
-        """
-        return self.method_variant
-
-    def get_inplace(self):
-        """Returns the inplace variant of the operator, torch.Tensor.<op_name>_.
-        Returns None if the operator has no inplace variant.
-        """
-        return self.inplace_variant
-
-    def get_operator(self):
-        """Returns operator variant of the operator, e.g. operator.neg
-        Returns None if the operator has no operator variant.
-        """
-        return self.operator_variant
-
-    def get_inplace_operator(self):
-        """Returns the inplace operator variant of the operator, e.g operator.iadd
-        Returns None if the operator has no inplace operator variant"""
-        return self.inplace_operator_variant
-
-    def conjugate_sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
-        """Returns an iterable of SampleInputs but with the tensor input or first
-        tensor in a sequence input conjugated.
-        """
-
-        samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
-        conj_samples = list(samples)
-
-        def conjugate(tensor):
-            _requires_grad = tensor.requires_grad
-            tensor = tensor.conj()
-            return tensor.requires_grad_(_requires_grad)
-
-        for i, sample in enumerate(samples):
-            sample = conj_samples[i]
-            # Note: it is assumed that the input here is either a tensor or tensorlist
-            if isinstance(sample.input, torch.Tensor):
-                sample.input = conjugate(sample.input)
-            else:
-                sample.input[0] = conjugate(sample.input[0])
-
-        return tuple(conj_samples)
-
-    def sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
-        """
-        Returns an iterable of SampleInputs.
-
-        These samples should be sufficient to test the function works correctly
-        with autograd, TorchScript, etc.
-        """
-        samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
-
-        if kwargs.get('include_conjugated_inputs', False):
-            conj_samples = self.conjugate_sample_inputs(device, dtype, requires_grad, **kwargs)
-            samples_list = list(samples)
-            samples_list.extend(conj_samples)
-            samples = tuple(samples_list)
-
-        return samples
-
-    def reference_inputs(self, device, dtype, requires_grad=False, **kwargs):
-        """
-        Returns an iterable of SampleInputs.
-
-        Distinct from sample_inputs() above because this returns an expanded set
-        of inputs when reference_inputs_func is defined. If undefined this returns
-        the sample inputs.
-        """
-        if self.reference_inputs_func is None:
-            return self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
-
-        if kwargs.get('include_conjugated_inputs', False):
-            raise NotImplementedError
-
-        return self.reference_inputs_func(self, device, dtype, requires_grad, **kwargs)
-
-    def error_inputs(self, device, **kwargs):
-        """
-        Returns an iterable of ErrorInputs.
-        """
-        return self.error_inputs_func(self, device, **kwargs)
-
-    def sample_inputs_sparse_coo(self, device, dtype, requires_grad=False, **kwargs):
-        """Returns an iterable of SampleInputs that contain inputs with sparse
-        coo layout.
-        """
-        return self.sample_inputs_sparse_coo_func(self, device, dtype, requires_grad, **kwargs)
-
-    def sample_inputs_sparse_csr(self, device, dtype, requires_grad=False, **kwargs):
-        """Returns an iterable of SampleInputs that contain inputs with sparse
-        csr layout.
-        """
-        return self.sample_inputs_sparse_csr_func(self, device, dtype, requires_grad, **kwargs)
-
-    def sample_inputs_sparse_csc(self, device, dtype, requires_grad=False, **kwargs):
-        """Returns an iterable of SampleInputs that contain inputs with sparse
-        csc layout.
-        """
-        return self.sample_inputs_sparse_csc_func(self, device, dtype, requires_grad, **kwargs)
-
-    def sample_inputs_sparse_bsr(self, device, dtype, requires_grad=False, **kwargs):
-        """Returns an iterable of SampleInputs that contain inputs with sparse
-        bsr layout.
-        """
-        return self.sample_inputs_sparse_bsr_func(self, device, dtype, requires_grad, **kwargs)
-
-    def sample_inputs_sparse_bsc(self, device, dtype, requires_grad=False, **kwargs):
-        """Returns an iterable of SampleInputs that contain inputs with sparse
-        bsc layout.
-        """
-        return self.sample_inputs_sparse_bsc_func(self, device, dtype, requires_grad, **kwargs)
-
-    def get_decorators(self, test_class, test_name, device, dtype):
-        '''Returns the decorators targeting the given test.'''
-        result = []
-        for decorator in self.decorators:
-            if isinstance(decorator, DecorateInfo):
-                if decorator.is_active(test_class, test_name, device, dtype):
-                    result.extend(decorator.decorators)
-            else:
-                result.append(decorator)
-        return result
-
-    def supported_dtypes(self, device_type):
-        if device_type == 'cpu':
-            return self.dtypes
-        if device_type == 'cuda':
-            return self.dtypesIfROCM if TEST_WITH_ROCM else self.dtypesIfCUDA
-        else:
-            return self.dtypes
-
-    def supported_backward_dtypes(self, device_type):
-        if not self.supports_autograd:
-            return set()
-
-        backward_dtypes = None
-        if device_type == 'cpu':
-            backward_dtypes = self.backward_dtypes
-        elif device_type == 'cuda':
-            backward_dtypes = self.backward_dtypesIfROCM if TEST_WITH_ROCM else self.backward_dtypesIfCUDA
-        else:
-            backward_dtypes = self.backward_dtypes
-
-        allowed_backward_dtypes = floating_and_complex_types_and(torch.bfloat16, torch.float16, torch.complex32)
-        return set(allowed_backward_dtypes).intersection(backward_dtypes)
-
-    def supports_dtype(self, dtype, device_type):
-        return dtype in self.supported_dtypes(device_type)
-
-    @property
-    def formatted_name(self):
-        """Returns a formatted full name for this OpInfo that can be used in test names."""
-        variant = '_' + self.variant_test_name.replace('.', '_') if self.variant_test_name else ''
-        return '{}{}'.format(self.name.replace('.', '_'), variant)
-
-
-def _generate_reduction_inputs(device, dtype, requires_grad, **kwargs):
-    """Generates input tensors for testing reduction operators"""
-    yield make_tensor([], dtype=dtype, device=device, requires_grad=requires_grad)
-    yield make_tensor([2], dtype=dtype, device=device, requires_grad=requires_grad)
-    yield make_tensor([3, 5], dtype=dtype, device=device, requires_grad=requires_grad)
-    yield make_tensor([3, 2, 1, 2], dtype=dtype, device=device, requires_grad=requires_grad)
-
-
-def _generate_reduction_kwargs(ndim, supports_multiple_dims=True):
-    """Generates a subset of all valid dim and keepdim kwargs given ndim that
-    is appropriate for testing reduction operators.
-    """
-
-    # Test default dim and keepdim
-    yield {}
-
-    # Test reducing inner and outer most dimensions
-    yield {'dim': 0, 'keepdim': True}
-    yield {'dim': -1, 'keepdim': False}
-
-    # Test reducing middle dimension
-    if ndim > 2:
-        yield {'dim': ndim // 2, 'keepdim': True}
-
-    if supports_multiple_dims:
-        # Test reducing all dimensions
-        yield {'dim': tuple(range(ndim)), 'keepdim': False}
-
-        # Test reducing both first and last dimensions
-        if ndim > 1:
-            yield {'dim': (0, -1), 'keepdim': True}
-
-        # Test reducing every other dimension starting with the second
-        if ndim > 3:
-            yield {'dim': tuple(range(1, ndim, 2)), 'keepdim': False}
-
-
-def sample_inputs_reduction(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for reduction operators."""
-
-    # TODO(@heitorschueroff) Once all reduction operators are using
-    # ReductionOpInfo use op_info.supports_multiple_dims directly.
-    supports_multiple_dims: bool = kwargs.get('supports_multiple_dims', True)
-
-    # TODO(@heitorschueroff) Once all reduction operators are using ReductionOpInfo
-    # use op_info.generate_args_kwargs directly.
-    generate_args_kwargs = kwargs.get('generate_args_kwargs', lambda *args, **kwargs: (yield tuple(), {}))
-
-    for t in _generate_reduction_inputs(device, dtype, requires_grad):
-        for reduction_kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims):
-            for args, kwargs in generate_args_kwargs(t, **reduction_kwargs):
-                kwargs.update(reduction_kwargs)
-                yield SampleInput(t.detach().requires_grad_(requires_grad), args=args, kwargs=kwargs)
-
-
-def _generate_masked_op_mask(input_shape, device, **kwargs):
-    yield None
-    yield make_tensor(input_shape, dtype=torch.bool, device=device, requires_grad=False)
-    if len(input_shape) > 2:
-        # broadcast last mask dimension:
-        yield make_tensor(input_shape[:-1] + (1,), dtype=torch.bool, device=device, requires_grad=False)
-        # broadcast middle mask dimension:
-        yield make_tensor(input_shape[:1] + (1,) + input_shape[2:], dtype=torch.bool, device=device, requires_grad=False)
-        # broadcast first mask dimension:
-        yield make_tensor((1,) + input_shape[1:], dtype=torch.bool, device=device, requires_grad=False)
-        # mask.ndim < input.ndim
-        yield make_tensor(input_shape[1:], dtype=torch.bool, device=device, requires_grad=False)
-        # mask.ndim == 1
-        yield make_tensor(input_shape[-1:], dtype=torch.bool, device=device, requires_grad=False)
-        # masks that require broadcasting of inputs (mask.ndim >
-        # input.ndim) will not be supported, however, we may
-        # reconsider this if there will be demand on this kind of
-        # degenerate cases.
-
-
-def sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked reduction operators.
-
-    Masked reduction operator is a reduction operator with trailing
-    mask optional argument. A mask is a bool tensor with the same
-    shape as input or a shape that is broadcastable to input shape.
-    """
-    kwargs['supports_multiple_dims'] = op_info.supports_multiple_dims
-
-    for sample_input in sample_inputs_reduction(op_info, device, dtype, requires_grad, **kwargs):
-        for mask in _generate_masked_op_mask(sample_input.input.shape, device, **kwargs):
-            sample_input_args, sample_input_kwargs = sample_input.args, dict(mask=mask, **sample_input.kwargs)
-            yield SampleInput(sample_input.input.detach().requires_grad_(requires_grad),
-                              args=sample_input_args, kwargs=sample_input_kwargs)
-            if(not requires_grad and dtype.is_floating_point and
-               sample_input.input.ndim == 2 and mask is not None and
-               mask.shape == sample_input.input.shape):
-                for v in [torch.inf, -torch.inf, torch.nan]:
-                    t = sample_input.input.detach()
-                    t.diagonal(0, -2, -1).fill_(v)
-                    yield SampleInput(t.requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs)
-
-
-def sample_inputs_sparse_coo_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked reduction operators that support inputs
-    with sparse coo layouts.
-    """
-    if op_info.supports_sparse:
-        op_name = op_info.name.replace('_masked.', '')
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            mask = sample_input.kwargs.get('mask')
-            if mask is not None:
-                sample_input_kwargs = sample_input.kwargs.copy()
-                sample_input_kwargs.update(mask=mask.to_sparse())
-                yield SampleInput(sample_input.input.to_sparse(),
-                                  args=sample_input.args, kwargs=sample_input_kwargs)
-            else:
-                if op_name in {'prod', 'amax', 'amin'}:
-                    # FIXME: for now reductions with non-zero reduction identity and
-                    # unspecified mask are not supported for sparse COO
-                    # tensors, see torch._masked.prod implementation
-                    # for details.
-                    continue
-                yield SampleInput(sample_input.input.to_sparse(),
-                                  args=sample_input.args, kwargs=sample_input.kwargs)
-
-
-def sample_inputs_sparse_csr_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked reduction operators that support inputs
-    with sparse csr layouts.
-    """
-    if op_info.supports_sparse_csr:
-        op_name = op_info.name.replace('_masked.', '')
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            if not (sample_input.input.ndim == 2 and sample_input.kwargs.get('keepdim')):
-                # - sparse CSR tensors are always 2-D tensors
-                # - masked reduction on CSR tensors are defined only if keepdim is True.
-                continue
-            mask = sample_input.kwargs.get('mask')
-            if mask is not None:
-                sample_input_kwargs = sample_input.kwargs.copy()
-                sample_input_kwargs.update(mask=mask.to_sparse_csr())
-                new_sample = SampleInput(sample_input.input.to_sparse_csr(),
-                                         args=sample_input.args, kwargs=sample_input_kwargs)
-            else:
-                if op_name in ['prod', 'amax', 'amin', 'mean']:
-                    # reductions with non-zero reduction identity and
-                    # unspecified mask is not supported for sparse CSR
-                    # tensors, see torch._masked.prod implementation
-                    # for details.
-                    continue
-                new_sample = SampleInput(sample_input.input.to_sparse_csr(),
-                                         args=sample_input.args, kwargs=sample_input.kwargs)
-            yield new_sample
-            if sample_input.kwargs['dim'] == 0:
-                # Reductions of CSR tensors use different implementations for
-                # inner and/or outer dimensions. So, as a minimum of testing CSR
-                # implementations the following kwargs must be generated:
-                #   dict(dim=0, keepdim=True)
-                #   dict(dim=1, keepdim=True)
-                #   dict(dim=(0, 1), keepdim=True)
-                # Here we generate the dim=1 case from the dim=0 case.
-                sample_input_kwargs = new_sample.kwargs.copy()
-                sample_input_kwargs.update(dim=1)
-                yield SampleInput(new_sample.input.clone(),
-                                  args=sample_input.args, kwargs=sample_input_kwargs)
-
-
-def sample_inputs_masked_norm(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked norm.
-    """
-    for ord in [2.0, 1, float('inf'), float('-inf'), 0]:
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            sample_input_args, sample_input_kwargs = (ord,) + sample_input.args, sample_input.kwargs.copy()
-            yield SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                              args=sample_input_args, kwargs=sample_input_kwargs)
-
-
-def sample_inputs_masked_std_var(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked std/var.
-    """
-    for unbiased in [False, True]:
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            if sample_input.args:
-                dim = sample_input.args[0]
-                sample_input_args = sample_input.args[:1] + (unbiased,) + sample_input.args[1:]
-                sample_input_kwargs = sample_input.kwargs.copy()
-            else:
-                dim = sample_input.kwargs.get('dim')
-                sample_input_args = sample_input.args
-                sample_input_kwargs = dict(sample_input.kwargs, unbiased=unbiased)
-            if requires_grad:
-                if sample_input_kwargs.get('mask') is None:
-                    orig_count = torch._masked.sum(torch.ones(sample_input.input.shape, dtype=torch.int64), dim, keepdim=True)
-                else:
-                    inmask = torch._masked._input_mask(sample_input.input, *sample_input_args, **sample_input_kwargs)
-                    orig_count = torch._masked.sum(inmask.new_ones(sample_input.input.shape, dtype=torch.int64),
-                                                   dim, keepdim=True, mask=inmask)
-                if orig_count.min() <= int(unbiased) + 1:
-                    # Skip samples that lead to singularities in var
-                    # computation resulting nan values both in var and
-                    # autograd output that test_grad_fn cannot handle
-                    # correctly. Also, skip samples when the autograd output
-                    # for std could not be handled correctly due to torch.sqrt
-                    continue
-            yield SampleInput(sample_input.input.detach().requires_grad_(requires_grad),
-                              args=sample_input_args, kwargs=sample_input_kwargs)
-
-# NOTE [Reductions]:
-#
-# For testing purposes, we relax the definition of a reduction operator
-# as defined in the docstring below. We do this to capture operators with
-# a similar API so they can be tested automatically. However...
-#
-# Strictly speaking a reduction operator is an operator that can reduce an
-# array to a single scalar value and that can be computed from the partial
-# result of reducing subarrays. This usually means that the reduction operation
-# should be commutative and associative. This definition is important when it
-# comes to implementation as it determines how a reduction can be parallelized.
-#
-# For example, many summary statistics such as median, mode and quantile cannot
-# be computed from partial results because these are sorting and counting based
-# algorithms that need information that would be lost in the reduced value.
-class ReductionOpInfo(OpInfo):
-    """Reduction operator information.
-
-    An operator is a reduction operator if it reduces one or more dimensions of
-    the input tensor to a single value. Reduction operators must implement the
-    following signature:
-
-    - `op(input, *args, *, dim=None, keepdim=False, **kwargs) -> Tensor`
-
-    ReductionOpInfo tests that reduction operators implement a consistent API.
-    Optional features such as reducing over multiple dimensions are captured in
-    the optional keyword parameters of the ReductionOpInfo constructor.
-
-    If a reduction operator does not yet implement the full required API of
-    reduction operators, this should be documented by xfailing the failing
-    tests rather than adding optional parameters to ReductionOpInfo.
-
-    NOTE
-    The API for reduction operators has not yet been finalized and some
-    requirements may change.
-
-    See tests in test/test_reductions.py
-    """
-
-    def __init__(
-        self, name, *,
-
-        # The identity value for the operator if it has one.
-        identity: Optional[Any] = None,
-
-        # The nan policy for the operator if it implements one.
-        # - propagate: NaN values are propagated to the output
-        # - omit: NaN values are discarded during the reduction
-        nan_policy: Optional[str] = None,
-
-        # Whether the operator supports reducing multiple dimensions.
-        supports_multiple_dims: bool = True,
-
-        # Whether the operator promotes integral to floating point dtypes.
-        promotes_int_to_float: bool = False,
-
-        # Whether the operator promotes all integral dtypes to int64.
-        promotes_int_to_int64: bool = False,
-
-        # If a specific dtype is given, then the operator always returns that
-        # dtype irrespective of the input dtype. If None, the operator returns
-        # the dtype according to the type promotion rules above.
-        result_dtype: Optional[torch.dtype] = None,
-
-        # Casts complex results to real (e.g. linalg.norm or torch.var)
-        complex_to_real: bool = False,
-
-        # ReductionOpInfo tests generate their own input, dim and keepdim
-        # arguments and call this function to generate tuples of extra args and
-        # kwargs to use when calling the op. This is required for operators that
-        # have other required parameters besides the input tensor.
-        generate_args_kwargs: Callable = lambda t, dim=None, keepdim=False: (yield tuple(), {}),
-
-        # Options from the OpInfo base class
-        **kwargs,
-    ):
-        self._original_reduction_args = locals().copy()
-        assert nan_policy in (None, 'propagate', 'omit')
-
-        # These are mutually exclusive options
-        assert not (result_dtype and promotes_int_to_float)
-        assert not (result_dtype and promotes_int_to_int64)
-        assert not (result_dtype and complex_to_real)
-        assert not (promotes_int_to_float and promotes_int_to_int64)
-
-        # Default sample_inputs_func for ReductionOpInfo which augments sample
-        # inputs from sample_inputs_reduction with the args and kwargs from
-        # generate_args_kwargs. This is only used if sample_inputs_func is None.
-        def sample_inputs_func(*args, **kwargs):
-            kwargs['supports_multiple_dims'] = supports_multiple_dims
-            kwargs['generate_args_kwargs'] = generate_args_kwargs
-            yield from sample_inputs_reduction(*args, **kwargs)
-
-        # Override OpInfo defaults and call base class __init__
-        kwargs.setdefault('inplace_variant', None)
-        kwargs.setdefault('sample_inputs_func', sample_inputs_func)
-        super().__init__(name, **kwargs)
-
-        self.identity = identity
-        self.nan_policy = nan_policy
-        self.supports_multiple_dims = supports_multiple_dims
-        self.promotes_int_to_float = promotes_int_to_float
-        self.promotes_int_to_int64 = promotes_int_to_int64
-        self.complex_to_real = complex_to_real
-        self.result_dtype = result_dtype
-        self.generate_args_kwargs = generate_args_kwargs
-
 def sample_inputs_tensor_split(op_info, device, dtype, requires_grad, **kwargs):
     make_input = partial(make_tensor, device=device, dtype=dtype,
                          low=None, high=None, requires_grad=requires_grad)
@@ -1456,76 +168,6 @@ def sample_inputs_tensor_split(op_info, device, dtype, requires_grad, **kwargs):
         yield SampleInput(make_input((S, S, S)), args=args)
 
 
-def sample_inputs_linalg_det_logdet_slogdet(op_info, device, dtype, requires_grad, **kwargs):
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-    batches = [(), (0, ), (3, )]
-    ns = [0, 1, 5]
-
-    is_logdet = (op_info.name == "logdet")
-
-    for batch, n, in product(batches, ns):
-        shape = batch + (n, n)
-        A = make_arg(*shape)
-        # Need to make the matrices in A have positive determinant for autograd
-        # To do so, we multiply A by its determinant to flip the sign of its determinant
-        if is_logdet and not A.is_complex() and A.numel() > 0:
-            s = torch.linalg.slogdet(A).sign
-            A = A * s.unsqueeze(-1).unsqueeze(-1)
-            A.requires_grad_(requires_grad)
-        yield SampleInput(A)
-
-
-def sample_inputs_linalg_det_singular(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype)
-
-    def make_singular_matrix_batch_base(size, rank):
-        assert size[-1] == size[-2]
-        assert rank > 0 and rank < size[-1]
-
-        n = size[-1]
-        a = make_arg(size[:-2] + (n, rank)) / 10
-        b = make_arg(size[:-2] + (rank, n)) / 10
-        x = a @ b
-        lu, pivs, _ = torch.linalg.lu_factor_ex(x)
-        p, l, u = torch.lu_unpack(lu, pivs)
-        u_diag_abs = u.diagonal(0, -2, -1).abs()
-        u_diag_abs_largest = u_diag_abs.max(dim=-1, keepdim=True).values
-        u_diag_abs_smallest_idxs = torch.topk(u_diag_abs, k=(n - rank), largest=False).indices
-        u.diagonal(0, -2, -1).div_(u_diag_abs_largest)
-        u.diagonal(0, -2, -1)[..., u_diag_abs_smallest_idxs] = torch.finfo(dtype).eps
-        matrix = p @ l @ u
-
-        matrix.requires_grad_(requires_grad)
-        return matrix
-
-    def sample_generator():
-        for batch, size in product(((), (2,), (2, 2)), range(6)):
-            shape = batch + (size, size)
-            for rank in range(1, size):
-                yield make_singular_matrix_batch_base(shape, rank)
-
-    return [SampleInput(t) for t in sample_generator()]
-
-
-def sample_inputs_linalg_matrix_power(op_info, device, dtype, requires_grad, **kwargs):
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    make_arg_fullrank = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-    # (<matrix_size>, (<batch_sizes, ...>))
-    test_sizes = [
-        (1, ()),
-        (2, (0,)),
-        (2, (2,)),
-    ]
-
-    for matrix_size, batch_sizes in test_sizes:
-        size = batch_sizes + (matrix_size, matrix_size)
-        for n in (0, 3, 5):
-            yield SampleInput(make_arg(size), args=(n,))
-        for n in [-4, -2, -1]:
-            yield SampleInput(make_arg_fullrank(*size), args=(n,))
-
 def sample_inputs_hsplit(op_info, device, dtype, requires_grad, **kwargs):
     return (SampleInput(make_tensor((6,), dtype=dtype, device=device,
                                     low=None, high=None,
@@ -1623,114 +265,6 @@ def error_inputs_dsplit(op_info, device, **kwargs):
     return (ErrorInput(si1, error_regex=err_msg1),
             ErrorInput(si2, error_regex=err_msg2),)
 
-def sample_inputs_linalg_multi_dot(op_info, device, dtype, requires_grad, **kwargs):
-    # Each test case consists of the sizes in the chain of multiplications
-    # e.g. [2, 3, 4, 5] generates matrices (2, 3) @ (3, 4) @ (4, 5)
-    test_cases = [
-        [1, 2, 1],
-        [2, 0, 2],
-        [0, 2, 2],
-        [2, 2, 2, 2],
-        [2, 3, 4, 5],
-        [5, 4, 0, 2],
-        [2, 4, 3, 5, 3, 2]
-    ]
-
-    result = []
-    for sizes in test_cases:
-        tensors = []
-        for size in zip(sizes[:-1], sizes[1:]):
-            t = make_tensor(size, dtype=dtype, device=device, requires_grad=requires_grad)
-            tensors.append(t)
-        result.append(SampleInput(tensors))
-
-    return result
-
-def sample_inputs_linalg_matrix_norm(op_info, device, dtype, requires_grad, **kwargs):
-    low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    sizes = ((2, 2), (2, 3, 2))
-    if dtype in low_precision_dtypes:
-        # svdvals not supported for low precision dtypes
-        ords = ('fro', inf, -inf, 1, -1)
-    else:
-        ords = ('fro', 'nuc', inf, -inf, 1, -1, 2, -2)
-    dims = ((-2, -1), (-1, 0))
-
-    for size, ord, dim, keepdim in product(sizes, ords, dims, [True, False]):
-        yield SampleInput(make_arg(size), args=(ord, dim, keepdim))
-
-def sample_inputs_linalg_norm(op_info, device, dtype, requires_grad, *, variant=None, **kwargs):
-    if variant is not None and variant not in ('subgradient_at_zero',):
-        raise ValueError(f"Unsupported variant, expected variant to be 'subgradient_at_zero' but got: {variant}")
-
-    test_sizes = [
-        (S,),
-        (0,),
-        (S, S),
-        (0, 0),
-        (S, 0),
-        (0, S),
-        (S, S, S),
-        (0, S, S),
-        (S, 0, S),
-        (0, 0, 0),
-    ]
-
-    vector_ords = (None, 0, 0.5, 1, 2, 3.5, inf, -0.5, -1, -2, -3.5, -inf)
-    matrix_ords = (None, 'fro', 'nuc', 1, 2, inf, -1, -2, -inf)
-
-    inputs = []
-
-    for test_size in test_sizes:
-        is_vector_norm = len(test_size) == 1
-        is_matrix_norm = len(test_size) == 2
-
-        for keepdim in [False, True]:
-            if not variant == 'subgradient_at_zero':
-                inputs.append(SampleInput(
-                    make_tensor(
-                        test_size, dtype=dtype, device=device, low=None, high=None,
-                        requires_grad=requires_grad),
-                    kwargs=dict(
-                        keepdim=keepdim)))
-
-            if not (is_vector_norm or is_matrix_norm):
-                continue
-
-            ords = vector_ords if is_vector_norm else matrix_ords
-
-            for ord in ords:
-                if variant == 'subgradient_at_zero':
-                    inputs.append(SampleInput(
-                        torch.zeros(
-                            test_size, dtype=dtype, device=device,
-                            requires_grad=requires_grad),
-                        args=(ord,),
-                        kwargs=dict(keepdim=keepdim)))
-                else:
-                    inputs.append(SampleInput(
-                        make_tensor(
-                            test_size, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                        args=(ord,),
-                        kwargs=dict(
-                            keepdim=keepdim)))
-
-                    if ord in ['nuc', 'fro']:
-                        inputs.append(SampleInput(
-                            make_tensor(
-                                test_size, dtype=dtype, device=device,
-                                low=None, high=None,
-                                requires_grad=requires_grad),
-                            kwargs=dict(
-                                ord=ord,
-                                keepdim=keepdim,
-                                dim=(0, 1))))
-
-        return inputs
 
 def sample_inputs_as_strided(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -2064,11003 +598,8790 @@ def sample_inputs_norm_inf(op_info, device, dtype, requires_grad, **kwargs):
     for shape, args, name in cases:
         yield SampleInput(make_arg(shape), args=args, name=name)
 
-def sample_kwargs_vector_norm(t, **kwargs):
-    # orders with / without identity
-    def ords():
-        has_id = (6, 4, 2, 1, 0, 0.9)
-        no_id = (inf, -2.1, -inf)
-        if t.numel() == 0:
-            dim = kwargs.get("dim")
-            if dim is None:
-                return has_id
-            if not isinstance(dim, Iterable):
-                dim = (dim,)
-            for d in dim:
-                if t.size(d) == 0:
-                    return has_id
-        return has_id + no_id
-
-    return (((), dict(ord=o)) for o in ords())
-
-# The following functions and classes are for testing elementwise binary operators.
-
-# Returns a generator of pairs of contiguous tensors on the requested device
-#   and with the requested dtype.
-#
-# This function is intended to test the non-vectorized and vectorized code
-#   paths of elementwise binary functions, as well as their handling of odd tensor
-#   sizes (like zero-dim tensors and tensors with zero elements).
-#
-# Each iterable will include an a tensor with no elements,
-#   zero dim (scalar) tensors, small 1D tensors, a medium 1D tensor, and
-#   a large 2D tensor.
-def generate_elementwise_binary_tensors(op, *, device, dtype, requires_grad=False, exclude_zero=False):
-    shapes = (
-        # tensors with no elements
-        (0,),
-        (1, 0, 3),
-        # zero dim (scalar) tensor
-        (),
-        # small 1D tensor
-        (20,),
-        # medium 1D tensor
-        (812,),
-        # large 2D tensor
-        (1029, 917),
-    )
 
+def sample_inputs_equal(op, device, dtype, requires_grad, **kwargs):
     make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
-    for shape in shapes:
-        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
-        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
-        yield SampleInput(lhs, args=(rhs,))
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def generate_elementwise_binary_arbitrarily_strided_tensors(op, *, device, dtype, requires_grad=False, exclude_zero=False):
-    # shape, strides, offset
-    strided_cases = (
-        ((5, 6, 2), (1, 1, 7), 2),
-        ((5, 5, 4), (1, 1, 7), 2),
-        ((5, 5, 2), (4, 5, 7), 3),
-        ((5, 5, 2), (5, 5, 7), 3),
-        ((5, 5, 2), (5, 5, 5), 3),
-        ((9, 5, 2), (0, 1, 7), 3),
+    shapes = (
+        ((), ()),
+        ((S,), ()),
+        ((), (S,)),
+        ((S, 1), (S,)),
+        ((M, S), ()),
+        ((S, S), (S, S))
     )
 
+    for shape_lhs, shape_rhs in shapes:
+        lhs = make_arg(shape_lhs)
+        rhs = make_arg(shape_rhs)
+        broadcasts_input = shape_lhs != torch.broadcast_shapes(shape_lhs, shape_rhs)
 
-    make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
-    for shape, strides, offset in strided_cases:
-        a = make_arg(500,).as_strided(shape, strides, offset)
-        b = make_arg(shape)
-        yield SampleInput(a, args=(b,))
-
-# Returns a generator of pairs of contiguous tensors on the requested device and with
-#   the requested dtype.
-#
-# Unlike the previous function, the values in these tensors are specified manually.
-def generate_elementwise_binary_small_value_tensors(
-    op, *, device, dtype, requires_grad=False, exclude_zero=None
-):
-    if exclude_zero is None:
-        if hasattr(op, "rhs_make_tensor_kwargs"):
-            exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
-
-    # defines interesting values
-    _unsigned_int_vals = (0, 1, 55, 127, 128, 190, 210, 220, 254)
-    _int_vals = (0, -1, 1, -55, 55, -127, 127, -128)
-    _float_vals = (
-        0.0,
-        -0.0,
-        -0.001,
-        0.001,
-        -0.25,
-        0.25,
-        -1.0,
-        1.0,
-        -math.pi / 2,
-        math.pi / 2,
-        -math.pi + 0.00001,
-        math.pi - 0.00001,
-        -math.pi,
-        math.pi,
-        -math.pi - 0.00001,
-        math.pi + 0.00001,
-    )
-
-    l_vals = []
-    r_vals = []
-
-    if dtype.is_floating_point:
-        prod = product(_float_vals, _float_vals)
-    elif dtype.is_complex:
-        complex_vals = product(_float_vals, _float_vals)
-        # Note the use of list is required here or the map generator will be
-        #  emptied by the following product and it won't produce the desired cross-product
-        complex_vals = list(map(lambda x: complex(*x), complex_vals))
-        prod = product(complex_vals, complex_vals)
-    elif dtype in (torch.int8, torch.int16, torch.int32, torch.int64):
-        prod = product(_int_vals, _int_vals)
-    elif dtype is torch.uint8:
-        prod = product(_unsigned_int_vals, _unsigned_int_vals)
-    else:
-        raise ValueError("Unsupported dtype!")
+        yield SampleInput(lhs, args=(rhs,), broadcasts_input=broadcasts_input)
+        if shape_lhs == shape_rhs:
+            yield SampleInput(lhs, args=(lhs.clone().detach_(),))
 
-    for l, r in prod:
-        l_vals.append(l)
-        if r == 0 and exclude_zero:
-            r_vals.append(1)
-        else:
-            r_vals.append(r)
 
-    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
-    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    yield SampleInput(lhs, args=(rhs,))
+def sample_inputs_jiterator(op, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
+    shapes = (
+        ((), ()),
+        ((S,), ()),
+        ((S, 1), (S,)),
+        ((M, S), ()),
+        ((S, M, S), (M, S)),
+        ((S, M, S), (S, M, S)),
+        ((M, 1, S), (M, S)),
+        ((M, 1, S), (1, M, S)),
+        ((0, 1, 3), (0, 10, 3))
+    )
 
-def generate_elementwise_binary_large_value_tensors(
-    op, *, device, dtype, requires_grad=False
-):
-    _large_int_vals = (-1113, 1113, -10701, 10701)
-    _large_float16_vals = (-501, 501, -1001.2, 1001.2, -13437.7, 13437.7)
-    _large_float_vals = _large_float16_vals + (-4988429.2, 4988429.2, -1e20, 1e20)
+    num_inputs = kwargs.get('num_inputs')
+    sample_kwargs = kwargs.get('sample_kwargs', {})
 
-    l_vals = []
-    r_vals = []
+    for shape_lhs, shape_rhs in shapes:
+        lhs = make_arg(shape_lhs)
 
-    if dtype == torch.float16:
-        prod = product(_large_float16_vals, _large_float16_vals)
-    elif dtype.is_floating_point:
-        prod = product(_large_float_vals, _large_float_vals)
-    elif dtype.is_complex:
-        complex_vals = product(_large_float_vals, _large_float_vals)
-        # Note the use of list is required here or the map generator will be
-        #  emptied by the following product and it won't produce the desired cross-product
-        complex_vals = list(map(lambda x: complex(*x), complex_vals))
-        prod = product(complex_vals, complex_vals)
-    elif dtype in (torch.int16, torch.int32, torch.int64):
-        prod = product(_large_int_vals, _large_int_vals)
-    else:
-        raise ValueError("Unsupported dtype!")
+        args = []
+        for i in range(num_inputs - 1):
+            args.append(make_arg(shape_rhs))
+        broadcasts_input = (shape_lhs != torch.broadcast_shapes(shape_lhs, shape_rhs))
 
-    for l, r in prod:
-        l_vals.append(l)
-        r_vals.append(r)
+        yield SampleInput(lhs, args=tuple(args), kwargs=sample_kwargs, broadcasts_input=broadcasts_input)
 
-    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
-    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_broadcast_shapes(op, device, dtype, requires_grad, **kwargs):
+    shapes = (
+        ((), ()),
+        ((S,), ()),
+        ((S, 1), (S,)),
+        ((S, 1), S),
+        ((M, S), ()),
+        ((S, M, S), (M, S)),
+        ((S, M, S), (S, M, S)),
+        ((M, 1, S), (M, S)),
+        ((M, 1, S), (1, M, S)),
+        ((0, 1, 3), (0, 10, 3))
+    )
 
-    yield SampleInput(lhs, args=(rhs,))
+    for shape in shapes:
+        inp, *arg0 = shape
+        yield SampleInput(inp, args=tuple(arg0))
 
+def sample_inputs_add_sub(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
 
-def generate_elementwise_binary_extremal_value_tensors(
-    op, *, device, dtype, requires_grad=False
-):
-    _float_extremals = (float("inf"), float("-inf"), float("nan"))
-
-    l_vals = []
-    r_vals = []
-
-    if dtype.is_floating_point:
-        prod = product(_float_extremals, _float_extremals)
-    elif dtype.is_complex:
-        complex_vals = product(_float_extremals, _float_extremals)
-        # Note the use of list is required here or the map generator will be
-        #  emptied by the following product and it won't produce the desired cross-product
-        complex_vals = list(map(lambda x: complex(*x), complex_vals))
-        prod = product(complex_vals, complex_vals)
+    # Adds alpha kwarg cases
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
+    if dtype is not torch.bool:
+        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': 2})
     else:
-        raise ValueError("Unsupported dtype!")
-
-    for l, r in prod:
-        l_vals.append(l)
-        r_vals.append(r)
-
-    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
-    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    yield SampleInput(lhs, args=(rhs,))
-
-    # Test case for NaN propagation
-    nan = float('nan') if dtype.is_floating_point else complex(float('nan'), float('nan'))
-    lhs = make_tensor((128, 128), device=device, dtype=dtype, requires_grad=requires_grad)
-    lhs.flatten()[::3] = nan
-    rhs = make_tensor((128, 128), device=device, dtype=dtype, requires_grad=requires_grad)
-    rhs.flatten()[::3] = nan
+        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': True})
+    neg_alpha = -3.14 if (dtype.is_floating_point or dtype.is_complex) else -3
+    lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
+    if dtype is not torch.bool:
+        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': neg_alpha})
+    else:
+        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': False})
 
-    yield SampleInput(lhs, args=(rhs,))
+def error_inputs_arange(op, device, **kwargs):
+    yield ErrorInput(SampleInput(0, args=(3, 0)), error_type=RuntimeError, error_regex='step must be nonzer')
+    yield ErrorInput(SampleInput(0, args=(-3, 2)), error_type=RuntimeError, error_regex='bound inconsistent with step sign')
+    yield ErrorInput(SampleInput(0, args=(3, -2)), error_type=RuntimeError, error_regex='bound inconsistent with step sign')
+    yield ErrorInput(SampleInput(0, args=(float('inf'), 2)), error_type=RuntimeError, error_regex='unsupported range')
+    yield ErrorInput(SampleInput(float('-inf'), args=(1, 2)), error_type=RuntimeError, error_regex='unsupported range')
 
-# Returns a generator of pairs of contiguous and noncontiguous tensors that
-#   require broadcasting
-def generate_elementwise_binary_broadcasting_tensors(
-    op, *, device, dtype, requires_grad=False, exclude_zero=False
-):
-    shapes = (
-        ((1,), ()),
-        ((2,), ()),
-        ((1,), (2,)),
-        ((2, 1), (2,)),
-        ((1, 2), (2,)),
-        ((3, 2), (2,)),
-        ((1, 3, 2), (2,)),
-        ((1, 3, 2), (3, 2)),
-        ((3, 1, 2), (3, 2)),
-        ((2, 3, 2), ()),
-        ((3, 1, 2), (1, 3, 2)),
-    )
-
-    make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+def sample_inputs_arange(op, device, dtype, requires_grad, **kwargs):
+    int_samples = (
+        # positive direction
+        (-1, 2, 2),
+        # negative direction
+        (2, -3, -1),
+        # start == end
+        (1, 1, 1),
+        (1, 1, -1),
+        # divides evenly
+        (0, -8, -4),
+        (1, 5, 2),
+        # bool
+        (False, True, True),
+        # default step
+        (0, 1, None),
+        # default start
+        (None, 3, None),
     )
-    for shape, noncontiguous in product(shapes, [True, False]):
-        shape_lhs, shape_rhs = shape
-        lhs = make_arg(
-            shape_lhs, noncontiguous=noncontiguous, **op.lhs_make_tensor_kwargs
-        )
-        rhs = make_arg(
-            shape_rhs, noncontiguous=noncontiguous, **op.rhs_make_tensor_kwargs
-        )
 
-        yield SampleInput(lhs, args=(rhs,), broadcasts_input=True)
+    def to_float(start, end, step):
+        start = start + 0.1 if start is not None else None
+        end = end + 0.1
+        step = float(step) if step is not None else None
+        return start, end, step
 
+    float_samples = (
+        # includes endpoint
+        (0., -8. - 1e-6, -4.),
+        (1., 5. + 1e-6, 2.),
+        (0., -8., -4.),
+        (1., 5., 2.),
+        *(to_float(start, end, step) for (start, end, step) in int_samples),
+    )
 
-# Returns a generator of pairs of contiguous tensors and scalars
-def generate_elementwise_binary_with_scalar_samples(
-    op, *, device, dtype, requires_grad=False
-):
-    make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    large_samples = (
+        (0, 10000, None),
     )
 
-    shapes = ((), (3,), (5, 3), (0, 1, 3), (1, 5))
-    if op.supports_rhs_python_scalar:
-        for shape in shapes:
-            lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
-            rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
-            lhs_scalar = make_arg((), **op.lhs_make_tensor_kwargs).item()
-            rhs_scalar = make_arg((), **op.rhs_make_tensor_kwargs).item()
+    samples = int_samples + float_samples
+    if dtype not in (torch.int8, torch.uint8):
+        samples += large_samples
 
-            yield SampleInput(lhs, args=(rhs_scalar,))
+    for start, end, step in samples:
+        if start is None:
+            assert step is None
+            yield SampleInput(end, kwargs={"dtype": dtype, "device": device})
+        elif step is None:
+            yield SampleInput(start, args=(end,), kwargs={"dtype": dtype, "device": device})
+        else:
+            yield SampleInput(start, args=(end, step), kwargs={"dtype": dtype, "device": device})
 
-        # Extends with scalar lhs
-        if op.supports_one_python_scalar:
-            yield SampleInput(lhs_scalar, args=(rhs,))
+    yield SampleInput(2)
+    yield SampleInput(1, args=(3, 1))
 
-    if op.supports_two_python_scalars:
-        lhs_scalar = make_arg((), **op.lhs_make_tensor_kwargs).item()
-        rhs_scalar = make_arg((), **op.rhs_make_tensor_kwargs).item()
 
-        yield SampleInput(lhs_scalar, args=(rhs_scalar,))
+def error_inputs_linspace(op, device, **kwargs):
+    yield ErrorInput(SampleInput(0, args=(3, -1)), error_type=RuntimeError, error_regex='number of steps must be non-negative')
+    yield ErrorInput(SampleInput(0, args=(3, 1.)), error_type=TypeError, error_regex='must be int, not float')
 
-# Returns a generator of pairs of contiguous tensors and 0d tensos and scalars and type promotion
-def generate_elementwise_binary_with_scalar_and_type_promotion_samples(
-    op, *, device, dtype, requires_grad=False
-):
-    # add these samples only for logical and comparison ops, arithmetic ops are not happy about extremal scalars
-    if op.name in ('eq', 'ne', 'gt', 'ge', 'lt', 'le', 'logical_and', 'logical_or', 'logical_xor'):
-        make_arg = partial(
-            make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
-        )
-        shape = (23,)  # this shape is big enough to trigger vectorization, and has non-vectorized tail
-        values = (float('nan'), float('inf'), -float('inf'))
-        scalar_tensors = tuple(torch.tensor(val) for val in values)
-        if op.supports_rhs_python_scalar:
-            lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
-            rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
-            for scalar in values + scalar_tensors:
-                yield SampleInput(lhs, args=(scalar,))
-                # Extends with scalar lhs
-                if op.supports_one_python_scalar:
-                    yield SampleInput(scalar, args=(rhs,))
-
-# Returns a generator of pairs of noncontiguous tensors
-def generate_elementwise_binary_noncontiguous_tensors(
-    op, *, device, dtype, requires_grad=False, exclude_zero=False
-):
-    make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
 
-    # Generic noncontiguity
-    lhs = make_arg((1026,), noncontiguous=True, **op.lhs_make_tensor_kwargs)
-    rhs = make_arg((1026,), noncontiguous=True, **op.rhs_make_tensor_kwargs)
+def sample_inputs_linspace(op, device, dtype, requires_grad, **kwargs):
+    ends = (-3, 0, 1, 4, 50)
+    starts = (-2., 0, 4.3, 50)
+    nsteps = (0, 1, 50)
+    # Extra case to replicate off-by-one issue on CUDA
+    cases = list(product(starts, ends, nsteps)) + [(0, 7, 50)]
+    for start, end, nstep in cases:
+        if dtype == torch.uint8 and end < 0 or start < 0:
+            continue
+        yield SampleInput(start, args=(end, nstep), kwargs={"dtype": dtype, "device": device})
 
-    yield SampleInput(lhs.clone(), args=(rhs.clone(),))
-    yield SampleInput(lhs.contiguous(), args=(rhs,))
+    yield SampleInput(1, args=(3, 1))
 
-    # Transposed
-    lhs = make_arg((789, 357), **op.lhs_make_tensor_kwargs)
-    rhs = make_arg((789, 357), **op.rhs_make_tensor_kwargs)
 
-    yield SampleInput(lhs.T, args=(rhs.T,))
+def sample_inputs_logpace(op, device, dtype, requires_grad, **kwargs):
+    ends = (-3, 0, 1.2, 2, 4)
+    starts = (-2., 0, 1, 2, 4.3)
+    nsteps = (0, 1, 2, 4)
+    bases = (2., 1.1) if dtype in (torch.int8, torch.uint8) else (None, 2., 3., 1.1, 5.)
+    for start, end, nstep, base in product(starts, ends, nsteps, bases):
+        if dtype == torch.uint8 and end < 0 or start < 0:
+            continue
+        if nstep == 1 and isinstance(start, float) and not (dtype.is_complex or dtype.is_floating_point):
+            # https://github.com/pytorch/pytorch/issues/82242
+            continue
+        if base is None:
+            yield SampleInput(start, args=(end, nstep), kwargs={"dtype": dtype, "device": device})
+        else:
+            yield SampleInput(start, args=(end, nstep, base), kwargs={"dtype": dtype, "device": device})
 
-    # More noncontiguity
-    shapes = ((5, 7), (1024,))
+    yield SampleInput(1, args=(3, 1, 2.))
 
-    for shape in shapes:
-        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
-        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
 
-        lhs_non_contig = torch.empty(shape + (2,), device=device, dtype=dtype)[..., 0]
-        lhs_non_contig.copy_(lhs)
+def sample_inputs_isclose(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
 
-        rhs_non_contig = torch.empty(shape + (2,), device=device, dtype=dtype)[..., 0]
-        rhs_non_contig.copy_(rhs)
+    # Creates additional inputs to test the rtol, atol, and equal_nan params
+    rtols = [0., 1e-7]
+    atols = [0., 1e-7]
+    equal_nans = [False, True]
 
-        yield SampleInput(lhs_non_contig.clone(), args=(rhs_non_contig.clone(),))
-        yield SampleInput(lhs_non_contig.contiguous(), args=(rhs_non_contig,))
+    products = product(rtols, atols, equal_nans)
 
-    # Noncontiguous indices
-    shape = (2, 2, 1, 2)
-    lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
-    rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for rtol, atol, equal_nan in products:
+        lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
+        rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
 
-    lhs_non_contig = lhs[:, 1, ...]
-    rhs_non_contig = rhs[:, 1, ...]
+        yield SampleInput(lhs, args=(rhs,),
+                          kwargs=dict(rtol=rtol, atol=atol, equal_nan=equal_nan))
 
-    yield SampleInput(lhs_non_contig.clone(), args=(rhs_non_contig.clone(),))
-    yield SampleInput(lhs_non_contig.contiguous(), args=(rhs_non_contig,))
 
-    # Expanded tensors
-    shapes = ((1, 3), (1, 7), (5, 7))
+def error_inputs_isclose(op, device, **kwargs):
+    make_float_arg = partial(make_tensor, device=device, dtype=torch.float, requires_grad=False)
 
-    for shape in shapes:
-        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
-        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+    yield ErrorInput(SampleInput(make_float_arg(()), args=(make_float_arg(()),), kwargs={'rtol': -0.4}),
+                     error_type=RuntimeError,
+                     error_regex='rtol must be greater than or equal to zero')
 
-        lhs_non_contig = lhs.expand(3, -1, -1)
-        rhs_non_contig = rhs.expand(3, -1, -1)
+    yield ErrorInput(SampleInput(make_float_arg(()), args=(make_float_arg(()),), kwargs={'atol': -0.4}),
+                     error_type=RuntimeError,
+                     error_regex='atol must be greater than or equal to zero')
 
-        yield SampleInput(lhs_non_contig, args=(rhs_non_contig,))
 
+def sample_inputs_t(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    return (SampleInput(make_arg((1, 2))),
+            SampleInput(make_arg((2,))),
+            SampleInput(make_arg(())))
 
-# Sample inputs for elementwise binary operators, like add
-def sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs):
-    _M = S if kwargs.get("small_inputs_only", False) else M
-    _S = XS if kwargs.get("small_inputs_only", False) else S
 
-    if hasattr(op, "rhs_make_tensor_kwargs"):
-        exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
+def sample_inputs_mm(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
+    def make_arg_conj(size):
+        return make_arg(size).conj().requires_grad_(requires_grad)
 
-    shapes = (
-        ((), ()),
-        ((_S,), ()),
-        ((_S, 1), (_S,)),
-        ((_M, _S), ()),
-        ((_S, _M, _S), (_M, _S)),
-        ((_S, _M, _S), (_S, _M, _S)),
-        ((_M, 1, _S), (_M, _S)),
-        ((_M, 1, _S), (1, _M, _S)),
-        ((0, 1, XS), (0, _M, XS)),
-    )
+    first_shape, second_shape = (S, M), (M, S)
 
-    sample_kwargs = kwargs.get("sample_kwargs", {})
+    yield SampleInput(make_arg(first_shape), args=(make_arg(second_shape),))
 
-    for shape_lhs, shape_rhs in shapes:
-        lhs = make_arg(shape_lhs, **op.lhs_make_tensor_kwargs)
-        rhs = make_arg(shape_rhs, **op.rhs_make_tensor_kwargs)
-        broadcasts_input = shape_lhs != torch.broadcast_shapes(shape_lhs, shape_rhs)
+    if dtype.is_complex:
+        yield SampleInput(make_arg(first_shape), args=(make_arg_conj(second_shape),))
 
-        yield SampleInput(
-            lhs, args=(rhs,), kwargs=sample_kwargs, broadcasts_input=broadcasts_input
-        )
 
-def sample_inputs_equal(op, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_addmm(op_info, device, dtype, requires_grad, **kwargs):
+    alpha_val = kwargs.get('alpha', 2 + 3j if dtype.is_complex else 0.6)
+    beta_val = kwargs.get('beta', 1 + 2j if dtype.is_complex else 0.2)
+    tests_list = [
+        ((2, 3), (2, 2), (2, 3), False)
+    ]
+    tests_with_lhs_broadcasting = [
+        ((1,), (2, 2), (2, 3), True),
+        ((), (2, 2), (2, 3), True)
+    ]
+    test_cases = tests_list + tests_with_lhs_broadcasting  # type: ignore[operator]
 
-    shapes = (
-        ((), ()),
-        ((S,), ()),
-        ((), (S,)),
-        ((S, 1), (S,)),
-        ((M, S), ()),
-        ((S, S), (S, S))
-    )
+    sample_inputs = []
 
-    for shape_lhs, shape_rhs in shapes:
-        lhs = make_arg(shape_lhs)
-        rhs = make_arg(shape_rhs)
-        broadcasts_input = shape_lhs != torch.broadcast_shapes(shape_lhs, shape_rhs)
+    for shape_a, shape_b, shape_c, broadcasts_input in test_cases:
+        sample_inputs.append(
+            SampleInput(
+                make_tensor(shape_a, dtype=dtype, device=device, requires_grad=requires_grad),
+                args=(
+                    make_tensor(shape_b, dtype=dtype, device=device,
+                                requires_grad=requires_grad),
+                    make_tensor(shape_c, dtype=dtype, device=device,
+                                requires_grad=requires_grad)),
+                kwargs={'alpha': alpha_val, 'beta': beta_val},
+                broadcasts_input=broadcasts_input))
 
-        yield SampleInput(lhs, args=(rhs,), broadcasts_input=broadcasts_input)
-        if shape_lhs == shape_rhs:
-            yield SampleInput(lhs, args=(lhs.clone().detach_(),))
+    if dtype.is_complex:
+        shape = (3, 3)
+        sample_inputs.append(
+            SampleInput(make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
+                        args=(
+                            make_tensor(shape, dtype=dtype, device=device).mH.requires_grad_(requires_grad),
+                            make_tensor(shape, dtype=dtype, device=device,
+                                        requires_grad=requires_grad)),
+                        kwargs={'alpha': alpha_val, 'beta': beta_val},))
+        sample_inputs.append(
+            SampleInput(make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
+                        args=(
+                            make_tensor(shape, dtype=dtype, device=device,
+                                        requires_grad=requires_grad),
+                            make_tensor(shape, dtype=dtype, device=device).mH.requires_grad_(requires_grad)),
+                        kwargs={'alpha': alpha_val, 'beta': beta_val},))
+    return sample_inputs
 
+def sample_inputs_sparse_sampled_addmm(op_info, device, dtype, requires_grad, **kwargs):
+    alpha = 2 + 3j if dtype.is_complex else 0.6
+    beta = 1 + 2j if dtype.is_complex else 0.2
 
+    def generator():
+        # sparse.sampled_addmm performs: alpha * (A @ B) * sparse_ones_like(C) + beta * C
+        for m, n, k in itertools.product([0, 5], repeat=3):
+            yield SampleInput(
+                torch.eye(m, n, device=device, dtype=dtype)
+                .to_sparse_csr()
+                .requires_grad_(requires_grad),
+                args=(
+                    make_tensor(
+                        (m, k),
+                        device=device,
+                        dtype=dtype,
+                        requires_grad=requires_grad,
+                    ),
+                    make_tensor(
+                        (k, n),
+                        device=device,
+                        dtype=dtype,
+                        requires_grad=requires_grad,
+                    ),
+                ),
+                kwargs={"alpha": alpha, "beta": beta},
+            )
 
-def sample_inputs_jiterator(op, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    return list(generator())
 
-    shapes = (
-        ((), ()),
-        ((S,), ()),
-        ((S, 1), (S,)),
-        ((M, S), ()),
-        ((S, M, S), (M, S)),
-        ((S, M, S), (S, M, S)),
-        ((M, 1, S), (M, S)),
-        ((M, 1, S), (1, M, S)),
-        ((0, 1, 3), (0, 10, 3))
+def sample_inputs_mv(self, device, dtype, requires_grad, **kwargs):
+    return (
+        SampleInput(
+            make_tensor((S, M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(
+                make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            )
+        ),
     )
 
-    num_inputs = kwargs.get('num_inputs')
-    sample_kwargs = kwargs.get('sample_kwargs', {})
+def sample_inputs_bmm(self, device, dtype, requires_grad, **kwargs):
+    return (
+        SampleInput(
+            make_tensor((M, S, M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(
+                make_tensor((M, M, S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            )
+        ),
+    )
 
-    for shape_lhs, shape_rhs in shapes:
-        lhs = make_arg(shape_lhs)
+def sample_inputs_dot_vdot(self, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        args = []
-        for i in range(num_inputs - 1):
-            args.append(make_arg(shape_rhs))
-        broadcasts_input = (shape_lhs != torch.broadcast_shapes(shape_lhs, shape_rhs))
-
-        yield SampleInput(lhs, args=tuple(args), kwargs=sample_kwargs, broadcasts_input=broadcasts_input)
+    def make_arg_conj(size):
+        return make_arg(size).conj().requires_grad_(requires_grad)
 
-def sample_inputs_broadcast_shapes(op, device, dtype, requires_grad, **kwargs):
-    shapes = (
-        ((), ()),
-        ((S,), ()),
-        ((S, 1), (S,)),
-        ((S, 1), S),
-        ((M, S), ()),
-        ((S, M, S), (M, S)),
-        ((S, M, S), (S, M, S)),
-        ((M, 1, S), (M, S)),
-        ((M, 1, S), (1, M, S)),
-        ((0, 1, 3), (0, 10, 3))
-    )
+    sample_inputs = []
+    sample_inputs.append(SampleInput(make_arg((S, )), args=(make_arg((S, )),)))
+    if dtype.is_complex:
+        # dot/vdot for (conj(input), conj(arg_tensor)) and (conj(input), arg_tensor)
+        # is tested in test_conj_view (which tests operations with only conjugated input tensor
+        # -- not conjugated arg tensors)
+        sample_inputs.append(SampleInput(make_arg((S, )), args=(make_arg_conj((S, )),)))
+    return sample_inputs
 
-    for shape in shapes:
-        inp, *arg0 = shape
-        yield SampleInput(inp, args=arg0)
+def sample_inputs_addmv(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-# The base reference input generation for elementwise binary operations
-def _reference_inputs_elementwise_binary(op, device, dtype, requires_grad, exclude_zero, **kwargs):
-    yield from op.sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
-    yield from generate_elementwise_binary_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
-    if dtype is not torch.bool:
-        yield from generate_elementwise_binary_small_value_tensors(
-            op, device=device, dtype=dtype, requires_grad=requires_grad
-        )
-    if dtype not in (torch.bool, torch.uint8, torch.int8):
-        yield from generate_elementwise_binary_large_value_tensors(
-            op, device=device, dtype=dtype, requires_grad=requires_grad
-        )
-    yield from generate_elementwise_binary_broadcasting_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
-    yield from generate_elementwise_binary_with_scalar_samples(
-        op, device=device, dtype=dtype, requires_grad=requires_grad
-    )
+    test_cases = (((S,), (S, M), (M,), 1, 1, False),
+                  ((S,), (S, M), (M,), 0.2, 0.6, False),
+                  )
 
-    yield from generate_elementwise_binary_with_scalar_and_type_promotion_samples(
-        op, device=device, dtype=dtype, requires_grad=requires_grad
-    )
+    test_cases_with_broadcast = (((1,), (S, M), (M,), 1, 1, True),
+                                 ((1,), (S, M), (M,), 0.2, 0.6, True),
+                                 ((), (S, M), (M,), 1, 1, True),
+                                 ((), (S, M), (M,), 0.2, 0.6, True),
+                                 )
 
-    if dtype.is_floating_point or dtype.is_complex:
-        yield from generate_elementwise_binary_extremal_value_tensors(
-            op, device=device, dtype=dtype, requires_grad=requires_grad
-        )
+    cases = test_cases + test_cases_with_broadcast
 
+    # addmv performs: beta * M + alpha * (mat @ vec)
+    for size, mat, vec, beta, alpha, broadcasts_input in cases:
+        yield SampleInput(make_arg(size), args=(make_arg(mat), make_arg(vec)),
+                          kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=broadcasts_input)
 
-# Note that these references inputs use scalars for the SampleInput.input value,
-#   and many tests require SampleInput.input be a tensor or a list of tensors
-def reference_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs):
-    if hasattr(op, "rhs_make_tensor_kwargs"):
-        exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
+def sample_inputs_addbmm(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    gen = partial(
-        _reference_inputs_elementwise_binary, op, device, dtype, requires_grad, exclude_zero, **kwargs
-    )
+    # input_shape, batch1_shape, batch2_shape, beta_val, alpha_val, is_broadcasting
+    test_cases = [((S, M), (S, S, S), (S, S, M), 1, 1, False),
+                  ((1,), (S, S, S), (S, S, M), 1, 1, True),
+                  ((S, M), (S, S, S), (S, S, M), 0.6, 0.2, False),
+                  ((1,), (S, S, S), (S, S, M), 0.6, 0.2, True),
+                  ((), (S, S, S), (S, S, M), 1, 1, True),
+                  ((), (S, S, S), (S, S, M), 0.6, 0.2, True),
+                  ]
 
-    # yields "normal" samples
-    yield from gen()
+    for input_shape, batch1_shape, batch2_shape, beta, alpha, is_broadcasting in test_cases:
+        if dtype.is_complex:
+            beta_complex, alpha_complex = beta * (1 + 2j), alpha * (2 + 3j)
+            yield SampleInput(make_arg(input_shape), args=(make_arg(batch1_shape), make_arg(batch2_shape)),
+                              kwargs=dict(beta=beta_complex, alpha=alpha_complex), broadcasts_input=is_broadcasting)
+        yield SampleInput(make_arg(input_shape), args=(make_arg(batch1_shape), make_arg(batch2_shape)),
+                          kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=is_broadcasting)
 
-    # yields noncontiguous samples
-    for sample in gen():
-        yield sample.noncontiguous()
+def sample_inputs_addcmul_addcdiv(op_info, device, dtype, requires_grad, **kwargs):
+    test_cases = [(((S, S), (S, S), (S, S)), False),
+                  (((S, S), (S, 1), (1, S)), False),
+                  (((1,), (S, S, 1), (1, S)), True),
+                  (((), (), ()), False),
+                  (((S, S), (), ()), True),
+                  (((), (S, S, 1), (1, S)), True)
+                  ]
 
-    yield from generate_elementwise_binary_noncontiguous_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
+    sample_inputs = []
+    for input_args, broadcasts_input in test_cases:
+        # addcdiv should accept inputs with zero value
+        # Currently, it throws ZeroDivisionError when the denominator is zero
+        # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
+        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
+                     exclude_zero=True) if isinstance(arg, tuple) else arg
+                     for arg in input_args)
+        sample_inputs.append(SampleInput(
+            args[0],
+            args=args[1:],
+            broadcasts_input=broadcasts_input))
 
-    yield from generate_elementwise_binary_arbitrarily_strided_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
-    )
+        # addcdiv should accept inputs with zero value
+        # Currently, it throws ZeroDivisionError when the denominator is zero
+        # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
+        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
+                     exclude_zero=True) if isinstance(arg, tuple) else arg
+                     for arg in input_args)
+        sample_inputs.append(SampleInput(
+            args[0],
+            args=args[1:],
+            kwargs=dict(value=3.14), broadcasts_input=broadcasts_input))
 
+    return tuple(sample_inputs)
 
-# A functional that extends an elementwise binary operator's bespoke error inputs
-#   with generic error inputs for the class of elementwise binary operations
-def make_error_inputs_elementwise_binary(error_inputs_func):
-    def error_inputs_func_wrapper(op, device, **kwargs):
-        if error_inputs_func is not None:
-            yield from error_inputs_func(op, device, **kwargs)
-
-        if not op.supports_rhs_python_scalar:
-            si = SampleInput(torch.tensor((1, 2, 3), device=device), args=(2,))
-            yield ErrorInput(si, error_type=Exception, error_regex="")
-
-        if not op.supports_one_python_scalar:
-            si = SampleInput(2, args=(torch.tensor((1, 2, 3), device=device),))
-            yield ErrorInput(si, error_type=Exception, error_regex="")
-
-        if (
-            not kwargs.get("skip_two_python_scalars", False)
-            and not op.supports_two_python_scalars
-        ):
-            si = SampleInput(2, args=(3,))
-            yield ErrorInput(si, error_type=Exception, error_regex="")
-
-    return error_inputs_func_wrapper
-
-
-# Metadata class for binary "universal functions (ufuncs)" that accept two
-# tensor and have common properties
-class BinaryUfuncInfo(OpInfo):
-    """Operator information for 'universal binary functions (binary ufuncs).'
-    These are functions of two tensors with common properties like:
-      - they are elementwise functions
-      - the output shape is determined by the input shape
-      - they typically have method and inplace variants
-      - they typically support the out kwarg
-      - they typically have NumPy or SciPy references
-    See NumPy's universal function documentation
-    (https://numpy.org/doc/stable/reference/ufuncs.html) for more details
-    about the concept of ufuncs.
-    """
+def reference_inputs_addcmul_addcdiv(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_addcmul_addcdiv(
+        op_info, device, dtype, requires_grad, **kwargs)
 
-    def __init__(
-        self,
-        name,
-        *,
-        sample_inputs_func=sample_inputs_elementwise_binary,
-        reference_inputs_func=reference_inputs_elementwise_binary,
-        error_inputs_func=None,
-        lhs_make_tensor_kwargs=None,
-        rhs_make_tensor_kwargs=None,
-        promotes_int_to_float=False,  # Set to true if the op promotes integer inputs to float
-        always_returns_bool=False,  # Set to true if the op always returns bool tensors
-        supports_rhs_python_scalar=True,  # Whether the operator allows Tensor x scalar inputs
-        supports_one_python_scalar=False,  # Whether the operator allows scalar x tensor and tensor x scalar inputs
-        supports_two_python_scalars=False,  # Whether the operator allows scalar x scalar inputs
-        **kwargs,
-    ):
-
-        self._original_binary_ufunc_args = locals().copy()
-
-        # Elementwise binary operations perform the equivalent of test_numpy_refs
-        #   in test_binary_ufuncs, but with additional test granularity. So the
-        #   generic test_ops.py test is skipped because it's redundant.
-        common_skips = (
-            DecorateInfo(
-                unittest.skip("Skipping redundant test."),
-                "TestCommon",
-                "test_numpy_refs",
-            ),
-        )
-        kwargs["skips"] = kwargs.get("skips", tuple()) + common_skips
-        super(BinaryUfuncInfo, self).__init__(
-            name,
-            sample_inputs_func=sample_inputs_func,
-            reference_inputs_func=reference_inputs_func,
-            error_inputs_func=make_error_inputs_elementwise_binary(error_inputs_func),
-            **kwargs,
-        )
+    # type promotion cases
+    supported_dtypes = op_info.supported_dtypes(device)
+    make_arg = partial(make_tensor, device=device, requires_grad=requires_grad)
 
-        # [lr]hs_make_tensor_kwargs are part of the OpInfo to be able to dynamically generate valid samples later on.
-        if lhs_make_tensor_kwargs is None:
-            lhs_make_tensor_kwargs = {}
-        self.lhs_make_tensor_kwargs = lhs_make_tensor_kwargs
+    types = (
+        (torch.float64, torch.complex128),
+        (torch.bfloat16, torch.float32),
+    )
 
-        if rhs_make_tensor_kwargs is None:
-            rhs_make_tensor_kwargs = {}
-        self.rhs_make_tensor_kwargs = rhs_make_tensor_kwargs
+    values = (
+        None,
+        True, False,
+        3.14, 3,
+        1.0, 1,
+        0.0, 0,
+        -3.14, -3,
+        3.14 + 2.71j,
+    )
 
-        self.promotes_int_to_float = promotes_int_to_float
-        self.always_returns_bool = always_returns_bool
-        self.supports_rhs_python_scalar = supports_rhs_python_scalar
-        self.supports_one_python_scalar = supports_one_python_scalar
-        self.supports_two_python_scalars = supports_two_python_scalars
+    for (type2, type3), value in product(types, values):
+        if (type2 not in supported_dtypes or
+                type3 not in supported_dtypes):
+            continue
 
-        if self.supports_two_python_scalars:
-            self.supports_one_python_scalar = True
+        # RuntimeError: value cannot be converted without overflow
+        if (type(value) is complex and
+                type2 is not torch.complex128):
+            continue
 
-        if self.supports_one_python_scalar:
-            assert (
-                supports_rhs_python_scalar
-            ), "Can't support lhs and rhs Python scalars but not rhs scalars!"
+        arg1 = make_arg([5, 5], dtype=dtype)
+        arg2 = make_arg([5, 5], dtype=type2)
+        arg3 = make_arg([1, 5], dtype=type3)
 
+        # TypeError: addcdiv(): argument 'value' must be Number, not NoneType
+        if value is not None:
+            yield SampleInput(arg1, args=(arg2, arg3), kwargs=dict(value=value))
+        else:
+            yield SampleInput(arg1, args=(arg2, arg3))
 
-# The following functions and classes are for testing elementwise unary operators.
-def sample_inputs_elementwise_unary(
-    op_info, device, dtype, requires_grad, op_kwargs=None, **kwargs
-):
-    if not op_kwargs:
-        op_kwargs = {}
+def sample_inputs_baddbmm(op_info, device, dtype, requires_grad, **kwargs):
+    test_cases = [((S, S, M), (S, S, S), (S, S, M), 1, 1, False),
+                  ((1,), (S, S, S), (S, S, M), 1, 1, True),
+                  ((S, S, M), (S, S, S), (S, S, M), 0.6, 0.2, False),
+                  ((1,), (S, S, S), (S, S, M), 0.6, 0.2, True),
+                  ((), (S, S, S), (S, S, M), 1, 1, True),
+                  ((), (S, S, S), (S, S, M), 0.6, 0.2, True),
+                  ]
+    sample_inputs = []
+    for (input_shape, batch1_shape, batch2_shape, alpha, beta, broadcasts_input) in test_cases:
+        args = (make_tensor(input_shape, dtype=dtype, device=device,
+                            low=None, high=None,
+                            requires_grad=requires_grad),
+                make_tensor(batch1_shape, dtype=dtype, device=device,
+                            low=None, high=None,
+                            requires_grad=requires_grad),
+                make_tensor(batch2_shape, dtype=dtype, device=device,
+                            low=None, high=None,
+                            requires_grad=requires_grad))
 
-    _L = S if kwargs.get("small_inputs_only", False) else L
+        sample_inputs.append(SampleInput(args[0], args=(args[1], args[2]),
+                             kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=broadcasts_input))
+        if dtype.is_complex:
+            sample_inputs.append(SampleInput(
+                args[0].clone().requires_grad_(requires_grad),
+                args=(args[1].clone().requires_grad_(requires_grad),
+                      args[2].clone().requires_grad_(requires_grad)),
+                kwargs=dict(beta=beta * (1 + 2j), alpha=alpha * (2 + 3j)),
+                broadcasts_input=broadcasts_input))
 
-    low, high = op_info.domain
-    low = low if low is None else low + op_info._domain_eps
-    high = high if high is None else high - op_info._domain_eps
-    if op_info.supports_sparse_csr or op_info.supports_sparse_csc or op_info.supports_sparse_bsr or op_info.supports_sparse_bsc:
-        # Tensors with dim=2 for sparse compressed testing
-        yield SampleInput(
-            make_tensor(
-                (_L, _L),
-                device=device,
-                dtype=dtype,
-                low=low,
-                high=high,
-                requires_grad=requires_grad,
-            ),
-            kwargs=op_kwargs,
-        )
-    else:
-        # Creates a 1D, empty, and scalar tensor
-        for shape in ((_L,), (1, 0, 3), ()):
-            yield SampleInput(
-                make_tensor(
-                    shape,
-                    device=device,
-                    dtype=dtype,
-                    low=low,
-                    high=high,
-                    requires_grad=requires_grad,
-                ),
-                kwargs=op_kwargs,
-            )
+    if dtype.is_complex:
+        shapes = [(S, S, S), (S, M, S), (S, S, M)]
+        args = (make_tensor(shapes[0], dtype=dtype, device=device,
+                            low=None, high=None,
+                            requires_grad=requires_grad),
+                make_tensor(shapes[1], dtype=dtype, device=device,
+                            low=None, high=None,
+                            requires_grad=requires_grad),
+                make_tensor(shapes[2], dtype=dtype, device=device,
+                            low=None, high=None,
+                            requires_grad=requires_grad))
+        sample_inputs.append(
+            SampleInput(
+                args[0].transpose_(-1, 1),
+                args=(args[1].transpose(-1, 1).conj().requires_grad_(requires_grad),
+                      args[2].transpose(-1, 1).conj().requires_grad_(requires_grad)),
+                kwargs=dict(beta=beta * (1 + 2j), alpha=alpha * (2 + 3j)),))
 
+    return tuple(sample_inputs)
 
-# Replace values satisfying condition with a safe value. This is used to block
-# out values the could cause singularity like tan(pi/2)
-def _replace_values_in_tensor(tensor, condition, safe_value):
-    mask = condition(tensor)
-    tensor.masked_fill_(mask, safe_value)
+# TODO: add reduction kwargs
+def sample_inputs_multilabel_soft_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
+    shapes = (
+        (S,),
+        (S, S),
+    )
 
-# Helper to create a unary elementwise tensor with valid inputs
-def _make_unary_elementwise_tensor(shape, *, op, dtype, **kwargs):
-    low, high = op.domain
-    low = low if low is None else low + op._domain_eps
-    high = high if high is None else high - op._domain_eps
+    for shape in shapes:
+        # Produce one with weight and one without.
+        yield SampleInput(_make_tensor(shape), args=(_make_tensor(shape, requires_grad=False),), kwargs={})
+        yield SampleInput(_make_tensor(shape), args=(_make_tensor(shape, requires_grad=False),),
+                          kwargs={'weight': _make_tensor(shape, requires_grad=False)})
 
-    a = make_tensor(shape, low=low, high=high, dtype=dtype, **kwargs)
+def sample_inputs_addr(op_info, device, dtype, requires_grad, **kwargs):
+    yield SampleInput(
+        make_tensor((S, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        args=(
+            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)))
 
-    if op.reference_numerics_filter is not None and dtype is not torch.bool:
-        condition, safe_value = op.reference_numerics_filter
-        _replace_values_in_tensor(a, condition, safe_value)
+    yield SampleInput(
+        make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        args=(
+            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
+        broadcasts_input=True)
 
-    return a
+    if dtype.is_complex:
+        alpha, beta = 0.1 + 0.3j, 0.4 + 0.6j
+    elif dtype.is_floating_point:
+        alpha, beta = 0.2, 0.6
+    else:
+        alpha, beta = 2, 3
 
+    yield SampleInput(
+        make_tensor((S, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        args=(
+            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
+        kwargs=dict(beta=beta, alpha=alpha))
 
-# Restricts the values in the tensor to the domain of the
-# given elementwise unary operator
-def _filter_unary_elementwise_tensor(a, *, op):
-    # short-circuits for boolean tensors
-    if a.dtype is torch.bool:
-        return a
+    yield SampleInput(
+        make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        args=(
+            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
+        kwargs=dict(beta=beta, alpha=alpha),
+        broadcasts_input=True)
 
-    low, high = op.domain
-    low = low if low is None else low + op._domain_eps
-    high = high if high is None else high - op._domain_eps
+    # These samples fail gradcheck
+    if dtype.is_floating_point and not requires_grad:
+        yield SampleInput(
+            torch.tensor([[math.nan]], device=device, requires_grad=requires_grad),
+            args=(
+                torch.tensor([0.0], device=device, requires_grad=requires_grad),
+                torch.tensor([0.0], device=device, requires_grad=requires_grad),
+            ),
+            kwargs=dict(beta=0.0, alpha=0.0),
+            broadcasts_input=True)
 
-    if a.dtype is torch.uint8 and low is not None:
-        low = max(low, 0)
+        yield SampleInput(
+            torch.tensor([[0.0]], device=device, requires_grad=requires_grad),
+            args=(
+                torch.tensor([math.nan], device=device, requires_grad=requires_grad),
+                torch.tensor([math.nan], device=device, requires_grad=requires_grad),
+            ),
+            kwargs=dict(beta=0.0, alpha=0.0),
+            broadcasts_input=True)
 
-    if not a.dtype.is_floating_point and not a.dtype.is_complex:
-        low = math.ceil(low) if low is not None else None
-        high = math.floor(high) if high is not None else None
+def sample_inputs_zero_(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    if op.reference_numerics_filter is not None:
-        condition, safe_value = op.reference_numerics_filter
-        _replace_values_in_tensor(a, condition, safe_value)
+    cases = ((), (S, S, S), (S,))
 
-    if low is not None or high is not None:
-        if a.dtype.is_complex:
-            a.real.clamp_(low, high)
-            a.imag.clamp_(low, high)
-        else:
-            a.clamp_(min=low, max=high)
+    for shape in cases:
+        yield(SampleInput(make_arg(shape)))
 
-    return a
+# TODO: add reduction kwargs
+def sample_inputs_multi_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    make_target = partial(_make_tensor, dtype=torch.long, requires_grad=False)
 
+    inputs = (
+        ((), make_target([], low=0, high=1), {}),
+        ((S,), make_target([], low=0, high=S), {"p": 1}),
+        ((S,), make_target([1], low=0, high=S), {"p": 2}),
+        ((S, M), make_target([S], low=0, high=M), {"margin": 1.0}),
+        ((M, S), make_target([M], low=0, high=S), {"weight": None}),
+    )
 
-def generate_elementwise_unary_tensors(op, *, device, dtype, requires_grad, **kwargs):
-
-    # Special-cases bool
-    if dtype is torch.bool:
-        tensors = (
-            torch.empty(0, device=device, dtype=torch.bool),
-            torch.tensor(True, device=device),
-            torch.tensor(False, device=device),
-            torch.tensor((True, False), device=device),
-            make_tensor((812,), device=device, dtype=dtype),
-            make_tensor((1029, 917), device=device, dtype=dtype),
-        )
-        for a in tensors:
-            yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
-
-    shapes = (
-        (1029, 917),
-        (812,),
-        # Empty sizes
-        (0,),
-        (0, 3, 3),
-        (1, 0, 5),
-        (6, 0, 0, 0),
-        (3, 0, 1, 0),
-    )
+    for input_shape, target, kwargs in inputs:
+        yield SampleInput(_make_tensor(input_shape), args=(target,), kwargs=kwargs)
 
-    make_arg = partial(
-        _make_unary_elementwise_tensor,
-        op=op,
-        device=device,
-        dtype=dtype,
-        requires_grad=requires_grad,
+def sample_inputs_logsumexp(self, device, dtype, requires_grad, **kwargs):
+    inputs = (
+        ((), (0,), True),
+        ((S, S), (1,), True),
+        ((S, S), (1,), False),
+        ((S, S), (-2,), False),
+        ((S, S), (0, 1), False),
     )
-    for shape in shapes:
-        a = make_arg(shape)
-        yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
+    samples = []
+    # Test large inputs to check numerical stability
+    lows = (None, 1e3, 1e6) if dtype in (torch.float32, torch.float64) else (None,)
+    for low in lows:
+        high = low * 2 if low is not None else None
+        for shape, dim, keepdim in inputs:
+            t = make_tensor(shape, dtype=dtype, device=device,
+                            low=low, high=high,
+                            requires_grad=requires_grad)
+            samples.append(SampleInput(t, args=(dim, keepdim)))
 
+    return tuple(samples)
 
-def generate_elementwise_unary_small_value_tensors(
-    op, *, device, dtype, requires_grad=False
-):
-    for sample in generate_elementwise_binary_small_value_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad
-    ):
-        a = _filter_unary_elementwise_tensor(sample.input, op=op)
-        yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
+def sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
+    inputs = [
+        ((), {}),
+        ((S, S), {}),
+        ((0, S, 0), {}),
+        ((S,), {'dtype': dtype, 'device': device}),
+        # Hard-code some dtypes/devices. We want to test cases where the
+        # (dtype, device) is different from the input's (dtype, device)
+        ((S,), {'dtype': torch.double}),
+        ((S,), {'device': 'cpu'}),
+        ((S,), {'dtype': torch.double, 'device': 'cpu'}),
+    ]
+    if torch.cuda.is_available():
+        inputs.append(((S,), {'device': 'cuda'}))
 
+    samples = []
+    for shape, kwargs in inputs:
+        t = make_tensor(shape, dtype=dtype, device=device,
+                        low=None, high=None,
+                        requires_grad=requires_grad)
+        samples.append(SampleInput(t, kwargs=kwargs))
 
-def generate_elementwise_unary_large_value_tensors(
-    op, *, device, dtype, requires_grad=False
-):
-    for sample in generate_elementwise_binary_large_value_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad
-    ):
-        a = _filter_unary_elementwise_tensor(sample.input, op=op)
-        yield SampleInput(sample.input, kwargs=op.sample_kwargs(device, dtype, a)[0])
+    return tuple(samples)
 
+def reference_inputs_like_fns(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_like_fns(op, device, dtype, requires_grad, **kwargs)
 
-def generate_elementwise_unary_extremal_value_tensors(
-    op, *, device, dtype, requires_grad=False
-):
-    for sample in generate_elementwise_binary_extremal_value_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad
-    ):
-        yield SampleInput(
-            sample.input, kwargs=op.sample_kwargs(device, dtype, sample.input)[0]
-        )
+    # shape
+    cases = (
+        (), (0,), (1, 0), (1, 1, 4, 5), (5, 3, 0, 1), (1, 4, 3, 1, 1)
+    )
 
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for shape in cases:
+        yield SampleInput(make_arg(shape))
+        yield SampleInput(make_arg(shape).transpose(0, -1))
+        yield SampleInput(make_arg(shape, noncontiguous=True))
+        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1))
 
-def generate_elementwise_unary_noncontiguous_tensors(
-    op, *, device, dtype, requires_grad=False
-):
-    low, high = op.domain
-    low = low if low is None else low + op._domain_eps
-    high = high if high is None else high - op._domain_eps
+# TODO: add reduction kwargs
+def sample_inputs_multilabel_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    make_target = partial(_make_tensor, dtype=torch.long, requires_grad=False)
 
-    make_arg = partial(
-        _make_unary_elementwise_tensor,
-        op=op,
-        device=device,
-        dtype=dtype,
-        requires_grad=requires_grad,
+    inputs = (
+        ([], make_target([], low=0, high=1)),
+        ([S], make_target([S], low=0, high=S)),
+        ([M, S], make_target([M, S], low=0, high=S)),
     )
 
-    # Generic noncontiguity
-    t = make_arg((1026,), noncontiguous=True)
-    yield SampleInput(t, kwargs=op.sample_kwargs(device, dtype, t)[0])
+    for shape, target in inputs:
+        yield SampleInput(_make_tensor(shape), args=(target,))
 
-    # Transposed
-    t = make_arg((1024, 1024)).T
-    yield SampleInput(t, kwargs=op.sample_kwargs(device, dtype, t)[0])
+def get_independent_tensor(tensor):
+    return tensor.clone().requires_grad_(tensor.requires_grad)
 
-    # Expanded tensors
-    shapes = ((1, 3), (1, 7), (5, 7))
+def sample_inputs_randint_like(self, device, dtype, requires_grad, **kwargs):
+    samples = []
+    low = 2
+    high = 10
 
-    for shape in shapes:
-        t = make_arg(shape)
-        t_non_contig = t.expand(3, -1, -1)
-        yield SampleInput(
-            t_non_contig, kwargs=op.sample_kwargs(device, dtype, t_non_contig)[0]
-        )
+    for sample in sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
+        # With high
+        samples.append(SampleInput(
+            sample.input,
+            args=(high,) + sample.args,
+            kwargs=sample.kwargs))
+        # With low and high
+        samples.append(SampleInput(
+            get_independent_tensor(sample.input),
+            args=(low, high,) + sample.args,
+            kwargs=sample.kwargs))
+    return tuple(samples)
 
-def generate_elementwise_unary_arbitrarily_strided_tensors(op, *, device, dtype, requires_grad=False):
-    # shape, strides, offset
-    strided_cases = (
-        ((5, 6, 2), (1, 1, 7), 2),
-        ((5, 5, 4), (1, 1, 7), 2),
-        ((5, 5, 2), (4, 5, 7), 3),
-        ((5, 5, 2), (5, 5, 7), 3),
-        ((5, 5, 2), (5, 5, 5), 3),
-        ((9, 5, 2), (0, 1, 7), 3),
-    )
+def sample_inputs_margin_ranking_loss(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    shapes = (
+        (),
+        (S,),
+        (S, S),
+        (S, S, S),
     )
-    for shape, strides, offset in strided_cases:
-        a = make_arg(500,).as_strided(shape, strides, offset)
-        yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
 
-# Reuses the elementwise binary generators for consistency
-# TODO: in the future generalize the reference generators to handle n-ary elementwise operations
-def _reference_inputs_elementwise_unary(op, device, dtype, requires_grad, **kwargs):
-    yield from op.sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
+    margins = (0., 1.)
+    reductions = ('sum', 'mean', 'none')
 
-    yield from generate_elementwise_unary_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
-    )
+    for shape in shapes:
+        for margin, reduction in product(margins, reductions):
+            kwargs = {'margin': margin, 'reduction': reduction}
+            yield SampleInput(_make_tensor(shape),
+                              args=(_make_tensor(shape, requires_grad=False),
+                                    _make_tensor(shape, requires_grad=False)),
+                              kwargs=kwargs)
 
-    if dtype is not torch.bool:
-        yield from generate_elementwise_unary_small_value_tensors(
-            op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
-        )
-    if dtype not in (torch.bool, torch.uint8, torch.int8) and (
-        op.handles_large_floats
-        or (not dtype.is_floating_point and not dtype.is_complex)
-    ):
-        yield from generate_elementwise_unary_large_value_tensors(
-            op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
-        )
+def reference_inputs_margin_ranking_loss(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_margin_ranking_loss(op, device, dtype, requires_grad, **kwargs)
+    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    if dtype.is_floating_point or (op.handles_complex_extremal_values and dtype.is_complex):
-        yield from generate_elementwise_unary_extremal_value_tensors(
-            op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
-        )
+    for reduction in ('sum', 'mean', 'none'):
+        if dtype.is_floating_point:  # only supports ints and floats
+            # NaN propagation
+            inp1 = make_input((10, ))
+            inp1[2] = float('nan')
+            inp2 = make_input((10, ))
+            inp2[4] = float('nan')
+            target = make_input((10, ))
+            inp2[9] = float('nan')
+            yield SampleInput(inp1, args=(inp2, target), kwargs={'reduction': reduction})
 
-def reference_inputs_elementwise_unary(op, device, dtype, requires_grad, **kwargs):
-    gen = partial(
-        _reference_inputs_elementwise_unary, op, device, dtype, requires_grad, **kwargs
-    )
+            # Inf handling
+            inp1 = make_input((10, ))
+            inp2[1] = float('inf')
+            inp2 = make_input((10, ))
+            inp2[4] = float('inf')
+            target = make_input((10, ))
+            inp2[7] = float('inf')
+            yield SampleInput(inp1, args=(inp2, target), kwargs={'reduction': reduction})
 
-    # yields "normal" samples
-    yield from gen()
+        # Broadcasting
+        inp1 = make_input((5, 2))
+        inp2 = make_input((5, 1))
+        target = make_input((1, 2))
+        yield SampleInput(inp1, args=(inp2, target), kwargs={'reduction': reduction})
 
-    # yields noncontiguous samples
-    for sample in gen():
-        yield sample.noncontiguous()
+def error_inputs_margin_ranking_loss(op, device, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=torch.float32)
+    # invalid reduction value.
+    yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4), make_input(5, 4),), kwargs={'reduction': 'abc'}),
+                     error_type=ValueError, error_regex='is not a valid value')
+    # invalid input shapes
+    yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4), make_input(5,),)),
+                     error_regex='margin_ranking_loss : All input tensors should')
 
-    yield from generate_elementwise_unary_noncontiguous_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
-    )
+def sample_inputs_new_fns(self, device, dtype, requires_grad, *, is_strided=False, **kwargs):
+    # input_shape, output_shape, strides, kwargs
+    # lengths of output_shape and strides must be equal
+    inputs = [
+        ((), (), (), {}),
+        ((S, S), (2, 0), (3, 4), {}),
+        ((0, S, 0), (3, 2, 2), (1, 2, 3), {}),
+        ((S,), (2, 3), (7, 8), {'dtype': dtype, 'device': device}),
+        # Hard-code some dtypes/devices. We want to test cases where the
+        # (dtype, device) is different from the input's (dtype, device)
+        ((S,), (10,), (S,), {'dtype': torch.double}),
+        ((S,), (1, 1, 12), (S, L, M), {'device': 'cpu'}),
+        ((S,), (2, 2, 2), (L, M, S), {'dtype': torch.double, 'device': 'cpu'}),
+    ]
+    if torch.cuda.is_available():
+        inputs.append(((S,), (7, 2), (3, 4), {'device': 'cuda'}))
 
-    yield from generate_elementwise_unary_arbitrarily_strided_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
-    )
+    samples = []
+    for input_shape, output_shape, strides, kwargs in inputs:
+        t = make_tensor(input_shape, dtype=dtype, device=device,
+                        low=None, high=None,
+                        requires_grad=requires_grad)
+        if is_strided:
+            samples.append(SampleInput(t, args=(output_shape, strides), kwargs=kwargs))
+        else:
+            samples.append(SampleInput(t, args=(output_shape,), kwargs=kwargs))
 
+    return tuple(samples)
 
-# Metadata class for unary "universal functions (ufuncs)" that accept a single
-# tensor and have common properties like:
-class UnaryUfuncInfo(OpInfo):
-    """Operator information for 'universal unary functions (unary ufuncs).'
-    These are functions of a single tensor with common properties like:
-      - they are elementwise functions
-      - the input shape is the output shape
-      - they typically have method and inplace variants
-      - they typically support the out kwarg
-      - they typically have NumPy or SciPy references
-    See NumPy's universal function documentation
-    (https://numpy.org/doc/1.18/reference/ufuncs.html) for more details
-    about the concept of ufuncs.
-    """
+def sample_inputs_empty(op, device, dtype, requires_grad, **kwargs):
+    # shape
+    cases = (
+        (), (0,), (1,), (1, 3, 5), (5, 3, 1), (1, 0, 5, 1),
+    )
 
-    def __init__(
-        self,
-        name,  # the string name of the function
-        *,
-        ref,  # a reference function
-        dtypes=floating_types(),
-        dtypesIfCUDA=None,
-        dtypesIfROCM=None,
-        domain=(None, None),  # the [low, high) domain of the function
-        handles_complex_extremal_values=True,  # whether the op correctly handles extremal values (like nan/inf)
-        handles_large_floats=True,  # whether the op correctly handles large float values (like 1e20)
-        supports_complex_to_float=False,  # op supports casting from complex input to real output safely eg. angle
-        sample_inputs_func=sample_inputs_elementwise_unary,
-        reference_inputs_func=reference_inputs_elementwise_unary,
-        sample_kwargs=lambda device, dtype, input: ({}, {}),
-        supports_sparse=False,
-        reference_numerics_filter=None,  # Filters values in the range of the domain specified above but that should not be tested
-        **kwargs,
-    ):
-        self._original_unary_ufunc_args = locals().copy()
-
-        super(UnaryUfuncInfo, self).__init__(
-            name,
-            dtypes=dtypes,
-            dtypesIfCUDA=dtypesIfCUDA,
-            dtypesIfROCM=dtypesIfROCM,
-            sample_inputs_func=sample_inputs_func,
-            reference_inputs_func=reference_inputs_func,
-            supports_sparse=supports_sparse,
-            **kwargs,
-        )
-        self.ref = ref
-        self.domain = domain
-        self.handles_complex_extremal_values = handles_complex_extremal_values
-        self.handles_large_floats = handles_large_floats
-        self.supports_complex_to_float = supports_complex_to_float
-        self.reference_numerics_filter = reference_numerics_filter
-
-        # test_unary_ufuncs.py generates its own inputs to test the consistency
-        # of the operator on sliced tensors, non-contig tensors, etc.
-        # `sample_kwargs` is a utility function to provide kwargs
-        # along with those inputs if required (eg. clamp).
-        # It should return two dictionaries, first holding kwarg for
-        # torch operator and second one for reference NumPy operator.
-        self.sample_kwargs = sample_kwargs
-
-        # Epsilon to ensure grad and gradgrad checks don't test values
-        #   outside a function's domain.
-        self._domain_eps = 1e-5
+    for case in cases:
+        _kwargs = {'device': device, 'dtype': dtype, 'requires_grad': requires_grad}
+        yield SampleInput(case, args=(), kwargs=_kwargs)
 
-def sample_inputs_add_sub(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
+def sample_inputs_eye(op, device, dtype, requires_grad, **kwargs):
+    # only ints >= 0 are allowed for both arguments, unless m is omitted
+    sizes = (None, 0, 1, 2, 3, 4, 7, L, M, S)
 
-    # Adds alpha kwarg cases
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
-    rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
-    if dtype is not torch.bool:
-        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': 2})
-    else:
-        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': True})
-    neg_alpha = -3.14 if (dtype.is_floating_point or dtype.is_complex) else -3
-    lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
-    rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
-    if dtype is not torch.bool:
-        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': neg_alpha})
-    else:
-        yield SampleInput(lhs, args=(rhs,), kwargs={'alpha': False})
+    for n, m in product(sizes, sizes):
+        if n is None:
+            continue
 
-def error_inputs_arange(op, device, **kwargs):
-    yield ErrorInput(SampleInput(0, args=(3, 0)), error_type=RuntimeError, error_regex='step must be nonzer')
-    yield ErrorInput(SampleInput(0, args=(-3, 2)), error_type=RuntimeError, error_regex='bound inconsistent with step sign')
-    yield ErrorInput(SampleInput(0, args=(3, -2)), error_type=RuntimeError, error_regex='bound inconsistent with step sign')
-    yield ErrorInput(SampleInput(0, args=(float('inf'), 2)), error_type=RuntimeError, error_regex='unsupported range')
-    yield ErrorInput(SampleInput(float('-inf'), args=(1, 2)), error_type=RuntimeError, error_regex='unsupported range')
+        # TODO: no layout
+        _kwargs = {'device': device, 'dtype': dtype, 'requires_grad': requires_grad}
+        if m is None:
+            yield SampleInput(n, args=(), kwargs=_kwargs)
+        else:
+            yield SampleInput(n, args=(m,), kwargs=_kwargs)
 
-def sample_inputs_arange(op, device, dtype, requires_grad, **kwargs):
-    int_samples = (
-        # positive direction
-        (-1, 2, 2),
-        # negative direction
-        (2, -3, -1),
-        # start == end
-        (1, 1, 1),
-        (1, 1, -1),
-        # divides evenly
-        (0, -8, -4),
-        (1, 5, 2),
-        # bool
-        (False, True, True),
-        # default step
-        (0, 1, None),
-        # default start
-        (None, 3, None),
-    )
+def error_inputs_eye(op_info, device, **kwargs):
+    # TODO: no layout
+    _kwargs = {'device': device, 'dtype': torch.float32}
 
-    def to_float(start, end, step):
-        start = start + 0.1 if start is not None else None
-        end = end + 0.1
-        step = float(step) if step is not None else None
-        return start, end, step
+    yield ErrorInput(
+        SampleInput(-1, args=(), kwargs=_kwargs),
+        error_regex="n must be greater or equal to 0, got -1"
+    )
 
-    float_samples = (
-        # includes endpoint
-        (0., -8. - 1e-6, -4.),
-        (1., 5. + 1e-6, 2.),
-        (0., -8., -4.),
-        (1., 5., 2.),
-        *(to_float(start, end, step) for (start, end, step) in int_samples),
+    yield ErrorInput(
+        SampleInput(-7, args=(42,), kwargs=_kwargs),
+        error_regex="n must be greater or equal to 0, got -7"
     )
 
-    large_samples = (
-        (0, 10000, None),
+    yield ErrorInput(
+        SampleInput(0, args=(-3,), kwargs=_kwargs),
+        error_regex="m must be greater or equal to 0, got -3"
     )
 
-    samples = int_samples + float_samples
-    if dtype not in (torch.int8, torch.uint8):
-        samples += large_samples
 
-    for start, end, step in samples:
-        if start is None:
-            assert step is None
-            yield SampleInput(end, kwargs={"dtype": dtype, "device": device})
-        elif step is None:
-            yield SampleInput(start, args=(end,), kwargs={"dtype": dtype, "device": device})
-        else:
-            yield SampleInput(start, args=(end, step), kwargs={"dtype": dtype, "device": device})
+def sample_inputs_new_full(self, device, dtype, requires_grad, **kwargs):
+    def get_val(dtype):
+        return make_tensor([], dtype=dtype, device="cpu").item()
 
-    yield SampleInput(2)
-    yield SampleInput(1, args=(3, 1))
+    samples = []
+    for sample in sample_inputs_new_fns(self, device, dtype, requires_grad, **kwargs):
+        # The scalar we are passing to new_full must be the same dtype
+        # as the one of the resulting tensor
+        use_dtype = sample.kwargs['dtype'] if 'dtype' in sample.kwargs else dtype
+        samples.append(SampleInput(
+            sample.input, args=sample.args + (get_val(use_dtype),), kwargs=sample.kwargs))
+    return tuple(samples)
 
+def sample_inputs_full_like(self, device, dtype, requires_grad, **kwargs):
+    def get_val(dtype):
+        return make_tensor([], dtype=dtype, device="cpu").item()
 
-def error_inputs_linspace(op, device, **kwargs):
-    yield ErrorInput(SampleInput(0, args=(3, -1)), error_type=RuntimeError, error_regex='number of steps must be non-negative')
-    yield ErrorInput(SampleInput(0, args=(3, 1.)), error_type=TypeError, error_regex='must be int, not float')
+    inputs = [
+        ((), get_val(dtype), {}),
+        ((S, S), get_val(dtype), {}),
+        ((0, S, 0), get_val(dtype), {}),
+        ((S,), get_val(dtype), {'dtype': dtype, 'device': device}),
+        # Hard-code some dtypes/devices. We want to test cases where the
+        # (dtype, device) is different from the input's (dtype, device)
+        ((S,), get_val(torch.double), {'dtype': torch.double}),
+        ((S,), get_val(dtype), {'device': 'cpu'}),
+        ((S,), get_val(torch.double), {'dtype': torch.double, 'device': 'cpu'}),
+    ]
+    if torch.cuda.is_available():
+        inputs.append(((S,), get_val(dtype), {'device': 'cuda'}))
 
+    samples = []
+    for shape, fill_value, kwargs in inputs:
+        t = make_tensor(shape, dtype=dtype, device=device,
+                        low=None, high=None,
+                        requires_grad=requires_grad)
+        samples.append(SampleInput(t, args=(fill_value,), kwargs=kwargs))
 
-def sample_inputs_linspace(op, device, dtype, requires_grad, **kwargs):
-    ends = (-3, 0, 1, 4, 50)
-    starts = (-2., 0, 4.3, 50)
-    nsteps = (0, 1, 50)
-    # Extra case to replicate off-by-one issue on CUDA
-    cases = list(product(starts, ends, nsteps)) + [(0, 7, 50)]
-    for start, end, nstep in cases:
-        if dtype == torch.uint8 and end < 0 or start < 0:
-            continue
-        yield SampleInput(start, args=(end, nstep), kwargs={"dtype": dtype, "device": device})
+    return tuple(samples)
 
-    yield SampleInput(1, args=(3, 1))
+def sample_inputs_multinomial(self, device, dtype, requires_grad, **kwargs):
+    cases = [
+        ([3], 3, {}),
+        ([10], 3, {}),
+        ([3, 10], 3, {}),
+        ([3], 3, dict(replacement=False)),
+        ([3], 3, dict(replacement=True)),
+        ([3, 4], 4, dict(replacement=True)),
+        ([3, 4], 4, dict(replacement=False)),
+    ]
 
+    samples = []
+    for shape, num_samples, kwargs in cases:
+        t = make_tensor(shape, dtype=dtype, device=device,
+                        low=0, high=None,
+                        requires_grad=requires_grad)
+        samples.append(SampleInput(t, args=(num_samples,), kwargs=kwargs))
+    return tuple(samples)
 
-def sample_inputs_logpace(op, device, dtype, requires_grad, **kwargs):
-    ends = (-3, 0, 1.2, 2, 4)
-    starts = (-2., 0, 1, 2, 4.3)
-    nsteps = (0, 1, 2, 4)
-    bases = (2., 1.1) if dtype in (torch.int8, torch.uint8) else (None, 2., 3., 1.1, 5.)
-    for start, end, nstep, base in product(starts, ends, nsteps, bases):
-        if dtype == torch.uint8 and end < 0 or start < 0:
-            continue
-        if nstep == 1 and isinstance(start, float) and not (dtype.is_complex or dtype.is_floating_point):
-            # https://github.com/pytorch/pytorch/issues/82242
-            continue
-        if base is None:
-            yield SampleInput(start, args=(end, nstep), kwargs={"dtype": dtype, "device": device})
-        else:
-            yield SampleInput(start, args=(end, nstep, base), kwargs={"dtype": dtype, "device": device})
+def sample_inputs_normal_common(self, device, dtype, requires_grad, cases, **kwargs):
+    def get_value_or_make_tensor(value_or_shape):
+        if isinstance(value_or_shape, list):
+            return make_tensor(value_or_shape, dtype=dtype, device=device,
+                               low=0, high=None,
+                               requires_grad=requires_grad)
+        return value_or_shape
 
-    yield SampleInput(1, args=(3, 1, 2.))
+    samples = []
+    for value_or_mean_shape, value_or_std_shape, kwargs in cases:
+        mean = get_value_or_make_tensor(value_or_mean_shape)
+        std = get_value_or_make_tensor(value_or_std_shape)
+        samples.append(SampleInput(mean, args=(std,), kwargs=kwargs))
+    return tuple(samples)
 
+def sample_inputs_normal_tensor_first(self, device, dtype, requires_grad, **kwargs):
+    # value_or_size, value_or_size, kwargs
+    cases = [
+        ([], [], {}),
+        ([3], [3], {}),
+        ([3, 4, 2], [3, 4, 2], {}),
+        ([2, 3], 1.1, {}),
+        ([1, 2, 3], [5, 2, 3], {}),  # broadcasting
+    ]
 
-def sample_inputs_isclose(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
+    return sample_inputs_normal_common(self, device, dtype, requires_grad, cases, **kwargs)
 
-    # Creates additional inputs to test the rtol, atol, and equal_nan params
-    rtols = [0., 1e-7]
-    atols = [0., 1e-7]
-    equal_nans = [False, True]
+def sample_inputs_normal_tensor_second(self, device, dtype, requires_grad, **kwargs):
+    cases = [
+        ([3, 4], 0.3, {}),
+    ]
+    return sample_inputs_normal_common(self, device, dtype, requires_grad, cases, **kwargs)
 
-    products = product(rtols, atols, equal_nans)
+def sample_inputs_bernoulli(self, device, dtype, requires_grad, **kwargs):
+    shapes = [
+        [3],
+        [],
+        [0, 3],
+        [2, 3, 4],
+    ]
 
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    for rtol, atol, equal_nan in products:
-        lhs = make_arg((S, S), **op.lhs_make_tensor_kwargs)
-        rhs = make_arg((S, S), **op.rhs_make_tensor_kwargs)
+    samples = []
+    for shape in shapes:
+        t = make_tensor(shape, dtype=dtype, device=device,
+                        low=0, high=1,
+                        requires_grad=requires_grad)
+        samples.append(SampleInput(t))
+    return tuple(samples)
 
-        yield SampleInput(lhs, args=(rhs,),
-                          kwargs=dict(rtol=rtol, atol=atol, equal_nan=equal_nan))
+def sample_inputs_logcumsumexp(self, device, dtype, requires_grad, **kwargs):
+    inputs = (
+        ((S, S, S), 0),
+        ((S, S, S), 1),
+        ((), 0),
+    )
+    samples = []
 
+    for large_number in (True, False):
+        for shape, dim in inputs:
+            t = make_tensor(shape, dtype=dtype, device=device,
+                            low=None, high=None,
+                            requires_grad=requires_grad)
 
-def error_inputs_isclose(op, device, **kwargs):
-    make_float_arg = partial(make_tensor, device=device, dtype=torch.float, requires_grad=False)
+            if large_number and t.dim() > 0:
+                t[0] = 10000
+            samples.append(SampleInput(t, args=(dim,)))
 
-    yield ErrorInput(SampleInput(make_float_arg(()), args=(make_float_arg(()),), kwargs={'rtol': -0.4}),
-                     error_type=RuntimeError,
-                     error_regex='rtol must be greater than or equal to zero')
+    return tuple(samples)
 
-    yield ErrorInput(SampleInput(make_float_arg(()), args=(make_float_arg(()),), kwargs={'atol': -0.4}),
-                     error_type=RuntimeError,
-                     error_regex='atol must be greater than or equal to zero')
+def sample_inputs_trace(self, device, dtype, requires_grad, **kwargs):
+    return (SampleInput((make_tensor((S, S), dtype=dtype, device=device,
+                                     low=None, high=None,
+                                     requires_grad=requires_grad))),)
 
 
-def sample_inputs_linalg_vecdot(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    batches = ((), (0,), (1,), (5,))
-    ns = (0, 1, 3, 5)
-    for b, n in product(batches, ns):
-        shape = b + (n,)
-        yield SampleInput(make_arg(shape), args=(make_arg(shape),))
-        for i in range(len(shape)):
-            yield SampleInput(make_arg(shape), args=(make_arg(shape),), kwargs=dict(dim=i))
+def error_inputs_trace(op, device):
+    yield ErrorInput(SampleInput(make_tensor((3, 4, 5), dtype=torch.float32, device=device)), error_regex="expected a matrix")
 
 
-def sample_inputs_t(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    return (SampleInput(make_arg((1, 2))),
-            SampleInput(make_arg((2,))),
-            SampleInput(make_arg(())))
+def sample_inputs_renorm(self, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    cases = (((S, S, S), (2, 1, 0.5)),
+             ((S, S, S), (2, -1, 0.5)),
+             ((S, S, S), (1, 2, 3)),
+             ((S, S, S), (float('inf'), 2, 0.5)),
+             )
 
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=args)
 
-def sample_inputs_mm(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    def make_arg_conj(size):
-        return make_arg(size).conj().requires_grad_(requires_grad)
+def sample_inputs_transpose_swapdims(self, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    first_shape, second_shape = (S, M), (M, S)
+    cases = (((1, 2, 3), (-1, -2)),
+             ((1, 2, 3), (-1, 2)),
+             ((1, 2, 3), (1, -2)),
+             ((1, 2, 3), (1, 2)),
+             ((), (0, 0)),
+             ((1, ), (0, 0)),
+             ((M, M), (0, 1)),
+             ((S, S, S), (2, 0)), )
 
-    yield SampleInput(make_arg(first_shape), args=(make_arg(second_shape),))
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=args)
 
-    if dtype.is_complex:
-        yield SampleInput(make_arg(first_shape), args=(make_arg_conj(second_shape),))
+def _numpy_ref_transpose(a, dim0, dim1):
+    if a.ndim <= 1:
+        return a
 
+    return np.swapaxes(a, dim0, dim1)
 
-def sample_inputs_addmm(op_info, device, dtype, requires_grad, **kwargs):
-    alpha_val = kwargs.get('alpha', 2 + 3j if dtype.is_complex else 0.6)
-    beta_val = kwargs.get('beta', 1 + 2j if dtype.is_complex else 0.2)
-    tests_list = [
-        ((2, 3), (2, 2), (2, 3), False)
-    ]
-    tests_with_lhs_broadcasting = [
-        ((1,), (2, 2), (2, 3), True),
-        ((), (2, 2), (2, 3), True)
-    ]
-    test_cases = tests_list + tests_with_lhs_broadcasting  # type: ignore[operator]
+def sample_inputs_adjoint(self, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    sample_inputs = []
+    shapes = ((1, 2, 3), (), (M, M), (S, S, S), (S, M, S), (M, S, M, S))
+    return (SampleInput(make_arg(shape)) for shape in shapes)
 
-    for shape_a, shape_b, shape_c, broadcasts_input in test_cases:
-        sample_inputs.append(
-            SampleInput(
-                make_tensor(shape_a, dtype=dtype, device=device, requires_grad=requires_grad),
-                args=(
-                    make_tensor(shape_b, dtype=dtype, device=device,
-                                requires_grad=requires_grad),
-                    make_tensor(shape_c, dtype=dtype, device=device,
-                                requires_grad=requires_grad)),
-                kwargs={'alpha': alpha_val, 'beta': beta_val},
-                broadcasts_input=broadcasts_input))
+def sample_inputs_T(self, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    if dtype.is_complex:
-        shape = (3, 3)
-        sample_inputs.append(
-            SampleInput(make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
-                        args=(
-                            make_tensor(shape, dtype=dtype, device=device).mH.requires_grad_(requires_grad),
-                            make_tensor(shape, dtype=dtype, device=device,
-                                        requires_grad=requires_grad)),
-                        kwargs={'alpha': alpha_val, 'beta': beta_val},))
-        sample_inputs.append(
-            SampleInput(make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
-                        args=(
-                            make_tensor(shape, dtype=dtype, device=device,
-                                        requires_grad=requires_grad),
-                            make_tensor(shape, dtype=dtype, device=device).mH.requires_grad_(requires_grad)),
-                        kwargs={'alpha': alpha_val, 'beta': beta_val},))
-    return sample_inputs
+    shapes = ((), (M, M))
+    return (SampleInput(make_arg(shape)) for shape in shapes)
 
-def sample_inputs_sparse_sampled_addmm(op_info, device, dtype, requires_grad, **kwargs):
-    alpha = 2 + 3j if dtype.is_complex else 0.6
-    beta = 1 + 2j if dtype.is_complex else 0.2
 
-    def generator():
-        # sparse.sampled_addmm performs: alpha * (A @ B) * sparse_ones_like(C) + beta * C
-        for m, n, k in itertools.product([0, 5], repeat=3):
-            yield SampleInput(
-                torch.eye(m, n, device=device, dtype=dtype)
-                .to_sparse_csr()
-                .requires_grad_(requires_grad),
-                args=(
-                    make_tensor(
-                        (m, k),
-                        device=device,
-                        dtype=dtype,
-                        requires_grad=requires_grad,
-                    ),
-                    make_tensor(
-                        (k, n),
-                        device=device,
-                        dtype=dtype,
-                        requires_grad=requires_grad,
-                    ),
-                ),
-                kwargs={"alpha": alpha, "beta": beta},
-            )
+def sample_inputs_singular_matrix_factors(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function produces two tensors of shape (*, m, k) and (*, n, k) with k <= min(m, n).
+    Their matrix product could be used to generate tensor of shape (*, m, n) of rank k.
+    """
 
-    return list(generator())
+    batches = [(), (0, ), (2, ), (1, 1)]
+    size = [1, 5, 10]
 
-def sample_inputs_mv(self, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((S, M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(
-                make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            )
-        ),
-    )
+    for batch, m, n in product(batches, size, size):
+        for k in range(min(3, min(m, n))):
+            a = make_tensor((*batch, m, k), dtype=dtype, device=device, requires_grad=requires_grad)
+            b = make_tensor((*batch, n, k), dtype=dtype, device=device, requires_grad=requires_grad)
+            yield SampleInput(a, args=(b,), kwargs=kwargs)
 
-def sample_inputs_bmm(self, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((M, S, M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(
-                make_tensor((M, M, S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            )
-        ),
-    )
 
-def sample_inputs_dot_vdot(self, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_svd_lowrank(op_info, device, dtype, requires_grad=False, **kwargs):
+    for sample in sample_inputs_singular_matrix_factors(op_info, device, dtype, requires_grad, **kwargs):
+        *batch, m, k = sample.input.shape
+        *_, n, _ = sample.args[0].shape
 
-    def make_arg_conj(size):
-        return make_arg(size).conj().requires_grad_(requires_grad)
+        # NOTE: since svd_lowrank relies on non rank-revealing SVD,
+        # it inherits the problem of unstable behavior with repeated
+        # singular values including zeros.
+        # Since we want to avoid (repeated) zeros as singular values,
+        # we can only use k for q.
+        # This issues could be resolved with using a rank-revealing SVD
+        # which does not include "zero" singular values.
+        op_kwargs = {
+            'q': k,
+            'M': None
+        }
 
-    sample_inputs = []
-    sample_inputs.append(SampleInput(make_arg((S, )), args=(make_arg((S, )),)))
-    if dtype.is_complex:
-        # dot/vdot for (conj(input), conj(arg_tensor)) and (conj(input), arg_tensor)
-        # is tested in test_conj_view (which tests operations with only conjugated input tensor
-        # -- not conjugated arg tensors)
-        sample_inputs.append(SampleInput(make_arg((S, )), args=(make_arg_conj((S, )),)))
-    return sample_inputs
+        # without M specified
+        yield clone_sample(sample, **op_kwargs)
 
-def sample_inputs_addmv(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+        # now with M
+        # TODO: fix bug in the documentation for svd_lowrank:
+        # M has to be (*, m, n), and not (*, 1, n) as written
+        # in the documentation
+        op_kwargs['M'] = make_tensor((*batch, m, n), dtype=dtype, device=device, requires_grad=requires_grad)
+        yield clone_sample(sample, **op_kwargs)
 
-    test_cases = (((S,), (S, M), (M,), 1, 1, False),
-                  ((S,), (S, M), (M,), 0.2, 0.6, False),
-                  )
+def chunk_iter(iterable, size):
+    it = iter(iterable)
+    while True:
+        chunk = tuple(islice(it, size))
+        if not chunk:
+            break
+        yield chunk
 
-    test_cases_with_broadcast = (((1,), (S, M), (M,), 1, 1, True),
-                                 ((1,), (S, M), (M,), 0.2, 0.6, True),
-                                 ((), (S, M), (M,), 1, 1, True),
-                                 ((), (S, M), (M,), 0.2, 0.6, True),
-                                 )
+def sample_inputs_pca_lowrank(op_info, device, dtype, requires_grad=False, **kwargs):
+    # we reuse samples from svd_lowrank which come in group of two with
+    # kwarg['M'] = None and with kwarg['M'] = <some tensor>
+    samples = sample_inputs_svd_lowrank(op_info, device, dtype, requires_grad, **kwargs)
+    for s1, s2 in chunk_iter(samples, 2):
+        del s1.kwargs['M']
+        del s2.kwargs['M']
+        s1.kwargs['center'] = False
+        s2.kwargs['center'] = True
+        yield s1
+        yield s2
 
-    cases = test_cases + test_cases_with_broadcast
+def np_sinc_with_fp16_as_fp32(x):
+    # Wraps numpy's sinc function so that fp16 values are promoted to fp32
+    # before sinc is invoked. Context: numpy's sinc returns NaN when evaluated
+    # at 0 for fp16.
+    if x.dtype == np.float16:
+        return np.sinc(x.astype(np.float32))
+    else:
+        return np.sinc(x)
 
-    # addmv performs: beta * M + alpha * (mat @ vec)
-    for size, mat, vec, beta, alpha, broadcasts_input in cases:
-        yield SampleInput(make_arg(size), args=(make_arg(mat), make_arg(vec)),
-                          kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=broadcasts_input)
+def sample_inputs_broadcast_to(op_info, device, dtype, requires_grad, **kwargs):
+    test_cases = (
+        ((S, 1, 1), (S, S, S)),
+        ((S, 1, S), (S, S, S)),
+        ((S, 1), (S, S, S)),
+        ((1,), (S, S, S)),
+        ((1, S), (1, 1, S)),
+        ((), ()),
+        ((), (1, 3, 2)),
+    )
 
-def sample_inputs_addbmm(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    return tuple(
+        SampleInput(
+            make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(shape,)) for size, shape in test_cases)
 
-    # input_shape, batch1_shape, batch2_shape, beta_val, alpha_val, is_broadcasting
-    test_cases = [((S, M), (S, S, S), (S, S, M), 1, 1, False),
-                  ((1,), (S, S, S), (S, S, M), 1, 1, True),
-                  ((S, M), (S, S, S), (S, S, M), 0.6, 0.2, False),
-                  ((1,), (S, S, S), (S, S, M), 0.6, 0.2, True),
-                  ((), (S, S, S), (S, S, M), 1, 1, True),
-                  ((), (S, S, S), (S, S, M), 0.6, 0.2, True),
-                  ]
+def sample_inputs_broadcast_tensors(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    test_cases: Tuple[tuple] = (((3,), (1, 2, 1), (1, 1), (5, 1, 1),),)
 
-    for input_shape, batch1_shape, batch2_shape, beta, alpha, is_broadcasting in test_cases:
-        if dtype.is_complex:
-            beta_complex, alpha_complex = beta * (1 + 2j), alpha * (2 + 3j)
-            yield SampleInput(make_arg(input_shape), args=(make_arg(batch1_shape), make_arg(batch2_shape)),
-                              kwargs=dict(beta=beta_complex, alpha=alpha_complex), broadcasts_input=is_broadcasting)
-        yield SampleInput(make_arg(input_shape), args=(make_arg(batch1_shape), make_arg(batch2_shape)),
-                          kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=is_broadcasting)
+    samples: List[SampleInput] = []
+    for shape, *other_shapes in test_cases:
+        samples.append(SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes)))
 
-def sample_inputs_addcmul_addcdiv(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases = [(((S, S), (S, S), (S, S)), False),
-                  (((S, S), (S, 1), (1, S)), False),
-                  (((1,), (S, S, 1), (1, S)), True),
-                  (((), (), ()), False),
-                  (((S, S), (), ()), True),
-                  (((), (S, S, 1), (1, S)), True)
-                  ]
+    return samples
 
-    sample_inputs = []
-    for input_args, broadcasts_input in test_cases:
-        # addcdiv should accept inputs with zero value
-        # Currently, it throws ZeroDivisionError when the denominator is zero
-        # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
-        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
-                     exclude_zero=True) if isinstance(arg, tuple) else arg
-                     for arg in input_args)
-        sample_inputs.append(SampleInput(
-            args[0],
-            args=args[1:],
-            broadcasts_input=broadcasts_input))
+def reference_inputs_broadcast_tensors(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_broadcast_tensors(op, device, dtype, requires_grad, **kwargs)
 
-        # addcdiv should accept inputs with zero value
-        # Currently, it throws ZeroDivisionError when the denominator is zero
-        # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
-        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
-                     exclude_zero=True) if isinstance(arg, tuple) else arg
-                     for arg in input_args)
-        sample_inputs.append(SampleInput(
-            args[0],
-            args=args[1:],
-            kwargs=dict(value=3.14), broadcasts_input=broadcasts_input))
+    m = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    n = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad, noncontiguous=True)
 
-    return tuple(sample_inputs)
+    cases = (
+        ((), (1, 1), (1, 1, 7, 1), (3, 1, 1)),
+        ((3, 5, 6), (1, 3, 5, 6), (1, 1, 1, 1, 6), (8, 3, 5, 6))
+    )
 
-def sample_inputs_baddbmm(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases = [((S, S, M), (S, S, S), (S, S, M), 1, 1, False),
-                  ((1,), (S, S, S), (S, S, M), 1, 1, True),
-                  ((S, S, M), (S, S, S), (S, S, M), 0.6, 0.2, False),
-                  ((1,), (S, S, S), (S, S, M), 0.6, 0.2, True),
-                  ((), (S, S, S), (S, S, M), 1, 1, True),
-                  ((), (S, S, S), (S, S, M), 0.6, 0.2, True),
-                  ]
-    sample_inputs = []
-    for (input_shape, batch1_shape, batch2_shape, alpha, beta, broadcasts_input) in test_cases:
-        args = (make_tensor(input_shape, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(batch1_shape, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(batch2_shape, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad))
+    for a, b, c, d in cases:
+        yield SampleInput(m(a), args=(m(b), m(c), m(d)))
+        yield SampleInput(n(a), args=(n(b), n(c), n(d)))
 
-        sample_inputs.append(SampleInput(args[0], args=(args[1], args[2]),
-                             kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=broadcasts_input))
-        if dtype.is_complex:
-            sample_inputs.append(SampleInput(
-                args[0].clone().requires_grad_(requires_grad),
-                args=(args[1].clone().requires_grad_(requires_grad),
-                      args[2].clone().requires_grad_(requires_grad)),
-                kwargs=dict(beta=beta * (1 + 2j), alpha=alpha * (2 + 3j)),
-                broadcasts_input=broadcasts_input))
+def sample_inputs_block_diag(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    test_cases: Tuple[tuple] = (
+        ((1, S), (2, S), (3, S),),
+        ((S, 1), (S, 2), (S, 3),),
+        ((1,), (2,), (3,),),
+        ((2, S), (S,))
+    )
 
-    if dtype.is_complex:
-        shapes = [(S, S, S), (S, M, S), (S, S, M)]
-        args = (make_tensor(shapes[0], dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(shapes[1], dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(shapes[2], dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad))
-        sample_inputs.append(
-            SampleInput(
-                args[0].transpose_(-1, 1),
-                args=(args[1].transpose(-1, 1).conj().requires_grad_(requires_grad),
-                      args[2].transpose(-1, 1).conj().requires_grad_(requires_grad)),
-                kwargs=dict(beta=beta * (1 + 2j), alpha=alpha * (2 + 3j)),))
-
-    return tuple(sample_inputs)
+    samples: List[SampleInput] = []
+    for shape, *other_shapes in test_cases:
+        samples.append(SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes)))
+        # We also want to test mixed complex-non-complex inputs to block_diag
+        if dtype == torch.complex32 or dtype == torch.complex64:
+            non_complex_dtype = torch.float32 if dtype == torch.complex32 else torch.float64
+            make_arg_non_complex = partial(make_tensor, dtype=non_complex_dtype, device=device, requires_grad=requires_grad)
+            samples.append(SampleInput(make_arg_non_complex(shape), args=tuple(make_arg(s) for s in other_shapes)))
 
-# TODO: add reduction kwargs
-def sample_inputs_multilabel_soft_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    return samples
 
-    shapes = (
-        (S,),
-        (S, S),
+def sample_inputs_cdist(op_info, device, dtype, requires_grad, **kwargs):
+    small_S = 2
+    test_cases = (
+        ((S, S, 2), (S, S + 1, 2)),
+        ((S, S), (S, S)),
+        ((S, S, S), (S, S, S)),
+        ((3, 5), (3, 5)),
+        ((2, 3, 5), (2, 3, 5)),
+        ((1, 2, 3), (1, 2, 3)),
+        ((1, 1), (S, 1)),
+        ((0, 5), (4, 5)),
+        ((4, 5), (0, 5)),
+        ((0, 4, 5), (3, 5)),
+        ((4, 5), (0, 3, 5)),
+        ((0, 4, 5), (1, 3, 5)),
+        ((1, 4, 5), (0, 3, 5)),
+        # Using S here would make this one test take 9s
+        ((small_S, small_S, small_S + 1, 2), (small_S, small_S, small_S + 2, 2)),
+        ((small_S, 1, 1, small_S), (1, small_S, small_S)),
+        ((1, 1, small_S), (small_S, 1, small_S, small_S)),
     )
 
-    for shape in shapes:
-        # Produce one with weight and one without.
-        yield SampleInput(_make_tensor(shape), args=(_make_tensor(shape, requires_grad=False),), kwargs={})
-        yield SampleInput(_make_tensor(shape), args=(_make_tensor(shape, requires_grad=False),),
-                          kwargs={'weight': _make_tensor(shape, requires_grad=False)})
+    samples = []
+    for cm in ['use_mm_for_euclid_dist', 'donot_use_mm_for_euclid_dist']:
+        # FIXME add an override for JIT and revert 0. back to 0
+        # since it's accepted by eager
+        for p in [0., 1., 2., 3., 0.5, 1.5, 2.5, float("inf")]:
+            for t1_size, t2_size in test_cases:
+                # The args should never be non-contiguous as this is not supported in the backward
+                samples.append(SampleInput(
+                    make_tensor(t1_size, dtype=dtype, device=device, requires_grad=requires_grad),
+                    args=(make_tensor(t2_size, dtype=dtype, device=device, requires_grad=requires_grad), p, cm)))
 
-def sample_inputs_addr(op_info, device, dtype, requires_grad, **kwargs):
-    yield SampleInput(
-        make_tensor((S, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)))
+    return samples
 
-    yield SampleInput(
-        make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
-        broadcasts_input=True)
 
-    if dtype.is_complex:
-        alpha, beta = 0.1 + 0.3j, 0.4 + 0.6j
-    elif dtype.is_floating_point:
-        alpha, beta = 0.2, 0.6
-    else:
-        alpha, beta = 2, 3
+def sample_inputs_fill_(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype,
+                       low=None, high=None, requires_grad=requires_grad)
 
-    yield SampleInput(
-        make_tensor((S, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
-        kwargs=dict(beta=beta, alpha=alpha))
+    cases = (((S, S, S), (1,)),
+             ((), (1,)),
+             ((S, S, S), (make_arg(()),)))
 
-    yield SampleInput(
-        make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
-        kwargs=dict(beta=beta, alpha=alpha),
-        broadcasts_input=True)
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=args)
 
-    # These samples fail gradcheck
-    if dtype.is_floating_point and not requires_grad:
-        yield SampleInput(
-            torch.tensor([[math.nan]], device=device, requires_grad=requires_grad),
-            args=(
-                torch.tensor([0.0], device=device, requires_grad=requires_grad),
-                torch.tensor([0.0], device=device, requires_grad=requires_grad),
-            ),
-            kwargs=dict(beta=0.0, alpha=0.0),
-            broadcasts_input=True)
+def _fill_np(a, value):
+    a = a.copy()
+    a.fill(value)
+    return a
 
-        yield SampleInput(
-            torch.tensor([[0.0]], device=device, requires_grad=requires_grad),
-            args=(
-                torch.tensor([math.nan], device=device, requires_grad=requires_grad),
-                torch.tensor([math.nan], device=device, requires_grad=requires_grad),
-            ),
-            kwargs=dict(beta=0.0, alpha=0.0),
-            broadcasts_input=True)
+def _fill_aten(a, value):
+    t = a * False
+    with torch.no_grad():
+        t.fill_(value)
+    return t
 
-def sample_inputs_zero_(op_info, device, dtype, requires_grad, **kwargs):
+def _fill_sample_kwargs(device, dtype, input):
+    if dtype is torch.bool:
+        value = True
+    else:
+        value = 3
+
+    return ({'value': value}, {'value': value})
+
+def sample_inputs_comparison_ops(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
+
+    # Adds a sample input where both tensors have the same values
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    cases = ((), (S, S, S), (S,))
+    lhs = make_arg((S, S))
+    yield SampleInput(lhs, args=(lhs.clone(),))
 
-    for shape in cases:
-        yield(SampleInput(make_arg(shape)))
+def sample_inputs_stack(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-# TODO: add reduction kwargs
-def sample_inputs_multi_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    make_target = partial(_make_tensor, dtype=torch.long, requires_grad=False)
+    # shape x number of tensors
+    cases = (
+        ((3, 4), 1),
+        ((1, 2, 1, 4), 3),
+        ((0, 1, 0), 2),)
 
-    inputs = (
-        ((), make_target([], low=0, high=1), {}),
-        ((S,), make_target([], low=0, high=S), {"p": 1}),
-        ((S,), make_target([1], low=0, high=S), {"p": 2}),
-        ((S, M), make_target([S], low=0, high=M), {"margin": 1.0}),
-        ((M, S), make_target([M], low=0, high=S), {"weight": None}),
-    )
+    for shape, num_tensors in cases:
+        tensors = []
+        for _ in range(num_tensors):
+            tensors.append(make_arg(shape))
+        for dim in range(-1, len(shape) - 1):
+            yield SampleInput(tensors, args=(dim,))
 
-    for input_shape, target, kwargs in inputs:
-        yield SampleInput(_make_tensor(input_shape), args=(target,), kwargs=kwargs)
+def sample_inputs_cat_concat(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def sample_inputs_logsumexp(self, device, dtype, requires_grad, **kwargs):
-    inputs = (
-        ((), (0,), True),
-        ((S, S), (1,), True),
-        ((S, S), (1,), False),
-        ((S, S), (-2,), False),
-        ((S, S), (0, 1), False),
+    cases: Tuple[tuple, tuple, dict] = (  # type: ignore[assignment]
+        ((S, S), (S, S), {'dim': -1}),
+        ((S, S), (S, S), {'dim': 1}),
+        ((M, S), (S, S), {'dim': 0}),  # different shapes
+        ((1, 2, 3), (1, 2, 3), {'dim': -2}),
+        ((0,), (0,), {'dim': 0}),  # empty tensor
+        ((0, S), (S, S), {'dim': 0}),
+        ((1,), (1,), {})  # dim not passed, fallback to default
     )
-    samples = []
-    # Test large inputs to check numerical stability
-    lows = (None, 1e3, 1e6) if dtype in (torch.float32, torch.float64) else (None,)
-    for low in lows:
-        high = low * 2 if low is not None else None
-        for shape, dim, keepdim in inputs:
-            t = make_tensor(shape, dtype=dtype, device=device,
-                            low=low, high=high,
-                            requires_grad=requires_grad)
-            samples.append(SampleInput(t, args=(dim, keepdim)))
 
-    return tuple(samples)
+    for input_shape1, input_shape2, kwargs in cases:
+        yield SampleInput([make_arg(input_shape1), make_arg(input_shape2)], kwargs=kwargs)
 
-def sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
-    inputs = [
-        ((), {}),
-        ((S, S), {}),
-        ((0, S, 0), {}),
-        ((S,), {'dtype': dtype, 'device': device}),
-        # Hard-code some dtypes/devices. We want to test cases where the
-        # (dtype, device) is different from the input's (dtype, device)
-        ((S,), {'dtype': torch.double}),
-        ((S,), {'device': 'cpu'}),
-        ((S,), {'dtype': torch.double, 'device': 'cpu'}),
-    ]
-    if torch.cuda.is_available():
-        inputs.append(((S,), {'device': 'cuda'}))
+def reference_inputs_cat(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_cat_concat(op, device, dtype, requires_grad, **kwargs)
 
-    samples = []
-    for shape, kwargs in inputs:
-        t = make_tensor(shape, dtype=dtype, device=device,
-                        low=None, high=None,
-                        requires_grad=requires_grad)
-        samples.append(SampleInput(t, kwargs=kwargs))
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return tuple(samples)
+    # Noncontiguous type promoting tensors
+    a = make_arg((3, 4, 2))
+    b = make_arg((3, 2, 2), noncontiguous=True, dtype=torch.double)
+    c = make_arg((3, 3, 2), dtype=torch.float16).permute(1, 0, 2)
 
-def reference_inputs_like_fns(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_like_fns(op, device, dtype, requires_grad, **kwargs)
+    yield SampleInput((a, b, c), kwargs={'dim': 1})
 
-    # shape
-    cases = (
-        (), (0,), (1, 0), (1, 1, 4, 5), (5, 3, 0, 1), (1, 4, 3, 1, 1)
-    )
+    # Special 1D tensor with dim length of 0 case
+    a = make_arg((0,))
+    b = make_arg((3, 2, 2))
 
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    for shape in cases:
-        yield SampleInput(make_arg(shape))
-        yield SampleInput(make_arg(shape).transpose(0, -1))
-        yield SampleInput(make_arg(shape, noncontiguous=True))
-        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1))
+    yield SampleInput((a, b, a))
+    yield SampleInput((a, a, a))
 
-# TODO: add reduction kwargs
-def sample_inputs_multilabel_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    make_target = partial(_make_tensor, dtype=torch.long, requires_grad=False)
-
-    inputs = (
-        ([], make_target([], low=0, high=1)),
-        ([S], make_target([S], low=0, high=S)),
-        ([M, S], make_target([M, S], low=0, high=S)),
-    )
+def _elementwise_type_promo_np(*args, type_promotion_kind):
+    def _maybe_torch(x):
+        if isinstance(x, np.ndarray):
+            return torch.from_numpy(x)
+        return x
 
-    for shape, target in inputs:
-        yield SampleInput(_make_tensor(shape), args=(target,))
+    flattened = tree_flatten(args)[0]
+    transformed = tuple(_maybe_torch(a) for a in flattened)
+    result_dtype, _ = prims.utils.elementwise_dtypes(
+        *transformed,
+        type_promotion_kind=type_promotion_kind)
+    return torch_to_numpy_dtype_dict[result_dtype]
 
-def get_independent_tensor(tensor):
-    return tensor.clone().requires_grad_(tensor.requires_grad)
+def _cat_np(input_seq, dim=0):
+    inputs = tuple(a for a in input_seq if not (a.ndim == 1 and a.size == 0))
 
-def sample_inputs_randint_like(self, device, dtype, requires_grad, **kwargs):
-    samples = []
-    low = 2
-    high = 10
+    if len(inputs) == 0:
+        np_dtype = _elementwise_type_promo_np(
+            input_seq,
+            type_promotion_kind=prims.utils.ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH)
+        return np.empty(0, dtype=np_dtype)
 
-    for sample in sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
-        # With high
-        samples.append(SampleInput(
-            sample.input,
-            args=(high,) + sample.args,
-            kwargs=sample.kwargs))
-        # With low and high
-        samples.append(SampleInput(
-            get_independent_tensor(sample.input),
-            args=(low, high,) + sample.args,
-            kwargs=sample.kwargs))
-    return tuple(samples)
+    return np.concatenate(inputs, axis=dim)
 
-def sample_inputs_margin_ranking_loss(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def _floor_divide_np(a, b):
+    dtype = _elementwise_type_promo_np(
+        a,
+        b,
+        type_promotion_kind=prims.utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT)
+    if isinstance(a, np.ndarray):
+        a = a.astype(dtype)
+    if isinstance(b, np.ndarray):
+        b = b.astype(dtype)
+    return np.floor_divide(a, b)
 
-    shapes = (
-        (),
-        (S,),
-        (S, S),
-        (S, S, S),
+def sample_inputs_hstack_dstack_vstack(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    tensor_shapes = (
+        # First Tensor being 1-D is special
+        # case for hstack
+        ((S,), (S,), (S,)),
+        ((S, S), (S, S), (S, S)),
     )
+    for s1, s2, s3 in tensor_shapes:
+        tensors = (make_arg(s1,), make_arg(s2,), make_arg(s3))
+        yield SampleInput(tensors)
 
-    margins = (0., 1.)
-    reductions = ('sum', 'mean', 'none')
+def error_inputs_hstack_dstack_vstack(op, device):
+    make_arg = partial(make_tensor, dtype=torch.int32, device=device, requires_grad=False)
+    tensor_shapes = (
+        ((S,), (S, S, S, S), (S,)),
+    )
+    for s1, s2, s3 in tensor_shapes:
+        tensors = (make_arg(s1,), make_arg(s2,), make_arg(s3))
+        # Different dimension tensor
+        yield ErrorInput(SampleInput(tensors), error_regex="Tensors must have same number of dimensions")
 
-    for shape in shapes:
-        for margin, reduction in product(margins, reductions):
-            kwargs = {'margin': margin, 'reduction': reduction}
-            yield SampleInput(_make_tensor(shape),
-                              args=(_make_tensor(shape, requires_grad=False),
-                                    _make_tensor(shape, requires_grad=False)),
-                              kwargs=kwargs)
+    # empty tensor list
+    yield ErrorInput(SampleInput(()), error_regex="expects a non-empty TensorList")
 
-def reference_inputs_margin_ranking_loss(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_margin_ranking_loss(op, device, dtype, requires_grad, **kwargs)
-    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_unbind(op_info, device, dtype, requires_grad, **kwargs):
+    # Note: we don't do any tests where we unbind along 0-length dims
+    # because in that case unbind returns and empty tuple, and that breaks
+    # some asumptions in some backward tests in test_ops.py
+    shape_dims = (((S,), 0),
+                  ((S, S), 0),
+                  ((S, S), 1),
+                  ((S, S), -1),
+                  ((S, 0, S), 0),
+                  ((S, S, S), 1),
+                  )
+    for shape, dim in shape_dims:
+        yield SampleInput(make_tensor(shape, dtype=dtype, device=device,
+                                      requires_grad=requires_grad),
+                          args=(dim,))
 
-    for reduction in ('sum', 'mean', 'none'):
-        if dtype.is_floating_point:  # only supports ints and floats
-            # NaN propagation
-            inp1 = make_input((10, ))
-            inp1[2] = float('nan')
-            inp2 = make_input((10, ))
-            inp2[4] = float('nan')
-            target = make_input((10, ))
-            inp2[9] = float('nan')
-            yield SampleInput(inp1, args=(inp2, target), kwargs={'reduction': reduction})
+def error_inputs_unbind(op_info, device):
+    make_arg = partial(make_tensor, dtype=torch.int32, device=device, requires_grad=False)
+    yield ErrorInput(SampleInput(make_arg(()), args=(0,)), error_type=IndexError,
+                     error_regex="Dimension specified as 0 but tensor has no dimensions")
+    yield ErrorInput(SampleInput(make_arg((2,)), args=(2,)), error_type=IndexError,
+                     error_regex="Dimension out of range")
 
-            # Inf handling
-            inp1 = make_input((10, ))
-            inp2[1] = float('inf')
-            inp2 = make_input((10, ))
-            inp2[4] = float('inf')
-            target = make_input((10, ))
-            inp2[7] = float('inf')
-            yield SampleInput(inp1, args=(inp2, target), kwargs={'reduction': reduction})
+def reference_unbind(t, dim):
+    """A numpy implementation of torch.unbind"""
+    return tuple(s.squeeze(dim) for s in np.split(t, t.shape[dim], dim))
 
-        # Broadcasting
-        inp1 = make_input((5, 2))
-        inp2 = make_input((5, 1))
-        target = make_input((1, 2))
-        yield SampleInput(inp1, args=(inp2, target), kwargs={'reduction': reduction})
+def sample_inputs_gather(op_info, device, dtype, requires_grad, **kwargs):
+    return (
+        SampleInput(
+            make_tensor((M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(0, gather_variable((S, S), 1, M, True, device=device))),
+        SampleInput(
+            make_tensor((M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(1, gather_variable((M, S // 2), 0, S, True, device=device))),
+        SampleInput(
+            make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(0, torch.tensor([0], dtype=torch.int64, device=device))),
+        # Empty index tensor case, see: https://github.com/pytorch/pytorch/pull/65006
+        SampleInput(
+            make_tensor((S,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(0, torch.tensor([], dtype=torch.uint8, device=device))),
+        SampleInput(
+            make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=(0, torch.tensor(0, dtype=torch.int64, device=device))),
+    )
 
-def error_inputs_margin_ranking_loss(op, device, **kwargs):
-    make_input = partial(make_tensor, device=device, dtype=torch.float32)
-    # invalid reduction value.
-    yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4), make_input(5, 4),), kwargs={'reduction': 'abc'}),
-                     error_type=ValueError, error_regex='is not a valid value')
-    # invalid input shapes
-    yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4), make_input(5,),)),
-                     error_regex='margin_ranking_loss : All input tensors should')
+def _fill_indices(idx, dim, dim_size, elems_per_row, m, n, o):
+    for i in range(1 if dim == 0 else m):
+        for j in range(1 if dim == 1 else n):
+            for k in range(1 if dim == 2 else o):
+                ii = [i, j, k]
+                ii[dim] = slice(0, idx.size(dim) + 1)
+                idx[tuple(ii)] = torch.randperm(dim_size)[0:elems_per_row]
 
-def sample_inputs_new_fns(self, device, dtype, requires_grad, **kwargs):
-    inputs = [
-        ((), (), {}),
-        ((S, S), (2, 0), {}),
-        ((0, S, 0), (3, 2, 2), {}),
-        ((S,), (2, 3), {'dtype': dtype, 'device': device}),
-        # Hard-code some dtypes/devices. We want to test cases where the
-        # (dtype, device) is different from the input's (dtype, device)
-        ((S,), (10,), {'dtype': torch.double}),
-        ((S,), (1, 1, 12), {'device': 'cpu'}),
-        ((S,), (2, 2, 2), {'dtype': torch.double, 'device': 'cpu'}),
-    ]
-    if torch.cuda.is_available():
-        inputs.append(((S,), (7, 2), {'device': 'cuda'}))
+def error_inputs_gather(op_info, device, **kwargs):
+    # src is [1, 2]
+    #        [3, 4]
+    src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
 
-    samples = []
-    for input_shape, output_shape, kwargs in inputs:
-        t = make_tensor(input_shape, dtype=dtype, device=device,
-                        low=None, high=None,
-                        requires_grad=requires_grad)
-        samples.append(SampleInput(t, args=(output_shape,), kwargs=kwargs))
+    # idx is [0, 0]
+    #        [1, 0]
+    idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
 
-    return tuple(samples)
+    # Index should be smaller than self except on dimesion 1
+    bad_src = make_tensor((1, 1), device=device, dtype=torch.float32)
+    yield ErrorInput(SampleInput(bad_src, args=(1, idx,)),
+                     error_regex="Size does not match at dimension 0")
 
-def sample_inputs_empty(op, device, dtype, requires_grad, **kwargs):
-    # shape
-    cases = (
-        (), (0,), (1,), (1, 3, 5), (5, 3, 1), (1, 0, 5, 1),
-    )
+    # Index must have long dtype
+    bad_idx = idx.to(torch.int32)
+    yield ErrorInput(SampleInput(src, args=(1, bad_idx)),
+                     error_regex="Expected dtype int64 for index")
 
-    for case in cases:
-        _kwargs = {'device': device, 'dtype': dtype, 'requires_grad': requires_grad}
-        yield SampleInput(case, args=(), kwargs=_kwargs)
+    # TODO: FIXME
+    # out.dtype must match src.dtype
+    # Creates new src & idx since SampleInputs can't share tensors
+    src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
+    idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
+    out = torch.empty((2, 2), device=device, dtype=torch.float64)
+    yield ErrorInput(SampleInput(src, args=(1, idx), kwargs={'out': out}),
+                     error_regex="Expected out tensor to have dtype")
 
-def sample_inputs_new_full(self, device, dtype, requires_grad, **kwargs):
-    def get_val(dtype):
-        return make_tensor([], dtype=dtype, device="cpu").item()
+    # src and index tensors must have the same # of dimensions
+    # idx too few dimensions
+    src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
+    idx = torch.tensor((0, 0), device=device, dtype=torch.long)
+    yield ErrorInput(SampleInput(src, args=(1, idx)),
+                     error_regex="Index tensor must have the same number of dimensions")
 
-    samples = []
-    for sample in sample_inputs_new_fns(self, device, dtype, requires_grad, **kwargs):
-        # The scalar we are passing to new_full must be the same dtype
-        # as the one of the resulting tensor
-        use_dtype = sample.kwargs['dtype'] if 'dtype' in sample.kwargs else dtype
-        samples.append(SampleInput(
-            sample.input, args=sample.args + (get_val(use_dtype),), kwargs=sample.kwargs))
-    return tuple(samples)
+    # src too few dimensions
+    src = torch.tensor((1, 2), device=device, dtype=torch.float32)
+    idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
+    yield ErrorInput(SampleInput(src, args=(0, idx)),
+                     error_regex="Index tensor must have the same number of dimensions")
 
-def sample_inputs_full_like(self, device, dtype, requires_grad, **kwargs):
-    def get_val(dtype):
-        return make_tensor([], dtype=dtype, device="cpu").item()
+    # index out of bounds
+    # NOTE: this ErrorInput is guarded because bounds checking does not occur on CUDA devices
+    if torch.device(device).type == 'cpu':
+        src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
+        idx = torch.tensor(((0, 23), (1, 0)), device=device, dtype=torch.long)
+        yield ErrorInput(SampleInput(src, args=(1, idx,)),
+                         error_regex="index 23 is out of bounds for dimension")
 
-    inputs = [
-        ((), get_val(dtype), {}),
-        ((S, S), get_val(dtype), {}),
-        ((0, S, 0), get_val(dtype), {}),
-        ((S,), get_val(dtype), {'dtype': dtype, 'device': device}),
-        # Hard-code some dtypes/devices. We want to test cases where the
-        # (dtype, device) is different from the input's (dtype, device)
-        ((S,), get_val(torch.double), {'dtype': torch.double}),
-        ((S,), get_val(dtype), {'device': 'cpu'}),
-        ((S,), get_val(torch.double), {'dtype': torch.double, 'device': 'cpu'}),
-    ]
-    if torch.cuda.is_available():
-        inputs.append(((S,), get_val(dtype), {'device': 'cuda'}))
+    x = torch.rand((1,), device=device).expand((3,))
+    src = torch.rand((6,), device=device)
+    ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
 
-    samples = []
-    for shape, fill_value, kwargs in inputs:
-        t = make_tensor(shape, dtype=dtype, device=device,
-                        low=None, high=None,
-                        requires_grad=requires_grad)
-        samples.append(SampleInput(t, args=(fill_value,), kwargs=kwargs))
+    yield ErrorInput(SampleInput(src, args=(0, ind,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-    return tuple(samples)
-
-def sample_inputs_multinomial(self, device, dtype, requires_grad, **kwargs):
-    cases = [
-        ([3], 3, dict()),
-        ([10], 3, dict()),
-        ([3, 10], 3, dict()),
-        ([3], 3, dict(replacement=False)),
-        ([3], 3, dict(replacement=True)),
-        ([3, 4], 4, dict(replacement=True)),
-        ([3, 4], 4, dict(replacement=False)),
-    ]
+    yield ErrorInput(SampleInput(src, args=(0, ind,), kwargs=dict(out=src)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-    samples = []
-    for shape, num_samples, kwargs in cases:
-        t = make_tensor(shape, dtype=dtype, device=device,
-                        low=0, high=None,
-                        requires_grad=requires_grad)
-        samples.append(SampleInput(t, args=(num_samples,), kwargs=kwargs))
-    return tuple(samples)
+    yield ErrorInput(SampleInput(ind.clone(), args=(0, ind[1:],), kwargs=dict(out=ind[:1])),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-def sample_inputs_normal_common(self, device, dtype, requires_grad, cases, **kwargs):
-    def get_value_or_make_tensor(value_or_shape):
-        if isinstance(value_or_shape, list):
-            return make_tensor(value_or_shape, dtype=dtype, device=device,
-                               low=0, high=None,
-                               requires_grad=requires_grad)
-        return value_or_shape
+def error_inputs_take(op_info, device, **kwargs):
+    x = torch.rand((1,), device=device).expand((3,))
+    src = torch.rand((6,), device=device)
+    ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
 
-    samples = []
-    for value_or_mean_shape, value_or_std_shape, kwargs in cases:
-        mean = get_value_or_make_tensor(value_or_mean_shape)
-        std = get_value_or_make_tensor(value_or_std_shape)
-        samples.append(SampleInput(mean, args=(std,), kwargs=kwargs))
-    return tuple(samples)
+    yield ErrorInput(SampleInput(src, args=(ind,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-def sample_inputs_normal_tensor_first(self, device, dtype, requires_grad, **kwargs):
-    # value_or_size, value_or_size, kwargs
-    cases = [
-        ([], [], {}),
-        ([3], [3], {}),
-        ([3, 4, 2], [3, 4, 2], {}),
-        ([2, 3], 1.1, {}),
-        ([1, 2, 3], [5, 2, 3], {}),  # broadcasting
-    ]
+    yield ErrorInput(SampleInput(src, args=(ind,), kwargs=dict(out=src)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-    return sample_inputs_normal_common(self, device, dtype, requires_grad, cases, **kwargs)
+    yield ErrorInput(SampleInput(ind.clone(), args=(ind[1:],), kwargs=dict(out=ind[:-1])),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-def sample_inputs_normal_tensor_second(self, device, dtype, requires_grad, **kwargs):
-    cases = [
-        ([3, 4], 0.3, {}),
-    ]
-    return sample_inputs_normal_common(self, device, dtype, requires_grad, cases, **kwargs)
+# Error inputs for scatter
+def error_inputs_scatter_and_scatter_add(op_info, device, **kwargs):
+    # Error when self.dtype != src.dtype (and src is not a scalar)
+    src = make_tensor((2, 5), device=device, dtype=torch.float32)
+    idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.long)
+    dst = torch.zeros((3, 5), device=device, dtype=torch.double)
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
+                     error_regex="Expected self.dtype to be equal to src.dtype")
 
-def sample_inputs_bernoulli(self, device, dtype, requires_grad, **kwargs):
-    shapes = [
-        [3],
-        [],
-        [0, 3],
-        [2, 3, 4],
-    ]
+    # Index dtype must be long
+    src = make_tensor((2, 5), device=device, dtype=torch.float32)
+    idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.int32)
+    dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
+                     error_regex="Expected dtype int64 for index")
 
-    samples = []
-    for shape in shapes:
-        t = make_tensor(shape, dtype=dtype, device=device,
-                        low=0, high=1,
-                        requires_grad=requires_grad)
-        samples.append(SampleInput(t))
-    return tuple(samples)
+    # Index and destination must have the same number of dimensions
+    src = make_tensor((2, 5), device=device, dtype=torch.float32)
+    idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.long)
+    dst = torch.zeros((3, 5, 3), device=device, dtype=torch.float32)
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
+                     error_regex="Index tensor must have the same number of dimensions as self tensor")
 
-def sample_inputs_logcumsumexp(self, device, dtype, requires_grad, **kwargs):
-    inputs = (
-        ((S, S, S), 0),
-        ((S, S, S), 1),
-        ((), 0),
-    )
-    samples = []
+    # Index and src must have the same number of dimensions when src is not a scalar
+    src = make_tensor((2, 5, 2), device=device, dtype=torch.float32)
+    idx = torch.tensor(((34, 1), (1, 2)), device=device, dtype=torch.long)
+    dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
+    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
+                     error_regex="Index tensor must have the same number of dimensions as src tensor")
 
-    for large_number in (True, False):
-        for shape, dim in inputs:
-            t = make_tensor(shape, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad)
+    # Index out of bounds
+    # NOTE: this ErrorInput is guarded because bounds checking does not occur on CUDA devices
+    if torch.device(device).type == 'cpu':
+        src = make_tensor((2, 5), device=device, dtype=torch.float32)
+        idx = torch.tensor(((34, 1), (1, 2)), device=device, dtype=torch.long)
+        dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
+        yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
+                         error_regex="index 34 is out of bounds for dimension 0 with size 3")
 
-            if large_number and t.dim() > 0:
-                t[0] = 10000
-            samples.append(SampleInput(t, args=(dim,)))
+def error_inputs_renorm(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(SampleInput(zero_d, args=(0.5, 0, 1.0)), error_type=RuntimeError,
+                     error_regex="needs at least 2 dimensions, got 0 dimensions")
 
-    return tuple(samples)
+def error_inputs_eig(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
 
-def sample_inputs_trace(self, device, dtype, requires_grad, **kwargs):
-    return (SampleInput((make_tensor((S, S), dtype=dtype, device=device,
-                                     low=None, high=None,
-                                     requires_grad=requires_grad))),)
+    yield ErrorInput(SampleInput(zero_d, args=(False,)), error_type=RuntimeError,
+                     error_regex="input should be 2 dimensional")
 
+    yield ErrorInput(SampleInput(zero_d, args=(True,)), error_type=RuntimeError,
+                     error_regex="input should be 2 dimensional")
 
-def error_inputs_trace(op, device):
-    yield ErrorInput(SampleInput(make_tensor((3, 4, 5), dtype=torch.float32, device=device)), error_regex="expected a matrix")
+def error_inputs_ormqr(op_info, device, **kwargs):
+    # this is only implemented on cpu
+    if (torch.device(device).type == 'cpu'):
+        zero_d = torch.randn((), device=device)
+        yield ErrorInput(SampleInput(zero_d, args=(zero_d, zero_d)), error_type=RuntimeError,
+                         error_regex="input must have at least 2 dimensions")
 
+def error_inputs_diag(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(SampleInput(zero_d, args=(0,)), error_type=RuntimeError,
+                     error_regex="matrix or a vector expected")
+    zero_d = torch.randn(1, 1, 1, device=device)
+    yield ErrorInput(SampleInput(zero_d, args=(0,)), error_type=RuntimeError,
+                     error_regex="matrix or a vector expected")
 
-def sample_inputs_renorm(self, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    cases = (((S, S, S), (2, 1, 0.5)),
-             ((S, S, S), (2, -1, 0.5)),
-             ((S, S, S), (1, 2, 3)),
-             ((S, S, S), (float('inf'), 2, 0.5)),
-             )
+def error_inputs_embedding(op_info, device, **kwargs):
+    indices = torch.rand(2, 2, device=device).long()
+    weights = [
+        torch.tensor(1.0, device=device),
+        torch.tensor(1.0, device=device).reshape(1, 1, 1),
+    ]
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=args)
+    for weight in weights:
+        yield ErrorInput(SampleInput(weight, args=(indices,)), error_type=RuntimeError,
+                         error_regex="'weight' must be 2-D")
 
 
-def sample_inputs_transpose_swapdims(self, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def error_inputs_t(op_info, device, **kwargs):
+    yield ErrorInput(
+        SampleInput(torch.randn(2, 3, 4, 5, device=device)),
+        error_regex="expects a tensor with <= 2",
+    )
 
-    cases = (((1, 2, 3), (-1, -2)),
-             ((1, 2, 3), (-1, 2)),
-             ((1, 2, 3), (1, -2)),
-             ((1, 2, 3), (1, 2)),
-             ((), (0, 0)),
-             ((1, ), (0, 0)),
-             ((M, M), (0, 1)),
-             ((S, S, S), (2, 0)), )
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=args)
+def error_inputs_multinomial(op_info, device, **kwargs):
+    x = torch.empty(1, 2, 3, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
+                     error_regex="prob_dist must be 1 or 2 dim")
 
-def _numpy_ref_transpose(a, dim0, dim1):
-    if a.ndim <= 1:
-        return a
+    x = torch.empty(1, 2, dtype=torch.long, device=device)
+    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
+                     error_regex="multinomial only supports floating-point dtypes for input")
 
-    return np.swapaxes(a, dim0, dim1)
+    x = torch.empty(1, 2, dtype=torch.double, device=device)
+    y = torch.empty(1, 2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(2,), kwargs=dict(out=y)), error_type=RuntimeError,
+                     error_regex="multinomial expects Long tensor out")
 
-def sample_inputs_adjoint(self, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    x = torch.empty(2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(0,)), error_type=RuntimeError,
+                     error_regex="cannot sample n_sample <= 0 samples")
 
-    shapes = ((1, 2, 3), (), (M, M), (S, S, S), (S, M, S), (M, S, M, S))
-    return (SampleInput(make_arg(shape)) for shape in shapes)
+    x = torch.empty(2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(-1,)), error_type=RuntimeError,
+                     error_regex="cannot sample n_sample <= 0 samples")
 
-def sample_inputs_T(self, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    x = torch.empty(2, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(3, False,)), error_type=RuntimeError,
+                     error_regex="cannot sample n_sample > prob_dist")
 
-    shapes = ((), (M, M))
-    return (SampleInput(make_arg(shape)) for shape in shapes)
+    x = torch.empty(16777217, dtype=torch.double, device=device)
+    yield ErrorInput(SampleInput(x, args=(3,)), error_type=RuntimeError,
+                     error_regex="number of categories cannot exceed")
 
+def error_inputs_gradient(op_info, device, **kwargs):
+    for dtype in [torch.long, torch.float32, torch.complex64]:
+        t = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], device=device, dtype=dtype)
 
-def sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates invertible inputs for linear algebra ops
-    The input is generated as the itertools.product of 'batches' and 'ns'.
-    In total this function generates 8 SampleInputs
-    'batches' cases include:
-        () - single input,
-        (0,) - zero batched dimension,
-        (2,) - batch of two matrices,
-        (1, 1) - 1x1 batch of matrices
-    'ns' gives 0x0 and 5x5 matrices.
-    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
-    """
-    make_fn = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
+        dim = (1, 0)
+        spacing = [0.1]
+        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=spacing, dim=dim, edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected spacing to be unspecified, a scalar ')
 
-    batches = [(), (0, ), (2, ), (1, 1)]
-    ns = [5, 0]
+        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=3)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient only supports edge_order=1 and edge_order=2.')
 
-    for batch, n in product(batches, ns):
-        yield SampleInput(make_arg(*batch, n, n))
+        dim = (1, 1)
+        spacing = 0.1
+        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=spacing, dim=dim, edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='dim 1 appears multiple times in the list of dims')
 
-def sample_inputs_linalg_pinv_singular(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function produces factors `a` and `b` to generate inputs of the form `a @ b.t()` to
-    test the backward method of `linalg_pinv`. That way we always preserve the rank of the
-    input no matter the perturbations applied to it by the gradcheck.
-    Note that `pinv` is Frechet-differentiable in a rank-preserving neighborhood.
-    """
-    batches = [(), (0, ), (2, ), (1, 1)]
-    # the size of at least 30 is required to cause failures for the previous implicit implementation
-    # of the pinv's backward method, albeit it is slow.
-    size = [0, 3, 50]
+        dim = (0, 1)
+        coordinates = [torch.tensor([1, 2, 4], device='cpu'), torch.tensor([1, 2, 4], device='meta')]
+        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=coordinates, dim=dim, edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected each tensor to be on the same device,')
 
-    for batch, m, n in product(batches, size, size):
-        for k in range(min(3, min(m, n))):
-            # Note that by making the columns of `a` and `b` orthonormal we make sure that
-            # the product matrix `a @ b.t()` has condition number 1 when restricted to its image
-            a = torch.rand(*batch, m, k, device=device, dtype=dtype).qr().Q.requires_grad_(requires_grad)
-            b = torch.rand(*batch, n, k, device=device, dtype=dtype).qr().Q.requires_grad_(requires_grad)
-            yield SampleInput(a, args=(b,))
+        yield ErrorInput(SampleInput(t, kwargs=dict(dim=3)),
+                         error_type=IndexError, error_regex='')
 
+        t = torch.tensor([[1], [2], [3]])
+        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=1)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected each dimension size to be at least')
 
-def sample_inputs_singular_matrix_factors(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function produces two tensors of shape (*, m, k) and (*, n, k) with k <= min(m, n).
-    Their matrix product could be used to generate tensor of shape (*, m, n) of rank k.
-    """
-
-    batches = [(), (0, ), (2, ), (1, 1)]
-    size = [1, 5, 10]
+        t = torch.tensor([[1, 2], [3, 4]])
+        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=2)),
+                         error_type=RuntimeError,
+                         error_regex='torch.gradient expected each dimension size to be at least')
 
-    for batch, m, n in product(batches, size, size):
-        for k in range(min(3, min(m, n))):
-            a = make_tensor((*batch, m, k), dtype=dtype, device=device, requires_grad=requires_grad)
-            b = make_tensor((*batch, n, k), dtype=dtype, device=device, requires_grad=requires_grad)
-            yield SampleInput(a, args=(b,), kwargs=kwargs)
+def error_inputs_masked_select(op_info, device, **kwargs):
+    x = torch.rand((1,), device=device).expand((3,))
+    y = torch.rand((6,), device=device)
+    mask = torch.tensor([True, False, True, True, False, False], device=device)
 
+    yield ErrorInput(SampleInput(y, args=(mask,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-def clone_sample(sample, **kwargs):
-    """
-    Given a SampleInput, this function analyzes its input, args and kwargs,
-    and produces a copy with each non-Tensor entry being copied by reference,
-    and with each Tensor entry cloned with `t.clone().requires_grad_(t.requires_grad)`
-    """
+    yield ErrorInput(SampleInput(y, args=(mask,), kwargs=dict(out=y)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-    def clone_tensor(t):
-        if isinstance(t, torch.Tensor):
-            return t.detach().clone().requires_grad_(t.requires_grad)
-        else:
-            return t
+    yield ErrorInput(SampleInput(mask.clone(), args=(mask,), kwargs=dict(out=mask)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
-    sample_kwargs = kwargs if kwargs else sample.kwargs
+def error_inputs_index_select(op_info, device, **kwargs):
+    x = torch.rand((1, 6), device=device).expand((2, 6))
+    y = torch.rand((3, 6), device=device)
+    ind = torch.tensor([0, 1], dtype=torch.int64, device=device)
 
-    return SampleInput(
-        clone_tensor(sample.input),
-        args=tuple(map(clone_tensor, sample.args)),
-        kwargs=dict(((k, clone_tensor(v)) for k, v in sample_kwargs.items()))
-    )
+    yield ErrorInput(SampleInput(y, args=(1, ind,), kwargs=dict(out=x)),
+                     error_type=RuntimeError,
+                     error_regex='unsupported operation')
 
+def error_inputs_logcumsumexp(op_info, device, **kwargs):
+    dim = 3
+    srcs = [torch.randn(5, 2, device=device), torch.randn(0, 2, device=device)]
+    for src in srcs:
+        yield ErrorInput(SampleInput(src, args=(dim,)),
+                         error_type=IndexError,
+                         error_regex='Dimension out of range')
 
-def sample_inputs_svd_lowrank(op_info, device, dtype, requires_grad=False, **kwargs):
-    for sample in sample_inputs_singular_matrix_factors(op_info, device, dtype, requires_grad, **kwargs):
-        *batch, m, k = sample.input.shape
-        *_, n, _ = sample.args[0].shape
+def sample_inputs_take_along_dim(op_info, device, dtype, requires_grad, **kwargs):
+    return (SampleInput(make_tensor((S, S), dtype=dtype, device=device,
+                                    low=None, high=None,
+                                    requires_grad=requires_grad),
+                        args=(gather_variable((S, S), 1, S, True, device=device), 0)),
 
-        # NOTE: since svd_lowrank relies on non rank-revealing SVD,
-        # it inherits the problem of unstable behavior with repeated
-        # singular values including zeros.
-        # Since we want to avoid (repeated) zeros as singular values,
-        # we can only use k for q.
-        # This issues could be resolved with using a rank-revealing SVD
-        # which does not include "zero" singular values.
-        op_kwargs = {
-            'q': k,
-            'M': None
-        }
+            # `indices` broadcast
+            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
+                                    low=None, high=None,
+                                    requires_grad=requires_grad),
+                        args=(gather_variable((1, S // 2), 0, S, True, device=device), 1)),
 
-        # without M specified
-        yield clone_sample(sample, **op_kwargs)
+            # `self` broadcast
+            SampleInput(make_tensor((1, S), dtype=dtype, device=device,
+                                    low=None, high=None,
+                                    requires_grad=requires_grad),
+                        args=(gather_variable((S, S // 2), 0, S, True, device=device), 1)),
 
-        # now with M
-        # TODO: fix bug in the documentation for svd_lowrank:
-        # M has to be (*, m, n), and not (*, 1, n) as written
-        # in the documentation
-        op_kwargs['M'] = make_tensor((*batch, m, n), dtype=dtype, device=device, requires_grad=requires_grad)
-        yield clone_sample(sample, **op_kwargs)
+            # without `dim` arg
+            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
+                                    low=None, high=None,
+                                    requires_grad=requires_grad),
+                        args=(gather_variable((S, S // 2), 0, S, True, device=device), )),
+            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
+                                    low=None, high=None,
+                                    requires_grad=requires_grad),
+                        args=(gather_variable((S, S // 2), 0, S, True, device=device),)),
+            )
 
-def chunk_iter(iterable, size):
-    it = iter(iterable)
-    while True:
-        chunk = tuple(islice(it, size))
-        if not chunk:
-            break
-        yield chunk
 
-def sample_inputs_pca_lowrank(op_info, device, dtype, requires_grad=False, **kwargs):
-    # we reuse samples from svd_lowrank which come in group of two with
-    # kwarg['M'] = None and with kwarg['M'] = <some tensor>
-    samples = sample_inputs_svd_lowrank(op_info, device, dtype, requires_grad, **kwargs)
-    for s1, s2 in chunk_iter(samples, 2):
-        del s1.kwargs['M']
-        del s2.kwargs['M']
-        s1.kwargs['center'] = False
-        s2.kwargs['center'] = True
-        yield s1
-        yield s2
+def error_inputs_aminmax_amax_amin(op_info, device, **kwargs):
 
-def sample_inputs_linalg_cond(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    # Error Inputs for zero-dim tensors, when 'dim' arg is not provided.
+    shape = (S, 0, S)
+    err_msg_amax_amin = "reduction"
+    err_msg_aminmax = "cannot compute aminmax over an empty dimension as the operation has no identity"
+    if op_info.name in ['amax', 'amin', '_refs.amax', '_refs.amin']:
+        yield ErrorInput(SampleInput(torch.rand(shape, device=device)), error_regex=err_msg_amax_amin)
+    elif op_info.name in ['aminmax']:
+        yield ErrorInput(SampleInput(torch.rand(shape, device=device)), error_regex=err_msg_aminmax)
 
-    # autograd is not supported for inputs with zero number of elements
-    shapes = ((S, S),
-              (2, S, S),
-              (2, 1, S, S), )
+    # Error Inputs for tensors with more than 64 dimension
+    sizes = [1] * 65
+    err_msg1 = "only tensors with up to 64 dims are supported"
+    yield ErrorInput(SampleInput(torch.randn(sizes, device=device), kwargs={'dim': -1}),
+                     error_regex=err_msg1)
+    yield ErrorInput(SampleInput(torch.randn(sizes, device=device), kwargs={'dim': 64}),
+                     error_regex=err_msg1)
 
-    for shape in shapes:
-        yield SampleInput(make_arg(shape))
+    # Error Inputs for repeated 'dim'
+    if op_info.name in ['amax', 'amin', '_refs.amax', '_refs.amin']:
+        dims = [(0, 0), (0, -4)]
+        err_msg2 = "in the list of dims"
+        x = torch.randn(S, S, S, S, device=device)
+        for dim in dims:
+            yield ErrorInput(SampleInput(x, kwargs={'dim': dim}), error_regex=err_msg2)
 
-def sample_inputs_linalg_vander(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    # Error Input for illegal dtype
+    input5 = torch.randn(L, L, dtype=torch.float32, device=device)
+    max_values = torch.empty(L, dtype=torch.float32, device=device)
+    min_values = torch.empty(L, dtype=torch.double, device=device)
+    illegal_values = torch.empty(L, dtype=torch.int, device=device)
 
-    shapes = ((),
-              (1,),
-              (S,),
-              (2, S),)
+    err_msg_amax_amin2 = "Expected the dtype for input and out to match"
+    err_msg_aminmax2 = "Expected out tensor to have dtype float, but got double instead"
 
-    for shape in shapes:
-        if len(shape) > 0 and shape[-1] > 1:
-            yield SampleInput(make_arg(shape))
-        n = shape[-1] if len(shape) > 0 else 1
-        for i in range(3):
-            # n-1, n, n+1
-            N = n + i - 1
-            if N < 2:
-                continue
-            yield SampleInput(make_arg(shape), kwargs=dict(N=N))
-
-def np_vander_batched(x, N=None):
-    # Wrapper around np.vander that supports batches of 1 dimension (enough for the tests)
-    if x.ndim == 0:
-        x = x[np.newaxis]
-    if x.ndim == 1:
-        y = np.vander(x, N=N, increasing=True)
-        return y
-    else:
-        if N is None:
-            N = x.shape[-1]
-        y = np.vander(x.ravel(), N=N, increasing=True).reshape((*x.shape, N))
-        return y
+    if op_info.name in ['amax', 'amin', '_refs.amax', '_refs.amin']:
+        yield ErrorInput(SampleInput(input5, kwargs={'dim': 0, 'out': illegal_values}),
+                         error_regex=err_msg_amax_amin2)
+    elif op_info.name in ['aminmax']:
+        yield ErrorInput(SampleInput(input5, kwargs={'dim': 0, 'out': (max_values, min_values)}),
+                         error_regex=err_msg_aminmax2)
 
-def np_sinc_with_fp16_as_fp32(x):
-    # Wraps numpy's sinc function so that fp16 values are promoted to fp32
-    # before sinc is invoked. Context: numpy's sinc returns NaN when evaluated
-    # at 0 for fp16.
-    if x.dtype == np.float16:
-        return np.sinc(x.astype(np.float32))
-    else:
-        return np.sinc(x)
+    # Error Inputs for functions to raise an error on specified zero'd dimension as reduction dim
+    err_msg3 = "reduction"
+    # FIXME: eager and ref impl throw different types of errors
+    error_type = IndexError if 'refs' not in op_info.name else RuntimeError
+    yield ErrorInput(SampleInput(torch.rand(shape, device=device), kwargs={'dim': 1}),
+                     error_type=error_type, error_regex=err_msg3)
 
-def sample_inputs_broadcast_to(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases = (
-        ((S, 1, 1), (S, S, S)),
-        ((S, 1, S), (S, S, S)),
-        ((S, 1), (S, S, S)),
-        ((1,), (S, S, S)),
-        ((1, S), (1, 1, S)),
-        ((), ()),
-        ((), (1, 3, 2)),
+def sample_inputs_aminmax(op_info, device, dtype, requires_grad, **kwargs):
+    test_cases: Tuple[tuple, dict] = (  # type: ignore[assignment]
+        ((S, S, S), {}),
+        ((S, S, S), {'dim': 1}),
+        ((S, S, S), {'dim': 1, 'keepdim': True}),
+        ((), {'dim': 0}),
+        ((), {}),
+        ((), {'dim': 0, 'keepdim': True}),
     )
 
-    return tuple(
-        SampleInput(
-            make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(shape,)) for size, shape in test_cases)
-
-def sample_inputs_broadcast_tensors(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    test_cases: Tuple[tuple] = (((3,), (1, 2, 1), (1, 1), (5, 1, 1),),)
-
     samples: List[SampleInput] = []
-    for shape, *other_shapes in test_cases:
-        samples.append(SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes)))
+    for shape, kwargs in test_cases:
+        samples.append(SampleInput(
+            make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
+            kwargs=kwargs))
 
     return samples
 
-def reference_inputs_broadcast_tensors(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_broadcast_tensors(op, device, dtype, requires_grad, **kwargs)
+def sample_inputs_diff(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    m = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    n = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad, noncontiguous=True)
+    test_cases = (
+        ((1,), 0, None, None),
+        ((S,), 0, None, None),
+        ((S, 1), 0, None, None),
+        ((S, 1), 1, None, None),
+        ((S, S), 0, None, None),
+        ((S, S), 1, None, None),
+        ((S, S), 0, (1, S), (2, S)),
+        ((S, S), 0, None, (2, S)),
+        ((XS, XS, XS), 1, None, None),
+        ((XS, XS, XS), 2, None, None),
+        ((XS, XS, XS), 1, (XS, 1, XS), (XS, 1, XS)),
+        ((XS, XS, XS), 2, (XS, XS, 1), (XS, XS, 1)),
+        ((XS, XS, XS), 2, (XS, XS, XS), (XS, XS, XS)),)
 
-    cases = (
-        ((), (1, 1), (1, 1, 7, 1), (3, 1, 1)),
-        ((3, 5, 6), (1, 3, 5, 6), (1, 1, 1, 1, 6), (8, 3, 5, 6))
-    )
+    sample_inputs = []
+    for size, dim, size_prepend, size_append in test_cases:
+        prepend_size = 0 if (size_prepend is None) else size_prepend[dim]
+        append_size = 0 if (size_append is None) else size_append[dim]
+        dim_size = size[dim] + prepend_size + append_size
+        for n in range(dim_size):
+            input_tensor = make_arg(size)
+            prepend = make_arg(size_prepend) if size_prepend else None
+            append = make_arg(size_append) if size_append else None
+            sample_inputs.append(SampleInput(input_tensor, args=(n, dim, prepend, append,)))
 
-    for a, b, c, d in cases:
-        yield SampleInput(m(a), args=(m(b), m(c), m(d)))
-        yield SampleInput(n(a), args=(n(b), n(c), n(d)))
+    # add some samples with n > dim_size
+    sample_inputs.append(SampleInput(make_arg((XS, XS, XS)), args=(S + 1, 1,)))
+    sample_inputs.append(SampleInput(make_arg((XS, XS, XS)), args=(S * 3 + 2, 2, make_arg((XS, XS, XS)), make_arg((XS, XS, XS)),)))
 
-def sample_inputs_block_diag(op_info, device, dtype, requires_grad, **kwargs):
+    return sample_inputs
+
+def sample_inputs_histogram(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    test_cases: Tuple[tuple] = (
-        ((1, S), (2, S), (3, S),),
-        ((S, 1), (S, 2), (S, 3),),
-        ((1,), (2,), (3,),),
-        ((2, S), (S,))
-    )
 
-    samples: List[SampleInput] = []
-    for shape, *other_shapes in test_cases:
-        samples.append(SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes)))
-        # We also want to test mixed complex-non-complex inputs to block_diag
-        if dtype == torch.complex32 or dtype == torch.complex64:
-            non_complex_dtype = torch.float32 if dtype == torch.complex32 else torch.float64
-            make_arg_non_complex = partial(make_tensor, dtype=non_complex_dtype, device=device, requires_grad=requires_grad)
-            samples.append(SampleInput(make_arg_non_complex(shape), args=tuple(make_arg(s) for s in other_shapes)))
-
-    return samples
-
-def sample_inputs_cdist(op_info, device, dtype, requires_grad, **kwargs):
-    small_S = 2
-    test_cases = (
-        ((S, S, 2), (S, S + 1, 2)),
-        ((S, S), (S, S)),
-        ((S, S, S), (S, S, S)),
-        ((3, 5), (3, 5)),
-        ((2, 3, 5), (2, 3, 5)),
-        ((1, 2, 3), (1, 2, 3)),
-        ((1, 1), (S, 1)),
-        ((0, 5), (4, 5)),
-        ((4, 5), (0, 5)),
-        ((0, 4, 5), (3, 5)),
-        ((4, 5), (0, 3, 5)),
-        ((0, 4, 5), (1, 3, 5)),
-        ((1, 4, 5), (0, 3, 5)),
-        # Using S here would make this one test take 9s
-        ((small_S, small_S, small_S + 1, 2), (small_S, small_S, small_S + 2, 2)),
-        ((small_S, 1, 1, small_S), (1, small_S, small_S)),
-        ((1, 1, small_S), (small_S, 1, small_S, small_S)),
-    )
+    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    samples = []
-    for cm in ['use_mm_for_euclid_dist', 'donot_use_mm_for_euclid_dist']:
-        # FIXME add an override for JIT and revert 0. back to 0
-        # since it's accepted by eager
-        for p in [0., 1., 2., 3., 0.5, 1.5, 2.5, float("inf")]:
-            for t1_size, t2_size in test_cases:
-                # The args should never be non-contiguous as this is not supported in the backward
-                samples.append(SampleInput(
-                    make_tensor(t1_size, dtype=dtype, device=device, requires_grad=requires_grad),
-                    args=(make_tensor(t2_size, dtype=dtype, device=device, requires_grad=requires_grad), p, cm)))
+    sample_inputs = []
+    for size, bin_ct, weighted, density in product(sizes, range(1, 5), [False, True], [False, True]):
+        input_tensor = make_arg(size)
+        weight_tensor = make_arg(size) if weighted else None
 
-    return samples
+        sample_inputs.append(SampleInput(input_tensor, args=(bin_ct,),
+                                         kwargs=dict(weight=weight_tensor, density=density)))
 
+        bins_tensor = make_arg((bin_ct + 1,))
+        sample_inputs.append(SampleInput(input_tensor, args=(bins_tensor,),
+                                         kwargs=dict(weight=weight_tensor, density=density)))
 
-def sample_inputs_fill_(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype,
-                       low=None, high=None, requires_grad=requires_grad)
+    return sample_inputs
 
-    cases = (((S, S, S), (1,)),
-             ((), (1,)),
-             ((S, S, S), (make_arg(()),)))
+def sample_inputs_histogramdd(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=args)
+    sizes = ((S, S), (S, S, S), (S, 1, S), (S, 0, S))
+    bin_ct_patterns = ((1, 1, 1, 1, 1), (2, 3, 2, 3, 2), (3, 2, 3, 2, 3))
 
-def _fill_np(a, value):
-    a = a.copy()
-    a.fill(value)
-    return a
+    sample_inputs = []
+    for size, bin_ct_pattern, weighted, density in product(sizes, bin_ct_patterns, [False, True], [False, True]):
+        input_tensor = make_arg(size)
+        bin_ct = bin_ct_pattern[:size[-1]]
+        weight_tensor = make_arg(size[:-1]) if weighted else None
 
-def _fill_aten(a, value):
-    t = a * False
-    with torch.no_grad():
-        t.fill_(value)
-    return t
+        sample_inputs.append(SampleInput(input_tensor, args=(bin_ct,),
+                                         kwargs=dict(weight=weight_tensor, density=density)))
 
-def _fill_sample_kwargs(device, dtype, input):
-    if dtype is torch.bool:
-        value = True
-    else:
-        value = 3
+        bins_tensor = [make_arg(ct + 1) for ct in bin_ct]
+        sample_inputs.append(SampleInput(input_tensor, args=(bins_tensor,),
+                                         kwargs=dict(weight=weight_tensor, density=density)))
 
-    return ({'value': value}, {'value': value})
+    return sample_inputs
 
-def sample_inputs_comparison_ops(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
+def sample_inputs_histc(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    # Adds a sample input where both tensors have the same values
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    lhs = make_arg((S, S))
-    yield SampleInput(lhs, args=(lhs.clone(),))
+    sample_inputs = []
+    for size, min, max in product(sizes, [0, -10], [0, 10]):
+        # construct sample input omitting bins arg
+        sample_inputs.append(SampleInput(make_arg(size),
+                                         kwargs=dict(min=min, max=max)))
 
-def sample_inputs_stack(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+        # construct sample inputs with a few different bins values
+        for bins in [1, 3, 10]:
+            sample_inputs.append(SampleInput(make_arg(size),
+                                             kwargs=dict(bins=bins, min=min, max=max)))
 
-    # shape x number of tensors
-    cases = (
-        ((3, 4), 1),
-        ((1, 2, 1, 4), 3),
-        ((0, 1, 0), 2),)
+    return sample_inputs
 
-    for shape, num_tensors in cases:
-        tensors = []
-        for _ in range(num_tensors):
-            tensors.append(make_arg(shape))
-        for dim in range(-1, len(shape) - 1):
-            yield SampleInput(tensors, args=(dim,))
+def sample_inputs_bincount(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-def sample_inputs_cat_concat(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    sample_inputs = []
 
-    cases: Tuple[tuple, tuple, dict] = (  # type: ignore[assignment]
-        ((S, S), (S, S), {'dim': -1}),
-        ((S, S), (S, S), {'dim': 1}),
-        ((M, S), (S, S), {'dim': 0}),  # different shapes
-        ((1, 2, 3), (1, 2, 3), {'dim': -2}),
-        ((0,), (0,), {'dim': 0}),  # empty tensor
-        ((0, S), (S, S), {'dim': 0}),
-        ((1,), (1,), {})  # dim not passed, fallback to default
-    )
+    for size, weighted in product((S, M), [False, True]):
+        input_tensor = torch.randint(0, size, (size,), dtype=dtype, device=device)
+        weight_tensor = make_arg((size,)) if weighted else None
 
-    for input_shape1, input_shape2, kwargs in cases:
-        yield SampleInput([make_arg(input_shape1), make_arg(input_shape2)], kwargs=kwargs)
+        max_val = int(input_tensor.max().item())
 
-def reference_inputs_cat(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_cat_concat(op, device, dtype, requires_grad, **kwargs)
+        for minlength in [0, max_val // 2, max_val, 2 * max_val]:
+            sample_inputs.append(SampleInput(input_tensor,
+                                             kwargs=dict(weights=weight_tensor, minlength=minlength)))
 
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    return sample_inputs
 
-    # Noncontiguous type promoting tensors
-    a = make_arg((3, 4, 2))
-    b = make_arg((3, 2, 2), noncontiguous=True, dtype=torch.double)
-    c = make_arg((3, 3, 2), dtype=torch.float16).permute(1, 0, 2)
+def sample_inputs_bucketize(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    yield SampleInput((a, b, c), kwargs={'dim': 1})
+    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    # Special 1D tensor with dim length of 0 case
-    a = make_arg((0,))
-    b = make_arg((3, 2, 2))
+    sample_inputs = []
 
-    yield SampleInput((a, b, a))
-    yield SampleInput((a, a, a))
+    for size, out_int32, right in product(sizes, [False, True], [False, True]):
+        input_tensor = make_arg(size)
+        boundaries = make_arg((S,)).msort()
 
-def _elementwise_type_promo_np(*args, type_promotion_kind):
-    def _maybe_torch(x):
-        if isinstance(x, np.ndarray):
-            return torch.from_numpy(x)
-        return x
+        sample_inputs.append(SampleInput(input_tensor, args=(boundaries, ),
+                                         kwargs=dict(out_int32=out_int32, right=right)))
 
-    flattened = tree_flatten(args)[0]
-    transformed = tuple(_maybe_torch(a) for a in flattened)
-    result_dtype, _ = prims.utils.elementwise_dtypes(
-        *transformed,
-        type_promotion_kind=type_promotion_kind)
-    return torch_to_numpy_dtype_dict[result_dtype]
+    return sample_inputs
 
-def _cat_np(input_seq, dim=0):
-    inputs = tuple(a for a in input_seq if not (a.ndim == 1 and a.size == 0))
+def sample_inputs_searchsorted(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    if len(inputs) == 0:
-        np_dtype = _elementwise_type_promo_np(
-            input_seq,
-            type_promotion_kind=prims.utils.ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH)
-        return np.empty(0, dtype=np_dtype)
+    sizes = ((0,), (M,), (0, 0), (M, M), (0, 0, 0), (M, M, M))
+    inputs = []
+    for size, noncontiguous, out_int32, right in product(sizes, [False, True], [False, True], [False, True]):
+        unsorted_tensor = make_arg(size, noncontiguous=noncontiguous)
+        input_tensor = make_arg(size, noncontiguous=noncontiguous)
+        if np.product(size) == 0:
+            boundary_tensor = unsorted_tensor
+            sorter = make_tensor(size, dtype=torch.int64, device=device, noncontiguous=noncontiguous)
+        else:
+            boundary_tensor, sorter = torch.sort(unsorted_tensor)
+        side = "right" if right else "left"
 
-    return np.concatenate(inputs, axis=dim)
+        inputs.append(SampleInput(boundary_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, right=right)))
+        inputs.append(SampleInput(boundary_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, side=side)))
 
-def _floor_divide_np(a, b):
-    dtype = _elementwise_type_promo_np(
-        a,
-        b,
-        type_promotion_kind=prims.utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT)
-    if isinstance(a, np.ndarray):
-        a = a.astype(dtype)
-    if isinstance(b, np.ndarray):
-        b = b.astype(dtype)
-    return np.floor_divide(a, b)
+        inputs.append(
+            SampleInput(unsorted_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, right=right, sorter=sorter)))
+        inputs.append(
+            SampleInput(unsorted_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, side=side, sorter=sorter)))
+    return inputs
 
-def sample_inputs_hstack_dstack_vstack(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    tensor_shapes = (
-        # First Tensor being 1-D is special
-        # case for hstack
-        ((S,), (S,), (S,)),
-        ((S, S), (S, S), (S, S)),
+def sample_inputs_gradient(op_info, device, dtype, requires_grad, **kwargs):
+    sample_inputs = []
+    test_cases_float = (
+        ((S,), None, None, 1),
+        ((S,), 2., None, 1),
+        ((S, S), None, None, 2),
+        ((S, S), [2.0, 2.1], None, 1),
+        ((S, S), [2.0, 2.1], (0, 1), 1),
+        ((4, 4, 4), [2., 1.], (0, 1), 2),
     )
-    for s1, s2, s3 in tensor_shapes:
-        tensors = (make_arg(s1,), make_arg(s2,), make_arg(s3))
-        yield SampleInput(tensors)
+    for size, spacing, dim, edge_order in test_cases_float:
+        t = make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+        sample_inputs.append(SampleInput(t, kwargs=dict(dim=dim, spacing=spacing, edge_order=edge_order)))
 
-def error_inputs_hstack_dstack_vstack(op, device):
-    make_arg = partial(make_tensor, dtype=torch.int32, device=device, requires_grad=False)
-    tensor_shapes = (
-        ((S,), (S, S, S, S), (S,)),
+    test_cases_tensor = (
+        ((3, 3, 3), ((1.1, 2.0, 3.5), (4.0, 2, 6.0)), (0, -1), 1),
+        ((3, 3, 3), ((1.0, 3.0, 2.0), (8.0, 6.0, 1.0)), (0, 1), 2),
     )
-    for s1, s2, s3 in tensor_shapes:
-        tensors = (make_arg(s1,), make_arg(s2,), make_arg(s3))
-        # Different dimension tensor
-        yield ErrorInput(SampleInput(tensors), error_regex="Tensors must have same number of dimensions")
-
-    # empty tensor list
-    yield ErrorInput(SampleInput(()), error_regex="expects a non-empty TensorList")
-
-def sample_inputs_gather(op_info, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, gather_variable((S, S), 1, M, True, device=device))),
-        SampleInput(
-            make_tensor((M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(1, gather_variable((M, S // 2), 0, S, True, device=device))),
-        SampleInput(
-            make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, torch.tensor([0], dtype=torch.int64, device=device))),
-        # Empty index tensor case, see: https://github.com/pytorch/pytorch/pull/65006
-        SampleInput(
-            make_tensor((S,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, torch.tensor([], dtype=torch.uint8, device=device))),
-        SampleInput(
-            make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, torch.tensor(0, dtype=torch.int64, device=device))),
-    )
-
-def _fill_indices(idx, dim, dim_size, elems_per_row, m, n, o):
-    for i in range(1 if dim == 0 else m):
-        for j in range(1 if dim == 1 else n):
-            for k in range(1 if dim == 2 else o):
-                ii = [i, j, k]
-                ii[dim] = slice(0, idx.size(dim) + 1)
-                idx[tuple(ii)] = torch.randperm(dim_size)[0:elems_per_row]
+    for size, coordinates, dim, edge_order in test_cases_tensor:
+        t = make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+        coordinates_tensor_list = []
+        for coords in coordinates:
+            # `coords` will always contain floating point values and Python 3.10 does not support this
+            # implicit conversion to an integer using `__int__`
+            # TODO: this can be simplified after https://github.com/pytorch/pytorch/issues/69316 is fixed
+            a = torch.tensor(coords, device=device)
+            coordinates_tensor_list.append(a.to(dtype))
+        sample_inputs.append(SampleInput(t, kwargs=dict(dim=dim, spacing=coordinates_tensor_list, edge_order=edge_order)))
 
-def error_inputs_gather(op_info, device, **kwargs):
-    # src is [1, 2]
-    #        [3, 4]
-    src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
+    return tuple(sample_inputs)
 
-    # idx is [0, 0]
-    #        [1, 0]
-    idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
+def sample_inputs_getitem(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    test_args = [
+        ([1, 2],),
+        (slice(0, 3),),
+        ([slice(0, 3), 1],),
+        ([[0, 2, 3], [1, 3, 3], [0, 0, 2]],),
+        ([[0, 0, 3], [1, 1, 3], [0, 0, 2]],),
+        ([slice(None), slice(None), [0, 3]],),
+        ([slice(None), [0, 3], slice(None)],),
+        ([[0, 3], slice(None), slice(None)],),
+        ([[0, 3], [1, 2], slice(None)],),
+        ([[0, 3], ],),
+        ([[0, 3], slice(None)],),
+        ([[0, 3], Ellipsis],),
+        ([[0, 2, 3], [1, 3, 3], torch.LongTensor([0, 0, 2])],),
+        (index_variable(2, S, device=device),),
+        (mask_not_all_zeros((S,)),),
+    ]
 
-    # Index should be smaller than self except on dimesion 1
-    bad_src = make_tensor((1, 1), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(bad_src, args=(1, idx,)),
-                     error_regex="Size does not match at dimension 0")
+    for args in test_args:
+        yield SampleInput(make_arg((S, S, S)), args=args)
 
-    # Index must have long dtype
-    bad_idx = idx.to(torch.int32)
-    yield ErrorInput(SampleInput(src, args=(1, bad_idx)),
-                     error_regex="Expected dtype int64 for index")
+    yield SampleInput(make_arg((S, S, S, S)), args=([slice(None), [0, 1], slice(None), [0, 1]],))
 
-    # TODO: FIXME
-    # out.dtype must match src.dtype
-    # Creates new src & idx since SampleInputs can't share tensors
-    src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
-    idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
-    out = torch.empty((2, 2), device=device, dtype=torch.float64)
-    yield ErrorInput(SampleInput(src, args=(1, idx), kwargs={'out': out}),
-                     error_regex="Expected out tensor to have dtype")
+def sample_inputs_index_put(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    # src and index tensors must have the same # of dimensions
-    # idx too few dimensions
-    src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
-    idx = torch.tensor((0, 0), device=device, dtype=torch.long)
-    yield ErrorInput(SampleInput(src, args=(1, idx)),
-                     error_regex="Index tensor must have the same number of dimensions")
+    inputs = []
+    for accumulate in [False, True]:
+        # Test with indices arg
+        inputs.append(SampleInput(
+            make_arg((S, S,)),
+            args=((index_variable(2, S, device=device),), make_arg((2, S))),
+            kwargs=dict(accumulate=accumulate)))
 
-    # src too few dimensions
-    src = torch.tensor((1, 2), device=device, dtype=torch.float32)
-    idx = torch.tensor(((0, 0), (1, 0)), device=device, dtype=torch.long)
-    yield ErrorInput(SampleInput(src, args=(0, idx)),
-                     error_regex="Index tensor must have the same number of dimensions")
+        # Test with mask arg
+        mask = torch.zeros(S, dtype=torch.bool) if accumulate else mask_not_all_zeros((S,))
+        inputs.append(SampleInput(
+            make_arg((S, S)),
+            args=((mask, ), make_arg((S,))),
+            kwargs=dict(accumulate=accumulate)))
 
-    # index out of bounds
-    # NOTE: this ErrorInput is guarded because bounds checking does not occur on CUDA devices
-    if torch.device(device).type == 'cpu':
-        src = torch.tensor(((1, 2), (3, 4)), device=device, dtype=torch.float32)
-        idx = torch.tensor(((0, 23), (1, 0)), device=device, dtype=torch.long)
-        yield ErrorInput(SampleInput(src, args=(1, idx,)),
-                         error_regex="index 23 is out of bounds for dimension")
+    return inputs
 
-    x = torch.rand((1,), device=device).expand((3,))
-    src = torch.rand((6,), device=device)
-    ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
+def sample_inputs_sort(op_info, device, dtype, requires_grad, **kwargs):
+    def small_3d_unique():
+        res = torch.randperm(S * S * S, dtype=torch.int64, device=device).view(S, S, S)
+        res = res.to(dtype).requires_grad_(requires_grad)
+        return res
 
-    yield ErrorInput(SampleInput(src, args=(0, ind,), kwargs=dict(out=x)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+    def large_1d_unique():
+        res = torch.randperm(L * L * L, dtype=torch.int64, device=device)
+        res = res.to(dtype).requires_grad_(requires_grad)
+        return res
 
-    yield ErrorInput(SampleInput(src, args=(0, ind,), kwargs=dict(out=src)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+    samples = []
+    # Test case for large tensor.
+    samples.append(SampleInput(large_1d_unique()))
 
-    yield ErrorInput(SampleInput(ind.clone(), args=(0, ind[1:],), kwargs=dict(out=ind[:1])),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+    # Test cases for small 3d tensors.
+    # Imitates legacy tests from test/test_torch.py
+    dims = range(-3, 3)
+    flag = [True, False]
+    for dim, descending, stable in product(dims, flag, flag):
+        # default schema without stable sort
+        samples.append(SampleInput(small_3d_unique(),
+                                   args=(dim, descending)))
+        # schema with stable sort, no CUDA support yet
+        if torch.device(device).type == 'cpu':
+            samples.append(
+                SampleInput(small_3d_unique(),
+                            kwargs=dict(dim=dim, descending=descending, stable=stable))
+            )
 
-def error_inputs_take(op_info, device, **kwargs):
-    x = torch.rand((1,), device=device).expand((3,))
-    src = torch.rand((6,), device=device)
-    ind = torch.tensor([2, 1, 0], device=device, dtype=torch.int64)
+    # Test cases for scalar tensor
+    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad)))
+    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad),
+                               args=(0,)))
+    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad),
+                               args=(0, True)))
 
-    yield ErrorInput(SampleInput(src, args=(ind,), kwargs=dict(out=x)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+    # Test cases for stable sort
+    samples.append(SampleInput(small_3d_unique(),
+                   kwargs=dict(stable=True)))
+    samples.append(SampleInput(small_3d_unique(),
+                   kwargs=dict(dim=0, stable=True)))
+    samples.append(SampleInput(small_3d_unique(),
+                   kwargs=dict(dim=0, descending=True, stable=True)))
+    return samples
 
-    yield ErrorInput(SampleInput(src, args=(ind,), kwargs=dict(out=src)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+def sample_inputs_threshold(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    sizes = ((), (S,), (S, S), (S, S, S))
+    samples = []
+    for x_size in sizes:
+        # threshold and values args must be numbers
+        samples.append(SampleInput(make_arg(x_size), args=(make_arg(()).item(), make_arg(()).item())))
+    return samples
 
-    yield ErrorInput(SampleInput(ind.clone(), args=(ind[1:],), kwargs=dict(out=ind[:-1])),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+def sample_inputs_argsort(*args, **kwargs):
+    return [sample_input for sample_input in sample_inputs_sort(*args, **kwargs) if "stable" not in sample_input.kwargs]
 
-# Error inputs for scatter
-def error_inputs_scatter_and_scatter_add(op_info, device, **kwargs):
-    # Error when self.dtype != src.dtype (and src is not a scalar)
-    src = make_tensor((2, 5), device=device, dtype=torch.float32)
-    idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.long)
-    dst = torch.zeros((3, 5), device=device, dtype=torch.double)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
-                     error_regex="Expected self.dtype to be equal to src.dtype")
+def sample_inputs_unique(op_info, device, dtype, requires_grad, **kwargs):
+    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    # Index dtype must be long
-    src = make_tensor((2, 5), device=device, dtype=torch.float32)
-    idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.int32)
-    dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
-                     error_regex="Expected dtype int64 for index")
+    sample_inputs = []
+    for shape, sorted, return_inverse, return_counts, dim in \
+            product(sizes, [False, True], [False, True], [False, True], [None, -2, -1, 0, 1, 2]):
+        # torch.unique cannot be called if the input tensor has a zero dimension which isn't the selected dim
+        if 0 in shape and shape.index(0) is not dim:
+            continue
 
-    # Index and destination must have the same number of dimensions
-    src = make_tensor((2, 5), device=device, dtype=torch.float32)
-    idx = torch.tensor(((0, 1), (1, 2)), device=device, dtype=torch.long)
-    dst = torch.zeros((3, 5, 3), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
-                     error_regex="Index tensor must have the same number of dimensions as self tensor")
+        # skip invalid dim args
+        if dim is not None and (dim < -len(shape) or dim >= len(shape)):
+            continue
 
-    # Index and src must have the same number of dimensions when src is not a scalar
-    src = make_tensor((2, 5, 2), device=device, dtype=torch.float32)
-    idx = torch.tensor(((34, 1), (1, 2)), device=device, dtype=torch.long)
-    dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
-    yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
-                     error_regex="Index tensor must have the same number of dimensions as src tensor")
+        kwargs = dict(sorted=sorted, return_inverse=return_inverse, return_counts=return_counts, dim=dim)
 
-    # Index out of bounds
-    # NOTE: this ErrorInput is guarded because bounds checking does not occur on CUDA devices
-    if torch.device(device).type == 'cpu':
-        src = make_tensor((2, 5), device=device, dtype=torch.float32)
-        idx = torch.tensor(((34, 1), (1, 2)), device=device, dtype=torch.long)
-        dst = torch.zeros((3, 5), device=device, dtype=torch.float32)
-        yield ErrorInput(SampleInput(dst, args=(0, idx, src)),
-                         error_regex="index 34 is out of bounds for dimension 0 with size 3")
+        # construct a test case with only one distinct value
+        input_t = torch.zeros(shape, dtype=dtype, device=device, requires_grad=requires_grad)
+        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
 
-def error_inputs_renorm(op_info, device, **kwargs):
-    zero_d = torch.randn((), device=device)
-    yield ErrorInput(SampleInput(zero_d, args=(0.5, 0, 1.0)), error_type=RuntimeError,
-                     error_regex="needs at least 2 dimensions, got 0 dimensions")
+        # construct a test case with mixed 0s and 1s
+        input_t = make_tensor(shape, dtype=torch.bool, device=device, requires_grad=False)\
+            .to(dtype).requires_grad_(requires_grad)
+        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
 
-def error_inputs_lstsq(op_info, device, **kwargs):
-    zero_d = torch.randn((), device=device)
-    yield ErrorInput(SampleInput(zero_d, args=(zero_d)), error_type=TypeError,
-                     error_regex="iteration over a 0-d tensor")
+        # construct a test case with many different values
+        input_t = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
+        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
 
-def error_inputs_eig(op_info, device, **kwargs):
-    zero_d = torch.randn((), device=device)
+    return sample_inputs
 
-    yield ErrorInput(SampleInput(zero_d, args=(False,)), error_type=RuntimeError,
-                     error_regex="input should be 2 dimensional")
+def sample_inputs_unique_consecutive(*args, **kwargs):
+    for sample_input in sample_inputs_unique(*args, **kwargs):
+        if not sample_input.kwargs["sorted"]:
+            sample_input.kwargs.pop("sorted")
+            yield sample_input
 
-    yield ErrorInput(SampleInput(zero_d, args=(True,)), error_type=RuntimeError,
-                     error_regex="input should be 2 dimensional")
+def sample_inputs_adaptive_avg_pool1d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def error_inputs_ormqr(op_info, device, **kwargs):
-    # this is only implemented on cpu
-    if (torch.device(device).type == 'cpu'):
-        zero_d = torch.randn((), device=device)
-        yield ErrorInput(SampleInput(zero_d, args=(zero_d, zero_d)), error_type=RuntimeError,
-                         error_regex="input must have at least 2 dimensions")
+    # Ordered as (input shape, output size)
+    cases = (
+        ((0, 8, 8), (5,)),
+        ((3, 8, 8), 5),
+        ((3, 8, 8), 1)
+    )
 
-def error_inputs_diag(op_info, device, **kwargs):
-    zero_d = torch.randn((), device=device)
-    yield ErrorInput(SampleInput(zero_d, args=(zero_d)), error_type=TypeError,
-                     error_regex="iteration over a 0-d tensor")
+    for input_shape, output_size in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(output_size,))
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(output_size,))
 
-def error_inputs_embedding(op_info, device, **kwargs):
-    indices = torch.rand(2, 2, device=device).long()
-    weights = [
-        torch.tensor(1.0, device=device),
-        torch.tensor(1.0, device=device).reshape(1, 1, 1),
-    ]
+def sample_inputs_adaptive_avg_pool2d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    for weight in weights:
-        yield ErrorInput(SampleInput(weight, args=(indices,)), error_type=RuntimeError,
-                         error_regex="'weight' must be 2-D")
+    # Ordered as (input shape, output size)
+    cases = (
+        ((1, 8, 8, 8), (5, 7)),
+        ((2, 8, 8, 8), (None, 7)),
+        ((1, 8, 4, 3), (5, None)),
+        ((1, 8, 4, 3), (None, None)),
+        ((1, 8, 4, 3), (5)),
+    )
 
+    for input_shape, output_size in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(output_size,))
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(output_size,))
 
-def error_inputs_t(op_info, device, **kwargs):
-    yield ErrorInput(
-        SampleInput(torch.randn(2, 3, 4, 5, device=device)),
-        error_regex="expects a tensor with <= 2",
+
+def sample_inputs_adaptive_avg_pool3d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    # Ordered as (input shape, output size)
+    cases = (
+        ((0, 8, 8, 8, 8), (5, 7, 4)),
+        ((1, 8, 4, 3, 7), (None, None, None)),
+        ((1, 8, 4, 3, 7), (1, 1, 1)),
+        ((3, 3, 8, 8, 6), (5, 7, None)),
+        ((1, 3, 8, 8, 6), (5, None, 2)),
+        ((3, 3, 8, 8, 6), (None, 3, 2)),
     )
 
+    for input_shape, output_size in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(output_size,))
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(output_size,))
 
-def error_inputs_multinomial(op_info, device, **kwargs):
-    x = torch.empty(1, 2, 3, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
-                     error_regex="prob_dist must be 1 or 2 dim")
+def sample_inputs_adaptive_max_pool1d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    x = torch.empty(1, 2, dtype=torch.long, device=device)
-    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
-                     error_regex="multinomial only supports floating-point dtypes for input")
+    # Ordered as (input shape, output size)
+    cases = (
+        # ((0, 8, 8), (5,)),
+        # 0 batch size doesn't work,  cannot reshape tensor of 0 elements into shape [0, 8, -1]
+        ((3, 4, 4), 3),
+        ((3, 4, 4), 1)
+    )
 
-    x = torch.empty(1, 2, dtype=torch.double, device=device)
-    y = torch.empty(1, 2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(2,), kwargs=dict(out=y)), error_type=RuntimeError,
-                     error_regex="multinomial expects Long tensor out")
+    for shapes, return_idx in product(cases, (True, False)):
+        # Batched
+        yield SampleInput(make_arg(shapes[0]), args=(shapes[1], return_idx))
+        # Unbatched
+        yield SampleInput(make_arg(shapes[0][1:]), args=(shapes[1], return_idx))
 
-    x = torch.empty(2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(0,)), error_type=RuntimeError,
-                     error_regex="cannot sample n_sample <= 0 samples")
+def sample_inputs_adaptive_max_pool2d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    x = torch.empty(2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(-1,)), error_type=RuntimeError,
-                     error_regex="cannot sample n_sample <= 0 samples")
+    # Ordered as (input shape, output size)
+    cases = (
+        # ((0, 8, 8, 8), (5, 7)),
+        # 0 batch size doesn't work,  cannot reshape tensor of 0 elements into shape [0, 8, -1]
+        ((1, 4, 4, 4), (2, 3)),
+        ((2, 4, 4, 4), (None, 3)),
+        ((2, 4, 4, 4), (1, 1)),
+        ((1, 4, 4, 3), (3, None)),
+        ((1, 4, 4, 3), (None, None)),
+        ((1, 4, 4, 3), (3)),
+    )
 
-    x = torch.empty(2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(3, False,)), error_type=RuntimeError,
-                     error_regex="cannot sample n_sample > prob_dist")
+    for shapes, return_idx in product(cases, (True, False)):
+        # Batched
+        yield SampleInput(make_arg(shapes[0]), args=(shapes[1], return_idx))
+        # Unbatched
+        yield SampleInput(make_arg(shapes[0][1:]), args=(shapes[1], return_idx))
 
-    x = torch.empty(16777217, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(3,)), error_type=RuntimeError,
-                     error_regex="number of categories cannot exceed")
 
-def error_inputs_gradient(op_info, device, **kwargs):
-    for dtype in [torch.long, torch.float32, torch.complex64]:
-        t = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], device=device, dtype=dtype)
+def sample_inputs_adaptive_max_pool3d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        dim = (1, 0)
-        spacing = [0.1]
-        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=spacing, dim=dim, edge_order=1)),
-                         error_type=RuntimeError,
-                         error_regex='torch.gradient expected spacing to be unspecified, a scalar ')
+    # Ordered as (input shape, output size)
+    cases = (
+        # ((0, 8, 8, 8, 8), (5, 7, 4)),
+        # 0 batch size doesn't work,  cannot reshape tensor of 0 elements into shape [0, 8, -1]
+        ((1, 4, 4, 3, 5), (None, None, None)),
+        ((1, 4, 4, 3, 5), (1, 1, 1)),
+        ((3, 3, 4, 4, 6), (2, 3, None)),
+        ((1, 3, 4, 4, 6), (3, None, 2)),
+        ((3, 3, 4, 4, 6), (None, 3, 2)),
+    )
 
-        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=3)),
-                         error_type=RuntimeError,
-                         error_regex='torch.gradient only supports edge_order=1 and edge_order=2.')
+    for shapes, return_idx in product(cases, (True, False)):
+        # Batched
+        yield SampleInput(make_arg(shapes[0]), args=(shapes[1], return_idx))
+        # Unbatched
+        yield SampleInput(make_arg(shapes[0][1:]), args=(shapes[1], return_idx))
 
-        dim = (1, 1)
-        spacing = 0.1
-        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=spacing, dim=dim, edge_order=1)),
-                         error_type=RuntimeError,
-                         error_regex='dim 1 appears multiple times in the list of dims')
+class _TestParamsMaxPoolBase(object):
 
-        dim = (0, 1)
-        coordinates = [torch.tensor([1, 2, 4], device='cpu'), torch.tensor([1, 2, 4], device='meta')]
-        yield ErrorInput(SampleInput(t, kwargs=dict(spacing=coordinates, dim=dim, edge_order=1)),
-                         error_type=RuntimeError,
-                         error_regex='torch.gradient expected each tensor to be on the same device,')
+    def __init__(self):
+        self.kwargs = {
+            'kernel_size': [3],
+            'stride': [2, None],
+            'ceil_mode': [True, False],
+            'padding': [0, 1],
+            'dilation': [1],
+            'return_indices': [True, False]
+        }
 
-        yield ErrorInput(SampleInput(t, kwargs=dict(dim=3)),
-                         error_type=IndexError, error_regex='')
+        self.shapes = [
+            [1, 2, None],  # batch
+            [2],  # channels
+            [3, 6]  # signal
+        ]
 
-        t = torch.tensor([[1], [2], [3]])
-        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=1)),
-                         error_type=RuntimeError,
-                         error_regex='torch.gradient expected each dimension size to be at least')
+    def _gen_shape(self):
+        for shape in product(*self.shapes):
+            # shape[0] is None indicates missing batch dimension
+            if shape[0] is None:
+                shape = shape[1:]
 
-        t = torch.tensor([[1, 2], [3, 4]])
-        yield ErrorInput(SampleInput(t, kwargs=dict(edge_order=2)),
-                         error_type=RuntimeError,
-                         error_regex='torch.gradient expected each dimension size to be at least')
+            yield shape, torch.contiguous_format
+            # only 2d (N, C, H, W) rank 4 tensors support channels_last memory format
+            if len(self.shapes) == 4 and len(shape) == 4:
+                yield shape, torch.channels_last
 
-def error_inputs_masked_select(op_info, device, **kwargs):
-    x = torch.rand((1,), device=device).expand((3,))
-    y = torch.rand((6,), device=device)
-    mask = torch.tensor([True, False, True, True, False, False], device=device)
+    def _gen_kwargs(self):
+        keys = self.kwargs.keys()
+        for values in product(*self.kwargs.values()):
+            yield dict(zip(keys, values))
 
-    yield ErrorInput(SampleInput(y, args=(mask,), kwargs=dict(out=x)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+    def gen_input_params(self):
+        yield from product(self._gen_shape(), self._gen_kwargs())
 
-    yield ErrorInput(SampleInput(y, args=(mask,), kwargs=dict(out=y)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+class _TestParamsMaxPool1d(_TestParamsMaxPoolBase):
 
-    yield ErrorInput(SampleInput(mask.clone(), args=(mask,), kwargs=dict(out=mask)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+    def __init__(self):
+        super().__init__()
+        self.kwargs['kernel_size'] += [(3,)]
+        self.kwargs['stride'] += [(2,)]
+        self.kwargs['padding'] += [(1,)]
+        self.kwargs['dilation'] += [(1,)]
 
-def error_inputs_index_select(op_info, device, **kwargs):
-    x = torch.rand((1, 6), device=device).expand((2, 6))
-    y = torch.rand((3, 6), device=device)
-    ind = torch.tensor([0, 1], dtype=torch.int64, device=device)
+class _TestParamsMaxPool2d(_TestParamsMaxPoolBase):
 
-    yield ErrorInput(SampleInput(y, args=(1, ind,), kwargs=dict(out=x)),
-                     error_type=RuntimeError,
-                     error_regex='unsupported operation')
+    def __init__(self):
+        super().__init__()
+        self.kwargs['kernel_size'] += [(3, 2)]
+        self.kwargs['stride'] += [(2, 1)]
+        self.kwargs['padding'] += [(1, 1)]
+        self.kwargs['dilation'] += [(1, 2)]
 
-def error_inputs_logcumsumexp(op_info, device, **kwargs):
-    dim = 3
-    srcs = [torch.randn(5, 2, device=device), torch.randn(0, 2, device=device)]
-    for src in srcs:
-        yield ErrorInput(SampleInput(src, args=(dim,)),
-                         error_type=IndexError,
-                         error_regex='Dimension out of range')
+        self.shapes.append([6])
 
-def sample_inputs_take_along_dim(op_info, device, dtype, requires_grad, **kwargs):
-    return (SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S), 1, S, True, device=device), 0)),
-
-            # `indices` broadcast
-            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((1, S // 2), 0, S, True, device=device), 1)),
-
-            # `self` broadcast
-            SampleInput(make_tensor((1, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S // 2), 0, S, True, device=device), 1)),
-
-            # without `dim` arg
-            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S // 2), 0, S, True, device=device), )),
-            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S // 2), 0, S, True, device=device),)),
-            )
+class _TestParamsMaxPool3d(_TestParamsMaxPoolBase):
 
+    def __init__(self):
+        super().__init__()
+        self.kwargs['kernel_size'] += [(3, 2, 3)]
+        self.kwargs['stride'] += [(2, 1, 2)]
+        self.kwargs['dilation'] += [(1, 2, 1)]
 
-def error_inputs_aminmax_amax_amin(op_info, device, **kwargs):
+        self.shapes.append([6])
+        self.shapes.append([5])
 
-    # Error Inputs for zero-dim tensors, when 'dim' arg is not provided.
-    shape = (S, 0, S)
-    err_msg_amax_amin = "reduction"
-    err_msg_aminmax = "cannot compute aminmax over an empty dimension as the operation has no identity"
-    if op_info.name in ['amax', 'amin', '_refs.amax', '_refs.amin']:
-        yield ErrorInput(SampleInput(torch.rand(shape, device=device)), error_regex=err_msg_amax_amin)
-    elif op_info.name in ['aminmax']:
-        yield ErrorInput(SampleInput(torch.rand(shape, device=device)), error_regex=err_msg_aminmax)
+def sample_inputs_max_pool(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=False)
 
-    # Error Inputs for tensors with more than 64 dimension
-    sizes = [1] * 65
-    err_msg1 = "only tensors with up to 64 dims are supported"
-    yield ErrorInput(SampleInput(torch.randn(sizes, device=device), kwargs={'dim': -1}),
-                     error_regex=err_msg1)
-    yield ErrorInput(SampleInput(torch.randn(sizes, device=device), kwargs={'dim': 64}),
-                     error_regex=err_msg1)
+    params_generator_type_dict = {
+        'nn.functional.max_pool1d': _TestParamsMaxPool1d,
+        'nn.functional.max_pool2d': _TestParamsMaxPool2d,
+        'nn.functional.max_pool3d': _TestParamsMaxPool3d,
+    }
 
-    # Error Inputs for repeated 'dim'
-    if op_info.name in ['amax', 'amin', '_refs.amax', '_refs.amin']:
-        dims = [(0, 0), (0, -4)]
-        err_msg2 = "in the list of dims"
-        x = torch.randn(S, S, S, S, device=device)
-        for dim in dims:
-            yield ErrorInput(SampleInput(x, kwargs={'dim': dim}), error_regex=err_msg2)
+    params_generator = params_generator_type_dict[op_info.name]()
+    for (shape, memory_format), kwargs in params_generator.gen_input_params():
+        arg = make_arg(shape).to(memory_format=memory_format).requires_grad_(requires_grad)
+        yield SampleInput(arg, kwargs=kwargs)
 
-    # Error Input for illegal dtype
-    input5 = torch.randn(L, L, dtype=torch.float32, device=device)
-    max_values = torch.empty(L, dtype=torch.float32, device=device)
-    min_values = torch.empty(L, dtype=torch.double, device=device)
-    illegal_values = torch.empty(L, dtype=torch.int, device=device)
+def error_inputs_max_pool1d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error inputs for input tensor
+    yield ErrorInput(SampleInput(torch.tensor(0), kwargs={'kernel_size': 1}),
+                     error_regex='Expected 2D or 3D input tensor, but got')
+
+    # error inputs for empty input
+    yield ErrorInput(SampleInput(torch.tensor([]), kwargs={'kernel_size': 1}),
+                     error_regex='Expected 2D or 3D input tensor, but got')
+
+    # error inputs for empty input with stride=0
+    yield ErrorInput(SampleInput(torch.tensor([[]]), kwargs={'kernel_size': 1, 'stride': 0}),
+                     error_regex='stride must be greater than zero, but got 0')
+
+    # error inputs for empty input with dilation=0
+    yield ErrorInput(SampleInput(torch.tensor([[]]), kwargs={'kernel_size': 1, 'stride': 1, 'padding': 0, 'dilation': 0}),
+                     error_regex='dilation must be greater than zero, but got 0')
+
+    # error inputs for invalied output size
+    yield ErrorInput(SampleInput(torch.tensor([[]]), kwargs={'kernel_size': 5, 'stride': 1, 'padding': 0, 'dilation': 1}),
+                     error_regex='Invalid computed output size: -4')
+
+    # error inputs when kernel_size=0
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 0}),
+                     error_regex='kernel_size must be greater than zero')
+
+    # error inputs for strides > 0
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 0}),
+                     error_regex='stride must be greater than zero')
+
+def error_inputs_max_pool2d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+    # 2-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+    # 2-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+
+def error_inputs_max_pool3d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49, 50], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+    # 3-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+    # 3-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
 
-    err_msg_amax_amin2 = "Expected the dtype for input and out to match"
-    err_msg_aminmax2 = "Expected out tensor to have dtype float, but got double instead"
+def sample_inputs_normalize(self, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, low=-1, high=1, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    if op_info.name in ['amax', 'amin', '_refs.amax', '_refs.amin']:
-        yield ErrorInput(SampleInput(input5, kwargs={'dim': 0, 'out': illegal_values}),
-                         error_regex=err_msg_amax_amin2)
-    elif op_info.name in ['aminmax']:
-        yield ErrorInput(SampleInput(input5, kwargs={'dim': 0, 'out': (max_values, min_values)}),
-                         error_regex=err_msg_aminmax2)
+    cases: Tuple[Tuple[int], dict] = (  # type: ignore[assignment]
+                                     ((2, 1, 4, 5), {'p': 1., 'dim': 2}),
+                                     ((2, 3, 4, 5), {'p': 2., 'dim': 1}),
+                                     ((1, 2, 4, 5), {'p': 0.5, 'dim': 0}),
+                                     ((1, 3, 4, 5), {'p': -1., 'dim': 1}),
+                                     ((1, 3, 4, 5), {'p': 0., 'dim': -1}),
+                                     ((), {'p': 1.2, 'dim': 0}),
+                                     ((2, 3, 4, 5), {}),
+                                     ((2, 3, 4, 5), {'eps': 1e-4}))
 
-    # Error Inputs for functions to raise an error on specified zero'd dimension as reduction dim
-    err_msg3 = "reduction"
-    # FIXME: eager and ref impl throw different types of errors
-    error_type = IndexError if 'refs' not in op_info.name else RuntimeError
-    yield ErrorInput(SampleInput(torch.rand(shape, device=device), kwargs={'dim': 1}),
-                     error_type=error_type, error_regex=err_msg3)
+    for input_shape, kwargs in cases:
+        yield SampleInput(make_arg(input_shape), kwargs=kwargs)
 
-def sample_inputs_aminmax(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases: Tuple[tuple, dict] = (  # type: ignore[assignment]
-        ((S, S, S), {}),
-        ((S, S, S), {'dim': 1}),
-        ((S, S, S), {'dim': 1, 'keepdim': True}),
-        ((), {'dim': 0}),
-        ((), {}),
-        ((), {'dim': 0, 'keepdim': True}),
-    )
 
-    samples: List[SampleInput] = []
-    for shape, kwargs in test_cases:
-        samples.append(SampleInput(
-            make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
-            kwargs=kwargs))
+def complex_conv(fn, input_size, weight, grad_output, stride, padding, dilation, groups):
+    # conv(W, x, b) = conv(Wr, xr, br) - conv(Wi, xi, 0) + i(conv(Wi, xr, bi) + conv(Wr, xi, 0))
+    # a = conv(Wr, xr, br),
+    # b = conv(Wi, xi, 0),
+    # c = conv(Wr + Wi, xr + xi, br + bi)
+    # conv(W, x, b) = a - b + i(c - a - b)
 
-    return samples
+    grad_output_ = torch.view_as_real(grad_output)
+    grad_output_r = grad_output_[..., 0]
+    grad_output_i = grad_output_[..., 1]
 
-def sample_inputs_diff(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    weight_ = torch.view_as_real(weight)
+    weight_r = weight_[..., 0]
+    weight_i = weight_[..., 1]
 
-    test_cases = (
-        ((1,), 0, None, None),
-        ((S,), 0, None, None),
-        ((S, 1), 0, None, None),
-        ((S, 1), 1, None, None),
-        ((S, S), 0, None, None),
-        ((S, S), 1, None, None),
-        ((S, S), 0, (1, S), (2, S)),
-        ((S, S), 0, None, (2, S)),
-        ((XS, XS, XS), 1, None, None),
-        ((XS, XS, XS), 2, None, None),
-        ((XS, XS, XS), 1, (XS, 1, XS), (XS, 1, XS)),
-        ((XS, XS, XS), 2, (XS, XS, 1), (XS, XS, 1)),
-        ((XS, XS, XS), 2, (XS, XS, XS), (XS, XS, XS)),)
+    a = fn(input_size, weight_r, grad_output_r, stride, padding, dilation, groups)
+    b = fn(input_size, weight_i, grad_output_i, stride, padding, dilation, groups)
+    c = fn(input_size, weight_r + weight_i, grad_output_r + grad_output_i, stride, padding, dilation, groups)
 
-    sample_inputs = []
-    for size, dim, size_prepend, size_append in test_cases:
-        prepend_size = 0 if (size_prepend is None) else size_prepend[dim]
-        append_size = 0 if (size_append is None) else size_append[dim]
-        dim_size = size[dim] + prepend_size + append_size
-        for n in range(dim_size):
-            input_tensor = make_arg(size)
-            prepend = make_arg(size_prepend) if size_prepend else None
-            append = make_arg(size_append) if size_append else None
-            sample_inputs.append(SampleInput(input_tensor, args=(n, dim, prepend, append,)))
+    return (a - b) + 1j * (c - a - b)
 
-    # add some samples with n > dim_size
-    sample_inputs.append(SampleInput(make_arg((XS, XS, XS)), args=(S + 1, 1,)))
-    sample_inputs.append(SampleInput(make_arg((XS, XS, XS)), args=(S * 3 + 2, 2, make_arg((XS, XS, XS)), make_arg((XS, XS, XS)),)))
 
-    return sample_inputs
+def conv_transpose_ref(input, weight, bias, stride=1, padding=0,
+                       output_padding=0, dilation=1, groups=1,
+                       fn=None):
+    # Derivative of `conv` is `conv_transpose`.
+    # To verify the correctness of `conv_transpose`,
+    # we rely `torch.nn.grad` implementation (which is tested in test_nn.py)
+    # for floating dtypes.
 
-def sample_inputs_histogram(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    assert fn is not None
 
-    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
+    grad_fn_map = {torch.nn.functional.conv_transpose1d: torch.nn.grad.conv1d_input}
+    batched_dim_map = {torch.nn.functional.conv_transpose1d: 3}
 
-    sample_inputs = []
-    for size, bin_ct, weighted, density in product(sizes, range(1, 5), [False, True], [False, True]):
-        input_tensor = make_arg(size)
-        weight_tensor = make_arg(size) if weighted else None
+    # Input for `ref` is ndarray.
+    input, weight = torch.from_numpy(input), torch.from_numpy(weight)
 
-        sample_inputs.append(SampleInput(input_tensor, args=(bin_ct,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
+    is_batched = len(input.shape) == batched_dim_map[fn]
+    if not is_batched:
+        input = input.unsqueeze(0)
 
-        bins_tensor = make_arg((bin_ct + 1,))
-        sample_inputs.append(SampleInput(input_tensor, args=(bins_tensor,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
+    if bias is not None:
+        bias = torch.from_numpy(bias).unsqueeze(1)
 
-    return sample_inputs
+    grad_output = input
+    # Get the input shape for grad_fn.
+    conv_transpose_output = fn(grad_output.to('meta'), weight.to('meta'), None,
+                               stride=stride, padding=padding, output_padding=output_padding,
+                               groups=groups, dilation=dilation)
+    input_size = conv_transpose_output.shape
 
-def sample_inputs_histogramdd(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    grad_fn = grad_fn_map[fn]
+    if weight.dtype.is_complex:
+        out = complex_conv(grad_fn, input_size, weight, grad_output, stride, padding, dilation, groups)
+    else:  # Floating
+        out = grad_fn(input_size, weight, grad_output, stride, padding, dilation, groups)
 
-    sizes = ((S, S), (S, S, S), (S, 1, S), (S, 0, S))
-    bin_ct_patterns = ((1, 1, 1, 1, 1), (2, 3, 2, 3, 2), (3, 2, 3, 2, 3))
+    if bias is not None:
+        out = out + bias
 
-    sample_inputs = []
-    for size, bin_ct_pattern, weighted, density in product(sizes, bin_ct_patterns, [False, True], [False, True]):
-        input_tensor = make_arg(size)
-        bin_ct = bin_ct_pattern[:size[-1]]
-        weight_tensor = make_arg(size[:-1]) if weighted else None
+    return out.squeeze(0) if not is_batched else out
 
-        sample_inputs.append(SampleInput(input_tensor, args=(bin_ct,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
 
-        bins_tensor = [make_arg(ct + 1) for ct in bin_ct]
-        sample_inputs.append(SampleInput(input_tensor, args=(bins_tensor,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
+def sample_inputs_conv_transpose1d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return sample_inputs
+    # Ordered as shapes for input, weight, bias
+    # and a dict of values of (stride, padding, output_padding, groups, dilation)
+    cases: Tuple[Tuple[int], Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
+        ((1, 3, 4), (3, 3, 3), (3,),
+         {'stride': (2,), 'padding': 2, 'output_padding': (1,), 'groups': 1}),
+        ((2, 2, 4), (2, 2, 4), (4,),
+         {'stride': (3,), 'padding': (1,), 'output_padding': (2,), 'groups': 2, 'dilation': (4,)}),
+        ((1, 1, 4), (1, 1, 4), (1,),
+         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1, 'dilation': (2,)}),
+        ((1, 1, 4), (1, 2, 3), None,
+         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1}),
+        ((1, 4, 5), (4, 8, 3), None,
+         {})
+    )
 
-def sample_inputs_histc(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for input_shape, weight, bias, kwargs in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
 
-    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    sample_inputs = []
-    for size, min, max in product(sizes, [0, -10], [0, 10]):
-        # construct sample input omitting bins arg
-        sample_inputs.append(SampleInput(make_arg(size),
-                                         kwargs=dict(min=min, max=max)))
+def sample_inputs_conv_transpose2d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        # construct sample inputs with a few different bins values
-        for bins in [1, 3, 10]:
-            sample_inputs.append(SampleInput(make_arg(size),
-                                             kwargs=dict(bins=bins, min=min, max=max)))
+    # Ordered as shapes for input, weight, bias
+    # and a dict of values of (stride, padding, output_padding, groups, dilation)
+    cases: Tuple[Tuple[int], Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
+        ((1, 3, 4, 4), (3, 3, 3, 3), (3,),
+         {'stride': (2, 2), 'padding': 2, 'output_padding': (1, 1), 'groups': 1}),
+        ((2, 2, 4, 4), (2, 2, 4, 5), (4,),
+         {'stride': (3, 2), 'padding': (1, 2), 'output_padding': (2, 3), 'groups': 2, 'dilation': (4, 4)}),
+        ((1, 1, 4, 5), (1, 1, 4, 3), (1,),
+         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1, 'dilation': (2, 3)}),
+        ((1, 1, 4, 3), (1, 2, 3, 4), None,
+         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1}),
+        ((1, 4, 5, 5), (4, 8, 3, 3), None,
+         {})
+    )
 
-    return sample_inputs
+    for input_shape, weight, bias, kwargs in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
 
-def sample_inputs_bincount(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def sample_inputs_conv_transpose3d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    sample_inputs = []
+    # Ordered as shapes for input, weight, bias
+    # and a dict of values of (stride, padding, output_padding, groups, dilation)
+    cases: Tuple[Tuple[int], Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
+        ((1, 3, 4, 4, 4), (3, 3, 3, 3, 3), (3,),
+         {'stride': (2, 2, 2), 'padding': 2, 'output_padding': (1, 1, 1), 'groups': 1}),
+        ((2, 2, 4, 4, 4), (2, 2, 4, 5, 6), (4,),
+         {'stride': (3, 2, 1), 'padding': (1, 2, 3), 'output_padding': (2, 3, 1), 'groups': 2, 'dilation': (4, 4, 4)}),
+        ((1, 1, 4, 5, 2), (1, 1, 4, 3, 1), (1,),
+         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1, 'dilation': (2, 3, 2)}),
+        ((1, 1, 4, 3, 4), (1, 2, 3, 4, 5), None,
+         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1}),
+        ((1, 4, 5, 5, 5), (4, 8, 3, 3, 3), None,
+         {})
+    )
 
-    for size, weighted in product((S, M), [False, True]):
-        input_tensor = torch.randint(0, size, (size,), dtype=dtype, device=device)
-        weight_tensor = make_arg((size,)) if weighted else None
+    for input_shape, weight, bias, kwargs in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
 
-        max_val = int(input_tensor.max().item())
 
-        for minlength in [0, max_val // 2, max_val, 2 * max_val]:
-            sample_inputs.append(SampleInput(input_tensor,
-                                             kwargs=dict(weights=weight_tensor, minlength=minlength)))
+def sample_inputs_conv1d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return sample_inputs
+    # Ordered as shapes for input, weight, bias,
+    # and a dict of values of (stride, padding, dilation, groups)
+    cases: Tuple = (
+        ((1, 3, 4), (3, 3, 3), (3,), {'stride': (2,), 'padding': 2, 'groups': 1}),
+        ((2, 4, 8), (2, 2, 3), (2,), {'stride': 3, 'padding': 1, 'groups': 2, 'dilation': 2}),
+        ((1, 4, 5), (1, 4, 3), None, {'stride': (2,), 'padding': 'valid'}),
+        ((2, 2, 4), (2, 1, 4), (2,), {'stride': (1,), 'padding': 'same', 'groups': 2, 'dilation': (2,)}),
+        # With defaults
+        ((1, 4, 5), (3, 4, 3), None, {}),
+    )
 
-def sample_inputs_bucketize(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    # TODO: (@krshrimali), add error_inputs_func once https://github.com/pytorch/pytorch/pull/67354 is merged
+    # Should replace test_conv_modules_raise_error_on_incorrect_input_size and test_conv_shapecheck
+    # in test/test_nn.py
 
-    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
+    for input_shape, weight, bias, kwargs in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
 
-    sample_inputs = []
 
-    for size, out_int32, right in product(sizes, [False, True], [False, True]):
-        input_tensor = make_arg(size)
-        boundaries = make_arg((S,)).msort()
+def sample_inputs_conv2d(op_info, device, dtype, requires_grad, jit_fail_sample=False, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        sample_inputs.append(SampleInput(input_tensor, args=(boundaries, ),
-                                         kwargs=dict(out_int32=out_int32, right=right)))
+    # Ordered as shapes for input, weight, bias
+    # and a dict of values of (stride, padding, groups, dilation)
+    cases: Tuple = (
+        ((1, 3, 4, 4), (3, 3, 3, 3), (3,),
+            {'stride': (2, 2), 'padding': 2, 'groups': 1}),
+        ((2, 4, 8, 8), (2, 2, 3, 3), (2,),
+            {'stride': (3, 2), 'padding': (2, 1), 'groups': 2, 'dilation': (4, 4)}),
+        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
+            {'stride': 2, 'padding': 1, 'groups': 1, 'dilation': (2, 3)}),
+        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
+            {'stride': 2, 'padding': 1, 'groups': 1, 'dilation': (2, 3)}),
+        ((1, 2, 4, 3), (4, 2, 3, 4), None,
+            {'stride': 2, 'padding': 1, 'groups': 1}),
+        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
+            {'stride': 2, 'padding': "valid"}),
+        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
+            {'stride': 1, 'padding': "same", 'dilation': 3}),
+        # Below are the group related samples from common_nn.py
+        ((2, 4, 6, 6), (4, 1, 3, 3), (4,), {'groups': 4}),
+        ((2, 4, 6, 6), (8, 1, 3, 3), (8,), {'groups': 4}),
+        ((2, 4, 6, 6), (8, 1, 3, 3), None, {'groups': 4}),
+        ((2, 4, 6, 6), (4, 1, 3, 3), (4,), {'groups': 4, 'stride': (3, 2)}),
+        ((2, 4, 6, 6), (4, 1, 3, 3), (4,), {'groups': 4, 'padding': (1, 1)}),
+        ((2, 4, 5, 5), (4, 1, 2, 2), (4,), {'groups': 4, 'dilation': (2, 2)}),
+        ((2, 4, 6, 5), (6, 2, 3, 2), (6,), {'groups': 2}),
+        # With defaults
+        ((1, 4, 5, 5), (3, 4, 3, 3), None, {}),
+    )
 
-    return sample_inputs
+    for input_shape, weight, bias, kwargs in cases:
+        # Batched
+        yield SampleInput(make_arg(input_shape), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
+        # Unbatched
+        yield SampleInput(make_arg(input_shape[1:]), args=(
+            make_arg(weight),
+            make_arg(bias) if bias is not None else bias
+        ), kwargs=kwargs)
 
-def sample_inputs_searchsorted(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    sizes = ((0,), (M,), (0, 0), (M, M), (0, 0, 0), (M, M, M))
-    inputs = []
-    for size, noncontiguous, out_int32, right in product(sizes, [False, True], [False, True], [False, True]):
-        unsorted_tensor = make_arg(size, noncontiguous=noncontiguous)
-        input_tensor = make_arg(size, noncontiguous=noncontiguous)
-        if np.product(size) == 0:
-            boundary_tensor = unsorted_tensor
-            sorter = make_tensor(size, dtype=torch.int64, device=device, noncontiguous=noncontiguous)
-        else:
-            boundary_tensor, sorter = torch.sort(unsorted_tensor)
-        side = "right" if right else "left"
+def sample_inputs_group_norm(opinfo, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        inputs.append(SampleInput(boundary_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, right=right)))
-        inputs.append(SampleInput(boundary_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, side=side)))
+    # Ordered as input shape, num groups, and eps
+    cases: Tuple[Tuple[int], int, float] = (  # type: ignore[assignment]
+        ((1, 6, 3), 2, 0.5),
+        ((2, 6, 3), 2, -0.5),
+        ((1, 2), 1, None),
+        ((0, 2), 1, None),
+    )
 
-        inputs.append(
-            SampleInput(unsorted_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, right=right, sorter=sorter)))
-        inputs.append(
-            SampleInput(unsorted_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, side=side, sorter=sorter)))
-    return inputs
+    for input_shape, num_groups, eps in cases:
+        # Shape of weight and bias should be the same as num_channels
+        weight = make_arg(input_shape[1])
+        bias = make_arg(input_shape[1])
+        kwargs = {'weight': weight, 'bias': bias} if eps is None else {'weight': weight, 'bias': bias, 'eps': eps}
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(num_groups,),
+            kwargs=kwargs
+        )
+    # Without any optional args
+    yield SampleInput(make_arg((1, 2)), args=(1,))
 
-def sample_inputs_gradient(op_info, device, dtype, requires_grad, **kwargs):
-    sample_inputs = []
-    test_cases_float = (
-        ((S,), None, None, 1),
-        ((S,), 2., None, 1),
-        ((S, S), None, None, 2),
-        ((S, S), [2.0, 2.1], None, 1),
-        ((S, S), [2.0, 2.1], (0, 1), 1),
-        ((4, 4, 4), [2., 1.], (0, 1), 2),
-    )
-    for size, spacing, dim, edge_order in test_cases_float:
-        t = make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        sample_inputs.append(SampleInput(t, kwargs=dict(dim=dim, spacing=spacing, edge_order=edge_order)))
 
-    test_cases_tensor = (
-        ((3, 3, 3), ((1.1, 2.0, 3.5), (4.0, 2, 6.0)), (0, -1), 1),
-        ((3, 3, 3), ((1.0, 3.0, 2.0), (8.0, 6.0, 1.0)), (0, 1), 2),
-    )
-    for size, coordinates, dim, edge_order in test_cases_tensor:
-        t = make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        coordinates_tensor_list = []
-        for coords in coordinates:
-            # `coords` will always contain floating point values and Python 3.10 does not support this
-            # implicit conversion to an integer using `__int__`
-            # TODO: this can be simplified after https://github.com/pytorch/pytorch/issues/69316 is fixed
-            a = torch.tensor(coords, device=device)
-            coordinates_tensor_list.append(a.to(dtype))
-        sample_inputs.append(SampleInput(t, kwargs=dict(dim=dim, spacing=coordinates_tensor_list, edge_order=edge_order)))
+def sample_inputs_instance_norm(opinfo, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    make_arg_without_requires_grad = partial(make_tensor, device=device, dtype=dtype, requires_grad=False)
 
-    return tuple(sample_inputs)
+    # Ordered as: input shape, kwargs for momentum, eps
+    cases: Tuple[Tuple[int], dict] = (  # type: ignore[assignment]
+        ((S, S, S), {'momentum': 0.5, 'eps': 0.6}),
+        ((S, S, S), {'momentum': 0.5, 'eps': 0.6, 'use_input_stats': True}),
+        ((3, 2, 4), {'momentum': -1.2}),
+        ((3, 2, 4), {'momentum': 0.0}),
+        ((3, 2, 3, 4), {'momentum': -1.0, 'eps': 0.5}),
+        ((3, 2, 3, 4), {'momentum': -1.0, 'eps': 0.5}),
+    )
 
-def sample_inputs_getitem(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    test_args = [
-        ([1, 2],),
-        (slice(0, 3),),
-        ([slice(0, 3), 1],),
-        ([[0, 2, 3], [1, 3, 3], [0, 0, 2]],),
-        ([[0, 0, 3], [1, 1, 3], [0, 0, 2]],),
-        ([slice(None), slice(None), [0, 3]],),
-        ([slice(None), [0, 3], slice(None)],),
-        ([[0, 3], slice(None), slice(None)],),
-        ([[0, 3], [1, 2], slice(None)],),
-        ([[0, 3], ],),
-        ([[0, 3], slice(None)],),
-        ([[0, 3], Ellipsis],),
-        ([[0, 2, 3], [1, 3, 3], torch.LongTensor([0, 0, 2])],),
-        (index_variable(2, S, device=device),),
-        (mask_not_all_zeros((S,)),),
-    ]
+    for input_shape, kwargs in cases:
+        # args: running mean, running var, weight and bias should necessarily be of shape: (channels,)
+        channels = input_shape[1]
+        weight = make_arg(channels)
+        bias = make_arg(channels)
+        running_mean = make_arg_without_requires_grad(channels, low=0)
+        running_var = make_arg_without_requires_grad(channels, low=0)
+        new_kwargs = {
+            'running_mean': running_mean,
+            'running_var': running_var,
+            'weight': weight,
+            'bias': bias,
+            **kwargs
+        }
 
-    for args in test_args:
-        yield SampleInput(make_arg((S, S, S)), args=args)
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(),
+            kwargs=new_kwargs
+        )
 
-    yield SampleInput(make_arg((S, S, S, S)), args=([slice(None), [0, 1], slice(None), [0, 1]],))
+    # Checking for permutations of weights and biases as `None`
+    # instance_norm assumes that if there's a bias, there's a weight
+    weights = [channels, None]
+    biases = [None, None]
 
-def sample_inputs_index_put(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for weight_channels, bias_channels in zip(weights, biases):
+        running_mean = make_arg_without_requires_grad(channels, low=0)
+        running_var = make_arg_without_requires_grad(channels, low=0)
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(),
+            kwargs={
+                'running_mean': running_mean,
+                'running_var': running_var,
+                'weight': make_arg(weight_channels) if weight_channels is not None else None,
+                'bias': make_arg(bias_channels) if bias_channels is not None else None
+            }
+        )
 
-    inputs = []
-    for accumulate in [False, True]:
-        # Test with indices arg
-        inputs.append(SampleInput(
-            make_arg((S, S,)),
-            args=((index_variable(2, S, device=device),), make_arg((2, S))),
-            kwargs=dict(accumulate=accumulate)))
+    # Test case for no optional kwargs
+    yield SampleInput(make_arg((1, 2, 3)), kwargs={})
 
-        # Test with mask arg
-        mask = torch.zeros(S, dtype=torch.bool) if accumulate else mask_not_all_zeros((S,))
-        inputs.append(SampleInput(
-            make_arg((S, S)),
-            args=((mask, ), make_arg((S,))),
-            kwargs=dict(accumulate=accumulate)))
 
-    return inputs
+def sample_inputs_layer_norm(opinfo, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def sample_inputs_sort(op_info, device, dtype, requires_grad, **kwargs):
-    def small_3d_unique():
-        res = torch.randperm(S * S * S, dtype=torch.int64, device=device).view(S, S, S)
-        res = res.to(dtype).requires_grad_(requires_grad)
-        return res
-
-    def large_1d_unique():
-        res = torch.randperm(L * L * L, dtype=torch.int64, device=device)
-        res = res.to(dtype).requires_grad_(requires_grad)
-        return res
-
-    samples = []
-    # Test case for large tensor.
-    samples.append(SampleInput(large_1d_unique()))
-
-    # Test cases for small 3d tensors.
-    # Imitates legacy tests from test/test_torch.py
-    dims = range(-3, 3)
-    flag = [True, False]
-    for dim, descending, stable in product(dims, flag, flag):
-        # default schema without stable sort
-        samples.append(SampleInput(small_3d_unique(),
-                                   args=(dim, descending)))
-        # schema with stable sort, no CUDA support yet
-        if torch.device(device).type == 'cpu':
-            samples.append(
-                SampleInput(small_3d_unique(),
-                            kwargs=dict(dim=dim, descending=descending, stable=stable))
-            )
-
-    # Test cases for scalar tensor
-    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad)))
-    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad),
-                               args=(0,)))
-    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad),
-                               args=(0, True)))
+    # Ordered as input shape, normalized_shape and a kwarg dict for eps
+    cases: Tuple[Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
+        ((1, 2, 3), (1, 2, 3), {'eps': 0.5}),
+        ((2, 2, 3), (2, 3), {'eps': -0.5}),
+        ((1,), (1,), {}),
+        ((1, 2), (2,), {}),
+        ((0, 1), (1,), {}),
+    )
 
-    # Test cases for stable sort
-    samples.append(SampleInput(small_3d_unique(),
-                   kwargs=dict(stable=True)))
-    samples.append(SampleInput(small_3d_unique(),
-                   kwargs=dict(dim=0, stable=True)))
-    samples.append(SampleInput(small_3d_unique(),
-                   kwargs=dict(dim=0, descending=True, stable=True)))
-    return samples
+    for input_shape, normalized_shape, kwargs in cases:
+        # Shape of weight and bias should be the same as normalized_shape
+        weight = make_arg(normalized_shape)
+        bias = make_arg(normalized_shape)
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(normalized_shape, weight, bias),
+            kwargs=kwargs
+        )
+    # Without any optional args
+    yield SampleInput(make_arg((1, 2)), args=((2,),))
 
-def sample_inputs_threshold(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    sizes = ((), (S,), (S, S), (S, S, S))
-    samples = []
-    for x_size in sizes:
-        # threshold and values args must be numbers
-        samples.append(SampleInput(make_arg(x_size), args=(make_arg(()).item(), make_arg(()).item())))
-    return samples
+    # TODO: @krshrimali, once to_numpy method in SampleInput class is modified to take None inputs,
+    # enable these inputs; see https://github.com/pytorch/pytorch/pull/63276#discussion_r691950400
 
-def sample_inputs_argsort(*args, **kwargs):
-    return [sample_input for sample_input in sample_inputs_sort(*args, **kwargs) if "stable" not in sample_input.kwargs]
+    # With weight and a `None` bias
+    # yield SampleInput(make_arg((1, 2)), args=((2,), make_arg((2,)), None))
 
-def sample_inputs_unique(op_info, device, dtype, requires_grad, **kwargs):
-    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
+    # With `None` weight and bias (tests failing for this, see the link above)
+    # yield SampleInput(make_arg((1, 2)), args=((2,), None, make_arg((2,))))
 
-    sample_inputs = []
-    for shape, sorted, return_inverse, return_counts, dim in \
-            product(sizes, [False, True], [False, True], [False, True], [None, -2, -1, 0, 1, 2]):
-        # torch.unique cannot be called if the input tensor has a zero dimension which isn't the selected dim
-        if 0 in shape and shape.index(0) is not dim:
-            continue
 
-        # skip invalid dim args
-        if dim is not None and (dim < -len(shape) or dim >= len(shape)):
-            continue
+def sample_inputs_native_layer_norm(opinfo, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        kwargs = dict(sorted=sorted, return_inverse=return_inverse, return_counts=return_counts, dim=dim)
+    # Ordered as input shape, normalized_shape, eps
+    cases: Tuple[Tuple[int], Tuple[int], float] = (  # type: ignore[assignment]
+        ((1, 2, 3), (1, 2, 3), 0.5),
+        ((2, 2, 3), (2, 3), -0.5),
+        ((1,), (1,), 1e-5),
+        ((1, 2), (2,), 1e-5),
+        ((0, 1), (1,), 1e-5),
+    )
 
-        # construct a test case with only one distinct value
-        input_t = torch.zeros(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
+    for input_shape, normalized_shape, eps in cases:
+        # Shape of weight and bias should be the same as normalized_shape
+        weight = make_arg(normalized_shape)
+        bias = make_arg(normalized_shape)
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(normalized_shape, weight, bias, eps),
+        )
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(normalized_shape, None, bias, eps),
+        )
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(normalized_shape, weight, None, eps),
+        )
+        yield SampleInput(
+            make_arg(input_shape),
+            args=(normalized_shape, None, None, eps),
+        )
 
-        # construct a test case with mixed 0s and 1s
-        input_t = make_tensor(shape, dtype=torch.bool, device=device, requires_grad=False)\
-            .to(dtype).requires_grad_(requires_grad)
-        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
 
-        # construct a test case with many different values
-        input_t = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
+def error_inputs_native_layer_norm(opinfo, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32, requires_grad=False)
+    input_shape = (1, 2, 3)
 
-    return sample_inputs
+    err_msg1 = "Expected normalized_shape to be at least 1-dimensional"
+    s1 = SampleInput(
+        make_arg(input_shape), args=(tuple(), None, None, 1e-5)
+    )
+    yield ErrorInput(s1, error_regex=err_msg1)
 
-def sample_inputs_unique_consecutive(*args, **kwargs):
-    for sample_input in sample_inputs_unique(*args, **kwargs):
-        if not sample_input.kwargs["sorted"]:
-            sample_input.kwargs.pop("sorted")
-            yield sample_input
+    normalized_shape = (1, 2, 3)
+    weight = make_arg((1, 2))
+    err_msg2 = "Expected weight to be of same shape as normalized_shape"
+    s2 = SampleInput(
+        make_arg(input_shape), args=(normalized_shape, weight, None, 1e-5)
+    )
+    yield ErrorInput(s2, error_regex=err_msg2)
 
-def sample_inputs_adaptive_avg_pool1d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    bias = make_arg((1, 2))
+    err_msg3 = "Expected bias to be of same shape as normalized_shape"
+    s3 = SampleInput(
+        make_arg(input_shape), args=(normalized_shape, None, bias, 1e-5)
+    )
+    yield ErrorInput(s3, error_regex=err_msg3)
 
-    # Ordered as (input shape, output size)
-    cases = (
-        ((0, 8, 8), (5,)),
-        ((3, 8, 8), 5),
-        ((3, 8, 8), 1)
+    err_msg4 = "Given normalized_shape="
+    s4 = SampleInput(
+        make_arg((2, 2, 3)), args=((2, 2), None, None, 1e-5)
     )
+    yield ErrorInput(s4, error_regex=err_msg4)
 
-    for input_shape, output_size in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(output_size,))
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(output_size,))
 
-def sample_inputs_adaptive_avg_pool2d(op_info, device, dtype, requires_grad, **kwargs):
+def sample_inputs_local_response_norm(opinfo, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # Ordered as (input shape, output size)
-    cases = (
-        ((1, 8, 8, 8), (5, 7)),
-        ((2, 8, 8, 8), (None, 7)),
-        ((1, 8, 4, 3), (5, None)),
-        ((1, 8, 4, 3), (None, None)),
-        ((1, 8, 4, 3), (5)),
+    # Ordered as input shape, size and a kwarg dict for alpha, beta, and k
+    cases: Tuple[Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
+        ((1, 6, 3), 2, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
+        ((1, 6, 3), 2, {'beta': 0.5, 'k': 1.25}),
+        ((1, 6, 3), 2, {'alpha': 3e-05, 'k': 1.25}),
+        ((1, 6, 3), 2, {'alpha': 3e-05, 'beta': 0.5}),
+        ((1, 6, 3), 2, {'alpha': 3e-05}),
+        ((1, 6, 3), 2, {'beta': 0.5}),
+        ((1, 6, 3), 2, {'k': 1.25}),
+        ((1, 6, 3), 2, {}),
+        ((2, 6, 3), 2, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
+        ((1, 1, 2), 1, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
+        ((0, 1, 2), 1, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
     )
 
-    for input_shape, output_size in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(output_size,))
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(output_size,))
-
-
-def sample_inputs_adaptive_avg_pool3d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for input_shape, size, kwargs in cases:
+        yield SampleInput(make_arg(input_shape), args=(size,), kwargs=kwargs)
 
-    # Ordered as (input shape, output size)
-    cases = (
-        ((0, 8, 8, 8, 8), (5, 7, 4)),
-        ((1, 8, 4, 3, 7), (None, None, None)),
-        ((1, 8, 4, 3, 7), (1, 1, 1)),
-        ((3, 3, 8, 8, 6), (5, 7, None)),
-        ((1, 3, 8, 8, 6), (5, None, 2)),
-        ((3, 3, 8, 8, 6), (None, 3, 2)),
-    )
+def sample_inputs_hardswish(self, device, dtype, requires_grad, **kwargs):
+    N = 5
+    # make sure we are testing -3 -> 3 range. default is -10 -> 10 so maybe unnecessary ?
+    tensors = [SampleInput(make_tensor((N * 2, N * 2), device=device, dtype=dtype,
+               requires_grad=requires_grad, low=-5, high=5)) for _ in range(1, N)]
+    return tensors
 
-    for input_shape, output_size in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(output_size,))
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(output_size,))
+def sample_inputs_linear(self, device, dtype, requires_grad, **kwargs):
+    features_options = [[3, 4], [8, 8]]
+    batch_options: List[List[int]] = [
+        [],  # no batch
+        [0],
+        [8],
+        [2, 3],
+    ]
+    create_tensor = partial(make_tensor, device=device, dtype=dtype,
+                            requires_grad=requires_grad, low=-2, high=2)
 
-def sample_inputs_adaptive_max_pool1d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    sample_inputs = []
+    for has_bias, (in_feat, out_feat), batch_shape in \
+            itertools.product([True, False], features_options, batch_options):
+        input_tensor = create_tensor(batch_shape + [in_feat])
+        weight = create_tensor([out_feat, in_feat])
+        if not has_bias:
+            sample_inputs.append(SampleInput(input_tensor, args=(weight,)))
+            continue
 
-    # Ordered as (input shape, output size)
-    cases = (
-        # ((0, 8, 8), (5,)),
-        # 0 batch size doesn't work,  cannot reshape tensor of 0 elements into shape [0, 8, -1]
-        ((3, 4, 4), 3),
-        ((3, 4, 4), 1)
-    )
+        bias = create_tensor([out_feat])
+        sample_inputs.append(SampleInput(input_tensor, args=(weight, bias)))
+    return sample_inputs
 
-    for shapes, return_idx in product(cases, (True, False)):
-        # Batched
-        yield SampleInput(make_arg(shapes[0]), args=(shapes[1], return_idx))
-        # Unbatched
-        yield SampleInput(make_arg(shapes[0][1:]), args=(shapes[1], return_idx))
-
-def sample_inputs_adaptive_max_pool2d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_bilinear(self, device, dtype, requires_grad, **kwargs):
+    features_options = [[3, 4, 5], [8, 8, 8]]
+    batch_options: List[List[int]] = [
+        [],  # no batch
+        [0],
+        [8],
+        [2, 3],
+    ]
+    create_tensor = partial(make_tensor, device=device, dtype=dtype,
+                            requires_grad=requires_grad, low=-2, high=2)
 
-    # Ordered as (input shape, output size)
-    cases = (
-        # ((0, 8, 8, 8), (5, 7)),
-        # 0 batch size doesn't work,  cannot reshape tensor of 0 elements into shape [0, 8, -1]
-        ((1, 4, 4, 4), (2, 3)),
-        ((2, 4, 4, 4), (None, 3)),
-        ((2, 4, 4, 4), (1, 1)),
-        ((1, 4, 4, 3), (3, None)),
-        ((1, 4, 4, 3), (None, None)),
-        ((1, 4, 4, 3), (3)),
-    )
+    sample_inputs = []
+    for has_bias, (in_feat1, in_feat2, out_feat), batch_shape in \
+            itertools.product([True, False], features_options, batch_options):
+        input_tensor1 = create_tensor(batch_shape + [in_feat1])
+        input_tensor2 = create_tensor(batch_shape + [in_feat2])
+        weight = create_tensor([out_feat, in_feat1, in_feat2])
+        if not has_bias:
+            sample_inputs.append(SampleInput(input_tensor1, args=(input_tensor2, weight,)))
+            continue
+        bias = create_tensor([out_feat])
+        sample_inputs.append(SampleInput(input_tensor1, args=(input_tensor2, weight, bias)))
 
-    for shapes, return_idx in product(cases, (True, False)):
-        # Batched
-        yield SampleInput(make_arg(shapes[0]), args=(shapes[1], return_idx))
-        # Unbatched
-        yield SampleInput(make_arg(shapes[0][1:]), args=(shapes[1], return_idx))
+    return sample_inputs
 
+def sample_inputs_glu(self, device, dtype, requires_grad, **kwargs):
+    features_options = [[2], [2, 4], [8, 8], [3, 6, 8], [1, 4, 6, 7]]
+    batch_options: List[List[int]] = [
+        [],  # no batch
+        [0],
+        [8],
+        [2, 3],
+    ]
+    create_tensor = partial(make_tensor, device=device, dtype=dtype,
+                            requires_grad=requires_grad, low=-2, high=2)
 
-def sample_inputs_adaptive_max_pool3d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    sample_inputs = []
+    for features, batch_shape in itertools.product(features_options, batch_options):
+        ndim = len(features) + len(batch_shape)
+        for dim in range(ndim):
+            input_tensor = create_tensor(batch_shape + features)
+            dim_size = input_tensor.size(dim)
+            if dim_size > 0 and dim_size % 2 == 0:
+                sample_inputs.append(SampleInput(input_tensor, args=(dim,)))
 
-    # Ordered as (input shape, output size)
-    cases = (
-        # ((0, 8, 8, 8, 8), (5, 7, 4)),
-        # 0 batch size doesn't work,  cannot reshape tensor of 0 elements into shape [0, 8, -1]
-        ((1, 4, 4, 3, 5), (None, None, None)),
-        ((1, 4, 4, 3, 5), (1, 1, 1)),
-        ((3, 3, 4, 4, 6), (2, 3, None)),
-        ((1, 3, 4, 4, 6), (3, None, 2)),
-        ((3, 3, 4, 4, 6), (None, 3, 2)),
-    )
+    return sample_inputs
 
-    for shapes, return_idx in product(cases, (True, False)):
-        # Batched
-        yield SampleInput(make_arg(shapes[0]), args=(shapes[1], return_idx))
-        # Unbatched
-        yield SampleInput(make_arg(shapes[0][1:]), args=(shapes[1], return_idx))
+def sample_inputs_interpolate(mode, self, device, dtype, requires_grad, **kwargs):
+    N, C = 2, 3
+    D = 4
+    S = 3
+    L = 5
 
-class _TestParamsMaxPoolBase(object):
+    align_corners_options: Tuple[Any, ...] = (None,)
+    if mode in ('linear', 'bilinear', 'bicubic', 'trilinear'):
+        align_corners_options = (True, False, None)
+    ranks_for_mode = {
+        'nearest': [1, 2, 3],
+        'linear': [1],
+        'bilinear': [2],
+        'bicubic': [2],
+        'trilinear': [3],
+        'area': [1, 2, 3]
+    }
 
-    def __init__(self):
-        self.kwargs = {
-            'kernel_size': [3],
-            'stride': [2, None],
-            'ceil_mode': [True, False],
-            'padding': [0, 1],
-            'dilation': [1],
-            'return_indices': [True, False]
-        }
+    def shape(size, rank, with_batch_channel=True):
+        if with_batch_channel:
+            return tuple([N, C] + ([size] * rank))
+        return tuple([size] * rank)
 
-        self.shapes = [
-            [1, 2, None],  # batch
-            [2],  # channels
-            [3, 6]  # signal
-        ]
+    make_arg = partial(make_tensor, device=device, dtype=dtype,
+                       requires_grad=requires_grad, low=-1, high=1)
 
-    def _gen_shape(self):
-        for shape in product(*self.shapes):
-            # shape[0] is None indicates missing batch dimension
-            if shape[0] is None:
-                shape = shape[1:]
+    sample_inputs = []
+    for align_corners in align_corners_options:
+        for rank in ranks_for_mode[mode]:
+            sample_inputs.extend([
+                SampleInput(make_arg(shape(D, rank)),
+                            args=(shape(S, rank, False), None, mode, align_corners)),
+                SampleInput(make_arg(shape(D, rank)),
+                            args=(shape(L, rank, False), None, mode, align_corners)),
+                SampleInput(make_arg(shape(D, rank)),
+                            args=(None, 1.7, mode, align_corners)),
+                SampleInput(make_arg(shape(D, rank)),
+                            args=(None, 0.6, mode, align_corners)),
+            ])
 
-            yield shape, torch.contiguous_format
-            # only 2d (N, C, H, W) rank 4 tensors support channels_last memory format
-            if len(self.shapes) == 4 and len(shape) == 4:
-                yield shape, torch.channels_last
+    return sample_inputs
 
-    def _gen_kwargs(self):
-        keys = self.kwargs.keys()
-        for values in product(*self.kwargs.values()):
-            yield dict(zip(keys, values))
+def sample_inputs_upsample(mode, self, device, dtype, requires_grad, **kwargs):
+    N, C = 2, 3
+    D = 4
+    S = 3
+    L = 5
 
-    def gen_input_params(self):
-        yield from product(self._gen_shape(), self._gen_kwargs())
+    ranks_for_mode = {
+        'nearest': [1, 2, 3],
+        'bilinear': [2],
+    }
 
-class _TestParamsMaxPool1d(_TestParamsMaxPoolBase):
+    def shape(size, rank, with_batch_channel=True):
+        if with_batch_channel:
+            return tuple([N, C] + ([size] * rank))
+        return tuple([size] * rank)
 
-    def __init__(self):
-        super().__init__()
-        self.kwargs['kernel_size'] += [(3,)]
-        self.kwargs['stride'] += [(2,)]
-        self.kwargs['padding'] += [(1,)]
-        self.kwargs['dilation'] += [(1,)]
+    make_arg = partial(make_tensor, device=device, dtype=dtype,
+                       requires_grad=requires_grad, low=-1, high=1)
 
-class _TestParamsMaxPool2d(_TestParamsMaxPoolBase):
+    sample_inputs = []
+    for rank in ranks_for_mode[mode]:
+        sample_inputs.extend([
+            SampleInput(make_arg(shape(D, rank)),
+                        kwargs=dict(size=shape(S, rank, False))),
+            SampleInput(make_arg(shape(D, rank)),
+                        kwargs=dict(size=shape(L, rank, False))),
+            SampleInput(make_arg(shape(D, rank)),
+                        kwargs=dict(scale_factor=1.7)),
+            SampleInput(make_arg(shape(D, rank)),
+                        kwargs=dict(scale_factor=0.6)),
+        ])
 
-    def __init__(self):
-        super().__init__()
-        self.kwargs['kernel_size'] += [(3, 2)]
-        self.kwargs['stride'] += [(2, 1)]
-        self.kwargs['padding'] += [(1, 1)]
-        self.kwargs['dilation'] += [(1, 2)]
+    return sample_inputs
 
-        self.shapes.append([6])
 
-class _TestParamsMaxPool3d(_TestParamsMaxPoolBase):
+def sample_inputs_gelu(self, device, dtype, requires_grad, **kwargs):
+    N = 5
+    tensors = []
+    for _ in range(1, N):
+        for approximate in ['none', 'tanh']:
+            tensors.append(SampleInput(
+                make_tensor((N * 2, N * 2), device=device, dtype=dtype,
+                            requires_grad=requires_grad, low=-3, high=3),
+                kwargs=dict(approximate=approximate)))
+    return tensors
 
-    def __init__(self):
-        super().__init__()
-        self.kwargs['kernel_size'] += [(3, 2, 3)]
-        self.kwargs['stride'] += [(2, 1, 2)]
-        self.kwargs['dilation'] += [(1, 2, 1)]
 
-        self.shapes.append([6])
-        self.shapes.append([5])
+def error_inputs_gelu(op, device, **kwargs):
+    # Tests thtat gelu errors out when passed an approximation we don't know.
+    yield ErrorInput(SampleInput(make_tensor((), dtype=torch.float, device=device), kwargs={"approximate": "asdf"}),
+                     error_regex="approximate argument must be either")
 
-def sample_inputs_max_pool(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=False)
 
-    params_generator_type_dict = {
-        'nn.functional.max_pool1d': _TestParamsMaxPool1d,
-        'nn.functional.max_pool2d': _TestParamsMaxPool2d,
-        'nn.functional.max_pool3d': _TestParamsMaxPool3d,
-    }
+def sample_inputs_max_min_reduction_with_dim(op_info, device, dtype, requires_grad, **kwargs):
+    inputs = []
+    args_for_reduction_with_dim = (
+        ((S, S, S), (1,),),
+        ((S, S, S), (1, True, ),),
+        ((), (0,),),
+        ((), (0, True,),),
+    )
+    inputs = list((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
+                                           low=None, high=None,
+                                           requires_grad=requires_grad),
+                               args=args,))
+                  for input_tensor, args in args_for_reduction_with_dim)
+    return inputs
 
-    params_generator = params_generator_type_dict[op_info.name]()
-    for (shape, memory_format), kwargs in params_generator.gen_input_params():
-        arg = make_arg(shape).to(memory_format=memory_format).requires_grad_(requires_grad)
-        yield SampleInput(arg, kwargs=kwargs)
+def sample_inputs_max_min_reduction_no_dim(op_info, device, dtype, requires_grad, **kwargs):
+    inputs = []
+    inputs.append(SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
+                                          low=None, high=None,
+                                          requires_grad=requires_grad),))
+    inputs.append(SampleInput(make_tensor((), dtype=dtype, device=device,
+                                          low=None, high=None,
+                                          requires_grad=requires_grad),))
+    return inputs
 
-def sample_inputs_normalize(self, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, low=-1, high=1, device=device, dtype=dtype, requires_grad=requires_grad)
+def _generate_nan_reduction_inputs(device, dtype, requires_grad, **kwargs):
+    yield from _generate_reduction_inputs(device, dtype, requires_grad)
+    # NaN only exists for floating point numbers
+    if dtype.is_complex or dtype.is_floating_point:
+        yield torch.tensor([2, torch.nan, -1], device=device, dtype=dtype, requires_grad=requires_grad)
+        yield torch.tensor([[torch.nan, 2], [0, 1]], device=device, dtype=dtype, requires_grad=requires_grad)
 
-    cases: Tuple[Tuple[int], dict] = (  # type: ignore[assignment]
-                                     ((2, 1, 4, 5), {'p': 1., 'dim': 2}),
-                                     ((2, 3, 4, 5), {'p': 2., 'dim': 1}),
-                                     ((1, 2, 4, 5), {'p': 0.5, 'dim': 0}),
-                                     ((1, 3, 4, 5), {'p': -1., 'dim': 1}),
-                                     ((1, 3, 4, 5), {'p': 0., 'dim': -1}),
-                                     ((), {'p': 1.2, 'dim': 0}),
-                                     ((2, 3, 4, 5), {}),
-                                     ((2, 3, 4, 5), {'eps': 1e-4}))
+def sample_inputs_nan_reduction(supports_multiple_dims):
+    # Generates sample inputs for reduction ops that contain the input tensor
+    # and dim and keepdim kwargs. If a reduction op needs to test additional
+    # args/kwargs then create a separate sample_inputs function
+    def fn(op_info, device, dtype, requires_grad, **kwargs):
+        inputs = []
 
-    for input_shape, kwargs in cases:
-        yield SampleInput(make_arg(input_shape), kwargs=kwargs)
+        for t in _generate_nan_reduction_inputs(device, dtype, requires_grad):
+            # Add case without dim and keepdim kwargs
+            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad)))
+            for kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims):
+                inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
+                                          kwargs=kwargs))
 
-def sample_inputs_conv_transpose1d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+        return inputs
 
-    # Ordered as shapes for input, weight, bias
-    # and a dict of values of (stride, padding, output_padding, groups, dilation)
-    cases: Tuple[Tuple[int], Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
-        ((1, 3, 4), (3, 3, 3), (3,),
-         {'stride': (2,), 'padding': 2, 'output_padding': (1,), 'groups': 1}),
-        ((2, 2, 4), (2, 2, 4), (4,),
-         {'stride': (3,), 'padding': (1,), 'output_padding': (2,), 'groups': 2, 'dilation': (4,)}),
-        ((1, 1, 4), (1, 1, 4), (1,),
-         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1, 'dilation': (2,)}),
-        ((1, 1, 4), (1, 2, 3), None,
-         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1}),
-        ((1, 4, 5), (4, 8, 3), None,
-         {})
-    )
+    return fn
 
-    for input_shape, weight, bias, kwargs in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
+def sample_inputs_reduction_quantile(op_info, device, dtype, requires_grad, **kwargs):
+    test_quantiles = (0.5, make_tensor((2,), dtype=dtype, device=device, low=0, high=1, requires_grad=requires_grad))
+    test_interpolations = ['linear', 'midpoint']
 
+    inputs = []
+    for quantiles in test_quantiles:
+        for t in _generate_reduction_inputs(device, dtype, requires_grad):
+            # Add case without dim and keepdim kwargs
+            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
+                                      args=(quantiles,)))
+            for kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims=False):
+                # Interpolation kwarg for now is only supported when providing both dim and keepdim
+                kwargs.setdefault('dim', 0)
+                kwargs.setdefault('keepdim', False)
+                for interpolation in test_interpolations:
+                    kwargs['interpolation'] = interpolation
+                    inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
+                                              args=(quantiles,), kwargs=kwargs))
 
-def sample_inputs_conv_transpose2d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    return inputs
 
-    # Ordered as shapes for input, weight, bias
-    # and a dict of values of (stride, padding, output_padding, groups, dilation)
-    cases: Tuple[Tuple[int], Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
-        ((1, 3, 4, 4), (3, 3, 3, 3), (3,),
-         {'stride': (2, 2), 'padding': 2, 'output_padding': (1, 1), 'groups': 1}),
-        ((2, 2, 4, 4), (2, 2, 4, 5), (4,),
-         {'stride': (3, 2), 'padding': (1, 2), 'output_padding': (2, 3), 'groups': 2, 'dilation': (4, 4)}),
-        ((1, 1, 4, 5), (1, 1, 4, 3), (1,),
-         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1, 'dilation': (2, 3)}),
-        ((1, 1, 4, 3), (1, 2, 3, 4), None,
-         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1}),
-        ((1, 4, 5, 5), (4, 8, 3, 3), None,
-         {})
-    )
+def sample_inputs_reduction_count_nonzero(*args, **kwargs):
+    """Sample inputs for count_nonzero"""
+    # count_nonzero does not support keepdim yet
+    for sample in sample_inputs_reduction(*args, **kwargs):
+        sample.kwargs.pop('keepdim', None)
+        yield sample
 
-    for input_shape, weight, bias, kwargs in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
+def sample_inputs_leaky_relu(op_info, device, dtype, requires_grad, **kwargs):
+    N = 10
+    tensors = [SampleInput(make_tensor((N, N), device=device, dtype=dtype,
+               requires_grad=requires_grad)) for _ in range(1, N)]
+    return tensors
 
-def sample_inputs_conv_transpose3d(op_info, device, dtype, requires_grad, **kwargs):
+def sample_inputs_fractional_max_pool2d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # Ordered as shapes for input, weight, bias
-    # and a dict of values of (stride, padding, output_padding, groups, dilation)
-    cases: Tuple[Tuple[int], Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
-        ((1, 3, 4, 4, 4), (3, 3, 3, 3, 3), (3,),
-         {'stride': (2, 2, 2), 'padding': 2, 'output_padding': (1, 1, 1), 'groups': 1}),
-        ((2, 2, 4, 4, 4), (2, 2, 4, 5, 6), (4,),
-         {'stride': (3, 2, 1), 'padding': (1, 2, 3), 'output_padding': (2, 3, 1), 'groups': 2, 'dilation': (4, 4, 4)}),
-        ((1, 1, 4, 5, 2), (1, 1, 4, 3, 1), (1,),
-         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1, 'dilation': (2, 3, 2)}),
-        ((1, 1, 4, 3, 4), (1, 2, 3, 4, 5), None,
-         {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1}),
-        ((1, 4, 5, 5, 5), (4, 8, 3, 3, 3), None,
-         {})
-    )
-
-    for input_shape, weight, bias, kwargs in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
-
+    # Order: input_shape, kernel_size
+    cases = (((1, 3, 9, 9), 3),
+             ((1, 3, 9, 9), (4, 4)),
+             ((1, 3, 9, 9), (6, 6)),
+             ((2, 3, 9, 9), (3, 3)),
+             ((1, 1, 4, 4), (2, 2)),
+             ((1, 2, 6, 6), (4, 4)))
 
-def sample_inputs_conv1d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    samples = []
 
-    # Ordered as shapes for input, weight, bias,
-    # and a dict of values of (stride, padding, dilation, groups)
-    cases: Tuple = (
-        ((1, 3, 4), (3, 3, 3), (3,), {'stride': (2,), 'padding': 2, 'groups': 1}),
-        ((2, 4, 8), (2, 2, 3), (2,), {'stride': 3, 'padding': 1, 'groups': 2, 'dilation': 2}),
-        ((1, 4, 5), (1, 4, 3), None, {'stride': (2,), 'padding': 'valid'}),
-        ((2, 2, 4), (2, 1, 4), (2,), {'stride': (1,), 'padding': 'same', 'groups': 2, 'dilation': (2,)}),
-        # With defaults
-        ((1, 4, 5), (3, 4, 3), None, {}),
-    )
+    for input_shape, kernel_size in cases:
+        for return_indices in [False, True]:
+            # test case passing a single output size
+            samples.append(SampleInput(
+                make_arg(input_shape),
+                args=(kernel_size,),
+                kwargs=dict(output_size=(2), return_indices=return_indices)
+            ))
 
-    # TODO: (@krshrimali), add error_inputs_func once https://github.com/pytorch/pytorch/pull/67354 is merged
-    # Should replace test_conv_modules_raise_error_on_incorrect_input_size and test_conv_shapecheck
-    # in test/test_nn.py
+            # test case passing a tuple output size
+            samples.append(SampleInput(
+                make_arg(input_shape),
+                args=(kernel_size,),
+                kwargs=dict(output_size=(2, 3), return_indices=return_indices)
+            ))
 
-    for input_shape, weight, bias, kwargs in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
+            # test case passing an output ratio
+            samples.append(SampleInput(
+                make_arg(input_shape),
+                args=(kernel_size,),
+                kwargs=dict(output_ratio=(0.5, 0.5), return_indices=return_indices)
+            ))
 
+    return samples
 
-def sample_inputs_conv2d(op_info, device, dtype, requires_grad, jit_fail_sample=False, **kwargs):
+def sample_inputs_fractional_max_pool3d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # Ordered as shapes for input, weight, bias
-    # and a dict of values of (stride, padding, groups, dilation)
-    cases: Tuple = (
-        ((1, 3, 4, 4), (3, 3, 3, 3), (3,),
-            {'stride': (2, 2), 'padding': 2, 'groups': 1}),
-        ((2, 4, 8, 8), (2, 2, 3, 3), (2,),
-            {'stride': (3, 2), 'padding': (2, 1), 'groups': 2, 'dilation': (4, 4)}),
-        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
-            {'stride': 2, 'padding': 1, 'groups': 1, 'dilation': (2, 3)}),
-        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
-            {'stride': 2, 'padding': 1, 'groups': 1, 'dilation': (2, 3)}),
-        ((1, 2, 4, 3), (4, 2, 3, 4), None,
-            {'stride': 2, 'padding': 1, 'groups': 1}),
-        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
-            {'stride': 2, 'padding': "valid"}),
-        ((1, 4, 5, 5), (1, 4, 2, 3), (1,),
-            {'stride': 1, 'padding': "same", 'dilation': 3}),
-        # Below are the group related samples from common_nn.py
-        ((2, 4, 6, 6), (4, 1, 3, 3), (4,), {'groups': 4}),
-        ((2, 4, 6, 6), (8, 1, 3, 3), (8,), {'groups': 4}),
-        ((2, 4, 6, 6), (8, 1, 3, 3), None, {'groups': 4}),
-        ((2, 4, 6, 6), (4, 1, 3, 3), (4,), {'groups': 4, 'stride': (3, 2)}),
-        ((2, 4, 6, 6), (4, 1, 3, 3), (4,), {'groups': 4, 'padding': (1, 1)}),
-        ((2, 4, 5, 5), (4, 1, 2, 2), (4,), {'groups': 4, 'dilation': (2, 2)}),
-        ((2, 4, 6, 5), (6, 2, 3, 2), (6,), {'groups': 2}),
-        # With defaults
-        ((1, 4, 5, 5), (3, 4, 3, 3), None, {}),
-    )
-
-    for input_shape, weight, bias, kwargs in cases:
-        # Batched
-        yield SampleInput(make_arg(input_shape), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
-        # Unbatched
-        yield SampleInput(make_arg(input_shape[1:]), args=(
-            make_arg(weight),
-            make_arg(bias) if bias is not None else bias
-        ), kwargs=kwargs)
+    # Order: input_shape, kernel_size
+    cases = (((2, 3, 5, 5, 5), (2, 2, 2)),
+             ((1, 2, 6, 5, 4), 2),
+             ((1, 2, 5, 6, 5), (2, 3, 2)),
+             ((1, 2, 6, 6, 6), (2, 3, 2)),
+             ((1, 1, 7, 6, 7), (2, 3, 4)),
+             ((1, 1, 4, 5, 4), (2, 2, 1)),
+             ((1, 1, 8, 7, 6), (4, 3, 2)),
+             ((0, 1, 4, 5, 4), (2, 2, 1)))
 
+    samples = []
 
-def sample_inputs_group_norm(opinfo, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for input_shape, kernel_size in cases:
+        for return_indices in [False, True]:
+            # test case passing a single output size
+            samples.append(SampleInput(
+                make_arg(input_shape),
+                args=(kernel_size,),
+                kwargs=dict(output_size=(2), return_indices=return_indices)
+            ))
 
-    # Ordered as input shape, num groups, and eps
-    cases: Tuple[Tuple[int], int, float] = (  # type: ignore[assignment]
-        ((1, 6, 3), 2, 0.5),
-        ((2, 6, 3), 2, -0.5),
-        ((1, 2), 1, None),
-        ((0, 2), 1, None),
-    )
+            # test case passing a tuple output size
+            samples.append(SampleInput(
+                make_arg(input_shape),
+                args=(kernel_size,),
+                kwargs=dict(output_size=(2, 3, 2), return_indices=return_indices)
+            ))
 
-    for input_shape, num_groups, eps in cases:
-        # Shape of weight and bias should be the same as num_channels
-        weight = make_arg(input_shape[1])
-        bias = make_arg(input_shape[1])
-        kwargs = {'weight': weight, 'bias': bias} if eps is None else {'weight': weight, 'bias': bias, 'eps': eps}
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(num_groups,),
-            kwargs=kwargs
-        )
-    # Without any optional args
-    yield SampleInput(make_arg((1, 2)), args=(1,))
+            # test case passing an output ratio
+            samples.append(SampleInput(
+                make_arg(input_shape),
+                args=(kernel_size,),
+                kwargs=dict(output_ratio=(0.5, 0.5, 0.5), return_indices=return_indices)
+            ))
 
+    return samples
 
-def sample_inputs_instance_norm(opinfo, device, dtype, requires_grad, **kwargs):
+def sample_inputs_avgpool2d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    make_arg_without_requires_grad = partial(make_tensor, device=device, dtype=dtype, requires_grad=False)
 
-    # Ordered as: input shape, kwargs for momentum, eps
-    cases: Tuple[Tuple[int], dict] = (  # type: ignore[assignment]
-        ((S, S, S), {'momentum': 0.5, 'eps': 0.6}),
-        ((S, S, S), {'momentum': 0.5, 'eps': 0.6, 'use_input_stats': True}),
-        ((3, 2, 4), {'momentum': -1.2}),
-        ((3, 2, 4), {'momentum': 0.0}),
-        ((3, 2, 3, 4), {'momentum': -1.0, 'eps': 0.5}),
-        ((3, 2, 3, 4), {'momentum': -1.0, 'eps': 0.5}),
-    )
+    # Order: input_shape, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override
+    cases = (((1, 3, 9, 9), 3, 1, 1, True, False, 2),
+             ((1, 3, 9, 9), (4, 4), (2, 3), 1, True, False, 2),
+             ((1, 3, 9, 9), (6, 6), (3, 3), (2, 3), True, True, 2),
+             ((2, 3, 9, 9), (3, 3), (1, 1), (1, ), True, False, 2),
+             ((1, 1, 4, 4), (2, 2), (), (0, ), False, True, -2),
+             ((1, 2, 6, 6), (4, 4), (2, 2), (2, ), True, True, None))
 
-    for input_shape, kwargs in cases:
-        # args: running mean, running var, weight and bias should necessarily be of shape: (channels,)
-        channels = input_shape[1]
-        weight = make_arg(channels)
-        bias = make_arg(channels)
-        running_mean = make_arg_without_requires_grad(channels, low=0)
-        running_var = make_arg_without_requires_grad(channels, low=0)
-        new_kwargs = {
-            'running_mean': running_mean,
-            'running_var': running_var,
-            'weight': weight,
-            'bias': bias,
-            **kwargs
-        }
+    for input_shape, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override in cases:
+        yield SampleInput(make_arg(input_shape),
+                          args=(kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override))
+    # Case with just input_shape and kernel_size
+    yield SampleInput(make_arg((1, 3, 9, 9)), args=((3, 3)))
 
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(),
-            kwargs=new_kwargs
-        )
+def sample_inputs_avgpool1d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # Checking for permutations of weights and biases as `None`
-    # instance_norm assumes that if there's a bias, there's a weight
-    weights = [channels, None]
-    biases = [None, None]
+    # Order: input_shape, kernel_size, kwargs
+    cases: List[Tuple[Tuple[int, ...], Union[int, Tuple[int, ...]], Dict]] = [
+        ((2, 3, 9), (3,), {}),
+        ((1, 3, 9), 3, dict(stride=1, padding=1, ceil_mode=True, count_include_pad=False)),
+        ((1, 3, 9), (6,), dict(stride=(3,), padding=(2,), ceil_mode=True, count_include_pad=True)),
+        ((2, 3, 9), (3,), dict(stride=(1,), padding=(1,), ceil_mode=False, count_include_pad=True)),
+        ((0, 3, 9), (6,), dict(stride=(3,), padding=(2,), ceil_mode=False, count_include_pad=True)),
+        ((1, 2, 9), (7,), dict(stride=(3,), padding=(2,), ceil_mode=False)),
+        ((1, 2, 9), (7,), dict(stride=(3,), padding=(3,), ceil_mode=True)),
+        ((1, 2, 9), (7,), dict(stride=(3,), ceil_mode=False)),
+        ((1, 2, 9), (7,), dict(stride=(3,), ceil_mode=True)),
+    ]
 
-    for weight_channels, bias_channels in zip(weights, biases):
-        running_mean = make_arg_without_requires_grad(channels, low=0)
-        running_var = make_arg_without_requires_grad(channels, low=0)
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(),
-            kwargs={
-                'running_mean': running_mean,
-                'running_var': running_var,
-                'weight': make_arg(weight_channels) if weight_channels is not None else None,
-                'bias': make_arg(bias_channels) if bias_channels is not None else None
-            }
-        )
+    for input_shape, kernel_size, kwargs in cases:
+        yield SampleInput(make_arg(input_shape), args=(kernel_size,), kwargs=kwargs)
 
-    # Test case for no optional kwargs
-    yield SampleInput(make_arg((1, 2, 3)), kwargs={})
+def sample_inputs_avgpool3d(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
+    # Order: input_shape, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override
+    cases: List[Tuple[Tuple[int, ...], Union[int, Tuple[int, ...]], Dict]] = [
+        ((2, 3, 3, 4, 4), (2, 2, 2), {}),
+        ((1, 2, 4, 4, 4), 2, dict(stride=1, padding=1, ceil_mode=True,
+                                  count_include_pad=False, divisor_override=2)),
+        ((1, 2, 5, 5, 5), (2, 3, 4), dict(stride=(1, 2, 2), padding=(0, 1, 2), ceil_mode=True,
+                                          count_include_pad=True, divisor_override=2)),
+        ((1, 2, 5, 5, 5), (2, 3, 4), dict(stride=(1, 2, 2), padding=(0, 1, 2), ceil_mode=False)),
+        ((1, 1, 7, 5, 7), (6, 3, 4), dict(stride=(2, 3, 2), padding=(3, 1, 0), ceil_mode=False,
+                                          count_include_pad=False, divisor_override=2)),
+        ((1, 1, 4, 5, 4), (2, 2, 3), dict(stride=(2, 2, 1), padding=0, ceil_mode=False,
+                                          count_include_pad=True, divisor_override=-2)),
+        ((1, 1, 6, 5, 6), (4, 5, 6), dict(stride=(2, 3, 2), padding=2, ceil_mode=True,
+                                          count_include_pad=True, divisor_override=None)),
+        ((0, 1, 4, 5, 4), (2, 3, 1), dict(stride=(2, 1, 2), padding=0, ceil_mode=False,
+                                          count_include_pad=True, divisor_override=None)),
+    ]
 
-def sample_inputs_layer_norm(opinfo, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for input_shape, kernel_size, kwargs in cases:
+        yield SampleInput(make_arg(input_shape), args=(kernel_size,), kwargs=kwargs)
 
-    # Ordered as input shape, normalized_shape and a kwarg dict for eps
-    cases: Tuple[Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
-        ((1, 2, 3), (1, 2, 3), {'eps': 0.5}),
-        ((2, 2, 3), (2, 3), {'eps': -0.5}),
-        ((1,), (1,), {}),
-        ((1, 2), (2,), {}),
-        ((0, 1), (1,), {}),
-    )
+def error_inputs_avg_pool1d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+
+def error_inputs_avg_pool2d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+    # 2-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+    # 2-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error inputs for zero divisor
+    x = torch.zeros(3, 3, 3)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (2, 2), 'divisor_override': 0}),
+                     error_regex='divisor must be not zero')
+
+def error_inputs_avg_pool3d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49, 50], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+    # 3-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+    # 3-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error inputs for zero divisor
+    x = torch.zeros(3, 3, 3, 3)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (2, 2, 2), 'divisor_override': 0}),
+                     error_regex='divisor must be not zero')
+
+    # error inputs for invalid input dimension
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 0}),
+                     error_regex='non-empty 4D or 5D')
 
-    for input_shape, normalized_shape, kwargs in cases:
-        # Shape of weight and bias should be the same as normalized_shape
-        weight = make_arg(normalized_shape)
-        bias = make_arg(normalized_shape)
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(normalized_shape, weight, bias),
-            kwargs=kwargs
-        )
-    # Without any optional args
-    yield SampleInput(make_arg((1, 2)), args=((2,),))
+def sample_inputs_topk(op_info, device, dtype, requires_grad, **kwargs):
+    def get_tensor_input(size):
+        return make_tensor(size, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    # TODO: @krshrimali, once to_numpy method in SampleInput class is modified to take None inputs,
-    # enable these inputs; see https://github.com/pytorch/pytorch/pull/63276#discussion_r691950400
+    inputs = []
+    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3,)))
+    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1)))
+    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2)))
+    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1, True)))
+    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2, True)))
+    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1, True, True)))
+    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2, True, True)))
 
-    # With weight and a `None` bias
-    # yield SampleInput(make_arg((1, 2)), args=((2,), make_arg((2,)), None))
+    inputs.append(SampleInput(get_tensor_input(()), args=(1,)))
+    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0)))
+    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1)))
+    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0, True)))
+    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1, True)))
+    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0, True, True)))
+    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1, True, True)))
 
-    # With `None` weight and bias (tests failing for this, see the link above)
-    # yield SampleInput(make_arg((1, 2)), args=((2,), None, make_arg((2,))))
+    return inputs
 
+def sample_inputs_outer(op_info, device, dtype, requires_grad, **kwargs):
+    inputs = []
+    arg_a = make_tensor((S,), dtype=dtype, device=device, requires_grad=requires_grad)
+    arg_b = make_tensor((M,), dtype=dtype, device=device, requires_grad=requires_grad)
+    inputs.append(SampleInput(arg_a, args=(arg_b,)))
+    return inputs
 
-def sample_inputs_native_layer_norm(opinfo, device, dtype, requires_grad, **kwargs):
+def sample_inputs_dist(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    sizes = ((S, S, S), (S,), (S, 1, S), (), (S, S))
+    ps = (2, 4)
 
-    # Ordered as input shape, normalized_shape, eps
-    cases: Tuple[Tuple[int], Tuple[int], float] = (  # type: ignore[assignment]
-        ((1, 2, 3), (1, 2, 3), 0.5),
-        ((2, 2, 3), (2, 3), -0.5),
-        ((1,), (1,), 1e-5),
-        ((1, 2), (2,), 1e-5),
-        ((0, 1), (1,), 1e-5),
-    )
+    for size_x, size_y, p in product(sizes, sizes, ps):
+        yield SampleInput(make_arg(size_x), args=(make_arg(size_y), p))
 
-    for input_shape, normalized_shape, eps in cases:
-        # Shape of weight and bias should be the same as normalized_shape
-        weight = make_arg(normalized_shape)
-        bias = make_arg(normalized_shape)
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(normalized_shape, weight, bias, eps),
-        )
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(normalized_shape, None, bias, eps),
-        )
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(normalized_shape, weight, None, eps),
-        )
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(normalized_shape, None, None, eps),
-        )
+# Missing to test the nondeterminism of the operation
+# https://github.com/pytorch/pytorch/issues/53352
+def sample_inputs_index(op_info, device, dtype, requires_grad, **kwargs):
+    # target.index_select(dim, idx)
+    select = op_info.name == "index_select"
+    # target.index_add(dim, idx, source, *, alpha=1)
+    add = op_info.name == "index_add"
+    # target.index_copy(dim, idx, source)
+    copy = op_info.name == "index_copy"
+    # target.index_fill(dim, idx, value)
+    fill = op_info.name == "index_fill"
 
 
-def error_inputs_native_layer_norm(opinfo, device, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=torch.float32, requires_grad=False)
-    input_shape = (1, 2, 3)
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    make_permutation = partial(torch.randperm, device=device, dtype=torch.int64)
 
-    err_msg1 = "Expected normalized_shape to be at least 1-dimensional"
-    s1 = SampleInput(
-        make_arg(input_shape), args=(tuple(), None, None, 1e-5)
-    )
-    yield ErrorInput(s1, error_regex=err_msg1)
+    def make_idx(n):
+        return make_tensor((n,), device=device, dtype=torch.int64, low=0, high=n)
 
-    normalized_shape = (1, 2, 3)
-    weight = make_arg((1, 2))
-    err_msg2 = "Expected weight to be of same shape as normalized_shape"
-    s2 = SampleInput(
-        make_arg(input_shape), args=(normalized_shape, weight, None, 1e-5)
-    )
-    yield ErrorInput(s2, error_regex=err_msg2)
+    shapes = [(), (1,), (S, S)]
+    # extra parameter for add
+    alphas = (-1, 0, 2) if add else (None,)
 
-    bias = make_arg((1, 2))
-    err_msg3 = "Expected bias to be of same shape as normalized_shape"
-    s3 = SampleInput(
-        make_arg(input_shape), args=(normalized_shape, None, bias, 1e-5)
-    )
-    yield ErrorInput(s3, error_regex=err_msg3)
+    for shape, alpha in product(shapes, alphas):
+        t = make_arg(shape)
+        args = []
 
-    err_msg4 = "Given normalized_shape="
-    s4 = SampleInput(
-        make_arg((2, 2, 3)), args=((2, 2), None, None, 1e-5)
-    )
-    yield ErrorInput(s4, error_regex=err_msg4)
-
-
-def sample_inputs_local_response_norm(opinfo, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+        # dim. We handle the scalar case
+        dim = 1 if t.ndim == 2 else 0
+        args.append(dim)
 
-    # Ordered as input shape, size and a kwarg dict for alpha, beta, and k
-    cases: Tuple[Tuple[int], Tuple[int], dict] = (  # type: ignore[assignment]
-        ((1, 6, 3), 2, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
-        ((1, 6, 3), 2, {'beta': 0.5, 'k': 1.25}),
-        ((1, 6, 3), 2, {'alpha': 3e-05, 'k': 1.25}),
-        ((1, 6, 3), 2, {'alpha': 3e-05, 'beta': 0.5}),
-        ((1, 6, 3), 2, {'alpha': 3e-05}),
-        ((1, 6, 3), 2, {'beta': 0.5}),
-        ((1, 6, 3), 2, {'k': 1.25}),
-        ((1, 6, 3), 2, {}),
-        ((2, 6, 3), 2, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
-        ((1, 1, 2), 1, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
-        ((0, 1, 2), 1, {'alpha': 3e-05, 'beta': 0.5, 'k': 1.25}),
-    )
+        # idx They need to be different for copy and add to be deterministic
+        make_idx_fn = make_permutation if copy or add else make_idx
+        idx = make_idx_fn(t.shape[dim] if t.ndim != 0 else 1)
+        args.append(idx)
 
-    for input_shape, size, kwargs in cases:
-        yield SampleInput(make_arg(input_shape), args=(size,), kwargs=kwargs)
+        # source
+        if copy or add:
+            args.append(make_arg(shape))
+        elif fill:
+            # A weird number to catch errors
+            args.append(make_arg((1,)).item())
 
-def sample_inputs_hardswish(self, device, dtype, requires_grad, **kwargs):
-    N = 5
-    # make sure we are testing -3 -> 3 range. default is -10 -> 10 so maybe unnecessary ?
-    tensors = [SampleInput(make_tensor((N * 2, N * 2), device=device, dtype=dtype,
-               requires_grad=requires_grad, low=-5, high=5)) for _ in range(1, N)]
-    return tensors
+        args = tuple(args)
+        kwargs = {} if alpha is None else {"alpha": alpha}
 
-def sample_inputs_linear(self, device, dtype, requires_grad, **kwargs):
-    features_options = [[3, 4], [8, 8]]
-    batch_options: List[List[int]] = [
-        [],  # no batch
-        [0],
-        [8],
-        [2, 3],
-    ]
-    create_tensor = partial(make_tensor, device=device, dtype=dtype,
-                            requires_grad=requires_grad, low=-2, high=2)
+        yield SampleInput(t, args=args, kwargs=kwargs)
 
-    sample_inputs = []
-    for has_bias, (in_feat, out_feat), batch_shape in \
-            itertools.product([True, False], features_options, batch_options):
-        input_tensor = create_tensor(batch_shape + [in_feat])
-        weight = create_tensor([out_feat, in_feat])
-        if not has_bias:
-            sample_inputs.append(SampleInput(input_tensor, args=(weight,)))
-            continue
+def sample_inputs_index_reduce(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        bias = create_tensor([out_feat])
-        sample_inputs.append(SampleInput(input_tensor, args=(weight, bias)))
-    return sample_inputs
+    def make_idx(n, m):
+        return make_tensor((n,), device=device, dtype=torch.int64, low=0, high=m)
 
-def sample_inputs_bilinear(self, device, dtype, requires_grad, **kwargs):
-    features_options = [[3, 4, 5], [8, 8, 8]]
-    batch_options: List[List[int]] = [
-        [],  # no batch
-        [0],
-        [8],
-        [2, 3],
-    ]
-    create_tensor = partial(make_tensor, device=device, dtype=dtype,
-                            requires_grad=requires_grad, low=-2, high=2)
+    shapes = [((), ()), ((1,), (1,)), ((S, S), (S, M)), ((S, S, S), (S, M, S))]
+    include_selfs = (True, False)
+    reduces = ('prod', 'mean', 'amin', 'amax')
 
-    sample_inputs = []
-    for has_bias, (in_feat1, in_feat2, out_feat), batch_shape in \
-            itertools.product([True, False], features_options, batch_options):
-        input_tensor1 = create_tensor(batch_shape + [in_feat1])
-        input_tensor2 = create_tensor(batch_shape + [in_feat2])
-        weight = create_tensor([out_feat, in_feat1, in_feat2])
-        if not has_bias:
-            sample_inputs.append(SampleInput(input_tensor1, args=(input_tensor2, weight,)))
-            continue
-        bias = create_tensor([out_feat])
-        sample_inputs.append(SampleInput(input_tensor1, args=(input_tensor2, weight, bias)))
+    for shape, include_self, reduce in product(shapes, include_selfs, reduces):
+        self_shape, src_shape = shape
+        # dim. We handle the scalar case
+        dim = 1 if len(self_shape) >= 2 else 0
+        idx = make_idx(src_shape[dim] if len(src_shape) != 0 else 1,
+                       self_shape[dim] if len(self_shape) != 0 else 1)
+        args = (dim, idx, make_arg(src_shape), reduce)
+        yield SampleInput(make_arg(self_shape),
+                          args=args,
+                          kwargs={'include_self' : include_self})
 
-    return sample_inputs
+    # Sample inputs to test edge cases for backward
+    if requires_grad:
+        # Check that gradients are propagated correctly for prod when zeros in self/src are reduced
+        # This sample tests gradients for the following cases
+        # (a) 1 zero reduced (from source (self[0, 1]), from self (self[0, 0]))
+        # (b) 2 zeros reduced (1 from src and 1 from self (self[1, 0], self[1, 1])
+        # (c) no zeros reduced (self[2, 1], self[2, 2])
+        # (d) 2 zeros reduced (both from src) is tested in test/test_autograd.py
+        #     test_scatter_index_reduce_prod_gradgrad_error as this case is not supported for gradgrad
+        input = torch.tensor([[0, 13], [0, 0], [15, 19]], dtype=dtype, device=device, requires_grad=requires_grad)
+        src = torch.tensor([[2, 0], [0, 0], [2, 3], [2, 2]], dtype=dtype, device=device, requires_grad=requires_grad)
+        idx = torch.tensor([0, 1, 2, 0], dtype=torch.long, device=device)
 
-def sample_inputs_glu(self, device, dtype, requires_grad, **kwargs):
-    features_options = [[2], [2, 4], [8, 8], [3, 6, 8], [1, 4, 6, 7]]
-    batch_options: List[List[int]] = [
-        [],  # no batch
-        [0],
-        [8],
-        [2, 3],
-    ]
-    create_tensor = partial(make_tensor, device=device, dtype=dtype,
-                            requires_grad=requires_grad, low=-2, high=2)
+        yield SampleInput(input,
+                          args=(0, idx, src, 'prod'),
+                          kwargs={'include_self': True})
 
-    sample_inputs = []
-    for features, batch_shape in itertools.product(features_options, batch_options):
-        ndim = len(features) + len(batch_shape)
-        for dim in range(ndim):
-            input_tensor = create_tensor(batch_shape + features)
-            dim_size = input_tensor.size(dim)
-            if dim_size > 0 and dim_size % 2 == 0:
-                sample_inputs.append(SampleInput(input_tensor, args=(dim,)))
+def sample_inputs_mode(op_info, device, dtype, requires_grad, **kwargs):
+    inputs = []
+    args = (
+        ((S, S, S), (),),
+        ((S, S, S), (1, ),),
+        ((S, S, S), (1, True, ),),
+        ((), (),),
+        ((), (0,),),
+        ((), (0, True,),),
+        # Non-fused mode kernel on CUDA
+        ((3000,), ()),
+    )
+    inputs = list((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
+                                           low=None, high=None,
+                                           requires_grad=requires_grad),
+                               args=args,))
+                  for input_tensor, args in args)
+    return inputs
 
-    return sample_inputs
+# Missing to test the nondeterminism of the operation
+# https://github.com/pytorch/pytorch/issues/53352
+def sample_inputs_put(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    make_idx = partial(make_tensor, low=0, dtype=torch.int64, device=device, requires_grad=False)
 
-def sample_inputs_interpolate(mode, self, device, dtype, requires_grad, **kwargs):
-    N, C = 2, 3
-    D = 4
     S = 3
-    L = 5
-
-    align_corners_options: Tuple[Any, ...] = (None,)
-    if mode in ('linear', 'bilinear', 'bicubic', 'trilinear'):
-        align_corners_options = (True, False, None)
-    ranks_for_mode = {
-        'nearest': [1, 2, 3],
-        'linear': [1],
-        'bilinear': [2],
-        'bicubic': [2],
-        'trilinear': [3],
-        'area': [1, 2, 3]
-    }
 
-    def shape(size, rank, with_batch_channel=True):
-        if with_batch_channel:
-            return tuple([N, C] + ([size] * rank))
-        return tuple([size] * rank)
+    # Generic inputs
+    idx = torch.randperm(S * S, device=device, dtype=torch.int64)[:S]
+    idx_list = [idx, -idx - 1]
+    for idx, acc in product(idx_list, (True, False)):
+        yield SampleInput(input=make_arg((S, S)),
+                          args=(idx.clone(),
+                                make_arg((S,)),
+                                acc))
 
-    make_arg = partial(make_tensor, device=device, dtype=dtype,
-                       requires_grad=requires_grad, low=-1, high=1)
+    # Scalar cases
+    scalar_sizes = [(), (1,)]
+    tgt_gen = (make_arg(size) for size in scalar_sizes)
+    idx_gen = (make_idx(size, high=1) for size in scalar_sizes)
+    src_gen = (make_arg(size) for size in scalar_sizes)
+    for tgt, idx, src, acc in product(tgt_gen, idx_gen, src_gen, (True, False)):
+        yield SampleInput(input=tgt.clone().requires_grad_(requires_grad),
+                          args=(idx.clone(),
+                                src.clone().requires_grad_(requires_grad),
+                                acc))
 
-    sample_inputs = []
-    for align_corners in align_corners_options:
-        for rank in ranks_for_mode[mode]:
-            sample_inputs.extend([
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(shape(S, rank, False), None, mode, align_corners)),
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(shape(L, rank, False), None, mode, align_corners)),
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(None, 1.7, mode, align_corners)),
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(None, 0.6, mode, align_corners)),
-            ])
+    # Empty cases
+    tgt_sizes = [(0,), (), (1,), (3, 2)]
+    tgt_gen = (make_arg(size) for size in tgt_sizes)
+    idx = make_idx((0,), high=1)
+    src = make_arg((0,))
+    for tgt, acc in product(tgt, (True, False)):
+        yield SampleInput(input=tgt.clone().requires_grad_(requires_grad),
+                          args=(idx.clone(),
+                                src.clone().requires_grad_(requires_grad),
+                                acc))
 
-    return sample_inputs
+def sample_inputs_take(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    make_idx = partial(make_tensor, low=0, dtype=torch.int64, device=device, requires_grad=False)
 
-def sample_inputs_upsample(mode, self, device, dtype, requires_grad, **kwargs):
-    N, C = 2, 3
-    D = 4
     S = 3
-    L = 5
-
-    ranks_for_mode = {
-        'nearest': [1, 2, 3],
-        'bilinear': [2],
-    }
 
-    def shape(size, rank, with_batch_channel=True):
-        if with_batch_channel:
-            return tuple([N, C] + ([size] * rank))
-        return tuple([size] * rank)
+    # Generic inputs: take S elements out of S * S
+    index = make_idx((S,), high=(S * S))
+    for idx in (index, -index - 1):
+        yield SampleInput(input=make_arg((S, S)), args=(idx,))
 
-    make_arg = partial(make_tensor, device=device, dtype=dtype,
-                       requires_grad=requires_grad, low=-1, high=1)
-
-    sample_inputs = []
-    for rank in ranks_for_mode[mode]:
-        sample_inputs.extend([
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(size=shape(S, rank, False))),
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(size=shape(L, rank, False))),
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(scale_factor=1.7)),
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(scale_factor=0.6)),
-        ])
-
-    return sample_inputs
+    # Scalar cases
+    scalar_sizes = [(), (1,)]
+    src_gen = (make_arg(size) for size in scalar_sizes)
+    idx_gen = (make_idx(size, high=1) for size in scalar_sizes)
+    for src, idx in product(src_gen, idx_gen):
+        yield SampleInput(input=src.clone().requires_grad_(requires_grad),
+                          args=(idx.clone(),))
 
+    # Empty cases
+    src_sizes = [(0,), (), (1,), (3, 2)]
+    src_gen = (make_arg(size) for size in src_sizes)
 
-def sample_inputs_gelu(self, device, dtype, requires_grad, **kwargs):
-    N = 5
-    tensors = []
-    for _ in range(1, N):
-        for approximate in ['none', 'tanh']:
-            tensors.append(SampleInput(
-                make_tensor((N * 2, N * 2), device=device, dtype=dtype,
-                            requires_grad=requires_grad, low=-3, high=3),
-                kwargs=dict(approximate=approximate)))
-    return tensors
+    idx = make_idx((0,), high=1)
+    for src in src_gen:
+        yield SampleInput(input=src.clone().requires_grad_(requires_grad),
+                          args=(idx.clone(),))
 
+def sample_movedim_moveaxis(op_info, device, dtype, requires_grad, **kwargs):
+    return (
+        SampleInput(
+            make_tensor((4, 3, 2, 1), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=([0, 1, 2, 3], [3, 2, 1, 0])),
+        SampleInput(
+            make_tensor((4, 3, 2, 1), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+            args=([0, -1, -2, -3], [-3, -2, -1, -0]))
+    )
 
-def error_inputs_gelu(op, device, **kwargs):
-    # Tests thtat gelu errors out when passed an approximation we don't know.
-    yield ErrorInput(SampleInput(make_tensor((), dtype=torch.float, device=device), kwargs={"approximate": "asdf"}),
-                     error_regex="approximate argument must be either")
+def reference_movedim_moveaxis(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_movedim_moveaxis(op_info, device, dtype, requires_grad, **kwargs)
 
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def sample_inputs_max_min_reduction_with_dim(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    args_for_reduction_with_dim = (
-        ((S, S, S), (1,),),
-        ((S, S, S), (1, True, ),),
-        ((), (0,),),
-        ((), (0, True,),),
+    # shape, source, destination
+    args = (
+        # empty inputs
+        ((), (), ()),
+        # int inputs, negative
+        ((3, 5, 7, 2), -2, 1),
+        # swap bounds
+        ((3, 5, 7, 2), (-1, 0), (0, -1)),
+        # non-sequential, negative
+        ((2, 3, 4, 5, 6), (3, -3, 4), (1, 0, -1)),
+        # idempotence, negative
+        ((2, 3, 4, 5, 6), (-3, 4, 3, 1), (-3, 4, 3, 1)),
+        # reverse, sequential, positive
+        ((6, 2, 3, 5, 4), (4, 3, 2, 1, 0), (0, 1, 2, 3, 4)),
+        # reverse, non-sequential
+        ((6, 2, 3, 5, 4), (-3, -2, -4, -5, -1), (2, 1, 3, 4, 0)),
+        # reverse, sequential, negative
+        ((6, 2, 3, 5, 4), (4, -2, 2, -4, -5), (-5, 1, 2, -2, -1)),
     )
-    inputs = list((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
-                                           low=None, high=None,
-                                           requires_grad=requires_grad),
-                               args=args,))
-                  for input_tensor, args in args_for_reduction_with_dim)
-    return inputs
 
-def sample_inputs_max_min_reduction_no_dim(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    inputs.append(SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                          low=None, high=None,
-                                          requires_grad=requires_grad),))
-    inputs.append(SampleInput(make_tensor((), dtype=dtype, device=device,
-                                          low=None, high=None,
-                                          requires_grad=requires_grad),))
-    return inputs
+    for shape, source, destination in args:
+        yield SampleInput(make_arg(shape), args=(source, destination))
 
-def _generate_nan_reduction_inputs(device, dtype, requires_grad, **kwargs):
-    yield from _generate_reduction_inputs(device, dtype, requires_grad)
-    # NaN only exists for floating point numbers
-    if dtype.is_complex or dtype.is_floating_point:
-        yield torch.tensor([2, torch.nan, -1], device=device, dtype=dtype, requires_grad=requires_grad)
-        yield torch.tensor([[torch.nan, 2], [0, 1]], device=device, dtype=dtype, requires_grad=requires_grad)
+def error_movedim_moveaxis(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
 
-def sample_inputs_nan_reduction(supports_multiple_dims):
-    # Generates sample inputs for reduction ops that contain the input tensor
-    # and dim and keepdim kwargs. If a reduction op needs to test additional
-    # args/kwargs then create a separate sample_inputs function
-    def fn(op_info, device, dtype, requires_grad, **kwargs):
-        inputs = []
+    # source length < destination length
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((3, -3), (1, 0, -1))),
+        error_regex=(r"movedim: Invalid source or destination dims: source "
+                     r"\(\[3, -3\] dims\) should contain the same number of "
+                     r"dims as destination \(\[1, 0, -1\] dims\)"),
+    )
 
-        for t in _generate_nan_reduction_inputs(device, dtype, requires_grad):
-            # Add case without dim and keepdim kwargs
-            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad)))
-            for kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims):
-                inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                          kwargs=kwargs))
+    # source length > destination length
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((3, -3, 4), (1, 0))),
+        error_regex=(r"movedim: Invalid source or destination dims: source "
+                     r"\(\[3, -3, 4\] dims\) should contain the same number of "
+                     r"dims as destination \(\[1, 0\] dims\)"),
+    )
 
-        return inputs
+    # repeated source dim, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((0, 4, -5), (1, 0, 2))),
+        error_regex=r"movedim: repeated dim in `source` \(\[0, 4, -5\]\)",
+    )
 
-    return fn
+    # repeated destination dim, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((1, 0, 2), (0, 4, -5))),
+        error_regex=r"movedim: repeated dim in `destination` \(\[0, 4, -5\]\)",
+    )
 
-def sample_inputs_reduction_quantile(op_info, device, dtype, requires_grad, **kwargs):
-    test_quantiles = (0.5, make_tensor((2,), dtype=dtype, device=device, low=0, high=1, requires_grad=requires_grad))
-    test_interpolations = ['linear', 'midpoint']
+    # repeated dim (both), with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((1, 0, -4), (0, 4, -5))),
+        error_regex=r"movedim: repeated dim in `source` \(\[1, 0, -4\]\)",
+    )
 
-    inputs = []
-    for quantiles in test_quantiles:
-        for t in _generate_reduction_inputs(device, dtype, requires_grad):
-            # Add case without dim and keepdim kwargs
-            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                      args=(quantiles,)))
-            for kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims=False):
-                # Interpolation kwarg for now is only supported when providing both dim and keepdim
-                kwargs.setdefault('dim', 0)
-                kwargs.setdefault('keepdim', False)
-                for interpolation in test_interpolations:
-                    kwargs['interpolation'] = interpolation
-                    inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                              args=(quantiles,), kwargs=kwargs))
+    # out of bounds source inputs, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((0, 1, -6), (1, 4, 2))),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
+    )
 
-    return inputs
+    # out of bounds destination inputs, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((1, 4, 2), (0, 1, -6))),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
+    )
 
-def sample_inputs_reduction_count_nonzero(*args, **kwargs):
-    """Sample inputs for count_nonzero"""
-    # count_nonzero does not support keepdim yet
-    for sample in sample_inputs_reduction(*args, **kwargs):
-        sample.kwargs.pop('keepdim', None)
-        yield sample
+    # out of bounds source input, int
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=(-6, 1)),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
+    )
 
-def sample_inputs_leaky_relu(op_info, device, dtype, requires_grad, **kwargs):
-    N = 10
-    tensors = [SampleInput(make_tensor((N, N), device=device, dtype=dtype,
-               requires_grad=requires_grad)) for _ in range(1, N)]
-    return tensors
+    # out of bounds destination input, int
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=(3, -6)),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
+    )
 
-def sample_inputs_fractional_max_pool2d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_repeat_tile(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    rep_dims = ((), (0, ), (1, ), (0, 2), (1, 1), (2, 3), (2, 3, 2), (0, 2, 3), (2, 1, 1, 1),)
+    shapes = ((), (0,), (2,), (3, 0), (3, 2), (3, 0, 1))
 
-    # Order: input_shape, kernel_size
-    cases = (((1, 3, 9, 9), 3),
-             ((1, 3, 9, 9), (4, 4)),
-             ((1, 3, 9, 9), (6, 6)),
-             ((2, 3, 9, 9), (3, 3)),
-             ((1, 1, 4, 4), (2, 2)),
-             ((1, 2, 6, 6), (4, 4)))
+    if requires_grad:
+        # Tests for variant_consistency_jit, grad, gradgrad
+        # are slower. Use smaller bags of `rep_dims` and `shapes`
+        # in this case.
+        rep_dims = ((), (0, ), (0, 2), (1, 1), (2, 3), (1, 3, 2), (3, 1, 1))  # type: ignore[assignment]
+        shapes = ((), (0,), (2,), (3, 2))  # type: ignore[assignment]
 
     samples = []
+    for rep_dim, shape in product(rep_dims, shapes):
+        # `torch.repeat` errors for `len(rep_dims) < t.dim()`,
+        # so we filter such combinations.
+        if op_info.name == 'repeat' and len(rep_dim) < len(shape):
+            continue
+        samples.append(SampleInput(make_arg(shape), args=(rep_dim,),))
 
-    for input_shape, kernel_size in cases:
-        for return_indices in [False, True]:
-            # test case passing a single output size
-            samples.append(SampleInput(
-                make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2), return_indices=return_indices)
-            ))
+    return samples
 
-            # test case passing a tuple output size
-            samples.append(SampleInput(
-                make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2, 3), return_indices=return_indices)
-            ))
 
-            # test case passing an output ratio
-            samples.append(SampleInput(
-                make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_ratio=(0.5, 0.5), return_indices=return_indices)
-            ))
+def sample_inputs_narrow(op_info, device, dtype, requires_grad, **kwargs):
+    shapes_and_args = (
+        ((S, S, S), (1, 2, 2)),
+        ((S, S, S), (-1, 2, 2)),
+        ((S, S, S), (1, 0, 0)),
+        ((S, S, S), (-1, 0, 0)),
+        ((S, S, S), (2, 1, 2)),
+    )
 
-    return samples
+    for shape, args in shapes_and_args:
+        tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
+                             requires_grad=requires_grad)
+        yield SampleInput(tensor, args=args)
 
-def sample_inputs_fractional_max_pool3d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
+    y_shape_x_shape_and_kwargs = [
+        ((2, 3), (2, 3), {}),
+        ((2, 3), (2, 3), {'dim': 1}),
+        ((6,), (6,), {}),
+        ((6,), None, {}),
+        # When 'trapezoid' is called with an empty input, it does not produce an output with requires_grad
+        # See Issue #{61619}
+        # ((6,0), (6,0), {}),
+        ((2, 3), (1, 3), {}),
+        ((3, 3), (3, 3), {}),
+        ((3, 3), (3, 3), {'dim': -2}),
+        ((5,), None, {'dx': 2.0}),
+        ((2, 2), None, {'dx': 3.0})
+    ]
+    samples = []
+    for y_shape, x_shape, kwarg in y_shape_x_shape_and_kwargs:
+        y_tensor = make_tensor(y_shape, dtype=dtype, device=device, low=None, high=None,
+                               requires_grad=requires_grad)
+        if x_shape is not None:
+            x_tensor = make_tensor(x_shape, dtype=dtype, device=device, low=None, high=None,
+                                   requires_grad=requires_grad)
+            samples.append(SampleInput(y_tensor, args=(x_tensor,), kwargs=kwarg))
+        else:
+            samples.append(SampleInput(y_tensor, kwargs=kwarg))
+    return samples
 
-    # Order: input_shape, kernel_size
-    cases = (((2, 3, 5, 5, 5), (2, 2, 2)),
-             ((1, 2, 6, 5, 4), 2),
-             ((1, 2, 5, 6, 5), (2, 3, 2)),
-             ((1, 2, 6, 6, 6), (2, 3, 2)),
-             ((1, 1, 7, 6, 7), (2, 3, 4)),
-             ((1, 1, 4, 5, 4), (2, 2, 1)),
-             ((1, 1, 8, 7, 6), (4, 3, 2)),
-             ((0, 1, 4, 5, 4), (2, 2, 1)))
+def sample_cumulative_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
 
+    y_shape_x_shape_and_kwargs = [
+        ((2, 3), (2, 3), {}),
+        ((2, 3), (2, 3), {'dim': 1}),
+        ((6,), (6,), {}),
+        ((6,), None, {}),
+        # When 'cumulative_trapezoid' is called with an empty input, it does not produce an output with requires_grad
+        # See Issue #{61619}
+        # ((6,0), (6,0), {}),
+        ((2, 3), (1, 3), {}),
+        ((3, 3), (3, 3), {}),
+        ((3, 3), (3, 3), {'dim': -2}),
+        ((5,), None, {'dx': 2.0}),
+        ((2, 2), None, {'dx': 3.0})
+    ]
     samples = []
+    for y_shape, x_shape, kwarg in y_shape_x_shape_and_kwargs:
+        y_tensor = make_tensor(y_shape, dtype=dtype, device=device, low=None, high=None,
+                               requires_grad=requires_grad)
+        if x_shape is not None:
+            x_tensor = make_tensor(x_shape, dtype=dtype, device=device, low=None, high=None,
+                                   requires_grad=requires_grad)
+            samples.append(SampleInput(y_tensor, args=(x_tensor,), kwargs=kwarg))
+        else:
+            samples.append(SampleInput(y_tensor, kwargs=kwarg))
+    return samples
 
-    for input_shape, kernel_size in cases:
-        for return_indices in [False, True]:
-            # test case passing a single output size
-            samples.append(SampleInput(
-                make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2), return_indices=return_indices)
-            ))
-
-            # test case passing a tuple output size
-            samples.append(SampleInput(
-                make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2, 3, 2), return_indices=return_indices)
-            ))
+def sample_unsqueeze(op_info, device, dtype, requires_grad, **kwargs):
+    shapes_and_axes = [
+        ((3, 4, 5), 0),
+        ((3, 4, 5), 1),
+        ((3, 4, 5), 3),
+        ((3, 4, 5), -1),
+        ((3, 4, 5), -3),
+        ((), 0),
+        ((), -1),
+        ((1,), 0),
+        ((1,), -1),
+    ]
 
-            # test case passing an output ratio
-            samples.append(SampleInput(
-                make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_ratio=(0.5, 0.5, 0.5), return_indices=return_indices)
-            ))
+    samples = []
+    for shape, axis in shapes_and_axes:
+        tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
+                             requires_grad=requires_grad)
+        samples.append(SampleInput(tensor, args=(axis,),))
 
     return samples
 
-def sample_inputs_avgpool2d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    # Order: input_shape, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override
-    cases = (((1, 3, 9, 9), 3, 1, 1, True, False, 2),
-             ((1, 3, 9, 9), (4, 4), (2, 3), 1, True, False, 2),
-             ((1, 3, 9, 9), (6, 6), (3, 3), (2, 3), True, True, 2),
-             ((2, 3, 9, 9), (3, 3), (1, 1), (1, ), True, False, 2),
-             ((1, 1, 4, 4), (2, 2), (), (0, ), False, True, -2),
-             ((1, 2, 6, 6), (4, 4), (2, 2), (2, ), True, True, None))
 
-    for input_shape, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override in cases:
-        yield SampleInput(make_arg(input_shape),
-                          args=(kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override))
-    # Case with just input_shape and kernel_size
-    yield SampleInput(make_arg((1, 3, 9, 9)), args=((3, 3)))
+def sample_inputs_nn_unfold(op_info, device, dtype, requires_grad, **kwargs):
+    shapes = ((0, 1, 5, 5), (1, 1, 5, 5), (2, 3, 5, 5))
+    kernel_sizes = (2, (2, 2), (3, 3))
+    dilations = (1, 2, (1, 2))
+    paddings = (0, 1, (1, 1))
+    strides = (1, 2, (1, 2))
 
-def sample_inputs_avgpool1d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    cases = product(shapes, kernel_sizes, dilations, paddings, strides)
+    for shape, kernel_size, dilation, padding, stride in cases:
+        tensor = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
+        yield SampleInput(tensor, args=(kernel_size, dilation, padding, stride))
 
-    # Order: input_shape, kernel_size, kwargs
-    cases: List[Tuple[Tuple[int, ...], Union[int, Tuple[int, ...]], Dict]] = [
-        ((2, 3, 9), (3,), dict()),
-        ((1, 3, 9), 3, dict(stride=1, padding=1, ceil_mode=True, count_include_pad=False)),
-        ((1, 3, 9), (6,), dict(stride=(3,), padding=(2,), ceil_mode=True, count_include_pad=True)),
-        ((2, 3, 9), (3,), dict(stride=(1,), padding=(1,), ceil_mode=False, count_include_pad=True)),
-        ((0, 3, 9), (6,), dict(stride=(3,), padding=(2,), ceil_mode=False, count_include_pad=True)),
-        ((1, 2, 9), (7,), dict(stride=(3,), padding=(2,), ceil_mode=False)),
-        ((1, 2, 9), (7,), dict(stride=(3,), padding=(3,), ceil_mode=True)),
-        ((1, 2, 9), (7,), dict(stride=(3,), ceil_mode=False)),
-        ((1, 2, 9), (7,), dict(stride=(3,), ceil_mode=True)),
-    ]
+    # With default args
+    yield SampleInput(make_tensor((1, 1, 5, 5), dtype=dtype, device=device, requires_grad=requires_grad),
+                      args=((3, 3),))
 
-    for input_shape, kernel_size, kwargs in cases:
-        yield SampleInput(make_arg(input_shape), args=(kernel_size,), kwargs=kwargs)
 
-def sample_inputs_avgpool3d(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_squeeze(op_info, device, dtype, requires_grad, **kwargs):
+    shapes_and_args = (
+        ((S, 1, S, 1), ()),
+        ((1, 1, 1, 1), ()),
+        ((S, 1, S, 1), (1,)),
+        ((S, 1, S, 1), (-1,)),
+        ((S, 1, S, 1), (2,)),
+        ((S, 1, S, 1), (-2,)),
+        ((), (0, )),
+    )
 
-    # Order: input_shape, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override
-    cases: List[Tuple[Tuple[int, ...], Union[int, Tuple[int, ...]], Dict]] = [
-        ((2, 3, 3, 4, 4), (2, 2, 2), dict()),
-        ((1, 2, 4, 4, 4), 2, dict(stride=1, padding=1, ceil_mode=True,
-                                  count_include_pad=False, divisor_override=2)),
-        ((1, 2, 5, 5, 5), (2, 3, 4), dict(stride=(1, 2, 2), padding=(0, 1, 2), ceil_mode=True,
-                                          count_include_pad=True, divisor_override=2)),
-        ((1, 2, 5, 5, 5), (2, 3, 4), dict(stride=(1, 2, 2), padding=(0, 1, 2), ceil_mode=False)),
-        ((1, 1, 7, 5, 7), (6, 3, 4), dict(stride=(2, 3, 2), padding=(3, 1, 0), ceil_mode=False,
-                                          count_include_pad=False, divisor_override=2)),
-        ((1, 1, 4, 5, 4), (2, 2, 3), dict(stride=(2, 2, 1), padding=0, ceil_mode=False,
-                                          count_include_pad=True, divisor_override=-2)),
-        ((1, 1, 6, 5, 6), (4, 5, 6), dict(stride=(2, 3, 2), padding=2, ceil_mode=True,
-                                          count_include_pad=True, divisor_override=None)),
-        ((0, 1, 4, 5, 4), (2, 3, 1), dict(stride=(2, 1, 2), padding=0, ceil_mode=False,
-                                          count_include_pad=True, divisor_override=None)),
-    ]
+    for shape, args in shapes_and_args:
+        tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
+                             requires_grad=requires_grad)
 
-    for input_shape, kernel_size, kwargs in cases:
-        yield SampleInput(make_arg(input_shape), args=(kernel_size,), kwargs=kwargs)
+        yield SampleInput(tensor, args=args)
 
-def sample_inputs_topk(op_info, device, dtype, requires_grad, **kwargs):
-    def get_tensor_input(size):
-        return make_tensor(size, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    inputs = []
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3,)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1, True)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2, True)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1, True, True)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2, True, True)))
+def sample_inputs_nn_pad(op_info, device, dtype, requires_grad, mode, **kwargs):
+    assert mode in ('constant', 'reflect', 'replicate', 'circular')
+    if mode in ['reflect', 'replicate']:
+        cases: tuple = (  # ignore
+            ((1, 3), (1, 2)),
+            ((1, 3), (0, 1)),
+            ((0, 3, 3), (1, 2)),
+            ((0, 3, 3), (0, 1)),
+            ((1, 3, 3), (1, 2)),
+            ((1, 3, 3), (0, 1)),
+            ((1, 3, 3), (0, 2, 0, 1)),
+            ((0, 3, 3, 3), (0, 2, 0, 1)),
+            ((3, 3, 5, 5), (0, 2, 0, 1)),
+            ((3, 3, 5, 5), (1, 1, 1, 1, 1, 1)),
+            ((1, 3, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
+            ((1, 3, 4, 4), (-1, 1, -2, 1)),
+        )
+    elif mode == 'constant':
+        cases = (
+            ((1, 3), (1, 2)),
+            ((1, 3), (0, 1)),
+            ((1, 3), (0, 2, 0, 1)),
+            ((0, 3, 3), (1, 2)),
+            ((0, 3, 3), (0, 1)),
+            ((0, 3, 3), (0, 2, 0, 1)),
+            ((0, 3, 3), (1, 1, 1, 1, 1, 1)),
+            ((1, 3, 3), (1, 2)),
+            ((1, 3, 3), (0, 1)),
+            ((1, 3, 3), (0, 2, 0, 1)),
+            ((1, 3, 3), (1, 1, 1, 1, 1, 1)),
+            ((0, 3, 3, 3), (1, 2)),
+            ((0, 3, 3, 3), (0, 1)),
+            ((0, 3, 3, 3), (0, 2, 0, 1)),
+            ((0, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
+            ((3, 3, 5, 5), (1, 2)),
+            ((3, 3, 5, 5), (0, 1)),
+            ((3, 3, 5, 5), (0, 2, 0, 1)),
+            ((3, 3, 5, 5), (1, 1, 1, 1, 1, 1)),
+            ((1, 3, 3, 3, 3), (1, 2)),
+            ((1, 3, 3, 3, 3), (0, 1)),
+            ((1, 3, 3, 3, 3), (0, 2, 0, 1)),
+            ((1, 3, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
+            ((1, 3, 4, 4), (-1, 1, -2, 1)),
+        )
+    else:  # mode == 'circular'
+        if dtype == torch.bool:
+            # test_dtypes fails on ASAN with for the case ab
+            # runtime error: load of value 190, which is not a valid value for type 'bool'
+            # Reference: https://github.com/pytorch/pytorch/pull/62814#issuecomment-894156562
+            # Reference Issue: https://github.com/pytorch/pytorch/issues/63034
+            cases = (
+                ((2, 3, 3), (1, 2)),
+                ((1, 3, 3), (1, 2)),
+            )
+        else:
+            cases = (
+                ((0, 3, 3), (1, 2)),
+                ((0, 3, 3), (0, 1)),
+                ((1, 3, 3), (1, 2)),
+                ((1, 3, 3), (0, 1)),
+                ((0, 3, 3, 3), (0, 2, 0, 1)),
+                ((3, 3, 5, 5), (0, 2, 0, 1)),
+                ((1, 3, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
+                ((1, 3, 4, 4), (-1, 1, -2, 1)),
+            )
 
-    inputs.append(SampleInput(get_tensor_input(()), args=(1,)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0, True)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1, True)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0, True, True)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1, True, True)))
+    make_inp = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return inputs
+    if mode == 'constant':
+        # Default args
+        yield SampleInput(make_inp((1, 3, 3)), args=((2, 2),))
 
-def sample_inputs_outer(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    arg_a = make_tensor((S,), dtype=dtype, device=device, requires_grad=requires_grad)
-    arg_b = make_tensor((M,), dtype=dtype, device=device, requires_grad=requires_grad)
-    inputs.append(SampleInput(arg_a, args=(arg_b,)))
-    return inputs
+    if mode in ['reflect', 'replicate', 'circular']:
+        for shape, pad in cases:
+            yield SampleInput(make_inp(shape), args=(pad, mode))
+    else:  # mode == 'constant'
+        for pad_value in (1., 2.):
+            for shape, pad in cases:
+                yield SampleInput(make_inp(shape), args=(pad, mode, pad_value))
 
-def sample_inputs_dist(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    sizes = ((S, S, S), (S,), (S, 1, S), (), (S, S))
-    ps = (2, 4)
 
-    for size_x, size_y, p in product(sizes, sizes, ps):
-        yield SampleInput(make_arg(size_x), args=(make_arg(size_y), p))
+def sample_inputs_constant_pad_nd(op_info, device, dtype, *args, **kwargs):
+    # Inherit sample inputs from nn.pad, but transform them to fit
+    # constant_pad_nd's interface
+    nn_samples = sample_inputs_nn_pad(op_info, device, dtype, *args,
+                                      mode='constant', **kwargs)
 
-# Missing to test the nondeterminism of the operation
-# https://github.com/pytorch/pytorch/issues/53352
-def sample_inputs_index(op_info, device, dtype, requires_grad, **kwargs):
-    # target.index_select(dim, idx)
-    select = op_info.name == "index_select"
-    # target.index_add(dim, idx, source, *, alpha=1)
-    add = op_info.name == "index_add"
-    # target.index_copy(dim, idx, source)
-    copy = op_info.name == "index_copy"
-    # target.index_fill(dim, idx, value)
-    fill = op_info.name == "index_fill"
+    # NOTE: primTorch is more strict about the type of the fill value argument
+    # So we must cast it to the correct dtype
+    from torch._prims_common import dtype_to_type
+    scalar_type = dtype_to_type(dtype)
 
+    def drop_mode_argument(input, pad, mode=None, value=None):
+        if value is None:
+            return SampleInput(input, args=(pad,))
+        else:
+            return SampleInput(input, args=(pad, scalar_type(value)))
 
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    make_permutation = partial(torch.randperm, device=device, dtype=torch.int64)
+    for sample in nn_samples:
+        yield drop_mode_argument(sample.input, *sample.args, **sample.kwargs)
 
-    def make_idx(n):
-        return make_tensor((n,), device=device, dtype=torch.int64, low=0, high=n)
+def sample_inputs_repeat_interleave(op_info, device, dtype, requires_grad, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    shapes = [(), (1,), (S, S)]
-    # extra parameter for add
-    alphas = (-1, 0, 2) if add else (None,)
-
-    for shape, alpha in product(shapes, alphas):
-        t = make_arg(shape)
-        args = []
+    return [
+        SampleInput(make_input(()), kwargs=dict(repeats=2)),
+        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=2)),
+        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=2, dim=1)),
+        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=torch.arange(3, device=device), dim=1))
+    ]
 
-        # dim. We handle the scalar case
-        dim = 1 if t.ndim == 2 else 0
-        args.append(dim)
 
-        # idx They need to be different for copy and add to be deterministic
-        make_idx_fn = make_permutation if copy or add else make_idx
-        idx = make_idx_fn(t.shape[dim] if t.ndim != 0 else 1)
-        args.append(idx)
+def sample_inputs_stft(op_info, device, dtype, requires_grad, **kwargs):
+    def mt(shape, **kwargs):
+        return make_tensor(shape, device=device, dtype=dtype,
+                           requires_grad=requires_grad, **kwargs)
+    yield SampleInput(mt(100), kwargs=dict(n_fft=10))
 
-        # source
-        if copy or add:
-            args.append(make_arg(shape))
-        elif fill:
-            # A weird number to catch errors
-            args.append(make_arg((1,)).item())
+    for center in [False, True]:
+        yield SampleInput(mt(10), kwargs=dict(n_fft=7, center=center))
+        yield SampleInput(mt((10, 100)), kwargs=dict(n_fft=16, hop_length=4, center=center))
 
-        args = tuple(args)
-        kwargs = {} if alpha is None else {"alpha": alpha}
+    window = make_tensor(16, low=.5, high=2.0, dtype=dtype, device=device, requires_grad=requires_grad)
+    yield SampleInput(
+        mt((2, 100)), kwargs=dict(n_fft=16, window=window, return_complex=True, center=center))
+    yield SampleInput(
+        mt((3, 100)), kwargs=dict(n_fft=16, window=window, return_complex=True, center=center))
+    if not dtype.is_complex:
+        yield SampleInput(
+            mt((10, 100)), kwargs=dict(n_fft=16, window=window, onesided=False))
 
-        yield SampleInput(t, args=args, kwargs=kwargs)
 
-def sample_inputs_index_reduce(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_istft(op_info, device, dtype, requires_grad, **kwargs):
+    def mt(shape, **kwargs):
+        real_shape = shape if dtype.is_complex else shape + (2,)
+        return make_tensor(real_shape, device=device, dtype=dtype,
+                           requires_grad=requires_grad, **kwargs)
 
-    def make_idx(n, m):
-        return make_tensor((n,), device=device, dtype=torch.int64, low=0, high=m)
+    yield SampleInput(mt((10, 2)), kwargs=dict(n_fft=10))
+    yield SampleInput(mt((6, 3)), kwargs=dict(n_fft=6, onesided=False))
+    yield SampleInput(mt((6, 4)), kwargs=dict(n_fft=10, onesided=True))
 
-    shapes = [((), ()), ((1,), (1,)), ((S, S), (S, M)), ((S, S, S), (S, M, S))]
-    include_selfs = (True, False)
-    reduces = ('prod', 'mean', 'amin', 'amax')
+    for center in [False, True]:
+        yield SampleInput(mt((10, 10, 6)), kwargs=dict(n_fft=10, center=center))
+        yield SampleInput(mt((1, 9, 10)), kwargs=dict(n_fft=16, hop_length=4, center=center))
 
-    for shape, include_self, reduce in product(shapes, include_selfs, reduces):
-        self_shape, src_shape = shape
-        # dim. We handle the scalar case
-        dim = 1 if len(self_shape) >= 2 else 0
-        idx = make_idx(src_shape[dim] if len(src_shape) != 0 else 1,
-                       self_shape[dim] if len(self_shape) != 0 else 1)
-        args = (dim, idx, make_arg(src_shape), reduce)
-        yield SampleInput(make_arg(self_shape),
-                          args=args,
-                          kwargs={'include_self' : include_self})
+    window = make_tensor(10, low=.5, high=2.0, dtype=dtype, device=device, requires_grad=requires_grad)
+    yield SampleInput(mt((10, 10, 6)), kwargs=dict(
+        n_fft=10, window=window, center=center, return_complex=dtype.is_complex))
+    yield SampleInput(mt((10, 10, 10)), kwargs=dict(
+        n_fft=10, window=window[:8], win_length=8, center=center, return_complex=True))
 
-    # Sample inputs to test edge cases for backward
-    if requires_grad:
-        # Check that gradients are propagated correctly for prod when zeros in self/src are reduced
-        # This sample tests gradients for the following cases
-        # (a) 1 zero reduced (from source (self[0, 1]), from self (self[0, 0]))
-        # (b) 2 zeros reduced (1 from src and 1 from self (self[1, 0], self[1, 1])
-        # (c) no zeros reduced (self[2, 1], self[2, 2])
-        # (d) 2 zeros reduced (both from src) is tested in test/test_autograd.py
-        #     test_scatter_index_reduce_prod_gradgrad_error as this case is not supported for gradgrad
-        input = torch.tensor([[0, 13], [0, 0], [15, 19]], dtype=dtype, device=device, requires_grad=requires_grad)
-        src = torch.tensor([[2, 0], [0, 0], [2, 3], [2, 2]], dtype=dtype, device=device, requires_grad=requires_grad)
-        idx = torch.tensor([0, 1, 2, 0], dtype=torch.long, device=device)
+    real_window = window if not dtype.is_complex else window.real
+    yield SampleInput(mt((10, 5, 6)), kwargs=dict(n_fft=8, window=real_window[:8], center=center))
 
-        yield SampleInput(input,
-                          args=(0, idx, src, 'prod'),
-                          kwargs={'include_self': True})
+def sample_inputs_ormqr(op_info, device, dtype, requires_grad, **kwargs):
+    # create a helper function wrapping `make_tensor`
+    make_input = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-def sample_inputs_mode(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    args = (
-        ((S, S, S), (),),
-        ((S, S, S), (1, ),),
-        ((S, S, S), (1, True, ),),
-        ((), (),),
-        ((), (0,),),
-        ((), (0, True,),),
-        # Non-fused mode kernel on CUDA
-        ((3000,), ()),
-    )
-    inputs = list((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
-                                           low=None, high=None,
-                                           requires_grad=requires_grad),
-                               args=args,))
-                  for input_tensor, args in args)
-    return inputs
+    def gen_inputs():
+        batches = [(), (0, ), (2, ), (2, 1)]
+        ns = [5, 2, 0]
+        tf = [True, False]
+        for batch, (m, n), left, transpose in product(batches, product(ns, ns), tf, tf):
+            reflectors = make_input((*batch, m, n))
+            tau = make_input((*batch, min(m, n)))
+            other_matrix_shape = (m, n) if left else (n, m)
+            other = make_input((*batch, *other_matrix_shape))
+            kwargs = {"left": left, "transpose": transpose}
+            yield SampleInput(reflectors, args=(tau, other,), kwargs=kwargs)
 
-# Missing to test the nondeterminism of the operation
-# https://github.com/pytorch/pytorch/issues/53352
-def sample_inputs_put(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    make_idx = partial(make_tensor, low=0, dtype=torch.int64, device=device, requires_grad=False)
+    return tuple(gen_inputs())
 
-    S = 3
+def sample_inputs_symeig(op_info, device, dtype, requires_grad=False, **kwargs):
+    out = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
 
-    # Generic inputs
-    idx = torch.randperm(S * S, device=device, dtype=torch.int64)[:S]
-    idx_list = [idx, -idx - 1]
-    for idx, acc in product(idx_list, (True, False)):
-        yield SampleInput(input=make_arg((S, S)),
-                          args=(idx.clone(),
-                                make_arg((S,)),
-                                acc))
+    for o in out:
+        o.kwargs = {"upper": bool(np.random.choice([True, False])),
+                    "eigenvectors": True}
+        # A gauge-invariant function
+        o.output_process_fn_grad = lambda output: (output[0], abs(output[1]))
+        yield o
 
-    # Scalar cases
-    scalar_sizes = [(), (1,)]
-    tgt_gen = (make_arg(size) for size in scalar_sizes)
-    idx_gen = (make_idx(size, high=1) for size in scalar_sizes)
-    src_gen = (make_arg(size) for size in scalar_sizes)
-    for tgt, idx, src, acc in product(tgt_gen, idx_gen, src_gen, (True, False)):
-        yield SampleInput(input=tgt.clone().requires_grad_(requires_grad),
-                          args=(idx.clone(),
-                                src.clone().requires_grad_(requires_grad),
-                                acc))
 
-    # Empty cases
-    tgt_sizes = [(0,), (), (1,), (3, 2)]
-    tgt_gen = (make_arg(size) for size in tgt_sizes)
-    idx = make_idx((0,), high=1)
-    src = make_arg((0,))
-    for tgt, acc in product(tgt, (True, False)):
-        yield SampleInput(input=tgt.clone().requires_grad_(requires_grad),
-                          args=(idx.clone(),
-                                src.clone().requires_grad_(requires_grad),
-                                acc))
+def sample_inputs_cholesky_solve(op_info, device, dtype, requires_grad=False, **kwargs):
+    cholesky_inverse_samples = sample_inputs_linalg_cholesky_inverse(
+        op_info, device, dtype, requires_grad=False
+    )
 
-def sample_inputs_take(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    make_idx = partial(make_tensor, low=0, dtype=torch.int64, device=device, requires_grad=False)
+    for sample in cholesky_inverse_samples:
+        psd_matrix = sample.input
+        sample.input = make_tensor(psd_matrix.shape, dtype=dtype, device=device, requires_grad=requires_grad, low=None, high=None)
+        sample.args = (psd_matrix.requires_grad_(requires_grad),)
+        yield sample
 
-    S = 3
 
-    # Generic inputs: take S elements out of S * S
-    index = make_idx((S,), high=(S * S))
-    for idx in (index, -index - 1):
-        yield SampleInput(input=make_arg((S, S)), args=(idx,))
+def sample_inputs_lu(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(make_fullrank_matrices_with_distinct_singular_values,
+                       dtype=dtype, device=device, requires_grad=requires_grad)
 
-    # Scalar cases
-    scalar_sizes = [(), (1,)]
-    src_gen = (make_arg(size) for size in scalar_sizes)
-    idx_gen = (make_idx(size, high=1) for size in scalar_sizes)
-    for src, idx in product(src_gen, idx_gen):
-        yield SampleInput(input=src.clone().requires_grad_(requires_grad),
-                          args=(idx.clone(),))
+    # not needed once OpInfo tests support Iterables
+    batch_shapes = ((), (3,), (3, 3))
+    for batch_shape, get_infos, size_delta in product(batch_shapes, (True, False), (-2, -1, 0, +1, +2)):
+        shape = batch_shape + (S + size_delta, S)
+        input = make_arg(*shape)
+        yield SampleInput(input, args=(True, get_infos))
 
-    # Empty cases
-    src_sizes = [(0,), (), (1,), (3, 2)]
-    src_gen = (make_arg(size) for size in src_sizes)
 
-    idx = make_idx((0,), high=1)
-    for src in src_gen:
-        yield SampleInput(input=src.clone().requires_grad_(requires_grad),
-                          args=(idx.clone(),))
+def sample_inputs_lu_unpack(op_info, device, dtype, requires_grad=False, **kwargs):
+    def out_fn(output):
+        return output[1], output[2]
 
-def sample_movedim_moveaxis(op_info, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((4, 3, 2, 1), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=([0, 1, 2, 3], [3, 2, 1, 0])),
-        SampleInput(
-            make_tensor((4, 3, 2, 1), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=([0, -1, -2, -3], [-3, -2, -1, -0]))
-    )
+    for lu_sample in sample_inputs_linalg_lu(op_info, device, dtype, requires_grad, **kwargs):
+        lu_data, pivots = torch.linalg.lu_factor(lu_sample.input)
+        lu_data.requires_grad_(requires_grad)
+        yield SampleInput(lu_data, args=(pivots,), output_process_fn_grad=out_fn)
 
 
-def sample_repeat_tile(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    rep_dims = ((), (0, ), (1, ), (0, 2), (1, 1), (2, 3), (2, 3, 2), (0, 2, 3), (2, 1, 1, 1),)
-    shapes = ((), (0,), (2,), (3, 0), (3, 2), (3, 0, 1))
+def sample_inputs_roll(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    if requires_grad:
-        # Tests for variant_consistency_jit, grad, gradgrad
-        # are slower. Use smaller bags of `rep_dims` and `shapes`
-        # in this case.
-        rep_dims = ((), (0, ), (0, 2), (1, 1), (2, 3), (1, 3, 2), (3, 1, 1))  # type: ignore[assignment]
-        shapes = ((), (0,), (2,), (3, 2))  # type: ignore[assignment]
+    args = ((0, 0), (1, 2), (0, 2), (2, 0), (-1, 0), (10000, 1), (2,), ((1, 2, -1), (0, 1, 2)))
 
-    samples = []
-    for rep_dim, shape in product(rep_dims, shapes):
-        # `torch.repeat` errors for `len(rep_dims) < t.dim()`,
-        # so we filter such combinations.
-        if op_info.name == 'repeat' and len(rep_dim) < len(shape):
-            continue
-        samples.append(SampleInput(make_arg(shape), args=(rep_dim,),))
+    for arg in args:
+        yield SampleInput(make_arg((0, 0, 0)), args=arg)
+        yield SampleInput(make_arg((S, S, S)), args=arg)
 
-    return samples
 
+def error_inputs_roll(op_info, device, **kwargs):
+    err_msg1 = "`shifts` required"
+    s1 = SampleInput(
+        make_tensor((S,), dtype=torch.float32, device=device), args=(tuple(),)
+    )
+    yield ErrorInput(s1, error_regex=err_msg1)
 
-def sample_inputs_narrow(op_info, device, dtype, requires_grad, **kwargs):
-    shapes_and_args = (
-        ((S, S, S), (1, 2, 2)),
-        ((S, S, S), (-1, 2, 2)),
-        ((S, S, S), (1, 0, 0)),
-        ((S, S, S), (-1, 0, 0)),
-        ((S, S, S), (2, 1, 2)),
+    err_msg2 = ("shifts and dimensions must align")
+    s2 = SampleInput(
+        make_tensor((S, S), dtype=torch.float32, device=device), args=((2, 1), 0)
     )
+    yield ErrorInput(s2, error_regex=err_msg2)
 
-    for shape, args in shapes_and_args:
-        tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
-                             requires_grad=requires_grad)
-        yield SampleInput(tensor, args=args)
-
-def sample_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
-    y_shape_x_shape_and_kwargs = [
-        ((2, 3), (2, 3), {}),
-        ((2, 3), (2, 3), {'dim': 1}),
-        ((6,), (6,), {}),
-        ((6,), None, {}),
-        # When 'trapezoid' is called with an empty input, it does not produce an output with requires_grad
-        # See Issue #{61619}
-        # ((6,0), (6,0), {}),
-        ((2, 3), (1, 3), {}),
-        ((3, 3), (3, 3), {}),
-        ((3, 3), (3, 3), {'dim': -2}),
-        ((5,), None, {'dx': 2.0}),
-        ((2, 2), None, {'dx': 3.0})
-    ]
-    samples = []
-    for y_shape, x_shape, kwarg in y_shape_x_shape_and_kwargs:
-        y_tensor = make_tensor(y_shape, dtype=dtype, device=device, low=None, high=None,
-                               requires_grad=requires_grad)
-        if x_shape is not None:
-            x_tensor = make_tensor(x_shape, dtype=dtype, device=device, low=None, high=None,
-                                   requires_grad=requires_grad)
-            samples.append(SampleInput(y_tensor, args=(x_tensor,), kwargs=kwarg))
-        else:
-            samples.append(SampleInput(y_tensor, kwargs=kwarg))
-    return samples
-
-def sample_cumulative_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
-
-    y_shape_x_shape_and_kwargs = [
-        ((2, 3), (2, 3), {}),
-        ((2, 3), (2, 3), {'dim': 1}),
-        ((6,), (6,), {}),
-        ((6,), None, {}),
-        # When 'cumulative_trapezoid' is called with an empty input, it does not produce an output with requires_grad
-        # See Issue #{61619}
-        # ((6,0), (6,0), {}),
-        ((2, 3), (1, 3), {}),
-        ((3, 3), (3, 3), {}),
-        ((3, 3), (3, 3), {'dim': -2}),
-        ((5,), None, {'dx': 2.0}),
-        ((2, 2), None, {'dx': 3.0})
-    ]
-    samples = []
-    for y_shape, x_shape, kwarg in y_shape_x_shape_and_kwargs:
-        y_tensor = make_tensor(y_shape, dtype=dtype, device=device, low=None, high=None,
-                               requires_grad=requires_grad)
-        if x_shape is not None:
-            x_tensor = make_tensor(x_shape, dtype=dtype, device=device, low=None, high=None,
-                                   requires_grad=requires_grad)
-            samples.append(SampleInput(y_tensor, args=(x_tensor,), kwargs=kwarg))
-        else:
-            samples.append(SampleInput(y_tensor, kwargs=kwarg))
-    return samples
+    err_msg3 = ("out of range")
+    s3 = SampleInput(
+        make_tensor((S, ), dtype=torch.float32, device=device), args=(0, 2)
+    )
+    yield ErrorInput(s3, error_regex=err_msg3, error_type=IndexError)
 
-def sample_unsqueeze(op_info, device, dtype, requires_grad, **kwargs):
-    shapes_and_axes = [
-        ((3, 4, 5), 0),
-        ((3, 4, 5), 1),
-        ((3, 4, 5), 3),
-        ((3, 4, 5), -1),
-        ((3, 4, 5), -3),
-        ((), 0),
-        ((), -1),
-        ((1,), 0),
-        ((1,), -1),
-    ]
+def sample_inputs_rot90(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    samples = []
-    for shape, axis in shapes_and_axes:
-        tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
-                             requires_grad=requires_grad)
-        samples.append(SampleInput(tensor, args=(axis,),))
+    args = itertools.product(range(-5, 6), [(0, 1), (1, 2), (1, -1)])
 
-    return samples
+    yield SampleInput(make_arg((S, S, S)))
+    for arg in args:
+        yield SampleInput(make_arg((S, S, S)), args=arg)
 
 
-def sample_inputs_nn_unfold(op_info, device, dtype, requires_grad, **kwargs):
-    shapes = ((0, 1, 5, 5), (1, 1, 5, 5), (2, 3, 5, 5))
-    kernel_sizes = (2, (2, 2), (3, 3))
-    dilations = (1, 2, (1, 2))
-    paddings = (0, 1, (1, 1))
-    strides = (1, 2, (1, 2))
+def error_inputs_rot90(op_info, device, **kwargs):
+    err_msg1 = "expected total rotation dims"
+    s1 = SampleInput(
+        make_tensor((S, S), dtype=torch.float32, device=device), kwargs={"dims": (0,)}
+    )
+    yield ErrorInput(s1, error_regex=err_msg1)
 
-    cases = product(shapes, kernel_sizes, dilations, paddings, strides)
-    for shape, kernel_size, dilation, padding, stride in cases:
-        tensor = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        yield SampleInput(tensor, args=(kernel_size, dilation, padding, stride))
+    err_msg2 = "expected total dims >= 2"
+    s2 = SampleInput(
+        make_tensor((S,), dtype=torch.float32, device=device),
+    )
+    yield ErrorInput(s2, error_regex=err_msg2)
 
-    # With default args
-    yield SampleInput(make_tensor((1, 1, 5, 5), dtype=dtype, device=device, requires_grad=requires_grad),
-                      args=((3, 3),))
+    err_msg3 = "expected rotation dims to be different"
+    s3 = SampleInput(
+        make_tensor((S, S), dtype=torch.float32, device=device), kwargs={"dims": (1, 1)}
+    )
+    yield ErrorInput(s3, error_regex=err_msg3)
 
 
-def sample_inputs_squeeze(op_info, device, dtype, requires_grad, **kwargs):
-    shapes_and_args = (
-        ((S, 1, S, 1), ()),
-        ((1, 1, 1, 1), ()),
-        ((S, 1, S, 1), (1,)),
-        ((S, 1, S, 1), (-1,)),
-        ((S, 1, S, 1), (2,)),
-        ((S, 1, S, 1), (-2,)),
-        ((), (0, )),
-    )
+def sample_inputs_std_var(op_info, device, dtype, requires_grad, **kwargs):
+    tensor_nd = partial(make_tensor, (S, S, S), device=device, dtype=dtype,
+                        requires_grad=requires_grad)
+    tensor_1d = partial(make_tensor, (S,), device=device, dtype=dtype,
+                        requires_grad=requires_grad)
 
-    for shape, args in shapes_and_args:
-        tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
-                             requires_grad=requires_grad)
+    return [
+        SampleInput(tensor_nd()),
+        SampleInput(tensor_nd(), kwargs=dict(dim=1)),
+        SampleInput(tensor_nd(), kwargs=dict(dim=1, unbiased=True, keepdim=True)),
+        SampleInput(tensor_1d(), kwargs=dict(dim=0, unbiased=True, keepdim=True)),
+        SampleInput(tensor_1d(), kwargs=dict(dim=0, unbiased=False, keepdim=False)),
 
-        yield SampleInput(tensor, args=args)
+        SampleInput(tensor_nd(), kwargs=dict(dim=(1,), correction=S // 2)),
+        SampleInput(tensor_nd(), kwargs=dict(dim=None, correction=0, keepdim=True)),
+    ]
 
 
-def sample_inputs_nn_pad(op_info, device, dtype, requires_grad, mode, **kwargs):
-    assert mode in ('constant', 'reflect', 'replicate', 'circular')
-    if mode in ['reflect', 'replicate']:
-        cases: tuple = (  # ignore
-            ((1, 3), (1, 2)),
-            ((1, 3), (0, 1)),
-            ((0, 3, 3), (1, 2)),
-            ((0, 3, 3), (0, 1)),
-            ((1, 3, 3), (1, 2)),
-            ((1, 3, 3), (0, 1)),
-            ((1, 3, 3), (0, 2, 0, 1)),
-            ((0, 3, 3, 3), (0, 2, 0, 1)),
-            ((3, 3, 5, 5), (0, 2, 0, 1)),
-            ((3, 3, 5, 5), (1, 1, 1, 1, 1, 1)),
-            ((1, 3, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
-            ((1, 3, 4, 4), (-1, 1, -2, 1)),
-        )
-    elif mode == 'constant':
-        cases = (
-            ((1, 3), (1, 2)),
-            ((1, 3), (0, 1)),
-            ((1, 3), (0, 2, 0, 1)),
-            ((0, 3, 3), (1, 2)),
-            ((0, 3, 3), (0, 1)),
-            ((0, 3, 3), (0, 2, 0, 1)),
-            ((0, 3, 3), (1, 1, 1, 1, 1, 1)),
-            ((1, 3, 3), (1, 2)),
-            ((1, 3, 3), (0, 1)),
-            ((1, 3, 3), (0, 2, 0, 1)),
-            ((1, 3, 3), (1, 1, 1, 1, 1, 1)),
-            ((0, 3, 3, 3), (1, 2)),
-            ((0, 3, 3, 3), (0, 1)),
-            ((0, 3, 3, 3), (0, 2, 0, 1)),
-            ((0, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
-            ((3, 3, 5, 5), (1, 2)),
-            ((3, 3, 5, 5), (0, 1)),
-            ((3, 3, 5, 5), (0, 2, 0, 1)),
-            ((3, 3, 5, 5), (1, 1, 1, 1, 1, 1)),
-            ((1, 3, 3, 3, 3), (1, 2)),
-            ((1, 3, 3, 3, 3), (0, 1)),
-            ((1, 3, 3, 3, 3), (0, 2, 0, 1)),
-            ((1, 3, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
-            ((1, 3, 4, 4), (-1, 1, -2, 1)),
-        )
-    else:  # mode == 'circular'
-        if dtype == torch.bool:
-            # test_dtypes fails on ASAN with for the case ab
-            # runtime error: load of value 190, which is not a valid value for type 'bool'
-            # Reference: https://github.com/pytorch/pytorch/pull/62814#issuecomment-894156562
-            # Reference Issue: https://github.com/pytorch/pytorch/issues/63034
-            cases = (
-                ((2, 3, 3), (1, 2)),
-                ((1, 3, 3), (1, 2)),
-            )
-        else:
-            cases = (
-                ((0, 3, 3), (1, 2)),
-                ((0, 3, 3), (0, 1)),
-                ((1, 3, 3), (1, 2)),
-                ((1, 3, 3), (0, 1)),
-                ((0, 3, 3, 3), (0, 2, 0, 1)),
-                ((3, 3, 5, 5), (0, 2, 0, 1)),
-                ((1, 3, 3, 3, 3), (1, 1, 1, 1, 1, 1)),
-                ((1, 3, 4, 4), (-1, 1, -2, 1)),
-            )
+def _generate_correlation_inputs(device, dtype, requires_grad, **kwargs):
+    shapes = [(2,), (1, 2), (3, 2), (2, 3)]
+    for shape in shapes:
+        yield make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    make_inp = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    if mode == 'constant':
-        # Default args
-        yield SampleInput(make_inp((1, 3, 3)), args=((2, 2),))
+def sample_inputs_corrcoef(op_info, device, dtype, requires_grad, **kwargs):
+    return [SampleInput(t) for t in _generate_correlation_inputs(device, dtype, requires_grad)]
 
-    if mode in ['reflect', 'replicate', 'circular']:
-        for shape, pad in cases:
-            yield SampleInput(make_inp(shape), args=(pad, mode))
-    else:  # mode == 'constant'
-        for pad_value in (1., 2.):
-            for shape, pad in cases:
-                yield SampleInput(make_inp(shape), args=(pad, mode, pad_value))
 
+def sample_inputs_cov(op_info, device, dtype, requires_grad, **kwargs):
+    inputs = []
+    for t in _generate_correlation_inputs(device, dtype, requires_grad):
+        inputs.append(SampleInput(t))
+        num_observations = t.numel() if t.ndimension() < 2 else t.size(1)
+        fweights = make_tensor((num_observations,), dtype=torch.int, device=device, low=1, high=10)
+        aweights = make_tensor((num_observations,), dtype=torch.float, device=device, low=0, high=1, requires_grad=requires_grad)
+        for correction, fw, aw in product(range(num_observations), [None, fweights], [None, aweights]):
+            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
+                                      kwargs={'correction': correction, 'fweights': fw, 'aweights': aw}))
+    return inputs
 
-def sample_inputs_constant_pad_nd(op_info, device, dtype, *args, **kwargs):
-    # Inherit sample inputs from nn.pad, but transform them to fit
-    # constant_pad_nd's interface
-    nn_samples = sample_inputs_nn_pad(op_info, device, dtype, *args,
-                                      mode='constant', **kwargs)
 
-    # NOTE: primTorch is more strict about the type of the fill value argument
-    # So we must cast it to the correct dtype
-    from torch._prims_common import dtype_to_type
-    scalar_type = dtype_to_type(dtype)
-
-    def drop_mode_argument(input, pad, mode=None, value=None):
-        if value is None:
-            return SampleInput(input, args=(pad,))
-        else:
-            return SampleInput(input, args=(pad, scalar_type(value)))
-
-    for sample in nn_samples:
-        yield drop_mode_argument(sample.input, *sample.args, **sample.kwargs)
+def error_inputs_cov(op_info, device, **kwargs):
+    a = torch.rand(S, device=device)
+    error_inputs = []
+    error_inputs.append(ErrorInput(
+        SampleInput(torch.rand(S, S, S, device=device)),
+        error_regex="expected input to have two or fewer dimensions"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.rand(S, S, device=device)}),
+        error_regex="expected fweights to have one or fewer dimensions"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.rand(S, S, device=device)}),
+        error_regex="expected aweights to have one or fewer dimensions"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.rand(S, device=device)}),
+        error_regex="expected fweights to have integral dtype"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.tensor([1, 1], device=device)}),
+        error_regex="expected aweights to have floating point dtype"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.tensor([1], device=device)}),
+        error_regex="expected fweights to have the same numel"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.rand(1, device=device)}),
+        error_regex="expected aweights to have the same numel"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'fweights': torch.tensor([-1, -2, -3, -4 , -5], device=device)}),
+        error_regex="fweights cannot be negative"))
+    error_inputs.append(ErrorInput(
+        SampleInput(a, kwargs={'aweights': torch.tensor([-1., -2., -3., -4., -5.], device=device)}),
+        error_regex="aweights cannot be negative"))
+    return error_inputs
 
 
-def np_unary_ufunc_integer_promotion_wrapper(fn):
-    # Wrapper that passes PyTorch's default scalar
-    #   type as an argument to the wrapped NumPy
-    #   unary ufunc when given an integer input.
-    #   This mimicks PyTorch's integer->floating point
-    #   type promotion.
-    #
-    # This is necessary when NumPy promotes
-    #   integer types to double, since PyTorch promotes
-    #   integer types to the default scalar type.
-
-    # Helper to determine if promotion is needed
-    def is_integral(dtype):
-        return dtype in [np.bool_, bool, np.uint8, np.int8, np.int16, np.int32, np.int64]
-
-    @wraps(fn)
-    def wrapped_fn(x):
-        # As the default dtype can change, acquire it when function is called.
-        # NOTE: Promotion in PyTorch is from integer types to the default dtype
-        np_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
-
-        if is_integral(x.dtype):
-            return fn(x.astype(np_dtype))
-        return fn(x)
-
-    return wrapped_fn
-
-def sample_inputs_spectral_ops(self, device, dtype, requires_grad=False, **kwargs):
-    is_fp16_or_chalf = dtype == torch.complex32 or dtype == torch.half
-    if not is_fp16_or_chalf:
-        nd_tensor = partial(make_tensor, (S, S + 1, S + 2), device=device,
-                            dtype=dtype, requires_grad=requires_grad)
-        oned_tensor = partial(make_tensor, (31,), device=device,
-                              dtype=dtype, requires_grad=requires_grad)
-    else:
-        # cuFFT supports powers of 2 for half and complex half precision
-        # NOTE: For hfft, hfft2, hfftn, irfft, irfft2, irfftn with default args
-        # where output_size n=2*(input_size - 1), we make sure that logical fft size is a power of two
-        low = None
-        high = None
-        if self.name in ['fft.hfft', 'fft.irfft',
-                         '_refs.fft.hfft', '_refs.fft.irfft']:
-            shapes = ((2, 9, 9), (33,))
-        elif self.name in ['fft.hfft2', 'fft.irfft2',
-                           '_refs.fft.hfft2', '_refs.fft.irfft2']:
-            shapes = ((2, 8, 9), (33,))
-        elif self.name in ['fft.hfftn', 'fft.irfftn',
-                           '_refs.fft.hfftn', '_refs.fft.irfftn']:
-            shapes = ((2, 2, 33), (33,))
-            # Adjusting the limits because the test would be flaky due to over-saturation of float16
-            # See: https://github.com/pytorch/pytorch/pull/81416
-            low = -1.0
-            high = 1.0
-        else:
-            shapes = ((2, 8, 16), (32,))
-        nd_tensor = partial(make_tensor, shapes[0], device=device, low=low, high=high,
-                            dtype=dtype, requires_grad=requires_grad)
-        oned_tensor = partial(make_tensor, shapes[1], device=device, low=low, high=high,
-                              dtype=dtype, requires_grad=requires_grad)
-
-    if self.ndimensional == SpectralFuncType.ND:
-        return [
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(3, 10) if not is_fp16_or_chalf else (4, 8), dim=(1, 2), norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(8,))),
-            SampleInput(oned_tensor()),
-
-            *(SampleInput(nd_tensor(),
-                          kwargs=dict(dim=dim))
-                for dim in [-1, -2, -3, (0, -1)]),
-        ]
-    elif self.ndimensional == SpectralFuncType.TwoD:
-        return [
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(3, 10) if not is_fp16_or_chalf else (4, 8), dim=(1, 2), norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(6, 8) if not is_fp16_or_chalf else (4, 8))),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(dim=0)),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(dim=(0, -1))),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(dim=(-3, -2, -1))),
-        ]
-    else:
-        return [
-            SampleInput(nd_tensor(),
-                        kwargs=dict(n=10 if not is_fp16_or_chalf else 8, dim=1, norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(n=7 if not is_fp16_or_chalf else 8)
-                        ),
-            SampleInput(oned_tensor()),
+def sample_inputs_permute(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-            *(SampleInput(nd_tensor(),
-                          kwargs=dict(dim=dim))
-                for dim in [-1, -2, -3]),
-        ]
+    cases = [((1, 2, 3, 4), (0, 2, 3, 1)),
+             ((1, 2, 3, 4), (0, -2, -1, 1)),
+             ((), ()),
+             ((1, 2, 3, 4), (2, 1, 3, 0))]
 
-def sample_inputs_repeat_interleave(op_info, device, dtype, requires_grad, **kwargs):
-    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=(args,))
 
-    return [
-        SampleInput(make_input(()), kwargs=dict(repeats=2)),
-        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=2)),
-        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=2, dim=1)),
-        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=torch.arange(3, device=device), dim=1))
-    ]
+def reference_inputs_permute(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_permute(op, device, dtype, requires_grad, **kwargs)
 
-SpectralFuncType = Enum('SpectralFuncType', ('OneD', 'TwoD', 'ND'))
-
-# Metadata class for Fast Fourier Transforms in torch.fft.
-class SpectralFuncInfo(OpInfo):
-    """Operator information for torch.fft transforms. """
-
-    def __init__(self,
-                 name,  # the string name of the function
-                 *,
-                 ref=None,  # Reference implementation (probably in np.fft namespace)
-                 dtypes=floating_and_complex_types(),
-                 ndimensional: SpectralFuncType,
-                 sample_inputs_func=sample_inputs_spectral_ops,
-                 decorators=None,
-                 **kwargs):
-
-        self._original_spectral_func_args = dict(locals()).copy()
-        self._original_spectral_func_args.update(kwargs)
-
-        decorators = list(decorators) if decorators is not None else []
-        decorators += [
-            skipCPUIfNoFFT,
-            DecorateInfo(toleranceOverride({torch.chalf: tol(4e-2, 4e-2)}),
-                         "TestCommon", "test_complex_half_reference_testing")
-        ]
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        super().__init__(name=name,
-                         dtypes=dtypes,
-                         decorators=decorators,
-                         sample_inputs_func=sample_inputs_func,
-                         **kwargs)
-        self.ref = ref
-        self.ndimensional = ndimensional
+    cases = (
+        ((), ()),
+        ((1,), (0,)),
+        ((2, 2), (1, 0)),
+        ((2, 2), (0, 1)),
+        ((2, 0, 1), (0, 2, 1)),
+        ((3, 4, 2), (2, 1, 0)),
+        ((3, 4, 2), (1, 0, 2)),
+        ((3, 4, 2), (0, 1, 2)),
+    )
 
+    # Adds tricky permutations and permutations with noncontiguity
+    for shape, permutation in cases:
+        for p in itertools.permutations(permutation):
+            a = make_arg(shape).permute(p)
+            yield SampleInput(a, args=(permutation,))
 
-def sample_inputs_stft(op_info, device, dtype, requires_grad, **kwargs):
-    def mt(shape, **kwargs):
-        return make_tensor(shape, device=device, dtype=dtype,
-                           requires_grad=requires_grad, **kwargs)
-    yield SampleInput(mt(100), kwargs=dict(n_fft=10))
+            a = make_arg(shape, noncontiguous=True).permute(p)
+            yield SampleInput(a, args=(permutation,))
 
-    for center in [False, True]:
-        yield SampleInput(mt(10), kwargs=dict(n_fft=7, center=center))
-        yield SampleInput(mt((10, 100)), kwargs=dict(n_fft=16, hop_length=4, center=center))
+def error_inputs_softshrink(op, device, **kwargs):
+    yield ErrorInput(SampleInput(make_tensor((1,), dtype=torch.float, device=device), kwargs={"lambd": -0.5}),
+                     error_regex="lambda must be greater or equal to 0, but found to be -0.5")
 
-    window = make_tensor(16, low=.5, high=2.0, dtype=dtype, device=device, requires_grad=requires_grad)
-    yield SampleInput(
-        mt((2, 100)), kwargs=dict(n_fft=16, window=window, return_complex=True, center=center))
-    yield SampleInput(
-        mt((3, 100)), kwargs=dict(n_fft=16, window=window, return_complex=True, center=center))
-    if not dtype.is_complex:
-        yield SampleInput(
-            mt((10, 100)), kwargs=dict(n_fft=16, window=window, onesided=False))
+def sample_inputs_softshrink(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
+    # The additional sample is to check additional values of lambd beyond the default
+    # value (what is already checked by sample_inputs_elementwise_unary)
+    for lbda in (0., 0.5):
+        yield SampleInput(make_arg(S, S), kwargs={"lambd": lbda})
 
-def sample_inputs_istft(op_info, device, dtype, requires_grad, **kwargs):
-    def mt(shape, **kwargs):
-        real_shape = shape if dtype.is_complex else shape + (2,)
-        return make_tensor(real_shape, device=device, dtype=dtype,
-                           requires_grad=requires_grad, **kwargs)
+    yield from sample_inputs_elementwise_unary(op_info, device, dtype, requires_grad)
 
-    yield SampleInput(mt((10, 2)), kwargs=dict(n_fft=10))
-    yield SampleInput(mt((6, 3)), kwargs=dict(n_fft=6, onesided=False))
-    yield SampleInput(mt((6, 4)), kwargs=dict(n_fft=10, onesided=True))
+def sample_inputs_hardshrink(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    for center in [False, True]:
-        yield SampleInput(mt((10, 10, 6)), kwargs=dict(n_fft=10, center=center))
-        yield SampleInput(mt((1, 9, 10)), kwargs=dict(n_fft=16, hop_length=4, center=center))
+    # The additional sample is to check additional values of lambd beyond the default
+    # value (what is already checked by sample_inputs_elementwise_unary)
+    # Note that unlike softshrink, lambd is allowed to be negative for hardshrink
+    for lbda in (-0.5, 0., 0.5):
+        yield SampleInput(make_arg(S, S), kwargs={"lambd": lbda})
 
-    window = make_tensor(10, low=.5, high=2.0, dtype=dtype, device=device, requires_grad=requires_grad)
-    yield SampleInput(mt((10, 10, 6)), kwargs=dict(
-        n_fft=10, window=window, center=center, return_complex=dtype.is_complex))
-    yield SampleInput(mt((10, 10, 10)), kwargs=dict(
-        n_fft=10, window=window[:8], win_length=8, center=center, return_complex=True))
+    yield from sample_inputs_elementwise_unary(op_info, device, dtype, requires_grad)
 
-    real_window = window if not dtype.is_complex else window.real
-    yield SampleInput(mt((10, 5, 6)), kwargs=dict(n_fft=8, window=real_window[:8], center=center))
 
+def sample_inputs_hardtanh(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def sample_inputs_fftshift(op_info, device, dtype, requires_grad, **kwargs):
-    def mt(shape, **kwargs):
-        return make_tensor(shape, device=device, dtype=dtype,
-                           requires_grad=requires_grad, **kwargs)
+    # The additional sample is to check additional values of min_val and max_val beyond the default
+    # value (what is already checked by sample_inputs_elementwise_unary)
+    for max_val, min_val in ((-0.5, 0.5), (0.5, -0.5), (0., 0.)):
+        yield SampleInput(make_arg(S, S), kwargs={"min_val": min_val, "max_val": max_val})
 
-    yield SampleInput(mt((9, 10)))
-    yield SampleInput(mt((50,)), kwargs=dict(dim=0))
-    yield SampleInput(mt((5, 11)), kwargs=dict(dim=(1,)))
-    yield SampleInput(mt((5, 6)), kwargs=dict(dim=(0, 1)))
-    yield SampleInput(mt((5, 6, 2)), kwargs=dict(dim=(0, 2)))
-
-
-class ShapeFuncInfo(OpInfo):
-    """Early version of a specialized OpInfo for Shape manipulating operations like tile and roll"""
-    def __init__(self,
-                 name,  # the string name of the function
-                 *,
-                 ref,  # a reference function
-                 dtypes=floating_types(),
-                 dtypesIfCUDA=None,
-                 dtypesIfROCM=None,
-                 sample_inputs_func=None,
-                 **kwargs):
-        super(ShapeFuncInfo, self).__init__(name,
-                                            dtypes=dtypes,
-                                            dtypesIfCUDA=dtypesIfCUDA,
-                                            dtypesIfROCM=dtypesIfROCM,
-                                            sample_inputs_func=sample_inputs_func,
-                                            **kwargs)
-        self.ref = ref
-
-def sample_inputs_foreach(self, device, dtype, N, *, noncontiguous=False, same_size=False, low=None, high=None):
-    if same_size:
-        return [make_tensor((N, N), dtype=dtype, device=device, noncontiguous=noncontiguous) for _ in range(N)]
-    else:
-        return [make_tensor((N - i, N - i), dtype=dtype, device=device, noncontiguous=noncontiguous) for i in range(N)]
-
-
-def get_foreach_method_names(name):
-    # get torch inplace reference function
-    op_name = "_foreach_" + name
-    inplace_op_name = "_foreach_" + name + "_"
-
-    op = getattr(torch, op_name, None)
-    inplace_op = getattr(torch, inplace_op_name, None)
-
-    ref = getattr(torch, name, None)
-    ref_inplace = getattr(torch.Tensor, name + "_", None)
-    return op, inplace_op, ref, ref_inplace
-
-class ForeachFuncInfo(OpInfo):
-    """Early version of a specialized OpInfo for foreach functions"""
-    def __init__(self,
-                 name,
-                 dtypes=floating_and_complex_types(),
-                 dtypesIfCUDA=floating_and_complex_types_and(torch.half),
-                 dtypesIfROCM=None,
-                 supports_alpha_param=False,
-                 sample_inputs_func=sample_inputs_foreach,
-                 **kwargs):
-        super().__init__(
-            "_foreach_" + name,
-            dtypes=dtypes,
-            dtypesIfCUDA=dtypesIfCUDA,
-            dtypesIfROCM=dtypesIfROCM,
-            sample_inputs_func=sample_inputs_func,
-            **kwargs
-        )
+    yield from sample_inputs_elementwise_unary(op_info, device, dtype, requires_grad)
 
-        foreach_method, foreach_method_inplace, torch_ref_method, torch_ref_inplace = get_foreach_method_names(name)
-        self.method_variant = foreach_method
-        self.inplace_variant = foreach_method_inplace
-        self.ref = torch_ref_method
-        self.ref_inplace = torch_ref_inplace
-        self.supports_alpha_param = supports_alpha_param
+def sample_inputs_eig(op_info, device, dtype, requires_grad=False, **kwargs):
+    eigvecs = make_tensor((S, S), device=device, dtype=dtype,
+                          low=None, high=None)
+    eigvals = make_tensor((S,), device=device, dtype=dtype,
+                          low=None, high=None)
+    # we produce only diagonazible inputs which do not have
+    # complex eigenvalues for real inputs, as there is no
+    # backward implementation for real inputs with complex
+    # eigenvalues yet.
+    input = (eigvecs * eigvals.unsqueeze(-2)) @ eigvecs.inverse()
+    input.requires_grad_(requires_grad)
 
-        if name == "norm":
-            self.ref = torch.linalg.vector_norm
+    def process_output(eigpair):
+        eigvals, eigvecs = eigpair
+        if dtype.is_complex:
+            # eig produces eigenvectors which are normalized to 1 norm.
+            # Note that if v is an eigenvector, so is v * e^{i \phi},
+            # and |v| = |v * e^{i \phi}| = 1.
+            # This, however, makes the eigenvector backward computation process
+            # rather unstable unless the objective function is gauge-invariant,
+            # that is if f(z) == f(|z|), for example.
+            # Hence for complex inputs we ignore the phases and return only
+            # the absolute values.
+            return eigvals, eigvecs.abs()
+        else:
+            return eigvals, eigvecs
 
+    return [
+        SampleInput(
+            input,
+            kwargs=dict(eigenvectors=True),
+            output_process_fn_grad=process_output
+        ),
+    ]
 
-def sample_inputs_linalg_cholesky_inverse(op_info, device, dtype, requires_grad=False, **kwargs):
-    from torch.testing._internal.common_utils import random_well_conditioned_matrix
 
-    # Cholesky factorization is for positive-definite matrices
-    single_well_conditioned_matrix = random_well_conditioned_matrix(S, S, dtype=dtype, device=device)
-    batch_well_conditioned_matrices = random_well_conditioned_matrix(2, S, S, dtype=dtype, device=device)
-    single_pd = single_well_conditioned_matrix @ single_well_conditioned_matrix.mH
-    batch_pd = batch_well_conditioned_matrices @ batch_well_conditioned_matrices.mH
+def sample_inputs_einsum(op_info, device, dtype, requires_grad=False, **kwargs):
+    def c(t):
+        return t.clone().requires_grad_(requires_grad)
 
-    inputs = (
-        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
-        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
-        single_pd,
-        batch_pd
-    )
-    test_cases = (torch.linalg.cholesky(a, upper=False) for a in inputs)
-    for l in test_cases:
-        # generated lower-triangular samples
-        l.requires_grad = requires_grad
-        yield SampleInput(l)  # upper=False by default
-        yield SampleInput(l.detach().clone().requires_grad_(requires_grad), kwargs=dict(upper=False))
-
-        # generate upper-triangular inputs
-        u = l.detach().clone().mT.contiguous().requires_grad_(requires_grad)
-        yield SampleInput(u, kwargs=dict(upper=True))
-
-def sample_inputs_linalg_ldl_factor(op_info, device, dtype, requires_grad=False, **kwargs):
-    from torch.testing._internal.common_utils import (
-        random_hermitian_pd_matrix,
-        random_symmetric_pd_matrix,
-    )
+    x = make_tensor((3,), dtype=dtype, device=device, requires_grad=requires_grad)
+    y = make_tensor((4,), dtype=dtype, device=device, requires_grad=requires_grad)
+    A = make_tensor((2, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
+    B = make_tensor((1, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
+    C = make_tensor((1, 2, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
+    D = make_tensor((1, 3, 4,), dtype=dtype, device=device, requires_grad=requires_grad)
+    E = make_tensor((4, 4,), dtype=dtype, device=device, requires_grad=requires_grad)
+    H = make_tensor((3, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
+    I = make_tensor((1, 3, 1,), dtype=dtype, device=device, requires_grad=requires_grad)
 
-    device = torch.device(device)
+    inputs = []
 
-    # Symmetric inputs
-    yield SampleInput(
-        random_symmetric_pd_matrix(S, dtype=dtype, device=device),
-        kwargs=dict(hermitian=False),
-    )  # single matrix
-    yield SampleInput(
-        random_symmetric_pd_matrix(S, 2, dtype=dtype, device=device),
-        kwargs=dict(hermitian=False),
-    )  # batch of matrices
-    yield SampleInput(
-        torch.zeros(0, 0, dtype=dtype, device=device), kwargs=dict(hermitian=False)
-    )  # 0x0 matrix
-    yield SampleInput(
-        torch.zeros(0, 2, 2, dtype=dtype, device=device), kwargs=dict(hermitian=False)
-    )  # zero batch of matrices
+    # Vector operations
+    inputs.append(SampleInput([c(x)], args=('i->',)))                      # sum
+    inputs.append(SampleInput([c(x), c(y)], args=('i,j->ij',)))               # outer
 
-    # Hermitian inputs
-    # hermitian=True for complex inputs on CUDA is supported only with MAGMA 2.5.4+
-    magma_254_available = device.type == 'cuda' and _get_magma_version() >= (2, 5, 4)
-    if dtype.is_complex and (device.type == 'cpu' or magma_254_available):
-        yield SampleInput(
-            random_hermitian_pd_matrix(S, dtype=dtype, device=device),
-            kwargs=dict(hermitian=True),
-        )  # single matrix
-        yield SampleInput(
-            random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
-            kwargs=dict(hermitian=True),
-        )  # batch of matrices
-
-def sample_inputs_linalg_ldl_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    # Generate LDL factors of symmetric (and Hermitian on CPU) matrices
-    from torch.testing._internal.common_utils import random_hermitian_pd_matrix, random_symmetric_pd_matrix
-    device = torch.device(device)
-    symmetric_inputs = (
-        random_symmetric_pd_matrix(S, dtype=dtype, device=device),  # single matrix
-        random_symmetric_pd_matrix(S, 2, dtype=dtype, device=device),  # batch of matrices
-        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
-        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
-    )
-    hermitian_inputs = (
-        random_hermitian_pd_matrix(S, dtype=dtype, device=device),
-        random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
-    ) if device.type == 'cpu' and dtype.is_complex else ()
-    test_cases1 = (torch.linalg.ldl_factor_ex(a, hermitian=False) for a in symmetric_inputs)
-    test_cases2 = (torch.linalg.ldl_factor_ex(a, hermitian=True) for a in hermitian_inputs)
-
-    # Symmetric case
-    for test_case in test_cases1:
-        factors, pivots, _ = test_case
-        factors.requires_grad = requires_grad
-        for B_batch_shape in ((), factors.shape[:-2]):
-            B = make_tensor((*B_batch_shape, factors.shape[-1], S), device=device, dtype=dtype, requires_grad=requires_grad)
-            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=False))
-            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
-            yield SampleInput(clone_factors, args=(pivots, B), kwargs=dict(hermitian=False))
-
-    # Hermitian case
-    for test_case in test_cases2:
-        factors, pivots, _ = test_case
-        factors.requires_grad = requires_grad
-        for B_batch_shape in ((), factors.shape[:-2]):
-            B = make_tensor((*B_batch_shape, factors.shape[-1], S), device=device, dtype=dtype, requires_grad=requires_grad)
-            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=True))
-            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
-            yield SampleInput(clone_factors, args=(pivots, B), kwargs=dict(hermitian=True))
-
-def sample_inputs_linalg_lstsq(op_info, device, dtype, requires_grad=False, **kwargs):
-    from torch.testing._internal.common_utils import random_well_conditioned_matrix
-
-    device = torch.device(device)
-
-    drivers: Tuple[str, ...]
-    if device.type == 'cuda':
-        drivers = ('gels',)
-    else:
-        drivers = ('gels', 'gelsy', 'gelss', 'gelsd')
-
-    # we generate matrices of shape (..., n + delta, n)
-    deltas: Tuple[int, ...]
-    if device.type == 'cpu' or has_cusolver():
-        deltas = (-1, 0, +1)
-    # only square systems if Cusolver is not available
-    # becase we solve a lstsq problem with a transposed matrix in the backward
-    else:
-        deltas = (0,)
-
-    out = []
-    for batch, driver, delta in product(((), (3,), (3, 3)), drivers, deltas):
-        shape = batch + (3 + delta, 3)
-        a = random_well_conditioned_matrix(*shape, dtype=dtype, device=device)
-        a.requires_grad_(requires_grad)
-        b = make_tensor(shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        out.append(SampleInput(a, args=(b,), kwargs=dict(driver=driver)))
-    return out
+    # Matrix operations
+    inputs.append(SampleInput([c(A)], args=("ij->i",)))                    # col sum
+    inputs.append(SampleInput([c(A), c(B)], args=("ij,kj->ik",)))             # matmul
+    inputs.append(SampleInput([c(A), c(E)], args=("ij,Ab->ijAb",)))           # matrix outer product
 
-def sample_inputs_householder_product(op_info, device, dtype, requires_grad, **kwargs):
-    """
-    This function generates input for torch.linalg.householder_product (torch.orgqr).
-    The first argument should be a square matrix or batch of square matrices, the second argument is a vector or batch of vectors.
-    Empty, square, rectangular, batched square and batched rectangular input is generated.
-    """
-    # Each column of the matrix is getting multiplied many times leading to very large values for
-    # the Jacobian matrix entries and making the finite-difference result of grad check less accurate.
-    # That's why gradcheck with the default range [-9, 9] fails and [-2, 2] is used here.
-    samples = (
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
+    # Tensor operations
+    inputs.append(SampleInput([c(C), c(D)], args=("aij,ajk->aik",)))          # batch matmul
+    inputs.append(SampleInput([c(D), c(E)], args=("aij,jk->aik",)))           # tensor matrix contraction
+    inputs.append(SampleInput([c(C), c(B)], args=("ijk,ik->j",)))             # non contiguous
 
-        SampleInput(make_tensor((S + 1, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
+    # Test diagonals
+    inputs.append(SampleInput([c(I)], args=('iji->j',)))                   # non-contiguous trace
 
-        SampleInput(make_tensor((2, 1, S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((2, 1, S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
+    # Test ellipsis
+    inputs.append(SampleInput([c(H)], args=("i...->...",)))
+    inputs.append(SampleInput([c(C), c(x)], args=('...ik, ...j -> ij',)))
 
-        SampleInput(make_tensor((2, 1, S + 1, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((2, 1, S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
+    return inputs
 
-        SampleInput(make_tensor((0, 0), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(make_tensor((0,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
 
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((0,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
+def sample_inputs_flip(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    sizes = ((S, M, S), (S, 0, M))
+    all_dims = ((0, 1, 2), (0,), (0, 2), (-1,), ())
 
-        # m = n = S, k = S - 2
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S - 2,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
+    for size, dims in product(sizes, all_dims):
+        yield SampleInput(make_arg(size), kwargs={"dims": dims})
 
-        # m = S, n = S -1, k = S - 2
-        SampleInput(make_tensor((S, S - 1), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S - 2,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
+def sample_inputs_fliplr_flipud(op_info, device, dtype, requires_grad, **kwargs):
+    tensors = (
+        make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        make_tensor((S, 0, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
     )
+    return [SampleInput(tensor) for tensor in tensors]
 
-    return samples
+def error_inputs_fliplr(op, device, **kwargs):
+    yield ErrorInput(SampleInput(make_tensor((1,), dtype=torch.float, device=device)),
+                     error_regex="Input must be >= 2-d.")
 
-def sample_inputs_ormqr(op_info, device, dtype, requires_grad, **kwargs):
-    # create a helper function wrapping `make_tensor`
-    make_input = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def error_inputs_flipud(op, device, **kwargs):
+    yield ErrorInput(SampleInput(make_tensor((), dtype=torch.float, device=device)),
+                     error_regex="Input must be >= 1-d.")
 
-    def gen_inputs():
-        batches = [(), (0, ), (2, ), (2, 1)]
-        ns = [5, 2, 0]
-        tf = [True, False]
-        for batch, (m, n), left, transpose in product(batches, product(ns, ns), tf, tf):
-            reflectors = make_input((*batch, m, n))
-            tau = make_input((*batch, min(m, n)))
-            other_matrix_shape = (m, n) if left else (n, m)
-            other = make_input((*batch, *other_matrix_shape))
-            kwargs = {"left": left, "transpose": transpose}
-            yield SampleInput(reflectors, args=(tau, other,), kwargs=kwargs)
+# TODO: clamp shares tensors among its sample inputs --- we should prohibit this!
+def sample_inputs_clamp(op_info, device, dtype, requires_grad, **kwargs):
+    x = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+    lb = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+    ub = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
 
-    return tuple(gen_inputs())
+    def detach(tensor):
+        return tensor.clone().detach_().requires_grad_(requires_grad)
 
-def sample_inputs_linalg_cholesky(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates always positive-definite input for torch.linalg.cholesky using
-    random_hermitian_pd_matrix.
-    The input is generated as the itertools.product of 'batches' and 'ns'.
-    In total this function generates 8 SampleInputs
-    'batches' cases include:
-        () - single input,
-        (0,) - zero batched dimension,
-        (2,) - batch of two matrices,
-        (1, 1) - 1x1 batch of matrices
-    'ns' gives 0x0 and 5x5 matrices.
-    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
-    """
-    from torch.testing._internal.common_utils import random_hermitian_pd_matrix
+    return [
+        SampleInput(detach(x), args=(lb, ub)),
+        SampleInput(detach(x), args=(detach(lb[0]), detach(ub[0]))),
+        SampleInput(detach(x), args=(detach(lb[:, :1]),)),
+        SampleInput(detach(x), args=(None, ub)),
+        SampleInput(detach(x), args=(lb, None)),
+    ]
 
-    batches = [(), (0, ), (2, ), (1, 1)]
-    ns = [5, 0]
-    out = []
-    for batch, n, upper in product(batches, ns, [True, False]):
-        a = random_hermitian_pd_matrix(n, *batch, dtype=dtype, device=device)
-        a.requires_grad = requires_grad
-        out.append(SampleInput(a, kwargs={"upper": upper}))
-    return out
+def reference_inputs_elementwise_ternary(op, device, dtype, requires_grad, *, sample_inputs_func, supports_scalars=False, **kwargs):
+    yield from sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
 
-def sample_inputs_symeig(op_info, device, dtype, requires_grad=False, **kwargs):
-    out = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    make_scalar_tensor = partial(make_tensor, (), device='cpu', dtype=dtype, requires_grad=requires_grad)
+    supported_dtypes = op.supported_dtypes(device)
 
-    for o in out:
-        o.kwargs = {"upper": bool(np.random.choice([True, False])),
-                    "eigenvectors": True}
-        # A gauge-invariant function
-        o.output_process_fn_grad = lambda output: (output[0], abs(output[1]))
-        yield o
+    # broadcasting and oncontiguous cases
+    cases = (
+        ((4, 4), (4, 4), (4, 4)),
+        ((4, 4), (1, 4, 4), (4, 4)),
+        ((4, 4), (1, 4, 4), (4, 1, 4)),
+        ((4, 4, 1), (1, 4, 4), (4, 4)),
+        ((4, 1), (1, 4, 4), (1, 4)),
+        ((4, 4), (), (4, 4)),
+        ((4, 4), (), ()),
+        ((), (4, 4), (1, 4, 4)),
+    )
 
-def sample_inputs_linalg_eig(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.eig
-    """
-    def out_fn(output):
-        return output[0], abs(output[1])
+    for a, b, c in cases:
+        yield SampleInput(make_arg(a), args=(make_arg(b), make_arg(c)))
+        yield SampleInput(make_arg(a, noncontiguous=True),
+                          args=(make_arg(b).transpose(0, -1), make_arg(c, noncontiguous=True).transpose(0, -1)))
 
-    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
-    for sample in samples:
-        sample.output_process_fn_grad = out_fn
-        yield sample
+    # scalar cases
+    if supports_scalars:
+        cases = [
+            ((), 1, 2,),
+            ((), 1., 2),
+            ((4, 4), 1., 2,),
+            ((3, 4), make_scalar_tensor(), make_scalar_tensor()),
+        ]
 
-def sample_inputs_linalg_eigh(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.eigh/eigvalsh with UPLO="U" or "L" keyword argument.
-    """
-    def out_fn(output):
-        if isinstance(output, tuple):
-            # eigh function
-            return output[0], abs(output[1])
-        else:
-            # eigvalsh function
-            return output
+        if torch.complex64 in supported_dtypes:
+            cases.extend([
+                ((3, 1, 4), complex(1, 2), 3.),
+            ])
 
-    # Samples do not need to be Hermitian, as we're using gradcheck_wrapper_hermitian_input
-    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
-    for sample in samples:
-        sample.kwargs = {"UPLO": np.random.choice(["L", "U"])}
-        sample.output_process_fn_grad = out_fn
-        yield sample
+        for a, b, c in cases:
+            yield SampleInput(make_arg(a), args=(b, c))
 
+    # type promotion cases
+    # int x float
+    if torch.float in supported_dtypes and torch.long in supported_dtypes:
+        a = make_arg((), dtype=torch.long)
+        b = make_arg((1, 4), dtype=torch.float)
+        c = make_arg((3, 4))
 
-def sample_inputs_linalg_pinv(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.pinv with hermitian=False keyword argument.
-    """
-    for o in sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad, **kwargs):
-        real_dtype = o.input.real.dtype if dtype.is_complex else dtype
-        # requires_grad path for rtol tensor is not implemented
-        for rtol in (None, 1.0, torch.tensor(1.0, dtype=real_dtype, device=device)):
-            o = clone_sample(o)
-            o.kwargs = {"rtol": rtol}
-            yield o
+        cases = (
+            (a, b, c),
+            (c, a, b),
+        )
 
+        for a, b, c in cases:
+            yield SampleInput(a, args=(b, c))
 
-def sample_inputs_linalg_pinv_hermitian(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.pinv with hermitian=True keyword argument.
-    """
-    for o in sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad, **kwargs):
-        o.kwargs = {"hermitian": True}
-        yield o
+    # NaN propagation
+    if dtype.is_floating_point or dtype.is_complex:
+        nan = float('nan') if dtype.is_floating_point else complex(float('nan'), float('nan'))
 
-def sample_inputs_linalg_solve(op_info, device, dtype, requires_grad=False, vector_rhs_allowed=True, **kwargs):
-    """
-    This function generates always solvable input for torch.linalg.solve
-    We sample a fullrank square matrix (i.e. invertible) A
-    The first input to torch.linalg.solve is generated as the itertools.product of 'batches' and 'ns'.
-    The second input is generated as the product of 'batches', 'ns' and 'nrhs'.
-    In total this function generates 18 SampleInputs
-    'batches' cases include:
-        () - single input,
-        (0,) - zero batched dimension,
-        (2,) - batch of two matrices.
-    'ns' gives 0x0 and 5x5 matrices.
-    and 'nrhs' controls the number of vectors to solve for:
-        () - using 1 as the number of vectors implicitly
-        (1,) - same as () but explicit
-        (3,) - solve for 3 vectors.
-    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
-    'vector_rhs_allowed' controls whether to include nrhs = () to the list of SampleInputs.
-    torch.solve / triangular_solve / cholesky_solve (opposed to torch.linalg.solve) do not allow
-    1D tensors (vectors) as the right-hand-side.
-    Once torch.solve / triangular_solve / cholesky_solve and its testing are removed,
-    'vector_rhs_allowed' may be removed here as well.
-    """
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_a = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-    make_b = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    batches = [(), (0, ), (2, )]
-    ns = [5, 0]
-    if vector_rhs_allowed:
-        nrhs = [(), (1,), (3,)]
-    else:
-        nrhs = [(1,), (3,)]
+        a = make_arg((12,))
+        a[4] = nan
+        a[7] = nan
+        b = make_arg((12,))
+        b[1] = nan
+        b[7] = nan
+        c = make_arg((12,))
+        c[9] = nan
 
-    for n, batch, rhs in product(ns, batches, nrhs):
-        yield SampleInput(make_a(*batch, n, n), args=(make_b((batch + (n,) + rhs)),))
+        yield SampleInput(a, args=(b, c))
 
 
-def sample_inputs_linalg_solve_triangular(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device)
-    bs = (1, 2, 0)
-    ns = (3, 0)
-    ks = (1, 3, 0)
-
-    for b, n, k, (left, upper, uni) in product(bs, ns, ks, product((True, False), repeat=3)):
-        if b == 1:
-            A = make_arg((n, n)) if left else make_arg((k, k))
-            B = make_arg((n, k))
-        else:
-            A = make_arg((b, n, n)) if left else make_arg((b, k, k))
-            B = make_arg((b, n, k))
-        if uni:
-            # Not really necessary, but writing it for consistency
-            A.diagonal(0, -2, -1).fill_(1.)
-        else:
-            d = A.diagonal(0, -2, -1)
-            d[d.abs() < 1e-6] = 1.
-        if upper:
-            A.triu_()
-        else:
-            A.tril_()
-        kwargs = {"upper": upper, "left": left, "unitriangular": uni}
-        if requires_grad:
-            for grad_A, grad_B in product((True, False), repeat=2):
-                # Either A or B needs to have a gradient
-                if not grad_A and not grad_B:
-                    continue
-                yield SampleInput(
-                    A.clone().requires_grad_(grad_A),
-                    args=(B.clone().requires_grad_(grad_B),),
-                    kwargs=kwargs)
-        else:
-            yield SampleInput(A, args=(B,), kwargs=kwargs)
+def _clamp_min_numpy(a, min=None):
+    return np.maximum(a, min)
 
-def sample_inputs_legacy_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates always solvable input for legacy solve functions
-    (the ones that are not in torch.linalg module).
-    The difference from sample_inputs_linalg_solve is that here the right-hand-side of A x = b equation
-    should have b.ndim >= 2, vectors are not allowed.
-    Also the arguments order is swapped.
-    """
-    out = sample_inputs_linalg_solve(
-        op_info, device, dtype, requires_grad=requires_grad, vector_rhs_allowed=False
-    )
 
-    def out_fn(output):
-        return output[0]
+def _clamp_max_numpy(a, max=None):
+    return np.minimum(a, max)
 
-    # Reverses tensor order
-    for sample in out:
-        sample.input, sample.args = sample.args[0], (sample.input,)
-        if op_info.name == "solve":
-            sample.output_process_fn_grad = out_fn
-        yield sample
 
+def _clamp_numpy(a, min=None, max=None):
+    if min is None:
+        return np.minimum(a, max)
+    if max is None:
+        return np.maximum(a, min)
 
-def sample_inputs_cholesky_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    cholesky_inverse_samples = sample_inputs_linalg_cholesky_inverse(
-        op_info, device, dtype, requires_grad=False
-    )
+    return np.minimum(max, np.maximum(a, min))
 
-    for sample in cholesky_inverse_samples:
-        psd_matrix = sample.input
-        sample.input = make_tensor(psd_matrix.shape, dtype=dtype, device=device, requires_grad=requires_grad, low=None, high=None)
-        sample.args = (psd_matrix.requires_grad_(requires_grad),)
-        yield sample
 
+def sample_inputs_cumprod(op_info, device, dtype, requires_grad, **kwargs):
+    def make_arg(shape):
+        # shrink values to be in the interval [-1, +1] for better precision in gradgradcheck
+        return make_tensor(shape, dtype=dtype, device=device, low=-1, high=+1, requires_grad=requires_grad)
 
-def sample_inputs_lu(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_fullrank_matrices_with_distinct_singular_values,
-                       dtype=dtype, device=device, requires_grad=requires_grad)
+    def prod_zeros(dim_select):
+        assert len(dim_select) == 2
+        result = make_arg(3 * (S,))
+        result.narrow(dim_select[0], 0, 1).narrow(dim_select[1], 1, 1).zero_()
+        result.narrow(dim_select[0], 2, 1).narrow(dim_select[1], 3, 1).zero_()
+        result.narrow(dim_select[0], 4, 1).narrow(dim_select[1], 3, 1).zero_()
+        return result
 
-    # not needed once OpInfo tests support Iterables
-    batch_shapes = ((), (3,), (3, 3))
-    for batch_shape, get_infos, size_delta in product(batch_shapes, (True, False), (-2, -1, 0, +1, +2)):
-        shape = batch_shape + (S + size_delta, S)
-        input = make_arg(*shape)
-        yield SampleInput(input, args=(True, get_infos))
+    for dim in range(3):
+        yield SampleInput(make_arg((S, S, S)), args=(dim,))
+    # Scalar tensors and empty tensor
+    for size in [(), (1,), (0,)]:
+        yield SampleInput(make_arg(size), args=(0,))
 
-def sample_inputs_linalg_lu(op_info, device, dtype, requires_grad=False, **kwargs):
-    full_rank = (op_info.name == "linalg.lu_factor")
-    make_fn = make_tensor if not full_rank else make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
+    yield SampleInput(prod_zeros([0, 1]), args=(1,))
+    yield SampleInput(prod_zeros([0, 2]), args=(1,))
+    yield SampleInput(prod_zeros([1, 2]), args=(1,))
 
-    def out_fn(output):
-        if op_info.name == "linalg.lu":
-            return output[1], output[2]
-        else:
-            return output
+    # test dtype kwarg
+    yield SampleInput(prod_zeros([1, 2]), args=(1,), kwargs={'dtype': dtype})
 
-    batch_shapes = ((), (3,), (3, 3))
-    # pivot=False only supported in CUDA
-    pivots = (True, False) if torch.device(device).type == "cuda" else (True,)
-    deltas = (-2, -1, 0, +1, +2)
-    for batch_shape, pivot, delta in product(batch_shapes, pivots, deltas):
-        shape = batch_shape + (S + delta, S)
-        # Insanely annoying that make_fullrank_blablabla accepts a *shape and not a tuple!
-        A = make_arg(shape) if not full_rank else make_arg(*shape)
-        yield SampleInput(A, kwargs={"pivot": pivot}, output_process_fn_grad=out_fn)
-
-def sample_inputs_lu_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    """ Samples the inputs for both linalg.lu_solve and lu_solve """
-    make_fn = make_fullrank_matrices_with_distinct_singular_values
-    make_a = partial(make_fn, dtype=dtype, device=device)
-    make_b = partial(make_tensor, dtype=dtype, device=device)
-
-    def clone(X, requires_grad):
-        Y = X.clone()
-        Y.requires_grad_(requires_grad)
-        return Y
-
-    is_linalg_lu_solve = (op_info.name == "linalg.lu_solve")
-
-    batches = ((), (0, ), (2, ))
-    ns = (3, 1, 0)
-    nrhs = (4, 1, 0)
-
-    for n, batch, rhs in product(ns, batches, nrhs):
-        A = make_a(*(batch + (n, n)))
-        LU, pivots = torch.linalg.lu_factor(A)
-
-        B = make_b(batch + (n, rhs))
-
-        grads = (False,) if not requires_grad else (True, False)
-        # we try all possible combinations of requires_grad for each input
-        for LU_grad, B_grad in product(grads, grads):
-            # when requires_grad == True, at least one input has to have requires_grad enabled
-            if requires_grad and not LU_grad and not B_grad:
-                continue
+def sample_inputs_view_as_complex(op_info, device, dtype, requires_grad, **kwargs):
+    return [SampleInput(make_tensor((S, 2), dtype=dtype, device=device, requires_grad=requires_grad),)]
 
-            if is_linalg_lu_solve:
-                for adjoint, left in product((True, False), repeat=2):
-                    yield SampleInput(clone(LU, LU_grad),
-                                      args=(pivots, clone(B if left else B.mT, B_grad)),
-                                      kwargs=dict(adjoint=adjoint, left=left))
-            else:
-                yield SampleInput(clone(B, B_grad), args=(clone(LU, LU_grad), pivots))
+def sample_inputs_view_as_real(op_info, device, dtype, requires_grad, **kwargs):
+    tensors = (
+        make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
+        make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad)
+    )
+    return [SampleInput(tensor) for tensor in tensors]
 
-def sample_inputs_lu_unpack(op_info, device, dtype, requires_grad=False, **kwargs):
-    def out_fn(output):
-        return output[1], output[2]
+def sample_inputs_prod(op_info, device, dtype, requires_grad, **kwargs):
+    def make_arg(shape):
+        # shrink values to be in the interval [-1, +1] for better precision in gradgradcheck
+        return make_tensor(shape, dtype=dtype, device=device, low=-1, high=+1, requires_grad=requires_grad)
 
-    for lu_sample in sample_inputs_linalg_lu(op_info, device, dtype, requires_grad, **kwargs):
-        lu_data, pivots = torch.linalg.lu_factor(lu_sample.input)
-        lu_data.requires_grad_(requires_grad)
-        yield SampleInput(lu_data, args=(pivots,), output_process_fn_grad=out_fn)
+    def prod_single_zero():
+        result = make_arg(2 * (S,))
+        result[0, 1] = 0
+        return result
 
+    for sample in sample_inputs_cumprod(op_info, device, dtype, requires_grad):
+        # only Tensor, ignore other inputs
+        yield SampleInput(sample.input.clone().requires_grad_(requires_grad))
+        yield sample
 
-def sample_inputs_roll(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # Generates samples with keepdim = True
+    for sample in sample_inputs_cumprod(op_info, device, dtype, requires_grad):
+        sample.kwargs['keepdim'] = True
+        yield sample
 
-    args = ((0, 0), (1, 2), (0, 2), (2, 0), (-1, 0), (10000, 1), (2,), ((1, 2, -1), (0, 1, 2)))
+    yield SampleInput(prod_single_zero())
+    yield SampleInput(make_arg((3, 3, 3)), args=(1,))
+    yield SampleInput(make_arg((3, 3, 3)), args=(1,), kwargs={'keepdim': True})
 
-    for arg in args:
-        yield SampleInput(make_arg((0, 0, 0)), args=arg)
-        yield SampleInput(make_arg((S, S, S)), args=arg)
+    # test zero scalar tensor
+    zero = make_arg(())
+    zero.zero_()
+    yield SampleInput(zero.clone().requires_grad_(requires_grad))
+    yield SampleInput(zero.clone().requires_grad_(requires_grad), args=(0,))
+    yield SampleInput(zero.clone().requires_grad_(requires_grad),
+                      args=(0,),
+                      kwargs={'keepdim': True})
 
+def error_inputs_neg(op_info, device, **kwargs):
+    si = SampleInput(torch.tensor((False, True), device=device))
+    msg = ("Negation, the `\\-` operator, on a bool tensor is not supported."
+           " If you are trying to invert a mask, use the `\\~` or"
+           " `logical_not\\(\\)` operator instead.")
+    return (ErrorInput(si, error_regex=msg),)
 
-def error_inputs_roll(op_info, device, **kwargs):
-    err_msg1 = "`shifts` required"
-    s1 = SampleInput(
-        make_tensor((S,), dtype=torch.float32, device=device), args=(tuple(),)
-    )
-    yield ErrorInput(s1, error_regex=err_msg1)
+def sample_inputs_diag(op_info, device, dtype, requires_grad, **kwargs):
+    vec_sample = SampleInput(make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad))
 
-    err_msg2 = ("shifts and dimensions must align")
-    s2 = SampleInput(
-        make_tensor((S, S), dtype=torch.float32, device=device), args=((2, 1), 0)
+    tensors = (
+        make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        make_tensor((3, 5), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        make_tensor((5, 3), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
     )
-    yield ErrorInput(s2, error_regex=err_msg2)
 
-    err_msg3 = ("out of range")
-    s3 = SampleInput(
-        make_tensor((S, ), dtype=torch.float32, device=device), args=(0, 2)
-    )
-    yield ErrorInput(s3, error_regex=err_msg3, error_type=IndexError)
+    args = ((), (2,), (-2,), (1,), (2,))
 
-def sample_inputs_rot90(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    samples = []
+    for tensor, arg in product(tensors, args):
+        samples.append(SampleInput(tensor.clone().requires_grad_(requires_grad), args=arg))
 
-    args = itertools.product(range(-5, 6), [(0, 1), (1, 2), (1, -1)])
+    return samples + [vec_sample]
 
-    yield SampleInput(make_arg((S, S, S)))
-    for arg in args:
-        yield SampleInput(make_arg((S, S, S)), args=arg)
-
-
-def error_inputs_rot90(op_info, device, **kwargs):
-    err_msg1 = "expected total rotation dims"
-    s1 = SampleInput(
-        make_tensor((S, S), dtype=torch.float32, device=device), kwargs={"dims": (0,)}
-    )
-    yield ErrorInput(s1, error_regex=err_msg1)
+def sample_inputs_diagonal_diag_embed(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    err_msg2 = "expected total dims >= 2"
-    s2 = SampleInput(
-        make_tensor((S,), dtype=torch.float32, device=device),
-    )
-    yield ErrorInput(s2, error_regex=err_msg2)
+    # Shapes for 2D Tensors
+    shapes_2d = ((S, S), (3, 5), (5, 3))
 
-    err_msg3 = "expected rotation dims to be different"
-    s3 = SampleInput(
-        make_tensor((S, S), dtype=torch.float32, device=device), kwargs={"dims": (1, 1)}
-    )
-    yield ErrorInput(s3, error_regex=err_msg3)
+    # Shapes for 3D Tensors
+    shapes_3d = ((S, S, S),)
 
+    kwargs_2d = (dict(), dict(offset=2), dict(offset=2), dict(offset=1))
+    kwargs_3d = (dict(offset=1, dim1=1, dim2=2),
+                 dict(offset=2, dim1=0, dim2=1),
+                 dict(offset=-2, dim1=0, dim2=1))
 
-def sample_inputs_std_var(op_info, device, dtype, requires_grad, **kwargs):
-    tensor_nd = partial(make_tensor, (S, S, S), device=device, dtype=dtype,
-                        requires_grad=requires_grad)
-    tensor_1d = partial(make_tensor, (S,), device=device, dtype=dtype,
-                        requires_grad=requires_grad)
+    for shape, kwarg in chain(product(shapes_2d, kwargs_2d), product(shapes_3d, kwargs_3d)):
+        yield SampleInput(make_arg(shape), kwargs=kwarg)
 
-    return [
-        SampleInput(tensor_nd()),
-        SampleInput(tensor_nd(), kwargs=dict(dim=1)),
-        SampleInput(tensor_nd(), kwargs=dict(dim=1, unbiased=True, keepdim=True)),
-        SampleInput(tensor_1d(), kwargs=dict(dim=0, unbiased=True, keepdim=True)),
-        SampleInput(tensor_1d(), kwargs=dict(dim=0, unbiased=False, keepdim=False)),
+def reference_inputs_diagonal_diag_embed(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_diagonal_diag_embed(
+        op_info, device, dtype, requires_grad, **kwargs)
 
-        SampleInput(tensor_nd(), kwargs=dict(dim=(1,), correction=S // 2)),
-        SampleInput(tensor_nd(), kwargs=dict(dim=None, correction=0, keepdim=True)),
-    ]
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
+    shapes1d = ((0,), (1,))
+    shapes2d = ((L, M),)
+    shapes3d = ((L, M, S),)
 
-def _generate_correlation_inputs(device, dtype, requires_grad, **kwargs):
-    shapes = [(2,), (1, 2), (3, 2), (2, 3)]
-    for shape in shapes:
-        yield make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
+    kwargs1d = {}
 
+    kwargs2d = (
+        # dim1 > dim2 is allowed
+        dict(dim1=1, dim2=0),
+        # negative dims are allowed
+        dict(dim1=-2, dim2=-1),
+        # out of bounds offset should return an empty tensor in diagonal and
+        # offset the diagonal in diag_embed
+        dict(offset=100),
+    )
 
-def sample_inputs_corrcoef(op_info, device, dtype, requires_grad, **kwargs):
-    return [SampleInput(t) for t in _generate_correlation_inputs(device, dtype, requires_grad)]
+    kwargs3d = kwargs2d + (
+        # make sure we can use non-sequential dims
+        dict(offset=-1, dim1=0, dim2=2),
+    )
 
+    samples1d = product(shapes1d, kwargs1d)
+    samples2d = product(shapes2d, kwargs2d)
+    samples3d = product(shapes3d, kwargs3d)
 
-def sample_inputs_cov(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    for t in _generate_correlation_inputs(device, dtype, requires_grad):
-        inputs.append(SampleInput(t))
-        num_observations = t.numel() if t.ndimension() < 2 else t.size(1)
-        fweights = make_tensor((num_observations,), dtype=torch.int, device=device, low=1, high=10)
-        aweights = make_tensor((num_observations,), dtype=torch.float, device=device, low=0, high=1, requires_grad=requires_grad)
-        for correction, fw, aw in product(range(num_observations), [None, fweights], [None, aweights]):
-            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                      kwargs={'correction': correction, 'fweights': fw, 'aweights': aw}))
-    return inputs
+    for shape, kwargs in chain(samples1d, samples2d, samples3d):
+        if op_info.name in ('diagonal', '_refs.diagonal'):
+            # these are error inputs for diagonal
+            if shape in ((0,), (1,)):
+                continue
+        yield SampleInput(input=make_arg(shape), kwargs=kwargs)
 
+def error_inputs_diagonal_diag_embed(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
 
-def error_inputs_cov(op_info, device, **kwargs):
-    a = torch.rand(S, device=device)
-    error_inputs = []
-    error_inputs.append(ErrorInput(
-        SampleInput(torch.rand(S, S, S, device=device)),
-        error_regex="expected input to have two or fewer dimensions"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.rand(S, S, device=device)}),
-        error_regex="expected fweights to have one or fewer dimensions"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.rand(S, S, device=device)}),
-        error_regex="expected aweights to have one or fewer dimensions"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.rand(S, device=device)}),
-        error_regex="expected fweights to have integral dtype"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.tensor([1, 1], device=device)}),
-        error_regex="expected aweights to have floating point dtype"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.tensor([1], device=device)}),
-        error_regex="expected fweights to have the same numel"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.rand(1, device=device)}),
-        error_regex="expected aweights to have the same numel"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.tensor([-1, -2, -3, -4 , -5], device=device)}),
-        error_regex="fweights cannot be negative"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.tensor([-1., -2., -3., -4., -5.], device=device)}),
-        error_regex="aweights cannot be negative"))
-    return error_inputs
+    shapes1d = (0, 1, (0,), (1,))
+    shapes2d = ((M, L),)
+    shapes3d = ((M, S, L),)
 
+    kwargs1d = {}
 
-def sample_inputs_svd(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
+    kwargs2d = (
+        # dim1 == dim2 is not allowed
+        dict(dim1=1, dim2=1),
+        # out of bounds dims are not allowed
+        dict(dim1=10000),
+        dict(dim2=10000),
+    )
 
-    is_linalg_svd = ("linalg.svd" in op_info.name)
-    batches = [(), (0, ), (3, )]
-    ns = [0, 3, 5]
+    kwargs3d = kwargs2d
 
-    def uniformize(usv):
-        S = usv[1]
-        k = S.shape[-1]
-        U = usv[0][..., :k]
-        Vh = usv[2] if is_linalg_svd else usv[2].mH
-        Vh = Vh[..., :k, :]
-        return U, S, Vh
+    samples1d = product(shapes1d, kwargs1d)
+    samples2d = product(shapes2d, kwargs2d)
+    samples3d = product(shapes3d, kwargs3d)
 
-    def fn_U(usv):
-        U, _, _ = uniformize(usv)
-        return U.abs()
+    for shape, kwargs in chain(samples1d, samples2d, samples3d):
+        arg = make_arg(shape)
+        sample = SampleInput(input=arg, kwargs=kwargs)
 
+        dim1 = kwargs.get('dim1')
+        dim2 = kwargs.get('dim2')
 
-    def fn_S(usv):
-        return uniformize(usv)[1]
+        if op_info.name in ('diagonal', '_refs.diagonal'):
+            num_dim = arg.dim()
+        elif op_info.name in ('diag_embed', '_refs.diag_embed'):
+            # these are valid inputs for diag_embed
+            if shape in ((0,), (1,)):
+                continue
+            num_dim = arg.dim() + 1
+        else:
+            raise RuntimeError("should be unreachable")
+
+        bound1 = -num_dim
+        bound2 = num_dim - 1
+        dim_range = range(bound1, bound2 + 1)
+        dim1_cond = dim1 and dim1 not in dim_range
+        dim2_cond = dim2 and dim2 not in dim_range
+
+        if dim1 == dim2:
+            err = f"diagonal dimensions cannot be identical {dim1}, {dim2}"
+            yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+        elif dim1_cond or dim2_cond:
+            err_dim = dim1 if dim1_cond else dim2
+            err = (r"Dimension out of range \(expected to be in range of "
+                   rf"\[{bound1}, {bound2}\], but got {err_dim}\)")
+            yield ErrorInput(sample, error_regex=err, error_type=IndexError)
+        else:
+            raise RuntimeError("should be unreachable")
 
-    def fn_Vh(usv):
-        # We also return S to test
-        _, S, Vh = uniformize(usv)
-        return S, Vh.abs()
+def sample_inputs_diagonal_scatter(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    def fn_UVh(usv):
-        U, S, Vh = uniformize(usv)
-        return U @ Vh, S
+    # Shapes for 2D Tensors
+    shapes_2d = ((M, M), (3, 5), (5, 3))
 
-    fns = (fn_U, fn_S, fn_Vh, fn_UVh)
+    # Shapes for 3D Tensors
+    shapes_3d = ((M, M, M),)
 
-    fullmat = 'full_matrices' if is_linalg_svd else 'some'
+    args_2d = ((), (2,), (-2,), (1,))
+    args_3d = ((1, 1, 2), (2, 0, 1), (-2, 0, 1))
 
-    for batch, n, k, fullmat_val, fn in product(batches, ns, ns, (True, False), fns):
-        shape = batch + (n, k)
-        yield SampleInput(make_arg(*shape), kwargs={fullmat: fullmat_val}, output_process_fn_grad=fn)
+    for input_shape, arg in chain(product(shapes_2d, args_2d), product(shapes_3d, args_3d)):
+        input_ = make_arg(input_shape)
+        # We can programatically figure out the right shape for src:
+        # It should be the same size as input.diagonal(other_args...)
+        if not isinstance(arg, tuple):
+            arg_tuple = (arg,)
+        else:
+            arg_tuple = arg
+        src_shape = input_.diagonal(*arg_tuple).size()
+        src = make_arg(src_shape)
+        yield SampleInput(input_, args=(src, *arg_tuple))
 
 
-def sample_inputs_permute(op_info, device, dtype, requires_grad, **kwargs):
+def sample_inputs_to_sparse(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    cases = [((1, 2, 3, 4), (0, 2, 3, 1)),
-             ((1, 2, 3, 4), (0, -2, -1, 1)),
-             ((), ()),
-             ((1, 2, 3, 4), (2, 1, 3, 0))]
+    return (SampleInput(make_arg((S, S)), args=(), output_process_fn_grad=lambda x: x.to_dense()),
+            SampleInput(make_arg((S, S)), args=(1,), output_process_fn_grad=lambda x: x.to_dense()),)
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=(args,))
+def sample_inputs_cross_entropy(op_info, device, dtype, requires_grad, **kwargs):
+    batch_size, num_classes = shape = (2, 3)
+    reductions = ("mean", "sum", "none")
 
-def reference_inputs_permute(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_permute(op, device, dtype, requires_grad, **kwargs)
+    input_shape_and_kwargs: List[Tuple[Tuple[int, ...], Dict[str, Any]]] = [
+        (shape, {}),
+        ((*shape, 1), {}),
+        ((*shape, 1, 2), {}),
+        ((*shape, 1, 2, 3), {}),
+        *[(shape, dict(reduction=reduction)) for reduction in reductions],
+        *[
+            (
+                shape,
+                dict(
+                    weight=make_tensor((num_classes,), device=device, dtype=dtype),
+                    reduction=reduction,
+                ),
+            )
+            for reduction in reductions
+        ],
+        (shape, dict(ignore_index=1)),
+    ]
 
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    sample_inputs = []
+    for (input_shape, kwargs), probabilities_target in itertools.product(input_shape_and_kwargs, (False, True)):
+        input = make_tensor(input_shape, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    cases = (
-        ((), ()),
-        ((1,), (0,)),
-        ((2, 2), (1, 0)),
-        ((2, 2), (0, 1)),
-        ((2, 0, 1), (0, 2, 1)),
-        ((3, 4, 2), (2, 1, 0)),
-        ((3, 4, 2), (1, 0, 2)),
-        ((3, 4, 2), (0, 1, 2)),
-    )
+        if probabilities_target:
+            # ignore_index is not supported for probabilities target
+            if "ignore_index" in kwargs:
+                continue
 
-    # Adds tricky permutations and permutations with noncontiguity
-    for shape, permutation in cases:
-        for p in itertools.permutations(permutation):
-            a = make_arg(shape).permute(p)
-            yield SampleInput(a, args=(permutation,))
+            target = make_tensor(
+                input_shape,
+                low=0,
+                high=1,
+                device=device,
+                dtype=dtype,
+                requires_grad=requires_grad,
+            )
+        else:
+            target = make_tensor(
+                (batch_size, *input_shape[2:]),
+                low=0,
+                high=num_classes,
+                device=device,
+                dtype=torch.long,
+            )
 
-            a = make_arg(shape, noncontiguous=True).permute(p)
-            yield SampleInput(a, args=(permutation,))
+            if "ignore_index" in kwargs and torch.all(target == kwargs["ignore_index"]):
+                # make sure at least one item in target is not ignored
+                target[0] = random.sample(set(range(num_classes)) - {kwargs["ignore_index"]}, 1)[0]
 
-def sample_inputs_linalg_svdvals(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+        sample_inputs.append(SampleInput(input, args=(target,), kwargs=kwargs))
 
-    batches = [(), (0, ), (2, ), (1, 1)]
-    ns = [5, 2, 0]
+    return sample_inputs
 
-    for batch, m, n in product(batches, ns, ns):
-        yield SampleInput(make_arg(batch + (m, n)))
 
-def error_inputs_softshrink(op, device, **kwargs):
-    yield ErrorInput(SampleInput(make_tensor((1,), dtype=torch.float, device=device), kwargs={"lambd": -0.5}),
-                     error_regex="lambda must be greater or equal to 0, but found to be -0.5")
+def sample_inputs_logit(op_info, device, dtype, requires_grad, **kwargs):
+    low, high = op_info.domain
 
-def sample_inputs_softshrink(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # Note: Operator is very sensitive at points near the
+    # start and end of domain and leads to NaN for float16
+    # if domain_eps is 1e-5.
+    domain_eps = op_info._domain_eps if dtype != torch.float16 else 3e-2
 
-    # The additional sample is to check additional values of lambd beyond the default
-    # value (what is already checked by sample_inputs_elementwise_unary)
-    for lbda in (0., 0.5):
-        yield SampleInput(make_arg(S, S), kwargs={"lambd": lbda})
+    low = low + domain_eps
+    high = high - domain_eps
 
-    yield from sample_inputs_elementwise_unary(op_info, device, dtype, requires_grad)
+    samples = (
+        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)),
+        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, low=low,
+                                high=high, requires_grad=requires_grad), args=(0.2,)),
+        SampleInput(make_tensor((), dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)),
+        SampleInput(make_tensor((), dtype=dtype, device=device, low=low,
+                                high=high, requires_grad=requires_grad), args=(0.2,)),
+    )
 
-def sample_inputs_hardshrink(op_info, device, dtype, requires_grad=False, **kwargs):
+    return samples
+
+def sample_inputs_isin(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # isin has two paths based on the size of elements and test_elements.
+    # if elements.numel() < 10 * pow(test_elements.numel(), 0.145):
+    yield SampleInput(make_arg((L,)), args=(make_arg((S,)),))
+    # else:
+    yield SampleInput(make_arg((S,)), args=(make_arg((L,)),))
 
-    # The additional sample is to check additional values of lambd beyond the default
-    # value (what is already checked by sample_inputs_elementwise_unary)
-    # Note that unlike softshrink, lambd is allowed to be negative for hardshrink
-    for lbda in (-0.5, 0., 0.5):
-        yield SampleInput(make_arg(S, S), kwargs={"lambd": lbda})
+def sample_inputs_masked_scatter(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    yield from sample_inputs_elementwise_unary(op_info, device, dtype, requires_grad)
+    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, S, device=device) > 0, make_arg((S, S))))
+    yield SampleInput(make_arg((S, S)), args=(torch.randn((S,), device=device) > 0, make_arg((S, S))))
+    yield SampleInput(make_arg((S, S)), args=(bernoulli_scalar().to(device), make_arg((S, S))))
+    yield SampleInput(make_arg((S,)),
+                      args=(torch.randn(S, S, device=device) > 0, make_arg((S, S))),
+                      broadcasts_input=True)
 
 
-def sample_inputs_hardtanh(op_info, device, dtype, requires_grad=False, **kwargs):
+def sample_inputs_masked_fill(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # The additional sample is to check additional values of min_val and max_val beyond the default
-    # value (what is already checked by sample_inputs_elementwise_unary)
-    for max_val, min_val in ((-0.5, 0.5), (0.5, -0.5), (0., 0.)):
-        yield SampleInput(make_arg(S, S), kwargs={"min_val": min_val, "max_val": max_val})
-
-    yield from sample_inputs_elementwise_unary(op_info, device, dtype, requires_grad)
-
-def sample_inputs_eig(op_info, device, dtype, requires_grad=False, **kwargs):
-    eigvecs = make_tensor((S, S), device=device, dtype=dtype,
-                          low=None, high=None)
-    eigvals = make_tensor((S,), device=device, dtype=dtype,
-                          low=None, high=None)
-    # we produce only diagonazible inputs which do not have
-    # complex eigenvalues for real inputs, as there is no
-    # backward implementation for real inputs with complex
-    # eigenvalues yet.
-    input = (eigvecs * eigvals.unsqueeze(-2)) @ eigvecs.inverse()
-    input.requires_grad_(requires_grad)
-
-    def process_output(eigpair):
-        eigvals, eigvecs = eigpair
-        if dtype.is_complex:
-            # eig produces eigenvectors which are normalized to 1 norm.
-            # Note that if v is an eigenvector, so is v * e^{i \phi},
-            # and |v| = |v * e^{i \phi}| = 1.
-            # This, however, makes the eigenvector backward computation process
-            # rather unstable unless the objective function is gauge-invariant,
-            # that is if f(z) == f(|z|), for example.
-            # Hence for complex inputs we ignore the phases and return only
-            # the absolute values.
-            return eigvals, eigvecs.abs()
-        else:
-            return eigvals, eigvecs
+    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, S, device=device) > 0, 10))
+    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, S, device=device) > 0, make_arg(())))
+    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, device=device) > 0, 10))
+    yield SampleInput(make_arg(()), args=(torch.randn((), device=device) > 0, 10))
+    yield SampleInput(make_arg(()), args=(torch.randn((), device=device) > 0, make_arg(())))
+    yield SampleInput(make_arg((S, S)), args=(torch.randn((), device=device) > 0, 10))
 
-    return [
-        SampleInput(
-            input,
-            kwargs=dict(eigenvectors=True),
-            output_process_fn_grad=process_output
-        ),
-    ]
+    yield SampleInput(make_arg((S,)),
+                      args=(torch.randn(S, S, device=device) > 0, make_arg(())),
+                      broadcasts_input=True)
+    yield SampleInput(make_arg((S,)),
+                      args=(torch.randn(S, S, device=device) > 0, 10),
+                      broadcasts_input=True)
 
+    if torch.device(device).type == 'cuda':
+        # `self` and `mask` on CUDA but `value` is a CPU scalar tensor.
+        yield SampleInput(make_arg((S, S)), args=(torch.randn(S, S, device=device) > 0, torch.randn(())))
 
-def sample_inputs_einsum(op_info, device, dtype, requires_grad=False, **kwargs):
-    def c(t):
-        return t.clone().requires_grad_(requires_grad)
+def error_inputs_masked_fill(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float, requires_grad=False)
+    # `value` is not a 0-D tensor.
+    yield ErrorInput(SampleInput(make_arg((2, 2)), args=(make_arg(()) > 0, make_arg((1,)))),
+                     error_regex="only supports a 0-dimensional value tensor, but got tensor with 1 dimension")
+    # downcasting complex value (scalar overload)
+    yield ErrorInput(SampleInput(make_arg((2, 2)), args=(make_arg(()) > 0, 1j)),
+                     error_regex=r"value cannot be converted to type .* without overflow")
+    # downcasting complex value (tensor overload)
+    yield ErrorInput(SampleInput(torch.ones(2, dtype=torch.long, device=device),
+                                 args=(make_arg(()) > 0, torch.tensor(1j, device=device))),
+                     error_regex=r"value cannot be converted to type .* without overflow")
 
-    x = make_tensor((3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    y = make_tensor((4,), dtype=dtype, device=device, requires_grad=requires_grad)
-    A = make_tensor((2, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    B = make_tensor((1, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    C = make_tensor((1, 2, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    D = make_tensor((1, 3, 4,), dtype=dtype, device=device, requires_grad=requires_grad)
-    E = make_tensor((4, 4,), dtype=dtype, device=device, requires_grad=requires_grad)
-    H = make_tensor((3, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    I = make_tensor((1, 3, 1,), dtype=dtype, device=device, requires_grad=requires_grad)
+    if torch.device(device).type == 'cuda':
+        # `self` and `mask` on CPU but `value` is a CUDA scalar tensor.
+        yield ErrorInput(SampleInput(torch.randn((S, S), device='cpu'),
+                                     args=(torch.randn(S, S, device='cpu') > 0,
+                                           torch.randn((), device='cuda'))),
+                         error_regex=r"to be on same device")
 
-    inputs = []
 
-    # Vector operations
-    inputs.append(SampleInput([c(x)], args=('i->',)))                      # sum
-    inputs.append(SampleInput([c(x), c(y)], args=('i,j->ij',)))               # outer
+def sample_inputs_masked_select(op_info, device, dtype, requires_grad, **kwargs):
+    samples = (
+        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+                    args=(torch.randn(M, M, device=device) > 0,)),
 
-    # Matrix operations
-    inputs.append(SampleInput([c(A)], args=("ij->i",)))                    # col sum
-    inputs.append(SampleInput([c(A), c(B)], args=("ij,kj->ik",)))             # matmul
-    inputs.append(SampleInput([c(A), c(E)], args=("ij,Ab->ijAb",)))           # matrix outer product
+        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+                    args=(torch.randn((M,), device=device) > 0,)),
 
-    # Tensor operations
-    inputs.append(SampleInput([c(C), c(D)], args=("aij,ajk->aik",)))          # batch matmul
-    inputs.append(SampleInput([c(D), c(E)], args=("aij,jk->aik",)))           # tensor matrix contraction
-    inputs.append(SampleInput([c(C), c(B)], args=("ijk,ik->j",)))             # non contiguous
+        SampleInput(make_tensor((M,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+                    args=(torch.randn((M, M), device=device) > 0,)),
 
-    # Test diagonals
-    inputs.append(SampleInput([c(I)], args=('iji->j',)))                   # non-contiguous trace
+        SampleInput(make_tensor((M, 1, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+                    args=(torch.randn((M, M), device=device) > 0,)),
 
-    # Test ellipsis
-    inputs.append(SampleInput([c(H)], args=("i...->...",)))
-    inputs.append(SampleInput([c(C), c(x)], args=('...ik, ...j -> ij',)))
+        SampleInput(make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+                    args=(torch.tensor(1, device=device, dtype=torch.bool),)),
 
-    return inputs
+        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+                    args=(torch.tensor(1, device=device, dtype=torch.bool),)),
 
+        SampleInput(make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+                    args=(torch.randn((M, M), device=device) > 0,)),
+    )
 
-def sample_inputs_linalg_qr_geqrf(op_info, device, dtype, requires_grad=False, **kwargs):
-    # QR is just well defined when the matrix is full rank
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
+    return samples
 
-    batches = [(), (0,), (2, ), (1, 1)]
-    ns = [5, 2, 0]
+def sample_inputs_matrix_exp(op_info, device, dtype, requires_grad, **kwargs):
+    samples = (
+        SampleInput(make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad)),
+        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, requires_grad=requires_grad)),
+    )
 
-    for batch, (m, n) in product(batches, product(ns, ns)):
-        shape = batch + (m, n)
-        yield SampleInput(make_arg(*shape))
+    return samples
 
-def sample_inputs_flip(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    sizes = ((S, M, S), (S, 0, M))
-    all_dims = ((0, 1, 2), (0,), (0, 2), (-1,), ())
+def sample_inputs_matmul(op_info, device, dtype, requires_grad, **kwargs):
+    test_cases = (((L,), (L,)),
+                  ((S, M), (M,)),
+                  ((M,), (M, S)),
+                  ((S, M), (M, S)),
+                  ((S, 0), (0, M)),
+                  ((S, S, M), (M,)),
+                  ((S, S, M), (M, S)),
+                  ((S, S, 0), (0, S)),
+                  ((M,), (S, M, S)),
+                  ((S, M), (S, M, S)),
+                  ((0, 0), (S, 0, 0)),
+                  ((S, S, M, M), (S, S, M, S)),
+                  ((S, S, M, M), (M,)),
+                  ((M,), (S, S, M, S)))
+    sample_inputs = []
+    for lhs_shape, rhs_shape in test_cases:
+        lhs = make_tensor(lhs_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+        rhs = make_tensor(rhs_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+        if op_info.name == 'matmul':
+            sample_inputs.append(SampleInput(lhs, args=(rhs,)))
+        elif op_info.name == '__rmatmul__':
+            sample_inputs.append(SampleInput(rhs, args=(lhs,)))
+        else:
+            raise RuntimeError("`op_info.name` must be 'matmul' or '__rmatmul__'")
+    return tuple(sample_inputs)
 
-    for size, dims in product(sizes, all_dims):
-        yield SampleInput(make_arg(size), kwargs={"dims": dims})
 
-def sample_inputs_fliplr_flipud(op_info, device, dtype, requires_grad, **kwargs):
-    tensors = (
-        make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        make_tensor((S, 0, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-    )
-    return [SampleInput(tensor) for tensor in tensors]
+def sample_inputs_meshgrid(op_info: OpInfo, device: torch.device, dtype: torch.dtype,
+                           requires_grad: bool,
+                           *, variant: str, **kwargs) -> List[SampleInput]:
+    if variant == 'variadic':
+        def make_inputs(
+                tensors: List[torch.Tensor]) -> Tuple[Union[torch.Tensor,
+                                                            List[torch.Tensor]],
+                                                      Tuple[torch.Tensor, ...]]:
+            return tensors[0], tuple(tensors[1:])
+    elif variant == 'list':
+        def make_inputs(
+                tensors: List[torch.Tensor]) -> Tuple[Union[torch.Tensor,
+                                                            List[torch.Tensor]],
+                                                      Tuple[torch.Tensor, ...]]:
+            return tensors, ()
+    else:
+        raise ValueError(
+            'Unsupported variant, must be one of {"variadic", "list"}. '
+            f'Got "{variant}".')
 
-def error_inputs_fliplr(op, device, **kwargs):
-    yield ErrorInput(SampleInput(make_tensor((1,), dtype=torch.float, device=device)),
-                     error_regex="Input must be >= 2-d.")
+    SCALAR = torch.Size([])
+    VECTOR = torch.Size([3])
+    test_cases: List[List[torch.Size]] = [
+        [SCALAR],
+        [VECTOR],
+        [VECTOR, SCALAR],
+        [VECTOR, SCALAR, VECTOR],
+        [VECTOR, SCALAR, VECTOR, SCALAR],
+    ]
 
-def error_inputs_flipud(op, device, **kwargs):
-    yield ErrorInput(SampleInput(make_tensor((), dtype=torch.float, device=device)),
-                     error_regex="Input must be >= 1-d.")
+    sample_inputs = []
+    for shapes, indexing in itertools.product(test_cases, {'xy', 'ij'}):
+        input, args = make_inputs(
+            [make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
+             for shape in shapes])
+        sample_inputs.append(SampleInput(input=input, args=args,
+                                         kwargs=dict(indexing=indexing)))
+    return sample_inputs
 
-# TODO: clamp shares tensors among its sample inputs --- we should prohibit this!
-def sample_inputs_clamp(op_info, device, dtype, requires_grad, **kwargs):
-    x = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-    lb = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-    ub = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
 
-    def detach(tensor):
-        return tensor.clone().detach_().requires_grad_(requires_grad)
+def sample_inputs_mvlgamma(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    tensor_shapes = ((S, S), ())
+    ns = (1, 2, 3, 4, 5)
 
-    return [
-        SampleInput(detach(x), args=(lb, ub)),
-        SampleInput(detach(x), args=(detach(lb[0]), detach(ub[0]))),
-        SampleInput(detach(x), args=(detach(lb[:, :1]),)),
-    ]
+    # Since the accepted lower bound for input
+    # to mvlgamma depends on `p` argument,
+    # the following function computes the lower bound
+    # which we pass to `make_tensor`.
+    def compute_min_val(p):
+        return (p - 1.) / 2
 
-def reference_inputs_elementwise_ternary(op, device, dtype, requires_grad, *, sample_inputs_func, supports_scalars=False, **kwargs):
-    yield from sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
+    for shape, n in product(tensor_shapes, ns):
+        min_val = compute_min_val(n)
+        if not dtype.is_floating_point:
+            # Round-up minimum value for integral dtypes
+            min_val += 1
+        else:
+            min_val += 2 * torch.finfo(dtype).eps
+        yield SampleInput(make_arg(shape, low=min_val), args=(n,))
 
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    make_scalar_tensor = partial(make_tensor, (), device='cpu', dtype=dtype, requires_grad=requires_grad)
-    supported_dtypes = op.supported_dtypes(device)
 
-    # broadcasting and oncontiguous cases
-    cases = (
-        ((4, 4), (4, 4), (4, 4)),
-        ((4, 4), (1, 4, 4), (4, 4)),
-        ((4, 4), (1, 4, 4), (4, 1, 4)),
-        ((4, 4, 1), (1, 4, 4), (4, 4)),
-        ((4, 1), (1, 4, 4), (1, 4)),
-        ((4, 4), (), (4, 4)),
-        ((4, 4), (), ()),
-        ((), (4, 4), (1, 4, 4)),
+# Since `mvlgamma` has multiple entries,
+# there are multiple common skips for the additional
+# entries. Following function is a helper to that end.
+def skips_mvlgamma(skip_redundant=False):
+    skips = (
+        # outside domain values are hard error for mvlgamma op.
+        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_float_domains'),
     )
+    if skip_redundant:
+        # Redundant tests
+        skips = skips + (  # type: ignore[assignment]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
+        )
+    return skips
 
-    for a, b, c in cases:
-        yield SampleInput(make_arg(a), args=(make_arg(b), make_arg(c)))
-        yield SampleInput(make_arg(a, noncontiguous=True),
-                          args=(make_arg(b).transpose(0, -1), make_arg(c, noncontiguous=True).transpose(0, -1)))
 
-    # scalar cases
-    if supports_scalars:
-        cases = [
-            ((), 1, 2,),
-            ((), 1., 2),
-            ((4, 4), 1., 2,),
-            ((3, 4), make_scalar_tensor(), make_scalar_tensor()),
-        ]
-
-        if torch.complex64 in supported_dtypes:
-            cases.extend([
-                ((3, 1, 4), complex(1, 2), 3.),
-            ])
+# To test reference numerics against multiple values of argument `p`,
+# we make multiple OpInfo entries with each entry corresponding to different value of p.
+# We run the op tests from test_ops.py only for `p=1` to avoid redundancy in testing.
+# Class `MvlGammaInfo` already contains the basic information related to the operator,
+# it only takes arguments like `domain`, `skips` and `sample_kwargs`, which
+# differ between the entries.
+class MvlGammaInfo(UnaryUfuncInfo):
+    def __init__(self, variant_test_name, domain, skips, sample_kwargs):
+        super(MvlGammaInfo, self).__init__(
+            'mvlgamma',
+            ref=reference_mvlgamma if TEST_SCIPY else None,
+            aliases=('special.multigammaln',),
+            variant_test_name=variant_test_name,
+            domain=domain,
+            decorators=(precisionOverride({torch.float16: 5e-2}),),
+            dtypes=all_types_and(torch.bfloat16),
+            dtypesIfCUDA=all_types_and(torch.half),
+            sample_inputs_func=sample_inputs_mvlgamma,
+            supports_forward_ad=True,
+            supports_fwgrad_bwgrad=True,
+            skips=skips,
+            sample_kwargs=sample_kwargs)
 
-        for a, b, c in cases:
-            yield SampleInput(make_arg(a), args=(b, c))
 
-    # type promotion cases
-    # int x float
-    if torch.float in supported_dtypes and torch.long in supported_dtypes:
-        a = make_arg((), dtype=torch.long)
-        b = make_arg((1, 4), dtype=torch.float)
-        c = make_arg((3, 4))
+def sample_inputs_cumulative_ops(op_info, device, dtype, requires_grad, supports_dtype_kwargs=True, **kwargs):
+    def _make_tensor_helper(shape, low=None, high=None):
+        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
-        cases = (
-            (a, b, c),
-            (c, a, b),
-        )
+    samples = [
+        SampleInput(_make_tensor_helper((S, S, S)), args=(0,)),
+        SampleInput(_make_tensor_helper((S, S, S)), args=(1,)),
+        SampleInput(_make_tensor_helper(()), args=(0,)),
+    ]
 
-        for a, b, c in cases:
-            yield SampleInput(a, args=(b, c))
+    if supports_dtype_kwargs:
+        # NOTE: if `dtype` is not same as input, then inplace variants fail with
+        # `provided dtype must match the dtype of self tensor in cumsum`
+        samples.append(SampleInput(_make_tensor_helper((S, S, S)), args=(1,), kwargs={'dtype': dtype}))
 
-    # NaN propagation
-    if dtype.is_floating_point or dtype.is_complex:
-        nan = float('nan') if dtype.is_floating_point else complex(float('nan'), float('nan'))
+    return samples
 
-        a = make_arg((12,))
-        a[4] = nan
-        a[7] = nan
-        b = make_arg((12,))
-        b[1] = nan
-        b[7] = nan
-        c = make_arg((12,))
-        c[9] = nan
 
-        yield SampleInput(a, args=(b, c))
+def sample_inputs_unfold(op_info, device, dtype, requires_grad, **kwargs):
+    test_cases = (
+        ((), (0, 1, 1)),
+        ((S, S, S, S), (0, 3, 1)),
+        ((S, S, S, S), (1, 3, 1)),
+        ((S, S, S, S), (2, 3, 1)),
+        ((S, S, S, S), (3, 3, 1)),
+        ((S, S, S, S), (0, 3, 2)),
+        ((S, S, S, S), (1, 3, 2)),
+        ((S, S, S, S), (2, 3, 2)),
+        ((S, S, S, S), (3, 3, 2)),
+        ((S, S, S, S), (0, 4, 1)),
+        ((S, S, S, S), (1, 4, 1)),
+        ((S, S, S, S), (2, 4, 1)),
+        ((S, S, S, S), (3, 4, 1)),
+        ((M,), (0, 3, 1)),
+        ((M,), (0, 3, 2)),
+        ((M,), (0, 3, 3)),
+        ((1000,), (0, 3, 11)),
+        ((1000,), (0, 2, 27)),
+        ((10, 10), (0, 1, 2)),
+        ((10, 10), (1, 2, 3)),
+        ((10, 10), (1, 2, 2)),
+        ((S, S, S), (2, 3, 2)),
+    )
 
+    sample_inputs = []
+    for shape, arguments in test_cases:
+        sample_inputs += [SampleInput(make_tensor(shape, dtype=dtype, device=device,
+                                      low=None, high=None,
+                                      requires_grad=requires_grad),
+                                      args=arguments)]
+    return sample_inputs
 
-def _clamp_min_numpy(a, min=None):
-    return np.maximum(a, min)
+def sample_inputs_split(op_info, device, dtype, requires_grad, *, list_args=False, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
+    if list_args:
+        cases = (
+            ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)],)),
+            ((S, S, S), ([int(S / 2), S - int(S / 2) * 2, int(S / 2)], 2),),
+            ((S, S, S), ([int(S / 2), S - int(S / 2) * 2, int(S / 2)], -2),)
+        )
+    else:
+        cases = (  # type: ignore[assignment]
+            ((S, S, S), (2,)),
+            ((S, S, S), (S, 1)),
+        )
 
-def _clamp_max_numpy(a, max=None):
-    return np.minimum(a, max)
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=args)
 
 
-def _clamp_numpy(a, min=None, max=None):
-    if min is None:
-        return np.minimum(a, max)
-    if max is None:
-        return np.maximum(a, min)
+def sample_inputs_split_with_sizes(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return np.minimum(max, np.maximum(a, min))
+    cases = (((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)],)),
+             ((S, S, S), ([int(S / 3), S - int(S / 3), 0],)),
+             ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)], 2)),
+             ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)], -2)),
+             )
 
-def sample_inputs_cross(op_info, device, dtype, requires_grad, **kwargs):
-    sample0 = SampleInput(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),
-                          args=(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),))
-    sample1 = SampleInput(make_tensor((S, 3, S), device=device, dtype=dtype, requires_grad=requires_grad),
-                          args=(make_tensor((S, 3, S), device=device, dtype=dtype, requires_grad=requires_grad),),
-                          kwargs={'dim': 1})
-    sample2 = SampleInput(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),
-                          args=(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),),
-                          kwargs={'dim': -1})
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=args)
 
-    return (sample0, sample1, sample2)
 
-def sample_inputs_cumprod(op_info, device, dtype, requires_grad, **kwargs):
-    def make_arg(shape):
-        # shrink values to be in the interval [-1, +1] for better precision in gradgradcheck
-        return make_tensor(shape, dtype=dtype, device=device, low=-1, high=+1, requires_grad=requires_grad)
+def sample_inputs_msort(op_info, device, dtype, requires_grad, **kwargs):
+    def apply_grad(t):
+        if dtype in floating_types_and(torch.float16, torch.bfloat16):
+            t.requires_grad_(requires_grad)
 
-    def prod_zeros(dim_select):
-        assert len(dim_select) == 2
-        result = make_arg(3 * (S,))
-        result.narrow(dim_select[0], 0, 1).narrow(dim_select[1], 1, 1).zero_()
-        result.narrow(dim_select[0], 2, 1).narrow(dim_select[1], 3, 1).zero_()
-        result.narrow(dim_select[0], 4, 1).narrow(dim_select[1], 3, 1).zero_()
-        return result
+    def large_1d_unique(dtype, device):
+        res = torch.randperm(L * L * L, dtype=torch.int64, device=device)
+        res = res.to(dtype)
+        apply_grad(res)
+        return res
 
-    for dim in range(3):
-        yield SampleInput(make_arg((S, S, S)), args=(dim,))
-    # Scalar tensors and empty tensor
-    for size in [(), (1,), (0,)]:
-        yield SampleInput(make_arg(size), args=(0,))
+    samples = []
+    # Test case for large tensor.
+    largesample = SampleInput(large_1d_unique(dtype, device))
 
-    yield SampleInput(prod_zeros([0, 1]), args=(1,))
-    yield SampleInput(prod_zeros([0, 2]), args=(1,))
-    yield SampleInput(prod_zeros([1, 2]), args=(1,))
+    sample = SampleInput(make_tensor((S, M, S), dtype=dtype, device=device,
+                                     low=None, high=None,
+                                     requires_grad=requires_grad))
 
-    # test dtype kwarg
-    yield SampleInput(prod_zeros([1, 2]), args=(1,), kwargs={'dtype': dtype})
+    return [largesample, sample]
 
-def sample_inputs_view_as_complex(op_info, device, dtype, requires_grad, **kwargs):
-    return [SampleInput(make_tensor((S, 2), dtype=dtype, device=device, requires_grad=requires_grad),)]
+def sample_inputs_lerp(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-def sample_inputs_view_as_real(op_info, device, dtype, requires_grad, **kwargs):
-    tensors = (
-        make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
-        make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad)
+    samples = (
+        # no broadcast
+        SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 0.4)),
+        # broadcast rhs
+        SampleInput(make_arg((S, S)), args=(make_arg((S,)), 0.4)),
+        # scalar tensor
+        SampleInput(make_arg(()), args=(make_arg(()), 0.4)),
+        # broadcast rhs scalar-tensor
+        SampleInput(make_arg((S, S)), args=(make_arg(()), 0.4)),
+        # broadcast rhs with weight tensor
+        SampleInput(make_arg((S, S)), args=(make_arg((S,)), make_arg((S, S)))),
+        # broadcast rhs and weight tensor
+        SampleInput(make_arg((S, S)), args=(make_arg((S, 1)), make_arg((S,)))),
+        # broadcast lhs
+        SampleInput(make_arg((S,)), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
+        # scalar broadcast_lhs
+        SampleInput(make_arg(()), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
+        # broadcast all
+        SampleInput(make_arg((S, 1)), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
+        # tensor broadcast all
+        SampleInput(make_arg((S, 1)), args=(make_arg((S, S)), make_arg((S, 1))),
+                    broadcasts_input=True),
+        # no broadcast with weight tensor
+        SampleInput(make_arg((S, S)), args=(make_arg((S, S)), make_arg((S, S)))),
+        # broadcast lhs with weight tensor
+        SampleInput(make_arg((S,)), args=(make_arg((S, S)), make_arg((S, S))), broadcasts_input=True),
+        # broadcast lhs and weight tensor
+        SampleInput(make_arg((S,)), args=(make_arg((S, S, S)), make_arg((S, S))), broadcasts_input=True),
+        # broadcast lhs and weight tensor variant
+        SampleInput(make_arg((S, S)), args=(make_arg((S, S, S)), make_arg((S,))), broadcasts_input=True),
     )
-    return [SampleInput(tensor) for tensor in tensors]
 
-def sample_inputs_prod(op_info, device, dtype, requires_grad, **kwargs):
-    def make_arg(shape):
-        # shrink values to be in the interval [-1, +1] for better precision in gradgradcheck
-        return make_tensor(shape, dtype=dtype, device=device, low=-1, high=+1, requires_grad=requires_grad)
+    if dtype.is_complex:
+        samples = samples + (  # type: ignore[assignment]
+            # no broadcast
+            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 0.4j)),
+            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 1.2 + 0.1j)),
+            # broadcast rhs
+            SampleInput(make_arg((S, S)), args=(make_arg((S,)), 0.4j)),
+            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 5.4 + 9j)),
+            # scalar tensor
+            SampleInput(make_arg(()), args=(make_arg(()), 0.4j)),
+            SampleInput(make_arg(()), args=(make_arg(()), 6.1 + 0.004j)),
+            # broadcast rhs scalar-tensor
+            SampleInput(make_arg((S, S)), args=(make_arg(()), 0.4j)),
+            SampleInput(make_arg((S, S)), args=(make_arg(()), 1 + 2j)),
+        )
 
-    def prod_single_zero():
-        result = make_arg(2 * (S,))
-        result[0, 1] = 0
-        return result
+    return samples
 
-    for sample in sample_inputs_cumprod(op_info, device, dtype, requires_grad):
-        # only Tensor, ignore other inputs
-        yield SampleInput(sample.input.clone().requires_grad_(requires_grad))
-        yield sample
+def sample_inputs_tensordot(self, device, dtype, requires_grad, **kwargs):
+    cases = (
+        ((2, 2, 2), (2, 2, 2), (2)),
+        ((2, 2, 1), (2, 1, 2), ([0, 1], [2, 0])),
+    )
+    samples = []
+    for first_shape, second_shape, dims in cases:
+        samples.append(SampleInput(make_tensor(first_shape, dtype=dtype, device=device,
+                                   requires_grad=requires_grad),
+                       args=(make_tensor(second_shape, dtype=dtype, device=device,
+                             requires_grad=requires_grad),),
+                       kwargs=dict(dims=dims,)))
+    return tuple(samples)
 
-    # Generates samples with keepdim = True
-    for sample in sample_inputs_cumprod(op_info, device, dtype, requires_grad):
-        sample.kwargs['keepdim'] = True
-        yield sample
+def sample_inputs_kron(op_info, device, dtype, requires_grad, **kwargs):
+    test_cases = (
+        ((S, S), (M, L)),
+    )
 
-    yield SampleInput(prod_single_zero())
-    yield SampleInput(make_arg((3, 3, 3)), args=(1,))
-    yield SampleInput(make_arg((3, 3, 3)), args=(1,), kwargs={'keepdim': True})
+    sample_inputs = []
+    for input_shape, other_shape in test_cases:
+        input = make_tensor(input_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+        other = make_tensor(other_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+        sample = SampleInput(input, args=(other,))
+        sample_inputs.append(sample)
+    return tuple(sample_inputs)
 
-    # test zero scalar tensor
-    zero = make_arg(())
-    zero.zero_()
-    yield SampleInput(zero.clone().requires_grad_(requires_grad))
-    yield SampleInput(zero.clone().requires_grad_(requires_grad), args=(0,))
-    yield SampleInput(zero.clone().requires_grad_(requires_grad),
-                      args=(0,),
-                      kwargs={'keepdim': True})
+def sample_inputs_inner(self, device, dtype, requires_grad, **kwargs):
+    return (
+        SampleInput(
+            make_tensor((S, ), dtype=dtype, device=device, requires_grad=requires_grad),
+            args=(
+                make_tensor((S, ), dtype=dtype, device=device, requires_grad=requires_grad),
+            )
+        ),
+        SampleInput(
+            make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad),
+            args=(
+                make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
+            )
+        ),
+    )
 
-def error_inputs_neg(op_info, device, **kwargs):
-    si = SampleInput(torch.tensor((False, True), device=device))
-    msg = ("Negation, the `\\-` operator, on a bool tensor is not supported."
-           " If you are trying to invert a mask, use the `\\~` or"
-           " `logical_not\\(\\)` operator instead.")
-    return (ErrorInput(si, error_regex=msg),)
+def sample_inputs_scatter(op_info, device, dtype, requires_grad, **kwargs):
+    def _tensor(shape, dtype=dtype, low=None, high=None):
+        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
-def sample_inputs_diag(op_info, device, dtype, requires_grad, **kwargs):
-    vec_sample = SampleInput(make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad))
+    def _gather(shape, index_dim, max_indices):
+        return gather_variable(shape, index_dim, max_indices, device=device)
 
-    tensors = (
-        make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        make_tensor((3, 5), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        make_tensor((5, 3), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+    zero = torch.tensor(0, dtype=torch.long, device=device)
+    test_cases = (
+        (_tensor((M, S)), (0, _gather((S, S), 1, M), _tensor((S, S)))),
+        (_tensor((M, S)), (1, _gather((S, S), 0, S), _tensor((S, S)))),
+        (_tensor((M, S)), (-1, _gather((S, S), 0, S), _tensor((S, S)))),
+        (_tensor((M, S)), (0, _gather((M, S // 2), 1, M), _tensor((M, S // 2)))),
+        (_tensor((M, S)), (1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
+        (_tensor((M, S)), (-1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
+        (_tensor(()), (0, zero.clone().detach(), _tensor(()))),
+        (_tensor(()), (0, zero.clone().detach(), 2.5)),
     )
 
-    args = ((), (2,), (-2,), (1,), (2,))
-
     samples = []
-    for tensor, arg in product(tensors, args):
-        samples.append(SampleInput(tensor.clone().requires_grad_(requires_grad), args=arg))
+    for tensor, args in test_cases:
+        samples.append(SampleInput(tensor, args=args))
 
-    return samples + [vec_sample]
+        if not requires_grad:
+            samples.append(SampleInput(
+                tensor.clone().detach(),
+                args=args, kwargs={'reduce': 'add'}
+            ))
 
-def sample_inputs_diagonal_diag_embed(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+            if dtype.is_floating_point:
+                samples.append(SampleInput(
+                    tensor.clone().detach(),
+                    args=args, kwargs={'reduce': 'multiply'}
+                ))
 
-    # Shapes for 2D Tensors
-    shapes_2d = ((S, S), (3, 5), (5, 3))
+    return samples
 
-    # Shapes for 3D Tensors
-    shapes_3d = ((S, S, S),)
+def sample_inputs_scatter_add(op_info, device, dtype, requires_grad, **kwargs):
+    def _tensor(shape, dtype=dtype, low=None, high=None):
+        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
-    kwargs_2d = (dict(), dict(offset=2), dict(offset=2), dict(offset=1))
-    kwargs_3d = (dict(offset=1, dim1=1, dim2=2),
-                 dict(offset=2, dim1=0, dim2=1),
-                 dict(offset=-2, dim1=0, dim2=1))
+    def _gather(shape, index_dim, max_indices):
+        return gather_variable(shape, index_dim, max_indices, device=device)
 
-    for shape, kwarg in chain(product(shapes_2d, kwargs_2d), product(shapes_3d, kwargs_3d)):
-        yield SampleInput(make_arg(shape), kwargs=kwarg)
+    zero = torch.tensor(0, dtype=torch.long, device=device)
+    test_cases = (
+        (_tensor((M, S)), (0, _gather((S, S), 1, M), _tensor((S, S)))),
+        (_tensor((M, S)), (1, _gather((S, S), 0, S), _tensor((S, S)))),
+        (_tensor((M, S)), (-1, _gather((S, S), 0, S), _tensor((S, S)))),
+        (_tensor((M, S)), (0, _gather((M, S // 2), 1, M), _tensor((M, S // 2)))),
+        (_tensor((M, S)), (1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
+        (_tensor((M, S)), (-1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
+        (_tensor(()), (0, zero.clone().detach(), _tensor(()))),
+    )
 
+    return [SampleInput(tensor, args=args) for tensor, args in test_cases]
 
-def sample_inputs_diagonal_scatter(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def sample_inputs_scatter_reduce(op_info, device, dtype, requires_grad, **kwargs):
+    def _tensor(shape, dtype=dtype, low=None, high=None):
+        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
-    # Shapes for 2D Tensors
-    shapes_2d = ((M, M), (3, 5), (5, 3))
+    def _gather(shape, index_dim, max_indices):
+        return gather_variable(shape, index_dim, max_indices, device=device)
 
-    # Shapes for 3D Tensors
-    shapes_3d = ((M, M, M),)
+    zero = torch.tensor(0, dtype=torch.long, device=device)
+    test_cases = (
+        ((M, S), 0, _gather((S, S), 1, M), (S, S)),
+        ((M, S), 1, _gather((S, S), 0, S), (S, S)),
+        ((M, S), -1, _gather((S, S), 0, S), (S, S)),
+        ((M, S), 0, _gather((M, S // 2), 1, M), (M, S // 2)),
+        ((M, S), 1, _gather((M, S // 2), 0, S), (M, S // 2)),
+        ((M, S), -1, _gather((M, S // 2), 0, S), (M, S // 2)),
+        ((), 0, zero.clone().detach(), ()),
+    )
 
-    args_2d = ((), (2,), (-2,), (1,))
-    args_3d = ((1, 1, 2), (2, 0, 1), (-2, 0, 1))
+    reduce = op_info.variant_test_name
+    for args, include_self in product(test_cases, [True, False]):
+        inp_shape, dim, index, src_shape = args
+        yield SampleInput(_tensor(inp_shape),
+                          args=(dim, index, _tensor(src_shape), reduce),
+                          kwargs={'include_self': include_self})
 
-    for input_shape, arg in chain(product(shapes_2d, args_2d), product(shapes_3d, args_3d)):
-        input_ = make_arg(input_shape)
-        # We can programatically figure out the right shape for src:
-        # It should be the same size as input.diagonal(other_args...)
-        if not isinstance(arg, tuple):
-            arg_tuple = (arg,)
+
+    # Sample inputs to test edge cases for backward
+    # Check that gradients are propagated correctly for prod when zeros in self/src are reduced
+    if requires_grad and reduce == 'prod':
+        # This sample tests gradients for the following cases
+        # (a) 1 zero reduced (from src (self[0, 1], self[1, 1]), from self (self[0, 0], self[2, 0]))
+        # (b) 2 zeros reduced (1 from src and 1 from self (self[1, 0])
+        # (c) no zeros reduced (self([2, 1]))
+        # (d) 2 zeros reduced (both from src) is tested in test/test_autograd.py
+        #     test_scatter_index_reduce_prod_gradgrad_error as this case is not supported for gradgrad
+        input = torch.tensor([[0, 13], [0, 17], [0, 19]], dtype=dtype, device=device, requires_grad=requires_grad)
+        src = torch.tensor([[0, 1, 2, 3], [0, 4, 0, 1], [2, 3, 5, 6]], dtype=dtype, device=device, requires_grad=requires_grad)
+        idx = torch.tensor([[1, 1, 0, 0], [0, 0, 1, 1], [0, 0, 0, 1]], dtype=torch.long, device=device)
+
+        yield SampleInput(input,
+                          args=(1, idx, src, reduce),
+                          kwargs={'include_self': True})
+
+def sample_inputs_segment_reduce(op_info, device, dtype, requires_grad, *, mode='lengths', **kwargs):
+    def _tensor(shape, dtype=dtype, low=None, high=None):
+        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
+
+    zero = torch.tensor(0, dtype=torch.long, device=device)
+    test_cases = (
+        # inp_shape, dim, lengths, unsafe
+        ((S,), 0, [0, 1, 2, 2], False),
+        ((S,), 0, [0, 1, 2, 2], True),
+        ((S,), 0, [2, 0, 3, 0], False),
+        ((S, S), 0, [0, 1, 2, 2], False),
+        # test when lengths do not sum to dim size
+        ((M, S, S), 0, [1, 2, 0, 6, 0], True),
+        # test for higher dimensions
+        ((S, S), 1, [[0, 1, 2, 2] for _ in range(S)], False),
+        ((S, S), 1, [[2, 0, 3, 0], [0, 1, 2, 2], [3, 0, 2, 0], [1, 1, 1, 2], [0, 1, 2, 2]], False),
+        ((S, S, S), 1, [[0, 1, 2, 2] for _ in range(S)], False),
+        ((S, S, S), 1, [[2, 0, 3, 0], [0, 1, 2, 2], [3, 0, 2, 0], [1, 1, 1, 2], [0, 1, 2, 2]], False),
+    )
+
+    reductions = ["max", "mean", "min", "sum", "prod"]
+    for args, reduce, initial in product(test_cases, reductions, [1, 2]):
+        inp_shape, dim, lengths, unsafe = args
+        lengths_t = torch.tensor(lengths, dtype=torch.long, device=device)
+        sample_input_kwargs = {'axis': dim, 'unsafe': unsafe, 'initial': initial}
+        if mode == 'lengths':
+            sample_input_kwargs['lengths'] = lengths_t
+        elif mode == 'offsets':
+            zeros_shape = list(lengths_t.shape)
+            zeros_shape[dim] = 1
+            offsets_t = torch.cat((lengths_t.new_zeros(zeros_shape), lengths_t), dim).cumsum_(dim)
+            sample_input_kwargs['offsets'] = offsets_t
         else:
-            arg_tuple = arg
-        src_shape = input_.diagonal(*arg_tuple).size()
-        src = make_arg(src_shape)
-        yield SampleInput(input_, args=(src, *arg_tuple))
+            raise RuntimeError(f"mode most be one of 'offsets' or 'lengths' got '{mode}'.")
+        yield SampleInput(_tensor(inp_shape),
+                          args=(reduce,),
+                          kwargs=sample_input_kwargs)
 
 
-def sample_inputs_to_sparse(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_ravel(op_info, device, dtype, requires_grad, **kwargs):
+    samples = (SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
+                                       low=None, high=None,
+                                       requires_grad=requires_grad)),
+               SampleInput(make_tensor((), dtype=dtype, device=device,
+                                       low=None, high=None,
+                                       requires_grad=requires_grad)),
+               SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
+                                       low=None, high=None,
+                                       requires_grad=requires_grad, noncontiguous=True)),)
 
-    return (SampleInput(make_arg((S, S)), args=(), output_process_fn_grad=lambda x: x.to_dense()),
-            SampleInput(make_arg((S, S)), args=(1,), output_process_fn_grad=lambda x: x.to_dense()),)
+    return samples
 
-def sample_inputs_cross_entropy(op_info, device, dtype, requires_grad, **kwargs):
-    batch_size, num_classes = shape = (2, 3)
-    reductions = ("mean", "sum", "none")
 
-    input_shape_and_kwargs: List[Tuple[Tuple[int, ...], Dict[str, Any]]] = [
-        (shape, dict()),
-        ((*shape, 1), dict()),
-        ((*shape, 1, 2), dict()),
-        ((*shape, 1, 2, 3), dict()),
-        *[(shape, dict(reduction=reduction)) for reduction in reductions],
-        *[
-            (
-                shape,
-                dict(
-                    weight=make_tensor((num_classes,), device=device, dtype=dtype),
-                    reduction=reduction,
-                ),
-            )
-            for reduction in reductions
-        ],
-        (shape, dict(ignore_index=1)),
-    ]
+def sample_inputs_tril_triu(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    cases = (((M, M), ()),
+             ((M, M), (2,),),
+             ((M, S), ()),
+             ((M, S), (-1,)),
+             ((M, M), (2,),),
+             ((S, M, S), ()),
+             ((S, M, S), (2,)),
+             ((3, 3, S, S), ()),)
 
-    sample_inputs = []
-    for (input_shape, kwargs), probabilities_target in itertools.product(input_shape_and_kwargs, (False, True)):
-        input = make_tensor(input_shape, device=device, dtype=dtype, requires_grad=requires_grad)
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=args)
 
-        if probabilities_target:
-            # ignore_index is not supported for probabilities target
-            if "ignore_index" in kwargs:
-                continue
+def sample_inputs_trilu_indices(op_info, device, dtype, requires_grad, **kwargs):
+    # (row, col, offset)
+    args_list = ((0, 0),
+                 (20, 0),
+                 (0, 20),
+                 (20, 21, 0),
+                 (20, 21, 7),
+                 (20, 21, -7),
+                 # Large test cases below are deliberately commented out to speed up CI
+                 # tests and to avoid OOM error. When modifying implementations of
+                 # tril_indices and triu_indices, please enable these tests and make sure
+                 # they pass.
+                 # (2, 68435455, 3),
+                 # (5000, 5000),
+                 # (5000, 5000, 1234),
+                 # (5000, 5000, -1233),
+                 )
+    for args in args_list:
+        yield SampleInput(args[0], args=args[1:], kwargs={"dtype": dtype, "device": device})
 
-            target = make_tensor(
-                input_shape,
-                low=0,
-                high=1,
-                device=device,
-                dtype=dtype,
-                requires_grad=requires_grad,
-            )
-        else:
-            target = make_tensor(
-                (batch_size, *input_shape[2:]),
-                low=0,
-                high=num_classes,
-                device=device,
-                dtype=torch.long,
-            )
-
-            if "ignore_index" in kwargs and torch.all(target == kwargs["ignore_index"]):
-                # make sure at least one item in target is not ignored
-                target[0] = random.sample(set(range(num_classes)) - {kwargs["ignore_index"]}, 1)[0]
-
-        sample_inputs.append(SampleInput(input, args=(target,), kwargs=kwargs))
-
-    return sample_inputs
+def sample_inputs_clone_contiguous(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-# Used for log_softmax, softmax, softmin
-def sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, with_dtype=False, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield SampleInput(make_arg((S, M, S)))
+    yield SampleInput(make_arg(()))
 
-    cases = [
-        ((S, ), (0, )),
-        ((S, S), (0, )),
-        ((S, S), (1, )),
-        ((S, S), (-1, )),
-        ((S, M, S), (2, )),
-    ]
+def reference_inputs_clone_contiguous(op, device, dtype, requires_grad, **kwargs):
+    # NOTE: the default memory format for clone is torch.preserve_format, for contiguous it's torch.contiguous_format
+    # This exploits that default to test torch.preserve_format for clone, without causing an error when testing contiguous
+    yield from sample_inputs_clone_contiguous(op, device, dtype, requires_grad, **kwargs)
 
-    # PyTorch on XLA throws an error when passed with dim argument for 0d tensor.
-    # See https://github.com/pytorch/xla/issues/3061 for more details.
-    if torch.device(device).type != 'xla':
-        cases.append(((), (0, )))
+    shapes = (
+        (3, 5, 6),
+        (1, 1, 3, 5, 6),
+        (1, 1, 3, 5, 6, 1, 1),
+        (1, 0, 3, 5, 0, 2),
+        (1, 0, 3, 5, 0, 0, 1, 1, 2),
+        (),
+    )
 
-    return [
-        SampleInput(make_arg(shape), args=dim, kwargs=dict(dtype=torch.float64) if with_dtype else None)
-        for shape, dim in cases
-    ]
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for shape in shapes:
+        yield SampleInput(make_arg(shape))
+        yield SampleInput(make_arg(shape).transpose(0, -1))
+        yield SampleInput(make_arg(shape, noncontiguous=True))
+        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1))
 
-def sample_inputs_masked_softmax(op_info, device, dtype, requires_grad, with_dtype=False, **kwargs):
-    """Sample inputs for masked softmax, log_softmax, and softmin.
+        yield SampleInput(make_arg(shape), kwargs={'memory_format': torch.contiguous_format})
+        yield SampleInput(make_arg(shape).transpose(0, -1), kwargs={'memory_format': torch.contiguous_format})
+        yield SampleInput(make_arg(shape, noncontiguous=True), kwargs={'memory_format': torch.contiguous_format})
+        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1), kwargs={'memory_format': torch.contiguous_format})
 
-    Masked normalization operator is a reduction operator with
-    trailing mask optional argument. A mask is a bool tensor with the
-    same shape as input or a shape that is broadcastable to input
-    shape.
-    """
-    inputs: List[SampleInput] = []
-    for sample_input in sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, with_dtype=with_dtype, **kwargs):
-        for mask in _generate_masked_op_mask(sample_input.input.shape, device, **kwargs):
-            sample_input_args, sample_input_kwargs = sample_input.args, dict(mask=mask, **sample_input.kwargs)
-            inputs.append(SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs))
-    return inputs
+    # shape, strides, offset
+    strided_cases = (
+        ((5, 6, 2), (1, 1, 7), 2),
+        ((5, 5, 4), (1, 1, 7), 2),
+        ((5, 5, 2), (4, 5, 7), 3),
+        ((5, 5, 2), (5, 5, 7), 3),
+        ((5, 5, 2), (5, 5, 5), 3),
+        ((9, 5, 2), (0, 1, 7), 3),
+    )
 
-def sample_inputs_masked_cumops(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked cumsum and cumprod.
-    """
-    inputs: List[SampleInput] = []
-    for sample_input in sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, **kwargs):
-        for mask in _generate_masked_op_mask(sample_input.input.shape, device, **kwargs):
-            if type(mask) != torch.Tensor:
-                continue
-            sample_input_args, sample_input_kwargs = sample_input.args, dict(mask=mask, **sample_input.kwargs)
-            if 'keepdim' in sample_input_kwargs:
-                sample_input_kwargs.pop('keepdim')
-            # dimension is required
-            if sample_input_args:
-                dim = sample_input.args[0]
-            else:
-                if 'dim' not in sample_input_kwargs:
-                    continue
-                dim = sample_input_kwargs.pop('dim')
-                sample_input_args = (dim,)
-            inputs.append(SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs))
+    for shape, strides, offset in strided_cases:
+        yield SampleInput(make_arg(500,).as_strided(shape, strides, offset))
+        yield SampleInput(make_arg(500,).as_strided(shape, strides, offset), kwargs={'memory_format': torch.contiguous_format})
 
-    return inputs
+    # channels last 2D
+    yield SampleInput(make_arg((2, 2, 2, 2)), kwargs={'memory_format': torch.channels_last})
+    a = make_arg((2, 2, 2, 2)).permute(0, 3, 1, 2)
+    yield SampleInput(a, kwargs={'memory_format': torch.channels_last})
 
-def sample_inputs_masked_logaddexp(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked logaddexp.
-    """
-    inputs: List[SampleInput] = []
-    shapes = [(S,), (S, S), (S, M, S)]
-    input_mask_lists = [list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes]
-    other_mask_lists = [list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes]
-
-    for shape, input_masks, other_masks in zip(shapes, input_mask_lists, other_mask_lists):
-        for input_mask, other_mask in zip(input_masks, other_masks):
-            input = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-            other = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-            inputs.append(SampleInput(input.clone().requires_grad_(requires_grad),
-                                      args=(other.clone().requires_grad_(requires_grad),),
-                                      kwargs=dict(input_mask=input_mask, other_mask=other_mask)))
-    return inputs
+    # channels last 3D
+    yield SampleInput(make_arg((2, 2, 2, 2, 2)), kwargs={'memory_format': torch.channels_last_3d})
+    a = make_arg((2, 2, 2, 2, 2)).permute(0, 4, 1, 2, 3)
+    yield SampleInput(a, kwargs={'memory_format': torch.channels_last_3d})
 
-def sample_inputs_masked_normalize(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked normalize.
-    """
-    inputs: List[SampleInput] = []
-    for ord in [2.0, 1, float('inf'), float('-inf'), 0]:
-        for sample_input in sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, **kwargs):
-            sample_input_args, sample_input_kwargs = (ord,) + sample_input.args, sample_input.kwargs.copy()
-            inputs.append(SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs))
-    return inputs
 
-def sample_inputs_logit(op_info, device, dtype, requires_grad, **kwargs):
-    low, high = op_info.domain
+def sample_inputs_sum_to_size(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    # Note: Operator is very sensitive at points near the
-    # start and end of domain and leads to NaN for float16
-    # if domain_eps is 1e-5.
-    domain_eps = op_info._domain_eps if dtype != torch.float16 else 3e-2
+    # list of tuples (shape, shape) defining the shapes of the input and output tensors
+    sample_shapes = [
+        ((), ()),
+        ((S), (1)),
+        ((S, S), (1, 1)),
+        ((S, S), (1, S)),
+        ((S, S), (S, S)),
+        ((S, S, S), (S, 1, S)),
+    ]
 
-    low = low + domain_eps
-    high = high - domain_eps
+    samples = []
 
-    samples = (
-        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)),
-        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, low=low,
-                                high=high, requires_grad=requires_grad), args=(0.2,)),
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)),
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=low,
-                                high=high, requires_grad=requires_grad), args=(0.2,)),
-    )
+    for input_shape, output_shape in sample_shapes:
+        input_t = make_arg(input_shape)
+        samples.append(SampleInput(input_t, args=(output_shape,)))
 
     return samples
 
-def sample_inputs_isin(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    # isin has two paths based on the size of elements and test_elements.
-    # if elements.numel() < 10 * pow(test_elements.numel(), 0.145):
-    yield SampleInput(make_arg((L,)), args=(make_arg((S,)),))
-    # else:
-    yield SampleInput(make_arg((S,)), args=(make_arg((L,)),))
+def sample_inputs_resize_ops(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device)
+    cases = (((S, S, S), (S * S, S)),
+             ((), ()),
+             ((), (1, 1, 1)),
+             )
 
-def sample_inputs_masked_scatter(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    for shape, args_or_shape in cases:
+        # Update `args` based on operator
+        if op_info.name == 'resize_':
+            # resize_ takes shape/tuple of ints,
+            args = (args_or_shape, )
+        elif op_info.name == 'resize_as_':
+            # resize_as_ takes another tensor
+            args = (make_arg(shape, requires_grad=False), )  # type:ignore[assignment]
+        else:
+            raise ValueError("sample_inputs_resize_ops is being used with incorrect operator")
 
-    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, S, device=device) > 0, make_arg((S, S))))
-    yield SampleInput(make_arg((S, S)), args=(torch.randn((S,), device=device) > 0, make_arg((S, S))))
-    yield SampleInput(make_arg((S, S)), args=(bernoulli_scalar().to(device), make_arg((S, S))))
-    yield SampleInput(make_arg((S,)),
-                      args=(torch.randn(S, S, device=device) > 0, make_arg((S, S))),
-                      broadcasts_input=True)
+        yield(SampleInput(make_arg(shape, requires_grad=requires_grad), args=args))
 
+def sample_inputs_view_reshape(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-def sample_inputs_masked_fill(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    cases = (
+        ((S, S, S), (S * S, S)),
+        ((S * S, S), (S, S, S)),
+        ((S * S, S), (S, -1, S)),
+        ((S * S * 2, S), (S, -1)),
+        ((S,), (S,)),
+        ((), ()),
+        ((), (1,)),
+    )
 
-    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, S, device=device) > 0, 10))
-    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, S, device=device) > 0, make_arg(())))
-    yield SampleInput(make_arg((S, S)), args=(torch.randn(S, device=device) > 0, 10))
-    yield SampleInput(make_arg(()), args=(torch.randn((), device=device) > 0, 10))
-    yield SampleInput(make_arg(()), args=(torch.randn((), device=device) > 0, make_arg(())))
-    yield SampleInput(make_arg((S, S)), args=(torch.randn((), device=device) > 0, 10))
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=(args,))
 
-    yield SampleInput(make_arg((S,)),
-                      args=(torch.randn(S, S, device=device) > 0, make_arg(())),
-                      broadcasts_input=True)
-    yield SampleInput(make_arg((S,)),
-                      args=(torch.randn(S, S, device=device) > 0, 10),
-                      broadcasts_input=True)
+        if kwargs.get("transpose_samples", False) and len(shape) >= 2:
+            transposed = make_arg(shape).transpose(0, 1).detach().requires_grad_(requires_grad)
+            yield SampleInput(transposed, args=(args,))
 
+def reference_inputs_view_reshape(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_view_reshape(op, device, dtype, requires_grad, **kwargs)
 
-def error_inputs_masked_fill(op_info, device, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=torch.float, requires_grad=False)
-    # `value` is not a 0-D tensor.
-    yield ErrorInput(SampleInput(make_arg((2, 2)), args=(make_arg(()) > 0, make_arg((1,)))),
-                     error_regex="only supports a 0-dimensional value tensor, but got tensor with 1 dimension")
-    # downcasting complex value (scalar overload)
-    yield ErrorInput(SampleInput(make_arg((2, 2)), args=(make_arg(()) > 0, 1j)),
-                     error_regex=r"value cannot be converted to type .* without overflow")
-    # downcasting complex value (tensor overload)
-    yield ErrorInput(SampleInput(torch.ones(2, dtype=torch.long, device=device),
-                                 args=(make_arg(()) > 0, torch.tensor(1j, device=device))),
-                     error_regex=r"value cannot be converted to type .* without overflow")
+    cases = (
+        ((125,), (25, 5)),
+        ((25, 25), (1, 5, 5, 1, 5, 1, 5, 1)),
+        ((16, 32), (2, 4, 1, 4, 4, 1, 4)),
+        ((16, 12), (12, 16)),
+        ((1, 16, 12), (12, 16)),
+        ((1, 5, 1, 5), (25, 1)),
+        ((2, 4, 2), (4, 4)),
+        ((1, 4), (1, 1, 2, 1, 2)),
+        ((3, 5, 7), (7, 5, 3)),
+        ((1,), ()),
+        ((5, 0, 2, 3), (5, 0, 2, 3)),
+        ((2, 1, 0, 3, 1), (5, 0)),
+        ((1,), ()),
+        ((4, 5, 6), (4, 5, 6, 1, 1, 1)),
+        ((), (1, 1, 1, 1)),
+    )
 
+    irreversible_cases = (
+        ((), (-1,)),
+        ((4, 7, 9, 1, 1), (1, 4, 3, -1, 1)),
+    )
 
-def sample_inputs_masked_select(op_info, device, dtype, requires_grad, **kwargs):
-    samples = (
-        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn(M, M, device=device) > 0,)),
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for a, b in cases:
+        yield SampleInput(make_arg(a), args=(b,))
+        yield SampleInput(make_arg(b), args=(a,))
 
-        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M,), device=device) > 0,)),
+        if kwargs.get("transpose_samples", False):
+            yield SampleInput(make_arg(a, noncontiguous=True).transpose(0, -1), args=(b,))
+        else:
+            yield SampleInput(make_arg(a, noncontiguous=True), args=(b,))
 
-        SampleInput(make_tensor((M,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M, M), device=device) > 0,)),
+    for a, b in irreversible_cases:
+        yield SampleInput(make_arg(a), args=(b,))
 
-        SampleInput(make_tensor((M, 1, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M, M), device=device) > 0,)),
+def error_inputs_reshape(op, device, **kwargs):
 
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.tensor(1, device=device, dtype=torch.bool),)),
+    cases = (
+        # Reshape to different numel
+        ((2,), ()),
+        ((1, 3, 0), ()),
+        ((4, 3), (4, 2)),
+        ((1, 3, 5), (5, 2, 2)),
+        # No valid inference
+        ((1, 3, 5), (5, -1, 2)),
+        # Two inferred shapes
+        ((1, 3, 5), (5, -1, -1)),
+        ((1), (0, -1)),
+        ((0, 5), (0, -1)),
+    )
 
-        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.tensor(1, device=device, dtype=torch.bool),)),
+    make_arg = partial(make_tensor, dtype=torch.float32, device=device, requires_grad=False)
+    for a, b in cases:
+        yield ErrorInput(SampleInput(make_arg(a), args=(b,)), error_type=Exception, error_regex="")
 
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M, M), device=device) > 0,)),
-    )
 
-    return samples
+def sample_inputs_view_as_reshape_as(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device)
 
-def sample_inputs_matrix_exp(op_info, device, dtype, requires_grad, **kwargs):
-    samples = (
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad)),
-        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, requires_grad=requires_grad)),
-    )
+    cases = (((S, S, S), (S * S, S)),
+             ((), ()),
+             ((), (1, 1)),
+             )
 
-    return samples
+    for case in cases:
+        shape, shape_other = case
+        inp = make_arg(shape, requires_grad=requires_grad)
+        yield(SampleInput(inp, args=(make_arg(shape_other, requires_grad=False),)))
 
-def sample_inputs_matmul(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases = (((L,), (L,)),
-                  ((S, M), (M,)),
-                  ((M,), (M, S)),
-                  ((S, M), (M, S)),
-                  ((S, 0), (0, M)),
-                  ((S, S, M), (M,)),
-                  ((S, S, M), (M, S)),
-                  ((S, S, 0), (0, S)),
-                  ((M,), (S, M, S)),
-                  ((S, M), (S, M, S)),
-                  ((0, 0), (S, 0, 0)),
-                  ((S, S, M, M), (S, S, M, S)),
-                  ((S, S, M, M), (M,)),
-                  ((M,), (S, S, M, S)))
-    sample_inputs = []
-    for lhs_shape, rhs_shape in test_cases:
-        lhs = make_tensor(lhs_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        rhs = make_tensor(rhs_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        if op_info.name == 'matmul':
-            sample_inputs.append(SampleInput(lhs, args=(rhs,)))
-        elif op_info.name == '__rmatmul__':
-            sample_inputs.append(SampleInput(rhs, args=(lhs,)))
-        else:
-            raise RuntimeError("`op_info.name` must be 'matmul' or '__rmatmul__'")
-    return tuple(sample_inputs)
+        if op_info.name != "view_as" and len(shape) >= 2:
+            yield(SampleInput(
+                inp.clone().transpose(0, 1).requires_grad_(requires_grad),
+                args=(make_arg(shape_other, requires_grad=False),)))
 
+def sample_inputs_atleast1d2d3d(op_info, device, dtype, requires_grad, **kwargs):
+    input_list = []
+    shapes = ((S, S, S, S), (S, S, S), (S, S), (S, ), (),)
+    make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    samples = []
+    for shape in shapes:
+        input_list.append(make_tensor_partial(shape))
+        samples.append(SampleInput(make_tensor_partial(shape)))
+    samples.append(SampleInput(input_list, ))
+    return samples
 
-def sample_inputs_meshgrid(op_info: OpInfo, device: torch.device, dtype: torch.dtype,
-                           requires_grad: bool,
-                           *, variant: str, **kwargs) -> List[SampleInput]:
-    if variant == 'variadic':
-        def make_inputs(
-                tensors: List[torch.Tensor]) -> Tuple[Union[torch.Tensor,
-                                                            List[torch.Tensor]],
-                                                      Tuple[torch.Tensor, ...]]:
-            return tensors[0], tuple(tensors[1:])
-    elif variant == 'list':
-        def make_inputs(
-                tensors: List[torch.Tensor]) -> Tuple[Union[torch.Tensor,
-                                                            List[torch.Tensor]],
-                                                      Tuple[torch.Tensor, ...]]:
-            return tensors, ()
-    else:
-        raise ValueError(
-            'Unsupported variant, must be one of {"variadic", "list"}. '
-            f'Got "{variant}".')
+def sample_inputs_column_stack(op_info, device, dtype, requires_grad, **kwargs):
+    input_list = []
+    cases: Tuple[tuple, tuple] = (  # type: ignore[assignment]
+        ((S, 2, 1), (S, 3, 1)),
+        ((S), (S, 5)), ((), (1, S))
+    )
+    make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for shape1, shape2 in cases:
+        input_list.append(SampleInput([make_tensor_partial(shape1), make_tensor_partial(shape2)]))
 
-    SCALAR = torch.Size([])
-    VECTOR = torch.Size([3])
-    test_cases: List[List[torch.Size]] = [
-        [SCALAR],
-        [VECTOR],
-        [VECTOR, SCALAR],
-        [VECTOR, SCALAR, VECTOR],
-        [VECTOR, SCALAR, VECTOR, SCALAR],
-    ]
+    return input_list
 
-    sample_inputs = []
-    for shapes, indexing in itertools.product(test_cases, {'xy', 'ij'}):
-        input, args = make_inputs(
-            [make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-             for shape in shapes])
-        sample_inputs.append(SampleInput(input=input, args=args,
-                                         kwargs=dict(indexing=indexing)))
-    return sample_inputs
+def sample_inputs_flatten(op_info, device, dtype, requires_grad, **kwargs):
+    samples = []
+    shapes = ((S, S, S), (S, S), (S, ), (),)
+    make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for shape in shapes:
+        samples.append(SampleInput(make_tensor_partial(shape)))
+        if len(shape) > 1:
+            samples.append(SampleInput(make_tensor_partial(shape), kwargs=dict(start_dim=1, end_dim=-1)))
+    return samples
 
-def sample_inputs_polygamma(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    tensor_shapes = ((S, S), ())
-    ns = (1, 2, 3, 4, 5)
+def reference_inputs_flatten(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_flatten(op, device, dtype, requires_grad, **kwargs)
 
-    for shape, n in product(tensor_shapes, ns):
-        yield SampleInput(make_arg(shape), args=(n,))
+    # shape x start_dim x end_dim
+    cases = (
+        ((5, 4, 0, 1, 3, 7), 1, 3),
+        ((5, 4, 0, 1, 3, 7), 4, 5),
+        ((5, 4, 1, 1, 3, 7), 2, 3),
+        ((), 0, -1),
+        ((1,), 0, -1),
+        ((3, 7, 5), 1, 2),
+        ((4, 5), 1, 1),
+        ((1, 5, 5, 1, 5, 1, 5, 1), 0, 2),
+        ((1, 5, 5, 1, 5, 1, 5, 1), 3, -1),
+        ((1, 5, 5, 1, 5, 7, 5, 1), -2, -1),
+        ((2, 4, 2), 0, 1),
+        ((4, 2, 2), 1, 2),
+        ((0, 3, 4, 5), 1, 3),
+    )
 
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for shape, start, end in cases:
+        yield SampleInput(make_arg(shape), args=(start, end,))
+        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1), args=(start, end,))
+        yield SampleInput(make_arg(shape).transpose(0, -1), args=(start, end,))
 
-def sample_inputs_mvlgamma(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    tensor_shapes = ((S, S), ())
-    ns = (1, 2, 3, 4, 5)
+def sample_inputs_unflatten(op_info, device, dtype, requires_grad, **kwargs):
+    # in_shape, dim, sizes
+    args = (((8,), 0, (8,)),
+            ((8,), 0, (4, 2)),
+            ((8,), -1, (2, 2, 2)),
+            ((8,), -1, (-1, 2)),
+            ((3, 6, 2), 1, (2, 3)),
+            ((3, 6, 2), -2, (2, 3)),
+            ((3, 6, 2), -2, (-1, 3)),
+            ((3, 2, 12), 2, (3, 2, 2)),
+            ((4, 0), 0, (2, 2)),
+            ((4, 0), 1, (2, 0, 0, 0)),
+            )
+    make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    for in_shape, dim, sizes in args:
+        yield SampleInput(make_tensor_partial(in_shape), args=(dim, sizes))
 
-    # Since the accepted lower bound for input
-    # to mvlgamma depends on `p` argument,
-    # the following function computes the lower bound
-    # which we pass to `make_tensor`.
-    def compute_min_val(p):
-        return (p - 1.) / 2
 
-    for shape, n in product(tensor_shapes, ns):
-        min_val = compute_min_val(n)
-        if not dtype.is_floating_point:
-            # Round-up minimum value for integral dtypes
-            min_val += 1
-        else:
-            min_val += 2 * torch.finfo(dtype).eps
-        yield SampleInput(make_arg(shape, low=min_val), args=(n,))
+def sample_inputs_select(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
+    cases = (((S, S, S), (1, 2)),
+             ((S, S, S), (-1, 2)),
+             ((S, S, S), (-1, -1)),
+             ((S, S, S), (1, -1)),
+             ((S,), (0, 2))
+             )
 
-# Since `mvlgamma` has multiple entries,
-# there are multiple common skips for the additional
-# entries. Following function is a helper to that end.
-def skips_mvlgamma(skip_redundant=False):
-    skips = (
-        # outside domain values are hard error for mvlgamma op.
-        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_float_domains'),
-    )
-    if skip_redundant:
-        # Redundant tests
-        skips = skips + (  # type: ignore[assignment]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
-        )
-    return skips
+    for shape, args in cases:
+        yield SampleInput(make_arg(shape), args=args)
 
 
-# To test reference numerics against multiple values of argument `p`,
-# we make multiple OpInfo entries with each entry corresponding to different value of p.
-# We run the op tests from test_ops.py only for `p=1` to avoid redundancy in testing.
-# Class `MvlGammaInfo` already contains the basic information related to the operator,
-# it only takes arguments like `domain`, `skips` and `sample_kwargs`, which
-# differ between the entries.
-class MvlGammaInfo(UnaryUfuncInfo):
-    def __init__(self, variant_test_name, domain, skips, sample_kwargs):
-        super(MvlGammaInfo, self).__init__(
-            'mvlgamma',
-            ref=reference_mvlgamma if TEST_SCIPY else _NOTHING,
-            aliases=('special.multigammaln',),
-            variant_test_name=variant_test_name,
-            domain=domain,
-            decorators=(precisionOverride({torch.float16: 5e-2}),),
-            dtypes=all_types_and(torch.bfloat16),
-            dtypesIfCUDA=all_types_and(torch.half),
-            sample_inputs_func=sample_inputs_mvlgamma,
-            supports_forward_ad=True,
-            supports_fwgrad_bwgrad=True,
-            skips=skips,
-            sample_kwargs=sample_kwargs)
+def sample_inputs_select_scatter(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
+    cases = (((S, S, S), (S, S), (1, 2)),
+             ((S, S, S), (S, S), (-1, 2)),
+             ((S, S, S), (S, S), (-1, -1)),
+             ((S, S, S), (S, S), (1, -1)),
+             ((S,), (), (0, 2))
+             )
 
-def sample_inputs_entr(op_info, device, dtype, requires_grad, **kwargs):
-    low, _ = op_info.domain
+    for input_shape, src_shape, args in cases:
+        input_ = make_arg(input_shape)
+        src = make_arg(src_shape)
+        yield SampleInput(input_, args=(src, *args))
 
-    if requires_grad:
-        low = 0 + op_info._domain_eps
 
-    return (SampleInput(make_tensor((L,), dtype=dtype, device=device,
-                                    low=low,
-                                    requires_grad=requires_grad)),
-            SampleInput(make_tensor((), dtype=dtype, device=device,
-                                    low=low,
-                                    requires_grad=requires_grad)))
+def sample_inputs_slice_scatter(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-# TODO: Consolidate `i0e` with sample_inputs_unary when `make_tensor`,
-#       supports `exclude` argument.
-#       For more context: https://github.com/pytorch/pytorch/pull/56352#discussion_r633277617
-def sample_inputs_i0_i1(op_info, device, dtype, requires_grad, **kwargs):
+    cases = (((L, L, L), (L, L, L,), (0, 0, L, 1)),
+             ((L, L, L), (L // 2, L, L,), (0, L // 2, L, 1)),
+             ((L, L, L), (L // 4, L, L,), (0, L // 2, L, 2)),
+             ((L, L, L), (L, L, L,), (1, 0, L, 1)),
+             ((L, L, L), (L, L // 2, L,), (1, L // 2, L, 1)),
+             ((L, L, L), (L, L // 4, L,), (1, L // 2, L, 2)),
+             ((L, L, L), (L, L, L,), (2, 0, L, 1)),
+             ((L, L, L), (L, L, L // 2,), (2, L // 2, L, 1)),
+             ((L, L, L), (L, L, L // 4,), (2, L // 2, L, 2)),
+             )
 
-    samples = (SampleInput(make_tensor((S,), dtype=dtype, device=device,
-                                       requires_grad=requires_grad)),
-               SampleInput(make_tensor((), dtype=dtype, device=device,
-                                       requires_grad=requires_grad)))
-
-    if requires_grad and op_info.op == torch.special.i0e:
-        # NOTE: `i0e`'s first-order gradient is not continous
-        # at `0`, hence we don't test `i0e` with any input being `0`.
-        # TODO: Remove this when `make_tensor` supports excluding `0`.
-        for sample in samples:
-            t = sample.input
-            t[t == 0] = torch.finfo(dtype).eps  # type: ignore[index]
-    elif requires_grad and op_info.op != torch.special.i0e:
-        # Special Case for gradient
-        # Sample with `0` in the input
-        t = make_tensor((S,), dtype=dtype, device=device,
-                        requires_grad=requires_grad)
-        t[0] = 0
+    for input_shape, src_shape, args in cases:
+        input_ = make_arg(input_shape)
+        src = make_arg(src_shape)
+        yield SampleInput(input_, args=(src, *args))
 
-        samples += (SampleInput(t),)  # type: ignore[assignment]
+def sample_inputs_expand(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    return samples
+    cases = (((S, 1, 1), (S, S, S)),
+             ((S, 1, S), (S, S, S)),
+             ((S, 1, S), (-1, S, -1)),
+             ((S, 1, S), (-1, S, S)),
+             ((S, 1), (S, S, S)),
+             ((1,), (S, S, S)),
+             ((1, S), (1, 1, S)),
+             ((), ()),
+             ((), (1, 3, 2)),
+             )
 
-def sample_inputs_cumulative_ops(op_info, device, dtype, requires_grad, supports_dtype_kwargs=True, **kwargs):
-    def _make_tensor_helper(shape, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
-
-    samples = [
-        SampleInput(_make_tensor_helper((S, S, S)), args=(0,)),
-        SampleInput(_make_tensor_helper((S, S, S)), args=(1,)),
-        SampleInput(_make_tensor_helper(()), args=(0,)),
-    ]
-
-    if supports_dtype_kwargs:
-        # NOTE: if `dtype` is not same as input, then inplace variants fail with
-        # `provided dtype must match the dtype of self tensor in cumsum`
-        samples.append(SampleInput(_make_tensor_helper((S, S, S)), args=(1,), kwargs={'dtype': dtype}))
-
-    return samples
+    for case in cases:
+        shape, args = case
+        yield(SampleInput(make_arg(shape), args=(args, )))
 
+def sample_inputs_conversion(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-def sample_inputs_unfold(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases = (
-        ((), (0, 1, 1)),
-        ((S, S, S, S), (0, 3, 1)),
-        ((S, S, S, S), (1, 3, 1)),
-        ((S, S, S, S), (2, 3, 1)),
-        ((S, S, S, S), (3, 3, 1)),
-        ((S, S, S, S), (0, 3, 2)),
-        ((S, S, S, S), (1, 3, 2)),
-        ((S, S, S, S), (2, 3, 2)),
-        ((S, S, S, S), (3, 3, 2)),
-        ((S, S, S, S), (0, 4, 1)),
-        ((S, S, S, S), (1, 4, 1)),
-        ((S, S, S, S), (2, 4, 1)),
-        ((S, S, S, S), (3, 4, 1)),
-        ((M,), (0, 3, 1)),
-        ((M,), (0, 3, 2)),
-        ((M,), (0, 3, 3)),
-        ((1000,), (0, 3, 11)),
-        ((1000,), (0, 2, 27)),
-        ((10, 10), (0, 1, 2)),
-        ((10, 10), (1, 2, 3)),
-        ((10, 10), (1, 2, 2)),
-        ((S, S, S), (2, 3, 2)),
-    )
+    shapes = ((),
+              (2, 3))
+    memory_format_options = [None, torch.contiguous_format]
 
-    sample_inputs = []
-    for shape, arguments in test_cases:
-        sample_inputs += [SampleInput(make_tensor(shape, dtype=dtype, device=device,
-                                      low=None, high=None,
-                                      requires_grad=requires_grad),
-                                      args=arguments)]
-    return sample_inputs
+    for shape, memory_format in itertools.product(shapes, memory_format_options):
+        yield SampleInput(make_arg(shape),
+                          kwargs={'memory_format': memory_format} if memory_format else {})
+    yield SampleInput(make_arg((2, 3, 2, 3)), kwargs={'memory_format': torch.channels_last})
 
-def sample_inputs_split(op_info, device, dtype, requires_grad, *, list_args=False, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_expand_as(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device)
 
-    if list_args:
-        cases = (
-            ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)],)),
-            ((S, S, S), ([int(S / 2), S - int(S / 2) * 2, int(S / 2)], 2),),
-            ((S, S, S), ([int(S / 2), S - int(S / 2) * 2, int(S / 2)], -2),)
-        )
-    else:
-        cases = (  # type: ignore[assignment]
-            ((S, S, S), (2,)),
-            ((S, S, S), (S, 1)),
-        )
+    cases = (((S, 1, 1), (S, S, S)),
+             ((), ()),
+             ((), (1, 1)),
+             )
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=args)
+    for shape, shape_other in cases:
+        yield(SampleInput(make_arg(shape, requires_grad=requires_grad),
+                          args=(make_arg(shape_other, requires_grad=False), )))
 
 
-def sample_inputs_split_with_sizes(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_where(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    cases = (((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)],)),
-             ((S, S, S), ([int(S / 3), S - int(S / 3), 0],)),
-             ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)], 2)),
-             ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)], -2)),
-             )
+    def make_bool_mask(shape):
+        # Make sure atleast one element is nonzero,
+        # except for empty tensor
+        mask_t = make_tensor(shape, dtype=torch.bool, device=device, requires_grad=False)
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=args)
+        if mask_t.numel() == 0:
+            return mask_t
+        elif mask_t.numel() == 1:
+            mask_t.fill_(True)
+            return mask_t
 
+        if mask_t.sum() == 0:
+            def random_index(shape):
+                return tuple(map(lambda max_idx: random.randint(0, max_idx), shape))
 
-def sample_inputs_msort(op_info, device, dtype, requires_grad, **kwargs):
-    def apply_grad(t):
-        if dtype in floating_types_and(torch.float16, torch.bfloat16):
-            t.requires_grad_(requires_grad)
+            mask_t[random_index(mask_t.shape)] = True
+            return mask_t
 
-    def large_1d_unique(dtype, device):
-        res = torch.randperm(L * L * L, dtype=torch.int64, device=device)
-        res = res.to(dtype)
-        apply_grad(res)
-        return res
+        return mask_t
 
-    samples = []
-    # Test case for large tensor.
-    largesample = SampleInput(large_1d_unique(dtype, device))
+    cases = (((M, M), (M, M), (M, M), False),
+             ((M, 1, M), (M, M), (M, M, 1), True),
+             ((), (), (), False),
+             ((M, 1, M), (), (M, M, 1), True),
+             ((), (M, M), (), True),)
 
-    sample = SampleInput(make_tensor((S, M, S), dtype=dtype, device=device,
-                                     low=None, high=None,
-                                     requires_grad=requires_grad))
+    for shape, mask_shape, other_shape, broadcasts_input in cases:
+        yield SampleInput(make_arg(shape),
+                          args=(make_bool_mask(mask_shape), make_arg(other_shape)),
+                          broadcasts_input=broadcasts_input)
 
-    return [largesample, sample]
+# TODO: add reference inputs for where(condition) signature
+def reference_inputs_where(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_where(op, device, dtype, requires_grad, **kwargs)
 
-def sample_inputs_lerp(op_info, device, dtype, requires_grad, **kwargs):
+    make_cond = partial(make_tensor, dtype=torch.bool, device=device, requires_grad=requires_grad)
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    samples = (
-        # no broadcast
-        SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 0.4)),
-        # broadcast rhs
-        SampleInput(make_arg((S, S)), args=(make_arg((S,)), 0.4)),
-        # scalar tensor
-        SampleInput(make_arg(()), args=(make_arg(()), 0.4)),
-        # broadcast rhs scalar-tensor
-        SampleInput(make_arg((S, S)), args=(make_arg(()), 0.4)),
-        # broadcast rhs with weight tensor
-        SampleInput(make_arg((S, S)), args=(make_arg((S,)), make_arg((S, S)))),
-        # broadcast rhs and weight tensor
-        SampleInput(make_arg((S, S)), args=(make_arg((S, 1)), make_arg((S,)))),
-        # broadcast lhs
-        SampleInput(make_arg((S,)), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
-        # scalar broadcast_lhs
-        SampleInput(make_arg(()), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
-        # broadcast all
-        SampleInput(make_arg((S, 1)), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
-        # tensor broadcast all
-        SampleInput(make_arg((S, 1)), args=(make_arg((S, S)), make_arg((S, 1))),
-                    broadcasts_input=True),
-        # no broadcast with weight tensor
-        SampleInput(make_arg((S, S)), args=(make_arg((S, S)), make_arg((S, S)))),
-        # broadcast lhs with weight tensor
-        SampleInput(make_arg((S,)), args=(make_arg((S, S)), make_arg((S, S))), broadcasts_input=True),
-        # broadcast lhs and weight tensor
-        SampleInput(make_arg((S,)), args=(make_arg((S, S, S)), make_arg((S, S))), broadcasts_input=True),
-        # broadcast lhs and weight tensor variant
-        SampleInput(make_arg((S, S)), args=(make_arg((S, S, S)), make_arg((S,))), broadcasts_input=True),
-    )
+    # noncontiguous
+    c = make_cond((10, 3), noncontiguous=True)
+    a = make_arg((10, 1), noncontiguous=True)
+    b = make_arg((3, 10, 3)).transpose(0, -1)
 
-    if dtype.is_complex:
-        samples = samples + (  # type: ignore[assignment]
-            # no broadcast
-            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 0.4j)),
-            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 1.2 + 0.1j)),
-            # broadcast rhs
-            SampleInput(make_arg((S, S)), args=(make_arg((S,)), 0.4j)),
-            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 5.4 + 9j)),
-            # scalar tensor
-            SampleInput(make_arg(()), args=(make_arg(()), 0.4j)),
-            SampleInput(make_arg(()), args=(make_arg(()), 6.1 + 0.004j)),
-            # broadcast rhs scalar-tensor
-            SampleInput(make_arg((S, S)), args=(make_arg(()), 0.4j)),
-            SampleInput(make_arg((S, S)), args=(make_arg(()), 1 + 2j)),
-        )
+    # NOTE that the OpInfo for where takes samples of the form a, cond, b
+    yield SampleInput(a, args=(c, b))
 
-    return samples
+    # type promoting
+    other_dtype = torch.double if dtype is not torch.double else torch.long
+    c = make_cond((10, 3), noncontiguous=True)
+    a = make_arg((10, 1), dtype=torch.long)
+    b = make_arg((10, 1))
 
-def sample_inputs_tensordot(self, device, dtype, requires_grad, **kwargs):
-    cases = (
-        ((2, 2, 2), (2, 2, 2), (2)),
-        ((2, 2, 1), (2, 1, 2), ([0, 1], [2, 0])),
-    )
-    samples = []
-    for first_shape, second_shape, dims in cases:
-        samples.append(SampleInput(make_tensor(first_shape, dtype=dtype, device=device,
-                                   requires_grad=requires_grad),
-                       args=(make_tensor(second_shape, dtype=dtype, device=device,
-                             requires_grad=requires_grad),),
-                       kwargs=dict(dims=dims,)))
-    return tuple(samples)
+    yield SampleInput(a, args=(c, b))
 
-def sample_inputs_kron(op_info, device, dtype, requires_grad, **kwargs):
-    test_cases = (
-        ((S, S), (M, L)),
-    )
+    # two python scalars
+    c = make_cond((10, 3), noncontiguous=True)
+    a = make_arg((1,)).item()
+    b = make_arg((1,)).item()
 
-    sample_inputs = []
-    for input_shape, other_shape in test_cases:
-        input = make_tensor(input_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        other = make_tensor(other_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        sample = SampleInput(input, args=(other,))
-        sample_inputs.append(sample)
-    return tuple(sample_inputs)
+    yield SampleInput(a, args=(c, b))
 
-def sample_inputs_inner(self, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((S, ), dtype=dtype, device=device, requires_grad=requires_grad),
-            args=(
-                make_tensor((S, ), dtype=dtype, device=device, requires_grad=requires_grad),
-            )
-        ),
-        SampleInput(
-            make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad),
-            args=(
-                make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
-            )
-        ),
-    )
+    # NaN propagation
+    if dtype.is_floating_point or dtype.is_complex:
+        if dtype.is_floating_point:
+            nan = float('nan')
+        else:
+            # dtype.is_complex
+            nan = complex(float('nan'), float('nan'))
+        c = make_cond((1, 10, 3))
+        a = make_arg((10, 3), noncontiguous=True)
+        a[2, 1] = nan
+        b = make_arg((1, 3))
+        b[0, 2] = nan
 
-def sample_inputs_scatter(op_info, device, dtype, requires_grad, **kwargs):
-    def _tensor(shape, dtype=dtype, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
+        yield SampleInput(a, args=(c, b))
 
-    def _gather(shape, index_dim, max_indices):
-        return gather_variable(shape, index_dim, max_indices, device=device)
+    # Python scalars type promotion
+    for scalar in (0, 0.0, 2j, False):
+        yield SampleInput(scalar, args=(c, b))
+        yield SampleInput(a, args=(c, scalar))
 
-    zero = torch.tensor(0, dtype=torch.long, device=device)
-    test_cases = (
-        (_tensor((M, S)), (0, _gather((S, S), 1, M), _tensor((S, S)))),
-        (_tensor((M, S)), (1, _gather((S, S), 0, S), _tensor((S, S)))),
-        (_tensor((M, S)), (-1, _gather((S, S), 0, S), _tensor((S, S)))),
-        (_tensor((M, S)), (0, _gather((M, S // 2), 1, M), _tensor((M, S // 2)))),
-        (_tensor((M, S)), (1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
-        (_tensor((M, S)), (-1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
-        (_tensor(()), (0, zero.clone().detach(), _tensor(()))),
-        (_tensor(()), (0, zero.clone().detach(), 2.5)),
-    )
-
-    samples = []
-    for tensor, args in test_cases:
-        samples.append(SampleInput(tensor, args=args))
-
-        if not requires_grad:
-            samples.append(SampleInput(
-                tensor.clone().detach(),
-                args=args, kwargs={'reduce': 'add'}
-            ))
 
-            if dtype.is_floating_point:
-                samples.append(SampleInput(
-                    tensor.clone().detach(),
-                    args=args, kwargs={'reduce': 'multiply'}
-                ))
+def error_inputs_where(op_info, device, **kwargs):
+    shape = (S,)
+    err_msg = "Expected all tensors to be on the same device"
+    for devices in product(('cpu', device), repeat=3):
+        if len(set(devices)) == 2:
+            si = SampleInput(make_tensor(shape, device=devices[0], dtype=torch.float32),
+                             args=(make_tensor(shape, dtype=torch.bool, device=devices[1]),
+                             make_tensor(shape, device=devices[2], dtype=torch.float32)))
+            yield ErrorInput(si, error_regex=err_msg)
 
-    return samples
+def sample_inputs_nonzero(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-def sample_inputs_scatter_add(op_info, device, dtype, requires_grad, **kwargs):
-    def _tensor(shape, dtype=dtype, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
+    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    def _gather(shape, index_dim, max_indices):
-        return gather_variable(shape, index_dim, max_indices, device=device)
+    inputs = []
+    for shape in sizes:
+        # construct input without any non-zero elements
+        zeros = torch.zeros(shape, dtype=dtype, device=device, requires_grad=requires_grad)
+        inputs.append(zeros)
 
-    zero = torch.tensor(0, dtype=torch.long, device=device)
-    test_cases = (
-        (_tensor((M, S)), (0, _gather((S, S), 1, M), _tensor((S, S)))),
-        (_tensor((M, S)), (1, _gather((S, S), 0, S), _tensor((S, S)))),
-        (_tensor((M, S)), (-1, _gather((S, S), 0, S), _tensor((S, S)))),
-        (_tensor((M, S)), (0, _gather((M, S // 2), 1, M), _tensor((M, S // 2)))),
-        (_tensor((M, S)), (1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
-        (_tensor((M, S)), (-1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
-        (_tensor(()), (0, zero.clone().detach(), _tensor(()))),
-    )
+        # construct input with mixed zero and non-zero elements
+        mixed = make_arg(shape).requires_grad_(False)
+        mask_t = make_tensor(shape, dtype=torch.bool, device=device, requires_grad=False)
+        mixed[mask_t] = 0
+        inputs.append(mixed)
 
-    return [SampleInput(tensor, args=args) for tensor, args in test_cases]
+    for input_t, as_tuple in product(inputs, [False, True]):
+        yield(SampleInput(input_t.clone().requires_grad_(requires_grad),
+                          kwargs=dict(as_tuple=as_tuple)))
 
-def sample_inputs_scatter_reduce(op_info, device, dtype, requires_grad, **kwargs):
-    def _tensor(shape, dtype=dtype, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
+def sample_inputs_chunk(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    def _gather(shape, index_dim, max_indices):
-        return gather_variable(shape, index_dim, max_indices, device=device)
+    cases = (((S, S, S), (2,)),
+             ((S, S, S), (S, 1)),
+             ((S, S, S), (S, -1)))
 
-    zero = torch.tensor(0, dtype=torch.long, device=device)
-    test_cases = (
-        ((M, S), 0, _gather((S, S), 1, M), (S, S)),
-        ((M, S), 1, _gather((S, S), 0, S), (S, S)),
-        ((M, S), -1, _gather((S, S), 0, S), (S, S)),
-        ((M, S), 0, _gather((M, S // 2), 1, M), (M, S // 2)),
-        ((M, S), 1, _gather((M, S // 2), 0, S), (M, S // 2)),
-        ((M, S), -1, _gather((M, S // 2), 0, S), (M, S // 2)),
-        ((), 0, zero.clone().detach(), ()),
-    )
+    for case in cases:
+        shape, args = case
+        yield(SampleInput(make_arg(shape), args=args))
 
-    reduce = op_info.variant_test_name
-    for args, include_self in product(test_cases, [True, False]):
-        inp_shape, dim, index, src_shape = args
-        yield SampleInput(_tensor(inp_shape),
-                          args=(dim, index, _tensor(src_shape), reduce),
-                          kwargs={'include_self': include_self})
+def reference_inputs_chunk(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_chunk(op, device, dtype, requires_grad, **kwargs)
 
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    # Sample inputs to test edge cases for backward
-    # Check that gradients are propagated correctly for prod when zeros in self/src are reduced
-    if requires_grad and reduce == 'prod':
-        # This sample tests gradients for the following cases
-        # (a) 1 zero reduced (from src (self[0, 1], self[1, 1]), from self (self[0, 0], self[2, 0]))
-        # (b) 2 zeros reduced (1 from src and 1 from self (self[1, 0])
-        # (c) no zeros reduced (self([2, 1]))
-        # (d) 2 zeros reduced (both from src) is tested in test/test_autograd.py
-        #     test_scatter_index_reduce_prod_gradgrad_error as this case is not supported for gradgrad
-        input = torch.tensor([[0, 13], [0, 17], [0, 19]], dtype=dtype, device=device, requires_grad=requires_grad)
-        src = torch.tensor([[0, 1, 2, 3], [0, 4, 0, 1], [2, 3, 5, 6]], dtype=dtype, device=device, requires_grad=requires_grad)
-        idx = torch.tensor([[1, 1, 0, 0], [0, 0, 1, 1], [0, 0, 0, 1]], dtype=torch.long, device=device)
+    # shape x chunks x dim
+    cases = (
+        ((13, 9, 11), 17, -1),
+        ((13, 9, 11), 11, -1),
+        ((13,), 12, -1),
+        ((15,), 12, -1),
+        ((15,), 7, 0),
+        ((15,), 9, 0),
+        ((3, 7), 9, 1),
+        ((3, 7), 9, 0),
+        ((3, 7), 2, 0),
+        ((3, 7), 3, 0),
+        ((3, 7), 1, 0),
+        ((3, 7), 1, 1),
+        ((4, 4), 2, 0),
+    )
 
-        yield SampleInput(input,
-                          args=(1, idx, src, reduce),
-                          kwargs={'include_self': True})
+    for shape, chunks, dim in cases:
+        yield SampleInput(make_arg(shape), args=(chunks, dim))
 
-def sample_inputs_segment_reduce(op_info, device, dtype, requires_grad, *, mode='lengths', **kwargs):
+def sample_inputs_kthvalue(op_info, device, dtype, requires_grad, **kwargs):
     def _tensor(shape, dtype=dtype, low=None, high=None):
         return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
-    zero = torch.tensor(0, dtype=torch.long, device=device)
-    test_cases = (
-        # inp_shape, dim, lengths, unsafe
-        ((S,), 0, [0, 1, 2, 2], False),
-        ((S,), 0, [0, 1, 2, 2], True),
-        ((S,), 0, [2, 0, 3, 0], False),
-        ((S, S), 0, [0, 1, 2, 2], False),
-        # test when lengths do not sum to dim size
-        ((M, S, S), 0, [1, 2, 0, 6, 0], True),
-        # test for higher dimensions
-        ((S, S), 1, [[0, 1, 2, 2] for _ in range(S)], False),
-        ((S, S), 1, [[2, 0, 3, 0], [0, 1, 2, 2], [3, 0, 2, 0], [1, 1, 1, 2], [0, 1, 2, 2]], False),
-        ((S, S, S), 1, [[0, 1, 2, 2] for _ in range(S)], False),
-        ((S, S, S), 1, [[2, 0, 3, 0], [0, 1, 2, 2], [3, 0, 2, 0], [1, 1, 1, 2], [0, 1, 2, 2]], False),
-    )
+    test_cases = [
+        (_tensor((S, S, S)), (2,)),
+        (_tensor((S, S, S)), (2, 1,)),
+        (_tensor((S, S, S)), (2, -1,)),
+        (_tensor((S, S, S)), (2, 1, True,)),
+        (_tensor((S, S, S)), (2, -1, True,)),
+        (_tensor((S,)), (2, 0,)),
+        (_tensor((S,)), (2, 0, True,)),
+        (_tensor(()), (1,)),
+        (_tensor(()), (1, 0,)),
+        (_tensor(()), (1, 0, True))
+    ]
 
-    reductions = ["max", "mean", "min", "sum", "prod"]
-    for args, reduce, initial in product(test_cases, reductions, [1, 2]):
-        inp_shape, dim, lengths, unsafe = args
-        lengths_t = torch.tensor(lengths, dtype=torch.long, device=device)
-        sample_input_kwargs = {'axis': dim, 'unsafe': unsafe, 'initial': initial}
-        if mode == 'lengths':
-            sample_input_kwargs['lengths'] = lengths_t
-        elif mode == 'offsets':
-            zeros_shape = list(lengths_t.shape)
-            zeros_shape[dim] = 1
-            offsets_t = torch.cat((lengths_t.new_zeros(zeros_shape), lengths_t), dim).cumsum_(dim)
-            sample_input_kwargs['offsets'] = offsets_t
-        else:
-            raise RuntimeError(f"mode most be one of 'offsets' or 'lengths' got '{mode}'.")
-        yield SampleInput(_tensor(inp_shape),
-                          args=(reduce,),
-                          kwargs=sample_input_kwargs)
+    return [SampleInput(tensor, args=args) for tensor, args in test_cases]
 
+def error_inputs_kthvalue(op_info, device, **kwargs):
+    # tests overlapping output fails
+    t = make_tensor(10, dtype=torch.float32, device=device)
+    indices = torch.empty((), device=device, dtype=torch.long)
+    si = SampleInput(t, args=(5,), kwargs={'out': (t, indices)})
 
-def sample_inputs_ravel(op_info, device, dtype, requires_grad, **kwargs):
-    samples = (SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                       low=None, high=None,
-                                       requires_grad=requires_grad)),
-               SampleInput(make_tensor((), dtype=dtype, device=device,
-                                       low=None, high=None,
-                                       requires_grad=requires_grad)),
-               SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                       low=None, high=None,
-                                       requires_grad=requires_grad, noncontiguous=True)),)
+    k_out_of_range_err = "selected number k out of range for dimension"
+    return (ErrorInput(si, error_regex="unsupported operation"),
+            ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3, 0)),
+                       error_regex=k_out_of_range_err),
+            ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3,)),
+                       error_regex=k_out_of_range_err),
+            ErrorInput(SampleInput(torch.tensor(2, device=device), args=(3,)),
+                       error_regex=k_out_of_range_err),)
 
-    return samples
+def sample_inputs_dropout(op_info, device, dtype, requires_grad, *,
+                          train=None, valid_input_dim=None, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
+    if valid_input_dim:
+        cases = ((S,) * i for i in valid_input_dim)
+    else:
+        cases = ((S, S), (S,), ())
+    p_vals = [0.0, 0.5, 1.0]
+    # This is to handle special case for feature_alpha_dropout which has different
+    # supported dtypes depending on `train` parameter
+    training_vals = [train] if train is not None else [True, False]
 
-def sample_inputs_tril_triu(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    cases = (((M, M), ()),
-             ((M, M), (2,),),
-             ((S, M, M), ()),
-             ((S, M, M), (2,)),
-             ((3, 3, S, S), ()),)
+    for case, p, training in product(cases, p_vals, training_vals):
+        yield SampleInput(make_arg(case), kwargs=dict(p=p, training=training))
+    yield SampleInput(make_arg(case), kwargs=dict())
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=args)
 
+def sample_inputs_embedding_bag(op_info, device, dtype, requires_grad, **kwargs):
+    def make_input(shape):
+        return make_tensor(shape, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def sample_inputs_clone_contiguous(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    def make_long_input(shape, *, low, high, noncontiguous=False):
+        return make_tensor(shape, device=device, dtype=torch.long, low=low, high=high,
+                           noncontiguous=noncontiguous)
 
-    yield SampleInput(make_arg((S, M, S)))
-    yield SampleInput(make_arg(()))
+    def make_per_sample_weight(flag, idx):
+        # a tensor of float / double weights, or None
+        # to indicate all weights should be taken to be 1
+        if flag:
+            return make_input(idx.shape)
+        return None
 
-def reference_inputs_clone_contiguous(op, device, dtype, requires_grad, **kwargs):
-    # NOTE: the default memory format for clone is torch.preserve_format, for contiguous it's torch.contiguous_format
-    # This exploits that default to test torch.preserve_format for clone, without causing an error when testing contiguous
-    yield from sample_inputs_clone_contiguous(op, device, dtype, requires_grad, **kwargs)
+    offsets = torch.tensor([0, 3], device=device, dtype=torch.long)
+    for generate_per_sample_weight in (True, False):
+        for mode in ('sum', 'mean', 'max'):
+            # per_sample_weights is only supported for mode='sum' (got mode='****')
+            if generate_per_sample_weight and mode in ('mean', 'max'):
+                continue
 
-    shapes = (
-        (3, 5, 6),
-        (1, 1, 3, 5, 6),
-        (1, 1, 3, 5, 6, 1, 1),
-        (1, 0, 3, 5, 0, 2),
-        (1, 0, 3, 5, 0, 0, 1, 1, 2),
-        (),
-    )
-
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    for shape in shapes:
-        yield SampleInput(make_arg(shape))
-        yield SampleInput(make_arg(shape).transpose(0, -1))
-        yield SampleInput(make_arg(shape, noncontiguous=True))
-        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1))
+            # 1-D index tensor
+            idx = make_long_input((S,), low=0, high=M)
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(make_input((M, S)), args=(idx,),
+                              kwargs={'offsets': offsets, 'mode': mode,
+                                      'per_sample_weights': per_sample_weights})
 
-        yield SampleInput(make_arg(shape), kwargs={'memory_format': torch.contiguous_format})
-        yield SampleInput(make_arg(shape).transpose(0, -1), kwargs={'memory_format': torch.contiguous_format})
-        yield SampleInput(make_arg(shape, noncontiguous=True), kwargs={'memory_format': torch.contiguous_format})
-        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1), kwargs={'memory_format': torch.contiguous_format})
+            idx = make_long_input((S,), low=0, high=M, noncontiguous=True)
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(make_input((M, S)), args=(idx,),
+                              kwargs={'offsets': offsets, 'mode': mode,
+                                      'per_sample_weights': per_sample_weights})
 
-    # shape, strides, offset
-    strided_cases = (
-        ((5, 6, 2), (1, 1, 7), 2),
-        ((5, 5, 4), (1, 1, 7), 2),
-        ((5, 5, 2), (4, 5, 7), 3),
-        ((5, 5, 2), (5, 5, 7), 3),
-        ((5, 5, 2), (5, 5, 5), 3),
-        ((9, 5, 2), (0, 1, 7), 3),
-    )
+            # bag with zero length
+            idx = make_long_input((S,), low=0, high=M, noncontiguous=True)
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(make_input((M, S)), args=(idx,),
+                              kwargs={'offsets': torch.tensor([0, 0, 3], device=device, dtype=torch.long),
+                                      'mode': mode,
+                                      'per_sample_weights': per_sample_weights})
 
-    for shape, strides, offset in strided_cases:
-        yield SampleInput(make_arg(500,).as_strided(shape, strides, offset))
-        yield SampleInput(make_arg(500,).as_strided(shape, strides, offset), kwargs={'memory_format': torch.contiguous_format})
+            # 2-D index tensor
+            idx = make_long_input((S, S), low=0, high=M)
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(make_input((M, S)), args=(idx,),
+                              kwargs={'mode': mode, 'per_sample_weights': per_sample_weights})
 
-    # channels last 2D
-    yield SampleInput(make_arg((2, 2, 2, 2)), kwargs={'memory_format': torch.channels_last})
-    a = make_arg((2, 2, 2, 2)).permute(0, 3, 1, 2)
-    yield SampleInput(a, kwargs={'memory_format': torch.channels_last})
+            idx = make_long_input((S, S), low=0, high=M, noncontiguous=True)
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(make_input((M, S)), args=(idx,),
+                              kwargs={'mode': mode, 'per_sample_weights': per_sample_weights})
 
-    # channels last 3D
-    yield SampleInput(make_arg((2, 2, 2, 2, 2)), kwargs={'memory_format': torch.channels_last_3d})
-    a = make_arg((2, 2, 2, 2, 2)).permute(0, 4, 1, 2, 3)
-    yield SampleInput(a, kwargs={'memory_format': torch.channels_last_3d})
+            # The gradient vector at `padding_idx` is not updated.
+            # Negative padding_idx
+            idx = make_long_input((6,), low=0, high=S)
+            idx[0] = 4
+            idx[4] = 4
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(make_input((S, S)), args=(idx,),
+                              kwargs={'padding_idx': -1, 'offsets': offsets,
+                                      'mode': mode, 'per_sample_weights': per_sample_weights},)
 
+            idx = make_long_input((3, 3), low=0, high=S)
+            # Positive padding_idx
+            idx[0, 0] = 2
+            idx[1, 1] = 2
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(make_input((S, S)), args=(idx,),
+                              kwargs={'padding_idx': 2, 'mode': mode,
+                                      'per_sample_weights': per_sample_weights},)
 
-def sample_inputs_sum_to_size(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+            idx = make_long_input((6, ), low=0, high=S)
+            weights = make_input((S, S))
+            offsets_ = torch.tensor([0, 3, 6], device=device, dtype=torch.long)
+            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+            yield SampleInput(weights, args=(idx,),
+                              kwargs={'mode': mode, 'offsets': offsets_, 'include_last_offset': True},)
 
-    # list of tuples (shape, shape) defining the shapes of the input and output tensors
-    sample_shapes = [
-        ((), ()),
-        ((S), (1)),
-        ((S, S), (1, 1)),
-        ((S, S), (1, S)),
-        ((S, S), (S, S)),
-        ((S, S, S), (S, 1, S)),
-    ]
+            if not requires_grad:
+                # Following inputs return different gradient from the numerical gradient.
+                # This is expected and relevant tests are present in `test_nn.py`.
 
-    samples = []
+                # Due to inplace renorming of weight, the numerical gradient doesn't match the
+                # analytical gradient.
+                idx = make_long_input((2, 2), low=0, high=S)
+                weights = make_input((S, S)) * 2
+                per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+                yield SampleInput(weights, args=(idx,),
+                                  kwargs={'max_norm': 1., 'mode': mode,
+                                          'per_sample_weights': per_sample_weights},)
 
-    for input_shape, output_shape in sample_shapes:
-        input_t = make_arg(input_shape)
-        samples.append(SampleInput(input_t, args=(output_shape,)))
+                idx = make_long_input((6, ), low=0, high=S)
+                weights = make_input((S, S)) * 2
+                per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+                yield SampleInput(weights, args=(idx,),
+                                  kwargs={'max_norm': 1., 'norm_type': 1.0,
+                                          'mode': mode, 'offsets': offsets,
+                                          'per_sample_weights': per_sample_weights},)
 
-    return samples
+                if mode != 'max':
+                    # Scale the gradient based on the inverse frequency of a particular index.
+                    # Note : smax mode does not support sparse weights
+                    idx = make_long_input((2, 2), low=0, high=S)
+                    idx[0, 0] = 1
+                    idx[0, 1] = 1
+                    weights = make_input((S, S))
+                    per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+                    yield SampleInput(weights, args=(idx,),
+                                      kwargs={'scale_grad_by_freq': True, 'mode': mode,
+                                              'per_sample_weights': per_sample_weights},)
 
-def sample_inputs_resize_ops(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device)
-    cases = (((S, S, S), (S * S, S)),
-             ((), ()),
-             ((), (1, 1, 1)),
-             )
+                    # gradcheck not implemented for sparse tensors.
+                    # Note : max mode does not support sparse weights
+                    idx = make_long_input((6, ), low=0, high=S)
+                    weights = make_input((S, S))
+                    per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+                    yield SampleInput(weights, args=(idx,),
+                                      kwargs={'sparse': True, 'offsets': offsets,
+                                              'mode': mode, 'per_sample_weights': per_sample_weights})
 
-    for shape, args_or_shape in cases:
-        # Update `args` based on operator
-        if op_info.name == 'resize_':
-            # resize_ takes shape/tuple of ints,
-            args = (args_or_shape, )
-        elif op_info.name == 'resize_as_':
-            # resize_as_ takes another tensor
-            args = (make_arg(shape, requires_grad=False), )  # type:ignore[assignment]
-        else:
-            raise ValueError("sample_inputs_resize_ops is being used with incorrect operator")
+                    idx = make_long_input((6, ), low=0, high=S)
+                    idx[0] = 1  # freq more than 1
+                    idx[1] = 1  # freq more than 1
+                    idx[3] = 0  # padding_idx
+                    weights = make_input((S, S)) * 2
+                    per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
+                    yield SampleInput(weights, args=(idx,),
+                                      kwargs={'sparse': True, 'scale_grad_by_freq': True, 'padding_idx': 0,
+                                              'max_norm': 1., 'offsets': offsets,
+                                              'mode': mode, 'per_sample_weights': per_sample_weights})
 
-        yield(SampleInput(make_arg(shape, requires_grad=requires_grad), args=args))
 
-def sample_inputs_view_reshape(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def sample_inputs_embedding(op_info, device, dtype, requires_grad, **kwargs):
+    def make_input(shape):
+        return make_tensor(shape, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    cases = (
-        ((S, S, S), (S * S, S)),
-        ((S * S, S), (S, S, S)),
-        ((S * S, S), (S, -1, S)),
-        ((S * S * 2, S), (S, -1)),
-        ((S,), (S,)),
-        ((), ()),
-        ((), (1,)),
-    )
+    def make_long_input(shape, *, low, high):
+        return make_tensor(shape, device=device, dtype=torch.long, low=low, high=high)
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=(args,))
+    # 0-D index tensor
+    idx = make_long_input((), low=0, high=M)
+    yield SampleInput(make_input((M, S)), args=(idx,),)
 
-        if kwargs.get("transpose_samples", False) and len(shape) >= 2:
-            transposed = make_arg(shape).transpose(0, 1).detach().requires_grad_(requires_grad)
-            yield SampleInput(transposed, args=(args,))
+    # 1-D index tensor
+    idx = make_long_input((S,), low=0, high=M)
+    yield SampleInput(make_input((M, S)), args=(idx,),)
 
-def reference_inputs_view_reshape(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_view_reshape(op, device, dtype, requires_grad, **kwargs)
+    # 2-D index tensor
+    idx = make_long_input((S, S), low=0, high=M)
+    yield SampleInput(make_input((M, S)), args=(idx,),)
 
-    cases = (
-        ((125,), (25, 5)),
-        ((25, 25), (1, 5, 5, 1, 5, 1, 5, 1)),
-        ((16, 32), (2, 4, 1, 4, 4, 1, 4)),
-        ((16, 12), (12, 16)),
-        ((1, 16, 12), (12, 16)),
-        ((1, 5, 1, 5), (25, 1)),
-        ((2, 4, 2), (4, 4)),
-        ((1, 4), (1, 1, 2, 1, 2)),
-        ((3, 5, 7), (7, 5, 3)),
-        ((1,), ()),
-        ((5, 0, 2, 3), (5, 0, 2, 3)),
-        ((2, 1, 0, 3, 1), (5, 0)),
-        ((1,), ()),
-        ((4, 5, 6), (4, 5, 6, 1, 1, 1)),
-        ((), (1, 1, 1, 1)),
-    )
+    if not requires_grad:
+        # Following inputs return different gradient from the numerical gradient.
+        # This is expected and relevant tests are present in `test_nn.py`.
 
-    irreversible_cases = (
-        ((), (-1,)),
-        ((4, 7, 9, 1, 1), (1, 4, 3, -1, 1)),
-    )
+        # The gradient vector at `padding_idx` is not updated.
+        idx = make_long_input((2, 2), low=0, high=S)
+        idx[0, 0] = 2
+        idx[1, 1] = 2
+        yield SampleInput(make_input((S, S)), args=(idx,), kwargs={'padding_idx': 2},)
 
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    for a, b in cases:
-        yield SampleInput(make_arg(a), args=(b,))
-        yield SampleInput(make_arg(b), args=(a,))
+        idx = make_long_input((2, 2), low=0, high=S)
+        idx[0, 0] = 4
+        idx[1, 1] = 4
+        yield SampleInput(make_input((S, S)), args=(idx,), kwargs={'padding_idx': -1},)
 
-        if kwargs.get("transpose_samples", False):
-            yield SampleInput(make_arg(a, noncontiguous=True).transpose(0, -1), args=(b,))
-        else:
-            yield SampleInput(make_arg(a, noncontiguous=True), args=(b,))
+        # Due to inplace renorming of weight, the numerical gradient doesn't match the
+        # analytical gradient.
+        idx = make_long_input((2, 2), low=0, high=S)
+        weights = make_input((S, S)) * 2
+        yield SampleInput(weights, args=(idx,), kwargs={'max_norm': 1.},)
 
-    for a, b in irreversible_cases:
-        yield SampleInput(make_arg(a), args=(b,))
+        idx = make_long_input((2, 2), low=0, high=S)
+        weights = make_input((S, S)) * 2
+        yield SampleInput(weights, args=(idx,), kwargs={'max_norm': 1., 'norm_type': 1.0},)
 
-def error_inputs_reshape(op, device, **kwargs):
+        # Scale the gradient based on the inverse frequency of a particular index.
+        idx = make_long_input((2, 2), low=0, high=S)
+        idx[0, 0] = 1
+        idx[0, 1] = 1
+        weights = make_input((S, S))
+        yield SampleInput(weights, args=(idx,), kwargs={'scale_grad_by_freq': True},)
 
-    cases = (
-        # Reshape to different numel
-        ((2,), ()),
-        ((1, 3, 0), ()),
-        ((4, 3), (4, 2)),
-        ((1, 3, 5), (5, 2, 2)),
-        # No valid inference
-        ((1, 3, 5), (5, -1, 2)),
-        # Two inferred shapes
-        ((1, 3, 5), (5, -1, -1)),
-        ((1), (0, -1)),
-        ((0, 5), (0, -1)),
-    )
+        # gradcheck not implemented for sparse tensors.
+        idx = make_long_input((2, 2), low=0, high=S)
+        weights = make_input((S, S))
+        yield SampleInput(weights, args=(idx,), kwargs={'sparse': True})
 
-    make_arg = partial(make_tensor, dtype=torch.float32, device=device, requires_grad=False)
-    for a, b in cases:
-        yield ErrorInput(SampleInput(make_arg(a), args=(b,)), error_type=Exception, error_regex="")
+        idx = make_long_input((3, 3), low=0, high=S)
+        idx[0, 0] = 1  # freq more than 1
+        idx[0, 1] = 1  # freq more than 1
+        idx[1, 0] = 0  # padding_idx
+        weights = make_input((S, S)) * 2
+        yield SampleInput(weights, args=(idx,),
+                          kwargs={'sparse': True, 'scale_grad_by_freq': True,
+                                  'padding_idx': 0, 'max_norm': 1.})
 
 
-def sample_inputs_view_as_reshape_as(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device)
+def sample_inputs_one_hot(op_info, device, dtype, requires_grad, **kwargs):
+    def make_input(shape, *, low, high):
+        return make_tensor(shape, device=device, dtype=dtype, low=low, high=high, requires_grad=requires_grad)
 
-    cases = (((S, S, S), (S * S, S)),
-             ((), ()),
-             ((), (1, 1)),
-             )
+    shapes = ((), (S,), (L, M, S))
+    num_classess = (-1, 10)
 
-    for case in cases:
-        shape, shape_other = case
-        inp = make_arg(shape, requires_grad=requires_grad)
-        yield(SampleInput(inp, args=(make_arg(shape_other, requires_grad=False),)))
+    return [
+        SampleInput(
+            make_input(
+                shape,
+                low=0,
+                high=10 if num_classes == -1 else num_classes // 2,
+            ),
+            kwargs=dict(num_classes=num_classes),
+        )
+        for shape, num_classes in itertools.product(shapes, num_classess)
+    ]
 
-        if op_info.name != "view_as" and len(shape) >= 2:
-            yield(SampleInput(
-                inp.clone().transpose(0, 1).requires_grad_(requires_grad),
-                args=(make_arg(shape_other, requires_grad=False),)))
 
-def sample_inputs_atleast1d2d3d(op_info, device, dtype, requires_grad, **kwargs):
-    input_list = []
-    shapes = ((S, S, S, S), (S, S, S), (S, S), (S, ), (),)
-    make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    samples = []
-    for shape in shapes:
-        input_list.append(make_tensor_partial(shape))
-        samples.append(SampleInput(make_tensor_partial(shape)))
-    samples.append(SampleInput(input_list, ))
-    return samples
+def sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs):
+    rhs_requires_grad = kwargs.get('rhs_requires_grad', requires_grad)
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def sample_inputs_column_stack(op_info, device, dtype, requires_grad, **kwargs):
-    input_list = []
-    cases: Tuple[tuple, tuple] = (  # type: ignore[assignment]
-        ((S, 2, 1), (S, 3, 1)),
-        ((S), (S, 5)), ((), (1, S))
+    # Although most losses also support the reduce and size_average combination instead of reduce, the former is
+    # deprecated since 0.4.1 and thus is not tested
+    shapes_and_kwargs = (
+        ((), None),
+        ((S,), dict(reduction="mean")),
+        ((S,), dict(reduction="sum")),
+        ((S,), dict(reduction="none")),
+        ((S, S), None),
+        ((S, S, S), None),
     )
-    make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    for shape1, shape2 in cases:
-        input_list.append(SampleInput([make_tensor_partial(shape1), make_tensor_partial(shape2)]))
-
-    return input_list
-
-def sample_inputs_flatten(op_info, device, dtype, requires_grad, **kwargs):
-    samples = []
-    shapes = ((S, S, S), (S, S), (S, ), (),)
-    make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    for shape in shapes:
-        samples.append(SampleInput(make_tensor_partial(shape)))
-        if len(shape) > 1:
-            samples.append(SampleInput(make_tensor_partial(shape), kwargs=dict(start_dim=1, end_dim=-1)))
-    return samples
-
-def reference_inputs_flatten(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_flatten(op, device, dtype, requires_grad, **kwargs)
 
-    # shape x start_dim x end_dim
-    cases = (
-        ((5, 4, 0, 1, 3, 7), 1, 3),
-        ((5, 4, 0, 1, 3, 7), 4, 5),
-        ((5, 4, 1, 1, 3, 7), 2, 3),
-        ((), 0, -1),
-        ((1,), 0, -1),
-        ((3, 7, 5), 1, 2),
-        ((4, 5), 1, 1),
-        ((1, 5, 5, 1, 5, 1, 5, 1), 0, 2),
-        ((1, 5, 5, 1, 5, 1, 5, 1), 3, -1),
-        ((1, 5, 5, 1, 5, 7, 5, 1), -2, -1),
-        ((2, 4, 2), 0, 1),
-        ((4, 2, 2), 1, 2),
-        ((0, 3, 4, 5), 1, 3),
-    )
+    for shape, kwargs in shapes_and_kwargs:
+        yield SampleInput(_make_tensor(shape),
+                          args=(_make_tensor(shape, requires_grad=rhs_requires_grad),),
+                          kwargs=kwargs)
 
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    for shape, start, end in cases:
-        yield SampleInput(make_arg(shape), args=(start, end,))
-        yield SampleInput(make_arg(shape, noncontiguous=True).transpose(0, -1), args=(start, end,))
-        yield SampleInput(make_arg(shape).transpose(0, -1), args=(start, end,))
+def sample_inputs_grid_sample(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-def sample_inputs_select(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    batch_size = 2
+    num_channels = 3
+    modes = ("bilinear", "nearest")
+    align_cornerss = (False, True)
+    padding_modes = ("zeros", "border", "reflection")
 
-    cases = (((S, S, S), (1, 2)),
-             ((S, S, S), (-1, 2)),
-             ((S, S, S), (-1, -1)),
-             ((S, S, S), (1, -1)),
-             ((S,), (0, 2))
-             )
+    sample_inputs = []
+    for dim in (2, 3):
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=args)
+        modes_ = (*modes, "bicubic") if dim == 2 else modes
 
+        for mode, padding_mode, align_corners in itertools.product(modes_, padding_modes, align_cornerss):
+            sample_inputs.append(
+                SampleInput(
+                    _make_tensor((batch_size, num_channels, *[S] * dim)),
+                    args=(_make_tensor((batch_size, *[S] * dim, dim)),),
+                    kwargs=dict(
+                        mode=mode,
+                        padding_mode=padding_mode,
+                        align_corners=align_corners,
+                    )
+                )
+            )
 
-def sample_inputs_select_scatter(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    return sample_inputs
 
-    cases = (((S, S, S), (S, S), (1, 2)),
-             ((S, S, S), (S, S), (-1, 2)),
-             ((S, S, S), (S, S), (-1, -1)),
-             ((S, S, S), (S, S), (1, -1)),
-             ((S,), (), (0, 2))
-             )
+def sample_inputs_cosine_embedding_loss(op_info, device, dtype, requires_grad, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    for input_shape, src_shape, args in cases:
-        input_ = make_arg(input_shape)
-        src = make_arg(src_shape)
-        yield SampleInput(input_, args=(src, *args))
+    def make_target(shape):
+        shape = () if len(shape) == 1 else (shape[0], )
+        t = torch.randint(0, 2, shape, device=device, dtype=torch.long)
+        # Label with -1 or 1
+        t = t * 2 - 1
+        target = t.to(dtype=dtype).detach_().requires_grad_(requires_grad)
+        return target
 
+    shapes = ((S, S), (S,))
+    reductions = ('none', 'mean', 'sum')
+    for s, r in product(shapes, reductions):
+        yield SampleInput(
+            make_input(s),
+            args=(make_input(s), make_target(s)),
+            kwargs=dict(reduction=r, margin=random.uniform(-1, 1))
+        )
 
-def sample_inputs_slice_scatter(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def sample_inputs_ctc_loss(op_info, device, dtype, requires_grad, **kwargs):
+    input_length = 50
+    batch = 16
+    num_char = 20
+    target_length = 30
 
-    cases = (((L, L, L), (L, L, L,), (0, 0, L, 1)),
-             ((L, L, L), (L // 2, L, L,), (0, L // 2, L, 1)),
-             ((L, L, L), (L // 4, L, L,), (0, L // 2, L, 2)),
-             ((L, L, L), (L, L, L,), (1, 0, L, 1)),
-             ((L, L, L), (L, L // 2, L,), (1, L // 2, L, 1)),
-             ((L, L, L), (L, L // 4, L,), (1, L // 2, L, 2)),
-             ((L, L, L), (L, L, L,), (2, 0, L, 1)),
-             ((L, L, L), (L, L, L // 2,), (2, L // 2, L, 1)),
-             ((L, L, L), (L, L, L // 4,), (2, L // 2, L, 2)),
-             )
+    def make_log_probs(s):
+        t = make_tensor(s, device=device, dtype=dtype)
+        log_probs = t.log_softmax(2).to(device=device, dtype=dtype).detach().requires_grad_(requires_grad=requires_grad)
+        return log_probs
 
-    for input_shape, src_shape, args in cases:
-        input_ = make_arg(input_shape)
-        src = make_arg(src_shape)
-        yield SampleInput(input_, args=(src, *args))
+    reductions = ('none', 'mean', 'sum')
+    zero_inf = (True, False)
+    for r, z in product(reductions, zero_inf):
+        log_probs = make_log_probs((input_length, batch, num_char))
+        targets = torch.randint(1, num_char, (batch, target_length), dtype=torch.long, device=device)
+        input_lengths = torch.full((batch, ), input_length, dtype=torch.long, device=device)
+        target_lengths = torch.randint(10, target_length, (batch, ), dtype=torch.long, device=device)
 
-def sample_inputs_expand(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+        yield SampleInput(log_probs, args=(targets, input_lengths, target_lengths,), kwargs=dict(reduction=r, zero_infinity=z))
 
-    cases = (((S, 1, 1), (S, S, S)),
-             ((S, 1, S), (S, S, S)),
-             ((S, 1, S), (-1, S, -1)),
-             ((S, 1, S), (-1, S, S)),
-             ((S, 1), (S, S, S)),
-             ((1,), (S, S, S)),
-             ((1, S), (1, 1, S)),
-             ((), ()),
-             ((), (1, 3, 2)),
-             )
+def sample_inputs_nll_loss(op_info, device, dtype, requires_grad, **kwargs):
+    shape = (2, 3)
+    num_classes = shape[1]
+    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # FIXME: Derivative wrt. weight not implemented
+    make_weight = partial(make_tensor, num_classes, device=device, dtype=dtype, requires_grad=False)
 
-    for case in cases:
-        shape, args = case
-        yield(SampleInput(make_arg(shape), args=(args, )))
+    def make_target(shape, zeros=False):
+        s = (shape[0], *shape[2:]) if len(shape) > 1 else ()
+        if zeros:
+            return torch.zeros(s, device=device, dtype=torch.long)
+        else:
+            return make_tensor(s,
+                               low=0,
+                               high=shape[1] if len(shape) > 1 else shape[0],
+                               device=device,
+                               dtype=torch.long)
 
-def sample_inputs_conversion(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    shapes = ((),
-              (2, 3))
-    memory_format_options = [None, torch.contiguous_format]
-
-    for shape, memory_format in itertools.product(shapes, memory_format_options):
-        yield SampleInput(make_arg(shape),
-                          kwargs={'memory_format': memory_format} if memory_format else {})
-    yield SampleInput(make_arg((2, 3, 2, 3)), kwargs={'memory_format': torch.channels_last})
-
-def sample_inputs_expand_as(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device)
+    def gen_shape_kwargs():
+        # Batched, non-batched and 2d
+        shapes = (shape, (num_classes,), shape + (2, 2))
+        reductions = ('none', 'mean', 'sum')
+        for reduction, s in product(reductions, shapes):
+            yield make_input(s), make_target(s), dict(reduction=reduction)
+            yield make_input(s), make_target(s), dict(weight=make_weight(), reduction=reduction)
+            yield make_input(s), make_target(s), dict(weight=make_weight(low=0), reduction=reduction)
+            yield make_input(s), make_target(s), dict(weight=make_weight(high=0), reduction=reduction)
+            t = make_target(s)
+            ignore = num_classes // 2
+            # If "mean", nll returns NaN, so it's not differentiable at those points
+            if t.eq(ignore).all() and reduction == "mean":
+                t.fill_(0)
+            yield make_input(s), t, dict(ignore_index=num_classes // 2, reduction=reduction)
+            yield make_input(s), t, dict(ignore_index=num_classes // 2, reduction=reduction, weight=make_weight())
+            # Test ignoring all the targets
+            # If "mean", nll returns NaN, so it's not differentiable at those points
+            if reduction != "mean":
+                yield make_input(s), make_target(s, zeros=True), dict(ignore_index=0, reduction=reduction)
 
-    cases = (((S, 1, 1), (S, S, S)),
-             ((), ()),
-             ((), (1, 1)),
-             )
+    for input, target, kwargs in gen_shape_kwargs():
+        yield SampleInput(input, args=(target,), kwargs=kwargs)
 
-    for shape, shape_other in cases:
-        yield(SampleInput(make_arg(shape, requires_grad=requires_grad),
-                          args=(make_arg(shape_other, requires_grad=False), )))
+def sample_inputs_binary_cross_entropy_with_logits(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    make = partial(make_tensor, device=device, dtype=dtype)
+    make_prob = partial(make, low=0, high=1)
+    reductions = ("mean", "sum", "none")
 
+    def make_weight_shape_kwargs():
+        kwargs = []
+        for shape in ((1,), (1, S), (S), (S, S)):
+            kwargs.extend([((S, S), dict(reduction=reduction, weight=make(shape))) for reduction in reductions])
+        return kwargs
 
-def sample_inputs_where(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    shapes_and_kwargs = [
+        *[(shape, None) for shape in ((), (1,), (S,), (S, S), (S, S, S))],
+        *[((S, S), dict(reduction=reduction)) for reduction in reductions],
+        *make_weight_shape_kwargs(),
+        *[((S, S), dict(reduction=reduction, pos_weight=make((S,), low=0))) for reduction in reductions],
+        *[((S, S), dict(reduction=reduction, weight=make((S, S)), pos_weight=make((S,), low=0))) for reduction in reductions],
+    ]
 
-    def make_bool_mask(shape):
-        # Make sure atleast one element is nonzero,
-        # except for empty tensor
-        mask_t = make_tensor(shape, dtype=torch.bool, device=device, requires_grad=False)
+    for shape, kwargs in shapes_and_kwargs:
+        yield SampleInput(
+            make(shape, requires_grad=requires_grad),
+            args=(make_prob(shape, requires_grad=requires_grad),),
+            kwargs=kwargs,
+        )
 
-        if mask_t.numel() == 0:
-            return mask_t
-        elif mask_t.numel() == 1:
-            mask_t.fill_(True)
-            return mask_t
+def sample_inputs_argwhere(op_info, device, dtype, requires_grad, **kwargs):
+    yield SampleInput(torch.tensor([1, 0, 2, 0], dtype=dtype, device=device, requires_grad=requires_grad))
+    mask = torch.tensor([[0, 1, 0, 1, 0],
+                         [1, 1, 1, 1, 0],
+                         [0, 0, 0, 1, 0],
+                         [1, 0, 1, 1, 0],
+                         [1, 0, 0, 1, 0]], dtype=torch.bool, device=device)
+    t = make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad)
+    t[mask] = 0
+    yield SampleInput(t)
 
-        if mask_t.sum() == 0:
-            def random_index(shape):
-                return tuple(map(lambda max_idx: random.randint(0, max_idx), shape))
+    t = make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad, noncontiguous=True)
+    t[mask] = 0
+    yield SampleInput(t)
 
-            mask_t[random_index(mask_t.shape)] = True
-            return mask_t
+    t = make_tensor((S, 0), dtype=dtype, device=device, requires_grad=requires_grad)
+    yield SampleInput(t)
 
-        return mask_t
+    yield SampleInput(torch.zeros((S,), dtype=dtype, device=device, requires_grad=requires_grad))
+    yield SampleInput(make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad))
 
-    cases = (((M, M), (M, M), (M, M), False),
-             ((M, 1, M), (M, M), (M, M, 1), True),
-             ((), (), (), False),
-             ((M, 1, M), (), (M, M, 1), True),
-             ((), (M, M), (), True),)
+def _generate_sample_shape_reduction():
+    shapes = ((S,), (S, S), (S, S, S))
+    reductions = ('none', 'mean', 'sum')
+    for s, r in product(shapes, reductions):
+        yield s, r
 
-    for shape, mask_shape, other_shape, broadcasts_input in cases:
-        yield SampleInput(make_arg(shape),
-                          args=(make_bool_mask(mask_shape), make_arg(other_shape)),
-                          broadcasts_input=broadcasts_input)
+def sample_inputs_gaussian_nll_loss(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # Set low slightly above 0 so gradcheck doesn't accidentally dip below 0
+    make_var = partial(make_tensor, low=0.1, device=device, dtype=dtype, requires_grad=requires_grad)
 
-# TODO: add reference inputs for where(condition) signature
-def reference_inputs_where(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_where(op, device, dtype, requires_grad, **kwargs)
+    def gen_shape(shape):
+        yield shape
+        # Broadcast
+        yield (*shape[:-1], 1)
+        yield shape[:-1]
 
-    make_cond = partial(make_tensor, dtype=torch.bool, device=device, requires_grad=requires_grad)
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    def gen_shape_kwargs():
+        for s, r in _generate_sample_shape_reduction():
+            for t_s, v_s in product(gen_shape(s), gen_shape(s)):
+                yield _make_tensor(s), _make_tensor(t_s), make_var(v_s), dict(reduction=r)
+                yield (
+                    _make_tensor(s), _make_tensor(t_s), make_var(v_s),
+                    dict(full=True, reduction=r)
+                )
+                yield (
+                    _make_tensor(s), _make_tensor(t_s), make_var(v_s),
+                    dict(eps=random.uniform(1e-6, 1e-3), reduction=r)
+                )
+                yield (
+                    _make_tensor(s), _make_tensor(t_s), make_var(v_s),
+                    dict(full=True, eps=random.uniform(1e-6, 1e-3), reduction=r)
+                )
 
-    # noncontiguous
-    c = make_cond((10, 3), noncontiguous=True)
-    a = make_arg((10, 1), noncontiguous=True)
-    b = make_arg((3, 10, 3)).transpose(0, -1)
+    for input, target, var, kwargs in gen_shape_kwargs():
+        yield SampleInput(input, args=(target, var, ), kwargs=kwargs)
 
-    # NOTE that the OpInfo for where takes samples of the form a, cond, b
-    yield SampleInput(a, args=(c, b))
+def _generate_sample_inputs_nn_loss(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # type promoting
-    other_dtype = torch.double if dtype is not torch.double else torch.long
-    c = make_cond((10, 3), noncontiguous=True)
-    a = make_arg((10, 1), dtype=torch.long)
-    b = make_arg((10, 1))
+    for s, r in _generate_sample_shape_reduction():
+        yield _make_tensor(s), _make_tensor(s), dict(reduction=r)
 
-    yield SampleInput(a, args=(c, b))
+def sample_inputs_hinge_embedding_loss(op_info, device, dtype, requires_grad, **kwargs):
+    for input, target, d in _generate_sample_inputs_nn_loss(op_info, device, dtype, requires_grad, **kwargs):
+        # target should contain either 1 or -1 as per docs
+        mask = torch.rand_like(target) > 0.5
+        target[mask] = 1
+        target[~mask] = -1
+        d['margin'] = random.uniform(-9, 9)
+        yield SampleInput(input, args=(target, ), kwargs=d)
 
-    # two python scalars
-    c = make_cond((10, 3), noncontiguous=True)
-    a = make_arg((1,)).item()
-    b = make_arg((1,)).item()
+    # scalar input and target.
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield SampleInput(_make_tensor(()), args=(_make_tensor(()), ))
 
-    yield SampleInput(a, args=(c, b))
+def error_inputs_hinge_embedding_loss(op, device, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=torch.float32)
+    # invalid reduction value
+    yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4),), kwargs={'reduction': 'abc'}),
+                     error_type=ValueError, error_regex='is not a valid value')
 
-    # NaN propagation
-    if dtype.is_floating_point or dtype.is_complex:
-        if dtype.is_floating_point:
-            nan = float('nan')
-        else:
-            # dtype.is_complex
-            nan = complex(float('nan'), float('nan'))
-        c = make_cond((1, 10, 3))
-        a = make_arg((10, 3), noncontiguous=True)
-        a[2, 1] = nan
-        b = make_arg((1, 3))
-        b[0, 2] = nan
+def reference_inputs_hinge_embedding_loss(op, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_hinge_embedding_loss(op, device, dtype, requires_grad, **kwargs)
+    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-        yield SampleInput(a, args=(c, b))
+    for reduction in ('sum', 'mean', 'none'):
+        if dtype.is_floating_point:  # only supports ints and floats
+            # NaN propagation
+            inp = make_input((10, ))
+            inp[2] = float('nan')
+            target = make_input((10, ))
+            # target should contain either 1 or -1 as per docs
+            mask = torch.rand_like(target) > 0.5
+            target[mask] = -1
+            target[~mask] = 1
+            yield SampleInput(inp, args=(target,), kwargs={'reduction': reduction})
 
-    # Python scalars type promotion
-    for scalar in (0, 0.0, 2j, False):
-        yield SampleInput(scalar, args=(c, b))
-        yield SampleInput(a, args=(c, scalar))
+            # Inf Handling
+            inp = make_input((10, ))
+            inp[4] = float('inf')
+            target = make_input((10, ))
+            mask = torch.rand_like(target) > 0.5
+            target[mask] = -1
+            target[~mask] = 1
+            yield SampleInput(inp, args=(target,), kwargs={'reduction': reduction})
 
+        # Broadcasting
+        inp = make_input((5, 5))
+        target = make_input((1, 5))
+        mask = torch.rand_like(target) > 0.5
+        target[mask] = -1
+        target[~mask] = 1
+        yield SampleInput(inp, args=(target,), kwargs={'reduction': reduction})
 
-def error_inputs_where(op_info, device, **kwargs):
-    shape = (S,)
-    err_msg = "Expected all tensors to be on the same device"
-    for devices in product(('cpu', device), repeat=3):
-        if len(set(devices)) == 2:
-            si = SampleInput(make_tensor(shape, device=devices[0], dtype=torch.float32),
-                             args=(make_tensor(shape, dtype=torch.bool, device=devices[1]),
-                             make_tensor(shape, device=devices[2], dtype=torch.float32)))
-            yield ErrorInput(si, error_regex=err_msg)
+def sample_inputs_huber_loss(op_info, device, dtype, requires_grad, **kwargs):
+    for input, target, d in _generate_sample_inputs_nn_loss(op_info, device, dtype, requires_grad, **kwargs):
+        d['delta'] = random.uniform(1e-3, 9)
+        yield SampleInput(input, args=(target, ), kwargs=d)
 
-def sample_inputs_nonzero(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def sample_inputs_poisson_nll_loss(op_info, device, dtype, requires_grad, **kwargs):
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
+    def gen_shape_kwargs():
+        for s, r in _generate_sample_shape_reduction():
+            for li in (True, False):
+                for f in (True, False):
+                    i1 = _make_tensor(s)
+                    i2 = _make_tensor(s)
+                    # For Poisson NLL Loss,
+                    # target is assumed to be from
+                    # Poisson Distribution which
+                    # always has positive samples
+                    t1 = _make_tensor(s, low=0)
+                    t2 = _make_tensor(s, low=0)
 
-    inputs = []
-    for shape in sizes:
-        # construct input without any non-zero elements
-        zeros = torch.zeros(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        inputs.append(zeros)
-
-        # construct input with mixed zero and non-zero elements
-        mixed = make_arg(shape).requires_grad_(False)
-        mask_t = make_tensor(shape, dtype=torch.bool, device=device, requires_grad=False)
-        mixed[mask_t] = 0
-        inputs.append(mixed)
-
-    for input_t, as_tuple in product(inputs, [False, True]):
-        yield(SampleInput(input_t.clone().requires_grad_(requires_grad),
-                          kwargs=dict(as_tuple=as_tuple)))
-
-def sample_inputs_chunk(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    cases = (((S, S, S), (2,)),
-             ((S, S, S), (S, 1)),
-             ((S, S, S), (S, -1)))
+                    if not li:
+                        i1.abs_()
+                        i2.abs_()
+                    t1.abs_()
+                    t2.abs_()
 
-    for case in cases:
-        shape, args = case
-        yield(SampleInput(make_arg(shape), args=args))
+                    yield (
+                        i1, t1,
+                        dict(log_input=li, full=f, reduction=r)
+                    )
+                    yield (
+                        i2, t2,
+                        dict(log_input=li, full=f,
+                             eps=random.uniform(1e-8, 1e-3),
+                             reduction=r)
+                    )
 
-def reference_inputs_chunk(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_chunk(op, device, dtype, requires_grad, **kwargs)
+    for input, target, kwargs in gen_shape_kwargs():
+        yield SampleInput(input, args=(target, ), kwargs=kwargs)
 
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+def sample_inputs_triplet_margin_loss(op_info, device, dtype, requires_grad, with_distance=False, **kwargs):
+    make = partial(make_tensor, (S, M), device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # shape x chunks x dim
-    cases = (
-        ((13, 9, 11), 17, -1),
-        ((13, 9, 11), 11, -1),
-        ((13,), 12, -1),
-        ((15,), 12, -1),
-        ((15,), 7, 0),
-        ((15,), 9, 0),
-        ((3, 7), 9, 1),
-        ((3, 7), 9, 0),
-        ((3, 7), 2, 0),
-        ((3, 7), 3, 0),
-        ((3, 7), 1, 0),
-        ((3, 7), 1, 1),
-        ((4, 4), 2, 0),
+    kwargss = (
+        *[dict(margin=margin) for margin in (1e-6, 1.0, 10.0)],
+        dict(swap=True),
+        *[dict(reduction=reduction) for reduction in ("mean", "sum", "none")],
     )
 
-    for shape, chunks, dim in cases:
-        yield SampleInput(make_arg(shape), args=(chunks, dim))
+    for kwargs in kwargss:
+        input = make()
+        args = (make(), make())
+        if with_distance:
+            kwargs["distance_function"] = torch.nn.PairwiseDistance()
+        yield SampleInput(input, args=args, kwargs=kwargs)
 
-def sample_inputs_kthvalue(op_info, device, dtype, requires_grad, **kwargs):
-    def _tensor(shape, dtype=dtype, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
+def sample_inputs_pairwise_distance(op_info, device, dtype, requires_grad, **kwargs):
+    make = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    test_cases = [
-        (_tensor((S, S, S)), (2,)),
-        (_tensor((S, S, S)), (2, 1,)),
-        (_tensor((S, S, S)), (2, -1,)),
-        (_tensor((S, S, S)), (2, 1, True,)),
-        (_tensor((S, S, S)), (2, -1, True,)),
-        (_tensor((S,)), (2, 0,)),
-        (_tensor((S,)), (2, 0, True,)),
-        (_tensor(()), (1,)),
-        (_tensor(()), (1, 0,)),
-        (_tensor(()), (1, 0, True))
+    shape = (3,)
+    batched_shape = (2, *shape)
+    shapes_and_kwargs = [
+        (shape, None),
+        (batched_shape, None),
+        (shape, dict(keepdim=True)),
+        (batched_shape, dict(keepdim=True)),
+        (shape, dict(p=5.0)),
+        (shape, dict(p=-1.0)),
+        (shape, dict(eps=1.0)),
     ]
 
-    return [SampleInput(tensor, args=args) for tensor, args in test_cases]
+    return [
+        SampleInput(make(shape), args=(make(shape),), kwargs=kwargs) for shape, kwargs in shapes_and_kwargs
+    ]
 
-def error_inputs_kthvalue(op_info, device, **kwargs):
-    # tests overlapping output fails
-    t = make_tensor(10, dtype=torch.float32, device=device)
-    indices = torch.empty((), device=device, dtype=torch.long)
-    si = SampleInput(t, args=(5,), kwargs={'out': (t, indices)})
+def sample_inputs_pixel_shuffle(op_info, device, dtype, requires_grad, **kwargs):
+    return [
+        SampleInput(
+            make_tensor((1, 9, 2, 2), device=device, dtype=dtype, requires_grad=requires_grad),
+            kwargs=dict(upscale_factor=upscale_factor),
+        )
+        for upscale_factor in (1, 3)
+    ]
 
-    k_out_of_range_err = "selected number k out of range for dimension"
-    return (ErrorInput(si, error_regex="unsupported operation"),
-            ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3, 0)),
-                       error_regex=k_out_of_range_err),
-            ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3,)),
-                       error_regex=k_out_of_range_err),
-            ErrorInput(SampleInput(torch.tensor(2, device=device), args=(3,)),
-                       error_regex=k_out_of_range_err),)
+def sample_inputs_pixel_unshuffle(op_info, device, dtype, requires_grad, **kwargs):
+    return [
+        SampleInput(
+            make_tensor((1, 1, 6, 6), device=device, dtype=dtype, requires_grad=requires_grad),
+            kwargs=dict(downscale_factor=downscale_factor),
+        )
+        for downscale_factor in (1, 3)
+    ]
 
-def sample_inputs_dropout(op_info, device, dtype, requires_grad, *,
-                          train=None, valid_input_dim=None, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def sample_inputs_binary_cross_entropy(op_info, device, dtype, requires_grad, logits=False, **kwargs):
+    make = partial(make_tensor, device=device, dtype=dtype)
+    make_prob = partial(make, low=0, high=1)
 
-    if valid_input_dim:
-        cases = ((S,) * i for i in valid_input_dim)
-    else:
-        cases = ((S, S), (S,), ())
-    p_vals = [0.0, 0.5, 1.0]
-    # This is to handle special case for feature_alpha_dropout which has different
-    # supported dtypes depending on `train` parameter
-    training_vals = [train] if train is not None else [True, False]
+    reductions = ("mean", "sum", "none")
 
-    for case, p, training in product(cases, p_vals, training_vals):
-        yield SampleInput(make_arg(case), kwargs=dict(p=p, training=training))
-    yield SampleInput(make_arg(case), kwargs=dict())
+    shapes_and_kwargs = [
+        *[(shape, None) for shape in ((), (1,), (S,), (S, S), (S, S, S))],
+        *[((S, S), dict(reduction=reduction)) for reduction in reductions],
+        *[((S, S), dict(reduction=reduction, weight=make((S, S)))) for reduction in reductions],
+    ]
 
+    if logits:
+        shapes_and_kwargs.extend(
+            [((S, S), dict(reduction=reduction, pos_weight=make((S,), low=0))) for reduction in reductions]
+        )
 
-def sample_inputs_embedding_bag(op_info, device, dtype, requires_grad, **kwargs):
-    def make_input(shape):
-        return make_tensor(shape, device=device, dtype=dtype, requires_grad=requires_grad)
+    for shape, kwargs in shapes_and_kwargs:
+        yield SampleInput(
+            (make if logits else make_prob)(shape, requires_grad=requires_grad),
+            args=(make_prob(shape, requires_grad=requires_grad),),
+            kwargs=kwargs,
+        )
 
-    def make_long_input(shape, *, low, high, noncontiguous=False):
-        return make_tensor(shape, device=device, dtype=torch.long, low=low, high=high,
-                           noncontiguous=noncontiguous)
+def sample_inputs_allclose(op_info, device, dtype, requires_grad, **kwargs):
+    samples = []
+    sample_shapes = [(), (S), (S, S, S)]
+    atols = [1e-2, 1e-16]
+    rtols = [1e-1, 0.5]
+    eps = 1e-8
+    for s, rtol, atol in product(sample_shapes, rtols, atols):
+        # close sample
+        t = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
+        close = (t + atol).detach().requires_grad_(requires_grad)
+        close_sample = SampleInput(t, args=(close,), kwargs=dict(rtol=rtol, atol=atol))
+        samples.append(close_sample)
 
-    def make_per_sample_weight(flag, idx):
-        # a tensor of float / double weights, or None
-        # to indicate all weights should be taken to be 1
-        if flag:
-            return make_input(idx.shape)
-        return None
+        # random sample
+        a = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
+        b = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
+        r_sample = SampleInput(a, args=(b,), kwargs=dict(rtol=rtol, atol=atol))
+        samples.append(r_sample)
 
-    offsets = torch.tensor([0, 3], device=device, dtype=torch.long)
-    for generate_per_sample_weight in (True, False):
-        for mode in ('sum', 'mean', 'max'):
-            # per_sample_weights is only supported for mode='sum' (got mode='****')
-            if generate_per_sample_weight and mode in ('mean', 'max'):
-                continue
+    return samples
 
-            # 1-D index tensor
-            idx = make_long_input((S,), low=0, high=M)
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(make_input((M, S)), args=(idx,),
-                              kwargs={'offsets': offsets, 'mode': mode,
-                                      'per_sample_weights': per_sample_weights})
+def sample_inputs_l1_loss(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs)
 
-            idx = make_long_input((S,), low=0, high=M, noncontiguous=True)
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(make_input((M, S)), args=(idx,),
-                              kwargs={'offsets': offsets, 'mode': mode,
-                                      'per_sample_weights': per_sample_weights})
+    # In addition to the regular test cases, we add two for mixed floating point and complex inputs
+    if dtype.is_complex:
+        make = partial(make_tensor, (), device=device, requires_grad=requires_grad)
+        yield SampleInput(make(dtype=dtype), args=(make(dtype=torch.double),))
+        yield SampleInput(make(dtype=torch.double), args=(make(dtype=dtype),))
 
-            # bag with zero length
-            idx = make_long_input((S,), low=0, high=M, noncontiguous=True)
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(make_input((M, S)), args=(idx,),
-                              kwargs={'offsets': torch.tensor([0, 0, 3], device=device, dtype=torch.long),
-                                      'mode': mode,
-                                      'per_sample_weights': per_sample_weights})
+def sample_inputs_smooth_l1_loss(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs)
 
-            # 2-D index tensor
-            idx = make_long_input((S, S), low=0, high=M)
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(make_input((M, S)), args=(idx,),
-                              kwargs={'mode': mode, 'per_sample_weights': per_sample_weights})
+    make = partial(make_tensor, (S, S), device=device, dtype=dtype, requires_grad=requires_grad)
 
-            idx = make_long_input((S, S), low=0, high=M, noncontiguous=True)
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(make_input((M, S)), args=(idx,),
-                              kwargs={'mode': mode, 'per_sample_weights': per_sample_weights})
+    # This test case always triggers the smooth condition, since absolute difference of input and target
+    # is smaller than beta
+    yield SampleInput(make(low=0, high=2), args=(make(low=-2, high=0),), kwargs=dict(beta=5))
+    yield SampleInput(make(), args=(make(),), kwargs=dict(beta=0))
 
-            # The gradient vector at `padding_idx` is not updated.
-            # Negative padding_idx
-            idx = make_long_input((6,), low=0, high=S)
-            idx[0] = 4
-            idx[4] = 4
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(make_input((S, S)), args=(idx,),
-                              kwargs={'padding_idx': -1, 'offsets': offsets,
-                                      'mode': mode, 'per_sample_weights': per_sample_weights},)
+def sample_inputs_kl_div(op_info, device, dtype, requires_grad, **kwargs):
+    make = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-            idx = make_long_input((3, 3), low=0, high=S)
-            # Positive padding_idx
-            idx[0, 0] = 2
-            idx[1, 1] = 2
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(make_input((S, S)), args=(idx,),
-                              kwargs={'padding_idx': 2, 'mode': mode,
-                                      'per_sample_weights': per_sample_weights},)
+    shapes_and_reduction = [
+        ((2,), "mean"),
+        ((2, 3), "mean"),
+        ((2, 3, 4), "mean"),
+        ((2,), "none"),
+        ((2,), "batchmean"),
+        ((2,), "sum"),
+    ]
 
-            idx = make_long_input((6, ), low=0, high=S)
-            weights = make_input((S, S))
-            offsets_ = torch.tensor([0, 3, 6], device=device, dtype=torch.long)
-            per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-            yield SampleInput(weights, args=(idx,),
-                              kwargs={'mode': mode, 'offsets': offsets_, 'include_last_offset': True},)
+    sample_inputs = []
+    for (shape, reduction), log_target in itertools.product(shapes_and_reduction, (True, False)):
+        # input should be log-probability, i.e. lie in (-inf, 0]
+        input = make(shape, low=None, high=0)
+        # target should be a probability by default, i.e. lie in [0, 1], and a log-probability if log_target is set,
+        # i.e. lie in (-inf, 0]
+        target = make(shape, low=None, high=0) if log_target else make(shape, low=0, high=1)
+        sample_inputs.append(
+            SampleInput(input, args=(target,), kwargs=dict(reduction=reduction, log_target=log_target))
+        )
+    return sample_inputs
 
-            if not requires_grad:
-                # Following inputs return different gradient from the numerical gradient.
-                # This is expected and relevant tests are present in `test_nn.py`.
+def sample_inputs_pdist(op_info, device, dtype, requires_grad, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-                # Due to inplace renorming of weight, the numerical gradient doesn't match the
-                # analytical gradient.
-                idx = make_long_input((2, 2), low=0, high=S)
-                weights = make_input((S, S)) * 2
-                per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-                yield SampleInput(weights, args=(idx,),
-                                  kwargs={'max_norm': 1., 'mode': mode,
-                                          'per_sample_weights': per_sample_weights},)
+    yield from (SampleInput(make_input((n, m))) for n, m in itertools.product((1, S), repeat=2))
+    yield from (SampleInput(make_input((S, S)), kwargs=dict(p=p)) for p in (0.0, 1.0, 2.0, 10.0, float("inf")))
 
-                idx = make_long_input((6, ), low=0, high=S)
-                weights = make_input((S, S)) * 2
-                per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-                yield SampleInput(weights, args=(idx,),
-                                  kwargs={'max_norm': 1., 'norm_type': 1.0,
-                                          'mode': mode, 'offsets': offsets,
-                                          'per_sample_weights': per_sample_weights},)
+def reference_pdist(input, p=2):
+    pdist = scipy.spatial.distance.pdist
+    if p == 0:
+        output = pdist(input, "hamming") * input.shape[1]
+    elif p == float("inf"):
+        output = pdist(input, lambda x, y: np.abs(x - y).max())
+    else:
+        output = pdist(input, "minkowski", p=p)
+    return output.astype(input.dtype)
 
-                if mode != 'max':
-                    # Scale the gradient based on the inverse frequency of a particular index.
-                    # Note : smax mode does not support sparse weights
-                    idx = make_long_input((2, 2), low=0, high=S)
-                    idx[0, 0] = 1
-                    idx[0, 1] = 1
-                    weights = make_input((S, S))
-                    per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-                    yield SampleInput(weights, args=(idx,),
-                                      kwargs={'scale_grad_by_freq': True, 'mode': mode,
-                                              'per_sample_weights': per_sample_weights},)
+def sample_inputs_diagflat(op_info, device, dtype, requires_grad, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-                    # gradcheck not implemented for sparse tensors.
-                    # Note : max mode does not support sparse weights
-                    idx = make_long_input((6, ), low=0, high=S)
-                    weights = make_input((S, S))
-                    per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-                    yield SampleInput(weights, args=(idx,),
-                                      kwargs={'sparse': True, 'offsets': offsets,
-                                              'mode': mode, 'per_sample_weights': per_sample_weights})
+    return [
+        SampleInput(make_input(())),
+        SampleInput(make_input((2,))),
+        SampleInput(make_input((2, 2))),
+        SampleInput(make_input((2,)), kwargs=dict(offset=1)),
+        SampleInput(make_input((2,)), kwargs=dict(offset=-1)),
+    ]
 
-                    idx = make_long_input((6, ), low=0, high=S)
-                    idx[0] = 1  # freq more than 1
-                    idx[1] = 1  # freq more than 1
-                    idx[3] = 0  # padding_idx
-                    weights = make_input((S, S)) * 2
-                    per_sample_weights = make_per_sample_weight(generate_per_sample_weight, idx)
-                    yield SampleInput(weights, args=(idx,),
-                                      kwargs={'sparse': True, 'scale_grad_by_freq': True, 'padding_idx': 0,
-                                              'max_norm': 1., 'offsets': offsets,
-                                              'mode': mode, 'per_sample_weights': per_sample_weights})
+def sample_inputs_max_unpool(op_info, device, dtype, requires_grad, **kwargs):
+    unpool_name_to_pool_method_dict = {
+        'nn.functional.max_unpool1d': torch.nn.functional.max_pool1d,
+        'nn.functional.max_unpool2d': torch.nn.functional.max_pool2d,
+        'nn.functional.max_unpool3d': torch.nn.functional.max_pool3d
+    }
 
+    unpool_name_to_dim = {
+        'nn.functional.max_unpool1d': 1,
+        'nn.functional.max_unpool2d': 2,
+        'nn.functional.max_unpool3d': 3
+    }
 
-def sample_inputs_embedding(op_info, device, dtype, requires_grad, **kwargs):
-    def make_input(shape):
-        return make_tensor(shape, device=device, dtype=dtype, requires_grad=requires_grad)
+    unpool_to_pool_name_dict = dict((
+        (k, f'nn.functional.{v.__name__}') for k, v in unpool_name_to_pool_method_dict.items()
+    ))
 
-    def make_long_input(shape, *, low, high):
-        return make_tensor(shape, device=device, dtype=torch.long, low=low, high=high)
+    pool_dim = unpool_name_to_dim[op_info.name]
+    pool_method = unpool_name_to_pool_method_dict[op_info.name]
 
-    # 0-D index tensor
-    idx = make_long_input((), low=0, high=M)
-    yield SampleInput(make_input((M, S)), args=(idx,),)
+    pool_op_info = copy.copy(op_info)
+    pool_op_info.name = unpool_to_pool_name_dict[op_info.name]
 
-    # 1-D index tensor
-    idx = make_long_input((S,), low=0, high=M)
-    yield SampleInput(make_input((M, S)), args=(idx,),)
+    for sample in sample_inputs_max_pool(pool_op_info, device, dtype, requires_grad, **kwargs):
+        # shapes (C, ...) do not work as of now,
+        # see https://github.com/pytorch/pytorch/issues/68337
+        # TODO: remove once the issue is resolved
+        if sample.input.dim() != pool_dim + 2:
+            continue
 
-    # 2-D index tensor
-    idx = make_long_input((S, S), low=0, high=M)
-    yield SampleInput(make_input((M, S)), args=(idx,),)
+        # No dilation > 1 for max_unpool,
+        # see https://github.com/pytorch/pytorch/issues/68420
+        if sample.kwargs['dilation'] != 1:
+            continue
 
-    if not requires_grad:
-        # Following inputs return different gradient from the numerical gradient.
-        # This is expected and relevant tests are present in `test_nn.py`.
+        # Can't unpool without indices
+        if sample.kwargs['return_indices']:
+            pool, indices = pool_method(sample.input, **sample.kwargs)
+            # arg has to be a leaf
+            arg = pool.detach().requires_grad_(requires_grad)
+            sample_kwargs = {
+                'kernel_size': sample.kwargs['kernel_size'],
+                'stride': sample.kwargs['stride'],
+                'padding': sample.kwargs['padding'],
+                # output_size could be None but we specify it explicitly
+                # to compensate for the information lose in pool due
+                # to the floor/ceil operation used to compute the shapes
+                'output_size': sample.input.size()
+            }
 
-        # The gradient vector at `padding_idx` is not updated.
-        idx = make_long_input((2, 2), low=0, high=S)
-        idx[0, 0] = 2
-        idx[1, 1] = 2
-        yield SampleInput(make_input((S, S)), args=(idx,), kwargs={'padding_idx': 2},)
+            yield SampleInput(arg, args=(indices,), kwargs=sample_kwargs)
 
-        idx = make_long_input((2, 2), low=0, high=S)
-        idx[0, 0] = 4
-        idx[1, 1] = 4
-        yield SampleInput(make_input((S, S)), args=(idx,), kwargs={'padding_idx': -1},)
+def sample_inputs_max_unpool_grad(op_info, device, dtype, requires_grad, **kwargs):
+    for sample in sample_inputs_max_unpool(op_info, device, dtype, requires_grad, **kwargs):
+        indices = sample.args[0]
+        # The samples for max_unpool are generated with max_pool.
+        # It could be that a single element from the max_pool's
+        # input is mapped to several locations in its output.
+        # This situation leads to failed gradchecks because
+        # the finite difference algorithm perturbes the elements
+        # of the output one by one, and not in classes of
+        # equivalences determined by whether two elements
+        # in the output are coming from the same location in the
+        # input (simply put, they have the same corresponding index).
+        # So, there are two ways to resolve this issue:
+        # 1. Extract a pertubation for one element and apply it all
+        #    the elements from the same equivalence class, or
+        # 2. Make sure that the equivalence classes are all singletons,
+        # i.e. the index tensor has to be comprised of only unique
+        # indices.
+        # Here we go with the solution 2, the easiest of all.
+        if indices.unique().numel() == indices.numel():
+            yield sample
 
-        # Due to inplace renorming of weight, the numerical gradient doesn't match the
-        # analytical gradient.
-        idx = make_long_input((2, 2), low=0, high=S)
-        weights = make_input((S, S)) * 2
-        yield SampleInput(weights, args=(idx,), kwargs={'max_norm': 1.},)
+foreach_unary_op_db: List[OpInfo] = [
+    ForeachFuncInfo('exp'),
+    ForeachFuncInfo('acos'),
+    ForeachFuncInfo('asin'),
+    ForeachFuncInfo('atan'),
+    ForeachFuncInfo('cos'),
+    ForeachFuncInfo('cosh'),
+    ForeachFuncInfo('log'),
+    ForeachFuncInfo('log10'),
+    ForeachFuncInfo('log2'),
+    ForeachFuncInfo('tan'),
+    ForeachFuncInfo('tanh'),
+    ForeachFuncInfo('sin'),
+    ForeachFuncInfo('sinh'),
 
-        idx = make_long_input((2, 2), low=0, high=S)
-        weights = make_input((S, S)) * 2
-        yield SampleInput(weights, args=(idx,), kwargs={'max_norm': 1., 'norm_type': 1.0},)
+    ForeachFuncInfo(
+        'neg',
+        dtypes=all_types_and_complex(),
+        dtypesIfCUDA=all_types_and_complex(),
+        sample_inputs_func=sample_inputs_foreach,
+    ),
 
-        # Scale the gradient based on the inverse frequency of a particular index.
-        idx = make_long_input((2, 2), low=0, high=S)
-        idx[0, 0] = 1
-        idx[0, 1] = 1
-        weights = make_input((S, S))
-        yield SampleInput(weights, args=(idx,), kwargs={'scale_grad_by_freq': True},)
+    ForeachFuncInfo(
+        'sqrt',
+        dtypes=floating_and_complex_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(torch.half),
+    ),
 
-        # gradcheck not implemented for sparse tensors.
-        idx = make_long_input((2, 2), low=0, high=S)
-        weights = make_input((S, S))
-        yield SampleInput(weights, args=(idx,), kwargs={'sparse': True})
+    ForeachFuncInfo(
+        'ceil',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
-        idx = make_long_input((3, 3), low=0, high=S)
-        idx[0, 0] = 1  # freq more than 1
-        idx[0, 1] = 1  # freq more than 1
-        idx[1, 0] = 0  # padding_idx
-        weights = make_input((S, S)) * 2
-        yield SampleInput(weights, args=(idx,),
-                          kwargs={'sparse': True, 'scale_grad_by_freq': True,
-                                  'padding_idx': 0, 'max_norm': 1.})
+    ForeachFuncInfo(
+        'erf',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
+    ForeachFuncInfo(
+        'erfc',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
-def sample_inputs_one_hot(op_info, device, dtype, requires_grad, **kwargs):
-    def make_input(shape, *, low, high):
-        return make_tensor(shape, device=device, dtype=dtype, low=low, high=high, requires_grad=requires_grad)
+    ForeachFuncInfo(
+        'expm1',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
-    shapes = ((), (S,), (L, M, S))
-    num_classess = (-1, 10)
+    ForeachFuncInfo(
+        'floor',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
-    return [
-        SampleInput(
-            make_input(
-                shape,
-                low=0,
-                high=10 if num_classes == -1 else num_classes // 2,
-            ),
-            kwargs=dict(num_classes=num_classes),
-        )
-        for shape, num_classes in itertools.product(shapes, num_classess)
-    ]
+    ForeachFuncInfo(
+        'log1p',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half),
+    ),
 
-def sample_inputs_tensorinv(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = make_fullrank_matrices_with_distinct_singular_values
+    ForeachFuncInfo(
+        'round',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
-    def make_input():
-        return make_arg(12, 12, device=device, dtype=dtype, requires_grad=requires_grad)
+    ForeachFuncInfo(
+        'frac',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
-    # lhs / rhs shape can have any number of dimensions as long as their product equals 12
-    shapes = [
-        ((2, 2, 3), (12, 1)),
-        ((4, 3), (6, 1, 2)),
-    ]
+    ForeachFuncInfo(
+        'reciprocal',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half),
+    ),
 
-    samples = []
-    for shape_lhs, shape_rhs in shapes:
-        inp = make_input().reshape(*shape_lhs, *shape_rhs).detach()
-        inp.requires_grad_(requires_grad)
-        samples.append(SampleInput(inp, kwargs=dict(ind=len(shape_lhs))))
+    ForeachFuncInfo(
+        'sigmoid',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half),
+    ),
 
-    return samples
+    ForeachFuncInfo(
+        'trunc',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+    ),
 
-def sample_inputs_tensorsolve(op_info, device, dtype, requires_grad, **kwargs):
-    a_shapes = [(2, 3, 6), (3, 4, 4, 3)]
-    # Zero-dim tensors are not supported in NumPy, so we skip them for now.
-    # NumPy is used in reference check tests.
-    # See https://github.com/numpy/numpy/pull/20482 for tracking NumPy bugfix.
-    # a_shapes += [(0, 0, 1, 2, 3, 0)]
-    dimss = [None, (0, 2)]
+    ForeachFuncInfo(
+        'abs',
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.half),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+]
 
-    for a_shape, dims in itertools.product(a_shapes, dimss):
-        a = make_tensor(a_shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        b = make_tensor(a_shape[:2], dtype=dtype, device=device, requires_grad=requires_grad)
-        yield SampleInput(a, args=(b,), kwargs=dict(dims=dims))
+foreach_binary_op_db: List[OpInfo] = [
+    ForeachFuncInfo(
+        "add",
+        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        supports_alpha_param=True,
+    ),
+    ForeachFuncInfo(
+        "sub",
+        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        supports_alpha_param=True,
+    ),
+    ForeachFuncInfo(
+        "mul",
+        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        skips=(
+            # Ref: https://github.com/pytorch/pytorch/issues/77946
+            DecorateInfo(unittest.skip("Unable to reproduce failure locally"), "TestForeach",
+                         "test_binary_op_scalarlist_fastpath",
+                         device_type='cuda', dtypes=(torch.float16,)),
+        )
+    ),
+    ForeachFuncInfo(
+        "div",
+        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+        skips=(
+            # Ref: https://github.com/pytorch/pytorch/issues/77946
+            DecorateInfo(unittest.skip("Unable to reproduce failure locally"), "TestForeach",
+                         "test_binary_op_scalarlist_fastpath",
+                         device_type='cuda', dtypes=(torch.float16,)),
+        )
+    ),
+]
 
-def sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs):
-    rhs_requires_grad = kwargs.get('rhs_requires_grad', requires_grad)
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+foreach_pointwise_op_db: List[ForeachFuncInfo] = [
+    ForeachFuncInfo(
+        "addcmul",
+        dtypes=all_types_and_complex(),
+        dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
+    ),
+    ForeachFuncInfo(
+        "addcdiv",
+        dtypes=all_types_and_complex(),
+        dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
+    ),
+]
 
-    # Although most losses also support the reduce and size_average combination instead of reduce, the former is
-    # deprecated since 0.4.1 and thus is not tested
-    shapes_and_kwargs = (
-        ((), None),
-        ((S,), dict(reduction="mean")),
-        ((S,), dict(reduction="sum")),
-        ((S,), dict(reduction="none")),
-        ((S, S), None),
-        ((S, S, S), None),
-    )
+foreach_minmax_op_db: List[ForeachFuncInfo] = [
+    ForeachFuncInfo(
+        "maximum",
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=all_types_and(torch.float16, torch.bool),
+    ),
+    ForeachFuncInfo(
+        "minimum",
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=all_types_and(torch.float16, torch.bool),
+    ),
+]
 
-    for shape, kwargs in shapes_and_kwargs:
-        yield SampleInput(_make_tensor(shape),
-                          args=(_make_tensor(shape, requires_grad=rhs_requires_grad),),
-                          kwargs=kwargs)
+foreach_reduce_op_db: List[ForeachFuncInfo] = [
+    ForeachFuncInfo(
+        "norm",
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+    ),
+]
 
-def sample_inputs_grid_sample(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def reference_sign(x):
+    if x.dtype == np.bool_:
+        # `np.sign` doesn't support `bool`.
+        # >>> np.sign(True)
+        # ufunc 'sign' did not contain a loop
+        # with signature matching types dtype('bool') -> dtype('bool')
+        return np.sign(x, dtype=np.uint8).astype(np.bool_)
+    return np.sign(x)
 
-    batch_size = 2
-    num_channels = 3
-    modes = ("bilinear", "nearest")
-    align_cornerss = (False, True)
-    padding_modes = ("zeros", "border", "reflection")
 
-    sample_inputs = []
-    for dim in (2, 3):
+def reference_sgn(x):
+    # NumPy doesn't have an equivalent to `torch.sgn` when the dtype is complex.
+    # For complex inputs, `np.sign` returns sign(x.real) + 0j if x.real != 0 else sign(x.imag) + 0j.
+    # while `torch.sgn` returns, 0 if abs(input) == 0 else input/abs(input)
+    if x.dtype not in [np.complex64, np.complex128]:
+        return reference_sign(x)
 
-        modes_ = (*modes, "bicubic") if dim == 2 else modes
+    out = (x / np.abs(x))
+    if out.ndim == 0:
+        # Handle x == 0 case
+        if (x == 0):
+            # Can't assign to np.complex object
+            # So make a new one.
+            return np.array(complex(0, 0), dtype=x.dtype)
+        return out
 
-        for mode, padding_mode, align_corners in itertools.product(modes_, padding_modes, align_cornerss):
-            sample_inputs.append(
-                SampleInput(
-                    _make_tensor((batch_size, num_channels, *[S] * dim)),
-                    args=(_make_tensor((batch_size, *[S] * dim, dim)),),
-                    kwargs=dict(
-                        mode=mode,
-                        padding_mode=padding_mode,
-                        align_corners=align_corners,
-                    )
-                )
-            )
+    # Handle x == 0 case
+    mask = (x == 0)
+    out[mask] = complex(0, 0)
+    return out
 
-    return sample_inputs
 
-def sample_inputs_cosine_embedding_loss(op_info, device, dtype, requires_grad, **kwargs):
-    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def reference_sigmoid(x):
+    # 'scipy.special.expit' not supported for the input types
+    if x.dtype in [np.complex64, np.complex128]:
+        return (1 / (1 + np.exp(-x)))
+    return scipy.special.expit(x)
 
-    def make_target(shape):
-        shape = () if len(shape) == 1 else (shape[0], )
-        t = torch.randint(0, 2, shape, device=device, dtype=torch.long)
-        # Label with -1 or 1
-        t = t * 2 - 1
-        target = t.to(dtype=dtype).detach_().requires_grad_(requires_grad)
-        return target
 
-    shapes = ((S, S), (S,))
-    reductions = ('none', 'mean', 'sum')
-    for s, r in product(shapes, reductions):
-        yield SampleInput(
-            make_input(s),
-            args=(make_input(s), make_target(s)),
-            kwargs=dict(reduction=r, margin=random.uniform(-1, 1))
-        )
+def reference_logsigmoid(x):
+    return np.where(
+        x < 0,
+        x - np.log1p(np.exp(x)),
+        -np.log1p(np.exp(-x)))
 
-def sample_inputs_ctc_loss(op_info, device, dtype, requires_grad, **kwargs):
-    input_length = 50
-    batch = 16
-    num_char = 20
-    target_length = 30
 
-    def make_log_probs(s):
-        t = make_tensor(s, device=device, dtype=dtype)
-        log_probs = t.log_softmax(2).to(device=device, dtype=dtype).detach().requires_grad_(requires_grad=requires_grad)
-        return log_probs
+def reference_hardsigmoid(x):
+    intermediate = x / 6 + 0.5
+    y = np.clip(intermediate, 0, None)
+    return np.where(y > 1, 1, y).astype(x.dtype)
 
-    reductions = ('none', 'mean', 'sum')
-    zero_inf = (True, False)
-    for r, z in product(reductions, zero_inf):
-        log_probs = make_log_probs((input_length, batch, num_char))
-        targets = torch.randint(1, num_char, (batch, target_length), dtype=torch.long, device=device)
-        input_lengths = torch.full((batch, ), input_length, dtype=torch.long, device=device)
-        target_lengths = torch.randint(10, target_length, (batch, ), dtype=torch.long, device=device)
 
-        yield SampleInput(log_probs, args=(targets, input_lengths, target_lengths,), kwargs=dict(reduction=r, zero_infinity=z))
+def reference_lgamma(x):
+    # scipy.special.gammaln returns `-inf` when input is `-inf`.
+    # While Pytorch, C and C++, all return `inf` when input is `-inf`.
+    # Reference:
+    # https://en.cppreference.com/w/cpp/numeric/math/lgamma
+    # https://en.cppreference.com/w/c/numeric/math/lgamma
 
-def sample_inputs_nll_loss(op_info, device, dtype, requires_grad, **kwargs):
-    shape = (2, 3)
-    num_classes = shape[1]
-    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    # FIXME: Derivative wrt. weight not implemented
-    make_weight = partial(make_tensor, num_classes, device=device, dtype=dtype, requires_grad=False)
-
-    def make_target(shape, zeros=False):
-        s = (shape[0], *shape[2:]) if len(shape) > 1 else ()
-        if zeros:
-            return torch.zeros(s, device=device, dtype=torch.long)
-        else:
-            return make_tensor(s,
-                               low=0,
-                               high=shape[1] if len(shape) > 1 else shape[0],
-                               device=device,
-                               dtype=torch.long)
-
-
-    def gen_shape_kwargs():
-        # Batched, non-batched and 2d
-        shapes = (shape, (num_classes,), shape + (2, 2))
-        reductions = ('none', 'mean', 'sum')
-        for reduction, s in product(reductions, shapes):
-            yield make_input(s), make_target(s), dict(reduction=reduction)
-            yield make_input(s), make_target(s), dict(weight=make_weight(), reduction=reduction)
-            yield make_input(s), make_target(s), dict(weight=make_weight(low=0), reduction=reduction)
-            yield make_input(s), make_target(s), dict(weight=make_weight(high=0), reduction=reduction)
-            t = make_target(s)
-            ignore = num_classes // 2
-            # If "mean", nll returns NaN, so it's not differentiable at those points
-            if t.eq(ignore).all() and reduction == "mean":
-                t.fill_(0)
-            yield make_input(s), t, dict(ignore_index=num_classes // 2, reduction=reduction)
-            yield make_input(s), t, dict(ignore_index=num_classes // 2, reduction=reduction, weight=make_weight())
-            # Test ignoring all the targets
-            # If "mean", nll returns NaN, so it's not differentiable at those points
-            if reduction != "mean":
-                yield make_input(s), make_target(s, zeros=True), dict(ignore_index=0, reduction=reduction)
+    # To handle the above discrepancy,
+    # we replace -inf with inf so values
+    # that were originally -inf map to inf as expected
+    if x.dtype.kind == 'f':
+        x = np.where(x == float('-inf'), np.array(float('inf'), dtype=x.dtype), x)
 
-    for input, target, kwargs in gen_shape_kwargs():
-        yield SampleInput(input, args=(target,), kwargs=kwargs)
+    out = scipy.special.gammaln(x)
 
-def sample_inputs_binary_cross_entropy_with_logits(
-    op_info, device, dtype, requires_grad, **kwargs
-):
-    make = partial(make_tensor, device=device, dtype=dtype)
-    make_prob = partial(make, low=0, high=1)
-    reductions = ("mean", "sum", "none")
+    if x.dtype == np.float16:
+        # `scipy.special.gammaln` returns output of float32 when input is float16,
+        # while `torch.lgamma` preserves `float16`. But due to smaller range of float16,
+        # Pytorch version outputs `inf` while SciPy returns finite values.
+        out = out.astype(np.float16)
 
-    def make_weight_shape_kwargs():
-        kwargs = []
-        for shape in ((1,), (1, S), (S), (S, S)):
-            kwargs.extend([((S, S), dict(reduction=reduction, weight=make(shape))) for reduction in reductions])
-        return kwargs
+    return out
 
-    shapes_and_kwargs = [
-        *[(shape, None) for shape in ((), (1,), (S,), (S, S), (S, S, S))],
-        *[((S, S), dict(reduction=reduction)) for reduction in reductions],
-        *make_weight_shape_kwargs(),
-        *[((S, S), dict(reduction=reduction, pos_weight=make((S,), low=0))) for reduction in reductions],
-        *[((S, S), dict(reduction=reduction, weight=make((S, S)), pos_weight=make((S,), low=0))) for reduction in reductions],
-    ]
 
-    for shape, kwargs in shapes_and_kwargs:
-        yield SampleInput(
-            make(shape, requires_grad=requires_grad),
-            args=(make_prob(shape, requires_grad=requires_grad),),
-            kwargs=kwargs,
-        )
+def reference_mvlgamma(x, d):
+    if x.dtype == np.float16:
+        return scipy.special.multigammaln(x, d).astype(np.float16)
 
-def sample_inputs_argwhere(op_info, device, dtype, requires_grad, **kwargs):
-    yield SampleInput(torch.tensor([1, 0, 2, 0], dtype=dtype, device=device, requires_grad=requires_grad))
-    mask = torch.tensor([[0, 1, 0, 1, 0],
-                         [1, 1, 1, 1, 0],
-                         [0, 0, 0, 1, 0],
-                         [1, 0, 1, 1, 0],
-                         [1, 0, 0, 1, 0]], dtype=torch.bool, device=device)
-    t = make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad)
-    t[mask] = 0
-    yield SampleInput(t)
+    return scipy.special.multigammaln(x, d)
 
-    t = make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad, noncontiguous=True)
-    t[mask] = 0
-    yield SampleInput(t)
+def reference_softplus(input, beta=1, threshold=20):
+    non_linear = input * beta <= threshold
+    output = input.copy()
+    output[non_linear] = np.log(1 + np.exp(beta * input[non_linear])) / beta
+    return output
 
-    t = make_tensor((S, 0), dtype=dtype, device=device, requires_grad=requires_grad)
-    yield SampleInput(t)
+def reference_gelu(X, *, approximate='none'):
+    def _gelu_ref(X):
+        return X * stats.norm.cdf(X)
 
-    yield SampleInput(torch.zeros((S,), dtype=dtype, device=device, requires_grad=requires_grad))
-    yield SampleInput(make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad))
+    def _tanh_gelu_ref(X):
+        M_SQRT_2_PI = math.sqrt(2 / math.pi)
+        Z = M_SQRT_2_PI * (X + 0.044715 * np.power(X, 3.0))
+        return 0.5 * X * (1.0 + np.tanh(Z))
 
-def _generate_sample_shape_reduction():
-    shapes = ((S,), (S, S), (S, S, S))
-    reductions = ('none', 'mean', 'sum')
-    for s, r in product(shapes, reductions):
-        yield s, r
+    if approximate == 'tanh':
+        return _tanh_gelu_ref(X)
+    else:
+        return _gelu_ref(X)
 
-def sample_inputs_gaussian_nll_loss(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    # Set low slightly above 0 so gradcheck doesn't accidentally dip below 0
-    make_var = partial(make_tensor, low=0.1, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    def gen_shape(shape):
-        yield shape
-        # Broadcast
-        yield (*shape[:-1], 1)
-        yield shape[:-1]
+def reference_one_hot(a: np.ndarray, num_classes: int = -1) -> np.ndarray:
+    if num_classes == -1:
+        num_classes = int(np.amax(a) + 1)
 
-    def gen_shape_kwargs():
-        for s, r in _generate_sample_shape_reduction():
-            for t_s, v_s in product(gen_shape(s), gen_shape(s)):
-                yield _make_tensor(s), _make_tensor(t_s), make_var(v_s), dict(reduction=r)
-                yield (
-                    _make_tensor(s), _make_tensor(t_s), make_var(v_s),
-                    dict(full=True, reduction=r)
-                )
-                yield (
-                    _make_tensor(s), _make_tensor(t_s), make_var(v_s),
-                    dict(eps=random.uniform(1e-6, 1e-3), reduction=r)
-                )
-                yield (
-                    _make_tensor(s), _make_tensor(t_s), make_var(v_s),
-                    dict(full=True, eps=random.uniform(1e-6, 1e-3), reduction=r)
-                )
+    idcs = a.reshape(-1) + np.arange(0, a.size, dtype=np.int64) * num_classes
+    one_hot = np.zeros((a.size, num_classes), dtype=a.dtype)
+    np.put(one_hot, idcs, 1)
+    return one_hot.reshape(*a.shape, -1)
 
-    for input, target, var, kwargs in gen_shape_kwargs():
-        yield SampleInput(input, args=(target, var, ), kwargs=kwargs)
 
-def _generate_sample_inputs_nn_loss(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+def reference_mse_loss(input, target, reduction="mean"):
+    se = (input - target) ** 2
+    if reduction == "mean":
+        return np.mean(se)
+    elif reduction == "sum":
+        return np.sum(se)
+    else:  # reduction == "none"
+        return se
 
-    for s, r in _generate_sample_shape_reduction():
-        yield _make_tensor(s), _make_tensor(s), dict(reduction=r)
 
-def sample_inputs_hinge_embedding_loss(op_info, device, dtype, requires_grad, **kwargs):
-    for input, target, d in _generate_sample_inputs_nn_loss(op_info, device, dtype, requires_grad, **kwargs):
-        # target should contain either 1 or -1 as per docs
-        mask = torch.rand_like(target) > 0.5
-        target[mask] = 1
-        target[~mask] = -1
-        d['margin'] = random.uniform(-9, 9)
-        yield SampleInput(input, args=(target, ), kwargs=d)
+def wrapper_set_seed(op, *args, **kwargs):
+    """Wrapper to set seed manually for some functions like dropout
+    See: https://github.com/pytorch/pytorch/pull/62315#issuecomment-896143189 for more details.
+    """
+    with freeze_rng_state():
+        torch.manual_seed(42)
+        return op(*args, **kwargs)
 
-    # scalar input and target.
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    yield SampleInput(_make_tensor(()), args=(_make_tensor(()), ))
 
-def error_inputs_hinge_embedding_loss(op, device, **kwargs):
-    make_input = partial(make_tensor, device=device, dtype=torch.float32)
-    # invalid reduction value
-    yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4),), kwargs={'reduction': 'abc'}),
-                     error_type=ValueError, error_regex='is not a valid value')
+def reference_layer_norm(inp: np.ndarray, normalized_shape: Tuple[int], weight=None, bias=None, eps=1e-5):
+    return reference_native_layer_norm(inp, normalized_shape, weight, bias, eps)[0]
 
-def reference_inputs_hinge_embedding_loss(op, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_hinge_embedding_loss(op, device, dtype, requires_grad, **kwargs)
-    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    for reduction in ('sum', 'mean', 'none'):
-        if dtype.is_floating_point:  # only supports ints and floats
-            # NaN propagation
-            inp = make_input((10, ))
-            inp[2] = float('nan')
-            target = make_input((10, ))
-            # target should contain either 1 or -1 as per docs
-            mask = torch.rand_like(target) > 0.5
-            target[mask] = -1
-            target[~mask] = 1
-            yield SampleInput(inp, args=(target,), kwargs={'reduction': reduction})
+def reference_native_layer_norm(inp: np.ndarray, normalized_shape: Tuple[int], weight, bias, eps):
+    feature_size = np.prod(normalized_shape)
+    inp_view = inp.reshape(-1, feature_size)  # type: ignore[call-overload]
+    mean = inp_view.mean(axis=-1, keepdims=True)
+    var = inp_view.var(axis=-1, ddof=0, keepdims=True)
+    Y = (inp_view - mean) / np.sqrt(var + eps)
+    if weight is None and bias is not None:
+        Y = Y + bias.reshape(-1)
+    elif weight is not None and bias is None:
+        Y = Y * weight.reshape(-1)
+    elif weight is not None and bias is not None:
+        Y = Y * weight.reshape(-1) + bias.reshape(-1)
+    axis = inp.ndim - len(normalized_shape)
+    stat_shape = inp.shape[:axis] + (1,) * len(normalized_shape)
+    return Y.reshape(*inp.shape), mean.reshape(stat_shape), (1.0 / np.sqrt(var + eps)).reshape(stat_shape)
 
-            # Inf Handling
-            inp = make_input((10, ))
-            inp[4] = float('inf')
-            target = make_input((10, ))
-            mask = torch.rand_like(target) > 0.5
-            target[mask] = -1
-            target[~mask] = 1
-            yield SampleInput(inp, args=(target,), kwargs={'reduction': reduction})
 
-        # Broadcasting
-        inp = make_input((5, 5))
-        target = make_input((1, 5))
-        mask = torch.rand_like(target) > 0.5
-        target[mask] = -1
-        target[~mask] = 1
-        yield SampleInput(inp, args=(target,), kwargs={'reduction': reduction})
+def reference_group_norm(inp: np.ndarray, num_groups: int, weight=None, bias=None, eps=1e-5):
+    inp_view = inp
+    if np.prod(inp.shape) != 0:
+        inp_view = inp.reshape((inp.shape[0], num_groups, -1))
+    mean = inp_view.mean(axis=-1, keepdims=True)
+    var = inp_view.var(axis=-1, ddof=0, keepdims=True)
+    Y = (inp_view - mean) / np.sqrt(var + eps)
+    Y = Y.reshape(inp.shape)
+    if weight is not None:
+        # weight is a vector of length equal to the channel
+        if len(Y.shape) > 2:
+            weight = np.tile(np.expand_dims(weight, 1), [1] + list(inp.shape[2:]))
+        Y = Y * weight
+    if bias is not None:
+        # bias is a vector of length equal to the channel
+        if len(Y.shape) > 2:
+            bias = np.tile(np.expand_dims(bias, 1), [1] + list(inp.shape[2:]))
+        Y = Y + bias
+    return Y
 
-def sample_inputs_huber_loss(op_info, device, dtype, requires_grad, **kwargs):
-    for input, target, d in _generate_sample_inputs_nn_loss(op_info, device, dtype, requires_grad, **kwargs):
-        d['delta'] = random.uniform(1e-3, 9)
-        yield SampleInput(input, args=(target, ), kwargs=d)
 
-def sample_inputs_poisson_nll_loss(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    def gen_shape_kwargs():
-        for s, r in _generate_sample_shape_reduction():
-            for li in (True, False):
-                for f in (True, False):
-                    i1 = _make_tensor(s)
-                    i2 = _make_tensor(s)
-                    # For Poisson NLL Loss,
-                    # target is assumed to be from
-                    # Poisson Distribution which
-                    # always has positive samples
-                    t1 = _make_tensor(s, low=0)
-                    t2 = _make_tensor(s, low=0)
-
-                    if not li:
-                        i1.abs_()
-                        i2.abs_()
-                    t1.abs_()
-                    t2.abs_()
-
-                    yield (
-                        i1, t1,
-                        dict(log_input=li, full=f, reduction=r)
-                    )
-                    yield (
-                        i2, t2,
-                        dict(log_input=li, full=f,
-                             eps=random.uniform(1e-8, 1e-3),
-                             reduction=r)
-                    )
-
-    for input, target, kwargs in gen_shape_kwargs():
-        yield SampleInput(input, args=(target, ), kwargs=kwargs)
-
-def sample_inputs_triplet_margin_loss(op_info, device, dtype, requires_grad, with_distance=False, **kwargs):
-    make = partial(make_tensor, (S, M), device=device, dtype=dtype, requires_grad=requires_grad)
-
-    kwargss = (
-        *[dict(margin=margin) for margin in (1e-6, 1.0, 10.0)],
-        dict(swap=True),
-        *[dict(reduction=reduction) for reduction in ("mean", "sum", "none")],
-    )
-
-    for kwargs in kwargss:
-        input = make()
-        args = (make(), make())
-        if with_distance:
-            kwargs["distance_function"] = torch.nn.PairwiseDistance()
-        yield SampleInput(input, args=args, kwargs=kwargs)
-
-def sample_inputs_pairwise_distance(op_info, device, dtype, requires_grad, **kwargs):
-    make = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+# using a custom reference function since numpy only has a string side arg (instead of right and side) and doesn't
+# have an out_int32 arg. Additionally, numpy doesn't support searchsorted with ND arrays, so this splits those into
+# stacked 1D cases
+def reference_searchsorted(sorted_sequence, boundary, out_int32=False, right=False, side='left', sorter=None):
+    side = 'right' if (right or side == 'right') else 'left'
+    if len(sorted_sequence.shape) == 1 :
+        ret = np.searchsorted(sorted_sequence, boundary, side=side, sorter=sorter)
+        return ret.astype(np.int32) if out_int32 else ret
+    elif sorted_sequence.shape[0] == 0:
+        if sorter is not None:
+            sorter = sorter.flatten()
+        ret = np.searchsorted(sorted_sequence.flatten(), boundary.flatten(), side=side, sorter=sorter)
+        ret = ret.astype(np.int32) if out_int32 else ret
+        return ret.reshape(boundary.shape)
+    else:
+        # numpy searchsorted only supports 1D inputs so we split up ND inputs
+        orig_shape = boundary.shape
+        num_splits = np.prod(sorted_sequence.shape[:-1])
+        splits = range(0, num_splits)
+        sorted_sequence, boundary = sorted_sequence.reshape(num_splits, -1), boundary.reshape(num_splits, -1)
+        if sorter is not None:
+            sorter = sorter.reshape(num_splits, -1)
 
-    shape = (3,)
-    batched_shape = (2, *shape)
-    shapes_and_kwargs = [
-        (shape, None),
-        (batched_shape, None),
-        (shape, dict(keepdim=True)),
-        (batched_shape, dict(keepdim=True)),
-        (shape, dict(p=5.0)),
-        (shape, dict(p=-1.0)),
-        (shape, dict(eps=1.0)),
-    ]
+        split_sequence = [sorted_sequence[i] for i in splits]
+        split_boundary = [boundary[i] for i in splits]
+        split_sorter = [sorter[i] if (sorter is not None) else None for i in splits]
 
-    return [
-        SampleInput(make(shape), args=(make(shape),), kwargs=kwargs) for shape, kwargs in shapes_and_kwargs
-    ]
+        split_ret = [np.searchsorted(s_seq, b, side=side, sorter=s_sort)
+                     for (s_seq, b, s_sort) in zip(split_sequence, split_boundary, split_sorter)]
+        split_ret = [i.astype(np.int32) for i in split_ret] if out_int32 else split_ret
+        return np.stack(split_ret).reshape(orig_shape)
 
-def sample_inputs_pixel_shuffle(op_info, device, dtype, requires_grad, **kwargs):
-    return [
-        SampleInput(
-            make_tensor((1, 9, 2, 2), device=device, dtype=dtype, requires_grad=requires_grad),
-            kwargs=dict(upscale_factor=upscale_factor),
-        )
-        for upscale_factor in (1, 3)
-    ]
+def loss_reference_reduction_wrapper(fn):
+    def wrapper(input, target, *, size_average=None, reduce=None, reduction="mean", **other_kwargs):
+        if size_average is not None or reduce is not None:
+            raise RuntimeError(
+                "The keyword arguments 'size_average' and 'reduce' are deprecated and not supported by this wrapper"
+            )
+        output = fn(input, target, **other_kwargs)
+        if reduction == "mean":
+            return np.mean(output)
+        elif reduction == "sum":
+            return np.sum(output)
+        else:  # reduction == "none"
+            return output
 
-def sample_inputs_pixel_unshuffle(op_info, device, dtype, requires_grad, **kwargs):
-    return [
-        SampleInput(
-            make_tensor((1, 1, 6, 6), device=device, dtype=dtype, requires_grad=requires_grad),
-            kwargs=dict(downscale_factor=downscale_factor),
-        )
-        for downscale_factor in (1, 3)
-    ]
+    return wrapper
 
-def sample_inputs_binary_cross_entropy(op_info, device, dtype, requires_grad, logits=False, **kwargs):
-    make = partial(make_tensor, device=device, dtype=dtype)
-    make_prob = partial(make, low=0, high=1)
+@loss_reference_reduction_wrapper
+def reference_smooth_l1_loss(input, target, beta=1.0):
+    diff = input - target
+    abs_diff = np.abs(diff)
+    above_threshold = abs_diff >= beta
 
-    reductions = ("mean", "sum", "none")
+    loss = np.empty_like(input)
+    loss[above_threshold] = abs_diff[above_threshold] - 0.5 * beta
+    loss[~above_threshold] = diff[~above_threshold] ** 2 / (2 * beta)
 
-    shapes_and_kwargs = [
-        *[(shape, None) for shape in ((), (1,), (S,), (S, S), (S, S, S))],
-        *[((S, S), dict(reduction=reduction)) for reduction in reductions],
-        *[((S, S), dict(reduction=reduction, weight=make((S, S)))) for reduction in reductions],
-    ]
+    return loss
 
-    if logits:
-        shapes_and_kwargs.extend(
-            [((S, S), dict(reduction=reduction, pos_weight=make((S,), low=0))) for reduction in reductions]
-        )
+def reference_std_var(f):
+    """Forwards unbiased/correction kwargs as NumPy's equivalent ddof"""
+    g = reference_reduction_numpy(f)
 
-    for shape, kwargs in shapes_and_kwargs:
-        yield SampleInput(
-            (make if logits else make_prob)(shape, requires_grad=requires_grad),
-            args=(make_prob(shape, requires_grad=requires_grad),),
-            kwargs=kwargs,
-        )
+    @wraps(g)
+    def wrapper(x: np.ndarray, *args, **kwargs):
+        assert not ('unbiased' in kwargs and 'correction' in kwargs)
 
-def sample_inputs_allclose(op_info, device, dtype, requires_grad, **kwargs):
-    samples = []
-    sample_shapes = [(), (S), (S, S, S)]
-    atols = [1e-2, 1e-16]
-    rtols = [1e-1, 0.5]
-    eps = 1e-8
-    for s, rtol, atol in product(sample_shapes, rtols, atols):
-        # close sample
-        t = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
-        close = (t + atol).detach().requires_grad_(requires_grad)
-        close_sample = SampleInput(t, args=(close,), kwargs=dict(rtol=rtol, atol=atol))
-        samples.append(close_sample)
+        if 'unbiased' in kwargs:
+            kwargs['ddof'] = int(kwargs.pop('unbiased'))
+        elif 'correction' in kwargs:
+            kwargs['ddof'] = kwargs.pop('correction')
 
-        # random sample
-        a = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
-        b = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
-        r_sample = SampleInput(a, args=(b,), kwargs=dict(rtol=rtol, atol=atol))
-        samples.append(r_sample)
+        return g(x, *args, **kwargs)
 
-    return samples
+    return wrapper
 
-def sample_inputs_l1_loss(op_info, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs)
+def generate_std_var_kwargs(t: torch.Tensor, **kwargs):
+    """Generates unbiased/correction kwargs for std/var operators"""
+    yield ((), {'unbiased': True})
+    yield ((), {'unbiased': False})
 
-    # In addition to the regular test cases, we add two for mixed floating point and complex inputs
-    if dtype.is_complex:
-        make = partial(make_tensor, (), device=device, requires_grad=requires_grad)
-        yield SampleInput(make(dtype=dtype), args=(make(dtype=torch.double),))
-        yield SampleInput(make(dtype=torch.double), args=(make(dtype=dtype),))
+    # Currently, calling std with correction is only enabled when
+    # both dim and keepdim are provided.
+    if 'dim' in kwargs and 'keepdim' in kwargs:
+        yield ((), {'correction': 0})
+        yield ((), {'correction': 1})
 
-def sample_inputs_smooth_l1_loss(op_info, device, dtype, requires_grad, **kwargs):
-    yield from sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs)
+        numel = torch.tensor(t.shape)[kwargs.get('dim')].prod()
+        yield ((), {'correction': numel // 2})
 
-    make = partial(make_tensor, (S, S), device=device, dtype=dtype, requires_grad=requires_grad)
+def error_inputs_mean(op_info, device, **kwargs):
+    err_msg1 = (r"mean\(\): could not infer output dtype. "
+                r"Input dtype must be either a floating point or complex dtype. "
+                r"Got: Long")
+    si1 = SampleInput(
+        make_tensor((3, 4, 5), dtype=torch.int64, device=device),
+        args=([],))
 
-    # This test case always triggers the smooth condition, since absolute difference of input and target
-    # is smaller than beta
-    yield SampleInput(make(low=0, high=2), args=(make(low=-2, high=0),), kwargs=dict(beta=5))
-    yield SampleInput(make(), args=(make(),), kwargs=dict(beta=0))
+    err_msg2 = (r"mean\(\): could not infer output dtype. "
+                r"Optional dtype must be either a floating point or complex dtype. "
+                r"Got: Long")
+    si2 = SampleInput(
+        make_tensor((3, 4, 5), dtype=torch.float32, device=device),
+        args=([],),
+        kwargs={"dtype": torch.int64})
 
-def sample_inputs_kl_div(op_info, device, dtype, requires_grad, **kwargs):
-    make = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    err_msg3 = "Expected out tensor to have dtype double, but got float instead"
+    si3 = SampleInput(
+        make_tensor((3, 4, 5), dtype=torch.int64, device=device),
+        args=([],),
+        kwargs={
+            "dtype": torch.float64,
+            "out": make_tensor([], dtype=torch.float32, device=device),
+        })
 
-    shapes_and_reduction = [
-        ((2,), "mean"),
-        ((2, 3), "mean"),
-        ((2, 3, 4), "mean"),
-        ((2,), "none"),
-        ((2,), "batchmean"),
-        ((2,), "sum"),
-    ]
+    return (ErrorInput(si1, error_regex=err_msg1),
+            ErrorInput(si2, error_regex=err_msg2),
+            ErrorInput(si3, error_regex=err_msg3))
 
-    sample_inputs = []
-    for (shape, reduction), log_target in itertools.product(shapes_and_reduction, (True, False)):
-        # input should be log-probability, i.e. lie in (-inf, 0]
-        input = make(shape, low=None, high=0)
-        # target should be a probability by default, i.e. lie in [0, 1], and a log-probability if log_target is set,
-        # i.e. lie in (-inf, 0]
-        target = make(shape, low=None, high=0) if log_target else make(shape, low=0, high=1)
-        sample_inputs.append(
-            SampleInput(input, args=(target,), kwargs=dict(reduction=reduction, log_target=log_target))
-        )
-    return sample_inputs
+# numpy implementation of torch.flatten
+# unfortunately there's no np.flatten. we figure out the desired shape and call np.reshape
+def reference_flatten(input, start_dim=0, end_dim=-1):
+    in_shape = input.shape
+    in_rank = len(in_shape)
+    for d in start_dim, end_dim:
+        if not((in_rank == 0 and d in (-1, 0)) or -in_rank <= d < in_rank):
+            raise IndexError(f"Dimension out of range (expected to be in range of [{-in_rank}, {in_rank-1}], but got {d}")
+    end_dim = end_dim if end_dim >= 0 else in_rank + end_dim
+    start_dim = start_dim if start_dim >= 0 else in_rank + start_dim
+    if in_rank == 0:
+        end_dim = start_dim
+    if end_dim < start_dim:
+        raise RuntimeError("flatten() has invalid args: start_dim cannot come after end_dim")
+    flatten_bit_dim = functools.reduce(operator.mul, in_shape[start_dim:end_dim + 1], 1)
+    out_shape = in_shape[:start_dim] + (flatten_bit_dim,) + in_shape[end_dim + 1:]
+    return np.reshape(input, out_shape)
 
-def sample_inputs_pdist(op_info, device, dtype, requires_grad, **kwargs):
-    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    yield from (SampleInput(make_input((n, m))) for n, m in itertools.product((1, S), repeat=2))
-    yield from (SampleInput(make_input((S, S)), kwargs=dict(p=p)) for p in (0.0, 1.0, 2.0, 10.0, float("inf")))
-
-def reference_pdist(input, p=2):
-    pdist = scipy.spatial.distance.pdist
-    if p == 0:
-        output = pdist(input, "hamming") * input.shape[1]
-    elif p == float("inf"):
-        output = pdist(input, lambda x, y: np.abs(x - y).max())
-    else:
-        output = pdist(input, "minkowski", p=p)
-    return output.astype(input.dtype)
-
-def sample_inputs_diagflat(op_info, device, dtype, requires_grad, **kwargs):
-    make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    return [
-        SampleInput(make_input(())),
-        SampleInput(make_input((2,))),
-        SampleInput(make_input((2, 2))),
-        SampleInput(make_input((2,)), kwargs=dict(offset=1)),
-        SampleInput(make_input((2,)), kwargs=dict(offset=-1)),
-    ]
-
-def sample_inputs_max_unpool(op_info, device, dtype, requires_grad, **kwargs):
-    unpool_name_to_pool_method_dict = {
-        'nn.functional.max_unpool1d': torch.nn.functional.max_pool1d,
-        'nn.functional.max_unpool2d': torch.nn.functional.max_pool2d,
-        'nn.functional.max_unpool3d': torch.nn.functional.max_pool3d
-    }
-
-    unpool_name_to_dim = {
-        'nn.functional.max_unpool1d': 1,
-        'nn.functional.max_unpool2d': 2,
-        'nn.functional.max_unpool3d': 3
-    }
-
-    unpool_to_pool_name_dict = dict((
-        (k, f'nn.functional.{v.__name__}') for k, v in unpool_name_to_pool_method_dict.items()
-    ))
-
-    pool_dim = unpool_name_to_dim[op_info.name]
-    pool_method = unpool_name_to_pool_method_dict[op_info.name]
-
-    pool_op_info = copy.copy(op_info)
-    pool_op_info.name = unpool_to_pool_name_dict[op_info.name]
-
-    for sample in sample_inputs_max_pool(pool_op_info, device, dtype, requires_grad, **kwargs):
-        # shapes (C, ...) do not work as of now,
-        # see https://github.com/pytorch/pytorch/issues/68337
-        # TODO: remove once the issue is resolved
-        if sample.input.dim() != pool_dim + 2:
-            continue
-
-        # No dilation > 1 for max_unpool,
-        # see https://github.com/pytorch/pytorch/issues/68420
-        if sample.kwargs['dilation'] != 1:
-            continue
-
-        # Can't unpool without indices
-        if sample.kwargs['return_indices']:
-            pool, indices = pool_method(sample.input, **sample.kwargs)
-            # arg has to be a leaf
-            arg = pool.detach().requires_grad_(requires_grad)
-            sample_kwargs = {
-                'kernel_size': sample.kwargs['kernel_size'],
-                'stride': sample.kwargs['stride'],
-                'padding': sample.kwargs['padding'],
-                # output_size could be None but we specify it explicitly
-                # to compensate for the information lose in pool due
-                # to the floor/ceil operation used to compute the shapes
-                'output_size': sample.input.size()
-            }
-
-            yield SampleInput(arg, args=(indices,), kwargs=sample_kwargs)
-
-def sample_inputs_max_unpool_grad(op_info, device, dtype, requires_grad, **kwargs):
-    for sample in sample_inputs_max_unpool(op_info, device, dtype, requires_grad, **kwargs):
-        indices = sample.args[0]
-        # The samples for max_unpool are generated with max_pool.
-        # It could be that a single element from the max_pool's
-        # input is mapped to several locations in its output.
-        # This situation leads to failed gradchecks because
-        # the finite difference algorithm perturbes the elements
-        # of the output one by one, and not in classes of
-        # equivalences determined by whether two elements
-        # in the output are coming from the same location in the
-        # input (simply put, they have the same corresponding index).
-        # So, there are two ways to resolve this issue:
-        # 1. Extract a pertubation for one element and apply it all
-        #    the elements from the same equivalence class, or
-        # 2. Make sure that the equivalence classes are all singletons,
-        # i.e. the index tensor has to be comprised of only unique
-        # indices.
-        # Here we go with the solution 2, the easiest of all.
-        if indices.unique().numel() == indices.numel():
-            yield sample
-
-foreach_unary_op_db: List[OpInfo] = [
-    ForeachFuncInfo('exp'),
-    ForeachFuncInfo('acos'),
-    ForeachFuncInfo('asin'),
-    ForeachFuncInfo('atan'),
-    ForeachFuncInfo('cos'),
-    ForeachFuncInfo('cosh'),
-    ForeachFuncInfo('log'),
-    ForeachFuncInfo('log10'),
-    ForeachFuncInfo('log2'),
-    ForeachFuncInfo('tan'),
-    ForeachFuncInfo('tanh'),
-    ForeachFuncInfo('sin'),
-    ForeachFuncInfo('sinh'),
-
-    ForeachFuncInfo(
-        'neg',
-        dtypes=all_types_and_complex(),
-        dtypesIfCUDA=all_types_and_complex(),
-        sample_inputs_func=sample_inputs_foreach,
-    ),
-
-    ForeachFuncInfo(
-        'sqrt',
-        dtypes=floating_and_complex_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_and_complex_types_and(torch.half),
-    ),
-
-    ForeachFuncInfo(
-        'ceil',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'erf',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'erfc',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'expm1',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'floor',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'log1p',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half),
-    ),
-
-    ForeachFuncInfo(
-        'round',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'frac',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'reciprocal',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half),
-    ),
-
-    ForeachFuncInfo(
-        'sigmoid',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half),
-    ),
-
-    ForeachFuncInfo(
-        'trunc',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-    ),
-
-    ForeachFuncInfo(
-        'abs',
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.half),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-    ),
-]
-
-foreach_binary_op_db: List[OpInfo] = [
-    ForeachFuncInfo(
-        "add",
-        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        supports_alpha_param=True,
-    ),
-    ForeachFuncInfo(
-        "sub",
-        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        supports_alpha_param=True,
-    ),
-    ForeachFuncInfo(
-        "mul",
-        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        skips=(
-            # Ref: https://github.com/pytorch/pytorch/issues/77946
-            DecorateInfo(unittest.skip("Unable to reproduce failure locally"), "TestForeach",
-                         "test_binary_op_scalarlist_fastpath",
-                         device_type='cuda', dtypes=(torch.float16,)),
-        )
-    ),
-    ForeachFuncInfo(
-        "div",
-        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-        skips=(
-            # Ref: https://github.com/pytorch/pytorch/issues/77946
-            DecorateInfo(unittest.skip("Unable to reproduce failure locally"), "TestForeach",
-                         "test_binary_op_scalarlist_fastpath",
-                         device_type='cuda', dtypes=(torch.float16,)),
-        )
-    ),
-]
-
-foreach_pointwise_op_db: List[ForeachFuncInfo] = [
-    ForeachFuncInfo(
-        "addcmul",
-        dtypes=all_types_and_complex(),
-        dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
-    ),
-    ForeachFuncInfo(
-        "addcdiv",
-        dtypes=all_types_and_complex(),
-        dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
-    ),
-]
-
-foreach_minmax_op_db: List[ForeachFuncInfo] = [
-    ForeachFuncInfo(
-        "maximum",
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        dtypesIfCUDA=all_types_and(torch.float16, torch.bool),
-    ),
-    ForeachFuncInfo(
-        "minimum",
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        dtypesIfCUDA=all_types_and(torch.float16, torch.bool),
-    ),
-]
-
-foreach_reduce_op_db: List[ForeachFuncInfo] = [
-    ForeachFuncInfo(
-        "norm",
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-    ),
-]
-
-def reference_sign(x):
-    if x.dtype == np.bool_:
-        # `np.sign` doesn't support `bool`.
-        # >>> np.sign(True)
-        # ufunc 'sign' did not contain a loop
-        # with signature matching types dtype('bool') -> dtype('bool')
-        return np.sign(x, dtype=np.uint8).astype(np.bool_)
-    return np.sign(x)
-
-
-def reference_sgn(x):
-    # NumPy doesn't have an equivalent to `torch.sgn` when the dtype is complex.
-    # For complex inputs, `np.sign` returns sign(x.real) + 0j if x.real != 0 else sign(x.imag) + 0j.
-    # while `torch.sgn` returns, 0 if abs(input) == 0 else input/abs(input)
-    if x.dtype not in [np.complex64, np.complex128]:
-        return reference_sign(x)
-
-    out = (x / np.abs(x))
-    if out.ndim == 0:
-        # Handle x == 0 case
-        if (x == 0):
-            # Can't assign to np.complex object
-            # So make a new one.
-            return np.array(complex(0, 0), dtype=x.dtype)
-        return out
-
-    # Handle x == 0 case
-    mask = (x == 0)
-    out[mask] = complex(0, 0)
-    return out
-
-
-def reference_sigmoid(x):
-    # 'scipy.special.expit' not supported for the input types
-    if x.dtype in [np.complex64, np.complex128]:
-        return (1 / (1 + np.exp(-x)))
-    return scipy.special.expit(x)
-
-
-def reference_logsigmoid(x):
-    return np.where(
-        x < 0,
-        x - np.log1p(np.exp(x)),
-        -np.log1p(np.exp(-x)))
-
-
-def reference_hardsigmoid(x):
-    intermediate = x / 6 + 0.5
-    y = np.clip(intermediate, 0, None)
-    return np.where(y > 1, 1, y).astype(x.dtype)
-
-
-def reference_lgamma(x):
-    # scipy.special.gammaln returns `-inf` when input is `-inf`.
-    # While Pytorch, C and C++, all return `inf` when input is `-inf`.
-    # Reference:
-    # https://en.cppreference.com/w/cpp/numeric/math/lgamma
-    # https://en.cppreference.com/w/c/numeric/math/lgamma
-
-    # To handle the above discrepancy,
-    # we replace -inf with inf so values
-    # that were originally -inf map to inf as expected
-    if x.dtype.kind == 'f':
-        x = np.where(x == float('-inf'), np.array(float('inf'), dtype=x.dtype), x)
-
-    out = scipy.special.gammaln(x)
-
-    if x.dtype == np.float16:
-        # `scipy.special.gammaln` returns output of float32 when input is float16,
-        # while `torch.lgamma` preserves `float16`. But due to smaller range of float16,
-        # Pytorch version outputs `inf` while SciPy returns finite values.
-        out = out.astype(np.float16)
-
-    return out
-
-def reference_polygamma(x, n):
-    # WEIRD `scipy.special.polygamma` behavior
-    # >>> scipy.special.polygamma(0, np.array(501, dtype=np.float32)).dtype
-    # dtype('float64')
-    # >>> scipy.special.polygamma(0, np.array([501], dtype=np.float32)).dtype
-    # dtype('float32')
-    #
-    # Thus we cast output to the default torch dtype or preserve double
-    result_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
-    if x.dtype == np.double:
-        result_dtype = np.double
-    return scipy.special.polygamma(n, x).astype(result_dtype)
-
-
-def reference_mvlgamma(x, d):
-    if x.dtype == np.float16:
-        return scipy.special.multigammaln(x, d).astype(np.float16)
-
-    return scipy.special.multigammaln(x, d)
-
-def reference_softplus(input, beta=1, threshold=20):
-    non_linear = input * beta <= threshold
-    output = input.copy()
-    output[non_linear] = np.log(1 + np.exp(beta * input[non_linear])) / beta
-    return output
-
-def reference_gelu(X, *, approximate='none'):
-    def _gelu_ref(X):
-        return X * stats.norm.cdf(X)
-
-    def _tanh_gelu_ref(X):
-        M_SQRT_2_PI = math.sqrt(2 / math.pi)
-        Z = M_SQRT_2_PI * (X + 0.044715 * np.power(X, 3.0))
-        return 0.5 * X * (1.0 + np.tanh(Z))
-
-    if approximate == 'tanh':
-        return _tanh_gelu_ref(X)
-    else:
-        return _gelu_ref(X)
-
-
-def reference_one_hot(a: np.ndarray, num_classes: int = -1) -> np.ndarray:
-    if num_classes == -1:
-        num_classes = int(np.amax(a) + 1)
-
-    idcs = a.reshape(-1) + np.arange(0, a.size, dtype=np.int64) * num_classes
-    one_hot = np.zeros((a.size, num_classes), dtype=a.dtype)
-    np.put(one_hot, idcs, 1)
-    return one_hot.reshape(*a.shape, -1)
-
-
-def reference_mse_loss(input, target, reduction="mean"):
-    se = (input - target) ** 2
-    if reduction == "mean":
-        return np.mean(se)
-    elif reduction == "sum":
-        return np.sum(se)
-    else:  # reduction == "none"
-        return se
-
-
-def wrapper_set_seed(op, *args, **kwargs):
-    """Wrapper to set seed manually for some functions like dropout
-    See: https://github.com/pytorch/pytorch/pull/62315#issuecomment-896143189 for more details.
-    """
-    with freeze_rng_state():
-        torch.manual_seed(42)
-        return op(*args, **kwargs)
-
-
-def reference_layer_norm(inp: np.ndarray, normalized_shape: Tuple[int], weight=None, bias=None, eps=1e-5):
-    return reference_native_layer_norm(inp, normalized_shape, weight, bias, eps)[0]
-
-
-def reference_native_layer_norm(inp: np.ndarray, normalized_shape: Tuple[int], weight, bias, eps):
-    feature_size = np.prod(normalized_shape)
-    inp_view = inp.reshape(-1, feature_size)  # type: ignore[call-overload]
-    mean = inp_view.mean(axis=-1, keepdims=True)
-    var = inp_view.var(axis=-1, ddof=0, keepdims=True)
-    Y = (inp_view - mean) / np.sqrt(var + eps)
-    if weight is None and bias is not None:
-        Y = Y + bias.reshape(-1)
-    elif weight is not None and bias is None:
-        Y = Y * weight.reshape(-1)
-    elif weight is not None and bias is not None:
-        Y = Y * weight.reshape(-1) + bias.reshape(-1)
-    axis = inp.ndim - len(normalized_shape)
-    stat_shape = inp.shape[:axis] + (1,) * len(normalized_shape)
-    return Y.reshape(*inp.shape), mean.reshape(stat_shape), (1.0 / np.sqrt(var + eps)).reshape(stat_shape)
-
-
-def reference_group_norm(inp: np.ndarray, num_groups: int, weight=None, bias=None, eps=1e-5):
-    inp_view = inp
-    if np.prod(inp.shape) != 0:
-        inp_view = inp.reshape((inp.shape[0], num_groups, -1))
-    mean = inp_view.mean(axis=-1, keepdims=True)
-    var = inp_view.var(axis=-1, ddof=0, keepdims=True)
-    Y = (inp_view - mean) / np.sqrt(var + eps)
-    Y = Y.reshape(inp.shape)
-    if weight is not None:
-        # weight is a vector of length equal to the channel
-        if len(Y.shape) > 2:
-            weight = np.tile(np.expand_dims(weight, 1), [1] + list(inp.shape[2:]))
-        Y = Y * weight
-    if bias is not None:
-        # bias is a vector of length equal to the channel
-        if len(Y.shape) > 2:
-            bias = np.tile(np.expand_dims(bias, 1), [1] + list(inp.shape[2:]))
-        Y = Y + bias
-    return Y
-
-
-# using a custom reference function since numpy only has a string side arg (instead of right and side) and doesn't
-# have an out_int32 arg. Additionally, numpy doesn't support searchsorted with ND arrays, so this splits those into
-# stacked 1D cases
-def reference_searchsorted(sorted_sequence, boundary, out_int32=False, right=False, side='left', sorter=None):
-    side = 'right' if (right or side == 'right') else 'left'
-    if len(sorted_sequence.shape) == 1 :
-        ret = np.searchsorted(sorted_sequence, boundary, side=side, sorter=sorter)
-        return ret.astype(np.int32) if out_int32 else ret
-    elif sorted_sequence.shape[0] == 0:
-        if sorter is not None:
-            sorter = sorter.flatten()
-        ret = np.searchsorted(sorted_sequence.flatten(), boundary.flatten(), side=side, sorter=sorter)
-        ret = ret.astype(np.int32) if out_int32 else ret
-        return ret.reshape(boundary.shape)
-    else:
-        # numpy searchsorted only supports 1D inputs so we split up ND inputs
-        orig_shape = boundary.shape
-        num_splits = np.prod(sorted_sequence.shape[:-1])
-        splits = range(0, num_splits)
-        sorted_sequence, boundary = sorted_sequence.reshape(num_splits, -1), boundary.reshape(num_splits, -1)
-        if sorter is not None:
-            sorter = sorter.reshape(num_splits, -1)
-
-        split_sequence = [sorted_sequence[i] for i in splits]
-        split_boundary = [boundary[i] for i in splits]
-        split_sorter = [sorter[i] if (sorter is not None) else None for i in splits]
-
-        split_ret = [np.searchsorted(s_seq, b, side=side, sorter=s_sort)
-                     for (s_seq, b, s_sort) in zip(split_sequence, split_boundary, split_sorter)]
-        split_ret = [i.astype(np.int32) for i in split_ret] if out_int32 else split_ret
-        return np.stack(split_ret).reshape(orig_shape)
-
-
-def gradcheck_wrapper_hermitian_input(op, input, *args, **kwargs):
-    """Gradcheck wrapper for functions that take Hermitian matrices as input.
-
-    They require a modified function because the finite-difference algorithm
-    for calculating derivatives does not preserve the Hermitian property of the input.
-    """
-    return op(input + input.mH, *args, **kwargs)
-
-
-def gradcheck_wrapper_triangular_input(op, *args, upper=False, idx=0, **kwargs):
-    """Gradcheck wrapper for functions that take lower or upper triangular matrices as input.
-
-    They require a modified function because the finite-difference algorithm
-    for calculating derivatives does not preserve the triangular property of the input.
-    `idx` is used to specific which `args[idx]` is to be triangularized.
-    """
-    triangular_arg = args[idx].triu() if upper else args[idx].tril()
-    return op(*args[:idx], triangular_arg, *args[idx + 1:], upper, **kwargs)
-
-
-def gradcheck_wrapper_triangular_input_real_positive_diagonal(op, *args, upper=False, idx=0, **kwargs):
-    """Gradcheck wrapper for functions that take lower/upper triangular matrices
-    with real and positive diagonals, for example, cholesky-like operations.
-    """
-    arg = args[idx]
-    arg_diag = arg.diagonal(0, -2, -1)
-    arg_diag_embed = torch.diag_embed(arg_diag)
-    id_diag_tensor = torch.ones_like(arg_diag)
-    id_tensor = torch.diag_embed(id_diag_tensor)
-    # new_arg = arg - diag(arg) + I
-    new_arg = arg - arg_diag_embed + id_tensor
-    return gradcheck_wrapper_triangular_input(
-        op, *args[:idx], new_arg, *args[idx + 1:],
-        upper=upper, idx=idx, **kwargs
-    )
-
-
-def gradcheck_wrapper_masked_operation(op, input, *args, **kwargs):
-    """Gradcheck wrapper for masked operations.
-
-    When mask is specified, replaces masked-out elements with zeros.
-
-    Use for operations that produce non-finite masked-out elements,
-    for instance, for minimum and maximum reductions.
-    """
-    output = op(input, *args, **kwargs)
-    mask = kwargs.get('mask')
-    if mask is not None:
-        output_mask = torch._masked._output_mask(op, input, *args, **kwargs)
-        output = torch.where(output_mask, output, output.new_zeros([]))
-    return output
-
-
-def gradcheck_wrapper_masked_pointwise_operation(op, input, *args, **kwargs):
-    """Gradcheck wrapper for masked pointwise operations. Assumes that the result
-    will be masked iff both tensors are masked at a specific index
-
-    When mask is specified, replaces masked-out elements with zeros.
-
-    Use for operations that produce non-finite masked-out elements,
-    for instance, for minimum and maximum reductions.
-    """
-    output = op(input, *args, **kwargs)
-    input_mask = kwargs.get('input_mask')
-    other_mask = kwargs.get('other_mask')
-    if input_mask is not None and other_mask is not None:
-        combined_mask = torch.logical_and(input_mask, other_mask)
-        new_kwargs = dict(mask=combined_mask, **kwargs)
-        output_mask = torch._masked._input_mask(input, *args, **new_kwargs)
-        output = torch.where(output_mask, output, output.new_zeros([]))
-    return output
-
-
-def reference_reduction_numpy(f, supports_keepdims=True):
-    """Wraps a NumPy reduction operator.
-
-    The wrapper function will forward dim, keepdim, mask, and identity
-    kwargs to the wrapped function as the NumPy equivalent axis,
-    keepdims, where, and initiak kwargs, respectively.
-
-    Args:
-        f: NumPy reduction operator to wrap
-        supports_keepdims (bool, optional): Whether the NumPy operator accepts
-            keepdims parameter. If it does not, the wrapper will manually unsqueeze
-            the reduced dimensions if it was called with keepdim=True. Defaults to True.
-
-    Returns:
-        Wrapped function
-
-    """
-    @wraps(f)
-    def wrapper(x: np.ndarray, *args, **kwargs):
-        # Copy keys into a set
-        keys = set(kwargs.keys())
-
-        dim = kwargs.pop('dim', None)
-        keepdim = kwargs.pop('keepdim', False)
-
-        if 'dim' in keys:
-            dim = tuple(dim) if isinstance(dim, Sequence) else dim
-
-            # NumPy reductions don't accept dim=0 for scalar inputs
-            # so we convert it to None if and only if dim is equivalent
-            if x.ndim == 0 and dim in {0, -1, (0,), (-1,)}:
-                kwargs['axis'] = None
-            else:
-                kwargs['axis'] = dim
-
-        if 'keepdim' in keys and supports_keepdims:
-            kwargs['keepdims'] = keepdim
-
-        if 'mask' in keys:
-            mask = kwargs.pop('mask')
-            if mask is not None:
-                assert mask.layout == torch.strided
-                kwargs['where'] = mask.cpu().numpy()
-
-        if 'identity' in keys:
-            identity = kwargs.pop('identity')
-            if identity is not None:
-                if identity.dtype is torch.bfloat16:
-                    identity = identity.cpu().to(torch.float32)
-                else:
-                    identity = identity.cpu()
-                kwargs['initial'] = identity.numpy()
-
-        if 'unbiased' in keys:
-            unbiased = kwargs.pop('unbiased')
-            if unbiased is not None:
-                kwargs['ddof'] = int(unbiased)
-
-        result = f(x, *args, **kwargs)
-
-        # Unsqueeze reduced dimensions if NumPy does not support keepdims
-        if keepdim and not supports_keepdims and x.ndim > 0:
-            dim = list(range(x.ndim)) if dim is None else dim
-            result = np.expand_dims(result, dim)
-
-        return result
-
-    return wrapper
-
-def loss_reference_reduction_wrapper(fn):
-    def wrapper(input, target, *, size_average=None, reduce=None, reduction="mean", **other_kwargs):
-        if size_average is not None or reduce is not None:
-            raise RuntimeError(
-                "The keyword arguments 'size_average' and 'reduce' are deprecated and not supported by this wrapper"
-            )
-        output = fn(input, target, **other_kwargs)
-        if reduction == "mean":
-            return np.mean(output)
-        elif reduction == "sum":
-            return np.sum(output)
-        else:  # reduction == "none"
-            return output
-
-    return wrapper
-
-@loss_reference_reduction_wrapper
-def reference_smooth_l1_loss(input, target, beta=1.0):
-    diff = input - target
-    abs_diff = np.abs(diff)
-    above_threshold = abs_diff >= beta
-
-    loss = np.empty_like(input)
-    loss[above_threshold] = abs_diff[above_threshold] - 0.5 * beta
-    loss[~above_threshold] = diff[~above_threshold] ** 2 / (2 * beta)
-
-    return loss
-
-def reference_std_var(f):
-    """Forwards unbiased/correction kwargs as NumPy's equivalent ddof"""
-    g = reference_reduction_numpy(f)
-
-    @wraps(g)
-    def wrapper(x: np.ndarray, *args, **kwargs):
-        assert not ('unbiased' in kwargs and 'correction' in kwargs)
-
-        if 'unbiased' in kwargs:
-            kwargs['ddof'] = int(kwargs.pop('unbiased'))
-        elif 'correction' in kwargs:
-            kwargs['ddof'] = kwargs.pop('correction')
-
-        return g(x, *args, **kwargs)
-
-    return wrapper
-
-def generate_std_var_kwargs(t: torch.Tensor, **kwargs):
-    """Generates unbiased/correction kwargs for std/var operators"""
-    yield ((), {'unbiased': True})
-    yield ((), {'unbiased': False})
-
-    # Currently, calling std with correction is only enabled when
-    # both dim and keepdim are provided.
-    if 'dim' in kwargs and 'keepdim' in kwargs:
-        yield ((), {'correction': 0})
-        yield ((), {'correction': 1})
-
-        numel = torch.tensor(t.shape)[kwargs.get('dim')].prod()
-        yield ((), {'correction': numel // 2})
-
-def error_inputs_mean(op_info, device, **kwargs):
-    err_msg1 = (r"mean\(\): could not infer output dtype. "
-                r"Input dtype must be either a floating point or complex dtype. "
-                r"Got: Long")
-    si1 = SampleInput(
-        make_tensor((3, 4, 5), dtype=torch.int64, device=device),
-        args=([],))
-
-    err_msg2 = (r"mean\(\): could not infer output dtype. "
-                r"Optional dtype must be either a floating point or complex dtype. "
-                r"Got: Long")
-    si2 = SampleInput(
-        make_tensor((3, 4, 5), dtype=torch.float32, device=device),
-        args=([],),
-        kwargs={"dtype": torch.int64})
-
-    err_msg3 = "Expected out tensor to have dtype double, but got float instead"
-    si3 = SampleInput(
-        make_tensor((3, 4, 5), dtype=torch.int64, device=device),
-        args=([],),
-        kwargs={
-            "dtype": torch.float64,
-            "out": make_tensor([], dtype=torch.float32, device=device),
-        })
-
-    return (ErrorInput(si1, error_regex=err_msg1),
-            ErrorInput(si2, error_regex=err_msg2),
-            ErrorInput(si3, error_regex=err_msg3))
-
-# Operator database (sorted alphabetically)
-op_db: List[OpInfo] = [
-    UnaryUfuncInfo('abs',
-                   aliases=('absolute', ),
-                   ref=np.abs,
-                   dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.chalf),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-                   skips=(
-                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
-                                    'test_inplace_grad', dtypes=(torch.cdouble,)),
-                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
-                                    'test_inplace_gradgrad', dtypes=(torch.cdouble,)),
-                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
-                                    'test_inplace_forward_mode_AD', dtypes=(torch.cdouble,)),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.cfloat]),
-                       # Reference: https://github.com/pytorch/pytorch/issues/49224
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    dtypes=[torch.int8], active_if=TEST_WITH_ASAN),
-                       # TODO: Fix test_out_arg_all_dtypes as torch.empty_like(expected_output) where expected_output=op(input)
-                       # We can break the logic of the loop over all possible types but it is OK.
-                       # https://github.com/pytorch/pytorch/blob/master/test/test_unary_ufuncs.py#L440-L449
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_out_arg_all_dtypes',
-                                    dtypes=[torch.cfloat, torch.cdouble]),
-                   ),
-                   supports_fwgrad_bwgrad=True,
-                   assert_autodiffed=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   supports_forward_ad=True),
-    # NOTE: CPU complex acos produces incorrect outputs (https://github.com/pytorch/pytorch/issues/42952)
-    UnaryUfuncInfo('acos',
-                   aliases=('arccos', ),
-                   ref=np.arccos,
-                   domain=(-1, 1),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.bfloat16: 1e-1,
-                                                  torch.complex64: 1e-2}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
-                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       # Failing with wrong imaginary sign on at least some Windows jobs
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       # Failing with wrong imaginary sign on at least some Windows jobs
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad',
-                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_method_grad',
-                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_inplace_grad',
-                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD',
-                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_inplace_forward_mode_AD',
-                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),)),
-    # NOTE: the derivative for inplace acosh is not implemented
-    UnaryUfuncInfo('acosh',
-                   aliases=('arccosh', ),
-                   ref=np.arccosh,
-                   domain=(1, None),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
-                   supports_inplace_autograd=False,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
-                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       # Failing with wrong imaginary sign on at least some Windows jobs
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                   ),
-                   # acosh is not defined at x < 1 (real)
-                   reference_numerics_filter=NumericsFilter(
-                       condition=lambda x: (x < 1 if not x.is_complex() else torch.zeros_like(x, dtype=torch.bool)),
-                       safe_val=2)),
-    BinaryUfuncInfo('add',
-                    # NumPy has no builtin reference for the alpha kwarg, but it is easy enough to emulate
-                    ref=lambda input, other, *, alpha=1: np.add(input, other) if alpha == 1 \
-                    else np.add(input, np.multiply(alpha, other)),
-                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16,
-                                                     torch.float16, torch.chalf),
-                    assert_autodiffed=True,
-                    sample_inputs_func=sample_inputs_add_sub,
-                    supports_fwgrad_bwgrad=True,
-                    supports_forward_ad=True,
-                    supports_two_python_scalars=True,
-                    decorators=(
-                        DecorateInfo(
-                            toleranceOverride({torch.chalf: tol(atol=1e-2, rtol=0)}),
-                            'TestBinaryUfuncs', 'test_reference_numerics'),
-                    ),
-                    skips=(
-                        # boolean alpha not handled properly
-                        DecorateInfo(unittest.expectedFailure,
-                                     'TestCudaFuserOpInfo',
-                                     'test_nvfuser_correctness',
-                                     dtypes=(torch.bool,)),
-                        # boolean alpha not handled properly
-                        DecorateInfo(unittest.expectedFailure,
-                                     'TestNNCOpInfo',
-                                     'test_nnc_correctness',
-                                     dtypes=(torch.bool,)),
-                        DecorateInfo(unittest.skip("Skipped!"),
-                                     'TestCommon',
-                                     'test_numpy_refs',
-                                     dtypes=(torch.complex128,)),
-                        DecorateInfo(unittest.skip("Skipped!"),
-                                     'TestBinaryUfuncs',
-                                     'test_reference_numerics_extremal_values',
-                                     dtypes=(torch.complex64, torch.complex128)),
-                    )),
-    OpInfo('arange',
-           dtypes=all_types_and(torch.bfloat16, torch.float16),
-           supports_out=True,
-           supports_autograd=False,
-           is_factory_function=True,
-           error_inputs_func=error_inputs_arange,
-           sample_inputs_func=sample_inputs_arange,
-           skips=(
-               # https://github.com/pytorch/pytorch/issues/81774
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-
-               # Tests that assume input is a tensor or sequence of tensors
-               DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-
-               # Lazy tensor failures
-               DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_dispatched_to_lazy'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness_with_reusing_ir'),
-
-               # Exception raised from analyzeImpl at ../torch/csrc/jit/ir/alias_analysis.cpp:608
-               # We don't have an op for aten::arange but it isn't a special case.
-               # Argument types: bool, bool, bool, int, int, Device, boo
-               DecorateInfo(unittest.expectedFailure, 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
-               DecorateInfo(unittest.expectedFailure, 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),
-
-               # Captured graph does not contain aten::arange (succeeds on complex!)
-               # g: graph():
-               #   %25 : Long(1, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value={1}]()
-               #   return (%25)
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
-
-               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-           )),
-    BinaryUfuncInfo('clamp_max',
-                    ref=_clamp_max_numpy,
-                    dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
-                    supports_forward_ad=True,
-                    supports_rhs_python_scalar=False,
-                    supports_fwgrad_bwgrad=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=False),
-                    skips=(
-                        # RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
-                        DecorateInfo(unittest.expectedFailure,
-                                     'TestBinaryUfuncs',
-                                     'test_type_promotion',
-                                     device_type='cuda'),
-                        # dispatch to lazy test failed
-                        DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_dispatched_to_lazy'),
-                        # test error disabled since rhs non-tensor python scalar is supported
-                        DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_errors'),
-                    )),
-    BinaryUfuncInfo('clamp_min',
-                    ref=_clamp_min_numpy,
-                    dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
-                    supports_forward_ad=True,
-                    supports_rhs_python_scalar=False,
-                    supports_fwgrad_bwgrad=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=False),
-                    skips=(
-                        # RuntimeError: "min_elementwise_cuda" not implemented for 'ComplexFloat'
-                        DecorateInfo(unittest.expectedFailure,
-                                     'TestBinaryUfuncs',
-                                     'test_type_promotion',
-                                     device_type='cuda'),
-                        # dispatch to lazy test failed
-                        DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_dispatched_to_lazy'),
-                        # test error disabled since rhs non-tensor python scalar is supported
-                        DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_errors'),
-                    )),
-    BinaryUfuncInfo('mul',
-                    aliases=('multiply',),
-                    dtypes=all_types_and_complex_and(torch.chalf, torch.float16, torch.bfloat16, torch.bool),
-                    assert_autodiffed=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_two_python_scalars=True),
-    BinaryUfuncInfo('sub',
-                    # NumPy has no builtin reference for the alpha kwarg, but it is easy enough to emulate
-                    ref=lambda input, other, *, alpha=1: np.subtract(input, np.multiply(alpha, other)),
-                    aliases=('subtract',),
-                    dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.chalf),
-                    assert_autodiffed=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    sample_inputs_func=sample_inputs_add_sub,
-                    supports_two_python_scalars=True,
-                    decorators=(
-                        DecorateInfo(
-                            toleranceOverride({torch.float16: tol(atol=1e-2, rtol=0)}),
-                            'TestBinaryUfuncs', 'test_reference_numerics'),
-                        DecorateInfo(
-                            toleranceOverride({torch.chalf: tol(atol=1e-2, rtol=0)}),
-                            'TestCommon', 'test_complex_half_reference_testing', device_type='cpu'),
-                        DecorateInfo(
-                            toleranceOverride({torch.chalf: tol(atol=5e-3, rtol=0)}),
-                            'TestDecomp', 'test_comprehensive', device_type='cpu'),
-                        DecorateInfo(
-                            toleranceOverride({torch.chalf: tol(atol=5e-3, rtol=0)}),
-                            'TestDecomp', 'test_quick', device_type='cpu'),
-                    ),
-                    skips=(
-                        DecorateInfo(unittest.skip("Skipped!"),
-                                     'TestBinaryUfuncs',
-                                     'test_reference_numerics',
-                                     dtypes=(torch.uint8,)),
-                        DecorateInfo(unittest.skip("Skipped!"),
-                                     'TestBinaryUfuncs',
-                                     'test_reference_numerics_small_values',
-                                     dtypes=(torch.uint8,)),
-                    )),
-    OpInfo('addmm',
-           # This addmm OpInfo is for when alpha and beta are not both equal to 1.
-           # alpha=beta=1 is tested in the following opinfo, because that special case will
-           # trigger addmm being decomposed by a jit pass.
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfROCM=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16] if CUDA11OrLater else []),
-           assert_autodiffed=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           sample_inputs_func=sample_inputs_addmm),
-    OpInfo('addmm',
-           # When alpha=beta=1 as compile-time constants, JIT will decompose addmm into mm and add.
-           variant_test_name='decomposed',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16] if(CUDA11OrLater or TEST_WITH_ROCM) else []),
-           assert_autodiffed=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           autodiff_nonfusible_nodes=['aten::add', 'aten::mm'],
-           sample_inputs_func=partial(sample_inputs_addmm, alpha=1, beta=1),
-           skips=(
-               # https://github.com/pytorch/pytorch/issues/71784
-               DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
-                            device_type='cpu', dtypes=(torch.float16,)),
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness', dtypes=(torch.float16,)),
-           )),
-    OpInfo('addmv',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.float16, torch.complex64, torch.complex128,
-                                           *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_addmv),
-    OpInfo('addbmm',
-           ref=lambda M, batch1, batch2, beta=1, alpha=1: np.add(np.multiply(np.asarray(beta, dtype=M.dtype), M),
-                                                                 np.multiply(np.asarray(alpha, dtype=batch1.dtype),
-                                                                             np.sum(np.matmul(batch1, batch2), axis=0))),
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16]
-                                                       if (SM53OrLater and CUDA11OrLater) or TEST_WITH_ROCM else []),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[
-               DecorateInfo(
-                   toleranceOverride({torch.float32: tol(atol=1.3e-05, rtol=1.3e-05),
-                                      torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
-                   'TestCommon', 'test_numpy_refs')],
-           skips=(
-               # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
-               # addbmm does not correctly warn when resizing out= inputs
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-               # https://github.com/pytorch/pytorch/issues/55907
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
-           ),
-           sample_inputs_func=sample_inputs_addbmm),
-    OpInfo('baddbmm',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.float16, torch.complex64, torch.complex128,
-                                           *[torch.bfloat16] if CUDA11OrLater or TEST_WITH_ROCM else []),
-           backward_dtypesIfCUDA=floating_types_and(torch.float16,
-                                                    *[torch.bfloat16] if SM53OrLater or TEST_WITH_ROCM else [],
-                                                    torch.complex64, torch.complex128),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[
-               DecorateInfo(
-                   toleranceOverride({torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
-                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda'),
-               DecorateInfo(
-                   toleranceOverride({torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
-                   'TestMathBits', 'test_conj_view', device_type='cuda')],
-           sample_inputs_func=sample_inputs_baddbmm),
-    OpInfo('dot',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           assert_autodiffed=True,
-           sample_inputs_func=sample_inputs_dot_vdot,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           ),
-    OpInfo('vdot',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           sample_inputs_func=sample_inputs_dot_vdot,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           ),
-    OpInfo('bmm',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16]
-                                                       if (SM53OrLater and CUDA11OrLater) or TEST_WITH_ROCM else []),
-           assert_autodiffed=True,
-           assert_jit_shape_analysis=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
-           ),
-           sample_inputs_func=sample_inputs_bmm),
-    OpInfo('mv',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           assert_autodiffed=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_mv),
-    OpInfo('addr',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
-           backward_dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-           backward_dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, *[torch.bfloat16]
-                                                           if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           # Reference: https://github.com/pytorch/pytorch/issues/50747
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # Reference: https://github.com/pytorch/pytorch/issues/50747
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16)),
-           ),
-           sample_inputs_func=sample_inputs_addr,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
-    OpInfo('addcmul',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
-           assert_autodiffed=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # TODO: update sample inputs with for_inplace_variant kwarg to support this test
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
-               # 76047
-               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
-                            dtypes=(torch.int8, torch.int16, torch.int32, torch.int64)),
-           ),
-           sample_inputs_func=sample_inputs_addcmul_addcdiv),
-    OpInfo('addcdiv',
-           dtypes=floating_and_complex_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # TODO: update sample inputs with for_inplace_variant kwarg to support this test
-               DecorateInfo(unittest.expectedFailure,
-                            'TestCommon',
-                            'test_variant_consistency_eager'),
-           ),
-           sample_inputs_func=sample_inputs_addcmul_addcdiv),
-    UnaryUfuncInfo('asin',
-                   aliases=('arcsin', ),
-                   ref=np.arcsin,
-                   domain=(-1, 1),
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   decorators=[
-                       DecorateInfo(
-                           toleranceOverride({torch.float16: tol(atol=1e-05, rtol=1e-03)}),
-                           'TestUnaryUfuncs', device_type='cuda'),
-                       precisionOverride({torch.bfloat16: 1e-2}),
-                   ],
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-                   )),
-    # NOTE: derivative for inplace asinh is not implemented
-    UnaryUfuncInfo('asinh',
-                   aliases=('arcsinh', ),
-                   ref=np.arcsinh,
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
-                   supports_inplace_autograd=False,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda', dtypes=[torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-                   )),
-    UnaryUfuncInfo('atan',
-                   aliases=('arctan', ),
-                   ref=np.arctan,
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   decorators=(precisionOverride({torch.bfloat16: 1e-2}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    active_if=TEST_WITH_ROCM, device_type='cuda'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    active_if=TEST_WITH_ROCM, device_type='cuda', dtypes=[torch.complex128]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cuda', dtypes=[torch.cfloat, torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda', dtypes=[torch.cfloat, torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-                   )),
-    BinaryUfuncInfo('atan2',
-                    aliases=('arctan2',),
-                    dtypes=all_types_and(torch.bool, torch.bfloat16),
-                    dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    promotes_int_to_float=True,
-                    supports_rhs_python_scalar=False,
-                    skips=(
-                        # Incorrectly attempts to use a scalar for the second argument
-                        DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
-                    )),
-    UnaryUfuncInfo('atanh',
-                   aliases=('arctanh', ),
-                   ref=np.arctanh,
-                   domain=(-1, 1),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-                   decorators=(precisionOverride({torch.bfloat16: 1e-2}),),
-                   supports_inplace_autograd=False,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cuda', dtypes=[torch.cfloat, torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda', dtypes=[torch.cfloat],
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    active_if=TEST_WITH_ROCM, device_type='cuda', dtypes=[torch.complex128]),
-                   )),
-    OpInfo('allclose',
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           ref=np.allclose,
-           supports_autograd=False,
-           supports_forward_ad=False,
-           sample_inputs_func=sample_inputs_allclose,
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-           ),
-           supports_out=False),
-    OpInfo('broadcast_to',
-           ref=np.broadcast_to,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_broadcast_to),
-    OpInfo('broadcast_shapes',
-           op=torch.broadcast_shapes,
-           ref=np.broadcast_shapes if np.lib.NumpyVersion(np.__version__) >= '1.20.0' else None,
-           dtypes=_dispatch_dtypes((torch.float32,)),
-           supports_out=False,
-           supports_gradgrad=False,
-           assert_autodiffed=False,
-           supports_autograd=False,
-           supports_scripting=False,
-           sample_inputs_func=sample_inputs_broadcast_shapes,
-           skips=(
-               # https://github.com/pytorch/pytorch/issues/64997
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # skip dtype tests since broadcast_shape is not device dependent.
-               # having dtypes limited to torch.float32 would cause test_dtypes to report unexpected success
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_dtypes'),
-               # skip these tests since we have non tensor input
-               DecorateInfo(unittest.skip('Skipped!'), "TestCommon", "test_noncontiguous_samples"),
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
-               DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('broadcast_tensors',
-           ref=np.broadcast_arrays,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_broadcast_tensors,
-           reference_inputs_func=reference_inputs_broadcast_tensors,
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           skips=(
-               # https://github.com/pytorch/pytorch/issues/64997
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # JIT does not support variadic tensors.
-               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
-               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
-               # please report a bug to PyTorch.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=[torch.float32]),
-           )),
-    OpInfo('block_diag',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # Default batching rule in core doesn't work for ops with TensorList args
-           check_batched_forward_grad=False,
-           skips=(
-               # https://github.com/pytorch/pytorch/issues/64997
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # JIT does not support variadic tensors.
-               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
-               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
-               # please report a bug to PyTorch.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=[torch.float32]),
-           ),
-           sample_inputs_func=sample_inputs_block_diag),
-    UnaryUfuncInfo('bitwise_not',
-                   ref=np.bitwise_not,
-                   dtypes=integral_types_and(torch.bool),
-                   operator_variant=operator.invert,
-                   supports_autograd=False),
-    BinaryUfuncInfo('bitwise_left_shift',
-                    op=torch.bitwise_left_shift,
-                    dtypes=integral_types(),
-                    dtypesIfCUDA=integral_types(),
-                    operator_variant=operator.lshift,
-                    inplace_operator_variant=operator.ilshift,
-                    supports_autograd=False,
-                    supports_one_python_scalar=True,
-                    rhs_make_tensor_kwargs=dict(low=0),
-                    skips=(
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                    )),
-    BinaryUfuncInfo('bitwise_right_shift',
-                    op=torch.bitwise_right_shift,
-                    dtypes=integral_types(),
-                    dtypesIfCUDA=integral_types(),
-                    operator_variant=operator.rshift,
-                    inplace_operator_variant=operator.irshift,
-                    supports_autograd=False,
-                    supports_one_python_scalar=True,
-                    rhs_make_tensor_kwargs=dict(low=0),
-                    skips=(
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                    )),
-    OpInfo('combinations',
-           op=torch.combinations,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           supports_out=False,
-           sample_inputs_func=sample_inputs_combinations),
-    OpInfo('cartesian_prod',
-           op=torch.cartesian_prod,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_cartesian_prod,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
-               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270
-               DecorateInfo(unittest.expectedFailure,
-                            'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
-           )),
-    OpInfo('cdist',
-           dtypes=floating_types(),
-           supports_out=False,
-           supports_gradgrad=False,
-           assert_autodiffed=False,
-           sample_inputs_func=sample_inputs_cdist),
-    UnaryUfuncInfo('ceil',
-                   ref=np.ceil,
-                   dtypes=floating_types_and(torch.bfloat16),
-                   dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   assert_autodiffed=True),
-    OpInfo('cholesky',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_cholesky,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],),
-    OpInfo('cholesky_inverse',
-           dtypes=floating_and_complex_types(),
-           backward_dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           check_batched_gradgrad=True,
-           sample_inputs_func=sample_inputs_linalg_cholesky_inverse,
-           gradcheck_wrapper=gradcheck_wrapper_triangular_input_real_positive_diagonal,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
-           skips=(
-               # Strides are not the same! Original strides were ((4, 2, 1),) and strides are now ((4, 1, 2),)
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),)),
-    OpInfo('cholesky_solve',
-           op=torch.cholesky_solve,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_cholesky_solve,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_wrapper=lambda *args, **kwargs: gradcheck_wrapper_triangular_input(*args, idx=1, **kwargs),
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
-    OpInfo('chunk',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
-           sample_inputs_func=sample_inputs_chunk,
-           reference_inputs_func=reference_inputs_chunk,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           supports_out=False),
-    OpInfo('clone',
-           ref=np.copy,
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
-           sample_inputs_func=sample_inputs_clone_contiguous,
-           reference_inputs_func=reference_inputs_clone_contiguous,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           supports_out=False,
-           skips=(
-               # TypeError: _copy_dispatcher() got an unexpected keyword argument 'memory_format'
-               # (NumPy reference needs to be extended with memory_format)
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref'),
-           ),),
-    OpInfo('contiguous',
-           op=lambda x, *args, **kwargs: x.contiguous(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
-           sample_inputs_func=sample_inputs_clone_contiguous,
-           reference_inputs_func=reference_inputs_clone_contiguous,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           autodiff_fusible_nodes=['aten::contiguous'],
-           assert_jit_shape_analysis=True,
-           supports_out=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-           )),
-    OpInfo('sum_to_size',
-           op=lambda x, *args, **kwargs: x.sum_to_size(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_sum_to_size,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           supports_out=False,
-           skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float,)),),),
-    OpInfo('symeig',
-           dtypes=floating_and_complex_types(),
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           sample_inputs_func=sample_inputs_symeig,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           ),
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off]),
-    OpInfo('clamp',
-           aliases=('clip',),
-           ref=_clamp_numpy,
-           dtypes=all_types_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_clamp,
-           reference_inputs_func=partial(reference_inputs_elementwise_ternary, sample_inputs_func=sample_inputs_clamp),
-           assert_autodiffed=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # nvFuser and NNC appear to not handle boolean clamp
-               DecorateInfo(unittest.expectedFailure,
-                            'TestCudaFuserOpInfo',
-                            'test_nvfuser_correctness',
-                            dtypes=(torch.bool,)),
-               DecorateInfo(unittest.expectedFailure,
-                            'TestNNCOpInfo',
-                            'test_nnc_correctness',
-                            dtypes=(torch.bool,)),
-           )),
-    UnaryUfuncInfo('positive',
-                   ref=np.positive,
+# Operator database (sorted alphabetically)
+op_db: List[OpInfo] = [
+    UnaryUfuncInfo('abs',
+                   aliases=('absolute', ),
+                   ref=np.abs,
                    dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.chalf),
-                   supports_out=False,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
+                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+                   skips=(
+                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
+                                    'test_inplace_grad', dtypes=(torch.cdouble,)),
+                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
+                                    'test_inplace_gradgrad', dtypes=(torch.cdouble,)),
+                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
+                                    'test_inplace_forward_mode_AD', dtypes=(torch.cdouble,)),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu', dtypes=[torch.cfloat]),
+                       # Reference: https://github.com/pytorch/pytorch/issues/49224
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    dtypes=[torch.int8], active_if=TEST_WITH_ASAN),
+                       # TODO: Fix test_out_arg_all_dtypes as torch.empty_like(expected_output) where expected_output=op(input)
+                       # We can break the logic of the loop over all possible types but it is OK.
+                       # https://github.com/pytorch/pytorch/blob/master/test/test_unary_ufuncs.py#L440-L449
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_out_arg_all_dtypes',
+                                    dtypes=[torch.cfloat, torch.cdouble]),
                    ),
-    UnaryUfuncInfo('conj',
-                   ref=np.conj,
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16,
-                                                    torch.half, torch.chalf),
-                   supports_sparse=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   # See https://github.com/pytorch/pytorch/pull/78358
-                   check_batched_forward_grad=False,
-                   supports_out=False),
-    UnaryUfuncInfo('conj_physical',
-                   decomp_aten_name='_conj_physical',
-                   ref=np.conj,
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16,
-                                                    torch.half, torch.chalf),
-                   supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
+                   assert_autodiffed=True,
                    supports_sparse_csr=True,
                    supports_sparse_csc=True,
                    supports_sparse_bsr=True,
                    supports_sparse_bsc=True,
-                   skips=(
-                       # RuntimeError: inputSet && outputSet
-                       # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":118,
-                       # please report a bug to PyTorch.
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32, )),
-                       DecorateInfo(unittest.skip("Skipped! conj_physical_ not implemented for sparse"),
-                                    'TestSparseUnaryUfuncs', 'test_inplace'),
-                   )),
-    OpInfo('resolve_conj',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_view_as_real,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           supports_out=False,
-           ),
-    OpInfo('resolve_neg',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           sample_inputs_func=sample_inputs_view_as_real,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           supports_out=False,
-           ),
-    OpInfo('view_as_real',
-           dtypes=complex_types(),
-           supports_forward_ad=True,
-           supports_out=False,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_view_as_real,
-           test_conjugated_samples=False,
-           ),
-    OpInfo('view_as_complex',
-           dtypes=floating_types_and(torch.half),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           test_neg_view=False,
-           sample_inputs_func=sample_inputs_view_as_complex,
-           skips=(
-               # RuntimeError: Tensor must have a last dimension with stride 1
-               DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
-               # RuntimeError: "eq_cpu" not implemented for 'ComplexHalf'
-               DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.half,)),
-               # RuntimeError: "eq_cpu" not implemented for 'ComplexHalf'
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness', dtypes=(torch.half,)),
-           )),
-    BinaryUfuncInfo('complex',
-                    dtypes=floating_types_and(torch.half),
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_rhs_python_scalar=False,
-                    skips=(
-                        # Test doesn't account for complex's type promotion semantics
-                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion'),
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out', device_type='mps'),
-                    )),
-    BinaryUfuncInfo('copysign',
-                    dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                    promotes_int_to_float=True,
-                    # https://github.com/pytorch/pytorch/issues/80411
-                    gradcheck_fast_mode=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True),
-    OpInfo('corrcoef',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.half,
-                                                  *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           sample_inputs_func=sample_inputs_corrcoef,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           skips=(
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
-           ),
-           supports_out=False),
-    UnaryUfuncInfo('cos',
-                   ref=np.cos,
+                   supports_forward_ad=True),
+    # NOTE: CPU complex acos produces incorrect outputs (https://github.com/pytorch/pytorch/issues/42952)
+    UnaryUfuncInfo('acos',
+                   aliases=('arccos', ),
+                   ref=np.arccos,
+                   domain=(-1, 1),
                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
                    dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
                    assert_autodiffed=True,
-                   handles_large_floats=False,
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.bfloat16: 1e-2}),),
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.bfloat16: 1e-1,
+                                                  torch.complex64: 1e-2}),),
                    skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
+                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
+                       # Failing with wrong imaginary sign on at least some Windows jobs
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       # Failing with wrong imaginary sign on at least some Windows jobs
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=(torch.cfloat, torch.cdouble,), device_type='cpu', active_if=IS_WINDOWS),
-                       # This fails on CUDA but passes on ROCm
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=(torch.cdouble,), device_type='cuda'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_WINDOWS),
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu',
-                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_MACOS),
-                       # AssertionError: Tensor-likes are not close!
-                       # Greatest absolute difference: nan at index (700,) (up to 1e-05 allowed)
-                       # Greatest relative difference: nan at index (700,) (up to 0.001 allowed)
-                       DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda',
-                                    dtypes=(torch.chalf,), active_if=IS_WINDOWS),
-                   )),
-    UnaryUfuncInfo('cosh',
-                   ref=np_unary_ufunc_integer_promotion_wrapper(np.cosh),
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad',
+                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_method_grad',
+                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_inplace_grad',
+                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD',
+                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_inplace_forward_mode_AD',
+                                    dtypes=[torch.cdouble], active_if=IS_WINDOWS),)),
+    # NOTE: the derivative for inplace acosh is not implemented
+    UnaryUfuncInfo('acosh',
+                   aliases=('arccosh', ),
+                   ref=np.arccosh,
+                   domain=(1, None),
                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
                    dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
+                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
+                   supports_inplace_autograd=False,
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
                    skips=(
-                       # Reference: https://github.com/pytorch/pytorch/issues/48641
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.int8]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
+                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_WINDOWS),
+                                    device_type='cuda', dtypes=[torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_WINDOWS),
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu',
-                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_MACOS),
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu',
-                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_MACOS),
-                       # AssertionError: Tensor-likes are not close!
-                       # Greatest absolute difference: nan at index (6000,) (up to 1e-05 allowed)
-                       # Greatest relative difference: nan at index (6000,) (up to 0.001 allowed)
-                       DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda',
-                                    dtypes=(torch.chalf,), active_if=IS_WINDOWS),
-                   )),
-    OpInfo('cov',
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       # Failing with wrong imaginary sign on at least some Windows jobs
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                   ),
+                   # acosh is not defined at x < 1 (real)
+                   reference_numerics_filter=NumericsFilter(
+                       condition=lambda x: (x < 1 if not x.is_complex() else torch.zeros_like(x, dtype=torch.bool)),
+                       safe_val=2)),
+    BinaryUfuncInfo('add',
+                    # NumPy has no builtin reference for the alpha kwarg, but it is easy enough to emulate
+                    ref=lambda input, other, *, alpha=1: np.add(input, other) if alpha == 1 \
+                    else np.add(input, np.multiply(alpha, other)),
+                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16,
+                                                     torch.float16, torch.chalf),
+                    assert_autodiffed=True,
+                    sample_inputs_func=sample_inputs_add_sub,
+                    supports_fwgrad_bwgrad=True,
+                    supports_forward_ad=True,
+                    supports_two_python_scalars=True,
+                    decorators=(
+                        DecorateInfo(
+                            toleranceOverride({torch.chalf: tol(atol=1e-2, rtol=0)}),
+                            'TestBinaryUfuncs', 'test_reference_numerics'),
+                    ),
+                    skips=(
+                        # boolean alpha not handled properly
+                        DecorateInfo(unittest.expectedFailure,
+                                     'TestCudaFuserOpInfo',
+                                     'test_nvfuser_correctness',
+                                     dtypes=(torch.bool,)),
+                        # boolean alpha not handled properly
+                        DecorateInfo(unittest.expectedFailure,
+                                     'TestNNCOpInfo',
+                                     'test_nnc_correctness',
+                                     dtypes=(torch.bool,)),
+                        DecorateInfo(unittest.skip("Skipped!"),
+                                     'TestCommon',
+                                     'test_numpy_refs',
+                                     dtypes=(torch.complex128,)),
+                        DecorateInfo(unittest.skip("Skipped!"),
+                                     'TestBinaryUfuncs',
+                                     'test_reference_numerics_extremal_values',
+                                     dtypes=(torch.complex64, torch.complex128)),
+                    )),
+    OpInfo('arange',
+           dtypes=all_types_and(torch.bfloat16, torch.float16),
+           supports_out=True,
+           supports_autograd=False,
+           is_factory_function=True,
+           error_inputs_func=error_inputs_arange,
+           sample_inputs_func=sample_inputs_arange,
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/81774
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+
+               # Lazy tensor failures
+               DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_dispatched_to_lazy'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness_with_reusing_ir'),
+
+               # Exception raised from analyzeImpl at ../torch/csrc/jit/ir/alias_analysis.cpp:608
+               # We don't have an op for aten::arange but it isn't a special case.
+               # Argument types: bool, bool, bool, int, int, Device, boo
+               DecorateInfo(unittest.expectedFailure, 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
+               DecorateInfo(unittest.expectedFailure, 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),
+
+               # Captured graph does not contain aten::arange (succeeds on complex!)
+               # g: graph():
+               #   %25 : Long(1, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value={1}]()
+               #   return (%25)
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
+
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+           )),
+    BinaryUfuncInfo('clamp_max',
+                    ref=_clamp_max_numpy,
+                    dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
+                    supports_forward_ad=True,
+                    supports_rhs_python_scalar=False,
+                    supports_fwgrad_bwgrad=True,
+                    rhs_make_tensor_kwargs=dict(exclude_zero=False),
+                    skips=(
+                        # RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
+                        DecorateInfo(unittest.expectedFailure,
+                                     'TestBinaryUfuncs',
+                                     'test_type_promotion',
+                                     device_type='cuda'),
+                        # dispatch to lazy test failed
+                        DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_dispatched_to_lazy'),
+                        # test error disabled since rhs non-tensor python scalar is supported
+                        DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_errors'),
+                    )),
+    BinaryUfuncInfo('clamp_min',
+                    ref=_clamp_min_numpy,
+                    dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
+                    supports_forward_ad=True,
+                    supports_rhs_python_scalar=False,
+                    supports_fwgrad_bwgrad=True,
+                    rhs_make_tensor_kwargs=dict(exclude_zero=False),
+                    skips=(
+                        # RuntimeError: "min_elementwise_cuda" not implemented for 'ComplexFloat'
+                        DecorateInfo(unittest.expectedFailure,
+                                     'TestBinaryUfuncs',
+                                     'test_type_promotion',
+                                     device_type='cuda'),
+                        # dispatch to lazy test failed
+                        DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_dispatched_to_lazy'),
+                        # test error disabled since rhs non-tensor python scalar is supported
+                        DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_errors'),
+                    )),
+    BinaryUfuncInfo('mul',
+                    aliases=('multiply',),
+                    dtypes=all_types_and_complex_and(torch.chalf, torch.float16, torch.bfloat16, torch.bool),
+                    assert_autodiffed=True,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_two_python_scalars=True),
+    BinaryUfuncInfo('sub',
+                    # NumPy has no builtin reference for the alpha kwarg, but it is easy enough to emulate
+                    ref=lambda input, other, *, alpha=1: np.subtract(input, np.multiply(alpha, other)),
+                    aliases=('subtract',),
+                    dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.chalf),
+                    assert_autodiffed=True,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    sample_inputs_func=sample_inputs_add_sub,
+                    supports_two_python_scalars=True,
+                    decorators=(
+                        DecorateInfo(
+                            toleranceOverride({torch.float16: tol(atol=1e-2, rtol=0)}),
+                            'TestBinaryUfuncs', 'test_reference_numerics'),
+                        DecorateInfo(
+                            toleranceOverride({torch.chalf: tol(atol=1e-2, rtol=0)}),
+                            'TestCommon', 'test_complex_half_reference_testing', device_type='cpu'),
+                        DecorateInfo(
+                            toleranceOverride({torch.chalf: tol(atol=5e-3, rtol=0)}),
+                            'TestDecomp', 'test_comprehensive', device_type='cpu'),
+                        DecorateInfo(
+                            toleranceOverride({torch.chalf: tol(atol=5e-3, rtol=0)}),
+                            'TestDecomp', 'test_quick', device_type='cpu'),
+                    ),
+                    skips=(
+                        DecorateInfo(unittest.skip("Skipped!"),
+                                     'TestBinaryUfuncs',
+                                     'test_reference_numerics',
+                                     dtypes=(torch.uint8,)),
+                        DecorateInfo(unittest.skip("Skipped!"),
+                                     'TestBinaryUfuncs',
+                                     'test_reference_numerics_small_values',
+                                     dtypes=(torch.uint8,)),
+                    )),
+    OpInfo('addmm',
+           # This addmm OpInfo is for when alpha and beta are not both equal to 1.
+           # alpha=beta=1 is tested in the following opinfo, because that special case will
+           # trigger addmm being decomposed by a jit pass.
            dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.half,
-                                                  *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           backward_dtypesIfCUDA=all_types_and_complex_and(torch.half, *[torch.bfloat16]
-                                                           if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           sample_inputs_func=sample_inputs_cov,
-           error_inputs_func=error_inputs_cov,
-           supports_out=False,
+           dtypesIfROCM=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16] if CUDA11OrLater else []),
+           assert_autodiffed=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           sample_inputs_func=sample_inputs_addmm,
            skips=(
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
-               # Float did not match double
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_grad'),
-               # Jacobian mismatch
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_forward_mode_AD'),
-               DecorateInfo(unittest.skip("Barely fails"), 'TestGradients', 'test_fn_fwgrad_bwgrad'),
-               # JIT test not working for tensor kwargs (https://github.com/pytorch/pytorch/issues/58507)
-               # RuntimeError:
-               # undefined value tensor:
-               #   File "<string>", line 3
-               # def the_method(i0):
-               #     return torch.cov(i0, correction=0, fweights=None, aweights=tensor([0.0518, 0.4681], dtype=torch.float32, requires_grad=True)) # noqa: B950
-               #                                                                ~~~~~~ <--- HERE
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
            )),
-    OpInfo('cross',
+    OpInfo('addmm',
+           # When alpha=beta=1 as compile-time constants, JIT will decompose addmm into mm and add.
+           variant_test_name='decomposed',
            dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.half),
-           sample_inputs_func=sample_inputs_cross,
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16] if(CUDA11OrLater or TEST_WITH_ROCM) else []),
+           assert_autodiffed=True,
+           supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           supports_out=True,
-           supports_forward_ad=True),
-    OpInfo('linalg.cross',
-           ref=lambda x, y, dim=-1: np.cross(x, y, axis=dim),
-           op=torch.linalg.cross,
+           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           autodiff_nonfusible_nodes=['aten::add', 'aten::mm'],
+           sample_inputs_func=partial(sample_inputs_addmm, alpha=1, beta=1),
+           skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+               # https://github.com/pytorch/pytorch/issues/71784
+               DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
+                            device_type='cpu', dtypes=(torch.float16,)),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness', dtypes=(torch.float16,)),
+           )),
+    OpInfo('addmv',
            dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.half),
-           aten_name='linalg_cross',
-           sample_inputs_func=sample_inputs_cross,
-           supports_out=True,
+           dtypesIfCUDA=floating_types_and(torch.float16, torch.complex64, torch.complex128,
+                                           *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True),
-    OpInfo('cumsum',
+           sample_inputs_func=sample_inputs_addmv),
+    OpInfo('addbmm',
+           ref=lambda M, batch1, batch2, beta=1, alpha=1: np.add(np.multiply(np.asarray(beta, dtype=M.dtype), M),
+                                                                 np.multiply(np.asarray(alpha, dtype=batch1.dtype),
+                                                                             np.sum(np.matmul(batch1, batch2), axis=0))),
            dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16]
+                                                       if (SM53OrLater and CUDA11OrLater) or TEST_WITH_ROCM else []),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           decorators=[
+               DecorateInfo(
+                   toleranceOverride({torch.float32: tol(atol=1.3e-05, rtol=1.3e-05),
+                                      torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
+                   'TestCommon', 'test_numpy_refs')],
            skips=(
-               # cumsum does not handle correctly out= dtypes
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
+               # addbmm does not correctly warn when resizing out= inputs
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # https://github.com/pytorch/pytorch/issues/55907
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
            ),
-           sample_inputs_func=sample_inputs_cumulative_ops),
-    OpInfo('cumprod',
+           sample_inputs_func=sample_inputs_addbmm),
+    OpInfo('baddbmm',
            dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.float16, torch.complex64, torch.complex128,
+                                           *[torch.bfloat16] if CUDA11OrLater or TEST_WITH_ROCM else []),
+           backward_dtypesIfCUDA=floating_types_and(torch.float16,
+                                                    *[torch.bfloat16] if SM53OrLater or TEST_WITH_ROCM else [],
+                                                    torch.complex64, torch.complex128),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           decorators=[
+               DecorateInfo(
+                   toleranceOverride({torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
+                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda'),
+               DecorateInfo(
+                   toleranceOverride({torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
+                   'TestMathBits', 'test_conj_view', device_type='cuda')],
+           sample_inputs_func=sample_inputs_baddbmm,
+           skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+           )),
+    OpInfo('dot',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           assert_autodiffed=True,
+           sample_inputs_func=sample_inputs_dot_vdot,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            skips=(
-               # cumprod does not handle correctly out= dtypes
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+           )),
+    OpInfo('vdot',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           sample_inputs_func=sample_inputs_dot_vdot,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+           )),
+    OpInfo('bmm',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16]
+                                                       if (SM53OrLater and CUDA11OrLater) or TEST_WITH_ROCM else []),
+           assert_autodiffed=True,
+           assert_jit_shape_analysis=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
            ),
-           # gradgradcheck fails in fast_mode=True: #56275
-           sample_inputs_func=sample_inputs_cumprod,
-           gradcheck_fast_mode=False),
-    OpInfo('cummax',
-           dtypes=all_types_and(torch.bool, torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=partial(sample_inputs_cumulative_ops, supports_dtype_kwargs=False),
+           sample_inputs_func=sample_inputs_bmm),
+    OpInfo('mv',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           assert_autodiffed=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_mv),
+    OpInfo('addr',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
+           backward_dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+           backward_dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, *[torch.bfloat16]
+                                                           if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           # Reference: https://github.com/pytorch/pytorch/issues/50747
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            skips=(
+               # Reference: https://github.com/pytorch/pytorch/issues/50747
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
+                            dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16)),
            ),
+           sample_inputs_func=sample_inputs_addr,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
-    OpInfo('cummin',
-           dtypes=all_types_and(torch.bool, torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=partial(sample_inputs_cumulative_ops, supports_dtype_kwargs=False),
+    OpInfo('addcmul',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+           assert_autodiffed=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            skips=(
+               # TODO: update sample inputs with for_inplace_variant kwarg to support this test
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               # 76047
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
+                            dtypes=(torch.int8, torch.int16, torch.int32, torch.int64)),
            ),
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
-    UnaryUfuncInfo('deg2rad',
-                   ref=np.radians,
-                   decorators=(precisionOverride({torch.bfloat16: 7e-1,
-                                                  torch.float16: 7e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_addcmul_addcdiv,
+           reference_inputs_func=partial(
+               reference_inputs_elementwise_ternary, sample_inputs_func=reference_inputs_addcmul_addcdiv)),
+    OpInfo('addcdiv',
+           dtypes=floating_and_complex_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # TODO: update sample inputs with for_inplace_variant kwarg to support this test
+               DecorateInfo(unittest.expectedFailure,
+                            'TestCommon',
+                            'test_variant_consistency_eager'),
+           ),
+           sample_inputs_func=sample_inputs_addcmul_addcdiv,
+           reference_inputs_func=partial(
+               reference_inputs_elementwise_ternary, sample_inputs_func=reference_inputs_addcmul_addcdiv)),
+    UnaryUfuncInfo('asin',
+                   aliases=('arcsin', ),
+                   ref=np.arcsin,
+                   domain=(-1, 1),
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   decorators=[
+                       DecorateInfo(
+                           toleranceOverride({torch.float16: tol(atol=1e-05, rtol=1e-03)}),
+                           'TestUnaryUfuncs', device_type='cuda'),
+                       precisionOverride({torch.bfloat16: 1e-2}),
+                   ],
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
+                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
+                   )),
+    # NOTE: derivative for inplace asinh is not implemented
+    UnaryUfuncInfo('asinh',
+                   aliases=('arcsinh', ),
+                   ref=np.arcsinh,
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
+                   supports_inplace_autograd=False,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cuda', dtypes=[torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
+                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
+                   )),
+    UnaryUfuncInfo('atan',
+                   aliases=('arctan', ),
+                   ref=np.arctan,
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   decorators=(precisionOverride({torch.bfloat16: 1e-2}),),
                    skips=(
-                       # Reference: https://github.com/pytorch/pytorch/pull/51283#issuecomment-770614273
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    active_if=TEST_WITH_ROCM, device_type='cuda'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    active_if=TEST_WITH_ROCM, device_type='cuda', dtypes=[torch.complex128]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.bfloat16]),
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cuda', dtypes=[torch.cfloat, torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cuda', dtypes=[torch.cfloat, torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
+                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
                    )),
-    OpInfo('diff',
-           op=torch.diff,
-           # np.diff has np._NoValue as default values for prepend and append, compare_with_reference breaks if prepend/append
-           # are set as None when converting to numpy
-           ref=lambda input, n=1, dim=-1, prepend=np._NoValue, append=np._NoValue: (
-               np.diff(input, n, dim, np._NoValue if prepend is None else prepend, np._NoValue if append is None else append)
-           ),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diff,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           skips=(
-           )),
-    BinaryUfuncInfo('div',
-                    aliases=('divide',),
-                    variant_test_name='no_rounding_mode',
-                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-                    # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-                    gradcheck_fast_mode=True,
-                    supports_forward_ad=True,
-                    promotes_int_to_float=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_two_python_scalars=True,
-                    assert_autodiffed=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=True),),
-    BinaryUfuncInfo('div',
-                    aliases=('divide',),
-                    variant_test_name='trunc_rounding',
-                    dtypes=all_types_and(torch.half, torch.bfloat16),
-                    sample_inputs_func=partial(sample_inputs_elementwise_binary, sample_kwargs=dict(rounding_mode="trunc")),
-                    # https://github.com/pytorch/pytorch/issues/80411
-                    gradcheck_fast_mode=True,
+    BinaryUfuncInfo('atan2',
+                    aliases=('arctan2',),
+                    dtypes=all_types_and(torch.bool, torch.bfloat16),
+                    dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
                     supports_forward_ad=True,
-                    promotes_int_to_float=True,
                     supports_fwgrad_bwgrad=True,
-                    supports_two_python_scalars=True,
-                    assert_autodiffed=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
-                    skips=(
-                        # RuntimeError: MALFORMED INPUT: Unhandled node kind (in computeValue): aten::div
-                        DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_working'),
-                    )),
-    BinaryUfuncInfo('div',
-                    aliases=('divide',),
-                    variant_test_name='floor_rounding',
-                    dtypes=all_types_and(torch.half, torch.bfloat16),
-                    sample_inputs_func=partial(sample_inputs_elementwise_binary, sample_kwargs=dict(rounding_mode="floor")),
-                    # https://github.com/pytorch/pytorch/issues/80411
-                    gradcheck_fast_mode=True,
-                    supports_forward_ad=True,
                     promotes_int_to_float=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_two_python_scalars=True,
-                    assert_autodiffed=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
+                    supports_rhs_python_scalar=False,
                     skips=(
-                        # RuntimeError: MALFORMED INPUT: Unhandled node kind (in computeValue): aten::div
-                        DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_working'),
+                        # Incorrectly attempts to use a scalar for the second argument
+                        DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
                     )),
-    BinaryUfuncInfo('true_divide',
-                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-                    supports_forward_ad=True,
-                    promotes_int_to_float=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_two_python_scalars=True,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=True)),
-    OpInfo('equal',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           ref=lambda input, other: (input == other).all(),
-           sample_inputs_func=sample_inputs_equal,
-           supports_autograd=False,
-           supports_tracing=False,
-           skips=(
-           )),
-    UnaryUfuncInfo('exp',
-                   ref=np_unary_ufunc_integer_promotion_wrapper(np.exp),
+    UnaryUfuncInfo('atanh',
+                   aliases=('arctanh', ),
+                   ref=np.arctanh,
+                   domain=(-1, 1),
                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+                   decorators=(precisionOverride({torch.bfloat16: 1e-2}),),
+                   supports_inplace_autograd=False,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
                    skips=(
-                       # Reference: https://github.com/pytorch/pytorch/pull/50093#pullrequestreview-561791547
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.bfloat16]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.bfloat16, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    dtypes=[torch.bfloat16]),
-                       # Reference: https://github.com/pytorch/pytorch/issues/48010
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
                                     device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
                                     device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                   ),
-                   assert_autodiffed=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    OpInfo('expand',
-           op=lambda self, shape: self.expand(shape),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_expand,
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cuda', dtypes=[torch.cfloat, torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cuda', dtypes=[torch.cfloat],
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
+                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    active_if=TEST_WITH_ROCM, device_type='cuda', dtypes=[torch.complex128]),
+                   )),
+    OpInfo('allclose',
+           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+           ref=np.allclose,
+           supports_autograd=False,
+           supports_forward_ad=False,
+           sample_inputs_func=sample_inputs_allclose,
+           skips=(
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
+           ),
+           supports_out=False),
+    OpInfo('broadcast_to',
+           ref=np.broadcast_to,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_broadcast_to),
+    OpInfo('broadcast_shapes',
+           op=torch.broadcast_shapes,
+           ref=np.broadcast_shapes if np.lib.NumpyVersion(np.__version__) >= '1.20.0' else None,
+           dtypes=_dispatch_dtypes((torch.float32,)),
+           supports_out=False,
+           supports_gradgrad=False,
+           assert_autodiffed=False,
+           supports_autograd=False,
+           supports_scripting=False,
+           sample_inputs_func=sample_inputs_broadcast_shapes,
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/64997
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # skip dtype tests since broadcast_shape is not device dependent.
+               # having dtypes limited to torch.float32 would cause test_dtypes to report unexpected success
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_dtypes'),
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), "TestCommon", "test_noncontiguous_samples"),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+           )),
+    OpInfo('broadcast_tensors',
+           ref=np.broadcast_arrays,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_broadcast_tensors,
+           reference_inputs_func=reference_inputs_broadcast_tensors,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/64997
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # JIT does not support variadic tensors.
+               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
+               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
+               # please report a bug to PyTorch.
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=[torch.float32]),
+           )),
+    OpInfo('block_diag',
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # Default batching rule in core doesn't work for ops with TensorList args
+           check_batched_forward_grad=False,
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/64997
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # JIT does not support variadic tensors.
+               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
+               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
+               # please report a bug to PyTorch.
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=[torch.float32]),
+           ),
+           sample_inputs_func=sample_inputs_block_diag),
+    UnaryUfuncInfo('bitwise_not',
+                   ref=np.bitwise_not,
+                   dtypes=integral_types_and(torch.bool),
+                   operator_variant=operator.invert,
+                   supports_autograd=False),
+    BinaryUfuncInfo('bitwise_left_shift',
+                    op=torch.bitwise_left_shift,
+                    dtypes=integral_types(),
+                    dtypesIfCUDA=integral_types(),
+                    operator_variant=operator.lshift,
+                    inplace_operator_variant=operator.ilshift,
+                    supports_autograd=False,
+                    supports_one_python_scalar=True,
+                    rhs_make_tensor_kwargs=dict(low=0),
+                    skips=(
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
+                    )),
+    BinaryUfuncInfo('bitwise_right_shift',
+                    op=torch.bitwise_right_shift,
+                    dtypes=integral_types(),
+                    dtypesIfCUDA=integral_types(),
+                    operator_variant=operator.rshift,
+                    inplace_operator_variant=operator.irshift,
+                    supports_autograd=False,
+                    supports_one_python_scalar=True,
+                    rhs_make_tensor_kwargs=dict(low=0),
+                    skips=(
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
+                    )),
+    OpInfo('combinations',
+           op=torch.combinations,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           assert_jit_shape_analysis=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           supports_out=False,
+           sample_inputs_func=sample_inputs_combinations),
+    OpInfo('cartesian_prod',
+           op=torch.cartesian_prod,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_cartesian_prod,
            skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
+               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270
+               DecorateInfo(unittest.expectedFailure,
+                            'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
            )),
-    OpInfo('expand_as',
-           op=lambda self, other: self.expand_as(other),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_expand_as,
+    OpInfo('cdist',
+           dtypes=floating_types(),
            supports_out=False,
+           supports_gradgrad=False,
+           assert_autodiffed=False,
+           sample_inputs_func=sample_inputs_cdist),
+    UnaryUfuncInfo('ceil',
+                   ref=np.ceil,
+                   dtypes=floating_types_and(torch.bfloat16),
+                   dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   assert_autodiffed=True),
+    OpInfo('cholesky',
+           dtypes=floating_and_complex_types(),
+           sample_inputs_func=sample_inputs_linalg_cholesky,
+           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],),
+    OpInfo('cholesky_inverse',
+           dtypes=floating_and_complex_types(),
+           backward_dtypes=floating_and_complex_types(),
+           # https://github.com/pytorch/pytorch/issues/80411
+           gradcheck_fast_mode=True,
+           supports_fwgrad_bwgrad=True,
+           supports_forward_ad=True,
+           check_batched_gradgrad=True,
+           sample_inputs_func=sample_inputs_linalg_cholesky_inverse,
+           gradcheck_wrapper=gradcheck_wrapper_triangular_input_real_positive_diagonal,
+           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
            skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),),
-           ),
-    OpInfo('diag',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+               # Strides are not the same! Original strides were ((4, 2, 1),) and strides are now ((4, 1, 2),)
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),)),
+    OpInfo('cholesky_solve',
+           op=torch.cholesky_solve,
+           dtypes=floating_and_complex_types(),
+           sample_inputs_func=sample_inputs_cholesky_solve,
+           check_batched_gradgrad=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diag,
-           error_inputs_func=error_inputs_diag),
-    OpInfo('diag_embed',
+           gradcheck_wrapper=lambda *args, **kwargs: gradcheck_wrapper_triangular_input(*args, idx=1, **kwargs),
+           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
+    OpInfo('chunk',
            dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
-           supports_out=False,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
+           sample_inputs_func=sample_inputs_chunk,
+           reference_inputs_func=reference_inputs_chunk,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diagonal_diag_embed),
-    OpInfo('diagonal',
-           # They are not strictly aliases as they have diverging defaults, but we can see them as aliases for testing purposes
-           # If we add tests that test the function against the alias, make linalg.diagonal into its own OpInfo
-           aliases=('linalg.diagonal',),
-           aten_backward_name='diagonal_backward',
+           supports_out=False),
+    OpInfo('clone',
+           ref=np.copy,
            dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           sample_inputs_func=sample_inputs_clone_contiguous,
+           reference_inputs_func=reference_inputs_clone_contiguous,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            supports_out=False,
+           skips=(
+               # TypeError: _copy_dispatcher() got an unexpected keyword argument 'memory_format'
+               # (NumPy reference needs to be extended with memory_format)
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref'),
+           ),),
+    OpInfo('contiguous',
+           op=lambda x, *args, **kwargs: x.contiguous(*args, **kwargs),
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           sample_inputs_func=sample_inputs_clone_contiguous,
+           reference_inputs_func=reference_inputs_clone_contiguous,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           autodiff_fusible_nodes=['aten::contiguous'],
+           assert_jit_shape_analysis=True,
+           supports_out=False,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+           )),
+    OpInfo('sum_to_size',
+           op=lambda x, *args, **kwargs: x.sum_to_size(*args, **kwargs),
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_sum_to_size,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diagonal_diag_embed),
-    OpInfo('diagonal_scatter',
-           dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
            supports_out=False,
+           skips=(
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float,)),),),
+    OpInfo('symeig',
+           dtypes=floating_and_complex_types(),
+           check_batched_grad=False,
+           check_batched_gradgrad=False,
+           sample_inputs_func=sample_inputs_symeig,
+           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+           skips=(
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
+                            device_type='mps', dtypes=[torch.float32]),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
+                            device_type='mps', dtypes=[torch.float32]),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
+                            device_type='mps', dtypes=[torch.float32]),
+           ),
+           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off]),
+    OpInfo('clamp',
+           aliases=('clip',),
+           ref=_clamp_numpy,
+           dtypes=all_types_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_clamp,
+           reference_inputs_func=partial(reference_inputs_elementwise_ternary, sample_inputs_func=sample_inputs_clamp),
+           assert_autodiffed=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diagonal_scatter),
-    BinaryUfuncInfo('eq',
-                    ref=np.equal,
-                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
-                    always_returns_bool=True,
-                    supports_autograd=False,
-                    sample_inputs_func=sample_inputs_comparison_ops,
-                    skips=(
-                    )),
-    BinaryUfuncInfo('fmax',
-                    op=torch.fmax,
-                    dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_rhs_python_scalar=False,
-                    skips=(
-                        # RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                    )),
-    BinaryUfuncInfo('fmin',
-                    op=torch.fmin,
-                    dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_rhs_python_scalar=False,
-                    skips=(
-                        # RuntimeError: "min_elementwise_cuda" not implemented for 'ComplexFloat'
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
-                    )),
-    BinaryUfuncInfo('fmod',
-                    ref=np.fmod,
-                    dtypes=all_types_and(torch.float16, torch.bfloat16),
-                    dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-                    # https://github.com/pytorch/pytorch/issues/80411
-                    gradcheck_fast_mode=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    assert_autodiffed=None,
-                    rhs_make_tensor_kwargs={'exclude_zero': True},
-                    decorators=(
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_contig_vs_every_other',
-                                     dtypes=(torch.bfloat16,)),
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_non_contig',
-                                     dtypes=(torch.bfloat16,)),
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_reference_numerics',
-                                     dtypes=(torch.bfloat16,)),
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_reference_numerics_small_values',
-                                     dtypes=(torch.uint8,)),
-                    )),
-    BinaryUfuncInfo('remainder',
-                    ref=np.remainder,
-                    dtypes=all_types_and(torch.float16, torch.bfloat16),
-                    dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-                    # https://github.com/pytorch/pytorch/issues/80411
-                    gradcheck_fast_mode=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    assert_autodiffed=None,
-                    operator_variant=operator.mod,
-                    inplace_operator_variant=operator.imod,
-                    supports_one_python_scalar=True,
-                    rhs_make_tensor_kwargs={'exclude_zero': True},
-                    decorators=(
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_contig_vs_every_other',
-                                     dtypes=(torch.bfloat16,)),
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_non_contig',
-                                     dtypes=(torch.bfloat16,)),
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_reference_numerics',
-                                     dtypes=(torch.bfloat16,)),
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
-                                     'test_reference_numerics_small_values',
-                                     dtypes=(torch.uint8,)),
-                        # Fails on XLA
-                        # False is not true : Tensors failed to compare as equal!
-                        # Attempted to compare equality of tensors with different dtypes
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
-                    )),
-    UnaryUfuncInfo('frac',
-                   ref=lambda x: np.modf(x)[0],
-                   dtypes=floating_types_and(torch.bfloat16, torch.float16),
-                   dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-                   assert_autodiffed=True,
+           skips=(
+               # nvFuser and NNC appear to not handle boolean clamp
+               DecorateInfo(unittest.expectedFailure,
+                            'TestCudaFuserOpInfo',
+                            'test_nvfuser_correctness',
+                            dtypes=(torch.bool,)),
+               DecorateInfo(unittest.expectedFailure,
+                            'TestNNCOpInfo',
+                            'test_nnc_correctness',
+                            dtypes=(torch.bool,)),
+           )),
+    UnaryUfuncInfo('positive',
+                   ref=np.positive,
+                   dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.chalf),
+                   supports_out=False,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   ),
+    UnaryUfuncInfo('conj',
+                   ref=np.conj,
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16,
+                                                    torch.half, torch.chalf),
+                   supports_sparse=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   # See https://github.com/pytorch/pytorch/pull/78358
+                   check_batched_forward_grad=False,
+                   supports_out=False),
+    UnaryUfuncInfo('conj_physical',
+                   decomp_aten_name='_conj_physical',
+                   ref=np.conj,
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16,
+                                                    torch.half, torch.chalf),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
                    skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=(torch.bfloat16, torch.float16, torch.float32, torch.float64)),
-                       # 76047
-                       DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
-                                    dtypes=(torch.float32, torch.float64)),
+                       # RuntimeError: inputSet && outputSet
+                       # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":118,
+                       # please report a bug to PyTorch.
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32, )),
+                       DecorateInfo(unittest.skip("Skipped! conj_physical_ not implemented for sparse"),
+                                    'TestSparseUnaryUfuncs', 'test_inplace'),
                    )),
-    SpectralFuncInfo('fft.fft',
-                     aten_name='fft_fft',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.fft,
-                     ndimensional=SpectralFuncType.OneD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     ),
-    SpectralFuncInfo('fft.fft2',
-                     aten_name='fft_fft2',
-                     ref=np.fft.fft2,
-                     decomp_aten_name='_fft_c2c',
-                     ndimensional=SpectralFuncType.TwoD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[precisionOverride(
-                         {torch.float: 1e-4, torch.cfloat: 1e-4})],
-                     ),
-    SpectralFuncInfo('fft.fftn',
-                     aten_name='fft_fftn',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.fftn,
-                     ndimensional=SpectralFuncType.ND,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[precisionOverride(
-                         {torch.float: 1e-4, torch.cfloat: 1e-4})],
-                     ),
-    SpectralFuncInfo('fft.hfft',
-                     aten_name='fft_hfft',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.hfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     check_batched_gradgrad=False),
-    SpectralFuncInfo('fft.hfft2',
-                     aten_name='fft_hfft2',
-                     decomp_aten_name='_fft_c2r',
-                     ref=scipy.fft.hfft2 if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.TwoD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_gradgrad=False,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.hfftn',
-                     aten_name='fft_hfftn',
-                     decomp_aten_name='_fft_c2r',
-                     ref=scipy.fft.hfftn if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.ND,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_gradgrad=False,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
-                             'TestFFT', 'test_reference_nd'), ],
-                     ),
-    SpectralFuncInfo('fft.rfft',
-                     aten_name='fft_rfft',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.rfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_grad=False,
-                     skips=(
-                     ),
-                     check_batched_gradgrad=False),
-    SpectralFuncInfo('fft.rfft2',
-                     aten_name='fft_rfft2',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.rfft2,
-                     ndimensional=SpectralFuncType.TwoD,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         precisionOverride({torch.float: 1e-4}),
-                     ],),
-    SpectralFuncInfo('fft.rfftn',
-                     aten_name='fft_rfftn',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.rfftn,
-                     ndimensional=SpectralFuncType.ND,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         precisionOverride({torch.float: 1e-4}),
-                     ],),
-    SpectralFuncInfo('fft.ifft',
-                     aten_name='fft_ifft',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.ifft,
-                     ndimensional=SpectralFuncType.OneD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),),
-    SpectralFuncInfo('fft.ifft2',
-                     aten_name='fft_ifft2',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.ifft2,
-                     ndimensional=SpectralFuncType.TwoD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.ifftn',
-                     aten_name='fft_ifftn',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.ifftn,
-                     ndimensional=SpectralFuncType.ND,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.ihfft',
-                     aten_name='fft_ihfft',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.ihfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     skips=(
-                     ),
-                     check_batched_grad=False),
-    SpectralFuncInfo('fft.ihfft2',
-                     aten_name='fft_ihfft2',
-                     decomp_aten_name='_fft_r2c',
-                     ref=scipy.fft.ihfftn if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.TwoD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=(
-                         # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-                         DecorateInfo(precisionOverride({torch.float: 2e-4}), 'TestFFT', 'test_reference_nd'),
-                         # Mismatched elements!
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warnings'))),
-    SpectralFuncInfo('fft.ihfftn',
-                     aten_name='fft_ihfftn',
-                     decomp_aten_name='_fft_r2c',
-                     ref=scipy.fft.ihfftn if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.ND,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archss
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-                         # Mismatched elements!
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-                         DecorateInfo(
-                             precisionOverride({torch.float: 2e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.irfft',
-                     aten_name='fft_irfft',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.irfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     check_batched_gradgrad=False),
-    SpectralFuncInfo('fft.irfft2',
-                     aten_name='fft_irfft2',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.irfft2,
-                     ndimensional=SpectralFuncType.TwoD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.irfftn',
-                     aten_name='fft_irfftn',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.irfftn,
-                     ndimensional=SpectralFuncType.ND,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    OpInfo('fft.fftshift',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           sample_inputs_func=sample_inputs_fftshift,
-           supports_out=False,
+    OpInfo('resolve_conj',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_view_as_real,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           ),
-    OpInfo('fft.ifftshift',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           sample_inputs_func=sample_inputs_fftshift,
            supports_out=False,
+           ),
+    OpInfo('resolve_neg',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           sample_inputs_func=sample_inputs_view_as_real,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           supports_out=False,
            ),
-    OpInfo('stft',
-           decorators=[
-               skipCPUIfNoFFT,
-               DecorateInfo(unittest.skip("Skipped! stft does not match the native function"),
-                            'TestJit', 'test_variant_consistency_jit'),
-           ],
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_stft,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
+    OpInfo('view_as_real',
+           dtypes=complex_types(),
            supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
            supports_out=False,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_view_as_real,
+           test_conjugated_samples=False,
            ),
-    OpInfo('istft',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_istft,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
+    OpInfo('view_as_complex',
+           dtypes=floating_types_and(torch.half),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           test_neg_view=False,
+           sample_inputs_func=sample_inputs_view_as_complex,
+           skips=(
+               # RuntimeError: Tensor must have a last dimension with stride 1
+               DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
+               # RuntimeError: "eq_cpu" not implemented for 'ComplexHalf'
+               DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.half,)),
+               # RuntimeError: "eq_cpu" not implemented for 'ComplexHalf'
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness', dtypes=(torch.half,)),
+           )),
+    BinaryUfuncInfo('complex',
+                    dtypes=floating_types_and(torch.half),
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_rhs_python_scalar=False,
+                    skips=(
+                        # Test doesn't account for complex's type promotion semantics
+                        DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion'),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out', device_type='mps'),
+                    )),
+    BinaryUfuncInfo('copysign',
+                    dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                    promotes_int_to_float=True,
+                    # https://github.com/pytorch/pytorch/issues/80411
+                    gradcheck_fast_mode=True,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True),
+    OpInfo('corrcoef',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.half,
+                                                  *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           sample_inputs_func=sample_inputs_corrcoef,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
            check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_out=False,
-           decorators=(
-               DecorateInfo(unittest.skip("Skipped! istft does not match the native function"),
-                            'TestJit', 'test_variant_consistency_jit'),
+           skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
            ),
+           supports_out=False),
+    UnaryUfuncInfo('cos',
+                   ref=np.cos,
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   handles_large_floats=False,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   decorators=(precisionOverride({torch.bfloat16: 1e-2}),),
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=(torch.cfloat, torch.cdouble,), device_type='cpu', active_if=IS_WINDOWS),
+                       # This fails on CUDA but passes on ROCm
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=(torch.cdouble,), device_type='cuda'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu',
+                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_MACOS),
+                       # AssertionError: Tensor-likes are not close!
+                       # Greatest absolute difference: nan at index (700,) (up to 1e-05 allowed)
+                       # Greatest relative difference: nan at index (700,) (up to 0.001 allowed)
+                       DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cuda',
+                                    dtypes=(torch.chalf,), active_if=IS_WINDOWS),
+                   )),
+    UnaryUfuncInfo('cosh',
+                   ref=np_unary_ufunc_integer_promotion_wrapper(np.cosh),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   skips=(
+                       # Reference: https://github.com/pytorch/pytorch/issues/48641
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu', dtypes=[torch.int8]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=[torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu',
+                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_MACOS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu',
+                                    dtypes=[torch.cfloat, torch.cdouble], active_if=IS_MACOS),
+                       # AssertionError: Tensor-likes are not close!
+                       # Greatest absolute difference: nan at index (6000,) (up to 1e-05 allowed)
+                       # Greatest relative difference: nan at index (6000,) (up to 0.001 allowed)
+                       DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cuda',
+                                    dtypes=(torch.chalf,), active_if=IS_WINDOWS),
+                   )),
+    OpInfo('cov',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.half,
+                                                  *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           backward_dtypesIfCUDA=all_types_and_complex_and(torch.half, *[torch.bfloat16]
+                                                           if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           sample_inputs_func=sample_inputs_cov,
+           error_inputs_func=error_inputs_cov,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            skips=(
-               skipCPUIfNoFFT,
-               # gradcheck fails on ROCm (gh-68429)
-               # grad is computed improperly (probably for weights tensor)
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+               # Float did not match double
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_grad'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               # Jacobian mismatch
+               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_forward_mode_AD'),
+               DecorateInfo(unittest.skip("Barely fails"), 'TestGradients', 'test_fn_fwgrad_bwgrad'),
+               # JIT test not working for tensor kwargs (https://github.com/pytorch/pytorch/issues/58507)
+               # RuntimeError:
+               # undefined value tensor:
+               #   File "<string>", line 3
+               # def the_method(i0):
+               #     return torch.cov(i0, correction=0, fweights=None, aweights=tensor([0.0518, 0.4681], dtype=torch.float32, requires_grad=True)) # noqa: B950
+               #                                                                ~~~~~~ <--- HERE
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
            )),
-    UnaryUfuncInfo('floor',
-                   ref=np.floor,
-                   dtypes=floating_types_and(torch.bfloat16),
-                   dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   assert_autodiffed=True),
-    OpInfo('flip',
-           op=torch.flip,
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_flip,
+    OpInfo('cross',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.half),
+           sample_inputs_func=sample_inputs_cross,
+           supports_fwgrad_bwgrad=True,
+           supports_out=True,
+           supports_forward_ad=True),
+    OpInfo('cumsum',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           supports_out=False),
-    OpInfo('fliplr',
-           op=torch.fliplr,
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_fliplr_flipud,
-           error_inputs_func=error_inputs_fliplr,
+           skips=(
+               # cumsum does not handle correctly out= dtypes
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+           ),
+           sample_inputs_func=sample_inputs_cumulative_ops),
+    OpInfo('cumprod',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           supports_out=False),
-    OpInfo('flipud',
-           op=torch.flipud,
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_fliplr_flipud,
-           error_inputs_func=error_inputs_flipud,
+           skips=(
+               # cumprod does not handle correctly out= dtypes
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           ),
+           # gradgradcheck fails in fast_mode=True: #56275
+           sample_inputs_func=sample_inputs_cumprod,
+           gradcheck_fast_mode=False),
+    OpInfo('cummax',
+           dtypes=all_types_and(torch.bool, torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=partial(sample_inputs_cumulative_ops, supports_dtype_kwargs=False),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           supports_out=False),
-    OpInfo('sparse.sampled_addmm',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=True,
-           sample_inputs_func=sample_inputs_sparse_sampled_addmm,
-           decorators=[
-               skipCUDAIf(_get_torch_cuda_version() < (11, 3), "cusparseSDDMM was added in 11.2.1"),
-               skipCPUIfNoMklSparse, ],
            skips=(
-               # NotImplementedError: Tensors of type SparseCsrTensorImpl do not have is_contiguous
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_noncontiguous_samples'),
-               # RuntimeError: Sparse CSR tensors do not have strides.
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestTags', 'test_tags'),
-               # RuntimeError: sampled_addmm: Expected result to have sparse csr layout, but got Strided
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out_warning'),
-               # RuntimeError: Sparse CSR tensors do not have strides
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
-               # RuntimeError: Sparse CSR tensors do not have strides
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_operator'),
-               # RuntimeError: Sparse CSR tensors do not have strides
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_backward'),
-               # RuntimeError: Sparse CSR tensors do not have strides
-               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_conj_view'),
-               # RuntimeError: Sparse CSR tensors do not have strides
-               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
-               # RuntimeError: Sparse CSR tensors do not have strides
-               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view'),
-               # RuntimeError: Sparse CSR tensors do not have strides
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-               # RuntimeError: unsupported memory format option Preserve
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_fwgrad_bwgrad'),
-               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
-               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
-               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD'),
-           )),
-    UnaryUfuncInfo('i0',
-                   ref=np_unary_ufunc_integer_promotion_wrapper(
-                       scipy.special.i0) if TEST_SCIPY else _NOTHING,
-                   aliases=('special.i0',),
-                   decorators=(precisionOverride({torch.bfloat16: 3e-1,
-                                                  torch.float16: 5e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   backward_dtypes=floating_types(),
+           ),
+           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
+    OpInfo('cummin',
+           dtypes=all_types_and(torch.bool, torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=partial(sample_inputs_cumulative_ops, supports_dtype_kwargs=False),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+           ),
+           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
+    UnaryUfuncInfo('deg2rad',
+                   ref=np.radians,
+                   decorators=(precisionOverride({torch.bfloat16: 7e-1,
+                                                  torch.float16: 7e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
-                   sample_inputs_func=sample_inputs_i0_i1,
                    skips=(
+                       # Reference: https://github.com/pytorch/pytorch/pull/51283#issuecomment-770614273
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=(torch.int8,)),
-                   )),
-    UnaryUfuncInfo('special.i0e',
-                   aten_name='special_i0e',
-                   ref=scipy.special.i0e if TEST_SCIPY else _NOTHING,
-                   decorators=(precisionOverride({torch.bfloat16: 3e-1,
-                                                  torch.float16: 3e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   backward_dtypes=floating_types(),
-                   sample_inputs_func=sample_inputs_i0_i1,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.i1',
-                   aten_name='special_i1',
-                   ref=np_unary_ufunc_integer_promotion_wrapper(scipy.special.i1) if TEST_SCIPY else _NOTHING,
-                   dtypes=all_types_and(torch.bool),
-                   dtypesIfCUDA=all_types_and(torch.bool),
-                   sample_inputs_func=sample_inputs_i0_i1,
-                   decorators=(
-                       DecorateInfo(toleranceOverride({
-                           torch.float32: tol(atol=1e-4, rtol=0),
-                           torch.bool: tol(atol=1e-4, rtol=0)})),
-                   ),
-                   skips=(
-                       DecorateInfo(unittest.skip("Incorrect result!"),
-                                    'TestUnaryUfuncs',
-                                    'test_reference_numerics_large',
-                                    dtypes=(torch.int8,)),
-                   ),
-                   supports_fwgrad_bwgrad=True,
-                   supports_forward_ad=True),
-    UnaryUfuncInfo('special.i1e',
-                   aten_name='special_i1e',
-                   ref=scipy.special.i1e if TEST_SCIPY else _NOTHING,
-                   dtypes=all_types_and(torch.bool),
-                   dtypesIfCUDA=all_types_and(torch.bool),
-                   sample_inputs_func=sample_inputs_i0_i1,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.ndtr',
-                   aten_name='special_ndtr',
-                   decorators=(precisionOverride({torch.bfloat16: 5e-3,
-                                                  torch.float16: 5e-4}),),
-                   ref=scipy.special.ndtr if TEST_SCIPY else _NOTHING,
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.bfloat16, torch.float16),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   skips=(
-                       # Dispatch stub: unsupported device typemeta
-                       DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad', device_type='meta'),
+                                    dtypes=[torch.bfloat16]),
                    )),
-    BinaryUfuncInfo('floor_divide',
-                    ref=_floor_divide_np,
+    OpInfo('diff',
+           op=torch.diff,
+           # np.diff has np._NoValue as default values for prepend and append, compare_with_reference breaks if prepend/append
+           # are set as None when converting to numpy
+           ref=lambda input, n=1, dim=-1, prepend=np._NoValue, append=np._NoValue: (
+               np.diff(input, n, dim, np._NoValue if prepend is None else prepend, np._NoValue if append is None else append)
+           ),
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_diff,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           skips=(
+           )),
+    BinaryUfuncInfo('div',
+                    aliases=('divide',),
+                    variant_test_name='no_rounding_mode',
+                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+                    dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+                    # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+                    gradcheck_fast_mode=True,
+                    supports_forward_ad=True,
+                    promotes_int_to_float=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_two_python_scalars=True,
+                    assert_autodiffed=True,
+                    rhs_make_tensor_kwargs=dict(exclude_zero=True),),
+    BinaryUfuncInfo('div',
+                    aliases=('divide',),
+                    variant_test_name='trunc_rounding',
                     dtypes=all_types_and(torch.half, torch.bfloat16),
-                    supports_autograd=False,
-                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
+                    sample_inputs_func=partial(sample_inputs_elementwise_binary, sample_kwargs=dict(rounding_mode="trunc")),
+                    # https://github.com/pytorch/pytorch/issues/80411
+                    gradcheck_fast_mode=True,
+                    supports_forward_ad=True,
+                    promotes_int_to_float=True,
+                    supports_fwgrad_bwgrad=True,
                     supports_two_python_scalars=True,
+                    assert_autodiffed=True,
+                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
                     skips=(
-                        # AssertionError: Results of original model and exported/imported version of model differed
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
-                        # bfloat16 floor_divide compared with a float32 reference works inconsistently
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs',
-                                     dtypes=(torch.bfloat16,)),
-                        # int8 floor divide has different results for -128 // -1 vs. NumPy
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs', 'test_reference_numerics_small_values',
-                                     dtypes=(torch.int8,)),
-                        # The following tests fails on some jobs
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs', 'test_reference_numerics_extremal_values',
-                                     dtypes=(torch.float16,)),
+                        # RuntimeError: MALFORMED INPUT: Unhandled node kind (in computeValue): aten::div
+                        DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_working'),
                     )),
-    UnaryUfuncInfo('frexp',
-                   op=torch.frexp,
-                   ref=np.frexp,
-                   dtypes=floating_types_and(torch.half, torch.bfloat16),
-                   dtypesIfCUDA=floating_types_and(torch.half),
-                   # skip testing torch.frexp as it is not supported by ROCm platform yet
-                   decorators=[],
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   skips=(
-                       # skips below tests as torch.frexp returns tuple-like (mantissa, exponent) as outputs,
-                       # while theses tests currently requires output to a single tensor.
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_batch_vs_slicing'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_contig_vs_every_other'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_contig_vs_transposed'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_non_contig_expand'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_variant_consistency'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_out_arg_all_dtypes'),
-
-                       # skips test_reference_numerics due to error in Windows CI.
-                       # The np.frexp returns exponent as np.intc dtype on Windows platform,
-                       # and np.intc does not have the correspond torch dtype
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    active_if=IS_WINDOWS),
-                   )),
-    BinaryUfuncInfo('ge',
-                    ref=np.greater_equal,
-                    aliases=('greater_equal',),
-                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
-                    always_returns_bool=True,
-                    supports_autograd=False,
+    BinaryUfuncInfo('div',
+                    aliases=('divide',),
+                    variant_test_name='floor_rounding',
+                    dtypes=all_types_and(torch.half, torch.bfloat16),
+                    sample_inputs_func=partial(sample_inputs_elementwise_binary, sample_kwargs=dict(rounding_mode="floor")),
+                    # https://github.com/pytorch/pytorch/issues/80411
+                    gradcheck_fast_mode=True,
+                    supports_forward_ad=True,
+                    promotes_int_to_float=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_two_python_scalars=True,
+                    assert_autodiffed=True,
+                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
                     skips=(
+                        # RuntimeError: MALFORMED INPUT: Unhandled node kind (in computeValue): aten::div
+                        DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_working'),
                     )),
-    OpInfo('geqrf',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_qr_geqrf,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    BinaryUfuncInfo('true_divide',
+                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+                    dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+                    supports_forward_ad=True,
+                    promotes_int_to_float=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_two_python_scalars=True,
+                    rhs_make_tensor_kwargs=dict(exclude_zero=True)),
+    OpInfo('equal',
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           ref=lambda input, other: (input == other).all(),
+           sample_inputs_func=sample_inputs_equal,
            supports_autograd=False,
+           supports_tracing=False,
            skips=(
-               # FIXME: geqrf can't forward with complex inputs that require grad
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_dtypes'),
-               # Strides are not the same!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
            )),
-    BinaryUfuncInfo('gt',
-                    ref=np.greater,
-                    aliases=('greater',),
-                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
-                    always_returns_bool=True,
-                    supports_autograd=False,
-                    skips=(
-                    )),
-    UnaryUfuncInfo('imag',
-                   ref=np.imag,
-                   dtypes=complex_types_and(torch.chalf),
-                   supports_out=False,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   # See https://github.com/pytorch/pytorch/issues/66357
-                   # RuntimeError: view_as_real doesn't work on unresolved conjugated tensors.
-                   check_batched_forward_grad=False,
+    UnaryUfuncInfo('exp',
+                   ref=np_unary_ufunc_integer_promotion_wrapper(np.exp),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
                    skips=(
-                       # Skip since real and imag don't have out variants.
-                       DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_out_arg_all_dtypes'),
-                   )),
-    OpInfo('gradient',
-           dtypes=floating_and_complex_types_and(torch.int8, torch.int16,
-                                                 torch.int32, torch.int64,
-                                                 torch.bfloat16, torch.half),
-           supports_out=False,
+                       # Reference: https://github.com/pytorch/pytorch/pull/50093#pullrequestreview-561791547
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.bfloat16]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=[torch.bfloat16, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    dtypes=[torch.bfloat16]),
+                       # Reference: https://github.com/pytorch/pytorch/issues/48010
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
+                   ),
+                   assert_autodiffed=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True),
+    OpInfo('expand',
+           op=lambda self, shape: self.expand(shape),
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_expand,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
+           assert_jit_shape_analysis=True,
+           supports_out=False,
            skips=(
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # following tests give a runtime error with undefined value tensor
-               # see discussion : https://github.com/pytorch/pytorch/issues/56660
-               # RuntimeError:
-               # Arguments for call are not valid.
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32, torch.complex64)),  # noqa: B950
-               DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-           ),
-           supports_inplace_autograd=False,
-           sample_inputs_func=sample_inputs_gradient,
-           error_inputs_func=error_inputs_gradient),
-    OpInfo('inverse',
-           op=torch.inverse,
-           dtypes=floating_and_complex_types(),
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # Strides are not the same!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', '.test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', '.test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
            )),
-    OpInfo('isin',
-           dtypes=all_types(),
-           dtypesIfCUDA=all_types_and(torch.half),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_isin),
-    OpInfo('kthvalue',
-           dtypes=all_types_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.float16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_kthvalue,
-           error_inputs_func=error_inputs_kthvalue),
-    BinaryUfuncInfo('le',
-                    ref=np.less_equal,
-                    aliases=('less_equal',),
-                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
-                    always_returns_bool=True,
-                    supports_autograd=False,
-                    skips=(
-                    )),
-    OpInfo('linalg.det',
-           op=torch.linalg.det,
-           aliases=('det',),
-           dtypes=floating_and_complex_types(),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           aten_name='linalg_det',
-           sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack,
-                       DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)}))],
-           check_batched_gradgrad=False,
-           supports_inplace_autograd=False),
-    OpInfo('linalg.det',
-           op=torch.linalg.det,
-           variant_test_name='singular',
-           aliases=('det',),
-           dtypes=double_types(),
-           backward_dtypes=double_types(),
+    OpInfo('expand_as',
+           op=lambda self, other: self.expand_as(other),
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           aten_name='linalg_det',
-           sample_inputs_func=sample_inputs_linalg_det_singular,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack,
-                       DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)}))],
-           check_batched_gradgrad=False,
-           supports_inplace_autograd=False,
+           sample_inputs_func=sample_inputs_expand_as,
+           supports_out=False,
            skips=(
-               DecorateInfo(unittest.skip("Skipped!"), "TestGradients", 'test_fn_fwgrad_bwgrad'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
-               # dtypes are tested in the suite above, no need to repeat it for singular
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes'),
-           )),
-    OpInfo('linalg.cholesky',
-           aten_name='linalg_cholesky',
-           dtypes=floating_and_complex_types(),
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),),
+           ),
+    OpInfo('diag',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_cholesky,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],),
-    OpInfo('linalg.cholesky_ex',
-           aten_name='linalg_cholesky_ex',
-           dtypes=floating_and_complex_types(),
+           sample_inputs_func=sample_inputs_diag,
+           error_inputs_func=error_inputs_diag),
+    OpInfo('diag_embed',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           supports_out=False,
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_cholesky,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           ),
-    OpInfo('linalg.vecdot',
-           aten_name='linalg_vecdot',
-           ref=lambda x, y, *, dim=-1: (x.conj() * y).sum(dim),
-           dtypes=floating_and_complex_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.half,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           sample_inputs_func=sample_inputs_linalg_vecdot,
-           check_batched_forward_grad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True),
-    OpInfo('linalg.cond',
-           aten_name='linalg_cond',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_cond,
-           check_batched_gradgrad=False,
-           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_diagonal_diag_embed,
+           reference_inputs_func=reference_inputs_diagonal_diag_embed,
+           error_inputs_func=error_inputs_diagonal_diag_embed),
+    OpInfo('diagonal',
+           # They are not strictly aliases as they have diverging defaults, but we can see them as aliases for testing purposes
+           # If we add tests that test the function against the alias, make linalg.diagonal into its own OpInfo
+           aliases=('linalg.diagonal',),
+           aten_backward_name='diagonal_backward',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],),
-    OpInfo('linalg.eig',
-           aten_name='linalg_eig',
-           op=torch.linalg.eig,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_eig,
-           check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
+           sample_inputs_func=sample_inputs_diagonal_diag_embed,
+           reference_inputs_func=reference_inputs_diagonal_diag_embed,
+           error_inputs_func=error_inputs_diagonal_diag_embed),
+    OpInfo('diagonal_scatter',
+           dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+           supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           skips=(
-               # AssertionError: Scalars are not equal!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cpu'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           ),
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
-           ),
-    OpInfo('linalg.eigvals',
-           aten_name='linalg_eigvals',
-           op=torch.linalg.eigvals,
+           sample_inputs_func=sample_inputs_diagonal_scatter),
+    BinaryUfuncInfo('eq',
+                    ref=np.equal,
+                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+                    always_returns_bool=True,
+                    supports_autograd=False,
+                    sample_inputs_func=sample_inputs_comparison_ops,
+                    skips=(
+                    )),
+    BinaryUfuncInfo('fmax',
+                    op=torch.fmax,
+                    dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_rhs_python_scalar=False,
+                    skips=(
+                        # RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
+                    )),
+    BinaryUfuncInfo('fmin',
+                    op=torch.fmin,
+                    dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_rhs_python_scalar=False,
+                    skips=(
+                        # RuntimeError: "min_elementwise_cuda" not implemented for 'ComplexFloat'
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
+                    )),
+    BinaryUfuncInfo('fmod',
+                    ref=np.fmod,
+                    dtypes=all_types_and(torch.float16, torch.bfloat16),
+                    dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+                    # https://github.com/pytorch/pytorch/issues/80411
+                    gradcheck_fast_mode=True,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    assert_autodiffed=None,
+                    rhs_make_tensor_kwargs={'exclude_zero': True},
+                    decorators=(
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_contig_vs_every_other',
+                                     dtypes=(torch.bfloat16,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_non_contig',
+                                     dtypes=(torch.bfloat16,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_reference_numerics',
+                                     dtypes=(torch.bfloat16,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_reference_numerics_small_values',
+                                     dtypes=(torch.uint8,)),
+                    )),
+    BinaryUfuncInfo('remainder',
+                    ref=np.remainder,
+                    dtypes=all_types_and(torch.float16, torch.bfloat16),
+                    dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+                    # https://github.com/pytorch/pytorch/issues/80411
+                    gradcheck_fast_mode=True,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    assert_autodiffed=None,
+                    operator_variant=operator.mod,
+                    inplace_operator_variant=operator.imod,
+                    supports_one_python_scalar=True,
+                    rhs_make_tensor_kwargs={'exclude_zero': True},
+                    decorators=(
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_contig_vs_every_other',
+                                     dtypes=(torch.bfloat16,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_non_contig',
+                                     dtypes=(torch.bfloat16,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_reference_numerics',
+                                     dtypes=(torch.bfloat16,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
+                                     'test_reference_numerics_small_values',
+                                     dtypes=(torch.uint8,)),
+                        # Fails on XLA
+                        # False is not true : Tensors failed to compare as equal!
+                        # Attempted to compare equality of tensors with different dtypes
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
+                    )),
+    UnaryUfuncInfo('frac',
+                   ref=lambda x: np.modf(x)[0],
+                   dtypes=floating_types_and(torch.bfloat16, torch.float16),
+                   dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+                   assert_autodiffed=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=(torch.bfloat16, torch.float16, torch.float32, torch.float64)),
+                       # 76047
+                       DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
+                                    dtypes=(torch.float32, torch.float64)),
+                   )),
+    OpInfo('stft',
+           decorators=[
+               skipCPUIfNoFFT,
+               DecorateInfo(unittest.skip("Skipped! stft does not match the native function"),
+                            'TestJit', 'test_variant_consistency_jit'),
+           ],
            dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
+           sample_inputs_func=sample_inputs_stft,
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
-           skips=(
-               # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
-               # exits early on eager extremal value test
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.eigh',
-           aten_name='linalg_eigh',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_eigh,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
            check_batched_forward_grad=False,
            check_batched_grad=False,
            check_batched_gradgrad=False,
+           supports_out=False,
+           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           ),
+    OpInfo('istft',
+           dtypes=floating_and_complex_types(),
+           sample_inputs_func=sample_inputs_istft,
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.eigvalsh',
-           aten_name='linalg_eigvalsh',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_eigh,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
            check_batched_forward_grad=False,
            check_batched_grad=False,
            check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+           supports_out=False,
+           decorators=(
+               DecorateInfo(unittest.skip("Skipped! istft does not match the native function"),
+                            'TestJit', 'test_variant_consistency_jit'),
+           ),
            skips=(
-               # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
+               skipCPUIfNoFFT,
+               # gradcheck fails on ROCm (gh-68429)
+               # grad is computed improperly (probably for weights tensor)
+               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_grad'),
+               # Pre-existing condition (calls .item); needs to be fixed
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            )),
-    OpInfo('linalg.householder_product',
-           aten_name='linalg_householder_product',
-           op=torch.linalg.householder_product,
-           aliases=('orgqr', ),
-           dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           # TODO: backward uses in-place operations that vmap doesn't like
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
+    UnaryUfuncInfo('floor',
+                   ref=np.floor,
+                   dtypes=floating_types_and(torch.bfloat16),
+                   dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   assert_autodiffed=True),
+    OpInfo('flip',
+           op=torch.flip,
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_flip,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_householder_product,
-           decorators=[
-               skipCUDAIfNoCusolver, skipCPUIfNoLapack,
-               DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)})),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
-           ]),
-    OpInfo('linalg.ldl_factor',
-           aten_name='linalg_ldl_factor',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_ldl_factor,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
-           ),
-    OpInfo('linalg.ldl_factor_ex',
-           aten_name='linalg_ldl_factor_ex',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_ldl_factor,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
-           ),
-    OpInfo('linalg.ldl_solve',
-           aten_name='linalg_ldl_solve',
+           supports_out=False),
+    OpInfo('fliplr',
+           op=torch.fliplr,
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_fliplr_flipud,
+           error_inputs_func=error_inputs_fliplr,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           supports_out=False),
+    OpInfo('flipud',
+           op=torch.flipud,
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_fliplr_flipud,
+           error_inputs_func=error_inputs_flipud,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           supports_out=False),
+    OpInfo('sparse.sampled_addmm',
            dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_ldl_solve,
+           supports_autograd=True,
+           sample_inputs_func=sample_inputs_sparse_sampled_addmm,
            decorators=[
-               skipCUDAIf(_get_torch_cuda_version() < (11, 4), "not available before CUDA 11.3.1"),
-               skipCUDAIfNoCusolver, skipCUDAIfRocm, skipCPUIfNoLapack],
-           ),
-    OpInfo('linalg.lstsq',
-           aten_name='linalg_lstsq',
-           dtypes=floating_and_complex_types(),
-           supports_out=True,
-           sample_inputs_func=sample_inputs_linalg_lstsq,
-           error_inputs_func=error_inputs_lstsq,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+               skipCUDAIf(_get_torch_cuda_version() < (11, 3), "cusparseSDDMM was added in 11.2.1"),
+               skipCPUIfNoMklSparse, ],
            skips=(
-               # we skip gradient checks for this suite as they are tested in
-               # variant_test_name='grad_oriented'
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
-               # The values for attribute 'shape' do not match
+               # NotImplementedError: Tensors of type SparseCsrTensorImpl do not have is_contiguous
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_noncontiguous_samples'),
+               # RuntimeError: Sparse CSR tensors do not have strides.
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.lstsq',
-           aten_name='linalg_lstsq',
-           variant_test_name='grad_oriented',
-           # gradchecks for forward AD fails with multi-Tensor outputs
-           op=lambda a, b, driver: torch.linalg.lstsq(a, b, driver=driver)[0],
-           supports_out=False,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_lstsq,
-           error_inputs_func=error_inputs_lstsq,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_autograd=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
-           skips=(
-               # tests do not work with passing lambda for op
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestTags', 'test_tags'),
+               # RuntimeError: sampled_addmm: Expected result to have sparse csr layout, but got Strided
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out_warning'),
+               # RuntimeError: Sparse CSR tensors do not have strides
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager'),
+               # RuntimeError: Sparse CSR tensors do not have strides
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_operator'),
+               # RuntimeError: Sparse CSR tensors do not have strides
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_backward'),
+               # RuntimeError: Sparse CSR tensors do not have strides
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_conj_view'),
+               # RuntimeError: Sparse CSR tensors do not have strides
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
+               # RuntimeError: Sparse CSR tensors do not have strides
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view'),
+               # RuntimeError: Sparse CSR tensors do not have strides
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               # RuntimeError: unsupported memory format option Preserve
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
+               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_fwgrad_bwgrad'),
+               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
+               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
+               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
+               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
+               # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
+               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD'),
            )),
-    OpInfo('linalg.matrix_power',
-           aliases=('matrix_power',),
-           aten_name='linalg_matrix_power',
+    UnaryUfuncInfo('i0',
+                   ref=np_unary_ufunc_integer_promotion_wrapper(
+                       scipy.special.i0) if TEST_SCIPY else None,
+                   aliases=('special.i0',),
+                   decorators=(precisionOverride({torch.bfloat16: 3e-1,
+                                                  torch.float16: 5e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   backward_dtypes=floating_types(),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   sample_inputs_func=sample_inputs_i0_i1,
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=(torch.int8,)),
+                   )),
+    BinaryUfuncInfo('floor_divide',
+                    ref=_floor_divide_np,
+                    dtypes=all_types_and(torch.half, torch.bfloat16),
+                    supports_autograd=False,
+                    rhs_make_tensor_kwargs=dict(exclude_zero=True),
+                    supports_two_python_scalars=True,
+                    skips=(
+                        # AssertionError: Results of original model and exported/imported version of model differed
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+                        # bfloat16 floor_divide compared with a float32 reference works inconsistently
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs',
+                                     dtypes=(torch.bfloat16,)),
+                        # int8 floor divide has different results for -128 // -1 vs. NumPy
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs', 'test_reference_numerics_small_values',
+                                     dtypes=(torch.int8,)),
+                        # The following tests fails on some jobs
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs', 'test_reference_numerics_extremal_values',
+                                     dtypes=(torch.float16,)),
+                    )),
+    UnaryUfuncInfo('frexp',
+                   op=torch.frexp,
+                   ref=np.frexp,
+                   dtypes=floating_types_and(torch.half, torch.bfloat16),
+                   dtypesIfCUDA=floating_types_and(torch.half),
+                   # skip testing torch.frexp as it is not supported by ROCm platform yet
+                   decorators=[],
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   skips=(
+                       # skips below tests as torch.frexp returns tuple-like (mantissa, exponent) as outputs,
+                       # while theses tests currently requires output to a single tensor.
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_batch_vs_slicing'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_contig_vs_every_other'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_contig_vs_transposed'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_non_contig_expand'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_variant_consistency'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_out_arg_all_dtypes'),
+
+                       # skips test_reference_numerics due to error in Windows CI.
+                       # The np.frexp returns exponent as np.intc dtype on Windows platform,
+                       # and np.intc does not have the correspond torch dtype
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    active_if=IS_WINDOWS),
+                   )),
+    UnaryUfuncInfo('log1p',
+                   ref=np.log1p,
+                   aliases=('special.log1p',),
+                   domain=(-1, None),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   decorators=(precisionOverride({torch.bfloat16: 1e-1}),),
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
+                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
+                   ),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   assert_autodiffed=True),
+    BinaryUfuncInfo('ge',
+                    ref=np.greater_equal,
+                    aliases=('greater_equal',),
+                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+                    always_returns_bool=True,
+                    supports_autograd=False,
+                    skips=(
+                    )),
+    OpInfo('geqrf',
            dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_inplace_autograd=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           check_batched_grad=False,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=sample_inputs_linalg_matrix_power,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           sample_inputs_func=sample_inputs_linalg_qr_geqrf,
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           supports_autograd=False,
            skips=(
+               # FIXME: geqrf can't forward with complex inputs that require grad
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_dtypes'),
                # Strides are not the same!
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
            )),
-    OpInfo('linalg.multi_dot',
-           # Need this lambda because gradcheck does not work with TensorList inputs
-           aten_name='linalg_multi_dot',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.half,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           supports_inplace_autograd=False,
-           # Batched grad checks fail for empty input tensors (see https://github.com/pytorch/pytorch/issues/53407)
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
+    BinaryUfuncInfo('gt',
+                    ref=np.greater,
+                    aliases=('greater',),
+                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+                    always_returns_bool=True,
+                    supports_autograd=False,
+                    skips=(
+                    )),
+    UnaryUfuncInfo('imag',
+                   ref=np.imag,
+                   dtypes=complex_types_and(torch.chalf),
+                   supports_out=False,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   # See https://github.com/pytorch/pytorch/issues/66357
+                   # RuntimeError: view_as_real doesn't work on unresolved conjugated tensors.
+                   check_batched_forward_grad=False,
+                   skips=(
+                       # Skip since real and imag don't have out variants.
+                       DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_out_arg_all_dtypes'),
+                   )),
+    OpInfo('gradient',
+           dtypes=floating_and_complex_types_and(torch.int8, torch.int16,
+                                                 torch.int32, torch.int64,
+                                                 torch.bfloat16, torch.half),
+           supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_multi_dot,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           skips=(
-               # https://github.com/pytorch/pytorch/issues/67470
-               DecorateInfo(unittest.skip("67470!"), 'TestCommon', 'test_noncontiguous_samples'),
-               # Fails on XLA.
-               # AssertionError: False is not true : Tensors failed to compare as equal!
-               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
-               # https://github.com/pytorch/pytorch/issues/71774
-               DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
-                            device_type='cpu', dtypes=(torch.long,)),
-           )),
-    # NB: linalg.norm has two variants so that different skips can be used for different sample inputs
-    OpInfo('linalg.norm',
-           aten_name='linalg_norm',
-           op=torch.linalg.norm,
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=sample_inputs_linalg_norm,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True),
-    OpInfo('linalg.norm',
-           op=torch.linalg.norm,
-           variant_test_name='subgradients_at_zero',
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=partial(sample_inputs_linalg_norm, variant='subgradient_at_zero'),
-           aten_name='linalg_norm',
-           supports_forward_ad=True,
-           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
-           # Could not allocate memory to change Tensor SizesAndStrides!
+           # See https://github.com/pytorch/pytorch/pull/78358
            check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True,
            skips=(
-               # [NEW] Skips specifically for sample inputs at zero
-               # norm's vjp/jvp are not well-conditioned near zero
-               DecorateInfo(unittest.expectedFailure, "TestGradients", 'test_fn_gradgrad'),
-               DecorateInfo(unittest.expectedFailure, "TestGradients", 'test_fn_fwgrad_bwgrad')
-           )),
-    OpInfo('linalg.matrix_norm',
-           aten_name='linalg_matrix_norm',
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           check_batched_gradgrad=False,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=sample_inputs_linalg_matrix_norm),
-    OpInfo('linalg.qr',
-           aten_name='linalg_qr',
-           op=torch.linalg.qr,
-           dtypes=floating_and_complex_types(),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # In-place ops
-           check_batched_gradgrad=False,
-           sample_inputs_func=sample_inputs_linalg_qr_geqrf,
-           decorators=[skipCUDAIfNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.slogdet',
-           aten_name='linalg_slogdet',
-           op=torch.linalg.slogdet,
-           dtypes=floating_and_complex_types(),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.vander',
-           aten_name='linalg_vander',
-           ref=np_vander_batched,
-           op=torch.linalg.vander,
-           dtypes=all_types_and_complex(),
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # following tests give a runtime error with undefined value tensor
+               # see discussion : https://github.com/pytorch/pytorch/issues/56660
+               # RuntimeError:
+               # Arguments for call are not valid.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32, torch.complex64)),  # noqa: B950
+               DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
+           ),
+           supports_inplace_autograd=False,
+           sample_inputs_func=sample_inputs_gradient,
+           error_inputs_func=error_inputs_gradient),
+    OpInfo('isin',
+           dtypes=all_types(),
+           dtypesIfCUDA=all_types_and(torch.half),
+           supports_autograd=False,
+           sample_inputs_func=sample_inputs_isin),
+    OpInfo('kthvalue',
+           dtypes=all_types_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.float16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           supports_out=False,
-           skips=(
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-           ),
-           sample_inputs_func=sample_inputs_linalg_vander),
-    ReductionOpInfo(
-        'linalg.vector_norm',
-        op=torch.linalg.vector_norm,
-        identity=0,
-        nan_policy='propagate',
-        supports_multiple_dims=True,
-        complex_to_real=True,
-        supports_forward_ad=True,
-        # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
-        # got: Could not allocate memory to change Tensor SizesAndStrides!
-        check_batched_forward_grad=False,
-        supports_fwgrad_bwgrad=True,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        generate_args_kwargs=sample_kwargs_vector_norm,
-        aten_name='linalg_vector_norm',
-        skips=(
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-        )),
+           sample_inputs_func=sample_inputs_kthvalue,
+           error_inputs_func=error_inputs_kthvalue),
+    BinaryUfuncInfo('le',
+                    ref=np.less_equal,
+                    aliases=('less_equal',),
+                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+                    always_returns_bool=True,
+                    supports_autograd=False,
+                    skips=(
+                    )),
     OpInfo('linspace',
            dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16),
            is_factory_function=True,
@@ -13069,10 +9390,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            error_inputs_func=error_inputs_linspace,
            sample_inputs_func=sample_inputs_linspace,
            skips=(
-               # https://github.com/pytorch/pytorch/issues/81774
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-
                # Tests that assume input is a tensor or sequence of tensors
                DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
@@ -13116,10 +9433,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            error_inputs_func=error_inputs_linspace,
            sample_inputs_func=sample_inputs_logpace,
            skips=(
-               # https://github.com/pytorch/pytorch/issues/81774
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-
                # Tests that assume input is a tensor or sequence of tensors
                DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
@@ -13177,25 +9490,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                    ),
                    # log10(z)->-inf for |z|->0
                    reference_numerics_filter=NumericsFilter(condition=lambda x: torch.abs(x) < 0.1, safe_val=1)),
-    UnaryUfuncInfo('log1p',
-                   ref=np.log1p,
-                   aliases=('special.log1p',),
-                   domain=(-1, None),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   decorators=(precisionOverride({torch.bfloat16: 1e-1}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-                   ),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   assert_autodiffed=True),
     UnaryUfuncInfo('log2',
                    ref=np.log2,
                    domain=(0, None),
@@ -13285,38 +9579,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                     supports_autograd=False,
                     skips=(
                     )),
-    OpInfo('linalg.lu_factor',
-           aten_name='linalg_lu_factor',
-           op=torch.linalg.lu_factor,
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_lu,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.lu_factor_ex',
-           aten_name='linalg_lu_factor_ex',
-           op=torch.linalg.lu_factor_ex,
-           dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_lu,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.lu',
-           aten_name='linalg_lu',
-           op=torch.linalg.lu,
-           dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_lu,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
     OpInfo('lu_unpack',
            op=torch.lu_unpack,
            dtypes=floating_and_complex_types(),
@@ -13369,20 +9631,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.skip("Tests different backward paths"),
                             "TestCommon", "test_floating_inputs_are_differentiable"),),
            decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver]),
-    OpInfo('linalg.lu_solve',
-           op=torch.linalg.lu_solve,
-           aten_name='linalg_lu_solve',
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_lu_solve,
-           skips=(
-               DecorateInfo(unittest.skip("Tests different backward paths"),
-                            "TestCommon", "test_floating_inputs_are_differentiable"),),
-           decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver]),
     OpInfo('masked_fill',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
            sample_inputs_func=sample_inputs_masked_fill,
@@ -13584,12 +9832,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            skips=(
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCompositeCompliance', 'test_backward'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCompositeCompliance', 'test_forward_ad'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCompositeCompliance', 'test_operator'),
                DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
            ),
            # See https://github.com/pytorch/pytorch/issues/66357
@@ -13602,12 +9844,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            skips=(
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCompositeCompliance', 'test_backward'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCompositeCompliance', 'test_forward_ad'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCompositeCompliance', 'test_operator'),
                DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
            ),
            # See https://github.com/pytorch/pytorch/issues/66357
@@ -13919,7 +10155,9 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.skip('Fails on cuda + rocm'), 'TestCommon', 'test_complex_half_reference_testing'),
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_grad'),
                DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
-               DecorateInfo(unittest.skip('Passes on complex128 and float64 only'), 'TestGradients', 'test_fn_fwgrad_bwgrad'),)),
+               DecorateInfo(unittest.skip('Passes on complex128 and float64 only'), 'TestGradients', 'test_fn_fwgrad_bwgrad'),
+               # AssertionError: Tensor-likes are not close! (new_empty_strided.default)
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"), 'TestDecomp', 'test_comprehensive'),)),
     OpInfo('native_layer_norm',
            aten_name='native_layer_norm',
            ref=reference_native_layer_norm,
@@ -14062,6 +10300,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            dtypes=floating_types_and(torch.int64, torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           error_inputs_func=error_inputs_avg_pool1d,
            sample_inputs_func=sample_inputs_avgpool1d),
     OpInfo('nn.functional.avg_pool3d',
            aten_name='avg_pool3d',
@@ -14071,6 +10310,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            dtypes=floating_types_and(torch.int64),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           error_inputs_func=error_inputs_avg_pool3d,
            sample_inputs_func=sample_inputs_avgpool3d,
            skips=(
                # AssertionError: Tensor-likes are not close!
@@ -14108,22 +10348,41 @@ def error_inputs_mean(op_info, device, **kwargs):
         supports_fwgrad_bwgrad=True,
         supports_forward_ad=True),
     OpInfo('nn.functional.conv_transpose1d',
+           # `ref` for this function is backward of
+           # corresponding `conv*d`
+           ref=partial(conv_transpose_ref, fn=torch.nn.functional.conv_transpose1d),
            aten_name='conv_transpose1d',
            aliases=('conv_transpose1d',),
-           dtypes=floating_types_and(torch.int64),
-           dtypesIfCUDA=floating_types_and(torch.float16, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypes=floating_and_complex_types_and(torch.int64),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.chalf,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=sample_inputs_conv_transpose1d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           assert_jit_shape_analysis=True,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[
+           decorators=(
                DecorateInfo(
                    toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1.3e-06), }),
-                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda')],
+                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda'),
+               DecorateInfo(
+                   toleranceOverride({torch.chalf: tol(atol=5e-2, rtol=5e-2), }),
+                   'TestCommon', 'test_complex_half_reference_testing')
+           ),
            skips=(
+               # Reason for Skip: https://github.com/pytorch/pytorch/pull/79694#issuecomment-1186949486
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.complex64,)),
+               # RuntimeError: UNSUPPORTED DTYPE: complex
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
+                            dtypes=(torch.complex64, torch.complex128)),
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":104, please report a bug to PyTorch.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.float,)),
+               # RuntimeError: "slow_conv2d_cpu_grad_input" not implemented for 'Long'
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref',
+                            dtypes=(torch.int64,)),
            ),
            supports_out=False,),
     OpInfo('nn.functional.conv_transpose2d',
@@ -14135,6 +10394,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_conv_transpose2d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           assert_jit_shape_analysis=True,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            decorators=[
                DecorateInfo(
@@ -14154,6 +10414,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_conv_transpose3d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           assert_jit_shape_analysis=True,
            # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
            gradcheck_fast_mode=True,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
@@ -14192,6 +10453,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_conv1d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           assert_jit_shape_analysis=True,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            decorators=(
                DecorateInfo(
@@ -14230,6 +10492,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           assert_jit_shape_analysis=True,
            decorators=(
                DecorateInfo(
                    toleranceOverride({torch.chalf: tol(atol=6e-2, rtol=5e-2)}),
@@ -14387,6 +10650,8 @@ def error_inputs_mean(op_info, device, **kwargs):
                # RuntimeError: falseINTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":185, please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
+               # Difference from <type> is larger with decomposition new_empty_strided.default than original on output 0
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"), 'TestDecomp', 'test_comprehensive'),
            ),
            supports_out=False),
     OpInfo('nn.functional.hardswish',
@@ -14535,7 +10800,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_out=False),
     OpInfo(
         "nn.functional.soft_margin_loss",
-        ref=_NOTHING,
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
         supports_out=False,
@@ -14560,7 +10824,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_out=False),
     OpInfo(
         "nn.functional.margin_ranking_loss",
-        ref=_NOTHING,
         dtypes=all_types_and(torch.bfloat16),
         dtypesIfCUDA=all_types_and(torch.bfloat16, torch.float16),
         supports_out=False,
@@ -14571,7 +10834,6 @@ def error_inputs_mean(op_info, device, **kwargs):
         supports_fwgrad_bwgrad=True),
     OpInfo(
         "nn.functional.multi_margin_loss",
-        ref=_NOTHING,
         dtypes=floating_types(),
         dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
         supports_out=False,
@@ -14580,7 +10842,6 @@ def error_inputs_mean(op_info, device, **kwargs):
     ),
     OpInfo(
         "nn.functional.multilabel_margin_loss",
-        ref=_NOTHING,
         dtypes=floating_types(),
         dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
         supports_out=False,
@@ -14603,7 +10864,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            autodiff_nonfusible_nodes=["aten::leaky_relu"]),
     OpInfo(
         "nn.functional.multilabel_soft_margin_loss",
-        ref=_NOTHING,
         supports_out=False,
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
@@ -14639,6 +10899,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_fwgrad_bwgrad=True,
            dtypes=floating_types_and(torch.int64, torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           error_inputs_func=error_inputs_avg_pool2d,
            sample_inputs_func=sample_inputs_avgpool2d,
            skips=(
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cuda'),
@@ -14705,6 +10966,7 @@ def error_inputs_mean(op_info, device, **kwargs):
                # to actually allocate memory
                DecorateInfo(unittest.skip("Skipped!"), 'TestTags', 'test_tags'),
            ),
+           error_inputs_func=error_inputs_max_pool1d,
            sample_inputs_func=sample_inputs_max_pool),
     OpInfo('nn.functional.max_pool2d',
            aten_name='max_pool2d',
@@ -14720,6 +10982,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            assert_jit_shape_analysis=True,
            dtypes=floating_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           error_inputs_func=error_inputs_max_pool2d,
            sample_inputs_func=sample_inputs_max_pool),
     OpInfo('nn.functional.max_pool3d',
            aten_name='max_pool3d',
@@ -14736,6 +10999,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            # TODO: investigate nondeterminism
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           error_inputs_func=error_inputs_max_pool3d,
            sample_inputs_func=sample_inputs_max_pool),
     OpInfo('nn.functional.max_unpool1d',
            aten_name='max_unpool1d',
@@ -14912,8 +11176,6 @@ def error_inputs_mean(op_info, device, **kwargs):
         inplace_variant=lambda x, alpha=1.0:
             torch.nn.functional.elu(x, alpha, inplace=True),
         decorators=[
-            # Not implemented yet
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_inplace_forward_mode_AD'),
             DecorateInfo(
                 toleranceOverride({
                     torch.float16: tol(atol=1e-03, rtol=1.2e-03),
@@ -14967,8 +11229,6 @@ def error_inputs_mean(op_info, device, **kwargs):
         inplace_variant=lambda x, alpha=1.0:
             torch.nn.functional.celu(x, alpha, inplace=True),
         decorators=[
-            # Not implemented yet
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_inplace_forward_mode_AD'),
             DecorateInfo(
                 toleranceOverride({
                     torch.float16: tol(atol=1e-03, rtol=1.2e-03),
@@ -14984,7 +11244,6 @@ def error_inputs_mean(op_info, device, **kwargs):
             wrapper_set_seed(torch.nn.functional.rrelu, input, *args, **kwargs),
         inplace_variant=lambda input, *args, **kwargs:
             wrapper_set_seed(torch.nn.functional.rrelu, input, *args, inplace=True, **kwargs),
-        ref=_NOTHING,
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         gradcheck_wrapper=wrapper_set_seed,
@@ -15032,8 +11291,6 @@ def error_inputs_mean(op_info, device, **kwargs):
         supports_out=False,
         inplace_variant=lambda x: torch.nn.functional.selu(x, inplace=True),
         decorators=[
-            # Not implemented yet (depends on 'elu_')
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_inplace_forward_mode_AD'),
             DecorateInfo(
                 toleranceOverride({
                     torch.float16: tol(atol=1e-2, rtol=1.8e-2),
@@ -15051,7 +11308,7 @@ def error_inputs_mean(op_info, device, **kwargs):
         supports_forward_ad=True,
         supports_autograd=True,
         supports_fwgrad_bwgrad=True,
-        assert_autodiffed=False,
+        assert_autodiffed=True,
         supports_out=False,
         inplace_variant=lambda x: torch.nn.functional.silu(x, inplace=True),
         decorators=[
@@ -15065,7 +11322,8 @@ def error_inputs_mean(op_info, device, **kwargs):
         skips=(
             DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
                          dtypes=(torch.cfloat,), device_type='cpu'),
-        )
+        ),
+        autodiff_nonfusible_nodes=["aten::silu"],
     ),
     # TODO: combine this with the nn.functional.silu OpInfo when
     # complex autodiff for silu is supported or when
@@ -15310,9 +11568,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),
                DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
                             device_type='cpu', dtypes=(torch.bfloat16,)),
-               # see https://github.com/pytorch/pytorch/issues/76283
-               DecorateInfo(unittest.skip("Fails on UBSAN!"), 'TestCompositeCompliance', 'test_forward_ad',
-                            device_type="cpu"),
                # Trying to use forward AD with miopen_batch_norm that does not support it
                # because it has not been implemented yet.
                DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad',
@@ -15330,8 +11585,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_forward_ad=True,
            decorators=[onlyCUDA, disablecuDNN],
            skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad',
-                            device_type='cpu'),
                DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
                             'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
            ),
@@ -15460,7 +11713,6 @@ def error_inputs_mean(op_info, device, **kwargs):
     #                     DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs'),
     #                 )),
     UnaryUfuncInfo('nn.functional.softshrink',
-                   ref=_NOTHING,
                    aten_name="softshrink",
                    aten_backward_name='softshrink_backward',
                    dtypes=floating_types_and(torch.bfloat16),
@@ -15471,7 +11723,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                    sample_inputs_func=sample_inputs_softshrink,
                    error_inputs_func=error_inputs_softshrink),
     UnaryUfuncInfo('nn.functional.hardshrink',
-                   ref=_NOTHING,
                    aten_name="hardshrink",
                    aten_backward_name='hardshrink_backward',
                    dtypes=floating_types_and(torch.bfloat16,),
@@ -15482,7 +11733,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                    supports_fwgrad_bwgrad=True,
                    autodiff_nonfusible_nodes=["aten::hardshrink"]),
     UnaryUfuncInfo('nn.functional.hardtanh',
-                   ref=_NOTHING,
                    aten_name="hardtanh",
                    aten_backward_name='hardtanh_backward',
                    dtypes=floating_types_and(torch.int8, torch.int16, torch.int32, torch.int64, torch.bfloat16),
@@ -15499,7 +11749,7 @@ def error_inputs_mean(op_info, device, **kwargs):
     OpInfo('nn.functional.gelu',
            aten_name="gelu",
            aten_backward_name='gelu_backward',
-           ref=reference_gelu if TEST_SCIPY else _NOTHING,
+           ref=reference_gelu if TEST_SCIPY else None,
            error_inputs_func=error_inputs_gelu,
            supports_autograd=True,
            assert_autodiffed=True,
@@ -15515,7 +11765,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                # May not replicate in CI
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),)),
     UnaryUfuncInfo('nn.functional.relu6',
-                   ref=_NOTHING,
                    aten_name="relu6",
                    dtypes=all_types_and(torch.bfloat16),
                    backward_dtypes=floating_types(),
@@ -15533,7 +11782,15 @@ def error_inputs_mean(op_info, device, **kwargs):
            assert_autodiffed=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_mm),
+           sample_inputs_func=sample_inputs_mm,
+           skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+           )),
     OpInfo('mode',
            op=torch.mode,
            dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
@@ -15684,6 +11941,10 @@ def error_inputs_mean(op_info, device, **kwargs):
                         # For `chalf`, reference computation in `numpy` is computed in `cfloat`.
                         # Output of `chalf` saturates to `inf` quicker than reference due to its small range
                         # which leads to failure of this test.
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_batch_vs_slicing',
+                                     dtypes=(torch.complex32,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_non_contig',
+                                     dtypes=(torch.complex32,)),
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_reference_numerics',
                                      dtypes=(torch.complex32,)),
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_reference_numerics_small_values',
@@ -16464,55 +12725,16 @@ def error_inputs_mean(op_info, device, **kwargs):
                                     device_type='cuda', dtypes=[torch.cfloat, torch.cdouble]),
                        # Reference: https://github.com/pytorch/pytorch/pull/52551#issuecomment-782596181
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.bfloat16]),
-                   ),),
-    OpInfo('lerp',
-           dtypes=floating_and_complex_types(),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_lerp,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           assert_autodiffed=True),
-    OpInfo('linalg.inv',
-           aten_name='linalg_inv',
-           op=torch.linalg.inv,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # AssertionError: Scalars are not equal!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.inv_ex',
-           aten_name='linalg_inv_ex',
+                                    dtypes=[torch.bfloat16]),
+                   ),),
+    OpInfo('lerp',
            dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           check_batched_gradgrad=False,
+           dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_lerp,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # AssertionError: Scalars are not equal!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
+           assert_autodiffed=True),
     UnaryUfuncInfo('angle',
                    ref=np.angle,
                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
@@ -16581,142 +12803,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                    supports_sparse_bsr=True,
                    supports_sparse_bsc=True,
                    supports_autograd=False),
-    OpInfo('linalg.solve',
-           aten_name='linalg_solve',
-           op=torch.linalg.solve,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_solve,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.solve_ex',
-           aten_name='linalg_solve_ex',
-           op=torch.linalg.solve_ex,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_solve,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.solve_triangular',
-           aten_name='linalg_solve_triangular',
-           op=torch.linalg.solve_triangular,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_solve_triangular,
-           supports_fwgrad_bwgrad=True,
-           skips=(skipCPUIfNoLapack,),
-           # linalg.solve_triangular cannot be batched over because of a call to out.copy_(result);
-           supports_forward_ad=True),
-    OpInfo('linalg.matrix_rank',
-           aten_name='linalg_matrix_rank',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           ),
-           ),
-    OpInfo('linalg.matrix_rank',
-           aten_name='linalg_matrix_rank',
-           variant_test_name='hermitian',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           ),
-           ),
-    OpInfo('linalg.pinv',
-           aten_name='linalg_pinv',
-           op=torch.linalg.pinv,
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_pinv,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # errors with "leaked XXXX bytes CUDA memory on device 0"
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', device_type='cuda'),)
-           ),
-    OpInfo('linalg.pinv',
-           aten_name='linalg_pinv',
-           variant_test_name='singular',
-           # pinv is Frechet-differentiable in a rank-preserving neighborhood,
-           # so we feed inputs that are the products of two full-rank factors,
-           # to avoid any rank changes caused by the perturbations in the gradcheck
-           op=lambda a, b: torch.linalg.pinv(a @ b.mT),
-           dtypes=floating_and_complex_types(),
-           supports_out=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_pinv_singular,
-           # Only large tensors show issues with implicit backward used prior to
-           # explicit backward implementation.
-           decorators=[slowTest, skipCUDAIfNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # CUDA runs out of memory
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_fwgrad_bwgrad',
-                            device_type='cuda', dtypes=[torch.cdouble]),
-               # This test takes almost 2 hours to run!
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad',
-                            device_type='cuda', dtypes=[torch.cdouble]),
-           )),
-    OpInfo('linalg.pinv',
-           aten_name='linalg_pinv',
-           variant_test_name='hermitian',
-           dtypes=floating_and_complex_types(),
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )
-           ),
     OpInfo('eig',
            op=torch.eig,
            dtypes=floating_and_complex_types(),
@@ -16726,10 +12812,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                skipCUDAIfNoMagma,
                skipCPUIfNoLapack,
            ],
-           skips=(
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-           ),
            ),
     OpInfo('einsum',
            # we need this lambda because SampleInput expects tensor input as the first argument
@@ -16768,6 +12850,12 @@ def error_inputs_mean(op_info, device, **kwargs):
            check_batched_gradgrad=False,
            decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
            skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
                             device_type='mps', dtypes=[torch.float32]),
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
@@ -16775,41 +12863,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
                             device_type='mps', dtypes=[torch.float32]),
            )),
-    OpInfo('linalg.svd',
-           op=torch.linalg.svd,
-           aten_name='linalg_svd',
-           decomp_aten_name='_linalg_svd',
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           # We're using at::allclose, which does not have a batching rule
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           sample_inputs_func=sample_inputs_svd,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.svdvals',
-           op=torch.linalg.svdvals,
-           aten_name='linalg_svdvals',
-           decomp_aten_name='_linalg_svd',
-           dtypes=floating_and_complex_types(),
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           # We're using at::allclose, which does not have a batching rule
-           check_batched_gradgrad=False,
-           sample_inputs_func=sample_inputs_linalg_svdvals,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off]),
     OpInfo('svd_lowrank',
            op=lambda *args, **kwargs: wrapper_set_seed(
                lambda a, b, **kwargs: torch.svd_lowrank(a @ b.mT, **kwargs),
@@ -16881,7 +12934,7 @@ def error_inputs_mean(op_info, device, **kwargs):
     UnaryUfuncInfo('polygamma',
                    op=lambda x, n, **kwargs: torch.polygamma(n, x, **kwargs),
                    variant_test_name='polygamma_n_0',
-                   ref=reference_polygamma if TEST_SCIPY else _NOTHING,
+                   ref=reference_polygamma if TEST_SCIPY else None,
                    dtypes=all_types_and(torch.bool, torch.bfloat16),
                    dtypesIfCUDA=all_types_and(torch.bool, torch.half),
                    supports_forward_ad=True,
@@ -16891,29 +12944,10 @@ def error_inputs_mean(op_info, device, **kwargs):
                        DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                    ),
                    sample_kwargs=lambda device, dtype, input: ({'n': 0}, {'n': 0})),
-    # A separate OpInfo entry for special.polygamma is needed to reorder the arguments
-    # for the alias. See the discussion here: https://github.com/pytorch/pytorch/pull/59691#discussion_r650261939
-    UnaryUfuncInfo('special.polygamma',
-                   op=lambda x, n, **kwargs: torch.special.polygamma(n, x, **kwargs),
-                   variant_test_name='special_polygamma_n_0',
-                   ref=reference_polygamma if TEST_SCIPY else _NOTHING,
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   sample_inputs_func=sample_inputs_polygamma,
-                   skips=(
-                       # lambda impl
-                       DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-                       DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-                   ),
-                   sample_kwargs=lambda device, dtype, input: ({'n': 0}, {'n': 0}),
-                   # polygamma functions have multiple singularities at x <= 0
-                   reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
     UnaryUfuncInfo('polygamma',
                    op=lambda x, n, **kwargs: torch.polygamma(n, x, **kwargs),
                    variant_test_name='polygamma_n_1',
-                   ref=reference_polygamma if TEST_SCIPY else _NOTHING,
+                   ref=reference_polygamma if TEST_SCIPY else None,
                    dtypes=all_types_and(torch.bool, torch.bfloat16),
                    dtypesIfCUDA=all_types_and(torch.bool, torch.half),
                    supports_forward_ad=True,
@@ -16935,7 +12969,7 @@ def error_inputs_mean(op_info, device, **kwargs):
     UnaryUfuncInfo('polygamma',
                    op=lambda x, n, **kwargs: torch.polygamma(n, x, **kwargs),
                    variant_test_name='polygamma_n_2',
-                   ref=reference_polygamma if TEST_SCIPY else _NOTHING,
+                   ref=reference_polygamma if TEST_SCIPY else None,
                    dtypes=all_types_and(torch.bool, torch.bfloat16),
                    dtypesIfCUDA=all_types_and(torch.bool, torch.half),
                    supports_forward_ad=True,
@@ -16959,7 +12993,7 @@ def error_inputs_mean(op_info, device, **kwargs):
     UnaryUfuncInfo('polygamma',
                    op=lambda x, n, **kwargs: torch.polygamma(n, x, **kwargs),
                    variant_test_name='polygamma_n_3',
-                   ref=reference_polygamma if TEST_SCIPY else _NOTHING,
+                   ref=reference_polygamma if TEST_SCIPY else None,
                    dtypes=all_types_and(torch.bool, torch.bfloat16),
                    dtypesIfCUDA=all_types_and(torch.bool, torch.half),
                    supports_forward_ad=True,
@@ -16979,7 +13013,7 @@ def error_inputs_mean(op_info, device, **kwargs):
     UnaryUfuncInfo('polygamma',
                    op=lambda x, n, **kwargs: torch.polygamma(n, x, **kwargs),
                    variant_test_name='polygamma_n_4',
-                   ref=reference_polygamma if TEST_SCIPY else _NOTHING,
+                   ref=reference_polygamma if TEST_SCIPY else None,
                    decorators=(precisionOverride({torch.float16: 5e-4, torch.float32: 5e-4}),),
                    dtypes=all_types_and(torch.bool, torch.bfloat16),
                    dtypesIfCUDA=all_types_and(torch.bool, torch.half),
@@ -17103,6 +13137,7 @@ def error_inputs_mean(op_info, device, **kwargs):
            ),
     OpInfo('flatten',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+           ref=reference_flatten,
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -17111,6 +13146,14 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_flatten,
            reference_inputs_func=reference_inputs_flatten,
            ),
+    OpInfo('unflatten',
+           op=torch.unflatten,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_unflatten,
+           ),
     OpInfo('column_stack',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
            supports_forward_ad=True,
@@ -17266,170 +13309,181 @@ def error_inputs_mean(op_info, device, **kwargs):
            )),
     OpInfo('put',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           check_batched_forward_grad=False,
-           check_batched_gradgrad=False,  # vmap complains of the sizes
-           sample_inputs_func=sample_inputs_put),
-    OpInfo('take',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           check_batched_grad=False,  # vmap complains of the sizes
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_take,
-           error_inputs_func=error_inputs_take),
-    OpInfo('scatter',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_scatter,
-           error_inputs_func=error_inputs_scatter_and_scatter_add),
-    OpInfo('bfloat16',
-           op=lambda x, *args, **kwargs: x.bfloat16(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           skips=(
-               # autograd tests don't handle operators that change dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients'),
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
-           )),
-    OpInfo('bool',
-           op=lambda x, *args, **kwargs: x.bool(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           supports_autograd=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('byte',
-           op=lambda x, *args, **kwargs: x.byte(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           # The autograd test runner cannot handle functions that change dtype
-           supports_autograd=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('char',
-           op=lambda x, *args, **kwargs: x.char(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           # The autograd test runner cannot handle functions that change dtype
-           supports_autograd=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('double',
-           op=lambda x, *args, **kwargs: x.double(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('float',
-           op=lambda x, *args, **kwargs: x.float(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           skips=(
-               # autograd tests don't handle operators that change dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients'),
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('half',
-           op=lambda x, *args, **kwargs: x.half(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           supports_autograd=True,
-           skips=(
-               # autograd tests don't handle operators that change dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients'),
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('int',
-           op=lambda x, *args, **kwargs: x.int(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           supports_autograd=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('long',
-           op=lambda x, *args, **kwargs: x.long(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           supports_autograd=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('short',
-           op=lambda x, *args, **kwargs: x.short(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           supports_autograd=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # RuntimeError: attribute lookup is not defined on builtin
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('chalf',
-           op=lambda x, *args, **kwargs: x.chalf(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_conversion,
-           skips=(
-               # autograd tests don't handle operators that change dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients'),
-               # use of lambda doesn't work with test_normalize_operator_exhaustive
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='cpu'),
-               # TypeError: 'int' object is not iterable
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view',
-                            device_type='cpu'),
-               # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view',
-                            device_type='cpu'),
-               # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
-               # RuntimeError: "neg_conj_cuda" not implemented for 'ComplexHalf'
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-           )
-           ),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           check_batched_forward_grad=False,
+           check_batched_gradgrad=False,  # vmap complains of the sizes
+           sample_inputs_func=sample_inputs_put),
+    OpInfo('take',
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           check_batched_grad=False,  # vmap complains of the sizes
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_take,
+           error_inputs_func=error_inputs_take),
+    OpInfo('scatter',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_scatter,
+           error_inputs_func=error_inputs_scatter_and_scatter_add),
+    UnaryUfuncInfo(
+        'bfloat16',
+        op=lambda x, *args, **kwargs: x.bfloat16(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        skips=(
+            # autograd tests don't handle operators that change dtype
+            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
+        )),
+    UnaryUfuncInfo(
+        'bool',
+        op=lambda x, *args, **kwargs: x.bool(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        supports_autograd=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'byte',
+        op=lambda x, *args, **kwargs: x.byte(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        # The autograd test runner cannot handle functions that change dtype
+        supports_autograd=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'char',
+        op=lambda x, *args, **kwargs: x.char(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        # The autograd test runner cannot handle functions that change dtype
+        supports_autograd=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'double',
+        op=lambda x, *args, **kwargs: x.double(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'float',
+        op=lambda x, *args, **kwargs: x.float(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        skips=(
+            # autograd tests don't handle operators that change dtype
+            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'half',
+        op=lambda x, *args, **kwargs: x.half(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        supports_autograd=True,
+        skips=(
+            # autograd tests don't handle operators that change dtype
+            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'int',
+        op=lambda x, *args, **kwargs: x.int(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        supports_autograd=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'long',
+        op=lambda x, *args, **kwargs: x.long(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        supports_autograd=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'short',
+        op=lambda x, *args, **kwargs: x.short(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        supports_autograd=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+        )),
+    UnaryUfuncInfo(
+        'chalf',
+        op=lambda x, *args, **kwargs: x.chalf(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        skips=(
+            # autograd tests don't handle operators that change dtype
+            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            # use of lambda doesn't work with test_normalize_operator_exhaustive
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager',
+                         device_type='cpu'),
+            # TypeError: 'int' object is not iterable
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view',
+                         device_type='cpu'),
+            # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view',
+                         device_type='cpu'),
+            # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
+            # RuntimeError: "neg_conj_cuda" not implemented for 'ComplexHalf'
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+        )
+    ),
     OpInfo('empty_like',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
            supports_out=False,
@@ -17437,7 +13491,9 @@ def error_inputs_mean(op_info, device, **kwargs):
            reference_inputs_func=reference_inputs_like_fns,
            supports_autograd=False,
            skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Empty tensor data is garbage so it's hard to make comparisons with it.
+               DecorateInfo(unittest.skip("Skipped!"),
+                            "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # Empty tensor data is garbage so it's hard to make comparisons with it.
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_noncontiguous_samples'),
                # Empty tensor data is garbage so it's hard to make comparisons with it.
@@ -17458,8 +13514,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_non_standard_bool_values'),
                DecorateInfo(unittest.skip("Expected: empty_like is not comparable"), 'TestCompositeCompliance',
                             'test_operator'),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            )),
     OpInfo('zeros_like',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
@@ -17467,9 +13521,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_like_fns,
            supports_autograd=False,
            skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            )),
     OpInfo('ones_like',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
@@ -17477,9 +13528,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_like_fns,
            supports_autograd=False,
            skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            )),
     OpInfo('randn_like',
            dtypes=floating_types_and(torch.half, torch.bfloat16, torch.complex32, torch.complex64, torch.complex128),
@@ -17507,8 +13555,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # AssertionError: JIT Test does not execute any logic
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
                DecorateInfo(unittest.skip("Expected: randn_like is not comparable between dtypes"),
                             'TestCommon', 'test_complex_half_reference_testing'),
            )),
@@ -17523,8 +13569,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # AssertionError: JIT Test does not execute any logic
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            )),
     OpInfo('full_like',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
@@ -17532,9 +13576,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_full_like,
            supports_autograd=False,
            skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            )),
     OpInfo('new_zeros',
            op=lambda x, *args, **kwargs: x.new_zeros(*args, **kwargs),
@@ -17543,8 +13584,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_new_fns,
            skips=(
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            ),
            supports_autograd=False),
     OpInfo('new_ones',
@@ -17554,8 +13593,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_new_fns,
            skips=(
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            ),
            supports_autograd=False),
     OpInfo('new_empty',
@@ -17585,12 +13622,58 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_non_standard_bool_values'),
                DecorateInfo(unittest.skip("Expected: new_empty is not comparable"), 'TestCompositeCompliance',
                             'test_operator'),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
                DecorateInfo(unittest.skip("Expected: new_empty is not comparable"),
                             'TestCommon', 'test_complex_half_reference_testing'),
            ),
            supports_autograd=False),
+    OpInfo('new_empty_strided',
+           op=lambda x, *args, **kwargs: x.new_empty_strided(*args, **kwargs),
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           supports_out=False,
+           sample_inputs_func=partial(sample_inputs_new_fns, is_strided=True),
+           supports_autograd=False,
+           skips=(
+               # FX failed to normalize op
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Lazy tensor failures
+               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness_with_reusing_ir'),
+               # Empty tensor data is garbage so it's hard to make comparisons with it.
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_noncontiguous_samples'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestMathBits', 'test_neg_conj_view'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_non_standard_bool_values'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestDecomp', 'test_comprehensive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestDecomp', 'test_quick'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestProxyTensorOpInfo', 'test_make_fx_exhaustive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestProxyTensorOpInfo', 'test_make_fx_fake_exhaustive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestProxyTensorOpInfo', 'test_make_fx_symbolic_exhaustive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
+           )),
     OpInfo('empty',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
            sample_inputs_func=sample_inputs_empty,
@@ -17617,7 +13700,7 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_non_standard_bool_values'),
                DecorateInfo(unittest.skip("Expected: empty is not comparable"), 'TestCompositeCompliance',
                             'test_operator'),
-               # Can't find schemas for this operator for some reason
+               # requires_grad doesn't exist in the jit schema
                DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
                DecorateInfo(unittest.skip("Expected: empty is not comparable"),
                             'TestCommon',
@@ -17630,6 +13713,31 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.skip("Expected: empty is not comparable"),
                             'TestCommon', 'test_complex_half_reference_testing'),
            )),
+    OpInfo('eye',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half),
+           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_eye,
+           error_inputs_func=error_inputs_eye,
+           supports_out=True,
+           supports_autograd=False,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # TODO: same as this?
+               # https://github.com/pytorch/pytorch/issues/81774
+               # also see: arange, new_full
+               # fails to match any schemas despite working in the interpreter
+               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+               # fails to match any schemas despite working in the interpreter
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), "TestCommon", "test_noncontiguous_samples"),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view'),
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+           )),
     OpInfo('new_full',
            op=lambda x, *args, **kwargs: x.new_full(*args, **kwargs),
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
@@ -17637,8 +13745,6 @@ def error_inputs_mean(op_info, device, **kwargs):
            sample_inputs_func=sample_inputs_new_full,
            skips=(
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
            ),
            supports_autograd=False),
     OpInfo('multinomial',
@@ -17846,6 +13952,16 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
                # see https://github.com/pytorch/pytorch/issues/71286
                DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),)),
+    OpInfo('unbind',
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
+           ref=reference_unbind,
+           sample_inputs_func=sample_inputs_unbind,
+           error_inputs_func=error_inputs_unbind,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           supports_gradgrad=True,
+           supports_out=False,
+           ),
     OpInfo('vstack',
            aliases=('row_stack',),
            dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
@@ -17901,7 +14017,9 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_fwgrad_bwgrad=True,
            # See https://github.com/pytorch/pytorch/pull/78358
            check_batched_forward_grad=False,
-           sample_inputs_func=sample_movedim_moveaxis),
+           sample_inputs_func=sample_movedim_moveaxis,
+           reference_inputs_func=reference_movedim_moveaxis,
+           error_inputs_func=error_movedim_moveaxis),
     OpInfo('renorm',
            dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
            sample_inputs_func=sample_inputs_renorm,
@@ -18049,11 +14167,8 @@ def error_inputs_mean(op_info, device, **kwargs):
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     supports_one_python_scalar=True,
-                    skips=(
-                        # nan vs nan comparisons
-                        # https://github.com/pytorch/pytorch/issues/74279
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
-                    )),
+                    # We don't test 0 as the gradient will be NaN and it'll break
+                    rhs_make_tensor_kwargs=dict(low=0.01)),
     OpInfo('zero_',
            op=lambda x: torch.zero_(x.clone()),
            method_variant=None,
@@ -18069,46 +14184,6 @@ def error_inputs_mean(op_info, device, **kwargs):
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
            ),
            sample_inputs_func=sample_inputs_zero_),
-    BinaryUfuncInfo('special.xlog1py',
-                    aten_name='special_xlog1py',
-                    dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                    backward_dtypes=all_types_and(torch.bool, torch.bfloat16),
-                    backward_dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                    promotes_int_to_float=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_one_python_scalar=True,
-                    skips=(
-                        # nan vs 0 comparisons
-                        # https://github.com/pytorch/pytorch/issues/74279
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
-                    )),
-    BinaryUfuncInfo('special.zeta',
-                    aten_name='special_zeta',
-                    dtypes=all_types_and(torch.bool),
-                    promotes_int_to_float=True,
-                    supports_autograd=False,
-                    supports_one_python_scalar=True),
-    # TODO: FIXME
-    # OpInfo entry to verify the gradient formula of `other`/`q`
-    # BinaryUfuncInfo('special.zeta',
-    #                 op=lambda q, x, **kwargs: torch.special.zeta(x, q, **kwargs),
-    #                 aten_name='special_zeta',
-    #                 variant_test_name='grad',
-    #                 dtypes=all_types_and(torch.bool),
-    #                 promotes_int_to_float=True,
-    #                 supports_autograd=True,
-    #                 supports_rhs_python_scalar=False,
-    #                 decorators=[
-    #                     # Derivative wrt first tensor not implemented
-    #                     DecorateInfo(unittest.expectedFailure, "TestCommon",
-    #                                  "test_floating_inputs_are_differentiable")
-    #                 ],
-    #                 skips=(
-    #                     # Lambda doesn't work in JIT test
-    #                     # AssertionError: JIT Test does not execute any logic
-    #                     DecorateInfo(unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"),
-    #                 )),
     OpInfo('logsumexp',
            aliases=('special.logsumexp',),
            dtypes=all_types_and(torch.bool, torch.bfloat16),
@@ -18147,1668 +14222,1169 @@ def error_inputs_mean(op_info, device, **kwargs):
            supports_fwgrad_bwgrad=True,
            skips=(
                # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_T),
-    OpInfo('H',
-           op=lambda x: x.H,
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_T),
-    OpInfo('mT',
-           op=lambda x: x.mT,
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_adjoint),
-    OpInfo('mH',
-           op=lambda x: x.mH,
-           aliases=('adjoint',),
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_adjoint),
-    OpInfo('tril',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_tril_triu),
-    OpInfo('triu',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_tril_triu),
-    OpInfo('kron',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_inplace_autograd=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_kron),
-    OpInfo('inner',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
-                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_inner,
-           ),
-    OpInfo('tensordot',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
-                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_tensordot,
-           skips=(
-               # Skip operator schema test because this is a functional and not an operator.
-               # Reference: https://github.com/pytorch/pytorch/issues/54574
-               DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-           )
-           ),
-    OpInfo('to_sparse',
-           op=lambda x, *args: x.to_sparse(*args),
-           sample_inputs_func=sample_inputs_to_sparse,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           backward_dtypes=floating_types(),
-           backward_dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-           supports_out=False,
-           supports_sparse_csr=True,
-           supports_sparse_csc=True,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           skips=(
-               # to_sparse does not support automatic differentiation for outputs with complex dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients',
-                            'test_nondifferentiable', dtypes=(torch.cdouble,)),
-               # NotImplementedError: Could not run 'aten::normal_' with arguments from the 'SparseCPU' backend
-               DecorateInfo(unittest.skip(""), 'TestCommon', 'test_noncontiguous_samples'),
-               # TODO: FIXME: complex inputs requiring grad error in forward
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes'),
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Allowed exception: sparse tensors don't have strides
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_backward'),
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestTags', 'test_tags'),
-               # TODO: implement csr.to_sparse(sample_dim) where sampled_dim is 1.
-               DecorateInfo(unittest.skip("csr.to_sparse(1) not implemented. Skipped!"),
-                            'TestSparseCSR', 'test_sparse_csr_consistency'),
-           )
-           ),
-    OpInfo('logcumsumexp',
-           dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-           backward_dtypes=floating_types_and(torch.bfloat16),
-           backward_dtypesIfCUDA=floating_types_and(torch.bfloat16),
-           skips=(
-               # AssertionError: UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning', device_type='cuda'),
-           ),
-           sample_inputs_func=sample_inputs_logcumsumexp,
-           error_inputs_func=error_inputs_logcumsumexp),
-    UnaryUfuncInfo('sigmoid',
-                   aliases=('special.expit', 'nn.functional.sigmoid'),
-                   aten_backward_name='sigmoid_backward',
-                   ref=reference_sigmoid if TEST_SCIPY else _NOTHING,
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.complex64: 1e-1,
-                                                  torch.bfloat16: 1e-2}),),
-                   skips=(
-                       # Reference: https://github.com/pytorch/pytorch/issues/56012
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.complex64, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.chalf, torch.complex64, torch.cdouble]),
-                       # alias, nn.functional.sigmoid, will produce (because of warning string saved):
-                       # "RuntimeError: Expected to not find "sigmoid" but found it"
-                       DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping')),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.complex32, torch.bool, torch.half, torch.bfloat16),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   assert_autodiffed=True,
-                   # sigmoid(z) = 1 / (1 + exp(-z)), at z = j * pi * odd_number, the denominator is zero
-                   reference_numerics_filter=NumericsFilter(
-                       condition=lambda x: (close_to_int(x / (math.pi * 1j))
-                                            if x.is_complex() else x.new_tensor(False, dtype=torch.bool)),
-                       safe_val=0)),
-    UnaryUfuncInfo('digamma',
-                   ref=scipy.special.digamma if TEST_SCIPY else _NOTHING,
-                   aliases=('special.psi', 'special.digamma',),
-                   decorators=(precisionOverride({torch.float16: 5e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.entr',
-                   ref=scipy.special.entr if TEST_SCIPY else _NOTHING,
-                   aten_name='special_entr',
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.float16: 1e-1,
-                                                  torch.bfloat16: 1e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.bfloat16, torch.float16]),
-                   ),
-                   supports_inplace_autograd=False,
-                   sample_inputs_func=sample_inputs_entr),
-    UnaryUfuncInfo('special.ndtri',
-                   ref=scipy.special.ndtri if TEST_SCIPY else _NOTHING,
-                   domain=(0, 1),
-                   aten_name='special_ndtri',
-                   dtypes=all_types_and(torch.bool),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.log_ndtr',
-                   aten_name='special_log_ndtr',
-                   ref=scipy.special.log_ndtr if TEST_SCIPY else _NOTHING,
-                   dtypes=all_types_and(torch.bool),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   ),
-    UnaryUfuncInfo('erf',
-                   ref=scipy.special.erf if TEST_SCIPY else _NOTHING,
-                   aliases=('special.erf', ),
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.bfloat16: 1e-2}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-
-                   ),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   assert_jit_shape_analysis=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('erfc',
-                   ref=scipy.special.erfc if TEST_SCIPY else _NOTHING,
-                   aliases=('special.erfc', ),
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.bfloat16: 1e-2}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('erfinv',
-                   ref=scipy.special.erfinv if TEST_SCIPY else _NOTHING,
-                   aliases=('special.erfinv', ),
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.bfloat16: 1e-2,
-                                                  torch.float32: 1e-4}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   domain=(-1, 1),
-                   skips=(
-                       # Reference: https://github.com/pytorch/pytorch/pull/49155#issuecomment-742664611
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
-                   )),
-    OpInfo("nn.functional.smooth_l1_loss",
-           ref=reference_smooth_l1_loss,
-           sample_inputs_func=sample_inputs_smooth_l1_loss,
-           dtypes=floating_types_and(torch.float16, torch.bfloat16),
-           backward_dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.float16),
-           backward_dtypesIfCUDA=floating_types_and(torch.float16),
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_T),
+    OpInfo('H',
+           op=lambda x: x.H,
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           skips=(
-               # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
-               # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),)),
-    OpInfo(
-        "nn.functional.l1_loss",
-        ref=loss_reference_reduction_wrapper(lambda input, target: np.abs(input - target)),
-        sample_inputs_func=sample_inputs_l1_loss,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        skips=(
-            # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
-            # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
-            DecorateInfo(
-                unittest.expectedFailure,
-                "TestJit",
-                "test_variant_consistency_jit",
-                dtypes=(torch.float32,),
-            ),
-        ),
-    ),
-    UnaryUfuncInfo('lgamma',
-                   ref=reference_lgamma if TEST_SCIPY else _NOTHING,
-                   aliases=('special.gammaln', ),
-                   decorators=(precisionOverride({torch.float16: 7e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   skips=(
-                       # Reference: https://github.com/pytorch/pytorch/pull/50140#discussion_r552615345
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.bfloat16]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.bfloat16]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    device_type='cpu', dtypes=[torch.bfloat16]),
-                       # Reference: https://github.com/pytorch/pytorch/pull/50140#issuecomment-756150214
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
-                   ),
-                   # lgamma have multiple singularities at x <= 0
-                   reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
-    OpInfo(
-        'logdet',
-        dtypes=floating_and_complex_types(),
-        supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
-        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
-    # `log_softmax` supports different dtypes based on whether `dtype` argument,
-    # is passed or not. Hence two OpInfo entries, one with dtype and other without.
-    OpInfo(
-        'log_softmax',
-        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
-        supports_out=True,
-        aten_backward_name='_log_softmax_backward_data',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_softmax_variant,
-        supports_forward_ad=True,
-        assert_autodiffed=True),
-    OpInfo(
-        'log_softmax',
-        variant_test_name='dtype',
-        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
-        supports_out=True,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-        sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
-        supports_forward_ad=True,
-        assert_autodiffed=True),
-    UnaryUfuncInfo('logit',
-                   aten_backward_name='logit_backward',
-                   ref=scipy.special.logit if TEST_SCIPY else _NOTHING,
-                   domain=(0, 1),
-                   aliases=('special.logit', ),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.bfloat16: 5e-1,
-                                                  torch.float16: 5e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   sample_inputs_func=sample_inputs_logit),
-    OpInfo('where',
-           # Currently only the `input` is tested in gradcheck.
-           # If we pass `condition` first, none of the input which supports
-           # autograd will be tested. Hence the following lambda.
-           op=lambda self, condition, other: torch.where(condition, self, other),
-           ref=lambda self, condition, other: np.where(condition, self, other),
-           sample_inputs_func=sample_inputs_where,
-           reference_inputs_func=reference_inputs_where,
-           error_inputs_func=error_inputs_where,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           skips=(
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_T),
+    OpInfo('mT',
+           op=lambda x: x.mT,
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=(
-               DecorateInfo(onlyCUDA, "TestCommon", 'test_errors'),),
            skips=(
                # lambda impl
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-           ),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf)),
-    OpInfo('nonzero',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
-           sample_inputs_func=sample_inputs_nonzero,
-           supports_autograd=False,
-           skips=(
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # nonzero(): argument 'out' must be Tensor, not tuple
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # https://github.com/pytorch/pytorch/issues/67458
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # nonzero is not raising a warning when the out is resized
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-           )),
-    # Following tests are for jiterator's python interface
-    # Jiterator can be used to author elementwise CUDA kernel
-    # jiterator._create_jit_fn returns a callable that behaves like a regular pytorch op
-    # See create_jit_fn in jiterator.py for more information
-    UnaryUfuncInfo(
-        'jiterator_unary',
-        op=torch.cuda.jiterator._create_jit_fn("template <typename T> T unary(T x) { return x * x + x; }"),
-        ref=lambda x: x * x + x,
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        decorators=[
-            onlyCUDA,
-            skipCUDAIfRocm,
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_hard'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_normal'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_small'),
-        ],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Skip reference_numerics tests for bool type, as the defined function doesn't work for bool
-            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                         dtypes=[torch.bool]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_hard',
-                         dtypes=[torch.bool]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
-                         dtypes=[torch.bool]),
-            # Expected failure: torch.jiterator_unary is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    BinaryUfuncInfo(
-        'jiterator_binary',
-        op=torch.cuda.jiterator._create_jit_fn(
-            "template <typename T> T binary(T x, T y, T alpha) { return x + alpha * y; }", alpha=1),
-        ref=lambda input, other, *, alpha=1: np.add(input, other) if alpha == 1 \
-            else np.add(input, np.multiply(alpha, other)),
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-3.14),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        supports_rhs_python_scalar=False,
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_binary is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    OpInfo(
-        'jiterator_4inputs_with_extra_args',
-        op=torch.cuda.jiterator._create_jit_fn(
-            "template <typename T> T binary(T i0, T i1, T i2, T i3, T alpha, T beta) { return alpha * i0 + beta * i1 + i2 + i3; }",
-            alpha=1, beta=1),
-        ref=lambda i0, i1, i2, i3, *, alpha=1, beta=1: alpha * i0 + beta * i1 + i2 + i3,
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=4, alpha=3.14, beta=-4.20),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    BinaryUfuncInfo(
-        'jiterator_binary_return_by_ref',
-        op=torch.cuda.jiterator._create_multi_output_jit_fn(
-            """
-            template <typename T>
-            T binary_return_by_ref(T i0, T i1, T& out0) {
-                out0 = i0 + i1;
-            }
-            """,
-            num_outputs=1),
-        ref=lambda i0, i1: i0 + i1,
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-0.42),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        supports_rhs_python_scalar=False,
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    OpInfo(
-        'jiterator_2inputs_2outputs',
-        op=torch.cuda.jiterator._create_multi_output_jit_fn(
-            """
-            template <typename T>
-            T binary_2outputs(T i0, T i1, T& out0, T& out1) {
-                out0 = i0 + i1;
-                out1 = i0 - i1;
-            }
-            """,
-            num_outputs=2),
-        ref=lambda i0, i1, *, alpha=1: (i0 + i1, i0 - i1),
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    # `torch.norm` has multiple code paths depending on the value of `p`.
-    # These paths have different dtype support. Also JIT supports,
-    # most variants but not all of them. So we split the OpInfo entries,
-    # for `norm` based on the code-paths and JIT support.
-    OpInfo(
-        "norm",
-        sample_inputs_func=sample_inputs_norm,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        skips=(
-            # AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast from a result
-            # of dtype torch.float32 into an out= with dtype torch.long
-            DecorateInfo(
-                unittest.expectedFailure,
-                "TestCommon",
-                "test_out",
-                device_type="meta",
-            ),
-        ),
-    ),
-    OpInfo('norm',
-           variant_test_name='nuc',
-           aten_name='nuclear_norm',
-           sample_inputs_func=sample_inputs_norm_nuc,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           check_batched_gradgrad=False,
-           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
-           # got: Could not allocate memory to change Tensor SizesAndStrides!
-           check_batched_forward_grad=False,
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_adjoint),
+    OpInfo('mH',
+           op=lambda x: x.mH,
+           aliases=('adjoint',),
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           dtypes=floating_and_complex_types(),
-           dtypesIfCUDA=floating_and_complex_types(),
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
            skips=(
-               # RuntimeError not raised :
-               # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # RuntimeError:
-               # Arguments for call are not valid.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.complex64, torch.float32,)),  # noqa: B950
-           )
-           ),
-    OpInfo('norm',
-           variant_test_name='fro',
-           aten_name='frobenius_norm',
-           sample_inputs_func=sample_inputs_norm_fro,
-           dtypes=floating_and_complex_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_adjoint),
+    OpInfo('tril',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_tril_triu),
+    OpInfo('triu',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
            supports_forward_ad=True,
-           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
-           # got: Could not allocate memory to change Tensor SizesAndStrides!
-           check_batched_forward_grad=False,
            supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_tril_triu),
+    OpInfo('triu_indices',
+           dtypes=_dispatch_dtypes((torch.int32, torch.int64)),
+           sample_inputs_func=sample_inputs_trilu_indices,
+           ref=lambda h, w, ofs=0, dtype=torch.long, device='cpu' : np.array(np.triu_indices(h, ofs, w), dtype=dtype),
+           supports_out=False,
+           supports_autograd=False,
            skips=(
-               # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # Arguments for call are not valid.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.complex64, torch.float32,)),  # noqa: B950
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
            )),
-    OpInfo(
-        "norm",
-        variant_test_name="inf",
-        sample_inputs_func=sample_inputs_norm_inf,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        # fast gradcheck produces NaNs
-        gradcheck_fast_mode=False,
-        skips=(
-            # AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast from a result
-            # of dtype torch.float32 into an out= with dtype torch.long
-            DecorateInfo(
-                unittest.expectedFailure,
-                "TestCommon",
-                "test_out",
-                device_type="meta",
-            ),
-        ),
-    ),
-    OpInfo('t',
-           sample_inputs_func=sample_inputs_t,
+    OpInfo('tril_indices',
+           dtypes=_dispatch_dtypes((torch.int32, torch.int64)),
+           sample_inputs_func=sample_inputs_trilu_indices,
+           ref=lambda h, w, ofs=0, dtype=torch.long, device='cpu' : np.array(np.tril_indices(h, ofs, w), dtype=dtype),
            supports_out=False,
+           supports_autograd=False,
+           skips=(
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
+           )),
+    OpInfo('kron',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_inplace_autograd=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_kron),
+    OpInfo('inner',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
+                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            # See https://github.com/pytorch/pytorch/pull/78358
            check_batched_forward_grad=False,
-           # vmap does not support inplace views
-           check_inplace_batched_forward_grad=False,
-           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
-           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
+           sample_inputs_func=sample_inputs_inner,
+           ),
+    OpInfo('tensordot',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
+                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_tensordot,
+           skips=(
+               # Skip operator schema test because this is a functional and not an operator.
+               # Reference: https://github.com/pytorch/pytorch/issues/54574
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+           )
+           ),
+    OpInfo('to_sparse',
+           op=lambda x, *args: x.to_sparse(*args),
+           sample_inputs_func=sample_inputs_to_sparse,
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           assert_autodiffed=True,
-           error_inputs_func=error_inputs_t),
-    UnaryUfuncInfo('special.erfcx',
-                   ref=scipy.special.erfcx if TEST_SCIPY else _NOTHING,
-                   aten_name='special_erfcx',
-                   decorators=(toleranceOverride({torch.float32: tol(atol=0, rtol=4e-6), }),),
-                   dtypes=all_types_and(torch.bool),
+           backward_dtypes=floating_types(),
+           backward_dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           supports_out=False,
+           supports_sparse_csr=True,
+           supports_sparse_csc=True,
+           check_batched_grad=False,
+           check_batched_gradgrad=False,
+           skips=(
+               # to_sparse does not support automatic differentiation for outputs with complex dtype
+               DecorateInfo(unittest.expectedFailure, 'TestGradients',
+                            'test_nondifferentiable', dtypes=(torch.cdouble,)),
+               # NotImplementedError: Could not run 'aten::normal_' with arguments from the 'SparseCPU' backend
+               DecorateInfo(unittest.skip(""), 'TestCommon', 'test_noncontiguous_samples'),
+               # TODO: FIXME: complex inputs requiring grad error in forward
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes'),
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Allowed exception: sparse tensors don't have strides
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_backward'),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestTags', 'test_tags'),
+               # TODO: implement csr.to_sparse(sample_dim) where sampled_dim is 1.
+               DecorateInfo(unittest.skip("csr.to_sparse(1) not implemented. Skipped!"),
+                            'TestSparseCSR', 'test_sparse_csr_consistency'),
+           )
+           ),
+    OpInfo('logcumsumexp',
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+           backward_dtypes=floating_types_and(torch.bfloat16),
+           backward_dtypesIfCUDA=floating_types_and(torch.bfloat16),
+           skips=(
+               # AssertionError: UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning', device_type='cuda'),
+           ),
+           sample_inputs_func=sample_inputs_logcumsumexp,
+           error_inputs_func=error_inputs_logcumsumexp),
+    UnaryUfuncInfo('sigmoid',
+                   aliases=('special.expit', 'nn.functional.sigmoid'),
+                   aten_backward_name='sigmoid_backward',
+                   ref=reference_sigmoid if TEST_SCIPY else None,
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.complex64: 1e-1,
+                                                  torch.bfloat16: 1e-2}),),
+                   skips=(
+                       # Reference: https://github.com/pytorch/pytorch/issues/56012
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.complex64, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=[torch.chalf, torch.complex64, torch.cdouble]),
+                       # alias, nn.functional.sigmoid, will produce (because of warning string saved):
+                       # "RuntimeError: Expected to not find "sigmoid" but found it"
+                       DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping')),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.complex32, torch.bool, torch.half, torch.bfloat16),
                    supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    OpInfo(
-        "nn.functional.dropout",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs),
-        ref=_NOTHING,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # Probably because we have used lambda for the op here
-            # AssertionError: JIT Test does not execute any logic
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # inplace variant dispatches to dropout kernel, while on CUDA
-            # the op dispatches to _fused_dropout (with a few more conditions)
-            # hence, different values and this skip here
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        # https://github.com/pytorch/pytorch/issues/66357
-        check_batched_forward_grad=False,
-        supports_out=False,
-        sample_inputs_func=sample_inputs_dropout,
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs, inplace=True)),
-    OpInfo(
-        "nn.functional.dropout2d",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs),
-        ref=_NOTHING,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        supports_out=False,
-        check_batched_forward_grad=False,
-        # As per the docs, valid input dims are (3, 4)
-        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(3, 4)),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs, inplace=True)),
-    OpInfo(
-        "nn.functional.dropout3d",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs),
-        ref=_NOTHING,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        supports_out=False,
-        check_batched_forward_grad=False,
-        # As per the docs, valid input dims are (4, 5)
-        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(4, 5)),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs, inplace=True)),
-    # In training mode, feature_alpha_dropout currently doesn't support inputs of complex dtype
-    # unlike when `train=False`, it supports complex inputs, hence 2 OpInfos to cover all cases
-    OpInfo(
-        "nn.functional.feature_alpha_dropout",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
-        variant_test_name="with_train",
-        ref=_NOTHING,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
-            # vmap: We do not yet support calling random operations inside of vmap.
-            # Please perform random operations outside of vmap as a workaround
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', "test_forward_mode_AD"),
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', "test_inplace_forward_mode_AD"),),
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        supports_out=False,
-        # As per the docs, valid input dims are (4, 5)
-        sample_inputs_func=partial(sample_inputs_dropout, train=True, valid_input_dim=(4, 5)),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
-    OpInfo(
-        "nn.functional.feature_alpha_dropout",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
-        variant_test_name="without_train",
-        ref=_NOTHING,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
-        gradcheck_wrapper=wrapper_set_seed,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        supports_out=False,
-        sample_inputs_func=partial(sample_inputs_dropout, train=False),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
+                   supports_fwgrad_bwgrad=True,
+                   assert_autodiffed=True,
+                   # sigmoid(z) = 1 / (1 + exp(-z)), at z = j * pi * odd_number, the denominator is zero
+                   reference_numerics_filter=NumericsFilter(
+                       condition=lambda x: (close_to_int(x / (math.pi * 1j))
+                                            if x.is_complex() else x.new_tensor(False, dtype=torch.bool)),
+                       safe_val=0)),
+    UnaryUfuncInfo('digamma',
+                   ref=scipy.special.digamma if TEST_SCIPY else None,
+                   aliases=('special.psi', 'special.digamma',),
+                   decorators=(precisionOverride({torch.float16: 5e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True),
+    UnaryUfuncInfo('erf',
+                   ref=scipy.special.erf if TEST_SCIPY else None,
+                   aliases=('special.erf', ),
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.bfloat16: 1e-2}),),
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
+                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
+
+                   ),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   assert_jit_shape_analysis=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True),
+    UnaryUfuncInfo('erfc',
+                   ref=scipy.special.erfc if TEST_SCIPY else None,
+                   aliases=('special.erfc', ),
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.bfloat16: 1e-2}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True),
+    UnaryUfuncInfo('erfinv',
+                   ref=scipy.special.erfinv if TEST_SCIPY else None,
+                   aliases=('special.erfinv', ),
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.bfloat16: 1e-2,
+                                                  torch.float32: 1e-4}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   domain=(-1, 1),
+                   skips=(
+                       # Reference: https://github.com/pytorch/pytorch/pull/49155#issuecomment-742664611
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
+                   )),
+    OpInfo("nn.functional.smooth_l1_loss",
+           ref=reference_smooth_l1_loss,
+           sample_inputs_func=sample_inputs_smooth_l1_loss,
+           dtypes=floating_types_and(torch.float16, torch.bfloat16),
+           backward_dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.float16),
+           backward_dtypesIfCUDA=floating_types_and(torch.float16),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
+               # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),)),
     OpInfo(
-        "nn.functional.one_hot",
-        ref=reference_one_hot,
+        "nn.functional.l1_loss",
+        ref=loss_reference_reduction_wrapper(lambda input, target: np.abs(input - target)),
+        sample_inputs_func=sample_inputs_l1_loss,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
         supports_out=False,
-        dtypes=_dispatch_dtypes((torch.int64,)),
-        sample_inputs_func=sample_inputs_one_hot,
-    ),
-    OpInfo(
-        "nn.functional.embedding",
-        aten_backward_name="embedding_dense_backward",
-        # We use lambda to reshuffle the positional arguments.
-        # This is because currently only the `input` field of SampleInput
-        # is tested in gradient tests.
-        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding(idx, weight, **kwargs),
-        dtypes=floating_types_and(torch.bfloat16, torch.float16),
-        sample_inputs_func=sample_inputs_embedding,
-        error_inputs_func=error_inputs_embedding,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # Reference: https://github.com/pytorch/pytorch/issues/67084
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),
-            # Not a problem: embedding does weird stuff to its input (it renormalizes)
-            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
+            # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
+            # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestJit",
+                "test_variant_consistency_jit",
+                dtypes=(torch.float32,),
+            ),
         ),
-        supports_expanded_weight=True,
-        supports_out=False,
     ),
+    UnaryUfuncInfo('lgamma',
+                   ref=reference_lgamma if TEST_SCIPY else None,
+                   aliases=('special.gammaln', ),
+                   decorators=(precisionOverride({torch.float16: 7e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   skips=(
+                       # Reference: https://github.com/pytorch/pytorch/pull/50140#discussion_r552615345
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.bfloat16]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu', dtypes=[torch.bfloat16]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    device_type='cpu', dtypes=[torch.bfloat16]),
+                       # Reference: https://github.com/pytorch/pytorch/pull/50140#issuecomment-756150214
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
+                   ),
+                   # lgamma have multiple singularities at x <= 0
+                   reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
     OpInfo(
-        "nn.functional.embedding_bag",
-        # We use lambda to reshuffle the positional arguments.
-        # This is because currently only the `input` field of SampleInput
-        # is tested in gradient tests.
-        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding_bag(idx, weight, **kwargs),
-        dtypes=floating_types_and(torch.float16),
-        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
-        # backward is not supported for mode `max` and dtype `bfloat16`
-        backward_dtypesIfCUDA=floating_types_and(torch.float16),
-        sample_inputs_func=sample_inputs_embedding_bag,
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # Not a problem: embedding_bag does weird stuff to its input (it renormalizes)
-            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
-        ),
-        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+        'logdet',
+        dtypes=floating_and_complex_types(),
         supports_out=False,
-        supports_gradgrad=False,
-    ),
-    UnaryUfuncInfo(
-        "nn.functional.softplus",
-        aten_backward_name='softplus_backward',
-        ref=reference_softplus,
-        sample_kwargs=lambda device, dtype, input: ({'beta': 3, 'threshold': .2}, {'beta': 3, 'threshold': .2}),
-        sample_inputs_func=partial(sample_inputs_elementwise_unary, op_kwargs={'beta': 3, 'threshold': .2}),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
-        decorators=(
-            DecorateInfo(
-                toleranceOverride
-                ({
-                    torch.half: tol(atol=1e-2, rtol=1e-2),
-                    torch.bfloat16: tol(atol=1e-2, rtol=1e-2),
-                }),
-                'TestUnaryUfuncs'),
-        ),
-    ),
+        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
+    # `log_softmax` supports different dtypes based on whether `dtype` argument,
+    # is passed or not. Hence two OpInfo entries, one with dtype and other without.
     OpInfo(
-        "linalg.tensorinv",
-        ref=np.linalg.tensorinv,
-        dtypes=floating_and_complex_types(),
-        sample_inputs_func=sample_inputs_tensorinv,
+        'log_softmax',
+        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
+        supports_out=True,
+        aten_backward_name='_log_softmax_backward_data',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_softmax_variant,
         supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        # See https://github.com/pytorch/pytorch/pull/78358
-        check_batched_forward_grad=False,
-        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
-    ),
+        assert_autodiffed=True),
     OpInfo(
-        "linalg.tensorsolve",
-        ref=lambda a, b, dims=None: np.linalg.tensorsolve(a, b, axes=dims),
-        dtypes=floating_and_complex_types(),
-        sample_inputs_func=sample_inputs_tensorsolve,
+        'log_softmax',
+        variant_test_name='dtype',
+        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
+        supports_out=True,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+        sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
         supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack,
-                    DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-03, rtol=1e-03)}),
-                                 'TestCommon', 'test_noncontiguous_samples',
-                                 device_type='cuda')],
+        assert_autodiffed=True),
+    UnaryUfuncInfo('logit',
+                   aten_backward_name='logit_backward',
+                   ref=scipy.special.logit if TEST_SCIPY else None,
+                   domain=(0, 1),
+                   aliases=('special.logit', ),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   decorators=(precisionOverride({torch.bfloat16: 5e-1,
+                                                  torch.float16: 5e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   sample_inputs_func=sample_inputs_logit),
+    OpInfo('where',
+           # Currently only the `input` is tested in gradcheck.
+           # If we pass `condition` first, none of the input which supports
+           # autograd will be tested. Hence the following lambda.
+           op=lambda self, condition, other: torch.where(condition, self, other),
+           ref=lambda self, condition, other: np.where(condition, self, other),
+           sample_inputs_func=sample_inputs_where,
+           reference_inputs_func=reference_inputs_where,
+           error_inputs_func=error_inputs_where,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           decorators=(
+               DecorateInfo(onlyCUDA, "TestCommon", 'test_errors'),),
+           skips=(
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+           ),
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf)),
+    OpInfo('nonzero',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           sample_inputs_func=sample_inputs_nonzero,
+           supports_autograd=False,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # nonzero(): argument 'out' must be Tensor, not tuple
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # https://github.com/pytorch/pytorch/issues/67458
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # nonzero is not raising a warning when the out is resized
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # Can't find schemas for this operator for some reason
+               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+           )),
+    # Following tests are for jiterator's python interface
+    # Jiterator can be used to author elementwise CUDA kernel
+    # jiterator._create_jit_fn returns a callable that behaves like a regular pytorch op
+    # See create_jit_fn in jiterator.py for more information
+    UnaryUfuncInfo(
+        'jiterator_unary',
+        op=torch.cuda.jiterator._create_jit_fn("template <typename T> T unary(T x) { return x * x + x; }"),
+        ref=lambda x: x * x + x,
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        supports_out=False,
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        decorators=[
+            onlyCUDA,
+            skipCUDAIfRocm,
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_hard'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_normal'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_small'),
+        ],
+        skips=(
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Skip reference_numerics tests for bool type, as the defined function doesn't work for bool
+            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                         dtypes=[torch.bool]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_hard',
+                         dtypes=[torch.bool]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
+                         dtypes=[torch.bool]),
+            # Expected failure: torch.jiterator_unary is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
-    OpInfo(
-        "nn.functional.mse_loss",
-        aten_backward_name='mse_loss_backward',
-        ref=loss_reference_reduction_wrapper(lambda input, target: (input - target) ** 2),
-        sample_inputs_func=sample_inputs_loss,
+    BinaryUfuncInfo(
+        'jiterator_binary',
+        op=torch.cuda.jiterator._create_jit_fn(
+            "template <typename T> T binary(T x, T y, T alpha) { return x + alpha * y; }", alpha=1),
+        ref=lambda input, other, *, alpha=1: np.add(input, other) if alpha == 1 \
+            else np.add(input, np.multiply(alpha, other)),
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-3.14),
         supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        dtypes=floating_types_and(torch.float16),
-        backward_dtypes=floating_types(),
-        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
-        backward_dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        supports_rhs_python_scalar=False,
+        decorators=[onlyCUDA, skipCUDAIfRocm],
         skips=(
-            # RuntimeError: input->type()->kind() == TypeKind::OptionalType
-            # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
-            # please report a bug to PyTorch.
-            DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit", dtypes=(torch.float32,),),
-        ),
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_binary is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
     OpInfo(
-        "nn.functional.grid_sample",
-        ref=_NOTHING,
-        dtypes=floating_types(),
-        dtypesIfCUDA=floating_types_and(torch.float16),
-        supports_out=False,
-        sample_inputs_func=sample_inputs_grid_sample,
-        supports_gradgrad=False,
-        gradcheck_nondet_tol=1e-15),
-    OpInfo(
-        "argwhere",
-        ref=np.argwhere,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        'jiterator_4inputs_with_extra_args',
+        op=torch.cuda.jiterator._create_jit_fn(
+            "template <typename T> T binary(T i0, T i1, T i2, T i3, T alpha, T beta) { return alpha * i0 + beta * i1 + i2 + i3; }",
+            alpha=1, beta=1),
+        ref=lambda i0, i1, i2, i3, *, alpha=1, beta=1: alpha * i0 + beta * i1 + i2 + i3,
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=4, alpha=3.14, beta=-4.20),
         supports_out=False,
-        supports_autograd=False,
-        sample_inputs_func=sample_inputs_argwhere,
-    ),
-    ReductionOpInfo(
-        'all',
-        identity=True,
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        result_dtype=torch.bool,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.all),
-        skips=(
-            # FIXME: does not support passing keepdim without dim
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: does not support dim=None
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
-            # FIXME: uint8 input returns uint8 instead of bool
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
-        ),
-    ),
-    ReductionOpInfo(
-        'any',
-        identity=False,
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        result_dtype=torch.bool,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.any),
-        skips=(
-            # FIXME: does not support passing keepdim without dim
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: does not support dim=None
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
-            # FIXME: uint8 input returns uint8 instead of bool
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
-        ),
-    ),
-    ReductionOpInfo(
-        'amax',
-        nan_policy='propagate',
-        supports_forward_ad=True,
-        check_batched_forward_grad=False,
-        supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        ref=reference_reduction_numpy(np.amax),
-        skips=(
-            # FIXME: reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-        ),
-        error_inputs_func=error_inputs_aminmax_amax_amin,
-    ),
-    ReductionOpInfo(
-        'amin',
-        nan_policy='propagate',
-        supports_forward_ad=True,
-        check_batched_forward_grad=False,
-        supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        ref=reference_reduction_numpy(np.amin),
-        skips=(
-            # FIXME: reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-        ),
-        error_inputs_func=error_inputs_aminmax_amax_amin,
-    ),
-    ReductionOpInfo(
-        'argmax',
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        assert_jit_shape_analysis=True,
-        result_dtype=torch.int64,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.argmax, supports_keepdims=False),
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        decorators=[onlyCUDA, skipCUDAIfRocm],
         skips=(
-            # FIXME: keepdim parameter is ignored when dim=None
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-        ),
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
-    ReductionOpInfo(
-        'argmin',
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        result_dtype=torch.int64,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.argmin, supports_keepdims=False),
+    BinaryUfuncInfo(
+        'jiterator_binary_return_by_ref',
+        op=torch.cuda.jiterator._create_multi_output_jit_fn(
+            """
+            template <typename T>
+            T binary_return_by_ref(T i0, T i1, T& out0) {
+                out0 = i0 + i1;
+            }
+            """,
+            num_outputs=1),
+        ref=lambda i0, i1: i0 + i1,
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-0.42),
+        supports_out=False,
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        supports_rhs_python_scalar=False,
+        decorators=[onlyCUDA, skipCUDAIfRocm],
         skips=(
-            # FIXME: keepdim parameter is ignored when dim=None
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-        ),
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
-    ReductionOpInfo(
-        'count_nonzero',
-        identity=0,
+    OpInfo(
+        'jiterator_2inputs_2outputs',
+        op=torch.cuda.jiterator._create_multi_output_jit_fn(
+            """
+            template <typename T>
+            T binary_2outputs(T i0, T i1, T& out0, T& out1) {
+                out0 = i0 + i1;
+                out1 = i0 - i1;
+            }
+            """,
+            num_outputs=2),
+        ref=lambda i0, i1, *, alpha=1: (i0 + i1, i0 - i1),
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2),
         supports_out=False,
-        supports_autograd=False,
-        result_dtype=torch.int64,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_reduction_count_nonzero,
-        ref=reference_reduction_numpy(np.count_nonzero),
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        decorators=[onlyCUDA, skipCUDAIfRocm],
         skips=(
-            # FIXME: count_nonzero does not accept keepdim kwarg
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_single_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_unsorted_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_offbounds_keepdim'),
-            # FIXME: dim=[] reduces all dimensions
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-        ),
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
-    ReductionOpInfo(
-        'mean',
-        nan_policy='propagate',
+    # `torch.norm` has multiple code paths depending on the value of `p`.
+    # These paths have different dtype support. Also JIT supports,
+    # most variants but not all of them. So we split the OpInfo entries,
+    # for `norm` based on the code-paths and JIT support.
+    OpInfo(
+        "norm",
+        sample_inputs_func=sample_inputs_norm,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        # FIXME: mean needs 'dim' parameter when using the 'out' overload.
-        # Adding it with 'generate_args_kwargs' does not work, since these also get passed
-        # onto the reference implementations.
-        supports_out=False,
-        assert_autodiffed=True,
-        assert_jit_shape_analysis=True,
-        promotes_int_to_float=True,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.mean),
-        error_inputs_func=error_inputs_mean,
         skips=(
-            # FIXME: mean does not support passing keepdim without passing dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: mean reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
-                         device_type='cuda', dtypes=[torch.complex64]),
+            # AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast from a result
+            # of dtype torch.float32 into an out= with dtype torch.long
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestCommon",
+                "test_out",
+                device_type="meta",
+            ),
         ),
     ),
-    ReductionOpInfo(
-        'nanmean',
-        nan_policy='omit',
-        assert_autodiffed=True,
-        promotes_int_to_float=True,
+    OpInfo('norm',
+           variant_test_name='nuc',
+           aten_name='nuclear_norm',
+           sample_inputs_func=sample_inputs_norm_nuc,
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           check_batched_gradgrad=False,
+           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
+           # got: Could not allocate memory to change Tensor SizesAndStrides!
+           check_batched_forward_grad=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           dtypes=floating_and_complex_types(),
+           dtypesIfCUDA=floating_and_complex_types(),
+           skips=(
+               # RuntimeError not raised :
+               # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # RuntimeError:
+               # Arguments for call are not valid.
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.complex64, torch.float32,)),  # noqa: B950
+           )
+           ),
+    OpInfo('norm',
+           variant_test_name='fro',
+           aten_name='frobenius_norm',
+           sample_inputs_func=sample_inputs_norm_fro,
+           dtypes=floating_and_complex_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+           supports_forward_ad=True,
+           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
+           # got: Could not allocate memory to change Tensor SizesAndStrides!
+           check_batched_forward_grad=False,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+               # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # Arguments for call are not valid.
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.complex64, torch.float32,)),  # noqa: B950
+           )),
+    OpInfo(
+        "norm",
+        variant_test_name="inf",
+        sample_inputs_func=sample_inputs_norm_inf,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
         supports_forward_ad=True,
-        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        dtypes=floating_types_and(torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
-        ref=reference_reduction_numpy(np.nanmean),
+        # fast gradcheck produces NaNs
+        gradcheck_fast_mode=False,
         skips=(
-            # AssertionError: False is not true :
-            # Failure in testing nodes' autodifferentiation.
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: prod reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
-                         device_type='cuda', dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
-                         device_type='cuda', dtypes=[torch.complex64]),
+            # AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast from a result
+            # of dtype torch.float32 into an out= with dtype torch.long
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestCommon",
+                "test_out",
+                device_type="meta",
+            ),
         ),
     ),
-    ReductionOpInfo(
-        'std',
-        nan_policy='propagate',
-        supports_out=False,
-        complex_to_real=True,
+    OpInfo('t',
+           sample_inputs_func=sample_inputs_t,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           # vmap does not support inplace views
+           check_inplace_batched_forward_grad=False,
+           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
+           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           assert_autodiffed=True,
+           error_inputs_func=error_inputs_t),
+    OpInfo(
+        "nn.functional.dropout",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        skips=(
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # Probably because we have used lambda for the op here
+            # AssertionError: JIT Test does not execute any logic
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # inplace variant dispatches to dropout kernel, while on CUDA
+            # the op dispatches to _fused_dropout (with a few more conditions)
+            # hence, different values and this skip here
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        assert_autodiffed=True,
-        promotes_int_to_float=True,
+        # https://github.com/pytorch/pytorch/issues/66357
         check_batched_forward_grad=False,
-        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_std_var,
-        ref=reference_std_var(np.std),
-        generate_args_kwargs=generate_std_var_kwargs,
-        skips=(
-            # FIXME: cannot specify keepdim without dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: dim=None not supported
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-            # FIXME: dim=[] reduces all dimensions
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
-            # NumPy is giving NaN for this
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
-        ),
-    ),
-    ReductionOpInfo(
-        'var',
-        nan_policy='propagate',
         supports_out=False,
-        assert_autodiffed=True,
-        promotes_int_to_float=True,
-        complex_to_real=True,
+        sample_inputs_func=sample_inputs_dropout,
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.dropout2d",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        skips=(
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        supports_out=False,
         check_batched_forward_grad=False,
-        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_std_var,
-        ref=reference_std_var(np.var),
-        generate_args_kwargs=generate_std_var_kwargs,
+        # As per the docs, valid input dims are (3, 4)
+        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(3, 4)),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.dropout3d",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         skips=(
-            # FIXME: cannot specify keepdim without dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: dim=None not supported
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-            # FIXME: dim=[] reduces all dimensions
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
-            # NumPy is giving NaN for this
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
-        ),
-    ),
-    ReductionOpInfo(
-        'prod',
-        identity=1,
-        nan_policy='propagate',
-        supports_multiple_dims=False,
-        # https://github.com/pytorch/pytorch/issues/80411
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+        check_batched_forward_grad=False,
+        # As per the docs, valid input dims are (4, 5)
+        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(4, 5)),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs, inplace=True)),
+    # In training mode, feature_alpha_dropout currently doesn't support inputs of complex dtype
+    # unlike when `train=False`, it supports complex inputs, hence 2 OpInfos to cover all cases
+    OpInfo(
+        "nn.functional.feature_alpha_dropout",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
+        variant_test_name="with_train",
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        skips=(
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
+            # vmap: We do not yet support calling random operations inside of vmap.
+            # Please perform random operations outside of vmap as a workaround
+            DecorateInfo(unittest.expectedFailure, 'TestGradients', "test_forward_mode_AD"),
+            DecorateInfo(unittest.expectedFailure, 'TestGradients', "test_inplace_forward_mode_AD"),),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
         gradcheck_fast_mode=True,
-        supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        promotes_int_to_int64=True,
-        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-        dtypes=all_types_and_complex_and(torch.bool),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-        sample_inputs_func=sample_inputs_prod,
-        ref=reference_reduction_numpy(np.prod),
-        skips=(
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
-            # FIXME: prod does not support passing keepdim without passing dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: prod reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: prod does not support passing None to dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16, torch.complex64]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
-                         dtypes=[torch.uint8, torch.float16, torch.complex64]),
-        ),
-    ),
-    ReductionOpInfo(
-        'sum',
-        identity=0,
-        nan_policy='propagate',
         supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        promotes_int_to_int64=True,
+        # As per the docs, valid input dims are (4, 5)
+        sample_inputs_func=partial(sample_inputs_dropout, train=True, valid_input_dim=(4, 5)),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.feature_alpha_dropout",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
+        variant_test_name="without_train",
         dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-        ref=reference_reduction_numpy(np.sum),
         skips=(
-            # FIXME: sum does not support passing keepdim without passing dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
-                         dtypes=[torch.float16]),
-        ),
-    ),
-    ReductionOpInfo(
-        'nansum',
-        identity=0,
-        nan_policy='omit',
-        supports_out=True,
-        promotes_int_to_int64=True,
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
+        gradcheck_wrapper=wrapper_set_seed,
         supports_forward_ad=True,
-        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
-        ref=reference_reduction_numpy(np.nansum),
-        skips=(
-            # please report a bug to PyTorch.
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: nansum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: flaky test so skipped instead of xfailed
-            # possibly bad low precision reference in numpy
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-        ),
-    ),
-    ReductionOpInfo(
-        '_masked.sum',
-        ref=reference_reduction_numpy(np.sum),
-        method_variant=None,
-        identity=0,
-        nan_policy='propagate',
         supports_out=False,
+        sample_inputs_func=partial(sample_inputs_dropout, train=False),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.one_hot",
+        ref=reference_one_hot,
+        supports_out=False,
+        dtypes=_dispatch_dtypes((torch.int64,)),
+        sample_inputs_func=sample_inputs_one_hot,
+    ),
+    OpInfo(
+        "nn.functional.embedding",
+        aten_backward_name="embedding_dense_backward",
+        # We use lambda to reshuffle the positional arguments.
+        # This is because currently only the `input` field of SampleInput
+        # is tested in gradient tests.
+        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding(idx, weight, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16, torch.float16),
+        sample_inputs_func=sample_inputs_embedding,
+        error_inputs_func=error_inputs_embedding,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        supports_sparse=True,
-        supports_sparse_csr=True,
-        promotes_int_to_int64=True,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
         skips=(
-            DecorateInfo(unittest.skip("Failing on some jobs"), 'TestReductions', 'test_reference_masked',
-                         dtypes=(torch.bool, torch.int8, torch.int16, torch.int32)),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
+            # lambda impl
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # Reference: https://github.com/pytorch/pytorch/issues/67084
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),
+            # Not a problem: embedding does weird stuff to its input (it renormalizes)
+            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.bfloat16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-03)}),
-                         'TestReductions', 'test_ref_small_input'),
-        ],
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction
+        supports_expanded_weight=True,
+        supports_out=False,
     ),
-    ReductionOpInfo(
-        '_masked.prod',
-        ref=reference_reduction_numpy(np.prod),
-        method_variant=None,
-        identity=1,
-        nan_policy='propagate',
-        # https://github.com/pytorch/pytorch/issues/80411
-        gradcheck_fast_mode=True,
+    OpInfo(
+        "nn.functional.embedding_bag",
+        # We use lambda to reshuffle the positional arguments.
+        # This is because currently only the `input` field of SampleInput
+        # is tested in gradient tests.
+        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding_bag(idx, weight, **kwargs),
+        dtypes=floating_types_and(torch.float16),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        # backward is not supported for mode `max` and dtype `bfloat16`
+        backward_dtypesIfCUDA=floating_types_and(torch.float16),
+        sample_inputs_func=sample_inputs_embedding_bag,
+        skips=(
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # Not a problem: embedding_bag does weird stuff to its input (it renormalizes)
+            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
+        ),
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
         supports_out=False,
+        supports_gradgrad=False,
+    ),
+    UnaryUfuncInfo(
+        "nn.functional.softplus",
+        aten_backward_name='softplus_backward',
+        ref=reference_softplus,
+        sample_kwargs=lambda device, dtype, input: ({'beta': 3, 'threshold': .2}, {'beta': 3, 'threshold': .2}),
+        sample_inputs_func=partial(sample_inputs_elementwise_unary, op_kwargs={'beta': 3, 'threshold': .2}),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        supports_sparse=True,
-        supports_sparse_csr=True,
-        promotes_int_to_int64=True,
-        # FIXME: "prod_cpu" not implemented for 'BFloat16'
-        # FIXME: "prod_cpu" not implemented for 'Half'
-        dtypes=all_types_and_complex_and(torch.bool),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        skips=(
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_forward_ad'),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.skip("Failing on some jobs"), 'TestReductions', 'test_reference_masked',
-                         dtypes=(torch.bool, torch.int8, torch.int16, torch.int32),),
-            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout', device_type='cuda',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        decorators=(
+            DecorateInfo(
+                toleranceOverride
+                ({
+                    torch.half: tol(atol=1e-2, rtol=1e-2),
+                    torch.bfloat16: tol(atol=1e-2, rtol=1e-2),
+                }),
+                'TestUnaryUfuncs'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-02)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_ref_duplicate_values'),
-        ],
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
     ),
     OpInfo(
-        '_masked.cumsum',
-        dtypes=all_types_and_complex_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
-        method_variant=None,
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
+        "nn.functional.mse_loss",
+        aten_backward_name='mse_loss_backward',
+        ref=loss_reference_reduction_wrapper(lambda input, target: (input - target) ** 2),
+        sample_inputs_func=sample_inputs_loss,
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        dtypes=floating_types_and(torch.float16),
+        backward_dtypes=floating_types(),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        backward_dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # RuntimeError: input->type()->kind() == TypeKind::OptionalType
+            # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
+            # please report a bug to PyTorch.
+            DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit", dtypes=(torch.float32,),),
         ),
-        # Can reuse the same inputs; dim is required in both
-        sample_inputs_func=sample_inputs_masked_cumops,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
     ),
     OpInfo(
-        '_masked.cumprod',
-        dtypes=all_types_and_complex_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
-        method_variant=None,
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
+        "nn.functional.grid_sample",
+        dtypes=floating_types(),
+        dtypesIfCUDA=floating_types_and(torch.float16),
         supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_grid_sample,
+        supports_gradgrad=False,
+        gradcheck_nondet_tol=1e-15),
+    OpInfo(
+        "argwhere",
+        ref=np.argwhere,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        supports_out=False,
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_argwhere,
+    ),
+    ReductionOpInfo(
+        'all',
+        identity=True,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        result_dtype=torch.bool,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.all),
+        skips=(
+            # FIXME: does not support passing keepdim without dim
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: does not support dim=None
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
+            # FIXME: uint8 input returns uint8 instead of bool
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
+        ),
+    ),
+    ReductionOpInfo(
+        'any',
+        identity=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        result_dtype=torch.bool,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.any),
         skips=(
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: does not support passing keepdim without dim
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: does not support dim=None
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
+            # FIXME: uint8 input returns uint8 instead of bool
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
         ),
-        # Can reuse the same inputs; dim is required in both
-        sample_inputs_func=sample_inputs_masked_cumops,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
     ),
     ReductionOpInfo(
-        '_masked.amax',
+        'amax',
         nan_policy='propagate',
-        supports_out=False,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        supports_sparse=True,
         supports_forward_ad=True,
+        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        supports_sparse_csr=True,
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         ref=reference_reduction_numpy(np.amax),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: amax reduces all dimensions when dim=[]
+            # FIXME: reduces all dimensions when dim=[]
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: Unknown builtin op: aten::iinfo
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
-            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
+        error_inputs_func=error_inputs_aminmax_amax_amin,
     ),
     ReductionOpInfo(
-        '_masked.amin',
+        'amin',
         nan_policy='propagate',
-        supports_out=False,
         supports_forward_ad=True,
+        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        supports_sparse=True,
-        supports_sparse_csr=True,
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         ref=reference_reduction_numpy(np.amin),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: amax reduces all dimensions when dim=[]
+            # FIXME: reduces all dimensions when dim=[]
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: Unknown builtin op: aten::iinfo
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
-            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
+        error_inputs_func=error_inputs_aminmax_amax_amin,
     ),
     ReductionOpInfo(
-        '_masked.argmax',
-        supports_out=False,
+        'argmax',
         supports_multiple_dims=False,
         supports_autograd=False,
+        assert_jit_shape_analysis=True,
+        result_dtype=torch.int64,
         dtypes=all_types_and(torch.float16, torch.bfloat16),
         ref=reference_reduction_numpy(np.argmax, supports_keepdims=False),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # initial is not a keyword for argmax
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_reference_masked'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.bfloat16,)),
+            # FIXME: keepdim parameter is ignored when dim=None
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.argmin',
-        supports_out=False,
+        'argmin',
         supports_multiple_dims=False,
         supports_autograd=False,
+        result_dtype=torch.int64,
         dtypes=all_types_and(torch.float16, torch.bfloat16),
         ref=reference_reduction_numpy(np.argmin, supports_keepdims=False),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # initial is not a keyword for argmin
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_reference_masked'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.bfloat16,)),
+            # FIXME: keepdim parameter is ignored when dim=None
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.mean',
-        ref=reference_reduction_numpy(np.mean) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
-        method_variant=None,
-        nan_policy='propagate',
+        'count_nonzero',
+        identity=0,
         supports_out=False,
-        supports_sparse_csr=True,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        promotes_int_to_float=True,
-        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16, torch.bool),
+        supports_autograd=False,
+        result_dtype=torch.int64,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_reduction_count_nonzero,
+        ref=reference_reduction_numpy(np.count_nonzero),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_ref_duplicate_values',
-                         dtypes=(torch.bool,)),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_reference_masked',
-                         dtypes=(torch.bool,)),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_ref_small_input',
-                         dtypes=(torch.bool,)),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
+            # FIXME: count_nonzero does not accept keepdim kwarg
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_single_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_unsorted_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_offbounds_keepdim'),
+            # FIXME: dim=[] reduces all dimensions
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-        ],
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
-    OpInfo(
-        '_masked.median',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16),
-        method_variant=None,
-        supports_out=False,
+    ReductionOpInfo(
+        'mean',
+        nan_policy='propagate',
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        # FIXME: mean needs 'dim' parameter when using the 'out' overload.
+        # Adding it with 'generate_args_kwargs' does not work, since these also get passed
+        # onto the reference implementations.
+        supports_out=False,
+        assert_autodiffed=True,
+        assert_jit_shape_analysis=True,
+        promotes_int_to_float=True,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.mean),
+        error_inputs_func=error_inputs_mean,
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: mean does not support passing keepdim without passing dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: mean reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
+                         device_type='cuda', dtypes=[torch.complex64]),
         ),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.norm',
-        identity=0,
-        method_variant=None,
-        nan_policy='propagate',
-        supports_out=False,
+        'nanmean',
+        nan_policy='omit',
+        assert_autodiffed=True,
         promotes_int_to_float=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
         dtypes=floating_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
+        ref=reference_reduction_numpy(np.nanmean),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # torch.jit.frontend.NotSupportedError: Compiled functions
-            # can't take variable number of arguments or use
-            # keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # AssertionError: False is not true :
+            # Failure in testing nodes' autodifferentiation.
+            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: prod reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
+                         device_type='cuda', dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
+                         device_type='cuda', dtypes=[torch.complex64]),
         ),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        sample_inputs_func=sample_inputs_masked_norm,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.var',
-        ref=reference_reduction_numpy(np.var) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
-        method_variant=None,
+        'std',
         nan_policy='propagate',
         supports_out=False,
+        complex_to_real=True,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        # See https://github.com/pytorch/pytorch/pull/78358
-        check_batched_forward_grad=False,
+        assert_autodiffed=True,
         promotes_int_to_float=True,
-        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        check_batched_forward_grad=False,
+        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_std_var,
+        ref=reference_std_var(np.std),
+        generate_args_kwargs=generate_std_var_kwargs,
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: cannot specify keepdim without dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: dim=[] reduces all dimensions
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
+            # NumPy is giving NaN for this
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02),
-                                            torch.bfloat16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestMasked', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
-        ],
-        sample_inputs_func=sample_inputs_masked_std_var,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        check_batched_grad=True,
     ),
     ReductionOpInfo(
-        '_masked.std',
-        ref=reference_reduction_numpy(np.std) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
-        method_variant=None,
+        'var',
         nan_policy='propagate',
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
         supports_out=False,
+        assert_autodiffed=True,
+        promotes_int_to_float=True,
+        complex_to_real=True,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        # See https://github.com/pytorch/pytorch/pull/78358
         check_batched_forward_grad=False,
-        promotes_int_to_float=True,
-        dtypes=all_types_and_complex_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness',
-                         dtypes=(torch.float16,)),
-        ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestMasked', 'test_reference_masked'),
-        ],
-        sample_inputs_func=sample_inputs_masked_std_var,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        check_batched_grad=True,
-    ),
-    OpInfo(
-        '_masked.softmax',
-        method_variant=None,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-        ),
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        supports_forward_ad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.log_softmax',
-        method_variant=None,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-        ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.bfloat16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestMasked', 'test_reference_masked'),
-        ],
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        supports_forward_ad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.softmin',
-        method_variant=None,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-        ),
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        supports_forward_ad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.normalize',
-        method_variant=None,
         dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_normalize,
+        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_std_var,
+        ref=reference_std_var(np.var),
+        generate_args_kwargs=generate_std_var_kwargs,
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # RuntimeError: "clamp_min_cpu" not implemented for 'Half'
-            DecorateInfo(unittest.expectedFailure, 'TestMasked', 'test_reference_masked',
-                         device_type='cpu', dtypes=[torch.half]),
+            # FIXME: cannot specify keepdim without dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: dim=[] reduces all dimensions
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
+            # NumPy is giving NaN for this
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
         ),
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.logaddexp',
-        dtypes=floating_types_and(torch.bfloat16),
+    ),
+    ReductionOpInfo(
+        'prod',
+        identity=1,
+        nan_policy='propagate',
+        supports_multiple_dims=False,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        check_batched_forward_grad=False,
+        promotes_int_to_int64=True,
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+        dtypes=all_types_and_complex_and(torch.bool),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+        sample_inputs_func=sample_inputs_prod,
+        ref=reference_reduction_numpy(np.prod),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
+            # Pre-existing condition (calls .item); needs to be fixed
+            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+            # Pre-existing condition (calls .item); needs to be fixed
+            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
+            # FIXME: prod does not support passing keepdim without passing dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: prod reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: prod does not support passing None to dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16, torch.complex64]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
+                         dtypes=[torch.uint8, torch.float16, torch.complex64]),
         ),
-        sample_inputs_func=sample_inputs_masked_logaddexp,
-        gradcheck_wrapper=gradcheck_wrapper_masked_pointwise_operation
     ),
     ReductionOpInfo(
-        '_masked.logsumexp',
-        dtypes=all_types_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-        method_variant=None,
+        'sum',
+        identity=0,
         nan_policy='propagate',
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        promotes_int_to_int64=True,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+        ref=reference_reduction_numpy(np.sum),
         skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: reduces all dimensions when dim=[]
+            # FIXME: sum does not support passing keepdim without passing dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: sum reduces all dimensions when dim=[]
             DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # Identity can't be -torch.inf without overflow
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_empty_tensor_empty_slice'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
+                         dtypes=[torch.float16]),
+        ),
+    ),
+    ReductionOpInfo(
+        'nansum',
+        identity=0,
+        nan_policy='omit',
+        supports_out=True,
+        promotes_int_to_int64=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
+        ref=reference_reduction_numpy(np.nansum),
+        skips=(
+            # please report a bug to PyTorch.
             DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # all the values are the same except for -inf vs nan
-            DecorateInfo(unittest.skip("Skipped!"), 'TestDecomp', 'test_comprehensive'),
+            # FIXME: nansum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: flaky test so skipped instead of xfailed
+            # possibly bad low precision reference in numpy
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     OpInfo(
         "nn.functional.ctc_loss",
-        ref=_NOTHING,
         dtypes=floating_types(),
         supports_out=False,
         sample_inputs_func=sample_inputs_ctc_loss,
@@ -19842,7 +15418,6 @@ def error_inputs_mean(op_info, device, **kwargs):
     ),
     OpInfo(
         "nn.functional.cosine_embedding_loss",
-        ref=_NOTHING,
         dtypes=all_types_and(torch.bfloat16, torch.bool),
         dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         supports_out=False,
@@ -19852,7 +15427,6 @@ def error_inputs_mean(op_info, device, **kwargs):
     ),
     OpInfo(
         "nn.functional.nll_loss",
-        ref=_NOTHING,
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_out=False,
@@ -19871,7 +15445,6 @@ def error_inputs_mean(op_info, device, **kwargs):
     ),
     OpInfo(
         "nn.functional.gaussian_nll_loss",
-        ref=_NOTHING,
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         # Runs very slowly on slow gradcheck - alternatively reduce input sizes
@@ -19900,7 +15473,6 @@ def error_inputs_mean(op_info, device, **kwargs):
     ),
     OpInfo(
         "nn.functional.hinge_embedding_loss",
-        ref=_NOTHING,
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_out=False,
@@ -19913,7 +15485,6 @@ def error_inputs_mean(op_info, device, **kwargs):
     OpInfo(
         "nn.functional.huber_loss",
         aten_backward_name='huber_loss_backward',
-        ref=_NOTHING,
         dtypes=floating_types_and(torch.float16, torch.bfloat16),
         supports_out=False,
         supports_forward_ad=True,
@@ -19935,7 +15506,6 @@ def error_inputs_mean(op_info, device, **kwargs):
         supports_gradgrad=False),
     OpInfo(
         "nn.functional.poisson_nll_loss",
-        ref=_NOTHING,
         dtypes=all_types_and(torch.bfloat16),
         dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
         supports_out=False,
@@ -20005,6 +15575,13 @@ def error_inputs_mean(op_info, device, **kwargs):
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         skips=(
+            # https://github.com/pytorch/pytorch/issues/82235
+            DecorateInfo(
+                unittest.expectedFailure,
+                'TestSchemaCheckModeOpInfo',
+                'test_schema_correctness',
+                device_type='cuda',
+            ),
             DecorateInfo(
                 unittest.skip("Skipped!"),
                 "TestJit",
@@ -20021,6 +15598,13 @@ def error_inputs_mean(op_info, device, **kwargs):
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         skips=(
+            # https://github.com/pytorch/pytorch/issues/82235
+            DecorateInfo(
+                unittest.expectedFailure,
+                'TestSchemaCheckModeOpInfo',
+                'test_schema_correctness',
+                device_type='cuda',
+            ),
             DecorateInfo(
                 unittest.skip("Skipped!"),
                 "TestJit",
@@ -20069,586 +15653,73 @@ def error_inputs_mean(op_info, device, **kwargs):
         dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
         sample_inputs_func=sample_inputs_scatter_reduce,
         skips=(
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-        ),
-    ),
-    OpInfo(
-        'scatter_reduce',
-        variant_test_name='mean',
-        # complex not added to dtypes as complex gradients are not properly handled
-        # and scatter_reduce hasn't been added to the whitelist in gen_variable_type yet
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_scatter_reduce,
-    ),
-    OpInfo(
-        'scatter_reduce',
-        variant_test_name='amin',
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_scatter_reduce,
-    ),
-    OpInfo(
-        'scatter_reduce',
-        variant_test_name='amax',
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_scatter_reduce,
-    ),
-    OpInfo(
-        'segment_reduce',
-        variant_test_name='lengths',
-        dtypes=floating_types_and(torch.float16, torch.bfloat16),
-        supports_out=False,
-        # RuntimeError: derivative for aten::_segment_reduce_backward is not implemented
-        supports_gradgrad=False,
-        sample_inputs_func=sample_inputs_segment_reduce,
-        skips=(
-            # FIXME: CUDA driver API confirmed a leak in
-            # __main__.TestJitCUDA.test_variant_consistency_jit_segment_reduce_cuda_float32
-            DecorateInfo(
-                unittest.skip("Skipped!"),
-                "TestJit",
-                "test_variant_consistency_jit",
-                device_type="cuda",
-            ),
-        ),
-    ),
-    OpInfo(
-        'segment_reduce',
-        variant_test_name='offsets',
-        dtypes=floating_types_and(torch.float16, torch.bfloat16),
-        supports_out=False,
-        # RuntimeError: derivative for aten::_segment_reduce_backward is not implemented
-        supports_gradgrad=False,
-        sample_inputs_func=partial(sample_inputs_segment_reduce, mode='offsets'),
-        skips=(
-            # FIXME: CUDA driver API confirmed a leak in
-            # __main__.TestJitCUDA.test_variant_consistency_jit_segment_reduce_cuda_float32
-            DecorateInfo(
-                unittest.skip("Skipped!"),
-                "TestJit",
-                "test_variant_consistency_jit",
-                device_type="cuda",
-            ),
-        ),
-    ),
-    UnaryUfuncInfo(
-        'special.airy_ai',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=lambda x: scipy.special.airy(x)[0] if TEST_SCIPY else _NOTHING,
-        skips=(
-            DecorateInfo(
-                unittest.skip("Skipped!"),
-                'TestUnaryUfuncs',
-                'test_reference_numerics_large',
-            ),
-        ),
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_j0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.j0 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_j1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.j1 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_y0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.y0 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_y1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.y1 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_t',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_u',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_v',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_w',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.hermite_polynomial_h',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.hermite_polynomial_he',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.laguerre_polynomial_l',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.legendre_polynomial_p',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_i0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.i0 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_i1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.i1 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_k0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k0 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_k1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k1 if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.scaled_modified_bessel_k0',
-        decorators=(
-            toleranceOverride(
-                {
-                    torch.float32: tol(atol=1e-03, rtol=1e-03),
-                    torch.float64: tol(atol=1e-05, rtol=1e-03),
-                }
-            ),
+            # Pre-existing condition (calls .item); needs to be fixed
+            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
         ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k0e if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
     ),
-    UnaryUfuncInfo(
-        'special.scaled_modified_bessel_k1',
-        decorators=(
-            toleranceOverride(
-                {
-                    torch.float32: tol(atol=1e-03, rtol=1e-03),
-                    torch.float64: tol(atol=1e-05, rtol=1e-03),
-                }
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k1e if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
+    OpInfo(
+        'scatter_reduce',
+        variant_test_name='mean',
+        # complex not added to dtypes as complex gradients are not properly handled
+        # and scatter_reduce hasn't been added to the whitelist in gen_variable_type yet
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_scatter_reduce,
     ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_t',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
+    OpInfo(
+        'scatter_reduce',
+        variant_test_name='amin',
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_scatter_reduce,
     ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_u',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
+    OpInfo(
+        'scatter_reduce',
+        variant_test_name='amax',
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_scatter_reduce,
     ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_v',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
+    OpInfo(
+        'segment_reduce',
+        variant_test_name='lengths',
+        dtypes=floating_types_and(torch.float16, torch.bfloat16),
+        supports_out=False,
+        # RuntimeError: derivative for aten::_segment_reduce_backward is not implemented
+        supports_gradgrad=False,
+        sample_inputs_func=sample_inputs_segment_reduce,
         skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
+            # FIXME: CUDA driver API confirmed a leak in
+            # __main__.TestJitCUDA.test_variant_consistency_jit_segment_reduce_cuda_float32
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="cuda",
+            ),
         ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
     ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_w',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
+    OpInfo(
+        'segment_reduce',
+        variant_test_name='offsets',
+        dtypes=floating_types_and(torch.float16, torch.bfloat16),
+        supports_out=False,
+        # RuntimeError: derivative for aten::_segment_reduce_backward is not implemented
+        supports_gradgrad=False,
+        sample_inputs_func=partial(sample_inputs_segment_reduce, mode='offsets'),
         skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.spherical_bessel_j0',
-        decorators=(
-            toleranceOverride(
-                {
-                    torch.float32: tol(atol=1e-03, rtol=1e-03),
-                    torch.float64: tol(atol=1e-05, rtol=1e-03),
-                }
+            # FIXME: CUDA driver API confirmed a leak in
+            # __main__.TestJitCUDA.test_variant_consistency_jit_segment_reduce_cuda_float32
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="cuda",
             ),
         ),
-        dtypes=all_types_and(torch.bool),
-        ref=lambda x: scipy.special.spherical_jn(0, x) if TEST_SCIPY else _NOTHING,
-        supports_autograd=False,
     ),
 ]
-
-# NOTE [Python References]
-# Python References emulate existing PyTorch operations, but can ultimately
-#   be expressed in terms of "primitive" operations from torch._prims.
-#
-# These references are experimental.
-# See https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-0/577
-#   for additional context.
-#
-# Python Reference OpInfos should be added to the python_ref_db list below.
-#   Tests can opt-into running on these references by including
-#   that list in the Sequence they pass to the @ops decorator.
-#
-# When a Python Reference OpInfo is constructed a pointer to an
-#   existing OpInfo must be provided using the torch_opinfo_name kwarg.
-#   The existing OpInfo with that name and no variant will be found
-#   to inherit from.
-#
-# Instead of just inheriting the existing OpInfo's metadata, the
-#   Python Reference OpInfos inherit the existing OpInfo's
-#   construction arguments. These arguments can be overridden
-#   by adding kwargs to the constructor.
-
-def _find_referenced_opinfo(referenced_name, variant_name):
-    '''
-    Finds the OpInfo with the given name that has no variant name.
-    '''
-    for opinfo in op_db:
-        if opinfo.name == referenced_name and opinfo.variant_test_name == variant_name:
-            return opinfo
-
-def _inherit_constructor_args(name, op, inherited, overrides):
-    # inherits metadata
-    common_kwargs = {
-        'name': name,
-        'op': op,
-        'aliases': None,  # TODO add a check for alias coverage
-        'method_variant': None,
-        'inplace_variant': None,  # TODO: add a check for inplace coverage
-        'supports_scripting': False,
-    }
-
-    # Acquires inherited kwargs
-    kwargs = inherited.copy()
-
-    # Fixes metadata
-    if 'kwargs' in kwargs:
-        kwargs.update(kwargs['kwargs'])
-        del kwargs['kwargs']
-    if 'self' in kwargs:
-        del kwargs['self']
-    if '__class__' in kwargs:
-        del kwargs['__class__']
-    if 'skips' in kwargs:
-        del kwargs['skips']
-    if 'decorators' in kwargs:
-        del kwargs['decorators']
-
-    # Overrides metadata
-    kwargs.update(common_kwargs)
-    kwargs.update(overrides)
-
-    # At the moment no prims support autograd, so we must not run autograd
-    # tests e.g. when testing dtype support.  Once we start writing autograd
-    # formulas for prims this can be removed.
-    kwargs['supports_autograd'] = False
-    kwargs['supports_gradgrad'] = False
-    kwargs['supports_fwgrad_bwgrad'] = False
-    kwargs['supports_inplace_autograd'] = False
-    kwargs['supports_forward_ad'] = False
-
-    return kwargs
-
-class PythonRefInfo(OpInfo):
-    '''
-    An OpInfo for a Python reference of an OpInfo base class operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            validate_view_consistency=True,
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.validate_view_consistency = validate_view_consistency
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, OpInfo)
-
-        inherited = self.torch_opinfo._original_opinfo_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-        super(PythonRefInfo, self).__init__(**ukwargs)
-
-class ReductionPythonRefInfo(ReductionOpInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise unary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, ReductionOpInfo)
-
-        inherited = self.torch_opinfo._original_reduction_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        # See https://github.com/pytorch/pytorch/issues/77216
-        self.validate_view_consistency = False
-
-        super().__init__(**ukwargs)
-
-class ElementwiseUnaryPythonRefInfo(UnaryUfuncInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise unary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, UnaryUfuncInfo)
-
-        inherited = self.torch_opinfo._original_unary_ufunc_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        super(ElementwiseUnaryPythonRefInfo, self).__init__(**ukwargs)
-
-class ElementwiseBinaryPythonRefInfo(BinaryUfuncInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise binary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, BinaryUfuncInfo)
-
-        inherited = self.torch_opinfo._original_binary_ufunc_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        super(ElementwiseBinaryPythonRefInfo, self).__init__(**ukwargs)
-
-class SpectralFuncPythonRefInfo(SpectralFuncInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise unary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant='',
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, SpectralFuncInfo)
-
-        inherited = self.torch_opinfo._original_spectral_func_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        super().__init__(**ukwargs)
+op_db += opinfo.definitions.op_db
 
 
 # Separate registry for experimental Python Reference OpInfos.
@@ -20762,6 +15833,60 @@ def __init__(
         validate_view_consistency=False,
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.meshgrid",
+        torch_opinfo_name="meshgrid",
+        torch_opinfo_variant_name="variadic_tensors",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.triu",
+        torch_opinfo_name="triu",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.tril",
+        torch_opinfo_name="tril",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.triu_indices",
+        torch_opinfo_name="triu_indices",
+        supports_nvfuser=False,
+        # the implementation uses torch.stack that violates view consistency
+        validate_view_consistency=False,
+        skips=(
+            # skip these tests since we have non tensor input
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
+        )),
+    PythonRefInfo(
+        "_refs.tril_indices",
+        torch_opinfo_name="tril_indices",
+        supports_nvfuser=False,
+        # the implementation uses torch.stack that violates view consistency
+        validate_view_consistency=False,
+        skips=(
+            # skip these tests since we have non tensor input
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
+        )),
+    PythonRefInfo(
+        "_refs.meshgrid",
+        torch_opinfo_name="meshgrid",
+        torch_opinfo_variant_name="list_of_tensors",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.movedim",
+        aliases=('moveaxis',),
+        torch_opinfo_name="movedim",
+        supports_nvfuser=False,
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.atan",
         torch_opinfo_name="atan",
@@ -20948,7 +16073,6 @@ def __init__(
     ElementwiseUnaryPythonRefInfo(
         "_refs.sign",
         torch_opinfo_name="sign",
-        supports_nvfuser=False,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.signbit",
@@ -20994,21 +16118,6 @@ def __init__(
     #
     # Elementwise Unary Special OpInfos
     #
-    ElementwiseUnaryPythonRefInfo(
-        "_refs.special.i0e",
-        torch_opinfo_name="special.i0e",
-        supports_nvfuser=False,
-    ),
-    ElementwiseUnaryPythonRefInfo(
-        "_refs.special.i1",
-        torch_opinfo_name="special.i1",
-        supports_nvfuser=False,
-    ),
-    ElementwiseUnaryPythonRefInfo(
-        "_refs.special.i1e",
-        torch_opinfo_name="special.i1e",
-        supports_nvfuser=False,
-    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.special.logit",
         torch_opinfo_name="logit",
@@ -21078,6 +16187,26 @@ def __init__(
                          dtypes=(torch.float32,), device_type='cpu'),
         ),
     ),
+    PythonRefInfo(
+        "_refs.nn.functional.glu",
+        torch_opinfo_name="nn.functional.glu",
+        supports_nvfuser=False,
+        supports_out=True,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.pairwise_distance",
+        torch_opinfo_name="nn.functional.pairwise_distance",
+        supports_out=True,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.pdist",
+        torch_opinfo_name="nn.functional.pdist",
+        supports_out=True,
+        supports_nvfuser=False,
+        skips=(
+            # RunTimeError: no _refs support for torch.Tensor.index_select
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),
+        )),
     PythonRefInfo(
         "_refs.nn.functional.leaky_relu",
         torch_opinfo_name="nn.functional.leaky_relu",
@@ -21199,13 +16328,13 @@ def __init__(
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
             # Reference result was farther (0.7433461727239705) from the precise
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
         ),
     ),
@@ -21387,13 +16516,13 @@ def __init__(
             # than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type='cuda', active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type='cuda'
             ),
             # Reference result was farther (0.0) from the precise computation
             # than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type='cuda', active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type='cuda'
             ),
         )
     ),
@@ -21421,13 +16550,13 @@ def __init__(
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
             # Reference result was farther (inf) from the precise
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
         ),
     ),
@@ -21490,28 +16619,23 @@ def __init__(
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
             # Reference result was farther (0.7433461727239705) from the precise
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
         ),
     ),
     #
-    # Elementwise Binary Special OpInfos
-    #
-    ElementwiseBinaryPythonRefInfo(
-        "_refs.special.zeta",
-        torch_opinfo_name="special.zeta",
-        supports_one_python_scalar=True,
-        supports_nvfuser=False,
-    ),
-    #
     # Elementwise Ternary Reference OpInfos
     #
+    PythonRefInfo(
+        "_refs.addcdiv",
+        torch_opinfo_name="addcdiv",
+    ),
     ElementwiseBinaryPythonRefInfo(
         "_refs.clamp_min",
         torch_opinfo_name="clamp_min",
@@ -21630,6 +16754,16 @@ def __init__(
         torch_opinfo_name="dsplit",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.diagonal",
+        torch_opinfo_name="diagonal",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.diag_embed",
+        torch_opinfo_name="diag_embed",
+        supports_nvfuser=False,
+    ),
     PythonRefInfo(
         "_refs.dstack",
         torch_opinfo_name="dstack",
@@ -21728,9 +16862,8 @@ def __init__(
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_meta'),
             # RuntimeError: no _refs support for torch.Tensor.tolist
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),
-            # RuntimeError: .tolist() is not supported for tensor subclasses.
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_executor'),
-        )
+        ),
+        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.hsplit",
@@ -21771,6 +16904,16 @@ def __init__(
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
         ),
     ),
+    PythonRefInfo(
+        "_refs.unflatten",
+        torch_opinfo_name="unflatten",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.unbind",
+        torch_opinfo_name="unbind",
+        supports_nvfuser=False,
+    ),
     #
     # Reduction Reference OpInfos
     #
@@ -21859,47 +17002,6 @@ def __init__(
         validate_view_consistency=False,
     ),
     #
-    # torch.linalg
-    #
-    ReductionPythonRefInfo(
-        "_refs.linalg.vector_norm",
-        torch_opinfo_name="linalg.vector_norm",
-        supports_out=True,
-        supports_nvfuser=False,  # clone_default
-    ),
-    PythonRefInfo(
-        "_refs.linalg.matrix_norm",
-        torch_opinfo_name="linalg.matrix_norm",
-        supports_out=True,
-        # Uses svdvals which does not support nvfuser
-        supports_nvfuser=False,
-        # Uses vector_norm inside and vector_norm is affected by
-        # https://github.com/pytorch/pytorch/issues/77216
-        validate_view_consistency=False,
-    ),
-    PythonRefInfo(
-        "_refs.linalg.norm",
-        torch_opinfo_name="linalg.norm",
-        supports_out=True,
-        # Uses svdvals which does not support nvfuser
-        supports_nvfuser=False,
-        # Uses vector_norm inside and vector_norm is affected by
-        # https://github.com/pytorch/pytorch/issues/77216
-        validate_view_consistency=False,
-    ),
-    PythonRefInfo(
-        "_refs.linalg.svd",
-        torch_opinfo_name="linalg.svd",
-        supports_out=True,
-        supports_nvfuser=False,
-    ),
-    PythonRefInfo(
-        "_refs.linalg.svdvals",
-        torch_opinfo_name="linalg.svdvals",
-        supports_out=True,
-        supports_nvfuser=False,
-    ),
-    #
     # Tensor Creation Reference OpInfos
     #
     PythonRefInfo(
@@ -21957,8 +17059,19 @@ def __init__(
             DecorateInfo(unittest.skip("Expected: empty is not comparable"),
                          'TestMathBits',
                          'test_neg_view'),
-            # Fixme: should not compare results of empty_like
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_executor'),
+            # FIXME: should not compare results of empty_like
+            DecorateInfo(unittest.skip("Can't check result for empty_like"), 'TestCommon', 'test_python_ref_executor'),
+        ),
+    ),
+    PythonRefInfo(
+        "_refs.eye",
+        torch_opinfo_name="eye",
+        supports_nvfuser=False,
+        skips=(
+            # skip these tests since we have non tensor input
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view'),
         ),
     ),
     PythonRefInfo(
@@ -21987,8 +17100,32 @@ def __init__(
             DecorateInfo(unittest.skip("Expected: empty is not comparable"),
                          'TestMathBits',
                          'test_neg_view'),
-            # Fixme: should not compare results of empty_like
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_executor'),
+            # FIXME: should not compare results of empty_like
+            DecorateInfo(unittest.skip("Can't check result for new_empty"), 'TestCommon', 'test_python_ref_executor'),
+        ),
+    ),
+    PythonRefInfo(
+        "_refs.new_empty_strided",
+        torch_opinfo_name="new_empty_strided",
+        skips=(
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestCommon',
+                         'test_python_ref'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestCommon',
+                         'test_python_ref_torch_fallback'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestMathBits',
+                         'test_conj_view'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestMathBits',
+                         'test_neg_conj_view'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestMathBits',
+                         'test_neg_view'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestCommon',
+                         'test_python_ref_executor'),
         ),
     ),
     PythonRefInfo(
@@ -22028,110 +17165,8 @@ def __init__(
         torch_opinfo_name="allclose",
         supports_nvfuser=False,
     ),
-    #
-    # FFT OpInfos
-    #
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.fft",
-        torch_opinfo_name="fft.fft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ifft",
-        torch_opinfo_name="fft.ifft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.rfft",
-        torch_opinfo_name="fft.rfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.irfft",
-        torch_opinfo_name="fft.irfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.hfft",
-        torch_opinfo_name="fft.hfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ihfft",
-        torch_opinfo_name="fft.ihfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.fftn",
-        torch_opinfo_name="fft.fftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ifftn",
-        torch_opinfo_name="fft.ifftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.rfftn",
-        torch_opinfo_name="fft.rfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.irfftn",
-        torch_opinfo_name="fft.irfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.hfftn",
-        torch_opinfo_name="fft.hfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ihfftn",
-        torch_opinfo_name="fft.ihfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.fft2",
-        torch_opinfo_name="fft.fft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ifft2",
-        torch_opinfo_name="fft.ifft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.rfft2",
-        torch_opinfo_name="fft.rfft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.irfft2",
-        torch_opinfo_name="fft.irfft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.hfft2",
-        torch_opinfo_name="fft.hfft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ihfft2",
-        torch_opinfo_name="fft.ihfft2",
-        supports_nvfuser=False,
-    ),
-    PythonRefInfo(
-        "_refs.fft.fftshift",
-        torch_opinfo_name="fft.fftshift",
-        supports_nvfuser=False,
-    ),
-    PythonRefInfo(
-        "_refs.fft.ifftshift",
-        torch_opinfo_name="fft.ifftshift",
-        supports_nvfuser=False,
-    ),
 ]
+python_ref_db += opinfo.definitions.python_ref_db
 
 # Common operator groupings
 ops_and_refs = op_db + python_ref_db
@@ -22144,7 +17179,7 @@ def __init__(
 sparse_reduction_ops = [op for op in op_db if isinstance(op, ReductionOpInfo) and op.supports_sparse]
 shape_funcs = [op for op in op_db if isinstance(op, ShapeFuncInfo)]
 reduction_ops = [op for op in op_db if isinstance(op, ReductionOpInfo)]
-reference_filtered_ops = [op for op in reduction_ops if op.ref not in (_NOTHING, None)]
+reference_filtered_ops = [op for op in reduction_ops if op.ref is not None]
 reference_masked_ops = [op for op in reference_filtered_ops if op.name.startswith('_masked.')]
 sparse_masked_reduction_ops = [op for op in sparse_reduction_ops if op.name.startswith('_masked.')]
 
@@ -22194,16 +17229,14 @@ def _compare_trilu_indices(
             torch.triu_indices(row, col, offset, dtype=dtype, device=device))
 
     else:
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(
+        self.assertEqual(
             torch.ones(row, col, device='cpu')
-                 .tril(offset).nonzero().to(dtype).transpose(0, 1),
+                 .tril(offset).nonzero().to(dtype=dtype).transpose(0, 1),
             torch.tril_indices(row, col, offset, dtype=dtype, device=device))
 
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(
+        self.assertEqual(
             torch.ones(row, col, device='cpu')
-                 .triu(offset).nonzero().to(dtype).transpose(0, 1),
+                 .triu(offset).nonzero().to(dtype=dtype).transpose(0, 1),
             torch.triu_indices(row, col, offset, dtype=dtype, device=device))
 
 
@@ -22267,14 +17300,9 @@ def _compare_large_trilu_indices(
     (6, 3, -1),
     (6, 3, -3),
     (6, 3, -9),
-    (258, 253, 1, torch.float32),
-    (257, 258, 1, torch.float64),
-    (258, 258, 1, torch.short),
     (3, 513, 1, torch.long),
     (513, 3, 1, torch.int),
-    (513, 0, 1, torch.double),
     (1024, 1024),
-    (1024, 1024, 500, torch.float32),
     (1024, 1024, 1023),
     (1024, 1024, -500),
     (1023, 1025),
diff --git a/torch/testing/_internal/common_modules.py b/torch/testing/_internal/common_modules.py
index 7a0d60b802411..f214ffbb8b3d2 100644
--- a/torch/testing/_internal/common_modules.py
+++ b/torch/testing/_internal/common_modules.py
@@ -8,7 +8,7 @@
 import torch.nn.functional as F
 from torch.testing import make_tensor
 from torch.testing._internal.common_cuda import TEST_CUDNN
-from torch.testing._internal.common_dtype import floating_types
+from torch.testing._internal.common_dtype import floating_types, floating_and_complex_types_and
 from torch.testing._internal.common_device_type import (
     _TestParametrizer, _update_param_kwargs, toleranceOverride, tol,
     skipCUDAIfCudnnVersionLessThan, skipCUDAIfRocm, precisionOverride, skipMeta)
@@ -22,9 +22,10 @@
 # List of all namespaces containing modules to test.
 MODULE_NAMESPACES: List[ModuleType] = [
     torch.nn.modules,
-    torch.nn.qat.modules,
+    torch.ao.nn.qat.modules,
     torch.nn.quantizable.modules,
     torch.nn.quantized.modules,
+    torch.ao.nn.quantized.modules,
 ]
 
 # Modules that shouldn't be tested for one reason or another.
@@ -33,6 +34,7 @@
     torch.nn.Container,  # deprecated
     torch.nn.NLLLoss2d,  # deprecated
     torch.nn.quantized.MaxPool2d,  # aliases to nn.MaxPool2d
+    torch.ao.nn.quantized.MaxPool2d,  # aliases to nn.MaxPool2d
 }
 
 # List of all module classes to test.
@@ -1050,19 +1052,13 @@ def module_inputs_torch_nn_LSTM(module_info, device, dtype, requires_grad, train
                train_and_eval_differ=True,
                module_inputs_func=module_inputs_torch_nn_BatchNorm2d,
                skips=(
-                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),),
-               decorators=(
-                   # Failure on ROCM for BatchNorm2d float32 issue #70125
-                   DecorateInfo(skipCUDAIfRocm, 'TestModule', 'test_memory_format', dtypes=[torch.float32]),)
+                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),)
                ),
     ModuleInfo(torch.nn.BatchNorm3d,
                train_and_eval_differ=True,
                module_inputs_func=module_inputs_torch_nn_BatchNorm3d,
                skips=(
-                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),),
-               decorators=(
-                   # Failure on ROCM for BatchNorm3d float32 issue #70125
-                   DecorateInfo(skipCUDAIfRocm, 'TestModule', 'test_memory_format', dtypes=[torch.float32]),)
+                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),)
                ),
     ModuleInfo(torch.nn.Conv1d,
                module_inputs_func=partial(module_inputs_torch_nn_ConvNd, N=1, lazy=False),
@@ -1117,12 +1113,28 @@ def module_inputs_torch_nn_LSTM(module_info, device, dtype, requires_grad, train
                module_inputs_func=partial(module_inputs_torch_nn_ConvNd, N=1, lazy=False, transposed=True),
                gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
                module_memformat_affects_out=True,
+               dtypes=floating_and_complex_types_and(torch.chalf),
                skips=(
                    # channels_last support on cuda requires cudnn >= 7603
                    DecorateInfo(skipCUDAIfCudnnVersionLessThan(version=7603), 'TestModule', 'test_memory_format'),
                    # Failure on ROCM for float32 issue #70125
                    DecorateInfo(skipCUDAIfRocm, 'TestModule', 'test_memory_format', dtypes=[torch.float32]),
                    DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),
+                   # Not implmented for chalf on CPU
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_forward',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_memory_format',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule',
+                                'test_if_train_and_eval_modes_differ', dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_non_contiguous_tensors',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_cpu_gpu_parity',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_multiple_device_transfer',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   # Ref: https://github.com/pytorch/pytorch/issues/73502
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_pickle', dtypes=(torch.chalf,)),
                ),
                decorators=(
                    DecorateInfo(precisionOverride({torch.float32: 1e-04}), 'TestModule', 'test_memory_format'),
diff --git a/torch/testing/_internal/common_nn.py b/torch/testing/_internal/common_nn.py
index d68ef4387b095..bdf73319de1c3 100644
--- a/torch/testing/_internal/common_nn.py
+++ b/torch/testing/_internal/common_nn.py
@@ -6444,3 +6444,38 @@ def constructor_args(self):
     @property
     def extra_args(self):
         return self._get_arg('extra_args', False)
+
+
+def _test_bfloat16_ops(test_case, op, device, inp_dims=(), prec=1e-2, scale_factor=None):
+    # fp32 compute
+    input1 = torch.randn(inp_dims, dtype=torch.float32, device=device, requires_grad=True)
+    if scale_factor is not None:
+        input1 = (torch.rand(inp_dims, dtype=torch.bfloat16, device=device) * scale_factor).float().requires_grad_()
+    out1 = op(input1)
+    grad_input1 = torch.randn_like(out1, device=device)
+    out1.backward(grad_input1)
+
+    # bfloat16 compute
+    op_bfp16 = op.bfloat16()
+    input2 = input1.detach().bfloat16().requires_grad_()
+    grad_input2 = grad_input1.bfloat16()
+    out2 = op_bfp16(input2)
+    out2.backward(grad_input2)
+
+    test_case.assertEqual(out1, out2, atol=prec, rtol=prec, exact_dtype=False)
+    test_case.assertEqual(input1.grad.data, input2.grad.data, atol=prec, rtol=prec, exact_dtype=False)
+
+def _test_module_empty_input(test_case, module, inp, check_size=True, inference=False):
+    if not inference:
+        inp.requires_grad_(True)
+    out = module(inp)
+    if not inference:
+        gO = torch.rand_like(out)
+        out.backward(gO)
+    if check_size:
+        test_case.assertEqual(out.size(), inp.size())
+    if not inference:
+        for p in module.parameters():
+            if p.requires_grad:
+                test_case.assertEqual(p.grad, torch.zeros_like(p.grad))
+        test_case.assertEqual(inp.grad, torch.zeros_like(inp))
diff --git a/torch/testing/_internal/common_quantization.py b/torch/testing/_internal/common_quantization.py
index 6c677d9c689ab..bc9f0755ff64a 100644
--- a/torch/testing/_internal/common_quantization.py
+++ b/torch/testing/_internal/common_quantization.py
@@ -6,8 +6,8 @@
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.nn.intrinsic.quantized.dynamic as nniqd
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 from torch.nn.intrinsic import _FusedModule
 import torch.distributed as dist
 
@@ -588,7 +588,7 @@ def checkGraphModuleNodes(
             expected_node, expected_node_occurrence, expected_node_list:
                see docs for checkGraphModeFxOp
         """
-        nodes_in_graph = dict()
+        nodes_in_graph = {}
         node_list = []
         modules = dict(graph_module.named_modules(remove_duplicate=False))
         for node in graph_module.graph.nodes:
@@ -801,7 +801,7 @@ def checkGraphModeFxOp(
                 prepare_expected_node_occurrence=None,
                 prepare_expected_node_list=None,
                 prepare_custom_config=None,
-                backend_config_dict=None):
+                backend_config=None):
             """ Quantizes model with graph mode quantization on fx and check if the
                 quantized model contains the quantized_node
 
@@ -867,7 +867,7 @@ def checkGraphModeFxOp(
                 model, qconfig_dict,
                 example_inputs=inputs,
                 prepare_custom_config=prepare_custom_config,
-                backend_config_dict=backend_config_dict)
+                backend_config=backend_config)
             if not quant_type == QuantType.DYNAMIC:
                 prepared(*inputs)
 
diff --git a/torch/testing/_internal/common_utils.py b/torch/testing/_internal/common_utils.py
index b13d6678c5188..eefe42070c49d 100644
--- a/torch/testing/_internal/common_utils.py
+++ b/torch/testing/_internal/common_utils.py
@@ -5,74 +5,94 @@
 torch.testing._internal.common_cuda.py can freely initialize CUDA context when imported.
 """
 
-import sys
-import os
-import platform
-import re
+import argparse
+import contextlib
+import copy
+import ctypes
+import errno
+import functools
 import gc
-import types
-import math
-from functools import partial
 import inspect
 import io
-import copy
+import json
+import math
 import operator
-import argparse
-import unittest
-import warnings
+import os
+import platform
 import random
-import contextlib
+import re
 import shutil
-import threading
-from pathlib import Path
 import socket
 import subprocess
+import sys
+import tempfile
+import threading
 import time
-from collections.abc import Sequence, Mapping
-from contextlib import contextmanager, closing
-from functools import wraps
-from itertools import product
+import types
+import unittest
+import warnings
+from collections.abc import Mapping, Sequence
+from contextlib import closing, contextmanager
 from copy import deepcopy
-import tempfile
-import json
-import __main__  # type: ignore[import]
-import errno
-import ctypes
-from typing import Any, Dict, Iterable, Iterator, Optional, Union, List, Tuple, Type, TypeVar, Callable
+from enum import Enum
+from functools import partial, wraps
+from itertools import product
+from pathlib import Path
+from statistics import mean
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    Iterable,
+    Iterator,
+    List,
+    Optional,
+    Tuple,
+    Type,
+    TypeVar,
+    Union,
+)
 from unittest.mock import MagicMock
 
+import expecttest
 import numpy as np
 
-import expecttest
+import __main__  # type: ignore[import]
+import torch
+import torch.backends.cudnn
+import torch.backends.mkl
+import torch.backends.xnnpack
+import torch.cuda
+from torch import Tensor
+from torch._C import ScriptDict, ScriptList  # type: ignore[attr-defined]
+from torch._six import string_classes
+from torch._utils_internal import get_writable_path
+from torch.nn import (
+    ModuleDict,
+    ModuleList,
+    ParameterDict,
+    ParameterList,
+    Sequential,
+)
+from torch.onnx import (
+    register_custom_op_symbolic,
+    unregister_custom_op_symbolic,
+)
+from torch.testing import make_tensor
 from torch.testing._comparison import (
-    assert_equal as assert_equal,
-    Pair,
-    TensorLikePair,
     BooleanPair,
+    ErrorMeta,
+    NonePair,
     NumberPair,
+    Pair,
+    TensorLikePair,
     UnsupportedInputs,
-    NonePair,
-    ErrorMeta,
 )
+from torch.testing._comparison import assert_equal as assert_equal
+from torch.testing._internal.common_dtype import get_all_dtypes
 
-import torch
-import torch.cuda
-from torch.testing import make_tensor
-from torch._utils_internal import get_writable_path
-from torch._six import string_classes
-from torch import Tensor
-import torch.backends.cudnn
-import torch.backends.mkl
-import torch.backends.xnnpack
-from enum import Enum
-from statistics import mean
-import functools
 from .composite_compliance import no_dispatch
-from torch.testing._internal.common_dtype import get_all_dtypes
-from torch.nn import ModuleList, ModuleDict, Sequential, ParameterList, ParameterDict
-from torch._C import ScriptList, ScriptDict  # type: ignore[attr-defined]
-from torch.onnx import (register_custom_op_symbolic,
-                        unregister_custom_op_symbolic)
+
 torch.backends.disable_global_flags()
 
 PYTEST_FILES = ["test_ops", "test_ops_gradients", "test_ops_jit"]
@@ -90,8 +110,21 @@
 OVERRIDE_FLAKY_SIGNAL = os.getenv('PYTORCH_OVERRIDE_FLAKY_SIGNAL') == '1'
 MAX_NUM_RETRIES = 3
 
-DISABLED_TESTS_FILE = '.pytorch-disabled-tests.json'
-SLOW_TESTS_FILE = '.pytorch-slow-tests.json'
+DEFAULT_DISABLED_TESTS_FILE = '.pytorch-disabled-tests.json'
+DEFAULT_SLOW_TESTS_FILE = '.pytorch-slow-tests.json'
+
+disabled_tests_dict = {}
+slow_tests_dict = {}
+
+# set them here in case the tests are running in a subprocess that doesn't call run_tests
+if os.getenv("SLOW_TESTS_FILE", ""):
+    with open(os.getenv("SLOW_TESTS_FILE"), 'r') as fp:
+        slow_tests_dict = json.load(fp)
+        warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
+if os.getenv("DISABLED_TESTS_FILE", ""):
+    with open(os.getenv("DISABLED_TESTS_FILE"), 'r') as fp:
+        disabled_tests_dict = json.load(fp)
+        warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
 
 NATIVE_DEVICES = ('cpu', 'cuda', 'meta')
 
@@ -471,8 +504,8 @@ def _get_test_report_path():
 parser.add_argument('--discover-tests', action='store_true')
 parser.add_argument('--log-suffix', type=str, default="")
 parser.add_argument('--run-parallel', type=int, default=1)
-parser.add_argument('--import-slow-tests', type=str, nargs='?', const=SLOW_TESTS_FILE)
-parser.add_argument('--import-disabled-tests', type=str, nargs='?', const=DISABLED_TESTS_FILE)
+parser.add_argument('--import-slow-tests', type=str, nargs='?', const=DEFAULT_SLOW_TESTS_FILE)
+parser.add_argument('--import-disabled-tests', type=str, nargs='?', const=DEFAULT_DISABLED_TESTS_FILE)
 
 # Only run when -h or --help flag is active to display both unittest and parser help messages.
 def run_unittest_help(argv):
@@ -495,8 +528,8 @@ def run_unittest_help(argv):
     GRAPH_EXECUTOR = cppProfilingFlagsToProfilingMode()
 
 
-IMPORT_SLOW_TESTS = args.import_slow_tests
-IMPORT_DISABLED_TESTS = args.import_disabled_tests
+SLOW_TESTS_FILE = args.import_slow_tests
+DISABLED_TESTS_FILE = args.import_disabled_tests
 LOG_SUFFIX = args.log_suffix
 RUN_PARALLEL = args.run_parallel
 TEST_BAILOUTS = args.test_bailouts
@@ -621,19 +654,23 @@ def sanitize_pytest_xml(xml_file: str):
 
 def run_tests(argv=UNITTEST_ARGS):
     # import test files.
-    if IMPORT_SLOW_TESTS:
-        if os.path.exists(IMPORT_SLOW_TESTS):
-            with open(IMPORT_SLOW_TESTS, 'r') as fp:
+    if SLOW_TESTS_FILE:
+        if os.path.exists(SLOW_TESTS_FILE):
+            with open(SLOW_TESTS_FILE, 'r') as fp:
+                global slow_tests_dict
+                slow_tests_dict = json.load(fp)
                 # use env vars so pytest-xdist subprocesses can still access them
-                os.environ['SLOW_TESTS_DICT'] = fp.read()
+                os.environ['SLOW_TESTS_FILE'] = SLOW_TESTS_FILE
         else:
-            print(f'[WARNING] slow test file provided but not found: {IMPORT_SLOW_TESTS}')
-    if IMPORT_DISABLED_TESTS:
-        if os.path.exists(IMPORT_DISABLED_TESTS):
-            with open(IMPORT_DISABLED_TESTS, 'r') as fp:
-                os.environ['DISABLED_TESTS_DICT'] = fp.read()
+            warnings.warn(f'slow test file provided but not found: {SLOW_TESTS_FILE}')
+    if DISABLED_TESTS_FILE:
+        if os.path.exists(DISABLED_TESTS_FILE):
+            with open(DISABLED_TESTS_FILE, 'r') as fp:
+                global disabled_tests_dict
+                disabled_tests_dict = json.load(fp)
+                os.environ['DISABLED_TESTS_FILE'] = DISABLED_TESTS_FILE
         else:
-            print(f'[WARNING] disabled test file provided but not found: {IMPORT_DISABLED_TESTS}')
+            warnings.warn(f'disabled test file provided but not found: {DISABLED_TESTS_FILE}')
     # Determine the test launch mechanism
     if TEST_DISCOVER:
         _print_test_names()
@@ -650,9 +687,9 @@ def run_tests(argv=UNITTEST_ARGS):
         for case in test_cases:
             test_case_full_name = case.id().split('.', 1)[1]
             other_args = []
-            if IMPORT_DISABLED_TESTS:
+            if DISABLED_TESTS_FILE:
                 other_args.append('--import-disabled-tests')
-            if IMPORT_SLOW_TESTS:
+            if SLOW_TESTS_FILE:
                 other_args.append('--import-slow-tests')
             cmd = [sys.executable] + [argv[0]] + other_args + argv[1:] + [test_case_full_name]
             string_cmd = " ".join(cmd)
@@ -734,6 +771,7 @@ def addSkip(self, test, reason):
                                     '--reruns=2', '-rfEX', f'--junit-xml-reruns={pytest_report_path}'])
             del os.environ["USING_PYTEST"]
             sanitize_pytest_xml(f'{pytest_report_path}')
+            print("Skip info is located in the xml test reports, please either go to s3 or the hud to download them")
             # exitcode of 5 means no tests were found, which happens since some test configs don't
             # run tests from certain files
             exit(0 if exit_code == 5 else exit_code)
@@ -758,6 +796,7 @@ def addSkip(self, test, reason):
 IS_MACOS = sys.platform == "darwin"
 IS_PPC = platform.machine() == "ppc64le"
 IS_X86 = platform.machine() in ('x86_64', 'i386')
+IS_ARM64 = platform.machine() == 'arm64'
 
 def is_avx512_vnni_supported():
     if sys.platform != 'linux':
@@ -832,7 +871,7 @@ def _check_module_exists(name: str) -> bool:
 
 TEST_DILL = _check_module_exists('dill')
 
-TEST_LIBROSA = _check_module_exists('librosa')
+TEST_LIBROSA = _check_module_exists('librosa') and not IS_ARM64
 
 BUILD_WITH_CAFFE2 = torch.onnx._CAFFE2_ATEN_FALLBACK
 
@@ -884,9 +923,8 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
 TEST_WITH_TORCHDYNAMO = os.getenv('PYTORCH_TEST_WITH_DYNAMO') == '1'
 if TEST_WITH_TORCHDYNAMO:
     import torchdynamo
-    # torchdynamo.config.trace = True
-    # torchdynamo.config.debug = True
-    torchdynamo.config.print_internal_exceptions = False
+    import logging
+    torchdynamo.config.log_level = logging.ERROR
     # TODO - Collect errors with fake tensors
     torchdynamo.config.fake_tensor_propagation = False
     # Do not spend time on helper functions that are called with different inputs
@@ -981,20 +1019,18 @@ def wrapper(*args, **kwargs):
             fn(*args, **kwargs)
     return wrapper
 
-# Skips a test on CUDA if ROCm is unavailable or its version is lower than requested.
+# Skips a test on CUDA if ROCm is available and its version is lower than requested.
 def skipIfRocmVersionLessThan(version=None):
     def dec_fn(fn):
         @wraps(fn)
         def wrap_fn(self, *args, **kwargs):
-            if not TEST_WITH_ROCM:
-                reason = "ROCm not available"
-                raise unittest.SkipTest(reason)
-            rocm_version = str(torch.version.hip)
-            rocm_version = rocm_version.split("-")[0]    # ignore git sha
-            rocm_version_tuple = tuple(int(x) for x in rocm_version.split("."))
-            if rocm_version_tuple is None or version is None or rocm_version_tuple < tuple(version):
-                reason = "ROCm {0} is available but {1} required".format(rocm_version_tuple, version)
-                raise unittest.SkipTest(reason)
+            if TEST_WITH_ROCM:
+                rocm_version = str(torch.version.hip)
+                rocm_version = rocm_version.split("-")[0]    # ignore git sha
+                rocm_version_tuple = tuple(int(x) for x in rocm_version.split("."))
+                if rocm_version_tuple is None or version is None or rocm_version_tuple < tuple(version):
+                    reason = "ROCm {0} is available but {1} required".format(rocm_version_tuple, version)
+                    raise unittest.SkipTest(reason)
             return fn(self, *args, **kwargs)
         return wrap_fn
     return dec_fn
@@ -1563,13 +1599,12 @@ def check_if_enable(test: unittest.TestCase):
     if "USING_PYTEST" in os.environ:
         test_suite = f"__main__.{test_suite.split('.')[1]}"
     raw_test_name = f'{test._testMethodName} ({test_suite})'
-    if raw_test_name in json.loads(os.environ.get("SLOW_TESTS_DICT", "{}")):
+    if raw_test_name in slow_tests_dict:
         getattr(test, test._testMethodName).__dict__['slow_test'] = True
         if not TEST_WITH_SLOW:
             raise unittest.SkipTest("test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test")
     sanitized_test_method_name = remove_device_and_dtype_suffixes(test._testMethodName)
-    if not IS_SANDCASTLE and "DISABLED_TESTS_DICT" in os.environ:
-        disabled_tests_dict = json.loads(os.environ["DISABLED_TESTS_DICT"])
+    if not IS_SANDCASTLE:
         for disabled_test, (issue_url, platforms) in disabled_tests_dict.items():
             disable_test_parts = disabled_test.split()
             if len(disable_test_parts) > 1:
@@ -2152,7 +2187,7 @@ def sawteeth(n, m):
         crow_indices.cumsum_(dim=0)
         return crow_indices.to(device=device)
 
-    def genSparseCompressedTensor(self, size, nnz, *, layout, device, dtype, index_dtype, blocksize=()):
+    def genSparseCompressedTensor(self, size, nnz, *, layout, device, dtype, index_dtype, blocksize=(), dense_dims=0):
         from operator import mul
         from functools import reduce
         sparse_dim = 2
@@ -2160,12 +2195,15 @@ def genSparseCompressedTensor(self, size, nnz, *, layout, device, dtype, index_d
         assert len(size) >= sparse_dim
         if blocksize:
             assert len(blocksize) == 2, (size, blocksize)
-            assert size[-2] % blocksize[0] == 0, (size, blocksize)
-            assert size[-1] % blocksize[1] == 0, (size, blocksize)
+            assert size[-2 - dense_dims] % blocksize[0] == 0, (size, blocksize)
+            assert size[-1 - dense_dims] % blocksize[1] == 0, (size, blocksize)
             blocksize0, blocksize1 = blocksize
         else:
             blocksize0 = blocksize1 = 1
 
+        size = tuple(size)
+        dense_size = size[(len(size) - dense_dims):]
+
         def random_sparse_compressed(n_compressed_dims, n_plain_dims, nnz):
             compressed_indices = self._make_crow_indices(n_compressed_dims, n_plain_dims, nnz, device=device, dtype=index_dtype)
             plain_indices = torch.zeros(nnz, dtype=index_dtype, device=device)
@@ -2175,44 +2213,43 @@ def random_sparse_compressed(n_compressed_dims, n_plain_dims, nnz):
                     torch.randperm(n_plain_dims, dtype=index_dtype, device=device)[:count])
             low = -1 if dtype != torch.uint8 else 0
             high = 1 if dtype != torch.uint8 else 2
-            values = make_tensor((nnz,) + blocksize, device=device, dtype=dtype, low=low, high=high)
+            values = make_tensor((nnz,) + blocksize + dense_size, device=device, dtype=dtype, low=low, high=high)
             return values, compressed_indices, plain_indices
 
-        batch_shape = size[:-2]
+        batch_shape = size[:-2 - dense_dims]
         n_batch = reduce(mul, batch_shape, 1)
 
         if layout in {torch.sparse_csr, torch.sparse_bsr}:
-            n_compressed_dims, n_plain_dims = size[-2] // blocksize0, size[-1] // blocksize1
+            n_compressed_dims, n_plain_dims = size[-2 - dense_dims] // blocksize0, size[-1 - dense_dims] // blocksize1
         else:
-            n_compressed_dims, n_plain_dims = size[-1] // blocksize1, size[-2] // blocksize0
+            n_compressed_dims, n_plain_dims = size[-1 - dense_dims] // blocksize1, size[-2 - dense_dims] // blocksize0
         blocknnz = nnz // (blocksize0 * blocksize1)
         sparse_tensors = [random_sparse_compressed(n_compressed_dims, n_plain_dims, blocknnz) for _ in range(n_batch)]
         sparse_tensors_it = map(list, zip(*sparse_tensors))
 
-        values = torch.stack(next(sparse_tensors_it)).reshape(*batch_shape, blocknnz, *blocksize)
+        values = torch.stack(next(sparse_tensors_it)).reshape(*batch_shape, blocknnz, *blocksize, *dense_size)
         compressed_indices = torch.stack(next(sparse_tensors_it)).reshape(*batch_shape, -1)
         plain_indices = torch.stack(next(sparse_tensors_it)).reshape(*batch_shape, -1)
-
         return torch.sparse_compressed_tensor(compressed_indices, plain_indices,
                                               values, size=size, dtype=dtype, layout=layout, device=device)
 
-    def genSparseCSRTensor(self, size, nnz, *, device, dtype, index_dtype):
+    def genSparseCSRTensor(self, size, nnz, *, device, dtype, index_dtype, dense_dims=0):
         return self.genSparseCompressedTensor(size, nnz, layout=torch.sparse_csr, device=device,
-                                              dtype=dtype, index_dtype=index_dtype, blocksize=())
+                                              dtype=dtype, index_dtype=index_dtype, blocksize=(), dense_dims=dense_dims)
 
-    def genSparseCSCTensor(self, size, nnz, *, device, dtype, index_dtype):
+    def genSparseCSCTensor(self, size, nnz, *, device, dtype, index_dtype, dense_dims=0):
         return self.genSparseCompressedTensor(size, nnz, layout=torch.sparse_csc, device=device,
-                                              dtype=dtype, index_dtype=index_dtype, blocksize=())
+                                              dtype=dtype, index_dtype=index_dtype, blocksize=(), dense_dims=0)
 
-    def genSparseBSRTensor(self, size, blocksize, nnz, *, device, dtype, index_dtype):
+    def genSparseBSRTensor(self, size, blocksize, nnz, *, device, dtype, index_dtype, dense_dims=0):
         assert len(blocksize) == 2
         return self.genSparseCompressedTensor(size, nnz, layout=torch.sparse_bsr, device=device,
-                                              dtype=dtype, index_dtype=index_dtype, blocksize=blocksize)
+                                              dtype=dtype, index_dtype=index_dtype, blocksize=blocksize, dense_dims=dense_dims)
 
-    def genSparseBSCTensor(self, size, blocksize, nnz, *, device, dtype, index_dtype):
+    def genSparseBSCTensor(self, size, blocksize, nnz, *, device, dtype, index_dtype, dense_dims=0):
         assert len(blocksize) == 2
         return self.genSparseCompressedTensor(size, nnz, layout=torch.sparse_bsc, device=device,
-                                              dtype=dtype, index_dtype=index_dtype, blocksize=blocksize)
+                                              dtype=dtype, index_dtype=index_dtype, blocksize=blocksize, dense_dims=dense_dims)
 
     def genSparseTensor(self, size, sparse_dim, nnz, is_uncoalesced, device, dtype):
         # Assert not given impossible combination, where the sparse dims have
@@ -2354,6 +2391,13 @@ def to_list(input):
         elif isinstance(x, Sequence) and isinstance(y, torch.Tensor):
             x = torch.as_tensor(x, dtype=y.dtype, device=y.device)
 
+        # If x or y are tensors and nested then we unbind them to a list of tensors this should allow us to compare
+        # a nested tensor to a nested tensor and a nested tensor to a list of expected tensors
+        if isinstance(x, torch.Tensor) and x.is_nested:
+            x = x.unbind()
+        if isinstance(y, torch.Tensor) and y.is_nested:
+            y = y.unbind()
+
         assert_equal(
             x,
             y,
@@ -2832,6 +2876,7 @@ def random_symmetric_psd_matrix(l, *batches, **kwargs):
     Returns a batch of random symmetric positive-semi-definite matrices.
     The shape of the result is batch_dims + (matrix_size, matrix_size)
     The following example creates a tensor of size 2 x 4 x 3 x 3
+    >>> # xdoctest: +SKIP("undefined variables")
     >>> matrices = random_symmetric_psd_matrix(3, 2, 4, dtype=dtype, device=device)
     """
     dtype = kwargs.get('dtype', torch.double)
@@ -2845,6 +2890,7 @@ def random_hermitian_psd_matrix(matrix_size, *batch_dims, dtype=torch.double, de
     Returns a batch of random Hermitian positive-semi-definite matrices.
     The shape of the result is batch_dims + (matrix_size, matrix_size)
     The following example creates a tensor of size 2 x 4 x 3 x 3
+    >>> # xdoctest: +SKIP("undefined variables")
     >>> matrices = random_hermitian_psd_matrix(3, 2, 4, dtype=dtype, device=device)
     """
     A = torch.randn(*(batch_dims + (matrix_size, matrix_size)), dtype=dtype, device=device)
@@ -2874,6 +2920,7 @@ def random_hermitian_pd_matrix(matrix_size, *batch_dims, dtype, device):
     Returns a batch of random Hermitian positive-definite matrices.
     The shape of the result is batch_dims + (matrix_size, matrix_size)
     The following example creates a tensor of size 2 x 4 x 3 x 3
+    >>> # xdoctest: +SKIP("undefined variables")
     >>> matrices = random_hermitian_pd_matrix(3, 2, 4, dtype=dtype, device=device)
     """
     A = torch.randn(*(batch_dims + (matrix_size, matrix_size)),
diff --git a/torch/testing/_internal/composite_compliance.py b/torch/testing/_internal/composite_compliance.py
index 166c37f082cdc..c14595c69c59d 100644
--- a/torch/testing/_internal/composite_compliance.py
+++ b/torch/testing/_internal/composite_compliance.py
@@ -525,9 +525,9 @@ def maybe_make_dual(dual):
             # with requires_grad set.
             primal, tangent = dual
             if isinstance(primal, torch.Tensor) and primal.requires_grad:
-                return fwAD.make_dual(primal, tangent)
+                return fwAD.make_dual(primal.detach(), tangent)
             elif is_tensorlist(primal):
-                return tuple(fwAD.make_dual(pri, tang) if tang is not None else pri
+                return tuple(fwAD.make_dual(pri.detach(), tang) if tang is not None else pri
                              for pri, tang in zip(primal, tangent))
             return primal
 
diff --git a/torch/testing/_internal/distributed/distributed_test.py b/torch/testing/_internal/distributed/distributed_test.py
index ca68132f45639..370f2bfe8ecd9 100644
--- a/torch/testing/_internal/distributed/distributed_test.py
+++ b/torch/testing/_internal/distributed/distributed_test.py
@@ -1206,13 +1206,16 @@ def test_batch_isend_irecv_nccl(self):
             device_id = rank_to_GPU[rank][0]
             torch.cuda.set_device(device_id)
             p2p_op_list = []
+            recv_tensors = [None for _ in range(world_size)]
+            expected_tensors = [None for _ in range(world_size)]
 
             for val in ["1", "0"]:
                 os.environ["NCCL_BLOCKING_WAIT"] = val
                 for src in range(0, world_size):
-                    send_tensor = _build_tensor(rank + 1, device_id=device_id)
-                    recv_tensor = _build_tensor(src + 1, value=-1, device_id=device_id)
-                    recv_op = dist.P2POp(dist.irecv, recv_tensor, src)
+                    send_tensor = _build_tensor(rank + 1, device_id=device_id).fill_(src)
+                    recv_tensors[src] = _build_tensor(src + 1, value=-1, device_id=device_id).fill_(-1)
+                    expected_tensors[src] = _build_tensor(src + 1, value=-1, device_id=device_id).fill_(rank)
+                    recv_op = dist.P2POp(dist.irecv, recv_tensors[src], src)
                     p2p_op_list.append(recv_op)
                     send_op = dist.P2POp(dist.isend, send_tensor, src)
                     p2p_op_list.append(send_op)
@@ -1221,6 +1224,9 @@ def test_batch_isend_irecv_nccl(self):
                 for req in reqs:
                     req.wait()
 
+                for src in range(0, world_size):
+                    self.assertEqual(recv_tensors[src], expected_tensors[src])
+
             self._barrier()
 
         @skip_if_no_gpu
@@ -1402,7 +1408,7 @@ def test_batch_isend_irecv_mixed_backend_err(self):
             group_nccl = dist.new_group(ranks=[0, 1], backend="nccl")
             if rank == 0:
                 with self.assertRaisesRegex(
-                    RuntimeError, "All groups need to use the same backend"
+                    RuntimeError, "All ops need to use the same group"
                 ):
                     send_tensor = _build_tensor(rank + 1)
                     send_op_gloo = dist.P2POp(dist.isend, send_tensor, 1, group_gloo)
@@ -2196,6 +2202,46 @@ def test_reduce_sum_cuda_twice(self):
                 rank_to_GPU,
             )
 
+        @sandcastle_skip_if(BACKEND != "nccl", "Only Nccl supports reduce_scatter_v")
+        @sandcastle_skip_if(BACKEND in DistTestCases.skip_collective["reduce"], f"{BACKEND} does not support reduce")
+        @skip_if_no_gpu
+        def test_reduce_scatter_v_cuda(self):
+            self._barrier()
+            group, group_id, rank = self._init_global_test()
+            rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
+            device_id = rank_to_GPU[rank][0]
+
+            input_split_sizes = []
+            for src in group:
+                input_split_sizes.append(src + 1)
+            start_len = sum(input_split_sizes[:rank])
+            end_len = start_len + input_split_sizes[rank]
+            sum_len = sum(input_split_sizes)
+            master_value = 2
+            worker_value = 10
+
+            for async_val in [True, False]:
+                tensor = _build_tensor(sum_len, worker_value, device_id=device_id)
+                tensor[start_len:end_len].fill_(master_value)
+                out_tensor = torch.empty(input_split_sizes[rank], sum_len, sum_len, dtype=torch.float).fill_(-1).cuda(device_id)
+
+                req = dist.reduce_scatter(
+                    out_tensor,
+                    list(torch.split(tensor, input_split_sizes)),
+                    dist.ReduceOp.SUM,
+                    group_id,
+                    async_val,
+                )
+                if async_val:
+                    req.wait()
+
+                expected_value = 2 + (10 * (len(group) - 1))
+                expected_tensor = torch.empty(input_split_sizes[rank], sum_len, sum_len, dtype=torch.float)
+                expected_tensor = expected_tensor.fill_(expected_value).cuda(device_id)
+
+                self.assertEqual(out_tensor, expected_tensor)
+            self._barrier()
+
         @skip_if_no_gpu
         @require_backend(DistTestCases.backend_feature["gpu"])
         def test_all_reduce_result_cuda(self):
@@ -3069,6 +3115,40 @@ def test_all_gather_full_group(self):
             group, group_id, rank = self._init_full_group_test()
             self._test_all_gather_helper(group, group_id, rank)
 
+        @sandcastle_skip_if(BACKEND != "nccl", "Only Nccl supports all_gather_v")
+        @skip_if_no_gpu
+        def test_all_gather_v_cuda(self):
+            self._barrier()
+            group, group_id, rank = self._init_global_test()
+            rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
+            device_id = rank_to_GPU[rank][0]
+
+            output_split_sizes = []
+            for dst in group:
+                output_split_sizes.append(dst + 1)
+            sum_len = sum(output_split_sizes)
+            value = 2
+
+            for async_val in [True, False]:
+                tensor = torch.empty(output_split_sizes[rank], sum_len, sum_len, dtype=torch.float).fill_(value).cuda(device_id)
+                out_tensor = _build_tensor(sum_len, -1, device_id=device_id)
+
+                req = dist.all_gather(
+                    list(torch.split(out_tensor, output_split_sizes)),
+                    tensor,
+                    group_id,
+                    async_val,
+                )
+                if async_val:
+                    req.wait()
+
+                expected_value = value
+                expected_tensor = _build_tensor(sum_len, expected_value, device_id=device_id)
+
+                self.assertEqual(out_tensor, expected_tensor)
+            self._barrier()
+
+
         def _run_all_gather_coalesced_and_verify(
             self, output_tensor_lists, input_tensors, expected_tensors, group_id
         ):
@@ -6146,7 +6226,6 @@ def test_ddp_profiling_autograd_profiler(self):
             IS_MACOS or IS_WINDOWS,
             "torch.profiler not enabled for mac/windows: https://github.com/pytorch/pytorch/pull/56124",
         )
-        @skip_if_rocm
         def test_ddp_profiling_torch_profiler(self):
             cpu_act = torch.profiler.ProfilerActivity.CPU
             cuda_act = torch.profiler.ProfilerActivity.CUDA
diff --git a/torch/testing/_internal/distributed/rpc/rpc_test.py b/torch/testing/_internal/distributed/rpc/rpc_test.py
index 359e7a25df634..ea294b43c38b6 100644
--- a/torch/testing/_internal/distributed/rpc/rpc_test.py
+++ b/torch/testing/_internal/distributed/rpc/rpc_test.py
@@ -3019,7 +3019,7 @@ def test_owner_equality(self):
         self.assertEqual(a.owner(), a.owner())
         self.assertEqual(a.owner(), b.owner())
         self.assertEqual(a.owner(), rpc.get_worker_info())
-        x = dict()
+        x = {}
         x[a.owner()] = a
         x[other_a.owner()] = other_a
         self.assertEqual(x[a.owner()], a)
diff --git a/torch/testing/_internal/distributed/rpc_utils.py b/torch/testing/_internal/distributed/rpc_utils.py
index a26c00c7c675a..76ecfc2a6fe90 100644
--- a/torch/testing/_internal/distributed/rpc_utils.py
+++ b/torch/testing/_internal/distributed/rpc_utils.py
@@ -178,7 +178,7 @@ def generate_tests(
             continue
 
         name = f"{prefix}{test_class.__name__}"
-        class_ = type(name, (test_class, mixin, SpawnHelper), dict())
+        class_ = type(name, (test_class, mixin, SpawnHelper), {})
         class_.__module__ = module_name
         ret[name] = class_
     return ret
diff --git a/torch/testing/_internal/jit_metaprogramming_utils.py b/torch/testing/_internal/jit_metaprogramming_utils.py
index afcf6f6253973..6d649684896a9 100644
--- a/torch/testing/_internal/jit_metaprogramming_utils.py
+++ b/torch/testing/_internal/jit_metaprogramming_utils.py
@@ -165,6 +165,7 @@ def conjugate(tensor):
     ('softmax', (S, S, S), (0, 3, torch.double), 'with_all_args', (True,)),
     ('tanh', (S, S, S), (), '', (True,)),
     ('sigmoid', (S, S, S), (), '', (True,)),
+    ('silu', (S, S, S), (), '', (True,)),
     ('log_softmax', (S, S, S), (0,), '', (True,)),
     ('linear', (S, S), ((M, S),), '', (True, ['aten::linear'])),
     ('linear', (S, S), ((M, S), (M,)), 'addmm', (True, ['aten::linear'])),
diff --git a/torch/testing/_internal/opinfo/__init__.py b/torch/testing/_internal/opinfo/__init__.py
new file mode 100644
index 0000000000000..4afd4147f10f4
--- /dev/null
+++ b/torch/testing/_internal/opinfo/__init__.py
@@ -0,0 +1,2 @@
+import torch.testing._internal.opinfo.core
+import torch.testing._internal.opinfo.definitions
diff --git a/torch/testing/_internal/opinfo/core.py b/torch/testing/_internal/opinfo/core.py
new file mode 100644
index 0000000000000..5bd5e9cbc6521
--- /dev/null
+++ b/torch/testing/_internal/opinfo/core.py
@@ -0,0 +1,2657 @@
+import collections
+import collections.abc
+import math
+import operator
+import unittest
+from dataclasses import asdict, dataclass
+from enum import Enum
+from functools import partial
+from itertools import product
+from typing import Any, Callable, Iterable, List, Optional, Tuple
+
+from torchgen.utils import dataclass_repr
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_device_type import (
+    skipCPUIfNoFFT,
+    tol,
+    toleranceOverride,
+)
+from torch.testing._internal.common_dtype import (
+    _dispatch_dtypes,
+    floating_and_complex_types,
+    floating_and_complex_types_and,
+    floating_types,
+)
+from torch.testing._internal.common_utils import (
+    is_iterable_of_tensors,
+    noncontiguous_like,
+    TEST_WITH_ROCM,
+    torch_to_numpy_dtype_dict,
+)
+from torch.testing._internal.opinfo import utils
+
+# Reasonable testing sizes for dimensions
+L = 20
+M = 10
+S = 5
+XS = 3
+
+# Unique value to distinguish default from anything else
+_NOTHING = object()
+
+
+# Extension of getattr to support qualified names
+# e.g. _getattr_qual(torch, 'linalg.norm') -> torch.linalg.norm
+def _getattr_qual(obj, name, default=_NOTHING):
+    try:
+        for path in name.split("."):
+            obj = getattr(obj, path)
+        return obj
+    except AttributeError:
+        if default is not _NOTHING:
+            return default
+        else:
+            raise
+
+
+class DecorateInfo(object):
+    """Describes which test, or type of tests, should be wrapped in the given
+    decorators when testing an operator. Any test that matches all provided
+    arguments will be decorated. The decorators will only be applied if the
+    active_if argument is True."""
+
+    __slots__ = [
+        "decorators",
+        "cls_name",
+        "test_name",
+        "device_type",
+        "dtypes",
+        "active_if",
+    ]
+
+    def __init__(
+        self,
+        decorators,
+        cls_name=None,
+        test_name=None,
+        *,
+        device_type=None,
+        dtypes=None,
+        active_if=True,
+    ):
+        self.decorators = (
+            list(decorators)
+            if isinstance(decorators, collections.abc.Sequence)
+            else [decorators]
+        )
+        self.cls_name = cls_name
+        self.test_name = test_name
+        self.device_type = device_type
+        self.dtypes = dtypes
+        self.active_if = active_if
+
+        # Validate dtypes
+        if self.dtypes is not None:
+            for dtype in self.dtypes:
+                assert isinstance(dtype, torch.dtype)
+
+    def is_active(self, cls_name, test_name, device_type, dtype):
+        return (
+            self.active_if
+            and (self.cls_name is None or self.cls_name == cls_name)
+            and (self.test_name is None or self.test_name == test_name)
+            and (self.device_type is None or self.device_type == device_type)
+            and (self.dtypes is None or dtype in self.dtypes)
+        )
+
+
+# FIXME
+# Note: historically the 'input' kwarg had to be a Tensor or TensorList, but we are trying
+#   to support scalar inputs, too. Some tests still depend on 'input' being a Tensor
+#   or TensorList, however.
+class SampleInput(object):
+    """Represents sample inputs to a function."""
+
+    __slots__ = [
+        "input",
+        "args",
+        "kwargs",
+        "output_process_fn_grad",
+        "broadcasts_input",
+        "name",
+    ]
+
+    def __init__(
+        self,
+        input,
+        *,
+        args=tuple(),
+        kwargs=None,
+        output_process_fn_grad=lambda x: x,
+        broadcasts_input=False,
+        name="",
+    ):
+        # input is the first input to the op and is typically either a Tensor or TensorList (Sequence[Tensor]).
+        # This follows the typical pattern where for Tensor inputs op(t, ...) = t.op(...).
+        self.input = input
+        self.args = args
+        assert isinstance(self.args, tuple)
+        self.kwargs = kwargs if kwargs is not None else {}
+        assert isinstance(self.kwargs, dict)
+        self.output_process_fn_grad = output_process_fn_grad
+        self.name = name
+
+        # Specifies if `self.input` is broadcasted or not,
+        # given that the operator supports broadcasting.
+        # This field is used to verify the behavior for inplace variant.
+        #
+        # If a SampleInput is marked with `broadcasts_input=True`,
+        # it is verified that we get a `RuntimerError` with this sample,
+        # and inplace variant. Also inplace grad{grad} tests are skipped,
+        # for such inputs (as they will error out otherwise).
+        self.broadcasts_input = broadcasts_input
+
+    def _repr_helper(self, formatter):
+        # Helper function to return the details of the SampleInput as `str`
+        # It consolidates all the fields of SampleInput and allows,
+        # formatting the fields like `input`, `args`, etc with `formatter`
+        # callable to customize the representation.
+        # Look at `summary` method for example.
+        arguments = [
+            f"input={formatter(self.input)}",
+            f"args={formatter(self.args)}",
+            f"kwargs={formatter(self.kwargs)}",
+            f"output_process_fn_grad={self.output_process_fn_grad}",
+            f"broadcasts_input={self.broadcasts_input}",
+            f"name={repr(self.name)}",
+        ]
+
+        return f'SampleInput({", ".join(a for a in arguments if a is not None)})'
+
+    def __repr__(self):
+        return self._repr_helper(lambda x: x)
+
+    def summary(self):
+        # Returns the SampleInput details in a more
+        # friendly format.
+        # It formats `Tensor` and `TensorList`
+        # in a more condensed representation.
+        def formatter(arg):
+            # Format any instance of `Tensor` (standalone, in list, or in dict)
+            # by Tensor[TensorShape]
+            # Eg. Tensor with shape (3, 4) is formatted as Tensor[3, 4]
+            if isinstance(arg, torch.Tensor):
+                shape = str(tuple(arg.shape)).replace("(", "").replace(")", "")
+                return f"Tensor[{shape}]"
+            elif isinstance(arg, dict):
+                return {k: formatter(v) for k, v in arg.items()}
+            elif is_iterable_of_tensors(arg):
+                return "TensorList[" + ", ".join(map(formatter, arg)) + "]"
+            elif isinstance(arg, (list, tuple)):  # Handle list, tuple
+                return "(" + ",".join(map(formatter, arg)) + ")"
+
+            return repr(arg)
+
+        return self._repr_helper(formatter)
+
+    # Applies the transform f(t) -> t to each tensor and dtype in the SampleInput
+    def transform(self, f):
+        def tt(t):
+            def _tt(t):
+                with torch.no_grad():
+                    return f(t)
+
+            if isinstance(t, torch.Tensor):
+                return _tt(t)
+            elif isinstance(t, torch.dtype):
+                return _tt(t)
+            elif isinstance(t, list):
+                return list(map(tt, t))
+            elif isinstance(t, tuple):
+                return tuple(map(tt, t))
+            elif isinstance(t, dict):
+                return {k: tt(v) for k, v in t.items()}
+            else:
+                return t
+
+        sample_tt_input, tt_args, tt_kwargs = (
+            tt(self.input),
+            tt(self.args),
+            tt(self.kwargs),
+        )
+
+        # Note the transformed SampleInput assumes metadata like output_process_fn_grad is still valid!
+        return SampleInput(
+            sample_tt_input,
+            args=tt_args,
+            kwargs=tt_kwargs,
+            output_process_fn_grad=self.output_process_fn_grad,
+            broadcasts_input=self.broadcasts_input,
+            name=self.name + "_transformed",
+        )
+
+    # Returns the NumPy version of the sample input object in the form of a tuple: (input, args, kwargs)
+    # Converts tensors to ndarrays by calling .detach().cpu().numpy() on them
+    # Converts dtypes by remapping them using torch_to_numpy_dtype_dict
+    def numpy(self):
+        def to_numpy(t):
+            if isinstance(t, torch.Tensor):
+                if t.dtype is torch.bfloat16:
+                    return t.detach().cpu().to(torch.float32).numpy()
+                if t.dtype is torch.chalf:
+                    return t.detach().cpu().to(torch.cfloat).numpy()
+                return t.detach().cpu().numpy()
+            elif isinstance(t, torch.dtype):
+                return torch_to_numpy_dtype_dict[t]
+
+            return t
+
+        return self.transform(to_numpy)
+
+    def noncontiguous(self):
+        def to_noncontiguous(t):
+            if isinstance(t, torch.Tensor):
+                return noncontiguous_like(t)
+            elif isinstance(t, torch.dtype):
+                return t
+
+            return t
+
+        return self.transform(to_noncontiguous)
+
+
+NumericsFilter = collections.namedtuple("NumericsFilter", ["condition", "safe_val"])
+
+
+class ErrorInput(object):
+    """
+    A SampleInput that will cause the operation to throw an error plus information
+    about the resulting error.
+    """
+
+    __slots__ = ["sample_input", "error_type", "error_regex"]
+
+    def __init__(self, sample_input, *, error_type=RuntimeError, error_regex):
+        self.sample_input = sample_input
+        self.error_type = error_type
+        self.error_regex = error_regex
+
+
+class AliasInfo(object):
+    """Class holds alias information. For example, torch.abs ->
+    torch.absolute, torch.Tensor.absolute, torch.Tensor.absolute_
+    """
+
+    def __init__(self, alias_name):
+        self.name = alias_name
+        self.op = _getattr_qual(torch, alias_name)
+        self.method_variant = getattr(torch.Tensor, alias_name, None)
+        self.inplace_variant = getattr(torch.Tensor, alias_name + "_", None)
+
+    def __call__(self, *args, **kwargs):
+        return self.op(*args, **kwargs)
+
+
+# Note [OpInfos]
+# ~~~~~~~~~~~~~~
+#
+# The majority of this note was written shortly after the PyTorch 1.9 release.
+# If you notice it's out-of-date or think it could be improved then please
+# file an issue.
+#
+# See also: the OpInfo tracker (https://github.com/pytorch/pytorch/issues/54261)
+# See also: "Writing Test Templates" in common_device_type.py to learn how to
+#   parametrize a test template using OpInfos.
+# See also: PyTorch's GitHub wiki on running and writing tests
+#   https://github.com/pytorch/pytorch/wiki/Running-and-writing-tests
+# See also: ModuleInfos, OpInfo's sister class, defined in common_modules.py
+#
+# An OpInfo is a collection of metadata related to a PyTorch operator. This
+#   metadata is used to generate tests that validate properties of the operator,
+#   like if it implements the correct gradient formula.
+#
+# WHY OPINFOS?
+# ~~~~~~~~~~~~
+#
+# OpInfos are principally intended to do three things:
+#
+#   1) to allow systematic testing over all PyTorch's operators
+#   2) to simplify operating testing by autogenerating many tests
+#   3) to allow systems (like autograd, torchscript, fx, nnc...) to test
+#        against every PyTorch operator
+#
+# All these goals are still a work in progress. Not every operator has an
+#   OpInfo, and some operator tests that could be automatically generated
+#   still have to be written manually.
+#
+# It's helpful to understand that OpInfos are both about test simplification and
+#   modularity. PyTorch is a complicated framework with many interrelated systems,
+#   too many for any one person to keep track of. An OpInfo can be thought of as the
+#   interface between an operator implementer and those other systems. Instead of
+#   requiring the implementer of torch.foo understand how to test its forward
+#   mode AD or NNC support that's typically handled automatically just by
+#   defining an OpInfo.
+#
+# It's often surprising to OpInfo writers that just implementing an OpInfo
+#   typically can't verify an operator is actually implemented correctly:
+#
+# "If an OpInfo doesn't validate my op works as expected, what's the point
+#     of it?"
+#
+# But the point of is the above. OpInfos are intended to let you focus on testing
+#   the operator logic you're familiar with instead of having to write tests for
+#   how the operator interacts with each of PyTorch's many systems.
+#
+# And, OK, it turns out that SOMETIMES just writing an OpInfo DOES
+#   validate your op works as expected, but that's only in special
+#   cases. See below for details.
+#
+# WHAT'S AN OPINFO?
+# ~~~~~~~~~~~~~~~~~
+#
+# So what is an OpInfo? It's a Python class that describes an operator's properties,
+#   like which dtypes it supports on the CPU and whether it has any aliases.
+#   These properties can be divided into three categories:
+#
+#   1) Metadata describing the operator, like the operator's name and if it
+#     "supports" the out kwarg.
+#   2) Test directives, like "skips" that tell the test suite to skip some
+#     tests.
+#   3) A "sample inputs" function that generates valid inputs for the operator.
+#
+# OpInfo attributes are described in more detail below.
+#
+# THE SAMPLE INPUTS FUNCTION
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# The "sample inputs" function merits special elaboration. This function is
+#   crucial to testing with OpInfos. A typical OpInfo test has to treat the operator
+#   as a black box. There's no structure for the test to understand or exploit.
+#   Without "sample inputs" it wouldn't even know how to call the OpInfo's
+#   operator. The sample input function saves the day by providing different
+#   "SampleInputs" that can be used to call the operator. A sample input
+#   function should have the following signature:
+#
+#   def sample_inputs_foo(op_info, device, dtype, requires_grad, **kwargs):
+#
+#   And should return an iterable of SampleInputs (see the class description
+#   above). Each SampleInput defines an "input", "args", "kwargs", an
+#   "output_process_fn_grad" function, the "broadcasts_input" bool and a
+#   "name".
+#
+#   All the "sample_inputs" functions are invoked within a `torch.no_grad()`
+#   environment for efficiency and correctness. As such remember to set the
+#   "requires_grad" flag on the inputs **after** performing any transformations
+#   on them.
+#
+# The "input" is the first argument to the operator, or the tensor that
+#   the method or inplace variants of the operator should be called on, and
+#   should be on the requested device, of the requested dtype, and its
+#   requires_grad attribute should be set to the requires_grad argument.
+#
+# "args" should contain positional arguments, and "kwargs" keyword arguments.
+#
+# "output_process_fn_grad" has an interesting name. It's a function that maps
+#   the operator's output (when given the input, args, and kwargs) to the
+#   portion of the output to gradcheck. For example, consider an operator
+#   like torch.linalg.slogdet
+#   (https://pytorch.org/docs/master/generated/torch.linalg.slogdet.html).
+#   This operator returns a tuple of two tensors, but the first tensor
+#   cannot be backwarded through. Its "output_process_fn_grad" filters
+#   this output tuple to just the second argument, which we can call backward
+#   on. Functions that produce a single tensor can ignore this argument.
+#
+# "broadcasts_input" is a bool indicated if the SampleInput causes the operator
+#   to broadcast the "input" argument. This is important for tests to understand
+#   because inplace variants of operations throw a runtime error if they
+#   would broadcast their input arguments, so tests that work with inplace
+#   variants filter SampleInputs that broadcast their input.
+#
+# "name" is a string that's just used for debugging. It appears when printing
+#   the SampleInput.
+#
+# Sample inputs are designed to be used with many tests, some
+#   that are very time consuming, so they should be a small
+#   set with small tensors. An elaborated set of sample inputs
+#   can be specified using the "reference_inputs_func" attribute.
+#   The "reference inputs" for an operation are an extended
+#   set of sample inputs that can more exhausively test an
+#   operator. They are used by only a few tests that are careful
+#   not to take too long to run. Adding reference inputs
+#   is highly encouraged!
+#
+# THE (OPTIONAL) ERROR INPUTS FUNCTION
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# OpInfos may optionally specify "error inputs" through an error function. If
+#   specified test_errors in test_ops.py will call the op with these inputs
+#   and validate that the desired error is thrown.
+#
+# Error inputs automate a common testing pattern where multiple inputs are
+#   passed to an operation and the errors they thrown are reviewed. Tests
+#   written in this style should be ported to the new OpInfo pattern.
+#
+# Error inputs are specified using the ErrorInputs class, which contains
+#   a SampleInput (see above) and data about the expected error.
+#
+# OPINFO FILE ORGANIZATION
+# ~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# All OpInfos are currently defined in this file. Most OpInfo tests are defined
+#   in test_ops.py, but some system-specific tests are defined in those
+#   systems' test files, and subclass-specific tests are defined in the test
+#   file that corresponds to that subclass (see the below).
+#   Expect a reorganization in the future.
+#
+# WHAT'S TESTED?
+# ~~~~~~~~~~~~~~
+#
+# Every OpInfo in the op_db sequence has the following properties validated in
+# test_ops.py:
+#
+#   - that its supported dtypes are specified correctly
+#   - that the operation produces the same results when called with noncontiguous inputs
+#   - that it supports the out= argument properly (if it allows out=),
+#       see https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
+#   - that it works with the conjugate view bit properly
+#   - that its function, method, and inplace variants perform the same operation
+#       (that is, that torch.add, torch.Tensor.add, and torch.Tensor.add_ all
+#       do the same thing).
+#   - that its inplace variant preserves the input's storage
+#   - that its gradient formula is implemented correctly, and that it supports
+#       gradgrad and complex grad and gradgrad and forward mode AD properly for
+#       the op's function and inplace variants (method variants are skipped
+#       to reduce test time).
+#   - that the operation performs the same operation when traced or scripted
+#       using the jit
+#   - that the operation is autodifferentiated by the jit as expected
+#   - that the operator's aliases, if any, perform the same operation and that
+#       the jit understands the alias
+#   - that the operator throws the correct errors (if error_inputs is defined)
+#   - that the operator produces the same results as a NumPy reference (if ref is defined)
+#   - that the operator produces the same results as a NumPy reference on an extended
+#       set of "reference inputs" (if both ref and reference_inputs_func are defined)
+#       (NOTE: elementwise unary and elementwise binary OpInfos do this even if only
+#         ref is defined, because they effectively autogenerate reference inputs)
+#   - that the operator works on different CUDA devices
+#
+# Additional OpInfo tests are in test_jit_fuser_te.py, test_fx_experimental.py,
+#   and test_fx.py. These tests validate that operators work with NNC and FX
+#   as expected.
+#
+# For performance, some of the above tests may only run on the first
+#   SampleInput returned by an OpInfo's sample input function.
+#
+# In addition to these tests, some subclasses (discussed in the next section)
+#   define additional tests.
+#
+# Critically, as mentioned above, what's not necessarily tested is that the operator
+#   works as expected. When implementing an OpInfo an engineer must still
+#   typically write one or more tests validating the operator's behavior.
+#   The exception to this is if reference testing is sufficient, or if
+#   the operation belongs to an OpInfo subclass that has more exhaustive
+#   operator testing. Elementwise unary and elementwise binary operators,
+#   in particular, usually don't require additional testing beyond
+#   writing an Opinfo.
+#
+#
+# OPINFO (SUB)CLASSES
+# ~~~~~~~~~~~~~~~~~~~
+#
+# In addition to the OpInfo base class there are several specialized OpInfo
+#   subclasses. For example, the UnaryUfuncInfo subclass is used for
+#   unary elementwise operations. These operations have a common structure
+#   that test_unary_ufuncs.py exploits with additional automated testing.
+#   The automated testing in test_unary_ufuncs.py is so thorough, comparing
+#   the operator to a NumPy reference function on a plethora of values, that
+#   just implementing an OpInfo for a unary elementwise operation is often
+#   sufficient testing.
+#
+# The ForeachFuncInfo is another OpInfo subclass that is hyper-specialized to a
+#   very unique class of operations. These OpInfos aren't included in the
+#   op_db sequence and have their own tests.
+#
+# Other OpInfo subclasses, like SpectralFuncInfo, are just for convenience
+# when writing OpInfos.
+#
+# TESTING A NEW OPERATOR
+# ~~~~~~~~~~~~~~~~~~~~~~
+#
+# If you're adding a new operator to any of the following namespaces:
+#   - torch
+#   - torch.fft
+#   - torch.linalg,
+#   - torch.special
+#   - torch.nn.functional
+# then you should typically add an OpInfo for it.
+#
+# As mentioned a couple times above, implementing an OpInfo is not
+#   usually sufficient testing (unless the operator is a unary or binary elementwise
+#   operator). The OpInfo will only test the properties described in the
+#   "WHAT'S TESTED" section. It DOES NOT necessarily verify that the operator is
+#   implemented correctly.
+#
+# TIPS FOR WRITING AN OPINFO AND OPINFO TESTS
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Writing an OpInfo can be a little daunting. Since the point of an OpInfo is to
+#   be consumed by a variety of systems it can be hard to understand how to
+#   deal with test failures or how to set the OpInfo metadata properly.
+#
+# Before adding an OpInfo it helps to look at other OpInfos. A sample inputs
+#   function must be defined, and the operator's dtypes must be specified.
+#   Once that's done you should run the operator's tests in test_ops.py
+#   (these can be filtered using the "-k" argument in pytest). Tests that
+#   fail should provide an error message that describes what to change about
+#   your OpInfo. You don't need to worry about changing an OpInfo's default
+#   values unless a test yells at you.
+#
+# Similarly, if you're writing a test that consumes OpInfos then it's critical
+#   your test provides a clear error message describing what to do when it
+#   fails. You should not assume the OpInfo implementer is familiar with your
+#   system.
+#
+# If you see a confusing error message while developing an OpInfo then please
+#   file an issue describing what happened.
+#
+# This trial-and-error approach to writing an OpInfo can be frustrating,
+#   but it's probably necessary as long as OpInfos don't require
+#   learning about all the systems that consume them. One thing that can help
+#   is the get_supported_dtypes() function defined in utils.py. This
+#   function can be used to programmatically specify the dtypes an operator
+#   supports, and is especially useful if writing an OpInfo on a machine
+#   without a CUDA device. See its documentation for more details.
+#
+# THE FUTURE OF OPINFOS AND OPINFO TESTING
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# In the future we expect OpInfo coverage to improve and cover
+#   the great majority of PyTorch's (public) operators.
+#
+
+# Classes and methods for the operator database
+@dataclass
+class OpInfo(object):
+    """Operator information and helper functions for acquiring it."""
+
+    # the string name of the function
+    name: str
+
+    # An optional reference function that accepts ndarrays (AKA "NumPy arrays").
+    # If given, the op will be compared with its reference on each of its sample inputs.
+    ref: Optional[Callable] = None
+
+    # the following metadata describes the operator, its variants, and its aliases, if any
+
+    # iterable of aliases, e.g. ("absolute",) for torch.abs
+    aliases: Iterable = None
+
+    # additional string to include in the test name
+    # this is useful when an op needs multiple OpInfos,
+    # like divide does, often because it's really several
+    # different ops behind the scenes
+    variant_test_name: str = ""
+
+    # the function variant of the operation, populated as torch.<name> if None
+    op: Callable = None
+
+    # allows the method variant of this operation to be specified as follows:
+    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
+    # - if None, then the OpInfo explicitly specifies is has no associated method
+    # - if a Callable, then that callable should be the method associated with this operation
+    method_variant: Callable = _NOTHING
+
+    # allows the inplace variant of this operation to be specified as follows:
+    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
+    # - if None, then the OpInfo explicitly specifies is has no associated inplace variant
+    # - if a Callable, then that callable should be the inplace variant associated with this operation
+    inplace_variant: Callable = _NOTHING
+
+    # allows the operator variant of this operation to be specified as follows:
+    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
+    # - if None, then the OpInfo explicitly specifies is has no associated operator
+    # - if a Callable, then that callable should be the operator associated with this operation
+    operator_variant: Callable = _NOTHING
+
+    # allows the inplace operator variant of this operation to be specified as follows:
+    # - if _NOTHING (default), then the OpInfo attempts to discover the variant using its name
+    # - if None, then the OpInfo explicitly specifies is has no associated inplace operator
+    # - if a Callable, then that callable should be the inplace operator associated with this operation
+    inplace_operator_variant: Callable = _NOTHING
+
+    # the following metadata are test directives for skipping or modifying tests
+
+    # information about which tests to skip
+    skips: Tuple = tuple()
+
+    # decorators to apply to generated tests
+    decorators: Tuple = tuple()
+
+    # the following are pointers to functions to generate certain classes of inputs
+
+    # function to generate sample inputs with strided layouts
+    sample_inputs_func: Callable = None
+
+    # function to generate a more thorough set of samples inputs with strided layouts
+    reference_inputs_func: Callable = None
+
+    # function to generate inputs that will throw errors
+    error_inputs_func: Callable = None
+
+    # function to generate sample inputs with sparse coo layouts
+    sample_inputs_sparse_coo_func: Callable = None
+
+    # function to generate sample inputs with sparse csr layouts
+    sample_inputs_sparse_csr_func: Callable = None
+
+    # function to generate sample inputs with sparse csc layouts
+    sample_inputs_sparse_csc_func: Callable = None
+
+    # function to generate sample inputs with sparse bsr layouts
+    sample_inputs_sparse_bsr_func: Callable = None
+
+    # function to generate sample inputs with sparse bsc layouts
+    sample_inputs_sparse_bsc_func: Callable = None
+
+    # the following metadata relates to dtype support and is tested for correctness in test_ops.py
+
+    # dtypes this function works with on the CPU,
+    # inherited by other device types that don't specify their own dtypes
+    dtypes: _dispatch_dtypes = None
+
+    # the following dtypesIf... options override the dtypes value on their respective device types
+
+    # dtypes this function is expected to work with on CUDA
+    dtypesIfCUDA: _dispatch_dtypes = None
+
+    # dtypes this function is expected to work with on ROCM
+    dtypesIfROCM: _dispatch_dtypes = None
+
+    # backward dtypes this function is expected to work with
+    backward_dtypes: _dispatch_dtypes = None
+
+    # backward dtypes this function is expected to work with on CUDA
+    backward_dtypesIfCUDA: _dispatch_dtypes = None
+
+    # backward dtypes this function is expected to work with on ROCM
+    backward_dtypesIfROCM: _dispatch_dtypes = None
+
+    # the following metadata describes the operators out= support
+
+    # whether the op supports the out kwarg
+    # defaults to True, if the op does not allow the out kwarg or
+    # supports it incorrectly then test_out in test_ops.py should fail
+    supports_out: bool = True
+
+    # the following metadata relates to autograd support
+    # whether the operation supports backward mode AD
+    # if true, gradient correctness is tested in test_ops.py
+    # using the op's sample inputs
+    supports_autograd: bool = True
+
+    # whether the op supports second order gradients
+    # if true, gradgrad correctness is tested in test_ops.py
+    # defaults to support_autograd's value
+    # TODO: rename this to supports_bwgrad_bwgrad to be consistent with below
+    supports_gradgrad: bool = None
+
+    # whether the ops supports second order gradients via
+    # forward-over-reverse. If True, forward-over-reverse gradgrad correctness
+    # is tested. If False, test that forward grad is not implemented.
+    # Defaults to False.
+    supports_fwgrad_bwgrad: bool = False
+
+    # whether the operation supports inplace autograd
+    # if true, tested in test_ops.py
+    # defaults to supports_autograd's value
+    supports_inplace_autograd: bool = None
+
+    # Whether the operation support forward mode AD
+    # If the value is True, we check that the gradients are correct
+    # If the value is False, we test that forward grad is not implemented
+    supports_forward_ad: bool = False
+
+    # wrapper function for gradcheck
+    gradcheck_wrapper: Callable = lambda op, *args, **kwargs: op(*args, **kwargs)
+
+    # whether to check batched grad when doing gradcheck
+    # defaults to support_autograd's value
+    check_batched_grad: bool = None
+
+    # whether to check batched grad grad when doing gradgradcheck
+    # default's to support_gradgrad's value
+    check_batched_gradgrad: bool = None
+
+    # whether to check batched forward grad when doing gradcheck
+    # defaults to the value of `supports_forward_ad`
+    check_batched_forward_grad: bool = None
+
+    # whether to check batched forward grad when doing gradcheck
+    # defaults to the value of `check_batched_forward_grad`
+    check_inplace_batched_forward_grad: bool = None
+
+    # tolerance for nondeterminism while performing gradcheck
+    gradcheck_nondet_tol: float = 0.0
+
+    # Whether to use the fast implmentation for gradcheck/gradgradcheck.
+    # When set to None, defers to the default value provided by the wrapper
+    # function around gradcheck (testing._internal.common_utils.gradcheck)
+    gradcheck_fast_mode: bool = None
+
+    # the following metadata relates to JIT support and is tested for correctness in test_ops.py
+
+    # name of the corresponding aten:: operator
+    aten_name: str = None
+
+    # if this is a composite implicit autograd op, the decomposed op
+    decomp_aten_name: Optional[str] = None
+
+    # name of the corresponding aten:: operator for backwards
+    aten_backward_name: Optional[str] = None
+
+    # if a op's aten::node is expected to be symbolically autodiffed
+    assert_autodiffed: bool = False
+
+    # a list of strings with node names that are expected to be in a
+    # DifferentiableGraph when autodiffed. Ex: ['aten::add', 'aten::mm'],
+    # default is populated to be ['aten::(name of Python operator)']
+    autodiff_nonfusible_nodes: List[str] = None
+
+    # a list of strings with node names that are expected to be in FusionGroups
+    # inside of DifferentiableGraphs when this operation is autodiffed.
+    # Ex: ['aten::add', 'aten::mm'], defaults to an empty list
+    # Note: currently no ops use fusible nodes
+    autodiff_fusible_nodes: List[str] = None
+
+    # the following metadata relates to sparse support and is used in test_sparse.py
+
+    # whether the op supports sparse inputs
+    supports_sparse: bool = False
+
+    # only run tracing tests
+    supports_scripting: bool = True
+
+    # if the operator can be traced
+    supports_tracing: bool = True
+
+    # the following metadata relates to sparse csr support and is used in test_sparse_csr.py
+
+    # whether the op supports sparse csr inputs
+    supports_sparse_csr: bool = False
+    # whether the op supports sparse csc inputs
+    supports_sparse_csc: bool = False
+    # whether the op supports sparse bsr inputs
+    supports_sparse_bsr: bool = False
+    # whether the op supports sparse bsc inputs
+    supports_sparse_bsc: bool = False
+
+    # the following metadata relates to complex support and is checked in test_ops.py
+
+    test_conjugated_samples: bool = True
+
+    test_neg_view: bool = True
+
+    # assert that jit shape analysis fully propagates shape
+    assert_jit_shape_analysis: bool = False
+
+    # the following metadata relates to ExpandedWeights support and is checked in test_expanded_weights.py
+
+    supports_expanded_weight: bool = False
+
+    is_factory_function: bool = False
+
+    def __post_init__(self):
+        self._original_opinfo_args = asdict(self).copy()
+
+        assert self.dtypes is not None, "OpInfo for {0} has no dtypes!".format(
+            self.name
+        )
+
+        dtypes_args = (self.dtypes, self.dtypesIfCUDA, self.dtypesIfROCM)
+
+        # Validates the dtypes are generated from the dispatch-related functions
+        for dtype_list in dtypes_args:
+            assert isinstance(dtype_list, (_dispatch_dtypes, type(None)))
+
+        if self.aten_name is None:
+            self.aten_name = self.name
+
+        # Attribute to verify dynamic_dtypes are used.
+        self.dynamic_dtypes = any(
+            map(
+                lambda dtypes: isinstance(dtypes, utils._dynamic_dispatch_dtypes),
+                dtypes_args,
+            )
+        )
+
+        if self.dynamic_dtypes:
+            # Make sure `dtyesIfCUDA` is dynamic, if dynamic dispatch is used for CPU
+            # This is because, below we set dtypesIfCUDA to dtypes if they are None.
+            assert isinstance(self.dtypesIfCUDA, utils._dynamic_dispatch_dtypes), (
+                f"To use dynamic dypes for operator {self.name}, "
+                "acquire the dtypes dynamically for argument `dtypesIfCUDA`."
+                "This is to ensure that CUDA dtypes are acquired correctly as they"
+                "differ from CPU dtypes occasionally"
+            )
+
+        self.dtypes = set(self.dtypes)
+
+        # NOTE: backward dtypes must be acquired before forward dtypes
+        #   since they fallback to explicit (not implicit!) specifications of
+        #   forward dtypes
+        self.backward_dtypesIfROCM = (
+            set(self.backward_dtypesIfROCM)
+            if self.backward_dtypesIfROCM is not None
+            else (
+                self.backward_dtypesIfCUDA
+                if self.backward_dtypesIfCUDA is not None
+                else self.backward_dtypes
+                if self.backward_dtypes is not None
+                else self.dtypesIfROCM
+                if self.dtypesIfROCM is not None
+                else self.dtypesIfCUDA
+                if self.dtypesIfCUDA is not None
+                else self.dtypes
+            )
+        )
+        self.backward_dtypesIfCUDA = (
+            set(self.backward_dtypesIfCUDA)
+            if self.backward_dtypesIfCUDA is not None
+            else (
+                self.backward_dtypes
+                if self.backward_dtypes is not None
+                else self.dtypesIfCUDA
+                if self.dtypesIfCUDA is not None
+                else self.dtypes
+            )
+        )
+        self.backward_dtypes = (
+            set(self.backward_dtypes)
+            if self.backward_dtypes is not None
+            else self.dtypes
+        )
+
+        self.dtypesIfCUDA = (
+            set(self.dtypesIfCUDA) if self.dtypesIfCUDA is not None else self.dtypes
+        )
+        self.dtypesIfROCM = (
+            set(self.dtypesIfROCM)
+            if self.dtypesIfROCM is not None
+            else self.dtypesIfCUDA
+        )
+
+        # NOTE: if the op is unspecified it is assumed to be under the torch namespace
+        if not self.op:
+            self.op = _getattr_qual(torch, self.name)
+
+        if self.method_variant is _NOTHING:
+            self.method_variant = getattr(torch.Tensor, self.name, None)
+
+        # attributes like real, imag are not callable
+        if not callable(self.method_variant):
+            self.method_variant = None
+
+        if self.inplace_variant is _NOTHING:
+            inplace_name = self.name + "_"
+            self.inplace_variant = getattr(torch.Tensor, inplace_name, None)
+
+        if self.operator_variant is _NOTHING:
+            self.operator_variant = getattr(operator, self.name, None)
+
+        if self.inplace_operator_variant is _NOTHING:
+            # Note: operator.i<op> will use operator.<op> and assign the result to the lhs when no
+            # __i<op>__ method is found. This results in the appearance of an inplace operator variant which
+            # does not have the correct inplace behavior. To avoid this, we guard automatic detection of the inplace
+            # operator with a check that an inplace variant exists.
+            if self.inplace_variant is not None:
+                inplace_operator_name = "i" + self.name
+                self.inplace_operator_variant = getattr(
+                    operator, inplace_operator_name, None
+                )
+            else:
+                self.inplace_operator_variant = None
+
+        self.decorators = (*self.decorators, *self.skips)
+
+        # We run the sampling functions without tracking the gradiends of the creation of inputs
+        self.sample_inputs_func = torch.no_grad()(self.sample_inputs_func)
+        self.sample_inputs_sparse_coo_func = torch.no_grad()(
+            self.sample_inputs_sparse_coo_func
+        )
+        self.sample_inputs_sparse_csr_func = torch.no_grad()(
+            self.sample_inputs_sparse_csr_func
+        )
+        self.sample_inputs_sparse_csc_func = torch.no_grad()(
+            self.sample_inputs_sparse_csc_func
+        )
+        self.sample_inputs_sparse_bsr_func = torch.no_grad()(
+            self.sample_inputs_sparse_bsr_func
+        )
+        self.sample_inputs_sparse_bsc_func = torch.no_grad()(
+            self.sample_inputs_sparse_bsc_func
+        )
+        if self.reference_inputs_func is not None:
+            self.reference_inputs_func = torch.no_grad()(self.reference_inputs_func)
+
+        if not self.autodiff_fusible_nodes:
+            self.autodiff_fusible_nodes = []
+
+        if self.autodiff_nonfusible_nodes is None:
+            self.autodiff_nonfusible_nodes = ["aten::" + self.name]
+
+        # Autograd support
+
+        # Autograd flags that depend on backward AD only
+        # - If setting has been explicitly set, raise error if inconsistent
+        if self.supports_gradgrad is None:
+            self.supports_gradgrad = self.supports_autograd
+        else:
+            assert not (self.supports_gradgrad and not self.supports_autograd), (
+                "supports_gradgrad refines the part of autograd is supported, so it should "
+                "not be set if supports_autograd is False"
+            )
+        if self.check_batched_grad is None:
+            self.check_batched_grad = self.supports_autograd or self.supports_forward_ad
+        else:
+            assert not (
+                self.check_batched_grad
+                and not (self.supports_autograd or self.supports_forward_ad)
+            ), (
+                "check_batched_grad refines the part of autograd that will be checked (by gradcheck), so "
+                "it should not be set if supports_autograd is False"
+            )
+        if self.check_batched_gradgrad is None:
+            self.check_batched_gradgrad = self.supports_gradgrad
+        else:
+            assert not (self.check_batched_gradgrad and not self.supports_gradgrad), (
+                "check_batched_gradgrad refines the part of autograd that will be checked (by "
+                "gradgradcheck), so it should not be set if either supports_gradgrad or supports_autograd "
+                "is False."
+            )
+        if self.check_batched_forward_grad is None:
+            self.check_batched_forward_grad = self.supports_forward_ad
+        else:
+            assert not (
+                self.check_batched_forward_grad and not self.supports_forward_ad
+            ), (
+                "check_batched_forward_grad should only be used when supports_forward_ad "
+                "is True. It is used to disable the test in the specific cases "
+                "where the op supports forward ad but fails to compute "
+                "batched forward grad."
+            )
+
+        if self.check_inplace_batched_forward_grad is None:
+            self.check_inplace_batched_forward_grad = self.check_batched_forward_grad
+        else:
+            assert not (
+                self.check_inplace_batched_forward_grad
+                and not self.check_batched_forward_grad
+            ), (
+                "check_batched_forward_grad should only be used when check_batched_forward_grad "
+                "is True. It is used to disable the test in the specific cases "
+                "where the op supports batched forward grad but fails to compute batched forward "
+                "grad for the inplace variant of the op."
+            )
+
+        assert not (self.supports_fwgrad_bwgrad and not self.supports_autograd), (
+            "supports_fwgrad_bwgrad enables forward-over-backward gradgrad checks and should only be "
+            "True if backward ad is also checked, i.e., supports_forward_ad should be True.",
+            self.name,
+        )
+
+        # Autograd flags that depend on both forward AD and backward AD
+        if self.supports_inplace_autograd is None:
+            self.supports_inplace_autograd = (
+                self.supports_autograd or self.supports_forward_ad
+            )
+        else:
+            assert not (
+                self.supports_inplace_autograd
+                and not self.supports_autograd
+                and not self.supports_forward_ad
+            ), (
+                "supports_inplace_autograd refines the part of autograd that is supported, so "
+                "it should not be set if both supports_autograd and supports_forward_ad are False"
+            )
+
+        if self.aliases is not None:
+            self.aliases = tuple(AliasInfo(a) for a in self.aliases)  # type: ignore[assignment]
+        else:
+            self.aliases = ()
+
+    def __call__(self, *args, **kwargs):
+        """Calls the function variant of the operator."""
+        return self.op(*args, **kwargs)
+
+    def __str__(self):
+        return dataclass_repr(self)
+
+    def get_op(self):
+        """Returns the function variant of the operator, torch.<op_name>."""
+        return self.op
+
+    def get_method(self):
+        """Returns the method variant of the operator, torch.Tensor.<op_name>.
+        Returns None if the operator has no method variant.
+        """
+        return self.method_variant
+
+    def get_inplace(self):
+        """Returns the inplace variant of the operator, torch.Tensor.<op_name>_.
+        Returns None if the operator has no inplace variant.
+        """
+        return self.inplace_variant
+
+    def get_operator(self):
+        """Returns operator variant of the operator, e.g. operator.neg
+        Returns None if the operator has no operator variant.
+        """
+        return self.operator_variant
+
+    def get_inplace_operator(self):
+        """Returns the inplace operator variant of the operator, e.g operator.iadd
+        Returns None if the operator has no inplace operator variant"""
+        return self.inplace_operator_variant
+
+    def conjugate_sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
+        """Returns an iterable of SampleInputs but with the tensor input or first
+        tensor in a sequence input conjugated.
+        """
+
+        samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
+        conj_samples = list(samples)
+
+        def conjugate(tensor):
+            _requires_grad = tensor.requires_grad
+            tensor = tensor.conj()
+            return tensor.requires_grad_(_requires_grad)
+
+        for i, sample in enumerate(samples):
+            sample = conj_samples[i]
+            # Note: it is assumed that the input here is either a tensor or tensorlist
+            if isinstance(sample.input, torch.Tensor):
+                sample.input = conjugate(sample.input)
+            else:
+                sample.input[0] = conjugate(sample.input[0])
+
+        return tuple(conj_samples)
+
+    def sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
+        """
+        Returns an iterable of SampleInputs.
+
+        These samples should be sufficient to test the function works correctly
+        with autograd, TorchScript, etc.
+        """
+        samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
+
+        if kwargs.get("include_conjugated_inputs", False):
+            conj_samples = self.conjugate_sample_inputs(
+                device, dtype, requires_grad, **kwargs
+            )
+            samples_list = list(samples)
+            samples_list.extend(conj_samples)
+            samples = tuple(samples_list)
+
+        return samples
+
+    def reference_inputs(self, device, dtype, requires_grad=False, **kwargs):
+        """
+        Returns an iterable of SampleInputs.
+
+        Distinct from sample_inputs() above because this returns an expanded set
+        of inputs when reference_inputs_func is defined. If undefined this returns
+        the sample inputs.
+        """
+        if self.reference_inputs_func is None:
+            return self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
+
+        if kwargs.get("include_conjugated_inputs", False):
+            raise NotImplementedError
+
+        return self.reference_inputs_func(self, device, dtype, requires_grad, **kwargs)
+
+    def error_inputs(self, device, **kwargs):
+        """
+        Returns an iterable of ErrorInputs.
+        """
+        return self.error_inputs_func(self, device, **kwargs)
+
+    def sample_inputs_sparse_coo(self, device, dtype, requires_grad=False, **kwargs):
+        """Returns an iterable of SampleInputs that contain inputs with sparse
+        coo layout.
+        """
+        return self.sample_inputs_sparse_coo_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
+
+    def sample_inputs_sparse_csr(self, device, dtype, requires_grad=False, **kwargs):
+        """Returns an iterable of SampleInputs that contain inputs with sparse
+        csr layout.
+        """
+        return self.sample_inputs_sparse_csr_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
+
+    def sample_inputs_sparse_csc(self, device, dtype, requires_grad=False, **kwargs):
+        """Returns an iterable of SampleInputs that contain inputs with sparse
+        csc layout.
+        """
+        return self.sample_inputs_sparse_csc_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
+
+    def sample_inputs_sparse_bsr(self, device, dtype, requires_grad=False, **kwargs):
+        """Returns an iterable of SampleInputs that contain inputs with sparse
+        bsr layout.
+        """
+        return self.sample_inputs_sparse_bsr_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
+
+    def sample_inputs_sparse_bsc(self, device, dtype, requires_grad=False, **kwargs):
+        """Returns an iterable of SampleInputs that contain inputs with sparse
+        bsc layout.
+        """
+        return self.sample_inputs_sparse_bsc_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
+
+    def get_decorators(self, test_class, test_name, device, dtype):
+        """Returns the decorators targeting the given test."""
+        result = []
+        for decorator in self.decorators:
+            if isinstance(decorator, DecorateInfo):
+                if decorator.is_active(test_class, test_name, device, dtype):
+                    result.extend(decorator.decorators)
+            else:
+                result.append(decorator)
+        return result
+
+    def supported_dtypes(self, device_type):
+        if device_type == "cpu":
+            return self.dtypes
+        if device_type == "cuda":
+            return self.dtypesIfROCM if TEST_WITH_ROCM else self.dtypesIfCUDA
+        else:
+            return self.dtypes
+
+    def supported_backward_dtypes(self, device_type):
+        if not self.supports_autograd:
+            return set()
+
+        backward_dtypes = None
+        if device_type == "cpu":
+            backward_dtypes = self.backward_dtypes
+        elif device_type == "cuda":
+            backward_dtypes = (
+                self.backward_dtypesIfROCM
+                if TEST_WITH_ROCM
+                else self.backward_dtypesIfCUDA
+            )
+        else:
+            backward_dtypes = self.backward_dtypes
+
+        allowed_backward_dtypes = floating_and_complex_types_and(
+            torch.bfloat16, torch.float16, torch.complex32
+        )
+        return set(allowed_backward_dtypes).intersection(backward_dtypes)
+
+    def supports_dtype(self, dtype, device_type):
+        return dtype in self.supported_dtypes(device_type)
+
+    @property
+    def formatted_name(self):
+        """Returns a formatted full name for this OpInfo that can be used in test names."""
+        variant = (
+            "_" + self.variant_test_name.replace(".", "_")
+            if self.variant_test_name
+            else ""
+        )
+        return "{}{}".format(self.name.replace(".", "_"), variant)
+
+
+def _generate_reduction_inputs(device, dtype, requires_grad, **kwargs):
+    """Generates input tensors for testing reduction operators"""
+    yield make_tensor([], dtype=dtype, device=device, requires_grad=requires_grad)
+    yield make_tensor([2], dtype=dtype, device=device, requires_grad=requires_grad)
+    yield make_tensor([3, 5], dtype=dtype, device=device, requires_grad=requires_grad)
+    yield make_tensor(
+        [3, 2, 1, 2], dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+
+def _generate_reduction_kwargs(ndim, supports_multiple_dims=True):
+    """Generates a subset of all valid dim and keepdim kwargs given ndim that
+    is appropriate for testing reduction operators.
+    """
+
+    # Test default dim and keepdim
+    yield {}
+
+    # Test reducing inner and outer most dimensions
+    yield {"dim": 0, "keepdim": True}
+    yield {"dim": -1, "keepdim": False}
+
+    # Test reducing middle dimension
+    if ndim > 2:
+        yield {"dim": ndim // 2, "keepdim": True}
+
+    if supports_multiple_dims:
+        # Test reducing all dimensions
+        yield {"dim": tuple(range(ndim)), "keepdim": False}
+
+        # Test reducing both first and last dimensions
+        if ndim > 1:
+            yield {"dim": (0, -1), "keepdim": True}
+
+        # Test reducing every other dimension starting with the second
+        if ndim > 3:
+            yield {"dim": tuple(range(1, ndim, 2)), "keepdim": False}
+
+
+def sample_inputs_reduction(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for reduction operators."""
+
+    # TODO(@heitorschueroff) Once all reduction operators are using
+    # ReductionOpInfo use op_info.supports_multiple_dims directly.
+    supports_multiple_dims: bool = kwargs.get("supports_multiple_dims", True)
+
+    # TODO(@heitorschueroff) Once all reduction operators are using ReductionOpInfo
+    # use op_info.generate_args_kwargs directly.
+    generate_args_kwargs = kwargs.get(
+        "generate_args_kwargs", lambda *args, **kwargs: (yield tuple(), {})
+    )
+
+    for t in _generate_reduction_inputs(device, dtype, requires_grad):
+        for reduction_kwargs in _generate_reduction_kwargs(
+            t.ndim, supports_multiple_dims
+        ):
+            for args, kwargs in generate_args_kwargs(t, **reduction_kwargs):
+                kwargs.update(reduction_kwargs)
+                yield SampleInput(
+                    t.detach().requires_grad_(requires_grad), args=args, kwargs=kwargs
+                )
+
+
+# NOTE [Reductions]:
+#
+# For testing purposes, we relax the definition of a reduction operator
+# as defined in the docstring below. We do this to capture operators with
+# a similar API so they can be tested automatically. However...
+#
+# Strictly speaking a reduction operator is an operator that can reduce an
+# array to a single scalar value and that can be computed from the partial
+# result of reducing subarrays. This usually means that the reduction operation
+# should be commutative and associative. This definition is important when it
+# comes to implementation as it determines how a reduction can be parallelized.
+#
+# For example, many summary statistics such as median, mode and quantile cannot
+# be computed from partial results because these are sorting and counting based
+# algorithms that need information that would be lost in the reduced value.
+class ReductionOpInfo(OpInfo):
+    """Reduction operator information.
+
+    An operator is a reduction operator if it reduces one or more dimensions of
+    the input tensor to a single value. Reduction operators must implement the
+    following signature:
+
+    - `op(input, *args, *, dim=None, keepdim=False, **kwargs) -> Tensor`
+
+    ReductionOpInfo tests that reduction operators implement a consistent API.
+    Optional features such as reducing over multiple dimensions are captured in
+    the optional keyword parameters of the ReductionOpInfo constructor.
+
+    If a reduction operator does not yet implement the full required API of
+    reduction operators, this should be documented by xfailing the failing
+    tests rather than adding optional parameters to ReductionOpInfo.
+
+    NOTE
+    The API for reduction operators has not yet been finalized and some
+    requirements may change.
+
+    See tests in test/test_reductions.py
+    """
+
+    def __init__(
+        self,
+        name,
+        *,
+        # The identity value for the operator if it has one.
+        identity: Optional[Any] = None,
+        # The nan policy for the operator if it implements one.
+        # - propagate: NaN values are propagated to the output
+        # - omit: NaN values are discarded during the reduction
+        nan_policy: Optional[str] = None,
+        # Whether the operator supports reducing multiple dimensions.
+        supports_multiple_dims: bool = True,
+        # Whether the operator promotes integral to floating point dtypes.
+        promotes_int_to_float: bool = False,
+        # Whether the operator promotes all integral dtypes to int64.
+        promotes_int_to_int64: bool = False,
+        # If a specific dtype is given, then the operator always returns that
+        # dtype irrespective of the input dtype. If None, the operator returns
+        # the dtype according to the type promotion rules above.
+        result_dtype: Optional[torch.dtype] = None,
+        # Casts complex results to real (e.g. linalg.norm or torch.var)
+        complex_to_real: bool = False,
+        # ReductionOpInfo tests generate their own input, dim and keepdim
+        # arguments and call this function to generate tuples of extra args and
+        # kwargs to use when calling the op. This is required for operators that
+        # have other required parameters besides the input tensor.
+        generate_args_kwargs: Callable = lambda t, dim=None, keepdim=False: (
+            yield tuple(),
+            {},
+        ),
+        # Options from the OpInfo base class
+        **kwargs,
+    ):
+        self._original_reduction_args = locals().copy()
+        assert nan_policy in (None, "propagate", "omit")
+
+        # These are mutually exclusive options
+        assert not (result_dtype and promotes_int_to_float)
+        assert not (result_dtype and promotes_int_to_int64)
+        assert not (result_dtype and complex_to_real)
+        assert not (promotes_int_to_float and promotes_int_to_int64)
+
+        # Default sample_inputs_func for ReductionOpInfo which augments sample
+        # inputs from sample_inputs_reduction with the args and kwargs from
+        # generate_args_kwargs. This is only used if sample_inputs_func is None.
+        def sample_inputs_func(*args, **kwargs):
+            kwargs["supports_multiple_dims"] = supports_multiple_dims
+            kwargs["generate_args_kwargs"] = generate_args_kwargs
+            yield from sample_inputs_reduction(*args, **kwargs)
+
+        # Override OpInfo defaults and call base class __init__
+        kwargs.setdefault("inplace_variant", None)
+        kwargs.setdefault("sample_inputs_func", sample_inputs_func)
+        super().__init__(name, **kwargs)
+
+        self.identity = identity
+        self.nan_policy = nan_policy
+        self.supports_multiple_dims = supports_multiple_dims
+        self.promotes_int_to_float = promotes_int_to_float
+        self.promotes_int_to_int64 = promotes_int_to_int64
+        self.complex_to_real = complex_to_real
+        self.result_dtype = result_dtype
+        self.generate_args_kwargs = generate_args_kwargs
+
+
+# The base reference input generation for elementwise binary operations
+def _reference_inputs_elementwise_binary(
+    op, device, dtype, requires_grad, exclude_zero, **kwargs
+):
+    yield from op.sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
+    yield from generate_elementwise_binary_tensors(
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+    if dtype is not torch.bool:
+        yield from generate_elementwise_binary_small_value_tensors(
+            op, device=device, dtype=dtype, requires_grad=requires_grad
+        )
+    if dtype not in (torch.bool, torch.uint8, torch.int8):
+        yield from generate_elementwise_binary_large_value_tensors(
+            op, device=device, dtype=dtype, requires_grad=requires_grad
+        )
+    yield from generate_elementwise_binary_broadcasting_tensors(
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+    yield from generate_elementwise_binary_with_scalar_samples(
+        op, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+
+    yield from generate_elementwise_binary_with_scalar_and_type_promotion_samples(
+        op, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+
+    if dtype.is_floating_point or dtype.is_complex:
+        yield from generate_elementwise_binary_extremal_value_tensors(
+            op, device=device, dtype=dtype, requires_grad=requires_grad
+        )
+
+
+# Note that these references inputs use scalars for the SampleInput.input value,
+#   and many tests require SampleInput.input be a tensor or a list of tensors
+def reference_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs):
+    if hasattr(op, "rhs_make_tensor_kwargs"):
+        exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
+
+    gen = partial(
+        _reference_inputs_elementwise_binary,
+        op,
+        device,
+        dtype,
+        requires_grad,
+        exclude_zero,
+        **kwargs,
+    )
+
+    # yields "normal" samples
+    yield from gen()
+
+    # yields noncontiguous samples
+    for sample in gen():
+        yield sample.noncontiguous()
+
+    yield from generate_elementwise_binary_noncontiguous_tensors(
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+
+    yield from generate_elementwise_binary_arbitrarily_strided_tensors(
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+
+
+# A functional that extends an elementwise binary operator's bespoke error inputs
+#   with generic error inputs for the class of elementwise binary operations
+def make_error_inputs_elementwise_binary(error_inputs_func):
+    def error_inputs_func_wrapper(op, device, **kwargs):
+        if error_inputs_func is not None:
+            yield from error_inputs_func(op, device, **kwargs)
+
+        if not op.supports_rhs_python_scalar:
+            si = SampleInput(torch.tensor((1, 2, 3), device=device), args=(2,))
+            yield ErrorInput(si, error_type=Exception, error_regex="")
+
+        if not op.supports_one_python_scalar:
+            si = SampleInput(2, args=(torch.tensor((1, 2, 3), device=device),))
+            yield ErrorInput(si, error_type=Exception, error_regex="")
+
+        if (
+            not kwargs.get("skip_two_python_scalars", False)
+            and not op.supports_two_python_scalars
+        ):
+            si = SampleInput(2, args=(3,))
+            yield ErrorInput(si, error_type=Exception, error_regex="")
+
+    return error_inputs_func_wrapper
+
+
+# The following functions and classes are for testing elementwise binary operators.
+
+# Returns a generator of pairs of contiguous tensors on the requested device
+#   and with the requested dtype.
+#
+# This function is intended to test the non-vectorized and vectorized code
+#   paths of elementwise binary functions, as well as their handling of odd tensor
+#   sizes (like zero-dim tensors and tensors with zero elements).
+#
+# Each iterable will include an a tensor with no elements,
+#   zero dim (scalar) tensors, small 1D tensors, a medium 1D tensor, and
+#   a large 2D tensor.
+def generate_elementwise_binary_tensors(
+    op, *, device, dtype, requires_grad=False, exclude_zero=False
+):
+    shapes = (
+        # tensors with no elements
+        (0,),
+        (1, 0, 3),
+        # zero dim (scalar) tensor
+        (),
+        # small 1D tensor
+        (20,),
+        # medium 1D tensor
+        (812,),
+        # large 2D tensor
+        (1029, 917),
+    )
+
+    make_arg = partial(
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+    for shape in shapes:
+        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+        yield SampleInput(lhs, args=(rhs,))
+
+
+def generate_elementwise_binary_arbitrarily_strided_tensors(
+    op, *, device, dtype, requires_grad=False, exclude_zero=False
+):
+    # shape, strides, offset
+    strided_cases = (
+        ((5, 6, 2), (1, 1, 7), 2),
+        ((5, 5, 4), (1, 1, 7), 2),
+        ((5, 5, 2), (4, 5, 7), 3),
+        ((5, 5, 2), (5, 5, 7), 3),
+        ((5, 5, 2), (5, 5, 5), 3),
+        ((9, 5, 2), (0, 1, 7), 3),
+    )
+
+    make_arg = partial(
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+    for shape, strides, offset in strided_cases:
+        a = make_arg(
+            500,
+        ).as_strided(shape, strides, offset)
+        b = make_arg(shape)
+        yield SampleInput(a, args=(b,))
+
+
+# Returns a generator of pairs of contiguous tensors on the requested device and with
+#   the requested dtype.
+#
+# Unlike the previous function, the values in these tensors are specified manually.
+def generate_elementwise_binary_small_value_tensors(
+    op, *, device, dtype, requires_grad=False, exclude_zero=None
+):
+    if exclude_zero is None:
+        if hasattr(op, "rhs_make_tensor_kwargs"):
+            exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
+
+    # defines interesting values
+    _unsigned_int_vals = (0, 1, 55, 127, 128, 190, 210, 220, 254)
+    _int_vals = (0, -1, 1, -55, 55, -127, 127, -128)
+    _float_vals = (
+        0.0,
+        -0.0,
+        -0.001,
+        0.001,
+        -0.25,
+        0.25,
+        -1.0,
+        1.0,
+        -math.pi / 2,
+        math.pi / 2,
+        -math.pi + 0.00001,
+        math.pi - 0.00001,
+        -math.pi,
+        math.pi,
+        -math.pi - 0.00001,
+        math.pi + 0.00001,
+    )
+
+    l_vals = []
+    r_vals = []
+
+    if dtype.is_floating_point:
+        prod = product(_float_vals, _float_vals)
+    elif dtype.is_complex:
+        complex_vals = product(_float_vals, _float_vals)
+        # Note the use of list is required here or the map generator will be
+        #  emptied by the following product and it won't produce the desired cross-product
+        complex_vals = list(map(lambda x: complex(*x), complex_vals))
+        prod = product(complex_vals, complex_vals)
+    elif dtype in (torch.int8, torch.int16, torch.int32, torch.int64):
+        prod = product(_int_vals, _int_vals)
+    elif dtype is torch.uint8:
+        prod = product(_unsigned_int_vals, _unsigned_int_vals)
+    else:
+        raise ValueError("Unsupported dtype!")
+
+    for l, r in prod:
+        l_vals.append(l)
+        if r == 0 and exclude_zero:
+            r_vals.append(1)
+        else:
+            r_vals.append(r)
+
+    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    yield SampleInput(lhs, args=(rhs,))
+
+
+def generate_elementwise_binary_large_value_tensors(
+    op, *, device, dtype, requires_grad=False
+):
+    _large_int_vals = (-1113, 1113, -10701, 10701)
+    _large_float16_vals = (-501, 501, -1001.2, 1001.2, -13437.7, 13437.7)
+    _large_float_vals = _large_float16_vals + (-4988429.2, 4988429.2, -1e20, 1e20)
+
+    l_vals = []
+    r_vals = []
+
+    if dtype == torch.float16:
+        prod = product(_large_float16_vals, _large_float16_vals)
+    elif dtype.is_floating_point:
+        prod = product(_large_float_vals, _large_float_vals)
+    elif dtype.is_complex:
+        complex_vals = product(_large_float_vals, _large_float_vals)
+        # Note the use of list is required here or the map generator will be
+        #  emptied by the following product and it won't produce the desired cross-product
+        complex_vals = list(map(lambda x: complex(*x), complex_vals))
+        prod = product(complex_vals, complex_vals)
+    elif dtype in (torch.int16, torch.int32, torch.int64):
+        prod = product(_large_int_vals, _large_int_vals)
+    else:
+        raise ValueError("Unsupported dtype!")
+
+    for l, r in prod:
+        l_vals.append(l)
+        r_vals.append(r)
+
+    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    yield SampleInput(lhs, args=(rhs,))
+
+
+def generate_elementwise_binary_extremal_value_tensors(
+    op, *, device, dtype, requires_grad=False
+):
+    _float_extremals = (float("inf"), float("-inf"), float("nan"))
+
+    l_vals = []
+    r_vals = []
+
+    if dtype.is_floating_point:
+        prod = product(_float_extremals, _float_extremals)
+    elif dtype.is_complex:
+        complex_vals = product(_float_extremals, _float_extremals)
+        # Note the use of list is required here or the map generator will be
+        #  emptied by the following product and it won't produce the desired cross-product
+        complex_vals = list(map(lambda x: complex(*x), complex_vals))
+        prod = product(complex_vals, complex_vals)
+    else:
+        raise ValueError("Unsupported dtype!")
+
+    for l, r in prod:
+        l_vals.append(l)
+        r_vals.append(r)
+
+    lhs = torch.tensor(l_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+    rhs = torch.tensor(r_vals, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    yield SampleInput(lhs, args=(rhs,))
+
+    # Test case for NaN propagation
+    nan = (
+        float("nan") if dtype.is_floating_point else complex(float("nan"), float("nan"))
+    )
+    lhs = make_tensor(
+        (128, 128), device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    lhs.flatten()[::3] = nan
+    rhs = make_tensor(
+        (128, 128), device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    rhs.flatten()[::3] = nan
+
+    yield SampleInput(lhs, args=(rhs,))
+
+
+# Returns a generator of pairs of contiguous and noncontiguous tensors that
+#   require broadcasting
+def generate_elementwise_binary_broadcasting_tensors(
+    op, *, device, dtype, requires_grad=False, exclude_zero=False
+):
+    shapes = (
+        ((1,), ()),
+        ((2,), ()),
+        ((1,), (2,)),
+        ((2, 1), (2,)),
+        ((1, 2), (2,)),
+        ((3, 2), (2,)),
+        ((1, 3, 2), (2,)),
+        ((1, 3, 2), (3, 2)),
+        ((3, 1, 2), (3, 2)),
+        ((2, 3, 2), ()),
+        ((3, 1, 2), (1, 3, 2)),
+    )
+
+    make_arg = partial(
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+    for shape, noncontiguous in product(shapes, [True, False]):
+        shape_lhs, shape_rhs = shape
+        lhs = make_arg(
+            shape_lhs, noncontiguous=noncontiguous, **op.lhs_make_tensor_kwargs
+        )
+        rhs = make_arg(
+            shape_rhs, noncontiguous=noncontiguous, **op.rhs_make_tensor_kwargs
+        )
+
+        yield SampleInput(lhs, args=(rhs,), broadcasts_input=True)
+
+
+# Returns a generator of pairs of contiguous tensors and scalars
+def generate_elementwise_binary_with_scalar_samples(
+    op, *, device, dtype, requires_grad=False
+):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+
+    shapes = ((), (3,), (5, 3), (0, 1, 3), (1, 5))
+    if op.supports_rhs_python_scalar:
+        for shape in shapes:
+            lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+            rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+            lhs_scalar = make_arg((), **op.lhs_make_tensor_kwargs).item()
+            rhs_scalar = make_arg((), **op.rhs_make_tensor_kwargs).item()
+
+            yield SampleInput(lhs, args=(rhs_scalar,))
+
+        # Extends with scalar lhs
+        if op.supports_one_python_scalar:
+            yield SampleInput(lhs_scalar, args=(rhs,))
+
+    if op.supports_two_python_scalars:
+        lhs_scalar = make_arg((), **op.lhs_make_tensor_kwargs).item()
+        rhs_scalar = make_arg((), **op.rhs_make_tensor_kwargs).item()
+
+        yield SampleInput(lhs_scalar, args=(rhs_scalar,))
+
+
+# Returns a generator of pairs of contiguous tensors and 0d tensos and scalars and type promotion
+def generate_elementwise_binary_with_scalar_and_type_promotion_samples(
+    op, *, device, dtype, requires_grad=False
+):
+    # add these samples only for logical and comparison ops, arithmetic ops are not happy about extremal scalars
+    if op.name in (
+        "eq",
+        "ne",
+        "gt",
+        "ge",
+        "lt",
+        "le",
+        "logical_and",
+        "logical_or",
+        "logical_xor",
+    ):
+        make_arg = partial(
+            make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+        )
+        shape = (
+            23,
+        )  # this shape is big enough to trigger vectorization, and has non-vectorized tail
+        values = (float("nan"), float("inf"), -float("inf"))
+        scalar_tensors = tuple(torch.tensor(val) for val in values)
+        if op.supports_rhs_python_scalar:
+            lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+            rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+            for scalar in values + scalar_tensors:
+                yield SampleInput(lhs, args=(scalar,))
+                # Extends with scalar lhs
+                if op.supports_one_python_scalar:
+                    yield SampleInput(scalar, args=(rhs,))
+
+
+# Returns a generator of pairs of noncontiguous tensors
+def generate_elementwise_binary_noncontiguous_tensors(
+    op, *, device, dtype, requires_grad=False, exclude_zero=False
+):
+    make_arg = partial(
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+
+    # Generic noncontiguity
+    lhs = make_arg((1026,), noncontiguous=True, **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((1026,), noncontiguous=True, **op.rhs_make_tensor_kwargs)
+
+    yield SampleInput(lhs.clone(), args=(rhs.clone(),))
+    yield SampleInput(lhs.contiguous(), args=(rhs,))
+
+    # Transposed
+    lhs = make_arg((789, 357), **op.lhs_make_tensor_kwargs)
+    rhs = make_arg((789, 357), **op.rhs_make_tensor_kwargs)
+
+    yield SampleInput(lhs.T, args=(rhs.T,))
+
+    # More noncontiguity
+    shapes = ((5, 7), (1024,))
+
+    for shape in shapes:
+        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+
+        lhs_non_contig = torch.empty(shape + (2,), device=device, dtype=dtype)[..., 0]
+        lhs_non_contig.copy_(lhs)
+
+        rhs_non_contig = torch.empty(shape + (2,), device=device, dtype=dtype)[..., 0]
+        rhs_non_contig.copy_(rhs)
+
+        yield SampleInput(lhs_non_contig.clone(), args=(rhs_non_contig.clone(),))
+        yield SampleInput(lhs_non_contig.contiguous(), args=(rhs_non_contig,))
+
+    # Noncontiguous indices
+    shape = (2, 2, 1, 2)
+    lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+    rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+
+    lhs_non_contig = lhs[:, 1, ...]
+    rhs_non_contig = rhs[:, 1, ...]
+
+    yield SampleInput(lhs_non_contig.clone(), args=(rhs_non_contig.clone(),))
+    yield SampleInput(lhs_non_contig.contiguous(), args=(rhs_non_contig,))
+
+    # Expanded tensors
+    shapes = ((1, 3), (1, 7), (5, 7))
+
+    for shape in shapes:
+        lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
+
+        lhs_non_contig = lhs.expand(3, -1, -1)
+        rhs_non_contig = rhs.expand(3, -1, -1)
+
+        yield SampleInput(lhs_non_contig, args=(rhs_non_contig,))
+
+
+# Sample inputs for elementwise binary operators, like add
+def sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs):
+    _M = S if kwargs.get("small_inputs_only", False) else M
+    _S = XS if kwargs.get("small_inputs_only", False) else S
+
+    if hasattr(op, "rhs_make_tensor_kwargs"):
+        exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
+
+    make_arg = partial(
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+
+    shapes = (
+        ((), ()),
+        ((_S,), ()),
+        ((_S, 1), (_S,)),
+        ((_M, _S), ()),
+        ((_S, _M, _S), (_M, _S)),
+        ((_S, _M, _S), (_S, _M, _S)),
+        ((_M, 1, _S), (_M, _S)),
+        ((_M, 1, _S), (1, _M, _S)),
+        ((0, 1, XS), (0, _M, XS)),
+    )
+
+    sample_kwargs = kwargs.get("sample_kwargs", {})
+
+    for shape_lhs, shape_rhs in shapes:
+        lhs = make_arg(shape_lhs, **op.lhs_make_tensor_kwargs)
+        rhs = make_arg(shape_rhs, **op.rhs_make_tensor_kwargs)
+        broadcasts_input = shape_lhs != torch.broadcast_shapes(shape_lhs, shape_rhs)
+
+        yield SampleInput(
+            lhs, args=(rhs,), kwargs=sample_kwargs, broadcasts_input=broadcasts_input
+        )
+
+
+# Metadata class for binary "universal functions (ufuncs)" that accept two
+# tensor and have common properties
+class BinaryUfuncInfo(OpInfo):
+    """Operator information for 'universal binary functions (binary ufuncs).'
+    These are functions of two tensors with common properties like:
+      - they are elementwise functions
+      - the output shape is determined by the input shape
+      - they typically have method and inplace variants
+      - they typically support the out kwarg
+      - they typically have NumPy or SciPy references
+    See NumPy's universal function documentation
+    (https://numpy.org/doc/stable/reference/ufuncs.html) for more details
+    about the concept of ufuncs.
+    """
+
+    def __init__(
+        self,
+        name,
+        *,
+        sample_inputs_func=sample_inputs_elementwise_binary,
+        reference_inputs_func=reference_inputs_elementwise_binary,
+        error_inputs_func=None,
+        lhs_make_tensor_kwargs=None,
+        rhs_make_tensor_kwargs=None,
+        promotes_int_to_float=False,  # Set to true if the op promotes integer inputs to float
+        always_returns_bool=False,  # Set to true if the op always returns bool tensors
+        supports_rhs_python_scalar=True,  # Whether the operator allows Tensor x scalar inputs
+        supports_one_python_scalar=False,  # Whether the operator allows scalar x tensor and tensor x scalar inputs
+        supports_two_python_scalars=False,  # Whether the operator allows scalar x scalar inputs
+        **kwargs,
+    ):
+
+        self._original_binary_ufunc_args = locals().copy()
+
+        # Elementwise binary operations perform the equivalent of test_numpy_refs
+        #   in test_binary_ufuncs, but with additional test granularity. So the
+        #   generic test_ops.py test is skipped because it's redundant.
+        common_skips = (
+            DecorateInfo(
+                unittest.skip("Skipping redundant test."),
+                "TestCommon",
+                "test_numpy_refs",
+            ),
+        )
+        kwargs["skips"] = kwargs.get("skips", tuple()) + common_skips
+        super(BinaryUfuncInfo, self).__init__(
+            name,
+            sample_inputs_func=sample_inputs_func,
+            reference_inputs_func=reference_inputs_func,
+            error_inputs_func=make_error_inputs_elementwise_binary(error_inputs_func),
+            **kwargs,
+        )
+
+        # [lr]hs_make_tensor_kwargs are part of the OpInfo to be able to dynamically generate valid samples later on.
+        if lhs_make_tensor_kwargs is None:
+            lhs_make_tensor_kwargs = {}
+        self.lhs_make_tensor_kwargs = lhs_make_tensor_kwargs
+
+        if rhs_make_tensor_kwargs is None:
+            rhs_make_tensor_kwargs = {}
+        self.rhs_make_tensor_kwargs = rhs_make_tensor_kwargs
+
+        self.promotes_int_to_float = promotes_int_to_float
+        self.always_returns_bool = always_returns_bool
+        self.supports_rhs_python_scalar = supports_rhs_python_scalar
+        self.supports_one_python_scalar = supports_one_python_scalar
+        self.supports_two_python_scalars = supports_two_python_scalars
+
+        if self.supports_two_python_scalars:
+            self.supports_one_python_scalar = True
+
+        if self.supports_one_python_scalar:
+            assert (
+                supports_rhs_python_scalar
+            ), "Can't support lhs and rhs Python scalars but not rhs scalars!"
+
+
+# The following functions and classes are for testing elementwise unary operators.
+def sample_inputs_elementwise_unary(
+    op_info, device, dtype, requires_grad, op_kwargs=None, **kwargs
+):
+    if not op_kwargs:
+        op_kwargs = {}
+
+    _L = S if kwargs.get("small_inputs_only", False) else L
+
+    low, high = op_info.domain
+    low = low if low is None else low + op_info._domain_eps
+    high = high if high is None else high - op_info._domain_eps
+    if (
+        op_info.supports_sparse_csr
+        or op_info.supports_sparse_csc
+        or op_info.supports_sparse_bsr
+        or op_info.supports_sparse_bsc
+    ):
+        # Tensors with dim=2 for sparse compressed testing
+        yield SampleInput(
+            make_tensor(
+                (_L, _L),
+                device=device,
+                dtype=dtype,
+                low=low,
+                high=high,
+                requires_grad=requires_grad,
+            ),
+            kwargs=op_kwargs,
+        )
+    else:
+        # Creates a 1D, empty, and scalar tensor
+        for shape in ((_L,), (1, 0, 3), ()):
+            yield SampleInput(
+                make_tensor(
+                    shape,
+                    device=device,
+                    dtype=dtype,
+                    low=low,
+                    high=high,
+                    requires_grad=requires_grad,
+                ),
+                kwargs=op_kwargs,
+            )
+
+
+# Replace values satisfying condition with a safe value. This is used to block
+# out values the could cause singularity like tan(pi/2)
+def _replace_values_in_tensor(tensor, condition, safe_value):
+    mask = condition(tensor)
+    tensor.masked_fill_(mask, safe_value)
+
+
+# Helper to create a unary elementwise tensor with valid inputs
+def _make_unary_elementwise_tensor(shape, *, op, dtype, **kwargs):
+    low, high = op.domain
+    low = low if low is None else low + op._domain_eps
+    high = high if high is None else high - op._domain_eps
+
+    a = make_tensor(shape, low=low, high=high, dtype=dtype, **kwargs)
+
+    if op.reference_numerics_filter is not None and dtype is not torch.bool:
+        condition, safe_value = op.reference_numerics_filter
+        _replace_values_in_tensor(a, condition, safe_value)
+
+    return a
+
+
+# Restricts the values in the tensor to the domain of the
+# given elementwise unary operator
+def _filter_unary_elementwise_tensor(a, *, op):
+    # short-circuits for boolean tensors
+    if a.dtype is torch.bool:
+        return a
+
+    low, high = op.domain
+    low = low if low is None else low + op._domain_eps
+    high = high if high is None else high - op._domain_eps
+
+    if a.dtype is torch.uint8 and low is not None:
+        low = max(low, 0)
+
+    if not a.dtype.is_floating_point and not a.dtype.is_complex:
+        low = math.ceil(low) if low is not None else None
+        high = math.floor(high) if high is not None else None
+
+    if op.reference_numerics_filter is not None:
+        condition, safe_value = op.reference_numerics_filter
+        _replace_values_in_tensor(a, condition, safe_value)
+
+    if low is not None or high is not None:
+        if a.dtype.is_complex:
+            a.real.clamp_(low, high)
+            a.imag.clamp_(low, high)
+        else:
+            a.clamp_(min=low, max=high)
+
+    return a
+
+
+def generate_elementwise_unary_tensors(op, *, device, dtype, requires_grad, **kwargs):
+
+    # Special-cases bool
+    if dtype is torch.bool:
+        tensors = (
+            torch.empty(0, device=device, dtype=torch.bool),
+            torch.tensor(True, device=device),
+            torch.tensor(False, device=device),
+            torch.tensor((True, False), device=device),
+            make_tensor((812,), device=device, dtype=dtype),
+            make_tensor((1029, 917), device=device, dtype=dtype),
+        )
+        for a in tensors:
+            yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
+
+    shapes = (
+        (1029, 917),
+        (812,),
+        # Empty sizes
+        (0,),
+        (0, 3, 3),
+        (1, 0, 5),
+        (6, 0, 0, 0),
+        (3, 0, 1, 0),
+    )
+
+    make_arg = partial(
+        _make_unary_elementwise_tensor,
+        op=op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+    )
+    for shape in shapes:
+        a = make_arg(shape)
+        yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
+
+
+def generate_elementwise_unary_small_value_tensors(
+    op, *, device, dtype, requires_grad=False
+):
+    for sample in generate_elementwise_binary_small_value_tensors(
+        op, device=device, dtype=dtype, requires_grad=requires_grad
+    ):
+        a = _filter_unary_elementwise_tensor(sample.input, op=op)
+        yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
+
+
+def generate_elementwise_unary_large_value_tensors(
+    op, *, device, dtype, requires_grad=False
+):
+    for sample in generate_elementwise_binary_large_value_tensors(
+        op, device=device, dtype=dtype, requires_grad=requires_grad
+    ):
+        a = _filter_unary_elementwise_tensor(sample.input, op=op)
+        yield SampleInput(sample.input, kwargs=op.sample_kwargs(device, dtype, a)[0])
+
+
+def generate_elementwise_unary_extremal_value_tensors(
+    op, *, device, dtype, requires_grad=False
+):
+    for sample in generate_elementwise_binary_extremal_value_tensors(
+        op, device=device, dtype=dtype, requires_grad=requires_grad
+    ):
+        yield SampleInput(
+            sample.input, kwargs=op.sample_kwargs(device, dtype, sample.input)[0]
+        )
+
+
+def generate_elementwise_unary_noncontiguous_tensors(
+    op, *, device, dtype, requires_grad=False
+):
+    low, high = op.domain
+    low = low if low is None else low + op._domain_eps
+    high = high if high is None else high - op._domain_eps
+
+    make_arg = partial(
+        _make_unary_elementwise_tensor,
+        op=op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+    )
+
+    # Generic noncontiguity
+    t = make_arg((1026,), noncontiguous=True)
+    yield SampleInput(t, kwargs=op.sample_kwargs(device, dtype, t)[0])
+
+    # Transposed
+    t = make_arg((1024, 1024)).T
+    yield SampleInput(t, kwargs=op.sample_kwargs(device, dtype, t)[0])
+
+    # Expanded tensors
+    shapes = ((1, 3), (1, 7), (5, 7))
+
+    for shape in shapes:
+        t = make_arg(shape)
+        t_non_contig = t.expand(3, -1, -1)
+        yield SampleInput(
+            t_non_contig, kwargs=op.sample_kwargs(device, dtype, t_non_contig)[0]
+        )
+
+
+def generate_elementwise_unary_arbitrarily_strided_tensors(
+    op, *, device, dtype, requires_grad=False
+):
+    # shape, strides, offset
+    strided_cases = (
+        ((5, 6, 2), (1, 1, 7), 2),
+        ((5, 5, 4), (1, 1, 7), 2),
+        ((5, 5, 2), (4, 5, 7), 3),
+        ((5, 5, 2), (5, 5, 7), 3),
+        ((5, 5, 2), (5, 5, 5), 3),
+        ((9, 5, 2), (0, 1, 7), 3),
+    )
+
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    for shape, strides, offset in strided_cases:
+        a = make_arg(
+            500,
+        ).as_strided(shape, strides, offset)
+        yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
+
+
+# Reuses the elementwise binary generators for consistency
+# TODO: in the future generalize the reference generators to handle n-ary elementwise operations
+def _reference_inputs_elementwise_unary(op, device, dtype, requires_grad, **kwargs):
+    yield from op.sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
+
+    yield from generate_elementwise_unary_tensors(
+        op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+    )
+
+    if dtype is not torch.bool:
+        yield from generate_elementwise_unary_small_value_tensors(
+            op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+        )
+    if dtype not in (torch.bool, torch.uint8, torch.int8) and (
+        op.handles_large_floats
+        or (not dtype.is_floating_point and not dtype.is_complex)
+    ):
+        yield from generate_elementwise_unary_large_value_tensors(
+            op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+        )
+
+    if dtype.is_floating_point or (
+        op.handles_complex_extremal_values and dtype.is_complex
+    ):
+        yield from generate_elementwise_unary_extremal_value_tensors(
+            op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+        )
+
+
+def reference_inputs_elementwise_unary(op, device, dtype, requires_grad, **kwargs):
+    gen = partial(
+        _reference_inputs_elementwise_unary, op, device, dtype, requires_grad, **kwargs
+    )
+
+    # yields "normal" samples
+    yield from gen()
+
+    # yields noncontiguous samples
+    for sample in gen():
+        yield sample.noncontiguous()
+
+    yield from generate_elementwise_unary_noncontiguous_tensors(
+        op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+    )
+
+    yield from generate_elementwise_unary_arbitrarily_strided_tensors(
+        op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+    )
+
+
+# Metadata class for unary "universal functions (ufuncs)" that accept a single
+# tensor and have common properties like:
+class UnaryUfuncInfo(OpInfo):
+    """Operator information for 'universal unary functions (unary ufuncs).'
+    These are functions of a single tensor with common properties like:
+      - they are elementwise functions
+      - the input shape is the output shape
+      - they typically have method and inplace variants
+      - they typically support the out kwarg
+      - they typically have NumPy or SciPy references
+    See NumPy's universal function documentation
+    (https://numpy.org/doc/1.18/reference/ufuncs.html) for more details
+    about the concept of ufuncs.
+    """
+
+    def __init__(
+        self,
+        name,  # the string name of the function
+        *,
+        dtypes=floating_types(),
+        domain=(None, None),  # the [low, high) domain of the function
+        handles_complex_extremal_values=True,  # whether the op correctly handles extremal values (like nan/inf)
+        handles_large_floats=True,  # whether the op correctly handles large float values (like 1e20)
+        supports_complex_to_float=False,  # op supports casting from complex input to real output safely eg. angle
+        sample_inputs_func=sample_inputs_elementwise_unary,
+        reference_inputs_func=reference_inputs_elementwise_unary,
+        sample_kwargs=lambda device, dtype, input: ({}, {}),
+        reference_numerics_filter=None,  # Filters values in the range of the domain specified above but that should not be tested
+        **kwargs,
+    ):
+        self._original_unary_ufunc_args = locals().copy()
+
+        super().__init__(
+            name,
+            dtypes=dtypes,
+            sample_inputs_func=sample_inputs_func,
+            reference_inputs_func=reference_inputs_func,
+            **kwargs,
+        )
+        self.domain = domain
+        self.handles_complex_extremal_values = handles_complex_extremal_values
+        self.handles_large_floats = handles_large_floats
+        self.supports_complex_to_float = supports_complex_to_float
+        self.reference_numerics_filter = reference_numerics_filter
+
+        # test_unary_ufuncs.py generates its own inputs to test the consistency
+        # of the operator on sliced tensors, non-contig tensors, etc.
+        # `sample_kwargs` is a utility function to provide kwargs
+        # along with those inputs if required (eg. clamp).
+        # It should return two dictionaries, first holding kwarg for
+        # torch operator and second one for reference NumPy operator.
+        self.sample_kwargs = sample_kwargs
+
+        # Epsilon to ensure grad and gradgrad checks don't test values
+        #   outside a function's domain.
+        self._domain_eps = 1e-5
+
+
+def sample_inputs_spectral_ops(self, device, dtype, requires_grad=False, **kwargs):
+    is_fp16_or_chalf = dtype == torch.complex32 or dtype == torch.half
+    if not is_fp16_or_chalf:
+        nd_tensor = partial(
+            make_tensor,
+            (S, S + 1, S + 2),
+            device=device,
+            dtype=dtype,
+            requires_grad=requires_grad,
+        )
+        oned_tensor = partial(
+            make_tensor, (31,), device=device, dtype=dtype, requires_grad=requires_grad
+        )
+    else:
+        # cuFFT supports powers of 2 for half and complex half precision
+        # NOTE: For hfft, hfft2, hfftn, irfft, irfft2, irfftn with default args
+        # where output_size n=2*(input_size - 1), we make sure that logical fft size is a power of two
+        low = None
+        high = None
+        if self.name in ["fft.hfft", "fft.irfft", "_refs.fft.hfft", "_refs.fft.irfft"]:
+            shapes = ((2, 9, 9), (33,))
+        elif self.name in [
+            "fft.hfft2",
+            "fft.irfft2",
+            "_refs.fft.hfft2",
+            "_refs.fft.irfft2",
+        ]:
+            shapes = ((2, 8, 9), (33,))
+        elif self.name in [
+            "fft.hfftn",
+            "fft.irfftn",
+            "_refs.fft.hfftn",
+            "_refs.fft.irfftn",
+        ]:
+            shapes = ((2, 2, 33), (33,))
+            # Adjusting the limits because the test would be flaky due to over-saturation of float16
+            # See: https://github.com/pytorch/pytorch/pull/81416
+            low = -1.0
+            high = 1.0
+        else:
+            shapes = ((2, 8, 16), (32,))
+        nd_tensor = partial(
+            make_tensor,
+            shapes[0],
+            device=device,
+            low=low,
+            high=high,
+            dtype=dtype,
+            requires_grad=requires_grad,
+        )
+        oned_tensor = partial(
+            make_tensor,
+            shapes[1],
+            device=device,
+            low=low,
+            high=high,
+            dtype=dtype,
+            requires_grad=requires_grad,
+        )
+
+    if self.ndimensional == SpectralFuncType.ND:
+        return [
+            SampleInput(
+                nd_tensor(),
+                kwargs=dict(
+                    s=(3, 10) if not is_fp16_or_chalf else (4, 8),
+                    dim=(1, 2),
+                    norm="ortho",
+                ),
+            ),
+            SampleInput(nd_tensor(), kwargs=dict(norm="ortho")),
+            SampleInput(nd_tensor(), kwargs=dict(s=(8,))),
+            SampleInput(oned_tensor()),
+            *(
+                SampleInput(nd_tensor(), kwargs=dict(dim=dim))
+                for dim in [-1, -2, -3, (0, -1)]
+            ),
+        ]
+    elif self.ndimensional == SpectralFuncType.TwoD:
+        return [
+            SampleInput(
+                nd_tensor(),
+                kwargs=dict(
+                    s=(3, 10) if not is_fp16_or_chalf else (4, 8),
+                    dim=(1, 2),
+                    norm="ortho",
+                ),
+            ),
+            SampleInput(nd_tensor(), kwargs=dict(norm="ortho")),
+            SampleInput(
+                nd_tensor(), kwargs=dict(s=(6, 8) if not is_fp16_or_chalf else (4, 8))
+            ),
+            SampleInput(nd_tensor(), kwargs=dict(dim=0)),
+            SampleInput(nd_tensor(), kwargs=dict(dim=(0, -1))),
+            SampleInput(nd_tensor(), kwargs=dict(dim=(-3, -2, -1))),
+        ]
+    else:
+        return [
+            SampleInput(
+                nd_tensor(),
+                kwargs=dict(n=10 if not is_fp16_or_chalf else 8, dim=1, norm="ortho"),
+            ),
+            SampleInput(nd_tensor(), kwargs=dict(norm="ortho")),
+            SampleInput(nd_tensor(), kwargs=dict(n=7 if not is_fp16_or_chalf else 8)),
+            SampleInput(oned_tensor()),
+            *(SampleInput(nd_tensor(), kwargs=dict(dim=dim)) for dim in [-1, -2, -3]),
+        ]
+
+
+SpectralFuncType = Enum("SpectralFuncType", ("OneD", "TwoD", "ND"))
+
+
+# Metadata class for Fast Fourier Transforms in torch.fft.
+class SpectralFuncInfo(OpInfo):
+    """Operator information for torch.fft transforms."""
+
+    def __init__(
+        self,
+        name,  # the string name of the function
+        *,
+        ref=None,  # Reference implementation (probably in np.fft namespace)
+        dtypes=floating_and_complex_types(),
+        ndimensional: SpectralFuncType,
+        sample_inputs_func=sample_inputs_spectral_ops,
+        decorators=None,
+        **kwargs,
+    ):
+
+        self._original_spectral_func_args = dict(locals()).copy()
+        self._original_spectral_func_args.update(kwargs)
+
+        decorators = list(decorators) if decorators is not None else []
+        decorators += [
+            skipCPUIfNoFFT,
+            DecorateInfo(
+                toleranceOverride({torch.chalf: tol(4e-2, 4e-2)}),
+                "TestCommon",
+                "test_complex_half_reference_testing",
+            ),
+        ]
+
+        super().__init__(
+            name=name,
+            dtypes=dtypes,
+            decorators=decorators,
+            sample_inputs_func=sample_inputs_func,
+            **kwargs,
+        )
+        self.ref = ref
+        self.ndimensional = ndimensional
+
+
+class ShapeFuncInfo(OpInfo):
+    """Early version of a specialized OpInfo for Shape manipulating operations like tile and roll"""
+
+    def __init__(
+        self,
+        name,  # the string name of the function
+        *,
+        ref,  # a reference function
+        dtypes=floating_types(),
+        dtypesIfCUDA=None,
+        dtypesIfROCM=None,
+        sample_inputs_func=None,
+        **kwargs,
+    ):
+        super(ShapeFuncInfo, self).__init__(
+            name,
+            dtypes=dtypes,
+            dtypesIfCUDA=dtypesIfCUDA,
+            dtypesIfROCM=dtypesIfROCM,
+            sample_inputs_func=sample_inputs_func,
+            **kwargs,
+        )
+        self.ref = ref
+
+
+def sample_inputs_foreach(
+    self, device, dtype, N, *, noncontiguous=False, same_size=False, low=None, high=None
+):
+    if same_size:
+        return [
+            make_tensor((N, N), dtype=dtype, device=device, noncontiguous=noncontiguous)
+            for _ in range(N)
+        ]
+    else:
+        return [
+            make_tensor(
+                (N - i, N - i), dtype=dtype, device=device, noncontiguous=noncontiguous
+            )
+            for i in range(N)
+        ]
+
+
+def get_foreach_method_names(name):
+    # get torch inplace reference function
+    op_name = "_foreach_" + name
+    inplace_op_name = op_name + "_"
+
+    op = getattr(torch, op_name, None)
+    inplace_op = getattr(torch, inplace_op_name, None)
+
+    ref = getattr(torch, name, None)
+    ref_inplace = getattr(torch.Tensor, name + "_", None)
+    return op, inplace_op, ref, ref_inplace
+
+
+class ForeachFuncInfo(OpInfo):
+    """Early version of a specialized OpInfo for foreach functions"""
+
+    def __init__(
+        self,
+        name,
+        dtypes=floating_and_complex_types(),
+        dtypesIfCUDA=floating_and_complex_types_and(torch.half),
+        dtypesIfROCM=None,
+        supports_alpha_param=False,
+        sample_inputs_func=sample_inputs_foreach,
+        **kwargs,
+    ):
+        super().__init__(
+            "_foreach_" + name,
+            dtypes=dtypes,
+            dtypesIfCUDA=dtypesIfCUDA,
+            dtypesIfROCM=dtypesIfROCM,
+            sample_inputs_func=sample_inputs_func,
+            **kwargs,
+        )
+
+        (
+            foreach_method,
+            foreach_method_inplace,
+            torch_ref_method,
+            torch_ref_inplace,
+        ) = get_foreach_method_names(name)
+        self.method_variant = foreach_method
+        self.inplace_variant = foreach_method_inplace
+        self.ref = torch_ref_method
+        self.ref_inplace = torch_ref_inplace
+        self.supports_alpha_param = supports_alpha_param
+
+        if name == "norm":
+            self.ref = torch.linalg.vector_norm
+
+
+def gradcheck_wrapper_hermitian_input(op, input, *args, **kwargs):
+    """Gradcheck wrapper for functions that take Hermitian matrices as input.
+
+    They require a modified function because the finite-difference algorithm
+    for calculating derivatives does not preserve the Hermitian property of the input.
+    """
+    return op(input + input.mH, *args, **kwargs)
+
+
+def gradcheck_wrapper_triangular_input(op, *args, upper=False, idx=0, **kwargs):
+    """Gradcheck wrapper for functions that take lower or upper triangular matrices as input.
+
+    They require a modified function because the finite-difference algorithm
+    for calculating derivatives does not preserve the triangular property of the input.
+    `idx` is used to specific which `args[idx]` is to be triangularized.
+    """
+    triangular_arg = args[idx].triu() if upper else args[idx].tril()
+    return op(*args[:idx], triangular_arg, *args[idx + 1 :], upper, **kwargs)
+
+
+def gradcheck_wrapper_triangular_input_real_positive_diagonal(
+    op, *args, upper=False, idx=0, **kwargs
+):
+    """Gradcheck wrapper for functions that take lower/upper triangular matrices
+    with real and positive diagonals, for example, cholesky-like operations.
+    """
+    arg = args[idx]
+    arg_diag = arg.diagonal(0, -2, -1)
+    arg_diag_embed = torch.diag_embed(arg_diag)
+    id_diag_tensor = torch.ones_like(arg_diag)
+    id_tensor = torch.diag_embed(id_diag_tensor)
+    # new_arg = arg - diag(arg) + I
+    new_arg = arg - arg_diag_embed + id_tensor
+    return gradcheck_wrapper_triangular_input(
+        op, *args[:idx], new_arg, *args[idx + 1 :], upper=upper, idx=idx, **kwargs
+    )
+
+
+def gradcheck_wrapper_masked_operation(op, input, *args, **kwargs):
+    """Gradcheck wrapper for masked operations.
+
+    When mask is specified, replaces masked-out elements with zeros.
+
+    Use for operations that produce non-finite masked-out elements,
+    for instance, for minimum and maximum reductions.
+    """
+    output = op(input, *args, **kwargs)
+    mask = kwargs.get("mask")
+    if mask is not None:
+        output_mask = torch._masked._output_mask(op, input, *args, **kwargs)
+        output = torch.where(output_mask, output, output.new_zeros([]))
+    return output
+
+
+def gradcheck_wrapper_masked_pointwise_operation(op, input, *args, **kwargs):
+    """Gradcheck wrapper for masked pointwise operations. Assumes that the result
+    will be masked iff both tensors are masked at a specific index
+
+    When mask is specified, replaces masked-out elements with zeros.
+
+    Use for operations that produce non-finite masked-out elements,
+    for instance, for minimum and maximum reductions.
+    """
+    output = op(input, *args, **kwargs)
+    input_mask = kwargs.get("input_mask")
+    other_mask = kwargs.get("other_mask")
+    if input_mask is not None and other_mask is not None:
+        combined_mask = torch.logical_and(input_mask, other_mask)
+        new_kwargs = dict(mask=combined_mask, **kwargs)
+        output_mask = torch._masked._input_mask(input, *args, **new_kwargs)
+        output = torch.where(output_mask, output, output.new_zeros([]))
+    return output
+
+
+def clone_sample(sample, **kwargs):
+    """
+    Given a SampleInput, this function analyzes its input, args and kwargs,
+    and produces a copy with each non-Tensor entry being copied by reference,
+    and with each Tensor entry cloned with `t.clone().requires_grad_(t.requires_grad)`
+    """
+
+    def clone_tensor(t):
+        if isinstance(t, torch.Tensor):
+            return t.detach().clone().requires_grad_(t.requires_grad)
+        else:
+            return t
+
+    sample_kwargs = kwargs if kwargs else sample.kwargs
+
+    return SampleInput(
+        clone_tensor(sample.input),
+        args=tuple(map(clone_tensor, sample.args)),
+        kwargs=dict(((k, clone_tensor(v)) for k, v in sample_kwargs.items())),
+    )
diff --git a/torch/testing/_internal/opinfo/definitions/__init__.py b/torch/testing/_internal/opinfo/definitions/__init__.py
new file mode 100644
index 0000000000000..7d81955f67e25
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/__init__.py
@@ -0,0 +1,18 @@
+from typing import List
+
+from torch.testing._internal.opinfo.core import OpInfo
+from torch.testing._internal.opinfo.definitions import _masked, fft, linalg, special
+
+# Operator database
+op_db: List[OpInfo] = [
+    *fft.op_db,
+    *linalg.op_db,
+    *special.op_db,
+    *_masked.op_db,
+]
+
+python_ref_db: List[OpInfo] = [
+    *fft.python_ref_db,
+    *linalg.python_ref_db,
+    *special.python_ref_db,
+]
diff --git a/torch/testing/_internal/opinfo/definitions/_masked.py b/torch/testing/_internal/opinfo/definitions/_masked.py
new file mode 100644
index 0000000000000..d8a3e8aa948de
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/_masked.py
@@ -0,0 +1,1132 @@
+import unittest
+from functools import partial
+from typing import List
+
+import numpy as np
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_device_type import tol, toleranceOverride
+from torch.testing._internal.common_dtype import (
+    all_types_and,
+    all_types_and_complex_and,
+    complex_types,
+    floating_and_complex_types_and,
+    floating_types_and,
+    integral_types,
+)
+from torch.testing._internal.opinfo.core import (
+    DecorateInfo,
+    gradcheck_wrapper_masked_operation,
+    gradcheck_wrapper_masked_pointwise_operation,
+    M,
+    OpInfo,
+    ReductionOpInfo,
+    S,
+    sample_inputs_reduction,
+    SampleInput,
+)
+from torch.testing._internal.opinfo.utils import reference_reduction_numpy
+
+
+# Used for log_softmax, softmax, softmin
+def sample_inputs_softmax_variant(
+    op_info, device, dtype, requires_grad, with_dtype=False, **kwargs
+):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    cases = [
+        ((S,), (0,)),
+        ((S, S), (0,)),
+        ((S, S), (1,)),
+        ((S, S), (-1,)),
+        ((S, M, S), (2,)),
+    ]
+    kwargs = dict(dtype=torch.float64) if with_dtype else None
+
+    # PyTorch on XLA throws an error when passed with dim argument for 0d tensor.
+    # See https://github.com/pytorch/xla/issues/3061 for more details.
+    if torch.device(device).type != "xla":
+        cases.append(((), (0,)))
+
+    return [
+        SampleInput(make_arg(shape), args=dim, kwargs=kwargs) for shape, dim in cases
+    ]
+
+
+def _generate_masked_op_mask(input_shape, device, **kwargs):
+    yield None
+    yield make_tensor(input_shape, dtype=torch.bool, device=device, requires_grad=False)
+    if len(input_shape) > 2:
+        # broadcast last mask dimension:
+        yield make_tensor(
+            input_shape[:-1] + (1,),
+            dtype=torch.bool,
+            device=device,
+            requires_grad=False,
+        )
+        # broadcast middle mask dimension:
+        yield make_tensor(
+            input_shape[:1] + (1,) + input_shape[2:],
+            dtype=torch.bool,
+            device=device,
+            requires_grad=False,
+        )
+        # broadcast first mask dimension:
+        yield make_tensor(
+            (1,) + input_shape[1:], dtype=torch.bool, device=device, requires_grad=False
+        )
+        # mask.ndim < input.ndim
+        yield make_tensor(
+            input_shape[1:], dtype=torch.bool, device=device, requires_grad=False
+        )
+        # mask.ndim == 1
+        yield make_tensor(
+            input_shape[-1:], dtype=torch.bool, device=device, requires_grad=False
+        )
+        # masks that require broadcasting of inputs (mask.ndim >
+        # input.ndim) will not be supported, however, we may
+        # reconsider this if there will be demand on this kind of
+        # degenerate cases.
+
+
+def sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked reduction operators.
+
+    Masked reduction operator is a reduction operator with trailing
+    mask optional argument. A mask is a bool tensor with the same
+    shape as input or a shape that is broadcastable to input shape.
+    """
+    kwargs["supports_multiple_dims"] = op_info.supports_multiple_dims
+
+    for sample_input in sample_inputs_reduction(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        for mask in _generate_masked_op_mask(
+            sample_input.input.shape, device, **kwargs
+        ):
+            sample_input_args, sample_input_kwargs = sample_input.args, dict(
+                mask=mask, **sample_input.kwargs
+            )
+            yield SampleInput(
+                sample_input.input.detach().requires_grad_(requires_grad),
+                args=sample_input_args,
+                kwargs=sample_input_kwargs,
+            )
+            if (
+                not requires_grad
+                and dtype.is_floating_point
+                and sample_input.input.ndim == 2
+                and mask is not None
+                and mask.shape == sample_input.input.shape
+            ):
+                for v in [torch.inf, -torch.inf, torch.nan]:
+                    t = sample_input.input.detach()
+                    t.diagonal(0, -2, -1).fill_(v)
+                    yield SampleInput(
+                        t.requires_grad_(requires_grad),
+                        args=sample_input_args,
+                        kwargs=sample_input_kwargs,
+                    )
+
+
+def sample_inputs_sparse_coo_masked_reduction(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    """Sample inputs for masked reduction operators that support inputs
+    with sparse coo layouts.
+    """
+    if op_info.supports_sparse:
+        op_name = op_info.name.replace("_masked.", "")
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            mask = sample_input.kwargs.get("mask")
+            if mask is not None:
+                sample_input_kwargs = sample_input.kwargs.copy()
+                sample_input_kwargs.update(mask=mask.to_sparse())
+                yield SampleInput(
+                    sample_input.input.to_sparse(),
+                    args=sample_input.args,
+                    kwargs=sample_input_kwargs,
+                )
+            else:
+                if op_name in {"prod", "amax", "amin"}:
+                    # FIXME: for now reductions with non-zero reduction identity and
+                    # unspecified mask are not supported for sparse COO
+                    # tensors, see torch._masked.prod implementation
+                    # for details.
+                    continue
+                yield SampleInput(
+                    sample_input.input.to_sparse(),
+                    args=sample_input.args,
+                    kwargs=sample_input.kwargs,
+                )
+
+
+def sample_inputs_sparse_csr_masked_reduction(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    """Sample inputs for masked reduction operators that support inputs
+    with sparse csr layouts.
+    """
+    if op_info.supports_sparse_csr:
+        op_name = op_info.name.replace("_masked.", "")
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            if not (
+                sample_input.input.ndim == 2 and sample_input.kwargs.get("keepdim")
+            ):
+                # - sparse CSR tensors are always 2-D tensors
+                # - masked reduction on CSR tensors are defined only if keepdim is True.
+                continue
+            mask = sample_input.kwargs.get("mask")
+            if mask is not None:
+                sample_input_kwargs = sample_input.kwargs.copy()
+                sample_input_kwargs.update(mask=mask.to_sparse_csr())
+                new_sample = SampleInput(
+                    sample_input.input.to_sparse_csr(),
+                    args=sample_input.args,
+                    kwargs=sample_input_kwargs,
+                )
+            else:
+                if op_name in ["prod", "amax", "amin", "mean"]:
+                    # reductions with non-zero reduction identity and
+                    # unspecified mask is not supported for sparse CSR
+                    # tensors, see torch._masked.prod implementation
+                    # for details.
+                    continue
+                new_sample = SampleInput(
+                    sample_input.input.to_sparse_csr(),
+                    args=sample_input.args,
+                    kwargs=sample_input.kwargs,
+                )
+            yield new_sample
+            if sample_input.kwargs["dim"] == 0:
+                # Reductions of CSR tensors use different implementations for
+                # inner and/or outer dimensions. So, as a minimum of testing CSR
+                # implementations the following kwargs must be generated:
+                #   dict(dim=0, keepdim=True)
+                #   dict(dim=1, keepdim=True)
+                #   dict(dim=(0, 1), keepdim=True)
+                # Here we generate the dim=1 case from the dim=0 case.
+                sample_input_kwargs = new_sample.kwargs.copy()
+                sample_input_kwargs.update(dim=1)
+                yield SampleInput(
+                    new_sample.input.clone(),
+                    args=sample_input.args,
+                    kwargs=sample_input_kwargs,
+                )
+
+
+def sample_inputs_masked_norm(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked norm."""
+    for ord in [2.0, 1, float("inf"), float("-inf"), 0]:
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            sample_input_args, sample_input_kwargs = (
+                ord,
+            ) + sample_input.args, sample_input.kwargs.copy()
+            yield SampleInput(
+                sample_input.input.clone().requires_grad_(requires_grad),
+                args=sample_input_args,
+                kwargs=sample_input_kwargs,
+            )
+
+
+def sample_inputs_masked_std_var(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked std/var."""
+    for unbiased in [False, True]:
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            if sample_input.args:
+                dim = sample_input.args[0]
+                sample_input_args = (
+                    sample_input.args[:1] + (unbiased,) + sample_input.args[1:]
+                )
+                sample_input_kwargs = sample_input.kwargs.copy()
+            else:
+                dim = sample_input.kwargs.get("dim")
+                sample_input_args = sample_input.args
+                sample_input_kwargs = dict(sample_input.kwargs, unbiased=unbiased)
+            if requires_grad:
+                if sample_input_kwargs.get("mask") is None:
+                    orig_count = torch._masked.sum(
+                        torch.ones(sample_input.input.shape, dtype=torch.int64),
+                        dim,
+                        keepdim=True,
+                    )
+                else:
+                    inmask = torch._masked._input_mask(
+                        sample_input.input, *sample_input_args, **sample_input_kwargs
+                    )
+                    orig_count = torch._masked.sum(
+                        inmask.new_ones(sample_input.input.shape, dtype=torch.int64),
+                        dim,
+                        keepdim=True,
+                        mask=inmask,
+                    )
+                if orig_count.min() <= int(unbiased) + 1:
+                    # Skip samples that lead to singularities in var
+                    # computation resulting nan values both in var and
+                    # autograd output that test_grad_fn cannot handle
+                    # correctly. Also, skip samples when the autograd output
+                    # for std could not be handled correctly due to torch.sqrt
+                    continue
+            yield SampleInput(
+                sample_input.input.detach().requires_grad_(requires_grad),
+                args=sample_input_args,
+                kwargs=sample_input_kwargs,
+            )
+
+
+def sample_inputs_masked_softmax(
+    op_info, device, dtype, requires_grad, with_dtype=False, **kwargs
+):
+    """Sample inputs for masked softmax, log_softmax, and softmin.
+
+    Masked normalization operator is a reduction operator with
+    trailing mask optional argument. A mask is a bool tensor with the
+    same shape as input or a shape that is broadcastable to input
+    shape.
+    """
+    inputs: List[SampleInput] = []
+    for sample_input in sample_inputs_softmax_variant(
+        op_info, device, dtype, requires_grad, with_dtype=with_dtype, **kwargs
+    ):
+        for mask in _generate_masked_op_mask(
+            sample_input.input.shape, device, **kwargs
+        ):
+            sample_input_args, sample_input_kwargs = sample_input.args, dict(
+                mask=mask, **sample_input.kwargs
+            )
+            inputs.append(
+                SampleInput(
+                    sample_input.input.clone().requires_grad_(requires_grad),
+                    args=sample_input_args,
+                    kwargs=sample_input_kwargs,
+                )
+            )
+    return inputs
+
+
+def sample_inputs_masked_cumops(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked cumsum and cumprod."""
+    inputs: List[SampleInput] = []
+    for sample_input in sample_inputs_softmax_variant(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        for mask in _generate_masked_op_mask(
+            sample_input.input.shape, device, **kwargs
+        ):
+            if type(mask) != torch.Tensor:
+                continue
+            sample_input_args, sample_input_kwargs = sample_input.args, dict(
+                mask=mask, **sample_input.kwargs
+            )
+            if "keepdim" in sample_input_kwargs:
+                sample_input_kwargs.pop("keepdim")
+            # dimension is required
+            if sample_input_args:
+                dim = sample_input.args[0]
+            else:
+                if "dim" not in sample_input_kwargs:
+                    continue
+                dim = sample_input_kwargs.pop("dim")
+                sample_input_args = (dim,)
+            inputs.append(
+                SampleInput(
+                    sample_input.input.clone().requires_grad_(requires_grad),
+                    args=sample_input_args,
+                    kwargs=sample_input_kwargs,
+                )
+            )
+
+    return inputs
+
+
+def sample_inputs_masked_logaddexp(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked logaddexp."""
+    inputs: List[SampleInput] = []
+    shapes = [(S,), (S, S), (S, M, S)]
+    input_mask_lists = [
+        list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes
+    ]
+    other_mask_lists = [
+        list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes
+    ]
+
+    for shape, input_masks, other_masks in zip(
+        shapes, input_mask_lists, other_mask_lists
+    ):
+        for input_mask, other_mask in zip(input_masks, other_masks):
+            input = make_tensor(
+                shape, dtype=dtype, device=device, requires_grad=requires_grad
+            )
+            other = make_tensor(
+                shape, dtype=dtype, device=device, requires_grad=requires_grad
+            )
+            inputs.append(
+                SampleInput(
+                    input.clone().requires_grad_(requires_grad),
+                    args=(other.clone().requires_grad_(requires_grad),),
+                    kwargs=dict(input_mask=input_mask, other_mask=other_mask),
+                )
+            )
+    return inputs
+
+
+def sample_inputs_masked_normalize(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked normalize."""
+    inputs: List[SampleInput] = []
+    for ord in [2.0, 1, float("inf"), float("-inf"), 0]:
+        for sample_input in sample_inputs_softmax_variant(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            sample_input_args, sample_input_kwargs = (
+                ord,
+            ) + sample_input.args, sample_input.kwargs.copy()
+            inputs.append(
+                SampleInput(
+                    sample_input.input.clone().requires_grad_(requires_grad),
+                    args=sample_input_args,
+                    kwargs=sample_input_kwargs,
+                )
+            )
+    return inputs
+
+
+op_db: List[OpInfo] = [
+    ReductionOpInfo(
+        "_masked.sum",
+        ref=reference_reduction_numpy(np.sum),
+        method_variant=None,
+        identity=0,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_sparse=True,
+        supports_sparse_csr=True,
+        promotes_int_to_int64=True,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        skips=(
+            DecorateInfo(
+                unittest.skip("Failing on some jobs"),
+                "TestReductions",
+                "test_reference_masked",
+                dtypes=(torch.bool, torch.int8, torch.int16, torch.int32),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({torch.bfloat16: tol(atol=1e-03, rtol=1e-03)}),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-03)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+    ),
+    ReductionOpInfo(
+        "_masked.prod",
+        ref=reference_reduction_numpy(np.prod),
+        method_variant=None,
+        identity=1,
+        nan_policy="propagate",
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_sparse=True,
+        supports_sparse_csr=True,
+        promotes_int_to_int64=True,
+        # FIXME: "prod_cpu" not implemented for 'Half' or 'BFloat16'
+        dtypes=all_types_and_complex_and(torch.bool),
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool, torch.float16, torch.bfloat16
+        ),
+        skips=(
+            # Pre-existing condition (calls .item); needs to be fixed
+            DecorateInfo(
+                unittest.expectedFailure, "TestCompositeCompliance", "test_backward"
+            ),
+            # Pre-existing condition (calls .item); needs to be fixed
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestCompositeCompliance", "test_forward_ad"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.skip("Failing on some jobs"),
+                "TestReductions",
+                "test_reference_masked",
+                dtypes=(torch.bool, torch.int8, torch.int16, torch.int32),
+            ),
+            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                device_type="cuda",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-02)}),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
+                "TestReductions",
+                "test_ref_duplicate_values",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+    ),
+    OpInfo(
+        "_masked.cumsum",
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        method_variant=None,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        # Can reuse the same inputs; dim is required in both
+        sample_inputs_func=sample_inputs_masked_cumops,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    OpInfo(
+        "_masked.cumprod",
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        method_variant=None,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # Pre-existing condition (calls .item); needs to be fixed
+            DecorateInfo(
+                unittest.expectedFailure, "TestCompositeCompliance", "test_backward"
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        # Can reuse the same inputs; dim is required in both
+        sample_inputs_func=sample_inputs_masked_cumops,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.amax",
+        nan_policy="propagate",
+        supports_out=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        supports_sparse=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_sparse_csr=True,
+        ref=reference_reduction_numpy(np.amax),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: amax reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: Unknown builtin op: aten::iinfo
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
+            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.amin",
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        supports_sparse=True,
+        supports_sparse_csr=True,
+        ref=reference_reduction_numpy(np.amin),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: amax reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: Unknown builtin op: aten::iinfo
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
+            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.argmax",
+        supports_out=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.argmax, supports_keepdims=False),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # initial is not a keyword for argmax
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_reference_masked"
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=(torch.bfloat16,),
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.argmin",
+        supports_out=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.argmin, supports_keepdims=False),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # initial is not a keyword for argmin
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_reference_masked"
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=(torch.bfloat16,),
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.mean",
+        ref=reference_reduction_numpy(np.mean)
+        if np.lib.NumpyVersion(np.__version__) >= "1.20.2"
+        else None,
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_sparse_csr=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        promotes_int_to_float=True,
+        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16, torch.bool),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestReductions",
+                "test_ref_duplicate_values",
+                dtypes=(torch.bool,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestReductions",
+                "test_reference_masked",
+                dtypes=(torch.bool,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestReductions",
+                "test_ref_small_input",
+                dtypes=(torch.bool,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    OpInfo(
+        "_masked.median",
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16),
+        method_variant=None,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.norm",
+        identity=0,
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        promotes_int_to_float=True,
+        dtypes=floating_types_and(torch.float16, torch.bfloat16),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # torch.jit.frontend.NotSupportedError: Compiled functions
+            # can't take variable number of arguments or use
+            # keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_masked_norm,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.var",
+        ref=reference_reduction_numpy(np.var)
+        if np.lib.NumpyVersion(np.__version__) >= "1.20.2"
+        else None,
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        promotes_int_to_float=True,
+        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.float16: tol(atol=1e-02, rtol=1e-02),
+                        torch.bfloat16: tol(atol=1e-03, rtol=1e-03),
+                    }
+                ),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestMasked",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestCudaFuserOpInfo",
+                "test_nvfuser_correctness",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_std_var,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        check_batched_grad=True,
+    ),
+    ReductionOpInfo(
+        "_masked.std",
+        ref=reference_reduction_numpy(np.std)
+        if np.lib.NumpyVersion(np.__version__) >= "1.20.2"
+        else None,
+        method_variant=None,
+        nan_policy="propagate",
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        promotes_int_to_float=True,
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCudaFuserOpInfo",
+                "test_nvfuser_correctness",
+                dtypes=(torch.float16,),
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestMasked",
+                "test_reference_masked",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_std_var,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        check_batched_grad=True,
+    ),
+    OpInfo(
+        "_masked.softmax",
+        method_variant=None,
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "_masked.log_softmax",
+        method_variant=None,
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({torch.bfloat16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestMasked",
+                "test_reference_masked",
+            ),
+        ],
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "_masked.softmin",
+        method_variant=None,
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "_masked.normalize",
+        method_variant=None,
+        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_normalize,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # RuntimeError: "clamp_min_cpu" not implemented for 'Half'
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMasked",
+                "test_reference_masked",
+                device_type="cpu",
+                dtypes=[torch.half],
+            ),
+        ),
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "_masked.logaddexp",
+        dtypes=floating_types_and(torch.bfloat16),
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_forward_grad=False,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestGradients", "test_fn_gradgrad"
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_logaddexp,
+        gradcheck_wrapper=gradcheck_wrapper_masked_pointwise_operation,
+    ),
+    ReductionOpInfo(
+        "_masked.logsumexp",
+        dtypes=all_types_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # Identity can't be -torch.inf without overflow
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestReductions",
+                "test_empty_tensor_empty_slice",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+            # all the values are the same except for -inf vs nan
+            DecorateInfo(unittest.skip("Skipped!"), "TestDecomp", "test_comprehensive"),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/definitions/fft.py b/torch/testing/_internal/opinfo/definitions/fft.py
new file mode 100644
index 0000000000000..061718ec2b533
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/fft.py
@@ -0,0 +1,715 @@
+import unittest
+from typing import List
+
+import numpy as np
+
+import torch
+
+from torch.testing import make_tensor
+from torch.testing._internal.common_cuda import SM53OrLater
+from torch.testing._internal.common_device_type import precisionOverride
+from torch.testing._internal.common_dtype import (
+    all_types_and,
+    all_types_and_complex_and,
+)
+from torch.testing._internal.common_utils import TEST_SCIPY, TEST_WITH_ROCM
+from torch.testing._internal.opinfo.core import (
+    DecorateInfo,
+    OpInfo,
+    SampleInput,
+    SpectralFuncInfo,
+    SpectralFuncType,
+)
+from torch.testing._internal.opinfo.refs import (
+    _find_referenced_opinfo,
+    _inherit_constructor_args,
+    PythonRefInfo,
+)
+
+has_scipy_fft = False
+if TEST_SCIPY:
+    try:
+        import scipy.fft
+
+        has_scipy_fft = True
+    except ModuleNotFoundError:
+        pass
+
+
+class SpectralFuncPythonRefInfo(SpectralFuncInfo):
+    """
+    An OpInfo for a Python reference of an elementwise unary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant="",
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant, op_db=op_db
+        )
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, SpectralFuncInfo)
+
+        inherited = self.torch_opinfo._original_spectral_func_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        super().__init__(**ukwargs)
+
+
+def sample_inputs_fftshift(op_info, device, dtype, requires_grad, **kwargs):
+    def mt(shape, **kwargs):
+        return make_tensor(
+            shape, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+        )
+
+    yield SampleInput(mt((9, 10)))
+    yield SampleInput(mt((50,)), kwargs=dict(dim=0))
+    yield SampleInput(mt((5, 11)), kwargs=dict(dim=(1,)))
+    yield SampleInput(mt((5, 6)), kwargs=dict(dim=(0, 1)))
+    yield SampleInput(mt((5, 6, 2)), kwargs=dict(dim=(0, 2)))
+
+
+# Operator database
+op_db: List[OpInfo] = [
+    SpectralFuncInfo(
+        "fft.fft",
+        aten_name="fft_fft",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.fft,
+        ndimensional=SpectralFuncType.OneD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.fft2",
+        aten_name="fft_fft2",
+        ref=np.fft.fft2,
+        decomp_aten_name="_fft_c2c",
+        ndimensional=SpectralFuncType.TwoD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4})],
+    ),
+    SpectralFuncInfo(
+        "fft.fftn",
+        aten_name="fft_fftn",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.fftn,
+        ndimensional=SpectralFuncType.ND,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4})],
+    ),
+    SpectralFuncInfo(
+        "fft.hfft",
+        aten_name="fft_hfft",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.hfft,
+        ndimensional=SpectralFuncType.OneD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        check_batched_gradgrad=False,
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.hfft2",
+        aten_name="fft_hfft2",
+        decomp_aten_name="_fft_c2r",
+        ref=scipy.fft.hfft2 if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.TwoD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_gradgrad=False,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.hfftn",
+        aten_name="fft_hfftn",
+        decomp_aten_name="_fft_c2r",
+        ref=scipy.fft.hfftn if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.ND,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_gradgrad=False,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            ),
+        ],
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.rfft",
+        aten_name="fft_rfft",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.rfft,
+        ndimensional=SpectralFuncType.OneD,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        skips=(),
+        check_batched_gradgrad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.rfft2",
+        aten_name="fft_rfft2",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.rfft2,
+        ndimensional=SpectralFuncType.TwoD,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=[
+            precisionOverride({torch.float: 1e-4}),
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.rfftn",
+        aten_name="fft_rfftn",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.rfftn,
+        ndimensional=SpectralFuncType.ND,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=[
+            precisionOverride({torch.float: 1e-4}),
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.ifft",
+        aten_name="fft_ifft",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.ifft,
+        ndimensional=SpectralFuncType.OneD,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.ifft2",
+        aten_name="fft_ifft2",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.ifft2,
+        ndimensional=SpectralFuncType.TwoD,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.ifftn",
+        aten_name="fft_ifftn",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.ifftn,
+        ndimensional=SpectralFuncType.ND,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.ihfft",
+        aten_name="fft_ihfft",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.ihfft,
+        ndimensional=SpectralFuncType.OneD,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        skips=(),
+        check_batched_grad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.ihfft2",
+        aten_name="fft_ihfft2",
+        decomp_aten_name="_fft_r2c",
+        ref=scipy.fft.ihfftn if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.TwoD,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=(
+            # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out_warning"),
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4}), "TestFFT", "test_reference_nd"
+            ),
+            # Mismatched elements!
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out"),
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out_warnings"),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.ihfftn",
+        aten_name="fft_ihfftn",
+        decomp_aten_name="_fft_r2c",
+        ref=scipy.fft.ihfftn if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.ND,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archss
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=[
+            # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out_warning"),
+            # Mismatched elements!
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out"),
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4}), "TestFFT", "test_reference_nd"
+            ),
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.irfft",
+        aten_name="fft_irfft",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.irfft,
+        ndimensional=SpectralFuncType.OneD,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        check_batched_gradgrad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.irfft2",
+        aten_name="fft_irfft2",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.irfft2,
+        ndimensional=SpectralFuncType.TwoD,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        check_batched_gradgrad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.irfftn",
+        aten_name="fft_irfftn",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.irfftn,
+        ndimensional=SpectralFuncType.ND,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        check_batched_gradgrad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    OpInfo(
+        "fft.fftshift",
+        dtypes=all_types_and_complex_and(
+            torch.bool, torch.bfloat16, torch.half, torch.chalf
+        ),
+        sample_inputs_func=sample_inputs_fftshift,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    OpInfo(
+        "fft.ifftshift",
+        dtypes=all_types_and_complex_and(
+            torch.bool, torch.bfloat16, torch.half, torch.chalf
+        ),
+        sample_inputs_func=sample_inputs_fftshift,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+]
+
+python_ref_db: List[OpInfo] = [
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.fft",
+        torch_opinfo_name="fft.fft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ifft",
+        torch_opinfo_name="fft.ifft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.rfft",
+        torch_opinfo_name="fft.rfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.irfft",
+        torch_opinfo_name="fft.irfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.hfft",
+        torch_opinfo_name="fft.hfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ihfft",
+        torch_opinfo_name="fft.ihfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.fftn",
+        torch_opinfo_name="fft.fftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ifftn",
+        torch_opinfo_name="fft.ifftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.rfftn",
+        torch_opinfo_name="fft.rfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.irfftn",
+        torch_opinfo_name="fft.irfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.hfftn",
+        torch_opinfo_name="fft.hfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ihfftn",
+        torch_opinfo_name="fft.ihfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.fft2",
+        torch_opinfo_name="fft.fft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ifft2",
+        torch_opinfo_name="fft.ifft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.rfft2",
+        torch_opinfo_name="fft.rfft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.irfft2",
+        torch_opinfo_name="fft.irfft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.hfft2",
+        torch_opinfo_name="fft.hfft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ihfft2",
+        torch_opinfo_name="fft.ihfft2",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.fft.fftshift",
+        op_db=op_db,
+        torch_opinfo_name="fft.fftshift",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.fft.ifftshift",
+        op_db=op_db,
+        torch_opinfo_name="fft.ifftshift",
+        supports_nvfuser=False,
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/definitions/linalg.py b/torch/testing/_internal/opinfo/definitions/linalg.py
new file mode 100644
index 0000000000000..42fdd4d72e384
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/linalg.py
@@ -0,0 +1,2282 @@
+import itertools
+import unittest
+from functools import partial
+from itertools import product
+from typing import Iterable, List
+
+import numpy as np
+from numpy import inf
+
+import torch
+
+from torch.testing import make_tensor
+from torch.testing._internal.common_cuda import (
+    _get_magma_version,
+    _get_torch_cuda_version,
+    CUDA11OrLater,
+    with_tf32_off,
+)
+from torch.testing._internal.common_device_type import (
+    has_cusolver,
+    skipCPUIfNoLapack,
+    skipCUDAIf,
+    skipCUDAIfNoCusolver,
+    skipCUDAIfNoMagma,
+    skipCUDAIfNoMagmaAndNoCusolver,
+    skipCUDAIfRocm,
+    tol,
+    toleranceOverride,
+)
+from torch.testing._internal.common_dtype import (
+    all_types_and_complex,
+    all_types_and_complex_and,
+    floating_and_complex_types,
+    floating_and_complex_types_and,
+)
+from torch.testing._internal.common_utils import (
+    GRADCHECK_NONDET_TOL,
+    make_fullrank_matrices_with_distinct_singular_values,
+    slowTest,
+    TEST_WITH_ROCM,
+)
+from torch.testing._internal.opinfo.core import (
+    clone_sample,
+    DecorateInfo,
+    ErrorInput,
+    gradcheck_wrapper_hermitian_input,
+    OpInfo,
+    ReductionOpInfo,
+    S,
+    SampleInput,
+)
+from torch.testing._internal.opinfo.refs import PythonRefInfo, ReductionPythonRefInfo
+
+
+def sample_kwargs_vector_norm(t, **kwargs):
+    # orders with / without identity
+    def ords():
+        has_id = (6, 4, 2, 1, 0, 0.9)
+        no_id = (inf, -2.1, -inf)
+        if t.numel() == 0:
+            dim = kwargs.get("dim")
+            if dim is None:
+                return has_id
+            if not isinstance(dim, Iterable):
+                dim = (dim,)
+            for d in dim:
+                if t.size(d) == 0:
+                    return has_id
+        return has_id + no_id
+
+    return (((), dict(ord=o)) for o in ords())
+
+
+def sample_inputs_svd(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    is_linalg_svd = "linalg.svd" in op_info.name
+    batches = [(), (0,), (3,)]
+    ns = [0, 3, 5]
+
+    def uniformize(usv):
+        S = usv[1]
+        k = S.shape[-1]
+        U = usv[0][..., :k]
+        Vh = usv[2] if is_linalg_svd else usv[2].mH
+        Vh = Vh[..., :k, :]
+        return U, S, Vh
+
+    def fn_U(usv):
+        U, _, _ = uniformize(usv)
+        return U.abs()
+
+    def fn_S(usv):
+        return uniformize(usv)[1]
+
+    def fn_Vh(usv):
+        # We also return S to test
+        _, S, Vh = uniformize(usv)
+        return S, Vh.abs()
+
+    def fn_UVh(usv):
+        U, S, Vh = uniformize(usv)
+        return U @ Vh, S
+
+    fns = (fn_U, fn_S, fn_Vh, fn_UVh)
+
+    fullmat = "full_matrices" if is_linalg_svd else "some"
+
+    for batch, n, k, fullmat_val, fn in product(batches, ns, ns, (True, False), fns):
+        shape = batch + (n, k)
+        yield SampleInput(
+            make_arg(*shape), kwargs={fullmat: fullmat_val}, output_process_fn_grad=fn
+        )
+
+
+def sample_inputs_cross(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    yield SampleInput(make_arg((S, 3)), args=(make_arg((S, 3)),))
+    yield SampleInput(
+        make_arg((S, 3, S)), args=(make_arg((S, 3, S)),), kwargs=dict(dim=1)
+    )
+    yield SampleInput(make_arg((1, 3)), args=(make_arg((S, 3)),), kwargs=dict(dim=-1))
+
+
+def error_inputs_cross(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+
+    sample = SampleInput(input=make_arg((S, 3)), args=(make_arg((S, 1)),))
+    err = "inputs dimension -1 must have length 3"
+    yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+
+    sample = SampleInput(input=make_arg((5, S, 3)), args=(make_arg((S, 3)),))
+    err = "inputs must have the same number of dimensions"
+    yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+
+    sample = SampleInput(input=make_arg((S, 2)), args=(make_arg((S, 2)),))
+    err = "must have length 3"
+    yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+
+    sample = SampleInput(
+        input=make_arg((S, 2)), args=(make_arg((S, 2)),), kwargs=dict(dim=2)
+    )
+    err = "Dimension out of range"
+    yield ErrorInput(sample, error_regex=err, error_type=IndexError)
+
+
+def sample_inputs_householder_product(op_info, device, dtype, requires_grad, **kwargs):
+    """
+    This function generates input for torch.linalg.householder_product (torch.orgqr).
+    The first argument should be a square matrix or batch of square matrices, the second argument is a vector or batch of vectors.
+    Empty, square, rectangular, batched square and batched rectangular input is generated.
+    """
+    # Each column of the matrix is getting multiplied many times leading to very large values for
+    # the Jacobian matrix entries and making the finite-difference result of grad check less accurate.
+    # That's why gradcheck with the default range [-9, 9] fails and [-2, 2] is used here.
+    samples = (
+        SampleInput(
+            make_tensor(
+                (S, S),
+                dtype=dtype,
+                device=device,
+                low=-2,
+                high=2,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (S,),
+                    dtype=dtype,
+                    device=device,
+                    low=-2,
+                    high=2,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+        SampleInput(
+            make_tensor(
+                (S + 1, S),
+                dtype=dtype,
+                device=device,
+                low=-2,
+                high=2,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (S,),
+                    dtype=dtype,
+                    device=device,
+                    low=-2,
+                    high=2,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+        SampleInput(
+            make_tensor(
+                (2, 1, S, S),
+                dtype=dtype,
+                device=device,
+                low=-2,
+                high=2,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (
+                        2,
+                        1,
+                        S,
+                    ),
+                    dtype=dtype,
+                    device=device,
+                    low=-2,
+                    high=2,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+        SampleInput(
+            make_tensor(
+                (2, 1, S + 1, S),
+                dtype=dtype,
+                device=device,
+                low=-2,
+                high=2,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (
+                        2,
+                        1,
+                        S,
+                    ),
+                    dtype=dtype,
+                    device=device,
+                    low=-2,
+                    high=2,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+        SampleInput(
+            make_tensor(
+                (0, 0),
+                dtype=dtype,
+                device=device,
+                low=None,
+                high=None,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (0,),
+                    dtype=dtype,
+                    device=device,
+                    low=None,
+                    high=None,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+        SampleInput(
+            make_tensor(
+                (S, S),
+                dtype=dtype,
+                device=device,
+                low=-2,
+                high=2,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (0,),
+                    dtype=dtype,
+                    device=device,
+                    low=None,
+                    high=None,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+        # m = n = S, k = S - 2
+        SampleInput(
+            make_tensor(
+                (S, S),
+                dtype=dtype,
+                device=device,
+                low=-2,
+                high=2,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (S - 2,),
+                    dtype=dtype,
+                    device=device,
+                    low=None,
+                    high=None,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+        # m = S, n = S -1, k = S - 2
+        SampleInput(
+            make_tensor(
+                (S, S - 1),
+                dtype=dtype,
+                device=device,
+                low=-2,
+                high=2,
+                requires_grad=requires_grad,
+            ),
+            args=(
+                make_tensor(
+                    (S - 2,),
+                    dtype=dtype,
+                    device=device,
+                    low=None,
+                    high=None,
+                    requires_grad=requires_grad,
+                ),
+            ),
+        ),
+    )
+
+    return samples
+
+
+def sample_inputs_linalg_det_singular(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype)
+
+    def make_singular_matrix_batch_base(size, rank):
+        assert size[-1] == size[-2]
+        assert rank > 0 and rank < size[-1]
+
+        n = size[-1]
+        a = make_arg(size[:-2] + (n, rank)) / 10
+        b = make_arg(size[:-2] + (rank, n)) / 10
+        x = a @ b
+        lu, pivs, _ = torch.linalg.lu_factor_ex(x)
+        p, l, u = torch.lu_unpack(lu, pivs)
+        u_diag_abs = u.diagonal(0, -2, -1).abs()
+        u_diag_abs_largest = u_diag_abs.max(dim=-1, keepdim=True).values
+        u_diag_abs_smallest_idxs = torch.topk(
+            u_diag_abs, k=(n - rank), largest=False
+        ).indices
+        u.diagonal(0, -2, -1).div_(u_diag_abs_largest)
+        u.diagonal(0, -2, -1)[..., u_diag_abs_smallest_idxs] = torch.finfo(dtype).eps
+        matrix = p @ l @ u
+
+        matrix.requires_grad_(requires_grad)
+        return matrix
+
+    def sample_generator():
+        for batch, size in product(((), (2,), (2, 2)), range(6)):
+            shape = batch + (size, size)
+            for rank in range(1, size):
+                yield make_singular_matrix_batch_base(shape, rank)
+
+    return [SampleInput(t) for t in sample_generator()]
+
+
+def sample_inputs_linalg_matrix_power(op_info, device, dtype, requires_grad, **kwargs):
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    make_arg_fullrank = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    # (<matrix_size>, (<batch_sizes, ...>))
+    test_sizes = [
+        (1, ()),
+        (2, (0,)),
+        (2, (2,)),
+    ]
+
+    for matrix_size, batch_sizes in test_sizes:
+        size = batch_sizes + (matrix_size, matrix_size)
+        for n in (0, 3, 5):
+            yield SampleInput(make_arg(size), args=(n,))
+        for n in [-4, -2, -1]:
+            yield SampleInput(make_arg_fullrank(*size), args=(n,))
+
+
+def sample_inputs_linalg_det_logdet_slogdet(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    batches = [(), (0,), (3,)]
+    ns = [0, 1, 5]
+
+    is_logdet = op_info.name == "logdet"
+
+    for (
+        batch,
+        n,
+    ) in product(batches, ns):
+        shape = batch + (n, n)
+        A = make_arg(*shape)
+        # Need to make the matrices in A have positive determinant for autograd
+        # To do so, we multiply A by its determinant to flip the sign of its determinant
+        if is_logdet and not A.is_complex() and A.numel() > 0:
+            s = torch.linalg.slogdet(A).sign
+            A = A * s.unsqueeze(-1).unsqueeze(-1)
+            A.requires_grad_(requires_grad)
+        yield SampleInput(A)
+
+
+def sample_inputs_lu_solve(op_info, device, dtype, requires_grad=False, **kwargs):
+    """Samples the inputs for both linalg.lu_solve and lu_solve"""
+    make_fn = make_fullrank_matrices_with_distinct_singular_values
+    make_a = partial(make_fn, dtype=dtype, device=device)
+    make_b = partial(make_tensor, dtype=dtype, device=device)
+
+    def clone(X, requires_grad):
+        Y = X.clone()
+        Y.requires_grad_(requires_grad)
+        return Y
+
+    is_linalg_lu_solve = op_info.name == "linalg.lu_solve"
+
+    batches = ((), (0,), (2,))
+    ns = (3, 1, 0)
+    nrhs = (4, 1, 0)
+
+    for n, batch, rhs in product(ns, batches, nrhs):
+        A = make_a(*(batch + (n, n)))
+        LU, pivots = torch.linalg.lu_factor(A)
+
+        B = make_b(batch + (n, rhs))
+
+        grads = (False,) if not requires_grad else (True, False)
+        # we try all possible combinations of requires_grad for each input
+        for LU_grad, B_grad in product(grads, grads):
+            # when requires_grad == True, at least one input has to have requires_grad enabled
+            if requires_grad and not LU_grad and not B_grad:
+                continue
+
+            if is_linalg_lu_solve:
+                for adjoint, left in product((True, False), repeat=2):
+                    yield SampleInput(
+                        clone(LU, LU_grad),
+                        args=(pivots, clone(B if left else B.mT, B_grad)),
+                        kwargs=dict(adjoint=adjoint, left=left),
+                    )
+            else:
+                yield SampleInput(clone(B, B_grad), args=(clone(LU, LU_grad), pivots))
+
+
+def sample_inputs_linalg_multi_dot(op_info, device, dtype, requires_grad, **kwargs):
+    # Each test case consists of the sizes in the chain of multiplications
+    # e.g. [2, 3, 4, 5] generates matrices (2, 3) @ (3, 4) @ (4, 5)
+    test_cases = [
+        [1, 2, 1],
+        [2, 0, 2],
+        [0, 2, 2],
+        [2, 2, 2, 2],
+        [2, 3, 4, 5],
+        [5, 4, 0, 2],
+        [2, 4, 3, 5, 3, 2],
+    ]
+
+    result = []
+    for sizes in test_cases:
+        tensors = []
+        for size in zip(sizes[:-1], sizes[1:]):
+            t = make_tensor(
+                size, dtype=dtype, device=device, requires_grad=requires_grad
+            )
+            tensors.append(t)
+        result.append(SampleInput(tensors))
+
+    return result
+
+
+def sample_inputs_linalg_matrix_norm(op_info, device, dtype, requires_grad, **kwargs):
+    low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+
+    sizes = ((2, 2), (2, 3, 2))
+    if dtype in low_precision_dtypes:
+        # svdvals not supported for low precision dtypes
+        ords = ("fro", inf, -inf, 1, -1)
+    else:
+        ords = ("fro", "nuc", inf, -inf, 1, -1, 2, -2)
+    dims = ((-2, -1), (-1, 0))
+
+    for size, ord, dim, keepdim in product(sizes, ords, dims, [True, False]):
+        yield SampleInput(make_arg(size), args=(ord, dim, keepdim))
+
+
+def sample_inputs_linalg_norm(
+    op_info, device, dtype, requires_grad, *, variant=None, **kwargs
+):
+    if variant is not None and variant not in ("subgradient_at_zero",):
+        raise ValueError(
+            f"Unsupported variant, expected variant to be 'subgradient_at_zero' but got: {variant}"
+        )
+
+    test_sizes = [
+        (S,),
+        (0,),
+        (S, S),
+        (0, 0),
+        (S, 0),
+        (0, S),
+        (S, S, S),
+        (0, S, S),
+        (S, 0, S),
+        (0, 0, 0),
+    ]
+
+    vector_ords = (None, 0, 0.5, 1, 2, 3.5, inf, -0.5, -1, -2, -3.5, -inf)
+    matrix_ords = (None, "fro", "nuc", 1, 2, inf, -1, -2, -inf)
+
+    inputs = []
+
+    for test_size in test_sizes:
+        is_vector_norm = len(test_size) == 1
+        is_matrix_norm = len(test_size) == 2
+
+        for keepdim in [False, True]:
+            if not variant == "subgradient_at_zero":
+                inputs.append(
+                    SampleInput(
+                        make_tensor(
+                            test_size,
+                            dtype=dtype,
+                            device=device,
+                            low=None,
+                            high=None,
+                            requires_grad=requires_grad,
+                        ),
+                        kwargs=dict(keepdim=keepdim),
+                    )
+                )
+
+            if not (is_vector_norm or is_matrix_norm):
+                continue
+
+            ords = vector_ords if is_vector_norm else matrix_ords
+
+            for ord in ords:
+                if variant == "subgradient_at_zero":
+                    inputs.append(
+                        SampleInput(
+                            torch.zeros(
+                                test_size,
+                                dtype=dtype,
+                                device=device,
+                                requires_grad=requires_grad,
+                            ),
+                            args=(ord,),
+                            kwargs=dict(keepdim=keepdim),
+                        )
+                    )
+                else:
+                    inputs.append(
+                        SampleInput(
+                            make_tensor(
+                                test_size,
+                                dtype=dtype,
+                                device=device,
+                                low=None,
+                                high=None,
+                                requires_grad=requires_grad,
+                            ),
+                            args=(ord,),
+                            kwargs=dict(keepdim=keepdim),
+                        )
+                    )
+
+                    if ord in ["nuc", "fro"]:
+                        inputs.append(
+                            SampleInput(
+                                make_tensor(
+                                    test_size,
+                                    dtype=dtype,
+                                    device=device,
+                                    low=None,
+                                    high=None,
+                                    requires_grad=requires_grad,
+                                ),
+                                kwargs=dict(ord=ord, keepdim=keepdim, dim=(0, 1)),
+                            )
+                        )
+
+        return inputs
+
+
+def sample_inputs_linalg_vecdot(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    batches = ((), (0,), (1,), (5,))
+    ns = (0, 1, 3, 5)
+    for b, n in product(batches, ns):
+        shape = b + (n,)
+        yield SampleInput(make_arg(shape), args=(make_arg(shape),))
+        for i in range(len(shape)):
+            yield SampleInput(
+                make_arg(shape), args=(make_arg(shape),), kwargs=dict(dim=i)
+            )
+
+
+def sample_inputs_linalg_invertible(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function generates invertible inputs for linear algebra ops
+    The input is generated as the itertools.product of 'batches' and 'ns'.
+    In total this function generates 8 SampleInputs
+    'batches' cases include:
+        () - single input,
+        (0,) - zero batched dimension,
+        (2,) - batch of two matrices,
+        (1, 1) - 1x1 batch of matrices
+    'ns' gives 0x0 and 5x5 matrices.
+    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
+    """
+    make_fn = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 0]
+
+    for batch, n in product(batches, ns):
+        yield SampleInput(make_arg(*batch, n, n))
+
+
+def sample_inputs_linalg_pinv_singular(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function produces factors `a` and `b` to generate inputs of the form `a @ b.t()` to
+    test the backward method of `linalg_pinv`. That way we always preserve the rank of the
+    input no matter the perturbations applied to it by the gradcheck.
+    Note that `pinv` is Frechet-differentiable in a rank-preserving neighborhood.
+    """
+    batches = [(), (0,), (2,), (1, 1)]
+    # the size of at least 30 is required to cause failures for the previous implicit implementation
+    # of the pinv's backward method, albeit it is slow.
+    size = [0, 3, 50]
+
+    for batch, m, n in product(batches, size, size):
+        for k in range(min(3, min(m, n))):
+            # Note that by making the columns of `a` and `b` orthonormal we make sure that
+            # the product matrix `a @ b.t()` has condition number 1 when restricted to its image
+            a = (
+                torch.rand(*batch, m, k, device=device, dtype=dtype)
+                .qr()
+                .Q.requires_grad_(requires_grad)
+            )
+            b = (
+                torch.rand(*batch, n, k, device=device, dtype=dtype)
+                .qr()
+                .Q.requires_grad_(requires_grad)
+            )
+            yield SampleInput(a, args=(b,))
+
+
+def sample_inputs_linalg_cond(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    # autograd is not supported for inputs with zero number of elements
+    shapes = (
+        (S, S),
+        (2, S, S),
+        (2, 1, S, S),
+    )
+
+    for shape in shapes:
+        yield SampleInput(make_arg(shape))
+
+
+def sample_inputs_linalg_vander(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    shapes = (
+        (),
+        (1,),
+        (S,),
+        (2, S),
+    )
+
+    for shape in shapes:
+        if len(shape) > 0 and shape[-1] > 1:
+            yield SampleInput(make_arg(shape))
+        n = shape[-1] if len(shape) > 0 else 1
+        for i in range(3):
+            # n-1, n, n+1
+            N = n + i - 1
+            if N < 2:
+                continue
+            yield SampleInput(make_arg(shape), kwargs=dict(N=N))
+
+
+def np_vander_batched(x, N=None):
+    # Wrapper around np.vander that supports batches of 1 dimension (enough for the tests)
+    if x.ndim == 0:
+        x = x[np.newaxis]
+    if x.ndim == 1:
+        y = np.vander(x, N=N, increasing=True)
+        return y
+    else:
+        if N is None:
+            N = x.shape[-1]
+        y = np.vander(x.ravel(), N=N, increasing=True).reshape((*x.shape, N))
+        return y
+
+
+def sample_inputs_linalg_cholesky_inverse(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    from torch.testing._internal.common_utils import random_well_conditioned_matrix
+
+    # Cholesky factorization is for positive-definite matrices
+    single_well_conditioned_matrix = random_well_conditioned_matrix(
+        S, S, dtype=dtype, device=device
+    )
+    batch_well_conditioned_matrices = random_well_conditioned_matrix(
+        2, S, S, dtype=dtype, device=device
+    )
+    single_pd = single_well_conditioned_matrix @ single_well_conditioned_matrix.mH
+    batch_pd = batch_well_conditioned_matrices @ batch_well_conditioned_matrices.mH
+
+    inputs = (
+        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
+        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
+        single_pd,
+        batch_pd,
+    )
+    test_cases = (torch.linalg.cholesky(a, upper=False) for a in inputs)
+    for l in test_cases:
+        # generated lower-triangular samples
+        l.requires_grad = requires_grad
+        yield SampleInput(l)  # upper=False by default
+        yield SampleInput(
+            l.detach().clone().requires_grad_(requires_grad), kwargs=dict(upper=False)
+        )
+
+        # generate upper-triangular inputs
+        u = l.detach().clone().mT.contiguous().requires_grad_(requires_grad)
+        yield SampleInput(u, kwargs=dict(upper=True))
+
+
+def sample_inputs_linalg_ldl_factor(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    from torch.testing._internal.common_utils import (
+        random_hermitian_pd_matrix,
+        random_symmetric_pd_matrix,
+    )
+
+    device = torch.device(device)
+
+    # Symmetric inputs
+    yield SampleInput(
+        random_symmetric_pd_matrix(S, dtype=dtype, device=device),
+        kwargs=dict(hermitian=False),
+    )  # single matrix
+    yield SampleInput(
+        random_symmetric_pd_matrix(S, 2, dtype=dtype, device=device),
+        kwargs=dict(hermitian=False),
+    )  # batch of matrices
+    yield SampleInput(
+        torch.zeros(0, 0, dtype=dtype, device=device), kwargs=dict(hermitian=False)
+    )  # 0x0 matrix
+    yield SampleInput(
+        torch.zeros(0, 2, 2, dtype=dtype, device=device), kwargs=dict(hermitian=False)
+    )  # zero batch of matrices
+
+    # Hermitian inputs
+    # hermitian=True for complex inputs on CUDA is supported only with MAGMA 2.5.4+
+    magma_254_available = device.type == "cuda" and _get_magma_version() >= (2, 5, 4)
+    if dtype.is_complex and (device.type == "cpu" or magma_254_available):
+        yield SampleInput(
+            random_hermitian_pd_matrix(S, dtype=dtype, device=device),
+            kwargs=dict(hermitian=True),
+        )  # single matrix
+        yield SampleInput(
+            random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
+            kwargs=dict(hermitian=True),
+        )  # batch of matrices
+
+
+def sample_inputs_linalg_ldl_solve(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    # Generate LDL factors of symmetric (and Hermitian on CPU) matrices
+    from torch.testing._internal.common_utils import (
+        random_hermitian_pd_matrix,
+        random_symmetric_pd_matrix,
+    )
+
+    device = torch.device(device)
+    symmetric_inputs = (
+        random_symmetric_pd_matrix(S, dtype=dtype, device=device),  # single matrix
+        random_symmetric_pd_matrix(
+            S, 2, dtype=dtype, device=device
+        ),  # batch of matrices
+        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
+        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
+    )
+    hermitian_inputs = (
+        (
+            random_hermitian_pd_matrix(S, dtype=dtype, device=device),
+            random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
+        )
+        if device.type == "cpu" and dtype.is_complex
+        else ()
+    )
+    test_cases1 = (
+        torch.linalg.ldl_factor_ex(a, hermitian=False) for a in symmetric_inputs
+    )
+    test_cases2 = (
+        torch.linalg.ldl_factor_ex(a, hermitian=True) for a in hermitian_inputs
+    )
+
+    # Symmetric case
+    for test_case in test_cases1:
+        factors, pivots, _ = test_case
+        factors.requires_grad = requires_grad
+        for B_batch_shape in ((), factors.shape[:-2]):
+            B = make_tensor(
+                (*B_batch_shape, factors.shape[-1], S),
+                device=device,
+                dtype=dtype,
+                requires_grad=requires_grad,
+            )
+            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=False))
+            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
+            yield SampleInput(
+                clone_factors, args=(pivots, B), kwargs=dict(hermitian=False)
+            )
+
+    # Hermitian case
+    for test_case in test_cases2:
+        factors, pivots, _ = test_case
+        factors.requires_grad = requires_grad
+        for B_batch_shape in ((), factors.shape[:-2]):
+            B = make_tensor(
+                (*B_batch_shape, factors.shape[-1], S),
+                device=device,
+                dtype=dtype,
+                requires_grad=requires_grad,
+            )
+            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=True))
+            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
+            yield SampleInput(
+                clone_factors, args=(pivots, B), kwargs=dict(hermitian=True)
+            )
+
+
+def sample_inputs_linalg_lstsq(op_info, device, dtype, requires_grad=False, **kwargs):
+    from torch.testing._internal.common_utils import random_well_conditioned_matrix
+
+    device = torch.device(device)
+
+    drivers: Tuple[str, ...]
+    if device.type == "cuda":
+        drivers = ("gels",)
+    else:
+        drivers = ("gels", "gelsy", "gelss", "gelsd")
+
+    # we generate matrices of shape (..., n + delta, n)
+    deltas: Tuple[int, ...]
+    if device.type == "cpu" or has_cusolver():
+        deltas = (-1, 0, +1)
+    # only square systems if Cusolver is not available
+    # becase we solve a lstsq problem with a transposed matrix in the backward
+    else:
+        deltas = (0,)
+
+    out = []
+    for batch, driver, delta in product(((), (3,), (3, 3)), drivers, deltas):
+        shape = batch + (3 + delta, 3)
+        a = random_well_conditioned_matrix(*shape, dtype=dtype, device=device)
+        a.requires_grad_(requires_grad)
+        b = make_tensor(
+            shape,
+            dtype=dtype,
+            device=device,
+            low=None,
+            high=None,
+            requires_grad=requires_grad,
+        )
+        out.append(SampleInput(a, args=(b,), kwargs=dict(driver=driver)))
+    return out
+
+
+def error_inputs_lstsq(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(
+        SampleInput(zero_d, args=(zero_d,)),
+        error_type=RuntimeError,
+        error_regex="at least 2 dimensions",
+    )
+
+
+def error_inputs_lstsq_grad_oriented(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(
+        SampleInput(zero_d, args=(zero_d, None)),
+        error_type=RuntimeError,
+        error_regex="at least 2 dimensions",
+    )
+
+
+def sample_inputs_linalg_cholesky(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function generates always positive-definite input for torch.linalg.cholesky using
+    random_hermitian_pd_matrix.
+    The input is generated as the itertools.product of 'batches' and 'ns'.
+    In total this function generates 8 SampleInputs
+    'batches' cases include:
+        () - single input,
+        (0,) - zero batched dimension,
+        (2,) - batch of two matrices,
+        (1, 1) - 1x1 batch of matrices
+    'ns' gives 0x0 and 5x5 matrices.
+    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
+    """
+    from torch.testing._internal.common_utils import random_hermitian_pd_matrix
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 0]
+    out = []
+    for batch, n, upper in product(batches, ns, [True, False]):
+        a = random_hermitian_pd_matrix(n, *batch, dtype=dtype, device=device)
+        a.requires_grad = requires_grad
+        out.append(SampleInput(a, kwargs={"upper": upper}))
+    return out
+
+
+def sample_inputs_linalg_eig(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates input for torch.linalg.eig
+    """
+
+    def out_fn(output):
+        return output[0], abs(output[1])
+
+    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
+    for sample in samples:
+        sample.output_process_fn_grad = out_fn
+        yield sample
+
+
+def sample_inputs_linalg_eigh(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates input for torch.linalg.eigh/eigvalsh with UPLO="U" or "L" keyword argument.
+    """
+
+    def out_fn(output):
+        if isinstance(output, tuple):
+            # eigh function
+            return output[0], abs(output[1])
+        else:
+            # eigvalsh function
+            return output
+
+    # Samples do not need to be Hermitian, as we're using gradcheck_wrapper_hermitian_input
+    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
+    for sample in samples:
+        sample.kwargs = {"UPLO": np.random.choice(["L", "U"])}
+        sample.output_process_fn_grad = out_fn
+        yield sample
+
+
+def sample_inputs_linalg_pinv(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates input for torch.linalg.pinv with hermitian=False keyword argument.
+    """
+    for o in sample_inputs_linalg_invertible(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        real_dtype = o.input.real.dtype if dtype.is_complex else dtype
+        # requires_grad path for rtol tensor is not implemented
+        for rtol in (None, 1.0, torch.tensor(1.0, dtype=real_dtype, device=device)):
+            o = clone_sample(o)
+            o.kwargs = {"rtol": rtol}
+            yield o
+
+
+def sample_inputs_linalg_pinv_hermitian(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function generates input for torch.linalg.pinv with hermitian=True keyword argument.
+    """
+    for o in sample_inputs_linalg_invertible(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        o.kwargs = {"hermitian": True}
+        yield o
+
+
+def sample_inputs_linalg_solve(
+    op_info, device, dtype, requires_grad=False, vector_rhs_allowed=True, **kwargs
+):
+    """
+    This function generates always solvable input for torch.linalg.solve
+    We sample a fullrank square matrix (i.e. invertible) A
+    The first input to torch.linalg.solve is generated as the itertools.product of 'batches' and 'ns'.
+    The second input is generated as the product of 'batches', 'ns' and 'nrhs'.
+    In total this function generates 18 SampleInputs
+    'batches' cases include:
+        () - single input,
+        (0,) - zero batched dimension,
+        (2,) - batch of two matrices.
+    'ns' gives 0x0 and 5x5 matrices.
+    and 'nrhs' controls the number of vectors to solve for:
+        () - using 1 as the number of vectors implicitly
+        (1,) - same as () but explicit
+        (3,) - solve for 3 vectors.
+    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
+    'vector_rhs_allowed' controls whether to include nrhs = () to the list of SampleInputs.
+    torch.solve / triangular_solve / cholesky_solve (opposed to torch.linalg.solve) do not allow
+    1D tensors (vectors) as the right-hand-side.
+    Once torch.solve / triangular_solve / cholesky_solve and its testing are removed,
+    'vector_rhs_allowed' may be removed here as well.
+    """
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_a = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    make_b = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    batches = [(), (0,), (2,)]
+    ns = [5, 0]
+    if vector_rhs_allowed:
+        nrhs = [(), (1,), (3,)]
+    else:
+        nrhs = [(1,), (3,)]
+
+    for n, batch, rhs in product(ns, batches, nrhs):
+        yield SampleInput(make_a(*batch, n, n), args=(make_b((batch + (n,) + rhs)),))
+
+
+def sample_inputs_linalg_solve_triangular(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    make_arg = partial(make_tensor, dtype=dtype, device=device)
+    bs = (1, 2, 0)
+    ns = (3, 0)
+    ks = (1, 3, 0)
+
+    for b, n, k, (left, upper, uni) in product(
+        bs, ns, ks, product((True, False), repeat=3)
+    ):
+        if b == 1:
+            A = make_arg((n, n)) if left else make_arg((k, k))
+            B = make_arg((n, k))
+        else:
+            A = make_arg((b, n, n)) if left else make_arg((b, k, k))
+            B = make_arg((b, n, k))
+        if uni:
+            # Not really necessary, but writing it for consistency
+            A.diagonal(0, -2, -1).fill_(1.0)
+        else:
+            d = A.diagonal(0, -2, -1)
+            d[d.abs() < 1e-6] = 1.0
+        if upper:
+            A.triu_()
+        else:
+            A.tril_()
+        kwargs = {"upper": upper, "left": left, "unitriangular": uni}
+        if requires_grad:
+            for grad_A, grad_B in product((True, False), repeat=2):
+                # Either A or B needs to have a gradient
+                if not grad_A and not grad_B:
+                    continue
+                yield SampleInput(
+                    A.clone().requires_grad_(grad_A),
+                    args=(B.clone().requires_grad_(grad_B),),
+                    kwargs=kwargs,
+                )
+        else:
+            yield SampleInput(A, args=(B,), kwargs=kwargs)
+
+
+def sample_inputs_legacy_solve(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates always solvable input for legacy solve functions
+    (the ones that are not in torch.linalg module).
+    The difference from sample_inputs_linalg_solve is that here the right-hand-side of A x = b equation
+    should have b.ndim >= 2, vectors are not allowed.
+    Also the arguments order is swapped.
+    """
+    out = sample_inputs_linalg_solve(
+        op_info, device, dtype, requires_grad=requires_grad, vector_rhs_allowed=False
+    )
+
+    def out_fn(output):
+        return output[0]
+
+    # Reverses tensor order
+    for sample in out:
+        sample.input, sample.args = sample.args[0], (sample.input,)
+        if op_info.name == "solve":
+            sample.output_process_fn_grad = out_fn
+        yield sample
+
+
+def sample_inputs_linalg_lu(op_info, device, dtype, requires_grad=False, **kwargs):
+    full_rank = op_info.name == "linalg.lu_factor"
+    make_fn = (
+        make_tensor
+        if not full_rank
+        else make_fullrank_matrices_with_distinct_singular_values
+    )
+    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
+
+    def out_fn(output):
+        if op_info.name == "linalg.lu":
+            return output[1], output[2]
+        else:
+            return output
+
+    batch_shapes = ((), (3,), (3, 3))
+    # pivot=False only supported in CUDA
+    pivots = (True, False) if torch.device(device).type == "cuda" else (True,)
+    deltas = (-2, -1, 0, +1, +2)
+    for batch_shape, pivot, delta in product(batch_shapes, pivots, deltas):
+        shape = batch_shape + (S + delta, S)
+        # Insanely annoying that make_fullrank_blablabla accepts a *shape and not a tuple!
+        A = make_arg(shape) if not full_rank else make_arg(*shape)
+        yield SampleInput(A, kwargs={"pivot": pivot}, output_process_fn_grad=out_fn)
+
+
+def sample_inputs_linalg_svdvals(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 2, 0]
+
+    for batch, m, n in product(batches, ns, ns):
+        yield SampleInput(make_arg(batch + (m, n)))
+
+
+def sample_inputs_linalg_qr_geqrf(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    # QR is just well defined when the matrix is full rank
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 2, 0]
+
+    for batch, (m, n) in product(batches, product(ns, ns)):
+        shape = batch + (m, n)
+        yield SampleInput(make_arg(*shape))
+
+
+def sample_inputs_tensorsolve(op_info, device, dtype, requires_grad, **kwargs):
+    a_shapes = [(2, 3, 6), (3, 4, 4, 3)]
+    # Zero-dim tensors are not supported in NumPy, so we skip them for now.
+    # NumPy is used in reference check tests.
+    # See https://github.com/numpy/numpy/pull/20482 for tracking NumPy bugfix.
+    # a_shapes += [(0, 0, 1, 2, 3, 0)]
+    dimss = [None, (0, 2)]
+
+    for a_shape, dims in itertools.product(a_shapes, dimss):
+        a = make_tensor(
+            a_shape, dtype=dtype, device=device, requires_grad=requires_grad
+        )
+        b = make_tensor(
+            a_shape[:2], dtype=dtype, device=device, requires_grad=requires_grad
+        )
+        yield SampleInput(a, args=(b,), kwargs=dict(dims=dims))
+
+
+def sample_inputs_tensorinv(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = make_fullrank_matrices_with_distinct_singular_values
+
+    def make_input():
+        return make_arg(12, 12, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    # lhs / rhs shape can have any number of dimensions as long as their product equals 12
+    shapes = [
+        ((2, 2, 3), (12, 1)),
+        ((4, 3), (6, 1, 2)),
+    ]
+
+    samples = []
+    for shape_lhs, shape_rhs in shapes:
+        inp = make_input().reshape(*shape_lhs, *shape_rhs).detach()
+        inp.requires_grad_(requires_grad)
+        samples.append(SampleInput(inp, kwargs=dict(ind=len(shape_lhs))))
+
+    return samples
+
+
+op_db: List[OpInfo] = [
+    OpInfo(
+        "linalg.cross",
+        ref=lambda x, y, dim=-1: np.cross(x, y, axis=dim),
+        op=torch.linalg.cross,
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.half),
+        aten_name="linalg_cross",
+        sample_inputs_func=sample_inputs_cross,
+        error_inputs_func=error_inputs_cross,
+        supports_out=True,
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+    ),
+    OpInfo(
+        "linalg.det",
+        aten_name="linalg_det",
+        op=torch.linalg.det,
+        aliases=("det",),
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+        check_batched_gradgrad=False,
+    ),
+    OpInfo(
+        "linalg.det",
+        aten_name="linalg_det",
+        op=torch.linalg.det,
+        variant_test_name="singular",
+        aliases=("det",),
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_linalg_det_singular,
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+        skips=(
+            DecorateInfo(
+                unittest.skip("The backward may give different results"),
+                "TestCommon",
+                "test_noncontiguous_samples",
+            ),
+            # Both Hessians are incorrect on complex inputs??
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestGradients",
+                "test_fn_gradgrad",
+                dtypes=(torch.complex128,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestGradients",
+                "test_fn_fwgrad_bwgrad",
+                dtypes=(torch.complex128,),
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.cholesky",
+        aten_name="linalg_cholesky",
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_cholesky,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.cholesky_ex",
+        aten_name="linalg_cholesky_ex",
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_cholesky,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.vecdot",
+        aten_name="linalg_vecdot",
+        ref=lambda x, y, *, dim=-1: (x.conj() * y).sum(dim),
+        dtypes=floating_and_complex_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(
+            torch.half, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []
+        ),
+        sample_inputs_func=sample_inputs_linalg_vecdot,
+        check_batched_forward_grad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.cond",
+        aten_name="linalg_cond",
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_cond,
+        check_batched_gradgrad=False,
+        check_batched_forward_grad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+    ),
+    OpInfo(
+        "linalg.eig",
+        aten_name="linalg_eig",
+        op=torch.linalg.eig,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_eig,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # AssertionError: Scalars are not equal!
+            DecorateInfo(
+                unittest.expectedFailure, "TestCommon", "test_out", device_type="cpu"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
+    ),
+    OpInfo(
+        "linalg.eigvals",
+        aten_name="linalg_eigvals",
+        op=torch.linalg.eigvals,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_invertible,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # Pre-existing condition; Needs to be fixed
+            DecorateInfo(
+                unittest.expectedFailure, "TestCompositeCompliance", "test_operator"
+            ),
+            # exits early on eager extremal value test
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCudaFuserOpInfo",
+                "test_nvfuser_extremal_values",
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.eigh",
+        aten_name="linalg_eigh",
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_eigh,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.eigvalsh",
+        aten_name="linalg_eigvalsh",
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_eigh,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # Pre-existing condition; Needs to be fixed
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.householder_product",
+        aten_name="linalg_householder_product",
+        op=torch.linalg.householder_product,
+        aliases=("orgqr",),
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        # TODO: backward uses in-place operations that vmap doesn't like
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_householder_product,
+        decorators=[
+            skipCUDAIfNoCusolver,
+            skipCPUIfNoLapack,
+            DecorateInfo(
+                toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)})
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestCompositeCompliance", "test_backward"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestCompositeCompliance", "test_forward_ad"
+            ),
+        ],
+    ),
+    OpInfo(
+        "linalg.ldl_factor",
+        aten_name="linalg_ldl_factor",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_ldl_factor,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
+    ),
+    OpInfo(
+        "linalg.ldl_factor_ex",
+        aten_name="linalg_ldl_factor_ex",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_ldl_factor,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
+    ),
+    OpInfo(
+        "linalg.ldl_solve",
+        aten_name="linalg_ldl_solve",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_ldl_solve,
+        decorators=[
+            skipCUDAIf(
+                _get_torch_cuda_version() < (11, 4), "not available before CUDA 11.3.1"
+            ),
+            skipCUDAIfNoCusolver,
+            skipCUDAIfRocm,
+            skipCPUIfNoLapack,
+        ],
+    ),
+    OpInfo(
+        "linalg.lstsq",
+        aten_name="linalg_lstsq",
+        dtypes=floating_and_complex_types(),
+        supports_out=True,
+        sample_inputs_func=sample_inputs_linalg_lstsq,
+        error_inputs_func=error_inputs_lstsq,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # we skip gradient checks for this suite as they are tested in
+            # variant_test_name='grad_oriented'
+            DecorateInfo(unittest.skip("Skipped!"), "TestGradients"),
+            # The values for attribute 'shape' do not match
+            DecorateInfo(unittest.skip("Skipped!"), "TestCommon", "test_out"),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.lstsq",
+        aten_name="linalg_lstsq",
+        variant_test_name="grad_oriented",
+        # gradchecks for forward AD fails with multi-Tensor outputs
+        op=lambda a, b, driver: torch.linalg.lstsq(a, b, driver=driver)[0],
+        supports_out=False,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_lstsq,
+        error_inputs_func=error_inputs_lstsq_grad_oriented,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_autograd=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # tests do not work with passing lambda for op
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestOperatorSignatures",
+                "test_get_torch_func_signature_exhaustive",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.matrix_power",
+        aliases=("matrix_power",),
+        aten_name="linalg_matrix_power",
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_inplace_autograd=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=sample_inputs_linalg_matrix_power,
+    ),
+    OpInfo(
+        "linalg.multi_dot",
+        # Need this lambda because gradcheck does not work with TensorList inputs
+        aten_name="linalg_multi_dot",
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(
+            torch.half, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []
+        ),
+        supports_inplace_autograd=False,
+        # Batched grad checks fail for empty input tensors (see https://github.com/pytorch/pytorch/issues/53407)
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # https://github.com/pytorch/pytorch/issues/66357
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_multi_dot,
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+        skips=(
+            # https://github.com/pytorch/pytorch/issues/67470
+            DecorateInfo(
+                unittest.skip("67470!"), "TestCommon", "test_noncontiguous_samples"
+            ),
+            # Fails on XLA.
+            # AssertionError: False is not true : Tensors failed to compare as equal!
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestOpInfo",
+                device_type="xla",
+                dtypes=(torch.long,),
+            ),
+            # https://github.com/pytorch/pytorch/issues/71774
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                device_type="cpu",
+                dtypes=(torch.long,),
+            ),
+        ),
+    ),
+    # NB: linalg.norm has two variants so that different skips can be used for different sample inputs
+    OpInfo(
+        "linalg.norm",
+        aten_name="linalg_norm",
+        op=torch.linalg.norm,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=sample_inputs_linalg_norm,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+    ),
+    OpInfo(
+        "linalg.norm",
+        op=torch.linalg.norm,
+        variant_test_name="subgradients_at_zero",
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=partial(
+            sample_inputs_linalg_norm, variant="subgradient_at_zero"
+        ),
+        aten_name="linalg_norm",
+        supports_forward_ad=True,
+        # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
+        # Could not allocate memory to change Tensor SizesAndStrides!
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # [NEW] Skips specifically for sample inputs at zero
+            # norm's vjp/jvp are not well-conditioned near zero
+            DecorateInfo(unittest.expectedFailure, "TestGradients", "test_fn_gradgrad"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestGradients", "test_fn_fwgrad_bwgrad"
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.matrix_norm",
+        aten_name="linalg_matrix_norm",
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        check_batched_gradgrad=False,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=sample_inputs_linalg_matrix_norm,
+    ),
+    OpInfo(
+        "linalg.qr",
+        aten_name="linalg_qr",
+        op=torch.linalg.qr,
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # In-place ops
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_linalg_qr_geqrf,
+        decorators=[skipCUDAIfNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.slogdet",
+        aten_name="linalg_slogdet",
+        op=torch.linalg.slogdet,
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.vander",
+        aten_name="linalg_vander",
+        ref=np_vander_batched,
+        op=torch.linalg.vander,
+        dtypes=all_types_and_complex(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+        skips=(
+            # Pre-existing condition (calls .item); needs to be fixed
+            DecorateInfo(
+                unittest.expectedFailure, "TestCompositeCompliance", "test_backward"
+            ),
+        ),
+        sample_inputs_func=sample_inputs_linalg_vander,
+    ),
+    ReductionOpInfo(
+        "linalg.vector_norm",
+        op=torch.linalg.vector_norm,
+        identity=0,
+        nan_policy="propagate",
+        supports_multiple_dims=True,
+        complex_to_real=True,
+        supports_forward_ad=True,
+        # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
+        # got: Could not allocate memory to change Tensor SizesAndStrides!
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        generate_args_kwargs=sample_kwargs_vector_norm,
+        aten_name="linalg_vector_norm",
+        skips=(
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.lu_factor",
+        aten_name="linalg_lu_factor",
+        op=torch.linalg.lu_factor,
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_lu,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.lu_factor_ex",
+        aten_name="linalg_lu_factor_ex",
+        op=torch.linalg.lu_factor_ex,
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_lu,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.lu",
+        aten_name="linalg_lu",
+        op=torch.linalg.lu,
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_lu,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.lu_solve",
+        op=torch.linalg.lu_solve,
+        aten_name="linalg_lu_solve",
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_lu_solve,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Tests different backward paths"),
+                "TestCommon",
+                "test_floating_inputs_are_differentiable",
+            ),
+        ),
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+    ),
+    OpInfo(
+        "linalg.inv",
+        aten_name="linalg_inv",
+        op=torch.linalg.inv,
+        aliases=("inverse",),
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_invertible,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.inv_ex",
+        aten_name="linalg_inv_ex",
+        op=torch.linalg.inv_ex,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_invertible,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.solve",
+        aten_name="linalg_solve",
+        op=torch.linalg.solve,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_solve,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.solve_ex",
+        aten_name="linalg_solve_ex",
+        op=torch.linalg.solve_ex,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_solve,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.solve_triangular",
+        aten_name="linalg_solve_triangular",
+        op=torch.linalg.solve_triangular,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_solve_triangular,
+        supports_fwgrad_bwgrad=True,
+        skips=(skipCPUIfNoLapack,),
+        # linalg.solve_triangular cannot be batched over because of a call to out.copy_(result);
+        supports_forward_ad=True,
+    ),
+    OpInfo(
+        "linalg.matrix_rank",
+        aten_name="linalg_matrix_rank",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_invertible,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.matrix_rank",
+        aten_name="linalg_matrix_rank",
+        variant_test_name="hermitian",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.pinv",
+        aten_name="linalg_pinv",
+        op=torch.linalg.pinv,
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_pinv,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            # errors with "leaked XXXX bytes CUDA memory on device 0"
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="cuda",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.pinv",
+        aten_name="linalg_pinv",
+        variant_test_name="singular",
+        # pinv is Frechet-differentiable in a rank-preserving neighborhood,
+        # so we feed inputs that are the products of two full-rank factors,
+        # to avoid any rank changes caused by the perturbations in the gradcheck
+        op=lambda a, b: torch.linalg.pinv(a @ b.mT),
+        dtypes=floating_and_complex_types(),
+        supports_out=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_pinv_singular,
+        # Only large tensors show issues with implicit backward used prior to
+        # explicit backward implementation.
+        decorators=[slowTest, skipCUDAIfNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # CUDA runs out of memory
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestGradients",
+                "test_fn_fwgrad_bwgrad",
+                device_type="cuda",
+                dtypes=[torch.cdouble],
+            ),
+            # This test takes almost 2 hours to run!
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestGradients",
+                "test_fn_gradgrad",
+                device_type="cuda",
+                dtypes=[torch.cdouble],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.pinv",
+        aten_name="linalg_pinv",
+        variant_test_name="hermitian",
+        dtypes=floating_and_complex_types(),
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.svd",
+        op=torch.linalg.svd,
+        aten_name="linalg_svd",
+        decomp_aten_name="_linalg_svd",
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        # We're using at::allclose, which does not have a batching rule
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_svd,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.svdvals",
+        op=torch.linalg.svdvals,
+        aten_name="linalg_svdvals",
+        decomp_aten_name="_linalg_svd",
+        dtypes=floating_and_complex_types(),
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+        # We're using at::allclose, which does not have a batching rule
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_linalg_svdvals,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+    ),
+    OpInfo(
+        "linalg.tensorinv",
+        ref=np.linalg.tensorinv,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_tensorinv,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+    ),
+    OpInfo(
+        "linalg.tensorsolve",
+        ref=lambda a, b, dims=None: np.linalg.tensorsolve(a, b, axes=dims),
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_tensorsolve,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[
+            skipCUDAIfNoMagmaAndNoCusolver,
+            skipCPUIfNoLapack,
+            DecorateInfo(
+                toleranceOverride({torch.float32: tol(atol=1e-03, rtol=1e-03)}),
+                "TestCommon",
+                "test_noncontiguous_samples",
+                device_type="cuda",
+            ),
+        ],
+    ),
+]
+
+python_ref_db: List[OpInfo] = [
+    #
+    # torch.linalg
+    #
+    ReductionPythonRefInfo(
+        "_refs.linalg.vector_norm",
+        torch_opinfo_name="linalg.vector_norm",
+        supports_out=True,
+        supports_nvfuser=False,  # clone_default
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.matrix_norm",
+        torch_opinfo_name="linalg.matrix_norm",
+        supports_out=True,
+        # Uses svdvals which does not support nvfuser
+        supports_nvfuser=False,
+        # Uses vector_norm inside and vector_norm is affected by
+        # https://github.com/pytorch/pytorch/issues/77216
+        validate_view_consistency=False,
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.norm",
+        torch_opinfo_name="linalg.norm",
+        supports_out=True,
+        # Uses svdvals which does not support nvfuser
+        supports_nvfuser=False,
+        # Uses vector_norm inside and vector_norm is affected by
+        # https://github.com/pytorch/pytorch/issues/77216
+        validate_view_consistency=False,
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.svd",
+        torch_opinfo_name="linalg.svd",
+        supports_out=True,
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.svdvals",
+        torch_opinfo_name="linalg.svdvals",
+        supports_out=True,
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/definitions/special.py b/torch/testing/_internal/opinfo/definitions/special.py
new file mode 100644
index 0000000000000..61059bae3a0e6
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/special.py
@@ -0,0 +1,684 @@
+import unittest
+from functools import partial
+from itertools import product
+from typing import List
+
+import numpy as np
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_device_type import (
+    precisionOverride,
+    tol,
+    toleranceOverride,
+)
+from torch.testing._internal.common_dtype import all_types_and, floating_types
+from torch.testing._internal.common_utils import TEST_SCIPY, torch_to_numpy_dtype_dict
+from torch.testing._internal.opinfo.core import (
+    BinaryUfuncInfo,
+    DecorateInfo,
+    L,
+    NumericsFilter,
+    OpInfo,
+    S,
+    SampleInput,
+    UnaryUfuncInfo,
+)
+from torch.testing._internal.opinfo.refs import (
+    ElementwiseBinaryPythonRefInfo,
+    ElementwiseUnaryPythonRefInfo,
+)
+from torch.testing._internal.opinfo.utils import (
+    np_unary_ufunc_integer_promotion_wrapper,
+)
+
+
+if TEST_SCIPY:
+    import scipy.special
+
+# TODO: Consolidate `i0e` with sample_inputs_unary when `make_tensor`,
+#       supports `exclude` argument.
+#       For more context: https://github.com/pytorch/pytorch/pull/56352#discussion_r633277617
+def sample_inputs_i0_i1(op_info, device, dtype, requires_grad, **kwargs):
+
+    samples = (
+        SampleInput(
+            make_tensor((S,), dtype=dtype, device=device, requires_grad=requires_grad)
+        ),
+        SampleInput(
+            make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad)
+        ),
+    )
+
+    if requires_grad and op_info.op == torch.special.i0e:
+        # NOTE: `i0e`'s first-order gradient is not continous
+        # at `0`, hence we don't test `i0e` with any input being `0`.
+        # TODO: Remove this when `make_tensor` supports excluding `0`.
+        for sample in samples:
+            t = sample.input
+            t[t == 0] = torch.finfo(dtype).eps  # type: ignore[index]
+    elif requires_grad and op_info.op != torch.special.i0e:
+        # Special Case for gradient
+        # Sample with `0` in the input
+        t = make_tensor((S,), dtype=dtype, device=device, requires_grad=requires_grad)
+        t[0] = 0
+
+        samples += (SampleInput(t),)  # type: ignore[assignment]
+
+    return samples
+
+
+def sample_inputs_polygamma(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    tensor_shapes = ((S, S), ())
+    ns = (1, 2, 3, 4, 5)
+
+    for shape, n in product(tensor_shapes, ns):
+        yield SampleInput(make_arg(shape), args=(n,))
+
+
+def reference_polygamma(x, n):
+    # WEIRD `scipy.special.polygamma` behavior
+    # >>> scipy.special.polygamma(0, np.array(501, dtype=np.float32)).dtype
+    # dtype('float64')
+    # >>> scipy.special.polygamma(0, np.array([501], dtype=np.float32)).dtype
+    # dtype('float32')
+    #
+    # Thus we cast output to the default torch dtype or preserve double
+    result_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
+    if x.dtype == np.double:
+        result_dtype = np.double
+    return scipy.special.polygamma(n, x).astype(result_dtype)
+
+
+def sample_inputs_entr(op_info, device, dtype, requires_grad, **kwargs):
+    low, _ = op_info.domain
+
+    if requires_grad:
+        low = 0 + op_info._domain_eps
+
+    return (
+        SampleInput(
+            make_tensor(
+                (L,), dtype=dtype, device=device, low=low, requires_grad=requires_grad
+            )
+        ),
+        SampleInput(
+            make_tensor(
+                (), dtype=dtype, device=device, low=low, requires_grad=requires_grad
+            )
+        ),
+    )
+
+
+op_db: List[OpInfo] = [
+    UnaryUfuncInfo(
+        "special.i0e",
+        aten_name="special_i0e",
+        ref=scipy.special.i0e if TEST_SCIPY else None,
+        decorators=(precisionOverride({torch.bfloat16: 3e-1, torch.float16: 3e-1}),),
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+        backward_dtypes=floating_types(),
+        sample_inputs_func=sample_inputs_i0_i1,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.i1",
+        aten_name="special_i1",
+        ref=np_unary_ufunc_integer_promotion_wrapper(scipy.special.i1)
+        if TEST_SCIPY
+        else None,
+        dtypes=all_types_and(torch.bool),
+        dtypesIfCUDA=all_types_and(torch.bool),
+        sample_inputs_func=sample_inputs_i0_i1,
+        decorators=(
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.float32: tol(atol=1e-4, rtol=0),
+                        torch.bool: tol(atol=1e-4, rtol=0),
+                    }
+                )
+            ),
+        ),
+        skips=(
+            DecorateInfo(
+                unittest.skip("Incorrect result!"),
+                "TestUnaryUfuncs",
+                "test_reference_numerics_large",
+                dtypes=(torch.int8,),
+            ),
+        ),
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.i1e",
+        aten_name="special_i1e",
+        ref=scipy.special.i1e if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool),
+        dtypesIfCUDA=all_types_and(torch.bool),
+        sample_inputs_func=sample_inputs_i0_i1,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.ndtr",
+        aten_name="special_ndtr",
+        decorators=(precisionOverride({torch.bfloat16: 5e-3, torch.float16: 5e-4}),),
+        ref=scipy.special.ndtr if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # Dispatch stub: unsupported device typemeta
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestGradients",
+                "test_fn_fwgrad_bwgrad",
+                device_type="meta",
+            ),
+        ),
+    ),
+    # A separate OpInfo entry for special.polygamma is needed to reorder the arguments
+    # for the alias. See the discussion here: https://github.com/pytorch/pytorch/pull/59691#discussion_r650261939
+    UnaryUfuncInfo(
+        "special.polygamma",
+        op=lambda x, n, **kwargs: torch.special.polygamma(n, x, **kwargs),
+        variant_test_name="special_polygamma_n_0",
+        ref=reference_polygamma if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.half),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_polygamma,
+        skips=(
+            # lambda impl
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+        ),
+        sample_kwargs=lambda device, dtype, input: ({"n": 0}, {"n": 0}),
+        # polygamma functions have multiple singularities at x <= 0
+        reference_numerics_filter=NumericsFilter(
+            condition=lambda x: x < 0.1, safe_val=1
+        ),
+    ),
+    BinaryUfuncInfo(
+        "special.xlog1py",
+        aten_name="special_xlog1py",
+        dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
+        promotes_int_to_float=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_one_python_scalar=True,
+        # We don't test -1 as the gradient will be NaN and it'll break
+        rhs_make_tensor_kwargs=dict(low=-0.99),
+    ),
+    BinaryUfuncInfo(
+        "special.zeta",
+        aten_name="special_zeta",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        supports_autograd=False,
+        supports_one_python_scalar=True,
+    ),
+    # TODO: FIXME
+    # OpInfo entry to verify the gradient formula of `other`/`q`
+    # BinaryUfuncInfo('special.zeta',
+    #                 op=lambda q, x, **kwargs: torch.special.zeta(x, q, **kwargs),
+    #                 aten_name='special_zeta',
+    #                 variant_test_name='grad',
+    #                 dtypes=all_types_and(torch.bool),
+    #                 promotes_int_to_float=True,
+    #                 supports_autograd=True,
+    #                 supports_rhs_python_scalar=False,
+    #                 decorators=[
+    #                     # Derivative wrt first tensor not implemented
+    #                     DecorateInfo(unittest.expectedFailure, "TestCommon",
+    #                                  "test_floating_inputs_are_differentiable")
+    #                 ],
+    #                 skips=(
+    #                     # Lambda doesn't work in JIT test
+    #                     # AssertionError: JIT Test does not execute any logic
+    #                     DecorateInfo(unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"),
+    #                 )),
+    UnaryUfuncInfo(
+        "special.entr",
+        ref=scipy.special.entr if TEST_SCIPY else None,
+        aten_name="special_entr",
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=(precisionOverride({torch.float16: 1e-1, torch.bfloat16: 1e-1}),),
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestUnaryUfuncs",
+                "test_reference_numerics_large",
+                dtypes=[torch.bfloat16, torch.float16],
+            ),
+        ),
+        supports_inplace_autograd=False,
+        sample_inputs_func=sample_inputs_entr,
+    ),
+    UnaryUfuncInfo(
+        "special.ndtri",
+        ref=scipy.special.ndtri if TEST_SCIPY else None,
+        domain=(0, 1),
+        aten_name="special_ndtri",
+        dtypes=all_types_and(torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.log_ndtr",
+        aten_name="special_log_ndtr",
+        ref=scipy.special.log_ndtr if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.erfcx",
+        ref=scipy.special.erfcx if TEST_SCIPY else None,
+        aten_name="special_erfcx",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=0, rtol=4e-6),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.airy_ai",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=lambda x: scipy.special.airy(x)[0] if TEST_SCIPY else None,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestUnaryUfuncs",
+                "test_reference_numerics_large",
+            ),
+        ),
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_j0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.j0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_j1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.j1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_y0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.y0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_y1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.y1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_t",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_u",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_v",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_w",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.hermite_polynomial_h",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.hermite_polynomial_he",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.laguerre_polynomial_l",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.legendre_polynomial_p",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_i0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.i0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_i1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.i1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_k0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_k1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.scaled_modified_bessel_k0",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=1e-03, rtol=1e-03),
+                    torch.float64: tol(atol=1e-05, rtol=1e-03),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k0e if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.scaled_modified_bessel_k1",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=1e-03, rtol=1e-03),
+                    torch.float64: tol(atol=1e-05, rtol=1e-03),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k1e if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_t",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_u",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_v",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_w",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.spherical_bessel_j0",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=1e-03, rtol=1e-03),
+                    torch.float64: tol(atol=1e-05, rtol=1e-03),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=lambda x: scipy.special.spherical_jn(0, x) if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+]
+
+python_ref_db: List[OpInfo] = [
+    #
+    # Elementwise Unary Special OpInfos
+    #
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.i0e",
+        torch_opinfo_name="special.i0e",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.i1",
+        torch_opinfo_name="special.i1",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.i1e",
+        torch_opinfo_name="special.i1e",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    #
+    # Elementwise Binary Special OpInfos
+    #
+    ElementwiseBinaryPythonRefInfo(
+        "_refs.special.zeta",
+        torch_opinfo_name="special.zeta",
+        supports_one_python_scalar=True,
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/refs.py b/torch/testing/_internal/opinfo/refs.py
new file mode 100644
index 0000000000000..3e868aef6dbbb
--- /dev/null
+++ b/torch/testing/_internal/opinfo/refs.py
@@ -0,0 +1,214 @@
+from torch.testing._internal.opinfo.core import (
+    BinaryUfuncInfo,
+    OpInfo,
+    ReductionOpInfo,
+    UnaryUfuncInfo,
+)
+
+# NOTE [Python References]
+# Python References emulate existing PyTorch operations, but can ultimately
+#   be expressed in terms of "primitive" operations from torch._prims.
+#
+# These references are experimental.
+# See https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-0/577
+#   for additional context.
+#
+# Python Reference OpInfos should be added to the python_ref_db list below.
+#   Tests can opt-into running on these references by including
+#   that list in the Sequence they pass to the @ops decorator.
+#
+# When a Python Reference OpInfo is constructed a pointer to an
+#   existing OpInfo must be provided using the torch_opinfo_name kwarg.
+#   The existing OpInfo with that name and no variant will be found
+#   to inherit from.
+#
+# Instead of just inheriting the existing OpInfo's metadata, the
+#   Python Reference OpInfos inherit the existing OpInfo's
+#   construction arguments. These arguments can be overridden
+#   by adding kwargs to the constructor.
+
+
+def _find_referenced_opinfo(referenced_name, variant_name, *, op_db=None):
+    """
+    Finds the OpInfo with the given name that has no variant name.
+    """
+    # NOTE: searching the global op_db doesn't work when OpInfos are split into
+    # different modules, as otherwise the op_db will not be fully constructed
+    # yet. So, instead the local op_db must be passed in explicitly.
+    if op_db is None:
+        from torch.testing._internal.common_methods_invocations import op_db
+
+    for opinfo in op_db:
+        if opinfo.name == referenced_name and opinfo.variant_test_name == variant_name:
+            return opinfo
+
+
+def _inherit_constructor_args(name, op, inherited, overrides):
+    # inherits metadata
+    common_kwargs = {
+        "name": name,
+        "op": op,
+        "aliases": None,  # TODO add a check for alias coverage
+        "method_variant": None,
+        "inplace_variant": None,  # TODO: add a check for inplace coverage
+        "supports_scripting": False,
+    }
+
+    # Acquires inherited kwargs
+    kwargs = inherited.copy()
+
+    # Fixes metadata
+    if "kwargs" in kwargs:
+        kwargs.update(kwargs["kwargs"])
+        del kwargs["kwargs"]
+    if "self" in kwargs:
+        del kwargs["self"]
+    if "__class__" in kwargs:
+        del kwargs["__class__"]
+    if "skips" in kwargs:
+        del kwargs["skips"]
+    if "decorators" in kwargs:
+        del kwargs["decorators"]
+
+    # Overrides metadata
+    kwargs.update(common_kwargs)
+    kwargs.update(overrides)
+
+    # At the moment no prims support autograd, so we must not run autograd
+    # tests e.g. when testing dtype support.  Once we start writing autograd
+    # formulas for prims this can be removed.
+    kwargs["supports_autograd"] = False
+    kwargs["supports_gradgrad"] = False
+    kwargs["supports_fwgrad_bwgrad"] = False
+    kwargs["supports_inplace_autograd"] = False
+    kwargs["supports_forward_ad"] = False
+
+    return kwargs
+
+
+class PythonRefInfo(OpInfo):
+    """
+    An OpInfo for a Python reference of an OpInfo base class operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        validate_view_consistency=True,
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.validate_view_consistency = validate_view_consistency
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, OpInfo)
+
+        inherited = self.torch_opinfo._original_opinfo_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+        super(PythonRefInfo, self).__init__(**ukwargs)
+
+
+class ReductionPythonRefInfo(ReductionOpInfo):
+    """
+    An OpInfo for a Python reference of an elementwise unary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, ReductionOpInfo)
+
+        inherited = self.torch_opinfo._original_reduction_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        # See https://github.com/pytorch/pytorch/issues/77216
+        self.validate_view_consistency = False
+
+        super().__init__(**ukwargs)
+
+
+class ElementwiseUnaryPythonRefInfo(UnaryUfuncInfo):
+    """
+    An OpInfo for a Python reference of an elementwise unary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, UnaryUfuncInfo)
+
+        inherited = self.torch_opinfo._original_unary_ufunc_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        super(ElementwiseUnaryPythonRefInfo, self).__init__(**ukwargs)
+
+
+class ElementwiseBinaryPythonRefInfo(BinaryUfuncInfo):
+    """
+    An OpInfo for a Python reference of an elementwise binary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, BinaryUfuncInfo)
+
+        inherited = self.torch_opinfo._original_binary_ufunc_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        super(ElementwiseBinaryPythonRefInfo, self).__init__(**ukwargs)
diff --git a/torch/testing/_internal/opinfo/utils.py b/torch/testing/_internal/opinfo/utils.py
new file mode 100644
index 0000000000000..a19da98cd1d0c
--- /dev/null
+++ b/torch/testing/_internal/opinfo/utils.py
@@ -0,0 +1,260 @@
+import collections
+import warnings
+from functools import partial, wraps
+from typing import Sequence
+
+import numpy as np
+
+import torch
+from torch.testing._internal.common_cuda import TEST_CUDA
+from torch.testing._internal.common_dtype import (
+    _dispatch_dtypes,
+    all_types,
+    all_types_and,
+    all_types_and_complex,
+    all_types_and_complex_and,
+    all_types_and_half,
+    complex_types,
+    floating_and_complex_types,
+    floating_and_complex_types_and,
+    floating_types,
+    floating_types_and,
+    floating_types_and_half,
+    integral_types,
+    integral_types_and,
+)
+from torch.testing._internal.common_utils import torch_to_numpy_dtype_dict
+
+
+COMPLETE_DTYPES_DISPATCH = (
+    all_types,
+    all_types_and_complex,
+    all_types_and_half,
+    floating_types,
+    floating_and_complex_types,
+    floating_types_and_half,
+    integral_types,
+    complex_types,
+)
+
+EXTENSIBLE_DTYPE_DISPATCH = (
+    all_types_and_complex_and,
+    floating_types_and,
+    floating_and_complex_types_and,
+    integral_types_and,
+    all_types_and,
+)
+
+# Better way to acquire devices?
+DEVICES = ["cpu"] + (["cuda"] if TEST_CUDA else [])
+
+
+class _dynamic_dispatch_dtypes(_dispatch_dtypes):
+    # Class to tag the dynamically generated types.
+    pass
+
+
+def get_supported_dtypes(op, sample_inputs_fn, device_type):
+    # Returns the supported dtypes for the given operator and device_type pair.
+    assert device_type in ["cpu", "cuda"]
+    if not TEST_CUDA and device_type == "cuda":
+        warnings.warn(
+            "WARNING: CUDA is not available, empty_dtypes dispatch will be returned!"
+        )
+        return _dynamic_dispatch_dtypes(())
+
+    supported_dtypes = set()
+    for dtype in all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half):
+        try:
+            samples = sample_inputs_fn(op, device_type, dtype, False)
+        except RuntimeError:
+            # If `sample_inputs_fn` doesn't support sampling for a given
+            # `dtype`, we assume that the `dtype` is not supported.
+            # We raise a warning, so that user knows that this was the case
+            # and can investigate if there was an issue with the `sample_inputs_fn`.
+            warnings.warn(
+                f"WARNING: Unable to generate sample for device:{device_type} and dtype:{dtype}"
+            )
+            continue
+
+        # We assume the dtype is supported
+        # only if all samples pass for the given dtype.
+        supported = True
+        for sample in samples:
+            try:
+                op(sample.input, *sample.args, **sample.kwargs)
+            except RuntimeError as re:
+                # dtype is not supported
+                supported = False
+                break
+
+        if supported:
+            supported_dtypes.add(dtype)
+
+    return _dynamic_dispatch_dtypes(supported_dtypes)
+
+
+def dtypes_dispatch_hint(dtypes):
+    # Function returns the appropriate dispatch function (from COMPLETE_DTYPES_DISPATCH and EXTENSIBLE_DTYPE_DISPATCH)
+    # and its string representation for the passed `dtypes`.
+    return_type = collections.namedtuple("return_type", "dispatch_fn dispatch_fn_str")
+
+    # CUDA is not available, dtypes will be empty.
+    if len(dtypes) == 0:
+        return return_type((), str(tuple()))
+
+    set_dtypes = set(dtypes)
+    for dispatch in COMPLETE_DTYPES_DISPATCH:
+        # Short circuit if we get an exact match.
+        if set(dispatch()) == set_dtypes:
+            return return_type(dispatch, dispatch.__name__ + "()")
+
+    chosen_dispatch = None
+    chosen_dispatch_score = 0.0
+    for dispatch in EXTENSIBLE_DTYPE_DISPATCH:
+        dispatch_dtypes = set(dispatch())
+        if not dispatch_dtypes.issubset(set_dtypes):
+            continue
+
+        score = len(dispatch_dtypes)
+        if score > chosen_dispatch_score:
+            chosen_dispatch_score = score
+            chosen_dispatch = dispatch
+
+    # If user passed dtypes which are lower than the lowest
+    # dispatch type available (not likely but possible in code path).
+    if chosen_dispatch is None:
+        return return_type((), str(dtypes))
+
+    return return_type(
+        partial(dispatch, *tuple(set(dtypes) - set(dispatch()))),
+        dispatch.__name__ + str(tuple(set(dtypes) - set(dispatch()))),
+    )
+
+
+def is_dynamic_dtype_set(op):
+    # Detect if the OpInfo entry acquired dtypes dynamically
+    # using `get_supported_dtypes`.
+    return op.dynamic_dtypes
+
+
+def str_format_dynamic_dtype(op):
+    fmt_str = """
+        OpInfo({name},
+               dtypes={dtypes},
+               dtypesIfCUDA={dtypesIfCUDA},
+        )
+        """.format(
+        name=op.name,
+        dtypes=dtypes_dispatch_hint(op.dtypes).dispatch_fn_str,
+        dtypesIfCUDA=dtypes_dispatch_hint(op.dtypesIfCUDA).dispatch_fn_str,
+    )
+
+    return fmt_str
+
+
+def np_unary_ufunc_integer_promotion_wrapper(fn):
+    # Wrapper that passes PyTorch's default scalar
+    #   type as an argument to the wrapped NumPy
+    #   unary ufunc when given an integer input.
+    #   This mimicks PyTorch's integer->floating point
+    #   type promotion.
+    #
+    # This is necessary when NumPy promotes
+    #   integer types to double, since PyTorch promotes
+    #   integer types to the default scalar type.
+
+    # Helper to determine if promotion is needed
+    def is_integral(dtype):
+        return dtype in [
+            np.bool_,
+            bool,
+            np.uint8,
+            np.int8,
+            np.int16,
+            np.int32,
+            np.int64,
+        ]
+
+    @wraps(fn)
+    def wrapped_fn(x):
+        # As the default dtype can change, acquire it when function is called.
+        # NOTE: Promotion in PyTorch is from integer types to the default dtype
+        np_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
+
+        if is_integral(x.dtype):
+            return fn(x.astype(np_dtype))
+        return fn(x)
+
+    return wrapped_fn
+
+
+def reference_reduction_numpy(f, supports_keepdims=True):
+    """Wraps a NumPy reduction operator.
+
+    The wrapper function will forward dim, keepdim, mask, and identity
+    kwargs to the wrapped function as the NumPy equivalent axis,
+    keepdims, where, and initiak kwargs, respectively.
+
+    Args:
+        f: NumPy reduction operator to wrap
+        supports_keepdims (bool, optional): Whether the NumPy operator accepts
+            keepdims parameter. If it does not, the wrapper will manually unsqueeze
+            the reduced dimensions if it was called with keepdim=True. Defaults to True.
+
+    Returns:
+        Wrapped function
+
+    """
+
+    @wraps(f)
+    def wrapper(x: np.ndarray, *args, **kwargs):
+        # Copy keys into a set
+        keys = set(kwargs.keys())
+
+        dim = kwargs.pop("dim", None)
+        keepdim = kwargs.pop("keepdim", False)
+
+        if "dim" in keys:
+            dim = tuple(dim) if isinstance(dim, Sequence) else dim
+
+            # NumPy reductions don't accept dim=0 for scalar inputs
+            # so we convert it to None if and only if dim is equivalent
+            if x.ndim == 0 and dim in {0, -1, (0,), (-1,)}:
+                kwargs["axis"] = None
+            else:
+                kwargs["axis"] = dim
+
+        if "keepdim" in keys and supports_keepdims:
+            kwargs["keepdims"] = keepdim
+
+        if "mask" in keys:
+            mask = kwargs.pop("mask")
+            if mask is not None:
+                assert mask.layout == torch.strided
+                kwargs["where"] = mask.cpu().numpy()
+
+        if "identity" in keys:
+            identity = kwargs.pop("identity")
+            if identity is not None:
+                if identity.dtype is torch.bfloat16:
+                    identity = identity.cpu().to(torch.float32)
+                else:
+                    identity = identity.cpu()
+                kwargs["initial"] = identity.numpy()
+
+        if "unbiased" in keys:
+            unbiased = kwargs.pop("unbiased")
+            if unbiased is not None:
+                kwargs["ddof"] = int(unbiased)
+
+        result = f(x, *args, **kwargs)
+
+        # Unsqueeze reduced dimensions if NumPy does not support keepdims
+        if keepdim and not supports_keepdims and x.ndim > 0:
+            dim = list(range(x.ndim)) if dim is None else dim
+            result = np.expand_dims(result, dim)
+
+        return result
+
+    return wrapper
diff --git a/torch/testing/_internal/opinfo_helper.py b/torch/testing/_internal/opinfo_helper.py
deleted file mode 100644
index 77381c0d769d2..0000000000000
--- a/torch/testing/_internal/opinfo_helper.py
+++ /dev/null
@@ -1,139 +0,0 @@
-import collections
-import warnings
-from functools import partial
-
-import torch
-from torch.testing._internal.common_cuda import (TEST_CUDA)
-from torch.testing._internal.common_dtype import (
-    all_types_and_complex_and,
-    all_types_and_complex,
-    all_types_and_half,
-    all_types,
-    complex_types,
-    floating_and_complex_types,
-    floating_types_and_half,
-    floating_types,
-    integral_types,
-    floating_types_and,
-    floating_and_complex_types_and,
-    integral_types_and,
-    all_types_and,
-    _dispatch_dtypes,
-)
-
-COMPLETE_DTYPES_DISPATCH = (
-    all_types,
-    all_types_and_complex,
-    all_types_and_half,
-    floating_types,
-    floating_and_complex_types,
-    floating_types_and_half,
-    integral_types,
-    complex_types,
-)
-
-EXTENSIBLE_DTYPE_DISPATCH = (
-    all_types_and_complex_and,
-    floating_types_and,
-    floating_and_complex_types_and,
-    integral_types_and,
-    all_types_and,
-)
-
-# Better way to acquire devices?
-DEVICES = ['cpu'] + (['cuda'] if TEST_CUDA else [])
-
-class _dynamic_dispatch_dtypes(_dispatch_dtypes):
-    # Class to tag the dynamically generated types.
-    pass
-
-
-def get_supported_dtypes(op, sample_inputs_fn, device_type):
-    # Returns the supported dtypes for the given operator and device_type pair.
-    assert device_type in ['cpu', 'cuda']
-    if not TEST_CUDA and device_type == 'cuda':
-        warnings.warn("WARNING: CUDA is not available, empty_dtypes dispatch will be returned!")
-        return _dynamic_dispatch_dtypes(())
-
-    supported_dtypes = set()
-    for dtype in all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half):
-        try:
-            samples = sample_inputs_fn(op, device_type, dtype, False)
-        except RuntimeError:
-            # If `sample_inputs_fn` doesn't support sampling for a given
-            # `dtype`, we assume that the `dtype` is not supported.
-            # We raise a warning, so that user knows that this was the case
-            # and can investigate if there was an issue with the `sample_inputs_fn`.
-            warnings.warn(f"WARNING: Unable to generate sample for device:{device_type} and dtype:{dtype}")
-            continue
-
-        # We assume the dtype is supported
-        # only if all samples pass for the given dtype.
-        supported = True
-        for sample in samples:
-            try:
-                op(sample.input, *sample.args, **sample.kwargs)
-            except RuntimeError as re:
-                # dtype is not supported
-                supported = False
-                break
-
-        if supported:
-            supported_dtypes.add(dtype)
-
-    return _dynamic_dispatch_dtypes(supported_dtypes)
-
-
-def dtypes_dispatch_hint(dtypes):
-    # Function returns the appropriate dispatch function (from COMPLETE_DTYPES_DISPATCH and EXTENSIBLE_DTYPE_DISPATCH)
-    # and its string representation for the passed `dtypes`.
-    return_type = collections.namedtuple('return_type', 'dispatch_fn dispatch_fn_str')
-
-    # CUDA is not available, dtypes will be empty.
-    if len(dtypes) == 0:
-        return return_type((), str(tuple()))
-
-    set_dtypes = set(dtypes)
-    for dispatch in COMPLETE_DTYPES_DISPATCH:
-        # Short circuit if we get an exact match.
-        if set(dispatch()) == set_dtypes:
-            return return_type(dispatch, dispatch.__name__ + "()")
-
-    chosen_dispatch = None
-    chosen_dispatch_score = 0.
-    for dispatch in EXTENSIBLE_DTYPE_DISPATCH:
-        dispatch_dtypes = set(dispatch())
-        if not dispatch_dtypes.issubset(set_dtypes):
-            continue
-
-        score = len(dispatch_dtypes)
-        if score > chosen_dispatch_score:
-            chosen_dispatch_score = score
-            chosen_dispatch = dispatch
-
-    # If user passed dtypes which are lower than the lowest
-    # dispatch type available (not likely but possible in code path).
-    if chosen_dispatch is None:
-        return return_type((), str(dtypes))
-
-    return return_type(partial(dispatch, *tuple(set(dtypes) - set(dispatch()))),
-                       dispatch.__name__ + str(tuple(set(dtypes) - set(dispatch()))))
-
-
-def is_dynamic_dtype_set(op):
-    # Detect if the OpInfo entry acquired dtypes dynamically
-    # using `get_supported_dtypes`.
-    return op.dynamic_dtypes
-
-
-def str_format_dynamic_dtype(op):
-    fmt_str = """
-        OpInfo({name},
-               dtypes={dtypes},
-               dtypesIfCUDA={dtypesIfCUDA},
-        )
-        """.format(name=op.name,
-                   dtypes=dtypes_dispatch_hint(op.dtypes).dispatch_fn_str,
-                   dtypesIfCUDA=dtypes_dispatch_hint(op.dtypesIfCUDA).dispatch_fn_str)
-
-    return fmt_str
diff --git a/torch/testing/_internal/schema_check_mode.py b/torch/testing/_internal/schema_check_mode.py
index 3aa7c400c8bd5..9d118719af6b1 100644
--- a/torch/testing/_internal/schema_check_mode.py
+++ b/torch/testing/_internal/schema_check_mode.py
@@ -41,9 +41,11 @@ def display_ops(self):
 
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         def has_mutated(before, after, md):
-            if type(before) == torch.Tensor and type(after) == torch.Tensor:
+            are_tensors = type(before) == torch.Tensor and type(after) == torch.Tensor
+            if are_tensors and before.layout != torch.sparse_csr and after.layout != torch.sparse_csr:
                 return not (
-                    torch.equal(before, after) and
+                    before.size() == after.size() and
+                    torch.allclose(before, after, equal_nan=True) and
                     md[0] == after.stride() and
                     md[1] == after.storage()._cdata
                 )
@@ -77,7 +79,8 @@ def parse_metadata(e):
                         return (deepcopy(current.stride()), current.storage()._cdata)
                     except AttributeError as t:
                         return None
-                else:
+                # Sparse CSR tensors do not have strides or storage
+                elif (e.layout != torch.sparse_csr):
                     return (deepcopy(e.stride()), e.storage()._cdata)
             return None
 
@@ -112,7 +115,8 @@ def parse_metadata(e):
                 md = cloned_metadata.get(name)
                 after = arguments.get(name)
                 for j in range(len(tuple_out)):
-                    if has_aliased(tuple_out[j], after):
+                    # aten::_unsafe_view is intended to have incorrect aliasing notation (hence unsafe)
+                    if has_aliased(tuple_out[j], after) and func._schema.name != 'aten::_unsafe_view':
                         if not schema_info.may_contain_alias(
                             SchemaArgument(SchemaArgType.output, j),
                                 SchemaArgument(SchemaArgType.input, i)):
diff --git a/torch/utils/_cuda_trace.py b/torch/utils/_cuda_trace.py
new file mode 100644
index 0000000000000..11f0308bb4db5
--- /dev/null
+++ b/torch/utils/_cuda_trace.py
@@ -0,0 +1,76 @@
+import logging
+from typing import Callable, Generic, List
+
+from typing_extensions import ParamSpec
+
+logger = logging.getLogger(__name__)
+P = ParamSpec("P")
+
+
+class CallbackRegistry(Generic[P]):
+    def __init__(self, name: str):
+        self.name = name
+        self.callback_list: List[Callable[P, None]] = []
+
+    def add_callback(self, cb: Callable[P, None]) -> None:
+        self.callback_list.append(cb)
+
+    def fire_callbacks(self, *args: P.args, **kwargs: P.kwargs) -> None:
+        for cb in self.callback_list:
+            try:
+                cb(*args, **kwargs)
+            except Exception as e:
+                logger.exception(
+                    f"Exception in callback for {self.name} registered with CUDA trace"
+                )
+
+
+CUDAEventCreationCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
+    "CUDA event creation"
+)
+CUDAEventDeletionCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
+    "CUDA event deletion"
+)
+CUDAEventRecordCallbacks: "CallbackRegistry[int, int]" = CallbackRegistry(
+    "CUDA event record"
+)
+CUDAEventWaitCallbacks: "CallbackRegistry[int, int]" = CallbackRegistry(
+    "CUDA event wait"
+)
+CUDAMemoryAllocationCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
+    "CUDA memory allocation"
+)
+CUDAMemoryDeallocationCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
+    "CUDA memory deallocation"
+)
+CUDAStreamCreationCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
+    "CUDA stream creation"
+)
+
+
+def register_callback_for_cuda_event_creation(cb: Callable[[int], None]) -> None:
+    CUDAEventCreationCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_event_deletion(cb: Callable[[int], None]) -> None:
+    CUDAEventDeletionCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_event_record(cb: Callable[[int, int], None]) -> None:
+    CUDAEventRecordCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_event_wait(cb: Callable[[int, int], None]) -> None:
+    CUDAEventWaitCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_memory_allocation(cb: Callable[[int], None]) -> None:
+    CUDAMemoryAllocationCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_memory_deallocation(cb: Callable[[int], None]) -> None:
+    CUDAMemoryDeallocationCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_stream_creation(cb: Callable[[int], None]) -> None:
+    CUDAStreamCreationCallbacks.add_callback(cb)
diff --git a/torch/utils/_pytree.py b/torch/utils/_pytree.py
index db9caa660f3bd..464b2897e0655 100644
--- a/torch/utils/_pytree.py
+++ b/torch/utils/_pytree.py
@@ -1,5 +1,9 @@
-from typing import NamedTuple, Callable, Any, Tuple, List, Dict, Type, cast, Optional
-from collections import namedtuple
+from typing import NamedTuple, Callable, Any, Tuple, List, Dict, Type, cast, Optional, TypeVar
+import functools
+from collections import namedtuple, OrderedDict
+
+T = TypeVar('T')
+S = TypeVar('S')
 
 """
 Contains utility functions for working with nested python data structures.
@@ -63,10 +67,18 @@ def _namedtuple_flatten(d: NamedTuple) -> Tuple[List[Any], Context]:
 def _namedtuple_unflatten(values: List[Any], context: Context) -> NamedTuple:
     return cast(NamedTuple, context(*values))
 
+def _odict_flatten(d: 'OrderedDict[Any, Any]') -> Tuple[List[Any], Context]:
+    return list(d.values()), list(d.keys())
+
+def _odict_unflatten(values: List[Any], context: Context) -> 'OrderedDict[Any, Any]':
+    return OrderedDict((key, value) for key, value in zip(context, values))
+
+
 _register_pytree_node(dict, _dict_flatten, _dict_unflatten)
 _register_pytree_node(list, _list_flatten, _list_unflatten)
 _register_pytree_node(tuple, _tuple_flatten, _tuple_unflatten)
 _register_pytree_node(namedtuple, _namedtuple_flatten, _namedtuple_unflatten)
+_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
 
 
 # h/t https://stackoverflow.com/questions/2166818/how-to-check-if-an-object-is-an-instance-of-a-namedtuple
@@ -178,6 +190,54 @@ def tree_map(fn: Any, pytree: PyTree) -> PyTree:
     flat_args, spec = tree_flatten(pytree)
     return tree_unflatten([fn(i) for i in flat_args], spec)
 
+def map_only(ty: Type[T]) -> Callable[[Callable[[T], Any]], Callable[[Any], Any]]:
+    """
+    Suppose you are writing a tree_map over tensors, leaving everything
+    else unchanged.  Ordinarily you would have to write:
+
+        def go(t):
+            if isinstance(t, Tensor):
+                return ...
+            else:
+                return t
+
+    With this function, you only need to write:
+
+        @map_only(Tensor)
+        def go(t):
+            return ...
+
+    You can also directly use 'tree_map_only'
+    """
+    def deco(f: Callable[[T], Any]) -> Callable[[Any], Any]:
+        @functools.wraps(f)
+        def inner(x: T) -> Any:
+            if isinstance(x, ty):
+                return f(x)
+            else:
+                return x
+        return inner
+    return deco
+
+def tree_map_only(ty: Type[T], fn: Callable[[T], Any], pytree: PyTree) -> PyTree:
+    return tree_map(map_only(ty)(fn), pytree)
+
+def tree_all(pred: Callable[[Any], bool], pytree: PyTree) -> bool:
+    flat_args, _ = tree_flatten(pytree)
+    return all(map(pred, flat_args))
+
+def tree_any(pred: Callable[[Any], bool], pytree: PyTree) -> bool:
+    flat_args, _ = tree_flatten(pytree)
+    return any(map(pred, flat_args))
+
+def tree_all_only(ty: Type[T], pred: Callable[[T], bool], pytree: PyTree) -> bool:
+    flat_args, _ = tree_flatten(pytree)
+    return all(pred(x) for x in flat_args if isinstance(x, ty))
+
+def tree_any_only(ty: Type[T], pred: Callable[[T], bool], pytree: PyTree) -> bool:
+    flat_args, _ = tree_flatten(pytree)
+    return any(pred(x) for x in flat_args if isinstance(x, ty))
+
 # Broadcasts a pytree to the provided TreeSpec and returns the flattened
 # values. If this is not possible, then this function returns None.
 #
diff --git a/torch/utils/_zip.py b/torch/utils/_zip.py
index 749bb6a892c0e..26a1fa37667f6 100644
--- a/torch/utils/_zip.py
+++ b/torch/utils/_zip.py
@@ -30,8 +30,8 @@ def remove_prefix(text, prefix):
         return text[len(prefix):]
     return text
 
-def write_to_zip(file_path, strip_file_path, zf):
-    stripped_file_path = remove_prefix(file_path, strip_file_dir + "/")
+def write_to_zip(file_path, strip_file_path, zf, prepend_str=""):
+    stripped_file_path = prepend_str + remove_prefix(file_path, strip_file_dir + "/")
     path = Path(stripped_file_path)
     if path.name in DENY_LIST:
         return
@@ -42,11 +42,14 @@ def write_to_zip(file_path, strip_file_path, zf):
     parser.add_argument("paths", nargs="*", help="Paths to zip.")
     parser.add_argument("--install_dir", help="Root directory for all output files")
     parser.add_argument("--strip_dir", help="The absolute directory we want to remove from zip")
+    parser.add_argument("--prepend_str", help="A string to prepend onto all paths of a file in the zip", default="")
     parser.add_argument("--zip_name", help="Output zip name")
+
     args = parser.parse_args()
 
     zip_file_name = args.install_dir + '/' + args.zip_name
     strip_file_dir = args.strip_dir
+    prepend_str = args.prepend_str
     zf = ZipFile(zip_file_name, mode='w')
 
     for p in args.paths:
@@ -54,6 +57,6 @@ def write_to_zip(file_path, strip_file_path, zf):
             files = glob.glob(p + "/**/*.py", recursive=True)
             for file_path in files:
                 # strip the absolute path
-                write_to_zip(file_path, strip_file_dir + "/", zf)
+                write_to_zip(file_path, strip_file_dir + "/", zf, prepend_str=prepend_str)
         else:
-            write_to_zip(p, strip_file_dir + "/", zf)
+            write_to_zip(p, strip_file_dir + "/", zf, prepend_str=prepend_str)
diff --git a/torch/utils/bottleneck/__main__.py b/torch/utils/bottleneck/__main__.py
index 9a98a85f2e91b..86c1af04baa0e 100644
--- a/torch/utils/bottleneck/__main__.py
+++ b/torch/utils/bottleneck/__main__.py
@@ -36,7 +36,7 @@ def run_env_analysis():
     print('Running environment analysis...')
     info = get_env_info()
 
-    result: Dict[str, str] = dict()
+    result: Dict[str, str] = {}
 
     debug_str = ''
     if info.is_debug_build:
diff --git a/torch/utils/checkpoint.py b/torch/utils/checkpoint.py
index dc774b1fc0d50..d28cf4a1c3ac6 100644
--- a/torch/utils/checkpoint.py
+++ b/torch/utils/checkpoint.py
@@ -1,6 +1,7 @@
 import torch
 import warnings
-from typing import Any, Dict, Iterable, List, Optional, Tuple
+import weakref
+from typing import Any, Iterable, List, Tuple
 
 __all__ = [
     "checkpoint", "checkpoint_sequential", "CheckpointFunction",
@@ -294,6 +295,7 @@ def checkpoint_sequential(functions, segments, input, **kwargs):
         Output of running :attr:`functions` sequentially on :attr:`*inputs`
 
     Example:
+        >>> # xdoctest: +SKIP("stub")
         >>> model = nn.Sequential(...)
         >>> input_var = checkpoint_sequential(model, chunks, input_var)
     """
@@ -350,25 +352,39 @@ def _checkpoint_without_reentrant(function, preserve_rng_state=True, *args, **kw
             had_cuda_in_fwd = True
             fwd_gpu_devices, fwd_gpu_states = get_device_states(*args)
 
-    storage: Dict[int, Optional[torch.Tensor]] = {}
-    counter = 0
+    # Custom class to be able to take weak references
+    class Holder():
+        pass
+    # The Holder object for each of the saved object is saved directly on the
+    # SavedVariable and is cleared when reset_data() is called on it. We MUST make
+    # sure that this is the only object having an owning reference to ensure that
+    # the Tensor stored in storage is deleted as soon as the corresponding SavedVariable
+    # data is cleared.
+    storage: weakref.WeakKeyDictionary = weakref.WeakKeyDictionary()
+    weak_holder_list = []
 
     def pack(x):
-        nonlocal counter
-        counter += 1
-        # TODO(varal7): Instead of returning indices, we can return things metadata (such as
+        # TODO(varal7): Instead of returning abstract object, we can return things metadata (such as
         # size, device, ...) to catch certain cases of undeterministic behavior of the forward
-        return counter - 1
+        res = Holder()
+        weak_holder_list.append(weakref.ref(res))
+        return res
+
 
     def unpack(x):
         unpack_counter = 0
         if len(storage) == 0:
-
             def inner_pack(inner):
                 nonlocal unpack_counter
-                storage[unpack_counter] = inner
                 unpack_counter += 1
-                return None
+                # If the holder went out of scope, the SavedVariable is dead and so
+                # the value will never be read from the storage. Skip filling it.
+                if weak_holder_list[unpack_counter - 1]() is None:
+                    return
+                # Use detach here to ensure we don't keep the temporary autograd
+                # graph created during the second forward
+                storage[weak_holder_list[unpack_counter - 1]()] = inner.detach()
+                return
 
             def inner_unpack(packed):
                 raise RuntimeError("You are calling backwards on a tensor that is never exposed. Please open an issue.")
@@ -398,7 +414,7 @@ def inner_unpack(packed):
                 " open an issue with details on your use case so that we can prioritize adding this."
             )
 
-        return storage.pop(x)
+        return storage[x]
 
     with torch.autograd.graph.saved_tensors_hooks(pack, unpack):
         output = function(*args, **kwargs)
diff --git a/torch/utils/cpp_extension.py b/torch/utils/cpp_extension.py
index 3adfb1a4439ff..9c6635f111f5b 100644
--- a/torch/utils/cpp_extension.py
+++ b/torch/utils/cpp_extension.py
@@ -51,17 +51,23 @@
 CUDA_GCC_VERSIONS = {
     '10.2': (MINIMUM_GCC_VERSION, (8, 0, 0)),
     '11.1': (MINIMUM_GCC_VERSION, (10, 0, 0)),
-    '11.2': (MINIMUM_GCC_VERSION, (10, 0, 0)),
-    '11.3': (MINIMUM_GCC_VERSION, (10, 0, 0)),
-    '11.4': ((6, 0, 0), (10, 0, 0))
+    '11.2': (MINIMUM_GCC_VERSION, (10, 2, 1)),
+    '11.3': (MINIMUM_GCC_VERSION, (10, 2, 1)),
+    '11.4': ((6, 0, 0), (11, 5, 0)),
+    '11.5': ((6, 0, 0), (11, 5, 0)),
+    '11.6': ((6, 0, 0), (11, 5, 0)),
+    '11.7': ((6, 0, 0), (11, 5, 0)),
 }
 
 CUDA_CLANG_VERSIONS = {
     '10.2': ((3, 3, 0), (8, 0, 0)),
-    '11.1': ((6, 0, 0), (10, 0, 0)),
-    '11.2': ((6, 0, 0), (10, 0, 0)),
-    '11.3': ((6, 0, 0), (10, 0, 0)),
-    '11.4': ((6, 0, 0), (10, 0, 0))
+    '11.1': ((6, 0, 0), (9, 0, 0)),
+    '11.2': ((6, 0, 0), (9, 0, 0)),
+    '11.3': ((6, 0, 0), (11, 0, 0)),
+    '11.4': ((6, 0, 0), (11, 0, 0)),
+    '11.5': ((6, 0, 0), (12, 0, 0)),
+    '11.6': ((6, 0, 0), (12, 0, 0)),
+    '11.7': ((6, 0, 0), (13, 0, 0)),
 }
 
 
@@ -387,6 +393,9 @@ def _check_cuda_version(compiler_name: str, compiler_version: TorchVersion) -> N
         warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
     else:
         min_compiler_version, max_compiler_version = cuda_compiler_bounds[cuda_str_version]
+        # Special case for 11.4.0, which has lower compiler bounds that 11.4.1
+        if "V11.4.48" in cuda_version_str and cuda_compiler_bounds == CUDA_GCC_VERSIONS:
+            max_compiler_version = (10, 0, 0)
         min_compiler_version_str = '.'.join(map(str, min_compiler_version))
         max_compiler_version_str = '.'.join(map(str, max_compiler_version))
 
@@ -896,19 +905,20 @@ def CppExtension(name, sources, *args, **kwargs):
     constructor.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from setuptools import setup
         >>> from torch.utils.cpp_extension import BuildExtension, CppExtension
         >>> setup(
-                name='extension',
-                ext_modules=[
-                    CppExtension(
-                        name='extension',
-                        sources=['extension.cpp'],
-                        extra_compile_args=['-g']),
-                ],
-                cmdclass={
-                    'build_ext': BuildExtension
-                })
+        ...     name='extension',
+        ...     ext_modules=[
+        ...         CppExtension(
+        ...             name='extension',
+        ...             sources=['extension.cpp'],
+        ...             extra_compile_args=['-g']),
+        ...     ],
+        ...     cmdclass={
+        ...         'build_ext': BuildExtension
+        ...     })
     '''
     include_dirs = kwargs.get('include_dirs', [])
     include_dirs += include_paths()
@@ -942,20 +952,21 @@ def CUDAExtension(name, sources, *args, **kwargs):
     constructor.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from setuptools import setup
         >>> from torch.utils.cpp_extension import BuildExtension, CUDAExtension
         >>> setup(
-                name='cuda_extension',
-                ext_modules=[
-                    CUDAExtension(
-                            name='cuda_extension',
-                            sources=['extension.cpp', 'extension_kernel.cu'],
-                            extra_compile_args={'cxx': ['-g'],
-                                                'nvcc': ['-O2']})
-                ],
-                cmdclass={
-                    'build_ext': BuildExtension
-                })
+        ...     name='cuda_extension',
+        ...     ext_modules=[
+        ...         CUDAExtension(
+        ...                 name='cuda_extension',
+        ...                 sources=['extension.cpp', 'extension_kernel.cu'],
+        ...                 extra_compile_args={'cxx': ['-g'],
+        ...                                     'nvcc': ['-O2']})
+        ...     ],
+        ...     cmdclass={
+        ...         'build_ext': BuildExtension
+        ...     })
 
     Compute capabilities:
 
@@ -989,10 +1000,12 @@ def CUDAExtension(name, sources, *args, **kwargs):
     To workaround the issue, move python binding logic to pure C++ file.
 
     Example use:
+        >>> # xdoctest: +SKIP
         >>> #include <ATen/ATen.h>
         >>> at::Tensor SigmoidAlphaBlendForwardCuda(....)
 
     Instead of:
+        >>> # xdoctest: +SKIP
         >>> #include <torch/extension.h>
         >>> torch::Tensor SigmoidAlphaBlendForwardCuda(...)
 
@@ -1017,13 +1030,14 @@ def CUDAExtension(name, sources, *args, **kwargs):
     Note: Ninja is required to build a CUDA Extension with RDC linking.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> CUDAExtension(
-                    name='cuda_extension',
-                    sources=['extension.cpp', 'extension_kernel.cu'],
-                    dlink=True,
-                    dlink_libraries=["dlink_lib"],
-                    extra_compile_args={'cxx': ['-g'],
-                                        'nvcc': ['-O2', '-rdc=true']})
+        ...        name='cuda_extension',
+        ...        sources=['extension.cpp', 'extension_kernel.cu'],
+        ...        dlink=True,
+        ...        dlink_libraries=["dlink_lib"],
+        ...        extra_compile_args={'cxx': ['-g'],
+        ...                            'nvcc': ['-O2', '-rdc=true']})
     '''
     library_dirs = kwargs.get('library_dirs', [])
     library_dirs += library_paths(cuda=True)
@@ -1155,7 +1169,7 @@ def library_paths(cuda: bool = False) -> List[str]:
             paths.append(os.path.join(HIP_HOME, 'lib'))
     elif cuda:
         if IS_WINDOWS:
-            lib_dir = 'lib/x64'
+            lib_dir = os.path.join('lib', 'x64')
         else:
             lib_dir = 'lib64'
             if (not os.path.exists(_join_cuda_home(lib_dir)) and
@@ -1255,12 +1269,13 @@ def load(name,
             added to the PATH environment variable as a side effect.)
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torch.utils.cpp_extension import load
         >>> module = load(
-                name='extension',
-                sources=['extension.cpp', 'extension_kernel.cu'],
-                extra_cflags=['-O2'],
-                verbose=True)
+        ...     name='extension',
+        ...     sources=['extension.cpp', 'extension_kernel.cu'],
+        ...     extra_cflags=['-O2'],
+        ...     verbose=True)
     '''
     return _jit_compile(
         name,
@@ -1346,14 +1361,14 @@ def load_inline(name,
 
     Example:
         >>> from torch.utils.cpp_extension import load_inline
-        >>> source = \'\'\'
+        >>> source = """
         at::Tensor sin_add(at::Tensor x, at::Tensor y) {
           return x.sin() + y.sin();
         }
-        \'\'\'
+        """
         >>> module = load_inline(name='inline_extension',
-                                 cpp_sources=[source],
-                                 functions=['sin_add'])
+        ...                      cpp_sources=[source],
+        ...                      functions=['sin_add'])
 
     .. note::
         By default, the Ninja backend uses #CPUS + 2 workers to build the
@@ -1681,10 +1696,10 @@ def _prepare_ldflags(extra_ldflags, with_cuda, verbose, is_standalone):
         if verbose:
             print('Detected CUDA files, patching ldflags', file=sys.stderr)
         if IS_WINDOWS:
-            extra_ldflags.append(f'/LIBPATH:{_join_cuda_home("lib/x64")}')
+            extra_ldflags.append(f'/LIBPATH:{_join_cuda_home("lib", "x64")}')
             extra_ldflags.append('cudart.lib')
             if CUDNN_HOME is not None:
-                extra_ldflags.append(os.path.join(CUDNN_HOME, 'lib/x64'))
+                extra_ldflags.append(os.path.join(CUDNN_HOME, "lib", "x64"))
         elif not IS_HIP_EXTENSION:
             extra_ldflags.append(f'-L{_join_cuda_home("lib64")}')
             extra_ldflags.append('-lcudart')
diff --git a/torch/utils/data/_utils/collate.py b/torch/utils/data/_utils/collate.py
index 18b2fc7c8e171..2225dc084969e 100644
--- a/torch/utils/data/_utils/collate.py
+++ b/torch/utils/data/_utils/collate.py
@@ -34,6 +34,7 @@ def default_convert(data):
             >>> default_convert(0)
             0
             >>> # Example with NumPy array
+            >>> # xdoctest: +SKIP
             >>> default_convert(np.array([0, 1]))
             tensor([0, 1])
             >>> # Example with NamedTuple
@@ -118,6 +119,7 @@ def default_collate(batch):
             >>> default_collate([{'A': 0, 'B': 1}, {'A': 100, 'B': 100}])
             {'A': tensor([  0, 100]), 'B': tensor([  1, 100])}
             >>> # Example with `NamedTuple` inside the batch:
+            >>> # xdoctest: +SKIP
             >>> Point = namedtuple('Point', ['x', 'y'])
             >>> default_collate([Point(0, 0), Point(1, 1)])
             Point(x=tensor([0, 1]), y=tensor([0, 1]))
diff --git a/torch/utils/data/_utils/pin_memory.py b/torch/utils/data/_utils/pin_memory.py
index fd2879228d764..d3aa9118cb649 100644
--- a/torch/utils/data/_utils/pin_memory.py
+++ b/torch/utils/data/_utils/pin_memory.py
@@ -21,13 +21,11 @@ def _pin_memory_loop(in_queue, out_queue, device_id, done_event, device):
 
     torch.cuda.set_device(device_id)
 
-    # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on the
-    # logic of this function.
-    while not done_event.is_set():
+    def do_one_step():
         try:
             r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
         except queue.Empty:
-            continue
+            return
         idx, data = r
         if not done_event.is_set() and not isinstance(data, ExceptionWrapper):
             try:
@@ -42,8 +40,13 @@ def _pin_memory_loop(in_queue, out_queue, device_id, done_event, device):
                 break
             except queue.Full:
                 continue
-        del r  # save memory
 
+    # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on the
+    # logic of this function.
+    while not done_event.is_set():
+        # Make sure that we don't preserve any object from one iteration
+        # to the next
+        do_one_step()
 
 def pin_memory(data, device=None):
     if isinstance(data, torch.Tensor):
diff --git a/torch/utils/data/dataloader.py b/torch/utils/data/dataloader.py
index 9c67d50789f1b..30d08b1f2bcba 100644
--- a/torch/utils/data/dataloader.py
+++ b/torch/utils/data/dataloader.py
@@ -1430,8 +1430,7 @@ def _shutdown_workers(self):
         # Called when shutting down this `_MultiProcessingDataLoaderIter`.
         # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on
         # the logic of this function.
-        python_exit_status = _utils.python_exit_status
-        if python_exit_status is True or python_exit_status is None:
+        if _utils is None or _utils.python_exit_status is True or _utils.python_exit_status is None:
             # See (2) of the note. If Python is shutting down, do no-op.
             return
         # Normal exit when last reference is gone / iterator is depleted.
diff --git a/torch/utils/data/datapipes/_hook_iterator.py b/torch/utils/data/datapipes/_hook_iterator.py
index df27f85c03738..74ef019de5899 100644
--- a/torch/utils/data/datapipes/_hook_iterator.py
+++ b/torch/utils/data/datapipes/_hook_iterator.py
@@ -8,7 +8,7 @@
 class _SnapshotState(Enum):
     r"""
     These are the snapshotting-related states that IterDataPipes can be in.
-    `NotStarted` - allows you to restore a snapshot and create an iterator without reset
+    `NotStarted` - allows you to restore a snapshot and create an iterator with reset
     `Restored` - cannot restore again, allows you to create an iterator without resetting the DataPipe
     `Iterating` - can restore, will reset if you create a new iterator
     """
diff --git a/torch/utils/data/datapipes/_typing.py b/torch/utils/data/datapipes/_typing.py
index 1fd316cbec5ff..ab5e3fb33b60a 100644
--- a/torch/utils/data/datapipes/_typing.py
+++ b/torch/utils/data/datapipes/_typing.py
@@ -351,11 +351,11 @@ def __new__(cls, name, bases, namespace, **kwargs):
             @functools.wraps(reset_func)
             def conditional_reset(*args, **kwargs):
                 r"""
-                Only execute DataPipe's `reset()` method if `_SnapshotState` is `Iterating`. This allows recently
+                Only execute DataPipe's `reset()` method if `_SnapshotState` is `Iterating` or `NotStarted`. This allows recently
                 restored DataPipe to preserve its restored state during the initial `__iter__` call.
                 """
                 datapipe = args[0]
-                if datapipe._snapshot_state == _SnapshotState.Iterating:
+                if datapipe._snapshot_state in (_SnapshotState.Iterating, _SnapshotState.NotStarted):
                     # Reset `NotStarted` is necessary because the `source_datapipe` of a DataPipe might have
                     # already begun iterating.
                     datapipe._number_of_samples_yielded = 0
diff --git a/torch/utils/data/datapipes/datapipe.py b/torch/utils/data/datapipes/datapipe.py
index d6cff084e826e..339613137fdd7 100644
--- a/torch/utils/data/datapipes/datapipe.py
+++ b/torch/utils/data/datapipes/datapipe.py
@@ -82,6 +82,7 @@ class IterDataPipe(IterableDataset[T_co], metaclass=_IterDataPipeMeta):
 
     Examples:
         General Usage:
+            >>> # xdoctest: +SKIP
             >>> from torchdata.datapipes.iter import IterableWrapper, Mapper
             >>> dp = IterableWrapper(range(10))
             >>> map_dp_1 = Mapper(dp, lambda x: x + 1)  # Using class constructor
@@ -223,6 +224,7 @@ class MapDataPipe(Dataset[T_co], metaclass=_DataPipeMeta):
         DataPipe with non-integral indices/keys, a custom sampler must be provided.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper, Mapper
         >>> dp = SequenceWrapper(range(10))
         >>> map_dp_1 = dp.map(lambda x: x + 1)  # Using functional form (recommended)
diff --git a/torch/utils/data/datapipes/gen_pyi.py b/torch/utils/data/datapipes/gen_pyi.py
index e7c496bdf2557..dee545a5e50bd 100644
--- a/torch/utils/data/datapipes/gen_pyi.py
+++ b/torch/utils/data/datapipes/gen_pyi.py
@@ -179,6 +179,10 @@ def get_method_definitions(file_path: Union[str, List[str]],
     methods_and_signatures, methods_and_class_names, methods_w_special_output_types = \
         parse_datapipe_files(file_paths)
 
+    for fn_name in method_to_special_output_type:
+        if fn_name not in methods_w_special_output_types:
+            methods_w_special_output_types.add(fn_name)
+
     method_definitions = []
     for method_name, arguments in methods_and_signatures.items():
         class_name = methods_and_class_names[method_name]
@@ -202,7 +206,7 @@ def get_method_definitions(file_path: Union[str, List[str]],
 mapDP_file_path: str = "map"
 mapDP_files_to_exclude: Set[str] = {"__init__.py", "utils.py"}
 mapDP_deprecated_files: Set[str] = set()
-mapDP_method_to_special_output_type: Dict[str, str] = {}
+mapDP_method_to_special_output_type: Dict[str, str] = {"shuffle": "IterDataPipe"}
 
 
 def main() -> None:
diff --git a/torch/utils/data/datapipes/iter/callable.py b/torch/utils/data/datapipes/iter/callable.py
index 682315c388b36..30b04885787ac 100644
--- a/torch/utils/data/datapipes/iter/callable.py
+++ b/torch/utils/data/datapipes/iter/callable.py
@@ -7,7 +7,8 @@
 from torch.utils.data._utils.collate import default_collate
 from torch.utils.data.datapipes.dataframe import dataframe_wrapper as df_wrapper
 from torch.utils.data.datapipes.datapipe import IterDataPipe
-from torch.utils.data.datapipes.utils.common import _check_unpickable_fn
+from torch.utils.data.datapipes.utils.common import (_check_unpickable_fn,
+                                                     validate_input_col)
 
 __all__ = [
     "CollatorIterDataPipe",
@@ -42,6 +43,7 @@ class MapperIterDataPipe(IterDataPipe[T_co]):
             - Key is used for dict. New key is acceptable.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper, Mapper
         >>> def add_one(x):
         ...     return x + 1
@@ -79,6 +81,7 @@ def __init__(
                 raise ValueError("`output_col` must be a single-element list or tuple")
             output_col = output_col[0]
         self.output_col = output_col
+        validate_input_col(fn, input_col)
 
     def _apply_fn(self, data):
         if self.input_col is None and self.output_col is None:
@@ -200,6 +203,7 @@ class CollatorIterDataPipe(MapperIterDataPipe):
         >>> def collate_fn(batch):
         ...     return torch.tensor(batch, dtype=torch.float)
         ...
+        >>> # xdoctest: +SKIP
         >>> collated_ds = CollateIterDataPipe(ds, collate_fn=collate_fn)
         >>> print(list(collated_ds))
         [tensor(3.), tensor(4.), tensor(5.), tensor(6.)]
diff --git a/torch/utils/data/datapipes/iter/combinatorics.py b/torch/utils/data/datapipes/iter/combinatorics.py
index 6b4a1323d8f16..a11358256cf6a 100644
--- a/torch/utils/data/datapipes/iter/combinatorics.py
+++ b/torch/utils/data/datapipes/iter/combinatorics.py
@@ -77,6 +77,7 @@ class ShufflerIterDataPipe(IterDataPipe[T_co]):
             applying the shuffle
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> dp = IterableWrapper(range(10))
         >>> shuffle_dp = dp.shuffle()
@@ -116,14 +117,13 @@ def set_shuffle(self, shuffle=True):
 
     def set_seed(self, seed: int):
         self._seed = seed
+        return self
 
     def __iter__(self) -> Iterator[T_co]:
         if not self._enabled:
             for x in self.datapipe:
                 yield x
         else:
-            self._rng.seed(self._seed)
-            self._seed = None
             for x in self.datapipe:
                 if len(self._buffer) == self.buffer_size:
                     idx = self._rng.randint(0, len(self._buffer) - 1)
@@ -131,9 +131,9 @@ def __iter__(self) -> Iterator[T_co]:
                     yield val
                 else:
                     self._buffer.append(x)
-            self._rng.shuffle(self._buffer)
             while self._buffer:
-                yield self._buffer.pop()
+                idx = self._rng.randint(0, len(self._buffer) - 1)
+                yield self._buffer.pop(idx)
 
     def __len__(self) -> int:
         if isinstance(self.datapipe, Sized):
@@ -144,6 +144,8 @@ def reset(self) -> None:
         self._buffer = []
         if self._enabled and self._seed is None:
             self._seed = int(torch.empty((), dtype=torch.int64).random_().item())
+        self._rng.seed(self._seed)
+        self._seed = None
 
     def __getstate__(self):
         if IterDataPipe.getstate_hook is not None:
@@ -153,9 +155,10 @@ def __getstate__(self):
             self.buffer_size,
             self._enabled,
             self._seed,
+            self._buffer,
+            self._rng.getstate(),
             self._valid_iterator_id,
             self._number_of_samples_yielded,
-            self._rng.getstate(),
         )
         return state
 
@@ -165,13 +168,13 @@ def __setstate__(self, state):
             self.buffer_size,
             self._enabled,
             self._seed,
+            self._buffer,
+            rng_state,
             self._valid_iterator_id,
             self._number_of_samples_yielded,
-            rng_state,
         ) = state
         self._rng = random.Random()
         self._rng.setstate(rng_state)
-        self._buffer = []
 
     def __del__(self):
         self._buffer.clear()
diff --git a/torch/utils/data/datapipes/iter/combining.py b/torch/utils/data/datapipes/iter/combining.py
index 191d5d02f7854..dc7fe50d1a5fe 100644
--- a/torch/utils/data/datapipes/iter/combining.py
+++ b/torch/utils/data/datapipes/iter/combining.py
@@ -29,6 +29,7 @@ class ConcaterIterDataPipe(IterDataPipe):
         datapipes: Iterable DataPipes being concatenated
 
     Example:
+        >>> # xdoctest: +REQUIRES(module:torchdata)
         >>> import random
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> dp1 = IterableWrapper(range(3))
@@ -77,6 +78,7 @@ class ForkerIterDataPipe(IterDataPipe):
            Defaults to ``1000``. Use ``-1`` for the unlimited buffer.
 
     Example:
+        >>> # xdoctest: +REQUIRES(module:torchdata)
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> source_dp = IterableWrapper(range(5))
         >>> dp1, dp2 = source_dp.fork(num_instances=2)
@@ -208,15 +210,18 @@ class _ChildDataPipe(IterDataPipe):
         the previous iterators  for all ChildDataPipes must be invalidated, with the exception when a ChildDataPipe
         hasn't had an iterator created from it since the last invalidation. See the example below.
 
-    Singler Iterator per IteraDataPipe Invalidation Example:
+    Example:
+        >>> # xdoctest: +REQUIRES(module:torchdata)
+        >>> # Singler Iterator per IteraDataPipe Invalidation
+        >>> from torchdata.datapipes.iter import IterableWrapper
         >>> source_dp = IterableWrapper(range(10))
         >>> cdp1, cdp2 = source_dp.fork(num_instances=2)
         >>> it1, it2 = iter(cdp1), iter(cdp2)
         >>> it3 = iter(cdp1)
-        The line above invalidates `it1` and `it2`, and resets `ForkerIterDataPipe`.
+        >>> # The line above invalidates `it1` and `it2`, and resets `ForkerIterDataPipe`.
         >>> it4 = iter(cdp2)
-        The line above doesn't invalidate `it3`, because an iterator for `cdp2` hasn't been created since
-        the last invalidation.
+        >>> # The line above doesn't invalidate `it3`, because an iterator for `cdp2` hasn't been created since
+        >>> # the last invalidation.
 
     Args:
         main_datapipe: Main DataPipe with a method 'get_next_element_by_instance(instance_id)'
@@ -287,6 +292,7 @@ class DemultiplexerIterDataPipe(IterDataPipe):
             Defaults to ``1000``. Use ``-1`` for the unlimited buffer.
 
     Examples:
+        >>> # xdoctest: +REQUIRES(module:torchdata)
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> def odd_or_even(n):
         ...     return n % 2
@@ -440,6 +446,7 @@ class MultiplexerIterDataPipe(IterDataPipe):
         datapipes: Iterable DataPipes that will take turn to yield their elements, until the shortest DataPipe is exhausted
 
     Example:
+        >>> # xdoctest: +REQUIRES(module:torchdata)
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
         >>> list(dp1.mux(dp2, dp3))
@@ -513,6 +520,7 @@ class ZipperIterDataPipe(IterDataPipe[Tuple[T_co]]):
         *datapipes: Iterable DataPipes being aggregated
 
     Example:
+        >>> # xdoctest: +REQUIRES(module:torchdata)
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> dp1, dp2, dp3 = IterableWrapper(range(5)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
         >>> list(dp1.zip(dp2, dp3))
diff --git a/torch/utils/data/datapipes/iter/filelister.py b/torch/utils/data/datapipes/iter/filelister.py
index f5935575edb58..f8eb4f91df5e1 100644
--- a/torch/utils/data/datapipes/iter/filelister.py
+++ b/torch/utils/data/datapipes/iter/filelister.py
@@ -26,6 +26,7 @@ class FileListerIterDataPipe(IterDataPipe[str]):
         length: Nominal length of the datapipe
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import FileLister
         >>> dp = FileLister(root=".", recursive=True)
         >>> list(dp)
diff --git a/torch/utils/data/datapipes/iter/fileopener.py b/torch/utils/data/datapipes/iter/fileopener.py
index 4314cb9a14970..f3e00b397cd9f 100644
--- a/torch/utils/data/datapipes/iter/fileopener.py
+++ b/torch/utils/data/datapipes/iter/fileopener.py
@@ -31,6 +31,7 @@ class FileOpenerIterDataPipe(IterDataPipe[Tuple[str, IOBase]]):
         to close them explicitly.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import FileLister, FileOpener, StreamReader
         >>> dp = FileLister(root=".").filter(lambda fname: fname.endswith('.txt'))
         >>> dp = FileOpener(dp)
diff --git a/torch/utils/data/datapipes/iter/grouping.py b/torch/utils/data/datapipes/iter/grouping.py
index 0c9f05fecf3d0..8e09eb095e67c 100644
--- a/torch/utils/data/datapipes/iter/grouping.py
+++ b/torch/utils/data/datapipes/iter/grouping.py
@@ -64,6 +64,7 @@ class BatcherIterDataPipe(IterDataPipe[DataChunk]):
             defaults to ``DataChunk``
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> dp = IterableWrapper(range(10))
         >>> dp = dp.batch(batch_size=3, drop_last=True)
@@ -124,6 +125,7 @@ class UnBatcherIterDataPipe(IterDataPipe):
             it will flatten the top two levels, and ``-1`` will flatten the entire DataPipe.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> source_dp = IterableWrapper([[[0, 1], [2]], [[3, 4], [5]], [[6]]])
         >>> dp1 = source_dp.unbatch()
@@ -191,6 +193,7 @@ class GrouperIterDataPipe(IterDataPipe[DataChunk]):
 
     Example:
         >>> import os
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> def group_fn(file):
         ...    return os.path.basename(file).split(".")[0]
diff --git a/torch/utils/data/datapipes/iter/selecting.py b/torch/utils/data/datapipes/iter/selecting.py
index 5e4ec78c7091e..2ba91b36fffb5 100644
--- a/torch/utils/data/datapipes/iter/selecting.py
+++ b/torch/utils/data/datapipes/iter/selecting.py
@@ -7,6 +7,7 @@
     _check_unpickable_fn,
     _deprecation_warning,
     StreamWrapper,
+    validate_input_col
 )
 
 
@@ -31,6 +32,7 @@ class FilterIterDataPipe(IterDataPipe[T_co]):
             - Key(s) is used for dict.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> def is_even(n):
         ...     return n % 2 == 0
@@ -68,6 +70,7 @@ def __init__(
         self.drop_empty_batches = drop_empty_batches
 
         self.input_col = input_col
+        validate_input_col(filter_fn, input_col)
 
     def _apply_filter_fn(self, data) -> bool:
         if self.input_col is None:
diff --git a/torch/utils/data/datapipes/iter/streamreader.py b/torch/utils/data/datapipes/iter/streamreader.py
index 1ba17b6041892..4f113577494e3 100644
--- a/torch/utils/data/datapipes/iter/streamreader.py
+++ b/torch/utils/data/datapipes/iter/streamreader.py
@@ -17,6 +17,7 @@ class StreamReaderIterDataPipe(IterDataPipe[Tuple[str, bytes]]):
             If ``None``, all bytes will be read util the EOF.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper, StreamReader
         >>> from io import StringIO
         >>> dp = IterableWrapper([("alphabet", StringIO("abcde"))])
diff --git a/torch/utils/data/datapipes/iter/utils.py b/torch/utils/data/datapipes/iter/utils.py
index 7278b6221cbde..415190f3e2794 100644
--- a/torch/utils/data/datapipes/iter/utils.py
+++ b/torch/utils/data/datapipes/iter/utils.py
@@ -20,6 +20,7 @@ class IterableWrapperIterDataPipe(IterDataPipe):
         the iterable instance to prevent data inconsistency across iterations.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.iter import IterableWrapper
         >>> dp = IterableWrapper(range(10))
         >>> list(dp)
diff --git a/torch/utils/data/datapipes/map/__init__.py b/torch/utils/data/datapipes/map/__init__.py
index fc8a9de701da6..dee04d15cc7b4 100644
--- a/torch/utils/data/datapipes/map/__init__.py
+++ b/torch/utils/data/datapipes/map/__init__.py
@@ -1,6 +1,6 @@
 # Functional DataPipe
 from torch.utils.data.datapipes.map.callable import MapperMapDataPipe as Mapper
-from torch.utils.data.datapipes.map.combinatorics import ShufflerMapDataPipe as Shuffler
+from torch.utils.data.datapipes.map.combinatorics import ShufflerIterDataPipe as Shuffler
 from torch.utils.data.datapipes.map.combining import (
     ConcaterMapDataPipe as Concater,
     ZipperMapDataPipe as Zipper
diff --git a/torch/utils/data/datapipes/map/callable.py b/torch/utils/data/datapipes/map/callable.py
index 7744c30de45d1..53b5a68e863f6 100644
--- a/torch/utils/data/datapipes/map/callable.py
+++ b/torch/utils/data/datapipes/map/callable.py
@@ -27,6 +27,7 @@ class MapperMapDataPipe(MapDataPipe[T_co]):
         fn: Function being applied to each item
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper, Mapper
         >>> def add_one(x):
         ...     return x + 1
diff --git a/torch/utils/data/datapipes/map/combinatorics.py b/torch/utils/data/datapipes/map/combinatorics.py
index 07b8856dc2ffd..0fbc9927ddfc6 100644
--- a/torch/utils/data/datapipes/map/combinatorics.py
+++ b/torch/utils/data/datapipes/map/combinatorics.py
@@ -1,19 +1,19 @@
 import random
 
-from torch.utils.data.datapipes._decorator import functional_datapipe
-from torch.utils.data.datapipes.datapipe import MapDataPipe
+import torch
+from torch.utils.data.datapipes.datapipe import IterDataPipe, MapDataPipe
 from typing import Iterator, List, Optional, TypeVar
 
-__all__ = ["ShufflerMapDataPipe", ]
+__all__ = ["ShufflerIterDataPipe", ]
 
 
 T_co = TypeVar('T_co', covariant=True)
 
 
-@functional_datapipe('shuffle')
-class ShufflerMapDataPipe(MapDataPipe[T_co]):
+# @functional_datapipe('shuffle')
+class ShufflerIterDataPipe(IterDataPipe[T_co]):
     r"""
-    Shuffle the input DataPipe via its indices (functional name: ``shuffle``).
+    Shuffle the input MapDataPipe via its indices (functional name: ``shuffle``).
 
     When it is used with :class:`~torch.utils.data.DataLoader`, the methods to
     set up random seed are different based on :attr:`num_workers`.
@@ -28,13 +28,29 @@ class ShufflerMapDataPipe(MapDataPipe[T_co]):
         indices: a list of indices of the MapDataPipe. If not provided, we assume it uses 0-based indexing
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper
         >>> dp = SequenceWrapper(range(10))
-        >>> shuffle_dp = dp.shuffle()
+        >>> shuffle_dp = dp.shuffle().set_seed(0)
         >>> list(shuffle_dp)
-        [0, 4, 1, 6, 3, 2, 9, 5, 7, 8]
+        [7, 8, 1, 5, 3, 4, 2, 0, 9, 6]
+        >>> list(shuffle_dp)
+        [6, 1, 9, 5, 2, 4, 7, 3, 8, 0]
+        >>> # Reset seed for Shuffler
+        >>> shuffle_dp = shuffle_dp.set_seed(0)
+        >>> list(shuffle_dp)
+        [7, 8, 1, 5, 3, 4, 2, 0, 9, 6]
+
+    Note:
+        Even thought this ``shuffle`` operation takes a ``MapDataPipe`` as the input, it would return an
+        ``IterDataPipe`` rather than a ``MapDataPipe``, because ``MapDataPipe`` should be non-sensitive to
+        the order of data order for the sake of random reads, but ``IterDataPipe`` depends on the order
+        of data during data-processing.
     """
     datapipe: MapDataPipe[T_co]
+    _enabled: bool
+    _seed: Optional[int]
+    _rng: random.Random
 
     def __init__(self,
                  datapipe: MapDataPipe[T_co],
@@ -44,20 +60,66 @@ def __init__(self,
         super().__init__()
         self.datapipe = datapipe
         self.indices = list(range(len(datapipe))) if indices is None else indices
-        self.index_map = {index_name: num_index for num_index, index_name in enumerate(self.indices)}
-        # We do not lazily shuffle because this way is significantly faster in terms of total time
-        random.shuffle(self.indices)
+        self._enabled = True
+        self._seed = None
+        self._rng = random.Random()
+        self._shuffled_indices: List = self.indices
 
-    def __getitem__(self, index) -> T_co:
-        old_numeric_index = self.index_map[index]
-        new_index = self.indices[old_numeric_index]
-        return self.datapipe[new_index]
+    def set_shuffle(self, shuffle=True):
+        self._enabled = shuffle
+        return self
+
+    def set_seed(self, seed: int):
+        self._seed = seed
+        return self
 
-    # Without __iter__ implemented, by default it tries to use 0-index,
-    # which doesn't work when there is a custom index.
     def __iter__(self) -> Iterator[T_co]:
-        for i in self.indices:
-            yield self.datapipe[i]
+        if not self._enabled:
+            for idx in self.indices:
+                yield self.datapipe[idx]
+        else:
+            while self._shuffled_indices:
+                idx = self._shuffled_indices.pop()
+                yield self.datapipe[idx]
+
+    def reset(self) -> None:
+        if self._enabled and self._seed is None:
+            self._seed = int(torch.empty((), dtype=torch.int64).random_().item())
+        self._rng.seed(self._seed)
+        self._seed = None
+        self._shuffled_indices = self._rng.sample(self.indices, len(self.indices))
 
     def __len__(self) -> int:
         return len(self.datapipe)
+
+    def __getstate__(self):
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(self)
+        state = (
+            self.datapipe,
+            self.indices,
+            self._enabled,
+            self._seed,
+            self._rng.getstate(),
+            self._shuffled_indices,
+            self._valid_iterator_id,
+            self._number_of_samples_yielded,
+        )
+        return state
+
+    def __setstate__(self, state):
+        (
+            self.datapipe,
+            self.indices,
+            self._enabled,
+            self._seed,
+            rng_state,
+            self._shuffled_indices,
+            self._valid_iterator_id,
+            self._number_of_samples_yielded,
+        ) = state
+        self._rng = random.Random()
+        self._rng.setstate(rng_state)
+
+
+MapDataPipe.register_datapipe_as_function("shuffle", ShufflerIterDataPipe)
diff --git a/torch/utils/data/datapipes/map/combining.py b/torch/utils/data/datapipes/map/combining.py
index 32284e7f279a8..694ce25c695dc 100644
--- a/torch/utils/data/datapipes/map/combining.py
+++ b/torch/utils/data/datapipes/map/combining.py
@@ -21,6 +21,7 @@ class ConcaterMapDataPipe(MapDataPipe):
         datapipes: Map DataPipes being concatenated
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper
         >>> dp1 = SequenceWrapper(range(3))
         >>> dp2 = SequenceWrapper(range(3))
@@ -66,6 +67,7 @@ class ZipperMapDataPipe(MapDataPipe[Tuple[T_co, ...]]):
         *datapipes: Map DataPipes being aggregated
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper
         >>> dp1 = SequenceWrapper(range(3))
         >>> dp2 = SequenceWrapper(range(10, 13))
diff --git a/torch/utils/data/datapipes/map/grouping.py b/torch/utils/data/datapipes/map/grouping.py
index 44931a305051a..10578f2222bbb 100644
--- a/torch/utils/data/datapipes/map/grouping.py
+++ b/torch/utils/data/datapipes/map/grouping.py
@@ -20,6 +20,7 @@ class BatcherMapDataPipe(MapDataPipe[DataChunk]):
         drop_last: Option to drop the last batch if it's not full
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper
         >>> dp = SequenceWrapper(range(10))
         >>> batch_dp = dp.batch(batch_size=2)
diff --git a/torch/utils/data/datapipes/map/utils.py b/torch/utils/data/datapipes/map/utils.py
index 80fd830a9d812..654fae09e9cfa 100644
--- a/torch/utils/data/datapipes/map/utils.py
+++ b/torch/utils/data/datapipes/map/utils.py
@@ -20,6 +20,7 @@ class SequenceWrapperMapDataPipe(MapDataPipe):
       across iterations.
 
     Example:
+        >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper
         >>> dp = SequenceWrapper(range(10))
         >>> list(dp)
diff --git a/torch/utils/data/datapipes/utils/common.py b/torch/utils/data/datapipes/utils/common.py
index 42227bfaf5922..4ca0ced8943cc 100644
--- a/torch/utils/data/datapipes/utils/common.py
+++ b/torch/utils/data/datapipes/utils/common.py
@@ -12,6 +12,7 @@
 from torch.utils.data._utils.serialization import DILL_AVAILABLE
 
 __all__ = [
+    "validate_input_col",
     "StreamWrapper",
     "get_file_binaries_from_pathnames",
     "get_file_pathnames_from_root",
@@ -20,6 +21,86 @@
 ]
 
 
+def validate_input_col(fn: Callable, input_col: Optional[Union[int, tuple, list]]):
+    """
+    Checks that function used in a callable datapipe works with the input column
+
+    This simply ensures that the number of positional arguments matches the size
+    of the input column. The function must not contain any non-default
+    keyword-only arguments.
+
+    Examples:
+        >>> def f(a, b, *, c=1):
+        >>>     return a + b + c
+        >>> def f_def(a, b=1, *, c=1):
+        >>>     return a + b + c
+        >>> assert validate_input_col(f, [1, 2])
+        >>> assert validate_input_col(f_def, 1)
+        >>> assert validate_input_col(f_def, [1, 2])
+
+    Notes:
+        If the function contains variable positional (`inspect.VAR_POSITIONAL`) arguments,
+        for example, f(a, *args), the validator will accept any size of input column
+        greater than or equal to the number of positional arguments.
+        (in this case, 1).
+
+    Args:
+        fn: The function to check.
+        input_col: The input column to check.
+
+    Raises:
+        ValueError: If the function is not compatible with the input column.
+    """
+    sig = inspect.signature(fn)
+    if isinstance(input_col, (list, tuple)):
+        input_col_size = len(input_col)
+    else:
+        input_col_size = 1
+
+    fn_name = str(fn)
+
+    pos = []
+    var_positional = False
+    non_default_kw_only = []
+
+    for p in sig.parameters.values():
+        if p.kind in (inspect.Parameter.POSITIONAL_ONLY, inspect.Parameter.POSITIONAL_OR_KEYWORD):
+            pos.append(p)
+        elif p.kind is inspect.Parameter.VAR_POSITIONAL:
+            var_positional = True
+        elif p.kind is inspect.Parameter.KEYWORD_ONLY:
+            if p.default is p.empty:
+                non_default_kw_only.append(p)
+        else:
+            continue
+
+    if len(non_default_kw_only) > 0:
+        raise ValueError(
+            f"The function {fn_name} takes {len(non_default_kw_only)} "
+            f"non-default keyword-only parameters, which is not allowed."
+        )
+
+    if len(sig.parameters) < input_col_size:
+        if not var_positional:
+            raise ValueError(
+                f"The function {fn_name} takes {len(sig.parameters)} "
+                f"parameters, but {input_col_size} are required."
+            )
+    else:
+        if len(pos) > input_col_size:
+            if any(p.default is p.empty for p in pos[input_col_size:]):
+                raise ValueError(
+                    f"The function {fn_name} takes {len(pos)} "
+                    f"positional parameters, but {input_col_size} are required."
+                )
+        elif len(pos) < input_col_size:
+            if not var_positional:
+                raise ValueError(
+                    f"The function {fn_name} takes {len(pos)} "
+                    f"positional parameters, but {input_col_size} are required."
+                )
+
+
 def _is_local_fn(fn):
     # Functions or Methods
     if hasattr(fn, "__code__"):
@@ -33,7 +114,6 @@ def _is_local_fn(fn):
             return "<locals>" in fn_type.__qualname__
     return False
 
-
 def _check_unpickable_fn(fn: Callable):
     """
     Checks function is pickable or not. If it is a lambda or local function, a UserWarning
diff --git a/torch/utils/data/dataset.py b/torch/utils/data/dataset.py
index 17d9be8df7fab..4cf957034cbda 100644
--- a/torch/utils/data/dataset.py
+++ b/torch/utils/data/dataset.py
@@ -106,16 +106,19 @@ class IterableDataset(Dataset[T_co]):
 
         >>> # Single-process loading
         >>> print(list(torch.utils.data.DataLoader(ds, num_workers=0)))
-        [3, 4, 5, 6]
+        [tensor([3]), tensor([4]), tensor([5]), tensor([6])]
 
+        >>> # xdoctest: +REQUIRES(POSIX)
         >>> # Mult-process loading with two worker processes
         >>> # Worker 0 fetched [3, 4].  Worker 1 fetched [5, 6].
+        >>> # xdoctest: +IGNORE_WANT("non deterministic")
         >>> print(list(torch.utils.data.DataLoader(ds, num_workers=2)))
-        [3, 5, 4, 6]
+        [tensor([3]), tensor([5]), tensor([4]), tensor([6])]
 
         >>> # With even more workers
-        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=20)))
-        [3, 4, 5, 6]
+        >>> # xdoctest: +IGNORE_WANT("non deterministic")
+        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=12)))
+        [tensor([3]), tensor([5]), tensor([4]), tensor([6])]
 
     Example 2: splitting workload across all workers using :attr:`worker_init_fn`::
 
@@ -159,7 +162,7 @@ class IterableDataset(Dataset[T_co]):
         [3, 5, 4, 6]
 
         >>> # With even more workers
-        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=20, worker_init_fn=worker_init_fn)))
+        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=12, worker_init_fn=worker_init_fn)))
         [3, 4, 5, 6]
     """
     def __iter__(self) -> Iterator[T_co]:
@@ -312,7 +315,7 @@ def random_split(dataset: Dataset[T], lengths: Sequence[Union[int, float]],
 
     >>> random_split(range(10), [3, 7], generator=torch.Generator().manual_seed(42))
     >>> random_split(range(30), [0.3, 0.3, 0.4], generator=torch.Generator(
-    ).manual_seed(42))
+    ...   ).manual_seed(42))
 
     Args:
         dataset (Dataset): Dataset to be split
diff --git a/torch/utils/data/distributed.py b/torch/utils/data/distributed.py
index fae8dce66dd2e..8358204bd6237 100644
--- a/torch/utils/data/distributed.py
+++ b/torch/utils/data/distributed.py
@@ -49,6 +49,7 @@ class DistributedSampler(Sampler[T_co]):
 
     Example::
 
+        >>> # xdoctest: +SKIP
         >>> sampler = DistributedSampler(dataset) if is_distributed else None
         >>> loader = DataLoader(dataset, shuffle=(sampler is None),
         ...                     sampler=sampler)
diff --git a/torch/utils/data/graph.py b/torch/utils/data/graph.py
index df2b81648cefa..7ba7f8b004c6e 100644
--- a/torch/utils/data/graph.py
+++ b/torch/utils/data/graph.py
@@ -1,11 +1,12 @@
 import io
 import pickle
 
+from collections.abc import Collection
+from typing import Dict, List, Set, Tuple, Type, Union
+
 from torch.utils.data import IterDataPipe, MapDataPipe
 from torch.utils.data._utils.serialization import DILL_AVAILABLE
 
-from typing import Dict, List, Set, Tuple, Type, Union
-
 __all__ = ["traverse"]
 
 DataPipe = Union[IterDataPipe, MapDataPipe]
@@ -36,7 +37,7 @@ def stub_pickler(obj):
     def getstate_hook(obj):
         state = {}
         for k, v in obj.__dict__.items():
-            if isinstance(v, (IterDataPipe, MapDataPipe, tuple)):
+            if isinstance(v, (IterDataPipe, MapDataPipe, Collection)):
                 state[k] = v
         return state
 
@@ -74,6 +75,19 @@ def reduce_hook(obj):
 
 
 def traverse(datapipe: DataPipe, only_datapipe: bool = False) -> DataPipeGraph:
+    r"""
+    Traverse the DataPipes and their attributes to extract the DataPipe graph. When
+    ``only_dataPipe`` is specified as ``True``, it would only look into the attribute
+    from each DataPipe that is either a DataPipe and a Python collection object such as
+    ``list``, ``tuple``, ``set`` and ``dict``.
+
+    Args:
+        datapipe: the end DataPipe of the graph
+        only_datapipe: If ``False`` (default), all attributes of each DataPipe are traversed
+    Returns:
+        A graph represented as a nested dictionary, where keys are ids of DataPipe instances
+        and values are tuples of DataPipe instance and the sub-graph
+    """
     cache: Set[int] = set()
     return _traverse_helper(datapipe, only_datapipe, cache)
 
diff --git a/torch/utils/data/sampler.py b/torch/utils/data/sampler.py
index 87ece4d6f1d5c..7e2bbee28d836 100644
--- a/torch/utils/data/sampler.py
+++ b/torch/utils/data/sampler.py
@@ -169,6 +169,7 @@ class WeightedRandomSampler(Sampler[int]):
         generator (Generator): Generator used in sampling.
 
     Example:
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
         [4, 4, 1, 4, 5]
         >>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False))
diff --git a/torch/utils/dlpack.py b/torch/utils/dlpack.py
index 574fa0d442cf7..ae0aafceb178e 100644
--- a/torch/utils/dlpack.py
+++ b/torch/utils/dlpack.py
@@ -42,6 +42,7 @@ class DLDeviceType(enum.IntEnum):
 The DLPack capsule shares the tensor's memory.
 """)
 
+
 # TODO: add a typing.Protocol to be able to tell Mypy that only objects with
 # __dlpack__ and __dlpack_device__ methods are accepted.
 def from_dlpack(ext_tensor: Any) -> torch.Tensor:
@@ -82,7 +83,7 @@ def from_dlpack(ext_tensor: Any) -> torch.Tensor:
         # The old-style DLPack usage, with an intermediate capsule object
         >>> capsule = torch.utils.dlpack.to_dlpack(t)
         >>> capsule
-        <capsule object "dltensor" at 0x7f6017d14300>
+        <capsule object "dltensor" at ...>
         >>> t3 = torch.from_dlpack(capsule)
         >>> t3
         tensor([-1, -1,  2,  3])
diff --git a/torch/utils/hipify/cuda_to_hip_mappings.py b/torch/utils/hipify/cuda_to_hip_mappings.py
index b23ed4b688e0d..ede13c12a7a4b 100644
--- a/torch/utils/hipify/cuda_to_hip_mappings.py
+++ b/torch/utils/hipify/cuda_to_hip_mappings.py
@@ -34,7 +34,7 @@
     rocm_path = subprocess.check_output(["hipconfig", "--rocmpath"]).decode("utf-8")
 except subprocess.CalledProcessError:
     print(f"Warning: hipconfig --rocmpath failed, assuming {rocm_path}")
-except FileNotFoundError:
+except (FileNotFoundError, PermissionError):
     # Do not print warning. This is okay. This file can also be imported for non-ROCm builds.
     pass
 
@@ -6502,22 +6502,6 @@
             "cublasZgetrfBatched",
             ("rocblas_zgetrf_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
         ),
-        (
-            "cublasSgetriBatched",
-            ("rocblas_sgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
-        (
-            "cublasDgetriBatched",
-            ("rocblas_dgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
-        (
-            "cublasCgetriBatched",
-            ("rocblas_cgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
-        (
-            "cublasZgetriBatched",
-            ("rocblas_zgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
         (
             "cublasSgetrsBatched",
             ("rocblas_sgetrs_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
@@ -7918,6 +7902,51 @@
             ("hipsparseSetMatIndexBase", CONV_MATH_FUNC, API_SPARSE),
         ),
         ("cusparseSetMatType", ("hipsparseSetMatType", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMV", ("hipsparseSpMV", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMV_bufferSize", ("hipsparseSpMV_bufferSize", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMM", ("hipsparseSpMM", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMM_bufferSize", ("hipsparseSpMM_bufferSize", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateDnMat", ("hipsparseCreateDnMat", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDnMatSetStridedBatch", ("hipsparseDnMatSetStridedBatch", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCsrSetStridedBatch", ("hipsparseCsrSetStridedBatch", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateDnVec", ("hipsparseCreateDnVec", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateCsr", ("hipsparseCreateCsr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDestroyDnMat", ("hipsparseDestroyDnMat", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDestroyDnVec", ("hipsparseDestroyDnVec", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDestroySpMat", ("hipsparseDestroySpMat", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_destroyDescr", ("hipsparseSpGEMM_destroyDescr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateCoo", ("hipsparseCreateCoo", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateCsr", ("hipsparseCreateCsr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_createDescr", ("hipsparseSpGEMM_createDescr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDnMatSetStridedBatch", ("hipsparseDnMatSetStridedBatch", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_copy", ("hipsparseSpGEMM_copy", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_compute", ("hipsparseSpGEMM_compute", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_workEstimation", ("hipsparseSpGEMM_workEstimation", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMatGetSize", ("hipsparseSpMatGetSize", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCsrSetPointers", ("hipsparseCsrSetPointers", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMVAlg_t", ("hipsparseSpMVAlg_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseSpMMAlg_t", ("hipsparseSpMMAlg_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseIndexType_t", ("hipsparseIndexType_t", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseMatDescr", ("hipsparseMatDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseDnMatDescr", ("hipsparseDnMatDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseDnVecDescr", ("hipsparseDnVecDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseSpMatDescr", ("hipsparseSpMatDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseSpGEMMDescr", ("hipsparseSpGEMMDescr", CONV_TYPE, API_SPARSE)),
+        ("cusparseDnMatDescr_t", ("hipsparseDnMatDescr_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseDnVecDescr_t", ("hipsparseDnVecDescr_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseSpMatDescr_t", ("hipsparseSpMatDescr_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseSpGEMMDescr_t", ("hipsparseSpGEMMDescr_t", CONV_TYPE, API_SPARSE)),
+        ("CUSPARSE_INDEX_32I", ("HIPSPARSE_INDEX_32I", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_INDEX_64I", ("HIPSPARSE_INDEX_64I", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_ORDER_COL", ("HIPSPARSE_ORDER_COLUMN", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_MV_ALG_DEFAULT", ("HIPSPARSE_MV_ALG_DEFAULT", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_MM_ALG_DEFAULT", ("HIPSPARSE_MM_ALG_DEFAULT", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMM_ALG1", ("HIPSPARSE_COOMM_ALG1", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMM_ALG2", ("HIPSPARSE_COOMM_ALG2", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMM_ALG3", ("HIPSPARSE_COOMM_ALG3", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMV_ALG", ("HIPSPARSE_COOMV_ALG", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_CSRMM_ALG1", ("HIPSPARSE_CSRMM_ALG1", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_SPGEMM_DEFAULT", ("HIPSPARSE_SPGEMM_DEFAULT", CONV_NUMERIC_LITERAL, API_SPARSE)),
         (
             "CUSPARSE_STATUS_SUCCESS",
             ("HIPSPARSE_STATUS_SUCCESS", CONV_NUMERIC_LITERAL, API_SPARSE),
@@ -8239,6 +8268,9 @@
         ),
         ("C10_CUDA_CHECK", ("C10_HIP_CHECK", API_C10)),
         ("C10_CUDA_CHECK_WARN", ("C10_HIP_CHECK_WARN", API_C10)),
+        ("C10_CUDA_ERROR_HANDLED", ("C10_HIP_ERROR_HANDLED", API_C10)),
+        ("C10_CUDA_IGNORE_ERROR", ("C10_HIP_IGNORE_ERROR", API_C10)),
+        ("C10_CUDA_CLEAR_ERROR", ("C10_HIP_CLEAR_ERROR", API_C10)),
         ("c10::cuda", ("c10::hip", API_C10)),
         ("cuda::CUDAStream", ("hip::HIPStream", API_C10)),
         ("CUDAStream", ("HIPStream", API_C10)),
diff --git a/torch/utils/hipify/hipify_python.py b/torch/utils/hipify/hipify_python.py
index f7d5e714c5c37..66e6cf486f32a 100755
--- a/torch/utils/hipify/hipify_python.py
+++ b/torch/utils/hipify/hipify_python.py
@@ -854,7 +854,8 @@ def repl(m):
     output_source = hip_header_magic(output_source)
 
     # Replace the extern __shared__
-    output_source = replace_extern_shared(output_source)
+    # NOTE: No longer needed after transition from hcc to hipclang.
+    # output_source = replace_extern_shared(output_source)
 
     # Don't write out identical hipified files for extensions if dirpath has not changed
     if (
diff --git a/torch/utils/hooks.py b/torch/utils/hooks.py
index 026b63e5c1ad7..39785a83d0f9b 100644
--- a/torch/utils/hooks.py
+++ b/torch/utils/hooks.py
@@ -2,7 +2,6 @@
 from collections import OrderedDict
 import weakref
 import warnings
-import functools
 from typing import Any
 
 
@@ -95,8 +94,7 @@ def _unpack_none(self, indices, values):
 
         return tuple(res)
 
-    def _set_user_hook(self, grad_fn, user_hook):
-        @functools.wraps(user_hook)
+    def _set_user_hook(self, grad_fn):
         def hook(grad_input, _):
             if self.grad_outputs is None:
                 raise RuntimeError("Module backward hook for grad_input is called before "
@@ -106,15 +104,24 @@ def hook(grad_input, _):
                                    "output depends on the input and that the loss is computed "
                                    "based on the output.")
 
-            grad_input = self._pack_with_none(self.input_tensors_index, grad_input, self.n_inputs)
-            res = user_hook(self.module, grad_input, self.grad_outputs)
-            if res is None:
-                return res
+            res = self._pack_with_none(self.input_tensors_index, grad_input, self.n_inputs)
+
+            for hook in self.user_hooks:
+                out = hook(self.module, res, self.grad_outputs)
+
+                if out is None:
+                    continue
+
+                if len(out) != len(res):
+                    raise RuntimeError("Backward hook returned an invalid number of grad_input, "
+                                       "got {}, but expected {}".format(len(out), len(res)))
+
+                res = out
+
+            self.grad_outputs = None
 
-            if len(res) != len(grad_input):
-                raise RuntimeError("Backward hook returned an invalid number of grad_input, "
-                                   "got {}, but expected {}".format(len(res), len(grad_input)))
             return self._unpack_none(self.input_tensors_index, res)
+
         grad_fn.register_hook(hook)
 
     def _apply_on_tensors(self, fn, args):
@@ -152,8 +159,7 @@ def _apply_on_tensors(self, fn, args):
 
     def setup_input_hook(self, args):
         def fn(grad_fn):
-            for hook in self.user_hooks:
-                self._set_user_hook(grad_fn, hook)
+            self._set_user_hook(grad_fn)
 
         res, input_idx = self._apply_on_tensors(fn, args)
         self.n_inputs = len(args)
@@ -176,7 +182,7 @@ def hook(_, grad_output):
                         if res is not None and not (isinstance(res, tuple) and all(el is None for el in res)):
                             raise RuntimeError("Backward hook for Modules where no input requires "
                                                "gradient should always return None or None for all gradients.")
-
+                    self.grad_outputs = None
             grad_fn.register_hook(hook)
 
         is_tuple = True
diff --git a/torch/utils/tensorboard/_pytorch_graph.py b/torch/utils/tensorboard/_pytorch_graph.py
index 00aca545a3337..c35cf88213bef 100644
--- a/torch/utils/tensorboard/_pytorch_graph.py
+++ b/torch/utils/tensorboard/_pytorch_graph.py
@@ -258,7 +258,7 @@ def parse(graph, trace, args=None, omit_useless_nodes=True):
         if node.type().kind() != CLASSTYPE_KIND:
             nodes_py.append(NodePyIO(node, "input"))
 
-    attr_to_scope: Dict[Any, str] = dict()
+    attr_to_scope: Dict[Any, str] = {}
     for node in graph.nodes():
         if node.kind() == GETATTR_KIND:
             attr_name = node.s("name")
@@ -297,7 +297,7 @@ def parse_traced_name(module):
             module_name = getattr(module, "original_name", "Module")
         return module_name
 
-    alias_to_name = dict()
+    alias_to_name = {}
     base_name = parse_traced_name(trace)
     for name, module in trace.named_modules(prefix="__module"):
         mod_name = parse_traced_name(module)
diff --git a/torch/utils/tensorboard/summary.py b/torch/utils/tensorboard/summary.py
index 4bb8083076bfc..80e1af4485789 100644
--- a/torch/utils/tensorboard/summary.py
+++ b/torch/utils/tensorboard/summary.py
@@ -571,12 +571,11 @@ def audio(tag, tensor, sample_rate=44100):
     import wave
 
     fio = io.BytesIO()
-    wave_write = wave.open(fio, "wb")
-    wave_write.setnchannels(1)
-    wave_write.setsampwidth(2)
-    wave_write.setframerate(sample_rate)
-    wave_write.writeframes(array.data)
-    wave_write.close()
+    with wave.open(fio, "wb") as wave_write:
+        wave_write.setnchannels(1)
+        wave_write.setsampwidth(2)
+        wave_write.setframerate(sample_rate)
+        wave_write.writeframes(array.data)
     audio_string = fio.getvalue()
     fio.close()
     audio = Summary.Audio(
diff --git a/torch/utils/throughput_benchmark.py b/torch/utils/throughput_benchmark.py
index 4008d5a23c048..1dae4b9377831 100644
--- a/torch/utils/throughput_benchmark.py
+++ b/torch/utils/throughput_benchmark.py
@@ -1,6 +1,7 @@
 
 import torch._C
 
+
 def format_time(time_us=None, time_ms=None, time_s=None):
     '''Defines how to format time'''
     assert sum([time_us is not None, time_ms is not None, time_s is not None]) == 1
@@ -48,7 +49,6 @@ def total_time_seconds(self):
         return self.num_iters * (
             self.latency_avg_ms / 1000.0) / self.benchmark_config.num_calling_threads
 
-
     def __str__(self):
         return '\n'.join([
             "Average latency per example: " + format_time(time_ms=self.latency_avg_ms),
@@ -75,18 +75,19 @@ class ThroughputBenchmark(object):
 
     Example::
 
+        >>> # xdoctest: +SKIP("undefined vars")
         >>> from torch.utils import ThroughputBenchmark
         >>> bench = ThroughputBenchmark(my_module)
         >>> # Pre-populate benchmark's data set with the inputs
         >>> for input in inputs:
-                # Both args and kwargs work, same as any PyTorch Module / ScriptModule
-                bench.add_input(input[0], x2=input[1])
-        >>> Inputs supplied above are randomly used during the execution
+        ...     # Both args and kwargs work, same as any PyTorch Module / ScriptModule
+        ...     bench.add_input(input[0], x2=input[1])
+        >>> # Inputs supplied above are randomly used during the execution
         >>> stats = bench.benchmark(
-                num_calling_threads=4,
-                num_warmup_iters = 100,
-                num_iters = 1000,
-            )
+        ...     num_calling_threads=4,
+        ...     num_warmup_iters = 100,
+        ...     num_iters = 1000,
+        ... )
         >>> print("Avg latency (ms): {}".format(stats.latency_avg_ms))
         >>> print("Number of iterations: {}".format(stats.num_iters))
 
diff --git a/torchgen/api/autograd.py b/torchgen/api/autograd.py
index 417bba637ce0a..a22072b744d1e 100644
--- a/torchgen/api/autograd.py
+++ b/torchgen/api/autograd.py
@@ -1,10 +1,16 @@
 import re
 from dataclasses import dataclass
-from typing import List, Match, Optional, Sequence, Set, Tuple
+from typing import Dict, List, Match, Optional, Sequence, Set, Tuple
 
 from torchgen.api import cpp
 from torchgen.api.types import Binding, NamedCType
-from torchgen.model import NativeFunction, NativeFunctionsViewGroup, SchemaKind, Type
+from torchgen.model import (
+    FunctionSchema,
+    NativeFunction,
+    NativeFunctionsViewGroup,
+    SchemaKind,
+    Type,
+)
 from torchgen.utils import IDENT_REGEX
 
 # Represents a saved attribute involved in backward calculation.
@@ -242,8 +248,8 @@ class DifferentiableOutput:
 @dataclass(frozen=True)
 class NativeFunctionWithDifferentiabilityInfo:
     func: NativeFunction
-    info: Optional[DifferentiabilityInfo]
-    fw_derivatives: Sequence[ForwardDerivative]
+    info: Optional[Dict[str, DifferentiabilityInfo]]
+    fw_derivatives: Optional[Dict[str, Sequence[ForwardDerivative]]]
 
 
 # TODO: Update comment below since it is out of date.
@@ -265,7 +271,13 @@ def dispatch_strategy(fn: NativeFunctionWithDifferentiabilityInfo) -> str:
           get dispatched back to VariableType (which will ensure that they
           are differentiable.)
     """
-    if fn.func.is_abstract or (fn.info is not None and fn.info.has_derivatives):
+    # fn is derived as long as any of its per-key differentiability infos
+    # has_derivatives. dispatch_strategy() is used to guard generation of fns in VariableType
+    # and ADInplaceOrViewType. We want to generate these functions as long as a
+    # derivative is defined for ANY dispatch key.
+    if fn.func.is_abstract or (
+        fn.info is not None and any(info.has_derivatives for info in fn.info.values())
+    ):
         # If the function is abstract (not implemented on at::Type), we must
         # call the implementation on the derived type with unpacked tensors.
 
@@ -292,29 +304,34 @@ def dispatch_strategy(fn: NativeFunctionWithDifferentiabilityInfo) -> str:
 
 def match_differentiability_info(
     native_functions: List[NativeFunction],
-    differentiability_infos: Sequence[DifferentiabilityInfo],
+    differentiability_infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]],
 ) -> List[NativeFunctionWithDifferentiabilityInfo]:
     """Sets the "derivative" key on declarations to matching autograd function
     In-place functions will use the out-of-place derivative definition if there
     is no in-place specific derivative.
     """
 
-    info_by_schema = {info.func.func: info for info in differentiability_infos}
     functional_info_by_signature = {
-        info.func.func.signature(strip_default=True): info
-        for info in differentiability_infos
-        if info.func.func.kind() == SchemaKind.functional
+        schema.signature(strip_default=True): info_dict
+        for schema, info_dict in differentiability_infos.items()
+        if schema.kind() == SchemaKind.functional
     }
     non_functional_info_by_signature = {
-        info.func.func.signature(strip_default=True): info
-        for info in differentiability_infos
-        if info.func.func.kind() != SchemaKind.functional
+        schema.signature(strip_default=True): info_dict
+        for schema, info_dict in differentiability_infos.items()
+        if schema.kind() != SchemaKind.functional
     }
 
-    def find_info(f: NativeFunction) -> Tuple[Optional[DifferentiabilityInfo], bool]:
+    def find_info(
+        f: NativeFunction,
+    ) -> Tuple[Optional[Dict[str, DifferentiabilityInfo]], bool]:
+        # Don't bother matching info to generated out= variants
+        if "generated" in f.tags and f.func.kind() == SchemaKind.out:
+            return None, False
+
         # (1) Check for an exact match
-        if f.func in info_by_schema:
-            return info_by_schema[f.func], True
+        if f.func in differentiability_infos:
+            return differentiability_infos[f.func], True
 
         # (2) If no exact match, check if the out-of-place variant
         # of this operator has a match.
@@ -329,37 +346,52 @@ def find_info(f: NativeFunction) -> Tuple[Optional[DifferentiabilityInfo], bool]
         # For the generated out-of-place variant, use the mutable variant's formula
         # if it exists.
         if "generated" in f.tags and f_sig in non_functional_info_by_signature:
-            info = non_functional_info_by_signature[f_sig]
+            info_dict = non_functional_info_by_signature[f_sig]
             # See https://github.com/pytorch/pytorch/pull/76320/files#r874816389
             assert not any(
-                "self" in str(inpt.nctype.name) for inpt in info.all_saved_inputs
+                any("self" in str(inpt.nctype.name) for inpt in info.all_saved_inputs)
+                for info in info_dict.values()
             ), f"""\
 Attempted to convert a derivative formula for a mutable operator
  to be used by automatically by its functional variant ("{str(f.func)}").
  this is not currently supported (we'd need to fix up the formula in the codegen)."""
-            return info, False
+            return info_dict, False
 
         return None, False
 
     result: List[NativeFunctionWithDifferentiabilityInfo] = []
     for f in native_functions:
-        info, is_exact_match = find_info(f)
+        info_dict, is_exact_match = find_info(f)
 
         # Currently, the '.strides()' to 'strides_or_error' replacement does not support
         # 'self' derivatives of an inplace function, so we must check for this case.
-        if f.func.kind() == SchemaKind.inplace and (info is not None):
-            for derivative in info.derivatives:
-                if "self" in derivative.var_names:
-                    for saved_input in derivative.saved_inputs:
-                        assert "strides_or_error" not in saved_input.expr, (
-                            "Calling '.strides()' in the 'self' derivative formula of an "
-                            f"in-place function is not supported: {f.func}"
-                        )
+        if f.func.kind() == SchemaKind.inplace and (info_dict is not None):
+            for info in info_dict.values():
+                for derivative in info.derivatives:
+                    if "self" in derivative.var_names:
+                        for saved_input in derivative.saved_inputs:
+                            assert "strides_or_error" not in saved_input.expr, (
+                                "Calling '.strides()' in the 'self' derivative formula of an "
+                                f"in-place function is not supported: {f.func}"
+                            )
+
+        if not info_dict:
+            result.append(
+                NativeFunctionWithDifferentiabilityInfo(
+                    func=f, info=None, fw_derivatives=None
+                )
+            )
+            continue
+
+        fw_derivative_dict: Dict[str, Sequence[ForwardDerivative]] = {}
+        for key, info in info_dict.items():
+            if not info.forward_derivatives:
+                fw_derivative_dict[key] = []
+                continue
 
-        # For functions that have a single def for out-of-place and inplace (like abs())
-        if info and info.forward_derivatives:
             forward_derivatives = info.forward_derivatives
 
+            # For functions that have a single def for out-of-place and inplace (like abs())
             if f.func.kind() == SchemaKind.inplace:
                 # For inplace functions there is a little bit of work to do:
                 #  1) Validate the formula and make sure the input that is modified in not used:
@@ -466,7 +498,7 @@ def check_parens_nest_level_gt_zero(s: str) -> bool:
 
                 required_original_self_value = bool(
                     re.search(IDENT_REGEX.format("original_self_p"), formula)
-                )
+                ) or bool(re.search(IDENT_REGEX.format("original_self_t"), formula))
 
                 forward_derivatives = [
                     ForwardDerivative(
@@ -479,12 +511,12 @@ def check_parens_nest_level_gt_zero(s: str) -> bool:
                         is_reusing_outplace_formula=not is_exact_match,
                     ),
                 ]
-        else:
-            forward_derivatives = []
+
+            fw_derivative_dict[key] = forward_derivatives
 
         result.append(
             NativeFunctionWithDifferentiabilityInfo(
-                func=f, info=info, fw_derivatives=forward_derivatives
+                func=f, info=info_dict, fw_derivatives=fw_derivative_dict
             )
         )
 
@@ -500,10 +532,10 @@ def is_differentiable(
 
 
 def gen_differentiable_outputs(
-    fn: NativeFunctionWithDifferentiabilityInfo,
+    fn: NativeFunctionWithDifferentiabilityInfo, key: str = "Default"
 ) -> List[DifferentiableOutput]:
     f = fn.func
-    info = fn.info
+    info = fn.info[key] if fn.info else None
     outputs: List[DifferentiableOutput] = [
         DifferentiableOutput(
             name=name, type=ret.type, cpp_type=cpp.return_type(ret).cpp_type()
diff --git a/torchgen/api/python.py b/torchgen/api/python.py
index fb9e7352fb5dd..0ff67f1365edc 100644
--- a/torchgen/api/python.py
+++ b/torchgen/api/python.py
@@ -666,12 +666,7 @@ def argument_type_str(t: Type, *, simple_type: bool = False) -> str:
             # Is it desired to keep '?' for simple_type with new style dispatcher?
             return "Tensor?"
         elem = argument_type_str(t.elem, simple_type=simple_type)
-        if elem == "Layout":
-            # TODO: fix this special case in PythonArgParser?
-            return "Layout"
-        else:
-            return f"{elem}?"
-
+        return f"{elem}?"
     elif isinstance(t, ListType):
         size = t.size if not simple_type else None
         if str(t.elem) == "bool":
@@ -801,7 +796,7 @@ def topt_default_init(name: str) -> Optional[str]:
         tensor_options_args.append(
             PythonArgument(
                 name="dtype",
-                type=BaseType(BaseTy.ScalarType),
+                type=OptionalType(BaseType(BaseTy.ScalarType)),
                 default="None",
                 default_init=(
                     "self.scalar_type()"
@@ -825,19 +820,22 @@ def topt_default_init(name: str) -> Optional[str]:
         tensor_options_args.append(
             PythonArgument(
                 name="device",
-                type=BaseType(BaseTy.Device),
+                type=OptionalType(BaseType(BaseTy.Device)),
                 default="None",
                 default_init=(
                     "self.device()"
                     if is_like_or_new_function
-                    else topt_default_init("device")
+                    else (
+                        topt_default_init("device")
+                        or "torch::tensors::get_default_device()"
+                    )
                 ),
             )
         )
         tensor_options_args.append(
             PythonArgument(
                 name="pin_memory",
-                type=BaseType(BaseTy.bool),
+                type=OptionalType(BaseType(BaseTy.bool)),
                 default="False",
                 default_init=None,
             )
@@ -845,7 +843,7 @@ def topt_default_init(name: str) -> Optional[str]:
         tensor_options_args.append(
             PythonArgument(
                 name="requires_grad",
-                type=BaseType(BaseTy.bool),
+                type=OptionalType(BaseType(BaseTy.bool)),
                 default="False",
                 default_init=None,
             )
@@ -1193,8 +1191,20 @@ def cpp_dispatch_exprs(
 # unexpected and/or unsupported cases which the old codegen rejects.
 # For certain cases it is intentionally more restrictive than necessary,
 # e.g.: it doesn't accepts doublelist with definite size.
-def arg_parser_unpack_method(t: Type, has_default: bool) -> str:
-    if has_default and str(t) not in ("ScalarType", "Device", "Layout?"):
+def arg_parser_unpack_method(
+    t: Type, default: Optional[str], default_init: Optional[str]
+) -> str:
+    has_default_init = default_init is not None
+    if has_default_init and str(t) not in (
+        "ScalarType?",
+        "ScalarType",
+        "Device",
+        "Device?",
+        "Layout",
+        "Layout?",
+        "bool",
+        "bool?",
+    ):
         raise RuntimeError(f"type '{t}' does not supported unpacking with default")
 
     if isinstance(t, BaseType):
@@ -1208,54 +1218,37 @@ def arg_parser_unpack_method(t: Type, has_default: bool) -> str:
             # These unpack methods line up with their schema names
             return t.name.name.lower()
         elif t.name == BaseTy.ScalarType:
-            return "scalartypeWithDefault" if has_default else "scalartype"
+            return "scalartypeWithDefault" if has_default_init else "scalartype"
         elif t.name == BaseTy.Device:
-            return "deviceWithDefault" if has_default else "device"
+            return "deviceWithDefault" if has_default_init else "device"
         elif t.name == BaseTy.int:
             return "toInt64"
         elif t.name == BaseTy.SymInt:
             return "toSymInt"
         elif t.name == BaseTy.bool:
-            return "toBool"
+            return "toBoolWithDefault" if has_default_init else "toBool"
         elif t.name == BaseTy.float:
             return "toDouble"
         elif t.name == BaseTy.str:
             return "stringView"
         elif t.name == BaseTy.Layout:
-            return "layout"
+            return "layoutWithDefault" if has_default_init else "layout"
+        elif t.name == BaseTy.MemoryFormat:
+            return "memoryformat"
 
     elif isinstance(t, OptionalType):
         if str(t.elem) == "Tensor":
             return "optionalTensor"
-
-        elif isinstance(t.elem, BaseType):
-            if t.elem.name in [
-                BaseTy.ScalarType,
-                BaseTy.Scalar,
-                BaseTy.int,
-                BaseTy.bool,
-                BaseTy.float,
-                BaseTy.str,
-            ]:
-                # Regular cases: append 'Optional' to elem's unpacking method
-                return arg_parser_unpack_method(t.elem, False) + "Optional"
-            elif t.elem.name == BaseTy.MemoryFormat:
-                return "memoryformatOptional"
-            elif t.elem.name == BaseTy.Generator:
-                return "generator"
-            elif t.elem.name == BaseTy.Layout:
-                return "layoutWithDefault" if has_default else "layoutOptional"
-            elif t.elem.name == BaseTy.Device:
-                return "deviceWithDefault" if has_default else "deviceOptional"
-
-        elif isinstance(t.elem, ListType):
-            if str(t.elem.elem) == "int":
-                # accept definite size
-                return "intlistOptional"
-            elif str(t.elem) == "float[]":
-                return "doublelistOptional"
-            elif str(t.elem) == "Dimname[]":
-                return "toDimnameListOptional"
+        elif str(t.elem) == "Generator":
+            return "generator"
+        elif str(t.elem) == "Dimname[]":
+            return "toDimnameListOptional"
+        elif not has_default_init and default in (None, "None", "c10::nullopt"):
+            # If default is None: append 'Optional' to elem's unpacking method
+            return arg_parser_unpack_method(t.elem, None, None) + "Optional"
+        else:
+            # Otherwise, load as underlying type with default
+            return arg_parser_unpack_method(t.elem, default, default_init)
 
     elif isinstance(t, ListType):
         if str(t.elem) == "Tensor":
@@ -1288,7 +1281,9 @@ def arg_parser_output_expr(
     arg_index: int, a: PythonArgument
 ) -> PythonArgParserOutputExpr:
     has_default = a.default_init is not None
-    unpack_method = arg_parser_unpack_method(a.type, has_default)
+    unpack_method = arg_parser_unpack_method(
+        t=a.type, default=a.default, default_init=a.default_init
+    )
     default = f", {a.default_init}" if has_default else ""
     expr = f"_r.{unpack_method}({arg_index}{default})"
 
@@ -1313,11 +1308,11 @@ def arg_parser_output_exprs(
 
 # argument name to type for scattered tensor options fields
 TENSOR_OPTIONS_FIELDS = {
-    "dtype": "ScalarType",
-    "device": "Device",
+    "dtype": "ScalarType?",
+    "device": "Device?",
     "layout": "Layout?",
-    "pin_memory": "bool",
-    "requires_grad": "bool",
+    "pin_memory": "bool?",
+    "requires_grad": "bool?",
 }
 
 # bind arg parser outputs (python args) with dispatch lambda arguments (c++ args).
@@ -1330,7 +1325,7 @@ def dispatch_lambda_exprs(
     arg_parser_outputs = arg_parser_output_exprs(ps, f)
     lambda_args = dispatch_lambda_args(ps, f)
     inits: List[str] = []
-    lambda_args_exprs: Dict[str, str] = dict()
+    lambda_args_exprs: Dict[str, str] = {}
 
     has_toptions = has_tensor_options(f)
 
diff --git a/torchgen/api/types.py b/torchgen/api/types.py
index 6bee40a421d40..a3f5e44e461c6 100644
--- a/torchgen/api/types.py
+++ b/torchgen/api/types.py
@@ -1,6 +1,6 @@
 from dataclasses import dataclass
 from enum import Enum
-from typing import Dict, List, Optional, Sequence, Set, TypeVar, Union
+from typing import Dict, Iterator, List, Optional, Sequence, Set, TypeVar, Union
 
 from torchgen.model import (
     Argument,
@@ -504,29 +504,30 @@ def most_faithful_signature(self) -> CppSignature:
         else:
             return self.signature
 
+    def signatures(self) -> Iterator[CppSignature]:
+        yield self.signature
+        if self.faithful_signature:
+            yield self.faithful_signature
+
     @staticmethod
     def from_native_function(
         f: NativeFunction, *, method: bool, fallback_binding: bool = False
     ) -> "CppSignatureGroup":
         func = f.func
-        faithful_signature: Optional[CppSignature]
-        if func.arguments.tensor_options is not None or len(func.arguments.out) > 0:
-            faithful_signature = CppSignature(
+
+        def make_sig(*, faithful: bool) -> CppSignature:
+            return CppSignature(
                 func=func,
-                faithful=True,
+                faithful=faithful,
                 method=method,
                 fallback_binding=fallback_binding,
                 cpp_no_default_args=f.cpp_no_default_args,
             )
-        else:
-            faithful_signature = None
-        signature = CppSignature(
-            func=func,
-            faithful=False,
-            method=method,
-            fallback_binding=fallback_binding,
-            cpp_no_default_args=f.cpp_no_default_args,
-        )
+
+        faithful_signature: Optional[CppSignature] = None
+        if func.arguments.tensor_options is not None or len(func.arguments.out) > 0:
+            faithful_signature = make_sig(faithful=True)
+        signature = make_sig(faithful=False)
         return CppSignatureGroup(
             func=func,
             signature=signature,
diff --git a/torchgen/context.py b/torchgen/context.py
index bbb8ea4d5c4c0..6924befb2550e 100644
--- a/torchgen/context.py
+++ b/torchgen/context.py
@@ -30,6 +30,7 @@
     NativeFunctionsGroup,
     Optional[NativeFunction],
     bool,
+    str,
 )
 
 
diff --git a/torchgen/dest/lazy_ir.py b/torchgen/dest/lazy_ir.py
index b2dc965d8f962..c3c8682267044 100644
--- a/torchgen/dest/lazy_ir.py
+++ b/torchgen/dest/lazy_ir.py
@@ -393,13 +393,14 @@ def lazy_tensor_decls(self, func: NativeFunction, schema: LazyIrSchema) -> str:
                 if isinstance(arg.lazy_type, OptionalCType):
                     lazy_tensor_decls.append(
                         f"""auto node_{arg.name} = {arg.name} ?
-                c10::make_optional(torch::lazy::LazyGraphExecutor::Get()->GetIrValueForScalarFromCodegen(*{arg.name})):
+                c10::make_optional(torch::lazy::LazyGraphExecutor::Get()->
+                    GetIrValueForScalarFromCodegen(*{arg.name}, *common_device)):
                 c10::nullopt;"""
                     )
                 else:
                     lazy_tensor_decls.append(
-                        f"""auto node_{arg.name} =
-                torch::lazy::LazyGraphExecutor::Get()->GetIrValueForScalarFromCodegen({arg.name});"""
+                        f"""auto node_{arg.name} = torch::lazy::LazyGraphExecutor::Get()->
+                            GetIrValueForScalarFromCodegen({arg.name}, *common_device);"""
                     )
             elif arg.is_symint_or_list:
                 continue  # values are extracted in isValueType
@@ -587,7 +588,7 @@ def __call__(self, func: NativeFunction) -> List[str]:
         {self.lazy_tensor_decls(func, schema)}
         {self.build_ir_node(func, schema)}
         {self.return_aten_tensor(func, schema)}
-    }};\n
+    }}\n
     """
         ]
 
diff --git a/torchgen/gen.py b/torchgen/gen.py
index 377f13a445479..c10f62932aff4 100644
--- a/torchgen/gen.py
+++ b/torchgen/gen.py
@@ -15,6 +15,7 @@
 import torchgen.api.native as native
 import torchgen.api.structured as structured
 import torchgen.dest as dest
+
 from torchgen.api import cpp
 from torchgen.api.translate import translate
 from torchgen.api.types import (
@@ -244,7 +245,12 @@ def error_check_native_functions(funcs: Sequence[NativeFunction]) -> None:
                 f"{f.structured_delegate}, but {f.structured_delegate} is not marked as structured. "
                 f"Consider adding 'structured=True' to the delegated operator"
             )
-        if "inplace_view" in f.tags:
+        # See Note [resize_ in Functionalization]
+        # resize_() is technically an inplace view op (and therefore needs the tag),
+        # but it would be overkill to add a true "view" variant of resize.
+        # Instead, resize_() gets special treatment in functionalization,
+        # and we have a resize() op that is non-aliasing + functional.
+        if "inplace_view" in f.tags and str(f.func.name) != "resize_":
             base_name = f.func.name.name
             overload_name = f.func.name.overload_name
             assert base_name.inplace, (
@@ -605,29 +611,19 @@ def __call__(self, f: NativeFunction) -> Optional[str]:
             f, method=False, fallback_binding=f.manual_cpp_binding
         )
 
-        def generate_defn(faithful: bool) -> str:
-            if faithful:
-                sig = sig_group.faithful_signature
-                assert sig is not None
-            else:
-                sig = sig_group.signature
-
+        result = ""
+        for sig in sig_group.signatures():
             # See Note [The ATen Operators API]
             target_sig = DispatcherSignature.from_schema(f.func)
             exprs = translate(sig.arguments(), target_sig.arguments())
             exprs_str = ", ".join([e.expr for e in exprs])
 
-            return f"""
+            result += f"""
 // aten::{f.func}
 inline {sig.decl()} {{
     return at::_ops::{f.func.name.unambiguous_name()}::call({exprs_str});
 }}
 """
-
-        result = generate_defn(False)
-        if sig_group.faithful_signature is not None:
-            result += generate_defn(True)
-
         return result
 
 
@@ -651,36 +647,28 @@ def __call__(self, f: NativeFunction) -> Optional[str]:
         )
 
         if self.target is Target.DECLARATION:
-            result = f"{sig_group.signature.decl()} const;\n"
-            if sig_group.faithful_signature is not None:
-                result += f"{sig_group.faithful_signature.decl()} const;\n"
+            result = ""
+            for sig in sig_group.signatures():
+                result += f"{sig.decl()} const;\n"
             return result
 
         if self.target is not Target.DEFINITION:
             assert_never(self.target)
 
-        def generate_defn(faithful: bool) -> str:
-            if faithful:
-                sig = sig_group.faithful_signature
-                assert sig is not None
-            else:
-                sig = sig_group.signature
+        result = ""
 
+        for sig in sig_group.signatures():
             target_sig = DispatcherSignature.from_schema(f.func)
             exprs = translate(sig.arguments(), target_sig.arguments(), method=True)
             exprs_str = ", ".join([e.expr for e in exprs])
 
-            return f"""
+            result += f"""
 // aten::{f.func}
 inline {sig.defn(prefix="Tensor::")} const {{
     return at::_ops::{f.func.name.unambiguous_name()}::call({exprs_str});
 }}
 """
 
-        result = generate_defn(faithful=False)
-        if sig_group.faithful_signature is not None:
-            result += generate_defn(faithful=True)
-
         return result
 
 
@@ -697,28 +685,19 @@ def __call__(self, f: NativeFunction) -> Optional[str]:
             f, method=False, fallback_binding=f.manual_cpp_binding
         )
 
-        def generate_defn(faithful: bool) -> str:
-            if faithful:
-                sig = sig_group.faithful_signature
-                assert sig is not None
-            else:
-                sig = sig_group.signature
-
+        result = ""
+        for sig in sig_group.signatures():
             target_sig = DispatcherSignature.from_schema(f.func)
             exprs = translate(sig.arguments(), target_sig.arguments())
             exprs_str = ", ".join(["dispatchKeySet"] + [a.expr for a in exprs])
 
-            return f"""
+            result += f"""
 // aten::{f.func}
 inline {sig.decl(is_redispatching_fn=True)} {{
     return at::_ops::{f.func.name.unambiguous_name()}::redispatch({exprs_str});
 }}
 """
 
-        result = generate_defn(False)
-        if sig_group.faithful_signature is not None:
-            result += generate_defn(True)
-
         return result
 
 
@@ -1403,6 +1382,168 @@ def get_native_function_declarations(
     return declarations
 
 
+def get_kernel_namespace(
+    *, f: Union[NativeFunction, NativeFunctionsGroup], backend_idx: BackendIndex
+) -> str:
+    backend_metadata = backend_idx.get_kernel(f)
+    assert not backend_metadata or "::native" in backend_metadata.cpp_namespace, (
+        f"The kernel for function {f.func.name if isinstance(f, NativeFunction) else f.functional.func.name} "
+        f"with dispatch key {backend_idx.dispatch_key}"
+        f" has a namespace {backend_metadata.cpp_namespace} and it's not ending with '::native'."
+    )
+    return (
+        backend_metadata.cpp_namespace if backend_metadata else DEFAULT_KERNEL_NAMESPACE
+    )
+
+
+# Return native function definitions grouped by dispatch key and custom namespace.
+# Used in RegisterDispatchKey.cpp and etc.
+def get_native_function_definitions(
+    *,
+    fm: FileManager,
+    grouped_native_functions: Sequence[Union[NativeFunction, NativeFunctionsGroup]],
+    dispatch_key: DispatchKey,
+    backend_idx: BackendIndex,
+    selector: SelectiveBuilder,
+    rocm: bool,
+    skip_dispatcher_op_registration: bool,
+    gen_dispatch_helpers: bool,
+) -> List[str]:
+    definitions: List[str] = []
+    ns_definitions: Dict[str, List[str]] = defaultdict(list)
+    anonymous_definitions: Dict[str, List[str]] = defaultdict(list)
+    registrations: Dict[str, Dict[str, List[str]]] = defaultdict(dict)
+    newline = "\n"
+    ns_gen = dest.RegisterDispatchKey(
+        backend_idx,
+        Target.NAMESPACED_DEFINITION,
+        selector,
+        rocm=rocm,
+        class_method_name=None,
+        skip_dispatcher_op_registration=skip_dispatcher_op_registration,
+    )
+    anonymous_gen = dest.RegisterDispatchKey(
+        backend_idx,
+        Target.ANONYMOUS_DEFINITION,
+        selector,
+        rocm=rocm,
+        class_method_name=None,
+        skip_dispatcher_op_registration=skip_dispatcher_op_registration,
+    )
+    reg_gen = dest.RegisterDispatchKey(
+        backend_idx,
+        Target.REGISTRATION,
+        selector,
+        rocm=rocm,
+        class_method_name=None,
+        skip_dispatcher_op_registration=skip_dispatcher_op_registration,
+    )
+    for f in grouped_native_functions:
+        kernel_namespace = get_kernel_namespace(f=f, backend_idx=backend_idx).replace(
+            "::native", ""
+        )
+
+        ns_definitions[kernel_namespace].extend(
+            ns_gen(f),
+        )
+        anonymous_definitions[kernel_namespace].extend(
+            anonymous_gen(f),
+        )
+        namespace = (
+            f.namespace if isinstance(f, NativeFunction) else f.functional.namespace
+        )
+        if namespace not in registrations[kernel_namespace]:
+            registrations[kernel_namespace] = defaultdict(list)
+        registrations[kernel_namespace][namespace].extend(
+            reg_gen(f),
+        )
+
+    for kernel_namespace in ns_definitions:
+        if len(ns_definitions[kernel_namespace]) == 0:
+            continue
+        ns_helper = NamespaceHelper(namespace_str=kernel_namespace)
+        registration_body = ""
+        for namespace in registrations[kernel_namespace]:
+            if not registrations[kernel_namespace][namespace]:
+                continue
+            registration_body += f"""
+TORCH_LIBRARY_IMPL({namespace}, {dispatch_key}, m) {{
+    {newline.join(registrations[kernel_namespace][namespace])}
+}};"""
+        definitions.extend(
+            fm.substitute_with_template(
+                "RegisterDispatchDefinitions.ini",
+                lambda: {
+                    "ns_prologue": ns_helper.prologue,
+                    "ns_epilogue": ns_helper.epilogue,
+                    "dispatch_helpers": dest.gen_registration_helpers(backend_idx)
+                    if gen_dispatch_helpers
+                    else [],
+                    "dispatch_anonymous_definitions": anonymous_definitions[
+                        kernel_namespace
+                    ],
+                    "static_init_dispatch_registrations": ""
+                    if skip_dispatcher_op_registration
+                    else registration_body,
+                    "deferred_dispatch_registrations": "",
+                    "dispatch_namespace": dispatch_key.lower(),
+                    "dispatch_namespaced_definitions": ns_definitions[kernel_namespace],
+                },
+            ).split(newline)
+        )
+
+    return definitions
+
+
+# Return native function declarations grouped by dispatch key and custom namespace.
+# Used in CPUFunctions_inl.h and etc.
+def get_namespaced_declaration(
+    *,
+    grouped_native_functions: Sequence[Union[NativeFunction, NativeFunctionsGroup]],
+    dispatch_key: DispatchKey,
+    backend_idx: BackendIndex,
+    selector: SelectiveBuilder,
+    rocm: bool,
+) -> List[str]:
+    declarations: List[str] = []
+    ns_grouped_kernels: Dict[str, List[str]] = defaultdict(list)
+    newline = "\n"
+    func = dest.RegisterDispatchKey(
+        backend_idx,
+        Target.NAMESPACED_DECLARATION,
+        selector,
+        rocm=rocm,
+        class_method_name=None,
+        skip_dispatcher_op_registration=False,
+    )
+    for f in grouped_native_functions:
+        namespace = get_kernel_namespace(f=f, backend_idx=backend_idx).replace(
+            "native", dispatch_key.lower()
+        )
+
+        ns_grouped_kernels[namespace].extend(
+            func(f),
+        )
+
+    for namespace, kernels in ns_grouped_kernels.items():
+        if len(kernels) == 0:
+            continue
+        ns_helper = NamespaceHelper(
+            namespace_str=namespace, entity_name="", max_level=3
+        )
+        ordered_kernels = list(OrderedDict.fromkeys(kernels))
+        declarations.extend(
+            f"""
+{ns_helper.prologue}
+{newline.join(ordered_kernels)}
+{ns_helper.epilogue}
+        """.split(
+                newline
+            )
+        )
+    return declarations
+
+
 # Return native function schema registration code for aten and other namespaces.
 def get_native_function_schema_registrations(
     *,
@@ -1425,10 +1566,6 @@ def get_native_function_schema_registrations(
         if namespace == "aten":
             aten_schema_registrations = schema_registrations_body
         else:
-            assert custom_namespace is None or namespace == custom_namespace, (
-                "Only one custom namespace (other than 'aten') is currently supported, "
-                f" but getting {namespace} and {custom_namespace}"
-            )
             custom_namespace = namespace
             tab = "\t"
             schema_registrations += f"""
@@ -1545,18 +1682,12 @@ def gen_aggregated_headers(
                 lambda: {
                     "DispatchKeyFunctions_inl_includes": [],
                     "dispatch_namespace": dispatch_key.lower(),
-                    "dispatch_namespaced_declarations": list(
-                        concatMap(
-                            dest.RegisterDispatchKey(
-                                backend_indices[dispatch_key],
-                                Target.NAMESPACED_DECLARATION,
-                                selector,
-                                rocm=rocm,
-                                class_method_name=None,
-                                skip_dispatcher_op_registration=False,
-                            ),
-                            grouped_native_functions,
-                        )
+                    "dispatch_namespaced_declarations": get_namespaced_declaration(
+                        grouped_native_functions=grouped_native_functions,
+                        dispatch_key=dispatch_key,
+                        backend_idx=backend_indices[dispatch_key],
+                        selector=selector,
+                        rocm=rocm,
                     ),
                 },
             )
@@ -1993,33 +2124,17 @@ def operator_headers() -> List[str]:
             )
             ns_grouped_native_functions[namespace].append(grouped_native_function)
 
-        static_init_dispatch_registrations = ""
-        for namespace, functions in ns_grouped_native_functions.items():
-            dispatch_registrations_body = (
-                ""
-                if skip_dispatcher_op_registration
-                else "\n".join(
-                    list(
-                        concatMap(
-                            dest.RegisterDispatchKey(
-                                backend_index,
-                                Target.REGISTRATION,
-                                selector,
-                                rocm=rocm,
-                                class_method_name=None,
-                                skip_dispatcher_op_registration=skip_dispatcher_op_registration,
-                            ),
-                            functions,
-                        )
-                    )
-                )
-            )
-
-            static_init_dispatch_registrations += f"""
-TORCH_LIBRARY_IMPL({namespace}, {dispatch_key}, m) {{
-    {dispatch_registrations_body}
-}};"""
         dispatch_namespace = str(dispatch_key).lower()
+        dispatch_definitions = get_native_function_definitions(
+            fm=fm,
+            grouped_native_functions=grouped_native_functions,
+            dispatch_key=dispatch_key,
+            backend_idx=backend_index,
+            selector=selector,
+            rocm=rocm,
+            skip_dispatcher_op_registration=skip_dispatcher_op_registration,
+            gen_dispatch_helpers=True,
+        )
         fm.write_with_template(
             f"Register{dispatch_key}.cpp",
             "RegisterDispatchKey.cpp",
@@ -2032,37 +2147,8 @@ def operator_headers() -> List[str]:
                     backend_index, per_operator_headers, rocm
                 ),
                 "ops_headers": operator_headers(),
-                "DispatchKey": dispatch_key,
-                "dispatch_namespace": dispatch_key.lower(),
-                "dispatch_helpers": dest.gen_registration_helpers(backend_index),
-                "dispatch_namespaced_definitions": list(
-                    concatMap(
-                        dest.RegisterDispatchKey(
-                            backend_index,
-                            Target.NAMESPACED_DEFINITION,
-                            selector,
-                            rocm=rocm,
-                            class_method_name=None,
-                            skip_dispatcher_op_registration=skip_dispatcher_op_registration,
-                        ),
-                        grouped_native_functions,
-                    )
-                ),
-                "dispatch_anonymous_definitions": list(
-                    concatMap(
-                        dest.RegisterDispatchKey(
-                            backend_index,
-                            Target.ANONYMOUS_DEFINITION,
-                            selector,
-                            rocm=rocm,
-                            class_method_name=None,
-                            skip_dispatcher_op_registration=skip_dispatcher_op_registration,
-                        ),
-                        grouped_native_functions,
-                    )
-                ),
-                "static_init_dispatch_registrations": static_init_dispatch_registrations,
-                "deferred_dispatch_registrations": "",
+                "dispatch_helpers": "",
+                "dispatch_definitions": dispatch_definitions,
             },
         )
 
@@ -2359,7 +2445,7 @@ def gen_op_headers(
             + [
                 "\n".join(
                     f"#include <ATen/ops/{f.root_name}_ops.h>"
-                    for f in [g.inplace, g.mutable]
+                    for f in [g.inplace, g.mutable, g.functional]
                     if f is not None and "generated" not in f.tags
                 )
                 for g in structured_native_functions
diff --git a/torchgen/gen_backend_stubs.py b/torchgen/gen_backend_stubs.py
index ae346e6d3aca6..37b4048146c00 100644
--- a/torchgen/gen_backend_stubs.py
+++ b/torchgen/gen_backend_stubs.py
@@ -422,6 +422,8 @@ def gen_dispatcher_registrations(
             grouped_native_functions,
         )
     )
+    newline = "\n"
+    ns_helper = NamespaceHelper(namespace_str="at")
     deferred_dispatch_registrations = ""
     static_init_dispatch_registrations = ""
     if eager_registration:
@@ -453,8 +455,6 @@ def gen_dispatcher_registrations(
         f"Register{dispatch_key}.cpp",
         "RegisterDispatchKey.cpp",
         lambda: {
-            "static_init_dispatch_registrations": static_init_dispatch_registrations,
-            "deferred_dispatch_registrations": deferred_dispatch_registrations,
             "extra_cuda_headers": "",
             "external_backend_headers": external_backend_headers_str,
             "ops_headers": "#include <ATen/Functions.h>"
@@ -465,21 +465,31 @@ def gen_dispatcher_registrations(
             "dispatch_headers": dest.gen_registration_headers(
                 backend_index, per_operator_headers=per_operator_headers, rocm=False
             ),
-            "dispatch_helpers": dest.gen_registration_helpers(backend_index),
-            "dispatch_namespaced_definitions": "",
-            "dispatch_anonymous_definitions": list(
-                concatMap(
-                    dest.RegisterDispatchKey(
-                        backend_index,
-                        Target.ANONYMOUS_DEFINITION,
-                        selector,
-                        rocm=False,
-                        class_method_name=f"{class_name}",
-                        skip_dispatcher_op_registration=False,
+            "dispatch_definitions": fm.substitute_with_template(
+                "RegisterDispatchDefinitions.ini",
+                lambda: {
+                    "ns_prologue": ns_helper.prologue,
+                    "ns_epilogue": ns_helper.epilogue,
+                    "static_init_dispatch_registrations": static_init_dispatch_registrations,
+                    "deferred_dispatch_registrations": deferred_dispatch_registrations,
+                    "dispatch_helpers": dest.gen_registration_helpers(backend_index),
+                    "dispatch_namespace": dispatch_key.lower(),
+                    "dispatch_namespaced_definitions": "",
+                    "dispatch_anonymous_definitions": list(
+                        concatMap(
+                            dest.RegisterDispatchKey(
+                                backend_index,
+                                Target.ANONYMOUS_DEFINITION,
+                                selector,
+                                rocm=False,
+                                class_method_name=f"{class_name}",
+                                skip_dispatcher_op_registration=False,
+                            ),
+                            grouped_native_functions,
+                        )
                     ),
-                    grouped_native_functions,
-                )
-            ),
+                },
+            ).split(newline),
         },
     )
 
diff --git a/torchgen/gen_functionalization_type.py b/torchgen/gen_functionalization_type.py
index 54730be4b2215..d1e26a0f13b6d 100644
--- a/torchgen/gen_functionalization_type.py
+++ b/torchgen/gen_functionalization_type.py
@@ -53,6 +53,8 @@
         # It will be BC-breaking, but we should fix their schemas.
         # should be inplace?
         "record_stream",
+        # See Note [resize_ in Functionalization]
+        "resize_",
     ]
 )
 
diff --git a/torchgen/model.py b/torchgen/model.py
index c5ed79453c5ba..93bbad44e8f2c 100644
--- a/torchgen/model.py
+++ b/torchgen/model.py
@@ -52,6 +52,12 @@ def __str__(self) -> str:
 BACKEND_COMPONENTS = "CPU CUDA HIP XLA MPS IPU XPU HPU VE Lazy Meta PrivateUse1 PrivateUse2 PrivateUse3".split()
 FUNCTIONALITY_KEYS = ["", "Quantized", "Sparse", "NestedTensor", "Autograd"]
 
+# This list guards dispatches that can be used in derivatives.yaml
+# For now we omit AutogradFunctionality and AutogradOther
+AUTOGRAD_KEYS = ["AutogradNestedTensor"] + [
+    "Autograd" + component for component in BACKEND_COMPONENTS
+]
+
 # This doesn't have to be in sync with the header, it only needs to contain
 # entries that we actually use in the codegen
 class DispatchKey(Enum):
@@ -875,6 +881,22 @@ def __post_init__(self) -> None:
                 "foreach kernels fall back to slow path when tensor are on different devices, "
                 "device_check not allowed to be enabled"
             )
+        named_symint = "SymInt" in self.func.name.overload_name
+        assert named_symint == self.func.has_symint()
+
+        # NB: if your function accidentally has rand/dropout/... in its name
+        # but is not actually random, feel free to amend this to special case
+        if (
+            "rand" in str(self.func.name)
+            or (
+                "dropout" in str(self.func.name)
+                # Backwards of dropout is typically deterministic
+                and "backward" not in str(self.func.name)
+                and str(self.func.name.name) not in ["_cudnn_init_dropout_state"]
+            )
+            or self.func.arguments.has_generator_arg()
+        ):
+            assert "nondeterministic_seeded" in self.tags, str(self.func.name)
 
     @property
     def has_composite_kernel(self) -> bool:
@@ -890,9 +912,12 @@ def is_view_op(self) -> bool:
         is_non_mutating_view = len(rets) > 0 and any(
             r.annotation is not None and not r.annotation.is_write for r in rets
         )
-        is_inplace_view = "inplace_view" in self.tags
+        # See Note [resize_ in Functionalization] for more dtails
+        is_inplace_view = (
+            "inplace_view" in self.tags and str(self.func.name) != "resize_"
+        )
         is_wildcard_view = any(
-            inp.annotation is not None and inp.annotation.alias_set_after != ""
+            inp.annotation is not None and "*" in inp.annotation.alias_set_after
             for inp in self.func.schema_order_arguments()
         )
         return is_non_mutating_view or is_inplace_view or is_wildcard_view
@@ -965,12 +990,16 @@ def __post_init__(self) -> None:
             if self.inplace is not None:
                 assert self.inplace.structured_delegate == self.out.func.name
 
-        generated_fns = [
-            str(f.func.name) for f in self.functions() if "generated" in f.tags
-        ]
+        generated_fns = sorted(
+            [str(f.func.name) for f in self.functions() if "generated" in f.tags]
+        )
         generated_fns_str = ", ".join(str(x) for x in generated_fns)
-        expected_generated_fns = f.autogen
-        expected_generated_fns_str = ", ".join(str(x) for x in expected_generated_fns)
+        expected_generated_fns: Set[str] = set()
+        for f in self.functions():
+            expected_generated_fns.update(str(op) for op in f.autogen)
+        expected_generated_fns_str = ", ".join(
+            str(x) for x in sorted(list(expected_generated_fns))
+        )
         if len(expected_generated_fns) == 0 and len(generated_fns) > 0:
             raise RuntimeError(
                 f"The codegen expects to be able to generate '{generated_fns_str}'."
@@ -1317,7 +1346,7 @@ def __post_init__(self) -> None:
 
         if self.arguments.tensor_options is not None:
             assert self.kind() == SchemaKind.functional, (
-                "Found an operator that is not functional, but has tensor options arguments."
+                "Found an operator that is not functional or out varuabt, but has tensor options arguments."
                 "This is not allowed- tensor options arguments are only allowed for factory functions."
                 f"schema: {str(self)}"
             )
@@ -1511,11 +1540,6 @@ def strip_ret_annotation(r: Return) -> Return:
         returns = original_returns + returns_from_mutable_inputs
 
         args_sig = self.arguments.signature(strip_default=strip_default)
-        # See Note [arange.start_step schema]
-        if str(self.name) == "arange.start_step":
-            args_sig = Arguments.parse(
-                str(args_sig).replace("Scalar step", "Scalar step=1")
-            )
         # See Note [bernoulli.p schema]
         if str(self.name) == "bernoulli.p":
             args_sig = Arguments.parse(str(args_sig).replace("float p", "float p=0.5"))
@@ -1547,6 +1571,11 @@ def with_name(self, name: "OperatorName") -> "FunctionSchema":
     def modifies_arguments(self) -> bool:
         return self.kind() in [SchemaKind.inplace, SchemaKind.out, SchemaKind.mutable]
 
+    def has_symint(self) -> bool:
+        return self.arguments.has_symint_arg() or any(
+            r.type.is_symint_like() for r in self.returns
+        )
+
     def __str__(self) -> str:
         all_arguments_str = str(self.arguments)
         if len(self.returns) == 1:
@@ -1569,26 +1598,37 @@ class Annotation:
     # we can conveniently assume it is canonically ordered
     alias_set: Tuple[str, ...]
     is_write: bool
-    alias_set_after: str
+    alias_set_after: Tuple[str, ...]
 
     @staticmethod
     def parse(ann: str) -> "Annotation":
-        # Only handling afterSet == Wildcard for now
-        becomes_wildcard_index = ann.find(" -> *")
-        if becomes_wildcard_index != -1:
-            after_set = "*"
-            # TODO: im not good enough with regexes to ignore -> *
-            m = re.match(
-                r"^([a-z])(!?)(!?)$",
-                ann[:becomes_wildcard_index]
-                + ann[becomes_wildcard_index + len(" -> *") :],
-            )
-        else:
-            after_set = ""
-            m = re.match(r"^([a-z])(!?)(!?)$", ann)
+        # TODO: implement a proper parser if this gets more ugly
+        # Regex Explanation:
+        # Example: "a! -> a|b"
+        # Group #1: alias before optional '|', required. Matches the first
+        #   character 'a' in the example
+        # Group #2: optional alias set after optional '|', matches empty string
+        #   in the example
+        # Group #3: optional "is write" flag, matches '!' in the example.
+        # Group #4: optional section containing arrow, matches " -> a|b" in the
+        #   example.
+        # Group #5: optional alias after set, supports wildcard, matches "a|b"
+        #   in the example.
+        # Group #6: optional sub-section of alias after set, matches "|b" in the
+        #   example.
+        m = re.match(r"^([a-z])(\|[a-z])*(!?)( -> (\*|[a-z](\|[a-z])*))?$", ann)
+
         assert m is not None, f"unrecognized alias annotation {ann}"
-        alias_set = (m.group(1),)
-        is_write = m.group(2) == "!"
+        before_alias = m.group(1) + (m.group(2) if m.group(2) else "")
+        alias_set = tuple(before_alias.split("|"))
+        is_write = m.group(3) == "!"
+        assert not (
+            is_write and len(alias_set) > 1
+        ), f"alias set larger than 1 is not mutable, got {ann} instead."
+        after_set = tuple(m.group(5).split("|")) if m.group(5) else tuple()
+        assert not (
+            len(before_alias) > 1 and len(after_set) > 1
+        ), f"before alias set and after alias set cannot be larger than 1 at the same time, got {ann} instead."
         r = Annotation(
             alias_set=alias_set, is_write=is_write, alias_set_after=after_set
         )
@@ -1597,10 +1637,12 @@ def parse(ann: str) -> "Annotation":
 
     def __str__(self) -> str:
         alias_set = "|".join(self.alias_set)
-        if self.alias_set_after:
-            alias_set = f'{alias_set}{" -> "}{self.alias_set_after}'
-        is_write = "!" if self.is_write else ""
-        return f"{alias_set}{is_write}"
+        if self.is_write:
+            alias_set = f"{alias_set}!"
+        alias_set_after = "|".join(self.alias_set_after)
+        if alias_set_after:
+            alias_set = f'{alias_set}{" -> "}{alias_set_after}'
+        return alias_set
 
 
 # The base class for the type system.  This is also loosely modeled
@@ -1626,6 +1668,11 @@ def _parse(t: str) -> "Type":
         if m is not None:
             size = int(m.group(2)) if m.group(2) is not None else None
             return ListType(elem=Type.parse(m.group(1)), size=size)
+
+        # '__torch__.torch.classes.' is the prefix for custom class
+        m = re.match(r"^__torch__\.torch\.classes\.([a-zA-Z0-9_.]+)$", t)
+        if m is not None:
+            return CustomClassType(m.group(1))
         try:
             return BaseType(BaseTy[t])
         except KeyError:
@@ -1639,9 +1686,18 @@ def __str__(self) -> str:
     # so we can conveniently generate legacy Declarations.yaml but
     # really we should probably just remove these at some point
 
-    def is_tensor_like(self) -> bool:
+    def is_base_ty_like(self, base_ty: "BaseTy") -> bool:
         raise NotImplementedError
 
+    def is_tensor_like(self) -> bool:
+        return self.is_base_ty_like(BaseTy.Tensor)
+
+    def is_generator_like(self) -> bool:
+        return self.is_base_ty_like(BaseTy.Generator)
+
+    def is_symint_like(self) -> bool:
+        return self.is_base_ty_like(BaseTy.SymInt)
+
     def is_nullable(self) -> bool:
         raise NotImplementedError
 
@@ -1685,8 +1741,8 @@ class BaseType(Type):
     def __str__(self) -> str:
         return f"{self.name.name}"
 
-    def is_tensor_like(self) -> bool:
-        return self.name == BaseTy.Tensor
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return self.name == base_ty
 
     def is_nullable(self) -> bool:
         return False
@@ -1708,8 +1764,8 @@ class OptionalType(Type):
     def __str__(self) -> str:
         return f"{self.elem}?"
 
-    def is_tensor_like(self) -> bool:
-        return self.elem.is_tensor_like()
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return self.elem.is_base_ty_like(base_ty)
 
     def is_nullable(self) -> bool:
         return True
@@ -1721,6 +1777,33 @@ def is_list_like(self) -> Optional["ListType"]:
         return self.elem.is_list_like()
 
 
+# A type representing a PyTorch custom class
+@dataclass(frozen=True)
+class CustomClassType(Type):
+    class_name: str
+
+    def __str__(self) -> str:
+        """
+        Return the class name will prefix __torch__.torch.classes
+        """
+        return f"__torch__.torch.classes.{self.class_name}"
+
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return False
+
+    def is_nullable(self) -> bool:
+        """
+        Assume a custom class is not nullable.
+        """
+        return False
+
+    def symint_to_int(self) -> "Type":
+        return self
+
+    def is_list_like(self) -> Optional["ListType"]:
+        return None
+
+
 # List types specify that we may have multiples of an element.  We
 # also support explicit sizes on list types, but these have
 # some nontrivial semantics!  (However, for C++ API purposes, explicit
@@ -1737,8 +1820,8 @@ def __str__(self) -> str:
         size = f"{self.size}" if self.size else ""
         return f"{self.elem}[{size}]"
 
-    def is_tensor_like(self) -> bool:
-        return self.elem.is_tensor_like()
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return self.elem.is_base_ty_like(base_ty)
 
     def is_nullable(self) -> bool:
         return self.elem.is_nullable()
@@ -2017,27 +2100,27 @@ def symints_to_ints(self) -> "Arguments":
         if arguments.self_arg:
             arguments = dataclasses.replace(
                 arguments,
-                pre_self_positional=[
+                pre_self_positional=tuple(
                     x.symint_to_int() for x in arguments.pre_self_positional
-                ],
+                ),
             )
 
         if self.tensor_options:
             arguments = dataclasses.replace(
                 arguments,
-                post_tensor_options_kwarg_only=[
+                post_tensor_options_kwarg_only=tuple(
                     x.symint_to_int() for x in arguments.post_tensor_options_kwarg_only
-                ],
+                ),
             )
 
         arguments = dataclasses.replace(
             arguments,
-            post_self_positional=[
+            post_self_positional=tuple(
                 x.symint_to_int() for x in arguments.post_self_positional
-            ],
-            pre_tensor_options_kwarg_only=[
+            ),
+            pre_tensor_options_kwarg_only=tuple(
                 x.symint_to_int() for x in arguments.pre_tensor_options_kwarg_only
-            ],
+            ),
         )
 
         return arguments
@@ -2045,6 +2128,12 @@ def symints_to_ints(self) -> "Arguments":
     def has_tensor_arg(self) -> bool:
         return any(a.type.is_tensor_like() for a in self.flat_non_out)
 
+    def has_symint_arg(self) -> bool:
+        return any(a.type.is_symint_like() for a in self.flat_non_out)
+
+    def has_generator_arg(self) -> bool:
+        return any(a.type.is_generator_like() for a in self.flat_non_out)
+
     def signature(self, *, strip_default: bool = False) -> "Arguments":
         # dataclasses.replace could be used here, but it is less
         # type safe so for now I've opted to type everything out
diff --git a/torchgen/native_function_generation.py b/torchgen/native_function_generation.py
index df34b8a7c1f5f..70755d4c0b4ba 100644
--- a/torchgen/native_function_generation.py
+++ b/torchgen/native_function_generation.py
@@ -46,6 +46,33 @@
     "_cummin_helper",
 ]
 
+# All of these operators don't have any tensor like returns
+FUNCTIONAL_OPS_THAT_CANNOT_GET_AN_OUT_VARIANT = [
+    "_assert_async",  # no return
+    "_dimI",  # returns an int
+    "_dimV",  # returns an int
+    "_has_same_storage_numel",  # returns a boolean
+    "_linalg_check_errors",  # no return
+    "_local_scalar_dense",  # returns a Scalar
+    "_nested_tensor_from_mask_left_aligned",  # returns a boolean
+    "_nnz",  # returns an int
+    "_use_cudnn_ctc_loss",  # returns a boolean
+    "_validate_compressed_sparse_indices",  # no return
+    "allclose",  # returns a boolean
+    "dense_dim",  # returns an int
+    "equal",  # returns a boolean
+    "is_coalesced",  # returns an boolean
+    "is_pinned",  # returns a boolean
+    "is_same_size",  # returns a boolean
+    "is_set_to",  # returns a boolean
+    "q_per_channel_axis",  # returns an int
+    "q_scale",  # returns a float
+    "q_zero_point",  # returns an int
+    "qscheme",  # returns a QScheme
+    "record_stream",  # no return
+    "sparse_dim",  # returns an int
+]
+
 INPLACE_OPS_THAT_DONT_GET_GROUPED_PROPERLY = [
     # polygamma and polygamma.out both exist, but have a
     # pre-self arg (while polygamma_ does not)
@@ -72,6 +99,11 @@ def pre_group_native_functions(
     return pre_grouped_native_functions
 
 
+# Returns the out variant overload name given a base function overload name
+def get_expected_out_variant_overload_name(overload_name: Optional[str]) -> str:
+    return "out" if not overload_name else f"{overload_name}_out"
+
+
 # Helper function: given an inplace FunctionSchema, generate its corresponding out= variant
 # Example before:
 #   _add_relu_.Scalar(Tensor(a!) self, Scalar other, Scalar alpha=1) -> Tensor(a!)
@@ -87,7 +119,7 @@ def self_to_out_signature(func: FunctionSchema) -> FunctionSchema:
     # - an "out" overload name
     return FunctionSchema(
         name=func.name.remove_inplace().with_overload(
-            "out" if not func.name.overload_name else f"{func.name.overload_name}_out"
+            get_expected_out_variant_overload_name(func.name.overload_name)
         ),
         arguments=func.arguments.remove_self_annotation().with_out_args(
             [
@@ -103,27 +135,37 @@ def self_to_out_signature(func: FunctionSchema) -> FunctionSchema:
     )
 
 
-# Helper function: given a mutable FunctionSchema, generate its corresponding out= variant
+# Helper function: given a functional FunctionSchema, generate its corresponding out= variant
 # Example before:
-#   _fused_moving_avg_obs_fq_helper(Tensor self, Tensor observer_on, Tensor fake_quant_on, Tensor(a!) running_min, Tensor(b!) running_max, Tensor(c!) scale, Tensor(d!) zero_point, float averaging_const, int quant_min, int quant_max, int ch_axis, bool per_row_fake_quant=False, bool symmetric_quant=False) -> (Tensor output, Tensor mask)  # noqa: B950
+#   _to_copy(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None,
+#       bool? pin_memory=None, bool non_blocking=False, MemoryFormat? memory_format=None) -> Tensor
 # Example after:
-#   _fused_moving_avg_obs_fq_helper.out(Tensor self, Tensor observer_on, Tensor fake_quant_on, Tensor(a!) running_min, Tensor(b!) running_max, Tensor(c!) scale, Tensor(d!) zero_point, float averaging_const, int quant_min, int quant_max, int ch_axis, bool per_row_fake_quant=False, bool symmetric_quant=False, *, Tensor(e!) out0, Tensor(f!) out1) -> (Tensor(e!), Tensor(f!))  # noqa: B950
-def mutable_to_out_signature(func: FunctionSchema) -> FunctionSchema:
-    # Generating an out= schema from a mutable schema.
-    assert func.kind() == SchemaKind.mutable
+#   _to_copy._out(Tensor self, *, bool non_blocking=False, MemoryFormat? memory_format=None,
+#       Tensor(a!) out) -> Tensor(a!)
+def functional_to_out_signature(func: FunctionSchema) -> FunctionSchema:
+    # Generating an out= schema from a functional schema.
+    assert func.kind() == SchemaKind.functional
+
+    new_returns, new_out_args = generate_out_args_from_schema(func)
     # The new out= schema has:
-    # - Any non-aliased tensor-like returns are converted to mutable, aliased out= arguments
-    #   (if the argument is a tensor then we also return it for method chaining,
-    #   otherwise we return nothing)
-    # - an "out" overload name
-    #
-    # Note that:
-    # (1) This also means that we can *only* generate an out= variant from a mutable schema
-    #     if the mutable schema has at least one tensor-like non-aliasing return.
-    # (2) The generated out= variant still has mutable positional arguments,
-    #     but if necessary we could probably add another out= variant that also
-    #     functionalizes the mutable arguments (a functional_out variant)
+    # - one or more new out argument(s) with the same type as returns (but with a mutable annotation)
+    # - The returns now alias the out= arguments
+    # - an "_out" overload name
+    return FunctionSchema(
+        name=func.name.with_overload(
+            get_expected_out_variant_overload_name(func.name.overload_name)
+        ),
+        arguments=func.arguments.signature().with_out_args(
+            new_out_args,
+        ),
+        returns=tuple(new_returns),
+    )
 
+
+# Helper function: given a function schema, generate corresponding out arguments, also the updated return annotations.
+def generate_out_args_from_schema(
+    func: FunctionSchema,
+) -> Tuple[List[Return], List[Argument]]:
     # More of a sanity check - our existing restrictions on schemas should enforce that
     # mutable schema kinds never return their mutable arguments.
     assert not any(
@@ -151,7 +193,7 @@ def mutable_to_out_signature(func: FunctionSchema) -> FunctionSchema:
     for (i, r) in enumerate(func.returns):
         if r.type.is_tensor_like():
             new_out = Argument(
-                name=f"out{i}",
+                name="out" if len(func.returns) == 1 else f"out{i}",
                 type=r.type,
                 default=None,
                 annotation=Annotation.parse(f"{valid_annotations[i]}!"),
@@ -166,10 +208,35 @@ def mutable_to_out_signature(func: FunctionSchema) -> FunctionSchema:
                 new_returns.append(new_ret)
         else:
             new_returns.append(r)
+    return new_returns, new_out_args
+
+
+# Helper function: given a mutable FunctionSchema, generate its corresponding out= variant
+# Example before:
+#   _fused_moving_avg_obs_fq_helper(Tensor self, Tensor observer_on, Tensor fake_quant_on, Tensor(a!) running_min, Tensor(b!) running_max, Tensor(c!) scale, Tensor(d!) zero_point, float averaging_const, int quant_min, int quant_max, int ch_axis, bool per_row_fake_quant=False, bool symmetric_quant=False) -> (Tensor output, Tensor mask)  # noqa: B950
+# Example after:
+#   _fused_moving_avg_obs_fq_helper._out(Tensor self, Tensor observer_on, Tensor fake_quant_on, Tensor(a!) running_min, Tensor(b!) running_max, Tensor(c!) scale, Tensor(d!) zero_point, float averaging_const, int quant_min, int quant_max, int ch_axis, bool per_row_fake_quant=False, bool symmetric_quant=False, *, Tensor(e!) out0, Tensor(f!) out1) -> (Tensor(e!), Tensor(f!))  # noqa: B950
+def mutable_to_out_signature(func: FunctionSchema) -> FunctionSchema:
+    # Generating an out= schema from a mutable schema.
+    assert func.kind() == SchemaKind.mutable
+    # The new out= schema has:
+    # - Any non-aliased tensor-like returns are converted to mutable, aliased out= arguments
+    #   (if the argument is a tensor then we also return it for method chaining,
+    #   otherwise we return nothing)
+    # - an "out" overload name
+    #
+    # Note that:
+    # (1) This also means that we can *only* generate an out= variant from a mutable schema
+    #     if the mutable schema has at least one tensor-like non-aliasing return.
+    # (2) The generated out= variant still has mutable positional arguments,
+    #     but if necessary we could probably add another out= variant that also
+    #     functionalizes the mutable arguments (a functional_out variant)
+
+    new_returns, new_out_args = generate_out_args_from_schema(func)
 
     return FunctionSchema(
         name=func.name.remove_inplace().with_overload(
-            "out" if not func.name.overload_name else f"{func.name.overload_name}_out"
+            get_expected_out_variant_overload_name(func.name.overload_name)
         ),
         arguments=func.arguments.with_out_args(new_out_args),
         returns=tuple(new_returns),
@@ -218,19 +285,31 @@ def generate_function(
             func = self_to_out_signature(f.func)
         elif f.func.kind() == SchemaKind.mutable:
             func = mutable_to_out_signature(f.func)
+        elif f.func.kind() == SchemaKind.functional:
+            func = functional_to_out_signature(f.func)
         else:
             raise AssertionError(
-                "We only bother generating out= functions from either inplace or mutable variants"
+                "We only bother generating out= functions from either inplace or mutable or functional variants"
             )
     else:
         raise AssertionError(
             "We currently only generate either functional or out= NativeFunctions"
         )
 
+    # Generated kernel naming convention for out: <op_name>_<overload_name>. The reason for this is to
+    # disambiguate operator with the same name but different overload name, e.g., `randn.names_out` and
+    # `randn.generator_with_names_out`.
+    kernel_name = (
+        func.name.unambiguous_name()
+        if func.kind() == SchemaKind.out
+        else cpp.name(func)
+    )
     backend_metadata = {
         DispatchKey.CompositeExplicitAutograd: {
             func.name: BackendMetadata(
-                cpp.name(func), structured=False, cpp_namespace=DEFAULT_KERNEL_NAMESPACE
+                kernel=kernel_name,
+                structured=False,
+                cpp_namespace=DEFAULT_KERNEL_NAMESPACE,
             )
         }
     }
@@ -261,7 +340,7 @@ def generate_function(
             has_composite_explicit_autograd_non_functional_kernel=False,
             # Every generated NativeFunction gets a "generated" tag, so it's easy to tell
             # which NativeFunction objects did not come directly from native_functions.yaml.
-            tags=set(["generated"]),
+            tags=set(["generated"]) | (f.tags & {"nondeterministic_seeded"}),
             namespace=f.namespace,
         ),
         backend_metadata,
@@ -293,20 +372,15 @@ def add_generated_native_functions(
         # We automatically generate a few native functions that don't exist in the yaml, for a few reasons:
         # (1) If an operator has an inplace/out= variant but no functional variant, we can generate
         #     a simple functional variant that the functionalization pass can consume.
-        # (2) If an operator has an inplace and functional but no out= variant, we generate an out=
+        # (2) If an operator has an inplace or functional but no out= variant, we generate an out=
         #     variant, mostly so we can easily pair up functions into NativeFunctionsGroup,
         #     while maintaining the constraint that the out= variant is "required".
-        #
-        # For now, we don't bother generated NativeFunctions for existing operators
-        # that only have a functional variant.
-        if has_mutable or has_inplace or has_out:
+        if has_mutable or has_inplace or has_out or has_functional:
 
             # Don't bother generating functions trio's for native functions that bypass the dispatcher.
             are_manual = all(f.manual_cpp_binding for f in d.values())
             # Don't bother generating functional + out= variants for view operators
-            has_view_ops = (
-                has_inplace and "inplace_view" in d[SchemaKind.inplace].tags
-            ) or any(f.is_view_op for f in d.values())
+            has_view_ops = any(f.is_view_op for f in d.values())
             # Don't generate the other variants for CompositeImplicitAutograd operators.
             # We could probably do this, but the main benefit of generating the function triplets
             # is for transforms that need them, and transforms don't need to act directly
@@ -350,6 +424,8 @@ def add_generated_native_functions(
                 else d[SchemaKind.mutable]
                 if has_mutable
                 else d[SchemaKind.out]
+                if has_out
+                else d[SchemaKind.functional]
             )
 
             # Note: [Mutable ops that cannot get an out variant]
@@ -359,20 +435,27 @@ def add_generated_native_functions(
             # There are only two functions that don't fit this criteria today though,
             # and they both look like they should be fixed to be out= variants,
             # so if feels safer to ban this schema all-together
-            gets_out_variant = not has_out and (
-                base_fn.func.kind() == SchemaKind.inplace
-                or any(r.type.is_tensor_like() for r in base_fn.func.returns)
+            base_fn_valid = base_fn.func.kind() == SchemaKind.inplace or any(
+                r.type.is_tensor_like() for r in base_fn.func.returns
             )
-            if not has_out and not gets_out_variant:
+            # Note: [Loosen the assertion that all functional should have out variant]
+            # By design all functional operators should have our variants. The needs_out check
+            # is loosening this requirement, changing it to only generate out variant if there's
+            # an `autogen` block in the native function, in the long run it should be removed.
+            # FIXME: Remove this after figuring out CI job failures related to min, max, mean
+            needs_out = any("out" in str(op_name) for op_name in base_fn.autogen)
+            gets_out_variant = not has_out and base_fn_valid and needs_out
+            if not has_out and not base_fn_valid:
                 if (
                     str(base_fn.func.name)
                     not in MUTABLE_OPS_THAT_CANNOT_GET_AN_OUT_VARIANT
+                    and str(base_fn.func.name)
+                    not in FUNCTIONAL_OPS_THAT_CANNOT_GET_AN_OUT_VARIANT
                 ):
                     raise AssertionError(
-                        f"""Found a mutable operator that we could not generate an out= variant for: {str(base_fn.func)}.
-These operators are problematic, because we can't easily auto-generate functionalization code for them. If you really need
-the operator have the schema mentioned, that add the name of the operator to the allow-list. Otherwise if possible,
-please convert it to an inplace operator"""
+                        f"""Found an operator that we could not generate an out= variant for: {str(base_fn.func)}.
+This type of operators don't have tensor-like return, making it difficult to generate a proper out= variant. If
+out= variant is not needed, please add the function name into FUNCTIONAL_OPS_THAT_CANNOT_GET_AN_OUT_VARIANT list."""
                     )
 
             # Generate an out= variant
@@ -529,8 +612,9 @@ def gen_composite_out_kernel(g: NativeFunctionsGroup) -> Optional[str]:
 
     copy_outs_str = "\n".join(copy_outs)
 
+    # Kernel name needs to follow the naming convention defined in `generate_function()`
     return f"""
-{sig.defn()} {{
+{sig.defn(name=g.out.func.name.unambiguous_name())} {{
   auto {out_name} = at::_ops::{g.functional.func.name.unambiguous_name()}::call({exprs});
   {copy_outs_str}
   {return_str(g.out.func.returns, rets)}
diff --git a/torchgen/selective_build/selector.py b/torchgen/selective_build/selector.py
index dd94dd17dd0ed..32f0f9e219caf 100644
--- a/torchgen/selective_build/selector.py
+++ b/torchgen/selective_build/selector.py
@@ -282,4 +282,4 @@ def combine_selective_builders(
 def op_name_from_native_function(f: NativeFunction) -> str:
     # This was originally read from the 'operator_name_with_overload' field in the
     # declaration dict, which was the part before the first '(' in 'schema_string'.
-    return f"aten::{f.func.name}"
+    return f"{f.namespace}::{f.func.name}"
diff --git a/torchgen/shape_functions/gen_jit_shape_functions.py b/torchgen/shape_functions/gen_jit_shape_functions.py
index 6013c4de5350f..c6336a69518b2 100644
--- a/torchgen/shape_functions/gen_jit_shape_functions.py
+++ b/torchgen/shape_functions/gen_jit_shape_functions.py
@@ -1,12 +1,34 @@
 #!/usr/bin/env python3
+import importlib.util
 import os
+import sys
 from itertools import chain
 from pathlib import Path
 
-from torch.jit._shape_functions import (
-    bounded_compute_graph_mapping,
-    shape_compute_graph_mapping,
-)
+
+# Manually importing the shape function module based on current directory
+# instead of torch imports to avoid needing to recompile Pytorch before
+# running the script
+
+file_path = Path.cwd() / "torch" / "jit" / "_shape_functions.py"
+module_name = "torch.jit._shape_functions"
+
+err_msg = """Could not find shape functions file, please make sure
+you are in the root directory of the Pytorch git repo"""
+if not file_path.exists():
+    raise Exception(err_msg)
+
+spec = importlib.util.spec_from_file_location(module_name, file_path)
+assert spec is not None
+module = importlib.util.module_from_spec(spec)
+sys.modules[module_name] = module
+assert spec.loader is not None
+assert module is not None
+spec.loader.exec_module(module)
+
+bounded_compute_graph_mapping = module.bounded_compute_graph_mapping
+shape_compute_graph_mapping = module.shape_compute_graph_mapping
+
 
 SHAPE_HEADER = r"""
 /**
@@ -41,7 +63,6 @@
 """
 
 DECOMP_END = r"""
-
 // clang-format on
 
 } // namespace jit
diff --git a/torchgen/utils.py b/torchgen/utils.py
index c168f186f83c3..64c2181708027 100644
--- a/torchgen/utils.py
+++ b/torchgen/utils.py
@@ -176,6 +176,25 @@ def _write_if_changed(self, filename: str, contents: str) -> None:
             with open(filename, "w") as f:
                 f.write(contents)
 
+    # Read from template file and replace pattern with callable (type could be dict or str).
+    def substitute_with_template(
+        self, template_fn: str, env_callable: Callable[[], Union[str, Dict[str, Any]]]
+    ) -> str:
+        template_path = os.path.join(self.template_dir, template_fn)
+        env = env_callable()
+        if isinstance(env, dict):
+            # TODO: Update the comment reference to the correct location
+            if "generated_comment" not in env:
+                comment = "@" + "generated by torchgen/gen.py"
+                comment += " from {}".format(os.path.basename(template_path))
+                env["generated_comment"] = comment
+            template = _read_template(template_path)
+            return template.substitute(env)
+        elif isinstance(env, str):
+            return env
+        else:
+            assert_never(env)
+
     def write_with_template(
         self,
         filename: str,
@@ -186,19 +205,11 @@ def write_with_template(
         assert filename not in self.filenames, "duplicate file write {filename}"
         self.filenames.add(filename)
         if not self.dry_run:
-            env = env_callable()
-            if isinstance(env, dict):
-                # TODO: Update the comment reference to the correct location
-                if "generated_comment" not in env:
-                    comment = "@" + "generated by torchgen/gen.py"
-                    comment += " from {}".format(os.path.basename(template_fn))
-                    env["generated_comment"] = comment
-                template = _read_template(os.path.join(self.template_dir, template_fn))
-                self._write_if_changed(filename, template.substitute(env))
-            elif isinstance(env, str):
-                self._write_if_changed(filename, env)
-            else:
-                assert_never(env)
+            substitute_out = self.substitute_with_template(
+                template_fn=template_fn,
+                env_callable=env_callable,
+            )
+            self._write_if_changed(filename=filename, contents=substitute_out)
 
     def write(
         self,